Big QueryML: Machine Learning for Data Scientists using SQL on Google Cloud Platform Quick Start…

Last Updated on December 1, 2020 by Editorial Team

Big QueryML: Machine Learning for Data Scientists using SQL on Google Cloud Platform Quick Start Guide

I hope you are doing well. It’s been a late post as I struggled through my health. I have managed to upgrade my skills by learning new technologies and platforms and would be eager to share my knowledge with you all. Without wasting any more time, let’s learn something new today.

Cloud computing has been around for a while, but the pace of adoption has improved exponentially nowadays thanks to the pandemic and IT companies realizing the benefits and trade-offs. There are 3 major cloud services providers: AWS, Microsoft, and GCP.

Google cloud computing services had been around for a while, but the rate of adoption had been slow due to n number of reasons, say a limited number of services offered. Although Google had pioneered and was the first to innovate something on the cloud, e.g., Big table for big data, Kubernetes for serverless orchestration, and so on, it has always been slow in marketing itself(GCP services).

Now that said, if you are new to cloud computing and confused about which platform to learn, you might not realize the differences, but after using all the cloud services (AWS, GCP, Azure), I have realized the pros and cons of each platform. If you ask me, I would say AWS has an excellent go-to-marketing strategy to cloud services and know-how to be one step ahead while reaching out to customers, and it’s easy to use. Google Cloud Platform is an equally mature ecosystem and does provide services that are quick to deploy and create than other platforms, although the platform will suit the geeky crowd more when compared to AWS, which has better guidelines and documentation resources at each step.

Now someone with a background in data will recognize the importance of data-warehouse for improving business decisions. GCP offers a compelling data-warehouse service, BigQuery. BigQuery allows storage and querying of petabytes of data in seconds for analysis.

As a data scientist, you can now directly create and run your models to answer business questions using SQL on BigQuery. Today’s tutorial is all about that, so let’s get started.

Prerequisites:

You need access to BigQuery services. You can try it for free. Check out this link.

You also need access to the dataset which we will create. Please note that big-query charges for the amount of data processed when you run an SQL query, so be cautious about what you query.

You also need some familiarity with SQL to understand what data is queried.

Steps:

Open the google platform console and navigate to BigQuery service: Link

2. Add the data-to-insights dataset by clicking this link.

3. Let’s do some data exploration:

Let’s try to find out of the total visitors who visited our website, what % made a purchase?

Paste the following SQL query in query editor:

WITH visitors AS

( SELECT

COUNT(DISTINCT fullVisitorId) AS total_visitors

FROM `data-to-insights.ecommerce.web_analytics`

),

purchasers AS(

SELECT

COUNT(DISTINCT fullVisitorId) AS total_purchasers

FROM `data-to-insights.ecommerce.web_analytics`

WHERE totals.transactions IS NOT NULL

SELECT

total_visitors,

total_purchasers,

total_purchasers / total_visitors AS conversion_rate

FROM visitors, purchasers

Let’s try to figure out the top 5 selling products in the market using e-commerce data.

SELECT

p.v2ProductName,

p.v2ProductCategory,

SUM(p.productQuantity) AS units_sold,

ROUND(SUM(p.localProductRevenue/1000000),2) AS revenue

FROM `data-to-insights.ecommerce.web_analytics`,

UNNEST(hits) AS h,

UNNEST(h.product) AS p

GROUP BY 1, 2

ORDER BY revenue DESC

LIMIT 5;

We will not discuss each syntax of SQL, but if you know SQL, then you might find it pretty easy to understand. UNNEST keyword flattens out the array to separate records in the table.

Can we answer how many visitors bought products on the next visits to the website?

WITH all_visitor_stats AS (

SELECT

fullvisitorid,

IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit

FROM `data-to-insights.ecommerce.web_analytics`

GROUP BY fullvisitorid

SELECT

COUNT(DISTINCT fullvisitorid) AS total_visitors,

will_buy_on_return_visit

FROM all_visitor_stats

GROUP BY will_buy_on_return_visitcontent

Only a small proportion of visitors visit again to buy the products, making sense when we relate to real-world scenarios: you might visit a website to browse or compare products but may not buy them.

4. Let’s make a feature selection to answer a machine learning problem:

Predict whether or not a user is likely to purchase in the future.

Let’s create a baseline model using a few features in big-query. For this domain, expertise is required or prior work with google analytics e-commerce data. Suppose we pick 2 features, for now, bounces and time_on_site, which are correlated to buying decisions. Now you can add more features, but that’s beyond the scope of our fundamental learnings. For our case, we need a logistic regression model.

We will create a dataset of ours where we will save the model.

After we have created the dataset, we will create a logistic regression model.

CREATE OR REPLACE MODEL `mydataset.classification_model`
OPTIONS
(
model_type=’logistic_reg’,
labels = [‘will_buy_on_return_visit’]
)
AS

#standardSQL
SELECT
 * EXCEPT(fullVisitorId)
FROM

# features
 (SELECT
 fullVisitorId,
 IFNULL(totals.bounces, 0) AS bounces,
 IFNULL(totals.timeOnSite, 0) AS time_on_site
 FROM
 `data-to-insights.ecommerce.web_analytics`
 WHERE
 totals.newVisits = 1
 AND date BETWEEN ‘20160801’ AND ‘20170430’) # train on first 9 months
 JOIN
 (SELECT
 fullvisitorid,
 IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
 FROM
 `data-to-insights.ecommerce.web_analytics`
 GROUP BY fullvisitorid)
 USING (fullVisitorId);

Once the model is created, you can explore the model.

5. Model Evaluation:

You can evaluate the model using the following query and checking the roc_auc.

SELECT
 roc_auc,
 CASE
 WHEN roc_auc > .9 THEN ‘good’
 WHEN roc_auc > .8 THEN ‘fair’
 WHEN roc_auc > .7 THEN ‘not great’
 ELSE ‘poor’ END AS model_quality
FROM
 ML.EVALUATE(MODEL mydataset.classification_model, (

SELECT
 * EXCEPT(fullVisitorId)
FROM

# features
 (SELECT
 fullVisitorId,
 IFNULL(totals.bounces, 0) AS bounces,
 IFNULL(totals.timeOnSite, 0) AS time_on_site
 FROM
 `data-to-insights.ecommerce.web_analytics`
 WHERE
 totals.newVisits = 1
 AND date BETWEEN ‘20170501’ AND ‘20170630’) # eval on 2 months
 JOIN
 (SELECT
 fullvisitorid,
 IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
 FROM
 `data-to-insights.ecommerce.web_analytics`
 GROUP BY fullvisitorid)
 USING (fullVisitorId)));

6. Feature Engineering:

Now we can try to improve the model by adding some more features.

CREATE OR REPLACE MODEL `mydataset.classification_model_2`
OPTIONS
 (model_type=’logistic_reg’, labels = [‘will_buy_on_return_visit’]) AS

WITH all_visitor_stats AS (
SELECT
 fullvisitorid,
 IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
 FROM `data-to-insights.ecommerce.web_analytics`
 GROUP BY fullvisitorid
)

# add in new features
SELECT * EXCEPT(unique_session_id) FROM (

SELECT
 CONCAT(fullvisitorid, CAST(visitId AS STRING)) AS unique_session_id,

# labels
 will_buy_on_return_visit,

MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS latest_ecommerce_progress,

# behavior on the site
 IFNULL(totals.bounces, 0) AS bounces,
 IFNULL(totals.timeOnSite, 0) AS time_on_site,
 totals.pageviews,

# where the visitor came from
 trafficSource.source,
 trafficSource.medium,
 channelGrouping,

# mobile or desktop
 device.deviceCategory,

# geographic
 IFNULL(geoNetwork.country, “”) AS country

FROM `data-to-insights.ecommerce.web_analytics`,
 UNNEST(hits) AS h

JOIN all_visitor_stats USING(fullvisitorid)

WHERE 1=1
 # only predict for new visits
 AND totals.newVisits = 1
 AND date BETWEEN ‘20160801’ AND ‘20170430’ # train 9 months

GROUP BY
 unique_session_id,
 will_buy_on_return_visit,
 bounces,
 time_on_site,
 totals.pageviews,
 trafficSource.source,
 trafficSource.medium,
 channelGrouping,
 device.deviceCategory,
 country
);

Let’s evaluate the model.

SELECT
 roc_auc,
 CASE
 WHEN roc_auc > .9 THEN ‘good’
 WHEN roc_auc > .8 THEN ‘fair’
 WHEN roc_auc > .7 THEN ‘not great’
 ELSE ‘poor’ END AS model_quality
FROM
 ML.EVALUATE(MODEL mydataset.classification_model_2, (

WITH all_visitor_stats AS (
SELECT
 fullvisitorid,
 IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
 FROM `data-to-insights.ecommerce.web_analytics`
 GROUP BY fullvisitorid
)

# add in new features
SELECT * EXCEPT(unique_session_id) FROM (

SELECT
 CONCAT(fullvisitorid, CAST(visitId AS STRING)) AS unique_session_id,

# labels
 will_buy_on_return_visit,

MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS latest_ecommerce_progress,

# behavior on the site
 IFNULL(totals.bounces, 0) AS bounces,
 IFNULL(totals.timeOnSite, 0) AS time_on_site,
 totals.pageviews,

# where the visitor came from
 trafficSource.source,
 trafficSource.medium,
 channelGrouping,

# mobile or desktop
 device.deviceCategory,

# geographic
 IFNULL(geoNetwork.country, “”) AS country

FROM `data-to-insights.ecommerce.web_analytics`,
 UNNEST(hits) AS h

JOIN all_visitor_stats USING(fullvisitorid)

WHERE 1=1
 # only predict for new visits
 AND totals.newVisits = 1
 AND date BETWEEN ‘20170501’ AND ‘20170630’ # eval 2 months

GROUP BY
 unique_session_id,
 will_buy_on_return_visit,
 bounces,
 time_on_site,
 totals.pageviews,
 trafficSource.source,
 trafficSource.medium,
 channelGrouping,
 device.deviceCategory,
 country
)
));

The model has improved a lot.

7. Prediction:

Now let’s predict all the new visitors who will become customers.

SELECT

FROM

ml.PREDICT(MODEL `ecommerce.classification_model_2`,

WITH all_visitor_stats AS (

SELECT

fullvisitorid,

IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit

FROM `data-to-insights.ecommerce.web_analytics`

GROUP BY fullvisitorid

SELECT

CONCAT(fullvisitorid, ‘-’,CAST(visitId AS STRING)) AS unique_session_id,

# labels

will_buy_on_return_visit,

MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS latest_ecommerce_progress,

# behavior on the site

IFNULL(totals.bounces, 0) AS bounces,

IFNULL(totals.timeOnSite, 0) AS time_on_site,

totals.pageviews,

# where the visitor came from

trafficSource.source,

trafficSource.medium,

channelGrouping,

# mobile or desktop

device.deviceCategory,

# geographic

IFNULL(geoNetwork.country, “”) AS country

FROM `data-to-insights.ecommerce.web_analytics`,

UNNEST(hits) AS h

JOIN all_visitor_stats USING(fullvisitorid)

WHERE

# only predict for new visits

totals.newVisits = 1

AND date BETWEEN ‘20170701’ AND ‘20170801’ # test 1 month

GROUP BY

unique_session_id,

will_buy_on_return_visit,

bounces,

time_on_site,

totals.pageviews,

trafficSource.source,

trafficSource.medium,

channelGrouping,

device.deviceCategory,

country

ORDER BY

predicted_will_buy_on_return_visit DESC;

Isn’t it easy and fun?

I hope you enjoyed this post. Let me know your thoughts if you have any queries or suggestions, would love to hear more from you. Stay tuned for future posts.

You can follow me for tutorials on AI/machine learning, data analytics, and BI. You can connect with me on LinkedIn.

Big QueryML: Machine Learning for Data Scientists using SQL on Google Cloud Platform Quick Start… was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Big QueryML: Machine Learning for Data Scientists using SQL on Google Cloud Platform Quick Start…

Author(s): Anuragbisht

Cloud Computing

Big QueryML: Machine Learning for Data Scientists using SQL on Google Cloud Platform Quick Start Guide

Towards AI Team

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

The Math Behind Supervised Learning: Making AI Less Mysterious

40 Ways DeepSeek AI Will Upgrade Your Life Instantly

TAI #141: Claude 3.7 Sonnet; Software Dev Focus in Anthropic’s First Thinking Model

Agentic RAG: Mastering Document Retrieval with CrewAI, DeepSeek, and Streamlit

Quantum AI Computing

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Big QueryML: Machine Learning for Data Scientists using SQL on Google Cloud Platform Quick Start…

Author(s): Anuragbisht

Big QueryML: Machine Learning for Data Scientists using SQL on Google Cloud Platform Quick Start Guide

Towards AI Team

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement