Big QueryML: Machine Learning for Data Scientists using SQL on Google Cloud Platform Quick Start…
Last Updated on December 1, 2020 by Editorial Team
Author(s): Anuragbisht
Cloud Computing
Big QueryML: Machine Learning for Data Scientists using SQL on Google Cloud Platform Quick Start Guide
I hope you are doing well. It’s been a late post as I struggled through my health. I have managed to upgrade my skills by learning new technologies and platforms and would be eager to share my knowledge with you all. Without wasting any more time, let’s learn something new today.
Cloud computing has been around for a while, but the pace of adoption has improved exponentially nowadays thanks to the pandemic and IT companies realizing the benefits and trade-offs. There are 3 major cloud services providers: AWS, Microsoft, and GCP.
Google cloud computing services had been around for a while, but the rate of adoption had been slow due to n number of reasons, say a limited number of services offered. Although Google had pioneered and was the first to innovate something on the cloud, e.g., Big table for big data, Kubernetes for serverless orchestration, and so on, it has always been slow in marketing itself(GCP services).
Now that said, if you are new to cloud computing and confused about which platform to learn, you might not realize the differences, but after using all the cloud services (AWS, GCP, Azure), I have realized the pros and cons of each platform. If you ask me, I would say AWS has an excellent go-to-marketing strategy to cloud services and know-how to be one step ahead while reaching out to customers, and it’s easy to use. Google Cloud Platform is an equally mature ecosystem and does provide services that are quick to deploy and create than other platforms, although the platform will suit the geeky crowd more when compared to AWS, which has better guidelines and documentation resources at each step.
Now someone with a background in data will recognize the importance of data-warehouse for improving business decisions. GCP offers a compelling data-warehouse service, BigQuery. BigQuery allows storage and querying of petabytes of data in seconds for analysis.
As a data scientist, you can now directly create and run your models to answer business questions using SQL on BigQuery. Today’s tutorial is all about that, so let’s get started.
Prerequisites:
You need access to BigQuery services. You can try it for free. Check out this link.
You also need access to the dataset which we will create. Please note that big-query charges for the amount of data processed when you run an SQL query, so be cautious about what you query.
You also need some familiarity with SQL to understand what data is queried.
Steps:
- Open the google platform console and navigate to BigQuery service: Link
2. Add the data-to-insights dataset by clicking this link.
3. Let’s do some data exploration:
Let’s try to find out of the total visitors who visited our website, what % made a purchase?
Paste the following SQL query in query editor:
WITH visitors AS
( SELECT
COUNT(DISTINCT fullVisitorId) AS total_visitors
FROM `data-to-insights.ecommerce.web_analytics`
),
purchasers AS(
SELECT
COUNT(DISTINCT fullVisitorId) AS total_purchasers
FROM `data-to-insights.ecommerce.web_analytics`
WHERE totals.transactions IS NOT NULL
)
SELECT
total_visitors,
total_purchasers,
total_purchasers / total_visitors AS conversion_rate
FROM visitors, purchasers
Let’s try to figure out the top 5 selling products in the market using e-commerce data.
SELECT
p.v2ProductName,
p.v2ProductCategory,
SUM(p.productQuantity) AS units_sold,
ROUND(SUM(p.localProductRevenue/1000000),2) AS revenue
FROM `data-to-insights.ecommerce.web_analytics`,
UNNEST(hits) AS h,
UNNEST(h.product) AS p
GROUP BY 1, 2
ORDER BY revenue DESC
LIMIT 5;
We will not discuss each syntax of SQL, but if you know SQL, then you might find it pretty easy to understand. UNNEST keyword flattens out the array to separate records in the table.
Can we answer how many visitors bought products on the next visits to the website?
WITH all_visitor_stats AS (
SELECT
fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM `data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid
)
SELECT
COUNT(DISTINCT fullvisitorid) AS total_visitors,
will_buy_on_return_visit
FROM all_visitor_stats
GROUP BY will_buy_on_return_visitcontent
Only a small proportion of visitors visit again to buy the products, making sense when we relate to real-world scenarios: you might visit a website to browse or compare products but may not buy them.
4. Let’s make a feature selection to answer a machine learning problem:
Predict whether or not a user is likely to purchase in the future.
Let’s create a baseline model using a few features in big-query. For this domain, expertise is required or prior work with google analytics e-commerce data. Suppose we pick 2 features, for now, bounces and time_on_site, which are correlated to buying decisions. Now you can add more features, but that’s beyond the scope of our fundamental learnings. For our case, we need a logistic regression model.
We will create a dataset of ours where we will save the model.
After we have created the dataset, we will create a logistic regression model.
CREATE OR REPLACE MODEL `mydataset.classification_model`
OPTIONS
(
model_type=’logistic_reg’,
labels = [‘will_buy_on_return_visit’]
)
AS
#standardSQL
SELECT
* EXCEPT(fullVisitorId)
FROM
# features
(SELECT
fullVisitorId,
IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site
FROM
`data-to-insights.ecommerce.web_analytics`
WHERE
totals.newVisits = 1
AND date BETWEEN ‘20160801’ AND ‘20170430’) # train on first 9 months
JOIN
(SELECT
fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM
`data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid)
USING (fullVisitorId);
Once the model is created, you can explore the model.
5. Model Evaluation:
You can evaluate the model using the following query and checking the roc_auc.
SELECT
roc_auc,
CASE
WHEN roc_auc > .9 THEN ‘good’
WHEN roc_auc > .8 THEN ‘fair’
WHEN roc_auc > .7 THEN ‘not great’
ELSE ‘poor’ END AS model_quality
FROM
ML.EVALUATE(MODEL mydataset.classification_model, (
SELECT
* EXCEPT(fullVisitorId)
FROM
# features
(SELECT
fullVisitorId,
IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site
FROM
`data-to-insights.ecommerce.web_analytics`
WHERE
totals.newVisits = 1
AND date BETWEEN ‘20170501’ AND ‘20170630’) # eval on 2 months
JOIN
(SELECT
fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM
`data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid)
USING (fullVisitorId)));
6. Feature Engineering:
Now we can try to improve the model by adding some more features.
CREATE OR REPLACE MODEL `mydataset.classification_model_2`
OPTIONS
(model_type=’logistic_reg’, labels = [‘will_buy_on_return_visit’]) AS
WITH all_visitor_stats AS (
SELECT
fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM `data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid
)
# add in new features
SELECT * EXCEPT(unique_session_id) FROM (
SELECT
CONCAT(fullvisitorid, CAST(visitId AS STRING)) AS unique_session_id,
# labels
will_buy_on_return_visit,
MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS latest_ecommerce_progress,
# behavior on the site
IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site,
totals.pageviews,
# where the visitor came from
trafficSource.source,
trafficSource.medium,
channelGrouping,
# mobile or desktop
device.deviceCategory,
# geographic
IFNULL(geoNetwork.country, “”) AS country
FROM `data-to-insights.ecommerce.web_analytics`,
UNNEST(hits) AS h
JOIN all_visitor_stats USING(fullvisitorid)
WHERE 1=1
# only predict for new visits
AND totals.newVisits = 1
AND date BETWEEN ‘20160801’ AND ‘20170430’ # train 9 months
GROUP BY
unique_session_id,
will_buy_on_return_visit,
bounces,
time_on_site,
totals.pageviews,
trafficSource.source,
trafficSource.medium,
channelGrouping,
device.deviceCategory,
country
);
Let’s evaluate the model.
SELECT
roc_auc,
CASE
WHEN roc_auc > .9 THEN ‘good’
WHEN roc_auc > .8 THEN ‘fair’
WHEN roc_auc > .7 THEN ‘not great’
ELSE ‘poor’ END AS model_quality
FROM
ML.EVALUATE(MODEL mydataset.classification_model_2, (
WITH all_visitor_stats AS (
SELECT
fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM `data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid
)
# add in new features
SELECT * EXCEPT(unique_session_id) FROM (
SELECT
CONCAT(fullvisitorid, CAST(visitId AS STRING)) AS unique_session_id,
# labels
will_buy_on_return_visit,
MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS latest_ecommerce_progress,
# behavior on the site
IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site,
totals.pageviews,
# where the visitor came from
trafficSource.source,
trafficSource.medium,
channelGrouping,
# mobile or desktop
device.deviceCategory,
# geographic
IFNULL(geoNetwork.country, “”) AS country
FROM `data-to-insights.ecommerce.web_analytics`,
UNNEST(hits) AS h
JOIN all_visitor_stats USING(fullvisitorid)
WHERE 1=1
# only predict for new visits
AND totals.newVisits = 1
AND date BETWEEN ‘20170501’ AND ‘20170630’ # eval 2 months
GROUP BY
unique_session_id,
will_buy_on_return_visit,
bounces,
time_on_site,
totals.pageviews,
trafficSource.source,
trafficSource.medium,
channelGrouping,
device.deviceCategory,
country
)
));
The model has improved a lot.
7. Prediction:
Now let’s predict all the new visitors who will become customers.
SELECT
*
FROM
ml.PREDICT(MODEL `ecommerce.classification_model_2`,
(
WITH all_visitor_stats AS (
SELECT
fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM `data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid
)
SELECT
CONCAT(fullvisitorid, ‘-’,CAST(visitId AS STRING)) AS unique_session_id,
# labels
will_buy_on_return_visit,
MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS latest_ecommerce_progress,
# behavior on the site
IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site,
totals.pageviews,
# where the visitor came from
trafficSource.source,
trafficSource.medium,
channelGrouping,
# mobile or desktop
device.deviceCategory,
# geographic
IFNULL(geoNetwork.country, “”) AS country
FROM `data-to-insights.ecommerce.web_analytics`,
UNNEST(hits) AS h
JOIN all_visitor_stats USING(fullvisitorid)
WHERE
# only predict for new visits
totals.newVisits = 1
AND date BETWEEN ‘20170701’ AND ‘20170801’ # test 1 month
GROUP BY
unique_session_id,
will_buy_on_return_visit,
bounces,
time_on_site,
totals.pageviews,
trafficSource.source,
trafficSource.medium,
channelGrouping,
device.deviceCategory,
country
)
)
ORDER BY
predicted_will_buy_on_return_visit DESC;
Isn’t it easy and fun?
I hope you enjoyed this post. Let me know your thoughts if you have any queries or suggestions, would love to hear more from you. Stay tuned for future posts.
You can follow me for tutorials on AI/machine learning, data analytics, and BI. You can connect with me on LinkedIn.
Big QueryML: Machine Learning for Data Scientists using SQL on Google Cloud Platform Quick Start… was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Published via Towards AI