Machine Learning with SQL

Today we’re going to look at how to create a Machine Learning model. Not with Python or Tensorflow or Spark, but we’re going to create a Machine Learning model using SQL.

What is Machine Learning?

Machine Learning (ML), a subset of Artificial Intelligence (AI), is the study of computer algorithms that solve a problem by learning from underlying patterns in data, as opposed to statistical heuristics or rule based programming

To get an answer to a question, a traditional approach has been to define rules (if… then… else…) either based on intuition or based on our inference of the historical data. But the rules can quickly get complicated and it may not be practical to implement all the rules required to get to the answer.

For example, if you’re trying to predict the fare of an Uber cab when going from point A to point B, there might be too many rules for that algorithm.

  • If the pickup location is X…
  • If the drop-off location is X…
  • If the trip start time is X….
  • If it is a weekday, then…. If it is a weekend then….
  • If the distance to be travelled is X…
  • If the time required to cover that distance is X….
  • what about traffic? If the traffic is low, medium, high?…

These rules might get complicated and there might be many more that impact the cab fare.

Machine Learning takes a different approach to provide an intelligent answer to this problem. Using ML, you can create an algorithm that learns from historical data of such trips. The model will be able to develop intelligence about this topic and will be able to predict the fare for a trip in the future. The algorithm will learn all the rules from the data, without being explicitly explaining what those rules are.

Mathematically, a Machine Learning algorithm, produces a mathematical equation between various features of the trip (the inputs) and the fare (the output).

f(distance, traffic, pickup location, drop-off location.... ) = fare

This equation is called the Machine Learning model. The model can then be used to make prediction for a new trip if we know all the inputs parameters (features of the trip).

Let us implement a simple fare prediction model today. Generally data scientists use Python and various ML libraries to implement an ML model, we will use SQL (Structured Query Language).

The Data

We will be using the data from Chicago Taxi Trips. This data is available in BigQuery as a part of the Public Datasets.

BigQuery is a Global Data Warehouse available on the Google Cloud Platform.

We’ll use data from the entire year of 2020 and for the purpose of this demo. We’ll only look at 1 feature, i.e. the distance for a trip to predict the fare for that trip. Also, for simplicity, we are going to trim the edges, we are only going to look at trips that were between $3 to $100 USD in fare and between 2.5 miles to 50 miles in trip distance. This is what data looks like.

  fare between 3 and 100 and
  trip_miles between 2.5 and 50 and
  trip_start_timestamp between "2020-01-01" and "2021-01-01"

There are 1,274,704 rows of data available to learn from.

Training data visualization

Now obviously in the real world, the fare of a taxi trip depends on a lot of other features like the time of the day, the day of the week, the traffic, the pickup and drop-off location and so on. It is clear from the graph above that although the distance appears to be strongly correlated to the fare, there’s quite a bit of noise and that probably corresponds to contribution from other features. But for the purpose of this demo, let’s stick to only one feature “trip_miles”.


BigQuery ML makes it extremely simple to train a Machine Learning model. You just need to add a couple of lines on top of your select statement that is mentioned above.

CREATE OR REPLACE MODEL `bqml.taxi_fare_prediction`
    , input_label_cols=['fare']
    , L1_REG = 1
    ) AS
    fare between 3 and 100 and
    trip_miles between 2.5 and 50 and
    trip_start_timestamp between "2020-01-01" and "2021-01-01"

Here you can see that we did only add 2 lines.

CREATE OR REPLACE MODEL `bqml.taxi_fare_prediction`

The create model (add ‘or replace’ if you’re retraining the model) statement tells BigQuery that you’re triyng to train the model. The model name is provided in this statement.

    , input_label_cols=['fare']
    , L1_REG = 1
    ) AS

BigQuery ML provides a bunch of options. We’ll be using 3 of these options in our code today.

  • model_type: We will be training a linear regression model. This model establishes a linear relationship between the inputs and the outputs.
  • input_label_cols: We have to tell BigQuery ML which column from the dataset is our “label”. This is the column that the model will predict once it has learnt from the data.
  • L1_REG: This parameter provides The amount of L1 regularization applied. This is required so that the model does not overfit to the training data.

When I clicked “Run”, the query took 1 min 23 sec to execute. In other words, it took 1 min 23 sec to train the model.

The loss curve for the model is made available on the BigQuery console.

Training and validation loss curve

BigQuery provides the evaluation metrics for the model.

Model evaluation metrics

You can see that the R squared is 0.9438, which is pretty good. In lay man terms, R squared measures how close does your model predict to the actual values. An R squared of 1 means the model predicts perfectly each time.

Batch Prediction

Now that we have a model trained, let’s run some predictions and see how well it works out.

Let’s predict for all the trips from 2021. We’ll make a prediction for various trip_miles values and compare the prediction to the average fare that the cabbies charged in 2021 for that distance.

  avg(fare) as fare
  fare between 3 and 100 and
  trip_miles between 2.5 and 50 and
  trip_start_timestamp > "2021-01-01"
group by trip_miles

This gives us our test data. Now pass this data to ML.PREDICT function.

  ML.PREDICT(MODEL `bqml.taxi_fare_prediction`,
        avg(fare) as fare
        fare between 3 and 100 and
        trip_miles between 2.5 and 50 and
        trip_start_timestamp > "2021-01-01"
      group by trip_miles
order by trip_miles

You can see that our prediction aligns very well with the test data.

Streaming Prediction

This is great, but what about real-time prediction? Can we create an API that allows us to predict on-demand? This will allow us to integrate this model with an app, similar to how Uber works.

Let’s start with “exporting” the model. Go to the model in the BigQuery console and click on “Export Model”.

BQ will ask you for a GCS location. Just provide a folder path.

Now go to Vertex AI > Models.

On the models page, click to import a model.

Import your model using the GCS path where you had saved the model.

This will import the model into Vertex AI Models tab.

Next, let’s go to the Vertex AI > Endpoints tab and click on “Create Endpoints”

Follow the wizard to create an endpoint. see below.

This should give you an endpoint that can be used for real-time prediction.

Now test this out with Postman

Now that you have a REST API, you can go ahead and integrate this within any app…. 😎


Hope this was useful and helps you get started with Machine Learning with just SQL. You don’t need to know Python, Tensorflow, Spark or anything else; if you know SQL, you can ML!

Not only can you ML, you can do with with using Managed Services on Google Cloud, so now you don’t have to worry about infrastructure management for Machine Learning either.

Let’s create some models!

Rapid Response Virtual Agent for Contact Centers powered by Google Cloud

MediaAgility, in Partnership with Google Cloud is deploying Rapid Response Virtual Agents to help you address the overwhelming demands in your contact centers.

These virtual agents run on Google’s Contact Center AI technology that is powered by Google Dialogflow.

In order to serve our customers during this unprecedented situation that we’re in right now due to the outbreak, we’re offering free deployment of these virtual agents for all of our customers till July 31st, 2020. The virtual agents can be deployed in less than 2 weeks.

They come loaded with a lot of Coronavirus related data from Center of Disease Control and Prevention (CDC) and can answer custom questions from your bank of Frequently Asked Questions (FAQs).

We are deploying these agents for free and Google is not going to charge for the platform till July 31st 2020 either. So this is absolutely at no cost to you and can help you address your customers’ needs.

Please click hear to learn more:

8 secrets to saving your Cloud Costs

Optimizing cloud consumption and reducing the money spent on IT infrastructure is one of the strategic priorities of any CIO

Here are the 8 secrets that you can leverage to reduce your cloud costs.

Secret #1. Custom Machine Types

Instead of provisioning a virtual machine (VM) with one of the many available standard configurations, you can actually create a custom configuration. Instead of overprovisioning, you can carefully assess exactly how much compute power you need for your workload and resize to the desired configuration.

Secret #2. Rightsizing Recommendations

This is related to the first secret. Google Cloud actually monitors your resource consumption and recommends you to rightsize your VMs to save cost.

And when you click on the recommendation, you can resize your VM with the click of a button.

Secret #3. Per Second Billing

This is less of a secret if you are already on the cloud, but if you are running your infrastructure on-premise, this is a huge benefit why you might want to consider moving to the cloud.

Google cloud charges all services on a per-second basis. So if you use a VM for 90 seconds, you pay for 90 seconds.

Imagine adding an additional 64-core server to your existing on-premise datacenter that you need only for a few minutes per day to handle peak load, or to run your batch jobs, etc. With Google Cloud, instead of paying for that infrastructure up-front, you consume only for the time when you actually use the servers. You move from a CapEx model to an OpEx model.

This means that the large upfront cash that you now save can be applied towards innovation and growth.

Secret #4. Preemptible VMs

This is clearly my favorite secret from this list. Google Cloud leases out its excess capacity for much cheaper, up to 80% cheaper!!

There are a few things to note with Preemptible VM

  1. They can last maximum of up to 24 hours
  2. Google can pull the plug on (preempt) these VMs with a 30-second warning

They are easy to set up. Just set the “Preemptibility” to “on” when creating a VM.

Preemptible VMs are best suited for batch jobs or fault-tolerant jobs.

Secret #5. Cloud TPU

Cloud TPU is a custom chip built by Google for running Deep Learning workloads. TPUs can save you both time and cost while training very deep neural networks.

Looking at this experiment by Martin Gorner, it is clear that a TPU can help you not only speed up your model training but cut your model training costs into half.

Secret #6. Sustained Use Discount

This is a discount that Google provides to you by default. The longer you leave a VM running, the bigger the discount is applied. You can’t do anything to accidentally forget to get the discount.

On average, customers save 21% off of list price by doing nothing.

Secret #7. Committed Use Discount

You can commit to a certain amount of compute and memory consumption to get up to 57% discounts in many cases.

You don’t have to commit to any particular machine type, all you need to commit to is the number of Cores and the amount of Memory consumption and then you can consume those resources with any combination of any machine types.

Secret #8. Offline Commit Contracts

If you are a large corporate or an enterprise, you can get into a multi-year commit contract with Google Cloud to get an additional 5% to 15% discount. This discount is applied on top of all other discounts and savings that we’ve discussed here.

The easiest way to leverage this secret and execute a commit-contract is to work with a Google Cloud Premier Partner like MediaAgility.

Connect with me on LinkedIn if you’d like to learn more or would like to schedule an assessment of your environment to understand how much can you save on your cloud costs.

Google Colaboratory vs Google Cloud AI Platform Notebooks

Setting up Jupyter Notebook environments can be hard, but it does not have to. Notebooks can be easy!

Let’s talk about “Notebooks on the Web” or Google Colaboratory, a.k.a. Colab.

Colab resides within your Google Drive. To create a new Google Colab notebook, just go to your Google Drive, click New > Google Colaboratory

This creates a blank Google Colab Notebook, try it out!

Colab notebook provides 2 cores and around 13GB of RAM

You can leverage GPU and TPU for training your ML model, absolutely for FREE!!.

To add GPU or TPU to your runtime, Go to Runtime > Change runtime type. Select a hardware accelerator.

Colab might give you a Tesla K80 or a P100 or a V100 based on availability.

Colab comes pre-loaded with all your favorite Data Science and Data Engineering libraries.

Now let’s compare this to Google Cloud AI Platform Notebooks.

To start with, let’s determine how much it’ll cost us to provision a Jupyter notebook environment on AI Platform notebooks with a similar configuration as that of a Colab. Navigate to Google Cloud Platform > AI Platform > Notebooks > New Instance.

Select n1-highmem-2 machine type, which has a similar config as Colab. i.e. 2 cores and 13GB RAM. Also, select the Tesla K80 GPU, which is the most economical GPU on GCP.

When you select this configuration, on the right hand of the screen you’ll see that Google estimates that this instance will cost you $293.25 per month.

An hourly rate of $0.402 is determined after applying the sustained use discount, which means if the Machine is not used continuously throughout the month, the hourly rate might go up. This clearly shows the value that Colab provides by giving you an environment for free.

Having said above, Google Cloud AI Platform Notebook does have its own advantages. For example, it gives you an environment that is integrated with the rest of the Google Cloud Platform and can be managed by your IT team.

Here is a side by side comparison of the two products:

Google Colaboratory

AI Platform Notebooks
2 cores,
13 GB RAM (double of that in Colab Pro), 1 GPU
Any size you want. Scale on-demand
Resides on Google DrivePart of the Google Cloud Platform. Easy GIT Integration
Maximum runtime of 12 hours. (24 hours for Colab Pro)No limitation on runtime.
Pay-per-use with per-second billing
IT generally has no visibilityPart of GCP, hence IT has full control and can manage notebooks similar to all other resources on the cloud
Easy authentication to Google Services. Pre-built connectors for Google Sheets, Google Cloud Storage, etc.Fully integrated with Google Cloud Platform. Easy access to data on Google Cloud Storage, BigQuery, etc.
Consumer product, no enterprise supportPart of the Google Cloud Platform. Enterprise support to resolve all your issues and queries
FREE!!Gotta pay for this guy
Good for personal research, science projectsGood for building a managed experimentation environment for enterprise ML systems

To conclude, whether you select Colab or AI Platform Notebooks, really depends on your use case. If you want to use a notebook for personal research or some science projects, use Colab. If you are a Data Scientist and want to use notebooks for building enterprise ML systems, then use Google Cloud AI Platform Notebooks.

AI Platform Notebooks on Google Cloud

If you are an AI company, you’re probably using Jupyter Notebooks. These are some of the challenges that we see with Enterprises that are using Jupyter Notebook Environments for their AI experiments:

  • These notebooks are difficult to setup
  • You’ve to manage your virtual environments
  • You are limited by the hardware capacity of your laptop
  • IT has no visibility into your Dev/Experiment environment
  • You have a highly inconsistent environment across your teams

Let me introduce you to Google Cloud’s AI Platform Notebooks, an enterprise notebook service to get your projects up and running in minutes.

Google has made it really easy for you to get started on AI Platform Notebooks. You get the same Jupyterlab interface that you’re used to. Hence no learning curve.

Comes pre-installed with your favorite Data Science and Data Engineering Libraries

You are no longer limited by the hardware capacity of your laptops. You can now use hardware accelerators like GPUs and TPUs right from your notebooks.

These notebooks can now be centrally managed by your IT team because they’re running within your enterprise Cloud account and not on your laptop.

AI Platform Notebooks are fully integrated with other Google Cloud products like BigQuery, Cloud Storage, etc.

Check out the video for a demo on how to get started with AI Platform Notebooks.

Why Google Cloud for Analytics & AI?

People often ask me why Google Cloud specifically for Analytics & AI? By the way of this post, I am trying to answer why I choose Google Cloud to build Analytics & AI systems.

If you look at Analytics or Machine Learning, the core building block for Analytics & AI is Data. You can not build a Machine Learning system without having a ton of data and without having a platform that can process that amount of data.

Google’s mission is “to organize the world’s information and make it universally accessible and useful”. Google is a data company.

Data comes from the consumption of services. There are only 13 services in the world that have at least a Billion Monthly Active Users users.

Two of them are by Microsoft, i.e. Windows and Office; Two by Facebook, i.e. Facebook and Whatsapp; And nine of them by Google.

All of you are probably familiar with each one of them and use them on a daily basis.

Google is a unique company with 9 of these apps with more than Billion Monthly Active Users, many of them with more than two billion users. Just think about how much data these apps must be generating.

All of these services are free for you. This data is critical to monetization of these apps. These apps are highly data-intensive and are fully loaded with Machine Learning.

This means Google needs to build products and infrastructure that can process and analyze that amount of data.

Which means they have the capability, the infrastructure, engineers, and algorithms to be able to run Analytics & Machine Learning at that scale.

Google has built Machine Learning systems for close to 2 decades now. Because of the amount of data that Google generates and processes, it gives me confidence that there is no other company in the world that runs Analytics and Machine Learning at the scale that Google does.



My name is Arpit Agrawal. I am the Director of Analytics & AI at MediaAgility and work closely with Google Cloud to solve some real-world problems for corporate and enterprise businesses. I focus on leveraging Google Cloud to generate insights from data. Insights that help you make intelligent decisions that propel your growth.

Through GCP.LIVE, I hope to bring you my decade worth of experience working on various data analytics & AI projects.

Please feel free to reach out if you find the content useful, or have feedback on anything that I write. Also, do let me know if there are additional topics that you’d like me to cover.

Looking forward to partnering with you in your journey towards building an intelligent enterprise.

Arpit Agrawal