In this tutorial, we'll look at BigQuery ML, see what makes it unique from other machine learning offerings, and look at an example for predicting churn from sample customer data.
Download the Telco Customer Churn dataset here: https://www.kaggle.com/blastchar/telco-customer-churn
The blog post with the idea for the train/test split can be read here: https://towardsdatascience.com/lessons-learned-using-google-cloud-bigquery-ml-dfd4763463c
Hello. Today, I want to talk about something a little bit different. I want to talk about BigQuery ML. BigQuery ML is a way to make and train machine learning models entirely in SQL, and make predictions based on data that you’ve imported into BigQuery. So I want to talk a little bit about what you can do with it, and how it’s different from some of the other machine learning offerings that Google has.
So it’s different first of all, in what you have to do with it compared to, let’s say AutoML, first of all. It’s not a completely automatic offering, so you’re going to have to do your own feature selection. You’re going to have to do your own train test split, for instance. It’s not going to be an entirely end-to-end solution, which can be cool. You know exactly what’s happening, for instance. But the trade-off for that is that it’s definitely a more cost-effective solution than what you would get with AutoML. And it’s certainly possible, depending on the amount of data that you need to process, to keep it well within the free tier.
Let’s take a look. Obviously, when you’re using BigQuery, price is always going to be a consideration. So let’s look real quickly at the pricing model and what’s included in the free tier. So if you’re using what’s called the queries, and what they’re calling analysis, so the first terabyte is going to be free for each month. Which is the same as what you get normally with non-machine learning queries in BigQuery, so the first terabyte is going to be free. And that’s anything other than actually creating the model, so doing any predictions using the model, any inspection or evaluation of the model, which we’ll go through in a little bit, that’s all included as analysis. That’s all included in that terabyte.
And then you get 10 gigabytes of data per month for free creating the model. That’s obviously a lot less than a terabyte, but the model files aren’t necessarily huge, and that’s still pretty generous. You can do an awful lot with that 10 gigabytes per month. So there’s an awful lot that you’re going to be able to do on the free tier, there’s definitely a lot you can do. With that said, why don’t we start with a sample project we can walk through, making some predictions?
So I have here a sample data set that I’ve gotten from Kaggle. If you’re not familiar with Kaggle, Kaggle is a fantastic source of all kinds of data sets. They have contests and that sort of thing, if that’s something that you’re interested in. But they have data sets of all kinds of classification, image data sets, and all kinds of other things that are really great for getting started to work with machine learning. So this one is a sample data set for predicting churn for a telco company. So I downloaded that and I imported it into BigQuery, and we can look at the categories, the features that are in this data set.
So what we’re going to be predicting is churn, which is all the way down here. You can see it’s just a yes/no category. One thing that I actually like about this data set is that it needs to be cleaned up, there’s a lot of things that are less than perfect about this. So obviously, there’s a little bit of inconsistency in the capitalization in the columns, which isn’t a huge deal. You can also see in some of these categories like here, senior citizen, this is a Boolean column. This one is 0 and 1. This one for dependence for instance, is also a Boolean column but they’ve chosen to put yes/no for that one. And some of these it’s a little bit hard to tell initially, so multiple lines looks like it’s going to be a yes/no, but it’s actually categorical data. There’s some other options.
If we look over here, I’ve done a little bit of initial analysis on a breakdown of how many values are going to be available here. So churn, this is what we’re predicting, this is the class breakdown. When you’re starting a machine learning project, you want to be cognizant of what the class breakdown is on the actual category that you’re predicting. Some of your diagnostics are going to split a little bit differently if you’re going to be predicting something that’s a little closer to 50/50. Or if like this, it splits a little bit differently. It’s a little bit more lopsided.
So you can see there’s a lot of categorical data that we’re looking at here. Device protection, multiple lines, does the customer use online backup, online security, paperless billing. The payment method, whether they’ve purchased streaming movies, purchased streaming TV, tech support. These total charges and monthly charges, I think you could actually make a case that total charges and monthly charges while it is a number, it probably could almost be categorical data because there’s only… What’s this, two, four, six, eight, ten values.
In the real world, a telco company would have lots of different pricing options. This particular sample company I guess doesn’t have very many prices or doesn’t have very many products that they offer, so there’s only a small number of tiers. But I think we’ll go ahead and leave this as a numerical value. This is one of the actual true numerical columns, this is just tenure. I’m guessing this is months of tenure. And then the contract type, and then this is another Boolean column of whether they have phone service.
The next thing that we’re going to do is create our train test split. So in other words, we’re going to take the entire telco database, which has about 7,000 records in it, and we’re going to break it into about an 80/20 split. We’re going to take 80% of that data and we’re going to use that to train our model. We’re going to take the remaining 20% and we’re going to use that to test the model, to see how accurate it is on data that it hasn’t seen before.
BigQuery ML doesn’t really, that I’ve seen at least, have a function to do that in ways that other machine learning libraries do. If you’ve done machine learning in Python, like the scikit-learn library for instance has like a train-test split, and functions for it. I have seen, there is actually a really great technique that I’ve seen that I’ll link in the description that I will walk you through right now. What this does is kind of a neat trick. It’s going to use the FARM_FINGERPRINT function, which is basically just like a hash.
So what this is going to do, it’s going to take… Let’s go back to our telco data. So if you see this customer ID column, what this is going to do is it’s going to take that customer ID column, it’s going to run it through FARM_FINGERPRINT. FARM_FINGERPRINT is a hash that that is going to reduce whatever is inside that column down to an integer. If that integer is divisible by five, it’s going to take it and it’s going to place it into our test database. This is going to be, we’re basically taking two queries. We’re going to run it once, where we want all the ones that are divisible by five, and that’s going to go into our test database.
And then we’re going to run it again, where we take all the ones that are not in our test database, and that’s going to wind up being our training database. You can see we’re creating our table, our main database I just called telco. So now I’m making one called telco test, and I’m selecting everything from telco where mod, which is going to be, we’re looking for a remainder. So the absolute value of FARM_FINGERPRINT of customer ID. So this is where the mod five equals zero. So I’m going to run that, and then this is going to make a new database.
So telco had 7,000 and then telco tests had 1423. And then… So just about exactly 20%. So now we want to build a train database that has everything from telco that’s not in telco test. That has 5620. So now the next thing that we want to do is build out our query, which is essentially our feature selection that we’re going to pass to the model. So what I’m going to do, and I have a query that has already been started. I’m going to go ahead and just use, if this was a real machine learning project we would do an actual feature selection exercise. Or we would go ahead and run some functions that would see what features are pulling the most weight or exerting the most effort on the target variable.
For right now, I’m just going to go ahead and use the ones that are pertaining to the service lines for the company, so let’s go ahead and look at this. I’ll show you what I have here. So it uses, BigQuery ML wants you to call the target variable label. If there’s some reason that you can’t use label, like if you have a column in your database that’s named label, when you’re creating the model, you can pass an attribute that says, “This is what I actually need to name my label.” Otherwise, you can just name it label. So for the other columns, I’m just going ahead and passing them using this IFNULL statement. So if it’s empty for some reason, I’m passing any empty string, otherwise I’m just letting the actual value flow through. And then it’s naturally just one-hot encoding these, so it’s a categorical variable.
So it’s creating behind the scenes, if you’re familiar with how one-hot encoding works, it’s creating… Let me go back to the database. So if we look at for instance, let’s look at internet service. So let’s use multiple lines. So multiple lines has three values. So there’s no, yes, or no phone service. So under the hood, BigQuery ML is going to create a three features. So there’s going to be multiple lines no, multiple lines yes, and multiple lines no phone service. So it’s doing that for each of the values in all of these other features, and it’s just going to do that automatically. So at this point, we’re ready to do the actual model creation, so let me go back here. Let’s copy this.
So there are several different model types that we could pass in. The reason I chose logistic regression here is because churn is the type that we’re predicting here. So churn just has two options which is, that is what logistic regression is. If churn was a number, then we would be a linear regression. Initially, I believe when BigQuery ML launched, it was just the two types, logistic or linear regression. There are a whole number of different kinds of models that you can choose from now. There’s TensorFlow I believe is a type, you can do ARIMA for time series forecasting. There’s things that you can do for preferences and all kinds of other different models that you can choose from.
One thing to keep in mind also, is that the model that you create does not have to stay inside of BigQuery. You can export the model to a Google storage bucket, and then you can use it outside. So if you have a TensorFlow project and you create a TensorFlow model, you can use it then in Python or in your own code. So, that can also be a really cost effective way than of using a model that you create in BigQuery in your own project outside. So there’s a lot of applications for the models that you’re creating here. We have our fields created, let’s run this query and get an idea of what this query looks like before we create the model.
And also we want to make sure… I had this on telco, but we want to make sure this is telco train. So this looks good, so let’s go ahead and create the model. We have a really small dataset and there’s only 5,000 records in the training set, so this isn’t going to take more than a few seconds to run. And then the model is going to show up inside the data set here. So now it is the fun time where we’re actually going to make some predictions on our test set and see what we actually get back. So the thing we want to look for, we’re going to switch from our training set to our test set. We’re going to look at the churn feature, and we want to get some actual probabilities to see one, if the customers are going to churn. And we want to get some actual probabilities to see, so not just if they will but how sure BigQuery ML is that they’re going to churn or not.
So a couple of quick changes to our last query, we’re going to switch from ML evaluate to ML predict, and then we’re going to switch from train to test. So let’s run that and drag that up. Here’s what changed. So predicted label, and remember, zero is not churn and then one is churn. This is how we want to read this. So label is what they actually did and then predicted label is what BigQuery ML thought it was going to do. And then predicted, this is the probability. So for instance, this one, it thought it was going to be a zero, it was a zero, and then it was 95% sure that it was going to be a zero. And the same thing for this one. So this one thought it was going be a zero, it was a zero, and it was 81% sure.
One thing that’s interesting, let’s see if we can find the ones where it was incorrect and see what the probabilities are. So if we type in WHERE predicted_label equals label, I wonder if there’s going to be some that are closer to 50/50, or if it was really sure about them and wrong. It looks like actually, it was really sure about them and wrong. Here’s one that was 54/45, 64/35, 61/38, 72/27. Some of these, it was actually really on the fence about.
So that’s an initial look at BigQuery ML, I hope that was helpful. It’s something that we’re going to be definitely talking about in future videos. Thank you for watching, I know this was a little bit of a long video, but I hope this was helpful. If you’re interested, please subscribe for more on BigQuery, Google Analytics, and Google Tag Manager.