Linear Regression with AWS SageMaker

I recently came across one of the new products from AWS — Amazon SageMaker.

Amazon SageMaker is a fully-managed platform that enables developers and data scientists to quickly and easily build, train, and deploy machine learning models at any scale.

Now I’ve used some of the ML models that AWS has provided in the past for linear regression and wasn’t entirely overwhelmed, however SageMaker has a couple of features that look really promising.

Perhaps the best feature of SageMaker is hosted Jupyter Notebooks. I love the integration of markdown and graphs/visuals with my code when exploring models.

I should preface this tutorial with two statements:

  1. I claim no expertise in Python or ML. If you come across an error, misunderstanding or “bad” way o f writing Python please let me know. I write ML algorithms as a hobby, not as a career.
  2. I’m writing this as I’m looking at SageMaker for the first time, so there may be “better” ways of doing things.

Anyway let’s get stuck into a simple example of linear regression with SageMaker!

What We Will Build

When testing pre-built ML algorithms, I like to perform a run to test the convergence/performance of the model by giving it data that I know has a perfectly linear relationship.

In this example I am going to take some AFL (Australian Rules Football) Match Data and try to find the relationship between match statistics and Fantasy Points. Fantasy Points are calculated by a linear combination of a subset of match stats. The stats and linear weightings are public and known, which means we can test how close the model gets to finding them.

The data set I will use for this example can be found here in csv form.

Getting Set Up

Step 1: Set up S3 and IAM

In order to use SageMaker you will need an S3 bucket to store models, data and results in. You should make sure you have a bucket created.

For the purposes of this exercise you could use test.sagemeaker.your.name

You will also need to set up an IAM user role so that SageMaker can access the S3 bucket to read/write data.

Chose any name for this IAM access role, just make sure you give this role programmatic access.

In the permissions section for this user you can navigate to “Attach existing policies directly” and search for AmazonS3FullAccess and AmazonSageMakerFullAccess.

Once you’ve created the IAM you are going to need to copy the User ARN which is available in the user summary for that IAM role. You will need this to set up a Notebook instance in SageMaker.

2. Create a Notebook Instance in SageMaker

Simply hit Create New Instance from the SageMaker dashboard and give your Notebook instance a name. In the IAM role input you will want to select Enter a custom IAM role ARN and paste in the ARN from the role we created earlier. This should be all that is required to start the instance and you can hit Create Notebook Instance.

It takes a few minutes for the instance to provision, but once it is ready you will be able to open up the notebook and see an instance of Jupyter Notebooks open in your browser.

You’ll notice a couple of tabs within your Notebook instance. The SageMaker Examples are a great resource for reading through the implementation of a couple of examples of the SageMaker models.

3. Create a new .ipynb (Notebook)

To get started we are going to create a new Notebook and start writing some code. In the top right corner you will see a button New to create a new notebook. I’m going to use a conda_python3 notebook for this example.

Model Set Up

The first thing we need to do is import the dataset. As we are working with a small dataset here for testing purposes, I uploaded my .csv directly into the Jupyter instance instead of S3. You can do this via the upload button in the main Jupyter dashboard.

Then we can access the csv in our code as follows

If you’ve never worked with a python notebook before, you just need to hit shift+enter to execute the code within the block.

To verify that the csv was read correctly you can execute df.head() to get a list of the top 5 entries in your dataframe.

My csv has a lot of data that we don’t need right now, we should create a dataframe with only the information we care about. Let’s create a new pandas df with only the columns we require for the excercise.

We now have an array of all the relevant player stats for every game of AFL in the 2018 season so far as well as the Fantasy Points that the player scored.

Now AFL fantasy points are calculated by the following formula:

Kick (3), Handball (2), Mark (3), Tackle (4), Goal (6), Behind (1), Hit Out (1), Free Kick For (1), Free Kick Against (-3)

I’ve ordered these in the same order as our array so that we can create a weightings array in this order.

Before we run any ML algorithms we should verify that our data and weighting array are valid. Lets write a simple function to confirm this.

This function will take an array of player stats and a vector of weights and multiply each stat by the relevant weight and sum them together to give us calculated Fantasy Points.

Now we can calculate fantasy points based on the weightings vector we have created and verify that they are indeed the correct weights.

At this stage we see that indeed, the weighting vector we created above is correct and does generate the Fantasy Points we would expect. The next step is to see if the SageMaker Linear Learner can find that weighting vector if it was unknown to us.

Using SageMaker Linear Learner

The first thing we need to do is to prepare the data in a format that SageMaker can use. The Linear Learner requires a numpy array of type float32.

Next we need to import some librarys to communicate with the ML instances

Now that we’ve done some setup and configuration, we can look at running the model.

Now we need to set some model parameters for this model. Specifically we need to tell the linear learner that we have 9 parameters to fit, that we want a regression model, and most importantly we do not want to normalise the data.

Now we are ready to deploy our model to an instance to run the linear learner and get results. To deploy this model we simply run:

This will take a couple of minutes to provision and run and will let you know when it’s done.

Accessing the results

Once the model has been trained, we can send new data to the model and obtain predictions. In this case we are just going to send it the training data back and see how close it got to finding the correct weights.

To obtain predictions for a single data point we would do something like this

Lets just pass all the data and get all the results back

Results

Surprisingly the weightings found by the linear learner were not exact and had some small error. This could be due to the stopping criteria defaults in the model setup.

The results were very close however, and given the ease of setting up the model, and the lack of domain knowledge required to run this simple regression, SageMaker seems be a handy product.

I will play around tweaking parameters and investigating why the predictive accuracy was not 100% (for what is a very simple model) and write a follow up post.

Full-stack developer. In love with Typescript and Serverless

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store