Brewing up custom ML models on AWS SageMaker

By Kimberly Cook |Email | Jul 30, 2018 | 7098 Views

I recently fell in love with SageMaker. Simply because it is so convenient! I really love their approach hiding all the infrastructural needs from the customer and letting them focus on the more important ML aspects of their solutions. Few clicks and typing here and there, voil├? , you've got a production ready model ready to take on 1000s (if not millions) of requests a day. If you need a good introduction to SageMaker see the following video by non-other than Amazon!

So what can possibly go wrong?
But trouble can strike you when you're trying to setup and create your own models in your own docker container to perform custom operations! It's not as straightforward and smooth-flowing as building everything with SageMaker from the beginning.

Why would you need to have your custom models?
There can be many reasons why you need your own custom model. You might be:

  • Using some specific python library versions instead of the latest (e.g. TensorFlow)
  • Using libraries unavailable on SageMaker

Before continuing ...
Before going forward, make sure you have the following.

  • Docker installed and running in your OS
  • Basic knowledge of how Docker works

How do we do this?
Now with a good context behind us, let us plough through the details of getting things set up for SageMaker. The tutorial is going to have three different sections.

  • Create a docker image with your code
  • Testing the docker container locally
  • Deploying the image on Amazon ECR (Elastic Container Repository)

Let me make flesh these points out here. First you create a docker image with the libraries and code and other requirements (e.g. access to ports). Then you create a contain from that image and run a container. Then you test the code/models with a small chunk of data in the container. After successfully testing, you upload the docker image to the ECR. Then you can specify this image as the ML model and the use it for training/prediction through Amazon SageMaker.

Also, I'll be using this tutorial/guide as the frame of reference for this blog. It's a really good tutorial. There are few reasons I thought of reinventing that blog post:

  • It's a good tutorial if you're relying only on scikit-learn. I thought of creating a container with XGBoost, so we'll have to do some tinkering around our Docker container.
  • I want Python 3 not Python 2 for obvious reasons.
  • I also feel like some details are missing here and there (especially when it comes to testing locally).

And to demonstrate this process, I'll be training a XGBoost classifier on the iris dataset. You can find the Github repository with all the code here.

Overview of Docker
You know what else is amazing than SageMaker? Docker. Docker is extremely powerful, portable and fast. But this is not the place to discuss why. So let's straight dive into setting things up. When working with Docker you have a clear set of steps you take:

  • Create a folder with code/models and a special file called Dockerfile that has the recipe to create the docker image
  • Create a docker image by running docker build -t <image-tag>
  • Run the image by running docker run <image>
  • Push the docker image to some store that will store the image (e.g. dockerhub or a AWS ECR repository) using docker push <image-tag>

Overview of SageMaker compatible Docker containers
Note that, SageMaker requires the image to have a specific folder structure. The folder structure SageMaker looking for is as follows. Mainly there are two parent folders /opt/program where the code is, and /opt/ml, where the artefacts are. And note that I've blurred out some file that you probably won't need to edit... ever, and are outside the scope of this tutorial.


Let's now discuss what each of these entities in detail. First, opt/ml is where all the artefacts are going to be stored. Let's talk about each of the subdirectories now.

Directory: /opt/ml
input/data is the directory where the data for your model is stored. It can be any data related file (given that your python code can read the data and the container has the required libraries to do so). Here <channel_name> is the name of some consumable input source that will be used by the model.

model is where the model will reside. You can either have the model in the container it self, you can specify a URL (S3 bucket location) where the model artefacts reside as a tar.gz file. For example, if you have the model artefacts in a Amazon S3 bucket, you can point to that S3 bucket during model setup on SageMaker. Then these model artefacts will be copied to the model directory, when your model is up and running.

Finally, output is the director which will store the reasons for failure of a request/task, if it fails.

Directory: /opt/program
Let's now dive into the cream of our model; the algorithm. This should be available in the /opt/program directory of our Docker container. There are three main files that we need to be careful about train , serve and predictor.py .

train holds the logic for training the model and storing the trained model. If the train file runs without failures, it will save a model (i.e. pickle file) to /opt/ml/model directory.

serve essential runs the logic written in predictor.py as a web service using Flask, that will listen to any incoming requests, invoke the model, make the predictions, and return a response with the predictions.

Dockerfile
This is the file that underpins what's going to be available in your Docker container. This means that this file is of uttermost importance. So let's take a peek inside. It's quite straight forward if you're already familiar about how to write a Dockerfile. But let me give you a brief tour anyway.

  • The FROM instruction specifies a base image. So here we are using an already built Ubuntu image as our base image.
  • Next using RUN command, we install several packages (including Python 3.5) using apt-get install
  • Then again using RUN command, we install pip and following that, numpy, scipy, scikit-learn, pandas, flask, etc.
  • Subsequently we set several environment variables within the Docker container using the ENV command. We need to append our /opt/program directory to the path variable so that, when we invoke the container it will know where our algorithm related files are.
  • Last but not least, we COPY the folder containing the algorithm related files to the /opt/program directory and then set that to be the WORKDIR

Creating our own Docker container
First, I'm going to use a modified version (We are going to ask for specific versions of numpy, scikit-learn, pandas, xgboost to make sure they are compatible with each other. The other best thing with specifying the versions of the libraries you want to use is that, you know it won't break just because a new version of some library is not compatible with your code.link here) of the amazing package provided at awslabs Github repository. This original repository has all the files we need to run our SageMaker model, so it's a matter of editing the files to get it to fit our requirements. Download the content found in the original link to a folder called xgboost-aws-container if you want to start from scratch, othewise, you can fiddle around with my version of the repository.

Note: If you're a Windows user, and you're one of those unfortunates to run the outdated Docker toolbox, make sure you use some directory in the C:Users directory as your project home folder. Otherwise, you'll run into a very ugly experience of mounting the folder to the container.
Changes to the existing files
  1. Rename the decision-trees folder to xgboost
  2. Edit the train file as provided in the repository. What I've essentially done is, I've imported xgboost and replaced the decision tree model to a XGBClassifier model. Note that, when ever there is an exception, that will be written to the failure file in the /opt/ml/output folder. So you are free to include as many descriptive exceptions as you want, to make sure you know what went wrong if the program fails.
  3. Edit the predictor.py file as provided in the repository. Essentially, what I've done is similar to the changes did on train. I imported xgboost and changed the classifier to a XGBClassifier.
  4. Open up your Dockerfile do the following edits.
Instead of python we use python3.5 and also add libgcc-5-dev as it is required by xgboost.

RUN apt-get -y update && apt-get install -y â?? no-install-recommends 
wget
python3.5
nginx
ca-certificates
libgcc-5-dev
&& rm -rf /var/lib/apt/lists/*

We are going to ask for specific versions of numpy, scikit-learn, pandas, xgboost to make sure they are compatible with each other. The other best thing with specifying the versions of the libraries you want to use is that, you know it won't break just because a new version of some library is not compatible with your code.

RUN wget https://bootstrap.pypa.io/3.3/get-pip.py && python3.5 get-pip.py &&
 pip3 install numpy==1.14.3 scipy scikit-learn==0.19.1 xgboost==0.72.1 pandas==0.22.0 flask gevent gunicorn &&
 (cd /usr/local/lib/python3.5/dist-packages/scipy/.libs; rm *; ln ../../numpy/.libs/* .) &&
 rm -rf /root/.cache

Then we're going to change the COPY command to the following

COPY xgboost /opt/program

Building the Docker image
Now open your Docker terminal (if on Windows, otherwise, the OS terminal) and head to the parent directory of the package. Then run the following command.
docker build -t xgboost-tut .

This should build the image with everything we need. Make sure the image is built by running,

docker images

You should see something like the following.


Running the Docker container to train the model
Now it's time to run the container, and fire away the following command.

docker run --rm -v $(pwd)/local_test/test_dir:/opt/ml xgboost-tut train

Let's break this command down.

--rm : Means the container will be destroyed when you exit it

-v <host location>:<container location>: Mounts a volume to a desired location in the container. Warning: Windows users, you'll run into trouble if you choose anything other than C:Users.

xgboost-tut: Name of the image

train: With the start of the container, it will automatically start running the train file from the /opt/program directory. This is why specifying /opt/program as a part of the PATH variable is important.

Things should run fine and you should see an output similar to the following.

Starting the training.
Training complete.

You should also see the xgboost-model.pkl file in your <project_home>/local_test/test_dir/model directory. This is because we mounted the local_test/test_dir directory to the container's /opt/ml, so whatever happens to /opt/ml will be reflected in test_dir.

Testing the container locally for serving
Next, we're going to see if the serving (inference) logic is functioning properly. Now let me warn here again, if you missed it above! If you're a Windows user, be careful about mounting the volume correctly. To avoid any unnecessary issues, make sure you choose a folder within the C:Users folder, as your project home directory.

docker run --rm --network=host -v $(pwd)/local_test/test_dir:/opt/ml xgboost-tut serve

Let me point to a special option that we specify in the Docker run command.

--network=host : Means the network stack of the host will be copied to the container. So it will be like running something on the local machine. This is needed to check whether the API calls are working fine.

Note: I'm using--network=host, because -p <host_ip>:<host_port>:<container_port> did not work (at least on Windows). I recommend using -p option (if it works), as shown below. Warning: Use only one of these commands, not both. But I'm going to assume the --network=host option to continue forward.
docker run --rm -p 127.0.0.1:8080:8080 -v $(pwd)/local_test/test_dir:/opt/ml xgboost-tut serve
serve : This is the file which calls the inference logic

This should show you an output similar to below.

Now to test if we can successfully ping to the service run the following command (in a separate terminal window).

curl http://<docker_ip>:8080/ping

You can find out the Docker machine's IP by

docker-machine ip default

This ping command should spawn two messages, on both host side and the server side. Something like below.

If all of this went smoothly (I dearly hope so) until this point. Congratulations! you've almost set-up a SageMaker compatible Docker image. Just one more thing we need to do before taking it live.

Now let's try something more exciting. Let's try to make a prediction through our web service. For this we're going to use the predict.sh file located in the local_test folder. Note that I've adapted it to suit my requirements, meaning that it's different from the one provided in the original awslabs repository. Precisely, I introduced a new user-prompt argument that takes in the IP address and the port in addition to the ones taken in the original file. We make a call to that modified predict.sh file using the following command.

./predict.sh <container_ip>:<port> payload.csv text/csv

Here we are making a call to the inference web service using the data in payload.csv and saying it's a csv file. It should return you the following. Which says it identified the data point in it as belonging to the class setosa.

* timeout on name lookup is not supported
* Trying <container_ip>...
* TCP_NODELAY set
* Connected to <container_ip> (<container_ip>) port <port> (#0)
> POST /invocations HTTP/1.1
> Host: <container_ip>:<port>
> User-Agent: curl/7.55.0
> Accept: */*
> Content-Type: text/csv
> Content-Length: 23
>
* upload completely sent off: 23 out of 23 bytes
< HTTP/1.1 200 OK
< Server: nginx/1.10.3 (Ubuntu)
< Date: <date and time> GMT
< Content-Type: text/csv; charset=utf-8
< Content-Length: 7
< Connection: keep-alive
<
setosa
* Connection #0 to host <container_ip> left intact

Pushing it up to the ECR
Okey! so the hard work has finally paid off. It's time to push our image to the Amazon Elastic Container Repository (ECR). Before that make sure you have a repository created in the ECR to push the images to. It's quite straight forward if you have a AWS account.

Go to the ECR service from the AWS dashboard and click â??Create repositoryâ??

Once you create the repository, within the repository, you should be able to see the instruction to complete the push to ECR.

Note: You can also use the build_and_push.sh provided in the repository. But I personally feel more comfortable doing things myself. And it's not really that many steps to push the repository.

First you need to get the credentials to login to the ECR

aws ecr get-login - no-include-email - region <region>

which should return an output like,
docker login ...

copy paste that command and now you should be logged into the ECR. Next you need to re-tag your image to be able to correctly push to the ECR.

docker tag xgboost-tut:latest <account>.dkr.ecr.<region>.amazonaws.com/xgboost-tut:latest

Now it's time to push the image to your repository.

docker push <account>.dkr.ecr.<region>.amazonaws.com/xgboost-tut:latest
Now the image should appear in your ECR repository with the tag latest. So the hard part is done, next you need to create a SageMaker model and point to the image, which is straightforward as creating a model with SageMaker itself. So I won't stretch the blog post with those details.

You can find the Github repository with all the code here.

Conclusion
It was a long journey, but a fruitful one (in my opinion). So we did the following in this tutorial.

  • First we understood why we might need to make our own custom models
  • Then we examined the structure of the Docker container required by SageMaker for it to be able to run the container.
  • We then discussed how to create a Docker image of the container
  • This was followed by how to build the image and run the container
  • Next we discussed how to test the container on the local computer, before pushing out
  • Finally we discussed how to push the image to ECR to be available for consumption through SageMaker.


The article was originally published here

Source: HOB