Archives

Docker

Python: API’s are friends not food – Confluence

Creating documentation from our code by our code

I hate having to do documentation as much as the next developer, especially when it’s for something as simple as some codes or underlying dependencies you use in your pipeline! Just imagine it, you’re working in your Agile ways, you get given a task to incorporate a new feature which uses some of the businesses codes like ‘WEDPA’, ‘UUQL’ or some other strange hieroglyphic.

You get cracking with your development and your code speaks for itself, a work of art if you don’t say so yourself! You close the ticket and document up how to use the feature but realise you’ve not documented any of its dependencies and what they mean! Whats more the business might change those codes and you would have to update the documentation every time you added new codes or updated them…

The Solution

  1. Create a table to hold all of your dependancies, describing in detail all that they do (even better would be to grab that data from a static data table where whoever entered the data would populate that for you).
  2. Load this table into memory if you havent created it yourself in the pipeline (eg from database -> pandas)
  3. Push this table into a Confluence page where you store all of this information so that its easily readable and visible to the Business, not just a csv or left somewhere as a comment in the code

Demo Confluence environment

To get us up and running we can spin up two docker containers, one running confluence and the other running a jupyter notebook. Be sure to follow through the step by step instructions on getting your confluence server up and running. It’s blissfully easy!

Creating and using confluence API wrapper

I have created a quick and dirty confluence wrapper and it is freely available on GitHub if you have any issues please raise them and pull requests are most welcome.

After we have installed this by running pip install git+https://github.com/ghandic/confluenceapi.git we should be good to go with the jupyter notebooks.

First of all we will make our pages in confluence leaving some pages empty for the code to fill in and update on every production pipeline run.

Now we can add html content to that page by following the example notebook provided:

Updating pages in confluence

Neccessary imports

In [ ]:
import os
from confluenceapi import Confluence

Setting up our credentials

In [ ]:
conf_server = os.environ['CONFLUENCE_IP'] + ':8090'
credentials = ('admin', 'Password123')

Create a confluence object ready to submit requests

In [ ]:
lc = Confluence(conf_server, credentials)

Add a page

In [ ]:
lc.add_page('Page about DS', 'Data Science')

Update a page with raw HTML

In [ ]:
lc.update_page('Page about DS', 'Data Science', '<h1 style="color:red;">This is a new title</h1>')

Delete a page

In [ ]:
lc.delete_page('Page about DS', 'Data Science')

Another method we may want to document is by uploading files, maybe its a picture (.png), log file (.txt), etc we can do this by using the following methods:

Attachments pages in confluence

Neccessary imports

In [ ]:
import os
from confluenceapi import Confluence

Setting up our credentials

In [ ]:
conf_server = os.environ['CONFLUENCE_IP'] + ':8090'
credentials = ('admin', 'Password123')

Create a confluence object ready to submit requests

In [ ]:
lc = Confluence(conf_server, credentials)

Add an attachment to our page

In [ ]:
lc.upload_attachment('demo.txt', 'Page about DS', 'Data Science', 'First upload!')

Update our attachments on our page

In [ ]:
lc.update_attachment('demo.txt', 'Page about DS', 'Data Science', 'Second upload!')

Delete an attachment on our page

In [ ]:
lc.delete_attachment('demo.txt', 'Page about DS', 'Data Science')

To see more examples check out the full GitHub repo

Docker: Creating a portable image recognition app with TensorFlow and Shiny

Disclaimer

I know I said I would be showing you how to retrain two neural nets to detect cats. However, I accidentally left a test running for a little longer than expected on Google Cloud ML and I ran out of free credits! If any of you fine readers would like to send me some credits or hook me up with a GPU I will get round to doing the next part of this blog series very swiftly! However, until then you can have a sneaky peak into the third part of the blog series!

Docker

What the heck is Docker?

Docker is incredible, the more I use it the more I love it. The reason being it allows you to create an environment that does not depend on your computer’s operating system. This means if you create a docker container with something like python in you will be able to run that same container on a Mac, Window or a Linux PC. This has many advantages, a particular one being that anything you do inside the docker container will be reproducible. Since then one does not have to worry about whether someone has to install the Windows version or the Mac version etc. Docker containers also facilitate the rapid development work for Data Scientists, Web Developers, Software Engineers,… you name it! Here’s a list of some reputable companies currently using Docker: BBC News, PayPal, Expedia, General Electric, Google, Amazon. There are also countless other things you can do with Docker, such as:

    • – Setting up a Docker swarm (or using Kubernetes) to orchestrate your containers.

 

    • – Using Docker containers to run continuous integration (using tools such as Jenkins, Travis, etc).

 

    – Multiple users can use the same Docker container at once! This makes running browser based like IDE’s Rstudio Server or JupyterHub very easy to setup and deploy.

How I have used Docker

First off I’ll answer a slightly different question, ‘Why I have used Docker in this blog’. The reason is, that I wanted the content of my blog to be fully reproducible for anyone that reads it and easily deployable by any keen readers too. The best way to do this in my experience was to ‘Dockerize’ anything that I was going to publish. So in the process of creating this blog I only used docker containers. I actually used 3 different containers for this development, one container running Jupyter notebooks using jupyter/tensorflow-notebook (with a few pip installs here and there). Another based off of Rstudio’s rocker/geospatial which I used to develop the App and the final container which runs the finished Shiny App is based off of rocker/shiny. See below for the Dockerfiles for each of these builds:

RStudio Dockerfile

Shiny Dockerfile

How to get in on the Docker action

If I have persuaded you to try out docker then follow the install instructions here, or click the links below for your operating system:

Windows Apple Ubuntu

Python – TensorFlow

To use the RCNN model simpy change the  MODEL_NAME parameter to ‘faster_rcnn_resnet101_coco_11_06_2017’

To see the Python code that I adapted from Google’s TensorFlow GitHub repo simply toggle the buttons above.

The two models that I decided to implement were a Faster-RCNN-ResNet and an SSD-MobileNet. These sound scary but if we break down what they actually mean it might make a little more sense as to what they actually are.

Faster-RCNN-ResNet

This model can be broken down into three parts as you can probably tell by the name! An RCNN is a Region-based Convolution Neural Network and it aims to propose regions of an image to be passed through a Convolutional Neural Network to compute features, these features are then passed through an SVM (Support Vector Machine) to be classified with an associated probability. This is easier to understand in a diagram:

Now these models can take weeks to train and to have a powerful model you would normally try to have a deep network. But the deeper the network the more expensive it is to train. Luckily some boffins at Microsoft came up with the idea of using a Residual Network or ResNet for short. These ResNets require less parameters (weights) than their regular counterparts and can therefore be used to create deeper models with a reduced expense. The residual block in the network essentially provides a kind of shortcut if the input of the next block and output of the previous block are of the dimension.

Finally the Faster RCNN part, this should be obvious! It’s a faster version, it does this by sharing convolutions across region proposals. The region proposals are usually done using algorithms like Selective Search or EdgeBoxes.

SSD-MobileNet

Let’s break this down again, an SSD is a Single Shot MultiBox Detector, it aims to be just as accurate as the Faster-RCNN-ResNet is but much faster! In fact it clocks in at around 59 frames per second in comparison to 7 FPS for the bulky Faster-RCNN-ResNet. It does this by ‘eliminating bounding box proposals and the subsequent pixel or feature resampling stage’. The model does this by a few improvements such as ‘using a small convolutional filter to predict object categories and offsets in bounding box locations, using separate predictors (filters) for different aspect ratio detections, and applying these filters to multiple feature maps from the later stages of a network in order to perform detection at multiple scales’

MobileNets are based on a streamlined architecture that uses depth-wise separable convolutions to build light weight deep neural networks. The name pointing out an obvious use case, for the deployment of models on devices with less resources such as mobiles. This means it is perfect to deploy into a Shiny App! For a comparison in model size if you download the Shiny App I created the SSD-MobileNet takes up 29.2MB vs the Faster-RCNN-ResNet coming in at a whopping 196.9MB. I’d urge you to give the Shiny App a go so you can see how much faster the SSD-MobileNet is than the Faster-RCNN-ResNet model, it’s very impressive!

R – Shiny

Download Full Shiny App

 

To see the code that is used to create the Shiny App (minus the Python code) simply toggle the buttons above.

How deploy this Shiny App on your machine

Note: To be able to run the Faster-RCNN-ResNet model from the Docker container you may have to increase your allocated memory for the Docker virtual machine to around 5GB.

Final outcome