Bringing a Python Django app to Cloud Foundry in 2017

In this post I want to answer the question:
What do you have to do to run a Django web app on Cloud Foundry in 2017?

In the past, a few other people have described their approaches, but given that Cloud Foundry is continuously changing and improving, I thought it would be good to revisit the topic and learn about Python & Django support in 2017.

Python on Cloud Foundry

Cloud Foundry is a polyglot application deployment system. At Pivotal [disclosure: where I work, but not on the Cloud Foundry team], we put a lot of emphasis on how great a home Cloud Foundry is for Java Spring applications, and we’ve always been fond of Ruby on Rails.

That doesn’t mean other languages are hard to run on CF though. Following the example of Heroku, CF uses ‘buildpacks’ to provide official support for many languages, and community support for many more.

Python is an officially supported language for CF, and the official buildpack is maintained and updated by the buildpacks team. This gives me confidence that I can rely on the Python buildpack to have up-to-date interpreters and saves me the hassle of finding or creating a custom buildpack.

Pre-requisites

I’ve been going through the updated 2nd edition of the ‘Obey the Testing Goat book’ otherwise known as Test Driven Development With Python by Harry J.W. Percival.
Test Driven Development with Python

In the book you build up a Django application from scratch using a TDD approach. I’m going to deploy this ‘Superlists’ to-do list application on to Cloud Foundry.

If you want to follow along you should have completed all the exercises up to and including Chapter 10, which includes adding gunicorn to requirements.txt. You can take a look at my version of the app at this point.

If you want to skip ahead and see all the changes we’ll make to the app, have a look at this commit.

First we’ll start as always by checking that our functional tests run successfully on our local machine.

$ python manage.py test functional_tests

All green, so we’re good to go!

Getting ready for Cloud Foundry

We are going to push our application to Cloud Foundry which will create a domain name for us. Let’s use the STAGING_SERVER variable to test this. I am aiming for the domain ih-superlists.cfapps.io, yours will vary based on your Cloud Foundry provider.

$ STAGING_SERVER=ih-superlists.cfapps.io python manage.py test functional_tests

As expected the tests fail completely.

Let’s get started on deploying to Cloud Foundry. We need to provide a ‘manifest’ file which tells Cloud Foundry how to deploy our application.

manifest.yml

---
applications:
- name: ih-superlists
  memory: 512M
  instances: 1
  buildpack: python_buildpack
  command: gunicorn superlists.wsgi:application

Then you can try to deploy using $ cf push and look at the logs with $ cf logs ih-superlists.

If your CF setup is like mine you’ll see

... [APP/PROC/WEB/0] ERR   File "/home/vcap/app/lists/views.py", line 16
    [APP/PROC/WEB/0] ERR     return redirect(f'/lists/{list_.id}/')
    [APP/PROC/WEB/0] ERR                                         ^
    [APP/PROC/WEB/0] ERR SyntaxError: invalid syntax

Oops! We forgot that CF expects to run Python 2 applications by default (boo!). Let’s tell it our application doesn’t use legacy Python.

runtime.txt

python-3.6.2

And then $ cf push again.

We also need to add our domain to ALLOWED_HOSTS in our settings file.

superlists/settings.py

ALLOWED_HOSTS = ['ih-superlists.cfapps.io']

Now we can see our (non-CSS’d) site running at ih-superlists.cfapps.io! Let’s run our functional tests

$ STAGING_SERVER=ih-superlists.cfapps.io python manage.py test functional_tests

All three tests still fail!

Serving static files

One of the problems is that our static files are not being served properly. In our logs we can see the requests for our static files:

... [APP/PROC/WEB/0] ERR Not Found: /static/base.css
    [APP/PROC/WEB/0] ERR Not Found: /favicon.ico
    [APP/PROC/WEB/0] ERR Not Found: /static/bootstrap/css/bootstrap.min.css
    [APP/PROC/WEB/0] ERR Not Found: /static/base.css

The CF Python buildpack actually runs collectstatic as part of its process. Where are these files going? We can look inside the container by connecting with $ cf ssh ih-superlists.

The files are being collected during the staging process into /tmp/app/static, but this directory is not available in the eventual container that runs the application. Hence the lack of static files for our app!

Let’s collect our static files just before we start the gunicorn server instead.

manifest.yml

  command: python manage.py collectstatic --noinput && gunicorn superlists.wsgi:application

From the logs we can see that the static files are now in `/home/vcap/static’.

Side note: The VCAP acronym stands for VMware Cloud Application Platform, which was the original name of Cloud Foundry when it started at VMware.

We can run our functional tests again, or look at the live site and see that this hasn’t fixed our static files problem. We now have the static files, but they are not being served by gunicorn.

One way to fix this is to gather these files and serve them with another Cloud Foundry app which uses the static buildpack. We only expect a small amount of traffic for our application so in this case we can try to serve these files from the same server, using the Whitenoise Python library.

Add Whitenoise to your requirements.txt and then update the settings to include it in the Django middleware that is used.

$ pip install whitenoise
$ pip freeze | grep whitenoise >> requirements.txt

superlists/settings.py

MIDDLEWARE_CLASSES = [
  'django.middleware.security.SecurityMiddleware',
  'whitenoise.middleware.WhiteNoiseMiddleware',
  # ...
]

We can now see our site is served with CSS, but the functional tests still fail.

Adding a managed database

We can also see the problem in the logs.

... [APP/PROC/WEB/0] ERR django.db.utils.OperationalError: unable to open database file

Uh oh, we didn’t initialise our database. At this point we need to change from using the file based SQLite database which will be purged (along with all other files) each time we push the application. Let’s fix this by using data services provided with CF.

First let’s create a PostgreSQL database. Here I’m using the free tier provided by ElephantSQL on Pivotal Web Services.

$ cf marketplace
Getting services from marketplace in org ianhuston / space testing as XXX...
OK

service                       plans                                                                                description
...
elephantsql                   turtle, panda*, hippo*, elephant*                                                    PostgreSQL as a Service
...
* These service plans have an associated cost. Creating a service instance will incur this cost.

TIP:  Use 'cf marketplace -s SERVICE' to view descriptions of individual plans of a given service.

Let’s look at the ElephantSQL plans in depth:

$ cf marketplace -s elephantsql
Getting service plan information for service elephantsql as XXX...
OK

service plan   description                                            free or paid
turtle         4 concurrent connections, 20MB Storage                 free
panda          20 concurrent connections, 2GB Storage                 paid
hippo          300 concurrent connections, 100 GB Storage             paid
elephant       300 concurrent connections, 1000 GB Storage, 500Mbps   paid

Looks like the turtle plan will suit us. Let’s create a service on that plan.

$ cf create-service elephantsql turtle mydb

Next we attach this service to our app and restage as it suggests.

$ cf bind-service ih-superlists mydb
$ cf restage ih-superlists

We can now see our database connection variable in the environment of our app.

$ cf env ih-superlists
...
System-Provided:
{
 "VCAP_SERVICES": {
  "elephantsql": [
   {
    "credentials": {
     "max_conns": "5",
     "uri": SUPER_SECRET_URI
    },
    "label": "elephantsql",
    "name": "mydb",
    "plan": "turtle",
...

But how will our Django app know to use this database? We need to give these credentials to the application. One important thing to know is that the URI from the VCAP_SERVICES environmental variable will also be provided to our application in the DATABASE_URL variable. This is the same way Heroku apps receive database credentials and gives us the opportunity to use the small dj_database_url library from Kenneth Reitz.

Install the library using pip locally, add it to your requirements.txt and then let’s change our settings.

superlists/settings.py

import dj_database_url
...
#DATABASES = {
#    'default': {
#        'ENGINE': 'django.db.backends.sqlite3',
#        'NAME': os.path.join(BASE_DIR, '../database/db.sqlite3'),
#    }
#}

LOCAL_SQLITE='sqlite:///' + os.path.abspath(os.path.join(BASE_DIR, '../database/db.sqlite3'))
DATABASES = {}
DATABASES['default'] = dj_database_url.config(default=LOCAL_SQLITE)

The dj_database_url.config function automatically looks for the DATABASE environmental variable, and here we also give it a default to use when running locally. We should run our local tests again to check this still works.

Now we need to initialise our PostgreSQL database. We can do this using a once-off task with the relatively new cf task command. First push the application.

$ cf push ih-superlists

Then run the database initialisation as a task.

$ cf run-task ih-superlists "python manage.py migrate" --name migrate

You can check the status of a task by looking at $ cf tasks ih-superlists.

Once the migration task is finished, we can run our functional tests again.

$ STAGING_SERVER=ih-superlists.cfapps.io python manage.py test functional_tests

Success!

Let’s make one final change to turn off debug mode.

superlists/settings.py

DEBUG = False

Summary

Python & CF
We walked through a few steps there to get our Django app up and running on Cloud Foundry. Some of these are CF specific, and some are more about making our Django app more ‘cloud native’ in the spirit of the 12 factors. All the changes we made can be seen in this commit. You can also see all the code for the CF-enabled version of the Superlists app so far.

Let’s recap:

  1. Create a manifest.yml file with CF specific information.
  2. Create a runtime.txt file to specify Python version.
  3. Add your expected URL to ALLOWED_HOSTS
  4. Use Whitenoise to serve static files.
  5. Use a data service to create a database and connect it to Django.
  6. Initialise the database and run all migrations.
  7. Turn off debug mode.
  8. cf push your way to Django on CF!

Hopefully this is useful for you to get your Django app running on Cloud Foundry. Let me know in the comments if you have any other tips!

 

Mapping Dublin parish boundaries

TLDR: Go straight to the Dublin Parish Boundaries map.

In Ireland, most primary schools are run by the Catholic Church and the rules for enrolling often include complex lists of rules with those in the local parish often being preferred. This means when you are looking for accommodation to rent or buy it can be very important to know in advance which parish the property is located in.

The Dublin Archdiocese has a map of all the churches in Dublin but unfortunately it doesn’t seem to be working at the moment. Individual parishes sometimes have maps although these are often either static scanned documents or sometimes even hand-drawn sketches.

So how can we make these parish boundaries available on a modern map interface?

Fortunately for our purposes, the Catholic parishes are such an integral part of Irish society that the national Central Statistics Office reports their boundaries as part of its census data. This data is available under a custom non-commercial license from Ordnance Survey Ireland.

The data the CSO provide is in the form of Shapefiles but we can convert them to the more palatable GeoJSON format using the ogr2ogr utility from GDAL:

ogr2ogr -f GeoJSON -t_srs crs:84 new_file.geojson original_file.shp

Github provides a really useful GeoJSON renderer on their site but also for embedded maps. This means we don’t have to worry about creating a map and adding the parish boundaries as a layer.

The final piece of the puzzle is how to make the map available on the web. For this I used Cloud Foundry and in particular Pivotal’s hosted Cloud Foundry instance called Pivotal Web Services. [Disclaimer: I work for Pivotal but not on Cloud Foundry.]

I made a simple HTML page and using the Staticfile buildpack I was able to just do cf push to get the Dublin parish boundaries map up and running.

The final GeoJSON map as rendered by Github

One note of caution: the parish boundaries in the Census data may not correspond to those used by parishes or schools so please double check carefully before making any life-changing decisions!

 

Data Science In the Balanced Team

ian-huston-data-science-in-the-balanced-team-0This year I was fortunate enough to speak at PyCon Ireland 2016 in Dublin.  This was a great event with lots of interesting Python based talks and a full PyData track over the two days. The topic of my talk was something I’ve been thinking about a lot over the last few years: how data scientists can work with other disciplines.

Recently designers and product managers have begun working more closely with development teams, and in my opinion there are many lessons that data scientists can learn from this experience. In particular the concept of a “Balanced Team” appeals to me as a template for data scientists.

The slides for this talk are on SpeakerDeck, and the video is also available. In this post I want to recap my argument from the talk with some extended notes.

From Imposter Syndrome to Team Player

ian-huston-data-science-in-the-balanced-team-1I work as a data scientist at Pivotal Labs helping clients, often large enterprises, to bring data science into their business. However, I really started working with data when I was an academic, handling results from numerical simulations of the early universe.

David Whittaker's Imposter Syndrome post

David Whittaker’s Imposter Syndrome post

For a lot of people in academia, the concept of imposter syndrome is very familiar, and for me academia was a long process of dealing with imposter syndrome. This is the idea that you aren’t really good enough to be here and someday someone is going to figure it out. This post by David Whittaker captures what is really happening. Though you may think everyone knows more than you, really you are just observing the combined knowledge of a lot of different people.

As an academic it was easy to think that others in my field or outside in industry must be handling these data problems in a better way. I had done some formal computer science training in my undergrad degree, but I’d taught myself how to use scientific Python tools and software carpentry practices.

When I left academia to work as a data scientist my first steps were working on solo projects where I was often expected to be a data science unicornThese type of projects involve a lot of pressure and the full weight of stakeholder expectations rests solely on you. It’s not a comfortable position to be in. Due to the hype around data science there was very little understanding by business stakeholders of the exploratory nature of much data science work, where positive results are not a guaranteed. 

[By the way, the Data Science Unicorn is a real account, with a collection of data science learning material gathered by Jason Byrne.]

ian-huston-data-science-in-the-balanced-team-4Working solo on projects is very draining, so more recently I’ve been fortunate to find myself working at Pivotal in teams including developers, product designers and product managers. We opened our Dublin office last year on Back To The Future Day, hence the branded De Lorean, and we are always looking for people with empathy to join us.

Working as part of a team has been great, and I’ve been able to learn a lot about how modern software is built. In particular I’ve been interested in how disciplines like design and product management have been integrated with more traditional development including a concept called the “Balanced Team”.

Balanced Team

The idea for Balanced Team came from conversations between developers and product teams who had been working in an agile methodology but were seeing problems integrating design and product management. As I understand it, the main idea behind Balanced Team is to share responsibility between the team and make sure everyone is acting in service to the team, not just their own self interest. Janice Fraser played a central role in formulating these ideas and explains them in more detail in this talk.

This image from her slides shows that the main roles represented in a balanced team are development, design and product management. Each role has obligations and an authority which they bring to any interactions. Fraser describes Balanced Team as more of a work environment than a methodology, essentially a frame of mind about how the team should interact.

In the past product designers have been kept entirely separate from development teams, often in specialised design agencies. They were frequently required to act as “hero designers”, unable to admit any faults and working hard in crunch mode to meet deadlines. It was striking for me to see the similarities with the expectations on data science unicorns. Some of the goals of Balanced Team are to get away from this notion of hero designers, to reduce power struggles and allow more space for people to speak freely and discard failing solutions.

In her talk Fraser describes the obligations and authority for each of the roles. For example the designer in the team needs to be the “empathizer-in-chief”, who understands the customer at an expert level, and can translate their high-value needs into product decisions. Their obligations to the team include honing their craft (as a service to the team) and facilitating balance between other parties within the team. Their main authority is the prioritisation of customer problems in every product conversation.

ian-huston-data-science-in-the-balanced-team-7

Monica Rogati’s ‘data thinking’ post

It’s worth noting that as it is currently formulated, the Balanced Team concept does not include any data oriented role. Monica Rogati described what happens in this situation in her recent post on “data thinking”. Rogati talks about how Apple’s Photos product can identify faces in your photo collection and highlights a list of 5 of these people in alphabetical order. Depending on their name, this means your closest friends and family might not appear in the top 5 listing, despite perhaps appearing in most of your photos.

As Rogati describes, a simple application of data thinking, with no complex machine learning or predictive analytics, would reorder these photos in frequency order. The take-away recommendation is that to avoid these product mistakes “you need data thinking to be part of the culture and top of mind, not an after-thought.”

Things that workedian-huston-data-science-in-the-balanced-team-8

With this in mind, as a data scientist, I wondered how I would describe my obligations and authority. I’ve been fortunate to have worked over the last two years in teams with developers, designers and PMs, and in this time we’ve tried different approaches to bringing data science into this process. I’m going to describe some of the approaches that worked for us, some that didn’t, and then try to distill what I’ve learned into a similar form to Janice Fraser’s blueprint.

ian-huston-data-science-in-the-balanced-team-9User research seeks to find the right direction to head in the space of possible products. As part of a balanced team the data scientist has an obligation to use available data to inform lines of questioning for in-person interviews, validate the results of these interviews and identify gaps when interviewees are not representative. It’s great to observe these user interview sessions as a data scientist, because I always come out with a long list of questions to answer from the data. 

A data scientist can also guide user research questions in order to understand the type of predictive models that will be suitable, answering questions about how much ‘explanability’ is needed, and where the line is between useful & creepy for instance.

ian-huston-data-science-in-the-balanced-team-10If your product manager has not worked with a data scientist before, you need to make a big effort to help them understand how you can contribute to the product. If they don’t understand how machine learning and predictive analysis can be effectively used, they will not direct the product discussion to include them.

As part of a balanced team you have an obligation to be part of all product conversations and story generation and to proactively suggest where data thinking could be most effective. Don’t wait for someone to come to you with an idea ‘perfect for some data science’.

ian-huston-data-science-in-the-balanced-team-11Expanding this idea of education, your team will make most effective use of data when ‘data thinking’ is central to the culture and practices of the team. If they have not been exposed to this before you will need to educate and involve them in understanding the available data and analysis techniques you are using. Pairing goes some way to sharing this knowledge, and you can also consider having a ‘show & tell’ to describe data discoveries and explain the moving parts of the model you are building.

As part of a balanced team, you have an obligation to educate your team about the techniques you’re using, the data that is available and what choices you have made in your analysis. The goal is not scrutiny of your work, but building confidence in your approach and results.

ian-huston-data-science-in-the-balanced-team-12At Pivotal we think Pair Programming is the best way to get fast feedback cycles and share knowledge. Data science is no different and we pair as data scientists when possible. We also like to pair with developers and designers to share knowledge of our methods and also get a new perspective on what we are building.

Pairing with developers is particularly useful to continue the journey from exploratory analysis to production code.

 

Things that did not work

ian-huston-data-science-in-the-balanced-team-13Now let’s consider a few things that we’ve tried, or experienced as part of a team in the past.

 

 

ian-huston-data-science-in-the-balanced-team-14In one project we tried to keep our user stories unified from front to backend so overall user value would be apparent. This means that whenever we deliver a story we know that we’ve put together everything necessary for the user to benefit from this feature. Unfortunately it proved quite difficult to work with these large stories in practice.

For one thing, having a single scale for estimation proved difficult to work with and our stories soon became too big to reliably show incremental progress to stakeholders, increasing communication difficulties. We eventually moved to having separate backlogs, which we already had for design work, although this means extra effort needed to keep backlogs in sync.

As part of a balanced team, the data scientist will need to take part in conversations about the engineering backlog (as well as design), and the PM will need to have a good handle on the inter-dependencies between backlogs.

ian-huston-data-science-in-the-balanced-team-15There’s sometimes a tendency to think data scientists should only arrive on a project once an MVP is built and some (usually limited) data is being collected. Even when machine learning is going to be at the core of a product, such as predictive maintenance, there’s sometimes a reluctance to bring data scientists/ML engineers in early on during the product creation phase.

This denies the data scientist the chance to be involved in the early conversations about the feasibility of different product directions, give advice on what early instrumentation to include, and provide context using any existing data sets in the business. As part of a balanced team, I think it’s clear that data scientists can contribute from the very beginning of the project and should ask to join the early product creation.

ian-huston-data-science-in-the-balanced-team-16As expensive and expert resources there is a tendency to spread data scientists across multiple projects to maximise their effectiveness. Continually switching contexts and juggling multiple simultaneous top priorities makes this path more inefficient for team progress as a whole. This lesson has been learned with designers, product managers and others, but now seems to need to be learned again for data scientists.

Being part of a balanced team means putting the team’s success first, which means being available and focused on a single team, a single product. This can result in what feels like inefficient use of your time if you’re not occupied 100%, but the alternative cost to the team of not having you available at the right moment is more detrimental. One way to justify this perceived inefficiency is to calculate the time & money wasted by a development team waiting for their shared data scientist to become available. Often what could have been a simple ten minute conversation can instead turn into days of emails, conference call scheduling and meeting planning, all because the data scientist is juggling other projects.

ian-huston-data-science-in-the-balanced-team-17There is so much hype around data science that it can feel like management expect the addition of a data scientist to instantly solve all existing problems. This is a dangerous situation to get into, and you must work to manage expectations, especially when starting to work on a new problem with many uncertainties. As part of a balanced team, the data scientist has a responsibility to inform the team’s expectations, and gains by sharing the burden of communicating and managing expectations with outside stakeholders.

ian-huston-data-science-in-the-balanced-teamI hope our experiences can help you as you explore the idea of including data science in your balanced product teams. There are many things that could be part of the core obligations of a data scientist in the framework that Janice Fraser describes for a Balanced Team. For me, the data scientist should be the “voice of data” on the team. They should provide deep expertise and understanding about the available data, and be able to identify potential valuable uses and techniques.

More and more we are seeing the implications of unethical uses of data and the data scientist should have the obligation to guard against unjustified (legally and mathematically), unethical and inappropriate uses of data. On the other hand where data is not currently being collected or is insufficient for future uses, the data scientist has an obligation to the team to begin collecting data to facilitate expected future product goals. In addition, data scientists can also facilitate balance in the team. Were I to include another obligation, it would be to “hone your craft” as Janice Fraser describes explicitly for designers.

For me the important authority that a data scientist brings to the team is the ability to improve product conversations with ‘data thinking’ as Monica Rogati suggests. We can make data thinking a natural part of product decisions, in order to reduce the sort of data literacy problems highlighted above.

[In the original talk the final two slides had references to “data” instead of “data thinking”.]

 

Summary

ian-huston-data-science-in-the-balanced-team-1To recap, I think there’s a lot of value in bringing data scientists into your balanced team. This helps make data thinking a central part of the product conversation. The data scientist has the obligation to provide data insights and explore potential uses, all in service to the team. In effect we are trying to break down the walls between data scientists and the rest of the product team.

Thank you to all the great people I’ve worked with as we’ve learned how data science contributes as part of a product team, from Pivotal and our client teams. In particular, I want to thank Janice Fraser for allowing me to reuse and adapt material from her Balanced Team talk slide-deck.

I hope that this is only the start of the conversation about Data Science in the Balanced Team and I look forward to hearing how data scientists are making ‘data thinking’ a central part of their product team’s work.

 
Bear