What tools do you need to bring to a DataDive? The next DataKind UK DataDive is taking place in two weeks time in London. I took part in one of the previous DataDives and I would highly recommend the experience for anyone with data science or analytical skills who wants to help charities use their data.
The DataDives take place over the course of a weekend and in that time you have to decide on a charity to work with, understand their data and goals, perform your analysis and present your results in a usable form. That’s a lot to get through in just over two days so it’s very important to be able to get up and running quickly with the analysis. I thought it might be useful to list the software and tools that I will be packing in my DataDive toolbelt this time around.
Caveat: All of these are personal preferences and there are many other choices I could have made. I have a Mac, so these choices are also somewhat OSX specific. Feel free to list the contents of your own toolbelt in the comments!
Python If you are a data scientist you probably have a favourite in the R vs Python debate. My preference happens to be Python and the PyData stack and the packages below reflect that. If you are more of an R devotee, there are direct equivalents for most of these. I would recommend using the Anaconda Python distribution, especially on Windows or OSX.
iTerm2 The range of options and customisation possible in this OSX terminal app is very impressive. Another really handy feature is the clipboard history.
Git & Github
Version control is important and being able to share the results of your work is made easier if everyone uses a central repository. For this DataDive the organisers have set up a Github organisation so polish up those Git skills if you don’t use it often.
Pandas Building on the lower level numerical capabilities of NumPy and SciPy, the most effective data analysis package for Python is Pandas created by Wes McKinney. It provides many extremely useful ways to ingest, transform and output datasets and is getting better and faster day by day.
CSVkit This extremely useful set of command line tools helps you to easily get a handle on the contents of CSV files and slice, dice and aggregate them in numerous ways. Really handy as the first step in a data cleaning process.
IPython Notebook For easy recording of your analysis steps and results using Python and other languages the clear choice is an IPython Notebook. The most recent update has added interactive widgets which enable much better exploration of data and a simple way to create interactive results.
scikit-learn Machine learning toolkit of choice in Python at the moment due to its breadth of algorithms and extremely elegant API. Pipelining means you can .fit() and .transform() your way through a multi-stage machine-learning process with ease.
yhat’s ggplot I wanted to especially mention this port of R’s wildly popular ggplot2 graphics package to the Python ecosystem. One of the main reasons given for R fans not trying Python is lack of something like ggplot2, so yhat decided to make a very faithful port using Matplotlib as the backend. The syntax isn’t very Pythonic but the results can fool even some veteran R users.
Flask There are many web frameworks out there, but Flask is a simple Python framework that allows you to get up and running quickly.
Cloud Foundry If you create a webapp as part of your analysis it would be great to have it publicly available (if that’s possible). I use the Cloud Foundry platform for this and one publicly hosted CF instance is Pivotal Web Services [Disclaimer: I work for Pivotal who run this service]. Making your webapp available is a simple as “cf push”. Other options include Heroku and Digital Ocean.
D3 These days people expect a lot more from visualisations than simply a static line graph. Mike Bostock’s D3 (Data-Driven Documents) has been at the centre of the interactive visualisation movement on the web. It’s somewhat difficult to get started with, but there are a lot of packages that help with this, including NVD3 & the previously mentioned mpld3.
I think that’s a good start for tools and packages that would be useful during a DataKind DataDive. There are obviously a lot of other things that will be necessary, not least the data itself!
It looks like this is going to be a busy week. First up Noelle Sio, Alexander Kagoshima and I are presenting a webinar on Tuesday about our traffic analysis and prediction work. We talked about this topic at Strata Santa Clara but this webinar will take an extended look at the data, the challenges and the technology we used. You can sign up on the Pivotal website and the recording will be available afterwards.
On Wednesday I am heading to Amsterdam to talk at the second Data Science Amsterdam meetup. The organisers of this meetup are branching out from their highly successful Data Science London event which regularly has hundreds of data scientists on the waiting list for each meeting. My talk on Wednesday will be about how to do massively parallel processing using familiar Python and R packages using the procedural languages PL/Python and PL/R.
On the subject of PL/Python, the video of my talk at PyData London is now available. Thankfully the video was edited to remove five minutes of me banging on the keyboard as my laptop crashed half way through my talk! In Amsterdam I will be talking about PL/R as well as PL/Python so hopefully this time the laptop holds up.
In addition I thought it would be useful to try to collect as many of the tweets from over the weekend as possible. These are available on Storify. There’s no guarantee I’ve found everything but hopefully there will be some value in having links to some of the slides and other materials people mentioned during their talks.
This week I had the opportunity to attend and speak at one of the biggest Big Data conferences of the year.
The Strata conferences run by O’Reilly have been running for the last few years and in many ways have driven the awareness and adoption of data science and predictive analytics.
My colleagues Alexander Kagoshima and Noelle Sio, and I talked about recent work we’ve been doing on how to use machine learning techniques to understand traffic flows in major cities and predict when travel disruptions will end. The talk seemed to be well received and generated a lot of questions and comments both at the conference and on Twitter. This recent post on the Pivotal blog explains more about the projects and the overall goals.
As part of the disruption prediction work I built a simple web app which displays the predictions for currently active incidents.
Video of the talk will be available through O’Reilly, and our slides are available on Slideshare:
If you are interested in this or other projects the Pivotal Data Labs team have worked on, there is a lot more information on the official Pivotal site.
We’ll be looking at how in car data sources like GPS locations can enable more intelligent routing which predicts future traffic conditions along your journey.
In addition we’ve taken a look at traffic disruption data in London and created a model which predicts how long a new incident will last, giving you confidence that the collision which blocked your route to work this morning will have been cleared by the time you want to head home. I’ve written a simple web based demo which I hope to show during the talk.
Strata talks are videoed (yikes!) and we hope to make our slides available after the talk. Stay tuned as well for a sneak peek at the transport disruption demo.
A physicist by training, I am curious about the world around us, from the smallest to the largest scales. I recently joined the Pivotal Data Science team and now work on interesting data science and predictive analytics projects for a wide range of industries.
As a university researcher I created numerical simulations of cosmological perturbations during the early universe. My code, called Pyflation, is open source and available for download.
This is a personal site and the views and opinions expressed in these pages are strictly mine and have not been reviewed or approved by my employer.