ODSC West Day Two

By Ian

November 16, 2015

After a great first day, ODSC West started up again on a blustery Sunday morning in the Bay Area. As I needed to prepare for my own talk I didn’t get to see as many of the other sessions as I would have liked, but I’ve collected some thoughts on those that I did see.

Wes McKinney, Cloudera

Wes is best known for creating the Pandas library for data analysis in Python and was talking about his successor project, Ibis, which aims to remove the unnecessary complications of SQL based analysis by providing a uniform Python based interface similar to Pandas.

One reason for starting a new project is the amount of technical debt accrued by Pandas which means it’s “hard to make Pandas magically big data enabled.” Other ‘big data’ efforts to use Python such as in user defined functions like PL/Python are hampered by the need to move data between the database and the Python process. Ibis will have a shared memory architecture removing this bottleneck and Wes is looking for contributors to help with implementing interfaces to more databases beyond the current focus on SQLite and Impala.

Anthony Goldbloom, Kaggle

As the founder of Kaggle, Anthony Goldbloom has seen over 400,000 data scientists try their hand at winning a competition. He shared some tips on what makes a good Kaggle competition and how the top performers attack a problem.

In particular he suggested that those at the top of the leaderboard fall into one of two camps. The first group like very structured datasets and spend almost all their time hand-crafting better features to input into their model. Recently this group has overwhelmingly switched to using the XGBoost algorithm instead of their previous favourite, Random Forests.

The second group focus on unstructured data problems and spend no time on feature engineering, instead building neural networks. For image based competitions, convolutional neural nets work well for edge detection, and time series data is best tackled with recurrent neural nets. Given the hype recently around deep learning it’s interesting to see the widespread use of neural nets for these tasks, but also that structured problems still need a more traditional feature based approach.

Anthony also previewed the new Scripts feature of Kaggle which will allow anyone to upload their model code and have it run (for free) in a containerised environment. Other competitors will be able to fork the code and immediately run their modified version on the same inputs. One interesting twist is that anyone uploading their code has to license it under the Apache 2 License, so the code will be open for anyone to reuse.

Owen Zhang, DataRobot

Owen, who until recently was the number 1 competitor on Kaggle’s leaderboard, gave a really humourous walk through of how he transitioned from an IT role into data science. He says he was “legitimized” by Kaggle as it meant he could apply to new data science roles and prove his ability, instead of just making claims about how good a data scientist he would be given the chance.

His two important things to remember when working as a data scientist are:

“Asking the right question is more important than getting the perfect answer.”
“What will/can you do differently if you have a prediction of X?”

James Powell, NumFocus & many others

Every talk by James is entertaining but this one was particularly fun and challenging, veering from very low level Python hacking to philosophical musings on the nature of human interactions as exhibited through code reviews. He also included a whole meta talk section describing the stages you go through before becoming a well known speaker travelling around the world:

Start by talking about something someone else did.
Talk about a small library you worked on, or take something from another language and apply it to your own.
Talk about your ideas and opinions and have a clear message.
Already be famous because this increases how much the audience will pay attention.

As a helpful hint for stage 4 he suggested that a simple way to become famous is to court controversy.

I can’t do justice to the multitude of ideas and concepts in this talk, so I’d recommend watching it (maybe more than once) when the video comes out.

Thoughts on ODSC West

For a new conference holding its first west coast event this was very well organised and managed to attract a large crowd of 1000 data scientists to a weekend event. The audience was definitely more on the practitioner end which is a refreshing change from the variety of marketing and sales based conferences proliferating in the data science and big data space.

Well done to Mammad, Sheamus, and the whole ODSC team for organising a great event and for their enthusiasm and helpfulness throughout the weekend.