I’m out in San Francisco for the first time in months, to speak at the first Bay Area edition of the Open Data Science Conference. For a relatively new conference there is a great line-up of speakers and the audience is already quite big with around 1000 data scientists attending.
The first day of talks was packed with really interesting sessions, and the slides and videos should be available in the next few weeks. In the meantime here’s my quick review of the talks I attended on Saturday:
Brian Granger, Jupyter
Brian is best known as the creator of the IPython Notebook, which has recently become the Jupyter project. Brian describes Jupyter as a tool for creating and sharing narratives and in his talk he outlined some of the new features that are coming in version 4.1 and beyond. Highlights of 4.1 include new multi-cell selections and actions and coming sometime after that will be a complete rewrite of the interface to enable third party plugins and collaborative editing (yay!).
In his talk Brian also challenged the audience to not simply define data science by the tools and skills necessary but rather by the fundamental questions that can be answered about the world. He also emphasised that data science can be pursued at any level and should be part of a general education and understanding of the world.
Lukas Biewald, Crowdflower
Lukas is the CEO of Crowdflower and gave two talks on Saturday outlining his views of how we can avoid a ‘data science winter’ similar to the AI winters of the 70s and 80s. As he noted, the strategy of big companies has changed and they now realise that amassing data is more important than building better algorithms. For example IBM recently bought Weather.com to get access to all the historical weather data they own, while at the same time Google feels confident enough to open source its distributed learning framework Tensorflow.
Lukas also discussed how human-in-the-loop based machine learning models are becoming increasingly popular, and expects this to be a standard practice in the industry within two years. Crowdflower helps companies enrich their datasets by getting real humans to perform classification or other tasks.
Vin Sharma, Intel
Vin discussed how data science is in a proto-science phase and needs to make the transition to real science similar to the move from alchemy to chemistry. The friction Intel has identified in this move is around one-off models and legacy components, which slow down adoption of cloud based platforms. Vin also mentioned Intel’s new Trusted Analytics Platform which is based on Cloud Foundry.
Claudia Perlich, dstillery
Claudia works in ad-targeting and gave a really interesting talk about mismatched incentives when advertisers focus on simple metrics like click through rate. With the proliferation of bots clicking ads and even filling out forms (for example to schedule a test drive), it has become easier to predict the behaviour of the bots than real humans. As these bots build realistic browsing profiles, advertisers unwittingly spend money targeting them through auctions.
John Myles White, Julia
John gave a really honest appraisal of the Julia language and what its strengths and weaknesses are. In the past I have heard a lot of enthusiasm about Julia which I never really understood, having played around with it a little. John’s explanation of this is the differing levels of maturity in the Julia stack for data acquisition, preparation and heterogeneous data tables, which are not really well supported, versus lower level numerical and matrix calculations in which Julia can achieve 100x or 1000x speedups in comparison with Python or R.
According to John, Julia version 1.0 which is coming in the next two years should see great improvements in the currently lacking areas.
Juliet Hougland, Cloudera
Juliet’s talk was an in-depth run through of problems people experience with PySpark, the Python bindings for Apache Spark, and her top tips to get better performance when using it. These included limiting the amount of data you pass to the Python processes by using the DataFrames API, making sure all your objects are serialisable by CloudPickle (the pickling library used by PySpark), and making functions static to help isolate issues with serialisation.
In addition she recommends the use of the SparkTestingBase package to help write unit and integration tests for your application. This package provides helper functions which originally were used by the Spark dev team, for example to provide a SparkContext in each running test, which can either be set up freshly for each test, or shared between tests.
Chris Colburn, Netflix
Chris gave a great talk about the engineering issues faced by Netflix who serve over 100 million hours of video per day and collect over 500 billion events from consumer devices. To handle this, Netflix runs over 100 micro services and pushes new code over 40 times a day.
Chris outlined their efforts on anomaly detection and in particular the use of Robust SVD and PCA instead of the usual versions. He also mentioned that Netflix will soon release some work on stochastic testing, where probabilistic models can be tested to be within certain expected bounds.
Don Dini, AT&T
Despite being mistaken for a physicist more than once, Don comes from a computational science background and discussed how the definition of Artificial Intelligence has changed over the decades following Larry Tesler’s aphorism that ‘AI is what hasn’t been done yet’.
Don described how search based planning for actions doesn’t scale well due to the exponential growth of the search space and how more probabilistic approaches are necessary.
Day One Summary
Overall the first day of ODSC West was really informative. There were a lot of other talks in the schedule across the different tracks, so I’m looking forward to checking out the videos when they’re released.