The data science team at Skimlinks has been using Spark for a few years now. Throughout this talk, we will share some of our experience of how to do large-scale data analysis using Apache Spark from its basic functionality to more advanced features.
The talk will be split into two parts: the first half (in this video) will be an introduction of Spark’s core functionality. We will go through some of the main operations, and look at how they are executed in the cluster. During the second part, we will focus on one of the most popular components of Spark: SparkSQL and the Dataframe API. We will demonstrate how they can be used to query over terabytes of data on the fly.
The talk will be full of practical insights, from how to get started on a cluster of EC2 instances to doing some basic debugging of a job. We will use an ipython-notebook for most of the talk. This talk is based on a talk given at PyData London 2015.
The attendants would get the most out of it if they installed Spark 1.6 in their laptops before the session.
Installing Spark deserves a tutorial of its own, we will probably not have time to cover that or offer assistance. We recommend that you install the pre-built Spark version 1.6 with Hadoop 2.4.
Spark download page:
During my PhD in signal processing at Cambridge, I discovered Bayesian statistics and all its amazing practical applications. Now I am working at Skimlinks applying machine learning at large scale to build models of user behaviour. I use a combination of scikit-learn and Spark to train and apply models to massive datasets.
Meetup page: http://www.meetup.com/Spark-London/events/229636441/