The talk will be split into two parts: the first half (in this video) will be an introduction of Spark’s core functionality. We will go through some of the main operations, and look at how they are executed in the cluster. During the second part, we will focus on one of the most popular components of Spark: SparkSQL and the Dataframe API. We will demonstrate how they can be used to query over terabytes of data on the fly.
The talk will be full of practical insights, from how to get started on a cluster of EC2 instances to doing some basic debugging of a job. We will use an ipython-notebook for most of the talk. This talk is based on a talk given at PyData London 2015.
The attendants would get the most out of it if they installed Spark 1.6 in their laptops before the session.
Installing Spark deserves a tutorial of its own, we will probably not have time to cover that or offer assistance. We recommend that you install the pre-built Spark version 1.6 with Hadoop 2.4.
Spark download page:
Sahan studied Computer Science and Applied Statistics before researching in computational biology in the early stages of his career. Then he pursued his master’s degree at University College London in Computational Statistics and Machine Learning. He works as a Data Science Engineer at Skimlinks where his primary focus is on building high performance data pipelines for data processing and machine learning.
The data science team at Skimlinks has been using Spark for a few years now. Throughout this talk, we will share some of our experience of how to do large-scale data analysis using Apache Spark from its basic functionality to more advanced features. .