In this talk, we will focus on the following aspects of Spark streaming:
(i) Motivation and most common use cases for Spark Streaming: – Streaming Data Ingestion & ETL – Building a data highway to ingest real time data into warehouses, search engines or data lakes.– Monitoring & Dashboarding– Anomaly/Fraud Detection with Online Learning – Doing predictions on streams and keeping the model up-to-date based on new data being observed.– Sessionization – Identifying sessions based on user behavior from streams
(ii) Common design patterns that emerge from these use cases and tips to avoid common pitfalls while implementing these design patterns:– Associative Time Based Window Aggregations – How and when to use window functions efficiently to do associative aggregations and maintain running statistics from your data.– Global Aggregations with State Management – Maintain the most current value of a statistic for all of time with a global state.– Joining streams efficiently with static and dynamic datasets – Many a time, you might not only want to join multiple streams but also join with historical datasets. The historical datasets can be static or dynamically changing. We will walk over the best practices while doing these joins.– Using SQL operations on stream – How to use Spark SQL on DStreams efficiently.– Avoiding common pitfalls while doing online model updates
(iii) Performance optimization techniques:– How to scale out efficiently to achieve high throughput.– Better state management with state pruning.– Fine tuning checkpoint interval for optimum performance.– Efficient ways of writing to data sinks
Vida is currently a Solutions Engineer at Databricks. In her past, she worked on scaling Square’s Reporting Analytics System. She first began working with distributed computing at Google – where she improved search rankings of mobile specific web content and built and tuned language models for speech recognition using a year’s worth of Google search queries. She’s passionate about accelerating the adoption of Apache Spark to bring the combination of speed and scale of data processing to the mainstream.
Prakash is currently a Solutions Architect at Databricks and focuses on helping customers building their big data infrastructure based on his decade-long experience on building large scale distributed systems and machine learning infrastructure at companies including Netflix and Yahoo. Prior to joining Databricks, he was with Netflix designing and building their recommendation infrastructure that serves out millions of recommendations to Netflix users every day. His interests broadly include distributed systems and machine learning and he has also co-authored several publications on machine learning and computer vision research in the early stages of his career.