Partha Nageswaran and Sudarshan Kadambi of Bloomberg presented the below discussion at Spark Summit San Francisco on June 7.
Managed Dataframes and Dynamically Composable Analytics: The Bloomberg Spark Server
Bloomberg has a strong reputation in the financial industry for providing lightning fast analytics on vast quantities of data. In this presentation, we talk about Bloomberg’s analytics stack and how Spark, with its formidable computational model for distributed, high-performance analytics, helps take this to the next level. We talk about the kinds of analytics that are being expressed in Spark and how these pose challenges in terms of what Spark is currently capable of, in terms of functionality and performance. At Bloomberg, instead of building isolated Spark applications for individual problem domains, we are looking at implementing a framework based approach to registering, discovering, and querying DFs and real-time data streams. DFs in the framework are cataloged in a registry, which captures data provenance (backing stores and real-time streams) as well as analytical and domain specific metadata. This allows for composable analytics over continously updated data, with significantly less boilerplate code for data plumbing. The results of these analytics can be registered back in the catalog, to be leveraged in higher order analytics. With such a data catalog, connectors to various internal data systems and standardized serverization runtimes for hosted Spark applications, Spark can allow for seamless integration between disparate datastores and data domains. We round out this talk by discussing a few challenges with building analytics infrastructure over Spark – need for dynamic topic registration, efficient stream reconciliation with updateStateByKey and context sharing for low-latency analytics while achieving efficient resource utilization. This is a continuation of our talk at Spark Summit East 2016 where we first introduced Managed DataFrames.
About Spark Summit San Francisco
Spark Summit, the largest big data event dedicated to Apache Spark, is returning to San Francisco, Monday, June 6 through Wednesday June 8, 2016 with an all-new program that promises to be the best yet. Join more than 2,500 engineers, analysts, scientists and business professionals at the Hilton Union Square in the heart of San Francisco for three days of in-depth learning and networking that you won’t want to miss.
With over 90 sessions and five tracks to choose from, there’s content for every level and role. Hear from leading production users of Spark, Spark SQL, Spark Streaming and related projects; find out where the future of Spark is going; discover how to use the Spark stack in a variety of applications; and get best practices from businesses that are utilizing Spark at scale successfully.
For more details, click here.
Partha Nageswaran, is an R&D Lead at Bloomberg where he’s architecting and helping build out a Low Latency Analytics Ecosystem for Financial Analytics. As a former Advisory Engineer at IBM, he worked on building distributed platforms and holds three patents. He holds three Masters degrees, in math, physics, and computer science.
Sudarshan Kadambi is an Architect at Bloomberg, helping build Bloomberg’s Data and Compute infrastructure. He has a background in distributed systems, has been a long-time user of Hadoop, and, more recently, Spark and is passionate about making these technologies awesome.