Рет қаралды 575
Dataframes are popular tools for interacting with and exploring data, but they are not as well understood nor as deeply studied as databases. Python’s pandas. and Apache Spark are two of the most popular dataframes in use by data practitioners, but even these are extremely different from each other in terms of guarantees and user expectations. In this talk, we will explore these differences and take a deep dive into pandas-like dataframes with a theoretical lens, exploring the dataframe data model and dataframe algebra. Contrasted with databases, pandas-like dataframes are ordered, have additional metadata, and can operate on and query metadata as if it were data. Spark-style dataframes are unordered, and follow the relational database data model much closer. We will discuss Snowpark, Snowflake’s implementation of Spark-style dataframes, which maps dataframe APIs in Python, Scala, and Java to SQL. There are interesting challenges in mapping the Snowpark dataframe APIs’ imperative programming syntax to the declarative, all-or-nothing world of SQL. We will also discuss some of our experiences mapping pandas to SQL databases at Ponder. Finally, we will look at an open source distributed dataframe implementation, Modin, and map this implementation to the data model and algebra discussed earlier.
Speaker bio: Devin Petersohn received a PhD from the UC Berkeley RISELab in 2021. His PhD focus was on dataframes and he created an open source project called Modin during his PhD. His thesis outlines dataframes from foundational theory all the way to implementation. After his PhD, he co-founded a company called Ponder with Doris Lee (fellow UC Berkeley PhD) and Aditya Parameswaran (UC Berkeley Professor). Ponder's product was built on Modin open source by transpiling pandas into SQL. Ponder was recently acquired by Snowflake and now Devin is contributing to the Python efforts within Snowflake, like Snowpark.