Beyond SQL: Dataframes in the Database (Devin Petersohn, Snowflake)

  Рет қаралды 575

Jignesh Patel

Jignesh Patel

Күн бұрын

Dataframes are popular tools for interacting with and exploring data, but they are not as well understood nor as deeply studied as databases. Python’s pandas. and Apache Spark are two of the most popular dataframes in use by data practitioners, but even these are extremely different from each other in terms of guarantees and user expectations. In this talk, we will explore these differences and take a deep dive into pandas-like dataframes with a theoretical lens, exploring the dataframe data model and dataframe algebra. Contrasted with databases, pandas-like dataframes are ordered, have additional metadata, and can operate on and query metadata as if it were data. Spark-style dataframes are unordered, and follow the relational database data model much closer. We will discuss Snowpark, Snowflake’s implementation of Spark-style dataframes, which maps dataframe APIs in Python, Scala, and Java to SQL. There are interesting challenges in mapping the Snowpark dataframe APIs’ imperative programming syntax to the declarative, all-or-nothing world of SQL. We will also discuss some of our experiences mapping pandas to SQL databases at Ponder. Finally, we will look at an open source distributed dataframe implementation, Modin, and map this implementation to the data model and algebra discussed earlier.
Speaker bio: Devin Petersohn received a PhD from the UC Berkeley RISELab in 2021. His PhD focus was on dataframes and he created an open source project called Modin during his PhD. His thesis outlines dataframes from foundational theory all the way to implementation. After his PhD, he co-founded a company called Ponder with Doris Lee (fellow UC Berkeley PhD) and Aditya Parameswaran (UC Berkeley Professor). Ponder's product was built on Modin open source by transpiling pandas into SQL. Ponder was recently acquired by Snowflake and now Devin is contributing to the Python efforts within Snowflake, like Snowpark.

Пікірлер
Snowflake
1:17:52
Jignesh Patel
Рет қаралды 576
Oracle's talk on JSON Relational duality
1:00:54
Jignesh Patel
Рет қаралды 301
Family Love #funny #sigma
00:16
CRAZY GREAPA
Рет қаралды 51 МЛН
Человек паук уже не тот
00:32
Miracle
Рет қаралды 4,1 МЛН
Haunted House 😰😨 LeoNata family #shorts
00:37
LeoNata Family
Рет қаралды 11 МЛН
The Strange Physics Principle That Shapes Reality
32:44
Veritasium
Рет қаралды 5 МЛН
Why American Cars Are So Expensive
13:53
CNBC
Рет қаралды 1,3 МЛН
The Birth of SQL & the Relational Database
20:08
Asianometry
Рет қаралды 205 М.
MongoDB Schema Design Best Practices
50:39
Joe Karlsson
Рет қаралды 173 М.
Kubernetes Tutorial for Beginners [FULL COURSE in 4 Hours]
3:36:55
TechWorld with Nana
Рет қаралды 8 МЛН
SQL Extension for Continuous Processiong in Snowflake.
43:02
Jignesh Patel
Рет қаралды 243