Рет қаралды 336
Modern data workloads come in all shapes and sizes - numbers, strings, JSONs, images, whole PDF textbooks and more. To process this data we still rely on utilities such as: ffmpeg for videos, jq for JSON and Pytorch for tensors. However, these tools were not built for large-scale ETL. This means that we often need to build bespoke data pipelines that orchestrate data movement and custom tooling. If only downloading images, resizing them and running vision models was as simple as extracting a substring in SparkSQL! Daft (www.getdaft.io) is a next-generation distributed query engine built on Python and Rust. It provides a familiar dataframe interface for easy and performant processing of multimodal data at scale. Join us as we demonstrate how to build a multimodal data lakehouse using Daft on your existing infrastructure (S3, DeltaLake, Databricks and Spark).
Talk By: Jay Chia, Co-Founder, Eventual Computing
Here’s more to explore:
Big Book of Data Engineering: 2nd Edition: dbricks.co/3Xp...
The Data Team's Guide to the Databricks Lakehouse Platform: dbricks.co/46n...
Connect with us: Website: databricks.com
Twitter: / databricks
LinkedIn: / data…
Instagram: / databricksinc
Facebook: / databricksinc