Simplify Data Conversion from Spark to TensorFlow and PyTorch

  Рет қаралды 2,985

Databricks

Databricks

Күн бұрын

In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks.
Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model?
The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity.
The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters.
In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.
Connect with us:
Website: databricks.com
Facebook: / databricksinc
Twitter: / databricks
LinkedIn: / databricks
Instagram: / databricksinc Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. databricks.com...

Пікірлер: 2
@mationplays1500
@mationplays1500 2 жыл бұрын
Great Tutorial!
@BullishBuddy
@BullishBuddy Жыл бұрын
great video, can you please share the source code?
Entity Resolution Using Patient Records at CMMI
18:44
Databricks
Рет қаралды 2,1 М.
Learn to Use Databricks for the Full ML Lifecycle
39:47
Databricks
Рет қаралды 30 М.
АЗАРТНИК 4 |СЕЗОН 1 Серия
40:47
Inter Production
Рет қаралды 1,3 МЛН
SCHOOLBOY. Мама флексит 🫣👩🏻
00:41
⚡️КАН АНДРЕЙ⚡️
Рет қаралды 6 МЛН
An Unknown Ending💪
00:49
ISSEI / いっせい
Рет қаралды 8 МЛН
How Strong is Tin Foil? 💪
00:26
Preston
Рет қаралды 32 МЛН
Cursor Is Beating VS Code (...by forking it)
18:00
Theo - t3․gg
Рет қаралды 74 М.
Why Hadoop is Dying
3:52
ness-intricity101
Рет қаралды 66 М.
James Webb Telescope Finally Sees What's Inside a Black Hole
12:58
Little-Known AI Tools Giving Academics an Unfair Advantage
9:58
Andy Stapleton
Рет қаралды 12 М.
АЗАРТНИК 4 |СЕЗОН 1 Серия
40:47
Inter Production
Рет қаралды 1,3 МЛН