38. Databricks | Pyspark | Interview Question | Compression Methods: Snappy vs Gzip

  Рет қаралды 12,882

Raja's Data Engineering

Raja's Data Engineering

Күн бұрын

Пікірлер: 26
@47shashank47
@47shashank47 Жыл бұрын
Just started following your playlist since last 3 days. .. The way you have provided explaination it's so amazing concepts which I could not clear in last 7-8 months. in just last 3 days I got clarity ont those topics. .. Thanks a lot for creating such amazing content.
@rajasdataengineering7585
@rajasdataengineering7585 Жыл бұрын
Thanks Shashank, for your comment! Keep watching
@azarudeena6467
@azarudeena6467 2 жыл бұрын
Easy to understand
@rajasdataengineering7585
@rajasdataengineering7585 2 жыл бұрын
Thank you
@ajinkyamore8359
@ajinkyamore8359 2 жыл бұрын
Really Nice Explanation. Thanks
@rajasdataengineering7585
@rajasdataengineering7585 2 жыл бұрын
Thank you
@PavanKumar-tt8mm
@PavanKumar-tt8mm 2 жыл бұрын
Good Raja. Today i had learn new topic..Thankyou.
@rajasdataengineering7585
@rajasdataengineering7585 2 жыл бұрын
Thank you Pavan
@deevjitsaha3168
@deevjitsaha3168 2 ай бұрын
i tried creating parquet file in gzip compression type but it created multiple part files. however it supposed to create one file right??
@karamveersolanki138
@karamveersolanki138 2 жыл бұрын
Hi Raja, one doubt: Regarding splitable, you said more than one core can access it. Isn't it means that the file is spread over multiple partitions and is available for parallel processing.
@rajasdataengineering7585
@rajasdataengineering7585 2 жыл бұрын
Good question Karamveer. The data is distributed across nodes in the form of partitions but that's within cluster environment (within onheap memory when we talk about spark). But what we are discussing here is file storage within external system such as dbfs, S3, adls, hdfs etc. So when spark is reading data from external environment, if the huge file is not in splitable format, it would take more time to distribute the data across nodes in the form of partitions because that non-splitable file cant be read by multiple cores at a time. Hope it is clear. Thanks for this good question
@sohelsayyad5572
@sohelsayyad5572 Жыл бұрын
thank you sir, if huge file is not splittable then, can we convert its compression format to make it splittable, if yes how do we do that ? Also is there any scenario of parquet/orc/avro where its not splittable and need workaround. how we resolve it ? 👍
@srinubathina7191
@srinubathina7191 Жыл бұрын
Thank You Sir
@rajasdataengineering7585
@rajasdataengineering7585 Жыл бұрын
Most welcome
@Mehtre108
@Mehtre108 9 ай бұрын
Hello sir what is sequence to watch videos because some are not there in playlist
@kanstantsinhulevich4313
@kanstantsinhulevich4313 Жыл бұрын
Hey, Raja. I know that parquet file with gzip codec is splittable. Of course if we compress csv file with gzip codec it won't be splittable. It would be nice if you will ad some clarification.
@rajasdataengineering7585
@rajasdataengineering7585 Жыл бұрын
Hi Kanstantsin, yes you are right. Parquet file with gzip is splittable by default while CSV with gzip is non-splittable by default. However there are some workaround to split gzipped CSV files like reading it in textinputformat api or pre-splitting the gzipped file into multiple pieces
@abhaybisht101
@abhaybisht101 2 жыл бұрын
Nice content Raja 🤟
@rajasdataengineering7585
@rajasdataengineering7585 2 жыл бұрын
Thanks Abhay
@karthickrajachandrasekar8486
@karthickrajachandrasekar8486 2 жыл бұрын
Hi raja, thanks for amazing explanation. I have one doubt is there any ways, after compressing into gz, same name will shown?
@rajasdataengineering7585
@rajasdataengineering7585 2 жыл бұрын
Yes same name will be shown after compression
@karthickrajachandrasekar8486
@karthickrajachandrasekar8486 2 жыл бұрын
@@rajasdataengineering7585 It will shown like part004 like that. How to fetch the same name that will be given csv?
@rajasdataengineering7585
@rajasdataengineering7585 2 жыл бұрын
If you want to have specific name, dataframe can be converted to pandas and write with specific name
@sravankumar1767
@sravankumar1767 2 жыл бұрын
Nice explanation Raj 👌 👍
@rajasdataengineering7585
@rajasdataengineering7585 2 жыл бұрын
Thanks Sravan
@abhinavsingh1173
@abhinavsingh1173 Жыл бұрын
Your course it best. But problem with you course is that you are not attching the github link for your sample data and code. Irequest you as your audience please do this. Thanks
39. Databricks | Spark | Pyspark Functions| Split
10:41
Raja's Data Engineering
Рет қаралды 11 М.
REAL 3D brush can draw grass Life Hack #shorts #lifehacks
00:42
MrMaximus
Рет қаралды 12 МЛН
兔子姐姐最终逃走了吗?#小丑#兔子警官#家庭
00:58
小蚂蚁和小宇宙
Рет қаралды 13 МЛН
the balloon deflated while it was flying #tiktok
00:19
Анастасия Тарасова
Рет қаралды 33 МЛН
How I Turned a Lolipop Into A New One 🤯🍭
00:19
Wian
Рет қаралды 12 МЛН
34. Databricks - Spark: Data Skew Optimization
15:03
Raja's Data Engineering
Рет қаралды 28 М.
Spark Runtime Architecture (Cluster Mode) | #pyspark  | #databricks
25:38
74. Databricks | Pyspark | Interview Question: Sort-Merge Join (SMJ)
16:46
Raja's Data Engineering
Рет қаралды 17 М.
78. Databricks | Pyspark | Performance Optimization: Delta Cache
7:47
Raja's Data Engineering
Рет қаралды 11 М.
REAL 3D brush can draw grass Life Hack #shorts #lifehacks
00:42
MrMaximus
Рет қаралды 12 МЛН