38. Databricks | Pyspark | Interview Question | Compression Methods: Snappy vs Gzip

Рет қаралды 12,882

Күн бұрын

Пікірлер: 26

@47shashank47 Жыл бұрын

Just started following your playlist since last 3 days. .. The way you have provided explaination it's so amazing concepts which I could not clear in last 7-8 months. in just last 3 days I got clarity ont those topics. .. Thanks a lot for creating such amazing content.

@rajasdataengineering7585 Жыл бұрын

Thanks Shashank, for your comment! Keep watching

@azarudeena6467 2 жыл бұрын

Easy to understand

@rajasdataengineering7585 2 жыл бұрын

Thank you

@ajinkyamore8359 2 жыл бұрын

Really Nice Explanation. Thanks

@rajasdataengineering7585 2 жыл бұрын

Thank you

@PavanKumar-tt8mm 2 жыл бұрын

Good Raja. Today i had learn new topic..Thankyou.

@rajasdataengineering7585 2 жыл бұрын

Thank you Pavan

@deevjitsaha3168 2 ай бұрын

i tried creating parquet file in gzip compression type but it created multiple part files. however it supposed to create one file right??

@karamveersolanki138 2 жыл бұрын

Hi Raja, one doubt: Regarding splitable, you said more than one core can access it. Isn't it means that the file is spread over multiple partitions and is available for parallel processing.

@rajasdataengineering7585 2 жыл бұрын

Good question Karamveer. The data is distributed across nodes in the form of partitions but that's within cluster environment (within onheap memory when we talk about spark). But what we are discussing here is file storage within external system such as dbfs, S3, adls, hdfs etc. So when spark is reading data from external environment, if the huge file is not in splitable format, it would take more time to distribute the data across nodes in the form of partitions because that non-splitable file cant be read by multiple cores at a time. Hope it is clear. Thanks for this good question

@sohelsayyad5572 Жыл бұрын

thank you sir, if huge file is not splittable then, can we convert its compression format to make it splittable, if yes how do we do that ? Also is there any scenario of parquet/orc/avro where its not splittable and need workaround. how we resolve it ? 👍

@srinubathina7191 Жыл бұрын

Thank You Sir

@rajasdataengineering7585 Жыл бұрын

Most welcome

@Mehtre108 9 ай бұрын

Hello sir what is sequence to watch videos because some are not there in playlist

@kanstantsinhulevich4313 Жыл бұрын

Hey, Raja. I know that parquet file with gzip codec is splittable. Of course if we compress csv file with gzip codec it won't be splittable. It would be nice if you will ad some clarification.

@rajasdataengineering7585 Жыл бұрын

Hi Kanstantsin, yes you are right. Parquet file with gzip is splittable by default while CSV with gzip is non-splittable by default. However there are some workaround to split gzipped CSV files like reading it in textinputformat api or pre-splitting the gzipped file into multiple pieces

@abhaybisht101 2 жыл бұрын

Nice content Raja 🤟

@rajasdataengineering7585 2 жыл бұрын

Thanks Abhay

@karthickrajachandrasekar8486 2 жыл бұрын

Hi raja, thanks for amazing explanation. I have one doubt is there any ways, after compressing into gz, same name will shown?

@rajasdataengineering7585 2 жыл бұрын

Yes same name will be shown after compression

@karthickrajachandrasekar8486 2 жыл бұрын

@@rajasdataengineering7585 It will shown like part004 like that. How to fetch the same name that will be given csv?

@rajasdataengineering7585 2 жыл бұрын

If you want to have specific name, dataframe can be converted to pandas and write with specific name

@sravankumar1767 2 жыл бұрын

Nice explanation Raj 👌 👍

@rajasdataengineering7585 2 жыл бұрын

Thanks Sravan

@abhinavsingh1173 Жыл бұрын

Your course it best. But problem with you course is that you are not attching the github link for your sample data and code. Irequest you as your audience please do this. Thanks