Amazing video Mark, your explanation and visualisation of everything was so nice!
@learndatawithmark11 ай бұрын
Thanks! That's very kind of you :)
@pawarbi4675 Жыл бұрын
Excellent, how do you use this in practice? Check the cardinality of each column and then choose encoding before saving parquet? If schema is defined in spark for each column before saving parquet, are we doing the same thing effectively?
@learndatawithmark Жыл бұрын
I'm not sure what spark does actually - I'd have to check. I still find it kinda surprising that the parquet writers don't just optimise everything for you - that would make more sense to me! I need to see how much saving on space impacts on the query side. In theory there should be a trade off between the two, but I'm not sure how big it is
@nmstoker Жыл бұрын
Thanks for the nice video This makes sense where you have one or a few massive files, but if you've got a boatload of such files is there a way to make the computer apply rules of thumb for you (so it scales as a process rather than having a person spend five mins per file thousands of times over!)
@learndatawithmark Жыл бұрын
Which bit in particular do you mean or just in general? I reckon you could probably automate everything that I did in this video to retrospectively look at a bunch of existing parquet files and see if there's a better way to store things. Definitely wouldn't recommend doing it manually!