23:33 -> 2 copies for every ingested records and then deduplicated before putting into object store? curious what’s the engineering benefits from that amplification
@pauldix2386Ай бұрын
This is because the ingestion tier has a WAL on an attached disk. That's where writes go until they get persisted as Parquet files in object storage (happens every 15 minutes or more aggressively if a table is high throughput). So we wanted to make sure that the data would land in object storage in the event of a failure, without any coordination from the individual nodes in the ingestion tier. So the ingestion routing tier has a replication factor configured. If it's set to 2, then each write gets sent to 2 ingestors. Every ingestor persists its buffer to object storage. We can configure it to 1, but then there's a window of time that written data would be unavailable if the ingestor that received it went down. Or the data would be lost if for some reason the volume died (although this is much lower probability and probably not worth engineering for TBH). The Monolith architecture that I describe later in the talk doesn't have this problem. Writes are persisted to object storage in individual WAL files in advance of getting written as Parquet. So we don't need to create multiple buffered copies, it's guaranteed durable by the object store. And the downstream replicas can pick up those WAL files as they're written.
@pctksudАй бұрын
thanks Paul, appreciate the detailed answer
@SteveLoughranАй бұрын
I enjoyed this, especially the bits where you reviewed things that didn't work as expected. Have you documented the parquet interop issues you found?
@AlfonsoSubiottoMarquésАй бұрын
Very interesting talk, thanks for sharing. I wonder if you have any lessons learned from implementing parquet->parquet compaction (i.e. merging multiple parquet files into a bigger one) efficiently.
@pauldix2386Ай бұрын
I think we still have a lot of work to do here. Most of the stuff that has fallen out of that are optimizations around better multi-column sorts since that's ultimately the reason we're rewriting the data (to re-organize it on disk). We'll probably be spending a good chunk of time early next year working on optimizations with this, so definitely more to come.
@raysonloginАй бұрын
Haha, still picking on his mmap call!
@scarface548Ай бұрын
brings it up every video 😝
@pauldix2386Ай бұрын
haha I can take it. The dirty secret is that in the real world, many mmap systems can outperform more "pure" systems that do their own file cache management. At least that's what we've experienced. Matching the performance of that mmap version is non-trivial.