Streaming data from BigQuery to Datastore using Dataflow

Рет қаралды 1,329

Күн бұрын

Пікірлер: 8

@andrewwang5223 Жыл бұрын

Nice video! Just got a question for myself: At 23:46 you mentioned that if the job was killed at "row number 2", some data between 0001-01-01 00:00:00 and 2023-11-11 23:00:15 were processed and some are not. When the pipeline is resumed, doesn't it start from the beginning of the work process and inserts the checkpoint table with a new row? (so 2023-11-11 23:00:15 will be the last processed timestamp).

@practicalgcp2780 Жыл бұрын

I think there might be a misunderstanding there because the way I presented the checkpoint table. So there will only be one row per table in this checkpoint table. This is just an illustration based on time. What I am basically saying is that, unless all of the data in that interval is written, in this way, the checkpoint will not be updated. So worst case scenario is that, it will reprocess all of the data in that window when the job is killed, or stopped accidentally. And this does happen every now and then due to bugs, network issues so it’s very important to design it like this so no data is lost.

@andrewwang5223 Жыл бұрын

@@practicalgcp2780 Thanks for your explanation but I think I must have misunderstood something 😂The only update to the checkpoint table happens before the datastore write, how do we ensure that "unless all of the data in that interval is written, in this way, the checkpoint will not be updated"?

@practicalgcp2780 Жыл бұрын

@@andrewwang5223no it doesn’t quite work like that. As I explained because it’s not possible to have and “end” in a steaming pipeline so there is no such thing as “finished writing to datastore”, instead the is is why I have a “incomplete timestamp”, this is used to track where it got started, and the actual write to the checkpoint will happen in the next impulse.

@andrewwang5223 Жыл бұрын

@@practicalgcp2780 That makes sense, thanks 🫡

@viralsurani7944 7 ай бұрын

Getting below error while running pipeline with DirectRunner. Any idea? Transform node AppliedPTransform(Start Impulse FakePii/GenSequence/ProcessKeyedElements/GroupByKey/GroupByKey, _GroupByKeyOnly) was not replaced as expected.