Streaming data from BigQuery to Datastore using Dataflow

  Рет қаралды 1,329

PracticalGCP

PracticalGCP

Күн бұрын

Пікірлер: 8
@andrewwang5223
@andrewwang5223 Жыл бұрын
Nice video! Just got a question for myself: At 23:46 you mentioned that if the job was killed at "row number 2", some data between 0001-01-01 00:00:00 and 2023-11-11 23:00:15 were processed and some are not. When the pipeline is resumed, doesn't it start from the beginning of the work process and inserts the checkpoint table with a new row? (so 2023-11-11 23:00:15 will be the last processed timestamp).
@practicalgcp2780
@practicalgcp2780 Жыл бұрын
I think there might be a misunderstanding there because the way I presented the checkpoint table. So there will only be one row per table in this checkpoint table. This is just an illustration based on time. What I am basically saying is that, unless all of the data in that interval is written, in this way, the checkpoint will not be updated. So worst case scenario is that, it will reprocess all of the data in that window when the job is killed, or stopped accidentally. And this does happen every now and then due to bugs, network issues so it’s very important to design it like this so no data is lost.
@andrewwang5223
@andrewwang5223 Жыл бұрын
@@practicalgcp2780 Thanks for your explanation but I think I must have misunderstood something 😂The only update to the checkpoint table happens before the datastore write, how do we ensure that "unless all of the data in that interval is written, in this way, the checkpoint will not be updated"?
@practicalgcp2780
@practicalgcp2780 Жыл бұрын
@@andrewwang5223no it doesn’t quite work like that. As I explained because it’s not possible to have and “end” in a steaming pipeline so there is no such thing as “finished writing to datastore”, instead the is is why I have a “incomplete timestamp”, this is used to track where it got started, and the actual write to the checkpoint will happen in the next impulse.
@andrewwang5223
@andrewwang5223 Жыл бұрын
@@practicalgcp2780 That makes sense, thanks 🫡
@viralsurani7944
@viralsurani7944 7 ай бұрын
Getting below error while running pipeline with DirectRunner. Any idea? Transform node AppliedPTransform(Start Impulse FakePii/GenSequence/ProcessKeyedElements/GroupByKey/GroupByKey, _GroupByKeyOnly) was not replaced as expected.
@NilanshiMathur-e1b
@NilanshiMathur-e1b Жыл бұрын
Can you share the link to git code
@practicalgcp2780
@practicalgcp2780 Жыл бұрын
It’s in the video description
Serverless distributed processing with BigFrames
27:53
PracticalGCP
Рет қаралды 2,4 М.
Real-time Analytics with Cloud Spanner CDC
37:29
PracticalGCP
Рет қаралды 555
Леон киллер и Оля Полякова 😹
00:42
Канал Смеха
Рет қаралды 4,7 МЛН
Try this prank with your friends 😂 @karina-kola
00:18
Andrey Grechka
Рет қаралды 9 МЛН
黑天使只对C罗有感觉#short #angel #clown
00:39
Super Beauty team
Рет қаралды 36 МЛН
BigQuery to Datastore via Remote Functions
22:20
PracticalGCP
Рет қаралды 1,7 М.
OAuth 2.0 and OpenID Connect (in plain English)
1:02:17
OktaDev
Рет қаралды 1,8 МЛН
Event-Driven Architecture (EDA) vs Request/Response (RR)
12:00
Confluent
Рет қаралды 177 М.
Code/Astro 2024: Day 2
1:19:37
Sarah Blunt
Рет қаралды 23
PubSub BigQuery Subscription
19:54
PracticalGCP
Рет қаралды 7 М.
Use Continuous Query and LLM to Understand Customer Feedback
11:57
How to build a sustainable data ecosystem on Google Cloud
29:59
Леон киллер и Оля Полякова 😹
00:42
Канал Смеха
Рет қаралды 4,7 МЛН