No video

PubSub BigQuery Subscription

  Рет қаралды 6,114

PracticalGCP

PracticalGCP

Күн бұрын

Google Cloud has recently released a very neat feature to stream data directly from PubSub to BigQuery without needing to manage any pipelines in the middle. I've had some time to explore it in detail, and today I would like to share what I've done, the areas where it can be very useful but also some limitations can restrict the use cases of this new feature.
Further reading
- Slides: slides.com/ric...
- BigQuery Subscription: cloud.google.c...
- Avro Logical Types: fastavro.readt...
- Code repo: github.com/roc...

Пікірлер: 19
@andredesouza7077
@andredesouza7077 Жыл бұрын
Thank you, this was awesome! I was struggling with datatypes and particularly around the date issue! I could not for the life of me understand why anyone would use the single data column to receive data -your perspective on this opened a number of ideas.
@practicalgcp2780
@practicalgcp2780 Жыл бұрын
Thanks for the feedback Andre. This is the purpose of this channel to share things to give others ideas 💡 glad you found it useful
@DadoscomCaio
@DadoscomCaio 2 жыл бұрын
Pretty useful. Thanks for sharing your knowledge.
@srivathsaballa6958
@srivathsaballa6958 Жыл бұрын
Thanks for video. Can you take an example of optional column in schema and without that column in schema publish message in the bigquery.
@happymanu5
@happymanu5 Жыл бұрын
Thank you, this was useful. Can we also use "Write to Bigquery" delivery type to delete or update table data in a Bigquery Dataset?
@practicalgcp2780
@practicalgcp2780 Жыл бұрын
Hi Raunak, this feature is designed to “ingest” data into BigQuery, not transformations. Typically it’s not a good idea to deal with transforming data at ingestion time, instead, get data into BigQuery first, then handle transformation using another tool such as DBT. This is because any processes / logics added at ingestion time is likely going to introduce a “magic” layer, and if anything goes wrong in that step it’s very difficult to debug because source data may no longer exist to validate what happened. If you want to only keep data ingested for a period of time, add TTL to the BigQuery table at partition level so that data doesn’t get kept longer than it should, see cloud.google.com/bigquery/docs/managing-tables
@keisou9765
@keisou9765 2 жыл бұрын
Very informative. Do you know if there's any way to split a single message into multiple entries in BQ? The json in question has an array component and I would like for there to be a separate line of entry for each object in the array. Is this something that can only be handled by more traditional methods of propagating messages from Pub/Sub to BQ? Thanks.
@practicalgcp2780
@practicalgcp2780 2 жыл бұрын
I am glad you found it useful, Cong. I think this is what you are looking for if I understood you correctly. cloud.google.com/bigquery/docs/reference/standard-sql/json-data#extract_arrays_from_json basically you can use this function to get an array of JSON types and then do UNNEST to unpack it to multiple roles.
@keisou9765
@keisou9765 2 жыл бұрын
@@practicalgcp2780 Thanks for your apply. Sorry, no, I meant to ask, if it is possible to get multiple BQ entries, through BQ Subscriptions as described in the video, out of a single message containing an array of elements.
@practicalgcp2780
@practicalgcp2780 2 жыл бұрын
@@keisou9765 ok I see, it depends on your input, it supports the Record data type, so that is nested so you can use that like this stackoverflow.com/questions/11764287/how-to-nest-records-in-an-avro-schema. Which will be mapped to the BigQuery Record type. I don’t think you want them to be separate rows in BQ, store them as Record and do Unnesting is a better way to do it because BigQuery natively supports nested and repeated data in a single row.
@AlexanderBelozerov
@AlexanderBelozerov Жыл бұрын
Hey thanks for the video. Are u aware about delay with using bigquery subscriptions please? Im noticing a few minutes delay for a few messages and cant find any details about it online.
@practicalgcp2780
@practicalgcp2780 Жыл бұрын
Not that I am aware of, there maybe a bit of a delay when you first set it up but I don’t expect it to be continuously like that unless there are issues with google services (behind the scenes it might be using cloud dataflow). If you notice unexpected delays I would say note the message timestamp down (you can see these if you export all columns into BigQuery) and then raise it to google cloud support with screenshots etc, they can debug it for you.
@AlexanderBelozerov
@AlexanderBelozerov Жыл бұрын
@@practicalgcp2780 thanks for the advicee
@arshad.mp4
@arshad.mp4 2 жыл бұрын
Very informative video. Hey, we want the same thing but the problem is we have a timestamp field. You said that you need to convert that but how can we do that. That part is confusing to me, can you please tell me how to convert the timestamp field.
@practicalgcp2780
@practicalgcp2780 2 жыл бұрын
I am glad you found it useful ;P So you can convert it in two ways. 1) convert it to an integer or long as a unix timestamp (which is a integer since epoch) during your ingestion process, in other words, in the system sends the message to PubSub. 2) leave it as a string, and make sure your PubSub schema and the BigQuery schema both have STRING type to match the data type, then, during the data modelling process, convert it using the BigQuery function cloud.google.com/bigquery/docs/reference/standard-sql/timestamp_functions#timestamp. Hope that makes sense?
@practicalgcp2780
@practicalgcp2780 2 жыл бұрын
And I would say I prefer the second way because that is the least amount of transformation applied which means less logic and if something goes wrong you know it’s not this conversion process is causing it. Also you don’t always have the option to convert it at source systems.
@1itech
@1itech Жыл бұрын
Bro plz explain how we can send CSV files in dataflow pubsbus to store GCS using python code
@ridhoheranof3413
@ridhoheranof3413 Жыл бұрын
ive seen your video, but i have a problem to do the streaming data from the gcs to bigquery, im trying to use pubsub subscription, but i kinda confuse where should i run the python producer file?
@practicalgcp2780
@practicalgcp2780 Жыл бұрын
From GCS to BigQuery? Is this a question specific to this video? If you want to load files from GCS to BQ, a much easier way is to do BQ load, have a look at the docs here cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv. This video is about getting data from PubSub to BigQuery, where the data must already exist in PubSub. The producer Python file is just to send data to PubSub, GCS is not involved in this design.
Split mixed data using dataflow
19:25
PracticalGCP
Рет қаралды 377
Centralised Data Sharing using Analytics Hub
31:33
PracticalGCP
Рет қаралды 2,7 М.
Вы чего бл….🤣🤣🙏🏽🙏🏽🙏🏽
00:18
Meet the one boy from the Ronaldo edit in India
00:30
Younes Zarou
Рет қаралды 16 МЛН
Кадр сыртындағы қызықтар | Келінжан
00:16
Unveiling my winning secret to defeating Maxim!😎| Free Fire Official
00:14
Garena Free Fire Global
Рет қаралды 6 МЛН
Google Cloud Pub Sub Spring Boot
29:48
Mike Møller Nielsen
Рет қаралды 8 М.
GCP  Apache Beam stream data processing pipeline | Pub Sub , Dataflow , Big Query
30:23
Anjan GCP Data Engineering
Рет қаралды 4,9 М.
Near real-time CDC using DataStream
32:20
PracticalGCP
Рет қаралды 6 М.
Automated data profiling and quality scan via Dataplex
26:48
PracticalGCP
Рет қаралды 7 М.
GCP Composer | Airflow GCS to BigQuery and BigQuery Operators
19:24
Anjan GCP Data Engineering
Рет қаралды 13 М.
01 Cloud Dataflow - Pub/Sub to Big Query Streaming
16:32
NextGen Learning
Рет қаралды 31 М.
Introduction to Dataform in Google Cloud Platform
41:47
Cloud 4 Data Science
Рет қаралды 25 М.
Вы чего бл….🤣🤣🙏🏽🙏🏽🙏🏽
00:18