PracticalGCP

PracticalGCP

I'm Richard He, a software engineer with many years of experience across various industries and sectors. My career began in full-stack development almost 2 decades ago and later evolved into a focus on data and distributed computing.

Over the past 7~ years, I've dedicated my expertise to Google Cloud Platform (GCP), helping numerous businesses in effectively adopting GCP, fostering a sustainable culture, and building high-performing teams.

Many have asked me to share my knowledge about GCP with a broader audience, and after careful consideration, I've launched this channel called #PracticalGCP. Through practical and easy to understand videos, I aim to discuss all things related to data and engineering on GCP, benefiting a wider audience by sharing my insights.

If you'd like to learn more about my background and what I do, please feel free to visit my LinkedIn profile at www.linkedin.com/in/shenghuahe/.

How effective is History-based Optimisations on BigQuery

16:03

How effective is History-based Optimisations on BigQuery

28 күн бұрын

Use Continuous Query and LLM to Understand Customer Feedback

11:57

Use Continuous Query and LLM to Understand Customer Feedback

Ай бұрын

Analysing Logs Using Data Canvas

5:35

Analysing Logs Using Data Canvas

Ай бұрын

Cloud Run Flow Control

5:04

Cloud Run Flow Control

Ай бұрын

Reliable and Controlled Cloud PubSub Producer

5:36

Reliable and Controlled Cloud PubSub Producer

Ай бұрын

The Importance of Idempotency in Designing Data Pipelines

3:29

The Importance of Idempotency in Designing Data Pipelines

Ай бұрын

Secure Offline Package Management in Jupyter

5:58

Secure Offline Package Management in Jupyter

Ай бұрын

Integrating PubSub Push with Cloud Run

3:17

Integrating PubSub Push with Cloud Run

Ай бұрын

Use BigQuery Transfer Service to Optimise a File Based CDC

22:41

Use BigQuery Transfer Service to Optimise a File Based CDC

2 ай бұрын

Save 50 percent of your Data Engineering effort via Continuous Queries

39:03

Save 50 percent of your Data Engineering effort via Continuous Queries

4 ай бұрын

Stream sharing with Pub/Sub using Analytics Hub

19:33

Stream sharing with Pub/Sub using Analytics Hub

5 ай бұрын

Seamless transition of Vector Search from BigQuery to Feature Store

34:56

Seamless transition of Vector Search from BigQuery to Feature Store

5 ай бұрын

Run Cloud Composer Locally

16:37

Run Cloud Composer Locally

5 ай бұрын

DBT Core on Cloud Run Job

39:26

DBT Core on Cloud Run Job

7 ай бұрын

How to build a sustainable data ecosystem on Google Cloud

29:59

How to build a sustainable data ecosystem on Google Cloud

9 ай бұрын

A practical application leveraging Langchain and BigQuery Vector Search

44:30

A practical application leveraging Langchain and BigQuery Vector Search

9 ай бұрын

Scaling development teams with Cloud Workstations

26:13

Scaling development teams with Cloud Workstations

10 ай бұрын

Privileged Just-in-time access on Google Cloud with JIT

44:59

Privileged Just-in-time access on Google Cloud with JIT

11 ай бұрын

Real-time Analytics with Cloud Spanner CDC

37:29

Real-time Analytics with Cloud Spanner CDC

Жыл бұрын

Streaming data from BigQuery to Datastore using Dataflow

46:19

Streaming data from BigQuery to Datastore using Dataflow

Жыл бұрын

Serverless distributed processing with BigFrames

27:53

Serverless distributed processing with BigFrames

Жыл бұрын

Automated data profiling and quality scan via Dataplex

26:48

Automated data profiling and quality scan via Dataplex

Жыл бұрын

Centralised Data Sharing using Analytics Hub

31:33

Centralised Data Sharing using Analytics Hub

Жыл бұрын

BigQuery to Datastore via Remote Functions

22:20

BigQuery to Datastore via Remote Functions

Жыл бұрын

Connect to services on another VPC via Private Service Connect (PSC)

24:08

Connect to services on another VPC via Private Service Connect (PSC)

Жыл бұрын

Cloud PubSub Multi-Team Design

20:55

Cloud PubSub Multi-Team Design

Жыл бұрын

Snapshotting Data using PubSub and Datastore for Efficient API Serving

24:21

Snapshotting Data using PubSub and Datastore for Efficient API Serving

Жыл бұрын

Cloud Run with IAP

23:10

Cloud Run with IAP

Жыл бұрын

Run Apache Spark jobs on serverless Dataproc

30:18

Run Apache Spark jobs on serverless Dataproc

Жыл бұрын

Пікірлер

@amitsk00 6 күн бұрын

how is messages ACKed in this config, or any different rules for that?

@practicalgcp2780

@practicalgcp2780 6 күн бұрын

It’s exactly the same, think it’s just subscriptions created via analytics hub so you can see all subscribers in one place. It does not change the behaviour of how PubSub works

@amitsk00 6 күн бұрын

@@practicalgcp2780 ok cool. so that means , ack deadline or such concepts are still the same. TY

@ChandraSekar-uk7bz

@ChandraSekar-uk7bz 7 күн бұрын

Excellent video! Thanks for sharing. Please keep up the good work.

@practicalgcp2780

@practicalgcp2780 6 күн бұрын

Thank you very much! ❤️

@ayushmandloi1580

@ayushmandloi1580 9 күн бұрын

Excellent video

@practicalgcp2780

@practicalgcp2780 6 күн бұрын

Glad you liked it❤️

@sid-b2t6o 17 күн бұрын

Hey, thanks for such detailed video. Just a request - Can you please make a video on Usage metrics in Analytics Hub.

@practicalgcp2780

@practicalgcp2780 16 күн бұрын

Usage metrics? Do you mean how to find or use those? I don’t recall there’s anything built in for that

@manoressytobias5281

@manoressytobias5281 27 күн бұрын

So we don't need to setup a bunch of configs? Just let BigQuery figure out where and what to optimize and do it "smartly"?

@practicalgcp2780

@practicalgcp2780 27 күн бұрын

Basically, but it’s good to understand the impact before and after so you know what type of queries can be optimised better hence adjusting your finops strategy on what to focus on etc.

@manoressytobias5281

@manoressytobias5281 27 күн бұрын

@@practicalgcp2780 , got it. Our teams are reviewing this now. Thanks, man!

@user-so5fk1mt9o

@user-so5fk1mt9o 27 күн бұрын

Thanks for sharing

@ShriramGaddam Ай бұрын

Kudos! Amazing content. This kind of content i am looking for. Thank you!

@outsider1996 Ай бұрын

Hi , For Checkpoint , We can store the data in a bigquery table as well right ? why you chose datastore ? any specific reason ?

@practicalgcp2780

@practicalgcp2780 Ай бұрын

You can use BigQuery but keep one thing in mind. BigQuery isn’t that good on concurrency, it’s an analytic database not something designed for high TPS. In an event driven setup, if you have too many events going on at the same time you will run into concurrency issues which means things will either get queued or fail. The actual TPS for BigQuery isn’t published as far as I know, but it’s typically expected to be less than 100, so just make sure you test out what happens when it hits the limit then you know if it will cause problems or not. Datastore is an easy way to ensure you will not run into concurrency issues, as it’s designed to handle high concurrent workload.

@outsider1996 Ай бұрын

@practicalgcp2780 thanks for the clarification

@SwapperTheFirst

@SwapperTheFirst Ай бұрын

I see 5 comments in total, but only 3 actual conversations below (Haiku comment + your comment, My question). All adblockers are off. Not sure what is going on...

@practicalgcp2780

@practicalgcp2780 Ай бұрын

yup the 3 threads are all yours, I did one reply on each. Sometimes these things don't work well when you are on mobile or certain browsers, it doesn't end up in a reply but as a new thread.

@SwapperTheFirst

@SwapperTheFirst Ай бұрын

Apparently, not with Haiku anymore, taking in account their 4X price increase... :)

@practicalgcp2780

@practicalgcp2780 Ай бұрын

😂😂😂 yup I was like what!

@SwapperTheFirst

@SwapperTheFirst Ай бұрын

Hi Richard, (1) Could you please provide a link to the code repo? (2) Why you used Faker instead of Haiku or Flash to generate fake feedback batch? (3) 30 tokens is too small for a meaningful customer feedback, and your reply categorization is also very simple. Probably you want more dimensions in LLM assessment (urgent; please contact me; etc.)., but still I guess token price for 1M could be under 20S using Flash or Haiku. And that was my thinking at the beginning - why not just call LLM directly from BQ, instead of pub-sub and cont. queries. Thanks for a heads-up about slow performance of "embedded" model. Great work, as always.

@practicalgcp2780

@practicalgcp2780 Ай бұрын

1) There you go, most of these code are generated by ChatGPT so you could reproduced like 70%-80% of these with that too github.com/richardhe-fundamenta/practical-gcp-examples/tree/main/continuous_queries_llm_customer_feedback For 2) and 3) Using Faker because it's simple and only take a few minutes to write the code. The main purpose of the video is to showcasing the system design and how things can be orchestrated to work together on Google Cloud. The tool used to generate fake data can be many, but wasn't really my main focus. I do get your point though, this is more like a skeleton for more details to give everyone food for thought on how such system can be put together e2e, to productionalise such solution, more effort will need to be spend on ensuring the quality is better. On the feedback analysis part, based on my understanding, sentiment is one of the easiest things for LLMs, probably rated 2-3 out of 10 in terms of complexity, I would be very surprised it can't do this right with most models out on the market. Not sure if you agree but it's certainly a lot easier than many other things such as trying to extract exact details on what is happening in the feedback etc. Also I think it's good to take an iterative approach, if the business has nothing at this stage, start with very simple things like sentiment analysis, put it on prod and start with analytics (not operations, because risk is higher) to quickly understand area of focus etc before going for something more complex. The best part for this its, you can afford failures, so even if the model does 80% correctly, it's better than 0% or all done by humans like before. The other thing is this solution is pretty powerful for many use cases, not limited to sentiment analysis. I am not a data science guy but I have worked with many would turn this simple design into a monster can do many amazing things. I am a bit gutted the ML.GENERATE_TEXT() is that unreliable, I'll try to get the reason through the GDE group to see if anyone else experienced it. Although it is still somewhat limited on both the use case and model selection, so I don't think it will be a mainstream solution either way for quite some time. I just looked at 7:42 for a few times, I can't see any quality issues? And just want to say this is amazing feedback, thanks for taking the time to watch the full video and give feedback, already appreciated the questions!

@AnirbanSanyal-w1i

@AnirbanSanyal-w1i Ай бұрын

Hi , So if we are using pull subscription then who is consuming the data , are we allowed to see that ? or we have to add analytics hub as you have explained in another video ?

@practicalgcp2780

@practicalgcp2780 Ай бұрын

Could you clarify the question a bit more, I am not sure I follow. Is this a permission related question on what access the team consuming the data needs to have? If you or your team are consuming data, the service account your team own, will need to be given access to the PubSub Topic and yes through a Pull (or Push) Subscription. Analytics Hub as of this date, is still mainly used for sharing data on BigQuery, not message queues like PubSub, there is a preview feature but it's early days, not production ready yet.

@practicalgcp2780

@practicalgcp2780 Ай бұрын

For Gemini Python code examples, this place has a good collection: github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/getting-started/intro_gemini_python.ipynb

@ZayanasLittleLion

@ZayanasLittleLion Ай бұрын

I have done as you have described in the video, however after loggig in as a verified gmail user, i am getting no healthy upstream error, i have added health check but iap is not letting it pass as backend is configured for http = port 80

@practicalgcp2780

@practicalgcp2780 Ай бұрын

I don’t believe IAP works with http, you need to use https listen on port 8080, if I remember correctly, this is the default port of cloud run. Port 80 won’t work.

@keatonwalker999

@keatonwalker999 Ай бұрын

This is great! I'm going to try it today with the cloud audit logs.

@practicalgcp2780

@practicalgcp2780 Ай бұрын

Nice! That is another great use case! Audit logs are the worst when it comes down to trying to remember where things are. Let me know how well it works.

@ZayanasLittleLion

@ZayanasLittleLion Ай бұрын

Thaks a lot, but I getting stuck on what permissios to be assigned for this service account, I am using cloud build with cloud run for CI/CD

@practicalgcp2780

@practicalgcp2780 Ай бұрын

Do you mean the IAP service account? If I remember correctly you need to assign the Cloud Run Invoker role to the Cloud Run service you have created, otherwise it won’t be able to call it.

@ZayanasLittleLion

@ZayanasLittleLion Ай бұрын

@@practicalgcp2780 I have added cloud run invoker role, secret manager accessor role, and IAP secure web app accesor role, and the app is only working when allow unathenticated access, for some reason my load balancer IP address is giving me error code 52, when i send request directly.

@practicalgcp2780

@practicalgcp2780 Ай бұрын

At 13:12 I explained this part.

@ZayanasLittleLion

@ZayanasLittleLion Ай бұрын

@@practicalgcp2780 i see , i see a small error, instead of project id it should be project number, that long number is project number and not id.

@ZayanasLittleLion

@ZayanasLittleLion Ай бұрын

@@practicalgcp2780 done what you have highlight at 13:12, now when i access the app after allowing un authenticated access i am able to access it, but when i click on load balancer ip i am getting error code 53, and when i dont allow auntheticated access and add require authentication with load balancer , i am getting cannot access URL Error

@SwapperTheFirst

@SwapperTheFirst Ай бұрын

really good for operational (day-to-day) analytics, thanks. very nice to see that cloud assist understands the schema of logs and can create queries on-the-fly. but money-wise, if you need to do it on a scale of many billions logs, what would you do - maybe cloud storage and parquet?

@practicalgcp2780

@practicalgcp2780 Ай бұрын

Let me double check later, I am pretty sure the log views have a partition timestamp based column which can prune the partitions and save cost but can’t remember exactly.

@practicalgcp2780

@practicalgcp2780 Ай бұрын

Yup, so there are actually a few partition keys. You can use a combination of `resource.type`, `timestamp` or `receive_timestamp`. All of them appears to be partition keys, you can tell by looking at if the message next to the query says "This query will process xxx MB when run." changes in size. But don't forget BigQuery is a columna database, reducing the number of selected columns in the SELECT part will also reduce the size, so make you compare this when the SELECT [columns] stay the same. Also in the Canvas there seems to be a bug, and it only generates the scan size if you re-paste the query into the UI after it's generated by code assist. So if you don't see that appearing, try that first, or modifying some text manually in the query then it will appear.

@dw17453 Ай бұрын

upload these as youtube shorts if its in this format

@practicalgcp2780

@practicalgcp2780 Ай бұрын

Unfortunately I can’t, it has a 1min cap

@practicalgcp2780

@practicalgcp2780 Ай бұрын

Docs for creating linked database in case the video is out of date cloud.google.com/logging/docs/analyze/query-linked-dataset#gcloud

@Kuntalsh Ай бұрын

As Gemini is there on GCP, any more insights can we get automatically from Log Analytics? If yes, any idea how can it be achieved?

@practicalgcp2780

@practicalgcp2780 Ай бұрын

In short, you can't yet use Gemini in Log Analytics as far as I know. But there are two things came in mind might help you. 1) Linked Dataset cloud.google.com/logging/docs/analyze/query-and-view#about-page This allows you to create linked dataset to query the logs in BigQuery instead, and in BigQuery you can use Gemini with code assist in the UI, it's still early days so it's a bit of hit & miss I think. 2) Data Canvas cloud.google.com/bigquery/docs/data-canvas. Once you have the data in BigQuery, you can then utilise Data Canvas. I am still a bit confused about the target user group of this product, but it does seem to be very help when you are targeting one particular table and need a lot of code assist and visualisation. Log Analytics I think fits into this well, give it a go and let me know what you think maybe?

@sam.francis Ай бұрын

Love your content. but WTH with Vertical videos?

@practicalgcp2780

@practicalgcp2780 Ай бұрын

😂 I thought someone might ask. Because I usually start the post on LinkedIn, LinkedIn have started supporting (and optimising) mobile friendly videos a lot more lately like TikTok. These short ones many can watch on their phone so I am testing it out for multi platform, hopefully more can see the videos at their convenience on the move. I might change back if everyone hates them, tbh it’s a pain to record these vertical ones too but they are more mobile friendly 😂

@andrelsjunior Ай бұрын

Happy to see you doing a lot more content now! Hope you enjoy this new phase!!

@SwapperTheFirst

@SwapperTheFirst Ай бұрын

Is Colab Enterprise limited to VAI though?

@practicalgcp2780

@practicalgcp2780 Ай бұрын

What is VAI? This is a Google Managed Service, sorry not sure I follow.

@SwapperTheFirst

@SwapperTheFirst Ай бұрын

@@practicalgcp2780 Vertex AI, as you mentioned in the beginning of the video

@PrasannaTJ-j9x

@PrasannaTJ-j9x 2 ай бұрын

hi, can we push pub/sub message to GKE (deployment - https ingress) ? please assist

@practicalgcp2780

@practicalgcp2780 2 ай бұрын

Technically yes but I haven’t done it myself. The thing I am not sure about is how authentication works with the GKE https endpoint. If you use cloud run or cloud function, this is easier as the push service just needs to grant its service account cloud run invoker role to the cloud run / function service deployed. But for GKE I am not sure how the API auth is handled this way. It’s funny I just created a video on this thing today for cloud run with PubSub push, if you are able to use it then this could be a lot easier. www.linkedin.com/posts/shenghuahe_cloudpubsub-cloudrun-microservices-activity-7254918607703363584-WAb2? Otherwise maybe do some research on one auth work, you may also need an internal load balancer too.

@abednegosantoso

@abednegosantoso 2 ай бұрын

Thank you for this superb explanation. If I get it right, this is about how we ingest the data to our BigQuery, right? At the moment, our solution is to use Datastream. Which one do you think is better?

@practicalgcp2780

@practicalgcp2780 2 ай бұрын

Yes it’s about ingesting data into BigQuery. However, it’s more focused on situations where your CDC solution’s target isn’t BigQuery but GCS (cloud storage). Not everyone had already ingested data into BigQuery, some did it years ago where BigQuery connector isn’t supported, or have issues when it comes to handling schema changes. For these customers, changing it to another solution can be really costly and time consuming. Datastream is a solution does CDC all the way to BigQuery, if you already has it, no need to change. A solution like Datastream ingest data directly into BigQuery offers much better latency (almost near real time), compared to file based CDC.

@abednegosantoso

@abednegosantoso 2 ай бұрын

@@practicalgcp2780 if I'm not mistaken, Datastream also allows GCS as destination? Anyway, have you tried Iceberg on GCS and use BigQuery on top of it?

@abhishekbhatia651

@abhishekbhatia651 2 ай бұрын

Thanks alot for this! I agree with keeping code separate from the docker container, however, in your case there was just a single py file, but in my case I have a whole repo which is needed when the main file is executed. How do you suggest I handle this?

@practicalgcp2780

@practicalgcp2780 2 ай бұрын

No worries 😉 and just wondering if you tried to zip the repo (only your files and modules not dependencies, I am worried it might conflict with what’s installed on the container) and put them in gcs and include at runtime during sparksubmit? There is a post explaining this too, let me know if this works for your scenario stackoverflow.com/questions/32785903/submit-a-pyspark-job-to-a-cluster-with-the-py-files-argument I haven’t done this for quite a long time, as I have been trying to avoid using spark as much as I can due to the complexity on setting things up 😂 let me know if this approach still works. I would be keen to know what you still use Spark for these days? Especially if you are on Google cloud you have quite a lot of options like bigframes or BigQuery itself, which from my experience covers majority of the scenarios spark used to do (as SQL is becoming really powerful on analytical databases). I could be wrong though as everyone’s situation is different.

@NatarajanMuthu-p5n

@NatarajanMuthu-p5n 2 ай бұрын

I have to integrate with okta. How to handle this scenario in my next js application .Any suggestion pleae

@practicalgcp2780

@practicalgcp2780 2 ай бұрын

If your organisation is already using Okta as the main identify provider I don’t believe you need to do anything as IAP handles auth via Google workspace identify. If this isn’t the case, maybe have a look at this cloud.google.com/iap/docs/enable-external-identities although it’s questionable why using workspace identify directly while also having Okta as the identify provider.

@anthonygraybosch2202

@anthonygraybosch2202 2 ай бұрын

Thanks for this; the content is helpful but also the information rate is set to practitioner non specialist, and IMHO that's the level where primers should live

@suryaseshagiri3053

@suryaseshagiri3053 2 ай бұрын

Thanks for the video. We have performed all the steps like client id,secret generation, redirect URL addition, IAP enablement using client ID and secret for the backend service, IAP service account creation, cloud run invoker and IAP web app secure user roles addition to it. .. However, even after doing all these steps, the issue we are having is that the backend service is "not appearing " in the APPLICATIONS tab of the IAP page in the console. This looks like a strange issue never seen in any of the IAP videos/articles. Can you please suggest what could have gone wrong from our end. Also, one more input to you is that we have the Load balancer in host project and backend service in service project (if that matters)

@sajti812 2 ай бұрын

Pretty cool intro to BigFrames

@practicalgcp2780

@practicalgcp2780 2 ай бұрын

A good question got asked about schema management, see www.linkedin.com/feed/update/urn:li:activity:7248816174312960000?commentUrn=urn%3Ali%3Acomment%3A%28activity%3A7248816174312960000%2C7248834731771858944%29&replyUrn=urn%3Ali%3Acomment%3A%28activity%3A7248816174312960000%2C7248943929612468224%29&dashCommentUrn=urn%3Ali%3Afsd_comment%3A%287248834731771858944%2Curn%3Ali%3Aactivity%3A7248816174312960000%29&dashReplyUrn=urn%3Ali%3Afsd_comment%3A%287248943929612468224%2Curn%3Ali%3Aactivity%3A7248816174312960000%29

@andrelsjunior 2 ай бұрын

Nice tips! In that SCD2 type. How do you recommend to do the most optimized soft delete?

@practicalgcp2780

@practicalgcp2780 2 ай бұрын

Thank you! I don’t know what’s most optimised, but the model I see the most is to have a valid_from and valid_to column with timestamps.

@andrelsjunior 2 ай бұрын

But then running an update query after ingesting for old records?

@practicalgcp2780

@practicalgcp2780 2 ай бұрын

@@andrelsjunior I haven’t done it for a while but typically on BigQuery this isn’t done through traditional update but DML Merge (I believe DBT can manage this quite easily without needing you to write the DML). The other way to do it (again very mixed, some like it some don’t), it’s just to add a row with the new data the first time you see it, then you have both rows with their own timestamps and it works well with incremental workload without needing to update any pre existing records. But I am not sure if this follows any open standard, personally I find it’s easier to deal with. Whichever way you do it, it’s really important to make sure the whole thing can always be rebuilt from data lake, this is easy enough to do and manage with tools like DBT but as a concept it’s very important regardless of the tooling choices, so if the logic is wrong or needs updating, it’s just a full rebuild and things are all fixed.

@andrelsjunior 2 ай бұрын

@@practicalgcp2780 appreciate your pov! I was thinking about using dbt specifically, but it adds an extra layer haha Seems that this is the best way.

@anamebecauseineedone

@anamebecauseineedone 2 ай бұрын

I wish I'd seen this vid earlier

@practicalgcp2780

@practicalgcp2780 2 ай бұрын

😊it’s an interesting way to do pull subscriptions indeed. From what I have seen recently this solution can be very reliable, and for many use cases where the traffic isn’t too spiky or the latency doesn’t need to be too low

@bugtank 2 ай бұрын

Great overview. I did a self learn of both dbt and cloud run. I'm now configuring my cloud run dbt job and needed a little bit of confirmation that it was the way to go for me. This video helped with that and backed it up with the proper knowledge. You also avoided all the rabbit holes and tangents. I like your style. Subscribed.

@ghazalehabadian6536

@ghazalehabadian6536 2 ай бұрын

What order do the videos follow?

@practicalgcp2780

@practicalgcp2780 2 ай бұрын

Random order 😂

@ghazalehabadian6536

@ghazalehabadian6536 2 ай бұрын

@@practicalgcp2780 😀

@jesniabrahame.t.6842

@jesniabrahame.t.6842 3 ай бұрын

Off many videos on Dataplex and data catalog available ,this video actually gives good context of how to implement things.

@practicalgcp2780

@practicalgcp2780 2 ай бұрын

Glad it’s useful to you!

@benediktbuchert4188

@benediktbuchert4188 3 ай бұрын

Nice video! Amazing channel! Thank you! Would be very curious to hear about cloud composer 3.

@practicalgcp2780

@practicalgcp2780 3 ай бұрын

Thank you 🙏 this is quite an old video, I believe 3 is currently in preview still having various issues. I will do a video when I feel it’s ready for production use on a later date.

@listenpramod 3 ай бұрын

Thanks Richard, very well explained. I would say we need links for your videos on GCP docs on the respective topic. If I want to push events from onprem to a topic in gcp and then use Analytic Hub to share the stream to downsteam other GCP project, What do you think were the topics receive the events from on prem should be created

@practicalgcp2780

@practicalgcp2780 3 ай бұрын

No worries, have a look at this, it’s a blog published by Google on this topic, it has all the links you need cloud.google.com/blog/products/data-analytics/share-pubsub-topics-in-analytics-hub For sending events from onprem, I suggest create a new project dedicated for ingestions, and put the topic in there. It’s better this way because it is very unlikely consumers of the topic will be just one application, this makes it cleaner when you have multiple consumers from different applications / squads.

@fbnz742 3 ай бұрын

Hi Richard, thank you so much for sharing this. This is exactly what I wanted. I have a few questions: 1. Do you have any example on how to orchestrate it using Composer? I mean the DAG code. 2. I am quite new to DBT. I used DBT Cloud before and I could run everything (Upstream + Downstream jobs) or just Upstream, just Downstream, etc. Can I do it using DBT Core + Cloud Run? 3. This is quite off-topic to the video but wanna ask: DBT Cloud offers a VERY nice visualization of the full chain of dependencies. Is there any way to do it outside of DBT Cloud? Thanks again!

@practicalgcp2780

@practicalgcp2780 3 ай бұрын

No worries, happy it helped. DBT cloud is indeed a very good platform to run DBT core. There is nothing wrong using DBT cloud, in fact so many businesses still use DBT cloud mostly today. However that doesn’t mean it’s the only solution, or the best solution for all DBT use cases. For example, DBT cloud even as of today does not have a “deployment” concept, it relies on calling a version control SaaS for each run instead of using a local copy instead. Some caching has been implemented to prevent it from going down but it’s more of a workaround than a solution. This means for running mission critical applications needs a strict SLA, it maybe better to not use DBT cloud. Many companies for various reasons also can’t use SaaS like this, due to data privacy concerns or operation cost being too high as number of users grow. Which leaves with DBT core. You are right this approach does not give you an easy way to rerun the DBT job partially, however I think you can do this via the airflow dag parameter (the config json) which can be passed from the UI then passed to the cloud run job itself to have a way to deal with ad-hoc administrative tasks. The thing I like about cloud run other than k8 operators (which is another way to run DBT core on an airflow), is it’s an Google SDK, and serverless, much easier to test and control. If this isn’t an option you like, I recently came across github.com/astronomer/astronomer-cosmos I haven’t tried it but it looks quite promising. My concerns on this is mainly what does it mean when the version changes between DBT and airflow, how accurate is the mapping, or does it have compatibility issues with composer default package which in the past I had a lot of issues with hence the k8 executor solution being more popular. One thing worth mentioning is in my view, composer seems to be travelling in a direction of serverless, and decentralised, and it makes no sense to centralise everything on one singer cluster anymore, that means if you run cosmos on a dedicated cluster it might be a better option. But again, I haven’t tried it yet.

@fbnz742 3 ай бұрын

@@practicalgcp2780 Thanks for your reply! I honestly have tried to setup DBT with Composer using another of your videos, but it looked sketchy to me, so I found the Cloud Run option to be way better. While looking for it, almost all the tutorials pointed me to Cosmos, which really looks to be a great option, however, in my point of view, it colides with Composer, meaning that you can use one or the other, and Composer has the easiness of being fully hosted (plus its whats used in my org :D so I can't really change) On my 3rd topic, I found that you can actually generate and serve dbt docs, but with Cloud Run, I don't know exactly how to handle documents that are generate in the filesystem, but I believe that this would be a great option to have HTML documentation. Did you try it by any chance? Or, tried to handle files in the filesystem

@s7006 3 ай бұрын

very through analysis ! one of the best KZbin channels out there on GCP product deep dives.

@practicalgcp2780

@practicalgcp2780 3 ай бұрын

Wow, thanks!🙏

@AdrianIborra 3 ай бұрын

Are you referring to Dbt Core as that not have VCS? From your point of view, does GCP one similar service like DBT that support a real world complex client system?

@practicalgcp2780

@practicalgcp2780 3 ай бұрын

Hi there, not sure I understood the question fully, it is about DBT core, although DBT core does not have VCS but it’s code, using it without VCS is not a good idea as you can’t track changes at all. Google has something called Dataform, it’s quite new, and kind of in its own ecosystem, from my point of view if you want to have something quick to get started that’s fine, but for scaling to large groups and teams, I am not sure it’s a good idea compares to DBT due to the lacking of community support of features.

@sergioortega5130

@sergioortega5130 4 ай бұрын

Excellent video, I spent hours figuring out where the run.invoke role needed to go, maybe I skipped it but is not mentioned anywhere in docs 🥲 Gold, thanks man!

@AghaOwais 4 ай бұрын

Hi My DBT code is successfully deployed to Google Cloud Run. I am using DBT Core not using DBT Cloud. The only issue is when I am hitting the URL, "Not Found" shows. I have identified the issue when code is running it keeps looking for dbt_cloud.yml but how can it be used when I am only using DBT Core. Please sort out. Thanks

@ItsMe-mh5ib 4 ай бұрын

what happens if your source query combines multiple tables?

@practicalgcp2780

@practicalgcp2780 4 ай бұрын

The short answer is most of these won’t work. At least during the public review. You can use certain sub queries if they don’t have keyword like Exists or NOT Exists. JOIN won’t work either. See the list of limitations here cloud.google.com/bigquery/docs/continuous-queries-introduction#limitations This makes sense because it’s a storage layer feature so it is very hard to implement things like listening to append logs on two separate tables together and somehow put them together. I would suggest focus on reverse ETL use cases which it’s mostly useful for the time being.

@ItsMe-mh5ib 4 ай бұрын

@@practicalgcp2780 thank you

@nickorlove-dev

@nickorlove-dev 4 ай бұрын

LOVE LOVE LOVE the passion from our Google Developer Expert program, and Richard for going above and beyond to create this video! It's super exciting to see the enthusiasm being generated around BigQuery continuous queries! Quick feedback up regarding the concerns/recommendations highlighted in the video: - All feedback is welcome and valid, so THANK YOU! Seriously! - The observed query concurrency limit of 1 query max for 50 slots and 3 queries max for 100 slots is an identified bug. We're in the process of fixing this, which will raise this limit and allow BigQuery to dynamically adjust concurrent continuous queries being submitted based on the available CONTINUOUS reservation assignment resources. - Continuous queries is currently in public preview, which simply means we aren't done with feature development yet. There are some really exciting items on our roadmap, which I cannot comment on in such a public forum, but concerns over cost efficiency, monitoring, administration, etc are at the VERY TOP of that list.

@practicalgcp2780

@practicalgcp2780 4 ай бұрын

Amazing ❤ thanks for the kind words and also the clarification on the concurrency bug, can’t wait to see it gets lifted so we can try it at scale!

@mohammedsafiahmed1639

@mohammedsafiahmed1639 4 ай бұрын

so this is like CDC for BQ tables?

@practicalgcp2780

@practicalgcp2780 4 ай бұрын

Yes pretty much, via SQL but more a reverse of CDC (reverse ETL in streaming mode if you prefer to call it that).

@mohammedsafiahmed1639

@mohammedsafiahmed1639 4 ай бұрын

thanks! Good to see you back

@practicalgcp2780

@practicalgcp2780 4 ай бұрын

😊 was on holiday for a couple of weeks

@SwapperTheFirst

@SwapperTheFirst 4 ай бұрын

thanks. It is trivial to connect from local vscode - it is just some small gcloud create tcp tunnel and this is it. though you're right that web browser experience is surpriningly good.

@practicalgcp2780

@practicalgcp2780 4 ай бұрын

Yup, I think a lot of the times when you have remote workforce it is easier to keep everything together so you can install plugins as well. That tunnel thing can work but a lot of times it’s just additional risk to manage and IT issues to resolve when something doesn’t work.

@DarshanK-h7s 4 ай бұрын

This is a good content. I have a question. I have a uses case where i have a Data which has columns: "customers_reviews","Country","Year","sentiment". I am trying to create a chat bot where it can answer queries like: "Negative comments related to xyz issue from USA from year 2023." for this I need to filter the data for USA and for year 2023 with embeddings for xyz issue to be searched from the database. Which database will be suitable for this: Bigquery or Cloud SQL or Alloy DB. All these have the vector search capabilities. But need to look for most suitable and easy to understand. Thanks

@practicalgcp2780

@practicalgcp2780 4 ай бұрын

One important thing to understand is the difference between database suitable for highly concurrent traffic (b2c or consumer traffic) vs b2b (internal or external business has small amount of users). BigQuery can be suitable for b2b when the amount of users using it at the same time peak, is low. For all b2c traffic you never want to use BigQuery because it’s not designed for such thing. There are 3 databases on GCP can be suitable for b2c traffic, and all of them supports highly concurrent workload. Cloudsql, alloydb and vertex feature store vector search if you want serverless. You can use any of the 3, whichever you are more comfortable with, vertex feature store can be quite convenient if your data is in BigQuery, a video I create recently might give you some good ideas on how to do this kzbin.info/www/bejne/h3q9qKp5oqqbsKs

@adeolamorren2678

@adeolamorren2678 5 ай бұрын

One seperate question; if we have dependencies, since it's a serverless environment, we should add the dbt deps command in the dockerfile args, or runtime override args right?

@practicalgcp2780

@practicalgcp2780 5 ай бұрын

No I don’t think that is the right way to do it, serverless environments you can still package up dependencies, and this is something you typically need to do in build time not run time, I.e, while you are packing the container in your CI pipeline. DBT can generate a lock file which can be used to ensure packaging consistency on versions, so you don’t end up having different versions each time you run the build. See docs.getdbt.com/reference/commands/deps The other reason you don’t want to do that at run time is it could be very slow to install dependencies each time because it requires downloading these, plus you may not want internet access on a production environment to be more secure in some setups so doing this in build time makes a lot more sense