Run Cloud Composer Locally
16:37
DBT Core on Cloud Run Job
39:26
3 ай бұрын
Cloud PubSub Multi-Team Design
20:55
Cloud Run with IAP
23:10
Жыл бұрын
The 2022 Wrap-up
40:00
Жыл бұрын
Run DBT jobs with Cloud Batch
21:49
Cloud Datastore TTL
27:06
Жыл бұрын
Пікірлер
@AghaOwais
@AghaOwais 9 күн бұрын
Hi My DBT code is successfully deployed to Google Cloud Run. I am using DBT Core not using DBT Cloud. The only issue is when I am hitting the URL, "Not Found" shows. I have identified the issue when code is running it keeps looking for dbt_cloud.yml but how can it be used when I am only using DBT Core. Please sort out. Thanks
@ItsMe-mh5ib
@ItsMe-mh5ib 9 күн бұрын
what happens if your source query combines multiple tables?
@practicalgcp2780
@practicalgcp2780 9 күн бұрын
The short answer is most of these won’t work. At least during the public review. You can use certain sub queries if they don’t have keyword like Exists or NOT Exists. JOIN won’t work either. See the list of limitations here cloud.google.com/bigquery/docs/continuous-queries-introduction#limitations This makes sense because it’s a storage layer feature so it is very hard to implement things like listening to append logs on two separate tables together and somehow put them together. I would suggest focus on reverse ETL use cases which it’s mostly useful for the time being.
@ItsMe-mh5ib
@ItsMe-mh5ib 9 күн бұрын
@@practicalgcp2780 thank you
@nickorlove-dev
@nickorlove-dev 9 күн бұрын
LOVE LOVE LOVE the passion from our Google Developer Expert program, and Richard for going above and beyond to create this video! It's super exciting to see the enthusiasm being generated around BigQuery continuous queries! Quick feedback up regarding the concerns/recommendations highlighted in the video: - All feedback is welcome and valid, so THANK YOU! Seriously! - The observed query concurrency limit of 1 query max for 50 slots and 3 queries max for 100 slots is an identified bug. We're in the process of fixing this, which will raise this limit and allow BigQuery to dynamically adjust concurrent continuous queries being submitted based on the available CONTINUOUS reservation assignment resources. - Continuous queries is currently in public preview, which simply means we aren't done with feature development yet. There are some really exciting items on our roadmap, which I cannot comment on in such a public forum, but concerns over cost efficiency, monitoring, administration, etc are at the VERY TOP of that list.
@practicalgcp2780
@practicalgcp2780 9 күн бұрын
Amazing ❤ thanks for the kind words and also the clarification on the concurrency bug, can’t wait to see it gets lifted so we can try it at scale!
@mohammedsafiahmed1639
@mohammedsafiahmed1639 10 күн бұрын
so this is like CDC for BQ tables?
@practicalgcp2780
@practicalgcp2780 10 күн бұрын
Yes pretty much, via SQL but more a reverse of CDC (reverse ETL in streaming mode if you prefer to call it that).
@mohammedsafiahmed1639
@mohammedsafiahmed1639 10 күн бұрын
thanks! Good to see you back
@practicalgcp2780
@practicalgcp2780 10 күн бұрын
😊 was on holiday for a couple of weeks
@SwapperTheFirst
@SwapperTheFirst 16 күн бұрын
thanks. It is trivial to connect from local vscode - it is just some small gcloud create tcp tunnel and this is it. though you're right that web browser experience is surpriningly good.
@practicalgcp2780
@practicalgcp2780 16 күн бұрын
Yup, I think a lot of the times when you have remote workforce it is easier to keep everything together so you can install plugins as well. That tunnel thing can work but a lot of times it’s just additional risk to manage and IT issues to resolve when something doesn’t work.
@user-wf5er3eo8v
@user-wf5er3eo8v 22 күн бұрын
This is a good content. I have a question. I have a uses case where i have a Data which has columns: "customers_reviews","Country","Year","sentiment". I am trying to create a chat bot where it can answer queries like: "Negative comments related to xyz issue from USA from year 2023." for this I need to filter the data for USA and for year 2023 with embeddings for xyz issue to be searched from the database. Which database will be suitable for this: Bigquery or Cloud SQL or Alloy DB. All these have the vector search capabilities. But need to look for most suitable and easy to understand. Thanks
@practicalgcp2780
@practicalgcp2780 14 күн бұрын
One important thing to understand is the difference between database suitable for highly concurrent traffic (b2c or consumer traffic) vs b2b (internal or external business has small amount of users). BigQuery can be suitable for b2b when the amount of users using it at the same time peak, is low. For all b2c traffic you never want to use BigQuery because it’s not designed for such thing. There are 3 databases on GCP can be suitable for b2c traffic, and all of them supports highly concurrent workload. Cloudsql, alloydb and vertex feature store vector search if you want serverless. You can use any of the 3, whichever you are more comfortable with, vertex feature store can be quite convenient if your data is in BigQuery, a video I create recently might give you some good ideas on how to do this kzbin.info/www/bejne/h3q9qKp5oqqbsKs
@adeolamorren2678
@adeolamorren2678 29 күн бұрын
One seperate question; if we have dependencies, since it's a serverless environment, we should add the dbt deps command in the dockerfile args, or runtime override args right?
@practicalgcp2780
@practicalgcp2780 28 күн бұрын
No I don’t think that is the right way to do it, serverless environments you can still package up dependencies, and this is something you typically need to do in build time not run time, I.e, while you are packing the container in your CI pipeline. DBT can generate a lock file which can be used to ensure packaging consistency on versions, so you don’t end up having different versions each time you run the build. See docs.getdbt.com/reference/commands/deps The other reason you don’t want to do that at run time is it could be very slow to install dependencies each time because it requires downloading these, plus you may not want internet access on a production environment to be more secure in some setups so doing this in build time makes a lot more sense
@adeolamorren2678
@adeolamorren2678 29 күн бұрын
with this approach is it possible to add environment variables that are isolated for each run? I basically want to pass environment variables for each run when I invoke google cloud run
@practicalgcp2780
@practicalgcp2780 29 күн бұрын
Environment variables are typically not designed for manipulating runtime variables each time, these are typically set for each environment, and stick to each deployment not run. But it looks like both options are possible, and stick to passing command line arguments because that’s more appropriate to override compared to environment variables. See this article on how to do it, it’s explained well chrlschn.medium.com/programmatically-invoke-cloud-run-jobs-with-runtime-overrides-4de96cbd158c
@aniket-kulkarni
@aniket-kulkarni Ай бұрын
After researching so much on this topic, finally, a video that explains clearly especially motivations and the problem that we are going to solve with PSC.
@practicalgcp2780
@practicalgcp2780 Ай бұрын
Comments like this is what keeps me going mate ❤ thanks for the feedback
@ritwikverma2463
@ritwikverma2463 Ай бұрын
Thank you Richard for great GCP tutorials, please continue making these GCP video series.
@practicalgcp2780
@practicalgcp2780 Ай бұрын
Thanks, will do! Glad you liked these.
@42svb58
@42svb58 Ай бұрын
thank you for posting these videos!
@practicalgcp2780
@practicalgcp2780 Ай бұрын
My pleasure!
@dollan1991
@dollan1991 Ай бұрын
I can't find it now, but I remember reading an GCP Issue Tracker that stated that the sync will always take 5min+ due to resources that needs to be provisioned in the background
@practicalgcp2780
@practicalgcp2780 Ай бұрын
I guess for daily workload it’s ok. These things don’t typically need to be that up to date for most use cases. I do want to try that continues mode though, which is more likely designed for real time sync
@travisbot1414
@travisbot1414 Ай бұрын
Awesome videos there are awesome, you should make courses that cover the content for courses $$$$$$$$
@practicalgcp2780
@practicalgcp2780 Ай бұрын
Haha, thanks. It’s more important to share knowledge for free so more companies can adopt Google cloud and make it work better and hope it will become the no. 1 cloud provider 😎 maybe one day in the future I will make a course.
@johnphillip9013
@johnphillip9013 Ай бұрын
@@practicalgcp2780thank you so much
@AI0331
@AI0331 Ай бұрын
This is really an amazing video. especially the trouble shooting part. very clear😊 Love it!!
@practicalgcp2780
@practicalgcp2780 Ай бұрын
Glad it helped!
@yinliu5471
@yinliu5471 Ай бұрын
I like this video, it is the most informational and practical video for the topic IAP. Thanks for sharing
@practicalgcp2780
@practicalgcp2780 Ай бұрын
Glad it was helpful!
@ritwikverma2463
@ritwikverma2463 2 ай бұрын
Hi Richard, can we create dataproc serverless job in different gcp project using service account?
@practicalgcp2780
@practicalgcp2780 2 ай бұрын
I am not sure I understood you fully, but service account can do anything in any project regardless which project the service account is created from. The way it works is by granting the service account IAM permission from the project you want the job to be created. Then it will work. But it may not be best way to do it as that one service account may have too much permission and scope. You can use separate service account, one for each project if you want to reduce scope, or have a master one to impersonate as other service account in those project but keep in mind it’s key to reduce scope of what each service account can do, otherwise when there is a breach, it can be massive damage on everything all together.
@HARDselection
@HARDselection 2 ай бұрын
As a member of a very small data team managing a complex orchestration workload, this is exactly what I was looking for. Thanks!
@practicalgcp2780
@practicalgcp2780 2 ай бұрын
Glad it was helpful!
@nishantmiglani7021
@nishantmiglani7021 2 ай бұрын
Thanks a lot, Richard He, for creating this insightful video on Analytics Hub.
@Iyanu-eb2eh
@Iyanu-eb2eh 2 ай бұрын
how do you know if the sql table is actually connected?
@practicalgcp2780
@practicalgcp2780 2 ай бұрын
Sorry it’s been a while since I created this, if it works it is connected right? Am I missing something?
@agss
@agss 2 ай бұрын
Thank you for the very insightful video! What is your take on using Dataform instead of DBT, when it comes to capabilities of both tools and ease to deploy and manage those solutions?
@practicalgcp2780
@practicalgcp2780 2 ай бұрын
Thank you and spot on question, I was wondering who is going to ask this first 🙌 I am actually making a Dataform video in the background but don’t want to public it unless I am 100% sure I am saying something useful. But based on my current findings, you could use either and depends on what you need both can be a good fit. Dataform is a lot easier to get up and running but it’s quite new and I won’t recommend using it for something too critical at this stage, and it’s also missing some key features like templating using jinja (I don’t really like the JavaScript templating system, as it’s built on typescript, that is something no one uses, you would be lock-in to something with no support which in my view is quite dangerous). But it is something a lot easier to get up and running natively in gcp. DBT is still the go to choice in my view, because it is built in Python has a strong open source community. For mission critical data modelling work, I still think DBT is much better.
@agss
@agss 2 ай бұрын
@@practicalgcp2780 you brought up exactly what I was worrying about. I highly appreciate your insight!
@strmanlt
@strmanlt 2 күн бұрын
Our team was debating migrating from dbt to Dataform. Dataform is actually is a pretty decent tool, but the main issues for us was the 1000 node limit per repo. So maybe if you have very simple models that do not require a lot of notes it would work fine, but for us the long term scalability was the deciding factor
@practicalgcp2780
@practicalgcp2780 2 күн бұрын
@@strmanlt thanks for the input on this! Can I ask what is the 1000 node you are referring to? Can you share the docs on this. Is it 1000 node limit on number of steps / sql you can write?
@Rising_Ballers
@Rising_Ballers 2 ай бұрын
Hi Richard, Love your content, always wanted someone to do GCP training videos emphasizing real world use cases, I work in Bigquery and Composer, I wanted to learn dataproc and dataflow. But everywhere i see same type of trainings not much focusing on real world implementations. I wanted to learn how dataproc and dataflow jobs are deployed in different environments like dev test and prod, your videos are helping a lot, hope you will do more videos on dataflow and dataproc, how we use this in real projects in how we create these jobs using CICD
@practicalgcp2780
@practicalgcp2780 2 ай бұрын
No worries glad you found this useful ❤
@Rising_Ballers
@Rising_Ballers 2 ай бұрын
@@practicalgcp2780 I have one doubt, in an organization if we have many dataproc jobs how will we create it in different environments like dev test and prod, can you please do a video on that
@ayoubelmaaradi7409
@ayoubelmaaradi7409 2 ай бұрын
🤩🤩🤩🤩🤩
@ap2394
@ap2394 2 ай бұрын
Thanks for detailed video. Can we have scheduling at task level ? Eg : if have 2 task in downstream DAG and both are different on different dataset. Can I control the schedule at task level ?
@practicalgcp2780
@practicalgcp2780 5 күн бұрын
Just realised I never replied to this one, my apologies. I am not sure that is the right way to think about how this works. Regardless which task it is, or which dag, it’s about listening to a change event from something got triggered in the upstream dataset, then react to that event. As long as you design dags in a way it is the right behaviour to trigger a dag, based on a change event, then it will work.
@viralsurani7944
@viralsurani7944 3 ай бұрын
Getting below error while running pipeline with DirectRunner. Any idea? Transform node AppliedPTransform(Start Impulse FakePii/GenSequence/ProcessKeyedElements/GroupByKey/GroupByKey, _GroupByKeyOnly) was not replaced as expected.
@DExpertz
@DExpertz 3 ай бұрын
I appreciate this video Sir, 😍 (Subscribed and liked) will share too with my team.
@practicalgcp2780
@practicalgcp2780 3 ай бұрын
Thanks so much for you support ❤
@DExpertz
@DExpertz 3 ай бұрын
@@practicalgcp2780 Of course man, thank you for sharing this informations in a simpler way
@10xApe
@10xApe 3 ай бұрын
Can cloud run be used for Power BI datarefresh gateway ?
@practicalgcp2780
@practicalgcp2780 3 ай бұрын
I haven’t used power BI so I googled what is data refresh gateway, according to learn.microsoft.com/en-us/power-bi/connect-data/refresh-scheduled-refresh it looks like it’s some sort of service you can control refresh via a schedule? Unless there is some sort of API it allows you to trigger from the Google Cloud ecosystem I am not sure if you can use it. I assume you are thinking of triggering some DBT job first then refresh the dashboard?
@adityab693
@adityab693 3 ай бұрын
In my org, They have to get an exception and analyticshub.listing.subscribe role is not available. Also data can be shared within org vpc, how about sharing outside of vpc?
@SamirSeth
@SamirSeth 3 ай бұрын
Simply the best (and only) clear explanation of how this works. Thank you very much.
@practicalgcp2780
@practicalgcp2780 3 ай бұрын
Glad it helped!
@QuynhNguyen-zy2rs
@QuynhNguyen-zy2rs 3 ай бұрын
Hi, After you have created data profile scan and data quality scan, is the insights tab displayed? I don't see the insights tab in your video. Please explain to me! Thanks!
@alifarah9
@alifarah9 4 ай бұрын
Really appreciate these high quality videos ! Seriously your videos are better than the official video for GCP. What makes these videos invaluable is you teach frok first principles and talk about problem that will be faced in any cloud environment not GCP.
@practicalgcp2780
@practicalgcp2780 4 ай бұрын
Thank you so much 🙏 you are right the principal are very much the same, no matter which cloud provider it is. Although my focus is GCP because it is something I believe as an ecosystem it’s much more powerful but remains the easiest to implement and scale compares to other cloud providers.
@anantvardhan1212
@anantvardhan1212 4 ай бұрын
Amazing explanation! However, I have a doubt regarding the use of OAuth 2.0 creds in this whole setup. Does the OAuth client ID represent the backend service here, which is delegating authentication to IAP?
@practicalgcp2780
@practicalgcp2780 4 ай бұрын
Thank you and I don't think this was explained well in the video. I did some more reading and one thing I noticed here is the docs here on how to create the backend service of LB has changed cloud.google.com/iap/docs/enabling-cloud-run#enabling. As you can see at 15:08 in the video it use to require the client_id and client_secret to create the backend to enable IAP, but that doesn't seem to be there anymore. The latest docs has a note saying "The ability to authenticate users with a Google-managed OAuth client is available in Preview.". Well technically if it's in preview it should not update the docs to remove this option but if it is true then it means by default it will use the google managed oauth client and creating the credentials manually is no longer required. I've not tested this out yet but I think it's worth trying it without using a custom credential and just enable IAP. I think it makes sense as creating it manually and then specify is a lot faff as you need to manage the secret rotation etc yourself.
@practicalgcp2780
@practicalgcp2780 4 ай бұрын
And my understanding the way this works is when a user comes in, the user will pass the auth header, the load balancer backend will intercept and use IAP to do the verification to see if the user has permission or not which is defined in IAM with the user group. Because the IAP SA has been granted the invoker access to the cloud run service, hence user will be granted access after passing through the IAP validation
@harshchoudhary6069
@harshchoudhary6069 4 ай бұрын
How we can share the authorized view using analytics hub?
@practicalgcp2780
@practicalgcp2780 4 ай бұрын
It makes no difference using authorised views, as authorised view permissions are managed the same way as tables, different to normal views. However, using authorised views has some tradeoffs, a key one being losing metadata such as column descriptions which isn’t great for data consumers. But it does have the advantage if you don’t want to duplicate data models or increase latencies
@LongHD14
@LongHD14 4 ай бұрын
May I ask one more question regarding this matter? I would like to implement permissions for a chat application concerning access to documents. For example, a person should only have access to certain tables or specific fields within those tables, and if they don't have permissions, they wouldn't be able to search. Do you have any suggestions or keywords that might help with this issue? Thank you very much for your assistance
@practicalgcp2780
@practicalgcp2780 4 ай бұрын
That is something you have to do through some sort of RBAC implementation (role based access control). That isn’t anything to do with the search, it’s more on mapping out the role of a user through logging in like what most applications do today. Then depends on the role, you can add specific filters in the search queries, like filtering via certain metadata or have a set of tables you can restrict based on roles etc.
@LongHD14
@LongHD14 4 ай бұрын
Sure, I understand that. However, I'm looking for a service that can assist me with implementing RBAC.
@practicalgcp2780
@practicalgcp2780 4 ай бұрын
ok I see. I think it really depends on what you are using. For example, if you are building a backend with Python. You can use Django which has a RBAC module, but generally any framework would have some sort of RBAC component you can use. If it’s an internal app (like for within the company use) then you can simplify things by just using IAP, but IAP isn’t suitable for external consumer applications
@LongHD14
@LongHD14 4 ай бұрын
@@practicalgcp2780 thank you for your answer!
@kavirajansakthivel
@kavirajansakthivel 4 ай бұрын
Hello Richard, It was wonderful video, but somehow i couldn't setup the tcp proxy, how did you do it ? through reverse proxy method or auth proxy mehtod ? I see you are the only successful person who has done this so far ? could you please create a tutorial video for the same ?
@practicalgcp2780
@practicalgcp2780 4 ай бұрын
Hi there, it’s been a while since I did it last time, it’s gonna be quite difficult to understand what your problems are as it’s a quite complex setup. If I remember correctly this is the documentation I followed cloud.google.com/datastream/docs/private-connectivity#reverse-csql-proxy, make sure you follow this step by step, especially don’t forget to open the required firewall rules as this can be a common cause.
@LongHD14
@LongHD14 4 ай бұрын
Wow, this video is incredibly insightful and informative! 👏 I've learned so much and am grateful that you've shared this valuable content with us. Just a quick question: Could I apply these concepts to create a conversation app that details the findings in the search results? Looking forward to your guidance on this.
@practicalgcp2780
@practicalgcp2780 4 ай бұрын
Glad you found it useful! I don’t see why not, but as I mentioned, for a conversational app, I assume you are talking about an app that is consumer (real customers) facing, the concept is exactly the same but you need to change the vector db to something that supports highly concurrent workload. So BigQuery is out of the picture, you can look at vertex AI vector search and also AlloyDB which I am hearing a lot lately. I haven’t tried either yet but as far as I know they are both valid approach for consumer apps supports highly concurrent workload. The docs for alloyDB is here cloud.google.com/alloydb/docs/ai/work-with-embeddings
@LongHD14
@LongHD14 4 ай бұрын
Thank you for your valuable insights and guidance!
@practicalgcp2780
@practicalgcp2780 4 ай бұрын
You are welcome ;)
@kamalmuradov6731
@kamalmuradov6731 4 ай бұрын
I implemented a similar solution using Cloud Workflows (CW) + Cloud Functions (CF). The CW runs a loop and makes N requests to the CF in parallel each iteration, where N is equal to the CF’s max instances. I’ll look into querying Stackdriver each loop to dynamically determine concurrency. I chose CW over Cloud Scheduler (CS) for a few reasons. First, CS is limited to at most 1 run per minute, which wasn’t fast enough to keep my workers busy (they process a batch in under 30 seconds). Second, CS can’t make N requests in parallel so would required something in between to replicate the CW is doing. Third, CW has a configurable retry policy which is handy for dealing with the occasional CF network issues. One caveat with CW is that a single execution is limited to 100k steps. To workaround this issue, I limit each CW execution to 10k loops, at the end of which it triggers a new workflow execution and exits. I setup an alerting policy to ensure there is always exactly 1 execution of this workflow running and haven’t had any issues.
@practicalgcp2780
@practicalgcp2780 4 ай бұрын
Hmm, interesting approach. Although I am not sure we are comparing apples and apples here. The solution demonstrated in this video is an always on approach. In other words, the pull subscriber is always on listening to the PubSub subscriber, it doesn’t die after processing all remaining messages, but simply waits. So if you change the interval of Cloud Scheduler to 10 minutes, and let the pull subscriber run for 9 minutes 50 seconds for example, it will not get killed until it reaches to that timeout (which is in the code example I gave). I am not sure if I misunderstood you here, but the solution here is no different to what you would normally do with a GKE deployment, it’s just an alternative without needing any infrastructure.
@kamalmuradov6731
@kamalmuradov6731 4 ай бұрын
That sounds correct! In my case the CF does a “synchronous pull” of a few thousand messages, processes them, and acks them all in bulk. So it’s not an always-on streaming setup like what you demoed here. It handles 1 batch per request, shuts down, and then is invoked again in the next loop by the CW. For this particular use case, batching is advantageous so I went with synchronous pull. But it would be straightforward to switch the CF to a streaming pull if batching was not necessary.
@stevenchang4784
@stevenchang4784 4 ай бұрын
vscode is browser base environment , but a lot of users restricts cloud shell and ssh connection. Do you think cloud workstation could bypass those restrictions.
@practicalgcp2780
@practicalgcp2780 4 ай бұрын
Hi there, I haven’t used this at scale yet but my understanding is one of the most important reasons for having cloud workstation is to get around these restrictions. The common reason cloud shell does not work in many org is because the inability to support private IP and VPC SC, but this isn’t the case for workstations as these are deployed within your network. Check this out and it’s documented here cloud.google.com/workstations
@stevenchang4784
@stevenchang4784 4 ай бұрын
@@practicalgcp2780 Hi, I tested all day. Thank you for your reply. It really solve the cloud shell's public IP issue.
@jean9174
@jean9174 4 ай бұрын
😎 "Promosm"
@digimonsta
@digimonsta 4 ай бұрын
Really interesting and informative. I'm currently looking at migrating away from a GKE workload purely because of the complexity, so this may prove useful. I'd be interested to know if you feel Cloud Run Jobs would support my use-case? Essentially, based on a Pub/Sub message, I need to pull down a bunch of files from a GCS bucket, zip them into a single archive and then push the resulting archive back into GCS. This zip file is then presented to the user for download. There could be many thousands of files to ZIP and the resulting archive could be several terrabytes in size. I was planning on hooking up FileStore or GCS FUSE to Cloud Run to facilitate this. The original implementation was in Cloud Run (prior to jobs), but at the time, no-one knew how many files users would need to download or how big the resulting zip files would be. We had to move over to GKE, as we hit the maximum time limits allowed for Cloud Run, before it was automatically terminated.
@practicalgcp2780
@practicalgcp2780 4 ай бұрын
Thanks for the kind comment. And it’s a quite interesting problem because the size of the archive can be potentially huge so it can take a long period of time. You are right. Cloud run service I think even today can only handle up to 1 hour timeout, cloud run job can handle 24 hours now. So if your archive process won’t take longer than a day i don’t see why you can’t use this approach. If you need longer time you can look at cloud batch, that can run longer without needing to create a cluster but it’s more complex the track the state of the operation. I have another video describing use cases using batch. Having said that, it feels a bit wrong to have archives of huge size like that, have you considered options to generate the PubSub message from upstream systems in smaller chunks, or use the cloud run service to break things down and only zip so much files in a single execution, track the offset somewhere (I.e in a separate PubSub topic) to trigger more short lived zip operations? The thing is if there’s a network glitch which happens every now and then you could have wasted huge amount of compute. Personally I would always prefer to make the logic slightly more complex in the code than maintaining a GKE cluster myself just to keep the infrastructure as simple as possible but that is just my opinion.
@user-dl5mm9fu9g
@user-dl5mm9fu9g 5 ай бұрын
The introduction is very detailed and very good,Good Job,buddy...
@practicalgcp2780
@practicalgcp2780 5 ай бұрын
Thank you! Its great you found it useful
@eyehear10
@eyehear10 5 ай бұрын
this seems like it adds more complexity than compared with the push model
@practicalgcp2780
@practicalgcp2780 5 ай бұрын
Yes it does, but not everything upstream supports push model, plus not every downstream can handle the load via the push model. I explained some of the pros and cons, mainly related to controlling or improving throughput (I.e limiting how much traffic you want to consume if there is too much traffic or use batching). A really important thing to consider is the downstream and how many connections you establish, or how many concurrent requires you make if, I.e the downstream system is a HTTPS endpoint. Opening too many requests can easily overwhelm the system on the other side where if you batch the request or just open a single connection and reuse it makes a huge difference. If it’s possible to use push without the constraints above, it’s almost always better to use push. Hope that makes sense
@SwapperTheFirst
@SwapperTheFirst 5 ай бұрын
Any examples of such tools for cataloging, certification and lineage? Especially OSS? I had some experience with Qlik Catalog, but not sure if this is a good choice to GCP and how well it is integrated with BQ. Beyond usual suspects (Collibra, Immuta, ...)
@practicalgcp2780
@practicalgcp2780 5 ай бұрын
There are a few who are GCP partners has very good integration with GCP to save you a lot of time doing meta data integration by engineers. Collibra is one of them as you already mentioned, you can also look at Atlan, a new player in the field but has some powerful features too. That’s the two I am aware of in my view have pretty good integration and features but please do your own research there are pros and cons and these are not recommendations I am making here. OSS do you mean support systems like JSM?
@SwapperTheFirst
@SwapperTheFirst 5 ай бұрын
@@practicalgcp2780 nope, I mean open source software, like Apache Airflow for workflow management. From which you can also make managed solutions, like Astronomer or Cloud Composer. I think something should exist in this space too?
@SwapperTheFirst
@SwapperTheFirst 5 ай бұрын
I like this format of battle stories/coaching.
@practicalgcp2780
@practicalgcp2780 5 ай бұрын
Thanks ☺️ thought might try a different way to present feels like more people can relate to this
@WiktorJurek
@WiktorJurek 5 ай бұрын
This is bang on. It would be awesome to see how this works in practice - as in, how all of this looks in the console, how to set it up, and practically how you can oversee/manage this kind of setup.
@practicalgcp2780
@practicalgcp2780 5 ай бұрын
There’s quite a lot of effort involved but the foundation isn’t that difficult to setup. But it’s not like there is just some sort of UI everything can be done there, I think the entry point of data management and discovery for large group of users can be from the catalog tool, and a platform team can own the tooling for things like quality scan and analytics hub while making them self service. There are things especially like the data quality check rules I would prefer to keep these in version control so it’s much easier to control the changes and quality of the checks where as other things like analytics hub UI should be sufficient as long as there is a way to recovery if something goes wrong
@alexanderpotts8425
@alexanderpotts8425 5 ай бұрын
Knocking it out of the park as usual. I'm trying to get adoption of some of these in my team already!
@practicalgcp2780
@practicalgcp2780 5 ай бұрын
Amazing to see you find it useful. I believe a of these things I covered are what we are doing everyday already, I was trying to put everything together in a more structured way hopefully to help a winder crowd adopt these technologies and methods.
@JohnMcclaned
@JohnMcclaned 5 ай бұрын
Would love to see a video about how to use AlloyDB to an ordered pubsub topic
@practicalgcp2780
@practicalgcp2780 5 ай бұрын
I thought about challenges from event based data consumption coming from message queues, but decided not to cover in this video. Event based data consumption in real time has very different challenges and I don’t believe it’s the same pain as we get compared to data stored in analytic databases. Sure, managing those are important, but from my experience, event based applications are very bespoke, already has clear data contract as they are very mission critical and built by data engineering team mostly and are well maintained. Unfortunately the same cannot be said for data being consumed in analytic databases. AlloyDB I assume you are using it for more bespoke use cases as it’s not typically something used to store all data permanently to allow a large group of teams to consume.
@JohnMcclaned
@JohnMcclaned 5 ай бұрын
@@practicalgcp2780 I am building an event sourced event store and I need a way to have ordered changes propagated out. I am defaulting to 1 second interval polling though I am exploring other solutions.
@user-yz6pz3yn6w
@user-yz6pz3yn6w 5 ай бұрын
Hi Richard, tysm for the video, it was really helpfull! Im a jr data scientist working on a customer service chatbot using dialogflow cx and some webhooks for the RAG (product recomendations, stock, price...). My original vector store was VertexAI Vector Search, since its really expensive im looking for other options like BQ Vector Search but you mentioned that a consumer traffic is an issue. Do you have any vector store that you can recomend? Im trying to mantain my solution inside my GCP. Thank you again 🙌
@practicalgcp2780
@practicalgcp2780 5 ай бұрын
Sorry just realised I forgot to reply. I was actually going to look at vertex AI vector search next, as it is one of the good options for vector search, can you provide some metrics on what would you consider as expensive? There are other options like Postgres, elastic search, you can see what you can use from the list langchain has integration with which covers most of them. python.langchain.com/docs/integrations/vectorstores. If you want to control cost better, you can consider the services aren’t based on volume but cost of storage engine. Although you have trade offs on managing infrastructure and also optimising for performance. So it’s more of try and error not a straight answer
@shaboxi129
@shaboxi129 5 ай бұрын
​@@practicalgcp2780 id like to hear more about vertex vector search since i cant find good deep dives on it :(
@user-yz6pz3yn6w
@user-yz6pz3yn6w 5 ай бұрын
​@@practicalgcp2780 The endpoint machine I'm using is quite large (16 vCPUs, 64 GiB). While I'm only working with half the intended volume of products at the moment, the cost is still around $1k per month. I suspect there might be a default parameter set during creation that prevents me from selecting a smaller machine type. I read the integrations and I think my next step is AlloyDB, the good thing about Vertex AI Vector Search is that is really fast so I need to see the difference between those two. Im looking forward to your videos since I haven't seen much content about Vertex AI Vector Search. Thank you for the reply 🙌
@MaxNRG
@MaxNRG 6 ай бұрын
Thank you sir, you saved my day!
@practicalgcp2780
@practicalgcp2780 6 ай бұрын
Glad I could help!
@karleecandice5287
@karleecandice5287 6 ай бұрын
Promo'SM