Thank you for your high-quality videos! In our use case, we ingest daily a .zip file containing 3 .csv’s related to sales, inventory and orders from different shops (20-30) and CRMs (4-5 ; each one with its own naming convention, dtypes, …). How would you improve the following pipeline? - Raw zip files are uploaded to a GCP bucket - The upload triggers a Python GCP Cloud function that transforms the data to create single naming/dtypes conventions and brief new columns (e.g. timestamp merging date + time) - Transformed data is uploaded to MongoDB - 3 separate collection for sales, inventory and orders - and raw .csv’s to a separate GCP bucket as parquet files (1 folder for each CRM and PoS as subfolder) - A PubSub message posted by the function triggers a GCP Function that loads processed data from MongoDB, applies ML models and stores results in separate collections (1 for each analysis type; e.g. forecast, anomaly detection, …) - A Python web app directly reads ML output data from MongoDB Thank you so much and love your videos; 🤗
@n.l.8752 жыл бұрын
Been a subscriber for a while, and I can't thank you enough for the quality of your channel. I have a request. This is an excellent video and highlights a key challenge of communicating to budgetary stakeholders that a solution may 'get something done' but will incur a considerable amount of 'technical debt'. You've treated other topics very well, and was wondering if you could do a video on Technical Debt. This is one of the least understood, but arguably the most important way to get people onboard with any technical change decision-making. I use a technical debt register that I give rough estimates on single-developer full time equivalent days for an ommitted task. The aim isn't accuracy, it is to get the conversation occurring when project managers or others are faced with huge bills because of their choices, rather than esoteric concepts like SCD maintainance.
@nullQueries2 жыл бұрын
I love talking about tech debt, I'll add it to the list of topics.
@antoruby2 жыл бұрын
@@nullQueries you have weird tastes 😆 Just kidding, it’s such an important and overlooked topic that teams go back and forth in rewrites without ever understanding what’s really happening.
@yogoson8371 Жыл бұрын
Absolutely agree. Your points are spot on!
@chasedoe25942 жыл бұрын
I'm just wondering, if Athena is good at joining table ? If they work on the OLTP, I guess they have to heavily join in order to get the expected results. Or they just retrieve the data and do the joins in Tableau which shouldn't be that fast isn't it.
@nullQueries2 жыл бұрын
Athena is mostly for querying files. So if the OLTP has a lot of joins like most do I wouldn't expect great performance compared to a relational database. Or another solution designed for complex data models. Athena is mostly used for ad-hoc querying of large file stores (ie: searching for data in log files)
@fishsauce74978 ай бұрын
What many fail to realise is that a bad data warehouse is not just bad table structure, but also, low documentation, redundant calculations, unnecessarily complicated ETL (mostly tech debt). All of which make the warehouse unusable and difficult to maintain. I also see wrong approach when creating a warehouse e.g. just looking at existing reports to create a data model, no data profiling, no business study, no articulation of data loading rules, heavy on undocumented assumptions. Eventually the new shiny warehouse by modellers is also discarded by analysts as it is not fit for purpose, because the same mistake is repated again and again.
@poizentv Жыл бұрын
Hello. I hope you well. Is it possible to become a Data Warehouse developer without leaning any programming language?