My notes: 19:34: ## Data distrubution strategies, for data distribution across slices (ie virtual nodes -1 physical node is made up of [2, 16] slices, acting as sharding virtual nodes). ## Note that data in a table is stored as follows : it is distributed according to the table's data-distribution strategy (eg according to their distribution key, if you chose "KEY based distribution"), and then data is sorted within each partition according to the sort key(s) (the sort key is relevant for zone maps). Data distribution has two goals : 1) *Avoid "hot partitions/shards"* , ie a partition of data where the request is centered, thus putting a lot of strain on the resources underlying the slice/node that hosts it, while the other slices in the cluster are under-utilized. 2) *Skip IO* : with a query like "SELECT * FROM my_table WHERE country='FRANCE'; " if this table was distributed according to the "country" column, then all the relevant data for this query was stored on the same partition, and can be returned quickly. Compute nodes can quickly identify the partitions where the data is and is not. (Note: "country" is probably not a good choice of partition key, as the resulting partitions will probably be too big, and therefore we'll result in a hot partition) 3) *Return relevant data together in few IO operations:* this is the same idea as 2) : when all relevant data is in the whole partition (or serveral partitions, all the data of those partitions is relevant), then all the data blocks of the partition are relevant, and the storage layer can read them quickly in just a few IO operations. For this reason, it makes sense to use as partition key a column that is often used as joining creteria (eg "JOIN mytable1 and mytable2 ON columnToUseAsPartitionKey") There are 4 data distribution options for a table: - *hash partitioning/sharding* (aka "distribution based on a *key*") : you choose a column (aka key) of your table that will be used as partition key. All the rows that share the same value for this column will be stored together in the same partition/shard. To get a good data dsitribution, and avoid having hot partitions, use a column with an even value distribution (to get even-sized partitions), and with high cardinality (to avoid having huge partitions, that can't be broken down in any smaller pieces), and on which your queries aren't focused on a single value (that's another reason why "country" is a bad partition key, because you'll typically query about facts regarding a single country, and therefore you'll put all the strain on the partition corresponding to that country, thus making it a hot partition ; for the same reason, don't use date as a partition key, because you'll typically query facts on a specific date, thus making the partition corresponding to that date a hot partition). With hash partitioning, if you choose your distribution key right, you'll avoid hot partitions (gaol 1), skip IO (goal 2) and read your data in fewer IO operations (goal 3). - *Even distribution:* the rows of your table are evenly spread among your slices, in a round robin fashion, row by row. The idea is to avoid hot partitions (goal 1), but you don't get goals 2 and 3. - *ALL*: your table is duplicated on the 1st slice of each of your compute nodes. This only makes sense for "small" tables (ie No I think it’s okay as the data model is typically for historizations, so inserts would be appended to the table, in the same HCC/ORC set of row. Sure, adding one by one new rows is a heavier process than doing so in a row-oriented DB, but it’s a heavy process taking place “at the margin”, affecting just one HCC/ORC set of rows per table at a time. 25:00 data is always at least twice redundant in the cluster. Redshift uses 2-phase commit for transactions.
@vaibhavdce1234 жыл бұрын
One of best talks on Amazon Redshift
@galeop2 жыл бұрын
13:40 so zone maps are like "storage indexes" in Oracle Exadata ; they allow to skip unecessary IO by telling the storage server what are the min and max values of a compressed 1MB storage block, thus allowing the server to know if the values it looks for NOT in that block. Sort keys in Redshift are meant to determine how your table will be sorted on disk, and therefore impact the zone maps that track the min/max of the sort keys in a 1MB block.You'd typically set sort keys for frequently filtered columns, and have fewer than 4 sort keys per table (the table being sorted based on column X, and then column Y when column X had identical vallues, etc.). Use in priority low cardinality columns for sort keys, as they'll allow to make zone maps more meaningful (to know where the filtered data is not)
@aksharjamgaonkar16772 жыл бұрын
this is pretty crisp and to the point, worth your time.
@DiogoNomura3 жыл бұрын
Really the best content of Redshift!
@raghukundurthi12884 жыл бұрын
Wow! An intelligent and insightful deep dive into the hidden pearls of Redshift! Love the topic of Advisor! Thanks Tony & Harshida!
@galeop2 жыл бұрын
16:34 like all datawarehouses, you should use a denormalized data moldel ; something akin to a star schema or snowflake schema. In your star schema, the Fact table is a denormalised table where entities that belong to the same Fact have been pre-joinned ("materialized", ie they are stored joined together on the Fact table). You will also store in the Faact table pre-calculated aggregations (as you store those values on the table, they are "materialized").
@venkata.krishnan4 жыл бұрын
Excellent presentation structure..
@galeop2 жыл бұрын
27:51 according to which criteria should the big S3 file to COPY be split into 16 smaller file ? Should we partition it according to the same distribution method as the one used in the table (eg same hash key)?
@galeop2 жыл бұрын
55:01 I suppose that altering the distribution key of a table means that all my table's partitions will be re-written from scratch, according to that new distribution key ; so it's a very heavy operation. Correct ?
@galeop2 жыл бұрын
43:40 for WLM and queues, what are these percentages of memory about? Leader node's memory, compute node's memory, or both?
@saravanangovindharaj69974 жыл бұрын
Awesome presentation!
@shrabanti843 жыл бұрын
I have two question, 1. from performance point of view do we have any major difference between temp table and ctas with backup no? 2. For alter table append command what if the the source datatypes are same but data lengths are not matching will it still work?
@yasink182 жыл бұрын
Best way to use update statement where I am getting slow response
@upuldi2 жыл бұрын
Very good presentation
@andreyavdeev81814 жыл бұрын
fantastic session! thank you!
@ArtisticKalam4 жыл бұрын
Best talk on redshift
@liuruibnu4 жыл бұрын
Excellent structure!
@vivien12524 жыл бұрын
Fantastic session
@Thevisionaryaddy3 жыл бұрын
how much concurrency and CPU usage if you are running a Decent Size Query which takes about 4-5 hours and is made joining 2-3 tables. It runs in the very morning
@RPRAVEENKUMAR4 жыл бұрын
Very vast for a beginner like me. But answered a lot of questions
@dsapraveen4 жыл бұрын
I suppose, Advisor is not available in ap-south-1, I wonder why AWS is selective when releasing these kind add-on of services,.
@akshayarawat4 жыл бұрын
What is the slidshare link?
@liuruibnu4 жыл бұрын
16:16
@user-so5fk1mt9o4 жыл бұрын
Is a joke compare to Google bigquery
@archivepurpose49834 жыл бұрын
in your dreams. :)
@avinashravipati90022 жыл бұрын
Slides can be found at d1.awsstatic.com/events/reinvent/2019/Deep_dive_and_best_practices_for_Amazon_Redshift_ANT418.pdf