AWS re:Invent 2019: Deep dive and best practices for Amazon Redshift (ANT418)

  Рет қаралды 54,478

AWS Events

AWS Events

Күн бұрын

Пікірлер: 27
@galeop
@galeop 2 жыл бұрын
My notes: 19:34: ## Data distrubution strategies, for data distribution across slices (ie virtual nodes -1 physical node is made up of [2, 16] slices, acting as sharding virtual nodes). ## Note that data in a table is stored as follows : it is distributed according to the table's data-distribution strategy (eg according to their distribution key, if you chose "KEY based distribution"), and then data is sorted within each partition according to the sort key(s) (the sort key is relevant for zone maps). Data distribution has two goals : 1) *Avoid "hot partitions/shards"* , ie a partition of data where the request is centered, thus putting a lot of strain on the resources underlying the slice/node that hosts it, while the other slices in the cluster are under-utilized. 2) *Skip IO* : with a query like "SELECT * FROM my_table WHERE country='FRANCE'; " if this table was distributed according to the "country" column, then all the relevant data for this query was stored on the same partition, and can be returned quickly. Compute nodes can quickly identify the partitions where the data is and is not. (Note: "country" is probably not a good choice of partition key, as the resulting partitions will probably be too big, and therefore we'll result in a hot partition) 3) *Return relevant data together in few IO operations:* this is the same idea as 2) : when all relevant data is in the whole partition (or serveral partitions, all the data of those partitions is relevant), then all the data blocks of the partition are relevant, and the storage layer can read them quickly in just a few IO operations. For this reason, it makes sense to use as partition key a column that is often used as joining creteria (eg "JOIN mytable1 and mytable2 ON columnToUseAsPartitionKey") There are 4 data distribution options for a table: - *hash partitioning/sharding* (aka "distribution based on a *key*") : you choose a column (aka key) of your table that will be used as partition key. All the rows that share the same value for this column will be stored together in the same partition/shard. To get a good data dsitribution, and avoid having hot partitions, use a column with an even value distribution (to get even-sized partitions), and with high cardinality (to avoid having huge partitions, that can't be broken down in any smaller pieces), and on which your queries aren't focused on a single value (that's another reason why "country" is a bad partition key, because you'll typically query about facts regarding a single country, and therefore you'll put all the strain on the partition corresponding to that country, thus making it a hot partition ; for the same reason, don't use date as a partition key, because you'll typically query facts on a specific date, thus making the partition corresponding to that date a hot partition). With hash partitioning, if you choose your distribution key right, you'll avoid hot partitions (gaol 1), skip IO (goal 2) and read your data in fewer IO operations (goal 3). - *Even distribution:* the rows of your table are evenly spread among your slices, in a round robin fashion, row by row. The idea is to avoid hot partitions (goal 1), but you don't get goals 2 and 3. - *ALL*: your table is duplicated on the 1st slice of each of your compute nodes. This only makes sense for "small" tables (ie No I think it’s okay as the data model is typically for historizations, so inserts would be appended to the table, in the same HCC/ORC set of row. Sure, adding one by one new rows is a heavier process than doing so in a row-oriented DB, but it’s a heavy process taking place “at the margin”, affecting just one HCC/ORC set of rows per table at a time. 25:00 data is always at least twice redundant in the cluster. Redshift uses 2-phase commit for transactions.
@vaibhavdce123
@vaibhavdce123 4 жыл бұрын
One of best talks on Amazon Redshift
@galeop
@galeop 2 жыл бұрын
13:40 so zone maps are like "storage indexes" in Oracle Exadata ; they allow to skip unecessary IO by telling the storage server what are the min and max values of a compressed 1MB storage block, thus allowing the server to know if the values it looks for NOT in that block. Sort keys in Redshift are meant to determine how your table will be sorted on disk, and therefore impact the zone maps that track the min/max of the sort keys in a 1MB block.You'd typically set sort keys for frequently filtered columns, and have fewer than 4 sort keys per table (the table being sorted based on column X, and then column Y when column X had identical vallues, etc.). Use in priority low cardinality columns for sort keys, as they'll allow to make zone maps more meaningful (to know where the filtered data is not)
@aksharjamgaonkar1677
@aksharjamgaonkar1677 2 жыл бұрын
this is pretty crisp and to the point, worth your time.
@DiogoNomura
@DiogoNomura 3 жыл бұрын
Really the best content of Redshift!
@raghukundurthi1288
@raghukundurthi1288 4 жыл бұрын
Wow! An intelligent and insightful deep dive into the hidden pearls of Redshift! Love the topic of Advisor! Thanks Tony & Harshida!
@galeop
@galeop 2 жыл бұрын
16:34 like all datawarehouses, you should use a denormalized data moldel ; something akin to a star schema or snowflake schema. In your star schema, the Fact table is a denormalised table where entities that belong to the same Fact have been pre-joinned ("materialized", ie they are stored joined together on the Fact table). You will also store in the Faact table pre-calculated aggregations (as you store those values on the table, they are "materialized").
@venkata.krishnan
@venkata.krishnan 4 жыл бұрын
Excellent presentation structure..
@galeop
@galeop 2 жыл бұрын
27:51 according to which criteria should the big S3 file to COPY be split into 16 smaller file ? Should we partition it according to the same distribution method as the one used in the table (eg same hash key)?
@galeop
@galeop 2 жыл бұрын
55:01 I suppose that altering the distribution key of a table means that all my table's partitions will be re-written from scratch, according to that new distribution key ; so it's a very heavy operation. Correct ?
@galeop
@galeop 2 жыл бұрын
43:40 for WLM and queues, what are these percentages of memory about? Leader node's memory, compute node's memory, or both?
@saravanangovindharaj6997
@saravanangovindharaj6997 4 жыл бұрын
Awesome presentation!
@shrabanti84
@shrabanti84 3 жыл бұрын
I have two question, 1. from performance point of view do we have any major difference between temp table and ctas with backup no? 2. For alter table append command what if the the source datatypes are same but data lengths are not matching will it still work?
@yasink18
@yasink18 2 жыл бұрын
Best way to use update statement where I am getting slow response
@upuldi
@upuldi 2 жыл бұрын
Very good presentation
@andreyavdeev8181
@andreyavdeev8181 4 жыл бұрын
fantastic session! thank you!
@ArtisticKalam
@ArtisticKalam 4 жыл бұрын
Best talk on redshift
@liuruibnu
@liuruibnu 4 жыл бұрын
Excellent structure!
@vivien1252
@vivien1252 4 жыл бұрын
Fantastic session
@Thevisionaryaddy
@Thevisionaryaddy 3 жыл бұрын
how much concurrency and CPU usage if you are running a Decent Size Query which takes about 4-5 hours and is made joining 2-3 tables. It runs in the very morning
@RPRAVEENKUMAR
@RPRAVEENKUMAR 4 жыл бұрын
Very vast for a beginner like me. But answered a lot of questions
@dsapraveen
@dsapraveen 4 жыл бұрын
I suppose, Advisor is not available in ap-south-1, I wonder why AWS is selective when releasing these kind add-on of services,.
@akshayarawat
@akshayarawat 4 жыл бұрын
What is the slidshare link?
@liuruibnu
@liuruibnu 4 жыл бұрын
16:16
@user-so5fk1mt9o
@user-so5fk1mt9o 4 жыл бұрын
Is a joke compare to Google bigquery
@archivepurpose4983
@archivepurpose4983 4 жыл бұрын
in your dreams. :)
@avinashravipati9002
@avinashravipati9002 2 жыл бұрын
Slides can be found at d1.awsstatic.com/events/reinvent/2019/Deep_dive_and_best_practices_for_Amazon_Redshift_ANT418.pdf
Getting Started with Amazon Redshift - AWS Online Tech Talks
46:39
AWS Developers
Рет қаралды 60 М.
To Brawl AND BEYOND!
00:51
Brawl Stars
Рет қаралды 17 МЛН
REAL or FAKE? #beatbox #tiktok
01:03
BeatboxJCOP
Рет қаралды 18 МЛН
Арыстанның айқасы, Тәуіржанның шайқасы!
25:51
QosLike / ҚосЛайк / Косылайық
Рет қаралды 700 М.
Amazon EMR Deep Dive and Best Practices - AWS Online Tech Talks
40:32
AWS Developers
Рет қаралды 58 М.
Data Warehousing on AWS with Redshift - with a demo!
39:16
Cloud Architects in Africa
Рет қаралды 20 М.
To Brawl AND BEYOND!
00:51
Brawl Stars
Рет қаралды 17 МЛН