Data Analysis 4: Data Transformation - Computerphile

  Рет қаралды 84,493

Computerphile

Computerphile

Күн бұрын

A litre of fuel but a pint of milk - time to get all your data in the right units. Don't let Dr Mike's measuring habits put you off! This is part 4 of the Data Analysis Learning Playlist: • Data Analysis with Dr ...
This Learning Playlist was designed by Dr Mercedes Torres-Torres & Dr Michael Pound of the University of Nottingham Computer Science Department. Find out more about Computer Science at Nottingham here: bit.ly/2IqwtNg
This series was made possible by sponsorship from by Google.
The ‘Census’ dataset was adapted from this dataset:
archive.ics.uci.edu/ml/datase...
/ computerphile
/ computer_phile
This video was filmed and edited by Sean Riley.
Computer Science at the University of Nottingham: bit.ly/nottscomputer
Computerphile is a sister project to Brady Haran's Numberphile. More at www.bradyharan.com

Пікірлер: 110
@Computerphile
@Computerphile 5 жыл бұрын
Check out the full Data Analysis Learning Playlist: kzbin.info/aero/PLzH6n4zXuckpfMu_4Ff8E7Z1behQks5ba
@masi416
@masi416 5 жыл бұрын
I thought the USA were bad with their units, greetings from germany where (nearly) everything is in SI units.
@aragonnetje
@aragonnetje 5 жыл бұрын
same in the netherlands
@GoatzAreEpic
@GoatzAreEpic 5 жыл бұрын
@@aragonnetje Mooi land ook toch
@KeithSteptoe
@KeithSteptoe 5 жыл бұрын
Car tyre sizes are crazy. 205/55 R16 = Width of tread in mm, profile as a percentage of the width (55% of 205), and rim size in inches. One metric measurement, a percentage and an imperial measurement all to specify a tyre.
@anjishnu8643
@anjishnu8643 5 жыл бұрын
And here I thought Brexit was convoluted
@kantomega
@kantomega 5 жыл бұрын
Greetings from Finland. I thought the problem was that countries use different measurements. It hadn't even crossed my mind that they might use different units for different things inside the *same* country.
@michty053
@michty053 5 жыл бұрын
Is there any chance you could release the R scripts used in this series. As a beginner R user, I have learned a lot of more elegant ways of doing things than I have previously used, but it would be easier to go through the script itself. Thanks to all of you at Computerphile for this amazing content!
@haraldurkarlsson1147
@haraldurkarlsson1147 Жыл бұрын
Check out R for Data Science 2nd Ed by Hadley Wickham and company. There is a free copy online with code that you can copy directly from their lessons.
@ceri-potat
@ceri-potat 5 жыл бұрын
thank god i live in a country where we only use the metric system!
@markkeilys
@markkeilys 5 жыл бұрын
I didn't know there was a metric equivalent to the point (typographic mesurment of how big text is).
@archertimecapsule
@archertimecapsule 4 жыл бұрын
As an American, who often uses metric, it really is better.
@siarles
@siarles 3 жыл бұрын
@@archertimecapsule Better for math, which is why all of our scientists are trained to use SI. Makes no difference if all you want to do is measure something, which is all that most people care about.
@ec92009y
@ec92009y 2 жыл бұрын
Only metric? Think again. For 5/3600 of an hour, aka 5 seconds. 😆. Respond in 1/31 of 1/12 of a year.
@ShadiMuhammad
@ShadiMuhammad 4 жыл бұрын
Dr Mike, your videos encouraged me to resume my "Data Science Specialization" course. 💪 Thanks a lot for your efforts. 😇
@novafawks
@novafawks 5 жыл бұрын
Wow, these are seriously some the most informative series of videos I've watched, I know it will help me immensely in what I do. Thank you guys so very much for putting the time, passion, and effort in to this content to teach and inpire people like me. You guys - Sean, Mike, Brady, the whole team, really - have inspired me to teach myself more advanced mathematics and coding in my free time where I would NEVER have had any interest before. All of you have changed my life in such a great way and I cannot thank you enough from the bottom of my heart ♥️ Thank you, so much!
@shadowwalker23901
@shadowwalker23901 5 жыл бұрын
When zero is a valid number in ad datasets, All nines in data sometimes means the value was not provided, so the two are not confused.
@DantalionNl
@DantalionNl 5 жыл бұрын
JFC the UK is fully mental with their units, Glad I live in the Netherlands~
@davidrathbone5581
@davidrathbone5581 3 жыл бұрын
Can confirm, fully mental. SI haunts my dreams...
@veeek8
@veeek8 2 жыл бұрын
Everything I the UK is a bit mental I wish I was Dutch 😂
@gorgolyt
@gorgolyt 5 жыл бұрын
10:22 -- this didn't make much sense. This plot doesn't really show why large ranges are a problem. After the proposed solution of standardising the mean to 0 and standard deviation to 1, if you showed the plot again, it would actually look exactly the same. (The true problem with this data was the distribution, which was too skewed; fixing this problem would require a non-linear transformation such as a logarithm.) Standardising data is often a good idea, but the graph explanation was not correct.
@user-bn8yf4ve7r
@user-bn8yf4ve7r 2 жыл бұрын
one graphic benefit of standardizing is that now the axis are more 'related'. instead of the y axis ranging from 0 to 100000 and the x axis ranging from 20 to 90, the y axis ranges from 0 to 14 and the x axis ranges from -2 to 4.
@phat4298
@phat4298 15 күн бұрын
Wow, this video is packed with valuable information! I'm surprised it doesn't have a larger audience. It's honestly one of the most informative resources I've found since starting my data analytics journey. The way you explain complex topics is clear and engaging, and it's evident you have a deep understanding of the subject matter. The professional presentation and the well-organized visuals, especially the code, make it incredibly easy to follow along. I really appreciate how you take the time to explain every detail, even the seemingly basic aspects. It's clear you care about making the information accessible to everyone. Thank you for creating such high-quality content! I'm definitely a subscriber now, and I can't wait to see what you create next.
@norddick
@norddick 4 жыл бұрын
Thank you for this series! Dr. Mike is really great speaker!
@Dusk-MTG
@Dusk-MTG 4 жыл бұрын
Mike Pound: "If I'm weighing myself is going to be in POUNDS".
@cajogos
@cajogos 4 жыл бұрын
These videos are amazing! Helping me a lot with my masters degree on my Data Mining and Machine Learning module, thank you Dr Mike!
@forgetfulfunctor2986
@forgetfulfunctor2986 5 жыл бұрын
GREAT PLAYLIST! DO MORE STUFF LIKE THIS
@IvanToshkov
@IvanToshkov 5 жыл бұрын
Standards are good, but "Dr Mike 0.453592 Kilos" just sounds strange to me. Sorry, I just couldn't help myself. Great videos, keep them coming.
@Hexanitrobenzene
@Hexanitrobenzene 4 жыл бұрын
:D
@pdsanalytics8713
@pdsanalytics8713 4 жыл бұрын
that is awesome
@ec92009y
@ec92009y 2 жыл бұрын
Very well done. Tons of valuable info. Thx again.
@veeek8
@veeek8 2 жыл бұрын
Best lecturer I've ever found. I wish the computerphile guys did MOOCS.
@tanmayjagtap78
@tanmayjagtap78 Жыл бұрын
This whole playlst is gold
@fredtrentini6791
@fredtrentini6791 5 жыл бұрын
Really helpful info, thanks. Keep it up!!!
@frosty9392
@frosty9392 4 жыл бұрын
the amount of time we waste using non-standardized formats (not just in measuring) is astounding it hurts my bones lol
@matthew.datcher
@matthew.datcher 5 жыл бұрын
14:47 The gain in Spain falls mainly on the main(frame).
@erw103
@erw103 4 жыл бұрын
Excellent, I must say.
@muskanvij4811
@muskanvij4811 4 жыл бұрын
Thanks a lot for this great video
@LalliOni
@LalliOni 5 жыл бұрын
I like that Mike slipped in "of course" while explaining all these different systems he is using for everyday things O_O
@rialyandriamiseza9814
@rialyandriamiseza9814 4 жыл бұрын
Have a sip every time Mike says "right"
@teemune
@teemune 2 жыл бұрын
Great intro, I've wondered how you operate in the UK with a dual system. Decimal separator differences also keep things interesting with for example .csv files. Or UTF-8 -characters.
@qwerty975311
@qwerty975311 5 жыл бұрын
@Dusk-MTG
@Dusk-MTG 4 жыл бұрын
Luckily conversions are not a problem for me. I just put c=1 and measure everything in eV.
@tetlamed
@tetlamed 4 жыл бұрын
I've never been happy to use the imperial system before. Only thing worse than not using metric is half using metric
@grainfrizz
@grainfrizz 5 жыл бұрын
11:15 - suddenly remembers Dan Shiffman's map(inp, rmin, rmax, 0, 1);
@raphaellaude1232
@raphaellaude1232 4 жыл бұрын
Beware top-coded values! I suspect that’s what you’re seeing with the 9999 and 99...
@zerokelvin3626
@zerokelvin3626 5 жыл бұрын
Greta video! I have one suggestion at 17:04 onwards: It would probably be better to divide by the total count of each country (and increase the number of bars) to better see the relative distributions.
@Lodinn
@Lodinn 5 жыл бұрын
Yeah no clue how Mike makes calls based on these plots. No way to see how the distribution looks like there.
@ledirigeant
@ledirigeant 3 жыл бұрын
@@Lodinn I think in these videos he's trying to use standard ggplot functions with very few or no parameter adjustments so it's easy for beginners to follow along with the code on-screen. Making the charts super clear by modifying bins and intervals and the values would add a lot of extra text that he'd either need to explain or hand-wave, with the latter probably causing a lot of confusion for some folks.
@Lodinn
@Lodinn 3 жыл бұрын
@@ledirigeant 17:15, as well as the line 63 in the editor - the number of bins is set explicitly. Otherwise, fair enough.
@MrHaggyy
@MrHaggyy 2 жыл бұрын
He just called it ok for the sake of this video saying explizitly you have to judge on the work that you are doing. Fine for me.
@DaTux91
@DaTux91 5 жыл бұрын
Just to add to this: depending on what you want to do with your combined datasets, you should not rescale any data before combining the sets. If you want to do the same analysis for all countries so you can compare between them, of course, you can scale each set individually. But if you want to combine datasets just to increase your sample size or achieve a better representation of, say, the global population (without distinction based on country), you'll want to first combine all the datasets (after conversion to the same units) and then do the rescaling.
@Lodinn
@Lodinn 5 жыл бұрын
Absolutely. Moreover, whitening the data is often a part of the routine doing PCA or clustering or whatever; and with the plots used to 'demonstrate' how different scale matters it didn't make much sense either because ggplot rescaled axes appropriately by default. In general, storing some intermediate results makes sense in many cases, but anything beyond units conversion and cleaning data up is a part of internal operations during analysis and shouldn't be transferred around (talk about joining datasets around 13:30 or so in the video). It'd be absolutely terrible if you were provided with whitened data already from all sources you're working with; one might deduce the income economy is similar in the US and, say, Turkey, based on a ratio of a median to mean, but the actual numbers differ on about 20x scale so some of the results obtained that way might not be exactly reasonable.
@Janokins
@Janokins 5 жыл бұрын
I think with the income example, it might have been better to take out the outliers first so that when you normalised it you didn't lose so much precision. That said, I don't know how R handles floating point values, so it could be fine, lol.
@JustinJFain
@JustinJFain 5 жыл бұрын
I'm so happy to see R. It's my primary language and I dont think it gets the credit it deserves.
@4.0.4
@4.0.4 4 жыл бұрын
If you don't mind me asking, what do you use R for? Do you happen to know/use general-purpose language(s)?
@JustinJFain
@JustinJFain 4 жыл бұрын
@@4.0.4 I am a computational geoscientist and professional researcher in that field. I started in Python but eventually found R much more useful both in teaching and in my everyday work. Within the context of my use cases, I now rarely find that Python is better suited for a task than R. Since I could probably be considered an advanced R user at this point, I even use it for the "quick and dirty" stuff whereas a year ago I would have preferred Python for those projects. It's mostly a case of affinity by exposure. Now I find Python a bit clunky and disjointed, especially when it comes to complex tasks dealing with spatial objects and satellite images.
@JustinJFain
@JustinJFain 4 жыл бұрын
Plus, the way class methods are called in Python reminds me too much of JS and Google Earth Engine to be any fun. :'))
@4.0.4
@4.0.4 4 жыл бұрын
@@JustinJFain thank you for your response! And what an interesting-sounding job title.
@mischajay
@mischajay 5 жыл бұрын
What is the best practise in R for keeping track of what the numbers mean after codification? After I have replaced the descriptive names of the categories, how do I best remember that e.g. 1 = "Bachelors"?
@kingpopaul
@kingpopaul 5 жыл бұрын
You could use another dataframe to convert the values from string to int/float, by using that way you always have a 'legend' of the changes that you did.
@TheViperZed
@TheViperZed 4 жыл бұрын
Don't worry, the inch is defined as 2.54cm since the introduction of gauge blocks. So for lengths, imperial measurements are just really weirdly "prefixed" units.
@ThisIsWhatIKnow
@ThisIsWhatIKnow 5 жыл бұрын
The conversion from euros to dollars was unnecessary. Multiplying by a linear factor (1.13) doesn't make any difference if later you scale it linearly again to 0-1, as long as you normalize the data before you join the 2 sets.
@SammyJStubbs
@SammyJStubbs Жыл бұрын
Shoutout to the guy in the data set working 99 hours per week. It's time to go home.
@moccaloto
@moccaloto 3 жыл бұрын
Remember to join before normalize. Unless your datasets have exactly the same min and max for all columns of course
@go1chase1the1sun1set
@go1chase1the1sun1set 2 жыл бұрын
What about data like childrens shoe sizes, it can be 0-40 and then J11, J10, C10 etc
@teunvandenbrand1324
@teunvandenbrand1324 5 жыл бұрын
I feel like the capital gains data variable probably should have been log transformed before scaling. Also, comparing absolute distributions of e.g. age between datasets makes little sense when the number observations differ massively. It would probably be better to use relative frequency histograms or, as a personal favourite, kernel density estimates.
@alansilva3826
@alansilva3826 4 жыл бұрын
Hi, what segment of knowledge studies those kind of normalizations you mentioned?
@teunvandenbrand1324
@teunvandenbrand1324 4 жыл бұрын
@@alansilva3826 The only transformation I was mentioning was the log transformation, which can be understood within mathematics
@ashishranjan4623
@ashishranjan4623 5 жыл бұрын
HOT LABEL ENCODING That's what we do when we assign numerical values to strings. It creates sparse matrix. 0010 1000 something like that.
@GoatzAreEpic
@GoatzAreEpic 5 жыл бұрын
hate when that happens
@DouglasZwick
@DouglasZwick 2 жыл бұрын
The gain in Spain stays mainly in column 1.
@oldcowbb
@oldcowbb 5 жыл бұрын
seems like units in uk is even more messier than in us..
@adamastalpa6015
@adamastalpa6015 5 жыл бұрын
What IDE did he used? It not looks like IPythonJupyter...
@zerokelvin3626
@zerokelvin3626 5 жыл бұрын
It's RStudio
@passingthetorch5831
@passingthetorch5831 5 жыл бұрын
Usually we take the log of financial data...
@yduufuuduu7494
@yduufuuduu7494 2 жыл бұрын
You try to give the video more brightness it will be great if you do
@pranayyanarp4118
@pranayyanarp4118 5 жыл бұрын
The last graph confused me. Difference between Spain and USA is large but he says it's fairly similar ... While Denmark statistics was very near to Spain .... Could someone pls explain..
@punktdotcom
@punktdotcom 5 жыл бұрын
I guess you mean the chart at 18:45? The reason he say it's similar, is because if you scale the green graph by a factor, you would get nearly the same red graph. Whereas if you would scale the blue graph, it would be different than the other two. Hope that makes sense
@pranayyanarp4118
@pranayyanarp4118 5 жыл бұрын
@@punktdotcom thank you
@casperes0912
@casperes0912 5 жыл бұрын
This was in my data warehousing lecture last semester
@TheChiisaiookami
@TheChiisaiookami 5 жыл бұрын
Love how much flack Americans get for using imperial units, but at least we're consistent with our madness
@ec92009y
@ec92009y 2 жыл бұрын
Give it 50 years like the Brits did and Americans have shot at sounding that mad.
@wktodd
@wktodd 5 жыл бұрын
was the camera man getting bored? or falling asleep ? was expecting an inverted mike at any moment 8-)
@pmcgee003
@pmcgee003 5 жыл бұрын
The average distance from the mean is zero. And I don't think the average absolute distance from the mean is the std.dev. sd should be the rms distance I think ... ?
@Flourish38
@Flourish38 5 жыл бұрын
Distance is always positive, you're thinking of displacement.
@skydrow4523
@skydrow4523 5 жыл бұрын
I see Dr. Mike, I click faster than *kilometers*
@kingpopaul
@kingpopaul 5 жыл бұрын
Does Mike only weight one Pound or is it x+1 Pounds.
@Lodinn
@Lodinn 5 жыл бұрын
One pound obviously.
@zerokelvin3626
@zerokelvin3626 5 жыл бұрын
15:12: Hard coding alert :)
@chijiokekennedyanoka4844
@chijiokekennedyanoka4844 4 жыл бұрын
how can u i download the data?
@YouPlague
@YouPlague 5 жыл бұрын
Normalizing the income does not do anything, if you plot the data it will look totally the same, just with different labels on your Y axis!
@heyandy889
@heyandy889 4 жыл бұрын
that's kind of the idea - the distribution is not modified, just the scale if I'm understanding Mike correctly, the idea is, later we will use standard data processing and analysis algorithms (he mentioned random forest, machine learning, ...) which are designed to work on data in a particular scale (0,1). this way the algorithm itself can focus on identifying trends, as opposed to cleaning & transformation which is handled in advance.
@garmands
@garmands 5 жыл бұрын
Yea baby first to analyze!
@elektrikblu7331
@elektrikblu7331 5 жыл бұрын
radians vs degrees. come on LUL
@blownspeakersss
@blownspeakersss 5 жыл бұрын
A little too much emphasis on machine learning in this series, imo. Classical statistical methods are more useful and informative in a large number of settings -- but of course they're also more difficult to learn and use properly.
@wkingston1248
@wkingston1248 5 жыл бұрын
Machine learning is really just iteration of classical statics to try to predict an outcome or classification of some data. After doing some linear and logistic regression for some training data, introducing more data to find out which is better at predicting the outcome is an example of machine learning.
@Hexanitrobenzene
@Hexanitrobenzene 4 жыл бұрын
I don't understand why you express such opinion under "Data transformation" video. There was no machine learning in this series so far. Are you concerned that everything was presented from a machine learning perspective ?
@MrCmon113
@MrCmon113 4 жыл бұрын
Everything he described so far you have to do no matter whether your methods are simple statistics or large ML models.
@akrambhat6162
@akrambhat6162 Жыл бұрын
I can watch peaky blinders w/o subtitles, but not this.
@iAmTheSquidThing
@iAmTheSquidThing 5 жыл бұрын
It seems a bit weird having to scale this data manually. Couldn't machine learning algorithms just normalise everything automatically if they need it in that format?
@18621005194
@18621005194 4 жыл бұрын
ml is like vending machine only takes in what it designed to take
@MrCmon113
@MrCmon113 4 жыл бұрын
If the numbers were labeled with information about the world that allows inference of the units, yes. But what would be the point?
@LuisRamirez-gc5ds
@LuisRamirez-gc5ds 2 жыл бұрын
why measure in feet square is alike british people xD?
Data Analysis 5: Data Reduction - Computerphile
17:50
Computerphile
Рет қаралды 60 М.
Data Analysis 1: What is Data? - Computerphile
12:14
Computerphile
Рет қаралды 158 М.
Мы никогда не были так напуганы!
00:15
Аришнев
Рет қаралды 6 МЛН
Data Analysis 6: Principal Component Analysis (PCA) - Computerphile
20:09
LogTransformations.1.Why Log Transformations for Parametric
10:12
Quantitative Analysis Institute
Рет қаралды 65 М.
The Quiz That Was Shared A Million Times
8:28
Tom Scott
Рет қаралды 1,9 МЛН
Functional Data Engineering - A Set of Best Practices | Lyft
39:43
Data Council
Рет қаралды 76 М.
AI & Logical Induction - Computerphile
27:48
Computerphile
Рет қаралды 349 М.
The Opposite of Infinity - Numberphile
15:05
Numberphile
Рет қаралды 4,3 МЛН
CPU Pipeline - Computerphile
21:48
Computerphile
Рет қаралды 61 М.
Transformations in Statistics: What are they?
13:47
Quant Psych
Рет қаралды 7 М.
AI's Game Playing Challenge - Computerphile
20:01
Computerphile
Рет қаралды 742 М.