Solving real-world data analysis problems with Python Pandas! (Lego dataset analysis)

  Рет қаралды 85,123

Keith Galli

Keith Galli

Күн бұрын

In this video we walkthrough a data analysis project on DataCamp. This project has us walk through a Lego dataset and answer a few questions. To do our analysis we use the Pandas library of Python.
Check out DataCamp!
bit.ly/KeithGalliDCFeb22
Link to my GitHub:
github.com/KeithGalli/lego-an...
From the DataCamp website:
The Rebrickable database includes data on every LEGO set that has ever been sold; the names of the sets, what bricks they contain, what color the bricks are, etc. It might be small bricks, but this is big data! In this project, you will get to explore the Rebrickable database and answer a series of questions related to the history of Lego!
Link to Rebrickable database: rebrickable.com/downloads/
Some skills worked on in this video:
- Reading CSV files with Python
- Filtering DataFrame based on conditional parameters
- Grouping data by column values and aggregating it
btw, I apologize at about the 25-minute mark I started having microphone issues, I'll have it solved by my next video.
Thank you to DataCamp for sponsoring this video :)
-------------------------
Follow me on social media!
Instagram | / keithgalli
Twitter | / keithgalli
-------------------------
Song at the end
good morning by Amine Maxwell / aminemaxwell
Creative Commons - Attribution 3.0 Unported - CC BY 3.0
Free Download / Stream: bit.ly/2vpruoY
Music promoted by Audio Library • Good morning - Amine M...
-------------------------
If you are curious to learn how I make my tutorials, check out this video: • How to Make a High Qua...
Practice your Python Pandas data science skills with problems on StrataScratch!
stratascratch.com/?via=keith
Join the Python Army to get access to perks!
KZbin - / @keithgalli
Patreon - / keithgalli
*I use affiliate links on the products that I recommend. I may earn a purchase commission or a referral bonus from the usage of these links.
-------------------------
Video Timeline!
0:00 - Introduction
1:05 - Getting started w/ Lego analysis project
2:33 - How to follow along if you are not a premium DataCamp subscriber (GitHub)
4:01 - Project tasks overview
5:40 - Basic exploration of the dataset
9:45 - Task #1: What percentage of all licensed sets ever released were Star Wars Themed?
24:23 - Task #2: In which year was Star Wars not the most popular licensed theme?
34:00 - Bonus Task: How many unique sets were released each year (1955-2017)?
42:26 - Conclusion!

Пікірлер: 120
@KeithGalli
@KeithGalli 2 жыл бұрын
Level up your data science skills with courses, projects, and competitions offered by DataCamp! Use my link below and check out the first chapter of any course for FREE! :) bit.ly/KeithGalliDCFeb22
@masternobody1896
@masternobody1896 2 жыл бұрын
can you do some google job coding. so how can i get a job
@KeithGalli
@KeithGalli 2 жыл бұрын
Big shout-out to my mom for not throwing away my Legos! She's the real MVP
@bobbyg603
@bobbyg603 2 жыл бұрын
Thanks mom!
@vishwasjajpura796
@vishwasjajpura796 2 жыл бұрын
Finally Keith will build his LEGO
@ocraking
@ocraking 3 ай бұрын
nice Kevin Durant reference
@KenJee_ds
@KenJee_ds 2 жыл бұрын
dude, loved the intro!
@KeithGalli
@KeithGalli 2 жыл бұрын
Hahaha thanks man :). Very happy that my mom didn't throw out all of my legos!
@DataProfessor
@DataProfessor 2 жыл бұрын
Wow the Lego stop motion was awesome!
@ahsanshah1866
@ahsanshah1866 2 жыл бұрын
Data professor is here 😀
@JW-pu1uk
@JW-pu1uk 2 жыл бұрын
I really like the thought process in these videos. It's very raw, and really will translate well to an actual work project.
@lVaNeSsA90
@lVaNeSsA90 2 жыл бұрын
Thanks for being honest while you search for syntax in the beginning. Love this raw, step by step video. I'm using your videos on my project to get inspired ❤️ thanks for being a good tutor 😊
@alan6506305
@alan6506305 2 жыл бұрын
God, this is brilliant. I watched the other two videos of yours on Pandas. You are a great teacher and friend. Thank you very much for your hard work and kindness.
@markomarjanovic8348
@markomarjanovic8348 Жыл бұрын
Absolutely love the raw natural style you are doing, hope everyone else appreciates it too, keep going buddy, you are amazing!
@rafaelmello8194
@rafaelmello8194 2 жыл бұрын
I'm a begginer in Python and I'm learning a lot from you. You are an awesome teacher. Your pacing and didactic are perfect. Thanks a lot for your effort
@simonvanwijk5178
@simonvanwijk5178 2 жыл бұрын
Man so good to have you back! If it was not for you I would have not gotten a role as a DA as you helped me the most in the beginning.
@H99x2
@H99x2 2 жыл бұрын
These type of videos are your strengths! Great tutorial and explanation Keith
@thebeeskhakis7145
@thebeeskhakis7145 2 жыл бұрын
I'm so happy you're back. Your videos helped me get my new job!
@qalinlekhaliif5518
@qalinlekhaliif5518 2 жыл бұрын
Thanks a lot man. Your videos are helpful and entertaining as well. We appreciate your great work.
@itsReshad
@itsReshad 2 жыл бұрын
Love the great content! Please dont stop! You have an impeccable way of teaching its amazing
@logannon
@logannon 2 жыл бұрын
Dude, I thought you were dead. Your videos have helped me so much. Glad to see you back!
@danielsantoyo2640
@danielsantoyo2640 2 жыл бұрын
Im so happy to see you are back! Panda and Numpy tutorials would be great !!! I’m currently trying to learn panda and numpy for data analytics and this video was super interesting !!! Thanks Keith keep going you are doing great 💯
@Sensei10238
@Sensei10238 2 жыл бұрын
Finally back! It helped me a lot in learning python! Thank you so much!
@FIBONACCIVEGA
@FIBONACCIVEGA Жыл бұрын
This video has been a true inspiration to continue learning. I'm doing the datacamp since I want to change my field and I've always liked programming and analyzing data. But he didn't know if he could use the learned knowledge to use it in real life. Now I know that everything I have learned is what is used in real life data analysis. Saludos
@ben-tiki
@ben-tiki 2 жыл бұрын
Another great video Keith! Glad to see yo back. Awesome that you got to work with datacamp. Please if you can make a video o OpenAI it would be awesome. Ive been using their API and its awesome
@YunusFidan_
@YunusFidan_ 2 жыл бұрын
Good to see you uploading again!!
@amansorout.6779
@amansorout.6779 2 жыл бұрын
Happy to see you back, fighting with something serious, you are not alone.
@PaYaMv2
@PaYaMv2 2 жыл бұрын
Good to have you back my dude! Loooooooved this!
@Omzodijacky
@Omzodijacky 2 жыл бұрын
Man , I'm happy you are back ! you were truly missed
@rksingh1997mp
@rksingh1997mp 2 жыл бұрын
He’s back baby!!
@cyrilodoi6868
@cyrilodoi6868 2 жыл бұрын
So good to have you back man! 💯
@stratascratch
@stratascratch 2 жыл бұрын
Good to see you’re back!
@weitingteng3241
@weitingteng3241 2 жыл бұрын
Great great and great to see you back
@MashiroRedo
@MashiroRedo 2 жыл бұрын
Waited so long! Thank you
@kartikeyasharma9908
@kartikeyasharma9908 Жыл бұрын
Hi Keith, loving the video tutorials!
@lucaspioli7970
@lucaspioli7970 2 жыл бұрын
Love your videos! Keep going
@azrmuradl6420
@azrmuradl6420 2 жыл бұрын
Please provide more such kind of videos, or as you always do, give us tips about how we can find such kind of real world ds projects online.
@terrytas13
@terrytas13 2 жыл бұрын
Love the introduction!!!
@patriciosebastiankellyfuen9547
@patriciosebastiankellyfuen9547 Жыл бұрын
props for sharing your knowledge man, its really easy to understand and apply what you're doing (Y)
@manfungnewmanyu1426
@manfungnewmanyu1426 2 жыл бұрын
Yeah!!! Your tutorial is very great and help me so much at the AI master course .
@tuandino6990
@tuandino6990 2 жыл бұрын
I've been waiting for this
@ocraking
@ocraking 3 ай бұрын
Dude, you ROCK
@leomiao5959
@leomiao5959 2 жыл бұрын
The man is back. The hero is back for us!!
@jongcheulkim7284
@jongcheulkim7284 2 жыл бұрын
Thank you, sir. I had lots of fun^^
@codewithkarthik7136
@codewithkarthik7136 2 жыл бұрын
nice video keith
@user-jl8vr4ff1e
@user-jl8vr4ff1e 2 жыл бұрын
keep up the good work!
@dharshankumar2522
@dharshankumar2522 2 жыл бұрын
Keith is back...yeahhhh
@Magmatic91
@Magmatic91 2 жыл бұрын
Did this project on DataCamp. Was a lot of fun.
@freddy4videos
@freddy4videos Жыл бұрын
thank you, much love
@kotharidhruv75
@kotharidhruv75 2 жыл бұрын
w8ing fr more such videos
@terrytas13
@terrytas13 2 жыл бұрын
Welcome back Keith, so good to see your face again. Stay well my friend!
@KeithGalli
@KeithGalli 2 жыл бұрын
Glad to be back!! :)
@sanjeetlal1873
@sanjeetlal1873 2 жыл бұрын
Legend's back❤️
@Viralvlogvideos
@Viralvlogvideos 2 жыл бұрын
welcome back to your first tutorial after long back :P
@davida99
@davida99 2 жыл бұрын
Yoooo love the vids
@kirubaselvi6754
@kirubaselvi6754 2 жыл бұрын
Keith, Pytorch tutorial please
@KeithGalli
@KeithGalli 2 жыл бұрын
I definitely want to! I need to spend considerable time reviewing and building up my own PyTorch skills before I make a tutorial on it.
@ChileHeroico
@ChileHeroico 2 жыл бұрын
keep doing more videos pls :D
@baggid6257
@baggid6257 2 жыл бұрын
He is back~!
@putyah
@putyah 2 жыл бұрын
Awesome video. Small detail: On the new era answer you typed the variable in. It would be nicer to drop every value that is Star Wars. Next select the remaining year as an variable. When the dataset is changed the variable is dynamic so the answer would still be correct.
@KeithGalli
@KeithGalli 2 жыл бұрын
Good suggestion! I agree that would be a better way to go about it :)
@aditiparashar9171
@aditiparashar9171 Жыл бұрын
you are freakingly smart!
@rafaelcastellarmartinez3498
@rafaelcastellarmartinez3498 2 жыл бұрын
Hi Keith, just tried to do the project with you and i got that Star Wars was not the most popular theme in 2004 - Harry Potter and 2017 - Super Heroes, weird that datcamp test said ok, but i did the math manually and harry potter was the most popular in 2004, thanks for your videos. an student from Colombia Latin America!
@adelekeemmanuel4917
@adelekeemmanuel4917 11 ай бұрын
omg... i just did the exercise myself and i discovered the same thing too... Came ti check the video but im seeing something else
@Levy957
@Levy957 2 жыл бұрын
that task #2 was really hard to do alone
@raghavgoyal3324
@raghavgoyal3324 2 жыл бұрын
please upload a project every week
@KeithGalli
@KeithGalli 2 жыл бұрын
I'll try my best!
@gersonchadijunior7499
@gersonchadijunior7499 Жыл бұрын
Hey Keith, I love so much your videos. I've been learning Pandas with you since your pokemon's video, but I feel that the last answer is not accurate and in fact the right year should be 2006, because it was the year with less Star Wars Sets released. Can I send you my code somehow?
@nitiknayyar7659
@nitiknayyar7659 2 жыл бұрын
Damn I also started this project on Datacamp.
@admonitoring-pi9os
@admonitoring-pi9os 3 ай бұрын
Hello there. I hope you are good. I am a little late with this comment because this video is already more than 2 years old but since i have started learning python now its the right time for me. where can i find the codes you explained in the video bcz no code is availbale in the project file at the github provided link.
@shahrose786
@shahrose786 2 жыл бұрын
question: when you merge when using left_on and right_on ...we get the merged df. So for the merged df and under parent_theme why are most if not all of those are "Legoland" and all IDs are 411? also how do we check the full tabular data -- print(df)?
@soldierbirb
@soldierbirb 2 жыл бұрын
Hey Keith, I'm divided between going towards data science or cyber security. I love both but I kinda needs to make money by now. Do you think I can own money in a short time in data science? Working as a freelancer or supporting small companies... Edit: I'm glad that you came back. Really love your videos
@adeshmishra1671
@adeshmishra1671 2 жыл бұрын
Go for Cybersecurity brother, Since difficulty level is medium.. But while earning 💰 you can also learn data scientist!!
@rodrigodasilva9176
@rodrigodasilva9176 2 жыл бұрын
This dude is cool, this chanel too.
@tuandino6990
@tuandino6990 2 жыл бұрын
Question 2: theme_count_by_year = licensed_lego_set.groupby('year')['parent_theme'].value_counts().unstack() theme_count_by_year.fillna(0, inplace=True) theme_count_by_year = pd.DataFrame.transpose(theme_count_by_year) Or you can use pivot_table function. By approaching in this way you can create a data frame that's easy to do plot (heatmap) and make high number pops out.
@tuandino6990
@tuandino6990 2 жыл бұрын
@Josh Yorko nice
@manu93ize
@manu93ize 2 жыл бұрын
bro Can you do a tutorial on data cleaning with Pyspark with real world example.
@sabbirahmed8012
@sabbirahmed8012 2 жыл бұрын
Hello Keith, can you please mention some resource to master natural language processing?
@KeithGalli
@KeithGalli 2 жыл бұрын
Hey! I actually did a PyCon lecture on NLP. That should be pretty helpful: kzbin.info/www/bejne/rKqymIqerLqgm8U
@baburamchaudhary159
@baburamchaudhary159 Жыл бұрын
in line [99] ie. .groupby(['year', 'parent_theme']) and in next line: .drop_duplilcates(['year']) since we already have grouped by 'year' and 'parent_theme' [I think, it groups unique year and parent_theme] why do we need to drop duplicates by 'year'?
@ElianMrl
@ElianMrl 2 жыл бұрын
Hey guys, would it be a good idea to use Datacamp projects in my resume?
@gopikaprasad8607
@gopikaprasad8607 Жыл бұрын
How to export the for loops result into excel?? Please reply
@user-ty4jy4cp3r
@user-ty4jy4cp3r Жыл бұрын
why didn't you use .agg?
@ratchakoon
@ratchakoon 2 жыл бұрын
themes.csv which you provided on github does not have 'is_licensed' field. Is 'parent_id' filed as same as 'is_licensed' field?
@KeithGalli
@KeithGalli 2 жыл бұрын
A little confusing, but you want to use parent_themes.csv, not themes.csv !!
@ratchakoon
@ratchakoon 2 жыл бұрын
@@KeithGalli Thank you
@alkiviadessavoullis2021
@alkiviadessavoullis2021 2 жыл бұрын
does anyone know why when I press continue or start project the Python Use python ... code checks gets highlighted pink and I can't work on the project ?
@merterisen
@merterisen 2 жыл бұрын
16:52 how did you change 'Star wars' text immediately?
@KeithGalli
@KeithGalli 2 жыл бұрын
Lol that was just video editing xD.
@letsjoinhands
@letsjoinhands 2 жыл бұрын
hello again Keith. For Q#2 I am getting a different result for new_era using this code: So the lego_all_lic is the DF containing all licensed lego set themes with the shape (1179 x 8) and that has been grouped by year to form lego_all_lic_yr. And the rest of the code I have written is quite simple to understand. Looks as if I have made a big mistake in aggregation but can't seem to locate it. lego_all_lic_yr = pd.DataFrame(lego_all_lic.groupby(by = ['year', 'parent_theme'], axis = 0).agg(Parent_Theme = ('set_num', 'count'))) lego_all_lic_yr.reset_index( inplace = True) lego_all_lic_yr.replace(to_replace = [theme for theme in lego_all_lic_yr['parent_theme'] if theme != 'Star Wars'], value = 'Others', inplace = True) lego_all_lic_yr = pd.DataFrame(lego_all_lic_yr.groupby(by = ['year', 'parent_theme'], axis = 0).agg(Parent_Theme = ('Parent_Theme', 'sum'))) lego_all_lic_yr When you look at the result it shows that 2006 was the first year in which Star Wars lost to other themes in terms of the sets released in that year.
@letsjoinhands
@letsjoinhands 2 жыл бұрын
Ok so I misunderstood the Q basically. It wasn't about Star Wars themed sets vs All The Rest rather it the year in which Star Wars lost out to some other individual theme. Got the correct answer using: lego_all_lic_yr = pd.DataFrame(lego_all_lic.groupby(by = ['year', 'parent_theme'], axis = 0).agg(Parent_Theme = ('set_num', 'count'))) lego_all_lic_yr.reset_index( inplace = True) lego_all_lic_yr = pd.DataFrame(lego_all_lic_yr.groupby(by = ['year', 'parent_theme'], axis = 0).agg(Parent_Theme = ('Parent_Theme', 'sum'))) lego_all_lic_yr = lego_all_lic_yr.sort_values(by = ['year','Parent_Theme'], ascending = False) lego_all_lic_yr.head(50)
@damarbowo
@damarbowo 2 жыл бұрын
Can I see your membership playlist? I can't find that playlist
@KeithGalli
@KeithGalli 2 жыл бұрын
Hmm I'm not sure what you are asking to see, can you clarify?
@damarbowo
@damarbowo 2 жыл бұрын
@@KeithGalli you have a membership benefits. One of the benefit is got playlist or videos for member. Do you have an example the video or playlist for member join your channel? Hope you understand
@KeithGalli
@KeithGalli 2 жыл бұрын
I just started my memberships last week so I haven't posted any exclusive videos there yet. To get an idea of the types of content I'll post there, check out these videos kzbin.info/www/bejne/p5-2d2uPlrWrbZo kzbin.info/www/bejne/hZyooIN_hNypnsk
@damarbowo
@damarbowo 2 жыл бұрын
@@KeithGalli I'll wait Keith. Regards
@KeithGalli
@KeithGalli 2 жыл бұрын
Sounds good!
@mufasao6776
@mufasao6776 2 жыл бұрын
I see that you posted some of your hidden videos. Thank you.
@RED_S0N
@RED_S0N 9 ай бұрын
keith moment
@zeasammy7572
@zeasammy7572 2 жыл бұрын
Does DataCamp have video learning platform?
@KeithGalli
@KeithGalli 2 жыл бұрын
The typical structure of classes is short videos that overview the concepts and then a bunch of interactive problems with a code editor to drill down the technical side of those concepts.
@clayherz_
@clayherz_ Жыл бұрын
if i solve the second question with this code, counted_2 = licensed_sets.groupby(["year", "parent_theme"])[["is_licensed"]].count() counted_2 = counted_2.reset_index().sort_values("is_licensed", ascending=False) counted_2.drop_duplicates("year").sort_values("year", ascending=True) is it wrong
@rabinmainali3373
@rabinmainali3373 Жыл бұрын
I done it in following ways:(question 2) 1. i count each licenced film released every year. 2.Then count the only star wars film released every year 3.And i calculate the proportion of step2 and step1. Is it okey ? ,by the way the result is also 2017 for me.
@letsjoinhands
@letsjoinhands 2 жыл бұрын
Hi Keith! this is how I solved Q # 1. Pls let me know if this is a bad coding practice, is acceptable or is good in your opinion. so I first made a function called is_lic. def is_lic(df_1, df_2): df_1['is_licensed'] = bool theme_1 = list(df_1['parent_theme']) theme_2 = list(df_2['name']) lic_status = list(df_2['is_licensed']) for i, s in enumerate(theme_1): for r, t in enumerate(theme_2): if s == t: df_1['is_licensed'][i] = lic_status[r] Then is_lic(lego_sets, lego_themes) Then all_themes = [ ] for r in lego_sets.itertuples(): all_themes.append([ r[6], r[1], r[7] ]). Then all_lic_themes = [x for [x, y, z] in all_themes if y is not np.NaN and z == True] star_wars = [theme for theme in all_lic_themes if theme == 'Star Wars'] the_force = int(len(star_wars)/len(all_lic_themes) * 100) the_force = 51%
@KeithGalli
@KeithGalli 2 жыл бұрын
So my biggest recommendation based on your code is to be more explicit with how you name your variables. So instead of "df_1" & "df_2" you might name those dataframes "parent_themes_df" & "lego_sets_df" respectively. Furthermore it would be better to name variables "i" & "s" something like "parent_theme_index" & "parent_theme_value". These types of changes will make your code more readable. Functionally, everything looks sound though. Nice work!
@letsjoinhands
@letsjoinhands 2 жыл бұрын
@@KeithGalli thanks a bunch Keith. and now in retrospect when I think about how you were working on solving this Q in the video I realised that all the time you were using pandas built in methods to solve the Q. so yes we could use a smattering of python methods to do this (like I did) but using that libraries' built-in methods would be more simpler and advantageous most of the times. Is that correct?
@igor-xadrezxadrez8541
@igor-xadrezxadrez8541 2 жыл бұрын
Hey, there's a red dot on your nose.
@KeithGalli
@KeithGalli 2 жыл бұрын
I got in a fight playing hockey!
@Viralvlogvideos
@Viralvlogvideos 2 жыл бұрын
Big nose :P
@ihateorangecat
@ihateorangecat 2 жыл бұрын
you got injured on your nose???
@KeithGalli
@KeithGalli 2 жыл бұрын
I got into a little ice hockey fight!
@AbhishekSharma-hy4nl
@AbhishekSharma-hy4nl 2 жыл бұрын
Bro what happened to your nose😟?
@KeithGalli
@KeithGalli 2 жыл бұрын
Got into a little fight playing ice hockey! We won the game though so it's cool xD
Solving real world data science tasks with Python Pandas!
1:26:07
Keith Galli
Рет қаралды 1,5 МЛН
Would you like a delicious big mooncake? #shorts#Mooncake #China #Chinesefood
00:30
ТАМАЕВ vs ВЕНГАЛБИ. ФИНАЛЬНАЯ ГОНКА! BMW M5 против CLS
47:36
Python for Data Analysis: Exploring and Cleaning Data
28:22
DataDaft
Рет қаралды 37 М.
Exploratory Data Analysis with Pandas Python
40:22
Rob Mulla
Рет қаралды 430 М.