Apache Spark / PySpark Tutorial: Basics In 15 Mins

  Рет қаралды 154,965

Greg Hogg

Greg Hogg

Күн бұрын

Пікірлер: 158
@mongooon6931
@mongooon6931 2 жыл бұрын
We use spark for our data pipeline at work -- we have tables with 10+ billion records, and our applications end up moving trillions upon trillions of records of data per month. Unfathomable numbers that spark is capable of. Great video!
@GregHogg
@GregHogg 2 жыл бұрын
Yeah, it's insane! Thanks so much.
@EclipsyChannel
@EclipsyChannel 2 жыл бұрын
that's the power of distributed systems and parallel computing... computer science is beautiful
@Nedwin
@Nedwin 3 жыл бұрын
I'm a freelance data scientist and I really thankful to find this video, Gregg. Can't expect more! Thank you so much. Good luck with everything. 🙏
@GregHogg
@GregHogg 3 жыл бұрын
That's awesome best of luck in that! And you're very welcome it's my pleasure 😊
@AnVinhNguyen
@AnVinhNguyen 3 жыл бұрын
Your explanation is clear and the examples are practical and useful for beginners. Thanks a lot and keep it up!
@GregHogg
@GregHogg 3 жыл бұрын
I really appreciate this. You're very welcome 😃
@joshuabradshaw1647
@joshuabradshaw1647 Жыл бұрын
Thank you for sharing to the world. I'm currently a supply chain analyst and aspiring supply chain data scientist 🙏
@GregHogg
@GregHogg Жыл бұрын
That's excellent to hear and very exciting Joshua! I wish you the best of luck 🥰
@ashleyb5849
@ashleyb5849 3 жыл бұрын
Awesome video. I love using spark at work
@andersborum9267
@andersborum9267 Жыл бұрын
I'm just getting into DataBricks and PySpark and this introductory tutorial was a great starter.
@GregHogg
@GregHogg Жыл бұрын
Awesome! Hope that goes well :)
@mtamjidhossain
@mtamjidhossain 2 жыл бұрын
You are awesome. Just delivering the right videos. Subscribed a few days back already but hit notifications on for you rn. Cause I wanna watch all your videos
@GregHogg
@GregHogg 2 жыл бұрын
Well that's really great to hear! Thanks so much Tamzid!
@yashmodi6762
@yashmodi6762 3 жыл бұрын
Which big data tools one must learn for beginners and from where to learn( please provide some resources)
@GregHogg
@GregHogg 3 жыл бұрын
Of course I'd recommend my channel - SQL and Spark are the most important ones in my opinion :)
@260056
@260056 3 жыл бұрын
@greg, plz share the link of 1 hr video.. I am unable to find it
@dominicaleung7329
@dominicaleung7329 Жыл бұрын
Greg, thank you so much. I am new to PySpark, and your video is very good in explanation and you did those simple example and I am able to follow you and write in my own Python Notebook to try it out. Will watch your DataFrame basics video next.
@GregHogg
@GregHogg Жыл бұрын
Amazing! Sorry for the late reply
@ericcarmichael3322
@ericcarmichael3322 2 жыл бұрын
Thanks for sharing, appreciate the quick run down on this stuff
@GregHogg
@GregHogg 2 жыл бұрын
Glad to hear it!
@victorroy525
@victorroy525 Жыл бұрын
Just the type of samples we need to begin with. Meaningful content. thnx.
@GregHogg
@GregHogg Жыл бұрын
Glad you enjoyed it!
@SpecialGreg66
@SpecialGreg66 3 ай бұрын
Hi Greg! Great video, do you have one that explains how you convert spark to dfs and vice versa? We pull millions of rows from csvs and looking to do transformations before dropping into a db. Also, how does the distributed computing work on a singular computer? Just distributes it across the cpu cores?
@EclipsyChannel
@EclipsyChannel 2 жыл бұрын
you are a great teacher... keep doing what you do my man
@BHANUCHAUDHARY-eb4ul
@BHANUCHAUDHARY-eb4ul Жыл бұрын
Thanks Greg for the wonderful explanation !!
@parvathirajan.n
@parvathirajan.n 2 жыл бұрын
No words man! Simply loved it. Appreciate your efforts.
@GregHogg
@GregHogg 2 жыл бұрын
Really glad to hear that! Thank you 😊
@noorhake9087
@noorhake9087 2 жыл бұрын
Hi , I'd like to ask you a question I'm working on a project that is how linear regression selected feature by apache spark when I want to execute the code for pyspark it gives an error that pyspark dont define and I tried to figure it out in many ways it didn't solve that problem💔
@hypebeastuchiha9229
@hypebeastuchiha9229 3 жыл бұрын
What was your degree in Computer Science or a Data Science course? I'm in my third year for a Computer Science BSc and I feel like I'm at a disadvantage for Data Science. We didn't learn statistics or have a lot of math modules. Most Data Science jobs require a Masters or PhD but I don't want to get a Masters straight after uni so I'm looking at Data Engineering since they accept BSc's. Is that a realistic path into Data Science or am I wasting my time?
@GregHogg
@GregHogg 3 жыл бұрын
I'm a statistics major. I don't think you're at a disadvantage, people very widely respect computer science majors. If anything I'd feel I'm at a disadvantage lol. But agreed, you get less stats courses. I would think some certificates and projects would be enough without needing a masters, unless you're aiming for FAANG or the other top jobs
@GregHogg
@GregHogg 3 жыл бұрын
This video may help; kzbin.info/www/bejne/ZmmqXqhvfbNrgcU
@antarcticadventure
@antarcticadventure 3 жыл бұрын
Never used Spark before. Thank you.
@GregHogg
@GregHogg 3 жыл бұрын
Me too for the longest time; PySpark is a life changer though!
@lakshaydulani
@lakshaydulani Жыл бұрын
now thats what i was looking for
@tarunodaysarma9741
@tarunodaysarma9741 3 жыл бұрын
Greg ,had a question on pyspark...how do I find latest parquet files stored in hdfc path using pyspark code
@GregHogg
@GregHogg 3 жыл бұрын
Sorry I don't know! 🤔
@ganeshkaushik2290
@ganeshkaushik2290 3 жыл бұрын
Hi Bro, could you please make a video on learning process on bigdata?and what job roles which big data skills i'm really confsed where to start and what to learn! I know python, sql I learned some basics of hdfs, hive, sqoop now i'm trying to learn pyspark
@GregHogg
@GregHogg 3 жыл бұрын
Thanks for the feedback, I'll keep this in mind!
@hsoley
@hsoley 3 жыл бұрын
You are awesome, thanks for sharing your knowledge with the world
@GregHogg
@GregHogg 3 жыл бұрын
I really appreciate that Hamid!!!
@andrewhancock2451
@andrewhancock2451 3 ай бұрын
This is an awesome video. I wonder, however, whether you could explain why the end results shows numbers with 12 characters. Didn't you set of numbers only go up to a million, which has 6 digits? You also referred to your hour-long PySpark course. Would you be able to link to it in the show notes, please? Thanks!
@yizharalmagor8539
@yizharalmagor8539 Ай бұрын
At a certain point he squared all the numbers in the RDD and then kept using the squares from then on.
@Vlapstone
@Vlapstone Жыл бұрын
sc command is not working on my Colab as it's working on this vide... can anyone help?
@aeigreen
@aeigreen 2 жыл бұрын
Explained so well. 5 stars. Love to see more videos..
@GregHogg
@GregHogg 2 жыл бұрын
Really glad to hear it thanks so much!
@samusaran1692
@samusaran1692 3 жыл бұрын
Very good examples. Thanks man :)
@GregHogg
@GregHogg 3 жыл бұрын
Glad it helped!
@keerthanamurugesan-xe6mr
@keerthanamurugesan-xe6mr 6 ай бұрын
It look like using numpy, pandas what is the difference between this and pyspark.
@GregHogg
@GregHogg 6 ай бұрын
It looks very similar to us coders, which is great. But pandas and numpy are mainly for dealing with data on the computer you're using. Spark allows us to distribute our workloads across a cluster of machines
@keerthanamurugesan-xe6mr
@keerthanamurugesan-xe6mr 6 ай бұрын
@@GregHogg Thankyou
@mehmetkaya4330
@mehmetkaya4330 2 жыл бұрын
Concise and very well explained! Thank you so much!!
@GregHogg
@GregHogg 2 жыл бұрын
Thank you and you're very welcome!
@aparfeno
@aparfeno 2 жыл бұрын
Thank you for great video and for useful education links!
@GregHogg
@GregHogg 2 жыл бұрын
You're super welcome 😃
@MrChilo89
@MrChilo89 2 жыл бұрын
Hello and thanks for this video, I ve been trying to follow and to your average way, but i receive an error : avg = nyt.map(lambda x: (x.title, int(x.rank[0]))) grouped = avg.groupByKey() grouped = grouped.map(lambda x:(x[0], list(x[1]))) averaged = grouped.map(lambda x: (x[0], sum(x[1]) / len(x[1]) )) averaged.collect() 'TypeError: Invalid argument, not a string or column: [1, 3, 7, 8, 12, 14, 20] of type . For column literals, use 'lit', 'array', 'struct' or 'create_map' function.'
@manpritsingh3972
@manpritsingh3972 2 жыл бұрын
This video is really helpful. Thanks a lot Gregg.
@GregHogg
@GregHogg 2 жыл бұрын
You're super welcome!
@geethsn1866
@geethsn1866 2 жыл бұрын
Thanks for the tutorial. It was simple and easy to follow. However, when I tried the code in Colab, just by typing "sc" is not invoking spark. Is there any prerequisites - to be installed in Colab before "sc" ?
@GregHogg
@GregHogg 2 жыл бұрын
Please check out my notebook. You'll need to pip install PySpark, and write a line or two of code to set it up
@geethsn1866
@geethsn1866 2 жыл бұрын
@@GregHogg Thank you Greg.
@boudhayanism
@boudhayanism Жыл бұрын
Cool video, thanks for making it
@GuilhermeMendesG
@GuilhermeMendesG 2 жыл бұрын
What an amazing content you're putting here man... thanks for everything!
@GregHogg
@GregHogg 2 жыл бұрын
Thanks so much for the kind words. You're very welcome 🤠
@r3d_robot594
@r3d_robot594 2 жыл бұрын
Good PySpark Primer! Others are either too lengthy or short and vague.
@GregHogg
@GregHogg 2 жыл бұрын
Thanks so much I'm really glad to hear that! :)
@krishj8011
@krishj8011 2 жыл бұрын
very fine details covered. really useful and easy to understand the spark concepts.
@GregHogg
@GregHogg 2 жыл бұрын
Really glad to hear that.
@demohub
@demohub 2 жыл бұрын
Great overview. Thanks
@hyeonjukwon3638
@hyeonjukwon3638 6 ай бұрын
Very useful and interesting! Subscribed :)
@GregHogg
@GregHogg 6 ай бұрын
Glad to hear it, thanks a ton!
@maxitube30
@maxitube30 2 жыл бұрын
and what are the machine on what we parallelize the work? They have to be configurated? i mean,if pyspark or spark parrallelize on a cluster,we have to configue the cluster too?
@GregHogg
@GregHogg 2 жыл бұрын
Someone has to configure it. Probably won't be your job though. You'll just select it, kinda like a Python virtual environment, and act as if it's the same as in this video because nothing changes from the programming point of view :)
@maxitube30
@maxitube30 2 жыл бұрын
@@GregHogg understood. Thx :)
@rohanjoseph1531
@rohanjoseph1531 3 жыл бұрын
Hi @Greg Hogg, I can't seem to access the "sc" object on Google Colab. Which library let's you use that object?
@GregHogg
@GregHogg 3 жыл бұрын
github.com/gahogg/KZbin/blob/master/PySpark%20In%2015%20Minutes.ipynb
@rohanjoseph1531
@rohanjoseph1531 3 жыл бұрын
@@GregHogg cheers!
@capper3360
@capper3360 2 жыл бұрын
Can you share the link to the hour long tutorial you mentioned at the end, couldn't find it in your spark playlist.
@GregHogg
@GregHogg 2 жыл бұрын
Here you go: kzbin.info/www/bejne/bqrTeoWma6mDm9k
@JaylaScousa
@JaylaScousa 3 жыл бұрын
Concise and well presented 👍
@GregHogg
@GregHogg 3 жыл бұрын
Very glad you found it useful, James!!
@nataliaresende1121
@nataliaresende1121 2 жыл бұрын
Hi Greg, how can I convert .csv files into .txt files (with comma as delimiter) using pyspark? Do you have a code snippet?
@GregHogg
@GregHogg 2 жыл бұрын
I think you can just change the extension from CSV to txt
@高智衡
@高智衡 2 жыл бұрын
great great content! BTW, please give us the link of the an-hour-long spark tutorial mentioned in the end,thanks a lot.
@GregHogg
@GregHogg 2 жыл бұрын
Thanks! Here you go: kzbin.info/www/bejne/bqrTeoWma6mDm9k
@PranitKothari
@PranitKothari 9 ай бұрын
Nicely explained.
@charlescoult
@charlescoult 2 жыл бұрын
Took a minute to get going but well done
@clintp3504
@clintp3504 3 жыл бұрын
Great stuff! Thanks
@GregHogg
@GregHogg 3 жыл бұрын
You're very welcome ☺️
@ajanieniola9172
@ajanieniola9172 Жыл бұрын
Can you also use apply instead of map
@GregHogg
@GregHogg Жыл бұрын
Probably
@caiocalo1
@caiocalo1 2 жыл бұрын
such a good tutorial
@abdullahsiddique7787
@abdullahsiddique7787 3 жыл бұрын
How is future of spark is flink replacing it ? Is it worth learning for career in big data ?
@GregHogg
@GregHogg 3 жыл бұрын
I don't know what flink is.
@abdullahsiddique7787
@abdullahsiddique7787 3 жыл бұрын
@@GregHogg thanks for reply gregg can u pls also tell me the career scope of Apache spark for future
@GregHogg
@GregHogg 3 жыл бұрын
@@abdullahsiddique7787 Spark is and will stay essential for Data science, ML, analysts and big data for a long time.
@abdullahsiddique7787
@abdullahsiddique7787 3 жыл бұрын
@@GregHogg thanks gregg appreciate your quick response
@GregHogg
@GregHogg 3 жыл бұрын
@@abdullahsiddique7787 Of course!
@AlexFosterAI
@AlexFosterAI Ай бұрын
appreciate this vid. thanks man
@chanta2809
@chanta2809 3 жыл бұрын
What is the URL to practice? How to setup data for practicing?
@GregHogg
@GregHogg 3 жыл бұрын
Thank you! You made me notice I accidentally removed the notebook from the video description. You can grab the notebook code in the video description now. You can actually get PySpark in google colab very easily, with simply !pip install pyspark and then import pyspark, then continue following the steps in this video.
@sohamsonone5440
@sohamsonone5440 2 жыл бұрын
Hi Greg, which one good among data science, data analytics or machine learning, AI.. could you pls give a suggestion
@GregHogg
@GregHogg 2 жыл бұрын
Data science / ML
@nataliaresende1121
@nataliaresende1121 2 жыл бұрын
very good, thanks!
@GregHogg
@GregHogg 2 жыл бұрын
You're very welcome Natalia!
@grantholomeu3725
@grantholomeu3725 2 жыл бұрын
I don't understand why in tutorials like this I often get errors saying, "module x has no attribute 'y.'" In this case, I can't get Python to recognize parallelize.
@GregHogg
@GregHogg 2 жыл бұрын
Not sure sorry!
@pardonmasuka2
@pardonmasuka2 8 ай бұрын
Awesome starter!
@ahmadsaad1888
@ahmadsaad1888 3 жыл бұрын
You mentioned an hour long spark video, I can't find it.
@GregHogg
@GregHogg 3 жыл бұрын
kzbin.info/www/bejne/bqrTeoWma6mDm9k
@agnelamodia
@agnelamodia 3 жыл бұрын
@@GregHogg Could you please paste this link in the description?
@GregHogg
@GregHogg 3 жыл бұрын
@@agnelamodia please see above
@cetilly
@cetilly 3 жыл бұрын
Sensational!
@GregHogg
@GregHogg 3 жыл бұрын
Thank you 😊😊😊
@RossittoS
@RossittoS 3 жыл бұрын
Great!
@GregHogg
@GregHogg 3 жыл бұрын
Thank you!
@javidhesenov7611
@javidhesenov7611 2 жыл бұрын
nice explanation
@GregHogg
@GregHogg 2 жыл бұрын
Thanks a bunch Javid! :)
@sndselecta
@sndselecta 3 жыл бұрын
I thought performance issue between scala and py isnt an issue anymore.
@GregHogg
@GregHogg 3 жыл бұрын
I personally doubt it. I'm not an expert on this one, but I'd be pretty surprised if python wasn't significantly slower than scala. Of course, if we're talking practically- they're both very fast, but in computational time, I would suspect python is much slower. Thanks!
@GregHogg
@GregHogg 3 жыл бұрын
You are correct, and I am incorrect! Thank you for updating me!
@sndselecta
@sndselecta 3 жыл бұрын
I think we are both correct. I've been reading up on it, with regards to refreshing my scala or keep chugging away with pyspark. Bottom line: it's good to know both. It depends on the use cases. But in general Scala will perform better monotonically. However what Ive read is: it isn't always about one way gains based solely upon performance or more importantly "one" sole factor, there are pros and cons and sometimes the cumulative gains can weigh either way. For example pythons rich ecosystem can weigh in for achieving a faster result trying to do the same thing with Scala. Another interesting discussion you should start is Koalas. I wrote a blog, trying to get people to weigh in. forums.databricks.com/questions/65646/thoughts-on-if-its-worth-it-to-work-in-koalas.html
@GregHogg
@GregHogg 3 жыл бұрын
@@sndselecta Sorry I missed this! Absolutely and thank you for the great reply.
@jimbocho660
@jimbocho660 2 жыл бұрын
The Spark people themselves are advising against learning Scala for only marginal gains over pySpark.
@emirhanbilgic2475
@emirhanbilgic2475 Жыл бұрын
thanks mate
@GregHogg
@GregHogg Жыл бұрын
Very welcome!
@e.s298
@e.s298 Жыл бұрын
Good for learn RDD
@paraklesis2253
@paraklesis2253 3 жыл бұрын
Thank you
@GregHogg
@GregHogg 3 жыл бұрын
You're very welcome!
@Somethingaweful
@Somethingaweful Жыл бұрын
back up from the camera my dude. I feel like your staring directly at my soul
@GregHogg
@GregHogg Жыл бұрын
Maybe I am
@Chris-qg6kc
@Chris-qg6kc 5 ай бұрын
​@@GregHoggget em bro.
@dw61w
@dw61w 2 жыл бұрын
how is this more useful than numpy?
@GregHogg
@GregHogg 2 жыл бұрын
NumPy works on one computer. Spark works on as many as you want
@dw61w
@dw61w 2 жыл бұрын
@@GregHogg thanks!
@pinkomoore
@pinkomoore 4 ай бұрын
Pyspark seems to be pandas on steroids + distributed resources usage
@pauweldalmeidaayivi5310
@pauweldalmeidaayivi5310 9 ай бұрын
Great!
@smash4929
@smash4929 8 ай бұрын
Hey Greg, The knowledge in the video is great but the background music is distracting.
@MrPeacefulsoul2610
@MrPeacefulsoul2610 3 жыл бұрын
A detailed video probably would be more helpful.
@marjorielarkins5178
@marjorielarkins5178 2 ай бұрын
Johnson Jason Miller Jason Martinez George
@vladx3539
@vladx3539 2 жыл бұрын
great video… but please step away from the camera sir
@GregHogg
@GregHogg 2 жыл бұрын
Ouch
@vladx3539
@vladx3539 2 жыл бұрын
@@GregHogg just kidding with you! great content
@TeresaGonzalez-r2q
@TeresaGonzalez-r2q 3 ай бұрын
Thomas David Young Shirley Rodriguez Scott
@teddymisderrick7034
@teddymisderrick7034 2 ай бұрын
Smith Jennifer Brown Frank Lewis William
@yelgabs
@yelgabs 9 ай бұрын
So you’re just gonna teach us the wrong way of doing things then leave us on a cliff hanger? 😅
@WoolleyStanford
@WoolleyStanford 2 ай бұрын
Hall Angela Taylor Scott Gonzalez Daniel
@YoungMerle-f6v
@YoungMerle-f6v Ай бұрын
Johnson Jennifer Miller Richard Anderson Betty
@BostonSummit-v1l
@BostonSummit-v1l 2 ай бұрын
Young Richard Thomas Melissa Robinson Ronald
@MikeKing-c5k
@MikeKing-c5k 3 ай бұрын
Young Karen Miller Christopher Johnson Thomas
@RoyanaHaque
@RoyanaHaque 3 ай бұрын
Garcia Angela Hall Jeffrey Moore Larry
@ranvijaymehta
@ranvijaymehta Жыл бұрын
Thanks Sir
@MUSKAN0896
@MUSKAN0896 2 жыл бұрын
this was an amazing and clear video! thanks so much!
@GregHogg
@GregHogg 2 жыл бұрын
Very glad to hear that!!
@AntoinetteFanny-l8s
@AntoinetteFanny-l8s 2 ай бұрын
Williams Matthew Brown Jason Young Michelle
PySpark Tutorial: Spark SQL & DataFrame Basics
17:13
Greg Hogg
Рет қаралды 55 М.
Learn Apache Spark in 10 Minutes | Step by Step Guide
10:47
Darshil Parmar
Рет қаралды 358 М.
Одну кружечку 😂❤️
00:12
Денис Кукояка
Рет қаралды 1,2 МЛН
Sigma Kid Mistake #funny #sigma
00:17
CRAZY GREAPA
Рет қаралды 14 МЛН
Long Nails 💅🏻 #shorts
00:50
Mr DegrEE
Рет қаралды 19 МЛН
Муж внезапно вернулся домой @Oscar_elteacher
00:43
История одного вокалиста
Рет қаралды 7 МЛН
The ONLY PySpark Tutorial You Will Ever Need.
17:21
Moran Reznik
Рет қаралды 143 М.
Apache Spark - Computerphile
7:40
Computerphile
Рет қаралды 253 М.
How I Would Learn Python FAST in 2024 (if I could start over)
12:19
Thu Vu data analytics
Рет қаралды 574 М.
What is Apache Spark? Learn Apache Spark in 15 Minutes
13:41
Mr. K Talks Tech
Рет қаралды 7 М.
Learn Python OOP in under 20 Minutes
18:32
Indently
Рет қаралды 119 М.
Apache Kafka in 6 minutes
6:48
James Cutajar
Рет қаралды 1 МЛН
Distributed Machine Learning with Apache Spark / PySpark MLlib
41:04
The BEST library for building Data Pipelines...
11:32
Rob Mulla
Рет қаралды 78 М.
Master Databricks and Apache Spark Step by Step: Lesson 1 - Introduction
32:23
Одну кружечку 😂❤️
00:12
Денис Кукояка
Рет қаралды 1,2 МЛН