End to End Machine Learning pipeline using Apache Spark

End to End Machine Learning pipeline using Apache Spark - Hands On

Рет қаралды 32,351

Күн бұрын

Пікірлер: 110

@pranavjayakumar2239 4 жыл бұрын

This is really good stuff. It was high time I transition from pandas and scikit learn to something more industry relevant.

@RD-yv4cc 4 жыл бұрын

My thoughts exactly mate. How's your journey been thus far?

@danialmalik80 4 жыл бұрын

Exactly the same thought, mate

@nikhildmehta3448 4 жыл бұрын

Thank you so much for this wonderful video and crisp, simple explanation.I was getting too impatient waiting for the course, so I decided to type the code line by line. Took a long time but definitely worth the effort :) Cant wait for more from u on SparkML.

@AIEngineeringLife 4 жыл бұрын

Nikhil.. This video will be part of the course as well and will be adding more spark videos. How do you feel working on Spark is now :)

@nikhildmehta3448 4 жыл бұрын

AIEngineering I’m truly enjoying working on Spark and can’t wait for the next batch of videos! Would love to get my hands dirty working on this😀 Thanks again for all the efforts!

@rahulbhatia5657 4 жыл бұрын

Hi, Your content is really helpful and unique, thanks for this! One question, I have a deep learning model trained using Keras and want to use it for making inferences on a Spark dataframe, can you suggest some options or is it possible to do this, or do I need to rebuild the models in some spark library for deep learning?

@AIEngineeringLife 4 жыл бұрын

Rahul.. Have you tried Spark tensorflow package and if it supports?. If not you can load TF model as python UDF and broadcast the model file. You can then use it for inference in spark

@ZainAhmed-ho5sf 4 жыл бұрын

Thank you for this amazing hands on pyspark tutorials.Is there a way or have you covered adding custom functions to the pipeline anywhere in the next few lectures?

@AIEngineeringLife 4 жыл бұрын

Hi Zain.. if u r looking for custom transformers then I have it in my plan later this year. Currently taking a spark break as have too many videos i did back to back on it :)

@ZainAhmed-ho5sf 4 жыл бұрын

@@AIEngineeringLife Great!Looking forward to it.

@chanchalshukla683 4 жыл бұрын

Hi Sir, Here, imbalanced data has not been handled, although highlighted. Is the approach should assign weights to lower class or other approach of oversampling?

@AIEngineeringLife 4 жыл бұрын

I typically prefer undersampling or class weights rather oversampling. So for this I will first try class weights and see or perform additional feature engineering and see if makes a difference. Sorry could not recollect what dataset I used for this video but in general above is how I approach it

@justmesherin 4 жыл бұрын

You are ahhhhmazing! I am not sure why I was lurking on your linked in and not here :) my 2020 is now set! Excellent content.

@AIEngineeringLife 4 жыл бұрын

Thank Sherin.. More to come :)

@joshuathomas2660 2 жыл бұрын

Hello sir, Can you send me the exact dataset? it will be easy for me to follow your tutorials. Thanks in advance.

@mohamedhanifansari9224 4 жыл бұрын

Thanks Srivatsan for this wonderful video and I've learnt a lot. I have one question - Doesn't fitting the data using training (while data pre-processing) and transforming the testing data using the same fit causes data leakage ? Just wanted to make sure I understand the concept clearly.

@AIEngineeringLife 4 жыл бұрын

Mohammed. Not in the case I showed as I am just applying transformation pipeline separately to the data. It might cause data leakage if I use the test as validation dataset during training and then use the same to predict Reason we always have in time and out of time dataset in real world. My actual training in this case has not seen the test data. This was for demo though but I would recommend separate validation set as well in real world pipeline Did I answer your question here or was your question intent was something else?.

@gururajangovindan7766 4 жыл бұрын

It would be great if you can publish your notebook link..the video was very useful!!!

@AIEngineeringLife 4 жыл бұрын

@gururajan.. I will publish it in my github repo in couple of weeks as other videos

@AbdulHadi-yj7fl 4 жыл бұрын

this is quite useful sir...this is helping me with my ug project...thanks a lot

@yoyovatsa2179 4 жыл бұрын

Now that I have binged the SparkML series, it seems very much similar to sklearn, some functions are different , so gotta go to the documentation first. Awesome video as always. I had a doubt though, as you used SQL for most of the EDA, is it only limited to SQL in Spark or is there any other way to do it, like using pandas like functions? Anyways thank you for the introduction, I am going to try and build my first pipeline now. Very much appreciate your content, your channel is so underrated.

@AIEngineeringLife 4 жыл бұрын

Mohneesh.. Spark ML is based on scikit-learn pipeline concept. So u will find lot of similaity. Instead of SparkQL you can use spark dataframe functions which again is pandas like. For EDA on databricks you can use data frame function as well

@RD-yv4cc 4 жыл бұрын

@@AIEngineeringLife Does it have the same great time series EDA functions as pandas?

@lazzybirdflying3225 2 жыл бұрын

Hi , Do you offer any personal training???

@umeshjadhav1586 3 жыл бұрын

Sir, you are great, i have no words to say beyond Thank You.

@Azureandfabricmastery 3 жыл бұрын

Thanks Srivatsan for detailed E2E video on Spark ML. Helpful. can't we generate confusion matrix visuals in spark ml in databricks and also classification report like in sklearn?

@AIEngineeringLife 3 жыл бұрын

I have not tried latest version of databricks yet but earlier version it was not there. With increased focus on MLOps these days in databricks maybe there is a better way to do this and also MLFlow integration. I will try and let you know

@Azureandfabricmastery 3 жыл бұрын

@@AIEngineeringLife ok thanks

@DevanshKhandekar 4 жыл бұрын

Excellent tutorial. Is the notebook for this available ?

@AIEngineeringLife 4 жыл бұрын

Check the code in my git repo here - github.com/srivatsan88/Mastering-Apache-Spark/blob/master/Churn-Analysis.ipynb

@teachingmachine 4 жыл бұрын

You are really awesome, its neat and crystal clear tutorial

@deonwagner2643 3 жыл бұрын

Very Nice. Do you have a link to your Git hub for the notebook in the this video. I would like to apply some of this code to my dataset

@AIEngineeringLife 3 жыл бұрын

All of my code is my repo here - github.com/srivatsan88 Spark has a seperate repo where you can find above code

@vishal6361 4 жыл бұрын

very well explained and clear cut pipelining concepts, it was really a great learning experience, thank you for this content. Would you be sharing industry level codes that can be deployed, through your videos?

@AIEngineeringLife 4 жыл бұрын

Thanks Vishal. Could you please elobrate on "industry level codes"?. Did not quiet get it

@maheshkumarsomalinga1455 3 жыл бұрын

Only recently, I have started looking into your videos...Definitely, one of the best collections available. Can you advise some real-life projects involving Spark for reference? Or if you already have videos on that, could you share the links? I could not find any. Thank you.

@krishnabisen2666 3 жыл бұрын

Great video sir, idk why KZbin hides such gems from us. Is it possible to get the link for the notebook?

@snehagrandhe1418 3 жыл бұрын

Can I make it a real time project to add in resume sir?

@hishailesh77 4 жыл бұрын

Excellent Video, i have been searching for this kind of Video from long time .

@akashprabhakar6353 2 жыл бұрын

Nice explanation!

@umeshjadhav1586 3 жыл бұрын

Do you have spark streaming session available with you, i searched on your channel and not able to find any

@AIEngineeringLife 3 жыл бұрын

Spark streaming not yet but have it in plans next year

@jeharulhussain9344 3 жыл бұрын

@AI Engineering : What kind of spark application is widely used by most of the clients you have come across. Data Bricks or something else?

@AIEngineeringLife 3 жыл бұрын

It mostly depends on where they are. If onprem which many are then it is cloudera. Those who are on cloud have seen EMR in many cases and databricks or others in some

@siddhantsapte 4 жыл бұрын

Hello, Is there any git repo for the code? it would be of great help! Thank you!

@IsaiahShadE 3 жыл бұрын

Im having an issue over here stages=[] for catCol in catColumns: stringIndexer=StringIndexer(inputCol=catCol,outputCol=catCol + "Index") encoder=OneHotEncoder(inputCols=[stringIndexer.getOuputCol()],outputCols=[catCol + "catVec"]) stages=stages+[stringIndexer,encoder] eError: 'StringIndexer' object has no attribute 'getOuputCol' @AIEngineering help me with this??

@mukeshkund4465 4 жыл бұрын

Excellent Video Sir. Kindly if we can get this dataset then it will be good to practice more on it..

@AIEngineeringLife 4 жыл бұрын

Thank you.. Should be in my dataset folder in github - github.com/srivatsan88/KZbinLI/tree/master/dataset

@mukeshkund4465 4 жыл бұрын

@@AIEngineeringLife Thank you for your reply. I have got it from another video of yours..

@1981Praveer Жыл бұрын

#AIEngineering Where can I find this dataset to practice ?

@EugeneKingsley 4 жыл бұрын

Wow ! Thanks for this beautiful video. Keep doing !

@orc475 4 жыл бұрын

excellent Hands-on ! thank you so much ! hope you can fix the voice on the next video

@apremgeorge 4 жыл бұрын

Thanks Srivatsan Excellent Video if its possible would you please add some extensions on how to make predictions and data cleaning on single record with multithreading options could be helpful as well. Also how to use some other python libraries in pyspark, like lime for explanations. Thanks Again

@AIEngineeringLife 4 жыл бұрын

I will add deployment side in future Prem. It is only list. I would not recommend multi threading within spark while you can do it. But I do have the information you asked on regualr python at this time You can watch video 7 and above in this playlist kzbin.info/aero/PL3N9eeOlCrP5PlN1jwOB3jVZE6nYTVswk

@apremgeorge 4 жыл бұрын

@@AIEngineeringLife Thanks Srivatsan, Waiting for that, one more question please , instead of transforming all my pandas codes to pyspark, after loading data as pyspark dataframe, I convert to pandas dataframe and do all the codes for modelling in pandas and save the model, is it possible to use the same for deployment in databricks.

@AIEngineeringLife 4 жыл бұрын

@@apremgeorge Yes you can.. databricks supports all python or you can install any packages with it. You can also look at koalas if interested kzbin.info/www/bejne/oYDXcoCfgspkgLs

@sumitbhalla2321 3 жыл бұрын

Is there any api/code snippets to enable model serving? I want to automate enable model serving in databricks/mlflow. Please help. thanks

@AIEngineeringLife 3 жыл бұрын

Sumit. Nope I do not have on Spark yet but have it in general on python

@ashirbaddas2573 4 жыл бұрын

Very neatly explained..:)

@sujeeshsvalath 4 жыл бұрын

Thank you sir. Great video

@vinodhkumarbaskaran228 4 жыл бұрын

Thanks Sir!!! Amazing video

@mohammadmuneer6463 4 жыл бұрын

content is very good, but at few places your voice is echoing. Awesome content and info flow.

@AIEngineeringLife 4 жыл бұрын

Thanks and Sorry for inconvenience in between. Over time I have upgraded my recording but initial few videos like this one had few echo

@mohammadmuneer6463 4 жыл бұрын

@@AIEngineeringLife no worries buddy....your content is too good :) keep doing the great work.

@imransharief2891 4 жыл бұрын

Hi sir if it is possible please share the csv file so that we can practice with the help of that file in databricks

@AIEngineeringLife 4 жыл бұрын

Imran.. It is in my github repo. I will add it to video description as well github.com/srivatsan88/KZbinLI/tree/master/dataset

@imransharief2891 4 жыл бұрын

Thank u sir ur doing a great job teaching us beyond ml

@raghurilokesh3270 3 жыл бұрын

As we already have jupyter note book why we are going for this?

@AIEngineeringLife 3 жыл бұрын

Can you pls eloborate. Did not get ques here

@thotarakesh2689 3 жыл бұрын

Sir, When I run printSchema it shows TotalCharges as string (nullable = true) instead of double. Could please explain why and how can I convert it to double?

@AIEngineeringLife 3 жыл бұрын

Thota.. Have you checked my repo notebook and checked for differences? - github.com/srivatsan88/Mastering-Apache-Spark

@thotarakesh2689 3 жыл бұрын

@@AIEngineeringLife Thank U Sir, the above one got fixed. Now I am getting the below error while doing fit & transform in pipeline. Please could you advise. pipeline = Pipeline().setStages(stages) pipelineModel = pipeline.fit(train_data) error: IllegalArgumentException: requirement failed: Output column label already exists.

@kousseilarekkam9511 4 жыл бұрын

I tried to fit a random forest classifier in pyspark : this is my code from pyspark.ml.tuning import ParamGridBuilder rf = RandomForestClassifier(labelCol="label", featuresCol="features") paramGrid = (ParamGridBuilder() .addGrid(rf.numTrees, [100]) .build()) crossval = CrossValidator(estimator=rf, estimatorParamMaps=paramGrid, evaluator=BinaryClassificationEvaluator(), numFolds=10) cvModel = crossval.fit(trainingData) predictions = crossval.transform(testData) predictions.printSchema() but i'm getting this error: Py4JJavaError: An error occurred while calling o767.fit. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 30.0 failed 1 times, most recent failure: Lost task 0.0 in stage 30.0 (TID 853, localhost, executor driver): java.lang.OutOfMemoryError: Java heap space can you help me please

@AIEngineeringLife 4 жыл бұрын

Based on the error looks like either your system does not have memory to handle the data or you have allocated low memory to your executor than what this job needs. Go to spark UI and check memory allocation as job is running or increase memory and try

@kousseilarekkam9511 4 жыл бұрын

@@AIEngineeringLife im using google colab, how can i check for spark UI, can you help me more please

@AIEngineeringLife 4 жыл бұрын

Then you cannot check spark UI. You can set executor memory where you get spark context and try . Monitor the CPU on top colab link or in manage sessions menu

@kousseilarekkam9511 4 жыл бұрын

@@AIEngineeringLife Thank you very much for your help, but i still have problem can you please show me an exemple how to set the sparkSession, configurations and sparkContext

@varungondu7053 4 жыл бұрын

is databricks is not giving free edition in recent times?

@AIEngineeringLife 4 жыл бұрын

I do see they are still.. are you not able to create account? - databricks.com/try-databricks

@varungondu7053 4 жыл бұрын

@@AIEngineeringLife No after the login it is asking to select the plans total they are three plans all are having price

@varungondu7053 4 жыл бұрын

Initially, it will ask community edition or free edition even after free edition also it is asking to select the above three plans which are priced

@AIEngineeringLife 4 жыл бұрын

Oh I get it I gave you try for free.. let me check today evening not sure if they removed it but I doubt they will

@dipanjanghosh6862 3 жыл бұрын

sir i am trying to run the basic command df.groupBy('Churn').count().show() but an error is coming up that says "Py4JJavaError Traceback (most recent call last) /databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 62 try: ---> 63 return f(*a, **kw) 64 except py4j.protocol.Py4JJavaError as e:" I have followed all the codes before this. Not sure where i am going wrong. can you please help?

@dipanjanghosh6862 3 жыл бұрын

sir none of the codes are getting executed after select * from churn_analysis. i have loaded the data correctly...so frustrating this is

@AIEngineeringLife 3 жыл бұрын

Hi Dipanjan.. I just tried on databricks and it seems to work fine. Which cluster have you created. I tried with databricks 7.5 ML , Spark 3.0.1. Can you check and try again. What you are getting seems to me databricks environment issue

@dipanjanghosh6862 3 жыл бұрын

@@AIEngineeringLife sir i have tried everything. still getting this error while trying to run df.groupBy('Churn').count().show() AnalysisException: cannot resolve '`Churn`' given input columns: [_c0, _c1, _c10, _c11, _c12, _c13, _c14, _c15, _c16, _c17, _c18, _c19, _c2, _c20, _c3, _c4, _c5, _c6, _c7, _c8, _c9];; sir my cluster is running smoothly. can you please help me out with this?

@AIEngineeringLife 3 жыл бұрын

@@dipanjanghosh6862 I think it has not inferred column names from header and reason you are getting _C1 and all. Check if you have set Header as true as below infer_schema = "true" first_row_is_header = "true" delimiter = "," df = spark.read.format(file_type) \ .option("inferSchema", infer_schema) \ .option("header", first_row_is_header) \ .option("sep", delimiter) \ .option('nanValue', ' ')\ .option('nullValue', ' ')\ .load(file_location)

@dipanjanghosh6862 3 жыл бұрын

sir you are awesome. thanks