This is really good stuff. It was high time I transition from pandas and scikit learn to something more industry relevant.
@RD-yv4cc4 жыл бұрын
My thoughts exactly mate. How's your journey been thus far?
@danialmalik804 жыл бұрын
Exactly the same thought, mate
@nikhildmehta34484 жыл бұрын
Thank you so much for this wonderful video and crisp, simple explanation.I was getting too impatient waiting for the course, so I decided to type the code line by line. Took a long time but definitely worth the effort :) Cant wait for more from u on SparkML.
@AIEngineeringLife4 жыл бұрын
Nikhil.. This video will be part of the course as well and will be adding more spark videos. How do you feel working on Spark is now :)
@nikhildmehta34484 жыл бұрын
AIEngineering I’m truly enjoying working on Spark and can’t wait for the next batch of videos! Would love to get my hands dirty working on this😀 Thanks again for all the efforts!
@rahulbhatia56574 жыл бұрын
Hi, Your content is really helpful and unique, thanks for this! One question, I have a deep learning model trained using Keras and want to use it for making inferences on a Spark dataframe, can you suggest some options or is it possible to do this, or do I need to rebuild the models in some spark library for deep learning?
@AIEngineeringLife4 жыл бұрын
Rahul.. Have you tried Spark tensorflow package and if it supports?. If not you can load TF model as python UDF and broadcast the model file. You can then use it for inference in spark
@ZainAhmed-ho5sf4 жыл бұрын
Thank you for this amazing hands on pyspark tutorials.Is there a way or have you covered adding custom functions to the pipeline anywhere in the next few lectures?
@AIEngineeringLife4 жыл бұрын
Hi Zain.. if u r looking for custom transformers then I have it in my plan later this year. Currently taking a spark break as have too many videos i did back to back on it :)
@ZainAhmed-ho5sf4 жыл бұрын
@@AIEngineeringLife Great!Looking forward to it.
@chanchalshukla6834 жыл бұрын
Hi Sir, Here, imbalanced data has not been handled, although highlighted. Is the approach should assign weights to lower class or other approach of oversampling?
@AIEngineeringLife4 жыл бұрын
I typically prefer undersampling or class weights rather oversampling. So for this I will first try class weights and see or perform additional feature engineering and see if makes a difference. Sorry could not recollect what dataset I used for this video but in general above is how I approach it
@justmesherin4 жыл бұрын
You are ahhhhmazing! I am not sure why I was lurking on your linked in and not here :) my 2020 is now set! Excellent content.
@AIEngineeringLife4 жыл бұрын
Thank Sherin.. More to come :)
@joshuathomas26602 жыл бұрын
Hello sir, Can you send me the exact dataset? it will be easy for me to follow your tutorials. Thanks in advance.
@mohamedhanifansari92244 жыл бұрын
Thanks Srivatsan for this wonderful video and I've learnt a lot. I have one question - Doesn't fitting the data using training (while data pre-processing) and transforming the testing data using the same fit causes data leakage ? Just wanted to make sure I understand the concept clearly.
@AIEngineeringLife4 жыл бұрын
Mohammed. Not in the case I showed as I am just applying transformation pipeline separately to the data. It might cause data leakage if I use the test as validation dataset during training and then use the same to predict Reason we always have in time and out of time dataset in real world. My actual training in this case has not seen the test data. This was for demo though but I would recommend separate validation set as well in real world pipeline Did I answer your question here or was your question intent was something else?.
@gururajangovindan77664 жыл бұрын
It would be great if you can publish your notebook link..the video was very useful!!!
@AIEngineeringLife4 жыл бұрын
@gururajan.. I will publish it in my github repo in couple of weeks as other videos
@AbdulHadi-yj7fl4 жыл бұрын
this is quite useful sir...this is helping me with my ug project...thanks a lot
@yoyovatsa21794 жыл бұрын
Now that I have binged the SparkML series, it seems very much similar to sklearn, some functions are different , so gotta go to the documentation first. Awesome video as always. I had a doubt though, as you used SQL for most of the EDA, is it only limited to SQL in Spark or is there any other way to do it, like using pandas like functions? Anyways thank you for the introduction, I am going to try and build my first pipeline now. Very much appreciate your content, your channel is so underrated.
@AIEngineeringLife4 жыл бұрын
Mohneesh.. Spark ML is based on scikit-learn pipeline concept. So u will find lot of similaity. Instead of SparkQL you can use spark dataframe functions which again is pandas like. For EDA on databricks you can use data frame function as well
@RD-yv4cc4 жыл бұрын
@@AIEngineeringLife Does it have the same great time series EDA functions as pandas?
@lazzybirdflying32252 жыл бұрын
Hi , Do you offer any personal training???
@umeshjadhav15863 жыл бұрын
Sir, you are great, i have no words to say beyond Thank You.
@Azureandfabricmastery3 жыл бұрын
Thanks Srivatsan for detailed E2E video on Spark ML. Helpful. can't we generate confusion matrix visuals in spark ml in databricks and also classification report like in sklearn?
@AIEngineeringLife3 жыл бұрын
I have not tried latest version of databricks yet but earlier version it was not there. With increased focus on MLOps these days in databricks maybe there is a better way to do this and also MLFlow integration. I will try and let you know
@Azureandfabricmastery3 жыл бұрын
@@AIEngineeringLife ok thanks
@DevanshKhandekar4 жыл бұрын
Excellent tutorial. Is the notebook for this available ?
@AIEngineeringLife4 жыл бұрын
Check the code in my git repo here - github.com/srivatsan88/Mastering-Apache-Spark/blob/master/Churn-Analysis.ipynb
@teachingmachine4 жыл бұрын
You are really awesome, its neat and crystal clear tutorial
@deonwagner26433 жыл бұрын
Very Nice. Do you have a link to your Git hub for the notebook in the this video. I would like to apply some of this code to my dataset
@AIEngineeringLife3 жыл бұрын
All of my code is my repo here - github.com/srivatsan88 Spark has a seperate repo where you can find above code
@vishal63614 жыл бұрын
very well explained and clear cut pipelining concepts, it was really a great learning experience, thank you for this content. Would you be sharing industry level codes that can be deployed, through your videos?
@AIEngineeringLife4 жыл бұрын
Thanks Vishal. Could you please elobrate on "industry level codes"?. Did not quiet get it
@maheshkumarsomalinga14553 жыл бұрын
Only recently, I have started looking into your videos...Definitely, one of the best collections available. Can you advise some real-life projects involving Spark for reference? Or if you already have videos on that, could you share the links? I could not find any. Thank you.
@krishnabisen26663 жыл бұрын
Great video sir, idk why KZbin hides such gems from us. Is it possible to get the link for the notebook?
@snehagrandhe14183 жыл бұрын
Can I make it a real time project to add in resume sir?
@hishailesh774 жыл бұрын
Excellent Video, i have been searching for this kind of Video from long time .
@akashprabhakar63532 жыл бұрын
Nice explanation!
@umeshjadhav15863 жыл бұрын
Do you have spark streaming session available with you, i searched on your channel and not able to find any
@AIEngineeringLife3 жыл бұрын
Spark streaming not yet but have it in plans next year
@jeharulhussain93443 жыл бұрын
@AI Engineering : What kind of spark application is widely used by most of the clients you have come across. Data Bricks or something else?
@AIEngineeringLife3 жыл бұрын
It mostly depends on where they are. If onprem which many are then it is cloudera. Those who are on cloud have seen EMR in many cases and databricks or others in some
@siddhantsapte4 жыл бұрын
Hello, Is there any git repo for the code? it would be of great help! Thank you!
@IsaiahShadE3 жыл бұрын
Im having an issue over here stages=[] for catCol in catColumns: stringIndexer=StringIndexer(inputCol=catCol,outputCol=catCol + "Index") encoder=OneHotEncoder(inputCols=[stringIndexer.getOuputCol()],outputCols=[catCol + "catVec"]) stages=stages+[stringIndexer,encoder] eError: 'StringIndexer' object has no attribute 'getOuputCol' @AIEngineering help me with this??
@mukeshkund44654 жыл бұрын
Excellent Video Sir. Kindly if we can get this dataset then it will be good to practice more on it..
@AIEngineeringLife4 жыл бұрын
Thank you.. Should be in my dataset folder in github - github.com/srivatsan88/KZbinLI/tree/master/dataset
@mukeshkund44654 жыл бұрын
@@AIEngineeringLife Thank you for your reply. I have got it from another video of yours..
@1981Praveer Жыл бұрын
#AIEngineering Where can I find this dataset to practice ?
@EugeneKingsley4 жыл бұрын
Wow ! Thanks for this beautiful video. Keep doing !
@orc4754 жыл бұрын
excellent Hands-on ! thank you so much ! hope you can fix the voice on the next video
@apremgeorge4 жыл бұрын
Thanks Srivatsan Excellent Video if its possible would you please add some extensions on how to make predictions and data cleaning on single record with multithreading options could be helpful as well. Also how to use some other python libraries in pyspark, like lime for explanations. Thanks Again
@AIEngineeringLife4 жыл бұрын
I will add deployment side in future Prem. It is only list. I would not recommend multi threading within spark while you can do it. But I do have the information you asked on regualr python at this time You can watch video 7 and above in this playlist kzbin.info/aero/PL3N9eeOlCrP5PlN1jwOB3jVZE6nYTVswk
@apremgeorge4 жыл бұрын
@@AIEngineeringLife Thanks Srivatsan, Waiting for that, one more question please , instead of transforming all my pandas codes to pyspark, after loading data as pyspark dataframe, I convert to pandas dataframe and do all the codes for modelling in pandas and save the model, is it possible to use the same for deployment in databricks.
@AIEngineeringLife4 жыл бұрын
@@apremgeorge Yes you can.. databricks supports all python or you can install any packages with it. You can also look at koalas if interested kzbin.info/www/bejne/oYDXcoCfgspkgLs
@sumitbhalla23213 жыл бұрын
Is there any api/code snippets to enable model serving? I want to automate enable model serving in databricks/mlflow. Please help. thanks
@AIEngineeringLife3 жыл бұрын
Sumit. Nope I do not have on Spark yet but have it in general on python
@ashirbaddas25734 жыл бұрын
Very neatly explained..:)
@sujeeshsvalath4 жыл бұрын
Thank you sir. Great video
@vinodhkumarbaskaran2284 жыл бұрын
Thanks Sir!!! Amazing video
@mohammadmuneer64634 жыл бұрын
content is very good, but at few places your voice is echoing. Awesome content and info flow.
@AIEngineeringLife4 жыл бұрын
Thanks and Sorry for inconvenience in between. Over time I have upgraded my recording but initial few videos like this one had few echo
@mohammadmuneer64634 жыл бұрын
@@AIEngineeringLife no worries buddy....your content is too good :) keep doing the great work.
@imransharief28914 жыл бұрын
Hi sir if it is possible please share the csv file so that we can practice with the help of that file in databricks
@AIEngineeringLife4 жыл бұрын
Imran.. It is in my github repo. I will add it to video description as well github.com/srivatsan88/KZbinLI/tree/master/dataset
@imransharief28914 жыл бұрын
Thank u sir ur doing a great job teaching us beyond ml
@raghurilokesh32703 жыл бұрын
As we already have jupyter note book why we are going for this?
@AIEngineeringLife3 жыл бұрын
Can you pls eloborate. Did not get ques here
@thotarakesh26893 жыл бұрын
Sir, When I run printSchema it shows TotalCharges as string (nullable = true) instead of double. Could please explain why and how can I convert it to double?
@AIEngineeringLife3 жыл бұрын
Thota.. Have you checked my repo notebook and checked for differences? - github.com/srivatsan88/Mastering-Apache-Spark
@thotarakesh26893 жыл бұрын
@@AIEngineeringLife Thank U Sir, the above one got fixed. Now I am getting the below error while doing fit & transform in pipeline. Please could you advise. pipeline = Pipeline().setStages(stages) pipelineModel = pipeline.fit(train_data) error: IllegalArgumentException: requirement failed: Output column label already exists.
@kousseilarekkam95114 жыл бұрын
I tried to fit a random forest classifier in pyspark : this is my code from pyspark.ml.tuning import ParamGridBuilder rf = RandomForestClassifier(labelCol="label", featuresCol="features") paramGrid = (ParamGridBuilder() .addGrid(rf.numTrees, [100]) .build()) crossval = CrossValidator(estimator=rf, estimatorParamMaps=paramGrid, evaluator=BinaryClassificationEvaluator(), numFolds=10) cvModel = crossval.fit(trainingData) predictions = crossval.transform(testData) predictions.printSchema() but i'm getting this error: Py4JJavaError: An error occurred while calling o767.fit. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 30.0 failed 1 times, most recent failure: Lost task 0.0 in stage 30.0 (TID 853, localhost, executor driver): java.lang.OutOfMemoryError: Java heap space can you help me please
@AIEngineeringLife4 жыл бұрын
Based on the error looks like either your system does not have memory to handle the data or you have allocated low memory to your executor than what this job needs. Go to spark UI and check memory allocation as job is running or increase memory and try
@kousseilarekkam95114 жыл бұрын
@@AIEngineeringLife im using google colab, how can i check for spark UI, can you help me more please
@AIEngineeringLife4 жыл бұрын
Then you cannot check spark UI. You can set executor memory where you get spark context and try . Monitor the CPU on top colab link or in manage sessions menu
@kousseilarekkam95114 жыл бұрын
@@AIEngineeringLife Thank you very much for your help, but i still have problem can you please show me an exemple how to set the sparkSession, configurations and sparkContext
@varungondu70534 жыл бұрын
is databricks is not giving free edition in recent times?
@AIEngineeringLife4 жыл бұрын
I do see they are still.. are you not able to create account? - databricks.com/try-databricks
@varungondu70534 жыл бұрын
@@AIEngineeringLife No after the login it is asking to select the plans total they are three plans all are having price
@varungondu70534 жыл бұрын
Initially, it will ask community edition or free edition even after free edition also it is asking to select the above three plans which are priced
@AIEngineeringLife4 жыл бұрын
Oh I get it I gave you try for free.. let me check today evening not sure if they removed it but I doubt they will
@dipanjanghosh68623 жыл бұрын
sir i am trying to run the basic command df.groupBy('Churn').count().show() but an error is coming up that says "Py4JJavaError Traceback (most recent call last) /databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 62 try: ---> 63 return f(*a, **kw) 64 except py4j.protocol.Py4JJavaError as e:" I have followed all the codes before this. Not sure where i am going wrong. can you please help?
@dipanjanghosh68623 жыл бұрын
sir none of the codes are getting executed after select * from churn_analysis. i have loaded the data correctly...so frustrating this is
@AIEngineeringLife3 жыл бұрын
Hi Dipanjan.. I just tried on databricks and it seems to work fine. Which cluster have you created. I tried with databricks 7.5 ML , Spark 3.0.1. Can you check and try again. What you are getting seems to me databricks environment issue
@dipanjanghosh68623 жыл бұрын
@@AIEngineeringLife sir i have tried everything. still getting this error while trying to run df.groupBy('Churn').count().show() AnalysisException: cannot resolve '`Churn`' given input columns: [_c0, _c1, _c10, _c11, _c12, _c13, _c14, _c15, _c16, _c17, _c18, _c19, _c2, _c20, _c3, _c4, _c5, _c6, _c7, _c8, _c9];; sir my cluster is running smoothly. can you please help me out with this?
@AIEngineeringLife3 жыл бұрын
@@dipanjanghosh6862 I think it has not inferred column names from header and reason you are getting _C1 and all. Check if you have set Header as true as below infer_schema = "true" first_row_is_header = "true" delimiter = "," df = spark.read.format(file_type) \ .option("inferSchema", infer_schema) \ .option("header", first_row_is_header) \ .option("sep", delimiter) \ .option('nanValue', ' ')\ .option('nullValue', ' ')\ .load(file_location)
@dipanjanghosh68623 жыл бұрын
sir you are awesome. thanks
@kanizfatma11283 жыл бұрын
Cannot import name onehotencoderestimator
@AIEngineeringLife3 жыл бұрын
Replied on LI. Change it to OneHotEncoder. Spark 3.0 changed the package
@kanizfatma11283 жыл бұрын
Thanks
@gasmikaouther68873 жыл бұрын
Can you plz share the dataset
@AIEngineeringLife3 жыл бұрын
Most of it must be here - github.com/srivatsan88/KZbinLI/tree/master/dataset
@arpanghosh38014 жыл бұрын
Can you share the code
@AIEngineeringLife4 жыл бұрын
it is in my git repo here - github.com/srivatsan88/Mastering-Apache-Spark