Data Cleaning and Analysis using Apache Spark

Рет қаралды 60,320

Күн бұрын

Пікірлер: 182

@AIEngineeringLife 4 жыл бұрын

Many have asked for the file I used for this video- You can download it from here - drive.google.com/file/d/1e6phh7Df8mzYoE-sBXPVJklnSt_wHwkq/view?usp=sharing Remove the last 2 line from the csv file

@vivekpuurkayastha1580 4 жыл бұрын

is the %sql command that makes SQL statement available? In that case is it only possible to use SQL when using Databricks? Thus making SQL not available for python scripts... Please correct me and also provide any input you may have ...

@harshithag5769 3 жыл бұрын

hello Sir, did you set any permission to this file. i am unable to open it. i tried to open it in onedrive office online , it says conversion error.

@AIEngineeringLife 3 жыл бұрын

@@harshithag5769 Nope it is open for all. Can you open it in google drive and see

@harshithag5769 3 жыл бұрын

@@AIEngineeringLife i dont have excel on my pc. i tried opening it thru onedrive live or google excel . both says error in opening the file. i was able to open your other files which you have provided in github repository, when i tried to upload the file to dataset in databricks its throwing the error.

@MachalaPuli 2 жыл бұрын

How to see the entity relationship diagram in data bricks or pyspark just like as we see in mysql... Please help me with this.

@himanshupatanwala09 4 жыл бұрын

Thank you so much, Sir. Millions of blessings from every student who watch this. I was looking for some real resources to learn Spark and your content saved a lot of effort to get wasted but made to put the effort in the right direction. Thanks a lot and please never stop creating such wonderful content for us.

@AIEngineeringLife 4 жыл бұрын

You are welcome Himanshu and Thanks for such nice and encouraging words to drive me create more such content :)

@ijeffking 4 жыл бұрын

This is HUGE! Gems of wisdom for a Machine learning aspirant. Excellent. Thank you very much.

@KennyJacobson1 4 жыл бұрын

I've been trying to get up to speed on Databricks and Spark for two weeks now, and I just learn 10x as much in 1 hour than I did in the previous 2 weeks. Thank you!

@AIEngineeringLife 4 жыл бұрын

Glad to know Kenny this was useful. All the best on your learning journey

@norpriest521 4 жыл бұрын

So how is Databricks service? I mean if I use it, what's their billing policy? Pay per use Pay per activity Pay per minutes/hours of use Pay per data size Can you let me know?

@nikhilkumarjha 4 жыл бұрын

Thanks for the great tutorial. The Data Science community needs more people like you SS :)

@DiverseDestinationsDiaries 3 жыл бұрын

This is the best content Video i have never seen in KZbin with respect to Realtime scenarios.... Thanks a lot Sir. Please do more to help us..

@rlmclaughlinmusic 3 жыл бұрын

Truly love your channel! Such a wealth of information and brilliantly explained. Thank you for providing this real world example. It was exactly what I needed to elevate my spark skills. You're a terrific instructor.

@sandeepsankar6353 3 жыл бұрын

Superb video. In 40 minutes, you covered pretty much everything :) . Please upload more videos

@AIEngineeringLife 3 жыл бұрын

I have a complete course on Apache spark in my playlist section of youtube channel. Have you seen it?

@sandeepsankar6353 3 жыл бұрын

@@AIEngineeringLife Yep. seen and subscribed as well :)

@The_Bold_Statement 3 жыл бұрын

Tremendous effort and knowledge can be seen in your video. Thank you

@shubhamtripathi5138 4 жыл бұрын

Awesome. It's not a video series it's an entire course I must say. I really appreciate your hard work and the teaching technique, thanks . Sir, keep it up. One request i think from many of students like me is ,please upload the notebook sir. So that it will be a little time saving too. Thanks

@AIEngineeringLife 4 жыл бұрын

Thanks Shubham.. :) .. The code is already available in my git repo - github.com/srivatsan88/Mastering-Apache-Spark

@norpriest521 4 жыл бұрын

@@AIEngineeringLife Hi thank you for your video. Just wanna ask if this video is about data profiling or data wrangling using Pyspark?

@AIEngineeringLife 4 жыл бұрын

@@norpriest521 this video is more on profiling/ cleaning of data but I have detailed videos on wrangling in my apache Spark playlist

@norpriest521 4 жыл бұрын

@@AIEngineeringLife I couldn't find the video regarding data wrangling in your list. Could you please let me know the title of the video?

@AIEngineeringLife 4 жыл бұрын

@@norpriest521 It is named as data engineering in this playlist. There are 2 parts to it kzbin.info/aero/PL3N9eeOlCrP5PfpYrP6YxMNtt5Hw27ZlO

@purushothamchanda898 4 жыл бұрын

Wonderful video sir, was looking for such content since many days. thanks a ton

@AIEngineeringLife 4 жыл бұрын

👍

@cloudmusician 3 жыл бұрын

Excellent content, precise and to the point

@CDALearningHub 3 жыл бұрын

Thanks very much for detailed hands on. It helps!

@beyourbest199 4 жыл бұрын

This is amazing, waiting for your next video on Spark Analysis and cleaning

@AIEngineeringLife 4 жыл бұрын

Next part on EDA using spark is here kzbin.info/www/bejne/jmeynIdojrWNjNU

@sahil0094 3 жыл бұрын

df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show() and df.filter(df.col.isNull()).count() should return same result, right? Upper command is giving 0 nulls whereas below command is giving some nulls for the same column. Can you please help?

@insane_billa 3 жыл бұрын

Now, this is what I was looking for :)

@prafulmaka7710 3 жыл бұрын

Amazing explanation!! Thank you very much!

@soumyagupta9301 3 жыл бұрын

Where did you create the spark session? I don't see the initialization of the spark variable. Can you please explain more on this?

@saitejap7876 3 жыл бұрын

Thank you for the tutorial, I am just curious while dealing with revol_util column, we are finding the average when the column is string and using it to replace "null" values and then cast it to "double". Will there be a difference if we are casting the values first to "double" then select the average value and replace the nulls. Hoping to get your insights on this.

@tanishasharma3665 4 жыл бұрын

Sir, i had a question, why are we creating a temporary table every time for sql functions. In pyspark, the main advantage is that we can use the sql functions simply on the dataframe as well for example: loan_dfsel.groupby("loan_status").count().orderBy(col("Count").desc()).show() where i have used 'loan_dfsel' is a dataframe please enlighten me if im wrong....

@AIEngineeringLife 4 жыл бұрын

Tanisha.. Is your question why I am using dataframe functions rather SQL functions in pyspark. If so the yes SQL is easy way of processing data in Spark but for iterative processing dataframe functions are very powerful and simple. Typically in projects we use combination of SQL as well as df functions. In this case I wanted to show dataframe functions but in future videos I have covered SQL as well. Underneath both SQL and df compile to same plan so performance might not differ

@aishwaryagopal5553 3 жыл бұрын

Hi, Thank you for the informative videos. I'm just getting started with Spark. The code seems easy to understand. What are other aspects of Spark that I should read through for a better understanding?

@iamhappy7759 4 жыл бұрын

thanks for amazing video.would you provide link of notebook for practice ?

@AIEngineeringLife 4 жыл бұрын

Thanks.. Should be in this repo - github.com/srivatsan88/Mastering-Apache-Spark

@iamhappy7759 4 жыл бұрын

@@AIEngineeringLife Thank you so much

@rahulraoshindek131 3 жыл бұрын

Thank you so much sir for this detailed video.

@kunalr_ai 3 жыл бұрын

Need to connect with azure data lake and Load it here in databricks. Do you have any document that support it?

@AIEngineeringLife 3 жыл бұрын

Kunai.. Nope not done any on Azure yet

@aleksei_cherniaev 3 жыл бұрын

that is awesome content, big thanks, man !

@CRTagadiya 4 жыл бұрын

Thanks for your work, Could you please upload all this series video in playlist?

@AIEngineeringLife 4 жыл бұрын

Check this playlist. It is available in it and will be updating upcoming videos into it as well kzbin.info/aero/PL3N9eeOlCrP7_vt6jq7GdJz6CSFmtTBpI

@prometeo34 3 жыл бұрын

This is amazing. Thank you very much!

@dhruvsharma9065 3 жыл бұрын

What's the 3rd argument in regexp_replace function?

@AIEngineeringLife 3 жыл бұрын

What do we need to replace the pattern in second column with

@ankushojha5089 4 жыл бұрын

Hello Sir, As in this Video you said like Data Type needs to change manually while using 'Corr' & 'Cov' functions.. Please help me, How we can change the DataType

@AIEngineeringLife 4 жыл бұрын

Kush.. you can use custom schema when you load data or after loading you can use CAST or astype to change from one data type to another

@ankushojha5089 4 жыл бұрын

@@AIEngineeringLife Thank you Sir for guiding me on each step.. I have done and can able to CAST datatype now.

@deeptigupta518 3 жыл бұрын

These videos are awsome and a great help. Thanks. You are doing a wonderful job for people like us who have entered the industry. Just wanted to ask that I have worked in ML models using Python but not worked in Apache Spark. WIll I face any difficulty doing the same thing here in Spark?

@AIEngineeringLife 3 жыл бұрын

Nope. It must be a smooth transition. Remember Spark ML pipeline is inspired by scikit pipelines, so process is similar. Only time consuming part will be understanding distributed architectures which might take time

@anandruparelia8970 4 жыл бұрын

For Pandas in Python, do we have something dedicated like reg extract or something that cleans data from within the values or the conventional regex have to be employed?

@AIEngineeringLife 4 жыл бұрын

Check this kzbin.info/www/bejne/Zoebk4RtZa2AZrc

@anandruparelia8970 4 жыл бұрын

@@AIEngineeringLife Thanks :)

@mustakahmad383 3 жыл бұрын

Hello @aiengineering, I have a question, is it possible to run an Oracle MERGE statement in the Oracle database (Oracle tables) using Python libraries such as for example "spark.sql.write.format('jdbc').options"...

@chwaleedsial 4 жыл бұрын

A suggestion: when you load the data-set & if it is not the same as one shared on kaggle please also let us know what transformations, filtration you have performed so that we can have same, similar results as we follow along.

@AIEngineeringLife 4 жыл бұрын

I am sorry if the dataset is not same.. I did not do any transformation but rather I downloaded by Lending Club directly in below link www.lendingclub.com/info/statistics.action Earlier it was download for all but post I downloaded sometime back they made it sign in based and hence I referred to kaggle thinking it should be similar. But from my end I did not make any changes to the dataset I got from Lending club. Are you facing any particular issue as few have reached out in past on some clarification and were able to execute all the commands successfully

@teja2775 4 жыл бұрын

Hi sir, I'm not understanding what is the exact purpose of using spark, as per my understand in one word answer spark is used for data analysis or data preparation am I correct....?

@AIEngineeringLife 4 жыл бұрын

Spark is used for end to end pipeline starting from data processing (cleaning, preparation) till machine learning or advance analytics. Reason we need spark is when your input data grows and point where typical tools like Pandas start failing to handle the volume and computation. Spark can work on TBs of data and Pandas is limited to few GBs if you are looking at large scale ML computation

@teja2775 4 жыл бұрын

@@AIEngineeringLife Thank you so much sir finally my doubt is cleared

@venkatesanp2240 4 жыл бұрын

sir how to count null in pyspark like pandas command how to delete entire column

@AIEngineeringLife 4 жыл бұрын

Venkat. It is there in my video in case if you have missed it df_sel.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df_sel.columns]).show() to delete you can use df.drop(

@vishalmishra1937 3 жыл бұрын

how to validate schema in spark against record in text files if every record follows different schema and to separate record as per schema

@AniketKumar-ij3ew 4 жыл бұрын

Great Stuff really enjoying the hands on videos. I have one input, not a big constraint however, I guess in the last part when you are creating the permanent table the data frame should be df_sel_final instead of df_sel

@AIEngineeringLife 4 жыл бұрын

Thank you Aniket.. You are right.. Maybe did it in a hurry.. Good catch :)

@revathis2844 4 жыл бұрын

Hi Sir, Instead of string regex, how to do numeric regex? For example : Username having abc12def here i need only character which is abcdef. Could you please help me?

@AIEngineeringLife 4 жыл бұрын

You can search for [0-9] regex pattern and replace it with empty string. That way the output is only alphabets

@revathis2844 4 жыл бұрын

@@AIEngineeringLife thank you it worked🙂

@revathis2844 4 жыл бұрын

I have another clarification sir.. Do spark will works on incremental load? I have searched many sites but can't find a proper solution

@AIEngineeringLife 4 жыл бұрын

@@revathis2844 It is not straight forward in regular spark but databricks has a functionality called delta lake. Check it out

@revathis2844 4 жыл бұрын

@@AIEngineeringLife thank you sir

@hemaswaroop7970 4 жыл бұрын

All the topics you have covered in the Spark series here.. how close they are when it comes to the real-time projects (MNCs like IBM, CTS, Google etc.) - just asking

@AIEngineeringLife 4 жыл бұрын

Hema, The "Master Spark" course I have in my channel was to bring out real world scenarios that one face in Industry. It takes a use case based approach rather function or api based approach. Many working professional in Spark have also benefited from this course as they were able to upskill themselves on specific area they had to work. I am saying this not because I have created it but You can compare the coverage with other courses and pick one that works for you

@hemaswaroop7970 4 жыл бұрын

@@AIEngineeringLife Thanks, Srivatsav! Looking forward to learning more from your channel

@biswajitbastia2 4 жыл бұрын

what will be alternative fo scala code for this df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()

@AIEngineeringLife 4 жыл бұрын

Biswajit.. Have you tried iterating columns and checking for null in each column in scala. That is what I am doing in python as well. I think map function can do that. I will try it out and paste exact syntax later in the week Any reason for using Scala as Spark 2.3 and above pyspark is almost at equal footing as Scala

@biswajitbastia2 4 жыл бұрын

@@AIEngineeringLife we been using scala for all data pipeline jobs as it faster than python

@shaikrasool1316 4 жыл бұрын

Sir, i would like to put this lending club type problem project in my resume... can i consider same columns even for same kind of other projects ..

@AIEngineeringLife 4 жыл бұрын

Shaik you can but if you are looking to expand it with external datapoints then you can check my video -kzbin.info/www/bejne/iJzCn3qdqLWEf6s In this video I show how you can use external data sources and combine with lending club kind of dataset

@raghuramsharma2603 4 жыл бұрын

I'm trying to import the file and create the df as mentioned and get the below error. Can you pls suggest what i missed. Error in SQL statement: ParseException: mismatched input 'file_location' expecting {'(', 'CONVERT', 'COPY', 'OPTIMIZE', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 0) == SQL == file_location = "/FileStore/tables/LoanStats_2018Q4-2.csv" ^^^ file_type = "csv"

@AIEngineeringLife 4 жыл бұрын

Raghu, Is the file path correct that you loaded.. Can you check if it is similar to below or missing something # File location and type file_location = "/FileStore/tables/LoanStats_2018Q4.csv" file_type = "csv" # CSV options infer_schema = "true" first_row_is_header = "true" delimiter = "," # The applied options are for CSV files. For other file types, these will be ignored. df = spark.read.format(file_type) \ .option("inferSchema", infer_schema) \ .option("header", first_row_is_header) \ .option("sep", delimiter) \ .load(file_location) display(df)

@raghuramsharma2603 4 жыл бұрын

@@AIEngineeringLife yes I double checked and the filepath is correct...

@raghuramsharma2603 4 жыл бұрын

@@AIEngineeringLife never mind I utilized the option of "createtable in notebook" that databricks provided and it worked..strange..thanks for your reply

@AIEngineeringLife 4 жыл бұрын

@@raghuramsharma2603 Great it worked.. That is how I got the loading part as well, used databricks provided :). All the best for remaining tutorial

@raghuramsharma2603 4 жыл бұрын

@@AIEngineeringLife Thank you...great work uploading these videos very helpful...

@gayathriv90 4 жыл бұрын

Sir, I have an requirement where i have a reusable code to run for different files and need to pass the filename to code from blob storage.Pass as a parameter. Can u help me

@AIEngineeringLife 4 жыл бұрын

What is the problem u r facing. You can pass filename as runtime parameters to spark and trigger multiple spark jobs with different file name

@KishoreKumar-yx4nw 4 жыл бұрын

Nicely explained

@Anupamk36 Жыл бұрын

Can you please share the link to the csv?

@imohitr888 4 жыл бұрын

i used this to convert string to integer : from pyspark.sql.types import IntegerType df = df.withColumn("loan_amnt", df["loan_amnt"].cast(IntegerType())) i can see in schema that loan_amnt is now changed to int type but when i am running the below command quantileProbs = [0.25, 0.5, 0.75, 0.9] relError=0.05 df_sel.stat.approxQuantile("annual_inc", quantileProbs, relError) i am getting the error :: "java.lang.IllegalArgumentException: requirement failed: Quantile calculation for column annual_inc with data type StringType is not supported. " Can u please help here

@AIEngineeringLife 4 жыл бұрын

Mohit .I see u have done cast for loan amount but using annual inc in quantile . Can u do cast for annual inc as well and see

@imohitr888 4 жыл бұрын

@@AIEngineeringLife hey i did casting for both already. still getting the same error :/

@imohitr888 4 жыл бұрын

@@AIEngineeringLife it worked now :) thank you.

@Shahzada1prince 4 жыл бұрын

Do you have these commands written somewhere?

@AIEngineeringLife 4 жыл бұрын

You can check my git repo - github.com/srivatsan88/Mastering-Apache-Spark

@Shahzada1prince 4 жыл бұрын

@@AIEngineeringLife thanks

@prashanthprasanna1484 3 жыл бұрын

awesome !! very useful

@srikd9829 4 жыл бұрын

Thank you Very much for the very informative videos. Could you please let us know what programming language(s) is(are) used in this video? Is it Spark (or) Scala (or) pyspark (or) pysql ? ( I dont know any of these) I only know Python including the Numpy and Pandas. So, Would you recommend knowing the relevant languages as a Pre-requisite? so that I should feel easy when a real world problem is given. Or any courses you recommend also fine. Thank you.

@AIEngineeringLife 4 жыл бұрын

Most of my spark videos on pyspark and Spark SQL. Python is a good start as pyspark syntax are similar to pandas with slight variation. only thing is it is distributed. You can check my entire spark course on youtube to learn spark kzbin.info/aero/PL3N9eeOlCrP5PfpYrP6YxMNtt5Hw27ZlO

@srikd9829 4 жыл бұрын

@@AIEngineeringLife Ok Sure. Thank you for the swift response.

@anshapettugari Жыл бұрын

Where can I access this code?

@biswadeeppatra1726 4 жыл бұрын

Sir, do you have git repo for the codes used in this proj..if yes then please share

@AIEngineeringLife 4 жыл бұрын

Yes but all modules might not be there as I have uploaded for only selected ones. Below is the link github.com/srivatsan88/KZbinLI

@biswadeeppatra1726 4 жыл бұрын

@@AIEngineeringLife thankx for sharing sir. I dont find codes for spark program.please upload the same if possible..it will really be a great help

@HemantSharma-fw2gx 3 жыл бұрын

thanks for this tutorial

@bharadwajiyer3504 4 жыл бұрын

How do I use a subset of the loan data? The original datset is too large(2gb), and takes time to upload in databricks.

@AIEngineeringLife 4 жыл бұрын

Bharadwaj.. best is to download and split it in spark or unix

@dineshvarma6733 4 жыл бұрын

Hello Sir, I appreciate your effort and time to teach us. I am facing a Job aborted error when trying to create a Permanent table at the end of the analysis. Is there a workaround for this.

@AIEngineeringLife 4 жыл бұрын

Hi Dinesh.. Thank you.. Can you please paste the error you are getting?

@dineshvarma6733 4 жыл бұрын

@@AIEngineeringLife org.apache.spark.SparkException: Job aborted. Py4JJavaError: An error occurred while calling o3407.saveAsTable. : org.apache.spark.SparkException: Job aborted. at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:201) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:192) at org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:555) at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.saveDataIntoTable(createDataSourceTables.scala:216) at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:175) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106) at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:126) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:150) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:138) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$5.apply(SparkPlan.scala:191) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:187) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:117) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:115) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:710) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:710) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withCustomExecutionEnv$1$$anonfun$apply$1.apply(SQLExecution.scala:112) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:217) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withCustomExecutionEnv$1.apply(SQLExecution.scala:98) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:835) at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:74) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:169) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:710) at org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:508) at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:487) at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:430) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380) at py4j.Gateway.invoke(Gateway.java:295) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:251) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 76.0 failed 1 times, most recent failure: Lost task 0.0 in stage 76.0 (TID 974, localhost, executor driver): java.rmi.RemoteException: com.databricks.api.base.DatabricksServiceException: QUOTA_EXCEEDED: You have exceeded the maximum number of allowed files on Databricks Community Edition. To ensure free access, you are limited to 10000 files and 10 GB of storage in DBFS. Please use dbutils.fs to list and clean up files to restore service. You may have to wait a few minutes after cleaning up the files for the quota to be refreshed. (Files found: 17327); nested exception is: com.databricks.api.base.DatabricksServiceException: QUOTA_EXCEEDED: You have exceeded the maximum number of allowed files on Databricks Community Edition. To ensure free access, you are limited to 10000 files and 10 GB of storage in DBFS. Please use dbutils.fs to list and clean up files to restore service. You may have to wait a few minutes after cleaning up the files for the quota to be refreshed. (Files found: 17327)

@imohitr888 4 жыл бұрын

hi, when i was running df_sel.stat.cov('annual_inc', 'loan_amnt') > i got this error "java.lang.IllegalArgumentException: requirement failed: Currently covariance calculation for columns with dataType string not supported." i realised loan_amnt and annual_inc is showing as string in the schema. i followed all steps as per you. Can you correct me what i missed? i saw ur previous schema commands look like it was showing integer in your videos but when i ran a schema command, its showing these 2 columns as string thats why the error.

@imohitr888 4 жыл бұрын

can u tellme in between the code, how can i change the specific column schema from string to integer? what exact code i should execute?

@AIEngineeringLife 4 жыл бұрын

Mohit I see your other message u have figured it out. You have to do cast to convert datatype

@towardsmlds9130 4 жыл бұрын

the dataset i have for lending club have noise in first row and the header starts from row 2, i am not able to skip first row and set 2nd row as header, any input on how to do this

@AIEngineeringLife 4 жыл бұрын

Is skiprows in read_csv not working for you?

@towardsmlds9130 4 жыл бұрын

@@AIEngineeringLife i didn't know about it, will see if it works

@devarajuessampally1338 4 жыл бұрын

Thank you for dataset

@imransharief2891 4 жыл бұрын

Hi sir will u plz make a proper playlist for this tutorial Bec it's very confusing

@AIEngineeringLife 4 жыл бұрын

Hi Imran.. Have you seen below playlist where I am adding it in sequence. Please see if it helps kzbin.info/aero/PL3N9eeOlCrP7_vt6jq7GdJz6CSFmtTBpI

@ankushojha5089 4 жыл бұрын

Thank you Sir for responding my comments and clearing my doubt. I have one more doubt I am using regexp_replace function in which while changing the place of string I can see 2 diff output. Years at 1st place trimmed completely from output but if I will interchange the Years at 2nd place and Year at first place in output 'S' won't trimmed. Please refer the screenshot :)

@AIEngineeringLife 4 жыл бұрын

Kush.. I did not get any screen snapshot here

@christineeee96 4 жыл бұрын

@@AIEngineeringLife sir, I keep getting attribute errors like "'NoneType' object has no attribute 'groupby'"

@uskhan7353 4 жыл бұрын

Can you show a more mechanical way for feature selection.

@AIEngineeringLife 4 жыл бұрын

Usman.. Are you referring to manual feature selection?

@uskhan7353 4 жыл бұрын

@@AIEngineeringLife yes by using any feature importance technique.

@seetharamireddybeereddy222 4 жыл бұрын

and how to handle delta or incremental load in pyspark

@AIEngineeringLife 4 жыл бұрын

I was actually not planing to cover ingesting of data to show incremental load but will see if I can in future

@venkatasaireddyavuluri2051 4 жыл бұрын

sir can you give the link to dataset

@AIEngineeringLife 4 жыл бұрын

Venkata, it is from kaggle - www.kaggle.com/wendykan/lending-club-loan-data

@sankarshkadambari2742 4 жыл бұрын

Hello Sir , For 2018Q4 data we have to slice the original data Loan.csv which contains (2260668, 145) right as kagle gave me 2gb zip file? Following are the steps I did on my local df=read the whole (2260668, 145) file LoanStats_2018Q4=df[(df['issue_d']=="Oct-2018") | (df['issue_d']=="Nov-2018") | (df['issue_d']=="Dec-2018")] LoanStats_2018Q4.shape (128412, 145) LoanStats_2018Q4.to_csv('/path/LoanStats_2018Q4.csv', index = False) Then I will upload this to Data Bricks

@AIEngineeringLife 4 жыл бұрын

I just ran it with subset so users need not wait on video for every instruction but in your case you can use it all or subset it as well.. Whole file would have made my video to run for additional hour :)

@dipanjanghosh6862 3 жыл бұрын

Hello Sir. I have given file_location = r"C:/Users/dipanja/Desktop/data science/LoanStats_2018.csv". This is the path of the loanstats csv file in my system. but while trying to execute it i am getting the error 'java.io.IOException: No FileSystem for scheme: C". can u please help me fix this?

@AIEngineeringLife 3 жыл бұрын

Are you using databricks spark or spark in your local system?.

@dipanjanghosh6862 3 жыл бұрын

@@AIEngineeringLife hi sir. i resolved the issue. thanks!

@Ravi-gu5ww 4 жыл бұрын

Sir can you make tutorial on functions like groupByKey,sortByKey,oderByKey,reduceByKey,join....

@AIEngineeringLife 4 жыл бұрын

Raviteja.. I have already covered groupby orderby.. the one you have mentioned are RDD functions and spark is making data frame functions as primary going forward. Not sure if you really need to learn RDD functions as 98% of time dataframe functions are easy and will do the job

@sahil0094 4 жыл бұрын

The describe and null count is not readable most of the times, doesn't that post a big problem in industry projects? I have dataset of hundreds of columns so how to view describe or null count for all in spark?

@AIEngineeringLife 4 жыл бұрын

Sahil.. In databricks we can use formatted output but in regular spark yes. In some case we load it into table and view it to understand

@sahil0094 3 жыл бұрын

@@AIEngineeringLife okay thanks!

@dinavahikalyan4929 3 жыл бұрын

Sir can you please share this whole notebook

@AIEngineeringLife 3 жыл бұрын

It is in this folder - github.com/srivatsan88/Mastering-Apache-Spark

@dinavahikalyan4929 3 жыл бұрын

@@AIEngineeringLife Thankq sir

@seetharamireddybeereddy222 4 жыл бұрын

can you make one video for pyspark on google cloud

@AIEngineeringLife 4 жыл бұрын

Will try to do it as part of cloud series. Spark job is same but will show how to run it on cloud dataproc

@seetharamireddybeereddy222 4 жыл бұрын

@@AIEngineeringLife thank you

@suprobhosantra 3 жыл бұрын

Can we have the notebook in github or somewhere?

@AIEngineeringLife 3 жыл бұрын

Yes.. You can check it against Spark course in below git repo - github.com/srivatsan88

@Ajeetsingh-uy4cy 4 жыл бұрын

We are using this command: df = spark.read.format() I haven't worked on Spark but by syntax, I can say that this is Spark's method of reading DataFrame. We are typing this command in Jupyter notebook which by default is Python-compatible. To use others we have to use Magic Command at the top. Then how are we able to use Spark in python. is this py-spark? or something else.

@AIEngineeringLife 4 жыл бұрын

Ajeet pyspark is enabled y default in notebook so you get python packages loaded in databricks by default for others we need to have magic. Did not get your question completely though

@Ajeetsingh-uy4cy 4 жыл бұрын

@@AIEngineeringLife You solved my doubt though. My question was "how r we using spark in python without using magic command?". And as your answer suggested its py-spark that we are using and not Spark directly.

@owaisshaikh3983 4 жыл бұрын

I believe if you are doing this exercise in Python - you should have used spark.read.load rather than using scala syntax spark.read.format.

@hakunamatata-qu7ft 4 жыл бұрын

Unable to get data from git please help

@AIEngineeringLife 4 жыл бұрын

Manoj. Can u check pinned comment of this video. I have given the link to dataset. This dataset is huge and so I could not push it to git due to size limit

@hakunamatata-qu7ft 4 жыл бұрын

Thank you so much for reply ,I want to get trained under your guidance could u help me, could u please help how to start your video lectures please could you tell the order as I m beginner

@AIEngineeringLife 4 жыл бұрын

Manoj.. If you go to my channel and then playlist tab.. You can see multiple playlist. Pick area of your interest. To start with you can learn from end to end ML playlist which talks about the lifecycle of ML projects. It is purely theory but good to know before getting into details

@hakunamatata-qu7ft 4 жыл бұрын

Thank you so much for your response but I want to be an end to end full stack so please help me with order of your play list to follow I am from banking background so please do help me in transition

@AIEngineeringLife 4 жыл бұрын

Manoj.. I do not have video coverage on basics of ML.. So i would suggest go through coursera Andrew Ng ML course that will be helpful and once done you can check my courses on NLP, Computer Vision and Time Series

@ranjitkumarthakur5123 4 жыл бұрын

Sir, please share the notebook.

@AIEngineeringLife 4 жыл бұрын

Hi RK, I have mentioned it in my FAQ below on github link and on scenario which I will be sharing notebook www.linkedin.com/pulse/course-launch-scaling-accelerating-machine-learning-srinivasan/ In some cases I will be sharing it in my git link few months after the video. Sorry in case if you dont get notebook immediately post video in some cases

@demidrek-heyward 4 жыл бұрын

thanks!

@praveenprakash143 4 жыл бұрын

Hi sir I like to learn spark for DE role ? Can u mentor me Looking for paid mentor?

@AIEngineeringLife 4 жыл бұрын

I have a entire course on Apache Spark which is free.. Why do you want to pay for mentorship while I have covered all that is required - kzbin.info/aero/PL3N9eeOlCrP5PfpYrP6YxMNtt5Hw27ZlO . Just practice along videos and you should be good

@hemanthdevarapati519 4 жыл бұрын

By the looks of it Databricks is using Zeppelin kinda notebooks

@AIEngineeringLife 4 жыл бұрын

Yes Hemanth it is pretty similar to Zeppelin but I think databricks have their own custom one that resembles it

@hemanthdevarapati519 4 жыл бұрын

One question. Shouldn't we use an action after the df.cache() to cache data as it works on lazy evaluation? something like df.cache().count().

@AIEngineeringLife 4 жыл бұрын

@@hemanthdevarapati519 yes it is lazy evaluation. It will get loaded to cache when I call subsequent action first time below. I thought I might have some action down somewhere. Is it not?. I might not be doing it explicitly with cache command

@hemanthdevarapati519 4 жыл бұрын

@@AIEngineeringLife Yeah, that makes sense. It was a very intuitive video. I enjoyed every bit of it. Thank you for all your efforts Srivatsan. (Y)