Many have asked for the file I used for this video- You can download it from here - drive.google.com/file/d/1e6phh7Df8mzYoE-sBXPVJklnSt_wHwkq/view?usp=sharing Remove the last 2 line from the csv file
@vivekpuurkayastha15804 жыл бұрын
is the %sql command that makes SQL statement available? In that case is it only possible to use SQL when using Databricks? Thus making SQL not available for python scripts... Please correct me and also provide any input you may have ...
@harshithag57693 жыл бұрын
hello Sir, did you set any permission to this file. i am unable to open it. i tried to open it in onedrive office online , it says conversion error.
@AIEngineeringLife3 жыл бұрын
@@harshithag5769 Nope it is open for all. Can you open it in google drive and see
@harshithag57693 жыл бұрын
@@AIEngineeringLife i dont have excel on my pc. i tried opening it thru onedrive live or google excel . both says error in opening the file. i was able to open your other files which you have provided in github repository, when i tried to upload the file to dataset in databricks its throwing the error.
@MachalaPuli2 жыл бұрын
How to see the entity relationship diagram in data bricks or pyspark just like as we see in mysql... Please help me with this.
@himanshupatanwala094 жыл бұрын
Thank you so much, Sir. Millions of blessings from every student who watch this. I was looking for some real resources to learn Spark and your content saved a lot of effort to get wasted but made to put the effort in the right direction. Thanks a lot and please never stop creating such wonderful content for us.
@AIEngineeringLife4 жыл бұрын
You are welcome Himanshu and Thanks for such nice and encouraging words to drive me create more such content :)
@ijeffking4 жыл бұрын
This is HUGE! Gems of wisdom for a Machine learning aspirant. Excellent. Thank you very much.
@KennyJacobson14 жыл бұрын
I've been trying to get up to speed on Databricks and Spark for two weeks now, and I just learn 10x as much in 1 hour than I did in the previous 2 weeks. Thank you!
@AIEngineeringLife4 жыл бұрын
Glad to know Kenny this was useful. All the best on your learning journey
@norpriest5214 жыл бұрын
So how is Databricks service? I mean if I use it, what's their billing policy? Pay per use Pay per activity Pay per minutes/hours of use Pay per data size Can you let me know?
@nikhilkumarjha4 жыл бұрын
Thanks for the great tutorial. The Data Science community needs more people like you SS :)
@DiverseDestinationsDiaries3 жыл бұрын
This is the best content Video i have never seen in KZbin with respect to Realtime scenarios.... Thanks a lot Sir. Please do more to help us..
@rlmclaughlinmusic3 жыл бұрын
Truly love your channel! Such a wealth of information and brilliantly explained. Thank you for providing this real world example. It was exactly what I needed to elevate my spark skills. You're a terrific instructor.
@sandeepsankar63533 жыл бұрын
Superb video. In 40 minutes, you covered pretty much everything :) . Please upload more videos
@AIEngineeringLife3 жыл бұрын
I have a complete course on Apache spark in my playlist section of youtube channel. Have you seen it?
@sandeepsankar63533 жыл бұрын
@@AIEngineeringLife Yep. seen and subscribed as well :)
@The_Bold_Statement3 жыл бұрын
Tremendous effort and knowledge can be seen in your video. Thank you
@shubhamtripathi51384 жыл бұрын
Awesome. It's not a video series it's an entire course I must say. I really appreciate your hard work and the teaching technique, thanks . Sir, keep it up. One request i think from many of students like me is ,please upload the notebook sir. So that it will be a little time saving too. Thanks
@AIEngineeringLife4 жыл бұрын
Thanks Shubham.. :) .. The code is already available in my git repo - github.com/srivatsan88/Mastering-Apache-Spark
@norpriest5214 жыл бұрын
@@AIEngineeringLife Hi thank you for your video. Just wanna ask if this video is about data profiling or data wrangling using Pyspark?
@AIEngineeringLife4 жыл бұрын
@@norpriest521 this video is more on profiling/ cleaning of data but I have detailed videos on wrangling in my apache Spark playlist
@norpriest5214 жыл бұрын
@@AIEngineeringLife I couldn't find the video regarding data wrangling in your list. Could you please let me know the title of the video?
@AIEngineeringLife4 жыл бұрын
@@norpriest521 It is named as data engineering in this playlist. There are 2 parts to it kzbin.info/aero/PL3N9eeOlCrP5PfpYrP6YxMNtt5Hw27ZlO
@purushothamchanda8984 жыл бұрын
Wonderful video sir, was looking for such content since many days. thanks a ton
@AIEngineeringLife4 жыл бұрын
👍
@cloudmusician3 жыл бұрын
Excellent content, precise and to the point
@CDALearningHub3 жыл бұрын
Thanks very much for detailed hands on. It helps!
@beyourbest1994 жыл бұрын
This is amazing, waiting for your next video on Spark Analysis and cleaning
@AIEngineeringLife4 жыл бұрын
Next part on EDA using spark is here kzbin.info/www/bejne/jmeynIdojrWNjNU
@sahil00943 жыл бұрын
df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show() and df.filter(df.col.isNull()).count() should return same result, right? Upper command is giving 0 nulls whereas below command is giving some nulls for the same column. Can you please help?
@insane_billa3 жыл бұрын
Now, this is what I was looking for :)
@prafulmaka77103 жыл бұрын
Amazing explanation!! Thank you very much!
@soumyagupta93013 жыл бұрын
Where did you create the spark session? I don't see the initialization of the spark variable. Can you please explain more on this?
@saitejap78763 жыл бұрын
Thank you for the tutorial, I am just curious while dealing with revol_util column, we are finding the average when the column is string and using it to replace "null" values and then cast it to "double". Will there be a difference if we are casting the values first to "double" then select the average value and replace the nulls. Hoping to get your insights on this.
@tanishasharma36654 жыл бұрын
Sir, i had a question, why are we creating a temporary table every time for sql functions. In pyspark, the main advantage is that we can use the sql functions simply on the dataframe as well for example: loan_dfsel.groupby("loan_status").count().orderBy(col("Count").desc()).show() where i have used 'loan_dfsel' is a dataframe please enlighten me if im wrong....
@AIEngineeringLife4 жыл бұрын
Tanisha.. Is your question why I am using dataframe functions rather SQL functions in pyspark. If so the yes SQL is easy way of processing data in Spark but for iterative processing dataframe functions are very powerful and simple. Typically in projects we use combination of SQL as well as df functions. In this case I wanted to show dataframe functions but in future videos I have covered SQL as well. Underneath both SQL and df compile to same plan so performance might not differ
@aishwaryagopal55533 жыл бұрын
Hi, Thank you for the informative videos. I'm just getting started with Spark. The code seems easy to understand. What are other aspects of Spark that I should read through for a better understanding?
@iamhappy77594 жыл бұрын
thanks for amazing video.would you provide link of notebook for practice ?
@AIEngineeringLife4 жыл бұрын
Thanks.. Should be in this repo - github.com/srivatsan88/Mastering-Apache-Spark
@iamhappy77594 жыл бұрын
@@AIEngineeringLife Thank you so much
@rahulraoshindek1313 жыл бұрын
Thank you so much sir for this detailed video.
@kunalr_ai3 жыл бұрын
Need to connect with azure data lake and Load it here in databricks. Do you have any document that support it?
@AIEngineeringLife3 жыл бұрын
Kunai.. Nope not done any on Azure yet
@aleksei_cherniaev3 жыл бұрын
that is awesome content, big thanks, man !
@CRTagadiya4 жыл бұрын
Thanks for your work, Could you please upload all this series video in playlist?
@AIEngineeringLife4 жыл бұрын
Check this playlist. It is available in it and will be updating upcoming videos into it as well kzbin.info/aero/PL3N9eeOlCrP7_vt6jq7GdJz6CSFmtTBpI
@prometeo343 жыл бұрын
This is amazing. Thank you very much!
@dhruvsharma90653 жыл бұрын
What's the 3rd argument in regexp_replace function?
@AIEngineeringLife3 жыл бұрын
What do we need to replace the pattern in second column with
@ankushojha50894 жыл бұрын
Hello Sir, As in this Video you said like Data Type needs to change manually while using 'Corr' & 'Cov' functions.. Please help me, How we can change the DataType
@AIEngineeringLife4 жыл бұрын
Kush.. you can use custom schema when you load data or after loading you can use CAST or astype to change from one data type to another
@ankushojha50894 жыл бұрын
@@AIEngineeringLife Thank you Sir for guiding me on each step.. I have done and can able to CAST datatype now.
@deeptigupta5183 жыл бұрын
These videos are awsome and a great help. Thanks. You are doing a wonderful job for people like us who have entered the industry. Just wanted to ask that I have worked in ML models using Python but not worked in Apache Spark. WIll I face any difficulty doing the same thing here in Spark?
@AIEngineeringLife3 жыл бұрын
Nope. It must be a smooth transition. Remember Spark ML pipeline is inspired by scikit pipelines, so process is similar. Only time consuming part will be understanding distributed architectures which might take time
@anandruparelia89704 жыл бұрын
For Pandas in Python, do we have something dedicated like reg extract or something that cleans data from within the values or the conventional regex have to be employed?
@AIEngineeringLife4 жыл бұрын
Check this kzbin.info/www/bejne/Zoebk4RtZa2AZrc
@anandruparelia89704 жыл бұрын
@@AIEngineeringLife Thanks :)
@mustakahmad3833 жыл бұрын
Hello @aiengineering, I have a question, is it possible to run an Oracle MERGE statement in the Oracle database (Oracle tables) using Python libraries such as for example "spark.sql.write.format('jdbc').options"...
@chwaleedsial4 жыл бұрын
A suggestion: when you load the data-set & if it is not the same as one shared on kaggle please also let us know what transformations, filtration you have performed so that we can have same, similar results as we follow along.
@AIEngineeringLife4 жыл бұрын
I am sorry if the dataset is not same.. I did not do any transformation but rather I downloaded by Lending Club directly in below link www.lendingclub.com/info/statistics.action Earlier it was download for all but post I downloaded sometime back they made it sign in based and hence I referred to kaggle thinking it should be similar. But from my end I did not make any changes to the dataset I got from Lending club. Are you facing any particular issue as few have reached out in past on some clarification and were able to execute all the commands successfully
@teja27754 жыл бұрын
Hi sir, I'm not understanding what is the exact purpose of using spark, as per my understand in one word answer spark is used for data analysis or data preparation am I correct....?
@AIEngineeringLife4 жыл бұрын
Spark is used for end to end pipeline starting from data processing (cleaning, preparation) till machine learning or advance analytics. Reason we need spark is when your input data grows and point where typical tools like Pandas start failing to handle the volume and computation. Spark can work on TBs of data and Pandas is limited to few GBs if you are looking at large scale ML computation
@teja27754 жыл бұрын
@@AIEngineeringLife Thank you so much sir finally my doubt is cleared
@venkatesanp22404 жыл бұрын
sir how to count null in pyspark like pandas command how to delete entire column
@AIEngineeringLife4 жыл бұрын
Venkat. It is there in my video in case if you have missed it df_sel.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df_sel.columns]).show() to delete you can use df.drop(
@vishalmishra19373 жыл бұрын
how to validate schema in spark against record in text files if every record follows different schema and to separate record as per schema
@AniketKumar-ij3ew4 жыл бұрын
Great Stuff really enjoying the hands on videos. I have one input, not a big constraint however, I guess in the last part when you are creating the permanent table the data frame should be df_sel_final instead of df_sel
@AIEngineeringLife4 жыл бұрын
Thank you Aniket.. You are right.. Maybe did it in a hurry.. Good catch :)
@revathis28444 жыл бұрын
Hi Sir, Instead of string regex, how to do numeric regex? For example : Username having abc12def here i need only character which is abcdef. Could you please help me?
@AIEngineeringLife4 жыл бұрын
You can search for [0-9] regex pattern and replace it with empty string. That way the output is only alphabets
@revathis28444 жыл бұрын
@@AIEngineeringLife thank you it worked🙂
@revathis28444 жыл бұрын
I have another clarification sir.. Do spark will works on incremental load? I have searched many sites but can't find a proper solution
@AIEngineeringLife4 жыл бұрын
@@revathis2844 It is not straight forward in regular spark but databricks has a functionality called delta lake. Check it out
@revathis28444 жыл бұрын
@@AIEngineeringLife thank you sir
@hemaswaroop79704 жыл бұрын
All the topics you have covered in the Spark series here.. how close they are when it comes to the real-time projects (MNCs like IBM, CTS, Google etc.) - just asking
@AIEngineeringLife4 жыл бұрын
Hema, The "Master Spark" course I have in my channel was to bring out real world scenarios that one face in Industry. It takes a use case based approach rather function or api based approach. Many working professional in Spark have also benefited from this course as they were able to upskill themselves on specific area they had to work. I am saying this not because I have created it but You can compare the coverage with other courses and pick one that works for you
@hemaswaroop79704 жыл бұрын
@@AIEngineeringLife Thanks, Srivatsav! Looking forward to learning more from your channel
@biswajitbastia24 жыл бұрын
what will be alternative fo scala code for this df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()
@AIEngineeringLife4 жыл бұрын
Biswajit.. Have you tried iterating columns and checking for null in each column in scala. That is what I am doing in python as well. I think map function can do that. I will try it out and paste exact syntax later in the week Any reason for using Scala as Spark 2.3 and above pyspark is almost at equal footing as Scala
@biswajitbastia24 жыл бұрын
@@AIEngineeringLife we been using scala for all data pipeline jobs as it faster than python
@shaikrasool13164 жыл бұрын
Sir, i would like to put this lending club type problem project in my resume... can i consider same columns even for same kind of other projects ..
@AIEngineeringLife4 жыл бұрын
Shaik you can but if you are looking to expand it with external datapoints then you can check my video -kzbin.info/www/bejne/iJzCn3qdqLWEf6s In this video I show how you can use external data sources and combine with lending club kind of dataset
@raghuramsharma26034 жыл бұрын
I'm trying to import the file and create the df as mentioned and get the below error. Can you pls suggest what i missed. Error in SQL statement: ParseException: mismatched input 'file_location' expecting {'(', 'CONVERT', 'COPY', 'OPTIMIZE', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 0) == SQL == file_location = "/FileStore/tables/LoanStats_2018Q4-2.csv" ^^^ file_type = "csv"
@AIEngineeringLife4 жыл бұрын
Raghu, Is the file path correct that you loaded.. Can you check if it is similar to below or missing something # File location and type file_location = "/FileStore/tables/LoanStats_2018Q4.csv" file_type = "csv" # CSV options infer_schema = "true" first_row_is_header = "true" delimiter = "," # The applied options are for CSV files. For other file types, these will be ignored. df = spark.read.format(file_type) \ .option("inferSchema", infer_schema) \ .option("header", first_row_is_header) \ .option("sep", delimiter) \ .load(file_location) display(df)
@raghuramsharma26034 жыл бұрын
@@AIEngineeringLife yes I double checked and the filepath is correct...
@raghuramsharma26034 жыл бұрын
@@AIEngineeringLife never mind I utilized the option of "createtable in notebook" that databricks provided and it worked..strange..thanks for your reply
@AIEngineeringLife4 жыл бұрын
@@raghuramsharma2603 Great it worked.. That is how I got the loading part as well, used databricks provided :). All the best for remaining tutorial
@raghuramsharma26034 жыл бұрын
@@AIEngineeringLife Thank you...great work uploading these videos very helpful...
@gayathriv904 жыл бұрын
Sir, I have an requirement where i have a reusable code to run for different files and need to pass the filename to code from blob storage.Pass as a parameter. Can u help me
@AIEngineeringLife4 жыл бұрын
What is the problem u r facing. You can pass filename as runtime parameters to spark and trigger multiple spark jobs with different file name
@KishoreKumar-yx4nw4 жыл бұрын
Nicely explained
@Anupamk36 Жыл бұрын
Can you please share the link to the csv?
@imohitr8884 жыл бұрын
i used this to convert string to integer : from pyspark.sql.types import IntegerType df = df.withColumn("loan_amnt", df["loan_amnt"].cast(IntegerType())) i can see in schema that loan_amnt is now changed to int type but when i am running the below command quantileProbs = [0.25, 0.5, 0.75, 0.9] relError=0.05 df_sel.stat.approxQuantile("annual_inc", quantileProbs, relError) i am getting the error :: "java.lang.IllegalArgumentException: requirement failed: Quantile calculation for column annual_inc with data type StringType is not supported. " Can u please help here
@AIEngineeringLife4 жыл бұрын
Mohit .I see u have done cast for loan amount but using annual inc in quantile . Can u do cast for annual inc as well and see
@imohitr8884 жыл бұрын
@@AIEngineeringLife hey i did casting for both already. still getting the same error :/
@imohitr8884 жыл бұрын
@@AIEngineeringLife it worked now :) thank you.
@Shahzada1prince4 жыл бұрын
Do you have these commands written somewhere?
@AIEngineeringLife4 жыл бұрын
You can check my git repo - github.com/srivatsan88/Mastering-Apache-Spark
@Shahzada1prince4 жыл бұрын
@@AIEngineeringLife thanks
@prashanthprasanna14843 жыл бұрын
awesome !! very useful
@srikd98294 жыл бұрын
Thank you Very much for the very informative videos. Could you please let us know what programming language(s) is(are) used in this video? Is it Spark (or) Scala (or) pyspark (or) pysql ? ( I dont know any of these) I only know Python including the Numpy and Pandas. So, Would you recommend knowing the relevant languages as a Pre-requisite? so that I should feel easy when a real world problem is given. Or any courses you recommend also fine. Thank you.
@AIEngineeringLife4 жыл бұрын
Most of my spark videos on pyspark and Spark SQL. Python is a good start as pyspark syntax are similar to pandas with slight variation. only thing is it is distributed. You can check my entire spark course on youtube to learn spark kzbin.info/aero/PL3N9eeOlCrP5PfpYrP6YxMNtt5Hw27ZlO
@srikd98294 жыл бұрын
@@AIEngineeringLife Ok Sure. Thank you for the swift response.
@anshapettugari Жыл бұрын
Where can I access this code?
@biswadeeppatra17264 жыл бұрын
Sir, do you have git repo for the codes used in this proj..if yes then please share
@AIEngineeringLife4 жыл бұрын
Yes but all modules might not be there as I have uploaded for only selected ones. Below is the link github.com/srivatsan88/KZbinLI
@biswadeeppatra17264 жыл бұрын
@@AIEngineeringLife thankx for sharing sir. I dont find codes for spark program.please upload the same if possible..it will really be a great help
@HemantSharma-fw2gx3 жыл бұрын
thanks for this tutorial
@bharadwajiyer35044 жыл бұрын
How do I use a subset of the loan data? The original datset is too large(2gb), and takes time to upload in databricks.
@AIEngineeringLife4 жыл бұрын
Bharadwaj.. best is to download and split it in spark or unix
@dineshvarma67334 жыл бұрын
Hello Sir, I appreciate your effort and time to teach us. I am facing a Job aborted error when trying to create a Permanent table at the end of the analysis. Is there a workaround for this.
@AIEngineeringLife4 жыл бұрын
Hi Dinesh.. Thank you.. Can you please paste the error you are getting?
@dineshvarma67334 жыл бұрын
@@AIEngineeringLife org.apache.spark.SparkException: Job aborted. Py4JJavaError: An error occurred while calling o3407.saveAsTable. : org.apache.spark.SparkException: Job aborted. at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:201) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:192) at org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:555) at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.saveDataIntoTable(createDataSourceTables.scala:216) at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:175) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106) at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:126) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:150) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:138) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$5.apply(SparkPlan.scala:191) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:187) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:117) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:115) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:710) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:710) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withCustomExecutionEnv$1$$anonfun$apply$1.apply(SQLExecution.scala:112) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:217) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withCustomExecutionEnv$1.apply(SQLExecution.scala:98) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:835) at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:74) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:169) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:710) at org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:508) at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:487) at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:430) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380) at py4j.Gateway.invoke(Gateway.java:295) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:251) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 76.0 failed 1 times, most recent failure: Lost task 0.0 in stage 76.0 (TID 974, localhost, executor driver): java.rmi.RemoteException: com.databricks.api.base.DatabricksServiceException: QUOTA_EXCEEDED: You have exceeded the maximum number of allowed files on Databricks Community Edition. To ensure free access, you are limited to 10000 files and 10 GB of storage in DBFS. Please use dbutils.fs to list and clean up files to restore service. You may have to wait a few minutes after cleaning up the files for the quota to be refreshed. (Files found: 17327); nested exception is: com.databricks.api.base.DatabricksServiceException: QUOTA_EXCEEDED: You have exceeded the maximum number of allowed files on Databricks Community Edition. To ensure free access, you are limited to 10000 files and 10 GB of storage in DBFS. Please use dbutils.fs to list and clean up files to restore service. You may have to wait a few minutes after cleaning up the files for the quota to be refreshed. (Files found: 17327)
@imohitr8884 жыл бұрын
hi, when i was running df_sel.stat.cov('annual_inc', 'loan_amnt') > i got this error "java.lang.IllegalArgumentException: requirement failed: Currently covariance calculation for columns with dataType string not supported." i realised loan_amnt and annual_inc is showing as string in the schema. i followed all steps as per you. Can you correct me what i missed? i saw ur previous schema commands look like it was showing integer in your videos but when i ran a schema command, its showing these 2 columns as string thats why the error.
@imohitr8884 жыл бұрын
can u tellme in between the code, how can i change the specific column schema from string to integer? what exact code i should execute?
@AIEngineeringLife4 жыл бұрын
Mohit I see your other message u have figured it out. You have to do cast to convert datatype
@towardsmlds91304 жыл бұрын
the dataset i have for lending club have noise in first row and the header starts from row 2, i am not able to skip first row and set 2nd row as header, any input on how to do this
@AIEngineeringLife4 жыл бұрын
Is skiprows in read_csv not working for you?
@towardsmlds91304 жыл бұрын
@@AIEngineeringLife i didn't know about it, will see if it works
@devarajuessampally13384 жыл бұрын
Thank you for dataset
@imransharief28914 жыл бұрын
Hi sir will u plz make a proper playlist for this tutorial Bec it's very confusing
@AIEngineeringLife4 жыл бұрын
Hi Imran.. Have you seen below playlist where I am adding it in sequence. Please see if it helps kzbin.info/aero/PL3N9eeOlCrP7_vt6jq7GdJz6CSFmtTBpI
@ankushojha50894 жыл бұрын
Thank you Sir for responding my comments and clearing my doubt. I have one more doubt I am using regexp_replace function in which while changing the place of string I can see 2 diff output. Years at 1st place trimmed completely from output but if I will interchange the Years at 2nd place and Year at first place in output 'S' won't trimmed. Please refer the screenshot :)
@AIEngineeringLife4 жыл бұрын
Kush.. I did not get any screen snapshot here
@christineeee964 жыл бұрын
@@AIEngineeringLife sir, I keep getting attribute errors like "'NoneType' object has no attribute 'groupby'"
@uskhan73534 жыл бұрын
Can you show a more mechanical way for feature selection.
@AIEngineeringLife4 жыл бұрын
Usman.. Are you referring to manual feature selection?
@uskhan73534 жыл бұрын
@@AIEngineeringLife yes by using any feature importance technique.
@seetharamireddybeereddy2224 жыл бұрын
and how to handle delta or incremental load in pyspark
@AIEngineeringLife4 жыл бұрын
I was actually not planing to cover ingesting of data to show incremental load but will see if I can in future
@venkatasaireddyavuluri20514 жыл бұрын
sir can you give the link to dataset
@AIEngineeringLife4 жыл бұрын
Venkata, it is from kaggle - www.kaggle.com/wendykan/lending-club-loan-data
@sankarshkadambari27424 жыл бұрын
Hello Sir , For 2018Q4 data we have to slice the original data Loan.csv which contains (2260668, 145) right as kagle gave me 2gb zip file? Following are the steps I did on my local df=read the whole (2260668, 145) file LoanStats_2018Q4=df[(df['issue_d']=="Oct-2018") | (df['issue_d']=="Nov-2018") | (df['issue_d']=="Dec-2018")] LoanStats_2018Q4.shape (128412, 145) LoanStats_2018Q4.to_csv('/path/LoanStats_2018Q4.csv', index = False) Then I will upload this to Data Bricks
@AIEngineeringLife4 жыл бұрын
I just ran it with subset so users need not wait on video for every instruction but in your case you can use it all or subset it as well.. Whole file would have made my video to run for additional hour :)
@dipanjanghosh68623 жыл бұрын
Hello Sir. I have given file_location = r"C:/Users/dipanja/Desktop/data science/LoanStats_2018.csv". This is the path of the loanstats csv file in my system. but while trying to execute it i am getting the error 'java.io.IOException: No FileSystem for scheme: C". can u please help me fix this?
@AIEngineeringLife3 жыл бұрын
Are you using databricks spark or spark in your local system?.
@dipanjanghosh68623 жыл бұрын
@@AIEngineeringLife hi sir. i resolved the issue. thanks!
@Ravi-gu5ww4 жыл бұрын
Sir can you make tutorial on functions like groupByKey,sortByKey,oderByKey,reduceByKey,join....
@AIEngineeringLife4 жыл бұрын
Raviteja.. I have already covered groupby orderby.. the one you have mentioned are RDD functions and spark is making data frame functions as primary going forward. Not sure if you really need to learn RDD functions as 98% of time dataframe functions are easy and will do the job
@sahil00944 жыл бұрын
The describe and null count is not readable most of the times, doesn't that post a big problem in industry projects? I have dataset of hundreds of columns so how to view describe or null count for all in spark?
@AIEngineeringLife4 жыл бұрын
Sahil.. In databricks we can use formatted output but in regular spark yes. In some case we load it into table and view it to understand
@sahil00943 жыл бұрын
@@AIEngineeringLife okay thanks!
@dinavahikalyan49293 жыл бұрын
Sir can you please share this whole notebook
@AIEngineeringLife3 жыл бұрын
It is in this folder - github.com/srivatsan88/Mastering-Apache-Spark
@dinavahikalyan49293 жыл бұрын
@@AIEngineeringLife Thankq sir
@seetharamireddybeereddy2224 жыл бұрын
can you make one video for pyspark on google cloud
@AIEngineeringLife4 жыл бұрын
Will try to do it as part of cloud series. Spark job is same but will show how to run it on cloud dataproc
@seetharamireddybeereddy2224 жыл бұрын
@@AIEngineeringLife thank you
@suprobhosantra3 жыл бұрын
Can we have the notebook in github or somewhere?
@AIEngineeringLife3 жыл бұрын
Yes.. You can check it against Spark course in below git repo - github.com/srivatsan88
@Ajeetsingh-uy4cy4 жыл бұрын
We are using this command: df = spark.read.format() I haven't worked on Spark but by syntax, I can say that this is Spark's method of reading DataFrame. We are typing this command in Jupyter notebook which by default is Python-compatible. To use others we have to use Magic Command at the top. Then how are we able to use Spark in python. is this py-spark? or something else.
@AIEngineeringLife4 жыл бұрын
Ajeet pyspark is enabled y default in notebook so you get python packages loaded in databricks by default for others we need to have magic. Did not get your question completely though
@Ajeetsingh-uy4cy4 жыл бұрын
@@AIEngineeringLife You solved my doubt though. My question was "how r we using spark in python without using magic command?". And as your answer suggested its py-spark that we are using and not Spark directly.
@owaisshaikh39834 жыл бұрын
I believe if you are doing this exercise in Python - you should have used spark.read.load rather than using scala syntax spark.read.format.
@hakunamatata-qu7ft4 жыл бұрын
Unable to get data from git please help
@AIEngineeringLife4 жыл бұрын
Manoj. Can u check pinned comment of this video. I have given the link to dataset. This dataset is huge and so I could not push it to git due to size limit
@hakunamatata-qu7ft4 жыл бұрын
Thank you so much for reply ,I want to get trained under your guidance could u help me, could u please help how to start your video lectures please could you tell the order as I m beginner
@AIEngineeringLife4 жыл бұрын
Manoj.. If you go to my channel and then playlist tab.. You can see multiple playlist. Pick area of your interest. To start with you can learn from end to end ML playlist which talks about the lifecycle of ML projects. It is purely theory but good to know before getting into details
@hakunamatata-qu7ft4 жыл бұрын
Thank you so much for your response but I want to be an end to end full stack so please help me with order of your play list to follow I am from banking background so please do help me in transition
@AIEngineeringLife4 жыл бұрын
Manoj.. I do not have video coverage on basics of ML.. So i would suggest go through coursera Andrew Ng ML course that will be helpful and once done you can check my courses on NLP, Computer Vision and Time Series
@ranjitkumarthakur51234 жыл бұрын
Sir, please share the notebook.
@AIEngineeringLife4 жыл бұрын
Hi RK, I have mentioned it in my FAQ below on github link and on scenario which I will be sharing notebook www.linkedin.com/pulse/course-launch-scaling-accelerating-machine-learning-srinivasan/ In some cases I will be sharing it in my git link few months after the video. Sorry in case if you dont get notebook immediately post video in some cases
@demidrek-heyward4 жыл бұрын
thanks!
@praveenprakash1434 жыл бұрын
Hi sir I like to learn spark for DE role ? Can u mentor me Looking for paid mentor?
@AIEngineeringLife4 жыл бұрын
I have a entire course on Apache Spark which is free.. Why do you want to pay for mentorship while I have covered all that is required - kzbin.info/aero/PL3N9eeOlCrP5PfpYrP6YxMNtt5Hw27ZlO . Just practice along videos and you should be good
@hemanthdevarapati5194 жыл бұрын
By the looks of it Databricks is using Zeppelin kinda notebooks
@AIEngineeringLife4 жыл бұрын
Yes Hemanth it is pretty similar to Zeppelin but I think databricks have their own custom one that resembles it
@hemanthdevarapati5194 жыл бұрын
One question. Shouldn't we use an action after the df.cache() to cache data as it works on lazy evaluation? something like df.cache().count().
@AIEngineeringLife4 жыл бұрын
@@hemanthdevarapati519 yes it is lazy evaluation. It will get loaded to cache when I call subsequent action first time below. I thought I might have some action down somewhere. Is it not?. I might not be doing it explicitly with cache command
@hemanthdevarapati5194 жыл бұрын
@@AIEngineeringLife Yeah, that makes sense. It was a very intuitive video. I enjoyed every bit of it. Thank you for all your efforts Srivatsan. (Y)
@seemunyum8323 жыл бұрын
The cluster i am creating is taking forever... anyone else have this problem? :(
@atruismoti24023 жыл бұрын
As an advise, could you speak more slowly please, is difficult understand you