Pyspark Scenarios 18 : How to Handle Bad Data in pyspark dataframe using pyspark schema

Рет қаралды 15,006

Күн бұрын

GitHub location :
github.com/rav...
Pyspark Interview question
Pyspark Scenario Based Interview Questions
Pyspark Scenario Based Questions
Scenario Based Questions
#PysparkScenarioBasedInterviewQuestions
#ScenarioBasedInterviewQuestions
#PysparkInterviewQuestions
Complete Pyspark Real Time Scenarios Videos.
Pyspark Scenarios 1: How to create partition by month and year in pyspark
• Pyspark Scenarios 1: H...
pyspark scenarios 2 : how to read variable number of columns data in pyspark dataframe #pyspark
• pyspark scenarios 2 : ...
Pyspark Scenarios 3 : how to skip first few rows from data file in pyspark
• Pyspark Scenarios 3 : ...
Pyspark Scenarios 4 : how to remove duplicate rows in pyspark dataframe #pyspark #Databricks
• Pyspark Scenarios 4 : ...
Pyspark Scenarios 5 : how read all files from nested folder in pySpark dataframe
• Pyspark Scenarios 5 : ...
Pyspark Scenarios 6 How to Get no of rows from each file in pyspark dataframe
• Pyspark Scenarios 6 Ho...
Pyspark Scenarios 7 : how to get no of rows at each partition in pyspark dataframe
• Pyspark Scenarios 7 : ...
Pyspark Scenarios 8: How to add Sequence generated surrogate key as a column in dataframe.
• Pyspark Scenarios 8: H...
Pyspark Scenarios 9 : How to get Individual column wise null records count
• Pyspark Scenarios 9 : ...
Pyspark Scenarios 10:Why we should not use crc32 for Surrogate Keys Generation?
• Pyspark Scenarios 10:W...
Pyspark Scenarios 11 : how to handle double delimiter or multi delimiters in pyspark
• Pyspark Scenarios 11 :...
Pyspark Scenarios 12 : how to get 53 week number years in pyspark extract 53rd week number in spark
• Pyspark Scenarios 12 :...
Pyspark Scenarios 13 : how to handle complex json data file in pyspark
• Pyspark Scenarios 13 :...
Pyspark Scenarios 14 : How to implement Multiprocessing in Azure Databricks
• Pyspark Scenarios 14 :...
Pyspark Scenarios 15 : how to take table ddl backup in databricks
• Pyspark Scenarios 15 :...
Pyspark Scenarios 16: Convert pyspark string to date format issue dd-mm-yy old format
• Pyspark Scenarios 16: ...
Pyspark Scenarios 17 : How to handle duplicate column errors in delta table
• Pyspark Scenarios 17 :...
Pyspark Scenarios 18 : How to Handle Bad Data in pyspark dataframe using pyspark schema
• Pyspark Scenarios 18 :...
Pyspark Scenarios 19 : difference between #OrderBy #Sort and #sortWithinPartitions Transformations
• Pyspark Scenarios 19 :...
Pyspark Scenarios 20 : difference between coalesce and repartition in pyspark #coalesce #repartition
• Pyspark Scenarios 20 :...
Pyspark Scenarios 21 : Dynamically processing complex json file in pyspark #complexjson #databricks
• Pyspark Scenarios 21 :...
Pyspark Scenarios 22 : How To create data files based on the number of rows in PySpark #pyspark
• Pyspark Scenarios 22 :...
pyspark sql
pyspark
hive
which
databricks
apache spark
sql server
spark sql functions
spark interview questions
sql interview questions
spark sql interview questions
spark sql tutorial
spark architecture
coalesce in sql
hadoop vs spark
window function in sql
which role is most likely to use azure data factory to define a data pipeline for an etl process?
what is data warehouse
broadcast variable in spark
pyspark documentation
apache spark architecture
which single service would you use to implement data pipelines, sql analytics, and spark analytics?
which one of the following tasks is the responsibility of a database administrator?
google colab
case class in scala
RISING
which role is most likely to use azure data factory to define a data pipeline for an etl process?
broadcast variable in spark
which one of the following tasks is the responsibility of a database administrator?
google colab
case class in scala
pyspark documentation
spark architecture
window function in sql
which single service would you use to implement data pipelines, sql analytics, and spark analytics?
apache spark architecture
hadoop vs spark
spark interview questions

Пікірлер: 36

@tusharhatwar Жыл бұрын

This channel is Goldmine for Pyspark Data engineers.

@manjulakumarisammidi1833 11 ай бұрын

instead of caching the dataframe @14:17 defining bad_data_df before good_data_df will also work, just another approach. Thanks for the video sir.

@anandattagasam7037 Жыл бұрын

Thanks for your brief explaination, i would go with 4th option (BadRecords path) instead of 5th (ColumnNnamedBadRecords).

@arshiyakub17 Жыл бұрын

Thank you so much for the video on this. I have been searching for this for a long time and finally got what I needed from this video.

@Jgiga 2 жыл бұрын

Thanks for sharing

@Technology_of_world5 Жыл бұрын

Good massage, thankyou lot 👍

@sravankumar1767 2 жыл бұрын

Nice explanation 👌 👍 👏

@shayankabasi160 2 жыл бұрын

Very nice

@mesukanya9828 Жыл бұрын

Thank you so much... very well explained :)

@TRRaveendra Жыл бұрын

Thank you 🙏

@muruganc2350 Жыл бұрын

Thanks. good to learn!

@Basket-hb5jc 6 ай бұрын

very valuable

@jobiquirobi123 2 жыл бұрын

Just find out your tutoriales, they look pretty nice, thank you!

@TRRaveendra 2 жыл бұрын

Thank You 👍

@mohitupadhayay1439 5 ай бұрын

Can we do the same for XML and JSON files?

@bharathsai232 Жыл бұрын

Permissive mode is not detecting malformed date types i mean if we have date as 2013-02-30 spark read in permissive mode is not detecting this as bad data

@ketanmehta3058 Жыл бұрын

Excellent ! Clearly explained each and every option to load the data. @TeckLake Can we use this option with the JSON data as well?

@srijitachaturvedi7738 2 жыл бұрын

Is this approach works while reading json data instead of csvs?

@TRRaveendra 2 жыл бұрын

Yes for normal json you can use the same option. for multiline json you can use option("multiline","true") otherwise it will create default _corrupt_record column.

@mehmetkaya4330 Жыл бұрын

Thank you for the great tutorials!

@TRRaveendra Жыл бұрын

Thanks for watching my channel videos

@saisaranv Жыл бұрын

Hi TechLake Team, thanks for the wonderful video and helped a lot. can you please help me with 2 errors which am facing right now : 1. "cannot cast string into integer type" even after specific data schema defined 2. complex json flattening (i had gone through video 13 but my data is too complex in nature to flatten). appreciated in help please

@TRRaveendra Жыл бұрын

Tgrappstech@gmail.con ping me ur schema or sample data i can verify

@saisaranv Жыл бұрын

@@TRRaveendra Done..please check one. Thank you for your reply :)

@YOGESHMULEY-n1j 4 ай бұрын

i got , query returns no records

@mohitupadhayay1439 4 ай бұрын

Still we could not find the proper reason why the records went into corrupt when the column are very huge

@hannawg7747 2 жыл бұрын

Hi Sir, Do you provide training Azure ADB ADF ?

@TRRaveendra 2 жыл бұрын

Yes I do. please reach me on tgrappstech@gmail.com

@chriskathumbi2292 2 жыл бұрын

Hello, good video. I have a question concerning spark. When I use local data like parquet and csv and make a tempview or or just normal spark, and try to use distinct/group by or window functions, I get an error and I've seen this on my windows/linux and docker container. What could be causing this?

@TRRaveendra 2 жыл бұрын

what kind of error are you getting. is it related to datafile path? or is it related to missing columns or wrong group by query?

@chriskathumbi2292 2 жыл бұрын

@@TRRaveendra if I use df.show() and the df contains group by, window function or distinct Py4JJavaError: An error occurred while calling o69.showString.

@chriskathumbi2292 2 жыл бұрын

@@TRRaveendra Funny thing is that on Google Colabs where I have to install pyspark on launch, doesn't have this issue

@chimorammohan8392 2 жыл бұрын

@@chriskathumbi2292 this might be code error pls share the code

@chriskathumbi2292 2 жыл бұрын

@@chimorammohan8392 spark_df = spark.read.csv("./test/fm_file_dir_map.csv", header=True) spark_df.createOrReplaceTempView("spark_df_temp_v1") { "works": [ spark.sql( """ select distinct file_id from spark_df_temp_v1 """ ), spark_df.select(f.col("file_id")).distinct(), ], "fails": [ spark.sql( """ select distinct file_id from spark_df_temp_v1 """ ).show(), spark_df.select(f.col("file_id")).distinct().show(), ], }

@Ameem-rw4ir 5 ай бұрын

bro, thanks for your inputs. can you please help me how to handle this? empid,fname|lname@sal#deptid 1,mohan|kumar@5000#100 2,karna|varadan@3489#101 3,kavitha|gandan@6000#102 Expected output empid,fname,lname,sal,deptid 1,mohan,kumar,5000,100 2,karan,varadan,3489,101 3,kavitha,gandan,6000,102