Data Validation with Pyspark || Schema Comparison || Dynamically || Real Time Scenario

  Рет қаралды 1,265

DataSpark

DataSpark

7 ай бұрын

In this Video we covered how we can perform quick data validation like Schema comparison between source and Target: In the next video we will look into Date/TimeStamp format check and duplicate count check .
Column Comparison link :
• Data Validation with P...
#dataanalytics #dataengineeringessentials #azuredatabricks
#dataanalysis
#pyspark
#pythonprogramming
#sql
#databricks #PySpark #Spark #DatabricksNotebook #PySparkLogic

Пікірлер: 10
@vamshimerugu6184
@vamshimerugu6184 3 ай бұрын
I think schema comparison is the important topic in pyspark . Great explanation sir ❤
@DataSpark45
@DataSpark45 3 ай бұрын
thank you bro
@skateforlife3679
@skateforlife3679 7 ай бұрын
Thank you for your work !!! It would be amazing if you could enhance the video with "chapters" to put more context in what you explain in the differents sections of the video :) !
@DataSpark45
@DataSpark45 7 ай бұрын
Great suggestion!
@saibhargavreddy5992
@saibhargavreddy5992 3 ай бұрын
I found this very useful as I had a similar issue with data validations. It helped a lot while completing my project.
@DataSpark45
@DataSpark45 3 ай бұрын
Glad it helped!
@avinash7003
@avinash7003 6 ай бұрын
code in github?
@DataSpark45
@DataSpark45 6 ай бұрын
Here is the link bro : drive.google.com/drive/folders/1I6rqtiKh1ChM_dkLJwxfxyxcHWPgyiKZ?usp=sharing
@amandoshi5803
@amandoshi5803 Ай бұрын
source code ?
@DataSpark45
@DataSpark45 Ай бұрын
def SchemaComparision(controldf, spsession, refdf): try: #iterate controldf and get the filename and filepath for x in controldf.collect(): filename = x['filename'] #print(filename) filepath = x['filepath'] #print(filepath) #define the dataframes from the filepaths print("Data frame is creating for {} or {}".format(filepath, filename)) dfs = spsession.read.format('csv').option('header', True).option('inferSchema', True).load(filepath) print("DF Created for {} or {}".format(filepath, filename)) ref_filter = refdf.filter(col('SrcFileName') == filename) for x in ref_filter.collect(): columnNames = x['SrcColumns'] refTypes = x['SrcColumnType'] #print(columnNames) columnNamesList = [x.strip().lower() for x in columnNames.split(",")] refTypesList = [x.strip().lower() for x in refTypes.split(",")] #print(refTypesList) dfsTypes = dfs.schema[columnNames].dataType.simpleString() #StringType() : string , IntergerType() : int dfsTypesList = [x.strip().lower() for x in dfsTypes.split(",")] # columnName : Row id, DataFrameType : int, reftype: int missmatchedcolumns = [(col_name, df_types, ref_types) for (col_name, df_types, ref_types) in zip(columnNamesList, dfsTypesList, refTypesList) if dfsTypesList != refTypesList] if missmatchedcolumns : print("schema comparision has been failed or missmatched for this {}".format(filename)) for col_name, df_types, ref_types in missmatchedcolumns: print(f"columnName : {col_name}, DataFrameType : {df_types}, referenceType : {ref_types}") else: print("Schema comaprision is done and success for {}".format(filename)) except Exception as e: print("An error occured : ", str(e)) return False
50 YouTubers Fight For $1,000,000
41:27
MrBeast
Рет қаралды 199 МЛН
Secret Experiment Toothpaste Pt.4 😱 #shorts
00:35
Mr DegrEE
Рет қаралды 12 МЛН
JSON Schema Validation in Python: Bring Structure Into JSON
13:45
“We Have Been LIED TO...” The Dr Banned For Speaking Out | Dr Aseem Malhotra
21:41
How to Use Pandas With Pandera to Validate Your Data in Python
11:32
Data Pipelines Explained
8:29
IBM Technology
Рет қаралды 144 М.
25 Nooby Pandas Coding Mistakes You Should NEVER make.
11:30
Rob Mulla
Рет қаралды 264 М.
How I would learn Data Engineering (if I could start over)
11:21