Data Validation with Pyspark || Schema Comparison || Dynamically || Real Time Scenario

Рет қаралды 1,265

7 ай бұрын

In this Video we covered how we can perform quick data validation like Schema comparison between source and Target: In the next video we will look into Date/TimeStamp format check and duplicate count check .
Column Comparison link :
• Data Validation with P...
#dataanalytics #dataengineeringessentials #azuredatabricks
#dataanalysis
#pyspark
#pythonprogramming
#sql
#databricks #PySpark #Spark #DatabricksNotebook #PySparkLogic

Пікірлер: 10

@vamshimerugu6184 3 ай бұрын

I think schema comparison is the important topic in pyspark . Great explanation sir ❤

@DataSpark45 3 ай бұрын

thank you bro

@skateforlife3679 7 ай бұрын

Thank you for your work !!! It would be amazing if you could enhance the video with "chapters" to put more context in what you explain in the differents sections of the video :) !

@DataSpark45 7 ай бұрын

Great suggestion!

@saibhargavreddy5992 3 ай бұрын

I found this very useful as I had a similar issue with data validations. It helped a lot while completing my project.

@DataSpark45 3 ай бұрын

Glad it helped!

@avinash7003 6 ай бұрын

code in github?

@DataSpark45 6 ай бұрын

Here is the link bro : drive.google.com/drive/folders/1I6rqtiKh1ChM_dkLJwxfxyxcHWPgyiKZ?usp=sharing

@amandoshi5803 Ай бұрын

source code ?

@DataSpark45 Ай бұрын

def SchemaComparision(controldf, spsession, refdf): try: #iterate controldf and get the filename and filepath for x in controldf.collect(): filename = x['filename'] #print(filename) filepath = x['filepath'] #print(filepath) #define the dataframes from the filepaths print("Data frame is creating for {} or {}".format(filepath, filename)) dfs = spsession.read.format('csv').option('header', True).option('inferSchema', True).load(filepath) print("DF Created for {} or {}".format(filepath, filename)) ref_filter = refdf.filter(col('SrcFileName') == filename) for x in ref_filter.collect(): columnNames = x['SrcColumns'] refTypes = x['SrcColumnType'] #print(columnNames) columnNamesList = [x.strip().lower() for x in columnNames.split(",")] refTypesList = [x.strip().lower() for x in refTypes.split(",")] #print(refTypesList) dfsTypes = dfs.schema[columnNames].dataType.simpleString() #StringType() : string , IntergerType() : int dfsTypesList = [x.strip().lower() for x in dfsTypes.split(",")] # columnName : Row id, DataFrameType : int, reftype: int missmatchedcolumns = [(col_name, df_types, ref_types) for (col_name, df_types, ref_types) in zip(columnNamesList, dfsTypesList, refTypesList) if dfsTypesList != refTypesList] if missmatchedcolumns : print("schema comparision has been failed or missmatched for this {}".format(filename)) for col_name, df_types, ref_types in missmatchedcolumns: print(f"columnName : {col_name}, DataFrameType : {df_types}, referenceType : {ref_types}") else: print("Schema comaprision is done and success for {}".format(filename)) except Exception as e: print("An error occured : ", str(e)) return False