No video

112. Databricks | Pyspark| Spark Reader: Skip First N Records While Reading CSV File

  Рет қаралды 4,253

Raja's Data Engineering

Raja's Data Engineering

Күн бұрын

Azure Databricks Learning: Spark Reader: Skip First N Records While Reading CSV File
=================================================================================
Processing CSV files in Spark and Databricks is one of the very frequently seen scenario. While reading CSV data, we come across requirement of skipping first few records in certain usecases. I have explained that requirement in this video
To get more understanding, watch this video
#SparkCSVReader, #SparkCSVSkipRows, #DatabricksCSVSkipRows,#CSVDataframe,#PySparkCSVOptions,#SparkDevelopment,#DatabricksDevelopment, #DatabricksPyspark,#PysparkTips, #DatabricksTutorial, #AzureDatabricks, #Databricks, #Databricksforbeginners,#datascientists, #datasciencecommunity,#bigdataengineers,#machinelearningengineers
Id,Year,First Name,County,Gender,Count
1,2007,ZOEY,KINGS,F,11
2,2007,ZOEY,SUFFOLK,F,6
3,2007,ZOEY,MONROE,F,6
4,2007,ZOEY,ERIE,F,9
5,2007,ZOE,ULSTER,F,5
6,2007,ZOE,WESTCHESTER,F,24
7,2007,ZOE,BRONX,F,13
8,2007,ZOE,NEW YORK,F,55
9,2007,ZOE,NASSAU,F,15
10,2007,ZOE,ERIE,F,6
11,2007,ZOE,SUFFOLK,F,14
12,2007,ZOE,KINGS,F,34
13,2007,ZOE,MONROE,F,9
14,2007,ZOE,QUEENS,F,26
15,2007,ZOE,ALBANY,F,5
16,2007,ZISSY,ROCKLAND,F,5
17,2007,ZISSY,KINGS,F,27
18,2007,ZION,KINGS,M,15
19,2007,ZION,BRONX,M,14
20,2007,ZEV,ROCKLAND,M,6
21,2007,ZEV,KINGS,M,30
22,2007,ZARA,QUEENS,F,10
23,2007,ZAIRE,KINGS,M,14
24,2007,ZACKARY,SUFFOLK,M,6
25,2007,ZACKARY,ERIE,M,5

Пікірлер: 38
@AshokKumar-ji3cs
@AshokKumar-ji3cs Жыл бұрын
Thanks for the best video with very useful content
@rajasdataengineering7585
@rajasdataengineering7585 Жыл бұрын
Thanks and welcome!
@AnuragsMusicChannel
@AnuragsMusicChannel 19 күн бұрын
Use window function row_number, then use filter to filter out row nums from 11 to 20: w=Window.orderby(lit("A")) df.withColumn("ROWNUM", row_number().over(w)).filter( (col("ROWNUM")20) ).drop("ROWNUM").show()
@percyjackson1662
@percyjackson1662 Жыл бұрын
for the quiz question - we first read the whole csv and then using row_number function we will filter out the records which we need
@rajasdataengineering7585
@rajasdataengineering7585 Жыл бұрын
Super quick! Thanks for sharing approach. Yes this is one of the approach. As databricks is distributed platform, while creating row number among partitions, it can't give a guarantee that first few records get the first few row numbers. Though this approach can be used, still we can think of other approaches which will give guarantee on removing only certain range of rows
@percyjackson1662
@percyjackson1662 Жыл бұрын
@@rajasdataengineering7585 I assumed there would be some timestamp column which we can use in order by clause of window spec. my bad! 😁
@rajasdataengineering7585
@rajasdataengineering7585 Жыл бұрын
That's right approach if we have timestamp👍🏻
@user-fg4bi6cp8o
@user-fg4bi6cp8o 7 ай бұрын
I am trying this for reading a csv file in my databricks notebook, but the skip is not working.
@kumarvummadi3772
@kumarvummadi3772 Жыл бұрын
What is OS area in spark?What is OS controlled area in the data node? Is the off-heap memory is controlled by the executor or OS control area? If the offheap memory is controlled by os area the how the executor will utilize the off heap memory? What are the kinds of optimisations that can we do using offheap memory ? Please make a video on this concepts sir requesting you sir please sir.
@rajasdataengineering7585
@rajasdataengineering7585 Жыл бұрын
On heap memory is controlled by executor JVM, while off heap memory is controlled by os of worker node. I have already posted a video about on- heap va off-heap memory kzbin.info/www/bejne/mYXNeaKhpN1of9U
@kumarvummadi3772
@kumarvummadi3772 Жыл бұрын
@@rajasdataengineering7585 sir could you please tell me how the jvm access the objects from off-heap memory and stores the objects gets stored in on heap memory. If possible please make a video sir.
@kumarvummadi3772
@kumarvummadi3772 Жыл бұрын
Thank you very much sir for doing this video. Sir I request you to make a video on one more concept I posted some questions related to that concept sir in this comment only sis. Please see it sir.
@rajasdataengineering7585
@rajasdataengineering7585 Жыл бұрын
Noted
@NaveenKumar-kb2fm
@NaveenKumar-kb2fm Жыл бұрын
Hi, can you implement a real time scenario of applying SCD type 2 on multiple source tables dynamically like not writing the script for each table, can you take an example and do this
@rajasdataengineering7585
@rajasdataengineering7585 Жыл бұрын
Sure, I will create a video for this requirement
@NaveenKumar-kb2fm
@NaveenKumar-kb2fm Жыл бұрын
@@rajasdataengineering7585 thank you very much, we are taking the data from on premise client database using Adf and loading it into ADLS (source to landing pipeline), then other pipeline to load from ADLS to Staging tables in synapse DWH, now here comes the SCD layer ( one to one copy from Staging table), in Source we have more than 50 tables and Staging is every day refresh and we have to load new updates and inserts of new data to SCD type 2 tables with a flag column, so we can't create a script for each table so we make it dynamically to get all the tables from Staging and do the SCD TYPE 2 on all the tables
@rajunaik8803
@rajunaik8803 11 ай бұрын
@@NaveenKumar-kb2fm you can create a metadata ingestion framework. Basically, you can capture all the metadata information of all tables(like source table name, target table name, key columns etc.) in one table called metadata table. And, build a generic stored procedure or spark SQL (merge statement to achieve SCD TYPE2 using above metadata table). Finally, you can loop through this procedure for all the 50 odd records you have it in metadata table.
@abhishekrajoriya2507
@abhishekrajoriya2507 Жыл бұрын
Sir when we use this skipRows option the header is also changed to wrong header, what to do in this case??
@rajasdataengineering7585
@rajasdataengineering7585 Жыл бұрын
Good question. So, for this scenario, we need to skip records starting from 2 to 5 instead of first 4 rows. I will post a video to handle that scenario in my next video.
@abhishekrajoriya2507
@abhishekrajoriya2507 Жыл бұрын
@@rajasdataengineering7585 you are the best
@bhaskarjha4673
@bhaskarjha4673 Жыл бұрын
Take 10 lines of rows using head and keep it one df Take all rows after skipping 20 rows using skipRows options and keep it in other data frame Union both the dataframe to get desired result Is this approach correct raja sir?
@rajasdataengineering7585
@rajasdataengineering7585 Жыл бұрын
Hi Bhaskar, this approach is very close to solution. Good thinking! One minor issue might arise while using head command. Head command ensures taking first 10 records of dafaframe. But it is not guaranteed that first 10 records of the CSV file will be first 10 records of dataframe, as it is distributed system and csv file would have been splitted into partitions after bringing into spark environment. Thanks for sharing your approach 👍🏻
@bhaskarjha4673
@bhaskarjha4673 Жыл бұрын
Read first 10 rows using head into panda df and then convert it into spark df
@sureshraina321
@sureshraina321 Жыл бұрын
How to skip last few lines ?
@rmrz2225
@rmrz2225 10 ай бұрын
Hi, could you share the quiz solution pls?. Thanks
@rajasdataengineering7585
@rajasdataengineering7585 10 ай бұрын
Hi, you can find the solution in next video 113
@aravind5310
@aravind5310 11 ай бұрын
from pyspark.sql.functions import monotonically_increasing_id df1=df.coalesce(1).select("*",monotonically_increasing_id().alias("pk")) df1.display() from pyspark.sql.functions import col df2=df1.filter(~col('pk').between(4,7)) df2.display()
@prabhatgupta6415
@prabhatgupta6415 Жыл бұрын
Sir today u forgot to give datasett :D
@rajasdataengineering7585
@rajasdataengineering7585 Жыл бұрын
I will add dataset in the description now
@mankaransingh981
@mankaransingh981 11 ай бұрын
Sir in this example, the header is also getting skipped. How can we keep the header and then skip the rows?
@rajasdataengineering7585
@rajasdataengineering7585 11 ай бұрын
Hi Manikaran, I have given workaround for this requirement in my next video. Pls watch video 113
@mankaransingh981
@mankaransingh981 11 ай бұрын
Sure, Thanks for the quick revert. Appreciate your work@@rajasdataengineering7585
@sabesanj5509
@sabesanj5509 Жыл бұрын
Awesome explanation raja bro... from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() df = spark.read.csv('baby_names.csv', header=True, inferSchema=True, ignoreLeadingWhiteSpace=True, ignoreTrailingWhiteSpace=True, skiprows=lambda i: i >= 11 and i
@rajasdataengineering7585
@rajasdataengineering7585 Жыл бұрын
Happy to see the approaches. Good try. But skiprows expects just an integer, it does not accept lambda function. Thanks for sharing your approach.
@sabesanj5509
@sabesanj5509 Жыл бұрын
@@rajasdataengineering7585 Thanks Raja for the reply. will the below logic be correct for the quiz?? filtered_df = df.filter((df['Id'] < 11) | (df['Id'] > 20))
@rajasdataengineering7585
@rajasdataengineering7585 Жыл бұрын
Yes this will produce required output. But in this example, we have ID column. But that is not the case with all real time scenarios. So without having any id also, we should be able to produce the expected output
@9039522806
@9039522806 Жыл бұрын
Hello..is this option skipRows only applicable to Databricks spark applications only Or it works with non Databricks based spark applications like cloudera etc?
@rajasdataengineering7585
@rajasdataengineering7585 Жыл бұрын
Hi, it works for all spark related solutions like databricks, open source spark, azure synapse spark pook etc
111. Databricks | Pyspark| SQL Coding Interview: Exchange Seats of Students
22:50
Raja's Data Engineering
Рет қаралды 6 М.
Они так быстро убрались!
01:00
Аришнев
Рет қаралды 3,1 МЛН
Challenge matching picture with Alfredo Larin family! 😁
00:21
BigSchool
Рет қаралды 40 МЛН
34. Databricks - Spark: Data Skew Optimization
15:03
Raja's Data Engineering
Рет қаралды 25 М.
Data Validation with Pyspark || Real Time Scenario
37:34
DataSpark
Рет қаралды 4,9 М.
Get S3 Data Process using Pyspark in Pycharm
16:05
Sreyobhilashi IT
Рет қаралды 9 М.
06. Databricks | Pyspark| Spark Reader: Read CSV File
17:06
Raja's Data Engineering
Рет қаралды 60 М.
117. Databricks | Pyspark| SQL Coding Interview: Total Grand Slam Titles Winner
19:08
121. Databricks | Pyspark| AutoLoader: Incremental Data Load
34:56
Raja's Data Engineering
Рет қаралды 16 М.
114. Databricks | Pyspark| Performance Optimization: Re-order Columns in Delta Table
18:14