Pyspark Scenarios 3 : how to skip first few rows from data file in pyspark

Рет қаралды 18,361

TechLake

Күн бұрын

Пікірлер: 40

@ravietl3001 Жыл бұрын

Superb Explanation.pls send more real time scenarios.its really helpful.Thank you so much

@gajulapatinagasai4563 3 күн бұрын

df= spark.read.option('skipRows',3).csv('path' , inferSchema= True , header= True) df.display() We can follow this method as well, it also works fine.

@sravankumar1767 2 жыл бұрын

Superb explanation bro 👌 👏 👍

@sssaamm29988 2 жыл бұрын

Very informative video.

@rutulhatwar5145 2 жыл бұрын

After writing zipwithIndex I am getting an error that file cant be found

@levialberto4379 2 жыл бұрын

Incredible. Thank you.

@Thulasisingala Жыл бұрын

why we have used skipline (final_rdd.first())attribute ,any way columns(collect()[0] attribute also having first row. we can directly use columns in lamda function

@userbayya 2 жыл бұрын

excellent

@ruinmaster5039 Жыл бұрын

How to skip last 10 rows while reading csv in pyspark? pls help

@starmscloud Жыл бұрын

Skip Last N rows . Let's say , your data looks like this and you want to skip last "8" records . id,name,location 1,ravi,bangalore 2,raj,gurgaon 3,prasad,hyderabad 4,sekhar,bangalore line1 line2 line3 line4 line5 line6,col1,clo2,clo3,clo4,col5,clo6,clo7 line7,col1,clo2,clo3,clo4,col5,clo6,clo7 line8,col1,clo2,clo3,clo4,col5,clo6,clo7 # ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ # Solution : # ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ # Create a File in dbfs location : dbutils.fs.put("dbfs:/FileStore/Scenarios/3_skip_last_few_rows/lastfewrows.csv","""id,name,location 1,ravi,bangalore 2,raj,gurgaon 3,prasad,hyderabad 4,sekhar,bangalore line1 line2 line3 line4 line5 line6,col1,clo2,clo3,clo4,col5,clo6,clo7 line7,col1,clo2,clo3,clo4,col5,clo6,clo7 line8,col1,clo2,clo3,clo4,col5,clo6,clo7""",True) # using RDD : # ++++++++++++++++++ # Read the File into RDD rddRaw = sc.textFile("dbfs:/FileStore/Scenarios/3_skip_last_few_rows/lastfewrows.csv") # Get the RDD Count to get no of lines recordCount = rddRaw.count() #Specify how many lines you want to skip from end . in our case it's 8 . lastNoofRowstoFilter = 8 # Create a RDD to filter the specified no of rows. finalRDD = rddRaw.zipWithIndex().map(lambda row : ((row[0]),row[1]+1)).filter(lambda row : row[1] + lastNoofRowstoFilter 1).map(lambda row:tuple(str(row[0]).split(","))).toDF(headerRowList) # Diplay the Dataframe df.display()

@VikramReddyAnapana 2 жыл бұрын

wonderful

@TRRaveendra 2 жыл бұрын

Thank you 👍

@svcc7773 Жыл бұрын

Can't we delete those 4 lines from Unix box and reprocess the file instead of the code change, usually code cannot be modified for this kind of issues

@prabhakarsingh8444 2 жыл бұрын

Great Explanation...It was very helpful.... I am stuck with one question, like I am unable to convert the RDD to DF Below is the data of Contact.csv file: id,name,address 101,Abhay,"Delhi,Banglore" 102,Nishant,"Delhi , Ranchi" 103,Abhishek,Delhi In the first row, address column is having delimiter comma "," , so records are getting split. In the second row, address column have two new line character after Delhi, so again records are getting split. Due to the above problem unable to convert RDD to DF: getting below exception: java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema. 3 fields are required while 1 values are provided.

@TRRaveendra 2 жыл бұрын

create df = spark.read.text(). and use option("multiline","true")

@BHARATHKUMARS-mr5jc 2 жыл бұрын

this is working for normal files. but my csv file is encoded with utf-16. when I mentioned sc.textFile(path,use_unicode='utf-16') still not working. Can you help me with this.

@TRRaveendra 2 жыл бұрын

Ping me error message or data file to Pysparktelugu@gmail.com

@meghanadhage824 2 жыл бұрын

I have a multiple csv files whose header starting position is varying w.r.t every file so to skip those unwanted rows dynamically. Eg. 1. file1.csv:- Header is starting from 3rd row 2. File2.csv :- header is starting from 7th row So every time header starting position is not constant. So how to skip rows for these kind of files

@TRRaveendra 2 жыл бұрын

read as text file and filter header based on data. then split into individual columns

@meghanadhage824 2 жыл бұрын

Header count is also varying for each file as 3 or 4 or 5 etc. but we have first header as a date header for all the files. So that basis can you plz suggest how to filters the files as I am new to this pyspark

@TRRaveendra 2 жыл бұрын

Then you should treat difference structure and different source. Create separate dataframes and then union or union all all dataframes. Then load into single table.

@chetanambi Жыл бұрын

We can also try this method: from pyspark.sql.functions import * df.withColumn('index', monotonically_increasing_id()) \ .filter('index > 2') \ .drop('index') \ .show(5)

@ruinmaster5039 Жыл бұрын

What if we want to skip last 10 rows while importing CSV

@DanishAnsari-hw7so Жыл бұрын

⁠@@ruinmaster5039filter(max(index) - 10)) maybe

@starmscloud Жыл бұрын

@@ruinmaster5039 Skip Last N rows . Let's say , your data looks like this and you want to skip last "8" records . id,name,location 1,ravi,bangalore 2,raj,gurgaon 3,prasad,hyderabad 4,sekhar,bangalore line1 line2 line3 line4 line5 line6,col1,clo2,clo3,clo4,col5,clo6,clo7 line7,col1,clo2,clo3,clo4,col5,clo6,clo7 line8,col1,clo2,clo3,clo4,col5,clo6,clo7 # ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ # Solution : # ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ # Create a File in dbfs location : dbutils.fs.put("dbfs:/FileStore/Scenarios/3_skip_last_few_rows/lastfewrows.csv","""id,name,location 1,ravi,bangalore 2,raj,gurgaon 3,prasad,hyderabad 4,sekhar,bangalore line1 line2 line3 line4 line5 line6,col1,clo2,clo3,clo4,col5,clo6,clo7 line7,col1,clo2,clo3,clo4,col5,clo6,clo7 line8,col1,clo2,clo3,clo4,col5,clo6,clo7""",True) # using RDD : # ++++++++++++++++++ # Read the File into RDD rddRaw = sc.textFile("dbfs:/FileStore/Scenarios/3_skip_last_few_rows/lastfewrows.csv") # Get the RDD Count to get no of lines recordCount = rddRaw.count() #Specify how many lines you want to skip from end . in our case it's 8 . lastNoofRowstoFilter = 8 # Create a RDD to filter the specified no of rows. finalRDD = rddRaw.zipWithIndex().map(lambda row : ((row[0]),row[1]+1)).filter(lambda row : row[1] + lastNoofRowstoFilter 1).map(lambda row:tuple(str(row[0]).split(","))).toDF(headerRowList) # Diplay the Dataframe df.display()

@ranjansrivastava9256 Жыл бұрын

Hi Friend really a very good explanation. One suggestion can we use 'skiprows' properties instead of doing so many code. Please correct me.

@TRRaveendra Жыл бұрын

i think skiprows is available only on pandas. this examples are spark dataframe .

@Rafian1924 Жыл бұрын

@@TRRaveendra sometimes, I feel pandas is rich in functionality than pyspark

@jasonbernard5126 2 жыл бұрын

Nice Explanation. 2 questions. It appears that the file contents is loaded into memory...how to handle if the file is very large? Would window function be a better approach? Also, you are doing a simple split, what if the content uses quoting (1,"Smith, John", "somewhere, TX"). how would the approach change?

@TRRaveendra 2 жыл бұрын

Spark internal architecture is distributed processing on multiple nodes. Its devides data into multiple blocks and those will be processed on data nodes based Cores availability.

@bharathr8078 2 жыл бұрын

did you get an answer to your 2nd question?

@jasonbernard5126 2 жыл бұрын

@@bharathr8078 No..but what I did is first get the header line in an rdd(rdd_header_line). then I use spark.read.csv (header_df=spark.read.csv(path=rdd_header_line,sep=',',quote='"') to get the header names into a spark dataframe. I then grab the column names from the data frame (column_names=[str(col_name) for col_name in header_df.collect()[0]])

@agamm21 2 жыл бұрын

How to make it work with text data file instead of csv? because I tried it and gave me all columns as one one column.

@TRRaveendra 2 жыл бұрын

First read spark.read.text Then use split function

@fansouzafrei 2 жыл бұрын

Great solution! By the way, can we use the pyspark API to skip this records or we can only use RDD? Congratulations!

@TRRaveendra 2 жыл бұрын

Yes possible in Dataframe. But need to read without header and use window function row_number()

@raviyadav-dt1tb 8 ай бұрын

We can skip it very simple way but you make it very lengthy

@SaikotRoy Жыл бұрын

This can be the alternative solution: Data set: +-----------+ | value| +-----------+ | line1 | | line2 | | line3 | |id,name,sal| | 1,abc,1000| | 2,cde,1000| | 3,xyz,500 | +-----------+ Solution: ------------------------------------------------ q4_df = spark.read.text("/content/drive/sample.txt") q4_df.show() df1 = q4_df.withColumn("index",monotonically_increasing_id()) df1.show() df2 = df1.filter(df1.index >3).drop("index").withColumn("splitted",split("value",",")) df2.show() for i in range(df2.select(max(size(df2.splitted))).collect()[0][0]): df2 = df2.withColumn("col"+str(i),df2.splitted[i]) df2.drop("value","splitted").toDF("EMPID","EMPNAME","SALARY").show()

@tamizh2003 10 ай бұрын

Great, Thanks, I tried another one too, that too working fine val schema2 = new StructType() .add("id", IntegerType) .add("name", StringType) .add("gender", StringType) .add("baddata", StringType) val df = spark.read.format("csv").schema(schema2).option("columnNameofCorruptRecord","baddata").option("header","true").load("file:///c:/data/file1.txt") df.show() val fil_df = df.filter("id is not null").drop("baddata") fil_df.show()

@sravankumar1767 2 жыл бұрын

Superb explanation bro 👌 👏 👍