Spark Interview Question | Scenario Based Question | Multi Delimiter

Spark Interview Question | Scenario Based Question | Multi Delimiter | LearntoSpark

Рет қаралды 69,898

Күн бұрын

Пікірлер: 91

@scavengernight3129 Жыл бұрын

Thanks a lot, Azarudeen Shahul.Excellent Content and very precise way of iterating the concepts. Literally found a good source for Data Engineering. God Bless You Man🤝

@AzarudeenShahul Жыл бұрын

Thanks a lot for all your kind words and support 🙂

@soumikdutta77 2 жыл бұрын

Thank you so much Man. Such an excellent content and proper understanding thus clearing my concept on spark RDD and DF. Kudos to you.

@AzarudeenShahul 2 жыл бұрын

Thank you for your support 🙂

@aneksingh4496 4 жыл бұрын

but without writing so much code belowcode works fine .. spark.read .option("inferSchema","True") .option("header","true") .option("delimiter","~|") .csv("data/multi_delimiter.csv").show

@AzarudeenShahul 4 жыл бұрын

The multi-delimiter is handled in spark 3.0, thats why you are able to make it simple. but in spark v2.4, this code will not work and throws error stating , multi delim is not allowed.. I had made a separate video on this update on spark 3.0 multidelim. Hope you got it answered.

@PraveenKumar-dq7sq 2 жыл бұрын

Super bro

@cydia007 3 жыл бұрын

6:16 Why you have to convert to rdd to do the map(), then convert it back to dataframe later? What is the necessity here? Thanks!

@pridename2858 Жыл бұрын

Fantastic solutions in a very precise way. Excellent materials..keep doing this.

@AzarudeenShahul Жыл бұрын

Thanks for your support 🙂

@sssaamm29988 2 жыл бұрын

Very good scenario based question and answers.keep them coming.very useful.

@AzarudeenShahul 2 жыл бұрын

Thanks for your support 🙂

@saranyanallasamy5812 3 жыл бұрын

Which version of spark you are using? Spark 3 supports multiple delimiters.

@AzarudeenShahul 3 жыл бұрын

In this video, we use spark 2.4.. u can also chk the vdo in our playlist on same use case with support on spark 3.0

@saranyanallasamy5812 3 жыл бұрын

@@AzarudeenShahul ok, thanks

@siddharthsingh5031 3 жыл бұрын

.option('ignore', '~'). Should also work.

@AzarudeenShahul 3 жыл бұрын

If ~ is in data, it might be ignored.. can cause data loss

@parammani4717 2 жыл бұрын

Hi bro, which tool you are using for this demo. Can you share some video how to install pyspark

@AzarudeenShahul Жыл бұрын

This Video, i use Jupyter notebook and i use standalone spark installed in my laptop for demo. We do have a well explained video on how to install Spark in your windows PC. We also can use Databricks community edition for the hands-on.

@itzmegokz1199 4 жыл бұрын

nice Azar.can u pls put a video for setting up pyspark & jupyter notebook in windows 10.I am facing lot of issues

@AzarudeenShahul 4 жыл бұрын

Thanks Gokul, we do have a blog that describes about setup of spark in Windows with jupyter. Please have a look at it www.learntospark.com

@itzmegokz1199 4 жыл бұрын

@@AzarudeenShahul thanks much! I cleared one interview with help of your videos.Keep going

@yogyathas6366 3 жыл бұрын

Great contests shared by you especially scenario based ones. Keep continuing. Good luck.

@SurendraKapkoti 2 жыл бұрын

Thank you very much friend.. Very useful content.

@AzarudeenShahul Жыл бұрын

Thanks for your support

@pratik5648 10 ай бұрын

01:02 Loading a text file or CSV file with multiple delimiters as a data frame using partition in Spark 02:04 Using options to specify delimiter in Spark session for reading CSV files 04:08 Reading a text file using Spark 05:10 Populated records with correct values and headers 06:12 Header column removed from the data frame 08:13 Converted CSV file into a data frame structure

@kasiviswanathan5323 3 жыл бұрын

Explanation is very good

@amiyaghosh6869 6 ай бұрын

Using spark sql also it will be done easily using substring function of sql . Instead of doing so much complex things. I will read the data as pipe delimiter as header true . Then I will rename the Name~ to Name. Then I will apply normal sql query using substring function I will omit the ~ from the end of each Name.

@youranji 4 жыл бұрын

Thanks, explanation is very good. May I know why you have used the spark session instead of spark context.

@cydia007 3 жыл бұрын

Because in the new API, spark session includes multiple configs and spark context is one of them, and session is also prefered way offcially.

@anikr4502 4 жыл бұрын

Can you plz explain how to decide core and Executors in spark

@harshitasija5801 Жыл бұрын

Cant we do this way val data=spark.read.format("csv").option("header",true).option("delimiter","~|").load("resources/input_data.txt") data.show(false)

@AzarudeenShahul Жыл бұрын

Yes, this works from Spark 3.0. But with older version of Spark we dont have a direct option to provide multi-delim. Hope you understood. Thanks

@praveensingh5922 Жыл бұрын

Can you explain this using spark- Scala, i am facing Mutiple issue while mimicking using Scala

@AzarudeenShahul Жыл бұрын

Sure will make a video or blog using spark scala

@GauravKumar-vw3up 4 жыл бұрын

Nice explanation Azarudeen!!

@AzarudeenShahul 4 жыл бұрын

Thanks for your support :)

@HanSuVin 2 жыл бұрын

How to do the same by Scala ? I tried but not getting expected .. is this not works in SCALA ?

@dharmeswaranparamasivam5498 4 жыл бұрын

Very good. Keep doing good stuffs.

@AzarudeenShahul 4 жыл бұрын

Thanks for your support. Pls do share with your friends

@bhanubrahmadesam4508 3 жыл бұрын

here's another solution using pyspark - without converting to RDD and back toDF print('This is our input') df1=spark.read.text('in1.txt') df1.show(truncate=0) header = df1.first()[0] schema = header.split('~|') dfA = df1.filter(df1['value'] != header) print('We want this output') dfA.withColumn(schema[0], split(col('value'), "\\~\\|").getItem(0)).withColumn(schema[1], split(col('value'), "\\~\\|").getItem(1)).drop('value').show()

@radhakrishnanselvaraj518 3 жыл бұрын

Hi can you do some example for- Received , processed , filtered and output row counts without df.count()

@AzarudeenShahul 3 жыл бұрын

Great question 😊. Sure, we will make in this topic in upcoming vdo

@goldykarn5922 6 ай бұрын

I had written this solution which works to get o/p data ...df1=spark.read.option("header","true").option("delimiter","~|").csv("/FileStore/tables/input.csv").show()...can someone tell me why I am not gettting "multiple delimiter" error same as what educator got in this video??

@maxdevilio 6 ай бұрын

Spark 3.0 handles multiple delimiter for you. Earlier version of spark doesn't.

@goldykarn5922 6 ай бұрын

@@maxdevilio thanks

@deepikakumari5369 4 жыл бұрын

thanks for such a nice explanation.

@gayathrilakshmi6087 3 жыл бұрын

Now in spark 3.0.2...we can have two delimiters when readind CSV ...it is not showing error..they have fixed the issue

@AzarudeenShahul 3 жыл бұрын

Yes, we have video on this feature update aswell. Pls chk our playlist

@prashantgangulwar2370 2 жыл бұрын

Hi Azarudeen. I wanted to ask you one scenario based question which had asked me in interview but I was not able give the answer it Please can you explain it. I am waiting for your reply.

@debasishkhuntia6998 Жыл бұрын

What is the question

@ravi19900 Жыл бұрын

Well Explained 🎉

@AzarudeenShahul Жыл бұрын

Thank you 🙂

@creativeminds7397 3 жыл бұрын

Hello Azarudeen, Your vedios are awesome. I have one question can you please provide me code .. 1) I want decrypt the file using private key .My all files PGP encrypted file and private key stored in S3 bucket.. please help me to provide the code.

@vijaykiran3092 2 жыл бұрын

I have tried the same in spark 3.3 , there I am able pass delimeter as ~|, seems now it is updated

@AzarudeenShahul 2 жыл бұрын

Yes from spark 3 we have this update. We also have a vdo on this update... Pls chk

@ankbala 3 жыл бұрын

Thanks very much for the explanation. I didn't understand at all the statement df_input=df.filter(df['value']!=header).rdd.map(lambda x: x[0].split('~|')).toDF(schema). Could you please spend some time on explaining the same? other than this everything is like a cakewalk.

@rushikeshparab132 2 жыл бұрын

He is Filtering out the header row form df using df['value']!= Header then that df will be converted to rdd and passed to map function. In map function it will split each row based on ~|

@saarush 4 жыл бұрын

can you plz explain how to do in scala only header to split.

@sapatil8999 Жыл бұрын

df = spark.read.csv("csv file path", header=True, sep="~|") we can use

@AzarudeenShahul Жыл бұрын

Thanks for sharing ☺️ Yes, u are right. With spark version 3 and above, one can directly use multi demiliters and for prior to v3.0 multi delim is not supported and can refer to this video.

@avinash7003 Жыл бұрын

Is it Databricks or Clouderaa?

@AzarudeenShahul Жыл бұрын

The Demo for this video is shown in my own laptop where i installed spark and used jupiter notebook as ide. Hope this answers your questions. Now-a-days we do demo in Databricks community edition :)

@avinash7003 Жыл бұрын

@@AzarudeenShahul how is the calls for Bigdata AWS

@parammani4717 2 жыл бұрын

Can you put example in spark scala please

@tanushreenagar3116 2 жыл бұрын

Nice

@barathy.m5589 4 жыл бұрын

Nice stuff

@AzarudeenShahul 4 жыл бұрын

Thanks for your support :)

@maheshk1678 4 жыл бұрын

Hi Azhar, could you please tell me how to handle the same spark scala

@AzarudeenShahul 4 жыл бұрын

Hi Mahesh, for the multi delimiter spark scala will also be same as pyspark. PFB, for ur reference, try it out val df = spark.read.text("inputpath.csv") df.rdd.map(i => i.mkString.split("\\~\\|")).toDF().show(false) hope u find this useful :)

@preethamp1826 4 жыл бұрын

Azarudeen Shahul Hi Azhar, I tried executing this command but I am end up getting result only one column.

@maheshk1678 4 жыл бұрын

@@AzarudeenShahul thank you

@AzarudeenShahul 4 жыл бұрын

can you pls share the screenshot / code snippet that u tried

@preethamp1826 4 жыл бұрын

@@AzarudeenShahul I cant share here. I am not getting an option to share a pic.

@AtifImamAatuif 3 жыл бұрын

Spark3 now supports , multicharacter delimiter

@AzarudeenShahul 3 жыл бұрын

Yes, we have a separate vdo for that 😊

@vinayreddy98 4 жыл бұрын

Can you tell why did you use x[0] in map instead of x

@AtifImamAatuif 3 жыл бұрын

Think of x as a list. you can't split a list directly. so to fetch the element of the list, he has used x[0], x[0] is like string, so you can split it. also you can pass x['value] . it will work

@shafickrahman88 3 жыл бұрын

0 is index

@surajgolhare5158 2 жыл бұрын

how we can achieve by using scala

@teluguvihari5427 3 жыл бұрын

When we are dealing huge data like 10 TB's , performance totally degrade....

@chaitrakr3908 Жыл бұрын

Hi Azarudeen df.filter(df['value'] != header).rdd.map(lambda z: z[0].split('-|')).toDF(schema) is giving below error and i tried to se the environment variables also but no luck.Can you pls help me with this Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 11.0 failed 1 times, most recent failure: Lost task 0.0 in stage 11.0 (TID 11) (chaitra executor driver): java.io.IOException: Cannot run program "python3": CreateProcess error=3, The system cannot find the path specified