Spark Interview Question | Scenario Based Question | Multi Delimiter | LearntoSpark

  Рет қаралды 69,898

Azarudeen Shahul

Azarudeen Shahul

Күн бұрын

Пікірлер: 91
@scavengernight3129
@scavengernight3129 Жыл бұрын
Thanks a lot, Azarudeen Shahul.Excellent Content and very precise way of iterating the concepts. Literally found a good source for Data Engineering. God Bless You Man🤝
@AzarudeenShahul
@AzarudeenShahul Жыл бұрын
Thanks a lot for all your kind words and support 🙂
@soumikdutta77
@soumikdutta77 2 жыл бұрын
Thank you so much Man. Such an excellent content and proper understanding thus clearing my concept on spark RDD and DF. Kudos to you.
@AzarudeenShahul
@AzarudeenShahul 2 жыл бұрын
Thank you for your support 🙂
@aneksingh4496
@aneksingh4496 4 жыл бұрын
but without writing so much code belowcode works fine .. spark.read .option("inferSchema","True") .option("header","true") .option("delimiter","~|") .csv("data/multi_delimiter.csv").show
@AzarudeenShahul
@AzarudeenShahul 4 жыл бұрын
The multi-delimiter is handled in spark 3.0, thats why you are able to make it simple. but in spark v2.4, this code will not work and throws error stating , multi delim is not allowed.. I had made a separate video on this update on spark 3.0 multidelim. Hope you got it answered.
@PraveenKumar-dq7sq
@PraveenKumar-dq7sq 2 жыл бұрын
Super bro
@cydia007
@cydia007 3 жыл бұрын
6:16 Why you have to convert to rdd to do the map(), then convert it back to dataframe later? What is the necessity here? Thanks!
@pridename2858
@pridename2858 Жыл бұрын
Fantastic solutions in a very precise way. Excellent materials..keep doing this.
@AzarudeenShahul
@AzarudeenShahul Жыл бұрын
Thanks for your support 🙂
@sssaamm29988
@sssaamm29988 2 жыл бұрын
Very good scenario based question and answers.keep them coming.very useful.
@AzarudeenShahul
@AzarudeenShahul 2 жыл бұрын
Thanks for your support 🙂
@saranyanallasamy5812
@saranyanallasamy5812 3 жыл бұрын
Which version of spark you are using? Spark 3 supports multiple delimiters.
@AzarudeenShahul
@AzarudeenShahul 3 жыл бұрын
In this video, we use spark 2.4.. u can also chk the vdo in our playlist on same use case with support on spark 3.0
@saranyanallasamy5812
@saranyanallasamy5812 3 жыл бұрын
@@AzarudeenShahul ok, thanks
@siddharthsingh5031
@siddharthsingh5031 3 жыл бұрын
.option('ignore', '~'). Should also work.
@AzarudeenShahul
@AzarudeenShahul 3 жыл бұрын
If ~ is in data, it might be ignored.. can cause data loss
@parammani4717
@parammani4717 2 жыл бұрын
Hi bro, which tool you are using for this demo. Can you share some video how to install pyspark
@AzarudeenShahul
@AzarudeenShahul Жыл бұрын
This Video, i use Jupyter notebook and i use standalone spark installed in my laptop for demo. We do have a well explained video on how to install Spark in your windows PC. We also can use Databricks community edition for the hands-on.
@itzmegokz1199
@itzmegokz1199 4 жыл бұрын
nice Azar.can u pls put a video for setting up pyspark & jupyter notebook in windows 10.I am facing lot of issues
@AzarudeenShahul
@AzarudeenShahul 4 жыл бұрын
Thanks Gokul, we do have a blog that describes about setup of spark in Windows with jupyter. Please have a look at it www.learntospark.com
@itzmegokz1199
@itzmegokz1199 4 жыл бұрын
@@AzarudeenShahul thanks much! I cleared one interview with help of your videos.Keep going
@yogyathas6366
@yogyathas6366 3 жыл бұрын
Great contests shared by you especially scenario based ones. Keep continuing. Good luck.
@SurendraKapkoti
@SurendraKapkoti 2 жыл бұрын
Thank you very much friend.. Very useful content.
@AzarudeenShahul
@AzarudeenShahul Жыл бұрын
Thanks for your support
@pratik5648
@pratik5648 10 ай бұрын
01:02 Loading a text file or CSV file with multiple delimiters as a data frame using partition in Spark 02:04 Using options to specify delimiter in Spark session for reading CSV files 04:08 Reading a text file using Spark 05:10 Populated records with correct values and headers 06:12 Header column removed from the data frame 08:13 Converted CSV file into a data frame structure
@kasiviswanathan5323
@kasiviswanathan5323 3 жыл бұрын
Explanation is very good
@amiyaghosh6869
@amiyaghosh6869 6 ай бұрын
Using spark sql also it will be done easily using substring function of sql . Instead of doing so much complex things. I will read the data as pipe delimiter as header true . Then I will rename the Name~ to Name. Then I will apply normal sql query using substring function I will omit the ~ from the end of each Name.
@youranji
@youranji 4 жыл бұрын
Thanks, explanation is very good. May I know why you have used the spark session instead of spark context.
@cydia007
@cydia007 3 жыл бұрын
Because in the new API, spark session includes multiple configs and spark context is one of them, and session is also prefered way offcially.
@anikr4502
@anikr4502 4 жыл бұрын
Can you plz explain how to decide core and Executors in spark
@harshitasija5801
@harshitasija5801 Жыл бұрын
Cant we do this way val data=spark.read.format("csv").option("header",true).option("delimiter","~|").load("resources/input_data.txt") data.show(false)
@AzarudeenShahul
@AzarudeenShahul Жыл бұрын
Yes, this works from Spark 3.0. But with older version of Spark we dont have a direct option to provide multi-delim. Hope you understood. Thanks
@praveensingh5922
@praveensingh5922 Жыл бұрын
Can you explain this using spark- Scala, i am facing Mutiple issue while mimicking using Scala
@AzarudeenShahul
@AzarudeenShahul Жыл бұрын
Sure will make a video or blog using spark scala
@GauravKumar-vw3up
@GauravKumar-vw3up 4 жыл бұрын
Nice explanation Azarudeen!!
@AzarudeenShahul
@AzarudeenShahul 4 жыл бұрын
Thanks for your support :)
@HanSuVin
@HanSuVin 2 жыл бұрын
How to do the same by Scala ? I tried but not getting expected .. is this not works in SCALA ?
@dharmeswaranparamasivam5498
@dharmeswaranparamasivam5498 4 жыл бұрын
Very good. Keep doing good stuffs.
@AzarudeenShahul
@AzarudeenShahul 4 жыл бұрын
Thanks for your support. Pls do share with your friends
@bhanubrahmadesam4508
@bhanubrahmadesam4508 3 жыл бұрын
here's another solution using pyspark - without converting to RDD and back toDF print('This is our input') df1=spark.read.text('in1.txt') df1.show(truncate=0) header = df1.first()[0] schema = header.split('~|') dfA = df1.filter(df1['value'] != header) print('We want this output') dfA.withColumn(schema[0], split(col('value'), "\\~\\|").getItem(0)).withColumn(schema[1], split(col('value'), "\\~\\|").getItem(1)).drop('value').show()
@radhakrishnanselvaraj518
@radhakrishnanselvaraj518 3 жыл бұрын
Hi can you do some example for- Received , processed , filtered and output row counts without df.count()
@AzarudeenShahul
@AzarudeenShahul 3 жыл бұрын
Great question 😊. Sure, we will make in this topic in upcoming vdo
@goldykarn5922
@goldykarn5922 6 ай бұрын
I had written this solution which works to get o/p data ...df1=spark.read.option("header","true").option("delimiter","~|").csv("/FileStore/tables/input.csv").show()...can someone tell me why I am not gettting "multiple delimiter" error same as what educator got in this video??
@maxdevilio
@maxdevilio 6 ай бұрын
Spark 3.0 handles multiple delimiter for you. Earlier version of spark doesn't.
@goldykarn5922
@goldykarn5922 6 ай бұрын
@@maxdevilio thanks
@deepikakumari5369
@deepikakumari5369 4 жыл бұрын
thanks for such a nice explanation.
@gayathrilakshmi6087
@gayathrilakshmi6087 3 жыл бұрын
Now in spark 3.0.2...we can have two delimiters when readind CSV ...it is not showing error..they have fixed the issue
@AzarudeenShahul
@AzarudeenShahul 3 жыл бұрын
Yes, we have video on this feature update aswell. Pls chk our playlist
@prashantgangulwar2370
@prashantgangulwar2370 2 жыл бұрын
Hi Azarudeen. I wanted to ask you one scenario based question which had asked me in interview but I was not able give the answer it Please can you explain it. I am waiting for your reply.
@debasishkhuntia6998
@debasishkhuntia6998 Жыл бұрын
What is the question
@ravi19900
@ravi19900 Жыл бұрын
Well Explained 🎉
@AzarudeenShahul
@AzarudeenShahul Жыл бұрын
Thank you 🙂
@creativeminds7397
@creativeminds7397 3 жыл бұрын
Hello Azarudeen, Your vedios are awesome. I have one question can you please provide me code .. 1) I want decrypt the file using private key .My all files PGP encrypted file and private key stored in S3 bucket.. please help me to provide the code.
@vijaykiran3092
@vijaykiran3092 2 жыл бұрын
I have tried the same in spark 3.3 , there I am able pass delimeter as ~|, seems now it is updated
@AzarudeenShahul
@AzarudeenShahul 2 жыл бұрын
Yes from spark 3 we have this update. We also have a vdo on this update... Pls chk
@ankbala
@ankbala 3 жыл бұрын
Thanks very much for the explanation. I didn't understand at all the statement df_input=df.filter(df['value']!=header).rdd.map(lambda x: x[0].split('~|')).toDF(schema). Could you please spend some time on explaining the same? other than this everything is like a cakewalk.
@rushikeshparab132
@rushikeshparab132 2 жыл бұрын
He is Filtering out the header row form df using df['value']!= Header then that df will be converted to rdd and passed to map function. In map function it will split each row based on ~|
@saarush
@saarush 4 жыл бұрын
can you plz explain how to do in scala only header to split.
@sapatil8999
@sapatil8999 Жыл бұрын
df = spark.read.csv("csv file path", header=True, sep="~|") we can use
@AzarudeenShahul
@AzarudeenShahul Жыл бұрын
Thanks for sharing ☺️ Yes, u are right. With spark version 3 and above, one can directly use multi demiliters and for prior to v3.0 multi delim is not supported and can refer to this video.
@avinash7003
@avinash7003 Жыл бұрын
Is it Databricks or Clouderaa?
@AzarudeenShahul
@AzarudeenShahul Жыл бұрын
The Demo for this video is shown in my own laptop where i installed spark and used jupiter notebook as ide. Hope this answers your questions. Now-a-days we do demo in Databricks community edition :)
@avinash7003
@avinash7003 Жыл бұрын
@@AzarudeenShahul how is the calls for Bigdata AWS
@parammani4717
@parammani4717 2 жыл бұрын
Can you put example in spark scala please
@tanushreenagar3116
@tanushreenagar3116 2 жыл бұрын
Nice
@barathy.m5589
@barathy.m5589 4 жыл бұрын
Nice stuff
@AzarudeenShahul
@AzarudeenShahul 4 жыл бұрын
Thanks for your support :)
@maheshk1678
@maheshk1678 4 жыл бұрын
Hi Azhar, could you please tell me how to handle the same spark scala
@AzarudeenShahul
@AzarudeenShahul 4 жыл бұрын
Hi Mahesh, for the multi delimiter spark scala will also be same as pyspark. PFB, for ur reference, try it out val df = spark.read.text("inputpath.csv") df.rdd.map(i => i.mkString.split("\\~\\|")).toDF().show(false) hope u find this useful :)
@preethamp1826
@preethamp1826 4 жыл бұрын
Azarudeen Shahul Hi Azhar, I tried executing this command but I am end up getting result only one column.
@maheshk1678
@maheshk1678 4 жыл бұрын
@@AzarudeenShahul thank you
@AzarudeenShahul
@AzarudeenShahul 4 жыл бұрын
can you pls share the screenshot / code snippet that u tried
@preethamp1826
@preethamp1826 4 жыл бұрын
@@AzarudeenShahul I cant share here. I am not getting an option to share a pic.
@AtifImamAatuif
@AtifImamAatuif 3 жыл бұрын
Spark3 now supports , multicharacter delimiter
@AzarudeenShahul
@AzarudeenShahul 3 жыл бұрын
Yes, we have a separate vdo for that 😊
@vinayreddy98
@vinayreddy98 4 жыл бұрын
Can you tell why did you use x[0] in map instead of x
@AtifImamAatuif
@AtifImamAatuif 3 жыл бұрын
Think of x as a list. you can't split a list directly. so to fetch the element of the list, he has used x[0], x[0] is like string, so you can split it. also you can pass x['value] . it will work
@shafickrahman88
@shafickrahman88 3 жыл бұрын
0 is index
@surajgolhare5158
@surajgolhare5158 2 жыл бұрын
how we can achieve by using scala
@teluguvihari5427
@teluguvihari5427 3 жыл бұрын
When we are dealing huge data like 10 TB's , performance totally degrade....
@chaitrakr3908
@chaitrakr3908 Жыл бұрын
Hi Azarudeen df.filter(df['value'] != header).rdd.map(lambda z: z[0].split('-|')).toDF(schema) is giving below error and i tried to se the environment variables also but no luck.Can you pls help me with this Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 11.0 failed 1 times, most recent failure: Lost task 0.0 in stage 11.0 (TID 11) (chaitra executor driver): java.io.IOException: Cannot run program "python3": CreateProcess error=3, The system cannot find the path specified
MY HEIGHT vs MrBEAST CREW 🙈📏
00:22
Celine Dept
Рет қаралды 46 МЛН
Хасанның өзі эфирге шықты! “Қылмыстық топқа қатысым жоқ” дейді. Талғарда не болды? Халық сене ме?
09:25
Демократиялы Қазақстан / Демократический Казахстан
Рет қаралды 332 М.
Incremental Data Load in Hive | Big data interview questions
16:44
Spark Data Skew
18:34
The Data Tech
Рет қаралды 4,7 М.
4 Recently asked Pyspark Coding Questions | Apache Spark Interview
28:39