pyspark scenarios 2 : how to read variable number of columns data in pyspark dataframe

  Рет қаралды 15,309

TechLake

TechLake

Күн бұрын

pyspark scenarios 2 : how to read variable number of columns data in pyspark dataframe #pyspark #adf
Pyspark Interview question
Pyspark Scenario Based Interview Questions
Pyspark Scenario Based Questions
Scenario Based Questions
#PysparkScenarioBasedInterviewQuestions
#ScenarioBasedInterviewQuestions
#PysparkInterviewQuestions
Notebook Location:
github.com/rav...
Complete Pyspark Real Time Scenarios Videos.
Pyspark Scenarios 1: How to create partition by month and year in pyspark
• Pyspark Scenarios 1: H...
pyspark scenarios 2 : how to read variable number of columns data in pyspark dataframe #pyspark
• pyspark scenarios 2 : ...
Pyspark Scenarios 3 : how to skip first few rows from data file in pyspark
• Pyspark Scenarios 3 : ...
Pyspark Scenarios 4 : how to remove duplicate rows in pyspark dataframe #pyspark #Databricks
• Pyspark Scenarios 4 : ...
Pyspark Scenarios 5 : how read all files from nested folder in pySpark dataframe
• Pyspark Scenarios 5 : ...
Pyspark Scenarios 6 How to Get no of rows from each file in pyspark dataframe
• Pyspark Scenarios 6 Ho...
Pyspark Scenarios 7 : how to get no of rows at each partition in pyspark dataframe
• Pyspark Scenarios 7 : ...
Pyspark Scenarios 8: How to add Sequence generated surrogate key as a column in dataframe.
• Pyspark Scenarios 8: H...
Pyspark Scenarios 9 : How to get Individual column wise null records count
• Pyspark Scenarios 9 : ...
Pyspark Scenarios 10:Why we should not use crc32 for Surrogate Keys Generation?
• Pyspark Scenarios 10:W...
Pyspark Scenarios 11 : how to handle double delimiter or multi delimiters in pyspark
• Pyspark Scenarios 11 :...
Pyspark Scenarios 12 : how to get 53 week number years in pyspark extract 53rd week number in spark
• Pyspark Scenarios 12 :...
Pyspark Scenarios 13 : how to handle complex json data file in pyspark
• Pyspark Scenarios 13 :...
Pyspark Scenarios 14 : How to implement Multiprocessing in Azure Databricks
• Pyspark Scenarios 14 :...
Pyspark Scenarios 15 : how to take table ddl backup in databricks
• Pyspark Scenarios 15 :...
Pyspark Scenarios 16: Convert pyspark string to date format issue dd-mm-yy old format
• Pyspark Scenarios 16: ...
Pyspark Scenarios 17 : How to handle duplicate column errors in delta table
• Pyspark Scenarios 17 :...
Pyspark Scenarios 18 : How to Handle Bad Data in pyspark dataframe using pyspark schema
• Pyspark Scenarios 18 :...
Pyspark Scenarios 19 : difference between #OrderBy #Sort and #sortWithinPartitions Transformations
• Pyspark Scenarios 19 :...
Pyspark Scenarios 20 : difference between coalesce and repartition in pyspark #coalesce #repartition
• Pyspark Scenarios 20 :...
Pyspark Scenarios 21 : Dynamically processing complex json file in pyspark #complexjson #databricks
• Pyspark Scenarios 21 :...
Pyspark Scenarios 22 : How To create data files based on the number of rows in PySpark #pyspark
• Pyspark Scenarios 22 :...
dynamic split with new columns in pyspark
How to import flat files with a varying number of columns in pyspark?
read text file where each line has different number of columns?
Importing text file with varying number of columns in PySpark?
variable size of columns reading in pyspark?
How to create dynamic columns with variable size of columns in pyspark dataframe?
How do I add multiple columns to a DataFrame in Spark?
read csv with different number of columns per row using pyspark dataframe?
read text file with different column lengths?
How to read the csv file properly if each row contains different number of fields (number quite big)?
pyspark sql
pyspark
hive
which
databricks
apache spark
sql server
spark sql functions
spark interview questions
sql interview questions
spark sql interview questions
spark sql tutorial
spark architecture
coalesce in sql
hadoop vs spark
window function in sql
which role is most likely to use azure data factory to define a data pipeline for an etl process?
what is data warehouse
broadcast variable in spark
pyspark documentation
apache spark architecture
which single service would you use to implement data pipelines, sql analytics, and spark analytics?
which one of the following tasks is the responsibility of a database administrator?
google colab
case class in scala
RISING
which role is most likely to use azure data factory to define a data pipeline for an etl process?
broadcast variable in spark
which one of the following tasks is the responsibility of a database administrator?
google colab
case class in scala
pyspark documentation
spark architecture
window function in sql
which single service would you use to implement data pipelines, sql analytics, and spark analytics?
apache spark architecture
hadoop vs spark
spark interview questions

Пікірлер: 38
@mohansonale8020
@mohansonale8020 7 ай бұрын
Really it's amazing knowledge sharing video....
@krishnachaitanyareddy2781
@krishnachaitanyareddy2781 Жыл бұрын
Excellent video thanks for sharing sir
@jaydeeppatidar4189
@jaydeeppatidar4189 2 жыл бұрын
That was good basic and must know questions. Thanks for sharing !
@sravankumar1767
@sravankumar1767 2 жыл бұрын
Simply superb 👌 👏 👍
@lakshminarayana3168
@lakshminarayana3168 10 ай бұрын
Thanks for sharing the knowledge ❤
@_jasonj_
@_jasonj_ Жыл бұрын
Great video, this is exactly what I need but I have a question. When I split my data which is something like 1234|5678 using | as the delimiter instead of , why does my result format like: ["1","2","3","4","|","5","6","7","8"] instead of ["1234","5678"] ? *EDIT* - found the solution, I guess pipe delimiter needs to be escaped in the split statement as "\\|" for it to work properly.
@explorewithaj4580
@explorewithaj4580 2 жыл бұрын
Excellent👏
@ravulapallivenkatagurnadha9605
@ravulapallivenkatagurnadha9605 2 жыл бұрын
Nice video
@sumantadutta5485
@sumantadutta5485 2 жыл бұрын
This implementation is very specific to a scenario and assumptions. In real scenarios, one will not receive csv data with different number of values in the field. Here, assumption is all data will be in correct schema order like id, name, address, email, phone. Then you can map correct value to correct column. We are just not showing which value belongs to which field but assuming it implicitly. Also, without schema no downstream application will be able to handle this data as it will never know which column contains what data. Processing with json could be best way to handle dynamic schema.
@TRRaveendra
@TRRaveendra 2 жыл бұрын
Thats true for on-premises data warehousing projects migration to cloud. when it comes to advanced analytics projects and if source system is IOT and Machine generated data then you can expect different types of csv files with header,without header, multi header, variable no of headers on Network based companies serverside and machines generated data.
@Someonner
@Someonner 2 жыл бұрын
There are enough egotistical idiots trying to flex while taking an interview who ask suck sort of questions.
@mohitmotwani9256
@mohitmotwani9256 Жыл бұрын
Defining a schema in csv can also solve the problem, but very interesting video. Thanks
@mehmetkaya4330
@mehmetkaya4330 Жыл бұрын
Great! Thanks
@TRRaveendra
@TRRaveendra Жыл бұрын
Ur welcome
@manognachowdary9797
@manognachowdary9797 Жыл бұрын
Thank you for educating! Is there a video to dynamically select specific column names from source dataset and rename as per target to find mismatches in datasets. If any please provide.
@akashsonone2838
@akashsonone2838 Жыл бұрын
Getting error while running it on PyCharm for i in range (splitable_df.select(max(size("Splitable_col"))).collect([0][0])) : TypeError: 'int' object is not callable what could be the reason for the same?
@mohansonale8020
@mohansonale8020 7 ай бұрын
Another question is can we rename the above five column names of final dataframe with different names
@krishnachaitanyareddy2781
@krishnachaitanyareddy2781 Жыл бұрын
Can I get real time project on azure for interview with azure data factory and azure data bricks. Please let me know
@chandramouli8407
@chandramouli8407 2 жыл бұрын
Sir any tutorials available for Main frames for beginners please share me the link thank you sir🙏🙏
@harishkanta3711
@harishkanta3711 Жыл бұрын
Hello, can you please tell while getting df.select(max(size('splittable_col'))).collect()[0][0]---...why did we add collect()[0][0] at the end.
@TRRaveendra
@TRRaveendra Жыл бұрын
Collect() will convert data into row format. So slicing in Python. First row and first item
@harishkanta3711
@harishkanta3711 Жыл бұрын
@@TRRaveendra thank you
@purnimabharti2306
@purnimabharti2306 2 жыл бұрын
What if we want to give names of the column as well dynamically?? Instead of c01 something else
@TRRaveendra
@TRRaveendra 2 жыл бұрын
use toDF() or withColumnRenamed() functions for new column names or renaming columns.
@starmscloud
@starmscloud Жыл бұрын
try this : You can create a dictionary of old and new columns and use it dynamically just like below . from pyspark.sql.functions import col colDict:dict = {"col0":"id","col1":"name","col2":"address","col3":"email","col4":"phoneno"} df1.select([ col(column).alias(colmaps.get(column,column)) for column in df1.columns]).display()
@shubne
@shubne Жыл бұрын
You can also use .option('mode', 'permissive')
@aryic0153
@aryic0153 Жыл бұрын
in scala ?
@srikantha7290
@srikantha7290 2 жыл бұрын
Hi Sir , I'm working in IBM TSM backup domain from past 6 years.Im plaining to switch my career into Azure Data Engineering Course.Please suggest best way & training with job support & Please suggest. For Azure Data Engineer :- 1. SQL server & T-Sql Queries 2. Azure Fundamental s 3. Azure Active directory 4. Azure data factory 5. Azure Synapse analytics 6. Synapse studio 7. Azure storage (BLOB) 8. Big data analytics (ADLS), 9. ADLA 10. U - SQL 11. Azure data bricks 12. Azure valuts Your covering all topics ? Please answer and share complete details for the course. Kindly do the needful. I'm sent mail also sir please check and reply 🙏.
@jaydeeppatidar4189
@jaydeeppatidar4189 2 жыл бұрын
You should go for DP 900 and AZ 900 cerfications first. You will get knowledge about azure resources by doing this two certifications. After that you should either go for DP 203 or you can start learning Azure data factory. You may also learn Apche spark/hadoop to be a data engineer but if you specifically want to be an Azure data engineer then you should better go for Azure data factory and Azure synapse analytics. The azure key vault, Azure functions and Azure active directory are basic things don't be panic about this. The real Azure data engineer mainly work on Azure data factory, Azure databricks, Azure data lake gen2. I would suggest to go for Apache spark and databricks which is popular now a days. Azure data factory + Azure databricks is must for Azure data engineer. Apache spark + databricks + Basics of hadoop and hdfs should be sufficient for a data engineer as a starter. Please note that in data engineer Field you must have strong SQL knowledge.
@srikantha7290
@srikantha7290 2 жыл бұрын
@@jaydeeppatidar4189 Thank you so much for reply.Sir your providing any training ? Please share contact details via mail , I sent alredy please check and reply sir 🙏.
@jaydeeppatidar4189
@jaydeeppatidar4189 2 жыл бұрын
@@srikantha7290 No I am fresher but went through this situation. I was also confused for this type of questions in early stage of my carrier so I thought to share my knowledge with you as well so that you can atleast start learning. I would suggest to learn from Udemy for better experience and well structural learning.
@srikantha7290
@srikantha7290 2 жыл бұрын
@@jaydeeppatidar4189 ok thanks
@vaibhavverma1340
@vaibhavverma1340 Жыл бұрын
Hello sir, How to write same code in pyspark (Pycharm IDE)??
@TRRaveendra
@TRRaveendra Жыл бұрын
Create spark session and specify datafile location from ur local system
@mdfurqan
@mdfurqan Жыл бұрын
there is one catch what if the value itself is separated by comma(delimiter)
@TRRaveendra
@TRRaveendra Жыл бұрын
Then data need to be requested with a double quote or single quote
@mdfurqan
@mdfurqan Жыл бұрын
@@TRRaveendra let's suppose it's generated data, do we have something to handle this at the time of writing code
@TRRaveendra
@TRRaveendra Жыл бұрын
@@mdfurqan use read mode option as DROPMALFORMED and it will reject if any bad data.
didn't manage to catch the ball #tiktok
00:19
Анастасия Тарасова
Рет қаралды 33 МЛН
Life hack 😂 Watermelon magic box! #shorts by Leisi Crazy
00:17
Leisi Crazy
Рет қаралды 80 МЛН
World‘s Strongest Man VS Apple
01:00
Browney
Рет қаралды 66 МЛН
96. Databricks | Pyspark | Real Time Scenario | Schema Comparison
12:34
Raja's Data Engineering
Рет қаралды 8 М.
Raw SQL, SQL Query Builder, or ORM?
16:19
ArjanCodes
Рет қаралды 102 М.