Рет қаралды 54
🚀 PYSPARK Challenge - Day 3️⃣
---------------------------------------------
🎯 PROBLEM STATEMENT
---------------------------------------------
Combine Two DF
Write a Pyspark program to report the first name, last name, city, and state of each person in the Person dataframe. If the address of a personId is not present in the Address dataframe, report null instead.
---------------------------------------------
📝 Schema And Data :
---------------------------------------------
Difficult Level : EASY
Input Data :
Define schema for the 'persons' table
persons_schema = StructType([
StructField("personId", IntegerType(), True),
StructField("lastName", StringType(), True),
StructField("firstName", StringType(), True)
])
Define schema for the 'addresses' table
addresses_schema = StructType([
StructField("addressId", IntegerType(), True),
StructField("personId", IntegerType(), True),
StructField("city", StringType(), True),
StructField("state", StringType(), True)
])
Define data for the 'persons' table
persons_data = [
(1, 'Wang', 'Allen'),
(2, 'Alice', 'Bob')
]
Define data for the 'addresses' table
addresses_data = [
(1, 2, 'New York City', 'New York'),
(2, 3, 'Leetcode', 'California')
]
Key Concepts:
📚 What You'll Learn:
• How to define schemas in PySpark
• Loading sample data into PySpark DataFrames
• Performing a left join to combine two DataFrames
• Handling missing data with null
🔔 Make sure to subscribe and hit the notification bell so you don’t miss any upcoming challenges! 🚀