nice explanation of all videos of pyspark, you helped lot of peoples and should be proud of yourself for helping lives, pl continue this good work
@Archanaishan Жыл бұрын
hi! your vedios are helping me somuch..now i able learn pyspark very easily..thank you somuch
@WafaStudies Жыл бұрын
Thank you 😊
@vadderamu5422 Жыл бұрын
Awesome explantion sir ❤
@Dilshad-z4k5 ай бұрын
what is the difference between approx_count_distinct() & countDistinct()..?
@markzohan7835 Жыл бұрын
what is the difference between approx_count_distinct() and countDistinct() as its giving the same output. Also please tell us which is better in performance
@ANILKUMARNAGAR Жыл бұрын
In PySpark, approx_count_distinct() and countDistinct() are two functions used for counting the number of distinct values in a column of a DataFrame. However, there are some differences between these two functions. countDistinct() is a deterministic function that returns the exact number of distinct values in a column. It scans the entire data set and computes the exact count. This function is more accurate but can be slower on large datasets. approx_count_distinct() is an approximate function that uses HyperLogLog algorithm to estimate the number of distinct values in a column. It does not scan the entire data set, but rather samples the data and computes an estimated count. This function is faster on large datasets, but its accuracy is less than countDistinct(). If both functions are returning the exact same result, it means that the number of distinct values in the column,is not very large, and approx_count_distinct() is able to provide an accurate estimate. if the number of distinct values in the column is very large, approx_count_distinct() may return a lower estimate than the actual count. In that case, you might need to use the countDistinct() function to get an accurate count.