Reliably utilizing Spark, S3 and Parquet: Everybody says ‘I love you’; not sure they know what that entails
Posts over posts have been written about the wonders of Spark and Parquet. How one can simply save the RDD/Dataframes in parquet format into HDFS or S3. In many cases the job output is persisted to HDFS volumes that are located on the same machines in the Spark cluster. However, HDFS come with a price: Disk volume resources…