amazon | HaVeDa - A Backend Development Blog

Reliably utilizing Spark, S3 and Parquet: Everybody says ‘I love you’; not sure they know what that entails

October 29, 2017October 30, 2017 ywilkof5 Comments

Posts over posts have been written about the wonders of Spark and Parquet. How one can simply save the RDD/Dataframes in parquet format into HDFS or S3. In many cases the job output is persisted to HDFS volumes that are located on the same machines in the Spark cluster. However, HDFS come with a price: Disk volume resources…

Out of the Middle Ages: Use S3a File System with Spark (2.x), Hadoop (2.7.x) and AWS SDK (>1.7.4)

August 13, 2017August 15, 2017 ywilkof1 Comment

Much had been said about the hardships entailed in combining of Apache Spark, Hadoop libraries and Amazon’s AWS SDK. Take as an example reading from S3 Storage using s3a:// file system. If you’ve tried once this setup, then you know it is not a straight forward task. While it seems like this should work out…

Win over Spark distribution’s dependency conflicts with SBT shading

December 27, 2015April 10, 2016 ywilkof2 Comments

In our production environment we are currently using a spark cluster in standalone mode (version 1.5.2). We reached a point when it was necessary to add the Amazon S3 Java SDK (1.10.39), following the advice given in this post. What should have been an easy task proved to be problematic – the Jackson dependency for…