In an earlier post I introduced my client that wraps the calls to Spark’s REST API to submit jobs, instead of using the spark-submit script. This REST API, while not officially in the Spark documentation, works just fine and almost the same way as the spark-submit script and we are even using it in production. Lately, however, I discovered one deviation from the spark-submit script concerning environment variables .
As explained in this blog post, in order to correctly propagate environment variables to spawned JVMs of drivers and executors, one should use the spark.driver.extraJavaOptions and spark.executor.extraJavaOptions. When using the submit script, these can be either supplied as some of the arguments for the script or be permanently placed in the spark-defaults.conf file. Some find the latter more convenient to avoid supplying those arguments every time.
What I found out is that when using the REST API, the spark-defaults.conf file is not being picked up. Since we needed to use the environment variables to supply the name of the application environment (to pick up the correct configurations) it was a shame we could not just drop it in the conf file anymore.
Luckily, the REST API actually simplifies adding environment variables by accepting an environmentVariables key in the submit request body! On top of that, I added the option to initialize the client with some default environemntVariables that will be added to each request issued from it, without repeating this for every request. This way, it works quite like the conf file.
This is how our client initialization looks like in the end:
@Bean public SparkRestClient sparkRestClient(SparkRestClientProperties sparkRestClientProperties){ final Map<String,String> environmentVariables = new HashMap<>(); environmentVariables.put("ENV","prod"); return SparkRestClient.builder() .masterHost(sparkRestClientProperties.getMasterHost()) .environmentVariables(environmentVariables) .sparkVersion("1.5.2") .poolingHttpClient(5) .build(); }