Things to consider before writing your next Spark App.

Published in

Engineering@Zemoso

2 min readNov 5, 2018

Apache Spark, Lightning-fast unified analytics engine. I have been part of the core development for a big data processing and analytics framework, for quite some time now (A year and a half). We use apache spark for data manipulation for the datasets in the framework.

Architecture wise we have clients which would ping to a python based server, this server would do the client work for the spark cluster, obviously, we have used pyspark as our interface to spark cluster. Everything is fine at this point, but things started to get more demanding, we needed to cache the spark data frame in memory instead of loading it everytime from source, this speeds up the processing because we no longer have the IO overhead.

Here’s the catch which we didn’t notice, or research would be the correct word. Apache spark in client mode would have JVM running locally on the client and this client would have a session object pointing to the cluster, this means this client’s JVM has a session context packed specific to the client and would not work outside this particular JVM viz. This particular client!

Well, what’s the problem then?

The problem is if we want to upscale our servers to meet the incoming demand from all these different clients, the caching concept would no longer make sense across different servers, spark documentation could have been better since different servers mean different JVMs, different session contexts, which mean even though all these pyspark clients point to the same physical spark cluster, the session objects are different, the local references are different, it is impossible to cache or share data frames between different sessions/clients even under the same application name. Between different client requests since the request is not guaranteed to be routed to a particular server we could not continue with this approach. We need something different, a spark proxy.

The client applications communicate to servers and the servers would have a communication medium to a centralised single spark client sitting outside these servers and delegate the task of spark client of all the servers.

Well, do you have to implement your own proxy service? The answer is no, there are interesting projects.

Hydrosphere mist, apache spark job based proxy service you can define python, scala, or Java-based applications. mist would execute the code for you in batch job. Like here. https://github.com/Hydrospheredata/mist
Apache livy. Unlike mist, livy would execute in client made, which we wanted for our use case. https://livy.incubator.apache.org

Happy big data!

Things to consider before writing your next Spark App.

Written by Sai Varun