How to configure your Spark application in an Amazon EMR Notebook

On the last pill we learned how to set up Spark and Jupyter Notebook on your MAC-OS. Now it’s time to level up and configure your Spark application on Amazon EMR.

Apache Spark

Apache Spark is distributed computing framework for executing data workloads across multiple nodes in a cluster. It was released in 2014, 8 years ago, and the last stable version available is 3.3.1.

Spark can run in multiple languages such as Scala, Java, SQL and Python. It is mainly used for ETL (Extract Transform and Load) pipelines in which large amounts of data need to be processed at a time, however, it also covers Streaming applications with the Spark Structured Streaming module.

The difference with traditional MapReduce was that no transformation (map) is computed until the action (reduce) is called. Its architectural foundation is the RDD or Resilient Distributed Dataset: a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way.

The DataFrame API was released on Spark 2.x as an abstraction of the RDD, along with a new library that gives the possibility of querying data in SQL language.

Spark Session

The SparkSession is the main entry point to the Spark environment. As the name suggests, it is a representation of the active Session of the cluster.

To create a basic SparkSession, just use SparkSession.builder()

frompyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

Once you have the instance of a SparkSession, you can start reading datasets and transforming them into visualizations or Pandas DataFrames.

df = spark.read.json("examples/src/main/resources/people.json")

Spark Configuration

Spark API lets you configure the environment for each unique session. How many resources do you want for the application (memory, CPUs…)? Do you need to load extra packages from the Maven Repository? Which are the authentication keys to access the S3 bucket?

A subset of basic Spark configurations is listed below:

Parameters Description Values
jars Jars to be used in the session List of string
pyFiles Python files to be used in the session List of string
files Files to be used in the session List of string
driverMemory Amount of memory to be used for the driver process string
driverCores Number of cores to be used for the driver process int
executorMemory Amount of memory to be used for the executor process string
executorCores Number of cores to be used for the executor process int
numExecutors Number of executors to be launched for the session int
archives Archives to be used in the session List of string
queue Name of the YARN queue string
name Name of the session (name must be in lower case) string

Those configurations are added as a <key, value> pair on the SparkConf class. To parse correctly the Map, all the spark keys should start with spark.

For a more detailed list, please check the Spark Documentation.

Amazon EMR

Amazon EMR is a cloud solution for big-data processing, interactive analytics and machine learning. It provides the infrastructure and the management to host an Apache Spark cluster with Apache Hive and Presto.

It’s elasticity enables you to quickly deploy and provision the capacity of your cluster, and it’s designed to reduce the cost of processing large amounts of data with Amazon EC2 Spot Integration.

Jupyter Notebook with Spark settings on EMR

An Amazon EMR notebook is a serverless Jupyter notebook. It uses the Sparkmagic kernel as a client to execute the code through an Apache Livy server.

Sparkmagic

The Sparkmagic project includes a set of commands for interactively running Spark code in multiple languages, as well as some kernels that you can use to turn Jupyter into an integrated Spark environment.

Use %%configure to add the required configuration before you run your first spark-bound code cell and avoid trouble with the cluster-wide spark configurations.:

%%configure -f
{"executorMemory":"4G"}

If you want to add more specific configurations that goes with —conf command, use a nested JSON object:

%%configure -f
{ "conf":  { "spark.dynamicAllocation.enabled":"false",
					 "spark.jars.packages": "io.qbeast:qbeast-spark_2.12:0.2.0,io.delta:delta-core_2.12:1.0.0",
           "spark.sql.extensions": "io.qbeast.spark.internal.QbeastSparkSessionExtension"} }
Check if the configuration is correct by executing:
%%info

On the server side, check the /var/log/livy/livy-livy-server.out log on the EMR cluster.

20/06/24 10:11:22 INFO InteractiveSession$: Creating Interactive session 2: [owner: null, request: [kind: pyspark, proxyUser: None, executorMemory: 4G, conf: spark.dynamicAllocation.enabled -> false, spark.jars.packages -> io.qbeast:qbeast-spark_2.12:0.2.0,io.delta:delta-core_2.12:1.0.0, spark.sql.extensions -> io.qbeast.spark.internal.QbeastSparkSessionExtension, heartbeatTimeoutInSecond: 0]]

In this article, we’ve seen the main session components of Apache Spark and how to configure Jupyter-Spark Applications to run in an EMR cluster. In the next chapters, we will get more hands-on and try some Spark Examples.

If you find this post helpful, don’t hesitate on sharing it and tagging us on any social media!

Want to learn more about EMR?

Book a call with Paola to know how we can help you: https://calendly.com/paolapardo/30min