How to configure your Spark application in an Amazon EMR Notebook
On the last pill we learned how to set up Spark and Jupyter Notebook on your MAC-OS. Now it’s time to level up and configure your Spark application on Amazon EMR.
Apache Spark
Apache Spark is distributed computing framework for executing data workloads across multiple nodes in a cluster. It was released in 2014, 8 years ago, and the last stable version available is 3.3.1.
Spark can run in multiple languages such as Scala
, Java
, SQL
and Python
. It is mainly used for ETL (Extract Transform and Load) pipelines in which large amounts of data need to be processed at a time, however, it also covers Streaming applications with the Spark Structured Streaming module.
The difference with traditional MapReduce was that no transformation (map) is computed until the action (reduce) is called. Its architectural foundation is the RDD
or Resilient Distributed Dataset: a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way.
The DataFrame
API was released on Spark 2.x as an abstraction of the RDD, along with a new library that gives the possibility of querying data in SQL language.
Spark Session
The SparkSession
is the main entry point to the Spark environment. As the name suggests, it is a representation of the active Session of the cluster.
To create a basic SparkSession
, just use SparkSession.builder()
frompyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
Once you have the instance of a SparkSession
, you can start reading datasets and transforming them into visualizations or Pandas
DataFrames.
df = spark.read.json("examples/src/main/resources/people.json")
Spark Configuration
Spark API lets you configure the environment for each unique session. How many resources do you want for the application (memory, CPUs…)? Do you need to load extra packages from the Maven Repository? Which are the authentication keys to access the S3 bucket?
A subset of basic Spark configurations is listed below:
Parameters | Description | Values |
---|---|---|
jars | Jars to be used in the session | List of string |
pyFiles | Python files to be used in the session | List of string |
files | Files to be used in the session | List of string |
driverMemory | Amount of memory to be used for the driver process | string |
driverCores | Number of cores to be used for the driver process | int |
executorMemory | Amount of memory to be used for the executor process | string |
executorCores | Number of cores to be used for the executor process | int |
numExecutors | Number of executors to be launched for the session | int |
archives | Archives to be used in the session | List of string |
queue | Name of the YARN queue | string |
name | Name of the session (name must be in lower case) | string |
Those configurations are added as a <key, value>
pair on the SparkConf
class. To parse correctly the Map, all the spark keys should start with spark.
For a more detailed list, please check the Spark Documentation.
Amazon EMR
Amazon EMR is a cloud solution for big-data processing, interactive analytics and machine learning. It provides the infrastructure and the management to host an Apache Spark cluster with Apache Hive and Presto.
It’s elasticity enables you to quickly deploy and provision the capacity of your cluster, and it’s designed to reduce the cost of processing large amounts of data with Amazon EC2 Spot Integration.
Jupyter Notebook with Spark settings on EMR
An Amazon EMR notebook is a serverless Jupyter notebook. It uses the Sparkmagic kernel as a client to execute the code through an Apache Livy server.
Sparkmagic
The Sparkmagic project includes a set of commands for interactively running Spark code in multiple languages, as well as some kernels that you can use to turn Jupyter into an integrated Spark environment.
Use %%configure
to add the required configuration before you run your first spark-bound code cell and avoid trouble with the cluster-wide spark configurations.:
%%configure -f
{"executorMemory":"4G"}
If you want to add more specific configurations that goes with —conf
command, use a nested JSON object:
%%configure -f
{ "conf": { "spark.dynamicAllocation.enabled":"false",
"spark.jars.packages": "io.qbeast:qbeast-spark_2.12:0.2.0,io.delta:delta-core_2.12:1.0.0",
"spark.sql.extensions": "io.qbeast.spark.internal.QbeastSparkSessionExtension"} }
%%info
On the server side, check the /var/log/livy/livy-livy-server.out log on the EMR cluster.
20/06/24 10:11:22 INFO InteractiveSession$: Creating Interactive session 2: [owner: null, request: [kind: pyspark, proxyUser: None, executorMemory: 4G, conf: spark.dynamicAllocation.enabled -> false, spark.jars.packages -> io.qbeast:qbeast-spark_2.12:0.2.0,io.delta:delta-core_2.12:1.0.0, spark.sql.extensions -> io.qbeast.spark.internal.QbeastSparkSessionExtension, heartbeatTimeoutInSecond: 0]]
In this article, we’ve seen the main session components of Apache Spark and how to configure Jupyter-Spark Applications to run in an EMR cluster. In the next chapters, we will get more hands-on and try some Spark Examples.
If you find this post helpful, don’t hesitate on sharing it and tagging us on any social media!