Set up Jupyter + Spark on MAC
Migrating from the Linux operative system to a curated MAC-OS could be tricky, especially if you are a developer. In this post, we will address how to set up your computer to use Spark and Jupyter Notebook with the M1 chip.
1. Install Homebrew
Homebrew is The Missing Package Manager for MacOS. Homebrew installs the stuff you need that Apple (or your Linux system) didn’t. It also installs packages to their own directory and then symlinks their files into /usr/local
(on macOS Intel).
To install it, just download it through curl and run the installation script:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
Once the process is done, make sure the brew command is added to your PATH by executing these two lines:
# This would add brew command to the PATH every time you open a shell
echo 'eval "$(/opt/homebrew/bin/brew shellenv)"' >> ~/.zprofile
eval "$(/opt/homebrew/bin/brew shellenv)"
2. Install JAVA
Apache Spark uses the JVM to execute the tasks, so we would need a compatible Java Version to run the notebooks with a distributed engine.
You can install Java through brew
:
brew install openjdk@8
Getting the right version of Java for MACOS with an M1 chip
If brew installation does not work for your MAC, we recommend using Azul’s Zulu OpenJDK v1.8.
You can find it on the downloads page, by scrolling down to the bottom of the website. Notice the filters that we applied in the link: Java 8
, macOS
, ARM 64-bit
, JDK
.
- Download JDK. You can retrieve either .dmg package, .zip or .tar.gz
wget "https://cdn.azul.com/zulu/bin/zulu8.64.0.19-ca-jdk8.0.345-macosx_aarch64.tar.gz"
tar -xvf zulu8.64.0.19-ca-jdk8.0.345-macosx_aarch64.tar.gz
- Define JAVA_HOME environment variable as the path to JDK folder in .zprofile
echo "export JAVA_HOME=$PWD/zulu8.64.0.19-ca-jdk8.0.345-macosx_aarch64.tar.gz" >> ~/.zprofile
- Add JAVA_HOME to PATH
export PATH=$PATH:$JAVA_HOME
3. Install Python and Scala
Jupyter Notebook gives you the possibility of using it with two different languages: Python and Scala. Though the Python console is more widely used, Apache Spark has a Scala API that covers all of the use cases.
To install Scala and Python interpreters, run the following commands with brew:
brew install python && brew install scala
4. Install Pyspark
There are multiple ways of installing and using Spark on a MAC.
- Install with pip
pip install pyspark
#Or an specific version
pip install pyspark=3.1.2
Using the Hadoop Version:
# Available versions are 2.7 and 3.2
PYSPARK_HADOOP_VERSION=2.7 pip install pyspark
- Install with Conda. Conda is an open-source package management and environmental management system which is a part of the Anaconda distribution. It is language agnostic and can replace both pip and virtualenv.
conda install pyspark
Note that **PySpark at Conda is not necessarily synced with PySpark** release cycle because it is maintained by the community separately.
To download a specific version (that is available on the Anaconda repositories):
conda install pyspark=3.1.2 -y
- Download the binaries and set up environment variables.If none of the alternatives above is suitable for your distribution, you can always check the Apache Spark download pages, decompress the files, and set up environment variables to run
pyspark
from the terminal.
# Download version 3.1.2
wget https://dlcdn.apache.org/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
tar -xvf spark-3.1.2-bin-hadoop3.2.tgz
Make sure the environment variables are correctly set up.
cd spark-3.1.2-bin-hadoop3.2
# Configure SPARK_HOME to run shell, SQL, submit with $SPAR_HOME/bin/<command>
export SPARK_HOME=`pwd`
# Configure PYTHONPATH to find the PySpark and Py4J packages under SPARK_HOME/python/lib
export PYTHONPATH=$(ZIPS=("$SPARK_HOME"/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYTHONPATH
5. Install Jupyter Labs
- Install with pip
pip install jupyterlab
- With conda
conda install jupyterlab
- With brew
brew install jupyter
6. Start Jupyter Notebook
Once the installation is complete, you can open a notebook in localhost by executing:
jupyter-notebook
The output will look like this:
(base) osopardo1> ~ % jupyter-notebook
[I 2022-10-18 14:21:25.819 LabApp] JupyterLab extension loaded from /Users/trabajo/opt/anaconda3/lib/python3.9/site-packages/jupyterlab
[I 2022-10-18 14:21:25.819 LabApp] JupyterLab application directory is /Users/trabajo/opt/anaconda3/share/jupyter/lab
[I 14:21:25.825 NotebookApp] Serving notebooks from local directory: /Users/trabajo
[I 14:21:25.825 NotebookApp] Jupyter Notebook 6.4.12 is running at:
[I 14:21:25.825 NotebookApp] http://localhost:8888/?token=7d02412ff87bd430c7404114f97cfde63b06e1e0c1a2b2e1
[I 14:21:25.825 NotebookApp] or http://127.0.0.1:8888/?token=7d02412ff87bd430c7404114f97cfde63b06e1e0c1a2b2e1
[I 14:21:25.825 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 14:21:25.832 NotebookApp]
To access the notebook, open this file in a browser:
file:///Users/trabajo/Library/Jupyter/runtime/nbserver-23854-open.html
Or copy and paste one of these URLs:
http://localhost:8888/?token=7d02412ff87bd430c7404114f97cfde63b06e1e0c1a2b2e1
or http://127.0.0.1:8888/?token=7d02412ff87bd430c7404114f97cfde63b06e1e0c1a2b2e1
You can avoid that the browser opens up automatically with the –no-browser option. Copy and paste one of the links that appear in the shell to access the jupyter notebook environment.
Note that all the logs would be printed in the terminal unless you put the process in the background.
7. Play with Pyspark
Create a Notebook and start using Spark in your Data Science Projects!