Tag: apache

Set up Jupyter + Spark on MAC

Migrating from the Linux operative system to a curated MAC-OS could be tricky, especially if you are a developer. In this post, we will address how to set up your computer to use Spark and Jupyter Notebook with the M1 chip.

1. Install Homebrew

Homebrew is The Missing Package Manager for MacOS. Homebrew installs the stuff you need that Apple (or your Linux system) didn’t. It also installs packages to their own directory and then symlinks their files into /usr/local (on macOS Intel).

To install it, just download it through curl and run the installation script:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Once the process is done, make sure the brew command is added to your PATH by executing these two lines:

# This would add brew command to the PATH every time you open a shell
echo 'eval "$(/opt/homebrew/bin/brew shellenv)"' >> ~/.zprofile
eval "$(/opt/homebrew/bin/brew shellenv)"

2. Install JAVA

Apache Spark uses the JVM to execute the tasks, so we would need a compatible Java Version to run the notebooks with a distributed engine.

You can install Java through brew:

brew install openjdk@8
Getting the right version of Java for MACOS with an M1 chip

If brew installation does not work for your MAC, we recommend using Azul’s Zulu OpenJDK v1.8.

You can find it on the downloads page, by scrolling down to the bottom of the website. Notice the filters that we applied in the link: Java 8, macOS, ARM 64-bit, JDK.

  • Download JDK. You can retrieve either .dmg package, .zip or .tar.gz

wget "https://cdn.azul.com/zulu/bin/zulu8.64.0.19-ca-jdk8.0.345-macosx_aarch64.tar.gz"
tar -xvf zulu8.64.0.19-ca-jdk8.0.345-macosx_aarch64.tar.gz
  • Define JAVA_HOME environment variable as the path to JDK folder in .zprofile
echo "export JAVA_HOME=$PWD/zulu8.64.0.19-ca-jdk8.0.345-macosx_aarch64.tar.gz" >> ~/.zprofile

3. Install Python and Scala

Jupyter Notebook gives you the possibility of using it with two different languages: Python and Scala. Though the Python console is more widely used, Apache Spark has a Scala API that covers all of the use cases.

To install Scala and Python interpreters, run the following commands with brew:

brew install python && brew install scala

4. Install Pyspark

There are multiple ways of installing and using Spark on a MAC.

  • Install with pip
pip install pyspark

#Or an specific version
pip install pyspark=3.1.2

Using the Hadoop Version

# Available versions are 2.7 and 3.2
PYSPARK_HADOOP_VERSION=2.7 pip install pyspark
  • Install with Conda. Conda is an open-source package management and environmental management system which is a part of the Anaconda  distribution. It is language agnostic and can replace both pip and virtualenv.
conda install pyspark

Note that **PySpark at Conda is not necessarily synced with PySpark** release cycle because it is maintained by the community separately.

To download a specific version (that is available on the Anaconda repositories):

conda install pyspark=3.1.2 -y
  • Download the binaries and set up environment variables.

    If none of the alternatives above is suitable for your distribution, you can always check the Apache Spark download pages, decompress the files, and set up environment variables to run pyspark from the terminal.

# Download version 3.1.2
wget https://dlcdn.apache.org/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
tar -xvf spark-3.1.2-bin-hadoop3.2.tgz

Make sure the environment variables are correctly set up.

cd spark-3.1.2-bin-hadoop3.2
# Configure SPARK_HOME to run shell, SQL, submit with $SPAR_HOME/bin/<command>
export SPARK_HOME=`pwd`
# Configure PYTHONPATH to find the PySpark and Py4J packages under SPARK_HOME/python/lib
export PYTHONPATH=$(ZIPS=("$SPARK_HOME"/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYTHONPATH

5. Install Jupyter Labs

  • Install with pip
pip install jupyterlab
  • With conda
conda install jupyterlab
  • With brew
brew install jupyter

6. Start Jupyter Notebook

Once the installation is complete, you can open a notebook in localhost by executing:


The output will look like this:

(base) osopardo1> ~ % jupyter-notebook
[I 2022-10-18 14:21:25.819 LabApp] JupyterLab extension loaded from /Users/trabajo/opt/anaconda3/lib/python3.9/site-packages/jupyterlab
[I 2022-10-18 14:21:25.819 LabApp] JupyterLab application directory is /Users/trabajo/opt/anaconda3/share/jupyter/lab
[I 14:21:25.825 NotebookApp] Serving notebooks from local directory: /Users/trabajo
[I 14:21:25.825 NotebookApp] Jupyter Notebook 6.4.12 is running at:
[I 14:21:25.825 NotebookApp] http://localhost:8888/?token=7d02412ff87bd430c7404114f97cfde63b06e1e0c1a2b2e1
[I 14:21:25.825 NotebookApp]  or
[I 14:21:25.825 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 14:21:25.832 NotebookApp] 
    To access the notebook, open this file in a browser:
    Or copy and paste one of these URLs:

You can avoid that the browser opens up automatically with the –no-browser option. Copy and paste one of the links that appear in the shell to access the jupyter notebook environment.

Note that all the logs would be printed in the terminal unless you put the process in the background.

7. Play with Pyspark

Create a Notebook and start using Spark in your Data Science Projects!

Jupyter Notebook with Pyspark

About Qbeast
Qbeast is here to simplify the lives of the Data Engineers and make Data Scientists more agile with fast queries and interactive visualizations. For more information, visit qbeast.io
© 2020 Qbeast. All rights reserved.

Back to menu

Continue reading

Contact us info@qbeast.io

C/ Roc Boronat 117, 2a Planta, 08018 Barcelona

© 2020 Qbeast
Design by Xurris