How to configure your Spark application in an Amazon EMR Notebook

On the last pill we learned how to set up Spark and Jupyter Notebook on your MAC-OS. Now it’s time to level up and configure your Spark application on Amazon EMR

Apache Spark

Apache Spark is distributed computing framework for executing data workloads across multiple nodes in a cluster. It was released in 2014, 8 years ago, and the last stable version available is 3.3.1.

Spark can run in multiple languages such as Scala, Java, SQL and Python. It is mainly used for ETL (Extract Transform and Load) pipelines in which large amounts of data need to be processed at a time, however, it also covers Streaming applications with the Spark Structured Streaming module.

The difference with traditional MapReduce was that no transformation (map) is computed until the action (reduce) is called. Its architectural foundation is the RDD or Resilient Distributed Dataset: a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way.

The DataFrame API was released on Spark 2.x as an abstraction of the RDD, along with a new library that gives the possibility of querying data in SQL language.

Spark Session

The SparkSession is the main entry point to the Spark environment. As the name suggests, it is a representation of the active Session of the cluster.

To create a basic SparkSession, just use SparkSession.builder()

frompyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

Once you have the instance of a SparkSession, you can start reading datasets and transforming them into visualizations or Pandas DataFrames.

df = spark.read.json("examples/src/main/resources/people.json")

Spark Configuration

Spark API lets you configure the environment for each unique session. How many resources do you want for the application (memory, CPUs…)? Do you need to load extra packages from the Maven Repository? Which are the authentication keys to access the S3 bucket?

A subset of basic Spark configurations is listed below:

Parameters Description Values
jars Jars to be used in the session List of string
pyFiles Python files to be used in the session List of string
files Files to be used in the session List of string
driverMemory Amount of memory to be used for the driver process string
driverCores Number of cores to be used for the driver process int
executorMemory Amount of memory to be used for the executor process string
executorCores Number of cores to be used for the executor process int
numExecutors Number of executors to be launched for the session int
archives Archives to be used in the session List of string
queue Name of the YARN queue string
name Name of the session (name must be in lower case) string

Those configurations are added as a <key, value> pair on the SparkConf class. To parse correctly the Map, all the spark keys should start with spark.

For a more detailed list, please check the Spark Documentation.

Amazon EMR

Amazon EMR is a cloud solution for big-data processing, interactive analytics and machine learning. It provides the infrastructure and the management to host an Apache Spark cluster with Apache Hive and Presto.

It’s elasticity enables you to quickly deploy and provision the capacity of your cluster, and it’s designed to reduce the cost of processing large amounts of data with Amazon EC2 Spot Integration

Jupyter Notebook with Spark settings on EMR

An Amazon EMR notebook is a serverless Jupyter notebook. It uses the Sparkmagic kernel as a client to execute the code through an Apache Livy server.

Sparkmagic

The Sparkmagic project includes a set of commands for interactively running Spark code in multiple languages, as well as some kernels that you can use to turn Jupyter into an integrated Spark environment.

Use %%configure to add the required configuration before you run your first spark-bound code cell and avoid trouble with the cluster-wide spark configurations.:

%%configure -f
{"executorMemory":"4G"}

If you want to add more specific configurations that goes with —conf command, use a nested JSON object:

%%configure -f
{ "conf":  { "spark.dynamicAllocation.enabled":"false",
					 "spark.jars.packages": "io.qbeast:qbeast-spark_2.12:0.2.0,io.delta:delta-core_2.12:1.0.0",
           "spark.sql.extensions": "io.qbeast.spark.internal.QbeastSparkSessionExtension"} }
Check if the configuration is correct by executing:
%%info

On the server side, check the /var/log/livy/livy-livy-server.out log on the EMR cluster.

20/06/24 10:11:22 INFO InteractiveSession$: Creating Interactive session 2: [owner: null, request: [kind: pyspark, proxyUser: None, executorMemory: 4G, conf: spark.dynamicAllocation.enabled -> false, spark.jars.packages -> io.qbeast:qbeast-spark_2.12:0.2.0,io.delta:delta-core_2.12:1.0.0, spark.sql.extensions -> io.qbeast.spark.internal.QbeastSparkSessionExtension, heartbeatTimeoutInSecond: 0]]

In this article, we’ve seen the main session components of Apache Spark and how to configure Jupyter-Spark Applications to run in an EMR cluster. In the next chapters, we will get more hands-on and try some Spark Examples.

If you find this post helpful, don’t hesitate on sharing it and tagging us on any social media!

Have a great day 🙂

About Qbeast
Qbeast is here to simplify the lives of the Data Engineers and make Data Scientists more agile with fast queries and interactive visualizations. For more information, visit qbeast.io
© 2020 Qbeast. All rights reserved.
Share:

Back to menu

Continue reading

Set up Jupyter + Spark on MAC

Migrating from the Linux operative system to a curated MAC-OS could be tricky, especially if you are a developer. In this post, we will address how to set up your computer to use Spark and Jupyter Notebook with the M1 chip.

1. Install Homebrew

Homebrew is The Missing Package Manager for MacOS. Homebrew installs the stuff you need that Apple (or your Linux system) didn’t. It also installs packages to their own directory and then symlinks their files into /usr/local (on macOS Intel).

To install it, just download it through curl and run the installation script:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Once the process is done, make sure the brew command is added to your PATH by executing these two lines:

# This would add brew command to the PATH every time you open a shell
echo 'eval "$(/opt/homebrew/bin/brew shellenv)"' >> ~/.zprofile
eval "$(/opt/homebrew/bin/brew shellenv)"

2. Install JAVA

Apache Spark uses the JVM to execute the tasks, so we would need a compatible Java Version to run the notebooks with a distributed engine.

You can install Java through brew:

brew install openjdk@8
Getting the right version of Java for MACOS with an M1 chip

If brew installation does not work for your MAC, we recommend using Azul’s Zulu OpenJDK v1.8.

You can find it on the downloads page, by scrolling down to the bottom of the website. Notice the filters that we applied in the link: Java 8, macOS, ARM 64-bit, JDK.

  • Download JDK. You can retrieve either .dmg package, .zip or .tar.gz

wget "https://cdn.azul.com/zulu/bin/zulu8.64.0.19-ca-jdk8.0.345-macosx_aarch64.tar.gz"
tar -xvf zulu8.64.0.19-ca-jdk8.0.345-macosx_aarch64.tar.gz
  • Define JAVA_HOME environment variable as the path to JDK folder in .zprofile
echo "export JAVA_HOME=$PWD/zulu8.64.0.19-ca-jdk8.0.345-macosx_aarch64.tar.gz" >> ~/.zprofile
  • Add JAVA_HOME to PATH
export PATH=$PATH:$JAVA_HOME

3. Install Python and Scala

Jupyter Notebook gives you the possibility of using it with two different languages: Python and Scala. Though the Python console is more widely used, Apache Spark has a Scala API that covers all of the use cases.

To install Scala and Python interpreters, run the following commands with brew:

brew install python && brew install scala

4. Install Pyspark

There are multiple ways of installing and using Spark on a MAC.

  • Install with pip
pip install pyspark

#Or an specific version
pip install pyspark=3.1.2

Using the Hadoop Version

# Available versions are 2.7 and 3.2
PYSPARK_HADOOP_VERSION=2.7 pip install pyspark
  • Install with Conda. Conda is an open-source package management and environmental management system which is a part of the Anaconda  distribution. It is language agnostic and can replace both pip and virtualenv.
conda install pyspark

Note that **PySpark at Conda is not necessarily synced with PySpark** release cycle because it is maintained by the community separately.

To download a specific version (that is available on the Anaconda repositories):

conda install pyspark=3.1.2 -y
  • Download the binaries and set up environment variables.

    If none of the alternatives above is suitable for your distribution, you can always check the Apache Spark download pages, decompress the files, and set up environment variables to run pyspark from the terminal.

# Download version 3.1.2
wget https://dlcdn.apache.org/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
tar -xvf spark-3.1.2-bin-hadoop3.2.tgz

Make sure the environment variables are correctly set up.

cd spark-3.1.2-bin-hadoop3.2
# Configure SPARK_HOME to run shell, SQL, submit with $SPAR_HOME/bin/<command>
export SPARK_HOME=`pwd`
# Configure PYTHONPATH to find the PySpark and Py4J packages under SPARK_HOME/python/lib
export PYTHONPATH=$(ZIPS=("$SPARK_HOME"/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYTHONPATH

5. Install Jupyter Labs

  • Install with pip
pip install jupyterlab
  • With conda
conda install jupyterlab
  • With brew
brew install jupyter

6. Start Jupyter Notebook

Once the installation is complete, you can open a notebook in localhost by executing:

jupyter-notebook

The output will look like this:

(base) osopardo1> ~ % jupyter-notebook
[I 2022-10-18 14:21:25.819 LabApp] JupyterLab extension loaded from /Users/trabajo/opt/anaconda3/lib/python3.9/site-packages/jupyterlab
[I 2022-10-18 14:21:25.819 LabApp] JupyterLab application directory is /Users/trabajo/opt/anaconda3/share/jupyter/lab
[I 14:21:25.825 NotebookApp] Serving notebooks from local directory: /Users/trabajo
[I 14:21:25.825 NotebookApp] Jupyter Notebook 6.4.12 is running at:
[I 14:21:25.825 NotebookApp] http://localhost:8888/?token=7d02412ff87bd430c7404114f97cfde63b06e1e0c1a2b2e1
[I 14:21:25.825 NotebookApp]  or http://127.0.0.1:8888/?token=7d02412ff87bd430c7404114f97cfde63b06e1e0c1a2b2e1
[I 14:21:25.825 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 14:21:25.832 NotebookApp] 
    
    To access the notebook, open this file in a browser:
        file:///Users/trabajo/Library/Jupyter/runtime/nbserver-23854-open.html
    Or copy and paste one of these URLs:
        http://localhost:8888/?token=7d02412ff87bd430c7404114f97cfde63b06e1e0c1a2b2e1
     or http://127.0.0.1:8888/?token=7d02412ff87bd430c7404114f97cfde63b06e1e0c1a2b2e1

You can avoid that the browser opens up automatically with the –no-browser option. Copy and paste one of the links that appear in the shell to access the jupyter notebook environment.

Note that all the logs would be printed in the terminal unless you put the process in the background.

7. Play with Pyspark

Create a Notebook and start using Spark in your Data Science Projects!

Jupyter Notebook with Pyspark

About Qbeast
Qbeast is here to simplify the lives of the Data Engineers and make Data Scientists more agile with fast queries and interactive visualizations. For more information, visit qbeast.io
© 2020 Qbeast. All rights reserved.
Share:

Back to menu

Continue reading

Publish your SBT project to the Central Repository

Paola Pardo & Eric Ávila

You and your team have been working hard on the very first release of your beloved project. Now it’s time to make it public for the ease of software usage 🚀

In this post, we will be explaining how to publish an sbt project to the Central Repository through Sonatype OSSRH Nexus Manager.

First-time preparation for releasing artifacts

Sonatype credentials and GPG keys must be set up before publishing an artifact. These steps have to be done only once and require some human revision.

Sonatype Setup

Sonatype OSSRH (OSS Repository Hosting) provides a repository hosting service for open source project binaries. It uses Maven Repository format and it would allow you to:

  • deploy snapshots
  • stage releases
  • promote releases and sync to the Central Repository

Can I change, delete or modify the published artifacts? → Quick and short answer: No.

Be careful and use -SNAPSHOT suffix on your version to test binaries before moving to a definitive stage.

For more information, click here 👈

1. Register to JIRA

There are some configurations that require human interaction (see here why). Sonatype uses JIRA to manage requests, so if you don’t have an account, it’s time to do so.

2. Open a JIRA ticket to solicit the domain

Now you have to create a new ticket requesting the namespace for your packages.

It’s very simple, but here it’s the request we made in case you want some inspiration:

3. Set your domain accordingly

Right after you publish the ticket, you will receive an automatic notification to configure the domain properly.

After setting TXT in your domain and editing the status of the ticket, you have to wait for the congratulations message.

Let’s move on to the next step!

GPG Setup

In order to sign the artifacts that you want to publish, you will need to create a private/public key pair. Using your tool of choice, create it and upload the public key to a key server when asked, or upload it manually.

I’ll show how to do so on the Linux command line:

1. Generate a GPG key

$> gpg --gen-key 

2. List keys to know it’s present in your machine

Once the key pair is generated, we can list them along with any other keys installed:

$> gpg --list-keys 
/Users/xxx/.gnupg/pubring.gpg
----------------------------------
pub   rsa2048 2012-02-14 [SCEA] [expires: 2028-02-09]
      <public-key>
uid           [ultimate] Eugene Yokota <eed3si9n@gmail.com>
sub   rsa2048 2012-02-14 [SEA] [expires: 2028-02-09]

3. Upload the public key to a server, so you will be able to sign packages and verify them

Since other people need your public key to verify your files, you have to distribute your public key to a key server:

$> gpg --keyserver keyserver.ubuntu.com --send-keys <public-key>

This first key will be set as default for your system, so now the sbt-pgp plugin will be able to use it.

Releasing an artifact

Now that you are registered in the OSS Sonatype Repository and configured the GPG keys to sign your library, it’s time to prepare your project to compile and produce the corresponding artifacts.

1. Prepare build.sbt

Add ‘sbt-sonatype’ and ‘sbt-pgp’ plugins to your project/plugins.sbt  file.

addSbtPlugin("org.xerial.sbt" % "sbt-sonatype" % "3.9.9")
addSbtPlugin("com.github.sbt" % "sbt-pgp" % "2.1.2")

In your build.sbt you have to add the reference to the remote Sonatype repository and some settings to accomplish the Maven Central repository requirements.

// Repository for releases on Maven Central using Sonatype
publishTo := sonatypePublishToBundle.value
sonatypeCredentialHost := "s01.oss.sonatype.org"

publishMavenStyle := true
sonatypeProfileName := "io.qbeast" // Your sonatype groupID

// Reference the project OSS repository
import xerial.sbt.Sonatype._
sonatypeProjectHosting := Some(
  GitHubHosting(user = "Qbeast-io", repository = "qbeast-spark", email = "info@qbeast.io"))

// Metadata referring to licenses, website, and SCM (source code management)
licenses:= Seq(
  "APL2" -> url("https://www.apache.org/licenses/LICENSE-2.0.txt"))
homepage := Some(url("https://qbeast.io/"))
scmInfo := Some(
  ScmInfo(
    url("https://github.com/Qbeast-io/qbeast-spark"),
    "scm:git@github.com:Qbeast-io/qbeast-spark.git"))
// Optional: if you want to publish snapshots 
// (which cannot be released to the Central Repository)
// You must set the sonatypeRepository in which to upload the artifacts
sonatypeRepository := {
  val nexus = "https://s01.oss.sonatype.org/"
  if (isSnapshot.value) nexus + "content/repositories/snapshots"
  else nexus + "service/local"
}
<!-- (Optional) pomExtra field, where you can reference developers, 
		among other things. This configuration must be in XML format, like
		in the example below, and it will be included in your .pom file. -->
pomExtra := 
	<developers>
    <developer>
      <id>osopardo1</id>
      <name>Paola Pardo</name>
      <url>https://github.com/osopardo1</url>
    </developer>
  </developers>

2. Sonatype credentials ~/.sbt/1.0/sonatype.sbt

Apart from the key, you need to set up a credentials file for the Sonatype server. Create a file in $HOME/.sbt/1.0/sonatype.sbt. This file will contain the credentials for Sonatype:

credentials += Credentials("Sonatype Nexus Repository Manager",
				"s01.oss.sonatype.org", // all domains registered since February 2021
				"(username)",
				"(password)")

3. Publish, stage and close

The easiest way is to run the following commands.

sbt clean
sbt publishSigned
sbt sonatypeBundleRelease

Please note that executing the third command is a definitive step and there’s no way back.

The full guide and explanation of what the next commands do can be found here: https://github.com/xerial/sbt-sonatype#publishing-your-artifact. I recommend reading it if it’s your first time doing so, to better understand the process.

Let’s explain the commands step by step.

  1. The first command does the cleaning of your target/ directory inside the project.
project-root$> sbt clean

   2. The second command creates all the required artifacts to publish to Maven Central. These files include different JARs (jar, jar+javadoc, jar+sources), a POM file with metadata required for publishing in Maven (qbeast-spark_x.xx-y.y.y.pom), and all the required checksum/CRC files.

project-root$> sbt publishSigned

    3. The third command prepares the Sonatype repository, uploads JARs, and releases them to the public, syncing with the Maven Central repository.

CAUTION. Executing this command is a definitive step and there’s no way back.

project-root$> sbt sonatypeBundleRelease

If you want to control each of the steps of sonatypeBundleRelease, you can run:

project-root$> sbt sonatypePrepare
project-root$> sbt sonatypeBundleUpload
project-root$> sbt sonatypeRelease

And finally…

In 10 minutes you will have your package ready for others to download and use it 🔥

Don’t worry if it does not appear on the Maven Central Repository in the following hours, since it takes a little longer to sync. But you can play with it right away!

We hope that you find this post useful 🙌 And don’t hesitate to share your projects with us 😃

About Qbeast
Qbeast is here to simplify the lives of the Data Engineers and make Data Scientists more agile with fast queries and interactive visualizations. For more information, visit qbeast.io
© 2020 Qbeast. All rights reserved.
Share:

Back to menu

Continue reading

Reduce Repetitive Tasks and Development Time by Writing your own Tool in Python

For the last weeks, I’ve been working on a CLI tool to help developers of the qbeast-spark open-source project test their changes to the code. I’ll show you how I did it using setuptools.

Motivation

Some weeks ago, we, at Qbeast, were running tests manually, which involved several repetitive steps, which stole our developers a considerable amount of time. These steps are necessary for testing, but they are unnecessarily time-consuming. In a few words, these steps consist of “simple things”, such as creating clusters in Amazon EMR with the required dependencies, running spark applications on these clusters, checking available Datasets in Amazon S3, and other few things. Things that seem easy to achieve but complex when you have to run and remember several commands and fix problems manually. Something that could be automated somehow.

We had to develop a tool to automate all these steps as a solution… Something like a Command Line Interface (CLI), which lets us run easy commands doing the whole process automatically. We decided to call it qauto (‘q’ for ‘Qbeast’ and ‘auto’ for… well, our CEO has a gift naming things…). Of course, this will not be the name of your application, but you can get some inspiration from it.

The CLI

This tool would let us run something like qauto cluster create or qauto benchmark run: Easy commands that wrap and ease complex ones.
You’d say: complex? – Yes. If you check the number of available options when you try to create a cluster in Amazon EMR using their CLI (if you never have), you’ll feel overwhelmed: take a look. There are more than 30 different options at the time of writing this! And most of these options will remain the same in all runs (except cluster name and the number of machines, maybe?).

So, why not create a simple command that lets you specify only the necessary options for your day-to-day commands?

Setuptools – Package your Python projects

With Python, you can create a simple tool to wrap these commands. Let’s see how to do it:

  1. Create the following file structure in your directory:

  1. As you can see, the project folder contains a directory named qauto and a setup.py file. The qauto directory will contain different .py files, which will indeed be the code of your application. You can have as many files of these as you want to structure your code correctly.
  2. The setup.py file will be used to let the system install the application.
  3. The __init__.py will contain the different packages that you have in your application. Imagine you have a main.py and a utils.py python files, then your init file must include:
  4. The main.py will contain the code for your application. In our case, we will write a simple example with a few options, but you can extend it to your like.
    1. Application entry point. This is the part where your code begins its execution. We will create a main group to wrap everything into the main application.
      import click
      
      @click.group()
      def main():
      pass
    2. This main group can contain other sub-groups. In our example, we’re going to add an aws group to the main, which will indeed contain another sub-group:
      @main.group()
      def aws():
          """AWS Cloud Provider commands"""
      
      @aws.group()
      def cluster():
          """Cluster-related commands."""
      
    3. Groups can contain commands, which are the real “executable things”. These commands may have arguments and options (mandatory/optional). In the following example, we are implementing the qauto aws cluster create <cluster-name> <number-of-nodes> command:
      @cluster.command("create")
      @click.argument("cluster-name")
      @click.option("--number-of-nodes", help="Number of nodes for the cluster", default=2, show_default=True)
      def aws_create_cluster(cluster_name, number_of_nodes):
          # your program logic

Following this basic structure, the final result for the main.py file could be something like:

With this made up, we currently have a command “create” inside some groups. Following the group structure from the main group, we can see the command itself is qauto aws cluster create. But… wait a second! We defined an alias for “aws” in our setup.py file, so an alternative will be qw cluster create (obviously providing some required arguments!).

Easy, isn’t it?

Installing your new custom-CLI

Once you have finished building your application -or you want to test it- you can install it easily using Python’s pip installer. Being in the root directory of your project, you can run pip install -e . to install your new application. From now on, you can run qauto <command> (or the name you specified for your application).

  • The -e option installs your program in editable mode: You won’t need to re-install it if you make some changes.
  • To uninstall it, just run pip uninstall qauto, or the name you specified for your application.

Conclusion

Setuptools is a powerful Python library that lets you package your python projects. It can be used for many applications when building something easy to run. We used it to create a CLI, which eases a command containing many redundant options. In the same way, you can add other commands and structure your code to your needs.

About Qbeast
Qbeast is here to simplify the lives of the Data Engineers and make Data Scientists more agile with fast queries and interactive visualizations. For more information, visit qbeast.io
© 2020 Qbeast. All rights reserved.
Share:

Back to menu

Continue reading

Scala Test Dive-in: Public, Private and Protected methods

We all know that testing code can be done in different ways. This pill is not to explain which is the best way to see if your Scala project is working as it should. But it will provide some tips and tricks for testing public, private, and protected methods.

Public Methods

Public methods are the functions inside a class, that can be called from outside, through the instantiated object. Public method testing is no rocket science. In Scala, the use of Matchers and Clues is needed in order to understand what is wrong.

Imagine we want to test a MathUtils class that has simple methods min and max:

class MathUtils {
  def min(x: Int, y: Int): Int = if (x <= y) x else y

  def max(x: Int, y: Int): Int = if (x >= y) x else y

}

This is how your test should look like:

import org.scalatest.AppendedClues.convertToClueful
import org.scalatest.matchers.should.Matchers
import org.scalatest.flatspec.AnyFlatSpec


class MathUtilsTest extends AnyFlatSpec with Matchers {

  "MathUtils" should "compute min correctly" in {
    val min = 10
    val max = 20
		val mathUtils = new MathUtils()
    mathUtils.min(min, max) shouldBe min withClue s"Min is not $min"
  }

  it should "compute max correctly" in {
    val min = 10
    val max = 20
		val mathUtils = new MathUtils()
    mathUtils.max(min, max) shouldBe max withClue s"Max is not $max"
  }
}

Private Methods

Private methods are the methods that cannot be accessed in any other class than the one in which they are declared.

Testing these functions is way more tricky. You have different ways of proceeding: copy and paste the implementation in a test class (which is out of the table), use Mockito, or try with PrivateMethodTester.

Let’s write a private method on the class MathUtils:

class MathUtils {

  def min(x: Int, y: Int): Int = if (x <= y) x else y

  def max(x: Int, y: Int): Int = if (x >= y) x else y

  private def sum(x: Int, y: Int): Int = {
    x + y
  }

  def sum(x: Int, y: Int, z: Int): Int = {
    val aux = sum(x, y)
    sum(aux, z)
  }

}

PrivateMethodTester is a trait that facilitates the testing of private methods. You have to mix it in your test class in order to take advantage of it.


import org.scalatest.AppendedClues.convertToClueful
import org.scalatest.matchers.should.Matchers
import org.scalatest.flatspec.AnyFlatSpec
import org.scalatest.PrivateMethodTester

class MathUtilsPrivateTest extends AnyFlatSpec with Matchers with PrivateMethodTester {

  "MathUtils" should "compute sum correctly" in {
  
    val x = 1
    val y = 2

    val mathUtils = new MathUtils()
		val sumPrivateMethod = PrivateMethod[Int]('sum)
    val privateSum = mathUtils invokePrivate sumPrivateMethod(1, 2)
    privateSum shouldBe (x + y) withClue s"Sum is not is not ${x + y}"
  }
}

In val sumPrivateMethod = PrivateMethod[Int]('sum) we have different parts:

  • [Int] is the return type of the method
  • (’sum) is the name of the method to call

In mathUtils invokePrivate sumPrivateMethod(x, y) you can collect the result in a val to compare and understand if it’s working properly. You need to use an instance of the class/object to invoke the method, otherwise, it will not find it.

Protected Methods

A protected method is like a private method in that it can only be invoked from within the implementation of a class or its subclasses.

For example we decide to make sum method protected instead of private. Class MathUtils would look like this:

class MathUtils {
  def min(x: Int, y: Int): Int = if (x <= y) x else y

  def max(x: Int, y: Int): Int = if (x >= y) x else y

  protected def sum(x: Int, y: Int): Int = x + y

}

If we create a new object from MathUtils and try to call the sum method, it will throw a warning saying that ‘sum is not accessible from this place’

But don’t worry, we have a solution for that as well.

We can write a subclass specific for this test and override the method since it can be invoked through the implementation of its subclasses.


class MathUtilsTestClass extends MathUtils {
  override def sum(x: Int, y: Int): Int = super.sum(x, y)
}

class MathUtilsProtectedTest extends AnyFlatSpec with Matchers {
  "MathUtils" should "compute sum correctly" in {
    val x = 1
    val y = 2
    val mathUtilsProtected = new MathUtilsTestClass()
    mathUtilsProtected.sum(x, y) shouldBe (x + y) withClue s"Sum is not is not ${x + y}"
  }

}

Summary

Now you can test the different types of methods in your Scala project: public, private, and protected. For more information about Scala, functional programming, and style, feel free to ask us or check out our other pills!

About Qbeast
Qbeast is here to simplify the lives of the Data Engineers and make Data Scientists more agile with fast queries and interactive visualizations. For more information, visit qbeast.io
© 2020 Qbeast. All rights reserved.
Share:

Back to menu

Continue reading

Read from public S3 bucket with Spark

S3 Hadoop Compatibility

Trying to read from public Amazon S3 object storage with Spark can cause many errors related to Hadoop versions.

Here are some tips to configure your spark application.

Spark Configuration

To read the S3 public bucket, you need to start a spark-shell with version 3.1.1 or superior and Hadoop dependencies of 3.2.

If you have to update the binaries to a compatible version to use this feature, follow these steps:

  • Download spark tar from the repository
$ > wget https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
  • Decompress the files
$ > tar xzvf spark-3.1.1-bin-hadoop3.2.tgz
  • Update the SPARK_HOME environment variable
$ > export SPARK_HOME=$PWD/spark-3.1.1-bin-hadoop3.2

Once you have your spark ready to execute, the following configuration must be used:

$ > $SPARK_HOME/bin/spark-shell \
--conf spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider \ 
--packages com.amazonaws:aws-java-sdk:1.12.20,\
		org.apache.hadoop:hadoop-common:3.2.0,\
    org.apache.hadoop:hadoop-client:3.2.0,\
    org.apache.hadoop:hadoop-aws:3.2.0

The  org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider  provides Anonymous credentials in order to access the public S3.

And to read the file:

val df = spark
.read
.format("parquet")
.load("s3a://qbeast-public-datasets/store_sales")

Summary

There’s no known working version of Hadoop 2.7 for AWS S3. However, you can try to use it. If you do so, remember to include the following option:

--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
About Qbeast
Qbeast is here to simplify the lives of the Data Engineers and make Data Scientists more agile with fast queries and interactive visualizations. For more information, visit qbeast.io
© 2020 Qbeast. All rights reserved.
Share:

Back to menu

Continue reading

Code Formatting with Scalafmt

Whether you are starting a Scala project or collaborating in one, here, you have a guide to know the most used frameworks for improving the code style.

Scalastyle and Scalafmt

Scalastyle is a handy tool for coding style in Scala, similar to what Checkstyle does in Java. Scalafmt formats code to look consistent between people on your team, and it is perfectly integrated into your toolchain.

Installation

For the installation, you need to add the following to the plugins.sbt file under the project folder.

addSbtPlugin("org.scalameta" % "sbt-scalafmt" % "2.4.2") 
addSbtPlugin("org.scalastyle" %% "scalastyle-sbt-plugin" % "1.0.0")

This will create a Scalastyle configuration under scalastyle_config.xml. And a file .scalafmt.conf where you can write rules to maintain consistency across the project.

For example:

# This style is copied from
# <https://github.com/apache/spark/blob/master/dev/.scalafmt.conf> version = "2.7.5"
align = none
align.openParenDefnSite = false
align.openParenCallSite = false
align.tokens = [] 
optIn = { 
  configStyleArguments = false 
} 
danglingParentheses = false 
docstrings = JavaDoc 
maxColumn = 98 
newlines.topLevelStatements = [before,after]

Quickstart

When opening a project that contains a .scalafmt.conf file, you will be prompted to use it:

Choose the scalafmt formatter, and it will be used at compile-time for formatting files.

However, you can check it manually with:

sbt scalastyle

Another exciting feature is that you can configure your IDE to reformat at saving:

Alternatively, force code formatting:

sbt scalafmt # Format main sources 

sbt test:scalafmt # Format test sources 

sbt scalafmtCheck # Check if the scala sources under the project have been formatted 

sbt scalafmtSbt # Format *.sbt and project /*.scala files 

sbt scalafmtSbtCheck # Check if the files have been formatted by scalafmtSbt

More tricks

Scaladocs

Sbt also checks the format of the Scala docs when publishing the artifacts. The following command will check and generate the Scaladocs:

sbt doc

Header Creation

Sometimes a header must be present in all files. You can do so by using this plugin: https://github.com/sbt/sbt-header

First, add it in the plugins.sbt:

addSbtPlugin("de.heikoseeberger" % "sbt-header" % "5.6.0")

Include the header you want to show in your build.sbt

headerLicense := Some(HeaderLicense.Custom("Copyright 2021 Qbeast Pills"))

And use it in compile time with:

Compile / compile := (Compile / compile).dependsOn(Compile / headerCheck).value

To automatize the creation of headers in all files, execute:

sbt headerCreate

Using println

Scalafmt has strong policies on print information. And we all debug like this now and then.

The quick solution is to wrap your code:

// scalastyle:off println
<your beautiful piece of code>
// scalastyle:on println

But make sure you delete these comments before pushing any commits 😉

About Qbeast
Qbeast is here to simplify the lives of the Data Engineers and make Data Scientists more agile with fast queries and interactive visualizations. For more information, visit qbeast.io
© 2020 Qbeast. All rights reserved.
Share:

Back to menu

Continue reading

Create awesome GIFs from a terminal: Nice-looking animations with Terminalizer

Have you ever wanted to generate cool GIFs from a terminal output? Do you want to have fancy animations to show some code snippets?

Using terminalizer, you will be able to create fantastic animations by following this simple guide!

The solution

1. First, you need to install NodeJS v12.21.0 (LTS) from https://nodejs.org/download/release/v12.21.0/. Other versions may not be compatible with terminalizer, and you might have problems when using it.

2. After that, install terminalizer globally by using the command:

npm install -g terminalizer

Now you can use the following commands to record, play and share a GIF:

# Start recording a demo in a file called my_demo.yml
terminalizer record my_demo

# Now run the commands you want to appear in the GIF.
# When you have finished, press Ctrl+D (⌘+D) to stop recording.

# You can play the demo you just recorded by using the play option.te
terminalizer play my_demo.yml

# At this point you can customize several things,
# check the "🌟 Pro tips" section below.

# If you're happy with the result, render the GIF from your YML
# file. This will create a file in your current directory.
terminalizer render my_demo.yml

🌟 Pro tips: You can edit and customize your GIF before rendering it by modifying the content of the .yml file.

  • For example, you can change the colours or the style by changing the theme and the frameBox objects in the YML:
theme:
    background: "#28225C"
(...)
frameBox:
    type: floating
    title: Terminalizer Rocks!
  • You can also edit the content and the timing of the output by modifying the records object:
# Records, feel free to edit them
records:
  - delay: 50
    content: "\e[35mErics-MacBook-Air\e[0m:~ eric$ "

So with a few modifications, we can get something like this:

You can find more customization options and tips at the original repo on GitHub: https://github.com/faressoft/terminalizer

Summary

Terminalizer is a beautiful and easy-to-use tool to create GIFs from your console/terminal output. It allows you to create GIFs by recording a session and customizing everything as desired in a simple YML format, perfect for newcomers to use it.

About Qbeast
Qbeast is here to simplify the lives of the Data Engineers and make Data Scientists more agile with fast queries and interactive visualizations. For more information, visit qbeast.io
© 2020 Qbeast. All rights reserved.
Share:

Back to menu

Continue reading

© 2020 Qbeast
Design by Xurris