Read from public S3 bucket with Spark
S3 Hadoop Compatibility
Trying to read from public Amazon S3 object storage with Spark can cause many errors related to Hadoop versions.
Here are some tips to configure your spark application.
Spark Configuration
To read the S3 public bucket, you need to start a spark-shell with version 3.1.1 or superior and Hadoop dependencies of 3.2.
If you have to update the binaries to a compatible version to use this feature, follow these steps:
- Download spark tar from the repository
$ > wget https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
- Decompress the files
$ > tar xzvf spark-3.1.1-bin-hadoop3.2.tgz
- Update the SPARK_HOME environment variable
$ > export SPARK_HOME=$PWD/spark-3.1.1-bin-hadoop3.2
Once you have your spark ready to execute, the following configuration must be used:
$ > $SPARK_HOME/bin/spark-shell \
--conf spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider \
--packages com.amazonaws:aws-java-sdk:1.12.20,\
org.apache.hadoop:hadoop-common:3.2.0,\
org.apache.hadoop:hadoop-client:3.2.0,\
org.apache.hadoop:hadoop-aws:3.2.0
The org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider
provides Anonymous credentials in order to access the public S3.
And to read the file:
val df = spark
.read
.format("parquet")
.load("s3a://qbeast-public-datasets/store_sales")
Summary
There’s no known working version of Hadoop 2.7 for AWS S3. However, you can try to use it. If you do so, remember to include the following option:
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem