Read from public S3 bucket with Spark
S3 Hadoop Compatibility
Trying to read from public Amazon S3 object storage with Spark can cause many errors related to Hadoop versions.
Here are some tips to configure your spark application.
To read the S3 public bucket, you need to start a spark-shell with version 3.1.1 or superior and Hadoop dependencies of 3.2.
If you have to update the binaries to a compatible version to use this feature, follow these steps:
- Download spark tar from the repository
$ > wget https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
- Decompress the files
$ > tar xzvf spark-3.1.1-bin-hadoop3.2.tgz
- Update the SPARK_HOME environment variable
$ > export SPARK_HOME=$PWD/spark-3.1.1-bin-hadoop3.2
Once you have your spark ready to execute, the following configuration must be used:
$ > $SPARK_HOME/bin/spark-shell \ --conf spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider \ --packages com.amazonaws:aws-java-sdk:1.12.20,\ org.apache.hadoop:hadoop-common:3.2.0,\ org.apache.hadoop:hadoop-client:3.2.0,\ org.apache.hadoop:hadoop-aws:3.2.0
org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider provides Anonymous credentials in order to access the public S3.
And to read the file:
val df = spark .read .format("parquet") .load("s3a://qbeast-public-datasets/store_sales")
There’s no known working version of Hadoop 2.7 for AWS S3. However, you can try to use it. If you do so, remember to include the following option:
Qbeast is here to simplify the lives of the Data Engineers and make Data Scientists more agile with fast queries and interactive visualizations. For more information, visit qbeast.io
© 2020 Qbeast. All rights reserved.