Trying to read from public Amazon S3 object storage with Spark can cause many errors related to Hadoop versions.
Here are some tips to configure your spark application.
To read the S3 public bucket, you need to start a spark-shell with version 3.1.1 or superior and Hadoop dependencies of 3.2.
If you have to update the binaries to a compatible version to use this feature, follow these steps:
$ > wget https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
$ > tar xzvf spark-3.1.1-bin-hadoop3.2.tgz
$ > export SPARK_HOME=$PWD/spark-3.1.1-bin-hadoop3.2
Once you have your spark ready to execute, the following configuration must be used:
$ > $SPARK_HOME/bin/spark-shell \ --conf spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider \ --packages com.amazonaws:aws-java-sdk:1.12.20,\ org.apache.hadoop:hadoop-common:3.2.0,\ org.apache.hadoop:hadoop-client:3.2.0,\ org.apache.hadoop:hadoop-aws:3.2.0
org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider provides Anonymous credentials in order to access the public S3.
And to read the file:
val df = spark .read .format("parquet") .load("s3a://qbeast-public-datasets/store_sales")
There’s no known working version of Hadoop 2.7 for AWS S3. However, you can try to use it. If you do so, remember to include the following option: