EMR Spark reference

Sudheer Keshav Bhat · May 16, 2023

Data Engineering Big Data Spark EMR

Quick gotchas to watch out for while submitting Spark jobs on EMR cluster

EMR comes pre-installed with PySpark. Don’t install pyspark again as part of bootstrap actions or otherwise. Ref: https://stackoverflow.com/questions/63210509/com-amazon-ws-emr-hadoop-fs-emrfilesystem-not-found-on-pyspark-script-on-aws-emr
When zipping the contents of your source, ensure the zip root directly contains the source without an additional root folder. Use the following script:

cd <source_dir>
zip -r sources.zip .

To pass .env files to spark, prefix each entry with spark.yarn.appMasterEnv. like below and pass to spark via --properties-file param
```
spark.yarn.appMasterEnv.SEARCH_URL=https://search.domain.com
```
To pass JSON config files to spark use --files param

Here is a sample spark-submit with all of the above:

spark-submit --master yarn \
          --deploy-mode cluster \
          --packages org.apache.hadoop:hadoop-aws:3.3.1,com.amazonaws:aws-java-sdk-emr:1.12.468 \
          --properties-file .env \
          --files config1.json,config2.json \
          --py-files s3://bucket/path/sources.zip \
          spark_main.py arg1 arg2