EMR Spark reference

Sudheer Keshav Bhat · May 16, 2023

Data Engineering   Big Data   Spark   EMR

Quick gotchas to watch out for while submitting Spark jobs on EMR cluster

  1. EMR comes pre-installed with PySpark. Don’t install pyspark again as part of bootstrap actions or otherwise. Ref: https://stackoverflow.com/questions/63210509/com-amazon-ws-emr-hadoop-fs-emrfilesystem-not-found-on-pyspark-script-on-aws-emr
  2. When zipping the contents of your source, ensure the zip root directly contains the source without an additional root folder. Use the following script:
cd <source_dir>
zip -r sources.zip .
  1. To pass .env files to spark, prefix each entry with spark.yarn.appMasterEnv. like below and pass to spark via --properties-file param
    spark.yarn.appMasterEnv.SEARCH_URL=https://search.domain.com
    
  2. To pass JSON config files to spark use --files param
  3. Here is a sample spark-submit with all of the above:
    spark-submit --master yarn \
              --deploy-mode cluster \
              --packages org.apache.hadoop:hadoop-aws:3.3.1,com.amazonaws:aws-java-sdk-emr:1.12.468 \
              --properties-file .env \
              --files config1.json,config2.json \
              --py-files s3://bucket/path/sources.zip \
              spark_main.py arg1 arg2