Usage with Apache Spark on YARN

conda-pack can be used to distribute conda environments to be used with Apache Spark jobs when deploying on Apache YARN. By bundling your environment for use with PySpark, you can make use of all the libraries provided by conda, and ensure that their consistently provided on every node. This makes use of YARN’s resource localization by distributing environments as archives, which are then automatically unarchived on every node. In this case either the tar.gz or zip formats must be used.

Example

Create an environment:

$ conda create -y -n example python=3.5 numpy pandas scikit-learn

Activate the environment:

$ conda activate example   # Older conda versions use `source activate` instead

Package the environment into a tar.gz archive:

$ conda pack -o environment.tar.gz
Collecting packages...
Packing environment at '/Users/jcrist/anaconda/envs/example' to 'environment.tar.gz'
[########################################] | 100% Completed | 23.2s

Write a PySpark script, for example:

# script.py
from pyspark import SparkConf
from pyspark import SparkContext

conf = SparkConf()
conf.setAppName('spark-yarn')
sc = SparkContext(conf=conf)

def some_function(x):
    # Packages are imported and available from your bundled environment.
    import sklearn
    import pandas
    import numpy as np

    # Use the libraries to do work
    return np.sin(x)**2 + 2

rdd = (sc.parallelize(range(1000))
         .map(some_function)
         .take(10))

print(rdd)

Submit the job to Spark using spark-submit. In YARN cluster mode:

$ PYSPARK_PYTHON=./environment/bin/python \
spark-submit \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python \
--master yarn \
--deploy-mode cluster \
--archives environment.tar.gz#environment \
script.py

Or in YARN client mode:

$ PYSPARK_DRIVER_PYTHON=`which python` \
PYSPARK_PYTHON=./environment/bin/python \
spark-submit \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python \
--master yarn \
--deploy-mode client \
--archives environment.tar.gz#environment \
script.py