Executing Presto on Spark

Presto on Spark makes it possible to leverage Spark as an execution framework for Presto queries. This is useful for queries that we want to run on thousands of nodes, requires 10s or 100s of terabytes of memory, and consume many CPU years.

Spark adds several useful features like resource isolation, fine grained resource management, and a scalable materialized exchange mechanism.

Steps

Download the Presto Spark package tarball, presto-spark-package-0.290.tar.gz and the Presto Spark launcher, presto-spark-launcher-0.290.jar. Keep both the files at, say, example directory. We assume here a two node Spark cluster with four cores each, thus giving us eight total cores.

The following is an example config.properties:

task.concurrency=4
task.max-worker-threads=4
task.writer-count=4

The details about properties are available at Presto Configuration Properties. Note that task.concurrency, task.writer-count and task.max-worker-threads are set to 4 each, since we have four cores per executor and want to synchronize with the relevant Spark submit arguments below. These values should be adjusted to keep all executor cores busy and synchronize with spark-submit parameters.

To execute Presto on Spark, first start your Spark cluster, which we will assume have the URL spark://spark-master:7077. Keep your time consuming query in a file called, say, query.sql. Run spark-submit command from the example directory created earlier:

/spark/bin/spark-submit \
--master spark://spark-master:7077 \
--executor-cores 4 \
--conf spark.task.cpus=4 \
--class com.facebook.presto.spark.launcher.PrestoSparkLauncher \
  presto-spark-launcher-0.290.jar \
--package presto-spark-package-0.290.tar.gz \
--config /presto/etc/config.properties \
--catalogs /presto/etc/catalogs \
--catalog hive \
--schema default \
--file query.sql

The details about configuring catalogs are at Catalog Properties. In Spark submit arguments, note the values of executor-cores (number of cores per executor in Spark) and spark.task.cpus (number of cores to allocate to each task in Spark). These are also equal to the number of cores (4 in this case) and are same as some of the config.properties settings discussed above. This is to ensure that a single Presto on Spark task is run in a single Spark executor (This limitation may be temporary and is introduced to avoid duplicating broadcasted hash tables for every task).