Performance Evaluation of Trino, Spark, and Hive on MR3

May 31, 2023

Read in 5 minutes

Introduction

In this article, we evaluate the performance of the following systems.

  1. Trino 418 (released on May 17, 2023)
  2. Spark 3.4.0 (released on Apr 13, 2023)
  3. Hive 3.1.3 on MR3 1.7 (released on May 15, 2023)

Trino and Hive on MR3 use Java 17, while Spark uses Java 8. For Hive on MR3, we also report the result of using Java 8. We do not use the fault-tolerant execution of Trino, whereas both Spark and Hive on MR3 are fault tolerant by design.

For performance evaluation of Trino 435 and Hive on MR3 1.9, see a new article Performance Evaluation of Trino and Hive on MR3 using the TPC-DS Benchmark.

Experiment Setup

Cluster

For experiments, we use a cluster consisting of 1 master and 12 worker nodes with the following properties:

In total, the amount of memory of worker nodes is 12 * 256GB = 3072GB. The cluster runs HDP (Hortonworks Data Platform) 3.1.4 and uses HDFS replication factor of 3.

TPC-DS benchmark

We use a variant of the TPC-DS benchmark introduced in the previous article which replaces an existing LIMIT clause with a new SELECT clause so that different results from the same query translate to different numbers of rows. The reader can find the set of modified TPC-DS queries in the GitHub repository.

The scale factor for the TPC-DS benchmark is 10TB.

For best performance, we use ORC for Trino and Hive on MR3, and Parquet for Spark.

Configuration

For all the experiments, the amount of memory allocated to JVM in each worker node (for Trino workers, Spark executors, and MR3 workers) is the same.

For Trino, we use a JVM option -Xmx196G and choose the following configuration after performance tuning:

memory.heap-headroom-per-node=58GB
query.max-total-memory=1680GB
query.max-memory-per-node=120GB
query.max-memory=1440GB

For Spark, we choose configuration parameters after performance tuning using a TPC-DS dataset of 1TB. For the amount of memory allocated to Spark executors, we use the following configuration:

spark.executor.memory=196g
spark.executor.memoryOverhead=20g

We use Spark shuffle service and enable advanced features such as dynamic partition pruning and adaptive query execution.

For Hive on MR3, we use the default configuration in the MR3 distribution except for the amount of memory allocated to workers.

Test

We sequentially submit 99 queries from the TPC-DS benchmark. We report the total running time, the geometric mean of running times, and the running time of each individual query.

In order to check the correctness, we report the number of rows from each query. If the result contains a single row, we report the sum of all numerical values in it.

Raw data of the experiment results

For the reader's perusal, we attach the table containing the raw data of the experiment results. Here is a link to [Google Docs].

Analysis

#1. Query completion

Spark and Hive on MR3 successfully complete all the queries, whereas Trino fails to complete query 72 with an error message Query exceeded per-node memory limit of 120GB. This is a typical case in which Trino (or any MPP system without fault tolerance) fails to complete a query because of lack of memory for holding intermediate data.

#2. Correctness

The three systems agree on the result with the following exceptions:

As all the systems under comparison have been developed independently, we can conclude (with high confidence) that Trino returns wrong answers on query 21 and query 23.

[ Update (June 3, 2023). The previous SQL script for query 21 included an integer division. In the case of Trino, the result of an integer division is always an integer. After casting the first operand to double, Trino returns the correct result of 3070 rows. Thank Alfred Davidson for reporting the error in the SQL script. ]

#3. Total running time

In terms of the total running time, Hive on MR3 is as fast as Trino.

trino.spark.hivemr3.total

Trino and Hive on MR3 far outperform Spark. Although Spark uses Java 8 in our experiment, running Spark with Java 17 is unlikely to yield a different conclusion because of a large gap in performance.

For fair comparison, one could use the fault-tolerant execution of Trino. In our experiment, however, Trino with fault-tolerant execution becomes much slower and even unstable. For example, when we use HDFS for an exchange manager, the first four queries of the TPC-DS benchmark produce the following results:

#4. Response time

In terms of the geometric mean of running times, Trino responds 21 percent faster than Hive on MR3 and 42 percent faster than Spark.

trino.spark.hivemr3.geo

#5. Hive on MR3 using Java 17 vs Java 8

By switching from Java 8 to Java 17 (both using G1 GC), Hive on MR3 achieves about 8 percent of reduction in the total running time.

Conclusion

Since its initial release, Hive on MR3 has achieved significant improvements in performance, stability, and features. Despite being fault tolerant, Hive on MR3 runs almost as fast as Trino on the TPC-DS benchmark without failing or returning wrong results. Operating Hive on MR3 is also easy, whether on Hadoop or on Kubernetes. In fact, Hive on MR3 can run in any type of cluster by exploiting standalone mode (similarly to Trino and Spark standalone mode).

If you are interested in Hive on MR3, see Quick Start Guide.

Discuss this article in Hacker News.