May 31, 2023
Read in 5 minutes
In this article, we evaluate the performance of the following systems.
Trino and Hive on MR3 use Java 17, while Spark uses Java 8. For Hive on MR3, we also report the result of using Java 8. We do not use the fault-tolerant execution of Trino, whereas both Spark and Hive on MR3 are fault tolerant by design.
For experiments, we use a cluster consisting of 1 master and 12 worker nodes with the following properties:
In total, the amount of memory of worker nodes is 12 * 256GB = 3072GB. The cluster runs HDP (Hortonworks Data Platform) 3.1.4 and uses HDFS replication factor of 3.
We use a variant of the TPC-DS benchmark introduced in the previous article which replaces an existing LIMIT clause with a new SELECT clause so that different results from the same query translate to different numbers of rows. The reader can find the set of modified TPC-DS queries in the GitHub repository.
The scale factor for the TPC-DS benchmark is 10TB.
For best performance, we use ORC for Trino and Hive on MR3, and Parquet for Spark.
For all the experiments, the amount of memory allocated to JVM in each worker node (for Trino workers, Spark executors, and MR3 workers) is the same.
For Trino, we use a JVM option -Xmx196G
and choose the following configuration
after performance tuning:
memory.heap-headroom-per-node=58GB
query.max-total-memory=1680GB
query.max-memory-per-node=120GB
query.max-memory=1440GB
For Spark, we choose configuration parameters after performance tuning using a TPC-DS dataset of 1TB. For the amount of memory allocated to Spark executors, we use the following configuration:
spark.executor.memory=196g
spark.executor.memoryOverhead=20g
We use Spark shuffle service and enable advanced features such as dynamic partition pruning and adaptive query execution.
For Hive on MR3, we use the default configuration in the MR3 distribution except for the amount of memory allocated to workers.
We sequentially submit 99 queries from the TPC-DS benchmark. We report the total running time, the geometric mean of running times, and the running time of each individual query.
In order to check the correctness, we report the number of rows from each query. If the result contains a single row, we report the sum of all numerical values in it.
For the reader's perusal, we attach the table containing the raw data of the experiment results. Here is a link to [Google Docs].
Spark and Hive on MR3 successfully complete all the queries,
whereas Trino fails to complete query 72
with an error message Query exceeded per-node memory limit of 120GB
.
This is a typical case in which Trino (or any MPP system without fault tolerance)
fails to complete a query because of lack of memory for holding intermediate data.
The three systems agree on the result with the following exceptions:
As all the systems under comparison have been developed independently,
we can conclude (with high confidence) that
Trino returns wrong answers on query 21 and query 23.
[ Update (June 3, 2023). The previous SQL script for query 21 included an integer division. In the case of Trino, the result of an integer division is always an integer. After casting the first operand to double, Trino returns the correct result of 3070 rows. Thank Alfred Davidson for reporting the error in the SQL script. ]
In terms of the total running time, Hive on MR3 is as fast as Trino.
Trino and Hive on MR3 far outperform Spark. Although Spark uses Java 8 in our experiment, running Spark with Java 17 is unlikely to yield a different conclusion because of a large gap in performance.
For fair comparison, one could use the fault-tolerant execution of Trino. In our experiment, however, Trino with fault-tolerant execution becomes much slower and even unstable. For example, when we use HDFS for an exchange manager, the first four queries of the TPC-DS benchmark produce the following results:
In terms of the geometric mean of running times, Trino responds 21 percent faster than Hive on MR3 and 42 percent faster than Spark.
By switching from Java 8 to Java 17 (both using G1 GC), Hive on MR3 achieves about 8 percent of reduction in the total running time.
Since its initial release, Hive on MR3 has achieved significant improvements in performance, stability, and features. Despite being fault tolerant, Hive on MR3 runs almost as fast as Trino on the TPC-DS benchmark without failing or returning wrong results. Operating Hive on MR3 is also easy, whether on Hadoop or on Kubernetes. In fact, Hive on MR3 can run in any type of cluster by exploiting standalone mode (similarly to Trino and Spark standalone mode).
If you are interested in Hive on MR3, see Quick Start Guide.
Discuss this article in Hacker News.