Apr 1, 2022
Read in 7 minutes
In this article, we evaluate the performance of the following systems.
The goal is 1) to show that Spark 3 achieves a major performance improvement over Spark 2, 2) to compare Spark 3 and Hive 3 for performance, and 3) to compare Hive-LLAP and Hive 3 for performance. We use the TPC-DS benchmark with both sequential and concurrent tests.
For experiments, we use two clusters: Indigo and Blue. Indigo consists of 1 master and 22 worker nodes with the following properties:
Blue consists of 1 master and 12 worker nodes with the following properties:
In total, the amount of memory of worker nodes is 22 * 96GB = 2112GB on the Indigo cluster and 12 * 256GB = 3072GB on the Blue cluster. Both Indigo and Blue run HDP 3.1.4 and use HDFS replication factor of 3.
We use a variant of the TPC-DS benchmark introduced in a previous article which replaces an existing LIMIT clause with a new SELECT clause so that different results from the same query translate to different numbers of rows. The reader can find the two sets of modified TPC-DS queries in the GitHub repository.
The scale factors for the TPC-DS benchmark are 1TB and 3TB on the Indigo cluster, and 1TB and 10TB on the Blue cluster.
For best performance, we use Parquet for Spark and ORC for Hive.
For all the experiments, the amount of memory allocated to JVM in each worker node (Spark executor for Spark, LLAP daemon for Hive-LLAP, and MR3 ContainerWorkers for Hive on MR3) is the same: 80GB on the Indigo cluster and 216GB on the Blue cluster. For Spark, we choose configuration parameters after performance tuning using the dataset of 1TB. For Spark 3.2.1, we enable advanced features such as dynamic partition pruning and adaptive query execution. For Hive-LLAP, we use the default configuration set by HDP.
In a sequential test, we submit 99 queries from the TPC-DS benchmark. We report the total running time, the geometric mean of running times, and the running time of each individual query.
In a concurrent test, we choose a concurrency level from 1 to 16/32 and start as many clients, each of which submits 17 queries, query 25 to query 40, from the TPC-DS benchmark. In order to better simulate a realistic environment, each client submits these 17 queries in a unique sequence. For each run, we measure the longest running time of all the clients. Since the cluster remains busy until the last client completes the execution of all its queries, the longest running time can be thought of as the cost of executing queries for all the clients.
For Spark, we run Spark Thrift Server. For Hive, we run HiveServer2.
For the reader's perusal, we attach the table containing the raw data of the experiment results. Here is a link to [Google Docs].
To compare Spark 2.3.8 and Spark 3.2.1, we use the dataset of 1TB on the Indigo and Blue clusters.
From the sequential test, Spark 3.2.1 runs about twice faster than Spark 2.4.8:
From the concurrent test, Spark 3.2.1 runs much faster, or equivalently, yields much higher throughput than Spark 2.4.8. On the Indigo cluster with 16 concurrent queries, Spark 2.4.8 performs even worse than executing all the queries sequentially (695 seconds * 16 < 49087 seconds).
In summary, Spark 3.2.1 indeed achieves a major performance improvement over Spark 2.4.8. As Spark 3 is no more difficult to operate than Spark 2, upgrading to Spark 3 should be a right decision in most cases.
In the previous experiment, the size of the dataset (1TB) is relatively small for the total amount of memory of worker nodes. In this experiment, we increase the size of the dataset to 3TB on Indigo and 10TB on Blue. We also compare Spark 3.2.1 and Hive 3.1.2 running on MR3 1.4 to gain a sense of how Spark 3 compares with Hive 3 in general.
From the sequential test, Hive on MR3 runs much faster than Spark 3.2.1 in terms of the total running time.
In terms of the geometric mean of running times, however, the performance gap is much smaller.
This result implies that Spark 3.2.1 is more or less comparable to Hive for short-running interactive queries whereas Hive is clearly a better choice than Spark for long-running ETL queries.
From the concurrent test in which we allow the concurrency level up to 32, we observe that Spark suffers from a heavy performance penalty as the concurrent level increases. For example, on the Indigo cluster with 32 concurrent queries, Spark performs even worse than executing all the queries sequentially (576 seconds * 32 < 131145 seconds). On the Blue cluster with 32 concurrent queries, Spark performs only slightly better than executing all the queries sequentially (49784 seconds / 1845 seconds = 26.97 vs 32). In contrast, the running time for Hive on MR3 is nearly proportional to the concurrency level.
The result shows that the current architecture of Spark has room for improvement for executing concurrent queries, especially with a high concurrency level. We remark that with 32 concurrent queries, Spark fails to complete 1 query out of 32 * 17 = 544 queries.
In the last experiment, we use the dataset of 10TB to compare Hive on MR3 and Hive-LLAP on the Blue cluster.
In the sequential test, Hive-LLAP is about 10 percent faster than Hive on MR3. The performance gap is mainly due to several patches incorporated into Hive-LLAP which produce different execution plans than Hive on MR3. (Hive on MR3 has not yet backported these patches because of a correctness issue reported in our previous article.)
In the concurrent test, Hive on MR3 is shown to be comparable to Hive-LLAP in terms of throughput. Hive on MR3 appears to yield slightly lower throughput than Hive-LLAP, but in the experiment, Hive-LLAP fails to complete several queries.
In summary, Hive on MR3 is comparable to Hive-LLAP in terms of performance. As it is much easier to operate than Hive-LLAP and also provides native support for Kubernetes, Hive on MR3 is a viable alternative to Hive-LLAP in terms of both performance and ease of use.
Since the release of Hive 3.1 (which, unfortunately, has not been actively maintained), there has been no new release of Apache Hive, while Spark has seen a considerable improvement in its performance with an upgrade to Spark 3. The good news is that the Hive community has recently started to roll out new releases from the Hive 4 branch, with an initial release of Hive 4.0.0-alpha-1. From our preliminary evaluation, Hive 4 achieves a moderate performance improvement over Hive 3 (but not as much as Spark 3 achieves over Spark 2). When a stable release of Hive 4 is available, we will report the result of evaluating its performance.