The best way to run Hive on Hadoop, on Kubernetes, and on Amazon AWS
As the de facto standard for SQL-based analytics on Hadoop, Apache Hive is a mature data warehouse system in wide use in industry. Unfortunately upgrading Hive on Hadoop is a tough decision because it almost inevitably runs into new dependency problems. As a consequence, many users reluctantly keep their first installation of Hive without enjoying the tremendous benefit of more recent releases.
LLAP (Low-Latency Analytical Processing) is a major component of Hive which allows it to far outperform competing technologies such as Presto and SparkSQL. Unfortunately enabling and configuring LLAP is excruciatingly difficult because of its complex architecture. It is so frustrating as to cause many users to reluctantly choose an alternative technology that is slower and less mature.
As the enterprise environment gravitates towards Kubernetes at an accelerating pace, the industry is urgently looking for a solution that will enable Hive to run on Kubernetes. Unfortunately only an expedient solution exists today which first operates Hadoop on Kubernetes and then runs Hive on Hadoop, thus introducing two layers of complexity. The right approach is to use an execution engine capable of communicating directly with Kubernetes.
Hive on MR3 is a robust solution that addresses all the pain points of Hive. Its core technology is a new execution engine MR3 which provides native support for both Hadoop and Kubernetes. Hive on MR3 is a significant improvement over Apache Hive in terms of both simplicity of operation and efficiency in execution.
On Hadoop, MR3 allows users to easily switch between different versions of Hive without upgrading Hadoop. All the major versions of Hive, from Hive 1 to Hive 4, can run in the same cluster. Hive on MR3 automatically achieves the performance of LLAP and beyond without requiring any further configuration.
On Kubernetes, Hive on MR3 directly creates and destroys worker Pods. All the enterprise features are equally available such as high availability, Kerberos-based security, SSL data encryption, authorization with Apache Ranger, and so on. On public clouds, Hive on MR3 can take advantage of autoscaling supported by MR3.
For users of Amazon AWS, Hive on MR3 includes key features for reducing the cost significantly. With in-memory or NVMe caching, the separation of compute and storage continues to work without performance penalty. With autoscaling, workers are created and destroyed dynamically to adapt to workload changes. With fault tolerance, spot instances can replace on-demand instances. For executing queries sporadically, workers can run on AWS Fargate.
If your product manages data warehouses for individual customers, integrating Hive on MR3 will immediately boost your competitive advantage. You will be able to save both time and money by reducing the maintenance overhead and operational cost of data warehouses. Meanwhile your customers will find your product more attractive because it inherits all the features of Hive.
As the design principle of MR3 lies in simplicity, we are confident that you will like Hive on MR3 much better than Apache Hive, whether on Hadoop or on Kubernetes. So give it a try!
Get in touch with us so that we learn more about your use case and answer your questions on Hive on MR3. We will assist you in deploying Hive on MR3 on Hadoop, on Kubernetes, and on Amazon AWS. So feel free to contact us!
Spark on MR3 runs Apache Spark using MR3 as the execution backend. It allows multiple Spark applications to share compute resources such as Yarn containers or Kubernetes Pods. Thus Spark on MR3 can be particularly useful in cloud environments where Spark applications are created and destroyed frequently.