Building a data warehouse with Apache Hive can be challenging. On Amazon AWS, existing solutions are hard to use because of the inevitable complexity from the underlying Hadoop system. You need to learn not only Hive but also Hadoop.
An ideal solution should be easy to maintain by hiding or even eliminating the Hadoop layer. It should also be cheap to operate by delivering excellent performance while minimizing wasted resources. Finally it should not demand vendor lock-in.
Our solution runs Apache Hive directly on Kubernetes without requiring an additional Hadoop layer. The enabling technology is a new execution engine MR3 which provides native support for Kubernetes. Our solution packages Hive with Grafana, Superset, and Apache Ranger. As it does not require Hadoop, our solution runs on Amazon EKS and stores all data on S3 using standard open data formats such as ORC and Parquet.
Data analysts can concurrently access the data warehouse with built-in Superset or their favorite BI tools, while the administrator can control access with Apache Ranger.
Our installation of Apache Hive will reduce your AWS bill significantly. On the TPC-DS benchmark, it runs at least twice faster than competing technologies such as Presto and Spark 3, and thus requires much less compute resources. With autoscaling, Hive workers are created and destroyed dynamically to adapt to workload changes. With fault tolerance, spot instances can replace on-demand instances.
We deploy our solution in your AWS account. Your sensitive data never leaves your AWS account. Our solution uses Apache Hive 3.1 with over 600 additional patches backported.
Give us scoped permissions on your AWS account. Specify basic configurations for your data warehouse. Then we will deploy our solution on Amazon EKS.
Once your data warehouse is ready, you can execute SQL queries right away with the built-in Superset or your favorite BI tool.
We manage your data warehouse in a transparent way. Hence you can also control it using the AWS console/CLI or a tool provided by us.
We use a simple pricing plan. We charge only on the compute resources for Hive workers, and just as much as the Amazon EMR price for the same compute resources.
For example, as Amazon EMR charges $0.113 per hour for m5d.2xlarge instance type, we charge $0.113 per hour for the same instance type.
With better performance and faster autoscaling, our solution offers a much cheaper option than Amazon EMR.