Cloudera recently issued a press release claiming that Impala, their Hadoop SQL engine, is faster than Apache Hive, faster than an unnamed “proprietary database”, scales linearly, has been widely adopted, is production ready, and has an ever increasing list of enterprise features. Sounds impressive – but these bold claims warrant a closer look.
While claiming no gimmicks, Cloudera has delivered some rather questionable “proof” points. Although they are basing their tests on the industry standard TPC-DS benchmark – they are only showing results for a carefully selected subset of the TPC-DS queries, using a carefully selected subset of the TPC-DS data.
For the performance comparisons – they have chosen just 20 of the 99 official TPC-DS query set.
For the scalability tests, they have chosen just 6 of the 99 official TPC-DS query set.
For all tests they have chosen to use a single fact table, even though the TPC-DS database schema contains 6 fact tables.
What about the rest of the queries and fact tables? 39 of the TPC-DS queries join multiple fact tables. Cloudera chose not to try those, perhaps because they are too complex?
How many users? (Correction).
The performance comparison tests were all done with a single user. How many customers dedicate a complete analytic cluster for a single user?
The very small subset of 6 queries measured for scalability apparently tested more than one user, but exactly how many users isn’t stated. Hopefully it was at least 4 – The TPC-DS benchmark specification states that a minimum of 4 concurrent users is required.
For the performance tests vs Hive, SQL syntax was changed for all of the queries to convert them into SQL-92 style joins, manually optimize the join order, and add an explicit partition predicate.
For the performance tests vs the unnamed proprietary database, they removed the SQL analytic functions and added an explicit predicate to the WHERE clauses that expresses a partition filter on the fact table. Why? Because Impala doesn’t support standard SQL analytic functions – such as windowed aggregates. Nor do they support dynamic partitioning so they had to manually change the query to reduce the data size .
What about running the SQL without modifications?
Increasing List of Enterprise Features?
The fact is that Impala is still missing SQL-92 sub-query support, SQL 99 aggregate functions, and SQL 2003 windowed aggregate functions – just to name a few things. The TPC-DS specification requires these (as do most customers), so for sure they need to increase their list of features!
Alternatives to Cloudera Impala
The current release of Big SQL in IBM’s BigInsights 2.1 has much richer SQL support than Impala – including SQL-92 sub-query support, SQL 99 aggregate functions, and SQL 2003 windowed aggregate functions. This means it is less likely to have to re-write a query, and more likely that end-user tools will work out of the box.
IBM is investing heavily in Big SQL, and intends to replace the current Big SQL execution engine with a true MPP SQL execution engine built for performance. Together with IBM’s expertise in query optimization, I think Cloudera’s perceived performance advantage is likely to be short-lived.