Cloudera Impala – “Benchmarketing” – A Closer Look

Cloudera recently issued a press release claiming that Impala, their Hadoop SQL engine, is faster than Apache Hive,  faster than an unnamed “proprietary database”, scales linearly, has been widely adopted, is production ready, and has an ever increasing list of enterprise features.  Sounds impressive – but these bold claims warrant a closer look.

Benchmark Games  

While claiming no gimmicks, Cloudera has delivered some rather questionable “proof” points. Although they are basing their tests on the industry standard TPC-DS benchmark – they are only showing results for a carefully selected subset of the TPC-DS queries, using a carefully selected subset of the TPC-DS data.

For the performance comparisons – they have chosen just 20 of the 99 official TPC-DS query set.

For the scalability tests, they have chosen just 6 of the 99 official TPC-DS query set.

For all tests they have chosen to use a single fact table, even though the TPC-DS database schema contains 6 fact tables.

What about the rest of the queries and fact tables?  39 of the TPC-DS queries join multiple fact tables. Cloudera chose not to try those, perhaps because they are too complex?

How many users? (Correction).

The performance comparison tests were all done with a single user.  How many customers dedicate a complete analytic cluster for a single user?

The very small subset of 6 queries measured for scalability apparently tested more than one user, but exactly how many users isn’t stated. Hopefully it was at least 4 – The TPC-DS benchmark specification states that a minimum of 4 concurrent users is required.

Standard SQL?

For the performance tests vs Hive, SQL syntax was changed for all of the queries to convert them into SQL-92 style joins, manually optimize the join order, and add an explicit partition predicate.

For the performance tests vs the unnamed proprietary database, they removed the SQL analytic functions and added an explicit predicate to the WHERE clauses that expresses a partition filter on the fact table.  Why? Because Impala doesn’t support standard SQL analytic functions – such as windowed aggregates. Nor do they support dynamic partitioning so they had to manually change the query to reduce the data size .

What about running the SQL without modifications?

Increasing List of Enterprise Features?

The fact is that Impala is still missing SQL-92 sub-query support, SQL 99 aggregate functions, and SQL 2003 windowed aggregate functions – just to name a few things.  The TPC-DS specification requires these (as do most customers), so for sure they need to increase their list of features!

Alternatives to Cloudera Impala

The current release of Big SQL in IBM’s BigInsights 2.1 has much richer SQL support than Impala – including SQL-92 sub-query support, SQL 99 aggregate functions, and SQL 2003 windowed aggregate functions. This means it is less likely to have to re-write a query, and more likely that end-user tools will work out of the box.

IBM is investing heavily in Big SQL, and intends to replace the current Big SQL execution engine with a true MPP SQL execution engine built for performance. Together with IBM’s expertise in query optimization, I think Cloudera’s perceived performance advantage is likely to be short-lived.

21 thoughts on “Cloudera Impala – “Benchmarketing” – A Closer Look

    • @Vlad,
      “Yes, sure Impala is not a top dog here, but its free and it matters the most.”

      I don’t know if you really want to say that…

      I mean would eat a hamburger that was tossed on the sidewalk (no wrapper) or would you go in to McDonalds and buy one?

      You may not like the analogy, but the point is that you need to think beyond just the dollar amount.
      If a free tool doesn’t meet your needs then its worthless and you are better off paying for a tool that does what you need. It may be cheaper in the long term than trying to force a free tool in to the mix.
      (Note: You spend time recoding the ‘free tool’ then that tool is no longer free. )

      A perfect example is FB and their use of MySQL….

      Just putting it in perspective…

  1. Given the data and queries used are public domain, how does IBM BigSQL compare performance wise?
    After all, it’s easy to be critical from the sidelines.

  2. It would be interestig to see IBM results on the same hardware – benchark is open sourced, including data generation and SQLs –
    Regarding scalability – for a good parallel database, it should be possible to consume as much of the cluster for a single query as possible, until some bottleneck is reached (CPU,IO or network). If you can’t get there, you are not efficient. Once you get there – yes, if you want twice the workload with the same performance then by definition you need twice the capacity…

  3. I don’t know that its a fair assessment. I think that Glenn needs to put down the cool-aide…😉

    Its not really fair to compare IBM’s product to Impala. Apples to Oranges. Especially in terms of query optimization. IBM has both DB2 and Informix guys who have been doing that specialty for years… Hive? Not so much…

  4. Appreciate all comments. I am making a correction to the “How many users” section – although Cloudera doesn’t say how many concurrent users they tested with their scalability tests – upon re-reading their blog it suggests there is more than one.

    The current version of Big SQL uses MapReduce for complex query execution, and bypasses MapReduce for short (tactical, single row) queries. So performance should be similar to Hive for complex queries – notice I did not claim it was faster than Impala.

    My main point was to take a closer look at Impala performance claims – don’t take things at face value.

    IBM’s next release of Big SQL will replace the MapReduce execution engine with one from IBM (so I agree with Cloudera on not using MapReduce) and then we can talk about performance comparisons.

    In the meantime – Big SQL does have richer SQL syntax support – not everything yet – but certainly ahead of Impala.

    Oh – and if you want technical support for Impala from Cloudera – then it is not free – it is an extra cost (price not disclosed) annual subscription option (RTQ) on top of Cloudera Enterprise.

  5. “Oh – and if you want technical support for Impala from Cloudera – then it is not free – it is an extra cost (price not disclosed) annual subscription option (RTQ) on top of Cloudera Enterprise.”

    Yes, Impala is open source software, and every vendor charges for support of their open source components, IBM included. What is your point?

  6. Full disclosure: I did most of the lab work associated with the Cloudera post referenced.
    The following are my personal thoughts on this post and have no approval from Cloudera.

    I always appreciate when people are diligent and do their homework about performance claims, but while this post suggests it takes “a closer look”, I believe it fails in actually doing so in a constructive and meaningful way to readers and it was lacking factual accuracy. Allow me to explain:

    Section “Benchmark Games”
    While the three points mentioned are factually accurate, they neither change the results nor make the work performed and the data presented less meaningful or valuable as you try to position to your readers. Twenty queries run are twenty data points regardless if there could have been 99 data points. And given exactly what was run is provided, if a reader wants to validate/compare, etc. they can. I see no gaming here whatsoever.

    WRT only using six queries for the multi-user scaling tests, the blog post clearly outlines this — “A multi-user workload of TPC-DS queries selected from the “Interactive” bucket described previously”. So I don’t see any new information here either.

    Section “How many users?”
    Your initial claim, but now removed, was “tested up to a whopping 2 users”. This in itself demonstrated your lack of diligence in reading the material. Making jokes about data that simply did not exist speaks not only to the intention, but also to the very poor quality of this post.

    While the TPC-DS benchmark specification states that a minimum of four concurrent users is required, that fact is irrelevant — the work performed *was not* a TPC-DS submission nor does it claim to be one, it simply borrows data and queries from it.

    Section “Standard SQL”
    Given that Hive requires SQL-92 style join syntax, this is needed, and clearly stated in the Cloudera blog. No surprises here.

    WRT DBMS tests – again, the Cloudera blog clearly states what modifications were done and why they were done — I’m not seeing any “value add” from your commentary. It doesn’t change what was run or the data points collected from the runs.

    In the end, I see this post is just an attempt to discredit the work done, but yet it provides no new information nor improved data points from the IBM product, Big SQL. It simply ends in few hand wave statements probably best summarized as “Big SQL has better SQL support today (but did not demonstrate Big SQL could run unmodified TPC-DS queries), but makes no claim to be faster today, but it may be faster and have better SQL support in a future release”. So it considers futures for Big SQL, but discounts stated roadmap features for Impala. Um ya…

    • @Greg,

      Benchmarks are a very sore subject. Oracle has been well known to fudge the facts and all vendors have been known to game the TPC tests in their favor.

      In truth, your benchmark really means nothing.

      By your own admission, it uses the data set from TPC-DS, yet doesn’t fully implement the test. You take a stick and draw a line in the sand saying here’s our numbers, here’s our test, we dare you to compete, because in truth, there is nothing to compare with your results.

      If you’re going to take the TPC-DS data, why didn’t you run the full test suite. If not for an official submission, but so that you could compare the results against other systems.

      Clearly our blogger drinks the blue cool-aid. Most everyone at Markham does.😉
      Clearly he overstated his position, and in part is trying to create FUD. After all, IBM is column fodder at this point and needs to improve their ranking.

      Having said that… why didn’t you use the full 6 fact tables and do the full 99 sets?

      I have no skin in the game, and I am vendor neutral. However, I want to see everyone be honest…😉

    • Hi Greg, I appreciate your comments.

      But a casual observer might not be familiar with tpc-ds (despite the link in the cloudera blog) – so the facts regarding the total number of queries, fact tables, and concurrency requirements of the full 161 page TPC-DS specification might be of interest – and new information to some that was not provided by Cloudera.

      Personally, I found the wording of the Cloudera blog posting on the scalability tests somewhat confusing. When I am wrong, I admit it – but why not just say clearly say how many concurrent queries were running? And I do find it curious why only 6 of the 20 were used for the scalability tests – why not all 20?

      I am not trying to discredit anybody’s work. And it’s great that Impala is providing another alternative for SQL access to Hadoop. Surely you welcome some competition, though right?

      By the way, I don’t suppose Cloudera is a member of the TPC. I am not so sure it is OK to “borrow data and queries from” TPC-DS and publicize the results. TPC policy 8.2.2 “unfair use” has a few clauses that look to me that Cloudera may have violated. “ Use results, metrics, or terminology which are not based upon official Results, but which could be reasonably inferred to refer to the TPC or TPC workloads, or to be comparable to Results.”… or “ Show a benchmark result derived from TPC Benchmark Standards in a manner that may cause the reader to believe that these non-Results are the equivalent or near-equivalent of Results.”

      • @Michael & @Glenn

        A few last comments/thoughts before I bow out of this discussion as the carbon footprint is getting a bit large for my liking…

        Why use DS data/queries and not some other data set & queries? My answer is the same reason that the significant majority of research papers and use TPC data & queries (and even other folks in the Hadoop + SQL space) — it’s something that’s readily available. Nothing more, nothing less. That said, I have no doubt that whatever workload was chosen would have had some criticism about it because one can not simply please everyone, hence what was exactly was run was provided in the github repo to provide transparency. And to quote the last paragraph of the Cloudera blog “we encourage you to do your own testing”. I think that communicates honesty and transparency. So Michael, I’m with you 110% if you recommend to your CHUG members to do the same. No benchmark demonstrates more value than the one run using your data and queries. The data points provided are not meant to be a substitution for one’s own diligence. You can quote me on that.

        WRT to positioning the work as official TPC-DS or not: I personally have no desire to misrepresent the work done as “TPC official”, and believe that no one at Cloudera does either (sales or otherwise), hence all the citations about modifications etc. I also think the github repo README I wrote is unmistakably clear on this topic:

        I don’t know as I’m an engineer, not a lawyer, but I’m of the belief that one or more individual query times, even if from an unmodified TPC data set or query, do not necessarily constitute “Results” given the overwhelming amount of research papers that make similar comparisons without issue it seems. Probably even some authored by IBM use TPC data/queries.

        It would seem the Cloudera data points likely fall under “8.3 Fair Use of TPC Specifications” specifically:
        8.3.2 All variations from the TPC specifications in question must be explicitly noted.
        8.3.3 Results based on the non-TPC benchmark must be clearly identified as not being comparable to an official TPC Result.
        I would say the blog and repo README cover both of these points, but I’m just an engineer.

        All of that said, if there is constructive feedback or questions and you leave a comment on the Cloudera blog post, I don’t see why a friendly Clouderan wouldn’t try to address it the best they could.

  7. “My main point was to take a closer look at Impala performance claims – don’t take things at face value.”

    Anyone who actually reads the original post will find it to be in violent agreement about this. Cloudera has openly published the queries and the data and software are both similarly available – by all means, anyone who is interested should run their own tests.

    • @Justin Kestelyn

      “Cloudera has openly published the queries and the data and software are both similarly available – by all means, anyone who is interested should run their own tests.”

      Lets get back to reality for a second.

      Does your sales team tell the customer…
      ‘We ran the TPC-DS benchmarks and did X’ or will they be more honest and say ‘We ran a benchmark using the data from the TPC-DS benchmark, running a subset of the queries after some modifications and achieved X’

      Now, looking at the Press Release:
      ” Impala queries across data in an open Hadoop columnar storage format (Parquet) ran on average 2x faster than identical queries on a commercial analytic database management system (DBMS) over its proprietary storage format.”

      You can go on and read the rest of the PR… and if you take off your Cloudera colored glasses, the PR piece is very misleading. Definitely not Cloudera’s finest moment.

      Why didn’t you run the full benchmark test and do the full 6 table joins? Then compare the results against unnamed RDBMS-Y…

      • Michael,

        With all due respect, a PR is not an appropriate vehicle for details about methodology. That’s what the blog post is for, and we expect anyone following this subject to read it. Everything is in there.

        Your position appears to be, “Your tests were not valid, because you could have done them another way.” Perhaps we could have but we didn’t, and we were transparent about what we did do.

  8. The comment section in this blog are getting silly.

    I am truly vendor neutral and I support customers regardless of which Hadoop vendor they choose.

    The facts are simple.
    Cloudera took the data set from the TPC-DS and ran a specific subset of the queries as a way to showcase their product.

    But in doing so, they attempt to give the impression that Impala out performed the unnamed RDBMS-Y by a factor of 2:1.

    Our blogger who happens to work for a competing vendor has taken the time to review the PR release, and to throw some cold water on it. As it turns out, some of what he says also has FUD mixed in with some truth.
    Clearly he wants to promote his product.

    The truth?
    Both vendors are less than 100% honest when it comes to telling the truth.
    You can’t cherry pick a couple queries and then claim victory because to get meaning from the TPC-DS test, you need to see the results from all of the tests. (Note: If Cloudera didn’t want you to associate their benchmark with the TPC-DS benchmark, why didn’t they just use some other widely available dataset.)

    Now if you’re like me and had spent time working for a vendor during the early 90’s this seems like deja-vu.
    And most of the potential customers also have people who know to take these PR pieces with a huge grain of kosher salt. Most likely they will do a PoC and their own benchmarks with their own data. So that they can control their environment.

    Truthfully I am disappointed and I will probably say something at the start of our next CHUG meet up, because many of our members are from companies that are evaluating *all* vendors.

  9. Pingback: SQL on Hadoop – Meet Big SQL 3 | Sheffield View

  10. First off, I used to work for Cloudera and was the SME for Impala in EMEA, in the original batch of 3 global SMEs for Impala, so I know it’s strengths and weaknesses better than most out there in the world. But I’ve since left Cloudera so I can say what I like and here is my attempt at a balanced view:

    Impala is far from perfect. I myself raised many improvement requests for various aspects. That being said, it’s still extremely fast if you have the RAM for it and your queries fit in memory. It would probably be a lot better if it were in Apache Foundation and everyone else were allowed to work on it… but that’s a sore point so I’ll leave that there.

    Now, having a more varied technical background than most and currently working as an independent Big Data specialist consultant in the field where I eagerly consider all alternatives, I’ve been fortunate enough to had the added experience of working with the other vendors and am actually currently running a PoC at an investment bank on IBM BigInsights 2.1.x, currently 2.1.2.

    I have in the last few months raised 59 issues against IBM’s platform, some of which are ultra-mind-bogglingly severe and embarrassing (I wish there was a public Jira to post them on your eyes would pop out), that demonstrate the obvious and severe lack of Hadoop ecosystem expertise in IBM and the rush to market with products nowhere near the maturity of several of their open source counterparts.

    Issues include many bugs around security, QA, component integration, tooling, and poor performance.

    Now BigSQL specifically, is more or less comparable to Hive in BigInsights 2.1.x. In fact it’s worse than Hive in several ways such as higher scale queries (surprised us), and drivers (should be replaced with DB2 drivers in the new 3.0 release though but I’ve not run that yet to comment on it), as well as some things that work ok in Hive but break in BigSQL. Our PoC users who have no prior Big Data experience even started using Hive instead of BigSQL after having so many problems/bugs with it and other IBM specific components. As we’ve progressed through the PoC users are dropping more and more IBM components in favour of their open source counterparts that simply work better, are more widely battle tested across more environments for more years.

    BigSQL 3.0 does sound exciting, and I am excited, MPP and all that… but we’ve been through this dance before… so I must temper my excitement… the proof is in the usage by an expert in the field who knows how these things should work. Based on experience to date I am not holding my breath though. If IBM did invest heavily I’m sure they could find enough good engineers globally to make BigSQL good, but then they risk cannibalizing their higher margin offerings such as Netezza to play in the Big Data field that makes no profit for it’s vendors.

    Netezza, which I had in a previous company, was good although very expensive. Impala isn’t far off in my experience and considering the price difference, Teradata and Netezza sales will be taking a beating in coming years and their growth is likely to stagnate nearly entirely to maintenance rather than growth in usage. Even if they are slightly better today, are they X millions of dollars better?

    Also, Netezza was bought not developed by IBM, it was good when we had it which was before IBM bought it. The cost however means that that company has since dropped it.

  11. Hari, thanks for your comments. I can assure you that IBM strives to ensure BigInsights is rock solid, and is working on the issues you have raised at your POC.
    BigInsights 3.0 is now GA, and includes the new MPP version of Big SQL. The level of investment is very high – I hope you get a chance to try it out.

  12. Pingback: IBM Big SQL Benchmark vs. Cloudera Impala and Hortonworks Hive/Tez | Sheffield View

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s