Matt Nourse, May 15 2012
Back in March the friendly and helpful Pangool developers posted some interesting benchmark results for Hadoop-based tools. There's lots of talk elsewhere about Hadoop scalability but we hear very little about performance so I was excited to learn more about relative performance of these tools. I'd love to see the performance conversation continue. I'm also interested in these benchmarks because for some time I've wanted to compare r17 with Hadoop. These bencharks help because they were run by someone else using a well-known Hadoop distro (so I can't accidentally/subconsciously misconfigure Hadoop) on hardware that I can easily get my hands on (Amazon EC2).
I copied the relevant graphs from the Pangool benchmark results and superimposed my r17 results below. I did 3 runs of each test and averaged the results, though the maximum variation between individual test runs was less than 10%. I only tested the largest-size variant of each test because this is both the most interesting and the least favourable towards r17. R17 ran on only one m1.large machine whereas the other tools in the benchmarks ran on four m1.large slaves plus a master for a total of 5 machines. R17 can scale most operations across machines but its standout feature is the performance it can eke out of a single machine without configuration headaches.
This test measures join performance. All times are in seconds, smaller is better.
This test measures sort performance. All times are in seconds, smaller is better.
This test measures basic framework overhead and text processing performance. All times are in seconds, smaller is better.
This test measures the total number of lines of code for all 3 tests, including comments. Smaller is better.
Thanks to the Pangool team for their super-responsive support. I had a few questions about their benchmarks and each time I received helpful answers very quickly.
The Hadoop & friends tests were run by the Pangool team. They describe their setup on their benchmark page. Below is the r17 setup.
R17 read test data from EBS and wrote the results back to EBS. The 'Secondary sort' test makes heavy use of /tmp which was mounted on instance storage. I increased the size of /tmp storage to 20GB to accommodate r17's temporary files.
# Get the url-map file in r17 native format so we can join with it.
| rel.from_text("([^\t]+?)\t(.*)", "string:url", "string:canonical_url")
# Do the join and convert the output to TSV.
| rel.from_tsv("string:url", "uint:timestamp", "uint:ip")
| rel.select(canonical_url, timestamp, ip)
"int:department_id", "string:name_id", "int:timestamp", "double:sale_value")
| rel.order_by(department_id, name_id, timestamp)
| rel.select(department_id, name_id, sale_value)
| rel.group(sum sale_value)
| rel.order_by(department_id, name_id)
| rel.str_split(words, " ")
| rel.select(words as word)