r17 - flexible, scalable, relational data mining language

Bakeoff part 1: MySQL, PostgreSQL and r17 on a spinning disk

Matt Nourse, August 7 2011

The graph that speaks a thousand words:

Total load+index+query time
All times are in minutes- less is better.

Now for the longer story...

Scenario

A web service receives 200 million page views per day from 1 million daily active users. The web service has lots of user-generated content so the page views are spread across 20 million unique URLs. The web application logs all the normal web server log info plus user names. It logs about 54 GB per day. We don't know what we don't know about the data so we need to make some exploratory queries on a single day's worth of data. For starters we want to know:

Who are the most active users?
What are the 10 most visited URLs by the 10 most active users?

In this post I'll compare MySQL, PostgreSQL and r17 performance on a single machine with a humble spinning disk. Unfortunately licensing restrictions prohibit me from publishing my Oracle bakeoff results. In future posts I'll bakeoff on a OCZ Vertex 3 SSD, then against Hadoop on spinning disk and SSD.

r17 has several modes of operation which trade off setup time for querying time. I'll use these modes for the bakeoff:

Direct-from-TSV, pipeline concurrency: no import time, but has added cost of translating from TAB-separated-value format during queries.
R17 native format, pipeline concurrency: some import time, simplest possible query, but limited use of available machine resources during query.
R17 native format, parallel concurrency: some import time, more complicated query, best use of available machine resources.

Import

For PostgreSQL and MySQL, this consists of loading the data into the database before creating additional indexes. For r17 this step translates the TAB-separated-value file into r17 native format, so for the direct-from-TAB r17 mode this is instantaneous. The database schema and import scripts are in the technical details section near the end of this post.

All times are in minutes- less is better.

Indexing

For PostgreSQL and MySQL, this consists of creating the indexes that we'd expect to use for querying this data set. r17 does not index so this step is instantaneous for all r17 modes. The SQL is below with the other technical details.

Note that MySQL did not finish indexing within 86 hours. I gave up on MySQL so I could continue running the rest of the tests.

All times are in minutes- less is better.

Pre-query preparation

I ran a VACUUM FULL ANALYZE command against the PostgreSQL database. It completed in less than 10 seconds.

Query 1: find most active users

This query orders all users by the number of times they appear in the logs. Full SQL and r17 scripts are below.

All times are in minutes- less is better.

Query 2: find the 10 most popular URLs amongst the 10 most active users

This query sorts all users by the number of times they appear in the log, gets the 10 most active users, joins the list of 10 most active users to the log, grabs the URLs from the joined records and sorts by the number of times those URLs appear. Full SQL and r17 scripts are below.

All times are in minutes- less is better.

Wrap up

r17 clearly passes the "interesting performance" test. In future posts I'll flesh out some more of the performance picture by adding Hadoop and an SSD in to the mix.

r17 is currently a free download. I'd love to hear about your experiences with it.

Technical details

System configuration

Quad-core Intel i7 930 2.8GHz
12 GB DDR3 RAM
Samsung HD154UI 1.5TB (5400 RPM) disk
Ubuntu 10.04 LTS
MySQL 5.1.41
PostgreSQL 9.0.4

Database schema

PostgreSQL and MySQL CREATE TABLE statement:
CREATE TABLE web_bakeoff_data( record_id BIGINT NOT NULL PRIMARY KEY, ip_address varchar(32) NOT NULL, epoch_sec BIGINT NOT NULL, method varchar(16) NOT NULL, url varchar(1024) NOT NULL, http_status int NOT NULL, byte_size int NOT NULL, referrer varchar(1024) NOT NULL, user_agent varchar(2048) NOT NULL, username varchar(64) NOT NULL );

MySQL was set to use the MyISAM storage engine by default.

Import SQL and r17 scripts

PostgreSQL import SQL:
COPY web_bakeoff_data FROM 'web_bakeoff_data.tsv' WITH DELIMITER E'\t' CSV HEADER;

MySQL import SQL:
LOAD DATA LOCAL INFILE 'web_bakeoff_data.tsv' INTO TABLE web_bakeoff_data;

r17 pipeline-concurrency import:
io.file.read("web_bakeoff_data.tsv.gz") | rel.from_tsv() | io.file.overwrite("web_bakeoff_data.r17_native.gz");

r17 parallel-concurrency import and test setup:
# Split the input file into fragments. io.file.read('web_bakeoff_data.tsv.gz') | rel.from_tsv() | rel.record_split(1000000, "tmp.web_fragment."); # Make the mapping from hosts to files. # For this test, participating_hosts.tsv just contains "localhost". io.file.read("participating_hosts.tsv") | rel.from_tsv() | io.file.overwrite("participating_hosts.r17_native"); io.directory.list(".") | rel.select(file_name) | rel.where(str.starts_with(file_name, "tmp.web_fragment.")) | rel.join.consistent_hash("participating_hosts.r17_native") | io.file.overwrite("host_file_mapping.r17_native"); # Do the distribution. Skip transfering files to localhost # so this does nothing for this single-machine test. io.file.read("host_file_mapping.r17_native") | rel.where(host_name != "localhost") | rel.select(meta.shell("rsync -av " + file_name + " " + host_name + ":~") as rsync_output) | io.file.append("/dev/null");

PostgreSQL and MySQL indexing SQL:


                    CREATE INDEX web_bakeoff_data_epoch_sec_idx ON web_bakeoff_data(epoch_sec);

                    CREATE INDEX web_bakeoff_data_url_idx ON web_bakeoff_data(url);

                    CREATE INDEX web_bakeoff_data_username_idx ON web_bakeoff_data(username);

                    CREATE INDEX web_bakeoff_data_ip_address_idx ON web_bakeoff_data(ip_address);

SQL & r17 script for query 1: find most active users

PostgreSQL:
SELECT username, count(1) AS num FROM web_bakeoff_data GROUP BY username ORDER BY num DESC

SQL & r17 script for query 2: 10 most popular URLs amongst the 10 most active users

PostgreSQL:
CREATE TEMPORARY TABLE top_10_users( username varchar(64) NOT NULL PRIMARY KEY, num bigint NOT NULL ); INSERT INTO top_10_users SELECT username, count(1) AS num FROM web_bakeoff_data GROUP BY username ORDER BY num DESC LIMIT 10; SELECT url, num FROM (SELECT url, count(1) as num FROM web_bakeoff_data JOIN top_10_users ON web_bakeoff_data.username = top_10_users.username GROUP BY url) AS tmp_urls_of_top_10_users ORDER BY num LIMIT 10;