A flexible, scalable, relational data mining language

Widely used data mining systems are expensive in...
...time (configuration, tuning, indexing, query run times)
...money (licensing, hardware, services)
...flexibility (must discard or index data to make queries tractable)
...or in some cases all of the above.
r17 is our response to these problems.


How does r17 help?

Run queries in reasonable time without indexing

r17 provides brute-force performance so you're free to explore the data without a separate indexing step and without reducing data resolution.

Scalable

r17 provides pipeline concurrency and two kinds of cross-machine parallel concurrency.

Query heterogeneous sources

Mix data from multiple sources without a separate import step. External applications can participate in queries using an easily-parsed & efficient data format. Store data in r17's efficient binary format...or not, as you please.

Streaming queries

r17 allows SELECT, WHERE and JOIN queries of data streams in real time.

Simple and obedient

r17 is a single executable. No wrestling with complex & fragile installation or configuration. It's easy to learn and uses familiar idioms.


What does r17 look like?

Syntax

r17's syntax is a cross between UNIX shell and SQL.
ls | grep "fred"
is roughly equivalent to
io.directory.list(".") | rel.where(file_name = "fred");
or (since 1.4.2) io.ls(".") | rel.where(file_name = "fred");
The | has the same meaning as in UNIX shell: the io.directory.list stream operator will execute concurrently with the rel.where operator.

SELECT username, COUNT(1) AS num FROM users GROUP BY username ORDER BY num;
is roughly equivalent to
io.file.read('users') | rel.select(username) | rel.group(count) | rel.order_by(_count);
The most interesting difference is that each r17 clause will execute concurrently.

And now for something that's quite difficult to do in UNIX shell or SQL alone:
sudo tail -f /var/log/apache2/access.log
| r17 'rel.from_text(
           "^([^ ]+?) [^ ]+? [^ ]+? \\[([^\\]]+?)\\]",
           "string:ip_address", "string:date")
         | rel.join.natural("interesting_ips.r17_native")
         | rel.to_tsv();'

This parses the IP address and date from the Apache access log, joins to an interesting list of IP addresses and converts the result to TAB-separated-value format, all in real time.

Language features
  • Built-in concurrency, including cross-machine concurrency.
  • Strong type checking at stream-header-read time.
  • Complex data transformations including 'if/then':
    rel.select(
      (if (str.starts_with(name, "Johann") || (j_i > 0)) then (
        "Another Johann"
      ) else (
        "No Johann"
      )) as johann_nature);
  • rel.select can refer to the output of the previous record transformation, allowing accumulator-like operations.
  • Perl-compatible regular expressions.
System requirements

Supported on 32- and 64-bit Linux and Mac OS X. Other UNIX-like platforms available on request. r17 is a single 1MB-ish executable dependent only on the lowest-level OS-supplied libraries.


Thanks to Dave Gamache for the Skeleton template.