r17 - flexible, scalable, relational data mining language

Documentation

Below is the semi-automatically-generated documentation for r17. It's generated using this command line:
$ r17 help.markdown | markdown > [file name].html

r17 v2.1.0

R17 is a language for heterogeneous data querying and manipulation.

It's a syntactic cross between UNIX shell scripting and SQL. Data flows along a pipeline of stream operators just like it does in UNIX pipelines. Instead of UNIX's bias towards line-oriented records, most r17 operators work on a structured relational data stream optimized for parsing performance.

Examples

The examples below assume this TAB-separated data is stored in test.tsv:

string:name int:birth_year
Johann Sebastian Bach 1685
Clamor Heinrich Abel 1634
Johann Georg Ahle 1651
Johann Michael Bach 1648

Example 1: sorting

$ r17 'io.file.read("test.tsv") | rel.from_tsv() | rel.order_by(birth_year) | rel.to_tsv();'

string:name int:birth_year
Clamor Heinrich Abel 1634
Johann Michael Bach 1648
Johann Georg Ahle 1651
Johann Sebastian Bach 1685

Example 2: projection, restriction, regular expression replacement

$ r17 '
io.file.read("test.tsv")
| rel.from_tsv()
| rel.select(str.regex_replace("([a-zA-Z]+)$", name, "\1") as last_name)
| rel.where(str.starts_with(last_name, "A"))
| rel.to_tsv();'

string:last_name
Abel
Ahle

Usage

r17 stream_op expression
OR
r17 script_file_name ['script_arguments']
OR
r17 'inline_script' ['script_arguments']
where stream_op is one of:
rel.group
rel.join.natural
rel.join.left
rel.join.anti
rel.join.consistent_hash
rel.order_by
rel.order_by.desc
rel.order_by.mergesort
rel.order_by.mergesort.desc
rel.order_by.quicksort
rel.order_by.quicksort.desc
rel.select
rel.record_count
rel.record_split
rel.where
rel.unique
rel.str_split
rel.assert.empty
rel.assert.nonempty
rel.from_tsv
rel.to_tsv
rel.from_csv
rel.to_csv
rel.from_usv
rel.to_usv
rel.from_text
rel.from_text_ignore_non_matching
rel.generate_sequence
text.utf16_to_utf8
text.strip_cr
io.file.read
io.file.append
io.file.overwrite
io.directory.list
io.ls
io.directory.list_recurse
io.ls_r
lang.python
lang.R
meta.script
meta.remote
meta.shell
meta.parallel_explicit_mapping
help.markdown
help.version
...or a programmer-defined compound stream operator (see "Compund stream operators" below)

Builtin stream operators

rel.group
input: r17 native record format
output: r17 native record format
since: 1.0
rel.group(count) is equivalent to SQL's SELECT COUNT(1) FROM ... GROUP BY .... The new heading is called int:_count.
rel.group(avg header_name) is equivalent to SQL's SELECT AVG(header_name) FROM ... GROUP BY .... The new heading is called int:_avg.
rel.group(min header_name) is equivalent to SQL's SELECT MIN(header_name) FROM ... GROUP BY .... No new column is created, the header_name column is used to hold the minimum value.
rel.group(max header_name) is equivalent to SQL's SELECT MAX(header_name) FROM ... GROUP BY .... No new column is created, the header_name column is used to hold the maximum value.
rel.group(sum header_name) is equivalent to SQL's SELECT SUM(header_name) FROM ... GROUP BY .... The new heading is called int:_sum.
rel.group(median header_name) finds the median in the same way that rel.group(avg header_name) finds the average. The new heading is called int:_median. The median aggregator is only in r17 1.8.0 and later.

rel.join.natural
input: r17 native record format
output: r17 native record format
since: 1.0
rel.join.natural('other_file_name') joins the input to other_file_name.

rel.join.left
input: r17 native record format
output: r17 native record format
since: 1.0
rel.join.left('other_file_name') left-joins the input to other_file_name.

rel.join.anti
input: r17 native record format
output: r17 native record format
since: 1.0
rel.join.anti('other_file_name') antijoins the input to other_file_name.

rel.join.consistent_hash
input: r17 native record format
output: r17 native record format
since: 1.0
rel.join.consistent_hash('other_file_name') joins the input to other_file_name using a consistent hash (http://en.wikipedia.org/wiki/Consistent_hashing). This operator will refuse to join two streams with common header names. Streams may contain duplicate records. The more times that a record appears in a stream, the more likely it is to be matched & included in the join.

rel.order_by
input: r17 native record format
output: r17 native record format
since: 1.0
rel.order_by(a, b, c) will sort by heading a then b, then c using the default search strategy: a stable merge sort. Currently the size of the sort input is limited to the available virtual memory minus 30 bytes per record overhead.

rel.order_by.desc
input: r17 native record format
output: r17 native record format
since: 1.0
rel.order_by.desc(a, b, c) will sort in the opposite order to rel.order_by(a, b, c).

rel.order_by.mergesort
input: r17 native record format
output: r17 native record format
since: 1.0
rel.order_by.mergesort(a, b, c) will sort by heading a then b, then c using a stable merge sort. Currently the size of the sort input is limited to the available virtual memory minus 30 bytes per record overhead.

rel.order_by.mergesort.desc
input: r17 native record format
output: r17 native record format
since: 1.0
rel.order_by.mergesort.desc(a, b, c) will sort in the opposite order to rel.order_by.mergesort(a, b, c).

rel.order_by.quicksort
input: r17 native record format
output: r17 native record format
since: 1.0
rel.order_by.quicksort(a, b, c) will sort by heading a then b, then c using a stable quicksort variant. The quicksort variant has the Sedgewick optimizations but still has the same poor worst-case running time as other quicksorts. Currently the size of the sort input is limited to the available virtual memory minus 30 bytes per record overhead.

rel.order_by.quicksort.desc
input: r17 native record format
output: r17 native record format
since: 1.0
rel.order_by.quicksort.desc(a, b, c) will sort in the opposite order to rel.order_by.quicksort(a, b, c).

rel.select
input: r17 native record format
output: r17 native record format
since: 1.0
rel.select(expr1 as [type:]header1, expr2 as [type:]header2, ...) will transform incoming records as specified by the expression(s). Approximately equivalent to SQL's SELECT. The prev. prefix may be used to refer to a field from the transformed previous record. If the header name is prefixed with a type name then the data will be coerced to that type without any extra type checking.

rel.record_count
input: r17 native record format
output: r17 native record format
since: 1.0
rel.record_count() will count the number of incoming records. NOTE that this operator writes an unadorned decimal number to the output stream.

rel.record_split
input: r17 native record format
output: r17 native record format
since: 1.0
rel.record_split(N, file_name_stub) will split the incoming records into files of at most N records with names starting with file_name_stub. It will gzip the files. It does not write to the output stream.

rel.where
input: r17 native record format
output: r17 native record format
since: 1.0
rel.where(expression) will include records in the output if expression returns true. Approximately equivalent to SQL's WHERE.

rel.unique
input: r17 native record format
output: r17 native record format
since: 1.0
rel.unique() includes only a single copy of duplicate records in the output stream. Approximately equivalent to SQL's DISTINCT clause.

rel.str_split
input: r17 native record format
output: r17 native record format
since: 1.0
rel.str_split(header_name, 'regex') splits the string in header_name using the regex regular expression. It creates a new uint:_counter column to count the newly-split string components.

rel.assert.empty
input: Any
output: None
since: 1.0
rel.assert.empty() prints an error message and exits if the input stream is not empty. In 1.4.3+, rel.assert.empty does not pass through any data or headers.

rel.assert.nonempty
input: Any
output: Any
since: 1.0
rel.assert.nonempty() prints error message and exits if the input stream is empty. Will pass through any incoming data.

rel.from_tsv
input: TAB-separated value format with typed headings
output: r17 native record format
since: 1.0
rel.from_tsv() translates the input stream from TAB-separated-value format to native record format. In 1.4.3 and earlier, the input stream must have headings. In 1.2.2+, if a heading has no type tag then a type of 'string' is assumed. In 1.4.3+, non-alphanumeric characters are replaced with an _ character. In 1.4.4+, rel.from_tsv("heading_name_1", "heading_name_2") will ignore any heading names in the input stream and use the supplied heading names instead. This allows parsing of input streams that have no heading names.

rel.to_tsv
input: r17 native record format
output: TAB-separated value format with typed headings
since: 1.0
rel.to_tsv() translates the input stream from native record format to TAB-separated-value format.

rel.from_csv
input: Comma-separated value format with typed headings
output: r17 native record format
since: 1.6.0
rel.from_csv() translates the input stream from comma-separated-value format to native record format. If no arguments are supplied then the input stream must have headings. If a heading has no type tag then a type of 'string' is assumed. Non-alphanumeric characters are replaced with an _ character. rel.from_csv("heading_name_1", "heading_name_2") will ignore any heading names in the input stream and use the supplied heading names instead. This allows parsing of input streams that have no heading names. Values that contain commas must be enclosed in double quotes ("s). Escape " characters with a second " character, like this: "". Use \n to represent newlines.

rel.to_csv
input: r17 native record format
output: Comma-separated value format with typed headings
since: 1.0
rel.to_csv() translates the input stream from native record format to comma-separated-value format.

rel.from_usv
input: Unit-separated value format with typed headings
output: r17 native record format
since: 1.0
rel.from_usv() translates the input stream from "unit-separated-value" format to native record format. Fields are separated by the US character (ASCII 31). Records are separated by the NUL character (ASCII 0). This format is designed to be faster to parse than TSV because the NUL character can't appear in UTF-8 strings and the US character is extremely unlikely to be used anywhere.

rel.to_usv
input: r17 native record format
output: Unit-separated value format with typed headings
since: 1.0
rel.to_usv() translates the input stream from native record format to unit-separated-value format.

rel.from_text
input: UTF-8 text
output: r17 native record format
since: 1.0
rel.from_text(regular_expression, heading1, ...headingN) translates the input stream from newline-separated 'rows' to native record format. Lines are divided into records using captures from the regular expression. If the input does not end with a newline then the last line will be ignored. Non-matching lines will generate an error.

rel.from_text_ignore_non_matching
input: UTF-8 text
output: r17 native record format
since: 1.2.0
rel.from_text_ignore_non_matching(regular_expression, heading1, ...headingN) is equivalent to rel.from_text except that non-matching lines are ignored rather than generating an error.

rel.generate_sequence
input: None
output: r17 native record format
since: 1.0
rel.generate_sequence(start, end) generates a sequence of integers starting at start and ending at end-1 under the heading int:_seq. The input stream is ignored.

text.utf16_to_utf8
input: UTF-16 text
output: UTF-8 text
since: 1.4.0
text.utf16_to_utf8() translates the UTF-16 input stream into UTF-8 so that it's suitable for reading by other r17 stream operators. The UTF-16 stream must start with a Byte Order Mark: 0xfffe for little-endian, 0xfeff for big-endian.

text.strip_cr
input: UTF-8 text
output: UTF-8 text
since: 1.4.0
text.strip_cr() copies the input stream to the output stream, omitting CR (ASCII 13) characters. It's useful for transforming text from Windows line-endings to the Unix line-endings expected by other r17 text operators.

io.file.read
input: None
output: Any
since: 1.0
io.file.read(file_name) reads file_name and writes it to stdout. If the file is gzipped, io.file.read will ungzip it before writing to stdout. In 1.4.0 and later, io.file.read accepts multiple file_name arguments. Each file is read in the same order as it appears in the argument list. io.file.read sniffs the contents of the files. If the first file is an r17 native file then all files must be r17 native files with the same headers, and io.file.read will omit all headers from the second and subsequent files.To read a file once per input record, use the io.file.read function.

io.file.append
input: Any
output: None
since: 1.0
io.file.append(file_name) reads input and appends it all to file_name.

io.file.overwrite
input: Any
output: None
since: 1.0
io.file.overwrite(file_name) reads input and writes it all to file_name, overwriting the file.

io.directory.list
input: None
output: r17 native record format
since: 1.0
io.directory.list(directory_name) lists all files and subdirectories in the supplied directory_name. The output stream contains these headings: string:directory_name, string:file_name, string:relative_path, uint:size_bytes, uint:mtime_usec, bool:is_directory. The input stream is ignored.

io.ls
input: None
output: r17 native record format
since: 1.4.2
Synonym for io.directory.list.

io.directory.list_recurse
input: None
output: r17 native record format
since: 1.0
io.directory.list_recurse(directory_name) lists all files and subdirectories as does io.directory.list, recursing into subdirectories.

io.ls_r
input: None
output: r17 native record format
since: 1.4.2
Synonym for io.directory.list_recurse.

lang.python
input: r17 native record format
output: r17 native record format
since: 1.7.0
lang.python(@@@ python @@@) executes Python code using the python3 interpreter in the shell's path. The Python script's standard input is the input stream. The Python script's standard output is the output stream. R17 prepends helper Python code to the Python script before passing it to the system's Python interpreter. R17's helper code supplies 2 stream-like global variables: r17InputStream and r17OutputStream. Each call to r17InputStream.next() returns an object with member variables of the same type and name as the r17 input columns. r17OutputStream.write(v) writes an r17 record row with column names and types inferred from v where v is an object or dictionary.
The simplest possible example is copying the input stream to the output stream:

lang.python(@@@  
for inputR in r17InputStream:  
    r17OutputStream.write(inputR)  
@@@);

Note that r17 does not add or remove whitespace to the Python script because indentation is so important in Python. So you need to indent inline Python code as if the Python code was in its own file.

This example assumes that the input stream contains a value column that is some kind of number:

lang.python(@@@  
for inputR in r17InputStream:  
    r17OutputStream.write({value: inputR.value + 1})  
@@@);

Below is almost all the Python code that r17 prepends to the inline script before passing to Python.

import sys  
import csv  

class R17StreamDefinition:  
    @staticmethod  
    def delimiter():  
        return '\t'  

    @staticmethod  
    def escapeChar():  
        return '\\'  

    @staticmethod  
    def lineTerminator():  
        return '\n'  


class R17InputStream:  
    def __init__(self):  
        self.inCsvReader = csv.reader(sys.stdin, delimiter=R17StreamDefinition.delimiter(), quotechar=None,  
                                       escapechar=R17StreamDefinition.escapeChar(),  
                                       lineterminator=R17StreamDefinition.lineTerminator())  

    def __iter__(self):  
        return self  

    def __next__(self):  
        return R17InputRecord(self.inCsvReader.__next__())  

    def next(self):  
        return __next__(self)  

r17InputStream = R17InputStream()  


class R17OutputStream:  
    def __init__(self):  
        self.outCsvWriter = csv.writer(sys.stdout, delimiter=R17StreamDefinition.delimiter(), quotechar=None,  
                                       escapechar=R17StreamDefinition.escapeChar(),  
                                       lineterminator=R17StreamDefinition.lineTerminator())  
        self.headerWritten = False  

    def write(self, outputDict):  
        if not type(outputDict) is dict:  
            outputDict = outputDict.__dict__  

        if (not self.headerWritten):  
            self.headerWritten = True  
            headers = []  
            for name, value in outputDict.items():  
                headers.append(self.typeAsString(value) + ':' + name)  

            self.outCsvWriter.writerow(headers)  

        row = []  
        for name, value in outputDict.items():  
            # Python's str(boolean_value) returns True or False but r17 expects true or false.  
            if type(value) is bool:  
                if value:  
                    row.append('true')  
                else:  
                    row.append('false')  
            else:  
                row.append(str(value))  

        self.outCsvWriter.writerow(row)  

    def typeAsString(self, value):  
        if type(value) is int:  
            return 'int'  

        if type(value) is str:  
            return 'string'  

        if type(value) is float:  
            return 'double'  

        if type(value) is bool:  
            return 'bool'  

        raise Exception('r17: unsupported type: ' + str(type(value)))  

r17OutputStream = R17OutputStream()

Underneath this code, r17 creates an R17InputRecord class that's tailored to the input stream columns. For example, the R17InputRecord class for an r17 input stream with columns...
string:v1 istring:v2 int:v3 uint:v4 double:v5 bool:v6 ipaddress:v7
...is

class R17InputRecord:  
    def __init__(self, row):  
    self.v1 = str(row[0])      # r17 istring -> Python string  
    self.v2 = str(row[1])      # r17 string -> Python string  
    self.v3 = long(row[2])     # r17 int -> Python long  
    self.v4 = long(row[3])     # r17 uint -> Python long  
    self.v5 = float(row[4])    # r17 double -> Python float  
    self.v6 = row[5] == 'true' # r17 bool -> Python boolean  
    self.v7 = str(row[6])      # r17 ipaddress -> Python string

lang.R
input: r17 native record format
output: r17 native record format
since: 1.7.1
lang.R(@@@ R code @@@) executes R code using the Rscript interpreter in the shell's path. The R script's standard input is the input stream. The R script's standard output is the output stream. R17 prepends helper R code to the R script before passing it to the system's R interpreter. R17's helper code supplies an r17InputTable table and an r17WriteTable function. r17InputTable contains the entire stream contents. The table's column names match the r17 stream headings. The definition of r17WriteTable is:

r17WriteTable <- function(colNames, t) {  
    write.table(sep="\t", quote=FALSE, row.names=FALSE, col.names=colNames, t)  
}

R17 can't infer the output table column types so you need to include the r17 types in the column names, for example: r17WriteTable(c("string:name", "int:value"), table)

meta.script
input: None
output: Any
since: 1.0
meta.script(file_name) executes an r17 script.

meta.remote
input: Any
output: Any
since: 1.0
meta.remote(host_name, inline_script) executes an r17 script on a remote machine. The current user must be able to SSH to the remote machine without a password. r17 must be in the PATH on the remote machine. If the host_name argument is 'localhost' then the script will be executed on the local machine without SSH.

meta.shell
input: Any
output: Any
since: 1.0
meta.shell(command) executes a shell command on the local machine once. The input stream becomes the shell command's standard input. The output of the shell command is written to the output stream. To execute a shell command once per input record, use the meta.shell function.

meta.parallel_explicit_mapping
input: r17 native record format
output: r17 native record format
since: 1.0
meta.parallel_explicit_mapping(inline_script) executes an r17 script on remote & local machine(s). It reads the [string:file_name, string:host_name] mapping from the input. The current user must be able to SSH to the remote machine(s) without a password. r17 must be in the PATH on all remote machine(s). The output of inline_script must be a normal record stream.

help.markdown
input: None
output: UTF-8 text
since: 1.0
help.markdown() writes out help in Markdown format.

help.version
input: None
output: UTF-8 text
since: 1.0
help.version() writes out the current r17 version number.

Compound stream operators (new in 1.9.0)

To write a stream operator, create a file of r17 code. Name the file using the characters supported for stream operator names: a-z, A-Z, 0-9, . and _. Put the file in the same directory as the calling script, or set the NP1_R17_PATH environment variable with a list of directories to search for r17 scripts.

Expressions

Expressions are simple infix C/Java/Python-like expressions that operate on one or at most two records.

The only control expression is if/then/else, which works the same way as C/Java/PHP's ternary ?: operator, eg
rel.select((if (str.starts_with(name, "Johann")) then ("Another Johann") else ("No Johann")) as johann_nature);

Data types

string: Case-sensitive string.
istring: Case-insensitive string.
int: 64-bit signed integer.
uint: 64-bit unsigned integer.
double: Double-precision floating point number.
bool: Boolean.
ipaddress: IPv4 IP address.

Builtin variables

_rownum: the current row number.

Operator: `=`

Since: 1.0
Equal to.
Precedence: 9 (lower is better)

string = string returns bool
string = istring returns bool
istring = string returns bool
istring = istring returns bool
int = int returns bool
uint = uint returns bool
bool = bool returns bool
ipaddress = ipaddress returns bool
string = ipaddress returns bool
ipaddress = string returns bool
istring = ipaddress returns bool
ipaddress = istring returns bool

Operator: `!=`

Since: 1.0
Synonym: <>
Not equal to.
Precedence: 9 (lower is better)

string != string returns bool
string != istring returns bool
istring != string returns bool
istring != istring returns bool
int != int returns bool
uint != uint returns bool
bool != bool returns bool
ipaddress != ipaddress returns bool
string != ipaddress returns bool
ipaddress != string returns bool
istring != ipaddress returns bool
ipaddress != istring returns bool

Operator: `<`

Since: 1.0
Less than.
Precedence: 8 (lower is better)

string < string returns bool
string < istring returns bool
istring < string returns bool
istring < istring returns bool
int < int returns bool
uint < uint returns bool
bool < bool returns bool
ipaddress < ipaddress returns bool
string < ipaddress returns bool
ipaddress < string returns bool
istring < ipaddress returns bool
ipaddress < istring returns bool
double < double returns bool

Operator: `>`

Since: 1.0
Greater than.
Precedence: 8 (lower is better)

string > string returns bool
string > istring returns bool
istring > string returns bool
istring > istring returns bool
int > int returns bool
uint > uint returns bool
bool > bool returns bool
ipaddress > ipaddress returns bool
string > ipaddress returns bool
ipaddress > string returns bool
istring > ipaddress returns bool
ipaddress > istring returns bool
double > double returns bool

Operator: `<=`

Since: 1.0
Less than or equal to.
Precedence: 8 (lower is better)

string <= string returns bool
string <= istring returns bool
istring <= string returns bool
istring <= istring returns bool
int <= int returns bool
uint <= uint returns bool
bool <= bool returns bool
ipaddress <= ipaddress returns bool
string <= ipaddress returns bool
ipaddress <= string returns bool
istring <= ipaddress returns bool
ipaddress <= istring returns bool
double <= double returns bool

Operator: `>=`

Since: 1.0
Greater than or equal to.
Precedence: 8 (lower is better)

string >= string returns bool
string >= istring returns bool
istring >= string returns bool
istring >= istring returns bool
int >= int returns bool
uint >= uint returns bool
bool >= bool returns bool
ipaddress >= ipaddress returns bool
string >= ipaddress returns bool
ipaddress >= string returns bool
istring >= ipaddress returns bool
ipaddress >= istring returns bool
double >= double returns bool

Operator: `+`

Since: 1.0
String concatenation or addition.
Precedence: 6 (lower is better)

string + string returns string
istring + string returns istring
string + istring returns istring
istring + istring returns istring
int + int returns int
uint + uint returns uint
double + double returns double

Operator: `-`

Since: 1.0
Subtraction.
Precedence: 6 (lower is better)

int - int returns int
uint - uint returns uint
double - double returns double
- int returns int
- double returns double

Operator: `*`

Since: 1.0
Multiplication.
Precedence: 5 (lower is better)

int * int returns int
uint * uint returns uint
double * double returns double

Operator: `/`

Since: 1.0
Division.
Precedence: 5 (lower is better)

int / int returns int
uint / uint returns uint
double / double returns double

Operator: `%`

Since: 1.0
Synonym: mod
Modulus.
Precedence: 5 (lower is better)

int % int returns int
uint % uint returns uint

Operator: `&`

Since: 1.0
Bitwise AND.
Precedence: 10 (lower is better)

int & int returns int
uint & uint returns uint

Operator: `|`

Since: 1.0
Bitwise OR.
Precedence: 12 (lower is better)

int | int returns int
uint | uint returns uint

Operator: `~`

Since: 1.0
Bitwise NOT.
Precedence: 3 (lower is better)

~ int returns int
~ uint returns uint

Operator: `&&`

Since: 1.0
Synonym: and
Logical AND. Note there is currently no "shortcutting"- both sides of the && will be executed even if the first expression returns false.
Precedence: 13 (lower is better)

bool && bool returns bool

Operator: `||`

Since: 1.0
Synonym: or
Logical OR. Note there is currently no "shortcutting"- both sides of the || will be executed even if the first expression returns true.
Precedence: 14 (lower is better)

bool || bool returns bool

Operator: `!`

Since: 1.0
Synonym: not
Logical NOT.
Precedence: 3 (lower is better)

! bool returns bool

Function: `to_string`

Since: 1.0
to_string(value) returns a 'sensible' string representation of value.

to_string(string) returns string
to_string(istring) returns string
to_string(int) returns string
to_string(uint) returns string
to_string(double) returns string
to_string(bool) returns string
to_string(ipaddress) returns string

Function: `to_istring`

Since: 1.0
to_istring(value) returns a 'sensible' case-independent string representation of value.

to_istring(string) returns istring
to_istring(istring) returns istring
to_istring(int) returns istring
to_istring(uint) returns istring
to_istring(double) returns istring
to_istring(bool) returns istring
to_istring(ipaddress) returns istring

Function: `str.to_upper_case`

Since: 1.4.4
str.to_upper_case(str) returns str with all characters converted to upper case.

str.to_upper_case(string) returns string
str.to_upper_case(istring) returns istring

Function: `str.to_lower_case`

Since: 1.4.4
str.to_lower_case(str) returns str with all characters converted to lower case.

str.to_lower_case(string) returns string
str.to_lower_case(istring) returns istring

Function: `str.regex_match`

Since: 1.0
str.regex_match(pattern, haystack) returns true if haystack matches pattern. pattern is a Perl-compatible regular expression.

str.regex_match(string, string) returns bool
str.regex_match(string, istring) returns bool
str.regex_match(istring, string) returns bool
str.regex_match(istring, istring) returns bool

Function: `str.regex_replace`

Since: 1.0
str.regex_replace(pattern, haystack, replace_spec) replaces grouped patterns within haystack with replacements in replace_spec. \n (where n is a digit) in replace spec refers to the nth parenthesized subexpression in pattern. If there is no match then str.regex_replace returns haystack.

str.regex_replace(string, string, string) returns string
str.regex_replace(string, string, istring) returns string
str.regex_replace(string, istring, string) returns istring
str.regex_replace(string, istring, istring) returns istring

Function: `str.regex_replace_empty_on_no_match`

Since: 1.0
As for str.regex_replace, except that it returns the empty string on no match.

str.regex_replace_empty_on_no_match(string, string, string) returns string
str.regex_replace_empty_on_no_match(string, string, istring) returns string
str.regex_replace_empty_on_no_match(string, istring, string) returns istring
str.regex_replace_empty_on_no_match(string, istring, istring) returns istring

Function: `str.regex_replace_all`

Since: 1.4.4
str.regex_replace_all(pattern, haystack, replace_spec) is the same as str.regex_replace(pattern, haystack, replace_spec) except that it continues to search and replace after the first match, and all non-matching substrings are returned as-is.

str.regex_replace_all(string, string, string) returns string
str.regex_replace_all(string, string, istring) returns string
str.regex_replace_all(string, istring, string) returns istring
str.regex_replace_all(string, istring, istring) returns istring

Function: `str.starts_with`

Since: 1.0
str.starts_with(haystack, needle) returns true if haystack starts with needle.

str.starts_with(string, string) returns bool
str.starts_with(string, istring) returns bool
str.starts_with(istring, string) returns bool
str.starts_with(istring, istring) returns bool

Function: `str.ends_with`

Since: 1.0
str.ends_with(haystack, needle) returns true if haystack ends with needle.

str.ends_with(string, string) returns bool
str.ends_with(string, istring) returns bool
str.ends_with(istring, string) returns bool
str.ends_with(istring, istring) returns bool

Function: `str.contains`

Since: 1.0
str.contains(haystack, needle) returns true if haystack contains needle.

str.contains(string, string) returns bool
str.contains(string, istring) returns bool
str.contains(istring, string) returns bool
str.contains(istring, istring) returns bool

Function: `str.uuidgen`

Since: 1.0
Generate a random UUID using the same random number generator as math.rand64().

str.uuidgen() returns string

Function: `str.sha256`

Since: 1.0
str.sha256(data) generates a hex-encoded SHA-256 hash data.

str.sha256(string) returns string
str.sha256(istring) returns string
str.sha256(int) returns string
str.sha256(uint) returns string
str.sha256(bool) returns string
str.sha256(ipaddress) returns string

Function: `math.rand64`

Since: 1.0
Generate a 64-bit random number. The seed is data from the OS's random number generator plus the process's PID and the current time in microseconds. The random state is periodically reset with a new seed. The random number is generated using a SHA-256 hash of the seed together with the output of the previous SHA-256 hash.

math.rand64() returns uint

Function: `time.now_epoch_usec`

Since: 1.0
The number of microseconds since 1/1/1970 00:00:00 GMT.

time.now_epoch_usec() returns uint

Function: `time.usec_to_msec`

Since: 1.0
Convert microseconds to milliseconds.

time.usec_to_msec(uint) returns uint
time.usec_to_msec(int) returns int

Function: `time.msec_to_usec`

Since: 1.0
Convert milliseconds to microseconds.

time.msec_to_usec(uint) returns uint
time.msec_to_usec(int) returns int

Function: `time.usec_to_sec`

Since: 1.0
Convert microseconds to seconds.

time.usec_to_sec(uint) returns uint
time.usec_to_sec(int) returns int

Function: `time.sec_to_usec`

Since: 1.0
Convert seconds to microseconds.

time.sec_to_usec(uint) returns uint
time.sec_to_usec(int) returns int

Function: `time.extract_year`

Since: 1.0
time.extract_year(usec_since_epoch) returns the year part of the supplied date.

time.extract_year(uint) returns uint
time.extract_year(int) returns int

Function: `time.extract_month`

Since: 1.0
time.extract_month(usec_since_epoch) returns the month part of the supplied date.

time.extract_month(uint) returns uint
time.extract_month(int) returns int

Function: `time.extract_day`

Since: 1.0
time.extract_day(usec_since_epoch) returns the day part of the supplied date.

time.extract_day(uint) returns uint
time.extract_day(int) returns int

Function: `time.extract_hour`

Since: 1.0
time.extract_hour(usec_since_epoch) returns the hour part of the supplied date.

time.extract_hour(uint) returns uint
time.extract_hour(int) returns int

Function: `time.extract_minute`

Since: 1.0
time.extract_minute(usec_since_epoch) returns the minute part of the supplied date.

time.extract_minute(uint) returns uint
time.extract_minute(int) returns int

Function: `time.extract_second`

Since: 1.0
time.extract_second(usec_since_epoch) returns the second part of the supplied date.

time.extract_second(uint) returns uint
time.extract_second(int) returns int

Function: `time.parse`

Since: 1.3.0
time.parse(time_string, format_string) parses time_string according to format where format is the format string supported by the system's strptime C function. Note that the time is interpreted as if it is in the local time zone. On some systems the underlying C library functions ignore time zone specifications completely and in others the time zone behaviour is counter-intuitive. The time is returned as number of microseconds since 1/1/1970 00:00:00 GMT.

time.parse(string, string) returns uint
time.parse(string, istring) returns uint
time.parse(istring, string) returns uint
time.parse(istring, istring) returns uint

Function: `time.format`

Since: 1.4.3
time.format(usec_since_epoch, format_string) formats usec_since_epoch according to format where format is the format string supported by the system's strftime C function.

time.format(uint, string) returns string
time.format(uint, istring) returns string

Function: `io.net.url.get`

Since: 1.0
io.net.url.get(url) retrieves the resource specified by the supplied URL. If the URL is an HTTP URL, io.net.url.get uses HTTP GET and returns all HTTP headers along with the resource. Returns the empty string on network error. The resource must be UTF-8. Any invalid character sequences will be replaced with ?.

io.net.url.get(string) returns string

Function: `io.file.read`

Since: 1.0
io.file.read(path) reads the entire file specified by path. Will decompress the file if it's in gzip format. Returns the empty string on error. The (possibly decompressed) file must be UTF-8. Any invalid character sequences will be replaced with ?. To read a file once for the whole input stream, use the io.file.read stream operator.

io.file.read(string) returns string

Function: `io.file.erase`

Since: 1.0
io.file.erase(path) erases the path specified by path. Returns false on error.

io.file.erase(string) returns bool

Function: `meta.shell`

Since: 1.0
meta.shell(command_line) executes the command line on the local machine and returns the output as a UTF-8 string. Any invalid character sequences in the output are replaced with ?. To run a command once for the whole input/output stream, use the meta.shell stream operator.

meta.shell(string) returns string

Explicit SSH-based Distribution

The key stream operators for SSH-based distribution are rel.record_split, rel.join.consistent_hash and meta.parallel_explicit_mapping. The recommended procedure is:
1. Use rel.record_split to split a large data set into individual files. For best results aim for chunk sizes of 100MB to 500MB.
2. Create a list of participating hosts in a TSV file and translate that TSV file to native R17 format with rel.from_tsv. Use localhost to refer to the local machine.
3. Join the list of files from (1) to the list of hosts in (2) using rel.join.consistent_hash. The consistent hashing will result in fewer large file movements as the number of hosts or files change over time.
4. Distribute the files using the output of (3), the meta.shell function and your favorite file transfer utility eg scp or rsync.
5. Query using the output of (3) and meta.parallel_consistent_mapping.

Below is a complete script that does everything from splitting to the query. In practice it's usually better to separate out steps 1-4 from 5.

# Split the input file into fragments.
io.file.read('big_input.r17_native.gz') | rel.record_split(1000000, 'tmp.web_fragment.');

# Make the mapping from hosts to files. participating_hosts.tsv's sole column is string:host_name.
io.file.read('participating_hosts.tsv') | rel.from_tsv() | io.file.overwrite('participating_hosts.r17_native');

io.directory.list('.')
| rel.select(file_name)
| rel.where(str.starts_with(file_name, 'tmp.web_fragment.'))
| rel.join.consistent_hash('participating_hosts.r17_native')
| io.file.overwrite('host_file_mapping.r17_native');

# Do the distribution. Skip transferring files to localhost.
io.file.read('host_file_mapping.r17_native')
| rel.where(host_name != 'localhost')
| rel.select(meta.shell('rsync -av ' + file_name + ' ' + host_name + ':~') as rsync_output)
| io.file.append('/dev/null');

# Do the actual query. It's ok to pass a pipeline to meta.parallel_explicit_mapping.
io.file.read('host_file_mapping.r17_native')
| meta.parallel_explicit_mapping(rel.select(username))
| rel.group(count)
| rel.order_by(_count)
| io.file.overwrite('user_by_activity.r17_native');

Environment Variables

NP1_MAX_RECORD_HASH_TABLE_SIZE (optional): The maximum number of slots in the record hash table that's used for rel.join.*, rel.unique and rel.group. Default is 9223372036854775807 slots.

NP1_SORT_CHUNK_SIZE (optional): The size of the chunks used for sorting, in bytes. The default is 104857600 bytes.

NP1_SORT_INITIAL_NUMBER_THREADS (optional): The initial number of threads used for parallel sorting. Each thread will sort a single chunk. R17 will adjust the actual number of threads based on throughput after sorting each chunk. The default is 5 threads.

NP1_R17_PATH (optional): The path to use for searching for r17 scripts. If a stream operator is not a known builtin, r17 will search the directory of the current script then search NP1_R17_PATH for the first file with the same name as the stream operator. r17 will then interpret that file as an r17 script.

Third Party Licenses

libcurl

COPYRIGHT AND PERMISSION NOTICE

Permission to use, copy, modify, and distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT OF THIRD PARTY RIGHTS. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Except as contained in this notice, the name of a copyright holder shall not be used in advertising or otherwise to promote the sale, use or other dealings in this Software without prior written authorization of the copyright holder.

zlib

This software is provided 'as-is', without any express or implied warranty. In no event will the authors be held liable for any damages arising from the use of this software.

Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the following restrictions:

1. The origin of this software must not be misrepresented; you must not claim that you wrote the original software. If you use this software in a product, an acknowledgment in the product documentation would be appreciated but is not required.
2. Altered source versions must be plainly marked as such, and must not be misrepresented as being the original software.
3. This notice may not be removed or altered from any source distribution.

Jean-loup Gailly Mark Adler jloup@gzip.org madler@alumni.caltech.edu

PCRE

Written by: Philip Hazel Email local part: ph10 Email domain: cam.ac.uk

University of Cambridge Computing Service, Cambridge, England.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

* Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

* Neither the name of the University of Cambridge nor the name of Google Inc. nor the names of their contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

64-bit multiplication and division on 32-bit platforms

This software was developed by the Computer Systems Engineering group at Lawrence Berkeley Laboratory under DARPA contract BG 91-66 and contributed to Berkeley.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
3. All advertising materials mentioning features or use of this software must display the following acknowledgement:
This product includes software developed by the University of
California, Berkeley and its contributors.
4. Neither the name of the University nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Documentation

r17 v2.1.0

Examples

Usage

Builtin stream operators

Compound stream operators (new in 1.9.0)

Expressions

Data types

Builtin variables

Operator: =

Operator: !=

Operator: <

Operator: >

Operator: <=

Operator: >=

Operator: +

Operator: -

Operator: *

Operator: /

Operator: %

Operator: &

Operator: |

Operator: ~

Operator: &&

Operator: ||

Operator: !

Function: to_string

Function: to_istring

Function: str.to_upper_case

Function: str.to_lower_case

Function: str.regex_match

Function: str.regex_replace

Function: str.regex_replace_empty_on_no_match

Function: str.regex_replace_all

Function: str.starts_with

Function: str.ends_with

Function: str.contains

Function: str.uuidgen

Function: str.sha256

Function: math.rand64

Function: time.now_epoch_usec

Function: time.usec_to_msec

Function: time.msec_to_usec

Function: time.usec_to_sec

Function: time.sec_to_usec

Function: time.extract_year

Function: time.extract_month

Function: time.extract_day

Function: time.extract_hour

Function: time.extract_minute

Function: time.extract_second

Function: time.parse

Function: time.format

Function: io.net.url.get

Function: io.file.read

Function: io.file.erase

Function: meta.shell