External applications can participate at the start, the middle or the end of queries.
meta.shell("java NetworkSniffer.class")
| rel.from_usv()
| rel.join.natural("interesting_ips.r17_native")
| rel.select(ip_address)
| rel.group()
| rel.to_usv()
| meta.shell("python visualise.py");
"USV" is very easy & fast to parse and generate. It uses ASCII 31 ("Unit Separator") to delimit fields and ASCII 0 to delimit records. r17 also supports TAB-separated-value and regex-parseable data but both of these tend to be slower & harder for external applications to parse.
meta.shell("for FILE in `ls *.xls`; do py_xls2csv $FILE | grep '^\"'; done")
| rel.from_text('"[^"]*?", "[^"]*?", "([^"]*?)", "[^"]*?", "[^"]*?", "([^"]*?)"', "string:email", "string:referrer_url")
| rel.select("UPDATE customer SET referrer_url = '" + referrer_url + "' WHERE email = '" + email + "';" as sql)
| rel.to_tsv();
This example shows XLS to SQL translation using a Python XLS to CSV converter. Quote escaping omitted for brevity.
# Distribute analytics processing using the R statistics language.
# This example only distributes processing amongst the CPUs on the local machine
# but the same principle applies to distributing processing across multiple
# machines. The R script would just need to be distributed in advance.
# Parse & split the input. In the "real world" this would probably be done separately.
io.file.read("distributed_r_example_input.tsv")
| rel.from_tsv()
| rel.record_split(10, "tmp.distributed_r_example.");
# Distribute the R operation amongst the CPUs on the local machine.
io.directory.list(".")
| rel.where(str.starts_with(file_name, "tmp.distributed_r_example."))
| rel.select(file_name, "localhost" as host_name)
| meta.parallel_explicit_mapping(
rel.to_tsv()
| meta.shell("Rscript --slave increment.R")
| rel.from_tsv())
| rel.order_by(value)
| rel.to_tsv();
For this example, increment.R is:
table <- read.table(header=TRUE, file="stdin", sep="\t", quote="")
valueP1 <- table$int.value+1
write.table(sep="\t", quote=FALSE, row.names=FALSE, col.names=c("int:value"), valueP1)
...and distributed_r_example_input.tsv is just a list of numbers in TAB-separated-value format:
int:value
19
17
60
40
26
.
.
.