r17 - flexible, scalable, relational data mining language

Mining Twitter with r17

Matt Nourse, August 25 2011

On Friday I had the pleasure of ranting about r17 over at Datalicious. I used some Twitter mining examples to show off some of r17's data exploration capabilities. Here's those examples.

Getting started: parsing names and tweets from Twitter's data feed

This r17 script...
...uses cURL to stream the tweets into rel.from_text
...rel.from_text breaks the JSON into lines (one tweet per line)
...rel.select gets the name and tweet out of the raw tweet data
...finally rel.to_tsv converts the name and tweet data to TAB-separated-value format.

meta.shell( "curl -s http://stream.twitter.com/1/statuses/sample.json -uusername:password") | rel.from_text("(.*)", "string:json") | rel.select( str.regex_replace_empty_on_no_match("\"name\":\"([^\"]+)\",", json, "\1") as name, str.regex_replace_empty_on_no_match("\"text\":\"([^\"]+)\",", json, "\1") as text) | rel.to_tsv();

R17 doesn't currently contain a JSON parser so it can't do a "pure" parse of the Twitter feed. But it does have Perl-compatible regular expressions so can do a "good enough" parse if you're looking for trends rather than exact matches. For example a \" will prematurely end parsing of a name or tweet. Of course the vast majority of tweets don't have \" in them so any anomalies should be statistically insignificant. If you want a dedicated JSON parser in r17 then I particularly want to hear from you.

Get tweeters of interesting words

This script extends our example to get the names of tweeters whose tweets contain a word in a list of interesting words. It expects the interesting word list to be in TAB-separated value format with a typed heading, like this:
istring:word fred flintstone chickens party . . .

Now for the script:
# Convert our interesting word list to r17 native format ready for the join. io.file.read("interesting_words.tsv") | rel.from_tsv() | io.file.overwrite("interesting_words.r17_native"); # Parse as above, then split tweet into words and case-insensitively # join them to the interesting word list. meta.shell( "curl -s http://stream.twitter.com/1/statuses/sample.json -uusername:password") | rel.from_text("(.*)", "string:json") | rel.select( str.regex_replace_empty_on_no_match("\"name\":\"([^\"]+)\",", json, "\1") as name, str.regex_replace_empty_on_no_match("\"text\":\"([^\"]+)\",", json, "\1") as text) | rel.str_split(text, "\s") | rel.select(name, text as istring:word) | rel.join.natural("interesting_words.r17_native") | rel.to_tsv();

rel.str_split also adds a _counter column, so it's also possible to figure out how far away one word is from another within the same tweet.

Store tweets for later mining

The above examples mine the tweet stream in real time. But maybe you want to do fancier mining with sorting or grouping, or you just want to store the tweets for a rainy day. The script below stores the tweet stream in chunks ready for parallel processing. It also uses a Bash script fragment to do basic error recovery so that tweet stream collection will continue after a transient network problem.

meta.shell( "while true; do curl -s --connect-timeout 15 --retry 5 --speed-time 15 " + "http://stream.twitter.com/1/statuses/sample.json -uusername:password; done") | rel.from_text("(.*)", "string:json") | rel.record_split(100, "twitter_stream_fragment.r17_native.");

Get tweeters of interesting words again, this time in parallel

Now that we've stored the tweet stream in chunks we can mine it in parallel, possibly across multiple machines. For simplicity the below example only uses one machine but has most of the code ready for multi-machine use. The extra work required is in the script comments.

# Convert interesting_words to r17 native format ready for the join, # just as in the real-time "interesting words" example. io.file.read("interesting_words.tsv") | rel.from_tsv() | io.file.overwrite("interesting_words.r17_native"); # Here we just use a single machine, but this could just as easily be a # collection of machines. You'd just need to read in a list of host files here # and ensure that interesting_words.r17_native is reachable on all machines. rel.generate_sequence(0, 1) | rel.select("localhost" as host_name) | io.file.overwrite("participating_hosts.r17_native"); # Map the available files to machines. io.directory.list(".") | rel.select(file_name) | rel.where(str.starts_with(file_name, "twitter_stream_fragment.r17_native.")) | rel.join.consistent_hash("participating_hosts.r17_native") | io.file.overwrite("host_file_mapping.r17_native"); # Do the distribution. Skip transferring files to localhost. # This is just here to show how to do distributed queries. # Nothing happens for this example. io.file.read("host_file_mapping.r17_native") | rel.where(host_name != "localhost") | rel.select( meta.shell("rsync -av " + file_name + " " + host_name + ":~") as rsync_output) | io.file.append("/dev/null"); # Now we're ready to do the actual query. In the "real world" most of the above # mucking around would probably only be done once, in a separate script. # The parallel part happens inside meta.parallel_explicit_mapping. io.file.read('host_file_mapping.r17_native') | meta.parallel_explicit_mapping( rel.select( str.regex_replace_empty_on_no_match("\"name\":\"([^\"]+)\",", json, "\1") as name, str.regex_replace_empty_on_no_match("\"text\":\"([^\"]+)\",", json, "\1") as text) | rel.str_split(text, "\s") | rel.select(name, text as istring:word) | rel.join.natural("interesting_words.r17_native")) | rel.to_tsv();

So there you have it- a basic primer in real-time and offline Twitter mining in r17. Please contact us for help, feature requests or feedback.