I’ve been meaning to get some kind of a first-hand experience on how “Big Data” technologies may be used for social media analysis – specifically around Twitter. I detailed out that experience around collecting data about the show “Game of Thrones” in a post, earlier this week. This two-part post will attempt to give a “behind the scenes” glimpse into the data presented in that post.
I had been searching for good topics to collect data on, when I had the notion that I could use my current favorite TV Series – Game of Thrones – as a kind of test bed for this.
Ever since I watched the Red Wedding episode on Game of Thrones, I have been kind of determined to not let the show pull another fast one on me. In fact, soon after the end of that season, I actually read all the books in the series, just so that I could stay ahead of the TV series. Thanks to this, I knew what was in store for the final three episodes of Season 04 and figured that this would make for a good experiment on Twitter.
Luckily for me (or rather unluckily) the show was on a short break due to the Memorial Day weekend, when I had the idea. That meant that I had less than a week to get started and begin the data collection before the show resumed airing on HBO. What followed was a rather interesting ride through technology, visualization and data insights.
Gearing up for D-Day
So having decided to do this, I began by trying to get a good handle on the technologies involved. Luckily, support for the Twitter API is quite good with a number of libraries readily available for a variety of programming languages. Having previously done similar analysis using Ruby, I chose to use the Twitter gem for Ruby. Implementing the Twitter Streaming API using the gem was quite straightforward and in a few evenings (yes, I have a day job) I had a good enough system working on my development environment. Once I had access to the data I was faced with the next decision – should I write this to a database or a flat file?
I originally intended to use Mongo DB, but given the steep learning curve in accessing and processing the written data, I finally chose to go with the flat file approach. Also I chose to only write a few fields viz. Tweet Creation date, Tweet, Twitter Handle, User Name, Retweet Flag, Location, Time Zone, Latitude and Longitude. In hindsight, I probably should have included a lot more fields but more on that later.
Given the volumes I had seen during my tests I presumed that over a 3 week period I would probably end up getting about 700K to a 1 million tweets. So given those kind of volumes, I figured I’d just host the data collection script on the server that hosts my blog. That’s where I hit the first roadblock – the ruby environment there was just plain out of sync with reality, running some really old versions of gems.
I realized this with just two days to go and at the very last minute decided to try and host this on AWS. Doing this on AWS was a breeze and in one long night, the code was finally live!!!
The Data Avalanche
I had written the data collection script to generate a new file with tweet data for each day. The file generated for June 01st 2014 was well within my estimates in terms of number of tweets collected, but nothing prepared me for the avalanche of tweets that just poured out when the monumental episode “The Mountain and The Viper” aired. That day generated close to 500K tweets, with the total for the week crossing the 1 million mark, which itself was as much as I had estimated for the entire three week period.
Surprisingly, the script didn’t crash at all and was still running at the end of the day. In fact in the entire 21 day period, the script only crashed once for a few hours – which given the fact that I had just hacked together the code over a few evenings, was quite satisfying.
Data quality and other woes
But with great volumes of data came great issues of quality of data. I was using a tab-delimited file format to store the data, which at the time seemed like a safe choice. But in spite of that I faced several data quality issues.
One of the common issues I had was with the presence of arbitrary new line characters in the files. While I tried to strip out these characters within the script code itself, some errors still crept through. This amounted to about 0.1% of the total tweets collected.
The second issue was in cases where the text content itself contained tab characters. This was something that I could have easily fixed in the script, but it somehow slipped my mind. This kind of error amounted to about 0.06% of the errors.
I corrected the first error semi-manually, with a script to point out all such errors and the correction done manually. For the second error, rather than find where the error actually was, I chose to just drop those data points since their number was very small.
After correcting all this data, I finally ended up with a clean data file of about 0.5 GB.
Preparing the data and designing the analysis
So once I had cleaned up the base data, the next step was to figure out what to do with it. The base data file was in a fairly accessible format and for some of the analysis that I intended to do; I could just use that with Ruby directly. For other kinds of analysis, it was imperative that I be able to perform some transformations on the data.
To do this, I chose to use Talend as the data transformation platform, largely because it was open source and easily available. For the analysis I wanted a data set just containing the tweets (for Apache Pig and HBase), another data set with just latitude and longitude (for geographical analysis) and the entire dataset loaded into a MySQL database. Using Talend I built ETL Jobs for each of the tasks and Talend was able to handle this quite easily.
I had originally planned on reverse geocoding the latitude and longitude pairs to extract useful information like the country from where the tweet was sent. After reading up on this, I figured that using a reverse geocoding API like that provided by Google Maps or GeoNames would have been ideal. However, it soon became evident that, thanks to the throttle limits on the free version of these API’s, it would have taken me ages to reverse geocode my dataset. For example Google Maps had a limit of 2500 requests per day, while even the more generous GeoNames (30000 requests per day) seemed inadequate given the volumes of data that I had. On the bright side, GeoNames also made its entire database available for free. Using that and some Ruby scripting, I was able to build an offline reverse geocoder that was able to get the reverse geocoding done in about eight hours.
I finally now had a clean enough data set with supplemental data prepared that I could use to begin analysis on.
I’ll detail out the actual analysis that I performed, next week .