Last week, I detailed out how I ended up collecting the rather modest data set of about 2.72 million tweets. So after that humongous data collection exercise, the question was, now what?

While the data was being collected, I had formed some notions about what kind of analysis I would perform and how I would present the analysis. I wanted to restrict myself mostly to frequency and statistical analysis that would give me some idea of the distribution of the tweets and present these in a visually appealing form. I broke my analysis into three distinct areas:

  • Analysis of the tweets and their content
  • Geographical analysis of the tweets
  • Analysis of the tweets vis-à-vis the characters of the TV series.

For each of these areas, I worked out some data aggregation and analysis that I could perform using a combination of SQL, Apache Pig + HBase and Ruby. While a lot of the analysis could have been done using standard SQL (after all 2.72 million isn’t exactly “Big Data”), I chose to use some of the technologies that are increasingly being associated with Big Data, just to get an understanding of the technology.

I finally visualized all that data using some nifty visualizations from D3.js to get a better feel for the data. The data can be seen in the companion micro-site and has been presented in a prior blog post.
The Summary

Summary of Tweets

Summary of Tweets

These were just summary aggregates generated using MySQL scripts. I used regular expressions to detect links while I was loading the data into the MySQL table.

Tweets over Time

Timeline of Tweets

Timeline of Tweets

Game of Thrones aired on HBO on every Sunday night. Overall the Twitter chatter peaked on the day the episode aired and the day following the air date. Seeing a different view of the timeline actually reveals a bit more insight into this observation.

Tweet volume by Day of Week and Hour

Tweet volume by Day of Week and Hour

The heat map above shows the distribution of tweets over hour of the day and day of the week. All the tweets have been normalized to the EDT Time Zone so that the analysis could be put into perspective with the Time Zone in which the TV series first aired. In line with the air date and time, we see that busiest hours seem to be on Sunday immediately after the week’s episode airs in the US. A second “peak” is seen on Monday in the two hours after the episode airs in the UK. The rest of the world catches up over the week, with Saturday being a lax day. Finally the anticipation builds up again on Sunday.
The visualizations used here were built using D3.JS. NVD3.js, a D3.js based library provided the visualization for the stacked area graph. The heat map was “inspired” from this example in the D3.js example code gallery. The data was aggregated using SQL and Ruby scripts. Ruby scripts were also used to generate the JSON data needed for the visualizations.

What were people saying

What’s equally interesting is what people were saying about the show in their tweets. I was very interested in what words and by extension “hashtags” were used by people. To help understand this, I did a word frequency analysis on all the tweets using HBase and Pig and visualized it as word clouds (using D3.js) for both “hashtags” and words.

Word Cloud of Tweets

Word Cloud of Tweets

Clearly certain words stood out and underscored reactions of people to the last three episodes. Hash Tags clearly showed the three episodes viz. “The Mountain And the Viper”, “The Watchers on the Wall” and “The Children”. Other show themes like “Trial By Combat”, “Valar Morghulis” and show star characters like “Tyrion”, “Arya”, “The Mountain”, “The Viper” etc. also dominated.

A word cloud of Hash Tags

A word cloud of Hash Tags

Both the word clouds were based on Jason Davies implementation of the word cloud using D3.js. This library is a fairly nifty one to visualize word clouds, but I haven’t been able to fully customize it as needed.

Who was popular

Popular Characters

Popular Characters

Another thing to check was what characters people were tweeting about. Here I tried to once again do a word cloud, but focussing just on the characters. Five characters clearly stood out viz. Oberyn Martell (The Red Viper), Jon Snow, Arya Stark, Gregor Clegane (The Mountain) and everyone’s favourite Tyrion Lannister.

Characters by Tweet volume and Day

Characters by Tweet volume and Day

To get a more qualitative feel of character mentions, I tried to look at volume of mentions per character. Here too some clear trends stood out:

  • The final episode very neatly tied up the character story arcs for most of the many characters in this show. Naturally, a significant number of characters saw a high number of mentions in the last episode.
  • Ygritte and Jon Snow, both saw the highest number of mentions in the episode “The Watchers on the Wall”, since that episode almost exclusively focused on their part of the Game of Thrones universe
  • Oberyn Martell and The Mountain, naturally saw the maximum number of mentions in the episode “The Mountain and The Viper”, which featured their epic showdown

The character cloud was once again based on Jason Davies word cloud implementation. The character “tweet count by day” chart was based on another example from the D3.js examples page called “Publications in Journals over Time” by Asif Rahman.

Where were people tweeting from

What was equally interesting was where people were tweeting from. Unfortunately, location data is not often available in tweets and in this case a mere 3-4% of tweets had geo-location data. Nevertheless, even this amounted to a fairly decent number of data points. I took this data and I did a kind of a “choropleth” world map that is shaded by number of tweets. This map was based on the “datamaps” library by Mark DiMarco and is an excellent library for these kind of mapping visualizations.

Countries by Tweet Volume

Countries by Tweet Volume

The first thing that I noticed is the sheer popularity of this show. Tweets had poured in from all over the globe, including those that appeared to be gibberish from a ship!!!

The overall distribution of tweets was also indicative of the fact that the show was immensely popular in the US and UK, which was something I noticed in the timeline data as well.

Tweets by Location

Tweets by Location

I also plotted all the points on an interactive Google Map that you can see here. I used CSS to style the map to that wonderful grey color that went well with the rest of the visualizations. Initially, I tried loading all the data points using a GeoJSON layer, but that proved to be very slow. So sacrificing some control over the look, I eventually opted to use Fusion Tables with Google Maps for this.

So What Next?

So far I have had an interesting month learning all this new stuff, and also playing around with data visualization. One of the things I missed here was actually storing the tweets in a database or in as system designed to handle the kind of volumes that would be seen in a real life use case. So as a next step, I plan on learning and moving to Mongo DB or some other similar data store that would serve me better than just using a flat file.

Also, a lot of the analysis that was presented here was done in a semi-manual fashion. I’d really want to automate this entire process so that it becomes seamless from data collection to visualization. As an end goal would ideally love to see this evolve as platform to solve some domain specific data visualization problems. I’ll keep updating this blog as and when I make progress on this, so do watch this space for more.

Did you like this? Share it:

I’ve been meaning to get some kind of a first-hand experience on how “Big Data” technologies may be used for social media analysis – specifically around Twitter. I detailed out that experience around collecting data about the show “Game of Thrones” in a postearlier this week. This two-part post will attempt to give a “behind the scenes” glimpse into the data presented in that post.

I had been searching for good topics to collect data on, when I had the notion that I could use my current favorite TV Series – Game of Thrones – as a kind of test bed for this.

Ever since I watched the Red Wedding episode on Game of Thrones, I have been kind of determined to not let the show pull another fast one on me. In fact, soon after the end of that season, I actually read all the books in the series, just so that I could stay ahead of the TV series. Thanks to this, I knew what was in store for the final three episodes of Season 04 and figured that this would make for a good experiment on Twitter.

Luckily for me (or rather unluckily) the show was on a short break due to the Memorial Day weekend, when I had the idea. That meant that I had less than a week to get started and begin the data collection before the show resumed airing on HBO. What followed was a rather interesting ride through technology, visualization and data insights.

Gearing up for D-Day

So having decided to do this, I began by trying to get a good handle on the technologies involved. Luckily, support for the Twitter API is quite good with a number of libraries readily available for a variety of programming languages. Having previously done similar analysis using Ruby, I chose to use the Twitter gem for Ruby. Implementing the Twitter Streaming API using the gem was quite straightforward and in a few evenings (yes, I have a day job) I had a good enough system working on my development environment. Once I had access to the data I was faced with the next decision – should I write this to a database or a flat file?

I originally intended to use Mongo DB, but given the steep learning curve in accessing and processing the written data,  I finally chose to go with the flat file approach. Also I chose to only write a few fields viz. Tweet Creation date, Tweet, Twitter Handle, User Name, Retweet Flag, Location, Time Zone, Latitude and Longitude. In hindsight, I probably should have included a lot more fields but more on that later.

Given the volumes I had seen during my tests I presumed that over a 3 week period I would probably end up getting about 700K to a 1 million tweets. So given those kind of volumes, I figured I’d just host the data collection script on the server that hosts my blog. That’s where I hit the first roadblock – the ruby environment there was just plain out of sync with reality, running some really old versions of gems.

I realized this with just two days to go and at the very last minute decided to try and host this on AWS. Doing this on AWS was a breeze and in one long night, the code was finally live!!!

The Data Avalanche

I had written the data collection script to generate a new file with tweet data for each day. The file generated for June 01st 2014 was well within my estimates in terms of number of tweets collected, but nothing prepared me for the avalanche of tweets that just poured out when the monumental episode “The Mountain and The Viper” aired. That day generated close to 500K tweets, with the total for the week crossing the 1 million mark, which itself was as much as I had estimated for the entire three week period.

Surprisingly, the script didn’t crash at all and was still running at the end of the day. In fact in the entire 21 day period, the script only crashed once for a few hours – which given the fact that I had just hacked together the code over a few evenings, was quite satisfying.

Data quality and other woes

But with great volumes of data came great issues of quality of data. I was using a tab-delimited file format to store the data, which at the time seemed like a safe choice. But in spite of that I faced several data quality issues.

One of the common issues I had was with the presence of arbitrary new line characters in the files. While I tried to strip out these characters within the script code itself, some errors still crept through. This amounted to about 0.1% of the total tweets collected.

The second issue was in cases where the text content itself contained tab characters. This was something that I could have easily fixed in the script, but it somehow slipped my mind. This kind of error amounted to about 0.06% of the errors.

I corrected the first error semi-manually, with a script to point out all such errors and the correction done manually. For the second error, rather than find where the error actually was, I chose to just drop those data points since their number was very small.

After correcting all this data, I finally ended up with a clean data file of about 0.5 GB.

Preparing the data and designing the analysis

Overview of data flow from tweet to visualization

Overview of data flow from tweet to visualization

So once I had cleaned up the base data, the next step was to figure out what to do with it. The base data file was in a fairly accessible format and for some of the analysis that I intended to do; I could just use that with Ruby directly. For other kinds of analysis, it was imperative that I be able to perform some transformations on the data.

To do this, I chose to use Talend as the data transformation platform, largely because it was open source and easily available. For the analysis I wanted a data set just containing the tweets (for Apache Pig and HBase), another data set with just latitude and longitude (for geographical analysis) and the entire dataset loaded into a MySQL database. Using Talend I built ETL Jobs for each of the tasks and Talend was able to handle this quite easily.

I had originally planned on reverse geocoding the latitude and longitude pairs to extract useful information like the country from where the tweet was sent. After reading up on this, I figured that using a reverse geocoding API like that provided by Google Maps or GeoNames would have been ideal. However, it soon became evident that, thanks to the throttle limits on the free version of these API’s, it would have taken me ages to reverse geocode my dataset. For example Google Maps had a limit of 2500 requests per day, while even the more generous GeoNames (30000 requests per day) seemed inadequate given the volumes of data that I had. On the bright side, GeoNames also made its entire database available for free. Using that and some Ruby scripting, I was able to build an offline reverse geocoder that was able to get the reverse geocoding done in about eight hours.

I finally now had a clean enough data set with supplemental data prepared that I could use to begin analysis on.
I’ll detail out the actual analysis that I performed, next week .

Did you like this? Share it:

So what does the Twitter chatter around the hit TV show Game of Thrones look like? Well I decided to try and find out.

I set up a small experiment where I collected tweets about the hit TV series over the last three weeks of the recently concluded Season 04. Then I added some D3.js visualization magic to the data and presto!!! I had some impressive insights into the twitter chatter around the series.
I managed to collect about 2.72 million tweets. A quick analysis on the numbers showed that about 114K unique users were responsible for those tweets sharing about 102K links. About 36% of those tweets were retweets.

Timeline of Tweets

Timeline of Tweets

An analysis of the timeline showed regular peaks around episode air dates with the episode “The Mountain and The Viper” garnering most reactions on Twitter.

Tweet volume by Day of Week and Hour

Tweet volume by Day of Week and Hour

An analysis of the hour vs. the day, on which tweets were sent out, shows that the busiest hours for tweeting were on Sunday night. After that, the next slot of busy hours occurred on Monday evening. This ties in nicely with the fact that those two slots are approximately just after the time when Game of Thrones airs in the USA and UK respectively. The twitter traffic kind of peters out over the rest of the week, and picks up again on Sunday afternoon, in anticipation of the next episode.

Continue reading ‘Game of Thrones in 21 days’ »

Did you like this? Share it:

election_map_0620

The results are out and the BJP has gotten itself a historic landslide victory !!! This means that India can have a stable government for the next 05 years and one not hobbled by coalition politics. Hopefully the new government will use the clear mandate given by the people to end the policy paralysis that plagued the last government and make some developmental progress. However this is not the main point of this post.

In my last post on the elections, I looked at search volumes for each of the leading candidates by region and noted that none of the leading candidates evinced any interest from the south and south-east of India. Looking at the map of India after the results (Google Live Election Tracker), seems that this is the only region where BJP has not made a clean sweep or any in-roads at all. So does this mean that search volumes (or the lack of them) may be a good predictor for election results? Food for thought !!!

Did you like this? Share it:

The Grand Old Indian elections, that massive multi-month saga is long at last nearing the end. Already, various news channels have started relentless coverage of the run-up to May 16th, when hopefully we will have a new government. In this run-up, I thought it would be interesting if Google could throw up some insights. So I spent some time over at Google Trends, looking at search trends for the key “front-runners” for this election.

The election has been largely dominated by three main politicians at the national level viz. BJP’s Narendra Modi, Congress’s Rahul Gandhi and wild horse AAP’s Arvind Kejriwal. So I pondered over what Google Search trends can reveal for these three players over the last 12 months. The first graph which shows Interest over Time from Google Trend’s is shown below:

Web Search Trends

Interest Over Time

The graph shows a clear “Modi Wave” dominating the web search queries in the last 12 months. In fact, its more like a Tsunami in the last couple of months. Poor Rahul Gandhi and Arvind Kejriwal are far far behind in this quest to remain at the top of the voter’s mind. They however outshine Modi in one or two instances each. In Rahul’s case, seems his interview with Arnab on Times Now was responsible for the spike in search interest that we see towards the end of  January. Arvind Kejriwal on the other hand spikes well above Modi in December when the AAP party formed the Delhi government. There is one other sharp spike in interest for Arvind around the time when AAP vacated power in Delhi.

What is more interesting is the regional interest in these leaders. Google trends luckily also shows the relative interest by state for each of these leaders. Let’s start with Modi whose chart is shown below:

Interest by Region for Modi

Interest by Region for Modi

Predictably Gujarat dominates as the region from which maximum interest is shown. In this election, Narendra Modi has turned out to be that rare candidate who has shown some measure of a pan India appeal and this is evidenced somewhat by the search interest graph. However, one should note that the interest seems to be more concentrated in North and North-west India.

Now let’s move on to Rahul Gandhi

Interest by Region for Rahul Gandhi

Interest by Region for Rahul Gandhi

Rahul Gandhi seems to have a bigger search following in the North, particularly in Jharkhand. Overall again, there is a concentration in the North, but it is not as intense as that of Modi. And finally now lets look at Arvind Kejriwal.

Interest by Region for Arvind Kejriwal

Interest by Region for Arvind Kejriwal

Given that the AAP movement found its first success in North India, during the Delhi state election, predictably search volumes for Kejriwal are concentrated in the North. However, surprisingly the leader finds much lower interest in other parts of the country as compared to Modi and Gandhi.

Interestingly all three charts, show a remarkable “lack” of interest in South India and parts of the North-East. Would this lead to yet another “coalition” government propped up by the parties of the south? Only time will tell…

Did you like this? Share it:

I recently needed to get some large format documents scanned. However the DTP shop kind of messed up the scan order of the documents. So here I was, left with a series of images scanned into PDF files with no real way to edit them into the right order (short of buying a PDF editor like Acrobat). Luckily Ubuntu came with a variety of available free tools to edit PDF files. The one I ended up using was something called PDF Chain. PDF Chain lets you do a variety of operations on PDF files, including merging, splitting etc. So I used PDF Chain to split all the PDF files into individual pages. Then I opened each individual page in GIMP and exported it to PNG, after correcting some scan errors. But then I hit a problem. How do I convert the newly ordered PNG files into PDF files? Well this handy hint helped saved the day. Seems you can convert a set of image files into a PDF using the convert command in Ubuntu. Here is how it works:

  1. Put the images you want to convert in some location. The convert command will take into account alphabetical order into account while creating pages. So keep this mind when you want the pages in a specific order.
  2. Fire up terminal from “Applications > Accessories”
  3. Navigate to the folder or location with your images so that your working directory is that location
  4. Then for some magic use the following:
    convert *.png myPDF.pdf

That’s it !! You should now have a shiny new PDF named myPDF in the location you saved the images in. The original tutorial seems to indicate that this will only work on the Desktop, but rest assured that this works in any location. Sure saved me a ton of time 🙂

 

Did you like this? Share it:

So, I have been looking for a replacement for my aging Samsung Galaxy Spica for sometime now. Having enjoyed the splendid benefits of using stock Android on my Nexus 7 tablet, I wanted to have the same experience on my phone. However, barring the relatively high priced (INR 29,000 or USD 475 approx in India) Nexus 5, there were no real alternatives for a stock android phone. Most other “budget” phones that were available (Samsung, LG etc.) didn’t really cut it in the specifications department, while the phones from Xolo, MicroMax etc looked very promising on paper. Then in December, out of the blue, Motorola finally announced the Moto G, which seemed to tick most of the features I wanted, at a relatively low price point. Motorola then promised a January release for the Moto G in India and I figured might as well wait the extra month.
However, the release date kept being postponed and from early January, the date slipped to mid-Jan and then early February. Finally, the Moto G was available in India sometime in early February (February 6th to be precise). I actually stayed awake till midnight to be one of the first people to order one on Flipkart (the only way to get a Moto G in India currently), but getting hold of a Moto G (16 GB Black) seemed to be almost impossible. I had added a Moto G to my cart at about 00:13 AM and by the time I proceeded to check out 00:17 AM it was sold out !!!
Flipkart got in fresh stock at 12:00 PM the same day, and this time round I pressed pay as soon as I could. About 02 days later the phone was delivered.


The Moto G came in a small little white box. I fully expected the box to only contain the phone and an USB cable (seems the US model ships with only this), but I was pleasantly surprised to find that it contained an AC charger and a hands-free kit.
My initial impression of the phone was – Whoa!!! – this is much slimmer and lighter than I expected. It also had a nice feel in the hand with it’s slightly curved back. Since the phone needed a micro-sim it took me a couple of days to get my regular SIM switched to one. I decided to actually use the phone for about a month or so before writing about it. And boy has it been a fabulous experience. Here is a rundown of the hits and misses: Continue reading ‘MotoG Review’ »

Did you like this? Share it:

DIY Signal BoosterSaw this DIY “Signal Booster” made out of an old coca-cola bottle, at a village during my recent trek to Harishchandragad. The cell phone kept in the contraption surprisingly was able to get reception in an otherwise “dead-signal” area.  Must work I guess…

Did you like this? Share it:

The e-commerce scene in India is pretty much booming with a gazillion “Flipkart” clones and seemingly endless hordes of VC/PE money. But how much do these firms actually make, in terms of revenue? To answer this question, I’ve been wondering if its possible to estimate revenue for some of the more leading firms in the space. To do this, I chose Flipkart, mainly for its prominence and the fact that they recently announced a ballpark figure for their revenue.

In August, Flipkart digital VP Sameer Nigam indicated that, Flipkart crossed Rs 1 billion in revenue in July. This is Rs 100 crore in revenue for the month, or in USD terms is about USD 18 million (taking 1 USD = 55.1948 INR, average for the month).
Now popular site, Trafficestimate.com, gives traffic estimates for popular web-sites on the net. The graph below shows traffic estimates for Flipkart.
Traffic to Flipkart
We see that Flipkart did approximately 33.11 million visitors in the month of July. Now most e-commerce sites have a 1-2% conversion rate (traffic to customers). Assuming a 1.5% conversion rate we get the estimate that Flipkart had 33.11 million X 1.5% = 0.5 million customers (approx), who bought something at the website.
So how much does the average customer at Flipkart spend ? According to the comScore- ASSOCHAM report on the State of E-Commerce in India (2012), this figure stands at $35 per transaction. This means that the estimate for Flipkart’s revenue in the month of July is approximately 0.5 million X $35 = 17.5 USD million
As we can see this is pretty close to reported revenues. Further, plotting revenue estimates by this method for the last 12 months, we get the following graph
Flipkart Revenue
It’s interesting to note that Flipkart’s revenues have been pretty much between USD 10 – 11 million per month between November 2011 and March 2012. After March 2012, there has been an steady and somewhat steep increase in revenue. This raises some interesting questions:
  • Is this an effect of Flyte being launched? Unlikely, since by FlipKart’s own assertion Flyte currently represents only 1% of overall sales. Also Flyte retails digital music at approximately Rs 10 per track (approx. 20 cents). Combining, these two figures it definitely seems highly unlikely that the increase is because of Flyte.
  • Is Flipkart’s foray into multiple categories finally paying off? Possibly. They definitely seem to have helped Flipkart raise the average transaction size to $35
  • Is this the effect of Flipkart’s recent media campaign? Considering that the new campaign launched around the same time that sales started picking up, this might be a possible cause. However, I would definitely need more data before judging on the causality and I guess there is only so much one can do with public sources.
Overall, this little thought experiment was definitely an interesting one and in the coming months I will try and see what else can be deduced by seemingly public sources of information.
Notes:
  1. The traffic estimates website does not clearly indicate what the figures are. For the sake of convenience I assumed that the figures represent visits by potential buyers and not unique customers
  2. Biggest assumption – Trafficestimate.com gives fairly OK estimates 🙂
Did you like this? Share it:

Much has been made of the recent “censorship” by the Indian government and the spurt in “takedown” notices from India over the year. Granted that this is indeed deplorable, a quick look at the data published over at the Google Transparency Report website reveals some interesting stats about the state of affairs in other “free” countries. (Data is for 2011) (Click on the image for an interactive view)

Google Take Down requests

Did you like this? Share it: