Irfan Knows

I really do!

c I use AWS to run the various scripts used in data collection.  I have been paying about $20 each month for a t1 micro instance since my free tier and academic credits ran out. That is quite a price to pay for 1 core and about 600MB ram. Maybe AWS is not the best option for long running tasks. The advantages of AWS (fast and easy to deploy) do not matter so much in the long run, and a more traditional hosting option may be more cost effective for those cases.

Still all is not lost, Amazon has been offering reserved instances since 2009. Reserving your instance up front rather than paying for it as you go can add up to significant savings. I recently switched to reserved instances and now my costs are less than 10$ per month for that t1 micro.


Understanding Neural Networks

A few months ago I made a presentation on Tensor Flow for MBA students. The challenge was to simplify the technical underpinnings of the technology sufficiently while preserving a semblance of what it was good for. One particular challenge was visualizing the technology. Showing python code and output was out of the question and at the time TensorBoard was not up to the challenge.

This week I was pleasantly surprised to find A Neural Network Playground. It is a demo of a Neural Network built on top of Tensor Flow. The underlying mechanism is completely abstracted away, so it is more about understanding how a neural network works and less about Tensor Flow.

Continue reading “Understanding Neural Networks”

OpenRefine for Data Mangling

I often use data from various sources and sometimes have to get creative about transforming the data into a format that I can use easily. So far I have done this mainly with R (and with Python from time to time). The other day, Dr. Mazzolla pointed to OpenRefine, an application for data clean-up and transformation. I most probably never give up R after this point, but what I saw in the project videos impressed me.

Continue reading “OpenRefine for Data Mangling”

Writing Better R Code

When I teach R, I always caution the audiences about the quirks of R programming. It is very typical for someone with a background in Java or Python to write code in R that will take forever to execute (like I used to do, and sometimes still do). I warn the students about loops and encourage vectorized operations. Unfortunately, writing faster code in R is often a more involved affair. What I offer are just some easy fixes to most common issues.

Thinking of those students that may be reading my blog, I want to share a resource that I think would be helpful. If you have reached the point in your R endeavour where you feel the need to write better code take a look at Hadley Wickham’s Advanced R. Information contained in the web site is also available as a dead tree book.

Surviving and Thriving in Online Labor Markets

I will be presenting our paper, “Surviving and Thriving in Online Labor Markets” this weekend at WISE 2015. Our study aims to uncover the dynamics shaping the online labor markets (,, …), with a focus on the role of country development level on the IT service providers’ ability to survive and earn a living.

Continue reading “Surviving and Thriving in Online Labor Markets”

Google TensorFlow

TensorFlowGoogle open sourced TensorFlow (TF), a distributed machine learning library, in November.  The basic idea is that, you build your ML process into a graph and let TF handle the running and distribution of the work between cores. Be it cores in your CPU or GPU, TF has you covered.

The dataflow graph works much like an enterprise miner diagram. Only you do not have a GUI to define it. The nodes of the graph represent operations, while the edges represent data being exchanged between nodes.

The response to TensorFlow has been overwhelmingly positive so far. Last time google shared this kind of technology (MapReduce) we ended up with a new industry standard (Hadoop). Google stopped using MapReduce in 2014, and now the sense in the online circles is that TensorFlow may repeat the feat. The few critics point out there are already competitors, like Theano and Caffe, out there  that do the same thing without the limitations google placed (distributing to a cluster is not possible in open source version of TF).

I experimented with TF a little for a possible EMBA project (which turned out to be too complex for EMBA). The python interface was a breeze to use. Google did a good job of developing documentation and packaging to go with TF before releasing it. My only gripe was that TensorBoard (the module responsible for visualizing the graphs) only works properly in Chrome. I was unable to get Firefox to render more than the histograms and plots.

Further Reading:

IBM Watson Explorer, Predictive Analytics Made Easy?

I am currently developing a tutorial for EMBA students for IBM’s Watson Explorer. Having just recently finished content development for R workshop for MSBA, I must say it is a pretty interesting experience. On the one hand you have R, a complex but powerful tool that is designed to lift (almost) any kind of analytics boulder, and on the other hand you have Watson Explorer, a tool that supposedly can deliver analytics with a natural language interface.

Watson’s is an interesting proposition, you feed it data and it answers your questions on data. It aims to be a complete analytics package, with functionality for data manipulation, visualizations, predictions, and presentations. Of course the functionality provided is very limited compared to R, but Watson is not targeting analysts. It is trying to provide data driven insights to less technical people, like the middle management.

The UI really makes plotting and exploration extremely easy. What takes some planning in R is just a few clicks away with Watson Explorer. In my experience, the NLP interface takes some getting used to, often fails to understand what is asked. That however does not mean it will stay that way. I suspect IBM is giving out free trials to Watson to build a training set for Watson, so I expect it to become better at what it does pretty soon.

So, go see for yourself at:

R Workshop Documentation (Updated)

I have been working on the upcoming R workshop for a few weeks now. Finally I am ready to publish part of the material in the spirit of open source, so that others can benefit from, and improve upon my material.

I uploaded all the materials to my github repository. There are a few documents I am holding back as I do not want the attendees to go too far with out my guidance.

UPDATE: I uploaded all documents prior to the second session. 11.11.2015

The workshop documentation can be found here.HowI

Integrating R Code in Beamer Presentations

I am preparing presentations for the upcoming R workshop. That means I have to find a way to integrate R code and results into my presentations. Now I know there are a number of ways to achieve this, but I like beamer for presentations and R for analysis. Now thanks to Knitr, I can bring together my two most favorite pieces of software.

Follow a few simple steps, and you are golden.

  • take your regular beamer presentation and change the file extension to .Rnw.
  • Initialize the frame with fragile option like so begin{frame}[fragile]
  • Enclose your R code in a chunk like so <<parameters here>>= CODE HERE @
  • Knit with Knitr in R

The results look marvelous.


There are certain parameters to control suppression of code display,  output, or scaling of images. Those can be discovered following the example of Yihui Xie.

Blog at

Up ↑