Search

Irfan Knows

I really do!

OpenRefine for Data Mangling

I often use data from various sources and sometimes have to get creative about transforming the data into a format that I can use easily. So far I have done this mainly with R (and with Python from time to time). The other day, Dr. Mazzolla pointed to OpenRefine, an application for data clean-up and transformation. I most probably never give up R after this point, but what I saw in the project videos impressed me.

Continue reading “OpenRefine for Data Mangling”

Writing Better R Code

When I teach R, I always caution the audiences about the quirks of R programming. It is very typical for someone with a background in Java or Python to write code in R that will take forever to execute (like I used to do, and sometimes still do). I warn the students about loops and encourage vectorized operations. Unfortunately, writing faster code in R is often a more involved affair. What I offer are just some easy fixes to most common issues.

Thinking of those students that may be reading my blog, I want to share a resource that I think would be helpful. If you have reached the point in your R endeavour where you feel the need to write better code take a look at Hadley Wickham’s Advanced R. Information contained in the web site is also available as a dead tree book.

Surviving and Thriving in Online Labor Markets

I will be presenting our paper, “Surviving and Thriving in Online Labor Markets” this weekend at WISE 2015. Our study aims to uncover the dynamics shaping the online labor markets (freelancer.com, upwork.com, elance.com …), with a focus on the role of country development level on the IT service providers’ ability to survive and earn a living.

Continue reading “Surviving and Thriving in Online Labor Markets”

Google TensorFlow

TensorFlowGoogle open sourced TensorFlow (TF), a distributed machine learning library, in November.  The basic idea is that, you build your ML process into a graph and let TF handle the running and distribution of the work between cores. Be it cores in your CPU or GPU, TF has you covered.

The dataflow graph works much like an enterprise miner diagram. Only you do not have a GUI to define it. The nodes of the graph represent operations, while the edges represent data being exchanged between nodes.

The response to TensorFlow has been overwhelmingly positive so far. Last time google shared this kind of technology (MapReduce) we ended up with a new industry standard (Hadoop). Google stopped using MapReduce in 2014, and now the sense in the online circles is that TensorFlow may repeat the feat. The few critics point out there are already competitors, like Theano and Caffe, out there  that do the same thing without the limitations google placed (distributing to a cluster is not possible in open source version of TF).

I experimented with TF a little for a possible EMBA project (which turned out to be too complex for EMBA). The python interface was a breeze to use. Google did a good job of developing documentation and packaging to go with TF before releasing it. My only gripe was that TensorBoard (the module responsible for visualizing the graphs) only works properly in Chrome. I was unable to get Firefox to render more than the histograms and plots.

Further Reading: http://www.wired.com/2015/11/google-open-sources-its-artificial-intelligence-engine/

IBM Watson Explorer, Predictive Analytics Made Easy?

I am currently developing a tutorial for EMBA students for IBM’s Watson Explorer. Having just recently finished content development for R workshop for MSBA, I must say it is a pretty interesting experience. On the one hand you have R, a complex but powerful tool that is designed to lift (almost) any kind of analytics boulder, and on the other hand you have Watson Explorer, a tool that supposedly can deliver analytics with a natural language interface.

Watson’s is an interesting proposition, you feed it data and it answers your questions on data. It aims to be a complete analytics package, with functionality for data manipulation, visualizations, predictions, and presentations. Of course the functionality provided is very limited compared to R, but Watson is not targeting analysts. It is trying to provide data driven insights to less technical people, like the middle management.

The UI really makes plotting and exploration extremely easy. What takes some planning in R is just a few clicks away with Watson Explorer. In my experience, the NLP interface takes some getting used to, often fails to understand what is asked. That however does not mean it will stay that way. I suspect IBM is giving out free trials to Watson to build a training set for Watson, so I expect it to become better at what it does pretty soon.

So, go see for yourself at: http://www.ibm.com/smarterplanet/us/en/ibmwatson/explorer.html

R Workshop Documentation (Updated)

I have been working on the upcoming R workshop for a few weeks now. Finally I am ready to publish part of the material in the spirit of open source, so that others can benefit from, and improve upon my material.

I uploaded all the materials to my github repository. There are a few documents I am holding back as I do not want the attendees to go too far with out my guidance.

UPDATE: I uploaded all documents prior to the second session. 11.11.2015

The workshop documentation can be found here.HowI

Integrating R Code in Beamer Presentations

I am preparing presentations for the upcoming R workshop. That means I have to find a way to integrate R code and results into my presentations. Now I know there are a number of ways to achieve this, but I like beamer for presentations and R for analysis. Now thanks to Knitr, I can bring together my two most favorite pieces of software.

Follow a few simple steps, and you are golden.

  • take your regular beamer presentation and change the file extension to .Rnw.
  • Initialize the frame with fragile option like so begin{frame}[fragile]
  • Enclose your R code in a chunk like so <<parameters here>>= CODE HERE @
  • Knit with Knitr in R

The results look marvelous.

Rwrkshp

There are certain parameters to control suppression of code display,  output, or scaling of images. Those can be discovered following the example of Yihui Xie.

World Map Visualizations Based on Data in R

I am working on a workshop for Business Analytics masters students. Part of the demos I intend to use are geographical visualizations. I am using rworldmap package to achieve these.

Let us say you have geographical data in a data.frame such as country, with country identifiers stored in “ip_iso2” column in ISO2 format and the variable of interest in “Supply” column. You plot this data as:


# Convert your data to a map that rworldmap can render country
country<- joinCountryData2Map(country, joinCode = "ISO2", nameJoinColumn = 'ip_iso2', verbose = T)
# Render the map
mapCountryData(country, nameColumnToPlot = "Supply")

Result would looke like this:

HEAT

A fancier example would be having a third grouping variable such as “Those” in a column.


# Initialize the map
mapDevice()
# Populate the map(, and change colors of land and oceans)
mapBubbles(dF=country, nameZSize="Supply", nameZColour="Those", colourPalette="rainbow", oceanCol="lightblue", landCol="wheat")

And this would be the result of this effort.

bubbles

Beamer use Tikz to Highlight Certain Parts of Tables

I did my proposal yesterday and it went well 😀

I used beamer to prepare my slide deck and was very pleased with how easy it was to work with. R already has functionality to export tables to Latex format and they looked gorgeous in beamer.

I wanted to highlight certain parts of the results table and discuss the results as they relate to hypothesis. I achieved this through tikzmarkin and tikspicture. Below is an example and how it looks.

eVWrT


documentclass{beamer}

% for themes, etc.
mode<presentation>
usetheme{CambridgeUS}
usecolortheme{beaver}

%usepackage{times}  % fonts are up to you
usepackage{graphicx}
% The usual suspects
usepackage{multirow, booktabs, dcolumn, color} % Tables
% The table highlighting for hypothesis discussion.
usepackage[beamer,customcolors]{hf-tikz}
usetikzlibrary{calc}

% To set the hypothesis highlighting boxes red.
tikzset{hl/.style={
set fill color=red!80!black!40,
set border color=red!80!black,
},
}

begin{document}

begin{frame}
frametitle{Preliminary Results}

resizebox{.99linewidth}{!}{

begin{tabular}{l D{)}{)}{14)3}@{} D{)}{)}{13)3}@{} D{)}{)}{13)3}@{} }
& multicolumn{1}{c}{Model 1} & multicolumn{1}{c}{Model 2} & multicolumn{1}{c}{Model 3}
toprule
midrule

~Control           & 0.392 ; (0.021)^{***}   & 0.198 ; (0.022)^{***}  & 0.198 ; (0.022)^{***} 
tikzmarkin<3>[hl]{bH2}DevOwn           & 0.064 ; (0.003)^{***}   &                         &                        
~Frat            &                          & 0.051 ; (0.001)^{***}  &                        
~Serot           &                          &                         & 0.051 ; (0.001)^{***}  tikzmarkend{bH2}
tikzmarkin<2>[hl]{bH1}Frat x Serot     & -22.018 ; (1.474)^{***} & -8.747 ; (1.535)^{***} & -8.750 ; (1.535)^{***} tikzmarkend{bH1}
midrule
AIC              & 171986.112               & 140758.027              & 140762.308             
Num. events      & 11821                    & 11821                   & 11821                  
Num. obs.        & 601960                   & 601960                  & 601960                 

bottomrule
multicolumn{4}{l}{scriptsize{$^{***}p<0.001$, $^{**}p<0.01$, $^*p<0.05$}}
end{tabular}

}

only<2>{
% Place the hypothesis number next to the highlighted area
begin{tikzpicture}[remember picture,overlay]
node[align=left, left] at ({pic cs:bH2}) {small{H2}};
end{tikzpicture}
}

end{frame}

end{document}

 

Blog at WordPress.com.

Up ↑