Web Picks (week of 3 August 2015)

Every two weeks, we find the most interesting data science links from around the web and collect them in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to get up to speed on interesting resources.


  • Pandashells
    Most of us already know and love Pandas, but did you know that there’s a project which aims to bring Pandas’ power to the command shell? For those who like to stay as much as possible within the command line, Pandashells offers a neat set of data manipulation and visualisation tools based on Pandas.
  • How Google Translate squeezes deep learning onto a phone
    This article from Google Research describes how the Google Translate team was able to fine tune a neural network to fit inside the memory and CPU constraints of a phone and perform character identification from photos, without having to be online.
  • Machine Learning: The High-Interest Credit Card of Technical Debt [pdf]
    We all love machine learning, although it serves well to be wary of potential dangers and pitfalls as well. This article from Google describes several machine learning specific risk factors and design patterns which should be avoided or refactored, namely: boundary erosion, entanglement, hidden feedback loops, undeclared consumers, data dependencies, changes in the external world, and a variety of system-level anti-patterns.
  • Visualizing GoogLeNet Classes
    It seems the web can’t get enough of Google’s recently unveiled deep neural network. This article tries to reproduce Google’s approach towards generating class visualizations from the neural network, complete with an IPython notebook for those interested.
  • A Visual Introduction to Machine Learning
    For those looking for a simple primer to explain to people what machine learning is all about, look no further! This beautiful web page — built with d3.js — provides a captivating, interactive introduction, all within your web browser.
  • Getting Started With Spark DataFrame
    Spark is a popular tool to easily perform distributed computations to process large-scale data. This article provides a brief introduction to Spark’s data frames.
  • GAM: The Predictive Modeling Silver Bullet
    “Imagine that you step into a room of data scientists. You ask them if they regularly use generalized additive models (GAM) to do their work. Very few will say yes, if any at all. Now let’s replay the scenario, only this time we replace GAM with, say, random forest or support vector machines (SVM). Everyone will say yes, and you might even spark a passionate debate.” In this article, the author makes the case for GAM’s and tries to convince more data scientists to use GAM. Interestingly enough, we saw GAM’s being mentioned before recently, namely in the context of Airbnb’s Aerosolve.
  • Tufte in R
    Who isn’t a fan of Edward Tufte and his excellent visualisation practices? This article shows how you can reproduce Tufte’s plots in all their clean, smooth glory in R, using both base graphics, lattice and ggplot2.