Web Picks (week of 3 August 2015)

Posted on August 16, 2015

Every two weeks, we find the most interesting data science links from around the web and collect them in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to get up to speed on interesting resources.

Pandashells
Most of us already know and love Pandas, but did you know that there’s a project which aims to bring Pandas’ power to the command shell? For those who like to stay as much as possible within the command line, Pandashells offers a neat set of data manipulation and visualisation tools based on Pandas.

How Google Translate squeezes deep learning onto a phone
This article from Google Research describes how the Google Translate team was able to fine tune a neural network to fit inside the memory and CPU constraints of a phone and perform character identification from photos, without having to be online.

A curated list of data science blogs
For those not getting enough of data science news, this Github page provides a lengthy list of data science related blogs.

Machine Learning: The High-Interest Credit Card of Technical Debt [pdf]
We all love machine learning, although it serves well to be wary of potential dangers and pitfalls as well. This article from Google describes several machine learning specific risk factors and design patterns which should be avoided or refactored, namely: boundary erosion, entanglement, hidden feedback loops, undeclared consumers, data dependencies, changes in the external world, and a variety of system-level anti-patterns.

Visualizing GoogLeNet Classes
It seems the web can’t get enough of Google’s recently unveiled deep neural network. This article tries to reproduce Google’s approach towards generating class visualizations from the neural network, complete with an IPython notebook for those interested.

A Visual Introduction to Machine Learning
For those looking for a simple primer to explain to people what machine learning is all about, look no further! This beautiful web page — built with d3.js — provides a captivating, interactive introduction, all within your web browser.

Getting Started With Spark DataFrame
Spark is a popular tool to easily perform distributed computations to process large-scale data. This article provides a brief introduction to Spark’s data frames.

Faster deep learning with GPUs and Theano
The folks at Domino describe how you can use a powerful GPU and Theano to whip up a deep neural network.

GAM: The Predictive Modeling Silver Bullet
“Imagine that you step into a room of data scientists. You ask them if they regularly use generalized additive models (GAM) to do their work. Very few will say yes, if any at all. Now let’s replay the scenario, only this time we replace GAM with, say, random forest or support vector machines (SVM). Everyone will say yes, and you might even spark a passionate debate.” In this article, the author makes the case for GAM’s and tries to convince more data scientists to use GAM. Interestingly enough, we saw GAM’s being mentioned before recently, namely in the context of Airbnb’s Aerosolve.

Composing Music With Recurrent Neural Networks
Yet another article illustrating how it is possible to compose music (classical, in this case) using recurrent neural networks.

Detecting Algorithmically Generated Domain Names with Random Forests
An interesting IPython notebook detailing the step-by-step process to detect and predict non-human registered domain names (ptes9dro-dwacty2lfa5rrts.com, anyone?). A bit light on the technical side, but a nice introductory read nonetheless.

Tufte in R
Who isn’t a fan of Edward Tufte and his excellent visualisation practices? This article shows how you can reproduce Tufte’s plots in all their clean, smooth glory in R, using both base graphics, lattice and ggplot2.