Web Picks (week of 7 March 2016)

Every two weeks, we find the most interesting data science links from around the web and collect them in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to get up to speed on interesting resources.


  • A Visual Look at 2 Million Chess Games
    “We’ll take a look at more than 2 million games. I was interested to see what kind of visualizations I can do, and what patterns would be revealed by considering so many games.”
  • Introducing GraphFrames
    Databricks announces the release of GraphFrames, a graph processing library for Apache Spark. GraphFrames support general graph processing, similar to Apache Spark’s GraphX library. However, GraphFrames are built on top of Spark DataFrames, resulting in some key advantages.
  • Recommending for the World
    The Netflix team highlights the four most interesting challenges they encountered in making their algorithms operate globally and, most importantly, how this improved their ability to connect members worldwide.
  • Where the f*** can I park?
    Fun, short read about Manuel Garrido’s quest towards extracting data out of a very non-open data government repository.
  • A Billion Taxi Rides in Redshift
    Now this is what you can call Big Data: this author analyses the geospatial metadata of 1.1 billion Taxi journeys made in New York City between 2009 and 2015 using Amazon Redshift.
  • Analyzing the Language of the Presidential Debates
    This article looks at the presidential candidates’ transcripts from the debates and see what can be gained by applying statistical and natural language processing (NLP) techniques to their words.
  • Visualizing the Clinton Email Network in R
    Staying in the realm of politics, though the article states: “This isn’t a post about politics. When the WSJ folks made it possible to search the Clinton email releases I though it would be fun to get the data into R to show how well the igraph and ggnetworkpackages could work together, and also show how to use svgPanZoom to make it a bit easier to poke around the resulting hairball network.”
  • Do Brand Colors Translate to Instagram?
    “Starbucks green. UPS brown. Target red. Some of the most recognized brands have built their visual identity on just one color. We were curious: how does this color-centric visual identity carry over to social media and, more specifically, Instagram?”
  • Scaling Knowledge at Airbnb
    Airbnb discusses how they make their knowledge shareable and easily transmittable, focussing on reproducibility, quality, consumability, discoverability and learning — interesting read.
  • Dora – an exploratory data analysis toolkit for Python
    Dora is a new Python library designed to automate the painful parts of exploratory data analysis. The library contains convenience functions for data cleaning, feature selection & extraction, visualization, partitioning data for model validation, and versioning transformations of data.
  • Dynamic Memory Networks for Visual and Textual Question Answering
    “Neural network architectures with memory and attention mechanisms exhibit certain reasoning capabilities required for question answering. One such architecture, the dynamic memory network (DMN), obtained high accuracy on a variety of language tasks. However, it was not shown whether the architecture achieves strong results for question answering when supporting facts are not marked during training or whether it could be applied to other modalities such as images. Based on an analysis of the DMN, we propose several improvements to its memory and input modules.”