Web Picks (week of 14 November 2016)

Every two weeks, we find the most interesting data science links from around the web and collect them in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to get up to speed on interesting resources.

  • Airbnb open sources its knowledge repository
    Great news from Airbnb: “The Knowledge Repository project is focused on facilitating the sharing of knowledge between data scientists and other technical roles using data formats and tools that make sense in these professions. It provides various data stores (and utilities to manage them) for “knowledge posts”, with a particular focus on notebooks (R Markdown and Jupyter / iPython Notebook) to better promote reproducible research.”
  • How Artificial Intelligence Will Redefine Management
    Many alarms have sounded on the potential for artificial intelligence (AI) technologies to upend the workforce, especially for easy-to-automate jobs. But managers at all levels will have to adapt to the world of smart machines. The fact is, artificial intelligence will soon be able to do the administrative tasks that consume much of managers’ time faster, better, and at a lower cost.
  • Differentiable neural computers
    More from DeepMind: “In a recent study in Nature, we introduce a form of memory-augmented neural network called a differentiable neural computer, and show that it can learn to use its memory to answer questions about complex, structured data, including artificially generated stories, family trees, and even a map of the London Underground. We also show that it can solve a block puzzle game using reinforcement learning.”
  • The Competitive Landscape for Machine Intelligence
    If this year’s landscape shows anything, it’s that the impact of machine intelligence is already here. Almost every industry is already being affected, from agriculture to transportation. Every employee can use machine intelligence to become more productive with tools that exist today. Companies have at their disposal, for the first time, the full set of building blocks to begin embedding machine intelligence in their businesses.
  • Media in the Age of Algorithms
    “Since Tuesday’s election, there’s been a lot of finger pointing, and many of those fingers are pointing at Facebook, arguing that their newsfeed algorithms played a major role in spreading misinformation and magnifying polarization. Some of the articles are thoughtful in their criticism, others thoughtful in their defense of Facebook, while others are full of the very misinformation and polarization that they hope will get them to the top of everyone’s newsfeed. But all of them seem to me to make a fundamental error in how they are thinking about media in the age of algorithms.”
  • How Data Failed Us in Calling an Election
    So, what happened with the election data and algorithms? The answer, it seems, is a combination of the shortcomings of polling, analysis and interpretation, perhaps both in how the numbers were presented and how they were understood by the public.
  • Gradient Boosting Simply explained
    Gradient boosting (GB) is a machine learning algorithm developed in the late ’90s that is still very popular. This page explains how the gradient boosting algorithm works using several interactive visualizations.
  • Graph-based machine learning: Community Detection at Scale
    One major constraint however of traditional community mining in graphs is that your graph needs to fit in memory. This quickly turns problematic as your number of nodes surpass billions, and the number of edges becomes trillions.
  • Minimal and clean examples of machine learning algorithms
    This project is targeting people who wants to learn internals of ml algorithms or implement them from scratch. The code is much easier to follow than the optimized libraries and easier to play with. All algorithms are implemented in Python, using numpy, scipy and autograd.
  • Tultu
    Let’s see where this goes: “Tuktu is a big data analytics platform that focuses on ease of use. The idea of the platform is that its users can focus on (business) logic rather than dealing with technical implementation details.”
  • The ultimate reference guide to data wrangling with Python and R
    In the realm of data wrangling, data.table from R and pandas from Python dominate. This repo is meant to be a comprehensive, easy to use reference guide on how to do common operations with data.table and pandas, including a cross-reference between them as well as speed comparisons.
  • Winning solution for the Kaggle competition Painter by Numbers
    “This repository contains a 1st place solution for the Kaggle competition Painter by Numbers. Below is a brief description of the dataset and approaches I’ve used to build and validate a predictive model. The challenge of the competition was to examine pairs of paintings and determine whether they were painted by the same artist.”