Web Picks (week of 14 November 2016)

Posted on November 27, 2016

Every two weeks, we find the most interesting data science links from around the web and collect them in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to get up to speed on interesting resources.

Ten ways your data science project is going to fail (presentation)
Thoughtful, amusing and great presentation!

Airbnb open sources its knowledge repository
Great news from Airbnb: “The Knowledge Repository project is focused on facilitating the sharing of knowledge between data scientists and other technical roles using data formats and tools that make sense in these professions. It provides various data stores (and utilities to manage them) for “knowledge posts”, with a particular focus on notebooks (R Markdown and Jupyter / iPython Notebook) to better promote reproducible research.”

The Infancy of Julia: An Inside Look at How Traders and Economists Are Using the Julia Programming Language
Several banks, vendors and a regulator discuss how they are using the Julia programming language to improve their big data analytics capabilities.

How Artificial Intelligence Will Redefine Management
Many alarms have sounded on the potential for artificial intelligence (AI) technologies to upend the workforce, especially for easy-to-automate jobs. But managers at all levels will have to adapt to the world of smart machines. The fact is, artificial intelligence will soon be able to do the administrative tasks that consume much of managers’ time faster, better, and at a lower cost.

DeepMind and Blizzard to release StarCraft II as an AI research environment
At BlizzCon 2016 in Anaheim, California, DeepMind announced their collaboration with Blizzard Entertainment to open up StarCraft II to AI and Machine Learning researchers around the world. See also https://github.com/SKTBrain/awesome-starcraftAI for more resources.

Differentiable neural computers
More from DeepMind: “In a recent study in Nature, we introduce a form of memory-augmented neural network called a differentiable neural computer, and show that it can learn to use its memory to answer questions about complex, structured data, including artificially generated stories, family trees, and even a map of the London Underground. We also show that it can solve a block puzzle game using reinforcement learning.”

Software Dreams Up New Molecules in Quest for Wonder Drugs
Ingesting a heap of drug data allows a machine-learning system to suggest alternatives humans hadn’t tried yet.

The Competitive Landscape for Machine Intelligence
If this year’s landscape shows anything, it’s that the impact of machine intelligence is already here. Almost every industry is already being affected, from agriculture to transportation. Every employee can use machine intelligence to become more productive with tools that exist today. Companies have at their disposal, for the first time, the full set of building blocks to begin embedding machine intelligence in their businesses.

Chatbots with Social Skills Will Convince You to Buy Something
Virtual assistants that can read social cues and nonverbal signals are less jarring—and surprisingly persuasive.

The big data ecosystem for science
A look at the data pipeline architecture for five key NERSC projects.

New MIT technique reveals the basis for machine-learning systems’ hidden decisions
MIT researchers have developed a method to determine the rationale for predictions by neural networks, which loosely mimic the human brain.

What’s wrong with big data?
The view that more information produces better decisions is at odds with the world around us.

The Models Were Telling Us Trump Could Win
… though many also said Clinton would.

Media in the Age of Algorithms
“Since Tuesday’s election, there’s been a lot of finger pointing, and many of those fingers are pointing at Facebook, arguing that their newsfeed algorithms played a major role in spreading misinformation and magnifying polarization. Some of the articles are thoughtful in their criticism, others thoughtful in their defense of Facebook, while others are full of the very misinformation and polarization that they hope will get them to the top of everyone’s newsfeed. But all of them seem to me to make a fundamental error in how they are thinking about media in the age of algorithms.”

How Data Failed Us in Calling an Election
So, what happened with the election data and algorithms? The answer, it seems, is a combination of the shortcomings of polling, analysis and interpretation, perhaps both in how the numbers were presented and how they were understood by the public.

Why I’m Backing Deep Drumpf, And You Should Too
And we remember this classic post: “By funding women in tech, this tremendous algorithm will make America so great it’ll make your head spin, believe me.”

Gradient Boosting Simply explained
Gradient boosting (GB) is a machine learning algorithm developed in the late ’90s that is still very popular. This page explains how the gradient boosting algorithm works using several interactive visualizations.

Practical advice for analysis of large, complex data sets
From the Unofficial Google Data Science Blog.

Clairvoyant: Software designed to identify and monitor social/historical cues for short term stock movement
“Using stock historical data, train a supervised learning algorithm with any combination of financial indicators. Rapidly backtest your model for accuracy and simulate investment portfolio performance.” One can try…

Graph-based machine learning: Community Detection at Scale
One major constraint however of traditional community mining in graphs is that your graph needs to fit in memory. This quickly turns problematic as your number of nodes surpass billions, and the number of edges becomes trillions.

Google releases Data Studio beta
A platform for dashboarding and data exploration.

Minimal and clean examples of machine learning algorithms
This project is targeting people who wants to learn internals of ml algorithms or implement them from scratch. The code is much easier to follow than the optimized libraries and easier to play with. All algorithms are implemented in Python, using numpy, scipy and autograd.

New Analytics for Sharing on LinkedIn: See Who’s Viewed Your Post
LinkedIn introduces new “sharing analytics” possibilities.

Tultu
Let’s see where this goes: “Tuktu is a big data analytics platform that focuses on ease of use. The idea of the platform is that its users can focus on (business) logic rather than dealing with technical implementation details.”

The ultimate reference guide to data wrangling with Python and R
In the realm of data wrangling, data.table from R and pandas from Python dominate. This repo is meant to be a comprehensive, easy to use reference guide on how to do common operations with data.table and pandas, including a cross-reference between them as well as speed comparisons.

Data Science Basics: An Introduction to Ensemble Learners
New to classifiers and a bit uncertain of what ensemble learners are, or how different ones work? This post examines 3 of the most popular ensemble methods in an approach designed for newcomers.

Winning solution for the Kaggle competition Painter by Numbers
“This repository contains a 1st place solution for the Kaggle competition Painter by Numbers. Below is a brief description of the dataset and approaches I’ve used to build and validate a predictive model. The challenge of the competition was to examine pairs of paintings and determine whether they were painted by the same artist.”