Web Picks (week of 4 April 2016)

Posted on April 14, 2016

Every two weeks, we find the most interesting data science links from around the web and collect them in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to get up to speed on interesting resources.

The past weeks were filled with media outlets discussing Microsoft’s teenage millennial chatbot, “Tay”. Here’s a selection (more or less chronologically ordered) of what happened:
– Microsoft deletes ‘teen girl’ AI after it became a Hitler-loving sex robot within 24 hours
– Microsoft is deleting its AI chatbot’s incredibly racist tweets
– Learning from Tay’s introduction (Microsoft statement)
– Satya Nadella says Microsoft’s rogue racist chatbot failed by its own standards
– TayTweets: Racist Microsoft chatbot briefly returns to Twitter
– Now anyone can build their own version of Microsoft’s racist, sexist chatbot Tay
– Tay Tweets: Microsoft releases software allowing anyone to create AI chatbots
– Microsoft is betting big on AI chatbots like Tay
– Clippy’s Back: The Future of Microsoft Is Chatbots
– Ethics In Machine Learning: What we learned from Tay chatbot fiasco

Interestingly enough, a similar bot, named XiaoIce, has been in operation in China since late 2014, without major incident

Microsoft Bot Framework
In case you wan’t to get started building your own bot. Better read the tips in this article (part 1, part 2) first, however.

Feather: A Fast On-Disk Format for Data Frames for R and Python, powered by Apache Arrow
Feather is a new, fast, lightweight, and easy-to-use binary file format for storing data frames. It is lightweight, has a minimal API, is language agnostic (!), and has high read and write performance.

Druid
Another promising data solution: Druid is a fast column-oriented distributed data store.

Caravel
Caravel is a data exploration platform designed to be visual, intuitive and interactive, from Airbnb.

Hottest job? Data scientists say they’re still mostly digital ‘janitors’
Data scientists are considered to have the hottest job right now, but a new study suggests they’re little more than “digital janitors” who spend most of their time cleaning data to prepare it for analysis.

dimple
dimple is an object-oriented API for business analytics powered by d3.

Automating XKCD-Style Narrative Charts
In which the author replicates the XKCD movie narrative charts.

Using R packages and education to scale Data Science at Airbnb
Airbnb explains how they scale collaboration and unify their data science brand.

Machine-Learning Algorithm Identifies Tweets Sent Under the Influence of Alcohol
An analysis of tweeting-while-drinking reveals patterns of alcohol-related behavior in unprecedented detail.

Understanding Aesthetics with Deep Learning
This article aims to understand aesthetics in photographs, using deep learning. Very interesting read.

Chess position evaluation with convolutional neural network in Julia
This post challenges the problem of chess position evaluation using convolutional neural network (CNN) using Julia’s deep learning library – Mocha.jl.

Five Lessons from AlphaGo’s Historic Victory
News about AlphaGo is still going strong as well: as Google’s computer crushed one of humanity’s best Go players, we learned a lot about the software’s inner workings, and what it means for AI.

Containerized Data Science and Engineering (part 1 and part 2)
Let’s admit it, data scientists are developing some pretty sweet models, optimizations, visualizations, etc. Unfortunately, many of these models will never actually be used because they cannot be “productionized.” In fact, much of the “data science” happening in industry is happening in isolation on data scientists’ laptops, and, in the case in which data science applications are actually deployed, they are often deployed as hacky python/R scripts uploaded AWS and run as a cron job.

The Real Value of Containers for Data Science
This is the real value of containers in data science: the ability to capture an experiment’s state (data, code, results, package versions, parameters, etc) at a point in time, making it possible to reproduce an experiment at any stage in the research process.

Flappy Bird hack using Deep Reinforcement Learning (Deep Q-learning)
A fun one to close with: this project follows the description of the Deep Q Learning algorithm described in Playing Atari with Deep Reinforcement Learning [2] and shows that this learning algorithm can be further generalized to the notorious Flappy Bird.