Web Picks (week of 30 November 2015)

Every two weeks, we find the most interesting data science links from around the web and collect them in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to get up to speed on interesting resources.


  • Data Science and Statistics: Diferent Worlds? [YouTube]
    In the last few years data science has become an increasingly popular discipline. However within the world of statistics, the ‘big data’ and ‘data scientist’ developments are sometimes labelled as hypes, and ‘data science’ is seen as a rebranding of what should be statistics. One of the often heard criticisms of big data analytics is that there’s a lack of statistical rigour which can lead to the wrong decisions. This talk discusses this topic in depth.
  • Character-based Neural Machine Translation
    This paper introduces a neural machine translation model that views the input and output sentences as sequences of characters rather than words. The authors show that their model can achieve translation results that are on par with conventional word-based models.
  • Reducing Overfitting in Deep Networks by Decorrelating Representations
    One major challenge in training Deep Neural Networks is preventing overfitting. Many techniques such as data augmentation and novel regularizers such as Dropout have been proposed to prevent overfitting without requiring a massive amount of training data. This work proposes a new regularizer called DeCov which leads to significantly reduced overfitting and better generalization.
  • The hardest parts of data science
    Contrary to common belief, the hardest part of data science isn’t building an accurate model or obtaining good, clean data. It is much harder to define feasible problems and come up with reasonable ways of measuring solutions. This post discusses some examples of these issues and how they can be addressed.
  • Building Analytics at Simple
    “Early in 2014, Simple was a mid-stage startup with only a single analytics-focused employee. When we wanted to answer a question about customer behavior or business performance, we would have to query production databases. Everybody in the company wanted to make informed decisions, from engineering to product strategy to business development to customer relations, so it was clear that we needed to build a data warehouse and a team to support it.”
  • NBA Player Movement Data in R
    “Everyone’s excited about the newly released NBA player movement data that’s been released – at least I am. I stumbled across this post which shows how to visualize player movement data in Python, but I wanted to figure out how to do the same in R. We’ll use a combination of several R packages including RCurl, jsonlite, png and plotrix.”
  • Tree-based Pipeline Optimization Tool (TPOT)
    In the same vein as auto-sklearn, consider TPOT your Data Science Assistant. TPOT is a Python tool that automatically creates and optimizes Machine Learning pipelines using genetic programming. TPOT will automate the most tedious part of Machine Learning by intelligently exploring thousands of possible pipelines to find the best one for your data.