Web Picks (week of 7 September 2015)

Every two weeks, we find the most interesting data science links from around the web and collect them in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to get up to speed on interesting resources.

  • Towards Neural Network-based Reasoning
    In this paper, the authors propose a framework for neural network-based reasoning over natural language sentences. They show how their technique is capable of Positional Reasoning and Path Finding.
  • Large Scale Decision Forests: Lessons Learned
    The folks at Sift Science detail their investigation into random forest based models. Their findings: transform sparse features to a single set of rate indicators (similar as this implementation of Leave One Out Encoding), deal with missing features by considering it as a third split instead of imputing, don’t split small leaves, keep trees shallow but use many, use Gini impurity as a splitting measure. In addition, they consider the prediction of the leaf node and its parent node to smooth predictions.
  • k-means++ Silhouettes
    This post illustrates the k-means++ clustering method using an interactive example. Devised in 2007 by Arthur and Vassilvitskii, k-means++ solves the NP-hard problem of how to choose centroids in the naive k-means clustering problem and provides an answer that is provably close to the optimum clustering solution.
  • Beyond Trending Topics: identifying important conversations in communities
    Betaworks describes how they identify, follow and reach communities on Twitter: “It is still hard to evaluate which items appear on a regular basis, and which are more unique. Wouldn’t it be great to know when activity around a certain hashtag is unique? Or more specifically, deviates from the expected behavior?”
  • Domino Datalabs: To Jupyter and beyond
    Domino announces support for Jupyter with R, Python, and Julia kernels. Eye-catching feature is the fact that they allow to schedule notebooks for execution as batch jobs, saving the output as an HTML report.