Web Picks (week of 1 August 2016)

Every two weeks, we find the most interesting data science links from around the web and collect them in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to get up to speed on interesting resources.

  • JupyterLab: the next generation of the Jupyter Notebook
    “It’s been a long time in the making, but today we want to start engaging our community with an early (pre-alpha) release of the next generation of the Jupyter Notebook application, which we are calling JupyterLab.”
  • The rectangularness of countries
    “A Facebook friend recently noted that Turkey was “a remarkably rectangular country.” I wondered how it compared to other countries, and this post shows my answers.”
  • An Introduction to Model-Based Machine Learning
    “This blog post follows my journey from traditional statistical modeling to Machine Learning (ML) and introduces a new paradigm of ML called Model-Based Machine Learning (Bishop, 2013). Model-Based Machine Learning may be of particular interest to statisticians, engineers, or related professionals looking to implement machine learning in their research or practice.”
  • Critical Behavior from Deep Dynamics: A Hidden Dimension in Natural Language (paper)
    “We show that in many data sequences – from texts in different languages to melodies and genomes – the mutual information between two symbols decays roughly like a power law with the number of symbols in between the two. In contrast, we prove that Markov/hidden Markov processes generically exhibit exponential decay in their mutual information, which explains why natural languages are poorly approximated by Markov processes.”
  • Why I’m Not a Fan of R-Squared
    “People sometimes use R2 as their preferred measure of model fit. Unlike quantities such as MSE or MAD, R2 is not a function only of model’s errors, its definition contains an implicit model comparison between the model being analyzed and the constant model that uses only the observed mean to make predictions. As such, R2 answers the question: does my model perform better than a constant model? But we often would like to answer a very different question: does my model perform worse than the true model?”
  • Spark ADMM
    The code in this repository provides a framework for solving arbitrary separable convex optimization problems with Alternating Direction Method of Multipliers (ADMM). In particular, the algorithm implemented is the generalized consensus algorithm described in the paper Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein.
  • Apartment Prices in Finland
    With all the data science and big data hype going on it’s always nice to see real case examples. At Reaktor we have created Kannattaakokauppa.fi, a probabilistic modeling-based interactive visualisation of regional apartment price trends in Finland.