Web Picks (week of 11 July 2016)

Every two weeks, we find the most interesting data science links from around the web and collect them in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to get up to speed on interesting resources.

  • A new ggplot is here
    A ggplot for Python, who would’ve thought? Hopefully a python-ddplyr is next…
  • Release of IPython 5.0
    “We are pleased to announce the release of IPython 5.0 LTS (or Long Term Support).” Interesting to note: Python 2 support will be dropped in the next release!
  • Renaming our company – Dato is now Turi
    Dato (formerly known as Graphlab) announces another name change: the company is now named Turi, after the famous computer scientist.
  • Art from Fluctuating Repetition
    Simple but beautiful demonstration on how simple mathematical formula’s can give rise to complex, artful behavior.
  • Effective Pandas
    Free book about how to make effective use of pandas, a data analysis library for the Python programming language.
  • Announcing Enhanced Apache Spark Support
    Domino announces support for Apache Spark. Looks great!
  • The land grab for farm data
    “Farmers are adopting smart-sensor technologies and connected farm equipment more quickly than ever, stretching each season to eke out greater yield from finite acreage. This rise in so-called precision agriculture has precipitated a massive influx in unstructured “bushels to bytes” farm data, creating new opportunities for an industry that has previously operated solely in the physical realm.”
  • Why Microsoft is betting its future on AI
    Inside Satya Nadella’s plan to outsmart Google.
  • AI, Apple and Google
    Ben Evans talks about some recent trends in AI. “Today you would spend hours or weeks in data analysis tools looking for the right criteria to find these, and you’d need people doing that work – sorting and resorting that Excel table and eyeballing for the weird result, metaphorically speaking, but with a million rows and a thousand columns. Machine learning offers the promise that a lot of very large and very boring analyses of data can be automated – not just running the search, but working out what the search should be to find the result you want.”
  • If Correlation Doesn’t Imply Causation, Then What Does?
    “We’ve all heard in school that “correlation does not imply causation,” but what does imply causation?! The gold standard for establishing cause and effect is a double-blind controlled trial (or the AB test equivalent). If you’re working with a system on which you can’t perform experiments, is all hope for scientific progress lost? Can we ever understand systems that we have limited or no control over? This would be a very bleak state of affairs, and fortunately there has been progress in answering these questions in the negative!”
  • useR! 2016 Tutorial: Machine Learning Algorithmic Deep Dive
    This tutorial contains training modules for six popular supervised machine learning methods.
  • Analyzing the Graph of Thrones
    Another one of these, but graphing out popular series remains a fun read.
  • Working with spatial data for the web
    This workshop is designed to be very hands-on, with many examples that can be extended as exercises.
  • Google AI has access to huge haul of NHS patient data
    A data-sharing agreement obtained by New Scientist shows that Google DeepMind’s collaboration with the NHS goes far beyond what it has publicly announced.
  • Watch 6,000 Years of Urbanization in 3 Minutes
    Max Galka at Metrocosm has taken the most comprehensive dataset on cities and made it come alive in a new video.
  • How well do facial recognition algorithms cope with a million strangers?
    “In the last few years, several groups have announced that their facial recognition systems have achieved near-perfect accuracy rates, performing better than humans at picking the same face out of the crowd. But those tests were performed on a dataset with only 13,000 images — fewer people than attend an average professional U.S. soccer game. What happens to their performance as those crowds grow to the size of a major U.S. city?”
  • Data Mining Reveals the Crucial Factors That Determine When People Make Blunders
    Decision making is influenced by the complexity of the situation, the skill of the decision maker, and the time pressure. But one of these is much more important than the others, a new study reveals.
  • Detecting Money Laundering
    “Applying machine learning to AML has been challenging due to the limited availability of labeled datasets. However there are a number of unsupervised techniques that may be worth considering.” Very interesting read, especially also the reference to Twitter’s Seasonal Hybrid ESD: https://github.com/twitter/AnomalyDetection
  • To Balance or Not to Balance? Unofficial Google DS Blog on Propensity Scores
    “Determining the causal effects of an action—which we call treatment—on an outcome of interest is at the heart of many data analysis efforts. In an ideal world, experimentation through randomization of the treatment assignment allows the identification and consistent estimation of causal effects. In observational studies treatment is assigned by nature, therefore its mechanism is unknown and needs to be estimated. This can be done through estimation of a quantity known as the propensity score, defined as the probability of receiving treatment within strata of the observed covariates.”
  • Machine Learning Algorithm Spots Depression in Speech Patterns
    “Researchers from the University of Southern California have developed a new machine learning tool capable of detecting certain speech-related diagnostic criteria in patients being evaluated for depression.”
  • Realtime Data Processing at Facebook
    Recently there has been a lot of development in realtime data processing systems, including Twitter’s Storm and Heron, Google’s Millwheel, and LinkedIn’s Samza. This paper presents Facebook’s Realtime data processing system architecture and its Puma, Swift, and Stylus stream processing systems.
  • Presentations:
    – Why Should I Trust You?”: Explaining the Predictions of Any Classifier (Youtube) (Paper)
    – Apache NiFi: Because It Aint Data Science Without the Data (Youtube)
    – Streaming SQL (Slideshare)
    – Is that a Time Machine? Some Design Patterns for Real World Machine Learning Systems (Slideshare)
    – A Deep Dive into Structured Streaming (Slideshare)