Web Picks (week of 5 September 2016)

Every two weeks, we find the most interesting data science links from around the web and collect them in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to get up to speed on interesting resources.

  • The “Joel Test” for Data Science
    Some great advice here: Can new hires get set up in the environment to run analyses on their first day? Can data scientists utilize the latest tools/packages without help from IT? Can data scientists use on-demand and scalable compute resources without help from IT/dev ops? Can data scientists find and reproduce past experiments and results, using the original code, data, parameters, and software versions? Does collaboration happen through a system other than email?
    Can predictive models be deployed to production without custom engineering or infrastructure work? Is there a single place to search for past research and reusable data sets, code, etc?
  • How algorithms rule our working lives
    Employers are turning to mathematically modelled ways of sifting through job applications. Even when wrong, their verdicts seem beyond dispute – and they tend to punish the poor.
  • Decoupled Neural Interfaces Using Synthetic Gradients
    Neural networks are the workhorse of many of the algorithms developed at DeepMind. For example, AlphaGo uses convolutional neural networks to evaluate board positions in the game of Go and DQN and Deep Reinforcement Learning algorithms use neural networks to choose actions to play at super-human level on video games. This latest post from DeepMind introduces some of their latest research in progressing the capabilities and training procedures of neural networks called Decoupled Neural Interfaces using Synthetic Gradients.
  • Learning from Imbalanced Classes
    When you start looking at real, uncleaned data one of the first things you notice is that it’s a lot noisier and imbalanced. If you deal with such problems and want practical advice on how to address them, read on.
  • Using Apache Spark to Analyze Large Neuroimaging Datasets
    This describes how to analyze dominant components in high-dimensional neuroimaging data and demonstrates how to perform Principal Components Analysis (PCA) on a dataset large enough that standard single-computer techniques will not work.
  • An exclusive inside look at how artificial intelligence and machine learning work at Apple
    “Because Apple has always been so tight-lipped about what goes on behind badged doors, the AI cognoscenti didn’t know what Apple was up to in machine learning. “It’s not part of the community,” says Jerry Kaplan, who teaches a course at Stanford on the history of artificial intelligence. “Apple is the NSA of AI.” But AI’s Brahmins figured that if Apple’s efforts were as significant as Google’s or Facebook’s, they would have heard that.”
  • 3 Reasons Counting is the Hardest Thing in Data Science
    “Counting is hard. You might be surprised to hear me say that, but it’s true. As a data scientist, I’ve done it all – everything from simple regression analysis all the way to coding Hadoop MapReduce jobs that process hundreds of billions of data points each month. And, with all that experience, I’ve found that counting often involves far more time and effort.”
  • O’Reilly Data Ebook Archive
    An archive of all O’Reilly data ebooks is available for free download. Dive deep into the latest in data science and big data, compiled by O’Reilly editors, authors, and Strata speakers.
  • glmnet for Python
    “Thinking it would be easier to have a tool that was written in a single language, I started looking for the Scikit-Learn analog of glmnet, specifically the cv.glmnet function in this R package. Unfortunately, I could not find anything written in Python that emulated the functionality we needed from glmnet.”
  • Collaborative Filtering with Recurrent Neural Networks (paper)
    “We show that collaborative filtering can be viewed as a sequence prediction problem, and that given this interpretation, recurrent neural networks offer very competitive approach. In particular we study how the long short-term memory (LSTM) can be applied to collaborative filtering, and how it compares to standard nearest neighbors and matrix factorization methods on movie recommendation.”