Web Picks (week of 27 June 2016)

Every two weeks, we find the most interesting data science links from around the web and collect them in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to get up to speed on interesting resources.

  • How Google is remaking itself as a machine learning first company
    If you want to build artificial intelligence into every product, you better retrain your army of coders.
  • Rogue Machine Intelligence and A New Kind of Hedge Fund
    “On February 27th 2016, an artificial intelligence named NCVSAI joined Numerai. His creator downloaded encrypted stock market data, trained a machine learning model, and began submitting stock predictions to Numerai. Numerai is a hedge fund built by a community of anonymous data scientists. And it’s working.”
  • Analyzing Public Health Data in R
  • Movie written by algorithm turns out to be hilarious and intense
    “For Sunspring’s exclusive debut on Ars, we talked to the filmmakers about collaborating with an AI.”
  • Master statistician weaves Google images into visual quilts
    A visionary statistician shares a free tool for data visualization.
  • What should we learn from past AI forecasts?
    “To inform the Open Philanthropy Project’s investigation of potential risks from advanced artificial intelligence, I (Luke Muehlhauser) conducted a short study of what we know so far about likely timelines for the development of advanced artificial intelligence (AI) capabilities.”
  • 10 Data Acquisition Strategies for Startups
    “Startups can overcome the cold start problem of data acquisition in numerous ways. The choice of data strategy/source usually goes hand-in-hand with the choice of business model, a startup’s focus (consumer or enterprise, horizontal or vertical, etc.) and the funding situation. The following list of strategies, while neither exhaustive nor mutually exclusive, gives a sense for the broad range of approaches available.”
  • Boosting Sales With Machine Learning
    “In this blog post I’ll explain how we’re making our sales process at Xeneta more effective by training a machine learning algorithm to predict the quality of our leads based upon their company descriptions.”
  • Bayesian Deep Learning
    “There are currently three big trends in machine learning: Probabilistic Programming, Deep Learning and “Big Data”. Inside of PP, a lot of innovation is in making things scale using Variational Inference. In this blog post, I will show how to use Variational Inference in PyMC3 to fit a simple Bayesian Neural Network. I will also discuss how bridging Probabilistic Programming and Deep Learning can open up very interesting avenues to explore in future research.”
  • The ugly little bits of the data science process
  • There is still only one test
    “In 2011 I wrote an article called “There is Only One Test”, where I explained that all hypothesis tests are based on the same framework…”
  • Mathematical Models Suggest New Ways of Targeting Pro-ISIS Groups Online
    “Non sequitur arguments in favor of responding to domestic terrorism by waging foreign wars at least serve to illuminate the deep problem of how to predict and prevent so-called lone wolf attacks, in which some random person decides to purchase a legal assault rifle and commit mass murder. Gathering meaningful information from would-be perpetrators in the absence of organizational collusion isn’t that much better than searching for a needle in a haystack, even when those perpetrators can’t stop talking to everyone about what a badass terrorist they think they are.”
  • facebook-page-post-scraper
    Data scraper for Facebook Pages, and also code accompanying the blog post How to Scrape Data From Facebook Page Posts for Statistical Analysis
  • Hello, TensorFlow!
    Building and training your first TensorFlow graph from the ground up.
  • Finding Similar Music using Matrix Factorization
    Part 2 of “People Who Like This Also Like … “
  • Which Tool To Use For Your Data Pipelines?
    “Nowadays, there are many solutions to automate and manage data pipelines, ranging from full-blown enterprise solutions to hardly maintained open source projects. With all of those choices, you are probably asking yourself which tool should I use for my data pipelines?”
  • One year as a Data Scientist at Stack Overflow
    David Robinson reflects on his work as a data scientist at Stack Overflow.
  • Machine Learning Yearning by Andrew NG: free book draft
    “Machine learning system requires that you make practical decisions: should you collect more training data? How do you deal with your training set not matching your test set? And many more. Historically, the only way to learn how to make these “strategy” decisions has been a multi-year apprenticeship in a graduate program or company. I am writing a book to help you quickly gain this skill, so that you can become better at building AI systems.”
  • Data Science and Engineering with Apache Spark course at EDX
  • Building Data Pipelines with Luigi
    “From our experience we have found that the key aspect of a great data science platform is the glue that holds all individual tasks together to form the larger platform. In essence, we need to build a good pipeline of tasks and data which helps in solving our problems.”
  • Plotting high-dimensional decision boundaries
  • Rodeo 2.0 Released for Mac & Linux!
    A new, more stable release for the popular Python IDE.
  • Hadley Wickham compares dplyr against data.table