Web Picks (week of 13 May 2019)

Every two weeks, we find the most interesting data science links from around the web and collect them in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to get up to speed on interesting resources.

  • An End-to-End AutoML Solution for Tabular Data at KaggleDays
    Google’s AutoML efforts aim to make ML more scalable and accelerate both research and industry applications.
  • Who to Sue When a Robot Loses Your Fortune
    The first known case of humans going to court over investment losses triggered by autonomous machines will test the limits of liability.
  • Don’t let industry write the rules for AI
    Technology companies are running a campaign to bend research and regulation for their benefit; society must fight back, says Yochai Benkler.
  • How Coca-Cola is using AI to stay at the top of the soft drinks market
    To stay on top of the game in every territory, it must collect and analyse huge amounts of data from disparate sources to determine which of its 500 brands are likely to be well received. The taste of their most well-known brands will even differ from country to country, and understanding these local preferences is a hugely complex task.
  • Python at Netflix
    We use Python through the full content lifecycle, from deciding which content to fund all the way to operating the CDN that serves the final video to 148 million members
  • Open Sourcing Delta Lake
    We are excited to announce the open sourcing of the Delta Lake project. Delta Lake is a storage layer that brings reliability to your data lakes built on HDFS and cloud storage by providing ACID transactions through optimistic concurrency control between writes and snapshot isolation for consistent reads during writes. Delta Lake also provides built-in data versioning for easy rollbacks and reproducing reports. The project is available at delta.io to download and use under Apache License 2.0.
  • Techniques for Data Visualization on both Mobile & Desktop
    Although there is no “one-size-fits-all” solution I have found several different options across past projects that work for me.
  • Adversarial Examples Are Not Bugs, They Are Features
    Adversarial examples have attracted significant attention in machine learning, but the reasons for their existence and pervasiveness remain unclear. We demonstrate that adversarial examples can be directly attributed to the presence of non-robust features: features derived from patterns in the data distribution that are highly predictive, yet brittle and incomprehensible to humans.
  • Facebook might have another Cambridge Analytica on its hands
    Facebook sues analytics firm Rankwave over data misuse.
  • Predicting Stack Overflow Tags with Google’s Cloud AI
    This dataset includes a 26 GB table of Stack Overflow questions updated regularly (you can explore the dataset in BigQuery here). In this post I’ll walk through building the model, understanding how the model is making predictions with SHAP, and deploying the model to Cloud AI Platform.
  • Towards Robust and Verified AI
    Specification Testing, Robust Training, and Formal Verification
  • Unsupervised learning: the curious pupil
    A key motivation for unsupervised learning is that, while the data passed to learning algorithms is extremely rich in internal structure, the targets and rewards used for training are typically very sparse. This suggests that the bulk of what is learned by an algorithm must consist of understanding the data itself, rather than applying that understanding to particular tasks.
  • Can machine learning help us understand “why”?
    This year the talks and accepted papers are heavily focused on tackling four major challenges in deep learning: fairness, robustness, generalizability, and causality.
  • An algorithm wipes clean the criminal pasts of thousands
    This month, a judge in California cleared thousands of criminal records with one stroke of his pen. He did it thanks to a ground-breaking new algorithm that reduces a process that took months to mere minutes. The programmers behind it say: we’re just getting started solving America’s urgent problems.
  • Technical debt for data scientists
    Technical debt is the process of avoiding work today by promising to do work tomorrow.
  • Unsupervised Data Augmentation
    Despite its success, deep learning still needs large labeled datasets to succeed. Data augmentation has shown much promise in alleviating the need for more labeled data, but it so far has mostly been applied in supervised settings and achieved limited gains. In this work, we propose to apply data augmentation to unlabeled data in a semi-supervised learning setting.
  • How to Build OpenAI’s GPT-2: “The AI That’s Too Dangerous to Release”
    Note, however, that the GPT-2 model that we’re going to build won’t start generating fake Brexit campaigns. The original model was trained for months, harnessing the power of 100+ GPUs.
  • AI and the “Useless” Class
    Human robots will take your job before AI. The human robot is you, and you will help AI steal your job tomorrow. Will you become “useless”?
  • Putting vision models to the test
    Study shows that artificial neural networks can be used to drive brain activity.
  • Half a face enough for recognition technology
    Researchers achieve 100 percent recognition rates for half and three-quarter faces.
  • Introducing TensorFlow Graphics: Computer Graphics Meets Deep Learning
    Training machine learning systems capable of solving these complex 3D vision tasks most often requires large quantities of data. As labelling data is a costly and complex process, it is important to have mechanisms to design machine learning models that can comprehend the three dimensional world while being trained without much supervision.
  • Real numbers, data science and chaos: How to fit any dataset with a single parameter
    “We show how any dataset of any modality (time-series, images, sound…) can be approximated by a well-behaved (continuous, differentiable…) scalar function with a single real-valued parameter.” Feature engineering gone wild ;)