Web Picks (week of 30 April 2018)

Posted on May 7, 2018

Every two weeks, we find the most interesting data science links from around the web and collect them in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to get up to speed on interesting resources.

Artificial Intelligence — The Revolution Hasn’t Happened Yet
“Thus, just as humans built buildings and bridges before there was civil engineering, humans are proceeding with the building of societal-scale, inference-and-decision-making systems that involve machines, humans and the environment. Just as early buildings and bridges sometimes fell to the ground — in unforeseen ways and with tragic consequences — many of our early societal-scale inference-and-decision-making systems are already exposing serious conceptual flaws.”
Why I’ve lost faith in p values
“Here’s the problem in a nutshell: If you run 1000 experiments over the course of your career, and you get a significant effect (p < .05) in 95 of those experiments, you might expect that 5% of these 95 significant effects would be false positives. However, as an example shown later in this blog will show, the actual false positive rate may be 47%, even if you’re not doing anything wrong (p-hacking, etc.).”
EU Member States sign up to cooperate on Artificial Intelligence
“On 10 April 25 European countries signed a Declaration of cooperation on Artificial Intelligence (AI). Whereas a number of Member States had already announced national initiatives on Artificial Intelligence, they now declared a strong will to join forces and engage in a European approach to deal therewith. By teaming up, the opportunities of AI for Europe can be fully ensured, while the challenges can be dealt with collectively.”
Lessons from My First Two Years of AI Research
“First of all, find someone who you feel comfortable asking “dumb” questions.”
Why Is the Human Brain So Efficient?
How massive parallelism lifts the brain’s performance above that of AI.
Machine Learning’s ‘Amazing’ Ability to Predict Chaos
In new computer experiments, artificial-intelligence algorithms can tell the future of chaotic systems.
Artificial intelligence accelerates discovery of metallic glass
Machine learning algorithms pinpoint new materials 200 times faster than previously possible.
Style Is an Algorithm
No one is original anymore, not even you.
Talk to Books
From Google: browse passages from books using experimental AI.
Text Embedding Models Contain Bias. Here’s Why That Matters
Not really that new, but now also folks at Google confirm it through a wide study.
HyperTools: A python toolbox for gaining geometric insights into high-dimensional data
“HyperTools is a library for visualizing and manipulating high-dimensional data in Python. It is built on top of matplotlib (for plotting), seaborn (for plot styling), and scikit-learn (for data manipulation).”
Mood disorders could be diagnosed by the way you fiddle with your phone
A neural network can detect depression and mania in bipolar subjects by analyzing how they hold and tap on their smartphones.
Designing a Data Visualization Dashboard Like It was a Game
“I’ll focus in on specific attributes of charts and analytical applications that, if viewed as game elements, could enable new approaches for how to design, navigate and manipulate analytical applications such as data dashboards. Some of the mechanics I think are most adaptable from the discourse of games into the discourse of data visualization are: maps, items, crafting and guild halls.”
reticulate: R interface to Python
This is impressive: reticulate is a comprehensive set of tools for interoperability between Python and R.
The fall of RNN / LSTM
We fell for Recurrent neural networks (RNN), Long-short term memory (LSTM), and all their variants. Now it is time to drop them! “But do not take our words for it, also see evidence that Attention based networks are used more and more by Google, Facebook, Salesforce, to name a few. All these companies have replaced RNN and variants for attention based models, and it is just the beginning. RNN have the days counted in all applications, because they require more resources to train and run than attention-based models.”
Human in the Loop: Bayesian Rules Enabling Explainable AI (presentation)
Interesting presentation about explainable AI with the Skater Python package.
Understanding and diagnosing your machine-learning models
“Achieving a good prediction is often only half of the job. Questions immediately arise: How to improve this prediction? What drives the prediction? Can we operate changes to the system based on the predictions? All these questions require understanding how good is the model prediction, and how do the model predict.”
Interpretable Machine Learning with Python
Very complete post with lots of interesting techniques!
Hyperparameters tuning with Polyaxon
“The hyperparamters optimization literature is very rich in terms of algorithms, and many open source implementations are already existing. But finding good hyperparameters, does not only involve an efficient search algorithm over the space of possible hyperparameters, but also requires organization in order to review, compare, restart, or resume some of the tried experiments.”
ModelDB: A system to manage machine learning models
“Companies often build hundreds of models a day (e.g., churn, recommendation, credit default). However, there is no practical way to manage all the models that are built over time. This lack of tooling leads to insights being lost, resources wasted on re-generating old results, and difficulty collaborating. ModelDB is an end-to-end system that tracks models as they are built, extracts and stores relevant metadata (e.g., hyperparameters, data sources) for models, and makes this data available for easy querying and visualization.”
Revisiting Small Batch Training for Deep Neural Networks
“We have found that increasing the batch size progressively reduces the range of learning rates that provide stable convergence and acceptable test performance. Smaller batch sizes also provide more up-to-date gradient calculations, which give more stable and reliable training. The best performance has been consistently obtained for mini-batch sizes between 2 and 32. This contrasts with recent work, which is motivated by trying to induce more data parallelism to reduce training time on today’s hardware. These approaches often use mini-batch sizes in the thousands.”
Progressive Growing of GANs for Improved Quality, Stability, and Variation
“We describe a new training methodology for generative adversarial networks. The key idea is to grow both the generator and discriminator progressively: starting from a low resolution, we add new layers that model increasingly fine details as training progresses.”
Streaming 100 TB of pins from MySQL to S3 /Hadoop continuously@ Pinterest (presentation)
Kafka, Spark, Presto, …
Using SQL to query Kafka, MongoDB, MySQL, PostgreSQL and Redis with Presto
“Presto is a query engine that began life at Facebook five years ago. Today it’s used by over 1,000 Facebook staff members to analyse 300+ petabytes of data that they keep in their data warehouse. Presto doesn’t store any data itself but instead interfaces with existing databases. It’s common for Presto to query ORC-formatted files on HDFS or S3 but there is also support for a wide variety of other storage systems as well.”
Drake: An R-focused pipeline toolkit for reproducibility and high-performance computing
The drake package is a general-purpose workflow manager for data-driven tasks in R. It rebuilds intermediate data objects when their dependencies change, and it skips work when the results are already up to date.
Mode Analytics
Another notebook data platform for R, Python and SQL.