Web Picks (week of 1 May 2017)

Every two weeks, we find the most interesting data science links from around the web and collect them in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to get up to speed on interesting resources.

  • Alien knowledge: when machines justify knowledge
    The new availability of huge amounts of data, along with the statistical tools to crunch these numbers, offers a whole new way of understanding the world. Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all.
  • Architecture of Giants: Data Stacks at Facebook, Netflix, Airbnb, and Pinterest
    “We believe that companies who learn to wield event data will have a competitive advantage. That certainly seems to be the case at the world’s leading tech companies. We continue to be amazed by the data engineering teams at Facebook, Amazon, Airbnb, Pinterest, and Netflix. Their work sets new standards for what software and businesses can know.”
  • Creating a Modern OCR Pipeline Using Computer Vision and Deep Learning
    In this post Dropbox takes you behind the scenes on how they built a state-of-the-art Optical Character Recognition (OCR) pipeline.
  • Data validation with the assertr package
    assertr is a R package to help you identify common dataset errors. More specifically, it helps you easily spell out your assumptions about how the data should look and alert you of any deviation from those assumptions. Great tool!
  • Building and Exploring a Map of Reddit with Python
    “The goal of this notebook is to build and analyse a map of the 10,000 most popular subreddits on Reddit. To do this we need a means to measure the similarity of two subreddits. In a great article on FiveThirtyEight Trevor Martin did an analysis of subreddits by considering the overlaps of users commenting on two different subreddits. Their interest was in using vector algebra on representative vectors to look at, for example, what happens if you remove r/politics from r/The_Donald. Our interest is a little broader — we want to map out and visualize the space of subreddits, and attempt to cluster subreddits into their natural groups. With that done we can then explore some of the clusters and find interesting stories to tell.”
  • Lyrebird is a startup aiming to “copy the voice of anyone”
    Take a listen at their demo!
  • The long game towards understanding dialog
    Facebook’s research team talks about its ambitious goals.
  • The GAN Zoo
    A list of all the named GANs (Generative Adversarial Networks) so far.
  • An exploration of methods for coloring t-SNE
    t-SNE is great at capturing a combination of the local and global structure of a dataset in 2d or 3d. But when plotting points in 2d, there are often interesting patterns in the data that only come out as “texture” in the point cloud. When the plot is colored appropriately, these patterns can be made more clear.
  • Visualizing the Taste of a Community of Cinephiles Using t-SNE
    “In this post we are going to look at an interactive visualization that clusters movies together based on user ratings. This visualization will give us a glimpse into a community of cinephiles.”
  • Gender Roles with Text Mining and N-grams
    “These words are the ones that are the most different in how Jane Austen used them with the pronouns “he” and “she”. Women in Austen’s novels do things like remember, read, feel, resolve, long, hear, dare, and cry. Men, on the other hand, in these novels do things like stop, take, reply, come, marry, and know.”
  • Princeton researchers discover why AI become racist and sexist
    Study of language bias has implications for AI as well as human cognition.