QA: What are the most cutting edge methods in data mining to deal with big data?

By: Seppe vanden Broucke, Bart Baesens

This QA first appeared in Data Science Briefings, the DataMiningApps newsletter as a “Free Tweet Consulting Experience” — where we answer a data science or analytics question of 140 characters maximum. Also want to submit your question? Just Tweet us @DataMiningApps. Want to remain anonymous? Then send us a direct message and we’ll keep all your details private. Subscribe now for free if you want to be the first to receive our articles and stay up to data on data science news, or follow us @DataMiningApps.


You asked: What are the most cutting edge methods in data mining to deal with big data?

Our answer:

When we talk about big data, it is necessary to take into the account all of the four V’s which makes data “big”, namely Volume, Velocity, Variety, and Veracity. Each of these aspects require their own set of techniques and approaches, so that your actual solution to tackle your specific big data problem ultimately depends on how it represents itself in terms of these four characteristics.

Velocity refers to the speed at which data is generated and is moving around. Think for example of Twitter’s firehose (the stream of all messages being posted in real-time), stock market transactions, or the speed at which credit card transactions need to be checked for fraudulent activities. Cutting-edge techniques to deal with such data are, for instance, Apache Storm (created by Twitter and composed out of other components such as ZeroMQ, Kafka and ZooKeeper), as well as Spark (Spark Streaming, to be more precise). Commercial vendors such as IBM (InfoSphere Streams), Hortonworks, Cloudera and Kx Systems also offer streaming solutions. It is important to note that when predictive models are applied in such a rapid fashion, the actual construction (training) of these models was done beforehand. Hence, it is mainly the execution of the actual prediction (the scoring), which needs to be fast.

The aspect of Variety comes closer to the actual data mining activity. Variety refers to the different types of data we can use. Structured data in tabular form is easy to use, but photos, videos, text fragments, sensor data, voice recordings, handwritten text, etc. are not. These days, a great deal of research is spend on the concept of “feature extraction”, aiming to extract low-dimensional though representative and meaningful information from an image, text, etc. which can then be used to train a predictive model. For instance, for text corpuses, one can apply feature hashing or something like the word2vec algorithm to extract representative vectors. For images, one can extract the connectivity graph of a picture. For audio, fast Fourier transform is oftentimes applied to convert an audio wave to a list of coefficients of a finite combination of complex sinusoids. For large social networks and other graphs, there exists a great number of feature extraction techniques as well. The problem of Variety also helps to explain the renewed interest in deep learning. Techniques such as deep-belief networks are able to encode high-dimensional data (such as an image composed of many pixels with a wide range of color-values) to a vector which is drastically smaller. The ability of deep learning techniques to extract meaningful patterns and features autonomously (“feature learning”) is what makes them so popular recently. It is worth mentioning however that deep learning techniques are also a complex bunch to train, debug, and explain, oftentimes requiring hyper parameter tweaking and long amounts of training time.

Note that the issue of Variety can also be encountered in structured data, for instance when facing giant tabular data marts with hundreds of features. Paradoxically, most predictive techniques perform worse, not better, when being fed more features. Here, the problem thus becomes to either select the correct, important features, or to construct new, derived features which are able to guide the learner (see this blog post for a simple example). The success of any machine learning algorithm depends on how you present the data. Hence, expect more and more research and industry attention being given to feature engineering, rather than coming up with new modelling algorithms.

Veracity refers to trustworthiness of the data, its quality, and its fitness for use. With many forms of data, its quality and accuracy are less easy to control or monitor, although the Volume of the data actually helps to make up for this issue. When people talk about the difficulties of big data in the context of machine learning, they often think first of this particular V. Data sets nowadays can easily extend one million records, and can become as large as billions of records. In recent years, a number of highly efficient implementations have sprung up which are able to deal with such large amounts of data. Look for instance at xgboost, H2O, and Vowpal Wabbit. Most of these implementation offer a similar subset of algorithms (generalized linear models, gradient boosted trees, distributed random forests and deep learning models), as many recent benchmark studies show that these are the best performing anyway. Also a noteworthy cutting-edge trend is the fact that more implementations are becoming available which are moving away from distributed computed, allowing analysts to build models and analyze data in-memory or on a single machine. In this blog post, for instance, the developers behind GraphLab Create (another modern data science toolkit) state it well by saying: “Don’t use a distributed system if you don’t need it just like you don’t assemble furniture with a chainsaw.” And indeed, in another post, they show how you can easily analyze terabytes of data on a commodity laptop with an SSD (930,000,000 rows and 950 columns). Modern data frame formats offer highly compressed, column-based, cached and lazily loaded data set representations allowing end-users to analyze datasets larger than RAM. Other similar offerings include PyTables, which is built on top of HDF5, a robust library and file format for storing and managing data, using flexible and efficient I/O and for high volume and complex data.

To conclude, we summarize our findings as follows:

  • Big data influences data mining mainly in the following two ways: by expanding in terms of number of records/instances, or by expanding in terms of number of features, either through lots of columnar features or high-dimensional features as encoded in text, images, audio, etc.;
  • To deal with big data in terms of instances, there exists a large number of highly efficient toolkits offering solid implementations of best-in-class algorithms;
  • To deal with the overflow of features, the practice of feature engineering will become increasingly more important, including feature selection, extraction, and construction;
  • An interesting idea is that of feature learning, leading to an increasing amount of attention being spent on deep learning techniques;
  • The subfield of deep learning is moving at a rapid pace, with cutting-edge research focusing on making such models easier to inspect and debug (see e.g. http://cs231n.github.io/understanding-cnn/, and http://yosinski.com/deepvis). Another interesting trend is the usage of GPU’s to train such models, as utilized in libraries such as Theano, Caffe, Torch, Keras, Lasagne (with or without Nolearn), and Deeplearning4J, to name a few.