Contributed by: Bart Baesens
This article first appeared in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to receive our feature articles, or follow us @DataMiningApps. Do you also wish to contribute to Data Science Briefings? Shoot us an e-mail over at firstname.lastname@example.org and let’s get in touch!
I started doing my PhD in 1998. One of my first research topics was on the use of neural networks for predictive analytics in both retention modeling and credit scoring. During my first empirical encounters with neural networks (in Matlab), I quickly found out that it was not that straightforward to parametrize them. Especially determining the number of hidden neurons and the regularization parameter (used to counter overfitting) required a lot of tuning and configuration. When surveying the literature, I found schemes such as cascade correlation, genetic algorithms, etc. which, although theoretically seemed sound, were not that obvious to implement.
It must have been around 2000 where I started reading David MacKay’s PhD thesis, which he already finished in 1992 with John Hopfield (yes, the John Hopfield) as his supervisor. Admitted, my first reading was rather chaotic given the complexity of the topic.
However, after some more readings, I got fascinated by the ideas presented in his work. More specifically, sir David MacKay advocated the use of Bayesian methods to set up the neural network architecture. He developed a 3-level inference framework to infer the weights and all other hyperparameters (e.g. regularization parameter and hidden neurons) of a neural network. I started playing with this using the trainbr (train with Bayesian regularization) routine in Matlab. My first experiments were very promising. The method allowed to easily determine the hyperparameters of the neural net, without requiring much empirical tuning. Furthermore, in one of the extensions of his work, he presented ARD (Automatic Relevance Determination). The idea was to add separate regularization terms to the objective function of the neural network, one for each input. He then went on to tune the hyperparameters in a Bayesian way (again using the 3-level framework), whereby the latter could, after optimization, then be interpreted as the relevance of each of the inputs (similar to p-values in a regression), hence the name automatic relevance determination (ARD).
I used this in a first paper for retention modeling see Bayesian neural network learning for repeat purchase modelling in direct marketing. The ideas of his work were latter extended to (least-squares) support vector machines by my colleague dr. Tony Van Gestel (see Financial time series prediction using least squares support vector machines within the evidence framework). With the renewed interest in neural networks in the context of deep learning, I think his ideas are now more relevant than ever before! Later on, sir David MacKay turned his research interest to societal problems and provided key research contributions to Climate change and policy. In September 2015, he was awarded the 2016 Breakthrough Paradigm Award “in recognition of his excellence in energy and climate change analyses.” I have never had the pleasure to meet him personally, but did exchange some emails with him. I was amazed that despite being a PhD student at that time, he each time spent time in answering my questions in a detailed way.
He will be greatly missed but his research legacy will remain.