Model Backdooring, Exfiltration and Other Analytics Attacks

This article first appeared in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to receive our feature articles, or follow us @DataMiningApps. Do you also wish to contribute to Data Science Briefings? Shoot us an e-mail over at and let’s get in touch!

Contributed by: Seppe vanden Broucke and Bart Baesens. This article was adopted from our Managing Model Risk book. Check out to learn more.

Let’s say you are a malicious data scientist working on a model, and would like to build in a hidden backdoor, which, given a carefully crafted set of features (or a subset of them) would allow you to have the model predict any outcome you want. How feasible is it to do this? For most models, there exist easy ways to build in such a backdoor. Even for deep learning models, such an outcome can be easily reached simply by taking the existing weights (parameters) of the model and start modifying weights until the desired outcome is reached.

The question, however, is whether this can be done in a stealthy manner, which it turns out for models with a large degree of parameters is rather easy to do. That is, one can set up a secondary training loop (using stochastic gradient descent) to find a set of modifications to the existing weights so that the changes are as minimal as possible, the model’s output does not change for a given test set, and the model outputs the desired results for our carefully crafted inputs.

This sounds a bit far-fetched, sure, but is certainly possible in practice. Already in 2017, researchers showed how such targeted attacks are possible using data poisoning (, i.e. by injecting crafted samples in the training data.

Models may be deemed confidential due to many reasons, e.g. the sensitivity of the data that was used to train them, their importance with regards to the revenue of the model owner, them being considered as intellectual property, or them being applied in applications domains such as fraud. Very often, however, models are being deployed with a publicly (company-wide, customers, or the general public) query interface, i.e. as an API. Many organizations also expose models through an API and ask users to pay for the services and predictions offered by the model.

In 2016, a group of researchers worked out how an adversary with access to such a query interface, without having any prior knowledge of the model’s parameters or training data, would be able to duplicate the functionality of the model, i.e. exfiltrate it (  Another word for this would be model theft. The authors show that such an attack can be efficient in terms of number of queries that need to be launched and highlight the need for careful model deployment and new model extraction countermeasures.

In 2019, Michael Kissner also showed how information with regards to the original training data can be extracted from a deep neural network ( Other scholars have also confirmed that this is possible with generative adversarial networks (, As you can imagine, this poses a potential privacy and security problem.

In an even more recent example from the end of 2020, researchers at Berkeley’s Artificial Intelligence Research (BAIR) showed that for GPT-2, the well-known textual deep language model, at least 0.1% of its text generations contain long verbatim strings that are “copy-pasted” from a document in its training set, including personal information such as phone numbers, address information, and more, again having strong implications towards privacy and sensitive data in general!

Denial of service (DOS) attacks are one of the most prevalent security threats in today’s digital world. Most of these attacks consist of overflowing a network node or server by sending too much or bogus data to it, causing it to become overloaded so that it can no longer serve non-malicious clients. What would happen if a malicious actor would overload one of your analytical models? For example by spamming it with lots of instances, trying to send instances for which the model might need a longer inference time to make its predictions, or trying to send it instances with the goal to make it crash, e.g. with missing values or unseen levels of categorical features, or by trying to overflow GPU memory allocation, something which is possible due to the fact that GPU toolkits such as CUDA lack memory bound checks or automatic garbage collection (it relies on the programmer or a higher-level library to do this). The result would be models which are barred from making predictions, i.e. would suffer from a denial of prediction (DOP) attack.

In a 2021 USENIX presentation, Microsoft’s red team reported on their analysis with regards to security and machine learning models, noting that “attacks on machine-learning are already here” and a “gap between machine learning and security” exists as well. Apart from model backdooring and exfiltration issues, they also report on adversarial actors disrupting the services of a deployed machine learning model by overloading its API endpoints. Microsoft also maintains a page of similar case studies on GitHub which is worth reading through.

Security with regards to machine learning is still a relatively young research field, let alone there being turn-key solutions available, though hopefully the above has illustrated that security in machine learning will become a hot topic in the coming years.