5 Types of Data Bias in AI Systems

Bias in machine learning and AI, appears in different forms. Before we start defining what bias means for AI systems, let’s start with basic definitions for ‘Bias’ and ‘AI Systems’.

” Bias: The Inclination for prejudice against a person or persons, or specific group of individuals, in a way that is considered unfair.”

In machine learning and artificial intelligence, bias can show up in several ways. Note that, here, by “Bias” I do not mean “Unfairness” – as both terms are often confused with each other.

What is AI Bias?

Although there are three main components of any ML/AI system – data, models, and predictions – bias is more likely to show up in your data. As a rule of thumb in statistical analysis, your sample data should be representative of the real world it attempts to model. For example, if you develop a system to identify cats and dogs, and your training data contains a lot of brown dogs and white cats, we can conclude that the dataset is over-weighted for cats and dogs with specific characteristics. Biased datasets do not accurately represent the real-world use case that they model. What does your model do with a pink cat or a white dog?

Human beings are very susceptible to what’s around them. You don’t know what you don’t know because you have never experienced or seen it before. Our behavior and decisions are based on information we know – a closed-world scenario. Machine learning models are similar in that they can only learn behavior in the data they have seen. This is why we must learn to identify bias in machine learning. Here, we explore 5 types of bias you may encounter in your data during ML/AI training.

Sampling/Representation Bias

A basic assumption for all statistical tests is that the sample is representative of the population. That is the probability of member selection within the sample matches the probability of member selection in the intended population. Sampling bias is a type of statistical bias where some members of the intended population have a higher or lower selection probability in the sample than others. For example, if we want an estimate for the mean student debt across the country, but our sample contains students from various departments in a single university. This sample is not representative of all students/universities across the country. In another article, we explain the different types of sample selection bias.

Finally, Imbalanced classes is often mistaken for sampling bias – it is important to note the difference. For example, when training an anomaly detector, there will be significantly more good samples than anomalous samples. As anomalous events are rare events, this is not sampling ‘bias’ – it is the reality of the problem space being modelled.

Aggregation Bias

Aggregation refers to combining micro-level data by summing or averaging micro-level responses, so that they represent a macro-level data point. More often, when data is micro-level data is aggregated, individual data points tend to loose their unique contribution in the overall interpretation of the results. For example, if we estimate a low mean test score for students in class, this does not mean that any one student will likely perform badly on the test. Aggregation bias in AI happens when models built with aggregated data is used for individuals in a group. This assumption is always not true as characteristics that determine group membership can differ greatly across individuals. Another example is tracking cyber criminal gangs on twitter actively predicting malware [ref]. A more generic NLP tool performs poorly as the use of certain keywords and phrases in these groups are outside expected norm e.g ‘The Apple has fallen’.

Historical Bias

Historical bias arises when the data used to train AI models no longer reflect the current reality. Some off-the-shelf AI models were trained using legacy data collected from limited demographic distributions. For example, a speech recognition system built on a sample of voices over-represented with America and British accents will perform badly on other demographics when deployed in realtime. A more prominent example is decisions by AI models to exclude certain demographics from financial products and services. As income disparity narrows across racial groups, data collected 10 years ago no longer represent the financial reality of these groups.

Distribution Drift Bias

This is sometimes called ‘label’ bias or concept drift bias happens when data labels conform to a set of assumptions for label classes. These assumptions can relate to unexpected shift from the natural assumption of ‘how things should be’. For example, an object detector trained to detect grocery items is trained on yellow and green bananas and may encounter a purple or black banana in the realworld. Another example is with most facial recognition systems built to identify front-facing faces alone.

Confounding Bias

A confounding variable is an unaccounted third variable that affects both cause and effect i.e the dependent and independent variables. These variables are correlated with one or more independent variables and causally related to the dependent variable. Confounding variables distort the true nature of relationships between independent and dependent variables. For example, most gang members are also gun owners and research show certain racial and income groups are over-represented in gang memberships. But there is a no apparent relationship between Race or income and Gun Ownership.

Conclusion

You have taken away 5 ways bias can trickle into AI models through data. Most of these biases can be corrected at the data collection, cleansing and feature engineering stages of the AI development process. This should give you a better understanding on to develop robust data preparation pipelines for your AI models. If your are interested in reading more about bias, I recommend this paper from Suresh and Guttang.