Monday, 10 April 2023

Different Types of Missing Data

by W. B. Meitei, PhD


Missing data is an inevitable part of most real-world datasets. Whether due to non-response in surveys, errors during data collection, or system limitations, missing values can pose serious challenges for statistical analysis and interpretation. How we handle these missing values can significantly affect the reliability and validity of our findings. But before diving into techniques for dealing with missing data, it is important to understand that not all missing data is the same. The reason why data is missing, and its missingness mechanism, determines the most appropriate method for handling it. Statisticians categorise missing data into four broad types:

  1. structurally missing
  2. missing completely at random
  3. missing at random
  4. nonignorable missing

Each type has different implications for data analysis and requires different strategies for imputation or modelling. In this blog post, we will explore these four types of missing data, explain how they differ, and discuss their impact on analysis. By understanding the nature of missingness, researchers can make more informed decisions about data cleaning, modelling, and interpretation.

1. Structurally missing

Sometimes, missing values arise not due to data entry errors or non-response, but because certain questions or variables simply do not apply to all respondents. This type of missingness is referred to as structurally missing. It is considered deterministic and often stems from the design or logic of a questionnaire or study protocol, i.e., the data is missing as it does not exist. In Table 1, variables, such as cooking_fuel and no._of_cigarretes, have missing values. Respondents who do not have a kitchen will not respond to what cooking fuel they use. Such cases can be removed safely, and their removal will not affect the analysis. Similarly, a person who does not smoke will not answer the question of how many cigarettes he or she smokes daily. In this case, we can assign a value of '0' for those who do not smoke and proceed with the analysis.

Table 1: Example of a structurally missing

ID

Kitchen

Cooking fuel

Smoking

No. of cigarettes

1

Yes

LPG

No

 

2

No

 

Yes

5

3

Yes

LPG

Yes

9

4

No

 

No

 

5

No

 

Yes

7

6

Yes

LPG

No

 


2. Missing completely at random (MCaR)

Looking at Table 2, we can ask what the possible income of the third and fourth respondents could be. The easiest way to answer this question is to assume that 50% of the respondents have high incomes and the remaining 50% have low incomes, stratified by gender. Therefore, the female respondents will have high incomes, and the male respondents will have low incomes. This is known as assuming the missing values as missing completely at random. When we make this assumption, we assume that whether or not the person has missing values is entirely unrelated to the other information in the data.

Table 2: Example of missing completely at random and missing at random

ID

Gender

Asset Index

Income

Age

1

M

High

High

40

2

F

Low

Low

20

3

F

Medium

 

32

4

M

Low

 

45

5

M

High

High

50

6

F

Low

Low

18

7

M

Medium

Low

21

8

F

High

High

25

Identifying an MCaR is relatively simple. If the other variables can predict the missing values in the data, then it is not an MCaR. MCaR can be formally tested by using Little's Test.

An MCaR means we can proceed with our analysis, ignoring those missing cases, provided we have enough sample size. MCaR is possible only when the missing values are truly due to a random phenomenon.

3. Missing at random (MaR)

In the case of MaR, we assume that we can predict the missing values with the help of other variables in the data. Looking at Table 2, a simple predictive model is to predict the income using asset_index, age, and gender or predict income using only asset_index alone. Note that the idea of prediction does not mean we can perfectly predict a relationship. All that is required is a probabilistic relationship.

When we have MaR, we can use an advanced imputation method, like multiple imputations, to impute the missing values. Or we can also use analytical methods specifically designed for handling MaR.

Notably, any analysis valid for MCaR will be applicable and valid for MaR. Whereas the reverse is not true.

4. Nonignorable missing

It is also known as missing not at random. This occurs when we cannot confidently conclude why the data is missing or the respondents refuse to answer specific questions. We cannot use any of the standard methods for handling missing values in data if it is nonignorable missing. See Tang & Ju (2018) for more information on handling nonignorable missing data.



Suggested Citation: Meitei, W. B. (2023). Different types of missing data. WBM STATS.

No comments:

Post a Comment