Sunday, 7 March 2021

Stratified Sampling with R

by W. B. Meitei, PhD


Stratified sampling is a probability sampling method used to improve the representativeness and precision of survey results by dividing a population into distinct subgroups, or "strata," based on specific characteristics such as age, gender, income level, or geographic region. Once the population is divided, a random sample is drawn from each stratum, either proportionally or equally, depending on the study design. This approach ensures that all key subgroups are adequately represented in the final sample, reducing sampling error and allowing for more accurate comparisons across strata. Stratified sampling is particularly useful when the population is heterogeneous, and the variables of interest vary meaningfully between subgroups.

Now, let X be the characteristic or variable of interest. If the population is homogeneous with respect to X, then the sample selected using a simple random sampling technique will give us a homogeneous sample, and the sample mean will serve as a good and reliable estimate of the population mean. This means that the sample is expected to be representative of the population compared to a sample drawn from a heterogeneous population. The variance of the sample means not only depends on the sample size and sample fraction but also on the population variance. Therefore, in order to increase the precision of the estimates, it is necessary to draw a sample from a population that is homogeneous with respect to the characteristics under study. One such sampling technique is the stratified sampling technique.

The process of the stratified sampling technique works as follows:

  1. Divide the population into smaller groups or subpopulations (known as strata) such that the sampling units are homogeneous with respect to X within the strata but heterogeneous between the strata.
  2. Treat each stratum as a separate population and draw a sample from each stratum using a simple random sampling technique.

For this particular exercise, an example dataset "hospital distance" is used. The dataset is based on a hypothetical region having six variables, namely, Sl_No - Serial Number, region - region code, cc - cluster code, por - Place of Residence, VoW - Village or Wards code and dist - distance of the district hospital from the village or ward (in km). The place of residence coded as 1 for Rural and 2 for Urban is treated as different strata.

The objective of this exercise is to draw a sample of size 100 each from two different strata using a simple random sampling technique and estimate the mean distance of the district hospital from the village or ward. [The mean distance of the district hospital from the village or ward for this particular region is 15.75 km].

The process of estimating the mean distance of the district hospital from the village or ward using the stratified sampling technique in R is given below. Here, X is the distance of the district hospital from the villages/wards.

First, create the directory. The directory can be created using "setwd",

> setwd("path of the directory/folder")

In order to sample the data using standard sampling techniques, we need to use a particular package. called "samplingbook". This package can be installed and loaded using the following code:

installed.packages("samplingbook")
library(samplingbook)

Then import the data using "read.csv".

sdata = read.csv("sample_data.csv" , header = T)

Consider place of residence (por = 1 "Rural" & por = 2 "Urban") as the two homogeneous strata. Then, a total of 100 sampling units each were selected randomly (SRSWoR) from the two strata. "dsts" is the final sample dataset (having 200 sampling units) using the stratified sampling technique. The R codes are as follows:

> table(sdata$por)
> sdatast = sdata[order(sdata$por) , ]
> sts1 = sample(1:1156 , 100 , replace = FALSE)
> sts2 = sample(1157:2378 , 100 , replace = FALSE)
> sts = sort(c(sts1 , sts2))
> dsts = sdatast[sts , ]

Having the "finite population correction" factor is necessary for computing the mean distance of the district hospital from the village or ward using a stratified sampling technique. This can be done by creating one new variable, sN, with values equal to 1156 (total rural population) for rural areas and 1222 (total urban population) for urban areas.

> dsts$sN[dsts$por == 1] = 1156
> dsts$sN[dsts$por == 2] = 1222

Finally, using the sampled data "dsts" using the stratified sampling technique, a stratified survey design is specified in the sampled data using "svydesign". Then, using "svymean", the required mean distance of the district hospital from the village or ward is estimated.

> sts = svydesign(id = ~1 , data = dsts , strata = ~por , fpc = ~sN)
> svymean(~dist , design = sts)
>          mean      SE
   dist  15.739     0.537

The estimated mean distance of the district hospital from the village or wards using the stratified sampling technique is 15.74 km compared to 15.75 km for the whole villages or wards of the region.

The sampled dataset is then exported in .csv format using the following codes:

> write.csv(dsts , "data_sts.csv" , row.names = FALSE)


The outcome can be verified using the following codes:

For sample mean,

> st_mean = aggregate(dsts$dist , list(dsts$por) , mean)
> str_mean = (1156/2378)*st_mean[1 , 2] + (1222/2378)*st_mean[2 , 2]; str_mean
> 15.739

For sample standard error,

> st_var = aggregate(dsts$dist , list(dsts$por) , var)
> str_var = ((1156/2378)^2)*(1/100-1/1156)*st_var[1 , 2] + ((1222/2378)^2)*(1/100-1/1222)*st_var[2 , 2]
> sqrt(str_var)
> 0.537



Suggested Citation: Meitei, W. B. (2021). Stratified Sampling with R. WBM STATS.

No comments:

Post a Comment