Showing posts with label Methods of Survey Sampling. Show all posts
Showing posts with label Methods of Survey Sampling. Show all posts

Sunday, 7 March 2021

Stratified Sampling with R

by W. B. Meitei, PhD


Stratified sampling is a probability sampling method used to improve the representativeness and precision of survey results by dividing a population into distinct subgroups, or "strata," based on specific characteristics such as age, gender, income level, or geographic region. Once the population is divided, a random sample is drawn from each stratum, either proportionally or equally, depending on the study design. This approach ensures that all key subgroups are adequately represented in the final sample, reducing sampling error and allowing for more accurate comparisons across strata. Stratified sampling is particularly useful when the population is heterogeneous, and the variables of interest vary meaningfully between subgroups.

Now, let X be the characteristic or variable of interest. If the population is homogeneous with respect to X, then the sample selected using a simple random sampling technique will give us a homogeneous sample, and the sample mean will serve as a good and reliable estimate of the population mean. This means that the sample is expected to be representative of the population compared to a sample drawn from a heterogeneous population. The variance of the sample means not only depends on the sample size and sample fraction but also on the population variance. Therefore, in order to increase the precision of the estimates, it is necessary to draw a sample from a population that is homogeneous with respect to the characteristics under study. One such sampling technique is the stratified sampling technique.

The process of the stratified sampling technique works as follows:

  1. Divide the population into smaller groups or subpopulations (known as strata) such that the sampling units are homogeneous with respect to X within the strata but heterogeneous between the strata.
  2. Treat each stratum as a separate population and draw a sample from each stratum using a simple random sampling technique.

For this particular exercise, an example dataset "hospital distance" is used. The dataset is based on a hypothetical region having six variables, namely, Sl_No - Serial Number, region - region code, cc - cluster code, por - Place of Residence, VoW - Village or Wards code and dist - distance of the district hospital from the village or ward (in km). The place of residence coded as 1 for Rural and 2 for Urban is treated as different strata.

The objective of this exercise is to draw a sample of size 100 each from two different strata using a simple random sampling technique and estimate the mean distance of the district hospital from the village or ward. [The mean distance of the district hospital from the village or ward for this particular region is 15.75 km].

The process of estimating the mean distance of the district hospital from the village or ward using the stratified sampling technique in R is given below. Here, X is the distance of the district hospital from the villages/wards.

First, create the directory. The directory can be created using "setwd",

> setwd("path of the directory/folder")

In order to sample the data using standard sampling techniques, we need to use a particular package. called "samplingbook". This package can be installed and loaded using the following code:

installed.packages("samplingbook")
library(samplingbook)

Then import the data using "read.csv".

sdata = read.csv("sample_data.csv" , header = T)

Consider place of residence (por = 1 "Rural" & por = 2 "Urban") as the two homogeneous strata. Then, a total of 100 sampling units each were selected randomly (SRSWoR) from the two strata. "dsts" is the final sample dataset (having 200 sampling units) using the stratified sampling technique. The R codes are as follows:

> table(sdata$por)
> sdatast = sdata[order(sdata$por) , ]
> sts1 = sample(1:1156 , 100 , replace = FALSE)
> sts2 = sample(1157:2378 , 100 , replace = FALSE)
> sts = sort(c(sts1 , sts2))
> dsts = sdatast[sts , ]

Having the "finite population correction" factor is necessary for computing the mean distance of the district hospital from the village or ward using a stratified sampling technique. This can be done by creating one new variable, sN, with values equal to 1156 (total rural population) for rural areas and 1222 (total urban population) for urban areas.

> dsts$sN[dsts$por == 1] = 1156
> dsts$sN[dsts$por == 2] = 1222

Finally, using the sampled data "dsts" using the stratified sampling technique, a stratified survey design is specified in the sampled data using "svydesign". Then, using "svymean", the required mean distance of the district hospital from the village or ward is estimated.

> sts = svydesign(id = ~1 , data = dsts , strata = ~por , fpc = ~sN)
> svymean(~dist , design = sts)
>          mean      SE
   dist  15.739     0.537

The estimated mean distance of the district hospital from the village or wards using the stratified sampling technique is 15.74 km compared to 15.75 km for the whole villages or wards of the region.

The sampled dataset is then exported in .csv format using the following codes:

> write.csv(dsts , "data_sts.csv" , row.names = FALSE)


The outcome can be verified using the following codes:

For sample mean,

> st_mean = aggregate(dsts$dist , list(dsts$por) , mean)
> str_mean = (1156/2378)*st_mean[1 , 2] + (1222/2378)*st_mean[2 , 2]; str_mean
> 15.739

For sample standard error,

> st_var = aggregate(dsts$dist , list(dsts$por) , var)
> str_var = ((1156/2378)^2)*(1/100-1/1156)*st_var[1 , 2] + ((1222/2378)^2)*(1/100-1/1222)*st_var[2 , 2]
> sqrt(str_var)
> 0.537



Suggested Citation: Meitei, W. B. (2021). Stratified Sampling with R. WBM STATS.

Sunday, 8 March 2020

Two-Stage Sampling (both SRSWoR) with R

by W. B. Meitei, PhD


Two-stage sampling is a widely used method in survey research and statistical studies, particularly when dealing with large or geographically dispersed populations. In this approach, the sample is selected in two distinct stages. In the first stage, larger units, often called primary sampling units (PSUs), such as villages, wards, or households, are selected, usually using probability sampling. In the second stage, smaller units, referred to as secondary sampling units (SSUs), such as individuals or specific items within the PSUs, are sampled from within the chosen primary units. The units selected at the first stage of sampling are also called the first-stage units, and the units or groups of units within the first-stage units are called the second-stage units or subunits.

This method offers practical advantages, such as reduced cost and increased efficiency, especially in cases where it is impractical to list or access every element of the population directly. It also allows for greater flexibility in design and helps improve the representativeness of the sample.

The process of two-stage sampling works as follows:

  1. Divide the whole population into different clusters
  2. Select n clusters out of the N clusters (first-stage selection)
  3. A sample of size mi is selected from the selected ith cluster; i.e., select a sample  of a specified number of units from the selected cluster (second-stage selection)

It is a more flexible sampling technique compared to one-stage sampling (commonly known as cluster sampling). But it can be reduced to the one-stage sampling when the number of units to be sampled from each cluster equals the number of units in each cluster. Although this technique of sampling gives higher statistical precision compared to one-stage sampling, the statistical precision comes at a cost. The cost incurred in adopting this technique will be higher compared to one-stage sampling.

Fig. 1 represents the number of villages/wards in a particular geographical region by its clusters (each square box represents a cluster). The region has 16 clusters with its cluster code (cc) from 1 to 16. And each cluster has Mi (i = 1, 2, ..., 16) villages/wards. An example dataset "hospital distance" is used for this particular exercise. The dataset consists of six variables viz. Sl_No - Serial Number, region - region code, cc - cluster code, por - Place of Residence, VoW - Village or Wards code and dist - distance of the district hospital from the village or ward (in km). [The mean distance of the district hospital from the village or ward for this particular region is 15.75 km]

The process of estimating the mean distance of the district hospital from the village or ward using two-stage sampling technique in R is given below.

Fig. 1: No. of villages/wards by cluster codes of a region

First, create the directory. The directory can be created using "setwd",

> setwd("path of the directory/folder")

In order to sample the data using standard sampling techniques, you need to use a particular package. called "samplingbook". This package can be installed and loaded using the following code:

> installed.packages("samplingbook")
> library(samplingbook)

Then import the data using "read.csv"

> sdata = read.csv("sample_data.csv" , header = T)

Five clusters out of the 16 clusters are selected randomly (SRSWoR) in the first stage. The below set of codes are used to select the clusters. "srswor" is used to sample the five clusters randomly from the sixteen clusters. Here, "fs" is a new variable taking values 0 and 1 (0 means the cluster is not included in the sample while 1 means the cluster is included in the sample). Finally, "dfst" is the required sample dataset at the first stage having only the selected clusters.

> table(sdata$cc); cc = 1:16
> c = srswor(5 , 16); df = data.frame(cc , c); df
> sdata = sdata[order(sdata$cc) , ]

> fs = 0
>
for(i in 1:16){
      fs[sdata$cc == df$cc[i]] = c[i]
   }


> sdata$fs = fs
> dfst = subset(sdata , fs == 1)

Now, in the second stage, a total of 200 sampling units are selected randomly (SRSWoR). The 200 sampling units consist of sampling units from all the selected clusters. And to facilitate this, the number of sampling units to be selected from each selected cluster is determined by the proportion of sampling units in the ith selected cluster in the first stage sample dataset multiplied by 200. The R codes are as follows:

> s1 = round(t[1]/length(dfst$fs)*200)
>
s2 = round(t[2]/length(dfst$fs)*200)
> s3 = round(t[3]/length(dfst$fs)*200)
> s4 = round(t[4]/length(dfst$fs)*200)
> s5 = round(t[5]/length(dfst$fs)*200) 

The sample of 200 sampling units are selected by using the informations on mi computed above for all the 5 selected clusters. "dtwost" is the final sample dataset selected using two-stage sampling technique. The R codes are as follows:

> sst1 = sample(1:t[1] , s1 , replace = FALSE)
> sst2 = sample((t[1] + 1):cumsum(t)[2] , s2 , replace = FALSE)
> sst3 = sample(cumsum(t)[2]:cumsum(t)[3] , s3 , replace = FALSE)
> sst4 = sample(cumsum(t)[3]:cumsum(t)[4] , s4 , replace = FALSE)
> sst5 = sample(cumsum(t)[4]:cumsum(t)[5] , s5 , replace = FALSE)
> sst = sort(c(sst1 , sst2 , sst3 , sst4 , sst5))
> dtwost = dfst[sst , ]

Having the "finite population correction" factor is necessary for computing the mean distance of the district hospital from the village or ward using a two-stage sampling technique. This can be done by creating two new variables N1 with all values equal to 16 (the total number of clusters) and N2 with values equal to the corresponding number of sampling units in each selected cluster at first stage of the sampling.

> dtwost$N1 = 16
> dtwost$N2[dtwost$cc == as.numeric(names(t)[1])] = t[1]
> dtwost$N2[dtwost$cc == as.numeric(names(t)[2])] = t[2]
> dtwost$N2[dtwost$cc == as.numeric(names(t)[3])] = t[3]
> dtwost$N2[dtwost$cc == as.numeric(names(t)[4])] = t[4]
> dtwost$N2[dtwost$cc == as.numeric(names(t)[5])] = t[5]

Finally, using the sampled data "dtwost" using two-stage sampling technique, a two-stage survey desingn is specified in the sampled data using "svydesign". Then, using the "svymean", the required mean distance of the district hospital from the village or ward is estimated.

> twost = svydesign(id = ~ cc + Sl_No , data = dtwost , fpc = ~N1+N2)
> svymean(~dist , design = twost)
>          mean
   dist  15.038

The estimated mean distance of the district hospital from the village or wards using the two-stage sampling technique is 15.04 km compared to 15.75 km for the whole villages or wards of the region. The estimated mean distance is not very different from the actual mean distance.

The sampled dataset is then exported in .csv format using the following code:

> write.csv(dtwost , "data_dtwost.csv" , row.names = FALSE)



Suggested Citation: Meitei, W. B. (2020). Two-Stage Sampling (both SRSWoR) with R. WBM STATS.