Sunday, 8 March 2020

Two-Stage Sampling (both SRSWoR) with R

by W. B. Meitei, PhD


Two-stage sampling is a widely used method in survey research and statistical studies, particularly when dealing with large or geographically dispersed populations. In this approach, the sample is selected in two distinct stages. In the first stage, larger units, often called primary sampling units (PSUs), such as villages, wards, or households, are selected, usually using probability sampling. In the second stage, smaller units, referred to as secondary sampling units (SSUs), such as individuals or specific items within the PSUs, are sampled from within the chosen primary units. The units selected at the first stage of sampling are also called the first-stage units, and the units or groups of units within the first-stage units are called the second-stage units or subunits.

This method offers practical advantages, such as reduced cost and increased efficiency, especially in cases where it is impractical to list or access every element of the population directly. It also allows for greater flexibility in design and helps improve the representativeness of the sample.

The process of two-stage sampling works as follows:

  1. Divide the whole population into different clusters
  2. Select n clusters out of the N clusters (first-stage selection)
  3. A sample of size mi is selected from the selected ith cluster; i.e., select a sample  of a specified number of units from the selected cluster (second-stage selection)

It is a more flexible sampling technique compared to one-stage sampling (commonly known as cluster sampling). But it can be reduced to the one-stage sampling when the number of units to be sampled from each cluster equals the number of units in each cluster. Although this technique of sampling gives higher statistical precision compared to one-stage sampling, the statistical precision comes at a cost. The cost incurred in adopting this technique will be higher compared to one-stage sampling.

Fig. 1 represents the number of villages/wards in a particular geographical region by its clusters (each square box represents a cluster). The region has 16 clusters with its cluster code (cc) from 1 to 16. And each cluster has Mi (i = 1, 2, ..., 16) villages/wards. An example dataset "hospital distance" is used for this particular exercise. The dataset consists of six variables viz. Sl_No - Serial Number, region - region code, cc - cluster code, por - Place of Residence, VoW - Village or Wards code and dist - distance of the district hospital from the village or ward (in km). [The mean distance of the district hospital from the village or ward for this particular region is 15.75 km]

The process of estimating the mean distance of the district hospital from the village or ward using two-stage sampling technique in R is given below.

Fig. 1: No. of villages/wards by cluster codes of a region

First, create the directory. The directory can be created using "setwd",

> setwd("path of the directory/folder")

In order to sample the data using standard sampling techniques, you need to use a particular package. called "samplingbook". This package can be installed and loaded using the following code:

> installed.packages("samplingbook")
> library(samplingbook)

Then import the data using "read.csv"

> sdata = read.csv("sample_data.csv" , header = T)

Five clusters out of the 16 clusters are selected randomly (SRSWoR) in the first stage. The below set of codes are used to select the clusters. "srswor" is used to sample the five clusters randomly from the sixteen clusters. Here, "fs" is a new variable taking values 0 and 1 (0 means the cluster is not included in the sample while 1 means the cluster is included in the sample). Finally, "dfst" is the required sample dataset at the first stage having only the selected clusters.

> table(sdata$cc); cc = 1:16
> c = srswor(5 , 16); df = data.frame(cc , c); df
> sdata = sdata[order(sdata$cc) , ]

> fs = 0
>
for(i in 1:16){
      fs[sdata$cc == df$cc[i]] = c[i]
   }


> sdata$fs = fs
> dfst = subset(sdata , fs == 1)

Now, in the second stage, a total of 200 sampling units are selected randomly (SRSWoR). The 200 sampling units consist of sampling units from all the selected clusters. And to facilitate this, the number of sampling units to be selected from each selected cluster is determined by the proportion of sampling units in the ith selected cluster in the first stage sample dataset multiplied by 200. The R codes are as follows:

> s1 = round(t[1]/length(dfst$fs)*200)
>
s2 = round(t[2]/length(dfst$fs)*200)
> s3 = round(t[3]/length(dfst$fs)*200)
> s4 = round(t[4]/length(dfst$fs)*200)
> s5 = round(t[5]/length(dfst$fs)*200) 

The sample of 200 sampling units are selected by using the informations on mi computed above for all the 5 selected clusters. "dtwost" is the final sample dataset selected using two-stage sampling technique. The R codes are as follows:

> sst1 = sample(1:t[1] , s1 , replace = FALSE)
> sst2 = sample((t[1] + 1):cumsum(t)[2] , s2 , replace = FALSE)
> sst3 = sample(cumsum(t)[2]:cumsum(t)[3] , s3 , replace = FALSE)
> sst4 = sample(cumsum(t)[3]:cumsum(t)[4] , s4 , replace = FALSE)
> sst5 = sample(cumsum(t)[4]:cumsum(t)[5] , s5 , replace = FALSE)
> sst = sort(c(sst1 , sst2 , sst3 , sst4 , sst5))
> dtwost = dfst[sst , ]

Having the "finite population correction" factor is necessary for computing the mean distance of the district hospital from the village or ward using a two-stage sampling technique. This can be done by creating two new variables N1 with all values equal to 16 (the total number of clusters) and N2 with values equal to the corresponding number of sampling units in each selected cluster at first stage of the sampling.

> dtwost$N1 = 16
> dtwost$N2[dtwost$cc == as.numeric(names(t)[1])] = t[1]
> dtwost$N2[dtwost$cc == as.numeric(names(t)[2])] = t[2]
> dtwost$N2[dtwost$cc == as.numeric(names(t)[3])] = t[3]
> dtwost$N2[dtwost$cc == as.numeric(names(t)[4])] = t[4]
> dtwost$N2[dtwost$cc == as.numeric(names(t)[5])] = t[5]

Finally, using the sampled data "dtwost" using two-stage sampling technique, a two-stage survey desingn is specified in the sampled data using "svydesign". Then, using the "svymean", the required mean distance of the district hospital from the village or ward is estimated.

> twost = svydesign(id = ~ cc + Sl_No , data = dtwost , fpc = ~N1+N2)
> svymean(~dist , design = twost)
>          mean
   dist  15.038

The estimated mean distance of the district hospital from the village or wards using the two-stage sampling technique is 15.04 km compared to 15.75 km for the whole villages or wards of the region. The estimated mean distance is not very different from the actual mean distance.

The sampled dataset is then exported in .csv format using the following code:

> write.csv(dtwost , "data_dtwost.csv" , row.names = FALSE)



Suggested Citation: Meitei, W. B. (2020). Two-Stage Sampling (both SRSWoR) with R. WBM STATS.

No comments:

Post a Comment