Epidemiology and Health Volume: 37, Article ID: e2015013, 10 pages http://dx.doi.org/10.4178/epih/e2015013

Epidemiology and Health LECTURE

Open Access

Calculation of smoking rates by dong/eup/myeon unit using indicators. However, the unit sample size varies considerably ranging between t small-area estimation in theseveral Community Health Survey hundreds, andHowever, health indicators such size as smoking rate for units with few indicators. the unit sample varies considerably ranging betwe indicators. However, the unit sample size varies considerably ranging between tens and

hundreds, health indicators such estimation as smoking methods rate for units withb 30 samplesseveral produced using and conventional statistical cannot several hundreds, and health indicators 3such as smoking rate for units with fewer than 3 3 produced using conventional 30 samples statistical estimation methods cann owing to anproduced exceedingly samplestatistical varianceestimation of the estimates. This problem Kay O Lee , Jong Seok Byun , Yang Wha Kang , Yun Sil Ko ,using Hyolarge Jin Kim 30 samples conventional methods cannot be used, owing to an exceedingly large sampleusing variance of the estimation estimates. This prob addressed by calculating unit-level statistics special method 1 3 owing to an exceedingly sample variance of Korea the aestimates. can be Gallup Korea, Seoul; 2Department of Applied Statistics, Hanshin University, Osan; Divisionlarge of Chronic Disease Control, Centers forThis problem addressed by calculating unit-level statistics using a special estimation meth small-area estimation This paper presents optimized estimation metho Disease Control and Prevention, Cheongju, Korea addressed by calculating[2]. unit-level statistics using aan special estimation method such as small-area estimation [2]. This paper presents an optimized estimation me Statistical (SAS) codes. small-area Analysis estimationSystem [2]. This paper presents an optimized estimation method using 1

INTRODUCTION

2

Statistical Analysis System (SAS) codes. Statistical Analysis System (SAS) codes.

Small-area estimation

Small-area estimation is an estimation method designed to Small-areaproduce Estimation Small-area Estimation The Korean Community Health Survey (CHS), a communitystatistics for small survey areas not included in the samSmall-area Estimation Small-area estimation is anand estimation method designed to produce statis estimation is an estimation method designed to produce st based nationwide annual survey with the objective of providple Small-area design for statistics having unusable high-variance estiSmall-area estimation is an estimation method designed to produce statistics for small survey areas not included in the sample design for statistics and havin small survey areas not included in the sample design for statistics and having u ing important health indicators, is conducted through stratified mates owing to excessively small sample sizes, through supplesmall survey areas not included in the sample design for statistics and having unusable high-variance estimates to information, excessively small sample sizes cluster sampling and computer-assisted personal interviewing. mentary use of surrounding area auxiliary high-variance estimates owing owing to survey excessively small sample sizes, high-variance estimates owing to excessively small sample sizes, through supplementary use of surrounding area survey information, auxiliary inform Using the dong/eup/myeon administrative units (hereafter information from other sources, or statistical model of the popsupplementary useofofsurrounding surrounding area survey information, auxiliary informatio supplementary use area survey information, auxiliary information from units) and residential structures (apartment or single house) as ulation [3,4]. or statistical model of the population [3,4]. other sources, other sources, or statistical model of the population [3,4]. other sources, or statistical model of the population [3,4]. stratification variables, 900 adults (age≥ 19 years) per commuGiven thethe sample design for the HealthHealth Ser- Service is i Giventhat that sample design forCommunity the Community Given that the the sample sampledesign design Community Health Service is inten Given that forfor the the Community Health Service is intended to nity health center district (hereafter district) are sampled and vice is intended to produce district-level health indicators, smallproduce district-level health indicators, small-area estimation is needed for produce district-level health indicators, small-area estimation is needed for producing produce district-level health indicators, small-area estimation is needed for pro proportionally distributed across the units and according to resarea estimation is needed producingThe reliable unit-level health the small-area reliable unit-level healthfor indicators. following describes reliable unit-level health indicators. TheThe following describes the small-area estimation health indicators. following describes est idential structures, followed by selecting tong/ban/ri-levelreliable sam- unit-level indicators. The following describes the small-area estimation me­small-area methods for producing unit-level health indicators [5]. the methods for producing unit-level health indicators [5]. methods for producing unit-level health indicators [5]. ple points via probability proportionate sampling based on the thods for producing unit-level health indicators [5]. number of households. From each selected sample point, five Direct estimator households on average are selected by systematic sampling, and estimator Direct estimator Direct A direct estimator uses only data obtained from the units concerned to pr Direct estimator individual interviews are conducted with all adults in each A direct estimator A direct uses estimator uses obtained only datafrom obtained from the unitsto produce unitonly data the units concerned level healthuses indicators. Eachobtained observation included in the survey to data sets A directconcerned estimator onlyunit-level data the units produ household [1]. Although district-level health indicators arelevel proto Each produce health from indicators. Each concerned obserhealth indicators. observation included in the survey data sets is given a weight item by item; the sample design-based direct estimator and its varia level indicators. Each observation the and survey data sets duced with a specific level of precision, there is an increasing included in the survey data setsincluded is given ain weight item weighthealth itemvation by item; the sample design-based direct estimator itsbyvariance canisbe expressed bythe the sample following estimation equation, using weighted and observed weight item by item; design-based direct estimator and its variance demand for producing unit-level health indicators. However, item; the sample design-based direct estimator and its variance expressed by the following estimation equation, using weighted and observed values: the unit sample size varies considerably ranging betweenexpressed tens canthe be following expressed by the following estimation using by estimation equation, usingequation, weighted and observed val �� ∑��� ��� ��� �� �� �� and several hundreds, and health indicators such as smoking weighted and observed∑values: � � � � �� ��� � �� ��� ���� � �� �� � � � ∑��� (1) rate for units with fewer than 30 samples produced using con∑��� ��� � ∑��� � � � � �� � �� � ventional statistical estimation methods cannot be used, owing (1) ∑��� �� where �� is the sample size of � unit �, ��� is the multiplier reflecting s to an exceedingly large sample variance of the estimates. This where � is the sample size of unit �, �� is i the multiplier reflecting sample and problem can be addressed by calculating unit-level statistics us- � where ni isrates, the sample of unit i, wj isvalue. the multiplier reflectresponse and ���size is the observed i � is the observed value. response andsample ���and ing a special estimation method such as small-area estimation sample response rates, and y is the value. isingthe size of unit �, � is the multiplier where ��rates, j � The variance of the estimator shown inobserved Equation (1) canreflecting be obtainedsamp usin The variance of the �estimator shown in shown Equation (1) can be(1) obtained using Equation [2]. This paper presents an optimized estimation methodresponse using The variance of the estimator in Equation can be rates, and �� is the observed value. (2), as follows: (2), as follows: Statistical Analysis System (SAS) codes. obtained using Equation (2), as follows:

The variance of the estimator shown in Equation (1) can be obtained using E � � � � �� � � ∑��� ���� �� ���� � ���� (2), as follows: ��� � � ���� �� �� ��� � ���� (2)�� � � �� �

Correspondence: Kay O Lee Gallup Korea, 70 Sajik-ro, Jongno-gu, Seoul 110-054, Korea Tel: +82-2-3702-2582, Fax: +82-2-3702-2628, Email: [email protected]

������� � �

� �� �� ��� ����

�� �������� � ∑��� � � ���� ��



� ��� �� �� � �� � ∑��� � �



Received: Jan 22, 2015, Accepted: Mar 2, 2015, Published: Mar 2, 2015 This article is available from: http://e-epih.org/ 2015, Korean Society of Epidemiology This is an open-access article distributed under the terms of the Creative where Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.



∑���� ��

�� ∑��� ��� ��� �� � Synthetic ∑���� �estimator �

��� �

��

, and � �� �



��



∑��� �� � � ���� � ��� �

� � � �� � , and ��� ∑��� ���� � ��� .� where � ����� � � � ∑��� �� �� � ����� � � � � � �� ∑���� � ∑ where ��� � �� � , and � � �� � . � � ��� �

�� ∑��� ��� .

-2-

(2)



-2A synthetic estimator can yield more accurate unit-level estimates by using various auxiliary data, such as sex, age, and reg-

-2-

1

the synthetic estimator shown in of Equation (3) is biased. These weighted average of whereas theusing two The weighted ave addressed using a hybrid approach obtaining mp addressed a estimators. hybrid approach offormula obtaining more reliableofestimate estimates from addressed using a hybrid approach of obtaining more reliable estim Synthetic estimator weighted average of the two estimators. The formul shown in Equations (1) and (3) is referred to as a composite estimator and ca weighted average of the two estimators. The formula of weighted estimate av weighted ofvarious theinistwo estimators. formula of weighted ec Equations (3) is referred to and as aca shown in Equations (1) shown and (3) referred to(1) as and aThe composite estimator A synthetic estimator can yield more accurate unit-level estimates byaverage using calculated as follows: shown in Equations (1) and (3) is referred to as a composite estima calculated as follows: calculated as follows: auxiliary data, such as sex, age, and registered population, for each unit within a district calculated as follows: Synthetic estimator [4,5]. Afterandgrouping the units into 2–3 homogeneous clusters, using their ator Epidemiology Health 2015;37:e2015013 nthetic estimator Synthetic estimator ����� � �� �social ����� ���� � ���� � ����� � ��� � ������ ���� � ����� � �� � ������ A synthetic estimator can yield more accurate unit-level estimates by using various Synthetic estimator � estimator canestimator yield more accurate unit-level estimates by using various AA synthetic can yield more accurate unit-level byby using various synthetic estimator can yield more accurate unit-level estimates using various environments and population ratios as estimates clustering variables, under the assumption ��that �� �� � � ��� � �� � ���� Synthetic estimator auxiliary data, such as sex, age, and registered population, for each unit within A synthetic estimator can yield more accurate unit-level estimates by using variousa district such as data, sex, age, and registered population, for each unit within a district xiliary such as sex, age, and registered population, for each unit within a district auxiliary data, such as sex, age, and registered population, for each unit within a district theAunits of the same cluster might have similar sex-dependent dependent health ��can αusing ��follows: istered population, forregistered each within aaccurate district [4,5]. After where thethe weight minimizes becalculated cal- ����� and be �is/ageis weightthat that minimizes ������ and becan calculated as as follows: where �within is the weight that minimizes ����� synthetic estimator canunit yield more unit-level estimates bytheir various where � is���the weight that minimizes � � and auxiliary data, such as sex,the age, and population, for each unit awhere district � � and c [4,5]. After grouping units into 2–3 homogeneous clusters, using social ouping thegrouping units into 2–3 homogeneous clusters, using their social 5]. After the units into 2–3 homogeneous clusters, using their social [4,5]. After grouping the units into 2–3 homogeneous clusters, using their social � � indicators, such as smoking rate, unit-level smoking rates can be calculated as follows, grouping the units into 2sex, to 3age, homogeneous clusters, using their culated aswhere follows: � is the weight that minimizes ������ � and can be calculated a auxiliary data, such as and registered population, for each unit within a district [4,5]. After grouping the units ratios into 2–3 homogeneous clusters,under using the theirassumption social environments and population as clustering variables, that nd population ratios as clustering variables, under the assumption that vironments and population ratios as variables, under thethe assumption that environments and population ratios asclustering clustering variables, under assumption � ����� � ���� social environments and ratios clustering vari� ����� � ���� by combining smoking rates by sexas and age group andusing bythat the [4,5]. After grouping thepopulation units into 2–3 homogeneous clusters, theirnumber social of �people � � ���� ��������� �� ��� � ����� environments and population ratios as clustering variables, under the assumption that � � � � ����� � � � units ofthe the same cluster might have similar sex-dependent /agedependent health � � � ����� � same cluster might have similar sex-dependent /agedependent health ethe units of of the same cluster might have similar sex-dependent /agehe units same cluster might have similar sex-dependent /agedependent health dependent health � that � � � �� � �� � ��� � ���(6) ����� ����� ���� ables, under the assumption that the as units of the same clusterunder the assumption population ratios clustering variables, ��� ��� ������� registered: �� �� � � � � � �� he units of theenvironments same clusterand might have similar sex-dependent /age- dependent health � ��� �� ����� �� � � as smoking rate, unit-level smoking ratessmoking can be calculated asbe follows, dicators, such as smoking rate, unit-level rates can be calculated as/agefollows, ndicators, such as smoking unit-level smoking rates can calculated as follows, might have sex-dependent/age-dependent health indicaindicators, such as smoking rate, unit-level smoking rates can be calculated as follows, ofsimilar therate, same cluster might have similar dependent healthvalue of � is expected to be the one that minimizes the root While the optimal ndicators, suchthe as units smoking rate, unit-level smoking rates can sex-dependent be calculated as While follows, While the optimal value of � isthat expected to be the α tors, such as smoking rate, unit-level smoking rates can be calthe optimal value of is expected to be the one smoking rates by sex and age group and by the number of people combining smoking rates by sex and age group and by the number of people by combining smoking rates by sex and age group and by the number of people by combiningindicators, smokingsuch rates by sexrate, andunit-level age group and� rates by can the of as smoking smoking benumber calculated aspeople follows, �� � �� , optimal �is is calculated Equation under thethat assumption square error ofof� While the optimal value expected to using beisthe one that minimizes the root m ∑and �by �the by combining culated smoking rates bybysex and age group of people While the value of � to(6)be the one minimize �mean �expected ��sex ��andnumber � ��� � α as follows, combining smoking rates by age minimizes the root square error of is calculated us� , � is calculated using Equation square error � gistered: registered: � �� and � age (3) by combining smoking rates by sex and by the clustered number of people registered: �� group � units are sufficiently homogeneous to ignore the bias of the synthetic estim � � � � registered: ,using � is units calculated using Equation under the error �� assumption � is(6)calculated Equation (6) under the(6)assumption square error ing of Equation �� , square group and by the number of people registered:∑��� ��� underofthe thatsufficiently clustered units are clustered are homogeneous to ignore ta registered: � � , and that the direct and synthetic estimators are independent of each other. � �� � � � clustered units are sufficiently homogeneous to ignore the bias of the syn � � �� sufficiently homogeneous to ignore the bias of the synthetic es� � �� � ��� �� clustered units are sufficiently homogeneous tothe ignore of theestimators synthetic ∑ � directthe andbias synthetic areestim inde �� , and� that ∑ ∑��� ∑ � ��� ��� �� �� �� �� � ��� ��� ����� �� ∑ �� �� where r���� �� ��������� is���� the average estimate of � (3) within cluster �, �and issyntheticestimators ���� � ��� (3)category (3) �� ��� ��� , and �� �∑���� �� ��� timator that the direct and estimators are in� �� ����� and that the direct synthetic are independent of each o � � � � � ∑��� ∑ ∑ � �∑����� ��� � � ����� ���� (3) ��� ��� that theCalculation direct estimators are independent of each other. �� , and(3) (3)andofsynthetic Smoking Rate Using Statistical Analysis System � ∑ �� ��∑�� ��� � �� ��� ∑��� ��� ���� ���� ��� (3) dependent of each other. �� ��� �� �� � Calculation of Smoking Rate Using Statistical Analy � ∑��� �population the calculation process of the unit-level smoking rates ∑ ∑�� ���� ���� ���� �����within cluster �, registered �� �� is describes the number of � of category � The ��, following � ���� ���� �unit ��� of Smoking Rate Using Statistical Analysis System the estimate of category �of of within cluster �, ��� is here r��r�� where �is��� is��� thethe estimate category � �within cluster �, �,� is Calculation is average estimate category within cluster ��� is Community �� TheSurvey following describes the calculation process of to t ∑��� �average � ���� ∑�average � � � � ��� �� � ��� � the 2013 Health data by applying small-area estimation � �� ∑ ∑ ∑ ∑ ���� ���� � ��� � ��� where is the average estimate of cluster category � within cluster �,number �cluster Calculation ofCalculation Smoking Rate Using Statistical Analysis System ��� ����rr � is ��� �� where is the average estimate of category � within cluster � � ��������� samples in category � within �, and is the of categories within ��� r �� is�,The following describes the calculation process of the unit-level smo is the average estimate of category k �� where � the average estimate of category � within �, � is of smoking rate using statistical analysis � the 2013 Community Health Survey data by applyi � ∑��� �� ���� � �� ∑��� calculation of smoking rates for the 22 dongs in Gangnam-gu, Seoul. ���� ∑��� �of ��� is the�,number of cluster �, registered population category of �,ofcategory ��category of of it � �within cluster �, population unit within cluster �,registered population �,���The thenumber number �is isthe following describes theCommunity calculation process of the unit-level smoking u the 2013 Health Survey data by small-area es system calculation of smoking rates for applying the 22 dongs in rates Gangna cluster �. registered unit � within cluster �, registered population of category �, � is the number of � �, ��, unit �� within cluster registered of category ��within is the numberofof of smoking rates for the 22 dongs in Gangnam-gu, Seoul. unit � �inwithin cluster registered population of category is the number i � � egory within cluster �,�, and is�,�,ithe number ofj,the categories mples category cluster and �cluster number ofwithin categories isthe number of2013 categories within samples in category �within within �, and ��ispopulation calculation within cluster j, N is unit within registered population The following describes the calculation process of the unit-level the Community Health Survey data by applying small-area estimation to � � j kcluster estimator using Equation The� variance of the�,estimate shown in Equation (3) canDirect be obtained number of categories samples in category within cluster and �� is samples in� category �number within and �� isnumber number of within categories within samples within �,cluster and ��,the the ofof categories within uster �. �. in category cluster of category k, ng is thecluster of samples category kthe within smoking using 2013 CHS data applying small-area Direct estimator �in is The rates sample size distribution of the 22bydongs of Gangnam-gu calculation smoking rates for the the 22 dongs in Gangnam-gu, Seoul. included in the (4): �. cluster �. cluster Direct estimator cluster j, and the number of be categories within cluster j. using estimation to the calculation of smoking rates forwith the 22 c isshown The sample size distribution of the 22and dongs of G ecluster ofThe the estimate shown in nEquation (3) can obtained using Equation The variance of thethe estimate in in Equation (3)(3) can bebe obtained Equation variance of estimate shown Equation can obtained using Equation Community Health Survey shows that the dongs the dongs smallest largest sa �. The variance The ofThe the estimate shown in Equation (3) can be obtained using Equation variance ofthe theestimate estimate shownin inEquation Equation (3)can canbe be obtained using Equation The sample distribution of the 22 dongs Gangnam-gu inclu variance of shown (3) in Gangnam-gu, Seoul. sizes are Gaepo 4-dongsize (n=24) and Yeoksam 1-dong (n =of58). The smoking rat Community Health Survey shows that the dongs with : The variance (4): of the estimate shown in Equation (3) can be obtained Equation Direct estimator using (4): � � � dong � were � (4): Community Survey shows that shown the(n=24) dongs withYeoksam the(1)smallest and calculatedHealth using the direct estimators in Equation and1-dong the var obtained using Equation (4): sizes are Gaepo 4-dong and � ����� � � ∑���� ��� �� � ��� �� � �� �� (4): The sample estimation size distribution of the 22 dongs of Gangnam-gu included in the 2 sizesEquation are Gaepo 4-dong (n=24) and Yeoksam 1-dong (n = 58).was The sm (2) using SAS code as follows. program pres dong were calculated usingThe theR-code direct estimators shown Direct estimator � ��� � �� � � � ��� ��� � �� � � � � ����� � � ∑��� � � ∑ ∑ ��� �� � ��� �� � �� ��� �� � �� � ��� �� � � � �� � ��� �� � (4) � Community Health Survey shows that the dongs with the smallest and largest sam � � � �� �� �� �� �� �� � in a previous study [6]. ��� ��� dong were calculated using estimators shown in as Equation � � � � � ��� � �� estimation Equation (2) SAS code follows.(1) Tha The sample size distribution of thethe 22direct dongs ofusing Gangnam-gu � � ����� � � ��� � ∑��� ��� ∑ ��� ��� �� ������ ���� � �� � ����� � ���� � ��� ��� estimation Equation (2) using SAS code as=follows. The smoking R-code program Gaepo 4-dong (n=24) and Yeoksam 1-dong (n 58). The rate � are (4) � � ∑�� � � � � � sizes (4) (4) in a previous study [6]. included in the 2013 CHS shows that the dongs with the small� � ������ �� � ���� � � ��� ���� � (4) � (4) ��� ���� /*Generating the variables age groups and smoking/non-smoking �� �� �� �� ��� in study [6]. dong��� were using the direct estimators shown in Equation (1) and the vari and largest sample sizes are Gaepo 4- dong (n = 24) and � �� calculated ��est � �(4) �a previous � � �� ��� �� � ��� ��� �� the Gangnam health center data*/ � � � ���� � �� ���� � �� ��� �� ��� � � � �� � �� � (4) � /*Generating the variables age groups an � � �� � �� � Yeoksam 1-dong (n= 58). The smoking rates by dong were cal�� � �� ��� ��� ��� ��������� �� ������ ��� ���� �� ��� data ����� ���� �� estimation (2)abc.seoul_gangnam_data; using SAS code as follows. The R-code program was prese ��� �� ��� ��� ��� ��� �� ���� �� ���� ��� ����� � �Equation �� ��� �� the Gangnam health data*/ /*Generating the variables agecenter groups and smoking/non������ �� � �� ��� ����� ��������� � ��� � � � ��� � � �� �� � � � set abc.chs13; �� �� ��� ��� �� culated using the direct estimators shown in Equation (1) and � � � ���� ��� ��� ��� ��� ���� �� � ��� � � �� in a previous study [6]. �� ��� ��� data abc.seoul_gangnam_data; the Gangnam health center data*/ ��� ��� � ������ ��� length age_group $8.0 ��� � � estimation Equation (2) using SAS code as follows. � � � ��� ������ �� ����� � ��� � the variance set abc.chs13; data abc.seoul_gangnam_data; keep josa_year dong sm_a0100 sma_01z2 sma_03z2 age age_group � � ������ ��� � �where � � ��. ��� � � length age_group The R-code program was presented in a previous study$8.0 [6]. ��� � � set abc.chs13; �� wt; � ��� ��� �� �� � � � ∑

/*Generating the variables age groups and smoking/non-smoking f keep$8.0 josa_year dong sm_a0100 sma_01z2 age_group rename length dong=eup/myeon/dong;;\ the Gangnam health center data*/ wt; keep josa_year dong sm_a0100 sma_01z2 � if 19

myeon unit using small-area estimation in the Community Health Survey.

myeon unit using small-area estimation in the Community Health Survey. - PDF Download Free
NAN Sizes 1 Downloads 6 Views