Spatial Survival Analysis of Initiation Age and Prevalence of Smoking in Iran; Results from a Population Based Study

1Department of Epidemiology and Biostatistics, School of Public health, Tehran University of Medical Sciences, Tehran, Iran 2Non-Communicable Diseases Research Center, Endocrinology and Metabolism Population Sciences Institute, Tehran University of Medical Sciences, Tehran, Iran 3Endocrinology and Metabolism Research Center, Endocrinology and Metabolism Clinical Sciences Institute, Tehran University of Medical Sciences, Tehran, Iran


Introduction
Non-communicable diseases (NCDs) are the leading causes of mortality and morbidity around the world. 1,2 Based on a report by the World Health Organization (WHO), 41 million people die each year from NCDs, which is about 71% of all global deaths. 3 Three quarters of all NCDs occur in low-and middle-income countries. 3 Tobacco smoking has been recognized as one of the important risk factors responsible for NCDs not only for smokers but also those who are around them. 4 Cigarette smoke contains 3500 chemical substances, about 20 of which have potentially dangerous carcinogens, especially for lung cancers. 5 Although tobacco smoking is a global problem, it is becoming more prevalent in developing countries such as Iran. 6 The initiation age and prevalence of cigarette smoking are two important topics in any smoking-related policy-making domain. Previous studies have shown that individuals who initiated smoking at young ages had more nicotine dependency and therefore, had more difficulty quitting in adulthood. 7,8 Such individuals are at a higher risk of illegal drug use, alcohol consumption, and deviant behaviors. 9 There are several studies on the prevalence of smoking and associated risk factors in Iran which have been summarized in a meta-analysis by Moosazadeh et al. 10 Their estimation of the cigarette smoking prevalence was 19.8% to 21.7% in men and 0.94% to 3.6% in women. In a study by Meysamie et al 11 which was based on the previous rounds of STEPs survey (WHO STEPwise approach to Non-communicable Disease Risk Factors Surveillance), conducted in 2000,2005,2007 and 2011, the mean initiation age of cigarette smoking in total population was reported at 21.3, 24.8, 20.6 and 20.9, respectively. However, there is not enough information about the geographical distribution of initiation age of smoking cigarette in Iran.
Thus, the aims of the current study are: (i) to use a new approach based on spatial survival mixture cure model to estimate the initiation age and prevalence of cigarette smoking with their geographical distribution and (ii) to evaluate the effect of demographic variables on the initiation age and prevalence of smoking in Iran using the extracted data set from STEPs 2016.

Data Collection
The data used in this population-based cross-sectional study was extracted from STEPs 2016 survey conducted in Iran. A total of 30 541 individuals participated in the survey from 30 provinces of Iran. The samples were appropriately collected based on the proportional to size approach. A total of 3105 clusters were selected from urban and rural areas of the studied provinces. The sampling method and study protocol were carried out as described by Hajipour. 12 Our inclusion criterion was 18≤ age ≤70 years at the time of the study. Based on this inclusion criterion, 27 612 individuals were included in the study. In this study, individuals who had quit smoking before the time of the study and individuals who were smoking at the time of the study were considered as smokers. We used four independent variables such as gender (men/ women), area of residence (urban/rural), level of education (illiterate/1-6 years/7-12 years/more than 12 years) and socioeconomic status (low/middle/high), along with a province-related variable as indicator of individuals' area of residence (Table 1).

Statistical Analysis
To estimate the effect of demographic variables on the initiation age and prevalence of smoking with their geographical distribution, a parametric spatial survival mixture cure rate model with double censoring was applied. Age was assumed as follow-up time 13 and the initiation of smoking as event of interest. At the time of the study, we observed three groups of individuals in the STEPs dataset based on their smoking status: 1) nonsmokers; 2) smokers who knew their exact age of smoking initiation; and 3) smokers who did not know their exact age of smoking initiation. In our approach, the first group was considered as right censored data because they did not experience the event of interest until their age of participation in the study. The second group was considered as the exact time of the event as their exact age of smoking initiation was known. The third group was considered as left censored data one as the event of interest had happened at an age before their participation age in the study. This type of dataset is called doubly censored which consists of left-and right-censored data and the exact time of the event. Doubly censored data is common in registry 14 or cross-sectional studies when age is considered as follow-up time. In our study, we generalized the approach of Turnbull, 15 who studied the initiation age of marijuana smoking amongst high school students in the United States. Turnbull 15 introduced an iterative procedure based on Kaplan Meier method for estimating the survival curve when some participants do not remember the age of initiation and some without marijuana smoking experience at the time of study.
In traditional survival analysis, the event of interest is assumed to happen for the entire study population, but in the first group in our study, a fraction of participants never experienced the event of interest. Thus, they were not at the risk of the event of interest. We used the mixture cure rate model to account for these types of individuals in our study. The mixture cure rate model assumed that the population had two groups of individuals, susceptible and non-susceptible. Susceptible individuals experienced the event of interest, while the non-susceptible never experienced the event. Using the mixture cure rate model, we were able to estimate the prevalence of smoking in the STEPs data. Under mixture cure rate model, the population's survival function is defined as S P (t) = π + (1-π)S * (t), where S * (t) is the survival function for susceptible individuals, π is the proportion of individuals who never experience the event of interest and (1 -π) is the proportion of individuals who experience the event of interest (prevalence). We assumed that the survival time for susceptible individuals followed Weibull (μ, λ) distribution, where μ and λ representing the shape and scale parameters, respectively. The Weibull distribution is popular in parametric survival analysis because it can model decreasing, constant and increasing failure rate over time, if its shape parameter is less than, equal to or greater than 1, respectively. On the other hand, it can properly estimate other similar distributions. 16 Furthermore, the Weibull distribution has both PH and AFT properties. The appropriateness of Weibull distribution was checked by comparing model fitted cumulative curves and observed cumulative curves as shown in Figure 1. The probability of density function of Weibull distribution is f * (t) = μλt μ-1 e -λtμ . Evaluation of the effect of demographic variables including sex and area of residence on the survival time was accomplished using the log link function of scale parameter λ of Weibull distribution as log (λ) = β 0 + β 1 .sex + β 2 . Area.
The effect of independent variables, including sex, area of residence, level of socioeconomic status and education, on parameter π was evaluated using logistic regression with logit link function as follows: Level of education and socioeconomic status were considered as ordinal variables. 17 In the survival regression model, we considered sex and area of residence as covariates, as we did not have the value of subject's education level and socioeconomic status at the time of smoking initiation.
The independence of individuals' survival time is a usual assumption in survival analysis. However, when data are extracted from a survey study which has collected data across a country, it is reasonable to assume that there is a correlation between observations from the same and adjacent clusters.
The spatial analysis was used to account the correlation between observations, and provinces were assumed as clustered regions. The hierarchical Bayesian models are commonly used in spatial analysis where any between and within-cluster correlations are modeled at the second level of the hierarchy by a set of random effects. These random effects are most commonly represented by a conditional autoregressive (CAR) prior distribution. 18 These spatial random effects were added to the survival and logistic regression models. Advances in computing power and available software have made Markov chain Monte Carlo (MCMC) 19,20 one of the most important computational tools in Bayesian analysis, especially in spatial analysis. In our approach, we used MCMC via a Metropolis-Hasting algorithm 21 for sampling from the posterior distribution of parameters with zero tricks for arbitrary likelihood in OpenBUGS 22 software. Under the fully Bayesian approach, we set a non-informative prior distribution for model's parameters as the normal distribution with zero means and precision of 0.0001 was assumed for the survival and logistic regression parameters. Gamma  Where W is the adjacency matrix and ψ represent the set of provinces that have a common border with the j th province (j~k) and φ represent the precision of normal distribution that controls the amount of variation between the random effects.
To incorporate the survey weights into the analysis, we used the Bayesian pseudo posterior estimator (BPPE) method 23 where we replaced the likelihood with pseudolikelihood in the formula of Bayes' theorem. We considered non-response item by considering them as left-censored data. According to the STEPs protocol, survey weights have been adjusted for non-response unit. 12 The statistical methodology of our approach has been described in detail by Carlin et al and Achcar et al and Turnbull. 15,[23][24][25] The principal component analysis was used to estimate socioeconomic status variables based on households' assets by the psych package. 26 DIC (Deviance information criterion) criterion was used to model selection between spatial and non-spatial models. The output of the model fitting was reported in terms of hazard and odds ratios. All statistical analyses and visualization were performed using OpenBUGS version 3.2.3 and R version 3.5.1.

Results
Of the total of 27 612 participants, 14 785 (54%) were women and 12 827 (46%) were men with a mean age of 41.7 ± 13.5 years. A total of 25 059 (90.8%) participants in the study had not experienced smoking until the time of the study. On the other hand, 2553 (9.2%) participants were daily smokers and 1567 (61.4%) of them did not remember their age at onset of smoking. Most of the participants were living in urban areas (71%). The highest percentage of the participants had fewer than 12 years of schooling (78%). About 20% of the participants were over the age of 55, and 12% were under the age of 26. The demographic characteristics of smokers and non-smokers are presented in Table 1.
Adjusted and unadjusted estimations of model parameters are reported in Table 2. For the survival model, the adjusted estimates were obtained based on sex and area of residence. For logistic model, the adjusted estimates were obtained based on sex, area of residence, level of education and socioeconomic status. Based on the DIC calculated values for spatial (DIC = 19680) and nonspatial (DIC = 19880) models, considering the spatial model is reasonable. The Kaplan Meier and predicted cumulative probability curves of the age of smoking initiation are shown in Figure 1. As shown in Figure 1, a goodness of fit can be deduced between predicted and observed cumulative curves using Turnbull method.
From survival parts of the model, the hazard of early smoking initiation age was 1.66 times greater in men than that of women (adjusted hazard ratio [HR] = 1.66, 95% CI: 1.15, 2.48); also, the range of cigarette smoking initiation age was wider in woman than men, as the probability of smoking tends to zero after the age of 45 years in men; but in woman, the probability tends to zero after the age of 50 years, but there was no significant difference between the hazard in urban and rural areas (adjusted HR = 0.96, 95% CI: 0.8, 1.16). The estimated median ages of smoking initiation for the entire population, men and women smokers were 23.3 (95% CI: 22.2-24.5), 21.9 (95% CI: 21.3-22.5) and 25.5 (95% CI: 22.8-28.7) years, respectively. In addition, estimated mean ages of smoking initiation for the entire population, men and women smokers were 23.3 (95% CI: 22.2-24.6), 21.9 (95% CI: 21.3-22.7) and 25.6 (95% CI: 22.9-28.8), respectively. The geographical distribution of median ages of smoking initiation for the entire population, men, and women are shown in Figure 2. In the entire population, the smoking initiation age ranged from 21.5 to 26.37 years with Alborz, Esfahan, Gilan, Kermanshah, Kurdistan,  Figure 2, the provinces with dark colors had earlier cigarette smoking initiation ages. Regarding the prevalence of smoking, the odds of smoking was 36.5 times greater in men than women (adjusted odds ratio [OR] = 36.5, 95% CI: 29.66, 45.52); also, the odds of smoking decreased by 32% (adjusted OR = 0.68, 95% CI: 0.65, 0.72) and 14% (adjusted OR = 0.86, 95% CI: 0.82, 0.94) with one level of increase in educational level and socioeconomic status, respectively. In contrast, there was not any significant difference between the odds of smoking in urban and rural areas (adjusted OR = 0.91, 95% CI = 0.8, 1.03). Based on our results, the estimated daily smoking prevalence in the total population, men and women was 10.11% (95% CI: 9.3%-11.0%) 22.3% (95% CI: 21.0%-23.6%), and 0.78% (95% CI: 0.62%-0.97%), respectively. The geographical distribution of smoking prevalence for the entire study population, men, and women are shown in Figure 3. The prevalence of cigarette smoking in the entire study population varied from 5.46% to 14.98%, while in men, it ranged from 12.82% to 30.98%, and in women, it varied from 0.4% to 1.2%.
The estimated value for the shape parameter was 3.32 (95% CI: 3.15-3.49). Adjusted effect sizes were obtained by considering gender, area of residence for survival regression model, and gender, area of residence, education, and socioeconomic status in logistic regression model.

Discussion
We reported the prevalence and initiation age of cigarette  smoking along with the effect of demographic variables and their geographical distribution in Iran based on 2016 STEPs study. Our findings could be split into two parts. The first and second part are related to the spatial survival and spatial logistic regression model for modeling initiation age and prevalence of cigarette smoking, respectively. In the first part, our key finding is that the west part of the county had lower initiation age of cigarette smoking. In addition, we observed a significant relationship between gender and starting age of cigarette smoking. This finding supports the previous belief that women start smoking in a later age compared to men. 27  in women, respectively. In the second part, we observed that the prevalence of smoking was higher in men than women, which is consistent with previous studies. 29,30 We found a significant relationship between smoking cigarette and the level of education. which is supported by previous studies showing that a higher level of education was associated with lower probability of smoking. 31,32 Similarly, we observed a significant relationship between the level of socioeconomic status and smoking cigarette. Our findings on the prevalence of smoking are in the range of previous studies. For example, in a meta-analysis by Moosazadeh et al, 33 smoking prevalence varied from 12.3% to 38.5% in men and from 0.6% to 9.8% in women. Finally, the geographical distribution of smoking prevalence showed that the northwest part of the country had the highest rate of smoking in the country; similar results were reported by Nemati et al, 34 who observed a higher rate of smoking prevalence in the northwest part of the country based on the STEPs dataset from 2006 to 2009.
The strength of our approach is considering spatial survival analysis for smoking cigarette. The nature of the STEPs dataset forces us to consider mixture cure fraction in the model, because a fraction of the study population will never experience the event of interest (Figure 1), but it enables us to model the prevalence of cigarette smoking at the same time. The Bayesian approach is a very powerful estimation method for a complex model, especially when it is complicated by censoring data, mixture structure and considering spatially correlated random effects in the model. However, considering survey weights in Bayesian estimation is not popular and the methodology of this topic is yet open. As an example among others, in a study conducted in Canada for evaluating the geographical distribution and temporal trend of smoking cigarette and its associated risk factors in Ontario, the Bayesian estimation method was used without considering survey weights. 35 In 2017, Gunawan 23 introduced two methods for considering survey weights in Bayesian estimation. The first method which was called BPPE, is useful for large sample size data based on their simulation and analytic proof. The second approach is based on bootstrap resampling and data augmentation method which is useful for small and moderate sample size data because of the burden of computational time. In this study, we considered the first approach to incorporate survey weights into our analysis.
The result of our study will assist policymakers to identify high risk people and provinces in the country with regard to the initiation age and prevalence of smoking, in order to conduct preventive policies such as increasing social awareness about harmful effects of smoking cigarette, prohibition of cigarette smoking in public places and selling unpacked cigarettes, at least in high risk provinces. Given that our results were obtained from a populationbased study (STEPs 2016), they are generalizable to the total population.
For future research, we suggest modeling the number of cigarette smoking, its associated risk factors and its possible geographical distribution using Poisson or negative binomial regression. Moreover, it would be interesting to use zero-inflated Poisson or negative binomial for modeling the number of cigarette smoking and the prevalence of smoking in the entire population and compare the results with our study. Future studies could be done on temporal trend of those important parameters based on previous rounds of STEPs survey study in the country to capture the trend of change over time.
Our study has several limitations, including: no information from the Qom province since the STEPs 2016 survey was not conducted in this province, lack of data on education and socioeconomic status in the starting age of smoking, lack of data about important indicators of smoking such as family history of smoking, opioid use and alcohol use. Furthermore, in this study, we did not study the number of daily cigarettes used by smokers. This parameter, along with initiation age and prevalence, is another important topic in cigarette smoking studies.
In conclusion, monitoring the initiation age and prevalence of smoking are important in controlling tobacco use in each community. Cigarette smoking is a complicated behavior with many associated factors. In this study, we observed that there are geographical differences in the initiation age and prevalence of smoking in Iran. Therefore, such differences should be considered in any preventive policy adopted.
© 2020 The Author(s). This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons. org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Conflict of Interest Disclosures
Authors declare no conflict of interest.

Ethical Statement
The data set used in this study is based on national survey study and it used with the permission of data provider. The information of patients was de-identified prior to analysis.