Title: | Estimating and Displaying Population Attributable Fractions |
---|---|
Description: | Estimation and display of various types of population attributable fraction and impact fractions. As well as the usual calculations of attributable fractions and impact fractions, functions are provided for attributable fraction nomograms and fan plots, continuous exposures, for pathway specific population attributable fractions, and for joint, average and sequential population attributable fractions. |
Authors: | John Ferguson [aut, cre] |
Maintainer: | John Ferguson <[email protected]> |
License: | MIT + file LICENSE |
Version: | 2.0.0 |
Built: | 2025-02-13 03:11:52 UTC |
Source: | https://github.com/johnfergusonnuig/graphpaf |
Main effects models are fit by default. For continuous variables, lm is used, for binary (numeric 0/1 variables), glm is used and for factor valued variables polr is used. For factors, ensure that the factor levels are ordered by increasing levels of risk. If interactions are required for certain models, it is advisable to populate the elements of model_list separately.
automatic_fit( data, parent_list, node_vec, prev = NULL, common = "", spline_nodes = c(), df_spline_nodes = 3 )
automatic_fit( data, parent_list, node_vec, prev = NULL, common = "", spline_nodes = c(), df_spline_nodes = 3 )
data |
Data frame. A data frame containing variables used for fitting the models. Must contain all variables used in fitting |
parent_list |
A list. The ith element is the vector of variable names that are direct causes of ith variable in node_vec |
node_vec |
A vector corresponding to the nodes in the Bayesian network. This must be specified from root to leaves - that is ancestors in the causal graph for a particular node are positioned before their descendants. If this condition is false the function will return an error. |
prev |
Prevalence of disease. Set to NULL for cohort or cross sectional studies |
common |
character text for part of the model formula that doesn't involve any variable in node_vec. Useful for specifying confounders involved in all models automatically |
spline_nodes |
Vector of continuous variable names that are fit as splines (when involved as parents). Natural splines are used. |
df_spline_nodes |
How many degrees of freedom for each spline (Default 3). At the moment, this can not be specified separately for differing variables. |
A list of fitted models corresponding to node_vec and parent_vec.
# More complicated example (slower to run) library(splines) parent_exercise <- c("education") parent_diet <- c("education") parent_smoking <- c("education") parent_alcohol <- c("education") parent_stress <- c("education") parent_high_blood_pressure <- c("education","exercise","diet", "smoking","alcohol","stress") parent_lipids <- c("education","exercise","diet","smoking", "alcohol","stress") parent_waist_hip_ratio <- c("education","exercise","diet","smoking", "alcohol","stress") parent_early_stage_heart_disease <- c("education","exercise","diet", "smoking","alcohol","stress","lipids","waist_hip_ratio","high_blood_pressure") parent_diabetes <- c("education","exercise","diet","smoking","alcohol", "stress","lipids","waist_hip_ratio","high_blood_pressure") parent_case <- c("education","exercise","diet","smoking","alcohol","stress", "lipids","waist_hip_ratio","high_blood_pressure","early_stage_heart_disease","diabetes") parent_list <- list(parent_exercise,parent_diet,parent_smoking, parent_alcohol,parent_stress,parent_high_blood_pressure, parent_lipids,parent_waist_hip_ratio,parent_early_stage_heart_disease, parent_diabetes,parent_case) node_vec=c("exercise","diet","smoking","alcohol","stress","high_blood_pressure", "lipids","waist_hip_ratio","early_stage_heart_disease", "diabetes","case") model_list=automatic_fit(data=stroke_reduced, parent_list=parent_list, node_vec=node_vec, prev=.0035,common="region*ns(age,df=5)+ sex*ns(age,df=5)", spline_nodes = c("waist_hip_ratio","lipids","diet"))
# More complicated example (slower to run) library(splines) parent_exercise <- c("education") parent_diet <- c("education") parent_smoking <- c("education") parent_alcohol <- c("education") parent_stress <- c("education") parent_high_blood_pressure <- c("education","exercise","diet", "smoking","alcohol","stress") parent_lipids <- c("education","exercise","diet","smoking", "alcohol","stress") parent_waist_hip_ratio <- c("education","exercise","diet","smoking", "alcohol","stress") parent_early_stage_heart_disease <- c("education","exercise","diet", "smoking","alcohol","stress","lipids","waist_hip_ratio","high_blood_pressure") parent_diabetes <- c("education","exercise","diet","smoking","alcohol", "stress","lipids","waist_hip_ratio","high_blood_pressure") parent_case <- c("education","exercise","diet","smoking","alcohol","stress", "lipids","waist_hip_ratio","high_blood_pressure","early_stage_heart_disease","diabetes") parent_list <- list(parent_exercise,parent_diet,parent_smoking, parent_alcohol,parent_stress,parent_high_blood_pressure, parent_lipids,parent_waist_hip_ratio,parent_early_stage_heart_disease, parent_diabetes,parent_case) node_vec=c("exercise","diet","smoking","alcohol","stress","high_blood_pressure", "lipids","waist_hip_ratio","early_stage_heart_disease", "diabetes","case") model_list=automatic_fit(data=stroke_reduced, parent_list=parent_list, node_vec=node_vec, prev=.0035,common="region*ns(age,df=5)+ sex*ns(age,df=5)", spline_nodes = c("waist_hip_ratio","lipids","diet"))
Calculation of average and sequential paf taking into account risk factor sequencing
average_paf( data, model_list, parent_list, node_vec, prev = NULL, exact = TRUE, nperm = NULL, correct_order = 2, riskfactor_vec = NULL, ci = FALSE, boot_rep = 50, ci_type = c("norm"), ci_level = 0.95, ci_level_ME = 0.95, weight_vec = NULL, verbose = TRUE )
average_paf( data, model_list, parent_list, node_vec, prev = NULL, exact = TRUE, nperm = NULL, correct_order = 2, riskfactor_vec = NULL, ci = FALSE, boot_rep = 50, ci_type = c("norm"), ci_level = 0.95, ci_level_ME = 0.95, weight_vec = NULL, verbose = TRUE )
data |
Data frame. A dataframe containing variables used for fitting the models. Must contain all variables used in fitting |
model_list |
List. A list of fitted models corresponding for the outcome variables in node_vec, with parents as described in parent_vec. This list must be in the same order as node_vec and parent_list. Models can be linear (lm), logistic (glm) or ordinal logistic (polr). Non-linear effects of variables (if necessary) should be specified via ns(x, df=y), where ns is the natural spline function from the splines library. |
parent_list |
A list. The ith element is the vector of variable names that are direct causes of ith variable in node_vec (Note that the variable names should be columns in data) |
node_vec |
A character vector corresponding to the nodes in the Bayesian network (The variable names should be column names in data). This must be specified from root to leaves - that is ancestors in the causal graph for a particular node are positioned before their descendants. If this condition is false the function will return an error. |
prev |
numeric. Prevalence of disease. Only relevant to set for case control datasets. |
exact |
logical. Default TRUE. If TRUE, an efficient calculation is used to calculate average PAF, which enables the average PAF from N! permutations, over all N risk factors to be calculated with only 2^N-1 operations. If FALSE, permutations are sampled |
nperm |
Default NULL Number of random permutations used to calculate average and sequential PAF. If correct_order is set to an integer value, nperm is reset to an integer multiple of factorial(N)/factorial(N-correct_order) depending on the size of nperm. If nperm is NULL or less than factorial(N)/factorial(N-correct_order), factorial(N)/factorial(N-correct_order) permutations will be sampled. If nperm is larger than factorial(N)/factorial(N-correct_order), nperm will be reset to the smallest integer multiple of factorial(N)/factorial(N-correct_order) less than the input value of nperm |
correct_order |
Default 3. This enforces stratified sampling of permutations where the first correct_order positions of the sampled permutations are evenly distributed over the integers 1 ... N, N being the number of risk factors of interest, over the sampled permutations. The other positions are randomly sampled. This automatically sets the number of simulations when nperm=NULL. For interest, if N=10 and correct_order=3, nperm is set to factorial(10)/factorial(10-3) = 720. This special resampling reduces Monte Carlo variation in estimated average and sequential PAFs. |
riskfactor_vec |
A subset of risk factors for which we want to calculate average, sequential and joint PAF |
ci |
Logical. If TRUE, a bootstrap confidence interval is computed along with a point estimate (default FALSE). If ci=FALSE, only a point estimate is produced. A simulation procedure (sampling permutations and also simulating the effects of eliminating risk factors over the descendant nodes in a Bayesian network) is required to produce the point estimates. The point estimate will change on repeated runs of the function. The margin of error of the point estimate is given when ci=FALSE |
boot_rep |
Integer. Number of bootstrap replications (Only necessary to specify if ci=TRUE). Note that at least 50 replicates are recommended to achieve stable estimates of standard error. In the examples below, values of boot_rep less than 50 are sometimes used to limit run time. |
ci_type |
Character. Default norm. A vector specifying the types of confidence interval desired. "norm", "basic", "perc" and "bca" are the available methods |
ci_level |
Numeric. Default 0.95. A number between 0 and 1 specifying the level of the confidence interval (when ci=TRUE) |
ci_level_ME |
Numeric. Default 0.95. A number between 0 and 1 specifying the level of the margin of error for the point estimate (only relevant when ci=FALSE and exact=FALSE) |
weight_vec |
An optional vector of inverse sampling weights (note with survey data, the variance may not be calculated correctly if sampling isn't independent). Note that this vector will be ignored if prev is specified, and the weights will be calibrated so that the weighted sample prevalence of disease equals prev. This argument can be ignored if data has a column weights with correctly calibrated weights |
verbose |
A logical indicator for whether extended output is produced when ci=TRUE, default TRUE |
A SAF_summary object with average joint and sequential PAF for all risk factors in node_vec (or alternatively a subset of those risk factors if specified in riskfactor_vec).
Ferguson, J., O’Connell, M. and O’Donnell, M., 2020. Revisiting sequential attributable fractions. Archives of Public Health, 78(1), pp.1-9.
Ferguson, J., Alvarez-Iglesias, A., Newell, J., Hinde, J. and O’Donnell, M., 2018. Estimating average attributable fractions with confidence intervals for cohort and case–control studies. Statistical methods in medical research, 27(4), pp.1141-1152
library(splines) library(survival) library(parallel) options(boot.parallel="snow") options(boot.ncpus=2) # The above could be set to the number of available cores on the machine # Simulated data on occupational and environmental exposure to chronic cough from Eide, 1995 # First specify the causal graph, in terms of the parents of each node. Then put into a list parent_urban.rural <- c() parent_smoking.category <- c("urban.rural") parent_occupational.exposure <- c("urban.rural") parent_y <- c("urban.rural","smoking.category","occupational.exposure") parent_list <- list(parent_urban.rural, parent_smoking.category, parent_occupational.exposure, parent_y) # also specify nodes of graph, in order from root to leaves node_vec <- c("urban.rural","smoking.category","occupational.exposure", "y") # specify a model list according to parent_list # here we use the auxillary function 'automatic fit' model_list=automatic_fit(data=Hordaland_data, parent_list=parent_list, node_vec=node_vec, prev=.09) # By default the function works by stratified simulation of permutations and # subsequent simulation of the incremental interventions on the distribution of risk # factors. The permutations are stratified so each factor appears equally often in # the first correct_order positions. correct_order has a default of 2. # model_list$data objects have fitting weights included # Including weight column in data # necessary if Bootstrapping CIs out <- average_paf(data=model_list[[length(model_list)]]$data, model_list=model_list, parent_list=parent_list, node_vec=node_vec, prev=.09, nperm=10,riskfactor_vec = c("urban.rural", "occupational.exposure"),ci=FALSE) print(out) # More complicated example (slower to run) parent_exercise <- c("education") parent_diet <- c("education") parent_smoking <- c("education") parent_alcohol <- c("education") parent_stress <- c("education") parent_high_blood_pressure <- c("education","exercise","diet","smoking", "alcohol","stress") parent_lipids <- c("education","exercise","diet","smoking","alcohol", "stress") parent_waist_hip_ratio <- c("education","exercise","diet","smoking", "alcohol","stress") parent_early_stage_heart_disease <- c("education","exercise","diet", "smoking","alcohol","stress","lipids","waist_hip_ratio","high_blood_pressure") parent_diabetes <- c("education","exercise","diet","smoking","alcohol", "stress","lipids","waist_hip_ratio","high_blood_pressure") parent_case <- c("education","exercise","diet","smoking","alcohol","stress", "lipids","waist_hip_ratio","high_blood_pressure", "early_stage_heart_disease","diabetes") parent_list <- list(parent_exercise,parent_diet,parent_smoking, parent_alcohol,parent_stress,parent_high_blood_pressure, parent_lipids,parent_waist_hip_ratio,parent_early_stage_heart_disease, parent_diabetes,parent_case) node_vec=c("exercise","diet","smoking","alcohol","stress", "high_blood_pressure","lipids","waist_hip_ratio","early_stage_heart_disease", "diabetes","case") model_list=automatic_fit(data=stroke_reduced, parent_list=parent_list, node_vec=node_vec, prev=.0035,common="region*ns(age,df=5)+sex*ns(age,df=5)", spline_nodes = c("waist_hip_ratio","lipids","diet")) out <- average_paf(data=stroke_reduced, model_list=model_list, parent_list=parent_list, node_vec=node_vec, prev=.0035, riskfactor_vec = c("high_blood_pressure","smoking","stress","exercise","alcohol", "diabetes","early_stage_heart_disease"),ci=TRUE,boot_rep=10) print(out) plot(out,max_PAF=0.5,min_PAF=-0.1,number_rows=3) # plot sequential and average PAFs by risk factor # similar calculation, but now sampling permutations (stratified, so # that each risk factor will appear equally often in the first correct_order positions) out <- average_paf(data=stroke_reduced, model_list=model_list, parent_list=parent_list, node_vec=node_vec, prev=.0035, exact=FALSE, correct_order=2, riskfactor_vec = c("high_blood_pressure","smoking","stress", "exercise","alcohol","diabetes","early_stage_heart_disease"),ci=TRUE, boot_rep=10) print(out) plot(out,max_PAF=0.5,min_PAF=-0.1,number_rows=3)
library(splines) library(survival) library(parallel) options(boot.parallel="snow") options(boot.ncpus=2) # The above could be set to the number of available cores on the machine # Simulated data on occupational and environmental exposure to chronic cough from Eide, 1995 # First specify the causal graph, in terms of the parents of each node. Then put into a list parent_urban.rural <- c() parent_smoking.category <- c("urban.rural") parent_occupational.exposure <- c("urban.rural") parent_y <- c("urban.rural","smoking.category","occupational.exposure") parent_list <- list(parent_urban.rural, parent_smoking.category, parent_occupational.exposure, parent_y) # also specify nodes of graph, in order from root to leaves node_vec <- c("urban.rural","smoking.category","occupational.exposure", "y") # specify a model list according to parent_list # here we use the auxillary function 'automatic fit' model_list=automatic_fit(data=Hordaland_data, parent_list=parent_list, node_vec=node_vec, prev=.09) # By default the function works by stratified simulation of permutations and # subsequent simulation of the incremental interventions on the distribution of risk # factors. The permutations are stratified so each factor appears equally often in # the first correct_order positions. correct_order has a default of 2. # model_list$data objects have fitting weights included # Including weight column in data # necessary if Bootstrapping CIs out <- average_paf(data=model_list[[length(model_list)]]$data, model_list=model_list, parent_list=parent_list, node_vec=node_vec, prev=.09, nperm=10,riskfactor_vec = c("urban.rural", "occupational.exposure"),ci=FALSE) print(out) # More complicated example (slower to run) parent_exercise <- c("education") parent_diet <- c("education") parent_smoking <- c("education") parent_alcohol <- c("education") parent_stress <- c("education") parent_high_blood_pressure <- c("education","exercise","diet","smoking", "alcohol","stress") parent_lipids <- c("education","exercise","diet","smoking","alcohol", "stress") parent_waist_hip_ratio <- c("education","exercise","diet","smoking", "alcohol","stress") parent_early_stage_heart_disease <- c("education","exercise","diet", "smoking","alcohol","stress","lipids","waist_hip_ratio","high_blood_pressure") parent_diabetes <- c("education","exercise","diet","smoking","alcohol", "stress","lipids","waist_hip_ratio","high_blood_pressure") parent_case <- c("education","exercise","diet","smoking","alcohol","stress", "lipids","waist_hip_ratio","high_blood_pressure", "early_stage_heart_disease","diabetes") parent_list <- list(parent_exercise,parent_diet,parent_smoking, parent_alcohol,parent_stress,parent_high_blood_pressure, parent_lipids,parent_waist_hip_ratio,parent_early_stage_heart_disease, parent_diabetes,parent_case) node_vec=c("exercise","diet","smoking","alcohol","stress", "high_blood_pressure","lipids","waist_hip_ratio","early_stage_heart_disease", "diabetes","case") model_list=automatic_fit(data=stroke_reduced, parent_list=parent_list, node_vec=node_vec, prev=.0035,common="region*ns(age,df=5)+sex*ns(age,df=5)", spline_nodes = c("waist_hip_ratio","lipids","diet")) out <- average_paf(data=stroke_reduced, model_list=model_list, parent_list=parent_list, node_vec=node_vec, prev=.0035, riskfactor_vec = c("high_blood_pressure","smoking","stress","exercise","alcohol", "diabetes","early_stage_heart_disease"),ci=TRUE,boot_rep=10) print(out) plot(out,max_PAF=0.5,min_PAF=-0.1,number_rows=3) # plot sequential and average PAFs by risk factor # similar calculation, but now sampling permutations (stratified, so # that each risk factor will appear equally often in the first correct_order positions) out <- average_paf(data=stroke_reduced, model_list=model_list, parent_list=parent_list, node_vec=node_vec, prev=.0035, exact=FALSE, correct_order=2, riskfactor_vec = c("high_blood_pressure","smoking","stress", "exercise","alcohol","diabetes","early_stage_heart_disease"),ci=TRUE, boot_rep=10) print(out) plot(out,max_PAF=0.5,min_PAF=-0.1,number_rows=3)
Strip out unneeded variables from original data (based on fitted model, or alternatively based on specifying a list of variables), and remove rows with NA values. The function works for logistic, survival and conditional logistic regressions. The function also creates a column of weights, which will be just a vector of 1s if prevalence is unspecified.
data_clean(data, model = NULL, vars = NULL, response = "case", prev = NULL)
data_clean(data, model = NULL, vars = NULL, response = "case", prev = NULL)
data |
A data frame that was used to fit the model |
model |
A glm (with logistic or log link, with binomial family), clogit or coxph model. |
vars |
Default NULL. Variables required in output data set. If set to NULL and model is specified, the variables kept are the response and covariates assumed in model. If set to NULL and model is unspecified, the original dataset is returned. |
response |
Default "case". response variable in dataset. Used when recalculating weights (if the argument prev is set) If set to NULL, the response is inferred from the model |
prev |
Default NULL. Prevalence of disease (or yearly incidence of disease in healthy controls). Only relevant to set in case control studies and if path specific PAF or sequential joint PAF calculations are required. The purpose of this is to create a vector of weights in output dataset, that reweights the cases and controls to reflect the general population. This vector of weights can be used to fit weighted regression models. |
A cleaned data frame
# example of using dataclean to strip out NAs, redundant columns and recalculate weights library(survival) library(splines) stroke_reduced_2 <- stroke_reduced stroke_reduced_2$case[sample(1:length(stroke_reduced_2$case),50)] <- NA stroke_reduced_2$random <- rnorm(length(stroke_reduced_2$case)) stroke_reduced_3 <- data_clean(stroke_reduced_2,vars=colnames(stroke_reduced),prev=0.01) dim(stroke_reduced_2) dim(stroke_reduced_3) mymod <- clogit(case ~ high_blood_pressure + strata(strata),data=stroke_reduced_2) stroke_reduced_3 <- data_clean(stroke_reduced_2,model=mymod,prev=0.01) dim(stroke_reduced_2) dim(stroke_reduced_3)
# example of using dataclean to strip out NAs, redundant columns and recalculate weights library(survival) library(splines) stroke_reduced_2 <- stroke_reduced stroke_reduced_2$case[sample(1:length(stroke_reduced_2$case),50)] <- NA stroke_reduced_2$random <- rnorm(length(stroke_reduced_2$case)) stroke_reduced_3 <- data_clean(stroke_reduced_2,vars=colnames(stroke_reduced),prev=0.01) dim(stroke_reduced_2) dim(stroke_reduced_3) mymod <- clogit(case ~ high_blood_pressure + strata(strata),data=stroke_reduced_2) stroke_reduced_3 <- data_clean(stroke_reduced_2,model=mymod,prev=0.01) dim(stroke_reduced_2) dim(stroke_reduced_3)
Internal: Simulate a column from the post intervention distribution corresponding to eliminating a risk factor
do_sim(colnum, current_mat, model, SN = TRUE)
do_sim(colnum, current_mat, model, SN = TRUE)
colnum |
The column indicator for the variable being simulated |
current_mat |
The current value of the data frame |
model |
A fitted model for simulating values of the variable, given the parent values |
SN |
Logical. If TRUE (default) simulations are achieved via adding the original model residuals, to the new fitted values based on the updated values of parents in current_mat. |
An updated data frame (a new version of current_mat) with a new column simulated
Estimation and display of various types of population attributable fraction and impact fractions. As well as the usual calculations of attributable fractions and impact fractions, functions are provided for attributable fraction nomograms and fan plots, continuous exposures, for pathway specific population attributable fractions, and for joint, average and sequential population attributable fractions.
Maintainer: John Ferguson [email protected]
Ferguson, J., O’Leary, N., Maturo, F., Yusuf, S. and O’Donnell, M., 2019. Graphical comparisons of relative disease burden across multiple risk factors. BMC medical research methodology, 19(1), pp.1-9
Ferguson, J., Maturo, F., Yusuf, S. and O’Donnell, M., 2020. Population attributable fractions for continuously distributed exposures. Epidemiologic Methods, 9(1).
Pathway specific Population attributable fractions. O’Connell, M.M. and Ferguson, J.P., 2022. IEA. International Journal of Epidemiology, 1, p.13.
Ferguson, J., O’Connell, M. and O’Donnell, M., 2020. Revisiting sequential attributable fractions. Archives of Public Health, 78(1), pp.1-9.
Ferguson, J., Alvarez-Iglesias, A., Newell, J., Hinde, J. and O’Donnell, M., 2018. Estimating average attributable fractions with confidence intervals for cohort and case–control studies. Statistical methods in medical research, 27(4), pp.1141-1152
Ferguson, J., Alvarez-Iglesias, A., Mulligan, M., Judge, C. and O’Donnell, M., 2024. Bias assessment and correction for Levin's population attributable fraction under confounding. European Journal of Epidemiology, In press
Useful links:
Report bugs at https://github.com/johnfergusonNUIG/graphPAF/issues
Simulated case control dataset for 5000 cases (individuals with chronic cough) and 5000 controls
Hordaland_data
Hordaland_data
A data frame with 10000 rows and 4 variables:
Chronic Cough, 1: Yes, 0: No
1: resident in urban setting, 0: resident in rural setting
Smoking level: 1 No smoker, 2: ex smoker, 3: 1-9 cigarettes per day, 4: 10-19 cigarettes per day, 4:>= 20 cigarettes per day
Exposed to dust/gas at work. 1: Yes, 0: no
Data simulated based on "Sequential and average attributable fractions as aids in the selection of preventive strategies." Journal of clinical epidemiology 48, no. 5 (1995): 645-655.
Internal: Calculation of an impact fraction using the Bruzzi approach
if_bruzzi(data, ind, model, model_type, new_data, response, weight_vec)
if_bruzzi(data, ind, model, model_type, new_data, response, weight_vec)
data |
A dataframe containing variables used for fitting the model |
ind |
An indicator of which rows will be used from the dataset |
model |
Either a clogit or glm fitted model object. Non-linear effects should be specified via ns(x, df=y), where ns is the natural spline function from the splines library. |
model_type |
Either a "clogit", "glm" or "coxph" model object |
new_data |
A dataframe (of the same variables and size as data) representing an alternative distribution of risk factors |
response |
A string representing the name of the outcome variable in data |
weight_vec |
An optional vector of inverse sampling weights |
A numeric estimated impact fraction.
Bruzzi, P., Green, S.B., Byar, D.P., Brinton, L.A. and Schairer, C., 1985. Estimating the population attributable risk for multiple risk factors using case-control data. American journal of epidemiology, 122(5), pp.904-914
Internal: Calculation of an impact fraction using the direct approach
if_direct( data, ind, model, model_type, new_data, prev, t_vector, response, weight_vec )
if_direct( data, ind, model, model_type, new_data, prev, t_vector, response, weight_vec )
data |
A dataframe containing variables used for fitting the model |
ind |
An indicator of which rows will be used from the dataset |
model |
Either a clogit, glm or coxph fitted model object. Non-linear effects should be specified via ns(x, df=y), where ns is the natural spline function from the splines library. |
model_type |
Either a "clogit", "glm" or "coxph" model object |
new_data |
A dataframe (of the same variables and size as data) representing an alternative distribution of risk factors |
prev |
Population prevalence of disease (default NULL) |
t_vector |
A vector of times at which PAF estimates are desired (for a coxph model) |
response |
A string representing the name of the outcome variable in data |
weight_vec |
An optional vector of inverse sampling weights |
A numeric estimated impact fraction.
General calculations of impact fractions
impact_fraction( model, data, new_data, calculation_method = "B", prev = NULL, ci = FALSE, boot_rep = 50, t_vector = NULL, ci_level = 0.95, ci_type = c("norm"), weight_vec = NULL, verbose = TRUE )
impact_fraction( model, data, new_data, calculation_method = "B", prev = NULL, ci = FALSE, boot_rep = 50, t_vector = NULL, ci_level = 0.95, ci_type = c("norm"), weight_vec = NULL, verbose = TRUE )
model |
Either a clogit, glm or coxph fitted model object. Non-linear effects should be specified via ns(x, df=y), where ns is the natural spline function from the splines library. |
data |
A dataframe containing variables used for fitting the model |
new_data |
A dataframe (of the same variables and size as data) representing an alternative distribution of risk factors |
calculation_method |
A character either 'B' (Bruzzi) or 'D' (Direct method). For case control data, the method described in Bruzzi 1985 is recommended. Bruzzi's method estimates PAF from relative risks and prevalence of exposure to the risk factor. The Direct method estimates PAF by summing estimated probabilities of disease in the absence of exposure on the individual level |
prev |
estimated prevalence of disease. This only needs to be specified if the data source is from a case control study, and the direct method is used |
ci |
Logical. If TRUE, a bootstrap confidence interval is computed along with point estimate (default FALSE) |
boot_rep |
Integer. Number of bootstrap replications (Only necessary to specify if ci=TRUE) |
t_vector |
Numeric. A vector of times at which to calculate PAF (only specified if model is coxph) |
ci_level |
Numeric. Default 0.95. A number between 0 and 1 specifying the confidence level |
ci_type |
Character. Default norm. A vector specifying the types of confidence interval desired. "norm", "basic", "perc" and "bca" are the available methods |
weight_vec |
An optional vector of inverse sampling weights for survey data (note that variance will not be calculated correctly if sampling isn't independent). Note that this vector will be ignored if prev is specified, and the weights will be calibrated so that the weighted sample prevalence of disease equals prev. |
verbose |
A logical indicator for whether extended output is produced when ci=TRUE, default TRUE |
A numeric estimated impact fraction if ci=FALSE, or for survival data a vector of estimated impact corresponding to event times in the data. If ci=TRUE, estimated impact fractions and other information are bundled into an object of class IF_summary.
Bruzzi, P., Green, S.B., Byar, D.P., Brinton, L.A. and Schairer, C., 1985. Estimating the population attributable risk for multiple risk factors using case-control data. American journal of epidemiology, 122(5), pp.904-914
library(splines) library(survival) new_data <- stroke_reduced N <- nrow(new_data) inactive_patients <- (1:N)[stroke_reduced$exercise==1] N_inactive <- sum(stroke_reduced$exercise) newly_active_patients <- inactive_patients[sample(1:N_inactive,0.2*N_inactive)] new_data$exercise[newly_active_patients] <- 0 model_exercise <- clogit(formula = case ~ age + education +exercise + ns(diet, df = 3) + smoking + alcohol + stress + ns(lipids,df = 3) + ns(waist_hip_ratio, df = 3) + high_blood_pressure +strata(strata), data=stroke_reduced) impact_fraction(model=model_exercise,stroke_reduced,new_data, calculation_method = "B")
library(splines) library(survival) new_data <- stroke_reduced N <- nrow(new_data) inactive_patients <- (1:N)[stroke_reduced$exercise==1] N_inactive <- sum(stroke_reduced$exercise) newly_active_patients <- inactive_patients[sample(1:N_inactive,0.2*N_inactive)] new_data$exercise[newly_active_patients] <- 0 model_exercise <- clogit(formula = case ~ age + education +exercise + ns(diet, df = 3) + smoking + alcohol + stress + ns(lipids,df = 3) + ns(waist_hip_ratio, df = 3) + high_blood_pressure +strata(strata), data=stroke_reduced) impact_fraction(model=model_exercise,stroke_reduced,new_data, calculation_method = "B")
Calculation of joint attributable fractions over several risk factors taking into account risk factor sequencing
joint_paf( data, model_list, parent_list, node_vec, prev = NULL, riskfactor_vec = NULL, ci = FALSE, boot_rep = 50, ci_type = c("norm"), ci_level = 0.95, nsim = 1, weight_vec = NULL, verbose = TRUE )
joint_paf( data, model_list, parent_list, node_vec, prev = NULL, riskfactor_vec = NULL, ci = FALSE, boot_rep = 50, ci_type = c("norm"), ci_level = 0.95, nsim = 1, weight_vec = NULL, verbose = TRUE )
data |
Data frame. A dataframe containing variables used for fitting the models. Must contain all variables used in fitting |
model_list |
List. A list of fitted models corresponding for the outcome variables in node_vec, with parents as described in parent_vec. This list must be in the same order as node_vec and parent_list. Non-linear effects should be specified via ns(x, df=y), where ns is the natural spline function from the splines library. Linear (lm), logistic (glm) and ordinal logistic (polr) models are permitted |
parent_list |
A list. The ith element is the vector of variable names that are direct causes of ith variable in node_vec |
node_vec |
A vector corresponding to the nodes in the Bayesian network. This must be specified from root to leaves - that is ancestors in the causal graph for a particular node are positioned before their descendants. If this condition is false the function will return an error. |
prev |
prevalence of the disease (default is NULL) |
riskfactor_vec |
A subset of risk factors for which we want to calculate joint PAF |
ci |
Logical. If TRUE, a bootstrap confidence interval is computed along with a point estimate (default FALSE). If ci=FALSE, only a point estimate is produced. A simulation procedure (sampling permutations and also simulating the effects of eliminating risk factors over the descendant nodes in a Bayesian network) is required to produce the point estimates. The point estimate will change on repeated runs of the function. The margin of error of the point estimate is given when ci=FALSE |
boot_rep |
Integer. Number of bootstrap replications (Only necessary to specify if ci=TRUE). Note that at least 50 replicates are recommended to achieve stable estimates of standard error. In the examples below, values of boot_rep less than 50 are sometimes used to limit run time. |
ci_type |
Character. Default norm. A vector specifying the types of confidence interval desired. "norm", "basic", "perc" and "bca" are the available method |
ci_level |
Numeric. Confidence level. Default 0.95 |
nsim |
Numeric. Number of independent simulations of the dataset. Default of 1 |
weight_vec |
An optional vector of inverse sampling weights (note with survey data, the variance may not be calculated correctly if sampling isn't independent). Note that this vector will be ignored if prev is specified, and the weights will be calibrated so that the weighted sample prevalence of disease equals prev. This argument can be ignored if data has a column weights with correctly calibrated weights |
verbose |
A logical indicator for whether extended output is produced when ci=TRUE, default TRUE |
A numeric estimate of the joint PAF for all risk factors (if ci=FALSE), or a jointpaf object giving the same information with confidence intervals (if ci=TRUE)
Ferguson, J., O’Connell, M. and O’Donnell, M., 2020. Revisiting sequential attributable fractions. Archives of Public Health, 78(1), pp.1-9.
library(splines) library(survival) library(parallel) options(boot.parallel="snow") options(boot.ncpus=2) # The above could be set to the number of available cores on the machine # Simulated data on occupational and environmental exposure to # chronic cough from Eide, 1995 # First specify the causal graph, in terms of the parents of each node. # Then put into a list. parent_urban.rural <- c() parent_smoking.category <- c("urban.rural") parent_occupational.exposure <- c("urban.rural") parent_y <- c("urban.rural","smoking.category","occupational.exposure") parent_list <- list(parent_urban.rural, parent_smoking.category, parent_occupational.exposure, parent_y) # also specify nodes of graph, in order from root to leaves node_vec <- c("urban.rural","smoking.category","occupational.exposure", "y") # specify a model list according to parent_list # here we use the auxillary function 'automatic fit' model_list=automatic_fit(data=Hordaland_data, parent_list=parent_list, node_vec=node_vec, prev=.09) # model_list$data objects have fitting weights included # Including weight column in data # necessary if Bootstrapping CIs joint_paf(data=model_list[[length(model_list)]]$data, model_list=model_list, parent_list=parent_list, node_vec=node_vec, prev=.09, riskfactor_vec = c("urban.rural", "occupational.exposure"),ci=FALSE) # More complicated example (slower to run) parent_exercise <- c("education") parent_diet <- c("education") parent_smoking <- c("education") parent_alcohol <- c("education") parent_stress <- c("education") parent_high_blood_pressure <- c("education","exercise","diet","smoking","alcohol","stress") parent_lipids <- c("education","exercise","diet","smoking","alcohol","stress") parent_waist_hip_ratio <- c("education","exercise","diet","smoking", "alcohol","stress") parent_early_stage_heart_disease <- c("education","exercise","diet", "smoking","alcohol","stress","lipids","waist_hip_ratio","high_blood_pressure") parent_diabetes <- c("education","exercise","diet","smoking","alcohol", "stress","lipids","waist_hip_ratio","high_blood_pressure") parent_case <- c("education","exercise","diet","smoking","alcohol", "stress","lipids","waist_hip_ratio","high_blood_pressure","early_stage_heart_disease","diabetes") parent_list <- list(parent_exercise,parent_diet,parent_smoking,parent_alcohol, parent_stress,parent_high_blood_pressure,parent_lipids,parent_waist_hip_ratio, parent_early_stage_heart_disease,parent_diabetes,parent_case) node_vec=c("exercise","diet","smoking","alcohol","stress","high_blood_pressure", "lipids","waist_hip_ratio","early_stage_heart_disease","diabetes","case") model_list=automatic_fit(data=stroke_reduced, parent_list=parent_list, node_vec=node_vec, prev=.0035,common="region*ns(age,df=5)+sex*ns(age,df=5)", spline_nodes = c("waist_hip_ratio","lipids","diet")) jointpaf <- joint_paf(data=stroke_reduced, model_list=model_list, parent_list=parent_list, node_vec=node_vec, prev=.0035, riskfactor_vec = c("high_blood_pressure","smoking","stress","exercise","alcohol", "diabetes","early_stage_heart_disease"),ci=TRUE,boot_rep=10)
library(splines) library(survival) library(parallel) options(boot.parallel="snow") options(boot.ncpus=2) # The above could be set to the number of available cores on the machine # Simulated data on occupational and environmental exposure to # chronic cough from Eide, 1995 # First specify the causal graph, in terms of the parents of each node. # Then put into a list. parent_urban.rural <- c() parent_smoking.category <- c("urban.rural") parent_occupational.exposure <- c("urban.rural") parent_y <- c("urban.rural","smoking.category","occupational.exposure") parent_list <- list(parent_urban.rural, parent_smoking.category, parent_occupational.exposure, parent_y) # also specify nodes of graph, in order from root to leaves node_vec <- c("urban.rural","smoking.category","occupational.exposure", "y") # specify a model list according to parent_list # here we use the auxillary function 'automatic fit' model_list=automatic_fit(data=Hordaland_data, parent_list=parent_list, node_vec=node_vec, prev=.09) # model_list$data objects have fitting weights included # Including weight column in data # necessary if Bootstrapping CIs joint_paf(data=model_list[[length(model_list)]]$data, model_list=model_list, parent_list=parent_list, node_vec=node_vec, prev=.09, riskfactor_vec = c("urban.rural", "occupational.exposure"),ci=FALSE) # More complicated example (slower to run) parent_exercise <- c("education") parent_diet <- c("education") parent_smoking <- c("education") parent_alcohol <- c("education") parent_stress <- c("education") parent_high_blood_pressure <- c("education","exercise","diet","smoking","alcohol","stress") parent_lipids <- c("education","exercise","diet","smoking","alcohol","stress") parent_waist_hip_ratio <- c("education","exercise","diet","smoking", "alcohol","stress") parent_early_stage_heart_disease <- c("education","exercise","diet", "smoking","alcohol","stress","lipids","waist_hip_ratio","high_blood_pressure") parent_diabetes <- c("education","exercise","diet","smoking","alcohol", "stress","lipids","waist_hip_ratio","high_blood_pressure") parent_case <- c("education","exercise","diet","smoking","alcohol", "stress","lipids","waist_hip_ratio","high_blood_pressure","early_stage_heart_disease","diabetes") parent_list <- list(parent_exercise,parent_diet,parent_smoking,parent_alcohol, parent_stress,parent_high_blood_pressure,parent_lipids,parent_waist_hip_ratio, parent_early_stage_heart_disease,parent_diabetes,parent_case) node_vec=c("exercise","diet","smoking","alcohol","stress","high_blood_pressure", "lipids","waist_hip_ratio","early_stage_heart_disease","diabetes","case") model_list=automatic_fit(data=stroke_reduced, parent_list=parent_list, node_vec=node_vec, prev=.0035,common="region*ns(age,df=5)+sex*ns(age,df=5)", spline_nodes = c("waist_hip_ratio","lipids","diet")) jointpaf <- joint_paf(data=stroke_reduced, model_list=model_list, parent_list=parent_list, node_vec=node_vec, prev=.0035, riskfactor_vec = c("high_blood_pressure","smoking","stress","exercise","alcohol", "diabetes","early_stage_heart_disease"),ci=TRUE,boot_rep=10)
Calculation of attributable fractions with a continuous exposure
PAF_calc_continuous( model, riskfactor_vec, q_vec = c(0.01), data, calculation_method = "B", prev = NULL, ci = FALSE, boot_rep = 50, t_vector = NULL, ci_level = 0.95, ci_type = c("norm"), S = 1, weight_vec = NULL, verbose = TRUE )
PAF_calc_continuous( model, riskfactor_vec, q_vec = c(0.01), data, calculation_method = "B", prev = NULL, ci = FALSE, boot_rep = 50, t_vector = NULL, ci_level = 0.95, ci_type = c("norm"), S = 1, weight_vec = NULL, verbose = TRUE )
model |
Either a clogit, glm or coxph R model object. Non-linear effects should be specified via ns(x, df=y), where ns is the natural spline function from the splines library. |
riskfactor_vec |
A character vector of names for continuous exposures/riskfactors that are predictors the model. |
q_vec |
A vector of 'risk quantiles' for the continuous exposure. q_vec=c(0.01) (the default) calculates an estimate of the PAF that is in some way analogous to eliminating a categorical risk factor. Other values in q_vec correspond to interventions on the continuous risk factor that results in a risk level for all individuals thresholded above by the corresponding quantile of pre-intervention population risk. For survival regressions only single element values of q_vec are allowed |
data |
A dataframe containing variables used for fitting the model |
calculation_method |
A character either 'B' (Bruzzi) or 'D' (Direct method). For case control data, the method described in Bruzzi 1985 is recommended. Bruzzi's method estimates PAF from relative risks and prevalence of exposure to the risk factor. The Direct method estimates PAF via summing estimated probabilities of disease in the absence of exposure over differing individuals. |
prev |
The estimated prevalence of disease (A number between 0 and 1). This only needs to be specified if the data source is from a case control study, and the direct calculation method is used |
ci |
Logical. If TRUE, a bootstrap confidence interval is computed along with point estimate (default FALSE) |
boot_rep |
Integer. Number of bootstrap replications (Only necessary to specify if ci=TRUE). Note that at least 50 replicates are recommended to achieve stable estimates of standard error. In the examples below, values of boot_rep less than 50 are sometimes used to limit run time. |
t_vector |
Numeric. A vector of times at which to calculate PAF (only specified if model is coxph) |
ci_level |
Numeric. A number between 0 and 1 specifying the confidence level |
ci_type |
Character. A vector specifying the types of confidence interval desired, as available in the 'Boot' package. The default is c('norm'), which calculates a symmetric confidence interval: (Est-Bias +- 1.96*SE), with the standard error calculated via Bootstrap. Other choices are 'basic', 'perc' and 'bca'. Increasing the number of Bootstrap repetitions is recommended for the 'basic', 'perc' and 'bca' methods. |
S |
Integer (default 1). Only relevant to change if there is an interaction between the continuous exposure and other variables in the model. In this case, marginal comparisons of disease risk at differing levels of the exposure need to be averaged over many individuals. S is the number of individuals used in this averaging. May be slow for large S |
weight_vec |
An optional vector of inverse sampling weights for survey data (note that variance will not be calculated correctly if sampling isn't independent). Note that this vector will be ignored if prev is specified, and the weights will be calibrated so that the weighted sample prevalence of disease equals prev. |
verbose |
A logical indicator for whether extended output is produced when ci=TRUE, default TRUE |
A PAF_q object. When ci=FALSE, this will essentially be a vector of estimated PAF corresponding to the quantiles specified in q_vec. If ci=TRUE, a data frame with columns corresponding to the raw estimate, estimated bias, bias corrected estimate and lower and upper elements of any confidence procedures requested, and rows corresponding to the quantiles in q_vec.
Ferguson, J., Maturo, F., Yusuf, S. and O’Donnell, M., 2020. Population attributable fractions for continuously distributed exposures. Epidemiologic Methods, 9(1).
library(splines) library(survival) library(parallel) options(boot.parallel="snow") options(boot.ncpus=2) # The above could be set to the number of available cores on the machine # Example with logistic regression. PAF_q (as in Ferguson, 2020) # estimated at q=0.01, 0.1, 0.3, 0.5, 0.7. 0.9. PAF_0.01 is roughly # analogous to 'eliminating' a discrete risk factor, but its estimation # may be unstable for some exposures, and the corresponding intervention # may be impractical. Comparing PAF_q for q >= 0.1 over different risk factors # may lead to more sensible comparisons of disease burden. # Either method (direct, D, or Bruzzi ) # reduce dataset to improve run time (not recommended on real data!) stroke_small <- stroke_reduced[sample(1:nrow(stroke_reduced),1000),] model_continuous <- glm(formula = case ~ region * ns(age, df = 5) + sex * ns(age, df = 5) + education +exercise + ns(diet, df = 3) + alcohol + stress + ns(lipids,df = 3) + ns(waist_hip_ratio, df = 3) + high_blood_pressure, family = "binomial", data = stroke_small) out <- PAF_calc_continuous(model_continuous,riskfactor_vec= c("diet","lipids","waist_hip_ratio"),q_vec=c(0.1,0.5,0.9), ci=FALSE,calculation_method="D",data=stroke_small, prev=0.0035) print(out) plot(out) # with confidence intervals (via bootstrap) on full dataset. Slower. model_continuous_clogit <- clogit(formula = case ~ region * ns(age, df = 5) + sex * ns(age, df = 5) + education +exercise + ns(diet, df = 3) + alcohol + stress + ns(lipids,df = 3) + ns(waist_hip_ratio, df = 3) + high_blood_pressure + strata(strata), data = stroke_reduced) out <- PAF_calc_continuous(model_continuous_clogit,riskfactor_vec=c("diet", "lipids","waist_hip_ratio"),q_vec=c(0.01, 0.1,0.3,0.5,0.7,0.9), ci=TRUE,calculation_method="B",data=stroke_reduced, prev=0.01) print(out) plot(out)
library(splines) library(survival) library(parallel) options(boot.parallel="snow") options(boot.ncpus=2) # The above could be set to the number of available cores on the machine # Example with logistic regression. PAF_q (as in Ferguson, 2020) # estimated at q=0.01, 0.1, 0.3, 0.5, 0.7. 0.9. PAF_0.01 is roughly # analogous to 'eliminating' a discrete risk factor, but its estimation # may be unstable for some exposures, and the corresponding intervention # may be impractical. Comparing PAF_q for q >= 0.1 over different risk factors # may lead to more sensible comparisons of disease burden. # Either method (direct, D, or Bruzzi ) # reduce dataset to improve run time (not recommended on real data!) stroke_small <- stroke_reduced[sample(1:nrow(stroke_reduced),1000),] model_continuous <- glm(formula = case ~ region * ns(age, df = 5) + sex * ns(age, df = 5) + education +exercise + ns(diet, df = 3) + alcohol + stress + ns(lipids,df = 3) + ns(waist_hip_ratio, df = 3) + high_blood_pressure, family = "binomial", data = stroke_small) out <- PAF_calc_continuous(model_continuous,riskfactor_vec= c("diet","lipids","waist_hip_ratio"),q_vec=c(0.1,0.5,0.9), ci=FALSE,calculation_method="D",data=stroke_small, prev=0.0035) print(out) plot(out) # with confidence intervals (via bootstrap) on full dataset. Slower. model_continuous_clogit <- clogit(formula = case ~ region * ns(age, df = 5) + sex * ns(age, df = 5) + education +exercise + ns(diet, df = 3) + alcohol + stress + ns(lipids,df = 3) + ns(waist_hip_ratio, df = 3) + high_blood_pressure + strata(strata), data = stroke_reduced) out <- PAF_calc_continuous(model_continuous_clogit,riskfactor_vec=c("diet", "lipids","waist_hip_ratio"),q_vec=c(0.01, 0.1,0.3,0.5,0.7,0.9), ci=TRUE,calculation_method="B",data=stroke_reduced, prev=0.01) print(out) plot(out)
Calculation of attributable fractions using a categorized risk factor
PAF_calc_discrete( model, riskfactor, refval, data, calculation_method = "B", prev = NULL, ci = FALSE, boot_rep = 50, t_vector = NULL, ci_level = 0.95, ci_type = c("norm"), weight_vec = NULL, verbose = TRUE )
PAF_calc_discrete( model, riskfactor, refval, data, calculation_method = "B", prev = NULL, ci = FALSE, boot_rep = 50, t_vector = NULL, ci_level = 0.95, ci_type = c("norm"), weight_vec = NULL, verbose = TRUE )
model |
Either a clogit, glm or coxph fitted regression object. Non-linear effects can be specified in these models if necessary via ns(x, df=y), where ns is the natural spline function from the splines library. |
riskfactor |
The name of the risk factor of interest in the dataset. The risk factor values can be 0/1 numeric, categorical or factor valued. |
refval |
The reference value for the risk factor. If a risk factor is 0/1 numeric, 0 is assumed as the default value, otherwise refval must be specified. |
data |
A dataframe containing variables used for fitting the model |
calculation_method |
A character either 'B' (Bruzzi) or 'D' (Direct method). For case control data, the method described in Bruzzi 1985 is recommended. Bruzzi's method estimates PAF from relative risks and prevalence of exposure to the risk factor. The Direct method estimates PAF by summing estimated probabilities of disease in the absence of exposure on the individual level |
prev |
The estimated prevalence of disease (A number between 0 and 1). This only needs to be specified if the data source is from a case control study, and the direct method is used |
ci |
Logical. If TRUE, a bootstrap confidence interval is computed along with point estimate (default FALSE) |
boot_rep |
Integer. Number of bootstrap replications (Only necessary to specify if ci=TRUE). Note that at least 50 replicates are recommended to achieve stable estimates of standard error. In the examples below, values of boot_rep less than 50 are sometimes used to limit run time. |
t_vector |
Numeric. A vector of times at which to calculate PAF (only specified if model is coxph) |
ci_level |
Numeric. A number between 0 and 1 specifying the confidence level |
ci_type |
Character. A vector specifying the types of confidence interval desired, as available in the 'Boot' package. The default is c('norm'), which calculates a symmetric confidence interval: (Est-Bias +- 1.96*SE), with the standard error calculated via Bootstrap. Other choices are 'basic', 'perc' and 'bca'. Increasing the number of Bootstrap repetitions is recommended for the 'basic', 'perc' and 'bca' methods. |
weight_vec |
An optional vector of inverse sampling weights for survey data (note that variance will not be calculated correctly if sampling isn't independent). Note that this will be ignored if prev is specified and calculation_method="D", in which case the weights will be constructed so the empirical re-weighted prevalence of disease is equal to prev. |
verbose |
A logical indicator for whether extended output is produced when ci=TRUE, default TRUE |
An estimated PAF if ci=FALSE, or for survival data a vector of estimated PAF corresponding to event times in the data. If ci=TRUE, a vector with elements corresponding to the raw estimate, estimated bias, bias corrected estimate and lower and upper elements of any confidence procedures requested. If ci=TRUE, and a coxph model is fit, a matrix will be returned, with rows corresponding to differing times at which the PAF might be calculated.
Bruzzi, P., Green, S.B., Byar, D.P., Brinton, L.A. and Schairer, C., 1985. Estimating the population attributable risk for multiple risk factors using case-control data. American journal of epidemiology, 122(5), pp.904-914
library(splines) library(survival) library(parallel) options(boot.parallel="snow") options(boot.ncpus=2) # The above could be set to the number of available cores on the machine data(stroke_reduced) model_exercise <- glm(formula = case ~ region * ns(age, df = 5) + sex * ns(age, df = 5) + education + exercise + ns(diet, df = 3) + smoking + alcohol + stress, family = "binomial", data = stroke_reduced) # calculate discrete PAF using Bruzzi method PAF_calc_discrete(model_exercise, "exercise", refval=0, data=stroke_reduced, calculation_method="B",ci=FALSE) # calculate discrete PAF using Direct method # Use bootstrap resampling to calculate a confidence interval # 10 Bootstrap reps used here for speed. # In real examples, use at least 50 repetitions. PAF_calc_discrete(model_exercise, "exercise", refval=0, data=stroke_reduced, calculation_method="D", prev=0.005, ci=TRUE, boot_rep=10) ### use the Bruzzi method derived by Bruzzi, 1985, instead PAF_calc_discrete(model_exercise, "exercise", refval=0, data=stroke_reduced, calculation_method="B", ci=TRUE, boot_rep=10) # examples of clogit and coxph regressions model_high_blood_pressure_clogit <- clogit(formula = case ~ age + education +exercise + ns(diet, df = 3) + smoking + alcohol + stress + ns(lipids,df = 3) + ns(waist_hip_ratio, df = 3) + high_blood_pressure + strata(strata),data=stroke_reduced) PAF_calc_discrete(model_high_blood_pressure_clogit, "high_blood_pressure", refval=0, data=stroke_reduced, calculation_method="B",ci=TRUE, boot_rep=10, ci_type=c('norm')) model_high_blood_pressure_coxph <- coxph(formula = Surv(time,event) ~ ns(age,df=5) + education +exercise + ns(diet, df = 3) + smoking + alcohol + stress + ns(lipids,df = 3) + ns(waist_hip_ratio, df = 3) + high_blood_pressure, data = stroke_reduced) PAF_calc_discrete(model_high_blood_pressure_coxph, "high_blood_pressure", refval=0, data=stroke_reduced, calculation_method="D", ci=TRUE, boot_rep=10, ci_type=c('norm'),t_vector=c(1,2,3,4,5,6,7,8,9))
library(splines) library(survival) library(parallel) options(boot.parallel="snow") options(boot.ncpus=2) # The above could be set to the number of available cores on the machine data(stroke_reduced) model_exercise <- glm(formula = case ~ region * ns(age, df = 5) + sex * ns(age, df = 5) + education + exercise + ns(diet, df = 3) + smoking + alcohol + stress, family = "binomial", data = stroke_reduced) # calculate discrete PAF using Bruzzi method PAF_calc_discrete(model_exercise, "exercise", refval=0, data=stroke_reduced, calculation_method="B",ci=FALSE) # calculate discrete PAF using Direct method # Use bootstrap resampling to calculate a confidence interval # 10 Bootstrap reps used here for speed. # In real examples, use at least 50 repetitions. PAF_calc_discrete(model_exercise, "exercise", refval=0, data=stroke_reduced, calculation_method="D", prev=0.005, ci=TRUE, boot_rep=10) ### use the Bruzzi method derived by Bruzzi, 1985, instead PAF_calc_discrete(model_exercise, "exercise", refval=0, data=stroke_reduced, calculation_method="B", ci=TRUE, boot_rep=10) # examples of clogit and coxph regressions model_high_blood_pressure_clogit <- clogit(formula = case ~ age + education +exercise + ns(diet, df = 3) + smoking + alcohol + stress + ns(lipids,df = 3) + ns(waist_hip_ratio, df = 3) + high_blood_pressure + strata(strata),data=stroke_reduced) PAF_calc_discrete(model_high_blood_pressure_clogit, "high_blood_pressure", refval=0, data=stroke_reduced, calculation_method="B",ci=TRUE, boot_rep=10, ci_type=c('norm')) model_high_blood_pressure_coxph <- coxph(formula = Surv(time,event) ~ ns(age,df=5) + education +exercise + ns(diet, df = 3) + smoking + alcohol + stress + ns(lipids,df = 3) + ns(waist_hip_ratio, df = 3) + high_blood_pressure, data = stroke_reduced) PAF_calc_discrete(model_high_blood_pressure_coxph, "high_blood_pressure", refval=0, data=stroke_reduced, calculation_method="D", ci=TRUE, boot_rep=10, ci_type=c('norm'),t_vector=c(1,2,3,4,5,6,7,8,9))
Implementation of Levin's formula for summary data
paf_levin(prev = NULL, RR = NULL, conf_prev = NULL, conf_RR = NULL, digits = 3)
paf_levin(prev = NULL, RR = NULL, conf_prev = NULL, conf_RR = NULL, digits = 3)
prev |
A vector of estimated prevalence for each non-reference level of risk factor. Can be left unspecified if conf_prev specified. |
RR |
A vector of estimated relative risk for each non-reference level of risk factor. Can be left unspecified if conf_RR specified. |
conf_prev |
If risk factor has 2 levels, a numeric vector of length 2 giving confidence limits for prevalence. If risk factor has K>2 levels, a K-1 x 2 matrix giving confidence intervals for prevalence of each non-reference level of risk factor. |
conf_RR |
If risk factor has 2 levels, a numeric vector of length 2 giving confidence limits for relative risk. If risk factor has K>2 levels, a K-1 x 2 matrix giving confidence intervals for relative risk for each non-reference level of risk factor |
digits |
integer. The number of significant digits for rounding of PAF estimates and confidence intervals. Default of 3. |
If confidence intervals for prevalence and relative risk are not specified, the estimated PAF. If confidence intervals for prevalence and relative risk are specified, confidence intervals for PAF are estimated using approximate propagation of imprecision. Note that if confidence intervals are supplied as arguments, the algorithm makes assumptions that the point estimate of prevalence is the average of the specified confidence limits for prevalence, that the point estimate for relative risk is the geometric mean of the confidence limits for relative risk, and that the 3 estimators are independent.
Ferguson, J., Alvarez-Iglesias, A., Mulligan, M., Judge, C. and O’Donnell, M., 2024. Bias assessment and correction for Levin's population attributable fraction under confounding. European Journal of Epidemiology, In press
CI_p <- c(0.1,0.3) CI_RR <- c(1.2, 2) # calculation without confidence interval paf_levin(prev=0.2,RR=exp(.5*log(1.2)+.5*log(2))) # calculation with confidence interval paf_levin(conf_prev=CI_p,conf_RR=CI_RR) # add another level to risk factor # with higher prevalence and RR # this will increase the PAF CI_p <- matrix(c(0.1,0.3,0.15, 0.25),nrow=2) CI_RR <- matrix(c(1.2,2,1.5,3),nrow=2) paf_levin(conf_prev=CI_p,conf_RR=CI_RR)
CI_p <- c(0.1,0.3) CI_RR <- c(1.2, 2) # calculation without confidence interval paf_levin(prev=0.2,RR=exp(.5*log(1.2)+.5*log(2))) # calculation with confidence interval paf_levin(conf_prev=CI_p,conf_RR=CI_RR) # add another level to risk factor # with higher prevalence and RR # this will increase the PAF CI_p <- matrix(c(0.1,0.3,0.15, 0.25),nrow=2) CI_RR <- matrix(c(1.2,2,1.5,3),nrow=2) paf_levin(conf_prev=CI_p,conf_RR=CI_RR)
Implementation of Miettinen's formula for summary data
paf_miettinen( prev = NULL, RR = NULL, RRu = NULL, conf_prev = NULL, conf_RR = NULL, conf_RRu = NULL, digits = 3 )
paf_miettinen( prev = NULL, RR = NULL, RRu = NULL, conf_prev = NULL, conf_RR = NULL, conf_RRu = NULL, digits = 3 )
prev |
A vector of estimated prevalence for each non-reference of risk factor. Can be left unspecified if conf_prev specified. |
RR |
A vector of estimated causal relative risk for each non-reference level of risk factor. Can be left unspecified if conf_RR specified. |
RRu |
A vector of estimated unadjusted relative risk for each non-reference level of the risk factor. Can be left unspecified if conf_RRu specified. |
conf_prev |
If risk factor has 2 levels, a numeric vector of length 2 giving confidence limits for prevalence. If risk factor has K>2 levels, a K-1 x 2 matrix giving confidence intervals for prevalence of each non-refernece level. |
conf_RR |
If risk factor has 2 levels, a numeric vector of length 2 giving confidence limits for the causal relative risk. If risk factor has K>2 levels, a K-1 x 2 matrix giving confidence intervals for causal relative risk fror each non-reference level of risk factor. |
conf_RRu |
If risk factor has 2 levels, a numeric vector of length 2 giving confidence limits for the unadjusted relative risk. If risk factor has K>2 levels, a K-1 x 2 matrix giving confidence intervals for unadjusted relative risk for each non-reference level of risk factor. |
digits |
integer. The number of significant digits for rounding of PAF estimates and confidence intervals. Default of 3 |
If confidence intervals for prevalence, adjusted and unadjusted relative risk are not specified, the estimated PAF. If confidence intervals are specified, confidence intervals for PAF are also estimated using approximate propagation of imprecision. Note that if confidence intervals are supplied as arguments, the algorithm makes assumptions that the point estimate of prevalence is the average of the specified confidence limits for prevalence, the point estimates for adjusted/unadjusted relative risk are the geometric means of the specified confidence limits for relative risk, and that the 3 estimators are independent.
Ferguson, J., Alvarez-Iglesias, A., Mulligan, M., Judge, C. and O’Donnell, M., 2024. Bias assessment and correction for Levin's population attributable fraction under confounding. European Journal of Epidemiology, In press
CI_p <- c(0.1,0.3) CI_RR <- c(1.2, 2) CI_RRu <- c(1.5, 2.5) # example without confidence interval paf_miettinen(prev=0.2,RR=exp(.5*log(1.2)+.5*log(2)), RRu=exp(.5*log(1.5)+.5*log(2.5))) #' # example with confidence interval paf_miettinen(conf_prev=CI_p,conf_RR=CI_RR, conf_RRu=CI_RRu) # risk factor with more than two non-reference levels # confidence intervals for non-reference levels # of risk factor should be a (K-1) x 2 matrix CI_p <- matrix(c(0.1,0.3,0.15, 0.25),nrow=2) CI_RR <- matrix(c(1.2,2,1.5,3),nrow=2) CI_RRu <- matrix(c(1.5,2.5,2,3.5),nrow=2) paf_miettinen(conf_prev=CI_p,conf_RR=CI_RR, conf_RRu=CI_RRu)
CI_p <- c(0.1,0.3) CI_RR <- c(1.2, 2) CI_RRu <- c(1.5, 2.5) # example without confidence interval paf_miettinen(prev=0.2,RR=exp(.5*log(1.2)+.5*log(2)), RRu=exp(.5*log(1.5)+.5*log(2.5))) #' # example with confidence interval paf_miettinen(conf_prev=CI_p,conf_RR=CI_RR, conf_RRu=CI_RRu) # risk factor with more than two non-reference levels # confidence intervals for non-reference levels # of risk factor should be a (K-1) x 2 matrix CI_p <- matrix(c(0.1,0.3,0.15, 0.25),nrow=2) CI_RR <- matrix(c(1.2,2,1.5,3),nrow=2) CI_RRu <- matrix(c(1.5,2.5,2,3.5),nrow=2) paf_miettinen(conf_prev=CI_p,conf_RR=CI_RR, conf_RRu=CI_RRu)
Plot hazard ratios, odds ratios or risk ratios comparing differing values of a continuous exposure to a reference level
plot_continuous( model, riskfactor, data, S = 10, ref_val = NA, ci_level = 0.95, min_risk_q = 0.1, plot_region = TRUE, plot_density = TRUE, n_x = 10000, theylab = "OR", qlist = seq(from = 0.001, to = 0.999, by = 0.001), interact = FALSE )
plot_continuous( model, riskfactor, data, S = 10, ref_val = NA, ci_level = 0.95, min_risk_q = 0.1, plot_region = TRUE, plot_density = TRUE, n_x = 10000, theylab = "OR", qlist = seq(from = 0.001, to = 0.999, by = 0.001), interact = FALSE )
model |
A fitted model (either glm, clogit or coxph) |
riskfactor |
The string name of a continuous exposure or risk factor represented in the data and model |
data |
Data frame used to fit the model |
S |
Default 10. The integer number of random samples used to calculate average differences in linear predictors. Only relevant to set when interact=TRUE |
ref_val |
The reference value used in plotting. If left at NA, the median value of the risk factor is used |
ci_level |
Numeric. A number between 0 and 1 specifying the confidence level |
min_risk_q |
Default .1. A number between 0 and 1 representing the desired risk quantile for the continuous risk factor |
plot_region |
Default TRUE. Logical specifying whether the targeted region corresponding to an intervention setting the continuous risk factor at a quantile min_risk_q or lower is to be plotted |
plot_density |
Default TRUE. Logical specifying whether density of distribution of risk factor is to be added to the plot |
n_x |
Default 10000. How many values of riskfactor will be used to plot spline (when interact=FALSE) |
theylab |
Default "OR". Y-axis label of the plot |
qlist |
Vector of quantile values for q, corresponding to the plotted values of PAF_q for each risk factor/exposure |
interact |
Default "FALSE". Set to TRUE spline models enter as interactions. |
A ggplot2 plotting object
Ferguson, J., Maturo, F., Yusuf, S. and O’Donnell, M., 2020. Population attributable fractions for continuously distributed exposures. Epidemiologic Methods, 9(1)
library(survival) library(splines) model_continuous <- glm(formula = case ~ region * ns(age, df = 5) + sex * ns(age, df = 5) + education +exercise + ns(diet, df = 3) + alcohol + stress + ns(lipids,df = 3) + ns(waist_hip_ratio, df = 3) + high_blood_pressure, family = "binomial", data = stroke_reduced) plot_continuous(model_continuous,riskfactor="diet",data=stroke_reduced)
library(survival) library(splines) model_continuous <- glm(formula = case ~ region * ns(age, df = 5) + sex * ns(age, df = 5) + education +exercise + ns(diet, df = 3) + alcohol + stress + ns(lipids,df = 3) + ns(waist_hip_ratio, df = 3) + high_blood_pressure, family = "binomial", data = stroke_reduced) plot_continuous(model_continuous,riskfactor="diet",data=stroke_reduced)
Plot impact fractions corresponding to risk-quantiles over several risk factors
## S3 method for class 'PAF_q' plot(x, ...)
## S3 method for class 'PAF_q' plot(x, ...)
x |
A PAF_q object. This is a dataframe that is created by running the function PAF_calc_continuous. |
... |
Other global arguments inherited by that might be passed to the ggplot routine |
A ggplot2 plotting object for PAF_q over the differing risk factors in x
library(splines) library(survival) library(parallel) options(boot.parallel="snow") options(boot.ncpus=2) # The above could be set to the number of available cores on the machine model_continuous <- glm(formula = case ~ region * ns(age, df = 5) + sex * ns(age, df = 5) + education +exercise + ns(diet, df = 3) + alcohol + stress + ns(lipids,df = 3) + ns(waist_hip_ratio, df = 3) + high_blood_pressure, family = "binomial", data = stroke_reduced) out <- PAF_calc_continuous(model_continuous,riskfactor_vec= c("diet","lipids","waist_hip_ratio"),q_vec=c(0.1,0.9), ci=FALSE,calculation_method="B",data=stroke_reduced) plot(out) # example with more quantile points and including confidence intervals # (more useful - but a bit slower to run) out <- PAF_calc_continuous(model_continuous,riskfactor_vec= c("diet","lipids","waist_hip_ratio"),q_vec=c(0.01, 0.1,0.3,0.5,0.7,0.9), ci=TRUE,calculation_method="B",data=stroke_reduced) plot(out)
library(splines) library(survival) library(parallel) options(boot.parallel="snow") options(boot.ncpus=2) # The above could be set to the number of available cores on the machine model_continuous <- glm(formula = case ~ region * ns(age, df = 5) + sex * ns(age, df = 5) + education +exercise + ns(diet, df = 3) + alcohol + stress + ns(lipids,df = 3) + ns(waist_hip_ratio, df = 3) + high_blood_pressure, family = "binomial", data = stroke_reduced) out <- PAF_calc_continuous(model_continuous,riskfactor_vec= c("diet","lipids","waist_hip_ratio"),q_vec=c(0.1,0.9), ci=FALSE,calculation_method="B",data=stroke_reduced) plot(out) # example with more quantile points and including confidence intervals # (more useful - but a bit slower to run) out <- PAF_calc_continuous(model_continuous,riskfactor_vec= c("diet","lipids","waist_hip_ratio"),q_vec=c(0.01, 0.1,0.3,0.5,0.7,0.9), ci=TRUE,calculation_method="B",data=stroke_reduced) plot(out)
Create a fan plot displaying approximate PAF, risk factor prevalence and risk ratios
## S3 method for class 'rf.data.frame' plot( x, type = "f", rf_prevmarks = c(0.02, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.7, 0.9), ormarks = c(1.05, 1.1, 1.2, 1.5, 2, 3), fan.label.size = 8, fan.point.size = 8, fan.legend.text.size = 30, fan.legend.title.size = 30, fan.axis.text.size = 30, fan.axis.title.size = 30, nomogram.label.size = 6, nomogram.axis.text.size = 6, nomogram.legend.text.size = 6, nomogram.legend.title.size = 6, ... )
## S3 method for class 'rf.data.frame' plot( x, type = "f", rf_prevmarks = c(0.02, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.7, 0.9), ormarks = c(1.05, 1.1, 1.2, 1.5, 2, 3), fan.label.size = 8, fan.point.size = 8, fan.legend.text.size = 30, fan.legend.title.size = 30, fan.axis.text.size = 30, fan.axis.title.size = 30, nomogram.label.size = 6, nomogram.axis.text.size = 6, nomogram.legend.text.size = 6, nomogram.legend.title.size = 6, ... )
x |
A rf.data.frame object |
type |
A character representing the type of plot. "f" for a fan_plot, "n" for a PAF nomogram and "rn" for a reverse PAF nomogram. See Ferguson et al.. "Graphical comparisons of relative disease burden across multiple risk factors." BMC medical research methodology 19, no. 1 (2019): 1-9 for more details |
rf_prevmarks |
Axis marks for risk factor prevalence (only used for type="n" and type = "rn") Default c(0.02, 0.05,0.1,0.2,0.3,0.4,0.5,0.7,0.9) |
ormarks |
Axis marks for odds ratios (only used for type="n" and type = "rn") Default c(1.05,1.1,1.4,1.7,2.0,3.0) |
fan.label.size |
label size for fan plot (default 8) |
fan.point.size |
point size for fan plot (default 8) |
fan.legend.text.size |
legend text size for fan plot (default 30) |
fan.legend.title.size |
legend title size for fan plot (default 30) |
fan.axis.text.size |
axis text size for fan plot (default 30) |
fan.axis.title.size |
axis title size for fan plot (default 30) |
nomogram.label.size |
label size for a nomogram (default 6) |
nomogram.axis.text.size |
axis title size for nomogram (default 6) |
nomogram.legend.text.size |
legend text size for nomogram (default 6) |
nomogram.legend.title.size |
legend title size for nomogram (default 6) |
... |
Other arguments that can be passed to the plotting routine |
fanplot or PAF nomogram (each is a ggplot2 object)
Ferguson, J., O’Leary, N., Maturo, F., Yusuf, S. and O’Donnell, M., 2019. Graphical comparisons of relative disease burden across multiple risk factors. BMC medical research methodology, 19(1), pp.1-9.
library(ggplot2) rfs <- rf_summary(rf_names=c('Hypertension','Inactivity','ApoB/ApoA', 'Diet','WHR','Smoking','Cardiac causes','Alcohol','Global Stress','Diabetes'), rf_prev=c(.474,.837,.669,.67,.67,.224,.049,.277,.144,.129), risk=c(1.093,0.501,0.428,0.378,0.294,0.513,1.156,0.186,0.301,0.148),log=TRUE) # fanplot plot(rfs,fan.point.size=4,fan.label.size=4, fan.legend.text.size=10,fan.legend.title.size=10, fan.axis.text.size=10,fan.axis.title.size=10) # nomogram plot(rfs,nomogram.label.size=4, nomogram.axis.text.size=4, nomogram.legend.text.size=8,nomogram.legend.title.size=8, type="rn") # reverse nomogram plot(rfs,nomogram.label.size=4, nomogram.axis.text.size=4, nomogram.legend.text.size=8,nomogram.legend.title.size=8, type="rn")
library(ggplot2) rfs <- rf_summary(rf_names=c('Hypertension','Inactivity','ApoB/ApoA', 'Diet','WHR','Smoking','Cardiac causes','Alcohol','Global Stress','Diabetes'), rf_prev=c(.474,.837,.669,.67,.67,.224,.049,.277,.144,.129), risk=c(1.093,0.501,0.428,0.378,0.294,0.513,1.156,0.186,0.301,0.148),log=TRUE) # fanplot plot(rfs,fan.point.size=4,fan.label.size=4, fan.legend.text.size=10,fan.legend.title.size=10, fan.axis.text.size=10,fan.axis.title.size=10) # nomogram plot(rfs,nomogram.label.size=4, nomogram.axis.text.size=4, nomogram.legend.text.size=8,nomogram.legend.title.size=8, type="rn") # reverse nomogram plot(rfs,nomogram.label.size=4, nomogram.axis.text.size=4, nomogram.legend.text.size=8,nomogram.legend.title.size=8, type="rn")
Produce plots of sequential and average PAF
## S3 method for class 'SAF_summary' plot( x, number_rows = 3, max_PAF = 0.4, min_PAF = 0, point.size = 4, axis.text.size = 6, title.size = 6, axis.title.size = 6, ... )
## S3 method for class 'SAF_summary' plot( x, number_rows = 3, max_PAF = 0.4, min_PAF = 0, point.size = 4, axis.text.size = 6, title.size = 6, axis.title.size = 6, ... )
x |
An SAF_summary R object produced by running the average_paf function. |
number_rows |
integer How many rows of plots will be included on the associated figure. |
max_PAF |
upper limit of y axis on PAF plots (default = 0.4) |
min_PAF |
lower limit of y axis on PAF plots (default = 0) |
point.size |
size of points on each individual plot (default=4) |
axis.text.size |
size of axis labels on each plot (default=6) |
title.size |
size of title on each individual plot (default=6) |
axis.title.size |
size of titles on each plot (default=6) |
... |
Other global arguments inherited by that might be passed to the ggplot routine |
A ggplot2 plotting object illustrating average sequential PAF by position and average PAF by risk factor.
Ferguson, J., O’Connell, M. and O’Donnell, M., 2020. Revisiting sequential attributable fractions. Archives of Public Health, 78(1), pp.1-9. Ferguson, J., Alvarez-Iglesias, A., Newell, J., Hinde, J. and O’Donnell, M., 2018. Estimating average attributable fractions with confidence intervals for cohort and case–control studies. Statistical methods in medical research, 27(4), pp.1141-1152
library(splines) library(survival) library(parallel) options(boot.parallel="snow") options(boot.ncpus=2) # Simulated data on occupational and environmental exposure to # chronic cough from Eide, 1995 # First specify the causal graph, in terms of the parents of each node. Then put into a list parent_urban.rural <- c() parent_smoking.category <- c("urban.rural") parent_occupational.exposure <- c("urban.rural") parent_y <- c("urban.rural","smoking.category","occupational.exposure") parent_list <- list(parent_urban.rural, parent_smoking.category, parent_occupational.exposure, parent_y) # also specify nodes of graph, in order from root to leaves node_vec <- c("urban.rural","smoking.category","occupational.exposure", "y") model_list=automatic_fit(Hordaland_data, parent_list=parent_list, node_vec=node_vec, prev=.09) out <- average_paf(data=model_list[[length(model_list)]]$data, model_list=model_list, parent_list=parent_list, node_vec=node_vec, prev=.09, nperm=10, riskfactor_vec = c("urban.rural","occupational.exposure"),ci=FALSE) plot(out) # plot with confidence intervals for average and sequential PAF # (This is probably more useful for more than 2 risk factors). # Separate axes for each risk factor so confidence intervals can be clearly displayed out <- average_paf(data=model_list[[length(model_list)]]$data, model_list=model_list, parent_list=parent_list, node_vec=node_vec, prev=.09, nperm=10, riskfactor_vec = c("urban.rural","occupational.exposure"),ci=TRUE,boot_rep=8) plot(out) # Here we plot, with margin of error of point estimate when 50 permutations are used out <- average_paf(data=model_list[[length(model_list)]]$data, model_list=model_list, parent_list=parent_list, node_vec=node_vec, prev=.09, nperm=50, riskfactor_vec = c("urban.rural","occupational.exposure"),ci=FALSE,exact=FALSE) plot(out)
library(splines) library(survival) library(parallel) options(boot.parallel="snow") options(boot.ncpus=2) # Simulated data on occupational and environmental exposure to # chronic cough from Eide, 1995 # First specify the causal graph, in terms of the parents of each node. Then put into a list parent_urban.rural <- c() parent_smoking.category <- c("urban.rural") parent_occupational.exposure <- c("urban.rural") parent_y <- c("urban.rural","smoking.category","occupational.exposure") parent_list <- list(parent_urban.rural, parent_smoking.category, parent_occupational.exposure, parent_y) # also specify nodes of graph, in order from root to leaves node_vec <- c("urban.rural","smoking.category","occupational.exposure", "y") model_list=automatic_fit(Hordaland_data, parent_list=parent_list, node_vec=node_vec, prev=.09) out <- average_paf(data=model_list[[length(model_list)]]$data, model_list=model_list, parent_list=parent_list, node_vec=node_vec, prev=.09, nperm=10, riskfactor_vec = c("urban.rural","occupational.exposure"),ci=FALSE) plot(out) # plot with confidence intervals for average and sequential PAF # (This is probably more useful for more than 2 risk factors). # Separate axes for each risk factor so confidence intervals can be clearly displayed out <- average_paf(data=model_list[[length(model_list)]]$data, model_list=model_list, parent_list=parent_list, node_vec=node_vec, prev=.09, nperm=10, riskfactor_vec = c("urban.rural","occupational.exposure"),ci=TRUE,boot_rep=8) plot(out) # Here we plot, with margin of error of point estimate when 50 permutations are used out <- average_paf(data=model_list[[length(model_list)]]$data, model_list=model_list, parent_list=parent_list, node_vec=node_vec, prev=.09, nperm=50, riskfactor_vec = c("urban.rural","occupational.exposure"),ci=FALSE,exact=FALSE) plot(out)
Internal: Create a data frame for predictions (when risk factor is continuous).
predict_df_continuous(riskfactor, q_val, risk_q, data)
predict_df_continuous(riskfactor, q_val, risk_q, data)
riskfactor |
The name of the risk factor of interest in the dataset |
q_val |
The risk quantile to match to |
risk_q |
Estimated risk quantiles |
data |
A dataframe containing variables used to fit the model |
A data frame where the distribution continuous risk factor so at an individual level, risk is at the q_val-quantile or below
Internal: Create a data frame for predictions (when risk factor is discrete).
predict_df_discrete(riskfactor, refval, data)
predict_df_discrete(riskfactor, refval, data)
riskfactor |
The name of the risk factor of interest in the dataset |
refval |
The reference value for the risk factor |
data |
A dataframe containing variables used to fit the model |
A data frame where the categorical variable is set to its reference level
Print out PAF_q for differing risk factors
## S3 method for class 'PAF_q' print(x, ...)
## S3 method for class 'PAF_q' print(x, ...)
x |
A PAF_q object. |
... |
Other arguments to be passed to print |
No return value, prints the PAF_q object to the console.
library(splines) library(survival) library(parallel) options(boot.parallel="snow") options(boot.ncpus=2) # The above could be set to the number of available cores on the machine model_continuous <- glm(formula = case ~ region * ns(age, df = 5) + sex * ns(age, df = 5) + education +exercise + ns(diet, df = 3) + alcohol + stress + ns(lipids,df = 3) + ns(waist_hip_ratio, df = 3) + high_blood_pressure, family = "binomial", data = stroke_reduced) out <- PAF_calc_continuous(model_continuous, riskfactor_vec=c("diet","lipids","waist_hip_ratio"), q_vec=c(0.01, 0.1,0.3,0.5,0.7,0.9),ci=FALSE,calculation_method="B", data=stroke_reduced) print(out)
library(splines) library(survival) library(parallel) options(boot.parallel="snow") options(boot.ncpus=2) # The above could be set to the number of available cores on the machine model_continuous <- glm(formula = case ~ region * ns(age, df = 5) + sex * ns(age, df = 5) + education +exercise + ns(diet, df = 3) + alcohol + stress + ns(lipids,df = 3) + ns(waist_hip_ratio, df = 3) + high_blood_pressure, family = "binomial", data = stroke_reduced) out <- PAF_calc_continuous(model_continuous, riskfactor_vec=c("diet","lipids","waist_hip_ratio"), q_vec=c(0.01, 0.1,0.3,0.5,0.7,0.9),ci=FALSE,calculation_method="B", data=stroke_reduced) print(out)
Print out a SAF_summary object
## S3 method for class 'SAF_summary' print(x, ...)
## S3 method for class 'SAF_summary' print(x, ...)
x |
A SAF_summary object. This is a special dataframe that is created by running the function average_PAF. |
... |
Other arguments to be passed to print |
No return value. Prints the SAF_summary object to the console.
library(splines) library(survival) library(parallel) options(boot.parallel="snow") options(boot.ncpus=2) # The above could be set to the number of available cores on the machine # Simulated data on occupational and environmental exposure to chronic cough from Eide, 1995 # First specify the causal graph, in terms of the parents of each node. Then put into a list parent_urban.rural <- c() parent_smoking.category <- c("urban.rural") parent_occupational.exposure <- c("urban.rural") parent_y <- c("urban.rural","smoking.category","occupational.exposure") parent_list <- list(parent_urban.rural, parent_smoking.category, parent_occupational.exposure, parent_y) node_vec <- c("urban.rural","smoking.category","occupational.exposure", "y") model_list=automatic_fit(data=Hordaland_data, parent_list=parent_list, node_vec=node_vec, prev=.09) # model_list$data objects have fitting weights # included in data frame # Including weight column in data # necessary if Bootstrapping CIs out <- average_paf(data=model_list[[length(model_list)]]$data, model_list=model_list, parent_list=parent_list, node_vec=node_vec, prev=.09, nperm=10, riskfactor_vec = c("urban.rural","occupational.exposure"),ci=FALSE) print(out)
library(splines) library(survival) library(parallel) options(boot.parallel="snow") options(boot.ncpus=2) # The above could be set to the number of available cores on the machine # Simulated data on occupational and environmental exposure to chronic cough from Eide, 1995 # First specify the causal graph, in terms of the parents of each node. Then put into a list parent_urban.rural <- c() parent_smoking.category <- c("urban.rural") parent_occupational.exposure <- c("urban.rural") parent_y <- c("urban.rural","smoking.category","occupational.exposure") parent_list <- list(parent_urban.rural, parent_smoking.category, parent_occupational.exposure, parent_y) node_vec <- c("urban.rural","smoking.category","occupational.exposure", "y") model_list=automatic_fit(data=Hordaland_data, parent_list=parent_list, node_vec=node_vec, prev=.09) # model_list$data objects have fitting weights # included in data frame # Including weight column in data # necessary if Bootstrapping CIs out <- average_paf(data=model_list[[length(model_list)]]$data, model_list=model_list, parent_list=parent_list, node_vec=node_vec, prev=.09, nperm=10, riskfactor_vec = c("urban.rural","occupational.exposure"),ci=FALSE) print(out)
Estimate pathway specific population attributable fractions
ps_paf( response_model, mediator_models, riskfactor, refval, data, prev = NULL, ci = FALSE, boot_rep = 50, ci_level = 0.95, ci_type = c("norm"), weight_vec = NULL, verbose = TRUE )
ps_paf( response_model, mediator_models, riskfactor, refval, data, prev = NULL, ci = FALSE, boot_rep = 50, ci_level = 0.95, ci_type = c("norm"), weight_vec = NULL, verbose = TRUE )
response_model |
A R model object for a binary outcome that involves a risk factor, confounders and mediators of the risk factor outcome relationship. Note that a weighted model should be used for case control data. Non-linear effects should be specified via ns(x, df=y), where ns is the natural spline function from the splines library. |
mediator_models |
A list of fitted models describing the risk factor/mediator relationship (the predictors in the model will be the risk factor and any confounders) Note a weighted model should be fit when data arise from a case control study. Models can be specified for linear responses (lm), binary responses (glm) and ordinal factors (through polr). Non-linear effects should be specified via ns(x, df=y), where ns is the natural spline function from the splines library. |
riskfactor |
character. Represents the name of the risk factor |
refval |
For factor valued risk factors, the reference level of the risk factor. If the risk factor is numeric, the reference level is assumed to be 0. |
data |
dataframe. A dataframe (with no missing values) containing the data used to fit the mediator and response models. You can run data_clean to the input dataset if the data has missing values as a pre-processing step |
prev |
numeric. A value between 0 and 1 specifying the prevalence of disease: only relevant to set if data arises from a case control study. |
ci |
logical. If TRUE a confidence interval is calculated using Bootstrap |
boot_rep |
Integer. Number of bootstrap replications (Only necessary to specify if ci=TRUE). Note that at least 50 replicates are recommended to achieve stable estimates of standard error. In the examples below, values of boot_rep less than 50 are sometimes used to limit run time. |
ci_level |
Numeric. Default 0.95. A number between 0 and 1 specifying the confidence level (only necessary to specify when ci=TRUE) |
ci_type |
Character. Default norm. A vector specifying the types of confidence interval desired. "norm", "basic", "perc" and "bca" are the available methods |
weight_vec |
An optional vector of inverse sampling weights for survey data (note that variance will not be calculated correctly if sampling isn't independent). Note that this will be ignored if prev is specified and calculation_method="D", in which case the weights will be constructed so the empirical re-weighted prevalence of disease is equal to prev |
verbose |
A logical indicator for whether extended output is produced when ci=TRUE, default TRUE |
A numeric vector (if ci=FALSE), or object (or class pspaf) (if CI=TRUE) with estimated PS-PAF for each mediator referred to in mediator_models, together with estimated direct PS-PAF and possibly confidence intervals.
Pathway specific Population attributable fractions. O’Connell, M.M. and Ferguson, J.P., 2022. IEA. International Journal of Epidemiology, 1, p.13. Accessible at: https://academic.oup.com/ije/advance-article/doi/10.1093/ije/dyac079/6583255?login=true
library(splines) library(survival) library(parallel) options(boot.parallel="snow") # User could set the next option to number of cores on machine: options(boot.ncpus=2) # Direct and pathway specific attributable fractions estimated # on simulated case control stroke data: # Note that the models here are weighted regressions (based on a column in the # dataframe named 'weights') to rebalance the case control structure to make it # representative over the population, according to the prev argument. # Unweighted regression is fine to use if the data arises from cohort or # cross sectional studies, in which case prev should be set to NULL response_model <- glm(case ~ region * ns(age, df = 5) + sex * ns(age, df = 5) + education + exercise + ns(diet, df = 3) + smoking + alcohol + stress + ns(lipids, df = 3) + ns(waist_hip_ratio, df = 3) + high_blood_pressure, data=stroke_reduced,family='binomial', weights=weights) mediator_models <- list(glm(high_blood_pressure ~ region * ns(age, df = 5) + sex * ns(age, df = 5) + education +exercise + ns(diet, df = 3) + smoking + alcohol + stress,data=stroke_reduced,family='binomial',weights=weights), lm(lipids ~ region * ns(age, df = 5) + sex * ns(age, df = 5) +education + exercise + ns(diet, df = 3) + smoking + alcohol + stress, weights=weights, data=stroke_reduced),lm(waist_hip_ratio ~ region * ns(age, df = 5) + sex * ns(age, df = 5) + education + exercise + ns(diet, df = 3) + smoking + alcohol + stress, weights=weights, data=stroke_reduced)) # point estimate ps_paf(response_model=response_model, mediator_models=mediator_models , riskfactor="exercise",refval=0,data=stroke_reduced,prev=0.0035, ci=FALSE) # confidence intervals ps_paf(response_model=response_model, mediator_models=mediator_models , riskfactor="exercise",refval=0,data=stroke_reduced,prev=0.0035, ci=TRUE, boot_rep=100,ci_type="norm")
library(splines) library(survival) library(parallel) options(boot.parallel="snow") # User could set the next option to number of cores on machine: options(boot.ncpus=2) # Direct and pathway specific attributable fractions estimated # on simulated case control stroke data: # Note that the models here are weighted regressions (based on a column in the # dataframe named 'weights') to rebalance the case control structure to make it # representative over the population, according to the prev argument. # Unweighted regression is fine to use if the data arises from cohort or # cross sectional studies, in which case prev should be set to NULL response_model <- glm(case ~ region * ns(age, df = 5) + sex * ns(age, df = 5) + education + exercise + ns(diet, df = 3) + smoking + alcohol + stress + ns(lipids, df = 3) + ns(waist_hip_ratio, df = 3) + high_blood_pressure, data=stroke_reduced,family='binomial', weights=weights) mediator_models <- list(glm(high_blood_pressure ~ region * ns(age, df = 5) + sex * ns(age, df = 5) + education +exercise + ns(diet, df = 3) + smoking + alcohol + stress,data=stroke_reduced,family='binomial',weights=weights), lm(lipids ~ region * ns(age, df = 5) + sex * ns(age, df = 5) +education + exercise + ns(diet, df = 3) + smoking + alcohol + stress, weights=weights, data=stroke_reduced),lm(waist_hip_ratio ~ region * ns(age, df = 5) + sex * ns(age, df = 5) + education + exercise + ns(diet, df = 3) + smoking + alcohol + stress, weights=weights, data=stroke_reduced)) # point estimate ps_paf(response_model=response_model, mediator_models=mediator_models , riskfactor="exercise",refval=0,data=stroke_reduced,prev=0.0035, ci=FALSE) # confidence intervals ps_paf(response_model=response_model, mediator_models=mediator_models , riskfactor="exercise",refval=0,data=stroke_reduced,prev=0.0035, ci=TRUE, boot_rep=100,ci_type="norm")
Internal, pathway specific PAF when the mediator is discrete
pspaf_discrete( data, refval, riskfactor_col, mediator_col, mediator_model, response_model, weight_vec )
pspaf_discrete( data, refval, riskfactor_col, mediator_col, mediator_model, response_model, weight_vec )
data |
dataframe. A dataframe (with no missing values) containing the data used to fit the mediator and response models. You can run data_clean to the input dataset if the data has missing values as a pre-processing step |
refval |
For factor valued risk factors, the reference level of the risk factor. If the risk factor is numeric, the reference level is assumed to be 0 |
riskfactor_col |
Integer indicator for the risk factor column in data |
mediator_col |
Integer indicator for the discrete mediator column in data |
mediator_model |
A glm or polr model for the mediator, depending on the same confounders and risk factor as specified in the response model. |
response_model |
A R model object for a binary outcome that involves a risk factor, confounders and mediators of the risk factor outcome relationship. Note that a weighted model should be used for case control data. Non-linear effects should be specified via ns(x, df=y), where ns is the natural spline function from the splines library. |
weight_vec |
A numeric column of weights |
A numeric vector (if ci=FALSE), or data frame (if CI=TRUE) containing estimated PS-PAF for each mediator referred to in mediator_models, together with estimated direct PS-PAF and possibly confidence intervals.
Create a rf.data.frame object for risk factors, prevalence and risk ratios. This will be used in fan plots and nomograms (by simply sending the rf.dat.frame object to plot)
rf_summary(rf_names, rf_prev, risk, log = FALSE)
rf_summary(rf_names, rf_prev, risk, log = FALSE)
rf_names |
A character vector of risk factor names |
rf_prev |
A numeric vector specifying prevalence of risk factor in disease controls (estimates of population prevalence can also be used if the disease is rare) |
risk |
A numeric vector of relative risks or Odds ratios for disease corresponding to each risk factor (if log=FALSE). Log-relative risks or log-odds ratios can be alternatively specified (if log=TRUE) |
log |
default TRUE. Set to TRUE if relative risks/odds ratios are specified on log-scale |
A rf.data.frame object
Ferguson, J., O’Leary, N., Maturo, F., Yusuf, S. and O’Donnell, M., 2019. Graphical comparisons of relative disease burden across multiple risk factors. BMC medical research methodology, 19(1), pp.1-9.
library(ggplot2) rfs <- rf_summary(rf_names=c('Hypertension','Inactivity','ApoB/ApoA','Diet', 'WHR','Smoking','Cardiac causes','Alcohol','Global Stress','Diabetes'), rf_prev=c(.474,.837,.669,.67,.67,.224,.049,.277,.144,.129), risk=c(1.093,0.501,0.428,0.378,0.294,0.513,1.156,0.186,0.301,0.148),log=TRUE) # fanplot plot(rfs,fan.point.size=4,fan.label.size=4, fan.legend.text.size=10,fan.legend.title.size=10, fan.axis.text.size=10,fan.axis.title.size=10) # nomogram plot(rfs,nomogram.label.size=6, nomogram.axis.text.size=6, type="n") # reverse nomogram plot(rfs,nomogram.label.size=6, nomogram.axis.text.size=6, type="rn")
library(ggplot2) rfs <- rf_summary(rf_names=c('Hypertension','Inactivity','ApoB/ApoA','Diet', 'WHR','Smoking','Cardiac causes','Alcohol','Global Stress','Diabetes'), rf_prev=c(.474,.837,.669,.67,.67,.224,.049,.277,.144,.129), risk=c(1.093,0.501,0.428,0.378,0.294,0.513,1.156,0.186,0.301,0.148),log=TRUE) # fanplot plot(rfs,fan.point.size=4,fan.label.size=4, fan.legend.text.size=10,fan.legend.title.size=10, fan.axis.text.size=10,fan.axis.title.size=10) # nomogram plot(rfs,nomogram.label.size=6, nomogram.axis.text.size=6, type="n") # reverse nomogram plot(rfs,nomogram.label.size=6, nomogram.axis.text.size=6, type="rn")
Return the vector of risk quantiles for a continuous risk factor.
risk_quantiles( riskfactor, data, model, S = 1, q = seq(from = 0.01, to = 0.99, by = 0.01) )
risk_quantiles( riskfactor, data, model, S = 1, q = seq(from = 0.01, to = 0.99, by = 0.01) )
riskfactor |
The name of the risk factor of interest in the dataset. A character vector |
data |
A dataframe containing variables used to fit the model |
model |
The fitted model |
S |
The number of randomly selected individuals for which risk is measured (defaults to 1). Let to perhaps 100 if risk factor involved in interactions in model |
q |
The desired risk quantiles |
A named vector of size S giving the risk factor quantiles
Calculation of sequential PAF taking into account risk factor sequencing
seq_paf( data, model_list, parent_list, node_vec, prev = NULL, riskfactor_vec = NULL, ci = FALSE, boot_rep = 50, ci_type = c("norm"), ci_level = 0.95, nsim = 1, weight_vec = NULL, verbose = TRUE )
seq_paf( data, model_list, parent_list, node_vec, prev = NULL, riskfactor_vec = NULL, ci = FALSE, boot_rep = 50, ci_type = c("norm"), ci_level = 0.95, nsim = 1, weight_vec = NULL, verbose = TRUE )
data |
Data frame. A dataframe containing variables used for fitting the models. Must contain all variables used in fitting |
model_list |
List. A list of fitted model objects corresponding for the outcome variables in node_vec, with parents as described in parent_vec. Linear (lm), logistic (glm) and ordinal (polr) objects are allowed. This list must be in the same order as node_vec and parent_list. Non-linear effects should be specified via ns(x, df=y), where ns is the natural spline function from the splines library. |
parent_list |
A list. The ith element is the vector of variable names that are direct causes of ith variable in node_vec |
node_vec |
A vector corresponding to the nodes in the Bayesian network. This must be specified from root to leaves - that is ancestors in the causal graph for a particular node are positioned before their descendants. If this condition is false the function will return an error. |
prev |
prevalence of the disease (default is NULL) |
riskfactor_vec |
A character vector of riskfactors. Sequential PAF is calculated for the risk factor specified in the last position of the vector, conditional on the other risk factors |
ci |
Logical. If TRUE, a bootstrap confidence interval is computed along with a point estimate (default FALSE). If ci=FALSE, only a point estimate is produced. A simulation procedure (sampling permutations and also simulating the effects of eliminating risk factors over the descendant nodes in a Bayesian network) is required to produce the point estimates. The point estimate will change on repeated runs of the function. The margin of error of the point estimate is given when ci=FALSE |
boot_rep |
Integer. Number of bootstrap replications (Only necessary to specify if ci=TRUE). Note that at least 50 replicates are recommended to achieve stable estimates of standard error. In the examples below, values of boot_rep less than 50 are sometimes used to limit run time. |
ci_type |
Character. Default norm. A vector specifying the types of confidence interval desired. "norm", "basic", "perc" and "bca" are the available methods |
ci_level |
Numeric. Confidence level. Default 0.95 |
nsim |
Numeric. Number of independent simulations of the dataset. Default of 1 |
weight_vec |
An optional vector of inverse sampling weights (note with survey data, the variance may not be calculated correctly if sampling isn't independent). Note that this vector will be ignored if prev is specified, and the weights will be calibrated so that the weighted sample prevalence of disease equals prev. This argument can be ignored if data has a column weights with correctly calibrated weights |
verbose |
A logical indicator for whether extended output is produced when ci=TRUE, default TRUE |
A numeric estimate of sequential PAF (if ci=FALSE), or else a saf object, giving estimates and confidence limits of sequential PAF (if ci=TRUE)
Ferguson, J., O’Connell, M. and O’Donnell, M., 2020. Revisiting sequential attributable fractions. Archives of Public Health, 78(1), pp.1-9.
library(splines) library(survival) library(parallel) options(boot.parallel="snow") options(boot.ncpus=2) # The above could be set to the number of available cores on the machine # Simulated data on occupational and environmental exposure to # chronic cough from Eide, 1995 # First specify the causal graph, in terms of the parents of each node. # Then put into a list. parent_urban.rural <- c() parent_smoking.category <- c("urban.rural") parent_occupational.exposure <- c("urban.rural") parent_y <- c("urban.rural","smoking.category","occupational.exposure") parent_list <- list(parent_urban.rural, parent_smoking.category, parent_occupational.exposure, parent_y) # also specify nodes of graph, in order from root to leaves node_vec <- c("urban.rural","smoking.category","occupational.exposure", "y") # specify a model list according to parent_list # here we use the auxillary function 'automatic fit' model_list=automatic_fit(data=Hordaland_data, parent_list=parent_list, node_vec=node_vec, prev=.09) # sequential PAF for occupational exposure conditional on elimination of urban.rural # Including weight column in data # necessary if Bootstrapping CIs seq_paf(data=model_list[[length(model_list)]]$data, model_list=model_list, parent_list=parent_list, node_vec=node_vec, prev=.09, riskfactor_vec = c("urban.rural", "occupational.exposure"),ci=FALSE) # More complicated example (slower to run) parent_exercise <- c("education") parent_diet <- c("education") parent_smoking <- c("education") parent_alcohol <- c("education") parent_stress <- c("education") parent_high_blood_pressure <- c("education","exercise","diet","smoking","alcohol", "stress") parent_lipids <- c("education","exercise","diet","smoking","alcohol","stress") parent_waist_hip_ratio <- c("education","exercise","diet","smoking", "alcohol","stress") parent_early_stage_heart_disease <- c("education","exercise","diet", "smoking","alcohol","stress","lipids","waist_hip_ratio","high_blood_pressure") parent_diabetes <- c("education","exercise","diet","smoking","alcohol", "stress","lipids","waist_hip_ratio","high_blood_pressure") parent_case <- c("education","exercise","diet","smoking","alcohol", "stress","lipids","waist_hip_ratio","high_blood_pressure", "early_stage_heart_disease","diabetes") parent_list <- list(parent_exercise,parent_diet,parent_smoking,parent_alcohol, parent_stress,parent_high_blood_pressure,parent_lipids,parent_waist_hip_ratio, parent_early_stage_heart_disease,parent_diabetes,parent_case) node_vec=c("exercise","diet","smoking","alcohol","stress","high_blood_pressure", "lipids","waist_hip_ratio","early_stage_heart_disease","diabetes","case") model_list=automatic_fit(data=stroke_reduced, parent_list=parent_list, node_vec=node_vec, prev=.0035,common="region*ns(age,df=5)+sex*ns(age,df=5)", spline_nodes = c("waist_hip_ratio","lipids","diet")) # calculate sequential PAF for stress, conditional on smoking # and blood pressure being eliminated from the population seqpaf <- seq_paf(data=stroke_reduced, model_list=model_list, parent_list= parent_list, node_vec=node_vec, prev=.0035, riskfactor_vec = c("high_blood_pressure", "smoking","stress"),ci=TRUE,boot_rep=10)
library(splines) library(survival) library(parallel) options(boot.parallel="snow") options(boot.ncpus=2) # The above could be set to the number of available cores on the machine # Simulated data on occupational and environmental exposure to # chronic cough from Eide, 1995 # First specify the causal graph, in terms of the parents of each node. # Then put into a list. parent_urban.rural <- c() parent_smoking.category <- c("urban.rural") parent_occupational.exposure <- c("urban.rural") parent_y <- c("urban.rural","smoking.category","occupational.exposure") parent_list <- list(parent_urban.rural, parent_smoking.category, parent_occupational.exposure, parent_y) # also specify nodes of graph, in order from root to leaves node_vec <- c("urban.rural","smoking.category","occupational.exposure", "y") # specify a model list according to parent_list # here we use the auxillary function 'automatic fit' model_list=automatic_fit(data=Hordaland_data, parent_list=parent_list, node_vec=node_vec, prev=.09) # sequential PAF for occupational exposure conditional on elimination of urban.rural # Including weight column in data # necessary if Bootstrapping CIs seq_paf(data=model_list[[length(model_list)]]$data, model_list=model_list, parent_list=parent_list, node_vec=node_vec, prev=.09, riskfactor_vec = c("urban.rural", "occupational.exposure"),ci=FALSE) # More complicated example (slower to run) parent_exercise <- c("education") parent_diet <- c("education") parent_smoking <- c("education") parent_alcohol <- c("education") parent_stress <- c("education") parent_high_blood_pressure <- c("education","exercise","diet","smoking","alcohol", "stress") parent_lipids <- c("education","exercise","diet","smoking","alcohol","stress") parent_waist_hip_ratio <- c("education","exercise","diet","smoking", "alcohol","stress") parent_early_stage_heart_disease <- c("education","exercise","diet", "smoking","alcohol","stress","lipids","waist_hip_ratio","high_blood_pressure") parent_diabetes <- c("education","exercise","diet","smoking","alcohol", "stress","lipids","waist_hip_ratio","high_blood_pressure") parent_case <- c("education","exercise","diet","smoking","alcohol", "stress","lipids","waist_hip_ratio","high_blood_pressure", "early_stage_heart_disease","diabetes") parent_list <- list(parent_exercise,parent_diet,parent_smoking,parent_alcohol, parent_stress,parent_high_blood_pressure,parent_lipids,parent_waist_hip_ratio, parent_early_stage_heart_disease,parent_diabetes,parent_case) node_vec=c("exercise","diet","smoking","alcohol","stress","high_blood_pressure", "lipids","waist_hip_ratio","early_stage_heart_disease","diabetes","case") model_list=automatic_fit(data=stroke_reduced, parent_list=parent_list, node_vec=node_vec, prev=.0035,common="region*ns(age,df=5)+sex*ns(age,df=5)", spline_nodes = c("waist_hip_ratio","lipids","diet")) # calculate sequential PAF for stress, conditional on smoking # and blood pressure being eliminated from the population seqpaf <- seq_paf(data=stroke_reduced, model_list=model_list, parent_list= parent_list, node_vec=node_vec, prev=.0035, riskfactor_vec = c("high_blood_pressure", "smoking","stress"),ci=TRUE,boot_rep=10)
Internal: Simulate from the post intervention distribution corresponding to eliminating a risk factor
sim_outnode(data, col_num, current_mat, parent_list, col_list, model_list)
sim_outnode(data, col_num, current_mat, parent_list, col_list, model_list)
data |
Data frame. A dataframe containing the original variables used for fitting the models. Must contain all variables used in fitting |
col_num |
The indicator for the risk factor that is being eliminated |
current_mat |
The current value of the data frame |
parent_list |
A list. The ith element is the vector of variable names that are direct causes of ith variable in node_vec (Note that the variable names should be columns in data) |
col_list |
Column indicators for the variables in node_vec (note that node_vec is ordered from root to leaves) |
model_list |
List. A list of fitted models corresponding for the outcome variables in node_vec, with parents as described in parent_vec. This list must be in the same order as node_vec and parent_list. Models can be linear (lm), logistic (glm) or ordinal logistic (polr). Non-linear effects of variables (if necessary) should be specified via ns(x, df=y), where ns is the natural spline function from the splines library |
An updated data frame (a new version of current_mat) with new columns simulated for variables that the risk factor causally effects.
Dataset containing simulated data on risk factors for 6856 stroke cases and 6856 stroke control, based on risk factors and associations in the INTERSTROKE study
stroke_reduced
stroke_reduced
A data frame with 13712 rows and 19 variables:
Geographic region, 1: Western Europe, 2: Eastern/central Europe/Middle East 3: Africa, 4: South Asia, 5: China, 6: South East Asia, 7: South America
case control status, (1 for stroke cases)
Gender of individual, 0: male, 1:female
Age of individual
Smoking status, 0: Never, 1: Current
1: sometimes stressed, 0: never stressed
Waist hip ratio
Physical Activity. 1: mainly inactive, 0: mainly active
Alcohol history and frequency, 1:never, 2:low/moderate, 3:high intake
Diabetes, 0: No, 1: Yes
Healthy eating score (higher is better)
presence of risk factors for heart disease. 0: No, 1: yes
Ratio of Apolipoprotein B to Apolipoprotein A
Years of education. 1: No education, 2: 1-8 years, 3:9-12 years, 3:Technical college, 4: University
Diagnosed hypertension: 0 No, 1: yes
weights that are proportional to inverse sampling probabilities. We have scaled the weights to be 0.0035 for a case and 0.9965 for a control to reflect any approximate incidence of 1 serious stroke in every 0.9965/0.0035 person years in the population
simulated time variable (for illustrating survival models)
simulated event indicator (0 if censored, 1 if event happened): for illustrating survival models
Strata number based on sex and region. For illustrating conditional regression
Data simulated based on relationships described in https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(16)30506-2/fulltext