While there is growing evidence on the effectiveness of structured care for diabetic patients in trial settings, standard population level evaluations may misestimate intervention benefits due to patient selection. In order to account for potential biases in measuring intervention benefits, we tested the impact of calibration on margins as a novel adjustment method in an evaluation context compared with simple poststratification.
We compared the results of a before–after evaluation on HbA1c levels after 1 year of enrollment in a French diabetes provider network (DPN) using an unadjusted sample and samples adjusted by simple poststratification to results obtained after adjustment via calibration on margins to the general diabetic population’s characteristics using a national cross-sectional sample of diabetic patients.
Both with and without adjustment, patients in the DPN had significantly lower HbA1c levels after 1 year of enrollment. However, the reductions in HbA1c levels among the adjusted samples were 22–183% lower than those measured in the unadjusted sample, regardless of the poststratification method and characteristics used. Compared with simple poststratification, estimations using calibration on margins exhibited higher performance.
Evaluations of diabetes management interventions based on uncontrolled before–after experiments may overestimate the actual benefit for patients. This can be corrected by using poststratification approaches when data on the ultimate target population for the intervention are available. In order to more accurately estimate the effect an intervention would have if extended to the target population, calibration on margins seems to be preferable over simple poststratification in terms of performance and usability.
Introduction
Health systems increasingly rely on structured care interventions to better meet the needs of growing populations with diabetes and other chronic conditions (1,2). Such interventions, frequently also referred to as disease management care, typically include components such as enhanced coordination between providers, the systematic use of clinical guidelines, and patient education (3,4). In practice, the nature and settings in which structured care is delivered vary, ranging from discrete programs offered to a selected group of patients to multicomponent, population-based strategies offered as part of usual care (5). An example of structured diabetes care is the concept of the diabetes provider network (DPN) in France, an association of care providers with central coordination to ensure a predefined patient trajectory (6).
Structured care approaches are expected to improve the quality of care for persons with chronic conditions, enhance health outcomes, and, ultimately, reduce costs. However, although intuitively appealing, the evidence of whether these aims are achieved in practice remains uncertain. This is in part because good quality evidence using randomized controlled designs are typically limited to small populations or conducted in research settings and are therefore difficult to generalize (7,8). Conversely, the diffusion of structured care approaches into routine settings, for example, through rollout or implementation at population level, is rarely accompanied by rigorous evaluation (9,10). Instead, where evaluation is conducted, it often must rely on uncontrolled designs such as observational before–after studies, because randomized designs are not feasible for various reasons (10,11). Such evaluations are difficult to interpret, however, and are prone to misestimating intervention effect, because they do not account for potential biases resulting, for example, from selective enrollment of patients likely to benefit from the intervention and from self-selection of health care professionals (12,13). This leads to differences in the characteristics of the overall target population and the intervention group that is a sample of the latter (called intervention sample throughout this article).
In order to improve the available evidence on the effectiveness and impact of population-wide disease management programs, it is necessary to account for these potential biases and use methods that are both scientifically robust and feasible for evaluation in daily practice.
To this end, it is appropriate to use poststratification methods commonly used in survey analysis that consist of a posteriori adjustments using auxiliary information at the target population level. These methods include simple poststratification, poststratification by regression, and, more recently, poststratification by calibration on margins introduced by Deville and Särndal (14–16). Poststratification aims to rebalance differences in the characteristics of the intervention sample and the overall target population by using several mathematical methods that are more or less complex. Simple poststratification methods are the most commonly used and rely upon the simplest mathematical method that assigns weights to strata of individuals using a ratio. Calibration on margins using more complex functions based on an iterative process with distance functions can also be applied (15). In addition to being relatively simple to use, the literature indicates that this method usually provides better performance than simple poststratification, as demonstrated by lower variance of the estimator of calibration, because it is less biased when used with small sample sizes (17,18). Moreover, calibration on margins can be conducted on a higher number of characteristics than simple poststratification, particularly when no individual but only aggregated data are available (because no poststrata can be reconstructed).
In this study, we explored the use and usefulness of calibration on margins over no adjustment or commonly used simple poststratification for the evaluation of DPNs in France. We compared the estimated effect size of the intervention by using a before–after evaluation of structured care on an unadjusted sample, on samples adjusted by simple poststratification, and on samples adjusted by calibration on margins. In addition, we sought to compare results obtained from the different poststratification methods using several different calibration functions in order to understand their relative merits and performance levels, thereby informing their potential use in future evaluations.
Research Design and Methods
Data Sources
The intervention group was comprised of patients enrolled in a DPN for type 1 and 2 diabetes in the Paris region of France. Enrollment was usually recommended by the patient’s general practitioner. Services provided by the DPN included patient education and workshops; systematic patient assessment and follow-up by general practitioners (annual checkups at a minimum), dietitians, nurses, and podiatrists; interdisciplinary meetings and training for the professionals involved; and the use of clinical guidelines. At patient enrollment, an initial patient assessment was performed and documented in the network database. Permission was obtained from the National Data Safety Authority to store and use patient data for evaluation purposes.
Because data on the characteristics of the French diabetic population are not available at the national level, data for the reference population were drawn from the Echantillon National Témoin Représentatif de la Population Diabétique (ENTRED) study, a cross-sectional representative national survey of people treated for diabetes in France based on a random sample of adults who had claimed at least three reimbursements for oral hypoglycemic agents or insulin from the largest French health insurance fund over a 1-year period (19). Anonymized patient-level data from the ENTRED study were obtained from the French Institute for Public Health Surveillance.
Study Population and Measures
For data availability reasons, we included patients in the DPN who were enrolled in 2007 or 2008 (n = 549) and had completed the initial and the 1-year follow-up assessments with a full data set, resulting in an intervention group sample size of 232 patients.
ENTRED data included claims data, questionnaires, and clinical data for 2007 and 2008, forming a reference group of 2,485 patients, representing the diabetic population in France.
As HbA1c level is the most widely used diabetes outcome measure (20,21), we used HbA1c levels after 1 year of enrollment in the DPN as the primary outcome.
Measures for adjustment included basic demographic characteristics (age and sex) and clinical indicators (diabetes duration; treatment modality; and intermediate outcomes, including hypertension, blood lipids, smoking status, BMI, and glomerular filtration rate). We also included the statutory health insurance chronic disease coverage scheme status (affections de longue durée) as a proxy reflecting a patient’s higher level of comorbidity. We were unable to consider socioeconomic status, as relevant data were not collected by the DPN.
Poststratification Methods
Simple Poststratification
Poststratification in its classic form classifies the sample by poststratum (group of individuals) on a given characteristic and weights individuals in each group up to the population total of that group. Specifically, weights are computed based on a ratio approach by dividing the proportion of individuals in a given stratum in the intervention sample by the same proportion in the overall target population (22).
Characteristics used for poststratification are commonly age and sex and those that are associated with the outcome and thus impact the results when they differ between the intervention sample and the reference population (23). We therefore explored four sets of one, two, or three characteristics using the standard age and sex as well as HbA1c because it is highly correlated with variation in outcome (correlation coefficient = 0.602; P < 0.001).
Calibration on Margins
Calibration on margins generates an adjusted sample by assigning a calibration sample weight (coefficient) to each individual based on an iterative process starting with initial weights dk (which are usually the “sampling weights,” equal to the inverse probabilities of an individual to be included in the intervention). At each iteration, new “calibration weights” wk that are as close as possible to the initial weights (as determined by a given distance function) are computed. Several mathematical functions are available, and we explored four of them: linear, raking ratio, logit, and truncated linear (see Supplementary Data for equations) (15). The two linear functions are based on quadratic functions, with the particularity that the truncated linear function always yields positive weights. The raking ratio and logit functions are logarithmic functions. A fifth function, hyperbolic sinus, was not available for SAS and R use and thus was not used in our analysis.
To our knowledge, no literature provides specific guidance on the types of characteristics that should be used for adjusting a sample applying this method in the context of a health intervention evaluation of diabetes care. We therefore used all characteristics that are considered to be associated with differences in outcomes in diabetic patients (23) and that may explain differences in before–after assessment of the DPN population: demographic characteristics, diabetes information, and other clinical characteristics (Table 1), regardless of whether they significantly differed between the DPN and ENTRED at baseline to ensure that they would not significantly differ between the two populations after adjustment.
Characteristics . | Reference population, ENTRED (n = 2,485) . | Intervention group, DPN (n = 232) . |
---|---|---|
Demographics | ||
Age, years | 64.59 ± 12.47 | 60.32 ± 12.51*** |
≤55 | 23% | 32%*** |
56–70 | 40% | 45%*** |
>70 | 37% | 22%*** |
Sex, male | 54% | 55% |
Diabetes information | ||
Duration of diabetes, years | ||
<4 | 23% | 41%*** |
4–9 | 25% | 26%*** |
9–16 | 24% | 14%*** |
>16 | 28% | 19%*** |
Treatment modus | ||
One oral HA | 42% | 38% |
Two or more oral HA | 35% | 38% |
Insulin and no, one, or more oral HA | 23% | 24% |
Other clinical characteristics | ||
Hypertension grade§ | ||
0 | 63% | 63% |
1, 2, or 3 | 37% | 37% |
LDL, g/L | ||
≤1.3 | 78% | 67%*** |
>1.3 | 22% | 33%*** |
HDL, g/L | ||
>0.4 (female) or 0.35 (male) | 87% | 88% |
≤0.4 (female) or 0.35 (male) | 13% | 12% |
Triglycerides, g/L | ||
≤1.5 | 63% | 66% |
>1.5 | 37% | 34% |
ALD, % yes | 87% | 84% |
Smoking status, % smoker | 14% | 15% |
BMI, kg/m2 | ||
<25 | 22% | 19% |
25–30 | 38% | 42% |
>30 | 40% | 39% |
GFR, mL/min | ||
Level 1: >90 | 42% | 45% |
Level 2: 60–89 | 36% | 38% |
Level 3, 4, or 5: 0–59 | 22% | 17% |
DPN assessed outcome | ||
HbA1c, % | 7.14 ± 1.19 | 7.75 ± 1.78*** |
HbA1c, mmol/mol | 54.5 ± 13 | 61.2 ± 19.5*** |
<7% (<53 mmol/mol) | 50% | 34%*** |
7–8% (53–64 mmol/mol) | 30% | 28%*** |
>8% (>64 mmol/mol) | 20% | 38%*** |
Characteristics . | Reference population, ENTRED (n = 2,485) . | Intervention group, DPN (n = 232) . |
---|---|---|
Demographics | ||
Age, years | 64.59 ± 12.47 | 60.32 ± 12.51*** |
≤55 | 23% | 32%*** |
56–70 | 40% | 45%*** |
>70 | 37% | 22%*** |
Sex, male | 54% | 55% |
Diabetes information | ||
Duration of diabetes, years | ||
<4 | 23% | 41%*** |
4–9 | 25% | 26%*** |
9–16 | 24% | 14%*** |
>16 | 28% | 19%*** |
Treatment modus | ||
One oral HA | 42% | 38% |
Two or more oral HA | 35% | 38% |
Insulin and no, one, or more oral HA | 23% | 24% |
Other clinical characteristics | ||
Hypertension grade§ | ||
0 | 63% | 63% |
1, 2, or 3 | 37% | 37% |
LDL, g/L | ||
≤1.3 | 78% | 67%*** |
>1.3 | 22% | 33%*** |
HDL, g/L | ||
>0.4 (female) or 0.35 (male) | 87% | 88% |
≤0.4 (female) or 0.35 (male) | 13% | 12% |
Triglycerides, g/L | ||
≤1.5 | 63% | 66% |
>1.5 | 37% | 34% |
ALD, % yes | 87% | 84% |
Smoking status, % smoker | 14% | 15% |
BMI, kg/m2 | ||
<25 | 22% | 19% |
25–30 | 38% | 42% |
>30 | 40% | 39% |
GFR, mL/min | ||
Level 1: >90 | 42% | 45% |
Level 2: 60–89 | 36% | 38% |
Level 3, 4, or 5: 0–59 | 22% | 17% |
DPN assessed outcome | ||
HbA1c, % | 7.14 ± 1.19 | 7.75 ± 1.78*** |
HbA1c, mmol/mol | 54.5 ± 13 | 61.2 ± 19.5*** |
<7% (<53 mmol/mol) | 50% | 34%*** |
7–8% (53–64 mmol/mol) | 30% | 28%*** |
>8% (>64 mmol/mol) | 20% | 38%*** |
Data are mean ± SD unless otherwise noted. ALD, affections de longue durée, statutory health insurance chronic disease coverage scheme; GFR, glomerular filtration rate; HA, hypoglycemic agent.
***P ≤ 0.001.
§According to the World Health Organization (33).
We explored three sets of characteristics. One with all characteristics excluding HbA1c at baseline; one with all characteristics including HbA1c at baseline; and one with age, sex, and HbA1c at baseline only. This last set was used to compare calibration and simple poststratification performance on the same set of characteristics used for adjustment.
We therefore constructed 12 adjusted samples by calibration using the four functions on three sets of characteristics.
Analysis
We first used descriptive statistics to check for differences in patient characteristics between the intervention group and the reference population.
Second, we compared results of a before–after analysis with HbA1c level after 1 year of enrollment in the DPN as the primary outcome on the initial sample, on the calibration-adjusted samples, and on the simple poststratification-adjusted samples. Changes in HbA1c were assessed on the mean level based on the following categories: number of patients whose HbA1c levels changed from >7 to ≤7% (24,25) and number of patients whose HbA1c levels fell by ≥0.5% (26,27).
Finally, in order to compare the performance of simple poststratifications and the different calibration functions used in this evaluation context, we compared them in terms of SE and weight dispersion measured by the design effect, with a higher design effect expressing lower dispersion, which is considered preferable (28).
All analyses were performed using SAS version 9.2 (SAS Institute, Cary, NC) and the “Calmar” macro developed by the French National Institute for Statistics and Economic Studies (29). (Note that other software for calibration includes the R package sampling, g-Calib for SPSS, and Bascula for Blaise.) A P value of 5% was considered significant. All P values are two-sided. For calibration, continuous variables were transformed into categorical variables.
Results
Baseline Characteristics
Patient characteristics at baseline are presented in Table 1.
Compared with the ENTRED population, patients enrolled in the DPN were younger, had a more recent diabetes diagnosis, and had higher LDL cholesterol. Glycemic control as measured by HbA1c was worse in the DPN patients.
Regardless of the sample used, results of the before–after analysis revealed that DPN patients had significant reductions in HbA1c levels after 1 year of enrollment (Table 2 and Fig. 1). However, when the initial sample was adjusted, the reductions in all outcome measures (mean HbA1c, the percentages of patients whose HbA1c status changed from >7 to ≤7% or fell by ≥0.5) was smaller regardless of the poststratification method and the set of characteristics used (Table 2).
Type of adjustment . | Characteristics used for adjustment . | Function . | Mean HbA1c change after 1 year, % . | 95% CI . | SD . | SE . | Relative change from no adjustment, % . | Patients with HbA1c change from >7 to ≤7%, % . | Patients with HbA1c decrease ≥0.5%, % . | Design effect ×103 . |
---|---|---|---|---|---|---|---|---|---|---|
1. No adjustment | 0.497*** | 0.308–0.685 | 1.456 | 0.096 | 0 | 25.86 | 41.38 | 0.00 | ||
2. Simple poststratification | ||||||||||
Age/sex | NA | 0.406*** | 0.231–0.581 | 1.354 | 0.089 | −22 | 24.77 | 37.47 | 4.31 | |
HbA1c | NA | 0.250** | 0.089–0.412 | 1.248 | 0.082 | −99 | 21.53 | 32.87 | 4.31 | |
Age/HbA1c | NA | 0.208** | 0.054–0.361 | 1.190 | 0.078 | −139 | 21.77 | 31.17 | 4.31 | |
Sex/HbA1c | NA | 0.250** | 0.09–0.411 | 1.242 | 0.082 | −99 | 21.50 | 32.78 | 4.31 | |
Age/sex/HbA1c | NA | 0.223** | 0.067–0.379 | 1.205 | 0.079 | −123 | 21.55 | 30.35 | 4.31 | |
3. Advanced poststratification: calibration on margins | ||||||||||
All patient characteristics at baseline§ without HbA1c | Linear | 0.367*** | 0.208–0.526 | 1.232 | 0.081 | −35 | 24.56 | 37.25 | 6.08 | |
Raking | 0.369*** | 0.211–0.528 | 1.226 | 0.081 | −35 | 24.73 | 37.00 | 6.27 | ||
Logit | 0.371*** | 0.213–0.53 | 1.226 | 0.081 | −34 | 24.82 | 37.13 | 6.25 | ||
Truncated linear | 0.366*** | 0.209–0.523 | 1.213 | 0.080 | −36 | 24.43 | 37.38 | 6.09 | ||
Age/sex/HbA1c | Linear | 0.208** | 0.055–0.361 | 1.200 | 0.078 | −138 | 21.54 | 31.12 | 5.37 | |
Raking | 0.235** | 0.081–0.388 | 1.207 | 0.078 | −112 | 21.68 | 31.41 | 5.41 | ||
Logit | 0.235** | 0.081–0.389 | 1.208 | 0.078 | −111 | 21.68 | 31.41 | 5.41 | ||
Truncated linear | 0.208** | 0.055–0.361 | 1.200 | 0.078 | −138 | 21.54 | 31.12 | 5.37 | ||
All patient characteristics at baseline§ with HbA1c | Linear | 0.175* | 0.036–0.314 | 1.084 | 0.071 | −183 | 19.82 | 29.28 | 6.72 | |
Raking | 0.200** | 0.062–0.339 | 1.068 | 0.070 | −148 | 20.15 | 29.65 | 7.29 | ||
Logit | 0.207** | 0.068–0.345 | 1.069 | 0.070 | −141 | 20.12 | 29.80 | 7.19 | ||
Truncated linear | 0.181** | 0.049–0.313 | 1.018 | 0.067 | −175 | 19.75 | 29.89 | 6.81 |
Type of adjustment . | Characteristics used for adjustment . | Function . | Mean HbA1c change after 1 year, % . | 95% CI . | SD . | SE . | Relative change from no adjustment, % . | Patients with HbA1c change from >7 to ≤7%, % . | Patients with HbA1c decrease ≥0.5%, % . | Design effect ×103 . |
---|---|---|---|---|---|---|---|---|---|---|
1. No adjustment | 0.497*** | 0.308–0.685 | 1.456 | 0.096 | 0 | 25.86 | 41.38 | 0.00 | ||
2. Simple poststratification | ||||||||||
Age/sex | NA | 0.406*** | 0.231–0.581 | 1.354 | 0.089 | −22 | 24.77 | 37.47 | 4.31 | |
HbA1c | NA | 0.250** | 0.089–0.412 | 1.248 | 0.082 | −99 | 21.53 | 32.87 | 4.31 | |
Age/HbA1c | NA | 0.208** | 0.054–0.361 | 1.190 | 0.078 | −139 | 21.77 | 31.17 | 4.31 | |
Sex/HbA1c | NA | 0.250** | 0.09–0.411 | 1.242 | 0.082 | −99 | 21.50 | 32.78 | 4.31 | |
Age/sex/HbA1c | NA | 0.223** | 0.067–0.379 | 1.205 | 0.079 | −123 | 21.55 | 30.35 | 4.31 | |
3. Advanced poststratification: calibration on margins | ||||||||||
All patient characteristics at baseline§ without HbA1c | Linear | 0.367*** | 0.208–0.526 | 1.232 | 0.081 | −35 | 24.56 | 37.25 | 6.08 | |
Raking | 0.369*** | 0.211–0.528 | 1.226 | 0.081 | −35 | 24.73 | 37.00 | 6.27 | ||
Logit | 0.371*** | 0.213–0.53 | 1.226 | 0.081 | −34 | 24.82 | 37.13 | 6.25 | ||
Truncated linear | 0.366*** | 0.209–0.523 | 1.213 | 0.080 | −36 | 24.43 | 37.38 | 6.09 | ||
Age/sex/HbA1c | Linear | 0.208** | 0.055–0.361 | 1.200 | 0.078 | −138 | 21.54 | 31.12 | 5.37 | |
Raking | 0.235** | 0.081–0.388 | 1.207 | 0.078 | −112 | 21.68 | 31.41 | 5.41 | ||
Logit | 0.235** | 0.081–0.389 | 1.208 | 0.078 | −111 | 21.68 | 31.41 | 5.41 | ||
Truncated linear | 0.208** | 0.055–0.361 | 1.200 | 0.078 | −138 | 21.54 | 31.12 | 5.37 | ||
All patient characteristics at baseline§ with HbA1c | Linear | 0.175* | 0.036–0.314 | 1.084 | 0.071 | −183 | 19.82 | 29.28 | 6.72 | |
Raking | 0.200** | 0.062–0.339 | 1.068 | 0.070 | −148 | 20.15 | 29.65 | 7.29 | ||
Logit | 0.207** | 0.068–0.345 | 1.069 | 0.070 | −141 | 20.12 | 29.80 | 7.19 | ||
Truncated linear | 0.181** | 0.049–0.313 | 1.018 | 0.067 | −175 | 19.75 | 29.89 | 6.81 |
A high design effect is considered favorable. NA, not applicable.
***P ≤ 0.001,
**P ≤ 0.01,
*P ≤ 0.05.
§Demographics, diabetes information, and other clinical characteristics.
When HbA1c level at baseline was included as an adjustment variable in either simple poststratification or calibration on margins, the measured change in all outcome measures was markedly lower than in the samples not adjusted on HbA1c. For example, when the sample was adjusted using the linear function for calibration on margins, reductions in mean HbA1c levels compared with the unadjusted sample were 35% lower when HbA1c level at baseline was not used for calibration; however, when HbA1c at baseline was included, reductions in HbA1c levels were 183% lower.
When poststratification was not performed on the initial HbA1c level, adjustment by calibration on margins with all remaining characteristics measured lower achievement in all outcomes than simple poststratification on age and sex. Similarly, when poststratification was also performed on the initial HbA1c level, poststratification via calibration on margins with all patient characteristics measured lower changes in all outcomes than simple poststratification. This represents relative decreases from no adjustment of 141–183% for calibration on margins compared with relative decreases of 99–123% in simple poststratification samples (Table 2). However, calibration on margins showed similar results as simple poststratification when only age, sex, and HbA1c level at baseline were used for adjustment in both methods.
The results of the four calibration functions fell within a narrow range. For example, when all characteristics were used for calibration, the absolute HbA1c level change ranged from 0.18 to 0.21 (Table 2).
Performance of Adjustment Techniques
Based on the number of characteristics and the sample sizes used in the intervention group and reference population, all tested poststratification techniques were technically feasible and yielded robust results.
The design effect of poststratification via calibration on margins was persistently higher than in simple poststratification, indicating more favorable weight dispersion, including when similar characteristics (age, sex, and HbA1c at baseline) were used in both methods. Moreover, in calibration on margins, the higher the number of characteristics used for adjustment, the lower the observed SE and design effect.
Finally, across the four calibration functions, the design effect was of a similar range, with the raking ratio function exhibiting the highest design effect compared with the linear, logit, and truncated linear functions for all sets of characteristics (Table 2).
Conclusions
To our knowledge, this is the first study to compare a range of adjustment methods, including calibration on margins, to evaluate a structured diabetes care intervention using a national cross-sectional sample of diabetic patients as the reference population. While the positive impact of the DPN remained significant, we found that before–after analysis without poststratification may have overestimated the effect of the DPN by 22–183% in terms of observed improvements in HbA1c levels. Furthermore, adjustment on HbA1c levels at baseline appears to be important for not overestimating the intervention effect. When compared with simple poststratification, change in the observed improvement is usually lower using calibration on margins, mostly because it allows adjustment on a greater number of characteristics. Moreover, estimations using calibration on margins exhibited higher performance with lower SEs and higher design effects, strongly suggesting that calibration on margins is the preferable adjustment method in this context. Finally, the four calibration functions that we tested all showed comparable performance.
As the analytical approach explored here has thus far not been documented in the peer-reviewed literature in the context of evaluation, it is difficult to compare our findings with work undertaken elsewhere. While it is well known that simple before–after evaluation can lead to substantial overestimation of the intervention effect in structured care (and, in rare cases, to underestimation), the size of this misestimation is not well understood (10). However, a recent evaluation of a diabetes disease management program in Austria using a cluster-randomized controlled trial found that HbA1c levels in the intervention group had decreased by 0.13% after 1 year. Conversely, using the same data and applying a simple before–after comparison, the effect size was measured as a decrease of 0.41% (13). Thus the before–after design had overestimated the “true” intervention effect established in the randomized controlled trial by ∼68%. While direct comparison of findings is impossible given the different methods used, the relative overestimation of the intervention effect as identified using poststratification methods appears to be within the same range. This suggests that poststratification and preferably calibration on margins may provide a useful evaluation approach to produce valid findings where more scientifically robust designs such as randomization are not possible.
We acknowledge some limitations regarding data quality, availability, and follow-up.
We included DPN patients in the analysis only if complete follow-up data and characteristics for calibration were available, which increased the likelihood of selection effects due to missing data within the DPN. Further, our analysis did not include characteristics on socioeconomic status, comorbidities, and diabetes type.
Despite using a proxy for comorbidities and the fact that the type of diabetes is in part reflected by age, disease duration, and treatment mode, we were likely unable to account for the full extent of patient selection. Moreover, there were missing data in the ENTRED reference population, which may render it a slightly biased representation of the overall target population. Despite these limitations, the data used reflects data available in the context of provider network evaluation, and the methods tested aim to support future pragmatic evaluations of similar interventions.
In addition, the data sources in our study may have been suboptimal in terms of their ability to illustrate the advantages of calibration on margins. In fact, our intervention and reference populations differed only moderately at baseline in some of the characteristics that may be associated with the outcome (e.g., BMI and renal function). Yet greater differences in patient characteristics between the two populations would probably lead to more marked differences in effect size and test performance between simple and advanced poststratification, thereby allowing us to illustrate the higher potential of calibration on margins.
In terms of analytical methods, our study did not test the hyperbolic sinus function as a fifth mathematical basis for calibration on margins. Because of the very similar results obtained by the four present functions in terms of measured outcomes and design effect, we assume that results would not have greatly differed for the hyperbolic sinus function.
Our results have implications for research and decision making. We conceived the evaluation approach tested here as a tool in situations in which gold-standard designs, such as randomized controlled trials or quasi-experimental designs (30), are not feasible for logistic or resource reasons. Calibration appears to be applicable to real-life evaluation and adapted to assessments of the effectiveness of a given intervention, particularly in the case of inclusion of more characteristics than would be feasible with simple poststratification. Indeed, if the number of characteristics used in the latter is higher than three, the number of strata increases almost exponentially, leading to a high risk of empty strata and, consequently, biased estimates. Moreover, calibration on margins can be applied when only aggregated data are available for the overall target population, while simple poststratification requires patient-level data. Given these possibilities, calibration on margins can increase the external validity of a given evaluation, as opposed to the high internal validity of evaluations using randomized controlled designs (31). While calibration on margins is not comparable to controlled evaluation designs and cannot measure phenomena such as the “placebo effect,” it may provide a useful evaluation design where randomization is not possible and program planners or funders are interested in estimating the effect of an existing intervention if rolled out to a wider population based on routinely collected data. In such a case, calibration on margins could provide clinicians, program managers and policy makers with relevant information regarding whether and how much they should invest in such a wider strategy.
It is important to note the methodological restrictions in using this method. Calibration on margins is technically feasible when the size of the intervention population is at least ∼1/10 the size of the reference population used for adjustment (32). Thus the applicability of calibration is likely limited to settings in which the intervention group is sufficiently large compared with the reference population. Moreover, researchers using the linear function in calibration should be aware that it can yield negative weights, and thus only statistical tests for quantitative variables may be used.
Overall, this study underscores the utility of poststratification methods in a context in which structured approaches to care for chronic diseases such as diabetes are increasingly being implemented but hard to evaluate in a rigorous manner for financial or logistic reasons. Calibration on margins appears to be the preferable poststratification method mostly because it allows for adjustment on a greater number of characteristics than simple poststratification. It appears to provide an effective means for accounting for selection bias, thereby mitigating the possibility of overestimation of the intervention effect when simple before–after evaluations are undertaken in real-world settings.
Article Information
Acknowledgments. The authors thank the DPN Paris Diabète, in particular Pierre-Yves Traynard and Pierre-Albert Charbit, and the ENTRED partners for generously providing the data. The authors are very grateful to the patients for their participation in these initiatives. The authors further thank Karen Berg Brigham (URC-Eco) for her very helpful review of the manuscript.
Funding. The 2007 ENTRED study was funded by the Institute for Public Health Surveillance, the National Institute for Prevention and Health Education, the General Scheme of Health Insurance, the Independent Scheme for Employees, and the French National Authority for Health. This study was conducted with support from the DISMEVAL Consortium, funded under the European Commission’s Seventh Framework Programme (grant 223277). See www.dismeval.eu for additional information.
Duality of Interest. No potential conflicts of interest relevant to this article were reported.
Author Contributions. K.C. designed the study, analyzed data, and wrote the manuscript. M.B. analyzed data and wrote the manuscript. B.C. designed the study, performed the statistical analysis, and reviewed the manuscript. E.N. contributed to the discussion and reviewed and edited the manuscript. I.D.-Z. obtained data and reviewed and edited the manuscript. K.C. is the guarantor of this work and, as such, had full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.