OBJECTIVE

While there is growing evidence on the effectiveness of structured care for diabetic patients in trial settings, standard population level evaluations may misestimate intervention benefits due to patient selection. In order to account for potential biases in measuring intervention benefits, we tested the impact of calibration on margins as a novel adjustment method in an evaluation context compared with simple poststratification.

RESEARCH DESIGN AND METHODS

We compared the results of a before–after evaluation on HbA1c levels after 1 year of enrollment in a French diabetes provider network (DPN) using an unadjusted sample and samples adjusted by simple poststratification to results obtained after adjustment via calibration on margins to the general diabetic population’s characteristics using a national cross-sectional sample of diabetic patients.

RESULTS

Both with and without adjustment, patients in the DPN had significantly lower HbA1c levels after 1 year of enrollment. However, the reductions in HbA1c levels among the adjusted samples were 22–183% lower than those measured in the unadjusted sample, regardless of the poststratification method and characteristics used. Compared with simple poststratification, estimations using calibration on margins exhibited higher performance.

CONCLUSIONS

Evaluations of diabetes management interventions based on uncontrolled before–after experiments may overestimate the actual benefit for patients. This can be corrected by using poststratification approaches when data on the ultimate target population for the intervention are available. In order to more accurately estimate the effect an intervention would have if extended to the target population, calibration on margins seems to be preferable over simple poststratification in terms of performance and usability.

Health systems increasingly rely on structured care interventions to better meet the needs of growing populations with diabetes and other chronic conditions (1,2). Such interventions, frequently also referred to as disease management care, typically include components such as enhanced coordination between providers, the systematic use of clinical guidelines, and patient education (3,4). In practice, the nature and settings in which structured care is delivered vary, ranging from discrete programs offered to a selected group of patients to multicomponent, population-based strategies offered as part of usual care (5). An example of structured diabetes care is the concept of the diabetes provider network (DPN) in France, an association of care providers with central coordination to ensure a predefined patient trajectory (6).

Structured care approaches are expected to improve the quality of care for persons with chronic conditions, enhance health outcomes, and, ultimately, reduce costs. However, although intuitively appealing, the evidence of whether these aims are achieved in practice remains uncertain. This is in part because good quality evidence using randomized controlled designs are typically limited to small populations or conducted in research settings and are therefore difficult to generalize (7,8). Conversely, the diffusion of structured care approaches into routine settings, for example, through rollout or implementation at population level, is rarely accompanied by rigorous evaluation (9,10). Instead, where evaluation is conducted, it often must rely on uncontrolled designs such as observational before–after studies, because randomized designs are not feasible for various reasons (10,11). Such evaluations are difficult to interpret, however, and are prone to misestimating intervention effect, because they do not account for potential biases resulting, for example, from selective enrollment of patients likely to benefit from the intervention and from self-selection of health care professionals (12,13). This leads to differences in the characteristics of the overall target population and the intervention group that is a sample of the latter (called intervention sample throughout this article).

In order to improve the available evidence on the effectiveness and impact of population-wide disease management programs, it is necessary to account for these potential biases and use methods that are both scientifically robust and feasible for evaluation in daily practice.

To this end, it is appropriate to use poststratification methods commonly used in survey analysis that consist of a posteriori adjustments using auxiliary information at the target population level. These methods include simple poststratification, poststratification by regression, and, more recently, poststratification by calibration on margins introduced by Deville and Särndal (1416). Poststratification aims to rebalance differences in the characteristics of the intervention sample and the overall target population by using several mathematical methods that are more or less complex. Simple poststratification methods are the most commonly used and rely upon the simplest mathematical method that assigns weights to strata of individuals using a ratio. Calibration on margins using more complex functions based on an iterative process with distance functions can also be applied (15). In addition to being relatively simple to use, the literature indicates that this method usually provides better performance than simple poststratification, as demonstrated by lower variance of the estimator of calibration, because it is less biased when used with small sample sizes (17,18). Moreover, calibration on margins can be conducted on a higher number of characteristics than simple poststratification, particularly when no individual but only aggregated data are available (because no poststrata can be reconstructed).

In this study, we explored the use and usefulness of calibration on margins over no adjustment or commonly used simple poststratification for the evaluation of DPNs in France. We compared the estimated effect size of the intervention by using a before–after evaluation of structured care on an unadjusted sample, on samples adjusted by simple poststratification, and on samples adjusted by calibration on margins. In addition, we sought to compare results obtained from the different poststratification methods using several different calibration functions in order to understand their relative merits and performance levels, thereby informing their potential use in future evaluations.

Data Sources

The intervention group was comprised of patients enrolled in a DPN for type 1 and 2 diabetes in the Paris region of France. Enrollment was usually recommended by the patient’s general practitioner. Services provided by the DPN included patient education and workshops; systematic patient assessment and follow-up by general practitioners (annual checkups at a minimum), dietitians, nurses, and podiatrists; interdisciplinary meetings and training for the professionals involved; and the use of clinical guidelines. At patient enrollment, an initial patient assessment was performed and documented in the network database. Permission was obtained from the National Data Safety Authority to store and use patient data for evaluation purposes.

Because data on the characteristics of the French diabetic population are not available at the national level, data for the reference population were drawn from the Echantillon National Témoin Représentatif de la Population Diabétique (ENTRED) study, a cross-sectional representative national survey of people treated for diabetes in France based on a random sample of adults who had claimed at least three reimbursements for oral hypoglycemic agents or insulin from the largest French health insurance fund over a 1-year period (19). Anonymized patient-level data from the ENTRED study were obtained from the French Institute for Public Health Surveillance.

Study Population and Measures

For data availability reasons, we included patients in the DPN who were enrolled in 2007 or 2008 (n = 549) and had completed the initial and the 1-year follow-up assessments with a full data set, resulting in an intervention group sample size of 232 patients.

ENTRED data included claims data, questionnaires, and clinical data for 2007 and 2008, forming a reference group of 2,485 patients, representing the diabetic population in France.

As HbA1c level is the most widely used diabetes outcome measure (20,21), we used HbA1c levels after 1 year of enrollment in the DPN as the primary outcome.

Measures for adjustment included basic demographic characteristics (age and sex) and clinical indicators (diabetes duration; treatment modality; and intermediate outcomes, including hypertension, blood lipids, smoking status, BMI, and glomerular filtration rate). We also included the statutory health insurance chronic disease coverage scheme status (affections de longue durée) as a proxy reflecting a patient’s higher level of comorbidity. We were unable to consider socioeconomic status, as relevant data were not collected by the DPN.

Poststratification Methods

Simple Poststratification

Poststratification in its classic form classifies the sample by poststratum (group of individuals) on a given characteristic and weights individuals in each group up to the population total of that group. Specifically, weights are computed based on a ratio approach by dividing the proportion of individuals in a given stratum in the intervention sample by the same proportion in the overall target population (22).

Characteristics used for poststratification are commonly age and sex and those that are associated with the outcome and thus impact the results when they differ between the intervention sample and the reference population (23). We therefore explored four sets of one, two, or three characteristics using the standard age and sex as well as HbA1c because it is highly correlated with variation in outcome (correlation coefficient = 0.602; P < 0.001).

Calibration on Margins

Calibration on margins generates an adjusted sample by assigning a calibration sample weight (coefficient) to each individual based on an iterative process starting with initial weights dk (which are usually the “sampling weights,” equal to the inverse probabilities of an individual to be included in the intervention). At each iteration, new “calibration weights” wk that are as close as possible to the initial weights (as determined by a given distance function) are computed. Several mathematical functions are available, and we explored four of them: linear, raking ratio, logit, and truncated linear (see Supplementary Data for equations) (15). The two linear functions are based on quadratic functions, with the particularity that the truncated linear function always yields positive weights. The raking ratio and logit functions are logarithmic functions. A fifth function, hyperbolic sinus, was not available for SAS and R use and thus was not used in our analysis.

To our knowledge, no literature provides specific guidance on the types of characteristics that should be used for adjusting a sample applying this method in the context of a health intervention evaluation of diabetes care. We therefore used all characteristics that are considered to be associated with differences in outcomes in diabetic patients (23) and that may explain differences in before–after assessment of the DPN population: demographic characteristics, diabetes information, and other clinical characteristics (Table 1), regardless of whether they significantly differed between the DPN and ENTRED at baseline to ensure that they would not significantly differ between the two populations after adjustment.

Table 1

Comparison of the characteristics of the DPN population with the ENTRED population at baseline

CharacteristicsReference population, ENTRED (n = 2,485)Intervention group, DPN (n = 232)
Demographics   
 Age, years 64.59 ± 12.47 60.32 ± 12.51*** 
  ≤55 23% 32%*** 
  56–70 40% 45%*** 
  >70 37% 22%*** 
 Sex, male 54% 55% 
Diabetes information   
 Duration of diabetes, years   
  <4 23% 41%*** 
  4–9 25% 26%*** 
  9–16 24% 14%*** 
  >16 28% 19%*** 
 Treatment modus   
  One oral HA 42% 38% 
  Two or more oral HA 35% 38% 
  Insulin and no, one, or more oral HA 23% 24% 
Other clinical characteristics   
 Hypertension grade§   
  0 63% 63% 
  1, 2, or 3 37% 37% 
 LDL, g/L   
  ≤1.3  78% 67%*** 
  >1.3 22% 33%*** 
 HDL, g/L   
  >0.4 (female) or 0.35 (male) 87% 88% 
  ≤0.4 (female) or 0.35 (male) 13% 12% 
 Triglycerides, g/L   
  ≤1.5 63% 66% 
  >1.5 37% 34% 
 ALD, % yes 87% 84% 
 Smoking status, % smoker 14% 15% 
 BMI, kg/m2   
  <25 22% 19% 
  25–30 38% 42% 
  >30 40% 39% 
 GFR, mL/min   
  Level 1: >90 42% 45% 
  Level 2: 60–89 36% 38% 
  Level 3, 4, or 5: 0–59 22% 17% 
DPN assessed outcome   
 HbA1c, % 7.14 ± 1.19 7.75 ± 1.78*** 
 HbA1c, mmol/mol 54.5 ± 13 61.2 ± 19.5*** 
  <7% (<53 mmol/mol) 50% 34%*** 
  7–8% (53–64 mmol/mol) 30% 28%*** 
  >8% (>64 mmol/mol) 20% 38%*** 
CharacteristicsReference population, ENTRED (n = 2,485)Intervention group, DPN (n = 232)
Demographics   
 Age, years 64.59 ± 12.47 60.32 ± 12.51*** 
  ≤55 23% 32%*** 
  56–70 40% 45%*** 
  >70 37% 22%*** 
 Sex, male 54% 55% 
Diabetes information   
 Duration of diabetes, years   
  <4 23% 41%*** 
  4–9 25% 26%*** 
  9–16 24% 14%*** 
  >16 28% 19%*** 
 Treatment modus   
  One oral HA 42% 38% 
  Two or more oral HA 35% 38% 
  Insulin and no, one, or more oral HA 23% 24% 
Other clinical characteristics   
 Hypertension grade§   
  0 63% 63% 
  1, 2, or 3 37% 37% 
 LDL, g/L   
  ≤1.3  78% 67%*** 
  >1.3 22% 33%*** 
 HDL, g/L   
  >0.4 (female) or 0.35 (male) 87% 88% 
  ≤0.4 (female) or 0.35 (male) 13% 12% 
 Triglycerides, g/L   
  ≤1.5 63% 66% 
  >1.5 37% 34% 
 ALD, % yes 87% 84% 
 Smoking status, % smoker 14% 15% 
 BMI, kg/m2   
  <25 22% 19% 
  25–30 38% 42% 
  >30 40% 39% 
 GFR, mL/min   
  Level 1: >90 42% 45% 
  Level 2: 60–89 36% 38% 
  Level 3, 4, or 5: 0–59 22% 17% 
DPN assessed outcome   
 HbA1c, % 7.14 ± 1.19 7.75 ± 1.78*** 
 HbA1c, mmol/mol 54.5 ± 13 61.2 ± 19.5*** 
  <7% (<53 mmol/mol) 50% 34%*** 
  7–8% (53–64 mmol/mol) 30% 28%*** 
  >8% (>64 mmol/mol) 20% 38%*** 

Data are mean ± SD unless otherwise noted. ALD, affections de longue durée, statutory health insurance chronic disease coverage scheme; GFR, glomerular filtration rate; HA, hypoglycemic agent.

***P ≤ 0.001.

§According to the World Health Organization (33).

We explored three sets of characteristics. One with all characteristics excluding HbA1c at baseline; one with all characteristics including HbA1c at baseline; and one with age, sex, and HbA1c at baseline only. This last set was used to compare calibration and simple poststratification performance on the same set of characteristics used for adjustment.

We therefore constructed 12 adjusted samples by calibration using the four functions on three sets of characteristics.

Analysis

We first used descriptive statistics to check for differences in patient characteristics between the intervention group and the reference population.

Second, we compared results of a before–after analysis with HbA1c level after 1 year of enrollment in the DPN as the primary outcome on the initial sample, on the calibration-adjusted samples, and on the simple poststratification-adjusted samples. Changes in HbA1c were assessed on the mean level based on the following categories: number of patients whose HbA1c levels changed from >7 to ≤7% (24,25) and number of patients whose HbA1c levels fell by ≥0.5% (26,27).

Finally, in order to compare the performance of simple poststratifications and the different calibration functions used in this evaluation context, we compared them in terms of SE and weight dispersion measured by the design effect, with a higher design effect expressing lower dispersion, which is considered preferable (28).

All analyses were performed using SAS version 9.2 (SAS Institute, Cary, NC) and the “Calmar” macro developed by the French National Institute for Statistics and Economic Studies (29). (Note that other software for calibration includes the R package sampling, g-Calib for SPSS, and Bascula for Blaise.) A P value of 5% was considered significant. All P values are two-sided. For calibration, continuous variables were transformed into categorical variables.

Baseline Characteristics

Patient characteristics at baseline are presented in Table 1.

Compared with the ENTRED population, patients enrolled in the DPN were younger, had a more recent diabetes diagnosis, and had higher LDL cholesterol. Glycemic control as measured by HbA1c was worse in the DPN patients.

Regardless of the sample used, results of the before–after analysis revealed that DPN patients had significant reductions in HbA1c levels after 1 year of enrollment (Table 2 and Fig. 1). However, when the initial sample was adjusted, the reductions in all outcome measures (mean HbA1c, the percentages of patients whose HbA1c status changed from >7 to ≤7% or fell by ≥0.5) was smaller regardless of the poststratification method and the set of characteristics used (Table 2).

Table 2

Comparison of DPN effects after 1 year, by before–after evaluation approach, on the initial sample and samples adjusted by different poststratification methods on patient characteristics (n = 232)

Type of adjustmentCharacteristics used for adjustmentFunctionMean HbA1c change after 1 year, %95% CISDSERelative change from no adjustment, %Patients with HbA1c change from >7 to ≤7%, %Patients with HbA1c decrease ≥0.5%, %Design effect ×103
1. No adjustment   0.497*** 0.308–0.685 1.456 0.096 25.86 41.38 0.00 
2. Simple poststratification           
  Age/sex NA 0.406*** 0.231–0.581 1.354 0.089 −22 24.77 37.47 4.31 
  HbA1c NA 0.250** 0.089–0.412 1.248 0.082 −99 21.53 32.87 4.31 
  Age/HbA1c NA 0.208** 0.054–0.361 1.190 0.078 −139 21.77 31.17 4.31 
  Sex/HbA1c NA 0.250** 0.09–0.411 1.242 0.082 −99 21.50 32.78 4.31 
  Age/sex/HbA1c NA 0.223** 0.067–0.379 1.205 0.079 −123 21.55 30.35 4.31 
3. Advanced poststratification: calibration on margins         
  All patient characteristics at baseline§ without HbA1c Linear 0.367*** 0.208–0.526 1.232 0.081 −35 24.56 37.25 6.08 
  Raking 0.369*** 0.211–0.528 1.226 0.081 −35 24.73 37.00 6.27 
  Logit 0.371*** 0.213–0.53 1.226 0.081 −34 24.82 37.13 6.25 
  Truncated linear 0.366*** 0.209–0.523 1.213 0.080 −36 24.43 37.38 6.09 
  Age/sex/HbA1c Linear 0.208** 0.055–0.361 1.200 0.078 −138 21.54 31.12 5.37 
  Raking 0.235** 0.081–0.388 1.207 0.078 −112 21.68 31.41 5.41 
  Logit 0.235** 0.081–0.389 1.208 0.078 −111 21.68 31.41 5.41 
  Truncated linear 0.208** 0.055–0.361 1.200 0.078 −138 21.54 31.12 5.37 
  All patient characteristics at baseline§ with HbA1c Linear 0.175* 0.036–0.314 1.084 0.071 −183 19.82 29.28 6.72 
  Raking 0.200** 0.062–0.339 1.068 0.070 −148 20.15 29.65 7.29 
  Logit 0.207** 0.068–0.345 1.069 0.070 −141 20.12 29.80 7.19 
  Truncated linear 0.181** 0.049–0.313 1.018 0.067 −175 19.75 29.89 6.81 
Type of adjustmentCharacteristics used for adjustmentFunctionMean HbA1c change after 1 year, %95% CISDSERelative change from no adjustment, %Patients with HbA1c change from >7 to ≤7%, %Patients with HbA1c decrease ≥0.5%, %Design effect ×103
1. No adjustment   0.497*** 0.308–0.685 1.456 0.096 25.86 41.38 0.00 
2. Simple poststratification           
  Age/sex NA 0.406*** 0.231–0.581 1.354 0.089 −22 24.77 37.47 4.31 
  HbA1c NA 0.250** 0.089–0.412 1.248 0.082 −99 21.53 32.87 4.31 
  Age/HbA1c NA 0.208** 0.054–0.361 1.190 0.078 −139 21.77 31.17 4.31 
  Sex/HbA1c NA 0.250** 0.09–0.411 1.242 0.082 −99 21.50 32.78 4.31 
  Age/sex/HbA1c NA 0.223** 0.067–0.379 1.205 0.079 −123 21.55 30.35 4.31 
3. Advanced poststratification: calibration on margins         
  All patient characteristics at baseline§ without HbA1c Linear 0.367*** 0.208–0.526 1.232 0.081 −35 24.56 37.25 6.08 
  Raking 0.369*** 0.211–0.528 1.226 0.081 −35 24.73 37.00 6.27 
  Logit 0.371*** 0.213–0.53 1.226 0.081 −34 24.82 37.13 6.25 
  Truncated linear 0.366*** 0.209–0.523 1.213 0.080 −36 24.43 37.38 6.09 
  Age/sex/HbA1c Linear 0.208** 0.055–0.361 1.200 0.078 −138 21.54 31.12 5.37 
  Raking 0.235** 0.081–0.388 1.207 0.078 −112 21.68 31.41 5.41 
  Logit 0.235** 0.081–0.389 1.208 0.078 −111 21.68 31.41 5.41 
  Truncated linear 0.208** 0.055–0.361 1.200 0.078 −138 21.54 31.12 5.37 
  All patient characteristics at baseline§ with HbA1c Linear 0.175* 0.036–0.314 1.084 0.071 −183 19.82 29.28 6.72 
  Raking 0.200** 0.062–0.339 1.068 0.070 −148 20.15 29.65 7.29 
  Logit 0.207** 0.068–0.345 1.069 0.070 −141 20.12 29.80 7.19 
  Truncated linear 0.181** 0.049–0.313 1.018 0.067 −175 19.75 29.89 6.81 

A high design effect is considered favorable. NA, not applicable.

***P ≤ 0.001,

**P ≤ 0.01,

*P ≤ 0.05.

§Demographics, diabetes information, and other clinical characteristics.

Figure 1

Trends in mean HbA1c level at baseline and after 1 year, by sample used, defined by the set of patients characteristics and the poststratification method.

Figure 1

Trends in mean HbA1c level at baseline and after 1 year, by sample used, defined by the set of patients characteristics and the poststratification method.

Close modal

When HbA1c level at baseline was included as an adjustment variable in either simple poststratification or calibration on margins, the measured change in all outcome measures was markedly lower than in the samples not adjusted on HbA1c. For example, when the sample was adjusted using the linear function for calibration on margins, reductions in mean HbA1c levels compared with the unadjusted sample were 35% lower when HbA1c level at baseline was not used for calibration; however, when HbA1c at baseline was included, reductions in HbA1c levels were 183% lower.

When poststratification was not performed on the initial HbA1c level, adjustment by calibration on margins with all remaining characteristics measured lower achievement in all outcomes than simple poststratification on age and sex. Similarly, when poststratification was also performed on the initial HbA1c level, poststratification via calibration on margins with all patient characteristics measured lower changes in all outcomes than simple poststratification. This represents relative decreases from no adjustment of 141–183% for calibration on margins compared with relative decreases of 99–123% in simple poststratification samples (Table 2). However, calibration on margins showed similar results as simple poststratification when only age, sex, and HbA1c level at baseline were used for adjustment in both methods.

The results of the four calibration functions fell within a narrow range. For example, when all characteristics were used for calibration, the absolute HbA1c level change ranged from 0.18 to 0.21 (Table 2).

Performance of Adjustment Techniques

Based on the number of characteristics and the sample sizes used in the intervention group and reference population, all tested poststratification techniques were technically feasible and yielded robust results.

The design effect of poststratification via calibration on margins was persistently higher than in simple poststratification, indicating more favorable weight dispersion, including when similar characteristics (age, sex, and HbA1c at baseline) were used in both methods. Moreover, in calibration on margins, the higher the number of characteristics used for adjustment, the lower the observed SE and design effect.

Finally, across the four calibration functions, the design effect was of a similar range, with the raking ratio function exhibiting the highest design effect compared with the linear, logit, and truncated linear functions for all sets of characteristics (Table 2).

To our knowledge, this is the first study to compare a range of adjustment methods, including calibration on margins, to evaluate a structured diabetes care intervention using a national cross-sectional sample of diabetic patients as the reference population. While the positive impact of the DPN remained significant, we found that before–after analysis without poststratification may have overestimated the effect of the DPN by 22–183% in terms of observed improvements in HbA1c levels. Furthermore, adjustment on HbA1c levels at baseline appears to be important for not overestimating the intervention effect. When compared with simple poststratification, change in the observed improvement is usually lower using calibration on margins, mostly because it allows adjustment on a greater number of characteristics. Moreover, estimations using calibration on margins exhibited higher performance with lower SEs and higher design effects, strongly suggesting that calibration on margins is the preferable adjustment method in this context. Finally, the four calibration functions that we tested all showed comparable performance.

As the analytical approach explored here has thus far not been documented in the peer-reviewed literature in the context of evaluation, it is difficult to compare our findings with work undertaken elsewhere. While it is well known that simple before–after evaluation can lead to substantial overestimation of the intervention effect in structured care (and, in rare cases, to underestimation), the size of this misestimation is not well understood (10). However, a recent evaluation of a diabetes disease management program in Austria using a cluster-randomized controlled trial found that HbA1c levels in the intervention group had decreased by 0.13% after 1 year. Conversely, using the same data and applying a simple before–after comparison, the effect size was measured as a decrease of 0.41% (13). Thus the before–after design had overestimated the “true” intervention effect established in the randomized controlled trial by ∼68%. While direct comparison of findings is impossible given the different methods used, the relative overestimation of the intervention effect as identified using poststratification methods appears to be within the same range. This suggests that poststratification and preferably calibration on margins may provide a useful evaluation approach to produce valid findings where more scientifically robust designs such as randomization are not possible.

We acknowledge some limitations regarding data quality, availability, and follow-up.

We included DPN patients in the analysis only if complete follow-up data and characteristics for calibration were available, which increased the likelihood of selection effects due to missing data within the DPN. Further, our analysis did not include characteristics on socioeconomic status, comorbidities, and diabetes type.

Despite using a proxy for comorbidities and the fact that the type of diabetes is in part reflected by age, disease duration, and treatment mode, we were likely unable to account for the full extent of patient selection. Moreover, there were missing data in the ENTRED reference population, which may render it a slightly biased representation of the overall target population. Despite these limitations, the data used reflects data available in the context of provider network evaluation, and the methods tested aim to support future pragmatic evaluations of similar interventions.

In addition, the data sources in our study may have been suboptimal in terms of their ability to illustrate the advantages of calibration on margins. In fact, our intervention and reference populations differed only moderately at baseline in some of the characteristics that may be associated with the outcome (e.g., BMI and renal function). Yet greater differences in patient characteristics between the two populations would probably lead to more marked differences in effect size and test performance between simple and advanced poststratification, thereby allowing us to illustrate the higher potential of calibration on margins.

In terms of analytical methods, our study did not test the hyperbolic sinus function as a fifth mathematical basis for calibration on margins. Because of the very similar results obtained by the four present functions in terms of measured outcomes and design effect, we assume that results would not have greatly differed for the hyperbolic sinus function.

Our results have implications for research and decision making. We conceived the evaluation approach tested here as a tool in situations in which gold-standard designs, such as randomized controlled trials or quasi-experimental designs (30), are not feasible for logistic or resource reasons. Calibration appears to be applicable to real-life evaluation and adapted to assessments of the effectiveness of a given intervention, particularly in the case of inclusion of more characteristics than would be feasible with simple poststratification. Indeed, if the number of characteristics used in the latter is higher than three, the number of strata increases almost exponentially, leading to a high risk of empty strata and, consequently, biased estimates. Moreover, calibration on margins can be applied when only aggregated data are available for the overall target population, while simple poststratification requires patient-level data. Given these possibilities, calibration on margins can increase the external validity of a given evaluation, as opposed to the high internal validity of evaluations using randomized controlled designs (31). While calibration on margins is not comparable to controlled evaluation designs and cannot measure phenomena such as the “placebo effect,” it may provide a useful evaluation design where randomization is not possible and program planners or funders are interested in estimating the effect of an existing intervention if rolled out to a wider population based on routinely collected data. In such a case, calibration on margins could provide clinicians, program managers and policy makers with relevant information regarding whether and how much they should invest in such a wider strategy.

It is important to note the methodological restrictions in using this method. Calibration on margins is technically feasible when the size of the intervention population is at least ∼1/10 the size of the reference population used for adjustment (32). Thus the applicability of calibration is likely limited to settings in which the intervention group is sufficiently large compared with the reference population. Moreover, researchers using the linear function in calibration should be aware that it can yield negative weights, and thus only statistical tests for quantitative variables may be used.

Overall, this study underscores the utility of poststratification methods in a context in which structured approaches to care for chronic diseases such as diabetes are increasingly being implemented but hard to evaluate in a rigorous manner for financial or logistic reasons. Calibration on margins appears to be the preferable poststratification method mostly because it allows for adjustment on a greater number of characteristics than simple poststratification. It appears to provide an effective means for accounting for selection bias, thereby mitigating the possibility of overestimation of the intervention effect when simple before–after evaluations are undertaken in real-world settings.

Acknowledgments. The authors thank the DPN Paris Diabète, in particular Pierre-Yves Traynard and Pierre-Albert Charbit, and the ENTRED partners for generously providing the data. The authors are very grateful to the patients for their participation in these initiatives. The authors further thank Karen Berg Brigham (URC-Eco) for her very helpful review of the manuscript.

Funding. The 2007 ENTRED study was funded by the Institute for Public Health Surveillance, the National Institute for Prevention and Health Education, the General Scheme of Health Insurance, the Independent Scheme for Employees, and the French National Authority for Health. This study was conducted with support from the DISMEVAL Consortium, funded under the European Commission’s Seventh Framework Programme (grant 223277). See www.dismeval.eu for additional information.

Duality of Interest. No potential conflicts of interest relevant to this article were reported.

Author Contributions. K.C. designed the study, analyzed data, and wrote the manuscript. M.B. analyzed data and wrote the manuscript. B.C. designed the study, performed the statistical analysis, and reviewed the manuscript. E.N. contributed to the discussion and reviewed and edited the manuscript. I.D.-Z. obtained data and reviewed and edited the manuscript. K.C. is the guarantor of this work and, as such, had full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.

1.
Mattke
S
,
Seid
M
,
Ma
S
.
Evidence for the effect of disease management: is $1 billion a year a good investment?
Am J Manag Care
2007
;
13
:
670
676
[PubMed]
2.
Gress
S
,
Baan
CA
,
Calnan
M
, et al
.
Co-ordination and management of chronic conditions in Europe: the role of primary care—position paper of the European Forum for Primary Care
.
Qual Prim Care
2009
;
17
:
75
86
[PubMed]
3.
Epstein
RS
,
Sherwood
LM
.
From outcomes research to disease management: a guide for the perplexed
.
Ann Intern Med
1996
;
124
:
832
837
[PubMed]
4.
Ham
C
,
Curry
N
.
Integrated Care. What Is It? Does It Work? What Does It Mean for the NHS?
London
,
The King’s Fund
,
2011
5.
Nolte
E
,
McKee
M
.
Caring for People with Chronic Conditions: A Health System Perspective
.
Maidenhead
,
Open University Press
,
2008
6.
Durand-Zaleski
I
,
Obrecht
O
.
France
. In
Managing Chronic Conditions: Experiences in Eight Countries
.
Nolte
E
,
Knai
C
,
McKee
M
, Eds.
Copenhagen
,
World Health Organization, on behalf of the European Observatory on Health Systems and Policies
,
2008
, p.
55
73
7.
Pimouguet
C
,
Le Goff
M
,
Thiébaut
R
,
Dartigues
JF
,
Helmer
C
.
Effectiveness of disease-management programs for improving diabetes care: a meta-analysis
.
CMAJ
2011
;
183
:
E115
E127
[PubMed]
8.
Hopkins
D
,
Lawrence
I
,
Mansell
P
, et al
.
Improved biomedical and psychological outcomes 1 year after structured education in flexible insulin therapy for people with type 1 diabetes: the U.K. DAFNE experience
.
Diabetes Care
2012
;
35
:
1638
1642
[PubMed]
9.
Mattke
S
,
Bergamo
G
,
Balakrishnan
A
,
Martino
S
,
Vakkur
NV
.
Measuring and Reporting the Performance of Disease Management Programs
.
Santa Monica
,
RAND Corporation
,
2006
10.
Nolte
E
,
Conklin
A
,
Adams
J
, et al
.
Evaluating Chronic Disease Management - Recommendations for Funders and Users
.
Cambridge
,
RAND Corporation and DISMEVAL Consortium
,
2012
11.
Knai
C
,
Nolte
E
,
Brunn
M
, et al
.
Reported barriers to evaluation in chronic care: experiences in six European countries
.
Health Policy
2013
;
110
:
220
228
[PubMed]
12.
Buntin
MB
,
Jain
AK
,
Mattke
S
,
Lurie
N
.
Who gets disease management?
J Gen Intern Med
2009
;
24
:
649
655
[PubMed]
13.
Flamm
M
,
Panisch
S
,
Winkler
H
,
Sönnichsen
AC
.
Impact of a randomized control group on perceived effectiveness of a Disease Management Programme for diabetes type 2
.
Eur J Public Health
2012
;
22
:
625
629
[PubMed]
14.
Särndal
C
.
The calibration approach in survey theory and practice
.
Surv Methodol
2007
;
33
:
99
119
15.
Deville
J
,
Särndal
C
.
Calibration estimators in survey sampling
.
JASA
1992
;
87
:
376
382
16.
Lu
H
,
Gelman
A
.
A method for estimating design-based sampling variances for surveys with weighting, poststratification, and raking
.
J Off Stat
2003
;
19
:
133
151
17.
McNamee R. Regression modelling and other methods to control confounding. Occup Environ Med. 2005;62:500–506, 472
18.
Kim
JK
,
Park
M
.
Calibration estimation in survey sampling
.
Int Stat Rev
2010
;
78
:
21
39
19.
Tiv
M
,
Viel
J-F
,
Mauny
F
, et al
.
Medication adherence in type 2 diabetes: the ENTRED study 2007, a French Population-Based Study
.
PLoS ONE
2012
;
7
:
e32412
[PubMed]
20.
Egginton
JS
,
Ridgeway
JL
,
Shah
ND
, et al
.
Care management for Type 2 diabetes in the United States: a systematic review and meta-analysis
.
BMC Health Serv Res
2012
;
12
:
72
[PubMed]
21.
Steinsbekk
A
,
Rygg
,
Lisulo
M
,
Rise
MB
,
Fretheim
A
.
Group based diabetes self-management education compared to routine treatment for people with type 2 diabetes mellitus. A systematic review with meta-analysis
.
BMC Health Serv Res
2012
;
12
:
213
[PubMed]
22.
Little
RJA
.
Post-stratification: a modeler’s perspective
.
J Am Stat Assoc
1993
;
88
:
1001
1012
23.
Armoogum
J
,
Madre
JL
.
Weighting or imputations? The example of nonresponses for daily trips in the French NPTS
.
J Transp Stat
1998
;
1
:
53
63
24.
Mauras
N
,
Beck
R
,
Xing
D
, et al
Diabetes Research in Children Network (DirecNet) Study Group
.
A randomized clinical trial to assess the efficacy and safety of real-time continuous glucose monitoring in the management of type 1 diabetes in young children aged 4 to <10 years
.
Diabetes Care
2012
;
35
:
204
210
[PubMed]
25.
Owens
LA
,
Avalos
G
,
Kirwan
B
,
Carmody
L
,
Dunne
F
.
ATLANTIC DIP: closing the loop: a change in clinical practice can improve outcomes for women with pregestational diabetes
.
Diabetes Care
2012
;
35
:
1669
1671
[PubMed]
26.
DePue
JD
,
Dunsiger
S
,
Seiden
AD
, et al
.
Nurse-community health worker team improves diabetes care in American Samoa: results of a randomized controlled trial
.
Diabetes Care
2013
;
36
:
1947
1953
[PubMed]
27.
Sönnichsen
AC
,
Rinnerberger
A
,
Url
MG
, et al
.
Effectiveness of the Austrian disease-management-programme for type 2 diabetes: study protocol of a cluster-randomized controlled trial
.
Trials
2008
;
9
:
38
[PubMed]
28.
Kish
L
.
Survey Sampling
.
New York
,
John Wiley & Sons
,
1965
29.
Sautory
O
.
La macro CALMAR - redressement d’un échantillon par calage sur les marges
.
Paris
,
INSEE
,
1993
30.
Duru
OK
,
Mangione
CM
,
Chan
C
, et al
.
Evaluation of the diabetes health plan to improve diabetes care and prevention
.
Prev Chronic Dis
2013
;
10
:
E16
[PubMed]
31.
English
M
,
Schellenberg
J
,
Todd
J
.
Assessing health system interventions: key points when considering the value of randomization
.
Bull World Health Organ
2011
;
89
:
907
912
[PubMed]
32.
Vivot M. Calage sur les marges aléatoires - une aventure hasardeuse? Presented at the Colloque francophone sur les sondages, 2005, Laval, Quebec, Canada
33.
Whitworth
JA
World Health Organization, International Society of Hypertension Writing Group
.
2003 World Health Organization (WHO)/International Society of Hypertension (ISH) statement on management of hypertension
.
J Hypertens
2003
;
21
:
1983
1992
[PubMed]
Readers may use this article as long as the work is properly cited, the use is educational and not for profit, and the work is not altered. See http://creativecommons.org/licenses/by-nc-nd/3.0/ for details.

Supplementary data