Developing a Prediction Rule From Automated Clinical Databases to Identify High-Risk Patients in a Large Population With Diabetes

  1. Joe V. Selby, MD, MPH,
  2. Andrew J. Karter, PHD,
  3. Lynn M. Ackerson, PHD,
  4. Assiamira Ferrara, MD and
  5. Jennifer Liu, MPH
  1. Division of Research, Kaiser Permanente of Northern California, Oakland, California


    OBJECTIVE—To develop and validate a prediction rule for identifying diabetic patients at high short-term risk of complications using automated data in a large managed care organization.

    RESEARCH DESIGN AND METHODS—Retrospective cohort analyses were performed in 57,722 diabetic members of Kaiser Permanente, Northern California, aged ≥19 years. Data from 1994 to 1995 were used to model risk for macro- and microvascular complications (n = 3,977), infectious complications (n = 1,580), and metabolic complications (n = 316) during 1996. Candidate predictors (n = 36) included prior inpatient and outpatient diagnoses, laboratory records, pharmacy records, utilization records, and survey data. Using split-sample validation, the risk scores derived from logistic regression models in half of the population were evaluated in the second half. Sensitivity, positive predictive value, and receiver operating characteristics curves were used to compare scores obtained from full models to those derived using simpler approaches.

    RESULTS—History of prior complications or related outpatient diagnoses were the strongest predictors in each complications set. For patients without previous events, treatment with insulin alone, serum creatinine ≥1.3 mg/dl, use of two or more antihypertensive medications, HbA1c >10%, and albuminuria/microalbuminuria were independent predictors of two or all three complications. Several risk scores derived from multivariate models were more efficient than simply targeting patients with elevated HbA1c levels for identifying high-risk patients.

    CONCLUSIONS—Simple prediction rules based on automated clinical data are useful in planning care management for populations with diabetes.

    Approximately 4% of most managed care populations have diabetes (1,2), but these patients account for nearly 12% of total health care expenditures (2). These costs also reflect catastrophic events in the lives of patients, because a large fraction of total costs result from hospitalization for disease complications (2). Recognizing the burden of this illness, many managed care organizations have developed intensive diabetes care programs, featuring multidisciplinary clinics or nurse case management (3,4,5) to improve pharmacotherapy, preventive screening, and support for self-care. However, inclusion of all diabetic patients in intensive disease management programs, including those at low risk for complications, would diminish the cost-effectiveness of these programs.

    Clinical prediction rules (6) are tools created by combining information from clinical data, usually using multivariate analyses, to estimate the probability of an outcome for individual patients. When applied to an entire population of members with diabetes, a prediction rule could be used to identify and rank members by their level of risk for complications. Despite the frequent availability of rich automated clinical data in health plan systems (7), prediction rules have not been widely used for diabetes. Instead, many programs focus solely on poor control of HbA1c levels to identify those in need of more intensive intervention. This study seeks to develop and test a prediction rule using automated clinical data that can be applied at the population level to improve this strategy. Several approaches are compared to identify the simplest rule that efficiently identifies high-risk patients.


    This report is based on a retrospective cohort analysis conducted in the Northern California Kaiser Permanente Diabetes Registry. Kaiser Permanente, which is a group model health maintenance organization (HMO), had ∼2.5 million enrollees during the study period. The registry (2) is an ongoing epidemiological cohort of all HMO members with diabetes identified from four automated databases: pharmacy prescriptions for diabetes medications, abnormal HbA1c values (≥6.7%) in laboratory files, primary hospital discharge diagnoses of diabetes, and emergency department records of diabetes as the reason for visit. During the study period, this registry had a sensitivity of 90% when matched against >1,500 self-reported diabetic patients who responded to two large mailed surveys. The registry missed some diet-controlled subjects and the small proportion of members who never use the HMO’s services. The registry also has been found to contain ∼2.5% false positives or members who do not truly have diabetes (J.V.S., unpublished data).

    For these analyses, data gathered from electronic databases and from a mailed survey during the 2-year baseline period (1994–1995) were used to predict three sets of complications of diabetes occurring during 1996. The study sample consisted of the 57,722 registry members who were aged ≥19 years and who were known to have diabetes by 1 January 1994; they were continuously enrolled in the health plan throughout the 2-year baseline period (1 January 1994 to 31 December 1995) and remained in the health plan for at least the first month of 1996. Continuous enrollment was defined as not having a membership gap of >2 months’ duration. We excluded 970 patients with no outpatient utilization (visits, laboratory tests, or prescriptions) during the baseline period, because they would provide few data on predictors, and they may well have received care, including care for complications, outside the HMO.

    Study outcomes

    Outcomes were complications requiring hospitalization during 1996; they were identified from principal discharge diagnoses in the HMO’s discharge databases (for 16 plan hospitals and for all claims received from out-of-plan hospitals). Complications were grouped into three sets: macro- and microvascular, infectious, and metabolic; they are listed by International Classification of Diseases, 9th Revision, Clinical Modification discharge diagnosis code in Table A1 in the appendix. Dichotomous dependent variables were created to indicate whether one or more complications from each set were noted during 1996.

    Candidate predictors

    The 36 potential predictors of complications (during the baseline period) are shown in Table A2 in the appendix. These included hospital discharges for the same complications sets as the study outcomes and the outpatient diagnoses that are related to these complications. Dichotomous predictor variables were used to note occurrence during the baseline period of any complication-related hospitalization. Dichotomous predictors were also used to indicate the baseline presence of outpatient diagnoses that are closely related to either macro- and microvascular or infectious complications. For example, renal insufficiency and unstable angina are likely to be important predictors of future hospitalizations. No outpatient diagnoses related to metabolic complications were captured in this HMO’s data systems.

    Other candidate predictors included laboratory results (HbA1c, serum creatinine, and lipoprotein levels), pharmacy prescriptions (for hypoglycemic, lipid-lowering, and antihypertensive agents), outpatient visit counts by type, and responses to a 1994–1996 mailed survey (with telephone follow-up of nonrespondents) completed by 83% of the study sample. Survey items included demographics, self-reported behaviors, and information used to classify diabetes (age and obesity status at onset and patterns of insulin use).

    Both clinical and survey databases have relatively high rates of missing data for potential predictors. Approximately 25–40% of cohort members had missing values for one or more key predictors, such as baseline HbA1c, serum creatinine, cholesterol, smoking status, and BMI. Because we wished to develop a tool applicable to an entire population, it was important that these subjects were included in the models. We therefore used “missing” categories for several key variables. Continuous predictors were converted to ordered categorical variables for this purpose.

    Data analyses

    The cohort of 57,722 members was randomly split into derivation and validation data sets. Of all the subjects, >94% remained under observation throughout 1996. Of the remaining 6%, 52% were censored because of death rather than leaving the health plan. Because of nearly complete follow-up and a short observation period, we used logistic regression to model the data.

    After examining bivariate associations of predictors and outcomes, separate stepwise logistic regression models were conducted in the derivation data set to build the best model for each complications set. On parameter estimates, P < 0.01 was required to include a predictor in each best model. Once each model was derived, coefficients for significant predictors were applied to predictor values of the validation data set members. Risk scores for each member were calculated by summing coefficients across all predictors, and the ability of these scores to predict complications in a new population was examined.

    Based on preliminary analyses, four simpler approaches to identifying and targeting high-risk patients were identified and compared with the best model. At an early stage in our analyses, we noted that events or related outpatient diagnoses during the baseline period were strong predictors of each complications set. Therefore, the first alternative was to use a “prior events” strategy that simply targeted patients with either of these predictors. Preliminary analyses also revealed that risk scores based only on the first three variables entering each model were nearly as sensitive as scores from the best models. Therefore, we evaluated “reduced models” that included only these first three variables.

    The third comparison approach tested a simplified numerical risk score derived by replacing significant model coefficients with integer values as follows: a value of 1.0 for a (significant) multivariate odds ratio (OR) between 1.1 and 1.49, 2.0 for an OR between 1.50 and 1.99, and 3.0 for an OR of ≥2.0, with corresponding negative numbers for significant ORs <1.0. To obtain integer values for age, which was the only continuous variable in any model, we calculated the age-specific OR distribution (relative to 20 years of age, which was the youngest age possible) using the model coefficients for 10-year increases in age and applied the same OR cut points to categorize the distribution into values from 0 to 3. The integer values were summed to yield a simple numerical score. If this approach performs nearly as well as the risk score from the best model, it yields a much simpler algorithm for use in other populations.

    The fourth strategy was to simply rank patients on the basis of their average HbA1c level during 1994–1995 and to select patients in descending order of these values. We used percentiles rather than absolute values for cut points because HbA1c distributions vary across populations and over time.

    Initial comparisons of these five approaches focused on sensitivity and positive predictive values in the validation data set. Continuous risk scores, which identified the 30% of patients with the highest predicted risk (or the highest HbA1c levels), were compared at the cut point. This cut point was chosen to be consistent with our health plan’s current policy of planning more intensive interventions for ∼30% of the population. Given its distribution, the numerical score, which is ordinal, was cut as close to the upper 30th percentile as possible. For the prior events approach, the proportion with such an event is fixed. Continuous and ordinal scores were also compared across their entire range of values. Differences in areas under the curve (AUCs) of receiver operating characteristics (ROC) curves were tested with ROC Analyzer (8,9), which uses a nonparametric method of estimating AUC and adjusts for the correlation of the two curves (10).

    For patients without prior inpatient events or related outpatient diagnoses, we re-examined the utility of the four remaining approaches. In this subgroup, the number of macro- and microvascular complications, infectious complications, and metabolic complications was greatly reduced, leaving just 723 subjects who experienced at least one event in 1996 (561 with a macro- and microvascular event, 453 with an infectious event, and 95 with a metabolic event). Because all complications are important from a disease management perspective, and in light of the overlap of many important predictors for two or all three sets of complications, we combined these end points and modeled risk for any event. Age, sex, and race were not included in this model, despite associations with one or more outcomes in the models described above, because these characteristics present no options for risk reduction. By removing them from the models, many of the associated, mutable risk factors should contribute more strongly to risk scores. We further excluded the 3.5% of remaining patients with serum creatinine levels ≥2.0, reasoning that these subjects should already be targeted because of their known and very high-risk status.


    Number of subjects, demographic characteristics, and frequency of each set of complications were similar in the derivation and validation data sets (Table 1). Macro- and microvascular events occurred nearly three times as frequently as infectious events and >10 times as frequently as metabolic complications.

    Descriptions of the best models

    For each complication, predictors are shown in the order of entry into stepwise models (Table 2), with ORs for each level of the predictor, and numerical scores assigned to levels that differed significantly from the referent group. Prior hospitalizations (during 1994–1995) for similar events were the strongest predictors of both infectious and metabolic complications and the second strongest predictor of macro- and microvascular complications. Related outpatient diagnoses were the strongest predictor of macro- and microvascular events and were also strongly predictive for infectious complications. There were no outpatient diagnoses for metabolic complications. Increasing age was the third predictor to enter macro- and microvascular and infectious complication models; age was inversely related to metabolic complications.

    Several clinical predictors were common to two or all three complications sets. Use of insulin alone (i.e., without records of oral hypoglycemic agents) was associated with increased risk for all three complications sets. Hyperglycemia (average HbA1c level >10.0%), not having HbA1c measured during the baseline period, and elevation of total or LDL cholesterol were all associated with both macro- and microvascular and metabolic complications. Elevated serum creatinine level predicted both macro- and microvascular and infectious disease complications. Outpatient macro- and microvascular disease diagnoses were also a strong predictor of infectious disease events. Use of two or more different antihypertensive medications during the baseline period was a strong predictor of macro- and microvascular events. Interestingly, not having had an albuminuria/microalbuminuria screening, as well as the presence of microalbuminuria or albuminuria, predicted macro- and microvascular events.

    Comparisons of the best model with simpler approaches

    For macro- and microvascular complications, selection of subjects on the basis of a previous event or related outpatient diagnosis (i.e., the first two variables to enter the model) was as efficient as using the best model, targeting essentially the same proportion of subjects and identifying exactly the same proportion (72%) of 1996 events (Table 3). For infectious and metabolic complications, a prior-events strategy identified far fewer subjects who would have had complications during 1996 than targeting the top 30% of subjects based on model-derived risk scores. However, prior-events strategies, as assessed by positive predictive values, were more efficient because far fewer than 30% of the population was targeted. Not surprisingly, the simple three-variable models, which included previous events and related diagnoses, also did nearly as well as full models, especially for macro- and microvascular complications. Comparisons of ROC curves between full and three-variable models revealed significant differences (0.01 < P < 0.06) for each, but differences in the AUCs were quite small (≤4%) for each, suggesting that measurement and inclusion of the remaining variables in the best models adds little to predictive ability.

    Simpler numerical scores performed nearly as well as risk scores calculated directly from coefficients of the best models for each complications set. ROC curve comparisons did not reveal any significant differences in AUCs between these two scores (for each complication, P > 0.05). All ROC curve comparisons are available upon request (J.V.S.).

    An approach based on selecting subjects solely on the basis of elevated HbA1c levels was far less efficient for each complication, whether evaluated at the upper 30% cut point or across the entire range using ROC curve comparisons.

    Utility of risk scores in subjects without prior events

    Having demonstrated the importance of targeting subjects with previous events or related diagnoses, we compared the remaining approaches in the reduced population of subjects without such markers (Table 4). The first three variables to enter the best model were an elevated serum creatinine level (three levels differed significantly from the reference group of <1.0 mg/dl) followed by use of antihypertensive agents (either one or more than one) and use of insulin as the only therapy. Other significant predictors included a prior emergency department visit, having more than seven primary care visits in the 2-year span, being a current or former smoker, having more than seven outpatient visits to specialists, an average HbA1c level >10.0%, albuminuria or microalbuminuria, and not having microalbuminuria measured during the 2-year interval.

    Cumulative sensitivity for 1996 events across the full range of each risk score is shown in Fig. 1. Model sensitivities were not as high in this patient subgroup as in the full sample because of the absence of the two strongest predictors (prior events and related diagnoses). Nevertheless, all three model-based approaches improved substantially over targeting based on HbA1c alone. The numerical score is shown as a black line because its seven observed scores do not fall at decile cut points. There was essentially no difference in performance between the best model and the numerical score as judged by comparison of ROC curves (P = 0.24). The AUC for the full model was slightly greater than the AUC for the three-variable model (64 vs. 61%, P = 0.03). Identifying patients simply on the basis of their previous HbA1c levels did little better than chance in identifying those at high short-term risk.


    It is frequently observed that very small proportions of a population consume a large fraction of total health costs. In this diabetic population, 20% of the members accounted for 79% of the excess costs of care in 1995 (J.V.S., unpublished data), much of which was a result of hospitalization for complications (2). We aimed to develop a tool that could help to identify those members of a population at greatest risk for complications.

    A relatively short-term (1 year) follow-up period was used in these analyses, because decision makers who fund expensive disease management programs are highly sensitive to short-term financial considerations (11). Pronk et al. (12) have shown that elevated risk factor levels translate to increased costs for diabetic patients in the short term, and two recent trials of intensive interventions for diabetes (5,13) have shown that hospitalization rates and costs of care can be reduced within 12 months. However, the major predictors in our models (hypertension, hyperglycemia, elevated serum creatinine, use of insulin only, albuminuria, and dyslipidemia) are highly consistent with previous epidemiological (14,15,16) and intervention studies (17,18,19,20,21,22,23) that used a long-term perspective.

    Several aspects of the findings should be highlighted. The importance of secondary prevention is demonstrated by the very strong predictive power of prior complications and related outpatient diagnoses. Patients with one or both of these markers accounted for well over half of the complications in 1996 and should clearly be among the first targeted by population disease–management programs. More complex prediction scores, such as those developed here, would be most helpful for targeting primary prevention in the remaining 60–70% of the diabetic population. HbA1c levels predicted increased risk for each set of complications, but model-based targeting improved substantially on selection that was based on elevated HbA1c levels. The simple numerical score, which proved to be as accurate as the score calculated directly from the best-model coefficients, would be the most convenient approach to applying our findings in other populations. In our data, a score ≥7 identified 46% of subjects without prior complications and 66% of their complications in 1996.

    Our analyses also indicate that sufficient information for predicting complications is captured in a very small number of commonly available variables. Even among patients with no prior events or related diagnoses, models containing just three variables were nearly as efficient as much more complex models in predicting short-term risk. Nearly all key variables come from data sources (hospital discharge files, outpatient visit claims, laboratory results, and pharmacy records) that are commonly available in health care systems.

    Several limitations of these analyses should be kept in mind. First, these risk scores were developed for use by programs that aim to support rather than replace clinical judgment. Although the models confirm the importance of several known clinical risk factors, the model scores derived from automated data are neither sufficiently accurate nor sufficiently complete to supplant decision- making by physicians who treat individual patients in clinical settings. Other information available to the clinician, such as comorbidities or known noncompliance, could easily overrule score-based decisions. The high levels of missing predictor information in our clinically derived data would be considered a serious limitation in epidemiological analyses. However, our aim was to produce a disease-management tool applicable to all members of a population rather than a biological or epidemiological model of complications. By including “missing” as a value for several predictors, we also learned that “missingness” itself can sometimes signal increased risk. We repeated the best models from Table 2, excluding patients with any missing values. Although sample size dropped by as much as 75%, results were essentially identical for macro- and microvascular and infectious models. The metabolic-events model was not interpretable, because the number of end points dropped to 27. Although age was a strong predictor and sex was a weak but significant predictor in at least one of the best models (Table 2), we included neither variable in the final model, because it would make little sense to target only the oldest patients or only one sex for disease management activities.

    In conclusion, automated data available in many HMOs could be used to more efficiently identify diabetic patients at high risk for complications. As databases derived directly from electronic medical records replace current systems, the precision and completeness of many predictors will improve, which will further add to the accuracy of predictive models.


    Table A1—

    Hospital discharge diagnoses (and International Classification of Diseases-9 Codes) in Each Complications Set

    Table A2—

    Predictor variables examined in stepwise regression analyses in derivation data set

    Table 1—

    Demographic and clinical characteristics of diabetic patients in derivation and validation data sets

    Table 2—

    Predictors and ORs from the best models predicting 1996 macro- and microvascular, infectious, and metabolic events, derivation data set

    Table 3—

    Sensitivity and predictive values of various targeting strategies, validation data set

    Table 4—

    Significant predictors of any 1996 event, numerical score, and prevalence of predictor for the derivation sample (restricted to subjects without prior events or related outpatient diagnoses and serum creatinine <2.0 mg/dl)

    Figure 1—

    Sensitivity for any 1996 event (macro- and microvascular, infectious, or metabolic) in the validation data set. ▪, Decile of risk scores from full model; [cjs2108], reduced three-variable model; □, average 1994–1995 HbA1c levels for subjects with no prior events or related outpatient diagnoses and serum creatinine <2.0; —▴—, numerical risk score.


    This research was supported in part by a grant from Pfizer Pharmaceuticals. The authors wish to thank Lyn Wender for her editorial assistance.


    • Address correspondence and reprint requests to Joe V. Selby, Division of Research, Kaiser Permanente, 3505 Broadway, Oakland, CA 94611. E-mail: jvs{at}

      Received for publication 9 January 2001 and accepted in revised form 12 April 2001.

      A table elsewhere in this issue shows conventional and Système International (SI) units and conversion factors for many substances.


    | Table of Contents