Identification of Patients With Diabetes From the Text of Physician Notes in the Electronic Medical Record

  1. Alexander Turchin, MD12,
  2. Isaac S. Kohane, MD, PHD2 and
  3. Merri L. Pendergrass, MD, PHD1
  1. 1Division of Endocrinology, Brigham and Women’s Hospital, Boston, Massachusetts
  2. 2Medical Informatics Program, Children’s Hospital, Boston, Massachusetts
  1. Address correspondencereprint requests to Alexander Turchin, MD, Division of Endocrinology, BrighamWomen’s Hospital, 221 Longwood Ave., Boston, MA 02115. E-mail: aturchin{at}

In this report, we describe a software tool that rapidly and reliably identifies a diagnosis of diabetes documented in physician notes in the electronic medical record.

Diabetes care in the U.S. is suboptimal: 20–40% of diabetic patients have inadequate glycemic or blood pressure control or do not have annual eye or foot examinations (13). Effective public health surveillance is mandatory to address this problem. It is crucial for the assessment of prevalence of diabetes, its economic and social costs, and evaluation of the dynamics of disease care measures and outcomes (4,5).

Disease surveillance can be significantly hampered by the difficulty in identifying the target population (6). A number of approaches have been used to identify patients with diabetes, including death certificates (7,8), billing data (9,10), and surveys (11,12). Each of these methods has its own shortcomings, and sensitivity remains relatively low. Consequently, manual chart review remains the gold standard for the identification of individuals diagnosed with a particular disease. This is a labor-intensive process that is not scalable to the level needed in public health surveillance.

Because most elements of the patient chart are increasingly available in digital format, there have been a number of attempts to identify diagnoses from the text of physician notes (1315). However, low sensitivity and specificity remain a problem. We therefore have designed a software tool, DITTO (Diabetes Identification Through Textual element Occurrences), that accurately and rapidly identifies patients with diabetes through analysis of the texts of physician notes.


Data were obtained from the Research Patient Data Registry, a database containing clinical (laboratory results, physician notes, and radiology reports) and administrative (billing and encounter data) records on all patients treated at Massachusetts General Hospital and Brigham and Women’s Hospital. Billing codes and outpatient physician notes of 7,203 adult patients who were seen in four primary care practices in 2002–2003 and who had either at least one billing ICD-9 code of 250.xx, one serum glucose >199 mg/dl, or one measurement of HbA1c (A1C) were retrieved for analysis. Although these selection criteria identified a population with a high likelihood of diabetes diagnosis, only about a third of the patients actually had diabetes (further described in results).

DITTO is a program written in Perl language that takes one or more text files containing patient notes as input. The entire text of physician notes was analyzed by DITTO for the presence of words, word roots, or groups of words (word tags) that potentially indicated the presence of a diagnosis of diabetes. These included two terms naming diabetes (“diabet,” “IDDM”—also capturing “NIDDM”) and 32 names of medications exclusively used to treat diabetes. Metformin and all insulins/insulin analogs except glargine were excluded because in the practices being studied, they are commonly used to treat polycystic ovarian syndrome and gestational diabetes. Sentences with diabetes word tags were ignored if they also contained 1 of 31 negative qualifiers (e.g., “insipidus,” “family history,” “work up for,” “not”), indicating that the patient did not have diabetes. Patients with at least two sentences with diabetes word tags but without negative qualifiers were considered to have diabetes. No data other than the text of the notes were used in the analysis.

The ability of DITTO to identify patients with diabetes was compared with 1) billing codes and 2) manual chart review. At least two codes of diabetes over 2 years were required to make the diagnosis of diabetes from billing data (16,17). One hundred fifty patient records randomly selected for manual review were examined independently by two investigators who were also blinded to the conclusions of the note text and billing codes analyses. When there was a discrepancy between the two reviewers (10 of 150 records), the charts were reexamined jointly and agreement was made on 100% of the charts. McNemar’s test was used to estimate statistical significance of the difference between text and billing code analysis (18). κ statistic (19) was used to evaluate the agreement between DITTO and manual chart review.


DITTO processed 182,345 physician notes over 40 min. The estimated processing speed was 2.7 × 105 notes/h. Of the 7,023 records that were analyzed, billing data identified 2,007 and DITTO identified 2,982 diabetic patients.

In the manual chart review, κ statistic for agreement between the two reviewers was 0.87 (P = 0.05 that κ ≥ 0.8). Of the 150 records randomly selected from 7,023 patient records, manual review identified 52 patients as having a documented diagnosis of diabetes. Billing data analysis detected 40, and DITTO detected 50 of these patients (Table 1).

κ statistic for agreement between DITTO and manual chart review was 0.94 (P < 0.001 that κ ≥ 0.8). Using manual chart review as the gold standard, sensitivity of DITTO was 96.2% and specificity 98.0%. Compared with DITTO, billing code analysis had substantially lower sensitivity (76.9%) but similarly high specificity (98.0%), consistent with findings of other investigators (9,10). This difference between the two methods was statistically significant (P = 0.02).


We have designed DITTO, a software tool that identifies the diagnosis of diabetes documented in the chart using analysis of the text of physician notes. DITTO is more sensitive and at least as specific as the best of the previously reported methods. It is very fast and can process over a quarter of a million patient notes (∼10,000 individual patient records) per hour. It requires minimal customization (mostly related to the format of the patient medical record numbers and separators between the notes in the text file) for adaptation to a different health care organization. It requires a minimal set of data (e.g., no insurance medication claims) that can be commonly obtained in many health care facilities.

Identification of patients with a particular diagnosis is a problem of great importance for public health care at every level: national, regional, and individual health care facilities. Our tool is an important advance in this field, and we plan to continue to develop this concept further to improve its performance, comprehensiveness, and functionality.

Table 1—

Comparison of DITTO and billing data analysis to manual chart review


  • A table elsewhere in this issue shows conventional and Système International (SI) units and conversion factors for many substances.

    • Accepted March 18, 2005.
    • Received December 28, 2004.


| Table of Contents