Informatics 99jun2

Journal of Informatics in Primary Care 1999 (June):2-8


Papers


The Impact of Single Versus Dual Data Entry on Accuracy of Relational Database Information

aWillem H Meeuwisse, MD PhD; aBrent E Hagel, MSc; bGordon H Fick, PhD
aUniversity of Calgary Sport Medicine Centre, Faculty of Kinesiology
bDepartment of Community Health Sciences, Faculty of Medicine, University of Calgary
Correspondence to Dr WH Meeuwisse, University of Calgary Sport Medicine Centre,
2500 University Drive NW, Calgary, Alberta T2N 1N4, Canada.
Tel: 403-220-8426; Fax: 403-282-6170; email: meeuwiss@acs.ucalgary.ca


Abstract

Objective: To evaluate the differences in entry errors associated with entry of prospective cohort data using a single versus a dual data entry system.

Design: Cross-over trial

Measurements: The database for the Canadian Intercollegiate Sport Injury Registry (CISIR) has, to date, used a dual data entry system. To test the differences in entry error, one season of injury and exposure (participation) data were entered for one sport from two participating institutions. A total of 5 data entry clerks were involved. Discrepancies between the final entries from both systems were flagged and checked against the original paper record. The number of instances where the dual entry system was correct but the single entry system was incorrect, and vice versa, were determined.

Results: A total of 17,988 database fields were evaluated (15,509 text and 2,479 fixed response fields). There was a statistically significant difference in error where the dual entry system was correct but the single entry system was incorrect for 2 out of the 4 data entry modules. The percent accuracy using the single entry system was lower (99.35%) than that using the dual entry system (99.83%).

Conclusions: Although the single entry system was statistically different from the dual entry system, the percent accuracy remained high and the errors did not cluster in one specific field. Given the doubling of entry costs with a dual entry system, it is recommended that a single entry system be used for the CISIR.

Keywords: Dual Data Entry, Relational Database, Cross-over trial.


Introduction

Many investigators have focused on the rigorous application of data collection methods to ensure that the highest quality of data is obtained in a study. However, data entry errors may also compromise the quality of the data depending on the process involved and the type of information being input into a microcomputer. Studies have demonstrated lower data entry error rates using a dual or double data entry system versus a single entry system1, 2. Single entry error rates have been reported between 9.5 and 64.9 errors per 10,000 fields1- 4, while those for dual entry error rates range between 2.5 and 15 per 10,000 fields1, 2, depending on the type of form evaluated.

It is less clear, however, if the lower error rates justify the dramatic increase in time devoted to a dual entry system in the form of development (i.e., programming) and actual data entry. Dual entry has been reported to take 36% longer than single entry alone2, although this account only considered the differences in data entry time and did not factor in the additional time involved to generate the more elaborate computer program required for dual entry. One group of investigators has in fact opted to use a single entry system despite a higher error rate found using that approach compared to dual entry1. The authors justify their position primarily by the fact that the errors that were made were not considered vital. That is, they would not have significantly influenced the conclusions those investigators would have drawn based on analysis of the single entered data1. Yet other investigators have concluded that "the additional time required for dual entry did not outweigh the value of added quality in the data collected"2. Because of the recent rapid advances in the ease of use and sophistication of relational database software, screen development and data input practices have changed. Mouse-driven input in the form of simple radio buttons, check boxes, pop-ups and scrolling lists has begun to replace the direct entry of text from the keyboard.

The effect of this shift in the mode of data entry on data accuracy, if any, has not been explored in the medical literature. Therefore, an estimate of the number and types of errors made using a dual versus single entry system involving multiple modes of input (i.e., mouse- versus keyboard-entered data) is needed. It would prove valuable for individuals involved in the planning and development of relational database systems for a variety of research purposes.

The purpose of this investigation is to evaluate the number and types of errors made using a dual versus single entry database system with multiple modes of data input. An evaluation of both the statistical and practical significance of the findings is provided.


Methods

The Canadian Intercollegiate Sport Injury Registry (CISIR) is a prospective cohort system designed to evaluate the rates and risks of injury in university athletics5. The CISIR has been in operation since 1993. The sports studied with the CISIR include men’s football, men’s and women’s hockey, men’s and women’s basketball and women’s volleyball. The data collection in all sports utilises two primary instruments: a daily participation log and an individual injury report form. The injury report form is filled out for all injuries reported to a physician or athletic therapist and includes the circumstances surrounding the injury, as well as the therapist’s assessment and/or physician’s diagnosis. It contains areas for both fixed responses (check box) and for recording of free text. The athlete participation information is captured every session (game or practice) for every player on the team using a single-character code for the level of participation, plus a single-character explanatory code. In addition, environmental conditions for each practice or game session are recorded in a similar manner.

A dual entry database system has been developed using Microsoft FoxProÔ relational database software6. In order to make the data entry process less complex, the design of the data entry system reflects the layout of the actual forms used to collect the injury (Figure 1) and participation (Figure 2) information. Although designing the entry screen to reflect the layout of the paper form would presumably ease the data entry process, to the authors’ knowledge this issue has not been studied. Different entry modules exist for both the injury and athlete participation data, with supplementary entry screens for certain details. For example, with the injury report form module, the injury circumstances are entered on one screen and, when completed, specific information surrounding the assessment and/or diagnosis is then input on a second screen. With the athlete participation information module, the environmental factors for that session are entered on one screen, followed by a screen which allows for entry of the participation codes for each player. These screens use a combination of radio buttons, scrolling lists, and text fields.

Figure 1: Individual Injury Report Form data entry screen.

 

Figure 2: Environmental factors or day exposure information entry screen

A dual entry system has been used for the past four years. With this approach, injury or participation data are entered first. Then, the information is stored by the computer. The data entry clerk then re-enters the information using a second set of identical entry screens. These data are then subsequently stored by the computer as well. When the second entry is completed, the program compares the first to the second set of entries (one field at a time), at which time any discrepancies between the entries are flagged and brought to the data entry clerk’s attention. The entry clerk then has the opportunity to correct the information before it is stored to the relational database. When all discrepant entries have been viewed by the entry clerk, the data is stored to the relevant data table of the relational database.

For the purposes of this study, a single entry system was developed based on the initial entry modules of the dual entry system. That is, the screens for the single entry system for all modules were identical to the dual entry system. However, the data entry clerk was only required to enter the information once, at which time it was stored directly to the relevant data table.

Canadian Intercollegiate Sport Injury Registry (CISIR) football injury and participation forms for the 1997 season were used from two institutions. These forms were entered first using the dual and then the single entry systems. With both entry systems, the injury information was entered prior to the participation data, as participation time loss due to injury had to be linked to specific injury reports (as is the case in the normal operation of the CISIR data checking and validation).

There were a total of five individuals who entered both the injury and participation information in a random order. This was done to emulate the conditions surrounding the entry of the data under normal circumstances. Each clerk entered similar proportions of data using both the single and dual entry systems although they did not enter the same information using each system.


Analysis

The unit of analysis was the individual field in the different data tables. There were two types of fields evaluated, fixed response and text fields. Fixed response fields in the data tables were those fields which required mouse input from the data entry clerk (for example, a selection from among two or more possible choices). The text fields required keyboard input (such as an entry of the events surrounding an injury).

The primary method of analysis concerned the determination of where the dual and the single entry systems were not the same for a particular field in the database (that is. column in the database). Assuming that there was no difference between the number of errors made using the dual compared to the single entry system (the null hypothesis), we would expect errors to occur with equal probability using each system (probability of 50%).

Microsoft ExcelÔ was used to determine where there were differences between the dual versus single entry fields7. Using a single spreadsheet formula, the fields in each of the single and dual entry systems for each module were evaluated using a relative addressing formula. Where the fields were the same, the value of the logical equation was "TRUE". However, where the two fields differed, the value of the logical equation was "FALSE". The "FALSE" entries were analysed as discordant pairs (instances where the dual and single entry systems were not the same). The "FALSE" instances were then checked against the original paper forms to determine which of the dual versus the single field entries was incorrect. Using the binomial distribution, an evaluation of the probability of observing the distribution of dual entry correct/single entry incorrect and vice versa could then be made. The discordant pair analysis procedure and rationale is illustrated in Table 1.

  Dual Entry Field Correct Dual Entry Field Incorrect
Single Entry Field Correct Not of primary interest Expect 50% of errors under null hypothesis
Single Entry Field Incorrect Expect 50% of errors under null hypothesis Not of primary interest

Table 1: Discordant pair analysis used to evaluate differences between the dual and single entry database systems

In addition to the primary discordant pair analysis detailed above, an assessment of the actual differences was made, that is, whether any statistically significant differences were considered practically relevant in terms of data accuracy.


Results

Table 2 represents the total number of fields and type of field errors, evaluated by relational database table. Table 3 details the instances where the dual entry system field was correct and the single entry field was incorrect (and vice versa). Information on the odds of the single entry being incorrect relative to the odds of the dual entry being incorrect is provided with an associated 95% confidence interval.

Data Table

Total Fixed Response Fields

Dual Entry Fixed Response Field Errors

Single Entry Fixed Response Field Errors

Total Text Fields

Dual Entry Text Field Errors

Single Entry Text Field Errors

Diagnosis 470 0 2 470 2 3
Injury 1204 8 43 2408 8 17
Athlete Exposure (Participation) 0 0 0 12148 8 43
Day Exposure 805 2 5 483 2 0
Total 2479 10 50 15509 20 63

Table 2: The total number of fixed response and text field errors including the total number of fields by data table

 

Data Table

Dual Entry Field Incorrect where Single Correct (%)

Single Entry Field Incorrect where Dual Correct (%)

P-value

Odds Ratio (odds of single incorrect relative to odds of dual incorrect)

95% Confidence Interval for Odds Ratio

Diagnosis 2 (0.20) 5 (0.53) 0.45 2.5 0.41 to 26.25
Injury 16 (0.44) 60 (1.66) <0.0001 3.75 2.13 to 6.98
Athlete Exposure (Participation) 8 (0.07) 43 (0.35) <0.0001 5.40 2.50 to 13.24
Day Exposure 4 (0.31) 5 (0.39) >0.05 1.25 0.27 to 6.30
Total 30 (0.17) 116 (0.65) <0.0001 3.83 2.55 to 5.94

Table 3: Discordant pair analysis by data table with associated odds ratios

Table 3 demonstrates that the dual entry system resulted in significantly fewer errors compared to the single entry system for two out of the four database entry modules. Specifically, the odds of the single entry being incorrect was at least 2.13 and 2.50 times the odds of the dual entry being incorrect for the injury and athlete participation data tables with 95% confidence, respectively. These odds ratios pertain to the discordant pairs only.

Overall, the dual entry system demonstrated an accuracy of 99.83% versus 99.35% for the single entry system. The dual entry system demonstrated a slightly higher accuracy rate than the single entry system with every data table module.

Lastly, Table 4 demonstrates the distribution of fixed versus text response fields for the injury report form data table (the table that had the greatest percentage of errors with single entry system). Of the 43 single entry fixed response field errors, 17 (39.5%) were fields which were left blank when a selection should have been made.

 

Field

Dual Incorrect

Single Incorrect

Field Type

Injury Status

1

8

Fixed Response

Brace

0

3

Fixed Response

Return to activity

0

3

Fixed Response

Involved

1

5

Fixed Response

During

3

6

Fixed Response

Venue

0

7

Fixed Response

Treatment

3

11

Fixed Response

Injury Date

1

2

Text

Report Date

4

4

Text

Position When Injured

1

2

Text

Normal Position

1

3

Text

Events

0

0

Text

Remarks

0

1

Text

Notes

0

0

Text

Treatment Notes

0

0

Text

Timeout

0

4

Text

Therapist's Name

0

0

Text

Medical Treatment

1

0

Text

Physician's Name

0

0

Text

Phys./Therapist Agreement

0

0

Text

Final Diagnosis

0

1

Text

Total

16

60

-

Table 4: Distribution of errors by field of the injury report form data table


Discussion

There have been rapid technological advances in the software application tools available for the development of data entry screens for relational databases. This has, in turn, allowed for more utilisation of mouse-controlled input such as radio buttons, check boxes, pop-up and scrolling lists for data entry, which has reduced the use of the keyboard for some data entry.

This investigation revealed a rate of 17 errors per 10,000 data table fields for the dual entry system and 65 errors per 10,000 data table fields for the single entry system. These accuracy rates appear to be worse than those reported for other dual and single entry systems. This may be the result of having relatively inexperienced data entry clerks input the data using both systems, although other investigators have not found that less experienced individuals tend to make more entry errors8. The ratio of the rate of errors, however, is in line with those previously reported. Specifically, Neaton et al4 concluded that "in the absence of verification, one can expect 4 to 5 times more fields in error". This is consistent with the overall findings presented here.

It appears that many of the discrepancies between the two systems were attributable to the fixed response fields (that is, mouse-controlled input). Intuitively, one would expect that because the data entry clerk could have chosen from among a number of options with a simple mouse click with a number of the data entry fields, the error rates would be lower as compared with those fields requiring keyboard input. A potential explanation for this unexpected finding may be that the process of entry with a mouse results in the data entry clerks simply missing fields. This argument is strengthened by the finding that 39.5% of the fixed response items were attributable to blank fields (with the injury report single entry system). Future investigations should separate text and fixed response fields to determine if the higher rates of errors in fixed response entries is peculiar to the CISIR system. Checks against blank fields may be used in the future to guard against this problem with the CISIR data entry process.

It seems logical to assume that using a dual versus single data entry system would reduce the number of errors made on input. There was, in fact, evidence to suggest a higher number of errors overall with the single compared with the dual entry system. However, dual entry systems carry a significant cost both for the actual programming of the data entry screen and for the data entry (that is, twice the data entry time relative to the single entry system). With the very low rate of errors evident with the use of the single entry system (0.65% error rate or 99.35% accuracy rate) a strong argument can be made for the use of a single entry system. Furthermore, since the errors were not clustered in any one category or field, their influence on any conclusions drawn from the data would be negligible.


Limitations

The lack of random order evident in this investigation with respect to the dual entry first, then single entry second protocol could have biased the results. If this bias did exist, its effect would most likely have been to reduce the number of errors made with the single entry system due to data entry clerk familiarity with the system and the data (learning and memorisation effects). However, due to (1) the use of five data entry clerks, (2) the large volume of information entered and (3) the relative simplicity of the data entry process, it is doubtful that a learning or memorisation effect would have influenced the results in a practical or a statistical sense.

The use of five data entry clerks to carry out the project may be criticised. The strict comparison of the dual versus single entry system may have been diluted with ‘between subject’ differences as opposed to the direct comparison of the systems. However, the impetus for this investigation was to determine how the error rates were affected by a dual versus single entry system under normal or practical conditions and not under ‘ideal’ or laboratory circumstances. That is, many data entry clerks are often involved in the entry of the CISIR data and this investigation sought to determine if the single entry system was inferior to the dual entry system under these everyday conditions. Due to the nature of the investigation, however, it was not possible, to examine which data entry clerks were responsible for particular errors. It may be that one entry clerk was responsible for a disproportionate number of errors. However, each data entry clerk had a similar level of experience using the system and was given the same introduction to the use of the system. Further, all entry clerks had the opportunity to ask questions about the data entry process throughout the duration of the study. Therefore, it is unlikely that one clerk was responsible for the majority of errors with either system.

If we assume that the first entry in the dual entry process is independent of the second entry for the dual entry system, then we would expect the probability of making a mistake on the dual entry system to be the square of the probability of making a mistake on the single entry system. This is not, however, what we have found. Specifically, the probability of an error on the single entry system was 65 errors per 10,000 fields or 0.0065. If the probability of making an error on the second entry screen was independent of making an error on the first entry screen, then we would expect the error rate for the dual entry system to be (0.0065*0.0065) 0.000042. The probability we obtained was 0.0017, or 17 errors per 10,000 fields, a rate 40 times higher than would be expected under the assumption of independence.

There are a number of plausible explanations for this result. When entering information in the dual entry system, the clerks may be less likely to scrutinise their input to the extent they would using only a single entry system because they know they get a ‘second chance’. In addition, they may be more likely to make the same mistake twice when entering the same data in succession. There may also have been some confounding of the rates by field difficulty whereby the overall rate for the dual entry system is much higher, but if we were to stratify on field difficulty we might indeed find that the rates do in fact decrease substantially compared with what we would expect. Finally, chance may in part account for the higher than expected dual entry error rates. Specifically, the upper limit of the 95% confidence limit for the single entry error rate and the lower limit of the 95% confidence limit for the dual entry error rate produce a dual entry error rate which is only 24 times higher than the square of the single entry system.

It is interesting to note that our finding of a much greater error rate for the dual entry system than would be expected based on the error rate for the single entry system has been seen by other investigators4. Perhaps confounding by field difficulty, variability, and data entry clerk behaviour all contribute to the higher than expected dual entry error rate, based on the value for the single entry error rate.


Conclusions

Although there were statistically more errors made with the single entry system, the practical differences were very small. In fact, using the single entry system resulted in less than a 1% error rate overall (that is, an accuracy of over 99%) but cut the entry time in half. Based on the information in Table 4, it is apparent that the errors made with the single entry system did not cluster in one field. That is, the errors would not have significantly influenced the results of an analysis or the associated conclusions drawn from the data. Further, although the fixed response fields were more often incorrect with the single entry system, the errors made did not demonstrate a tendency to any particular type of response but were, for the most part, simple errors of omission, where fields were left blank. Therefore, a well developed single entry system with logical range checks and checks against blank fields may prove to be the most cost-effective system to use for entry of data for systems such as the Canadian Intercollegiate Sport Injury Registry data.


References

1 Gibson D, Harvey AJ, Everett V, Parmar KB. Is double data entry necessary? The CHART trials. Controlled Clinical Trials 1994; 15:482–488
2 Reynolds-Haertle RA, McBride R. Single vs. double data entry in CAST. Controlled Clinical Trials 1992; 13:487–494
3 Bagniewski A, Black D, Molvig K, et al. Data quality in a distributed data processing system: the SHEP pilot study. Controlled Clinical Trials 1986; 7:27–37
4 Neaton JD, Duchene AG, Svendsen KH, Wentworth D. An examination of the efficiency of some quality assurance methods commonly employed in clinical trials. Statistics in Medicine 1990; 9:115–124
5 Meeuwisse WH, Love EJ. Development, implementation, and validation of the Canadian Intercollegiate Sport Injury Registry. Clinical Journal of Sport Medicine 1998; 8:164–177
6 FoxPro/Mac, Microsoft Corporation, 1994
7 Excel/Mac, Microsoft Corporation, 1995
8 Crombie IK, Irving JM. An investigation of data entry methods with a personal computer. Computers and Biomedical Research 1986; 19:543–550

Refbacks

  • There are currently no refbacks.


This is an open access journal, which means that all content is freely available without charge to the user or their institution. Users are allowed to read, download, copy, distribute, print, search, or link to the full texts of the articles in this journal starting from Volume 21 without asking prior permission from the publisher or the author. This is in accordance with the BOAI definition of open accessFor permission regarding papers published in previous volumes, please contact us.

Privacy statement: The names and email addresses entered in this journal site will be used exclusively for the stated purposes of this journal and will not be made available for any other purpose or to any other party.

Online ISSN 2058-4563 - Print ISSN 2058-4555. Published by BCS, The Chartered Institute for IT