Comment. This is the first study (to our knowledge) that has sys- tematically evaluated agreement among ROP experts for plus disease diagnosis. Consistent and accurate detec- tion of plus disease has an increasingly critical role in the identification of treatment-requiring ROP. This is par- ticularly relevant because the multicenter ETROP trial recently determined that presence of plus disease is suf- Table. Absolute Agreement in Plus Disease Diagnosis Among 22 Experts Reviewing 34 Imagesa 2 1 (5) 16 (76) 4 (19) 1 (5) 20 (95) 3 14 (70) 6 (30) 0 14 (70) 6 (30) 4 5 (24) 12 (57) 4 (19) 5 (24) 16 (76) 5 3 (14) 9 (43) 9 (43) 3 (14) 18 (86) 6 22 (100) 0 0 22 (100) 0 7 1 (5) 9 (41) 12 (55) 1 (5) 21 (96) 8 21 (96) 1 (5) 0 21 (96) 1 (5) 9 0 9 (43) 12 (57) 0 21 (100) 10 0 0 22 (100) 0 22 (100) 11 22 (100) 0 0 22 (100) 0 12 1 (5) 11 (50) 10 (46) 1 (5) 21 (96) 13 7 (32) 15 (68) 0 7 (32) 15 (68) 14 2 (10) 11 (52) 8 (38) 2 (10) 19 (90) 15 12 (60) 8 (40) 0 12 (60) 8 (40) 16 1 (5) 10 (48) 10 (48) 1 (5) 20 (95) 17 8 (38) 11 (52) 2 (10) 8 (38) 13 (62) 18 1 (5) 10 (46) 11 (50) 1 (5) 21 (96) 19 2 (10) 14 (67) 5 (24) 2 (10) 19 (90) 20 20 (95) 1 (5) 0 20 (95) 1 (5) 21 0 8 (38) 13 (62) 0 21 (100) 22 11 (52) 10 (48) 0 11 (52) 10 (48) 23 17 (77) 5 (23) 0 17 (77) 5 (23) 24 0 5 (23) 17 (77) 0 22 (100) 25 2 (10) 9 (43) 10 (48) 2 (10) 19 (90) 26 16 (73) 6 (27) 0 16 (73) 6 (27) 27 1 (5) 8 (36) 13 (59) 1 (5) 21 (96) 28 14 (64) 8 (36) 0 14 (64) 8 (36) 29 1 (5) 15 (71) 5 (24) 1 (5) 20 (95) 30 17 (81) 4 (19) 0 17 (81) 4 (19) 31 1 (5) 8 (36) 13 (59) 1 (5) 21 (96) 32 3 (14) 14 (64) 5 (23) 3 (14) 19 (86) 33 17 (77) 5 (23) 0 17 (77) 5 (23) 34 22 (100) 0 0 22 (100) 0 a Number of images in each row may not add to 22 because images categorized as cannot determine were excluded for that expert. ficient for meeting the definition of type 1 ROP, which benefits from early treatment regardless of the exact num- ber of clock hours of peripheral disease.4 The main finding from this study is that interexpert agreement of plus disease diagnosis is imperfect. Using a 3-level categorization, all 22 experts agreed on the same diagnosis in 4 of 34 images (12%) (Figure 2), and the mean weighted n statistic for each expert compared with all others ranged from 0.25 (fair agreement) to 0.55 (mod- erate agreement) (Figure 3). Using a 2-level categoriza- tion, all experts who provided a diagnosis agreed in 7 of 34 images (21%) (Figure 2), and the mean n statistic for each expert compared with all others ranged from 0.19 (slight agreement) to 0.66 (substantial agreement) A B C D E F nition, a criterion standard test should have complete accuracy and consensus.13 The extent of disagreement in plus disease diagnosis among recognized ROP authori- ties in this study raises important questions about the re- liability of this standard. This inconsistency presum- ably results from subjective differences in judgment among experts even while viewing the same images or from vary- ing interpretations of the definition of plus disease.3 Sev- eral studies14-16 explored the possibility of plus disease detection using computer-based image analysis. If these automated systems can be shown to have accuracy com- parable to that of human experts, the objectivity and re- producibility of computer-based techniques may offer im- portant advantages over current methods. We emphasize that the objective of this study was to evaluate agree- ment in diagnosis among experts rather than to mea- sure accuracy compared with a criterion standard. The opinion of any single participant in this study would cer- tainly be regarded as a criterion standard for diagnosis, although we note that a criterion standard test cannot be completely accurate if it is not reproducible among multiple observers. The recently revised international classification of ROP introduced an intermediate condition of pre-plus dis- ease.10 The clinical significance of pre-plus disease is not completely clear given that it was not incorporated into clinical trials such as CRYO-ROP or ETROP. Results from our current study suggest that even experts do not have complete agreement about whether a given image repre- sents pre-plus as opposed to plus or neither (Table). If the clinical usefulness of pre-plus disease can be demon- strated,17 future development of a more precise definition of this entity may help guide physicians in diagnosis. The design of a study involving interexpert agree- ment requires an explicit definition of expertise, and the method used for this project warrants some explana- tion. Participants were invited for this study based on academic criteria, as evidenced by leadership roles in ma- jor multicenter clinical trials or by authorship of peer- reviewed ROP literature. This may not necessarily re- flect clinical expertise in a real-world setting. However, there are numerous factors comprising medical exper- tise, some of which may be difficult to quantify for the purpose of study design.18 Furthermore, it could be ar- gued that academic ROP experts may have greater famil- iarity with the published photographic standard for plus disease than the overall population of ophthalmologists who perform ROP examinations. Therefore, we hypoth- esize that disagreement in plus disease diagnosis within the overall population of practicing clinicians may be higher than that among the academic experts in this study. This issue may warrant further study to determine the extent to which these findings are generalizable. From a clinical perspective, it would be most useful to know the agreement of plus disease diagnosis among mul- tiple experts performing serial indirect ophthalmoscopy on the same infant. However, that type of study would be impractical because of infant safety concerns.19 To simu- late a real-world situation for this study, images pre- sented to participants were captured using a commer- cially available RetCam device. This is a contact camera with a 130° field of view and is the most well-known in- A 3-Level categorization 30 25 20 15 10 5 0 100 ≥90 ≥80 ≥70 ≥60 ≥50 ≥40 B 35 2-Level categorization 30 25 20 15 10 5 0 100 ≥90 ≥80 ≥70 ≥60 ≥50 Absolute Expert Agreement on Diagnosis, % No. of Images No. of Images Mean κ vs Other Experts Mean Weighted κ vs Other Experts strument for pediatric retinal imaging.20-25 In contrast, xxxx- dard binocular indirect ophthalmoscopy provides a 40° to 50° field of view. It is conceivable that this difference in perspective may have caused difficulty for participants, de- pending on their previous experience interpreting wide- angle ROP photographs. Although this study did not de- tect any correlation between mean n statistics and self- reported level of RetCam experience, this question may deserve additional study with a broader spectrum of im- age graders. On one hand, limited experience in correlat- ing wide-angle images with indirect ophthalmoscopy might result in systematic overdiagnosis or underdiagnosis of plus disease by some participants, thereby increasing variabil- ity. On the other hand, the fact that all participants were asked to review the exact same images in this study might produce decreased variability compared with serial oph- thalmoscopy because examination quality may vary based on infant stability or cooperation. Telemedicine strategies have been proposed as an al- ternative to standard ROP care involving dilated exami- nation at the neonatal intensive care unit bedside. Find- ings from several limited studies20-25 suggest that remote interpretation of retinal images may have adequate sen- sitivity and specificity to identify clinically significant ROP. To our knowledge, these published studies have com- A 0.6 3-Level categorization n = 15 0.5 0.4 n = 7 0.3 0.2 0.21-0.40 (Fair) 0.41-0.60 (Moderate) Mean Weighted κ vs Other Experts B 0.8 2-Level categorization 0.7 n = 6 n = 12 0.6 0.5 0.4 n = 3 0.3 n = 1 0.2 0-0.20 (Slight) Mean κ vs Other Experts Figure 3. Agreement in plus disease diagnosis, based on box plots of the mean n statistic for each of 22 experts compared with all others. A, The mean weighted n statistic in the 3-level categorization (plus, pre-plus, or neither). B, The mean n statistic in the 2-level categorization (plus or not plus). Boxes represent the 25th, 50th, and 75th percentile n values. Whiskers represent the 10th and 90th percentile values. pared accuracy of remote interpretation of RetCam im- ages captured by ophthalmic personnel with a criterion standard of dilated ophthalmoscopy by a single exam- iner. The present study reveals several clinically signifi- cant disagreements among acknowledged ROP experts in plus disease diagnosis from wide-angle retinal photo- graphs. To prevent diagnostic errors, this issue should be examined and resolved before the routine deploy- ment of ROP telemedicine systems. However, if imple- mented properly, remote image interpretation at certi- fied centers may offer advantages over dilated examination by a single ophthalmologist with regard to standardiza- tion. This is analogous to the national Fundus Photo- graph Reading Center on the basis of the 7-field photo- graphic reference established by the Early Treatment for Diabetic Retinopathy Study.26 Four additional study limitations should be noted:

Appears in 4 contracts

Samples: Interexpert Agreement of Plus Disease Diagnosis in Retinopathy of Prematurity, Interexpert Agreement of Plus Disease Diagnosis in Retinopathy of Prematurity, Interexpert Agreement

Comment. This is the first study (to our knowledge) that has sys- tematically evaluated agreement among ROP experts for plus disease diagnosis. Consistent and accurate detec- tion of plus disease has an increasingly critical role in the identification of treatment-requiring ROP. This is par- ticularly relevant because the multicenter ETROP trial recently determined that presence of plus disease is suf- Table. Absolute Agreement in Plus Disease Diagnosis Among 22 Experts Reviewing 34 Imagesa 2 1 (5) 16 (76) 4 (19) 1 (5) 20 (95) 3 14 (70) 6 (30) 0 14 (70) 6 (30) 4 5 (24) 12 (57) 4 (19) 5 (24) 16 (76) 5 3 (14) 9 (43) 9 (43) 3 (14) 18 (86) 6 22 (100) 0 0 22 (100) 0 7 1 (5) 9 (41) 12 (55) 1 (5) 21 (96) 8 21 (96) 1 (5) 0 21 (96) 1 (5~~) 9 0 9 (43) 12 (57) 0 21 (100~~) 10 0 0 22 (100) 0 22 (100) 11 22 (100) 0 0 22 (100) 0 12 1 (5) 11 (50) 10 (46) 1 (5) 21 (96) 13 7 (32) 15 (68) 0 7 (32) 15 (68) 14 2 (10) 11 (52) 8 (38) 2 (10) 19 (90) 15 12 (60) 8 (40) 0 12 (60) 8 (40) 16 1 (5) 10 (48) 10 (48) 1 (5) 20 (95) 17 8 (38) 11 (52) 2 (10) 8 (38) 13 (62) 18 1 (5) 10 (46) 11 (50) 1 (5) 21 (96) 19 2 (10) 14 (67) 5 (24) 2 (10) 19 (90) 20 20 (95) 1 (5) 0 20 (95) 1 (5) 21 0 8 (38) 13 (62) 0 21 (100) 22 11 (52) 10 (48) 0 11 (52) 10 (48) 23 17 (77) 5 (23) 0 17 (77) 5 (23) 24 0 5 (23) 17 (77) 0 22 (100) 25 2 (10) 9 (43) 10 (48) 2 (10) 19 (90) 26 16 (73) 6 (27) 0 16 (73) 6 (27) 27 1 (5) 8 (36) 13 (59) 1 (5) 21 (96) 28 14 (64) 8 (36) 0 14 (64) 8 (36) 29 1 (5) 15 (71) 5 (24) 1 (5) 20 (95) 30 17 (81) 4 (19) 0 17 (81) 4 (19) 31 1 (5) 8 (36) 13 (59) 1 (5) 21 (96) 32 3 (14) 14 (64) 5 (23) 3 (14) 19 (86) 33 17 (77) 5 (23) 0 17 (77) 5 (23) 34 22 (100) 0 0 22 (100) 0 a Number of images in each row may not add to 22 because images categorized as cannot determine were excluded for that expert. ficient for meeting the definition of type 1 ROP, which benefits from early treatment regardless of the exact num- ber of clock hours of peripheral disease.4 The main finding from this study is that interexpert agreement of plus disease diagnosis is imperfect. Using a 3-level categorization, all 22 experts agreed on the same diagnosis in 4 of 34 images (12%) (Figure 2), and the mean weighted n statistic for each expert compared with all others ranged from 0.25 (fair agreement) to 0.55 (mod- erate agreement) (Figure 3). Using a 2-level categoriza- tion, all experts who provided a diagnosis agreed in 7 of 34 images (21%) (Figure 2), and the mean n statistic for each expert compared with all others ranged from 0.19 (slight agreement) to 0.66 (substantial agreement) (Figure 3). This degree of variability suggests that image- based plus disease diagnosis may be heavily subjective. Binocular indirect ophthalmoscopy by an experi- enced ophthalmologist is considered the criterion xxxx- dard for diagnosis and classification of ROP.10 By defi- A B C D E F nition, a criterion standard test should have complete accuracy and consensus.13 The extent of disagreement in plus disease diagnosis among recognized ROP authori- ties in this study raises important questions about the re- liability of this standard. This inconsistency presum- ably results from subjective differences in judgment among experts even while viewing the same images or from vary- ing interpretations of the definition of plus disease.3 Sev- eral studies14-16 explored the possibility of plus disease detection using computer-based image analysis. If these automated systems can be shown to have accuracy com- parable to that of human experts, the objectivity and re- producibility of computer-based techniques may offer im- portant advantages over current methods. We emphasize that the objective of this study was to evaluate agree- ment in diagnosis among experts rather than to mea- sure accuracy compared with a criterion standard. The opinion of any single participant in this study would cer- tainly be regarded as a criterion standard for diagnosis, although we note that a criterion standard test cannot be completely accurate if it is not reproducible among multiple observers. The recently revised international classification of ROP introduced an intermediate condition of pre-plus dis- ease.10 The clinical significance of pre-plus disease is not completely clear given that it was not incorporated into clinical trials such as CRYO-ROP or ETROP. Results from our current study suggest that even experts do not have complete agreement about whether a given image repre- sents pre-plus as opposed to plus or neither (Table). If the clinical usefulness of pre-plus disease can be demon- strated,17 future development of a more precise definition of this entity may help guide physicians in diagnosis. The design of a study involving interexpert agree- ment requires an explicit definition of expertise, and the method used for this project warrants some explana- tion. Participants were invited for this study based on academic criteria, as evidenced by leadership roles in ma- jor multicenter clinical trials or by authorship of peer- reviewed ROP literature. This may not necessarily re- flect clinical expertise in a real-world setting. However, there are numerous factors comprising medical exper- tise, some of which may be difficult to quantify for the purpose of study design.18 Furthermore, it could be ar- gued that academic ROP experts may have greater famil- iarity with the published photographic standard for plus disease than the overall population of ophthalmologists who perform ROP examinations. Therefore, we hypoth- esize that disagreement in plus disease diagnosis within the overall population of practicing clinicians may be higher than that among the academic experts in this study. This issue may warrant further study to determine the extent to which these findings are generalizable. From a clinical perspective, it would be most useful to know the agreement of plus disease diagnosis among mul- tiple experts performing serial indirect ophthalmoscopy on the same infant. However, that type of study would be impractical because of infant safety concerns.19 To simu- late a real-world situation for this study, images pre- sented to participants were captured using a commer- cially available RetCam device. This is a contact camera with a 130° field of view and is the most well-known in- A 3-Level categorization 30 25 20 15 10 5 0 100 ≥90 ≥80 ≥70 ≥60 ≥50 ≥40 B 35 2-Level categorization 30 25 20 15 10 5 0 100 ≥90 ≥80 ≥70 ≥60 ≥50 Absolute Expert Agreement on Diagnosis, % No. of Images No. of Images Mean κ vs Other Experts Mean Weighted κ vs Other Experts strument for pediatric retinal imaging.20-25 In contrast, xxxx- dard binocular indirect ophthalmoscopy provides a 40° to 50° field of view. It is conceivable that this difference in perspective may have caused difficulty for participants, de- pending on their previous experience interpreting wide- angle ROP photographs. Although this study did not de- tect any correlation between mean n statistics and self- reported level of RetCam experience, this question may deserve additional study with a broader spectrum of im- age graders. On one hand, limited experience in correlat- ing wide-angle images with indirect ophthalmoscopy might result in systematic overdiagnosis or underdiagnosis of plus disease by some participants, thereby increasing variabil- ity. On the other hand, the fact that all participants were asked to review the exact same images in this study might produce decreased variability compared with serial oph- thalmoscopy because examination quality may vary based on infant stability or cooperation. Telemedicine strategies have been proposed as an al- ternative to standard ROP care involving dilated exami- nation at the neonatal intensive care unit bedside. Find- ings from several limited studies20-25 suggest that remote interpretation of retinal images may have adequate sen- sitivity and specificity to identify clinically significant ROP. To our knowledge, these published studies have com- A 0.6 3-Level categorization n = 15 0.5 0.4 n = 7 0.3 0.2 0.21-0.40 (Fair) 0.41-0.60 (Moderate) Mean Weighted κ vs Other Experts B 0.8 2-Level categorization 0.7 n = 6 n = 12 0.6 0.5 0.4 n = 3 0.3 n = 1 0.2 0-0.20 (Slight) Mean κ vs Other Experts Figure 3. Agreement in plus disease diagnosis, based on box plots of the mean n statistic for each of 22 experts compared with all others. A, The mean weighted n statistic in the 3-level categorization (plus, pre-plus, or neither). B, The mean n statistic in the 2-level categorization (plus or not plus). Boxes represent the 25th, 50th, and 75th percentile n values. Whiskers represent the 10th and 90th percentile values. pared accuracy of remote interpretation of RetCam im- ages captured by ophthalmic personnel with a criterion standard of dilated ophthalmoscopy by a single exam- iner. The present study reveals several clinically signifi- cant disagreements among acknowledged ROP experts in plus disease diagnosis from wide-angle retinal photo- graphs. To prevent diagnostic errors, this issue should be examined and resolved before the routine deploy- ment of ROP telemedicine systems. However, if imple- mented properly, remote image interpretation at certi- fied centers may offer advantages over dilated examination by a single ophthalmologist with regard to standardiza- tion. This is analogous to the national Fundus Photo- graph Reading Center on the basis of the 7-field photo- graphic reference established by the Early Treatment for Diabetic Retinopathy Study.26 Four additional study limitations should be noted:

Appears in 1 contract

Samples: Interexpert Agreement of Plus Disease Diagnosis in Retinopathy of Prematurity

Common use of Comment Clause in Contracts