False-Positive–Constrained Evaluation of Artificial Intelligence Lung Nodule Detection at Clinically Relevant Operating Points
Key clinical takeaway
AI maintains 91.7% sensitivity at approximately 1 false positive per scan.
Purpose
False-positive findings remain a primary barrier to clinical adoption of artificial intelligence (AI) lung nodule detection, contributing to workflow inefficiency and reduced radiologist trust. While prior studies often emphasize peak sensitivity or aggregate accuracy metrics, fewer assess performance at operating points aligned with real-world clinical use. This study evaluated whether an AI lung nodule detection system maintains high sensitivity while constraining false-positive burden at clinically relevant operating points.
Methods
An AI lung nodule detection system was trained using low-dose chest CT examinations from a multi-center lung cancer screening cohort. Performance was evaluated on an independent multi-reader dataset with heterogeneous annotations to reflect interpretive variability. Lesion-level sensitivity was assessed across false-positive rates using free-response receiver operating characteristic (FROC) analysis. The primary operating point corresponded to approximately one false positive per scan (FPPS). Secondary analyses examined alternative operating points and nodule size thresholds. Ninety-five percent confidence intervals were estimated using bootstrap resampling.
Lesion-level sensitivity vs. false positives per scan.

Free-response receiver operating characteristic (FROC) curve demonstrating lesion-level sensitivity of the artificial intelligence lung nodule detection system for nodules ≥5 mm across increasing false-positive rates per scan. High sensitivity is maintained at clinically relevant operating points, including approximately one false positive per scan, highlighting favorable performance under realistic lung cancer screening workflow constraints.
Results
Among 1,009 CT examinations containing 1,303 annotated nodules ≥5 mm, maximum lesion-level sensitivity was 98.4% (95% CI: 97.6–99.1) at permissive thresholds associated with approximately 9.6 FPPS. At the primary operating point of approximately 1 FPPS, sensitivity was 91.7% (95% CI: 89.6–93.6). Sensitivity increased to 94.9% (95% CI: 93.1–96.3) at 2 FPPS and remained above 85% at 0.5 false positives per scan. The overall FROC score was 0.889 (95% CI: 0.873–0.904). Comparable trends were observed for nodules ≥3 mm, with sensitivity of 90.8% at approximately 1 FPPS.
Discussion
- Overall FROC score was 0.889 (95% CI 0.873–0.904), reflecting robust performance across evaluated operating points.
- Sensitivity remained high across stricter false-positive constraints, supporting workflow-conscious evaluation rather than reliance on permissive peak sensitivity metrics.
- Diminishing returns beyond approximately 2 FPPS suggest limited incremental sensitivity benefit relative to added false-positive burden.
- Comparable trends observed for nodules ≥3 mm, with 90.8% sensitivity near 1 FPPS.
Conclusion
AI-based lung nodule detection demonstrated sustained high sensitivity at clinically relevant operating points while maintaining a constrained false-positive burden. Evaluating AI performance under realistic false-positive constraints provides practice-relevant insight beyond peak accuracy metrics and supports workflow-conscious integration of AI into lung cancer screening.