 We are pleased to present to you a collection of notable papers in the field of computer aided diagnosis and detection published over the years in the journal Medical Physics. The field started at the University of Chicago with many faculty members, notably the medical physics team of Cuneo-Odoi, Heng-Peng Chan and Marilyn Geiger along with the insightful contributions from radiologist Hebra McMahon, Robert Schmidt and Carl Vibborny. Thus many of the early papers are from the University of Chicago while with the departure of Heng-Peng Chan to the University of Michigan resulting subsequent publications arose from Michigan. The field continues to grow especially with the existence of large databases and the rapid development of deep learning methods. I am grateful to the National Institutes of Health National Cancer Institute for support as well as to my institutions with whom I am affiliated, Wheaton College and the University of Chicago. Here I show my conflict of interest statements. I am very grateful to the support I have received over the years from NIH, the University of Chicago, as well as many other grant agencies. First, we highlight two major reviews on computer aided detection and computer aided diagnosis. Each have been influential in describing the state of the field and informing those who have entered it during the past decade. This paper, History and Status of CAD and Quantitative Image Analysis, The Role of Medical Physics in AAPM, was published by Marilyn Geiger, Heng-Peng Chan and John Boone in 2008 to reflect on the then approximately 20 year history of computer aided detection and diagnosis. It provides a thorough coverage of the history of medical imaging through the role of medical physics reaching back to the first x-ray image taken by Wilhelm Röntgen. It continued through a review of the role of the computer evolution in medical imaging in the middle of the 20th century and detailed coverage of seminal papers in the field. Also included were descriptions of CAD systems such as the one shown here. Clearly noted was that, quote, as imaging systems become more complex and the need for better quantitative information from images grows, the future includes the combined research efforts from physicists working in CAD, with those working on quantitative imaging systems to readily yield information on morphology, function and molecular structure. In 2019, Berkman Scheheiner and co-authors published a review of deep learning in both medical imaging and radiation therapy. They described artificial intelligence as the fourth industrial revolution and reported on deep learning, which is a subset area of AI, and its applications to medical physics. The paper serves as a very thorough review of machine learning itself going back to the 1960s. The authors describe earlier uses of neural networks in medical imaging, including artificial neural networks in the 1980s. The paper also describes the revolution circa 2006 that contributed to the explosion of deep learning in medical physics through the availability of larger data sets, graphical units, and innovations in network architecture and training. As shown here, very thorough tables are available describing the scientific literature on applying deep learning to organ and substructure segmentation, lesion segmentation, image processing and reconstruction, and many other applications. Because this paper provides such a thorough review of the history of the field and highlights the importance of data quality, statistical significance, and clinical significance, and relevance of research in medical physics using deep learning methods, it is essential reading. Clearly noted was that however most medical imaging tasks are far from being solved and the optimal deep learning method and architecture for each individual task and application area have not yet been established. Moreover, the integration of medical image analysis methods and other patient data such as patient history, age, and demographics also remains an area of active research that could further improve performance of clinical decision making aids. Computer aided detection is a localization task, often developed for screening programs in which patients present without symptoms of the disease under surveillance, such as breast cancer screening through screening mammography. The journal Medical Physics has published several seminal papers on this topic. In the late 1980s, Hangpeng Chan and Mary Ellen Geiger published papers utilizing a difference image technique to provide computer aided detection in x-ray images. The difference image is created by subtracting a signal suppressed image from a signal enhanced image of the same image. Chan's work used filters matched to the size and contrast variations of typical breast microcalcifications to create the signal enhanced image, while Geiger's work applied filters to chest x-rays to maximize the signal to noise ratio to enhance lung nodules. Geiger's paper additionally used feature extraction techniques and investigated their performance as a function of variations in the techniques. The method of adding both signal enhancement and signal suppression to images was one of the first demonstrations of computer aided detection. In 1994, the first use of deep learning applied to medical image analysis was published. Shipped invariant artificial neural networks were used to further investigate the use of computerized detection for microcalcifications. As the paper states, 55 percent of false positive regions were eliminated without any loss of true positive regions. The paper's second paragraph on the question of generalization in the application of AI to medical imaging, that is, whether a trained network will respond reasonably well to inputs not present in the training database, demonstrates the important didactic nature of these early articles. The question of generalizability is still an important matter to consider today with deep learning methods being easier to implement. This work, Hang Peng Chan and co-authors trained a convolutional neural network to detect microcalcifications. Notably, they used the whole region of interest and separated out results on what were termed obvious, average, or subtle microcalcifications. This paper gives a very high level of detail on methods, making it both instructional and informational. The results are shown without the use of a CNN and with variations of the decision threshold of the CNN. As they noted, our present computer method can achieve a true positive cluster detection rate of approximately 80 percent at a false positive detection rate of one cluster per image. The potential application of such a computer aided system to mammographic interpretation is demonstrated by its ability to detect microcalcifications in clinical mammograms. This 2001 paper by Sam Armato and colleagues demonstrates the close association of papers published in medical physics with clinically relevant questions. Helico computer tomography provides a great deal of information to clinicians such as over 50 slices in the region of interest for chest CT at the time of publication. At the same time as the authors describe, the large amount of data could lead to information overload for radiologists, especially as they seek to address the confounding that is present from blood vessels when they try to detect lung nodules. This works sought to provide additional support for detecting lung nodules by the application of threshold methods to first segment the lung, then creating multiple segmented lung volumes through applying multiple thresholds. Subsequently, features were extracted and merged using linear discriminant analysis. The results demonstrated high ROC performance in the task of distinguishing between lung nodules and false positive candidates. Furthering the work by Samuel Armato in 2003, Suzuki and colleagues implemented a massive training artificial neural network to reduce false positives in low-dose CT for lung nodule detection, an early application of deep learning to computer aided detection using CT. The study was designed with a specific goal to reduce false positives. The M10 uses subregions with teacher images along with distribution for likelihood of being a nodule. In this case, a multi M10 was used, nine M10s. The false positive rate was improved from 0.98 to 0.18 false positives per section with an overall sensitivity of 80.3%. CT-based virtual colonoscopy has been shown to have similar high accuracy in detecting polyps compared to fiber optic colonoscopy, but radiologist use of the technology has been time consuming. Long and co-authors investigated the use of CAD to aid in the identification of polyps to reduce interpretation time and reduce false positives and false negatives. The group used a smoothed principle curvature method to provide a shape description of a wide area and to identify initial polyp candidates. Then internal features describing geometry morphology and texture of the candidates were considered. The authors demonstrated that it was these features that led to a significant reduction of false positive polyps for ranges of polyp sizes. As they noted, for a 100% detection sensitivity on polyps, the presented CAD method had an average false positives of 2.68 per patient data set and eliminated 93.1% of false positives generated by the surface-based measures. This study published in 2007 by Terrasi and colleagues presented an evaluation of information theoretic similarity measures for content-based retrieval and detection of masses in mammograms. The study showed that the image similarity measures fall into one of two categories, with one category being better suited to the retrieval of semantically similar cases, while the second category being more effective with knowledge-based decisions regarding the presence of a true mass in the query location. In addition, the information theoretic CAD scheme yielded a substantial reduction in false positive detections while maintaining high detection rate for malignant masses. This work by Van Onglen and Karsemeyer published in 2007 developed methods to combine two projections from a mammography to aid in mass detection. The authors built a cascaded multiple classifier system in which the last stage computes suspiciousness of an initially detected region conditional on the existence and similarity of a linked candidate region. As the authors noted, a statistically significant improvement was found in the lesion-based detection performance. At a false positive rate of 0.1 false positives per image, the lesion sensitivity improved from 56 to 61%. Akaido and colleagues reported in 2007 on the development of a fully automatic algorithm for the detection of masses in whole breast ultrasound images. Note that computer-aided detection may be necessary in whole breast ultrasound images as the radiologist will need to scan through a three-dimensional image. The paper reported results for both resubstitution performance, that is, self-consistency to assess if the system could learn the data on which it was trained, as well as the performance using a leave-one-out technique as a way to show more generalizable performance levels. The authors noted that such a CAD method may be useful in proving both cancer screening performance and efficiency. This study by Polavi Tawari and colleagues presents an interesting application of CAD concepts and techniques to magnetic resonance spectroscopy data for the purpose of detection of prostate cancer. A two-step process is used. First, the region of interest is identified using hierarchical spectral clustering. Then, the ROIs are classified into three categories of normal, suspicious, and indeterminate using nonlinear dimension reduction methods. The example figure here shows spectra from each meta voxel of an image, where red is an abnormal appearing spectra, blue is normal, and green indicates indeterminate spectra. The authors said, quote, when evaluated on expert annotated datasets, the CAD system yielded an average sensitivity and specificity of 81.39% and 64.71% respectively. 10 at all in 2011, reported on a computer-aided detection system for the detection of lung nodules in computed tomography images using a mixed feature selection and classification methodology. The CAD system was trained and tested on images from the publicly available Lung Image Data Consortium, LIDC, from the National Cancer Institute Web site. In their method, they first segment nodule and then extract features to differentiate between real nodules and some forms of blood vessels that result in false positive detections. The authors noted that their fixed topology ANN classifier with 11 hitting nodes reached a detection sensitivity of 85.39% and an average sensitivity of 85.39% and of 87.5% with an average of four false positives per scan for nodules with diameters greater than or equal to three millimeters. Analysis of false positive items revealed that a considerable proportion of them were actually smaller nodules less than three millimeter in diameter. This study by Karen Drucker et al. developed a computer-aided detection method for the specific issue of dense breasts when imaged with automated breast ultrasound. A study developed to address the large volume of information generated by the modality. A three-step process was used detection of lesion candidates, characterization of the candidates, and reductions of false positives, only malignant lesions were used in the study. And then supervised K-means clustering technique was used to identify regions of interest that were candidates for further evaluation along with their seed point location. These areas were segmented and 18 features were extracted. Then a Bayesian neural net classifier was used to distinguish between true positive and false positive lesions. Performance was characterized in terms of the number of marks that is detections per view. As the authors noted, the computerized detection method yielded at a single mark per view, the median sensitivity of 50%, which was similar to the location specific sensitivity obtained on average by the 17 radiologists in the reader study for this data set. This 2016 paper by Graforian and colleagues in the Netherlands applies computer-aided detection methods to the detection of white matter hyperintensities. The clinical relevance is substantial because cerebral small vessel disease usually presents with white matter hyperintensities in the flare MRI pulse sequence. These hyperintensities are also common in Parkinson's and multiple sclerosis. The detection question not only is designed to detect the lesions, but also to quantify the lesions in terms of volume, location, and distribution, since the white matter hyperintensities for areas less than 3 millimeter are very different than large ones. To accomplish this, the authors used a two-stage learning approach, considering lesions less than 3 millimeters, and those greater than 3 millimeters, then combining. There were different feature sets for each. The results demonstrated statistically significant improvement in FROC curves for the second stage classifier. And now we discuss six important papers in computer-aided diagnosis. In this 1994 study by Kano and colleagues, a computerized method to assess change over time is reported. In the technique, a method for digital image subtraction is used to detect interval change within temporarily sequential chest radiographs. The subtraction images were able to enhance various important interval changes such as differences in the size of tumor masses, changes in heart size, and changes in pulmonary infiltrates for pleural effusions. This paper published in 2002 by Carla Horsch et al. was one of the first to apply CAD methods to diagnosis of breast lesions on ultrasound. The goal was to improve ability to distinguish between benign and malignant lesions in breast ultrasound. Lesions were automatically segmented and four features were extracted. The database was investigated in its entire form, but also for the ability to distinguish between carcinomas and benign salvo lesions and between carcinomas and complex cysts. The paper is notable for its thorough independent validation, yielding an average classification performance of 0.87 for the entire dataset. In this study from 2006, Chen et al. reported on a fuzzy seamains clustering approach developed here for the automatic identification and classification of characteristic kinetic curves of breast lesions on dynamic contrast enhance MRI. Such computer assessment of kinetics is important in predicting the likelihood of malignancy of breast lesions. They showed that computer derived lesion features from these most enhancing kinetic curves yield improved diagnostic performance as compared to those from curves obtained by just averaging over the entire lesion and were similar to kinetic curves generated from regions drawn within the lesion by a radiologist experienced in breast MRI. This paper from 1998 was one of the first to demonstrate the use of spatial features extracted from lesions imaged with dynamic contrast enhanced MRI. Interestingly, the analysis was conducted in both 3D and 2D format with a ladder using a single representative slice, a concept that is used in many deep learning studies today. Classification performance in the task of distinguishing between malignant and benign lesions was evaluated both by individual features and by combinations of them using stepwise multiple regression as inputs to linear discriminant analysis. This paper gives careful attention to the concept that human engineered features should be descriptive in both meaningful and objective ways. As noted by the authors, they found that automated feature extraction and classification has the potential to complement the interpretation of radiologists in an objective, consistent, and accurate way. In this paper from 2001, Zhao et al. reported on a computer vision algorithm that uses thresholding on gray level histograms to categorize breast density on mammograms. Shown here are various mammograms and the associated thresholds yielding different breast density assessments. The technique has potential to improve the accuracy and reproducibility of breast density estimation in comparison with the subjective visual assessment by a radiologist. This paper by Antropova et al. demonstrated the benefits of fusing deep learning and human engineered radiomics methods in computer aided diagnosis. Features were extracted from images of breast lesions in three modalities using both a pre-trained convolutional neural network and human engineered radiomics methods. For each case, the probability of malignancy as determined from each type of features was shown as demonstrated here for the fulfilled digital mammography images. Then classification performance in the task of identifying lesions as malignant or benign was conducted using each type of feature in the fusion of them. As the authors noted from ROC analysis, our fusion-based method demonstrates on all three imaging modalities statistically significant improvements in terms of AUC as compared to previous breast cancer caddox methods in the task of distinguishing between malignant and benign lesions. Next, we highlight nine methods papers that illustrate various topics important in the conduction of CED research, development, evaluation, and translation. The methods with which we select features and design and train classifiers in computer aided detection and diagnosis is crucial to appropriate system design and ultimate use. Such best practices have been the subject of study since the inception of the field. Seven of the papers described here provide insight into both feature selection and classifier design on a number of different fronts. The eighth paper illustrates the importance of databases in AI research, and the ninth paper summarizes the need for robust evaluation methods in order to enable translation of developed AI methods. Nishikawa and co-authors reported in 1994 their work on the effective case selection on computer aided detection. The database consisted of mammograms from 45 cases that is 90 pairs of mammograms all used for training a classifier. Then they used three variations of cases for testing, a so-called easy database, a so-called difficult database, and one they called an altered easy database in which the 10 easiest pairs of mammograms were replaced by the 10 most difficult. The results indicated differences in free response receiver operating characteristics curve by testing database, demonstrating the effect that case selection can have on computer aided detection. The authors notably gave insight into the difficulties of comparing CAD schemes using databases from different institutions, a matter that is still a great concern today. They said, since neither a common database nor a quantitative measure of image subtlety exists at present, we recommend that all investigators, one, report the method used for selecting cases, that is, for example, consecutive cases, typical cases, difficult cases, etc. And two, characterize their databases by reporting physical measurements of the relevant features in the image. How many features should be used for machine learning classification is an ongoing topic of discussion in CAD, as there are hundreds, even thousands of features that can be extracted from images. Just over 20 years ago, in 1999, Matt Kopinski and Mary Ellen Geiger published feature selection with limited datasets, a work that provides some crucial insight into the selection of features. They found that the probability that the optimal subset of features has been selected for classification purposes depends upon the dataset sample size, the number of features that may be potentially used, and the number of features actually selected among other factors. Their conclusion was that if one is using a relatively small dataset and a large number of features, bias is introduced, and the probability that the optimal features have been chosen is low. This paper is essential reading for anyone working with radiomic features in CAD to provide guidance on feature selection. The goal of this paper by Heng Peng Chan and co-authors published in 1999 was to investigate the dependence of performance as measured by the area under the receiver operating characteristic curve on design sample size for both classical classifiers such as linear or discriminant analysis and neural network classifiers. They looked into factors of sample size and dimensionality and found that as the design sample size decreases and as the dimensionality of feature space increases, the effect of finite sample size increases, reducing classification performance. This figure shows the simulation for a set of data in which the two classes have equal covariance and the linear discriminant analysis classifier is used. The authors concluded the magnitude of the biases increases as the design sample size decreases and as the dimensionality of the feature space increases. This paper by Berkman Scheheiner and co-authors published in 2008 identified that error rate is inadequate as a performance measure. They were interested in understanding impact on root main square error of AUC area under the ROC curve. They used five re-sampling methods including a 0.632 plus bootstrap method, the Foucault Hayes designated as FH and leave one out methods. Linear discriminant analysis was used as a classifier on simulated data. Their results indicated that choice of sampling technique is important and that the 0.32 plus bootstrap method appeared superior to the others. In 2010, Jamison et al reported on the use of unsupervised non-linear dimension reduction and data representation techniques applied to computer extracted breast lesion feature spaces across three separate imaging modalities breast ultrasound MRI and mammography. The dimension reduction methods attempt to map originally high dimensional feature spaces to more human interpretable lower dimension spaces while preserving both local and global information. The authors explored Laplacian eigenmaps and t-distributed stochastic neighboring embedding that is T-SNE for data representation and in terms of diagnostic performance when the reduced features emerge using various classifiers. While not appropriate as a complete replacement of feature selection in computer aided diagnostic problems, dimension reduction techniques offer a complementary approach which can aid elucidation of additional properties associated with the data. Specifically, the new techniques were shown to possess the added benefit of delivering sparse lower dimensional representation for visual interpretation revealing intricate data structure of the feature space. Addressing overtraining in computer aided diagnosis and detection is a significant concern. One method called noise injection intentionally penalizes models by adding noise to the training data set. This paper by Richard Zur and co-authors in 2009 studied the impacts of adding noise when training artificial neural networks and compared the effects to two other methods of reduced data overtraining weight decay and early stopping. The study investigated ANNs of both low and high complexity and investigated the standard deviation of the noise kernel as a critical noise parameter. They optimized the amount of injected noise to find the maximum AUC that minimized overtraining. As they noted, these results indicate that training ANNs with noise injection can reduce overfitting to a greater degree than early stopping and to a similar degree as weight decay. In this paper published in 2010, Salas, Gonzales and co-authors recognized that the medical question of how to use fluorine 18 FDG PET for questions related to Alzheimer's disease intrinsically had difficulty due to the relative number of features to the number of cases. This is because the FDG PET image contained tens of thousands of pixels while a training set might contain only around a hundred cases. The authors performed preliminary automatic selection of axles of interest using a T statistic greater than a given threshold. Next, they used factor analysis to reduce feature dimension. Subsequently, classification was conducted on three tasks, normal versus mild cognitive impaired subjects, normal versus those with confirmed Alzheimer's disease, and then subjects with mild cognitive impairment versus those with confirmed Alzheimer's disease. The study also compared classification performance when using PCA. The authors found that using the selection of axles along with factor analysis resulted in improved accuracy sensitivity and specificity for all three tasks. The Long Image Database Consortium and Image Database Resource Initiative were established to develop a database for lung nodule detection, classification, and quantitative assessment. The collaboration included seven academic centers and eight medical imaging companies, compiling 1,018 patients and their thoracic CT scans. The LIDC was formed in 2001 and the IDRI's creation in 2004 led to its unique public-private partnership format. This paper, with 56 co-authors, reports on the completed database. Radiologists marked the lesions in terms of size categories, that is nodules greater than three millimeters or less than three millimeters in dimension, or non-nodules greater than three millimeters in outlined lesions. A total of 7,371 nodules are available in the database. As the authors noted, the LIDC-IDRI database is expected to provide an essential medical imaging research resource just for CAD development, validation, and dissemination in clinical practice. The paper serves as an important record of the creation and curation of the dataset and provides useful insights into the careful work that must be done to provide these resources to the medical imaging community. In the early 2000s, while CED systems were commercially available, standardized approaches for evaluating and reporting their performance had not yet been formalized in the literature. This deficiency led to difficulty in the comparison of CAD devices and in understanding how the reported performance might translate into clinical practice. Thus, to address these important issues, the American Association of Physicist and Medicine formed the Computer Aided Detection and Diagnosis Imaging Subcommittee to develop recommendations on approaches for assessing CAD system performance. The purpose of this 2013 paper was to convey the opinions of the subcommittee members and to stimulate the development of consensus approaches and BEX practices for evaluating CAD systems. Both the assessment of a standalone CAD system and the evaluation of the impact of CAD on end users are discussed in this paper. The authors noted that proper assessment of the CAD system's standalone performance or its impact on the user would lead to a better understanding of its effectiveness and limitations, which in turn is expected to stimulate further research and development efforts on CAD technologies, reduce problems due to improper use, and eventually improve the utility and efficacy of CAD in clinical practice. In summary, a vast amount of CAD research, often now referred to as AI, has been conducted over the past 35 years, with many findings having been published in medical physics. It should be noted that while AI refers to the algorithm, CAD refers to how the AI is implemented by the end user. With many advances in compute power, increased databases, and more complex algorithms, AI is rapidly expanding across clinical tasks, imaging modalities, disease types, and clinical implementation. Aspects of development, training, testing, and translation of AI have remained to ensure generalizability and reproducibility. Thank you for the opportunity to present to you these notable papers in computer aided diagnosis and computer aided detection from the general medical physics. We hope the collection is both informational and inspirational.