Abstract

Dataset

NIH Chest X-rays

National Institutes of Health Chest X-Ray Dataset Chest X-ray exams are one of the most frequent and cost-effective medical imaging examinations available. However, clinical diagnosis of a chest X-ray can be challenging and sometimes more difficult than diagnosis via chest CT imaging. The lack of large publicly available datasets with annotations means it is still very difficult, if not impossible, to achieve clinically relevant computer-aided detection and diagnosis (CAD) in real world medical sites with chest X-rays. One major hurdle in creating large X-ray image datasets is the lack resources for labeling so many images. Prior to the release of this dataset, Openi was the largest publicly available source of chest X-ray images with 4,143 images available.

This NIH Chest X-ray Dataset is comprised of 112,120 X-ray images with disease labels from 30,805 unique patients. To create these labels, the authors used Natural Language Processing to text-mine disease classifications from the associated radiological reports. The labels are expected to be >90% accurate and suitable for weakly-supervised learning. The original radiology reports are not publicly available but you can find more details on the labeling process in this Open Access paper: “ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases.” (Wang et al.)

Read article

Problem Statement

analyze data from the NIH Chest X-ray Dataset and train a CNN to classify a given chest x-ray for the presence or absence of pneumonia.

This project will culminate in a model that can predict the presence of pneumonia with human radiologist-level accuracy that can be prepared for submission to the FDA for 510(k) clearance as software as a medical device.

Metrics

AUC

AUC is used for binary classification, multiclass classification, and ranking problems. AUC measures the proportion of correctly ordered objects and the capability of the model to distinguish between the classes.

The AUC has an important statistical property: the AUC of a classifier is equivalent to the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance.

AUC is the Area Under the ROC Curve. The best AUC = 1 for a model that ranks all the objects right (all objects with class 1 are assigned higher probabilities then objects of class 0). AUC for the ‘bad’ classifier which is working as random guessing is equal to 0.5.

The ROC curve shows the model’s ability to distinguishing between classes.

The model which randomly assigns a class to object is a ‘bad’ classifier and has a diagonal ROC curve. The better is the classifier, the higher is the ROC curve. The ROC curve is plotted with TPR, True Positive Rate, on the y-axis against the FPR, False Positive Rate, on the x-axis. The curve also could be interpreted in terms of Sensitivity and Specificity of the model with Sensitivity on the y-axis and (1-Specificity) on the x-axis.

Building and visualizing the ROC curve could be used to measure classification algorithm performance with different probability boundaries and select the probability boundary required to achieve the specified false-positive or false-negative rate.

Binary Accuracy

Binary Accuracy calculates the percentage of predicted values (yPred) that match with actual values (yTrue) for binary labels.

Since the label is binary, yPred consists of the probability value of the predictions being equal to 1.

For a record:

  1. If the probability is above the threshold, 1 is assigned else the value assigned is 0.
  2. If the assigned value is equal to the actual value, it is considered accurate.

We then calculate Binary Accuracy by dividing the number of accurately predicted records by the total number of records.

The threshold (default = 0.5) can be adjusted to improve Binary Accuracy.

EDA

DataFrame.describe() method generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values. This method tells us a lot of things about a dataset. One important thing is that the describe() method deals only with numeric values. It doesn’t work with any categorical values. So if there are any categorical values in a column the describe() method will ignore it and display summary for the other columns unless parameter include=“all” is passed.

  • count tells us the number of Non-empty rows in a feature.
  • mean tells us the mean value of that feature.
  • std tells us the Standard Deviation Value of that feature.
  • min tells us the minimum value of that feature.
  • 25%, 50%, and 75% are the percentile/quartile of each features. This quartile information helps us to detect Outliers.
  • max tells us the maximum value of that feature.
all_xray_df.describe()

               count    mean      std       min   25%    50%   75%    max
Follow-up #   112120.0  8.573751  15.406320 0.000 0.000 3.000 10.000  183.0000
Patient ID    112120.0  14346.381743  8403.876972 1.000 7310.750  13993.000 20673.000 30805.0000
Patient Age 112120.0  46.901463 16.839923 1.000 35.000  49.000  59.000  414.0000
OriginalImage[Width 112120.0  2646.078844 341.246429  1143.000  2500.000  2518.000  2992.000  3827.0000
Height] 112120.0  2486.438842 401.268227  966.000 2048.000  2544.000  2991.000  4715.0000
OriginalImagePixelSpacing[x 112120.0  0.155649  0.016174  0.115 0.143 0.143 0.168 0.1988
y]  112120.0  0.155649  0.016174  0.115 0.143 0.143 0.168 0.1988
Unnamed: 11 0.0 NaN NaN NaN NaN NaN NaN NaN

EDA Visualization

Data with labels

The relative frequencies of each disease in dataset. It looks like ‘No Finding’ is the most common occurrence. ‘No Finding’ can never appear with any other label by definition, so we know that in 57.5% of this dataset, there is no finding in the image. Beyond that, it appears that ‘Infiltration’ is the most common disease-related label, and it is followed by ‘Effusion’ and ‘Atelectasis.

Age and Gender distribution

Age and Gender Distribution Patient gender distribution on Effusion

Patient gender distribution on Infiltration

Looks like the distribution of age across the whole population is slightly different than it is specifically for Infiltration and Effusion. Infiltration appears to be more skewed towards younger individuals, and Effusion spans the age range but has a large peak around 55.

Implementation

Spliting data

Data was splitted for training and validation.

def create_splits(vargs):
    
  
    pmonia_df = all_xray_df[all_xray_df['Pneumonia'] == 1]
    non_pmonia_df = all_xray_df[all_xray_df['Pneumonia'] == 0]
    train_data, val_data = skl.train_test_split(pmonia_df, test_size = 0.2)
    train_non_pmonia_data, val_non_pmonia_data = skl.train_test_split(non_pmonia_df, test_size = 0.5)
    train_non_pmonia_data = train_non_pmonia_data.sample(train_data.shape[0])
    train_data = pd.concat([train_data, train_non_pmonia_data])
    non_pmonia_test_count = int((val_data.shape[0] / 1.2) * 98.8)
    val_non_pmonia_data = val_non_pmonia_data.sample(non_pmonia_test_count)
    val_data = pd.concat([val_data, val_non_pmonia_data])
    
    return train_data, val_data

train_df, valid_df = create_splits(all_xray_df)

Age distribution accross the training data and validation data

fig, axes = plt.subplots(1, 2)
train_df['Patient Age'].hist(ax=axes[0],figsize=(20,5))
valid_df['Patient Age'].hist(ax=axes[1],figsize=(20,5))
axes[0].set_title('Distribution of ages for training data')
axes[0].set_xlabel("Age")
axes[0].set_ylabel("Number of x-ray observations")
axes[1].set_title('Distribution of ages for validation data')
axes[1].set_xlabel("Age")
axes[1].set_ylabel("Number of x-ray observations")

Age distribution accross the data

Compare Pneumonia vs No Pneumonia cases between training vs validation data

fig, axes = plt.subplots(1, 2)
train_df['Pneumonia'].value_counts().plot(ax=axes[0],kind='bar',figsize=(20,5))
valid_df['Pneumonia'].value_counts().plot(ax=axes[1],kind='bar',figsize=(20,5))
axes[0].set_title('Pneumonia vs No Pneumonia for training data')
axes[0].set_xlabel("Gender")
axes[0].set_ylabel("Number of x-ray observations")
axes[1].set_title('Pneumonia vs No Pneumonia for validation data')
axes[1].set_xlabel("Gender")
axes[1].set_ylabel("Number of x-ray observations")

Pneumonia vs No Pneumonia cases between training vs validation data

Algorithms and Techniques

Training

This classification problem as well it requires transfar learning hence vgg16 was used.

def build_my_model():
    
    model = Sequential()
    model.add(load_pretrained_model())
    model.add(Flatten())
    model.add(Dropout(0.5))
    model.add(Dense(1024, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(512, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(256, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(optimizer = Adam(lr=1e-4), loss='binary_crossentropy', metrics=['binary_accuracy'])
    
  
    return model
my_model = build_my_model()
history = my_model.fit_generator(train_gen, 
                           validation_data = (valX, valY), 
                           epochs = 15, 
                           callbacks = callbacks_list)

Model Evaluation

def plot_auc(valY, pred_Y):
    
    precision, recall, thresholds = precision_recall_curve(valY, pred_Y)

    # Todo
    fpr, tpr, _ = roc_curve(valY, pred_Y)
    roc_auc = auc(fpr, tpr)
    
    # Calculate f1 vs threshold scores
    f1_scores = []
    for i in thresholds:
        f1 = f1_score(valY.astype(int), binarize(pred_Y,i))
        f1_scores.append(f1)
        
    
    
    # Plot charts
    fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(20,5))

    lw = 2
    ax1.plot(fpr, tpr, color='purple', lw=lw, label='ROC curve (area = %0.4f)' % roc_auc)
    ax1.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
    ax1.set_xlabel('False Positive Rate')
    ax1.set_ylabel('True Positive Rate')
    ax1.title.set_text('ROC Curve')
    ax1.legend(loc="upper left")
    ax1.grid(True)

    ax2.step(recall, precision, color='red', where='post')
    ax2.set_xlabel('Recall')
    ax2.set_ylabel('Precision')
    ax2.title.set_text('Precision-Recall Curve')
    ax2.grid(True)

    ax3.plot(thresholds, f1_scores, label = 'F1 Score', color='purple')
    ax3.set_xlabel('Threshold')
    ax3.set_ylabel('F1 Score')
    ax3.title.set_text('F1 Score vs Threshold')
    ax3.legend(loc="upper left")
    ax3.grid(True)

    plt.show()
    return
    
plot_auc(valY, pred_Y)

ROC Curve, Precision-Recall Curve, F1 Score

The AUC suggests an 69% chance that the radiologist reading the image will correctly distinguish a normal from pneumonia to non pneumoninia patient based on the ordering of the image ratings

Classification report:
              precision    recall  f1-score   support

         0.0       0.99      1.00      0.99      1978
         1.0       0.00      0.00      0.00        22

    accuracy                           0.99      2000
   macro avg       0.49      0.50      0.50      2000
weighted avg       0.98      0.99      0.98      2000

Project Repository

Project repo