Chest X-Rays are used all the time in hospitals. By using Machine Learning (ML), we can train an algorithm to triage our chest x-rays to highlight those with a high risk of pathology.

This tutorial is one that will require lots of computing power - thankfully, Colaboratory is great at this, as it lets us utilise the power of Google's servers for free.


Pre-Requisites

Ensure that you read the Getting Started tutorial to learn the basics of Colaboratory and how to navigate around files, and the interface. We'll be using Colaboratory for our code so we won't need to download anything.


Basics

All images consist of individual pixels - using ML we can train our algorithm to recognise groups of pixels which correlate with a specific disease. In essence, it will learn by looking at thousands of chest x-rays.

More information

The images we'll be using are 1024x1024 resolution. This means that when we train the algorithm, the computer will have to make hundreds of calculations on an image with 1,048,576 pixels. Multiply this by a few thousand, and we can see that this will require a beefy computer to process the data!



Stage 1 - Getting the Data

First things first, fire up a new Python 3 Notebook in Colaboratory. To get started, we need to get our data. Download the following file called kaggle.json (download button in the top right of the page). This file is a key to Kaggle, a large collection of datasets, including chest x-rays and other medical imaging repositories. To use the key properly, we need to put it in the right folder for it to work. In Colaboratory, we need to make folders and move files with code. The most important commands are listed below:

Command Function Usage
cd Change Directory (folder) cd ~ (cd to home directory, denoted by ~)
mkdir Make Directory mkdir ~/.kaggle (makes .kaggle folder in home directory)
ls List Stuff - lists everything in the directory ls (lists everything in current directory)
unzip Unzips zipped folder unzip test.zip (unzips test.zip)

To do that, let's make the directory using the following command: !mkdir ~/.kaggle, and then move ourselves into that folder using cd ~/.kaggle.

We need to give Kaggle the key to allow us to download our data. To do this, we need to upload the key we downloaded. On the left-hand pane, click "Upload", and upload the kaggle key from above.

We will need to import Kaggle to be able to use it in our code. To be able to import Kaggle, we need to install it first. Type the following in the first code block: !pip install kaggle. Pip is the Python package manager to install packages such as Kaggle.

Once we've run that code, we will then import Kaggle using import kaggle to be able to use it in our code.

Let's run that section of code, and upload the key we just downloaded. After that's done, we need to change directory back to our main folder called "/content". Let's do this using cd /content.

Now we can download our dataset - in a new code block - type the following: !kaggle datasets download -d paultimothymooney/chest-xray-pneumonia This will take a minute or so to download. This dataset contains around 10,000 images of normal and pneumonia chest x-rays.


Stage 2 - Unzipping the data

Luckily, the data that we're using is already pre-sorted, and arranged in the correct format to make it almost instantly ready for training our algorithm. Now it's time to finish arranging our data - all we have to do is unzip the files.

To do this, let's firstly run ls in it's own code block and see what's in the folder. We can see that there's a zip file called "chest-xray-pneumonia.zip", which we can unzip using the following command: !unzip -o chest-xray-pneumonia.zip. The -o here ensures that we overwrite any existing images that may already have been unzipped.

Once that's done, we can ls again, and see that there's another zip file now, called "chest_xray.zip", so again, let's unzip that using: !unzip chest_xray.zip.

All our data is now unzipped, we can check this in the left panel of Colaboratory (click files), and we should see a folder, and subfolders called train, test, and val. In general, with images, we set data into two folder types (training and testing). The validation set is smaller, and contains images which are not used at any time during training/testing to ensure a "clean" test.

These allow us to train us the data and then for it to test itself to see how well it performed.

An analogy would be to go through all the possible past papers for an exam, but saving the mock exam to see how well you would do in the final exam.

We're now ready to train our algorithm!


Stage 3 - Training our model

Now the interesting bit. We're going to train our model to tell the difference between a normal chest x-ray and one with signs of pneumonia. We can train it to be able to recognise a whole host of pathology, but we're going to keep it relatively simple for this tutorial. Here's the code we're going to use:

Before we explain what the code's actually doing, go ahead and run it as it'll take some time to complete, and we'll discuss how it works whilst it's calculating away.

from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D
from keras.layers import Activation, Dropout, Flatten, Dense
from keras import backend as K

# dimensions of our images.
img_width, img_height = 256, 256

#locations of our training and testing image folders
train_data_dir = '/content/chest_xray/train'
validation_data_dir = '/content/chest_xray/test'

#how many training and testing samples we want to do, the more samples we have, the longer it takes for the model to run, but better trained it will be
nb_train_samples = 2000
nb_validation_samples = 800

#epochs = how many "episodes" we want the algorithm to learn for
epochs = 25
batch_size = 16

#not important for our code, but ensures the code is adaptable to other images being used
if K.image_data_format() == 'channels_first':
    input_shape = (3, img_width, img_height)
else:
    input_shape = (img_width, img_height, 3)

#structure of the model - we say that our model is going to be sequential (IE in order), and then we add (model.add) layers to the model.
model = Sequential()
model.add(Conv2D(64, (3, 3), input_shape=input_shape))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(64, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(128, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(128, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Flatten())
model.add(Dense(64))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(1))
model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

# this is the augmentation configuration we will use for training
train_datagen = ImageDataGenerator(
    rescale=1. / 255,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True)

# this is the augmentation configuration we will use for testing:
# only rescaling
test_datagen = ImageDataGenerator(rescale=1. / 255)

train_generator = train_datagen.flow_from_directory(
    train_data_dir,
    target_size=(img_width, img_height),
    batch_size=batch_size,
    class_mode='binary')

validation_generator = test_datagen.flow_from_directory(
    validation_data_dir,
    target_size=(img_width, img_height),
    batch_size=batch_size,
    class_mode='binary')

model.fit_generator(
    train_generator,
    steps_per_epoch=nb_train_samples // batch_size,
    epochs=epochs,
    validation_data=validation_generator,
    validation_steps=nb_validation_samples // batch_size)


model.save('CNN_model.h5')
model.save_weights('CNN_weights.h5')

Let's break down the code section by section so we fully understand what's happening:

Importing modules

We need to import the modules (other pieces of code) that we need to use for our algorithm before we start. For our algorithm, we're going to use the Keras library, and import only the modules we need (to make things a bit faster!).

from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D
from keras.layers import Activation, Dropout, Flatten, Dense
from keras import backend as K

From here, we now need to assign the key variables.

Variables and Hyperparameters

We need to tell our algorithm where to find our training data, and our validation data, and the size of the images we're feeding into it. Whilst the images are actually 1024x1024, we have re-sized them here by saying that they're 256x256 in order to reduce the number of pixels being processed, making the algorithm learn faster.

# dimensions of our images.
img_width, img_height = 256, 256

#locations of our training and testing image folders
train_data_dir = '/content/chest_xray/train'
validation_data_dir = '/content/chest_xray/test'

Then we set the hyperparameters. Hyperparameters are variables that we set rather than variables that the system learns through training. Samples refers to the number of images the system will analyse, with more images leading to better accuracy (but longer time to train!).

Epochs refers to how many times the system will run through the entire input dataset. Simply, this means that the system will train 2000 samples x 25 epochs (50,000 times!). The batch size refers to how many images to run through the model at any time (limited by RAM available). Smaller batch sizes lead to better accuracy, but longer time to train. Batch size is limited by the RAM available on the computer (on Colaboratory it's 12GB of GPU RAM which is quite a lot!)

#how many training and testing samples we want to do
#more samples = longer training time (but better accuracy)
nb_train_samples = 2000
nb_validation_samples = 800

#epochs = how many "episodes" we want the algorithm to learn for
#batch_size = how many images to run through the model at any time (limited by RAM available).
epochs = 25
batch_size = 16

Structure of the model

This is where the majority of the clever stuff happens. We're not going to do a deep-dive on the structure of Convolutional Neural Networks (CNNs), but suggest that you have a quick google so you feel familiar with the concept.

model = Sequential()
model.add(Conv2D(64, (3, 3), input_shape=input_shape))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

Here, we initialise the model by calling model = Sequential(), then running model.add to add layers to it. We're just going to look at one layer as the general concept is the same for each layer. We start with a Convolutional filter which feeds to an Activation filter. The convoltional filter tests masks, and the activation filter determines which masks worked well. The MaxPooling layer essentially introduces "blur" into the system. It causes it to lose some detail, but this helps it to generalise, as overfitting models is a big problem in machine learning.

The model is compiled like so:

model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

This tells the model how to learn, minimising loss using "binary_crossentropy" as the output is binary (pneumonia or nomal, 1 or 0).

Data Generator

We also need to create more data for our system. The more data we have, the more accurate our model will be. We augment our data sample using techniques including rescaling the image, shearing, and even zooming.

train_datagen = ImageDataGenerator(
    rescale=1. / 255,
    shear_range=0.2,
    zoom_range=0.2)

The model is then prepared to run with the model.fit_generator:

model.fit_generator(
    train_generator,
    steps_per_epoch=nb_train_samples // batch_size,
    epochs=epochs,
    validation_data=validation_generator,
    validation_steps=nb_validation_samples // batch_size)


model.save('CNN_model.h5')
model.save_weights('CNN_weights.h5')

We feed in the training data from the train_generator,  and validation data from the validation_generator. We allow the model to run, and at the end it will save the model and the weights for the parameters it calculated. The .h5 data format is a hierarchical data format to save large amounts of data very efficiently.


Interpreting the results

Hopefully your model has finished training by now! As you may have noticed, there's 4 key "metrics" that appears. These are Loss, Accuracy, Validation Loss, and Validation Accuracy.

Metric Description
Accuracy Measure of how well our algorithm is able to correctly classify the images in our training set (on a scale from 0-1) IE, how well it does when "revising"
Validation Accuracy How well our algorithm performed when classifying images in the testing/validation dataset. IE, how well it did in the mock exam.
Loss How close the individual points are to the mathematical function that our algorithm has generated. IE, if we had a scatter plot with a line of best fit, it would be the distance between the line and each individual point to the line.
Validation Loss As above, but this time, testing against the testing/validation data rather than the training data.

When training our algorithm, we expect loss to decrease with each epoch. However if loss is very low, accuracy is high, and validation accuracy is low, it indicates that there may be overfitting of the model (IE our system has memorised all the practice questions but can't answer the exam questions).

loss
The machine learning model is trying to find the line of best fit which has the smallest total distance in terms of red arrows.

By this point our algorithm should be well on the way to finish being trained. Once it's complete, it will come up something like this:

Epoch 25/25 - loss: 0.1838 - acc: 0.9270 - val_loss: 0.3210 - val_acc: 0.8600.

Our algorithm has finished training, and shows that it's around 86% accurate on new data. For a starting point, it's pretty good! We can improve this further by modifying some of the variables, but we'll focus on these more in a later tutorial.


Stage 4 - Testing our algorithm

To finish, let's test our algorithm on the remaining data that we haven't used (in the validation folder), and see how our model performs.

We'll manually feed in an image which shows Pneumonia using the img_path variable. The image gets converted to numbers using the img_to_array function, and then tested through model.predict.

import numpy as np
from keras.models import load_model
from keras.preprocessing import image

img_path = 'chest_xray/val/PNEUMONIA/person1952_bacteria_4883.jpeg'

#Load model if already saved
#model = load_model("CNN_model2.h5")

#Convert image to an array
img = image.load_img(img_path, target_size=(256,256))
img_array = image.img_to_array(img)
img_array = np.expand_dims(img_array, axis=0)
new_image = img_array/255. #Must normalise numbers to a scale of 0-1

prediction = model.predict(new_image)
print ('Probability of pneumonia', prediction*100, '%')
final_test

If we test it enough, we find that it works great on identifying images with confirmed pneumonia, however, also gives the normal images a relatively high percentage chance of having pneumonia. In part this is likely to be due to the skeleton being detected as a key feature of both normal and pneumonia images. We can also consider that images may have been incorrectly placed into the normal/pneumonia folders (human error) as another potential source of error.

Summary

  • For image classification, large datasets are needed, with augmentation to ensure a large enough data sample.
  • CNNs are great for image classification. To get the best results, it's best to play around with hyper-parameters.

Hopefully you've managed to follow along and found this useful! If you enjoyed this, or have any ideas, concerns, or expectations, please drop us a message here (we'd absolutely love to hear from you)!