Pre-Requisites

Ensure that you read the Getting Started tutorial to learn the basics of Colaboratory and how to navigate around files, and the interface. We'll be using Colaboratory for our code so we won't need to download anything.


Introduction

With research, there's often a lot of data that is produced, and no meaningful way to process it with applications such as Microsoft Excel. Instead we must resort to the use of code to assist us to make meaningful conclusions from the data. In this example, we're looking at data generated from histology slides about normal and cancerous breast tissue, with plenty of features to investigate.


The data

For this tutorial, we're going to use the Wisconsin Breast Cancer Dataset. This dataset came out in 1994, and contains 569 samples about the breast cancer histology. This data is on kaggle, which means we can use a kaggle command to download it straight to Colaboratory. With that in mind, let's get started!


Stage 1 - Importing the Data

This is very similar to other tutorials, so let's recap the details of how to do this.

First things first, fire up a new Python 3 Notebook in Colaboratory. To get started, we need to get our data. Download the following file This file is a key to Kaggle, a large collection of datasets, including chest x-rays and other medical imaging repositories. To use the key properly, we need to put it in the right folder for it to work. To do that, let's make the directory using the following command: !mkdir ~/.kaggle, and then move ourselves into that folder using cd ~/.kaggle

We need to give Kaggle the key to allow us to download our data. To do this, we need to upload the key we downloaded. On the left-hand pane, click code snippets and click "Uploading files from your local file system"

Use the code snippet shown above to upload the kaggle key, enabling us to download the files!

We will need to import Kaggle to be able to use it in our code. To be able to import Kaggle, we need to install it first. Type the following in the first code block: !pip install kaggle. Pip is the Python package manager to install packages such as Kaggle.

Once we've run that code, we will then import Kaggle using import kaggle to be able to use it in our code.

Let's run that section of code, and upload the key we just downloaded. After that's done, we need to change directory back to our main folder called "/content". Let's do this using cd /content.

Now we can download our dataset - in a new code block - type the following: !kaggle datasets download -d uciml/breast-cancer-wisconsin-data This will take a few seconds to download. The data comes inside a zip file, so let's unzip the file using unzip breast-cancer-wisconsin-data.zip.

If we now type and run ls to see what's there, we'll find a .csv file. This type of file is typically opened by Microsoft Excel, but we're going to use a package called Pandas to open it!


Stage 2 - Visualising the Data

To start, let's view the data. To do that, we need to type the following into a new code block and run it:

#Check the data to see it has imported correctly, and the top 5 rows.
import pandas as pd

df = pd.read_csv('data.csv')
df.head()

This code imports the Pandas module, and then uses the read_csv function to open the csv file. We make a variable called 'df', which stands for dataframe. When we open our csv file, we say that df is equal to the output of opening the file IE, we use the df variable to access our data. The df.head() prints the top 5 rows of the data to allow us to verify it imported correctly. The brackets here are empty as we don't want to add any extra details to the code, we just want it to run in the standard manner (IE only print top 5 rows). The output should look something like this:

The first five rows of the dataset

We can see that there's a lot of numerical data. The data is organised into 3 main sections: mean, standard errors, and worst values with the same headings for each of the 3 sections. We can note that there's an ID column which is of no relevance to us, so can be removed.  On the far right hand, there's also a column called 'Unnamed:32' which we can also remove to purify our data a bit. It is also of note that there's a column called diagnosis with values being either 'B' or 'M' for benign or malignant. To create an algorithm, we should change this to be binary (either 0 or 1) to allow us to create a numerical model to predict if a cancer is benign or malignant. We can do all the above through the following:

#Removing the columns which are not relevant, once done, this does not need to be done again
df.drop('id', axis=1, inplace = True)
df.drop("Unnamed: 32",axis=1,inplace=True)

#We need to remap the values for malignant and benign to being 1 or 0
df['diagnosis'] = df['diagnosis'].map({'M':1,'B':0})

We apply the drop function to the columns in our dataframe (df), and tell it that we want to drop the id column in the vertical axis (1 = column), and inplace means that we don't need the data that we remove for anything, so the column can be removed 'in place'. To rename the values, we run the 'map' function to change our M values to 1, and B values to 0.

Let's check that all worked by running df.head() in a new code block.

The top five rows without the extra columns!

Looks like it worked! There's only 31 columns this time compared to the 33 we had earlier, and we can see that the M's and B's in the diagnosis column have changed to 1's and 0's respectively.

To make things easier for data processing, let's move the 31 columns into the three categories. Let's look at the list of columns by running df.columns in a new code block. We can see that columns 2-11 are labelled mean, 12-21 are SE (standard error), and 22-31 are the worst measurements.

#splitting the data into the 3 different sets provided
features_mean = list(df.columns[0:11])
features_se = list(df.columns[11:21])
features_worst = list(df.columns[21:31])

#print the lists so we can make sure it worked correctly
print (features_mean)
print (features_se)
print (features_worst)

We create three new lists to seperate the 3 categories, and then check the correct features are in each list.

Now for something cool. Normally to determine correlations between variables in Excel, we might end up doing it manually. But really, who has time for that? This next method is super useful for determining trends large swathes of numerical data super fast. We're going to use the heatmap function from a module called Seaborn, which enables us to view all correlations at a glance. Let's look at the code below:

#Let's look at the correlations between the features of the dataset
import seaborn as sns
import matplotlib.pyplot as plt


corr = df[features_mean].corr()
plt.figure(figsize =(14,14))

sns.heatmap(corr, cbar = True,  square = True, annot=True, fmt= '.2f',annot_kws={'size': 15}, cmap= 'coolwarm')

We need to import the module, the use the features list to determine the correlations (which we set to equal to a variable called corr). We create an empty plot for the chart using plt.figure, and then create the actual plot using sns.heatmap - we feed in the variable 'corr' to get the correlations. The other pieces of code inside the bracket are to make the plot look a bit nicer (cbar = colour bar on the right, square makes the plot exactly square, annot puts numbers in the chart, and fmt sets number to 2 decimal places using '.2f')

The result is the plot below.

On initial inspection, the plot looks complicated, but contains a lot of detail which is enables us to make a good model to predict malignancy very quickly. By using features which show a strong correlation with diagnosis (number closer to 1), it is likely that our model will have a better accuracy when we train it. It should be noted that there is a diagonal line of 1's which indicate a feature being plotted against itself (e.g. radius_mean plotted against radius_mean, which would show 100% correlation!).


Stage 3 - Training our Model

We need to pick some features to use to train our model to learn from. By the looks of it, there is a good correlation between diagnosis and perimeter_mean, compactness_mean, and concavity_mean, so let's use these three features to start when traning our model.

Let's add these features to a new list called 'prediction_var' through the following line of code: prediction_var = ['perimeter_mean', 'compactness_mean', 'concavity_mean']. Then we need to setup the basic model structure as follows:

from sklearn.model_selection import train_test_split # module to split the data into two parts

#split data into 85% being used for training, and 15% for testing
train, test = train_test_split(df, test_size = 0.15)
print ("Train shape", train.shape)
print ("Test shape", test.shape)

#setup the structure of the training data
train_x = train[prediction_var]
train_y = train.diagnosis

test_x = test[prediction_var]
test_y = test.diagnosis

Using this, we setup the training and testing data split. Then we need to setup the model to train.

from sklearn.ensemble import RandomForestClassifier # for random forest classifier
from sklearn import metrics # for the check the error and accuracy of the model


model = RandomForestClassifier(n_estimators=300)  

From here, we tell our model to train, and test itself to see how well it did.

model.fit(train_x, train_y)

prediction = model.predict(test_x)
print ("Accuracy =", metrics.accuracy_score(prediction, test_y))

When we run the model, we get 88% accuracy - not a bad start, but we can improve this. We can do this through two key ways: providing more training data, and adding more features to the model. Let's add more training data to start by changing train, test = train_test_split(df, test_size = 0.15) to 0.1.

This improves our accuracy up to 91%. We can still do better. Let's add some features to the predication variables, for example texture_mean prediction_var = ['perimeter_mean', 'compactness_mean', 'concavity_mean', 'texture_mean'] and run the model again. When we run it a few times (accuracy will change each time), we can get up to 95% accuracy - indicating that our model is a good fit!


Summary

There's a few key learning points from this tutorial.

- Before training our model, it's best to visualise the data, and observe patterns/correlations to enable us to use the right parameters for training.

- Sometimes adding random variables can improve accuracy for reasons which are not well understood. (Thought to be interaction between variables during training)

- To find the right model for our data, we should consider the types of data we are working with, and experiment to find the best model type.