Deep Learning and Diseases

Namaste! Today will be all about infectious diseases! Well… Simulated and non-deadly ones! There is a nice article on Wikipedia about Compartmental models in epidemiology. I found it intriguing. Epidemiology is a mix of medicine, biology, demography, and environmental social sciences. People in that area are very interested in how infectious diseases spread. This caught my attention! Time for Deep Learning and diseases!

But, wait… Why am I writing about diseases? Simple as that: Deep Learning can be applied to epidemiology. I stumbled on that idea, while searching for data-sets. You know… You can only do Krishna’s good Deep Learning work, if you have enough data. And since I have been toying with the idea to use simulated data for Deep Learning, I ended up simulating various diseases today.

How did we get here?

I found the SIR model, while doing some random reading on the train. SIR stands for „susceptible“, „infectious“, and „recovered“. Which is the three state an individual can be in. I really like the model. It describes that individuals are either in the risk of getting ill, currently ill and infecting other people, or immune. And usually you go through the whole thing in exactly that order.

And the really great thing is that you can simulate the SIR-model with different parameters using a cellular automaton. How? A population is represented as a rectangular grid. Each cell in that grid is an individual. Each individual is in one of the SIR-states. If you add a couple of parameters, you are ready to do a full disease-spreading simulation!

There are not so many parameters:

How many people are infected at the beginning?
How long will an individual be infectious after infection?
How long will an individual be immune to a re-infection after being infected?
How high is the probability that the disease is transmitted when two individuals have contact?
How many contacts does a single individual have on average?

Now you are definitely curious, how this would look like. I have simulated two diseases with different parameters. I am still obliged to give them proper names, but I did not find the time yet…

My very own infectious diseases! Not deadly, though…

The first disease is an annoying one. It is always around. You catch it regularly. Immunity does not last very long. And always someone is ill. Thanks to all gods new and old, the disease is not deadly. Here is the simulation:

And here is the statistics:

The second disease is a bummer. It quickly infects the whole population. But the population gets immune in no time and the whole pathogen goes extinct. Here is the simulation:

And here is the statistics:

You can download the simulator if you like.

I have decided to put the simulator online. Here is the file on GitHub. Let me quickly show you how you can simulate a disease with some parameters:

sir_simulator = SIRSimulator(50, 50, 100)

initially_infected = 5
time_infection = 2
time_recover = 4
transmission_probability = 0.4
average_contacts = 2
states = sir_simulator.simulate(initially_infected, time_infection, time_recover, transmission_probability, average_contacts)

render_states(states, "output-1.gif")
counts = states_to_counts(states)
render_counts("output-1.png", counts)

This code runs a simulation. The population is 2500, which 50 times 50. The simulation runs for 100 steps. At the beginning 5 people are infected. The illness lasts for two time-steps, and immunity for four. The probability for transmission is 0.4, which means that out of 10 handshakes, 4 will be an infection on average. And on average you have two handshakes per day. After that, the code renders the states of the cellular automaton into an animated GIF, and the SIR-diagram into a PNG. As you have seen at the beginning of this article.

Why is this so relevant for my work? You can use the simulator to go through a lot of different combinations and run a lot of different simulations. Nice to have. But why exactly? What can you do with that? Create a huge dataset for Deep Learning!

Creating a dataset of different infectious diseases for Deep Learning. Now I have lost it completely…

As a reminder. In Deep Learning you usually assume the existence of a big data-set. And you usually split that data-set into three subsets:

Training set: This is the data that is used to actually train the Neural Network on.
Validation set: This is for validating the Neural Network after each training-epoch. You do this in order to find out that the trained Neural Network is good at generalizing its experiences.
Testing set: Ideally you would use this only once. That is, after your training has been successful.

My SIR-simulator can be used to do exactly that. Creating a data-set by running lots and lots of simulations with different parameters and splitting the result into train-validate-test. How? Like this:

sir_simulator = SIRSimulator(20, 20, 50)

size = 10000
dataset_type = "counts"

dataset = sir_simulator.generate_dataset(
    size = size,
    split_ratio = "7:1:2",
    dataset_type = dataset_type,
    initially_infected = [5, 10, 20, 40],
    time_infection = [1, 2, 4, 8],
    time_recover = [2, 4, 8, 16],
    transmission_probability = [0.1, 0.2, 0.4, 0.8],
    average_contacts = [2, 4, 8, 16]
    )

dataset_path = "dataset-{}-{}.p".format(dataset_type, size)
save_dataset(dataset, dataset_path)

Most of the parameters you already learned earlier in this article. The size of the datas-set is as interesting as it is simple. It denotes how many input-output pairs will be in the data-set. In our case, 10000. The data-set-type can be either „counts“ which is exactly the diagrams that you saw earlier, or „states“ which is the states of the underlying cellular automaton. And finally, the split-ratio is relevant. „7:1:2“ basically translates to 7 parts training, 1 part validation, and 2 parts testing.

But wait! How does the data-set actually look like?

It is about time to lift the curtain of secrecy. It is time to have a look how the actual data looks like. And the best way to do that is code:

dataset_path = "dataset-counts-10.p"
(train_input, train_output), _, _ = load_dataset(dataset_path)
print("train_input:", train_input[0][:10])
print("train_output:", train_output[0])

And this is the output:

Loading dataset from  dataset-counts-10.p ...
train_input: [[380  20   0]
 [ 50 350   0]
 [  0 282 118]
 [  0  44 356]
 [  0   0 400]
 [118   0 282]
 [356   0  44]
 [400   0   0]
 [400   0   0]
 [400   0   0]]
train_output: [20.   1.   2.   0.2 16. ]

This basically loads the dataset, focusses on the first input-output-pair and prints it. The input is a sequence of SIR-values, showing how the disease progresses over time in the population. For the sake of readability, we print only the first ten elements of that diagram. The rest is similar. The output is also interesting. What do the values mean? They are these disease-parameters in that ordering: initially_infected, time_infection, time_recover, transmission_probability, average_contacts. This the data-set maps progressions of diseases to their parameters. What can you do with that? Exactly! Predict those parameters from any given SIR-progression. This is a plague-analyst!

Training a Neural Network (or several) on my tiny little pathogens.

Well! Let us just quickly summarize, where we are right now. Based on a simple approach with cellular automata, we can create a data-set of arbitrary size by running simulations on different combinations of parameters. Why do we do that? We want to predict the parameters with Neural Networks. We will do that in less than 150 lines of code! And to spice things up a little, we will not train one but five Neural Networks! It is always good to compare.

You will find the whole source-code here on GitHub.

Let’s examine the code! Right in the beginning we do the usual thing. Importing everything that is necessary:

import sir_dataset
import keras
from keras import models
from keras import layers
from keras import optimizers
import numpy as np
import matplotlib.pyplot as plt

This includes our dataset, and of course Keras, NumPy and matplotlib. This is all we need on our quest. Next we will consider our training parameters:

# Parameters.
epochs = 50
batch_size = 128
model_types = ["dense", "lstm", "deeplstm", "gru", "deepgru"]

The number of epochs denotes how often the Neural Network will be trained on the training set. The batch-size reflects how many samples will the trained in parallel. And the model-types-list contains identifiers of all the different models we are about to train. So let’s do exactly that:

def train_neural_network():
    """ Trains and evaluates a couple of neural networks. """

    # Load the dataset.
    dataset = sir_dataset.load_dataset("dataset-counts-10000.p")
    sir_dataset.print_dataset_statistics(dataset)
    (train_input, train_output), (validate_input, validate_output), (test_input, test_output) = dataset

    # Normalize all input data.
    population_size = np.sum(train_input[0,0])
    train_input = train_input / population_size
    validate_input = validate_input / population_size
    test_input = test_input / population_size

    # Normalize all output data.
    minimum = np.amin(train_output, axis=0)
    maximum = np.amax(train_output, axis=0)
    difference = maximum - minimum
    train_output = (train_output - minimum) / difference
    validate_output = (validate_output - minimum) / difference
    test_output = (test_output - minimum) / difference

    # Train the different models.
    for model_type in model_types:

        # Create the model.
        model = create_model(model_type, train_input.shape, train_output.shape)

        # Train the model.
        history = model.fit(
            train_input, train_output,
            validation_data=(validate_input, validate_output),
            epochs=epochs,
            batch_size=batch_size
        )

        # Plot the history.
        plot_history(history, model_type)

        # Evaluate the model against the test-data.
        evaluation_result = model.evaluate(test_input, test_output)
        print(evaluation_result)

Loading the data-set is the easiest part. After that we have to normalize both the input- and the output-data for train, validate and test. Of course only using numbers be derived from the train-data, since we do not want to leak any data from validate and test into our model.

Normalizing the input-data is straightforward. The data consists of integers between zero and the size of the population. Neural Networks do not really like such big numbers. So we will normalize those and map them to floating-point values between 0.0 and 1.0.

With the output data it is similar. Here, we only know that the individual values are in a range. We determine the range. That is, the minimum and the maximum and use both to normalize all values.

After that training begins. By the book… Creating a model, calling the fit()-method, plotting the history in order to find out how good the Neural Network is, and then evaluating it with the test set.

We are brave and courageous. We use different models.

My character has an inclination towards research and experimentation. I love doing different experiments and comparing their outcomes. That is why I usually tend to allow for different Neural Network model architectures in all of my projects. It is usually a parameter. Here is the method that creates models:

def create_model(model_type, input_shape, output_shape):
    """ Creates a model of a given type. """

    model = models.Sequential()

    if model_type == "dense":
        model.add(layers.Flatten(input_shape=(input_shape[1], input_shape[2])))
        model.add(layers.Dense(300, activation="relu"))
        model.add(layers.Dense(output_shape[1], activation="sigmoid"))
    elif model_type == "lstm":
        model.add(layers.LSTM(30, input_shape=(input_shape[1], input_shape[2])))
        model.add(layers.Dense(output_shape[1], activation="sigmoid"))
    elif model_type == "deeplstm":
        model.add(layers.LSTM(30, input_shape=(input_shape[1], input_shape[2]), return_sequences=True))
        model.add(layers.LSTM(20, return_sequences=True))
        model.add(layers.LSTM(10))
        model.add(layers.Dense(output_shape[1], activation="sigmoid"))
    elif model_type == "gru":
        model.add(layers.GRU(10, input_shape=(input_shape[1], input_shape[2])))
        model.add(layers.Dense(output_shape[1], activation="sigmoid"))
    elif model_type == "deepgru":
        model.add(layers.GRU(30, input_shape=(input_shape[1], input_shape[2]), return_sequences=True))
        model.add(layers.GRU(20, return_sequences=True))
        model.add(layers.GRU(10))
        model.add(layers.Dense(output_shape[1], activation="sigmoid"))
    else:
        raise Exception("Unknown model type:", model_type)

    model.summary()

    model.compile(
        loss="mse",
        optimizer=optimizers.RMSprop(lr=0.01),
        metrics=["accuracy"]
    )

    return model

I guess you are more than curious, what the model-type parameter is? These types are available:

„dense“: This is a simple, fully-connected Neural Network with a couple of layers. Going fully connected is usually a good start./li>
„lstm“: We are clearly working with episodic data. That is data that is ordered in time. Here it usually makes sense to use a neural network that has a memory. And LSTMs are good for data that has a time-axis.
„deeplstm“: Same as LSTM, but deeper. That is, with more layers.
„gru“: GRUs are similar to LSTMs, but a little simpler.
„deepgru“: GRU with more layers.

Always plot the history. You want to know how well training went, right?

You will often run into methods that look similar to this:

def plot_history(history, prefix):
    """ Plots the history of a training. """

    # Render the accuracy.
    plt.plot(history.history['acc'])
    plt.plot(history.history['val_acc'])
    plt.title('model accuracy')
    plt.ylabel('accuracy')
    plt.xlabel('epoch')
    plt.legend(['train', 'test'], loc='upper left')
    plt.savefig(prefix + "history_accuracy.png")
    plt.clf()

    # Render the loss.
    plt.plot(history.history['loss'])
    plt.plot(history.history['val_loss'])
    plt.title('model loss')
    plt.ylabel('loss')
    plt.xlabel('epoch')
    plt.legend(['train', 'test'], loc='upper left')
    plt.savefig(prefix + "history_loss.png")
    plt.clf()

It is advised to always inspect the training-history after training. Why? You are interested in how good your network is. And how do you learn that? By inspecting both loss and accuracy for training and validation. Training shows how good the network is at learning data, and validating how good the same network is at generalizing the learned knowledge. The accuracy says exactly that, whereas the loss says how much the network learned or still can learn.

Let us look at some pictures. This is how a dense network trains:

This is how a deepgru network trains:

As you can clearly see is that both models do some overfitting. I have seen worse. What you also see is that the dense network gets an accuracy at around 70%, whereas the deep GRU network reaches an accuracy at around 80%. This is a good start!

Summary. Let us clean our table.

Whoa! That was quite something! We learned that you can simulate the progression of diseases with the SIR-model. With a handful of parameters you can design a lot of non-deadly pathogens. You can also use these simulations in order to create a data-set for Neural Network training. And you will end up with a Neural Network that you can use to analyze diseases. Nice, isn’t it?

The full project can be found on GitHub. Feel free to toy with it and your own experiments with non-deadly diseases!

Stay in touch.

I hope you liked the article. Why not stay in touch? You will find me at LinkedIn, XING and Facebook. Please add me if you like and feel free to like, comment and share my humble contributions to the world of AI. Thank you!

If you want to become a part of my mission of spreading Artificial Intelligence globally, feel free to become one of my Patrons. Become a Patron!

A quick about me. I am a computer scientist with a love for art, music and yoga. I am a Artificial Intelligence expert with a focus on Deep Learning. As a freelancer I offer training, mentoring and prototyping. If you are interested in working with me, let me know. My email-address is tristan@ai-guru.de - I am looking forward to talking to you!