16. Februar 2018 19 min to read
Deep Learning and Weird Fiction
Category : Artificial Intelligence, Books, Deep Learning, Natural Language Processing, Tutorial
Namaste! I am very sure, that the title of this article might confuse you. It is a reference to both our shared passion and my own personal one. Truth be told, I am an avid fan of the works of H. P. Lovecraft. Lovecraft was a gentleman-writer from New England. He wrote quite a lot of weird fiction short stories. And he influenced many artists. Just mentioning Stephen King and Metallica. So, let is dive headfirst into Deep Learning and Weird Fiction!
Deep Learning and Weird Fiction is about Artificial Intelligence, Natural Language Processing, and Art.
I recently wrote about a very special Neural Network of mine. The Hindi Text Generator. It trained on a huge corpus of classical Indian literature. This approach was on the character-level. Meaning that it generated texts character by character. Leaving me curious about more…
Generating literature on a word-level seemed to be very appealing. So I gave it a try. You will find the code for your experiments on my GitHub-repository. It is only 300 lines of codes.
Here is some generated text to get you excited:
First! It was the stone, who have never thought the way came back to the old man, though I went to it. I was glad to get Lake as I can get back to their mind. I knew not the others of this place, but I must not be be on him, but was glad to get me about the thing. I was not the same with the slain which the building. He was killed, and the shelves of flowers was found in the sky by the sky. It was as if the whole of the shore, and that they were in his house by the door that had been sealed known before the whole scene. This was a burst of such a little stirring, who came upon the of the visit, and had seen Mercy of the great corner the entrance. In the whole side in the mountains, the notes had a choking and laid; and it was now in his inside that he could tell them the way to go. The tops, the of the galley I saw the events into the next day – January, after our our own head, and was to many his. My first claim, however, I kept a most by the night. The door, too, was to be, and as he sank away from the black room beyond that which had been from the same building. It was a typical wooden with one of the city, and the Newport in the little city, and the flickering desert of damp and had been very little and whose tales of whose upper in the dark. There were in a very corridor and to resume his teaching. But the old man was laid out of me to a shapeless even even even now and then that it slipped to reflect and space. It was not too much to him, but now he was greater and in a corner with the. The Book of beings came to scatter, is by a means of profound and terror from every sort of provocative still outside. Though is what I shall wish out of old people – especially when I did not know why. As it was, I added, was wrong; for in your time I was sure that the thing was not in them.
A sketch of our process.
Lovecraft’s works are famous for merely hinting at things. He rarely described anything in detail and left all to the readers‘ imagination. I will not do this. No secrets here! A clear description of what we are about to do!
We keep in mind what we want to do: Generating new Lovecraft-texts with Deep Learning. We will do this word by word. And we will start with a seed-sequence – a random text. Our neural Network will then predict the next word that fits best right after that sequence. Exactly like this:
mad Arab Abdul Alhazred. -> I Arab Abdul Alhazred. I -> was Abdul Alhazred. I was -> rather Alhazred. I was rather -> sorry . I was rather sorry -> , I was rather sorry, -> later was rather sorry, later -> on rather sorry, later on -> , sorry, later on, -> that , later on, that -> I later on, that I -> had on, that I had -> ever , that I had ever -> looked that I had ever looked -> into I had ever looked into -> that had ever looked into that -> monstrous ever looked into that monstrous -> book looked into that monstrous book -> at into that monstrous book at -> the that monstrous book at the -> college
As you can see, this can be used to generate quite long texts! One and the same procedure can be applied over and over again. But wait… What kind of Neural Network would we use here? And how will we encode the data?
The Neural Network model is quite straightforward. Its first layer will be an Embedding-layer. Word Embeddings are a fine method for encoding words. Each word will be a dense vector that carries semantics. The next layer will be a LSTM. We need this since our prediction task clearly requires some kind of short-term memory. And finally, a fully-connected layer will yield the predicted word as its one-hot-encoding.
A little more details about how this is going to work. We will take a text:
'about the infinite cosmic spaces'
We will split the text into its tokens:
['about', 'the', 'infinite', 'cosmic', 'spaces']
The tokens will be mapped to their indices with respect to the vocabulary. Index 0 is the first word in the vocabulary. Index 10 is the 11th and so on. Like this:
[61, 0, 1337, 673, 1829]
This sequence will then be fed into our brave Neural Network. This will yield a prediction. That is, a very long sequence of floating-point numbers. I will not show such a prediction here. It has ten-thousands of numbers.
From this prediction, yet another index will be computed:
Which is, of course, the representation of the predicted word:
This picture summarizes the whole process:
That is quite easy, isn’t it? Next thing… Going deeply into the code!
Here comes the code.
Now, it is getting exciting! It is code-time. We will start with the imports. All you really need:
import numpy as np import matplotlib.pyplot as plt import pandas as pd import keras from keras import models from keras import layers from keras.utils import to_categorical import os import urllib from collections import Counter import html import nltk nltk.download('punkt') nltk.download('perluniprops') from nltk import word_tokenize import pickle import random import progressbar import keras from keras import models from keras import layers from keras import utils
I ran into some trouble. On Google’s Colaboratory NLTK had some pain. NLTK has a nice detokenizer – an algorithm that turns tokens into nicely looking strings. Unfortunately that only worked on my machine. Here is the code that deals with that:
# This tokenizer is nice, but could cause problems. try: from nltk.tokenize.moses import MosesDetokenizer detokenizer = MosesDetokenizer() use_moses_detokenizer = True except: use_moses_detokenizer = False
Very simple: If the Moses detokenizer is not available, don’t use it.
Next are our parameters. All four phases – downloading corpus, preprocessing, training, and generating – are customizable. This is combination of parameters that worked blasphemously well:
# Corpus parameters. download_anyway = False corpus_url = "https://archive.org/stream/TheCollectedWorksOfH.p.Lovecraft/The-Collected-Works-of-HP-Lovecraft_djvu.txt" corpus_path = "lovecraft.txt" # Preprocessing parameters. preprocess_anyway = False preprocessed_corpus_path = "lovecraft_preprocessed.p" most_common_words_number = 10000 # Training parameters. train_anyway = False model_path = "model.h5" data-set_size = 50000 sequence_length = 30 epochs = 10 batch_size = 128 hidden_size = 1000 # Generation parameters. generated_sequence_length = 500
After that, execute all four phases:
def main(): """ The main-method. Where the fun begins. """ download_corpus_if_necessary() preprocess_corpus_if_necessary() train_neural_network() generate_texts()
So far, so good. It is time to consider each phase on its own.
Digging deeper. Downloading and cleaning the corpus.
def download_corpus_if_necessary(): """ Downloads the corpus either if it is not on the hard-drive or of the download is forced. """ if not os.path.exists(corpus_path) or download_anyway == True: print("Downloading corpus...") # Dowloading content. corpus_string = urllib.request.urlopen(corpus_url).read().decode('utf-8') # Removing HTML-stuff. index = corpus_string.index("<pre>") corpus_string = corpus_string[index + 5:] index = corpus_string.find("</pre>") corpus_string = corpus_string[:index ] corpus_string = html.unescape(corpus_string) # Write to file. corpus_file = open(corpus_path, "w") corpus_file.write(corpus_string) corpus_file.close() print("Corpus downloaded to", corpus_path) else: print("Corpus already downloaded.")
This method downloads the corpus. It comes as a HTML-file. Complete with tags and escape-sequences. We do not want any HTML-clutter in training data. That is why we remove all the HTML-stuff. With the aide of everything-Python, we store a clean corpus in a file in no time. Done!
After downloading the corpus: Preprocessing. Because we need something to train on.
With our shiny corpus at hand and on our hard-drives, we can do the preprocessing. The whole data will be transformed into two things. A vocabulary – a list of tokens. And a huge sequence of indices – the corpus encoded with respect to the vocabulary. I have already explained this in the intro. So, let’s do this:
def preprocess_corpus_if_necessary(): """ Preprocesses the corpus either if it has not been done before or if it is forced. """ if not os.path.exists(preprocessed_corpus_path) or preprocess_anyway == True: print("Preprocessing corpus...") # Opening the file. corpus_file = open(corpus_path, "r") corpus_string = corpus_file.read() # Getting the vocabulary. print("Tokenizing...") corpus_tokens = word_tokenize(corpus_string) print("Number of tokens:", len(corpus_tokens)) print("Building vocabulary...") word_counter = Counter() word_counter.update(corpus_tokens) print("Length of vocabulary before pruning:", len(word_counter)) vocabulary = [key for key, value in word_counter.most_common(most_common_words_number)] print("Length of vocabulary after pruning:", len(vocabulary)) # Converting to indices. print("Index-encoding...") indices = encode_sequence(corpus_tokens, vocabulary) print("Number of indices:", len(indices)) # Saving. print("Saving file...") pickle.dump((indices, vocabulary), open(preprocessed_corpus_path, "wb")) else: print("Corpus already preprocessed.")
Piece of cake. Loading the corpus as a string is a no-brainer. Then NLTK’s tokenizer is invoked to do the splitting magic. Python has a very nice Counter-class that is hideously excellent at counting the occurrences. And it is also great at sorting them with respect to their occurrences. This comes very handy, because we restrict our vocabulary to a fixed size. Exactly 10000 in our case. A parameter of course. After that, the whole corpus is turned into sequences. Let us have a look at the how of the matter:
def encode_sequence(sequence, vocabulary): """ Encodes a sequence of tokens into a sequence of indices. """ return [vocabulary.index(element) for element in sequence if element in vocabulary]
Each token in the corpus will be mapped to its index in the vocabulary. But only if the token is in the vocabulary. Else it is omitted. I am sure, the ol‘ gent Lovecraft is very forgiving in that matter. This is science, right?
After preprocessing comes the training. Heating up our artificial brain. We let it read Lovecraft a lot.
The corpus is now encoded. That was easy, right? The next thing is to use the corpus in order to generate a data-set from it. We need this for training. I started to like the idea to make the size of the data-set a parameter in my projects. This allows us to experiment with different sizes easily. Training the Neural Network is 40 lines of code. If it scares you, don’t look:
def train_neural_network(): """ Trains the corpus either if it has not been done before or if it is forced. """ if not os.path.exists(model_path) or train_anyway == True: # Loading index-encoded corpus and vocabulary. indices, vocabulary = pickle.load(open(preprocessed_corpus_path, "rb")) # Get the data-set. print("Getting the data-set...") data_input, data_output = get_data-set(indices) data_output = utils.to_categorical(data_output, num_classes=len(vocabulary)) # Creating the model. print("Creating model...") model = models.Sequential() model.add(layers.Embedding(len(vocabulary), hidden_size, input_length=sequence_length)) model.add(layers.LSTM(hidden_size)) model.add(layers.Dense(len(vocabulary))) model.add(layers.Activation('softmax')) model.summary() # Compining the model. print("Compiling model...") model.compile( loss='categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy'] ) # Training the model. print("Training model...") history = model.fit( data_input, data_output, epochs=epochs, batch_size=batch_size) model.save(model_path) plot_history(history)
The whole thing starts with loading the indices and the vocabulary. The
get_data-set-method then extracts a data-set from the corpus. We will end up with a huge list of input-output-pairs. Remember that the input is a list of tokens, and the outputs are the tokens that follow the input-sequences. Of course, we have to map the output to its one-hot representation – to categorical. This will make the Neural Network work like a charm.
How does the Neural Network look like? As promised, it is a simple sequential architecture. It begins with an Embedding-layer, followed by an LSTM-layer, and ends with a Dense-layer. Doing word-mathemagic, learning sequences, and then guessing the next word. Easy as that!
In its essence, we have a multi-classifier here. This is why it makes sense to use categorical cross-entropy as our loss-function. The Adam-optimizer is always a good choice. And categorical-accuracy as a metric is a mere consequence.
Creating the data-set is retrieving random-samples from the corpus.
No secrets. I say it again. It is a good moment to look at the path from the corpus to our data-set. Here is the method for doing so:
def get_data-set(indices): """ Gets a full data-set of a defined size from the corpus. """ print("Generating data set...") data_input =  data_output =  current_size = 0 bar = progressbar.ProgressBar(max_value=data-set_size) while current_size < data-set_size: # Randomly retrieve a sequence of tokens and the token right after it. random_index = random.randint(0, len(indices) - (sequence_length + 1)) input_sequence = indices[random_index:random_index + sequence_length] output_sequence = indices[random_index + sequence_length] # Update arrays. data_input.append(input_sequence) data_output.append(output_sequence) # Next step. current_size += 1 bar.update(current_size) bar.finish() # Done. Return NumPy-arrays. data_input = np.array(data_input) data_output = np.array(data_output) return (data_input, data_output)
The target number of samples is a parameter. We just do the following… Until this desired number is reached, a random sample will be retrieved and stored in the data-set. Our corpus is a huge list of indices. So we just select sub-lists at random positions for our input-data. And of course, we use the index right after the sub-list as our output. Finally, we just make sure that we have NumPy-arrays. Done!
Let us write weird fiction! The Neural Network at work.
Again: Done! We got a corpus. We preprocessed it. We generated a data-set. We trained a Neural Network on it. All in a small amount of code. Let us now generate random texts.
I have explained in the beginning how this works. Let me repeat this. First we get a random sample from the corpus as our seed-sequence. Then we generate the next token. This means, we predict it using our Neural Network. This token will be combined with the sequence to form a new one. We drop the first element to maintain the size. After that, repeat! This is how it works:
def generate_texts(): """ Generates a couple of random texts. """ print("Generating texts...") # Getting all necessary data. That is the preprocessed corpus and the model. indices, vocabulary = pickle.load(open(preprocessed_corpus_path, "rb")) model = models.load_model(model_path) # Generate a couple of texts. for _ in range(10): # Get a random temperature for prediction. temperature = random.uniform(0.0, 1.0) print("Temperature:", temperature) # Get a random sample as seed sequence. random_index = random.randint(0, len(indices) - (generated_sequence_length)) input_sequence = indices[random_index:random_index + sequence_length] # Generate the sequence by repeatedly predicting. generated_sequence =  generated_sequence.extend(input_sequence) while len(generated_sequence) < generated_sequence_length: prediction = model.predict(np.expand_dims(input_sequence, axis=0)) predicted_index = get_index_from_prediction(prediction, temperature) generated_sequence.append(predicted_index) input_sequence = input_sequence[1:] input_sequence.append(predicted_index) # Convert the generated sequence to a string. text = decode_indices(generated_sequence, vocabulary) print(text) print("")
This is exactly what I have mentioned. All by the book! Have you seen? Another magic spell is hidden in the code. There is a random-temperature. The temperature is used to add some randomness to our prediction. Just a little… You know, after a sequence of tokens, there might be several tokens that would make sense to follow after the sequence. This is our magical random-worker:
def get_index_from_prediction(prediction, temperature=0.0): """ Gets an index from a prediction. """ # Zero temperature - use the argmax. if temperature == 0.0: return np.argmax(prediction) # Non-zero temperature - do some random magic. else: prediction = np.asarray(prediction).astype('float64') prediction = np.log(prediction) / temperature exp_prediction= np.exp(prediction) prediction = exp_prediction / np.sum(exp_prediction) probabilities = np.random.multinomial(1, prediction, 1) return np.argmax(probabilities)
Nice! There is not much left to explain. I already mentioned encoding. The decoding-part is still missing. Here it is:
def decode_indices(indices, vocabulary): """ Decodes a sequence of indices and returns a string. """ decoded_tokens = [vocabulary[index] for index in indices] if use_moses_detokenizer == True: return detokenizer.detokenize(decoded_tokens, return_str=True) else: return " ".join(decoded_tokens)
It is just the other way around. Turn a sequence of indices into a sequence of their respective tokens. After that, join them properly. Either with NLTK’s Moses detokenizer. This one generates really beautiful strings. Or plainly by joining them. This yields a readable but not so beautiful result.
And finally… Just mentioning some standard code. We, of course, render our raining results. It is always good to know accuracy and loss:
def plot_history(history): """ Plots the history of a training. """ print(history.history.keys()) # Render the loss. plt.plot(history.history['loss']) plt.title('model loss') plt.ylabel('loss') plt.xlabel('epoch') plt.legend(['train', 'test'], loc='upper left') plt.savefig("history_loss.png") plt.clf() # Render the accuracy. plt.plot(history.history['categorical_accuracy']) plt.title('model accuracy') plt.ylabel('accuracy') plt.xlabel('epoch') plt.legend(['train', 'test'], loc='upper left') plt.savefig("history_accuracy.png") plt.clf() plt.show()
That is it!
Let us summarize.
Wow! That was quite something! What have we seen here? We had a look at an Sequence-to-One approach to generating new texts from H. P. Lovecraft’s complete works. And I have to say, I have seen projects that were way more hideous and cyclopean. It is always good to see, that you can write such nice programs in such a small amount of code.
To conclude, here is another generated text. Enjoy!
I’m my ever looked at the time of my mind and with my own seizure, but I was at that it must have was vacant. In writing I must have or Marceline had been that you – you – especially what I did not know. There was not much of science down to look, but I did not wake to nearly myself by all the outside which giving looking from the slain which of the natives in the water. And they were on this, staring of, and feet several books I saw the lines of phenomenon utterly it was through that which had indeed had been shown these burrows ! completely like his pictures flights of less normal skyward at a sides, but held a vague human – – that time when a large, and New red, and wholly unrecognizable uncanny. No human could ever led; yet the very light of everything or folklore with scientific; and which the Old Ones at once last in the occult student of mystery which the elder leader of mystery it was found before. But this was sullen, hideously modern, for a organs Curwen drove a great. The nearby setting reef floated lower, the crowd ‚d up open to avoid much at me, and I followed myself as he did scarcely wake to make it. But with only the mighty part of the spot, the evil, and red, and we all the floor and the black black stone. the dingy met the things with the four and alley it seems to appear among the faces was in dark. I had thought I can not; as I did not press know where the sense of those are the body to that frenzied. .. .. . ‚ better ! “ I know, the beings told of the throat I have said that I was forced by nervous interest – so that I – what must do not know it.
Before you leave, I have a question for you. If you have an opinion please write me an email to tristan(Replace this parenthesis with the @ sign)ai-guru.de. I would love to hear from you.
Usually, when doing a proper Deep Learning exercise, I split my data into training, validating, and testing. In the generative-arts domain I usually don’t. What do you think about that?
Stay in touch.
I hope you liked the article. Why not stay in touch? You will find me at LinkedIn, XING and Facebook. Please add me if you like and feel free to like, comment and share my humble contributions to the world of AI. Thank you! I am looking forward to talking to you!