Weekend-pastime – Denoising some dirty texts

Namaste! This weekend I set aside a couple of hours for a nice pastime. I had a mission. Preparing the next release candidate of my Deep Learning library NGDLM. That was my original intention. After quite some refactoring and some stabilizing I got an idea. Why not testing NGDLM with a little challenge? So down into the development-dungeon I went… I had a quest… Denoising dirty texts!

Selfish self-promotion: What is NGDLM?

Some of you might already know. In the near-past, experimented a lot with esoteric Neural Networks. Different Autoencoder variants, Generative Adversarial Nets, and Triplet-Loss. My dear Auntie Copy-Pasting showed up every time I applied one of these to a use-case of mine. Boilerplate code. So I decided to move a lot of sources into a library. Reusing at its best!

The overall intention is simple. Keras is excellent for creating deep feed-forward networks. The library is very versatile and heavily customizable. I love that. When it comes to a little bit more complicated Neural Nets, Keras is still fine. But you end up reinventing the wheel after your second Triplet-Loss implementation. And boy, I tell you… those are big wheels to reinvent when it comes to Triplet-Loss and others.

About motivation.

Something private about my motivation. My girlfriend and potential future wife is heavily into OCR. Especially using data of that vintage kind. Books. Scans of old texts. The humanities nowadays are very fond of digital editions. And thus humanities researchers love processing their old and new books with digital tools.

It is fascinating that more and more sciences converge towards IT and Artificial Intelligence. Many fields employ AI today. And more and more fields might follow. An overall embrace of computer-facilitated capabilities is already here.

But coming back to the task at hand. Why not considering this question: How can Deep Learning help with OCR? We will never now, unless we try.

What are denoising Autoencoders?

Autoencoders are the bomb! You can use them for unsupervised learning. For example for finding a nice embedding of your data. Latent-spaces. Use-case recommender systems et al. And you can use some Autoencoder pretraining in order to boost the accuracy of your classifiers. Start unsupervised and then go supervised.

One peculiar use-case for Autoencoders out of many is „denoising“. Applicable in all signal processing domains. What is this denoising, you might ask? Have a look at this picture:

The image tells a complete story, doesn’t it? A denoising Autoencoder would consume some input data. In this case a very noisy image of a handwritten digit. And it would remove all the noise. What do you have to do in order to train such a net? All you need is data! Data-driven approaches for the win! Noisy and cleaned data to be precise. Pairs of them to be exact.

I wanted more than just denoising digits.

There is an old and rusty challenge on Kaggle: Denoising dirty documents. It is exactly what I had in mind when thinking about throwing NGDLM at some proper use-case. Yes, let me clean up dirty scans of texts. As you can imagine, the occasional coffee-stain on paper can really decrease the readability of a document. A Neural Net that would remove such and similar artifacts would really help.

Here is some code. The architecture of a denoising Autoencoder is always something to have a close and curious look at:

# Create the encoder.
encoder_input = layers.Input(shape=(image_size, image_size, 1))
encoder_output = encoder_input
encoder_output = layers.Conv2D(32, (3, 3), activation='relu', padding='same')(encoder_output)
encoder_output = layers.MaxPooling2D((2, 2), padding='same')(encoder_output)
encoder_output = layers.Conv2D(32, (3, 3), activation='relu', padding='same')(encoder_output)
encoder_output = layers.MaxPooling2D((2, 2), padding='same')(encoder_output)
encoder_output = layers.Conv2D(32, (3, 3), activation='relu', padding='same')(encoder_output)
encoder_output = layers.MaxPooling2D((2, 2), padding='same')(encoder_output)
encoder = models.Model(encoder_input, encoder_output)
encoder.summary()

# Create the decoder.
decoder_input = layers.Input(shape=(8, 8, 32))
decoder_output = decoder_input
decoder_output = layers.Conv2D(32, (3, 3), activation='relu', padding='same')(decoder_output)
decoder_output = layers.UpSampling2D((2, 2))(decoder_output)
decoder_output = layers.Conv2D(32, (3, 3), activation='relu', padding='same')(decoder_output)
decoder_output = layers.UpSampling2D((2, 2))(decoder_output)
decoder_output = layers.Conv2D(32, (3, 3), activation='relu', padding='same')(decoder_output)
decoder_output = layers.UpSampling2D((2, 2))(decoder_output)
decoder_output = layers.Conv2D(1, (3, 3), activation='sigmoid', padding='same')(decoder_output)
decoder = models.Model(decoder_input, decoder_output)

# Create the autoencoder.
ae = ngdlmodels.AE(encoder, decoder)
ae.compile(optimizer='adadelta', loss='binary_crossentropy')
ae.summary()

# Train.
print("Train...")
history = ae.fit_generator(
        train_generator,
        steps_per_epoch=100,
        epochs=100,
    )

Yes, this is an Autoencoder. A convolutional one. The encoder compresses a noisy image into an embedding in latent-space. The decoder decompresses the embedding into a clean image. Assuming that the loss is low.

I trained that network for a moment (it is a lie, more than an hour is the truth) and the results were amazing. No one would expect the thing to work right away. But it did! See for yourself:

On the left side is the original dirty document. The right side shows the cleaned-up version. Amazing already after one round of training and no hyper-parameter tweaking and tuning.

Why no hyper-parameter tuning? Let’s do some grid search!

The initial prototype turned out to be quite good. The loss after around an hour of training was acceptable. The initial results were visually convincing. Still I was very curious. In the morning right after waking up I found myself thinking. Maybe it was a kind of Deep Learning fever? How could I get the loss smaller? There was only one prescription: Grid search!

In a Convolutional Neural Net the number of filters is something that can make a difference. So I decided to aim at that hyper-parameter first. A little script would train the model a couple of times. Each time with a different number of CNN-filters. Here is the result… A comparison of the losses with different numbers of filters:

As you can clearly see, when you increase the number of filters, the loss goes down. This definitely has an effect on the denoising accuracy:

It is so amazing what you can do with Deep Neural Nets in such a small time!

Summary and a bonus.

There is one thing I have to say: I had a blast implementing this use-case. Why? First and foremost, my model almost immediately worked out of the box. And second of all, the whole exercise took like three hours of implementation and three hours of training.

And here comes a bonus. What happens if you feed an image into the net with a very unexpected motif?

Stay in touch.

I hope you liked the article. Why not stay in touch? You will find me at LinkedIn, XING and Facebook. Please add me if you like and feel free to like, comment and share my humble contributions to the world of AI. Thank you!

If you want to become a part of my mission of spreading Artificial Intelligence globally, feel free to become one of my Patrons. Become a Patron!

A quick about me. I am a computer scientist with a love for art, music and yoga. I am a Artificial Intelligence expert with a focus on Deep Learning. As a freelancer I offer training, mentoring and prototyping. If you are interested in working with me, let me know. My email-address is tristan@ai-guru.de - I am looking forward to talking to you!