Home About Me Projects ⇢ conv-net

Convolutional Neural Networks Explained Visually

Inspired by an episode of the show 'Silicon Valley' a few weeks back, curiosity led me to start reading up on the science of image recognition. With neural networks being the most promising solution in this field, my quest into machine learning had begun. Convolutional Neural Networks (CNNs) are currently the most popular type of neural network for the application of image recognition. I should also note that while my ultimate goal is to implement and train a convolutional neural network from scratch (not using any learning libraries), I have not yet accomplished this to the point of being able to reflect on it. However, after spending a significant portion of the last three or so weeks reading academic materials on CNNs, I believe I have a thorough enough understanding of the topic to illustrate it in a (hopefully) more digestable manner than I've seen elsewhere.

Concept Review

  • Neural Network: A trainable system that can be used to classify novel input data to some degree of accuracy. Structurely, it's a collection of forward-feeding layers that are made up of gates (also called neurons). The outputs of the gates that make up layer A become the inputs for the gates that make up layer B, and so on. Each gate applies some function to its input to produce its output that is then passed on to the next layer. In addition to feeding these values forward, each individual gate has a weight value associated with it. These weight values are initially set randomly, then get adjusted a little bit every time you train the neural network with new input data via a process called backpropogation (calculting how far off the network was at classifying the given input data, then passing that error back and adjusting each weight a little bit towards what it should have been). By putting emphasis on certain gates via their respective weights, it tells the neural network what parts of the input are most important in classifying what it thinks the input is, and therefore what it thinks the output should be.

  • Image Convolution: In the context of a CNN, image convolution is the process of identifying distinguishing features in an image for the sake of (typically) classifying what the image is of. For example, if I fed a good CNN 1,000 picture of my face, it would create "feature filters" for common edges and lines that it finds, then convolve these findings together to create filters for my actual facial features. If I then tested this CNN by showing it 10 pictures it had never seen before, 9 of random faces and 1 of mine, it would theoretically be able to distinguish my face from the others based on the presence of these learned features. While I just did broad-stroke over an offensive amount of technical material, you get the picture (pun intended).

Architecture of a Convolutional Neural Network

  • Input Layer: The first layer in any neural network. For a CNN, this layer consists of a map of all the pixels and their respective values for the input image as a 3-dimensional matrix. For instance, if a 64x64-pixel image has 3 color channels (Red,Blue,Green), then the size of the input layer is 64x64x3. However, for simplicity sake, we're going to use a 10x10 grayscale image of a smiley face. Grayscale images only have 1 color channel which represents the spectrum from completely black to completely white. For the sake of the math involved, we will have the values of this spectrum range from -1 (completely black) to 1 (completely white), with shades of gray falling respectively between -1 and 1. Given that this input layer would be 10x10x1 in size (10 pixels x 10 pixels x 1 channel), we can think of it as just 10x10, thus allowing us to deal with only 2 dimensions.

  • Convolution Layer: As MTV Cribs would say, this is where the magic happens. Also, it's kind of like a magic trick because you have to see it 10 times to understand what's going on, so bare with me. The convolution layer is where feature filters are applied to the input map to detect features in the image. Since the location, size, and orientation of items in images vary greatly, even between photos of the same item(s), feature filters must be tried at every possible location in the input image. What's a feature filter?

    Thanks for asking- a feature filter is a n x n x d matrix of values that is laid on top of the input image and compared at every position. d = depth = number of channels (1 for grayscale, as stated earlier). n = a user determined variable for the window-size of your feature filters. We'll make n=3, as 3x3 is a common choice for filter dimensions. Think of a feature filter as a sliding window that gradually traverses over your entire input image, comparing its own values against the respective pixel values of the n x n area that it's covering. This comparison function that occurs at every position is as follows: each value of the filter is multiplied by the respective pixel of the input image that it's lying on top of. All those products are then added together, and that sum is divided by the number of values of the filter (divide by 9, because of our 3x3 feature filter). The resulting value is then assigned to the output map, known as a feature map at the current coordinates of where the filter is (remember, it starts at [0,0]). Here's a simulation to illustrate this process:

  • Max-Pooling Layer: Ok, take a deep breath. The worst is behind us. The max-pooling layer is a much simpler concept. For each feature map that is produced in the convolution layer, the max-pooling layer passes its own window of m x m x d size (m being a user-defined variable) over it. We'll make m=2. However, unlike a feature filter, this m x m x d window has no values of its own. Instead, it looks at all the values in the area of the feature map that it is lying on top of, and then compares them to each other. The largest value that it find in that area of the feature map gets passed on to the resulting max-pooled map at the respective coordinates. By doing this, the resolution of the feature map is effectively reduced by a factor of 2, downsampling it to only the highest valued cells. Conventionally, the goal is often to reduce these feature maps down to 2x2 squares after multiple passes through max-pooling layers, but that's out of the scope of this article.

  • Fully-Connected Layer: The fully-connected layer is the second-to-last layer in a CNN. The fully-connected layer takes all of the down-sampled maps produced in the max-pooling layer, and flattens them into a linear list of values. These are the most decisive values for classification, as they represent all the data that has been collected, boiled down to only the most "activated" values - those with values closest to 1. It is from this layer that the output is determined.

  • Output Layer: The output layer is where classification decisions are made. The output layer consists of a node for every possible classification the network is allowed to make. In our example, we only have two possible classifications, represented by two output nodes; smiley face and NOT smiley face. This is called a binary classification as there are exactly two mutually-exclusive outputs; 1=definitely a smiley face, and 0=definitely not a smiley face. Given that the output value will be between 0 and 1, whichever of these integer the output is closest to is deemed the best choice. With this being the case, the tipping point of classification is 0.5. It can also be thought about in terms of the network's % confidence in its answer, with 0.5 being the least confident guess possible. In our example network, we're simply calculating the output by taking the average value amongst all the nodes in the fully-connected layer. Unfortunately, the output layer in our example is a bit over-simplified, and very naive as we have not trained it (or considered its weights at all, for that matter). For this reason, it incorrectly classifies the output as "not smiley face" just barely, with a very unsure output value of 0.46. In a real CNN, the network would now start the process of backpropagation to essentially figure out where it went wrong and tweak its weights accordingly. 

Tying It All Together

You still there? pls respond. I had about as much fun writing that last section as you had reading it. It's all lazy reading from here on, I promise. When all these layers are brought together, they form a basic CNN. However, in practice, CNNs usually have many, many more layers than this. You can have as many sequences of convolutional, max-pooling, and fully-connected layers as you fancy. However, increasing network depth is a fascinating and eerie trade-off. On one hand, increasing the depth of a network allows for the network to have many more connections and weights to tweak, and therefore many more levels of abstraction for the sake of problem solving. However, with every increase in depth, the network becomes exponentially harder to understand for a human being, hence the frequent use of the term "black box" to describe such networks. The study of these complex, multi-layered networks is known as deep-learning, and the research that is coming out in this field is equal parts astonishing and terrifying. Just don't be surprised when Siri starts asking you unsolicited questions in the middle of the night.

Convolutional neural networks are perhaps the most exciting and daunting topic that I have dug into recently. While the math can get pretty ugly, the concepts behind CNNs are very elegant and impressive. I hope you learned something from this article, and if not, I'm very impressed that you're still here with me.

Important Topics Not Covered: Activation functions, ReLU layers, stride as a user-defined variable, zero-padding, learning rate, biases, proper output nodes.

Programs Used: Unity3D (C#), Python, QuickTime

Resources and Further Reading: 

Outside Image Sources: