Building a Three-Layer Neural Network with Keras and TensorFlow

Construct a sequential three-layer neural network using Keras.

Build a neural network in TensorFlow with just a few concise lines of code, and understand how each layer transforms input data. Learn the functions and mathematics driving neural network activations like ReLU and Softmax.

Key Insights

  • The neural network built is sequential, consisting of an input layer flattening a 28x28 pixel grid into a 784-item list, a dense hidden layer with 128 neurons using the ReLU activation function, and an output layer with 10 neurons employing the Softmax function.
  • The hidden dense layer is considered a "black box" because it creates approximately 100,000 weighted connections between the 784 input values and 128 hidden neurons, allowing complex pattern learning that is often opaque to human interpretation.
  • Activation functions play a crucial role: ReLU sets negative inputs to zero to prevent subtracting from neuronal confidence, while Softmax scales the final output to probabilities that sum to 1, effectively identifying the most likely digit from 0 to 9.

Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.

Let's write a very short but rather intricate code to build our three-layer neural network, and then we'll spend a bunch of time up above here talking about what all these layers are, what do they do, what are these arguments we're giving it. We're going to create a model, and it's going to be a Keras model that's sequential, meaning it'll go through each of these layers in order, sequentially. And it's actually a list of layers, and again in sequential order.

So the first is our input layer, and that's going to be a TensorFlow Keras layer that flattens the input. We'll talk about why in a minute. And we'll say, hey, flatten it, and also its input shape is 28 × 28.

We'll pass it a tuple. Again, we'll talk about each of these lines in a minute. And then our second layer is a dense layer, meaning a hidden layer, and it's going to be 128 neurons.

Sounds like a lot, and as you'll realize, it's actually even more. And we'll use a ReLU activation, and again, I'll explain why in a minute. Then our final layer is our get-it-out-of-there layer, our output layer.

Give us the answer layer. It's still technically a dense layer, but this is usually known as the output layer. And it only needs to be 10, and I can explain this one, actually, without even having to go up above.

Data Analytics Certificate: Live & Hands-on, In NYC or Online, 0% Financing, 1-on-1 Mentoring, Free Retake, Job Prep. Named a Top Bootcamp by Forbes, Fortune, & Time Out. Noble Desktop. Learn More.

It's 10 because it's 10 possible values, digits 0 through 9. And its activation is what's known as softmax, and that's TensorFlow neural networks softmax. Okay, so we pass all these layers, and we'll talk about all these things in just a moment. We pass in these layers into the list, and TensorFlow will build us a neural network with these layers and put it in our model.

And it only takes a moment. Let's talk about each of those layers and all of that. So what we're going to do is talk about each of these layers.

You know, the flattened inputs layer. That's this one right here, the flattened one. It's flattened as a technical term in programming for taking a multidimensional list and making it one dimensional.

So instead of 28 lists of 28 length, one list of 784 items. It helped us to visualize these set of pixels in a grid, which is what this 28 × 28, 28 columns, 28 rows thing did for us. But it doesn't help the computer.

In fact, it hurts it. It'll work much better with the exact same data in the same order, but as one long list. We can look at, it doesn't need to know that one value is next to another value.

That's not how it actually looks at these images. Instead, it's going to learn how much to weight each individual value of the 784 and what it should weight towards. Okay.

So it doesn't care, again, you know, how we humans read things and how we see patterns. It wants it in a machine readable way. And just a straight up list of numbers, machines eat that right up.

And it also wants it normalized. Okay. Now, layer two is where the fun happens.

And by fun, I mean the mystery. The dense or hidden layer, also known as the black box layer. Black box is a term for stuff's happening in there, we can't really see into it.

This is where it takes all of those 784 values and it says, okay, this number here seems to indicate a 5 this amount of time. Let's weight it this way. And what you end up with is this 128 node layer gets fed these 784 inputs, which results in this connection of 100,000 wires between all those 784 possible inputs and 128 nodes.

You multiply those two together. Each of those nodes in the first layer is connected to each of those 784 nodes in the first layer, the inputs, are connected to 128 in the second. And that's a lot of neurons, a lot of neural wiring.

So it's going to assign a different weight to every single one of those connecting wires. So what makes this kind of a black box is we don't really know why wire 85,683 is weighted slightly higher than 23,642. We get all the information as to, at the end, we can say, print out your connecting wires in your dense layer and it'll print it out for us and it'll just be 100,000 numbers that don't mean anything to us.

But they're very effective for the computer to understand. So this is why things happen like Google using neural networks doesn't really know how a lot of its algorithms work. It's got its amazing thing where if you Google something like, hey, the top hit is almost always what you want.

And it's able to do that by using a neural network to get that algorithm. So Google themselves are like, well, it seems to work really well. And they don't really know exactly how that happens.

In the final layer, we have just 10 neurons, digits 0 to 9. And they're each going to have a value. Each of those is connected to all 128 neurons in the dense layer and it's weighting its values and sending values down to the 0 to 9. And what we end up with is a number between 0 and 1 for each of those nine digits. Each of those 10 nodes gets assigned a number that all add up to 1. And it ends up being, what is the percentage chance that it's 0 or 3 or 2? So whichever of those has the highest number receives the most juice over the electric wire is the winner.

Like, yup, ding, ding, ding, looks like it's 2. 99% sure or 60% sure or whatever. And it's known as the activation. And it's governed by an activation function.

All right. What is the activation function? I think it's the last thing we need to talk about. That and softmax.

This activation function relu is a strange function. What it essentially is, I make a function called relu that takes in a value. Oh, that's some JavaScript right there.

Relu takes in a number n. And it returns, let's see how I can make this happen. It returns the max, which is bigger, n or 0. And that's all it is. Relu is an extremely simple function.

If n is negative 5, this will return 0 because 0 is bigger than negative 5. If n is positive 5, this will return 5 because 5 is bigger than 0. It will return whichever is bigger, which means it will be n unless n is negative, in which case it will be 0. And this is how we make sure that it's never subtracting from confidence. That something saying, like, it doesn't look like it's a 5 at all doesn't decrease its chance of being a 1. It just decreases the chance of being a 5. We say, hey, at worst, just add 0. If you really don't think it's a 5, then 0% confident that it's a 5. Great. But this is a function that we need to give each of the layers of our middle, each of the nodes of our middle layer, our hidden dense layer.

There are other ones, most famously the sigmoid function, which used to be much more common, which has this smooth curve. But they really discovered that, hey, Relu is faster and better, even though it's simpler. Just if it's negative, make it 0. Otherwise, that value is good.

The last function to understand is softmax. That is simply the scaling of the 0 to 1 numbers so that they don't go too low or too high and so that it all comes out to 100% when you add them all together. That's all it is.

How to manipulate this value to be in the range we want. Just like this one is manipulating the values of the dense layer, this is manipulating our output so that it's on the right scale to be like, oh, yep, okay, 53% this, 47% this. And because we've scaled it right, those are the scales.

It would be like, yep, it's that 53% one. Or more likely, it's just going to be 99% sure that it's a particular digit. All right.

So that's all the background you need. We've built our network. Next, we'll compile it and use it.

Colin Jaffe

Colin Jaffe is a programmer, writer, and teacher with a passion for creative code, customizable computing environments, and simple puns. He loves teaching code, from the fundamentals of algorithmic thinking to the business logic and user flow of application building—he particularly enjoys teaching JavaScript, Python, API design, and front-end frameworks.

Colin has taught code to a diverse group of students since learning to code himself, including young men of color at All-Star Code, elementary school kids at The Coding Space, and marginalized groups at Pursuit. He also works as an instructor for Noble Desktop, where he teaches classes in the Full-Stack Web Development Certificate and the Data Science & AI Certificate.

Colin lives in Brooklyn with his wife, two kids, and many intricate board games.

More articles by Colin Jaffe

How to Learn Machine Learning

Master machine learning with hands-on training. Use Python to make, modify, and test your own machine learning models.

Yelp Facebook LinkedIn YouTube Twitter Instagram