Building a Three-Layer Neural Network with Keras and TensorFlow

Construct a sequential three-layer neural network using Keras.

Build a neural network in TensorFlow with just a few concise lines of code, and understand how each layer transforms input data. Learn the functions and mathematics driving neural network activations like ReLU and Softmax.

Key Insights

  • The neural network built is sequential, consisting of an input layer flattening a 28x28 pixel grid into a 784-item list, a dense hidden layer with 128 neurons using the ReLU activation function, and an output layer with 10 neurons employing the Softmax function.
  • The hidden dense layer is considered a "black box" because it creates approximately 100,000 weighted connections between the 784 input values and 128 hidden neurons, allowing complex pattern learning that is often opaque to human interpretation.
  • Activation functions play a crucial role: ReLU sets negative inputs to zero to prevent subtracting from neuronal confidence, while Softmax scales the final output to probabilities that sum to 1, effectively identifying the most likely digit from 0 to 9.

Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.

Let's write a very short but rather intricate code to build our three-layer neural network, and then we'll spend a lot of time above talking about what all these layers are, what they do, and what these arguments we're giving it mean. We're going to create a model, and it's going to be a Keras model that's sequential, meaning it'll go through each of these layers in order, sequentially. And it's actually a list of layers, again in sequential order.

So the first is our input layer, and that's going to be a TensorFlow Keras layer that flattens the input. We'll talk about why in a minute. And we'll say, hey, flatten it, and also its input shape is 28 X 28.

We'll pass it a tuple. Again, we'll talk about each of these lines in a minute. And then our second layer is a dense layer, meaning a hidden layer, and it's going to be 128 neurons.

Sounds like a lot, and as you'll realize, it's actually even more. And we'll use a ReLU activation, and again, I'll explain why in a minute. Then our final layer is our get-it-out-of-there layer—our output layer.

Give us the answer layer. It's still technically a dense layer, but this is usually known as the output layer. And it only needs to be 10, and I can explain this one, actually, without even having to go up above.

Data Analytics Certificate: Live & Hands-on, In NYC or Online, 0% Financing, 1-on-1 Mentoring, Free Retake, Job Prep. Named a Top Bootcamp by Forbes, Fortune, & Time Out. Noble Desktop. Learn More.

It's 10 because it's 10 possible values—digits 0 through 9. And its activation is what's known as softmax, and that's TensorFlow neural networks’ softmax. Okay, so we pass all these layers, and we'll talk about all these things in just a moment. We pass in these layers into the list, and TensorFlow will build us a neural network with these layers and put it in our model.

And it only takes a moment. Let's talk about each of those layers and all of that. So what we're going to do is talk about each of these layers.

You know, the flattened inputs layer. That's this one right here—the flattened one. "Flattened" is a technical term in programming for taking a multidimensional list and making it one-dimensional.

So instead of 28 lists of 28 length, it's one list of 784 items. It helped us visualize this set of pixels in a grid, which is what this 28 X 28—28 columns, 28 rows—thing did for us. But it doesn't help the computer.

In fact, it hurts it. It'll work much better with the exact same data in the same order, but as one long list. We can look at it—it doesn’t need to know that one value is next to another value.

That's not how it actually looks at these images. Instead, it's going to learn how much to weight each individual value of the 784 and what it should weight toward. Okay.

So it doesn't care, again, how we humans read things and how we see patterns. It wants it in a machine-readable way. And just a straight-up list of numbers—machines eat that right up.

And it also wants it normalized. Okay. Now, layer two is where the fun happens.

And by fun, I mean the mystery. The dense or hidden layer—also known as the black box layer. "Black box" is a term for "stuff's happening in there, we can't really see into it."

This is where it takes all of those 784 values and says, okay, this number here seems to indicate a 5 this amount of the time. Let's weight it this way. And what you end up with is this 128-node layer gets fed these 784 inputs, which results in this connection of 100,000 wires between all those 784 possible inputs and 128 nodes.

You multiply those two together. Each of the nodes in the first layer—the inputs—are connected to the 128 in the second. And that's a lot of neurons, a lot of neural wiring.

So it's going to assign a different weight to every single one of those connecting wires. What makes this a black box is we don't really know why wire 85,683 is weighted slightly higher than 23,642. We get all the information, and at the end, we can say, print out your connecting weights in your dense layer, and it'll print it out for us—it'll just be 100,000 numbers that don’t mean anything to us.

But they're very effective for the computer to understand. This is why, for example, Google, which uses neural networks, doesn’t always know exactly how its algorithms work. It has this amazing thing where if you Google something, the top hit is almost always what you want.

And it’s able to do that by using a neural network to optimize its algorithm. So even Google is like, well, it seems to work really well. And they don’t really know exactly how that happens.

In the final layer, we have just 10 neurons—digits 0 to 9. And they're each going to have a value. Each of those is connected to all 128 neurons in the dense layer and is weighting its values and sending signals down to the 0 to 9 output.

And what we end up with is a number between 0 and 1 for each of those ten digits. Each of those 10 nodes gets assigned a number that all add up to 1. It ends up being: what is the percentage chance that it's 0, or 3, or 2? So whichever of those has the highest number—receives the strongest signal—is the winner.

Like, yup, ding ding, looks like it’s 2. 99% sure, or 60% sure, or whatever. And it’s known as the activation. And it’s governed by an activation function.

All right. What is the activation function? I think it's the last thing we need to talk about—that and softmax.

This activation function ReLU is a strange function. Imagine a function called relu that takes in a value.

ReLU takes in a number n. It returns the max—which is bigger—n or 0. And that's all it does. ReLU is an extremely simple function.

If n is negative 5, this will return 0 because 0 is bigger than negative 5. If n is positive 5, this will return 5 because 5 is bigger than 0. It will return whichever is bigger, which means it will be n unless n is negative, in which case it will be 0.

This is how we make sure that it's never subtracting from confidence. That something saying “it doesn’t look like it’s a 5 at all” doesn’t decrease its chance of being a 1—it just decreases the chance of being a 5. We say, hey, at worst, just add zero. If you really don’t think it’s a 5, then be 0% confident that it’s a 5. Great.

This is a function that we need to give to each of the nodes in our middle layer—our hidden dense layer.

There are other ones—most famously the sigmoid function, which used to be much more common. It has this smooth curve. But they discovered that ReLU is faster and better, even though it’s simpler. Just set it to zero if it’s negative; otherwise, that value is good.

The last function to understand is softmax. That is simply the scaling of the 0 to 1 numbers so that they don’t go too low or too high, and so that it all comes out to 100% when you add them all together. That’s all it does.

It manipulates the values to be in the range we want. Just like ReLU manipulates the values in the dense layer, softmax manipulates our output so that it's on the right scale to say, “Oh yep, 53% this, 47% that.” And because we've scaled it right, those are the correct proportions.

It would be like, yep, it's that 53% one. Or more likely, it's just going to be 99% sure that it’s a particular digit. All right.

So that’s all the background you need. We’ve built our network. Next, we’ll compile it and use it.

Colin Jaffe

Colin Jaffe is a programmer, writer, and teacher with a passion for creative code, customizable computing environments, and simple puns. He loves teaching code, from the fundamentals of algorithmic thinking to the business logic and user flow of application building—he particularly enjoys teaching JavaScript, Python, API design, and front-end frameworks.

Colin has taught code to a diverse group of students since learning to code himself, including young men of color at All-Star Code, elementary school kids at The Coding Space, and marginalized groups at Pursuit. He also works as an instructor for Noble Desktop, where he teaches classes in the Full-Stack Web Development Certificate and the Data Science & AI Certificate.

Colin lives in Brooklyn with his wife, two kids, and many intricate board games.

More articles by Colin Jaffe

How to Learn Machine Learning

Master Machine Learning with Hands-on Training. Use Python to Make, Modify, and Test Your Own Machine Learning Models.

Yelp Facebook LinkedIn YouTube Twitter Instagram