The ideas and methods in neural nets (NNs) have been around for a long time, but in the last decade plus, we are finally starting to reap significant benefits, and this is just the beginning. This post provides an overview of my recent PyCon talk in Montreal which is a neural net primer of sorts. The video is below, my slides are on SpeakerDeck, and I have a repo on Github named Neural Nets for Newbies.
There is too much to cover to fully explain neural nets them; thus, the post and the talk provide a framework to start to understand neural nets. If you want to learn more, there are plenty of resources, some listed in my deck, to dive into.
What are they?
Machine learning is a set of algorithms for classification and prediction, and artificial neural nets are part of the machine learning space. At its core, neural nets are an algorithm which means an equation to help abstract and find patterns in data. Technically it’s a combination of equations.
The structure is modeled after our brains. Before you get all excited about robots that can think like us (side note that idea has been around since BC), the reality is that we still don’t fully understand how the human brain functions. Neural nets are only loosely mimicking brain functionality. For the enthusiasts out there, yes there are many researchers focused on creating a closer biological representation that acts like our brain. The Bottom line is we aren’t there yet.
The algorithm, structure and many of the ideas around neural net functionality have been around for a while; several of them date back to the 1950s. Neural nets have been applied for commercial solutions as far back as 1959 (reducing phone line echos), but we really haven’t seen significant value until recently. Key reasons are that our computational power (computer processing speed and memory capabilities) and access to useful data (amount of stored data) has significantly improved in the last decade alone.
Why should I care?
Because NNs have achieved technical advancement in areas like:
- Natural Language Processing (Search & Sentiment)
- Speech Recognition (Siri)
- Computer Vision & Facial Recognition (Automatic Image Tagging)
- Robotics (Automated Car)
- Recommender Systems (Amazon)
- Ad Placement
Some of you may roll your eyes at these advancements and complain about how Siri is limited in interactions. Need I remind you that we weren’t talking to our machines at the beginning of this century (or at least it wasn’t common). Hell, we didn’t have iPods at the beginning of this century if you remember what they are. I too fall into the sci-fi trap where I’ve seen it or read about it and so when we actually experience real advancements, it seems so boring and behind the times. Yeah, get over that.
All the areas I mentioned above still have plenty of room for growth and there are definitly other areas I haven’t listed especially in scientific fields. One of the reasons neural nets have had such impressive impact is their way of handling more data especially data that has layers of complexity. This doesn’t mean that neural nets should be used for all problems. They are overkill in many situations. I cannot stress that enough that every problem is not a nail for the NN hammer.
If you have a good problem and want to apply NNs it’s important to understand how they work.
Ok, so how do they work?
Out the gate, If you want to get serious about applying NNs then you will need to embrace math no matter how much you don’t like it. Below I’ve given you some fundamentals around the math and the structure to get you started.
Our brains are made up of neurons and synapses, and based on our interactions, certain neurons will fire and send signals to other neurons for data processing/interpretation. There is much more complex stuff going on than just that in our brains, but at a high-level that expresses the structure the neural net models.
NNs at a minimum have three layers: input, hidden, output.
- Input = data
- Data that is broken up into consumable information
- Data can be pre-processed or raw
- Bias and noise are applied sometimes
- Hidden = processing units (aka, does math)
- Made up of neurons
- A neuron determines if it will be active (math under equation section)
- Typically there are multiple neurons in a hidden layer (can be thousands or even billions depending on the data used and objective)
- Output = results
- One node per classification and just one or many
- A net to classify dogs or cats in a picture has two output nodes for each type of classification
- A net to classify handwritten digits between 0-9 has ten output nodes
You can have more than one hidden layer in a neural net, and when you start adding hidden layers, they trade off as inputs and outputs based on where they are in the structure.
Each neuron represents an equation, and it takes in a set of inputs, multiplies weights, combines the data and then applies an activation function to determine if the neuron is active. A neuron is known as a processing unit because it computes the data to determine its response.
- Inputs = input layer data in numerical format
- Weights = coefficients (also known as theta)
- Specialize each neuron to handle the problem (dataset) you are working with
- Can initialize randomly
- One way to initialize is to create a distribution of the existing data set and randomly sample that distribution
- Often weights are represented between -1 to 1
- Bias = can be included as an input or used as a threshold to compare data after the activation function is applied
- Activation Function = data transformation to determine if the neural will send a signal
- Also known as the energy function
- There are many different equations that can be used, and it depends on the problem and data you are working with
- Example equations: sigmoid/logistic, step/binary threshold, linear, rectified linear (combines binary threshold & linear), …
- Output(s) = each node results in a binary, percentage or number range
Each neuron is unique from other neurons in a hidden layer based on the weights applied. They can also be unique in the inputs and outputs. There are many hyperparameters that you can tweak for one single neuron let alone the whole structure to improve its performance. What makes neural nets powerful is the combination of linear with nonlinear functions in the equation.
When applying a neural net, an effort is needed to optimize the model, so it produces the results you are targeting.
The breakthroughs in neural nets are largely in the area of supervised learning. Supervised learning means you have a dataset labeled with the results you expect. The data is used to train the model so you can make sure it functions as needed. Cross validation is a technique typically used in supervised learning where you split the dataset into a training set to build the model and test set for validation. Note, there are areas in neural net research that explores unlabeled data, but that is too much to cover in this post.
In order to optimize, you start out with a structure and probably randomized weights on each neuron in the hidden layer(s). You’ll run your label data through the structure and come out with results at the end. Then you compare those results to real labels using a loss function to help define the error value. The loss function will transform the comparison, so it becomes a type of compass when going back to optimize the weights on each neuron.
The optimization method (aka back propagation or backprop) is a way of taking the derivative of the loss function and applying it to the weights throughout the model. This method can change all weights on every neuron and because of the way the method works, it does not change the weights equally. You want shifts that vary across weights because each neuron is unique.
- Error = difference between NN results to the real labels
- Loss Function = calculates the error (also referred to as cost function)
- There are many different equations that are used, and it depends on the problem and data you are working with
- Example equations: mean squared error, negative log likelihood, cross entropy, hinge, …
- Regularization = noise applied in the loss function to prevent overfitting
- Optimization Method = learning method to tune weights
- There are many different equations that are used, and it depends on the problem and data you are working with
- Example equations: stochastic gradient descent, Adagrad (J Duchi), Adadelta (M Zeiler), RMSprop (T. Tieleman), …
- Learning Rate = size of how much to change the weights each time and sometimes part of optimization algorithms
Backprop in essence wiggles (to quote Karpathy) the weights a little each time you run the data through the model during training. You keep running the data through and adjusting the weights until the error stops changing. Hopefully it’s as low as you need it to be for the problem. And if it’s not, you may want to investigate other model structure modifications.
Note reducing the error rate is a common model objective but not always the objective. For the sake of simplicity, that’s our focus right now.
Validation / Testing
Once you’ve stopped training your model, you can run the test data set through it to see how it performs. If the error rate is horrible, then you may have overfit, or there could be a number of other issues to consider. Error rate and other standard validation approaches can be used to check how your model is performing.
I’ve given you a basic structure on how the neural net connects but its important to understand there are variations in that structure that are better for different types of problems. Example types include:
- Feed Forward (FFN) = basic structure and passes data forward through the structure in the order of connections
- There are no loops
- Data moves in one direction
- Key Applications: financial prediction, image compression, medical diagnosis and protein structure prediction
- Recurrent (RNN) = depending on the timing the neuron fires, data can be looped back earlier in the net structure as inputs
- Data can become input to the same neuron, other neurons in that layer or neurons in a hidden layer prior to that layer
- Operates on linear progression of time
- Good for supervised learning in discrete time settings
- Key Applications: sentiment analysis, speech recognition, NLP
- Convolutional (CNN) = uses a mixture of hidden layers types (e.g. pooling, convolutional, etc.)
- Best structure for scaling
- Inspired by biological processes and variant of multilayer perceptrons
- Key Applications: computer vision, image & video recognition
- Other types to checkout:
- Recursive (RNN) = related to Recurrent but based on structure vs time
- Restricted Boltzmann Machine (RBM) = 1st neural net to demonstrate learning of latent / hidden variables
- Autoencoder (Auto) = RBM variant
- Denoising Autoencoder (DAE)
- Deep Belief Networks (DBN)
Neural nets can get complex in the structure and combined equations. It can be tricky and time-consuming to develop a useful model and confusing on where to start. Due to extensive research, there are already pre-baked templates for certain types of problems that you can adapt and avoid starting from scratch.
There are a couple other points to note about neural nets to point you in the right direction when developing and deploying.
In order to run a neural net to solve problems like mentioned above, it’s important to understand certain system engineering concepts.
The main one to spend time on is graphical processing units (GPUs). These chips are playing a key role in improving latency (speed) to develop NNs. You want every advantage you can get with reducing the time it takes to make a neural net.
GPUs are highly optimized for computation compared to CPUs which is whey they are popular in gaming and research. Granted there are advances going on in CPUs that some argue are making them function more like GPUs. At the heart of this, just spend some time learning about GPUs and try running an NN on it.
I listed a few other topics in my talk that you should research further to go above and beyond single server computation of a neural net.
- Distributed Computing
- High-Performance Computing
Note if you go down the distributed path you are starting to get into sharing the data across nodes or splitting the model, which can be extremely tricky. Try sticking to a single server for as long as possible because you can’t beat that latency and with where technology is, you should be able to do a lot with one computer especially when starting out. Only go down the distributed path when the data and problem are complex enough it can’t be contained on one server.
There are many Python packages you can use to get started with building neural nets and some that will automate most of the process for you to get you off the ground faster. Below is a list of ones I’ve come across so far.
- Machine Learning Packages
- Packages based in C with Python Bindings
- GUI with Python API
- GUI with Python API
I highly recommend that you spend time exploring Theano because it’s well documented, will give you the best exposure and control of the math and structure and it’s regularly applied to solve real world problems. Many of the machine learning packages are built off of it. The machine learning packages vary in terms of how easy they are to use, and some have easy integration with GPUs.
MNIST Code Example
For the example in the talk, I used the MNIST (Mixed National Institute of Standards and Technology) dataset, which is the “hello world” of neural nets. It’s handwritten digit analysis of grayscale pictures (28 x 28 pixels).
- Structure can be as simple as 784 inputs, 1000 hidden units, 10 outputs with at least 794K connections
- Based on Yann LeCunn’s work at ATT with LeNet in 1990s
For reference, I’ve pulled MNIST examples for some of the Python packages into a Github repository as mentioned above, and you can also find here: github.com/nyghtowl/Neural_Net_Newbies.
What’s next for NN?
Neural nets will continue to play a signficant role in advancements in all the areas I’ve mentioned especially with natural language processing and computer vision. The real key value for nearl nets is in automatic feature engineering and we will continue to see neural nets applied to help identify features especially as richer datasets for certain problems are captured.
Additionally, combining neural net structures as well as other machine learing models models with NNs will help drive these advancements. Some great research came out last fall around combinging CNNs with RNNs to apply sentence long descriptions to images.
Where a number of experts have talked about for the long-term value is the potential impact with unlabeled data. Finding patterns in data that we have no knowledge of or data we’ve labeled with our own set of biases. These types of patterns will drive advancements that may very well be akin to what we read in sci-fi as well as stuff we really haven’t though of yet.
Reality is NNs are algorithms with the most potential to really create greater intelligence in our machines. Having technology that can reason and come up with new ideas is very possible when NNs are factored in.
If you want to get serious about researching neural nets, spend time studying linear algebra (matrix math), calculus (derivatives), existing neural net research and systems engineering (esp. GPUs and distributed systems). The slides I posted have a number of references and there are many other resources online. There are many great talks coming out post conferences that can help you tap into the latest progress. Most importantly, code and practice applying neural nets. Best way to learn is by doing.