Recurrent back-propagation networks take values from hidden layer and/or output layer units and copy them down to the input layer for use with the next input. These values that are copied down are a kind of coded record of what the recent inputs to the network have been and this gives a network a simple kind of short-term memory, possibly a little like human short-term memory. There are a lot of variations on what can be done with recurrent networks, all in all rather messy. The topics are:
One of the simplest things you can do with a recurrent network is memorize a sequence, normally there is no real use for this since you want a network to produce results on unknown data but the memorization problems are a good way to get used to recurrent networks.
The first memorization problem is this one: suppose you want a network to memorize the two short sequences, "acb" and "bcd". In the middle of both of these sequences is the letter, "c". In the first case you want a network to take in "a" and output "c", then take in "c" and output "b". In the second case you want a network to take in "b" and output "c", then take in "c" and output "d". To do this a network needs a simple memory of what came before the "c".
Let the network be an 7-3-4 network where input units 1-4 and output units 1-4 stand for the letters a-d. So the codes are:
a: 1000 b: 0100 c: 0010 d: 0001In action, the network needs to do the following. When a is input, c must be output:
0010 <- output layer hhh <- hidden layer 1000 stm <- input layerIn this context, when c is input, b should be output:
0100 hhh 0010 stmFor the other string, when b is input, c is output:
0010 hhh 0100 stmand when c in input, d is output:
0001 hhh 0010 stmThis is easy to do if the network keeps a short-term memory of what its most recent inputs have been. Suppose we input a and the output is c:
0010 <- output layer hhh <- hidden layer 1000 stm <- input layerPlacing a on the input layer generates some kind of code (like a hash code) on the 3 units in the hidden layer. On the other hand, placing b on the input units will generate a different code on the hidden units. All we need to do is save these hidden unit codes and input them with a c. In one case the network will output b and in the other case it will output d. In one particular run inputting a produced:
0 0 1 0 0.993 0.973 0.020 1 0 0 0 0 0 0When c is input the hidden layer units are copied down to input to give:
0 1 0 0 0.006 0.999 0.461 0 0 1 0 0.993 0.973 0.020For the other pattern, inputting b gave:
0 0 1 0 0.986 0.870 0.020 0 1 0 0 0 0 0Then the input of c gave:
0 0 0 1 0.005 0.999 0.264 0 0 1 0 0.986 0.870 0.020
This particular problem can be set up as follows (the file rshort.bp):
m 7 3 4 s 144 ci t 0.2 fic rt rshort.datwhere rshort.dat contains:
1000 xxx 0010 0010 hhh 0100 0100 xxx 0010 0010 hhh 0001where the first four values on each line are the normal input. The next group of 3 are the short term memory units, an x stands for the value of the unknown whose default value is 0.5 and an h means take the value from the next hidden layer unit. The last four values are the desired outputs. When the program reads the training and test files the `x' value is replaced by its current value. When an `h' value is loaded into the network the value is taken from the next unused hidden layer unit.
By the way this simple problem does not converge particularly fast for most seed values (144 on my Linux system converges in 29 iterations) and you may need to do a number of runs before you hit on initial values that will work quickly. It will work more reliably with more hidden units.
The notation for the problem is also awkward because if you have to change the number of hidden units you have to make up a whole new data file with a different number of `h' symbols for each input. To get around this there is a better notation seen in the files rshort2.bp and rshort2.dat:
m 4+3 3 4 s 144 ci t 0.2 fic rt rshort2.dat 1000 X 0010 0010 H 0100 0100 X 0010 0010 H 0001
Here in rshort2.bp you make a network with 4 regular input
units plus 3 more whose values are taken from the hidden layer.
In rshort2.dat the X means use as many x (unknown) values as
you need to fill out the rest of the input units and H means
use as many h (hidden layer unit) values as you need to fill
out the rest of the input units. Now the data file never
needs to be changed when you change the number of hidden layer
units.
Another Memorization Example
Another recurrent network included in the examples is one designed to memorize two lines of poetry (files poetry.bp and poetry.dat). The two lines were:
I the heir of all the ages in the foremost files of time
For I doubt not through all the ages ones increasing purpose runsbut for the sake of making the problem simpler each word was shortened to 5 characters giving:
i the heir of all the ages in the frmst files of
time for i doubt not thru the ages one incre purpo runsThe letters were coded by taking the last 5 bits of their ASCII codes. The training data is:
X * (nothing) -> i 0100100000000000000000000 0100100000000000000000000 H * i -> the 1010001000001010000000000 1010001000001010000000000 H * the -> heir 0100000101010011001000000 0100000101010011001000000 H * heir -> of 0111100110000000000000000 0111100110000000000000000 H * of -> all 0000101100011000000000000 0000101100011000000000000 H * all -> the 1010001000001010000000000 1010001000001010000000000 H * the -> ages 0000100111001011001100000 0000100111001011001100000 H * ages -> in 0100101110000000000000000 0100101110000000000000000 H * in -> the 1010001000001010000000000 1010001000001010000000000 H * the -> frmst 0011010010011011001110100 0011010010011011001110100 H * frmst -> files 0011001001011000010110011 0011001001011000010110011 H * files -> of 0111100110000000000000000 0111100110000000000000000 H * of -> time 1010001001011010010100000 1010001001011010010100000 H * time -> for 0011001111100100000000000 0011001111100100000000000 H * for -> i 0100100000000000000000000 0100100000000000000000000 H * i -> doubt 0010001111101010001010100 0010001111101010001010100 H * doubt -> not 0111001111101000000000000 0111001111101000000000000 H * not -> thru 1010001000100101010100000 1010001000100101010100000 H * thru -> the 1010001000001010000000000 1010001000001010000000000 H * the -> ages 0000100111001011001100000 0000100111001011001100000 H * ages -> one 0111101110001010000000000 0111101110001010000000000 H * one -> incre 0100101110000111001000101 0100101110000111001000101 H * incre -> purpo 1000010101100101000001111 1000010101100101000001111 H * purpo -> runs 1001010101011101001100000
The net can learn this sequence rather quickly with a 45-20-25 network.
As an example of what the hidden layer codes look like, here is what happened in one particular training session. The following vector was the pattern that was present on the hidden layer after "ages" at position 7 had been input:
0.00 0.91 0.00 0.00 0.03 0.00 0.00 0.02 0.74 0.00 0.05 0.00 0.00 0.97 1.00 0.97 0.00 0.94 0.98 0.00while it had this pattern when the ``ages" at word 20 is input:
0.00 0.92 0.00 0.00 0.00 0.00 0.00 0.00 0.92 0.00 0.08 0.00 0.00 0.03 0.73 0.79 0.00 0.75 0.60 0.00
The two patterns are much the same, yet different enough to prompt the network to produce two different responses. In the first case the network produces ``in" and in the second case it produces ``one".
Once upon a time I was wondering what would happen if the poetry network learned its lines and then the program was given several words in the middle of a verse. Would it pick up the sequence and be able to complete it given 1 or 2 or 3 or n words? So given for example, the short sequence "for i doubt" will it be able to "get on track" and finish the verse? To test for this there are an extra pair of commands, tr and trp. Given a test set (which should be the training set) they start at every possible place in the test set, input n words and then check to see if the net produces the right answer. For this example I tried n = 3, 4, 5, 6 and 7 with the following results:
[?!abcdefhiklmnopqrstuvw]? tr 3 TOL: 81.82 % ERROR: 0.022967 [?!abcdefhiklmnopqrstuvw]? tr 4 TOL: 90.48 % ERROR: 0.005672 [?!abcdefhiklmnopqrstuvw]? tr 5 TOL: 90.00 % ERROR: 0.005974 [?!abcdefhiklmnopqrstuvw]? tr 6 TOL: 100.00 % ERROR: 0.004256 [?!abcdefhiklmnopqrstuvw]? tr 7 TOL: 100.00 % ERROR: 0.004513So after getting just 3 words the program was 81.82% right in predicting the next word to within the desired tolerance. Given 6 or 7 words it was getting them all right. The trp command does the same thing except it also prints the final output value for each of the tests made.
There is no provision to take values from any hidden layer units other than the ones in the first hidden layer, so if your network is made with:
m 8 5 3 2there is no way copy the three values in the second hidden layer down to the input layer. (There does not seem to be much need for this.) But you can copy output layer units down to the input with the following notation. Suppose the network is an 8-4-4 network and you want to copy the output layer values down to input. Use the input pattern:
0.1 0.2 0.3 0.4 o1 o2 o3 o4where the first four numbers are normal input values and `o1' means copy the value from output unit 1, `o2' means copy the value from output unit 2, etc..
There is an additional provision to take values from the input layer and then "shift them along" the input layer as more patterns are added. For example you may want take a hidden unit value or an output unit value and keep them around for more than a single time step, in effect a history of what the recent values have been. (This is supposed to help learning and generalization in recurrent networks, see the paper: Learning Long Term Dependencies is not as Difficult with NARX Recurrent Neural Networks. (NARX comes from Nonlinear AutoRegressive models with eXogenous inputs.) Suppose the network is 5-5-1. Let the normal inputs be:
1 2 3 4 5 6 7 8 9and let the targets at each of these steps be:
-1 -2 -3 -4 -5 -6 -7 -8 -9and let the outputs at each of these steps be:
-1.1 -2.1 -3.1 -4.1 -5.1 -6.1 -7.1 -8.1 -9.1So you could do this as a 1-5-1 problem. But you want to save outputs and shift them along the input units. Now suppose we're in the middle of this problem, the input was 5 and the output was -5.1. You want to keep that -5.1 around as a kind of memory of what the network has been producing. In fact you want to use it three more times before throwing it away completely. What is on the network NOW, what you got from the pass where the input was 5 and the output was -5.1 is:
-5.1 h h h h h 5 -1.1 -2.1 -3.1 -4.1The next pattern you set up could look like this:
6 i3 i4 i5 o1 -6
-6.1 h h h h h 6 -2.1 -3.1 -4.1 -5.1The next pattern given to the net will be:
7 i3 i4 i5 o1 -7
-7.1 h h h h h 7 -3.1 -4.1 -5.1 -6.1The next pattern given to the net will be:
8 i3 i4 i5 o1 -8
-8.1 h h h h h 8 -4.1 -5.1 -6.1 -7.1And it keeps going on. Here are the inputs to the network over all these steps all lined up:
5 -1.1 -2.1 -3.1 -4.1 6 -2.1 -3.1 -4.1 -5.1 7 -3.1 -4.1 -5.1 -6.1 8 -4.1 -5.1 -6.1 -7.1Notice how the latest output value arrives on the right and then it gets shifted one location to the left for each new pattern until the value finally "falls off" or "goes away".
Now you're kind of stuck at the beginning because you don't have four saved output values to work with. This version of the program now looks at the data for the letters, i, o, and h and if it finds them it sets an internal flag and when training is done or the command is to evaluate the training set then the network is set to the current value of the unknown, x. So you can input the following patterns and the network will have initial values to work with:
1 i3 i4 i5 o1 -1 2 i3 i4 i5 o1 -2 3 i3 i4 i5 o1 -3 4 i3 i4 i5 o1 -4 5 i3 i4 i5 o1 -5 6 i3 i4 i5 o1 -6 7 i3 i4 i5 o1 -7 8 i3 i4 i5 o1 -8 9 i3 i4 i5 o1 -9If you really wanted the new input (the 1, 2, 3 ...) to go on the right of the pattern instead of the left of the pattern you could do this:
i2 i3 i4 o1 1 -1 i2 i3 i4 o1 2 -2 i2 i3 i4 o1 3 -3 i2 i3 i4 o1 4 -4 i2 i3 i4 o1 5 -5 i2 i3 i4 o1 6 -6 i2 i3 i4 o1 7 -7 i2 i3 i4 o1 8 -8 i2 i3 i4 o1 9 -9Now here's something you can't do:
1 o1 i2 i3 i4 -1 2 o1 i2 i3 i4 -2 3 o1 i2 i3 i4 -3 4 o1 i2 i3 i4 -4 5 o1 i2 i3 i4 -5 6 o1 i2 i3 i4 -6 7 o1 i2 i3 i4 -7 8 o1 i2 i3 i4 -8 9 o1 i2 i3 i4 -9You can see why by taking a close look at pattern 5:
Remember there is a simple "command" in the program that lets you evaluate a pattern simply by typing it at the bottom of the W95 screen or in the Tcl/Tk entry box at the bottom of the screen. So for the xor problem you could type:
0 1and it was not a problem. With recurrent networks, however, there are letters (such as i, o and h) that could well be legitimate commands and if they came first the program would end up doing those commands. To avoid this problem use a `\' at the start of the pattern as in this example:
\ o1 hhhhhTHIS IS REQUIRED ONLY AT THE COMMAND LINE LEVEL. When the program has been told to read patterns (the rt, rx and tf commands) using the `\' will cause an error.
Rather than using recurrent networks to memorize sequences of letters they are probably more useful at predicting the value of some variable at time t+1 given its value at t, t-1, t-2, ... . A very simple example of this is to give the value of sin(t+1) given a recent history of inputs to the net. Given a value of sin(t) the curve may be going up or down and the net needs to keep track of this in order to correctly predict the next value. The setup to do this is (files rsin.bp and parts of rsin.dat):
m 1+5 5 1 a aol uq qp e 0.01 s 1154 ci rt rsin.dat 0.00000 X 0.15636 0.15636 H 0.30887 0.30887 H 0.45378 . . . -0.15950 H -0.00319 -0.00319 H 0.15321The problem converges rather rapidly with Linux and a seed of 1154, often it fails with other seed values or takes much longer.
When you use a recurrent network to predict the next element of a time series you run into a problem. You can normally only predict THE NEXT ELEMENT of the time series and NO FURTHER. For instance suppose you've trying to predict the stock market. Your inputs may be the current value of the market index (or the most recent change), the inflation index, the unemployment index, the prime rate, money supply and so on. Or perhaps the best thing to do is to input changes or percent changes of these values, I shall leave that issue to the experts. At any rate you can train a network to predict the next value and if you've used all the inputs you have (right up to today) you will get a prediction for the next value of the market index but since you don't know the next set of input values will be you can't go any farther. Now you could ASSUME that the input values (prime rate, inflation rate, etc.) will not change much and continue to input the last values you have but that is risky.
To evaluate the performance of such a network you can do the following. Suppose you have a total of 81 patterns you can work with. You can train the network on the first 61 patterns and then let it predict the next value of the index. This gives you only one market index value as a prediction. Now train another network on the first 62 patterns. This gives you a second output value. Now train the network on the first 63 patterns to give another output value. If you continue doing this you can get 20 predictions, average the error and thereby get an estimate of how well your network has learned the ebb and flow of the market. Under this plan the number of training set patterns keeps increasing. Sometimes this may be a good idea and sometimes it is better to limit the number of training set patterns. If you think the series is, as the Mathematicians say, a stationary process, a series where the underlying fundamentals do not change then you should use all your data. On the other hand if the underlying fundamentals do change with time you may be better off using only a block of recent data.
To make such testing easy there is an additional command
called predict that takes one data file, writes out a training
set file and a test set file with a single pattern. It then
uses the benchmarking function for training (allowing more than
one network to train), writes out the results on a file called
bresults.tmp, then the predict command reads the results and
then sets up another run with a new data set, and so on until
all the possible sets have been used. There are more details
on this in the predict command
documentation.
Recurrent Classification Problems
Another type of recurrent network problem is to take in a series of measurements over time, perhaps like an electrocardiogram, and then classify the series as being normal or abnormal. In this type of problem a whole series of "minor patterns" are used to give an answer to one "major pattern". It will be called a "recurrent classification problem." In order to deal with this type of problem the program must be told how many minor patterns there are within each major pattern. Suppose there are 128. Then the program must be told: "f r 128" and EVERY major pattern must contain 128 minor patterns. The classification is made by averaging the output units over all the minor patterns rather than adding up votes contributed by each minor pattern. To get the run command to output statistics based on major patterns rather than minor patterns you must also do: "f sr". This prints the average value of the outputs for all the minor patterns in the series. In the normal use of the program it stops when all the patterns have been learned to within the desired tolerance or when the overall tolerance has been met. With recurrent classification problems the program DOES NOT stop when the major patterns meet the closeness criteria, it goes on until all the minor patterns meet that criteria or the overall tolerance has been met.
A sample recurrent classification problem, sinclass.bp, is included in the set of examples. It simply has to learn the difference between the two patterns, sin(x) and sin(2x). It is set up as follows:
* In this recurrent network example there are two sine waves being * presented to the network. First for class 1 the inputs are x and * sin(x). Then for class 2 the inputs are x and sin(2x). The network * has to learn that only the second input value can be used to determine * which class each sequence belongs to. m 2+2 2 2 a dd us ss e 0.05 M 5 f ir pc r 41 rt sinclass.dat s7 ciThe data looks like this:
0.00000 0.00000 X 1 * the first pattern is class 1 0.15700 0.15636 H 1 0.31400 0.30887 H 1 ... 6.12300 -0.15950 H 1 6.28000 -0.00319 H 1 0.00000 0.00000 X 2 * the second pattern is class 2 0.15700 0.30887 H 2 0.31400 0.58753 H 2 ... 6.12300 -0.31492 H 2 6.28000 -0.00637 H 2This example converges rather rapidly and the network produces the right answers even when many minor patterns are still wrong.
A second recurrent classification problem is found in the files waves.bp and waves.dat. The object here is to classify sin waves of frequencies x, 2x, 3x, 4x and 5x however if it is done by using a series of values of sin(x) for class 1, a series of values of sin(2*x) for class 2, ... and a sereis of values for sin(5*x)for class 5 the training is difficult. A common way to treat time series is to input the differences between successive values so that is what is in the file waves.dat. The file waves.bp is:
m 1+10 10 5 a dd ur rp t 0.1 u- f ir pc sr r 1001 rt waves.dat s7 ciand the start of waves.dat is:
* Note: these are differences between successive value of the series * freq = 1.0 0 X 1 0.16 H 1 0.15 H 1 0.14 H 1 0.13 H 1 0.12 H 1 0.10 H 1 0.08 H 1
Given the small values it would probably make sense to scale them up however the training goes quickly enough without it.
To get information on the training set patterns for a recurrent classification problem there is a line of buttons in the P menu-window or the typed command is the u command (Sorry, I've run out of great one and two letter command names! Maybe someday this will be changed.). At any rate the u command is the analog to the p command, these are examples:
To get the information on testing set patterns there is a line of buttons in the P menu-window or there is the the v command:
NOTE: With a conventional network you can ask for the output for a particular pattern number, say pattern number 5 by using:
p 5and the network loads pattern 5 onto the network and reports the output values BUT the program works the same way with a recurrent network and the answer will not be right because it depends on hidden unit values that are not available when the single pattern is evaluated. To get a single value within the series use the p command which DOES initialize the network correctly and DOES work through them all. The same situation occurs if you want to evaluate a single test pattern with the t command.
NOTE: If your test set is a continuation of the training set then to get the correct values for the test set you must evaluate the training set first.