Copyright 1996-2001 by Donald R. Tveter, commercial use is prohibited. Short quotations are permitted if proper attribution is given. This material CAN be posted elsewhere on the net if the posted files are not altered in any way but please let me know where it is posted. The main location is: http://dontveter.com/bpr/bpr.html |
* "Constructive Learning of Recurrent Neural Networks: Limitations of Recurrent Cascade Correlation and a Simple Solution" by C. Lee Giles, Dong Chen, Hsing-Hen Chen, Yee-Chung Lee and Mark W. Goudreau available from NEC Research Institute, New Jersey. The solution to the problems of conventional RCC is to feedback output(s) into all previously frozen hidden layer unit(s). This solution while fixing the problem with RCC can slow down it's convergence for large networks.
* "Learning Long-Term Dependencies is not as Difficult with NARX Recurrent Neural Networks" by Tsungnan Lin, Bill G. Horne, Peter Tino and C. Lee Giles available from NEC Research Institute, New Jersey. If you've already got a recurrent network program then making the changes to get a NARX network will probably be easy. I know of somone who tried this on a stock market prediction problem where it did not improve results but as with all backprop methods you never know when they do work. If try this and you can say anything good or bad please let me know.
* "SuperSAB: Fast Adaptive Back Propagation with Good Scaling Properties" by Tom Tollenaere in Neural Networks, volume 3, pages 561-573. CAUTION: there were some obvious typographical errors in the original algorithm.
* "Speed Improvement of the Back-Propagation on Current Generation Workstations" by D. Anguita, G. Parodi and R. Zunino in Proceedings of the World Congress on Neural Networking, Portland, Oregon, 1993, volume 1 pages 165-168, Lawrence Erlbaum/INNS Press, 1993.
* "Why Two Hidden Layers are Better than One" by Daniel L. Chester in IJCNN-90-WASH-DC, Lawrence Erlbaum, 1990, volume 1, pp265-268. The bottom line here is:
The problem with a single hidden layer is that the neurons interact with each other globally, making it difficult to improve an approximation at one point without worsening it elsewhere....
(With 2 hidden layers) the effects of the neurons are isolated and the approximations in different regions can be adjusted independently of each other, much as is done in the Finite Element Method for solving partial differential equations or the spline technique for fitting curves.
* "Operational Experience with a Neural Network in the Detection of Explosives in Checked Airline Luggage" by Patrick M. Shea and Felix Liu in IJCNN San Diego, June 17-21 1990, IEEE Press, volume 2, Press, pp 175-178. The authors report slightly better results with two hidden layers but it took much longer to train the network.
* "Neural Networks for Bond Rating Improved by Multiple Hidden Layers" by Alvin J. Surkan and J. Clay Singleton in IJCNN San Diego June 17-21, 1990, volume 2, IEEE Press, pp 157-162.
* "Backpropagation Neural Networks with One and Two Hidden Layers" by Jacques de Villiers and Etienne Barnard in IEEE Transactions on Neural Networks, vol 4, no 1, January 1992, pp 136-141. The bottom line here was:
The above points lead us to conclude that there seems to be no reason to use four layer networks in preference to three layer nets in all but the most esoteric applications.
* "A Better Activation Function for Artificial Neural Networks" by David Elliott, available by ftp from the University of Maryland.
* "A Tree-Structured Neural Network for Real-Time Adaptive Control" by Alois P. Heinz available by http from the University of Freiburg, Germany. Note that while the paper mentions the approximation to the sigmoid the paper has nothing to do with activation functions.
* "An Introduction to the Conjugate Gradient Method Without the Agonizing Pain" by Jonathan Richard Shewchuk. The entire uncompressed postscript file is about 1.7M and is available by http from Carnegie-Mellon. It is also available by ftp from Carnegie-Mellon. Its also available by ftp in four parts ( part 1, part 2, part 3, part 4 ) by ftp from Carnegie-Mellon. This account is fairly readable but you still need to know calculus and some elementary linear algebra.
* Patrick van der Smagt has a couple of articles online that are much the same as his article, "Minimisation Methods for Training Feedforward Neural Networks" in Neural Networks, volume 7, number 1, pp 1-11, 1994. One version is available from the German Aerospace Research Establishment but contains less theory than the Neural Networks article. A newer version is chapter 2 of his thesis from the German Aerospace Research Establishment or from the University of Amsterdam. The references of the latter paper are not included, but are available in another chapter of his thesis from the German Aerospace Research Establishment and his thesis from the University of Amsterdam. His C software is available from the University of Amsterdam. I have never tried it, if it is any good please let me know. As to the NN article I noticed one test mentioned toward the end that could be redone. Patrick used an adaptive learning rate algorithm by Silva and Almeida on the sin(x) * cos(2*x) problem and reported that it took on the order of 2 million function evaluations to meet the tolerance. A problem with that result is that it used data that ran from 0 to 2 * pi, a range that certainly does hurt convergence. When I tried this problem with rprop and data symmetric around 0 it converged fairly reliably with 180,000 function evaluations. This is still MUCH WORSE that the result for the CG method but it does go to show how careful you have to be with comparing methods.
* "Dynamic Node Creation in Backpropagation Networks" by Timur Ash in Connection Science volume 1, pages 365-375, 1989. My experience with this is that it will get you out of a local minimum in artificial problems like xor but it does not seem to be useful in real world problems and in fact it may hurt. Moreover, it tends to degenerate to just adding a hidden node at some regular interval.
* "Efficiency of Modified Backpropagation and Optimization Methods on a Real-world Medical Problem" by Dogan Alpsan, Michael Towsey, Ozcan Ozdamar, Ah Chung Tsoi and Dhanjoo N. Ghista in Neural Networks, volume 8, number 6, pp 945-962. The authors tried various methods to speed up and improve generalization on their problem, one set of experiments simply trained the networks to within a given tolerance while in the second set they trained to a local minimum. In the first set plain backprop did very well. In the second set the more sophisticated algorithms found local minima that backprop could not yet ultimately this did not improve the performance on the test set.
* "The Interchangeability of Learning Rate and Gain in Backpropagation Neural Networks" by G. Thimm, P. Moerland and E. Fiesler available from Dalle Molle Institute for Perceptive Artificial Intelligence, Switzerland. This little article shows how gain is related to the learning rate and size of the initial weights.
* "What Size Neural Network Gives Optimal Generalization? Convergence Properties of Backpropagation" by Steve Lawrence, C. Lee Giles and Ah Chung Tsoi available from NEC Research Institute, New Jersey This is a 19 page technical report containing experiments on networks that show that minimizing the number of hidden units does not always lead to the best generalization. It includes pointers to other such articles and explanations of why this is so.
* "An Alternative Choice of Output in Neural Network for the Generation of Trading Signals in a Financial Market" by Charles Lam and Lam Kin at The University of Hong Kong. This html paper with gifs explores another method for predicting the ups and downs of an individual stock. It includes references to many other papers on the subject and to Charles Lam's thesis. His online thesis is available in HTML and Microsoft Word format unfortunately it is not available in postscript.
* "Back Propagation Family Album" by Jondarr Gibb, from . Macquarie University, Australia is a 72 page postscript file describing variations on backprop. Ultimately there are only 48 pages of text, the rest consists of references, appendicies, title page, etc.. The text is fairly easy reading and the appendicies include pseudo code for quickprop and the scaled conjugate gradient algorithms.
* "Capabilities of a Four-Layered Feedforward Neural Network: Four Layers Versus Three" by Shin'ichi Tamura and Masahiko Tateishi in IEEE Transactions on Neural Networks, Vol 8, No 2, March 1997. This paper gives a proof that most normal functions can be approximated as closely as you like with two hidden layers using only N/2+3 hidden layer units where N is the number of patterns. This is an improvement over the result for a single hidden layer where N-1 units are needed. The implication is that 4 layer networks MAY give better results than 3 layer networks because fewer units and therefore fewer connections are needed. This paper is only a proof and no experimental evidence is provided. The paper is available online for IEEE members.