* "Faster Learning Variations on Back-Propagation: An Empirical Study" by Scott Fahlman from the Ohio State neuroprose archive or from Carnegie-Mellon. This paper shows a series of experiments to try to improve backprop and finishes with quickprop which may be one of the best ways to speed up the training of a network.
* J. R. Chen and P. Mars, "Stepsize Variation Methods for Accelerating the Back-Propagation Algorithm", IJCNN-90-WASH-DC volume 1, pp 601-604, Lawrence Erlbaum, 1990.
* "Speeding up Backpropagation Algorithms by using Cross-Entropy combined with Pattern Normalization" by Merten Joost and Wolfram Schiffmann", from University of Koblenz-Landau, Germany The authors show that by using the cross-entropy error measure on classification problems rather than the traditional sum squared error the net effect is to simply skip the derivative term in the traditional formulation. Apparently this skip the derivative term originated with the Differential Step Size method of Chen and Mars.
* "Increased Rates of Convergence" by Robert A. Jacobs, in Neural Networks, Volume 1, Number 4, 1988.
* "Learning Long-Term Dependencies is not as Difficult with NARX Recurrent Neural Networks" by Tsungnan Lin, Bill G. Horne, Peter Tino and C. Lee Giles available from NEC Research Institute, New Jersey. If you've already got a recurrent network program then making the changes to get a NARX network will probably be easy. I know of somone who tried this on a stock market prediction problem where it did not improve results but as with all backprop methods you never know when they do work. If try this and you can say anything good or bad please let me know.
* "Dynamic Node Creation in Backpropagation Networks", Institute for Cognitive Science, University of California, San Diego, ICS Report 8901, February 1989 (and republished in Connection Science, volume 1, pages 365-375, 1989.).
* "A Direct Adaptive Method for Faster Back-Propagation Learning: The RPROP Algorithm" by Martin Riedmiller and Heinrich Braun from The University of Karlsruhe, Germany. This article and the following two describe Rprop, a very easy to understand method that is normally extremely fast and works well with a default set of parameters. The last article is probably the best because not only do you get Rprop explained and tested against other methods you get an explanation of those other methods as well.
* "Rprop - Description and Implementation Details" by Martin Riedmiller from The University of Karlsruhe, Germany. If you just want a short description of Rprop, this is it, only 2 pages.
* "Advanced Supervised Learning in Multi-layer Perceptrons - From Backpropagation to Adaptive Learning Algorithms" by Martin Riedmiller available from The University of Karlsruhe, Germany. This is the most worthwhile of these 3 Rprop papers and it shows rprop in a good light. In my opinion rprop may be the fastest training algorithm in more cases than any other method. (But that still does not make it the best in every case!)
* "SuperSAB: Fast Adaptive Back Propagation with Good Scaling Properties" in Neural Networks, volume 3, pages 561-573. CAUTION: there were some obvious typographical errors in the original algorithm.
* "The Cascade Correlation Learning Algorithm" by Scott Fahlman and Christian Lebiere from the Ohio State neuroprose archive or from Carnegie-Mellon. Cascade correlation builds a network one hidden unit at a time and it is extremely fast. It does not work very well with function approximation and there have been reports that a better version is under development that will work well on function approximation problems and be simpler as well.