Many
different learning algorithms exist for the training (calibration) of neural
networks, which can be considered as an optimization problem. The first algorithm
for the training of multilayer networks was proposed by Rumelhart and McClelland
in 1986 [61].
This algorithm belongs to the gradient descent algorithms following the steepest
descent of the error surface in the hyperspace of the adjustable parameters
to find an acceptable minimum. To improve the excessively slow convergence of
the algorithm for low gradients and to prevent trapping and oscillations in
local minima a momentum term was introduced, which remembers the last change
of the weights [62].
This algorithm, which smoothes the error surface, is also known as conjugate
gradient descent. The introduction of individual learning rates and momentum
terms for each weight significantly speeds up the algorithm and is known as
delta bar algorithm [63]
and SuperSAB algorithm [64].
One of the most modern algorithms of the gradient algorithms is the resilient
propagation (Rprop) [65],[66].
This algorithm combines the ideas of the algorithms described before. Yet, the
weights are not adapted depending on the magnitude of the derivative but depending
only on the sequence of the signs of the derivative. Consequently, the learning
is spread equally over the complete network in contrast to all the other methods
described before. Another advantage is the insensitivity of the algorithm to
the different parameters [59], which can be all set to
theses values proposed in [65],[66]
and implemented in [28],[67].
The algorithm has been successfully used for several chemometric applications
due to its speed [68],[69].
Many
modern learning algorithms with an amazing convergence speed belong to the second-order
optimization methods, which utilize the Hessian matrix of partial second derivatives
of the cost function, which describes how the error depends on the weights.
The popular Levenberg-Marquardt algorithm [70] combines the gradient descent
direction with the direction calculated from the estimated inverse Hessian matrix.
The conjugate gradient algorithms also utilize the second-order information,
but the estimation of the Hessian matrix is avoided. The scaled conjugate gradient
algorithm (SCG) [71]
makes use of the pseudo-second derivative, is insensitive to its parameters,
which can be used as suggested in [71] and implemented
in [65],[66], and has proven to
be a very efficient algorithm with respect to convergence speed and optimization
quality [101]. Recently, genetic algorithms have been
suggested for training NN [72],[73].
As genetic algorithms are global optimization algorithms, they are more likely
to find the global minimum of the error surface than gradient algorithms (see
section 2.8.5 for more information about genetic algorithms).
Yet, genetic learning algorithms are faced with several problems like long computing
times, several parameters to be adjusted and troubles in fine-tuning the weights
and biases. Combining genetic algorithms for the rough optimization and gradient
algorithms for the fine-tuning has created a hybrid algorithm to overcome the
last problem. Yet, the extremely high computing times render genetic algorithms
for the training of neural networks unusable in practice, especially when several
neural nets have to be trained for some kind of optimization (see chapters
7 and 8).
In this
work, two learning algorithms are used. The Rprop algorithm is applied to the
neural networks used in chapter 3. For all other networks
SCG was used, as this algorithm shows a very fast initial convergence allowing
the reduction in the number of training cycles during network optimization processes.
All networks were trained with a maximum number of 2000 learning steps, whereby
a method called early stopping was applied. This technique helps to anticipate
the so-called overtraining effect [74].
An overtrained neural network learns a small calibration data set by heart.
Thereby the noise in the data is learnt instead of generalizing the functional
relationship of the data. For more details of overtraining, see section
2.8. Early stopping was implemented by monitoring the calibration data by
a crossvalidation procedure (see section 2.4). The training
is stopped when the error of crossvalidation of the calibration data starts
going up, as the net may start loosing its generalization ability at this moment.
Early stopping is not an ultimate solution for preventing overtraining as a
premature stopping of the training also stops the calibration of the functional
relationship behind the data. Early stopping is only a tool, which should be
used in combination with the network optimization procedures (see section
2.8.2) and becomes less important with more optimized networks.