2.7.3. Training of Neural Networks (Dr. Frank Dieterle)

Frank Dieterle

Ph. D. Thesis

2. Theory – Fundamentals of the Multivariate Data Analysis

2.7. Neural Networks – Universal Calibration Tools

2.7.3. Training of Neural Networks

Home
News
About Me
Ph. D. Thesis
	Abstract
	Table of Contents
	1. Introduction
	2. Theory – Fundamentals of the Multivariate Data Analysis
		2.1. Overview of the Multivariate Quantitative Data Analysis
		2.2. Experimental Design
		2.3. Data Preprocessing
		2.4. Data Splitting and Validation
		2.5. Calibration of Linear Relationships
		2.6. Calibration of Nonlinear Relationships
		2.7. Neural Networks – Universal Calibration Tools
			2.7.1. Principles of Neural Networks
			2.7.2. Topology of Neural Networks
			2.7.3. Training of Neural Networks
		2.8. Too Much Information Deteriorates Calibration
		2.9. Measures of Error and Validation
	3. Theory – Quantification of the Refrigerants R22 and R134a: Part I
	4. Experiments, Setups and Data Sets
	5. Results – Kinetic Measurements
	6. Results – Multivariate Calibrations
	7. Results – Genetic Algorithm Framework
	8. Results – Growing Neural Network Framework
	9. Results – All Data Sets
	10. Results – Various Aspects of the Frameworks and Measurements
	11. Summary and Outlook
	12. References
	13. Acknowledgements
Publications
Research Tutorials
Downloads and Links
Contact
Search
Site Map
Print this Page

2.7.3. Training of Neural Networks

Many different learning algorithms exist for the training (calibration) of neural networks, which can be considered as an optimization problem. The first algorithm for the training of multilayer networks was proposed by Rumelhart and McClelland in 1986 [61]. This algorithm belongs to the gradient descent algorithms following the steepest descent of the error surface in the hyperspace of the adjustable parameters to find an acceptable minimum. To improve the excessively slow convergence of the algorithm for low gradients and to prevent trapping and oscillations in local minima a momentum term was introduced, which remembers the last change of the weights [62]. This algorithm, which smoothes the error surface, is also known as conjugate gradient descent. The introduction of individual learning rates and momentum terms for each weight significantly speeds up the algorithm and is known as delta bar algorithm [63] and SuperSAB algorithm [64]. One of the most modern algorithms of the gradient algorithms is the resilient propagation (Rprop) [65],[66]. This algorithm combines the ideas of the algorithms described before. Yet, the weights are not adapted depending on the magnitude of the derivative but depending only on the sequence of the signs of the derivative. Consequently, the learning is spread equally over the complete network in contrast to all the other methods described before. Another advantage is the insensitivity of the algorithm to the different parameters [59], which can be all set to theses values proposed in [65],[66] and implemented in [28],[67]. The algorithm has been successfully used for several chemometric applications due to its speed [68],[69].

Many modern learning algorithms with an amazing convergence speed belong to the second-order optimization methods, which utilize the Hessian matrix of partial second derivatives of the cost function, which describes how the error depends on the weights. The popular Levenberg-Marquardt algorithm [70] combines the gradient descent direction with the direction calculated from the estimated inverse Hessian matrix. The conjugate gradient algorithms also utilize the second-order information, but the estimation of the Hessian matrix is avoided. The scaled conjugate gradient algorithm (SCG) [71] makes use of the pseudo-second derivative, is insensitive to its parameters, which can be used as suggested in [71] and implemented in [65],[66], and has proven to be a very efficient algorithm with respect to convergence speed and optimization quality [101]. Recently, genetic algorithms have been suggested for training NN [72],[73]. As genetic algorithms are global optimization algorithms, they are more likely to find the global minimum of the error surface than gradient algorithms (see section 2.8.5 for more information about genetic algorithms). Yet, genetic learning algorithms are faced with several problems like long computing times, several parameters to be adjusted and troubles in fine-tuning the weights and biases. Combining genetic algorithms for the rough optimization and gradient algorithms for the fine-tuning has created a hybrid algorithm to overcome the last problem. Yet, the extremely high computing times render genetic algorithms for the training of neural networks unusable in practice, especially when several neural nets have to be trained for some kind of optimization (see chapters 7 and 8).

In this work, two learning algorithms are used. The Rprop algorithm is applied to the neural networks used in chapter 3. For all other networks SCG was used, as this algorithm shows a very fast initial convergence allowing the reduction in the number of training cycles during network optimization processes. All networks were trained with a maximum number of 2000 learning steps, whereby a method called early stopping was applied. This technique helps to anticipate the so-called overtraining effect [74]. An overtrained neural network learns a small calibration data set by heart. Thereby the noise in the data is learnt instead of generalizing the functional relationship of the data. For more details of overtraining, see section 2.8. Early stopping was implemented by monitoring the calibration data by a crossvalidation procedure (see section 2.4). The training is stopped when the error of crossvalidation of the calibration data starts going up, as the net may start loosing its generalization ability at this moment. Early stopping is not an ultimate solution for preventing overtraining as a premature stopping of the training also stops the calibration of the functional relationship behind the data. Early stopping is only a tool, which should be used in combination with the network optimization procedures (see section 2.8.2) and becomes less important with more optimized networks.

Page 46