Artificial Neural Network Overview

An artificial neural network is a collection of connected models neurons. Taken one at a time each neuron is rather simple. As a collection however, a group of neurons is capable of producing complex results. In the following sections I will briefly summarize a mathematical model of a neuron, neuron layer, and neural network before discussing the types of behavior achievable from a neural network. Finally, I will conclude with a short description of the program included in this lesson so you can form networks that are tailored to your class.

Models

The models presented in this section appear fairly difficult mathematically. However, they eventually boil down to just multiplication and addition. The use of matrices and vectors simplifies the notation but is not absolutely required for this application.

Neuron Model

A model of a neuron has three basic parts: input weights, a summer, and an output function. The input weights scale values used as inputs to the neuron, the summer adds all the scaled values together, and the output function produces the final output of the neuron. Often, one additional input, known as the bias is added to the system. If a bias is used, it can be represented by a weight with a constant input of one. This description is laid out visually below.

Where I1, I2, and I3 are the inputs, W1, W2, and W3 are the weights, B is the bias, x is an intermediate output, and a is final output. The equation for a is given by where f could be any function. Most often, f is the sign of the argument (i.e. 1 if the argument is positive and -1 if the argument is negative), linear (i.e. the output is simply the input times some constant factor), or some complex curve used in function matching (not needed here). For this model we will use the first case where f is the sign of the argument for two reasons: it closely matches the ‘all or nothing’ property seen in biological neurons and it is fairly easy it implement.

When artificial neurons are implemented, vectors are commonly used to represent the inputs and the weights so the first of two brief reviews of linear algebra is appropriate here. The dot product of two vectors and is given by . Using this notation the output is simplified to where all the inputs are contained in and all the weights are contained in .

Neuron Layer

In a neuron layer each input is tied to every neuron and each neuron produces its own output. This can be represented mathematically by the following series of equations:

. . .

NOTE: In general these functions may be different, however, I will take them to be the sign of the argument from now on.

And we will take our second digression into linear algebra. We need to recall that to perform the operation of matrix multiplication you take each column of the second matrix and perform the dot product operation with each row of the first matrix to produce each element in the result. For example the dot product of the ith column of the second matrix and the jth row of the first matrix results in the (j,i) element of the result. If the second matrix is only one column, then the result is also one column.

Keeping matrix multiplication in mind, we append the weights so that each row of a matrix represents the weights of on neuron. Now, representing the input vector and the biases as one column matrices, we can simplify the above notation to:

which is the final form of the mathematical representation of one layer of artificial neurons.

Neural Network

A neural network is simply a collection of neuron layers where the output of each previous layer becomes the input to the next layer. So, for example, the inputs to layer two are the outputs of layer one. In this exercise we are keeping it relatively simple by not having feedback (i.e. output from layer n being input for some previous layer). To mathematically represent the neural network we only have to chain together the equations. The finished equation for the three layer network in this equation is given by:

Neural Network Behavior

Although transistor now switch in as little as 0.000000000001 seconds and biological neurons take about .001 seconds to respond we have not been able to approach the complexity or the overall speed of the brain because of, in part, the large number (approximately 100,000,000,000) neurons that are highly connected (approximately 10,000 connections per neuron). Although not as advanced as biologic brains, artificial neural networks are still perform many important functions in a wide range of applications including sensing, controls, pattern recognition, and categorization. Generally, networks (including our brains) are trained to achieve a desired result. The training mechanisms and rules are beyond the scope of this paper, however it is worth mentioning that generally good behavior is rewarded while bad behavior is punished. That is to say that when a network performs well it is modified only slightly (if at all) and when it performs poorly larger modifications are made. As a final thought on neural network behavior, it is worth noting that if the output function of the neurons are all linear functions, the network is reducible to a one layer network. In other words, to have a useful network of more than one layer we must us a function like the sigmoid (an s shaped curve), the sign function we used above, a linear function that saturates, or any other non-line shaped curve.

Matlab Code

This section covers the parameters in my Matlab code that you might choose to modify if you decide to create a network with inputs and outputs other than what have been already documented in this lesson. Before using my code you should be aware that it was not written to solve general neural network problems, but rather to find a network by randomly trying values. This means that it could loop forever even if a solution to your inputs and outputs exists. If you do not get a good result after a few minutes you may want to stop the execution and change your parameters. Finally, I will not claim that I have worked all bugs out of this program so you should check your results carefully before executing them in a classroom setting.

p1, p2, and p3 are input patterns for three different inputs. Each input pattern consists of three elements pertaining to different attributes of the input. For example in my lesson I used redness, roundness, and softness. Here, for instance, a one in the first position means that an object is red while a zero indicates that it is not red.

a1, a2, and a3 are output patterns. They need to be initialized to be incorrect (that way the program enters the loop rather than bypasses it). The second argument of the conditionals for the loop should be the desired results. In my case, I chose to have one neuron in the last layer be an indicator for each object. When that object was used as an input for the network, that neuron would end up being a one while the other neurons in the last layer would be negative one (if everybody did their math correctly). More explicitly, when the first element of a1 is not a positive one then it is wrong and I want to do the loop again. In a similar manner, when the second element of a1 is not a negative one it is wrong and I want to do the loop again. And the same for the rest of the outputs.

Note that there is one known bug involving the termination of non-terminating decimals (in binary 0.1 is non-terminating). It is possible that a 0.0000 is taken to be positive rather than zero.

Host Distance Estimation Using Artificial Neural Network

Machine Learning

Jianing Hu {hujn@cs.cmu.edu}

1. Introduction

It is an emerging trend that Internet content providers are using multiple hosts to provide the same content, in order to enhance availability and reliability. In such cases it is desirable for the user to know the distances, in terms of metrics such as latency or bandwidth, to other hosts so that he can choose the "nearest" host to access. For example, the user can select to access the nearest of multiple equal-content web servers to get the quickest response.

There are already several proposals on how to provide such a service based on some real-world measurements like those mentioned in [1] and [2]. However, real-world measurement could sometimes be expensive to get. In this project I propose another approach to provide host proximity service without measuring[1]. I will try train a neural network to estimate the distances to a host instead of measuring it.

The training data comes from the "ping" utility. The input of the neural network will be the IP address of the destination host and the time-of-day and weekday. The output will be the estimated latency from a certain host (in this case my machine). Presumably, IP address is a good indicator of latency. For example it's pretty safe to guess that a host at 128.2.*.* is much closer than one at 168.160.*.*. Time-of-day and weekday are chosen as input because they often reflect different traffic patterns and they probably affect the latency.

Some early experiments showed that the function from the chosen parameters to latency is highly non-linear. I tried several ANN structures and input/output encoding schemes. Some of them turned out to be effective.

2. My approach

2.1 Data Collection

There is no ready-to-use data set for this project so I have to compile my own data set. The first step is to get a list of hosts to measure. Because of the overwhelming number of WWW servers and because one of the typical usage of this service is to select from multiple equal content WWW servers, I decided to collect data only for WWW servers. I wrote a helper program that collects WWW host names by following links from several major search engines. I collected 1000 host names for this project. This number is so decided that pinging all hosts takes roughly an hour. Thus provides fine granularity for time-of-day measurement. Further granularity doesn't seem to be necessary.

A script was written that calls the ping utility program upon those hosts continuously. The output of ping, together with the time when ping is called, is stored in a file. Thus how the raw data is collected.

To translate the raw data to a convenient format for ANN training, I wrote a preprocessor that parses the raw data and translates into the following format:

<1st> ... <4th>

represents the weekday by integers from 0 to 6, corresponding to Sunday to Saturday, respectively. and is the time when this measurement takes place. No finer granularity (e.g., second) is used since I believe that finer granularity of time doesn't affect the latency much and even if it did, it's very hard to capture its influence by the data I can get. The measured latency is the average of the four tries of a pinging. If pinging of a host fails in all of its four tries, the raw data of this pinging will be discarded. If some but not all tries fail, the failed tries will be assigned a relatively big value (tentatively 1500ms) and the average will be computed as the actual measured latency.

2.2 Design of ANN

I chose to use back-propagation network with sigmoid as computing unit. Back-propagation network is one of the most widely used structures and sigmoid is one of the most widely used computing units. The advantages of using them include existing code support and clear understanding of their functions and properties. The basic structure of the ANN I used is a two-layer structure. I tried several different structures in experiments and compared their performance. Those structures differ in the number of hidden units and input/output encoding schemes. I'll present in this section the structures of five of the ANNs I used and briefly talk about some other structures I've tried. In the next section I'll present the experimental results. The six ANN structures presented below all have two layers. Each hidden unit is fed by all input units. There is one output unit in each structure, fed by all the hidden units.

The first structure is shown in figure one.

The weekday unit encodes the week day information. It linearly scales weekday from 0-6 to 0.0-1.0, namely, Sunday is encoded as 0.0, Monday as 0.166667, ..., Saturday as 1.0. The time unit encodes time using the formula: time = (hour * 60 + minute)/1439. It linearly scales time to 0.0-1.0. Thus 0:00am is encoded as 0.0 and 23:59pm is encoded as 1.0. IP1 through IP4 encode the IP address, each unit encodes one byte. IPn encodes the nth (starts from 1) byte of the IP address. They also do linear scaling to 0.0-1.0, so a value of 0 is encoded as 0.0 and a value of 255 is encoded as 1.

The measured latency / output is also linearly scaled to 0.0-1.0. Since the maximum latency is chosen to be 1500 ms, a latency of 1500 ms will be encoded as 1.0, a latency of 0 ms will be encoded as 0.0. Therefore an output of 0.0 corresponds to an estimated latency of 0 ms and an output of 1.0 corresponds to an estimated latency of 1500ms, and the estimated latency corresponding to an output value between 0.0 and 1.0 can be easily computed by linear interpolation. In practice it's often desirable not to encode output to a range of 0.0-1.0 but to a smaller range like 0.1-0.9, to avoid long training time. However, in our case the measured latency will never be 0ms or 1500ms, so the encoded latency will never be 0.0 or 1.0. In this case it's actually safe to use a directly linear scaling as mentioned above and an experiment with a revised linear scaling scheme didn't show much difference in performance.

The second structure is same as the first one in unit layout. It only differs from the first one by the encoding of the measured latency / output. The latency is encoded as log(latency)/log(1500). Note that the minimum latency measured is 10 ms, so this formula only gives positive results less than 1.0. I will talk about why I used this encoding scheme in the next section, since it has to do with the evaluation criteria.

The third structure differs from the second one in that it uses 8 input units for each byte of the IP address, each unit corresponding to one bit of the particular byte. The unit is set to 1.0 if the corresponding bit is 1, and 0.0 if the corresponding bit is 0. This change of encoding is based on the observation that the target function is so non-linear that a little difference in the IP addresses could result in great difference in the estimated latencies. Thus I tried to enlarge the differences between IP addresses by using this encoding scheme. The experiment results showed that this scheme is rather effective.

The forth structure goes further on changing the input encoding scheme. It uses two input units to encode time, instead of only one unit used in previous structures. The two input units encode hour and minute of time, respectively, using linear scaling.

The fifth structure differs from the forth one in that it uses 8 hidden units. They are the same in all other aspects.

The sixth structure differs from the forth one only in output encoding. It uses the linear scaling encoding as that used in structure 1.

Other structures used include multi-output-unit, multi-layer(more than two), and partial input-hidden unit connection, namely not all the input units feed in each hidden unit. None of those structures gave outstanding performance and I'll omit them in the following discussion.

3. Experimental results and discussion

In this section I present the experimental results of the five ANN structures mentioned in the last section. The only performance metric I used is accuracy, I use two kinds of error rates to measure it. As argued by the authors of [3], it often suffices to obtain accuracy within a factor of 2. Therefore an error is defined as an estimation that is out of the range of a factor of 2 within its real value. Namely, if the estimation is e and the real latency is r, then e is considered an error if e is not within [r/2, 2r]. I name the rate of this kind of error as value error rate or VER. The other error rate has a closer relationship with the nature of the application of this service. Since this service is used to choose from some otherwise "equal" hosts, sometimes it doesn't really matter if we get the estimated distances to each host right. What really matters is that we get the order of the distances right. The second error rate, called order error rate or OER, measures the rate of the order error. To compute OER, each estimated latency is compared with all other estimated latencies. If an estimated latency e1 is greater/less than another estimated latency e2, while its real latency r1 is less/greater than that of e2's, an order error is considered to occur. The total number of order errors of one estimated latency divided by the number of estimated latencies, is the OER of that estimated latency. The mean of all the latency OERs in an experiment is the OER of that experiment.

The data sets used in the experiments contains 5506 items, each item being the result of one execution of ping. The total number of data items collected is much greater than that. However, due to the time limit to run the experiments I used this relatively small data set. Among the 5506 items, 3660 of them are randomly chosen to be in the training set, the other 1846 are in the testing set. No hold out set is used for simplicity.

The following two charts show the error rates on the test data.

One obvious problem is that none of those structures converges. It's hard to tell where the problem is, probably it's just due to the highly non-linear nature of the target function. A highly non-linear target function requires some weights to be very big, which is hard to be learned by a back-propagation network. However, these experimental results are still useful in comparing the different ANN structures.

In both charts, structure 1 and structure 2 give the worst performance. That shows the effectiveness of the encoding scheme that encodes each byte of the IP addresses with 8 input units. That scheme works because it enlarges the differences between IP addresses and therefore "smooth" the target function.

Another observation is that structure 6 gives significantly worse performance than structure 3, 4, 5. This shows the importance of an appropriate encoding scheme. The linear scaling scheme doesn't work well for output unit because our evaluation criteria are non-linear. The linear scaling scheme favors functions that minimize the absolute error. In our case, where the evaluation criteria are the relative error, the logarithmic scaling works better.

An interesting question to ask is: now that this ANN doesn't converge, how well on earth does it work? Or does it offer any good at all? To answer this question, let's compute what error rate we could get by randomly guessing. Suppose we pick our estimated latency from an even distribution from 0ms to 1500 ms. The VER we might have is. The OER we might have is 50%, too. So it seems that some of the "bad" structures are doing even worse than randomly guessing. However, the above VER is computed under a even distribution of latency. In fact the distribution of latency is not even, but favors smaller values. Therefore the VER of randomly guessing would be much higher. Also observed is that OER is often much better than VER. If the applications that use this service only care about the right order of the distances of hosts, this estimator can still be effectively used.

4. Related work

The SONAR[1] and HOPS[2] project provide architectures of host proximity services. [3] presents an architecture for host distance estimation but their work has little AI support.

Back-propagation network and sigmoid are widely used and thoroughly studied. [4] pg 126-pg127 give a list of references on this topic. William Porter and Abdellatief Abouali worked on ANN designs and presented their own approaches that is different from back-propagation network [5][6][7]. Their work could be an alternative of the back-propagation structure used in this project.

5. Conclusions and future work

Estimating host distances is a hard problem due to the high non-linear nature of the Internet. The result given by my estimator could be effectively used by applications that care only about the right order of the distance of hosts. Since I'm not aware of any performance report of other host proximity services, I can not give a performance comparison here. But I believe that distance estimation has its advantage over distance measuring and future work can improve its performance.

The structure of an ANN can affect its performance greatly. Encoding scheme sometimes affect performance more than unit layout does.

One important future work is to study the reason why the ANNs don't converge. If it's a shortcoming of back-propagation network that it can not learn highly non-linear function, then what kind of network should be used to learn such function? William Porter and Abdellatief Abouali's work[5][6][7] could be a good starting point of future work.

Meanwhile, other ANN structures might help improve the performance. Also worth trying is to use different parameters (e.g., the base of the log) and bigger data sets. During the development of my ANN I realized that lack of a literature on ANN design, namely a general guide for the design of layout and encoding scheme. I could only rely on intuition and experience (none of which is abundant in me, by the way). An engineering book on ANN design is needed and can help widen the use of ANN.

6. References

[1] K. Moore, J. Cox, and S. Green, "Sonar - a network proximity service"

[2] P. Francis, "Host proximity service (hops)"

[3] P. Francis et. al., "An Architecture for a Global Internet Host Distance Estimation Service"

[4] Tom Mitchell, "Machine Learning"

[5] William Porter and Abdellatief Abouali, "On Neural Network Design Part I"

[6] William Porter and Abdellatief Abouali, "On Neural Network Design Part II"

[7] William Porter and Abdellatief Abouali, "Function Emulation Using MVQ Neural Networks"

Abstract

Applicability of Artificial Neural Network (ANN) and Decision Tree (DT) to Digital (predictive) Soil Mapping

Considering that the land degradation caused by deforestation and mismanagement in sloping areas is steadily increasing, conservation-oriented studies in these areas become vital. Fortunately, ample attention is paid to landslide and erosion as the two most common degradation types. The demand for high resolution soil mapping is more and more growing, in particular in land use planning projects.

The objective of this study is focused on applying a few methods of digital soil mapping in inaccessible sloping areas, susceptible to landslide and erosion. The intention is to apply some of the available methods of digital soil mapping in order to select the most effective one to map the soils in a quick, accurate and inexpensive way. Artificial Neural Network (ANN) and Decision Tree (DT) were employed to comply with the objectives. Geopedologic approach was applied as from the first stage; that is the visual image interpretation, through the fieldwork (during the phase of data collection). After the geoform map was produced, training areas could be selected, wherein the application of the Jenny equation and SCORPAN model (recently derived from the Jenny equation) could be executed.

The major task, forming the scientific framework of this exercise, is parameterization of the soil forming factors and their integration. A digital soil mapping was done in the study area, Hoi Num Rin sub-watershed, covering an area of about 20 km². The ANN is based on feedforward-backpropagation learning algorithm determined with one hidden layer. The decision tree is based on the expert system concept. Both methods were applied to integrate the parameterized soil forming factors. The description of soil predictors to train the ANN and to formulate the decision trees: 4 organism types, 7 relief-type units, 9 lithological units, 3 time series, 4 landscape units and 8 landform units were extracted from the maps and databases. The results: soil mapping derived from ANN, 10 soil classes names showed training error (MSE) under 0.003, 98% training accuracy and 39 minutes learning time.

The soil map resulted from using decision tree took much more time; more than 2 days to learn soil and its environment over the landscape and landform variable and to formalize and generalize 10 statements (formulas). Soil physical property maps were used in the ANN topology to predict 32 soil data from sample areas to unsampled areas. For the validation of soil classes with observed data, the results show very high accuracy at Order and Suborder levels, high accuracy in Great group and Subgroup levels and more than 90% matching when compared with decision-tree-derived map. For the validation of soil properties map, there is good accuracy of soil bulk density, shear strength and plasticity index maps, being 69%, 60% and 70%, respectively. In summary, the geopedological approach is quite valuable to obtain special soil information in inaccessible areas. ANN as well as DT can help produce a high resolution map. The difference, however, is that ANN is faster, thus more recommendable in terms of time and cost saving.

Keyword: geopedologic approach to soil survey, predictive soil map, digital soil map, artificial neural network, decision tree, ANN, DT, landslide, erosion.

[1] I should say without real time measuring, since the training data comes from measuring after all.

BIOMED ALL INVITED

Saturday, October 24, 2009