H. Chen1, P. Buntin, L. She, S. Sutjahjo, C. Sommer, D. Neely
We examined a real-life problem domain, greyhound racing prediction, which involved numerous dog racing variables (fastest time, win-lose ratio, etc.) that were complex and sometimes noisy, but had complete histories that may serve as good performance predictors. Our experiment examined 200 training races and 100 testing races. Actual payoffs for each race allowed us to compare the performance levels of the different techniques. Our research showed that the machine learning techniques out-performed three human track experts in making racing predictions. The Backpropagation network, in particular, achieved the best performance. However, both algorithms were good at predicting long shots, which typically resulted in high payoffs.
It is inherent in the nature of problem solving or decision making that it involves some degree of uncertainty that decision makers or problem solvers are always striving to reduce in order to outperform their counterparts. Techniques for lessening degree of uncertainty are constantly being developed and tested. One way to reduce problem uncertainty is by seeking the advice of an expert or becoming an expert. An expert is usually considered to be a person who possesses special knowledge or skill in a specific field. In new techniques that involve the utilization of computers to reduce uncertainty, the computer becomes an ``expert'' in a specific field through a variety of methods.
Machine learning is one such technique. The method involves utilizing a computer algorithm to capture hidden knowledge from data. Machine learning usually encompasses different types of solutions, such as decision trees, production rules, and neural networks [27]. ID3 [18] is an example of a decision-tree building algorithm. Backpropagation networks [21], a popular neural network learning algorithm, exhibits a different knowledge structure and learning rules. However, its predictive power and noise-tolerant property have made it one of the most popular methods for machine learning or knowledge discovery [11] [8].
Using a real-life greyhound racing application as an example, we compared the prediction performances of three track experts versus those of ID3 and a Backpropagation network. Greyhound racing prediction is a task which involves a complex problem space and high uncertainty and therefore allowed us to examine in detail the characteristics and performance of various expert-based and machine learning techniques.
This paper is structured as follows. Section 2 discusses expert prediction and machine learning approaches to game playing. Section 3 presents the ID3 algorithm and describes how it was applied to the application. Section 4 discusses the Backpropagation network solution to the prediction task. Section 5 presents our greyhound application, including attribute selection and system implementation details. In Section 6, the prediction accuracy and the payoffs obtained by the various methods are compared. Section 7 presents conclusions and discusses some further research directions.
In artificial intelligence, various techniques have been developed and tested in the past two decades for the purpose of reducing complexity and/or uncertainty. Among these techniques, systems based on human or expert heuristics and algorithms which exhibit learning capability have gained significant attention of researchers.
One method of problem solving is through the creation of expert systems or knowledge-based systems [9]. An expert system can be defined as ``a computer program that represents and reasons with knowledge of some specialist subject with a view of solving problems or giving advice'' [10]. The idea is to make an expert-like system by storing large amounts of domain-specific knowledge with ``condition-action'' pairs, coupled with an inference engine which simulates human reasoning. An expert system is essentially the embodiment of a human's expertise in a computer model. The development of expert systems is rooted in cognitive research and human information processing theory, both of which have contributed significantly to important work in artificial intelligence [15]. Some knowledge-based systems may exhibit ``weak'' problem solving methods for various application domains. Examples of such systems are Newell's GPS (General Problem Solver) [15] and Salzberg's Handicapper [24]. These systems adopted human problem solving strategies (e.g., means-end analysis in GPS) or general information processing heuristics (e.g., Occam's Razor heuristic in Handicapper).
Because expert systems aim to solve real-world problems of significant complexity and uncertainty, most of them represent rule-based solutions to real-world problems. Applications of expert systems have ranged from analysis of chemical compounds to diagnosis of diseases and configuration of computer systems.
Despite their usefulness, expert systems or knowledge-based systems are considered performance systems [25] - they only perform what they were programmed to do, i.e., they are without learning ability. Significant effort often is required to acquire knowledge from domain experts and to maintain and update a knowledge base.
A newer paradigm, generally considered to be the learning approach, has attracted attention of researchers in artificial intelligence, computer science, and other functional disciplines such as engineering, medicine, and business [13] [2] [11] [27] [3] [4]. In contrast to performance systems which acquire knowledge from human experts, learning systems acquire knowledge automatically from data. The most frequently used techniques include symbolic, inductive learning algorithms such as ID3 [16] and multiple-layered, feed-forward neural networks such as Backpropagation networks [21]. We review these two techniques briefly below.
Quinlan's ID3 decision-tree building algorithm and its descendants [16] [17] are simple and yet powerful algorithms for inductive learning. ID3 takes objects of a known class, described in terms of a fixed collection of properties or attributes, and produces a decision tree incorporating these attributes that correctly classifies all the given objects. It uses an information-theoretic approach aimed at minimizing the expected number of tests to classify an object. Its output can be summarized in terms of production rules. ID3 has been used successfully for various classification and prediction applications.
Among the numerous artificial neural networks which have been proposed recently, Backpropagation networks have been extremely popular for their performance. Backpropagation networks [21] are multiple-layered, feed-forward networks. Activations flow from the input layer through the hidden layer, then to the output layer. A Backpropagation network typically starts out with a random set of weights. The network adjusts its weights each time it sees an input-output pair. The Backpropagation network updates its weights incrementally until the network stabilizes.
Over the past years there have been several studies which compared the performance of these techniques for different ``toy'' and real-life applications as well as some systems which used hybrid representations and learning techniques. In [14], Mooney et al. found that ID3 was faster than a Backpropagation net, but the Backpropagation net was more adaptive to noisy data sets. The performances of these two techniques were comparable, however. Weiss and Kapouleas [26] [27] suggested using a re-sampling technique such as leave-one-out for evaluation. Discriminant analysis methods, Backpropagation net, and decision tree-based inductive learning methods (ID3-like) were found to achieve comparable performance for several data sets. Fisher and McKusick [6] found that, using batch learning, Backpropagation performed as well as ID3 but was more noise-resistent. They also compared the effect of incremental learning versus batch learning.
Most of the applications that have been tested were in engineering or biomedical domains. These problem domains are complex and interesting, but the data sets used for testing were relatively clean and structured. In this research, we investigated a different problem solving scenario called game playing, which is unstructured, complex, and not well studied previously. We hoped that by studying such an application we would be able to gain insights about the relative strengths and weaknesses of the different AI techniques.
The scenario in most game playing situations involves a single winner and several losers. In games, participants compete with each other, the best performer at the game becomes the winner and everyone else is a loser. Performance, however, is based on a relative scale, i.e., relative to other participants. A weak performer may still win a game if other participants are even weaker.
It is assumed that participants' behaviors or capabilities are relatively stable in the short run, but may improve or degrade in the long run. Furthermore, a game is assumed to be conducted fairly and is free from other external human factors (i.e., cheating). Animal racing games such as horse racing and greyhound racing exhibit these characteristics. Based on these assumptions, it is tempting to ask, given enough historical information, whether human experts or machine learning algorithms will predict more accurately?
In [24], Salzberg first tested an algorithmic solution to the horse racing problem. He used general human heuristics to formulate hypotheses for Handicapper, a horse racing prediction program. Based on nine cognitive psychology-based heuristics, Handicapper predicted a horse to win in any given race. The heuristics used were based on what Salzberg called ``usefulness.'' For example, the ``unusualness'' heuristic focused on a rare or unusual feature of the situation preceding an unexpected event. ``Occam's Razor'' preferred simple hypotheses over complex ones. In several trials, Handicapper was found to perform better than human experts at picking winners. The heuristics designed for Handicapper and the rationalization process used provide a computer system with an excellent way to select a good hypothesis from among a very large set of candidate hypotheses.
In searching for an interesting domain, we consulted several real-life game-playing scenarios (e.g., horse racing, basketball games, etc.) and decided on greyhound racing for the following reasons. Greyhound racing is a complex problem domain which involves about 50 performance variables for eight competing dogs in a race. The large search space for problem solving and prediction poses a special challenge for human experts and machine learning algorithms. This problem domain consists of a large amount of historical information, some accurate and relevant and some noisy and irrelevant, which needs to be filtered, selected, and analyzed in order to assist in making a prediction. For every race, each dog's past history is complete and freely available to bettors. The question then becomes: can machine learning techniques be used to reduce the uncertainty in a complex game playing scenario? Can these methods outperform human experts in prediction? This research sought to answer these questions.
Tucson Greyhound Park (TGP), located in Tucson, Arizona, is a recreational arena centered around greyhound dog racing. It was built in 1944 and has contracts with 20 kennels that together house between 1,000 and 1,200 dogs. Currently it holds evening races and matinee races on some afternoons. There are 112 races in an average week. TGP enlists the prediction services of three dog racing experts, who are not affiliated with TGP. To our knowledge, their predictions are solely their own thinking and TGP does not compensate them for publishing their advice in the daily programs. These experts base their predictions on the same information available to any bettor in the daily program.
In this section, we describe in detail the ID3 algorithm and our modifications to it for the greyhound application. ID3 is a decision-tree building algorithm developed by Quinlan [16]. It adopts a divide-and-conquer strategy for classification. Its goal is to classify mixed objects into their associated classes, based on the objects' attribute values.
In the greyhound racing application, a dog can be classified in one of two classes, as a winner or a loser. Each greyhound can be described by a set of attributes such as the number of races it has won, its fastest time, etc. The entropy value of an attribute describes how well that attribute can be used to classify a greyhound (as a winner or a loser) - the lower the entropy value, the less uncertainty, the more useful the attribute. Entropy is calculated by using the following function [16]:
where pj is the probability of a dog being classified as winner or loser. ID3 by design is able to deal with both categorical attributes and continuous values. For categorical attributes, attribute values can be easily enumerated. For continuous values, ID3 performs a sweeping analysis of entropy reduction for all possible partition points (between any two consecutive values) and selects the partition point which provides the most entropy reduction [18]. In essence, the algorithm performs a binary partition. In our system implementation, we developed an ID3 program by adopting this original design. In addition to this version, we also developed a variant of the ID3 algorithm which performed ternary partition for continuous variables.
The ternary partition based variant of the ID3 algorithm first sorted the continuous values in ascending order, e.g., (0.12 winner) (0.15 winner) (0.35 loser) (0.36 loser) (0.45 winner) (0.55 loser) (0.66 winner) (0.70 loser) (0.80 loser) (0.88 loser). It then marked the ``clean'' classes on the two ends of the sorted list as unique classes, e.g., (0.12 winner) (0.15 winner) are in the ``winner'' class and (0.70 loser) (0.80 loser) (0.88 loser) are in the ``loser'' class. Other values in the middle of the list were mixed and remained to be classified by using other attributes.
In our experiment, we found that the ternary partition variant tended to reduce more entropy than the original binary partition algorithm, the reason being that the two ends in the three partitions were clean classes and thus did not have any entropy value. This often resulted in a lower overall entropy value after partition than that produced by binary partition, which often generated two large mixed classes. The resulting decision tree was also less complex than that obtained through binary partition - ternary partition produced a simple ternary tree where each level of the tree contained exactly three branches (see Figure 3). Binary partition branched out two ways at each level, which resulted in a large and more complex binary tree. However, both algorithms were equally efficient, especially in comparison with the neural networks learning algorithms. In our implementation (in ANSI C and running on a DECstation 5000/120), it took only several seconds to build a decision tree from 1600 training cases. ID3's result was easy to understand and it was also simple to trace ID3's classification decisions.
Neural network systems learn to discriminate among different input patterns in a holistic manner. They are typically presented with training sets of representative instances of some classes, correctly classified, and they learn to recognize and classify other new instances of these classes. Among the computational models of neural networks proposed, the Backpropagation network has been shown to be theoretically sound [21] [20] and it has demonstrated excellent capability for various complex classification and prediction problems [11].
A Backpropagation network, as shown in Figure 1, is a multiple-layered, feed-forward neural network. Activation of the network flows in one direction only: from the input layer through the hidden layer, then on to the output layer. Each unit in a layer is connected in the forward direction to every unit in the next layer. A Backpropagation network may contain multiple hidden layers. Knowledge of the network is encoded in the (synaptic) weights between units. The activation levels of the units in the output layer determine the output of the whole network. Despite the promising results demonstrated in various applications, Backpropagation network development still requires extensive experimentation, parameters selection, and human judgment.
The number of hidden units in the hidden layers as well as the number of hidden layers in a Backpropagation network are much harder to determine. They provide the added power of internal representation which could capture the often nonlinear relationships between the input and the output vectors. Although these hidden elements - hidden units and layers - play such an important role in the Backpropagation network, deciding the exact numbers has never been easy. Researchers have developed different heuristics for topology selection [21]. They generally agree that a larger hidden layer results in a more powerful network, but too much power obtained from the training data may be undesirable for predicting new, unknown data (a generalization problem). In addition, the computing time needed for a Backpropagation network is directly proportional (with a factor often much greater than 1) to the number of hidden units and the number of hidden layers in a network. A complex network topology may be undesirable for computational reasons.
Momentum factor,
,
which tends to keep the weight changes moving in the same direction,
allows the algorithm to skip over small local minima.
It can also improve the speed of learning.
But, as in the case of learning rate, a large momentum factor may cause
the side effect of skipping too much.
In past research, momentum factors have been set between 0 and 1 for various
applications [14] [26].
A large number of epochs usually brings the network closer to a convergent state and helps improve the recalling accuracy for the training data set. For different applications and network topologies the required number of epochs can be very different, ranging from a few hundred to several thousand. However, large epochs require longer computing time and may cause an undesirable generalization problem.
During the initial stage of training, performance on the training set typically improves as the network adjusts its weights through backpropagation. Performance on the unknown testing set also improves, although it is never quite as good as for the training set. After a number of epochs, network performance reaches a plateau as the weights shift around, looking for a path to further improvement. Eventually, such a path is found, and performance on the training set improves again. But performance on the test set gets worse. Why? The network has begun to memorize the individual input-output pairs rather than settling for weights that generally describe the mapping for all cases. With thousands of real-valued connection weights in the network, a Backpropagation network is theoretically capable of storing an entire training set. This may result in over-training. Deciding the proper amount of training is therefore crucial to the success of neural networks learning.
In this section we review the procedure followed to create the greyhound racing databases and identify useful attributes. As previously noted, the Tucson Greyhound Park makes detailed programs available to its patrons (see Figure 2). Each program contains approximately 15 races, the races being of a variety of grades from ``A'' to ``D,'' with A being the most competitive. A few specially designated races with grades such as ``M'' (maiden race) are also included but were ignored in our experiment. Each race program includes information about eight dogs, including each dog's fastest time. Any time any individual dog records a faster time, it is reflected in future programs. In addition to the fastest time, the dog's total races competed and its number of first, second, third and fourth place finishes are included. Directly below the dog's summary data, the program lists its performance for the last seven of its races. The dog's starting position, its position during the first turn (called the break position), the second and third turns and its finishing position comprise the data included in this seven-race history. In addition, its race time and the grade of the race are recorded. Besides the daily program, TGP also publishes the previous day's results. These contain information about how each dog fared and the payoff odds associated with the winning dogs for all the races held the previous day. Deciding the payoffs was simply a matter of matching programs and result sheets to see whether a particular greyhound won or lost.
In a typical race, the TGP program contains about 50 variables that might or might not affect the outcome of that race. Among these variables are the condition of the track, the weather, dog owner and trainer, and the physical attributes associated with each dog. The problem of devising an appropriate encoding for training data has been widely recognized as being central for all inductive learning efforts. In [23], Flann and Diettrich argued that part of the power of the learning systems derived from the choice of representation as well as from the use of a domain theory. They proposed a ``multiple representation strategy'' which bridges the gap between structural representation and performance representation. For greyhound racing, most of the variables shown in the program can be considered structural variables, from which a small set of performance-related variables needs to be identified (either algorithmically or by some domain experts). The use of ``domain model'' [1] and ``abstraction space'' (of the underlying domain) [5] has also been shown to result in substantial speedup (i.e., it provides guidance for the rule formation program in a space of rules that is too large to search exhaustively and in a domain where the trivial or meaningless associations far outnumber the important ones [1]) and produce better classification results.
As is typical of complex problem-solving scenarios, the first task was to reduce the problem complexity or to prune the problem space by deciding on a smaller set of relevant performance attributes. As in [24] (horse racing prediction), the selection of performance attributes in our research was mainly based on the opinions of frequent bettors, track experts, and the management of TGP. These human experts were able to succinctly identify performance variables from the TGP program which they thought were the most crucial in predicting winners. Their domain knowledge helped to reduce the problem space significantly. In total, the TGP experts suggested the following 10 performance variables:
As will be discussed in a later section, identification of relevant performance variables (i.e., pruning the problem space) is probably one of the most important and as well as one of the most time-consuming tasks in our experiment. The machine learning algorithms' inability to analyze the initial noisy and fussy problem domain prompted us to rely on human experts' knowledge and heuristics. This may prove to be one of the weaknesses of some of the prevailing machine learning techniques and makes it an area which deserves more research attention.
The initial stage of the project involved data collection and entry. Some of the variables or attributes we selected required manipulating the data in the programs in order for it to be useful. For instance, we averaged the values listed in a greyhound's seven previous races to obtain the average break position. Another attribute requiring the mean computation of a variable was the finish average, a greyhound's finishing position during the previous seven races. Another two attributes, the time 3 average and the time 7 average, were calculated by computing the mean of the three most recent race times and the seven previous race times, respectively. For the race grade average, we assigned values to each type of grade, with the most competitive race grade, A, receiving a value of four. B grade races received the value of three, C races were assigned a value of two and D races, the least competitive race category, received a value of one. All other types of races, such as maiden races or training races, received a value of zero. These values were then averaged to procure the race grade average attribute. Furthermore, previous race grades were used to assign values when a greyhound moved down a race grade category. This was necessary to account for residual effects left from racing in a more competitive race previously. The value assigned for this ``up grade" attribute depended upon how recently the drop in race grade(s) occurred. If the most recent race was of a higher grade, three points were assigned, two points were given if the more competitive race was two races ago, etc., until the value of zero was given for a greyhound that had not raced in a more competitive event in the last three races.
Greyhound racing is essentially a competitive game. That is, the greyhounds are competing against each other in a win-lose scenario. Since the dogs are competing directly against one another, we manipulated the data such that the values were relatively scaled. This means that after making all calculations, such as averaging a greyhound's finish times, we then assigned the lowest value contained in one single race to be zero. The other values were scaled to reflect their difference from that lowest value. For example, a race program entry contains eight pieces of data that represent each of the eight greyhounds' fastest time. The worst fastest time among these dogs, e.g., 32.00 seconds, was assigned a value of zero. If another dog in that race had a fastest time of 31.00 seconds, then its fastest time was assigned a value of 1.00, since that would be the difference between the worst fastest time of 32.00 seconds, and its own time of 31.00 seconds. Relative scaling of the data was necessary due to the different statistics associated with different race grades, i.e., greyhounds in grade A races tended to have faster times.
As suggested by Weiss and Kulikowski [27], the proportions of training and testing were 2/3 and 1/3, respectively. The training stage for both ID3 and the Backpropagation network consisted of running 200 races. Each race was run by eight greyhounds, for a total of 1600 classified greyhounds. Testing consisted of an additional 100 races, or 800 new cases. We did not use any re-sampling technique for testing due to the large testing data set and the computational requirements for evaluating the Backpropagation algorithm.
Evaluating the performance of different techniques was simply a matter of computing the monetary payoffs. We bet $2 for each dog that an algorithm or an expert predicted to be winner. In some races, the algorithms may have predicted more than one winner (so we bet on all predicted winners) or no winner at all (so no bet was placed). Using the payoff odds given in the result sheets, we were able to compute the final payoffs after betting on the 100 races. In this experiment, for performance comparison purposes we only considered the payoff for first-place winners; we did not consider other more elaborate betting systems such as: Place (your greyhound must finish either first or second), Show (your greyhound must finish either first, second, or third), Quiniela (two greyhounds you bet on must finish first and second), etc. It could be argued that the payoff odds are not necessarily correct since each bet on a particular dog will change the payoff odds for that dog. We contend that since there are numerous bets on any given dog, the odds would not change significantly with the additional bet on a particular dog.
We developed the programs in ANSI C and ran them on a DECstation 5000/120 (a 25 MIPS machine, running ULTRIX). Table 1 summarizes the numbers of races correctly and incorrectly predicted by the experts, ID3, and the Backpropagation network, respectively, and their final payoffs for 100 races. In 26 races, ID3 decided not to bet on any dog.
|
|
|
|
A summary table for the one-way analysis of variance (using MINITAB [22])) of the payoffs of the different approaches is provided in Figure 4. The means for the five underlying populations (i.e., payoffs generated by three experts and the two algorithms) were different at the 10% significance level (N = 100, P = 0.083). Two sample t-tests revealed that the Backpropagation network out-performed the best expert (and the other experts) in monetary payoffs at the 10% significance level (P = 0.084). However, ID3 did not out-perform the experts at a statistically significant level (P = 0.15) and there was no statistically significant difference in payoffs between Backpropagation and ID3 (P = 0.68).
In terms of prediction accuracy and monetary payoff, both the Backpropagation network and ID3 performed better than human experts. Both algorithms appeared to be more robust than humans in their ability to analyze the large set of racing data objectively and reach unbiased conclusions. This characteristic is particularly evident from both algorithms' convincing prediction of numerous long shots human experts failed to identify. ID3's decision tree output was more understandable than that of the Backpropagation. In general, ID3 also predicted more conservatively than other approaches. The Backpropagation network, on the other hand, was more computationally expensive (and thus very slow), but made excellent predictions of long shots.
In this experiment, especially during the early stage, experts' heuristics for refining the initial performance variables were critical to the success of the machine learning algorithms. The 50-plus candidate variables were reduced to 10 of the most relevant parameters, which effectively reduced the problem space for this complex and noisy domain. Some variables were also the results of computations, e.g., time 3 average, time 7 average, etc. The machine learning algorithms that we adopted were ``supervised'' learning in which the learner was told the category membership of environmental observations; the sole learning task was to summarize the commonality among members of the same categories and differences among competing ones. These algorithms were not adaptive to unsupervised, self exploration of relevant variables or useful categories in the data using internalized heuristics. Nevertheless, there are other machine learning algorithms which exhibit unsupervised learning capability [7]. This may prove to be one of the most potentially fruitful directions for machine learning or knowledge discovery research [8] [12].
In summary, this research has examined in detail the characteristics and feasibility of various prediction techniques, including expert prediction, symbolic learning (ID3), and a connectionist approach (Backpropagation) for complex and uncertain game playing scenarios. For large real-life databases of historical information, recent machine learning techniques appeared to exhibit unique capabilities for data analysis and knowledge discovery.