Overview paper on Composite System Design

Stephen Fickas

Note: The following paper appeared as a guest column in the Automated Software Engineering Journal. Columnists are asked to select from one to six papers that they would take to a desert island.

Desert Island Column

It is a pleasure to kick off the Desert Island column. I want to thank the editors for giving me so much freedom in my paper selection criteria. While in other domains this would be known as criminal under-specification, I want to believe it is founded on the great trust they have in my judgement.

I've decided to look at three books that tell a fascinating story. I'm sure you will want to add to them to your collection immediately: (1) Rules for the Government of the Transportation Department, The Pennsylvania Railroad, 1910, (2) Urban Transportation Technology, Thomas McGean, Lexington Books, 1976, and (3) Down Brakes: A History of Railroad Accidents, Safety Precautions, and Operating Practices in the USA, Robert Shaw, MacMillan, 1961.

These books provided the depth and breadth of material necessary to support the study of design in a rich and complex real domain. My group undertook to exercise a design model by following a process of rational reconstruction. By this I mean the activity of selecting an implemented target system and then applying a proposed design model to see if it is capable of producing the target (among others), and under what assumptions it produces that target.

The design model we were attempting to test was one of composite system design as embodied in a tool called Critter [1]. Briefly, composite system design involves the design of multi-agent systems working towards the achievement of global goals and constraints. My group's specific interest is in composite systems where at least some of the agents have natural or artificial intelligence. The quintessential process in composite system design is assignment of responsibility: choose the agent or agents that will be responsible for achieving one or more global system concerns. A major focus of my group is on the analysis models needed by Critter to choose wisely among competing methods of assigning responsibility.

Using the three books, we chose to reconstruct a series of early train systems. These included the pioneering train systems which relied on time-and-distance as the primary means of safety, and later train systems that used a manual control block system and telegraph communication to improve efficiency and reduce accidents.

McGean gave us the basic layout of our systems and placed them in historical context. The Pennsylvania Railroad Company's book of rules (henceforth, simply Book of Rules), through 450 separate rules, allowed us to further identify (1) the agents involved in our target train systems (approximately 20 on average), and (2) the inter-agent protocols we would need to reconstruct. Most important, the Book of Rules holds a wealth of information on responsibility assignment. Some rules deal with a single agent and a single assignment of responsibility. A more typical rule divides responsibility among two or more agents. For example, Rule 102 states that a trainman and an engineer are jointly responsible for the safety of certain types of trains: "When cars are pushed by an engine, a trainman must take a conspicuous position on the front of the leading car. If signals from the trainman cannot be seen from the engine [by the engineer] the train must be stopped immediately."

Even mob responsibility can be found: "In case of danger to the Company's property employees must unite to protect it." [General Rule L]

In contrast to joint responsibility (horizontal splits), there are rules specifying various forms of supervision (vertical splits). Much of this is of a monitoring nature, e.g., Rule 730 states "Train police have supervision over crossing watchmen, and will see that they (watchmen) properly understand their duties and fulfill them.", Rule 701 states "The Train Master reports to and receives his instructions directly from the Superintendent."

The Book of Rules specifies decomposable agents. For instance, at one level trains are viewed as a single agent with the engineer as the human interface. At this level, the engineer interacts with dispatchers, station agents, and other trains/engineers. However, when discussing train failure or train accidents, the train is decomposed into a mini-system where each crewmember has "fire fighting" responsibilities.

The Book of Rules clearly shows that a single agent within a train system has multiple responsibilities. It's interesting that the Book of Rules anticipates agent overloading in some cases, and attempts to specify priorities of responsibilities, e.g., Rule 106 states that "In all cases of doubt or uncertainty the safe course must be taken and no risks run."

Rule 106 is also an example of "run time specification". Instead of attempting to list all possible situations and agent's responses, Rule 106 leaves it to individual agents (a) to determine what is safe and what is a risk, and (b) to devise a plan of action that insures safety and avoids risks. This seems to imply we can get away with a minimalist design that instructs agents to do the right thing, and gives them the what's necessary to figure out what the right thing is.

The last topic the Book of Rules raises is system failure. A not small part of the rules cover exceptional cases. The rules anticipate that both physical and human agents will malfunction. Looking at signalmen, for example, Rule 623 states "If there is a derailment the signalman must set the signal so that no further train movement is permitted." Rule 625 pertains to agent failure: "During storms or drifting snow, if the agent responsible for clearing switches is not on hand promptly, it is the signalman's responsibility to report the unsafe condition to the superintendent." In general, out of the 26 rules covering the responsibility of signalmen, 13 are devoted to responsibilities in face of system failure. This follows a general trend for all the agents specified in the Book of Rules.

To delve really deeply into system failure, we must turn to Shaw's book. However, before leaving the Book of Rules, I'd like to point out something probably not surprising to most spec builders: train specs employ many forms of expression. Besides the textual description found in its rules, the Book of Rules includes definitions, e.g., what is a block, what is a manual controlled block system, use of color plates to show signaling protocol, use of examples to reinforce a rule, and finally, the standard document forms for train orders, i.e., the paper-based "interface specs" among dispatchers, engineers and station agents.

Shaw's book chronicles railway accidents from the earliest reported (circa 1850) to 1961. Perhaps the best feel for the book can be had by partially listing the Table of Contents: I. Early Accidents, II. The Beginning of the Disasters, III. Boilers Blow Up, IV. Bridges Fall Down, V. Time Interval Failures, VI. Misreading of Train Orders, VII. Disregard [!] of Train Orders, VIII. Operator Errors, IX. False Signal Indications, X. Sabotage, etc. There are a total of 49 chapters devoted to the way train systems go wrong. The 50th chapter is a summary: Responsibility for Accidents.

Shaw's book has been a gold mine for me. When read in tandem with the specs given in McGean and in the Book of Rules, it provides a compelling call to trace system failure back to system specification. Adding our particular slant, Shaw allowed us to trace failure in a multi-agent system back to poor composite system design in general, and to poor responsibility assignment in specific. Using a failure-driven approach, and using appropriate generalization, we were able to get a start on the composite system design analysis model we sought. I even went as far as mimicing the book's style by indexing accidents with Shawesque names and associating each with a composite system design concern:

Accident Cause

The Death Train (missing system requirement)

The Naked Train (uncovered system requirement)

The Neverending Train (weak system requirement)

The Runaway Train (agent unreliability)

The Ghost Train (inter-agent hand-off error)

The Weak Link Syndrome (bad communication protocol)

The Quick Good-bye (bad communication protocol)

The Heroic Switchman (correct runtime reasoning by intelligent agent)

The Lazy Conductor (incorrect (!) runtime reasoning by intelligent agent)

The Reluctant Brakeman (inter-agent hand-off error)

The Battling Signals (agent misidentification)

The Wet Engineer (lack of agent motivation)

The Harried Station Agent (agent overload)

The Hungry Tiger (agent irresponsibility)

The Fastest Train Alive (agent irresponsibility)

Accident	Cause
The Death Train	(missing system requirement)
The Naked Train	(uncovered system requirement)
The Neverending Train	(weak system requirement)
The Runaway Train	(agent unreliability)
The Ghost Train	(inter-agent hand-off error)
The Weak Link Syndrome	(bad communication protocol)
The Quick Good-bye	(bad communication protocol)
The Heroic Switchman	(correct runtime reasoning by intelligent agent)
The Lazy Conductor	(incorrect (!) runtime reasoning by intelligent agent)
The Reluctant Brakeman	(inter-agent hand-off error)
The Battling Signals	(agent misidentification)
The Wet Engineer	(lack of agent motivation)
The Harried Station Agent	(agent overload)
The Hungry Tiger	(agent irresponsibility)
The Fastest Train Alive	(agent irresponsibility)

Studying the accidents chronicled by Shaw, it was clear that we would need some powerful design analysis models to avoid designing systems with the same proclivity to screw up. Let me give you a small sampler of accidents, taken from the above list, that point to some of the problems.

The Hungry Tiger. Shaw discusses an accident involving a circus train that went out of control and derailed when the performers (clowns?) disconnected the emergency brakes. Besides killing seven people, there were two other casualties: "Mrs. Alfred Thomas was milking her cows when a tiger crept around the corner of the barn and leaped upon and killed a cow. Mr. Thomas went out with his gun and killed the tiger." How does one reason, at design time, about agents acting irresponsibly (the performers, not the tiger nor Mr. Thomas)? What countermeasures (and associated responsibilities) can one add to mitigate irresponsible behavior? At what point is it an institutional concern?

The Wet Engineer. Some train companies rewarded train crews for arriving at their destination as early as possible. This caused engineers to frequently ignore safety rules. In one case, an engineer thought he could beat a drawbridge, and hence ignored the stop signal. His train went into the drink. Many passengers were killed. The engineer survived, but just barely; surviving passengers went after him with a rope, but he out ran them. This accident, among others, lead me to consider agent motivation - an agent may be capable of carrying out a responsibility without being motivated to do so. One sees all range of motivational components in composite systems, from pure reward (engineers were given cash incentive to keep their safety record clean) to pure punishment (special agents were put on board trains to monitor engineers and report transgressions).

The Heroic Switchman and the Lazy Conductor. On a moonless night, a switchman allowed a train (henceforth, Train A) onto a clear siding. As he went to change the switch back to the mainline, it became jammed. The switchman was aware of a closely following train (henceforth, Train B) that would be arriving shortly. As things stood, B would be switched into the back end of A. The switchman reasoned that he must do something to stop B. He reasoned that if he ran down the track towards the direction of B, and lit his shirt on fire as a signal, the engineer of B would reason that something was amiss and stop short of the switch. All inferences were correct - the switchman was a hero. He and the engineer of B had acted in true accordance with Rule 106.

On the other hand, we have the lazy conductor. When a train made an unscheduled stop in a block, it was the conductor's responsibility to protect the back of the train by lighting a lantern and slowly swinging it to warn trains approaching from the rear. One night a train indeed did have a mechanical breakdown in the middle of a block. However, the conductor decided he could skip his duty. He reasoned that since no other train was scheduled to use this line for another ten hours, there was no worry about collision. Further, he knew that the speed limit along this line was 15 MPH, slow enough for an approaching train to see the trains running lights and stop.

An approaching train smashed into the back of the conductor's train. The conductor was one of the few survivors.

The conductor had made two bad inferences. First, while there was no scheduled train, there was a "special" (a train not appearing on monthly schedules) that was carrying (what else?) a circus to the next town. As an aside, I have noted that there is a preponderance of specials involved in Shaw's train accidents. My advice: never get on a special. It would probably also be wise to avoid circuses.

Second, the speed limit had been changed to 30 MPH in the last week - the approaching train was obeying the new limit and was unable to stop in time. The conductor's court transcripts tell the rest of the story. These two examples raise interesting questions on the limits we can expect on intelligent agents. What does an intelligent agent need to know to act heroically or "efficiently"? Shaw's accidents and near misses give us a tentative list of issues for evaluating composite system specifications.

Introspection. Can the agent figure out its capabilities and responsibilities under the time constraints imposed by the system?
System teleology. Is the agent aware of the overall system goals? Is the agent aware of who the other agents are and what their system roles are?
Current system state. Can each agent see enough of the system state to carry out its responsibilities? Will the information be timely enough?
Negotiation. Can agents negotiate among themselves to cover responsibilities?

In summary, books like Shaw, McGean, and the Book of Rules lay down a difficult but irresistible challenge for me: Can we develop models of design that allow us to formally and reliably specify systems like those in the real world? Systems that will break down, will contain unreliable agents, will contain unmotivated agents, and perhaps most importantly, will have finite budgets. Systems that will also be successful in spite of all this. My answer is not yet. However, putting my recruiting hat on, I hope I've gone a little way toward convincing young researchers that it is a challenging and worthwhile field to take up!

I'd like to conclude with two remarks. First, I had a great deal of fun working with a real world composite system. I think this is mainly because we stumbled upon a domain that (1) was understandable with a bit of work, and (2) had well documented specs and system failures. I have by no means squeezed all that is possible from McGean, Shaw and the Book of Rules, so would be happy to take them along to my desert island. However, I'd also be happy to take the equivalent books from another domain. In particular, I've just started reading the book Blind Trust by John Nance [2] on airline disasters. The book includes a serendipitous discussion of how early aircraft transportation design was based on train transportation design, and why it failed for many of the same reasons. In this way it is similar to Shaw who discusses the attempt to transfer horse-drawn transportation specs to early train systems, and again, the disasters that ensued. Maybe I will add Nance to my desert island bookbag.

Second, I'd like to make a small pitch to the Automated Software Engineering community: I believe we can profit by extending our current set of canonical examples to include at least one canonical composite system example drawn in realistic detail. This will require access to: documentation of the system itself down to a reasonable level of detail; documentation of system failures. The latter, in particular, allows one to study the selectivity of a design approach, not just its generative capability.

Two candidate domains come to mind: the train domain because of the existence of the necessary documentation; the library domain because it already exists as a canonical example. Both are rich and complex multi-agent systems. However, both have undesirable properties. The train domain documentation is hard to find, and hence may exclude a number of researchers. The library domain has made attempts to publish specs (Robinson gives examples in [3]), but has nothing that I can find that is similar to Shaw in documenting designs and failures. Perhaps we can discuss the inclusion of an appropriate composite system example, and the possible revamp of our canonical examples as a whole, at one of our future conferences or workshops, or even here in our new journal.

As a final note, I'd like to acknowledge the efforts of my colleague, Rob Helm, in poring over the dusty back shelves of our library to unearth the desert island books, and for finding many of the key insights contained in them.

The author welcomes debate, inquiries, or complaints about this column. He can be reached at fickas@cs.uoregon.edu.

[1] Fickas, S.,Helm, R., Automating the design of composite systems, IEEE Transactions on Software Engineering, 6/92

[2] Nance, J., Blind Trust, W. Morrow, 1986

[3] Robinson, W., Automated negotiated design integration: formal representations and algorithms for collaborative design, PhD Thesis, Computer Science Department, University of Oregon, 1993 (available as CIS-TR-93-10)