Software Architecture for Large Embedded Systems

Maarten Boasson & Edwin de Jong



Introduction

As the software component of embedded systems becomes larger and more complex, the choice of software architecture becomes more crucial within the overall development process. An architecture defines the overall structure of the system in terms of components and an organizational principle that defines possible interconnections between these components. In addition, an architecture prescribes a set of rules and constraints governing the behavior of components and their interaction [4]. Traditionally, software architectures have been primarily concerned with structural organization and static interfaces [8]. With the growing interest in coordination models [6][7], more emphasis is placed on the organizational aspects of behavior and interaction.

 Applications in the embedded systems domain mostly deal with data instances that represent continuous quantities: data is either an observation sampled from the system's environment, or derived from such samples through a process of data association and correlation. The data itself is relatively simple in structure; there are only a few data types, and given the volatile nature of the samples, only recent values are of interest. However, samples may enter the system at very short intervals, so sufficient throughput and low latency are crucial properties. In addition, but to a lesser extent, embedded systems maintain some discrete information, which is either directly related to external events or derived through qualitative reasoning from the sampled input.

 This situation is rapidly changing however, and for the future we can expect an extension, particularly for large embedded systems, towards more intensive information management. Consequently, as embedded systems become larger and more complex, the management of data assumes a more central role in the architecture of these systems. Indeed, the software architecture that we introduce in the next section is strongly data-oriented, in contrast with more traditional approaches for embedded systems design, that are mainly process or control-oriented.
 
 

A software architecture for large embedded systems

The software architecture that we developed for large embedded systems, named SPLICE, basically consists of two types of components: applications and a shared data space. Applications are active, concurrently executing processes that each implement part of the overall system's functionality. There is no direct interaction between applications; all communication takes place through a logically shared data space simply by reading and writing appropriate data elements. In this sense SPLICE bears strong resemblance to coordination languages and models like Linda [5], Gamma [1], and Swarm [9], where active entities are coordinated by means of a shared data space.

The shared data space

 In SPLICE the shared data space is organized after the well-known relational data model. The data elements are structured as labeled records; each record definition consists of a number of typed fields. The usual basic types, such as integer, real, and character are provided, augmented with a number of type constructors, such as enumerated types, arrays, and nested records, that allow defining more complex types. Each record definition is associated with a system-wide unique label, such that there exists a one-to-one correspondence between a record label and the interpretation of the corresponding data by the applications that use it. We refer to the unique label of a record as its data sort. Data sorts enable applications to distinguish between different types of information.

 To illustrate, we consider a simplified example taken from the domain of air traffic control. Typically a system in this domain would be concerned with various aspects of flights, such as flight plans and the progress of flights as tracked from the reports received from the system's surveillance radar. Hence, we introduce data sorts flightplan, report, and track as indicated in fig. 1.


sortflightplan
keyflightnumber : string
departure : time
arrival : time
aircraft : string

sortreport
keyindex : integer
position : vector
timestamp : time

sorttrack
keyflightnumber : string
keyindex : integer
state : vector

 Figure 1: An example-definition of data sorts flightplan, report and track


The record definition of data sort flightplan consists of four fields: a flight number, e.g. KL332 or AF1257, the scheduled times for departure and arrival, and the type of aircraft that carries out the flight, e.g. a Boeing 737 or an Airbus A320. By declaring the flight number as a key field, it is assumed that each flight is uniquely determined by its flight number. Data sort report contains the position vector of an object as measured at a specific time by the system's surveillance radar. A unique index is attached to be able to distinguish between different reports. Through a correlation and identification process, the progress of individual flights is recorded in data sort track. The state vector typically contains position and velocity information on the associated flight number. The index identifies the report from which the track information is derived. Note that the flight number alone does not suffice as a key, since there will be many consecutive reports during a flight.

Applications

 Basically, applications interact with the shared data space through a pair of primitive operations for reading and writing data elements. These operations have the following form:
 
 

The initial absence of a primitive operation for removing data from the shared data space, exemplifies that data in SPLICE is regarded as shared information that can be freely consulted by any number of applications. This is in contrast with most other coordination models, such as Linda and Gamma, where data is primarily viewed as shared resources that can be consumed by processes in limited quantities. The coordination model of SPLICE is based on monotonic (temporal) reasoning where information, once established, never becomes invalid. This concept fits quite naturally with embedded systems, whose primary task is to collect and deduce new information either from existing information or from observations of the environment. Clearly, due to practical limitations, such as the available amount of memory, data eventually will have to be removed from the system. We regard this, however, as a non-functional aspect of the system that we shall address in a further refinement of the coordination model.
 
 

Refinements of the coordination model

The coordination model of SPLICE is based on an ideal situation where non-functional requirements, such as limited resources need not be taken into account. Here we discuss how, through successive refinements of the coordination model, a software architecture can be derived that fully supports the development of large, industrial-quality embedded systems.

A distributed software architecture

 The first aspect that we consider here is distribution of the shared data space over a network of computer systems. The coordination model is refined by introducing two additional components to the basic software architecture from the previous section. The additional components consist of agents and a communicationnetwork.

Each application process interacts with exactly one agent. An agent embodies a local database for storing data sort instances, and processing facilities for handling all communication needs of the application process it serves. All agents are identical and need no prior information about either the application processes or their communication requirements. Communication between agents is established by a message passing mechanism. Messages between agents are handled by the communication network that interconnects them. The network must support broadcasting, but should preferably also support direct addressing of agents, and multicast. An application process interacts with its assigned agent by means of the primitive read and write operations. The interaction with agents is transparent with respect to the basic coordination model: the application processes continue to operate on a logically shared data space.

 The agents are passive servers of the application processes, but are actively involved in establishing and maintaining the required inter-agent communication. The communication needs are derived dynamically by the collection of agents from the read and write operations that are issued by the application processes. The protocol that is used by the agents to manage communication is based on a subscription paradigm. The assumption underlying this protocol is that the majority of data transfers in a system occurs regularly, and that it is therefore beneficial to forward data to all known users as soon as it is written (cf. a subscription to a newspaper: the newspaper is sent upon printing, without further requests from those subscribed).

 As a result of this protocol, the shared data space is selectively replicated across the agents in the network. The local database of each agent contains instances of only those data sorts that are actually read or written by the application process it serves. In practice the approach is viable, particularly for large systems, since the application processes are generally interested in only a fraction of all data sorts. Moreover, the communication pattern in which agents exchange data is relatively static: it may change when the operational mode of a system changes, or in a number of circumstances in which the configuration of the system changes (such as extensions or failure recovery). Such changes to the pattern are very rare with respect to the number of actual communications using an established pattern. It is therefore beneficial from a performance point of view to maintain a subscription registration. After an initial short phase each time a new data sort has been introduced, the agents will have `learned' the new communication requirement. This knowledge is subsequently used by the agents to distribute newly produced instances of a data sort to all the agents that hold a sub-scription. Since subscription registration is maintained dynamically by the agents, all changes to the system configuration will automatically lead to adaptation of the communication patterns.

Temporal aspects

 In embedded systems data gradually loses its value as the environment changes and time evolves. Hence, the application processes usually require only a limited temporal view on the overall contents of the shared data space. As time progresses, any data that falls outside the temporal view of the application processes can be safely removed from the shared data space, since it is no longer of interest.

 In addition to the removal of stale data, agents use the following mechanism to limit memory requirements. If an agent receives a data sort instance from another agent on the network, it first verifies whether an instance with the same key values currently exists in the local database. If it does, the newly received instance is more recent, and the currently stored instance is overwritten. Conversely, if an application process performs a write operation, its agent will first forward a copy of the instance to all subscribed agents, after which it replaces any current instance in the local database with the same key values by the more recent version.

 Since application processes operate asynchronously on the shared data space, there is no need to group the distribution of a data sort instance to the collection of subscribed agents into an atomic transaction. This enables a very efficient implementation in which the produced data is distributed as fast as possible and the latency between actual production and use of the data depends largely on the consuming application processes. This results in upper bounds that are acceptable for embedded systems where timing requirements are of the order of milliseconds.

System Modifications and Extensions

 A major problem in the design of systems of the kind considered here is the need to provide for upgrades and other modifications, preferably while the existing system remains on-line. There are two distinct cases to be considered.
 
 

Since the subscription registration is maintained dynamically by the agents, it is obvious that the current coordination model can deal with the first case without further refinements. Simply installing and starting a new application will be enough for it to automati-cally integrate in the already running system.

 The second case, clearly, is more difficult. One special, but important, category of modifications can be handled by a simple refinement of the agents. Consider the problem of upgrading a system by replacing an existing application process with one that implements the same function, but using a better algorithm, leading to higher quality results. In many systems it is not possible to physically replace the old application with the new one, since this would require the system to be brought off-line.

 By a refinement of the agents it is possible to support on-line replacement of application processes as follows. If an application process performs a write operation, its agent attaches an additional key field to the data sort instance representing the application's version number. Upon a read request, an agent now first checks whether multiple versions of the requested instance are available in the local database. If this is the case, the instance having the highest version number is delivered to the application. From that moment on, all instances with lower version numbers, received from the same agent, are discarded. In this way an application process can be dynamically upgraded, simply by starting the new version of the application, after which it will automatically take over the role of the earlier version, and then stopping the older version.

Fault-tolerance

 In safety-critical systems, such as aircraft flight control systems, there is the need for redundancy in order to mask hardware failures during operation. Fault-tolerance in general is a very complex requirement to meet and can, of course, only be partially solved in software. In SPLICE, the agents have been refined to provide a mechanism for fault-tolerant behavior, that is transparent to the applications. The mechanism supports both active and passive replication of application processes. By making fault-tolerance a property of the coordination model, the design complexity of applications is significantly reduced.

 Full treatment of fault tolerance being beyond the scope of this article, only one refinement in the coordination model will be mentioned. When an application process performs a write operation, its agent additionally stores a copy of the data sort instance in one or more remote databases, located in different units of failure. When, in case of failure, a back-up of the application is activated, its agent restores the local data from one of the remote databases.

 An optimization to this scheme is possible to minimize the amount of data remotely stored. As mentioned before, embedded systems are mainly concerned with data that represents continuous quantities: data is either an observation sampled from the system's environment or derived from such samples. Consequently, the environment itself can be regarded as a back-up of the system's data. Simply through new observations the system is able to recover from a failure. Not all data in an embedded system represent continuous quantities, however. By distinguishing two classes of data, representing continuous and discrete quantities respectively, storage and communication overhead can be significantly reduced: the agents only store data belonging to the discrete class in a remote database.
 
 

Conclusion

We have presented a software architecture for large embedded systems that incorporates an explicit coordination model . We have demonstrated how, starting from a relatively simple model based on a shared data space, the model can be successively refined to meet the requirements that are typical for this class of systems.

 Currently the architecture is applied in the development of commercially available command-and-control, and traffic management systems. These systems consist of some 1000 application processes running on close to 100 processors interconnected by a hybrid communication network. The total size of the software measures several million lines of source code written in Ada and C [2].

 Experience with the development of these systems confirms that the coordination model, including all of the refinements discussed, significantly reduces the complexity of the design process [3]. Due to the high level of decoupling between application processes, these systems are relatively easy to develop and integrate in an incremental way. Moreover, distribution of processes and data, fault-tolerant behavior, graceful degradation, and dynamic reconfiguration are handled by the coordination model in an elegant and transparent way.
 
 

References

[1]
 J.-P. Banatre, D. Le Metayer, "Programming by Multiset transformation", Communications of the ACM, Vol. 36, No. 1, 1993, pp. 98-111.
[2]
 M. Boasson, "Control Systems Software", IEEE Transactions on Automatic Control, vol. 38, nr. 7, 1993, pp. 1094-1107.
[3]
 M. Boasson, "Complexity may be our own fault", IEEE Software, March 1993.
[4]
 M. Boasson, Software Architecture special issue (guest editor), IEEE Software, November 1995.
[5]
 N. Carriero, D. gelernter, "Linda in Context", Communications of the ACM, Vol. 32, No. 4, 1989, pp. 444-458.
[6]
 P. Ciancarini, C. Hankin (Eds.), "Coordination Languages and Models", Lecture Notes in Computer Science 1061, Springer, 1996.
[7]
 D. Gelernter, N. Carriero, "Coordination Languages and their Significance", Communications of the ACM, Vol. 35, No. 2, 1992, pp. 97-107.
[8]
 K. Jackson, M. Boasson, "The importance of good architectural style", Proc. of the workshop of the IEEE TF on Engineering of Computer Based Systems, Tucson, 1995.
[9]
 G.-C. Roman, H.C. Cunningham, "Mixed Programming Metaphors in a Shared Dataspace Model of Concurrency", IEEE Transactions of Software Engineering, Vol. 16, No. 12, 1990, pp.1361-1373.