Applications in the embedded systems domain mostly deal with data instances that represent continuous quantities: data is either an observation sampled from the system's environment, or derived from such samples through a process of data association and correlation. The data itself is relatively simple in structure; there are only a few data types, and given the volatile nature of the samples, only recent values are of interest. However, samples may enter the system at very short intervals, so sufficient throughput and low latency are crucial properties. In addition, but to a lesser extent, embedded systems maintain some discrete information, which is either directly related to external events or derived through qualitative reasoning from the sampled input.
This situation
is rapidly changing however, and for the future we can expect an
extension, particularly for large embedded systems, towards more
intensive information management. Consequently, as embedded systems
become larger and more complex, the management of data assumes a more
central role in the architecture of these systems. Indeed, the
software architecture that we introduce in the next section is
strongly data-oriented, in contrast with more traditional approaches
for embedded systems design, that are mainly process or
control-oriented.
The shared data space
In SPLICE the shared data space is organized after the well-known relational data model. The data elements are structured as labeled records; each record definition consists of a number of typed fields. The usual basic types, such as integer, real, and character are provided, augmented with a number of type constructors, such as enumerated types, arrays, and nested records, that allow defining more complex types. Each record definition is associated with a system-wide unique label, such that there exists a one-to-one correspondence between a record label and the interpretation of the corresponding data by the applications that use it. We refer to the unique label of a record as its data sort. Data sorts enable applications to distinguish between different types of information.
To illustrate, we consider a simplified example taken from the domain of air traffic control. Typically a system in this domain would be concerned with various aspects of flights, such as flight plans and the progress of flights as tracked from the reports received from the system's surveillance radar. Hence, we introduce data sorts flightplan, report, and track as indicated in fig. 1.
sortreport
keyindex : integer
position : vector
timestamp : time
sorttrack
keyflightnumber : string
keyindex : integer
state : vector
Figure 1: An example-definition of data sorts flightplan,
report and track
The record definition of data sort flightplan consists of
four fields: a flight number, e.g. KL332 or AF1257, the scheduled times
for departure and arrival, and the type of aircraft that carries out the
flight, e.g. a Boeing 737 or an Airbus A320. By declaring the flight number
as a key field, it is assumed that each flight is uniquely determined by
its flight number. Data sort report contains the position vector
of an object as measured at a specific time by the system's surveillance
radar. A unique index is attached to be able to distinguish between different
reports. Through a correlation and identification process, the progress
of individual flights is recorded in data sort track. The state
vector typically contains position and velocity information on the associated
flight number. The index identifies the report from which the track information
is derived. Note that the flight number alone does not suffice as a key,
since there will be many consecutive reports during a flight.
Applications
Basically, applications interact with the shared data space through
a pair of primitive operations for reading and writing data elements. These
operations have the following form:
A distributed software architecture
The first aspect that we consider here is distribution of the shared data space over a network of computer systems. The coordination model is refined by introducing two additional components to the basic software architecture from the previous section. The additional components consist of agents and a communicationnetwork.
Each application process interacts with exactly one agent. An agent embodies a local database for storing data sort instances, and processing facilities for handling all communication needs of the application process it serves. All agents are identical and need no prior information about either the application processes or their communication requirements. Communication between agents is established by a message passing mechanism. Messages between agents are handled by the communication network that interconnects them. The network must support broadcasting, but should preferably also support direct addressing of agents, and multicast. An application process interacts with its assigned agent by means of the primitive read and write operations. The interaction with agents is transparent with respect to the basic coordination model: the application processes continue to operate on a logically shared data space.
The agents are passive servers of the application processes, but are actively involved in establishing and maintaining the required inter-agent communication. The communication needs are derived dynamically by the collection of agents from the read and write operations that are issued by the application processes. The protocol that is used by the agents to manage communication is based on a subscription paradigm. The assumption underlying this protocol is that the majority of data transfers in a system occurs regularly, and that it is therefore beneficial to forward data to all known users as soon as it is written (cf. a subscription to a newspaper: the newspaper is sent upon printing, without further requests from those subscribed).
As a result of this protocol, the shared data space is selectively replicated across the agents in the network. The local database of each agent contains instances of only those data sorts that are actually read or written by the application process it serves. In practice the approach is viable, particularly for large systems, since the application processes are generally interested in only a fraction of all data sorts. Moreover, the communication pattern in which agents exchange data is relatively static: it may change when the operational mode of a system changes, or in a number of circumstances in which the configuration of the system changes (such as extensions or failure recovery). Such changes to the pattern are very rare with respect to the number of actual communications using an established pattern. It is therefore beneficial from a performance point of view to maintain a subscription registration. After an initial short phase each time a new data sort has been introduced, the agents will have `learned' the new communication requirement. This knowledge is subsequently used by the agents to distribute newly produced instances of a data sort to all the agents that hold a sub-scription. Since subscription registration is maintained dynamically by the agents, all changes to the system configuration will automatically lead to adaptation of the communication patterns.
Temporal aspects
In embedded systems data gradually loses its value as the environment changes and time evolves. Hence, the application processes usually require only a limited temporal view on the overall contents of the shared data space. As time progresses, any data that falls outside the temporal view of the application processes can be safely removed from the shared data space, since it is no longer of interest.
In addition to the removal of stale data, agents use the following mechanism to limit memory requirements. If an agent receives a data sort instance from another agent on the network, it first verifies whether an instance with the same key values currently exists in the local database. If it does, the newly received instance is more recent, and the currently stored instance is overwritten. Conversely, if an application process performs a write operation, its agent will first forward a copy of the instance to all subscribed agents, after which it replaces any current instance in the local database with the same key values by the more recent version.
Since application processes operate asynchronously on the shared data space, there is no need to group the distribution of a data sort instance to the collection of subscribed agents into an atomic transaction. This enables a very efficient implementation in which the produced data is distributed as fast as possible and the latency between actual production and use of the data depends largely on the consuming application processes. This results in upper bounds that are acceptable for embedded systems where timing requirements are of the order of milliseconds.
System Modifications and Extensions
A major problem in the design of systems of the kind considered
here is the need to provide for upgrades and other modifications, preferably
while the existing system remains on-line. There are two distinct cases
to be considered.
The second case, clearly, is more difficult. One special, but important, category of modifications can be handled by a simple refinement of the agents. Consider the problem of upgrading a system by replacing an existing application process with one that implements the same function, but using a better algorithm, leading to higher quality results. In many systems it is not possible to physically replace the old application with the new one, since this would require the system to be brought off-line.
By a refinement of the agents it is possible to support on-line replacement of application processes as follows. If an application process performs a write operation, its agent attaches an additional key field to the data sort instance representing the application's version number. Upon a read request, an agent now first checks whether multiple versions of the requested instance are available in the local database. If this is the case, the instance having the highest version number is delivered to the application. From that moment on, all instances with lower version numbers, received from the same agent, are discarded. In this way an application process can be dynamically upgraded, simply by starting the new version of the application, after which it will automatically take over the role of the earlier version, and then stopping the older version.
Fault-tolerance
In safety-critical systems, such as aircraft flight control systems, there is the need for redundancy in order to mask hardware failures during operation. Fault-tolerance in general is a very complex requirement to meet and can, of course, only be partially solved in software. In SPLICE, the agents have been refined to provide a mechanism for fault-tolerant behavior, that is transparent to the applications. The mechanism supports both active and passive replication of application processes. By making fault-tolerance a property of the coordination model, the design complexity of applications is significantly reduced.
Full treatment of fault tolerance being beyond the scope of this article, only one refinement in the coordination model will be mentioned. When an application process performs a write operation, its agent additionally stores a copy of the data sort instance in one or more remote databases, located in different units of failure. When, in case of failure, a back-up of the application is activated, its agent restores the local data from one of the remote databases.
An
optimization to this scheme is possible to minimize the amount of data
remotely stored. As mentioned before, embedded systems are mainly
concerned with data that represents continuous quantities: data is
either an observation sampled from the system's environment or derived
from such samples. Consequently, the environment itself can be
regarded as a back-up of the system's data. Simply through new
observations the system is able to recover from a failure. Not all
data in an embedded system represent continuous quantities,
however. By distinguishing two classes of data, representing
continuous and discrete quantities respectively, storage and
communication overhead can be significantly reduced: the agents only
store data belonging to the discrete class in a remote database.
Currently the architecture is applied in the development of commercially available command-and-control, and traffic management systems. These systems consist of some 1000 application processes running on close to 100 processors interconnected by a hybrid communication network. The total size of the software measures several million lines of source code written in Ada and C [2].
Experience with the development of these systems confirms
that the coordination model, including all of the refinements
discussed, significantly reduces the complexity of the design process
[3]. Due to the high level of decoupling between application
processes, these systems are relatively easy to develop and integrate
in an incremental way. Moreover, distribution of processes and data,
fault-tolerant behavior, graceful degradation, and dynamic
reconfiguration are handled by the coordination model in an elegant
and transparent way.