Minerva : An Event Based Model For Extensible Network Management

David J. Hughes
Bond University
Bambi@Bond.edu.au

Wu, Zheng Da
Bond University
wz@Bond.edu.au

Presented at the Internet Society's INET93 in San Francisco



[Note : The model outlined in this paper reflects the initial outcome of the research. The model proved to be incapable of handling complex relationships between elements of data. The current revision of the code utilises an object based model rather than this original flat model. The object based framework is outlined in the AUUG95 conference paper. - Bambi]



Abstract

The goal of the work outlined within this paper is to provide a Network Management System (NMS) capable of supporting "non-typical" environments. It is viewed that no NMS developer can foresee every requirement and adhoc function that a network manager may desire. As such, the NMS must be flexible and extensible in terms of the acquisition, interpretation, storage and display of information. This should not be limited to information pertaining to network interface byte counts or other typical network related data, but also cater for any information the network manager, or system administrator, may wish to record.

By offering a simple, generic interface for the acquisition of data, many problems may be avoided. The NMS is no longer restricted to one method for collection of information (e.g. SNMP). It also offers a simple mechanism for the management of devices that do not support a management protocol at all.

A model for the acquisition, storage and interpretation of abstract information will be presented. Tools that have been designed to interwork with the model will also be discussed.



Introduction

As the importance of the humble data communications network increases, we are seeing a similar increase in the deployment of network management systems. The common, optimistic view is that a network management system will provide an endless stream of categorised and summarised information pertaining to the health and operational status of a network. In reality, such an expectation is rarely achieved due to the vast range of conflicting definitions of network management, people's perception of networks in general, and the methods available for a

A network may be viewed as a collection of interconnected cables providing a path for electronic data or it may be viewed as a tool for the provision of end-user services. As such, a person subscribing to the former of the definitions would be happy knowing the status of physical network devices such as bridges and routers. A manager concerned with service provision would not be happy with this information alone because although a host may be "reachable" in network terms, it cannot be viewed as operational if the software responsible for service provision is not functioning correctly.

To complicate matters further, there is no one standard for the collection or representation of network management information. Protocols such as SNMP[1], the emerging SNMPv2, and OSI's CMIP[2] have been developed to provide mechanisms for the collection of management information, but this is yet another area of choice. There are also devices that do not support any of the commonly accepted management protocols. In such a situation, there may be pockets of the network that are ignored by an SNMP or CMIP based management system.

Another area of concern is integration of external information into the network management platform. If a system has been designed to be a single, centralised point for collection and analysis of network status and performance information, it appears practical to include other forms of operational and status information. Such information may include status details for services provided by network nodes (e.g. printing and file serving), operational information from time-sharing or server hosts (e.g. disk space utilisation or system load average), information relating to possible security breaches and analysis of other host based log files (e.g. mail and news). By incorporating support for all forms of event and log information, the NMS can truly be seen as a single management point. It also allows for a consistent approach to the analysis and display of the information recorded.

Once the management information has been collected, other problems start to emerge. If "non-standard" information is received by the management station, how is it interpreted, stored and represented to the user? If a report is required outlining the trend analysis for disk space usage on a set of file servers or the utilisation of electronic mail within the organisation, can the management system produce the correlated information? If the network manager wishes to graphically view adhoc, real time data, such as the number of remote login sessions on a central time-sharing host, can the management system be of use? If the data can be collected in some manner, be it via a protocol such as SNMP or by more traditional methods such as an awk script, the answer should be yes. The problem becomes one of flexibility and extensibility within the management application.



Minerva

Since 1990, work has been undertaken to design a model for an integrated management system[3]. The result of that work, known as Minerva [4], offers several features that allow for the customisation of the NMS to suit the requirements of the network manager. To date, the main effort has been focused on provision of a mechanism via which the network manager may simply and efficiently collect user defined information. This has encompassed the representation and storage of user defined information, incorporation of the information into the "state" of the network as perceived by the NMS, and a method by which that information may be "injected" into the NMS. The outcome has been a central data storage and interpretation model known as the Reporter.

Design work has also been carried out in the areas of flexible integration of SNMP, user defined protocol specifications for use in protocol analysis and decoding, and graphical representation of the gathered information. At the time of writing, implementation of the Reporter and its associated tools has been completed and Alpha testing is underway. Implementation of the other modules will be carried out over the ensuing months with the resultant code to be made available to the Internet community.



The Reporter

The design of the Reporter, and as such Minerva in general, is based on the thesis that any information about which a network manager may be concerned, is the result of an event. Such an event may be an interface on a router failing, a network service refusing connections, a process on a particular host crashing, or the actions associated with gathering statistical information. By focusing at the event level rather than higher level abstractions, it becomes possible to extend the capabilities of the NMS to cater for any user definable event.

By defining a set of known events, the simple act of notifying the reporter of a given event's occurrence can convey vast amounts of implicit information. This enables the complexity involved with interpretation and processing of the event notification to be contained within the Reporter rather than being replicated in the tools used to gather the information. Naturally, the definition of each event must contain enough information to enable the correct interpretation and processing of the event notification.

Due to the basic premise that a network manager is concerned only with state changes within the realm of the network, such as node failing, rather than being notified of the result of each test carried out, the event definition must include some mechanism via which the Reporter may ascertain the current state of an object and also the proposed state as a result of processing the event. If the event will not cause a change in the state of the object, such as the central router is still running, the report is discarded. Only if the event indicates a change of state will the event be processed and the defined actions be taken. This is achieved by nominating a field within a relation as holding the current state of the object. Upon receipt of the report, the Reporter determines what effect processing the event would have on that field. If the value changes, it is determined that a state change has occurred and processing continues. Configuration of an event's state relation and field is shown in Figure 2.

To enable notification of the event to the appropriate people, details pertaining to the place of work of the operations staff is maintained. This allows the Reporter to notify the appropriate people at their place of work rather than requiring a staff member to keep a "casual eye" on the network management station.

Depending on the severity of the event, an event definition may specify that an alert be generated on receipt of a report for that event. In such a case, a window as illustrated in Figure 3 will be displayed on workstations or X terminals used by the operations staff. The alert includes the host to which the report relates, the event class and name, and a textual message describing the event. The message is generated from a definition specified during configuration of the event. To cater for unusual situations, the default message may be overridden by including a message definition within the event report itself.

Associated with each window is an acknowledgment button. When a member of the operations staff acknowledges the alert, the other recipients of the alert are notified that it has been acknowledged and the name of the person acknowledging it. In future developments, the acknowledgement of an alert popup will be linked to a trouble ticket system. The user acknowledging the alert will automatically have a trouble ticket opened and assigned to him/her. This will guarantee reliable information for the generation of management information such as trend reports

Due to the resources consumed by displaying the alert popups on a number of displays, a timeout mechanism is included. If the alert is not acknowledged within a given period of time, the windows are destroyed and an electronic mail message outlining the alert information is sent to the operations staff members. This reduces the in-core requirements of the Reporter and caters for the situation where all the operations staff are away from their desks. It is viewed that network management staff rely on electronic mail as a vital communications tool and are likely to check their "inbox" on return to the office.


Event Report Structure

Each event notification, termed a report, has the capability of generating a log entry in the NMS's database, updating a relation in the database and generating an immediate alert for network management staff. By allowing updates of the database to be performed as a result of event reports, the model of the network's state, as used by mapping tools etc., may reflect the information conveyed by the report (e.g. host based service availability or node reachability). The reporter is the only mechanism via which status information may be entered into the NMS's database. This enables a simple and consistent interface for the acquisition of status information and the sharing of status objects (i.e. database relations) between several data acquisition mechanisms.

Event Definitions

Associated with each report is an event class and an event index. An event class indicates the specific area of operations to which the event pertains (e.g. node interface status), while the event index specifies a particular event within that class (e.g. interface down). Event classes and events may be added to or removed from the reporter's repertoire at any time and are not restricted to network related objects. An event class may be defined for the company wide database with events ranging from the availability of the database back-end to the failure of a batch update to run correctly. The only current restriction on event and event class definition relates to the number of events and classes. Each report carries a pair of unsigned, short integer fields indicating the event and event class. As such, approximately 65,000 event classes and events per class may be defined.

To allow incorporation of reported events into the "state of the network" as held in the NMS's database, each event definition contains details of database relations and relation fields that hold information pertaining to the event. If a report is received indicating that an interface on a particular node is no longer functioning, the appropriate record in the database must be updated to reflect the fact. This is achieved by configuring the interaction with the database in terms of dynamic and constant values. As discussed in more detail later in this paper, a report may included typed data to be extracted from the report and include in the processing of that event. Such dynamic information, combined with literal constants and system variables (e.g. system time, node to which the report pertains etc.), is used to fill the fields of the required database relations. Configuration of database interaction is depicted in Figure 1.

Event reports are transmitted to the NMS via the network to allow report generation to occur on nodes other than the management station. Each report is contained within a single packet. The Reporter accepts event reports via either UDP or TCP. As the reporter protocol does not include acknowledgements, this allows for unreliable reports, á la the SNMP trap mechanism, to be sent via UDP, as well as more reliable reports using TCP. The report packet includes the nodename to which the report pertains, an optional message definition used to override the default event message, the event class, the event index, and a series of optional typed data fields.

Transmission of typed data between machines of varying byte order is achieved using an encoding method similar in concept to the OSI Basic Encoding Rules [5] (BER). The complexity of the encoding is greatly simplified when compared to the BER and as such, the number of available types and constructs is greatly reduced. The point of using a simple encoding for the typed data is to reduce the overhead of integrating other applications with the NMS. The encoding rules for typed data allow for 4 byte signed or unsigned integers, variable length strings and boolean values represented in a single byte. There is no support for floating point values as there is as yet no justifiable reason for increasing the complexity of the encoding rules to that extent.

Each data element is transmitted using a three part tuple including a one byte type tag, a one byte length field indicating the octet count, and the data octets. All multi-octet numeric values, including signed and unsigned data, the event class identifier and the event index, are formatted for transmission with the least significant octet first.

There are currently two methods available to programmers for generating event reports. Firstly, a library of C routines allows for event reports to be generated by applications. It offers two interfaces similar in design to the UNIX(1) execv() and execl() calls. Secondly, a stand-alone application is available to allow shell and perl scripts to generate reports. It uses the command line arguments for the parameters of the report.


Other Benefits of an EventBased Model

Utilising an event based model offers advantages other than just extensibility and ease of interfacing external data sources. It also lends itself well to remote manager to manager communications. A scaled down version of the NMS, acting as a report processor/forwarder, may be installed at a distant site. It is the responsibility of the forwarder to interpret event reports generated from within its realm and maintain a local database depicting the state of the subnetwork under its control. If, upon receipt of an event report, the forwarder determines that the event will force a state change, the event report is forwarded to the central Reporter. This approach localises the node polling traffic associated with SNMP styled data acquisition, reducing the level of utilisation of WAN link capacity associated with management traffic. Only in the instance of a problem is data reported to the central management station for processing.

Another scenario well suited to this model involves large networks where departments manage their local subnets while a central Network Operations Centre (NOC) is responsible for the network as a whole. If each departmental NMS is configured to forward reports to the enterprise-wide NMS, both the departmental and central operations staff are informed of problems as they occur. The NOC maintains a clear picture of the entire network without duplicating management traffic other than the single packet event reports generated by state changes.



Reporter Based Tools

Of the three tools that have been designed to communicate with the reporter, only the Monitor has been implemented at the time of writing. It provides for acquisition of management data at timed intervals. The other two tools, as described below, will be implemented later in the development timeline.

Monitor

Within the Minerva model, coordination of poll based data acquisition is the responsibility of the monitor. Internally, the monitor consists of a dynamic queue of poll elements where each element represents an action to be undertaken at regular, timed intervals. Currently, the Monitor is capable of ICMP echo tests (ping), testing of IP based services, and the execution of external commands. Due to the dynamic nature of the poll queue, the NMS may insert or remove entries from the queue while the monitor is running. At a later stage, this will allow for the gathering of ad hoc, realtime information for graphical display.

To enable testing of network service provision rather than just node reachability, details pertaining to the network service protocols are defined. During the polling of a particular network node, the monitor utilises this information to mimic the actions of a service client and engage in a legitimate protocol exchange with the service provider. The contents of the first packet received from the target node are compared against a regular expression which defines the expected result. If the data returned does not match the regular expression, an error event is generated.

The inclusion of external commands in the polling model allows the NMS to be extended to include monitoring of protocols or other objects not in its original design. As an example, a simple UNIX shell script could be written, utilising the Columbia Appletalk Package, that checks the reachability of nodes on an Appletalk network and returns the results via the external interface to the Reporter. This capability clearly illustrates the flexibility obtained using the event based model.


SNMP Query Support

To provide a flexible mechanism for the acquisition of management information via SNMP, a lightweight scripting language has been designed. The syntax and style of the language closely mimic that of C to provide a familiar environment for the network manager. When implemented, the language will include interfaces to an SNMP library and also the reporter library. By using such a script, the network manager may extract the required information from a network node and inject the information into the NMS via the usual mechanism.

The use of a programmatic approach to SNMP based data acquisition offers several advantages, the most prominent of which is the ability to define tests capable of incorporating the semantics of the object in question. As an example, a script could be written that firstly obtains the number of interfaces possessed by a router and then enters a loop testing each interface in turn. If an interface is determined to be disfunctional, another query could be generated to retrieve further information from the device. This information can then be included as dynamic data in the event report to aid the network manager in locating the problem.

When the SNMP script language has been implemented, support for the scripts will be added to the Monitor to allow timed polling via SNMP.


SNMP Trap Support

To allow integration of SNMP traps into the event based model, an extended SNMP trap daemon has been designed. The extensions will allow for the use of the Object Identifier returned in the SNMP trap packet as a mechanism for determining the event that corresponds to that trap. Once that information has been obtained, the trap information can be sent to the Reporter for inclusion in the database and notification of the operations staff.


Conclusion

The event based model as described in this paper, combined with the mentioned data acquisition tools, is currently being tested at Bond University and other Australian sites. The initial results are showing that the model itself provides great flexibility and power to the user with respect to the range of objects capable of being managed. Further work is being undertaken to provide the missing functionality and combine the modules into a complete network management system.

With such an extensive range of information and the ability to expand to incorporate new requirements as they come to hand, such a system can be viewed as a single point of contact for all information relating to the management of a network. The end result being a collection of information ranging from the most detailed physical network operations data to high level information pertaining to the primary goal of any network: the provision of services to the community.


References

[1]
J. Case, M. Fedor, M. Schoffstall, and J. Davin, "A Simple Network Management Protocol", RFC 1067, August 1998.
[2]
International Organisation for Standards, ISO 9596 "Common Management Information Protocol Specification".
[3]
Wu Z. D. and D. J. Hughes, "An Approach to Integrate Management Facilities for Campus Network Environments", Proceedings of the 1991 Signapore International Conference on Networks, pp. 131-136, September 1991.
[4]
D.J. Hughes and Wu Z. D., "Minerva - An Integrated Network Management System", AARNet Networkshop, December 1992
[5]
International Organisation for Standards, ISO 8825 "Basic Encoding Rules", 1987
[6]
J. Postel, "User Datagram Protocol", RFC 786, August 1980



Author Information

Mr. Hughes joined Bond University in September 1988 where he is currently the University's Senior Network Programmer. During his time at Bond, he has been responsible for the implementation and management of the campus data communications facilities, and the development of software to extend the services provided by the network. His interest in the design of network management systems began in 1990 and has lead to five papers co-authored with Dr. Wu. He received his B. App. Sci. in computer science from the Queensland University of Technology and is currently researching the design of network management systems for his Ph.D. Mr. Hughes' interests also included artificial intelligence which has lead to his lecturing in the use of Prolog for Expert Systems

Dr. Zheng da Wu joined Bond University in 1989 where he is currently an assistant professor of computer science in the School of Information Technology. As a researcher, teacher, and consultant, his subject has been in the area of computer networks. His current research interests include multimedia communications protocols, network management and performance modelling and analysis of networks. Dr. Wu received his Msc in Computer Science from the Graduate School of Chinese Academy of Sciences, Bejing, China, 1981 and Ph.D at the Computing Laboratory, University of Kent, England, 1987.


Footnotes

(1)
UNIX is a registered trademark of UNIX System Laboratories Inc.