Index Changes
This is version 2. It is not the current version, and thus it cannot be edited.
[Back to current version]   [Restore this version]
SNMP BC

SNMP Binding Component
one pager

Sun logo


Frank Kieviet / Philip Chan

Revisions:

Who
When
What
Frank Kieviet
12-01-06
Created
Frank Kieviet
12-05-06
Completed list of architectural concerns

(C) Copyright 2006 Sun Microsystems

High level description and applicability

The SNMP BC can be used to monitor large numbers of SNMP variables and forward the data for further processing. This further processing is done in an SE. A description of possible SE's is outside of the scope of this project, but it is likely that this SE will perform further slicing and dicing of the data, and store the resulting data in a relational database. An SE that may be considered in this scenario is the IEP.

The SNMP BC is meant to be the data collecting agent in a larger monitoring application. Other components in this larger monitoring application may include the aforementioned data analysis SE, but also business rules engines, visual applications to provide operators with a graphical view of the health of the system, alerting components, trend analysis components, historical analysis modules, etc.

The focus of the SNMP BC in this larger application is on monitoring. Monitoring SNMP variables is done by both actively querying of  SNMP variables and by passively receiving SNMP traps and informs. Active querying is done by scheduled polling of specified SNMP variables.

Before propagation to the SE, data is led through a primary filter to remove redundant data, e.g. a poll of a boolean variable will only propagate changes in the monitored value. The primary filter is limited in functionality and is not meant as a place to introduce elaborate business logic: the SE is the place for that.

What the SNMP BC will not do: the SNMP BC does not support an external interface that can be used to set SNMP variables or poll variables on demand (negotiable). No other protocols than UDP are supported.

Requirements

  1. Should actively monitor variables up to 100,000 agents using SNMP GET
  2. Receive SNMP TRAP, INFORM
  3. Support SNMPv1, SNMPv2c, SNMPv3
  4. dynamic configuration: addition / changes in the configuration of what to monitor without redeploying
  5. provisions for distributed deployment, fail over, redundancy

A summary of SNMP

This description of SNMP is confined to those aspects of SNMP that are relevant for the design decisions in this document..

SNMP is a protocol for network management. SNMP is primarily a request / reply protocol (GET and SET). Agents is a term used for devices that listen for these GET / SET requests. Managers are applications that send these requests to agents. Next to GET / SET, SNMP also supports a way for agents to send unsolicited notifications to the manager (TRAP / INFORM).

SNMP can support multiple networking protocols, but the most popular one is UDP. SNMP over UDP is not a reliable protocol: packets may be dropped. Each GET, SET, TRAP or INFORM request or message are transmitted in one UDP packet. Each packet contain only one data item. UDP packets are of limited size (< 500 bytes for some network infrastructures). Data items are limited to a handful of primitive values, essentially just ints and strings. Data is encoded using BER.

The data items that can be queries from or set on agents are organized in a tree. This tree is comparable to a database schema. The schema language used to describe this tree is a subset of ASN.1 called the SMI. The nodes in the tree are addressable using object ids (OIDs). OIDs are globally unique. OIDs are a sequence of period delimited numbers. The tree and its OIDs that an agent support is called a MIB. Although there have been efforts to standardize MIBs, there are more than 2300 cataloged MIBs with over 1,000,000 OIDs.

Although the data is organized in a tree, MIBs can describe tables of data. This is done in a complicated way. Each cell in the table ends up with its own OID. Cells can be read only one at at time through GET requests.

MIBs typically refer to many data items, e.g. querying the MIB for the Java VM results in more than 400 data items. Much of this data is static configuration information; there are only a handful data items that contain information important for management, e.g. memory size, and thread count.

Basic entities

Unit of monitoring -- The unit of monitoring can be thought to be a complete MIB or individual variables (OIDs) in a MIB. Considering the fact that the number of data items in a MIB that are relevant for monitoring is a small portion (in the case of the Java VM MIB this is about 1%), it makes more sense to use individual OIDs as the unit of monitoring than it is to use a complete MIB as the unit.

The structure of data being processed -- Monitored variables can be considered as stand-alone entities, or can be thought of as nodes in the MIB tree. For the latter it should be considered that MIBs can be converted into XSDs (either a single small XSD describing all possible MIBs or one XSD per MIB), and hence the data of a device can be converted into XML. However, the consumer of the data, i.e. the consuming SE (e.g. the IEP), will likely prefer scalar or tabular data over complex structures because of the large number of MIBs involved. This large number of MIBs is also the reason that the design of the BC will be simplified if the BC does not need to have detailed knowledge of all MIBs involved. Therefore the structure of the data (the entities) should be scalar or tabular data.

Service level requirements and concerns

Performance: reduction of the number of data items produced -- Components consuming the data produced by the BC are likely more interested in events rather than raw data values. E.g. an uptime variable may be used to detect if and when a system was restarted rather than monitoring a monotonously increasing time value. Similarly, a boolean system down variable is only of interest if its value changes. Since the number of monitored data items will be large, it will be useful if there is a facility for primary filtering of data to reduce the number of useless data items that the consuming SE receives. It is not the intention to introduce business rules here: this remains the responsibility of the consuming SE.

Performance: reduction of poll rate -- Many data items in a MIB do not change after a device is booted. An example is a dump of the system properties of the Java VM MIB. Yet, an SE may store these variables in a relational database once. If there is a mechanism to relate one event to another, the number of useless data item polls can be reduced. For example, the system properties of the Java VM should only be polled if a system restart is detected. Again, the intention is not to introduce business rules : that remains the responsibility of the consuming SE.

Configuration -- usually BCs are configured through WSDL extensibility elements. However because of the large number of OIDs to be configured, and because of the requirement of dynamic configuration, this is not a feasible approach. Hence, the configuration should reside in a data store that can be manipulated independent from the deployment, e.g. a relational database. Note that configuration is likely done by personnel other than "comp app designers" or "deployers" and may need to be done through remote consoles (e.g. a web browser). Also note that because of the large number of monitored devices, configuration likely occurs often (e.g. once or multiple times per day).

Scalability, reliability, availability -- for reasons of scalability, it should be possible to run the SNMP on several machines at the same time (horizontal scalability). These multiple instances should use the same configuration store, yet they should not poll the same data elements. For automatic fail over, different instances should be aware of other instances going down and should be able to take over from these failed instances automatically.
The SNMP BC should be able to leverage multi core and hyper threaded CPUs (vertical scaling) through a multi threaded design.

Extensibility, maintainability, testability -- The SNMP BC is organized in a number of separate libraries that can be fully tested outside of JBI. Details on the division in separate libraries is yet to follow.

Manageability -- Through JMX information can be obtained about the internal state of the SNMP BC, e.g. queue sizes, thread pool sizes, etc. Performance counters etc. can be reset through JMX. Performance related configuration parameters are set in the WSDL extensibility elements and will not be changeable through JMX.

Security -- The safe storage of credentials in the configuration store is a special concern. File system and/or database security will serve as the primary mechanism to safeguard this information. A secondary mechanism through obfuscation (encryption with a hard-coded key) can also be used in addition.

Portability --The SNMP BC depends on JDK 5. It  may rely on classes in Glassfish and the Sun JDK (negotiable). There will be no dependence on operating system or hardware platform.

Data collection

Event mechanism -- An event mechanism, i.e. event listeners and event generators, will be used as the conceptual model for the monitoring of devices. Variable monitors act like event listeners: they are triggered by events. When a variable monitor is triggered by an event, it polls the monitored variable. Event sources include timers and triggers. Variable monitors can also act as event generators: for example a change in a variable boot time may cause an event that will invoke other variable monitors.

Variable monitors -- each monitored variable will have a variable monitor which is responsible for invoking the GET operation. Variable monitors are invoked by events, have primary filters which may cause other events to be thrown, and may produce output that is sent to the SE. Note that variables may be scalar or tabular.

Trap monitors -- a trap monitor is invoked by an SNMP trap or inform. Similarly to variable monitors, it invokes primary filters which in turn may cause events, and may produce output that is sent to the SE.

Timer events -- Schedules express a series of points in time at which a timer event should be caused. E.g. every 5 seconds between 09:00 and 18:00 from Monday till Friday.

Primary filters -- variable monitors can have one or more primary filters. If there are multiple filters, these filters are not put in series but are put in parallel and run independently from eachother. Here are a few examples of primary filters:

  • state change
    • filters out all values that are equal to the previously monitored value if there is a previously monitored value
    • optional: event to generate on state change
    • output: new value, flag if there was a previously monitored value
  • monotone
    • filters out all values that are greater than the previously monitored value if there is a previously monitored value
    • optional: event to generate when a value is not filtered out
    • output: new value, flag if there was a previously monitored value
  • band
    • filters out all values that are within the specified interval
    • optional: event to generate when a value is not filtered out
    • output: value
  • reachable
    • filters out all values, but produces a value if the variable could not be polled
    • optional: event to generate when a value is not filtered out
    • output: id
  • batching limit filter
    • limits the number of traps: drops or batches traps if their frequency exceed a configured value
    • optional: event to generate when a value is not filtered out
    • output: last or aggregated value
Note that filters are stateful: e.g. they can retain their last monitored value. This memory is not persistent, and is not shared between multiple instances of the SNMP BC.

Buffering -- traps and data monitoring may produce data at a rate higher than the consuming SE can process. For this the BC has a limited buffer. If this buffer overflows, data will be dropped from the buffer. Data with a lower priority will be dropped from the buffer before data with a higher priority. Dropping data from the buffer is highly undesirable, and special measures will be taken so that this situation may only occur if there is a flood of traps. Should data be dropped, this will be logged and alerts will be raised.

To reduce the likelihood of dropped data, most operations are buffered and some type of operations take precedence over other types of operations:

  1. delivery to NMR (buffered), reading trap events, reading get replies
  2. invoking primary filters for trap monitors
  3. event triggers due to trap monitor primary filters
  4. event triggers due to variable monitor primary filters
  5. timer event triggers (delayed)
Reliability -- the UDP protocol is not reliable. Replies to get requests may be lost. The SNMP BC will monitor this and retry the get request after timeout.

The SNMP BC will not persist any data. Data may be lost in the case of a system crash and in the case of a trap flood as described above.

Groups -- variable monitors can be organized in groups. OIDs with the same group are guaranteed to be running on the same SNMP BC instance. This guarantee is necessary for events: events do not span multiple SNMP BC instances. If no group ID is specified, a group id is generated based on the agent id.

An instance of an SNMP BC can take responsibility of multiple groups. In this way multiple instances of the SNMP BC can be active at the same time without multiple instances of the same variable monitor being active concurrently.
TBD:
  • how instances of an SNMP BC are mapped to groups (can be done by partioning baesd on hash(groupid)%nodeid
  • how responsibility of groups are transferred to other instances of the SNMP BC if an instance of the SNMP BC goes down

Configuration store -- the configuration data store will have at least the following entities:

  • Variable monitor
    • agent id
    • monitored variable (the OID to query) (TBD: should there be a mapping from OIDs to human readable strings?)
    • a unique reference ID that can be used to identify values in other parts of the monitoring application
    • list of event ids to be invoked upon
    • name of a primary filter and parameters to this primary filter; these parameters may include eventid-s to generate
    • priority: used to drop data in case of buffer overflows
    • timeout
  • Trap monitor
    • agent id
    • monitored variable (the OID to query)
    • a unique reference ID that can be used to identify values in other parts of the monitoring application
    • name of a primary filter and parameters to this primary filter; these parameters may include eventid-s to generate
    • TBD
    • priority: used to drop data in case of buffer overflows
  • Agent id
    • a network address
    • a port number
    • credentials
    • SNMP version
    • group ID
  • Schedule
    • a schedule expressed in poll frequency, time of day, etc
    • eventid to generate
BC Configuration -- the primary configuration of the SNMP BC is done through WSDL extensibility elements. The configuration items are:
  • location of configuration data store
  • port to listen on
  • buffer limit
  • throttle limit

Data propagation

XSD -- Variable monitors and trap monitors may generate data that is sent to an SE. The schema of this data is TBD.

Message exchange pattern -- Considering the lack of reliability, in-only is preferred (TBD).

Configuration tool

To populate the configuration store a separate tool may be required. This tool may need to be able to load MIBs and allow the user to select OIDs from the MIBs. This tool will be described in more detail later.


JSPWiki v2.4.100
[RSS]
« Home Index Changes Prefs
This particular version was published on 05-Dec-06 23:29 PM, -0800 by FrankKieviet2