When Open-ESB is put into serious use to handle real messages, the ability and efficiency to handle messages by encoders and JBI components, especially the ability to process large messages, may become an important requirement in the related business scenarios. This can also become a big differentiator in JBI implementations.
Following is a compilation or summary of ideas and possibly research and experimentation work for large message handling that have been discussed previously. Handling large size messages may not only be a requirement for encoders themselves whose job is to decode and encode message from native format to XML, but for the JBI components and platform as well.
Large message support
Support of large message handling may be achieved through at least the following supports:
- Service Data Layer (SDL)
Different use cases may require different approach with or without advanced parsing technology.
Service Data Layer (SDL)
- Service data layer: API and implementation for an easy and unified way of handling data.
- In JBI, currently DOM is used as the service data layer.
- Pros: standard, easy to use (reads whole document into memory), fast access with XPath
- Cons: performance, large memory footprint, cannot handle large messages, lack of validation (too loose), lack of meta-data introspection
- We need a SDL better than DOM. The proposed SDL overcomes drawbacks of DOM:
- better performance, small footprint, large message handling, validation, meta-data introspection
- Characteristics of the proposed SDL (many of these may be borrowed from SDO spec)
- XML: internal data format
- XSD: meta-model language
- non-XML data: converted to XML for processing in NMR or stored as an attachment
- Small footprint, extreme fast and scalable XML store
- Support both strong typed data access (generated code scenario) and generic data access
- Support both lazy loading and eager loading (for strong typed access without XML infoset)
- Has a clear set of supported simple types for interoperability
- Support dynamic metadata introspection – getDataType
- Support single base type inheritance and dynamic type substitution and enforcement (xsi:type) - have it with DOM – don’t want to lose it
- Native XPath support, convenient data conversion (helper methods) and copy support - SDL object has Xpath support built in - more efficient than applying Xpath externally on DOM
- Support partial scan of original non-XMLized data for XPath evaluation (handy for content based routing) - parsing the header without converting the whole document to XML
- Easy to use by component developers
- Short-term goal #1: Prototype a DOM implementation with optimized XPath implementation
- We may need to prototype a DOM implementation based on performance on processing messages of different sizes (small, medium, large).
- The benchmark should cover parsing, XPath evaluation (for both reading and updating), building large dataset and marshaling.
Short-term goal #2: evaluate SDO to see if suitable, also evaluate existing XML APIs such as StAX, SAX, DOM, JAXB.
- SAX is not suitable for handling large message and streaming.
- May use existing XML APIs as a “view” or ’façade”
- StAX and DOM may be considered, and may have them layered.
Short-term goal #3: evaluate some existing parsing technologies for their memory efficiency, performance, etc.
- vtd-xml http://vtd-xml.sourceforge.net/.
> Attachments may be used to store binary data as is. The binary data can be the entire message or part of it.
> Attachments help achieve “Decode-on-demand”.
> URIs can be used to reference attachments as illustrated in XOP (XML-binary Optimized Packaging, see http://www.w3.org/TR/xop10/)
> Attachment support is provided from the JBI platform, which can be leveraged in creating fast and small footprint XML store and in handling large size messages.
> The desired features include but not limited to the following:
random access attachment
persisted or partially-loaded attachment.
We need to focus on using persisted or partially-loaded attachment to support huge size message.
4 Streaming (and partial evaluation)
- No easy way to stream messages in JBI.
> One-pass streaming process is not feasible for Open-ESB.
JBI SA contains processes defined by multiple heterogeneous languages piped together and often involve sophisticated asynchronous interactions. One-pass streaming would mean that whole processing cannot start until the last process in the pipeline is reached, which is impractical.
To fully support one-pass streaming, every component must participate, or become a breaker of the streaming process.
One-pass streaming usually cannot support full language features of XSLT, XQuery.
One-pass streaming involves intensive design time analysis of execution plans, which might be doable for environment with single language, such as SQL in DBMS. In an open community, enforcing every component to do this is difficult.
A research project to experiment with limited streaming support
Currently all existing encoder providers (Custom Encoder, HL7 encoder, Cobol Copybook Encoder) use SAXSouce as input or output in order to avoid object creation. But the drawbacks of SAXSource are:
- a callback programming model (a “push” model XML source), not convenient for partial decoding/encoding.
- Lacking write capabilities.
- Enhance the encoder providers to also implement StAX Source - a “Pull” model XML source that support both read and write.
- Develop features like partial XPath evaluation, where XPath expression can be evaluated while decoding and the process stops as soon as the value is retrieved.
- Should be very useful for content based routing or heterogeneous message process.
- Note that benchmark indicates StAX is slightly slower than SAX based on current JDK implementation. Javolution has a modified version of StAX implementation which has better performance, but is not fully compliant to StAX interface.
- Use StAXSource for new encoder providers (e.g. Swift MT, etc.) and add StAXSource to existing encoder providers (HL7, custom, and CoCo).
- If parser look-ahead needs >1 tokens, then defeats the “pull” model
- HL7 parser: mostly k=1, except in cases where “UPA” rule is violated and need fix.
- CoCo parser: k>1
- SWIFT parser: k=1 except in localized places - custom parsers: k>1, uses lot of backtracking.
- SEF parsers: mostly k=1 * JDK1.5's Transformer does not recognize StAXSource.
‘Should have’, 10 days for each encoder provider (HL7, custom, CoCo, SWIFT, SEF)
5 Use cases of data handling
- Different use case may require different approach to handle data. For example, handling data in content based routing may require only peeking the header data via one-pass partial scan, whereas handling data in BPEL process may require random access to any piece of the data inside a fully parsed message. We have identified some uses cases of data handling (see following), which may be addressed using layered approach.
- Fully decoded small or mid-size message with random R/W access
- This is the most common case (found in BPEL, XSLT, XQuery etc.) and we must support it very well.
- Includes both building a message from scratch and updating an existing one.
- We cannot compromise on performance for this when supporting large message handling.
- One-pass partial scan to evaluate some XPath values.
- Message size might be small or large.
- Once the values are retrieved, the parsing process immediately stops.
- Deferred payload processing:
- Message is composed of a header/envelope and payload.
- The payload size might be small or large.
- For decoding, read the header, and extract raw payload out as intact. Then based on header information, perform action with payload (e.g. persist payload to a DB).
- For encoding, construct a header and insert a raw payload.
- Fully decoded large message with sequential R/W access.
- Decoding example: fully parse an HL7 batch composed sequentially of a batch header, a large number of different messages and a batch trailer.
- Encodeing example: Build from scratch an HL7 batch message with a batch header, a number of message payloads of same or different types, and a batch trailer.
- Fully decoded large message with random R/W access.