Not logged in | Login
The complete solution for scalability of system is co-coordinated work of all the constituent parts; service engines, binding components and NMR. When the system starts maxing out heap space, the full GC kicks in and ‘pauses’ the JVM. This has debilitating effect on the system performance. A properly designed system is expected to know expected usage and provide enough resources for capacity utilization.
The specific problem in the context of business process orchestration involving interactions with external systems is the in-determinism in terms of responses for the outbound requests. When that happen the business process instances continue to hog memory and new requests for such processes would only aggravate the memory situation.
Since BPEL SE is the only component currently in the open-esb that can be configured to be used with database, we can leverage this to provide scalability for long running processes. Also the BPEL SE scalability solution needs to be thought in conjunction with its recovery and monitor capabilities. In my opinion, the best solution is to have very large heap space and nothing else would be required. But, if that cannot happen, at least with 32-bit boxes, and also for other reason s scalability solution as outlined here for BPELSE can help.
The scalability solution can be called in phases based on early detection and also following some patterns of business process behavior. The complete solution may be complex, but taking a trivial approach may not solve most (if not all) the use cases. Since the primary mechanism for scalability solution of BPEL Service Engine is passivation, there will be some overhead associated with that happens. Also, in order to minimize the overhead and not be overly aggressive about it, attempt will be made to do the passivation in phases with each phase providing incremental relief.
Functional specification can be found here.
Whenever the persistence is on, the Scalability Solution (Phase-1 and Phase-2) is enabled by default. Using system properties the solution can be turned off. Also, the phase-1 and Phase -2 solution can be selectively turned off. In addition there are other system properties that allows for configuring the memory levels and the age criterion used by the scalability for memory release. Please note that default values for all these properties are already supplied. Here are the System Properties to use and the explanation thereof.
Details can be found here.
If Phase 1 does not yield the required free memory (i.e. all the variables of paused instances are knocked off), Phase 2 scalability solution will be called.
In addition to providing support for scalability for BPEL engine, we need to take care of the scalability of recover/Monitor and Fail-over in the case of cluster.
Without the scalability solution, during recovery the bpel service engine loads all the instances in the memory, and try to recover them all. If there are very large number of instances to be recovered, engine may run into out of memory issues.
When there are very large number of instances to be recovered, the instances will be recovered in batches with check on available memory before scheduling next batch. If the available memory falls below as defined by phase/level 1 solution (refer related integration notice), the level 1 scalability solution will be called to free up memory. If enough memory is freed, the recovery will continue, otherwise the recovery will paused and will be periodically visited to finish up the recovery work.
With this solution in place, BPELSE in cluster performs fail-over by acquiring all the instances of the failed engine(s). When the engine is running with low available memory (heap space), this could be result in destabilizing engine/system. The scalability solution enables the engine in cluster to fail-over the instances without causing memory issues and also better distribute the fail-over work amongst the live engines.
If the defined process is synchronous process (receive-reply), the instances cannot be passivated while in the middle of execution of receive-reply. Hence, if the BPEL engine runs into memory issues do to inordinate wait of request-reply, this solution cannot help. The reason why we cannot passivate the instances in the middle request-reply is due to fact that by passivating the requester reference (Message Exchange) will be lost (currently, the ME cannot be passivated).
This solution assumes that the synchronous processes do not take long time to complete. One way to ensure scalability for Synchronous outbound invocations to external web service (that can take long time to complete), is by configuring throttling property on the connection. Throttling can limit the number of simultaneous outstanding requests at a given time, hence limiting the memory usage.
During BPEL Service Engine initialization an instance of BPELMemoryMonitorImpl class is set on the engine. This class provide utility methods to do the memory levels checks and also defines the Phase1 Upper and Lower Free Memory ratio limits.
When the free memory falls below the threshold as defined by Lower Free Memory Ratio, the phase 1 solution will be called first. If phase 1 solution does not succeed in releasing enough memory to meet this threshold, phase 2 will be employed.
When scalability is enabled, the main entry point for Scalability Solution is method doMemoryMangement() on EngineImpl. This gets called during processing of any request by BPEL SE.
As a result of Phase 1 Scalability Solution call interleaving the regular persistence calls, process variables can have various states. Each variable maintain three state flags - namely:
Also, the persistence and scalability call on the variable would result in altering the above state of the variable. Following figure explains the various states and shows all the possible combinations of the before and after state as a result of scalability/persistence and in-memory read/update calls.
Since the phase 1 solution (above) could not make enough free memory available, more aggressive action will be attempted to free up memory. In this phase all those instances that are not moving forward (same as identified by phase 1) will be progressively passivated, starting with most aged ones.
This relates to passivation of instances that are held in mRequestPendingInstances queue for waiting for the correlatable message for further execution.
If the instance execution is in the middle of active request-reply, the instance should not be passivated (even if the instance qualifies for scalability passivation, due to wait/outstanding response/status) as in doing so the requester context (message exchange) will be lost. In order to scalability passivate such processes (that are defined with receive-reply) user need to configure the throttling property on the service connection (to be available on CASA editor, work under progress)
Splitting of ready to run queue introduced issue of scheduling of instance even if marked suspended. After the split the instances on waiting queue are not checked for suspended status causing this issue. Also, to support the instance suspend and resume in conjunction with scalability passivation (phase 2) following design will be used:
At various points during the execution of a process instance a special instance wrapper Business Process Instance Thread (BPIT) is created and put on a queue (LinkedList in ReadyToRunQueue class). This same queue is also used for adding BPITs waiting for wait activity in the process definition. This current design has following pitfalls:
When instances that are waiting on ready to run queue (due to wait activity) are scalability passivated, keeping them on the regular queue will have performance implications. Although this is strictly required for regular execution, but found benefit of have two queues, hence added for regular execution also.
NOTE: This support is only added for Oracle as the equivalent of rownum function (that allows limiting the resultset for a update query) is not available for Derby at this moment. Also note that this feature for Derby is being worked on and planned for release early next year. Once this is available support will be added for Derby. Clustering will still work with Derby though, only scalability of failover support will not be available.
Find the details here
|62||31-Mar-08 15:29 PM, -0700||malkit||24245||to previous|
|61||31-Mar-08 11:44 AM, -0700||malkit||24475||to previous | to last|
|60||31-Mar-08 11:43 AM, -0700||malkit||24474||to previous | to last|