Synchronization Engine Debugging

Best Practices

When a failure occurs during synchronization, typically the first sign of the failure is that an object created or altered in NexJ fails to reach the external system, or visa versa. Failures of this sort have not followed any predictable pattern, but we have learned some lessons about how to quickly identify the source of each failure.

Exceptions, Timeouts and Out Of Memory

Regardless of log levels, it is important to acquire all available logs from the period during which the failure began. The logs should be reviewed for exceptions, indications of timeouts (query timeouts, transaction timeouts, connection pool timeouts, etc.), and for indications of an out of memory condition. These indications do not consistently occur in any one log: all logs should be reviewed.

Exceptions, if at all possible, should be eliminated from the system before serious debugging begins. We have on several occasions solved problems by eliminating exceptions that did not appear in any way related to synchronization.

Timeouts can provide a hint as to where in the synchronization process a failure is occurring. They can also be misleading, however, since one failure may lead to delays and timeouts on later units of work.

Out of memory conditions are very important, because they break the J2EE container. Any activity occurring after an out of memory condition is essentially meaningless and should be ignored. Where an out of memory condition is arising it must be addressed before any further debugging occurs.

Data Dependence

It is important to identify which items in NexJ or on the External system have failed to synchronize. When the items have been identified, one can then compare them to look for similarities. Are they all of the same class in NexJ? Are they unusually large? Does synchronizing them involve any specialized code?

Once failing items have been identified, try synchronizing them again (if possible). Try to manually create a simplified item similar to the failing items. If the same items consistently fail to sync, it is much simpler to create a reduced test case for more detailed debugging.

Minimal Reproducible Case

When a failure is consistently reproducible (and even if it isn't) it is a valuable exercise to look for the smallest failing test case. When a single item with minimal contents can consistently reproduce a failure, a solution is just around the corner. Even if 1000 items are required, this is easier to work with than testing 1,000,000 items. Learning that it isn't possible to reduce a test case below a certain size or complexity can itself be informative.

Environment Dependence

Ideally, we would like to reproduce every failure in a development environment, since this is where we can bring to bear the greatest array of debugging tools. Reproducing in our QA environment is a close second. Even when this is not possible, re-running a failing case in any environment is very useful. When a problem is reproducible in more than one environment, it is likely to be a code or data problem. When it is reproducible only in a single environment (especially if it is consistently reproducible), the problem is likely to be a configuration problem, a data problem, or a problem relating to usage patterns.

When the failure occurs only in a single environment, collect as much information as possible about the environment, the network configuration, the exact make and versions of software involved (jre, J2EE container, NexJ core libraries, NexJ repositories). Find out how the environment was configured, and re-check the configuration by retracing our setup instructions. Identify any resources used by the system that are used by other systems and applications. If possible, disable these other systems and applications, and attempt to reproduce the problem. Although this work is tedious, it is less tedious than spending a week or more fruitlessly debugging code only to learn that a configuration step was missed.

Identifying Stage of Failure

Synchronization is a process involving multiple transactions, multiple threads and multiple components both in the core framework and in application code. Identifying the precise point at which a failure occurs is critical to arriving at a timely resolution of a failure. When an exception is raised during processing, identifying the point of failure can be simple. When no exceptions are present, however, it is still possible to identify the point of failure. Ideally, this should be done by selectively increasing log levels for some classes, and tracing synchronization messages through the code until they "go missing". Where increasing log levels is not a possibility, it may still sometimes be possible to make inferences about how far synchronization progresses before failing. For details on the expected trajectory of a message through the synchronization logic, please see the sections below.

Verifying Hypotheses

Especially when information about a failure is scarce, either because the failure occurs only on a production system, or because the failure occurs only occasionally, it is tempting to jump to conclusions. Where possible, every attempt should be made to verify any conclusions that are reached. If some characteristic of the data is thought to be relevant, construct a new instance of the data with that characteristic and watch to see if it fails. If load is thought to be relevant, try a similar operation under similar load which involves different items. If conflict from another process or system is suspected, disable that process or system. Especially where a hypothetical solution involves intense code refactoring, a lot of time can be saved by using a quick hack to verify that the hypothesis is in fact correct to start with.

Stages of Outbound Synchronization

When an item in NexJ is synchronized to an external system, the item passes through the following stages.

Creation of SynchronizationCommand

Synchronization begins with the creation of a SynchronizationCommand, which is immediately posted to a JMS queue. This can happen for one of three reasons:

A snapshot has been initiated. When a snapshot is initiated this will be logged at info level by nexj.core.meta.MetaClass.SysSyncTarget and at info level by nexj.core.meta.MetaClass.SysProcess. A snapshot will create one or several SynchronizationCommands including every instance that is visible to the target or link for which the snapshot was invoked.
An instance of a class configured for synchronization is altered and committed. A SynchronizationCommand is created specifically for the instance(s) altered on the transaction.
An instance of a class named in the DEPENDENCY_LIST of some SysSyncClass is altered and committed. On the transaction where the change occurs, a DependencySynchronizationCommand is created and posted to JMS. Its contents are logged at dump level by nexj.core.rpc.jms.JMSSender. The DependencySynchronizationCommand is received by another thread (nexj.core.integration.Receiver will log at debug that a JMS message was received from the relevant channel, but won't identify the message). The DependencySynchronizationCommand is run, invoking SysSyncClass.getDependencyFilter (an excellent point to start debugging). getDependencyFilter returns a where clause identifying a set of instances, for which a SynchronizationCommand is built.

The constructor of the nexj.core.integration.sync.SynchronizationCommand' will log at debug that the command has been created, and for how many instances. Immediately following, on the same thread, nexj.core.rpc.jms.JMSSender will log at dump the SynchronizationCommand, including the OIDs of all instances to be synched. Following this, on the same thread, nexj.core.runtime.UnitOfWork will log "Commit completed" at debug.

Receipt of SynchronizationCommand

The SynchronizationCommand is received on another thread. The receipt of a message is logged by nexj.core.integration.Receiver at debug. nexj.core.integration.sync.SynchronizationCommand then logs at debug the number of instances to be synchronized, followed at dump by the classes and OIDs of those instances. If debug logging is enabled for nexj.core.persistence.sql.SQLAdapter, queries for the links and instances involved in the synchronization will be logged. If locking is enabled, nexj.core.integration.sync.SynchronizationCommand will log at dump "Beginning link synchronization transaction." All log messages on the thread from this point to the conclusion of the transaction will include the class and OID of the link to which items are being synchronized.

Invocation of Service

At debug nexj.core.persistence.sql.SQLAdapter logs a query to lock the link (if locking is enabled) and queries to retrieve syncObjects for the link and the instances. It is at this point that the framework determines which instances to send to the link and which events (create, update or delete) to set on those instances. Once these decisions are made the outbound synchronization service is invoked. The service invocation is logged at dump by nexj.core.integration.sync.SynchronizationCommand and by nexj.core.meta.Metaclass.<Link>; both log entries include the class and OID(s) of the instances being sent.

The actions of the service will be logged if logging is enabled for nexj.core.meta.integration.service.Service.<ServiceName>.

When the service returns, the class and OID(s) of the instances sent are logged again by nexj.core.meta.Metaclass.<Link>, this time at debug. If the service raises an exception, the exception will be logged at warn level by nexj.core.meta.Metaclass.<Link>.

At this point, if locking is used, the link transaction will commit. If this is the case, then nexj.core.runtime.UnitOfWork will log "Commit completed" at debug and nexj.core.integration.sync.SynchronizationCommand will log the commit at dump.

If the transaction is rolled back, nexj.core.integration.sync.SynchronizationCommand will log this at debug. If an exception occured in the framework (not in the service) it is logged by nexj.core.integration.sync.SynchronizationCommand at error, and by nexj.core.meta.Metaclass.<Link> at warn.

Completion

After all links have been processed, nexj.core.integration.sync.SynchronizationCommand will log at debug the number of instances succesfully sent. This number may not match the number of instances in the SynchronizationCommand: it is not the number of instances involved, but rather the number of Object messages posted successfully to the outbound synchronization service.

For each instance that was posted for a link, but for which an unrecoverable exception was raised by the service, nexj.core.integration.sync.SynchronizationCommand will log at warn level that the instance (classname, OID) has been abandoned for the specific link (OID).

If some or instances need to be retried, nexj.core.integration.sync.SynchronizationCommand will log at debug that the command will be retried, though the precise set of instances being retried will not be specified.

nexj.core.runtime.UnitOfWork will log "Commit completed" at debug.

When some instances must be retried, the SynchronizationCommand is re-sent and is logged at dump by nexj.core.rpc.jms.JMSSender. When all instances must be retried, it is not re-sent, but rather an exception is raised to the JMS provider which is responsible for resubmitting the command again at some future point. When this occurs, an exception is logged at debug by nexj.core.rpc.jms.JMSReceiver. If the JMSMessage has failed so many times that it will not be retried, the exception is logged at error by nexj.core.rpc.jms.JMSReceiver.

Stages of InboundSynchronization

When an item from an external system is synchronized to NexJ, it passes through the following stages.

Inbound Service Invocation

The receipt of an inbound message will be logged at debug by nexj.core.integration.Receiver, though no information will be provided about the nature or contents of the message. The service bound to whichever channel receives the message should immediately be invoked. If there are logger statements in the services, you can configure the log level through nexj.core.integration.service.Service.<ServiceName>.

Persist Step

The synchronization engine is invoked by a service through either a Persist or a Sync step. Both steps ultimately call the ObjectMessageFormatter. nexj.core.integration.format.object.ObjectMessageFormatter will log at debug level that it has been invoked, and at dump will log the full message being formatted.

The ObjectMessageFormatter decides whether the message being formatted should update, create or delete an instance in NexJ. This decision is logged at dump by nexj.core.meta.Metaclass.<Link> (if a link is specified).

After formatting of each instance is complete, both nexj.core.integration.format.object.ObjectMessageFormatter and nexj.core.meta.Metaclass.<Link> will log the success at debug.

Near the end of the transaction, it should be possible to see the change that has occurred reflected as one or several "insert", "update" or "delete" commands logged at debug by nexj.core.persistence.sql.SQLAdapter.

Following this, on the same thread, nexj.core.runtime.UnitOfWork logs "Commit completed" at debug.

If an exception occurs during formatting, nexj.core.integration.format.object.ObjectMessageFormatter logs the exception at debug. If formatting is retried, this is also logged at debug by nexj.core.integration.format.object.ObjectMessageFormatter and at warn by nexj.core.meta.Metaclass.<Link>.

If an error is raised for a message that is not retried, this is logged at debug by nexj.core.meta.Metaclass.<Link> The exception, in this case, is raised to the service which invoked the formatter. The service may also log the exceptoin.

Information Required for Escalation or Transfer

When escalating or transfering an issue to another developer it is critical to clearly explain the issue, the resources involved and how to access them, how the issue has been investigated, and why it is being transfered.

At the very least, it is good to answer the following questions:

What version of core and of the repository present the problem?
What is the most accessible environment in which the problem can be reproduced (Dev, QA, customer QA, customer production, etc.)? Who is responsible for the environment, and how can the environment be accessed?
What is the symptom of the problem? How can it be reproduced as simply and consistently as possible?
What exceptions are observed in the logs? If the exceptions are not thought to be part of the failure, why have they not been resolved?
Are any timeouts observed? Is the system running out of memory?
Has a single item or a class of items been identified for which the failure occurs? Describe the item(s).
Is the problem reproducible in more than one environment? If not, describe the environment in detail and explain what steps have been taken to ensure that the environment is correctly configured.
At what stage of synchronization is the failure believed to occur? What evidence exists that this stage is the point of failure?
What working hypotheses do you have relating to the failure? What evidence exists to support these hypotheses, and what tests have been performed to validate them?
If any of these questions cannot be answered, why can they not be answered?
Why is responsibility for the problem being transfered?

The developer receiving an item should ask for confirmation of any information that seems especially pertinent. Sometimes it is difficult to tell the difference between a fact and an assumption. Some complex test may have be repeated simply to rule out the possibility of mistakes during testing.