End to end transaction analysis with Neo4j


Transaction_analysis

End to end transaction analysis

Successful enterprises constantly seek new ways to improve availability of their services, and try to avoid compliance breaches.  This can be achieved by managing services on an end-to-end basis. By analyzing the topology of end-to-end business value chains you gain insight into the behavior of your systems in a way you have never seen before.

What is transaction analysis and what can we do with it

8ajannst

Too many metrics and still no insight!

Transaction analysis allows you to analyse monitoring data within the context of business transactions. Data is collected over the full end-to-end value chain and includes the complete technology stack (web, services, servers, databases). The collected data contains simple metrics (performance, cpu usage) as well as complex event data (call data, timing). But its real value only comes apparent by combining all information in its full context. Only then it will allow you to see the complete picture for managing and improving your systems:

  • Before deployment: By analyzing monitoring data within the context of business transactions you understand how your services really work. You can verify that your services are designed and configured as required. Verify that they meet compliancy requirements and determine if there are any single points of failure that can impact the customer and harm your business. This will prevent harm before it happens.
  • During runtime: The monitoring system will constantly determine the health of your services, collect valuable data for monitoring and optimising operations. And when anything happens find root cause of failure, determine its impact on your business and take appropiate action.

In order to do all this you need to have insight in your service topology, the dependencies between components and the transactions that they support.

How to describe a transaction

If we want to understand how transactions work we have to describe them in terms of:

  • Completeness and consistency (do we collect all relevant events)
  • General structure and complexity of a transaction
    • Number of services involved
    • Role of each services: Frontend, Middle, Backend
    • Fanout (how many services directly called by each service)
    • Call chain depth
    • Order of execution within a transaction
    • Interaction patterns between services
      • Request/reply or Fire and forget
      • Sequential / Parallel execution
  • Result of execution of each service call and transaction as a whole (success, technical, functional, late, onTime)
  • The way errors, timeouts and failures are handled

This list may look a bit over the top but it can help to understand how your system really looks like.

First of all transactions should be “complete” meaning that for all service activity monitoring data has been captured. Orphans (lost events) such as the ones identified in a previous post are not acceptable. Transaction tracking data should also have consistent timestamps, component names and statuses. Incomplete or invalid transactions should be excluded from further analysis.

Transaction_large

Large (500) but simple structure

The general structure and complexity of a transaction can vary a lot. Longest call path length can vary between 0 and 10. Zero means the “Frontend” service is the only one involved in the transaction. Simple transactions include small amounts of service calls (1 to 10) while complex transactions can include more than 500 service calls. The more complex transactions sometimes use parallel processing. This means that a service will call many other services at once not waiting between each call for the results to return. The service then waits some predetermined time for all results to come in. This approach will improve response time because the service only has to wait for the slowest service.

What can we learn by analyzing a typical transaction. I have collected monitoring data, processed it and loaded it into a Neo4j graph database. Neo4j allows you to use CYPHER queries to retrieve data and visualize it on a web browser. In addition to graphical display it can also do aggregations and display results in table format.

For this visualisation of a transaction I used sequence numbers to show the order of execution. The “Frontend” facing service (1) calls node 2 and gets a “Functional” error response from node 3. A remarkable finding is that this fact is not forwarded but absorbed because node 2 reports “Success”. This can be “by design” but is is still remarkable. Node 4 executes some calls in parallel (nodes marked 8).

This simple example already shows how powerful this kind of analysis is. It gets even better if you start using this information for improving your IT infrastructure.

How to improve your transaction handling

You have to think about error handling. The example show that errors are not always forwarded to the Frontend service (customer). Is this by design? Do services handle error situations by themselves or do they just ignore errors? You have to find out what is going on here and determine if this is want you want. If the error is not forwarded does it mean that it is not important and can be ignored?  Or does an error always have nasty side effects that have to be managed even when we hide them from the customer.

Imagine that several teams are responsible for developing and managing services in the same business value chain. What should happen when an error is detected and forwarded. Should all these events by send to each team or only the root cause of the error. This can reduce the number of alarms to be handled by teams and reduce cost. Also think about this: Who is responsible for the whole end-to-end value chain.

The most obvious problem is slow performance. Slow performance cannot be hidden from the customer. It can only be avoided by better design, using high performance servers and techniques such as parallel processing and offloading processing task by using Oneway (fire and forget) communication patterns.

In order to design an effective IT infrastructure companies should have consensus among development teams on how they handle errors in business transactions. This can be achieved by defining company wide guidelines based on the analysis of real business transactions.

Lessons learned:

Business transaction analysis can help you to understand how your business really works.

Allows you to detect and analyze common problems

Allows you to improve your IT infrastructure

Allow you to design and setup an effective monitoring system

In a up comming post i will talk about compliancy.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s