TIBCO Enterprise Service Bus provides a complete set of ESB products, these products need to be combined into a solution architecture. For my current customer I designed and implemented the following monitoring and reporting solution that resulted in a significant reduction of business process exceptions.
- OpsView (Enterprise IT Monitoring)
- TIBCO Hawk (monitor infrastructure behavior, metrics and failures)
- TIBCO Clever (monitor functional and technical errors)
- TIBCO Spotfire (reporting)
- Pentaho Data Integration (ETL)
- Esper (Complex Event Processing)
- Confluence (Wiki based knowlegde base)
OpsView is selected as the enterprise wide monitoring solution for IT components. OpsView monitors all critical components, collects events and provides an end to end view at the infrastructure level. OpsView is a generic solution and provides operators with a high level overview. It also has its limitations and it can not provide detailed information for each monitored sub-system such as the TIBCO ESB. For this it relies on TIBCO specific solutions as TIBCO Hawk and Clever.
TIBCO Hawk is the software product selected for monitoring the components that support the TIBCO ESB.
- TIBCO BPM (business processes)
- TIBCO BusinessWorks (integration and business logic)
- TIBCO EMS (the messaging backbone)
- Servers (CPU / Memory / Disk)
For each of these components monitoring rules have been developed that monitor specific aspects such as component status, performance and resource usage. All collected events are stored in a database for viewing and reporting. When treshholds are violated alarms will go off and operators get notified by email. Alarms wich are classified as Critical are forwarded to OpsView.
- The component involved
- The activity executed when the failure occured
- Context such as incomming message, OrderID, correlation id’s (link to business process)
- Details such as error codes, error message, stackdump
- Improved “Error classification mechanism” based on a known symptom lookup table
- New symptoms can be added in real-time (refreshed every 10 minutes)
- Symptoms can be exported and imported
- Symptom table can be used to generate and update documentation
- Throttling: reduction of repeating exceptions (reduces email notificatons using Esper Complex Event Processing)
- Improved Email formatting (now includes a problem description and resolution)
The classification system is based on a database table with known symptoms. The table includes the exception category, domain, type, a short description and a problem resolution. The classification mechanism enables fast exception recognition and improves reporting usability. After classification Clever routes these classified exceptions to specific solving groups based on criteria such as category and severity:
Finally all exceptions are stored in the Clever database and can be used in Spotfire reports.
- Find failing components
- Determine root causes
- Determine impact of failures on availability
Events collected by TIBCO Hawk and Clever are stored in a database, the data is used to generate reports. Report: showing error distibution by component and type over a specified period of time. This particular report allowed us to find periodic recurring database problems.
These reports are now used by operations managers and testers.
- Operations manager:
- System availability
- Number of critical alerts (indicator for quality and load on operators)
- Type and number of alerts per component (find area’s for improvements)
- Find exceptions cause by bugs in deployed components
- Verify data quality of all logged events and exceptions
The TIBCO Spotfire product does not include data pre-processing capabilities such as ETL (Extraction, Transformation and Load). For these capabilities an Opensource solution was found Pentaho Data Integration (a.k.a Kettle).
This article is part of a series on monitoring.
I work as a consultant and developer, building and managing microservices.