Getting started with Smart monitoring


If you want to offer your customers IT services and promise them 99,9% availability then Service monitoring is essential. Smart Monitoring is a philosophy and way of working to achieve that goal. It enables you to understand your complex IT and be in control. In this article I will give you some guidelines and a basic plan how to get started.

This article is part of a series on monitoring.

smartmonitor_overviewLets try to aim for this:

The dashboard shows information on several services and how they conform to SLA defined KPI’s such as response time target. You can see how these services perform, what results they return (success, functional and technical errors) and drill down by Team.

Note: This is not the full Smart Monitoring solution, its just a getting started with monitoring in order to learn by doing it yourself. You can always contact me if you need advise.

Some guidelines:

  • Design and develop services with reliability in mind
  • Setup a monitoring solution to log events and monitor your services
  • Follow a step by step improvement approach
  • Analyse your monitoring data, learn and improve your services
  • Handle incidents and recover from failures

Because we will focus on  “Getting started” we will use the ELK stack that is simple to use and can be extended to meet almost every requirement.

  • ElasticSearch (event storage and search functionality)
  • Logstash (Data collection and processing)
  • Kibana (Data visualization and analysis)

Start with a plan

Installing the ELK stack is easy to do and for a proof of concept can be installed on a single laptop if needed. At first sight the Kibana dashboard enables you to create impressive dashboards. You could just fill ElastcSearch with tons of data and create all kinds of queries and dashboards. That would work for a while if the system you monitor is not too complicated, you are the only one looking at the data and you have excellent query building skills.

Dashboards

Ad hoc monitoring dashboards

So if you have a small company, a single development team, just a handfull of services and two OPS engineers that know everything about the business you don’t need a plan. But in most cases you may have more demanding requirements:

  • The number of services is ever increasing
  • Service have complex relations with each other
  • Multiple teams are developing services
  • Teams must be able to manage all services not just their own
  • Services have to be managed at the business process level
  • Dashboards should also be used by non-technical people (management)
  • You may want to know about you service availability

Things growing out of hand now. Every team develops it own monitoring solution and dashboards only show a specific part of the infrastructure. Its now imposible to get any overview.

It is up to you but it usually pays of the create a plan. Monitoring can be seen as a kind of Business Intelligence (BI) application:

  • What do you want to achieve
  • What questions do you want to answer
  • What data do you need to answer those questions
  • Where can you find the data
  • How can you collect and process the data

Collecting data

In the ELK Stack Logstash and Beats are the tools for collecting data and forwarding it to ElasticSearch.

beats-platform

The Beats are open source data shippers that you install as agents on your servers to send different types of operational data to Elasticsearch. Beats can send data directly to Elasticsearch or send it to Elasticsearch via Logstash, which you can use to parse and transform the data.

With Logstash you can forward log file events to Elasticsearch in a basic logging format (timestamp ,severity, message) format. Altough usefull this is just forwarding the raw data and won’t help you to understand what goes on. It helps to collect events that have richer content. You could create service call trace events that provide more structured information:

  • host
  • service
  • operation
  • error code
  • error reason
  • severity

With this (very simplified) structured call tracing event, fields like service and operation can be used to filter and drill down into your data. Structuring data requirers some planning and coordination but will pay off in the end. You can try to read your log files with Logstash and build a complex parser to retrieve the structured fields. It will be much easier and more performant to develop a standard logging service and log events in a structured format.

Another aspect is normalizing data. If all your services and operations use different types of error codes for the same kind of situation it will be very hard to understand what they mean. Its better to define standards for all codes. The same goes for latency every operation has its own latency KPI. In order to compare different operations with each other you have to put them on the same scale by normalizing the measured values. This is essential and usually missing from those commercialy available monitoring solutions.

Storing and analyzing data

Quote: Elasticsearch is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases. As the heart of the Elastic Stack, it centrally stores your data so you can discover the expected and uncover the unexpected.

All data collected with Logstash and Beats will be stored in an ElasticSearch cluster for further analysis and visualization. Data in ElasticSearch is organized and stored as documents in indexes.

Quotae: An index is a collection of documents that have somewhat similar characteristics. For example, you can have an index for customer data, another index for a product catalog, and yet another index for order data. An index is identified by a name (that must be all lowercase) and this name is used to refer to the index when performing indexing, search, update, and delete operations against the documents in it.

For monitoring we use an index thnamed logstash-yyyy.mm.dd and create a new index every day. Indexes will be kept for 30 days and then dropped. ElasticSearch Log events are stored as log types and tracing event get the tracing type. We use templates to specify how the fields should be interpreted

Visualizing data

smartmonitor_overviewFor visualization in this example i use Kibana and Grafana to define dashboards. In Kibana im now experimenting with Kibi from Siren Solutions. Kibi allows me to drill down on a specific dashboard pane select an area of interest and then switch to another dashboard to further investigate. The dashboard shows normalized response times and normalized error codes. Now i can see which of my services misbehave.

grafana_overview

I also created a grafana dashboard. The  gauges show service metrics stored in Elasticsearch:

  • Availability (percentage)
  • onTime (do we meet SLA)
  • Technical error (percentage)
  • Functional error (percentage)
  • Transactions per minute (absolute value, needs to be normalized)
  • Resource usage (percentage)

As you can see there are a lot of interesting gauges. Service health is determined by all of these values but also depends on other infrastructure components and their health. Think about network, message queues, other services and the servers they run on. So for every service you can find at least 10 metrics. And there are complex relations between each of these metrics. When Transactions per minute goes up, resource usage will go up to. So what happens if you have thousands of micro services to manage? You will be overwelmed with a sea of data and spend all of your time managing micro services.

To fight this problem Im working on a business rules engine that determines service health based on many factors, keeping track of dependencies and making use of expert knowledge. This is just one example of what Smart Monitoring is ment to do.

Next steps

So far we have collected and viewd normalized data. This wil help to get a better view on the behaviour of our services. But there is more:

  • Daily operations:
    • I want to get a summary of current status instead of so many dashboards
    • I want to be informed when something goes out of hand
    • I want to understand what causes all my trouble
  • Long term planning:
    • I want to know how my infrastructure topology looks like
    • I want to verify if it is correctly setup
    • I want to do capacity planning
    • I want to optimize the number of server

Currently im working on using graph databases (such as Neo4j) for structural analysis:

confidentiality

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s