Last week I attended the ING Business Continuity conference in Amsterdam as a speaker. More than 140 participants from all over the world discussed for two days how reliability of IT services could be improved, handle major incidents and recover from disasters. My presentation on Smart monitoring provides an practical approach how to improve reliability of IT services.
The idea behind “Smart monitoring” is simple and beautiful:
- Understand how your infrastructure works
- Detect weaknesses and prevent failures before they happen
- Monitor your services more efficient and effectively
The idea is simple and beautiful but the real challenge is in the implementation. The first and major hurdle is to bring this idea to the attention of everyone within the company and call for action. As a subtitle I use “Situational awareness with Smart monitoring” and my major challenge is to make people aware of this problem.
Nowadays DevOps teams are agile and self steering, but they are not totally independent in setting priorities and determine where to spent their time on. The product owner and business stakeholders expect them to continuously deliver new services within weeks. The team puts this on high priority above anything else (such as maintenance). Most companies still have large amounts of old/legacy services that no one cares about but still deliver 99% of your business value.
As a result service maintenance is lagging behind, the number of errors and failures increases thereby decreasing the quality of services to customers.
You might ask yourself “If the number of failures and errors increases then this must be noticed by someone”.
Fact is that current service monitoring in many companies is mediocre at best, data is incomplete and of poor quality. Most of the time nothing is standardized and reports are hard to get. Many systems already produce so many errors and vague indicators that nobody wants to dig into this mess of data. For years this situation persists and only really big incidents get noticed and handled. This is the reason we have so many failures.
So what can we do about that? Lets assume that someone is bold enough to do something and takes up the challenge of improving this situation. The plan is that we need to understand how our organisation and infrastructure work, improve it before things go out of hand and actively monitor its behavior. To be really effective the plan has to be embedded and accepted by the organisation.
Lets assume that someone thinks it might be a good plan and now want you to start working on it. But he wants some quick results to verify its not another hot air solution. Implementing a new way of working throughout the whole organisation is to much but you could start something smaller. For instance setup a monitoring system that is able to show the dependencies between services. Once you have that it could convince the organisation to take next steps. I have create a road map to help me steer in the right direction.
Implementing smart monitoring
For this we need to know:
- Current situation of our systems
- how our systems work
- how they depend on each other
- how they interact
- Current state of our services and components
Getting insight into the complex infrastructure is a major challenge. Especially in a large company with hundreds of teams working independently on the same system. The approach I take is to collect data from the services themselves, clean it and use it to automatically infer the dependencies between services and build an operational model. For this I use advanced technology such as Neo4j graph databases, tools for streaming data but also a Postgres database, Qlickview and plain old “common sense”.
The model can then be used to:
- analyze dependencies between components
- analyze business transactions
- find weaknesses on services already in production and repair them
- develop a better, more effective monitoring system
This may sound a bit theoretical but I can assure you that it will help to make your system more reliable. It enabled us to develop better DevOps team dashboards and a real-time dashboard for determining the impact of a service failure on the system. And this is not the last result. I’m now working on integration with ServiceNow to process and handle alerts.
If you want to know more click on the links in this article, read my blog on monitoring or contact me.
I challenge you to improve your services.
Tauvic Ritter, DevOps IT4IT Innovator. Also checkout linkedin profile: