It’s Time to Move beyond Reactive Event Management

A Brief History of Infrastructure Monitoring

Infrastructure monitoring can be simply defined as watching computer systems with software to make sure that they are working properly. Monitoring software has been around for as long as business critical tasks have been performed on the computer.

Back before clouds, virtualization, containers, distributed applications, n-tier applications, and microservices, the IT architecture was simple. An application ran on one or more dedicated servers. Other than the operating system, the application was the only thing important running on that server.

Multiple monitoring tools are typically deployed, and the alerts from these tools as well as the events from the entire underlying infrastructure were generally fed into an event management system.

Today’s State of Affairs

As opposed to the simple one-to-one relationships between applications and infrastructure, modern applications and their supporting environments could not be more complex. As shown in the image below, the application stack is now highly distributed, diverse, rapidly changing, and often running in multiple clouds.

  • Shifts in application architecture – In order to be able to evolve digital services more quickly and frequently, the applications are being broken into microservices where each individual function runs in its own container. This creates thousands and sometimes hundreds of thousands of individual microservices which are each running in their own containers which then need to be monitored and managed.
  • Accelerated delivery of code in production – The process of getting code into production is being automated as much as possible with continuous improvement/continuous delivery (CI/CD). This facilitates hundreds and sometimes thousands of changes in code in production every day.
  • Many different languages – The variety of tasks that need to be implemented in code, and the need to do more with less skilled developers has driven a proliferation of languages. It is not unusual for one application to be composed of components written in three or more different languages. It’s not just only Java anymore.
  • Many different services and containers – The layer of services which applications rely upon has exploded in complexity and diversity. It used to be simple. A web server like Apache, a Java application server like Tomcat or JBoss, and a database server like Oracle or MYSQL. Now one application might leverage separate application servers for each language, a message bus like Kafka, and multiple databases like MongoDB, Redis, Cassandra, and one or more SQL databases.
  • Proliferation of data architectures – Data architectures are not just the top three SQL databases anymore, but also include a wide variety of NO-SQL databases as well as scale out SQL options and Graph datastores.
  • Everything virtualized – All of the resources (compute, memory, networking, and storage) are all virtualized which means that both the virtual and physical instances of all these resources need to be monitored and managed.
  • Many clouds – There are many cloud options spanning private, hybrid, and public, with most enterprises using a blend of more than one.

Highly Distributed, Rapidly Changing, Complex Environments in Many Clouds

The modern application technical stack, and its dynamic behavior combined with the development process create the following new and unprecedented challenges for teams building and supporting these new applications in production:
  • Lots of objects to monitor – Modern apps are highly scaled out with many things to monitor (i.e. hundreds and thousands of microservices in production).
  • Rapid and increasing rate of change – Modern apps are highly dynamic with their high rate of change in scale and new versions (i.e. multiple releases of new software into production every day).
  • Increased application and infrastructure diversity – Modern apps are very diverse with many different languages and stacks, driving the need for developer productivity and ever more diversity.
  • It’s not a 100% microservices world – Business services are often comprised of not just the modern applications, but previously developed n-tier, monolithic, and commercial off-the-shelf (COTS) applications.
  • Cross-cloud complexity – As stated above, the environments spanning the on-premises private cloud and the public clouds are more complex and dynamic than ever.
  • Skills gap – Due to the above factors, modern apps are very complex and addressing issues consumes time and expensive resources. In fact, Gartner predicts, By 2020, 75% of enterprises will experience visible business disruptions due to infrastructure and operations (I&O) skills gaps, which is an increase from less than 20% in 2016.
  • Business adoption challenges – Enterprises struggle to deploy APM tools broadly and pervasively due to their complexity and cost. In Gartner’s 2019 Magic Quadrant for Application Performance Monitoring, Gartner states, “Enterprises will quadruple their application performance monitoring due to increasingly digitized processes from 2018 through 2021 to reach 20% of all business applications.” This means that at the time when this note was published in March of 2019, only 5% of the applications that should be monitored by an APM solution were so monitored.
An Overview of Events Event Management is the standard method by which IT Operations teams “manage by exception”:
  • The essence of IT Operations Management for the last 30 years (ever since the advent of distributed systems in the mid 1980’s) has been to understand what is “normal” and what is “abnormal” and to then alert on the anomalies. Events are anomalies.
  • Now events come from an incredible variety of sources. Every element of the entire stack (disks, storage arrays, network devices, servers, load balancers, firewalls, systems software, middleware and services and applications) are capable of sending events.
  • Events tend to come in two broad forms. Hard faults and alarms related to failures in the environment (this disk drive has failed, this port on this switch has failed, this database server is down), and alerts that come from violations of thresholds set by humans on various monitoring systems.
However, the problems with event management systems are:
  • They are entirely reactive. The entire event management process does not even start until after a problem has occurred. Event management systems provide no way to be proactive or to prevent problems.
  • They are entirely reliant upon the quality of the events fed to them by the infrastructure and tools that are the source of the alarms and events.
  • Since many different humans are involved in setting up the criterial for events and alarms, the event and alarms sent to the event management system vary greatly in quality and relevance
  • The humans who set the thresholds for the alarms in the monitoring systems tend to either set the thresholds too high (leading to lots of missed problems) or set the thresholds too low (leading to lots of false alarms).
  • The data fed into event management systems is completely lacking in topology and dependency information. This information sometimes came from a CMDB, but with modern and dynamic systems, the CMDB is always out of date.
An Overview of Metrics (Data) The operation of any IT environment can also be characterized by metrics or data. There are thousands of metrics across any kind of complex hardware and software stack, and the important ones can be boiled down in the following categories:
  • Capacity – how much capacity of each type exists. This covers free storage capacity, free network bandwidth, available memory on servers, and available CPU resources across the environment.
  • Utilization – how much of your capacity of each type is being used at each point in time. Trends in utilization are important to understand when you will run out of each kind of capacity.
  • Contention – for which key resources are applications and processes “waiting in line”. CPU Ready in a VMWare environment tells you what the contention is for virtual and physical CPU resources. Memory swapping can indicated contention for memory resources. I/O queues at the storage layer indicate that the storage devices may be saturated.
  • Performance – this is a crucial point. Performance in abstracted environments (virtualized and cloud based environments is NOT resource utilization. Performance is how long it takes to get things done. So performance is equal response time at the transaction level and latency at the infrastructure level.
  • Throughput – these metrics measure how much work is being done per unit of time. Transactions per second at the transaction layer, and reads/writes per second at the network and storage layers are good examples of throughput metrics.
  • Error Rate – these metrics measure things like failed transactions and dropped network packets.
How Can Metrics (Data) Help? In this era of big data, it is possible to combine and mine the data that measures the performance, throughput, contention, utilization, and error rate across the stack and get the following types of insights:
  • Where are the current hotspots in the environments? Where are the sources of contention in key resources that are likely impacting transaction and applications performance?
  • What are the trends in contention? Where will there likely be in issue in the near future and how can that issue be proactively avoided?
  • Can relationships between metrics help with root cause? Advanced big data systems for IT Operations do not just capture metrics, but also capture the relationships between transactions and applications and where they run in the virtual and physical infrastructure.
  • Identifying zombie VM’s and cloud images that are just costing you money but not doing any useful work
  • Communicating the service level status of crucial transactions and their supporting infrastructure to business constituents and application owners.
The Centerity AIOps Platform Centerity is an AIOps platform that provides Dynamic Service Views into the business services that are composed of your new digitalization initiatives and the legacy systems that support those new initiatives. Metrics, logs, events and relationships are collected across your existing tools and platforms, analyzed for anomalies, and visualized in service level graphs. Key features include:
  • Deep and Broad Integrations: Centerity integrates with the tools and platforms that you rely upon.
  • Dynamic Service Views: Simple to understand gauges that show the service quality for each critical business service
  • Real-time Analytics:Advanced management & tactical dashboards maintain SLAs for critical processes.
  • In Context: Constant alignment between IT data and business objectives.
  • Consistent User Experience: Detect user experience degradation before your users do.
  • Traffic Analysis:Analyze bandwidth consumption and data flow; filter by application, packets, protocols, etc.
  • AI-driven Anomaly Detection: Machine learning for digital services moves the performance discipline beyond thresholds

The Centerity AIOps Platform

Centerity Dynamic Service Views

Centerity’s unique value in the AIOps Platform market is the ability to combine the information about a business service into a Dynamic Service View (DSV) which reflects the quality of that service to the business constituents that are responsible for that service.

Centerity Dynamic Service Views

When service levels degrade, Centerity provides a bird’s eye view of how the operation of each layer that supports the digital business service is impacting the overall service level.

Drill Down Into Each Layer of a Digital Business Service

Summary Recommendation

Today’s application systems are too dynamic and rapidly changing to be able to be effectively managed for service quality with a reactive event management system. Only Centerity combines the required metrics and relationships with AI into an effective AIOps platform that can assure service quality in these modern environments and provide the business owners of these services with relevant views and diagnostics into their services.

Download EMA White Paper

Let’s Get Started

Ready To Make a Real Change? Let’s Build this Thing Together!