Best Practices in Performance Monitoring for Big Data Environments (SAP HANA, Hadoop, Enterprise Applications)
Use Case #3 – Reducing Mean-Time-to-Restore (MTTR)
- Dynamic Threshold Alerting
- Automated Remediation Actions
- Outage Graphics
Use Case Overview
The number one driver of Customer Satisfaction and increasing Net Promoter Scores is reduced downtime. To do this, IT departments must avoid downtime and restore service quickly.
First, what is needed are smart static and dynamic thresholds that can provide proactive and automated actions before warnings become critical. While some circumstances can be anticipated with static thresholds, dynamic thresholds need to be “learned” by observing what is normal for a given hour of the day or day of the week.
Second, once topology and dependency maps are established, outages can be resolved quickly as alerts and graphical diagrams can pinpoint higher order problems that are causing a cascade of events. Resolving the higher order problem first is likely to resolve the lower order ones in the most expedient way.
Third, being able to see an entire business service view on a single console and show all the service levels from individual metrics to technology layers will highlight which component is causing a degradation of service and why. Seeing everything from a single console provides the entire organization with a single view of the truth upon which efficient and coordinated actions can be taken.
Product Screen 8: Graphical Outage Screen