Table of ContentsView in Frames

What incidents are

Incidents are events that indicate I/O performance issues on a volume workload caused by contention on a cluster component.

Performance Manager monitors the I/O response time and operations for volumes on a cluster. When other workloads overuse a cluster component, for example, the component is in contention and is unable to perform at an optimal level to meet workload demands. The performance of other workloads that are using the same component might be impacted, causing their response time to increase. If the response time crosses the performance threshold, Performance Manager triggers an incident alert to notify you.

Incident analysis

When the response time for a volume workload crosses the performance threshold of the expected range, Performance Manager analyzes the previous 15 days of performance statistics to gather the following information:
  • Identifies the victim workloads whose performance has decreased. Victims are identified based on workload operations that are greater than 10 operations per second and with a response time greater than 5 milliseconds.
  • Identifies the cluster component in contention.
  • Identifies the bully workloads that are over-using the cluster component and causing it to be in contention.
  • Ranks the workloads involved, based on their deviation in utilization or activity of a cluster component, to determine which bullies have the highest change in usage of the cluster component and which victims are the most impacted.

An incident may occur for only a brief moment and then correct itself once the component it is using is no longer in contention. A continuous incident is one that reoccurs for the same cluster component within a 5 minute interval and remains in the new state. Incidents that remain unresolved, which have a state of new, can display different description messages as workloads involved in the incident change.

When an incident is resolved, it remains available in Performance Manager as part of the record of past performance issues for a volume. Each incident has a unique ID that identifies the incident type and the volumes, cluster, and cluster components involved.
Note: A single volume can be involved in more than one incident at the same time.

Incident state

Incidents can be in one of the following states.
  • New means that the incident has not corrected itself, or has not been resolved, and the I/O response time of the impacted workloads remains above the performance threshold of the expected range.
  • Obsolete means that the incident has corrected itself, or has been resolved, and the I/O response time of the impacted workloads is no longer above the performance threshold of the expected range. A user might have made a change to the cluster to resolve the incident or the incident might have corrected itself by returning back within the expected range.

Incident notification

The incident alerts are displayed on the Dashboard, Volume Details page, and are sent to specified email addresses. If you have configured OnCommand Unified Manager to receive incident alerts from Performance Manager, the incidents are also displayed on the Unified Manager Dashboard. You can view detailed analysis information about an incident and get suggestions for resolving it on the Incident Details page.

Single incident on Response Time chart in Performance Manager

In the example above, an incident is indicated by a red dot (Performance Manager incident icon) on the Response Time chart on the Volume Details page. Pointing at the red dot displays a popup with more details about the incident and options for analyzing it. If an incident has remained in the new state, it is ongoing, or continuous, and the blue line for the actual response time displays red for the duration of the incident.