Biblio
Complex systems are prevalent in many fields such as finance, security and industry. A fundamental problem in system management is to perform diagnosis in case of system failure such that the causal anomalies, i.e., root causes, can be identified for system debugging and repair. Recently, invariant network has proven a powerful tool in characterizing complex system behaviors. In an invariant network, a node represents a system component, and an edge indicates a stable interaction between two components. Recent approaches have shown that by modeling fault propagation in the invariant network, causal anomalies can be effectively discovered. Despite their success, the existing methods have a major limitation: they typically assume there is only a single and global fault propagation in the entire network. However, in real-world large-scale complex systems, it's more common for multiple fault propagations to grow simultaneously and locally within different node clusters and jointly define the system failure status. Inspired by this key observation, we propose a two-phase framework to identify and rank causal anomalies. In the first phase, a probabilistic clustering is performed to uncover impaired node clusters in the invariant network. Then, in the second phase, a low-rank network diffusion model is designed to backtrack causal anomalies in different impaired clusters. Extensive experimental results on real-life datasets demonstrate the effectiveness of our method.