DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • The Transformer Algorithm: A Love Story of Data and Attention
  • Navigating the Complexities of Text Summarization With NLP
  • Machine Learning: Unleashing the Power of Artificial Intelligence
  • SIEM Volume Spike Alerts Using ML

Trending

  • Microsoft Azure Synapse Analytics: Scaling Hurdles and Limitations
  • How to Configure and Customize the Go SDK for Azure Cosmos DB
  • Recurrent Workflows With Cloud Native Dapr Jobs
  • A Modern Stack for Building Scalable Systems
  1. DZone
  2. Data Engineering
  3. Data
  4. Outlier Identification in Continuous Data Streams With Z-Score and Modified Z-Score in a Moving Window

Outlier Identification in Continuous Data Streams With Z-Score and Modified Z-Score in a Moving Window

How do you spot outliers in continuous data streams? Learn to catch network performance hiccups effortlessly using z-score and modified z-score.

By 
Maksim Kupriianov user avatar
Maksim Kupriianov
·
Jul. 31, 24 · Analysis
Likes (45)
Comment
Save
Tweet
Share
34.5K Views

Join the DZone community and get the full member experience.

Join For Free

My name is Maksim Kupriianov, and for the past few years, I have been actively involved in network monitoring. This means sending network probes from different locations within an organization’s network, analyzing the responses with regard to packet loss percentage and response times, and identifying places in the network where something has gone wrong. The probes are sent and the results are analyzed continuously, with new data coming in every second. The analytics are conducted on a sliding window of several tens of seconds.

During this process, local anomalies often arise. For instance, a server in one of the racks might start showing increased packet loss due to the server's high CPU utilization. On the one hand, this is already a problem that should be highlighted, but on the other hand, it is not caused by any network equipment failure and doesn’t affect network performance in general. A similar situation occurs with the response times: for various reasons, responses could be delayed even by seconds, whereas typically, the round-trip time of the requests does not exceed a few milliseconds.

How can we automatically detect such isolated anomalies and exclude them from the overall analysis? Many methods exist for this purpose, including Kalman filters, clustering, and machine learning algorithms. However, in this article, I want to focus on one of the methods of standardized evaluation: z-score filtering. The reason is that it is very simple to understand and implement, cheap in terms of additional CPU and memory load, and quite effective.

Z-Score

A z-score, also known as a standard score, is a metric that quantifies how far an observation deviates from the mean of the population. Essentially, it indicates the number of standard deviations below or above the mean that an observation falls. It is calculated for each observation using the formula:

z =xi-μσ

Where:

  • xi is the individual data point
  • μ is the mean (average) of the dataset
  • σ is the standard deviation of the dataset

Let's look at an example. Consider a set of measured network latencies between two groups of servers located in the same building:

Latency, µs

25

26

26

26

26

26

27

27

27

28

28

28

29

30

30

32

35

40

52

97

If we calculate the standard scores for these values, we get the following results:

Latency, µs

25

26

26

26

26

26

27

27

27

28

28

28

29

30

30

32

35

40

52

97

Z-Score

-0,520

-0,457

-0,457

-0,457

-0,457

-0,457

-0,394

-0,394

-0,394

-0,331

-0,331

-0,331

-0,268

-0,205

-0,205

-0,079

0,110

0,425

1,182

4,018

Outlier?

no

no

no

no

no

no

no

no

no

no

no

no

no

no

no

no

no

no

no

YES

In this example, a typical threshold value of 3 was selected to identify anomalies. It is clear that the method only partially detected the anomaly. This occurred because the distribution of latencies is heavily skewed to the right: achieving a latency significantly lower than the median is physically impossible but experiencing delays due to network queues or similar issues resulting in higher values is common.

The advantage of the z-score technique lies in its effectiveness and exceptional convenience when used in sliding time windows, requiring only three parameters for calculation:

  • SumX=i=1Nxi, the sum of all values in the window
  • SsqX=i=1Nxi2, the sum of the squares of all values in the window
  • N, the number of values

Thus, as new data arrives, we simply increment these counters, and as old data becomes obsolete, we decrement them accordingly. The standard score for each element Xi can be calculated using the following formula:

z = xi-SumXNSsqX-SumX2NN-1

Modified Z-Score

But can we do better? In our case, yes, and the modified z-score method will help with this. It performs better, especially on skewed datasets. The modified score is calculated as follows:

zmod=0.6745xi-MMAD

Where

  • xi - Original sample element
  • M - Sample median
  • MAD=median(xi-M) - Median absolute deviation of the dataset

In cases where the median absolute deviation equals zero, another formula is used.

zmod=xi-M1.253314μAD

Where

  • μAD=i=1Nxi-MN, mean (or average) absolute deviation

Latency, µs

25

26

26

26

26

26

27

27

27

28

28

28

29

30

30

32

35

40

52

97

Modified
Z-Score

-1,009

-0,673

-0,673

-0,673

-0,673

-0,673

-0,336

-0,336

-0,336

0,000

0,000

0,000

0,336

0,673

0,673

1,346

2,355

4,038

8,075

23,217

Outlier?

no

no

no

no

no

no

no

no

no

no

no

no

no

no

no

no

no

YES

YES

YES

A threshold value of 3.5 is recommended for use with the modified z-score. In our example, employing this threshold successfully identified all three anomalous results. However, in practice, it is advisable to initially use higher threshold values, such as 5.

One significant drawback of the modified z-score algorithm is its requirement to recalculate the median (and median absolute deviation) for each incoming value. This necessitates maintaining all values in the moving window in sorted order and performing calculations anew each time, which can be challenging and resource-intensive for large datasets.

This challenge can be mitigated by employing a combination of both moving and sliding windows simultaneously. To distinguish between the two:

  • Moving Time Window: This spans a fixed duration, such as ten seconds.
  • Sliding Time Window: This moves continuously over the data, for example, every five seconds.

Moving time window

Sliding time window





































For instance, using a moving time window of ten seconds alongside a sliding window of five seconds allows for recalculating the median and its absolute deviation for the previous interval each time the sliding window advances. This approach enables analysis over the moving window while maintaining stability in the dataset over shorter periods of time.

While this method may occasionally lead to false negatives and the exclusion of some trend-setting values, these effects are typically short-lived as subsequent sliding windows adjust accordingly.

Conclusion

The z-score method of standardized evaluation demonstrates strong performance in filtering out occasional anomalous outliers from a continuous stream of measurements. It remains cost-effective even when applied within a sliding window framework. However, its effectiveness may be compromised when dealing with datasets that exhibit significant skewness or distributions departing from normal.

In such instances, the modified z-score method proves more suitable due to its capability to handle skewed data distributions effectively. Nevertheless, applying this method to continuous data streams presents challenges, as recalculating the median absolute deviation with each new sample necessitates accessing and potentially sorting all values. Despite these challenges, recalculating at smaller intervals may be feasible in scenarios where the sample median remains stable.

Machine learning Network monitoring Data (computing) Stream (computing) Algorithm

Opinions expressed by DZone contributors are their own.

Related

  • The Transformer Algorithm: A Love Story of Data and Attention
  • Navigating the Complexities of Text Summarization With NLP
  • Machine Learning: Unleashing the Power of Artificial Intelligence
  • SIEM Volume Spike Alerts Using ML

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

OSZAR »