Anomaly Detection with NumPy Rolling Z-Score
In data stream engineering, sensor monitoring, and algorithmic trading, detecting sudden variations and spikes (anomalies) as they occur is vital. A static threshold fails when the overall baseline drifts. Here is how to implement dynamic anomaly detection using a rolling z-score in NumPy.
What is a Rolling Z-Score?
A standard z-score measures how many standard deviations a data point lies away from the global average. However, in live data streams, the average value changes over time.
By calculating a **rolling z-score**, we measure the deviation of the most recent element in a window relative to the local average and deviation of only that window. This makes it highly sensitive to local spikes and safe from global baseline shifts.
Mathematical Definition: For a sliding window of size W, the z-score of the last value is calculated as:
Where μlocal and σlocal are the rolling mean and standard deviation computed over the current window.
Python Implementation Guide
Here is a complete, copy-pasteable script using rollit.zscore to flag anomalies where the absolute z-score value exceeds a threshold of 2.5:
import numpy as np
import rollit
# 1. Generate a mock streaming dataset with a sudden spike (anomaly)
# index 6 (150.0) is a clear outlier compared to the local average
data = np.array([100.0, 101.0, 99.5, 100.2, 101.1, 100.8, 150.0, 100.5, 99.8])
# 2. Compute rolling z-score using a window size of 5
zscores = rollit.zscore(data, window=5)
# 3. Detect anomalies where |Z-Score| > 2.5
# We align the output to the original indices (offsetting by window - 1)
threshold = 2.5
anomalies = np.where(np.abs(zscores) > threshold)[0]
# Output the results
print("Raw Stream Data:", data)
print("Rolling Z-Scores:", np.round(zscores, 2))
print("Anomalies detected at indices:", anomalies + 4) # Adjust for window size - 1
# Output: Anomalies detected at indices: [6]Why choose rollit for live feeds?
In production applications listening to WebSockets or message brokers, low latency is critical. Pandas introduces latency during series initialization. rollit.zscore calculates results directly on flat arrays at C-speed, keeping your pipeline running under sub-millisecond response rates.