01/15/2025 | Press release | Distributed by Public on 01/14/2025 17:20
Gaining visibility into connected IoT devices is crucial for effective network segmentation and assessing the potential cyber risks posed by these assets. Thus, network operators and Internet Service Providers (ISPs) are focusing on developing a real-time registry to identify IoT device types and track their counts as they connect to the network.
The good news is that IoT devices typically exhibit predictable network traffic patterns. This makes it possible to identify IoT devices by analysing their network traffic. In addition to deterministic methods like examining network services and connection endpoints, a well-established approach involves leveraging Machine Learning (ML) models to infer from traffic data generated by these devices. Various techniques using network traffic at different granularities have been explored, with promising results. However, several challenges still need to be addressed before achieving a reliable traffic inference system.
This post, based on our poster presented at SIGCOMM 2024, will look at one of these challenges.
Read: How to detect IoT devices in a network
You won't be fit to predict forever!
One significant challenge with ML-based network traffic inference is the gradual decline in model performance over time. This occurs due to changes in the underlying data distribution, a phenomenon referred to as concept drift in ML literature. As a result, the patterns learned from the training data become outdated and less effective.
If it is the staleness of training data by the time of deployment that causes performance degradation, why not periodically update the model using fresh data? The challenge lies in the fact that retraining ML models typically requires labelled traffic data that is time-consuming and operationally expensive to collect. Hence, a more practical approach is preferred: Determining when model updates are essential - or will soon become necessary - and acting only in those cases. Ideally, such decisions should rely minimally, if at all, on labelled data instances. This approach is called semi-supervised or unsupervised drift detection, depending on the availability of ground-truth labels.
How difficult is it to detect the change?
A natural approach to unsupervised concept drift detection is to monitor distributional changes in the model's input features. While this method can be useful, it also presents its own challenges that need to be addressed effectively.
Firstly, modelling the joint distribution of all network traffic features will be extremely data-intensive due to the high dimensionality of the data. This issue, commonly referred to as the Curse of Dimensionality, can be mitigated through dimensionality reduction or feature selection techniques. In addition, the lack of ground-truth labels makes unsupervised drift detection inherently challenging. In some cases, significant changes in the distribution of unlabelled data may not affect the model's performance. Conversely, in some other cases, the distribution might appear unchanged while the model's performance is negatively impacted.
This concept is illustrated in Figure 1. Although the distributions in A and D are equivalent after masking the ground-truth labels (C), a drift between these settings would render the classifier (dashed line) completely ineffective. In contrast, B illustrates a drift from A that can be detected without ground-truth labels. However, this type of drift does not negatively impact the classifier's accuracy.
A brief case study
By analysing a subset of a large IoT testbed dataset provided by a major ISP, we showcase that changes in the distribution of network traffic - though not always - can lead to performance degradation in trained classifiers. The data we used consists of metadata from 200,000 outgoing TCP/80 flows of four different IoT devices collected between September 2020 and December 2023. A Bagged Trees model with 100 trees was trained on data from the period of September 2019 to August 2020. To prevent the risk of overfitting, we configured the model to require that each tree leaf contains at least 1% of the training instances for the least frequent class.
To study distributional changes in data, we selected the feature with the highest impurity-based importance - maxPacketSize, which represents the size of the largest payload carried by packets in a flow - from our 22 features. We estimate the Probability Density Function (PDF) of this feature over the entire training period, as well as in 1-month time windows, using Gaussian kernel density estimation.
Then, we use the Hellinger distance between the estimated PDF functions as a measure of concept drift. Our goal is to investigate the relationship between concept drift and model performance, which we define in terms of prediction accuracy. Figure 2 and Figure 3 show the accuracy (green) and measure of drift (red) for instances of two classes, which we denote by C1 and C2, respectively. For class C1 (Figure 2), a degradation in the model's prediction accuracy coincides with concept drift - performance drops in November 2022 as Hellinger distance increases and recovers in October 2023 as Hellinger distance decreases. However, for class C2 (Figure 3), despite a significant increase in the Hellinger distance over time, the performance does not degrade.
To explain this inconsistent behaviour, we examined the model's decision boundaries. Figure 4 shows the ranges of maxPacketSize values for the regions labelled as C1 (blue shade) and C2 (red shade), along with mean values of the feature for both classes, accompanied by 1-standard deviation error bars.
For class C1, the maxPacketSize values change between November 2022 and August 2023 in a way that they overlap with the decision regions of the other class, leading to a drop in accuracy for that class. In contrast, for class C2, although changes in maxPacketSize occur, they remain within the decision region of class C2, preventing any impact on the model's accuracy for this class.
In conclusion
In this post, we briefly discussed concept drifts in network traffic, with a focus on the IoT classification task. We highlighted some of the complexities involved in detecting concept drifts, such as the Curse of Dimensionality and limited access to ground-truth labels. Additionally, we presented a case study to illustrate the changes in network traffic data and to demonstrate the varying impact these changes can have on classifier performance.
Shayan Azizi is a PhD Candidate at UNSW Sydney, researching ML-based network visibility under drifting concepts.
The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.