CrowdStrike’s Approach to Better Machine Learning Evaluation Using Strategic Data Splitting

August 11, 2025

| | Engineering & Tech
  • “Leakage” in machine learning (ML) occurs when data that an ML model should not learn on is included at training time, often in unexpected ways.
  • This can cause overconfidence in ML model training results, producing cybersecurity ML models that fail to recognize threats.
  • CrowdStrike data scientists employ strategic data splitting during ML model training to prevent data leakage.

Since day one, CrowdStrike's mission has been to stop breaches. Our pioneering AI-native approach quickly set our platform apart from the landscape of legacy cybersecurity vendors that were heavily reliant on reactive, signature-based approaches for threat detection and response. 

Our use of patented models across the CrowdStrike Falcon® sensor and in the cloud enables us to quickly and proactively detect threats — even unknown or zero-day threats. This requires accurate threat prediction by the CrowdStrike Falcon® platform.

To achieve this critical requirement, CrowdStrike data scientists think carefully about how we train and evaluate our ML models. We train our models on datasets containing millions of cybersecurity events. These events can be structured in certain ways; they can have dependencies or similarities to one another. For example, we might collect multiple data points for a single malicious process tree and those data points will be closely related to one another, or we might collect malicious scripts that are extremely similar. 

Because of the kinds of relationships present in cybersecurity data, our domain requires us to carefully consider the ML concepts of train-test leakage and data splitting. When observations are not independent of one another, the data should be split in a way that does not cause overconfidence. Otherwise, we might think our model can handle malicious processes very well, even though when faced with new threats, the model fails to recognize them.

In this post, we explain why CrowdStrike data scientists adopt strategic data splitting when training our ML models. Employing the strategic data splitting approach, as discussed below, will help to prevent train-test leakage in datasets with interdependent observations. This helps ensure more reliable model performance against novel threats in the wild.

From Random to Strategic Data Splitting

One tenet of ML is to split the data into train, validation, and test sets, or to perform cross-validation, where the data are partitioned into multiple iterations of training and testing. The model learns from the training data and is then evaluated on the validation/testing data. This allows us to have a reasonable expectation of real-world performance and select a winner from competing models.

A common statistical assumption is that observations in the data are independent. However, in real-world scenarios, data points often relate to each other. And if we train using data that are not independent, we get train-test leakage — where the training data has information it should not be expected to have. When correlated observations are mixed randomly into both the train and test sets, the model’s training data is dependent on the testing data in a way that may not be realistic for the model in production. Therefore, real-world performance for the ML model may not match what was seen in testing.

As an analogy for train-test leakage, imagine you’re evaluating a student in a class with a final test. In order to prepare them for the test, you give them a set of practice questions — the training set. If the practice questions are too closely related to the actual test questions (for example, only changing a few words in the question), the student might ace the test just by memorizing the practice questions. The student may have performed well, but we are overconfident in how much the student has learned because information leaked from the training set to the actual test — giving us an inflated view of their true knowledge.

Looking at this issue in a real-world, physical science scenario, data with these kinds of dependency structures can often be found in ecological data, as noted by Roberts et al. (2016). In this domain, it is common to observe autocorrelation in space or time, or dependency among observations from the same individuals or groups. 

For example, if data points are spatially autocorrelated (related by location), traditional random train-test splits can lead to misleading results. This happens because nearby locations share similar features, like climate, which can leak information between training and test sets. 

Therefore, a random split may inflate the performance estimate of the model. In fact, we may get data from an entirely new region at prediction time, but the data splitting method has led to overoptimism and overfitting. 

This problem is not limited to one domain. Kapoor and Narayanan (2023) describe how different types of data leakage have contributed to a reproducibility crisis in ML-based science across over 290 papers in 17 fields, due to the overoptimism that leakage produces.

Different modeling strategies for nonindependent data are possible — such as linear mixed model or time-series approaches — but many performant predictive models, including tree-based ensembles and neural networks, may not be designed to account for these dependency structures. 

We should then turn toward a more careful approach to data splitting. The Roberts et al. study recommends splitting the data into blocks, where each block groups together dependent data at some level. Each block is then assigned to a cross-validation fold. In the ecological data example, grouping nearby locations together as one block prevents data leakage and gives more accurate model performance estimates.

There are also trade-offs here. It is possible that blocking the data may limit what is seen in predictor space — the possible feature values — which can decrease the model’s predictive capability. Some experiments below illustrate these concepts. 

CrowdStrike’s Solution to Data Leakage

One approach CrowdStrike takes to stop breaches is applying ML to detect malicious processes by their behaviors. However, observations from a process are correlated with other observations from that process — and with other processes from its process genealogy and machine of origin. We experimented with “blocking” by machine.

Processes in a machine are not independent, so we consider each machine a “block.” Figure 1. A machine has many different processes, and each process is part of its own genealogy. Processes in a machine are not independent, so we consider each machine a “block.”
Our experiments consisted of training tree-based ML models for binary classification. We used an experimental dataset containing observations of process behaviors, with each observation labeled as either malicious or non-malicious. There was fair label balance. Eighty percent of the data was used to run 1) blocked cross-validation (see scikit-learn’s GroupKFold), and 2) random cross-validation, each with five folds. The remaining 20% of the data was held out in a way we theorized to be realistic for the prediction context — across new blocks and with data later in time. We trained a final model on the original 80% of the data and evaluated it over the remaining 20% to get a realistic performance estimate. We used AUC (area under the ROC Curve) as a performance metric, where a higher AUC is better. 
AUC across five cross-validation folds plotted for the two split strategies. Figure 2. AUC across five cross-validation folds plotted for the two split strategies. A final model was trained over all of the cross-validation data and tested on a more realistic test set and shown as “Realistic AUC.” Points are jittered for clarity.

A few conclusions are apparent based on our results: 

  1. A purely random partition strategy overestimates performance
  2. Blocked cross-validation better estimates realistic performance
  3. Extrapolation across blocks is difficult

Our findings show how a blocked cross-validation approach can illuminate the structure of the data. If we used a random split, we would be overoptimistic about our ML model.

Along with overoptimism, there is also the potential of overfitting to the data. One method to avoid overfitting some iterative ML models is early stopping, which attempts to stop model training before the point of overfitting. A validation-based early stopping rule halts the training process once we see performance is not improving on a validation set. It is clear that leakage with the validation set can cause problems. 

With that in mind, we trained two iterative boosting models on our data with early stopping. Once the validation loss failed to improve on the minimum for 20 rounds, training ceased. Eighty percent of the data was used for training and 10% for validation, split either randomly or systematically as in the cross-validation procedure. Otherwise, the models were identical.

Logistic loss for the validation set plotted over boosting iteration for the two split strategies. A lower loss is better. The blocked split early stops at iteration 198. Figure 3. Logistic loss for the validation set plotted over boosting iteration for the two split strategies. A lower loss is better. The blocked split early stops at iteration 198.

We observed the model trained with a random split did not stop early before 1,000 rounds. This suggests the randomly split validation set is correlated with the training set and therefore, loss continues improving across iterations. It is possible the model trained this way was overfit to the training data. 

However, the randomly split model performed better on the final 10% of the data — a realistic test set — with AUC 0.966 vs. 0.948, so we might ultimately use a random split. One possibility for the better performance of the random split model is that blocking can limit what we see in predictor space while training, leading to worse overall performance. Perhaps the blocks are too dissimilar, and using a blocked split has actually underfit the model. These trade-offs should be taken into account in accordance with our goal: catching threats. 

Building Better Machine Learning Threat Predictions

In typical ML workflows, data scientists gather and clean data, conduct exploratory analyses, consider what approaches might work, then finally train and evaluate models. Each of these steps requires careful consideration of the underlying data. In particular, practitioners should be careful about their data partitioning and evaluation strategies. 

At CrowdStrike, we continuously evaluate our models carefully, in order to understand them and pick the best ones. By making our analyses rigorous, we get better threat predictions.

Additional Resources