How CrowdStrike Uses SHAP to Enhance Machine Learning Models

silhouette of person with communication icons in the background

At CrowdStrike®, machine learning is a major tool for detecting new malware families and keeping our customers safe. We utilize gradient boosted trees with thousands of features to classify whether a file sample is labeled as malware or clean. This model provides a lot of predictive power, leading to a high level of accuracy, but as a tradeoff, it is challenging to comprehend how the model makes its predictions due to its complex nature. 

CrowdStrike uses SHAP, a Python package that implements Shapley value theory, to enhance our machine learning technology and increase the effectiveness of the CrowdStrike Falcon®® platform’s threat detection capabilities. The following outlines how this approach works and the benefits of using SHAP. 

Holistic Approach to Value Theory

SHAP offers a holistic way to quantify how much a feature value changes the average prediction for every sample. In the context of malware detection at CrowdStrike, the Shapley value for a sample’s feature indicates the following:

  • Whether the feature makes the file “cleaner” (blue) or “dirtier” (red) depending on the sign of the SHAP value (- for clean and + for dirty)
  • How significant the contribution was (by the magnitude of the value)

You can add up the Shapley values for each feature and then see which side of the threshold the sample lies on: clean or dirty. This way, you are able to examine individual files and determine the dirty and clean forces that push the prediction one way or the other (Figure 1 ).

Force plot of Shapley values

Figure 1. Force plot of Shapley values illustrating how each feature contributes to a model prediction.

SHAP Aids Feature Engineering

There are many useful applications of the SHAP tool for internal projects. For example, internal teams are looking into how SHAP can help with feature engineering. We want to make sure we have many layers of protection against all types of malware. One layer of defense is using expertise from our security analysts to engineer specific features for when a malware family comes out. Once these candidate features have been crafted, we want to make sure they’re grasping the essence of the family.

After training the model with the new candidate features, you can use SHAP to find out how significant those new features are to a subset of files known to be from that family. This gives immediate feedback to feature engineers to see if these features are working or not.

Bar charts of the top contributing features to the predictions of a subset of files using Shapley values

Figure 2. Bar charts of the top contributing features to the predictions of a subset of files using Shapley values as computed from the SHAP library.

The image on the left in Figure 2 shows top features contributing to a random collection of “AutoIt” family files. We can see that two AutoIT features show up on the summary plot on the left. This is contrasted with a summary plot of a random subset of samples on the right. For this, a checksum feature is the most important feature and there aren’t any features corresponding to a particular group/family (DotNet, AutoIt, etc.).

This highlights the purpose of subset analysis. From analyzing the summary plots of subsets, we can verify that candidate features are working, and also determine what other features are significant in the classification of those samples.

This gives reassurance that the new features are working. On the other hand, we can also remove features that aren’t contributing, which speeds up our training process and helps us get updated models out faster.  Streamlining the model update process and gaining actionable insight into our feature engineering process are key in our fight to detect and prevent the latest malware. More insight leads to faster update cycles, allowing us to push new protections to customers more rapidly.   

SHAP Provides Important Insights

Overall, using SHAP provides more insight into how the models arrive at their final decision and gives us confidence that the models are using intuitive, robust features for their predictions —  rather than focusing on the arbitrary features of a file, which do not generalize. And, generalization is the property that makes machine learning such a powerful tool for malware detection and prevention. SHAP is a perfect example of how our data science team combines open source tools with CrowdStrike’s vast, crowdsourced data-streams and sources to protect our customers through the power of machine learning.

Additional Resources:

Related Content