BERT Embeddings: A Modern Machine-learning Approach for Detecting Malware from Command Lines (Part 1 of 2)
January 26, 2022Stefan-Bogdan Cocea Endpoint & Cloud Security
- Suspicious command lines differ from common ones in how the executable path looks and the unusual arguments passed to them
- Bidirectional Encoder Representations from Transformers (BERT) embeddings can successfully be used for feature extraction for command lines
- Outlier detectors on top of BERT embeddings can detect anomalous command lines without the need for data labeling
- Our BERT model assists detection in an unsupervised fashion, strengthening the protection of the CrowdStrike Falcon® platform
The large amounts of behavioral data being generated today necessitate accurate labels for machine learning classifiers. In an earlier blog post, Large-Scale Endpoint Security MOLD Remediation, we discussed how to remediate labeling noise. In this blog post, we experiment with an unsupervised approach that eliminates the need for learning from labeled data. From misconfigurations to malicious behavior, we propose an unsupervised way of filtering command lines applicable to our own telemetry data using BERT embeddings.
The command line is a quick, powerful text-based interface developers use to communicate and initiate a wide range of processes running on the computer’s operating system. It is therefore an important field we capture in our telemetry data because by itself it can indicate suspicious activity. Suspicious command lines differ from common ones by having different executable paths or unusual arguments.
From a statistical point of view, we can treat suspicious command lines as outliers. Although command lines are not natural language, they can be modeled using natural language processing (NLP) techniques. In Kuppa et al., 2019, a word2vec-like approach is used for generating command2vec embeddings, on which adversarial autoencoders are employed. Tiberiu Boros and Andrei Cotaie use the term frequency-inverse document frequency (TF-IDF) and bilingual evaluation understudy (BLEU) scoring to generate command line encodings used as inputs for the local outlier factor (LOF) and autoencoders.
In our experiments, we use a pre-trained BERT model for the feature extraction step — a strategy for generating command line embeddings that wasn’t previously used in the literature. We compare a variety of anomaly detection algorithms, including principal component analysis (PCA), Isolation Forest (iForests), Copula-based Outlier Detection (COPOD), autoencoders (AE, pp. 237-263) and an ensemble of all of these algorithms. Most of these algorithms are implemented in the open-source PyOD library, which we also use in our implementation.
The data used for the anomaly detection model consists mainly of Windows command lines extracted from our telemetry. The command lines have the structure illustrated in Figure 1.
The command lines we used were sampled from 5 million generic firewall events. However, the model’s scope can be easily extended by including various command lines.
Learning Command Line Embeddings
The command lines might have different outcomes depending on the order of the tokens and the context in which they are used. We decided to learn a data representation from unlabeled command lines by pretraining a BERT model to better capture this diversity of outcomes.
The first step in training BERT from scratch is having a command-line tokenizer. We trained a WordPiece tokenizer on over 45 million command lines to achieve this. Figure 2 shows an example of a tokenized command line.
BERT models trained for natural language modeling use tokenizers that split on space characters by default. In addition, we also split on file path delimiters “\\”.
We trained a BERT model for the masked language modeling (MLM) task using the tokenizer. The obtained model has a hidden size of 768, and it was trained on 1.5 million command lines extracted from CrowdStrike’s event data. After the pre-training stage, the BERT model is used as a feature extraction tool for command lines.
Detection time is a concern. Thus, on top of the contextualized embeddings, we use a few different outlier detection models with low computational overhead. In Figure 3, we show the general architecture of the anomaly detection pipeline.
Out of the anomaly detectors tested, the ones listed below performed best and were the most computationally effective for our extensive high-dimensional data set:
- PCA: Computes the outlier score as the sum of the projected distance of a sample on all eigenvectors. In our experiments, we use PCA to reduce the number of features by 90%.
- COPOD: A parameter-free model that estimates tail probabilities using empirical copula.
- Isolation Forest: “Isolates” observations by randomly selecting a split value from the range of values of a feature. The outlier score is equal to the number of splits required to isolate a sample. An outlier will require fewer splits until it is isolated from the rest of the data.
- Autoencoder: A deep learning model used in our experiments to detect outliers, by reducing the feature space and then computing the reconstruction errors.
The proposed models compute an anomaly score for each sample in the dataset. In Figure 4, we can inspect the distribution of the anomaly scores. Note that there are very few samples with an anomaly score greater than 3. Moreover, there is a significant difference in score between the top samples and the rest. Therefore, as a general rule, the first N samples with the highest outlier score are considered anomalies.
Finally, we experiment with an ensemble model, aggregating the outlier detectors described above: COPOD, PCA, iForest and AE. The anomaly score is computed as the average score of the four models. Our intuition is that the ensemble strategy should provide more consistent results, as it combines multiple approaches for anomaly detection.
Comparing various outlier models proved to be a difficult task, as we do not have a specific metric for the unsupervised setup. To better understand how the anomalies are selected, we created a visual representation of the dataset. To this end, we use PCA to reduce the feature space from 768 to two components, and we select the first 250 distinct command lines with the highest anomaly score. It is worth mentioning that not all of the marked command lines are anomalies, rather only outliers.
The right-hand plot in Figure 5 displays in red the anomalies detected by iForest. The iForest algorithm differs the most from the other models because it identifies anomalies with a very low occurrence rate.
In Figure 6, we observe that PCA and AE have similar outcomes. This may be because these algorithms share a similar detection strategy. Both PCA and AE project the data into a lower-dimensional space and detect outliers by computing the reconstruction errors. Another possible explanation might be that the BERT embedding model draws a clear separation in the feature space between the outliers and the remaining observations. This is also backed by the COPOD model, which detects anomalies that are more than 80% similar (Figure 5). iForest has fewer detections in common, but this is expected given the probabilistic nature of the selection algorithm that favors command lines with a lower rate of occurrence.
We should mention that we considered only the first 250 distinct command lines with the highest outlier score as anomalies for these measurements. To motivate the rationale behind the ensemble model, let’s suppose that an anomaly x is ranked 300 by one of the models considered and 100 by another one. This is expected, as each model tackles the outlier detection problem differently. The ensemble model averages the scores of each outlier detector to provide a robust prediction.
The neural network approach (AE) and PCA perform best out of the four standalone models. While PCA has lower training times than COPOD, AE and iForest, the autoencoder architecture is much more powerful than PCA because it employs non-linear transformations. The ensemble model combines the above four strategies to improve the individual performance of the anomaly detectors, albeit at the cost of having a slower processing time.
We evaluate the results obtained in the research phase as promising, provided that the model detects commands that are different, to some extent, from the ones that we are regularly seeing in our telemetry. For example, it identifies incomplete command lines inconsistent with the format explained in the approach section (i.e., potential misconfigurations).
The goal of this exercise was to describe a novel approach for using BERT models to detect anomalous command-line executions. We first pre-trained our BERT model in house using targeted command lines from firewall events, benefiting from a large set of unlabelled telemetry data.
The BERT model is subsequently used to perform feature extraction on new samples, and the resulting command line embeddings are fed to a series of anomaly detectors. Finally, we compared the performance of the dedicated algorithms for outlier detection and discussed our modeling decisions, concluding that BERT embeddings can successfully be used for command-line feature extraction and as input for anomaly detection models.
This model assists detection in an unsupervised fashion by filtering suspicious command lines from large amounts of events. Thus, the experiments conducted in this research help strengthen the protection of the CrowdStrike Falcon® platform.
- Learn more about the CrowdStrike Falcon® platform by visiting the product webpage.
- Learn more about CrowdStrike endpoint detection and response on the Falcon Insight webpage.
- Test CrowdStrike next-gen AV for yourself. Start your free trial of Falcon Prevent™ today.