EMBER2024: Advancing the Training of Cybersecurity ML Models Against Evasive Malware

September 03, 2025

| | Engineering & Tech

CrowdStrike data scientists are members of a team of cybersecurity researchers that recently released EMBER2024, an update to EMBER, the popular open source malware benchmark dataset originally released in 2018. 

The EMBER2024 dataset includes metadata, labels, and calculated features for over 3.2 million files from six different file formats. It provides data scientists conducting cybersecurity research with an extensive, modern dataset to support the training and evaluation of machine learning models for malware detection, including a collection of advanced malware that has demonstrated its ability to evade antivirus products. 

An academic paper, EMBER2024: A Benchmark Dataset for Holistic Evaluation of Malware Classifiers, details this new dataset and was presented at the SIGKDD Conference on Knowledge Discovery and Data Mining (KDD-2025) in Toronto in August 2025. The paper also includes 14 benchmark models trained on different subsets of the data and varying classification tasks. 

There are many barriers to releasing public datasets in the cybersecurity field, including preserving customer privacy and hiding defender capabilities from attackers. Because of this, CrowdStrike researchers were excited for the opportunity to help update this very popular dataset. In this post, researchers can learn more about what this dataset provides and the new research enabled by it.

Original EMBER Dataset (2018): An Influential Resource for Malware Classification

The original EMBER dataset was a labeled benchmark dataset for training machine learning models to statically detect malicious Windows portable executable (PE) files. Released in 2018, it was accompanied by an academic paper co-authored by a CrowdStrike data scientist who is part of the EMBER2024 team. The paper was subsequently updated the following year.

The goal of EMBER was to invigorate research in the field of malware classification, just as other benchmark datasets had done for image classification. It has helped to significantly advance malware detection in cybersecurity products, including the CrowdStrike Falcon® Platform. As of this writing, the paper has been cited in academic research over 700 times since its original publication in 2018, reflecting just how influential EMBER has been in the field of ML training for cybersecurity. Researchers have used EMBER to measure how quickly malware classifiers degrade over time, explore adversarial machine learning attacks and defenses, and as a basis for educational projects. Last year, CrowdStrike researchers augmented the data with tags and leaf similarity information to create EMBERSim, an effort to make building Binary Code Similarity techniques using benign data easier.

EMBER2024 builds on the innovative and influential original, delivering a leap forward in capability.

EMBER2024: Updated to Help Train the Next Generation of Cybersecurity ML Researchers

WIth an ongoing industry shift to ML-based malware detection, the importance of innovative tools like EMBER has only increased.

A team of researchers from multiple organizations — including a member of the CrowdStrike Data Science team who co-created the original EMBER dataset — recently undertook the project of updating and improving EMBER. They had ambitious plans to expand and extend the original dataset in many different ways, ending up with in excess of 3.2 million files from six file formats. Figure 1 shows how many of each file type are included in EMBER2024. The dataset features seven different types of labels and tags that support training classifiers on seven common tasks, including malicious/benign detection, malware family classification, and malware behavior identification. Source code is included that will allow researchers to replicate the feature calculation, model training, and file collection techniques used to construct the dataset. A supplemental release also includes the raw bytes and disassembly for 16.3 million functions from malicious files identified and compiled by the FLARE team’s capa tool.

Figure 1. File type stats for the EMBER 2024 dataset
File TypeTrainTestChallengeTotal
Win321,560,000360,0003,2251,923,225
Win64520,000120,000814640,814
.NET260,00060,000805320,805
APK208,00048,000256256,256
PDF52,00012,00080564,805
ELF26,0006,00038632,386

CrowdStrike’s contribution to the project consisted of updating the original feature calculation code to make it easier to use. EMBER2018 features require version 0.9.0 of the LIEF library. Updating this library results in features that may not be equivalent to what’s calculated with 0.9.0. But LIEF 0.9.0 requires Python 3.6, which is now quite out of date and unsupported. One of EMBER’s main use cases is teaching students how to work with machine learning in cybersecurity, and this very old dependency was just introducing them to the pain of Python packaging and versioning instead.

To solve this problem, the feature calculation code was updated to use the most recent version of the pefile library instead of LIEF. Because pefile is pure Python, it’s more likely that a single locked version of pefile will be able to be installed on future versions of Python as they’re released. Future versions of pefile are also unlikely to introduce breaking changes to the calculated features so that locking the pefile version required can be delayed as long as possible. While making this change, the repository also switched to using more modern Python tooling (polars, uv, etc.).

In addition to the dependency update, EMBER2024 features now include information about a file’s richheader, authenticode, and any warnings that the pefile module outputs while attempting to read the PE file format. Figure 2 shows the categories of features that are calculated along with examples of all of the metadata included. A full description of all changes to the feature calculation can be found in the paper and the source code.

Figure 2. Categories of features calculated and examples of metadata included in EMBER2024 Figure 2. Categories of features calculated and examples of metadata included in EMBER2024

Beyond the updated features, two other aspects of the new dataset warrant mentioning: the inclusion of a challenge set and infrastructure code.

Adding a Challenge Set to Improve ML Training for Commercial Cybersecurity Solutions

The challenge set includes 6,315 files that were not initially detected as malicious by any of the AV products in VirusTotal. Files are included in the challenge set if they are later (after 30 days) detected as malicious by enough AV products to qualify as EMBER2024’s definition of malicious. In order to gather as many of these samples as possible, they are saved throughout the train and test time periods of the dataset. They are then set aside from the train and test, and models are evaluated on them separately. Figure 1 shows the relative size of the challenge set among the various file types collected.

One of the drawbacks of the original EMBER dataset is that it’s too “easy” to classify. The benchmark model from the very first release achieved a ROC AUC score of 0.99911 on the test set. This made it very difficult for researchers to publicly demonstrate that their novel techniques for classification would perform better. The dataset wasn’t large enough to reflect the difficulties of training and shipping a real commercial AV solution.

The challenge sets take a step toward solving this problem by highlighting the very hardest files to classify. Checking signatures and using allowlists and blocklists with cloud lookups make it possible to identify “known bad” behavior. The promise of incorporating machine learning into your AV system is that you’ll be better able to identify malicious files that nobody has seen before. Most of the AV products used to generate EMBER2024 labels already use ML to attempt this. And even then, the community of defenders sometimes fails. Creating a collection of those files that weren’t initially identified as malicious highlights where existing solutions are struggling and creates a metric that has room for improvement.

Infrastructure Code to Support Future Research

Another innovation in the EMBER2024 public release is that it includes the code used to construct the dataset itself. This includes retrieving VirusTotal reports, labeling the files contained in those reports, and selecting a pre-set number of files from a certain time period while excluding near duplicates. This will allow researchers with access to VirusTotal to replicate what EMBER2024 would have constructed as a dataset some time in the future.

Given enough resources, future projects could use this code to put together a much larger dataset that would enable larger models or studies about the evolution of benign and malicious software over time. There’s no guarantee about the consistency of the files that get added to VirusTotal in any given time period, but there are still interesting questions about model degradation or other topics that can now be approached with this codebase.

EMBER2024 Exemplifies CrowdStrike’s Commitment to Research

The original EMBER dataset achieved its objective of boosting research in the field of malware classification, with hundreds of citations in academic features in the years since it was first published. It has also been used to help teach the latest generations of cybersecurity researchers. Its popularity spawned related projects like EMBERSim and now EMBER2024. This effort, and the involvement of our data scientists, reflects CrowdStrike’s ongoing commitment to research in the cybersecurity industry. We believe when defenders collaborate and share knowledge, we collectively strengthen our position against the threat actors who benefit from operating in the shadows. 

Open source initiatives like EMBER2024 represent the kind of industry-wide cooperation that helps to drive innovation and support continuous product improvement. Projects like these tilt the playing field toward defenders and ensure the AI-native CrowdStrike Falcon platform remains a leader in stopping breaches.

Additional Resources