How CrowdStrike Trains GenAI Models at Scale Using Distributed Computing

CrowdStrike researchers outline key concepts, tools, and techniques used in LLM training

Large language models (LLMs) have revolutionized artificial intelligence and are rapidly transforming the cybersecurity landscape. As these powerful models become commonly used among both attackers and defenders, developing specialized cybersecurity LLMs has become a strategic imperative. 

The CrowdStrike 2025 Global Threat Report highlights a concerning trend: Threat actors are increasingly enhancing social engineering and computer network operations campaigns with LLM capabilities. To counter these evolving threats, CrowdStrike is investing in custom LLMs purpose-built for cybersecurity applications, designed to understand and address the unique challenges of this domain.

Here, we discuss key concepts in LLM training and offer insights into CrowdStrike's approach to developing next-generation security-focused language models. A team of our data scientists gave a presentation on this topic as part of a Customer Voice segment at the Google Cloud Next 2025 conference, where CrowdStrike received the 2025 Google Cloud Security Partner of the Year Award for Workload Security

Infrastructure for LLM Experimentation at Scale

Today's state-of-the-art LLMs, which excel at tasks from general knowledge to coding and reasoning, require immense computational resources. Training these models demands high-performance, distributed computing clusters at a scale. To meet demand, major AI labs are announcing increasingly larger infrastructures, with some even exploring nuclear energy solutions to fuel their energy needs. Setting up such a training cluster is the essential first step in developing LLMs at scale.

CrowdStrike leverages Google Cloud Vertex Training Platform, specifically Vertex Training Cluster to streamline training cluster management. We start with a small number of multi-GPU instances (nodes) for testing and efficiently scale up to larger node counts as needed. Our infrastructure-as-code approach enables seamless operations, simplified maintenance, and continuous security updates. Our custom dashboards and automated alerts, built on Google Cloud’s real-time metrics, help to ensure consistent cluster performance and reliability. Lastly, we are able to partition and spin up dedicated large clusters for higher priority jobs enabling multiple teams to efficiently run their large training workloads.

With the cluster infrastructure in place through the Vertex Training Cluster, we configured Slurm, the industry-standard workload manager for high-performance computing, to handle job scheduling and resource allocation. This allows our researchers to focus on innovation rather than infrastructure management, enabling concurrent work across multiple projects. We maintain dedicated Slurm partitions for both interactive and non-interactive workloads. This separation ensures there's always a pool of resources available for daily data exploration and small-scale processing tasks, while preventing longer training jobs from disrupting interactive work.

Practical Considerations for Training LLMs

LLMs are trained with the causal language modelling objective, i.e., “predicting the next token in a sequence, given previous tokens.” While conceptually straightforward, implementation requires addressing several practical challenges. In this section, we discuss how CrowdStrike has tackled some of these challenges so far.

Data Strategy

As with any machine learning application, high-quality, diverse datasets are fundamental to LLM performance. At CrowdStrike, we enhance training data through synthetic generation by guiding LLMs to generate, augment, reinterpret, or reword certain inputs, such as documents and code. Our researchers have produced more robust models after including synthetic data, especially for low-resource concepts, like domain-specific languages. Our optimized generation process implements a complex pipeline that leverages a feedback loop with validation, filtering, and regeneration steps. Parallelizing the workload and taking advantage of the hardware is critical for timely results and a tight feedback loop.

CrowdStrike’s modular data pipelines assemble large-scale datasets by using recipes (specifications) and pulling data from multiple sources. For experimental cases where the dataset is not fixed and may grow over time, the data pipeline can effectively stream and construct training batches just-in-time for model training.

Distributed Computing for LLM training

LLMs are trained on large-scale clusters using distributed computing. Therefore, multiple algorithms for parallelism have been devised, including: 

  • Data parallelism (replicating models across nodes for parallel processing of multiple data subsets)
  • Tensor parallelism (sharding tensors)
  • Pipeline parallelism (sharding different model layers across nodes)

These approaches reduce memory requirements and enable training larger models with less hardware. Context parallelism and expert parallelism further help manage longer contexts and train mixture-of-experts architectures. Together, these five approaches constitute “5D parallelism,” with practitioners typically using a subset of them based on specific needs.

CrowdStrike researchers apply these techniques selectively, depending on project specifics like model size and available hardware. As the HuggingFace Ultrascale Playbook notes, “we can’t give a single unified recipe” since optimal approaches depend on specific hardware configurations. Similarly, there are many frameworks that implement training algorithms, such as PyTorch (FSDP), Microsoft’s DeepSpeed, and HuggingFace’s Accelerate, which allows users to interface easily with both.

Hardware-specific Optimizations Given Training Configurations 

To illustrate the importance of tailoring solutions to specific hardware, we compared identical training parameters across different GPU architectures. We tested two attention mechanism implementations on NVIDIA's H100 and B200 GPUs, using single GCE A3 Mega and A4 nodes, respectively. While Flash Attention 2 consistently outperformed scaled dot-product attention (SDPA), the magnitude varied significantly between architectures. One configuration showed only modest acceleration, while the other went from slowest to fastest performer when the attention mechanism was optimized. 

These results, though specific to our experimental conditions (e.g., LLM architecture, attention mechanism, or context size requirements), highlight a key takeaway: Meaningful performance gains require optimizations that consider the complete software and hardware ecosystem.

Figure 1. Shows the training speed for two GPU architectures in combination with two attention implementations. Flash Attention 2 brings a small improvement to the older GPU architecture, but it makes a significant difference for the newer one. With Flash Attention 2, the newer architecture goes from being the slowest option to being the fastest one. Figure 1. Shows the training speed for two GPU architectures in combination with two attention implementations. Flash Attention 2 brings a small improvement to the older GPU architecture, but it makes a significant difference for the newer one. With Flash Attention 2, the newer architecture goes from being the slowest option to being the fastest one.
Figure 2. Similar to Figure 1 but for a longer context length (64k tokens). The longer context length causes speeds to be slower in absolute terms, but all observations made in Figure 1 still hold. Figure 2. Similar to Figure 1 but for a longer context length (64k tokens). The longer context length causes speeds to be slower in absolute terms, but all observations made in Figure 1 still hold.

Node Communication

For effective large-scale parallelism during training, cluster nodes need to communicate and synchronize with each other (e.g., when updating model weights in the backward pass). CrowdStrike leverages optimized operations from NVIDIA’s NCCL for fast inter-node communication. Through careful configuration, we’ve achieved accelerations of up to 6x over baseline performance.

Extended Context Handling 

At CrowdStrike, we experiment with LLMs for binary analysis (for example, Byte Back: Next-Generation Malware Classification Using Binary Transformers). This work requires significantly longer contexts than typical text applications to accommodate the tokenized representation of the byte data from the files.

Since the classic Transformer softmax attention has a quadratic complexity in both time and space with respect to the sequence length, we employ context parallelism across multiple GPUs. Context parallelism comes in multiple forms, such as ring attention, DeepSpeed Ulysses, and Ulysses-Offload, and in multiple frameworks including NVIDIA NeMo and DeepSpeed-Megatron. We have experienced good results with the DeepSpeed Ulysses sequence parallelism. This methodology splits dataset entries along the sequence dimension among multiple GPUs, and all-to-all communication is used to compute the attention in parallel on the head dimension. 

Combining sequence parallelism with a high performance attention implementation (such as Flash Attention), and with a memory optimized loss computation (such as fused cross-entropy loss) or by computing the loss in blocks of logits (proposed by ProLong), leads to an effective training throughput in our distributed environment.

Figure 3. Shows training throughput for Llama 3.1 8B using various long-context scenarios. Doubling the context size roughly halves the training throughput. Figure 3. Shows training throughput for Llama 3.1 8B using various long-context scenarios. Doubling the context size roughly halves the training throughput.

Training LLMs with long contexts is an ongoing challenge with respect to performance, particularly due to the inherent complexity of the task and the constant necessity for software to keep pace with hardware advancements. In the example of Llama 3.1B above, doubling context size roughly halved training throughput. However, this ratio doesn’t necessarily hold true for larger architecture LLMs. We are continually conducting experiments to optimize performance, also accounting for the GPU configurations.

Memory Management and Gradient Checkpointing 

Hardware constraints, particularly video RAM (VRAM) limitations, dictate feasible model architectures and configurations as well as training times. To fit larger models or batches into limited memory, we employ gradient checkpointing. During training, deep learning models undergo forward and backward passes, computing gradients needed to update the model's parameters and applying these updates, respectively. Since gradients are typically computed for each parameter, a model effectively doubles its VRAM footprint during training. Gradient checkpointing "forgets" certain gradients to reduce peak memory usage, recomputing them during the backward pass when needed. This technique trades additional computation for reduced memory requirements, a worthwhile compromise when VRAM is constrained.

Figure 4. Shows how activating gradient checkpointing affects peak VRAM requirements for various distributed algorithms (FSDP and variants of DeepSpeed ZeRO). The reduction in peak VRAM is different for each algorithm, with DeepSpeed ZeRO 3 seeing the largest reduction (80%, from 31GB down to 6GB). Figure 4. Shows how activating gradient checkpointing affects peak VRAM requirements for various distributed algorithms (FSDP and variants of DeepSpeed ZeRO). The reduction in peak VRAM is different for each algorithm, with DeepSpeed ZeRO 3 seeing the largest reduction (80%, from 31GB down to 6GB).
Figure 5. Shows how activating gradient checkpointing affects training time for the distributed algorithms in Figure 4. In all cases, gradient checkpointing slows down training. However, this slowdown is not prohibitive for this specific test workload. Figure 5. Shows how activating gradient checkpointing affects training time for the distributed algorithms in Figure 4. In all cases, gradient checkpointing slows down training. However, this slowdown is not prohibitive for this specific test workload.

Observability

Monitoring is essential for both small- and large-scale training runs to optimize configurations and troubleshoot issues. CrowdStrike utilizes PyTorch's profiler to analyze resource usage and maximize hardware efficiency, tracking metrics like SM Efficiency for NVIDIA GPUs, which estimates the fraction of active processors. We complement this with Google Cloud’s dashboards to monitor node status and detect potential hardware failures, an inevitable occurrence during extended training of large models.

Managing the Complexity of Training LLMs at Scale

Training LLMs at scale involves numerous complex, interconnected components. Based on our experience, we recommend:

  • Prioritize caching and reusability for both code and data to reduce duplication.
  • Maintain relevant metadata for datasets for reproducibility and auditing.
  • Validate at small scale with tight feedback loops before scaling up.
  • Use log extensive metrics to improve visibility, debugging, and model selection.
  • Configure resource usage-based alerts to proactively prevent failures.
  • Invest in robust data pipelines and management to handle multiple sources, custom pre-processing, data mixing, and synthetic data generation.
  • Enforce best practices through testing and automation to build trust and efficiency.

LLM Research at CrowdStrike

At CrowdStrike, generative AI is one of our primary areas of research. Teams of data scientists study the development and usage of various types of language models and training techniques. The setup described in this article needs to support fine-tuning agentic models, training foundational models, fast distributed inference, and many other use cases. 

As our projects become more ambitious and our execution capabilities scale, we look forward to unlocking innovative ways of stopping the adversary. One example, building on the insights gained from prior research into Binary Transformers for malware classification, is the large byte model (LBM). CrowdStrike is taking the next step by developing a multi-modal LLM specifically designed for in-depth binary file analysis.

Additional Resources