Large language models (LLMs) have revolutionized artificial intelligence and are rapidly transforming the cybersecurity landscape. As these powerful models become commonly used among both attackers and defenders, developing specialized cybersecurity LLMs has become a strategic imperative.
The CrowdStrike 2025 Global Threat Report highlights a concerning trend: Threat actors are increasingly enhancing social engineering and computer network operations campaigns with LLM capabilities. To counter these evolving threats, CrowdStrike is investing in custom LLMs purpose-built for cybersecurity applications, designed to understand and address the unique challenges of this domain.
Here, we discuss key concepts in LLM training and offer insights into CrowdStrike's approach to developing next-generation security-focused language models. A team of our data scientists gave a presentation on this topic as part of a Customer Voice segment at the Google Cloud Next 2025 conference, where CrowdStrike received the 2025 Google Cloud Security Partner of the Year Award for Workload Security.
Infrastructure for LLM Experimentation at Scale
Today's state-of-the-art LLMs, which excel at tasks from general knowledge to coding and reasoning, require immense computational resources. Training these models demands high-performance, distributed computing clusters at a scale. To meet demand, major AI labs are announcing increasingly larger infrastructures, with some even exploring nuclear energy solutions to fuel their energy needs. Setting up such a training cluster is the essential first step in developing LLMs at scale.
CrowdStrike leverages Google Cloud Vertex Training Platform, specifically Vertex Training Cluster to streamline training cluster management. We start with a small number of multi-GPU instances (nodes) for testing and efficiently scale up to larger node counts as needed. Our infrastructure-as-code approach enables seamless operations, simplified maintenance, and continuous security updates. Our custom dashboards and automated alerts, built on Google Cloud’s real-time metrics, help to ensure consistent cluster performance and reliability. Lastly, we are able to partition and spin up dedicated large clusters for higher priority jobs enabling multiple teams to efficiently run their large training workloads.
With the cluster infrastructure in place through the Vertex Training Cluster, we configured Slurm, the industry-standard workload manager for high-performance computing, to handle job scheduling and resource allocation. This allows our researchers to focus on innovation rather than infrastructure management, enabling concurrent work across multiple projects. We maintain dedicated Slurm partitions for both interactive and non-interactive workloads. This separation ensures there's always a pool of resources available for daily data exploration and small-scale processing tasks, while preventing longer training jobs from disrupting interactive work.
Practical Considerations for Training LLMs
LLMs are trained with the causal language modelling objective, i.e., “predicting the next token in a sequence, given previous tokens.” While conceptually straightforward, implementation requires addressing several practical challenges. In this section, we discuss how CrowdStrike has tackled some of these challenges so far.
Data Strategy
As with any machine learning application, high-quality, diverse datasets are fundamental to LLM performance. At CrowdStrike, we enhance training data through synthetic generation by guiding LLMs to generate, augment, reinterpret, or reword certain inputs, such as documents and code. Our researchers have produced more robust models after including synthetic data, especially for low-resource concepts, like domain-specific languages. Our optimized generation process implements a complex pipeline that leverages a feedback loop with validation, filtering, and regeneration steps. Parallelizing the workload and taking advantage of the hardware is critical for timely results and a tight feedback loop.
CrowdStrike’s modular data pipelines assemble large-scale datasets by using recipes (specifications) and pulling data from multiple sources. For experimental cases where the dataset is not fixed and may grow over time, the data pipeline can effectively stream and construct training batches just-in-time for model training.
Distributed Computing for LLM training
LLMs are trained on large-scale clusters using distributed computing. Therefore, multiple algorithms for parallelism have been devised, including:
- Data parallelism (replicating models across nodes for parallel processing of multiple data subsets)
- Tensor parallelism (sharding tensors)
- Pipeline parallelism (sharding different model layers across nodes)
These approaches reduce memory requirements and enable training larger models with less hardware. Context parallelism and expert parallelism further help manage longer contexts and train mixture-of-experts architectures. Together, these five approaches constitute “5D parallelism,” with practitioners typically using a subset of them based on specific needs.
CrowdStrike researchers apply these techniques selectively, depending on project specifics like model size and available hardware. As the HuggingFace Ultrascale Playbook notes, “we can’t give a single unified recipe” since optimal approaches depend on specific hardware configurations. Similarly, there are many frameworks that implement training algorithms, such as PyTorch (FSDP), Microsoft’s DeepSpeed, and HuggingFace’s Accelerate, which allows users to interface easily with both.
Hardware-specific Optimizations Given Training Configurations
To illustrate the importance of tailoring solutions to specific hardware, we compared identical training parameters across different GPU architectures. We tested two attention mechanism implementations on NVIDIA's H100 and B200 GPUs, using single GCE A3 Mega and A4 nodes, respectively. While Flash Attention 2 consistently outperformed scaled dot-product attention (SDPA), the magnitude varied significantly between architectures. One configuration showed only modest acceleration, while the other went from slowest to fastest performer when the attention mechanism was optimized.
These results, though specific to our experimental conditions (e.g., LLM architecture, attention mechanism, or context size requirements), highlight a key takeaway: Meaningful performance gains require optimizations that consider the complete software and hardware ecosystem.