CrowdStrike’s Journey in Customizing NVIDIA Nemotron Models for Peak Accuracy and Performance

How CrowdStrike data scientists collaborate with NVIDIA to evaluate the NVIDIA family of Nemotron models and fine-tune them for real SOC workflows.

Today’s security teams need AI models that can reason over massive telemetry and support autonomous actions. At CrowdStrike, we're working closely with NVIDIA to operationalize NVIDIA Nemotron open models1, building on our existing integration of Nemotron on Amazon Bedrock within the CrowdStrike Falcon® platform. This collaboration enables us to rigorously test and adapt large language models (LLMs) for security-specific workloads while maintaining production-grade performance and security.

As part of this effort, we developed a natural language-to-CQL translation model powered by NVIDIA Nemotron. CrowdStrike Query Language (CQL) is the syntax analysts use to search and analyze security data in the Falcon platform. By leveraging CrowdStrike's unique advantage of millions of real-world queries written by our security professionals, combined with NVIDIA NeMo Data Designer for synthetic data generation and targeted fine-tuning of Llama Nemotron Super 49B, our model outperforms closed-source frontier alternatives. This work demonstrates how domain-specific adaptation can unlock significant gains in accuracy, reliability, and interpretability for security-focused AI systems.

In this post, we walk through the technical pipeline behind this capability, including data collection and processing, synthetic data generation, fine-tuning, and evaluation, and we share how these techniques translate advanced model research into practical, mission-ready outcomes for defenders.

Data Gathering and Processing

Collection

At CrowdStrike, our analysts write millions of CQL queries annually while investigating threats and hunting for adversaries. This gives us a unique advantage: access to real-world queries that reflect how security professionals actually use the language.

We collected millions of queries from our internal analysts. However, this raw data presented two significant challenges:

  1. First, we lacked natural language descriptions. While we had millions of queries, we didn't have the corresponding human-readable explanations of what each query does. To train a model that translates from natural language to CQL, we needed pairs like "Show me all DNS requests to suspicious domains" matched with their CQL equivalents.
  2. Second, many queries were semantically identical with only parameter values differing. Training on millions of these near-duplicate queries would be inefficient and could bias our model toward the most common patterns rather than teaching it the full breadth of CQL capabilities.

Deduplication

To address the duplication problem, we built a custom deduplication system based on Abstract Syntax Trees (ASTs). ASTs are structural representations of queries that capture their logical meaning rather than just their text.

Standard string-matching approaches would miss semantic duplicates. For example, these queries are functionally identical but look different as strings:

  • table([@timestamp, source.ip, message])
  • table([message, @timestamp, source.ip])

Similarly, queries with the same structure but different parameter values (like different IDs) are semantic duplicates that we wanted to identify.

Our AST-based approach analyzes the logical structure of each query: the functions used, their arguments, and how they're composed. This allowed us to identify queries that accomplish the same thing regardless of parameter order, formatting, or specific values. This significantly reduced our dataset to a diverse set of unique query patterns suitable for training and improved the model’s ability to generalize to novel analyst questions.

PII Scrubbing

Our internal queries might include information like IP addresses and hostnames. While internal, this data isn’t critical for teaching the model CQL syntax and structure, so we removed it for privacy purposes.

We built a custom PII scrubbing pipeline that replaces sensitive information with realistic fake values while preserving query structure. IP addresses become fake IPs in private ranges, hostnames maintain their formatting patterns, and the queries remain syntactically valid. Our PII approach achieved an F1 score of 99.35%, significantly outperforming third-party alternatives, allowing us to create a privacy-safe training dataset without sacrificing quality.

Synthetic Data Generation and Filtering

After deduplication and PII scrubbing, we had a diverse corpus of clean CQL queries. But we still faced our original challenge: We had the queries but no natural language descriptions explaining what each query does. To train a natural-language-to-CQL model, we needed pairs of human-readable questions matched with their CQL translations.

Rather than manually annotating thousands of queries (an expensive and time-consuming process), we leveraged NVIDIA NeMo Data Designer to generate synthetic natural language descriptions. For each CQL query, we prompted two LLMs (NVIDIA Super 49B and gpt-oss 120B) to create a corresponding natural language question that captures the query's intent. We used a multi-stage generation process with a co-teacher model that evaluates quality and suggests improvements, ensuring each generated description accurately reflects the query's purpose.

To add diversity and realism, we sampled different analyst personas (security analyst, IT operations, DevOps engineer) and complexity levels (simple to expert). This ensured our synthetic data reflected how different users would naturally phrase their queries.

We then filtered the generated pairs based on relevance and clarity scores, keeping only high-quality examples where the natural language accurately captured the CQL query's intent. The result was thousands of diverse, high-quality training pairs ready for model fine-tuning.

Training and Evaluation

With our synthetic training data ready, we moved to the core challenge: teaching a model to translate natural language into valid CQL queries.

We used Llama-3.3-Nemotron-Super-49B-v1.5 as our base model and fine-tuned it using LoRA (Low-Rank Adaptation). This allowed us to specialize the model for CQL generation while maintaining reasonable training costs.

During training, we taught the model to generate reasoning steps before producing the final query. This approach helped the model break down complex translations into logical components, making its outputs more interpretable and improving its ability to handle novel queries.

We evaluated our model using two complementary methods: syntax validation to check if generated queries were executable, and semantic correctness evaluation by comparing generated queries against manually curated reference queries and their execution results.

We benchmarked our fine-tuned model against state-of-the-art alternatives. The results highlight both the importance of domain-specific fine-tuning and the value fine-tuned models provide on specialized security tasks.

Table 1. Our fine-tuned model achieved 96% valid query accuracy, outperforming all baselines including GPT-4o and Claude Sonnet 4.5.
 Valid Queries Accuracy (max 1)Semantic Score (max 5)
GPT-4o0.611.51
Claude Sonnet 4.50.942.35
Llama Nemotron 49B0.581.46
gpt-oss120b0.721.88
Llama Nemotron 49B - fine-tuned0.962.50

Looking Ahead

Our collaboration with NVIDIA continues to evolve as we explore the next generation of capabilities with the NVIDIA Nemotron 3 family of open models. A core principle of our approach is matching the right model to the right security requirement and ensuring we are balancing accuracy, latency, cost, and scale based on the specific workload.

The Right Model for the Right Security Workload

The Nemotron 3 family introduces a tiered architecture spanning highly efficient small models to large-scale reasoning engines. This range allows us to explore the most appropriate model for each security use case. Nemotron 3 Nano, Super, and Ultra are models of interest for a plethora of security use cases.

Small language models (SLMs) are essential when speed is crucial and scale multiplies. Security operations demand inference at a massive scale. In CrowdStrike’s detection and response, we correlate threats across global telemetry and respond to accelerating adversary speeds. We'll be rigorously testing Nemotron 3 Nano’s performance on these massive-scale security workloads to see how it enables faster, more scalable operations without sacrificing accuracy.

For complex investigation workflows where multiple AI agents need to collaborate (correlating threat intelligence, analyzing attack chains, coordinating response actions), Nemotron 3 Super will be assessed for enhanced reasoning capabilities while maintaining production-grade performance.

For advanced threat modeling and deep investigative reasoning like comprehensive incident reconstruction and strategic threat hunting, Nemotron 3 Ultra will be evaluated for sophisticated analytical capabilities when inference speed matters less.

This tiered approach lets us optimize the balance between performance, cost, and capability across our security operations.

Exploring Safety Models for AIDR

We are also exploring specialized safety and content moderation models for our AI detection and response (AIDR) research. Specifically, we're evaluating:

  • Llama-Nemotron Safety Guard8B v3 for detecting adversarial prompts and unsafe AI interactions across multiple languages
  • Nemotron Content Safety Reasoning 4B for adhering to domain-specific safety policies with reasoning

These models represent promising avenues for building robust defenses and ensuring safe AI operations.

In Summary

By leveraging NVIDIA Nemotron models and NVIDIA NeMo Data Designer, and applying domain-specific fine-tuning techniques, we were able to significantly improve model performance on a real-world security task: translating natural language into executable CQL. The resulting natural-language-to-CQL model achieves high query validity and semantic accuracy, enabling analysts to focus on threat hunting and investigation rather than query syntax in order to reduce friction in day-to-day workflows and accelerate time-to-insight across the SOC.

More broadly, this work reflects the value of CrowdStrike’s ongoing collaboration with NVIDIA to advance agentic security. Fine-tuning Llama Nemotron Super 49B demonstrates how deep security expertise, high-quality synthetic data, and rigorous evaluation pipelines can turn powerful foundation models into trusted components of autonomous security workflows to help customers detect threats faster, investigate incidents more efficiently, and operate at scale without increasing analyst burden.

Based on our successes, we are beginning early evaluations of the latest Nemotron 3 family of open models, which introduce deeper reasoning capabilities, improved long-context performance, and stronger log and code understanding. We’re excited to continue working with NVIDIA to explore how these advances and greater model selection flexibility can further strengthen agentic workflows to deliver higher detection fidelity, more reliable automation, and greater confidence in AI-driven security decisions for our customers across the Falcon platform.

Additional Resources

1 While we are evaluating multiple models, the model we are working to operationalize in the context of this blog is Llama-3.3-Nemotron-Super-49B-v1.5