Dealing with Out-of-memory Conditions in Rust

Blog 1060x698

We recently integrated new functionality into our CrowdStrike Falcon sensor that was implemented in Rust. Rust is a relatively young language with several features focused on safety and security. Calling Rust from C++ was relatively straightforward, but one stumbling block we’ve run into is how Rust deals with out-of-memory (OOM) conditions.

Let’s start by defining what we mean by “out of memory”: Specifically, we mean that the underlying allocator returns NULL for an attempted allocation. You may have never seen malloc() return NULL in practice. On Linux in its default configuration, it’s nearly impossible, as noted by the man page for malloc:

By default, Linux follows an optimistic memory allocation strategy. This means that when malloc() returns non-NULL there is no guarantee that the memory really is available. In case it turns out that the system is out of memory, one or more processes will be killed by the OOM killer.

If the system is out of memory, malloc will still return a non-NULL pointer, but then the OOM killer will get involved and start terminating processes via SIGKILL. Working with Linux in this configuration can lull one into a false sense of security in terms of dealing with allocation errors. However, there are plenty of systems in the world that aren’t running a system that’s willing to overcommit memory like this, where it’s important to know what your application will do if an allocation attempt fails.

Rust OOM Behavior

Error handling in Rust is typically covered either by returning a Result, which forces the caller to handle the error in some way at an API level, or by panicking, which unwinds the stack and is primarily used for errors that aren’t meant to be recoverable (e.g., indexing past the end of array or failing an assertion on some pre- or post-condition). Given those patterns, one might expect OOM events to result in a panic, but they don’t: Today, OOM events result in Rust immediately terminating the process without unwinding. This behavior may be surprising to people unfamiliar with this particular issue, which I’ve seen internally and has been publicly stated (e.g., by the author of libcurl in their investigations of Rust as a possible backend).

I absolutely do not intend to cast aspersions on the Rust language or team. The choice to abort on OOM is certainly defensible, but the Motivation section of the approved RFC to add failable allocation to the standard library collections notes that it is insufficient as a long-term solution:

Many collection methods may decide to allocate (push, insert, extend, entry, reserve, with_capacity, …) and those allocations may fail. Early on in Rust’s history we made a policy decision not to expose this fact at the API level, preferring to abort. This is because most developers aren’t prepared to handle it, or interested. Handling allocation failure haphazardly is likely to lead to many never-tested code paths and therefore bugs. We call this approach infallible collection allocation, because the developer model is that allocations just don’t fail.

Unfortunately, this stance is unsustainable in several of the contexts Rust is designed for. This RFC seeks to establish a basic fallible collection allocation API, which allows our users to handle allocation failures where desirable.

That RFC is a goldmine of valuable background and future information, and I highly recommend reading it in full. It goes on to outline several use cases and a plan to address them. Our use case is similar to the server use case described in the RFC. We want to run Rust-implemented components inside a userspace program that is executing multiple tasks, and want OOM events on a task to only result in the failure of that task; other tasks (in the same process) should continue unimpeded.

Unfortunately, the solutions described in the RFC are not yet available, at least on stable Rust, (some, but not all, are available on nightly). Furthermore, implementing all of them may be challenging, particularly changing OOMs to panic and unwind instead of abort: There may be unsafe code in the standard library or published crates that assumes allocation never fails, and which becomes unsound (i.e., introduce undefined behavior) if allocation failures are allowed to unwind. If one wants to handle OOM events today, on stable Rust, the options include:

  1. Let OOM events abort the process. This is by far the simplest option as it requires no extra work: You can use the full Rust language, standard library and third-party crates. I think this is probably the right solution for the vast majority of applications. Any robust system needs to be able to restart due to external causes (hardware failure, crash in a system library, root user killing a process). However, for a cybersecurity company, it is a bitter pill to integrate a Rust component into a larger C++ program where the C++ pieces can all recover from OOM and the Rust component cannot, and it has an adverse effect on the automated tests we have, as we see crashes caused by OOM events inside the Rust component.
  2. Switch to a no_std environment. This is typically used for microcontroller work, but it can be used in any context. It disables large chunks of the Rust standard library, including those that allocate memory, and will also restrict what third-party crates are usable. Depending on how much code you’ve already written, this may be very costly, particularly if you’re using nontrivial crates that are not no_std-compatible.
  3. Use the try_* methods outlined in the RFC above to convert allocation failures into Results that can be handled at the API level. At the moment (Rust 1.48), these are still unstable and therefore only available on a nightly compiler, but there are also third-party crates that make them usable: fallible_collections extends many standard library types to add the proposed RFC methods, for example, and hashbrown (which is the standard library HashMap/HashSet implementation) exposes a try_reserve method on its HashMap and HashSet.

Option 2 may be very difficult if you’ve already pulled in third-party dependencies that do not support no_std. The rest of this blog expands on Option 3. An immediate problem with Option 3 is that there isn’t a good way to know that you’ve found and updated every call site that might allocate. The above RFC mentions a desire for “some kind of system to prevent a function from ever infallibly allocating,” which it mentions could be implemented via some kind of lint. No such lint is available today, so instead we attempted to cover this by testing our Rust component with a custom global allocator that intentionally injects OOM events.

Testing Rust OOM Handling with a Global Allocator

Rust allows you to replace the global allocator with your own implementation. Typically, this is used to switch between the system allocator and jemalloc, but we can also use it for borderline-nefarious purposes: We’ll write a custom global allocator that intentionally fails some of the time. Of course, this would be a horrible thing to do in a real application, so we’ll restrict its use to a single test.

Let’s start by writing a simple library that allocates a couple of different ways. This example was written with the most recent stable version of Rust at the time of writing (1.48.0).

[~]% cargo new --lib oom-demo
     Created library `oom-demo` package
[~]% cd oom-demo
# … edit src/lib.rs ...
[~/oom-demo]% cat src/lib.rs
use std::collections::HashMap;

#[derive(Debug, Default)]
pub struct Counter {
          items: HashMap<u32, Vec<u32>>,
}

impl Counter {
    pub fn push_key_value(&mut self, key: u32, value: u32) {
        self.items.entry(key).or_default().push(value);
    }

    pub fn values_for_key(&self, key: u32) -> Option<&[u32]> {
        self.items.get(&key).map(Vec::as_slice)
    }
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn it_works() {
         let mut c = Counter::default();
         c.push_key_value(0, 1);
         c.push_key_value(0, 2);
         c.push_key_value(5, 100);
         c.push_key_value(5, 4);
         assert_eq!(c.values_for_key(0).unwrap(), &[1, 2]);
         assert_eq!(c.values_for_key(5).unwrap(), &[100, 4]);
         assert_eq!(c.values_for_key(1), None);
    }
}

This is a trivial little library that accumulates values for a given key, but it allocates in a couple of different places, so should be good enough to demonstrate an OOM-injecting global allocator. A global allocator is, unsurprisingly, global, which means that running tests with a custom allocator needs to be done with a little care: cargo test will run multiple test threads (which could be problematic for us), and replacing the global allocator will affect allocations made by the test framework itself. We’ll tackle the multithreading concern by putting our OOM injection test into its own test file with just a single #[test] inside.

Restricting our OOM injection to just call sites inside our library is a little trickier. We’ll use an AtomicBool to enable/disable OOM injection, and only turn it on while we’re calling into our library (yet another reason to restrict this to a single thread!). Make a tests directory, and put this into tests/oom-injection.rs:

use oom_demo::Counter;
use std::alloc::{GlobalAlloc, Layout, System};
use std::ptr;
use std::sync::atomic::{AtomicBool, Ordering};

struct OomAllocator {
    enable_oom_injection: AtomicBool,
}

impl OomAllocator {
    fn enable_oom_injection(&self) {
        self.enable_oom_injection.store(true, Ordering::Relaxed);
    }
    fn disable_oom_injection(&self) {
        self.enable_oom_injection.store(false,
            Ordering::Relaxed);
    }
    fn is_oom_injection_enabled(&self) -> bool {
        self.enable_oom_injection.load(Ordering::Relaxed)
    }
}

unsafe impl GlobalAlloc for OomAllocator {
    unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
        if self.is_oom_injection_enabled() {
            // OOM injection enabled - return NULL
            return ptr::null_mut();
        } else {
            // no OOM injection - defer to system allocator
            System.alloc(layout)
        }
    }

    unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) {
        System.dealloc(ptr, layout)
    }
}

#[global_allocator]
static GLOBAL: OomAllocator = OomAllocator {
    enable_oom_injection: AtomicBool::new(false),
};

// NOTE: It is critical that we only have one #[test] in this
// file to avoid bad interactions between our OOM-injecting
// global allocator and `cargo test` multithreading!
#[test]
fn run_demo_with_oom_injection() {
    // For now, just repeat the test from lib.rs.
    let mut c = Counter::default();
    c.push_key_value(0, 1);
    c.push_key_value(0, 2);
    c.push_key_value(5, 100);
    c.push_key_value(5, 4);
    assert_eq!(c.values_for_key(0).unwrap(), &[1, 2]);
    assert_eq!(c.values_for_key(5).unwrap(), &[100, 4]);
    assert_eq!(c.values_for_key(1), None);
}

This gives us the skeleton of what we need:

  • Sets up a global allocator using our custom type
  • Our allocator contains an AtomicBool toggling OOM injection on or off
  • We run the same unit test we had for our library — it’s important to cover all of the code paths we expect could possibly allocate

But it’s still incomplete. We never actually enable OOM injection, and if we did, it would return NULL on every attempted allocation, which isn’t going to give us very much coverage. On top of that, when we inject an allocation failure we don’t know the source of the allocation attempt. It may be tempting to change that “return ptr::null_mut();” into a panic, but that is explicitly not allowed according to the documentation of the GlobalAlloc trait:

It’s undefined behavior if global allocators unwind. This restriction may be lifted in the future, but currently a panic from any of these functions may lead to memory unsafety.

There are many options for deciding when to actually return NULL; the simplest is to randomly return NULL some percentage of the time (while OOM injection is enabled). That may not be appropriate if the chance of hitting a particular allocation call site in your code is low. We’ll discuss some options for that later, but for this example it should be good enough. We’ll also pull in the backtrace crate to log where we are in the call stack when we inject an OOM. Add this to Cargo.toml:

[dev-dependencies]
backtrace = "0.3"
rand = "0.7"

And then update oom-injection.rs like so:

@@ -1,8 +1,11 @@
+use backtrace::Backtrace;
 use oom_demo::Counter;
 use std::alloc::{GlobalAlloc, Layout, System};
 use std::ptr;
 use std::sync::atomic::{AtomicBool, Ordering};

+const OOM_INJECTION_PROBABILITY: f32 = 0.1;
+
 struct OomAllocator {
     enable_oom_injection: AtomicBool,
 }
@@ -22,8 +25,14 @@ impl OomAllocator {

 unsafe impl GlobalAlloc for OomAllocator {
     unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
-        if self.is_oom_injection_enabled() {
-            // OOM injection enabled - return NULL
+        if self.is_oom_injection_enabled()
+            && rand::random::() < OOM_INJECTION_PROBABILITY
+        {
+            // Generating a backtrace will require allocation.
+            // Disable OOM injection while generating and
+            // printing it, then re-enable it. This will behave
+            // strangely if there are multiple threads 
+            // allocating while we run this test!
+            self.disable_oom_injection();
+            println!("injecting OOM from {:?}",
+                Backtrace::new());
+            self.enable_oom_injection();
             return ptr::null_mut();
         } else {
             // no OOM injection - defer to system allocator
@@ -42,12 +47,14 @@ static GLOBAL: OomAllocator = OomAllocator {
     enable_oom_injection: AtomicBool::new(false),
 };

 #[test]
 fn run_demo_with_oom_injection() {
-    // For now, just repeat the test from lib.rs.
+    // Enable random OOM injection; repeat test many times.
+    GLOBAL.enable_oom_injection();
+    for _ in 0..1_000 {
         let mut c = Counter::default();
         c.push_key_value(0, 1);
         c.push_key_value(0, 2);
@@ -56,5 +63,6 @@ fn run_demo_with_oom_injection() {
         assert_eq!(c.values_for_key(0).unwrap(), &[1, 2]);
         assert_eq!(c.values_for_key(5).unwrap(), &[100, 4]);
         assert_eq!(c.values_for_key(1), None);
+    }
+    GLOBAL.disable_oom_injection();
 }

Notes on the changes:

  • We only inject an OOM 10% of the time (when OOM injection is enabled).
  • We temporarily disable OOM injection while generating and printing a backtrace, which itself allocates memory. This is a concession to the fact that our allocator is indeed global: If we want to allocate memory from within our allocator itself, we ultimately end up recursing back into our own alloc method.
  • We run our test 1,000 times. This is certainly overkill for this demo and library. If we’re using randomness to decide when to inject an OOM, we want to set this high enough to have confidence that we’ll hit all the attempted allocations in our library.

Now if we run cargo test -- --nocapture, we should see a backtrace followed by something like this, although you may see a different memory allocation amount if you’re following along:

memory allocation of 16 bytes failederror: test failed, to rerun pass '--test oom-injection'

This is progress! We printed a backtrace of where we were injecting an OOM, and then we got the spartan log message from Rust that it prints prior to aborting. If you rerun this a few times, you may see a different memory allocation amount in the log message and a different backtrace, because our library has two different points where it allocates memory.

The backtrace is large and noisy, but try keying in on the frames around oom_demo::Counter::push_key_value. With a few runs to account for randomness, you should see both of these subsets of backtraces:

14: std::collections::hash::map::HashMap::entry
             at std/src/collections/hash/map.rs:704:19
  15: oom_demo::Counter::push_key_value
             at src/lib.rs:10:9

---

  14: alloc::vec::Vec::push
             at alloc/src/vec.rs:1210:13
  15: oom_demo::Counter::push_key_value
             at src/lib.rs:10:9

These are the two allocation call sites of our library: one where we ask for the entry of a hash map (which may have to reallocate to make space for the new key/value pair), and another where we attempt to push onto a vector.

We can now update our library and use fallible_collections and hashbrown to grow our containers in a way that we can catch allocation errors. Add fallible_collections = “0.3” and hashbrown = “0.9” to our library’s dependencies, then make these changes to src/lib.rs:

@@ -1,4 +1,5 @@
-use std::collections::HashMap;
+use fallible_collections::FallibleVec;
+use hashbrown::{HashMap, TryReserveError};

 #[derive(Debug, Default)]
 pub struct Counter {
@@ -6,8 +7,10 @@ pub struct Counter {
 }

 impl Counter {
-    pub fn push_key_value(&mut self, key: u32, value: u32) {
-        self.items.entry(key).or_default().push(value);
+    pub fn push_key_value(&mut self, key: u32, value: u32)
+        -> Result<(), TryReserveError>
+    {
+        // Make space for a new key - this is unnecessary if
+        // `key` is already present, but it’s probably cheaper
+        // to do this every time (since it will only grow if
+        // needed) than to guard it with a lookup of `key`.
+        // In real code, profile to be sure!
+        self.items.try_reserve(1)?;
+        // `.entry()` should not longer reallocate here, and
+        // we replace .push() with .try_push() for the vector
+        self.items.entry(key).or_default().try_push(value)?;
+        Ok(())
     }

     pub fn values_for_key(&self, key: u32) -> Option<&[u32]> {
@@ -22,10 +25,10 @@ mod tests {
     #[test]
     fn it_works() {
         let mut c = Counter::default();
-        c.push_key_value(0, 1);
-        c.push_key_value(0, 2);
-        c.push_key_value(5, 100);
-        c.push_key_value(5, 4);
+        c.push_key_value(0, 1).unwrap();
+        c.push_key_value(0, 2).unwrap();
+        c.push_key_value(5, 100).unwrap();
+        c.push_key_value(5, 4).unwrap();
         assert_eq!(c.values_for_key(0).unwrap(), &[1, 2]);
         assert_eq!(c.values_for_key(5).unwrap(), &[100, 4]);
         assert_eq!(c.values_for_key(1), None);

We’ll also need to update our OOM injection tests both for the change to the function’s return type and because the assertions we have in the test may not be valid: If we try to push a new key/value pair but an allocation fails, that pair won’t be pushed. Make these changes to tests/oom-injection.rs:

@@ -61,12 +61,10 @@ fn run_demo_with_oom_injection() {
     GLOBAL.enable_oom_injection();
     for _ in 0..1_000 {
         let mut c = Counter::default();
-        c.push_key_value(0, 1);
-        c.push_key_value(0, 2);
-        c.push_key_value(5, 100);
-        c.push_key_value(5, 4);
-        assert_eq!(c.values_for_key(0).unwrap(), &[1, 2]);
-        assert_eq!(c.values_for_key(5).unwrap(), &[100, 4]);
+        let _ = c.push_key_value(0, 1);
+        let _ = c.push_key_value(0, 2);
+        let _ = c.push_key_value(5, 100);
+        let _ = c.push_key_value(5, 4);
         assert_eq!(c.values_for_key(1), None);
     }
     GLOBAL.disable_oom_injection();

We should now be able to run cargo test and see both the unit tests and the OOM injection test passing. Our library will no longer abort on an OOM event and will instead return an error.

Conclusion

This blog walked through one possible way to adapt a library to handle OOM events without aborting the process. In general, I would not recommend giving libraries this kind of treatment, as it has multiple real costs:

  • There are runtime costs with allocation checking. You need to be careful with how you convert allocation failures to errors. For example, calling try_push in a loop is probably a terrible idea, and instead you should try_reserve prior to looping. This is an extra dimension of concern that you don’t usually need to consider.
  • There are API and implementation complexity costs that will affect maintenance over time. You may be able to offset this some by panicking instead of returning Result, if appropriate, although be aware that panicking itself allocates memory.
  • There may be better solutions in the future — e.g., full support for panicking on OOM in a future version of Rust. I’d like to keep our workarounds to a minimum in hope of switching to the language-supported techniques as soon as possible.

However, if you weigh the costs and decide that trying to handle OOM events in stable Rust today is worthwhile, hopefully this blog gave you some ideas. If you want to pursue the OOM-injecting-global-allocator test plan sketched out above, consider that there are almost certainly more useful ways of deciding when to inject an OOM event: You could count allocations in an attempt to cover every one, you could scan the backtrace looking for particular symbols when deciding whether to return NULL, etc. There are a lot of options available, and it’s likely you’ll need to tweak things to apply this to any nontrivial application.

Are you an expert in designing large-scale distributed systems? The CrowdStrike Engineering team wants to hear from you! Check out the openings on our career page.

Additional Resources

 

CrowdStrike Falcon Free Trial
 

Try CrowdStrike Free for 15 Days Get Started with A Free Trial