Oh No! My Data Science Is Getting Rust-y

Python is one of the most popular programming languages for data scientists — and for good reason. The Python Package Index (PyPI) hosts a vast array of impressive data science library packages, such as NumPy, SciPy, Natural Language Toolkit, Pandas and Matplotlib. The sheer number of available high-quality analytic libraries and its massive developer community make Python an easy choice for many data scientists. Many of these libraries are implemented in C and C++ for performance reasons, but provide foreign function interfaces (FFIs) or Python bindings so you can call those functions from Python. These “lower-level” language implementations are used to mitigate some common criticisms of Python, specifically execution time and memory consumption. Bounding execution time and memory consumption simplifies scalability, which is critical for cost reduction. If we can write performant code to accomplish data science tasks, then integration with Python is a major advantage. The intersection of data science and malware analysis requires not only fast execution time but also efficient use of shared resources for scalability. Scalability is critical for “big data” problems like efficiently processing data for millions of executables for multiple platforms. Getting good performance on modern processors requires parallelism, typically via multiple threads, but efficient execution time and memory usage are also necessary. Balancing local system resources can be difficult with this kind of problem, and correctly implementing multi-threaded systems is even more difficult. C and C++ inherently do not provide thread safety. While external platform-specific libraries exist, the onus is clearly on the developer to preserve thread safety. Parsing malware is inherently dangerous. Malware often manipulates file format data structures in unanticipated ways to cause analysis utilities to fail. One relatively common Python parsing pitfall is caused by the lack of strong type safety. Python’s gratuitous acceptance of None values when a bytearray was expected can easily lead to general mayhem without littering the code with None checks. These assumptions related to “duck typing” often lead to failures. Enter Rust. The Rust language makes many claims that align well with an ideal solution to the potential problems identified above: execution time and memory consumption comparable to C and C++, along with providing extensive thread safety. The Rust language offers additional beneficial features, such as strong memory-safety guarantees and no runtime overhead. No runtime overhead simplifies Rust code integration with other languages, including Python. In this blog, we take Rust for a short test drive to see if the hype is warranted.

Example Data Science Application

Data science is a very broad field with far too many applications to discuss in a single blog post. An example of a simple data science task is to compute information entropy for byte sequences. The general formula for computing entropy in bits is (see Wikipedia; information entropy):

H(X)=-?i Px(xi) log2Px(xi)

To compute the entropy for random variable X, we first count occurrences of each possible byte value (xi)

 

and divide by the total number of occurrences to calculate the probabilities of a particular value, xi
, occurring (Px(xi)). Then we calculate the negative of the weighted sum of the probability of a particular value, xi, occurring (Px(xi))

 

and the so-called self-information (log2Px(xi)
). Since we are computing entropy in bits, we use log2 (note base 2 for bits). Let’s put Rust to the test and see how it performs on entropy calculations against pure Python and even some of the wildly popular Python libraries mentioned above. This is a simplistic assessment of how performant Rust can be for data science applications, not a criticism of Python or the excellent libraries available. In these tests, we will generate a custom C library from Rust code that we can import from Python. All tests were run on Ubuntu 18.04.

Pure Python

We start with a simple pure Python function (in entropy.py) to calculate the entropy of a bytearray only using the standard library math module. This function is not optimized and provides a baseline for modifications and performance measurements. import math def compute_entropy_pure_python(data):

 

 

 

 

"""Compute entropy on bytearray `data`."""

 

 

 

 

counts = <0> * 256

 

 

 

 

entropy = 0.0

 

 

 

 

length = len(data)

 

 

 

 

for byte in data:

 

 

 

 

 

 

 

 

counts += 1

 

 

 

 

for count in counts:

 

 

 

 

 

 

 

 

if count != 0:

 

 

 

 

 

 

 

 

 

 

 

 

probability = float(count) / length

 

 

 

 

 

 

 

 

 

 

 

 

entropy -= probability * math.log(probability, 2)

 

 

 

 

return entropy

Python with NumPy and SciPy

As you might expect, SciPy provides a function to compute entropy. We will use NumPy’s unique() function to compute the byte frequencies first. Comparing the performance of SciPy’s entropy function to the other implementations is a bit unfair, because the SciPy implementation has additional functionality to compute relative entropy (the Kullback-Leibler divergence). Again, we’re just going for a (hopefully not too) slow test drive to see the performance of Rust compiled libraries imported from Python. Following is a SciPy-based implementation included in our entropy.py script. import numpy as np from scipy.stats import entropy as scipy_entropy def compute_entropy_scipy_numpy(data):

 

 

 

 

"""Compute entropy on bytearray `data` with SciPy and NumPy."""

 

 

 

 

counts = np.bincount(bytearray(data), minlength=256)

 

 

 

 

return scipy_entropy(counts, base=2)

Python with Rust

We now go into more depth about our Rust implementation than the previous implementations for both thoroughness and repeatability. We started with the default library package generated with Cargo. The following sections describe how we modified the Rust package.

cargo new --lib rust_entropy

Cargo.toml

Starting with the obligatory Cargo.toml manifest file, we define the Cargo package and the library name, rust_entropy_lib. We use the public cpython crate (v0.4.1) available on crates.io, the Rust Package Registry. We also use Rust v1.42.0, the latest stable release available at the time of writing. name = "rust-entropy" version = "0.1.0" authors = <"Nobody <nobody@nowhere.com>"> edition = "2018" name = "rust_entropy_lib" crate-type = <"dylib"> version = "0.4.1" features = <"extension-module">

lib.rs

The Rust library implementation is fairly straightforward. As we did with our pure Python implementation, we initialize an array of counts for each possible byte value and iterate over the data to populate the counts. To finish the computation, we compute and return the negative sum of probabilities multiplied by the log2 of the probabilities. use cpython::{py_fn, py_module_initializer, PyResult, Python}; /// Compute entropy on byte array. fn compute_entropy_pure_rust(data: &) -> f64 {

 

 

 

 

let mut counts = <0; 256>;

 

 

 

 

let mut entropy = 0_f64;

 

 

 

 

let length = data.len() as f64;

 

 

 

 

// collect byte counts

 

 

 

 

for &byte in data.iter() {

 

 

 

 

 

 

 

 

counts += 1;

 

 

 

 

}

 

 

 

 

// make entropy calculation

 

 

 

 

for &count in counts.iter() {

 

 

 

 

 

 

 

 

if count != 0 {

 

 

 

 

 

 

 

 

 

 

 

 

let probability = f64::from(count) / length;

 

 

 

 

 

 

 

 

 

 

 

 

entropy -= probability * probability.log2();

 

 

 

 

 

 

 

 

}

 

 

 

 

}

 

 

 

 

entropy
}
All that’s left now for lib.rs is the mechanism to call our pure Rust function from Python. We include in lib.rs a CPython aware function (compute_entropy_cpython()) to call our “pure” Rust function (compute_entropy_pure_rust()). This design gives us the benefit of maintaining a single pure Rust implementation and also providing a CPython-friendly “wrapper.” /// Rust-CPython aware function fn compute_entropy_cpython(_: Python, data: &) -> PyResult<f64> {

 

 

 

 

let _gil = Python::acquire_gil();

 

 

 

 

let entropy = compute_entropy_pure_rust(data);

 

 

 

 

Ok(entropy)
}
// initialize Python module and add Rust CPython aware function py_module_initializer!(

 

 

 

 

librust_entropy_lib,

 

 

 

 

initlibrust_entropy_lib,

 

 

 

 

PyInit_rust_entropy_lib,

 

 

 

 

|py, m | {

 

 

 

 

 

 

 

 

m.add(py, "__doc__", "Entropy module implemented in Rust")?;

 

 

 

 

 

 

 

 

m.add(

 

 

 

 

 

 

 

 

 

 

 

 

py,

 

 

 

 

 

 

 

 

 

 

 

 

"compute_entropy_cpython",

 

 

 

 

 

 

 

 

 

 

 

 

py_fn!(py, compute_entropy_cpython(data: &)

 

 

 

 

 

 

 

 

 

 

 

 

)

 

 

 

 

 

 

 

 

)?;

 

 

 

 

 

 

 

 

Ok(())

 

 

 

 

}
);

Calling Our Rust Code from Python

Finally, we call the Rust implementation from Python (again, in entropy.py) by first importing our custom dynamic system library we compiled from Rust. Then we simply call the provided library function we specified earlier when we initialized the Python module with the py_module_initializer! macro in our Rust code. At this point, we have a single Python module (entropy.py) that includes functions to call all of our entropy calculation implementations. import rust_entropy_lib def compute_entropy_rust_from_python(data):

 

 

 

 

"""Compute entropy on bytearray `data` with Rust."""

 

 

 

 

return rust_entropy_lib.compute_entropy_cpython(data)
We build the above Rust library package on Ubuntu 18.04 using Cargo. (This link may be helpful for OS X users.)

cargo build --release

Once built, we copy and rename the produced dynamic library to the directory where our Python modules are so we can import it from our Python scripts. The Cargo-produced library name is librust_entropy_lib.so, but will need to be renamed rust_entropy_lib.so to import successfully in these tests.

Performance Results

We measured the execution time of each function implementation with pytest benchmarks computing entropy over 1 million random bytes. All implementations were presented with the same data. The benchmark tests (also included in entropy.py) are shown below. # ### BENCHMARKS ### # generate some random bytes to test w/ NumPy NUM = 1000000 VAL = np.random.randint(0, 256, size=(NUM, ), dtype=np.uint8) def test_pure_python(benchmark):

 

 

 

 

"""Test pure Python."""

 

 

 

 

benchmark(compute_entropy_pure_python, VAL)
def test_python_scipy_numpy(benchmark):

 

 

 

 

"""Test pure Python with SciPy."""

 

 

 

 

benchmark(compute_entropy_scipy_numpy, VAL)
def test_rust(benchmark):

 

 

 

 

"""Test Rust implementation called from Python."""

 

 

 

 

benchmark(compute_entropy_rust_from_python, VAL)
Finally, we made separate, simple driver scripts for each method for calculating entropy. Following is a representative driver script for testing the pure Python implementation. The testdata.bin file is 1,000,000 random bytes used for testing all methods. All methods repeat the calculations 100 times in order to simplify capturing memory usage data. import entropy with open('testdata.bin', 'rb') as f:

 

 

 

 

DATA = f.read()
for _ in range(100):

 

 

 

 

entropy.compute_entropy_pure_python(DATA)
Both the SciPy/NumPy and Rust implementations exhibited strong performance, easily outperforming the unoptimized, pure Python implementation by more than a 100x factor. The Rust version exhibited only slightly better performance than SciPy/NumPy, but the results confirmed what we had already expected: pure Python is vastly slower than compiled languages, and extensions written in Rust can be extremely competitive with those written in C (even beating them in this microbenchmark). Other methods to improve performance exist as well. We could have used ctypes or cffi modules. We could have added type hints and used Cython to generate a library we could import from Python. And all of these options have solution-specific trade-offs for consideration.
Function/ImplementationMinimum Benchmark Execution Time (µs)
compute_entropy_pure_python()

294,319

compute_entropy_scipy_numpy()

2,370

compute_entropy_rust_from_python()

584

We also measured the memory usage of each function implementation with the GNU time application (not to be confused with the time built-in shell command). In particular, we measure the maximum resident set size. While the pure Python and Rust implementations have very similar maximum resident set sizes, the SciPy/NumPy uses measurably more memory in this benchmark, presumably due to additional capabilities loaded into memory when they are imported. In either case, calling Rust code from Python does not appear to add a substantial amount of memory overhead.
Function/ImplementationMaximum resident set size (KB)
compute_entropy_pure_python()

65,262

compute_entropy_scipy_numpy()

73,934

compute_entropy_rust_from_python()

65,444

Summary

We were thoroughly impressed with the performance of calling Rust from Python. In our admittedly brief assessment, our Rust implementation performance was comparable to the underlying C code from SciPy and NumPy packages. Rust seems well suited for efficient processing at scale. Not only was Rust performant in execution time, but its additional memory overhead was also minimal in these tests. The execution time and memory utilization characteristics should prove ideal for scalability. The performance of the SciPy and NumPy C FFI implementations are certainly comparable, but Rust provides additional benefits that C and C++ do not. Memory and thread safety guarantees are appealing advantages. While C provides similar runtime execution improvements, it does not inherently provide thread safety. External libraries exist to provide this functionality for C, but the onus of correctness is entirely on the developer. Rust checks for thread safety issues, such as race conditions, at compile time with its ownership model, and the standard library offers a suite of concurrency mechanisms, including channels, locks and reference counting smart pointers. We are not advocating that anyone port SciPy or NumPy to Rust, because these are already heavily optimized packages with robust support communities. On the other hand, we would strongly consider porting pure Python code to Rust, which is not otherwise available in high-performance libraries. For data science applications in the security space, Rust seems like a compelling alternative given its speed and safety guarantees. And who would mind a little iron oxide on their data science in exchange for more speed and safety? Are you an expert in designing large-scale distributed systems? The CrowdStrike Engineering team wants to hear from you! Check out the openings on our career page.

Additional Resources