Folo

Mechanisms for high-performance hardware-aware programming in Rust.

What it gives you

To take advantage of hardware-awareness we must first gain that awareness. many_cpus informs us about the nature of the system's processors and the arrangement of them in relation to main memory, giving us control over what specific logic runs on which specific processors. No longer do we simply thread::spawn() blindly - now we spawn specific threads for specific processors or groups of processors.

many_cpus_benchmarking provides a benchmark harness to explore the effects of how work and data are spread between processors, when applied to different algorithms. This allows you to judge when it matters and by how much.

The ability to target our workloads to specific processors and specific memory regions unlocks new optimization opportunities. region_local and region_cached provide something like a layer of caching between the processor and main memory, ensuring that even data sets that do not fit into processor caches still experience high data locality.

Designing code for hardware-efficiency often benefits from a thread-isolated mindset, treating each thread as its own universe. linked provides valuable concepts, metaphors and mechanisms to enable objects to present a unique face to each thread, acting as separate objects on each thread while being connected through internal logic and only permitting opt-in transfers across thread boundaries. These are the building blocks used to implement region_local and region_cached.

Measuring effects of hardware-aware programming sometimes requires benchmarks to be multi-threaded, which is not something you get out of the box with benchmark frameworks like Criterion. par_bench extends Criterion with a simple harness for multithreaded benchmarking, running your benchmark logic on a specific processor set obtained from many_cpus. It takes care of all the dirty business involved in coordinating the threads and eliminating any test harness overhead from the data.

Processor time statistics:

| Operation                   | Mean |
|-----------------------------|------|
| futures_oneshot_channel_mt  | 92ns |
| futures_oneshot_channel_st  | 76ns |
| local_once_event_managed    | 38ns |
| pooled_local_once_event_ptr | 27ns |
| pooled_local_once_event_rc  | 31ns |
| pooled_local_once_event_ref | 24ns |

When evaluating complex application logic, it can be important to take a holistic view - it does not only matter how fast the benchmark logic runs but also how much energy (processor time) it uses. Perhaps the code also runs logic on background threads or perhaps the code just blocks some threads for a while on syscalls, costing wall clock time without costing processor time. These are all factors that we must account for in complex scenarios. all_the_time allows us to track the processor time spent by the process, in addition to the wall clock time. It integrates well into Criterion and is natively supported by par_bench.

Allocation statistics:

| Operation                      | Mean bytes | Mean count |
|--------------------------------|------------|------------|
| futures_oneshot_channel        |        128 |          1 |
| local_once_event_managed       |         48 |          1 |
| pooled_local_once_event_ptr    |          0 |          0 |
| pooled_local_once_event_rc     |          0 |          0 |
| pooled_local_once_event_ref    |          0 |          0 |

Memory allocation is the root of all evil. The simplest and most effective way to make a typical application faster is to eliminate memory allocations from it - this can often multiply performance several times. Before we can eliminate, we need to measure. alloc_tracker gives us the ability to measure exactly how much heap memory is allocated by a particular piece of code. It integrates well into Criterion and is natively supported by par_bench.

Once we have knowledge of how much memory we are allocating, we can start making a difference. The simplest way is to change the algorithms so no memory allocations are necessary but sometimes that is impractical. Nevertheless, the global Rust memory allocator (whichever one might be used) is a general-purpose mechanism and it pays a price in performance for that generality. If we are allocating a large number of objects of specific sizes, we can benefit from special-purpose allocators that keep the memory around for reuse, so the next allocation is simple and fast.

While allocator APIs are still an unstable Rust feature, there are stable-API alternatives. Another term for special-purpose allocators is object pools and infinity_pool offers several of them, from basic Vec<T> style pinned object collections to type-agnostic object pools that can allocate any type of object. While the safe-API variants come with substantial overheads compared to the unsafe-API variants, they both can surpass the efficiency of using the global memory allocator under many conditions. Your mileage may vary - measure 100 times, cut 10 times.

A surprising source of memory allocations in high-performance code can be signaling. We are used to thinking of oneshot channels as cheap and efficient things and while this is true, they are still built upon shared memory allocated from the heap. Every signaling channel you create is a heap allocation and they can add up fast! events provides you with pooled signaling channels that take advantage of infinity_pool to reuse memory allocations, as well as providing single-threaded and unsafe-code-managed events for lower overhead in specialized scenarios.

bagels_cooked_weight_grams: 2300; sum 744000; mean 323
value <=    0 [    0 ]: 
value <=  100 [    0 ]: 
value <=  200 [ 1300 ]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
value <=  300 [    0 ]: 
value <=  400 [    0 ]: 
value <=  500 [    0 ]: 
value <=  600 [ 1000 ]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
value <=  700 [    0 ]: 
value <=  800 [    0 ]: 
value <=  900 [    0 ]: 
value <= 1000 [    0 ]: 
value <= +inf [    0 ]:

It is easy to think that performance and efficiency has been achieved once the benchmarks look good. Yet time makes fools of us all! Real-world data often shows surprising behaviors - where we thought we would spawn 500 tasks, a surprise implementation detail from the HTTP stack may end up spawning 500 million! Benchmarks and belief is not enough. nm provides a very high performance and minimal metrics framework suitable for taking millions of measurements per second. Only with the real data can we be assured that we achieve real performance.

Many attempts to instrument high-performance logic are self-defeating because few people expect that time itself is slow. Measuring the time, that is! Instant::now() is a remarkably slow operation - never use this in high-performance code, as merely capturing the timestamp can massively degrade performance. Accurate timing information can only ever be measured for large batches of iterations so that one measurement span covers 100+ milliseconds of work. For situations where you can sacrifice precision but still want satisfactory performance, fast_time provides a clock that is much cheaper to query. It will not give you precise numbers but it is safe to query tens of thousands of times per second.

Extras

Auxiliary packages developed and published by this project:

cargo-detect-package - cargo subcommand to detect which package is used based on a provided path and to run another subcommand on that package.
cpulist - utilities for parsing and emitting Linux cpulist strings, used by many_cpus.
new_zealand - utilities for working with non-zero integers.

Packages present in the repo but not relevant to a general audience:

benchmarks - random pile of benchmarks to explore relevant scenarios and guide Folo development.
folo_ffi - utilities for working with FFI logic; exists for internal use in Folo packages; no stable API surface.
folo_utils - utilities for internal use in Folo packages; exists for internal use in Folo packages; no stable API surface.
testing - private helpers for testing and examples in Folo packages.

Deprecated packages:

blind_pool - deprecated, use infinity_pool which offers a similar API but is internally structured in a more maintainable manner.
opaque_pool - deprecated, use infinity_pool which offers a similar API but is internally structured in a more maintainable manner.
pinned_pool - deprecated, use infinity_pool which offers a similar API but is internally structured in a more maintainable manner.

Development environment setup

See DEVELOPMENT.md.

Quality assurance

This project aims for high quality standards:

✅ Comprehensive testing - All packages tested with extensive unit tests, integration tests, and doctests
✅ Miri validation - All packages pass strict Rust memory safety validation via Miri
✅ Mutation testing - Code quality verified through comprehensive mutation testing with cargo-mutants
✅ High test coverage - Test coverage measured and maintained via cargo-llvm-cov
✅ Zero warnings policy - All code must compile without any compiler or Clippy warnings
✅ Extensive Clippy rules - 100+ custom Clippy lint rules enforced across the workspace
✅ Cross-platform validation - All code tested on both Windows and Linux platforms
✅ Automated CI/CD - Continuous integration runs full validation suite on every commit
✅ API documentation - Complete API documentation with inline examples for all public APIs
✅ Dependency auditing - Regular security audits of all dependencies via cargo-audit
✅ Semver compliance - API changes validated for semantic versioning compliance via cargo-semver-checks

0.3.17	0	0	0	0	0
0.3.16	0	0	0	0	0
0.3.15	0	0	0	0	0
0.3.14	0	0	0	0	0
0.3.13	0	0	0	0	0
0.3.12	0	0	0	0	0
0.3.11	0	0	0	0	0
0.3.10	0	0	0	0	0
0.3.9	0	0	0	0	0
0.3.8	0	0	0	0	0
0.3.7	0	0	0	0	0
0.3.6	0	0	0	0	0
0.3.5	0	0	0	0	0
0.3.4	0	0	0	0	0
0.3.3	0	0	0	0	0
0.3.2	0	0	0	0	0
0.3.1	0	0	0	0	0
0.3.0	0	0	0	0	0
0.2.3	0	0	0	0	0
0.2.2	0	0	0	0	0
0.2.1	0	0	0	0	0
0.2.0	0	0	0	0	0
0.1.15	0	0	0	0	0
0.1.14	0	0	0	0	0
0.1.13	0	0	0	0	0
0.1.12	0	0	0	0	0
0.1.11	0	0	0	0	0
0.1.10	0	0	0	0	0
0.1.9	0	0	0	0	0
0.1.8	0	0	0	0	0
0.1.7	0	0	0	0	0
0.1.6	0	0	0	0	0
0.1.5	0	0	0	0	0
0.1.4	0	0	0	0	0
0.1.3	0	0	0	0	0
0.1.2	0	0	0	0	0
0.1.1	0	0	0	0	0
0.1.0	0	0	0	0	0

Componentpedia / Listings / events

events

Folo

What it gives you

Extras

Development environment setup

Quality assurance