From: Andreas Hindborg <[email protected]>
Hi All!
This is an early preview of a null block driver written in Rust. It is a follow
up to my LSF topic proposal [1]. I send this series with intention of collecting
feedback and comments at LSF/MM/BPF.
The patches apply on top of the `rust-next` tree (rust-6.4
ea76e08f4d901a450619831a255e9e0a4c0ed162) [4].
A null block driver is a good opportunity to evaluate Rust bindings for the
block layer. It is a small and simple driver and thus should be simple to reason
about. Further, the null block driver is not usually deployed in production
environments. Thus, it should be fairly straight forward to review, and any
potential issues are not going to bring down any production workloads.
Being small and simple, the null block driver is a good place to introduce the
Linux kernel storage community to Rust. This will help prepare the community for
future Rust projects and facilitate a better maintenance process for these
projects.
The statistics presented in my previous message [1] show that the C null block
driver has had a significant amount of memory safety related problems in the
past. 41% of fixes merged for the C null block driver are fixes for memory
safety issues. This makes the null block driver a good candidate for rewriting
in Rust.
Patches in this series prefixed with capitalized RUST are code taken from the
R4L `rust` tree [2]. This code is not yet in mainline Linux and has not yet been
submitted as patches for inclusion in Linux. I am not the original author of
these patches, except for small changes required to include them in this series.
The driver performs similarly to the C driver for the implemented features, see
table below [5].
In this table each cell shows the relative performance of the Rust
driver to the C driver with the throughput of the C driver in parenthesis:
`rel_read rel_write (read_miops write_miops)`. Ex: the Rust driver performs 4.74
percent better than the C driver for fio randread with 2 jobs at 16 KiB
block size.
Over the 432 benchmark configurations, the relative performance of the Rust
driver to the C driver (in terms of IOPS) is between 6.8 and -11.8 percent with
an average of 0.2 percent better for reads. For writes the span is 16.8 to -4.5
percent with an average of 0.9 percent better.
For each measurement the drivers are loaded, a drive is configured with memory
backing and a size of 4 GiB. C null_blk is configured to match the implemented
modes of the Rust driver: `blocksize` is set to 4 KiB, `completion_nsec` to 0,
`irqmode` to 0 (IRQ_NONE), `queue_mode` to 2 (MQ), `hw_queue_depth` to 256 and
`memory_backed` to 1. For both the drivers, the queue scheduler is set to
`none`. These measurements are made using 30 second runs of `fio` with the
`PSYNC` IO engine with workers pinned to separate CPU cores. The measurements
are done inside a virtual machine (qemu/kvm) on an Intel Alder Lake workstation
(i5-12600).
Given the relatively large spread in these numbers, it is possible that there is
quite a bit of noise in the results. If anybody is able to dedicate some CPU
time to run benchmarks in a more controlled and standardized environment, I
would appreciate it.
The feature list of the driver is relatively short for now:
- `blk-mq` support
- Direct completion of IO
- Read and write requests
- Optional memory backing
Features available in the C null_blk driver that are currently not implemented
in this work:
- Bio-based submission
- NUMA support
- Block size configuration
- Multiple devices
- Dynamic device creation/destruction
- Soft-IRQ and timer mode
- Queue depth configuration
- Queue count configuration
- Discard operation support
- Cache emulation
- Bandwidth throttling
- Per node hctx
- IO scheduler configuration
- Blocking submission mode
- Shared tags configuration (for >1 device)
- Zoned storage support
- Bad block simulation
- Poll queues
The driver is implemented entirely in safe Rust, with all unsafe code fully
contained in wrappers for C APIs.
Patches 07/11 and 08/11 that enable the equivalent of the C
`spin_lock_irqsave()` did appear on the Rust list [6,3] but was not applied to
`rust-next` tree yet [4]. I am not the author of these patches, but I include
them here for convenience. Please defer any discussion of these patches to their
original posts.
The null block driver is provided in patch 10/11. Patches 03/11 and 04/11
introduce Rust abstractions for the block C APIs required by the driver.
[1] https://lore.kernel.org/all/[email protected]/
[2] https://github.com/Rust-for-Linux/linux/tree/rust
[3] https://lore.kernel.org/all/[email protected]/
[4] https://github.com/Rust-for-Linux/linux/tree/rust-next
[6] https://lore.kernel.org/all/[email protected]/
[5]:
+---------+-----------+-------------------------------+-----------------------------+-------------------------------+
| jobs/bs | workload | 1 | 2 | 3 |
+---------+-----------+-------------------------------+-----------------------------+-------------------------------+
| 4k | randread | 1.54 0.00 (0.7,0.0) | 0.35 0.00 (1.3,0.0) | -0.52 0.00 (1.3,0.0) |
| 8k | randread | 3.30 0.00 (0.57,0.0) | 6.79 0.00 (0.92,0.0) | 0.24 0.00 (0.97,0.0) |
| 16k | randread | 0.89 0.00 (0.42,0.0) | 4.74 0.00 (0.64,0.0) | 3.98 0.00 (0.65,0.0) |
| 32k | randread | 2.04 0.00 (0.27,0.0) | 3.26 0.00 (0.37,0.0) | 1.41 0.00 (0.37,0.0) |
| 64k | randread | 1.35 0.00 (0.16,0.0) | 0.71 0.00 (0.21,0.0) | 0.66 0.00 (0.22,0.0) |
| 128k | randread | 0.59 0.00 (0.09,0.0) | 0.13 0.00 (0.1,0.0) | -6.47 0.00 (0.11,0.0) |
| 256k | randread | 0.13 0.00 (0.048,0.0) | 1.00 0.00 (0.055,0.0) | 0.03 0.00 (0.059,0.0) |
| 512k | randread | 1.16 0.00 (0.023,0.0) | -1.82 0.00 (0.027,0.0) | -0.90 0.00 (0.028,0.0) |
| 1024k | randread | 0.14 0.00 (0.011,0.0) | 0.58 0.00 (0.012,0.0) | -1.14 0.00 (0.013,0.0) |
| 2048k | randread | 2.02 0.00 (0.005,0.0) | 0.13 0.00 (0.0062,0.0) | -0.37 0.00 (0.0062,0.0) |
| 4096k | randread | 1.21 0.00 (0.0026,0.0) | -5.68 0.00 (0.0031,0.0) | -2.39 0.00 (0.0029,0.0) |
| 8192k | randread | -1.02 0.00 (0.0013,0.0) | 0.66 0.00 (0.0013,0.0) | -1.72 0.00 (0.0013,0.0) |
| 4k | randrw | 2.82 2.82 (0.25,0.25) | 2.94 2.94 (0.3,0.3) | 3.19 3.20 (0.34,0.34) |
| 8k | randrw | 2.80 2.82 (0.21,0.21) | 2.32 2.32 (0.25,0.25) | 1.95 1.96 (0.28,0.28) |
| 16k | randrw | 4.54 4.52 (0.16,0.16) | 2.29 2.29 (0.2,0.2) | 2.27 2.30 (0.21,0.21) |
| 32k | randrw | 3.51 3.51 (0.11,0.11) | 2.43 2.40 (0.13,0.13) | 1.38 1.38 (0.14,0.14) |
| 64k | randrw | 2.40 2.42 (0.069,0.069) | 1.98 1.99 (0.08,0.08) | 0.51 0.51 (0.082,0.082) |
| 128k | randrw | 1.50 1.49 (0.039,0.039) | 1.28 1.27 (0.042,0.042) | 0.83 0.84 (0.043,0.043) |
| 256k | randrw | 2.02 2.06 (0.021,0.021) | 0.92 0.91 (0.022,0.022) | 0.70 0.75 (0.022,0.022) |
| 512k | randrw | 0.82 0.82 (0.011,0.011) | 0.69 0.70 (0.011,0.011) | 0.97 0.98 (0.011,0.011) |
| 1024k | randrw | 3.78 3.84 (0.0046,0.0046) | 0.98 0.98 (0.0053,0.0053) | 1.30 1.29 (0.0054,0.0054) |
| 2048k | randrw | 1.60 1.54 (0.0023,0.0023) | 1.76 1.69 (0.0026,0.0026) | 1.23 0.92 (0.0026,0.0026) |
| 4096k | randrw | 1.18 1.10 (0.0011,0.0012) | 1.32 1.20 (0.0014,0.0014) | 1.49 1.41 (0.0013,0.0013) |
| 8192k | randrw | -2.01 -1.89 (0.00057,0.00057) | 1.20 1.14 (0.00061,0.00061) | -0.74 -0.61 (0.0006,0.00062) |
| 4k | randwrite | 0.00 2.48 (0.0,0.41) | 0.00 3.61 (0.0,0.39) | 0.00 3.93 (0.0,0.44) |
| 8k | randwrite | 0.00 3.46 (0.0,0.35) | 0.00 1.90 (0.0,0.34) | 0.00 2.20 (0.0,0.37) |
| 16k | randwrite | 0.00 3.01 (0.0,0.28) | 0.00 2.86 (0.0,0.27) | 0.00 2.25 (0.0,0.29) |
| 32k | randwrite | 0.00 3.02 (0.0,0.19) | 0.00 2.68 (0.0,0.19) | 0.00 2.20 (0.0,0.21) |
| 64k | randwrite | 0.00 3.39 (0.0,0.12) | 0.00 3.75 (0.0,0.12) | 0.00 -0.79 (0.0,0.14) |
| 128k | randwrite | 0.00 3.81 (0.0,0.069) | 0.00 1.10 (0.0,0.072) | 0.00 5.48 (0.0,0.078) |
| 256k | randwrite | 0.00 4.28 (0.0,0.038) | 0.00 1.27 (0.0,0.037) | 0.00 -0.10 (0.0,0.041) |
| 512k | randwrite | 0.00 2.71 (0.0,0.019) | 0.00 0.46 (0.0,0.017) | 0.00 2.96 (0.0,0.019) |
| 1024k | randwrite | 0.00 2.67 (0.0,0.0082) | 0.00 2.54 (0.0,0.0087) | 0.00 0.20 (0.0,0.0093) |
| 2048k | randwrite | 0.00 2.40 (0.0,0.0041) | 0.00 1.90 (0.0,0.0041) | 0.00 3.44 (0.0,0.0043) |
| 4096k | randwrite | 0.00 3.38 (0.0,0.002) | 0.00 1.69 (0.0,0.0025) | 0.00 6.00 (0.0,0.0025) |
| 8192k | randwrite | 0.00 3.09 (0.0,0.00098) | 0.00 1.10 (0.0,0.0012) | 0.00 0.45 (0.0,0.0012) |
| 4k | read | 1.26 0.00 (1.0,0.0) | 1.65 0.00 (2.0,0.0) | -9.94 0.00 (2.8,0.0) |
| 8k | read | 1.86 0.00 (0.78,0.0) | 4.60 0.00 (1.5,0.0) | 0.47 0.00 (1.7,0.0) |
| 16k | read | 1.13 0.00 (0.54,0.0) | 4.00 0.00 (0.95,0.0) | -10.28 0.00 (0.97,0.0) |
| 32k | read | 0.64 0.00 (0.32,0.0) | 5.54 0.00 (0.46,0.0) | -0.67 0.00 (0.47,0.0) |
| 64k | read | 0.29 0.00 (0.18,0.0) | -0.08 0.00 (0.25,0.0) | -0.82 0.00 (0.25,0.0) |
| 128k | read | 0.11 0.00 (0.095,0.0) | 0.12 0.00 (0.11,0.0) | -4.99 0.00 (0.12,0.0) |
| 256k | read | 1.66 0.00 (0.05,0.0) | -0.41 0.00 (0.058,0.0) | -0.30 0.00 (0.061,0.0) |
| 512k | read | 1.06 0.00 (0.024,0.0) | -1.77 0.00 (0.029,0.0) | 0.81 0.00 (0.029,0.0) |
| 1024k | read | 0.41 0.00 (0.011,0.0) | -2.28 0.00 (0.013,0.0) | -0.33 0.00 (0.013,0.0) |
| 2048k | read | -1.18 0.00 (0.0053,0.0) | 1.41 0.00 (0.0063,0.0) | -0.86 0.00 (0.0063,0.0) |
| 4096k | read | -0.54 0.00 (0.0027,0.0) | -11.77 0.00 (0.0031,0.0) | 0.02 0.00 (0.003,0.0) |
| 8192k | read | -1.60 0.00 (0.0013,0.0) | -4.25 0.00 (0.0013,0.0) | -1.64 0.00 (0.0013,0.0) |
| 4k | readwrite | 1.01 1.01 (0.34,0.34) | 1.00 1.00 (0.42,0.42) | 1.30 1.29 (0.45,0.44) |
| 8k | readwrite | 1.54 1.54 (0.28,0.28) | 1.65 1.66 (0.32,0.32) | 1.18 1.18 (0.36,0.36) |
| 16k | readwrite | 1.07 1.08 (0.2,0.2) | 2.82 2.81 (0.24,0.24) | 0.40 0.39 (0.26,0.26) |
| 32k | readwrite | 0.56 0.56 (0.13,0.13) | -1.31 -1.30 (0.16,0.16) | 0.09 0.09 (0.16,0.16) |
| 64k | readwrite | 1.82 1.83 (0.077,0.077) | -0.96 -0.96 (0.091,0.091) | -0.99 -0.97 (0.092,0.092) |
| 128k | readwrite | 2.09 2.10 (0.041,0.041) | 1.07 1.06 (0.045,0.045) | 0.28 0.29 (0.045,0.045) |
| 256k | readwrite | 2.06 2.05 (0.022,0.022) | 1.84 1.85 (0.023,0.023) | 0.66 0.59 (0.023,0.023) |
| 512k | readwrite | 5.90 5.84 (0.01,0.01) | 1.20 1.17 (0.011,0.011) | 1.18 1.21 (0.011,0.011) |
| 1024k | readwrite | 1.94 1.89 (0.0047,0.0047) | 2.43 2.48 (0.0053,0.0053) | 2.18 2.21 (0.0054,0.0054) |
| 2048k | readwrite | -0.47 -0.47 (0.0023,0.0023) | 2.45 2.38 (0.0026,0.0026) | 1.76 1.78 (0.0026,0.0026) |
| 4096k | readwrite | 1.88 1.77 (0.0011,0.0012) | 1.37 1.17 (0.0014,0.0014) | 1.80 1.90 (0.0013,0.0013) |
| 8192k | readwrite | 1.27 1.42 (0.00057,0.00057) | 0.86 0.73 (0.00061,0.00062) | -0.99 -1.28 (0.00061,0.00062) |
| 4k | write | 0.00 1.53 (0.0,0.51) | 0.00 0.35 (0.0,0.51) | 0.00 0.92 (0.0,0.52) |
| 8k | write | 0.00 1.04 (0.0,0.42) | 0.00 1.01 (0.0,0.4) | 0.00 2.90 (0.0,0.44) |
| 16k | write | 0.00 0.86 (0.0,0.32) | 0.00 1.89 (0.0,0.35) | 0.00 -1.38 (0.0,0.36) |
| 32k | write | 0.00 0.47 (0.0,0.21) | 0.00 -0.52 (0.0,0.26) | 0.00 -1.40 (0.0,0.24) |
| 64k | write | 0.00 0.30 (0.0,0.13) | 0.00 13.34 (0.0,0.14) | 0.00 2.01 (0.0,0.15) |
| 128k | write | 0.00 2.24 (0.0,0.073) | 0.00 16.77 (0.0,0.076) | 0.00 -0.36 (0.0,0.085) |
| 256k | write | 0.00 3.49 (0.0,0.039) | 0.00 9.76 (0.0,0.039) | 0.00 -4.52 (0.0,0.043) |
| 512k | write | 0.00 4.08 (0.0,0.02) | 0.00 1.03 (0.0,0.017) | 0.00 1.05 (0.0,0.019) |
| 1024k | write | 0.00 2.72 (0.0,0.0083) | 0.00 2.89 (0.0,0.0086) | 0.00 2.98 (0.0,0.0095) |
| 2048k | write | 0.00 2.14 (0.0,0.0041) | 0.00 3.12 (0.0,0.0041) | 0.00 1.84 (0.0,0.0044) |
| 4096k | write | 0.00 1.76 (0.0,0.002) | 0.00 1.80 (0.0,0.0026) | 0.00 6.01 (0.0,0.0025) |
| 8192k | write | 0.00 1.41 (0.0,0.00099) | 0.00 0.98 (0.0,0.0012) | 0.00 0.13 (0.0,0.0012) |
+---------+-----------+-------------------------------+-----------------------------+-------------------------------+
+---------+-----------+-------------------------------+------------------------------+-------------------------------+
| jobs/bs | workload | 4 | 5 | 6 |
+---------+-----------+-------------------------------+------------------------------+-------------------------------+
| 4k | randread | 1.66 0.00 (1.3,0.0) | -0.81 0.00 (1.3,0.0) | -0.50 0.00 (1.2,0.0) |
| 8k | randread | 2.52 0.00 (0.98,0.0) | 2.54 0.00 (0.95,0.0) | 2.52 0.00 (0.92,0.0) |
| 16k | randread | 0.68 0.00 (0.65,0.0) | 3.43 0.00 (0.64,0.0) | 5.84 0.00 (0.63,0.0) |
| 32k | randread | 1.88 0.00 (0.38,0.0) | 0.46 0.00 (0.38,0.0) | 2.33 0.00 (0.37,0.0) |
| 64k | randread | 1.28 0.00 (0.21,0.0) | 0.64 0.00 (0.21,0.0) | -0.26 0.00 (0.21,0.0) |
| 128k | randread | -0.05 0.00 (0.11,0.0) | 0.10 0.00 (0.11,0.0) | 0.09 0.00 (0.11,0.0) |
| 256k | randread | -0.01 0.00 (0.059,0.0) | -1.01 0.00 (0.059,0.0) | -0.39 0.00 (0.059,0.0) |
| 512k | randread | -1.17 0.00 (0.029,0.0) | 0.39 0.00 (0.028,0.0) | -0.33 0.00 (0.028,0.0) |
| 1024k | randread | -0.20 0.00 (0.013,0.0) | -0.85 0.00 (0.013,0.0) | -0.57 0.00 (0.013,0.0) |
| 2048k | randread | -0.07 0.00 (0.0062,0.0) | -0.26 0.00 (0.0061,0.0) | -0.15 0.00 (0.006,0.0) |
| 4096k | randread | -0.25 0.00 (0.0027,0.0) | -2.70 0.00 (0.0027,0.0) | -2.35 0.00 (0.0026,0.0) |
| 8192k | randread | -2.94 0.00 (0.0013,0.0) | -3.01 0.00 (0.0012,0.0) | -2.09 0.00 (0.0012,0.0) |
| 4k | randrw | 2.83 2.84 (0.36,0.36) | 2.24 2.23 (0.37,0.37) | 3.49 3.50 (0.38,0.38) |
| 8k | randrw | 1.31 1.32 (0.29,0.29) | 2.01 2.01 (0.3,0.3) | 2.09 2.09 (0.3,0.3) |
| 16k | randrw | 1.96 1.95 (0.22,0.22) | 1.61 1.61 (0.22,0.22) | 1.99 2.00 (0.22,0.22) |
| 32k | randrw | 1.01 1.03 (0.14,0.14) | 0.58 0.59 (0.14,0.14) | -0.04 -0.05 (0.14,0.14) |
| 64k | randrw | -0.19 -0.18 (0.081,0.081) | -0.72 -0.72 (0.081,0.081) | -0.50 -0.50 (0.081,0.081) |
| 128k | randrw | 0.42 0.42 (0.043,0.043) | -0.33 -0.36 (0.043,0.043) | -0.62 -0.59 (0.043,0.043) |
| 256k | randrw | 0.38 0.41 (0.022,0.022) | -0.35 -0.35 (0.022,0.022) | 0.37 0.31 (0.022,0.022) |
| 512k | randrw | 0.40 0.41 (0.011,0.011) | -0.29 -0.26 (0.011,0.011) | 2.76 2.48 (0.01,0.01) |
| 1024k | randrw | 1.16 1.01 (0.0052,0.0052) | 0.74 0.72 (0.0051,0.0051) | 1.72 1.77 (0.005,0.005) |
| 2048k | randrw | 1.09 0.96 (0.0026,0.0026) | 1.19 1.57 (0.0026,0.0026) | 1.49 1.21 (0.0025,0.0025) |
| 4096k | randrw | -0.13 0.06 (0.0013,0.0013) | 0.69 0.77 (0.0013,0.0013) | -0.20 -0.13 (0.0012,0.0012) |
| 8192k | randrw | -0.71 -1.27 (0.00059,0.00061) | -1.78 -2.02 (0.00059,0.0006) | -1.60 -1.25 (0.00058,0.00059) |
| 4k | randwrite | 0.00 2.17 (0.0,0.44) | 0.00 4.08 (0.0,0.45) | 0.00 5.47 (0.0,0.46) |
| 8k | randwrite | 0.00 0.98 (0.0,0.38) | 0.00 2.39 (0.0,0.39) | 0.00 2.30 (0.0,0.4) |
| 16k | randwrite | 0.00 2.02 (0.0,0.3) | 0.00 2.63 (0.0,0.3) | 0.00 3.17 (0.0,0.3) |
| 32k | randwrite | 0.00 1.82 (0.0,0.21) | 0.00 1.12 (0.0,0.22) | 0.00 1.40 (0.0,0.21) |
| 64k | randwrite | 0.00 1.41 (0.0,0.14) | 0.00 1.92 (0.0,0.13) | 0.00 1.76 (0.0,0.13) |
| 128k | randwrite | 0.00 2.32 (0.0,0.075) | 0.00 4.61 (0.0,0.074) | 0.00 2.31 (0.0,0.074) |
| 256k | randwrite | 0.00 -0.77 (0.0,0.038) | 0.00 2.61 (0.0,0.036) | 0.00 2.48 (0.0,0.036) |
| 512k | randwrite | 0.00 1.83 (0.0,0.019) | 0.00 1.68 (0.0,0.018) | 0.00 2.80 (0.0,0.018) |
| 1024k | randwrite | 0.00 3.32 (0.0,0.0086) | 0.00 1.20 (0.0,0.0087) | 0.00 0.92 (0.0,0.0081) |
| 2048k | randwrite | 0.00 0.92 (0.0,0.0043) | 0.00 2.95 (0.0,0.0042) | 0.00 3.15 (0.0,0.0042) |
| 4096k | randwrite | 0.00 3.22 (0.0,0.0025) | 0.00 0.50 (0.0,0.0024) | 0.00 0.40 (0.0,0.0024) |
| 8192k | randwrite | 0.00 -0.57 (0.0,0.0012) | 0.00 -0.41 (0.0,0.0011) | 0.00 -0.36 (0.0,0.0011) |
| 4k | read | 4.34 0.00 (2.5,0.0) | -0.40 0.00 (2.6,0.0) | -8.45 0.00 (2.6,0.0) |
| 8k | read | -1.43 0.00 (1.7,0.0) | 0.01 0.00 (1.7,0.0) | -3.89 0.00 (1.7,0.0) |
| 16k | read | 1.68 0.00 (0.96,0.0) | -0.38 0.00 (0.94,0.0) | 0.47 0.00 (0.91,0.0) |
| 32k | read | -0.98 0.00 (0.48,0.0) | -0.09 0.00 (0.47,0.0) | -1.24 0.00 (0.47,0.0) |
| 64k | read | -0.42 0.00 (0.25,0.0) | -0.26 0.00 (0.25,0.0) | -0.94 0.00 (0.25,0.0) |
| 128k | read | -0.49 0.00 (0.12,0.0) | 0.06 0.00 (0.12,0.0) | -0.59 0.00 (0.12,0.0) |
| 256k | read | -0.89 0.00 (0.062,0.0) | -0.11 0.00 (0.062,0.0) | -0.77 0.00 (0.062,0.0) |
| 512k | read | -0.64 0.00 (0.03,0.0) | -0.11 0.00 (0.03,0.0) | 0.08 0.00 (0.03,0.0) |
| 1024k | read | 0.00 0.00 (0.013,0.0) | -0.05 0.00 (0.013,0.0) | 0.42 0.00 (0.013,0.0) |
| 2048k | read | -0.69 0.00 (0.0063,0.0) | -0.36 0.00 (0.0063,0.0) | 1.12 0.00 (0.006,0.0) |
| 4096k | read | -2.69 0.00 (0.0029,0.0) | -1.85 0.00 (0.0027,0.0) | -1.13 0.00 (0.0026,0.0) |
| 8192k | read | -2.15 0.00 (0.0013,0.0) | -1.66 0.00 (0.0013,0.0) | -2.90 0.00 (0.0012,0.0) |
| 4k | readwrite | 0.32 0.32 (0.46,0.46) | 0.68 0.69 (0.47,0.47) | -0.26 -0.26 (0.5,0.5) |
| 8k | readwrite | 0.18 0.18 (0.37,0.37) | 0.77 0.76 (0.38,0.38) | -0.47 -0.48 (0.39,0.39) |
| 16k | readwrite | 0.24 0.25 (0.27,0.27) | -0.17 -0.15 (0.27,0.27) | -0.51 -0.52 (0.27,0.27) |
| 32k | readwrite | -0.91 -0.91 (0.17,0.17) | -1.40 -1.37 (0.17,0.17) | -1.67 -1.70 (0.17,0.17) |
| 64k | readwrite | -1.22 -1.21 (0.09,0.09) | -1.33 -1.30 (0.09,0.09) | -2.49 -2.50 (0.091,0.091) |
| 128k | readwrite | -0.67 -0.69 (0.045,0.045) | -1.23 -1.24 (0.045,0.045) | -1.01 -0.96 (0.045,0.045) |
| 256k | readwrite | -0.31 -0.35 (0.023,0.023) | -0.28 -0.28 (0.023,0.023) | 0.25 0.08 (0.022,0.023) |
| 512k | readwrite | 0.77 0.72 (0.011,0.011) | 0.96 0.96 (0.011,0.011) | 0.62 1.16 (0.011,0.011) |
| 1024k | readwrite | 0.93 0.91 (0.0053,0.0053) | 0.72 0.78 (0.0051,0.0051) | -0.48 -0.10 (0.0051,0.0051) |
| 2048k | readwrite | 1.84 1.77 (0.0026,0.0026) | 0.86 0.71 (0.0026,0.0026) | 0.31 1.37 (0.0026,0.0026) |
| 4096k | readwrite | 2.71 2.43 (0.0013,0.0013) | 1.54 1.73 (0.0013,0.0013) | -1.18 -0.79 (0.0012,0.0013) |
| 8192k | readwrite | -2.61 -1.52 (0.0006,0.00061) | -1.83 -1.27 (0.00059,0.0006) | -2.25 -1.53 (0.00059,0.00059) |
| 4k | write | 0.00 -2.94 (0.0,0.53) | 0.00 -2.23 (0.0,0.57) | 0.00 0.93 (0.0,0.56) |
| 8k | write | 0.00 -1.78 (0.0,0.45) | 0.00 -1.14 (0.0,0.47) | 0.00 0.47 (0.0,0.47) |
| 16k | write | 0.00 -1.46 (0.0,0.35) | 0.00 -1.46 (0.0,0.36) | 0.00 0.25 (0.0,0.36) |
| 32k | write | 0.00 -1.05 (0.0,0.24) | 0.00 -1.53 (0.0,0.24) | 0.00 -0.74 (0.0,0.24) |
| 64k | write | 0.00 -2.78 (0.0,0.15) | 0.00 1.39 (0.0,0.15) | 0.00 -2.42 (0.0,0.15) |
| 128k | write | 0.00 -0.03 (0.0,0.082) | 0.00 0.32 (0.0,0.081) | 0.00 2.76 (0.0,0.079) |
| 256k | write | 0.00 -0.33 (0.0,0.039) | 0.00 -0.32 (0.0,0.038) | 0.00 -2.37 (0.0,0.038) |
| 512k | write | 0.00 4.00 (0.0,0.019) | 0.00 6.45 (0.0,0.02) | 0.00 2.94 (0.0,0.019) |
| 1024k | write | 0.00 1.22 (0.0,0.0088) | 0.00 1.57 (0.0,0.0088) | 0.00 -3.98 (0.0,0.0087) |
| 2048k | write | 0.00 0.05 (0.0,0.0043) | 0.00 3.33 (0.0,0.0044) | 0.00 8.69 (0.0,0.0044) |
| 4096k | write | 0.00 2.36 (0.0,0.0025) | 0.00 1.52 (0.0,0.0024) | 0.00 0.44 (0.0,0.0024) |
| 8192k | write | 0.00 -0.35 (0.0,0.0012) | 0.00 -0.45 (0.0,0.0011) | 0.00 -0.73 (0.0,0.0011) |
+---------+-----------+-------------------------------+------------------------------+-------------------------------+
Andreas Hindborg (9):
rust: add radix tree abstraction
rust: add `pages` module for handling page allocation
rust: block: introduce `kernel::block::mq` module
rust: block: introduce `kernel::block::bio` module
RUST: add `module_params` macro
rust: apply cache line padding for `SpinLock`
RUST: implement `ForeignOwnable` for `Pin`
rust: add null block driver
rust: inline a number of short functions
Wedson Almeida Filho (2):
rust: lock: add support for `Lock::lock_irqsave`
rust: lock: implement `IrqSaveBackend` for `SpinLock`
drivers/block/Kconfig | 4 +
drivers/block/Makefile | 4 +
drivers/block/rnull-helpers.c | 60 ++++
drivers/block/rnull.rs | 177 ++++++++++
rust/bindings/bindings_helper.h | 4 +
rust/bindings/lib.rs | 1 +
rust/helpers.c | 91 ++++++
rust/kernel/block.rs | 6 +
rust/kernel/block/bio.rs | 93 ++++++
rust/kernel/block/bio/vec.rs | 181 +++++++++++
rust/kernel/block/mq.rs | 15 +
rust/kernel/block/mq/gen_disk.rs | 133 ++++++++
rust/kernel/block/mq/operations.rs | 260 +++++++++++++++
rust/kernel/block/mq/raw_writer.rs | 30 ++
rust/kernel/block/mq/request.rs | 87 +++++
rust/kernel/block/mq/tag_set.rs | 92 ++++++
rust/kernel/cache_padded.rs | 33 ++
rust/kernel/error.rs | 4 +
rust/kernel/lib.rs | 13 +
rust/kernel/module_param.rs | 501 +++++++++++++++++++++++++++++
rust/kernel/pages.rs | 284 ++++++++++++++++
rust/kernel/radix_tree.rs | 156 +++++++++
rust/kernel/sync/lock.rs | 49 ++-
rust/kernel/sync/lock/mutex.rs | 2 +
rust/kernel/sync/lock/spinlock.rs | 50 ++-
rust/kernel/types.rs | 30 ++
rust/macros/helpers.rs | 46 ++-
rust/macros/module.rs | 402 +++++++++++++++++++++--
scripts/Makefile.build | 2 +-
29 files changed, 2766 insertions(+), 44 deletions(-)
create mode 100644 drivers/block/rnull-helpers.c
create mode 100644 drivers/block/rnull.rs
create mode 100644 rust/kernel/block.rs
create mode 100644 rust/kernel/block/bio.rs
create mode 100644 rust/kernel/block/bio/vec.rs
create mode 100644 rust/kernel/block/mq.rs
create mode 100644 rust/kernel/block/mq/gen_disk.rs
create mode 100644 rust/kernel/block/mq/operations.rs
create mode 100644 rust/kernel/block/mq/raw_writer.rs
create mode 100644 rust/kernel/block/mq/request.rs
create mode 100644 rust/kernel/block/mq/tag_set.rs
create mode 100644 rust/kernel/cache_padded.rs
create mode 100644 rust/kernel/module_param.rs
create mode 100644 rust/kernel/pages.rs
create mode 100644 rust/kernel/radix_tree.rs
base-commit: ea76e08f4d901a450619831a255e9e0a4c0ed162
--
2.40.0
From: Andreas Hindborg <[email protected]>
Add initial abstractions for working with blk-mq.
This patch is a maintained, refactored subset of code originally published by
Wedson Almeida Filho <[email protected]> [1].
[1] https://github.com/wedsonaf/linux/tree/f2cfd2fe0e2ca4e90994f96afe268bbd4382a891/rust/kernel/blk/mq.rs
Cc: Wedson Almeida Filho <[email protected]>
Signed-off-by: Andreas Hindborg <[email protected]>
---
rust/bindings/bindings_helper.h | 2 +
rust/helpers.c | 22 +++
rust/kernel/block.rs | 5 +
rust/kernel/block/mq.rs | 15 ++
rust/kernel/block/mq/gen_disk.rs | 133 +++++++++++++++
rust/kernel/block/mq/operations.rs | 260 +++++++++++++++++++++++++++++
rust/kernel/block/mq/raw_writer.rs | 30 ++++
rust/kernel/block/mq/request.rs | 71 ++++++++
rust/kernel/block/mq/tag_set.rs | 92 ++++++++++
rust/kernel/error.rs | 4 +
rust/kernel/lib.rs | 1 +
11 files changed, 635 insertions(+)
create mode 100644 rust/kernel/block.rs
create mode 100644 rust/kernel/block/mq.rs
create mode 100644 rust/kernel/block/mq/gen_disk.rs
create mode 100644 rust/kernel/block/mq/operations.rs
create mode 100644 rust/kernel/block/mq/raw_writer.rs
create mode 100644 rust/kernel/block/mq/request.rs
create mode 100644 rust/kernel/block/mq/tag_set.rs
diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h
index 52834962b94d..86c07eeb1ba1 100644
--- a/rust/bindings/bindings_helper.h
+++ b/rust/bindings/bindings_helper.h
@@ -11,6 +11,8 @@
#include <linux/wait.h>
#include <linux/sched.h>
#include <linux/radix-tree.h>
+#include <linux/blk_types.h>
+#include <linux/blk-mq.h>
/* `bindgen` gets confused at certain things. */
const gfp_t BINDINGS_GFP_KERNEL = GFP_KERNEL;
diff --git a/rust/helpers.c b/rust/helpers.c
index 9bd9d95da951..a59341084774 100644
--- a/rust/helpers.c
+++ b/rust/helpers.c
@@ -18,6 +18,7 @@
* accidentally exposed.
*/
+#include <linux/bio.h>
#include <linux/bug.h>
#include <linux/build_bug.h>
#include <linux/err.h>
@@ -28,6 +29,8 @@
#include <linux/wait.h>
#include <linux/radix-tree.h>
#include <linux/highmem.h>
+#include <linux/blk-mq.h>
+#include <linux/blkdev.h>
__noreturn void rust_helper_BUG(void)
{
@@ -130,6 +133,25 @@ void rust_helper_put_task_struct(struct task_struct *t)
}
EXPORT_SYMBOL_GPL(rust_helper_put_task_struct);
+struct bio_vec rust_helper_req_bvec(struct request *rq)
+{
+ return req_bvec(rq);
+}
+EXPORT_SYMBOL_GPL(rust_helper_req_bvec);
+
+void *rust_helper_blk_mq_rq_to_pdu(struct request *rq)
+{
+ return blk_mq_rq_to_pdu(rq);
+}
+EXPORT_SYMBOL_GPL(rust_helper_blk_mq_rq_to_pdu);
+
+void rust_helper_bio_advance_iter_single(const struct bio *bio,
+ struct bvec_iter *iter,
+ unsigned int bytes) {
+ bio_advance_iter_single(bio, iter, bytes);
+}
+EXPORT_SYMBOL_GPL(rust_helper_bio_advance_iter_single);
+
void rust_helper_init_radix_tree(struct xarray *tree, gfp_t gfp_mask)
{
INIT_RADIX_TREE(tree, gfp_mask);
diff --git a/rust/kernel/block.rs b/rust/kernel/block.rs
new file mode 100644
index 000000000000..4c93317a568a
--- /dev/null
+++ b/rust/kernel/block.rs
@@ -0,0 +1,5 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! Types for working with the block layer
+
+pub mod mq;
diff --git a/rust/kernel/block/mq.rs b/rust/kernel/block/mq.rs
new file mode 100644
index 000000000000..5b40f6a73c0f
--- /dev/null
+++ b/rust/kernel/block/mq.rs
@@ -0,0 +1,15 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! This module provides types for implementing drivers that interface the
+//! blk-mq subsystem
+
+mod gen_disk;
+mod operations;
+mod raw_writer;
+mod request;
+mod tag_set;
+
+pub use gen_disk::GenDisk;
+pub use operations::Operations;
+pub use request::Request;
+pub use tag_set::TagSet;
diff --git a/rust/kernel/block/mq/gen_disk.rs b/rust/kernel/block/mq/gen_disk.rs
new file mode 100644
index 000000000000..50496af15bbf
--- /dev/null
+++ b/rust/kernel/block/mq/gen_disk.rs
@@ -0,0 +1,133 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! GenDisk abstraction
+//!
+//! C header: [`include/linux/blkdev.h`](../../include/linux/blkdev.h)
+//! C header: [`include/linux/blk_mq.h`](../../include/linux/blk_mq.h)
+
+use crate::block::mq::{raw_writer::RawWriter, Operations, TagSet};
+use crate::{
+ bindings, error::from_err_ptr, error::Result, sync::Arc, types::ForeignOwnable,
+ types::ScopeGuard,
+};
+use core::fmt::{self, Write};
+
+/// A generic block device
+///
+/// # Invariants
+///
+/// - `gendisk` must always point to an initialized and valid `struct gendisk`.
+pub struct GenDisk<T: Operations> {
+ _tagset: Arc<TagSet<T>>,
+ gendisk: *mut bindings::gendisk,
+}
+
+// SAFETY: `GenDisk` is an owned pointer to a `struct gendisk` and an `Arc` to a
+// `TagSet` It is safe to send this to other threads as long as T is Send.
+unsafe impl<T: Operations + Send> Send for GenDisk<T> {}
+
+impl<T: Operations> GenDisk<T> {
+ /// Try to create a new `GenDisk`
+ pub fn try_new(tagset: Arc<TagSet<T>>, queue_data: T::QueueData) -> Result<Self> {
+ let data = queue_data.into_foreign();
+ let recover_data = ScopeGuard::new(|| {
+ // SAFETY: T::QueueData was created by the call to `into_foreign()` above
+ unsafe { T::QueueData::from_foreign(data) };
+ });
+
+ let lock_class_key = crate::sync::LockClassKey::new();
+
+ // SAFETY: `tagset.raw_tag_set()` points to a valid and initialized tag set
+ let gendisk = from_err_ptr(unsafe {
+ bindings::__blk_mq_alloc_disk(tagset.raw_tag_set(), data as _, lock_class_key.as_ptr())
+ })?;
+
+ const TABLE: bindings::block_device_operations = bindings::block_device_operations {
+ submit_bio: None,
+ open: None,
+ release: None,
+ ioctl: None,
+ compat_ioctl: None,
+ check_events: None,
+ unlock_native_capacity: None,
+ getgeo: None,
+ set_read_only: None,
+ swap_slot_free_notify: None,
+ report_zones: None,
+ devnode: None,
+ alternative_gpt_sector: None,
+ get_unique_id: None,
+ owner: core::ptr::null_mut(),
+ pr_ops: core::ptr::null_mut(),
+ free_disk: None,
+ poll_bio: None,
+ };
+
+ // SAFETY: gendisk is a valid pointer as we initialized it above
+ unsafe { (*gendisk).fops = &TABLE };
+
+ recover_data.dismiss();
+ Ok(Self {
+ _tagset: tagset,
+ gendisk,
+ })
+ }
+
+ /// Set the name of the device
+ pub fn set_name(&self, args: fmt::Arguments<'_>) -> Result {
+ let mut raw_writer = RawWriter::from_array(unsafe { &mut (*self.gendisk).disk_name });
+ raw_writer.write_fmt(args)?;
+ raw_writer.write_char('\0')?;
+ Ok(())
+ }
+
+ /// Register the device with the kernel. When this function return, the
+ /// device is accessible from VFS. The kernel may issue reads to the device
+ /// during registration to discover partition infomation.
+ pub fn add(&self) -> Result {
+ crate::error::to_result(unsafe {
+ bindings::device_add_disk(core::ptr::null_mut(), self.gendisk, core::ptr::null_mut())
+ })
+ }
+
+ /// Call to tell the block layer the capcacity of the device
+ pub fn set_capacity(&self, sectors: u64) {
+ unsafe { bindings::set_capacity(self.gendisk, sectors) };
+ }
+
+ /// Set the logical block size of the device
+ pub fn set_queue_logical_block_size(&self, size: u32) {
+ unsafe { bindings::blk_queue_logical_block_size((*self.gendisk).queue, size) };
+ }
+
+ /// Set the physical block size of the device
+ pub fn set_queue_physical_block_size(&self, size: u32) {
+ unsafe { bindings::blk_queue_physical_block_size((*self.gendisk).queue, size) };
+ }
+
+ /// Set the rotational media attribute for the device
+ pub fn set_rotational(&self, rotational: bool) {
+ if !rotational {
+ unsafe {
+ bindings::blk_queue_flag_set(bindings::QUEUE_FLAG_NONROT, (*self.gendisk).queue)
+ };
+ } else {
+ unsafe {
+ bindings::blk_queue_flag_clear(bindings::QUEUE_FLAG_NONROT, (*self.gendisk).queue)
+ };
+ }
+ }
+}
+
+impl<T: Operations> Drop for GenDisk<T> {
+ fn drop(&mut self) {
+ let queue_data = unsafe { (*(*self.gendisk).queue).queuedata };
+
+ unsafe { bindings::del_gendisk(self.gendisk) };
+
+ // SAFETY: `queue.queuedata` was created by `GenDisk::try_new()` with a
+ // call to `ForeignOwnable::into_pointer()` to create `queuedata`.
+ // `ForeignOwnable::from_foreign()` is only called here.
+ let _queue_data = unsafe { T::QueueData::from_foreign(queue_data) };
+ }
+}
diff --git a/rust/kernel/block/mq/operations.rs b/rust/kernel/block/mq/operations.rs
new file mode 100644
index 000000000000..fb1ab707d1f0
--- /dev/null
+++ b/rust/kernel/block/mq/operations.rs
@@ -0,0 +1,260 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! This module provides an interface for blk-mq drivers to implement.
+//!
+//! C header: [`include/linux/blk-mq.h`](../../include/linux/blk-mq.h)
+
+use crate::{
+ bindings,
+ block::mq::{tag_set::TagSetRef, Request},
+ error::{from_result, Result},
+ types::ForeignOwnable,
+};
+use core::{marker::PhantomData, pin::Pin};
+
+/// Implement this trait to interface blk-mq as block devices
+#[macros::vtable]
+pub trait Operations: Sized {
+ /// Data associated with a request. This data is located next to the request
+ /// structure.
+ type RequestData;
+
+ /// Data associated with the `struct request_queue` that is allocated for
+ /// the `GenDisk` associated with this `Operations` implementation.
+ type QueueData: ForeignOwnable;
+
+ /// Data associated with a dispatch queue. This is stored as a pointer in
+ /// `struct blk_mq_hw_ctx`.
+ type HwData: ForeignOwnable;
+
+ /// Data associated with a tag set. This is stored as a pointer in `struct
+ /// blk_mq_tag_set`.
+ type TagSetData: ForeignOwnable;
+
+ /// Called by the kernel to allocate a new `RequestData`. The structure will
+ /// eventually be pinned, so defer initialization to `init_request_data()`
+ fn new_request_data(
+ _tagset_data: <Self::TagSetData as ForeignOwnable>::Borrowed<'_>,
+ ) -> Result<Self::RequestData>;
+
+ /// Called by the kernel to initialize a previously allocated `RequestData`
+ fn init_request_data(
+ _tagset_data: <Self::TagSetData as ForeignOwnable>::Borrowed<'_>,
+ _data: Pin<&mut Self::RequestData>,
+ ) -> Result {
+ Ok(())
+ }
+
+ /// Called by the kernel to queue a request with the driver. If `is_last` is
+ /// `false`, the driver is allowed to defer commiting the request.
+ fn queue_rq(
+ hw_data: <Self::HwData as ForeignOwnable>::Borrowed<'_>,
+ queue_data: <Self::QueueData as ForeignOwnable>::Borrowed<'_>,
+ rq: &Request<Self>,
+ is_last: bool,
+ ) -> Result;
+
+ /// Called by the kernel to indicate that queued requests should be submitted
+ fn commit_rqs(
+ hw_data: <Self::HwData as ForeignOwnable>::Borrowed<'_>,
+ queue_data: <Self::QueueData as ForeignOwnable>::Borrowed<'_>,
+ );
+
+ /// Called by the kernel when the request is completed
+ fn complete(_rq: &Request<Self>);
+
+ /// Called by the kernel to allocate and initialize a driver specific hardware context data
+ fn init_hctx(
+ tagset_data: <Self::TagSetData as ForeignOwnable>::Borrowed<'_>,
+ hctx_idx: u32,
+ ) -> Result<Self::HwData>;
+
+ /// Called by the kernel to poll the device for completed requests. Only used for poll queues.
+ fn poll(_hw_data: <Self::HwData as ForeignOwnable>::Borrowed<'_>) -> i32 {
+ unreachable!()
+ }
+
+ /// Called by the kernel to map submission queues to CPU cores.
+ fn map_queues(_tag_set: &TagSetRef) {
+ unreachable!()
+ }
+
+ // There is no need for exit_request() because `drop` will be called.
+}
+
+pub(crate) struct OperationsVtable<T: Operations>(PhantomData<T>);
+
+impl<T: Operations> OperationsVtable<T> {
+ // # Safety
+ //
+ // The caller of this function must ensure that `hctx` and `bd` are valid
+ // and initialized. The pointees must outlive this function. Further
+ // `hctx->driver_data` must be a pointer created by a call to
+ // `Self::init_hctx_callback()` and the pointee must outlive this function.
+ // This function must not be called with a `hctx` for which
+ // `Self::exit_hctx_callback()` has been called.
+ unsafe extern "C" fn queue_rq_callback(
+ hctx: *mut bindings::blk_mq_hw_ctx,
+ bd: *const bindings::blk_mq_queue_data,
+ ) -> bindings::blk_status_t {
+ // SAFETY: `bd` is valid as required by the safety requirement for this function.
+ let rq = unsafe { (*bd).rq };
+
+ // SAFETY: The safety requirement for this function ensure that
+ // `(*hctx).driver_data` was returned by a call to
+ // `Self::init_hctx_callback()`. That function uses
+ // `PointerWrapper::into_pointer()` to create `driver_data`. Further,
+ // the returned value does not outlive this function and
+ // `from_foreign()` is not called until `Self::exit_hctx_callback()` is
+ // called. By the safety requirement of this function and contract with
+ // the `blk-mq` API, `queue_rq_callback()` will not be called after that
+ // point.
+ let hw_data = unsafe { T::HwData::borrow((*hctx).driver_data) };
+
+ // SAFETY: `hctx` is valid as required by this function.
+ let queue_data = unsafe { (*(*hctx).queue).queuedata };
+
+ // SAFETY: `queue.queuedata` was created by `GenDisk::try_new()` with a
+ // call to `ForeignOwnable::into_pointer()` to create `queuedata`.
+ // `ForeignOwnable::from_foreign()` is only called when the tagset is
+ // dropped, which happens after we are dropped.
+ let queue_data = unsafe { T::QueueData::borrow(queue_data) };
+
+ // SAFETY: `bd` is valid as required by the safety requirement for this function.
+ let ret = T::queue_rq(
+ hw_data,
+ queue_data,
+ &unsafe { Request::from_ptr(rq) },
+ unsafe { (*bd).last },
+ );
+ if let Err(e) = ret {
+ e.to_blk_status()
+ } else {
+ bindings::BLK_STS_OK as _
+ }
+ }
+
+ unsafe extern "C" fn commit_rqs_callback(hctx: *mut bindings::blk_mq_hw_ctx) {
+ let hw_data = unsafe { T::HwData::borrow((*hctx).driver_data) };
+
+ // SAFETY: `hctx` is valid as required by this function.
+ let queue_data = unsafe { (*(*hctx).queue).queuedata };
+
+ // SAFETY: `queue.queuedata` was created by `GenDisk::try_new()` with a
+ // call to `ForeignOwnable::into_pointer()` to create `queuedata`.
+ // `ForeignOwnable::from_foreign()` is only called when the tagset is
+ // dropped, which happens after we are dropped.
+ let queue_data = unsafe { T::QueueData::borrow(queue_data) };
+ T::commit_rqs(hw_data, queue_data)
+ }
+
+ unsafe extern "C" fn complete_callback(rq: *mut bindings::request) {
+ T::complete(&unsafe { Request::from_ptr(rq) });
+ }
+
+ unsafe extern "C" fn poll_callback(
+ hctx: *mut bindings::blk_mq_hw_ctx,
+ _iob: *mut bindings::io_comp_batch,
+ ) -> core::ffi::c_int {
+ let hw_data = unsafe { T::HwData::borrow((*hctx).driver_data) };
+ T::poll(hw_data)
+ }
+
+ unsafe extern "C" fn init_hctx_callback(
+ hctx: *mut bindings::blk_mq_hw_ctx,
+ tagset_data: *mut core::ffi::c_void,
+ hctx_idx: core::ffi::c_uint,
+ ) -> core::ffi::c_int {
+ from_result(|| {
+ let tagset_data = unsafe { T::TagSetData::borrow(tagset_data) };
+ let data = T::init_hctx(tagset_data, hctx_idx)?;
+ unsafe { (*hctx).driver_data = data.into_foreign() as _ };
+ Ok(0)
+ })
+ }
+
+ unsafe extern "C" fn exit_hctx_callback(
+ hctx: *mut bindings::blk_mq_hw_ctx,
+ _hctx_idx: core::ffi::c_uint,
+ ) {
+ let ptr = unsafe { (*hctx).driver_data };
+ unsafe { T::HwData::from_foreign(ptr) };
+ }
+
+ unsafe extern "C" fn init_request_callback(
+ set: *mut bindings::blk_mq_tag_set,
+ rq: *mut bindings::request,
+ _hctx_idx: core::ffi::c_uint,
+ _numa_node: core::ffi::c_uint,
+ ) -> core::ffi::c_int {
+ from_result(|| {
+ // SAFETY: The tagset invariants guarantee that all requests are allocated with extra memory
+ // for the request data.
+ let pdu = unsafe { bindings::blk_mq_rq_to_pdu(rq) } as *mut T::RequestData;
+ let tagset_data = unsafe { T::TagSetData::borrow((*set).driver_data) };
+
+ let v = T::new_request_data(tagset_data)?;
+
+ // SAFETY: `pdu` memory is valid, as it was allocated by the caller.
+ unsafe { pdu.write(v) };
+
+ let tagset_data = unsafe { T::TagSetData::borrow((*set).driver_data) };
+ // SAFETY: `pdu` memory is valid and properly initialised.
+ T::init_request_data(tagset_data, unsafe { Pin::new_unchecked(&mut *pdu) })?;
+
+ Ok(0)
+ })
+ }
+
+ unsafe extern "C" fn exit_request_callback(
+ _set: *mut bindings::blk_mq_tag_set,
+ rq: *mut bindings::request,
+ _hctx_idx: core::ffi::c_uint,
+ ) {
+ // SAFETY: The tagset invariants guarantee that all requests are allocated with extra memory
+ // for the request data.
+ let pdu = unsafe { bindings::blk_mq_rq_to_pdu(rq) } as *mut T::RequestData;
+
+ // SAFETY: `pdu` is valid for read and write and is properly initialised.
+ unsafe { core::ptr::drop_in_place(pdu) };
+ }
+
+ unsafe extern "C" fn map_queues_callback(tag_set_ptr: *mut bindings::blk_mq_tag_set) {
+ let tag_set = unsafe { TagSetRef::from_ptr(tag_set_ptr) };
+ T::map_queues(&tag_set);
+ }
+
+ const VTABLE: bindings::blk_mq_ops = bindings::blk_mq_ops {
+ queue_rq: Some(Self::queue_rq_callback),
+ queue_rqs: None,
+ commit_rqs: Some(Self::commit_rqs_callback),
+ get_budget: None,
+ put_budget: None,
+ set_rq_budget_token: None,
+ get_rq_budget_token: None,
+ timeout: None,
+ poll: if T::HAS_POLL {
+ Some(Self::poll_callback)
+ } else {
+ None
+ },
+ complete: Some(Self::complete_callback),
+ init_hctx: Some(Self::init_hctx_callback),
+ exit_hctx: Some(Self::exit_hctx_callback),
+ init_request: Some(Self::init_request_callback),
+ exit_request: Some(Self::exit_request_callback),
+ cleanup_rq: None,
+ busy: None,
+ map_queues: if T::HAS_MAP_QUEUES {
+ Some(Self::map_queues_callback)
+ } else {
+ None
+ },
+ #[cfg(CONFIG_BLK_DEBUG_FS)]
+ show_rq: None,
+ };
+
+ pub(crate) const unsafe fn build() -> &'static bindings::blk_mq_ops {
+ &Self::VTABLE
+ }
+}
diff --git a/rust/kernel/block/mq/raw_writer.rs b/rust/kernel/block/mq/raw_writer.rs
new file mode 100644
index 000000000000..25c16ee0b1f7
--- /dev/null
+++ b/rust/kernel/block/mq/raw_writer.rs
@@ -0,0 +1,30 @@
+use core::fmt::{self, Write};
+
+pub(crate) struct RawWriter {
+ ptr: *mut u8,
+ len: usize,
+}
+
+impl RawWriter {
+ unsafe fn new(ptr: *mut u8, len: usize) -> Self {
+ Self { ptr, len }
+ }
+
+ pub(crate) fn from_array<const N: usize>(a: &mut [core::ffi::c_char; N]) -> Self {
+ unsafe { Self::new(&mut a[0] as *mut _ as _, N) }
+ }
+}
+
+impl Write for RawWriter {
+ fn write_str(&mut self, s: &str) -> fmt::Result {
+ let bytes = s.as_bytes();
+ let len = bytes.len();
+ if len > self.len {
+ return Err(fmt::Error);
+ }
+ unsafe { core::ptr::copy_nonoverlapping(&bytes[0], self.ptr, len) };
+ self.ptr = unsafe { self.ptr.add(len) };
+ self.len -= len;
+ Ok(())
+ }
+}
diff --git a/rust/kernel/block/mq/request.rs b/rust/kernel/block/mq/request.rs
new file mode 100644
index 000000000000..e95ae3fd71ad
--- /dev/null
+++ b/rust/kernel/block/mq/request.rs
@@ -0,0 +1,71 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! This module provides a wrapper for the C `struct request` type.
+//!
+//! C header: [`include/linux/blk-mq.h`](../../include/linux/blk-mq.h)
+
+use crate::{
+ bindings,
+ block::mq::Operations,
+ error::{Error, Result},
+};
+use core::marker::PhantomData;
+
+/// A wrapper around a blk-mq `struct request`. This represents an IO request.
+pub struct Request<T: Operations> {
+ ptr: *mut bindings::request,
+ _p: PhantomData<T>,
+}
+
+impl<T: Operations> Request<T> {
+ pub(crate) unsafe fn from_ptr(ptr: *mut bindings::request) -> Self {
+ Self {
+ ptr,
+ _p: PhantomData,
+ }
+ }
+
+ /// Get the command identifier for the request
+ pub fn command(&self) -> u32 {
+ unsafe { (*self.ptr).cmd_flags & ((1 << bindings::REQ_OP_BITS) - 1) }
+ }
+
+ /// Call this to indicate to the kernel that the request has been issued by the driver
+ pub fn start(&self) {
+ unsafe { bindings::blk_mq_start_request(self.ptr) };
+ }
+
+ /// Call this to indicate to the kernel that the request has been completed without errors
+ // TODO: Consume rq so that we can't use it after ending it?
+ pub fn end_ok(&self) {
+ unsafe { bindings::blk_mq_end_request(self.ptr, bindings::BLK_STS_OK as _) };
+ }
+
+ /// Call this to indicate to the kernel that the request completed with an error
+ pub fn end_err(&self, err: Error) {
+ unsafe { bindings::blk_mq_end_request(self.ptr, err.to_blk_status()) };
+ }
+
+ /// Call this to indicate that the request completed with the status indicated by `status`
+ pub fn end(&self, status: Result) {
+ if let Err(e) = status {
+ self.end_err(e);
+ } else {
+ self.end_ok();
+ }
+ }
+
+ /// Call this to schedule defered completion of the request
+ // TODO: Consume rq so that we can't use it after completing it?
+ pub fn complete(&self) {
+ if !unsafe { bindings::blk_mq_complete_request_remote(self.ptr) } {
+ T::complete(&unsafe { Self::from_ptr(self.ptr) });
+ }
+ }
+
+ /// Get the target sector for the request
+ #[inline(always)]
+ pub fn sector(&self) -> usize {
+ unsafe { (*self.ptr).__sector as usize }
+ }
+}
diff --git a/rust/kernel/block/mq/tag_set.rs b/rust/kernel/block/mq/tag_set.rs
new file mode 100644
index 000000000000..d122db7f6d0e
--- /dev/null
+++ b/rust/kernel/block/mq/tag_set.rs
@@ -0,0 +1,92 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! This module provides the `TagSet` struct to wrap the C `struct blk_mq_tag_set`.
+//!
+//! C header: [`include/linux/blk-mq.h`](../../include/linux/blk-mq.h)
+
+use crate::{
+ bindings,
+ block::mq::{operations::OperationsVtable, Operations},
+ error::{Error, Result},
+ sync::Arc,
+ types::ForeignOwnable,
+};
+use core::{cell::UnsafeCell, convert::TryInto, marker::PhantomData};
+
+/// A wrapper for the C `struct blk_mq_tag_set`
+pub struct TagSet<T: Operations> {
+ inner: UnsafeCell<bindings::blk_mq_tag_set>,
+ _p: PhantomData<T>,
+}
+
+impl<T: Operations> TagSet<T> {
+ /// Try to create a new tag set
+ pub fn try_new(
+ nr_hw_queues: u32,
+ tagset_data: T::TagSetData,
+ num_tags: u32,
+ num_maps: u32,
+ ) -> Result<Arc<Self>> {
+ let tagset = Arc::try_new(Self {
+ inner: UnsafeCell::new(bindings::blk_mq_tag_set::default()),
+ _p: PhantomData,
+ })?;
+
+ // SAFETY: We just allocated `tagset`, we know this is the only reference to it.
+ let inner = unsafe { &mut *tagset.inner.get() };
+
+ inner.ops = unsafe { OperationsVtable::<T>::build() };
+ inner.nr_hw_queues = nr_hw_queues;
+ inner.timeout = 0; // 0 means default which is 30 * HZ in C
+ inner.numa_node = bindings::NUMA_NO_NODE;
+ inner.queue_depth = num_tags;
+ inner.cmd_size = core::mem::size_of::<T::RequestData>().try_into()?;
+ inner.flags = bindings::BLK_MQ_F_SHOULD_MERGE;
+ inner.driver_data = tagset_data.into_foreign() as _;
+ inner.nr_maps = num_maps;
+
+ // SAFETY: `inner` points to valid and initialised memory.
+ let ret = unsafe { bindings::blk_mq_alloc_tag_set(inner) };
+ if ret < 0 {
+ // SAFETY: We created `driver_data` above with `into_foreign`
+ unsafe { T::TagSetData::from_foreign(inner.driver_data) };
+ return Err(Error::from_errno(ret));
+ }
+
+ Ok(tagset)
+ }
+
+ /// Return the pointer to the wrapped `struct blk_mq_tag_set`
+ pub(crate) fn raw_tag_set(&self) -> *mut bindings::blk_mq_tag_set {
+ self.inner.get()
+ }
+}
+
+impl<T: Operations> Drop for TagSet<T> {
+ fn drop(&mut self) {
+ let tagset_data = unsafe { (*self.inner.get()).driver_data };
+
+ // SAFETY: `inner` is valid and has been properly initialised during construction.
+ unsafe { bindings::blk_mq_free_tag_set(self.inner.get()) };
+
+ // SAFETY: `tagset_data` was created by a call to
+ // `ForeignOwnable::into_foreign` in `TagSet::try_new()`
+ unsafe { T::TagSetData::from_foreign(tagset_data) };
+ }
+}
+
+/// A tag set reference. Used to control lifetime and prevent drop of TagSet references passed to
+/// `Operations::map_queues()`
+pub struct TagSetRef {
+ ptr: *mut bindings::blk_mq_tag_set,
+}
+
+impl TagSetRef {
+ pub(crate) unsafe fn from_ptr(tagset: *mut bindings::blk_mq_tag_set) -> Self {
+ Self { ptr: tagset }
+ }
+
+ pub fn ptr(&self) -> *mut bindings::blk_mq_tag_set {
+ self.ptr
+ }
+}
diff --git a/rust/kernel/error.rs b/rust/kernel/error.rs
index 5f4114b30b94..421fef677321 100644
--- a/rust/kernel/error.rs
+++ b/rust/kernel/error.rs
@@ -107,6 +107,10 @@ impl Error {
self.0
}
+ pub(crate) fn to_blk_status(self) -> bindings::blk_status_t {
+ unsafe { bindings::errno_to_blk_status(self.0) }
+ }
+
/// Returns the error encoded as a pointer.
#[allow(dead_code)]
pub(crate) fn to_ptr<T>(self) -> *mut T {
diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs
index 8bef6686504b..cd798d12d97c 100644
--- a/rust/kernel/lib.rs
+++ b/rust/kernel/lib.rs
@@ -34,6 +34,7 @@ extern crate self as kernel;
#[cfg(not(test))]
#[cfg(not(testlib))]
mod allocator;
+pub mod block;
mod build_assert;
pub mod error;
pub mod init;
--
2.40.0
From: Andreas Hindborg <[email protected]>
The rust compiler will not inline functions that live in vmlinux when
building modules. Add inline directives to these short functions to ensure that
they are inlined when building modules.
Signed-off-by: Andreas Hindborg <[email protected]>
---
rust/kernel/sync/lock.rs | 2 ++
rust/kernel/sync/lock/mutex.rs | 2 ++
rust/kernel/sync/lock/spinlock.rs | 2 ++
rust/kernel/types.rs | 6 ++++++
4 files changed, 12 insertions(+)
diff --git a/rust/kernel/sync/lock.rs b/rust/kernel/sync/lock.rs
index bb21af8a8377..4bfc2f5d9841 100644
--- a/rust/kernel/sync/lock.rs
+++ b/rust/kernel/sync/lock.rs
@@ -101,6 +101,7 @@ impl<T: ?Sized, B: IrqSaveBackend> Lock<T, B> {
/// Before acquiring the lock, it disables interrupts. When the guard is dropped, the interrupt
/// state (either enabled or disabled) is restored to its state before
/// [`lock_irqsave`](Self::lock_irqsave) was called.
+ #[inline(always)]
pub fn lock_irqsave(&self) -> Guard<'_, T, B> {
// SAFETY: The constructor of the type calls `init`, so the existence of the object proves
// that `init` was called.
@@ -210,6 +211,7 @@ impl<T: ?Sized, B: Backend> core::ops::DerefMut for Guard<'_, T, B> {
}
impl<T: ?Sized, B: Backend> Drop for Guard<'_, T, B> {
+ #[inline(always)]
fn drop(&mut self) {
// SAFETY: The caller owns the lock, so it is safe to unlock it.
unsafe { B::unlock(self.lock.state.get(), &self.state) };
diff --git a/rust/kernel/sync/lock/mutex.rs b/rust/kernel/sync/lock/mutex.rs
index 923472f04af4..5e8096811b98 100644
--- a/rust/kernel/sync/lock/mutex.rs
+++ b/rust/kernel/sync/lock/mutex.rs
@@ -104,12 +104,14 @@ unsafe impl super::Backend for MutexBackend {
unsafe { bindings::__mutex_init(ptr, name, key) }
}
+ #[inline(always)]
unsafe fn lock(ptr: *mut Self::State) -> Self::GuardState {
// SAFETY: The safety requirements of this function ensure that `ptr` points to valid
// memory, and that it has been initialised before.
unsafe { bindings::mutex_lock(ptr) };
}
+ #[inline(always)]
unsafe fn unlock(ptr: *mut Self::State, _guard_state: &Self::GuardState) {
// SAFETY: The safety requirements of this function ensure that `ptr` is valid and that the
// caller is the owner of the mutex.
diff --git a/rust/kernel/sync/lock/spinlock.rs b/rust/kernel/sync/lock/spinlock.rs
index 50b8775bb49d..23a973dab85c 100644
--- a/rust/kernel/sync/lock/spinlock.rs
+++ b/rust/kernel/sync/lock/spinlock.rs
@@ -122,6 +122,7 @@ unsafe impl super::Backend for SpinLockBackend {
None
}
+ #[inline(always)]
unsafe fn unlock(ptr: *mut Self::State, guard_state: &Self::GuardState) {
match guard_state {
// SAFETY: The safety requirements of this function ensure that `ptr` is valid and that
@@ -141,6 +142,7 @@ unsafe impl super::Backend for SpinLockBackend {
// interrupt state, and the `irqrestore` variant of the lock release functions to restore the state
// in `unlock` -- we use the guard context to determine which method was used to acquire the lock.
unsafe impl super::IrqSaveBackend for SpinLockBackend {
+ #[inline(always)]
unsafe fn lock_irqsave(ptr: *mut Self::State) -> Self::GuardState {
// SAFETY: The safety requirements of this function ensure that `ptr` points to valid
// memory, and that it has been initialised before.
diff --git a/rust/kernel/types.rs b/rust/kernel/types.rs
index 98e71e96a7fc..7be1f64bbde9 100644
--- a/rust/kernel/types.rs
+++ b/rust/kernel/types.rs
@@ -70,10 +70,12 @@ pub trait ForeignOwnable: Sized {
impl<T: 'static> ForeignOwnable for Box<T> {
type Borrowed<'a> = &'a T;
+ #[inline(always)]
fn into_foreign(self) -> *const core::ffi::c_void {
Box::into_raw(self) as _
}
+ #[inline(always)]
unsafe fn borrow<'a>(ptr: *const core::ffi::c_void) -> &'a T {
// SAFETY: The safety requirements for this function ensure that the object is still alive,
// so it is safe to dereference the raw pointer.
@@ -82,6 +84,7 @@ impl<T: 'static> ForeignOwnable for Box<T> {
unsafe { &*ptr.cast() }
}
+ #[inline(always)]
unsafe fn from_foreign(ptr: *const core::ffi::c_void) -> Self {
// SAFETY: The safety requirements of this function ensure that `ptr` comes from a previous
// call to `Self::into_foreign`.
@@ -92,12 +95,15 @@ impl<T: 'static> ForeignOwnable for Box<T> {
impl ForeignOwnable for () {
type Borrowed<'a> = ();
+ #[inline(always)]
fn into_foreign(self) -> *const core::ffi::c_void {
core::ptr::NonNull::dangling().as_ptr()
}
+ #[inline(always)]
unsafe fn borrow<'a>(_: *const core::ffi::c_void) -> Self::Borrowed<'a> {}
+ #[inline(always)]
unsafe fn from_foreign(_: *const core::ffi::c_void) -> Self {}
}
--
2.40.0
From: Andreas Hindborg <[email protected]>
This patch adds support for working with pages of order 0. Support for pages
with higher order is deferred. Page allocation flags are fixed in this patch.
Future work might allow the user to specify allocation flags.
This patch is a heavily modified version of code available in the rust tree [1],
primarily adding support for multiple page mapping strategies.
[1] https://github.com/rust-for-Linux/linux/tree/bc22545f38d74473cfef3e9fd65432733435b79f/rust/kernel/pages.rs
Signed-off-by: Andreas Hindborg <[email protected]>
---
rust/helpers.c | 31 +++++
rust/kernel/lib.rs | 6 +
rust/kernel/pages.rs | 284 +++++++++++++++++++++++++++++++++++++++++++
3 files changed, 321 insertions(+)
create mode 100644 rust/kernel/pages.rs
diff --git a/rust/helpers.c b/rust/helpers.c
index 5dd5e325b7cc..9bd9d95da951 100644
--- a/rust/helpers.c
+++ b/rust/helpers.c
@@ -27,6 +27,7 @@
#include <linux/sched/signal.h>
#include <linux/wait.h>
#include <linux/radix-tree.h>
+#include <linux/highmem.h>
__noreturn void rust_helper_BUG(void)
{
@@ -150,6 +151,36 @@ void **rust_helper_radix_tree_next_slot(void **slot,
}
EXPORT_SYMBOL_GPL(rust_helper_radix_tree_next_slot);
+void *rust_helper_kmap(struct page *page)
+{
+ return kmap(page);
+}
+EXPORT_SYMBOL_GPL(rust_helper_kmap);
+
+void rust_helper_kunmap(struct page *page)
+{
+ return kunmap(page);
+}
+EXPORT_SYMBOL_GPL(rust_helper_kunmap);
+
+void *rust_helper_kmap_atomic(struct page *page)
+{
+ return kmap_atomic(page);
+}
+EXPORT_SYMBOL_GPL(rust_helper_kmap_atomic);
+
+void rust_helper_kunmap_atomic(void *address)
+{
+ kunmap_atomic(address);
+}
+EXPORT_SYMBOL_GPL(rust_helper_kunmap_atomic);
+
+struct page *rust_helper_alloc_pages(gfp_t gfp_mask, unsigned int order)
+{
+ return alloc_pages(gfp_mask, order);
+}
+EXPORT_SYMBOL_GPL(rust_helper_alloc_pages);
+
/*
* We use `bindgen`'s `--size_t-is-usize` option to bind the C `size_t` type
* as the Rust `usize` type, so we can use it in contexts where Rust
diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs
index a85cb6aae8d6..8bef6686504b 100644
--- a/rust/kernel/lib.rs
+++ b/rust/kernel/lib.rs
@@ -38,6 +38,7 @@ mod build_assert;
pub mod error;
pub mod init;
pub mod ioctl;
+pub mod pages;
pub mod prelude;
pub mod print;
pub mod radix_tree;
@@ -57,6 +58,11 @@ pub use uapi;
#[doc(hidden)]
pub use build_error::build_error;
+/// Page size defined in terms of the `PAGE_SHIFT` macro from C.
+///
+/// [`PAGE_SHIFT`]: ../../../include/asm-generic/page.h
+pub const PAGE_SIZE: u32 = 1 << bindings::PAGE_SHIFT;
+
/// Prefix to appear before log messages printed from within the `kernel` crate.
const __LOG_PREFIX: &[u8] = b"rust_kernel\0";
diff --git a/rust/kernel/pages.rs b/rust/kernel/pages.rs
new file mode 100644
index 000000000000..ed51b053dd5d
--- /dev/null
+++ b/rust/kernel/pages.rs
@@ -0,0 +1,284 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! Kernel page allocation and management.
+//!
+//! This module currently provides limited support. It supports pages of order 0
+//! for most operations. Page allocation flags are fixed.
+
+use crate::{bindings, error::code::*, error::Result, PAGE_SIZE};
+use core::{marker::PhantomData, ptr};
+
+/// A set of physical pages.
+///
+/// `Pages` holds a reference to a set of pages of order `ORDER`. Having the order as a generic
+/// const allows the struct to have the same size as a pointer.
+///
+/// # Invariants
+///
+/// The pointer `Pages::pages` is valid and points to 2^ORDER pages.
+pub struct Pages<const ORDER: u32> {
+ pub(crate) pages: *mut bindings::page,
+}
+
+impl<const ORDER: u32> Pages<ORDER> {
+ /// Allocates a new set of contiguous pages.
+ pub fn new() -> Result<Self> {
+ let pages = unsafe {
+ bindings::alloc_pages(
+ bindings::GFP_KERNEL | bindings::__GFP_ZERO | bindings::___GFP_HIGHMEM,
+ ORDER,
+ )
+ };
+ if pages.is_null() {
+ return Err(ENOMEM);
+ }
+ // INVARIANTS: We checked that the allocation above succeeded.
+ // SAFETY: We allocated pages above
+ Ok(unsafe { Self::from_raw(pages) })
+ }
+
+ /// Create a `Pages` from a raw `struct page` pointer
+ ///
+ /// # Safety
+ ///
+ /// Caller must own the pages pointed to by `ptr` as these will be freed
+ /// when the returned `Pages` is dropped.
+ pub unsafe fn from_raw(ptr: *mut bindings::page) -> Self {
+ Self { pages: ptr }
+ }
+}
+
+impl Pages<0> {
+ #[inline(always)]
+ fn check_offset_and_map<I: MappingInfo>(
+ &self,
+ offset: usize,
+ len: usize,
+ ) -> Result<PageMapping<'_, I>>
+ where
+ Pages<0>: MappingActions<I>,
+ {
+ let end = offset.checked_add(len).ok_or(EINVAL)?;
+ if end as u32 > PAGE_SIZE {
+ return Err(EINVAL);
+ }
+
+ let mapping = <Self as MappingActions<I>>::map(self);
+
+ Ok(mapping)
+ }
+
+ #[inline(always)]
+ unsafe fn read_internal<I: MappingInfo>(
+ &self,
+ dest: *mut u8,
+ offset: usize,
+ len: usize,
+ ) -> Result
+ where
+ Pages<0>: MappingActions<I>,
+ {
+ let mapping = self.check_offset_and_map::<I>(offset, len)?;
+
+ unsafe { ptr::copy_nonoverlapping((mapping.ptr as *mut u8).add(offset), dest, len) };
+ Ok(())
+ }
+
+ /// Maps the pages and reads from them into the given buffer.
+ ///
+ /// # Safety
+ ///
+ /// Callers must ensure that the destination buffer is valid for the given
+ /// length. Additionally, if the raw buffer is intended to be recast, they
+ /// must ensure that the data can be safely cast;
+ /// [`crate::io_buffer::ReadableFromBytes`] has more details about it.
+ /// `dest` may not point to the source page.
+ #[inline(always)]
+ pub unsafe fn read(&self, dest: *mut u8, offset: usize, len: usize) -> Result {
+ unsafe { self.read_internal::<NormalMappingInfo>(dest, offset, len) }
+ }
+
+ /// Maps the pages and reads from them into the given buffer. The page is
+ /// mapped atomically.
+ ///
+ /// # Safety
+ ///
+ /// Callers must ensure that the destination buffer is valid for the given
+ /// length. Additionally, if the raw buffer is intended to be recast, they
+ /// must ensure that the data can be safely cast;
+ /// [`crate::io_buffer::ReadableFromBytes`] has more details about it.
+ /// `dest` may not point to the source page.
+ #[inline(always)]
+ pub unsafe fn read_atomic(&self, dest: *mut u8, offset: usize, len: usize) -> Result {
+ unsafe { self.read_internal::<AtomicMappingInfo>(dest, offset, len) }
+ }
+
+ #[inline(always)]
+ unsafe fn write_internal<I: MappingInfo>(
+ &self,
+ src: *const u8,
+ offset: usize,
+ len: usize,
+ ) -> Result
+ where
+ Pages<0>: MappingActions<I>,
+ {
+ let mapping = self.check_offset_and_map::<I>(offset, len)?;
+
+ unsafe { ptr::copy_nonoverlapping(src, (mapping.ptr as *mut u8).add(offset), len) };
+ Ok(())
+ }
+
+ /// Maps the pages and writes into them from the given buffer.
+ ///
+ /// # Safety
+ ///
+ /// Callers must ensure that the buffer is valid for the given length.
+ /// Additionally, if the page is (or will be) mapped by userspace, they must
+ /// ensure that no kernel data is leaked through padding if it was cast from
+ /// another type; [`crate::io_buffer::WritableToBytes`] has more details
+ /// about it. `src` must not point to the destination page.
+ #[inline(always)]
+ pub unsafe fn write(&self, src: *const u8, offset: usize, len: usize) -> Result {
+ unsafe { self.write_internal::<NormalMappingInfo>(src, offset, len) }
+ }
+
+ /// Maps the pages and writes into them from the given buffer. The page is
+ /// mapped atomically.
+ ///
+ /// # Safety
+ ///
+ /// Callers must ensure that the buffer is valid for the given length.
+ /// Additionally, if the page is (or will be) mapped by userspace, they must
+ /// ensure that no kernel data is leaked through padding if it was cast from
+ /// another type; [`crate::io_buffer::WritableToBytes`] has more details
+ /// about it. `src` must not point to the destination page.
+ #[inline(always)]
+ pub unsafe fn write_atomic(&self, src: *const u8, offset: usize, len: usize) -> Result {
+ unsafe { self.write_internal::<AtomicMappingInfo>(src, offset, len) }
+ }
+
+ /// Maps the page at index 0.
+ #[inline(always)]
+ pub fn kmap(&self) -> PageMapping<'_, NormalMappingInfo> {
+ let ptr = unsafe { bindings::kmap(self.pages) };
+
+ PageMapping {
+ page: self.pages,
+ ptr,
+ _phantom: PhantomData,
+ _phantom2: PhantomData,
+ }
+ }
+
+ /// Atomically Maps the page at index 0.
+ #[inline(always)]
+ pub fn kmap_atomic(&self) -> PageMapping<'_, AtomicMappingInfo> {
+ let ptr = unsafe { bindings::kmap_atomic(self.pages) };
+
+ PageMapping {
+ page: self.pages,
+ ptr,
+ _phantom: PhantomData,
+ _phantom2: PhantomData,
+ }
+ }
+}
+
+impl<const ORDER: u32> Drop for Pages<ORDER> {
+ fn drop(&mut self) {
+ // SAFETY: By the type invariants, we know the pages are allocated with the given order.
+ unsafe { bindings::__free_pages(self.pages, ORDER) };
+ }
+}
+
+/// Specifies the type of page mapping
+pub trait MappingInfo {}
+
+/// Encapsulates methods to map and unmap pages
+pub trait MappingActions<I: MappingInfo>
+where
+ Pages<0>: MappingActions<I>,
+{
+ /// Map a page into the kernel address scpace
+ fn map(pages: &Pages<0>) -> PageMapping<'_, I>;
+
+ /// Unmap a page specified by `mapping`
+ ///
+ /// # Safety
+ ///
+ /// Must only be called by `PageMapping::drop()`.
+ unsafe fn unmap(mapping: &PageMapping<'_, I>);
+}
+
+/// A type state indicating that pages were mapped atomically
+pub struct AtomicMappingInfo;
+impl MappingInfo for AtomicMappingInfo {}
+
+/// A type state indicating that pages were not mapped atomically
+pub struct NormalMappingInfo;
+impl MappingInfo for NormalMappingInfo {}
+
+impl MappingActions<AtomicMappingInfo> for Pages<0> {
+ #[inline(always)]
+ fn map(pages: &Pages<0>) -> PageMapping<'_, AtomicMappingInfo> {
+ pages.kmap_atomic()
+ }
+
+ #[inline(always)]
+ unsafe fn unmap(mapping: &PageMapping<'_, AtomicMappingInfo>) {
+ // SAFETY: An instance of `PageMapping` is created only when `kmap` succeeded for the given
+ // page, so it is safe to unmap it here.
+ unsafe { bindings::kunmap_atomic(mapping.ptr) };
+ }
+}
+
+impl MappingActions<NormalMappingInfo> for Pages<0> {
+ #[inline(always)]
+ fn map(pages: &Pages<0>) -> PageMapping<'_, NormalMappingInfo> {
+ pages.kmap()
+ }
+
+ #[inline(always)]
+ unsafe fn unmap(mapping: &PageMapping<'_, NormalMappingInfo>) {
+ // SAFETY: An instance of `PageMapping` is created only when `kmap` succeeded for the given
+ // page, so it is safe to unmap it here.
+ unsafe { bindings::kunmap(mapping.page) };
+ }
+}
+
+/// An owned page mapping. When this struct is dropped, the page is unmapped.
+pub struct PageMapping<'a, I: MappingInfo>
+where
+ Pages<0>: MappingActions<I>,
+{
+ page: *mut bindings::page,
+ ptr: *mut core::ffi::c_void,
+ _phantom: PhantomData<&'a i32>,
+ _phantom2: PhantomData<I>,
+}
+
+impl<'a, I: MappingInfo> PageMapping<'a, I>
+where
+ Pages<0>: MappingActions<I>,
+{
+ /// Return a pointer to the wrapped `struct page`
+ #[inline(always)]
+ pub fn get_ptr(&self) -> *mut core::ffi::c_void {
+ self.ptr
+ }
+}
+
+// Because we do not have Drop specialization, we have to do this dance. Life
+// would be much more simple if we could have `impl Drop for PageMapping<'_,
+// Atomic>` and `impl Drop for PageMapping<'_, NotAtomic>`
+impl<I: MappingInfo> Drop for PageMapping<'_, I>
+where
+ Pages<0>: MappingActions<I>,
+{
+ #[inline(always)]
+ fn drop(&mut self) {
+ // SAFETY: We are OK to call this because we are `PageMapping::drop()`
+ unsafe { <Pages<0> as MappingActions<I>>::unmap(self) }
+ }
+}
--
2.40.0
From: Wedson Almeida Filho <[email protected]>
This allows locks like spinlocks and raw spinlocks to expose a
`lock_irqsave` variant in Rust that corresponds to the C version.
Reviewed-by: Martin Rodriguez Reboredo <[email protected]>
Signed-off-by: Wedson Almeida Filho <[email protected]>
---
rust/kernel/sync/lock.rs | 38 ++++++++++++++++++++++++++++++++++++++
1 file changed, 38 insertions(+)
diff --git a/rust/kernel/sync/lock.rs b/rust/kernel/sync/lock.rs
index 1c584b1df30d..bb21af8a8377 100644
--- a/rust/kernel/sync/lock.rs
+++ b/rust/kernel/sync/lock.rs
@@ -72,6 +72,44 @@ pub unsafe trait Backend {
}
}
+/// The "backend" of a lock that supports the irq-save variant.
+///
+/// # Safety
+///
+/// The same requirements wrt mutual exclusion in [`Backend`] apply for acquiring the lock via
+/// [`IrqSaveBackend::lock_irqsave`].
+///
+/// Additionally, when [`IrqSaveBackend::lock_irqsave`] is used to acquire the lock, implementers
+/// must disable interrupts on lock, and restore interrupt state on unlock. Implementers may use
+/// [`Backend::GuardState`] to store state needed to keep track of the interrupt state.
+pub unsafe trait IrqSaveBackend: Backend {
+ /// Acquires the lock, making the caller its owner.
+ ///
+ /// Before acquiring the lock, it disables interrupts, and returns the previous interrupt state
+ /// as its guard state so that the guard can restore it when it is dropped.
+ ///
+ /// # Safety
+ ///
+ /// Callers must ensure that [`Backend::init`] has been previously called.
+ #[must_use]
+ unsafe fn lock_irqsave(ptr: *mut Self::State) -> Self::GuardState;
+}
+
+impl<T: ?Sized, B: IrqSaveBackend> Lock<T, B> {
+ /// Acquires the lock and gives the caller access to the data protected by it.
+ ///
+ /// Before acquiring the lock, it disables interrupts. When the guard is dropped, the interrupt
+ /// state (either enabled or disabled) is restored to its state before
+ /// [`lock_irqsave`](Self::lock_irqsave) was called.
+ pub fn lock_irqsave(&self) -> Guard<'_, T, B> {
+ // SAFETY: The constructor of the type calls `init`, so the existence of the object proves
+ // that `init` was called.
+ let state = unsafe { B::lock_irqsave(self.state.get()) };
+ // SAFETY: The lock was just acquired.
+ unsafe { Guard::new(self, state) }
+ }
+}
+
/// A mutual exclusion primitive.
///
/// Exposes one of the kernel locking primitives. Which one is exposed depends on the lock backend
--
2.40.0
From: Andreas Hindborg <[email protected]>
The kernel `struct spinlock` is 4 bytes on x86 when lockdep is not enabled. The
structure is not padded to fit a cache line. The effect of this for `SpinLock`
is that the lock variable and the value protected by the lock will share a cache
line, depending on the alignment requirements of the protected value. Aligning
the lock variable and the protected value to a cache line yields a 20%
performance increase for the Rust null block driver for sequential reads to
memory backed devices at 6 concurrent readers.
Signed-off-by: Andreas Hindborg <[email protected]>
---
rust/kernel/cache_padded.rs | 33 +++++++++++++++++++++++++++++++
rust/kernel/lib.rs | 2 ++
rust/kernel/sync/lock.rs | 9 ++++++---
rust/kernel/sync/lock/spinlock.rs | 13 ++++++++----
4 files changed, 50 insertions(+), 7 deletions(-)
create mode 100644 rust/kernel/cache_padded.rs
diff --git a/rust/kernel/cache_padded.rs b/rust/kernel/cache_padded.rs
new file mode 100644
index 000000000000..758678e71f50
--- /dev/null
+++ b/rust/kernel/cache_padded.rs
@@ -0,0 +1,33 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#[repr(align(64))]
+pub struct CachePadded<T: ?Sized> {
+ value: T,
+}
+
+unsafe impl<T: Send> Send for CachePadded<T> {}
+unsafe impl<T: Sync> Sync for CachePadded<T> {}
+
+impl<T> CachePadded<T> {
+ /// Pads and aligns a value to 64 bytes.
+ #[inline(always)]
+ pub(crate) const fn new(t: T) -> CachePadded<T> {
+ CachePadded::<T> { value: t }
+ }
+}
+
+impl<T: ?Sized> core::ops::Deref for CachePadded<T> {
+ type Target = T;
+
+ #[inline(always)]
+ fn deref(&self) -> &T {
+ &self.value
+ }
+}
+
+impl<T: ?Sized> core::ops::DerefMut for CachePadded<T> {
+ #[inline(always)]
+ fn deref_mut(&mut self) -> &mut T {
+ &mut self.value
+ }
+}
diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs
index a0bd0b0e2aef..426e2dea0da6 100644
--- a/rust/kernel/lib.rs
+++ b/rust/kernel/lib.rs
@@ -37,6 +37,7 @@ extern crate self as kernel;
mod allocator;
pub mod block;
mod build_assert;
+mod cache_padded;
pub mod error;
pub mod init;
pub mod ioctl;
@@ -56,6 +57,7 @@ pub mod types;
#[doc(hidden)]
pub use bindings;
+pub(crate) use cache_padded::CachePadded;
pub use macros;
pub use uapi;
diff --git a/rust/kernel/sync/lock.rs b/rust/kernel/sync/lock.rs
index a2216325632d..1c584b1df30d 100644
--- a/rust/kernel/sync/lock.rs
+++ b/rust/kernel/sync/lock.rs
@@ -6,7 +6,9 @@
//! spinlocks, raw spinlocks) to be provided with minimal effort.
use super::LockClassKey;
-use crate::{bindings, init::PinInit, pin_init, str::CStr, types::Opaque, types::ScopeGuard};
+use crate::{
+ bindings, init::PinInit, pin_init, str::CStr, types::Opaque, types::ScopeGuard, CachePadded,
+};
use core::{cell::UnsafeCell, marker::PhantomData, marker::PhantomPinned};
use macros::pin_data;
@@ -87,7 +89,7 @@ pub struct Lock<T: ?Sized, B: Backend> {
_pin: PhantomPinned,
/// The data protected by the lock.
- pub(crate) data: UnsafeCell<T>,
+ pub(crate) data: CachePadded<UnsafeCell<T>>,
}
// SAFETY: `Lock` can be transferred across thread boundaries iff the data it protects can.
@@ -102,7 +104,7 @@ impl<T, B: Backend> Lock<T, B> {
#[allow(clippy::new_ret_no_self)]
pub fn new(t: T, name: &'static CStr, key: &'static LockClassKey) -> impl PinInit<Self> {
pin_init!(Self {
- data: UnsafeCell::new(t),
+ data: CachePadded::new(UnsafeCell::new(t)),
_pin: PhantomPinned,
// SAFETY: `slot` is valid while the closure is called and both `name` and `key` have
// static lifetimes so they live indefinitely.
@@ -115,6 +117,7 @@ impl<T, B: Backend> Lock<T, B> {
impl<T: ?Sized, B: Backend> Lock<T, B> {
/// Acquires the lock and gives the caller access to the data protected by it.
+ #[inline(always)]
pub fn lock(&self) -> Guard<'_, T, B> {
// SAFETY: The constructor of the type calls `init`, so the existence of the object proves
// that `init` was called.
diff --git a/rust/kernel/sync/lock/spinlock.rs b/rust/kernel/sync/lock/spinlock.rs
index 979b56464a4e..e39142a8148c 100644
--- a/rust/kernel/sync/lock/spinlock.rs
+++ b/rust/kernel/sync/lock/spinlock.rs
@@ -4,7 +4,10 @@
//!
//! This module allows Rust code to use the kernel's `spinlock_t`.
+use core::ops::DerefMut;
+
use crate::bindings;
+use crate::CachePadded;
/// Creates a [`SpinLock`] initialiser with the given name and a newly-created lock class.
///
@@ -90,7 +93,7 @@ pub struct SpinLockBackend;
// SAFETY: The underlying kernel `spinlock_t` object ensures mutual exclusion. `relock` uses the
// default implementation that always calls the same locking method.
unsafe impl super::Backend for SpinLockBackend {
- type State = bindings::spinlock_t;
+ type State = CachePadded<bindings::spinlock_t>;
type GuardState = ();
unsafe fn init(
@@ -100,18 +103,20 @@ unsafe impl super::Backend for SpinLockBackend {
) {
// SAFETY: The safety requirements ensure that `ptr` is valid for writes, and `name` and
// `key` are valid for read indefinitely.
- unsafe { bindings::__spin_lock_init(ptr, name, key) }
+ unsafe { bindings::__spin_lock_init((&mut *ptr).deref_mut(), name, key) }
}
+ #[inline(always)]
unsafe fn lock(ptr: *mut Self::State) -> Self::GuardState {
// SAFETY: The safety requirements of this function ensure that `ptr` points to valid
// memory, and that it has been initialised before.
- unsafe { bindings::spin_lock(ptr) }
+ unsafe { bindings::spin_lock((&mut *ptr).deref_mut()) }
}
+ #[inline(always)]
unsafe fn unlock(ptr: *mut Self::State, _guard_state: &Self::GuardState) {
// SAFETY: The safety requirements of this function ensure that `ptr` is valid and that the
// caller is the owner of the mutex.
- unsafe { bindings::spin_unlock(ptr) }
+ unsafe { bindings::spin_unlock((&mut *ptr).deref_mut()) }
}
}
--
2.40.0
From: Andreas Hindborg <[email protected]>
Implement `ForeignOwnable for Pin<T> where T: ForeignOwnable + Deref`.
Imported from rust tree [1]
[1] https://github.com/Rust-for-Linux/linux/tree/bc22545f38d74473cfef3e9fd65432733435b79f
Cc: Wedson Almeida Filho <[email protected]>
---
rust/kernel/types.rs | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)
diff --git a/rust/kernel/types.rs b/rust/kernel/types.rs
index 29db59d6119a..98e71e96a7fc 100644
--- a/rust/kernel/types.rs
+++ b/rust/kernel/types.rs
@@ -9,6 +9,7 @@ use core::{
marker::PhantomData,
mem::MaybeUninit,
ops::{Deref, DerefMut},
+ pin::Pin,
ptr::NonNull,
};
@@ -100,6 +101,29 @@ impl ForeignOwnable for () {
unsafe fn from_foreign(_: *const core::ffi::c_void) -> Self {}
}
+impl<T: ForeignOwnable + Deref> ForeignOwnable for Pin<T> {
+ type Borrowed<'a> = T::Borrowed<'a>;
+
+ fn into_foreign(self) -> *const core::ffi::c_void {
+ // SAFETY: We continue to treat the pointer as pinned by returning just a pointer to it to
+ // the caller.
+ let inner = unsafe { Pin::into_inner_unchecked(self) };
+ inner.into_foreign()
+ }
+
+ unsafe fn borrow<'a>(ptr: *const core::ffi::c_void) -> Self::Borrowed<'a> {
+ // SAFETY: The safety requirements for this function are the same as the ones for
+ // `T::borrow`.
+ unsafe { T::borrow(ptr) }
+ }
+
+ unsafe fn from_foreign(p: *const core::ffi::c_void) -> Self {
+ // SAFETY: The object was originally pinned.
+ // The passed pointer comes from a previous call to `T::into_foreign`.
+ unsafe { Pin::new_unchecked(T::from_foreign(p)) }
+ }
+}
+
/// Runs a cleanup function/closure when dropped.
///
/// The [`ScopeGuard::dismiss`] function prevents the cleanup function from running.
--
2.40.0
On Wed, May 03, 2023 at 11:06:57AM +0200, Andreas Hindborg wrote:
> From: Andreas Hindborg <[email protected]>
>
(cut)
>
> For each measurement the drivers are loaded, a drive is configured with memory
> backing and a size of 4 GiB. C null_blk is configured to match the implemented
> modes of the Rust driver: `blocksize` is set to 4 KiB, `completion_nsec` to 0,
> `irqmode` to 0 (IRQ_NONE), `queue_mode` to 2 (MQ), `hw_queue_depth` to 256 and
> `memory_backed` to 1. For both the drivers, the queue scheduler is set to
> `none`. These measurements are made using 30 second runs of `fio` with the
> `PSYNC` IO engine with workers pinned to separate CPU cores. The measurements
> are done inside a virtual machine (qemu/kvm) on an Intel Alder Lake workstation
> (i5-12600).
Hello Andreas,
I'm curious why you used psync ioengine for the benchmarks.
As psync is a sync ioengine, it means queue depth == 1.
Wouldn't it have been more interesting to see an async ioengine,
together with different queue depths?
You might want to explain your table a bit more.
It might be nice to see IOPS and average latencies.
As an example of a table that I find easier to interpret,
see e.g. the table on page 29 in the SPDK performance report:
https://ci.spdk.io/download/performance-reports/SPDK_nvme_bdev_perf_report_2301.pdf
Kind regards,
Niklas
On Wed, 3 May 2023 11:07:03 +0200, Andreas Hindborg <[email protected]> wrote:
> The kernel `struct spinlock` is 4 bytes on x86 when lockdep is not enabled. The
> structure is not padded to fit a cache line. The effect of this for `SpinLock`
> is that the lock variable and the value protected by the lock will share a cache
> line, depending on the alignment requirements of the protected value. Aligning
> the lock variable and the protected value to a cache line yields a 20%
> performance increase for the Rust null block driver for sequential reads to
> memory backed devices at 6 concurrent readers.
>
> Signed-off-by: Andreas Hindborg <[email protected]>
This applies the cacheline padding to all spinlocks unconditionally.
It's not clear to me that we want to do that. Instead, I suggest using
`SpinLock<CachePadded<T>>` in the null block driver to opt-in to the
cache padding there, and let other drivers choose whether or not they
want to cache pad their locks.
On Wed, 3 May 2023 11:07:03 +0200, Andreas Hindborg <[email protected]> wrote:
> diff --git a/rust/kernel/cache_padded.rs b/rust/kernel/cache_padded.rs
> new file mode 100644
> index 000000000000..758678e71f50
> --- /dev/null
> +++ b/rust/kernel/cache_padded.rs
>
> +impl<T> CachePadded<T> {
> + /// Pads and aligns a value to 64 bytes.
> + #[inline(always)]
> + pub(crate) const fn new(t: T) -> CachePadded<T> {
> + CachePadded::<T> { value: t }
> + }
> +}
Please make this `pub` instead of just `pub(crate)`. Other drivers might
want to use this directly.
On Wed, 3 May 2023 11:07:03 +0200, Andreas Hindborg <[email protected]> wrote:
> diff --git a/rust/kernel/sync/lock/spinlock.rs b/rust/kernel/sync/lock/spinlock.rs
> index 979b56464a4e..e39142a8148c 100644
> --- a/rust/kernel/sync/lock/spinlock.rs
> +++ b/rust/kernel/sync/lock/spinlock.rs
> @@ -100,18 +103,20 @@ unsafe impl super::Backend for SpinLockBackend {
> ) {
> // SAFETY: The safety requirements ensure that `ptr` is valid for writes, and `name` and
> // `key` are valid for read indefinitely.
> - unsafe { bindings::__spin_lock_init(ptr, name, key) }
> + unsafe { bindings::__spin_lock_init((&mut *ptr).deref_mut(), name, key) }
> }
>
> + #[inline(always)]
> unsafe fn lock(ptr: *mut Self::State) -> Self::GuardState {
> // SAFETY: The safety requirements of this function ensure that `ptr` points to valid
> // memory, and that it has been initialised before.
> - unsafe { bindings::spin_lock(ptr) }
> + unsafe { bindings::spin_lock((&mut *ptr).deref_mut()) }
> }
>
> + #[inline(always)]
> unsafe fn unlock(ptr: *mut Self::State, _guard_state: &Self::GuardState) {
> // SAFETY: The safety requirements of this function ensure that `ptr` is valid and that the
> // caller is the owner of the mutex.
> - unsafe { bindings::spin_unlock(ptr) }
> + unsafe { bindings::spin_unlock((&mut *ptr).deref_mut()) }
> }
> }
I would prefer to remain in pointer-land for the above operations. I
think that this leads to core that is more obviously correct.
For example:
```
impl<T> CachePadded<T> {
pub const fn raw_get(ptr: *mut Self) -> *mut T {
core::ptr::addr_of_mut!((*ptr).value)
}
}
#[inline(always)]
unsafe fn unlock(ptr: *mut Self::State, _guard_state: &Self::GuardState) {
unsafe { bindings::spin_unlock(CachePadded::raw_get(ptr)) }
}
```
On 03.05.23 11:06, Andreas Hindborg wrote:
> From: Andreas Hindborg <[email protected]>
>
> This patch adds support for working with pages of order 0. Support for pages
> with higher order is deferred. Page allocation flags are fixed in this patch.
> Future work might allow the user to specify allocation flags.
>
> This patch is a heavily modified version of code available in the rust tree [1],
> primarily adding support for multiple page mapping strategies.
>
> [1] https://github.com/rust-for-Linux/linux/tree/bc22545f38d74473cfef3e9fd65432733435b79f/rust/kernel/pages.rs
>
> Signed-off-by: Andreas Hindborg <[email protected]>
> ---
> rust/helpers.c | 31 +++++
> rust/kernel/lib.rs | 6 +
> rust/kernel/pages.rs | 284 +++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 321 insertions(+)
> create mode 100644 rust/kernel/pages.rs
>
> diff --git a/rust/helpers.c b/rust/helpers.c
> index 5dd5e325b7cc..9bd9d95da951 100644
> --- a/rust/helpers.c
> +++ b/rust/helpers.c
> @@ -27,6 +27,7 @@
> #include <linux/sched/signal.h>
> #include <linux/wait.h>
> #include <linux/radix-tree.h>
> +#include <linux/highmem.h>
>
> __noreturn void rust_helper_BUG(void)
> {
> @@ -150,6 +151,36 @@ void **rust_helper_radix_tree_next_slot(void **slot,
> }
> EXPORT_SYMBOL_GPL(rust_helper_radix_tree_next_slot);
>
> +void *rust_helper_kmap(struct page *page)
> +{
> + return kmap(page);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_kmap);
> +
> +void rust_helper_kunmap(struct page *page)
> +{
> + return kunmap(page);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_kunmap);
> +
> +void *rust_helper_kmap_atomic(struct page *page)
> +{
> + return kmap_atomic(page);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_kmap_atomic);
> +
> +void rust_helper_kunmap_atomic(void *address)
> +{
> + kunmap_atomic(address);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_kunmap_atomic);
> +
> +struct page *rust_helper_alloc_pages(gfp_t gfp_mask, unsigned int order)
> +{
> + return alloc_pages(gfp_mask, order);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_alloc_pages);
> +
> /*
> * We use `bindgen`'s `--size_t-is-usize` option to bind the C `size_t` type
> * as the Rust `usize` type, so we can use it in contexts where Rust
> diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs
> index a85cb6aae8d6..8bef6686504b 100644
> --- a/rust/kernel/lib.rs
> +++ b/rust/kernel/lib.rs
> @@ -38,6 +38,7 @@ mod build_assert;
> pub mod error;
> pub mod init;
> pub mod ioctl;
> +pub mod pages;
> pub mod prelude;
> pub mod print;
> pub mod radix_tree;
> @@ -57,6 +58,11 @@ pub use uapi;
> #[doc(hidden)]
> pub use build_error::build_error;
>
> +/// Page size defined in terms of the `PAGE_SHIFT` macro from C.
> #@
> #@ `PAGE_SHIFT` is not using a doc-link.
> #@
> +///
> +/// [`PAGE_SHIFT`]: ../../../include/asm-generic/page.h
> +pub const PAGE_SIZE: u32 = 1 << bindings::PAGE_SHIFT;
> #@
> #@ This should be of type `usize`.
> #@
> +
> /// Prefix to appear before log messages printed from within the `kernel` crate.
> const __LOG_PREFIX: &[u8] = b"rust_kernel\0";
>
> diff --git a/rust/kernel/pages.rs b/rust/kernel/pages.rs
> new file mode 100644
> index 000000000000..ed51b053dd5d
> --- /dev/null
> +++ b/rust/kernel/pages.rs
> @@ -0,0 +1,284 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +//! Kernel page allocation and management.
> +//!
> +//! This module currently provides limited support. It supports pages of order 0
> +//! for most operations. Page allocation flags are fixed.
> +
> +use crate::{bindings, error::code::*, error::Result, PAGE_SIZE};
> +use core::{marker::PhantomData, ptr};
> +
> +/// A set of physical pages.
> +///
> +/// `Pages` holds a reference to a set of pages of order `ORDER`. Having the order as a generic
> +/// const allows the struct to have the same size as a pointer.
> #@
> #@ I would remove the 'Having the order as a...' sentence. Since that is
> #@ just implementation detail.
> #@
> +///
> +/// # Invariants
> +///
> +/// The pointer `Pages::pages` is valid and points to 2^ORDER pages.
> #@
> #@ `Pages::pages` -> `pages`.
> #@
> +pub struct Pages<const ORDER: u32> {
> + pub(crate) pages: *mut bindings::page,
> +}
> +
> +impl<const ORDER: u32> Pages<ORDER> {
> + /// Allocates a new set of contiguous pages.
> + pub fn new() -> Result<Self> {
> + let pages = unsafe {
> + bindings::alloc_pages(
> + bindings::GFP_KERNEL | bindings::__GFP_ZERO | bindings::___GFP_HIGHMEM,
> + ORDER,
> + )
> + };
> #@
> #@ Missing `SAFETY` comment.
> #@
> + if pages.is_null() {
> + return Err(ENOMEM);
> + }
> + // INVARIANTS: We checked that the allocation above succeeded.
> + // SAFETY: We allocated pages above
> + Ok(unsafe { Self::from_raw(pages) })
> + }
> +
> + /// Create a `Pages` from a raw `struct page` pointer
> + ///
> + /// # Safety
> + ///
> + /// Caller must own the pages pointed to by `ptr` as these will be freed
> + /// when the returned `Pages` is dropped.
> + pub unsafe fn from_raw(ptr: *mut bindings::page) -> Self {
> + Self { pages: ptr }
> + }
> +}
> +
> +impl Pages<0> {
> + #[inline(always)]
> #@
> #@ Is this really needed? I think this function should be inlined
> #@ automatically.
> #@
> + fn check_offset_and_map<I: MappingInfo>(
> + &self,
> + offset: usize,
> + len: usize,
> + ) -> Result<PageMapping<'_, I>>
> + where
> + Pages<0>: MappingActions<I>,
> #@
> #@ Why not use `Self: MappingActions<I>`?
> #@
> + {
> + let end = offset.checked_add(len).ok_or(EINVAL)?;
> + if end as u32 > PAGE_SIZE {
> #@
> #@ Remove the `as u32`, since `PAGE_SIZE` should be of type `usize`.
> #@
> + return Err(EINVAL);
> #@
> #@ I think it would make sense to create a more descriptive Rust error with
> #@ a `From` impl to turn it into an `Error`. It always is better to know from
> #@ the signature what exactly can go wrong when calling a function.
> #@
> + }
> +
> + let mapping = <Self as MappingActions<I>>::map(self);
> +
> + Ok(mapping)
> #@
> #@ I would merge these lines.
> #@
> + }
> +
> + #[inline(always)]
> + unsafe fn read_internal<I: MappingInfo>(
> #@
> #@ Missing `# Safety` section.
> #@
> + &self,
> + dest: *mut u8,
> + offset: usize,
> + len: usize,
> + ) -> Result
> + where
> + Pages<0>: MappingActions<I>,
> + {
> + let mapping = self.check_offset_and_map::<I>(offset, len)?;
> +
> + unsafe { ptr::copy_nonoverlapping((mapping.ptr as *mut u8).add(offset), dest, len) };
> #@
> #@ Missing `SAFETY` comment. Replace `as *mut u8` with `.cast::<u8>()`.
> #@
> + Ok(())
> + }
> +
> + /// Maps the pages and reads from them into the given buffer.
> + ///
> + /// # Safety
> + ///
> + /// Callers must ensure that the destination buffer is valid for the given
> + /// length. Additionally, if the raw buffer is intended to be recast, they
> + /// must ensure that the data can be safely cast;
> + /// [`crate::io_buffer::ReadableFromBytes`] has more details about it.
> + /// `dest` may not point to the source page.
> #@
> #@ - `dest` is valid for writes for `len`.
> #@ - What is meant by 'the raw buffer is intended to be recast'?
> #@ - `io_buffer` does not yet exist in `rust-next`.
> #@
> + #[inline(always)]
> + pub unsafe fn read(&self, dest: *mut u8, offset: usize, len: usize) -> Result {
> + unsafe { self.read_internal::<NormalMappingInfo>(dest, offset, len) }
> #@
> #@ Missing `SAFETY` comment.
> #@
> + }
> +
> + /// Maps the pages and reads from them into the given buffer. The page is
> + /// mapped atomically.
> + ///
> + /// # Safety
> + ///
> + /// Callers must ensure that the destination buffer is valid for the given
> + /// length. Additionally, if the raw buffer is intended to be recast, they
> + /// must ensure that the data can be safely cast;
> + /// [`crate::io_buffer::ReadableFromBytes`] has more details about it.
> + /// `dest` may not point to the source page.
> + #[inline(always)]
> + pub unsafe fn read_atomic(&self, dest: *mut u8, offset: usize, len: usize) -> Result {
> + unsafe { self.read_internal::<AtomicMappingInfo>(dest, offset, len) }
> #@
> #@ Missing `SAFETY` comment.
> #@
> + }
> +
> + #[inline(always)]
> + unsafe fn write_internal<I: MappingInfo>(
> #@
> #@ Missing `# Safety` section.
> #@
> + &self,
> + src: *const u8,
> + offset: usize,
> + len: usize,
> + ) -> Result
> + where
> + Pages<0>: MappingActions<I>,
> + {
> + let mapping = self.check_offset_and_map::<I>(offset, len)?;
> +
> + unsafe { ptr::copy_nonoverlapping(src, (mapping.ptr as *mut u8).add(offset), len) };
> #@
> #@ Missing `SAFETY` comment.
> #@
> + Ok(())
> + }
> +
> + /// Maps the pages and writes into them from the given buffer.
> + ///
> + /// # Safety
> + ///
> + /// Callers must ensure that the buffer is valid for the given length.
> + /// Additionally, if the page is (or will be) mapped by userspace, they must
> + /// ensure that no kernel data is leaked through padding if it was cast from
> + /// another type; [`crate::io_buffer::WritableToBytes`] has more details
> + /// about it. `src` must not point to the destination page.
> #@
> #@ `src` is valid for reads for `len`.
> #@
> + #[inline(always)]
> + pub unsafe fn write(&self, src: *const u8, offset: usize, len: usize) -> Result {
> + unsafe { self.write_internal::<NormalMappingInfo>(src, offset, len) }
> + }
> +
> + /// Maps the pages and writes into them from the given buffer. The page is
> + /// mapped atomically.
> + ///
> + /// # Safety
> + ///
> + /// Callers must ensure that the buffer is valid for the given length.
> + /// Additionally, if the page is (or will be) mapped by userspace, they must
> + /// ensure that no kernel data is leaked through padding if it was cast from
> + /// another type; [`crate::io_buffer::WritableToBytes`] has more details
> + /// about it. `src` must not point to the destination page.
> + #[inline(always)]
> + pub unsafe fn write_atomic(&self, src: *const u8, offset: usize, len: usize) -> Result {
> + unsafe { self.write_internal::<AtomicMappingInfo>(src, offset, len) }
> + }
> +
> + /// Maps the page at index 0.
> + #[inline(always)]
> + pub fn kmap(&self) -> PageMapping<'_, NormalMappingInfo> {
> + let ptr = unsafe { bindings::kmap(self.pages) };
> #@
> #@ Missing `SAFETY` comment.
> #@
> +
> + PageMapping {
> + page: self.pages,
> + ptr,
> + _phantom: PhantomData,
> + _phantom2: PhantomData,
> + }
> + }
> +
> + /// Atomically Maps the page at index 0.
> + #[inline(always)]
> + pub fn kmap_atomic(&self) -> PageMapping<'_, AtomicMappingInfo> {
> + let ptr = unsafe { bindings::kmap_atomic(self.pages) };
> #@
> #@ Missing `SAFETY` comment.
> #@
> +
> + PageMapping {
> + page: self.pages,
> + ptr,
> + _phantom: PhantomData,
> + _phantom2: PhantomData,
> + }
> + }
> +}
> +
> +impl<const ORDER: u32> Drop for Pages<ORDER> {
> + fn drop(&mut self) {
> + // SAFETY: By the type invariants, we know the pages are allocated with the given order.
> + unsafe { bindings::__free_pages(self.pages, ORDER) };
> + }
> +}
> +
> +/// Specifies the type of page mapping
> +pub trait MappingInfo {}
> +
> +/// Encapsulates methods to map and unmap pages
> +pub trait MappingActions<I: MappingInfo>
> +where
> + Pages<0>: MappingActions<I>,
> +{
> + /// Map a page into the kernel address scpace
> #@
> #@ Typo.
> #@
> + fn map(pages: &Pages<0>) -> PageMapping<'_, I>;
> +
> + /// Unmap a page specified by `mapping`
> + ///
> + /// # Safety
> + ///
> + /// Must only be called by `PageMapping::drop()`.
> + unsafe fn unmap(mapping: &PageMapping<'_, I>);
> +}
> +
> +/// A type state indicating that pages were mapped atomically
> +pub struct AtomicMappingInfo;
> +impl MappingInfo for AtomicMappingInfo {}
> +
> +/// A type state indicating that pages were not mapped atomically
> +pub struct NormalMappingInfo;
> +impl MappingInfo for NormalMappingInfo {}
> +
> +impl MappingActions<AtomicMappingInfo> for Pages<0> {
> + #[inline(always)]
> + fn map(pages: &Pages<0>) -> PageMapping<'_, AtomicMappingInfo> {
> + pages.kmap_atomic()
> + }
> +
> + #[inline(always)]
> + unsafe fn unmap(mapping: &PageMapping<'_, AtomicMappingInfo>) {
> + // SAFETY: An instance of `PageMapping` is created only when `kmap` succeeded for the given
> + // page, so it is safe to unmap it here.
> + unsafe { bindings::kunmap_atomic(mapping.ptr) };
> + }
> +}
> +
> +impl MappingActions<NormalMappingInfo> for Pages<0> {
> + #[inline(always)]
> + fn map(pages: &Pages<0>) -> PageMapping<'_, NormalMappingInfo> {
> + pages.kmap()
> + }
> +
> + #[inline(always)]
> + unsafe fn unmap(mapping: &PageMapping<'_, NormalMappingInfo>) {
> + // SAFETY: An instance of `PageMapping` is created only when `kmap` succeeded for the given
> + // page, so it is safe to unmap it here.
> + unsafe { bindings::kunmap(mapping.page) };
> + }
> +}
> #@
> #@ I am not sure if this is the best implementation, why do the `kmap` and
> #@ `kmap_atomic` functions exist? Would it not make sense to implement
> #@ them entirely in `MappingActions::map`?
> #@
> +
> +/// An owned page mapping. When this struct is dropped, the page is unmapped.
> +pub struct PageMapping<'a, I: MappingInfo>
> +where
> + Pages<0>: MappingActions<I>,
> +{
> + page: *mut bindings::page,
> + ptr: *mut core::ffi::c_void,
> + _phantom: PhantomData<&'a i32>,
> + _phantom2: PhantomData<I>,
> +}
> +
> +impl<'a, I: MappingInfo> PageMapping<'a, I>
> +where
> + Pages<0>: MappingActions<I>,
> +{
> + /// Return a pointer to the wrapped `struct page`
> + #[inline(always)]
> + pub fn get_ptr(&self) -> *mut core::ffi::c_void {
> + self.ptr
> + }
> +}
> +
> +// Because we do not have Drop specialization, we have to do this dance. Life
> +// would be much more simple if we could have `impl Drop for PageMapping<'_,
> +// Atomic>` and `impl Drop for PageMapping<'_, NotAtomic>`
> +impl<I: MappingInfo> Drop for PageMapping<'_, I>
> +where
> + Pages<0>: MappingActions<I>,
> +{
> + #[inline(always)]
> + fn drop(&mut self) {
> + // SAFETY: We are OK to call this because we are `PageMapping::drop()`
> + unsafe { <Pages<0> as MappingActions<I>>::unmap(self) }
> + }
> +}
> --
> 2.40.0
Here are some more general things:
- I think we could use this as an opportunity to add more docs about how
paging works, or at least add some links to the C documentation.
- Can we improve the paging API? I have not given it any thought yet, but
the current API looks very primitive.
- Documentation comments should form complete sentences (so end with `.`).
--
Cheers,
Benno
Sorry, forgot to replace `> #@` with nothing. Fixed here:
On 03.05.23 11:06, Andreas Hindborg wrote:
> From: Andreas Hindborg <[email protected]>
>
> This patch adds support for working with pages of order 0. Support for pages
> with higher order is deferred. Page allocation flags are fixed in this patch.
> Future work might allow the user to specify allocation flags.
>
> This patch is a heavily modified version of code available in the rust tree [1],
> primarily adding support for multiple page mapping strategies.
>
> [1] https://github.com/rust-for-Linux/linux/tree/bc22545f38d74473cfef3e9fd65432733435b79f/rust/kernel/pages.rs
>
> Signed-off-by: Andreas Hindborg <[email protected]>
> ---
> rust/helpers.c | 31 +++++
> rust/kernel/lib.rs | 6 +
> rust/kernel/pages.rs | 284 +++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 321 insertions(+)
> create mode 100644 rust/kernel/pages.rs
>
> diff --git a/rust/helpers.c b/rust/helpers.c
> index 5dd5e325b7cc..9bd9d95da951 100644
> --- a/rust/helpers.c
> +++ b/rust/helpers.c
> @@ -27,6 +27,7 @@
> #include <linux/sched/signal.h>
> #include <linux/wait.h>
> #include <linux/radix-tree.h>
> +#include <linux/highmem.h>
>
> __noreturn void rust_helper_BUG(void)
> {
> @@ -150,6 +151,36 @@ void **rust_helper_radix_tree_next_slot(void **slot,
> }
> EXPORT_SYMBOL_GPL(rust_helper_radix_tree_next_slot);
>
> +void *rust_helper_kmap(struct page *page)
> +{
> + return kmap(page);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_kmap);
> +
> +void rust_helper_kunmap(struct page *page)
> +{
> + return kunmap(page);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_kunmap);
> +
> +void *rust_helper_kmap_atomic(struct page *page)
> +{
> + return kmap_atomic(page);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_kmap_atomic);
> +
> +void rust_helper_kunmap_atomic(void *address)
> +{
> + kunmap_atomic(address);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_kunmap_atomic);
> +
> +struct page *rust_helper_alloc_pages(gfp_t gfp_mask, unsigned int order)
> +{
> + return alloc_pages(gfp_mask, order);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_alloc_pages);
> +
> /*
> * We use `bindgen`'s `--size_t-is-usize` option to bind the C `size_t` type
> * as the Rust `usize` type, so we can use it in contexts where Rust
> diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs
> index a85cb6aae8d6..8bef6686504b 100644
> --- a/rust/kernel/lib.rs
> +++ b/rust/kernel/lib.rs
> @@ -38,6 +38,7 @@ mod build_assert;
> pub mod error;
> pub mod init;
> pub mod ioctl;
> +pub mod pages;
> pub mod prelude;
> pub mod print;
> pub mod radix_tree;
> @@ -57,6 +58,11 @@ pub use uapi;
> #[doc(hidden)]
> pub use build_error::build_error;
>
> +/// Page size defined in terms of the `PAGE_SHIFT` macro from C.
`PAGE_SHIFT` is not using a doc-link.
> +///
> +/// [`PAGE_SHIFT`]: ../../../include/asm-generic/page.h
> +pub const PAGE_SIZE: u32 = 1 << bindings::PAGE_SHIFT;
This should be of type `usize`.
> +
> /// Prefix to appear before log messages printed from within the `kernel` crate.
> const __LOG_PREFIX: &[u8] = b"rust_kernel\0";
>
> diff --git a/rust/kernel/pages.rs b/rust/kernel/pages.rs
> new file mode 100644
> index 000000000000..ed51b053dd5d
> --- /dev/null
> +++ b/rust/kernel/pages.rs
> @@ -0,0 +1,284 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +//! Kernel page allocation and management.
> +//!
> +//! This module currently provides limited support. It supports pages of order 0
> +//! for most operations. Page allocation flags are fixed.
> +
> +use crate::{bindings, error::code::*, error::Result, PAGE_SIZE};
> +use core::{marker::PhantomData, ptr};
> +
> +/// A set of physical pages.
> +///
> +/// `Pages` holds a reference to a set of pages of order `ORDER`. Having the order as a generic
> +/// const allows the struct to have the same size as a pointer.
I would remove the 'Having the order as a...' sentence. Since that is
just implementation detail.
> +///
> +/// # Invariants
> +///
> +/// The pointer `Pages::pages` is valid and points to 2^ORDER pages.
`Pages::pages` -> `pages`.
> +pub struct Pages<const ORDER: u32> {
> + pub(crate) pages: *mut bindings::page,
> +}
> +
> +impl<const ORDER: u32> Pages<ORDER> {
> + /// Allocates a new set of contiguous pages.
> + pub fn new() -> Result<Self> {
> + let pages = unsafe {
> + bindings::alloc_pages(
> + bindings::GFP_KERNEL | bindings::__GFP_ZERO | bindings::___GFP_HIGHMEM,
> + ORDER,
> + )
> + };
Missing `SAFETY` comment.
> + if pages.is_null() {
> + return Err(ENOMEM);
> + }
> + // INVARIANTS: We checked that the allocation above succeeded.
> + // SAFETY: We allocated pages above
> + Ok(unsafe { Self::from_raw(pages) })
> + }
> +
> + /// Create a `Pages` from a raw `struct page` pointer
> + ///
> + /// # Safety
> + ///
> + /// Caller must own the pages pointed to by `ptr` as these will be freed
> + /// when the returned `Pages` is dropped.
> + pub unsafe fn from_raw(ptr: *mut bindings::page) -> Self {
> + Self { pages: ptr }
> + }
> +}
> +
> +impl Pages<0> {
> + #[inline(always)]
Is this really needed? I think this function should be inlined
automatically.
> + fn check_offset_and_map<I: MappingInfo>(
> + &self,
> + offset: usize,
> + len: usize,
> + ) -> Result<PageMapping<'_, I>>
> + where
> + Pages<0>: MappingActions<I>,
Why not use `Self: MappingActions<I>`?
> + {
> + let end = offset.checked_add(len).ok_or(EINVAL)?;
> + if end as u32 > PAGE_SIZE {
Remove the `as u32`, since `PAGE_SIZE` should be of type `usize`.
> + return Err(EINVAL);
I think it would make sense to create a more descriptive Rust error with
a `From` impl to turn it into an `Error`. It always is better to know from
the signature what exactly can go wrong when calling a function.
> + }
> +
> + let mapping = <Self as MappingActions<I>>::map(self);
> +
> + Ok(mapping)
I would merge these lines.
> + }
> +
> + #[inline(always)]
> + unsafe fn read_internal<I: MappingInfo>(
Missing `# Safety` section.
> + &self,
> + dest: *mut u8,
> + offset: usize,
> + len: usize,
> + ) -> Result
> + where
> + Pages<0>: MappingActions<I>,
> + {
> + let mapping = self.check_offset_and_map::<I>(offset, len)?;
> +
> + unsafe { ptr::copy_nonoverlapping((mapping.ptr as *mut u8).add(offset), dest, len) };
Missing `SAFETY` comment. Replace `as *mut u8` with `.cast::<u8>()`.
> + Ok(())
> + }
> +
> + /// Maps the pages and reads from them into the given buffer.
> + ///
> + /// # Safety
> + ///
> + /// Callers must ensure that the destination buffer is valid for the given
> + /// length. Additionally, if the raw buffer is intended to be recast, they
> + /// must ensure that the data can be safely cast;
> + /// [`crate::io_buffer::ReadableFromBytes`] has more details about it.
> + /// `dest` may not point to the source page.
- `dest` is valid for writes for `len`.
- What is meant by 'the raw buffer is intended to be recast'?
- `io_buffer` does not yet exist in `rust-next`.
> + #[inline(always)]
> + pub unsafe fn read(&self, dest: *mut u8, offset: usize, len: usize) -> Result {
> + unsafe { self.read_internal::<NormalMappingInfo>(dest, offset, len) }
Missing `SAFETY` comment.
> + }
> +
> + /// Maps the pages and reads from them into the given buffer. The page is
> + /// mapped atomically.
> + ///
> + /// # Safety
> + ///
> + /// Callers must ensure that the destination buffer is valid for the given
> + /// length. Additionally, if the raw buffer is intended to be recast, they
> + /// must ensure that the data can be safely cast;
> + /// [`crate::io_buffer::ReadableFromBytes`] has more details about it.
> + /// `dest` may not point to the source page.
> + #[inline(always)]
> + pub unsafe fn read_atomic(&self, dest: *mut u8, offset: usize, len: usize) -> Result {
> + unsafe { self.read_internal::<AtomicMappingInfo>(dest, offset, len) }
Missing `SAFETY` comment.
> + }
> +
> + #[inline(always)]
> + unsafe fn write_internal<I: MappingInfo>(
Missing `# Safety` section.
> + &self,
> + src: *const u8,
> + offset: usize,
> + len: usize,
> + ) -> Result
> + where
> + Pages<0>: MappingActions<I>,
> + {
> + let mapping = self.check_offset_and_map::<I>(offset, len)?;
> +
> + unsafe { ptr::copy_nonoverlapping(src, (mapping.ptr as *mut u8).add(offset), len) };
Missing `SAFETY` comment.
> + Ok(())
> + }
> +
> + /// Maps the pages and writes into them from the given buffer.
> + ///
> + /// # Safety
> + ///
> + /// Callers must ensure that the buffer is valid for the given length.
> + /// Additionally, if the page is (or will be) mapped by userspace, they must
> + /// ensure that no kernel data is leaked through padding if it was cast from
> + /// another type; [`crate::io_buffer::WritableToBytes`] has more details
> + /// about it. `src` must not point to the destination page.
`src` is valid for reads for `len`.
> + #[inline(always)]
> + pub unsafe fn write(&self, src: *const u8, offset: usize, len: usize) -> Result {
> + unsafe { self.write_internal::<NormalMappingInfo>(src, offset, len) }
> + }
> +
> + /// Maps the pages and writes into them from the given buffer. The page is
> + /// mapped atomically.
> + ///
> + /// # Safety
> + ///
> + /// Callers must ensure that the buffer is valid for the given length.
> + /// Additionally, if the page is (or will be) mapped by userspace, they must
> + /// ensure that no kernel data is leaked through padding if it was cast from
> + /// another type; [`crate::io_buffer::WritableToBytes`] has more details
> + /// about it. `src` must not point to the destination page.
> + #[inline(always)]
> + pub unsafe fn write_atomic(&self, src: *const u8, offset: usize, len: usize) -> Result {
> + unsafe { self.write_internal::<AtomicMappingInfo>(src, offset, len) }
> + }
> +
> + /// Maps the page at index 0.
> + #[inline(always)]
> + pub fn kmap(&self) -> PageMapping<'_, NormalMappingInfo> {
> + let ptr = unsafe { bindings::kmap(self.pages) };
Missing `SAFETY` comment.
> +
> + PageMapping {
> + page: self.pages,
> + ptr,
> + _phantom: PhantomData,
> + _phantom2: PhantomData,
> + }
> + }
> +
> + /// Atomically Maps the page at index 0.
> + #[inline(always)]
> + pub fn kmap_atomic(&self) -> PageMapping<'_, AtomicMappingInfo> {
> + let ptr = unsafe { bindings::kmap_atomic(self.pages) };
Missing `SAFETY` comment.
> +
> + PageMapping {
> + page: self.pages,
> + ptr,
> + _phantom: PhantomData,
> + _phantom2: PhantomData,
> + }
> + }
> +}
> +
> +impl<const ORDER: u32> Drop for Pages<ORDER> {
> + fn drop(&mut self) {
> + // SAFETY: By the type invariants, we know the pages are allocated with the given order.
> + unsafe { bindings::__free_pages(self.pages, ORDER) };
> + }
> +}
> +
> +/// Specifies the type of page mapping
> +pub trait MappingInfo {}
> +
> +/// Encapsulates methods to map and unmap pages
> +pub trait MappingActions<I: MappingInfo>
> +where
> + Pages<0>: MappingActions<I>,
> +{
> + /// Map a page into the kernel address scpace
Typo.
> + fn map(pages: &Pages<0>) -> PageMapping<'_, I>;
> +
> + /// Unmap a page specified by `mapping`
> + ///
> + /// # Safety
> + ///
> + /// Must only be called by `PageMapping::drop()`.
> + unsafe fn unmap(mapping: &PageMapping<'_, I>);
> +}
> +
> +/// A type state indicating that pages were mapped atomically
> +pub struct AtomicMappingInfo;
> +impl MappingInfo for AtomicMappingInfo {}
> +
> +/// A type state indicating that pages were not mapped atomically
> +pub struct NormalMappingInfo;
> +impl MappingInfo for NormalMappingInfo {}
> +
> +impl MappingActions<AtomicMappingInfo> for Pages<0> {
> + #[inline(always)]
> + fn map(pages: &Pages<0>) -> PageMapping<'_, AtomicMappingInfo> {
> + pages.kmap_atomic()
> + }
> +
> + #[inline(always)]
> + unsafe fn unmap(mapping: &PageMapping<'_, AtomicMappingInfo>) {
> + // SAFETY: An instance of `PageMapping` is created only when `kmap` succeeded for the given
> + // page, so it is safe to unmap it here.
> + unsafe { bindings::kunmap_atomic(mapping.ptr) };
> + }
> +}
> +
> +impl MappingActions<NormalMappingInfo> for Pages<0> {
> + #[inline(always)]
> + fn map(pages: &Pages<0>) -> PageMapping<'_, NormalMappingInfo> {
> + pages.kmap()
> + }
> +
> + #[inline(always)]
> + unsafe fn unmap(mapping: &PageMapping<'_, NormalMappingInfo>) {
> + // SAFETY: An instance of `PageMapping` is created only when `kmap` succeeded for the given
> + // page, so it is safe to unmap it here.
> + unsafe { bindings::kunmap(mapping.page) };
> + }
> +}
I am not sure if this is the best implementation, why do the `kmap` and
`kmap_atomic` functions exist? Would it not make sense to implement
them entirely in `MappingActions::map`?
> +
> +/// An owned page mapping. When this struct is dropped, the page is unmapped.
> +pub struct PageMapping<'a, I: MappingInfo>
> +where
> + Pages<0>: MappingActions<I>,
> +{
> + page: *mut bindings::page,
> + ptr: *mut core::ffi::c_void,
> + _phantom: PhantomData<&'a i32>,
> + _phantom2: PhantomData<I>,
> +}
> +
> +impl<'a, I: MappingInfo> PageMapping<'a, I>
> +where
> + Pages<0>: MappingActions<I>,
> +{
> + /// Return a pointer to the wrapped `struct page`
> + #[inline(always)]
> + pub fn get_ptr(&self) -> *mut core::ffi::c_void {
> + self.ptr
> + }
> +}
> +
> +// Because we do not have Drop specialization, we have to do this dance. Life
> +// would be much more simple if we could have `impl Drop for PageMapping<'_,
> +// Atomic>` and `impl Drop for PageMapping<'_, NotAtomic>`
> +impl<I: MappingInfo> Drop for PageMapping<'_, I>
> +where
> + Pages<0>: MappingActions<I>,
> +{
> + #[inline(always)]
> + fn drop(&mut self) {
> + // SAFETY: We are OK to call this because we are `PageMapping::drop()`
> + unsafe { <Pages<0> as MappingActions<I>>::unmap(self) }
> + }
> +}
> --
> 2.40.0
Here are some more general things:
- I think we could use this as an opportunity to add more docs about how
paging works, or at least add some links to the C documentation.
- Can we improve the paging API? I have not given it any thought yet, but
the current API looks very primitive.
- Documentation comments should form complete sentences (so end with `.`).
--
Cheers,
Benno
Hi Niklas,
Niklas Cassel <[email protected]> writes:
> On Wed, May 03, 2023 at 11:06:57AM +0200, Andreas Hindborg wrote:
>> From: Andreas Hindborg <[email protected]>
>>
>
> (cut)
>
>>
>> For each measurement the drivers are loaded, a drive is configured with memory
>> backing and a size of 4 GiB. C null_blk is configured to match the implemented
>> modes of the Rust driver: `blocksize` is set to 4 KiB, `completion_nsec` to 0,
>> `irqmode` to 0 (IRQ_NONE), `queue_mode` to 2 (MQ), `hw_queue_depth` to 256 and
>> `memory_backed` to 1. For both the drivers, the queue scheduler is set to
>> `none`. These measurements are made using 30 second runs of `fio` with the
>> `PSYNC` IO engine with workers pinned to separate CPU cores. The measurements
>> are done inside a virtual machine (qemu/kvm) on an Intel Alder Lake workstation
>> (i5-12600).
>
> Hello Andreas,
>
> I'm curious why you used psync ioengine for the benchmarks.
>
> As psync is a sync ioengine, it means queue depth == 1.
>
> Wouldn't it have been more interesting to see an async ioengine,
> together with different queue depths?
That would also be interesting. I was a bit constrained on CPU cycles so
I had to choose. I intend to produce the numbers you ask for. For now
here is two runs of random read using io_uring with queue depth 128
(same table style):
For iodepth_batch_submit=1, iodepth_batch_complete=1:
+---------+----------+---------------------+---------------------+
| jobs/bs | workload | 1 | 6 |
+---------+----------+---------------------+---------------------+
| 4k | randread | 2.97 0.00 (0.9,0.0) | 4.06 0.00 (1.8,0.0) |
+---------+----------+---------------------+---------------------+
For iodepth_batch_submit=16, iodepth_batch_complete=16:
+---------+----------+---------------------+---------------------+
| jobs/bs | workload | 1 | 6 |
+---------+----------+---------------------+---------------------+
| 4k | randread | 4.40 0.00 (1.1,0.0) | 4.87 0.00 (1.8,0.0) |
+---------+----------+---------------------+---------------------+
Above numbers are 60 second runs on bare metal, so not entirely
comparable with the ones in the cover letter.
> You might want to explain your table a bit more.
I understand that the table can be difficult to read. It is not easy to
convey all this information in ASCII email. The numbers in parenthesis
in the cells _are_ IOPS x 10e6 (read,write). Referring to he second
table above, for 1 job at 4k bs the Rust driver performed 4.8 percent
more IOPS than the C driver. The C driver did 1.1M IOPS. I hope this
clarifies the table, otherwise let me know!
> It might be nice to see IOPS and average latencies.
I did collect latency info as well, including completion latency
percentiles. It's just difficult to fit all that data in an email. I
have the fio json output, let me know if you want it and I will find a
way to get it to you. I am considering setting up some kind of CI that
will publish the performance results online automatically so that it
will be a link instead of an inline table.
>
> As an example of a table that I find easier to interpret,
> see e.g. the table on page 29 in the SPDK performance report:
> https://ci.spdk.io/download/performance-reports/SPDK_nvme_bdev_perf_report_2301.pdf
Thanks for the input, I will be sure to reference that next time. Just
for clarity, as you mentioned there is only one queue depth in play for
the numbers in the cover letter.
Best regards
Andreas
On 5/3/23 02:06, Andreas Hindborg wrote:
> This is an early preview of a null block driver written in Rust.
It is not clear to me why this effort was started? As far as I know the
null_blk driver is only used by kernel developers for testing kernel
changes so end users are not affected by bugs in this driver.
Additionally, performance of this driver is critical since this driver
is used to measure block layer performance. Does this mean that C is a
better choice than Rust for this driver?
Thanks,
Bart.
Hi Bart,
Bart Van Assche <[email protected]> writes:
> On 5/3/23 02:06, Andreas Hindborg wrote:
>> This is an early preview of a null block driver written in Rust.
>
> It is not clear to me why this effort was started? As far as I know the null_blk
> driver is only used by kernel developers for testing kernel changes so end users
> are not affected by bugs in this driver. Additionally, performance of this
> driver is critical since this driver is used to measure block layer performance.
> Does this mean that C is a better choice than Rust for this driver?
I take it you did not read the rest of the cover letter. Let me quote
some of it here:
> A null block driver is a good opportunity to evaluate Rust bindings for the
> block layer. It is a small and simple driver and thus should be simple to reason
> about. Further, the null block driver is not usually deployed in production
> environments. Thus, it should be fairly straight forward to review, and any
> potential issues are not going to bring down any production workloads.
>
> Being small and simple, the null block driver is a good place to introduce the
> Linux kernel storage community to Rust. This will help prepare the community for
> future Rust projects and facilitate a better maintenance process for these
> projects.
>
> The statistics presented in my previous message [1] show that the C null block
> driver has had a significant amount of memory safety related problems in the
> past. 41% of fixes merged for the C null block driver are fixes for memory
> safety issues. This makes the null block driver a good candidate for rewriting
> in Rust.
In relation to performance, it turns out that there is not much of a
difference. For memory safety bugs - I think we are better off without
them, no matter if we are user facing or not.
If it is still unclear to you why this effort was started, please do let
me know and I shall try to clarify further :)
Best regards,
Andreas
On Thu, May 04, 2023 at 11:36:01AM -0700, Bart Van Assche wrote:
> On 5/4/23 11:15, Andreas Hindborg wrote:
> > If it is still unclear to you why this effort was started, please do let
> > me know and I shall try to clarify further :)
>
> It seems like I was too polite in my previous email. What I meant is that
> rewriting code is useful if it provides a clear advantage to the users of
> a driver. For null_blk, the users are kernel developers. The code that has
> been posted is the start of a rewrite of the null_blk driver. The benefits
> of this rewrite (making low-level memory errors less likely) do not outweigh
> the risks that this effort will introduce functional or performance regressions.
Instead of replacing, would co-existing be okay? Of course as long as
there's no requirement to maintain feature parity between the two.
Actually, just call it "rust_blk" and declare it has no relationship to
null_blk, despite their functional similarities: it's a developer
reference implementation for a rust block driver.
On 5/4/23 11:15, Andreas Hindborg wrote:
> If it is still unclear to you why this effort was started, please do let
> me know and I shall try to clarify further :)
It seems like I was too polite in my previous email. What I meant is that
rewriting code is useful if it provides a clear advantage to the users of
a driver. For null_blk, the users are kernel developers. The code that has
been posted is the start of a rewrite of the null_blk driver. The benefits
of this rewrite (making low-level memory errors less likely) do not outweigh
the risks that this effort will introduce functional or performance regressions.
Bart.
Bart Van Assche <[email protected]> writes:
> On 5/4/23 11:15, Andreas Hindborg wrote:
>> If it is still unclear to you why this effort was started, please do let
>> me know and I shall try to clarify further :)
>
> It seems like I was too polite in my previous email. What I meant is that
> rewriting code is useful if it provides a clear advantage to the users of
> a driver. For null_blk, the users are kernel developers. The code that has
> been posted is the start of a rewrite of the null_blk driver. The benefits
> of this rewrite (making low-level memory errors less likely) do not outweigh
> the risks that this effort will introduce functional or performance regressions.
If this turns in to a full rewrite instead of just a demonstrator, we
will be in the lucky situation that we have the existing C version to
verify performance and functionality against. Unnoticed regressions are
unlikely in this sense.
If we want to have Rust abstractions for the block layer in the kernel
(some people do), then having a simple driver in Rust to regression test
these abstractions with, is good value.
Best regards,
Andreas
On 5/4/23 12:52?PM, Keith Busch wrote:
> On Thu, May 04, 2023 at 11:36:01AM -0700, Bart Van Assche wrote:
>> On 5/4/23 11:15, Andreas Hindborg wrote:
>>> If it is still unclear to you why this effort was started, please do let
>>> me know and I shall try to clarify further :)
>>
>> It seems like I was too polite in my previous email. What I meant is that
>> rewriting code is useful if it provides a clear advantage to the users of
>> a driver. For null_blk, the users are kernel developers. The code that has
>> been posted is the start of a rewrite of the null_blk driver. The benefits
>> of this rewrite (making low-level memory errors less likely) do not outweigh
>> the risks that this effort will introduce functional or performance regressions.
>
> Instead of replacing, would co-existing be okay? Of course as long as
> there's no requirement to maintain feature parity between the two.
> Actually, just call it "rust_blk" and declare it has no relationship to
> null_blk, despite their functional similarities: it's a developer
> reference implementation for a rust block driver.
To me, the big discussion point isn't really whether we're doing
null_blk or not, it's more if we want to go down this path of
maintaining rust bindings for the block code in general. If the answer
to that is yes, then doing null_blk seems like a great choice as it's
not a critical piece of infrastructure. It might even be a good idea to
be able to run both, for performance purposes, as the bindings or core
changes.
But back to the real question... This is obviously extra burden on
maintainers, and that needs to be sorted out first. Block drivers in
general are not super security sensitive, as it's mostly privileged code
and there's not a whole lot of user visibile API. And the stuff we do
have is reasonably basic. So what's the long term win of having rust
bindings? This is a legitimate question. I can see a lot of other more
user exposed subsystems being of higher interest here.
--
Jens Axboe
On Thu, May 4, 2023 at 9:02 PM Jens Axboe <[email protected]> wrote:
>
> But back to the real question... This is obviously extra burden on
> maintainers, and that needs to be sorted out first. Block drivers in
Regarding maintenance, something we have suggested in similar cases to
other subsystems is that the author gets involved as a maintainer of,
at least, the Rust abstractions/driver (possibly with a different
`MAINTAINERS` entry).
Of course, that is still work for the existing maintainer(s), i.e.
you, since coordination takes time. However, it can also be a nice way
to learn Rust on the side, meanwhile things are getting upstreamed and
discussed (I think Daniel, in Cc, is taking that approach).
And it may also be a way for you to get an extra
maintainer/reviewer/... later on for the C parts, too, even if Rust
does not succeed.
> general are not super security sensitive, as it's mostly privileged code
> and there's not a whole lot of user visibile API. And the stuff we do
> have is reasonably basic. So what's the long term win of having rust
> bindings? This is a legitimate question. I can see a lot of other more
> user exposed subsystems being of higher interest here.
From the experience of other kernel maintainers/developers that are
making the move, the advantages seem to be well worth it, even
disregarding the security aspect, i.e. on the language side alone.
Cheers,
Miguel
Jens Axboe <[email protected]> writes:
> On 5/4/23 12:52?PM, Keith Busch wrote:
>> On Thu, May 04, 2023 at 11:36:01AM -0700, Bart Van Assche wrote:
>>> On 5/4/23 11:15, Andreas Hindborg wrote:
>>>> If it is still unclear to you why this effort was started, please do let
>>>> me know and I shall try to clarify further :)
>>>
>>> It seems like I was too polite in my previous email. What I meant is that
>>> rewriting code is useful if it provides a clear advantage to the users of
>>> a driver. For null_blk, the users are kernel developers. The code that has
>>> been posted is the start of a rewrite of the null_blk driver. The benefits
>>> of this rewrite (making low-level memory errors less likely) do not outweigh
>>> the risks that this effort will introduce functional or performance regressions.
>>
>> Instead of replacing, would co-existing be okay? Of course as long as
>> there's no requirement to maintain feature parity between the two.
>> Actually, just call it "rust_blk" and declare it has no relationship to
>> null_blk, despite their functional similarities: it's a developer
>> reference implementation for a rust block driver.
>
> To me, the big discussion point isn't really whether we're doing
> null_blk or not, it's more if we want to go down this path of
> maintaining rust bindings for the block code in general. If the answer
> to that is yes, then doing null_blk seems like a great choice as it's
> not a critical piece of infrastructure. It might even be a good idea to
> be able to run both, for performance purposes, as the bindings or core
> changes.
>
> But back to the real question... This is obviously extra burden on
> maintainers, and that needs to be sorted out first. Block drivers in
> general are not super security sensitive, as it's mostly privileged code
> and there's not a whole lot of user visibile API. And the stuff we do
> have is reasonably basic. So what's the long term win of having rust
> bindings? This is a legitimate question. I can see a lot of other more
> user exposed subsystems being of higher interest here.
Even though the block layer is not usually exposed in the same way that
something like the USB stack is, absence of memory safety bugs is a very
useful property. If this is attainable without sacrificing performance,
it seems like a nice option to offer future block device driver
developers. Some would argue that it is worth offering even in the face
of performance regression.
While memory safety is the primary feature that Rust brings to the
table, it does come with other nice features as well. The type system,
language support stackless coroutines and error handling language
support are all very useful.
Regarding maintenance of the bindings, it _is_ an amount extra work. But
there is more than one way to structure that work. If Rust is accepted
into the block layer at some point, maintenance could be structured in
such a way that it does not get in the way of existing C maintenance
work. A "rust keeps up or it breaks" model. That could work for a while.
Best regards
Andreas
On 5/4/23 1:59?PM, Andreas Hindborg wrote:
>
> Jens Axboe <[email protected]> writes:
>
>> On 5/4/23 12:52?PM, Keith Busch wrote:
>>> On Thu, May 04, 2023 at 11:36:01AM -0700, Bart Van Assche wrote:
>>>> On 5/4/23 11:15, Andreas Hindborg wrote:
>>>>> If it is still unclear to you why this effort was started, please do let
>>>>> me know and I shall try to clarify further :)
>>>>
>>>> It seems like I was too polite in my previous email. What I meant is that
>>>> rewriting code is useful if it provides a clear advantage to the users of
>>>> a driver. For null_blk, the users are kernel developers. The code that has
>>>> been posted is the start of a rewrite of the null_blk driver. The benefits
>>>> of this rewrite (making low-level memory errors less likely) do not outweigh
>>>> the risks that this effort will introduce functional or performance regressions.
>>>
>>> Instead of replacing, would co-existing be okay? Of course as long as
>>> there's no requirement to maintain feature parity between the two.
>>> Actually, just call it "rust_blk" and declare it has no relationship to
>>> null_blk, despite their functional similarities: it's a developer
>>> reference implementation for a rust block driver.
>>
>> To me, the big discussion point isn't really whether we're doing
>> null_blk or not, it's more if we want to go down this path of
>> maintaining rust bindings for the block code in general. If the answer
>> to that is yes, then doing null_blk seems like a great choice as it's
>> not a critical piece of infrastructure. It might even be a good idea to
>> be able to run both, for performance purposes, as the bindings or core
>> changes.
>>
>> But back to the real question... This is obviously extra burden on
>> maintainers, and that needs to be sorted out first. Block drivers in
>> general are not super security sensitive, as it's mostly privileged code
>> and there's not a whole lot of user visibile API. And the stuff we do
>> have is reasonably basic. So what's the long term win of having rust
>> bindings? This is a legitimate question. I can see a lot of other more
>> user exposed subsystems being of higher interest here.
>
> Even though the block layer is not usually exposed in the same way
> that something like the USB stack is, absence of memory safety bugs is
> a very useful property. If this is attainable without sacrificing
> performance, it seems like a nice option to offer future block device
> driver developers. Some would argue that it is worth offering even in
> the face of performance regression.
>
> While memory safety is the primary feature that Rust brings to the
> table, it does come with other nice features as well. The type system,
> language support stackless coroutines and error handling language
> support are all very useful.
We're in violent agreement on this part, I don't think anyone sane would
argue that memory safety with the same performance [1] isn't something
you'd want. And the error handling with rust is so much better than the
C stuff drivers do now that I can't see anyone disagreeing on that being
a great thing as well.
The discussion point here is the price being paid in terms of people
time.
> Regarding maintenance of the bindings, it _is_ an amount extra work. But
> there is more than one way to structure that work. If Rust is accepted
> into the block layer at some point, maintenance could be structured in
> such a way that it does not get in the way of existing C maintenance
> work. A "rust keeps up or it breaks" model. That could work for a while.
That potentially works for null_blk, but it would not work for anything
that people actually depend on. In other words, anything that isn't
null_blk. And I don't believe we'd be actively discussing these bindings
if just doing null_blk is the end goal, because that isn't useful by
itself, and at that point we'd all just be wasting our time. In the real
world, once we have just one actual driver using it, then we'd be
looking at "this driver regressed because of change X/Y/Z and that needs
to get sorted before the next release". And THAT is the real issue for
me. So a "rust keeps up or it breaks" model is a bit naive in my
opinion, it's just not a viable approach. In fact, even for null_blk,
this doesn't really fly as we rely on blktests to continually vet the
sanity of the IO stack, and null_blk is an integral part of that.
So I really don't think there's much to debate between "rust people vs
jens" here, as we agree on the benefits, but my end of the table has to
stomach the cons. And like I mentioned in an earlier email, that's not
just on me, there are other regular contributors and reviewers that are
relevant to this discussion. This is something we need to discuss.
[1] We obviously need to do real numbers here, the ones posted I don't
consider stable enough to be useful in saying "yeah it's fully on part".
If you have an updated rust nvme driver that uses these bindings I'd
be happy to run some testing that will definitively tell us if there's a
performance win, loss, or parity, and how much.
--
Jens Axboe
On 5/4/23 2:11?PM, Miguel Ojeda wrote:
> On Thu, May 4, 2023 at 9:02?PM Jens Axboe <[email protected]> wrote:
>>
>> But back to the real question... This is obviously extra burden on
>> maintainers, and that needs to be sorted out first. Block drivers in
>
> Regarding maintenance, something we have suggested in similar cases to
> other subsystems is that the author gets involved as a maintainer of,
> at least, the Rust abstractions/driver (possibly with a different
> `MAINTAINERS` entry).
Right, but that doesn't really solve the problem when the rust bindings
get in the way of changes that you are currently making. Or if you break
them inadvertently. I do see benefits to that approach, but it's no
panacea.
> Of course, that is still work for the existing maintainer(s), i.e.
> you, since coordination takes time. However, it can also be a nice way
> to learn Rust on the side, meanwhile things are getting upstreamed and
> discussed (I think Daniel, in Cc, is taking that approach).
This seems to assume that time is plentiful and we can just add more to
our plate, which isn't always true. While I'd love to do more rust and
get more familiar with it, the time still has to be there for that. I'm
actually typing this on a laptop with a rust gpu driver :-)
And this isn't just on me, there are other regular contributors and
reviewers that would need to be onboard with this.
> And it may also be a way for you to get an extra
> maintainer/reviewer/... later on for the C parts, too, even if Rust
> does not succeed.
That is certainly a win!
>> general are not super security sensitive, as it's mostly privileged code
>> and there's not a whole lot of user visibile API. And the stuff we do
>> have is reasonably basic. So what's the long term win of having rust
>> bindings? This is a legitimate question. I can see a lot of other more
>> user exposed subsystems being of higher interest here.
>
> From the experience of other kernel maintainers/developers that are
> making the move, the advantages seem to be well worth it, even
> disregarding the security aspect, i.e. on the language side alone.
Each case is different though, different people and different schedules
and priorities. So while the above is promising, it's also just
annecdotal and doesn't necessarily apply to our case.
--
Jens Axboe
On Wed, May 03, 2023 at 11:06:59AM +0200, Andreas Hindborg wrote:
> From: Andreas Hindborg <[email protected]>
>
> This patch adds support for working with pages of order 0. Support for pages
> with higher order is deferred. Page allocation flags are fixed in this patch.
> Future work might allow the user to specify allocation flags.
>
> This patch is a heavily modified version of code available in the rust tree [1],
> primarily adding support for multiple page mapping strategies.
This also seems misaligned with the direction of Linux development.
Folios are the future, pages are legacy. Please, ask about what's
going on before wasting time on the past.
On Thu, May 04, 2023 at 01:02:19PM -0600, Jens Axboe wrote:
> null_blk or not, it's more if we want to go down this path of
> maintaining rust bindings for the block code in general. If the answer
> to that is yes, then doing null_blk seems like a great choice as it's
> not a critical piece of infrastructure. It might even be a good idea to
> be able to run both, for performance purposes, as the bindings or core
> changes.
Yes. And I'm not in favor of it, especially right now. There is
so much work we need to do that requires changes all over (e.g.
sorting out the request_queue vs gendisk, and making the bio_vec
folio or even better physical address based), and the last thing I
need is a binding to a another language, one that happens to have
nice features but that also is really ugly.
Matthew Wilcox <[email protected]> writes:
> On Wed, May 03, 2023 at 11:06:59AM +0200, Andreas Hindborg wrote:
>> From: Andreas Hindborg <[email protected]>
>>
>> This patch adds support for working with pages of order 0. Support for pages
>> with higher order is deferred. Page allocation flags are fixed in this patch.
>> Future work might allow the user to specify allocation flags.
>>
>> This patch is a heavily modified version of code available in the rust tree [1],
>> primarily adding support for multiple page mapping strategies.
>
> This also seems misaligned with the direction of Linux development.
> Folios are the future, pages are legacy. Please, ask about what's
> going on before wasting time on the past.
I see, thanks for the heads up! In this case I wanted to do an
apples-apples comparison to the C null_blk driver. Since that is using
kmap I wanted to have that. But let's bind to the folio_* APIs in the
future, that would make sense.
Best regards
Andreas
Jens Axboe <[email protected]> writes:
> On 5/4/23 1:59?PM, Andreas Hindborg wrote:
>>
>> Jens Axboe <[email protected]> writes:
>>
>>> On 5/4/23 12:52?PM, Keith Busch wrote:
>>>> On Thu, May 04, 2023 at 11:36:01AM -0700, Bart Van Assche wrote:
>>>>> On 5/4/23 11:15, Andreas Hindborg wrote:
>>>>>> If it is still unclear to you why this effort was started, please do let
>>>>>> me know and I shall try to clarify further :)
>>>>>
>>>>> It seems like I was too polite in my previous email. What I meant is that
>>>>> rewriting code is useful if it provides a clear advantage to the users of
>>>>> a driver. For null_blk, the users are kernel developers. The code that has
>>>>> been posted is the start of a rewrite of the null_blk driver. The benefits
>>>>> of this rewrite (making low-level memory errors less likely) do not outweigh
>>>>> the risks that this effort will introduce functional or performance regressions.
>>>>
>>>> Instead of replacing, would co-existing be okay? Of course as long as
>>>> there's no requirement to maintain feature parity between the two.
>>>> Actually, just call it "rust_blk" and declare it has no relationship to
>>>> null_blk, despite their functional similarities: it's a developer
>>>> reference implementation for a rust block driver.
>>>
>>> To me, the big discussion point isn't really whether we're doing
>>> null_blk or not, it's more if we want to go down this path of
>>> maintaining rust bindings for the block code in general. If the answer
>>> to that is yes, then doing null_blk seems like a great choice as it's
>>> not a critical piece of infrastructure. It might even be a good idea to
>>> be able to run both, for performance purposes, as the bindings or core
>>> changes.
>>>
>>> But back to the real question... This is obviously extra burden on
>>> maintainers, and that needs to be sorted out first. Block drivers in
>>> general are not super security sensitive, as it's mostly privileged code
>>> and there's not a whole lot of user visibile API. And the stuff we do
>>> have is reasonably basic. So what's the long term win of having rust
>>> bindings? This is a legitimate question. I can see a lot of other more
>>> user exposed subsystems being of higher interest here.
>>
>> Even though the block layer is not usually exposed in the same way
>> that something like the USB stack is, absence of memory safety bugs is
>> a very useful property. If this is attainable without sacrificing
>> performance, it seems like a nice option to offer future block device
>> driver developers. Some would argue that it is worth offering even in
>> the face of performance regression.
>>
>> While memory safety is the primary feature that Rust brings to the
>> table, it does come with other nice features as well. The type system,
>> language support stackless coroutines and error handling language
>> support are all very useful.
>
> We're in violent agreement on this part, I don't think anyone sane would
> argue that memory safety with the same performance [1] isn't something
> you'd want. And the error handling with rust is so much better than the
> C stuff drivers do now that I can't see anyone disagreeing on that being
> a great thing as well.
>
> The discussion point here is the price being paid in terms of people
> time.
>
>> Regarding maintenance of the bindings, it _is_ an amount extra work. But
>> there is more than one way to structure that work. If Rust is accepted
>> into the block layer at some point, maintenance could be structured in
>> such a way that it does not get in the way of existing C maintenance
>> work. A "rust keeps up or it breaks" model. That could work for a while.
>
> That potentially works for null_blk, but it would not work for anything
> that people actually depend on. In other words, anything that isn't
> null_blk. And I don't believe we'd be actively discussing these bindings
> if just doing null_blk is the end goal, because that isn't useful by
> itself, and at that point we'd all just be wasting our time. In the real
> world, once we have just one actual driver using it, then we'd be
> looking at "this driver regressed because of change X/Y/Z and that needs
> to get sorted before the next release". And THAT is the real issue for
> me. So a "rust keeps up or it breaks" model is a bit naive in my
> opinion, it's just not a viable approach. In fact, even for null_blk,
> this doesn't really fly as we rely on blktests to continually vet the
> sanity of the IO stack, and null_blk is an integral part of that.
Sure, once there are actual users, this model would not work. But during
an introduction period it might be a useful model. Having Rust around
without having to take care of it might give
maintainers,reviewers,contributors a no strings attached opportunity to
dabble with the language in a domain they are familiar with.
>
> So I really don't think there's much to debate between "rust people vs
> jens" here, as we agree on the benefits, but my end of the table has to
> stomach the cons. And like I mentioned in an earlier email, that's not
> just on me, there are other regular contributors and reviewers that are
> relevant to this discussion. This is something we need to discuss.
>
> [1] We obviously need to do real numbers here, the ones posted I don't
> consider stable enough to be useful in saying "yeah it's fully on part".
> If you have an updated rust nvme driver that uses these bindings I'd
> be happy to run some testing that will definitively tell us if there's a
> performance win, loss, or parity, and how much.
I do plan to rebase the NVMe driver somewhere in the next few months.
I'll let you know when that work is done.
Best regards
Andreas
On 5/4/23 20:36, Bart Van Assche wrote:
> On 5/4/23 11:15, Andreas Hindborg wrote:
>> If it is still unclear to you why this effort was started, please do let
>> me know and I shall try to clarify further :)
>
> It seems like I was too polite in my previous email. What I meant is that
> rewriting code is useful if it provides a clear advantage to the users of
> a driver. For null_blk, the users are kernel developers. The code that has
> been posted is the start of a rewrite of the null_blk driver. The benefits
> of this rewrite (making low-level memory errors less likely) do not
> outweigh
> the risks that this effort will introduce functional or performance
> regressions.
>
I have to disagree here. While the null_blk driver in itself is
certainly not _that_ useful, it does provide a good sounding board if
all the design principles of the linux block layer can be adequately
expressed in Rust.
And by posting this driver you just proved that, and we all have a
better understanding what would be needed to convert old or create new
drivers.
But I guess we'll have a longer discussion at LSF :-)
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman
On Fri, May 05, 2023 at 06:42:02AM +0200, Andreas Hindborg wrote:
>
> Matthew Wilcox <[email protected]> writes:
>
> > On Wed, May 03, 2023 at 11:06:59AM +0200, Andreas Hindborg wrote:
> >> From: Andreas Hindborg <[email protected]>
> >>
> >> This patch adds support for working with pages of order 0. Support for pages
> >> with higher order is deferred. Page allocation flags are fixed in this patch.
> >> Future work might allow the user to specify allocation flags.
> >>
> >> This patch is a heavily modified version of code available in the rust tree [1],
> >> primarily adding support for multiple page mapping strategies.
> >
> > This also seems misaligned with the direction of Linux development.
> > Folios are the future, pages are legacy. Please, ask about what's
> > going on before wasting time on the past.
>
> I see, thanks for the heads up! In this case I wanted to do an
> apples-apples comparison to the C null_blk driver. Since that is using
> kmap I wanted to have that. But let's bind to the folio_* APIs in the
> future, that would make sense.
Well, kmap() is essentially a no-op on 64-bit systems, so it's not
terribly relevant to doing a comparison.
On Thu, May 4, 2023 at 10:22 PM Jens Axboe <[email protected]> wrote:
>
> Right, but that doesn't really solve the problem when the rust bindings
> get in the way of changes that you are currently making. Or if you break
> them inadvertently. I do see benefits to that approach, but it's no
> panacea.
>
> This seems to assume that time is plentiful and we can just add more to
> our plate, which isn't always true. While I'd love to do more rust and
> get more familiar with it, the time still has to be there for that. I'm
> actually typing this on a laptop with a rust gpu driver :-)
>
> And this isn't just on me, there are other regular contributors and
> reviewers that would need to be onboard with this.
Indeed -- I didn't mean to imply it wouldn't be time consuming, only
that it might be an alternative approach compared to having existing
maintainers do it. Of course, it depends on the dynamics of the
subsystem, how busy the subsystem is, whether there is good rapport,
etc.
> Each case is different though, different people and different schedules
> and priorities. So while the above is promising, it's also just
> annecdotal and doesn't necessarily apply to our case.
Definitely, in the end subsystems know best if there is enough time
available (from everybody) to pull it off. I only meant to say that
the security angle is not the only benefit.
For instance, like you said, the error handling, plus a bunch more
that people usually enjoy: stricter typing, more information on
signatures, sum types, pattern matching, privacy, closures, generics,
etc.
Cheers,
Miguel
On Thu, May 4, 2023 at 10:55 PM Jens Axboe <[email protected]> wrote:
>
> That potentially works for null_blk, but it would not work for anything
> that people actually depend on. In other words, anything that isn't
> null_blk. And I don't believe we'd be actively discussing these bindings
> if just doing null_blk is the end goal, because that isn't useful by
> itself, and at that point we'd all just be wasting our time. In the real
> world, once we have just one actual driver using it, then we'd be
> looking at "this driver regressed because of change X/Y/Z and that needs
> to get sorted before the next release". And THAT is the real issue for
> me. So a "rust keeps up or it breaks" model is a bit naive in my
> opinion, it's just not a viable approach. In fact, even for null_blk,
> this doesn't really fly as we rely on blktests to continually vet the
> sanity of the IO stack, and null_blk is an integral part of that.
But `null_blk` shouldn't be affected, no? The Rust abstractions can be
behind an explicit "experimental" / "broken" / "compile-test only"
gate (or similar) in the beginning, as a way to test how much
maintenance it actually requires.
In such a setup, Andreas could be the one responsible to keep them up
to date in the beginning. That is, in the worst case, a kernel release
could happen with the Rust side broken -- that way `null_blk` is not
impacted.
That is why a separate `MAINTAINERS` entry may be interesting if you
want to keep e.g. the `S:` level separate (though Andreas, I think,
may be able to do "Supported" at this time).
When the first real driver comes, a similar approach could be
repeated, to buy some more time too.
Cheers,
Miguel
On Fri, May 05, 2023 at 12:53:41PM +0200, Miguel Ojeda wrote:
> On Thu, May 4, 2023 at 10:22 PM Jens Axboe <[email protected]> wrote:
> >
> > Right, but that doesn't really solve the problem when the rust bindings
> > get in the way of changes that you are currently making. Or if you break
> > them inadvertently. I do see benefits to that approach, but it's no
> > panacea.
One thing I want to point out is: not having a block layer Rust API
doesn't keep the block layer away from Rust ;-) Rust will get in the way
as long as block layer is used, directly or indirectly, in any Rust code
in kernel.
Take the M1 GPU driver for example, it can totally be done without a drm
Rust API: Lina will have to directly call C funciton in her GPU driver,
but it's possible or she can have her own drm Rust binding which is not
blessed by the drm maintainers. But as long as drm is used in a Rust
driver, a refactoring/improvement of drm will need to take the usage of
Rust side into consideration. Unless of course, some one is willing to
write a C driver for M1 GPU.
The Rust bindings are actually the way of communication between
subsystem mantainers and Rust driver writers, and can help reduce the
amount of work: You can have the abstraction the way you like.
Of course, there is always "don't do it until there are actually users",
and I totally agree with that. But what is a better way to design the
Rust binding for a subsystem?
* Sit down and use the wisdom of maintainers and active
developers, and really spend time on it right now? Or
* Let one future user drag the API/binding design to insaneness?
I'd rather prefer the first approach. Time spent is time saved.
Personally, my biggest fear is: RCU stalls/lockdep warnings in the Rust
code (or they don't happen because incorrect bindings), and who is going
to fix them ;-) So I have to spend my time on making sure these bindings
in good shapes, which is not always a pleasant experience: the more you
use something, the more you hate it ;-) But I think it's worth.
Of course, by no means I want to force anyone to learn Rust, I totally
understand people who want to see zero Rust. Just want to say the
maintain burden may exist any way, and the Rust binding is actually the
thing to help here.
Regards,
Boqun
> >
> > This seems to assume that time is plentiful and we can just add more to
> > our plate, which isn't always true. While I'd love to do more rust and
> > get more familiar with it, the time still has to be there for that. I'm
> > actually typing this on a laptop with a rust gpu driver :-)
> >
> > And this isn't just on me, there are other regular contributors and
> > reviewers that would need to be onboard with this.
>
> Indeed -- I didn't mean to imply it wouldn't be time consuming, only
> that it might be an alternative approach compared to having existing
> maintainers do it. Of course, it depends on the dynamics of the
> subsystem, how busy the subsystem is, whether there is good rapport,
> etc.
>
> > Each case is different though, different people and different schedules
> > and priorities. So while the above is promising, it's also just
> > annecdotal and doesn't necessarily apply to our case.
>
> Definitely, in the end subsystems know best if there is enough time
> available (from everybody) to pull it off. I only meant to say that
> the security angle is not the only benefit.
>
> For instance, like you said, the error handling, plus a bunch more
> that people usually enjoy: stricter typing, more information on
> signatures, sum types, pattern matching, privacy, closures, generics,
> etc.
>
> Cheers,
> Miguel
On Fri, May 05, 2023 at 05:24:56AM -0700, Boqun Feng wrote:
> On Fri, May 05, 2023 at 12:53:41PM +0200, Miguel Ojeda wrote:
> > On Thu, May 4, 2023 at 10:22 PM Jens Axboe <[email protected]> wrote:
> > >
> > > Right, but that doesn't really solve the problem when the rust bindings
> > > get in the way of changes that you are currently making. Or if you break
> > > them inadvertently. I do see benefits to that approach, but it's no
> > > panacea.
>
> One thing I want to point out is: not having a block layer Rust API
> doesn't keep the block layer away from Rust ;-) Rust will get in the way
> as long as block layer is used, directly or indirectly, in any Rust code
> in kernel.
>
> Take the M1 GPU driver for example, it can totally be done without a drm
> Rust API: Lina will have to directly call C funciton in her GPU driver,
> but it's possible or she can have her own drm Rust binding which is not
> blessed by the drm maintainers. But as long as drm is used in a Rust
> driver, a refactoring/improvement of drm will need to take the usage of
> Rust side into consideration. Unless of course, some one is willing to
> write a C driver for M1 GPU.
>
> The Rust bindings are actually the way of communication between
> subsystem mantainers and Rust driver writers, and can help reduce the
> amount of work: You can have the abstraction the way you like.
>
> Of course, there is always "don't do it until there are actually users",
> and I totally agree with that. But what is a better way to design the
> Rust binding for a subsystem?
>
> * Sit down and use the wisdom of maintainers and active
> developers, and really spend time on it right now? Or
>
> * Let one future user drag the API/binding design to insaneness?
>
Ah, of course, I should add: this is not the usual case, most of the
time, users (e.g. a real driver) can help the design, I was just trying
to say: without the wisdom of maintainers and active developers, a Rust
binding solely designed by one user could have some design issues. In
other words, the experience of maintaining C side API is very valuable
to design Rust bindings.
Regards,
Boqun
> I'd rather prefer the first approach. Time spent is time saved.
>
> Personally, my biggest fear is: RCU stalls/lockdep warnings in the Rust
> code (or they don't happen because incorrect bindings), and who is going
> to fix them ;-) So I have to spend my time on making sure these bindings
> in good shapes, which is not always a pleasant experience: the more you
> use something, the more you hate it ;-) But I think it's worth.
>
> Of course, by no means I want to force anyone to learn Rust, I totally
> understand people who want to see zero Rust. Just want to say the
> maintain burden may exist any way, and the Rust binding is actually the
> thing to help here.
>
> Regards,
> Boqun
>
> > >
> > > This seems to assume that time is plentiful and we can just add more to
> > > our plate, which isn't always true. While I'd love to do more rust and
> > > get more familiar with it, the time still has to be there for that. I'm
> > > actually typing this on a laptop with a rust gpu driver :-)
> > >
> > > And this isn't just on me, there are other regular contributors and
> > > reviewers that would need to be onboard with this.
> >
> > Indeed -- I didn't mean to imply it wouldn't be time consuming, only
> > that it might be an alternative approach compared to having existing
> > maintainers do it. Of course, it depends on the dynamics of the
> > subsystem, how busy the subsystem is, whether there is good rapport,
> > etc.
> >
> > > Each case is different though, different people and different schedules
> > > and priorities. So while the above is promising, it's also just
> > > annecdotal and doesn't necessarily apply to our case.
> >
> > Definitely, in the end subsystems know best if there is enough time
> > available (from everybody) to pull it off. I only meant to say that
> > the security angle is not the only benefit.
> >
> > For instance, like you said, the error handling, plus a bunch more
> > that people usually enjoy: stricter typing, more information on
> > signatures, sum types, pattern matching, privacy, closures, generics,
> > etc.
> >
> > Cheers,
> > Miguel
On 5/5/23 03:53, Miguel Ojeda wrote:
> Definitely, in the end subsystems know best if there is enough time
> available (from everybody) to pull it off. I only meant to say that
> the security angle is not the only benefit.
>
> For instance, like you said, the error handling, plus a bunch more
> that people usually enjoy: stricter typing, more information on
> signatures, sum types, pattern matching, privacy, closures, generics,
> etc.
These are all great advantages of Rust.
One potential cause of memory corruption caused by block drivers is
misprogramming the DMA engine of the storage controller. This is something
no borrow checker can protect against. Only an IOMMU can protect against
the storage controller accessing memory that it shouldn't access. This is
not a criticism of Rust - I'm bringing this up because I think this is
something that is important to realize.
Bart.
On Fri, May 05, 2023 at 05:24:56AM -0700, Boqun Feng wrote:
>
> The Rust bindings are actually the way of communication between
> subsystem mantainers and Rust driver writers, and can help reduce the
> amount of work: You can have the abstraction the way you like.
We don't have stable APIs or structures here, so let's be clear-eyed
about the maintenance burden these bindings create for linux-block
contributors. Not a hard "no" from me, but this isn't something to
handwave over.
On Fri, May 05, 2023 at 01:42:42PM -0600, Keith Busch wrote:
> On Fri, May 05, 2023 at 05:24:56AM -0700, Boqun Feng wrote:
> >
> > The Rust bindings are actually the way of communication between
> > subsystem mantainers and Rust driver writers, and can help reduce the
> > amount of work: You can have the abstraction the way you like.
>
> We don't have stable APIs or structures here, so let's be clear-eyed
Of course, but every API change need to cover all in-tree users, right?
> about the maintenance burden these bindings create for linux-block
> contributors. Not a hard "no" from me, but this isn't something to
> handwave over.
Not tried to handwave over anything ;-) The fact IIUC is simply that Rust
drivers can call C function, so say a in-tree Rust driver does something
as follow:
struct Foo {
ptr: *mut bio; // A pointer to bio.
...
}
impl Foo {
pub fn bar(&self) {
unsafe {
// calling a C function "do_sth_to_bio".
// void do_sth_to_bio(struct bio *bio);
bindings::do_sth_to_bio(self.ptr);
}
}
}
That's an user of the block layer, and that user could exist even
without Bio abstraction. And whenever a linux-block developer wants to
refactor the "do_sth_to_bio" with a slightly different semantics, that
user will need to be taken into consideration, meaning reading the Rust
code of Foo to understand the usage.
Now with a Bio abstraction, the immediate effect would be there should
be no Rust code is allowed to directly calls block layer functions
without using Bio abstraction. And hopefully Bio abstraction along with
other bindings is a place good enough to reasoning about semanitcs
changes or refactoring, so no need to read the code of Foo to understand
the usage. Of course, some C side changes may result into changes in
Rust bindings as well, but that doesn't make things worse. (Need to
understand Foo in that case if there is no Rust bindings).
Of course, this is just my 2 cents. I could be completely wrong.
(Put the Rust-for-Linux hat on) Needless to say with or without the Rust
bindings for the block layer, we are (at least I'm) happy to help on any
Rust related questions/bugs/issues for linux-block ;-)
Regards,
Boqun
On Wed, May 03, 2023 at 11:06:57AM +0200, Andreas Hindborg wrote:
> The statistics presented in my previous message [1] show that the C null block
> driver has had a significant amount of memory safety related problems in the
> past. 41% of fixes merged for the C null block driver are fixes for memory
> safety issues. This makes the null block driver a good candidate for rewriting
> in Rust.
Curious, how long does it take to do an analysis like this? Are there efforts
to automate this a bit more? We have efforts to use machine learning to
evaluate stable candidate patches, we probably should be able to qualify
commits as fixing "memory safety", I figure.
Because what I'd love to see is if we can could easily obtain similar
statistics for arbitrary parts of the kernel. The easiest way to break
this down might be by kconfig symbol for instance, and then based on
that gather more information about subsystems.
Then the rationale for considerating adopting rust bindings for certain areas
of the kernel becomes a bit clearer.
I figured some of this work has already been done, but I just haven't seen it
yet.
Luis
Luis Chamberlain <[email protected]> writes:
> On Wed, May 03, 2023 at 11:06:57AM +0200, Andreas Hindborg wrote:
>> The statistics presented in my previous message [1] show that the C null block
>> driver has had a significant amount of memory safety related problems in the
>> past. 41% of fixes merged for the C null block driver are fixes for memory
>> safety issues. This makes the null block driver a good candidate for rewriting
>> in Rust.
>
> Curious, how long does it take to do an analysis like this? Are there efforts
> to automate this a bit more? We have efforts to use machine learning to
> evaluate stable candidate patches, we probably should be able to qualify
> commits as fixing "memory safety", I figure.
>
> Because what I'd love to see is if we can could easily obtain similar
> statistics for arbitrary parts of the kernel. The easiest way to break
> this down might be by kconfig symbol for instance, and then based on
> that gather more information about subsystems.
>
I spent around 4 hours with a spreadsheet and git. It would be cool if
that work could be automated. It's not always clear from the commit
heading that a commit is a fix. When it is clear that it is a fix, it
might not be clear what is fixed. I had to look at the diff quite a few
commits.
There is some work mentioning the ratio of memory safety issues fixed in
the kernel, but none of them go into details for specific subsystems as
far as I know. 20% of bugs fixed in stable Linux Kernel branches for
drivers are memory safety issues [1]. 65% of recent Linux kernel
vulnerabilities are memory safety issues [2]
> Then the rationale for considerating adopting rust bindings for certain areas
> of the kernel becomes a bit clearer.
As mentioned elsewhere in this thread there are other benefits from
deploying Rust than provable absence of memory safety issues.
Best regards
Andreas
[1] http://dx.doi.org/10.15514/ISPRAS-2018-30(6)-8
[2] https://lssna19.sched.com/event/RHaT/writing-linux-kernel-modules-in-safe-rust-geoffrey-thomas-two-sigma-investments-alex-gaynor-alloy
I have left some comments below. Some of them are not really
suggestions, but rather I would like to know the rationale of the
design, as I am not familiar at all with the C side and have mostly
no idea what the called functions do.
On Wednesday, May 3rd, 2023 at 11:07, Andreas Hindborg <[email protected]> wrote:
> From: Andreas Hindborg <[email protected]>
>
> Add initial abstractions for working with blk-mq.
>
> This patch is a maintained, refactored subset of code originally published by
> Wedson Almeida Filho <[email protected]> [1].
>
> [1] https://github.com/wedsonaf/linux/tree/f2cfd2fe0e2ca4e90994f96afe268bbd4382a891/rust/kernel/blk/mq.rs
>
> Cc: Wedson Almeida Filho <[email protected]>
> Signed-off-by: Andreas Hindborg <[email protected]>
> ---
> rust/bindings/bindings_helper.h | 2 +
> rust/helpers.c | 22 +++
> rust/kernel/block.rs | 5 +
> rust/kernel/block/mq.rs | 15 ++
> rust/kernel/block/mq/gen_disk.rs | 133 +++++++++++++++
> rust/kernel/block/mq/operations.rs | 260 +++++++++++++++++++++++++++++
> rust/kernel/block/mq/raw_writer.rs | 30 ++++
> rust/kernel/block/mq/request.rs | 71 ++++++++
> rust/kernel/block/mq/tag_set.rs | 92 ++++++++++
> rust/kernel/error.rs | 4 +
> rust/kernel/lib.rs | 1 +
> 11 files changed, 635 insertions(+)
> create mode 100644 rust/kernel/block.rs
> create mode 100644 rust/kernel/block/mq.rs
> create mode 100644 rust/kernel/block/mq/gen_disk.rs
> create mode 100644 rust/kernel/block/mq/operations.rs
> create mode 100644 rust/kernel/block/mq/raw_writer.rs
> create mode 100644 rust/kernel/block/mq/request.rs
> create mode 100644 rust/kernel/block/mq/tag_set.rs
>
> diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h
> index 52834962b94d..86c07eeb1ba1 100644
> --- a/rust/bindings/bindings_helper.h
> +++ b/rust/bindings/bindings_helper.h
> @@ -11,6 +11,8 @@
> #include <linux/wait.h>
> #include <linux/sched.h>
> #include <linux/radix-tree.h>
> +#include <linux/blk_types.h>
> +#include <linux/blk-mq.h>
>
> /* `bindgen` gets confused at certain things. */
> const gfp_t BINDINGS_GFP_KERNEL = GFP_KERNEL;
> diff --git a/rust/helpers.c b/rust/helpers.c
> index 9bd9d95da951..a59341084774 100644
> --- a/rust/helpers.c
> +++ b/rust/helpers.c
> @@ -18,6 +18,7 @@
> * accidentally exposed.
> */
>
> +#include <linux/bio.h>
> #include <linux/bug.h>
> #include <linux/build_bug.h>
> #include <linux/err.h>
> @@ -28,6 +29,8 @@
> #include <linux/wait.h>
> #include <linux/radix-tree.h>
> #include <linux/highmem.h>
> +#include <linux/blk-mq.h>
> +#include <linux/blkdev.h>
>
> __noreturn void rust_helper_BUG(void)
> {
> @@ -130,6 +133,25 @@ void rust_helper_put_task_struct(struct task_struct *t)
> }
> EXPORT_SYMBOL_GPL(rust_helper_put_task_struct);
>
> +struct bio_vec rust_helper_req_bvec(struct request *rq)
> +{
> + return req_bvec(rq);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_req_bvec);
> +
> +void *rust_helper_blk_mq_rq_to_pdu(struct request *rq)
> +{
> + return blk_mq_rq_to_pdu(rq);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_blk_mq_rq_to_pdu);
> +
> +void rust_helper_bio_advance_iter_single(const struct bio *bio,
> + struct bvec_iter *iter,
> + unsigned int bytes) {
> + bio_advance_iter_single(bio, iter, bytes);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_bio_advance_iter_single);
> +
> void rust_helper_init_radix_tree(struct xarray *tree, gfp_t gfp_mask)
> {
> INIT_RADIX_TREE(tree, gfp_mask);
> diff --git a/rust/kernel/block.rs b/rust/kernel/block.rs
> new file mode 100644
> index 000000000000..4c93317a568a
> --- /dev/null
> +++ b/rust/kernel/block.rs
> @@ -0,0 +1,5 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +//! Types for working with the block layer
> +
> +pub mod mq;
> diff --git a/rust/kernel/block/mq.rs b/rust/kernel/block/mq.rs
> new file mode 100644
> index 000000000000..5b40f6a73c0f
> --- /dev/null
> +++ b/rust/kernel/block/mq.rs
> @@ -0,0 +1,15 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +//! This module provides types for implementing drivers that interface the
> +//! blk-mq subsystem
> +
> +mod gen_disk;
> +mod operations;
> +mod raw_writer;
> +mod request;
> +mod tag_set;
> +
> +pub use gen_disk::GenDisk;
> +pub use operations::Operations;
> +pub use request::Request;
> +pub use tag_set::TagSet;
> diff --git a/rust/kernel/block/mq/gen_disk.rs b/rust/kernel/block/mq/gen_disk.rs
> new file mode 100644
> index 000000000000..50496af15bbf
> --- /dev/null
> +++ b/rust/kernel/block/mq/gen_disk.rs
> @@ -0,0 +1,133 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +//! GenDisk abstraction
> +//!
> +//! C header: [`include/linux/blkdev.h`](../../include/linux/blkdev.h)
> +//! C header: [`include/linux/blk_mq.h`](../../include/linux/blk_mq.h)
> +
> +use crate::block::mq::{raw_writer::RawWriter, Operations, TagSet};
> +use crate::{
> + bindings, error::from_err_ptr, error::Result, sync::Arc, types::ForeignOwnable,
> + types::ScopeGuard,
> +};
> +use core::fmt::{self, Write};
> +
> +/// A generic block device
> +///
> +/// # Invariants
> +///
> +/// - `gendisk` must always point to an initialized and valid `struct gendisk`.
> +pub struct GenDisk<T: Operations> {
> + _tagset: Arc<TagSet<T>>,
> + gendisk: *mut bindings::gendisk,
Why are these two fields not embedded? Shouldn't the user decide where
to allocate?
> +}
> +
> +// SAFETY: `GenDisk` is an owned pointer to a `struct gendisk` and an `Arc` to a
> +// `TagSet` It is safe to send this to other threads as long as T is Send.
> +unsafe impl<T: Operations + Send> Send for GenDisk<T> {}
> +
> +impl<T: Operations> GenDisk<T> {
> + /// Try to create a new `GenDisk`
> + pub fn try_new(tagset: Arc<TagSet<T>>, queue_data: T::QueueData) -> Result<Self> {
> + let data = queue_data.into_foreign();
> + let recover_data = ScopeGuard::new(|| {
> + // SAFETY: T::QueueData was created by the call to `into_foreign()` above
> + unsafe { T::QueueData::from_foreign(data) };
> + });
> +
> + let lock_class_key = crate::sync::LockClassKey::new();
> +
> + // SAFETY: `tagset.raw_tag_set()` points to a valid and initialized tag set
> + let gendisk = from_err_ptr(unsafe {
> + bindings::__blk_mq_alloc_disk(tagset.raw_tag_set(), data as _, lock_class_key.as_ptr())
Avoid `as _` casts.
> + })?;
> +
> + const TABLE: bindings::block_device_operations = bindings::block_device_operations {
> + submit_bio: None,
> + open: None,
> + release: None,
> + ioctl: None,
> + compat_ioctl: None,
> + check_events: None,
> + unlock_native_capacity: None,
> + getgeo: None,
> + set_read_only: None,
> + swap_slot_free_notify: None,
> + report_zones: None,
> + devnode: None,
> + alternative_gpt_sector: None,
> + get_unique_id: None,
> + owner: core::ptr::null_mut(),
> + pr_ops: core::ptr::null_mut(),
> + free_disk: None,
> + poll_bio: None,
> + };
> +
> + // SAFETY: gendisk is a valid pointer as we initialized it above
> + unsafe { (*gendisk).fops = &TABLE };
> +
> + recover_data.dismiss();
> + Ok(Self {
> + _tagset: tagset,
> + gendisk,
> + })
> + }
> +
> + /// Set the name of the device
> + pub fn set_name(&self, args: fmt::Arguments<'_>) -> Result {
> + let mut raw_writer = RawWriter::from_array(unsafe { &mut (*self.gendisk).disk_name });
Missing `SAFETY` also see below.
> + raw_writer.write_fmt(args)?;
> + raw_writer.write_char('\0')?;
> + Ok(())
> + }
> +
> + /// Register the device with the kernel. When this function return, the
> + /// device is accessible from VFS. The kernel may issue reads to the device
> + /// during registration to discover partition infomation.
> + pub fn add(&self) -> Result {
> + crate::error::to_result(unsafe {
> + bindings::device_add_disk(core::ptr::null_mut(), self.gendisk, core::ptr::null_mut())
> + })
> + }
> +
> + /// Call to tell the block layer the capcacity of the device
> + pub fn set_capacity(&self, sectors: u64) {
> + unsafe { bindings::set_capacity(self.gendisk, sectors) };
> + }
> +
> + /// Set the logical block size of the device
> + pub fn set_queue_logical_block_size(&self, size: u32) {
> + unsafe { bindings::blk_queue_logical_block_size((*self.gendisk).queue, size) };
> + }
> +
> + /// Set the physical block size of the device
What does this *do*? I do not think the doc string gives any meaningful
information not present in the function name (this might just be,
because I have no idea of what this is and anyone with just a little
more knowledge would know, but I still wanted to mention it).
> + pub fn set_queue_physical_block_size(&self, size: u32) {
> + unsafe { bindings::blk_queue_physical_block_size((*self.gendisk).queue, size) };
> + }
> +
> + /// Set the rotational media attribute for the device
> + pub fn set_rotational(&self, rotational: bool) {
> + if !rotational {
> + unsafe {
> + bindings::blk_queue_flag_set(bindings::QUEUE_FLAG_NONROT, (*self.gendisk).queue)
> + };
> + } else {
> + unsafe {
> + bindings::blk_queue_flag_clear(bindings::QUEUE_FLAG_NONROT, (*self.gendisk).queue)
> + };
> + }
> + }
> +}
> +
> +impl<T: Operations> Drop for GenDisk<T> {
> + fn drop(&mut self) {
> + let queue_data = unsafe { (*(*self.gendisk).queue).queuedata };
> +
> + unsafe { bindings::del_gendisk(self.gendisk) };
> +
> + // SAFETY: `queue.queuedata` was created by `GenDisk::try_new()` with a
> + // call to `ForeignOwnable::into_pointer()` to create `queuedata`.
> + // `ForeignOwnable::from_foreign()` is only called here.
> + let _queue_data = unsafe { T::QueueData::from_foreign(queue_data) };
> + }
> +}
> diff --git a/rust/kernel/block/mq/operations.rs b/rust/kernel/block/mq/operations.rs
> new file mode 100644
> index 000000000000..fb1ab707d1f0
> --- /dev/null
> +++ b/rust/kernel/block/mq/operations.rs
> @@ -0,0 +1,260 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +//! This module provides an interface for blk-mq drivers to implement.
> +//!
> +//! C header: [`include/linux/blk-mq.h`](../../include/linux/blk-mq.h)
> +
> +use crate::{
> + bindings,
> + block::mq::{tag_set::TagSetRef, Request},
> + error::{from_result, Result},
> + types::ForeignOwnable,
> +};
> +use core::{marker::PhantomData, pin::Pin};
> +
> +/// Implement this trait to interface blk-mq as block devices
> +#[macros::vtable]
> +pub trait Operations: Sized {
Is this trait really safe? Are there **no** requirements for e.g.
`QueueData`? So could I use `Box<()>`?
> + /// Data associated with a request. This data is located next to the request
> + /// structure.
> + type RequestData;
> +
> + /// Data associated with the `struct request_queue` that is allocated for
> + /// the `GenDisk` associated with this `Operations` implementation.
> + type QueueData: ForeignOwnable;
> +
> + /// Data associated with a dispatch queue. This is stored as a pointer in
> + /// `struct blk_mq_hw_ctx`.
> + type HwData: ForeignOwnable;
> +
> + /// Data associated with a tag set. This is stored as a pointer in `struct
> + /// blk_mq_tag_set`.
> + type TagSetData: ForeignOwnable;
> +
> + /// Called by the kernel to allocate a new `RequestData`. The structure will
> + /// eventually be pinned, so defer initialization to `init_request_data()`
> + fn new_request_data(
> + _tagset_data: <Self::TagSetData as ForeignOwnable>::Borrowed<'_>,
> + ) -> Result<Self::RequestData>;
> +
> + /// Called by the kernel to initialize a previously allocated `RequestData`
> + fn init_request_data(
> + _tagset_data: <Self::TagSetData as ForeignOwnable>::Borrowed<'_>,
> + _data: Pin<&mut Self::RequestData>,
> + ) -> Result {
> + Ok(())
> + }
> +
> + /// Called by the kernel to queue a request with the driver. If `is_last` is
> + /// `false`, the driver is allowed to defer commiting the request.
> + fn queue_rq(
> + hw_data: <Self::HwData as ForeignOwnable>::Borrowed<'_>,
> + queue_data: <Self::QueueData as ForeignOwnable>::Borrowed<'_>,
> + rq: &Request<Self>,
> + is_last: bool,
> + ) -> Result;
> +
> + /// Called by the kernel to indicate that queued requests should be submitted
> + fn commit_rqs(
> + hw_data: <Self::HwData as ForeignOwnable>::Borrowed<'_>,
> + queue_data: <Self::QueueData as ForeignOwnable>::Borrowed<'_>,
> + );
> +
> + /// Called by the kernel when the request is completed
> + fn complete(_rq: &Request<Self>);
> +
> + /// Called by the kernel to allocate and initialize a driver specific hardware context data
> + fn init_hctx(
> + tagset_data: <Self::TagSetData as ForeignOwnable>::Borrowed<'_>,
> + hctx_idx: u32,
> + ) -> Result<Self::HwData>;
> +
> + /// Called by the kernel to poll the device for completed requests. Only used for poll queues.
> + fn poll(_hw_data: <Self::HwData as ForeignOwnable>::Borrowed<'_>) -> i32 {
> + unreachable!()
Why are these implemented this way? Should this really panic? Maybe
return an error? Why `i32` as the return type? If it can error it should
be `Result<u32>`.
> + }
> +
> + /// Called by the kernel to map submission queues to CPU cores.
> + fn map_queues(_tag_set: &TagSetRef) {
> + unreachable!()
> + }
> +
> + // There is no need for exit_request() because `drop` will be called.
> +}
> +
> +pub(crate) struct OperationsVtable<T: Operations>(PhantomData<T>);
> +
> +impl<T: Operations> OperationsVtable<T> {
> + // # Safety
> + //
> + // The caller of this function must ensure that `hctx` and `bd` are valid
> + // and initialized. The pointees must outlive this function. Further
> + // `hctx->driver_data` must be a pointer created by a call to
> + // `Self::init_hctx_callback()` and the pointee must outlive this function.
> + // This function must not be called with a `hctx` for which
> + // `Self::exit_hctx_callback()` has been called.
> + unsafe extern "C" fn queue_rq_callback(
> + hctx: *mut bindings::blk_mq_hw_ctx,
> + bd: *const bindings::blk_mq_queue_data,
> + ) -> bindings::blk_status_t {
> + // SAFETY: `bd` is valid as required by the safety requirement for this function.
> + let rq = unsafe { (*bd).rq };
> +
> + // SAFETY: The safety requirement for this function ensure that
> + // `(*hctx).driver_data` was returned by a call to
> + // `Self::init_hctx_callback()`. That function uses
> + // `PointerWrapper::into_pointer()` to create `driver_data`. Further,
> + // the returned value does not outlive this function and
> + // `from_foreign()` is not called until `Self::exit_hctx_callback()` is
> + // called. By the safety requirement of this function and contract with
> + // the `blk-mq` API, `queue_rq_callback()` will not be called after that
> + // point.
This safety section and the others here are rather long and mostly
repeat themselves. Is it possible to put this in its own module and
explain the safety invariants in that module and then in these safety
sections just refer to some labels from that section?
I think we should discuss this in our next meeting.
> + let hw_data = unsafe { T::HwData::borrow((*hctx).driver_data) };
> +
> + // SAFETY: `hctx` is valid as required by this function.
> + let queue_data = unsafe { (*(*hctx).queue).queuedata };
> +
> + // SAFETY: `queue.queuedata` was created by `GenDisk::try_new()` with a
> + // call to `ForeignOwnable::into_pointer()` to create `queuedata`.
> + // `ForeignOwnable::from_foreign()` is only called when the tagset is
> + // dropped, which happens after we are dropped.
> + let queue_data = unsafe { T::QueueData::borrow(queue_data) };
> +
> + // SAFETY: `bd` is valid as required by the safety requirement for this function.
> + let ret = T::queue_rq(
> + hw_data,
> + queue_data,
> + &unsafe { Request::from_ptr(rq) },
> + unsafe { (*bd).last },
> + );
> + if let Err(e) = ret {
> + e.to_blk_status()
> + } else {
> + bindings::BLK_STS_OK as _
> + }
> + }
> +
> + unsafe extern "C" fn commit_rqs_callback(hctx: *mut bindings::blk_mq_hw_ctx) {
> + let hw_data = unsafe { T::HwData::borrow((*hctx).driver_data) };
> +
> + // SAFETY: `hctx` is valid as required by this function.
> + let queue_data = unsafe { (*(*hctx).queue).queuedata };
> +
> + // SAFETY: `queue.queuedata` was created by `GenDisk::try_new()` with a
> + // call to `ForeignOwnable::into_pointer()` to create `queuedata`.
> + // `ForeignOwnable::from_foreign()` is only called when the tagset is
> + // dropped, which happens after we are dropped.
> + let queue_data = unsafe { T::QueueData::borrow(queue_data) };
> + T::commit_rqs(hw_data, queue_data)
> + }
> +
> + unsafe extern "C" fn complete_callback(rq: *mut bindings::request) {
> + T::complete(&unsafe { Request::from_ptr(rq) });
> + }
> +
> + unsafe extern "C" fn poll_callback(
> + hctx: *mut bindings::blk_mq_hw_ctx,
> + _iob: *mut bindings::io_comp_batch,
> + ) -> core::ffi::c_int {
> + let hw_data = unsafe { T::HwData::borrow((*hctx).driver_data) };
> + T::poll(hw_data)
> + }
> +
> + unsafe extern "C" fn init_hctx_callback(
> + hctx: *mut bindings::blk_mq_hw_ctx,
> + tagset_data: *mut core::ffi::c_void,
> + hctx_idx: core::ffi::c_uint,
> + ) -> core::ffi::c_int {
> + from_result(|| {
> + let tagset_data = unsafe { T::TagSetData::borrow(tagset_data) };
> + let data = T::init_hctx(tagset_data, hctx_idx)?;
> + unsafe { (*hctx).driver_data = data.into_foreign() as _ };
> + Ok(0)
> + })
> + }
> +
> + unsafe extern "C" fn exit_hctx_callback(
> + hctx: *mut bindings::blk_mq_hw_ctx,
> + _hctx_idx: core::ffi::c_uint,
> + ) {
> + let ptr = unsafe { (*hctx).driver_data };
> + unsafe { T::HwData::from_foreign(ptr) };
> + }
> +
> + unsafe extern "C" fn init_request_callback(
> + set: *mut bindings::blk_mq_tag_set,
> + rq: *mut bindings::request,
> + _hctx_idx: core::ffi::c_uint,
> + _numa_node: core::ffi::c_uint,
> + ) -> core::ffi::c_int {
> + from_result(|| {
> + // SAFETY: The tagset invariants guarantee that all requests are allocated with extra memory
> + // for the request data.
> + let pdu = unsafe { bindings::blk_mq_rq_to_pdu(rq) } as *mut T::RequestData;
> + let tagset_data = unsafe { T::TagSetData::borrow((*set).driver_data) };
> +
> + let v = T::new_request_data(tagset_data)?;
> +
> + // SAFETY: `pdu` memory is valid, as it was allocated by the caller.
> + unsafe { pdu.write(v) };
> +
> + let tagset_data = unsafe { T::TagSetData::borrow((*set).driver_data) };
> + // SAFETY: `pdu` memory is valid and properly initialised.
> + T::init_request_data(tagset_data, unsafe { Pin::new_unchecked(&mut *pdu) })?;
> +
> + Ok(0)
> + })
> + }
> +
> + unsafe extern "C" fn exit_request_callback(
> + _set: *mut bindings::blk_mq_tag_set,
> + rq: *mut bindings::request,
> + _hctx_idx: core::ffi::c_uint,
> + ) {
> + // SAFETY: The tagset invariants guarantee that all requests are allocated with extra memory
> + // for the request data.
> + let pdu = unsafe { bindings::blk_mq_rq_to_pdu(rq) } as *mut T::RequestData;
> +
> + // SAFETY: `pdu` is valid for read and write and is properly initialised.
> + unsafe { core::ptr::drop_in_place(pdu) };
> + }
> +
> + unsafe extern "C" fn map_queues_callback(tag_set_ptr: *mut bindings::blk_mq_tag_set) {
> + let tag_set = unsafe { TagSetRef::from_ptr(tag_set_ptr) };
> + T::map_queues(&tag_set);
> + }
> +
> + const VTABLE: bindings::blk_mq_ops = bindings::blk_mq_ops {
> + queue_rq: Some(Self::queue_rq_callback),
> + queue_rqs: None,
> + commit_rqs: Some(Self::commit_rqs_callback),
> + get_budget: None,
> + put_budget: None,
> + set_rq_budget_token: None,
> + get_rq_budget_token: None,
> + timeout: None,
> + poll: if T::HAS_POLL {
> + Some(Self::poll_callback)
> + } else {
> + None
> + },
> + complete: Some(Self::complete_callback),
> + init_hctx: Some(Self::init_hctx_callback),
> + exit_hctx: Some(Self::exit_hctx_callback),
> + init_request: Some(Self::init_request_callback),
> + exit_request: Some(Self::exit_request_callback),
> + cleanup_rq: None,
> + busy: None,
> + map_queues: if T::HAS_MAP_QUEUES {
> + Some(Self::map_queues_callback)
> + } else {
> + None
> + },
> + #[cfg(CONFIG_BLK_DEBUG_FS)]
> + show_rq: None,
> + };
> +
> + pub(crate) const unsafe fn build() -> &'static bindings::blk_mq_ops {
> + &Self::VTABLE
> + }
Why is this function `unsafe`?
> +}
Some `# Safety` and `SAFETY` missing in this hunk.
> diff --git a/rust/kernel/block/mq/raw_writer.rs b/rust/kernel/block/mq/raw_writer.rs
> new file mode 100644
> index 000000000000..25c16ee0b1f7
> --- /dev/null
> +++ b/rust/kernel/block/mq/raw_writer.rs
> @@ -0,0 +1,30 @@
> +use core::fmt::{self, Write};
> +
> +pub(crate) struct RawWriter {
> + ptr: *mut u8,
> + len: usize,
> +}
> +
> +impl RawWriter {
> + unsafe fn new(ptr: *mut u8, len: usize) -> Self {
> + Self { ptr, len }
> + }
> +
> + pub(crate) fn from_array<const N: usize>(a: &mut [core::ffi::c_char; N]) -> Self {
> + unsafe { Self::new(&mut a[0] as *mut _ as _, N) }
> + }
This function needs to be `unsafe`, because it never captures the
lifetime of `a`. I can write:
let mut a = Box::new([0; 10]);
let mut writer = RawWriter::from_array(&mut *a);
drop(a);
writer.write_str("Abc"); // UAF
Alternatively add a lifetime to `RawWriter`.
> +}
> +
> +impl Write for RawWriter {
> + fn write_str(&mut self, s: &str) -> fmt::Result {
> + let bytes = s.as_bytes();
> + let len = bytes.len();
> + if len > self.len {
> + return Err(fmt::Error);
> + }
> + unsafe { core::ptr::copy_nonoverlapping(&bytes[0], self.ptr, len) };
> + self.ptr = unsafe { self.ptr.add(len) };
> + self.len -= len;
> + Ok(())
> + }
> +}
> diff --git a/rust/kernel/block/mq/request.rs b/rust/kernel/block/mq/request.rs
> new file mode 100644
> index 000000000000..e95ae3fd71ad
> --- /dev/null
> +++ b/rust/kernel/block/mq/request.rs
> @@ -0,0 +1,71 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +//! This module provides a wrapper for the C `struct request` type.
> +//!
> +//! C header: [`include/linux/blk-mq.h`](../../include/linux/blk-mq.h)
> +
> +use crate::{
> + bindings,
> + block::mq::Operations,
> + error::{Error, Result},
> +};
> +use core::marker::PhantomData;
> +
> +/// A wrapper around a blk-mq `struct request`. This represents an IO request.
> +pub struct Request<T: Operations> {
> + ptr: *mut bindings::request,
Why is this not embedded?
> + _p: PhantomData<T>,
> +}
> +
> +impl<T: Operations> Request<T> {
> + pub(crate) unsafe fn from_ptr(ptr: *mut bindings::request) -> Self {
> + Self {
> + ptr,
> + _p: PhantomData,
> + }
> + }
> +
> + /// Get the command identifier for the request
> + pub fn command(&self) -> u32 {
> + unsafe { (*self.ptr).cmd_flags & ((1 << bindings::REQ_OP_BITS) - 1) }
> + }
> +
> + /// Call this to indicate to the kernel that the request has been issued by the driver
> + pub fn start(&self) {
> + unsafe { bindings::blk_mq_start_request(self.ptr) };
> + }
> +
> + /// Call this to indicate to the kernel that the request has been completed without errors
> + // TODO: Consume rq so that we can't use it after ending it?
> + pub fn end_ok(&self) {
> + unsafe { bindings::blk_mq_end_request(self.ptr, bindings::BLK_STS_OK as _) };
> + }
> +
> + /// Call this to indicate to the kernel that the request completed with an error
> + pub fn end_err(&self, err: Error) {
> + unsafe { bindings::blk_mq_end_request(self.ptr, err.to_blk_status()) };
> + }
> +
> + /// Call this to indicate that the request completed with the status indicated by `status`
> + pub fn end(&self, status: Result) {
> + if let Err(e) = status {
> + self.end_err(e);
> + } else {
> + self.end_ok();
> + }
> + }
> +
> + /// Call this to schedule defered completion of the request
> + // TODO: Consume rq so that we can't use it after completing it?
> + pub fn complete(&self) {
> + if !unsafe { bindings::blk_mq_complete_request_remote(self.ptr) } {
> + T::complete(&unsafe { Self::from_ptr(self.ptr) });
> + }
> + }
> +
> + /// Get the target sector for the request
> + #[inline(always)]
Why is this `inline(always)`?
> + pub fn sector(&self) -> usize {
> + unsafe { (*self.ptr).__sector as usize }
> + }
> +}
> diff --git a/rust/kernel/block/mq/tag_set.rs b/rust/kernel/block/mq/tag_set.rs
> new file mode 100644
> index 000000000000..d122db7f6d0e
> --- /dev/null
> +++ b/rust/kernel/block/mq/tag_set.rs
> @@ -0,0 +1,92 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +//! This module provides the `TagSet` struct to wrap the C `struct blk_mq_tag_set`.
> +//!
> +//! C header: [`include/linux/blk-mq.h`](../../include/linux/blk-mq.h)
> +
> +use crate::{
> + bindings,
> + block::mq::{operations::OperationsVtable, Operations},
> + error::{Error, Result},
> + sync::Arc,
> + types::ForeignOwnable,
> +};
> +use core::{cell::UnsafeCell, convert::TryInto, marker::PhantomData};
> +
> +/// A wrapper for the C `struct blk_mq_tag_set`
> +pub struct TagSet<T: Operations> {
> + inner: UnsafeCell<bindings::blk_mq_tag_set>,
> + _p: PhantomData<T>,
> +}
> +
> +impl<T: Operations> TagSet<T> {
> + /// Try to create a new tag set
> + pub fn try_new(
> + nr_hw_queues: u32,
> + tagset_data: T::TagSetData,
> + num_tags: u32,
> + num_maps: u32,
> + ) -> Result<Arc<Self>> {
Why force the users to use `Arc`?
> + let tagset = Arc::try_new(Self {
> + inner: UnsafeCell::new(bindings::blk_mq_tag_set::default()),
> + _p: PhantomData,
> + })?;
> +
> + // SAFETY: We just allocated `tagset`, we know this is the only reference to it.
> + let inner = unsafe { &mut *tagset.inner.get() };
> +
> + inner.ops = unsafe { OperationsVtable::<T>::build() };
> + inner.nr_hw_queues = nr_hw_queues;
> + inner.timeout = 0; // 0 means default which is 30 * HZ in C
> + inner.numa_node = bindings::NUMA_NO_NODE;
> + inner.queue_depth = num_tags;
> + inner.cmd_size = core::mem::size_of::<T::RequestData>().try_into()?;
> + inner.flags = bindings::BLK_MQ_F_SHOULD_MERGE;
> + inner.driver_data = tagset_data.into_foreign() as _;
> + inner.nr_maps = num_maps;
> +
> + // SAFETY: `inner` points to valid and initialised memory.
> + let ret = unsafe { bindings::blk_mq_alloc_tag_set(inner) };
> + if ret < 0 {
> + // SAFETY: We created `driver_data` above with `into_foreign`
> + unsafe { T::TagSetData::from_foreign(inner.driver_data) };
> + return Err(Error::from_errno(ret));
> + }
> +
> + Ok(tagset)
> + }
> +
> + /// Return the pointer to the wrapped `struct blk_mq_tag_set`
> + pub(crate) fn raw_tag_set(&self) -> *mut bindings::blk_mq_tag_set {
> + self.inner.get()
> + }
> +}
> +
> +impl<T: Operations> Drop for TagSet<T> {
> + fn drop(&mut self) {
> + let tagset_data = unsafe { (*self.inner.get()).driver_data };
> +
> + // SAFETY: `inner` is valid and has been properly initialised during construction.
> + unsafe { bindings::blk_mq_free_tag_set(self.inner.get()) };
> +
> + // SAFETY: `tagset_data` was created by a call to
> + // `ForeignOwnable::into_foreign` in `TagSet::try_new()`
> + unsafe { T::TagSetData::from_foreign(tagset_data) };
> + }
> +}
> +
> +/// A tag set reference. Used to control lifetime and prevent drop of TagSet references passed to
> +/// `Operations::map_queues()`
> +pub struct TagSetRef {
> + ptr: *mut bindings::blk_mq_tag_set,
> +}
> +
> +impl TagSetRef {
> + pub(crate) unsafe fn from_ptr(tagset: *mut bindings::blk_mq_tag_set) -> Self {
> + Self { ptr: tagset }
> + }
> +
> + pub fn ptr(&self) -> *mut bindings::blk_mq_tag_set {
> + self.ptr
> + }
> +}
This is a **very** thin abstraction, why is it needed?
> diff --git a/rust/kernel/error.rs b/rust/kernel/error.rs
> index 5f4114b30b94..421fef677321 100644
> --- a/rust/kernel/error.rs
> +++ b/rust/kernel/error.rs
> @@ -107,6 +107,10 @@ impl Error {
> self.0
> }
>
> + pub(crate) fn to_blk_status(self) -> bindings::blk_status_t {
> + unsafe { bindings::errno_to_blk_status(self.0) }
> + }
> +
> /// Returns the error encoded as a pointer.
> #[allow(dead_code)]
> pub(crate) fn to_ptr<T>(self) -> *mut T {
> diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs
> index 8bef6686504b..cd798d12d97c 100644
> --- a/rust/kernel/lib.rs
> +++ b/rust/kernel/lib.rs
> @@ -34,6 +34,7 @@ extern crate self as kernel;
> #[cfg(not(test))]
> #[cfg(not(testlib))]
> mod allocator;
> +pub mod block;
> mod build_assert;
> pub mod error;
> pub mod init;
> --
> 2.40.0
>
--
Cheers,
Benno
+ /// Call to tell the block layer the capcacity of the device
+ pub fn set_capacity(&self, sectors: u64) {
+ unsafe { bindings::set_capacity(self.gendisk, sectors) };
+ }
Nit in the comment: capcacity -> capacity
Cheers,
Sergio
Hi All,
I apologize for the lengthy email, but I have a lot of things to cover.
As some of you know, a goal of mine is to make it possible to write blk-mq
device drivers in Rust. The RFC patches I have sent to this list are the first
steps of making that goal a reality. They are a sample of the work I am doing.
My current plan of action is to provide a Rust API that allows implementation of
blk-mq device drives, along with a Rust implementation of null_blk to serve as a
reference implementation. This reference implementation will demonstrate how to
use the API.
I attended LSF in Vancouver a few weeks back where I led a discussion on the
topic. My goal for that session was to obtain input from the community on how to
upstream the work as it becomes more mature.
I received a lot of feedback, both during the session, in the hallway, and on
the mailing list. Ultimately, we did not achieve consensus on a path forward. I
will try to condense the key points raised by the community here. If anyone feel
their point is not contained below, please chime in.
Please note that I am paraphrasing the points below, they are not citations.
1) "Block layer community does not speak Rust and thus cannot review Rust patches"
This work hinges on one of two things happening. Either block layer reviewers
and maintainers eventually becoming fluent in Rust, or they accept code in
their tree that are maintained by the "rust people". I very much would prefer
the first option.
I would suggest to use this work to facilitate gradual adoption of Rust. I
understand that this will be a multi-year effort. By giving the community
access to a Rust bindings specifically designed or the block layer, the block
layer community will have a helpful reference to consult when investigating
Rust.
While the block community is getting up to speed in Rust, the Rust for Linux
community is ready to conduct review of patches targeting the block layer.
Until such a time where Rust code can be reviewed by block layer experts, the
work could be gated behind an "EXPERIMENTAL" flag.
Selection of the null_blk driver for a reference implementation to drive the
Rust block API was not random. The null_blk driver is relatively simple and
thus makes for a good platform to demonstrate the Rust API without having to
deal with actual hardware.
The null_blk driver is a piece of testing infrastructure that is not usually
deployed in production environments, so people who are worried about Rust in
general will not have to worry about their production environments being
infested with Rust.
Finally there have been suggestions both to replace and/or complement the
existing C null_blk driver with the Rust version. I would suggest
(eventually, not _now_) complementing the existing driver, since it can be
very useful to benchmark and test the two drivers side by side.
2) "Having Rust bindings for the block layer in-tree is a burden for the
maintainers"
I believe we can integrate the bindings in a way so that any potential
breakage in the Rust API does not impact current maintenance work.
Maintainers and reviewers that do not wish to bother with Rust should be able
to opt out. All Rust parts should be gated behind a default N kconfig option.
With this scheme there should be very little inconvenience for current
maintainers.
I will take necessary steps to make sure block layer Rust bindings are always
up to date with changes to kernel C API. I would run CI against
- for-next of https://git.kernel.dk/linux.git
- master of https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
- mainline releases including RCs
- stable and longterm kernels with queues applied
- stable and longterm releases including RCs
Samsung will provide resources to support this CI effort. Through this effort
I will aim to minimize any inconvenience for maintainers.
3) "How will you detect breakage in the Rust API caused by changes to C code?"
The way we call C code from Rust in the kernel guarantees that most changes
to C APIs that are called by Rust code will cause a compile failure when
building the kernel with Rust enabled. This includes changing C function
argument names or types, and struct field names or types. Thus, we do not need
to rely on symvers CRC calculation as suggested by James Bottomley at LSF.
However, if the semantics of a kernel C function is changed without changing
its name or signature, potential breakage will not be detected by the build
system. To detect breakage resulting from this kind of change, we have to
rely _on the same mechanics_ that maintainers of kernel C code are relying on
today:
- kunit tests
- blktests
- fstests
- staying in the loop wrt changes in general
We also have Rust support in Intel 0-day CI, although only compile tests for
now.
4) "How will you prevent breakage in C code resulting from changes to Rust code"
The way the Rust API is designed, existing C code is not going to be reliant
on Rust code. If anything breaks just disable Rust and no Rust code will be
built. Or disable block layer Rust code if you want to keep general Rust
support. If Rust is disabled by default, nothing in the kernel should break
because of Rust, if not explicitly enabled.
5) "Block drivers in general are not security sensitive because they are mostly
privileged code and have limited user visible API"
There are probably easier ways to exploit a Linux system than to target the
block layer, although people are plugging in potentially malicious block
devices all the time in the form of USB Mass Storage devices or CF cards.
While memory safety is very relevant for preventing exploitable security
vulnerabilities, it is also incredibly useful in preventing memory safety
bugs in general. Fewer bugs means less risk of bugs leading to data
corruption. It means less time spent on tracking down and fixing bugs, and
less time spent reviewing bug fixes. It also means less time required to
review patches in general, because reviewers do not have to review for memory
safety issues.
So while Rust has high merit in exposed and historically exploited
subsystems, this does not mean that it has no merit in other subsystems.
6) "Other subsystems may benefit more from adopting Rust"
While this might be true, it does not prevent the block subsystem from
benefiting from adopting Rust (see 5).
7) "Do not waste time re-implementing null_blk, it is test infrastructure so
memory safety does not matter. Why don't you do loop instead?"
I strongly believe that memory safety is also relevant in test
infrastructure. We waste time and energy fixing memory safety issues in our
code, no matter if the code is test infrastructure or not. I refer to the
statistics I posted to the list at an earlier date [3].
Further, I think it is a benefit to all if the storage community can become
fluent in Rust before any critical infrastructure is deployed using Rust.
This is one reason that I switched my efforts to null_block and that I am not
pushing Rust NVMe.
8) "Why don't you wait with this work until you have a driver for a new storage
standard"
Let's be proactive. I think it is important to iron out the details of the
Rust API before we implement any potential new driver. When we eventually
need to implement a driver for a future storage standard, the choice to do so
in Rust should be easy. By making the API available ahead of time, we will be
able to provide future developers with a stable implementation to choose
from.
9) "You are a new face in our community. How do we know you will not disappear?"
I recognize this consideration and I acknowledge that the community is trust
based. Trust takes time to build. I can do little more than state that I
intend to stay with my team at Samsung to take care of this project for many
years to come. Samsung is behind this particular effort. In general Google
and Microsoft are actively contributing to the wider Rust for Linux project.
Perhaps that can be an indication that the project in general is not going
away.
10) "How can I learn how to build the kernel with Rust enabled?"
We have a guide in `Documentation/rust/quick-start.rst`. If that guide does
not get you started, please reach out to us [1] and we will help you get
started (and fix the documentation since it must not be good enough then).
11) "What if something catches fire and you are out of office?"
If I am for some reason not responding to pings during a merge, please
contact the Rust subsystem maintainer and the Rust for Linux list [2]. There
are quite a few people capable of firefighting if it should ever become
necessary.
12) "These patches are not ready yet, we should not accept them"
They most definitely are _not_ ready, and I would not ask for them to be
included at all in their current state. The RFC is meant to give a sample of
the work that I am doing and to start this conversation. I would rather have
this conversation preemptively. I did not intend to give the impression that
the patches are in a finalized state at all.
With all this in mind I would suggest that we treat the Rust block layer API and
associated null block driver as an experiment. I would suggest that we merge it
in when it is ready, and we gate it behind an experimental kconfig option. If it
turns out that all your worst nightmares come true and it becomes an unbearable
load for maintainers, reviewers and contributors, it will be low effort remove
it again. I very much doubt this will be the case though.
Jens, Kieth, Christoph, Ming, I would kindly ask you to comment on my suggestion
for next steps, or perhaps suggest an alternate path. In general I would
appreciate any constructive feedback from the community.
[1] https://rust-for-linux.com/contact
[2] [email protected]
[3] https://lore.kernel.org/all/[email protected]/
Best regards,
Andreas Hindborg
On Tue, Jun 6, 2023 at 3:40 PM Andreas Hindborg (Samsung)
<[email protected]> wrote:
>
> Samsung will provide resources to support this CI effort. Through this effort
> I will aim to minimize any inconvenience for maintainers.
This is great.
> We also have Rust support in Intel 0-day CI, although only compile tests for
> now.
In addition, we also have initial Rust support in KernelCI (including
`linux-next` [1], `rust-next` and the old `rust` branch), which could
be expanded upon, especially with more resources (Cc'ing Guillaume).
Moreover, our plan is to replicate the simple CI we originally had for
the `rust` branch (which included QEMU boot tests under a matrix of
several archs/configs/compilers) to our `rust-next`, `rust-fixes` and
`rust-dev` branches, in order to complement the other CIs and to get
some early smoke testing (especially for `rust-dev`).
[1] https://github.com/kernelci/kernelci-core/blob/65b0900438e0ed20e7efe0ada681ab212dc8c774/config/core/build-configs.yaml#L1152-L1197
Cheers,
Miguel
Hi,
Yexuan Yang <[email protected]> writes:
>> Over the 432 benchmark configurations, the relative performance of the Rust
>> driver to the C driver (in terms of IOPS) is between 6.8 and -11.8 percent with
>> an average of 0.2 percent better for reads. For writes the span is 16.8 to -4.5
>> percent with an average of 0.9 percent better.
>>
>> For each measurement the drivers are loaded, a drive is configured with memory
>> backing and a size of 4 GiB. C null_blk is configured to match the implemented
>> modes of the Rust driver: `blocksize` is set to 4 KiB, `completion_nsec` to 0,
>> `irqmode` to 0 (IRQ_NONE), `queue_mode` to 2 (MQ), `hw_queue_depth` to 256 and
>> `memory_backed` to 1. For both the drivers, the queue scheduler is set to
>> `none`. These measurements are made using 30 second runs of `fio` with the
>> `PSYNC` IO engine with workers pinned to separate CPU cores. The measurements
>> are done inside a virtual machine (qemu/kvm) on an Intel Alder Lake workstation
>> (i5-12600).
>
> Hi All!
> I have some problems about your benchmark test.
> In Ubuntu 22.02, I compiled an RFL kernel with both C null_blk driver and Rust
> null_blk_driver as modules. For the C null_blk driver, I used the `modprobe`
> command to set relevant parameters, while for the Rust null_blk_driver, I simply
> started it. I used the following two commands to start the drivers:
>
> ```bash
> sudo modprobe null_blk \
> queue_mode=2 irqmode=0 hw_queue_depth=256 \
> memory_backed=1 bs=4096 completion_nsec=0 \
> no_sched=1 gb=4;
> sudo modprobe rnull_mod
> ```
>
> After that, I tested their performance in `randread` with the fio command, specifying the first parameter as 4 and the second parameter as 1:
>
> ```bash
> fio --filename=/dev/nullb0 --iodepth=64 --ioengine=psync --direct=1 --rw=randread --bs=$1k --numjobs=$2 --runtime=30 --group_reporting --name=test-rand-read --output=test_c_randread.log
> fio --filename=/dev/rnullb0 --iodepth=64 --ioengine=psync --direct=1 --rw=randread --bs=$1k --numjobs=$2 --runtime=30 --group_reporting --name=test-rand-read --output=test_rust_randread.log
> ```
>
> The test results showed a significant performance difference between the two
> drivers, which was drastically different from the data you tested in the
> community. Specifically, for `randread`, the C driver had a bandwidth of
> 487MiB/s and IOPS of 124.7k, while the Rust driver had a bandwidth of 264MiB/s
> and IOPS of 67.7k. However, for other I/O types, the performance of the C and
> Rust drivers was more similar. Could you please provide more information about
> the actual bandwidth and IOPS data from your tests, rather than just the
> performance difference between the C and Rust drivers? Additionally, if you
> could offer possible reasons for this abnormality, I would greatly appreciate
> it!
Thanks for trying out the code! I am not sure why you get these numbers.
I am currently out of office, but I will rerun the benchmarks next week
when I get back in. Maybe I can provide you with some scripts and my
kernel configuration. Hopefully we can figure out the difference in our
setups.
Best regards,
Andreas
Hi,
"Andreas Hindborg (Samsung)" <[email protected]> writes:
> Hi,
>
> Yexuan Yang <[email protected]> writes:
>
>>> Over the 432 benchmark configurations, the relative performance of the Rust
>>> driver to the C driver (in terms of IOPS) is between 6.8 and -11.8 percent with
>>> an average of 0.2 percent better for reads. For writes the span is 16.8 to -4.5
>>> percent with an average of 0.9 percent better.
>>>
>>> For each measurement the drivers are loaded, a drive is configured with memory
>>> backing and a size of 4 GiB. C null_blk is configured to match the implemented
>>> modes of the Rust driver: `blocksize` is set to 4 KiB, `completion_nsec` to 0,
>>> `irqmode` to 0 (IRQ_NONE), `queue_mode` to 2 (MQ), `hw_queue_depth` to 256 and
>>> `memory_backed` to 1. For both the drivers, the queue scheduler is set to
>>> `none`. These measurements are made using 30 second runs of `fio` with the
>>> `PSYNC` IO engine with workers pinned to separate CPU cores. The measurements
>>> are done inside a virtual machine (qemu/kvm) on an Intel Alder Lake workstation
>>> (i5-12600).
>>
>> Hi All!
>> I have some problems about your benchmark test.
>> In Ubuntu 22.02, I compiled an RFL kernel with both C null_blk driver and Rust
>> null_blk_driver as modules. For the C null_blk driver, I used the `modprobe`
>> command to set relevant parameters, while for the Rust null_blk_driver, I simply
>> started it. I used the following two commands to start the drivers:
>>
>> ```bash
>> sudo modprobe null_blk \
>> queue_mode=2 irqmode=0 hw_queue_depth=256 \
>> memory_backed=1 bs=4096 completion_nsec=0 \
>> no_sched=1 gb=4;
>> sudo modprobe rnull_mod
>> ```
>>
>> After that, I tested their performance in `randread` with the fio command, specifying the first parameter as 4 and the second parameter as 1:
>>
>> ```bash
>> fio --filename=/dev/nullb0 --iodepth=64 --ioengine=psync --direct=1 --rw=randread --bs=$1k --numjobs=$2 --runtime=30 --group_reporting --name=test-rand-read --output=test_c_randread.log
>> fio --filename=/dev/rnullb0 --iodepth=64 --ioengine=psync --direct=1 --rw=randread --bs=$1k --numjobs=$2 --runtime=30 --group_reporting --name=test-rand-read --output=test_rust_randread.log
>> ```
>>
>> The test results showed a significant performance difference between the two
>> drivers, which was drastically different from the data you tested in the
>> community. Specifically, for `randread`, the C driver had a bandwidth of
>> 487MiB/s and IOPS of 124.7k, while the Rust driver had a bandwidth of 264MiB/s
>> and IOPS of 67.7k. However, for other I/O types, the performance of the C and
>> Rust drivers was more similar. Could you please provide more information about
>> the actual bandwidth and IOPS data from your tests, rather than just the
>> performance difference between the C and Rust drivers? Additionally, if you
>> could offer possible reasons for this abnormality, I would greatly appreciate
>> it!
>
> Thanks for trying out the code! I am not sure why you get these numbers.
> I am currently out of office, but I will rerun the benchmarks next week
> when I get back in. Maybe I can provide you with some scripts and my
> kernel configuration. Hopefully we can figure out the difference in our
> setups.
>
I just ran the benchmarks for that configuration again. My setup is an
Intel Alder Lake CPU (i5-12600) with Debian Bullseye user space running
a 6.2.12 host kernel with a kernel config based on the stock Debian
Bullseye (debian config + make olddefconfig). 32 GB of DDR5 4800MHz
memory. The benchmarks (and patched kernel) run inside QEMU KVM (from
Debian repos Debian 1:5.2+dfsg-11+deb11u2). I use the following qemu
command to boot (edited a bit for clarity):
"qemu-system-x86_64" "-nographic" "-enable-kvm" "-m" "16G" "-cpu" "host"
"-M" "q35" "-smp" "6" "-kernel" "vmlinux" "-append" "console=ttyS0
nokaslr rdinit=/sbin/init root=/dev/vda1 null_blk.nr_devices=0"
I use a Debian Bullseye user space inside the virtual machine.
I think I used the stock Bullseye kernel for the host when I did the
numbers for the cover letter, so that is different for this run.
Here are the results:
+---------+----------+------+---------------------+
| jobs/bs | workload | prep | 4 |
+---------+----------+------+---------------------+
| 4 | randread | 1 | 0.28 0.00 (1.7,0.0) |
| 4 | randread | 0 | 5.75 0.00 (6.8,0.0) |
+---------+----------+------+---------------------+
I used the following parameters for both tests:
Block size: 4k
Run time: 120s
Workload: randread
Jobs: 4
This is the `fio` command I used:
['fio', '--group_reporting', '--name=default', '--filename=/dev/<snip>', '--time_based=1', '--runtime=120', '--rw=randread', '--output=<snip>', '--output-format=json', '--numjobs=4', '--cpus_allowed=0-3', '--cpus_allowed_policy=split', '--ioengine=psync', '--bs=4k', '--direct=1', '--norandommap', '--random_generator=lfsr']
For the line in the table where `prep` is 1 I filled up the target
device with data before running the benchmark. I use the following `fio`
job to do that:
['fio', '--name=prep', '--rw=write', '--filename=/dev/None', '--bs=4k', '--direct=1']
Did you prepare the volumes with data in your experiment?
How to read the data in the table: For the benchmark where the drive is
prepared, Rust null_blk perform 0.28 percent better than C null_block.
For the case where the drive is empty Rust null_block performs 5.75
percent better than C null_block. I calculated the relative performance
as "(r_iops - c_iops) / c_iops * 100".
The IOPS info you request is present in the tables from the cover
letter. As mentioned in the cover letter, the numbers in parenthesis in
each cell is throughput in 1,000,000 IOPS for read and
write for the C null_blk driver. For the table above, C null_blk did
1,700,000 read IOPS in the case when the drive is prepared and 5,750,000
IOPS in the case where the drive is not prepared.
You can obtain bandwidth by multiplying IOPS by block size. I can also
provide the raw json output of `fio` if you want to have a look.
One last note is that I unload the module between each run and I do not
have both modules loaded at the same time. If you are exhausting you
memory pool this could maybe impact performance?
So things to check:
- Do you prepare the drive?
- Do you release drive memory after test (you can unload the module)
- Use the lfsr random number generator
- Do not use the randommap
I am not sure what hardware you are running on, but the throughput
numbers you obtain seem _really_ low.
Best regards
Andreas Hindborg
Hi Benno,
Benno Lossin <[email protected]> writes:
<...>
>> diff --git a/rust/kernel/block/mq/gen_disk.rs b/rust/kernel/block/mq/gen_disk.rs
>> new file mode 100644
>> index 000000000000..50496af15bbf
>> --- /dev/null
>> +++ b/rust/kernel/block/mq/gen_disk.rs
>> @@ -0,0 +1,133 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +
>> +//! GenDisk abstraction
>> +//!
>> +//! C header: [`include/linux/blkdev.h`](../../include/linux/blkdev.h)
>> +//! C header: [`include/linux/blk_mq.h`](../../include/linux/blk_mq.h)
>> +
>> +use crate::block::mq::{raw_writer::RawWriter, Operations, TagSet};
>> +use crate::{
>> + bindings, error::from_err_ptr, error::Result, sync::Arc, types::ForeignOwnable,
>> + types::ScopeGuard,
>> +};
>> +use core::fmt::{self, Write};
>> +
>> +/// A generic block device
>> +///
>> +/// # Invariants
>> +///
>> +/// - `gendisk` must always point to an initialized and valid `struct gendisk`.
>> +pub struct GenDisk<T: Operations> {
>> + _tagset: Arc<TagSet<T>>,
>> + gendisk: *mut bindings::gendisk,
>
> Why are these two fields not embedded? Shouldn't the user decide where
> to allocate?
The `TagSet` can be shared between multiple `GenDisk`. Using an `Arc`
seems resonable?
For the `gendisk` field, the allocation is done by C and the address
must be stable. We are owning the pointee and must drop it when it goes out
of scope. I could do this:
#[repr(transparent)]
struct GenDisk(Opaque<bindings::gendisk>);
struct UniqueGenDiskRef {
_tagset: Arc<TagSet<T>>,
gendisk: Pin<&'static mut GenDisk>,
}
but it seems pointless. `struct GenDisk` would not be pub in that case. What do you think?
>
>> +}
>> +
>> +// SAFETY: `GenDisk` is an owned pointer to a `struct gendisk` and an `Arc` to a
>> +// `TagSet` It is safe to send this to other threads as long as T is Send.
>> +unsafe impl<T: Operations + Send> Send for GenDisk<T> {}
>> +
>> +impl<T: Operations> GenDisk<T> {
>> + /// Try to create a new `GenDisk`
>> + pub fn try_new(tagset: Arc<TagSet<T>>, queue_data: T::QueueData) -> Result<Self> {
>> + let data = queue_data.into_foreign();
>> + let recover_data = ScopeGuard::new(|| {
>> + // SAFETY: T::QueueData was created by the call to `into_foreign()` above
>> + unsafe { T::QueueData::from_foreign(data) };
>> + });
>> +
>> + let lock_class_key = crate::sync::LockClassKey::new();
>> +
>> + // SAFETY: `tagset.raw_tag_set()` points to a valid and initialized tag set
>> + let gendisk = from_err_ptr(unsafe {
>> + bindings::__blk_mq_alloc_disk(tagset.raw_tag_set(), data as _, lock_class_key.as_ptr())
>
> Avoid `as _` casts.
????
>
>> + })?;
>> +
>> + const TABLE: bindings::block_device_operations = bindings::block_device_operations {
>> + submit_bio: None,
>> + open: None,
>> + release: None,
>> + ioctl: None,
>> + compat_ioctl: None,
>> + check_events: None,
>> + unlock_native_capacity: None,
>> + getgeo: None,
>> + set_read_only: None,
>> + swap_slot_free_notify: None,
>> + report_zones: None,
>> + devnode: None,
>> + alternative_gpt_sector: None,
>> + get_unique_id: None,
>> + owner: core::ptr::null_mut(),
>> + pr_ops: core::ptr::null_mut(),
>> + free_disk: None,
>> + poll_bio: None,
>> + };
>> +
>> + // SAFETY: gendisk is a valid pointer as we initialized it above
>> + unsafe { (*gendisk).fops = &TABLE };
>> +
>> + recover_data.dismiss();
>> + Ok(Self {
>> + _tagset: tagset,
>> + gendisk,
>> + })
>> + }
>> +
>> + /// Set the name of the device
>> + pub fn set_name(&self, args: fmt::Arguments<'_>) -> Result {
>> + let mut raw_writer = RawWriter::from_array(unsafe { &mut (*self.gendisk).disk_name });
>
> Missing `SAFETY` also see below.
Yes, I have a few of those. Will add for next version.
>
>> + raw_writer.write_fmt(args)?;
>> + raw_writer.write_char('\0')?;
>> + Ok(())
>> + }
>> +
>> + /// Register the device with the kernel. When this function return, the
>> + /// device is accessible from VFS. The kernel may issue reads to the device
>> + /// during registration to discover partition infomation.
>> + pub fn add(&self) -> Result {
>> + crate::error::to_result(unsafe {
>> + bindings::device_add_disk(core::ptr::null_mut(), self.gendisk, core::ptr::null_mut())
>> + })
>> + }
>> +
>> + /// Call to tell the block layer the capcacity of the device
>> + pub fn set_capacity(&self, sectors: u64) {
>> + unsafe { bindings::set_capacity(self.gendisk, sectors) };
>> + }
>> +
>> + /// Set the logical block size of the device
>> + pub fn set_queue_logical_block_size(&self, size: u32) {
>> + unsafe { bindings::blk_queue_logical_block_size((*self.gendisk)queue, size) };
>> + }
>> +
>> + /// Set the physical block size of the device
>
> What does this *do*? I do not think the doc string gives any meaningful
> information not present in the function name (this might just be,
> because I have no idea of what this is and anyone with just a little
> more knowledge would know, but I still wanted to mention it).
I'll add some more context.
>
>> + pub fn set_queue_physical_block_size(&self, size: u32) {
>> + unsafe { bindings::blk_queue_physical_block_size((*self.gendisk).queue, size) };
>> + }
>> +
>> + /// Set the rotational media attribute for the device
>> + pub fn set_rotational(&self, rotational: bool) {
>> + if !rotational {
>> + unsafe {
>> + bindings::blk_queue_flag_set(bindings::QUEUE_FLAG_NONROT, (*self.gendisk).queue)
>> + };
>> + } else {
>> + unsafe {
>> + bindings::blk_queue_flag_clear(bindings::QUEUE_FLAG_NONROT, (*self.gendisk).queue)
>> + };
>> + }
>> + }
>> +}
>> +
>> +impl<T: Operations> Drop for GenDisk<T> {
>> + fn drop(&mut self) {
>> + let queue_data = unsafe { (*(*self.gendisk).queue).queuedata };
>> +
>> + unsafe { bindings::del_gendisk(self.gendisk) };
>> +
>> + // SAFETY: `queue.queuedata` was created by `GenDisk::try_new()` with a
>> + // call to `ForeignOwnable::into_pointer()` to create `queuedata`.
>> + // `ForeignOwnable::from_foreign()` is only called here.
>> + let _queue_data = unsafe { T::QueueData::from_foreign(queue_data) };
>> + }
>> +}
>> diff --git a/rust/kernel/block/mq/operations.rs b/rust/kernel/block/mq/operations.rs
>> new file mode 100644
>> index 000000000000..fb1ab707d1f0
>> --- /dev/null
>> +++ b/rust/kernel/block/mq/operations.rs
>> @@ -0,0 +1,260 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +
>> +//! This module provides an interface for blk-mq drivers to implement.
>> +//!
>> +//! C header: [`include/linux/blk-mq.h`](../../include/linux/blk-mq.h)
>> +
>> +use crate::{
>> + bindings,
>> + block::mq::{tag_set::TagSetRef, Request},
>> + error::{from_result, Result},
>> + types::ForeignOwnable,
>> +};
>> +use core::{marker::PhantomData, pin::Pin};
>> +
>> +/// Implement this trait to interface blk-mq as block devices
>> +#[macros::vtable]
>> +pub trait Operations: Sized {
>
> Is this trait really safe? Are there **no** requirements for e.g.
> `QueueData`? So could I use `Box<()>`?
Yes, it is intended to be safe. `ForeignOwnable` covers safety
requirements for these associated data types.
>
>> + /// Data associated with a request. This data is located next to the request
>> + /// structure.
>> + type RequestData;
>> +
>> + /// Data associated with the `struct request_queue` that is allocated for
>> + /// the `GenDisk` associated with this `Operations` implementation.
>> + type QueueData: ForeignOwnable;
>> +
>> + /// Data associated with a dispatch queue. This is stored as a pointer in
>> + /// `struct blk_mq_hw_ctx`.
>> + type HwData: ForeignOwnable;
>> +
>> + /// Data associated with a tag set. This is stored as a pointer in `struct
>> + /// blk_mq_tag_set`.
>> + type TagSetData: ForeignOwnable;
>> +
>> + /// Called by the kernel to allocate a new `RequestData`. The structure will
>> + /// eventually be pinned, so defer initialization to `init_request_data()`
>> + fn new_request_data(
>> + _tagset_data: <Self::TagSetData as ForeignOwnable>::Borrowed<'_>,
>> + ) -> Result<Self::RequestData>;
>> +
>> + /// Called by the kernel to initialize a previously allocated `RequestData`
>> + fn init_request_data(
>> + _tagset_data: <Self::TagSetData as ForeignOwnable>::Borrowed<'_>,
>> + _data: Pin<&mut Self::RequestData>,
>> + ) -> Result {
>> + Ok(())
>> + }
>> +
>> + /// Called by the kernel to queue a request with the driver. If `is_last` is
>> + /// `false`, the driver is allowed to defer commiting the request.
>> + fn queue_rq(
>> + hw_data: <Self::HwData as ForeignOwnable>::Borrowed<'_>,
>> + queue_data: <Self::QueueData as ForeignOwnable>::Borrowed<'_>,
>> + rq: &Request<Self>,
>> + is_last: bool,
>> + ) -> Result;
>> +
>> + /// Called by the kernel to indicate that queued requests should be submitted
>> + fn commit_rqs(
>> + hw_data: <Self::HwData as ForeignOwnable>::Borrowed<'_>,
>> + queue_data: <Self::QueueData as ForeignOwnable>::Borrowed<'_>,
>> + );
>> +
>> + /// Called by the kernel when the request is completed
>> + fn complete(_rq: &Request<Self>);
>> +
>> + /// Called by the kernel to allocate and initialize a driver specific hardware context data
>> + fn init_hctx(
>> + tagset_data: <Self::TagSetData as ForeignOwnable>::Borrowed<'_>,
>> + hctx_idx: u32,
>> + ) -> Result<Self::HwData>;
>> +
>> + /// Called by the kernel to poll the device for completed requests. Only used for poll queues.
>> + fn poll(_hw_data: <Self::HwData as ForeignOwnable>::Borrowed<'_>) -> i32 {
>> + unreachable!()
>
> Why are these implemented this way? Should this really panic? Maybe
> return an error? Why `i32` as the return type? If it can error it should
> be `Result<u32>`.
I will update in accordance with the new documentation for `#[vtable]`.
Return type should be `bool`, I will change it. It inherited the int
from `core::ffi::c_int`.
>
>> + }
>> +
>> + /// Called by the kernel to map submission queues to CPU cores.
>> + fn map_queues(_tag_set: &TagSetRef) {
>> + unreachable!()
>> + }
>> +
>> + // There is no need for exit_request() because `drop` will be called.
>> +}
>> +
>> +pub(crate) struct OperationsVtable<T: Operations>(PhantomData<T>);
>> +
>> +impl<T: Operations> OperationsVtable<T> {
>> + // # Safety
>> + //
>> + // The caller of this function must ensure that `hctx` and `bd` are valid
>> + // and initialized. The pointees must outlive this function. Further
>> + // `hctx->driver_data` must be a pointer created by a call to
>> + // `Self::init_hctx_callback()` and the pointee must outlive this function.
>> + // This function must not be called with a `hctx` for which
>> + // `Self::exit_hctx_callback()` has been called.
>> + unsafe extern "C" fn queue_rq_callback(
>> + hctx: *mut bindings::blk_mq_hw_ctx,
>> + bd: *const bindings::blk_mq_queue_data,
>> + ) -> bindings::blk_status_t {
>> + // SAFETY: `bd` is valid as required by the safety requirement for this function.
>> + let rq = unsafe { (*bd).rq };
>> +
>> + // SAFETY: The safety requirement for this function ensure that
>> + // `(*hctx).driver_data` was returned by a call to
>> + // `Self::init_hctx_callback()`. That function uses
>> + // `PointerWrapper::into_pointer()` to create `driver_data`. Further,
>> + // the returned value does not outlive this function and
>> + // `from_foreign()` is not called until `Self::exit_hctx_callback()` is
>> + // called. By the safety requirement of this function and contract with
>> + // the `blk-mq` API, `queue_rq_callback()` will not be called after that
>> + // point.
>
> This safety section and the others here are rather long and mostly
> repeat themselves. Is it possible to put this in its own module and
> explain the safety invariants in that module and then in these safety
> sections just refer to some labels from that section?
>
> I think we should discuss this in our next meeting.
Not sure about the best way to do this. Lets talk.
<...>
>> +
>> + pub(crate) const unsafe fn build() -> &'static bindings::blk_mq_ops {
>> + &Self::VTABLE
>> + }
>
> Why is this function `unsafe`?
I don't think it needs to be unsafe, thanks.
>
>> +}
>
> Some `# Safety` and `SAFETY` missing in this hunk.
>
>> diff --git a/rust/kernel/block/mq/raw_writer.rs b/rust/kernel/block/mq/raw_writer.rs
>> new file mode 100644
>> index 000000000000..25c16ee0b1f7
>> --- /dev/null
>> +++ b/rust/kernel/block/mq/raw_writer.rs
>> @@ -0,0 +1,30 @@
>> +use core::fmt::{self, Write};
>> +
>> +pub(crate) struct RawWriter {
>> + ptr: *mut u8,
>> + len: usize,
>> +}
>> +
>> +impl RawWriter {
>> + unsafe fn new(ptr: *mut u8, len: usize) -> Self {
>> + Self { ptr, len }
>> + }
>> +
>> + pub(crate) fn from_array<const N: usize>(a: &mut [core::ffi::c_char; N]) -> Self {
>> + unsafe { Self::new(&mut a[0] as *mut _ as _, N) }
>> + }
>
> This function needs to be `unsafe`, because it never captures the
> lifetime of `a`. I can write:
> let mut a = Box::new([0; 10]);
> let mut writer = RawWriter::from_array(&mut *a);
> drop(a);
> writer.write_str("Abc"); // UAF
> Alternatively add a lifetime to `RawWriter`.
Yes, a lifetime is missing in RawWriter, thanks.
>
>> +}
>> +
>> +impl Write for RawWriter {
>> + fn write_str(&mut self, s: &str) -> fmt::Result {
>> + let bytes = s.as_bytes();
>> + let len = bytes.len();
>> + if len > self.len {
>> + return Err(fmt::Error);
>> + }
>> + unsafe { core::ptr::copy_nonoverlapping(&bytes[0], self.ptr, len) };
>> + self.ptr = unsafe { self.ptr.add(len) };
>> + self.len -= len;
>> + Ok(())
>> + }
>> +}
>> diff --git a/rust/kernel/block/mq/request.rs b/rust/kernel/block/mq/request.rs
>> new file mode 100644
>> index 000000000000..e95ae3fd71ad
>> --- /dev/null
>> +++ b/rust/kernel/block/mq/request.rs
>> @@ -0,0 +1,71 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +
>> +//! This module provides a wrapper for the C `struct request` type.
>> +//!
>> +//! C header: [`include/linux/blk-mq.h`](../../include/linux/blk-mq.h)
>> +
>> +use crate::{
>> + bindings,
>> + block::mq::Operations,
>> + error::{Error, Result},
>> +};
>> +use core::marker::PhantomData;
>> +
>> +/// A wrapper around a blk-mq `struct request`. This represents an IO request.
>> +pub struct Request<T: Operations> {
>> + ptr: *mut bindings::request,
>
> Why is this not embedded?
I have changed it to `struct Request(Opaque<bindings::request>)` for
next version ????
>
>> + _p: PhantomData<T>,
>> +}
>> +
>> +impl<T: Operations> Request<T> {
>> + pub(crate) unsafe fn from_ptr(ptr: *mut bindings::request) -> Self {
>> + Self {
>> + ptr,
>> + _p: PhantomData,
>> + }
>> + }
>> +
>> + /// Get the command identifier for the request
>> + pub fn command(&self) -> u32 {
>> + unsafe { (*self.ptr).cmd_flags & ((1 << bindings::REQ_OP_BITS) - 1) }
>> + }
>> +
>> + /// Call this to indicate to the kernel that the request has been issued by the driver
>> + pub fn start(&self) {
>> + unsafe { bindings::blk_mq_start_request(self.ptr) };
>> + }
>> +
>> + /// Call this to indicate to the kernel that the request has been completed without errors
>> + // TODO: Consume rq so that we can't use it after ending it?
>> + pub fn end_ok(&self) {
>> + unsafe { bindings::blk_mq_end_request(self.ptr, bindings::BLK_STS_OK as _) };
>> + }
>> +
>> + /// Call this to indicate to the kernel that the request completed with an error
>> + pub fn end_err(&self, err: Error) {
>> + unsafe { bindings::blk_mq_end_request(self.ptr, err.to_blk_status()) };
>> + }
>> +
>> + /// Call this to indicate that the request completed with the status indicated by `status`
>> + pub fn end(&self, status: Result) {
>> + if let Err(e) = status {
>> + self.end_err(e);
>> + } else {
>> + self.end_ok();
>> + }
>> + }
>> +
>> + /// Call this to schedule defered completion of the request
>> + // TODO: Consume rq so that we can't use it after completing it?
>> + pub fn complete(&self) {
>> + if !unsafe { bindings::blk_mq_complete_request_remote(self.ptr) } {
>> + T::complete(&unsafe { Self::from_ptr(self.ptr) });
>> + }
>> + }
>> +
>> + /// Get the target sector for the request
>> + #[inline(always)]
>
> Why is this `inline(always)`?
Compiler would not inline from kernel crate to modules without this. I
will check if this is still the case.
>
>> + pub fn sector(&self) -> usize {
>> + unsafe { (*self.ptr).__sector as usize }
>> + }
>> +}
>> diff --git a/rust/kernel/block/mq/tag_set.rs b/rust/kernel/block/mq/tag_set.rs
>> new file mode 100644
>> index 000000000000..d122db7f6d0e
>> --- /dev/null
>> +++ b/rust/kernel/block/mq/tag_set.rs
>> @@ -0,0 +1,92 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +
>> +//! This module provides the `TagSet` struct to wrap the C `struct blk_mq_tag_set`.
>> +//!
>> +//! C header: [`include/linux/blk-mq.h`](../../include/linux/blk-mq.h)
>> +
>> +use crate::{
>> + bindings,
>> + block::mq::{operations::OperationsVtable, Operations},
>> + error::{Error, Result},
>> + sync::Arc,
>> + types::ForeignOwnable,
>> +};
>> +use core::{cell::UnsafeCell, convert::TryInto, marker::PhantomData};
>> +
>> +/// A wrapper for the C `struct blk_mq_tag_set`
>> +pub struct TagSet<T: Operations> {
>> + inner: UnsafeCell<bindings::blk_mq_tag_set>,
>> + _p: PhantomData<T>,
>> +}
>> +
>> +impl<T: Operations> TagSet<T> {
>> + /// Try to create a new tag set
>> + pub fn try_new(
>> + nr_hw_queues: u32,
>> + tagset_data: T::TagSetData,
>> + num_tags: u32,
>> + num_maps: u32,
>> + ) -> Result<Arc<Self>> {
>
> Why force the users to use `Arc`?
Changed to return a `PinInit<TagSet>` for next version.
>
>> + let tagset = Arc::try_new(Self {
>> + inner: UnsafeCell::new(bindings::blk_mq_tag_set::default()),
>> + _p: PhantomData,
>> + })?;
>> +
>> + // SAFETY: We just allocated `tagset`, we know this is the only reference to it.
>> + let inner = unsafe { &mut *tagset.inner.get() };
>> +
>> + inner.ops = unsafe { OperationsVtable::<T>::build() };
>> + inner.nr_hw_queues = nr_hw_queues;
>> + inner.timeout = 0; // 0 means default which is 30 * HZ in C
>> + inner.numa_node = bindings::NUMA_NO_NODE;
>> + inner.queue_depth = num_tags;
>> + inner.cmd_size = core::mem::size_of::<T::RequestData>().try_into()?;
>> + inner.flags = bindings::BLK_MQ_F_SHOULD_MERGE;
>> + inner.driver_data = tagset_data.into_foreign() as _;
>> + inner.nr_maps = num_maps;
>> +
>> + // SAFETY: `inner` points to valid and initialised memory.
>> + let ret = unsafe { bindings::blk_mq_alloc_tag_set(inner) };
>> + if ret < 0 {
>> + // SAFETY: We created `driver_data` above with `into_foreign`
>> + unsafe { T::TagSetData::from_foreign(inner.driver_data) };
>> + return Err(Error::from_errno(ret));
>> + }
>> +
>> + Ok(tagset)
>> + }
>> +
>> + /// Return the pointer to the wrapped `struct blk_mq_tag_set`
>> + pub(crate) fn raw_tag_set(&self) -> *mut bindings::blk_mq_tag_set {
>> + self.inner.get()
>> + }
>> +}
>> +
>> +impl<T: Operations> Drop for TagSet<T> {
>> + fn drop(&mut self) {
>> + let tagset_data = unsafe { (*self.inner.get()).driver_data };
>> +
>> + // SAFETY: `inner` is valid and has been properly initialised during construction.
>> + unsafe { bindings::blk_mq_free_tag_set(self.inner.get()) };
>> +
>> + // SAFETY: `tagset_data` was created by a call to
>> + // `ForeignOwnable::into_foreign` in `TagSet::try_new()`
>> + unsafe { T::TagSetData::from_foreign(tagset_data) };
>> + }
>> +}
>> +
>> +/// A tag set reference. Used to control lifetime and prevent drop of TagSet references passed to
>> +/// `Operations::map_queues()`
>> +pub struct TagSetRef {
>> + ptr: *mut bindings::blk_mq_tag_set,
>> +}
>> +
>> +impl TagSetRef {
>> + pub(crate) unsafe fn from_ptr(tagset: *mut bindings::blk_mq_tag_set) -> Self {
>> + Self { ptr: tagset }
>> + }
>> +
>> + pub fn ptr(&self) -> *mut bindings::blk_mq_tag_set {
>> + self.ptr
>> + }
>> +}
>
> This is a **very** thin abstraction, why is it needed?
It is not. I changed it to `&TagSet`, thanks.
Thanks for the comments!
Best regards,
Andreas
Sergio González Collado <[email protected]> writes:
> + /// Call to tell the block layer the capcacity of the device
> + pub fn set_capacity(&self, sectors: u64) {
> + unsafe { bindings::set_capacity(self.gendisk, sectors) };
> + }
>
> Nit in the comment: capcacity -> capacity
Thanks!
BR Andreas
Hi Andreas,
just so you know, I received this email today, so it was very late,
since the send date is January 12.
On 12.01.24 10:18, Andreas Hindborg (Samsung) wrote:
>
> Hi Benno,
>
> Benno Lossin <[email protected]> writes:
>
> <...>
>
>>> diff --git a/rust/kernel/block/mq/gen_disk.rs b/rust/kernel/block/mq/gen_disk.rs
>>> new file mode 100644
>>> index 000000000000..50496af15bbf
>>> --- /dev/null
>>> +++ b/rust/kernel/block/mq/gen_disk.rs
>>> @@ -0,0 +1,133 @@
>>> +// SPDX-License-Identifier: GPL-2.0
>>> +
>>> +//! GenDisk abstraction
>>> +//!
>>> +//! C header: [`include/linux/blkdev.h`](../../include/linux/blkdev.h)
>>> +//! C header: [`include/linux/blk_mq.h`](../../include/linux/blk_mq.h)
>>> +
>>> +use crate::block::mq::{raw_writer::RawWriter, Operations, TagSet};
>>> +use crate::{
>>> + bindings, error::from_err_ptr, error::Result, sync::Arc, types::ForeignOwnable,
>>> + types::ScopeGuard,
>>> +};
>>> +use core::fmt::{self, Write};
>>> +
>>> +/// A generic block device
>>> +///
>>> +/// # Invariants
>>> +///
>>> +/// - `gendisk` must always point to an initialized and valid `struct gendisk`.
>>> +pub struct GenDisk<T: Operations> {
>>> + _tagset: Arc<TagSet<T>>,
>>> + gendisk: *mut bindings::gendisk,
>>
>> Why are these two fields not embedded? Shouldn't the user decide where
>> to allocate?
>
> The `TagSet` can be shared between multiple `GenDisk`. Using an `Arc`
> seems resonable?
>
> For the `gendisk` field, the allocation is done by C and the address
> must be stable. We are owning the pointee and must drop it when it goes out
> of scope. I could do this:
>
> #[repr(transparent)]
> struct GenDisk(Opaque<bindings::gendisk>);
>
> struct UniqueGenDiskRef {
> _tagset: Arc<TagSet<T>>,
> gendisk: Pin<&'static mut GenDisk>,
>
> }
>
> but it seems pointless. `struct GenDisk` would not be pub in that case. What do you think?
Hmm, I am a bit confused as to how you usually use a `struct gendisk`.
You said that a `TagSet` might be shared between multiple `GenDisk`s,
but that is not facilitated by the C side?
Is it the case that on the C side you create a struct containing a
tagset and a gendisk for every block device you want to represent?
And you decided for the Rust abstractions that you want to have only a
single generic struct for any block device, distinguished by the generic
parameter?
I think these kinds of details would be nice to know. Not only for
reviewers, but also for veterans of the C APIs.
--
Cheers,
Benno
Benno Lossin <[email protected]> writes:
> Hi Andreas,
>
> just so you know, I received this email today, so it was very late,
> since the send date is January 12.
My mistake. I started drafting Jan 12, but did not get time to finish
the mail until today. I guess that is how mu4e does things, I should be
aware and fix up the date. Thanks for letting me know ????
>
> On 12.01.24 10:18, Andreas Hindborg (Samsung) wrote:
>>
>> Hi Benno,
>>
>> Benno Lossin <[email protected]> writes:
>>
>> <...>
>>
>>>> diff --git a/rust/kernel/block/mq/gen_disk.rs b/rust/kernel/block/mq/gen_disk.rs
>>>> new file mode 100644
>>>> index 000000000000..50496af15bbf
>>>> --- /dev/null
>>>> +++ b/rust/kernel/block/mq/gen_disk.rs
>>>> @@ -0,0 +1,133 @@
>>>> +// SPDX-License-Identifier: GPL-2.0
>>>> +
>>>> +//! GenDisk abstraction
>>>> +//!
>>>> +//! C header: [`include/linux/blkdev.h`](../../include/linux/blkdev.h)
>>>> +//! C header: [`include/linux/blk_mq.h`](../../include/linux/blk_mq.h)
>>>> +
>>>> +use crate::block::mq::{raw_writer::RawWriter, Operations, TagSet};
>>>> +use crate::{
>>>> + bindings, error::from_err_ptr, error::Result, sync::Arc, types::ForeignOwnable,
>>>> + types::ScopeGuard,
>>>> +};
>>>> +use core::fmt::{self, Write};
>>>> +
>>>> +/// A generic block device
>>>> +///
>>>> +/// # Invariants
>>>> +///
>>>> +/// - `gendisk` must always point to an initialized and valid `struct gendisk`.
>>>> +pub struct GenDisk<T: Operations> {
>>>> + _tagset: Arc<TagSet<T>>,
>>>> + gendisk: *mut bindings::gendisk,
>>>
>>> Why are these two fields not embedded? Shouldn't the user decide where
>>> to allocate?
>>
>> The `TagSet` can be shared between multiple `GenDisk`. Using an `Arc`
>> seems resonable?
>>
>> For the `gendisk` field, the allocation is done by C and the address
>> must be stable. We are owning the pointee and must drop it when it goes out
>> of scope. I could do this:
>>
>> #[repr(transparent)]
>> struct GenDisk(Opaque<bindings::gendisk>);
>>
>> struct UniqueGenDiskRef {
>> _tagset: Arc<TagSet<T>>,
>> gendisk: Pin<&'static mut GenDisk>,
>>
>> }
>>
>> but it seems pointless. `struct GenDisk` would not be pub in that case. What do you think?
>
> Hmm, I am a bit confused as to how you usually use a `struct gendisk`.
> You said that a `TagSet` might be shared between multiple `GenDisk`s,
> but that is not facilitated by the C side?
>
> Is it the case that on the C side you create a struct containing a
> tagset and a gendisk for every block device you want to represent?
Yes, but the `struct tag_set` can be shared between multiple `struct
gendisk`.
Let me try to elaborate:
In C you would first allocate a `struct tag_set` and partially
initialize it. The allocation can be dynamic, static or part of existing
allocation. You would then partially initialize the structure and finish
the initialization by calling `blk_mq_alloc_tag_set()`. This populates
the rest of the structure which includes more dynamic allocations.
You then allocate a `struct gendisk` by calling `blk_mq_alloc_disk()`,
passing in a pointer to the `struct tag_set` you just created. This
function will return a pointer to a `struct gendisk` on success.
In the Rust abstractions, we allocate the `TagSet`:
#[pin_data(PinnedDrop)]
#[repr(transparent)]
pub struct TagSet<T: Operations> {
#[pin]
inner: Opaque<bindings::blk_mq_tag_set>,
_p: PhantomData<T>,
}
with `PinInit` [^1]. The initializer will partially initialize the struct and
finish the initialization like C does by calling
`blk_mq_alloc_tag_set()`. We now need a place to point the initializer.
`Arc::pin_init()` is that place for now. It allows us to pass the
`TagSet` reference to multiple `GenDisk` if required. Maybe we could be
generic over `Deref<TagSet>` in the future. Bottom line is that we need
to hold on to that `TagSet` reference until the `GenDisk` is dropped.
`struct tag_set` is not reference counted on the C side. C
implementations just take care to keep it alive, for instance by storing
it next to a pointer to `struct gendisk` that it is servicing.
> And you decided for the Rust abstractions that you want to have only a
> single generic struct for any block device, distinguished by the generic
> parameter?
Yes, we have a single generic struct (`GenDisk`) representing the C
`struct gendisk`, and a single generic struct (`TagSet`) representing
the C `struct tag_set`. These are both generic over `T: Operations`.
`Operations` represent a C vtable (`struct blk_mq_ops`) attached to the
`struct tag_set`. This vtable is provided by the driver and holds
function pointers that allow the kernel to perform actions such as queue
IO requests with the driver. A C driver can instantiate multiple `struct
gendisk` and service them with the same `struct tag_set` and thereby the
same vtable. Or it can use separate tag sets and the same vtable. Or a
separate tag_set and vtable for each gendisk.
> I think these kinds of details would be nice to know. Not only for
> reviewers, but also for veterans of the C APIs.
I should write some module level documentation clarifying the use of
these types. The null block driver is a simple example, but it is just
code. I will include more docs in the next version.
Best regards
Andreas
[^1]: This was not `PinInit` in the RFC, I changed this based on your
feedback. The main points are still the same though.
On 23.01.24 19:39, Andreas Hindborg (Samsung) wrote:
>>>>> +/// A generic block device
>>>>> +///
>>>>> +/// # Invariants
>>>>> +///
>>>>> +/// - `gendisk` must always point to an initialized and valid `struct gendisk`.
>>>>> +pub struct GenDisk<T: Operations> {
>>>>> + _tagset: Arc<TagSet<T>>,
>>>>> + gendisk: *mut bindings::gendisk,
>>>>
>>>> Why are these two fields not embedded? Shouldn't the user decide where
>>>> to allocate?
>>>
>>> The `TagSet` can be shared between multiple `GenDisk`. Using an `Arc`
>>> seems resonable?
>>>
>>> For the `gendisk` field, the allocation is done by C and the address
>>> must be stable. We are owning the pointee and must drop it when it goes out
>>> of scope. I could do this:
>>>
>>> #[repr(transparent)]
>>> struct GenDisk(Opaque<bindings::gendisk>);
>>>
>>> struct UniqueGenDiskRef {
>>> _tagset: Arc<TagSet<T>>,
>>> gendisk: Pin<&'static mut GenDisk>,
>>>
>>> }
>>>
>>> but it seems pointless. `struct GenDisk` would not be pub in that case. What do you think?
>>
>> Hmm, I am a bit confused as to how you usually use a `struct gendisk`.
>> You said that a `TagSet` might be shared between multiple `GenDisk`s,
>> but that is not facilitated by the C side?
>>
>> Is it the case that on the C side you create a struct containing a
>> tagset and a gendisk for every block device you want to represent?
>
> Yes, but the `struct tag_set` can be shared between multiple `struct
> gendisk`.
>
> Let me try to elaborate:
>
> In C you would first allocate a `struct tag_set` and partially
> initialize it. The allocation can be dynamic, static or part of existing
> allocation. You would then partially initialize the structure and finish
> the initialization by calling `blk_mq_alloc_tag_set()`. This populates
> the rest of the structure which includes more dynamic allocations.
>
> You then allocate a `struct gendisk` by calling `blk_mq_alloc_disk()`,
> passing in a pointer to the `struct tag_set` you just created. This
> function will return a pointer to a `struct gendisk` on success.
>
> In the Rust abstractions, we allocate the `TagSet`:
>
> #[pin_data(PinnedDrop)]
> #[repr(transparent)]
> pub struct TagSet<T: Operations> {
> #[pin]
> inner: Opaque<bindings::blk_mq_tag_set>,
> _p: PhantomData<T>,
> }
>
> with `PinInit` [^1]. The initializer will partially initialize the struct and
> finish the initialization like C does by calling
> `blk_mq_alloc_tag_set()`. We now need a place to point the initializer.
> `Arc::pin_init()` is that place for now. It allows us to pass the
> `TagSet` reference to multiple `GenDisk` if required. Maybe we could be
> generic over `Deref<TagSet>` in the future. Bottom line is that we need
> to hold on to that `TagSet` reference until the `GenDisk` is dropped.
I see, thanks for the elaborate explanation! I now think that using `Arc`
makes sense.
> `struct tag_set` is not reference counted on the C side. C
> implementations just take care to keep it alive, for instance by storing
> it next to a pointer to `struct gendisk` that it is servicing.
This is interesting, is this also done in the case where it is shared
among multiple `struct gendisk`s?
Does this have some deeper reason? Or am I right to assume that creating
`Gendisk`/`TagSet` is done rarely (i.e. only at initialization of the
driver)?
>> And you decided for the Rust abstractions that you want to have only a
>> single generic struct for any block device, distinguished by the generic
>> parameter?
>
> Yes, we have a single generic struct (`GenDisk`) representing the C
> `struct gendisk`, and a single generic struct (`TagSet`) representing
> the C `struct tag_set`. These are both generic over `T: Operations`.
> `Operations` represent a C vtable (`struct blk_mq_ops`) attached to the
> `struct tag_set`. This vtable is provided by the driver and holds
> function pointers that allow the kernel to perform actions such as queue
> IO requests with the driver. A C driver can instantiate multiple `struct
> gendisk` and service them with the same `struct tag_set` and thereby the
> same vtable. Or it can use separate tag sets and the same vtable. Or a
> separate tag_set and vtable for each gendisk.
>
>> I think these kinds of details would be nice to know. Not only for
>> reviewers, but also for veterans of the C APIs.
>
> I should write some module level documentation clarifying the use of
> these types. The null block driver is a simple example, but it is just
> code. I will include more docs in the next version.
Thanks a lot for explaining!
--
Cheers,
Benno
Benno Lossin <[email protected]> writes:
> On 23.01.24 19:39, Andreas Hindborg (Samsung) wrote:
>>>>>> +/// A generic block device
>>>>>> +///
>>>>>> +/// # Invariants
>>>>>> +///
>>>>>> +/// - `gendisk` must always point to an initialized and valid `struct gendisk`.
>>>>>> +pub struct GenDisk<T: Operations> {
>>>>>> + _tagset: Arc<TagSet<T>>,
>>>>>> + gendisk: *mut bindings::gendisk,
>>>>>
>>>>> Why are these two fields not embedded? Shouldn't the user decide where
>>>>> to allocate?
>>>>
>>>> The `TagSet` can be shared between multiple `GenDisk`. Using an `Arc`
>>>> seems resonable?
>>>>
>>>> For the `gendisk` field, the allocation is done by C and the address
>>>> must be stable. We are owning the pointee and must drop it when it goes out
>>>> of scope. I could do this:
>>>>
>>>> #[repr(transparent)]
>>>> struct GenDisk(Opaque<bindings::gendisk>);
>>>>
>>>> struct UniqueGenDiskRef {
>>>> _tagset: Arc<TagSet<T>>,
>>>> gendisk: Pin<&'static mut GenDisk>,
>>>>
>>>> }
>>>>
>>>> but it seems pointless. `struct GenDisk` would not be pub in that case. What do you think?
>>>
>>> Hmm, I am a bit confused as to how you usually use a `struct gendisk`.
>>> You said that a `TagSet` might be shared between multiple `GenDisk`s,
>>> but that is not facilitated by the C side?
>>>
>>> Is it the case that on the C side you create a struct containing a
>>> tagset and a gendisk for every block device you want to represent?
>>
>> Yes, but the `struct tag_set` can be shared between multiple `struct
>> gendisk`.
>>
>> Let me try to elaborate:
>>
>> In C you would first allocate a `struct tag_set` and partially
>> initialize it. The allocation can be dynamic, static or part of existing
>> allocation. You would then partially initialize the structure and finish
>> the initialization by calling `blk_mq_alloc_tag_set()`. This populates
>> the rest of the structure which includes more dynamic allocations.
>>
>> You then allocate a `struct gendisk` by calling `blk_mq_alloc_disk()`,
>> passing in a pointer to the `struct tag_set` you just created. This
>> function will return a pointer to a `struct gendisk` on success.
>>
>> In the Rust abstractions, we allocate the `TagSet`:
>>
>> #[pin_data(PinnedDrop)]
>> #[repr(transparent)]
>> pub struct TagSet<T: Operations> {
>> #[pin]
>> inner: Opaque<bindings::blk_mq_tag_set>,
>> _p: PhantomData<T>,
>> }
>>
>> with `PinInit` [^1]. The initializer will partially initialize the struct and
>> finish the initialization like C does by calling
>> `blk_mq_alloc_tag_set()`. We now need a place to point the initializer.
>> `Arc::pin_init()` is that place for now. It allows us to pass the
>> `TagSet` reference to multiple `GenDisk` if required. Maybe we could be
>> generic over `Deref<TagSet>` in the future. Bottom line is that we need
>> to hold on to that `TagSet` reference until the `GenDisk` is dropped.
>
> I see, thanks for the elaborate explanation! I now think that using `Arc`
> makes sense.
>
>> `struct tag_set` is not reference counted on the C side. C
>> implementations just take care to keep it alive, for instance by storing
>> it next to a pointer to `struct gendisk` that it is servicing.
>
> This is interesting, is this also done in the case where it is shared
> among multiple `struct gendisk`s?
Yes. The architecture is really quite flexible. For instance C NVMe uses
one tag set for the admin queue of a nvme controller, and one tag set
shared for all IO queues of a controller. The admin queue tag set is not
actually attached to a `struct gendisk` and does not appear as a block
device to the user. The IO queue tag set serves a number `struct
gendisk`, one for each name space of the controller.
> Does this have some deeper reason? Or am I right to assume that creating
> `Gendisk`/`TagSet` is done rarely (i.e. only at initialization of the
> driver)?
I am not sure. It could probably be reference counted on C side. Perhaps
nobody felt the need for it, since the lifetime of it is not that
complex. And yes, it is relatively rare as it is only as part of
initialization and tear down that you would create or destroy this
structure.
Best regards,
Andreas
Hi Alice,
Alice Ryhl <[email protected]> writes:
> On Wed, 3 May 2023 11:07:03 +0200, Andreas Hindborg <[email protected]> wrote:
>> The kernel `struct spinlock` is 4 bytes on x86 when lockdep is not enabled. The
>> structure is not padded to fit a cache line. The effect of this for `SpinLock`
>> is that the lock variable and the value protected by the lock will share a cache
>> line, depending on the alignment requirements of the protected value. Aligning
>> the lock variable and the protected value to a cache line yields a 20%
>> performance increase for the Rust null block driver for sequential reads to
>> memory backed devices at 6 concurrent readers.
>>
>> Signed-off-by: Andreas Hindborg <[email protected]>
>
> This applies the cacheline padding to all spinlocks unconditionally.
> It's not clear to me that we want to do that. Instead, I suggest using
> `SpinLock<CachePadded<T>>` in the null block driver to opt-in to the
> cache padding there, and let other drivers choose whether or not they
> want to cache pad their locks.
I was going to write that this is not going to work because the compiler
is going to reorder the fields of `Lock` and put the `data` field first,
followed by the `state` field. But I checked the layout, and it seems
that I actually get the `state` field first (with an alignment of 4), 60
bytes of padding, and then the `data` field (with alignment 64).
I am wondering why the compiler is not reordering these fields? Am I
guaranteed that the fields will not be reordered? Looking at the
definition of `Lock` there does not seem to be anything that prevents
rustc from swapping `state` and `data`.
>
> On Wed, 3 May 2023 11:07:03 +0200, Andreas Hindborg <[email protected]> wrote:
>> diff --git a/rust/kernel/cache_padded.rs b/rust/kernel/cache_padded.rs
>> new file mode 100644
>> index 000000000000..758678e71f50
>> --- /dev/null
>> +++ b/rust/kernel/cache_padded.rs
>>
>> +impl<T> CachePadded<T> {
>> + /// Pads and aligns a value to 64 bytes.
>> + #[inline(always)]
>> + pub(crate) const fn new(t: T) -> CachePadded<T> {
>> + CachePadded::<T> { value: t }
>> + }
>> +}
>
> Please make this `pub` instead of just `pub(crate)`. Other drivers might
> want to use this directly.
Alright.
>
> On Wed, 3 May 2023 11:07:03 +0200, Andreas Hindborg <[email protected]> wrote:
>> diff --git a/rust/kernel/sync/lock/spinlock.rs b/rust/kernel/sync/lock/spinlock.rs
>> index 979b56464a4e..e39142a8148c 100644
>> --- a/rust/kernel/sync/lock/spinlock.rs
>> +++ b/rust/kernel/sync/lock/spinlock.rs
>> @@ -100,18 +103,20 @@ unsafe impl super::Backend for SpinLockBackend {
>> ) {
>> // SAFETY: The safety requirements ensure that `ptr` is valid for writes, and `name` and
>> // `key` are valid for read indefinitely.
>> - unsafe { bindings::__spin_lock_init(ptr, name, key) }
>> + unsafe { bindings::__spin_lock_init((&mut *ptr).deref_mut(), name, key) }
>> }
>>
>> + #[inline(always)]
>> unsafe fn lock(ptr: *mut Self::State) -> Self::GuardState {
>> // SAFETY: The safety requirements of this function ensure that `ptr` points to valid
>> // memory, and that it has been initialised before.
>> - unsafe { bindings::spin_lock(ptr) }
>> + unsafe { bindings::spin_lock((&mut *ptr).deref_mut()) }
>> }
>>
>> + #[inline(always)]
>> unsafe fn unlock(ptr: *mut Self::State, _guard_state: &Self::GuardState) {
>> // SAFETY: The safety requirements of this function ensure that `ptr` is valid and that the
>> // caller is the owner of the mutex.
>> - unsafe { bindings::spin_unlock(ptr) }
>> + unsafe { bindings::spin_unlock((&mut *ptr).deref_mut()) }
>> }
>> }
>
> I would prefer to remain in pointer-land for the above operations. I
> think that this leads to core that is more obviously correct.
>
> For example:
>
> ```
> impl<T> CachePadded<T> {
> pub const fn raw_get(ptr: *mut Self) -> *mut T {
> core::ptr::addr_of_mut!((*ptr).value)
> }
> }
>
> #[inline(always)]
> unsafe fn unlock(ptr: *mut Self::State, _guard_state: &Self::GuardState) {
> unsafe { bindings::spin_unlock(CachePadded::raw_get(ptr)) }
> }
> ```
Got it ????
BR Andreas
On Mon, Feb 26, 2024 at 10:02 AM Andreas Hindborg (Samsung)
<[email protected]> wrote:
>
>
> Hi Alice,
>
> Alice Ryhl <[email protected]> writes:
>
> > On Wed, 3 May 2023 11:07:03 +0200, Andreas Hindborg <a.hindborg@samsungcom> wrote:
> >> The kernel `struct spinlock` is 4 bytes on x86 when lockdep is not enabled. The
> >> structure is not padded to fit a cache line. The effect of this for `SpinLock`
> >> is that the lock variable and the value protected by the lock will share a cache
> >> line, depending on the alignment requirements of the protected value. Aligning
> >> the lock variable and the protected value to a cache line yields a 20%
> >> performance increase for the Rust null block driver for sequential reads to
> >> memory backed devices at 6 concurrent readers.
> >>
> >> Signed-off-by: Andreas Hindborg <[email protected]>
> >
> > This applies the cacheline padding to all spinlocks unconditionally.
> > It's not clear to me that we want to do that. Instead, I suggest using
> > `SpinLock<CachePadded<T>>` in the null block driver to opt-in to the
> > cache padding there, and let other drivers choose whether or not they
> > want to cache pad their locks.
>
> I was going to write that this is not going to work because the compiler
> is going to reorder the fields of `Lock` and put the `data` field first,
> followed by the `state` field. But I checked the layout, and it seems
> that I actually get the `state` field first (with an alignment of 4), 60
> bytes of padding, and then the `data` field (with alignment 64).
>
> I am wondering why the compiler is not reordering these fields? Am I
> guaranteed that the fields will not be reordered? Looking at the
> definition of `Lock` there does not seem to be anything that prevents
> rustc from swapping `state` and `data`.
It's because `Lock` has `: ?Sized` on the `T` generic. Fields that
might not be Sized must always be last.
Alice