2024-03-13 17:43:01

by Andreas Hindborg

[permalink] [raw]
Subject: [RFC PATCH 0/5] Rust block device driver API and null block driver

From: Andreas Hindborg <[email protected]>

Hi All!

This is the second version of the Rust block device driver API and the Rust null
block driver. The context and motivation can be seen in cover letter of the RFC
v1 [1]. If more context is required, a talk about this effort was recorded at
LPC [2]. I hope to be able to discuss this series at LSF this year [3].

The series is still in RFC state, as many dependencies have not yet been merged.

Changes from v1:
- Improved request lifetime tracking
- Changed rust null block driver name to `rnull`
- Added timer completion support to `rnull`
- Added softirq completion support to `rnull`
- Moved to `xarray` for memory backing
- Adopted the folio abstraction where applicable
- Added `SAFETY` comments to all unsafe blocks
- Improved documentation and added examples
- Fixed up `vtable` default method behavior
- Dropped `unsafe` from vtable builder
- Fixed a safety issue in `RawWriter`
- Dropped the `TagSetRef` abstraction
- Renamed `Bio::iter` -> `Bio::raw_iter`
- Updated `core::fmt::Display for Bio` for readability
- Simplified `BioIterator::next`
- Documented and refactored `bvec_iter` functions
- Moved cache alignment out of `Lock`
- Added MAINTAINER entry

Thanks for all the input on v1!


Performance
===========

Rather than include an abundance of performance numbers in this letter, I would
refer to [4] for a comparison of performance to the C null block driver. In
general across all of the benchmark configurations, the mean of the difference
is -2.5%, with individual configurations showing [-10%;30%] difference in IO/s.


Request lifetime modeling
=========================

While implementing timer completion, it became clear that the request lifetime
modeling applied in the v1 RFC was not adequate. With that modeling, combined
with the timer callback feature, a developer would be able to write safe Rust
code that violates the Rust invariants for references, thus leading to undefined
behavior. A similar situation would happen in the completion path when
developers write code to convert from a completion id to a `&Request`.

To make these situations safe, and thus free from undefined behavior, the
`Request` type now applies reference counting. The logic piggybacks on the
`atomic_t ref` field of the C `struct request`. In order to make this work, the
Rust code has to be able to free requests when the refcount reaches zero.
Therefore, the series exposes (EXPORT_SYMBOL_GPL) the internal blk-mq symbol
`__blk_mq_free_request`.

I am curious what the community thinks of this approach and whether it is OK to
expose this symbol so that Rust can call it.

One consequence of these changes is that requests that are processed by a Rust
block device driver, are freed at a different place in the completion path, than
requests processed by C drivers. Requests processed by C drivers are typically
freed and recycled inline during the call to `blk_mq_complete_request`. The
requests processed by a Rust driver will be recycled when the `RequestRef` is
dropped, which is usually at some time after the request has been completed.

There does not seem to be any statistically significant effect on performance
for the rust null block implementation due to this change.


Dependencies
============

This series is based on the branch `rnull-deps-v6.8` of [5]. This tree is based
on the upstream `v6.8` tag and has been prepared with dependencies for this
series:

- [17] rust: add `CacheAligned` for easy cache line alignment of values
- [16] rust: xarray: add mutable access through `Guard`
- [15] rust: hrtimer: introduce hrtimer support
- [14] rust: add `Opaque::try_ffi_init`
- [13] rust: make `ARef` repr transparent
- [12] rust: folio: add `with_slice_into_page()`
- [11] rust: page: add `from_raw()`
- [10] rust: page: add `with_slice_into_page()`
- [ 9] IM: implement `ForeignOwnable` for `Pin`
- [ 8] IM: add `module_params` macro
- [ 7] rust: xarray: fix example
- [ 6] LIST: rust: xarray: Add an abstraction for XArray
- [ 5] LIST: rust: types: add FOREIGN_ALIGN to ForeignOwnable
- [ 4] LIST: rust: lock: implement `IrqSaveBackend` for `SpinLock`
- [ 3] LIST: rust: lock: add support for `Lock::lock_irqsave`
- [ 2] LIST: rust: add improved version of `ForeignOwnable::borrow_mut`
- [ 1] LIST: rust: folio: introduce basic support for folios
- [ 0] LIST: rust: add abstraction for `struct page`

Dependencies 0-6 are patches that have appeared on the list but are not yet
merged. Dependencies 8-9 are imports from the `rust` branch [6,7] that have not
yet been submitted. Dependencies 10-17 I will submit independently in the near
future.

If the upstream maintainers agree, then when the dependencies are merged, I will
eventually start submitting PRs to Jens. I fully commit to maintain this code as
indicated in the MAINTAINERS entry.

Best regards,
Andreas Hindborg


[1] https://lore.kernel.org/rust-for-linux/[email protected]
[2] https://lpc.events/event/17/contributions/1425
[3] https://lore.kernel.org/rust-for-linux/[email protected]
[4] https://rust-for-linux.com/null-block-driver
[5] git https://github.com/metaspace/linux.git rnull-deps-v6.8
[6] git https://github.com/rust-for-Linux/linux.git rust
[7] https://rust-for-linux.com/branches#rust


Andreas Hindborg (5):
rust: block: introduce `kernel::block::mq` module
rust: block: introduce `kernel::block::bio` module
rust: block: allow `hrtimer::Timer` in `RequestData`
rust: block: add rnull, Rust null_blk implementation
MAINTAINERS: add entry for Rust block device driver API

MAINTAINERS | 14 ++
block/blk-mq.c | 3 +-
drivers/block/Kconfig | 4 +
drivers/block/Makefile | 3 +
drivers/block/rnull.rs | 323 +++++++++++++++++++++++++++
include/linux/blk-mq.h | 1 +
rust/bindings/bindings_helper.h | 2 +
rust/helpers.c | 46 ++++
rust/kernel/block.rs | 6 +
rust/kernel/block/bio.rs | 112 ++++++++++
rust/kernel/block/bio/vec.rs | 279 +++++++++++++++++++++++
rust/kernel/block/mq.rs | 131 +++++++++++
rust/kernel/block/mq/gen_disk.rs | 174 +++++++++++++++
rust/kernel/block/mq/operations.rs | 346 +++++++++++++++++++++++++++++
rust/kernel/block/mq/raw_writer.rs | 60 +++++
rust/kernel/block/mq/request.rs | 269 ++++++++++++++++++++++
rust/kernel/block/mq/tag_set.rs | 117 ++++++++++
rust/kernel/error.rs | 5 +
rust/kernel/lib.rs | 1 +
scripts/Makefile.build | 2 +-
20 files changed, 1896 insertions(+), 2 deletions(-)
create mode 100644 drivers/block/rnull.rs
create mode 100644 rust/kernel/block.rs
create mode 100644 rust/kernel/block/bio.rs
create mode 100644 rust/kernel/block/bio/vec.rs
create mode 100644 rust/kernel/block/mq.rs
create mode 100644 rust/kernel/block/mq/gen_disk.rs
create mode 100644 rust/kernel/block/mq/operations.rs
create mode 100644 rust/kernel/block/mq/raw_writer.rs
create mode 100644 rust/kernel/block/mq/request.rs
create mode 100644 rust/kernel/block/mq/tag_set.rs


base-commit: e8f897f4afef0031fe618a8e94127a0934896aba
prerequisite-patch-id: 299c2cc48c45c8149e7cb18c6146a0308e5d0a44
prerequisite-patch-id: 5153ee1c410dbdf22a6fd40667712943b4b89a97
prerequisite-patch-id: 4d025bab9bc9741aedecc5327ad18f88f9105271
prerequisite-patch-id: a5e932c86fa6c68234764aa3d7f314e5b534b1d9
prerequisite-patch-id: aef3042976c4c678b7aa96154fc280f9061ebaf7
prerequisite-patch-id: 8bf108ad0af2a3ec89acb8d99ee1b49ca2f51c69
prerequisite-patch-id: a803b221c3232db3258406a4075558e85acefd09
prerequisite-patch-id: 5e9cbcd0dc56a83353f0e4a3b5d4e8d5b51f3160
prerequisite-patch-id: 28bae4a7fe83b36afed9892515a6dde1ea51e98a
prerequisite-patch-id: 5b5ea2a21b37afb05fdf655396a6f74d83bb99c4
prerequisite-patch-id: dc53b6ce21f74726d5d13988398c2954da07bcb6
prerequisite-patch-id: b86d2b14f1770c1d4756ca10b93efaada643e560
prerequisite-patch-id: 6a155859eb9a18afcd22a5bda3350d45d92e2fc7
prerequisite-patch-id: c8ca075008f50d3cf1781c1ea1130a8ee735e7d2
prerequisite-patch-id: b000cd190fe80dea3f4dd9172ecf8787f23b72be
prerequisite-patch-id: b72f1fc3bd44b60911d9d91ecc5a45812a75eba3
prerequisite-patch-id: 167c7770e124b9afa44bead742f90a57ac73f2d7
prerequisite-patch-id: cc24a3135b4f258f4b1ea83ac91c2be4ffe31772
--
2.44.0



2024-03-13 19:03:07

by Bart Van Assche

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Rust block device driver API and null block driver

On 3/13/24 04:05, Andreas Hindborg wrote:
> This is the second version of the Rust block device driver API and the Rust null
> block driver. The context and motivation can be seen in cover letter of the RFC
> v1 [1]. If more context is required, a talk about this effort was recorded at
> LPC [2]. I hope to be able to discuss this series at LSF this year [3].

Memory safety may land in C++ in the near future (see also
https://herbsutter.com/2024/03/). If memory-safe C++ or memory-safe C
would be adopted in the kernel, it would allow writing memory-safe
drivers without having to add complicated bindings between existing C
code and new Rust code. Please do not take this as personal criticism -
I appreciate the effort that has been spent on coming up with great
Rust bindings for the Linux kernel block layer.

Thanks,

Bart.


2024-03-13 19:08:36

by Boqun Feng

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Rust block device driver API and null block driver

On Wed, Mar 13, 2024 at 11:02:23AM -0700, Bart Van Assche wrote:
> On 3/13/24 04:05, Andreas Hindborg wrote:
> > This is the second version of the Rust block device driver API and the Rust null
> > block driver. The context and motivation can be seen in cover letter of the RFC
> > v1 [1]. If more context is required, a talk about this effort was recorded at
> > LPC [2]. I hope to be able to discuss this series at LSF this year [3].
>
> Memory safety may land in C++ in the near future (see also
> https://herbsutter.com/2024/03/). If memory-safe C++ or memory-safe C
> would be adopted in the kernel, it would allow writing memory-safe
> drivers without having to add complicated bindings between existing C

I honestly doubt it, memory-safe is not free, basically you will still
want unsafe part for the performance reason (or interacting with
hardware), and provide a safe API for driver development. I don't think
that part will be gone with a memory-safe C++. So the complication still
exists. But I'm happy to be proved wrong ;-)

Regards,
Boqun

> code and new Rust code. Please do not take this as personal criticism -
> I appreciate the effort that has been spent on coming up with great
> Rust bindings for the Linux kernel block layer.
>
> Thanks,
>
> Bart.
>

2024-03-13 20:27:33

by Bart Van Assche

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Rust block device driver API and null block driver


On 3/13/24 12:03, Andreas Hindborg wrote:
> I think it is great that people are starting to realize that bringing
> memory safety to other systems languages is a good idea.

Totally agree :-)

> But from one person blogging about it to things being ready for
> production is a long journey.
In case you wouldn't be aware of this, Herb Sutter is not a random
blogger - he is the chair of the ISO C++ standards committee since 2002.

Thanks,

Bart.


2024-03-13 21:12:47

by Andreas Hindborg

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Rust block device driver API and null block driver

Boqun Feng <[email protected]> writes:

> On Wed, Mar 13, 2024 at 11:02:23AM -0700, Bart Van Assche wrote:
>> On 3/13/24 04:05, Andreas Hindborg wrote:
>> > This is the second version of the Rust block device driver API and the Rust null
>> > block driver. The context and motivation can be seen in cover letter of the RFC
>> > v1 [1]. If more context is required, a talk about this effort was recorded at
>> > LPC [2]. I hope to be able to discuss this series at LSF this year [3].
>>
>> Memory safety may land in C++ in the near future (see also
>> https://herbsutter.com/2024/03/). If memory-safe C++ or memory-safe C
>> would be adopted in the kernel, it would allow writing memory-safe
>> drivers without having to add complicated bindings between existing C
>
> I honestly doubt it, memory-safe is not free, basically you will still
> want unsafe part for the performance reason (or interacting with
> hardware), and provide a safe API for driver development. I don't think
> that part will be gone with a memory-safe C++. So the complication still
> exists. But I'm happy to be proved wrong ;-)

I think it is great that people are starting to realize that bringing
memory safety to other systems languages is a good idea. But from one
person blogging about it to things being ready for production is a long
journey. Language designers have to design ways to express the new
semantics, standards committees has to agree, compiler engineers have to
build and test their compilers. Probably we need a few cycles of this to
get things right. At any rate, it is going to take a while.

Second, as Boqun is saying, interfacing the unsafe C part of the kernel
with memory safe C++ is going to require the same complicated
abstraction logic as the safe Rust APIs are bringing. The complexity in
bringing Rust to the kernel is not coming from interfacing a foreign
language. It stems from the inherent difficulty in designing memory safe
wrappers around unsafe C APIs.

Best regards,
Andreas

2024-03-13 21:17:48

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Rust block device driver API and null block driver

On Wed, Mar 13, 2024 at 11:02:23AM -0700, Bart Van Assche wrote:
> On 3/13/24 04:05, Andreas Hindborg wrote:
> > This is the second version of the Rust block device driver API and the Rust null
> > block driver. The context and motivation can be seen in cover letter of the RFC
> > v1 [1]. If more context is required, a talk about this effort was recorded at
> > LPC [2]. I hope to be able to discuss this series at LSF this year [3].
>
> Memory safety may land in C++ in the near future (see also
> https://herbsutter.com/2024/03/). If memory-safe C++ or memory-safe C
> would be adopted in the kernel, it would allow writing memory-safe
> drivers without having to add complicated bindings between existing C
> code and new Rust code. Please do not take this as personal criticism -
> I appreciate the effort that has been spent on coming up with great
> Rust bindings for the Linux kernel block layer.

You know, this reminds me of when I was at Intel working on NVMe.
We had a product all ready to ship and a manager suggested that we delay
shipping it until the SCSI-over-PCI standard was finished and ratified
so that the same piece of silicon could support both.

Fortunately, said manager was overruled, but it's a great example of
how this kind of FUD can actually work.

2024-03-14 12:14:43

by Philipp Stanner

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Rust block device driver API and null block driver

On Wed, 2024-03-13 at 11:02 -0700, Bart Van Assche wrote:
> On 3/13/24 04:05, Andreas Hindborg wrote:
> > This is the second version of the Rust block device driver API and
> > the Rust null
> > block driver. The context and motivation can be seen in cover
> > letter of the RFC
> > v1 [1]. If more context is required, a talk about this effort was
> > recorded at
> > LPC [2]. I hope to be able to discuss this series at LSF this year
> > [3].
>
> Memory safety may land in C++ in the near future (see also
> https://herbsutter.com/2024/03/). If memory-safe C++ or memory-safe C
> would be adopted in the kernel, it would allow writing memory-safe
> drivers without having to add complicated bindings between existing C
> code and new Rust code.

Correct, but it would also most likely allow to just arbitrarily ignore
the "modern, memory safe C" (or C++) and write it the old way.

Those discussions are, below the surface, frequently in truth about the
question: Can you (and do you want to) _force_ people?

The Kernel's C already has more memory safety than standardized C:
There's devres, and since last year there's the __cleanup attribute.
– but the thing is, you can just ignore it and do it the old way.

Once you give people freedom (which is necessary frequently), you'll
get people who ignore "the right way to do it". You certainly get them
once thousands of people are participating in your project.
Actually, Rust in userspace has a similar problem: Your coprogrammers
can call unwrap(), just ignoring those neat Result types and blow up
your thread. So you have to review and reject that.

One of the stronger arguments behind the push for Rust is that the
language by design forces you to obey, because otherwise the compiler
will just reject building.


P.


> Please do not take this as personal criticism -
> I appreciate the effort that has been spent on coming up with great
> Rust bindings for the Linux kernel block layer.
>
> Thanks,
>
> Bart.
>
>


2024-03-14 17:04:07

by Bart Van Assche

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Rust block device driver API and null block driver

On 3/14/24 05:14, Philipp Stanner wrote:
> On Wed, 2024-03-13 at 11:02 -0700, Bart Van Assche wrote:
>> On 3/13/24 04:05, Andreas Hindborg wrote:
>>> This is the second version of the Rust block device driver API and
>>> the Rust null
>>> block driver. The context and motivation can be seen in cover
>>> letter of the RFC
>>> v1 [1]. If more context is required, a talk about this effort was
>>> recorded at
>>> LPC [2]. I hope to be able to discuss this series at LSF this year
>>> [3].
>>
>> Memory safety may land in C++ in the near future (see also
>> https://herbsutter.com/2024/03/). If memory-safe C++ or memory-safe C
>> would be adopted in the kernel, it would allow writing memory-safe
>> drivers without having to add complicated bindings between existing C
>> code and new Rust code.
>
> Correct, but it would also most likely allow to just arbitrarily ignore
> the "modern, memory safe C" (or C++) and write it the old way.

Linux kernel developers have the choice today between memory-unsafe C
and memory-safe Rust so they already have the freedom to ignore memory
safety by choosing for C. In my opinion a compiler option should be
introduced once memory-safe C or C++ is available that can be set per
source file and that makes the build fail for source files that
unintentionally do not follow the memory-safety rules.

> The Kernel's C already has more memory safety than standardized C:
> There's devres, and since last year there's the __cleanup attribute.
> – but the thing is, you can just ignore it and do it the old way.

devres is controversial - see also Laurent Pinchart, "Why is
devm_kzalloc() harmful and what can we do about it", LPC, 2022
(https://lpc.events/event/16/contributions/1227/).

> One of the stronger arguments behind the push for Rust is that the
> language by design forces you to obey, because otherwise the compiler
> will just reject building.

Rust has a very significant disadvantage that memory-safe C/C++ won't
have: supporting Rust means adding Rust bindings for all C functions
called from Rust code. This forces everyone who wants to change an
interface to also change the Rust bindings and hence will make it
harder to maintain the Linux kernel in its entirety.

Thanks,

Bart.



2024-03-14 17:17:00

by Conor Dooley

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Rust block device driver API and null block driver

Just a passer-by here, but I noticed the link to Laurent's talk..

On Thu, Mar 14, 2024 at 10:03:28AM -0700, Bart Van Assche wrote:
> On 3/14/24 05:14, Philipp Stanner wrote:
>
> > The Kernel's C already has more memory safety than standardized C:
> > There's devres, and since last year there's the __cleanup attribute.
> > – but the thing is, you can just ignore it and do it the old way.
>
> devres is controversial - see also Laurent Pinchart, "Why is
> devm_kzalloc() harmful and what can we do about it", LPC, 2022
> (https://lpc.events/event/16/contributions/1227/).

I don't think that's a great thing to cite, that talk prompted a series
of others with (AFAIK*) the most recent being from Bart at LPC this year:
https://lpc.events/event/17/contributions/16f
The TL;DR is that it's not actually problem caused by devres.

* I think Wolfram also talked about it at an automotive conference, but
that seemed like a bit of a pitch for funding from the safety
conscious


Attachments:
(No filename) (0.98 kB)
signature.asc (235.00 B)
Download all attachments

2024-03-14 22:31:52

by Andreas Hindborg

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Rust block device driver API and null block driver


Hi Bart,

Bart Van Assche <[email protected]> writes:
> On 3/14/24 05:14, Philipp Stanner wrote:
>> On Wed, 2024-03-13 at 11:02 -0700, Bart Van Assche wrote:

[...]

>> One of the stronger arguments behind the push for Rust is that the
>> language by design forces you to obey, because otherwise the compiler
>> will just reject building.
>
> Rust has a very significant disadvantage that memory-safe C/C++ won't
> have: supporting Rust means adding Rust bindings for all C functions
> called from Rust code. This forces everyone who wants to change an
> interface to also change the Rust bindings and hence will make it
> harder to maintain the Linux kernel in its entirety.

I think you might be missing a key point here. We actually generate Rust
bindings to the existing C kernel automatically. No hand editing
required, except for some corner cases we currently have with static
methods and certain macros. If we just wanted to call the C APIa
directly, there would be no engineering required. The main reason to
deploy Rust would also go away, we might as well stay in C.

The actual engineering effort goes into building memory safe versions of
the C APIs. This requirement will not magically go away, no matter what
memory safe language (or language extensions) your use to interface the
existing unsafe C APIs.

Best regards,
Andreas

2024-03-13 15:07:57

by Andreas Hindborg

[permalink] [raw]
Subject: [RFC PATCH 2/5] rust: block: introduce `kernel::block::bio` module

From: Andreas Hindborg <[email protected]>

Add abstractions for working with `struct bio`.

Signed-off-by: Andreas Hindborg <[email protected]>
---
rust/kernel/block.rs | 1 +
rust/kernel/block/bio.rs | 112 +++++++++++++
rust/kernel/block/bio/vec.rs | 279 ++++++++++++++++++++++++++++++++
rust/kernel/block/mq/request.rs | 22 +++
4 files changed, 414 insertions(+)
create mode 100644 rust/kernel/block/bio.rs
create mode 100644 rust/kernel/block/bio/vec.rs

diff --git a/rust/kernel/block.rs b/rust/kernel/block.rs
index 4c93317a568a..1797859551fd 100644
--- a/rust/kernel/block.rs
+++ b/rust/kernel/block.rs
@@ -2,4 +2,5 @@

//! Types for working with the block layer

+pub mod bio;
pub mod mq;
diff --git a/rust/kernel/block/bio.rs b/rust/kernel/block/bio.rs
new file mode 100644
index 000000000000..0d4336cbe9c1
--- /dev/null
+++ b/rust/kernel/block/bio.rs
@@ -0,0 +1,112 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! Types for working with the bio layer.
+//!
+//! C header: [`include/linux/blk_types.h`](../../include/linux/blk_types.h)
+
+use core::fmt;
+use core::ptr::NonNull;
+
+mod vec;
+
+pub use vec::BioSegmentIterator;
+pub use vec::Segment;
+
+use crate::types::Opaque;
+
+/// A block device IO descriptor (`struct bio`)
+///
+/// # Invariants
+///
+/// Instances of this type is always reference counted. A call to
+/// `bindings::bio_get()` ensures that the instance is valid for read at least
+/// until a matching call to `bindings :bio_put()`.
+#[repr(transparent)]
+pub struct Bio(Opaque<bindings::bio>);
+
+impl Bio {
+ /// Returns an iterator over segments in this `Bio`. Does not consider
+ /// segments of other bios in this bio chain.
+ #[inline(always)]
+ pub fn segment_iter(&self) -> BioSegmentIterator<'_> {
+ BioSegmentIterator::new(self)
+ }
+
+ /// Get slice referencing the `bio_vec` array of this bio
+ #[inline(always)]
+ fn io_vec(&self) -> &[bindings::bio_vec] {
+ let this = self.0.get();
+
+ // SAFETY: By the type invariant of `Bio` and existence of `&self`,
+ // `this` is valid for read.
+ let io_vec = unsafe { (*this).bi_io_vec };
+
+ // SAFETY: By the type invariant of `Bio` and existence of `&self`,
+ // `this` is valid for read.
+ let length = unsafe { (*this).bi_vcnt };
+
+ // SAFETY: By C API contract, `io_vec` points to `length` consecutive
+ // and properly initialized `bio_vec` values. The array is properly
+ // aligned because it is #[repr(C)]. By C API contract and safety
+ // requirement of `from_raw()`, the elements of the `io_vec` array are
+ // not modified for the duration of the lifetime of `&self`
+ unsafe { core::slice::from_raw_parts(io_vec, length as usize) }
+ }
+
+ /// Return a copy of the `bvec_iter` for this `Bio`. This iterator always
+ /// indexes to a valid `bio_vec` entry.
+ #[inline(always)]
+ fn raw_iter(&self) -> bindings::bvec_iter {
+ // SAFETY: By the type invariant of `Bio` and existence of `&self`,
+ // `self` is valid for read.
+ unsafe { (*self.0.get()).bi_iter }
+ }
+
+ /// Get the next `Bio` in the chain
+ #[inline(always)]
+ fn next(&self) -> Option<&Self> {
+ // SAFETY: By the type invariant of `Bio` and existence of `&self`,
+ // `self` is valid for read.
+ let next = unsafe { (*self.0.get()).bi_next };
+ // SAFETY: By C API contract `bi_next` has nonzero reference count if it
+ // is not null, for at least the duration of the lifetime of &self.
+ unsafe { Self::from_raw(next) }
+ }
+
+ /// Create an instance of `Bio` from a raw pointer.
+ ///
+ /// # Safety
+ ///
+ /// If `ptr` is not null, caller must ensure positive refcount for the
+ /// pointee and immutability for the duration of the returned lifetime.
+ #[inline(always)]
+ pub(crate) unsafe fn from_raw<'a>(ptr: *mut bindings::bio) -> Option<&'a Self> {
+ Some(
+ // SAFETY: by the safety requirement of this funciton, `ptr` is
+ // valid for read for the duration of the returned lifetime
+ unsafe { &*NonNull::new(ptr)?.as_ptr().cast::<Bio>() },
+ )
+ }
+}
+
+impl core::fmt::Display for Bio {
+ fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
+ write!(f, "Bio({:?})", self.0.get())
+ }
+}
+
+/// An iterator over `Bio`
+pub struct BioIterator<'a> {
+ pub(crate) bio: Option<&'a Bio>,
+}
+
+impl<'a> core::iter::Iterator for BioIterator<'a> {
+ type Item = &'a Bio;
+
+ #[inline(always)]
+ fn next(&mut self) -> Option<&'a Bio> {
+ let current = self.bio.take()?;
+ self.bio = current.next();
+ Some(current)
+ }
+}
diff --git a/rust/kernel/block/bio/vec.rs b/rust/kernel/block/bio/vec.rs
new file mode 100644
index 000000000000..b61380807f38
--- /dev/null
+++ b/rust/kernel/block/bio/vec.rs
@@ -0,0 +1,279 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! Types for working with `struct bio_vec` IO vectors
+//!
+//! C header: [`include/linux/bvec.h`](../../include/linux/bvec.h)
+
+use super::Bio;
+use crate::error::Result;
+use crate::folio::UniqueFolio;
+use crate::page::Page;
+use core::fmt;
+use core::mem::ManuallyDrop;
+
+/// A wrapper around a `strutct bio_vec` - a contiguous range of physical memory addresses
+///
+/// # Invariants
+///
+/// `bio_vec` must always be initialized and valid for read and write
+pub struct Segment<'a> {
+ bio_vec: bindings::bio_vec,
+ _marker: core::marker::PhantomData<&'a ()>,
+}
+
+impl Segment<'_> {
+ /// Get he lenght of the segment in bytes
+ #[inline(always)]
+ pub fn len(&self) -> usize {
+ self.bio_vec.bv_len as usize
+ }
+
+ /// Returns true if the length of the segment is 0
+ #[inline(always)]
+ pub fn is_empty(&self) -> bool {
+ self.len() == 0
+ }
+
+ /// Get the offset field of the `bio_vec`
+ #[inline(always)]
+ pub fn offset(&self) -> usize {
+ self.bio_vec.bv_offset as usize
+ }
+
+ /// Copy data of this segment into `folio`.
+ ///
+ /// Note: Disregards `self.offset()`
+ #[inline(always)]
+ pub fn copy_to_folio(&self, dst_folio: &mut UniqueFolio) -> Result {
+ // SAFETY: self.bio_vec is valid and thus bv_page must be a valid
+ // pointer to a `struct page`. We do not own the page, but we prevent
+ // drop by wrapping the `Page` in `ManuallyDrop`.
+ let src_page = ManuallyDrop::new(unsafe { Page::from_raw(self.bio_vec.bv_page) });
+
+ src_page.with_slice_into_page(|src| {
+ dst_folio.with_slice_into_page_mut(0, |dst| {
+ dst.copy_from_slice(src);
+ Ok(())
+ })
+ })
+ }
+
+ /// Copy data to the page of this segment from `src`.
+ ///
+ /// Note: Disregards `self.offset()`
+ pub fn copy_from_folio(&mut self, src_folio: &UniqueFolio) -> Result {
+ // SAFETY: self.bio_vec is valid and thus bv_page must be a valid
+ // pointer to a `struct page`. We do not own the page, but we prevent
+ // drop by wrapping the `Page` in `ManuallyDrop`.
+ let mut dst_page = ManuallyDrop::new(unsafe { Page::from_raw(self.bio_vec.bv_page) });
+
+ dst_page.with_slice_into_page_mut(|dst| {
+ src_folio.with_slice_into_page(0, |src| {
+ dst.copy_from_slice(src);
+ Ok(())
+ })
+ })
+ }
+}
+
+impl core::fmt::Display for Segment<'_> {
+ fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
+ write!(
+ f,
+ "Segment {:?} len: {}",
+ self.bio_vec.bv_page, self.bio_vec.bv_len
+ )
+ }
+}
+
+/// An iterator over `Segment`
+///
+/// # Invariants
+///
+/// If `iter.bi_size` > 0, `iter` must always index a valid `bio_vec` in `bio.io_vec()`.
+pub struct BioSegmentIterator<'a> {
+ bio: &'a Bio,
+ iter: bindings::bvec_iter,
+}
+
+impl<'a> BioSegmentIterator<'a> {
+ /// Creeate a new segemnt iterator for iterating the segments of `bio`. The
+ /// iterator starts at the beginning of `bio`.
+ #[inline(always)]
+ pub(crate) fn new(bio: &'a Bio) -> BioSegmentIterator<'_> {
+ // SAFETY: `bio.raw_iter()` returns an index that indexes into a valid
+ // `bio_vec` in `bio.io_vec()`.
+ Self {
+ bio,
+ iter: bio.raw_iter(),
+ }
+ }
+
+ // The accessors in this implementation block are modelled after C side
+ // macros and static functions `bvec_iter_*` and `mp_bvec_iter_*` from
+ // bvec.h.
+
+ /// Construct a `bio_vec` from the current iterator state.
+ ///
+ /// This will return a `bio_vec`of size <= PAGE_SIZE
+ ///
+ /// # Safety
+ ///
+ /// Caller must ensure that `self.iter.bi_size` > 0 before calling this
+ /// method.
+ unsafe fn io_vec(&self) -> bindings::bio_vec {
+ // SAFETY: By safety requirement of this function `self.iter.bi_size` is
+ // greater than 0.
+ unsafe {
+ bindings::bio_vec {
+ bv_page: self.page(),
+ bv_len: self.len(),
+ bv_offset: self.offset(),
+ }
+ }
+ }
+
+ /// Get the currently indexed `bio_vec` entry.
+ ///
+ /// # Safety
+ ///
+ /// Caller must ensure that `self.iter.bi_size` > 0 before calling this
+ /// method.
+ #[inline(always)]
+ unsafe fn bvec(&self) -> &bindings::bio_vec {
+ // SAFETY: By the safety requirement of this function and the type
+ // invariant of `Self`, `self.iter.bi_idx` indexes into a valid
+ // `bio_vec`
+ unsafe { self.bio.io_vec().get_unchecked(self.iter.bi_idx as usize) }
+ }
+
+ /// Get the currently indexed page, indexing into pages of order > 0.
+ ///
+ /// # Safety
+ ///
+ /// Caller must ensure that `self.iter.bi_size` > 0 before calling this
+ /// method.
+ #[inline(always)]
+ unsafe fn page(&self) -> *mut bindings::page {
+ // SAFETY: By C API contract, the following offset cannot exceed pages
+ // allocated to this bio.
+ unsafe { self.mp_page().add(self.mp_page_idx()) }
+ }
+
+ /// Get the remaining bytes in the current page. Never more than PAGE_SIZE.
+ ///
+ /// # Safety
+ ///
+ /// Caller must ensure that `self.iter.bi_size` > 0 before calling this
+ /// method.
+ #[inline(always)]
+ unsafe fn len(&self) -> u32 {
+ // SAFETY: By safety requirement of this function `self.iter.bi_size` is
+ // greater than 0.
+ unsafe { self.mp_len().min((bindings::PAGE_SIZE as u32) - self.offset()) }
+ }
+
+ /// Get the offset from the last page boundary in the currently indexed
+ /// `bio_vec` entry. Never more than PAGE_SIZE.
+ ///
+ /// # Safety
+ ///
+ /// Caller must ensure that `self.iter.bi_size` > 0 before calling this
+ /// method.
+ #[inline(always)]
+ unsafe fn offset(&self) -> u32 {
+ // SAFETY: By safety requirement of this function `self.iter.bi_size` is
+ // greater than 0.
+ unsafe { self.mp_offset() % (bindings::PAGE_SIZE as u32) }
+ }
+
+ /// Return the first page of the currently indexed `bio_vec` entry. This
+ /// might be a multi-page entry, meaning that page might have order > 0.
+ ///
+ /// # Safety
+ ///
+ /// Caller must ensure that `self.iter.bi_size` > 0 before calling this
+ /// method.
+ #[inline(always)]
+ unsafe fn mp_page(&self) -> *mut bindings::page {
+ // SAFETY: By safety requirement of this function `self.iter.bi_size` is
+ // greater than 0.
+ unsafe { self.bvec().bv_page }
+ }
+
+ /// Get the offset in whole pages into the currently indexed `bio_vec`. This
+ /// can be more than 0 is the page has order > 0.
+ ///
+ /// # Safety
+ ///
+ /// Caller must ensure that `self.iter.bi_size` > 0 before calling this
+ /// method.
+ #[inline(always)]
+ unsafe fn mp_page_idx(&self) -> usize {
+ // SAFETY: By safety requirement of this function `self.iter.bi_size` is
+ // greater than 0.
+ (unsafe { self.mp_offset() } / (bindings::PAGE_SIZE as u32)) as usize
+ }
+
+ /// Get the offset in the currently indexed `bio_vec` multi-page entry. This
+ /// can be more than `PAGE_SIZE` if the page has order > 0.
+ ///
+ /// # Safety
+ ///
+ /// Caller must ensure that `self.iter.bi_size` > 0 before calling this
+ /// method.
+ #[inline(always)]
+ unsafe fn mp_offset(&self) -> u32 {
+ // SAFETY: By safety requirement of this function `self.iter.bi_size` is
+ // greater than 0.
+ unsafe { self.bvec().bv_offset + self.iter.bi_bvec_done }
+ }
+
+ /// Get the number of remaining bytes for the currently indexed `bio_vec`
+ /// entry. Can be more than PAGE_SIZE for `bio_vec` entries with pages of
+ /// order > 0.
+ ///
+ /// # Safety
+ ///
+ /// Caller must ensure that `self.iter.bi_size` > 0 before calling this
+ /// method.
+ #[inline(always)]
+ unsafe fn mp_len(&self) -> u32 {
+ // SAFETY: By safety requirement of this function `self.iter.bi_size` is
+ // greater than 0.
+ self.iter
+ .bi_size
+ .min(unsafe { self.bvec().bv_len } - self.iter.bi_bvec_done)
+ }
+}
+
+impl<'a> core::iter::Iterator for BioSegmentIterator<'a> {
+ type Item = Segment<'a>;
+
+ #[inline(always)]
+ fn next(&mut self) -> Option<Self::Item> {
+ if self.iter.bi_size == 0 {
+ return None;
+ }
+
+ // SAFETY: We checked that `self.iter.bi_size` > 0 above.
+ let bio_vec_ret = unsafe { self.io_vec() };
+
+ // SAFETY: By existence of reference `&bio`, `bio.0` contains a valid
+ // `struct bio`. By type invariant of `BioSegmentItarator` `self.iter`
+ // indexes into a valid `bio_vec` entry. By C API contracit, `bv_len`
+ // does not exceed the size of the bio.
+ unsafe {
+ bindings::bio_advance_iter_single(
+ self.bio.0.get(),
+ &mut self.iter as *mut bindings::bvec_iter,
+ bio_vec_ret.bv_len,
+ )
+ };
+
+ Some(Segment {
+ bio_vec: bio_vec_ret,
+ _marker: core::marker::PhantomData,
+ })
+ }
+}
diff --git a/rust/kernel/block/mq/request.rs b/rust/kernel/block/mq/request.rs
index b4dacac5e091..cccffde45981 100644
--- a/rust/kernel/block/mq/request.rs
+++ b/rust/kernel/block/mq/request.rs
@@ -12,6 +12,9 @@
};
use core::{ffi::c_void, marker::PhantomData, ops::Deref};

+use crate::block::bio::Bio;
+use crate::block::bio::BioIterator;
+
/// A wrapper around a blk-mq `struct request`. This represents an IO request.
///
/// # Invariants
@@ -84,6 +87,25 @@ pub fn complete(&self) {
}
}

+ /// Get a wrapper for the first Bio in this request
+ #[inline(always)]
+ pub fn bio(&self) -> Option<&Bio> {
+ // SAFETY: By type invariant of `Self`, `self.0` is valid and the deref
+ // is safe.
+ let ptr = unsafe { (*self.0.get()).bio };
+ // SAFETY: By C API contract, if `bio` is not null it will have a
+ // positive refcount at least for the duration of the lifetime of
+ // `&self`.
+ unsafe { Bio::from_raw(ptr) }
+ }
+
+ /// Get an iterator over all bio structurs in this request
+ #[inline(always)]
+ pub fn bio_iter(&self) -> BioIterator<'_> {
+ BioIterator { bio: self.bio() }
+ }
+
+ // TODO: Check if inline is still required for cross language LTO inlining into module
/// Get the target sector for the request
#[inline(always)]
pub fn sector(&self) -> usize {
--
2.44.0


2024-03-13 15:23:26

by Andreas Hindborg

[permalink] [raw]
Subject: [RFC PATCH 5/5] MAINTAINERS: add entry for Rust block device driver API

From: Andreas Hindborg <[email protected]>

Signed-off-by: Andreas Hindborg <[email protected]>
---
MAINTAINERS | 14 ++++++++++++++
1 file changed, 14 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 1aabf1c15bb3..031198967782 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3623,6 +3623,20 @@ F: include/linux/blk*
F: kernel/trace/blktrace.c
F: lib/sbitmap.c

+BLOCK LAYER DEVICE DRIVER API [RUST]
+M: Andreas Hindborg <[email protected]>
+R: Boqun Feng <[email protected]>
+L: [email protected]
+L: [email protected]
+S: Supported
+W: https://rust-for-linux.com
+B: https://github.com/Rust-for-Linux/linux/issues
+C: https://rust-for-linux.zulipchat.com/#narrow/stream/Block
+T: git https://github.com/Rust-for-Linux/linux.git rust-block-next
+F: drivers/block/rnull.rs
+F: rust/kernel/block.rs
+F: rust/kernel/block/
+
BLOCK2MTD DRIVER
M: Joern Engel <[email protected]>
L: [email protected]
--
2.44.0


2024-03-13 17:43:02

by Andreas Hindborg

[permalink] [raw]
Subject: [RFC PATCH 1/5] rust: block: introduce `kernel::block::mq` module

From: Andreas Hindborg <[email protected]>

Add initial abstractions for working with blk-mq.

This patch is a maintained, refactored subset of code originally published by
Wedson Almeida Filho <[email protected]> [1].

[1] https://github.com/wedsonaf/linux/tree/f2cfd2fe0e2ca4e90994f96afe268bbd4382a891/rust/kernel/blk/mq.rs

Cc: Wedson Almeida Filho <[email protected]>
Signed-off-by: Andreas Hindborg <[email protected]>
---
block/blk-mq.c | 3 +-
include/linux/blk-mq.h | 1 +
rust/bindings/bindings_helper.h | 2 +
rust/helpers.c | 45 ++++
rust/kernel/block.rs | 5 +
rust/kernel/block/mq.rs | 131 +++++++++++
rust/kernel/block/mq/gen_disk.rs | 174 +++++++++++++++
rust/kernel/block/mq/operations.rs | 346 +++++++++++++++++++++++++++++
rust/kernel/block/mq/raw_writer.rs | 60 +++++
rust/kernel/block/mq/request.rs | 182 +++++++++++++++
rust/kernel/block/mq/tag_set.rs | 117 ++++++++++
rust/kernel/error.rs | 5 +
rust/kernel/lib.rs | 1 +
13 files changed, 1071 insertions(+), 1 deletion(-)
create mode 100644 rust/kernel/block.rs
create mode 100644 rust/kernel/block/mq.rs
create mode 100644 rust/kernel/block/mq/gen_disk.rs
create mode 100644 rust/kernel/block/mq/operations.rs
create mode 100644 rust/kernel/block/mq/raw_writer.rs
create mode 100644 rust/kernel/block/mq/request.rs
create mode 100644 rust/kernel/block/mq/tag_set.rs

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 2dc01551e27c..a531f664bee7 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -702,7 +702,7 @@ static void blk_mq_finish_request(struct request *rq)
}
}

-static void __blk_mq_free_request(struct request *rq)
+void __blk_mq_free_request(struct request *rq)
{
struct request_queue *q = rq->q;
struct blk_mq_ctx *ctx = rq->mq_ctx;
@@ -722,6 +722,7 @@ static void __blk_mq_free_request(struct request *rq)
blk_mq_sched_restart(hctx);
blk_queue_exit(q);
}
+EXPORT_SYMBOL_GPL(__blk_mq_free_request);

void blk_mq_free_request(struct request *rq)
{
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 7a8150a5f051..842bb88e6e78 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -703,6 +703,7 @@ int blk_mq_alloc_sq_tag_set(struct blk_mq_tag_set *set,
unsigned int set_flags);
void blk_mq_free_tag_set(struct blk_mq_tag_set *set);

+void __blk_mq_free_request(struct request *rq);
void blk_mq_free_request(struct request *rq);
int blk_rq_poll(struct request *rq, struct io_comp_batch *iob,
unsigned int poll_flags);
diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h
index f8e54d398c19..df18acb229d9 100644
--- a/rust/bindings/bindings_helper.h
+++ b/rust/bindings/bindings_helper.h
@@ -7,6 +7,8 @@
*/

#include <kunit/test.h>
+#include <linux/blk_types.h>
+#include <linux/blk-mq.h>
#include <linux/errname.h>
#include <linux/ethtool.h>
#include <linux/mdio.h>
diff --git a/rust/helpers.c b/rust/helpers.c
index 66411845536e..017fa90366e6 100644
--- a/rust/helpers.c
+++ b/rust/helpers.c
@@ -21,6 +21,9 @@
*/

#include <kunit/test-bug.h>
+#include <linux/bio.h>
+#include <linux/blk-mq.h>
+#include <linux/blkdev.h>
#include <linux/bug.h>
#include <linux/build_bug.h>
#include <linux/cacheflush.h>
@@ -242,6 +245,30 @@ void *rust_helper_kmap_local_folio(struct folio *folio, size_t offset)
}
EXPORT_SYMBOL_GPL(rust_helper_kmap_local_folio);

+struct bio_vec rust_helper_req_bvec(struct request *rq)
+{
+ return req_bvec(rq);
+}
+EXPORT_SYMBOL_GPL(rust_helper_req_bvec);
+
+void *rust_helper_blk_mq_rq_to_pdu(struct request *rq)
+{
+ return blk_mq_rq_to_pdu(rq);
+}
+EXPORT_SYMBOL_GPL(rust_helper_blk_mq_rq_to_pdu);
+
+struct request *rust_helper_blk_mq_rq_from_pdu(void* pdu) {
+ return blk_mq_rq_from_pdu(pdu);
+}
+EXPORT_SYMBOL_GPL(rust_helper_blk_mq_rq_from_pdu);
+
+void rust_helper_bio_advance_iter_single(const struct bio *bio,
+ struct bvec_iter *iter,
+ unsigned int bytes) {
+ bio_advance_iter_single(bio, iter, bytes);
+}
+EXPORT_SYMBOL_GPL(rust_helper_bio_advance_iter_single);
+
void *rust_helper_kmap(struct page *page)
{
return kmap(page);
@@ -306,6 +333,24 @@ int rust_helper_xa_err(void *entry)
}
EXPORT_SYMBOL_GPL(rust_helper_xa_err);

+bool rust_helper_req_ref_inc_not_zero(struct request *req)
+{
+ return atomic_inc_not_zero(&req->ref);
+}
+EXPORT_SYMBOL_GPL(rust_helper_req_ref_inc_not_zero);
+
+bool rust_helper_req_ref_put_and_test(struct request *req)
+{
+ return atomic_dec_and_test(&req->ref);
+}
+EXPORT_SYMBOL_GPL(rust_helper_req_ref_put_and_test);
+
+void rust_helper_blk_mq_free_request_internal(struct request *req)
+{
+ __blk_mq_free_request(req);
+}
+EXPORT_SYMBOL_GPL(rust_helper_blk_mq_free_request_internal);
+
/*
* `bindgen` binds the C `size_t` type as the Rust `usize` type, so we can
* use it in contexts where Rust expects a `usize` like slice (array) indices.
diff --git a/rust/kernel/block.rs b/rust/kernel/block.rs
new file mode 100644
index 000000000000..4c93317a568a
--- /dev/null
+++ b/rust/kernel/block.rs
@@ -0,0 +1,5 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! Types for working with the block layer
+
+pub mod mq;
diff --git a/rust/kernel/block/mq.rs b/rust/kernel/block/mq.rs
new file mode 100644
index 000000000000..08de1cc114ff
--- /dev/null
+++ b/rust/kernel/block/mq.rs
@@ -0,0 +1,131 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! This module provides types for implementing block drivers that interface the
+//! blk-mq subsystem.
+//!
+//! To implement a block device driver, a Rust module must do the following:
+//!
+//! - Implement [`Operations`] for a type `T`
+//! - Create a [`TagSet<T>`]
+//! - Create a [`GenDisk<T>`], passing in the `TagSet` reference
+//! - Add the disk to the system by calling [`GenDisk::add`]
+//!
+//! The types available in this module that have direct C counterparts are:
+//!
+//! - The `TagSet` type that abstracts the C type `struct tag_set`.
+//! - The `GenDisk` type that abstracts the C type `struct gendisk`.
+//! - The `Request` type that abstracts the C type `struct request`.
+//!
+//! Many of the C types that this module abstracts allow a driver to carry
+//! private data, either embedded in the stuct directly, or as a C `void*`. In
+//! these abstractions, this data is typed. The types of the data is defined by
+//! associated types in `Operations`, see [`Operations::RequestData`] for an
+//! example.
+//!
+//! The kernel will interface with the block evice driver by calling the method
+//! implementations of the `Operations` trait.
+//!
+//! IO requests are passed to the driver as [`Request`] references. The
+//! `Request` type is a wrapper around the C `struct request`. The driver must
+//! mark start of request processing by calling [`Request::start`] and end of
+//! processing by calling one of the [`Request::end`], methods. Failure to do so
+//! can lead to IO failures.
+//!
+//! The `TagSet` is responsible for creating and maintaining a mapping between
+//! `Request`s and integer ids as well as carrying a pointer to the vtable
+//! generated by `Operations`. This mapping is useful for associating
+//! completions from hardware with the correct `Request` instance. The `TagSet`
+//! determines the maximum queue depth by setting the number of `Request`
+//! instances available to the driver, and it determines the number of queues to
+//! instantiate for the driver. If possible, a driver should allocate one queue
+//! per core, to keep queue data local to a core.
+//!
+//! One `TagSet` instance can be shared between multiple `GenDisk` instances.
+//! This can be useful when implementing drivers where one piece of hardware
+//! with one set of IO resources are represented to the user as multiple disks.
+//!
+//! One significant difference between block device drivers implemented with
+//! these Rust abstractions and drivers implemented in C, is that the Rust
+//! drivers have to own a reference count on the `Request` type when the IO is
+//! in flight. This is to ensure that the C `struct request` instances backing
+//! the Rust `Request` instances are live while the Rust driver holds a
+//! reference to the `Request`. In addition, the conversion of an ineger tag to
+//! a `Request` via the `TagSet` would not be sound without this bookkeeping.
+//!
+//! # ⚠ Note
+//!
+//! For Rust block device drivers, the point in time where a request
+//! is freed and made available for recycling is usualy at the point in time
+//! when the last `ARef<Request>` is dropped. For C drivers, this event usually
+//! occurs when `bindings::blk_mq_end_request` is called.
+//!
+//! # Example
+//!
+//! ```rust
+//! use kernel::{
+//! block::mq::*,
+//! new_mutex,
+//! prelude::*,
+//! sync::{Arc, Mutex},
+//! types::{ARef, ForeignOwnable},
+//! };
+//!
+//! struct MyBlkDevice;
+//!
+//! #[vtable]
+//! impl Operations for MyBlkDevice {
+//! type RequestData = ();
+//! type RequestDataInit = impl PinInit<()>;
+//! type QueueData = ();
+//! type HwData = ();
+//! type TagSetData = ();
+//!
+//! fn new_request_data(
+//! _tagset_data: <Self::TagSetData as ForeignOwnable>::Borrowed<'_>,
+//! ) -> Self::RequestDataInit {
+//! kernel::init::zeroed()
+//! }
+//!
+//! fn queue_rq(_hw_data: (), _queue_data: (), rq: ARef<Request<Self>>, _is_last: bool) -> Result {
+//! rq.start();
+//! rq.end_ok();
+//! Ok(())
+//! }
+//!
+//! fn commit_rqs(
+//! _hw_data: <Self::HwData as ForeignOwnable>::Borrowed<'_>,
+//! _queue_data: <Self::QueueData as ForeignOwnable>::Borrowed<'_>,
+//! ) {
+//! }
+//!
+//! fn complete(rq: &Request<Self>) {
+//! rq.end_ok();
+//! }
+//!
+//! fn init_hctx(
+//! _tagset_data: <Self::TagSetData as ForeignOwnable>::Borrowed<'_>,
+//! _hctx_idx: u32,
+//! ) -> Result<Self::HwData> {
+//! Ok(())
+//! }
+//! }
+//!
+//! let tagset: Arc<TagSet<MyBlkDevice>> = Arc::pin_init(TagSet::try_new(1, (), 256, 1))?;
+//! let mut disk = GenDisk::try_new(tagset, ())?;
+//! disk.set_name(format_args!("myblk"))?;
+//! disk.set_capacity_sectors(4096);
+//! disk.add()?;
+//!
+//! # Ok::<(), kernel::error::Error>(())
+//! ```
+
+mod gen_disk;
+mod operations;
+mod raw_writer;
+mod request;
+mod tag_set;
+
+pub use gen_disk::GenDisk;
+pub use operations::Operations;
+pub use request::{Request, RequestDataRef};
+pub use tag_set::TagSet;
diff --git a/rust/kernel/block/mq/gen_disk.rs b/rust/kernel/block/mq/gen_disk.rs
new file mode 100644
index 000000000000..b7845fc9e39f
--- /dev/null
+++ b/rust/kernel/block/mq/gen_disk.rs
@@ -0,0 +1,174 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! Generic disk abstraction.
+//!
+//! C header: [`include/linux/blkdev.h`](srctree/include/linux/blkdev.h)
+//! C header: [`include/linux/blk_mq.h`](srctree/include/linux/blk_mq.h)
+
+use crate::block::mq::{raw_writer::RawWriter, Operations, TagSet};
+use crate::{
+ bindings, error::from_err_ptr, error::Result, sync::Arc, types::ForeignOwnable,
+ types::ScopeGuard,
+};
+use core::fmt::{self, Write};
+
+/// A generic block device
+///
+/// # Invariants
+///
+/// - `gendisk` must always point to an initialized and valid `struct gendisk`.
+pub struct GenDisk<T: Operations> {
+ _tagset: Arc<TagSet<T>>,
+ gendisk: *mut bindings::gendisk,
+}
+
+// SAFETY: `GenDisk` is an owned pointer to a `struct gendisk` and an `Arc` to a
+// `TagSet` It is safe to send this to other threads as long as T is Send.
+unsafe impl<T: Operations + Send> Send for GenDisk<T> {}
+
+impl<T: Operations> GenDisk<T> {
+ /// Try to create a new `GenDisk`
+ pub fn try_new(tagset: Arc<TagSet<T>>, queue_data: T::QueueData) -> Result<Self> {
+ let data = queue_data.into_foreign();
+ let recover_data = ScopeGuard::new(|| {
+ // SAFETY: T::QueueData was created by the call to `into_foreign()` above
+ unsafe { T::QueueData::from_foreign(data) };
+ });
+
+ let lock_class_key = crate::sync::LockClassKey::new();
+
+ // SAFETY: `tagset.raw_tag_set()` points to a valid and initialized tag set
+ let gendisk = from_err_ptr(unsafe {
+ bindings::__blk_mq_alloc_disk(
+ tagset.raw_tag_set(),
+ data.cast_mut(),
+ lock_class_key.as_ptr(),
+ )
+ })?;
+
+ const TABLE: bindings::block_device_operations = bindings::block_device_operations {
+ submit_bio: None,
+ open: None,
+ release: None,
+ ioctl: None,
+ compat_ioctl: None,
+ check_events: None,
+ unlock_native_capacity: None,
+ getgeo: None,
+ set_read_only: None,
+ swap_slot_free_notify: None,
+ report_zones: None,
+ devnode: None,
+ alternative_gpt_sector: None,
+ get_unique_id: None,
+ // TODO: Set to THIS_MODULE. Waiting for const_refs_to_static feature to be merged
+ // https://github.com/rust-lang/rust/issues/119618
+ owner: core::ptr::null_mut(),
+ pr_ops: core::ptr::null_mut(),
+ free_disk: None,
+ poll_bio: None,
+ };
+
+ // SAFETY: gendisk is a valid pointer as we initialized it above
+ unsafe { (*gendisk).fops = &TABLE };
+
+ recover_data.dismiss();
+ Ok(Self {
+ _tagset: tagset,
+ gendisk,
+ })
+ }
+
+ /// Set the name of the device
+ pub fn set_name(&mut self, args: fmt::Arguments<'_>) -> Result {
+ let mut raw_writer = RawWriter::from_array(
+ // SAFETY: By type invariant `self.gendisk` points to a valid and initialized instance
+ unsafe { &mut (*self.gendisk).disk_name },
+ );
+ raw_writer.write_fmt(args)?;
+ raw_writer.write_char('\0')?;
+ Ok(())
+ }
+
+ /// Register the device with the kernel. When this function return, the
+ /// device is accessible from VFS. The kernel may issue reads to the device
+ /// during registration to discover partition infomation.
+ pub fn add(&self) -> Result {
+ crate::error::to_result(
+ // SAFETY: By type invariant, `self.gendisk` points to a valid and
+ // initialized instance of `struct gendisk`
+ unsafe {
+ bindings::device_add_disk(
+ core::ptr::null_mut(),
+ self.gendisk,
+ core::ptr::null_mut(),
+ )
+ },
+ )
+ }
+
+ /// Call to tell the block layer the capacity of the device in sectors (512B)
+ pub fn set_capacity_sectors(&self, sectors: u64) {
+ // SAFETY: By type invariant, `self.gendisk` points to a valid and
+ // initialized instance of `struct gendisk`
+ unsafe { bindings::set_capacity(self.gendisk, sectors) };
+ }
+
+ /// Set the logical block size of the device.
+ ///
+ /// This is the smallest unit the storage device can address. It is
+ /// typically 512 bytes.
+ pub fn set_queue_logical_block_size(&self, size: u32) {
+ // SAFETY: By type invariant, `self.gendisk` points to a valid and
+ // initialized instance of `struct gendisk`
+ unsafe { bindings::blk_queue_logical_block_size((*self.gendisk).queue, size) };
+ }
+
+ /// Set the physical block size of the device.
+ ///
+ /// This is the smallest unit a physical storage device can write
+ /// atomically. It is usually the same as the logical block size but may be
+ /// bigger. One example is SATA drives with 4KB sectors that expose a
+ /// 512-byte logical block size to the operating system.
+ pub fn set_queue_physical_block_size(&self, size: u32) {
+ // SAFETY: By type invariant, `self.gendisk` points to a valid and
+ // initialized instance of `struct gendisk`
+ unsafe { bindings::blk_queue_physical_block_size((*self.gendisk).queue, size) };
+ }
+
+ /// Set the rotational media attribute for the device
+ pub fn set_rotational(&self, rotational: bool) {
+ if !rotational {
+ // SAFETY: By type invariant, `self.gendisk` points to a valid and
+ // initialized instance of `struct gendisk`
+ unsafe {
+ bindings::blk_queue_flag_set(bindings::QUEUE_FLAG_NONROT, (*self.gendisk).queue)
+ };
+ } else {
+ // SAFETY: By type invariant, `self.gendisk` points to a valid and
+ // initialized instance of `struct gendisk`
+ unsafe {
+ bindings::blk_queue_flag_clear(bindings::QUEUE_FLAG_NONROT, (*self.gendisk).queue)
+ };
+ }
+ }
+}
+
+impl<T: Operations> Drop for GenDisk<T> {
+ fn drop(&mut self) {
+ // SAFETY: By type invariant of `Self`, `self.gendisk` points to a valid
+ // and initialized instance of `struct gendisk`. As such, `queuedata`
+ // was initialized by the initializer returned by `try_new` with a call
+ // to `ForeignOwnable::into_foreign`.
+ let queue_data = unsafe { (*(*self.gendisk).queue).queuedata };
+
+ // SAFETY: By type invariant, `self.gendisk` points to a valid and
+ // initialized instance of `struct gendisk`
+ unsafe { bindings::del_gendisk(self.gendisk) };
+
+ // SAFETY: `queue.queuedata` was created by `GenDisk::try_new()` with a
+ // call to `ForeignOwnable::into_pointer()` to create `queuedata`.
+ // `ForeignOwnable::from_foreign()` is only called here.
+ let _queue_data = unsafe { T::QueueData::from_foreign(queue_data) };
+ }
+}
diff --git a/rust/kernel/block/mq/operations.rs b/rust/kernel/block/mq/operations.rs
new file mode 100644
index 000000000000..53c6ad663208
--- /dev/null
+++ b/rust/kernel/block/mq/operations.rs
@@ -0,0 +1,346 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! This module provides an interface for blk-mq drivers to implement.
+//!
+//! C header: [`include/linux/blk-mq.h`](srctree/include/linux/blk-mq.h)
+
+use crate::{
+ bindings,
+ block::mq::Request,
+ error::{from_result, Result},
+ init::PinInit,
+ types::{ARef, ForeignOwnable},
+};
+use core::{marker::PhantomData, ptr::NonNull};
+
+use super::TagSet;
+
+/// Implement this trait to interface blk-mq as block devices
+#[macros::vtable]
+pub trait Operations: Sized {
+ /// Data associated with a request. This data is located next to the request
+ /// structure.
+ ///
+ /// To be able to handle accessing this data from interrupt context, this
+ /// data must be `Sync`.
+ type RequestData: Sized + Sync;
+
+ /// Initializer for `Self::RequestDta`. Used to initialize private data area
+ /// when requst structure is allocated.
+ type RequestDataInit: PinInit<Self::RequestData>;
+
+ /// Data associated with the `struct request_queue` that is allocated for
+ /// the `GenDisk` associated with this `Operations` implementation.
+ type QueueData: ForeignOwnable;
+
+ /// Data associated with a dispatch queue. This is stored as a pointer in
+ /// the C `struct blk_mq_hw_ctx` that represents a hardware queue.
+ type HwData: ForeignOwnable;
+
+ /// Data associated with a `TagSet`. This is stored as a pointer in `struct
+ /// blk_mq_tag_set`.
+ type TagSetData: ForeignOwnable;
+
+ /// Called by the kernel to get an initializer for a `Pin<&mut RequestData>`.
+ fn new_request_data(
+ //rq: ARef<Request<Self>>,
+ tagset_data: <Self::TagSetData as ForeignOwnable>::Borrowed<'_>,
+ ) -> Self::RequestDataInit;
+
+ /// Called by the kernel to queue a request with the driver. If `is_last` is
+ /// `false`, the driver is allowed to defer commiting the request.
+ fn queue_rq(
+ hw_data: <Self::HwData as ForeignOwnable>::Borrowed<'_>,
+ queue_data: <Self::QueueData as ForeignOwnable>::Borrowed<'_>,
+ rq: ARef<Request<Self>>,
+ is_last: bool,
+ ) -> Result;
+
+ /// Called by the kernel to indicate that queued requests should be submitted
+ fn commit_rqs(
+ hw_data: <Self::HwData as ForeignOwnable>::Borrowed<'_>,
+ queue_data: <Self::QueueData as ForeignOwnable>::Borrowed<'_>,
+ );
+
+ /// Called by the kernel when the request is completed
+ fn complete(_rq: &Request<Self>);
+
+ /// Called by the kernel to allocate and initialize a driver specific hardware context data
+ fn init_hctx(
+ tagset_data: <Self::TagSetData as ForeignOwnable>::Borrowed<'_>,
+ hctx_idx: u32,
+ ) -> Result<Self::HwData>;
+
+ /// Called by the kernel to poll the device for completed requests. Only
+ /// used for poll queues.
+ fn poll(_hw_data: <Self::HwData as ForeignOwnable>::Borrowed<'_>) -> bool {
+ crate::build_error(crate::error::VTABLE_DEFAULT_ERROR)
+ }
+
+ /// Called by the kernel to map submission queues to CPU cores.
+ fn map_queues(_tag_set: &TagSet<Self>) {
+ crate::build_error(crate::error::VTABLE_DEFAULT_ERROR)
+ }
+
+ // There is no need for exit_request() because `drop` will be called.
+}
+
+/// A vtable for blk-mq to interact with a block device driver.
+///
+/// A `bindings::blk_mq_opa` vtable is constructed from pointers to the `extern
+/// "C"` functions of this struct, exposed through the `OperationsVTable::VTABLE`.
+///
+/// For general documentation of these methods, see the kernel source
+/// documentation related to `struct blk_mq_operations` in
+/// [`include/linux/blk-mq.h`].
+///
+/// [`include/linux/blk-mq.h`]: srctree/include/linux/blk-mq.h
+pub(crate) struct OperationsVTable<T: Operations>(PhantomData<T>);
+
+impl<T: Operations> OperationsVTable<T> {
+ // # Safety
+ //
+ // - The caller of this function must ensure that `hctx` and `bd` are valid
+ // and initialized. The pointees must outlive this function.
+ // - `hctx->driver_data` must be a pointer created by a call to
+ // `Self::init_hctx_callback()` and the pointee must outlive this
+ // function.
+ // - This function must not be called with a `hctx` for which
+ // `Self::exit_hctx_callback()` has been called.
+ // - (*bd).rq must point to a valid `bindings:request` with a positive refcount in the `ref` field.
+ unsafe extern "C" fn queue_rq_callback(
+ hctx: *mut bindings::blk_mq_hw_ctx,
+ bd: *const bindings::blk_mq_queue_data,
+ ) -> bindings::blk_status_t {
+ // SAFETY: `bd` is valid as required by the safety requirement for this
+ // function.
+ let request_ptr = unsafe { (*bd).rq };
+
+ // SAFETY: By C API contract, the pointee of `request_ptr` is valid and has a refcount of 1
+ #[cfg_attr(not(CONFIG_DEBUG_MISC), allow(unused_variables))]
+ let updated = unsafe { bindings::req_ref_inc_not_zero(request_ptr) };
+
+ #[cfg(CONFIG_DEBUG_MISC)]
+ if !updated {
+ crate::pr_err!("Request ref was zero at queue time\n");
+ }
+
+ let rq =
+ // SAFETY: We own a refcount that we took above. We pass that to
+ // `ARef`.
+ unsafe { ARef::from_raw(NonNull::new_unchecked(request_ptr.cast::<Request<T>>())) };
+
+ // SAFETY: The safety requirement for this function ensure that `hctx`
+ // is valid and that `driver_data` was produced by a call to
+ // `into_foreign` in `Self::init_hctx_callback`.
+ let hw_data = unsafe { T::HwData::borrow((*hctx).driver_data) };
+
+ // SAFETY: `hctx` is valid as required by this function.
+ let queue_data = unsafe { (*(*hctx).queue).queuedata };
+
+ // SAFETY: `queue.queuedata` was created by `GenDisk::try_new()` with a
+ // call to `ForeignOwnable::into_pointer()` to create `queuedata`.
+ // `ForeignOwnable::from_foreign()` is only called when the tagset is
+ // dropped, which happens after we are dropped.
+ let queue_data = unsafe { T::QueueData::borrow(queue_data) };
+
+ let ret = T::queue_rq(
+ hw_data,
+ queue_data,
+ rq,
+ // SAFETY: `bd` is valid as required by the safety requirement for this function.
+ unsafe { (*bd).last },
+ );
+ if let Err(e) = ret {
+ e.to_blk_status()
+ } else {
+ bindings::BLK_STS_OK as _
+ }
+ }
+
+ /// # Safety
+ ///
+ /// This function may only be called by blk-mq C infrastructure. The caller
+ /// must ensure that `hctx` is valid.
+ unsafe extern "C" fn commit_rqs_callback(hctx: *mut bindings::blk_mq_hw_ctx) {
+ // SAFETY: `driver_data` was installed by us in `init_hctx_callback` as
+ // the result of a call to `into_foreign`.
+ let hw_data = unsafe { T::HwData::borrow((*hctx).driver_data) };
+
+ // SAFETY: `hctx` is valid as required by this function.
+ let queue_data = unsafe { (*(*hctx).queue).queuedata };
+
+ // SAFETY: `queue.queuedata` was created by `GenDisk::try_new()` with a
+ // call to `ForeignOwnable::into_pointer()` to create `queuedata`.
+ // `ForeignOwnable::from_foreign()` is only called when the tagset is
+ // dropped, which happens after we are dropped.
+ let queue_data = unsafe { T::QueueData::borrow(queue_data) };
+ T::commit_rqs(hw_data, queue_data)
+ }
+
+ /// # Safety
+ ///
+ /// This function may only be called by blk-mq C infrastructure. `rq` must
+ /// point to a valid request that has been marked as completed. The pointee
+ /// of `rq` must be valid for write for the duration of this function.
+ unsafe extern "C" fn complete_callback(rq: *mut bindings::request) {
+ // SAFETY: By function safety requirement `rq`is valid for write for the
+ // lifetime of the returned `Request`.
+ T::complete(unsafe { Request::from_ptr_mut(rq) });
+ }
+
+ /// # Safety
+ ///
+ /// This function may only be called by blk-mq C infrastructure. `hctx` must
+ /// be a pointer to a valid and aligned `struct blk_mq_hw_ctx` that was
+ /// previously initialized by a call to `init_hctx_callback`.
+ unsafe extern "C" fn poll_callback(
+ hctx: *mut bindings::blk_mq_hw_ctx,
+ _iob: *mut bindings::io_comp_batch,
+ ) -> core::ffi::c_int {
+ // SAFETY: By function safety requirement, `hctx` was initialized by
+ // `init_hctx_callback` and thus `driver_data` came from a call to
+ // `into_foreign`.
+ let hw_data = unsafe { T::HwData::borrow((*hctx).driver_data) };
+ T::poll(hw_data).into()
+ }
+
+ /// # Safety
+ ///
+ /// This function may only be called by blk-mq C infrastructure.
+ /// `tagset_data` must be initialized by the initializer returned by
+ /// `TagSet::try_new` as part of tag set initialization. `hctx` must be a
+ /// pointer to a valid `blk_mq_hw_ctx` where the `driver_data` field was not
+ /// yet initialized. This function may only be called onece before
+ /// `exit_hctx_callback` is called for the same context.
+ unsafe extern "C" fn init_hctx_callback(
+ hctx: *mut bindings::blk_mq_hw_ctx,
+ tagset_data: *mut core::ffi::c_void,
+ hctx_idx: core::ffi::c_uint,
+ ) -> core::ffi::c_int {
+ from_result(|| {
+ // SAFETY: By the safety requirements of this function,
+ // `tagset_data` came from a call to `into_foreign` when the
+ // `TagSet` was initialized.
+ let tagset_data = unsafe { T::TagSetData::borrow(tagset_data) };
+ let data = T::init_hctx(tagset_data, hctx_idx)?;
+
+ // SAFETY: by the safety requirments of this function, `hctx` is
+ // valid for write
+ unsafe { (*hctx).driver_data = data.into_foreign() as _ };
+ Ok(0)
+ })
+ }
+
+ /// # Safety
+ ///
+ /// This function may only be called by blk-mq C infrastructure. `hctx` must
+ /// be a valid pointer that was previously initialized by a call to
+ /// `init_hctx_callback`. This function may be called only once after
+ /// `init_hctx_callback` was called.
+ unsafe extern "C" fn exit_hctx_callback(
+ hctx: *mut bindings::blk_mq_hw_ctx,
+ _hctx_idx: core::ffi::c_uint,
+ ) {
+ // SAFETY: By the safety requirements of this function, `hctx` is valid for read.
+ let ptr = unsafe { (*hctx).driver_data };
+
+ // SAFETY: By the safety requirements of this function, `ptr` came from
+ // a call to `into_foreign` in `init_hctx_callback`
+ unsafe { T::HwData::from_foreign(ptr) };
+ }
+
+ /// # Safety
+ ///
+ /// This function may only be called by blk-mq C infrastructure. `set` must point to an initialized `TagSet<T>`.
+ unsafe extern "C" fn init_request_callback(
+ set: *mut bindings::blk_mq_tag_set,
+ rq: *mut bindings::request,
+ _hctx_idx: core::ffi::c_uint,
+ _numa_node: core::ffi::c_uint,
+ ) -> core::ffi::c_int {
+ from_result(|| {
+ // SAFETY: The tagset invariants guarantee that all requests are allocated with extra memory
+ // for the request data.
+ let pdu = unsafe { bindings::blk_mq_rq_to_pdu(rq) }.cast::<T::RequestData>();
+
+ // SAFETY: Because `set` is a `TagSet<T>`, `driver_data` comes from
+ // a call to `into_foregn` by the initializer returned by
+ // `TagSet::try_new`.
+ let tagset_data = unsafe { T::TagSetData::borrow((*set).driver_data) };
+
+ let initializer = T::new_request_data(tagset_data);
+
+ // SAFETY: `pdu` is a valid pointer as established above. We do not
+ // touch `pdu` if `__pinned_init` returns an error. We promise ot to
+ // move the pointee of `pdu`.
+ unsafe { initializer.__pinned_init(pdu)? };
+
+ Ok(0)
+ })
+ }
+
+ /// # Safety
+ ///
+ /// This function may only be called by blk-mq C infrastructure. `rq` must
+ /// point to a request that was initialized by a call to
+ /// `Self::init_request_callback`.
+ unsafe extern "C" fn exit_request_callback(
+ _set: *mut bindings::blk_mq_tag_set,
+ rq: *mut bindings::request,
+ _hctx_idx: core::ffi::c_uint,
+ ) {
+ // SAFETY: The tagset invariants guarantee that all requests are allocated with extra memory
+ // for the request data.
+ let pdu = unsafe { bindings::blk_mq_rq_to_pdu(rq) }.cast::<T::RequestData>();
+
+ // SAFETY: `pdu` is valid for read and write and is properly initialised.
+ unsafe { core::ptr::drop_in_place(pdu) };
+ }
+
+ /// # Safety
+ ///
+ /// This function may only be called by blk-mq C infrastructure. `tag_set`
+ /// must be a pointer to a valid and initialized `TagSet<T>`. The pointee
+ /// must be valid for use as a reference at least the duration of this call.
+ unsafe extern "C" fn map_queues_callback(tag_set: *mut bindings::blk_mq_tag_set) {
+ // SAFETY: The safety requirements of this function satiesfies the
+ // requirements of `TagSet::from_ptr`.
+ let tag_set = unsafe { TagSet::from_ptr(tag_set) };
+ T::map_queues(tag_set);
+ }
+
+ const VTABLE: bindings::blk_mq_ops = bindings::blk_mq_ops {
+ queue_rq: Some(Self::queue_rq_callback),
+ queue_rqs: None,
+ commit_rqs: Some(Self::commit_rqs_callback),
+ get_budget: None,
+ put_budget: None,
+ set_rq_budget_token: None,
+ get_rq_budget_token: None,
+ timeout: None,
+ poll: if T::HAS_POLL {
+ Some(Self::poll_callback)
+ } else {
+ None
+ },
+ complete: Some(Self::complete_callback),
+ init_hctx: Some(Self::init_hctx_callback),
+ exit_hctx: Some(Self::exit_hctx_callback),
+ init_request: Some(Self::init_request_callback),
+ exit_request: Some(Self::exit_request_callback),
+ cleanup_rq: None,
+ busy: None,
+ map_queues: if T::HAS_MAP_QUEUES {
+ Some(Self::map_queues_callback)
+ } else {
+ None
+ },
+ #[cfg(CONFIG_BLK_DEBUG_FS)]
+ show_rq: None,
+ };
+
+ pub(crate) const fn build() -> &'static bindings::blk_mq_ops {
+ &Self::VTABLE
+ }
+}
diff --git a/rust/kernel/block/mq/raw_writer.rs b/rust/kernel/block/mq/raw_writer.rs
new file mode 100644
index 000000000000..f7857740af29
--- /dev/null
+++ b/rust/kernel/block/mq/raw_writer.rs
@@ -0,0 +1,60 @@
+use core::{
+ fmt::{self, Write},
+ marker::PhantomData,
+};
+
+/// A mutable reference to a byte buffer where a string can be written into
+///
+/// # Invariants
+///
+/// * `ptr` is not aliased and valid for read and write for `len` bytes
+///
+pub(crate) struct RawWriter<'a> {
+ ptr: *mut u8,
+ len: usize,
+ _p: PhantomData<&'a ()>,
+}
+
+impl<'a> RawWriter<'a> {
+ /// Create a new `RawWriter` instance.
+ ///
+ /// # Safety
+ ///
+ /// * `ptr` must be valid for read and write for `len` consecutive `u8` elements
+ /// * `ptr` must not be aliased
+ unsafe fn new(ptr: *mut u8, len: usize) -> RawWriter<'a> {
+ Self {
+ ptr,
+ len,
+ _p: PhantomData,
+ }
+ }
+
+ pub(crate) fn from_array<const N: usize>(a: &'a mut [core::ffi::c_char; N]) -> RawWriter<'a> {
+ // SAFETY: the buffer of `a` is valid for read and write for at least `N` bytes
+ unsafe { Self::new(a.as_mut_ptr().cast::<u8>(), N) }
+ }
+}
+
+impl Write for RawWriter<'_> {
+ fn write_str(&mut self, s: &str) -> fmt::Result {
+ let bytes = s.as_bytes();
+ let len = bytes.len();
+ if len > self.len {
+ return Err(fmt::Error);
+ }
+
+ // SAFETY:
+ // * `bytes` is valid for reads of `bytes.len()` size because we hold a shared reference to `s`
+ // * By type invariant `self.ptr` is valid for writes for at lest `self.len` bytes
+ // * The regions are not overlapping as `ptr` is not aliased
+ unsafe { core::ptr::copy_nonoverlapping(&bytes[0], self.ptr, len) };
+
+ // SAFETY: By type invariant of `Self`, `ptr` is in bounds of an
+ // allocation. Also by type invariant, the pointer resulting from this
+ // addition is also in bounds.
+ self.ptr = unsafe { self.ptr.add(len) };
+ self.len -= len;
+ Ok(())
+ }
+}
diff --git a/rust/kernel/block/mq/request.rs b/rust/kernel/block/mq/request.rs
new file mode 100644
index 000000000000..b4dacac5e091
--- /dev/null
+++ b/rust/kernel/block/mq/request.rs
@@ -0,0 +1,182 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! This module provides a wrapper for the C `struct request` type.
+//!
+//! C header: [`include/linux/blk-mq.h`](srctree/include/linux/blk-mq.h)
+
+use crate::{
+ bindings,
+ block::mq::Operations,
+ error::{Error, Result},
+ types::{ARef, AlwaysRefCounted, Opaque},
+};
+use core::{ffi::c_void, marker::PhantomData, ops::Deref};
+
+/// A wrapper around a blk-mq `struct request`. This represents an IO request.
+///
+/// # Invariants
+///
+/// * `self.0` is a valid `struct request` created by the C portion of the kernel
+/// * `self` is reference counted. a call to `req_ref_inc_not_zero` keeps the
+/// instance alive at least until a matching call to `req_ref_put_and_test`
+///
+#[repr(transparent)]
+pub struct Request<T: Operations>(Opaque<bindings::request>, PhantomData<T>);
+
+impl<T: Operations> Request<T> {
+ /// Create a `&mut Request` from a `bindings::request` pointer
+ ///
+ /// # Safety
+ ///
+ /// * `ptr` must be aligned and point to a valid `bindings::request` instance
+ /// * Caller must ensure that the pointee of `ptr` is live and owned
+ /// exclusively by caller for at least `'a`
+ ///
+ pub(crate) unsafe fn from_ptr_mut<'a>(ptr: *mut bindings::request) -> &'a mut Self {
+ // SAFETY:
+ // * The cast is valid as `Self` is transparent.
+ // * By safety requirements of this function, the reference will be
+ // valid for 'a.
+ unsafe { &mut *(ptr.cast::<Self>()) }
+ }
+
+ /// Get the command identifier for the request
+ pub fn command(&self) -> u32 {
+ // SAFETY: By C API contract and type invariant, `cmd_flags` is valid for read
+ unsafe { (*self.0.get()).cmd_flags & ((1 << bindings::REQ_OP_BITS) - 1) }
+ }
+
+ /// Call this to indicate to the kernel that the request has been issued by the driver
+ pub fn start(&self) {
+ // SAFETY: By type invariant, `self.0` is a valid `struct request`. By
+ // existence of `&mut self` we have exclusive access.
+ unsafe { bindings::blk_mq_start_request(self.0.get()) };
+ }
+
+ /// Call this to indicate to the kernel that the request has been completed without errors
+ pub fn end_ok(&self) {
+ // SAFETY: By type invariant, `self.0` is a valid `struct request`. By
+ // existence of `&mut self` we have exclusive access.
+ unsafe { bindings::blk_mq_end_request(self.0.get(), bindings::BLK_STS_OK as _) };
+ }
+
+ /// Call this to indicate to the kernel that the request completed with an error
+ pub fn end_err(&self, err: Error) {
+ // SAFETY: By type invariant, `self.0` is a valid `struct request`. By
+ // existence of `&mut self` we have exclusive access.
+ unsafe { bindings::blk_mq_end_request(self.0.get(), err.to_blk_status()) };
+ }
+
+ /// Call this to indicate that the request completed with the status indicated by `status`
+ pub fn end(&self, status: Result) {
+ if let Err(e) = status {
+ self.end_err(e);
+ } else {
+ self.end_ok();
+ }
+ }
+
+ /// Call this to schedule defered completion of the request
+ pub fn complete(&self) {
+ // SAFETY: By type invariant, `self.0` is a valid `struct request`
+ if !unsafe { bindings::blk_mq_complete_request_remote(self.0.get()) } {
+ T::complete(self);
+ }
+ }
+
+ /// Get the target sector for the request
+ #[inline(always)]
+ pub fn sector(&self) -> usize {
+ // SAFETY: By type invariant of `Self`, `self.0` is valid and live.
+ unsafe { (*self.0.get()).__sector as usize }
+ }
+
+ /// Returns an owned reference to the per-request data associated with this
+ /// request
+ pub fn owned_data_ref(request: ARef<Self>) -> RequestDataRef<T> {
+ RequestDataRef::new(request)
+ }
+
+ /// Returns a reference to the oer-request data associated with this request
+ pub fn data_ref(&self) -> &T::RequestData {
+ let request_ptr = self.0.get().cast::<bindings::request>();
+
+ // SAFETY: `request_ptr` is a valid `struct request` because `ARef` is
+ // `repr(transparent)`
+ let p: *mut c_void = unsafe { bindings::blk_mq_rq_to_pdu(request_ptr) };
+
+ let p = p.cast::<T::RequestData>();
+
+ // SAFETY: By C API contract, `p` is initialized by a call to
+ // `OperationsVTable::init_request_callback()`. By existence of `&self`
+ // it must be valid for use as a shared reference.
+ unsafe { &*p }
+ }
+}
+
+// SAFETY: It is impossible to obtain an owned or mutable `Request`, so we can
+// mark it `Send`.
+unsafe impl<T: Operations> Send for Request<T> {}
+
+// SAFETY: `Request` references can be shared across threads.
+unsafe impl<T: Operations> Sync for Request<T> {}
+
+/// An owned reference to a `Request<T>`
+#[repr(transparent)]
+pub struct RequestDataRef<T: Operations> {
+ request: ARef<Request<T>>,
+}
+
+impl<T> RequestDataRef<T>
+where
+ T: Operations,
+{
+ /// Create a new instance.
+ fn new(request: ARef<Request<T>>) -> Self {
+ Self { request }
+ }
+
+ /// Get a reference to the underlying request
+ pub fn request(&self) -> &Request<T> {
+ &self.request
+ }
+}
+
+impl<T> Deref for RequestDataRef<T>
+where
+ T: Operations,
+{
+ type Target = T::RequestData;
+
+ fn deref(&self) -> &Self::Target {
+ self.request.data_ref()
+ }
+}
+
+// SAFETY: All instances of `Request<T>` are reference counted. This
+// implementation of `AlwaysRefCounted` ensure that increments to the ref count
+// keeps the object alive in memory at least until a matching reference count
+// decrement is executed.
+unsafe impl<T: Operations> AlwaysRefCounted for Request<T> {
+ fn inc_ref(&self) {
+ // SAFETY: By type invariant `self.0` is a valid `struct reqeust`
+ #[cfg_attr(not(CONFIG_DEBUG_MISC), allow(unused_variables))]
+ let updated = unsafe { bindings::req_ref_inc_not_zero(self.0.get()) };
+ #[cfg(CONFIG_DEBUG_MISC)]
+ if !updated {
+ crate::pr_err!("Request refcount zero on clone");
+ }
+ }
+
+ unsafe fn dec_ref(obj: core::ptr::NonNull<Self>) {
+ // SAFETY: By type invariant `self.0` is a valid `struct reqeust`
+ let zero = unsafe { bindings::req_ref_put_and_test(obj.as_ref().0.get()) };
+ if zero {
+ // SAFETY: By type invariant of `self` we have the last reference to
+ // `obj` and it is safe to free it.
+ unsafe {
+ bindings::blk_mq_free_request_internal(obj.as_ptr().cast::<bindings::request>())
+ };
+ }
+ }
+}
diff --git a/rust/kernel/block/mq/tag_set.rs b/rust/kernel/block/mq/tag_set.rs
new file mode 100644
index 000000000000..7f463b7e288b
--- /dev/null
+++ b/rust/kernel/block/mq/tag_set.rs
@@ -0,0 +1,117 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! This module provides the `TagSet` struct to wrap the C `struct blk_mq_tag_set`.
+//!
+//! C header: [`include/linux/blk-mq.h`](srctree/include/linux/blk-mq.h)
+
+use core::pin::Pin;
+
+use crate::{
+ bindings,
+ block::mq::{operations::OperationsVTable, Operations},
+ error::{self, Error, Result},
+ prelude::PinInit,
+ try_pin_init,
+ types::{ForeignOwnable, Opaque},
+};
+use core::{convert::TryInto, marker::PhantomData};
+use macros::{pin_data, pinned_drop};
+
+/// A wrapper for the C `struct blk_mq_tag_set`.
+///
+/// `struct blk_mq_tag_set` contains a `struct list_head` and so must be pinned.
+#[pin_data(PinnedDrop)]
+#[repr(transparent)]
+pub struct TagSet<T: Operations> {
+ #[pin]
+ inner: Opaque<bindings::blk_mq_tag_set>,
+ _p: PhantomData<T>,
+}
+
+impl<T: Operations> TagSet<T> {
+ /// Try to create a new tag set
+ pub fn try_new(
+ nr_hw_queues: u32,
+ tagset_data: T::TagSetData,
+ num_tags: u32,
+ num_maps: u32,
+ ) -> impl PinInit<Self, error::Error> {
+ try_pin_init!( TagSet {
+ inner <- Opaque::try_ffi_init(move |place: *mut bindings::blk_mq_tag_set| -> Result<()> {
+
+ // SAFETY: try_ffi_init promises that `place` is writable, and
+ // zeroes is a valid bit pattern for this structure.
+ unsafe { core::ptr::write_bytes(place, 0, 1) };
+
+ /// For a raw pointer to a struct, write a struct field without
+ /// creating a reference to the field
+ macro_rules! write_ptr_field {
+ ($target:ident, $field:ident, $value:expr) => {
+ ::core::ptr::write(::core::ptr::addr_of_mut!((*$target).$field), $value)
+ };
+ }
+
+ // SAFETY: try_ffi_init promises that `place` is writable
+ unsafe {
+ write_ptr_field!(place, ops, OperationsVTable::<T>::build());
+ write_ptr_field!(place, nr_hw_queues , nr_hw_queues);
+ write_ptr_field!(place, timeout , 0); // 0 means default which is 30 * HZ in C
+ write_ptr_field!(place, numa_node , bindings::NUMA_NO_NODE);
+ write_ptr_field!(place, queue_depth , num_tags);
+ write_ptr_field!(place, cmd_size , core::mem::size_of::<T::RequestData>().try_into()?);
+ write_ptr_field!(place, flags , bindings::BLK_MQ_F_SHOULD_MERGE);
+ write_ptr_field!(place, driver_data , tagset_data.into_foreign() as _);
+ write_ptr_field!(place, nr_maps , num_maps);
+ }
+
+ // SAFETY: Relevant fields of `place` are initialised above
+ let ret = unsafe { bindings::blk_mq_alloc_tag_set(place) };
+ if ret < 0 {
+ // SAFETY: We created `driver_data` above with `into_foreign`
+ unsafe { T::TagSetData::from_foreign((*place).driver_data) };
+ return Err(Error::from_errno(ret));
+ }
+
+ Ok(())
+ }),
+ _p: PhantomData,
+ })
+ }
+
+ /// Return the pointer to the wrapped `struct blk_mq_tag_set`
+ pub(crate) fn raw_tag_set(&self) -> *mut bindings::blk_mq_tag_set {
+ self.inner.get()
+ }
+
+ /// Create a `TagSet<T>` from a raw pointer.
+ ///
+ /// # Safety
+ ///
+ /// `ptr` must be a pointer to a valid and initialized `TagSet<T>`. There
+ /// may be no other mutable references to the tag set. The pointee must be
+ /// live and valid at least for the duration of the returned lifetime `'a`.
+ pub(crate) unsafe fn from_ptr<'a>(ptr: *mut bindings::blk_mq_tag_set) -> &'a Self {
+ // SAFETY: By the safety requirements of this function, `ptr` is valid
+ // for use as a reference for the duration of `'a`.
+ unsafe { &*(ptr.cast::<Self>()) }
+ }
+}
+
+#[pinned_drop]
+impl<T: Operations> PinnedDrop for TagSet<T> {
+ fn drop(self: Pin<&mut Self>) {
+ // SAFETY: We are not moving self below
+ let this = unsafe { Pin::into_inner_unchecked(self) };
+
+ // SAFETY: `this.inner.get()` points to a valid `blk_mq_tag_set` and
+ // thus is safe to dereference.
+ let tagset_data = unsafe { (*this.inner.get()).driver_data };
+
+ // SAFETY: `inner` is valid and has been properly initialised during construction.
+ unsafe { bindings::blk_mq_free_tag_set(this.inner.get()) };
+
+ // SAFETY: `tagset_data` was created by a call to
+ // `ForeignOwnable::into_foreign` in `TagSet::try_new()`
+ unsafe { T::TagSetData::from_foreign(tagset_data) };
+ }
+}
diff --git a/rust/kernel/error.rs b/rust/kernel/error.rs
index 4f0c1edd63b7..c947fd631416 100644
--- a/rust/kernel/error.rs
+++ b/rust/kernel/error.rs
@@ -130,6 +130,11 @@ pub fn to_errno(self) -> core::ffi::c_int {
self.0
}

+ pub(crate) fn to_blk_status(self) -> bindings::blk_status_t {
+ // SAFETY: `self.0` is a valid error due to its invariant.
+ unsafe { bindings::errno_to_blk_status(self.0) }
+ }
+
/// Returns the error encoded as a pointer.
#[allow(dead_code)]
pub(crate) fn to_ptr<T>(self) -> *mut T {
diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs
index 638a68af341a..9f02a8b352e0 100644
--- a/rust/kernel/lib.rs
+++ b/rust/kernel/lib.rs
@@ -34,6 +34,7 @@
#[cfg(not(test))]
#[cfg(not(testlib))]
mod allocator;
+pub mod block;
mod build_assert;
mod cache_aligned;
pub mod error;
--
2.44.0


2024-03-13 14:14:02

by Andreas Hindborg

[permalink] [raw]
Subject: [RFC PATCH 4/5] rust: block: add rnull, Rust null_blk implementation

From: Andreas Hindborg <[email protected]>

Signed-off-by: Andreas Hindborg <[email protected]>
---
drivers/block/Kconfig | 4 +
drivers/block/Makefile | 3 +
drivers/block/rnull.rs | 323 +++++++++++++++++++++++++++++++++++++++++
rust/helpers.c | 1 +
scripts/Makefile.build | 2 +-
5 files changed, 332 insertions(+), 1 deletion(-)
create mode 100644 drivers/block/rnull.rs

diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 5b9d4aaebb81..fb877d4f8ddf 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -354,6 +354,10 @@ config VIRTIO_BLK
This is the virtual block driver for virtio. It can be used with
QEMU based VMMs (like KVM or Xen). Say Y or M.

+config BLK_DEV_RUST_NULL
+ tristate "Rust null block driver"
+ depends on RUST
+
config BLK_DEV_RBD
tristate "Rados block device (RBD)"
depends on INET && BLOCK
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index 101612cba303..1105a2d4fdcb 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -9,6 +9,9 @@
# needed for trace events
ccflags-y += -I$(src)

+obj-$(CONFIG_BLK_DEV_RUST_NULL) += rnull_mod.o
+rnull_mod-y := rnull.o
+
obj-$(CONFIG_MAC_FLOPPY) += swim3.o
obj-$(CONFIG_BLK_DEV_SWIM) += swim_mod.o
obj-$(CONFIG_BLK_DEV_FD) += floppy.o
diff --git a/drivers/block/rnull.rs b/drivers/block/rnull.rs
new file mode 100644
index 000000000000..05fef30e910c
--- /dev/null
+++ b/drivers/block/rnull.rs
@@ -0,0 +1,323 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! This is a Rust implementation of the C null block driver.
+//!
+//! Supported features:
+//!
+//! - optional memory backing
+//! - blk-mq interface
+//! - direct completion
+//! - softirq completion
+//! - timer completion
+//!
+//! The driver is configured at module load time by parameters
+//! `param_memory_backed`, `param_capacity_mib`, `param_irq_mode` and
+//! `param_completion_time_nsec!.
+
+use core::ops::Deref;
+
+use kernel::{
+ bindings,
+ block::{
+ bio::Segment,
+ mq::{self, GenDisk, Operations, RequestDataRef, TagSet},
+ },
+ error::Result,
+ folio::*,
+ hrtimer::{RawTimer, TimerCallback},
+ new_mutex, pr_info,
+ prelude::*,
+ sync::{Arc, Mutex},
+ types::{ARef, ForeignOwnable},
+ xarray::XArray,
+};
+
+use kernel::new_spinlock;
+use kernel::CacheAligned;
+use kernel::sync::SpinLock;
+
+module! {
+ type: NullBlkModule,
+ name: "rnull_mod",
+ author: "Andreas Hindborg",
+ license: "GPL v2",
+ params: {
+ param_memory_backed: bool {
+ default: true,
+ permissions: 0,
+ description: "Use memory backing",
+ },
+ // Problems with pin_init when `irq_mode`
+ param_irq_mode: u8 {
+ default: 0,
+ permissions: 0,
+ description: "IRQ Mode (0: None, 1: Soft, 2: Timer)",
+ },
+ param_capacity_mib: u64 {
+ default: 4096,
+ permissions: 0,
+ description: "Device capacity in MiB",
+ },
+ param_completion_time_nsec: u64 {
+ default: 1_000_000,
+ permissions: 0,
+ description: "Completion time in nano seconds for timer mode",
+ },
+ param_block_size: u16 {
+ default: 4096,
+ permissions: 0,
+ description: "Block size in bytes",
+ },
+ },
+}
+
+#[derive(Debug)]
+enum IRQMode {
+ None,
+ Soft,
+ Timer,
+}
+
+impl TryFrom<u8> for IRQMode {
+ type Error = kernel::error::Error;
+
+ fn try_from(value: u8) -> Result<Self> {
+ match value {
+ 0 => Ok(Self::None),
+ 1 => Ok(Self::Soft),
+ 2 => Ok(Self::Timer),
+ _ => Err(kernel::error::code::EINVAL),
+ }
+ }
+}
+
+struct NullBlkModule {
+ _disk: Pin<Box<Mutex<GenDisk<NullBlkDevice>>>>,
+}
+
+fn add_disk(tagset: Arc<TagSet<NullBlkDevice>>) -> Result<GenDisk<NullBlkDevice>> {
+ let block_size = *param_block_size.read();
+ if block_size % 512 != 0 || !(512..=4096).contains(&block_size) {
+ return Err(kernel::error::code::EINVAL);
+ }
+
+ let irq_mode = (*param_irq_mode.read()).try_into()?;
+
+ let queue_data = Box::pin_init(pin_init!(
+ QueueData {
+ tree <- TreeContainer::new(),
+ completion_time_nsec: *param_completion_time_nsec.read(),
+ irq_mode,
+ memory_backed: *param_memory_backed.read(),
+ block_size,
+ }
+ ))?;
+
+ let block_size = queue_data.block_size;
+
+ let mut disk = GenDisk::try_new(tagset, queue_data)?;
+ disk.set_name(format_args!("rnullb{}", 0))?;
+ disk.set_capacity_sectors(*param_capacity_mib.read() << 11);
+ disk.set_queue_logical_block_size(block_size.into());
+ disk.set_queue_physical_block_size(block_size.into());
+ disk.set_rotational(false);
+ Ok(disk)
+}
+
+impl kernel::Module for NullBlkModule {
+ fn init(_module: &'static ThisModule) -> Result<Self> {
+ pr_info!("Rust null_blk loaded\n");
+ let tagset = Arc::pin_init(TagSet::try_new(1, (), 256, 1))?;
+ let disk = Box::pin_init(new_mutex!(add_disk(tagset)?, "nullb:disk"))?;
+
+ disk.lock().add()?;
+
+ Ok(Self { _disk: disk })
+ }
+}
+
+impl Drop for NullBlkModule {
+ fn drop(&mut self) {
+ pr_info!("Dropping rnullb\n");
+ }
+}
+
+struct NullBlkDevice;
+
+type Tree = XArray<Box<UniqueFolio>>;
+type TreeRef<'a> = &'a Tree;
+
+#[pin_data]
+struct TreeContainer {
+ // `XArray` is safe to use without a lock, as it applies internal locking.
+ // However, there are two reasons to use an external lock: a) cache line
+ // contention and b) we don't want to take the lock for each page we
+ // process.
+ //
+ // A: The `XArray` lock (xa_lock) is located on the same cache line as the
+ // xarray data pointer (xa_head). The effect of this arrangement is that
+ // under heavy contention, we often get a cache miss when we try to follow
+ // the data pointer after acquiring the lock. We would rather have consumers
+ // spinning on another lock, so we do not get a miss on xa_head. This issue
+ // can potentially be fixed by padding the C `struct xarray`.
+ //
+ // B: The current `XArray` Rust API requires that we take the `xa_lock` for
+ // each `XArray` operation. This is very inefficient when the lock is
+ // contended and we have many operations to perform. Eventually we should
+ // update the `XArray` API to allow multiple tree operations under a single
+ // lock acquisition. For now, serialize tree access with an external lock.
+ #[pin]
+ tree: CacheAligned<Tree>,
+ #[pin]
+ lock: CacheAligned<SpinLock<()>>,
+}
+
+impl TreeContainer {
+ fn new() -> impl PinInit<Self> {
+ pin_init!(TreeContainer {
+ tree <- CacheAligned::new_initializer(XArray::new(0)),
+ lock <- CacheAligned::new_initializer(new_spinlock!((), "rnullb:mem")),
+ })
+ }
+}
+
+#[pin_data]
+struct QueueData {
+ #[pin]
+ tree: TreeContainer,
+ completion_time_nsec: u64,
+ irq_mode: IRQMode,
+ memory_backed: bool,
+ block_size: u16,
+}
+
+impl NullBlkDevice {
+ #[inline(always)]
+ fn write(tree: TreeRef<'_>, sector: usize, segment: &Segment<'_>) -> Result {
+ let idx = sector >> bindings::PAGE_SECTORS_SHIFT;
+
+ let mut folio = if let Some(page) = tree.get_locked(idx) {
+ page
+ } else {
+ tree.set(idx, Box::try_new(Folio::try_new(0)?)?)?;
+ tree.get_locked(idx).unwrap()
+ };
+
+ segment.copy_to_folio(&mut folio)?;
+
+ Ok(())
+ }
+
+ #[inline(always)]
+ fn read(tree: TreeRef<'_>, sector: usize, segment: &mut Segment<'_>) -> Result {
+ let idx = sector >> bindings::PAGE_SECTORS_SHIFT;
+
+ if let Some(folio) = tree.get_locked(idx) {
+ segment.copy_from_folio(folio.deref())?;
+ }
+
+ Ok(())
+ }
+
+ #[inline(never)]
+ fn transfer(
+ command: bindings::req_op,
+ tree: TreeRef<'_>,
+ sector: usize,
+ segment: &mut Segment<'_>,
+ ) -> Result {
+ match command {
+ bindings::req_op_REQ_OP_WRITE => Self::write(tree, sector, segment)?,
+ bindings::req_op_REQ_OP_READ => Self::read(tree, sector, segment)?,
+ _ => (),
+ }
+ Ok(())
+ }
+}
+
+#[pin_data]
+struct Pdu {
+ #[pin]
+ timer: kernel::hrtimer::Timer<Self>,
+}
+
+impl TimerCallback for Pdu {
+ type Receiver = RequestDataRef<NullBlkDevice>;
+
+ fn run(this: Self::Receiver) {
+ this.request().end_ok();
+ }
+}
+
+kernel::impl_has_timer! {
+ impl HasTimer<Self> for Pdu { self.timer }
+}
+
+#[vtable]
+impl Operations for NullBlkDevice {
+ type RequestData = Pdu;
+ type RequestDataInit = impl PinInit<Pdu>;
+ type QueueData = Pin<Box<QueueData>>;
+ type HwData = ();
+ type TagSetData = ();
+
+ fn new_request_data(
+ _tagset_data: <Self::TagSetData as ForeignOwnable>::Borrowed<'_>,
+ ) -> Self::RequestDataInit {
+ pin_init!( Pdu {
+ timer <- kernel::hrtimer::Timer::new(),
+ })
+ }
+
+ #[inline(always)]
+ fn queue_rq(
+ _hw_data: (),
+ queue_data: &QueueData,
+ rq: ARef<mq::Request<Self>>,
+ _is_last: bool,
+ ) -> Result {
+ rq.start();
+ if queue_data.memory_backed {
+ let guard = queue_data.tree.lock.lock();
+ let tree = queue_data.tree.tree.deref();
+
+ let mut sector = rq.sector();
+ for bio in rq.bio_iter() {
+ for mut segment in bio.segment_iter() {
+ Self::transfer(rq.command(), tree, sector, &mut segment)?;
+ sector += segment.len() >> bindings::SECTOR_SHIFT;
+ }
+ }
+
+ drop(guard);
+ }
+
+
+ match queue_data.irq_mode {
+ IRQMode::None => rq.end_ok(),
+ IRQMode::Soft => rq.complete(),
+ IRQMode::Timer => {
+ mq::Request::owned_data_ref(rq).schedule(queue_data.completion_time_nsec)
+ }
+ }
+
+ Ok(())
+ }
+
+ fn commit_rqs(
+ _hw_data: <Self::HwData as ForeignOwnable>::Borrowed<'_>,
+ _queue_data: <Self::QueueData as ForeignOwnable>::Borrowed<'_>,
+ ) {
+ }
+
+ fn complete(rq: &mq::Request<Self>) {
+ rq.end_ok();
+ }
+
+ fn init_hctx(
+ _tagset_data: <Self::TagSetData as ForeignOwnable>::Borrowed<'_>,
+ _hctx_idx: u32,
+ ) -> Result<Self::HwData> {
+ Ok(())
+ }
+}
diff --git a/rust/helpers.c b/rust/helpers.c
index 017fa90366e6..9c8976629e90 100644
--- a/rust/helpers.c
+++ b/rust/helpers.c
@@ -200,6 +200,7 @@ struct page *rust_helper_folio_page(struct folio *folio, size_t n)
{
return folio_page(folio, n);
}
+EXPORT_SYMBOL_GPL(rust_helper_folio_page);

loff_t rust_helper_folio_pos(struct folio *folio)
{
diff --git a/scripts/Makefile.build b/scripts/Makefile.build
index dae447a1ad30..f64be2310010 100644
--- a/scripts/Makefile.build
+++ b/scripts/Makefile.build
@@ -262,7 +262,7 @@ $(obj)/%.lst: $(src)/%.c FORCE
# Compile Rust sources (.rs)
# ---------------------------------------------------------------------------

-rust_allowed_features := new_uninit,offset_of
+rust_allowed_features := new_uninit,offset_of,allocator_api,impl_trait_in_assoc_type

# `--out-dir` is required to avoid temporaries being created by `rustc` in the
# current working directory, which may be not accessible in the out-of-tree
--
2.44.0


2024-03-13 11:07:59

by Andreas Hindborg

[permalink] [raw]
Subject: [RFC PATCH 3/5] rust: block: allow `hrtimer::Timer` in `RequestData`

From: Andreas Hindborg <[email protected]>

Signed-off-by: Andreas Hindborg <[email protected]>
---
rust/kernel/block/mq/request.rs | 67 ++++++++++++++++++++++++++++++++-
1 file changed, 66 insertions(+), 1 deletion(-)

diff --git a/rust/kernel/block/mq/request.rs b/rust/kernel/block/mq/request.rs
index cccffde45981..8b7f08f894be 100644
--- a/rust/kernel/block/mq/request.rs
+++ b/rust/kernel/block/mq/request.rs
@@ -4,13 +4,16 @@
//!
//! C header: [`include/linux/blk-mq.h`](srctree/include/linux/blk-mq.h)

+use kernel::hrtimer::RawTimer;
+
use crate::{
bindings,
block::mq::Operations,
error::{Error, Result},
+ hrtimer::{HasTimer, TimerCallback},
types::{ARef, AlwaysRefCounted, Opaque},
};
-use core::{ffi::c_void, marker::PhantomData, ops::Deref};
+use core::{ffi::c_void, marker::PhantomData, ops::Deref, ptr::NonNull};

use crate::block::bio::Bio;
use crate::block::bio::BioIterator;
@@ -175,6 +178,68 @@ fn deref(&self) -> &Self::Target {
}
}

+impl<T> RawTimer for RequestDataRef<T>
+where
+ T: Operations,
+ T::RequestData: HasTimer<T::RequestData>,
+ T::RequestData: Sync,
+{
+ fn schedule(self, expires: u64) {
+ let self_ptr = self.deref() as *const T::RequestData;
+ core::mem::forget(self);
+
+ // SAFETY: `self_ptr` is a valid pointer to a `T::RequestData`
+ let timer_ptr = unsafe { T::RequestData::raw_get_timer(self_ptr) };
+
+ // `Timer` is `repr(transparent)`
+ let c_timer_ptr = timer_ptr.cast::<bindings::hrtimer>();
+
+ // Schedule the timer - if it is already scheduled it is removed and
+ // inserted
+
+ // SAFETY: c_timer_ptr points to a valid hrtimer instance that was
+ // initialized by `hrtimer_init`
+ unsafe {
+ bindings::hrtimer_start_range_ns(
+ c_timer_ptr as *mut _,
+ expires as i64,
+ 0,
+ bindings::hrtimer_mode_HRTIMER_MODE_REL,
+ );
+ }
+ }
+}
+
+impl<T> kernel::hrtimer::RawTimerCallback for RequestDataRef<T>
+where
+ T: Operations,
+ T: Sync,
+ T::RequestData: HasTimer<T::RequestData>,
+ T::RequestData: TimerCallback<Receiver = Self>,
+{
+ unsafe extern "C" fn run(ptr: *mut bindings::hrtimer) -> bindings::hrtimer_restart {
+ // `Timer` is `repr(transparent)`
+ let timer_ptr = ptr.cast::<kernel::hrtimer::Timer<T::RequestData>>();
+
+ // SAFETY: By C API contract `ptr` is the pointer we passed when
+ // enqueing the timer, so it is a `Timer<T::RequestData>` embedded in a `T::RequestData`
+ let receiver_ptr = unsafe { T::RequestData::timer_container_of(timer_ptr) };
+
+ // SAFETY: The pointer was returned by `T::timer_container_of` so it
+ // points to a valid `T::RequestData`
+ let request_ptr = unsafe { bindings::blk_mq_rq_from_pdu(receiver_ptr.cast::<c_void>()) };
+
+ // SAFETY: We own a refcount that we leaked during `RawTimer::schedule()`
+ let dref = RequestDataRef::new(unsafe {
+ ARef::from_raw(NonNull::new_unchecked(request_ptr.cast::<Request<T>>()))
+ });
+
+ T::RequestData::run(dref);
+
+ bindings::hrtimer_restart_HRTIMER_NORESTART
+ }
+}
+
// SAFETY: All instances of `Request<T>` are reference counted. This
// implementation of `AlwaysRefCounted` ensure that increments to the ref count
// keeps the object alive in memory at least until a matching reference count
--
2.44.0


2024-03-15 12:18:21

by Ming Lei

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] rust: block: introduce `kernel::block::mq` module

On Fri, Mar 15, 2024 at 08:52:46AM +0100, Andreas Hindborg wrote:
> Miguel Ojeda <[email protected]> writes:
>
> > On Thu, Mar 14, 2024 at 8:23 PM Andreas Hindborg <[email protected]> wrote:
> >>
> >> The way the current code compiles, <kernel::block::mq::Request as
> >> kernel::types::AlwaysRefCounted>::dec_ref` is inlined into the `rnull`
> >> module. A relocation for `rust_helper_blk_mq_free_request_internal`
> >> appears in `rnull_mod.ko`. I didn't test it yet, but if
> >> `__blk_mq_free_request` (or the helper) is not exported, I don't think
> >> this would be possible?
> >
> > Yeah, something needs to be exported since there is a generic
> > involved, but even if you want to go the route of exporting only a
> > different symbol, you would still want to put it in the C header so
> > that you don't get the C missing declaration warning and so that we
> > don't have to write the declaration manually in the helper.
>
> That is what I did:
>
> @@ -703,6 +703,7 @@ int blk_mq_alloc_sq_tag_set(struct blk_mq_tag_set *set,
> unsigned int set_flags);
> void blk_mq_free_tag_set(struct blk_mq_tag_set *set);
>
> +void __blk_mq_free_request(struct request *rq);
> void blk_mq_free_request(struct request *rq);
> int blk_rq_poll(struct request *rq, struct io_comp_batch *iob,
> unsigned int poll_flags);

Can you explain in detail why one block layer internal helper is
called into rnull driver directly? It never happens in C driver code.


Thanks,
Ming


2024-03-15 15:25:25

by Ming Lei

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] rust: block: introduce `kernel::block::mq` module

On Fri, Mar 15, 2024 at 01:46:30PM +0100, Andreas Hindborg wrote:
> Ming Lei <[email protected]> writes:
> > On Fri, Mar 15, 2024 at 08:52:46AM +0100, Andreas Hindborg wrote:
> >> Miguel Ojeda <[email protected]> writes:
> >>
> >> > On Thu, Mar 14, 2024 at 8:23 PM Andreas Hindborg <[email protected]> wrote:
> >> >>
> >> >> The way the current code compiles, <kernel::block::mq::Request as
> >> >> kernel::types::AlwaysRefCounted>::dec_ref` is inlined into the `rnull`
> >> >> module. A relocation for `rust_helper_blk_mq_free_request_internal`
> >> >> appears in `rnull_mod.ko`. I didn't test it yet, but if
> >> >> `__blk_mq_free_request` (or the helper) is not exported, I don't think
> >> >> this would be possible?
> >> >
> >> > Yeah, something needs to be exported since there is a generic
> >> > involved, but even if you want to go the route of exporting only a
> >> > different symbol, you would still want to put it in the C header so
> >> > that you don't get the C missing declaration warning and so that we
> >> > don't have to write the declaration manually in the helper.
> >>
> >> That is what I did:
> >>
> >> @@ -703,6 +703,7 @@ int blk_mq_alloc_sq_tag_set(struct blk_mq_tag_set *set,
> >> unsigned int set_flags);
> >> void blk_mq_free_tag_set(struct blk_mq_tag_set *set);
> >>
> >> +void __blk_mq_free_request(struct request *rq);
> >> void blk_mq_free_request(struct request *rq);
> >> int blk_rq_poll(struct request *rq, struct io_comp_batch *iob,
> >> unsigned int poll_flags);
> >
> > Can you explain in detail why one block layer internal helper is
> > called into rnull driver directly? It never happens in C driver code.
>
> It is not the rust null block driver that calls this symbol directly. It
> is called by the Rust block device driver API. But because of inlining,
> the symbol is referenced from the loadable object.

What is the exact Rust block device driver API? The key point is that how
the body of one exported kernel C API(EXPORT_SYMBOL) becomes inlined
with Rust driver.

>
> The reason we have to call this symbol directly is to ensure proper
> lifetime of the `struct request`. For example in C, when a driver

Sounds Rust API still calls into __blk_mq_free_request() directly, right?

If that is the case, the usecase need to be justified, and you need
to write one standalone patch with the exact story for exporting
__blk_mq_free_request().

> converts a tag to a request, the developer makes sure to only ask for
> requests which are outstanding in the driver. In Rust, for the API to be
> sound, we must ensure that the developer cannot write safe code that
> obtains a reference to a request that is not owned by the driver.
>
> A similar issue exists in the null block driver when timer completions
> are enabled. If the request is cancelled and the timer fires after the
> request has been recycled, there is a problem because the timer holds a
> reference to the request private data area.
>
> To that end, I use the `atomic_t ref` field of the C `struct request`
> and implement the `AlwaysRefCounted` Rust trait for the request type.
> This is a smart pointer that owns a reference to the pointee. In this
> way, the request is not freed and recycled until the smart pointer is
> dropped. But if the smart pointer holds the last reference when it is
> dropped, it must be able to free the request, and hence it has to call
> `__blk_mq_free_request`.

For callbacks(queue_rq, timeout, complete) implemented by driver, block
layer core guaranteed that the passed request reference is live.

So driver needn't to worry about request lifetime, same with Rust
driver, I think smart pointer isn't necessary for using request in
Rust driver.



Thanks,
Ming


2024-03-16 14:49:09

by Ming Lei

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] rust: block: introduce `kernel::block::mq` module

On Fri, Mar 15, 2024 at 06:49:39PM +0100, Andreas Hindborg wrote:
> Ming Lei <[email protected]> writes:
>
> > On Fri, Mar 15, 2024 at 01:46:30PM +0100, Andreas Hindborg wrote:
> >> Ming Lei <[email protected]> writes:
> >> > On Fri, Mar 15, 2024 at 08:52:46AM +0100, Andreas Hindborg wrote:
> >> >> Miguel Ojeda <[email protected]> writes:
> >> >>
> >> >> > On Thu, Mar 14, 2024 at 8:23 PM Andreas Hindborg <[email protected]> wrote:
> >> >> >>
> >> >> >> The way the current code compiles, <kernel::block::mq::Request as
> >> >> >> kernel::types::AlwaysRefCounted>::dec_ref` is inlined into the `rnull`
> >> >> >> module. A relocation for `rust_helper_blk_mq_free_request_internal`
> >> >> >> appears in `rnull_mod.ko`. I didn't test it yet, but if
> >> >> >> `__blk_mq_free_request` (or the helper) is not exported, I don't think
> >> >> >> this would be possible?
> >> >> >
> >> >> > Yeah, something needs to be exported since there is a generic
> >> >> > involved, but even if you want to go the route of exporting only a
> >> >> > different symbol, you would still want to put it in the C header so
> >> >> > that you don't get the C missing declaration warning and so that we
> >> >> > don't have to write the declaration manually in the helper.
> >> >>
> >> >> That is what I did:
> >> >>
> >> >> @@ -703,6 +703,7 @@ int blk_mq_alloc_sq_tag_set(struct blk_mq_tag_set *set,
> >> >> unsigned int set_flags);
> >> >> void blk_mq_free_tag_set(struct blk_mq_tag_set *set);
> >> >>
> >> >> +void __blk_mq_free_request(struct request *rq);
> >> >> void blk_mq_free_request(struct request *rq);
> >> >> int blk_rq_poll(struct request *rq, struct io_comp_batch *iob,
> >> >> unsigned int poll_flags);
> >> >
> >> > Can you explain in detail why one block layer internal helper is
> >> > called into rnull driver directly? It never happens in C driver code.
> >>
> >> It is not the rust null block driver that calls this symbol directly. It
> >> is called by the Rust block device driver API. But because of inlining,
> >> the symbol is referenced from the loadable object.
> >
> > What is the exact Rust block device driver API? The key point is that how
> > the body of one exported kernel C API(EXPORT_SYMBOL) becomes inlined
> > with Rust driver.
>
> This happens when `ARef<Request<_>>` is dropped. The drop method
> (destructor) of this smart pointer decrements the refcount and
> potentially calls `__blk_mq_free_request`.
>
> >>
> >> The reason we have to call this symbol directly is to ensure proper
> >> lifetime of the `struct request`. For example in C, when a driver
> >
> > Sounds Rust API still calls into __blk_mq_free_request() directly, right?
>
> Yes, the Rust block device driver API will call this request if an
> `ARef<Request<_>>` is dropped and the refcount goes to 0.
>
> > If that is the case, the usecase need to be justified, and you need
> > to write one standalone patch with the exact story for exporting
> > __blk_mq_free_request().
>
> Ok, I can do that.
>
> >
> >> converts a tag to a request, the developer makes sure to only ask for
> >> requests which are outstanding in the driver. In Rust, for the API to be
> >> sound, we must ensure that the developer cannot write safe code that
> >> obtains a reference to a request that is not owned by the driver.
> >>
> >> A similar issue exists in the null block driver when timer completions
> >> are enabled. If the request is cancelled and the timer fires after the
> >> request has been recycled, there is a problem because the timer holds a
> >> reference to the request private data area.
> >>
> >> To that end, I use the `atomic_t ref` field of the C `struct request`
> >> and implement the `AlwaysRefCounted` Rust trait for the request type.
> >> This is a smart pointer that owns a reference to the pointee. In this
> >> way, the request is not freed and recycled until the smart pointer is
> >> dropped. But if the smart pointer holds the last reference when it is
> >> dropped, it must be able to free the request, and hence it has to call
> >> `__blk_mq_free_request`.
> >
> > For callbacks(queue_rq, timeout, complete) implemented by driver, block
> > layer core guaranteed that the passed request reference is live.
> >
> > So driver needn't to worry about request lifetime, same with Rust
> > driver, I think smart pointer isn't necessary for using request in
> > Rust driver.
>
> Using the C API, there is nothing preventing a driver from using the
> request after the lifetime ends.

Yes, it is true for C, so will Rust-for-linux need to add refcount for
most exported kernel C structure? such as by implementing AlwaysRefCounted
traits?

> With Rust, we have to make it
> impossible.Without the refcount and associated call to
> `__blk_mq_free_request`, it would be possible to write Rust code that
> access the request after the lifetime ends. This is not sound, and it is
> something we need to avoid in the Rust abstractions.
>
> One concrete way to do write unsound code with a Rust API where lifetime
> is not tracked with refcount, is if the null block timer completion
> callback fires after the request is completed. Perhaps the driver
> cancels the request but forgets to cancel the timer. When the timer
> fires, it will access the request via the context pointer, but the
> request will be invalid.

The issue is less serious for blk-mq request, which is pre-allocated,
and one freed request just means it can be re-allocated for other IO
in same queue, and the pointed memory won't be really freed.

Also as I mentioned, inside driver's ->timeout(), the request is
guaranteed to be live by block layer core(won't be re-allocated to other IO),
the passed-in request is referenced already, please see bt_iter() which
is called from blk_mq_timeout_work(). Here, block layer core just
borrows request, then passed the reference to ->timeout(), when
request is owned by driver actually.

I understand Rust block driver still need to implement ->queue_rq(),
->timeout(), ..., just like C driver, but maybe I am wrong? Or Rust block driver
will re-implement part of block layer core code? such as, get one extra
reference of request no matter block core has done that.

> In C we have to write the driver code so this
> cannot happen. In Rust, the API must prevent this from happening. So any
> driver written in the safe subset of Rust using this API can never
> trigger this behavior.
>
> By using the refcount, we ensure that the request is alive until all
> users who hold a reference to it are dropped.

block layer has provided such guarantee if Rust driver follows current
block driver model.

>
> Another concrete example is when a driver calls `blk_mq_tag_to_rq` with
> an invalid tag. This can return a reference to an invalid tag, if the
> driver is not implemented correctly. By using `req_ref_inc_not_zero` we
> can assert that the request is live before we create a Rust reference to
> it, and even if the driver code has bugs, it can never access an invalid
> request, and thus it can be memory safe.
>
> We move the responsibility of correctness, in relation to memory safety,
> from the driver implementation to the API implementation.

After queue_rq(req) is called, request ownership is actually transferred to
driver like Rust's move, then driver is free to call blk_mq_tag_to_rq(), and
finally return request to block core after the request is completed by driver.

The biggest question should be how Rust block driver will be designed &
implemented? Will rust block driver follow current C driver's model, such
as implementing ->queue_rq(), ->timeout(), ->complete()...?



thanks,
Ming


2024-03-17 05:56:19

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Rust block device driver API and null block driver

On Wed, Mar 13, 2024 at 12:05:07PM +0100, Andreas Hindborg wrote:
> - Adopted the folio abstraction where applicable

I don't think this was the correct move. The memory you're managing
with folios is only used for storing the data being stored in the blkdev.
It's not mapped to userspace, it doesn't make use of the flags (locked,
uptodate, writeback, etc), it doesn't need an LRU list, a mapping,
an index, a refcount or memcg.

I think you should use pages instead of folios to keep a handle on
this memory. Apologies for the confusion; we're in the middle of a
years-long transition from the overly complex and overused struct page
to splitting it into different data types for different purposes.

More detail on this here, if you're interested:
https://kernelnewbies.org/MatthewWilcox/Memdescs

2024-03-17 08:26:20

by Andreas Hindborg

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Rust block device driver API and null block driver

Matthew Wilcox <[email protected]> writes:

> On Wed, Mar 13, 2024 at 12:05:07PM +0100, Andreas Hindborg wrote:
>> - Adopted the folio abstraction where applicable
>
> I don't think this was the correct move. The memory you're managing
> with folios is only used for storing the data being stored in the blkdev.
> It's not mapped to userspace, it doesn't make use of the flags (locked,
> uptodate, writeback, etc), it doesn't need an LRU list, a mapping,
> an index, a refcount or memcg.
>
> I think you should use pages instead of folios to keep a handle on
> this memory. Apologies for the confusion; we're in the middle of a
> years-long transition from the overly complex and overused struct page
> to splitting it into different data types for different purposes.

Ok, thanks. I'll swap it back.

I was under the impression that using folios also give the advantage
that handles are always head pages. No need to worry about head/tail
pages. If the driver moves to use higher order pages for larger block
sizes, would it then make sense with folios, or are page still
preferred?

> More detail on this here, if you're interested:
> https://kernelnewbies.org/MatthewWilcox/Memdescs

Thanks, that is useful.

Best regards,
Andreas

2024-03-18 00:01:20

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Rust block device driver API and null block driver

On Sun, Mar 17, 2024 at 08:09:53AM +0100, Andreas Hindborg wrote:
> I was under the impression that using folios also give the advantage
> that handles are always head pages. No need to worry about head/tail
> pages. If the driver moves to use higher order pages for larger block
> sizes, would it then make sense with folios, or are page still
> preferred?

This is a good question.

If you do enhance this driver to support larger block sizes, I would
recommend indexing the xarray by sector number instead of page number.
You would allocate a compound page, but at no point would you need to
ask the page what order it is (you already know), you don't need to
change the page flags, etc, etc.

The page cache is in a bit of a different spot. It has to support
multiple folio sizes for the same file; it has to support folios being
split, it has to support lookup from users who don't know what the folio
size is, etc, etc.

I don't think there's ever a point at which you'd need to call
compound_head() because you'd always look up the head page. You'd still
want an iterator of some kind to copy to a compound page (because you
can only map one at a time). And that might look a lot like the folio
copying code. Probably the right way to handle this is for the folio
copying code to call the page copying code since a folio is always
composed of pages.

2024-03-16 19:56:19

by Andreas Hindborg

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] rust: block: introduce `kernel::block::mq` module

Ming Lei <[email protected]> writes:

> On Fri, Mar 15, 2024 at 06:49:39PM +0100, Andreas Hindborg wrote:
>> Ming Lei <[email protected]> writes:
>>
>> > On Fri, Mar 15, 2024 at 01:46:30PM +0100, Andreas Hindborg wrote:
>> >> Ming Lei <[email protected]> writes:
>> >> > On Fri, Mar 15, 2024 at 08:52:46AM +0100, Andreas Hindborg wrote:
>> >> >> Miguel Ojeda <[email protected]> writes:
>> >> >>
>> >> >> > On Thu, Mar 14, 2024 at 8:23 PM Andreas Hindborg <[email protected]> wrote:
>> >> >> >>
>> >> >> >> The way the current code compiles, <kernel::block::mq::Request as
>> >> >> >> kernel::types::AlwaysRefCounted>::dec_ref` is inlined into the `rnull`
>> >> >> >> module. A relocation for `rust_helper_blk_mq_free_request_internal`
>> >> >> >> appears in `rnull_mod.ko`. I didn't test it yet, but if
>> >> >> >> `__blk_mq_free_request` (or the helper) is not exported, I don't think
>> >> >> >> this would be possible?
>> >> >> >
>> >> >> > Yeah, something needs to be exported since there is a generic
>> >> >> > involved, but even if you want to go the route of exporting only a
>> >> >> > different symbol, you would still want to put it in the C header so
>> >> >> > that you don't get the C missing declaration warning and so that we
>> >> >> > don't have to write the declaration manually in the helper.
>> >> >>
>> >> >> That is what I did:
>> >> >>
>> >> >> @@ -703,6 +703,7 @@ int blk_mq_alloc_sq_tag_set(struct blk_mq_tag_set *set,
>> >> >> unsigned int set_flags);
>> >> >> void blk_mq_free_tag_set(struct blk_mq_tag_set *set);
>> >> >>
>> >> >> +void __blk_mq_free_request(struct request *rq);
>> >> >> void blk_mq_free_request(struct request *rq);
>> >> >> int blk_rq_poll(struct request *rq, struct io_comp_batch *iob,
>> >> >> unsigned int poll_flags);
>> >> >
>> >> > Can you explain in detail why one block layer internal helper is
>> >> > called into rnull driver directly? It never happens in C driver code.
>> >>
>> >> It is not the rust null block driver that calls this symbol directly. It
>> >> is called by the Rust block device driver API. But because of inlining,
>> >> the symbol is referenced from the loadable object.
>> >
>> > What is the exact Rust block device driver API? The key point is that how
>> > the body of one exported kernel C API(EXPORT_SYMBOL) becomes inlined
>> > with Rust driver.
>>
>> This happens when `ARef<Request<_>>` is dropped. The drop method
>> (destructor) of this smart pointer decrements the refcount and
>> potentially calls `__blk_mq_free_request`.
>>
>> >>
>> >> The reason we have to call this symbol directly is to ensure proper
>> >> lifetime of the `struct request`. For example in C, when a driver
>> >
>> > Sounds Rust API still calls into __blk_mq_free_request() directly, right?
>>
>> Yes, the Rust block device driver API will call this request if an
>> `ARef<Request<_>>` is dropped and the refcount goes to 0.
>>
>> > If that is the case, the usecase need to be justified, and you need
>> > to write one standalone patch with the exact story for exporting
>> > __blk_mq_free_request().
>>
>> Ok, I can do that.
>>
>> >
>> >> converts a tag to a request, the developer makes sure to only ask for
>> >> requests which are outstanding in the driver. In Rust, for the API to be
>> >> sound, we must ensure that the developer cannot write safe code that
>> >> obtains a reference to a request that is not owned by the driver.
>> >>
>> >> A similar issue exists in the null block driver when timer completions
>> >> are enabled. If the request is cancelled and the timer fires after the
>> >> request has been recycled, there is a problem because the timer holds a
>> >> reference to the request private data area.
>> >>
>> >> To that end, I use the `atomic_t ref` field of the C `struct request`
>> >> and implement the `AlwaysRefCounted` Rust trait for the request type.
>> >> This is a smart pointer that owns a reference to the pointee. In this
>> >> way, the request is not freed and recycled until the smart pointer is
>> >> dropped. But if the smart pointer holds the last reference when it is
>> >> dropped, it must be able to free the request, and hence it has to call
>> >> `__blk_mq_free_request`.
>> >
>> > For callbacks(queue_rq, timeout, complete) implemented by driver, block
>> > layer core guaranteed that the passed request reference is live.
>> >
>> > So driver needn't to worry about request lifetime, same with Rust
>> > driver, I think smart pointer isn't necessary for using request in
>> > Rust driver.
>>
>> Using the C API, there is nothing preventing a driver from using the
>> request after the lifetime ends.
>
> Yes, it is true for C, so will Rust-for-linux need to add refcount for
> most exported kernel C structure? such as by implementing AlwaysRefCounted
> traits?

Not for all types and not all the time. For instance the Rust block
device driver API does not always use refcounting for `struct request`.
Only when it cannot determine the lifetime of a request reference at
compile time.

>
>> With Rust, we have to make it
>> impossible.Without the refcount and associated call to
>> `__blk_mq_free_request`, it would be possible to write Rust code that
>> access the request after the lifetime ends. This is not sound, and it is
>> something we need to avoid in the Rust abstractions.
>>
>> One concrete way to do write unsound code with a Rust API where lifetime
>> is not tracked with refcount, is if the null block timer completion
>> callback fires after the request is completed. Perhaps the driver
>> cancels the request but forgets to cancel the timer. When the timer
>> fires, it will access the request via the context pointer, but the
>> request will be invalid.
>
> The issue is less serious for blk-mq request, which is pre-allocated,
> and one freed request just means it can be re-allocated for other IO
> in same queue, and the pointed memory won't be really freed.

The issue here is not so much use after free as it is aliasing of
mutable and shared references. This is illegal in Rust, and programs
that allow this may exhibit undefined behavior.

> Also as I mentioned, inside driver's ->timeout(), the request is
> guaranteed to be live by block layer core(won't be re-allocated to other IO),
> the passed-in request is referenced already, please see bt_iter() which
> is called from blk_mq_timeout_work(). Here, block layer core just
> borrows request, then passed the reference to ->timeout(), when
> request is owned by driver actually.

I understand. I am not referring to `blk_mq_opw.timeout`. The null block
driver has a feature where requests complete after a delay. This is
implemented via the `hrtimer` subsystem.

> I understand Rust block driver still need to implement ->queue_rq(),
> ->timeout(), ..., just like C driver, but maybe I am wrong? Or Rust block driver
> will re-implement part of block layer core code? such as, get one extra
> reference of request no matter block core has done that.

The Rust block driver API implements the `blk_mq_ops` vtable on behalf
of the driver. There is a little bit of glue code inserted between the C
symbols called by the kernel and the Rust functions that the block device
driver provides.

>> In C we have to write the driver code so this
>> cannot happen. In Rust, the API must prevent this from happening. So any
>> driver written in the safe subset of Rust using this API can never
>> trigger this behavior.
>>
>> By using the refcount, we ensure that the request is alive until all
>> users who hold a reference to it are dropped.
>
> block layer has provided such guarantee if Rust driver follows current
> block driver model.

I understand that the driver has exclusive ownership of the request
between `queue_rq()` and `complete()`. What we want to guarantee is that
the author of a block device driver is not able to access a request
after block layer has regained ownership of the request (after complete).
The way we achieve this is by preventing block layer from calling
`__blk_mq_free_request` if the Rust driver did not destroy all
references to the request.

I understand that this is not necessary for correct operation of a
block device driver. However, it is necessary to uphold invariants of
the Rust language for references.

>>
>> Another concrete example is when a driver calls `blk_mq_tag_to_rq` with
>> an invalid tag. This can return a reference to an invalid tag, if the
>> driver is not implemented correctly. By using `req_ref_inc_not_zero` we
>> can assert that the request is live before we create a Rust reference to
>> it, and even if the driver code has bugs, it can never access an invalid
>> request, and thus it can be memory safe.
>>
>> We move the responsibility of correctness, in relation to memory safety,
>> from the driver implementation to the API implementation.
>
> After queue_rq(req) is called, request ownership is actually transferred to
> driver like Rust's move, then driver is free to call blk_mq_tag_to_rq(), and
> finally return request to block core after the request is completed by driver.

As I said, I understand and appreciate this design. But there is a
possibility for a buggy driver to not obey. The Rust abstractions must
prevent this buggy code from compiling at all. In Rust, it must be
impossible to call `blk_mq_tag_to_rq()` for a tag that is not owned by
the driver. `blk_mq_tag_to_rq` takes an integer, so the driver developer
can call this function with an invalid tag like
`blk_mq_tag_to_rq(tagset, 5)`. If 5 is not owned by the driver, this
will return a request pointer that will alias with mutable access. When
this pointer is converted to a Rust reference, this will give undefined
behavior. For Rust, we must prevent code like this from compiling in the
first place.

> The biggest question should be how Rust block driver will be designed &
> implemented? Will rust block driver follow current C driver's model, such
> as implementing ->queue_rq(), ->timeout(), ->complete()...?

The Rust driver API does follow the design of the C driver model. It
does implement the `blk_mq_ops` vtable. Rust block drivers must
implement a set of Rust functions that very closely follow this vtable.
But when it is not possible to statically determine the lifetime of an
object, to the point where the compiler will refuse to compile the code
that violates the lifetime durations of the object, we must revert to
reference counting.

I understand that a correctly implemented block device driver will not
violate lifetime of the request structure. Incorrect driver code might
violate the lifetime though. With Rust, we must prevent incorrectly
implemented drivers from compiling at all. We need the refcount for
that. By only allowing access to the request through the refcounted
pointer in certain cases, we prevent the safe driver code from
misbehaving.

It might seem cumbersome to do this dance at first. After all, we can
just write the device drivers correctly without memory safety bugs. But
the reward for all this work is that if a device driver is implemented
in the safe subset of Rust, it _cannot_ exhibit memory safety related
bugs. We don't have to review a device driver implemented in safe Rust
for memory safety issues at all.

Best regards,
Andreas

2024-03-15 15:12:34

by Andreas Hindborg

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] rust: block: introduce `kernel::block::mq` module

Ming Lei <[email protected]> writes:
> On Fri, Mar 15, 2024 at 08:52:46AM +0100, Andreas Hindborg wrote:
>> Miguel Ojeda <[email protected]> writes:
>>
>> > On Thu, Mar 14, 2024 at 8:23 PM Andreas Hindborg <[email protected]> wrote:
>> >>
>> >> The way the current code compiles, <kernel::block::mq::Request as
>> >> kernel::types::AlwaysRefCounted>::dec_ref` is inlined into the `rnull`
>> >> module. A relocation for `rust_helper_blk_mq_free_request_internal`
>> >> appears in `rnull_mod.ko`. I didn't test it yet, but if
>> >> `__blk_mq_free_request` (or the helper) is not exported, I don't think
>> >> this would be possible?
>> >
>> > Yeah, something needs to be exported since there is a generic
>> > involved, but even if you want to go the route of exporting only a
>> > different symbol, you would still want to put it in the C header so
>> > that you don't get the C missing declaration warning and so that we
>> > don't have to write the declaration manually in the helper.
>>
>> That is what I did:
>>
>> @@ -703,6 +703,7 @@ int blk_mq_alloc_sq_tag_set(struct blk_mq_tag_set *set,
>> unsigned int set_flags);
>> void blk_mq_free_tag_set(struct blk_mq_tag_set *set);
>>
>> +void __blk_mq_free_request(struct request *rq);
>> void blk_mq_free_request(struct request *rq);
>> int blk_rq_poll(struct request *rq, struct io_comp_batch *iob,
>> unsigned int poll_flags);
>
> Can you explain in detail why one block layer internal helper is
> called into rnull driver directly? It never happens in C driver code.

It is not the rust null block driver that calls this symbol directly. It
is called by the Rust block device driver API. But because of inlining,
the symbol is referenced from the loadable object.

The reason we have to call this symbol directly is to ensure proper
lifetime of the `struct request`. For example in C, when a driver
converts a tag to a request, the developer makes sure to only ask for
requests which are outstanding in the driver. In Rust, for the API to be
sound, we must ensure that the developer cannot write safe code that
obtains a reference to a request that is not owned by the driver.

A similar issue exists in the null block driver when timer completions
are enabled. If the request is cancelled and the timer fires after the
request has been recycled, there is a problem because the timer holds a
reference to the request private data area.

To that end, I use the `atomic_t ref` field of the C `struct request`
and implement the `AlwaysRefCounted` Rust trait for the request type.
This is a smart pointer that owns a reference to the pointee. In this
way, the request is not freed and recycled until the smart pointer is
dropped. But if the smart pointer holds the last reference when it is
dropped, it must be able to free the request, and hence it has to call
`__blk_mq_free_request`.

We could tag the function with `#[inline(never)]`, but that would impact
performance.

Best regards,
Andreas

2024-03-15 17:50:17

by Andreas Hindborg

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] rust: block: introduce `kernel::block::mq` module

Ming Lei <[email protected]> writes:

> On Fri, Mar 15, 2024 at 01:46:30PM +0100, Andreas Hindborg wrote:
>> Ming Lei <[email protected]> writes:
>> > On Fri, Mar 15, 2024 at 08:52:46AM +0100, Andreas Hindborg wrote:
>> >> Miguel Ojeda <[email protected]> writes:
>> >>
>> >> > On Thu, Mar 14, 2024 at 8:23 PM Andreas Hindborg <[email protected]> wrote:
>> >> >>
>> >> >> The way the current code compiles, <kernel::block::mq::Request as
>> >> >> kernel::types::AlwaysRefCounted>::dec_ref` is inlined into the `rnull`
>> >> >> module. A relocation for `rust_helper_blk_mq_free_request_internal`
>> >> >> appears in `rnull_mod.ko`. I didn't test it yet, but if
>> >> >> `__blk_mq_free_request` (or the helper) is not exported, I don't think
>> >> >> this would be possible?
>> >> >
>> >> > Yeah, something needs to be exported since there is a generic
>> >> > involved, but even if you want to go the route of exporting only a
>> >> > different symbol, you would still want to put it in the C header so
>> >> > that you don't get the C missing declaration warning and so that we
>> >> > don't have to write the declaration manually in the helper.
>> >>
>> >> That is what I did:
>> >>
>> >> @@ -703,6 +703,7 @@ int blk_mq_alloc_sq_tag_set(struct blk_mq_tag_set *set,
>> >> unsigned int set_flags);
>> >> void blk_mq_free_tag_set(struct blk_mq_tag_set *set);
>> >>
>> >> +void __blk_mq_free_request(struct request *rq);
>> >> void blk_mq_free_request(struct request *rq);
>> >> int blk_rq_poll(struct request *rq, struct io_comp_batch *iob,
>> >> unsigned int poll_flags);
>> >
>> > Can you explain in detail why one block layer internal helper is
>> > called into rnull driver directly? It never happens in C driver code.
>>
>> It is not the rust null block driver that calls this symbol directly. It
>> is called by the Rust block device driver API. But because of inlining,
>> the symbol is referenced from the loadable object.
>
> What is the exact Rust block device driver API? The key point is that how
> the body of one exported kernel C API(EXPORT_SYMBOL) becomes inlined
> with Rust driver.

This happens when `ARef<Request<_>>` is dropped. The drop method
(destructor) of this smart pointer decrements the refcount and
potentially calls `__blk_mq_free_request`.

>>
>> The reason we have to call this symbol directly is to ensure proper
>> lifetime of the `struct request`. For example in C, when a driver
>
> Sounds Rust API still calls into __blk_mq_free_request() directly, right?

Yes, the Rust block device driver API will call this request if an
`ARef<Request<_>>` is dropped and the refcount goes to 0.

> If that is the case, the usecase need to be justified, and you need
> to write one standalone patch with the exact story for exporting
> __blk_mq_free_request().

Ok, I can do that.

>
>> converts a tag to a request, the developer makes sure to only ask for
>> requests which are outstanding in the driver. In Rust, for the API to be
>> sound, we must ensure that the developer cannot write safe code that
>> obtains a reference to a request that is not owned by the driver.
>>
>> A similar issue exists in the null block driver when timer completions
>> are enabled. If the request is cancelled and the timer fires after the
>> request has been recycled, there is a problem because the timer holds a
>> reference to the request private data area.
>>
>> To that end, I use the `atomic_t ref` field of the C `struct request`
>> and implement the `AlwaysRefCounted` Rust trait for the request type.
>> This is a smart pointer that owns a reference to the pointee. In this
>> way, the request is not freed and recycled until the smart pointer is
>> dropped. But if the smart pointer holds the last reference when it is
>> dropped, it must be able to free the request, and hence it has to call
>> `__blk_mq_free_request`.
>
> For callbacks(queue_rq, timeout, complete) implemented by driver, block
> layer core guaranteed that the passed request reference is live.
>
> So driver needn't to worry about request lifetime, same with Rust
> driver, I think smart pointer isn't necessary for using request in
> Rust driver.

Using the C API, there is nothing preventing a driver from using the
request after the lifetime ends. With Rust, we have to make it
impossible. Without the refcount and associated call to
`__blk_mq_free_request`, it would be possible to write Rust code that
access the request after the lifetime ends. This is not sound, and it is
something we need to avoid in the Rust abstractions.

One concrete way to do write unsound code with a Rust API where lifetime
is not tracked with refcount, is if the null block timer completion
callback fires after the request is completed. Perhaps the driver
cancels the request but forgets to cancel the timer. When the timer
fires, it will access the request via the context pointer, but the
request will be invalid. In C we have to write the driver code so this
cannot happen. In Rust, the API must prevent this from happening. So any
driver written in the safe subset of Rust using this API can never
trigger this behavior.

By using the refcount, we ensure that the request is alive until all
users who hold a reference to it are dropped.

Another concrete example is when a driver calls `blk_mq_tag_to_rq` with
an invalid tag. This can return a reference to an invalid tag, if the
driver is not implemented correctly. By using `req_ref_inc_not_zero` we
can assert that the request is live before we create a Rust reference to
it, and even if the driver code has bugs, it can never access an invalid
request, and thus it can be memory safe.

We move the responsibility of correctness, in relation to memory safety,
from the driver implementation to the API implementation.

Best regards,
Andreas

2024-03-22 23:41:37

by Benno Lossin

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] rust: block: introduce `kernel::block::mq` module

On 3/13/24 12:05, Andreas Hindborg wrote:
> From: Andreas Hindborg <[email protected]>
>
> Add initial abstractions for working with blk-mq.
>
> This patch is a maintained, refactored subset of code originally published by
> Wedson Almeida Filho <[email protected]> [1].
>
> [1] https://github.com/wedsonaf/linux/tree/f2cfd2fe0e2ca4e90994f96afe268bbd4382a891/rust/kernel/blk/mq.rs
>
> Cc: Wedson Almeida Filho <[email protected]>
> Signed-off-by: Andreas Hindborg <[email protected]>
> ---
> block/blk-mq.c | 3 +-
> include/linux/blk-mq.h | 1 +
> rust/bindings/bindings_helper.h | 2 +
> rust/helpers.c | 45 ++++
> rust/kernel/block.rs | 5 +
> rust/kernel/block/mq.rs | 131 +++++++++++
> rust/kernel/block/mq/gen_disk.rs | 174 +++++++++++++++
> rust/kernel/block/mq/operations.rs | 346 +++++++++++++++++++++++++++++
> rust/kernel/block/mq/raw_writer.rs | 60 +++++
> rust/kernel/block/mq/request.rs | 182 +++++++++++++++
> rust/kernel/block/mq/tag_set.rs | 117 ++++++++++
> rust/kernel/error.rs | 5 +
> rust/kernel/lib.rs | 1 +
> 13 files changed, 1071 insertions(+), 1 deletion(-)

Do you think that it's possible to split this into smaller
patches? It would make review a lot easier.

> create mode 100644 rust/kernel/block.rs
> create mode 100644 rust/kernel/block/mq.rs
> create mode 100644 rust/kernel/block/mq/gen_disk.rs
> create mode 100644 rust/kernel/block/mq/operations.rs
> create mode 100644 rust/kernel/block/mq/raw_writer.rs
> create mode 100644 rust/kernel/block/mq/request.rs
> create mode 100644 rust/kernel/block/mq/tag_set.rs

[...]

> diff --git a/rust/kernel/block.rs b/rust/kernel/block.rs
> new file mode 100644
> index 000000000000..4c93317a568a
> --- /dev/null
> +++ b/rust/kernel/block.rs
> @@ -0,0 +1,5 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +//! Types for working with the block layer

Missing '.'.

> +
> +pub mod mq;
> diff --git a/rust/kernel/block/mq.rs b/rust/kernel/block/mq.rs
> new file mode 100644
> index 000000000000..08de1cc114ff
> --- /dev/null
> +++ b/rust/kernel/block/mq.rs
> @@ -0,0 +1,131 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +//! This module provides types for implementing block drivers that interface the
> +//! blk-mq subsystem.
> +//!
> +//! To implement a block device driver, a Rust module must do the following:
> +//!
> +//! - Implement [`Operations`] for a type `T`

I think it would be nicer to use `Driver` (or `MyBlkDevice`) instead of
`T`.

> +//! - Create a [`TagSet<T>`]
> +//! - Create a [`GenDisk<T>`], passing in the `TagSet` reference
> +//! - Add the disk to the system by calling [`GenDisk::add`]
> +//!
> +//! The types available in this module that have direct C counterparts are:
> +//!
> +//! - The `TagSet` type that abstracts the C type `struct tag_set`.
> +//! - The `GenDisk` type that abstracts the C type `struct gendisk`.
> +//! - The `Request` type that abstracts the C type `struct request`.
> +//!
> +//! Many of the C types that this module abstracts allow a driver to carry
> +//! private data, either embedded in the stuct directly, or as a C `void*`. In
> +//! these abstractions, this data is typed. The types of the data is defined by
> +//! associated types in `Operations`, see [`Operations::RequestData`] for an
> +//! example.
> +//!
> +//! The kernel will interface with the block evice driver by calling the method

Typo: "block evice driver"

> +//! implementations of the `Operations` trait.
> +//!
> +//! IO requests are passed to the driver as [`Request`] references. The
> +//! `Request` type is a wrapper around the C `struct request`. The driver must
> +//! mark start of request processing by calling [`Request::start`] and end of
> +//! processing by calling one of the [`Request::end`], methods. Failure to do so
> +//! can lead to IO failures.

I am unfamiliar with this, what are "IO failures"?
Do you think that it might be better to change the API to use a
callback? So instead of calling start and end, you would do

request.handle(|req| {
// do the stuff that would be done between start and end
});

I took a quick look at the rnull driver and there you are calling
`Request::end_ok` from a different function. So my suggestion might not
be possible, since you really need the freedom.

Do you think that a guard approach might work better? ie `start` returns
a guard that when dropped will call `end` and you need the guard to
operate on the request.

> +//!
> +//! The `TagSet` is responsible for creating and maintaining a mapping between
> +//! `Request`s and integer ids as well as carrying a pointer to the vtable
> +//! generated by `Operations`. This mapping is useful for associating
> +//! completions from hardware with the correct `Request` instance. The `TagSet`
> +//! determines the maximum queue depth by setting the number of `Request`
> +//! instances available to the driver, and it determines the number of queues to
> +//! instantiate for the driver. If possible, a driver should allocate one queue
> +//! per core, to keep queue data local to a core.
> +//!
> +//! One `TagSet` instance can be shared between multiple `GenDisk` instances.
> +//! This can be useful when implementing drivers where one piece of hardware
> +//! with one set of IO resources are represented to the user as multiple disks.
> +//!
> +//! One significant difference between block device drivers implemented with
> +//! these Rust abstractions and drivers implemented in C, is that the Rust
> +//! drivers have to own a reference count on the `Request` type when the IO is
> +//! in flight. This is to ensure that the C `struct request` instances backing
> +//! the Rust `Request` instances are live while the Rust driver holds a
> +//! reference to the `Request`. In addition, the conversion of an ineger tag to

Typo: "of an ineger tag"

> +//! a `Request` via the `TagSet` would not be sound without this bookkeeping.
> +//!
> +//! # ⚠ Note
> +//!
> +//! For Rust block device drivers, the point in time where a request
> +//! is freed and made available for recycling is usualy at the point in time
> +//! when the last `ARef<Request>` is dropped. For C drivers, this event usually
> +//! occurs when `bindings::blk_mq_end_request` is called.
> +//!
> +//! # Example
> +//!
> +//! ```rust
> +//! use kernel::{
> +//! block::mq::*,
> +//! new_mutex,
> +//! prelude::*,
> +//! sync::{Arc, Mutex},
> +//! types::{ARef, ForeignOwnable},
> +//! };
> +//!
> +//! struct MyBlkDevice;
> +//!
> +//! #[vtable]
> +//! impl Operations for MyBlkDevice {
> +//! type RequestData = ();
> +//! type RequestDataInit = impl PinInit<()>;
> +//! type QueueData = ();
> +//! type HwData = ();
> +//! type TagSetData = ();
> +//!
> +//! fn new_request_data(
> +//! _tagset_data: <Self::TagSetData as ForeignOwnable>::Borrowed<'_>,
> +//! ) -> Self::RequestDataInit {
> +//! kernel::init::zeroed()
> +//! }
> +//!
> +//! fn queue_rq(_hw_data: (), _queue_data: (), rq: ARef<Request<Self>>, _is_last: bool) -> Result {
> +//! rq.start();
> +//! rq.end_ok();
> +//! Ok(())
> +//! }
> +//!
> +//! fn commit_rqs(
> +//! _hw_data: <Self::HwData as ForeignOwnable>::Borrowed<'_>,
> +//! _queue_data: <Self::QueueData as ForeignOwnable>::Borrowed<'_>,
> +//! ) {
> +//! }
> +//!
> +//! fn complete(rq: &Request<Self>) {
> +//! rq.end_ok();
> +//! }
> +//!
> +//! fn init_hctx(
> +//! _tagset_data: <Self::TagSetData as ForeignOwnable>::Borrowed<'_>,
> +//! _hctx_idx: u32,
> +//! ) -> Result<Self::HwData> {
> +//! Ok(())
> +//! }
> +//! }
> +//!
> +//! let tagset: Arc<TagSet<MyBlkDevice>> = Arc::pin_init(TagSet::try_new(1, (), 256, 1))?;
> +//! let mut disk = GenDisk::try_new(tagset, ())?;
> +//! disk.set_name(format_args!("myblk"))?;
> +//! disk.set_capacity_sectors(4096);
> +//! disk.add()?;
> +//!
> +//! # Ok::<(), kernel::error::Error>(())
> +//! ```

This piece of documentation is **really** valuable, thanks a lot for
taking the time to write it.

> +
> +mod gen_disk;
> +mod operations;
> +mod raw_writer;
> +mod request;
> +mod tag_set;
> +
> +pub use gen_disk::GenDisk;
> +pub use operations::Operations;
> +pub use request::{Request, RequestDataRef};
> +pub use tag_set::TagSet;
> diff --git a/rust/kernel/block/mq/gen_disk.rs b/rust/kernel/block/mq/gen_disk.rs
> new file mode 100644
> index 000000000000..b7845fc9e39f
> --- /dev/null
> +++ b/rust/kernel/block/mq/gen_disk.rs
> @@ -0,0 +1,174 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +//! Generic disk abstraction.
> +//!
> +//! C header: [`include/linux/blkdev.h`](srctree/include/linux/blkdev.h)
> +//! C header: [`include/linux/blk_mq.h`](srctree/include/linux/blk_mq.h)
> +
> +use crate::block::mq::{raw_writer::RawWriter, Operations, TagSet};
> +use crate::{
> + bindings, error::from_err_ptr, error::Result, sync::Arc, types::ForeignOwnable,
> + types::ScopeGuard,
> +};
> +use core::fmt::{self, Write};
> +
> +/// A generic block device
> +///
> +/// # Invariants
> +///
> +/// - `gendisk` must always point to an initialized and valid `struct gendisk`.
> +pub struct GenDisk<T: Operations> {
> + _tagset: Arc<TagSet<T>>,
> + gendisk: *mut bindings::gendisk,
> +}
> +
> +// SAFETY: `GenDisk` is an owned pointer to a `struct gendisk` and an `Arc` to a
> +// `TagSet` It is safe to send this to other threads as long as T is Send.
> +unsafe impl<T: Operations + Send> Send for GenDisk<T> {}
> +
> +impl<T: Operations> GenDisk<T> {
> + /// Try to create a new `GenDisk`
> + pub fn try_new(tagset: Arc<TagSet<T>>, queue_data: T::QueueData) -> Result<Self> {
> + let data = queue_data.into_foreign();
> + let recover_data = ScopeGuard::new(|| {
> + // SAFETY: T::QueueData was created by the call to `into_foreign()` above
> + unsafe { T::QueueData::from_foreign(data) };
> + });
> +
> + let lock_class_key = crate::sync::LockClassKey::new();
> +
> + // SAFETY: `tagset.raw_tag_set()` points to a valid and initialized tag set
> + let gendisk = from_err_ptr(unsafe {
> + bindings::__blk_mq_alloc_disk(
> + tagset.raw_tag_set(),
> + data.cast_mut(),
> + lock_class_key.as_ptr(),
> + )
> + })?;
> +
> + const TABLE: bindings::block_device_operations = bindings::block_device_operations {
> + submit_bio: None,
> + open: None,
> + release: None,
> + ioctl: None,
> + compat_ioctl: None,
> + check_events: None,
> + unlock_native_capacity: None,
> + getgeo: None,
> + set_read_only: None,
> + swap_slot_free_notify: None,
> + report_zones: None,
> + devnode: None,
> + alternative_gpt_sector: None,
> + get_unique_id: None,
> + // TODO: Set to THIS_MODULE. Waiting for const_refs_to_static feature to be merged
> + // https://github.com/rust-lang/rust/issues/119618
> + owner: core::ptr::null_mut(),
> + pr_ops: core::ptr::null_mut(),
> + free_disk: None,
> + poll_bio: None,
> + };
> +
> + // SAFETY: gendisk is a valid pointer as we initialized it above
> + unsafe { (*gendisk).fops = &TABLE };
> +
> + recover_data.dismiss();
> + Ok(Self {
> + _tagset: tagset,
> + gendisk,

Missing INVARIANT comment.

> + })
> + }
> +
> + /// Set the name of the device

Missing '.'.

> + pub fn set_name(&mut self, args: fmt::Arguments<'_>) -> Result {
> + let mut raw_writer = RawWriter::from_array(
> + // SAFETY: By type invariant `self.gendisk` points to a valid and initialized instance
> + unsafe { &mut (*self.gendisk).disk_name },

To create a `&mut` reference, you need exclusive access, it should be
sufficient to add to the invariant that `gendisk` is owned/unique.

> + );
> + raw_writer.write_fmt(args)?;
> + raw_writer.write_char('\0')?;
> + Ok(())
> + }

[...]

> +impl<T: Operations> Drop for GenDisk<T> {
> + fn drop(&mut self) {
> + // SAFETY: By type invariant of `Self`, `self.gendisk` points to a valid
> + // and initialized instance of `struct gendisk`. As such, `queuedata`
> + // was initialized by the initializer returned by `try_new` with a call
> + // to `ForeignOwnable::into_foreign`.

This should also be an invariant of `GenDisk`.

> + let queue_data = unsafe { (*(*self.gendisk).queue).queuedata };
> +
> + // SAFETY: By type invariant, `self.gendisk` points to a valid and
> + // initialized instance of `struct gendisk`
> + unsafe { bindings::del_gendisk(self.gendisk) };
> +
> + // SAFETY: `queue.queuedata` was created by `GenDisk::try_new()` with a
> + // call to `ForeignOwnable::into_pointer()` to create `queuedata`.
> + // `ForeignOwnable::from_foreign()` is only called here.
> + let _queue_data = unsafe { T::QueueData::from_foreign(queue_data) };
> + }
> +}
> diff --git a/rust/kernel/block/mq/operations.rs b/rust/kernel/block/mq/operations.rs
> new file mode 100644
> index 000000000000..53c6ad663208
> --- /dev/null
> +++ b/rust/kernel/block/mq/operations.rs
> @@ -0,0 +1,346 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +//! This module provides an interface for blk-mq drivers to implement.
> +//!
> +//! C header: [`include/linux/blk-mq.h`](srctree/include/linux/blk-mq.h)
> +
> +use crate::{
> + bindings,
> + block::mq::Request,
> + error::{from_result, Result},
> + init::PinInit,
> + types::{ARef, ForeignOwnable},
> +};
> +use core::{marker::PhantomData, ptr::NonNull};
> +
> +use super::TagSet;
> +
> +/// Implement this trait to interface blk-mq as block devices
> +#[macros::vtable]
> +pub trait Operations: Sized {
> + /// Data associated with a request. This data is located next to the request
> + /// structure.
> + ///
> + /// To be able to handle accessing this data from interrupt context, this
> + /// data must be `Sync`.
> + type RequestData: Sized + Sync;
> +
> + /// Initializer for `Self::RequestDta`. Used to initialize private data area
> + /// when requst structure is allocated.
> + type RequestDataInit: PinInit<Self::RequestData>;

Just to let you know, this dance with the associated types is not needed
any longer. RPITIT (return position impl trait in trait) has been
stabilized in 1.75 and you should be able to just write this:

fn new_request_data(
//rq: ARef<Request<Self>>,
tagset_data: <Self::TagSetData as ForeignOwnable>::Borrowed<'_>,
) -> impl PinInit<Self::RequestData>;


> +
> + /// Data associated with the `struct request_queue` that is allocated for
> + /// the `GenDisk` associated with this `Operations` implementation.
> + type QueueData: ForeignOwnable;
> +
> + /// Data associated with a dispatch queue. This is stored as a pointer in
> + /// the C `struct blk_mq_hw_ctx` that represents a hardware queue.
> + type HwData: ForeignOwnable;
> +
> + /// Data associated with a `TagSet`. This is stored as a pointer in `struct
> + /// blk_mq_tag_set`.
> + type TagSetData: ForeignOwnable;
> +
> + /// Called by the kernel to get an initializer for a `Pin<&mut RequestData>`.
> + fn new_request_data(
> + //rq: ARef<Request<Self>>,
> + tagset_data: <Self::TagSetData as ForeignOwnable>::Borrowed<'_>,

Since you need to use this pattern a lot, it might be a good idea to
introduce a type alias to help with this:

type ForeignBorrowed<'a, T> = <T as ForeignOwnable>::Borrowed<'a>;

What do the others think?

The function would then become (with the RPITIT improvement as well):

fn new_request_data(
//rq: ARef<Request<Self>>,
tagset_data: ForeignBorrowed<'_, Self::TagSetData>,
) -> impl PinInit<Self::RequestData>;


> + ) -> Self::RequestDataInit;
> +
> + /// Called by the kernel to queue a request with the driver. If `is_last` is
> + /// `false`, the driver is allowed to defer commiting the request.
> + fn queue_rq(
> + hw_data: <Self::HwData as ForeignOwnable>::Borrowed<'_>,
> + queue_data: <Self::QueueData as ForeignOwnable>::Borrowed<'_>,
> + rq: ARef<Request<Self>>,
> + is_last: bool,
> + ) -> Result;
> +
> + /// Called by the kernel to indicate that queued requests should be submitted
> + fn commit_rqs(
> + hw_data: <Self::HwData as ForeignOwnable>::Borrowed<'_>,
> + queue_data: <Self::QueueData as ForeignOwnable>::Borrowed<'_>,
> + );
> +
> + /// Called by the kernel when the request is completed
> + fn complete(_rq: &Request<Self>);
> +
> + /// Called by the kernel to allocate and initialize a driver specific hardware context data
> + fn init_hctx(
> + tagset_data: <Self::TagSetData as ForeignOwnable>::Borrowed<'_>,
> + hctx_idx: u32,
> + ) -> Result<Self::HwData>;
> +
> + /// Called by the kernel to poll the device for completed requests. Only
> + /// used for poll queues.
> + fn poll(_hw_data: <Self::HwData as ForeignOwnable>::Borrowed<'_>) -> bool {
> + crate::build_error(crate::error::VTABLE_DEFAULT_ERROR)
> + }
> +
> + /// Called by the kernel to map submission queues to CPU cores.
> + fn map_queues(_tag_set: &TagSet<Self>) {
> + crate::build_error(crate::error::VTABLE_DEFAULT_ERROR)
> + }
> +
> + // There is no need for exit_request() because `drop` will be called.

I think it would be a good idea to mention this in the documentation of
the trait.

> +}

[...]

> diff --git a/rust/kernel/block/mq/raw_writer.rs b/rust/kernel/block/mq/raw_writer.rs
> new file mode 100644
> index 000000000000..f7857740af29
> --- /dev/null
> +++ b/rust/kernel/block/mq/raw_writer.rs
> @@ -0,0 +1,60 @@
> +use core::{
> + fmt::{self, Write},
> + marker::PhantomData,
> +};
> +
> +/// A mutable reference to a byte buffer where a string can be written into
> +///
> +/// # Invariants
> +///
> +/// * `ptr` is not aliased and valid for read and write for `len` bytes

You probably also want to add "for the duration of `'a`".

> +///
> +pub(crate) struct RawWriter<'a> {
> + ptr: *mut u8,
> + len: usize,
> + _p: PhantomData<&'a ()>,
> +}
> +
> +impl<'a> RawWriter<'a> {
> + /// Create a new `RawWriter` instance.
> + ///
> + /// # Safety
> + ///
> + /// * `ptr` must be valid for read and write for `len` consecutive `u8` elements
> + /// * `ptr` must not be aliased
> + unsafe fn new(ptr: *mut u8, len: usize) -> RawWriter<'a> {
> + Self {
> + ptr,
> + len,
> + _p: PhantomData,
> + }
> + }

Since this function is not used (except in the function below), what is
the reason for using a raw pointer?
I looked in your other patches, but did not find another user, so could
this be improved by using mutable references?

> +
> + pub(crate) fn from_array<const N: usize>(a: &'a mut [core::ffi::c_char; N]) -> RawWriter<'a> {
> + // SAFETY: the buffer of `a` is valid for read and write for at least `N` bytes
> + unsafe { Self::new(a.as_mut_ptr().cast::<u8>(), N) }
> + }
> +}
> +
> +impl Write for RawWriter<'_> {
> + fn write_str(&mut self, s: &str) -> fmt::Result {
> + let bytes = s.as_bytes();
> + let len = bytes.len();
> + if len > self.len {
> + return Err(fmt::Error);
> + }
> +
> + // SAFETY:
> + // * `bytes` is valid for reads of `bytes.len()` size because we hold a shared reference to `s`
> + // * By type invariant `self.ptr` is valid for writes for at lest `self.len` bytes
> + // * The regions are not overlapping as `ptr` is not aliased
> + unsafe { core::ptr::copy_nonoverlapping(&bytes[0], self.ptr, len) };
> +
> + // SAFETY: By type invariant of `Self`, `ptr` is in bounds of an
> + // allocation. Also by type invariant, the pointer resulting from this
> + // addition is also in bounds.
> + self.ptr = unsafe { self.ptr.add(len) };
> + self.len -= len;
> + Ok(())
> + }
> +}
> diff --git a/rust/kernel/block/mq/request.rs b/rust/kernel/block/mq/request.rs
> new file mode 100644
> index 000000000000..b4dacac5e091
> --- /dev/null
> +++ b/rust/kernel/block/mq/request.rs
> @@ -0,0 +1,182 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +//! This module provides a wrapper for the C `struct request` type.
> +//!
> +//! C header: [`include/linux/blk-mq.h`](srctree/include/linux/blk-mq.h)
> +
> +use crate::{
> + bindings,
> + block::mq::Operations,
> + error::{Error, Result},
> + types::{ARef, AlwaysRefCounted, Opaque},
> +};
> +use core::{ffi::c_void, marker::PhantomData, ops::Deref};
> +
> +/// A wrapper around a blk-mq `struct request`. This represents an IO request.
> +///
> +/// # Invariants
> +///
> +/// * `self.0` is a valid `struct request` created by the C portion of the kernel
> +/// * `self` is reference counted. a call to `req_ref_inc_not_zero` keeps the
> +/// instance alive at least until a matching call to `req_ref_put_and_test`
> +///
> +#[repr(transparent)]
> +pub struct Request<T: Operations>(Opaque<bindings::request>, PhantomData<T>);
> +
> +impl<T: Operations> Request<T> {
> + /// Create a `&mut Request` from a `bindings::request` pointer
> + ///
> + /// # Safety
> + ///
> + /// * `ptr` must be aligned and point to a valid `bindings::request` instance
> + /// * Caller must ensure that the pointee of `ptr` is live and owned
> + /// exclusively by caller for at least `'a`
> + ///
> + pub(crate) unsafe fn from_ptr_mut<'a>(ptr: *mut bindings::request) -> &'a mut Self {
> + // SAFETY:
> + // * The cast is valid as `Self` is transparent.
> + // * By safety requirements of this function, the reference will be
> + // valid for 'a.
> + unsafe { &mut *(ptr.cast::<Self>()) }
> + }
> +
> + /// Get the command identifier for the request
> + pub fn command(&self) -> u32 {
> + // SAFETY: By C API contract and type invariant, `cmd_flags` is valid for read
> + unsafe { (*self.0.get()).cmd_flags & ((1 << bindings::REQ_OP_BITS) - 1) }
> + }
> +
> + /// Call this to indicate to the kernel that the request has been issued by the driver

I am a bit confused, is this not supposed to signal that the processing
of the request is going to start now? cf C documentation:

/**
* blk_mq_start_request - Start processing a request
* @rq: Pointer to request to be started
*
* Function used by device drivers to notify the block layer that a request
* is going to be processed now, so blk layer can do proper initializations
* such as starting the timeout timer.
*/

> + pub fn start(&self) {
> + // SAFETY: By type invariant, `self.0` is a valid `struct request`. By
> + // existence of `&mut self` we have exclusive access.
> + unsafe { bindings::blk_mq_start_request(self.0.get()) };
> + }
> +
> + /// Call this to indicate to the kernel that the request has been completed without errors

I dislike the "Call this", it feels redundant, what about "Signal the
block layer that the request has been completed without errors.".

Also with the other functions below.

> + pub fn end_ok(&self) {
> + // SAFETY: By type invariant, `self.0` is a valid `struct request`. By
> + // existence of `&mut self` we have exclusive access.
> + unsafe { bindings::blk_mq_end_request(self.0.get(), bindings::BLK_STS_OK as _) };
> + }
> +
> + /// Call this to indicate to the kernel that the request completed with an error
> + pub fn end_err(&self, err: Error) {
> + // SAFETY: By type invariant, `self.0` is a valid `struct request`. By
> + // existence of `&mut self` we have exclusive access.
> + unsafe { bindings::blk_mq_end_request(self.0.get(), err.to_blk_status()) };
> + }
> +
> + /// Call this to indicate that the request completed with the status indicated by `status`
> + pub fn end(&self, status: Result) {
> + if let Err(e) = status {
> + self.end_err(e);
> + } else {
> + self.end_ok();
> + }
> + }
> +
> + /// Call this to schedule defered completion of the request
> + pub fn complete(&self) {
> + // SAFETY: By type invariant, `self.0` is a valid `struct request`
> + if !unsafe { bindings::blk_mq_complete_request_remote(self.0.get()) } {
> + T::complete(self);
> + }
> + }
> +
> + /// Get the target sector for the request
> + #[inline(always)]
> + pub fn sector(&self) -> usize {
> + // SAFETY: By type invariant of `Self`, `self.0` is valid and live.
> + unsafe { (*self.0.get()).__sector as usize }
> + }
> +
> + /// Returns an owned reference to the per-request data associated with this
> + /// request
> + pub fn owned_data_ref(request: ARef<Self>) -> RequestDataRef<T> {
> + RequestDataRef::new(request)
> + }
> +
> + /// Returns a reference to the oer-request data associated with this request

Typo: "the oer-request"

> + pub fn data_ref(&self) -> &T::RequestData {
> + let request_ptr = self.0.get().cast::<bindings::request>();
> +
> + // SAFETY: `request_ptr` is a valid `struct request` because `ARef` is
> + // `repr(transparent)`
> + let p: *mut c_void = unsafe { bindings::blk_mq_rq_to_pdu(request_ptr) };
> +
> + let p = p.cast::<T::RequestData>();
> +
> + // SAFETY: By C API contract, `p` is initialized by a call to
> + // `OperationsVTable::init_request_callback()`. By existence of `&self`
> + // it must be valid for use as a shared reference.
> + unsafe { &*p }
> + }
> +}
> +
> +// SAFETY: It is impossible to obtain an owned or mutable `Request`, so we can

What do you mean by "mutable `Request`"? There is the function to obtain
a `&mut Request`.
Also this should probably be an invariant if you use it as a
justification here.

> +// mark it `Send`.
> +unsafe impl<T: Operations> Send for Request<T> {}
> +
> +// SAFETY: `Request` references can be shared across threads.

Does not explain why that is the case.

> +unsafe impl<T: Operations> Sync for Request<T> {}
> +
> +/// An owned reference to a `Request<T>`
> +#[repr(transparent)]
> +pub struct RequestDataRef<T: Operations> {
> + request: ARef<Request<T>>,
> +}

Is this extra type really needed? I have not yet taken a look at patch 3,
which uses this. But would it hurt if you implemented the traits
there directly on `ARef<Request<T>>`?

> +
> +impl<T> RequestDataRef<T>
> +where
> + T: Operations,
> +{
> + /// Create a new instance.
> + fn new(request: ARef<Request<T>>) -> Self {
> + Self { request }
> + }
> +
> + /// Get a reference to the underlying request
> + pub fn request(&self) -> &Request<T> {
> + &self.request
> + }
> +}

I really like how you improved the safety comments and documentation. It
was a lot easier to wrap my head around this time (might also me getting
more experienced, but I think it helped a lot).

--
Cheers,
Benno


2024-03-23 10:52:21

by Benno Lossin

[permalink] [raw]
Subject: Re: [RFC PATCH 3/5] rust: block: allow `hrtimer::Timer` in `RequestData`

On 3/13/24 12:05, Andreas Hindborg wrote:> From: Andreas Hindborg <[email protected]>
>
> Signed-off-by: Andreas Hindborg <[email protected]>
> ---
> rust/kernel/block/mq/request.rs | 67 ++++++++++++++++++++++++++++++++-
> 1 file changed, 66 insertions(+), 1 deletion(-)
>
> diff --git a/rust/kernel/block/mq/request.rs b/rust/kernel/block/mq/request.rs
> index cccffde45981..8b7f08f894be 100644
> --- a/rust/kernel/block/mq/request.rs
> +++ b/rust/kernel/block/mq/request.rs
> @@ -4,13 +4,16 @@
> //!
> //! C header: [`include/linux/blk-mq.h`](srctree/include/linux/blk-mq.h)
>
> +use kernel::hrtimer::RawTimer;
> +
> use crate::{
> bindings,
> block::mq::Operations,
> error::{Error, Result},
> + hrtimer::{HasTimer, TimerCallback},
> types::{ARef, AlwaysRefCounted, Opaque},
> };
> -use core::{ffi::c_void, marker::PhantomData, ops::Deref};
> +use core::{ffi::c_void, marker::PhantomData, ops::Deref, ptr::NonNull};
>
> use crate::block::bio::Bio;
> use crate::block::bio::BioIterator;
> @@ -175,6 +178,68 @@ fn deref(&self) -> &Self::Target {
> }
> }
>
> +impl<T> RawTimer for RequestDataRef<T>
> +where
> + T: Operations,
> + T::RequestData: HasTimer<T::RequestData>,
> + T::RequestData: Sync,
> +{
> + fn schedule(self, expires: u64) {
> + let self_ptr = self.deref() as *const T::RequestData;
> + core::mem::forget(self);
> +
> + // SAFETY: `self_ptr` is a valid pointer to a `T::RequestData`
> + let timer_ptr = unsafe { T::RequestData::raw_get_timer(self_ptr) };
> +
> + // `Timer` is `repr(transparent)`
> + let c_timer_ptr = timer_ptr.cast::<bindings::hrtimer>();
> +
> + // Schedule the timer - if it is already scheduled it is removed and
> + // inserted
> +
> + // SAFETY: c_timer_ptr points to a valid hrtimer instance that was
> + // initialized by `hrtimer_init`
> + unsafe {
> + bindings::hrtimer_start_range_ns(
> + c_timer_ptr as *mut _,
> + expires as i64,
> + 0,
> + bindings::hrtimer_mode_HRTIMER_MODE_REL,
> + );
> + }
> + }
> +}
> +
> +impl<T> kernel::hrtimer::RawTimerCallback for RequestDataRef<T>
> +where
> + T: Operations,
> + T: Sync,

Why is this needed? Shouldn't this be `T::RequestData: Sync`?

Is the `run` function below executed on a different thread compared to
the `schedule` function above?
If yes, then `T::RequestData` probably also needs to be `Send`.
You also would need to adjust the bounds in the impl above.

--
Cheers,
Benno

> + T::RequestData: HasTimer<T::RequestData>,
> + T::RequestData: TimerCallback<Receiver = Self>,
> +{
> + unsafe extern "C" fn run(ptr: *mut bindings::hrtimer) -> bindings::hrtimer_restart {
> + // `Timer` is `repr(transparent)`
> + let timer_ptr = ptr.cast::<kernel::hrtimer::Timer<T::RequestData>>();
> +
> + // SAFETY: By C API contract `ptr` is the pointer we passed when
> + // enqueing the timer, so it is a `Timer<T::RequestData>` embedded in a `T::RequestData`
> + let receiver_ptr = unsafe { T::RequestData::timer_container_of(timer_ptr) };
> +
> + // SAFETY: The pointer was returned by `T::timer_container_of` so it
> + // points to a valid `T::RequestData`
> + let request_ptr = unsafe { bindings::blk_mq_rq_from_pdu(receiver_ptr.cast::<c_void>()) };
> +
> + // SAFETY: We own a refcount that we leaked during `RawTimer::schedule()`
> + let dref = RequestDataRef::new(unsafe {
> + ARef::from_raw(NonNull::new_unchecked(request_ptr.cast::<Request<T>>()))
> + });
> +
> + T::RequestData::run(dref);
> +
> + bindings::hrtimer_restart_HRTIMER_NORESTART
> + }
> +}
> +
> // SAFETY: All instances of `Request<T>` are reference counted. This
> // implementation of `AlwaysRefCounted` ensure that increments to the ref count
> // keeps the object alive in memory at least until a matching reference count
> --
> 2.44.0
>

2024-03-23 11:33:55

by Benno Lossin

[permalink] [raw]
Subject: Re: [RFC PATCH 4/5] rust: block: add rnull, Rust null_blk implementation

On 3/13/24 12:05, Andreas Hindborg wrote:
> +module! {
> + type: NullBlkModule,
> + name: "rnull_mod",
> + author: "Andreas Hindborg",
> + license: "GPL v2",
> + params: {
> + param_memory_backed: bool {
> + default: true,
> + permissions: 0,
> + description: "Use memory backing",
> + },
> + // Problems with pin_init when `irq_mode`

Can you elaborate?

--
Cheers,
Benno

> + param_irq_mode: u8 {
> + default: 0,
> + permissions: 0,
> + description: "IRQ Mode (0: None, 1: Soft, 2: Timer)",
> + },
> + param_capacity_mib: u64 {
> + default: 4096,
> + permissions: 0,
> + description: "Device capacity in MiB",
> + },
> + param_completion_time_nsec: u64 {
> + default: 1_000_000,
> + permissions: 0,
> + description: "Completion time in nano seconds for timer mode",
> + },
> + param_block_size: u16 {
> + default: 4096,
> + permissions: 0,
> + description: "Block size in bytes",
> + },
> + },
> +}

2024-04-02 12:53:31

by Andreas Hindborg

[permalink] [raw]
Subject: Re: [RFC PATCH 4/5] rust: block: add rnull, Rust null_blk implementation

Benno Lossin <[email protected]> writes:

> On 3/13/24 12:05, Andreas Hindborg wrote:
>> +module! {
>> + type: NullBlkModule,
>> + name: "rnull_mod",
>> + author: "Andreas Hindborg",
>> + license: "GPL v2",
>> + params: {
>> + param_memory_backed: bool {
>> + default: true,
>> + permissions: 0,
>> + description: "Use memory backing",
>> + },
>> + // Problems with pin_init when `irq_mode`
>
> Can you elaborate?

I think we discussed this before, but I do not recall what you decided
was the issue.

It is probably easier if you can apply the patches and try to build with
this on top:

diff --git a/drivers/block/rnull.rs b/drivers/block/rnull.rs
index 04bdb6668558..bd089c5e6e89 100644
--- a/drivers/block/rnull.rs
+++ b/drivers/block/rnull.rs
@@ -48,7 +48,7 @@
description: "Use memory backing",
},
// Problems with pin_init when `irq_mode`
- param_irq_mode: u8 {
+ irq_mode: u8 {
default: 0,
permissions: 0,
description: "IRQ Mode (0: None, 1: Soft, 2: Timer)",
@@ -101,7 +101,7 @@ fn add_disk(tagset: Arc<TagSet<NullBlkDevice>>) -> Result<GenDisk<NullBlkDevice>
return Err(kernel::error::code::EINVAL);
}

- let irq_mode = (*param_irq_mode.read()).try_into()?;
+ let irq_mode = (*irq_mode.read()).try_into()?;

let queue_data = Box::pin_init(pin_init!(
QueueData {

---

There is some kind of name clash issue when using `pin_init!` in the expression on
line 106:

let queue_data = Box::pin_init(pin_init!(
QueueData {
tree <- TreeContainer::new(),
completion_time_nsec: *param_completion_time_nsec.read(),
irq_mode,
memory_backed: *param_memory_backed.read(),
block_size,
}
))?;

I cannot immediately decipher the error message:

RUSTC [M] drivers/block/rnull.o
error[E0277]: the trait bound `__rnull_mod_irq_mode: From<u8>` is not satisfied
--> /home/aeh/src/linux-rust/linux/drivers/block/rnull.rs:104:39
|
104 | let irq_mode = (*irq_mode.read()).try_into()?;
| ^^^^^^^^ the trait `From<u8>` is not implemented for `__rnull_mod_irq_mode`
|
= note: required for `u8` to implement `Into<__rnull_mod_irq_mode>`
= note: required for `__rnull_mod_irq_mode` to implement `TryFrom<u8>`
= note: required for `u8` to implement `TryInto<__rnull_mod_irq_mode>`

error[E0308]: mismatched types
--> /home/aeh/src/linux-rust/linux/drivers/block/rnull.rs:106:36
|
106 | let queue_data = Box::pin_init(pin_init!(
| ____________________________________^
107 | | QueueData {
108 | | tree <- TreeContainer::new(),
109 | | completion_time_nsec: *param_completion_time_nsec.read(),
.. |
113 | | }
114 | | ))?;
| | ^
| | |
| |_____expected `IRQMode`, found `__rnull_mod_irq_mode`
| arguments to this function are incorrect
|
note: function defined here
--> /home/aeh/.rustup/toolchains/1.74.1-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:1365:21
|
1365 | pub const unsafe fn write<T>(dst: *mut T, src: T) {
| ^^^^^
= note: this error originates in the macro `$crate::__init_internal` which comes from the expansion of the macro `pin_init` (in Nightly builds, run with -Z macro-backtrace for more info)

error[E0308]: mismatched types
--> /home/aeh/src/linux-rust/linux/drivers/block/rnull.rs:106:36
|
39 | / module! {
40 | | type: NullBlkModule,
41 | | name: "rnull_mod",
42 | | author: "Andreas Hindborg",
.. |
71 | | },
72 | | }
| |_- constant defined here
..
106 | let queue_data = Box::pin_init(pin_init!(
| ____________________________________^
| |____________________________________|
| |____________________________________|
| |____________________________________|
| |
107 | | QueueData {
108 | | tree <- TreeContainer::new(),
109 | | completion_time_nsec: *param_completion_time_nsec.read(),
.. |
113 | | }
114 | | ))?;
| | ^
| |_____|
| |_____expected `DropGuard<IRQMode>`, found `__rnull_mod_irq_mode`
| |_____this expression has type `DropGuard<IRQMode>`
| |_____`irq_mode` is interpreted as a constant, not a new binding
| help: introduce a new binding instead: `other_irq_mode`
|
= note: expected struct `DropGuard<IRQMode>`
found struct `__rnull_mod_irq_mode`
= note: this error originates in the macro `$crate::__init_internal` which comes from the expansion of the macro `pin_init` (in Nightly builds, run with -Z macro-backtrace for more info)

error: aborting due to 3 previous errors

Some errors have detailed explanations: E0277, E0308.
For more information about an error, try `rustc --explain E0277`.
make[5]: *** [/home/aeh/src/linux-rust/linux/scripts/Makefile.build:293: drivers/block/rnull.o] Error 1

2024-04-02 16:32:10

by Andreas Hindborg

[permalink] [raw]
Subject: Re: [RFC PATCH 3/5] rust: block: allow `hrtimer::Timer` in `RequestData`

Benno Lossin <[email protected]> writes:

> On 3/13/24 12:05, Andreas Hindborg wrote:> From: Andreas Hindborg <[email protected]>
>>
>> Signed-off-by: Andreas Hindborg <[email protected]>
>> ---
>> rust/kernel/block/mq/request.rs | 67 ++++++++++++++++++++++++++++++++-
>> 1 file changed, 66 insertions(+), 1 deletion(-)
>>
>> diff --git a/rust/kernel/block/mq/request.rs b/rust/kernel/block/mq/request.rs
>> index cccffde45981..8b7f08f894be 100644
>> --- a/rust/kernel/block/mq/request.rs
>> +++ b/rust/kernel/block/mq/request.rs
>> @@ -4,13 +4,16 @@
>> //!
>> //! C header: [`include/linux/blk-mq.h`](srctree/include/linux/blk-mq.h)
>>
>> +use kernel::hrtimer::RawTimer;
>> +
>> use crate::{
>> bindings,
>> block::mq::Operations,
>> error::{Error, Result},
>> + hrtimer::{HasTimer, TimerCallback},
>> types::{ARef, AlwaysRefCounted, Opaque},
>> };
>> -use core::{ffi::c_void, marker::PhantomData, ops::Deref};
>> +use core::{ffi::c_void, marker::PhantomData, ops::Deref, ptr::NonNull};
>>
>> use crate::block::bio::Bio;
>> use crate::block::bio::BioIterator;
>> @@ -175,6 +178,68 @@ fn deref(&self) -> &Self::Target {
>> }
>> }
>>
>> +impl<T> RawTimer for RequestDataRef<T>
>> +where
>> + T: Operations,
>> + T::RequestData: HasTimer<T::RequestData>,
>> + T::RequestData: Sync,
>> +{
>> + fn schedule(self, expires: u64) {
>> + let self_ptr = self.deref() as *const T::RequestData;
>> + core::mem::forget(self);
>> +
>> + // SAFETY: `self_ptr` is a valid pointer to a `T::RequestData`
>> + let timer_ptr = unsafe { T::RequestData::raw_get_timer(self_ptr) };
>> +
>> + // `Timer` is `repr(transparent)`
>> + let c_timer_ptr = timer_ptr.cast::<bindings::hrtimer>();
>> +
>> + // Schedule the timer - if it is already scheduled it is removed and
>> + // inserted
>> +
>> + // SAFETY: c_timer_ptr points to a valid hrtimer instance that was
>> + // initialized by `hrtimer_init`
>> + unsafe {
>> + bindings::hrtimer_start_range_ns(
>> + c_timer_ptr as *mut _,
>> + expires as i64,
>> + 0,
>> + bindings::hrtimer_mode_HRTIMER_MODE_REL,
>> + );
>> + }
>> + }
>> +}
>> +
>> +impl<T> kernel::hrtimer::RawTimerCallback for RequestDataRef<T>
>> +where
>> + T: Operations,
>> + T: Sync,
>
> Why is this needed? Shouldn't this be `T::RequestData: Sync`?
>
> Is the `run` function below executed on a different thread compared to
> the `schedule` function above?
> If yes, then `T::RequestData` probably also needs to be `Send`.
> You also would need to adjust the bounds in the impl above.

It's a typo, thanks for spotting. It should be `T::RequestData: Sync`.

BR Andreas

2024-04-02 18:39:27

by Andreas Hindborg

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] rust: block: introduce `kernel::block::mq` module

Benno Lossin <[email protected]> writes:

> On 3/13/24 12:05, Andreas Hindborg wrote:
>> From: Andreas Hindborg <[email protected]>
>>
>> Add initial abstractions for working with blk-mq.
>>
>> This patch is a maintained, refactored subset of code originally published by
>> Wedson Almeida Filho <[email protected]> [1].
>>
>> [1] https://github.com/wedsonaf/linux/tree/f2cfd2fe0e2ca4e90994f96afe268bbd4382a891/rust/kernel/blk/mq.rs
>>
>> Cc: Wedson Almeida Filho <[email protected]>
>> Signed-off-by: Andreas Hindborg <[email protected]>
>> ---
>> block/blk-mq.c | 3 +-
>> include/linux/blk-mq.h | 1 +
>> rust/bindings/bindings_helper.h | 2 +
>> rust/helpers.c | 45 ++++
>> rust/kernel/block.rs | 5 +
>> rust/kernel/block/mq.rs | 131 +++++++++++
>> rust/kernel/block/mq/gen_disk.rs | 174 +++++++++++++++
>> rust/kernel/block/mq/operations.rs | 346 +++++++++++++++++++++++++++++
>> rust/kernel/block/mq/raw_writer.rs | 60 +++++
>> rust/kernel/block/mq/request.rs | 182 +++++++++++++++
>> rust/kernel/block/mq/tag_set.rs | 117 ++++++++++
>> rust/kernel/error.rs | 5 +
>> rust/kernel/lib.rs | 1 +
>> 13 files changed, 1071 insertions(+), 1 deletion(-)
>
> Do you think that it's possible to split this into smaller
> patches? It would make review a lot easier.

Probably, I'll look into that.

>
>> create mode 100644 rust/kernel/block.rs
>> create mode 100644 rust/kernel/block/mq.rs
>> create mode 100644 rust/kernel/block/mq/gen_disk.rs
>> create mode 100644 rust/kernel/block/mq/operations.rs
>> create mode 100644 rust/kernel/block/mq/raw_writer.rs
>> create mode 100644 rust/kernel/block/mq/request.rs
>> create mode 100644 rust/kernel/block/mq/tag_set.rs
>
> [...]
>
>> diff --git a/rust/kernel/block.rs b/rust/kernel/block.rs
>> new file mode 100644
>> index 000000000000..4c93317a568a
>> --- /dev/null
>> +++ b/rust/kernel/block.rs
>> @@ -0,0 +1,5 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +
>> +//! Types for working with the block layer
>
> Missing '.'.

????

>
>> +
>> +pub mod mq;
>> diff --git a/rust/kernel/block/mq.rs b/rust/kernel/block/mq.rs
>> new file mode 100644
>> index 000000000000..08de1cc114ff
>> --- /dev/null
>> +++ b/rust/kernel/block/mq.rs
>> @@ -0,0 +1,131 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +
>> +//! This module provides types for implementing block drivers that interface the
>> +//! blk-mq subsystem.
>> +//!
>> +//! To implement a block device driver, a Rust module must do the following:
>> +//!
>> +//! - Implement [`Operations`] for a type `T`
>
> I think it would be nicer to use `Driver` (or `MyBlkDevice`) instead of
> `T`.

I think I like `T` better, because it signals placeholder more
effectively.

>
>> +//! - Create a [`TagSet<T>`]
>> +//! - Create a [`GenDisk<T>`], passing in the `TagSet` reference
>> +//! - Add the disk to the system by calling [`GenDisk::add`]
>> +//!
>> +//! The types available in this module that have direct C counterparts are:
>> +//!
>> +//! - The `TagSet` type that abstracts the C type `struct tag_set`.
>> +//! - The `GenDisk` type that abstracts the C type `struct gendisk`.
>> +//! - The `Request` type that abstracts the C type `struct request`.
>> +//!
>> +//! Many of the C types that this module abstracts allow a driver to carry
>> +//! private data, either embedded in the stuct directly, or as a C `void*`. In
>> +//! these abstractions, this data is typed. The types of the data is defined by
>> +//! associated types in `Operations`, see [`Operations::RequestData`] for an
>> +//! example.
>> +//!
>> +//! The kernel will interface with the block evice driver by calling the method
>
> Typo: "block evice driver"

Thanks.

>
>> +//! implementations of the `Operations` trait.
>> +//!
>> +//! IO requests are passed to the driver as [`Request`] references. The
>> +//! `Request` type is a wrapper around the C `struct request`. The driver must
>> +//! mark start of request processing by calling [`Request::start`] and end of
>> +//! processing by calling one of the [`Request::end`], methods. Failure to do so
>> +//! can lead to IO failures.
>
> I am unfamiliar with this, what are "IO failures"?
> Do you think that it might be better to change the API to use a
> callback? So instead of calling start and end, you would do
>
> request.handle(|req| {
> // do the stuff that would be done between start and end
> });
>
> I took a quick look at the rnull driver and there you are calling
> `Request::end_ok` from a different function. So my suggestion might not
> be possible, since you really need the freedom.
>
> Do you think that a guard approach might work better? ie `start` returns
> a guard that when dropped will call `end` and you need the guard to
> operate on the request.

I don't think that would fit, since the driver might not complete the
request immediately. We might be able to call `start` on behalf of the
driver.

At any rate, since the request is reference counted now, we can
automatically fail a request when the last reference is dropped and it
was not marked successfully completed. I would need to measure the
performance implications of such a feature.

The comment needs update too. Failure to complete requests will lead to
either deadlock or timeout errors, depending on configuration.

>
>> +//!
>> +//! The `TagSet` is responsible for creating and maintaining a mapping between
>> +//! `Request`s and integer ids as well as carrying a pointer to the vtable
>> +//! generated by `Operations`. This mapping is useful for associating
>> +//! completions from hardware with the correct `Request` instance. The `TagSet`
>> +//! determines the maximum queue depth by setting the number of `Request`
>> +//! instances available to the driver, and it determines the number of queues to
>> +//! instantiate for the driver. If possible, a driver should allocate one queue
>> +//! per core, to keep queue data local to a core.
>> +//!
>> +//! One `TagSet` instance can be shared between multiple `GenDisk` instances.
>> +//! This can be useful when implementing drivers where one piece of hardware
>> +//! with one set of IO resources are represented to the user as multiple disks.
>> +//!
>> +//! One significant difference between block device drivers implemented with
>> +//! these Rust abstractions and drivers implemented in C, is that the Rust
>> +//! drivers have to own a reference count on the `Request` type when the IO is
>> +//! in flight. This is to ensure that the C `struct request` instances backing
>> +//! the Rust `Request` instances are live while the Rust driver holds a
>> +//! reference to the `Request`. In addition, the conversion of an ineger tag to
>
> Typo: "of an ineger tag"

Thanks. Looks like I need to properly configure my editor to spell check
in comments.

>
>> +//! a `Request` via the `TagSet` would not be sound without this bookkeeping.
>> +//!
>> +//! # ⚠ Note
>> +//!
>> +//! For Rust block device drivers, the point in time where a request
>> +//! is freed and made available for recycling is usualy at the point in time
>> +//! when the last `ARef<Request>` is dropped. For C drivers, this event usually
>> +//! occurs when `bindings::blk_mq_end_request` is called.
>> +//!
>> +//! # Example
>> +//!
>> +//! ```rust
>> +//! use kernel::{
>> +//! block::mq::*,
>> +//! new_mutex,
>> +//! prelude::*,
>> +//! sync::{Arc, Mutex},
>> +//! types::{ARef, ForeignOwnable},
>> +//! };
>> +//!
>> +//! struct MyBlkDevice;
>> +//!
>> +//! #[vtable]
>> +//! impl Operations for MyBlkDevice {
>> +//! type RequestData = ();
>> +//! type RequestDataInit = impl PinInit<()>;
>> +//! type QueueData = ();
>> +//! type HwData = ();
>> +//! type TagSetData = ();
>> +//!
>> +//! fn new_request_data(
>> +//! _tagset_data: <Self::TagSetData as ForeignOwnable>::Borrowed<'_>,
>> +//! ) -> Self::RequestDataInit {
>> +//! kernel::init::zeroed()
>> +//! }
>> +//!
>> +//! fn queue_rq(_hw_data: (), _queue_data: (), rq: ARef<Request<Self>>, _is_last: bool) -> Result {
>> +//! rq.start();
>> +//! rq.end_ok();
>> +//! Ok(())
>> +//! }
>> +//!
>> +//! fn commit_rqs(
>> +//! _hw_data: <Self::HwData as ForeignOwnable>::Borrowed<'_>,
>> +//! _queue_data: <Self::QueueData as ForeignOwnable>::Borrowed<'_>,
>> +//! ) {
>> +//! }
>> +//!
>> +//! fn complete(rq: &Request<Self>) {
>> +//! rq.end_ok();
>> +//! }
>> +//!
>> +//! fn init_hctx(
>> +//! _tagset_data: <Self::TagSetData as ForeignOwnable>::Borrowed<'_>,
>> +//! _hctx_idx: u32,
>> +//! ) -> Result<Self::HwData> {
>> +//! Ok(())
>> +//! }
>> +//! }
>> +//!
>> +//! let tagset: Arc<TagSet<MyBlkDevice>> = Arc::pin_init(TagSet::try_new(1, (), 256, 1))?;
>> +//! let mut disk = GenDisk::try_new(tagset, ())?;
>> +//! disk.set_name(format_args!("myblk"))?;
>> +//! disk.set_capacity_sectors(4096);
>> +//! disk.add()?;
>> +//!
>> +//! # Ok::<(), kernel::error::Error>(())
>> +//! ```
>
> This piece of documentation is **really** valuable, thanks a lot for
> taking the time to write it.

Great!

>
>> +
>> +mod gen_disk;
>> +mod operations;
>> +mod raw_writer;
>> +mod request;
>> +mod tag_set;
>> +
>> +pub use gen_disk::GenDisk;
>> +pub use operations::Operations;
>> +pub use request::{Request, RequestDataRef};
>> +pub use tag_set::TagSet;
>> diff --git a/rust/kernel/block/mq/gen_disk.rs b/rust/kernel/block/mq/gen_disk.rs
>> new file mode 100644
>> index 000000000000..b7845fc9e39f
>> --- /dev/null
>> +++ b/rust/kernel/block/mq/gen_disk.rs
>> @@ -0,0 +1,174 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +
>> +//! Generic disk abstraction.
>> +//!
>> +//! C header: [`include/linux/blkdev.h`](srctree/include/linux/blkdev.h)
>> +//! C header: [`include/linux/blk_mq.h`](srctree/include/linux/blk_mq.h)
>> +
>> +use crate::block::mq::{raw_writer::RawWriter, Operations, TagSet};
>> +use crate::{
>> + bindings, error::from_err_ptr, error::Result, sync::Arc, types::ForeignOwnable,
>> + types::ScopeGuard,
>> +};
>> +use core::fmt::{self, Write};
>> +
>> +/// A generic block device
>> +///
>> +/// # Invariants
>> +///
>> +/// - `gendisk` must always point to an initialized and valid `struct gendisk`.
>> +pub struct GenDisk<T: Operations> {
>> + _tagset: Arc<TagSet<T>>,
>> + gendisk: *mut bindings::gendisk,
>> +}
>> +
>> +// SAFETY: `GenDisk` is an owned pointer to a `struct gendisk` and an `Arc` to a
>> +// `TagSet` It is safe to send this to other threads as long as T is Send.
>> +unsafe impl<T: Operations + Send> Send for GenDisk<T> {}
>> +
>> +impl<T: Operations> GenDisk<T> {
>> + /// Try to create a new `GenDisk`
>> + pub fn try_new(tagset: Arc<TagSet<T>>, queue_data: T::QueueData) -> Result<Self> {
>> + let data = queue_data.into_foreign();
>> + let recover_data = ScopeGuard::new(|| {
>> + // SAFETY: T::QueueData was created by the call to `into_foreign()` above
>> + unsafe { T::QueueData::from_foreign(data) };
>> + });
>> +
>> + let lock_class_key = crate::sync::LockClassKey::new();
>> +
>> + // SAFETY: `tagset.raw_tag_set()` points to a valid and initialized tag set
>> + let gendisk = from_err_ptr(unsafe {
>> + bindings::__blk_mq_alloc_disk(
>> + tagset.raw_tag_set(),
>> + data.cast_mut(),
>> + lock_class_key.as_ptr(),
>> + )
>> + })?;
>> +
>> + const TABLE: bindings::block_device_operations = bindings::block_device_operations {
>> + submit_bio: None,
>> + open: None,
>> + release: None,
>> + ioctl: None,
>> + compat_ioctl: None,
>> + check_events: None,
>> + unlock_native_capacity: None,
>> + getgeo: None,
>> + set_read_only: None,
>> + swap_slot_free_notify: None,
>> + report_zones: None,
>> + devnode: None,
>> + alternative_gpt_sector: None,
>> + get_unique_id: None,
>> + // TODO: Set to THIS_MODULE. Waiting for const_refs_to_static feature to be merged
>> + // https://github.com/rust-lang/rust/issues/119618
>> + owner: core::ptr::null_mut(),
>> + pr_ops: core::ptr::null_mut(),
>> + free_disk: None,
>> + poll_bio: None,
>> + };
>> +
>> + // SAFETY: gendisk is a valid pointer as we initialized it above
>> + unsafe { (*gendisk).fops = &TABLE };
>> +
>> + recover_data.dismiss();
>> + Ok(Self {
>> + _tagset: tagset,
>> + gendisk,
>
> Missing INVARIANT comment.

Will fix.

>
>> + })
>> + }
>> +
>> + /// Set the name of the device
>
> Missing '.'.

Thanks.

>
>> + pub fn set_name(&mut self, args: fmt::Arguments<'_>) -> Result {
>> + let mut raw_writer = RawWriter::from_array(
>> + // SAFETY: By type invariant `self.gendisk` points to a valid and initialized instance
>> + unsafe { &mut (*self.gendisk).disk_name },
>
> To create a `&mut` reference, you need exclusive access, it should be
> sufficient to add to the invariant that `gendisk` is owned/unique.

Hmm, we don't actually _always_ have unique ownership of this string
buffer. I will change the API to only allow configuration of the
instance before it is hooked in. Thanks for spotting this.

>
>> + );
>> + raw_writer.write_fmt(args)?;
>> + raw_writer.write_char('\0')?;
>> + Ok(())
>> + }
>
> [...]
>
>> +impl<T: Operations> Drop for GenDisk<T> {
>> + fn drop(&mut self) {
>> + // SAFETY: By type invariant of `Self`, `self.gendisk` points to a valid
>> + // and initialized instance of `struct gendisk`. As such, `queuedata`
>> + // was initialized by the initializer returned by `try_new` with a call
>> + // to `ForeignOwnable::into_foreign`.
>
> This should also be an invariant of `GenDisk`.

Ok.

>
>> + let queue_data = unsafe { (*(*self.gendisk).queue).queuedata };
>> +
>> + // SAFETY: By type invariant, `self.gendisk` points to a valid and
>> + // initialized instance of `struct gendisk`
>> + unsafe { bindings::del_gendisk(self.gendisk) };
>> +
>> + // SAFETY: `queue.queuedata` was created by `GenDisk::try_new()` with a
>> + // call to `ForeignOwnable::into_pointer()` to create `queuedata`.
>> + // `ForeignOwnable::from_foreign()` is only called here.
>> + let _queue_data = unsafe { T::QueueData::from_foreign(queue_data) };
>> + }
>> +}
>> diff --git a/rust/kernel/block/mq/operations.rs b/rust/kernel/block/mq/operations.rs
>> new file mode 100644
>> index 000000000000..53c6ad663208
>> --- /dev/null
>> +++ b/rust/kernel/block/mq/operations.rs
>> @@ -0,0 +1,346 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +
>> +//! This module provides an interface for blk-mq drivers to implement.
>> +//!
>> +//! C header: [`include/linux/blk-mq.h`](srctree/include/linux/blk-mq.h)
>> +
>> +use crate::{
>> + bindings,
>> + block::mq::Request,
>> + error::{from_result, Result},
>> + init::PinInit,
>> + types::{ARef, ForeignOwnable},
>> +};
>> +use core::{marker::PhantomData, ptr::NonNull};
>> +
>> +use super::TagSet;
>> +
>> +/// Implement this trait to interface blk-mq as block devices
>> +#[macros::vtable]
>> +pub trait Operations: Sized {
>> + /// Data associated with a request. This data is located next to the request
>> + /// structure.
>> + ///
>> + /// To be able to handle accessing this data from interrupt context, this
>> + /// data must be `Sync`.
>> + type RequestData: Sized + Sync;
>> +
>> + /// Initializer for `Self::RequestDta`. Used to initialize private data area
>> + /// when requst structure is allocated.
>> + type RequestDataInit: PinInit<Self::RequestData>;
>
> Just to let you know, this dance with the associated types is not needed
> any longer. RPITIT (return position impl trait in trait) has been
> stabilized in 1.75 and you should be able to just write this:
>
> fn new_request_data(
> //rq: ARef<Request<Self>>,
> tagset_data: <Self::TagSetData as ForeignOwnable>::Borrowed<'_>,
> ) -> impl PinInit<Self::RequestData>;
>

Thanks ????

>
>> +
>> + /// Data associated with the `struct request_queue` that is allocated for
>> + /// the `GenDisk` associated with this `Operations` implementation.
>> + type QueueData: ForeignOwnable;
>> +
>> + /// Data associated with a dispatch queue. This is stored as a pointer in
>> + /// the C `struct blk_mq_hw_ctx` that represents a hardware queue.
>> + type HwData: ForeignOwnable;
>> +
>> + /// Data associated with a `TagSet`. This is stored as a pointer in `struct
>> + /// blk_mq_tag_set`.
>> + type TagSetData: ForeignOwnable;
>> +
>> + /// Called by the kernel to get an initializer for a `Pin<&mut RequestData>`.
>> + fn new_request_data(
>> + //rq: ARef<Request<Self>>,
>> + tagset_data: <Self::TagSetData as ForeignOwnable>::Borrowed<'_>,
>
> Since you need to use this pattern a lot, it might be a good idea to
> introduce a type alias to help with this:
>
> type ForeignBorrowed<'a, T> = <T as ForeignOwnable>::Borrowed<'a>;
>
> What do the others think?
>
> The function would then become (with the RPITIT improvement as well):
>
> fn new_request_data(
> //rq: ARef<Request<Self>>,
> tagset_data: ForeignBorrowed<'_, Self::TagSetData>,
> ) -> impl PinInit<Self::RequestData>;

A bit more concise, I think it is better. I'll go ahead and kill that
commented out argument that sneaked into the patch as well :)

>
>
>> + ) -> Self::RequestDataInit;
>> +
>> + /// Called by the kernel to queue a request with the driver. If `is_last` is
>> + /// `false`, the driver is allowed to defer commiting the request.
>> + fn queue_rq(
>> + hw_data: <Self::HwData as ForeignOwnable>::Borrowed<'_>,
>> + queue_data: <Self::QueueData as ForeignOwnable>::Borrowed<'_>,
>> + rq: ARef<Request<Self>>,
>> + is_last: bool,
>> + ) -> Result;
>> +
>> + /// Called by the kernel to indicate that queued requests should be submitted
>> + fn commit_rqs(
>> + hw_data: <Self::HwData as ForeignOwnable>::Borrowed<'_>,
>> + queue_data: <Self::QueueData as ForeignOwnable>::Borrowed<'_>,
>> + );
>> +
>> + /// Called by the kernel when the request is completed
>> + fn complete(_rq: &Request<Self>);
>> +
>> + /// Called by the kernel to allocate and initialize a driver specific hardware context data
>> + fn init_hctx(
>> + tagset_data: <Self::TagSetData as ForeignOwnable>::Borrowed<'_>,
>> + hctx_idx: u32,
>> + ) -> Result<Self::HwData>;
>> +
>> + /// Called by the kernel to poll the device for completed requests. Only
>> + /// used for poll queues.
>> + fn poll(_hw_data: <Self::HwData as ForeignOwnable>::Borrowed<'_>) -> bool {
>> + crate::build_error(crate::error::VTABLE_DEFAULT_ERROR)
>> + }
>> +
>> + /// Called by the kernel to map submission queues to CPU cores.
>> + fn map_queues(_tag_set: &TagSet<Self>) {
>> + crate::build_error(crate::error::VTABLE_DEFAULT_ERROR)
>> + }
>> +
>> + // There is no need for exit_request() because `drop` will be called.
>
> I think it would be a good idea to mention this in the documentation of
> the trait.

Yes.

>
>> +}
>
> [...]
>
>> diff --git a/rust/kernel/block/mq/raw_writer.rs b/rust/kernel/block/mq/raw_writer.rs
>> new file mode 100644
>> index 000000000000..f7857740af29
>> --- /dev/null
>> +++ b/rust/kernel/block/mq/raw_writer.rs
>> @@ -0,0 +1,60 @@
>> +use core::{
>> + fmt::{self, Write},
>> + marker::PhantomData,
>> +};
>> +
>> +/// A mutable reference to a byte buffer where a string can be written into
>> +///
>> +/// # Invariants
>> +///
>> +/// * `ptr` is not aliased and valid for read and write for `len` bytes
>
> You probably also want to add "for the duration of `'a`".

????

>
>> +///
>> +pub(crate) struct RawWriter<'a> {
>> + ptr: *mut u8,
>> + len: usize,
>> + _p: PhantomData<&'a ()>,
>> +}
>> +
>> +impl<'a> RawWriter<'a> {
>> + /// Create a new `RawWriter` instance.
>> + ///
>> + /// # Safety
>> + ///
>> + /// * `ptr` must be valid for read and write for `len` consecutive `u8` elements
>> + /// * `ptr` must not be aliased
>> + unsafe fn new(ptr: *mut u8, len: usize) -> RawWriter<'a> {
>> + Self {
>> + ptr,
>> + len,
>> + _p: PhantomData,
>> + }
>> + }
>
> Since this function is not used (except in the function below), what is
> the reason for using a raw pointer?
> I looked in your other patches, but did not find another user, so could
> this be improved by using mutable references?

I am not sure. But I think the code could benefit from getting passed a
mutable slice instead, move the safety comment to the call site. I will
try that.

>
>> +
>> + pub(crate) fn from_array<const N: usize>(a: &'a mut [core::ffi::c_char; N]) -> RawWriter<'a> {
>> + // SAFETY: the buffer of `a` is valid for read and write for at least `N` bytes
>> + unsafe { Self::new(a.as_mut_ptr().cast::<u8>(), N) }
>> + }
>> +}
>> +
>> +impl Write for RawWriter<'_> {
>> + fn write_str(&mut self, s: &str) -> fmt::Result {
>> + let bytes = s.as_bytes();
>> + let len = bytes.len();
>> + if len > self.len {
>> + return Err(fmt::Error);
>> + }
>> +
>> + // SAFETY:
>> + // * `bytes` is valid for reads of `bytes.len()` size because we hold a shared reference to `s`
>> + // * By type invariant `self.ptr` is valid for writes for at lest `self.len` bytes
>> + // * The regions are not overlapping as `ptr` is not aliased
>> + unsafe { core::ptr::copy_nonoverlapping(&bytes[0], self.ptr, len) };
>> +
>> + // SAFETY: By type invariant of `Self`, `ptr` is in bounds of an
>> + // allocation. Also by type invariant, the pointer resulting from this
>> + // addition is also in bounds.
>> + self.ptr = unsafe { self.ptr.add(len) };
>> + self.len -= len;
>> + Ok(())
>> + }
>> +}
>> diff --git a/rust/kernel/block/mq/request.rs b/rust/kernel/block/mq/request.rs
>> new file mode 100644
>> index 000000000000..b4dacac5e091
>> --- /dev/null
>> +++ b/rust/kernel/block/mq/request.rs
>> @@ -0,0 +1,182 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +
>> +//! This module provides a wrapper for the C `struct request` type.
>> +//!
>> +//! C header: [`include/linux/blk-mq.h`](srctree/include/linux/blk-mq.h)
>> +
>> +use crate::{
>> + bindings,
>> + block::mq::Operations,
>> + error::{Error, Result},
>> + types::{ARef, AlwaysRefCounted, Opaque},
>> +};
>> +use core::{ffi::c_void, marker::PhantomData, ops::Deref};
>> +
>> +/// A wrapper around a blk-mq `struct request`. This represents an IO request.
>> +///
>> +/// # Invariants
>> +///
>> +/// * `self.0` is a valid `struct request` created by the C portion of the kernel
>> +/// * `self` is reference counted. a call to `req_ref_inc_not_zero` keeps the
>> +/// instance alive at least until a matching call to `req_ref_put_and_test`
>> +///
>> +#[repr(transparent)]
>> +pub struct Request<T: Operations>(Opaque<bindings::request>, PhantomData<T>);
>> +
>> +impl<T: Operations> Request<T> {
>> + /// Create a `&mut Request` from a `bindings::request` pointer
>> + ///
>> + /// # Safety
>> + ///
>> + /// * `ptr` must be aligned and point to a valid `bindings::request` instance
>> + /// * Caller must ensure that the pointee of `ptr` is live and owned
>> + /// exclusively by caller for at least `'a`
>> + ///
>> + pub(crate) unsafe fn from_ptr_mut<'a>(ptr: *mut bindings::request) -> &'a mut Self {
>> + // SAFETY:
>> + // * The cast is valid as `Self` is transparent.
>> + // * By safety requirements of this function, the reference will be
>> + // valid for 'a.
>> + unsafe { &mut *(ptr.cast::<Self>()) }
>> + }
>> +
>> + /// Get the command identifier for the request
>> + pub fn command(&self) -> u32 {
>> + // SAFETY: By C API contract and type invariant, `cmd_flags` is valid for read
>> + unsafe { (*self.0.get()).cmd_flags & ((1 << bindings::REQ_OP_BITS) - 1) }
>> + }
>> +
>> + /// Call this to indicate to the kernel that the request has been issued by the driver
>
> I am a bit confused, is this not supposed to signal that the processing
> of the request is going to start now? cf C documentation:
>
> /**
> * blk_mq_start_request - Start processing a request
> * @rq: Pointer to request to be started
> *
> * Function used by device drivers to notify the block layer that a request
> * is going to be processed now, so blk layer can do proper initializations
> * such as starting the timeout timer.
> */

Yes, it is to indicate that the request is now considered in-flight by
the driver. I'll change the wording a bit.

>
>> + pub fn start(&self) {
>> + // SAFETY: By type invariant, `self.0` is a valid `struct request`. By
>> + // existence of `&mut self` we have exclusive access.
>> + unsafe { bindings::blk_mq_start_request(self.0.get()) };
>> + }
>> +
>> + /// Call this to indicate to the kernel that the request has been completed without errors
>
> I dislike the "Call this", it feels redundant, what about "Signal the
> block layer that the request has been completed without errors.".
>
> Also with the other functions below.

I agree.

>
>> + pub fn end_ok(&self) {
>> + // SAFETY: By type invariant, `self.0` is a valid `struct request`. By
>> + // existence of `&mut self` we have exclusive access.
>> + unsafe { bindings::blk_mq_end_request(self.0.get(), bindings::BLK_STS_OK as _) };
>> + }
>> +
>> + /// Call this to indicate to the kernel that the request completed with an error
>> + pub fn end_err(&self, err: Error) {
>> + // SAFETY: By type invariant, `self.0` is a valid `struct request`. By
>> + // existence of `&mut self` we have exclusive access.
>> + unsafe { bindings::blk_mq_end_request(self.0.get(), err.to_blk_status()) };
>> + }
>> +
>> + /// Call this to indicate that the request completed with the status indicated by `status`
>> + pub fn end(&self, status: Result) {
>> + if let Err(e) = status {
>> + self.end_err(e);
>> + } else {
>> + self.end_ok();
>> + }
>> + }
>> +
>> + /// Call this to schedule defered completion of the request
>> + pub fn complete(&self) {
>> + // SAFETY: By type invariant, `self.0` is a valid `struct request`
>> + if !unsafe { bindings::blk_mq_complete_request_remote(self.0.get()) } {
>> + T::complete(self);
>> + }
>> + }
>> +
>> + /// Get the target sector for the request
>> + #[inline(always)]
>> + pub fn sector(&self) -> usize {
>> + // SAFETY: By type invariant of `Self`, `self.0` is valid and live.
>> + unsafe { (*self.0.get()).__sector as usize }
>> + }
>> +
>> + /// Returns an owned reference to the per-request data associated with this
>> + /// request
>> + pub fn owned_data_ref(request: ARef<Self>) -> RequestDataRef<T> {
>> + RequestDataRef::new(request)
>> + }
>> +
>> + /// Returns a reference to the oer-request data associated with this request
>
> Typo: "the oer-request"

Thanks.

>
>> + pub fn data_ref(&self) -> &T::RequestData {
>> + let request_ptr = self.0.get().cast::<bindings::request>();
>> +
>> + // SAFETY: `request_ptr` is a valid `struct request` because `ARef` is
>> + // `repr(transparent)`
>> + let p: *mut c_void = unsafe { bindings::blk_mq_rq_to_pdu(request_ptr) };
>> +
>> + let p = p.cast::<T::RequestData>();
>> +
>> + // SAFETY: By C API contract, `p` is initialized by a call to
>> + // `OperationsVTable::init_request_callback()`. By existence of `&self`
>> + // it must be valid for use as a shared reference.
>> + unsafe { &*p }
>> + }
>> +}
>> +
>> +// SAFETY: It is impossible to obtain an owned or mutable `Request`, so we can
>
> What do you mean by "mutable `Request`"? There is the function to obtain
> a `&mut Request`.

The idea behind this comment is that it is not possible to have an owned
`Request` instance. You can only ever have something that will deref
(shared) to `Request`. Construction of the `Request` type is not
possible in safe driver code. At least that is the intention.

The `from_ptr_mut` is unsafe, and could be downgraded to
`from_ptr`, since `Operations::complete` takes a shared reference
anyway. Bottom line is that user code does not handle `&mut Request`.

> Also this should probably be an invariant if you use it as a
> justification here.

Ok.

>
>> +// mark it `Send`.
>> +unsafe impl<T: Operations> Send for Request<T> {}
>> +
>> +// SAFETY: `Request` references can be shared across threads.
>
> Does not explain why that is the case.

Will add.

>
>> +unsafe impl<T: Operations> Sync for Request<T> {}
>> +
>> +/// An owned reference to a `Request<T>`
>> +#[repr(transparent)]
>> +pub struct RequestDataRef<T: Operations> {
>> + request: ARef<Request<T>>,
>> +}
>
> Is this extra type really needed? I have not yet taken a look at patch 3,
> which uses this. But would it hurt if you implemented the traits
> there directly on `ARef<Request<T>>`?

Yes, thanks :) Way better with just `ARef<Request<T>>`.

>
>> +
>> +impl<T> RequestDataRef<T>
>> +where
>> + T: Operations,
>> +{
>> + /// Create a new instance.
>> + fn new(request: ARef<Request<T>>) -> Self {
>> + Self { request }
>> + }
>> +
>> + /// Get a reference to the underlying request
>> + pub fn request(&self) -> &Request<T> {
>> + &self.request
>> + }
>> +}
>
> I really like how you improved the safety comments and documentation. It
> was a lot easier to wrap my head around this time (might also me getting
> more experienced, but I think it helped a lot).

That's great!

Best regards,
Andreas


2024-04-02 22:35:57

by Benno Lossin

[permalink] [raw]
Subject: Re: [RFC PATCH 4/5] rust: block: add rnull, Rust null_blk implementation

On 02.04.24 14:52, Andreas Hindborg wrote:
> Benno Lossin <[email protected]> writes:
>
>> On 3/13/24 12:05, Andreas Hindborg wrote:
>>> +module! {
>>> + type: NullBlkModule,
>>> + name: "rnull_mod",
>>> + author: "Andreas Hindborg",
>>> + license: "GPL v2",
>>> + params: {
>>> + param_memory_backed: bool {
>>> + default: true,
>>> + permissions: 0,
>>> + description: "Use memory backing",
>>> + },
>>> + // Problems with pin_init when `irq_mode`
>>
>> Can you elaborate?
>
> I think we discussed this before, but I do not recall what you decided
> was the issue.

Ah I vaguely remember.

> It is probably easier if you can apply the patches and try to build with
> this on top:
>
> diff --git a/drivers/block/rnull.rs b/drivers/block/rnull.rs
> index 04bdb6668558..bd089c5e6e89 100644
> --- a/drivers/block/rnull.rs
> +++ b/drivers/block/rnull.rs
> @@ -48,7 +48,7 @@
> description: "Use memory backing",
> },
> // Problems with pin_init when `irq_mode`
> - param_irq_mode: u8 {
> + irq_mode: u8 {
> default: 0,
> permissions: 0,
> description: "IRQ Mode (0: None, 1: Soft, 2: Timer)",
> @@ -101,7 +101,7 @@ fn add_disk(tagset: Arc<TagSet<NullBlkDevice>>) -> Result<GenDisk<NullBlkDevice>
> return Err(kernel::error::code::EINVAL);
> }
>
> - let irq_mode = (*param_irq_mode.read()).try_into()?;
> + let irq_mode = (*irq_mode.read()).try_into()?;
>
> let queue_data = Box::pin_init(pin_init!(
> QueueData {
>
> ---
>
> There is some kind of name clash issue when using `pin_init!` in the expression on
> line 106:
>
> let queue_data = Box::pin_init(pin_init!(
> QueueData {
> tree <- TreeContainer::new(),
> completion_time_nsec: *param_completion_time_nsec.read(),
> irq_mode,
> memory_backed: *param_memory_backed.read(),
> block_size,
> }
> ))?;
>
> I cannot immediately decipher the error message:
>
> RUSTC [M] drivers/block/rnull.o
> error[E0277]: the trait bound `__rnull_mod_irq_mode: From<u8>` is not satisfied
> --> /home/aeh/src/linux-rust/linux/drivers/block/rnull.rs:104:39
> |
> 104 | let irq_mode = (*irq_mode.read()).try_into()?;
> | ^^^^^^^^ the trait `From<u8>` is not implemented for `__rnull_mod_irq_mode`
> |
> = note: required for `u8` to implement `Into<__rnull_mod_irq_mode>`
> = note: required for `__rnull_mod_irq_mode` to implement `TryFrom<u8>`
> = note: required for `u8` to implement `TryInto<__rnull_mod_irq_mode>`
>
> error[E0308]: mismatched types
> --> /home/aeh/src/linux-rust/linux/drivers/block/rnull.rs:106:36
> |
> 106 | let queue_data = Box::pin_init(pin_init!(
> | ____________________________________^
> 107 | | QueueData {
> 108 | | tree <- TreeContainer::new(),
> 109 | | completion_time_nsec: *param_completion_time_nsec.read(),
> ... |
> 113 | | }
> 114 | | ))?;
> | | ^
> | | |
> | |_____expected `IRQMode`, found `__rnull_mod_irq_mode`
> | arguments to this function are incorrect
> |
> note: function defined here
> --> /home/aeh/.rustup/toolchains/1.74.1-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:1365:21
> |
> 1365 | pub const unsafe fn write<T>(dst: *mut T, src: T) {
> | ^^^^^
> = note: this error originates in the macro `$crate::__init_internal` which comes from the expansion of the macro `pin_init` (in Nightly builds, run with -Z macro-backtrace for more info)
>
> error[E0308]: mismatched types
> --> /home/aeh/src/linux-rust/linux/drivers/block/rnull.rs:106:36
> |
> 39 | / module! {
> 40 | | type: NullBlkModule,
> 41 | | name: "rnull_mod",
> 42 | | author: "Andreas Hindborg",
> ... |
> 71 | | },
> 72 | | }
> | |_- constant defined here
> ...
> 106 | let queue_data = Box::pin_init(pin_init!(
> | ____________________________________^
> | |____________________________________|
> | |____________________________________|
> | |____________________________________|
> | |
> 107 | | QueueData {
> 108 | | tree <- TreeContainer::new(),
> 109 | | completion_time_nsec: *param_completion_time_nsec.read(),
> ... |
> 113 | | }
> 114 | | ))?;
> | | ^
> | |_____|
> | |_____expected `DropGuard<IRQMode>`, found `__rnull_mod_irq_mode`
> | |_____this expression has type `DropGuard<IRQMode>`
> | |_____`irq_mode` is interpreted as a constant, not a new binding
> | help: introduce a new binding instead: `other_irq_mode`
> |
> = note: expected struct `DropGuard<IRQMode>`
> found struct `__rnull_mod_irq_mode`
> = note: this error originates in the macro `$crate::__init_internal` which comes from the expansion of the macro `pin_init` (in Nightly builds, run with -Z macro-backtrace for more info)
>
> error: aborting due to 3 previous errors
>
> Some errors have detailed explanations: E0277, E0308.
> For more information about an error, try `rustc --explain E0277`.
> make[5]: *** [/home/aeh/src/linux-rust/linux/scripts/Makefile.build:293: drivers/block/rnull.o] Error 1


So I did some digging and there are multiple things at play. I am going
to explain the second error first, since that one might be a problem
with `pin_init`:
- the `params` extension of the `module!` macro creates constants with
snake case names.
- your `QueueData` struct has the same name as a field.
- `pin_init!` generates `let $field_name = ...` statements for each
field you initialize

Now when you define a constant in Rust, you are able to pattern-match
with that constant, eg:

const FOO: u8 = 0;

fn main() {
match 10 {
FOO => println!("foo"),
_ => {}
}
}

So when you do `let FOO = x;`, then it interprets `FOO` as the constant.
This is still true if the constant has a snake case name.
Since the expression in the `pin_init!` macro has type
`DropGuard<$field_type>`, we get the error "expected
`DropGuard<IRQMode>`, found `__rnull_mod_irq_mode`".

Now to the first error, this is a problem with the parameter handling of
`module`. By the same argument above, your let binding in line 104:

let irq_mode = (*irq_mode.read()).try_into()?;

Tries to pattern-match the `irq_mode` constant with the right
expression. Since you use the `try_into` function, it tries to search
for a `TryInto` implementation for the type of `irq_mode` which is
generated by the `module!` macro. The type is named
__rnull_mod_irq_mode.


Now what to do about this. For the second error (the one related to
`pin_init`), I could create a patch that fixes it by adding the suffix
`_guard` to those let bindings, preventing the issue. Not sure if we
need that though, since it will not get rid of the first issue.

For the first issue, I think there is no other way than to use a
different name for either the field or the constant. Since constants are
usually named using screaming snake case, I think it should be renamed.
I believe your reason for using a snake case name is that these names
are used directly as the names for the parameters when loading the
module and there the convention is to use snake case, right?
In that case I think we could expect people to write the screaming snake
case name in rust and have it automatically be lower-cased by the
`module!` macro when it creates the names that the parameters are shown
with.

Hope this helps!

--
Cheers,
Benno


2024-04-02 23:10:08

by Benno Lossin

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] rust: block: introduce `kernel::block::mq` module

On 23.03.24 07:32, Andreas Hindborg wrote:
> Benno Lossin <[email protected]> writes:
>> On 3/13/24 12:05, Andreas Hindborg wrote:
>>> +//! implementations of the `Operations` trait.
>>> +//!
>>> +//! IO requests are passed to the driver as [`Request`] references. The
>>> +//! `Request` type is a wrapper around the C `struct request`. The driver must
>>> +//! mark start of request processing by calling [`Request::start`] and end of
>>> +//! processing by calling one of the [`Request::end`], methods. Failure to do so
>>> +//! can lead to IO failures.
>>
>> I am unfamiliar with this, what are "IO failures"?
>> Do you think that it might be better to change the API to use a
>> callback? So instead of calling start and end, you would do
>>
>> request.handle(|req| {
>> // do the stuff that would be done between start and end
>> });
>>
>> I took a quick look at the rnull driver and there you are calling
>> `Request::end_ok` from a different function. So my suggestion might not
>> be possible, since you really need the freedom.
>>
>> Do you think that a guard approach might work better? ie `start` returns
>> a guard that when dropped will call `end` and you need the guard to
>> operate on the request.
>
> I don't think that would fit, since the driver might not complete the
> request immediately. We might be able to call `start` on behalf of the
> driver.
>
> At any rate, since the request is reference counted now, we can
> automatically fail a request when the last reference is dropped and it
> was not marked successfully completed. I would need to measure the
> performance implications of such a feature.

Are there cases where you still need access to the request after you
have called `end`? If no, I think it would be better for the request to
be consumed by the `end` function.
This is a bit difficult with `ARef`, since the user can just clone it
though... Do you think that it might be necessary to clone requests?

Also what happens if I call `end_ok` and then `end_err` or vice versa?

>>> + pub fn data_ref(&self) -> &T::RequestData {
>>> + let request_ptr = self.0.get().cast::<bindings::request>();
>>> +
>>> + // SAFETY: `request_ptr` is a valid `struct request` because `ARef` is
>>> + // `repr(transparent)`
>>> + let p: *mut c_void = unsafe { bindings::blk_mq_rq_to_pdu(request_ptr) };
>>> +
>>> + let p = p.cast::<T::RequestData>();
>>> +
>>> + // SAFETY: By C API contract, `p` is initialized by a call to
>>> + // `OperationsVTable::init_request_callback()`. By existence of `&self`
>>> + // it must be valid for use as a shared reference.
>>> + unsafe { &*p }
>>> + }
>>> +}
>>> +
>>> +// SAFETY: It is impossible to obtain an owned or mutable `Request`, so we can
>>
>> What do you mean by "mutable `Request`"? There is the function to obtain
>> a `&mut Request`.
>
> The idea behind this comment is that it is not possible to have an owned
> `Request` instance. You can only ever have something that will deref
> (shared) to `Request`. Construction of the `Request` type is not
> possible in safe driver code. At least that is the intention.
>
> The `from_ptr_mut` is unsafe, and could be downgraded to
> `from_ptr`, since `Operations::complete` takes a shared reference
> anyway. Bottom line is that user code does not handle `&mut Request`.

Ah I see what you mean. But the user is able to have an `ARef<Request>`.
Which you own, if it is the only refcount currently held on that
request. When you drop it, you will run the destructor of the request.

A more suitable safety comment would be "SAFETY: A `struct request` may
be destroyed from any thread.".

--
Cheers,
Benno


2024-04-03 09:48:02

by Andreas Hindborg

[permalink] [raw]
Subject: Re: [RFC PATCH 4/5] rust: block: add rnull, Rust null_blk implementation

Benno Lossin <[email protected]> writes:

[...]

>
>
> So I did some digging and there are multiple things at play. I am going
> to explain the second error first, since that one might be a problem
> with `pin_init`:
> - the `params` extension of the `module!` macro creates constants with
> snake case names.
> - your `QueueData` struct has the same name as a field.
> - `pin_init!` generates `let $field_name = ...` statements for each
> field you initialize
>
> Now when you define a constant in Rust, you are able to pattern-match
> with that constant, eg:
>
> const FOO: u8 = 0;
>
> fn main() {
> match 10 {
> FOO => println!("foo"),
> _ => {}
> }
> }
>
> So when you do `let FOO = x;`, then it interprets `FOO` as the constant.
> This is still true if the constant has a snake case name.
> Since the expression in the `pin_init!` macro has type
> `DropGuard<$field_type>`, we get the error "expected
> `DropGuard<IRQMode>`, found `__rnull_mod_irq_mode`".

Thanks for the analysis!

So in this expanded code:

1 {
2 unsafe { ::core::ptr::write(&raw mut ((*slot).irq_mode), irq_mode) };
3 }
4 let irq_mode = unsafe {
5 $crate::init::__internal::DropGuard::new(&raw mut ((*slot).irq_mode))
6 };

the `irq_mode` on line 2 will refer to the correct thing, but the one on
line 6 will be a pattern match against a constant? That is really
surprising to me. Can we make the let binding in line 6 be `let
irq_mode_pin_init` or something similar?

>
> Now to the first error, this is a problem with the parameter handling of
> `module`. By the same argument above, your let binding in line 104:
>
> let irq_mode = (*irq_mode.read()).try_into()?;
>
> Tries to pattern-match the `irq_mode` constant with the right
> expression. Since you use the `try_into` function, it tries to search
> for a `TryInto` implementation for the type of `irq_mode` which is
> generated by the `module!` macro. The type is named
> __rnull_mod_irq_mode.
>
>
> Now what to do about this. For the second error (the one related to
> `pin_init`), I could create a patch that fixes it by adding the suffix
> `_guard` to those let bindings, preventing the issue. Not sure if we
> need that though, since it will not get rid of the first issue.

I think that is a good idea ????

>
> For the first issue, I think there is no other way than to use a
> different name for either the field or the constant. Since constants are
> usually named using screaming snake case, I think it should be renamed.
> I believe your reason for using a snake case name is that these names
> are used directly as the names for the parameters when loading the
> module and there the convention is to use snake case, right?
> In that case I think we could expect people to write the screaming snake
> case name in rust and have it automatically be lower-cased by the
> `module!` macro when it creates the names that the parameters are shown
> with.

I was thinking about putting the parameters in a separate name space,
but making them screaming snake case is also a good idea. So it would
be `module_parameters::IRQ_MODE` to access the parameter with the name
`irq_mode` exposed towards the user. Developers can always opt in to bringing
the symbols into scope with a `use`.

Best regards,
Andreas

2024-04-03 11:09:55

by Benno Lossin

[permalink] [raw]
Subject: Re: [RFC PATCH 4/5] rust: block: add rnull, Rust null_blk implementation

On 03.04.24 11:47, Andreas Hindborg wrote:
> Benno Lossin <[email protected]> writes:
>
> [...]
>
>>
>>
>> So I did some digging and there are multiple things at play. I am going
>> to explain the second error first, since that one might be a problem
>> with `pin_init`:
>> - the `params` extension of the `module!` macro creates constants with
>> snake case names.
>> - your `QueueData` struct has the same name as a field.
>> - `pin_init!` generates `let $field_name = ...` statements for each
>> field you initialize
>>
>> Now when you define a constant in Rust, you are able to pattern-match
>> with that constant, eg:
>>
>> const FOO: u8 = 0;
>>
>> fn main() {
>> match 10 {
>> FOO => println!("foo"),
>> _ => {}
>> }
>> }
>>
>> So when you do `let FOO = x;`, then it interprets `FOO` as the constant.
>> This is still true if the constant has a snake case name.
>> Since the expression in the `pin_init!` macro has type
>> `DropGuard<$field_type>`, we get the error "expected
>> `DropGuard<IRQMode>`, found `__rnull_mod_irq_mode`".
>
> Thanks for the analysis!
>
> So in this expanded code:
>
> 1 {
> 2 unsafe { ::core::ptr::write(&raw mut ((*slot).irq_mode), irq_mode) };
> 3 }
> 4 let irq_mode = unsafe {
> 5 $crate::init::__internal::DropGuard::new(&raw mut ((*slot).irq_mode))
> 6 };
>
> the `irq_mode` on line 2 will refer to the correct thing, but the one on
> line 6 will be a pattern match against a constant? That is really
> surprising to me. Can we make the let binding in line 6 be `let
> irq_mode_pin_init` or something similar?

The first occurrence of `irq_mode` in line 2 will refer to the field of
`QueueData`. The second one in line 2 will refer to the constant.
The one in line 4 is a pattern-match of the constant (since the type of
the constant is a fieldless struct, only one value exists. Thus making
the match irrefutable.) The occurrence in line 5 is again referring to
the field of `QueueData`.

If you have a constant `FOO`, you are unable to create a local binding
with the name `FOO`. So by creating the `irq_mode` constant, you cannot
create (AFAIK) a `irq_mode` local variable (if the constant is in-scope).

>
>>
>> Now to the first error, this is a problem with the parameter handling of
>> `module`. By the same argument above, your let binding in line 104:
>>
>> let irq_mode = (*irq_mode.read()).try_into()?;
>>
>> Tries to pattern-match the `irq_mode` constant with the right
>> expression. Since you use the `try_into` function, it tries to search
>> for a `TryInto` implementation for the type of `irq_mode` which is
>> generated by the `module!` macro. The type is named
>> __rnull_mod_irq_mode.
>>
>>
>> Now what to do about this. For the second error (the one related to
>> `pin_init`), I could create a patch that fixes it by adding the suffix
>> `_guard` to those let bindings, preventing the issue. Not sure if we
>> need that though, since it will not get rid of the first issue.
>
> I think that is a good idea ????

Will do that.

>
>>
>> For the first issue, I think there is no other way than to use a
>> different name for either the field or the constant. Since constants are
>> usually named using screaming snake case, I think it should be renamed.
>> I believe your reason for using a snake case name is that these names
>> are used directly as the names for the parameters when loading the
>> module and there the convention is to use snake case, right?
>> In that case I think we could expect people to write the screaming snake
>> case name in rust and have it automatically be lower-cased by the
>> `module!` macro when it creates the names that the parameters are shown
>> with.
>
> I was thinking about putting the parameters in a separate name space,
> but making them screaming snake case is also a good idea. So it would
> be `module_parameters::IRQ_MODE` to access the parameter with the name
> `irq_mode` exposed towards the user. Developers can always opt in to bringing
> the symbols into scope with a `use`.

I really like the idea of putting them in a module. At the very first
glance at the usage, I thought "where does this come from?!". So having
`module_parameters` in front is really valuable.

--
Cheers,
Benno


2024-04-03 13:00:51

by Andreas Hindborg

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] rust: block: introduce `kernel::block::mq` module

Benno Lossin <[email protected]> writes:

> On 23.03.24 07:32, Andreas Hindborg wrote:
>> Benno Lossin <[email protected]> writes:
>>> On 3/13/24 12:05, Andreas Hindborg wrote:
>>>> +//! implementations of the `Operations` trait.
>>>> +//!
>>>> +//! IO requests are passed to the driver as [`Request`] references. The
>>>> +//! `Request` type is a wrapper around the C `struct request`. The driver must
>>>> +//! mark start of request processing by calling [`Request::start`] and end of
>>>> +//! processing by calling one of the [`Request::end`], methods. Failure to do so
>>>> +//! can lead to IO failures.
>>>
>>> I am unfamiliar with this, what are "IO failures"?
>>> Do you think that it might be better to change the API to use a
>>> callback? So instead of calling start and end, you would do
>>>
>>> request.handle(|req| {
>>> // do the stuff that would be done between start and end
>>> });
>>>
>>> I took a quick look at the rnull driver and there you are calling
>>> `Request::end_ok` from a different function. So my suggestion might not
>>> be possible, since you really need the freedom.
>>>
>>> Do you think that a guard approach might work better? ie `start` returns
>>> a guard that when dropped will call `end` and you need the guard to
>>> operate on the request.
>>
>> I don't think that would fit, since the driver might not complete the
>> request immediately. We might be able to call `start` on behalf of the
>> driver.
>>
>> At any rate, since the request is reference counted now, we can
>> automatically fail a request when the last reference is dropped and it
>> was not marked successfully completed. I would need to measure the
>> performance implications of such a feature.
>
> Are there cases where you still need access to the request after you
> have called `end`?

In general no, there is no need to handle the request after calling end.
C drivers are not allowed to, because this transfers ownership of the
request back to the block layer. This patch series defer the transfer of
ownership to the point when the ARef<Request> refcount goes to zero, so
there should be no danger associated with touching the `Request` after
end.

> If no, I think it would be better for the request to
> be consumed by the `end` function.
> This is a bit difficult with `ARef`, since the user can just clone it
> though... Do you think that it might be necessary to clone requests?

Looking into the details now I see that calling `Request::end` more than
once will trigger UAF, because C code decrements the refcount on the
request. When we have `ARef<Request>` around, that is a problem. It
probably also messes with other things in C land. Good catch.

I did implement `Request::end` to consume the request at one point
before I fell back on reference counting. It works fine for simple
drivers. However, most drivers will need to use the block layer tag set
service, that allows conversion of an integer id to a request pointer.
The abstraction for this feature is not part of this patch set. But the
block layer manages a mapping of integer to request mapping, and drivers
typically use this to identify the request that corresponds to
completion messages that arrive from hardware. When drivers are able to
turn integers into requests like this, consuming the request in the call
to `end` makes little sense (because we can just construct more).

What I do now is issue the an `Option<ARef<Request>>` with
`bindings::req_ref_inc_not_zero(rq_ptr)`, to make sure that the request
is currently owned by the driver.

I guess we can check the absolute value of the refcount, and only issue
a request handle if the count matches what we expect. Then we can be certain
that the handle is unique, and we can require transfer of ownership of
the handle to `Request::end` to make sure it can never be called more
than once.

Another option is to error out in `Request::end` if the
refcount is not what we expect.

>
> Also what happens if I call `end_ok` and then `end_err` or vice versa?

That would be similar to calling end twice.

>
>>>> + pub fn data_ref(&self) -> &T::RequestData {
>>>> + let request_ptr = self.0.get().cast::<bindings::request>();
>>>> +
>>>> + // SAFETY: `request_ptr` is a valid `struct request` because `ARef` is
>>>> + // `repr(transparent)`
>>>> + let p: *mut c_void = unsafe { bindings::blk_mq_rq_to_pdu(request_ptr) };
>>>> +
>>>> + let p = p.cast::<T::RequestData>();
>>>> +
>>>> + // SAFETY: By C API contract, `p` is initialized by a call to
>>>> + // `OperationsVTable::init_request_callback()`. By existence of `&self`
>>>> + // it must be valid for use as a shared reference.
>>>> + unsafe { &*p }
>>>> + }
>>>> +}
>>>> +
>>>> +// SAFETY: It is impossible to obtain an owned or mutable `Request`, so we can
>>>
>>> What do you mean by "mutable `Request`"? There is the function to obtain
>>> a `&mut Request`.
>>
>> The idea behind this comment is that it is not possible to have an owned
>> `Request` instance. You can only ever have something that will deref
>> (shared) to `Request`. Construction of the `Request` type is not
>> possible in safe driver code. At least that is the intention.
>>
>> The `from_ptr_mut` is unsafe, and could be downgraded to
>> `from_ptr`, since `Operations::complete` takes a shared reference
>> anyway. Bottom line is that user code does not handle `&mut Request`.
>
> Ah I see what you mean. But the user is able to have an `ARef<Request>`.
> Which you own, if it is the only refcount currently held on that
> request. When you drop it, you will run the destructor of the request.
>
> A more suitable safety comment would be "SAFETY: A `struct request` may
> be destroyed from any thread.".

I see, I will update the comment.


BR Andreas

2024-04-03 19:38:06

by Benno Lossin

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] rust: block: introduce `kernel::block::mq` module

On 03.04.24 10:46, Andreas Hindborg wrote:
> Benno Lossin <[email protected]> writes:
>
>> On 23.03.24 07:32, Andreas Hindborg wrote:
>>> Benno Lossin <[email protected]> writes:
>>>> On 3/13/24 12:05, Andreas Hindborg wrote:
>>>>> +//! implementations of the `Operations` trait.
>>>>> +//!
>>>>> +//! IO requests are passed to the driver as [`Request`] references. The
>>>>> +//! `Request` type is a wrapper around the C `struct request`. The driver must
>>>>> +//! mark start of request processing by calling [`Request::start`] and end of
>>>>> +//! processing by calling one of the [`Request::end`], methods. Failure to do so
>>>>> +//! can lead to IO failures.
>>>>
>>>> I am unfamiliar with this, what are "IO failures"?
>>>> Do you think that it might be better to change the API to use a
>>>> callback? So instead of calling start and end, you would do
>>>>
>>>> request.handle(|req| {
>>>> // do the stuff that would be done between start and end
>>>> });
>>>>
>>>> I took a quick look at the rnull driver and there you are calling
>>>> `Request::end_ok` from a different function. So my suggestion might not
>>>> be possible, since you really need the freedom.
>>>>
>>>> Do you think that a guard approach might work better? ie `start` returns
>>>> a guard that when dropped will call `end` and you need the guard to
>>>> operate on the request.
>>>
>>> I don't think that would fit, since the driver might not complete the
>>> request immediately. We might be able to call `start` on behalf of the
>>> driver.
>>>
>>> At any rate, since the request is reference counted now, we can
>>> automatically fail a request when the last reference is dropped and it
>>> was not marked successfully completed. I would need to measure the
>>> performance implications of such a feature.
>>
>> Are there cases where you still need access to the request after you
>> have called `end`?
>
> In general no, there is no need to handle the request after calling end.
> C drivers are not allowed to, because this transfers ownership of the
> request back to the block layer. This patch series defer the transfer of
> ownership to the point when the ARef<Request> refcount goes to zero, so
> there should be no danger associated with touching the `Request` after
> end.
>
>> If no, I think it would be better for the request to
>> be consumed by the `end` function.
>> This is a bit difficult with `ARef`, since the user can just clone it
>> though... Do you think that it might be necessary to clone requests?
>
> Looking into the details now I see that calling `Request::end` more than
> once will trigger UAF, because C code decrements the refcount on the
> request. When we have `ARef<Request>` around, that is a problem. It
> probably also messes with other things in C land. Good catch.
>
> I did implement `Request::end` to consume the request at one point
> before I fell back on reference counting. It works fine for simple
> drivers. However, most drivers will need to use the block layer tag set
> service, that allows conversion of an integer id to a request pointer.
> The abstraction for this feature is not part of this patch set. But the
> block layer manages a mapping of integer to request mapping, and drivers
> typically use this to identify the request that corresponds to
> completion messages that arrive from hardware. When drivers are able to
> turn integers into requests like this, consuming the request in the call
> to `end` makes little sense (because we can just construct more).

How do you ensure that this is fine?:

let r1 = tagset.get(0);
let r2 = tagset.get(0);
r1.end_ok();
r2.do_something_that_would_only_be_done_while_active();

One thing that comes to my mind would be to only give out `&Request`
from the tag set. And to destroy, you could have a separate operation
that also removes the request from the tag set. (I am thinking of a tag
set as a `HashMap<u64, Request>`.

>
> What I do now is issue the an `Option<ARef<Request>>` with
> `bindings::req_ref_inc_not_zero(rq_ptr)`, to make sure that the request
> is currently owned by the driver.
>
> I guess we can check the absolute value of the refcount, and only issue
> a request handle if the count matches what we expect. Then we can be certain
> that the handle is unique, and we can require transfer of ownership of
> the handle to `Request::end` to make sure it can never be called more
> than once.
>
> Another option is to error out in `Request::end` if the
> refcount is not what we expect.

I am a bit confused, why does the refcount matter in this case? Can't
the user just have multiple `ARef`s?

I think it would be weird to use `ARef<Request>` if you expect the
refcount to be 1. Maybe the API should be different?
As I understand it, a request has the following life cycle (please
correct me if I am wrong):
1. A new request is created, it is given to the driver via `queue_rq`.
2. The driver can now decide what to do with it (theoretically it can
store it somewhere and later do something with it), but it should at
some point call `Request::start`.
3. Work happens and eventually the driver calls `Request::end`.

To me this does not seem like something where we need a refcount (we
still might need one for safety, but it does not need to be exposed to
the user).

--
Cheers,
Benno


2024-04-04 09:44:28

by Andreas Hindborg

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] rust: block: introduce `kernel::block::mq` module

Benno Lossin <[email protected]> writes:

> On 03.04.24 10:46, Andreas Hindborg wrote:
>> Benno Lossin <[email protected]> writes:
>>
>>> On 23.03.24 07:32, Andreas Hindborg wrote:
>>>> Benno Lossin <[email protected]> writes:
>>>>> On 3/13/24 12:05, Andreas Hindborg wrote:
>>>>>> +//! implementations of the `Operations` trait.
>>>>>> +//!
>>>>>> +//! IO requests are passed to the driver as [`Request`] references. The
>>>>>> +//! `Request` type is a wrapper around the C `struct request`. The driver must
>>>>>> +//! mark start of request processing by calling [`Request::start`] and end of
>>>>>> +//! processing by calling one of the [`Request::end`], methods. Failure to do so
>>>>>> +//! can lead to IO failures.
>>>>>
>>>>> I am unfamiliar with this, what are "IO failures"?
>>>>> Do you think that it might be better to change the API to use a
>>>>> callback? So instead of calling start and end, you would do
>>>>>
>>>>> request.handle(|req| {
>>>>> // do the stuff that would be done between start and end
>>>>> });
>>>>>
>>>>> I took a quick look at the rnull driver and there you are calling
>>>>> `Request::end_ok` from a different function. So my suggestion might not
>>>>> be possible, since you really need the freedom.
>>>>>
>>>>> Do you think that a guard approach might work better? ie `start` returns
>>>>> a guard that when dropped will call `end` and you need the guard to
>>>>> operate on the request.
>>>>
>>>> I don't think that would fit, since the driver might not complete the
>>>> request immediately. We might be able to call `start` on behalf of the
>>>> driver.
>>>>
>>>> At any rate, since the request is reference counted now, we can
>>>> automatically fail a request when the last reference is dropped and it
>>>> was not marked successfully completed. I would need to measure the
>>>> performance implications of such a feature.
>>>
>>> Are there cases where you still need access to the request after you
>>> have called `end`?
>>
>> In general no, there is no need to handle the request after calling end.
>> C drivers are not allowed to, because this transfers ownership of the
>> request back to the block layer. This patch series defer the transfer of
>> ownership to the point when the ARef<Request> refcount goes to zero, so
>> there should be no danger associated with touching the `Request` after
>> end.
>>
>>> If no, I think it would be better for the request to
>>> be consumed by the `end` function.
>>> This is a bit difficult with `ARef`, since the user can just clone it
>>> though... Do you think that it might be necessary to clone requests?
>>
>> Looking into the details now I see that calling `Request::end` more than
>> once will trigger UAF, because C code decrements the refcount on the
>> request. When we have `ARef<Request>` around, that is a problem. It
>> probably also messes with other things in C land. Good catch.
>>
>> I did implement `Request::end` to consume the request at one point
>> before I fell back on reference counting. It works fine for simple
>> drivers. However, most drivers will need to use the block layer tag set
>> service, that allows conversion of an integer id to a request pointer.
>> The abstraction for this feature is not part of this patch set. But the
>> block layer manages a mapping of integer to request mapping, and drivers
>> typically use this to identify the request that corresponds to
>> completion messages that arrive from hardware. When drivers are able to
>> turn integers into requests like this, consuming the request in the call
>> to `end` makes little sense (because we can just construct more).
>
> How do you ensure that this is fine?:
>
> let r1 = tagset.get(0);
> let r2 = tagset.get(0);
> r1.end_ok();
> r2.do_something_that_would_only_be_done_while_active();
>
> One thing that comes to my mind would be to only give out `&Request`
> from the tag set. And to destroy, you could have a separate operation
> that also removes the request from the tag set. (I am thinking of a tag
> set as a `HashMap<u64, Request>`.

This would be similar to

let r1 = tagset.get(0)?;
ler r2 = r1.clone();
r1.end_ok();
r2.do_something_requires_active();

but it is not a problem because we do not implement any actions that are
illegal in that position (outside of `end` - that _is_ a problem).


>>
>> What I do now is issue the an `Option<ARef<Request>>` with
>> `bindings::req_ref_inc_not_zero(rq_ptr)`, to make sure that the request
>> is currently owned by the driver.
>>
>> I guess we can check the absolute value of the refcount, and only issue
>> a request handle if the count matches what we expect. Then we can be certain
>> that the handle is unique, and we can require transfer of ownership of
>> the handle to `Request::end` to make sure it can never be called more
>> than once.
>>
>> Another option is to error out in `Request::end` if the
>> refcount is not what we expect.
>
> I am a bit confused, why does the refcount matter in this case? Can't
> the user just have multiple `ARef`s?

Because we want to assert that we are consuming the last handle to the
request. After we do that, the user cannot call `Request::end` again.
`TagSet::get` will not issue a request reference if the request is not
in flight. Although there might be a race condition to watch out for.

When the block layer hands over ownership to Rust, the reference count
is 1. The first `ARef<Request>` we create increments the count to 2. To
complete the request, we must have ownership of all reference counts
above 1. The block layer takes the last reference count when it takes
back ownership of the request.

> I think it would be weird to use `ARef<Request>` if you expect the
> refcount to be 1.

Yes, that would require a custom smart pointer with a `try_into_unique`
method that succeeds when the refcount is exactly 2. It would consume
the instance and decrement the refcount to 1. But as I said, there is a
potential race with `TagSet::get` when the refcount is 1 that needs to
be handled.

> Maybe the API should be different?

I needs to change a little, yes.

> As I understand it, a request has the following life cycle (please
> correct me if I am wrong):
> 1. A new request is created, it is given to the driver via `queue_rq`.
> 2. The driver can now decide what to do with it (theoretically it can
> store it somewhere and later do something with it), but it should at
> some point call `Request::start`.
> 3. Work happens and eventually the driver calls `Request::end`.
>
> To me this does not seem like something where we need a refcount (we
> still might need one for safety, but it does not need to be exposed to
> the user).

It would not need to be exposed to the user, other than a) ending a request
can fail OR b) `TagSet::get` can fail.

a) would require that ending a request must be done with a unique
reference. This could be done by the user by the user calling
`try_into_unique` or by making the `end` method fallible.

b) would make the reference handle `!Clone` and add a failure mode to
`TagSet::get`, so it fails to construct a `Request` handle if there are
already one in existence.

I gravitate towards a) because it allows the user to clone the Request
reference without adding an additional `Arc`.


BR Andreas

2024-04-04 13:47:15

by Benno Lossin

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] rust: block: introduce `kernel::block::mq` module

On 04.04.24 11:30, Andreas Hindborg wrote:
> Benno Lossin <[email protected]> writes:
>
>> On 04.04.24 07:44, Andreas Hindborg wrote:
>>> Benno Lossin <[email protected]> writes:
>>>
>>>> On 03.04.24 10:46, Andreas Hindborg wrote:
>>>>> Benno Lossin <[email protected]> writes:
>>>>>
>>>>>> On 23.03.24 07:32, Andreas Hindborg wrote:
>>>>>>> Benno Lossin <[email protected]> writes:
>>>>>>>> On 3/13/24 12:05, Andreas Hindborg wrote:
>>>>>>>>> +//! implementations of the `Operations` trait.
>>>>>>>>> +//!
>>>>>>>>> +//! IO requests are passed to the driver as [`Request`] references. The
>>>>>>>>> +//! `Request` type is a wrapper around the C `struct request`. The driver must
>>>>>>>>> +//! mark start of request processing by calling [`Request::start`] and end of
>>>>>>>>> +//! processing by calling one of the [`Request::end`], methods. Failure to do so
>>>>>>>>> +//! can lead to IO failures.
>>>>>>>>
>>>>>>>> I am unfamiliar with this, what are "IO failures"?
>>>>>>>> Do you think that it might be better to change the API to use a
>>>>>>>> callback? So instead of calling start and end, you would do
>>>>>>>>
>>>>>>>> request.handle(|req| {
>>>>>>>> // do the stuff that would be done between start and end
>>>>>>>> });
>>>>>>>>
>>>>>>>> I took a quick look at the rnull driver and there you are calling
>>>>>>>> `Request::end_ok` from a different function. So my suggestion might not
>>>>>>>> be possible, since you really need the freedom.
>>>>>>>>
>>>>>>>> Do you think that a guard approach might work better? ie `start` returns
>>>>>>>> a guard that when dropped will call `end` and you need the guard to
>>>>>>>> operate on the request.
>>>>>>>
>>>>>>> I don't think that would fit, since the driver might not complete the
>>>>>>> request immediately. We might be able to call `start` on behalf of the
>>>>>>> driver.
>>>>>>>
>>>>>>> At any rate, since the request is reference counted now, we can
>>>>>>> automatically fail a request when the last reference is dropped and it
>>>>>>> was not marked successfully completed. I would need to measure the
>>>>>>> performance implications of such a feature.
>>>>>>
>>>>>> Are there cases where you still need access to the request after you
>>>>>> have called `end`?
>>>>>
>>>>> In general no, there is no need to handle the request after calling end.
>>>>> C drivers are not allowed to, because this transfers ownership of the
>>>>> request back to the block layer. This patch series defer the transfer of
>>>>> ownership to the point when the ARef<Request> refcount goes to zero, so
>>>>> there should be no danger associated with touching the `Request` after
>>>>> end.
>>>>>
>>>>>> If no, I think it would be better for the request to
>>>>>> be consumed by the `end` function.
>>>>>> This is a bit difficult with `ARef`, since the user can just clone it
>>>>>> though... Do you think that it might be necessary to clone requests?
>>>>>
>>>>> Looking into the details now I see that calling `Request::end` more than
>>>>> once will trigger UAF, because C code decrements the refcount on the
>>>>> request. When we have `ARef<Request>` around, that is a problem. It
>>>>> probably also messes with other things in C land. Good catch.
>>>>>
>>>>> I did implement `Request::end` to consume the request at one point
>>>>> before I fell back on reference counting. It works fine for simple
>>>>> drivers. However, most drivers will need to use the block layer tag set
>>>>> service, that allows conversion of an integer id to a request pointer.
>>>>> The abstraction for this feature is not part of this patch set. But the
>>>>> block layer manages a mapping of integer to request mapping, and drivers
>>>>> typically use this to identify the request that corresponds to
>>>>> completion messages that arrive from hardware. When drivers are able to
>>>>> turn integers into requests like this, consuming the request in the call
>>>>> to `end` makes little sense (because we can just construct more).
>>>>
>>>> How do you ensure that this is fine?:
>>>>
>>>> let r1 = tagset.get(0);
>>>> let r2 = tagset.get(0);
>>>> r1.end_ok();
>>>> r2.do_something_that_would_only_be_done_while_active();
>>>>
>>>> One thing that comes to my mind would be to only give out `&Request`
>>>> from the tag set. And to destroy, you could have a separate operation
>>>> that also removes the request from the tag set. (I am thinking of a tag
>>>> set as a `HashMap<u64, Request>`.
>>>
>>> This would be similar to
>>>
>>> let r1 = tagset.get(0)?;
>>> ler r2 = r1.clone();
>>> r1.end_ok();
>>> r2.do_something_requires_active();
>>>
>>> but it is not a problem because we do not implement any actions that are
>>> illegal in that position (outside of `end` - that _is_ a problem).
>>
>> Makes sense, but I think it's a bit weird to still be able to access it
>> after `end`ing.
>
> Yes, that is true.
>
>>
>>>
>>>
>>>>>
>>>>> What I do now is issue the an `Option<ARef<Request>>` with
>>>>> `bindings::req_ref_inc_not_zero(rq_ptr)`, to make sure that the request
>>>>> is currently owned by the driver.
>>>>>
>>>>> I guess we can check the absolute value of the refcount, and only issue
>>>>> a request handle if the count matches what we expect. Then we can be certain
>>>>> that the handle is unique, and we can require transfer of ownership of
>>>>> the handle to `Request::end` to make sure it can never be called more
>>>>> than once.
>>>>>
>>>>> Another option is to error out in `Request::end` if the
>>>>> refcount is not what we expect.
>>>>
>>>> I am a bit confused, why does the refcount matter in this case? Can't
>>>> the user just have multiple `ARef`s?
>>>
>>> Because we want to assert that we are consuming the last handle to the
>>> request. After we do that, the user cannot call `Request::end` again.
>>> `TagSet::get` will not issue a request reference if the request is not
>>> in flight. Although there might be a race condition to watch out for.
>>>
>>> When the block layer hands over ownership to Rust, the reference count
>>> is 1. The first `ARef<Request>` we create increments the count to 2. To
>>> complete the request, we must have ownership of all reference counts
>>> above 1. The block layer takes the last reference count when it takes
>>> back ownership of the request.
>>>
>>>> I think it would be weird to use `ARef<Request>` if you expect the
>>>> refcount to be 1.
>>>
>>> Yes, that would require a custom smart pointer with a `try_into_unique`
>>> method that succeeds when the refcount is exactly 2. It would consume
>>> the instance and decrement the refcount to 1. But as I said, there is a
>>> potential race with `TagSet::get` when the refcount is 1 that needs to
>>> be handled.
>>>
>>>> Maybe the API should be different?
>>>
>>> I needs to change a little, yes.
>>>
>>>> As I understand it, a request has the following life cycle (please
>>>> correct me if I am wrong):
>>>> 1. A new request is created, it is given to the driver via `queue_rq`.
>>>> 2. The driver can now decide what to do with it (theoretically it can
>>>> store it somewhere and later do something with it), but it should at
>>>> some point call `Request::start`.
>>>> 3. Work happens and eventually the driver calls `Request::end`.
>>>>
>>>> To me this does not seem like something where we need a refcount (we
>>>> still might need one for safety, but it does not need to be exposed to
>>>> the user).
>>>
>>> It would not need to be exposed to the user, other than a) ending a request
>>> can fail OR b) `TagSet::get` can fail.
>>>
>>> a) would require that ending a request must be done with a unique
>>> reference. This could be done by the user by the user calling
>>> `try_into_unique` or by making the `end` method fallible.
>>>
>>> b) would make the reference handle `!Clone` and add a failure mode to
>>> `TagSet::get`, so it fails to construct a `Request` handle if there are
>>> already one in existence.
>>>
>>> I gravitate towards a) because it allows the user to clone the Request
>>> reference without adding an additional `Arc`.
>>
>> This confuses me a little, since I thought that `TagSet::get` returns
>> `Option<ARef<Request>>`.
>
> It does, but in the current implementation the failure mode returning
> `None` is triggered when the refcount is zero, meaning that the request
> corresponding to that tag is not currently owned by the driver. For
> solution b) we would change the type to be
> `Option<CustomSmartPointerHandleThing<Request>>`.
>
>> (I tried to find the abstractions in your
>> github, but I did not find them)
>
> It's here [1]. It was introduced in the `rnvme-v6.8` branch.

Thanks for the pointer.

>> I think that this could work: `queue_rq` takes a `OwnedRequest`, which
>> the user can store in a `TagSet`, transferring ownership. `TagSet::get`
>> returns `Option<&Request>` and you can call `TagSet::remove` to get
>> `Option<OwnedRequest>`. `OwnedRequest::end` consumes `self`.
>> With this pattern we also do not need to take an additional refcount.
>
> It would, but the `TagSet` is just a wrapper for the C block layer
> `strugt blk_mq_tag_set`. This is a highly optimized data structure and
> tag mapping is done before the driver sees the request. I would like to
> reuse that logic.
>
> We could implement what you suggest anyhow, but I would not want to that
> additional logic to the hot path.

I overlooked an important detail: the `TagSet` is always stored in an
`Arc` (IIRC since you want to be able to share it between different
`Gendisk`s). This probably makes my suggestion impossible, since you
can't mutably borrow the `TagSet` for removal of `Request`s.
Depending on how `Request`s are associated to a `TagSet`, there might be
a way around this: I saw the `qid` parameter to the `tag_to_rq`
function, is that a unique identifier for a queue? Because in that case
we might be able to have a unique `QueueTagSetRef` with

fn remove(&mut self, tag: u32) -> OwnedRequest;

fn get(&self, tag: u32) -> Option<&Request>;

The `TagSet` would still be shared, only the ability to "remove" (I
don't know if you do that manually in C, if not, then this would just
remove it in the abstraction, but keep it on the C side) is unique to
the `QueueTagSetRef` struct.

But feel free to use your proposed option a), it is simpler and we can
try to make this work when you send the `TagSet` abstractions.
I just think that we should try a bit harder to make it even better.

--
Cheers,
Benno


2024-04-04 14:17:56

by Andreas Hindborg

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] rust: block: introduce `kernel::block::mq` module

Benno Lossin <[email protected]> writes:

> On 04.04.24 07:44, Andreas Hindborg wrote:
>> Benno Lossin <[email protected]> writes:
>>
>>> On 03.04.24 10:46, Andreas Hindborg wrote:
>>>> Benno Lossin <[email protected]> writes:
>>>>
>>>>> On 23.03.24 07:32, Andreas Hindborg wrote:
>>>>>> Benno Lossin <[email protected]> writes:
>>>>>>> On 3/13/24 12:05, Andreas Hindborg wrote:
>>>>>>>> +//! implementations of the `Operations` trait.
>>>>>>>> +//!
>>>>>>>> +//! IO requests are passed to the driver as [`Request`] references. The
>>>>>>>> +//! `Request` type is a wrapper around the C `struct request`. The driver must
>>>>>>>> +//! mark start of request processing by calling [`Request::start`] and end of
>>>>>>>> +//! processing by calling one of the [`Request::end`], methods. Failure to do so
>>>>>>>> +//! can lead to IO failures.
>>>>>>>
>>>>>>> I am unfamiliar with this, what are "IO failures"?
>>>>>>> Do you think that it might be better to change the API to use a
>>>>>>> callback? So instead of calling start and end, you would do
>>>>>>>
>>>>>>> request.handle(|req| {
>>>>>>> // do the stuff that would be done between start and end
>>>>>>> });
>>>>>>>
>>>>>>> I took a quick look at the rnull driver and there you are calling
>>>>>>> `Request::end_ok` from a different function. So my suggestion might not
>>>>>>> be possible, since you really need the freedom.
>>>>>>>
>>>>>>> Do you think that a guard approach might work better? ie `start` returns
>>>>>>> a guard that when dropped will call `end` and you need the guard to
>>>>>>> operate on the request.
>>>>>>
>>>>>> I don't think that would fit, since the driver might not complete the
>>>>>> request immediately. We might be able to call `start` on behalf of the
>>>>>> driver.
>>>>>>
>>>>>> At any rate, since the request is reference counted now, we can
>>>>>> automatically fail a request when the last reference is dropped and it
>>>>>> was not marked successfully completed. I would need to measure the
>>>>>> performance implications of such a feature.
>>>>>
>>>>> Are there cases where you still need access to the request after you
>>>>> have called `end`?
>>>>
>>>> In general no, there is no need to handle the request after calling end.
>>>> C drivers are not allowed to, because this transfers ownership of the
>>>> request back to the block layer. This patch series defer the transfer of
>>>> ownership to the point when the ARef<Request> refcount goes to zero, so
>>>> there should be no danger associated with touching the `Request` after
>>>> end.
>>>>
>>>>> If no, I think it would be better for the request to
>>>>> be consumed by the `end` function.
>>>>> This is a bit difficult with `ARef`, since the user can just clone it
>>>>> though... Do you think that it might be necessary to clone requests?
>>>>
>>>> Looking into the details now I see that calling `Request::end` more than
>>>> once will trigger UAF, because C code decrements the refcount on the
>>>> request. When we have `ARef<Request>` around, that is a problem. It
>>>> probably also messes with other things in C land. Good catch.
>>>>
>>>> I did implement `Request::end` to consume the request at one point
>>>> before I fell back on reference counting. It works fine for simple
>>>> drivers. However, most drivers will need to use the block layer tag set
>>>> service, that allows conversion of an integer id to a request pointer.
>>>> The abstraction for this feature is not part of this patch set. But the
>>>> block layer manages a mapping of integer to request mapping, and drivers
>>>> typically use this to identify the request that corresponds to
>>>> completion messages that arrive from hardware. When drivers are able to
>>>> turn integers into requests like this, consuming the request in the call
>>>> to `end` makes little sense (because we can just construct more).
>>>
>>> How do you ensure that this is fine?:
>>>
>>> let r1 = tagset.get(0);
>>> let r2 = tagset.get(0);
>>> r1.end_ok();
>>> r2.do_something_that_would_only_be_done_while_active();
>>>
>>> One thing that comes to my mind would be to only give out `&Request`
>>> from the tag set. And to destroy, you could have a separate operation
>>> that also removes the request from the tag set. (I am thinking of a tag
>>> set as a `HashMap<u64, Request>`.
>>
>> This would be similar to
>>
>> let r1 = tagset.get(0)?;
>> ler r2 = r1.clone();
>> r1.end_ok();
>> r2.do_something_requires_active();
>>
>> but it is not a problem because we do not implement any actions that are
>> illegal in that position (outside of `end` - that _is_ a problem).
>
> Makes sense, but I think it's a bit weird to still be able to access it
> after `end`ing.

Yes, that is true.

>
>>
>>
>>>>
>>>> What I do now is issue the an `Option<ARef<Request>>` with
>>>> `bindings::req_ref_inc_not_zero(rq_ptr)`, to make sure that the request
>>>> is currently owned by the driver.
>>>>
>>>> I guess we can check the absolute value of the refcount, and only issue
>>>> a request handle if the count matches what we expect. Then we can be certain
>>>> that the handle is unique, and we can require transfer of ownership of
>>>> the handle to `Request::end` to make sure it can never be called more
>>>> than once.
>>>>
>>>> Another option is to error out in `Request::end` if the
>>>> refcount is not what we expect.
>>>
>>> I am a bit confused, why does the refcount matter in this case? Can't
>>> the user just have multiple `ARef`s?
>>
>> Because we want to assert that we are consuming the last handle to the
>> request. After we do that, the user cannot call `Request::end` again.
>> `TagSet::get` will not issue a request reference if the request is not
>> in flight. Although there might be a race condition to watch out for.
>>
>> When the block layer hands over ownership to Rust, the reference count
>> is 1. The first `ARef<Request>` we create increments the count to 2. To
>> complete the request, we must have ownership of all reference counts
>> above 1. The block layer takes the last reference count when it takes
>> back ownership of the request.
>>
>>> I think it would be weird to use `ARef<Request>` if you expect the
>>> refcount to be 1.
>>
>> Yes, that would require a custom smart pointer with a `try_into_unique`
>> method that succeeds when the refcount is exactly 2. It would consume
>> the instance and decrement the refcount to 1. But as I said, there is a
>> potential race with `TagSet::get` when the refcount is 1 that needs to
>> be handled.
>>
>>> Maybe the API should be different?
>>
>> I needs to change a little, yes.
>>
>>> As I understand it, a request has the following life cycle (please
>>> correct me if I am wrong):
>>> 1. A new request is created, it is given to the driver via `queue_rq`.
>>> 2. The driver can now decide what to do with it (theoretically it can
>>> store it somewhere and later do something with it), but it should at
>>> some point call `Request::start`.
>>> 3. Work happens and eventually the driver calls `Request::end`.
>>>
>>> To me this does not seem like something where we need a refcount (we
>>> still might need one for safety, but it does not need to be exposed to
>>> the user).
>>
>> It would not need to be exposed to the user, other than a) ending a request
>> can fail OR b) `TagSet::get` can fail.
>>
>> a) would require that ending a request must be done with a unique
>> reference. This could be done by the user by the user calling
>> `try_into_unique` or by making the `end` method fallible.
>>
>> b) would make the reference handle `!Clone` and add a failure mode to
>> `TagSet::get`, so it fails to construct a `Request` handle if there are
>> already one in existence.
>>
>> I gravitate towards a) because it allows the user to clone the Request
>> reference without adding an additional `Arc`.
>
> This confuses me a little, since I thought that `TagSet::get` returns
> `Option<ARef<Request>>`.

It does, but in the current implementation the failure mode returning
`None` is triggered when the refcount is zero, meaning that the request
corresponding to that tag is not currently owned by the driver. For
solution b) we would change the type to be
`Option<CustomSmartPointerHandleThing<Request>>`.

> (I tried to find the abstractions in your
> github, but I did not find them)

It's here [1]. It was introduced in the `rnvme-v6.8` branch.

> I think that this could work: `queue_rq` takes a `OwnedRequest`, which
> the user can store in a `TagSet`, transferring ownership. `TagSet::get`
> returns `Option<&Request>` and you can call `TagSet::remove` to get
> `Option<OwnedRequest>`. `OwnedRequest::end` consumes `self`.
> With this pattern we also do not need to take an additional refcount.

It would, but the `TagSet` is just a wrapper for the C block layer
`strugt blk_mq_tag_set`. This is a highly optimized data structure and
tag mapping is done before the driver sees the request. I would like to
reuse that logic.

We could implement what you suggest anyhow, but I would not want to that
additional logic to the hot path.


BR Andreas


[1] https://github.com/metaspace/linux/blob/0976c869fbfae13f6d475a287ade776c730cc029/rust/kernel/block/mq/tag_set.rs#L102

2024-04-04 09:04:33

by Benno Lossin

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] rust: block: introduce `kernel::block::mq` module

On 04.04.24 07:44, Andreas Hindborg wrote:
> Benno Lossin <[email protected]> writes:
>
>> On 03.04.24 10:46, Andreas Hindborg wrote:
>>> Benno Lossin <[email protected]> writes:
>>>
>>>> On 23.03.24 07:32, Andreas Hindborg wrote:
>>>>> Benno Lossin <[email protected]> writes:
>>>>>> On 3/13/24 12:05, Andreas Hindborg wrote:
>>>>>>> +//! implementations of the `Operations` trait.
>>>>>>> +//!
>>>>>>> +//! IO requests are passed to the driver as [`Request`] references The
>>>>>>> +//! `Request` type is a wrapper around the C `struct request`. The driver must
>>>>>>> +//! mark start of request processing by calling [`Request::start`] and end of
>>>>>>> +//! processing by calling one of the [`Request::end`], methods. Failure to do so
>>>>>>> +//! can lead to IO failures.
>>>>>>
>>>>>> I am unfamiliar with this, what are "IO failures"?
>>>>>> Do you think that it might be better to change the API to use a
>>>>>> callback? So instead of calling start and end, you would do
>>>>>>
>>>>>> request.handle(|req| {
>>>>>> // do the stuff that would be done between start and end
>>>>>> });
>>>>>>
>>>>>> I took a quick look at the rnull driver and there you are calling
>>>>>> `Request::end_ok` from a different function. So my suggestion might not
>>>>>> be possible, since you really need the freedom.
>>>>>>
>>>>>> Do you think that a guard approach might work better? ie `start` returns
>>>>>> a guard that when dropped will call `end` and you need the guard to
>>>>>> operate on the request.
>>>>>
>>>>> I don't think that would fit, since the driver might not complete the
>>>>> request immediately. We might be able to call `start` on behalf of the
>>>>> driver.
>>>>>
>>>>> At any rate, since the request is reference counted now, we can
>>>>> automatically fail a request when the last reference is dropped and it
>>>>> was not marked successfully completed. I would need to measure the
>>>>> performance implications of such a feature.
>>>>
>>>> Are there cases where you still need access to the request after you
>>>> have called `end`?
>>>
>>> In general no, there is no need to handle the request after calling end.
>>> C drivers are not allowed to, because this transfers ownership of the
>>> request back to the block layer. This patch series defer the transfer of
>>> ownership to the point when the ARef<Request> refcount goes to zero, so
>>> there should be no danger associated with touching the `Request` after
>>> end.
>>>
>>>> If no, I think it would be better for the request to
>>>> be consumed by the `end` function.
>>>> This is a bit difficult with `ARef`, since the user can just clone it
>>>> though... Do you think that it might be necessary to clone requests?
>>>
>>> Looking into the details now I see that calling `Request::end` more than
>>> once will trigger UAF, because C code decrements the refcount on the
>>> request. When we have `ARef<Request>` around, that is a problem. It
>>> probably also messes with other things in C land. Good catch.
>>>
>>> I did implement `Request::end` to consume the request at one point
>>> before I fell back on reference counting. It works fine for simple
>>> drivers. However, most drivers will need to use the block layer tag set
>>> service, that allows conversion of an integer id to a request pointer.
>>> The abstraction for this feature is not part of this patch set. But the
>>> block layer manages a mapping of integer to request mapping, and drivers
>>> typically use this to identify the request that corresponds to
>>> completion messages that arrive from hardware. When drivers are able to
>>> turn integers into requests like this, consuming the request in the call
>>> to `end` makes little sense (because we can just construct more).
>>
>> How do you ensure that this is fine?:
>>
>> let r1 = tagset.get(0);
>> let r2 = tagset.get(0);
>> r1.end_ok();
>> r2.do_something_that_would_only_be_done_while_active();
>>
>> One thing that comes to my mind would be to only give out `&Request`
>> from the tag set. And to destroy, you could have a separate operation
>> that also removes the request from the tag set. (I am thinking of a tag
>> set as a `HashMap<u64, Request>`.
>
> This would be similar to
>
> let r1 = tagset.get(0)?;
> ler r2 = r1.clone();
> r1.end_ok();
> r2.do_something_requires_active();
>
> but it is not a problem because we do not implement any actions that are
> illegal in that position (outside of `end` - that _is_ a problem).

Makes sense, but I think it's a bit weird to still be able to access it
after `end`ing.

>
>
>>>
>>> What I do now is issue the an `Option<ARef<Request>>` with
>>> `bindings::req_ref_inc_not_zero(rq_ptr)`, to make sure that the request
>>> is currently owned by the driver.
>>>
>>> I guess we can check the absolute value of the refcount, and only issue
>>> a request handle if the count matches what we expect. Then we can be certain
>>> that the handle is unique, and we can require transfer of ownership of
>>> the handle to `Request::end` to make sure it can never be called more
>>> than once.
>>>
>>> Another option is to error out in `Request::end` if the
>>> refcount is not what we expect.
>>
>> I am a bit confused, why does the refcount matter in this case? Can't
>> the user just have multiple `ARef`s?
>
> Because we want to assert that we are consuming the last handle to the
> request. After we do that, the user cannot call `Request::end` again.
> `TagSet::get` will not issue a request reference if the request is not
> in flight. Although there might be a race condition to watch out for.
>
> When the block layer hands over ownership to Rust, the reference count
> is 1. The first `ARef<Request>` we create increments the count to 2. To
> complete the request, we must have ownership of all reference counts
> above 1. The block layer takes the last reference count when it takes
> back ownership of the request.
>
>> I think it would be weird to use `ARef<Request>` if you expect the
>> refcount to be 1.
>
> Yes, that would require a custom smart pointer with a `try_into_unique`
> method that succeeds when the refcount is exactly 2. It would consume
> the instance and decrement the refcount to 1. But as I said, there is a
> potential race with `TagSet::get` when the refcount is 1 that needs to
> be handled.
>
>> Maybe the API should be different?
>
> I needs to change a little, yes.
>
>> As I understand it, a request has the following life cycle (please
>> correct me if I am wrong):
>> 1. A new request is created, it is given to the driver via `queue_rq`.
>> 2. The driver can now decide what to do with it (theoretically it can
>> store it somewhere and later do something with it), but it should at
>> some point call `Request::start`.
>> 3. Work happens and eventually the driver calls `Request::end`.
>>
>> To me this does not seem like something where we need a refcount (we
>> still might need one for safety, but it does not need to be exposed to
>> the user).
>
> It would not need to be exposed to the user, other than a) ending a request
> can fail OR b) `TagSet::get` can fail.
>
> a) would require that ending a request must be done with a unique
> reference. This could be done by the user by the user calling
> `try_into_unique` or by making the `end` method fallible.
>
> b) would make the reference handle `!Clone` and add a failure mode to
> `TagSet::get`, so it fails to construct a `Request` handle if there are
> already one in existence.
>
> I gravitate towards a) because it allows the user to clone the Request
> reference without adding an additional `Arc`.

This confuses me a little, since I thought that `TagSet::get` returns
`Option<ARef<Request>>`. (I tried to find the abstractions in your
github, but I did not find them)

I think that this could work: `queue_rq` takes a `OwnedRequest`, which
the user can store in a `TagSet`, transferring ownership. `TagSet::get`
returns `Option<&Request>` and you can call `TagSet::remove` to get
`Option<OwnedRequest>`. `OwnedRequest::end` consumes `self`.
With this pattern we also do not need to take an additional refcount.

--
Cheers,
Benno


2024-04-05 08:44:14

by Andreas Hindborg

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] rust: block: introduce `kernel::block::mq` module

Benno Lossin <[email protected]> writes:

> On 04.04.24 11:30, Andreas Hindborg wrote:
>> Benno Lossin <[email protected]> writes:
>>
>>> On 04.04.24 07:44, Andreas Hindborg wrote:
>>>> Benno Lossin <[email protected]> writes:
>>>>
>>>>> On 03.04.24 10:46, Andreas Hindborg wrote:
>>>>>> Benno Lossin <[email protected]> writes:
>>>>>>
>>>>>>> On 23.03.24 07:32, Andreas Hindborg wrote:
>>>>>>>> Benno Lossin <[email protected]> writes:
>>>>>>>>> On 3/13/24 12:05, Andreas Hindborg wrote:
>>>>>>>>>> +//! implementations of the `Operations` trait.
>>>>>>>>>> +//!
>>>>>>>>>> +//! IO requests are passed to the driver as [`Request`] references. The
>>>>>>>>>> +//! `Request` type is a wrapper around the C `struct request`. The driver must
>>>>>>>>>> +//! mark start of request processing by calling [`Request::start`] and end of
>>>>>>>>>> +//! processing by calling one of the [`Request::end`], methods. Failure to do so
>>>>>>>>>> +//! can lead to IO failures.
>>>>>>>>>
>>>>>>>>> I am unfamiliar with this, what are "IO failures"?
>>>>>>>>> Do you think that it might be better to change the API to use a
>>>>>>>>> callback? So instead of calling start and end, you would do
>>>>>>>>>
>>>>>>>>> request.handle(|req| {
>>>>>>>>> // do the stuff that would be done between start and end
>>>>>>>>> });
>>>>>>>>>
>>>>>>>>> I took a quick look at the rnull driver and there you are calling
>>>>>>>>> `Request::end_ok` from a different function. So my suggestion might not
>>>>>>>>> be possible, since you really need the freedom.
>>>>>>>>>
>>>>>>>>> Do you think that a guard approach might work better? ie `start` returns
>>>>>>>>> a guard that when dropped will call `end` and you need the guard to
>>>>>>>>> operate on the request.
>>>>>>>>
>>>>>>>> I don't think that would fit, since the driver might not complete the
>>>>>>>> request immediately. We might be able to call `start` on behalf of the
>>>>>>>> driver.
>>>>>>>>
>>>>>>>> At any rate, since the request is reference counted now, we can
>>>>>>>> automatically fail a request when the last reference is dropped and it
>>>>>>>> was not marked successfully completed. I would need to measure the
>>>>>>>> performance implications of such a feature.
>>>>>>>
>>>>>>> Are there cases where you still need access to the request after you
>>>>>>> have called `end`?
>>>>>>
>>>>>> In general no, there is no need to handle the request after calling end.
>>>>>> C drivers are not allowed to, because this transfers ownership of the
>>>>>> request back to the block layer. This patch series defer the transfer of
>>>>>> ownership to the point when the ARef<Request> refcount goes to zero, so
>>>>>> there should be no danger associated with touching the `Request` after
>>>>>> end.
>>>>>>
>>>>>>> If no, I think it would be better for the request to
>>>>>>> be consumed by the `end` function.
>>>>>>> This is a bit difficult with `ARef`, since the user can just clone it
>>>>>>> though... Do you think that it might be necessary to clone requests?
>>>>>>
>>>>>> Looking into the details now I see that calling `Request::end` more than
>>>>>> once will trigger UAF, because C code decrements the refcount on the
>>>>>> request. When we have `ARef<Request>` around, that is a problem. It
>>>>>> probably also messes with other things in C land. Good catch.
>>>>>>
>>>>>> I did implement `Request::end` to consume the request at one point
>>>>>> before I fell back on reference counting. It works fine for simple
>>>>>> drivers. However, most drivers will need to use the block layer tag set
>>>>>> service, that allows conversion of an integer id to a request pointer.
>>>>>> The abstraction for this feature is not part of this patch set. But the
>>>>>> block layer manages a mapping of integer to request mapping, and drivers
>>>>>> typically use this to identify the request that corresponds to
>>>>>> completion messages that arrive from hardware. When drivers are able to
>>>>>> turn integers into requests like this, consuming the request in the call
>>>>>> to `end` makes little sense (because we can just construct more).
>>>>>
>>>>> How do you ensure that this is fine?:
>>>>>
>>>>> let r1 = tagset.get(0);
>>>>> let r2 = tagset.get(0);
>>>>> r1.end_ok();
>>>>> r2.do_something_that_would_only_be_done_while_active();
>>>>>
>>>>> One thing that comes to my mind would be to only give out `&Request`
>>>>> from the tag set. And to destroy, you could have a separate operation
>>>>> that also removes the request from the tag set. (I am thinking of a tag
>>>>> set as a `HashMap<u64, Request>`.
>>>>
>>>> This would be similar to
>>>>
>>>> let r1 = tagset.get(0)?;
>>>> ler r2 = r1.clone();
>>>> r1.end_ok();
>>>> r2.do_something_requires_active();
>>>>
>>>> but it is not a problem because we do not implement any actions that are
>>>> illegal in that position (outside of `end` - that _is_ a problem).
>>>
>>> Makes sense, but I think it's a bit weird to still be able to access it
>>> after `end`ing.
>>
>> Yes, that is true.
>>
>>>
>>>>
>>>>
>>>>>>
>>>>>> What I do now is issue the an `Option<ARef<Request>>` with
>>>>>> `bindings::req_ref_inc_not_zero(rq_ptr)`, to make sure that the request
>>>>>> is currently owned by the driver.
>>>>>>
>>>>>> I guess we can check the absolute value of the refcount, and only issue
>>>>>> a request handle if the count matches what we expect. Then we can be certain
>>>>>> that the handle is unique, and we can require transfer of ownership of
>>>>>> the handle to `Request::end` to make sure it can never be called more
>>>>>> than once.
>>>>>>
>>>>>> Another option is to error out in `Request::end` if the
>>>>>> refcount is not what we expect.
>>>>>
>>>>> I am a bit confused, why does the refcount matter in this case? Can't
>>>>> the user just have multiple `ARef`s?
>>>>
>>>> Because we want to assert that we are consuming the last handle to the
>>>> request. After we do that, the user cannot call `Request::end` again.
>>>> `TagSet::get` will not issue a request reference if the request is not
>>>> in flight. Although there might be a race condition to watch out for.
>>>>
>>>> When the block layer hands over ownership to Rust, the reference count
>>>> is 1. The first `ARef<Request>` we create increments the count to 2. To
>>>> complete the request, we must have ownership of all reference counts
>>>> above 1. The block layer takes the last reference count when it takes
>>>> back ownership of the request.
>>>>
>>>>> I think it would be weird to use `ARef<Request>` if you expect the
>>>>> refcount to be 1.
>>>>
>>>> Yes, that would require a custom smart pointer with a `try_into_unique`
>>>> method that succeeds when the refcount is exactly 2. It would consume
>>>> the instance and decrement the refcount to 1. But as I said, there is a
>>>> potential race with `TagSet::get` when the refcount is 1 that needs to
>>>> be handled.
>>>>
>>>>> Maybe the API should be different?
>>>>
>>>> I needs to change a little, yes.
>>>>
>>>>> As I understand it, a request has the following life cycle (please
>>>>> correct me if I am wrong):
>>>>> 1. A new request is created, it is given to the driver via `queue_rq`.
>>>>> 2. The driver can now decide what to do with it (theoretically it can
>>>>> store it somewhere and later do something with it), but it should at
>>>>> some point call `Request::start`.
>>>>> 3. Work happens and eventually the driver calls `Request::end`.
>>>>>
>>>>> To me this does not seem like something where we need a refcount (we
>>>>> still might need one for safety, but it does not need to be exposed to
>>>>> the user).
>>>>
>>>> It would not need to be exposed to the user, other than a) ending a request
>>>> can fail OR b) `TagSet::get` can fail.
>>>>
>>>> a) would require that ending a request must be done with a unique
>>>> reference. This could be done by the user by the user calling
>>>> `try_into_unique` or by making the `end` method fallible.
>>>>
>>>> b) would make the reference handle `!Clone` and add a failure mode to
>>>> `TagSet::get`, so it fails to construct a `Request` handle if there are
>>>> already one in existence.
>>>>
>>>> I gravitate towards a) because it allows the user to clone the Request
>>>> reference without adding an additional `Arc`.
>>>
>>> This confuses me a little, since I thought that `TagSet::get` returns
>>> `Option<ARef<Request>>`.
>>
>> It does, but in the current implementation the failure mode returning
>> `None` is triggered when the refcount is zero, meaning that the request
>> corresponding to that tag is not currently owned by the driver. For
>> solution b) we would change the type to be
>> `Option<CustomSmartPointerHandleThing<Request>>`.
>>
>>> (I tried to find the abstractions in your
>>> github, but I did not find them)
>>
>> It's here [1]. It was introduced in the `rnvme-v6.8` branch.
>
> Thanks for the pointer.
>
>>> I think that this could work: `queue_rq` takes a `OwnedRequest`, which
>>> the user can store in a `TagSet`, transferring ownership. `TagSet::get`
>>> returns `Option<&Request>` and you can call `TagSet::remove` to get
>>> `Option<OwnedRequest>`. `OwnedRequest::end` consumes `self`.
>>> With this pattern we also do not need to take an additional refcount.
>>
>> It would, but the `TagSet` is just a wrapper for the C block layer
>> `strugt blk_mq_tag_set`. This is a highly optimized data structure and
>> tag mapping is done before the driver sees the request. I would like to
>> reuse that logic.
>>
>> We could implement what you suggest anyhow, but I would not want to that
>> additional logic to the hot path.
>
> I overlooked an important detail: the `TagSet` is always stored in an
> `Arc` (IIRC since you want to be able to share it between different
> `Gendisk`s). This probably makes my suggestion impossible, since you
> can't mutably borrow the `TagSet` for removal of `Request`s.
> Depending on how `Request`s are associated to a `TagSet`, there might be
> a way around this: I saw the `qid` parameter to the `tag_to_rq`
> function, is that a unique identifier for a queue?

A tag set services a number of request queues. Each queue has a number
used to identify it within the tag set. It is unique within the tag set.

> Because in that case
> we might be able to have a unique `QueueTagSetRef` with
>
> fn remove(&mut self, tag: u32) -> OwnedRequest;

We would not need exclusive access. The tag set remove are synchronized
internally with some fancy atomic bit flipping [1].

>
> fn get(&self, tag: u32) -> Option<&Request>;
>
> The `TagSet` would still be shared, only the ability to "remove" (I
> don't know if you do that manually in C, if not, then this would just
> remove it in the abstraction, but keep it on the C side) is unique to
> the `QueueTagSetRef` struct.

I would not advice removing tag->request associations from the driver. I
understand your point and from the perspective of these patches it makes
sense. But it would be a major layer violation of the current block
layer architecture, as far as I can tell.

I am having trouble enough trying to justify deferred free of the
request structure as it is.

> But feel free to use your proposed option a), it is simpler and we can
> try to make this work when you send the `TagSet` abstractions.
> I just think that we should try a bit harder to make it even better.

I'll code it up a) and see how it looks (and what it costs in
performance) ????

BR Andreas


[1] https://github.com/metaspace/linux/blob/bf25426ad5652319528c6f87b74dd026fbedb9e8/lib/sbitmap.c#L638

2024-04-05 09:47:49

by Benno Lossin

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] rust: block: introduce `kernel::block::mq` module

On 05.04.24 10:43, Andreas Hindborg wrote:
> Benno Lossin <[email protected]> writes:
>
>> On 04.04.24 11:30, Andreas Hindborg wrote:
>>> Benno Lossin <[email protected]> writes:
>>>
>>>> On 04.04.24 07:44, Andreas Hindborg wrote:
>>>>> Benno Lossin <[email protected]> writes:
>>>>>
>>>>>> On 03.04.24 10:46, Andreas Hindborg wrote:
>>>>>>> Benno Lossin <[email protected]> writes:
>>>>>>>
>>>>>>>> On 23.03.24 07:32, Andreas Hindborg wrote:
>>>>>>>>> Benno Lossin <[email protected]> writes:
>>>>>>>>>> On 3/13/24 12:05, Andreas Hindborg wrote:
>>>>>>>>>>> +//! implementations of the `Operations` trait.
>>>>>>>>>>> +//!
>>>>>>>>>>> +//! IO requests are passed to the driver as [`Request`] references. The
>>>>>>>>>>> +//! `Request` type is a wrapper around the C `struct request`. The driver must
>>>>>>>>>>> +//! mark start of request processing by calling [`Request::start`] and end of
>>>>>>>>>>> +//! processing by calling one of the [`Request::end`], methods Failure to do so
>>>>>>>>>>> +//! can lead to IO failures.
>>>>>>>>>>
>>>>>>>>>> I am unfamiliar with this, what are "IO failures"?
>>>>>>>>>> Do you think that it might be better to change the API to use a
>>>>>>>>>> callback? So instead of calling start and end, you would do
>>>>>>>>>>
>>>>>>>>>> request.handle(|req| {
>>>>>>>>>> // do the stuff that would be done between start and end
>>>>>>>>>> });
>>>>>>>>>>
>>>>>>>>>> I took a quick look at the rnull driver and there you are calling
>>>>>>>>>> `Request::end_ok` from a different function. So my suggestion might not
>>>>>>>>>> be possible, since you really need the freedom.
>>>>>>>>>>
>>>>>>>>>> Do you think that a guard approach might work better? ie `start` returns
>>>>>>>>>> a guard that when dropped will call `end` and you need the guard to
>>>>>>>>>> operate on the request.
>>>>>>>>>
>>>>>>>>> I don't think that would fit, since the driver might not complete the
>>>>>>>>> request immediately. We might be able to call `start` on behalf of the
>>>>>>>>> driver.
>>>>>>>>>
>>>>>>>>> At any rate, since the request is reference counted now, we can
>>>>>>>>> automatically fail a request when the last reference is dropped and it
>>>>>>>>> was not marked successfully completed. I would need to measure the
>>>>>>>>> performance implications of such a feature.
>>>>>>>>
>>>>>>>> Are there cases where you still need access to the request after you
>>>>>>>> have called `end`?
>>>>>>>
>>>>>>> In general no, there is no need to handle the request after calling end.
>>>>>>> C drivers are not allowed to, because this transfers ownership of the
>>>>>>> request back to the block layer. This patch series defer the transfer of
>>>>>>> ownership to the point when the ARef<Request> refcount goes to zero, so
>>>>>>> there should be no danger associated with touching the `Request` after
>>>>>>> end.
>>>>>>>
>>>>>>>> If no, I think it would be better for the request to
>>>>>>>> be consumed by the `end` function.
>>>>>>>> This is a bit difficult with `ARef`, since the user can just clone it
>>>>>>>> though... Do you think that it might be necessary to clone requests?
>>>>>>>
>>>>>>> Looking into the details now I see that calling `Request::end` more than
>>>>>>> once will trigger UAF, because C code decrements the refcount on the
>>>>>>> request. When we have `ARef<Request>` around, that is a problem. It
>>>>>>> probably also messes with other things in C land. Good catch.
>>>>>>>
>>>>>>> I did implement `Request::end` to consume the request at one point
>>>>>>> before I fell back on reference counting. It works fine for simple
>>>>>>> drivers. However, most drivers will need to use the block layer tag set
>>>>>>> service, that allows conversion of an integer id to a request pointer.
>>>>>>> The abstraction for this feature is not part of this patch set. But the
>>>>>>> block layer manages a mapping of integer to request mapping, and drivers
>>>>>>> typically use this to identify the request that corresponds to
>>>>>>> completion messages that arrive from hardware. When drivers are able to
>>>>>>> turn integers into requests like this, consuming the request in the call
>>>>>>> to `end` makes little sense (because we can just construct more).
>>>>>>
>>>>>> How do you ensure that this is fine?:
>>>>>>
>>>>>> let r1 = tagset.get(0);
>>>>>> let r2 = tagset.get(0);
>>>>>> r1.end_ok();
>>>>>> r2.do_something_that_would_only_be_done_while_active();
>>>>>>
>>>>>> One thing that comes to my mind would be to only give out `&Request`
>>>>>> from the tag set. And to destroy, you could have a separate operation
>>>>>> that also removes the request from the tag set. (I am thinking of a tag
>>>>>> set as a `HashMap<u64, Request>`.
>>>>>
>>>>> This would be similar to
>>>>>
>>>>> let r1 = tagset.get(0)?;
>>>>> ler r2 = r1.clone();
>>>>> r1.end_ok();
>>>>> r2.do_something_requires_active();
>>>>>
>>>>> but it is not a problem because we do not implement any actions that are
>>>>> illegal in that position (outside of `end` - that _is_ a problem).
>>>>
>>>> Makes sense, but I think it's a bit weird to still be able to access it
>>>> after `end`ing.
>>>
>>> Yes, that is true.
>>>
>>>>
>>>>>
>>>>>
>>>>>>>
>>>>>>> What I do now is issue the an `Option<ARef<Request>>` with
>>>>>>> `bindings::req_ref_inc_not_zero(rq_ptr)`, to make sure that the request
>>>>>>> is currently owned by the driver.
>>>>>>>
>>>>>>> I guess we can check the absolute value of the refcount, and only issue
>>>>>>> a request handle if the count matches what we expect. Then we can be certain
>>>>>>> that the handle is unique, and we can require transfer of ownership of
>>>>>>> the handle to `Request::end` to make sure it can never be called more
>>>>>>> than once.
>>>>>>>
>>>>>>> Another option is to error out in `Request::end` if the
>>>>>>> refcount is not what we expect.
>>>>>>
>>>>>> I am a bit confused, why does the refcount matter in this case? Can't
>>>>>> the user just have multiple `ARef`s?
>>>>>
>>>>> Because we want to assert that we are consuming the last handle to the
>>>>> request. After we do that, the user cannot call `Request::end` again.
>>>>> `TagSet::get` will not issue a request reference if the request is not
>>>>> in flight. Although there might be a race condition to watch out for.
>>>>>
>>>>> When the block layer hands over ownership to Rust, the reference count
>>>>> is 1. The first `ARef<Request>` we create increments the count to 2. To
>>>>> complete the request, we must have ownership of all reference counts
>>>>> above 1. The block layer takes the last reference count when it takes
>>>>> back ownership of the request.
>>>>>
>>>>>> I think it would be weird to use `ARef<Request>` if you expect the
>>>>>> refcount to be 1.
>>>>>
>>>>> Yes, that would require a custom smart pointer with a `try_into_unique`
>>>>> method that succeeds when the refcount is exactly 2. It would consume
>>>>> the instance and decrement the refcount to 1. But as I said, there is a
>>>>> potential race with `TagSet::get` when the refcount is 1 that needs to
>>>>> be handled.
>>>>>
>>>>>> Maybe the API should be different?
>>>>>
>>>>> I needs to change a little, yes.
>>>>>
>>>>>> As I understand it, a request has the following life cycle (please
>>>>>> correct me if I am wrong):
>>>>>> 1. A new request is created, it is given to the driver via `queue_rq`.
>>>>>> 2. The driver can now decide what to do with it (theoretically it can
>>>>>> store it somewhere and later do something with it), but it should at
>>>>>> some point call `Request::start`.
>>>>>> 3. Work happens and eventually the driver calls `Request::end`.
>>>>>>
>>>>>> To me this does not seem like something where we need a refcount (we
>>>>>> still might need one for safety, but it does not need to be exposed to
>>>>>> the user).
>>>>>
>>>>> It would not need to be exposed to the user, other than a) ending a request
>>>>> can fail OR b) `TagSet::get` can fail.
>>>>>
>>>>> a) would require that ending a request must be done with a unique
>>>>> reference. This could be done by the user by the user calling
>>>>> `try_into_unique` or by making the `end` method fallible.
>>>>>
>>>>> b) would make the reference handle `!Clone` and add a failure mode to
>>>>> `TagSet::get`, so it fails to construct a `Request` handle if there are
>>>>> already one in existence.
>>>>>
>>>>> I gravitate towards a) because it allows the user to clone the Request
>>>>> reference without adding an additional `Arc`.
>>>>
>>>> This confuses me a little, since I thought that `TagSet::get` returns
>>>> `Option<ARef<Request>>`.
>>>
>>> It does, but in the current implementation the failure mode returning
>>> `None` is triggered when the refcount is zero, meaning that the request
>>> corresponding to that tag is not currently owned by the driver. For
>>> solution b) we would change the type to be
>>> `Option<CustomSmartPointerHandleThing<Request>>`.
>>>
>>>> (I tried to find the abstractions in your
>>>> github, but I did not find them)
>>>
>>> It's here [1]. It was introduced in the `rnvme-v6.8` branch.
>>
>> Thanks for the pointer.
>>
>>>> I think that this could work: `queue_rq` takes a `OwnedRequest`, which
>>>> the user can store in a `TagSet`, transferring ownership. `TagSet::get`
>>>> returns `Option<&Request>` and you can call `TagSet::remove` to get
>>>> `Option<OwnedRequest>`. `OwnedRequest::end` consumes `self`.
>>>> With this pattern we also do not need to take an additional refcount.
>>>
>>> It would, but the `TagSet` is just a wrapper for the C block layer
>>> `strugt blk_mq_tag_set`. This is a highly optimized data structure and
>>> tag mapping is done before the driver sees the request. I would like to
>>> reuse that logic.
>>>
>>> We could implement what you suggest anyhow, but I would not want to that
>>> additional logic to the hot path.
>>
>> I overlooked an important detail: the `TagSet` is always stored in an
>> `Arc` (IIRC since you want to be able to share it between different
>> `Gendisk`s). This probably makes my suggestion impossible, since you
>> can't mutably borrow the `TagSet` for removal of `Request`s.
>> Depending on how `Request`s are associated to a `TagSet`, there might be
>> a way around this: I saw the `qid` parameter to the `tag_to_rq`
>> function, is that a unique identifier for a queue?
>
> A tag set services a number of request queues. Each queue has a number
> used to identify it within the tag set. It is unique within the tag set.
>
>> Because in that case
>> we might be able to have a unique `QueueTagSetRef` with
>>
>> fn remove(&mut self, tag: u32) -> OwnedRequest;
>
> We would not need exclusive access. The tag set remove are synchronized
> internally with some fancy atomic bit flipping [1].

If we bind the ability to call `Request::end` to `OwnedRequest` and
require exclusive access to the `QueueTagSetRef`, then we could ensure
that the `end` function is only called once.

>>
>> fn get(&self, tag: u32) -> Option<&Request>;
>>
>> The `TagSet` would still be shared, only the ability to "remove" (I
>> don't know if you do that manually in C, if not, then this would just
>> remove it in the abstraction, but keep it on the C side) is unique to
>> the `QueueTagSetRef` struct.
>
> I would not advice removing tag->request associations from the driver. I
> understand your point and from the perspective of these patches it makes
> sense. But it would be a major layer violation of the current block
> layer architecture, as far as I can tell.

Ah I should have specified this better: we don't remove the request from
the C side, only from the `TagSet` Rust abstraction. Maybe a better name
would be `end_request` (the function would then return bool to indicate
if there was a request with that tag).

> I am having trouble enough trying to justify deferred free of the
> request structure as it is.

Using this approach, there also would not be a deferred free, as we
would call `end` immediately, right?

>> But feel free to use your proposed option a), it is simpler and we can
>> try to make this work when you send the `TagSet` abstractions.
>> I just think that we should try a bit harder to make it even better.
>
> I'll code it up a) and see how it looks (and what it costs in
> performance) ????

Sure.

We can also speak about this in the meeting, I have the feeling that
that would be easier than trying via mail :)

--
Cheers,
Benno