2023-12-06 12:01:02

by Alice Ryhl

[permalink] [raw]
Subject: [PATCH v2 0/7] File abstractions needed by Rust Binder

This patchset contains the file abstractions needed by the Rust
implementation of the Binder driver.

Please see the Rust Binder RFC for usage examples:
https://lore.kernel.org/rust-for-linux/[email protected]/

Users of "rust: file: add Rust abstraction for `struct file`":
[PATCH RFC 02/20] rust_binder: add binderfs support to Rust binder
[PATCH RFC 03/20] rust_binder: add threading support

Users of "rust: cred: add Rust abstraction for `struct cred`":
[PATCH RFC 05/20] rust_binder: add nodes and context managers
[PATCH RFC 06/20] rust_binder: add oneway transactions
[PATCH RFC 11/20] rust_binder: send nodes in transaction
[PATCH RFC 13/20] rust_binder: add BINDER_TYPE_FD support

Users of "rust: security: add abstraction for security_secid_to_secctx":
[PATCH RFC 06/20] rust_binder: add oneway transactions

Users of "rust: file: add `FileDescriptorReservation`":
[PATCH RFC 13/20] rust_binder: add BINDER_TYPE_FD support
[PATCH RFC 14/20] rust_binder: add BINDER_TYPE_FDA support

Users of "rust: file: add kuid getters":
[PATCH RFC 05/20] rust_binder: add nodes and context managers
[PATCH RFC 06/20] rust_binder: add oneway transactions

Users of "rust: file: add `DeferredFdCloser`":
[PATCH RFC 14/20] rust_binder: add BINDER_TYPE_FDA support

Users of "rust: file: add abstraction for `poll_table`":
[PATCH RFC 07/20] rust_binder: add epoll support

This patchset has some uses of read_volatile in place of READ_ONCE.
Please see the following rfc for context on this:
https://lore.kernel.org/all/[email protected]/

Signed-off-by: Alice Ryhl <[email protected]>
---
Changes in v2:
- Update various docs and safety comments.
- Rename method names to match the C name.
- Use ordinary read instead of READ_ONCE in File::cred.
- Changed null check in secctx.
- Add type alias for PhantomData in FileDescriptorReservation.
- Use Kuid::from_raw in Kuid::current_euid.
- Make DeferredFdCloser fallible if it is unable to schedule a task
work. And also schedule the task work *before* closing the file.
- Moved PollCondVar to rust/kernel/sync.
- Updated PollCondVar to use wake_up_pollfree.
- Link to v1: https://lore.kernel.org/all/[email protected]/

Link to RFC:
https://lore.kernel.org/all/[email protected]/

---
Alice Ryhl (4):
rust: security: add abstraction for secctx
rust: file: add `Kuid` wrapper
rust: file: add `DeferredFdCloser`
rust: file: add abstraction for `poll_table`

Wedson Almeida Filho (3):
rust: file: add Rust abstraction for `struct file`
rust: cred: add Rust abstraction for `struct cred`
rust: file: add `FileDescriptorReservation`

rust/bindings/bindings_helper.h | 9 +
rust/bindings/lib.rs | 1 +
rust/helpers.c | 94 +++++++++
rust/kernel/cred.rs | 73 +++++++
rust/kernel/file.rs | 431 ++++++++++++++++++++++++++++++++++++++++
rust/kernel/lib.rs | 3 +
rust/kernel/security.rs | 79 ++++++++
rust/kernel/sync.rs | 1 +
rust/kernel/sync/poll.rs | 103 ++++++++++
rust/kernel/task.rs | 68 ++++++-
rust/kernel/types.rs | 10 +
11 files changed, 871 insertions(+), 1 deletion(-)
---
base-commit: 98b1cc82c4affc16f5598d4fa14b1858671b2263
change-id: 20231123-alice-file-525b98e8a724

Best regards,
--
Alice Ryhl <[email protected]>


2023-12-06 12:01:18

by Alice Ryhl

[permalink] [raw]
Subject: [PATCH v2 1/7] rust: file: add Rust abstraction for `struct file`

From: Wedson Almeida Filho <[email protected]>

This abstraction makes it possible to manipulate the open files for a
process. The new `File` struct wraps the C `struct file`. When accessing
it using the smart pointer `ARef<File>`, the pointer will own a
reference count to the file. When accessing it as `&File`, then the
reference does not own a refcount, but the borrow checker will ensure
that the reference count does not hit zero while the `&File` is live.

Since this is intended to manipulate the open files of a process, we
introduce a `from_fd` constructor that corresponds to the C `fget`
method. In future patches, it will become possible to create a new fd in
a process and bind it to a `File`. Rust Binder will use these to send
fds from one process to another.

We also provide a method for accessing the file's flags. Rust Binder
will use this to access the flags of the Binder fd to check whether the
non-blocking flag is set, which affects what the Binder ioctl does.

This introduces a struct for the EBADF error type, rather than just
using the Error type directly. This has two advantages:
* `File::from_fd` returns a `Result<ARef<File>, BadFdError>`, which the
compiler will represent as a single pointer, with null being an error.
This is possible because the compiler understands that `BadFdError`
has only one possible value, and it also understands that the
`ARef<File>` smart pointer is guaranteed non-null.
* Additionally, we promise to users of the method that the method can
only fail with EBADF, which means that they can rely on this promise
without having to inspect its implementation.
That said, there are also two disadvantages:
* Defining additional error types involves boilerplate.
* The question mark operator will only utilize the `From` trait once,
which prevents you from using the question mark operator on
`BadFdError` in methods that return some third error type that the
kernel `Error` is convertible into. (However, it works fine in methods
that return `Error`.)

Signed-off-by: Wedson Almeida Filho <[email protected]>
Co-developed-by: Daniel Xu <[email protected]>
Signed-off-by: Daniel Xu <[email protected]>
Co-developed-by: Alice Ryhl <[email protected]>
Signed-off-by: Alice Ryhl <[email protected]>
---
rust/bindings/bindings_helper.h | 2 +
rust/helpers.c | 7 ++
rust/kernel/file.rs | 196 ++++++++++++++++++++++++++++++++++++++++
rust/kernel/lib.rs | 1 +
4 files changed, 206 insertions(+)

diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h
index 85f013ed4ca4..beed3ef1fbc3 100644
--- a/rust/bindings/bindings_helper.h
+++ b/rust/bindings/bindings_helper.h
@@ -8,6 +8,8 @@

#include <kunit/test.h>
#include <linux/errname.h>
+#include <linux/file.h>
+#include <linux/fs.h>
#include <linux/slab.h>
#include <linux/refcount.h>
#include <linux/wait.h>
diff --git a/rust/helpers.c b/rust/helpers.c
index 70e59efd92bc..03141a3608a4 100644
--- a/rust/helpers.c
+++ b/rust/helpers.c
@@ -25,6 +25,7 @@
#include <linux/build_bug.h>
#include <linux/err.h>
#include <linux/errname.h>
+#include <linux/fs.h>
#include <linux/mutex.h>
#include <linux/refcount.h>
#include <linux/sched/signal.h>
@@ -157,6 +158,12 @@ void rust_helper_init_work_with_key(struct work_struct *work, work_func_t func,
}
EXPORT_SYMBOL_GPL(rust_helper_init_work_with_key);

+struct file *rust_helper_get_file(struct file *f)
+{
+ return get_file(f);
+}
+EXPORT_SYMBOL_GPL(rust_helper_get_file);
+
/*
* `bindgen` binds the C `size_t` type as the Rust `usize` type, so we can
* use it in contexts where Rust expects a `usize` like slice (array) indices.
diff --git a/rust/kernel/file.rs b/rust/kernel/file.rs
new file mode 100644
index 000000000000..29e1aacacd06
--- /dev/null
+++ b/rust/kernel/file.rs
@@ -0,0 +1,196 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! Files and file descriptors.
+//!
+//! C headers: [`include/linux/fs.h`](../../../../include/linux/fs.h) and
+//! [`include/linux/file.h`](../../../../include/linux/file.h)
+
+use crate::{
+ bindings,
+ error::{code::*, Error, Result},
+ types::{ARef, AlwaysRefCounted, Opaque},
+};
+use core::ptr;
+
+/// Flags associated with a [`File`].
+pub mod flags {
+ /// File is opened in append mode.
+ pub const O_APPEND: u32 = bindings::O_APPEND;
+
+ /// Signal-driven I/O is enabled.
+ pub const O_ASYNC: u32 = bindings::FASYNC;
+
+ /// Close-on-exec flag is set.
+ pub const O_CLOEXEC: u32 = bindings::O_CLOEXEC;
+
+ /// File was created if it didn't already exist.
+ pub const O_CREAT: u32 = bindings::O_CREAT;
+
+ /// Direct I/O is enabled for this file.
+ pub const O_DIRECT: u32 = bindings::O_DIRECT;
+
+ /// File must be a directory.
+ pub const O_DIRECTORY: u32 = bindings::O_DIRECTORY;
+
+ /// Like [`O_SYNC`] except metadata is not synced.
+ pub const O_DSYNC: u32 = bindings::O_DSYNC;
+
+ /// Ensure that this file is created with the `open(2)` call.
+ pub const O_EXCL: u32 = bindings::O_EXCL;
+
+ /// Large file size enabled (`off64_t` over `off_t`).
+ pub const O_LARGEFILE: u32 = bindings::O_LARGEFILE;
+
+ /// Do not update the file last access time.
+ pub const O_NOATIME: u32 = bindings::O_NOATIME;
+
+ /// File should not be used as process's controlling terminal.
+ pub const O_NOCTTY: u32 = bindings::O_NOCTTY;
+
+ /// If basename of path is a symbolic link, fail open.
+ pub const O_NOFOLLOW: u32 = bindings::O_NOFOLLOW;
+
+ /// File is using nonblocking I/O.
+ pub const O_NONBLOCK: u32 = bindings::O_NONBLOCK;
+
+ /// Also known as `O_NDELAY`.
+ ///
+ /// This is effectively the same flag as [`O_NONBLOCK`] on all architectures
+ /// except SPARC64.
+ pub const O_NDELAY: u32 = bindings::O_NDELAY;
+
+ /// Used to obtain a path file descriptor.
+ pub const O_PATH: u32 = bindings::O_PATH;
+
+ /// Write operations on this file will flush data and metadata.
+ pub const O_SYNC: u32 = bindings::O_SYNC;
+
+ /// This file is an unnamed temporary regular file.
+ pub const O_TMPFILE: u32 = bindings::O_TMPFILE;
+
+ /// File should be truncated to length 0.
+ pub const O_TRUNC: u32 = bindings::O_TRUNC;
+
+ /// Bitmask for access mode flags.
+ ///
+ /// # Examples
+ ///
+ /// ```
+ /// use kernel::file;
+ /// # fn do_something() {}
+ /// # let flags = 0;
+ /// if (flags & file::flags::O_ACCMODE) == file::flags::O_RDONLY {
+ /// do_something();
+ /// }
+ /// ```
+ pub const O_ACCMODE: u32 = bindings::O_ACCMODE;
+
+ /// File is read only.
+ pub const O_RDONLY: u32 = bindings::O_RDONLY;
+
+ /// File is write only.
+ pub const O_WRONLY: u32 = bindings::O_WRONLY;
+
+ /// File can be both read and written.
+ pub const O_RDWR: u32 = bindings::O_RDWR;
+}
+
+/// Wraps the kernel's `struct file`.
+///
+/// # Invariants
+///
+/// Instances of this type are always ref-counted, that is, a call to `get_file` ensures that the
+/// allocation remains valid at least until the matching call to `fput`.
+#[repr(transparent)]
+pub struct File(Opaque<bindings::file>);
+
+// SAFETY: By design, the only way to access a `File` is via an immutable reference or an `ARef`.
+// This means that the only situation in which a `File` can be accessed mutably is when the
+// refcount drops to zero and the destructor runs. It is safe for that to happen on any thread, so
+// it is ok for this type to be `Send`.
+unsafe impl Send for File {}
+
+// SAFETY: All methods defined on `File` that take `&self` are safe to call even if other threads
+// are concurrently accessing the same `struct file`, because those methods either access immutable
+// properties or have proper synchronization to ensure that such accesses are safe.
+unsafe impl Sync for File {}
+
+impl File {
+ /// Constructs a new `struct file` wrapper from a file descriptor.
+ ///
+ /// The file descriptor belongs to the current process.
+ pub fn fget(fd: u32) -> Result<ARef<Self>, BadFdError> {
+ // SAFETY: FFI call, there are no requirements on `fd`.
+ let ptr = ptr::NonNull::new(unsafe { bindings::fget(fd) }).ok_or(BadFdError)?;
+
+ // SAFETY: `fget` either returns null or a valid pointer to a file, and we checked for null
+ // above.
+ //
+ // INVARIANT: `fget` increments the refcount before returning.
+ Ok(unsafe { ARef::from_raw(ptr.cast()) })
+ }
+
+ /// Creates a reference to a [`File`] from a valid pointer.
+ ///
+ /// # Safety
+ ///
+ /// The caller must ensure that `ptr` points at a valid file and that its refcount does not
+ /// reach zero during the lifetime 'a.
+ pub unsafe fn from_ptr<'a>(ptr: *const bindings::file) -> &'a File {
+ // SAFETY: The caller guarantees that the pointer is not dangling and stays valid for the
+ // duration of 'a. The cast is okay because `File` is `repr(transparent)`.
+ //
+ // INVARIANT: The safety requirements guarantee that the refcount does not hit zero during
+ // 'a.
+ unsafe { &*ptr.cast() }
+ }
+
+ /// Returns a raw pointer to the inner C struct.
+ #[inline]
+ pub fn as_ptr(&self) -> *mut bindings::file {
+ self.0.get()
+ }
+
+ /// Returns the flags associated with the file.
+ ///
+ /// The flags are a combination of the constants in [`flags`].
+ pub fn flags(&self) -> u32 {
+ // This `read_volatile` is intended to correspond to a READ_ONCE call.
+ //
+ // SAFETY: The file is valid because the shared reference guarantees a nonzero refcount.
+ //
+ // TODO: Replace with `read_once` when available on the Rust side.
+ unsafe { core::ptr::addr_of!((*self.as_ptr()).f_flags).read_volatile() }
+ }
+}
+
+// SAFETY: The type invariants guarantee that `File` is always ref-counted.
+unsafe impl AlwaysRefCounted for File {
+ fn inc_ref(&self) {
+ // SAFETY: The existence of a shared reference means that the refcount is nonzero.
+ unsafe { bindings::get_file(self.as_ptr()) };
+ }
+
+ unsafe fn dec_ref(obj: ptr::NonNull<Self>) {
+ // SAFETY: The safety requirements guarantee that the refcount is nonzero.
+ unsafe { bindings::fput(obj.cast().as_ptr()) }
+ }
+}
+
+/// Represents the `EBADF` error code.
+///
+/// Used for methods that can only fail with `EBADF`.
+#[derive(Copy, Clone, Eq, PartialEq)]
+pub struct BadFdError;
+
+impl From<BadFdError> for Error {
+ fn from(_: BadFdError) -> Error {
+ EBADF
+ }
+}
+
+impl core::fmt::Debug for BadFdError {
+ fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result {
+ f.pad("EBADF")
+ }
+}
diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs
index e6aff80b521f..ce9abceab784 100644
--- a/rust/kernel/lib.rs
+++ b/rust/kernel/lib.rs
@@ -34,6 +34,7 @@
mod allocator;
mod build_assert;
pub mod error;
+pub mod file;
pub mod init;
pub mod ioctl;
#[cfg(CONFIG_KUNIT)]

--
2.43.0.rc2.451.g8631bc7472-goog

2023-12-06 12:01:36

by Alice Ryhl

[permalink] [raw]
Subject: [PATCH v2 3/7] rust: security: add abstraction for secctx

Adds an abstraction for viewing the string representation of a security
context.

This is needed by Rust Binder because it has feature where a process can
view the string representation of the security context for incoming
transactions. The process can use that to authenticate incoming
transactions, and since the feature is provided by the kernel, the
process can trust that the security context is legitimate.

Signed-off-by: Alice Ryhl <[email protected]>
---
rust/bindings/bindings_helper.h | 1 +
rust/helpers.c | 21 +++++++++++
rust/kernel/cred.rs | 8 +++++
rust/kernel/lib.rs | 1 +
rust/kernel/security.rs | 79 +++++++++++++++++++++++++++++++++++++++++
5 files changed, 110 insertions(+)

diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h
index 6d1bd2229aab..81b13a953eae 100644
--- a/rust/bindings/bindings_helper.h
+++ b/rust/bindings/bindings_helper.h
@@ -11,6 +11,7 @@
#include <linux/errname.h>
#include <linux/file.h>
#include <linux/fs.h>
+#include <linux/security.h>
#include <linux/slab.h>
#include <linux/refcount.h>
#include <linux/wait.h>
diff --git a/rust/helpers.c b/rust/helpers.c
index 10ed69f76424..fd633d9db79a 100644
--- a/rust/helpers.c
+++ b/rust/helpers.c
@@ -30,6 +30,7 @@
#include <linux/mutex.h>
#include <linux/refcount.h>
#include <linux/sched/signal.h>
+#include <linux/security.h>
#include <linux/spinlock.h>
#include <linux/wait.h>
#include <linux/workqueue.h>
@@ -177,6 +178,26 @@ void rust_helper_put_cred(const struct cred *cred)
}
EXPORT_SYMBOL_GPL(rust_helper_put_cred);

+#ifndef CONFIG_SECURITY
+void rust_helper_security_cred_getsecid(const struct cred *c, u32 *secid)
+{
+ security_cred_getsecid(c, secid);
+}
+EXPORT_SYMBOL_GPL(rust_helper_security_cred_getsecid);
+
+int rust_helper_security_secid_to_secctx(u32 secid, char **secdata, u32 *seclen)
+{
+ return security_secid_to_secctx(secid, secdata, seclen);
+}
+EXPORT_SYMBOL_GPL(rust_helper_security_secid_to_secctx);
+
+void rust_helper_security_release_secctx(char *secdata, u32 seclen)
+{
+ security_release_secctx(secdata, seclen);
+}
+EXPORT_SYMBOL_GPL(rust_helper_security_release_secctx);
+#endif
+
/*
* `bindgen` binds the C `size_t` type as the Rust `usize` type, so we can
* use it in contexts where Rust expects a `usize` like slice (array) indices.
diff --git a/rust/kernel/cred.rs b/rust/kernel/cred.rs
index 497058ec89bb..3794937b5294 100644
--- a/rust/kernel/cred.rs
+++ b/rust/kernel/cred.rs
@@ -43,6 +43,14 @@ pub unsafe fn from_ptr<'a>(ptr: *const bindings::cred) -> &'a Credential {
unsafe { &*ptr.cast() }
}

+ /// Get the id for this security context.
+ pub fn get_secid(&self) -> u32 {
+ let mut secid = 0;
+ // SAFETY: The invariants of this type ensures that the pointer is valid.
+ unsafe { bindings::security_cred_getsecid(self.0.get(), &mut secid) };
+ secid
+ }
+
/// Returns the effective UID of the given credential.
pub fn euid(&self) -> bindings::kuid_t {
// SAFETY: By the type invariant, we know that `self.0` is valid.
diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs
index 097fe9bb93ed..342cb02c495a 100644
--- a/rust/kernel/lib.rs
+++ b/rust/kernel/lib.rs
@@ -42,6 +42,7 @@
pub mod kunit;
pub mod prelude;
pub mod print;
+pub mod security;
mod static_assert;
#[doc(hidden)]
pub mod std_vendor;
diff --git a/rust/kernel/security.rs b/rust/kernel/security.rs
new file mode 100644
index 000000000000..6545bfa2fd72
--- /dev/null
+++ b/rust/kernel/security.rs
@@ -0,0 +1,79 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! Linux Security Modules (LSM).
+//!
+//! C header: [`include/linux/security.h`](../../../../include/linux/security.h).
+
+use crate::{
+ bindings,
+ error::{to_result, Result},
+};
+
+/// A security context string.
+///
+/// # Invariants
+///
+/// The `secdata` and `seclen` fields correspond to a valid security context as returned by a
+/// successful call to `security_secid_to_secctx`, that has not yet been destroyed by calling
+/// `security_release_secctx`.
+pub struct SecurityCtx {
+ secdata: *mut core::ffi::c_char,
+ seclen: usize,
+}
+
+impl SecurityCtx {
+ /// Get the security context given its id.
+ pub fn from_secid(secid: u32) -> Result<Self> {
+ let mut secdata = core::ptr::null_mut();
+ let mut seclen = 0u32;
+ // SAFETY: Just a C FFI call. The pointers are valid for writes.
+ unsafe {
+ to_result(bindings::security_secid_to_secctx(
+ secid,
+ &mut secdata,
+ &mut seclen,
+ ))?;
+ }
+
+ // INVARIANT: If the above call did not fail, then we have a valid security context.
+ Ok(Self {
+ secdata,
+ seclen: seclen as usize,
+ })
+ }
+
+ /// Returns whether the security context is empty.
+ pub fn is_empty(&self) -> bool {
+ self.seclen == 0
+ }
+
+ /// Returns the length of this security context.
+ pub fn len(&self) -> usize {
+ self.seclen
+ }
+
+ /// Returns the bytes for this security context.
+ pub fn as_bytes(&self) -> &[u8] {
+ let ptr = self.secdata;
+ if ptr.is_null() {
+ // We can't pass a null pointer to `slice::from_raw_parts` even if the length is zero.
+ debug_assert_eq!(self.seclen, 0);
+ return &[];
+ }
+
+ // SAFETY: The call to `security_secid_to_secctx` guarantees that the pointer is valid for
+ // `seclen` bytes. Furthermore, if the length is zero, then we have ensured that the
+ // pointer is not null.
+ unsafe { core::slice::from_raw_parts(ptr.cast(), self.seclen) }
+ }
+}
+
+impl Drop for SecurityCtx {
+ fn drop(&mut self) {
+ // SAFETY: This frees a pointer that came from a successful call to
+ // `security_secid_to_secctx` and has not yet been destroyed by `security_release_secctx`.
+ unsafe {
+ bindings::security_release_secctx(self.secdata, self.seclen as u32);
+ }
+ }
+}

--
2.43.0.rc2.451.g8631bc7472-goog

2023-12-06 12:01:37

by Alice Ryhl

[permalink] [raw]
Subject: [PATCH v2 4/7] rust: file: add `FileDescriptorReservation`

From: Wedson Almeida Filho <[email protected]>

Allow for the creation of a file descriptor in two steps: first, we
reserve a slot for it, then we commit or drop the reservation. The first
step may fail (e.g., the current process ran out of available slots),
but commit and drop never fail (and are mutually exclusive).

This is needed by Rust Binder when fds are sent from one process to
another. It has to be a two-step process to properly handle the case
where multiple fds are sent: The operation must fail or succeed
atomically, which we achieve by first reserving the fds we need, and
only installing the files once we have reserved enough fds to send the
files.

Fd reservations assume that the value of `current` does not change
between the call to get_unused_fd_flags and the call to fd_install (or
put_unused_fd). By not implementing the Send trait, this abstraction
ensures that the `FileDescriptorReservation` cannot be moved into a
different process.

Signed-off-by: Wedson Almeida Filho <[email protected]>
Co-developed-by: Alice Ryhl <[email protected]>
Signed-off-by: Alice Ryhl <[email protected]>
---
rust/kernel/file.rs | 68 ++++++++++++++++++++++++++++++++++++++++++++++++++--
rust/kernel/types.rs | 10 ++++++++
2 files changed, 76 insertions(+), 2 deletions(-)

diff --git a/rust/kernel/file.rs b/rust/kernel/file.rs
index a88140794a8d..2d036d4636a0 100644
--- a/rust/kernel/file.rs
+++ b/rust/kernel/file.rs
@@ -9,9 +9,9 @@
bindings,
cred::Credential,
error::{code::*, Error, Result},
- types::{ARef, AlwaysRefCounted, Opaque},
+ types::{ARef, AlwaysRefCounted, NotThreadSafe, Opaque},
};
-use core::ptr;
+use core::{marker::PhantomData, ptr};

/// Flags associated with a [`File`].
pub mod flags {
@@ -193,6 +193,70 @@ unsafe fn dec_ref(obj: ptr::NonNull<Self>) {
}
}

+/// A file descriptor reservation.
+///
+/// This allows the creation of a file descriptor in two steps: first, we reserve a slot for it,
+/// then we commit or drop the reservation. The first step may fail (e.g., the current process ran
+/// out of available slots), but commit and drop never fail (and are mutually exclusive).
+///
+/// Dropping the reservation happens in the destructor of this type.
+///
+/// # Invariants
+///
+/// The fd stored in this struct must correspond to a reserved file descriptor of the current task.
+pub struct FileDescriptorReservation {
+ fd: u32,
+ /// Prevent values of this type from being moved to a different task.
+ ///
+ /// The `fd_install` and `put_unused_fd` functions assume that the value of `current` is
+ /// unchanged since the call to `get_unused_fd_flags`. By adding this marker to this type, we
+ /// prevent it from being moved across task boundaries, which ensures that `current` does not
+ /// change while this value exists.
+ _not_send: NotThreadSafe,
+}
+
+impl FileDescriptorReservation {
+ /// Creates a new file descriptor reservation.
+ pub fn get_unused_fd_flags(flags: u32) -> Result<Self> {
+ // SAFETY: FFI call, there are no safety requirements on `flags`.
+ let fd: i32 = unsafe { bindings::get_unused_fd_flags(flags) };
+ if fd < 0 {
+ return Err(Error::from_errno(fd));
+ }
+ Ok(Self {
+ fd: fd as u32,
+ _not_send: PhantomData,
+ })
+ }
+
+ /// Returns the file descriptor number that was reserved.
+ pub fn reserved_fd(&self) -> u32 {
+ self.fd
+ }
+
+ /// Commits the reservation.
+ ///
+ /// The previously reserved file descriptor is bound to `file`. This method consumes the
+ /// [`FileDescriptorReservation`], so it will not be usable after this call.
+ pub fn fd_install(self, file: ARef<File>) {
+ // SAFETY: `self.fd` was previously returned by `get_unused_fd_flags`, and `file.ptr` is
+ // guaranteed to have an owned ref count by its type invariants.
+ unsafe { bindings::fd_install(self.fd, file.0.get()) };
+
+ // `fd_install` consumes both the file descriptor and the file reference, so we cannot run
+ // the destructors.
+ core::mem::forget(self);
+ core::mem::forget(file);
+ }
+}
+
+impl Drop for FileDescriptorReservation {
+ fn drop(&mut self) {
+ // SAFETY: `self.fd` was returned by a previous call to `get_unused_fd_flags`.
+ unsafe { bindings::put_unused_fd(self.fd) };
+ }
+}
+
/// Represents the `EBADF` error code.
///
/// Used for methods that can only fail with `EBADF`.
diff --git a/rust/kernel/types.rs b/rust/kernel/types.rs
index fdb778e65d79..a4584d6b26c0 100644
--- a/rust/kernel/types.rs
+++ b/rust/kernel/types.rs
@@ -387,3 +387,13 @@ pub enum Either<L, R> {
/// Constructs an instance of [`Either`] containing a value of type `R`.
Right(R),
}
+
+/// Zero-sized type to mark types not [`Send`].
+///
+/// Add this type as a field to your struct if your type should not be sent to a different task.
+/// Since [`Send`] is an auto trait, adding a single field that is `!Send` will ensure that the
+/// whole type is `!Send`.
+///
+/// If a type is `!Send` it is impossible to give control over an instance of the type to another
+/// task. This is useful when a type stores task-local information for example file descriptors.
+pub type NotThreadSafe = PhantomData<*mut ()>;

--
2.43.0.rc2.451.g8631bc7472-goog

2023-12-06 12:01:46

by Alice Ryhl

[permalink] [raw]
Subject: [PATCH v2 5/7] rust: file: add `Kuid` wrapper

Adds a wrapper around `kuid_t` called `Kuid`. This allows us to define
various operations on kuids such as equality and current_euid. It also
lets us provide conversions from kuid into userspace values.

Rust Binder needs these operations because it needs to compare kuids for
equality, and it needs to tell userspace about the pid and uid of
incoming transactions.

Signed-off-by: Alice Ryhl <[email protected]>
---
rust/bindings/bindings_helper.h | 1 +
rust/helpers.c | 45 +++++++++++++++++++++++++++
rust/kernel/cred.rs | 5 +--
rust/kernel/task.rs | 68 ++++++++++++++++++++++++++++++++++++++++-
4 files changed, 116 insertions(+), 3 deletions(-)

diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h
index 81b13a953eae..700f01840188 100644
--- a/rust/bindings/bindings_helper.h
+++ b/rust/bindings/bindings_helper.h
@@ -11,6 +11,7 @@
#include <linux/errname.h>
#include <linux/file.h>
#include <linux/fs.h>
+#include <linux/pid_namespace.h>
#include <linux/security.h>
#include <linux/slab.h>
#include <linux/refcount.h>
diff --git a/rust/helpers.c b/rust/helpers.c
index fd633d9db79a..58e3a9dff349 100644
--- a/rust/helpers.c
+++ b/rust/helpers.c
@@ -142,6 +142,51 @@ void rust_helper_put_task_struct(struct task_struct *t)
}
EXPORT_SYMBOL_GPL(rust_helper_put_task_struct);

+kuid_t rust_helper_task_uid(struct task_struct *task)
+{
+ return task_uid(task);
+}
+EXPORT_SYMBOL_GPL(rust_helper_task_uid);
+
+kuid_t rust_helper_task_euid(struct task_struct *task)
+{
+ return task_euid(task);
+}
+EXPORT_SYMBOL_GPL(rust_helper_task_euid);
+
+#ifndef CONFIG_USER_NS
+uid_t rust_helper_from_kuid(struct user_namespace *to, kuid_t uid)
+{
+ return from_kuid(to, uid);
+}
+EXPORT_SYMBOL_GPL(rust_helper_from_kuid);
+#endif /* CONFIG_USER_NS */
+
+bool rust_helper_uid_eq(kuid_t left, kuid_t right)
+{
+ return uid_eq(left, right);
+}
+EXPORT_SYMBOL_GPL(rust_helper_uid_eq);
+
+kuid_t rust_helper_current_euid(void)
+{
+ return current_euid();
+}
+EXPORT_SYMBOL_GPL(rust_helper_current_euid);
+
+struct user_namespace *rust_helper_current_user_ns(void)
+{
+ return current_user_ns();
+}
+EXPORT_SYMBOL_GPL(rust_helper_current_user_ns);
+
+pid_t rust_helper_task_tgid_nr_ns(struct task_struct *tsk,
+ struct pid_namespace *ns)
+{
+ return task_tgid_nr_ns(tsk, ns);
+}
+EXPORT_SYMBOL_GPL(rust_helper_task_tgid_nr_ns);
+
struct kunit *rust_helper_kunit_get_current_test(void)
{
return kunit_get_current_test();
diff --git a/rust/kernel/cred.rs b/rust/kernel/cred.rs
index 3794937b5294..fbc749788bfa 100644
--- a/rust/kernel/cred.rs
+++ b/rust/kernel/cred.rs
@@ -8,6 +8,7 @@

use crate::{
bindings,
+ task::Kuid,
types::{AlwaysRefCounted, Opaque},
};

@@ -52,9 +53,9 @@ pub fn get_secid(&self) -> u32 {
}

/// Returns the effective UID of the given credential.
- pub fn euid(&self) -> bindings::kuid_t {
+ pub fn euid(&self) -> Kuid {
// SAFETY: By the type invariant, we know that `self.0` is valid.
- unsafe { (*self.0.get()).euid }
+ Kuid::from_raw(unsafe { (*self.0.get()).euid })
}
}

diff --git a/rust/kernel/task.rs b/rust/kernel/task.rs
index b2299bc7ac1f..7a3a07660af7 100644
--- a/rust/kernel/task.rs
+++ b/rust/kernel/task.rs
@@ -5,7 +5,12 @@
//! C header: [`include/linux/sched.h`](../../../../include/linux/sched.h).

use crate::{bindings, types::Opaque};
-use core::{marker::PhantomData, ops::Deref, ptr};
+use core::{
+ cmp::{Eq, PartialEq},
+ marker::PhantomData,
+ ops::Deref,
+ ptr,
+};

/// Returns the currently running task.
#[macro_export]
@@ -78,6 +83,12 @@ unsafe impl Sync for Task {}
/// The type of process identifiers (PIDs).
type Pid = bindings::pid_t;

+/// The type of user identifiers (UIDs).
+#[derive(Copy, Clone)]
+pub struct Kuid {
+ kuid: bindings::kuid_t,
+}
+
impl Task {
/// Returns a task reference for the currently executing task/thread.
///
@@ -132,12 +143,32 @@ pub fn pid(&self) -> Pid {
unsafe { *ptr::addr_of!((*self.0.get()).pid) }
}

+ /// Returns the UID of the given task.
+ pub fn uid(&self) -> Kuid {
+ // SAFETY: By the type invariant, we know that `self.0` is valid.
+ Kuid::from_raw(unsafe { bindings::task_uid(self.0.get()) })
+ }
+
+ /// Returns the effective UID of the given task.
+ pub fn euid(&self) -> Kuid {
+ // SAFETY: By the type invariant, we know that `self.0` is valid.
+ Kuid::from_raw(unsafe { bindings::task_euid(self.0.get()) })
+ }
+
/// Determines whether the given task has pending signals.
pub fn signal_pending(&self) -> bool {
// SAFETY: By the type invariant, we know that `self.0` is valid.
unsafe { bindings::signal_pending(self.0.get()) != 0 }
}

+ /// Returns the given task's pid in the current pid namespace.
+ pub fn pid_in_current_ns(&self) -> Pid {
+ // SAFETY: Calling `task_active_pid_ns` with the current task is always safe.
+ let namespace = unsafe { bindings::task_active_pid_ns(bindings::get_current()) };
+ // SAFETY: We know that `self.0.get()` is valid by the type invariant.
+ unsafe { bindings::task_tgid_nr_ns(self.0.get(), namespace) }
+ }
+
/// Wakes up the task.
pub fn wake_up(&self) {
// SAFETY: By the type invariant, we know that `self.0.get()` is non-null and valid.
@@ -147,6 +178,41 @@ pub fn wake_up(&self) {
}
}

+impl Kuid {
+ /// Get the current euid.
+ pub fn current_euid() -> Kuid {
+ // SAFETY: Just an FFI call.
+ Self::from_raw(unsafe { bindings::current_euid() })
+ }
+
+ /// Create a `Kuid` given the raw C type.
+ pub fn from_raw(kuid: bindings::kuid_t) -> Self {
+ Self { kuid }
+ }
+
+ /// Turn this kuid into the raw C type.
+ pub fn into_raw(self) -> bindings::kuid_t {
+ self.kuid
+ }
+
+ /// Converts this kernel UID into a userspace UID.
+ ///
+ /// Uses the namespace of the current task.
+ pub fn into_uid_in_current_ns(self) -> bindings::uid_t {
+ // SAFETY: Just an FFI call.
+ unsafe { bindings::from_kuid(bindings::current_user_ns(), self.kuid) }
+ }
+}
+
+impl PartialEq for Kuid {
+ fn eq(&self, other: &Kuid) -> bool {
+ // SAFETY: Just an FFI call.
+ unsafe { bindings::uid_eq(self.kuid, other.kuid) }
+ }
+}
+
+impl Eq for Kuid {}
+
// SAFETY: The type invariants guarantee that `Task` is always ref-counted.
unsafe impl crate::types::AlwaysRefCounted for Task {
fn inc_ref(&self) {

--
2.43.0.rc2.451.g8631bc7472-goog

2023-12-06 12:01:54

by Alice Ryhl

[permalink] [raw]
Subject: [PATCH v2 7/7] rust: file: add abstraction for `poll_table`

The existing `CondVar` abstraction is a wrapper around `wait_list`, but
it does not support all use-cases of the C `wait_list` type. To be
specific, a `CondVar` cannot be registered with a `struct poll_table`.
This limitation has the advantage that you do not need to call
`synchronize_rcu` when destroying a `CondVar`.

However, we need the ability to register a `poll_table` with a
`wait_list` in Rust Binder. To enable this, introduce a type called
`PollCondVar`, which is like `CondVar` except that you can register a
`poll_table`. We also introduce `PollTable`, which is a safe wrapper
around `poll_table` that is intended to be used with `PollCondVar`.

The destructor of `PollCondVar` unconditionally calls `synchronize_rcu`
to ensure that the removal of epoll waiters has fully completed before
the `wait_list` is destroyed.

That said, `synchronize_rcu` is rather expensive and is not needed in
all cases: If we have never registered a `poll_table` with the
`wait_list`, then we don't need to call `synchronize_rcu`. (And this is
a common case in Binder - not all processes use Binder with epoll.) The
current implementation does not account for this, but if we find that it
is necessary to improve this, a future patch could change store a
boolean next to the `wait_list` to keep track of whether a `poll_table`
has ever been registered.

Signed-off-by: Alice Ryhl <[email protected]>
---
rust/bindings/bindings_helper.h | 2 +
rust/bindings/lib.rs | 1 +
rust/kernel/sync.rs | 1 +
rust/kernel/sync/poll.rs | 103 ++++++++++++++++++++++++++++++++++++++++
4 files changed, 107 insertions(+)

diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h
index c8daee341df6..14f84aeef62d 100644
--- a/rust/bindings/bindings_helper.h
+++ b/rust/bindings/bindings_helper.h
@@ -13,6 +13,7 @@
#include <linux/file.h>
#include <linux/fs.h>
#include <linux/pid_namespace.h>
+#include <linux/poll.h>
#include <linux/security.h>
#include <linux/slab.h>
#include <linux/refcount.h>
@@ -25,3 +26,4 @@
const size_t BINDINGS_ARCH_SLAB_MINALIGN = ARCH_SLAB_MINALIGN;
const gfp_t BINDINGS_GFP_KERNEL = GFP_KERNEL;
const gfp_t BINDINGS___GFP_ZERO = __GFP_ZERO;
+const __poll_t BINDINGS_POLLFREE = POLLFREE;
diff --git a/rust/bindings/lib.rs b/rust/bindings/lib.rs
index 9bcbea04dac3..eeb291cc60db 100644
--- a/rust/bindings/lib.rs
+++ b/rust/bindings/lib.rs
@@ -51,3 +51,4 @@ mod bindings_helper {

pub const GFP_KERNEL: gfp_t = BINDINGS_GFP_KERNEL;
pub const __GFP_ZERO: gfp_t = BINDINGS___GFP_ZERO;
+pub const POLLFREE: __poll_t = BINDINGS_POLLFREE;
diff --git a/rust/kernel/sync.rs b/rust/kernel/sync.rs
index d219ee518eff..84726f80c406 100644
--- a/rust/kernel/sync.rs
+++ b/rust/kernel/sync.rs
@@ -11,6 +11,7 @@
mod condvar;
pub mod lock;
mod locked_by;
+pub mod poll;

pub use arc::{Arc, ArcBorrow, UniqueArc};
pub use condvar::CondVar;
diff --git a/rust/kernel/sync/poll.rs b/rust/kernel/sync/poll.rs
new file mode 100644
index 000000000000..e1dded9b7b9d
--- /dev/null
+++ b/rust/kernel/sync/poll.rs
@@ -0,0 +1,103 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! Utilities for working with `struct poll_table`.
+
+use crate::{
+ bindings,
+ file::File,
+ prelude::*,
+ sync::{CondVar, LockClassKey},
+ types::Opaque,
+};
+use core::ops::Deref;
+
+/// Creates a [`PollCondVar`] initialiser with the given name and a newly-created lock class.
+#[macro_export]
+macro_rules! new_poll_condvar {
+ ($($name:literal)?) => {
+ $crate::file::PollCondVar::new($crate::optional_name!($($name)?), $crate::static_lock_class!())
+ };
+}
+
+/// Wraps the kernel's `struct poll_table`.
+#[repr(transparent)]
+pub struct PollTable(Opaque<bindings::poll_table>);
+
+impl PollTable {
+ /// Creates a reference to a [`PollTable`] from a valid pointer.
+ ///
+ /// # Safety
+ ///
+ /// The caller must ensure that for the duration of 'a, the pointer will point at a valid poll
+ /// table, and that it is only accessed via the returned reference.
+ pub unsafe fn from_ptr<'a>(ptr: *mut bindings::poll_table) -> &'a mut PollTable {
+ // SAFETY: The safety requirements guarantee the validity of the dereference, while the
+ // `PollTable` type being transparent makes the cast ok.
+ unsafe { &mut *ptr.cast() }
+ }
+
+ fn get_qproc(&self) -> bindings::poll_queue_proc {
+ let ptr = self.0.get();
+ // SAFETY: The `ptr` is valid because it originates from a reference, and the `_qproc`
+ // field is not modified concurrently with this call since we have an immutable reference.
+ unsafe { (*ptr)._qproc }
+ }
+
+ /// Register this [`PollTable`] with the provided [`PollCondVar`], so that it can be notified
+ /// using the condition variable.
+ pub fn register_wait(&mut self, file: &File, cv: &PollCondVar) {
+ if let Some(qproc) = self.get_qproc() {
+ // SAFETY: The pointers to `self` and `file` are valid because they are references.
+ //
+ // Before the wait list is destroyed, the destructor of `PollCondVar` will clear
+ // everything in the wait list, so the wait list is not used after it is freed.
+ unsafe { qproc(file.as_ptr() as _, cv.wait_list.get(), self.0.get()) };
+ }
+ }
+}
+
+/// A wrapper around [`CondVar`] that makes it usable with [`PollTable`].
+///
+/// # Invariant
+///
+/// If `needs_synchronize_rcu` is false, then there is nothing registered with `register_wait`.
+///
+/// [`CondVar`]: crate::sync::CondVar
+#[pin_data(PinnedDrop)]
+pub struct PollCondVar {
+ #[pin]
+ inner: CondVar,
+}
+
+impl PollCondVar {
+ /// Constructs a new condvar initialiser.
+ pub fn new(name: &'static CStr, key: &'static LockClassKey) -> impl PinInit<Self> {
+ pin_init!(Self {
+ inner <- CondVar::new(name, key),
+ })
+ }
+}
+
+// Make the `CondVar` methods callable on `PollCondVar`.
+impl Deref for PollCondVar {
+ type Target = CondVar;
+
+ fn deref(&self) -> &CondVar {
+ &self.inner
+ }
+}
+
+#[pinned_drop]
+impl PinnedDrop for PollCondVar {
+ fn drop(self: Pin<&mut Self>) {
+ // Clear anything registered using `register_wait`.
+ //
+ // SAFETY: The pointer points at a valid wait list.
+ unsafe { bindings::__wake_up_pollfree(self.inner.wait_list.get()) };
+
+ // Wait for epoll items to be properly removed.
+ //
+ // SAFETY: Just an FFI call.
+ unsafe { bindings::synchronize_rcu() };
+ }
+}

--
2.43.0.rc2.451.g8631bc7472-goog

2023-12-06 12:02:06

by Alice Ryhl

[permalink] [raw]
Subject: [PATCH v2 6/7] rust: file: add `DeferredFdCloser`

To close an fd from kernel space, we could call `ksys_close`. However,
if we do this to an fd that is held using `fdget`, then we may trigger a
use-after-free. Introduce a helper that can be used to close an fd even
if the fd is currently held with `fdget`. This is done by grabbing an
extra refcount to the file and dropping it in a task work once we return
to userspace.

This is necessary for Rust Binder because otherwise the user might try
to have Binder close its fd for /dev/binder, which would cause problems
as this happens inside an ioctl on /dev/binder, and ioctls hold the fd
using `fdget`.

Additional motivation can be found in commit 80cd795630d6 ("binder: fix
use-after-free due to ksys_close() during fdget()") and in the comments
on `binder_do_fd_close`.

If there is some way to detect whether an fd is currently held with
`fdget`, then this could be optimized to skip the allocation and task
work when this is not the case. Another possible optimization would be
to combine several fds into a single task work, since this is used with
fd arrays that might hold several fds.

That said, it might not be necessary to optimize it, because Rust Binder
has two ways to send fds: BINDER_TYPE_FD and BINDER_TYPE_FDA. With
BINDER_TYPE_FD, it is userspace's responsibility to close the fd, so
this mechanism is used only by BINDER_TYPE_FDA, but fd arrays are used
rarely these days.

Signed-off-by: Alice Ryhl <[email protected]>
---
rust/bindings/bindings_helper.h | 2 +
rust/helpers.c | 8 ++
rust/kernel/file.rs | 157 +++++++++++++++++++++++++++++++++++++++-
3 files changed, 166 insertions(+), 1 deletion(-)

diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h
index 700f01840188..c8daee341df6 100644
--- a/rust/bindings/bindings_helper.h
+++ b/rust/bindings/bindings_helper.h
@@ -9,6 +9,7 @@
#include <kunit/test.h>
#include <linux/cred.h>
#include <linux/errname.h>
+#include <linux/fdtable.h>
#include <linux/file.h>
#include <linux/fs.h>
#include <linux/pid_namespace.h>
@@ -17,6 +18,7 @@
#include <linux/refcount.h>
#include <linux/wait.h>
#include <linux/sched.h>
+#include <linux/task_work.h>
#include <linux/workqueue.h>

/* `bindgen` gets confused at certain things. */
diff --git a/rust/helpers.c b/rust/helpers.c
index 58e3a9dff349..d146bbf25aec 100644
--- a/rust/helpers.c
+++ b/rust/helpers.c
@@ -32,6 +32,7 @@
#include <linux/sched/signal.h>
#include <linux/security.h>
#include <linux/spinlock.h>
+#include <linux/task_work.h>
#include <linux/wait.h>
#include <linux/workqueue.h>

@@ -243,6 +244,13 @@ void rust_helper_security_release_secctx(char *secdata, u32 seclen)
EXPORT_SYMBOL_GPL(rust_helper_security_release_secctx);
#endif

+void rust_helper_init_task_work(struct callback_head *twork,
+ task_work_func_t func)
+{
+ init_task_work(twork, func);
+}
+EXPORT_SYMBOL_GPL(rust_helper_init_task_work);
+
/*
* `bindgen` binds the C `size_t` type as the Rust `usize` type, so we can
* use it in contexts where Rust expects a `usize` like slice (array) indices.
diff --git a/rust/kernel/file.rs b/rust/kernel/file.rs
index 2d036d4636a0..eba96af4bdf7 100644
--- a/rust/kernel/file.rs
+++ b/rust/kernel/file.rs
@@ -11,7 +11,8 @@
error::{code::*, Error, Result},
types::{ARef, AlwaysRefCounted, NotThreadSafe, Opaque},
};
-use core::{marker::PhantomData, ptr};
+use alloc::boxed::Box;
+use core::{alloc::AllocError, marker::PhantomData, mem, ptr};

/// Flags associated with a [`File`].
pub mod flags {
@@ -257,6 +258,160 @@ fn drop(&mut self) {
}
}

+/// Helper used for closing file descriptors in a way that is safe even if the file is currently
+/// held using `fdget`.
+///
+/// Additional motivation can be found in commit 80cd795630d6 ("binder: fix use-after-free due to
+/// ksys_close() during fdget()") and in the comments on `binder_do_fd_close`.
+pub struct DeferredFdCloser {
+ inner: Box<DeferredFdCloserInner>,
+}
+
+/// SAFETY: This just holds an allocation with no real content, so there's no safety issue with
+/// moving it across threads.
+unsafe impl Send for DeferredFdCloser {}
+unsafe impl Sync for DeferredFdCloser {}
+
+#[repr(C)]
+struct DeferredFdCloserInner {
+ twork: mem::MaybeUninit<bindings::callback_head>,
+ file: *mut bindings::file,
+}
+
+impl DeferredFdCloser {
+ /// Create a new [`DeferredFdCloser`].
+ pub fn new() -> Result<Self, AllocError> {
+ Ok(Self {
+ inner: Box::try_new(DeferredFdCloserInner {
+ twork: mem::MaybeUninit::uninit(),
+ file: core::ptr::null_mut(),
+ })?,
+ })
+ }
+
+ /// Schedule a task work that closes the file descriptor when this task returns to userspace.
+ ///
+ /// Fails if this is called from a context where we cannot run work when returning to
+ /// userspace. (E.g., from a kthread.)
+ pub fn close_fd(self, fd: u32) -> Result<(), DeferredFdCloseError> {
+ use bindings::task_work_notify_mode_TWA_RESUME as TWA_RESUME;
+
+ // In this method, we schedule the task work before closing the file. This is because
+ // scheduling a task work is fallible, and we need to know whether it will fail before we
+ // attempt to close the file.
+
+ // SAFETY: Getting a pointer to current is always safe.
+ let current = unsafe { bindings::get_current() };
+
+ // SAFETY: Accessing the `flags` field of `current` is always safe.
+ let is_kthread = (unsafe { (*current).flags } & bindings::PF_KTHREAD) != 0;
+ if is_kthread {
+ return Err(DeferredFdCloseError::TaskWorkUnavailable);
+ }
+
+ // This disables the destructor of the box, so the allocation is not cleaned up
+ // automatically below.
+ let inner = Box::into_raw(self.inner);
+
+ // The `callback_head` field is first in the struct, so this cast correctly gives us a
+ // pointer to the field.
+ let callback_head = inner.cast::<bindings::callback_head>();
+ // SAFETY: This pointer offset operation does not go out-of-bounds.
+ let file_field = unsafe { core::ptr::addr_of_mut!((*inner).file) };
+
+ // SAFETY: The `callback_head` pointer is compatible with the `do_close_fd` method.
+ unsafe { bindings::init_task_work(callback_head, Some(Self::do_close_fd)) };
+ // SAFETY: The `callback_head` pointer points at a valid and fully initialized task work
+ // that is ready to be scheduled.
+ //
+ // If the task work gets scheduled as-is, then it will be a no-op. However, we will update
+ // the file pointer below to tell it which file to fput.
+ let res = unsafe { bindings::task_work_add(current, callback_head, TWA_RESUME) };
+
+ if res != 0 {
+ // SAFETY: Scheduling the task work failed, so we still have ownership of the box, so
+ // we may destroy it.
+ unsafe { drop(Box::from_raw(inner)) };
+
+ return Err(DeferredFdCloseError::TaskWorkUnavailable);
+ }
+
+ // SAFETY: Just an FFI call. This is safe no matter what `fd` is.
+ let file = unsafe { bindings::close_fd_get_file(fd) };
+ if file.is_null() {
+ // We don't clean up the task work since that might be expensive if the task work queue
+ // is long. Just let it execute and let it clean up for itself.
+ return Err(DeferredFdCloseError::BadFd);
+ }
+
+ // SAFETY: The `file` pointer points at a valid file.
+ unsafe { bindings::get_file(file) };
+
+ // SAFETY: Due to the above `get_file`, even if the current task holds an `fdget` to
+ // this file right now, the refcount will not drop to zero until after it is released
+ // with `fdput`. This is because when using `fdget`, you must always use `fdput` before
+ // returning to userspace, and our task work runs after any `fdget` users have returned
+ // to userspace.
+ //
+ // Note: fl_owner_t is currently a void pointer.
+ unsafe { bindings::filp_close(file, (*current).files as bindings::fl_owner_t) };
+
+ // We update the file pointer that the task work is supposed to fput.
+ //
+ // SAFETY: Task works are executed on the current thread once we return to userspace, so
+ // this write is guaranteed to happen before `do_close_fd` is called, which means that a
+ // race is not possible here.
+ //
+ // It's okay to pass this pointer to the task work, since we just acquired a refcount with
+ // the previous call to `get_file`. Furthermore, the refcount will not drop to zero during
+ // an `fdget` call, since we defer the `fput` until after returning to userspace.
+ unsafe { *file_field = file };
+
+ Ok(())
+ }
+
+ // SAFETY: This function is an implementation detail of `close_fd`, so its safety comments
+ // should be read in extension of that method.
+ unsafe extern "C" fn do_close_fd(inner: *mut bindings::callback_head) {
+ // SAFETY: In `close_fd` we use this method together with a pointer that originates from a
+ // `Box<DeferredFdCloserInner>`, and we have just been given ownership of that allocation.
+ let inner = unsafe { Box::from_raw(inner as *mut DeferredFdCloserInner) };
+ if !inner.file.is_null() {
+ // SAFETY: This drops a refcount we acquired in `close_fd`. Since this callback runs in
+ // a task work after we return to userspace, it is guaranteed that the current thread
+ // doesn't hold this file with `fdget`, as `fdget` must be released before returning to
+ // userspace.
+ unsafe { bindings::fput(inner.file) };
+ }
+ // Free the allocation.
+ drop(inner);
+ }
+}
+
+/// Represents a failure to close an fd in a deferred manner.
+#[derive(Copy, Clone, Eq, PartialEq)]
+pub enum DeferredFdCloseError {
+ /// Closing the fd failed because we were unable to schedule a task work.
+ TaskWorkUnavailable,
+ /// Closing the fd failed because the fd does not exist.
+ BadFd,
+}
+
+impl From<DeferredFdCloseError> for Error {
+ fn from(err: DeferredFdCloseError) -> Error {
+ match err {
+ DeferredFdCloseError::TaskWorkUnavailable => ESRCH,
+ DeferredFdCloseError::BadFd => EBADF,
+ }
+ }
+}
+
+impl core::fmt::Debug for DeferredFdCloseError {
+ fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result {
+ Error::from(*self).fmt(f)
+ }
+}
+
/// Represents the `EBADF` error code.
///
/// Used for methods that can only fail with `EBADF`.

--
2.43.0.rc2.451.g8631bc7472-goog

2023-12-06 12:34:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 5/7] rust: file: add `Kuid` wrapper

On Wed, Dec 06, 2023 at 11:59:50AM +0000, Alice Ryhl wrote:

> diff --git a/rust/helpers.c b/rust/helpers.c
> index fd633d9db79a..58e3a9dff349 100644
> --- a/rust/helpers.c
> +++ b/rust/helpers.c
> @@ -142,6 +142,51 @@ void rust_helper_put_task_struct(struct task_struct *t)
> }
> EXPORT_SYMBOL_GPL(rust_helper_put_task_struct);
>
> +kuid_t rust_helper_task_uid(struct task_struct *task)
> +{
> + return task_uid(task);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_task_uid);
> +
> +kuid_t rust_helper_task_euid(struct task_struct *task)
> +{
> + return task_euid(task);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_task_euid);

So I still object to these on the ground that they're obvious and
trivial speculation gadgets.

We should not have (exported) functions that are basically a single
dereference of a pointer argument.

And I do not appreciate my feedback on the previous round being ignored.

2023-12-06 12:58:13

by Alice Ryhl

[permalink] [raw]
Subject: Re: [PATCH v2 5/7] rust: file: add `Kuid` wrapper

On Wed, Dec 6, 2023 at 1:34 PM Peter Zijlstra <[email protected]> wrote:
>
> On Wed, Dec 06, 2023 at 11:59:50AM +0000, Alice Ryhl wrote:
>
> > diff --git a/rust/helpers.c b/rust/helpers.c
> > index fd633d9db79a..58e3a9dff349 100644
> > --- a/rust/helpers.c
> > +++ b/rust/helpers.c
> > @@ -142,6 +142,51 @@ void rust_helper_put_task_struct(struct task_struct *t)
> > }
> > EXPORT_SYMBOL_GPL(rust_helper_put_task_struct);
> >
> > +kuid_t rust_helper_task_uid(struct task_struct *task)
> > +{
> > + return task_uid(task);
> > +}
> > +EXPORT_SYMBOL_GPL(rust_helper_task_uid);
> > +
> > +kuid_t rust_helper_task_euid(struct task_struct *task)
> > +{
> > + return task_euid(task);
> > +}
> > +EXPORT_SYMBOL_GPL(rust_helper_task_euid);
>
> So I still object to these on the ground that they're obvious and
> trivial speculation gadgets.
>
> We should not have (exported) functions that are basically a single
> dereference of a pointer argument.
>
> And I do not appreciate my feedback on the previous round being ignored.

I'm sorry about that. I barely know what speculation gadgets are, so I
didn't really know what to respond. But I should have responded by
saying that.

I can reimplement these specific functions as inline Rust functions,
but I don't think I can give you a general solution to the
rust_helper_* problem in this patch series.

Alice

2023-12-06 13:41:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 5/7] rust: file: add `Kuid` wrapper

On Wed, Dec 06, 2023 at 01:57:52PM +0100, Alice Ryhl wrote:
> On Wed, Dec 6, 2023 at 1:34 PM Peter Zijlstra <[email protected]> wrote:
> >
> > On Wed, Dec 06, 2023 at 11:59:50AM +0000, Alice Ryhl wrote:
> >
> > > diff --git a/rust/helpers.c b/rust/helpers.c
> > > index fd633d9db79a..58e3a9dff349 100644
> > > --- a/rust/helpers.c
> > > +++ b/rust/helpers.c
> > > @@ -142,6 +142,51 @@ void rust_helper_put_task_struct(struct task_struct *t)
> > > }
> > > EXPORT_SYMBOL_GPL(rust_helper_put_task_struct);
> > >
> > > +kuid_t rust_helper_task_uid(struct task_struct *task)
> > > +{
> > > + return task_uid(task);
> > > +}
> > > +EXPORT_SYMBOL_GPL(rust_helper_task_uid);
> > > +
> > > +kuid_t rust_helper_task_euid(struct task_struct *task)
> > > +{
> > > + return task_euid(task);
> > > +}
> > > +EXPORT_SYMBOL_GPL(rust_helper_task_euid);
> >
> > So I still object to these on the ground that they're obvious and
> > trivial speculation gadgets.
> >
> > We should not have (exported) functions that are basically a single
> > dereference of a pointer argument.
> >
> > And I do not appreciate my feedback on the previous round being ignored.
>
> I'm sorry about that. I barely know what speculation gadgets are, so I
> didn't really know what to respond. But I should have responded by
> saying that.

Sigh... how many years since Meltdown are we now? Oh well, the basic
idea is to abuse the basic speculative execution of the CPU to make it
disclose secrets. The way this is done is by training the various branch
predictors such that you gain control over the speculative control flow.

You then use this control to cause micro-architectural side-effects to
observe state, eg. cache state.

In the case above, if you train the indirect branch predictor to jump to
the above functions after you've loaded a secret into %rdi (first
argument) then it will dereference your pointer + offset.

This causes the CPU to populate the cacheline, even if the actual
execution path never does this.

So by wiping the cache, doing your tricky thing, and then probing the
cache to find out which line is present, you can infer the secret value.

Anywhoo, the longer a function is, the harder it becomes, since you need
to deal with everything a function does and consider the specuation
window length. So trivial functions like the above that do an immediate
dereference and are (and must be) a valid indirect target (because
EXPORT) are ideal.

> I can reimplement these specific functions as inline Rust functions,

That would be good, but how are you going to do that without duplicating
the horror that is struct task_struct ?

> but I don't think I can give you a general solution to the
> rust_helper_* problem in this patch series.

Well, I really wish the Rust community would address the C
interoperability in a hurry. Basically make it a requirement for
in-kernel Rust.

I mean, how hard can it be to have clang parse the C headers and inject
them into the Rust IR as if they're external FFI things.


2023-12-06 13:50:43

by Alice Ryhl

[permalink] [raw]
Subject: Re: [PATCH v2 5/7] rust: file: add `Kuid` wrapper

On Wed, Dec 6, 2023 at 2:40 PM Peter Zijlstra <[email protected]> wrote:
> > I can reimplement these specific functions as inline Rust functions,
>
> That would be good, but how are you going to do that without duplicating
> the horror that is struct task_struct ?

That shouldn't be an issue. The bindgen tool takes care of generating
a Rust version of struct task_struct for us at build time. The only
thing it can't handle is inline functions and #defines.

I'll respond to the other things later. But thank you for the thorough
explanation!

Alice

2023-12-06 16:50:07

by Nick Desaulniers

[permalink] [raw]
Subject: Re: [PATCH v2 5/7] rust: file: add `Kuid` wrapper

On Wed, Dec 06, 2023 at 02:40:41PM +0100, Peter Zijlstra wrote:
> On Wed, Dec 06, 2023 at 01:57:52PM +0100, Alice Ryhl wrote:
> > On Wed, Dec 6, 2023 at 1:34 PM Peter Zijlstra <[email protected]> wrote:
> > I can reimplement these specific functions as inline Rust functions,
>
> That would be good, but how are you going to do that without duplicating
> the horror that is struct task_struct ?
>
> > but I don't think I can give you a general solution to the
> > rust_helper_* problem in this patch series.
>
> Well, I really wish the Rust community would address the C
> interoperability in a hurry. Basically make it a requirement for
> in-kernel Rust.
>
> I mean, how hard can it be to have clang parse the C headers and inject
> them into the Rust IR as if they're external FFI things.

That's pretty much how Swift and Carbon are doing C (and C++) interop.

Carbon: https://youtu.be/1ZTJ9omXOQ0?si=yiuLHn6o8RMezEZj
Swift: https://youtu.be/lgivCGdmFrw?si=-x9Uva-_Y2x-JNBe

The swift talk doesn't allude much to the codegen interop they're doing (still
an excellent talk), but the carbon folks have adopted a model from Swift of
doing interop at the IR layer.

Those compilers link against clang to provide IR interop.
Rust's bindgen crate links against clang to generate Rust stubs.

At some point, someone on the Rust side will notice what Swift and Carbon are
up to, realize they're probably already linking against libclang for C/C++
interop, and try just linking libclang into rustc itself.

2023-12-08 07:37:26

by Benno Lossin

[permalink] [raw]
Subject: Re: [PATCH v2 4/7] rust: file: add `FileDescriptorReservation`

On 12/6/23 12:59, Alice Ryhl wrote:
> + /// Commits the reservation.
> + ///
> + /// The previously reserved file descriptor is bound to `file`. This method consumes the
> + /// [`FileDescriptorReservation`], so it will not be usable after this call.
> + pub fn fd_install(self, file: ARef<File>) {
> + // SAFETY: `self.fd` was previously returned by `get_unused_fd_flags`, and `file.ptr` is
> + // guaranteed to have an owned ref count by its type invariants.

There is no mention of the requirement that `current` has not changed
(you do mention it on the `_not_send` field, but I think it should also
be in the safety comment here).

> + unsafe { bindings::fd_install(self.fd, file.0.get()) };
> +
> + // `fd_install` consumes both the file descriptor and the file reference, so we cannot run
> + // the destructors.
> + core::mem::forget(self);
> + core::mem::forget(file);
> + }
> +}
> +
> +impl Drop for FileDescriptorReservation {
> + fn drop(&mut self) {
> + // SAFETY: `self.fd` was returned by a previous call to `get_unused_fd_flags`.

Ditto.

> + unsafe { bindings::put_unused_fd(self.fd) };
> + }
> +}
> +
> /// Represents the `EBADF` error code.
> ///
> /// Used for methods that can only fail with `EBADF`.
> diff --git a/rust/kernel/types.rs b/rust/kernel/types.rs
> index fdb778e65d79..a4584d6b26c0 100644
> --- a/rust/kernel/types.rs
> +++ b/rust/kernel/types.rs
> @@ -387,3 +387,13 @@ pub enum Either<L, R> {
> /// Constructs an instance of [`Either`] containing a value of type `R`.
> Right(R),
> }
> +
> +/// Zero-sized type to mark types not [`Send`].
> +///
> +/// Add this type as a field to your struct if your type should not be sent to a different task.
> +/// Since [`Send`] is an auto trait, adding a single field that is `!Send` will ensure that the
> +/// whole type is `!Send`.
> +///
> +/// If a type is `!Send` it is impossible to give control over an instance of the type to another
> +/// task. This is useful when a type stores task-local information for example file descriptors.
> +pub type NotThreadSafe = PhantomData<*mut ()>;

This should be in its own commit.

Then you can also change the usages of `PhantomData<*mut ()>` in
`Guard` and `TaskRef`.

It would be nice to use `NotThreadSafe` as the value instead of
`PhantomData`, since it is a bit confusing...
I think we might be able to also have a constant with the same name
that is just `pub const NotThreadSafe: NotThreadSafe = PhantomData;`.

--
Cheers,
Benno

2023-12-08 07:43:55

by Alice Ryhl

[permalink] [raw]
Subject: Re: [PATCH v2 4/7] rust: file: add `FileDescriptorReservation`

On Fri, Dec 8, 2023 at 8:37 AM Benno Lossin <[email protected]> wrote:
> > +/// Zero-sized type to mark types not [`Send`].
> > +///
> > +/// Add this type as a field to your struct if your type should not be sent to a different task.
> > +/// Since [`Send`] is an auto trait, adding a single field that is `!Send` will ensure that the
> > +/// whole type is `!Send`.
> > +///
> > +/// If a type is `!Send` it is impossible to give control over an instance of the type to another
> > +/// task. This is useful when a type stores task-local information for example file descriptors.
> > +pub type NotThreadSafe = PhantomData<*mut ()>;
>
> This should be in its own commit.

Will do.

> Then you can also change the usages of `PhantomData<*mut ()>` in
> `Guard` and `TaskRef`.

Will do.

> It would be nice to use `NotThreadSafe` as the value instead of
> `PhantomData`, since it is a bit confusing...

That doesn't compile, unfortunately.

> I think we might be able to also have a constant with the same name
> that is just `pub const NotThreadSafe: NotThreadSafe = PhantomData;`.

I can try to get this to work, but I worry that they will shadow each other.

Alice

2023-12-08 09:50:33

by Benno Lossin

[permalink] [raw]
Subject: Re: [PATCH v2 1/7] rust: file: add Rust abstraction for `struct file`

On 12/6/23 12:59, Alice Ryhl wrote:
> +impl File {
> + /// Constructs a new `struct file` wrapper from a file descriptor.
> + ///
> + /// The file descriptor belongs to the current process.
> + pub fn fget(fd: u32) -> Result<ARef<Self>, BadFdError> {
> + // SAFETY: FFI call, there are no requirements on `fd`.
> + let ptr = ptr::NonNull::new(unsafe { bindings::fget(fd) }).ok_or(BadFdError)?;
> +
> + // SAFETY: `fget` either returns null or a valid pointer to a file, and we checked for null
> + // above.

Since now both the Rust and C functions are called `fget`, I think you
should refer to `bindings::fget`.

> + //
> + // INVARIANT: `fget` increments the refcount before returning.
> + Ok(unsafe { ARef::from_raw(ptr.cast()) })
> + }

[...]

> +// SAFETY: The type invariants guarantee that `File` is always ref-counted.
> +unsafe impl AlwaysRefCounted for File {
> + fn inc_ref(&self) {
> + // SAFETY: The existence of a shared reference means that the refcount is nonzero.
> + unsafe { bindings::get_file(self.as_ptr()) };
> + }
> +
> + unsafe fn dec_ref(obj: ptr::NonNull<Self>) {
> + // SAFETY: The safety requirements guarantee that the refcount is nonzero.
> + unsafe { bindings::fput(obj.cast().as_ptr()) }

The comment should also justify the cast.

--
Cheers,
Benno

> + }
> +}

2023-12-08 16:23:18

by Benno Lossin

[permalink] [raw]
Subject: Re: [PATCH v2 3/7] rust: security: add abstraction for secctx

On 12/6/23 12:59, Alice Ryhl wrote:
> +impl SecurityCtx {
> + /// Get the security context given its id.
> + pub fn from_secid(secid: u32) -> Result<Self> {
> + let mut secdata = core::ptr::null_mut();
> + let mut seclen = 0u32;
> + // SAFETY: Just a C FFI call. The pointers are valid for writes.
> + unsafe {
> + to_result(bindings::security_secid_to_secctx(
> + secid,
> + &mut secdata,
> + &mut seclen,
> + ))?;
> + }

Can you move the `unsafe` block inside of the `to_result` call? That way
we only have the unsafe operation in the unsafe block. Additionally, on
my side it fits perfectly into 100 characters.

> + // INVARIANT: If the above call did not fail, then we have a valid security context.
> + Ok(Self {
> + secdata,
> + seclen: seclen as usize,
> + })
> + }

[...]

> + /// Returns the bytes for this security context.
> + pub fn as_bytes(&self) -> &[u8] {
> + let ptr = self.secdata;
> + if ptr.is_null() {
> + // We can't pass a null pointer to `slice::from_raw_parts` even if the length is zero.
> + debug_assert_eq!(self.seclen, 0);

Would this be interesting enough to emit some kind of log message when
this fails?

> + return &[];
> + }
> +
> + // SAFETY: The call to `security_secid_to_secctx` guarantees that the pointer is valid for
> + // `seclen` bytes. Furthermore, if the length is zero, then we have ensured that the
> + // pointer is not null.
> + unsafe { core::slice::from_raw_parts(ptr.cast(), self.seclen) }
> + }
> +}
> +
> +impl Drop for SecurityCtx {
> + fn drop(&mut self) {
> + // SAFETY: This frees a pointer that came from a successful call to
> + // `security_secid_to_secctx` and has not yet been destroyed by `security_release_secctx`.
> + unsafe {
> + bindings::security_release_secctx(self.secdata, self.seclen as u32);
> + }

If you move the `;` to the outside of the `unsafe` block this also fits
on a single line.

--
Cheers,
Benno

> + }
> +}

2023-12-08 16:32:19

by Miguel Ojeda

[permalink] [raw]
Subject: Re: [PATCH v2 5/7] rust: file: add `Kuid` wrapper

On Wed, Dec 6, 2023 at 2:41 PM Peter Zijlstra <[email protected]> wrote:
>
> Anywhoo, the longer a function is, the harder it becomes, since you need
> to deal with everything a function does and consider the specuation
> window length. So trivial functions like the above that do an immediate
> dereference and are (and must be) a valid indirect target (because
> EXPORT) are ideal.

We discussed this in our weekly meeting, and we would like to ask a
few questions:

- Could you please describe an example attack that you are thinking
of? (i.e. a "full" attack, rather than just Spectre itself). For
instance, would it rely on other vulnerabilities?

- Is this a kernel rule everybody should follow now? i.e. "no (new?)
short, exported symbols that just dereference their pointer args". If
so, could this please be documented? Or is it already somewhere?

- Are we checking for this in some way already, e.g. via `objtool`?
Especially if this is a rule, then it would be nice to have a way to
double-check if we are getting rid of (most of) these "dangerous"
symbols (or at least not introduce new ones, and not just in Rust but
C too).

Thanks Peter!

> That would be good, but how are you going to do that without duplicating
> the horror that is struct task_struct ?

As Alice pointed out, `bindgen` "solves" that, but it is nevertheless
extra maintenance effort.

> Well, I really wish the Rust community would address the C
> interoperability in a hurry. Basically make it a requirement for
> in-kernel Rust.

Yeah, some of us have advocated for more integrated C support within
Rust (or within `rustc` at least).

> I mean, how hard can it be to have clang parse the C headers and inject
> them into the Rust IR as if they're external FFI things.

That is what `bindgen` does (it uses Clang as a library), except it
does not create Rust IR, it outputs normal Rust code, i.e. similar to
C declarations.

But note that using Clang does not solve the issue of `#define`s in
the general case. That is why we would still need "helpers" like these
so that the compiler knows how to expand the macro in a C context,
which then can be inlined as LLVM IR or similar (which is what I
suspect you were actually thinking about, rather than "Rust IR"?).

That "mix the LLVM IRs from Clang and `rustc`" ("local LTO hack")
approach is something we have been discussing in the past for
performance reasons (i.e. to inline these small C functions that Rust
needs, cross-language, even in non-LTO builds). And if it helps to
avoid certain attacks around speculation, then even better. So if the
LLVM folks do not have any major concerns about it, then I think we
should go ahead with that (please see also my reply to comex).

GCC is still a question mark, though.

Cheers,
Miguel

2023-12-08 16:40:56

by Benno Lossin

[permalink] [raw]
Subject: Re: [PATCH v2 5/7] rust: file: add `Kuid` wrapper

On 12/6/23 12:59, Alice Ryhl wrote:
> + /// Returns the given task's pid in the current pid namespace.
> + pub fn pid_in_current_ns(&self) -> Pid {
> + // SAFETY: Calling `task_active_pid_ns` with the current task is always safe.
> + let namespace = unsafe { bindings::task_active_pid_ns(bindings::get_current()) };

Why not create a safe wrapper for `bindings::get_current()`?
This patch series has three occurrences of `get_current`, so I think it
should be ok to add a wrapper.
I would also prefer to move the call to `bindings::get_current()` out of
the `unsafe` block.

> + // SAFETY: We know that `self.0.get()` is valid by the type invariant.

What about `namespace`?

--
Cheers,
Benno

> + unsafe { bindings::task_tgid_nr_ns(self.0.get(), namespace) }
> + }

2023-12-08 16:43:43

by Boqun Feng

[permalink] [raw]
Subject: Re: [PATCH v2 5/7] rust: file: add `Kuid` wrapper

On Fri, Dec 08, 2023 at 04:40:09PM +0000, Benno Lossin wrote:
> On 12/6/23 12:59, Alice Ryhl wrote:
> > + /// Returns the given task's pid in the current pid namespace.
> > + pub fn pid_in_current_ns(&self) -> Pid {
> > + // SAFETY: Calling `task_active_pid_ns` with the current task is always safe.
> > + let namespace = unsafe { bindings::task_active_pid_ns(bindings::get_current()) };
>
> Why not create a safe wrapper for `bindings::get_current()`?
> This patch series has three occurrences of `get_current`, so I think it
> should be ok to add a wrapper.
> I would also prefer to move the call to `bindings::get_current()` out of
> the `unsafe` block.

FWIW, we have a current!() macro, we should use it here.

Regards,
Boqun

>
> > + // SAFETY: We know that `self.0.get()` is valid by the type invariant.
>
> What about `namespace`?
>
> --
> Cheers,
> Benno
>
> > + unsafe { bindings::task_tgid_nr_ns(self.0.get(), namespace) }
> > + }

2023-12-08 16:57:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 5/7] rust: file: add `Kuid` wrapper

On Fri, Dec 08, 2023 at 05:31:59PM +0100, Miguel Ojeda wrote:
> On Wed, Dec 6, 2023 at 2:41 PM Peter Zijlstra <[email protected]> wrote:
> >
> > Anywhoo, the longer a function is, the harder it becomes, since you need
> > to deal with everything a function does and consider the specuation
> > window length. So trivial functions like the above that do an immediate
> > dereference and are (and must be) a valid indirect target (because
> > EXPORT) are ideal.
>
> We discussed this in our weekly meeting, and we would like to ask a
> few questions:
>
> - Could you please describe an example attack that you are thinking
> of? (i.e. a "full" attack, rather than just Spectre itself). For
> instance, would it rely on other vulnerabilities?

There's a fairly large amount of that on github, google spectre poc and
stuff like that.

> - Is this a kernel rule everybody should follow now? i.e. "no (new?)
> short, exported symbols that just dereference their pointer args". If
> so, could this please be documented? Or is it already somewhere?

Gadget scanners are ever evolving and I think there's a bunch of groups
running them against Linux, but most of us don't have spare time beyond
trying to plug the latest hole.

> - Are we checking for this in some way already, e.g. via `objtool`?
> Especially if this is a rule, then it would be nice to have a way to
> double-check if we are getting rid of (most of) these "dangerous"
> symbols (or at least not introduce new ones, and not just in Rust but
> C too).

Objtool does not look for these (gadget scanners are quite complicated
by now and I don't think I want to go there with objtool). At the moment
I'm simply fixing whatever gets reported from 3rd parties when and where
time permits.

The thing at hand was just me eyeballing it.

> > That would be good, but how are you going to do that without duplicating
> > the horror that is struct task_struct ?
>
> As Alice pointed out, `bindgen` "solves" that, but it is nevertheless
> extra maintenance effort.
>
> > Well, I really wish the Rust community would address the C
> > interoperability in a hurry. Basically make it a requirement for
> > in-kernel Rust.
>
> Yeah, some of us have advocated for more integrated C support within
> Rust (or within `rustc` at least).

\o/

> > I mean, how hard can it be to have clang parse the C headers and inject
> > them into the Rust IR as if they're external FFI things.
>
> That is what `bindgen` does (it uses Clang as a library), except it
> does not create Rust IR, it outputs normal Rust code, i.e. similar to
> C declarations.

Right, but then you get into the problem that Rust simply cannot express
a fair amount of the things we already do, like asm-goto, or even simple
asm with memops apparently.

> But note that using Clang does not solve the issue of `#define`s in
> the general case. That is why we would still need "helpers" like these
> so that the compiler knows how to expand the macro in a C context,
> which then can be inlined as LLVM IR or similar (which is what I
> suspect you were actually thinking about, rather than "Rust IR"?).

Yeah, LLVM-IR. And urgh yeah, CPP, this is another down-side of Rust not
being in the C language family, you can't sanely run CPP on it. Someone
really should do a Rust like language in the C family, then perhaps it
will stop looking like line noise to me :-)

I suppose converting things to enum and inline functions where possible
might help a bit with that, but things like tracepoints, which are built
from a giant pile of CPP are just not going to be happy :/

Anyway, I think it would be a giant step forwards from where we are
today.

> That "mix the LLVM IRs from Clang and `rustc`" ("local LTO hack")
> approach is something we have been discussing in the past for
> performance reasons (i.e. to inline these small C functions that Rust
> needs, cross-language, even in non-LTO builds). And if it helps to
> avoid certain attacks around speculation, then even better. So if the
> LLVM folks do not have any major concerns about it, then I think we
> should go ahead with that (please see also my reply to comex).

But does LTO make any guarantees about inlining? The thing is, with
actual LLVM-IR you can express the __always_inline attribute and
inlining becomes guaranteed, I don't think you can rely on LTO for the
same level of guarantees.

And you still need to create these C functions by hand in this
local-LTO scenario, which is less than ideal.

2023-12-08 17:39:31

by Benno Lossin

[permalink] [raw]
Subject: Re: [PATCH v2 6/7] rust: file: add `DeferredFdCloser`

On 12/6/23 12:59, Alice Ryhl wrote:
> + /// Schedule a task work that closes the file descriptor when this task returns to userspace.
> + ///
> + /// Fails if this is called from a context where we cannot run work when returning to
> + /// userspace. (E.g., from a kthread.)
> + pub fn close_fd(self, fd: u32) -> Result<(), DeferredFdCloseError> {
> + use bindings::task_work_notify_mode_TWA_RESUME as TWA_RESUME;
> +
> + // In this method, we schedule the task work before closing the file. This is because
> + // scheduling a task work is fallible, and we need to know whether it will fail before we
> + // attempt to close the file.
> +
> + // SAFETY: Getting a pointer to current is always safe.
> + let current = unsafe { bindings::get_current() };
> +
> + // SAFETY: Accessing the `flags` field of `current` is always safe.
> + let is_kthread = (unsafe { (*current).flags } & bindings::PF_KTHREAD) != 0;

Since Boqun brought to my attention that we already have a wrapper for
`get_current()`, how about you use it here as well?

> + if is_kthread {
> + return Err(DeferredFdCloseError::TaskWorkUnavailable);
> + }
> +
> + // This disables the destructor of the box, so the allocation is not cleaned up
> + // automatically below.
> + let inner = Box::into_raw(self.inner);

Importantly this also lifts the uniqueness requirement (maybe add this
to the comment?).

> +
> + // The `callback_head` field is first in the struct, so this cast correctly gives us a
> + // pointer to the field.
> + let callback_head = inner.cast::<bindings::callback_head>();
> + // SAFETY: This pointer offset operation does not go out-of-bounds.
> + let file_field = unsafe { core::ptr::addr_of_mut!((*inner).file) };
> +
> + // SAFETY: The `callback_head` pointer is compatible with the `do_close_fd` method.

Also, `callback_head` is valid, since it is derived from...

> + unsafe { bindings::init_task_work(callback_head, Some(Self::do_close_fd)) };
> + // SAFETY: The `callback_head` pointer points at a valid and fully initialized task work
> + // that is ready to be scheduled.
> + //
> + // If the task work gets scheduled as-is, then it will be a no-op. However, we will update
> + // the file pointer below to tell it which file to fput.
> + let res = unsafe { bindings::task_work_add(current, callback_head, TWA_RESUME) };
> +
> + if res != 0 {
> + // SAFETY: Scheduling the task work failed, so we still have ownership of the box, so
> + // we may destroy it.
> + unsafe { drop(Box::from_raw(inner)) };
> +
> + return Err(DeferredFdCloseError::TaskWorkUnavailable);

Just curious, what could make the `task_work_add` call fail? I imagine
an OOM situation, but is that all?

> + }
> +
> + // SAFETY: Just an FFI call. This is safe no matter what `fd` is.

I took a look at the C code and there I found this comment:

/*
* variant of close_fd that gets a ref on the file for later fput.
* The caller must ensure that filp_close() called on the file.
*/

And while you do call `filp_close` later, this seems like a safety
requirement to me.
Also, you do not call it when `file` is null, which I imagine to be
fine, but I do not know that since the C comment does not cover that
case.

> + let file = unsafe { bindings::close_fd_get_file(fd) };
> + if file.is_null() {
> + // We don't clean up the task work since that might be expensive if the task work queue
> + // is long. Just let it execute and let it clean up for itself.
> + return Err(DeferredFdCloseError::BadFd);
> + }
> +
> + // SAFETY: The `file` pointer points at a valid file.
> + unsafe { bindings::get_file(file) };
> +
> + // SAFETY: Due to the above `get_file`, even if the current task holds an `fdget` to
> + // this file right now, the refcount will not drop to zero until after it is released
> + // with `fdput`. This is because when using `fdget`, you must always use `fdput` before

Shouldn't this be "the refcount will not drop to zero until after it is
released with `fput`."?

Why is this the SAFETY comment for `filp_close`? I am not understanding
the requirement on that function that needs this. This seems more a
justification for accessing `file` inside `do_close_fd`. In which case I
think it would be better to make it a type invariant of
`DeferredFdCloserInner`.

> + // returning to userspace, and our task work runs after any `fdget` users have returned
> + // to userspace.
> + //
> + // Note: fl_owner_t is currently a void pointer.
> + unsafe { bindings::filp_close(file, (*current).files as bindings::fl_owner_t) };
> +
> + // We update the file pointer that the task work is supposed to fput.
> + //
> + // SAFETY: Task works are executed on the current thread once we return to userspace, so
> + // this write is guaranteed to happen before `do_close_fd` is called, which means that a
> + // race is not possible here.
> + //
> + // It's okay to pass this pointer to the task work, since we just acquired a refcount with
> + // the previous call to `get_file`. Furthermore, the refcount will not drop to zero during
> + // an `fdget` call, since we defer the `fput` until after returning to userspace.
> + unsafe { *file_field = file };

A synchronization question: who guarantees that this write is actually
available to the cpu that executes `do_close_fd`? Is there some
synchronization run when returning to userspace?

> +
> + Ok(())
> + }
> +
> + // SAFETY: This function is an implementation detail of `close_fd`, so its safety comments
> + // should be read in extension of that method.

Why not use this?:
- `inner` is a valid pointer to the `callback_head` field of a valid
`DeferredFdCloserInner`.
- `inner` has exclusive access to the pointee and owns the allocation.
- `inner` originates from a call to `Box::into_raw`.

> + unsafe extern "C" fn do_close_fd(inner: *mut bindings::callback_head) {
> + // SAFETY: In `close_fd` we use this method together with a pointer that originates from a
> + // `Box<DeferredFdCloserInner>`, and we have just been given ownership of that allocation.
> + let inner = unsafe { Box::from_raw(inner as *mut DeferredFdCloserInner) };

Use `.cast()`.

> + if !inner.file.is_null() {
> + // SAFETY: This drops a refcount we acquired in `close_fd`. Since this callback runs in
> + // a task work after we return to userspace, it is guaranteed that the current thread
> + // doesn't hold this file with `fdget`, as `fdget` must be released before returning to
> + // userspace.
> + unsafe { bindings::fput(inner.file) };
> + }
> + // Free the allocation.
> + drop(inner);
> + }
> +}
> +
> +/// Represents a failure to close an fd in a deferred manner.
> +#[derive(Copy, Clone, Eq, PartialEq)]
> +pub enum DeferredFdCloseError {
> + /// Closing the fd failed because we were unable to schedule a task work.
> + TaskWorkUnavailable,
> + /// Closing the fd failed because the fd does not exist.
> + BadFd,
> +}
> +
> +impl From<DeferredFdCloseError> for Error {
> + fn from(err: DeferredFdCloseError) -> Error {
> + match err {
> + DeferredFdCloseError::TaskWorkUnavailable => ESRCH,

This error reads "No such process", I am not sure if that is the best
way to express the problem in that situation. I took a quick look at the
other error codes, but could not find a better fit. Do you have any
better ideas? Or is this the error that C binder uses?

--
Cheers,
Benno

> + DeferredFdCloseError::BadFd => EBADF,
> + }
> + }
> +}

2023-12-08 17:53:50

by Benno Lossin

[permalink] [raw]
Subject: Re: [PATCH v2 7/7] rust: file: add abstraction for `poll_table`

On 12/6/23 12:59, Alice Ryhl wrote:
> diff --git a/rust/bindings/lib.rs b/rust/bindings/lib.rs
> index 9bcbea04dac3..eeb291cc60db 100644
> --- a/rust/bindings/lib.rs
> +++ b/rust/bindings/lib.rs
> @@ -51,3 +51,4 @@ mod bindings_helper {
>
> pub const GFP_KERNEL: gfp_t = BINDINGS_GFP_KERNEL;
> pub const __GFP_ZERO: gfp_t = BINDINGS___GFP_ZERO;
> +pub const POLLFREE: __poll_t = BINDINGS_POLLFREE;

You are no longer using this constant, should this still exist?

[...]

> + fn get_qproc(&self) -> bindings::poll_queue_proc {
> + let ptr = self.0.get();
> + // SAFETY: The `ptr` is valid because it originates from a reference, and the `_qproc`
> + // field is not modified concurrently with this call since we have an immutable reference.

This needs an invariant on `PollTable` (i.e. `self.0` is valid).

> + unsafe { (*ptr)._qproc }
> + }
> +
> + /// Register this [`PollTable`] with the provided [`PollCondVar`], so that it can be notified
> + /// using the condition variable.
> + pub fn register_wait(&mut self, file: &File, cv: &PollCondVar) {
> + if let Some(qproc) = self.get_qproc() {
> + // SAFETY: The pointers to `self` and `file` are valid because they are references.

What about cv.wait_list...

> + //
> + // Before the wait list is destroyed, the destructor of `PollCondVar` will clear
> + // everything in the wait list, so the wait list is not used after it is freed.
> + unsafe { qproc(file.as_ptr() as _, cv.wait_list.get(), self.0.get()) };
> + }
> + }
> +}
> +
> +/// A wrapper around [`CondVar`] that makes it usable with [`PollTable`].
> +///
> +/// # Invariant
> +///
> +/// If `needs_synchronize_rcu` is false, then there is nothing registered with `register_wait`.

Not able to find `needs_synchronize_rcu` anywhere else, should this be
here?

> +///
> +/// [`CondVar`]: crate::sync::CondVar
> +#[pin_data(PinnedDrop)]
> +pub struct PollCondVar {
> + #[pin]
> + inner: CondVar,
> +}

[..]

> +#[pinned_drop]
> +impl PinnedDrop for PollCondVar {
> + fn drop(self: Pin<&mut Self>) {
> + // Clear anything registered using `register_wait`.
> + //
> + // SAFETY: The pointer points at a valid wait list.

I was a bit confused by "wait list", since the C type is named
`wait_queue_head`, maybe just use the type name?

--
Cheers,
Benno

> + unsafe { bindings::__wake_up_pollfree(self.inner.wait_list.get()) };
> +
> + // Wait for epoll items to be properly removed.
> + //
> + // SAFETY: Just an FFI call.
> + unsafe { bindings::synchronize_rcu() };
> + }
> +}

2023-12-08 18:19:00

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v2 5/7] rust: file: add `Kuid` wrapper

On Fri, Dec 08, 2023 at 05:57:02PM +0100, Peter Zijlstra wrote:
> On Fri, Dec 08, 2023 at 05:31:59PM +0100, Miguel Ojeda wrote:
> > On Wed, Dec 6, 2023 at 2:41 PM Peter Zijlstra <[email protected]> wrote:
> > >
> > > Anywhoo, the longer a function is, the harder it becomes, since you need
> > > to deal with everything a function does and consider the specuation
> > > window length. So trivial functions like the above that do an immediate
> > > dereference and are (and must be) a valid indirect target (because
> > > EXPORT) are ideal.
> >
> > We discussed this in our weekly meeting, and we would like to ask a
> > few questions:
> >
> > - Could you please describe an example attack that you are thinking
> > of? (i.e. a "full" attack, rather than just Spectre itself). For
> > instance, would it rely on other vulnerabilities?
>
> There's a fairly large amount of that on github, google spectre poc and
> stuff like that.

tl;dr: I don't think the introduction of speculation gadgets is a
sufficient reason to block Rust interfaces like this.

Long version:

I think the question here is "what is the threat model?" If I break down
the objection, I understand it as:

1) The trivial wrappers-of-inlines are speculation gadgets.
2) They're exported, so callable by anything.

If the threat model is "something can call these to trigger
speculation", I think this is pretty strongly mitigated already;

1) These aren't syscall definitions, so their "reachability" is pretty
limited. In fact, they're already going to be used in places, logically,
where the inline would be used, so the speculation window is going to be
same (or longer, given the addition of the direct call and return).

2) If an attacker is in a position to directly call these helpers,
they're not going to use them: if an attacker already has arbitrary
execution, they're not going to bother with speculation.

Fundamentally I don't see added risk here. From the security hardening
perspective we have two goals: kill bug classes and block exploitation
techniques, and the former is a much more powerful defensive strategy
since without the bugs, there's no chance to perform an exploit.

In general, I think we should prioritize bug class elimination over
exploit technique foiling. In this case, we're adding a potential weakness
to the image of the kernel of fairly limited scope in support of stronger
bug elimination goals.

Even if we look at the prerequisites for mounting an attack here, we've
already got things in place to help mitigate arbitrary code execution
(KCFI, BTI, etc). Nothing is perfect, but speculation gadgets are
pretty far down on the list of concerns, IMO. We have no real x86 ROP
defense right now in the kernel, so that's a much lower hanging fruit
for attackers.

As another comparison, on x86 there are so many direct execution gadgets
present in middle-of-instruction code patterns that worrying about a
speculation gadget seems silly to me.

> [...]
> The thing at hand was just me eyeballing it.

I can understand the point you and Greg have both expressed here: "this
is known to be an anti-pattern, we need to do something else". I generally
agree with this, but in this case, I don't think it's the right call. This
is an area we'll naturally see improvement from on the Rust side since
these calls are a _performance_ concern too, so it's not like this will be
"forgotten" about. But blocking it until there is a complete and perfect
solution feels like we're letting perfect be the enemy of good.

All of our development is evolutionary, so I think we'd be in a much
better position to take these (currently ugly) work-arounds (visible
only in Rust builds) so that we gain the ability to evolve towards more
memory safe code.

-Kees

--
Kees Cook

2023-12-08 20:46:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 5/7] rust: file: add `Kuid` wrapper

On Fri, Dec 08, 2023 at 10:18:47AM -0800, Kees Cook wrote:

> Even if we look at the prerequisites for mounting an attack here, we've
> already got things in place to help mitigate arbitrary code execution
> (KCFI, BTI, etc). Nothing is perfect, but speculation gadgets are
> pretty far down on the list of concerns, IMO. We have no real x86 ROP
> defense right now in the kernel, so that's a much lower hanging fruit
> for attackers.

Supervisor shadow stacks, as they exist today, just can't work on Linux.
Should get fixed with FRED, but yeah, this is all somewhat unfortunate.

> As another comparison, on x86 there are so many direct execution gadgets
> present in middle-of-instruction code patterns that worrying about a
> speculation gadget seems silly to me.

FineIBT (or even IBT) limits the middle of function gadgets
significantly.

2023-12-08 20:57:53

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v2 5/7] rust: file: add `Kuid` wrapper

On Fri, Dec 08, 2023 at 09:45:01PM +0100, Peter Zijlstra wrote:
> On Fri, Dec 08, 2023 at 10:18:47AM -0800, Kees Cook wrote:
>
> > Even if we look at the prerequisites for mounting an attack here, we've
> > already got things in place to help mitigate arbitrary code execution
> > (KCFI, BTI, etc). Nothing is perfect, but speculation gadgets are
> > pretty far down on the list of concerns, IMO. We have no real x86 ROP
> > defense right now in the kernel, so that's a much lower hanging fruit
> > for attackers.
>
> Supervisor shadow stacks, as they exist today, just can't work on Linux.

Yeah, totally agreed. I still wonder if we can extend KCFI to cover
return paths (i.e. emitting cookies for return destinations and doing
pre-return cookie checking for return destinations).

> Should get fixed with FRED, but yeah, this is all somewhat unfortunate.

Agreed.

> > As another comparison, on x86 there are so many direct execution gadgets
> > present in middle-of-instruction code patterns that worrying about a
> > speculation gadget seems silly to me.
>
> FineIBT (or even IBT) limits the middle of function gadgets
> significantly.

Right -- for indirect calls we are at least able to restrict to
same-prototype (KCFI) or "actual function" (IBT).

Regardless, for the case at hand, it seems like the Rust wrappers are
still not "reachable" since we do BTB stuffing to defang these kinds of
speculation gadgets.

-Kees

--
Kees Cook

2023-12-11 15:34:59

by Alice Ryhl

[permalink] [raw]
Subject: Re: [PATCH v2 3/7] rust: security: add abstraction for secctx

Benno Lossin <[email protected]> writes:
> On 12/6/23 12:59, Alice Ryhl wrote:
> > +impl SecurityCtx {
> > + /// Get the security context given its id.
> > + pub fn from_secid(secid: u32) -> Result<Self> {
> > + let mut secdata = core::ptr::null_mut();
> > + let mut seclen = 0u32;
> > + // SAFETY: Just a C FFI call. The pointers are valid for writes.
> > + unsafe {
> > + to_result(bindings::security_secid_to_secctx(
> > + secid,
> > + &mut secdata,
> > + &mut seclen,
> > + ))?;
> > + }
>
> Can you move the `unsafe` block inside of the `to_result` call? That way
> we only have the unsafe operation in the unsafe block. Additionally, on
> my side it fits perfectly into 100 characters.

Will do.

> > + /// Returns the bytes for this security context.
> > + pub fn as_bytes(&self) -> &[u8] {
> > + let ptr = self.secdata;
> > + if ptr.is_null() {
> > + // We can't pass a null pointer to `slice::from_raw_parts` even if the length is zero.
> > + debug_assert_eq!(self.seclen, 0);
>
> Would this be interesting enough to emit some kind of log message when
> this fails?

I'm not convinced that makes sense. I'm pretty sure that if this API
returns a null pointer under any circumstances, then we're in some sort
of context where security contexts don't exist at all, and then they
would be hard-coded to use a length zero as well.

> > + return &[];
> > + }
> > +
> > + // SAFETY: The call to `security_secid_to_secctx` guarantees that the pointer is valid for
> > + // `seclen` bytes. Furthermore, if the length is zero, then we have ensured that the
> > + // pointer is not null.
> > + unsafe { core::slice::from_raw_parts(ptr.cast(), self.seclen) }
> > + }
> > +}
> > +
> > +impl Drop for SecurityCtx {
> > + fn drop(&mut self) {
> > + // SAFETY: This frees a pointer that came from a successful call to
> > + // `security_secid_to_secctx` and has not yet been destroyed by `security_release_secctx`.
> > + unsafe {
> > + bindings::security_release_secctx(self.secdata, self.seclen as u32);
> > + }
>
> If you move the `;` to the outside of the `unsafe` block this also fits
> on a single line.

Will do.

Alice

2023-12-11 15:34:58

by Alice Ryhl

[permalink] [raw]
Subject: Re: [PATCH v2 1/7] rust: file: add Rust abstraction for `struct file`

Benno Lossin <[email protected]> wrutes:
> > + // SAFETY: `fget` either returns null or a valid pointer to a file, and we checked for null
> > + // above.
>
> Since now both the Rust and C functions are called `fget`, I think you
> should refer to `bindings::fget`.
> [...]
> > + unsafe fn dec_ref(obj: ptr::NonNull<Self>) {
> > + // SAFETY: The safety requirements guarantee that the refcount is nonzero.
> > + unsafe { bindings::fput(obj.cast().as_ptr()) }
>
> The comment should also justify the cast.

I'll make both changes.

Alice

2023-12-11 15:35:02

by Alice Ryhl

[permalink] [raw]
Subject: Re: [PATCH v2 6/7] rust: file: add `DeferredFdCloser`

Benno Lossin <[email protected]> writes:
> On 12/6/23 12:59, Alice Ryhl wrote:
> > + /// Schedule a task work that closes the file descriptor when this task returns to userspace.
> > + ///
> > + /// Fails if this is called from a context where we cannot run work when returning to
> > + /// userspace. (E.g., from a kthread.)
> > + pub fn close_fd(self, fd: u32) -> Result<(), DeferredFdCloseError> {
> > + use bindings::task_work_notify_mode_TWA_RESUME as TWA_RESUME;
> > +
> > + // In this method, we schedule the task work before closing the file. This is because
> > + // scheduling a task work is fallible, and we need to know whether it will fail before we
> > + // attempt to close the file.
> > +
> > + // SAFETY: Getting a pointer to current is always safe.
> > + let current = unsafe { bindings::get_current() };
> > +
> > + // SAFETY: Accessing the `flags` field of `current` is always safe.
> > + let is_kthread = (unsafe { (*current).flags } & bindings::PF_KTHREAD) != 0;
>
> Since Boqun brought to my attention that we already have a wrapper for
> `get_current()`, how about you use it here as well?

I can use the wrapper, but it seems simpler to not go through a
reference when we just need a raw pointer.

Perhaps we should have a safe `Task::current_raw` function that just
returns a raw pointer? It can still be safe.

> > + if is_kthread {
> > + return Err(DeferredFdCloseError::TaskWorkUnavailable);
> > + }
> > +
> > + // This disables the destructor of the box, so the allocation is not cleaned up
> > + // automatically below.
> > + let inner = Box::into_raw(self.inner);
>
> Importantly this also lifts the uniqueness requirement (maybe add this
> to the comment?).

I can update the comment.

> > +
> > + // The `callback_head` field is first in the struct, so this cast correctly gives us a
> > + // pointer to the field.
> > + let callback_head = inner.cast::<bindings::callback_head>();
> > + // SAFETY: This pointer offset operation does not go out-of-bounds.
> > + let file_field = unsafe { core::ptr::addr_of_mut!((*inner).file) };
> > +
> > + // SAFETY: The `callback_head` pointer is compatible with the `do_close_fd` method.
>
> Also, `callback_head` is valid, since it is derived from...

I can update the comment.

> > + unsafe { bindings::init_task_work(callback_head, Some(Self::do_close_fd)) };
> > + // SAFETY: The `callback_head` pointer points at a valid and fully initialized task work
> > + // that is ready to be scheduled.
> > + //
> > + // If the task work gets scheduled as-is, then it will be a no-op. However, we will update
> > + // the file pointer below to tell it which file to fput.
> > + let res = unsafe { bindings::task_work_add(current, callback_head, TWA_RESUME) };
> > +
> > + if res != 0 {
> > + // SAFETY: Scheduling the task work failed, so we still have ownership of the box, so
> > + // we may destroy it.
> > + unsafe { drop(Box::from_raw(inner)) };
> > +
> > + return Err(DeferredFdCloseError::TaskWorkUnavailable);
>
> Just curious, what could make the `task_work_add` call fail? I imagine
> an OOM situation, but is that all?

Actually, no, this doesn't fail in OOM situations since we pass it an
allocation for its linked list. It fails only when the current task is
exiting and wont run task work again.

> > + }
> > +
> > + // SAFETY: Just an FFI call. This is safe no matter what `fd` is.
>
> I took a look at the C code and there I found this comment:
>
> /*
> * variant of close_fd that gets a ref on the file for later fput.
> * The caller must ensure that filp_close() called on the file.
> */
>
> And while you do call `filp_close` later, this seems like a safety
> requirement to me.

I'll mention this.

> Also, you do not call it when `file` is null, which I imagine to be
> fine, but I do not know that since the C comment does not cover that
> case.

Null pointer means that the fd doesn't exist, and it's correct to do
nothing in that case.

> > + let file = unsafe { bindings::close_fd_get_file(fd) };
> > + if file.is_null() {
> > + // We don't clean up the task work since that might be expensive if the task work queue
> > + // is long. Just let it execute and let it clean up for itself.
> > + return Err(DeferredFdCloseError::BadFd);
> > + }
> > +
> > + // SAFETY: The `file` pointer points at a valid file.
> > + unsafe { bindings::get_file(file) };
> > +
> > + // SAFETY: Due to the above `get_file`, even if the current task holds an `fdget` to
> > + // this file right now, the refcount will not drop to zero until after it is released
> > + // with `fdput`. This is because when using `fdget`, you must always use `fdput` before
>
> Shouldn't this be "the refcount will not drop to zero until after it is
> released with `fput`."?
>
> Why is this the SAFETY comment for `filp_close`? I am not understanding
> the requirement on that function that needs this. This seems more a
> justification for accessing `file` inside `do_close_fd`. In which case I
> think it would be better to make it a type invariant of
> `DeferredFdCloserInner`.

It's because `filp_close` decreases the refcount for the file, and doing
that is not always safe even if you have a refcount to the file. To drop
the refcount, at least one of the two following must be the case:

* If the refcount decreases to a non-zero value, then it is okay.
* If there are no users of `fdget` on the file, then it is okay.

In this case, we are in the first case, as we own two refcounts.

I'll clarify the safety comment in v3.

> > + // returning to userspace, and our task work runs after any `fdget` users have returned
> > + // to userspace.
> > + //
> > + // Note: fl_owner_t is currently a void pointer.
> > + unsafe { bindings::filp_close(file, (*current).files as bindings::fl_owner_t) };
> > +
> > + // We update the file pointer that the task work is supposed to fput.
> > + //
> > + // SAFETY: Task works are executed on the current thread once we return to userspace, so
> > + // this write is guaranteed to happen before `do_close_fd` is called, which means that a
> > + // race is not possible here.
> > + //
> > + // It's okay to pass this pointer to the task work, since we just acquired a refcount with
> > + // the previous call to `get_file`. Furthermore, the refcount will not drop to zero during
> > + // an `fdget` call, since we defer the `fput` until after returning to userspace.
> > + unsafe { *file_field = file };
>
> A synchronization question: who guarantees that this write is actually
> available to the cpu that executes `do_close_fd`? Is there some
> synchronization run when returning to userspace?

It's on the same thread, so it's just a sequenced-before relation.

It's not like an interrupt. It runs after the syscall invocation has
exited, but before it does the actual return-to-userspace stuff.

> > +
> > + Ok(())
> > + }
> > +
> > + // SAFETY: This function is an implementation detail of `close_fd`, so its safety comments
> > + // should be read in extension of that method.
>
> Why not use this?:
> - `inner` is a valid pointer to the `callback_head` field of a valid
> `DeferredFdCloserInner`.
> - `inner` has exclusive access to the pointee and owns the allocation.
> - `inner` originates from a call to `Box::into_raw`.

I'll update this together with the clarification of `filp_close`.

> > + unsafe extern "C" fn do_close_fd(inner: *mut bindings::callback_head) {
> > + // SAFETY: In `close_fd` we use this method together with a pointer that originates from a
> > + // `Box<DeferredFdCloserInner>`, and we have just been given ownership of that allocation.
> > + let inner = unsafe { Box::from_raw(inner as *mut DeferredFdCloserInner) };
>
> Use `.cast()`.

Ok.

> > + if !inner.file.is_null() {
> > + // SAFETY: This drops a refcount we acquired in `close_fd`. Since this callback runs in
> > + // a task work after we return to userspace, it is guaranteed that the current thread
> > + // doesn't hold this file with `fdget`, as `fdget` must be released before returning to
> > + // userspace.
> > + unsafe { bindings::fput(inner.file) };
> > + }
> > + // Free the allocation.
> > + drop(inner);
> > + }
> > +}
> > +
> > +/// Represents a failure to close an fd in a deferred manner.
> > +#[derive(Copy, Clone, Eq, PartialEq)]
> > +pub enum DeferredFdCloseError {
> > + /// Closing the fd failed because we were unable to schedule a task work.
> > + TaskWorkUnavailable,
> > + /// Closing the fd failed because the fd does not exist.
> > + BadFd,
> > +}
> > +
> > +impl From<DeferredFdCloseError> for Error {
> > + fn from(err: DeferredFdCloseError) -> Error {
> > + match err {
> > + DeferredFdCloseError::TaskWorkUnavailable => ESRCH,
>
> This error reads "No such process", I am not sure if that is the best
> way to express the problem in that situation. I took a quick look at the
> other error codes, but could not find a better fit. Do you have any
> better ideas? Or is this the error that C binder uses?

This is the error code that task_work_add returns. (It can't happen in
Binder.)

And I do think that it is a reasonable choice, because the error only
happens if you're calling the method from a context that has no
userspace process associated with it.

Alice

2023-12-11 15:35:08

by Alice Ryhl

[permalink] [raw]
Subject: Re: [PATCH v2 4/7] rust: file: add `FileDescriptorReservation`

Benno Lossin <[email protected]> writes:
> On 12/6/23 12:59, Alice Ryhl wrote:
> > + /// Commits the reservation.
> > + ///
> > + /// The previously reserved file descriptor is bound to `file`. This method consumes the
> > + /// [`FileDescriptorReservation`], so it will not be usable after this call.
> > + pub fn fd_install(self, file: ARef<File>) {
> > + // SAFETY: `self.fd` was previously returned by `get_unused_fd_flags`, and `file.ptr` is
> > + // guaranteed to have an owned ref count by its type invariants.
>
> There is no mention of the requirement that `current` has not changed
> (you do mention it on the `_not_send` field, but I think it should also
> be in the safety comment here).
>
> [...]
> > +impl Drop for FileDescriptorReservation {
> > + fn drop(&mut self) {
> > + // SAFETY: `self.fd` was returned by a previous call to `get_unused_fd_flags`.
>
> Ditto.

I'll update this.

> > +/// Zero-sized type to mark types not [`Send`].
> > +///
> > +/// Add this type as a field to your struct if your type should not be sent to a different task.
> > +/// Since [`Send`] is an auto trait, adding a single field that is `!Send` will ensure that the
> > +/// whole type is `!Send`.
> > +///
> > +/// If a type is `!Send` it is impossible to give control over an instance of the type to another
> > +/// task. This is useful when a type stores task-local information for example file descriptors.
> > +pub type NotThreadSafe = PhantomData<*mut ()>;
>
> This should be in its own commit.
>
> Then you can also change the usages of `PhantomData<*mut ()>` in
> `Guard` and `TaskRef`.
>
> It would be nice to use `NotThreadSafe` as the value instead of
> `PhantomData`, since it is a bit confusing...
> I think we might be able to also have a constant with the same name
> that is just `pub const NotThreadSafe: NotThreadSafe = PhantomData;`.

I was able to get this to work with a `const`, so I will use that.

Alice

2023-12-11 15:35:10

by Alice Ryhl

[permalink] [raw]
Subject: Re: [PATCH v2 5/7] rust: file: add `Kuid` wrapper

Benno Lossin <[email protected]> writes:
> On 12/6/23 12:59, Alice Ryhl wrote:
> > + /// Returns the given task's pid in the current pid namespace.
> > + pub fn pid_in_current_ns(&self) -> Pid {
> > + // SAFETY: Calling `task_active_pid_ns` with the current task is always safe.
> > + let namespace = unsafe { bindings::task_active_pid_ns(bindings::get_current()) };
>
> Why not create a safe wrapper for `bindings::get_current()`?
> This patch series has three occurrences of `get_current`, so I think it
> should be ok to add a wrapper.
> I would also prefer to move the call to `bindings::get_current()` out of
> the `unsafe` block.

See thread on patch 6.

> > + // SAFETY: We know that `self.0.get()` is valid by the type invariant.
>
> What about `namespace`?

I'll update the comment.

Alice

2023-12-11 16:02:08

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH v2 5/7] rust: file: add `Kuid` wrapper

On Fri, Dec 08, 2023 at 08:43:19AM -0800, Boqun Feng wrote:
> On Fri, Dec 08, 2023 at 04:40:09PM +0000, Benno Lossin wrote:
> > On 12/6/23 12:59, Alice Ryhl wrote:
> > > + /// Returns the given task's pid in the current pid namespace.
> > > + pub fn pid_in_current_ns(&self) -> Pid {
> > > + // SAFETY: Calling `task_active_pid_ns` with the current task is always safe.
> > > + let namespace = unsafe { bindings::task_active_pid_ns(bindings::get_current()) };
> >
> > Why not create a safe wrapper for `bindings::get_current()`?
> > This patch series has three occurrences of `get_current`, so I think it
> > should be ok to add a wrapper.
> > I would also prefer to move the call to `bindings::get_current()` out of
> > the `unsafe` block.
>
> FWIW, we have a current!() macro, we should use it here.

Why does it need to be a macro?

2023-12-11 17:06:01

by Benno Lossin

[permalink] [raw]
Subject: Re: [PATCH v2 5/7] rust: file: add `Kuid` wrapper

On 12/11/23 16:58, Kent Overstreet wrote:
> On Fri, Dec 08, 2023 at 08:43:19AM -0800, Boqun Feng wrote:
>> On Fri, Dec 08, 2023 at 04:40:09PM +0000, Benno Lossin wrote:
>>> On 12/6/23 12:59, Alice Ryhl wrote:
>>>> + /// Returns the given task's pid in the current pid namespace.
>>>> + pub fn pid_in_current_ns(&self) -> Pid {
>>>> + // SAFETY: Calling `task_active_pid_ns` with the current task is always safe.
>>>> + let namespace = unsafe { bindings::task_active_pid_ns(bindings::get_current()) };
>>>
>>> Why not create a safe wrapper for `bindings::get_current()`?
>>> This patch series has three occurrences of `get_current`, so I think it
>>> should be ok to add a wrapper.
>>> I would also prefer to move the call to `bindings::get_current()` out of
>>> the `unsafe` block.
>>
>> FWIW, we have a current!() macro, we should use it here.
>
> Why does it need to be a macro?

This is a very interesting question. A `Task` is `AlwaysRefCounted`, so
if you have a `&'a Task`, someone above you owns a refcount on that
task. But the `current` task will never go away as long as you stay in
the same task. So you actually do not need to own a refcount as long as
there are no context switches.

We use this to our advantage and the `current!()` macro returns
something that acts like `&'a Task` but additionally is `!Send` (so it
cannot be sent over to a different task). This means that we do not need
to take a refcount on the current task to get a reference to it.

But just having a function that returns a `&'a Task` like thing that is
`!Send` is not enough, since there are no constraints on 'a. This is
because the function `Task::current` would take no arguments and there
is nothing the lifetime could even bind to.
Since there are no constraints, you could just choose 'a = 'static which
is obviously wrong, since there are tasks that end. For this reason the
`Task::current` function is `unsafe` and the macro `current!` ensures
that the lifetime 'a ends early enough. It is implemented like this:

macro_rules! current {
() => {
// SAFETY: Deref + addr-of below create a temporary `TaskRef` that cannot outlive the
// caller.
unsafe { &*$crate::task::Task::current() }
};
}

Note the `&*`. This ensures that the thing returned by `Task::current`
is temporary and cannot outlive the current function thus preventing the
creation of static references to it.

If you want to read more, see here for Gary's original discovery of the
issue:
https://lore.kernel.org/rust-for-linux/[email protected]/

--
Cheers,
Benno

2023-12-11 17:23:23

by Benno Lossin

[permalink] [raw]
Subject: Re: [PATCH v2 6/7] rust: file: add `DeferredFdCloser`

On 12/11/23 16:34, Alice Ryhl wrote:
> Benno Lossin <[email protected]> writes:
>> On 12/6/23 12:59, Alice Ryhl wrote:
>>> + /// Schedule a task work that closes the file descriptor when this task returns to userspace.
>>> + ///
>>> + /// Fails if this is called from a context where we cannot run work when returning to
>>> + /// userspace. (E.g., from a kthread.)
>>> + pub fn close_fd(self, fd: u32) -> Result<(), DeferredFdCloseError> {
>>> + use bindings::task_work_notify_mode_TWA_RESUME as TWA_RESUME;
>>> +
>>> + // In this method, we schedule the task work before closing the file. This is because
>>> + // scheduling a task work is fallible, and we need to know whether it will fail before we
>>> + // attempt to close the file.
>>> +
>>> + // SAFETY: Getting a pointer to current is always safe.
>>> + let current = unsafe { bindings::get_current() };
>>> +
>>> + // SAFETY: Accessing the `flags` field of `current` is always safe.
>>> + let is_kthread = (unsafe { (*current).flags } & bindings::PF_KTHREAD) != 0;
>>
>> Since Boqun brought to my attention that we already have a wrapper for
>> `get_current()`, how about you use it here as well?
>
> I can use the wrapper, but it seems simpler to not go through a
> reference when we just need a raw pointer.
>
> Perhaps we should have a safe `Task::current_raw` function that just
> returns a raw pointer? It can still be safe.

Sure, that would work.

Just thought that it is annoying having to write the safety
requirements of that again and again.

[...]

>>> + unsafe { bindings::init_task_work(callback_head, Some(Self::do_close_fd)) };
>>> + // SAFETY: The `callback_head` pointer points at a valid and fully initialized task work
>>> + // that is ready to be scheduled.
>>> + //
>>> + // If the task work gets scheduled as-is, then it will be a no-op. However, we will update
>>> + // the file pointer below to tell it which file to fput.
>>> + let res = unsafe { bindings::task_work_add(current, callback_head, TWA_RESUME) };
>>> +
>>> + if res != 0 {
>>> + // SAFETY: Scheduling the task work failed, so we still have ownership of the box, so
>>> + // we may destroy it.
>>> + unsafe { drop(Box::from_raw(inner)) };
>>> +
>>> + return Err(DeferredFdCloseError::TaskWorkUnavailable);
>>
>> Just curious, what could make the `task_work_add` call fail? I imagine
>> an OOM situation, but is that all?
>
> Actually, no, this doesn't fail in OOM situations since we pass it an
> allocation for its linked list. It fails only when the current task is
> exiting and wont run task work again.

Interesting, is there some way to check for this aside from calling
`task_work_add`?

>>> + }
>>> +
>>> + // SAFETY: Just an FFI call. This is safe no matter what `fd` is.
>>
>> I took a look at the C code and there I found this comment:
>>
>> /*
>> * variant of close_fd that gets a ref on the file for later fput.
>> * The caller must ensure that filp_close() called on the file.
>> */
>>
>> And while you do call `filp_close` later, this seems like a safety
>> requirement to me.
>
> I'll mention this.
>
>> Also, you do not call it when `file` is null, which I imagine to be
>> fine, but I do not know that since the C comment does not cover that
>> case.
>
> Null pointer means that the fd doesn't exist, and it's correct to do
> nothing in that case.

I would also mention that in a comment (or the SAFETY comment).

>>> + let file = unsafe { bindings::close_fd_get_file(fd) };
>>> + if file.is_null() {
>>> + // We don't clean up the task work since that might be expensive if the task work queue
>>> + // is long. Just let it execute and let it clean up for itself.
>>> + return Err(DeferredFdCloseError::BadFd);
>>> + }
>>> +
>>> + // SAFETY: The `file` pointer points at a valid file.
>>> + unsafe { bindings::get_file(file) };
>>> +
>>> + // SAFETY: Due to the above `get_file`, even if the current task holds an `fdget` to
>>> + // this file right now, the refcount will not drop to zero until after it is released
>>> + // with `fdput`. This is because when using `fdget`, you must always use `fdput` before
>>
>> Shouldn't this be "the refcount will not drop to zero until after it is
>> released with `fput`."?
>>
>> Why is this the SAFETY comment for `filp_close`? I am not understanding
>> the requirement on that function that needs this. This seems more a
>> justification for accessing `file` inside `do_close_fd`. In which case I
>> think it would be better to make it a type invariant of
>> `DeferredFdCloserInner`.
>
> It's because `filp_close` decreases the refcount for the file, and doing
> that is not always safe even if you have a refcount to the file. To drop
> the refcount, at least one of the two following must be the case:
>
> * If the refcount decreases to a non-zero value, then it is okay.
> * If there are no users of `fdget` on the file, then it is okay.

I see, that makes sense. Is this written down somewhere? Or how does one
know about this?

> In this case, we are in the first case, as we own two refcounts.
>
> I'll clarify the safety comment in v3.
>
>>> + // returning to userspace, and our task work runs after any `fdget` users have returned
>>> + // to userspace.
>>> + //
>>> + // Note: fl_owner_t is currently a void pointer.
>>> + unsafe { bindings::filp_close(file, (*current).files as bindings::fl_owner_t) };
>>> +
>>> + // We update the file pointer that the task work is supposed to fput.
>>> + //
>>> + // SAFETY: Task works are executed on the current thread once we return to userspace, so
>>> + // this write is guaranteed to happen before `do_close_fd` is called, which means that a
>>> + // race is not possible here.
>>> + //
>>> + // It's okay to pass this pointer to the task work, since we just acquired a refcount with
>>> + // the previous call to `get_file`. Furthermore, the refcount will not drop to zero during
>>> + // an `fdget` call, since we defer the `fput` until after returning to userspace.
>>> + unsafe { *file_field = file };
>>
>> A synchronization question: who guarantees that this write is actually
>> available to the cpu that executes `do_close_fd`? Is there some
>> synchronization run when returning to userspace?
>
> It's on the same thread, so it's just a sequenced-before relation.
>
> It's not like an interrupt. It runs after the syscall invocation has
> exited, but before it does the actual return-to-userspace stuff.

Reasonable, can you also put this in a comment?

[...]

>>> + if !inner.file.is_null() {
>>> + // SAFETY: This drops a refcount we acquired in `close_fd`. Since this callback runs in
>>> + // a task work after we return to userspace, it is guaranteed that the current thread
>>> + // doesn't hold this file with `fdget`, as `fdget` must be released before returning to
>>> + // userspace.
>>> + unsafe { bindings::fput(inner.file) };
>>> + }
>>> + // Free the allocation.
>>> + drop(inner);
>>> + }
>>> +}
>>> +
>>> +/// Represents a failure to close an fd in a deferred manner.
>>> +#[derive(Copy, Clone, Eq, PartialEq)]
>>> +pub enum DeferredFdCloseError {
>>> + /// Closing the fd failed because we were unable to schedule a task work.
>>> + TaskWorkUnavailable,
>>> + /// Closing the fd failed because the fd does not exist.
>>> + BadFd,
>>> +}
>>> +
>>> +impl From<DeferredFdCloseError> for Error {
>>> + fn from(err: DeferredFdCloseError) -> Error {
>>> + match err {
>>> + DeferredFdCloseError::TaskWorkUnavailable => ESRCH,
>>
>> This error reads "No such process", I am not sure if that is the best
>> way to express the problem in that situation. I took a quick look at the
>> other error codes, but could not find a better fit. Do you have any
>> better ideas? Or is this the error that C binder uses?
>
> This is the error code that task_work_add returns. (It can't happen in
> Binder.)
>
> And I do think that it is a reasonable choice, because the error only
> happens if you're calling the method from a context that has no
> userspace process associated with it.

I see.

What do you think of making the Rust error more descriptive? So instead
of implementing `Debug` like you currently do, you print

$error ($variant)

where $error = Error::from(*self) and $variant is the name of the
variant?

This is more of a general suggestion, I don't think that this error type
in particular warrants this. But in general with Rust we do have the
option to have good error messages for every error while maintaining
efficient error values.

--
Cheers,
Benno

2023-12-11 18:56:59

by Boqun Feng

[permalink] [raw]
Subject: Re: [PATCH v2 6/7] rust: file: add `DeferredFdCloser`

On Mon, Dec 11, 2023 at 03:34:40PM +0000, Alice Ryhl wrote:
> Benno Lossin <[email protected]> writes:
> > On 12/6/23 12:59, Alice Ryhl wrote:
> > > + /// Schedule a task work that closes the file descriptor when this task returns to userspace.
> > > + ///
> > > + /// Fails if this is called from a context where we cannot run work when returning to
> > > + /// userspace. (E.g., from a kthread.)
> > > + pub fn close_fd(self, fd: u32) -> Result<(), DeferredFdCloseError> {
> > > + use bindings::task_work_notify_mode_TWA_RESUME as TWA_RESUME;
> > > +
> > > + // In this method, we schedule the task work before closing the file. This is because
> > > + // scheduling a task work is fallible, and we need to know whether it will fail before we
> > > + // attempt to close the file.
> > > +
> > > + // SAFETY: Getting a pointer to current is always safe.
> > > + let current = unsafe { bindings::get_current() };
> > > +
> > > + // SAFETY: Accessing the `flags` field of `current` is always safe.
> > > + let is_kthread = (unsafe { (*current).flags } & bindings::PF_KTHREAD) != 0;
> >
> > Since Boqun brought to my attention that we already have a wrapper for
> > `get_current()`, how about you use it here as well?
>
> I can use the wrapper, but it seems simpler to not go through a
> reference when we just need a raw pointer.
>
> Perhaps we should have a safe `Task::current_raw` function that just
> returns a raw pointer? It can still be safe.
>

I think we can have a `as_ptr` function for `Task`?

impl Task {
pub fn as_ptr(&self) -> *mut bindings::task_struct {
self.0.get()
}
}

Regards,
Boqun

[...]

2023-12-11 21:13:42

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH v2 5/7] rust: file: add `Kuid` wrapper

On Fri, Dec 08, 2023 at 10:18:47AM -0800, Kees Cook wrote:
> On Fri, Dec 08, 2023 at 05:57:02PM +0100, Peter Zijlstra wrote:
> > On Fri, Dec 08, 2023 at 05:31:59PM +0100, Miguel Ojeda wrote:
> > > On Wed, Dec 6, 2023 at 2:41 PM Peter Zijlstra <[email protected]> wrote:
> > > >
> > > > Anywhoo, the longer a function is, the harder it becomes, since you need
> > > > to deal with everything a function does and consider the specuation
> > > > window length. So trivial functions like the above that do an immediate
> > > > dereference and are (and must be) a valid indirect target (because
> > > > EXPORT) are ideal.
> > >
> > > We discussed this in our weekly meeting, and we would like to ask a
> > > few questions:
> > >
> > > - Could you please describe an example attack that you are thinking
> > > of? (i.e. a "full" attack, rather than just Spectre itself). For
> > > instance, would it rely on other vulnerabilities?
> >
> > There's a fairly large amount of that on github, google spectre poc and
> > stuff like that.
>
> tl;dr: I don't think the introduction of speculation gadgets is a
> sufficient reason to block Rust interfaces like this.
>
> Long version:
>
> I think the question here is "what is the threat model?" If I break down
> the objection, I understand it as:
>
> 1) The trivial wrappers-of-inlines are speculation gadgets.
> 2) They're exported, so callable by anything.
>
> If the threat model is "something can call these to trigger
> speculation", I think this is pretty strongly mitigated already;
>
> 1) These aren't syscall definitions, so their "reachability" is pretty
> limited. In fact, they're already going to be used in places, logically,
> where the inline would be used, so the speculation window is going to be
> same (or longer, given the addition of the direct call and return).
>
> 2) If an attacker is in a position to directly call these helpers,
> they're not going to use them: if an attacker already has arbitrary
> execution, they're not going to bother with speculation.
>
> Fundamentally I don't see added risk here. From the security hardening
> perspective we have two goals: kill bug classes and block exploitation
> techniques, and the former is a much more powerful defensive strategy
> since without the bugs, there's no chance to perform an exploit.
>
> In general, I think we should prioritize bug class elimination over
> exploit technique foiling. In this case, we're adding a potential weakness
> to the image of the kernel of fairly limited scope in support of stronger
> bug elimination goals.
>
> Even if we look at the prerequisites for mounting an attack here, we've
> already got things in place to help mitigate arbitrary code execution
> (KCFI, BTI, etc). Nothing is perfect, but speculation gadgets are
> pretty far down on the list of concerns, IMO. We have no real x86 ROP
> defense right now in the kernel, so that's a much lower hanging fruit
> for attackers.
>
> As another comparison, on x86 there are so many direct execution gadgets
> present in middle-of-instruction code patterns that worrying about a
> speculation gadget seems silly to me.
>
> > [...]
> > The thing at hand was just me eyeballing it.
>
> I can understand the point you and Greg have both expressed here: "this
> is known to be an anti-pattern, we need to do something else". I generally
> agree with this, but in this case, I don't think it's the right call. This
> is an area we'll naturally see improvement from on the Rust side since
> these calls are a _performance_ concern too, so it's not like this will be
> "forgotten" about. But blocking it until there is a complete and perfect
> solution feels like we're letting perfect be the enemy of good.
>
> All of our development is evolutionary, so I think we'd be in a much
> better position to take these (currently ugly) work-arounds (visible
> only in Rust builds) so that we gain the ability to evolve towards more
> memory safe code.

Well said.

More than that, I think this is a situation where we really need to
consider what our priorities are.

Where are most of our real world vulnerabilities coming from? Are they
spectre issues, or are they memory safety issues, and the kinds of
logic/resource issues we know Rust's type system is going to help with?
I think we all know the answer to that question.

More than that, as kernel programmers, we need to be focusing primarily
on the issues that are our responsibility and under our control. At the
point when the inability of the CPU manufacturers to get their shit
together is causing us to have concerns over _generated code for little
helper functions_, the world has gone crazy.

I remember when the spectre stuff first hit, and everyone was saying
"this is just a temporary workaround; CPU manufacturers are going to
sort this out in a couple releases".

They haven't. That's batshit. And I doubt they ever will if we keep
pretending we can just work around everything in software.

2023-12-12 01:27:26

by Boqun Feng

[permalink] [raw]
Subject: Re: [PATCH v2 6/7] rust: file: add `DeferredFdCloser`

On Mon, Dec 11, 2023 at 09:41:28AM -0800, Boqun Feng wrote:
> On Mon, Dec 11, 2023 at 03:34:40PM +0000, Alice Ryhl wrote:
> > Benno Lossin <[email protected]> writes:
> > > On 12/6/23 12:59, Alice Ryhl wrote:
> > > > + /// Schedule a task work that closes the file descriptor when this task returns to userspace.
> > > > + ///
> > > > + /// Fails if this is called from a context where we cannot run work when returning to
> > > > + /// userspace. (E.g., from a kthread.)
> > > > + pub fn close_fd(self, fd: u32) -> Result<(), DeferredFdCloseError> {
> > > > + use bindings::task_work_notify_mode_TWA_RESUME as TWA_RESUME;
> > > > +
> > > > + // In this method, we schedule the task work before closing the file. This is because
> > > > + // scheduling a task work is fallible, and we need to know whether it will fail before we
> > > > + // attempt to close the file.
> > > > +
> > > > + // SAFETY: Getting a pointer to current is always safe.
> > > > + let current = unsafe { bindings::get_current() };
> > > > +
> > > > + // SAFETY: Accessing the `flags` field of `current` is always safe.
> > > > + let is_kthread = (unsafe { (*current).flags } & bindings::PF_KTHREAD) != 0;
> > >
> > > Since Boqun brought to my attention that we already have a wrapper for
> > > `get_current()`, how about you use it here as well?
> >
> > I can use the wrapper, but it seems simpler to not go through a
> > reference when we just need a raw pointer.
> >
> > Perhaps we should have a safe `Task::current_raw` function that just
> > returns a raw pointer? It can still be safe.
> >
>
> I think we can have a `as_ptr` function for `Task`?
>
> impl Task {
> pub fn as_ptr(&self) -> *mut bindings::task_struct {
> self.0.get()
> }
> }

Forgot mention, yes a ptr->ref->ptr trip may not be ideal, but I think
eventually we will have a task work wrapper, in that case maybe
Task::as_ptr() is still needed somehow.

Regards,
Boqun

2023-12-12 09:35:32

by Alice Ryhl

[permalink] [raw]
Subject: Re: [PATCH v2 6/7] rust: file: add `DeferredFdCloser`

On Mon, Dec 11, 2023 at 6:23 PM Benno Lossin <[email protected]> wrote:
>
> >>> + unsafe { bindings::init_task_work(callback_head, Some(Self::do_close_fd)) };
> >>> + // SAFETY: The `callback_head` pointer points at a valid and fully initialized task work
> >>> + // that is ready to be scheduled.
> >>> + //
> >>> + // If the task work gets scheduled as-is, then it will be a no-op. However, we will update
> >>> + // the file pointer below to tell it which file to fput.
> >>> + let res = unsafe { bindings::task_work_add(current, callback_head, TWA_RESUME) };
> >>> +
> >>> + if res != 0 {
> >>> + // SAFETY: Scheduling the task work failed, so we still have ownership of the box, so
> >>> + // we may destroy it.
> >>> + unsafe { drop(Box::from_raw(inner)) };
> >>> +
> >>> + return Err(DeferredFdCloseError::TaskWorkUnavailable);
> >>
> >> Just curious, what could make the `task_work_add` call fail? I imagine
> >> an OOM situation, but is that all?
> >
> > Actually, no, this doesn't fail in OOM situations since we pass it an
> > allocation for its linked list. It fails only when the current task is
> > exiting and wont run task work again.
>
> Interesting, is there some way to check for this aside from calling
> `task_work_add`?

I don't think so. We would need to access the `work_exited` constant
in `kernel/task_work.c` to do that, but it is not exported.

> >> Also, you do not call it when `file` is null, which I imagine to be
> >> fine, but I do not know that since the C comment does not cover that
> >> case.
> >
> > Null pointer means that the fd doesn't exist, and it's correct to do
> > nothing in that case.
>
> I would also mention that in a comment (or the SAFETY comment).

Okay.

> >>> + let file = unsafe { bindings::close_fd_get_file(fd) };
> >>> + if file.is_null() {
> >>> + // We don't clean up the task work since that might be expensive if the task work queue
> >>> + // is long. Just let it execute and let it clean up for itself.
> >>> + return Err(DeferredFdCloseError::BadFd);
> >>> + }
> >>> +
> >>> + // SAFETY: The `file` pointer points at a valid file.
> >>> + unsafe { bindings::get_file(file) };
> >>> +
> >>> + // SAFETY: Due to the above `get_file`, even if the current task holds an `fdget` to
> >>> + // this file right now, the refcount will not drop to zero until after it is released
> >>> + // with `fdput`. This is because when using `fdget`, you must always use `fdput` before
> >>
> >> Shouldn't this be "the refcount will not drop to zero until after it is
> >> released with `fput`."?
> >>
> >> Why is this the SAFETY comment for `filp_close`? I am not understanding
> >> the requirement on that function that needs this. This seems more a
> >> justification for accessing `file` inside `do_close_fd`. In which case I
> >> think it would be better to make it a type invariant of
> >> `DeferredFdCloserInner`.
> >
> > It's because `filp_close` decreases the refcount for the file, and doing
> > that is not always safe even if you have a refcount to the file. To drop
> > the refcount, at least one of the two following must be the case:
> >
> > * If the refcount decreases to a non-zero value, then it is okay.
> > * If there are no users of `fdget` on the file, then it is okay.
>
> I see, that makes sense. Is this written down somewhere? Or how does one
> know about this?

I don't think there's a single place to read about this. The comments
on __fget_light allude to something similar, but it makes the blanket
statement that you can't call filp_close while an fdget reference
exists, even though the reality is a bit more nuanced.

> >>> + // We update the file pointer that the task work is supposed to fput.
> >>> + //
> >>> + // SAFETY: Task works are executed on the current thread once we return to userspace, so
> >>> + // this write is guaranteed to happen before `do_close_fd` is called, which means that a
> >>> + // race is not possible here.
> >>> + //
> >>> + // It's okay to pass this pointer to the task work, since we just acquired a refcount with
> >>> + // the previous call to `get_file`. Furthermore, the refcount will not drop to zero during
> >>> + // an `fdget` call, since we defer the `fput` until after returning to userspace.
> >>> + unsafe { *file_field = file };
> >>
> >> A synchronization question: who guarantees that this write is actually
> >> available to the cpu that executes `do_close_fd`? Is there some
> >> synchronization run when returning to userspace?
> >
> > It's on the same thread, so it's just a sequenced-before relation.
> >
> > It's not like an interrupt. It runs after the syscall invocation has
> > exited, but before it does the actual return-to-userspace stuff.
>
> Reasonable, can you also put this in a comment?

What do you want me to add? I already say that it will be executed on
the same thread.

> >>> +/// Represents a failure to close an fd in a deferred manner.
> >>> +#[derive(Copy, Clone, Eq, PartialEq)]
> >>> +pub enum DeferredFdCloseError {
> >>> + /// Closing the fd failed because we were unable to schedule a task work.
> >>> + TaskWorkUnavailable,
> >>> + /// Closing the fd failed because the fd does not exist.
> >>> + BadFd,
> >>> +}
> >>> +
> >>> +impl From<DeferredFdCloseError> for Error {
> >>> + fn from(err: DeferredFdCloseError) -> Error {
> >>> + match err {
> >>> + DeferredFdCloseError::TaskWorkUnavailable => ESRCH,
> >>
> >> This error reads "No such process", I am not sure if that is the best
> >> way to express the problem in that situation. I took a quick look at the
> >> other error codes, but could not find a better fit. Do you have any
> >> better ideas? Or is this the error that C binder uses?
> >
> > This is the error code that task_work_add returns. (It can't happen in
> > Binder.)
> >
> > And I do think that it is a reasonable choice, because the error only
> > happens if you're calling the method from a context that has no
> > userspace process associated with it.
>
> I see.
>
> What do you think of making the Rust error more descriptive? So instead
> of implementing `Debug` like you currently do, you print
>
> $error ($variant)
>
> where $error = Error::from(*self) and $variant is the name of the
> variant?
>
> This is more of a general suggestion, I don't think that this error type
> in particular warrants this. But in general with Rust we do have the
> option to have good error messages for every error while maintaining
> efficient error values.

I can #[derive(Debug)] instead, I guess?

Alice

2023-12-12 10:00:14

by Alice Ryhl

[permalink] [raw]
Subject: Re: [PATCH v2 7/7] rust: file: add abstraction for `poll_table`

On Fri, Dec 8, 2023 at 6:53 PM Benno Lossin <[email protected]> wrote:
>
> On 12/6/23 12:59, Alice Ryhl wrote:
> > diff --git a/rust/bindings/lib.rs b/rust/bindings/lib.rs
> > index 9bcbea04dac3..eeb291cc60db 100644
> > --- a/rust/bindings/lib.rs
> > +++ b/rust/bindings/lib.rs
> > @@ -51,3 +51,4 @@ mod bindings_helper {
> >
> > pub const GFP_KERNEL: gfp_t = BINDINGS_GFP_KERNEL;
> > pub const __GFP_ZERO: gfp_t = BINDINGS___GFP_ZERO;
> > +pub const POLLFREE: __poll_t = BINDINGS_POLLFREE;
>
> You are no longer using this constant, should this still exist?

Nice catch, thanks!

> > + fn get_qproc(&self) -> bindings::poll_queue_proc {
> > + let ptr = self.0.get();
> > + // SAFETY: The `ptr` is valid because it originates from a reference, and the `_qproc`
> > + // field is not modified concurrently with this call since we have an immutable reference.
>
> This needs an invariant on `PollTable` (i.e. `self.0` is valid).

How would you phrase it?

> > + unsafe { (*ptr)._qproc }
> > + }
> > +
> > + /// Register this [`PollTable`] with the provided [`PollCondVar`], so that it can be notified
> > + /// using the condition variable.
> > + pub fn register_wait(&mut self, file: &File, cv: &PollCondVar) {
> > + if let Some(qproc) = self.get_qproc() {
> > + // SAFETY: The pointers to `self` and `file` are valid because they are references.
>
> What about cv.wait_list...

I can add it to the list of things that are valid due to references.

> > + //
> > + // Before the wait list is destroyed, the destructor of `PollCondVar` will clear
> > + // everything in the wait list, so the wait list is not used after it is freed.
> > + unsafe { qproc(file.as_ptr() as _, cv.wait_list.get(), self.0.get()) };
> > + }
> > + }
> > +}
> > +
> > +/// A wrapper around [`CondVar`] that makes it usable with [`PollTable`].
> > +///
> > +/// # Invariant
> > +///
> > +/// If `needs_synchronize_rcu` is false, then there is nothing registered with `register_wait`.
>
> Not able to find `needs_synchronize_rcu` anywhere else, should this be
> here?

Sorry, this shouldn't be there. It was something I experimented with,
but gave up on.

> > +#[pinned_drop]
> > +impl PinnedDrop for PollCondVar {
> > + fn drop(self: Pin<&mut Self>) {
> > + // Clear anything registered using `register_wait`.
> > + //
> > + // SAFETY: The pointer points at a valid wait list.
>
> I was a bit confused by "wait list", since the C type is named
> `wait_queue_head`, maybe just use the type name?

I will update all instances of "wait list" to "wait_queue_head". It's
because I incorrectly remembered the C type name to be "wait_list".

Alice

2023-12-12 16:51:08

by Benno Lossin

[permalink] [raw]
Subject: Re: [PATCH v2 6/7] rust: file: add `DeferredFdCloser`

On 12/12/23 10:35, Alice Ryhl wrote:
> On Mon, Dec 11, 2023 at 6:23 PM Benno Lossin <[email protected]> wrote:
>>
>>>>> + // We update the file pointer that the task work is supposed to fput.
>>>>> + //
>>>>> + // SAFETY: Task works are executed on the current thread once we return to userspace, so
>>>>> + // this write is guaranteed to happen before `do_close_fd` is called, which means that a
>>>>> + // race is not possible here.
>>>>> + //
>>>>> + // It's okay to pass this pointer to the task work, since we just acquired a refcount with
>>>>> + // the previous call to `get_file`. Furthermore, the refcount will not drop to zero during
>>>>> + // an `fdget` call, since we defer the `fput` until after returning to userspace.
>>>>> + unsafe { *file_field = file };
>>>>
>>>> A synchronization question: who guarantees that this write is actually
>>>> available to the cpu that executes `do_close_fd`? Is there some
>>>> synchronization run when returning to userspace?
>>>
>>> It's on the same thread, so it's just a sequenced-before relation.
>>>
>>> It's not like an interrupt. It runs after the syscall invocation has
>>> exited, but before it does the actual return-to-userspace stuff.
>>
>> Reasonable, can you also put this in a comment?
>
> What do you want me to add? I already say that it will be executed on
> the same thread.

Seems I missed that, then no need to add anything.

>>>>> +/// Represents a failure to close an fd in a deferred manner.
>>>>> +#[derive(Copy, Clone, Eq, PartialEq)]
>>>>> +pub enum DeferredFdCloseError {
>>>>> + /// Closing the fd failed because we were unable to schedule a task work.
>>>>> + TaskWorkUnavailable,
>>>>> + /// Closing the fd failed because the fd does not exist.
>>>>> + BadFd,
>>>>> +}
>>>>> +
>>>>> +impl From<DeferredFdCloseError> for Error {
>>>>> + fn from(err: DeferredFdCloseError) -> Error {
>>>>> + match err {
>>>>> + DeferredFdCloseError::TaskWorkUnavailable => ESRCH,
>>>>
>>>> This error reads "No such process", I am not sure if that is the best
>>>> way to express the problem in that situation. I took a quick look at the
>>>> other error codes, but could not find a better fit. Do you have any
>>>> better ideas? Or is this the error that C binder uses?
>>>
>>> This is the error code that task_work_add returns. (It can't happen in
>>> Binder.)
>>>
>>> And I do think that it is a reasonable choice, because the error only
>>> happens if you're calling the method from a context that has no
>>> userspace process associated with it.
>>
>> I see.
>>
>> What do you think of making the Rust error more descriptive? So instead
>> of implementing `Debug` like you currently do, you print
>>
>> $error ($variant)
>>
>> where $error = Error::from(*self) and $variant is the name of the
>> variant?
>>
>> This is more of a general suggestion, I don't think that this error type
>> in particular warrants this. But in general with Rust we do have the
>> option to have good error messages for every error while maintaining
>> efficient error values.
>
> I can #[derive(Debug)] instead, I guess?

Hmm I thought that might not be ideal, since then you would not have the
error code, only `TaskWorkUnavailable` or `BadFd`.
But if that is also acceptable, then I would go with the derived debug.

--
Cheers,
Benno

2023-12-12 17:01:53

by Benno Lossin

[permalink] [raw]
Subject: Re: [PATCH v2 7/7] rust: file: add abstraction for `poll_table`

On 12/12/23 10:59, Alice Ryhl wrote:
> On Fri, Dec 8, 2023 at 6:53 PM Benno Lossin <[email protected]> wrote:
>> On 12/6/23 12:59, Alice Ryhl wrote:
>>> + fn get_qproc(&self) -> bindings::poll_queue_proc {
>>> + let ptr = self.0.get();
>>> + // SAFETY: The `ptr` is valid because it originates from a reference, and the `_qproc`
>>> + // field is not modified concurrently with this call since we have an immutable reference.
>>
>> This needs an invariant on `PollTable` (i.e. `self.0` is valid).
>
> How would you phrase it?

- `self.0` contains a valid `bindings::poll_table`.
- `self.0` is only modified via references to `Self`.

>>> + unsafe { (*ptr)._qproc }
>>> + }
>>> +
>>> + /// Register this [`PollTable`] with the provided [`PollCondVar`], so that it can be notified
>>> + /// using the condition variable.
>>> + pub fn register_wait(&mut self, file: &File, cv: &PollCondVar) {
>>> + if let Some(qproc) = self.get_qproc() {
>>> + // SAFETY: The pointers to `self` and `file` are valid because they are references.
>>
>> What about cv.wait_list...
>
> I can add it to the list of things that are valid due to references.

Yes this is getting a bit tedious.

What if we create a newtype wrapping `Opaque<T>` with the invariant
that it contains a valid value? Then we could have a specially named
getter for which we would always assume that the returned pointer is
valid. And thus permit you to not mention it in the SAFETY comment?

[...]

>>> +#[pinned_drop]
>>> +impl PinnedDrop for PollCondVar {
>>> + fn drop(self: Pin<&mut Self>) {
>>> + // Clear anything registered using `register_wait`.
>>> + //
>>> + // SAFETY: The pointer points at a valid wait list.
>>
>> I was a bit confused by "wait list", since the C type is named
>> `wait_queue_head`, maybe just use the type name?
>
> I will update all instances of "wait list" to "wait_queue_head". It's
> because I incorrectly remembered the C type name to be "wait_list".

Maybe we should also change the name of the field on `CondVar`?

If you guys agree, I can open a good-first-issue, since it is a very
simple change.

--
Cheers,
Benno

2023-12-12 20:57:37

by Boqun Feng

[permalink] [raw]
Subject: Re: [PATCH v2 6/7] rust: file: add `DeferredFdCloser`

On Mon, Dec 11, 2023 at 05:25:18PM -0800, Boqun Feng wrote:
> On Mon, Dec 11, 2023 at 09:41:28AM -0800, Boqun Feng wrote:
> > On Mon, Dec 11, 2023 at 03:34:40PM +0000, Alice Ryhl wrote:
> > > Benno Lossin <[email protected]> writes:
> > > > On 12/6/23 12:59, Alice Ryhl wrote:
> > > > > + /// Schedule a task work that closes the file descriptor when this task returns to userspace.
> > > > > + ///
> > > > > + /// Fails if this is called from a context where we cannot run work when returning to
> > > > > + /// userspace. (E.g., from a kthread.)
> > > > > + pub fn close_fd(self, fd: u32) -> Result<(), DeferredFdCloseError> {
> > > > > + use bindings::task_work_notify_mode_TWA_RESUME as TWA_RESUME;
> > > > > +
> > > > > + // In this method, we schedule the task work before closing the file. This is because
> > > > > + // scheduling a task work is fallible, and we need to know whether it will fail before we
> > > > > + // attempt to close the file.
> > > > > +
> > > > > + // SAFETY: Getting a pointer to current is always safe.
> > > > > + let current = unsafe { bindings::get_current() };
> > > > > +
> > > > > + // SAFETY: Accessing the `flags` field of `current` is always safe.
> > > > > + let is_kthread = (unsafe { (*current).flags } & bindings::PF_KTHREAD) != 0;
> > > >
> > > > Since Boqun brought to my attention that we already have a wrapper for
> > > > `get_current()`, how about you use it here as well?
> > >
> > > I can use the wrapper, but it seems simpler to not go through a
> > > reference when we just need a raw pointer.
> > >
> > > Perhaps we should have a safe `Task::current_raw` function that just
> > > returns a raw pointer? It can still be safe.
> > >
> >
> > I think we can have a `as_ptr` function for `Task`?
> >
> > impl Task {
> > pub fn as_ptr(&self) -> *mut bindings::task_struct {
> > self.0.get()
> > }
> > }
>
> Forgot mention, yes a ptr->ref->ptr trip may not be ideal, but I think
> eventually we will have a task work wrapper, in that case maybe
> Task::as_ptr() is still needed somehow.
>

After some more thoughts, I agree `Task::current_raw` may be better for
the current usage, since we can also use it to wrap a
`current_is_kthread` method like:

impl Task {
pub fn current_is_kthread() -> bool {
let current = Self::current_raw();

unsafe { (*current).flags & bindings::PF_KTHREAD != 0 }
}
}

Regards,
Boqun

2023-12-13 01:35:45

by Boqun Feng

[permalink] [raw]
Subject: Re: [PATCH v2 7/7] rust: file: add abstraction for `poll_table`

On Tue, Dec 12, 2023 at 05:01:28PM +0000, Benno Lossin wrote:
> On 12/12/23 10:59, Alice Ryhl wrote:
> > On Fri, Dec 8, 2023 at 6:53 PM Benno Lossin <[email protected]> wrote:
> >> On 12/6/23 12:59, Alice Ryhl wrote:
> >>> + fn get_qproc(&self) -> bindings::poll_queue_proc {
> >>> + let ptr = self.0.get();
> >>> + // SAFETY: The `ptr` is valid because it originates from a reference, and the `_qproc`
> >>> + // field is not modified concurrently with this call since we have an immutable reference.
> >>
> >> This needs an invariant on `PollTable` (i.e. `self.0` is valid).
> >
> > How would you phrase it?
>
> - `self.0` contains a valid `bindings::poll_table`.
> - `self.0` is only modified via references to `Self`.
>
> >>> + unsafe { (*ptr)._qproc }
> >>> + }
> >>> +
> >>> + /// Register this [`PollTable`] with the provided [`PollCondVar`], so that it can be notified
> >>> + /// using the condition variable.
> >>> + pub fn register_wait(&mut self, file: &File, cv: &PollCondVar) {
> >>> + if let Some(qproc) = self.get_qproc() {
> >>> + // SAFETY: The pointers to `self` and `file` are valid because they are references.
> >>
> >> What about cv.wait_list...
> >
> > I can add it to the list of things that are valid due to references.
>

Actually, there is an implied safety requirement here, it's about how
qproc is implemented. As we can see, PollCondVar::drop() will wait for a
RCU grace period, that means the waiter (a file or something) has to use
RCU to access the cv.wait_list, otherwise, the synchronize_rcu() in
PollCondVar::drop() won't help.

To phrase it, it's more like:

(in the safety requirement of `PollTable::from_ptr` and the type
invariant of `PollTable`):

", further, if the qproc function in poll_table publishs the pointer of
the wait_queue_head, it must publish it in a way that reads on the
published pointer have to be in an RCU read-side critical section."

and here we can said,

"per type invariant, `qproc` cannot publish `cv.wait_list` without
proper RCU protection, so it's safe to use `cv.wait_list` here, and with
the synchronize_rcu() in PollCondVar::drop(), free of the wait_list will
be delayed until all usages are done."

I know, this is quite verbose, but just imagine some one removes the
rcu_read_lock() and rcu_read_unlock() in ep_remove_wait_queue(), the
poll table from epoll (using ep_ptable_queue_proc()) is still valid one
according to the current safety requirement, but now there is a
use-after-free in the following case:

CPU 0 CPU1
ep_remove_wait_queue():
struct wait_queue_head *whead;
whead = smp_load_acquire(...);
if (whead) { // not null
PollCondVar::drop():
__wake_pollfree();
synchronize_rcu(); // no current RCU readers, yay.
<free the wait_queue_head>
remove_wait_queue(whead, ...); // BOOM, use-after-free

Regards,
Boqun

2023-12-13 09:13:10

by Benno Lossin

[permalink] [raw]
Subject: Re: [PATCH v2 7/7] rust: file: add abstraction for `poll_table`

On 12/13/23 02:35, Boqun Feng wrote:
> On Tue, Dec 12, 2023 at 05:01:28PM +0000, Benno Lossin wrote:
>> On 12/12/23 10:59, Alice Ryhl wrote:
>>> On Fri, Dec 8, 2023 at 6:53 PM Benno Lossin <[email protected]> wrote:
>>>> On 12/6/23 12:59, Alice Ryhl wrote:
>>>>> + fn get_qproc(&self) -> bindings::poll_queue_proc {
>>>>> + let ptr = self.0.get();
>>>>> + // SAFETY: The `ptr` is valid because it originates from a reference, and the `_qproc`
>>>>> + // field is not modified concurrently with this call since we have an immutable reference.
>>>>
>>>> This needs an invariant on `PollTable` (i.e. `self.0` is valid).
>>>
>>> How would you phrase it?
>>
>> - `self.0` contains a valid `bindings::poll_table`.
>> - `self.0` is only modified via references to `Self`.
>>
>>>>> + unsafe { (*ptr)._qproc }
>>>>> + }
>>>>> +
>>>>> + /// Register this [`PollTable`] with the provided [`PollCondVar`], so that it can be notified
>>>>> + /// using the condition variable.
>>>>> + pub fn register_wait(&mut self, file: &File, cv: &PollCondVar) {
>>>>> + if let Some(qproc) = self.get_qproc() {
>>>>> + // SAFETY: The pointers to `self` and `file` are valid because they are references.
>>>>
>>>> What about cv.wait_list...
>>>
>>> I can add it to the list of things that are valid due to references.
>>
>
> Actually, there is an implied safety requirement here, it's about how
> qproc is implemented. As we can see, PollCondVar::drop() will wait for a
> RCU grace period, that means the waiter (a file or something) has to use
> RCU to access the cv.wait_list, otherwise, the synchronize_rcu() in
> PollCondVar::drop() won't help.

Good catch, this is rather important. I did not find the implementation
of `qproc`, since it is a function pointer. Since this pattern is
common, what is the way to find the implementation of those in general?

I imagine that the pattern is used to enable dynamic selection of the
concrete implementation, but there must be some general specification of
what the function does, is this documented somewhere?

> To phrase it, it's more like:
>
> (in the safety requirement of `PollTable::from_ptr` and the type
> invariant of `PollTable`):
>
> ", further, if the qproc function in poll_table publishs the pointer of
> the wait_queue_head, it must publish it in a way that reads on the
> published pointer have to be in an RCU read-side critical section."

What do you mean by `publish`?

> and here we can said,
>
> "per type invariant, `qproc` cannot publish `cv.wait_list` without
> proper RCU protection, so it's safe to use `cv.wait_list` here, and with
> the synchronize_rcu() in PollCondVar::drop(), free of the wait_list will
> be delayed until all usages are done."

I think I am missing how the call to `__wake_up_pollfree` ensures that
nobody uses the `PollCondVar` any longer. How is it removed from the
table?

--
Cheers,
Benno

> I know, this is quite verbose, but just imagine some one removes the
> rcu_read_lock() and rcu_read_unlock() in ep_remove_wait_queue(), the
> poll table from epoll (using ep_ptable_queue_proc()) is still valid one
> according to the current safety requirement, but now there is a
> use-after-free in the following case:
>
> CPU 0 CPU1
> ep_remove_wait_queue():
> struct wait_queue_head *whead;
> whead = smp_load_acquire(...);
> if (whead) { // not null
> PollCondVar::drop():
> __wake_pollfree();
> synchronize_rcu(); // no current RCU readers, yay.
> <free the wait_queue_head>
> remove_wait_queue(whead, ...); // BOOM, use-after-free

2023-12-13 10:09:32

by Alice Ryhl

[permalink] [raw]
Subject: Re: [PATCH v2 7/7] rust: file: add abstraction for `poll_table`

Benno Lossin <[email protected]> writes:
>> and here we can said,
>>
>> "per type invariant, `qproc` cannot publish `cv.wait_list` without
>> proper RCU protection, so it's safe to use `cv.wait_list` here, and with
>> the synchronize_rcu() in PollCondVar::drop(), free of the wait_list will
>> be delayed until all usages are done."
>
> I think I am missing how the call to `__wake_up_pollfree` ensures that
> nobody uses the `PollCondVar` any longer. How is it removed from the
> table?

The __wake_up_pollfree function clears the queue. Here is its
documentation:

/**
* wake_up_pollfree - signal that a polled waitqueue is going away
* @wq_head: the wait queue head
*
* In the very rare cases where a ->poll() implementation uses a waitqueue whose
* lifetime is tied to a task rather than to the 'struct file' being polled,
* this function must be called before the waitqueue is freed so that
* non-blocking polls (e.g. epoll) are notified that the queue is going away.
*
* The caller must also RCU-delay the freeing of the wait_queue_head, e.g. via
* an explicit synchronize_rcu() or call_rcu(), or via SLAB_TYPESAFE_BY_RCU.
*/

The only way for another thread to touch the queue after it has been
cleared is if they are concurrently removing themselves from the queue
under RCU. Because of that, we have to wait for an RCU grace period
after the call to __wake_up_pollfree to ensure that any such concurrent
users have gone away.

Alice

2023-12-13 11:03:03

by Alice Ryhl

[permalink] [raw]
Subject: Re: [PATCH v2 7/7] rust: file: add abstraction for `poll_table`

Benno Lossin <[email protected]> writes:
>>>> +#[pinned_drop]
>>>> +impl PinnedDrop for PollCondVar {
>>>> + fn drop(self: Pin<&mut Self>) {
>>>> + // Clear anything registered using `register_wait`.
>>>> + //
>>>> + // SAFETY: The pointer points at a valid wait list.
>>>
>>> I was a bit confused by "wait list", since the C type is named
>>> `wait_queue_head`, maybe just use the type name?
>>
>> I will update all instances of "wait list" to "wait_queue_head". It's
>> because I incorrectly remembered the C type name to be "wait_list".
>
> Maybe we should also change the name of the field on `CondVar`?
>
> If you guys agree, I can open a good-first-issue, since it is a very
> simple change.

I think that change is fine, but let's not add it to this patchset,
since it would need to be an eight patch. I'll let you open an issue for
it.

Alice

2023-12-13 11:04:53

by Alice Ryhl

[permalink] [raw]
Subject: Re: [PATCH v2 6/7] rust: file: add `DeferredFdCloser`

On Tue, Dec 12, 2023 at 9:57 PM Boqun Feng <[email protected]> wrote:
>
> On Mon, Dec 11, 2023 at 05:25:18PM -0800, Boqun Feng wrote:
> > On Mon, Dec 11, 2023 at 09:41:28AM -0800, Boqun Feng wrote:
> > > On Mon, Dec 11, 2023 at 03:34:40PM +0000, Alice Ryhl wrote:
> > > > Benno Lossin <[email protected]> writes:
> > > > > On 12/6/23 12:59, Alice Ryhl wrote:
> > > > > > + /// Schedule a task work that closes the file descriptor when this task returns to userspace.
> > > > > > + ///
> > > > > > + /// Fails if this is called from a context where we cannot run work when returning to
> > > > > > + /// userspace. (E.g., from a kthread.)
> > > > > > + pub fn close_fd(self, fd: u32) -> Result<(), DeferredFdCloseError> {
> > > > > > + use bindings::task_work_notify_mode_TWA_RESUME as TWA_RESUME;
> > > > > > +
> > > > > > + // In this method, we schedule the task work before closing the file. This is because
> > > > > > + // scheduling a task work is fallible, and we need to know whether it will fail before we
> > > > > > + // attempt to close the file.
> > > > > > +
> > > > > > + // SAFETY: Getting a pointer to current is always safe.
> > > > > > + let current = unsafe { bindings::get_current() };
> > > > > > +
> > > > > > + // SAFETY: Accessing the `flags` field of `current` is always safe.
> > > > > > + let is_kthread = (unsafe { (*current).flags } & bindings::PF_KTHREAD) != 0;
> > > > >
> > > > > Since Boqun brought to my attention that we already have a wrapper for
> > > > > `get_current()`, how about you use it here as well?
> > > >
> > > > I can use the wrapper, but it seems simpler to not go through a
> > > > reference when we just need a raw pointer.
> > > >
> > > > Perhaps we should have a safe `Task::current_raw` function that just
> > > > returns a raw pointer? It can still be safe.
> > > >
> > >
> > > I think we can have a `as_ptr` function for `Task`?
> > >
> > > impl Task {
> > > pub fn as_ptr(&self) -> *mut bindings::task_struct {
> > > self.0.get()
> > > }
> > > }
> >
> > Forgot mention, yes a ptr->ref->ptr trip may not be ideal, but I think
> > eventually we will have a task work wrapper, in that case maybe
> > Task::as_ptr() is still needed somehow.
> >
>
> After some more thoughts, I agree `Task::current_raw` may be better for
> the current usage, since we can also use it to wrap a
> `current_is_kthread` method like:
>
> impl Task {
> pub fn current_is_kthread() -> bool {
> let current = Self::current_raw();
>
> unsafe { (*current).flags & bindings::PF_KTHREAD != 0 }
> }
> }

I'll introduce a current_raw, then.

Alice

2023-12-13 17:06:16

by Boqun Feng

[permalink] [raw]
Subject: Re: [PATCH v2 7/7] rust: file: add abstraction for `poll_table`

On Wed, Dec 13, 2023 at 09:12:45AM +0000, Benno Lossin wrote:
[...]
> >
> > Actually, there is an implied safety requirement here, it's about how
> > qproc is implemented. As we can see, PollCondVar::drop() will wait for a
> > RCU grace period, that means the waiter (a file or something) has to use
> > RCU to access the cv.wait_list, otherwise, the synchronize_rcu() in
> > PollCondVar::drop() won't help.
>
> Good catch, this is rather important. I did not find the implementation
> of `qproc`, since it is a function pointer. Since this pattern is
> common, what is the way to find the implementation of those in general?
>

Actually I don't find any. Ping vfs ;-)

Personally, it took me a while to get a rough understanding of the API:
it's similar to `Future::poll` (or at least the registering waker part),
it basically should registers a waiter, so that when an event happens
later, the waiter gets notified. Also the waiter registration can have a
(optional?) cancel mechanism (like an async drop of Future ;-)), and
that's what gives us headache here: cancellation needs to remove the
waiter from the wait_queue_head, which means wait_queue_head must be
valid during the removal, and that means the kfree of wait_queue_head
must be delayed to a state where no one can access it in waiter removal.

> I imagine that the pattern is used to enable dynamic selection of the
> concrete implementation, but there must be some general specification of
> what the function does, is this documented somewhere?
>
> > To phrase it, it's more like:
> >
> > (in the safety requirement of `PollTable::from_ptr` and the type
> > invariant of `PollTable`):
> >
> > ", further, if the qproc function in poll_table publishs the pointer of
> > the wait_queue_head, it must publish it in a way that reads on the
> > published pointer have to be in an RCU read-side critical section."
>
> What do you mean by `publish`?
>

Publishing a pointer is like `Send`ing a `&T` (or put pointer in a
global variable), so that other threads can access it. Note that since
the cancel mechanism is optional (best to my knowledge), so a qproc call
may not pushlish the pointer.

Regards,
Boqun