This patchset contains the file abstractions needed by the Rust
implementation of the Binder driver.
Please see the Rust Binder RFC for usage examples:
https://lore.kernel.org/rust-for-linux/[email protected]/
Users of "rust: types: add `NotThreadSafe`":
[PATCH 5/9] rust: file: add `FileDescriptorReservation`
Users of "rust: task: add `Task::current_raw`":
[PATCH 7/9] rust: file: add `Kuid` wrapper
[PATCH 8/9] rust: file: add `DeferredFdCloser`
Users of "rust: file: add Rust abstraction for `struct file`":
[PATCH RFC 02/20] rust_binder: add binderfs support to Rust binder
[PATCH RFC 03/20] rust_binder: add threading support
Users of "rust: cred: add Rust abstraction for `struct cred`":
[PATCH RFC 05/20] rust_binder: add nodes and context managers
[PATCH RFC 06/20] rust_binder: add oneway transactions
[PATCH RFC 11/20] rust_binder: send nodes in transaction
[PATCH RFC 13/20] rust_binder: add BINDER_TYPE_FD support
Users of "rust: security: add abstraction for secctx":
[PATCH RFC 06/20] rust_binder: add oneway transactions
Users of "rust: file: add `FileDescriptorReservation`":
[PATCH RFC 13/20] rust_binder: add BINDER_TYPE_FD support
[PATCH RFC 14/20] rust_binder: add BINDER_TYPE_FDA support
Users of "rust: file: add `Kuid` wrapper":
[PATCH RFC 05/20] rust_binder: add nodes and context managers
[PATCH RFC 06/20] rust_binder: add oneway transactions
Users of "rust: file: add abstraction for `poll_table`":
[PATCH RFC 07/20] rust_binder: add epoll support
This patchset has some uses of read_volatile in place of READ_ONCE.
Please see the following rfc for context on this:
https://lore.kernel.org/all/[email protected]/
Signed-off-by: Alice Ryhl <[email protected]>
---
Changes in v6:
- Introduce file sharing modes.
- Rewrite most documentation for `struct file` wrapper.
- Drop `DeferredFdCloser`. It will be sent later when it can be placed
somewhere where only Rust Binder can use it.
- Rebase on top of rust-next: 97ab3e8eec0c ("rust: alloc: fix dangling pointer in VecExt<T>::reserve()")
- Link to v5: https://lore.kernel.org/r/[email protected]
Changes in v5:
- Pass a null pointer to task_tgid_nr_ns.
- Fix some typos and other formatting issues.
- Add Reviewed-by where appropriate.
- Link to v4: https://lore.kernel.org/r/[email protected]
Changes in v4:
- Moved the two really simple patches to the beginning of the patchset.
- Update Send safety comments.
- Use srctree relative links.
- Mention that `Credential::euid` is immutable.
- Update some safety comments to mention the invariant on Self.
- Use new name for close_fd_get_file.
- Move safety comments on DeferredFdCloser around and be more explicit
about how many refcounts we own.
- Reword safety comments related to _qproc.
- Add Reviewed-by where appropriate.
- Link to v3: https://lore.kernel.org/r/[email protected]
Changes in v3:
- Completely rewrite comments about refcounting in the first patch.
- And add a note to the documentation in fs/file.c.
- Discuss speculation gadgets in commit message for the Kuid wrapper.
- Introduce NotThreadSafe and Task::current_raw patches and use them in
later patches.
- Improve safety comments in DeferredFdCloser.
- Some other minor changes.
- Link to v2: https://lore.kernel.org/r/[email protected]
Changes in v2:
- Update various docs and safety comments.
- Rename method names to match the C name.
- Use ordinary read instead of READ_ONCE in File::cred.
- Changed null check in secctx.
- Add type alias for PhantomData in FileDescriptorReservation.
- Use Kuid::from_raw in Kuid::current_euid.
- Make DeferredFdCloser fallible if it is unable to schedule a task
work. And also schedule the task work *before* closing the file.
- Moved PollCondVar to rust/kernel/sync.
- Updated PollCondVar to use wake_up_pollfree.
- Link to v1: https://lore.kernel.org/all/[email protected]/
Link to RFC:
https://lore.kernel.org/all/[email protected]/
---
Alice Ryhl (5):
rust: types: add `NotThreadSafe`
rust: task: add `Task::current_raw`
rust: security: add abstraction for secctx
rust: file: add `Kuid` wrapper
rust: file: add abstraction for `poll_table`
Wedson Almeida Filho (3):
rust: file: add Rust abstraction for `struct file`
rust: cred: add Rust abstraction for `struct cred`
rust: file: add `FileDescriptorReservation`
fs/file.c | 7 +
rust/bindings/bindings_helper.h | 6 +
rust/helpers.c | 86 +++++++++
rust/kernel/cred.rs | 83 ++++++++
rust/kernel/file.rs | 416 ++++++++++++++++++++++++++++++++++++++++
rust/kernel/lib.rs | 3 +
rust/kernel/security.rs | 72 +++++++
rust/kernel/sync.rs | 1 +
rust/kernel/sync/lock.rs | 13 +-
rust/kernel/sync/poll.rs | 119 ++++++++++++
rust/kernel/task.rs | 91 ++++++++-
rust/kernel/types.rs | 26 +++
12 files changed, 911 insertions(+), 12 deletions(-)
---
base-commit: 97ab3e8eec0ce79d9e265e6c9e4c480492180409
change-id: 20231123-alice-file-525b98e8a724
Best regards,
--
Alice Ryhl <[email protected]>
From: Wedson Almeida Filho <[email protected]>
This abstraction makes it possible to manipulate the open files for a
process. The new `File` struct wraps the C `struct file`. When accessing
it using the smart pointer `ARef<File>`, the pointer will own a
reference count to the file. When accessing it as `&File`, then the
reference does not own a refcount, but the borrow checker will ensure
that the reference count does not hit zero while the `&File` is live.
Since this is intended to manipulate the open files of a process, we
introduce an `fget` constructor that corresponds to the C `fget`
method. In future patches, it will become possible to create a new fd in
a process and bind it to a `File`. Rust Binder will use these to send
fds from one process to another.
We also provide a method for accessing the file's flags. Rust Binder
will use this to access the flags of the Binder fd to check whether the
non-blocking flag is set, which affects what the Binder ioctl does.
This introduces a struct for the EBADF error type, rather than just
using the Error type directly. This has two advantages:
* `File::fget` returns a `Result<ARef<File>, BadFdError>`, which the
compiler will represent as a single pointer, with null being an error.
This is possible because the compiler understands that `BadFdError`
has only one possible value, and it also understands that the
`ARef<File>` smart pointer is guaranteed non-null.
* Additionally, we promise to users of the method that the method can
only fail with EBADF, which means that they can rely on this promise
without having to inspect its implementation.
That said, there are also two disadvantages:
* Defining additional error types involves boilerplate.
* The question mark operator will only utilize the `From` trait once,
which prevents you from using the question mark operator on
`BadFdError` in methods that return some third error type that the
kernel `Error` is convertible into. (However, it works fine in methods
that return `Error`.)
Signed-off-by: Wedson Almeida Filho <[email protected]>
Co-developed-by: Daniel Xu <[email protected]>
Signed-off-by: Daniel Xu <[email protected]>
Co-developed-by: Alice Ryhl <[email protected]>
Reviewed-by: Martin Rodriguez Reboredo <[email protected]>
Reviewed-by: Trevor Gross <[email protected]>
Reviewed-by: Benno Lossin <[email protected]>
Signed-off-by: Alice Ryhl <[email protected]>
---
fs/file.c | 7 +
rust/bindings/bindings_helper.h | 2 +
rust/helpers.c | 7 +
rust/kernel/file.rs | 330 ++++++++++++++++++++++++++++++++++++++++
rust/kernel/lib.rs | 1 +
rust/kernel/types.rs | 8 +
6 files changed, 355 insertions(+)
diff --git a/fs/file.c b/fs/file.c
index 3b683b9101d8..f2eab5fcb87f 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -1127,6 +1127,13 @@ EXPORT_SYMBOL(task_lookup_next_fdget_rcu);
*
* The fput_needed flag returned by fget_light should be passed to the
* corresponding fput_light.
+ *
+ * (As an exception to rule 2, you can call filp_close between fget_light and
+ * fput_light provided that you capture a real refcount with get_file before
+ * the call to filp_close, and ensure that this real refcount is fput *after*
+ * the fput_light call.)
+ *
+ * See also the documentation in rust/kernel/file.rs.
*/
static unsigned long __fget_light(unsigned int fd, fmode_t mask)
{
diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h
index ddb5644d4fd9..541afef7ddc4 100644
--- a/rust/bindings/bindings_helper.h
+++ b/rust/bindings/bindings_helper.h
@@ -9,6 +9,8 @@
#include <kunit/test.h>
#include <linux/errname.h>
#include <linux/ethtool.h>
+#include <linux/file.h>
+#include <linux/fs.h>
#include <linux/jiffies.h>
#include <linux/mdio.h>
#include <linux/phy.h>
diff --git a/rust/helpers.c b/rust/helpers.c
index 4c8b7b92a4f4..5545a00560d1 100644
--- a/rust/helpers.c
+++ b/rust/helpers.c
@@ -25,6 +25,7 @@
#include <linux/build_bug.h>
#include <linux/err.h>
#include <linux/errname.h>
+#include <linux/fs.h>
#include <linux/mutex.h>
#include <linux/refcount.h>
#include <linux/sched/signal.h>
@@ -157,6 +158,12 @@ void rust_helper_init_work_with_key(struct work_struct *work, work_func_t func,
}
EXPORT_SYMBOL_GPL(rust_helper_init_work_with_key);
+struct file *rust_helper_get_file(struct file *f)
+{
+ return get_file(f);
+}
+EXPORT_SYMBOL_GPL(rust_helper_get_file);
+
/*
* `bindgen` binds the C `size_t` type as the Rust `usize` type, so we can
* use it in contexts where Rust expects a `usize` like slice (array) indices.
diff --git a/rust/kernel/file.rs b/rust/kernel/file.rs
new file mode 100644
index 000000000000..ad881e67084c
--- /dev/null
+++ b/rust/kernel/file.rs
@@ -0,0 +1,330 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! Files and file descriptors.
+//!
+//! C headers: [`include/linux/fs.h`](srctree/include/linux/fs.h) and
+//! [`include/linux/file.h`](srctree/include/linux/file.h)
+
+use crate::{
+ bindings,
+ error::{code::*, Error, Result},
+ types::{ARef, AlwaysRefCounted, Opaque},
+};
+use core::{marker::PhantomData, ptr};
+
+/// Flags associated with a [`File`].
+pub mod flags {
+ /// File is opened in append mode.
+ pub const O_APPEND: u32 = bindings::O_APPEND;
+
+ /// Signal-driven I/O is enabled.
+ pub const O_ASYNC: u32 = bindings::FASYNC;
+
+ /// Close-on-exec flag is set.
+ pub const O_CLOEXEC: u32 = bindings::O_CLOEXEC;
+
+ /// File was created if it didn't already exist.
+ pub const O_CREAT: u32 = bindings::O_CREAT;
+
+ /// Direct I/O is enabled for this file.
+ pub const O_DIRECT: u32 = bindings::O_DIRECT;
+
+ /// File must be a directory.
+ pub const O_DIRECTORY: u32 = bindings::O_DIRECTORY;
+
+ /// Like [`O_SYNC`] except metadata is not synced.
+ pub const O_DSYNC: u32 = bindings::O_DSYNC;
+
+ /// Ensure that this file is created with the `open(2)` call.
+ pub const O_EXCL: u32 = bindings::O_EXCL;
+
+ /// Large file size enabled (`off64_t` over `off_t`).
+ pub const O_LARGEFILE: u32 = bindings::O_LARGEFILE;
+
+ /// Do not update the file last access time.
+ pub const O_NOATIME: u32 = bindings::O_NOATIME;
+
+ /// File should not be used as process's controlling terminal.
+ pub const O_NOCTTY: u32 = bindings::O_NOCTTY;
+
+ /// If basename of path is a symbolic link, fail open.
+ pub const O_NOFOLLOW: u32 = bindings::O_NOFOLLOW;
+
+ /// File is using nonblocking I/O.
+ pub const O_NONBLOCK: u32 = bindings::O_NONBLOCK;
+
+ /// File is using nonblocking I/O.
+ ///
+ /// This is effectively the same flag as [`O_NONBLOCK`] on all architectures
+ /// except SPARC64.
+ pub const O_NDELAY: u32 = bindings::O_NDELAY;
+
+ /// Used to obtain a path file descriptor.
+ pub const O_PATH: u32 = bindings::O_PATH;
+
+ /// Write operations on this file will flush data and metadata.
+ pub const O_SYNC: u32 = bindings::O_SYNC;
+
+ /// This file is an unnamed temporary regular file.
+ pub const O_TMPFILE: u32 = bindings::O_TMPFILE;
+
+ /// File should be truncated to length 0.
+ pub const O_TRUNC: u32 = bindings::O_TRUNC;
+
+ /// Bitmask for access mode flags.
+ ///
+ /// # Examples
+ ///
+ /// ```
+ /// use kernel::file;
+ /// # fn do_something() {}
+ /// # let flags = 0;
+ /// if (flags & file::flags::O_ACCMODE) == file::flags::O_RDONLY {
+ /// do_something();
+ /// }
+ /// ```
+ pub const O_ACCMODE: u32 = bindings::O_ACCMODE;
+
+ /// File is read only.
+ pub const O_RDONLY: u32 = bindings::O_RDONLY;
+
+ /// File is write only.
+ pub const O_WRONLY: u32 = bindings::O_WRONLY;
+
+ /// File can be both read and written.
+ pub const O_RDWR: u32 = bindings::O_RDWR;
+}
+
+/// Compile-time information for keeping track of how a [`File`] is shared.
+///
+/// The `fdget_pos` method can be used to access the file's position without taking `f_pos_lock`,
+/// as long as the file is not shared with any other threads. During such calls to `fdget_pos`, the
+/// file must remain non-shared, so it must not be possible to move the file to another thread. For
+/// example, if the file is moved to another thread, then it could be passed to `fd_install`, at
+/// which point the remote process could touch the file position.
+///
+/// The share mode only keeps track of whether there are active `fdget_pos` calls that did not take
+/// the `f_pos_lock`, and does not keep track of `fdget` calls. This is okay because `fdget` does
+/// not care about the refcount of the underlying `struct file`; as long as the entry in the
+/// current thread's fd table does not get removed, it's okay to share the file. For example,
+/// `fd_install`ing the `struct file` into another process is okay during an `fdget` call, because
+/// the other process can't touch the fd table of the original process.
+mod share_mode {
+ /// Trait implemented by the two sharing modes that a file might have.
+ pub trait FileShareMode {}
+
+ /// Represents a file for which there might be an active call to `fdget_pos` that did not take
+ /// the `f_pos_lock` lock.
+ pub enum MaybeFdgetPos {}
+
+ /// Represents a file for which it is known that all active calls to `fdget_pos` (if any) took
+ /// the `f_pos_lock` lock.
+ pub enum NoFdgetPos {}
+
+ impl FileShareMode for MaybeFdgetPos {}
+ impl FileShareMode for NoFdgetPos {}
+}
+pub use self::share_mode::{FileShareMode, MaybeFdgetPos, NoFdgetPos};
+
+/// Wraps the kernel's `struct file`.
+///
+/// This represents an open file rather than a file on a filesystem. Processes generally reference
+/// open files using file descriptors. However, file descriptors are not the same as files. A file
+/// descriptor is just an integer that corresponds to a file, and a single file may be referenced
+/// by multiple file descriptors.
+///
+/// # Refcounting
+///
+/// Instances of this type are reference-counted. The reference count is incremented by the
+/// `fget`/`get_file` functions and decremented by `fput`. The Rust type `ARef<File>` represents a
+/// pointer that owns a reference count on the file.
+///
+/// Whenever a process opens a file descriptor (fd), it stores a pointer to the file in its fd
+/// table (`struct files_struct`). This pointer owns a reference count to the file, ensuring the
+/// file isn't prematurely deleted while the file descriptor is open. In Rust terminology, the
+/// pointers in `struct files_struct` are `ARef<File>` pointers.
+///
+/// ## Light refcounts
+///
+/// Whenever a process has an fd to a file, it may use something called a "light refcount" as a
+/// performance optimization. Light refcounts are acquired by calling `fdget` and released with
+/// `fdput`. The idea behind light refcounts is that if the fd is not closed between the calls to
+/// `fdget` and `fdput`, then the refcount cannot hit zero during that time, as the `struct
+/// files_struct` holds a reference until the fd is closed. This means that it's safe to access the
+/// file even if `fdget` does not increment the refcount.
+///
+/// The requirement that the fd is not closed during a light refcount applies globally across all
+/// threads - not just on the thread using the light refcount. For this reason, light refcounts are
+/// only used when the `struct files_struct` is not shared with other threads, since this ensures
+/// that other unrelated threads cannot suddenly start using the fd and close it. Therefore,
+/// calling `fdget` on a shared `struct files_struct` creates a normal refcount instead of a light
+/// refcount.
+///
+/// Light reference counts must be released with `fdput` before the system call returns to
+/// userspace. This means that if you wait until the current system call returns to userspace, then
+/// all light refcounts that existed at the time have gone away.
+///
+/// ### The file position
+///
+/// Each `struct file` has a position integer, which is protected by the `f_pos_lock` mutex.
+/// However, if the `struct file` is not shared, then the kernel may avoid taking the lock as a
+/// performance optimization.
+///
+/// The condition for avoiding the `f_pos_lock` mutex is different from the condition for using
+/// `fdget`. With `fdget`, you may avoid incrementing the refcount as long as the current fd table
+/// is not shared; it is okay if there are other fd tables that also reference the same `struct
+/// file`. However, `fdget_pos` can only avoid taking the `f_pos_lock` if the entire `struct file`
+/// is not shared, as different processes with an fd to the same `struct file` share the same
+/// position.
+///
+/// ## Rust references
+///
+/// The reference type `&File` is similar to light refcounts:
+///
+/// * `&File` references don't own a reference count. They can only exist as long as the reference
+/// count stays positive, and can only be created when there is some mechanism in place to ensure
+/// this.
+///
+/// * The Rust borrow-checker normally ensures this by enforcing that the `ARef<File>` from which
+/// a `&File` is created outlives the `&File`.
+///
+/// * Using the unsafe [`File::from_ptr`] means that it is up to the caller to ensure that the
+/// `&File` only exists while the reference count is positive.
+///
+/// * You can think of `fdget` as using an fd to look up an `ARef<File>` in the `struct
+/// files_struct` and create an `&File` from it. The "fd cannot be closed" rule is like the Rust
+/// rule "the `ARef<File>` must outlive the `&File`".
+///
+/// # Invariants
+///
+/// * All instances of this type are refcounted using the `f_count` field.
+/// * If the file sharing mode is `MaybeFdgetPos`, then all active calls to `fdget_pos` that did
+/// not take the `f_pos_lock` mutex must be on the same thread as this `File`.
+/// * If the file sharing mode is `NoFdgetPos`, then there must not be active calls to `fdget_pos`
+/// that did not take the `f_pos_lock` mutex.
+#[repr(transparent)]
+pub struct File<S: FileShareMode> {
+ inner: Opaque<bindings::file>,
+ _share_mode: PhantomData<S>,
+}
+
+// SAFETY: This file is known to not have any local active `fdget_pos` calls, so it is safe to
+// transfer it between threads.
+unsafe impl Send for File<NoFdgetPos> {}
+
+// SAFETY: This file is known to not have any local active `fdget_pos` calls, so it is safe to
+// access its methods from several threads in parallel.
+unsafe impl Sync for File<NoFdgetPos> {}
+
+/// File methods that only exist under the [`MaybeFdgetPos`] sharing mode.
+impl File<MaybeFdgetPos> {
+ /// Constructs a new `struct file` wrapper from a file descriptor.
+ ///
+ /// The file descriptor belongs to the current process, and there might be active local calls
+ /// to `fdget_pos`.
+ pub fn fget(fd: u32) -> Result<ARef<File<MaybeFdgetPos>>, BadFdError> {
+ // SAFETY: FFI call, there are no requirements on `fd`.
+ let ptr = ptr::NonNull::new(unsafe { bindings::fget(fd) }).ok_or(BadFdError)?;
+
+ // SAFETY: `bindings::fget` created a refcount, and we pass ownership of it to the `ARef`.
+ //
+ // INVARIANT: This file is in the fd table on this thread, so either all `fdget_pos` calls
+ // are on this thread, or the file is shared, in which case `fdget_pos` calls took the
+ // `f_pos_lock` mutex.
+ Ok(unsafe { ARef::from_raw(ptr.cast()) })
+ }
+
+ /// Assume that there are no active `fdget_pos` calls that prevent us from sharing this file.
+ ///
+ /// This makes it safe to transfer this file to other threads. No checks are performed, and
+ /// using it incorrectly may lead to a data race on the file position if the file is shared
+ /// with another thread.
+ ///
+ /// This method is intended to be used together with [`File::fget`] when the caller knows
+ /// statically that there are no `fdget_pos` calls on the current thread. For example, you
+ /// might use it when calling `fget` from an ioctl, since ioctls usually do not touch the file
+ /// position.
+ ///
+ /// # Safety
+ ///
+ /// There must not be any active `fdget_pos` calls on the current thread.
+ pub unsafe fn assume_no_fdget_pos(me: ARef<Self>) -> ARef<File<NoFdgetPos>> {
+ // INVARIANT: There are no `fdget_pos` calls on the current thread, and by the type
+ // invariants, if there is a `fdget_pos` call on another thread, then it took the
+ // `f_pos_lock` mutex.
+ //
+ // SAFETY: `File<MaybeFdgetPos>` and `File<NoFdgetPos>` have the same layout.
+ unsafe { ARef::from_raw(ARef::into_raw(me).cast()) }
+ }
+}
+
+/// File methods that exist under all sharing modes.
+impl<S: FileShareMode> File<S> {
+ /// Creates a reference to a [`File`] from a valid pointer.
+ ///
+ /// # Safety
+ ///
+ /// * The caller must ensure that `ptr` points at a valid file and that the file's refcount is
+ /// positive for the duration of 'a.
+ /// * The caller must ensure that the requirements for using the chosen file sharing mode are
+ /// upheld.
+ pub unsafe fn from_ptr<'a>(ptr: *const bindings::file) -> &'a File<S> {
+ // SAFETY: The caller guarantees that the pointer is not dangling and stays valid for the
+ // duration of 'a. The cast is okay because `File` is `repr(transparent)`.
+ //
+ // INVARIANT: The safety requirements guarantee that the refcount does not hit zero during
+ // 'a. The caller guarantees to uphold the requirements for the chosen sharing mode.
+ unsafe { &*ptr.cast() }
+ }
+
+ /// Returns a raw pointer to the inner C struct.
+ #[inline]
+ pub fn as_ptr(&self) -> *mut bindings::file {
+ self.inner.get()
+ }
+
+ /// Returns the flags associated with the file.
+ ///
+ /// The flags are a combination of the constants in [`flags`].
+ pub fn flags(&self) -> u32 {
+ // This `read_volatile` is intended to correspond to a READ_ONCE call.
+ //
+ // SAFETY: The file is valid because the shared reference guarantees a nonzero refcount.
+ //
+ // FIXME(read_once): Replace with `read_once` when available on the Rust side.
+ unsafe { core::ptr::addr_of!((*self.as_ptr()).f_flags).read_volatile() }
+ }
+}
+
+// SAFETY: The type invariants guarantee that `File` is always ref-counted. This implementation
+// makes `ARef<File>` own a normal refcount.
+unsafe impl<S: FileShareMode> AlwaysRefCounted for File<S> {
+ fn inc_ref(&self) {
+ // SAFETY: The existence of a shared reference means that the refcount is nonzero.
+ unsafe { bindings::get_file(self.as_ptr()) };
+ }
+
+ unsafe fn dec_ref(obj: ptr::NonNull<File<S>>) {
+ // SAFETY: To call this method, the caller passes us ownership of a normal refcount, so we
+ // may drop it. The cast is okay since `File` has the same representation as `struct file`.
+ unsafe { bindings::fput(obj.cast().as_ptr()) }
+ }
+}
+
+/// Represents the `EBADF` error code.
+///
+/// Used for methods that can only fail with `EBADF`.
+#[derive(Copy, Clone, Eq, PartialEq)]
+pub struct BadFdError;
+
+impl From<BadFdError> for Error {
+ fn from(_: BadFdError) -> Error {
+ EBADF
+ }
+}
+
+impl core::fmt::Debug for BadFdError {
+ fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result {
+ f.pad("EBADF")
+ }
+}
diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs
index 9a943d99c71a..c583fd27736d 100644
--- a/rust/kernel/lib.rs
+++ b/rust/kernel/lib.rs
@@ -29,6 +29,7 @@
pub mod alloc;
mod build_assert;
pub mod error;
+pub mod file;
pub mod init;
pub mod ioctl;
#[cfg(CONFIG_KUNIT)]
diff --git a/rust/kernel/types.rs b/rust/kernel/types.rs
index 93734677cfe7..3ec2b12afbee 100644
--- a/rust/kernel/types.rs
+++ b/rust/kernel/types.rs
@@ -366,6 +366,14 @@ pub unsafe fn from_raw(ptr: NonNull<T>) -> Self {
_p: PhantomData,
}
}
+
+ /// Convert this [`ARef`] into a raw pointer.
+ ///
+ /// The caller retains ownership of the refcount that this `ARef` used to own.
+ pub fn into_raw(me: Self) -> NonNull<T> {
+ let me = core::mem::ManuallyDrop::new(me);
+ me.ptr
+ }
}
impl<T: AlwaysRefCounted> Clone for ARef<T> {
--
2.45.0.rc1.225.g2a3ae87e7f-goog
From: Wedson Almeida Filho <[email protected]>
Add a wrapper around `struct cred` called `Credential`, and provide
functionality to get the `Credential` associated with a `File`.
Rust Binder must check the credentials of processes when they attempt to
perform various operations, and these checks usually take a
`&Credential` as parameter. The security_binder_set_context_mgr function
would be one example. This patch is necessary to access these security_*
methods from Rust.
Signed-off-by: Wedson Almeida Filho <[email protected]>
Co-developed-by: Alice Ryhl <[email protected]>
Reviewed-by: Trevor Gross <[email protected]>
Reviewed-by: Benno Lossin <[email protected]>
Reviewed-by: Martin Rodriguez Reboredo <[email protected]>
Signed-off-by: Alice Ryhl <[email protected]>
---
rust/bindings/bindings_helper.h | 1 +
rust/helpers.c | 13 ++++++++
rust/kernel/cred.rs | 74 +++++++++++++++++++++++++++++++++++++++++
rust/kernel/file.rs | 13 ++++++++
rust/kernel/lib.rs | 1 +
5 files changed, 102 insertions(+)
diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h
index 541afef7ddc4..94091cb337e9 100644
--- a/rust/bindings/bindings_helper.h
+++ b/rust/bindings/bindings_helper.h
@@ -7,6 +7,7 @@
*/
#include <kunit/test.h>
+#include <linux/cred.h>
#include <linux/errname.h>
#include <linux/ethtool.h>
#include <linux/file.h>
diff --git a/rust/helpers.c b/rust/helpers.c
index 5545a00560d1..56415bce8af0 100644
--- a/rust/helpers.c
+++ b/rust/helpers.c
@@ -23,6 +23,7 @@
#include <kunit/test-bug.h>
#include <linux/bug.h>
#include <linux/build_bug.h>
+#include <linux/cred.h>
#include <linux/err.h>
#include <linux/errname.h>
#include <linux/fs.h>
@@ -164,6 +165,18 @@ struct file *rust_helper_get_file(struct file *f)
}
EXPORT_SYMBOL_GPL(rust_helper_get_file);
+const struct cred *rust_helper_get_cred(const struct cred *cred)
+{
+ return get_cred(cred);
+}
+EXPORT_SYMBOL_GPL(rust_helper_get_cred);
+
+void rust_helper_put_cred(const struct cred *cred)
+{
+ put_cred(cred);
+}
+EXPORT_SYMBOL_GPL(rust_helper_put_cred);
+
/*
* `bindgen` binds the C `size_t` type as the Rust `usize` type, so we can
* use it in contexts where Rust expects a `usize` like slice (array) indices.
diff --git a/rust/kernel/cred.rs b/rust/kernel/cred.rs
new file mode 100644
index 000000000000..360d6fdbe5e7
--- /dev/null
+++ b/rust/kernel/cred.rs
@@ -0,0 +1,74 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! Credentials management.
+//!
+//! C header: [`include/linux/cred.h`](srctree/include/linux/cred.h).
+//!
+//! Reference: <https://www.kernel.org/doc/html/latest/security/credentials.html>
+
+use crate::{
+ bindings,
+ types::{AlwaysRefCounted, Opaque},
+};
+
+/// Wraps the kernel's `struct cred`.
+///
+/// Credentials are used for various security checks in the kernel.
+///
+/// Most fields of credentials are immutable. When things have their credentials changed, that
+/// happens by replacing the credential instead of changing an existing credential. See the [kernel
+/// documentation][ref] for more info on this.
+///
+/// # Invariants
+///
+/// Instances of this type are always ref-counted, that is, a call to `get_cred` ensures that the
+/// allocation remains valid at least until the matching call to `put_cred`.
+///
+/// [ref]: https://www.kernel.org/doc/html/latest/security/credentials.html
+#[repr(transparent)]
+pub struct Credential(Opaque<bindings::cred>);
+
+// SAFETY:
+// - `Credential::dec_ref` can be called from any thread.
+// - It is okay to send ownership of `Credential` across thread boundaries.
+unsafe impl Send for Credential {}
+
+// SAFETY: It's OK to access `Credential` through shared references from other threads because
+// we're either accessing properties that don't change or that are properly synchronised by C code.
+unsafe impl Sync for Credential {}
+
+impl Credential {
+ /// Creates a reference to a [`Credential`] from a valid pointer.
+ ///
+ /// # Safety
+ ///
+ /// The caller must ensure that `ptr` is valid and remains valid for the lifetime of the
+ /// returned [`Credential`] reference.
+ pub unsafe fn from_ptr<'a>(ptr: *const bindings::cred) -> &'a Credential {
+ // SAFETY: The safety requirements guarantee the validity of the dereference, while the
+ // `Credential` type being transparent makes the cast ok.
+ unsafe { &*ptr.cast() }
+ }
+
+ /// Returns the effective UID of the given credential.
+ pub fn euid(&self) -> bindings::kuid_t {
+ // SAFETY: By the type invariant, we know that `self.0` is valid. Furthermore, the `euid`
+ // field of a credential is never changed after initialization, so there is no potential
+ // for data races.
+ unsafe { (*self.0.get()).euid }
+ }
+}
+
+// SAFETY: The type invariants guarantee that `Credential` is always ref-counted.
+unsafe impl AlwaysRefCounted for Credential {
+ fn inc_ref(&self) {
+ // SAFETY: The existence of a shared reference means that the refcount is nonzero.
+ unsafe { bindings::get_cred(self.0.get()) };
+ }
+
+ unsafe fn dec_ref(obj: core::ptr::NonNull<Credential>) {
+ // SAFETY: The safety requirements guarantee that the refcount is nonzero. The cast is okay
+ // because `Credential` has the same representation as `struct cred`.
+ unsafe { bindings::put_cred(obj.cast().as_ptr()) };
+ }
+}
diff --git a/rust/kernel/file.rs b/rust/kernel/file.rs
index ad881e67084c..755816eb38ef 100644
--- a/rust/kernel/file.rs
+++ b/rust/kernel/file.rs
@@ -7,6 +7,7 @@
use crate::{
bindings,
+ cred::Credential,
error::{code::*, Error, Result},
types::{ARef, AlwaysRefCounted, Opaque},
};
@@ -283,6 +284,18 @@ pub fn as_ptr(&self) -> *mut bindings::file {
self.inner.get()
}
+ /// Returns the credentials of the task that originally opened the file.
+ pub fn cred(&self) -> &Credential {
+ // SAFETY: It's okay to read the `f_cred` field without synchronization because `f_cred` is
+ // never changed after initialization of the file.
+ let ptr = unsafe { (*self.as_ptr()).f_cred };
+
+ // SAFETY: The signature of this function ensures that the caller will only access the
+ // returned credential while the file is still valid, and the C side ensures that the
+ // credential stays valid at least as long as the file.
+ unsafe { Credential::from_ptr(ptr) }
+ }
+
/// Returns the flags associated with the file.
///
/// The flags are a combination of the constants in [`flags`].
diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs
index c583fd27736d..cc629a74137f 100644
--- a/rust/kernel/lib.rs
+++ b/rust/kernel/lib.rs
@@ -28,6 +28,7 @@
pub mod alloc;
mod build_assert;
+pub mod cred;
pub mod error;
pub mod file;
pub mod init;
--
2.45.0.rc1.225.g2a3ae87e7f-goog
Add an abstraction for viewing the string representation of a security
context.
This is needed by Rust Binder because it has a feature where a process
can view the string representation of the security context for incoming
transactions. The process can use that to authenticate incoming
transactions, and since the feature is provided by the kernel, the
process can trust that the security context is legitimate.
Reviewed-by: Benno Lossin <[email protected]>
Reviewed-by: Martin Rodriguez Reboredo <[email protected]>
Reviewed-by: Trevor Gross <[email protected]>
Signed-off-by: Alice Ryhl <[email protected]>
---
rust/bindings/bindings_helper.h | 1 +
rust/helpers.c | 21 ++++++++++++
rust/kernel/cred.rs | 8 +++++
rust/kernel/lib.rs | 1 +
rust/kernel/security.rs | 72 +++++++++++++++++++++++++++++++++++++++++
5 files changed, 103 insertions(+)
diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h
index 94091cb337e9..cd2aaaaf9214 100644
--- a/rust/bindings/bindings_helper.h
+++ b/rust/bindings/bindings_helper.h
@@ -17,6 +17,7 @@
#include <linux/phy.h>
#include <linux/refcount.h>
#include <linux/sched.h>
+#include <linux/security.h>
#include <linux/slab.h>
#include <linux/wait.h>
#include <linux/workqueue.h>
diff --git a/rust/helpers.c b/rust/helpers.c
index 56415bce8af0..766e368bd0d8 100644
--- a/rust/helpers.c
+++ b/rust/helpers.c
@@ -30,6 +30,7 @@
#include <linux/mutex.h>
#include <linux/refcount.h>
#include <linux/sched/signal.h>
+#include <linux/security.h>
#include <linux/spinlock.h>
#include <linux/wait.h>
#include <linux/workqueue.h>
@@ -177,6 +178,26 @@ void rust_helper_put_cred(const struct cred *cred)
}
EXPORT_SYMBOL_GPL(rust_helper_put_cred);
+#ifndef CONFIG_SECURITY
+void rust_helper_security_cred_getsecid(const struct cred *c, u32 *secid)
+{
+ security_cred_getsecid(c, secid);
+}
+EXPORT_SYMBOL_GPL(rust_helper_security_cred_getsecid);
+
+int rust_helper_security_secid_to_secctx(u32 secid, char **secdata, u32 *seclen)
+{
+ return security_secid_to_secctx(secid, secdata, seclen);
+}
+EXPORT_SYMBOL_GPL(rust_helper_security_secid_to_secctx);
+
+void rust_helper_security_release_secctx(char *secdata, u32 seclen)
+{
+ security_release_secctx(secdata, seclen);
+}
+EXPORT_SYMBOL_GPL(rust_helper_security_release_secctx);
+#endif
+
/*
* `bindgen` binds the C `size_t` type as the Rust `usize` type, so we can
* use it in contexts where Rust expects a `usize` like slice (array) indices.
diff --git a/rust/kernel/cred.rs b/rust/kernel/cred.rs
index 360d6fdbe5e7..fdd899040098 100644
--- a/rust/kernel/cred.rs
+++ b/rust/kernel/cred.rs
@@ -50,6 +50,14 @@ pub unsafe fn from_ptr<'a>(ptr: *const bindings::cred) -> &'a Credential {
unsafe { &*ptr.cast() }
}
+ /// Get the id for this security context.
+ pub fn get_secid(&self) -> u32 {
+ let mut secid = 0;
+ // SAFETY: The invariants of this type ensures that the pointer is valid.
+ unsafe { bindings::security_cred_getsecid(self.0.get(), &mut secid) };
+ secid
+ }
+
/// Returns the effective UID of the given credential.
pub fn euid(&self) -> bindings::kuid_t {
// SAFETY: By the type invariant, we know that `self.0` is valid. Furthermore, the `euid`
diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs
index cc629a74137f..ade5889c76b4 100644
--- a/rust/kernel/lib.rs
+++ b/rust/kernel/lib.rs
@@ -39,6 +39,7 @@
pub mod net;
pub mod prelude;
pub mod print;
+pub mod security;
mod static_assert;
#[doc(hidden)]
pub mod std_vendor;
diff --git a/rust/kernel/security.rs b/rust/kernel/security.rs
new file mode 100644
index 000000000000..ee2ef0385bae
--- /dev/null
+++ b/rust/kernel/security.rs
@@ -0,0 +1,72 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! Linux Security Modules (LSM).
+//!
+//! C header: [`include/linux/security.h`](srctree/include/linux/security.h).
+
+use crate::{
+ bindings,
+ error::{to_result, Result},
+};
+
+/// A security context string.
+///
+/// # Invariants
+///
+/// The `secdata` and `seclen` fields correspond to a valid security context as returned by a
+/// successful call to `security_secid_to_secctx`, that has not yet been destroyed by calling
+/// `security_release_secctx`.
+pub struct SecurityCtx {
+ secdata: *mut core::ffi::c_char,
+ seclen: usize,
+}
+
+impl SecurityCtx {
+ /// Get the security context given its id.
+ pub fn from_secid(secid: u32) -> Result<Self> {
+ let mut secdata = core::ptr::null_mut();
+ let mut seclen = 0u32;
+ // SAFETY: Just a C FFI call. The pointers are valid for writes.
+ to_result(unsafe { bindings::security_secid_to_secctx(secid, &mut secdata, &mut seclen) })?;
+
+ // INVARIANT: If the above call did not fail, then we have a valid security context.
+ Ok(Self {
+ secdata,
+ seclen: seclen as usize,
+ })
+ }
+
+ /// Returns whether the security context is empty.
+ pub fn is_empty(&self) -> bool {
+ self.seclen == 0
+ }
+
+ /// Returns the length of this security context.
+ pub fn len(&self) -> usize {
+ self.seclen
+ }
+
+ /// Returns the bytes for this security context.
+ pub fn as_bytes(&self) -> &[u8] {
+ let ptr = self.secdata;
+ if ptr.is_null() {
+ debug_assert_eq!(self.seclen, 0);
+ // We can't pass a null pointer to `slice::from_raw_parts` even if the length is zero.
+ return &[];
+ }
+
+ // SAFETY: The call to `security_secid_to_secctx` guarantees that the pointer is valid for
+ // `seclen` bytes. Furthermore, if the length is zero, then we have ensured that the
+ // pointer is not null.
+ unsafe { core::slice::from_raw_parts(ptr.cast(), self.seclen) }
+ }
+}
+
+impl Drop for SecurityCtx {
+ fn drop(&mut self) {
+ // SAFETY: By the invariant of `Self`, this frees a pointer that came from a successful
+ // call to `security_secid_to_secctx` and has not yet been destroyed by
+ // `security_release_secctx`.
+ unsafe { bindings::security_release_secctx(self.secdata, self.seclen as u32) };
+ }
+}
--
2.45.0.rc1.225.g2a3ae87e7f-goog
From: Wedson Almeida Filho <[email protected]>
Allow for the creation of a file descriptor in two steps: first, we
reserve a slot for it, then we commit or drop the reservation. The first
step may fail (e.g., the current process ran out of available slots),
but commit and drop never fail (and are mutually exclusive).
This is needed by Rust Binder when fds are sent from one process to
another. It has to be a two-step process to properly handle the case
where multiple fds are sent: The operation must fail or succeed
atomically, which we achieve by first reserving the fds we need, and
only installing the files once we have reserved enough fds to send the
files.
Fd reservations assume that the value of `current` does not change
between the call to get_unused_fd_flags and the call to fd_install (or
put_unused_fd). By not implementing the Send trait, this abstraction
ensures that the `FileDescriptorReservation` cannot be moved into a
different process.
Signed-off-by: Wedson Almeida Filho <[email protected]>
Co-developed-by: Alice Ryhl <[email protected]>
Reviewed-by: Benno Lossin <[email protected]>
Reviewed-by: Martin Rodriguez Reboredo <[email protected]>
Reviewed-by: Trevor Gross <[email protected]>
Signed-off-by: Alice Ryhl <[email protected]>
---
rust/kernel/file.rs | 75 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 74 insertions(+), 1 deletion(-)
diff --git a/rust/kernel/file.rs b/rust/kernel/file.rs
index 755816eb38ef..d3d70df993c9 100644
--- a/rust/kernel/file.rs
+++ b/rust/kernel/file.rs
@@ -9,7 +9,7 @@
bindings,
cred::Credential,
error::{code::*, Error, Result},
- types::{ARef, AlwaysRefCounted, Opaque},
+ types::{ARef, AlwaysRefCounted, NotThreadSafe, Opaque},
};
use core::{marker::PhantomData, ptr};
@@ -324,6 +324,79 @@ unsafe fn dec_ref(obj: ptr::NonNull<File<S>>) {
}
}
+/// A file descriptor reservation.
+///
+/// This allows the creation of a file descriptor in two steps: first, we reserve a slot for it,
+/// then we commit or drop the reservation. The first step may fail (e.g., the current process ran
+/// out of available slots), but commit and drop never fail (and are mutually exclusive).
+///
+/// Dropping the reservation happens in the destructor of this type.
+///
+/// # Invariants
+///
+/// The fd stored in this struct must correspond to a reserved file descriptor of the current task.
+pub struct FileDescriptorReservation {
+ fd: u32,
+ /// Prevent values of this type from being moved to a different task.
+ ///
+ /// The `fd_install` and `put_unused_fd` functions assume that the value of `current` is
+ /// unchanged since the call to `get_unused_fd_flags`. By adding this marker to this type, we
+ /// prevent it from being moved across task boundaries, which ensures that `current` does not
+ /// change while this value exists.
+ _not_send: NotThreadSafe,
+}
+
+impl FileDescriptorReservation {
+ /// Creates a new file descriptor reservation.
+ pub fn get_unused_fd_flags(flags: u32) -> Result<Self> {
+ // SAFETY: FFI call, there are no safety requirements on `flags`.
+ let fd: i32 = unsafe { bindings::get_unused_fd_flags(flags) };
+ if fd < 0 {
+ return Err(Error::from_errno(fd));
+ }
+ Ok(Self {
+ fd: fd as u32,
+ _not_send: NotThreadSafe,
+ })
+ }
+
+ /// Returns the file descriptor number that was reserved.
+ pub fn reserved_fd(&self) -> u32 {
+ self.fd
+ }
+
+ /// Commits the reservation.
+ ///
+ /// The previously reserved file descriptor is bound to `file`. This method consumes the
+ /// [`FileDescriptorReservation`], so it will not be usable after this call.
+ pub fn fd_install(self, file: ARef<File<NoFdgetPos>>) {
+ // SAFETY: `self.fd` was previously returned by `get_unused_fd_flags`. We have not yet used
+ // the fd, so it is still valid, and `current` still refers to the same task, as this type
+ // cannot be moved across task boundaries.
+ //
+ // Furthermore, the file pointer is guaranteed to own a refcount by its type invariants,
+ // and we take ownership of that refcount by not running the destructor below.
+ // Additionally, the file is known to not have any non-shared `fdget_pos` calls, so even if
+ // this process starts using the file position, this will not result in a data race on the
+ // file position.
+ unsafe { bindings::fd_install(self.fd, file.as_ptr()) };
+
+ // `fd_install` consumes both the file descriptor and the file reference, so we cannot run
+ // the destructors.
+ core::mem::forget(self);
+ core::mem::forget(file);
+ }
+}
+
+impl Drop for FileDescriptorReservation {
+ fn drop(&mut self) {
+ // SAFETY: By the type invariants of this type, `self.fd` was previously returned by
+ // `get_unused_fd_flags`. We have not yet used the fd, so it is still valid, and `current`
+ // still refers to the same task, as this type cannot be moved across task boundaries.
+ unsafe { bindings::put_unused_fd(self.fd) };
+ }
+}
+
/// Represents the `EBADF` error code.
///
/// Used for methods that can only fail with `EBADF`.
--
2.45.0.rc1.225.g2a3ae87e7f-goog
Adds a wrapper around `kuid_t` called `Kuid`. This allows us to define
various operations on kuids such as equality and current_euid. It also
lets us provide conversions from kuid into userspace values.
Rust Binder needs these operations because it needs to compare kuids for
equality, and it needs to tell userspace about the pid and uid of
incoming transactions.
To read kuids from a `struct task_struct`, you must currently use
various #defines that perform the appropriate field access under an RCU
read lock. Currently, we do not have a Rust wrapper for rcu_read_lock,
which means that for this patch, there are two ways forward:
1. Inline the methods into Rust code, and use __rcu_read_lock directly
rather than the rcu_read_lock wrapper. This gives up lockdep for
these usages of RCU.
2. Wrap the various #defines in helpers and call the helpers from Rust.
This patch uses the second option. One possible disadvantage of the
second option is the possible introduction of speculation gadgets, but
as discussed in [1], the risk appears to be acceptable.
Of course, once a wrapper for rcu_read_lock is available, it is
preferable to use that over either of the two above approaches.
Link: https://lore.kernel.org/all/202312080947.674CD2DC7@keescook/ [1]
Reviewed-by: Benno Lossin <[email protected]>
Reviewed-by: Martin Rodriguez Reboredo <[email protected]>
Reviewed-by: Trevor Gross <[email protected]>
Signed-off-by: Alice Ryhl <[email protected]>
---
rust/bindings/bindings_helper.h | 1 +
rust/helpers.c | 45 ++++++++++++++++++++++++++++
rust/kernel/cred.rs | 5 ++--
rust/kernel/task.rs | 66 +++++++++++++++++++++++++++++++++++++++++
4 files changed, 115 insertions(+), 2 deletions(-)
diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h
index cd2aaaaf9214..2a758930fc74 100644
--- a/rust/bindings/bindings_helper.h
+++ b/rust/bindings/bindings_helper.h
@@ -15,6 +15,7 @@
#include <linux/jiffies.h>
#include <linux/mdio.h>
#include <linux/phy.h>
+#include <linux/pid_namespace.h>
#include <linux/refcount.h>
#include <linux/sched.h>
#include <linux/security.h>
diff --git a/rust/helpers.c b/rust/helpers.c
index 766e368bd0d8..81ac2c994c71 100644
--- a/rust/helpers.c
+++ b/rust/helpers.c
@@ -142,6 +142,51 @@ void rust_helper_put_task_struct(struct task_struct *t)
}
EXPORT_SYMBOL_GPL(rust_helper_put_task_struct);
+kuid_t rust_helper_task_uid(struct task_struct *task)
+{
+ return task_uid(task);
+}
+EXPORT_SYMBOL_GPL(rust_helper_task_uid);
+
+kuid_t rust_helper_task_euid(struct task_struct *task)
+{
+ return task_euid(task);
+}
+EXPORT_SYMBOL_GPL(rust_helper_task_euid);
+
+#ifndef CONFIG_USER_NS
+uid_t rust_helper_from_kuid(struct user_namespace *to, kuid_t uid)
+{
+ return from_kuid(to, uid);
+}
+EXPORT_SYMBOL_GPL(rust_helper_from_kuid);
+#endif /* CONFIG_USER_NS */
+
+bool rust_helper_uid_eq(kuid_t left, kuid_t right)
+{
+ return uid_eq(left, right);
+}
+EXPORT_SYMBOL_GPL(rust_helper_uid_eq);
+
+kuid_t rust_helper_current_euid(void)
+{
+ return current_euid();
+}
+EXPORT_SYMBOL_GPL(rust_helper_current_euid);
+
+struct user_namespace *rust_helper_current_user_ns(void)
+{
+ return current_user_ns();
+}
+EXPORT_SYMBOL_GPL(rust_helper_current_user_ns);
+
+pid_t rust_helper_task_tgid_nr_ns(struct task_struct *tsk,
+ struct pid_namespace *ns)
+{
+ return task_tgid_nr_ns(tsk, ns);
+}
+EXPORT_SYMBOL_GPL(rust_helper_task_tgid_nr_ns);
+
struct kunit *rust_helper_kunit_get_current_test(void)
{
return kunit_get_current_test();
diff --git a/rust/kernel/cred.rs b/rust/kernel/cred.rs
index fdd899040098..961e94b6a657 100644
--- a/rust/kernel/cred.rs
+++ b/rust/kernel/cred.rs
@@ -8,6 +8,7 @@
use crate::{
bindings,
+ task::Kuid,
types::{AlwaysRefCounted, Opaque},
};
@@ -59,11 +60,11 @@ pub fn get_secid(&self) -> u32 {
}
/// Returns the effective UID of the given credential.
- pub fn euid(&self) -> bindings::kuid_t {
+ pub fn euid(&self) -> Kuid {
// SAFETY: By the type invariant, we know that `self.0` is valid. Furthermore, the `euid`
// field of a credential is never changed after initialization, so there is no potential
// for data races.
- unsafe { (*self.0.get()).euid }
+ Kuid::from_raw(unsafe { (*self.0.get()).euid })
}
}
diff --git a/rust/kernel/task.rs b/rust/kernel/task.rs
index 367b4bbddd9f..1a36a9f19368 100644
--- a/rust/kernel/task.rs
+++ b/rust/kernel/task.rs
@@ -9,6 +9,7 @@
types::{NotThreadSafe, Opaque},
};
use core::{
+ cmp::{Eq, PartialEq},
ffi::{c_int, c_long, c_uint},
ops::Deref,
ptr,
@@ -96,6 +97,12 @@ unsafe impl Sync for Task {}
/// The type of process identifiers (PIDs).
type Pid = bindings::pid_t;
+/// The type of user identifiers (UIDs).
+#[derive(Copy, Clone)]
+pub struct Kuid {
+ kuid: bindings::kuid_t,
+}
+
impl Task {
/// Returns a raw pointer to the current task.
///
@@ -157,12 +164,31 @@ pub fn pid(&self) -> Pid {
unsafe { *ptr::addr_of!((*self.0.get()).pid) }
}
+ /// Returns the UID of the given task.
+ pub fn uid(&self) -> Kuid {
+ // SAFETY: By the type invariant, we know that `self.0` is valid.
+ Kuid::from_raw(unsafe { bindings::task_uid(self.0.get()) })
+ }
+
+ /// Returns the effective UID of the given task.
+ pub fn euid(&self) -> Kuid {
+ // SAFETY: By the type invariant, we know that `self.0` is valid.
+ Kuid::from_raw(unsafe { bindings::task_euid(self.0.get()) })
+ }
+
/// Determines whether the given task has pending signals.
pub fn signal_pending(&self) -> bool {
// SAFETY: By the type invariant, we know that `self.0` is valid.
unsafe { bindings::signal_pending(self.0.get()) != 0 }
}
+ /// Returns the given task's pid in the current pid namespace.
+ pub fn pid_in_current_ns(&self) -> Pid {
+ // SAFETY: We know that `self.0.get()` is valid by the type invariant, and passing a null
+ // pointer as the namespace is correct for using the current namespace.
+ unsafe { bindings::task_tgid_nr_ns(self.0.get(), ptr::null_mut()) }
+ }
+
/// Wakes up the task.
pub fn wake_up(&self) {
// SAFETY: By the type invariant, we know that `self.0.get()` is non-null and valid.
@@ -184,3 +210,43 @@ unsafe fn dec_ref(obj: ptr::NonNull<Self>) {
unsafe { bindings::put_task_struct(obj.cast().as_ptr()) }
}
}
+
+impl Kuid {
+ /// Get the current euid.
+ #[inline]
+ pub fn current_euid() -> Kuid {
+ // SAFETY: Just an FFI call.
+ Self::from_raw(unsafe { bindings::current_euid() })
+ }
+
+ /// Create a `Kuid` given the raw C type.
+ #[inline]
+ pub fn from_raw(kuid: bindings::kuid_t) -> Self {
+ Self { kuid }
+ }
+
+ /// Turn this kuid into the raw C type.
+ #[inline]
+ pub fn into_raw(self) -> bindings::kuid_t {
+ self.kuid
+ }
+
+ /// Converts this kernel UID into a userspace UID.
+ ///
+ /// Uses the namespace of the current task.
+ #[inline]
+ pub fn into_uid_in_current_ns(self) -> bindings::uid_t {
+ // SAFETY: Just an FFI call.
+ unsafe { bindings::from_kuid(bindings::current_user_ns(), self.kuid) }
+ }
+}
+
+impl PartialEq for Kuid {
+ #[inline]
+ fn eq(&self, other: &Kuid) -> bool {
+ // SAFETY: Just an FFI call.
+ unsafe { bindings::uid_eq(self.kuid, other.kuid) }
+ }
+}
+
+impl Eq for Kuid {}
--
2.45.0.rc1.225.g2a3ae87e7f-goog
The existing `CondVar` abstraction is a wrapper around
`wait_queue_head`, but it does not support all use-cases of the C
`wait_queue_head` type. To be specific, a `CondVar` cannot be registered
with a `struct poll_table`. This limitation has the advantage that you
do not need to call `synchronize_rcu` when destroying a `CondVar`.
However, we need the ability to register a `poll_table` with a
`wait_queue_head` in Rust Binder. To enable this, introduce a type
called `PollCondVar`, which is like `CondVar` except that you can
register a `poll_table`. We also introduce `PollTable`, which is a safe
wrapper around `poll_table` that is intended to be used with
`PollCondVar`.
The destructor of `PollCondVar` unconditionally calls `synchronize_rcu`
to ensure that the removal of epoll waiters has fully completed before
the `wait_queue_head` is destroyed.
That said, `synchronize_rcu` is rather expensive and is not needed in
all cases: If we have never registered a `poll_table` with the
`wait_queue_head`, then we don't need to call `synchronize_rcu`. (And
this is a common case in Binder - not all processes use Binder with
epoll.) The current implementation does not account for this, but if we
find that it is necessary to improve this, a future patch could store a
boolean next to the `wait_queue_head` to keep track of whether a
`poll_table` has ever been registered.
Reviewed-by: Benno Lossin <[email protected]>
Reviewed-by: Martin Rodriguez Reboredo <[email protected]>
Reviewed-by: Trevor Gross <[email protected]>
Signed-off-by: Alice Ryhl <[email protected]>
---
rust/bindings/bindings_helper.h | 1 +
rust/kernel/sync.rs | 1 +
rust/kernel/sync/poll.rs | 119 ++++++++++++++++++++++++++++++++++++++++
3 files changed, 121 insertions(+)
diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h
index 2a758930fc74..b423d5cb6826 100644
--- a/rust/bindings/bindings_helper.h
+++ b/rust/bindings/bindings_helper.h
@@ -16,6 +16,7 @@
#include <linux/mdio.h>
#include <linux/phy.h>
#include <linux/pid_namespace.h>
+#include <linux/poll.h>
#include <linux/refcount.h>
#include <linux/sched.h>
#include <linux/security.h>
diff --git a/rust/kernel/sync.rs b/rust/kernel/sync.rs
index 0ab20975a3b5..bae4a5179c72 100644
--- a/rust/kernel/sync.rs
+++ b/rust/kernel/sync.rs
@@ -11,6 +11,7 @@
mod condvar;
pub mod lock;
mod locked_by;
+pub mod poll;
pub use arc::{Arc, ArcBorrow, UniqueArc};
pub use condvar::{new_condvar, CondVar, CondVarTimeoutResult};
diff --git a/rust/kernel/sync/poll.rs b/rust/kernel/sync/poll.rs
new file mode 100644
index 000000000000..20b23d985cd4
--- /dev/null
+++ b/rust/kernel/sync/poll.rs
@@ -0,0 +1,119 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! Utilities for working with `struct poll_table`.
+
+use crate::{
+ bindings,
+ file::{File, NoFdgetPos},
+ prelude::*,
+ sync::{CondVar, LockClassKey},
+ types::Opaque,
+};
+use core::ops::Deref;
+
+/// Creates a [`PollCondVar`] initialiser with the given name and a newly-created lock class.
+#[macro_export]
+macro_rules! new_poll_condvar {
+ ($($name:literal)?) => {
+ $crate::sync::poll::PollCondVar::new(
+ $crate::optional_name!($($name)?), $crate::static_lock_class!()
+ )
+ };
+}
+
+/// Wraps the kernel's `struct poll_table`.
+///
+/// # Invariants
+///
+/// This struct contains a valid `struct poll_table`.
+///
+/// For a `struct poll_table` to be valid, its `_qproc` function must follow the safety
+/// requirements of `_qproc` functions:
+///
+/// * The `_qproc` function is given permission to enqueue a waiter to the provided `poll_table`
+/// during the call. Once the waiter is removed and an rcu grace period has passed, it must no
+/// longer access the `wait_queue_head`.
+#[repr(transparent)]
+pub struct PollTable(Opaque<bindings::poll_table>);
+
+impl PollTable {
+ /// Creates a reference to a [`PollTable`] from a valid pointer.
+ ///
+ /// # Safety
+ ///
+ /// The caller must ensure that for the duration of 'a, the pointer will point at a valid poll
+ /// table (as defined in the type invariants).
+ ///
+ /// The caller must also ensure that the `poll_table` is only accessed via the returned
+ /// reference for the duration of 'a.
+ pub unsafe fn from_ptr<'a>(ptr: *mut bindings::poll_table) -> &'a mut PollTable {
+ // SAFETY: The safety requirements guarantee the validity of the dereference, while the
+ // `PollTable` type being transparent makes the cast ok.
+ unsafe { &mut *ptr.cast() }
+ }
+
+ fn get_qproc(&self) -> bindings::poll_queue_proc {
+ let ptr = self.0.get();
+ // SAFETY: The `ptr` is valid because it originates from a reference, and the `_qproc`
+ // field is not modified concurrently with this call since we have an immutable reference.
+ unsafe { (*ptr)._qproc }
+ }
+
+ /// Register this [`PollTable`] with the provided [`PollCondVar`], so that it can be notified
+ /// using the condition variable.
+ pub fn register_wait(&mut self, file: &File<NoFdgetPos>, cv: &PollCondVar) {
+ if let Some(qproc) = self.get_qproc() {
+ // SAFETY: The pointers to `file` and `self` need to be valid for the duration of this
+ // call to `qproc`, which they are because they are references.
+ //
+ // The `cv.wait_queue_head` pointer must be valid until an rcu grace period after the
+ // waiter is removed. The `PollCondVar` is pinned, so before `cv.wait_queue_head` can
+ // be destroyed, the destructor must run. That destructor first removes all waiters,
+ // and then waits for an rcu grace period. Therefore, `cv.wait_queue_head` is valid for
+ // long enough.
+ unsafe { qproc(file.as_ptr() as _, cv.wait_queue_head.get(), self.0.get()) };
+ }
+ }
+}
+
+/// A wrapper around [`CondVar`] that makes it usable with [`PollTable`].
+///
+/// [`CondVar`]: crate::sync::CondVar
+#[pin_data(PinnedDrop)]
+pub struct PollCondVar {
+ #[pin]
+ inner: CondVar,
+}
+
+impl PollCondVar {
+ /// Constructs a new condvar initialiser.
+ pub fn new(name: &'static CStr, key: &'static LockClassKey) -> impl PinInit<Self> {
+ pin_init!(Self {
+ inner <- CondVar::new(name, key),
+ })
+ }
+}
+
+// Make the `CondVar` methods callable on `PollCondVar`.
+impl Deref for PollCondVar {
+ type Target = CondVar;
+
+ fn deref(&self) -> &CondVar {
+ &self.inner
+ }
+}
+
+#[pinned_drop]
+impl PinnedDrop for PollCondVar {
+ fn drop(self: Pin<&mut Self>) {
+ // Clear anything registered using `register_wait`.
+ //
+ // SAFETY: The pointer points at a valid `wait_queue_head`.
+ unsafe { bindings::__wake_up_pollfree(self.inner.wait_queue_head.get()) };
+
+ // Wait for epoll items to be properly removed.
+ //
+ // SAFETY: Just an FFI call.
+ unsafe { bindings::synchronize_rcu() };
+ }
+}
--
2.45.0.rc1.225.g2a3ae87e7f-goog
On Fri, May 17, 2024 at 09:30:36AM +0000, Alice Ryhl wrote:
> From: Wedson Almeida Filho <[email protected]>
>
> This abstraction makes it possible to manipulate the open files for a
> process. The new `File` struct wraps the C `struct file`. When accessing
> it using the smart pointer `ARef<File>`, the pointer will own a
> reference count to the file. When accessing it as `&File`, then the
> reference does not own a refcount, but the borrow checker will ensure
> that the reference count does not hit zero while the `&File` is live.
>
> Since this is intended to manipulate the open files of a process, we
> introduce an `fget` constructor that corresponds to the C `fget`
> method. In future patches, it will become possible to create a new fd in
> a process and bind it to a `File`. Rust Binder will use these to send
> fds from one process to another.
>
> We also provide a method for accessing the file's flags. Rust Binder
> will use this to access the flags of the Binder fd to check whether the
> non-blocking flag is set, which affects what the Binder ioctl does.
>
> This introduces a struct for the EBADF error type, rather than just
> using the Error type directly. This has two advantages:
> * `File::fget` returns a `Result<ARef<File>, BadFdError>`, which the
> compiler will represent as a single pointer, with null being an error.
> This is possible because the compiler understands that `BadFdError`
> has only one possible value, and it also understands that the
> `ARef<File>` smart pointer is guaranteed non-null.
> * Additionally, we promise to users of the method that the method can
> only fail with EBADF, which means that they can rely on this promise
> without having to inspect its implementation.
> That said, there are also two disadvantages:
> * Defining additional error types involves boilerplate.
> * The question mark operator will only utilize the `From` trait once,
> which prevents you from using the question mark operator on
> `BadFdError` in methods that return some third error type that the
> kernel `Error` is convertible into. (However, it works fine in methods
> that return `Error`.)
>
> Signed-off-by: Wedson Almeida Filho <[email protected]>
> Co-developed-by: Daniel Xu <[email protected]>
> Signed-off-by: Daniel Xu <[email protected]>
> Co-developed-by: Alice Ryhl <[email protected]>
> Reviewed-by: Martin Rodriguez Reboredo <[email protected]>
> Reviewed-by: Trevor Gross <[email protected]>
> Reviewed-by: Benno Lossin <[email protected]>
> Signed-off-by: Alice Ryhl <[email protected]>
> ---
> fs/file.c | 7 +
> rust/bindings/bindings_helper.h | 2 +
> rust/helpers.c | 7 +
> rust/kernel/file.rs | 330 ++++++++++++++++++++++++++++++++++++++++
> rust/kernel/lib.rs | 1 +
> rust/kernel/types.rs | 8 +
> 6 files changed, 355 insertions(+)
>
> diff --git a/fs/file.c b/fs/file.c
> index 3b683b9101d8..f2eab5fcb87f 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -1127,6 +1127,13 @@ EXPORT_SYMBOL(task_lookup_next_fdget_rcu);
> *
> * The fput_needed flag returned by fget_light should be passed to the
> * corresponding fput_light.
> + *
> + * (As an exception to rule 2, you can call filp_close between fget_light and
> + * fput_light provided that you capture a real refcount with get_file before
> + * the call to filp_close, and ensure that this real refcount is fput *after*
> + * the fput_light call.)
> + *
> + * See also the documentation in rust/kernel/file.rs.
> */
> static unsigned long __fget_light(unsigned int fd, fmode_t mask)
> {
> diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h
> index ddb5644d4fd9..541afef7ddc4 100644
> --- a/rust/bindings/bindings_helper.h
> +++ b/rust/bindings/bindings_helper.h
> @@ -9,6 +9,8 @@
> #include <kunit/test.h>
> #include <linux/errname.h>
> #include <linux/ethtool.h>
> +#include <linux/file.h>
> +#include <linux/fs.h>
> #include <linux/jiffies.h>
> #include <linux/mdio.h>
> #include <linux/phy.h>
> diff --git a/rust/helpers.c b/rust/helpers.c
> index 4c8b7b92a4f4..5545a00560d1 100644
> --- a/rust/helpers.c
> +++ b/rust/helpers.c
> @@ -25,6 +25,7 @@
> #include <linux/build_bug.h>
> #include <linux/err.h>
> #include <linux/errname.h>
> +#include <linux/fs.h>
> #include <linux/mutex.h>
> #include <linux/refcount.h>
> #include <linux/sched/signal.h>
> @@ -157,6 +158,12 @@ void rust_helper_init_work_with_key(struct work_struct *work, work_func_t func,
> }
> EXPORT_SYMBOL_GPL(rust_helper_init_work_with_key);
>
> +struct file *rust_helper_get_file(struct file *f)
> +{
> + return get_file(f);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_get_file);
> +
> /*
> * `bindgen` binds the C `size_t` type as the Rust `usize` type, so we can
> * use it in contexts where Rust expects a `usize` like slice (array) indices.
> diff --git a/rust/kernel/file.rs b/rust/kernel/file.rs
> new file mode 100644
> index 000000000000..ad881e67084c
> --- /dev/null
> +++ b/rust/kernel/file.rs
> @@ -0,0 +1,330 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +//! Files and file descriptors.
> +//!
> +//! C headers: [`include/linux/fs.h`](srctree/include/linux/fs.h) and
> +//! [`include/linux/file.h`](srctree/include/linux/file.h)
> +
> +use crate::{
> + bindings,
> + error::{code::*, Error, Result},
> + types::{ARef, AlwaysRefCounted, Opaque},
> +};
> +use core::{marker::PhantomData, ptr};
> +
> +/// Flags associated with a [`File`].
> +pub mod flags {
> + /// File is opened in append mode.
> + pub const O_APPEND: u32 = bindings::O_APPEND;
> +
> + /// Signal-driven I/O is enabled.
> + pub const O_ASYNC: u32 = bindings::FASYNC;
> +
> + /// Close-on-exec flag is set.
> + pub const O_CLOEXEC: u32 = bindings::O_CLOEXEC;
> +
> + /// File was created if it didn't already exist.
> + pub const O_CREAT: u32 = bindings::O_CREAT;
> +
> + /// Direct I/O is enabled for this file.
> + pub const O_DIRECT: u32 = bindings::O_DIRECT;
> +
> + /// File must be a directory.
> + pub const O_DIRECTORY: u32 = bindings::O_DIRECTORY;
> +
> + /// Like [`O_SYNC`] except metadata is not synced.
> + pub const O_DSYNC: u32 = bindings::O_DSYNC;
> +
> + /// Ensure that this file is created with the `open(2)` call.
> + pub const O_EXCL: u32 = bindings::O_EXCL;
> +
> + /// Large file size enabled (`off64_t` over `off_t`).
> + pub const O_LARGEFILE: u32 = bindings::O_LARGEFILE;
> +
> + /// Do not update the file last access time.
> + pub const O_NOATIME: u32 = bindings::O_NOATIME;
> +
> + /// File should not be used as process's controlling terminal.
> + pub const O_NOCTTY: u32 = bindings::O_NOCTTY;
> +
> + /// If basename of path is a symbolic link, fail open.
> + pub const O_NOFOLLOW: u32 = bindings::O_NOFOLLOW;
> +
> + /// File is using nonblocking I/O.
> + pub const O_NONBLOCK: u32 = bindings::O_NONBLOCK;
> +
> + /// File is using nonblocking I/O.
> + ///
> + /// This is effectively the same flag as [`O_NONBLOCK`] on all architectures
> + /// except SPARC64.
> + pub const O_NDELAY: u32 = bindings::O_NDELAY;
> +
> + /// Used to obtain a path file descriptor.
> + pub const O_PATH: u32 = bindings::O_PATH;
> +
> + /// Write operations on this file will flush data and metadata.
> + pub const O_SYNC: u32 = bindings::O_SYNC;
> +
> + /// This file is an unnamed temporary regular file.
> + pub const O_TMPFILE: u32 = bindings::O_TMPFILE;
> +
> + /// File should be truncated to length 0.
> + pub const O_TRUNC: u32 = bindings::O_TRUNC;
> +
> + /// Bitmask for access mode flags.
> + ///
> + /// # Examples
> + ///
> + /// ```
> + /// use kernel::file;
> + /// # fn do_something() {}
> + /// # let flags = 0;
> + /// if (flags & file::flags::O_ACCMODE) == file::flags::O_RDONLY {
> + /// do_something();
> + /// }
> + /// ```
> + pub const O_ACCMODE: u32 = bindings::O_ACCMODE;
> +
> + /// File is read only.
> + pub const O_RDONLY: u32 = bindings::O_RDONLY;
> +
> + /// File is write only.
> + pub const O_WRONLY: u32 = bindings::O_WRONLY;
> +
> + /// File can be both read and written.
> + pub const O_RDWR: u32 = bindings::O_RDWR;
> +}
> +
> +/// Compile-time information for keeping track of how a [`File`] is shared.
> +///
> +/// The `fdget_pos` method can be used to access the file's position without taking `f_pos_lock`,
> +/// as long as the file is not shared with any other threads. During such calls to `fdget_pos`, the
> +/// file must remain non-shared, so it must not be possible to move the file to another thread. For
> +/// example, if the file is moved to another thread, then it could be passed to `fd_install`, at
> +/// which point the remote process could touch the file position.
> +///
> +/// The share mode only keeps track of whether there are active `fdget_pos` calls that did not take
> +/// the `f_pos_lock`, and does not keep track of `fdget` calls. This is okay because `fdget` does
> +/// not care about the refcount of the underlying `struct file`; as long as the entry in the
> +/// current thread's fd table does not get removed, it's okay to share the file. For example,
> +/// `fd_install`ing the `struct file` into another process is okay during an `fdget` call, because
> +/// the other process can't touch the fd table of the original process.
> +mod share_mode {
> + /// Trait implemented by the two sharing modes that a file might have.
> + pub trait FileShareMode {}
> +
> + /// Represents a file for which there might be an active call to `fdget_pos` that did not take
> + /// the `f_pos_lock` lock.
> + pub enum MaybeFdgetPos {}
> +
> + /// Represents a file for which it is known that all active calls to `fdget_pos` (if any) took
> + /// the `f_pos_lock` lock.
> + pub enum NoFdgetPos {}
> +
> + impl FileShareMode for MaybeFdgetPos {}
> + impl FileShareMode for NoFdgetPos {}
> +}
> +pub use self::share_mode::{FileShareMode, MaybeFdgetPos, NoFdgetPos};
> +
> +/// Wraps the kernel's `struct file`.
> +///
> +/// This represents an open file rather than a file on a filesystem. Processes generally reference
> +/// open files using file descriptors. However, file descriptors are not the same as files. A file
> +/// descriptor is just an integer that corresponds to a file, and a single file may be referenced
> +/// by multiple file descriptors.
> +///
> +/// # Refcounting
> +///
> +/// Instances of this type are reference-counted. The reference count is incremented by the
> +/// `fget`/`get_file` functions and decremented by `fput`. The Rust type `ARef<File>` represents a
> +/// pointer that owns a reference count on the file.
> +///
> +/// Whenever a process opens a file descriptor (fd), it stores a pointer to the file in its fd
> +/// table (`struct files_struct`). This pointer owns a reference count to the file, ensuring the
> +/// file isn't prematurely deleted while the file descriptor is open. In Rust terminology, the
> +/// pointers in `struct files_struct` are `ARef<File>` pointers.
> +///
> +/// ## Light refcounts
> +///
> +/// Whenever a process has an fd to a file, it may use something called a "light refcount" as a
> +/// performance optimization. Light refcounts are acquired by calling `fdget` and released with
> +/// `fdput`. The idea behind light refcounts is that if the fd is not closed between the calls to
> +/// `fdget` and `fdput`, then the refcount cannot hit zero during that time, as the `struct
> +/// files_struct` holds a reference until the fd is closed. This means that it's safe to access the
> +/// file even if `fdget` does not increment the refcount.
> +///
> +/// The requirement that the fd is not closed during a light refcount applies globally across all
> +/// threads - not just on the thread using the light refcount. For this reason, light refcounts are
> +/// only used when the `struct files_struct` is not shared with other threads, since this ensures
> +/// that other unrelated threads cannot suddenly start using the fd and close it. Therefore,
> +/// calling `fdget` on a shared `struct files_struct` creates a normal refcount instead of a light
> +/// refcount.
> +///
> +/// Light reference counts must be released with `fdput` before the system call returns to
> +/// userspace. This means that if you wait until the current system call returns to userspace, then
> +/// all light refcounts that existed at the time have gone away.
You obviously are aware of this but I'm just spelling it out. Iirc,
there will practically only ever be one light refcount per file.
For a light refcount to be used we know that the file descriptor table
isn't shared with any other task. So there are no threads that could
concurrently access the file descriptor table. We also know that the
file descriptor table cannot become shared while we're in system call
context because the caller can't create new threads and they can't
unshare the file descriptor table.
So there's only one fdget() caller (Yes, they could call fdget()
multiple times and then have to do fdput() multiple times but that's a
level of weirdness that we don't need to worry about.).
> +///
> +/// ### The file position
> +///
> +/// Each `struct file` has a position integer, which is protected by the `f_pos_lock` mutex.
> +/// However, if the `struct file` is not shared, then the kernel may avoid taking the lock as a
> +/// performance optimization.
> +///
> +/// The condition for avoiding the `f_pos_lock` mutex is different from the condition for using
> +/// `fdget`. With `fdget`, you may avoid incrementing the refcount as long as the current fd table
> +/// is not shared; it is okay if there are other fd tables that also reference the same `struct
> +/// file`. However, `fdget_pos` can only avoid taking the `f_pos_lock` if the entire `struct file`
> +/// is not shared, as different processes with an fd to the same `struct file` share the same
> +/// position.
> +///
> +/// ## Rust references
> +///
> +/// The reference type `&File` is similar to light refcounts:
> +///
> +/// * `&File` references don't own a reference count. They can only exist as long as the reference
> +/// count stays positive, and can only be created when there is some mechanism in place to ensure
> +/// this.
> +///
> +/// * The Rust borrow-checker normally ensures this by enforcing that the `ARef<File>` from which
> +/// a `&File` is created outlives the `&File`.
> +///
> +/// * Using the unsafe [`File::from_ptr`] means that it is up to the caller to ensure that the
> +/// `&File` only exists while the reference count is positive.
> +///
> +/// * You can think of `fdget` as using an fd to look up an `ARef<File>` in the `struct
> +/// files_struct` and create an `&File` from it. The "fd cannot be closed" rule is like the Rust
> +/// rule "the `ARef<File>` must outlive the `&File`".
> +///
> +/// # Invariants
> +///
> +/// * All instances of this type are refcounted using the `f_count` field.
> +/// * If the file sharing mode is `MaybeFdgetPos`, then all active calls to `fdget_pos` that did
> +/// not take the `f_pos_lock` mutex must be on the same thread as this `File`.
> +/// * If the file sharing mode is `NoFdgetPos`, then there must not be active calls to `fdget_pos`
> +/// that did not take the `f_pos_lock` mutex.
> +#[repr(transparent)]
> +pub struct File<S: FileShareMode> {
> + inner: Opaque<bindings::file>,
> + _share_mode: PhantomData<S>,
> +}
> +
> +// SAFETY: This file is known to not have any local active `fdget_pos` calls, so it is safe to
> +// transfer it between threads.
> +unsafe impl Send for File<NoFdgetPos> {}
> +
> +// SAFETY: This file is known to not have any local active `fdget_pos` calls, so it is safe to
> +// access its methods from several threads in parallel.
> +unsafe impl Sync for File<NoFdgetPos> {}
> +
> +/// File methods that only exist under the [`MaybeFdgetPos`] sharing mode.
> +impl File<MaybeFdgetPos> {
> + /// Constructs a new `struct file` wrapper from a file descriptor.
> + ///
> + /// The file descriptor belongs to the current process, and there might be active local calls
> + /// to `fdget_pos`.
> + pub fn fget(fd: u32) -> Result<ARef<File<MaybeFdgetPos>>, BadFdError> {
> + // SAFETY: FFI call, there are no requirements on `fd`.
> + let ptr = ptr::NonNull::new(unsafe { bindings::fget(fd) }).ok_or(BadFdError)?;
> +
> + // SAFETY: `bindings::fget` created a refcount, and we pass ownership of it to the `ARef`.
> + //
> + // INVARIANT: This file is in the fd table on this thread, so either all `fdget_pos` calls
> + // are on this thread, or the file is shared, in which case `fdget_pos` calls took the
> + // `f_pos_lock` mutex.
> + Ok(unsafe { ARef::from_raw(ptr.cast()) })
> + }
I'm a little unclear how this is supposed to be used. What I
specifically struggle with is what function does one have to call to
translate from a file descriptor to a file? IOW, where are the actual
entry points for turning fds into files? That's what I want to see and
that's what we need to make this interface usable generically.
Because naively, what I'm looking for is a Rust version of fdget() and
fdget_pos() that give me back a File<MaybeFdget> or a
File<MaybeFdgetPos>.
And then those both implement a get_file() method so the caller can take
an explicit long-term reference to the file.
The fget() above is really confusing to me because it always takes a
reference on the file that's pointed to by the fd and then it returns a
MaybeFdgetPos because presumably you want to indicate that the file
descriptor may refer to a file that may or may not be referenced by
another thread via fdget()/fdget_pos() already.
So I've _skimmed_ the binder RFC and looked at:
[email protected]
which states:
Add support for sending fds over binder.
Unlike the other object types, file descriptors are not translated until
the transaction is actually received by the recipient. Until that
happens, we store `u32::MAX` as the fd.
Translating fds is done in a two-phase process. First, the file
descriptors are allocated and written to the allocation. Then, once we
have allocated all of them, we commit them to the files in question.
Using this strategy, we are able to guarantee that we either send all of
the fds, or none of them.
So I'm curious. How does the binder fd sending work exactly? Because I
feel that this is crucial to understand here. Some specific questions:
* When file descriptors are passed the reference to these files via
fget() are taken _synchronously_, i.e., by the sending task, not the
receiver? IOW, is binder_translate_fd() called in the context of the
sender or the receiver. I assume it must be the sender because
otherwise the sender and receiver must share a file descriptor table
in order for the receiver to call fget().
* The receiving task then allocates new file descriptors and installs
the received files into its file descriptor table?
And so basically, what I'm after here is that the binder_translate_fd()
that calls fget() is done in the context of the sender and we _know_
that the fds can't have light references. Because if they did it could
only be by the calling task but they don't since the calling task uses
fget() on them. And if the calling task is multi-threaded and another
thread has called fdget() or fdget_pos() we know that they have taken
their own reference because the file descriptor table is shared.
So why is that fget() in here returning a File<MaybeFdgetPos>? This
doesn't make sense to me at first glance.
> +
> + /// Assume that there are no active `fdget_pos` calls that prevent us from sharing this file.
> + ///
> + /// This makes it safe to transfer this file to other threads. No checks are performed, and
> + /// using it incorrectly may lead to a data race on the file position if the file is shared
> + /// with another thread.
> + ///
> + /// This method is intended to be used together with [`File::fget`] when the caller knows
> + /// statically that there are no `fdget_pos` calls on the current thread. For example, you
> + /// might use it when calling `fget` from an ioctl, since ioctls usually do not touch the file
> + /// position.
> + ///
> + /// # Safety
> + ///
> + /// There must not be any active `fdget_pos` calls on the current thread.
> + pub unsafe fn assume_no_fdget_pos(me: ARef<Self>) -> ARef<File<NoFdgetPos>> {
> + // INVARIANT: There are no `fdget_pos` calls on the current thread, and by the type
> + // invariants, if there is a `fdget_pos` call on another thread, then it took the
> + // `f_pos_lock` mutex.
> + //
> + // SAFETY: `File<MaybeFdgetPos>` and `File<NoFdgetPos>` have the same layout.
> + unsafe { ARef::from_raw(ARef::into_raw(me).cast()) }
> + }
> +}
> +
> +/// File methods that exist under all sharing modes.
> +impl<S: FileShareMode> File<S> {
> + /// Creates a reference to a [`File`] from a valid pointer.
> + ///
> + /// # Safety
> + ///
> + /// * The caller must ensure that `ptr` points at a valid file and that the file's refcount is
> + /// positive for the duration of 'a.
> + /// * The caller must ensure that the requirements for using the chosen file sharing mode are
> + /// upheld.
> + pub unsafe fn from_ptr<'a>(ptr: *const bindings::file) -> &'a File<S> {
I think I requested from_raw_file() in the last revision?
> + // SAFETY: The caller guarantees that the pointer is not dangling and stays valid for the
> + // duration of 'a. The cast is okay because `File` is `repr(transparent)`.
> + //
> + // INVARIANT: The safety requirements guarantee that the refcount does not hit zero during
> + // 'a. The caller guarantees to uphold the requirements for the chosen sharing mode.
> + unsafe { &*ptr.cast() }
> + }
> +
> + /// Returns a raw pointer to the inner C struct.
> + #[inline]
> + pub fn as_ptr(&self) -> *mut bindings::file {
And that was supposed to be into_raw_file() or as_raw_file()?
> + self.inner.get()
> + }
> +
> + /// Returns the flags associated with the file.
> + ///
> + /// The flags are a combination of the constants in [`flags`].
> + pub fn flags(&self) -> u32 {
> + // This `read_volatile` is intended to correspond to a READ_ONCE call.
> + //
> + // SAFETY: The file is valid because the shared reference guarantees a nonzero refcount.
> + //
> + // FIXME(read_once): Replace with `read_once` when available on the Rust side.
> + unsafe { core::ptr::addr_of!((*self.as_ptr()).f_flags).read_volatile() }
> + }
> +}
> +
> +// SAFETY: The type invariants guarantee that `File` is always ref-counted. This implementation
> +// makes `ARef<File>` own a normal refcount.
> +unsafe impl<S: FileShareMode> AlwaysRefCounted for File<S> {
> + fn inc_ref(&self) {
> + // SAFETY: The existence of a shared reference means that the refcount is nonzero.
> + unsafe { bindings::get_file(self.as_ptr()) };
> + }
> +
> + unsafe fn dec_ref(obj: ptr::NonNull<File<S>>) {
> + // SAFETY: To call this method, the caller passes us ownership of a normal refcount, so we
> + // may drop it. The cast is okay since `File` has the same representation as `struct file`.
> + unsafe { bindings::fput(obj.cast().as_ptr()) }
> + }
> +}
> +
> +/// Represents the `EBADF` error code.
> +///
> +/// Used for methods that can only fail with `EBADF`.
> +#[derive(Copy, Clone, Eq, PartialEq)]
> +pub struct BadFdError;
> +
> +impl From<BadFdError> for Error {
> + fn from(_: BadFdError) -> Error {
> + EBADF
> + }
> +}
> +
> +impl core::fmt::Debug for BadFdError {
> + fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result {
> + f.pad("EBADF")
> + }
> +}
> diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs
> index 9a943d99c71a..c583fd27736d 100644
> --- a/rust/kernel/lib.rs
> +++ b/rust/kernel/lib.rs
> @@ -29,6 +29,7 @@
> pub mod alloc;
> mod build_assert;
> pub mod error;
> +pub mod file;
> pub mod init;
> pub mod ioctl;
> #[cfg(CONFIG_KUNIT)]
> diff --git a/rust/kernel/types.rs b/rust/kernel/types.rs
> index 93734677cfe7..3ec2b12afbee 100644
> --- a/rust/kernel/types.rs
> +++ b/rust/kernel/types.rs
> @@ -366,6 +366,14 @@ pub unsafe fn from_raw(ptr: NonNull<T>) -> Self {
> _p: PhantomData,
> }
> }
> +
> + /// Convert this [`ARef`] into a raw pointer.
> + ///
> + /// The caller retains ownership of the refcount that this `ARef` used to own.
> + pub fn into_raw(me: Self) -> NonNull<T> {
> + let me = core::mem::ManuallyDrop::new(me);
> + me.ptr
> + }
> }
>
> impl<T: AlwaysRefCounted> Clone for ARef<T> {
>
> --
> 2.45.0.rc1.225.g2a3ae87e7f-goog
>
On Fri, May 24, 2024 at 07:17:13PM +0000, Alice Ryhl wrote:
> > You obviously are aware of this but I'm just spelling it out. Iirc,
> > there will practically only ever be one light refcount per file.
> >
> > For a light refcount to be used we know that the file descriptor table
> > isn't shared with any other task. So there are no threads that could
> > concurrently access the file descriptor table. We also know that the
> > file descriptor table cannot become shared while we're in system call
> > context because the caller can't create new threads and they can't
> > unshare the file descriptor table.
> >
> > So there's only one fdget() caller (Yes, they could call fdget()
> > multiple times and then have to do fdput() multiple times but that's a
> > level of weirdness that we don't need to worry about.).
>
> Hmm. Is it not the case that different processes with different file
> descriptor tables could reference the same underlying `struct file` and
> both use light refcounts to do so, as long as each fd table is not
> shared? So there could be multiple light refcounts to the same `struct
> file` at the same time on different threads.
Relevant rules:
* Each file pointer in any descriptor table contributes to refcount
of file.
* All assignments to task->files are done by the task itself or,
during task creation, by its parent The latter happens before the task
runs for the first time. The former is done with task_lock(current)
held.
* current->files is always stable. The object it points to
is guaranteed to stay alive at least until you explicitly change
current->files.
* task->files is stable while you are holding task_lock(task).
The object it points to is guaranteed to stay alive until you release
task_lock(task).
* task->files MAY be fetched (racily) without either of the
above, but it should not be dereferenced - the memory may be freed
and reused right after you've fetched the pointer.
* descriptor tables are refcounted by table->count.
* descriptor table is created with ->count equal to 1 and
destroyed when its ->count reaches 0.
* each task with task->files == table contributes to table->count.
* before the task dies, its ->files becomes NULL (see exit_files()).
* when task is born (see copy_process() and copy_files())) the parent
is responsible for setting the value of task->files and making sure that
refcounts are correct; that's the only case where one is allowed to acquire
an extra reference to existing table (handling of clone(2) with COPY_FILES).
* the only descriptor table one may modify is that pointed to
by current->files. Any access to other threads' descriptor tables is
read-only.
* struct fd is fundamentally thread-local. It should never be
passed around, put into shared data structures, etc.
* if you have done fdget(N), the matching fdput() MUST be done
before the caller modifies the Nth slot of its descriptor table,
spawns children that would share the descriptor table.
* fdget() MAY borrow a reference from caller's descriptor table.
That can be done if current->files->count is equal to 1.
In that case we can be certain that the file reference we fetched from
our descriptor table will remain unchanged (and thus contributing to refcount
of file) until fdput(). Indeed,
+ at the time of fdget() no other thread has task->files pointing
to our table (otherwise ->count would be greater than 1).
+ our thread will remain the sole owner of descriptor table at
least until fdput(). Indeed, the first additional thread with task->files
pointing to our table would have to have been spawned by us and we are
forbidden to do that (rules for fdget() use)
+ no other thread could modify our descriptor table (they would
have to share it first).
+ we are allowed to modify our table, but we are forbidden to touch
the slot we'd copied from (rules for fdget() use).
In other words, if current->files->count is equal to 1 at fdget() time
we can skip incrementing refcount. Matching fdput() would need to
skip decrement, of course. Note that we must record that (borrowed
vs. cloned) in struct fd - the condition cannot be rechecked at fdput()
time, since the table that had been shared at fdget() time might no longer
be shared by the time of fdput().
> And this does *not* apply to `fdget_pos`, which checks the refcount of
> the `struct file` instead of the refcount of the fd table.
False. fdget_pos() is identical to fdget() as far as file refcount
handling goes. The part that is different is that grabbing ->f_pos_lock
is sensitive to file refcount in some cases. This is orthogonal to
"does this struct fd contribute to file refcount".
Again, "light" references are tied to thread; they can only be created
if we are guaranteed that descriptor table's slot they came from will
remain unchanged for as long as the reference is used.
And yes, there may be several light references to the same file - both
in different processes that do not share descriptor table *and* in the
same thread, if e.g. sendfile(in_fd, out_fd, ...) is called with
in_fd == out_fd.
On Fri, May 24, 2024 at 07:17:13PM +0000, Alice Ryhl wrote:
> > And then those both implement a get_file() method so the caller can take
> > an explicit long-term reference to the file.
>
> Even if you call `get_file` to get a long-term reference from something
> you have an fdget_pos reference to, that doesn't necessarily mean that
> you can share that long-term reference with other threads. You would
> need to release the fdget_pos reference first. For that reason, the
> long-term reference returned by `get_file` would still need to have the
> `File<MaybeFdgetPos>` type.
Why would you want such a bizarre requirement?
> Note that since it forgets which fd and fd table it came from, calls to
> `fdget` are actually not a problem for sending our long-term references
> across threads. The `fdget` requirements only care about things that
> touch the entry in the file descriptor table, such as closing the fd.
> The `ARef<File>` type does not provide any methods that could lead to
> that happening, so sharing it across threads is okay *even if* there is
> an light reference. That's why I have an `MaybeFdgetPos` but no
> `MaybeFdget`.
Huh?
What is "the entry in the file descriptor table"? Which one and in which one?
> let file = File::fget(my_fd)?;
> // SAFETY: We know that there are no active `fdget_pos` calls on
> // the current thread, since this is an ioctl and we have not
> // called `fdget_pos` inside the Binder driver.
> let thread_safe_file = unsafe { file.assume_no_fdget_pos() };
>
> (search for File::from_fd in the RFC to find where this would go)
>
> The `assume_no_fdget_pos` call has no effect at runtime - it is purely a
> compile-time thing to force the user to use unsafe to "promise" that
> there aren't any `fdget_pos` calls on the same fd.
Why does fdget_pos() even matter? The above makes no sense...
Again, cloning a reference and sending it to another thread is perfectly
fine. And what's so special about fdget_pos()/fdput_pos() compared to
fdget()/fdput()?
_What_ memory safety issues are you talking about?
On Fri, May 24, 2024 at 11:56:40PM +0100, Al Viro wrote:
> > Even if you call `get_file` to get a long-term reference from something
> > you have an fdget_pos reference to, that doesn't necessarily mean that
> > you can share that long-term reference with other threads. You would
> > need to release the fdget_pos reference first. For that reason, the
> > long-term reference returned by `get_file` would still need to have the
> > `File<MaybeFdgetPos>` type.
>
> Why would you want such a bizarre requirement?
FWIW, fdget()...fdput() form a scope. The file reference _in_ that
struct fd is just a normal file reference, period.
You can pass it to a function as an argument, etc. You certainly can
clone it (with get_file()).
The rules are basically "you can't spawn threads with CLONE_FILES inside
the scope and you can't remove reference in your descriptor table while
in scope". The value in fd.file is guaranteed to stay with positive
refcount through the entire scope, just as if you had
{
struct file *f = fget(n);
if (!f)
return -EBADF;
...
fput(f);
}
The rules for access are exactly the same - you can pass f to a function
called from the scope, you can use it while in scope, you can clone it
and store it somewhere, etc.
As far as the type system goes, fd::file is a normal counting reference.
Depending upon the descriptor table sharing it might or might not have
had to increment ->f_count, but that's not something users of the value
need to be concerned about. It *is* a concern for fdput() in the end
of scope, but that's it.
You can't do
fd = fdget(n);
...
fput(fd.file);
but then you can't call unbalanced put on a member of object and
expect that to work - if nothing else,
fd = fdget(n);
...
fput(fd.file);
foo(fd);
would be a clear bug - the thing you pass to foo() is obviously not
a normal struct fd instance.
fdget_pos()...fdput_pos() is a red herring; we could've just as well
done it as
fd = fdget(n);
maybe_lock(&fd);
...
maybe_unlock(fd);
fdput(fd);
It's a convenience helper; again, fd.file is the usual reference, with
guaranteed positive ->f_count through the entire scope.
For the sake of completeness, here's possible maybe_lock()/maybe_unlock()
implementation:
static inline void maybe_lock(struct fd *fd)
{
if (fd->file && file_needs_f_pos_lock(fd->file)) {
fd->flags |= FDPUT_POS_UNLOCK;
mutex_lock(&fd->file->f_pos_lock);
}
}
static inline void maybe_unlock(struct fd fd)
{
if (fd.flags & FDPUT_POS_UNLOCK)
mutex_unlock(&fd.file->f_pos_lock);
}
That's it. Sure, we need to pair fdput_pos() with fdget_pos(), but that's
not a matter of memory safety of any sort. And we certainly can
pass get_file(fd.file) anywhere we want without waiting for fdput_pos()
(or maybe_unlock(), if we went that way). You'd get a cloned reference
to file, that would stay valid until the matching fput(). Sure, that
file will have ->f_pos_lock held until fdput_pos(); what's the problem?
IDGI... Was that the fact that current file_needs_f_pos_lock() happens
to look at ->f_count in some cases? The reasons are completely unrelated
and we do *NOT* assume anything about state at the end of scope - that's
why we store that in fd.flags.
On Fri, May 24, 2024 at 07:17:13PM +0000, Alice Ryhl wrote:
> On Fri, May 24, 2024 at 6:12 PM Christian Brauner <[email protected]> wrote:
> >
> > On Fri, May 17, 2024 at 09:30:36AM +0000, Alice Ryhl wrote:
> > > From: Wedson Almeida Filho <[email protected]>
> > >
> > > This abstraction makes it possible to manipulate the open files for a
> > > process. The new `File` struct wraps the C `struct file`. When accessing
> > > it using the smart pointer `ARef<File>`, the pointer will own a
> > > reference count to the file. When accessing it as `&File`, then the
> > > reference does not own a refcount, but the borrow checker will ensure
> > > that the reference count does not hit zero while the `&File` is live.
> > >
> > > Since this is intended to manipulate the open files of a process, we
> > > introduce an `fget` constructor that corresponds to the C `fget`
> > > method. In future patches, it will become possible to create a new fd in
> > > a process and bind it to a `File`. Rust Binder will use these to send
> > > fds from one process to another.
> > >
> > > We also provide a method for accessing the file's flags. Rust Binder
> > > will use this to access the flags of the Binder fd to check whether the
> > > non-blocking flag is set, which affects what the Binder ioctl does.
> > >
> > > This introduces a struct for the EBADF error type, rather than just
> > > using the Error type directly. This has two advantages:
> > > * `File::fget` returns a `Result<ARef<File>, BadFdError>`, which the
> > > compiler will represent as a single pointer, with null being an error.
> > > This is possible because the compiler understands that `BadFdError`
> > > has only one possible value, and it also understands that the
> > > `ARef<File>` smart pointer is guaranteed non-null.
> > > * Additionally, we promise to users of the method that the method can
> > > only fail with EBADF, which means that they can rely on this promise
> > > without having to inspect its implementation.
> > > That said, there are also two disadvantages:
> > > * Defining additional error types involves boilerplate.
> > > * The question mark operator will only utilize the `From` trait once,
> > > which prevents you from using the question mark operator on
> > > `BadFdError` in methods that return some third error type that the
> > > kernel `Error` is convertible into. (However, it works fine in methods
> > > that return `Error`.)
> > >
> > > Signed-off-by: Wedson Almeida Filho <[email protected]>
> > > Co-developed-by: Daniel Xu <[email protected]>
> > > Signed-off-by: Daniel Xu <[email protected]>
> > > Co-developed-by: Alice Ryhl <[email protected]>
> > > Reviewed-by: Martin Rodriguez Reboredo <[email protected]>
> > > Reviewed-by: Trevor Gross <[email protected]>
> > > Reviewed-by: Benno Lossin <[email protected]>
> > > Signed-off-by: Alice Ryhl <[email protected]>
> > > ---
> > > fs/file.c | 7 +
> > > rust/bindings/bindings_helper.h | 2 +
> > > rust/helpers.c | 7 +
> > > rust/kernel/file.rs | 330 ++++++++++++++++++++++++++++++++++++++++
> > > rust/kernel/lib.rs | 1 +
> > > rust/kernel/types.rs | 8 +
> > > 6 files changed, 355 insertions(+)
> > >
> > > diff --git a/fs/file.c b/fs/file.c
> > > index 3b683b9101d8..f2eab5fcb87f 100644
> > > --- a/fs/file.c
> > > +++ b/fs/file.c
> > > @@ -1127,6 +1127,13 @@ EXPORT_SYMBOL(task_lookup_next_fdget_rcu);
> > > *
> > > * The fput_needed flag returned by fget_light should be passed to the
> > > * corresponding fput_light.
> > > + *
> > > + * (As an exception to rule 2, you can call filp_close between fget_light and
> > > + * fput_light provided that you capture a real refcount with get_file before
> > > + * the call to filp_close, and ensure that this real refcount is fput *after*
> > > + * the fput_light call.)
> > > + *
> > > + * See also the documentation in rust/kernel/file.rs.
> > > */
> > > static unsigned long __fget_light(unsigned int fd, fmode_t mask)
> > > {
> > > diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h
> > > index ddb5644d4fd9..541afef7ddc4 100644
> > > --- a/rust/bindings/bindings_helper.h
> > > +++ b/rust/bindings/bindings_helper.h
> > > @@ -9,6 +9,8 @@
> > > #include <kunit/test.h>
> > > #include <linux/errname.h>
> > > #include <linux/ethtool.h>
> > > +#include <linux/file.h>
> > > +#include <linux/fs.h>
> > > #include <linux/jiffies.h>
> > > #include <linux/mdio.h>
> > > #include <linux/phy.h>
> > > diff --git a/rust/helpers.c b/rust/helpers.c
> > > index 4c8b7b92a4f4..5545a00560d1 100644
> > > --- a/rust/helpers.c
> > > +++ b/rust/helpers.c
> > > @@ -25,6 +25,7 @@
> > > #include <linux/build_bug.h>
> > > #include <linux/err.h>
> > > #include <linux/errname.h>
> > > +#include <linux/fs.h>
> > > #include <linux/mutex.h>
> > > #include <linux/refcount.h>
> > > #include <linux/sched/signal.h>
> > > @@ -157,6 +158,12 @@ void rust_helper_init_work_with_key(struct work_struct *work, work_func_t func,
> > > }
> > > EXPORT_SYMBOL_GPL(rust_helper_init_work_with_key);
> > >
> > > +struct file *rust_helper_get_file(struct file *f)
> > > +{
> > > + return get_file(f);
> > > +}
> > > +EXPORT_SYMBOL_GPL(rust_helper_get_file);
> > > +
> > > /*
> > > * `bindgen` binds the C `size_t` type as the Rust `usize` type, so we can
> > > * use it in contexts where Rust expects a `usize` like slice (array) indices.
> > > diff --git a/rust/kernel/file.rs b/rust/kernel/file.rs
> > > new file mode 100644
> > > index 000000000000..ad881e67084c
> > > --- /dev/null
> > > +++ b/rust/kernel/file.rs
> > > @@ -0,0 +1,330 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +
> > > +//! Files and file descriptors.
> > > +//!
> > > +//! C headers: [`include/linux/fs.h`](srctree/include/linux/fs.h) and
> > > +//! [`include/linux/file.h`](srctree/include/linux/file.h)
> > > +
> > > +use crate::{
> > > + bindings,
> > > + error::{code::*, Error, Result},
> > > + types::{ARef, AlwaysRefCounted, Opaque},
> > > +};
> > > +use core::{marker::PhantomData, ptr};
> > > +
> > > +/// Flags associated with a [`File`].
> > > +pub mod flags {
> > > + /// File is opened in append mode.
> > > + pub const O_APPEND: u32 = bindings::O_APPEND;
> > > +
> > > + /// Signal-driven I/O is enabled.
> > > + pub const O_ASYNC: u32 = bindings::FASYNC;
> > > +
> > > + /// Close-on-exec flag is set.
> > > + pub const O_CLOEXEC: u32 = bindings::O_CLOEXEC;
> > > +
> > > + /// File was created if it didn't already exist.
> > > + pub const O_CREAT: u32 = bindings::O_CREAT;
> > > +
> > > + /// Direct I/O is enabled for this file.
> > > + pub const O_DIRECT: u32 = bindings::O_DIRECT;
> > > +
> > > + /// File must be a directory.
> > > + pub const O_DIRECTORY: u32 = bindings::O_DIRECTORY;
> > > +
> > > + /// Like [`O_SYNC`] except metadata is not synced.
> > > + pub const O_DSYNC: u32 = bindings::O_DSYNC;
> > > +
> > > + /// Ensure that this file is created with the `open(2)` call.
> > > + pub const O_EXCL: u32 = bindings::O_EXCL;
> > > +
> > > + /// Large file size enabled (`off64_t` over `off_t`).
> > > + pub const O_LARGEFILE: u32 = bindings::O_LARGEFILE;
> > > +
> > > + /// Do not update the file last access time.
> > > + pub const O_NOATIME: u32 = bindings::O_NOATIME;
> > > +
> > > + /// File should not be used as process's controlling terminal.
> > > + pub const O_NOCTTY: u32 = bindings::O_NOCTTY;
> > > +
> > > + /// If basename of path is a symbolic link, fail open.
> > > + pub const O_NOFOLLOW: u32 = bindings::O_NOFOLLOW;
> > > +
> > > + /// File is using nonblocking I/O.
> > > + pub const O_NONBLOCK: u32 = bindings::O_NONBLOCK;
> > > +
> > > + /// File is using nonblocking I/O.
> > > + ///
> > > + /// This is effectively the same flag as [`O_NONBLOCK`] on all architectures
> > > + /// except SPARC64.
> > > + pub const O_NDELAY: u32 = bindings::O_NDELAY;
> > > +
> > > + /// Used to obtain a path file descriptor.
> > > + pub const O_PATH: u32 = bindings::O_PATH;
> > > +
> > > + /// Write operations on this file will flush data and metadata.
> > > + pub const O_SYNC: u32 = bindings::O_SYNC;
> > > +
> > > + /// This file is an unnamed temporary regular file.
> > > + pub const O_TMPFILE: u32 = bindings::O_TMPFILE;
> > > +
> > > + /// File should be truncated to length 0.
> > > + pub const O_TRUNC: u32 = bindings::O_TRUNC;
> > > +
> > > + /// Bitmask for access mode flags.
> > > + ///
> > > + /// # Examples
> > > + ///
> > > + /// ```
> > > + /// use kernel::file;
> > > + /// # fn do_something() {}
> > > + /// # let flags = 0;
> > > + /// if (flags & file::flags::O_ACCMODE) == file::flags::O_RDONLY {
> > > + /// do_something();
> > > + /// }
> > > + /// ```
> > > + pub const O_ACCMODE: u32 = bindings::O_ACCMODE;
> > > +
> > > + /// File is read only.
> > > + pub const O_RDONLY: u32 = bindings::O_RDONLY;
> > > +
> > > + /// File is write only.
> > > + pub const O_WRONLY: u32 = bindings::O_WRONLY;
> > > +
> > > + /// File can be both read and written.
> > > + pub const O_RDWR: u32 = bindings::O_RDWR;
> > > +}
> > > +
> > > +/// Compile-time information for keeping track of how a [`File`] is shared.
> > > +///
> > > +/// The `fdget_pos` method can be used to access the file's position without taking `f_pos_lock`,
> > > +/// as long as the file is not shared with any other threads. During such calls to `fdget_pos`, the
> > > +/// file must remain non-shared, so it must not be possible to move the file to another thread. For
> > > +/// example, if the file is moved to another thread, then it could be passed to `fd_install`, at
> > > +/// which point the remote process could touch the file position.
> > > +///
> > > +/// The share mode only keeps track of whether there are active `fdget_pos` calls that did not take
> > > +/// the `f_pos_lock`, and does not keep track of `fdget` calls. This is okay because `fdget` does
> > > +/// not care about the refcount of the underlying `struct file`; as long as the entry in the
> > > +/// current thread's fd table does not get removed, it's okay to share the file. For example,
> > > +/// `fd_install`ing the `struct file` into another process is okay during an `fdget` call, because
> > > +/// the other process can't touch the fd table of the original process.
> > > +mod share_mode {
> > > + /// Trait implemented by the two sharing modes that a file might have.
> > > + pub trait FileShareMode {}
> > > +
> > > + /// Represents a file for which there might be an active call to `fdget_pos` that did not take
> > > + /// the `f_pos_lock` lock.
> > > + pub enum MaybeFdgetPos {}
> > > +
> > > + /// Represents a file for which it is known that all active calls to `fdget_pos` (if any) took
> > > + /// the `f_pos_lock` lock.
> > > + pub enum NoFdgetPos {}
> > > +
> > > + impl FileShareMode for MaybeFdgetPos {}
> > > + impl FileShareMode for NoFdgetPos {}
> > > +}
> > > +pub use self::share_mode::{FileShareMode, MaybeFdgetPos, NoFdgetPos};
> > > +
> > > +/// Wraps the kernel's `struct file`.
> > > +///
> > > +/// This represents an open file rather than a file on a filesystem. Processes generally reference
> > > +/// open files using file descriptors. However, file descriptors are not the same as files. A file
> > > +/// descriptor is just an integer that corresponds to a file, and a single file may be referenced
> > > +/// by multiple file descriptors.
> > > +///
> > > +/// # Refcounting
> > > +///
> > > +/// Instances of this type are reference-counted. The reference count is incremented by the
> > > +/// `fget`/`get_file` functions and decremented by `fput`. The Rust type `ARef<File>` represents a
> > > +/// pointer that owns a reference count on the file.
> > > +///
> > > +/// Whenever a process opens a file descriptor (fd), it stores a pointer to the file in its fd
> > > +/// table (`struct files_struct`). This pointer owns a reference count to the file, ensuring the
> > > +/// file isn't prematurely deleted while the file descriptor is open. In Rust terminology, the
> > > +/// pointers in `struct files_struct` are `ARef<File>` pointers.
> > > +///
> > > +/// ## Light refcounts
> > > +///
> > > +/// Whenever a process has an fd to a file, it may use something called a "light refcount" as a
> > > +/// performance optimization. Light refcounts are acquired by calling `fdget` and released with
> > > +/// `fdput`. The idea behind light refcounts is that if the fd is not closed between the calls to
> > > +/// `fdget` and `fdput`, then the refcount cannot hit zero during that time, as the `struct
> > > +/// files_struct` holds a reference until the fd is closed. This means that it's safe to access the
> > > +/// file even if `fdget` does not increment the refcount.
> > > +///
> > > +/// The requirement that the fd is not closed during a light refcount applies globally across all
> > > +/// threads - not just on the thread using the light refcount. For this reason, light refcounts are
> > > +/// only used when the `struct files_struct` is not shared with other threads, since this ensures
> > > +/// that other unrelated threads cannot suddenly start using the fd and close it. Therefore,
> > > +/// calling `fdget` on a shared `struct files_struct` creates a normal refcount instead of a light
> > > +/// refcount.
> > > +///
> > > +/// Light reference counts must be released with `fdput` before the system call returns to
> > > +/// userspace. This means that if you wait until the current system call returns to userspace, then
> > > +/// all light refcounts that existed at the time have gone away.
> >
> > You obviously are aware of this but I'm just spelling it out. Iirc,
> > there will practically only ever be one light refcount per file.
> >
> > For a light refcount to be used we know that the file descriptor table
> > isn't shared with any other task. So there are no threads that could
> > concurrently access the file descriptor table. We also know that the
> > file descriptor table cannot become shared while we're in system call
> > context because the caller can't create new threads and they can't
> > unshare the file descriptor table.
> >
> > So there's only one fdget() caller (Yes, they could call fdget()
> > multiple times and then have to do fdput() multiple times but that's a
> > level of weirdness that we don't need to worry about.).
>
> Hmm. Is it not the case that different processes with different file
> descriptor tables could reference the same underlying `struct file` and
> both use light refcounts to do so, as long as each fd table is not
> shared? So there could be multiple light refcounts to the same `struct
> file` at the same time on different threads.
That's correct.
But I misread what you were trying to say then. I thought you were
talking about multiple light references from the same thread which is
rather rare and should only happen in system calls that take two fds.
But what you're talking about is the same struct file being present in
separate file descriptor tables and referenced multiple times from
different threads so in that sense we have multiple light references.
And that's obviously correct.
>
> And this does *not* apply to `fdget_pos`, which checks the refcount of
> the `struct file` instead of the refcount of the fd table.
I have only skimmed the replies down thread so far but I saw that this
has mostly been clarified. The reference counting between fdget() and
fdget_pos() is identical. fdget_pos() calls fdget() after all.
What's different is that if the file is already shared among different
processes then even if a light reference was taken by the caller because
it doesn't share the file descriptor table fdget_pos() may still acquire
the f_pos_lock because the struct file is referenced by multiple
processes.
IOW, you could have the same struct file in the file descriptor tables
of 10 processes. So the f_count would be 10. Assume all of them
concurrently call read(), then none of them will bump f_count because
fdget() sees that the file descriptor tables aren't shared.
But all 10 of them will serialize on f_pos_lock. So that's really
separate from light refcounting. If you have to acquire f_pos_lock from
within the same thread then fdget*() always take normal reference
counts.
>
> > > +///
> > > +/// ### The file position
> > > +///
> > > +/// Each `struct file` has a position integer, which is protected by the `f_pos_lock` mutex.
> > > +/// However, if the `struct file` is not shared, then the kernel may avoid taking the lock as a
> > > +/// performance optimization.
> > > +///
> > > +/// The condition for avoiding the `f_pos_lock` mutex is different from the condition for using
> > > +/// `fdget`. With `fdget`, you may avoid incrementing the refcount as long as the current fd table
> > > +/// is not shared; it is okay if there are other fd tables that also reference the same `struct
> > > +/// file`. However, `fdget_pos` can only avoid taking the `f_pos_lock` if the entire `struct file`
> > > +/// is not shared, as different processes with an fd to the same `struct file` share the same
> > > +/// position.
> > > +///
> > > +/// ## Rust references
> > > +///
> > > +/// The reference type `&File` is similar to light refcounts:
> > > +///
> > > +/// * `&File` references don't own a reference count. They can only exist as long as the reference
> > > +/// count stays positive, and can only be created when there is some mechanism in place to ensure
> > > +/// this.
> > > +///
> > > +/// * The Rust borrow-checker normally ensures this by enforcing that the `ARef<File>` from which
> > > +/// a `&File` is created outlives the `&File`.
> > > +///
> > > +/// * Using the unsafe [`File::from_ptr`] means that it is up to the caller to ensure that the
> > > +/// `&File` only exists while the reference count is positive.
> > > +///
> > > +/// * You can think of `fdget` as using an fd to look up an `ARef<File>` in the `struct
> > > +/// files_struct` and create an `&File` from it. The "fd cannot be closed" rule is like the Rust
> > > +/// rule "the `ARef<File>` must outlive the `&File`".
> > > +///
> > > +/// # Invariants
> > > +///
> > > +/// * All instances of this type are refcounted using the `f_count` field.
> > > +/// * If the file sharing mode is `MaybeFdgetPos`, then all active calls to `fdget_pos` that did
> > > +/// not take the `f_pos_lock` mutex must be on the same thread as this `File`.
> > > +/// * If the file sharing mode is `NoFdgetPos`, then there must not be active calls to `fdget_pos`
> > > +/// that did not take the `f_pos_lock` mutex.
> > > +#[repr(transparent)]
> > > +pub struct File<S: FileShareMode> {
> > > + inner: Opaque<bindings::file>,
> > > + _share_mode: PhantomData<S>,
> > > +}
> > > +
> > > +// SAFETY: This file is known to not have any local active `fdget_pos` calls, so it is safe to
> > > +// transfer it between threads.
> > > +unsafe impl Send for File<NoFdgetPos> {}
> > > +
> > > +// SAFETY: This file is known to not have any local active `fdget_pos` calls, so it is safe to
> > > +// access its methods from several threads in parallel.
> > > +unsafe impl Sync for File<NoFdgetPos> {}
> > > +
> > > +/// File methods that only exist under the [`MaybeFdgetPos`] sharing mode.
> > > +impl File<MaybeFdgetPos> {
> > > + /// Constructs a new `struct file` wrapper from a file descriptor.
> > > + ///
> > > + /// The file descriptor belongs to the current process, and there might be active local calls
> > > + /// to `fdget_pos`.
> > > + pub fn fget(fd: u32) -> Result<ARef<File<MaybeFdgetPos>>, BadFdError> {
> > > + // SAFETY: FFI call, there are no requirements on `fd`.
> > > + let ptr = ptr::NonNull::new(unsafe { bindings::fget(fd) }).ok_or(BadFdError)?;
> > > +
> > > + // SAFETY: `bindings::fget` created a refcount, and we pass ownership of it to the `ARef`.
> > > + //
> > > + // INVARIANT: This file is in the fd table on this thread, so either all `fdget_pos` calls
> > > + // are on this thread, or the file is shared, in which case `fdget_pos` calls took the
> > > + // `f_pos_lock` mutex.
> > > + Ok(unsafe { ARef::from_raw(ptr.cast()) })
> > > + }
> >
> > I'm a little unclear how this is supposed to be used. What I
> > specifically struggle with is what function does one have to call to
> > translate from a file descriptor to a file? IOW, where are the actual
> > entry points for turning fds into files? That's what I want to see and
> > that's what we need to make this interface usable generically.
>
> That is File::fget. It takes a file descriptor and returns a long-term
> reference to a file.
Ok.
>
> > Because naively, what I'm looking for is a Rust version of fdget() and
> > fdget_pos() that give me back a File<MaybeFdget> or a
> > File<MaybeFdgetPos>.
> >
> > And then those both implement a get_file() method so the caller can take
> > an explicit long-term reference to the file.
>
> Even if you call `get_file` to get a long-term reference from something
> you have an fdget_pos reference to, that doesn't necessarily mean that
> you can share that long-term reference with other threads. You would
> need to release the fdget_pos reference first. For that reason, the
> long-term reference returned by `get_file` would still need to have the
> `File<MaybeFdgetPos>` type.
So what you're getting at seems to be that some process has a private
file descriptor table and an just opened @fd to a @file that isn't
shared.
fd = open("/my/file");
and then let's say has a random ioctl(fd, SOMETHING_SOMETHING) that
somehow does:
struct fd fd = fdget_pos();
if (!fd.file)
return -EBADF;
We know that process has used a light reference count and that it didn't
acquire f_pos_lock.
Your whole approach seems to assume that after something like this has
happened the same process now offloads that struct file to another
process that somehow ends up doing some other operation on the file that
would also require f_pos_lock to be taken but it doesn't like a read or
something.
To share a file between multiple processes would normally always require
that the process sends that file to another process. That process then
install that fd into its file descriptor table and then later accesses
that file via fdget() again. That's the standard way of doing it -
binder does it that way too. And that's all perfectly fine.
What you would need for this to be a problem is for a process be sent a
struct file from a process that is in the middle of an f_pos_lock scope
and for the receiving process to immediately start doing stuff that
would normally require f_pos_lock.
Like, idk vfs_read(file, ...).
If that's what this is about then yes, there's a close-to-zero but
non-zero possibility that some really stupid code could end up doing
something like this.
Basically, that scenario doesn't exist (as I've mentioned before)
currently. It only exists in a single instance and that's when
pidfd_getfd() is used to steal a file from another task while that task
is in the middle of an f_pos_lock section (I said it before we don't
care about that because non-POSIX interface anyway and we have ptrace
rights anyway. And iiuc that wouldn't even be preventable in your
current scheme because you would need to have the information available
that the struct file you're about to steal from the file descriptor
table is currently within an f_pos_lock section.).
Is it honestly worth encoding all that complexity into rust's file
implementation itself right now? It's barely understandable to
non-rust experts as it is right now. :)
Imho, it would seem a lot more prudent to just have something simpler
for now.
>
> (But you could convert it to a `File<NoFdgetPos>` afterwards. The
> `assume_no_fdget_pos` method performs that conversion.)
>
> As a sidenote, the reason that this patchset does not implement `fdget`
> or `fdget_pos` is that Rust Binder does not use them. Like C Binder, it
Yes, you mentioned.
> just uses `fget` to immediately obtain a long-term reference. I was told
Right and that's why I'm confused why that whole shared_state
machinery is needed in the first place. Because binder does do it
correctly:
* sender registers a bunch of fds to use and takes fget() reference
All other processes that use the same file in their fdtable and rely
on fdget_pos() will see the elevated reference count and acquire
f_pos_lock.
* receiver installs stuff into their fdtable
Receiver can now use fdget_pos() to do reads/writes. Everything's in
order as well.
> that as an exception, Rust code can be merged *before* its user, but
> that we couldn't merge Rust code with no upcoming user. However, I can
> include implementations of `fdget` and `fdget_pos` in the next version
> if you prefer that. After all, it seems rather likely that we will
> eventually have a user for fdget.
>
> > The fget() above is really confusing to me because it always takes a
> > reference on the file that's pointed to by the fd and then it returns a
> > MaybeFdgetPos because presumably you want to indicate that the file
> > descriptor may refer to a file that may or may not be referenced by
> > another thread via fdget()/fdget_pos() already.
>
> No, not another thread. It is because it may or may not be referenced by
> fdget_pos by *the same* thread already.
>
> Here's how I think of it: The `fget` method takes a file descriptor and
> returns a long term reference (an ARef) to a `struct file`. It does not
> return a file descriptor, since it doesn't store anywhere which fd it
> came from.
>
> The `File::fget` method returns a `File<MaybeFdgetPos>` in case *the
> same thread* is also using `fdget_pos` on the same file descriptor. It's
> okay if other threads are using `fdget_pos` because in that case the
> file is already shared, so those other `fdget_pos` calls necessarily the
> f_pos_lock mutex.
>
> Note that since it forgets which fd and fd table it came from, calls to
> `fdget` are actually not a problem for sending our long-term references
> across threads. The `fdget` requirements only care about things that
> touch the entry in the file descriptor table, such as closing the fd.
> The `ARef<File>` type does not provide any methods that could lead to
> that happening, so sharing it across threads is okay *even if* there is
> an light reference. That's why I have an `MaybeFdgetPos` but no
> `MaybeFdget`.
>
> > So I've _skimmed_ the binder RFC and looked at:
> > [email protected]
> > which states:
> >
> > Add support for sending fds over binder.
> >
> > Unlike the other object types, file descriptors are not translated until
> > the transaction is actually received by the recipient. Until that
> > happens, we store `u32::MAX` as the fd.
> >
> > Translating fds is done in a two-phase process. First, the file
> > descriptors are allocated and written to the allocation. Then, once we
> > have allocated all of them, we commit them to the files in question.
> > Using this strategy, we are able to guarantee that we either send all of
> > the fds, or none of them.
> >
> > So I'm curious. How does the binder fd sending work exactly? Because I
> > feel that this is crucial to understand here. Some specific questions:
> >
> > * When file descriptors are passed the reference to these files via
> > fget() are taken _synchronously_, i.e., by the sending task, not the
> > receiver? IOW, is binder_translate_fd() called in the context of the
> > sender or the receiver. I assume it must be the sender because
> > otherwise the sender and receiver must share a file descriptor table
> > in order for the receiver to call fget().
>
> binder_translate_fd is called in the context of the sender.
>
> > * The receiving task then allocates new file descriptors and installs
> > the received files into its file descriptor table?
>
> That happens in binder_apply_fd_fixups, which is called in the context
> of the receiver.
Yes, that's what I thought.
>
> I can see how the sentence "Until that happens, we store `u32::MAX` as
> the fd." is really confusing here. What happens when you send a fd is
> this:
>
> In the sender's ioctl:
> 1. The sender wishes to send a byte array to the recipient. The sender
> tells the kernel that at specific offsets in this array, there are
> some file descriptors that it wishes to send.
>
> 2. The kernel copies the byte array directly into the recipient's
> address space. The offsets in the byte array with file descriptors
> are not copied - instead u32::MAX is written temporarily at those
> offsets.
>
> 3. For each fd being sent, the kernel uses fget to obtain a reference to
> the underlying `struct file`. These pointers are stored in an array.
>
> In the receiver's ioctl:
> 1. Go through the list of `struct file` pointers and create a
> `FileDescriptorReservation` for each.
>
> 2. Go through the list of `struct file` pointers again and `fd_install`
> them into the current thread's fd table. This is infallible due to
> the reservations we just made.
>
> 3. Finally, overwrite the u32::MAX values in the byte array with the
> actual file descriptors that the files were assigned.
>
> This is the same as how it works in C Binder.
Yes, that all seems fine.
>
> > And so basically, what I'm after here is that the binder_translate_fd()
> > that calls fget() is done in the context of the sender and we _know_
> > that the fds can't have light references. Because if they did it could
> > only be by the calling task but they don't since the calling task uses
> > fget() on them. And if the calling task is multi-threaded and another
> > thread has called fdget() or fdget_pos() we know that they have taken
> > their own reference because the file descriptor table is shared.
> >
> > So why is that fget() in here returning a File<MaybeFdgetPos>? This
> > doesn't make sense to me at first glance.
>
> Because when you call `File::fget`, then there could also be a different
> call to `fdget_pos` on the same thread on the same file descriptor.
>
> fdget_pos(my_fd);
> let my_file = File::fget(my_fd)?;
> // oh no!
> send_to_another_thread(my_file);
Ok, that's basically my above example.
>
> In the above code, the file becomes shared even though `fdget_pos` might
> not have taken the `f_pos_lock` mutex. That's not okay. We could end up
> with a data race on the file position.
But a race on f_pos isn't a memory safety issue it's just a POSIX
ordering requirement.
>
> One of the primary design principles of Rust is that, if the user of our
> API has *any* way of using it that could trigger a memory safety
> problem, then we must be able to point at an unsafe block *in the user's
> code* that is at fault. This must be the case no matter how contrived
> the use of the API is.
>
> As a corollary, if the user can trigger memory safety problems with our
> API without using any unsafe blocks, then that is a bug in the API.
> We cannot assign the blame to an unsafe block in the user's code, so the
> blame *must* lie with an unsafe block inside the API.
>
> So, to follow that design principle, I have designed the API in a way
> that prevents the above data race. Concretely, because the
> `File<MaybeFdgetPos>` type is not thread safe (or in Rust terms "is not
> Send"), it's not possible to send values of that type across thread
> boundaries. E.g., our `send_to_another_thread` would have a requirement
> in its signature saying that it can only be called with types that are
> thread safe, so calling it with a type that isn't results in a type
> error.
>
>
>
> Now, what if you *want* to send it to another thread? Let's consider
> Rust Binder, which needs to do exactly that. The relevant code in Rust
> Binder would need to be updated to look like this:
>
> let file = File::fget(my_fd)?;
> // SAFETY: We know that there are no active `fdget_pos` calls on
> // the current thread, since this is an ioctl and we have not
> // called `fdget_pos` inside the Binder driver.
> let thread_safe_file = unsafe { file.assume_no_fdget_pos() };
>
> (search for File::from_fd in the RFC to find where this would go)
>
> The `assume_no_fdget_pos` call has no effect at runtime - it is purely a
> compile-time thing to force the user to use unsafe to "promise" that
> there aren't any `fdget_pos` calls on the same fd.
>
> If Rust Binder uses `assume_no_fdget_pos` and ends up triggering memory
> unsafety because it sent a file to another thread, then we can point to
> the unsafe block that calls `assume_no_fdget_pos` and say "that unsafe
> block is at fault because it assumed that there was no `fdget_pos` call,
> but that assumption was false."
>
> > > + /// Assume that there are no active `fdget_pos` calls that prevent us from sharing this file.
> > > + ///
> > > + /// This makes it safe to transfer this file to other threads. No checks are performed, and
> > > + /// using it incorrectly may lead to a data race on the file position if the file is shared
> > > + /// with another thread.
> > > + ///
> > > + /// This method is intended to be used together with [`File::fget`] when the caller knows
> > > + /// statically that there are no `fdget_pos` calls on the current thread. For example, you
> > > + /// might use it when calling `fget` from an ioctl, since ioctls usually do not touch the file
> > > + /// position.
> > > + ///
> > > + /// # Safety
> > > + ///
> > > + /// There must not be any active `fdget_pos` calls on the current thread.
> > > + pub unsafe fn assume_no_fdget_pos(me: ARef<Self>) -> ARef<File<NoFdgetPos>> {
> > > + // INVARIANT: There are no `fdget_pos` calls on the current thread, and by the type
> > > + // invariants, if there is a `fdget_pos` call on another thread, then it took the
> > > + // `f_pos_lock` mutex.
> > > + //
> > > + // SAFETY: `File<MaybeFdgetPos>` and `File<NoFdgetPos>` have the same layout.
> > > + unsafe { ARef::from_raw(ARef::into_raw(me).cast()) }
> > > + }
> > > +}
> > > +
> > > +/// File methods that exist under all sharing modes.
> > > +impl<S: FileShareMode> File<S> {
> > > + /// Creates a reference to a [`File`] from a valid pointer.
> > > + ///
> > > + /// # Safety
> > > + ///
> > > + /// * The caller must ensure that `ptr` points at a valid file and that the file's refcount is
> > > + /// positive for the duration of 'a.
> > > + /// * The caller must ensure that the requirements for using the chosen file sharing mode are
> > > + /// upheld.
> > > + pub unsafe fn from_ptr<'a>(ptr: *const bindings::file) -> &'a File<S> {
> >
> > I think I requested from_raw_file() in the last revision?
>
> Ah, yeah, I totally forgot about this. I'll make the change in the next
> version.
>
> > > + /// Returns a raw pointer to the inner C struct.
> > > + #[inline]
> > > + pub fn as_ptr(&self) -> *mut bindings::file {
> >
> > And that was supposed to be into_raw_file() or as_raw_file()?
>
> Per the Rust API guidelines [1], this should be `as_raw_file`. The
> `into_*` prefix is for conversions that destroy the object you call it
> on. (E.g., because it takes ownership of the underlying allocation or
> refcount.)
>
> As always, thank you for the very detailed questions!
>
> Alice
>
> [1]: https://rust-lang.github.io/api-guidelines/naming.html#ad-hoc-conversions-follow-as_-to_-into_-conventions-c-conv
On Sat, May 25, 2024 at 01:33:05AM +0100, Al Viro wrote:
> FWIW, fdget()...fdput() form a scope. The file reference _in_ that
> struct fd is just a normal file reference, period.
>
> You can pass it to a function as an argument, etc. You certainly can
> clone it (with get_file()).
>
> The rules are basically "you can't spawn threads with CLONE_FILES inside
> the scope and you can't remove reference in your descriptor table while
> in scope". The value in fd.file is guaranteed to stay with positive
> refcount through the entire scope, just as if you had
>
> {
> struct file *f = fget(n);
>
> if (!f)
> return -EBADF;
>
> ...
>
> fput(f);
> }
>
> The rules for access are exactly the same - you can pass f to a function
> called from the scope, you can use it while in scope, you can clone it
> and store it somewhere, etc.
If anything, fd = fdget(N) is "clone or borrow the file reference from the
Nth slot of your descriptor table into fd.file and record whether it had
been cloned or borrowed into fd.flags".
The rules for what can't be done in the scope of fd are there to guarantee
that the reference _can_ be borrowed in the common case when descriptor table
is not shared.
The tricks with calling conventions of __fdget() (it returns both the
reference and flags in single unsigned long, with flags in the lowest
bits) are implementation details; those should stay hidden from anyone
who uses struct fd.
Incidentally, there's no "light refcount" - "light references" are simply
the ones that had been borrowed from the descriptor table rather than
having them cloned.
One more thing: we never get fd.file == NULL && fd.flags != 0 - that
combination is never generated (NULL couldn't have been cloned).
As the result, if fd.file is NULL, fdput(fd) is a no-op. Most of the
places where we use struct fd are making use of that -
fd = fdget(n);
if (fd.file) {
do something
fdput(fd);
}
is equivalent to
fd = fdget(n);
if (fd.file)
do something
fdput(fd);
and the former is more common way to spell it. In particular,
fd = fdget(n);
if (!fd.file)
return -EBADF;
error = do_something(fd.file);
fdput(fd);
is often convenient and very common. We could go for
CLASS(fd, fd)(n);
if (!fd.file)
return -EBADF;
return do_something(fd.file);
and let the compiler add fdput(fd) whenever it goes out of scope,
but that gets clumsy - we'd end up with a plenty of declarations
in the middle of blocks that way, and for C that looks wrong.
There are very few places where struct fd does not come from
fdget()/fdget_raw()/fdget_pos(). One variety is "initialize
it to NULL,0, then possibly replace with fdget() result;
unconditional fdput() will be safe either way" (a couple of places).
Another is overlayfs ovl_real_fdget() and ovl_real_fdget_meta().
Those two are arguably too clever for their readers' good.
It's still "borrow or clone, and remember which one had it been",
but that's mixed with returning errors:
err = ovl_read_fdget(file, &fd);
may construct and store a junk value in fd if err is non-zero.
Not a bug, but only because all callers remember to ignore that
value in such case.
Another inconvenient bit is this:
static int vmsplice_type(struct fd f, int *type)
{
if (!f.file)
return -EBADF;
if (f.file->f_mode & FMODE_WRITE) {
*type = ITER_SOURCE;
} else if (f.file->f_mode & FMODE_READ) {
*type = ITER_DEST;
} else {
fdput(f);
return -EBADF;
}
return 0;
}
It's a move if an error had been returned and borrow otherwise.
There's a couple of other examples of the same sort.
It might be better to lift fdput() into the callers (or, in this
case, just fold the entire sucker into its sole caller).
Finally, there's a non-obvious thing in net/socket.c -
sockfd_lookup_light()..fput_light(). What happens is that
if sock_from_file(f) is non-NULL, we are guaranteed that
sock_from_file(f)->file == f (and that reference does not
contribute to refcount of f). So we do the same clone-or-borrow
thing, but we only keep the socket for the rest of operation;
when it comes to undoing the clone-or-borrow, we do that manually
and use sock->file to reconstruct the file pointer.
It's still an equivalent of fdget()/fdput() pair, but it slightly
reduces a register pressure in some very hot paths; hell knows
if it's warranted these days. Avoiding an unconditional clone
in there is a really important part, but not keeping the struct
file reference in a local variable through the syscall...
probably not so much. It is very localized - nothing of that
sort is done outside of net/socket.c (we probably ought to
move fput_light() over there - no other users).
That's about it as far as struct fd is concerned...
Christian Brauner <[email protected]> writes:
> On Fri, May 24, 2024 at 07:17:13PM +0000, Alice Ryhl wrote:
> > On Fri, May 24, 2024 at 6:12 PM Christian Brauner <[email protected]> wrote:
> > >
> > > On Fri, May 17, 2024 at 09:30:36AM +0000, Alice Ryhl wrote:
> > > > From: Wedson Almeida Filho <[email protected]>
> > > >
> > > > This abstraction makes it possible to manipulate the open files for a
> > > > process. The new `File` struct wraps the C `struct file`. When accessing
> > > > it using the smart pointer `ARef<File>`, the pointer will own a
> > > > reference count to the file. When accessing it as `&File`, then the
> > > > reference does not own a refcount, but the borrow checker will ensure
> > > > that the reference count does not hit zero while the `&File` is live.
> > > >
> > > > Since this is intended to manipulate the open files of a process, we
> > > > introduce an `fget` constructor that corresponds to the C `fget`
> > > > method. In future patches, it will become possible to create a new fd in
> > > > a process and bind it to a `File`. Rust Binder will use these to send
> > > > fds from one process to another.
> > > >
> > > > We also provide a method for accessing the file's flags. Rust Binder
> > > > will use this to access the flags of the Binder fd to check whether the
> > > > non-blocking flag is set, which affects what the Binder ioctl does.
> > > >
> > > > This introduces a struct for the EBADF error type, rather than just
> > > > using the Error type directly. This has two advantages:
> > > > * `File::fget` returns a `Result<ARef<File>, BadFdError>`, which the
> > > > compiler will represent as a single pointer, with null being an error.
> > > > This is possible because the compiler understands that `BadFdError`
> > > > has only one possible value, and it also understands that the
> > > > `ARef<File>` smart pointer is guaranteed non-null.
> > > > * Additionally, we promise to users of the method that the method can
> > > > only fail with EBADF, which means that they can rely on this promise
> > > > without having to inspect its implementation.
> > > > That said, there are also two disadvantages:
> > > > * Defining additional error types involves boilerplate.
> > > > * The question mark operator will only utilize the `From` trait once,
> > > > which prevents you from using the question mark operator on
> > > > `BadFdError` in methods that return some third error type that the
> > > > kernel `Error` is convertible into. (However, it works fine in methods
> > > > that return `Error`.)
> > > >
> > > > Signed-off-by: Wedson Almeida Filho <[email protected]>
> > > > Co-developed-by: Daniel Xu <[email protected]>
> > > > Signed-off-by: Daniel Xu <[email protected]>
> > > > Co-developed-by: Alice Ryhl <[email protected]>
> > > > Reviewed-by: Martin Rodriguez Reboredo <[email protected]>
> > > > Reviewed-by: Trevor Gross <[email protected]>
> > > > Reviewed-by: Benno Lossin <[email protected]>
> > > > Signed-off-by: Alice Ryhl <[email protected]>
> > > > ---
> > > > fs/file.c | 7 +
> > > > rust/bindings/bindings_helper.h | 2 +
> > > > rust/helpers.c | 7 +
> > > > rust/kernel/file.rs | 330 ++++++++++++++++++++++++++++++++++++++++
> > > > rust/kernel/lib.rs | 1 +
> > > > rust/kernel/types.rs | 8 +
> > > > 6 files changed, 355 insertions(+)
> > > >
> > > > diff --git a/fs/file.c b/fs/file.c
> > > > index 3b683b9101d8..f2eab5fcb87f 100644
> > > > --- a/fs/file.c
> > > > +++ b/fs/file.c
> > > > @@ -1127,6 +1127,13 @@ EXPORT_SYMBOL(task_lookup_next_fdget_rcu);
> > > > *
> > > > * The fput_needed flag returned by fget_light should be passed to the
> > > > * corresponding fput_light.
> > > > + *
> > > > + * (As an exception to rule 2, you can call filp_close between fget_light and
> > > > + * fput_light provided that you capture a real refcount with get_file before
> > > > + * the call to filp_close, and ensure that this real refcount is fput *after*
> > > > + * the fput_light call.)
> > > > + *
> > > > + * See also the documentation in rust/kernel/file.rs.
> > > > */
> > > > static unsigned long __fget_light(unsigned int fd, fmode_t mask)
> > > > {
> > > > diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h
> > > > index ddb5644d4fd9..541afef7ddc4 100644
> > > > --- a/rust/bindings/bindings_helper.h
> > > > +++ b/rust/bindings/bindings_helper.h
> > > > @@ -9,6 +9,8 @@
> > > > #include <kunit/test.h>
> > > > #include <linux/errname.h>
> > > > #include <linux/ethtool.h>
> > > > +#include <linux/file.h>
> > > > +#include <linux/fs.h>
> > > > #include <linux/jiffies.h>
> > > > #include <linux/mdio.h>
> > > > #include <linux/phy.h>
> > > > diff --git a/rust/helpers.c b/rust/helpers.c
> > > > index 4c8b7b92a4f4..5545a00560d1 100644
> > > > --- a/rust/helpers.c
> > > > +++ b/rust/helpers.c
> > > > @@ -25,6 +25,7 @@
> > > > #include <linux/build_bug.h>
> > > > #include <linux/err.h>
> > > > #include <linux/errname.h>
> > > > +#include <linux/fs.h>
> > > > #include <linux/mutex.h>
> > > > #include <linux/refcount.h>
> > > > #include <linux/sched/signal.h>
> > > > @@ -157,6 +158,12 @@ void rust_helper_init_work_with_key(struct work_struct *work, work_func_t func,
> > > > }
> > > > EXPORT_SYMBOL_GPL(rust_helper_init_work_with_key);
> > > >
> > > > +struct file *rust_helper_get_file(struct file *f)
> > > > +{
> > > > + return get_file(f);
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(rust_helper_get_file);
> > > > +
> > > > /*
> > > > * `bindgen` binds the C `size_t` type as the Rust `usize` type, so we can
> > > > * use it in contexts where Rust expects a `usize` like slice (array) indices.
> > > > diff --git a/rust/kernel/file.rs b/rust/kernel/file.rs
> > > > new file mode 100644
> > > > index 000000000000..ad881e67084c
> > > > --- /dev/null
> > > > +++ b/rust/kernel/file.rs
> > > > @@ -0,0 +1,330 @@
> > > > +// SPDX-License-Identifier: GPL-2.0
> > > > +
> > > > +//! Files and file descriptors.
> > > > +//!
> > > > +//! C headers: [`include/linux/fs.h`](srctree/include/linux/fs.h) and
> > > > +//! [`include/linux/file.h`](srctree/include/linux/file.h)
> > > > +
> > > > +use crate::{
> > > > + bindings,
> > > > + error::{code::*, Error, Result},
> > > > + types::{ARef, AlwaysRefCounted, Opaque},
> > > > +};
> > > > +use core::{marker::PhantomData, ptr};
> > > > +
> > > > +/// Flags associated with a [`File`].
> > > > +pub mod flags {
> > > > + /// File is opened in append mode.
> > > > + pub const O_APPEND: u32 = bindings::O_APPEND;
> > > > +
> > > > + /// Signal-driven I/O is enabled.
> > > > + pub const O_ASYNC: u32 = bindings::FASYNC;
> > > > +
> > > > + /// Close-on-exec flag is set.
> > > > + pub const O_CLOEXEC: u32 = bindings::O_CLOEXEC;
> > > > +
> > > > + /// File was created if it didn't already exist.
> > > > + pub const O_CREAT: u32 = bindings::O_CREAT;
> > > > +
> > > > + /// Direct I/O is enabled for this file.
> > > > + pub const O_DIRECT: u32 = bindings::O_DIRECT;
> > > > +
> > > > + /// File must be a directory.
> > > > + pub const O_DIRECTORY: u32 = bindings::O_DIRECTORY;
> > > > +
> > > > + /// Like [`O_SYNC`] except metadata is not synced.
> > > > + pub const O_DSYNC: u32 = bindings::O_DSYNC;
> > > > +
> > > > + /// Ensure that this file is created with the `open(2)` call.
> > > > + pub const O_EXCL: u32 = bindings::O_EXCL;
> > > > +
> > > > + /// Large file size enabled (`off64_t` over `off_t`).
> > > > + pub const O_LARGEFILE: u32 = bindings::O_LARGEFILE;
> > > > +
> > > > + /// Do not update the file last access time.
> > > > + pub const O_NOATIME: u32 = bindings::O_NOATIME;
> > > > +
> > > > + /// File should not be used as process's controlling terminal.
> > > > + pub const O_NOCTTY: u32 = bindings::O_NOCTTY;
> > > > +
> > > > + /// If basename of path is a symbolic link, fail open.
> > > > + pub const O_NOFOLLOW: u32 = bindings::O_NOFOLLOW;
> > > > +
> > > > + /// File is using nonblocking I/O.
> > > > + pub const O_NONBLOCK: u32 = bindings::O_NONBLOCK;
> > > > +
> > > > + /// File is using nonblocking I/O.
> > > > + ///
> > > > + /// This is effectively the same flag as [`O_NONBLOCK`] on all architectures
> > > > + /// except SPARC64.
> > > > + pub const O_NDELAY: u32 = bindings::O_NDELAY;
> > > > +
> > > > + /// Used to obtain a path file descriptor.
> > > > + pub const O_PATH: u32 = bindings::O_PATH;
> > > > +
> > > > + /// Write operations on this file will flush data and metadata.
> > > > + pub const O_SYNC: u32 = bindings::O_SYNC;
> > > > +
> > > > + /// This file is an unnamed temporary regular file.
> > > > + pub const O_TMPFILE: u32 = bindings::O_TMPFILE;
> > > > +
> > > > + /// File should be truncated to length 0.
> > > > + pub const O_TRUNC: u32 = bindings::O_TRUNC;
> > > > +
> > > > + /// Bitmask for access mode flags.
> > > > + ///
> > > > + /// # Examples
> > > > + ///
> > > > + /// ```
> > > > + /// use kernel::file;
> > > > + /// # fn do_something() {}
> > > > + /// # let flags = 0;
> > > > + /// if (flags & file::flags::O_ACCMODE) == file::flags::O_RDONLY {
> > > > + /// do_something();
> > > > + /// }
> > > > + /// ```
> > > > + pub const O_ACCMODE: u32 = bindings::O_ACCMODE;
> > > > +
> > > > + /// File is read only.
> > > > + pub const O_RDONLY: u32 = bindings::O_RDONLY;
> > > > +
> > > > + /// File is write only.
> > > > + pub const O_WRONLY: u32 = bindings::O_WRONLY;
> > > > +
> > > > + /// File can be both read and written.
> > > > + pub const O_RDWR: u32 = bindings::O_RDWR;
> > > > +}
> > > > +
> > > > +/// Compile-time information for keeping track of how a [`File`] is shared.
> > > > +///
> > > > +/// The `fdget_pos` method can be used to access the file's position without taking `f_pos_lock`,
> > > > +/// as long as the file is not shared with any other threads. During such calls to `fdget_pos`, the
> > > > +/// file must remain non-shared, so it must not be possible to move the file to another thread. For
> > > > +/// example, if the file is moved to another thread, then it could be passed to `fd_install`, at
> > > > +/// which point the remote process could touch the file position.
> > > > +///
> > > > +/// The share mode only keeps track of whether there are active `fdget_pos` calls that did not take
> > > > +/// the `f_pos_lock`, and does not keep track of `fdget` calls. This is okay because `fdget` does
> > > > +/// not care about the refcount of the underlying `struct file`; as long as the entry in the
> > > > +/// current thread's fd table does not get removed, it's okay to share the file. For example,
> > > > +/// `fd_install`ing the `struct file` into another process is okay during an `fdget` call, because
> > > > +/// the other process can't touch the fd table of the original process.
> > > > +mod share_mode {
> > > > + /// Trait implemented by the two sharing modes that a file might have.
> > > > + pub trait FileShareMode {}
> > > > +
> > > > + /// Represents a file for which there might be an active call to `fdget_pos` that did not take
> > > > + /// the `f_pos_lock` lock.
> > > > + pub enum MaybeFdgetPos {}
> > > > +
> > > > + /// Represents a file for which it is known that all active calls to `fdget_pos` (if any) took
> > > > + /// the `f_pos_lock` lock.
> > > > + pub enum NoFdgetPos {}
> > > > +
> > > > + impl FileShareMode for MaybeFdgetPos {}
> > > > + impl FileShareMode for NoFdgetPos {}
> > > > +}
> > > > +pub use self::share_mode::{FileShareMode, MaybeFdgetPos, NoFdgetPos};
> > > > +
> > > > +/// Wraps the kernel's `struct file`.
> > > > +///
> > > > +/// This represents an open file rather than a file on a filesystem. Processes generally reference
> > > > +/// open files using file descriptors. However, file descriptors are not the same as files. A file
> > > > +/// descriptor is just an integer that corresponds to a file, and a single file may be referenced
> > > > +/// by multiple file descriptors.
> > > > +///
> > > > +/// # Refcounting
> > > > +///
> > > > +/// Instances of this type are reference-counted. The reference count is incremented by the
> > > > +/// `fget`/`get_file` functions and decremented by `fput`. The Rust type `ARef<File>` represents a
> > > > +/// pointer that owns a reference count on the file.
> > > > +///
> > > > +/// Whenever a process opens a file descriptor (fd), it stores a pointer to the file in its fd
> > > > +/// table (`struct files_struct`). This pointer owns a reference count to the file, ensuring the
> > > > +/// file isn't prematurely deleted while the file descriptor is open. In Rust terminology, the
> > > > +/// pointers in `struct files_struct` are `ARef<File>` pointers.
> > > > +///
> > > > +/// ## Light refcounts
> > > > +///
> > > > +/// Whenever a process has an fd to a file, it may use something called a "light refcount" as a
> > > > +/// performance optimization. Light refcounts are acquired by calling `fdget` and released with
> > > > +/// `fdput`. The idea behind light refcounts is that if the fd is not closed between the calls to
> > > > +/// `fdget` and `fdput`, then the refcount cannot hit zero during that time, as the `struct
> > > > +/// files_struct` holds a reference until the fd is closed. This means that it's safe to access the
> > > > +/// file even if `fdget` does not increment the refcount.
> > > > +///
> > > > +/// The requirement that the fd is not closed during a light refcount applies globally across all
> > > > +/// threads - not just on the thread using the light refcount. For this reason, light refcounts are
> > > > +/// only used when the `struct files_struct` is not shared with other threads, since this ensures
> > > > +/// that other unrelated threads cannot suddenly start using the fd and close it. Therefore,
> > > > +/// calling `fdget` on a shared `struct files_struct` creates a normal refcount instead of a light
> > > > +/// refcount.
> > > > +///
> > > > +/// Light reference counts must be released with `fdput` before the system call returns to
> > > > +/// userspace. This means that if you wait until the current system call returns to userspace, then
> > > > +/// all light refcounts that existed at the time have gone away.
> > >
> > > You obviously are aware of this but I'm just spelling it out. Iirc,
> > > there will practically only ever be one light refcount per file.
> > >
> > > For a light refcount to be used we know that the file descriptor table
> > > isn't shared with any other task. So there are no threads that could
> > > concurrently access the file descriptor table. We also know that the
> > > file descriptor table cannot become shared while we're in system call
> > > context because the caller can't create new threads and they can't
> > > unshare the file descriptor table.
> > >
> > > So there's only one fdget() caller (Yes, they could call fdget()
> > > multiple times and then have to do fdput() multiple times but that's a
> > > level of weirdness that we don't need to worry about.).
> >
> > Hmm. Is it not the case that different processes with different file
> > descriptor tables could reference the same underlying `struct file` and
> > both use light refcounts to do so, as long as each fd table is not
> > shared? So there could be multiple light refcounts to the same `struct
> > file` at the same time on different threads.
>
> That's correct.
> But I misread what you were trying to say then. I thought you were
> talking about multiple light references from the same thread which is
> rather rare and should only happen in system calls that take two fds.
>
> But what you're talking about is the same struct file being present in
> separate file descriptor tables and referenced multiple times from
> different threads so in that sense we have multiple light references.
> And that's obviously correct.
>
> >
> > And this does *not* apply to `fdget_pos`, which checks the refcount of
> > the `struct file` instead of the refcount of the fd table.
>
> I have only skimmed the replies down thread so far but I saw that this
> has mostly been clarified. The reference counting between fdget() and
> fdget_pos() is identical. fdget_pos() calls fdget() after all.
>
> What's different is that if the file is already shared among different
> processes then even if a light reference was taken by the caller because
> it doesn't share the file descriptor table fdget_pos() may still acquire
> the f_pos_lock because the struct file is referenced by multiple
> processes.
>
> IOW, you could have the same struct file in the file descriptor tables
> of 10 processes. So the f_count would be 10. Assume all of them
> concurrently call read(), then none of them will bump f_count because
> fdget() sees that the file descriptor tables aren't shared.
>
> But all 10 of them will serialize on f_pos_lock. So that's really
> separate from light refcounting. If you have to acquire f_pos_lock from
> within the same thread then fdget*() always take normal reference
> counts.
See my reply to Al Viro.
https://lore.kernel.org/r/[email protected]
> > > > +///
> > > > +/// ### The file position
> > > > +///
> > > > +/// Each `struct file` has a position integer, which is protected by the `f_pos_lock` mutex.
> > > > +/// However, if the `struct file` is not shared, then the kernel may avoid taking the lock as a
> > > > +/// performance optimization.
> > > > +///
> > > > +/// The condition for avoiding the `f_pos_lock` mutex is different from the condition for using
> > > > +/// `fdget`. With `fdget`, you may avoid incrementing the refcount as long as the current fd table
> > > > +/// is not shared; it is okay if there are other fd tables that also reference the same `struct
> > > > +/// file`. However, `fdget_pos` can only avoid taking the `f_pos_lock` if the entire `struct file`
> > > > +/// is not shared, as different processes with an fd to the same `struct file` share the same
> > > > +/// position.
> > > > +///
> > > > +/// ## Rust references
> > > > +///
> > > > +/// The reference type `&File` is similar to light refcounts:
> > > > +///
> > > > +/// * `&File` references don't own a reference count. They can only exist as long as the reference
> > > > +/// count stays positive, and can only be created when there is some mechanism in place to ensure
> > > > +/// this.
> > > > +///
> > > > +/// * The Rust borrow-checker normally ensures this by enforcing that the `ARef<File>` from which
> > > > +/// a `&File` is created outlives the `&File`.
> > > > +///
> > > > +/// * Using the unsafe [`File::from_ptr`] means that it is up to the caller to ensure that the
> > > > +/// `&File` only exists while the reference count is positive.
> > > > +///
> > > > +/// * You can think of `fdget` as using an fd to look up an `ARef<File>` in the `struct
> > > > +/// files_struct` and create an `&File` from it. The "fd cannot be closed" rule is like the Rust
> > > > +/// rule "the `ARef<File>` must outlive the `&File`".
> > > > +///
> > > > +/// # Invariants
> > > > +///
> > > > +/// * All instances of this type are refcounted using the `f_count` field.
> > > > +/// * If the file sharing mode is `MaybeFdgetPos`, then all active calls to `fdget_pos` that did
> > > > +/// not take the `f_pos_lock` mutex must be on the same thread as this `File`.
> > > > +/// * If the file sharing mode is `NoFdgetPos`, then there must not be active calls to `fdget_pos`
> > > > +/// that did not take the `f_pos_lock` mutex.
> > > > +#[repr(transparent)]
> > > > +pub struct File<S: FileShareMode> {
> > > > + inner: Opaque<bindings::file>,
> > > > + _share_mode: PhantomData<S>,
> > > > +}
> > > > +
> > > > +// SAFETY: This file is known to not have any local active `fdget_pos` calls, so it is safe to
> > > > +// transfer it between threads.
> > > > +unsafe impl Send for File<NoFdgetPos> {}
> > > > +
> > > > +// SAFETY: This file is known to not have any local active `fdget_pos` calls, so it is safe to
> > > > +// access its methods from several threads in parallel.
> > > > +unsafe impl Sync for File<NoFdgetPos> {}
> > > > +
> > > > +/// File methods that only exist under the [`MaybeFdgetPos`] sharing mode.
> > > > +impl File<MaybeFdgetPos> {
> > > > + /// Constructs a new `struct file` wrapper from a file descriptor.
> > > > + ///
> > > > + /// The file descriptor belongs to the current process, and there might be active local calls
> > > > + /// to `fdget_pos`.
> > > > + pub fn fget(fd: u32) -> Result<ARef<File<MaybeFdgetPos>>, BadFdError> {
> > > > + // SAFETY: FFI call, there are no requirements on `fd`.
> > > > + let ptr = ptr::NonNull::new(unsafe { bindings::fget(fd) }).ok_or(BadFdError)?;
> > > > +
> > > > + // SAFETY: `bindings::fget` created a refcount, and we pass ownership of it to the `ARef`.
> > > > + //
> > > > + // INVARIANT: This file is in the fd table on this thread, so either all `fdget_pos` calls
> > > > + // are on this thread, or the file is shared, in which case `fdget_pos` calls took the
> > > > + // `f_pos_lock` mutex.
> > > > + Ok(unsafe { ARef::from_raw(ptr.cast()) })
> > > > + }
> > >
> > > I'm a little unclear how this is supposed to be used. What I
> > > specifically struggle with is what function does one have to call to
> > > translate from a file descriptor to a file? IOW, where are the actual
> > > entry points for turning fds into files? That's what I want to see and
> > > that's what we need to make this interface usable generically.
> >
> > That is File::fget. It takes a file descriptor and returns a long-term
> > reference to a file.
>
> Ok.
>
> >
> > > Because naively, what I'm looking for is a Rust version of fdget() and
> > > fdget_pos() that give me back a File<MaybeFdget> or a
> > > File<MaybeFdgetPos>.
> > >
> > > And then those both implement a get_file() method so the caller can take
> > > an explicit long-term reference to the file.
> >
> > Even if you call `get_file` to get a long-term reference from something
> > you have an fdget_pos reference to, that doesn't necessarily mean that
> > you can share that long-term reference with other threads. You would
> > need to release the fdget_pos reference first. For that reason, the
> > long-term reference returned by `get_file` would still need to have the
> > `File<MaybeFdgetPos>` type.
>
> So what you're getting at seems to be that some process has a private
> file descriptor table and an just opened @fd to a @file that isn't
> shared.
>
> fd = open("/my/file");
>
> and then let's say has a random ioctl(fd, SOMETHING_SOMETHING) that
> somehow does:
>
> struct fd fd = fdget_pos();
> if (!fd.file)
> return -EBADF;
>
> We know that process has used a light reference count and that it didn't
> acquire f_pos_lock.
>
> Your whole approach seems to assume that after something like this has
> happened the same process now offloads that struct file to another
> process that somehow ends up doing some other operation on the file that
> would also require f_pos_lock to be taken but it doesn't like a read or
> something.
Can we not have a data race even if the other process *does* take the
f_pos_lock mutex? The current thread did not take the mutex, so if the
current thread touches the file position after sending the file
reference, then that could race with the other process even if the
other process takes f_pos_lock.
> To share a file between multiple processes would normally always require
> that the process sends that file to another process. That process then
> install that fd into its file descriptor table and then later accesses
> that file via fdget() again. That's the standard way of doing it -
> binder does it that way too. And that's all perfectly fine.
And similarly, if the remote process installs the file, returns to
userspace, and then userspace calls back into the kernel, which enters
an fdget_pos scope and modifies the file position. Then this can also
race on the file position if the original process changes the file
position it after sending the file reference.
*That's* the data race that this is trying to prevent.
> What you would need for this to be a problem is for a process be sent a
> struct file from a process that is in the middle of an f_pos_lock scope
> and for the receiving process to immediately start doing stuff that
> would normally require f_pos_lock.
>
> Like, idk vfs_read(file, ...).
>
> If that's what this is about then yes, there's a close-to-zero but
> non-zero possibility that some really stupid code could end up doing
> something like this.
>
> Basically, that scenario doesn't exist (as I've mentioned before)
> currently. It only exists in a single instance and that's when
> pidfd_getfd() is used to steal a file from another task while that task
> is in the middle of an f_pos_lock section (I said it before we don't
> care about that because non-POSIX interface anyway and we have ptrace
> rights anyway. And iiuc that wouldn't even be preventable in your
> current scheme because you would need to have the information available
> that the struct file you're about to steal from the file descriptor
> table is currently within an f_pos_lock section.).
>
> Is it honestly worth encoding all that complexity into rust's file
> implementation itself right now? It's barely understandable to
> non-rust experts as it is right now. :)
>
> Imho, it would seem a lot more prudent to just have something simpler
> for now.
The purpose of the changes I've made are to prevent data races on the
file position. If we go back to what we had before, then the API does
not make it impossible for users of the API to cause such data races.
That is the tradeoff.
> > (But you could convert it to a `File<NoFdgetPos>` afterwards. The
> > `assume_no_fdget_pos` method performs that conversion.)
> >
> > As a sidenote, the reason that this patchset does not implement `fdget`
> > or `fdget_pos` is that Rust Binder does not use them. Like C Binder, it
>
> Yes, you mentioned.
>
> > just uses `fget` to immediately obtain a long-term reference. I was told
>
> Right and that's why I'm confused why that whole shared_state
> machinery is needed in the first place. Because binder does do it
> correctly:
>
> * sender registers a bunch of fds to use and takes fget() reference
> All other processes that use the same file in their fdtable and rely
> on fdget_pos() will see the elevated reference count and acquire
> f_pos_lock.
> * receiver installs stuff into their fdtable
> Receiver can now use fdget_pos() to do reads/writes. Everything's in
> order as well.
>
> > that as an exception, Rust code can be merged *before* its user, but
> > that we couldn't merge Rust code with no upcoming user. However, I can
> > include implementations of `fdget` and `fdget_pos` in the next version
> > if you prefer that. After all, it seems rather likely that we will
> > eventually have a user for fdget.
> >
> > > The fget() above is really confusing to me because it always takes a
> > > reference on the file that's pointed to by the fd and then it returns a
> > > MaybeFdgetPos because presumably you want to indicate that the file
> > > descriptor may refer to a file that may or may not be referenced by
> > > another thread via fdget()/fdget_pos() already.
> >
> > No, not another thread. It is because it may or may not be referenced by
> > fdget_pos by *the same* thread already.
> >
> > Here's how I think of it: The `fget` method takes a file descriptor and
> > returns a long term reference (an ARef) to a `struct file`. It does not
> > return a file descriptor, since it doesn't store anywhere which fd it
> > came from.
> >
> > The `File::fget` method returns a `File<MaybeFdgetPos>` in case *the
> > same thread* is also using `fdget_pos` on the same file descriptor. It's
> > okay if other threads are using `fdget_pos` because in that case the
> > file is already shared, so those other `fdget_pos` calls necessarily the
> > f_pos_lock mutex.
> >
> > Note that since it forgets which fd and fd table it came from, calls to
> > `fdget` are actually not a problem for sending our long-term references
> > across threads. The `fdget` requirements only care about things that
> > touch the entry in the file descriptor table, such as closing the fd.
> > The `ARef<File>` type does not provide any methods that could lead to
> > that happening, so sharing it across threads is okay *even if* there is
> > an light reference. That's why I have an `MaybeFdgetPos` but no
> > `MaybeFdget`.
> >
> > > So I've _skimmed_ the binder RFC and looked at:
> > > [email protected]
> > > which states:
> > >
> > > Add support for sending fds over binder.
> > >
> > > Unlike the other object types, file descriptors are not translated until
> > > the transaction is actually received by the recipient. Until that
> > > happens, we store `u32::MAX` as the fd.
> > >
> > > Translating fds is done in a two-phase process. First, the file
> > > descriptors are allocated and written to the allocation. Then, once we
> > > have allocated all of them, we commit them to the files in question.
> > > Using this strategy, we are able to guarantee that we either send all of
> > > the fds, or none of them.
> > >
> > > So I'm curious. How does the binder fd sending work exactly? Because I
> > > feel that this is crucial to understand here. Some specific questions:
> > >
> > > * When file descriptors are passed the reference to these files via
> > > fget() are taken _synchronously_, i.e., by the sending task, not the
> > > receiver? IOW, is binder_translate_fd() called in the context of the
> > > sender or the receiver. I assume it must be the sender because
> > > otherwise the sender and receiver must share a file descriptor table
> > > in order for the receiver to call fget().
> >
> > binder_translate_fd is called in the context of the sender.
> >
> > > * The receiving task then allocates new file descriptors and installs
> > > the received files into its file descriptor table?
> >
> > That happens in binder_apply_fd_fixups, which is called in the context
> > of the receiver.
>
> Yes, that's what I thought.
>
> >
> > I can see how the sentence "Until that happens, we store `u32::MAX` as
> > the fd." is really confusing here. What happens when you send a fd is
> > this:
> >
> > In the sender's ioctl:
> > 1. The sender wishes to send a byte array to the recipient. The sender
> > tells the kernel that at specific offsets in this array, there are
> > some file descriptors that it wishes to send.
> >
> > 2. The kernel copies the byte array directly into the recipient's
> > address space. The offsets in the byte array with file descriptors
> > are not copied - instead u32::MAX is written temporarily at those
> > offsets.
> >
> > 3. For each fd being sent, the kernel uses fget to obtain a reference to
> > the underlying `struct file`. These pointers are stored in an array.
> >
> > In the receiver's ioctl:
> > 1. Go through the list of `struct file` pointers and create a
> > `FileDescriptorReservation` for each.
> >
> > 2. Go through the list of `struct file` pointers again and `fd_install`
> > them into the current thread's fd table. This is infallible due to
> > the reservations we just made.
> >
> > 3. Finally, overwrite the u32::MAX values in the byte array with the
> > actual file descriptors that the files were assigned.
> >
> > This is the same as how it works in C Binder.
>
> Yes, that all seems fine.
>
> >
> > > And so basically, what I'm after here is that the binder_translate_fd()
> > > that calls fget() is done in the context of the sender and we _know_
> > > that the fds can't have light references. Because if they did it could
> > > only be by the calling task but they don't since the calling task uses
> > > fget() on them. And if the calling task is multi-threaded and another
> > > thread has called fdget() or fdget_pos() we know that they have taken
> > > their own reference because the file descriptor table is shared.
> > >
> > > So why is that fget() in here returning a File<MaybeFdgetPos>? This
> > > doesn't make sense to me at first glance.
> >
> > Because when you call `File::fget`, then there could also be a different
> > call to `fdget_pos` on the same thread on the same file descriptor.
> >
> > fdget_pos(my_fd);
> > let my_file = File::fget(my_fd)?;
> > // oh no!
> > send_to_another_thread(my_file);
>
> Ok, that's basically my above example.
>
> >
> > In the above code, the file becomes shared even though `fdget_pos` might
> > not have taken the `f_pos_lock` mutex. That's not okay. We could end up
> > with a data race on the file position.
>
> But a race on f_pos isn't a memory safety issue it's just a POSIX
> ordering requirement.
Memory safety may be the wrong word, but data races *are* on the list of
things that Rust tries to prevent.
Alice
> > One of the primary design principles of Rust is that, if the user of our
> > API has *any* way of using it that could trigger a memory safety
> > problem, then we must be able to point at an unsafe block *in the user's
> > code* that is at fault. This must be the case no matter how contrived
> > the use of the API is.
> >
> > As a corollary, if the user can trigger memory safety problems with our
> > API without using any unsafe blocks, then that is a bug in the API.
> > We cannot assign the blame to an unsafe block in the user's code, so the
> > blame *must* lie with an unsafe block inside the API.
> >
> > So, to follow that design principle, I have designed the API in a way
> > that prevents the above data race. Concretely, because the
> > `File<MaybeFdgetPos>` type is not thread safe (or in Rust terms "is not
> > Send"), it's not possible to send values of that type across thread
> > boundaries. E.g., our `send_to_another_thread` would have a requirement
> > in its signature saying that it can only be called with types that are
> > thread safe, so calling it with a type that isn't results in a type
> > error.
> >
> >
> >
> > Now, what if you *want* to send it to another thread? Let's consider
> > Rust Binder, which needs to do exactly that. The relevant code in Rust
> > Binder would need to be updated to look like this:
> >
> > let file = File::fget(my_fd)?;
> > // SAFETY: We know that there are no active `fdget_pos` calls on
> > // the current thread, since this is an ioctl and we have not
> > // called `fdget_pos` inside the Binder driver.
> > let thread_safe_file = unsafe { file.assume_no_fdget_pos() };
> >
> > (search for File::from_fd in the RFC to find where this would go)
> >
> > The `assume_no_fdget_pos` call has no effect at runtime - it is purely a
> > compile-time thing to force the user to use unsafe to "promise" that
> > there aren't any `fdget_pos` calls on the same fd.
> >
> > If Rust Binder uses `assume_no_fdget_pos` and ends up triggering memory
> > unsafety because it sent a file to another thread, then we can point to
> > the unsafe block that calls `assume_no_fdget_pos` and say "that unsafe
> > block is at fault because it assumed that there was no `fdget_pos` call,
> > but that assumption was false."
> >
> > > > + /// Assume that there are no active `fdget_pos` calls that prevent us from sharing this file.
> > > > + ///
> > > > + /// This makes it safe to transfer this file to other threads. No checks are performed, and
> > > > + /// using it incorrectly may lead to a data race on the file position if the file is shared
> > > > + /// with another thread.
> > > > + ///
> > > > + /// This method is intended to be used together with [`File::fget`] when the caller knows
> > > > + /// statically that there are no `fdget_pos` calls on the current thread. For example, you
> > > > + /// might use it when calling `fget` from an ioctl, since ioctls usually do not touch the file
> > > > + /// position.
> > > > + ///
> > > > + /// # Safety
> > > > + ///
> > > > + /// There must not be any active `fdget_pos` calls on the current thread.
> > > > + pub unsafe fn assume_no_fdget_pos(me: ARef<Self>) -> ARef<File<NoFdgetPos>> {
> > > > + // INVARIANT: There are no `fdget_pos` calls on the current thread, and by the type
> > > > + // invariants, if there is a `fdget_pos` call on another thread, then it took the
> > > > + // `f_pos_lock` mutex.
> > > > + //
> > > > + // SAFETY: `File<MaybeFdgetPos>` and `File<NoFdgetPos>` have the same layout.
> > > > + unsafe { ARef::from_raw(ARef::into_raw(me).cast()) }
> > > > + }
> > > > +}
> > > > +
> > > > +/// File methods that exist under all sharing modes.
> > > > +impl<S: FileShareMode> File<S> {
> > > > + /// Creates a reference to a [`File`] from a valid pointer.
> > > > + ///
> > > > + /// # Safety
> > > > + ///
> > > > + /// * The caller must ensure that `ptr` points at a valid file and that the file's refcount is
> > > > + /// positive for the duration of 'a.
> > > > + /// * The caller must ensure that the requirements for using the chosen file sharing mode are
> > > > + /// upheld.
> > > > + pub unsafe fn from_ptr<'a>(ptr: *const bindings::file) -> &'a File<S> {
> > >
> > > I think I requested from_raw_file() in the last revision?
> >
> > Ah, yeah, I totally forgot about this. I'll make the change in the next
> > version.
> >
> > > > + /// Returns a raw pointer to the inner C struct.
> > > > + #[inline]
> > > > + pub fn as_ptr(&self) -> *mut bindings::file {
> > >
> > > And that was supposed to be into_raw_file() or as_raw_file()?
> >
> > Per the Rust API guidelines [1], this should be `as_raw_file`. The
> > `into_*` prefix is for conversions that destroy the object you call it
> > on. (E.g., because it takes ownership of the underlying allocation or
> > refcount.)
> >
> > As always, thank you for the very detailed questions!
> >
> > Alice
> >
> > [1]: https://rust-lang.github.io/api-guidelines/naming.html#ad-hoc-conversions-follow-as_-to_-into_-conventions-c-conv
Al Viro <[email protected]> writes:
> > > You obviously are aware of this but I'm just spelling it out. Iirc,
> > > there will practically only ever be one light refcount per file.
> > >
> > > For a light refcount to be used we know that the file descriptor table
> > > isn't shared with any other task. So there are no threads that could
> > > concurrently access the file descriptor table. We also know that the
> > > file descriptor table cannot become shared while we're in system call
> > > context because the caller can't create new threads and they can't
> > > unshare the file descriptor table.
> > >
> > > So there's only one fdget() caller (Yes, they could call fdget()
> > > multiple times and then have to do fdput() multiple times but that's a
> > > level of weirdness that we don't need to worry about.).
> >
> > Hmm. Is it not the case that different processes with different file
> > descriptor tables could reference the same underlying `struct file` and
> > both use light refcounts to do so, as long as each fd table is not
> > shared? So there could be multiple light refcounts to the same `struct
> > file` at the same time on different threads.
>
> Relevant rules:
>
> * Each file pointer in any descriptor table contributes to refcount
> of file.
>
> * All assignments to task->files are done by the task itself or,
> during task creation, by its parent The latter happens before the task
> runs for the first time. The former is done with task_lock(current)
> held.
>
> * current->files is always stable. The object it points to
> is guaranteed to stay alive at least until you explicitly change
> current->files.
> * task->files is stable while you are holding task_lock(task).
> The object it points to is guaranteed to stay alive until you release
> task_lock(task).
> * task->files MAY be fetched (racily) without either of the
> above, but it should not be dereferenced - the memory may be freed
> and reused right after you've fetched the pointer.
>
> * descriptor tables are refcounted by table->count.
> * descriptor table is created with ->count equal to 1 and
> destroyed when its ->count reaches 0.
> * each task with task->files == table contributes to table->count.
> * before the task dies, its ->files becomes NULL (see exit_files()).
> * when task is born (see copy_process() and copy_files())) the parent
> is responsible for setting the value of task->files and making sure that
> refcounts are correct; that's the only case where one is allowed to acquire
> an extra reference to existing table (handling of clone(2) with COPY_FILES).
>
> * the only descriptor table one may modify is that pointed to
> by current->files. Any access to other threads' descriptor tables is
> read-only.
>
> * struct fd is fundamentally thread-local. It should never be
> passed around, put into shared data structures, etc.
>
> * if you have done fdget(N), the matching fdput() MUST be done
> before the caller modifies the Nth slot of its descriptor table,
> spawns children that would share the descriptor table.
>
> * fdget() MAY borrow a reference from caller's descriptor table.
> That can be done if current->files->count is equal to 1.
> In that case we can be certain that the file reference we fetched from
> our descriptor table will remain unchanged (and thus contributing to refcount
> of file) until fdput(). Indeed,
> + at the time of fdget() no other thread has task->files pointing
> to our table (otherwise ->count would be greater than 1).
> + our thread will remain the sole owner of descriptor table at
> least until fdput(). Indeed, the first additional thread with task->files
> pointing to our table would have to have been spawned by us and we are
> forbidden to do that (rules for fdget() use)
> + no other thread could modify our descriptor table (they would
> have to share it first).
> + we are allowed to modify our table, but we are forbidden to touch
> the slot we'd copied from (rules for fdget() use).
>
> In other words, if current->files->count is equal to 1 at fdget() time
> we can skip incrementing refcount. Matching fdput() would need to
> skip decrement, of course. Note that we must record that (borrowed
> vs. cloned) in struct fd - the condition cannot be rechecked at fdput()
> time, since the table that had been shared at fdget() time might no longer
> be shared by the time of fdput().
This is great! It matches my understanding. I didn't know the details
about current->files and task->files.
You should copy this to the kernel documentation somewhere. :)
> > And this does *not* apply to `fdget_pos`, which checks the refcount of
> > the `struct file` instead of the refcount of the fd table.
>
> False. fdget_pos() is identical to fdget() as far as file refcount
> handling goes. The part that is different is that grabbing ->f_pos_lock
> is sensitive to file refcount in some cases. This is orthogonal to
> "does this struct fd contribute to file refcount".
Sorry, I see now that I didn't phrase that quite right. What I meant is
that there are ways of sharing a `struct file` reference during an fdget
scope that are not dangerous, but where it *would be* dangerous if it
was an fdget_pos scope instead. Specifically, the reason they are
dangerous is that they can lead to a data race on the file position if
the fdget_pos scope did not take the f_pos_lock mutex.
For example, during an `fdget(N)` scope, you can always do a `get_file`
and then send it to another process and `fd_install` it into that other
process. There's no way that this could result in the deletion of the
Nth entry of `current->files`.
However, during an `fdget_pos(N)` scope, then it is *not* the case that
it's always okay to send a `get_file` reference to another thread and
`fd_install` it. Because after the remote process returns from the
syscall in which we `fd_install`ed the file, the remote process could
proceed to call another syscall that in turn modifies the file position.
And if the original `fdget_pos(N)` scope modifies the file position
after sending the `get_file` reference, then that could be a data race
on f_pos.
> Again, "light" references are tied to thread; they can only be created
> if we are guaranteed that descriptor table's slot they came from will
> remain unchanged for as long as the reference is used.
>
> And yes, there may be several light references to the same file - both
> in different processes that do not share descriptor table *and* in the
> same thread, if e.g. sendfile(in_fd, out_fd, ...) is called with
> in_fd == out_fd.
Thanks for confirming this!
I hope this reply along with my reply to Christian Brauner also
addresses your other thread. Let me know if it doesn't.
Alice
On Tue, May 28, 2024 at 9:36 PM Al Viro <[email protected]> wrote:
>
> On Mon, May 27, 2024 at 04:03:56PM +0000, Alice Ryhl wrote:
>
> > > In other words, if current->files->count is equal to 1 at fdget() time
> > > we can skip incrementing refcount. Matching fdput() would need to
> > > skip decrement, of course. Note that we must record that (borrowed
> > > vs. cloned) in struct fd - the condition cannot be rechecked at fdput()
> > > time, since the table that had been shared at fdget() time might no longer
> > > be shared by the time of fdput().
> >
> > This is great! It matches my understanding. I didn't know the details
> > about current->files and task->files.
> >
> > You should copy this to the kernel documentation somewhere. :)
>
> Probably, after it's turned into something more coherent - and after
> the description of struct fd scope rules is corrected ;-/
>
> Correction in question: you _are_ allowed to move reference from
> descriptor table while in scope of struct fd; what you are not allowed
> is dropping that reference until the end of scope.
The patch you are commenting on contains a change to fs/file.c with
exactly that correction. I'm not sure if you noticed it - I should
probably have put it in its own commit to make it more obvious.
> That's what binder_do_fd_close() is about - binder_deferred_fd_close()
> is called in a struct fd scope (anything called from ->unlocked_ioctl()
> instances is). It *does* remove a struct file reference from
> descriptor table:
> twcb->file = file_close_fd(fd);
> moves that reference to twcb->file, and subsequent
> get_file(twcb->file);
> filp_close(twcb->file, current->files);
> completes detaching it from the table[*] and the reference itself
> is dropped via task_work, which is going to be executed on the
> way out to userland, definitely after we leave the scope of
> struct fd.
Yeah. If you look at previous versions of this patchset, it contains
Rust code for performing exactly that dance. I was asked to drop it
from the patch series, though.
> Incidentally, I'm very tempted to unexport close_fd(); in addition to
> being a source of bugs when called from ioctl on user-supplied descriptor
> it encourages racy crap - just look at e.g. 1819200166ce
> "drm/amdkfd: Export DMABufs from KFD using GEM handles", where we
> call drm_gem_prime_handle_to_fd(), immediately followed by
> dmabuf = dma_buf_get(fd);
> close_fd(fd);
> dup2() from another thread with guessed descriptor number as target and
> you've got a problem... It's not a violation of fdget() use rules
> (it is called from ioctl, but descriptor is guaranteed to be different
> from the one passed to ioctl(2)), but it's still wrong. Would take
> some work, though...
Wait, what's going on there? It adds the fd and then immediately
removes it again, or?
> "Detaching from the table" bit also needs documenting, BTW. If you look
> at that thing, you'll see that current->files is converted to fl_owner_t,
> which is an opaque pointer. What happens is that dnotify and POSIX lock
> use current->files as opaque tags (->dn_owner and ->flc_owner resp.) and
> filp_close() (well, filp_flush() these days) needs to be called to
> purge all of those associated with given struct file and given tag value.
Ah, yes, fl_owner_t being a void pointer rather than having a proper
type caused a bug in an early version of Rust Binder ...
Alice
> That needs to be done between removal of file reference from table and
> destruction of the table itself and it guarantees that those opaque references
> won't outlive the table (more importantly, don't survive until a different
> files_struct instance is allocated at the same address).
>
> [*] NB: might make sense to export filp_flush(), since that's what this
> sequence boils down to. We really need a better identifier, though -
> "filp" part is a leftover from OSDI, AFAICT; that's a hungarism for
> "file pointer" and it makes no sense. file_flush() would be better,
> IMO - or flush_file(), for that matter.
On Mon, May 27, 2024 at 04:03:56PM +0000, Alice Ryhl wrote:
> > In other words, if current->files->count is equal to 1 at fdget() time
> > we can skip incrementing refcount. Matching fdput() would need to
> > skip decrement, of course. Note that we must record that (borrowed
> > vs. cloned) in struct fd - the condition cannot be rechecked at fdput()
> > time, since the table that had been shared at fdget() time might no longer
> > be shared by the time of fdput().
>
> This is great! It matches my understanding. I didn't know the details
> about current->files and task->files.
>
> You should copy this to the kernel documentation somewhere. :)
Probably, after it's turned into something more coherent - and after
the description of struct fd scope rules is corrected ;-/
Correction in question: you _are_ allowed to move reference from
descriptor table while in scope of struct fd; what you are not allowed
is dropping that reference until the end of scope.
That's what binder_do_fd_close() is about - binder_deferred_fd_close()
is called in a struct fd scope (anything called from ->unlocked_ioctl()
instances is). It *does* remove a struct file reference from
descriptor table:
twcb->file = file_close_fd(fd);
moves that reference to twcb->file, and subsequent
get_file(twcb->file);
filp_close(twcb->file, current->files);
completes detaching it from the table[*] and the reference itself
is dropped via task_work, which is going to be executed on the
way out to userland, definitely after we leave the scope of
struct fd.
Incidentally, I'm very tempted to unexport close_fd(); in addition to
being a source of bugs when called from ioctl on user-supplied descriptor
it encourages racy crap - just look at e.g. 1819200166ce
"drm/amdkfd: Export DMABufs from KFD using GEM handles", where we
call drm_gem_prime_handle_to_fd(), immediately followed by
dmabuf = dma_buf_get(fd);
close_fd(fd);
dup2() from another thread with guessed descriptor number as target and
you've got a problem... It's not a violation of fdget() use rules
(it is called from ioctl, but descriptor is guaranteed to be different
from the one passed to ioctl(2)), but it's still wrong. Would take
some work, though...
"Detaching from the table" bit also needs documenting, BTW. If you look
at that thing, you'll see that current->files is converted to fl_owner_t,
which is an opaque pointer. What happens is that dnotify and POSIX lock
use current->files as opaque tags (->dn_owner and ->flc_owner resp.) and
filp_close() (well, filp_flush() these days) needs to be called to
purge all of those associated with given struct file and given tag value.
That needs to be done between removal of file reference from table and
destruction of the table itself and it guarantees that those opaque references
won't outlive the table (more importantly, don't survive until a different
files_struct instance is allocated at the same address).
[*] NB: might make sense to export filp_flush(), since that's what this
sequence boils down to. We really need a better identifier, though -
"filp" part is a leftover from OSDI, AFAICT; that's a hungarism for
"file pointer" and it makes no sense. file_flush() would be better,
IMO - or flush_file(), for that matter.
On Tue, May 28, 2024 at 10:29:03PM +0200, Alice Ryhl wrote:
> > Incidentally, I'm very tempted to unexport close_fd(); in addition to
> > being a source of bugs when called from ioctl on user-supplied descriptor
> > it encourages racy crap - just look at e.g. 1819200166ce
> > "drm/amdkfd: Export DMABufs from KFD using GEM handles", where we
> > call drm_gem_prime_handle_to_fd(), immediately followed by
> > dmabuf = dma_buf_get(fd);
> > close_fd(fd);
> > dup2() from another thread with guessed descriptor number as target and
> > you've got a problem... It's not a violation of fdget() use rules
> > (it is called from ioctl, but descriptor is guaranteed to be different
> > from the one passed to ioctl(2)), but it's still wrong. Would take
> > some work, though...
>
> Wait, what's going on there? It adds the fd and then immediately
> removes it again, or?
It creates an object and associated struct file, using a primitive that
shoves the reference to that new struct file into descriptor table and
returns the slot number. Then it looks the file up by the returned
descriptor, tries to pick the object out of it and closes the descriptor.
If that descriptor table is shared, well... pray the descriptor still
refers to the same file by the time you try to look the file up.
It's bogus; the song and dance with putting it into descriptor table
makes sense for the primary user (ioctl that returns the descriptor
number to userland), but here it's just plain wrong. What they need
is to cut that sucker in two functions - one that returns dmabuf,
with wrapper doing dma_buf_fd() on the result (or allocating a descriptor
first, then calling the primitives that gets their dmabuf, then doing
fd_install()).
This caller should use the new primitive without messing with descriptor
table. In general, new descriptors are fit only for one thing - returning
them to userland. As soon as file reference is in descriptor table
it might get closed right under you - file argument of fd_install()
is moved, not borrowed. You might find something on lookup by that
descritor, but it's not guaranteed to have anything to do with what
you'd just put there.
That's why we have anon_inode_getfile(), with anon_inode_getfd() being
only a convenience helper, for example...
> > So what you're getting at seems to be that some process has a private
> > file descriptor table and an just opened @fd to a @file that isn't
> > shared.
> >
> > fd = open("/my/file");
> >
> > and then let's say has a random ioctl(fd, SOMETHING_SOMETHING) that
> > somehow does:
> >
> > struct fd fd = fdget_pos();
> > if (!fd.file)
> > return -EBADF;
> >
> > We know that process has used a light reference count and that it didn't
> > acquire f_pos_lock.
> >
> > Your whole approach seems to assume that after something like this has
> > happened the same process now offloads that struct file to another
> > process that somehow ends up doing some other operation on the file that
> > would also require f_pos_lock to be taken but it doesn't like a read or
> > something.
>
> Can we not have a data race even if the other process *does* take the
> f_pos_lock mutex? The current thread did not take the mutex, so if the
> current thread touches the file position after sending the file
> reference, then that could race with the other process even if the
> other process takes f_pos_lock.
Yes, for that the original sender would have to have taken a light
reference, and the file mustn't have been in any other file descriptor
table and then the original opener must've sent it while in fdget_pos
scope and remain in that scope while the receiver installs the fd into
their fd table and then leaves and reenters the kernel to e.g., read.
> > To share a file between multiple processes would normally always require
> > that the process sends that file to another process. That process then
> > install that fd into its file descriptor table and then later accesses
> > that file via fdget() again. That's the standard way of doing it -
> > binder does it that way too. And that's all perfectly fine.
>
> And similarly, if the remote process installs the file, returns to
> userspace, and then userspace calls back into the kernel, which enters
> an fdget_pos scope and modifies the file position. Then this can also
> race on the file position if the original process changes the file
> position it after sending the file reference.
If that process were to send it from within an fdget_pos scope then yes.
> *That's* the data race that this is trying to prevent.
>
> > What you would need for this to be a problem is for a process be sent a
> > struct file from a process that is in the middle of an f_pos_lock scope
> > and for the receiving process to immediately start doing stuff that
> > would normally require f_pos_lock.
> >
> > Like, idk vfs_read(file, ...).
> >
> > If that's what this is about then yes, there's a close-to-zero but
> > non-zero possibility that some really stupid code could end up doing
> > something like this.
> >
> > Basically, that scenario doesn't exist (as I've mentioned before)
> > currently. It only exists in a single instance and that's when
> > pidfd_getfd() is used to steal a file from another task while that task
> > is in the middle of an f_pos_lock section (I said it before we don't
> > care about that because non-POSIX interface anyway and we have ptrace
> > rights anyway. And iiuc that wouldn't even be preventable in your
> > current scheme because you would need to have the information available
> > that the struct file you're about to steal from the file descriptor
> > table is currently within an f_pos_lock section.).
> >
> > Is it honestly worth encoding all that complexity into rust's file
> > implementation itself right now? It's barely understandable to
> > non-rust experts as it is right now. :)
> >
> > Imho, it would seem a lot more prudent to just have something simpler
> > for now.
>
> The purpose of the changes I've made are to prevent data races on the
> file position. If we go back to what we had before, then the API does
> not make it impossible for users of the API to cause such data races.
>
> That is the tradeoff.
Right. Sorry, there's some back and forth here. But we're all navigating
this new territory here and it's not always trivial to see what the
correct approach is.
I wonder what's better for now. It seems that the binder code isn't
really subject to the races we discussed. So maybe we should start with
the simpler approach for now to not get bogged down in encoding all
subtle details into rust's file wrapper just yet?
On Wed, May 29, 2024 at 10:17 AM Christian Brauner <[email protected]> wrote:
> > > Is it honestly worth encoding all that complexity into rust's file
> > > implementation itself right now? It's barely understandable to
> > > non-rust experts as it is right now. :)
> > >
> > > Imho, it would seem a lot more prudent to just have something simpler
> > > for now.
> >
> > The purpose of the changes I've made are to prevent data races on the
> > file position. If we go back to what we had before, then the API does
> > not make it impossible for users of the API to cause such data races.
> >
> > That is the tradeoff.
>
> Right. Sorry, there's some back and forth here. But we're all navigating
> this new territory here and it's not always trivial to see what the
> correct approach is.
Yeah of course. You've been very helpful in that regard, and I'm
grateful for that.
> I wonder what's better for now. It seems that the binder code isn't
> really subject to the races we discussed. So maybe we should start with
> the simpler approach for now to not get bogged down in encoding all
> subtle details into rust's file wrapper just yet?
Yeah, maybe. But I think that if we can accurately represent the
requirements of the interface, then that would be preferable. Perhaps we
can tweak it to make it easier to understand, without giving up
accuracy?
One of the reasons that the current API is confusing is that the types
are called `File<NoFdgetPos>` and `File<MaybeFdgetPos>`. These names
_sound_ like their purpose is to keep track of whether or not the file
came from an `fdget_pos` call or not, but that is not the case.
Instead, let's call them something else.
We can have two files: File and LocalFile.
This name accurately conveys the main difference between them. File can
be transferred across thread boundaries. LocalFile cannot.
Now, it is still the case that `fget` will return a `LocalFile`, which
may be confusing. But we can document it like this:
1. On `fget`'s docs, we explain that to get a `File`, you need to
convert it using the `assume_not_in_fdget_pos_scope` function. We do
not explain why in the docs for `fget`.
2. We can put an explanation of why in the docs for the function
`assume_not_in_fdget_pos_scope`.
I think it's possible to design an API like this where the complexities
about `fdget_pos` are only relevant in a few places. In the rest of the
implementation, we simplify the situation to "file is threadsafe" or
"file is not threadsafe", and that distinction should be easier to
understand than nuances related to `fdget_pos`.
Alice