2024-02-09 11:19:21

by Alice Ryhl

[permalink] [raw]
Subject: [PATCH v5 0/9] File abstractions needed by Rust Binder

This patchset contains the file abstractions needed by the Rust
implementation of the Binder driver.

Please see the Rust Binder RFC for usage examples:
https://lore.kernel.org/rust-for-linux/[email protected]/

Users of "rust: types: add `NotThreadSafe`":
[PATCH 5/9] rust: file: add `FileDescriptorReservation`

Users of "rust: task: add `Task::current_raw`":
[PATCH 7/9] rust: file: add `Kuid` wrapper
[PATCH 8/9] rust: file: add `DeferredFdCloser`

Users of "rust: file: add Rust abstraction for `struct file`":
[PATCH RFC 02/20] rust_binder: add binderfs support to Rust binder
[PATCH RFC 03/20] rust_binder: add threading support

Users of "rust: cred: add Rust abstraction for `struct cred`":
[PATCH RFC 05/20] rust_binder: add nodes and context managers
[PATCH RFC 06/20] rust_binder: add oneway transactions
[PATCH RFC 11/20] rust_binder: send nodes in transaction
[PATCH RFC 13/20] rust_binder: add BINDER_TYPE_FD support

Users of "rust: security: add abstraction for secctx":
[PATCH RFC 06/20] rust_binder: add oneway transactions

Users of "rust: file: add `FileDescriptorReservation`":
[PATCH RFC 13/20] rust_binder: add BINDER_TYPE_FD support
[PATCH RFC 14/20] rust_binder: add BINDER_TYPE_FDA support

Users of "rust: file: add `Kuid` wrapper":
[PATCH RFC 05/20] rust_binder: add nodes and context managers
[PATCH RFC 06/20] rust_binder: add oneway transactions

Users of "rust: file: add `DeferredFdCloser`":
[PATCH RFC 14/20] rust_binder: add BINDER_TYPE_FDA support

Users of "rust: file: add abstraction for `poll_table`":
[PATCH RFC 07/20] rust_binder: add epoll support

This patchset is based on rust-next, which means that it doesn't compile
without the following patch, which has not yet made it further upstream
than rust-next.
https://lore.kernel.org/all/[email protected]/

This patchset has some uses of read_volatile in place of READ_ONCE.
Please see the following rfc for context on this:
https://lore.kernel.org/all/[email protected]/

Signed-off-by: Alice Ryhl <[email protected]>
---
Changes in v5:
- Pass a null pointer to task_tgid_nr_ns.
- Fix some typos and other formatting issues.
- Add Reviewed-by where appropriate.
- Link to v4: https://lore.kernel.org/r/[email protected]

Changes in v4:
- Moved the two really simple patches to the beginning of the patchset.
- Update Send safety comments.
- Use srctree relative links.
- Mention that `Credential::euid` is immutable.
- Update some safety comments to mention the invariant on Self.
- Use new name for close_fd_get_file.
- Move safety comments on DeferredFdCloser around and be more explicit
about how many refcounts we own.
- Reword safety comments related to _qproc.
- Add Reviewed-by where appropriate.
- Link to v3: https://lore.kernel.org/r/[email protected]

Changes in v3:
- Completely rewrite comments about refcounting in the first patch.
- And add a note to the documentation in fs/file.c.
- Discuss speculation gadgets in commit message for the Kuid wrapper.
- Introduce NotThreadSafe and Task::current_raw patches and use them in
later patches.
- Improve safety comments in DeferredFdCloser.
- Some other minor changes.
- Link to v2: https://lore.kernel.org/r/[email protected]

Changes in v2:
- Update various docs and safety comments.
- Rename method names to match the C name.
- Use ordinary read instead of READ_ONCE in File::cred.
- Changed null check in secctx.
- Add type alias for PhantomData in FileDescriptorReservation.
- Use Kuid::from_raw in Kuid::current_euid.
- Make DeferredFdCloser fallible if it is unable to schedule a task
work. And also schedule the task work *before* closing the file.
- Moved PollCondVar to rust/kernel/sync.
- Updated PollCondVar to use wake_up_pollfree.
- Link to v1: https://lore.kernel.org/all/[email protected]/

Link to RFC:
https://lore.kernel.org/all/[email protected]/

---
Alice Ryhl (6):
rust: types: add `NotThreadSafe`
rust: task: add `Task::current_raw`
rust: security: add abstraction for secctx
rust: file: add `Kuid` wrapper
rust: file: add `DeferredFdCloser`
rust: file: add abstraction for `poll_table`

Wedson Almeida Filho (3):
rust: file: add Rust abstraction for `struct file`
rust: cred: add Rust abstraction for `struct cred`
rust: file: add `FileDescriptorReservation`

fs/file.c | 7 +
rust/bindings/bindings_helper.h | 8 +
rust/helpers.c | 94 ++++++++
rust/kernel/cred.rs | 83 +++++++
rust/kernel/file.rs | 519 ++++++++++++++++++++++++++++++++++++++++
rust/kernel/lib.rs | 3 +
rust/kernel/security.rs | 72 ++++++
rust/kernel/sync.rs | 1 +
rust/kernel/sync/lock.rs | 14 +-
rust/kernel/sync/poll.rs | 117 +++++++++
rust/kernel/task.rs | 105 +++++++-
rust/kernel/types.rs | 18 ++
12 files changed, 1029 insertions(+), 12 deletions(-)
---
base-commit: f090f0d0eea9666a96702b29bc9a64cbabee85c5
change-id: 20231123-alice-file-525b98e8a724

Best regards,
--
Alice Ryhl <[email protected]>



2024-02-09 11:19:31

by Alice Ryhl

[permalink] [raw]
Subject: [PATCH v5 1/9] rust: types: add `NotThreadSafe`

This introduces a new marker type for types that shouldn't be thread
safe. By adding a field of this type to a struct, it becomes non-Send
and non-Sync, which means that it cannot be accessed in any way from
threads other than the one it was created on.

This is useful for APIs that require globals such as `current` to remain
constant while the value exists.

We update two existing users in the Kernel to use this helper:

* `Task::current()` - moving the return type of this value to a
different thread would not be safe as you can no longer be guaranteed
that the `current` pointer remains valid.
* Lock guards. Mutexes and spinlocks should be unlocked on the same
thread as where they were locked, so we enforce this using the Send
trait.

There are also additional users in later patches of this patchset. See
[1] and [2] for the discussion that led to the introduction of this
patch.

Link: https://lore.kernel.org/all/nFDPJFnzE9Q5cqY7FwSMByRH2OAn_BpI4H53NQfWIlN6I2qfmAqnkp2wRqn0XjMO65OyZY4h6P4K2nAGKJpAOSzksYXaiAK_FoH_8QbgBI4=@proton.me/ [1]
Link: https://lore.kernel.org/all/nFDPJFnzE9Q5cqY7FwSMByRH2OAn_BpI4H53NQfWIlN6I2qfmAqnkp2wRqn0XjMO65OyZY4h6P4K2nAGKJpAOSzksYXaiAK_FoH_8QbgBI4=@proton.me/ [2]
Suggested-by: Benno Lossin <[email protected]>
Reviewed-by: Benno Lossin <[email protected]>
Reviewed-by: Trevor Gross <[email protected]>
Reviewed-by: Martin Rodriguez Reboredo <[email protected]>
Signed-off-by: Alice Ryhl <[email protected]>
---
rust/kernel/sync/lock.rs | 14 ++++++++++----
rust/kernel/task.rs | 10 ++++++----
rust/kernel/types.rs | 18 ++++++++++++++++++
3 files changed, 34 insertions(+), 8 deletions(-)

diff --git a/rust/kernel/sync/lock.rs b/rust/kernel/sync/lock.rs
index 149a5259d431..090b9ad63dc6 100644
--- a/rust/kernel/sync/lock.rs
+++ b/rust/kernel/sync/lock.rs
@@ -6,8 +6,14 @@
//! spinlocks, raw spinlocks) to be provided with minimal effort.

use super::LockClassKey;
-use crate::{bindings, init::PinInit, pin_init, str::CStr, types::Opaque, types::ScopeGuard};
-use core::{cell::UnsafeCell, marker::PhantomData, marker::PhantomPinned};
+use crate::{
+ bindings,
+ init::PinInit,
+ pin_init,
+ str::CStr,
+ types::{NotThreadSafe, ScopeGuard, Opaque},
+};
+use core::{cell::UnsafeCell, marker::PhantomPinned};
use macros::pin_data;

pub mod mutex;
@@ -132,7 +138,7 @@ pub fn lock(&self) -> Guard<'_, T, B> {
pub struct Guard<'a, T: ?Sized, B: Backend> {
pub(crate) lock: &'a Lock<T, B>,
pub(crate) state: B::GuardState,
- _not_send: PhantomData<*mut ()>,
+ _not_send: NotThreadSafe,
}

// SAFETY: `Guard` is sync when the data protected by the lock is also sync.
@@ -184,7 +190,7 @@ pub(crate) unsafe fn new(lock: &'a Lock<T, B>, state: B::GuardState) -> Self {
Self {
lock,
state,
- _not_send: PhantomData,
+ _not_send: NotThreadSafe,
}
}
}
diff --git a/rust/kernel/task.rs b/rust/kernel/task.rs
index a3a4007db682..148a4f4eb7a8 100644
--- a/rust/kernel/task.rs
+++ b/rust/kernel/task.rs
@@ -4,10 +4,12 @@
//!
//! C header: [`include/linux/sched.h`](srctree/include/linux/sched.h).

-use crate::{bindings, types::Opaque};
+use crate::{
+ bindings,
+ types::{NotThreadSafe, Opaque},
+};
use core::{
ffi::{c_int, c_long, c_uint},
- marker::PhantomData,
ops::Deref,
ptr,
};
@@ -106,7 +108,7 @@ impl Task {
pub unsafe fn current() -> impl Deref<Target = Task> {
struct TaskRef<'a> {
task: &'a Task,
- _not_send: PhantomData<*mut ()>,
+ _not_send: NotThreadSafe,
}

impl Deref for TaskRef<'_> {
@@ -125,7 +127,7 @@ fn deref(&self) -> &Self::Target {
// that `TaskRef` is not `Send`, we know it cannot be transferred to another thread
// (where it could potentially outlive the caller).
task: unsafe { &*ptr.cast() },
- _not_send: PhantomData,
+ _not_send: NotThreadSafe,
}
}

diff --git a/rust/kernel/types.rs b/rust/kernel/types.rs
index fdb778e65d79..ee1375d47df0 100644
--- a/rust/kernel/types.rs
+++ b/rust/kernel/types.rs
@@ -387,3 +387,21 @@ pub enum Either<L, R> {
/// Constructs an instance of [`Either`] containing a value of type `R`.
Right(R),
}
+
+/// Zero-sized type to mark types not [`Send`].
+///
+/// Add this type as a field to your struct if your type should not be sent to a different task.
+/// Since [`Send`] is an auto trait, adding a single field that is `!Send` will ensure that the
+/// whole type is `!Send`.
+///
+/// If a type is `!Send` it is impossible to give control over an instance of the type to another
+/// task. This is useful to include in types that store or reference task-local information. A file
+/// descriptor is an example of such task-local information.
+pub type NotThreadSafe = PhantomData<*mut ()>;
+
+/// Used to construct instances of type [`NotThreadSafe`] similar to how `PhantomData` is
+/// constructed.
+///
+/// [`NotThreadSafe`]: type@NotThreadSafe
+#[allow(non_upper_case_globals)]
+pub const NotThreadSafe: NotThreadSafe = PhantomData;

--
2.43.0.687.g38aa6559b0-goog


2024-02-09 11:20:03

by Alice Ryhl

[permalink] [raw]
Subject: [PATCH v5 2/9] rust: task: add `Task::current_raw`

Introduces a safe function for getting a raw pointer to the current
task.

When writing bindings that need to access the current task, it is often
more convenient to call a method that directly returns a raw pointer
than to use the existing `Task::current` method. However, the only way
to do that is `bindings::get_current()` which is unsafe since it calls
into C. By introducing `Task::current_raw()`, it becomes possible to
obtain a pointer to the current task without using unsafe.

Link: https://lore.kernel.org/all/CAH5fLgjT48X-zYtidv31mox3C4_Ogoo_2cBOCmX0Ang3tAgGHA@mail.gmail.com/
Reviewed-by: Benno Lossin <[email protected]>
Reviewed-by: Martin Rodriguez Reboredo <[email protected]>
Reviewed-by: Trevor Gross <[email protected]>
Signed-off-by: Alice Ryhl <[email protected]>
---
rust/kernel/task.rs | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/rust/kernel/task.rs b/rust/kernel/task.rs
index 148a4f4eb7a8..b579367fb923 100644
--- a/rust/kernel/task.rs
+++ b/rust/kernel/task.rs
@@ -97,6 +97,15 @@ unsafe impl Sync for Task {}
type Pid = bindings::pid_t;

impl Task {
+ /// Returns a raw pointer to the current task.
+ ///
+ /// It is up to the user to use the pointer correctly.
+ #[inline]
+ pub fn current_raw() -> *mut bindings::task_struct {
+ // SAFETY: Getting the current pointer is always safe.
+ unsafe { bindings::get_current() }
+ }
+
/// Returns a task reference for the currently executing task/thread.
///
/// The recommended way to get the current task/thread is to use the
@@ -119,14 +128,12 @@ fn deref(&self) -> &Self::Target {
}
}

- // SAFETY: Just an FFI call with no additional safety requirements.
- let ptr = unsafe { bindings::get_current() };
-
+ let current = Task::current_raw();
TaskRef {
// SAFETY: If the current thread is still running, the current task is valid. Given
// that `TaskRef` is not `Send`, we know it cannot be transferred to another thread
// (where it could potentially outlive the caller).
- task: unsafe { &*ptr.cast() },
+ task: unsafe { &*current.cast() },
_not_send: NotThreadSafe,
}
}

--
2.43.0.687.g38aa6559b0-goog


2024-02-09 11:21:00

by Alice Ryhl

[permalink] [raw]
Subject: [PATCH v5 4/9] rust: cred: add Rust abstraction for `struct cred`

From: Wedson Almeida Filho <[email protected]>

Add a wrapper around `struct cred` called `Credential`, and provide
functionality to get the `Credential` associated with a `File`.

Rust Binder must check the credentials of processes when they attempt to
perform various operations, and these checks usually take a
`&Credential` as parameter. The security_binder_set_context_mgr function
would be one example. This patch is necessary to access these security_*
methods from Rust.

Signed-off-by: Wedson Almeida Filho <[email protected]>
Co-developed-by: Alice Ryhl <[email protected]>
Reviewed-by: Trevor Gross <[email protected]>
Reviewed-by: Benno Lossin <[email protected]>
Reviewed-by: Martin Rodriguez Reboredo <[email protected]>
Signed-off-by: Alice Ryhl <[email protected]>
---
rust/bindings/bindings_helper.h | 1 +
rust/helpers.c | 13 ++++++++
rust/kernel/cred.rs | 74 +++++++++++++++++++++++++++++++++++++++++
rust/kernel/file.rs | 13 ++++++++
rust/kernel/lib.rs | 1 +
5 files changed, 102 insertions(+)

diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h
index 41fcd2905ed4..84a56e8b6b67 100644
--- a/rust/bindings/bindings_helper.h
+++ b/rust/bindings/bindings_helper.h
@@ -7,6 +7,7 @@
*/

#include <kunit/test.h>
+#include <linux/cred.h>
#include <linux/errname.h>
#include <linux/ethtool.h>
#include <linux/file.h>
diff --git a/rust/helpers.c b/rust/helpers.c
index 03141a3608a4..10ed69f76424 100644
--- a/rust/helpers.c
+++ b/rust/helpers.c
@@ -23,6 +23,7 @@
#include <kunit/test-bug.h>
#include <linux/bug.h>
#include <linux/build_bug.h>
+#include <linux/cred.h>
#include <linux/err.h>
#include <linux/errname.h>
#include <linux/fs.h>
@@ -164,6 +165,18 @@ struct file *rust_helper_get_file(struct file *f)
}
EXPORT_SYMBOL_GPL(rust_helper_get_file);

+const struct cred *rust_helper_get_cred(const struct cred *cred)
+{
+ return get_cred(cred);
+}
+EXPORT_SYMBOL_GPL(rust_helper_get_cred);
+
+void rust_helper_put_cred(const struct cred *cred)
+{
+ put_cred(cred);
+}
+EXPORT_SYMBOL_GPL(rust_helper_put_cred);
+
/*
* `bindgen` binds the C `size_t` type as the Rust `usize` type, so we can
* use it in contexts where Rust expects a `usize` like slice (array) indices.
diff --git a/rust/kernel/cred.rs b/rust/kernel/cred.rs
new file mode 100644
index 000000000000..360d6fdbe5e7
--- /dev/null
+++ b/rust/kernel/cred.rs
@@ -0,0 +1,74 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! Credentials management.
+//!
+//! C header: [`include/linux/cred.h`](srctree/include/linux/cred.h).
+//!
+//! Reference: <https://www.kernel.org/doc/html/latest/security/credentials.html>
+
+use crate::{
+ bindings,
+ types::{AlwaysRefCounted, Opaque},
+};
+
+/// Wraps the kernel's `struct cred`.
+///
+/// Credentials are used for various security checks in the kernel.
+///
+/// Most fields of credentials are immutable. When things have their credentials changed, that
+/// happens by replacing the credential instead of changing an existing credential. See the [kernel
+/// documentation][ref] for more info on this.
+///
+/// # Invariants
+///
+/// Instances of this type are always ref-counted, that is, a call to `get_cred` ensures that the
+/// allocation remains valid at least until the matching call to `put_cred`.
+///
+/// [ref]: https://www.kernel.org/doc/html/latest/security/credentials.html
+#[repr(transparent)]
+pub struct Credential(Opaque<bindings::cred>);
+
+// SAFETY:
+// - `Credential::dec_ref` can be called from any thread.
+// - It is okay to send ownership of `Credential` across thread boundaries.
+unsafe impl Send for Credential {}
+
+// SAFETY: It's OK to access `Credential` through shared references from other threads because
+// we're either accessing properties that don't change or that are properly synchronised by C code.
+unsafe impl Sync for Credential {}
+
+impl Credential {
+ /// Creates a reference to a [`Credential`] from a valid pointer.
+ ///
+ /// # Safety
+ ///
+ /// The caller must ensure that `ptr` is valid and remains valid for the lifetime of the
+ /// returned [`Credential`] reference.
+ pub unsafe fn from_ptr<'a>(ptr: *const bindings::cred) -> &'a Credential {
+ // SAFETY: The safety requirements guarantee the validity of the dereference, while the
+ // `Credential` type being transparent makes the cast ok.
+ unsafe { &*ptr.cast() }
+ }
+
+ /// Returns the effective UID of the given credential.
+ pub fn euid(&self) -> bindings::kuid_t {
+ // SAFETY: By the type invariant, we know that `self.0` is valid. Furthermore, the `euid`
+ // field of a credential is never changed after initialization, so there is no potential
+ // for data races.
+ unsafe { (*self.0.get()).euid }
+ }
+}
+
+// SAFETY: The type invariants guarantee that `Credential` is always ref-counted.
+unsafe impl AlwaysRefCounted for Credential {
+ fn inc_ref(&self) {
+ // SAFETY: The existence of a shared reference means that the refcount is nonzero.
+ unsafe { bindings::get_cred(self.0.get()) };
+ }
+
+ unsafe fn dec_ref(obj: core::ptr::NonNull<Credential>) {
+ // SAFETY: The safety requirements guarantee that the refcount is nonzero. The cast is okay
+ // because `Credential` has the same representation as `struct cred`.
+ unsafe { bindings::put_cred(obj.cast().as_ptr()) };
+ }
+}
diff --git a/rust/kernel/file.rs b/rust/kernel/file.rs
index cf8ebf619379..3a64c5022941 100644
--- a/rust/kernel/file.rs
+++ b/rust/kernel/file.rs
@@ -7,6 +7,7 @@

use crate::{
bindings,
+ cred::Credential,
error::{code::*, Error, Result},
types::{ARef, AlwaysRefCounted, Opaque},
};
@@ -207,6 +208,18 @@ pub fn as_ptr(&self) -> *mut bindings::file {
self.0.get()
}

+ /// Returns the credentials of the task that originally opened the file.
+ pub fn cred(&self) -> &Credential {
+ // SAFETY: It's okay to read the `f_cred` field without synchronization because `f_cred` is
+ // never changed after initialization of the file.
+ let ptr = unsafe { (*self.as_ptr()).f_cred };
+
+ // SAFETY: The signature of this function ensures that the caller will only access the
+ // returned credential while the file is still valid, and the C side ensures that the
+ // credential stays valid at least as long as the file.
+ unsafe { Credential::from_ptr(ptr) }
+ }
+
/// Returns the flags associated with the file.
///
/// The flags are a combination of the constants in [`flags`].
diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs
index 9353dd713a20..f65e5978f807 100644
--- a/rust/kernel/lib.rs
+++ b/rust/kernel/lib.rs
@@ -33,6 +33,7 @@
#[cfg(not(testlib))]
mod allocator;
mod build_assert;
+pub mod cred;
pub mod error;
pub mod file;
pub mod init;

--
2.43.0.687.g38aa6559b0-goog


2024-02-09 11:21:51

by Alice Ryhl

[permalink] [raw]
Subject: [PATCH v5 6/9] rust: file: add `FileDescriptorReservation`

From: Wedson Almeida Filho <[email protected]>

Allow for the creation of a file descriptor in two steps: first, we
reserve a slot for it, then we commit or drop the reservation. The first
step may fail (e.g., the current process ran out of available slots),
but commit and drop never fail (and are mutually exclusive).

This is needed by Rust Binder when fds are sent from one process to
another. It has to be a two-step process to properly handle the case
where multiple fds are sent: The operation must fail or succeed
atomically, which we achieve by first reserving the fds we need, and
only installing the files once we have reserved enough fds to send the
files.

Fd reservations assume that the value of `current` does not change
between the call to get_unused_fd_flags and the call to fd_install (or
put_unused_fd). By not implementing the Send trait, this abstraction
ensures that the `FileDescriptorReservation` cannot be moved into a
different process.

Signed-off-by: Wedson Almeida Filho <[email protected]>
Co-developed-by: Alice Ryhl <[email protected]>
Reviewed-by: Benno Lossin <[email protected]>
Reviewed-by: Martin Rodriguez Reboredo <[email protected]>
Signed-off-by: Alice Ryhl <[email protected]>
---
rust/kernel/file.rs | 72 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 71 insertions(+), 1 deletion(-)

diff --git a/rust/kernel/file.rs b/rust/kernel/file.rs
index 3a64c5022941..fb903b7f23fe 100644
--- a/rust/kernel/file.rs
+++ b/rust/kernel/file.rs
@@ -9,7 +9,7 @@
bindings,
cred::Credential,
error::{code::*, Error, Result},
- types::{ARef, AlwaysRefCounted, Opaque},
+ types::{ARef, AlwaysRefCounted, NotThreadSafe, Opaque},
};
use core::ptr;

@@ -248,6 +248,76 @@ unsafe fn dec_ref(obj: ptr::NonNull<File>) {
}
}

+/// A file descriptor reservation.
+///
+/// This allows the creation of a file descriptor in two steps: first, we reserve a slot for it,
+/// then we commit or drop the reservation. The first step may fail (e.g., the current process ran
+/// out of available slots), but commit and drop never fail (and are mutually exclusive).
+///
+/// Dropping the reservation happens in the destructor of this type.
+///
+/// # Invariants
+///
+/// The fd stored in this struct must correspond to a reserved file descriptor of the current task.
+pub struct FileDescriptorReservation {
+ fd: u32,
+ /// Prevent values of this type from being moved to a different task.
+ ///
+ /// The `fd_install` and `put_unused_fd` functions assume that the value of `current` is
+ /// unchanged since the call to `get_unused_fd_flags`. By adding this marker to this type, we
+ /// prevent it from being moved across task boundaries, which ensures that `current` does not
+ /// change while this value exists.
+ _not_send: NotThreadSafe,
+}
+
+impl FileDescriptorReservation {
+ /// Creates a new file descriptor reservation.
+ pub fn get_unused_fd_flags(flags: u32) -> Result<Self> {
+ // SAFETY: FFI call, there are no safety requirements on `flags`.
+ let fd: i32 = unsafe { bindings::get_unused_fd_flags(flags) };
+ if fd < 0 {
+ return Err(Error::from_errno(fd));
+ }
+ Ok(Self {
+ fd: fd as u32,
+ _not_send: NotThreadSafe,
+ })
+ }
+
+ /// Returns the file descriptor number that was reserved.
+ pub fn reserved_fd(&self) -> u32 {
+ self.fd
+ }
+
+ /// Commits the reservation.
+ ///
+ /// The previously reserved file descriptor is bound to `file`. This method consumes the
+ /// [`FileDescriptorReservation`], so it will not be usable after this call.
+ pub fn fd_install(self, file: ARef<File>) {
+ // SAFETY: `self.fd` was previously returned by `get_unused_fd_flags`. We have not yet used
+ // the fd, so it is still valid, and `current` still refers to the same task, as this type
+ // cannot be moved across task boundaries.
+ //
+ // Furthermore, the file pointer is guaranteed to own a refcount by its type invariants,
+ // and we take ownership of that refcount by not running the destructor below.
+ unsafe { bindings::fd_install(self.fd, file.as_ptr()) };
+
+ // `fd_install` consumes both the file descriptor and the file reference, so we cannot run
+ // the destructors.
+ core::mem::forget(self);
+ core::mem::forget(file);
+ }
+}
+
+impl Drop for FileDescriptorReservation {
+ fn drop(&mut self) {
+ // SAFETY: By the type invariants of this type, `self.fd` was previously returned by
+ // `get_unused_fd_flags`. We have not yet used the fd, so it is still valid, and `current`
+ // still refers to the same task, as this type cannot be moved across task boundaries.
+ unsafe { bindings::put_unused_fd(self.fd) };
+ }
+}
+
/// Represents the `EBADF` error code.
///
/// Used for methods that can only fail with `EBADF`.

--
2.43.0.687.g38aa6559b0-goog


2024-02-09 11:22:39

by Alice Ryhl

[permalink] [raw]
Subject: [PATCH v5 5/9] rust: security: add abstraction for secctx

Add an abstraction for viewing the string representation of a security
context.

This is needed by Rust Binder because it has a feature where a process
can view the string representation of the security context for incoming
transactions. The process can use that to authenticate incoming
transactions, and since the feature is provided by the kernel, the
process can trust that the security context is legitimate.

Reviewed-by: Benno Lossin <[email protected]>
Reviewed-by: Martin Rodriguez Reboredo <[email protected]>
Reviewed-by: Trevor Gross <[email protected]>
Signed-off-by: Alice Ryhl <[email protected]>
---
rust/bindings/bindings_helper.h | 1 +
rust/helpers.c | 21 ++++++++++++
rust/kernel/cred.rs | 8 +++++
rust/kernel/lib.rs | 1 +
rust/kernel/security.rs | 72 +++++++++++++++++++++++++++++++++++++++++
5 files changed, 103 insertions(+)

diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h
index 84a56e8b6b67..5ca497d786f0 100644
--- a/rust/bindings/bindings_helper.h
+++ b/rust/bindings/bindings_helper.h
@@ -15,6 +15,7 @@
#include <linux/jiffies.h>
#include <linux/mdio.h>
#include <linux/phy.h>
+#include <linux/security.h>
#include <linux/slab.h>
#include <linux/refcount.h>
#include <linux/wait.h>
diff --git a/rust/helpers.c b/rust/helpers.c
index 10ed69f76424..fd633d9db79a 100644
--- a/rust/helpers.c
+++ b/rust/helpers.c
@@ -30,6 +30,7 @@
#include <linux/mutex.h>
#include <linux/refcount.h>
#include <linux/sched/signal.h>
+#include <linux/security.h>
#include <linux/spinlock.h>
#include <linux/wait.h>
#include <linux/workqueue.h>
@@ -177,6 +178,26 @@ void rust_helper_put_cred(const struct cred *cred)
}
EXPORT_SYMBOL_GPL(rust_helper_put_cred);

+#ifndef CONFIG_SECURITY
+void rust_helper_security_cred_getsecid(const struct cred *c, u32 *secid)
+{
+ security_cred_getsecid(c, secid);
+}
+EXPORT_SYMBOL_GPL(rust_helper_security_cred_getsecid);
+
+int rust_helper_security_secid_to_secctx(u32 secid, char **secdata, u32 *seclen)
+{
+ return security_secid_to_secctx(secid, secdata, seclen);
+}
+EXPORT_SYMBOL_GPL(rust_helper_security_secid_to_secctx);
+
+void rust_helper_security_release_secctx(char *secdata, u32 seclen)
+{
+ security_release_secctx(secdata, seclen);
+}
+EXPORT_SYMBOL_GPL(rust_helper_security_release_secctx);
+#endif
+
/*
* `bindgen` binds the C `size_t` type as the Rust `usize` type, so we can
* use it in contexts where Rust expects a `usize` like slice (array) indices.
diff --git a/rust/kernel/cred.rs b/rust/kernel/cred.rs
index 360d6fdbe5e7..fdd899040098 100644
--- a/rust/kernel/cred.rs
+++ b/rust/kernel/cred.rs
@@ -50,6 +50,14 @@ pub unsafe fn from_ptr<'a>(ptr: *const bindings::cred) -> &'a Credential {
unsafe { &*ptr.cast() }
}

+ /// Get the id for this security context.
+ pub fn get_secid(&self) -> u32 {
+ let mut secid = 0;
+ // SAFETY: The invariants of this type ensures that the pointer is valid.
+ unsafe { bindings::security_cred_getsecid(self.0.get(), &mut secid) };
+ secid
+ }
+
/// Returns the effective UID of the given credential.
pub fn euid(&self) -> bindings::kuid_t {
// SAFETY: By the type invariant, we know that `self.0` is valid. Furthermore, the `euid`
diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs
index f65e5978f807..d55d065642f0 100644
--- a/rust/kernel/lib.rs
+++ b/rust/kernel/lib.rs
@@ -44,6 +44,7 @@
pub mod net;
pub mod prelude;
pub mod print;
+pub mod security;
mod static_assert;
#[doc(hidden)]
pub mod std_vendor;
diff --git a/rust/kernel/security.rs b/rust/kernel/security.rs
new file mode 100644
index 000000000000..ee2ef0385bae
--- /dev/null
+++ b/rust/kernel/security.rs
@@ -0,0 +1,72 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! Linux Security Modules (LSM).
+//!
+//! C header: [`include/linux/security.h`](srctree/include/linux/security.h).
+
+use crate::{
+ bindings,
+ error::{to_result, Result},
+};
+
+/// A security context string.
+///
+/// # Invariants
+///
+/// The `secdata` and `seclen` fields correspond to a valid security context as returned by a
+/// successful call to `security_secid_to_secctx`, that has not yet been destroyed by calling
+/// `security_release_secctx`.
+pub struct SecurityCtx {
+ secdata: *mut core::ffi::c_char,
+ seclen: usize,
+}
+
+impl SecurityCtx {
+ /// Get the security context given its id.
+ pub fn from_secid(secid: u32) -> Result<Self> {
+ let mut secdata = core::ptr::null_mut();
+ let mut seclen = 0u32;
+ // SAFETY: Just a C FFI call. The pointers are valid for writes.
+ to_result(unsafe { bindings::security_secid_to_secctx(secid, &mut secdata, &mut seclen) })?;
+
+ // INVARIANT: If the above call did not fail, then we have a valid security context.
+ Ok(Self {
+ secdata,
+ seclen: seclen as usize,
+ })
+ }
+
+ /// Returns whether the security context is empty.
+ pub fn is_empty(&self) -> bool {
+ self.seclen == 0
+ }
+
+ /// Returns the length of this security context.
+ pub fn len(&self) -> usize {
+ self.seclen
+ }
+
+ /// Returns the bytes for this security context.
+ pub fn as_bytes(&self) -> &[u8] {
+ let ptr = self.secdata;
+ if ptr.is_null() {
+ debug_assert_eq!(self.seclen, 0);
+ // We can't pass a null pointer to `slice::from_raw_parts` even if the length is zero.
+ return &[];
+ }
+
+ // SAFETY: The call to `security_secid_to_secctx` guarantees that the pointer is valid for
+ // `seclen` bytes. Furthermore, if the length is zero, then we have ensured that the
+ // pointer is not null.
+ unsafe { core::slice::from_raw_parts(ptr.cast(), self.seclen) }
+ }
+}
+
+impl Drop for SecurityCtx {
+ fn drop(&mut self) {
+ // SAFETY: By the invariant of `Self`, this frees a pointer that came from a successful
+ // call to `security_secid_to_secctx` and has not yet been destroyed by
+ // `security_release_secctx`.
+ unsafe { bindings::security_release_secctx(self.secdata, self.seclen as u32) };
+ }
+}

--
2.43.0.687.g38aa6559b0-goog


2024-02-09 11:22:40

by Alice Ryhl

[permalink] [raw]
Subject: [PATCH v5 8/9] rust: file: add `DeferredFdCloser`

To close an fd from kernel space, we could call `ksys_close`. However,
if we do this to an fd that is held using `fdget`, then we may trigger a
use-after-free. Introduce a helper that can be used to close an fd even
if the fd is currently held with `fdget`. This is done by grabbing an
extra refcount to the file and dropping it in a task work once we return
to userspace.

This is necessary for Rust Binder because otherwise the user might try
to have Binder close its fd for /dev/binder, which would cause problems
as this happens inside an ioctl on /dev/binder, and ioctls hold the fd
using `fdget`.

Additional motivation can be found in commit 80cd795630d6 ("binder: fix
use-after-free due to ksys_close() during fdget()") and in the comments
on `binder_do_fd_close`.

If there is some way to detect whether an fd is currently held with
`fdget`, then this could be optimized to skip the allocation and task
work when this is not the case. Another possible optimization would be
to combine several fds into a single task work, since this is used with
fd arrays that might hold several fds.

That said, it might not be necessary to optimize it, because Rust Binder
has two ways to send fds: BINDER_TYPE_FD and BINDER_TYPE_FDA. With
BINDER_TYPE_FD, it is userspace's responsibility to close the fd, so
this mechanism is used only by BINDER_TYPE_FDA, but fd arrays are used
rarely these days.

Reviewed-by: Benno Lossin <[email protected]>
Reviewed-by: Martin Rodriguez Reboredo <[email protected]>
Signed-off-by: Alice Ryhl <[email protected]>
---
rust/bindings/bindings_helper.h | 2 +
rust/helpers.c | 8 ++
rust/kernel/file.rs | 184 +++++++++++++++++++++++++++++++++++++++-
rust/kernel/task.rs | 14 +++
4 files changed, 207 insertions(+), 1 deletion(-)

diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h
index 4194b057ef6b..f4d9d04333c0 100644
--- a/rust/bindings/bindings_helper.h
+++ b/rust/bindings/bindings_helper.h
@@ -10,6 +10,7 @@
#include <linux/cred.h>
#include <linux/errname.h>
#include <linux/ethtool.h>
+#include <linux/fdtable.h>
#include <linux/file.h>
#include <linux/fs.h>
#include <linux/jiffies.h>
@@ -21,6 +22,7 @@
#include <linux/refcount.h>
#include <linux/wait.h>
#include <linux/sched.h>
+#include <linux/task_work.h>
#include <linux/workqueue.h>

/* `bindgen` gets confused at certain things. */
diff --git a/rust/helpers.c b/rust/helpers.c
index 58e3a9dff349..d146bbf25aec 100644
--- a/rust/helpers.c
+++ b/rust/helpers.c
@@ -32,6 +32,7 @@
#include <linux/sched/signal.h>
#include <linux/security.h>
#include <linux/spinlock.h>
+#include <linux/task_work.h>
#include <linux/wait.h>
#include <linux/workqueue.h>

@@ -243,6 +244,13 @@ void rust_helper_security_release_secctx(char *secdata, u32 seclen)
EXPORT_SYMBOL_GPL(rust_helper_security_release_secctx);
#endif

+void rust_helper_init_task_work(struct callback_head *twork,
+ task_work_func_t func)
+{
+ init_task_work(twork, func);
+}
+EXPORT_SYMBOL_GPL(rust_helper_init_task_work);
+
/*
* `bindgen` binds the C `size_t` type as the Rust `usize` type, so we can
* use it in contexts where Rust expects a `usize` like slice (array) indices.
diff --git a/rust/kernel/file.rs b/rust/kernel/file.rs
index fb903b7f23fe..8902f490ccc8 100644
--- a/rust/kernel/file.rs
+++ b/rust/kernel/file.rs
@@ -11,7 +11,8 @@
error::{code::*, Error, Result},
types::{ARef, AlwaysRefCounted, NotThreadSafe, Opaque},
};
-use core::ptr;
+use alloc::boxed::Box;
+use core::{alloc::AllocError, mem, ptr};

/// Flags associated with a [`File`].
pub mod flags {
@@ -318,6 +319,187 @@ fn drop(&mut self) {
}
}

+/// Helper used for closing file descriptors in a way that is safe even if the file is currently
+/// held using `fdget`.
+///
+/// Additional motivation can be found in commit 80cd795630d6 ("binder: fix use-after-free due to
+/// ksys_close() during fdget()") and in the comments on `binder_do_fd_close`.
+pub struct DeferredFdCloser {
+ inner: Box<DeferredFdCloserInner>,
+}
+
+/// SAFETY: This just holds an allocation with no real content, so there's no safety issue with
+/// moving it across threads.
+unsafe impl Send for DeferredFdCloser {}
+unsafe impl Sync for DeferredFdCloser {}
+
+/// # Invariants
+///
+/// If the `file` pointer is non-null, then it points at a `struct file` and owns a refcount to
+/// that file.
+#[repr(C)]
+struct DeferredFdCloserInner {
+ twork: mem::MaybeUninit<bindings::callback_head>,
+ file: *mut bindings::file,
+}
+
+impl DeferredFdCloser {
+ /// Create a new [`DeferredFdCloser`].
+ pub fn new() -> Result<Self, AllocError> {
+ Ok(Self {
+ // INVARIANT: The `file` pointer is null, so the type invariant does not apply.
+ inner: Box::try_new(DeferredFdCloserInner {
+ twork: mem::MaybeUninit::uninit(),
+ file: core::ptr::null_mut(),
+ })?,
+ })
+ }
+
+ /// Schedule a task work that closes the file descriptor when this task returns to userspace.
+ ///
+ /// Fails if this is called from a context where we cannot run work when returning to
+ /// userspace. (E.g., from a kthread.)
+ pub fn close_fd(self, fd: u32) -> Result<(), DeferredFdCloseError> {
+ use bindings::task_work_notify_mode_TWA_RESUME as TWA_RESUME;
+
+ // In this method, we schedule the task work before closing the file. This is because
+ // scheduling a task work is fallible, and we need to know whether it will fail before we
+ // attempt to close the file.
+
+ // Task works are not available on kthreads.
+ let current = crate::current!();
+ if current.is_kthread() {
+ return Err(DeferredFdCloseError::TaskWorkUnavailable);
+ }
+
+ // Transfer ownership of the box's allocation to a raw pointer. This disables the
+ // destructor, so we must manually convert it back to a Box to drop it.
+ //
+ // Until we convert it back to a `Box`, there are no aliasing requirements on this
+ // pointer.
+ let inner = Box::into_raw(self.inner);
+
+ // The `callback_head` field is first in the struct, so this cast correctly gives us a
+ // pointer to the field.
+ let callback_head = inner.cast::<bindings::callback_head>();
+ // SAFETY: This pointer offset operation does not go out-of-bounds.
+ let file_field = unsafe { core::ptr::addr_of_mut!((*inner).file) };
+
+ let current = current.as_raw();
+
+ // SAFETY: This function currently has exclusive access to the `DeferredFdCloserInner`, so
+ // it is okay for us to perform unsynchronized writes to its `callback_head` field.
+ unsafe { bindings::init_task_work(callback_head, Some(Self::do_close_fd)) };
+
+ // SAFETY: This inserts the `DeferredFdCloserInner` into the task workqueue for the current
+ // task. If this operation is successful, then this transfers exclusive ownership of the
+ // `callback_head` field to the C side until it calls `do_close_fd`, and we don't touch or
+ // invalidate the field during that time.
+ //
+ // When the C side calls `do_close_fd`, the safety requirements of that method are
+ // satisfied because when a task work is executed, the callback is given ownership of the
+ // pointer.
+ //
+ // The file pointer is currently null. If it is changed to be non-null before `do_close_fd`
+ // is called, then that change happens due to the write at the end of this function, and
+ // that write has a safety comment that explains why the refcount can be dropped when
+ // `do_close_fd` runs.
+ let res = unsafe { bindings::task_work_add(current, callback_head, TWA_RESUME) };
+
+ if res != 0 {
+ // SAFETY: Scheduling the task work failed, so we still have ownership of the box, so
+ // we may destroy it.
+ unsafe { drop(Box::from_raw(inner)) };
+
+ return Err(DeferredFdCloseError::TaskWorkUnavailable);
+ }
+
+ // This removes the fd from the fd table in `current`. The file is not fully closed until
+ // `filp_close` is called. We are given ownership of one refcount to the file.
+ //
+ // SAFETY: This is safe no matter what `fd` is. If the `fd` is valid (that is, if the
+ // pointer is non-null), then we call `filp_close` on the returned pointer as required by
+ // `file_close_fd`.
+ let file = unsafe { bindings::file_close_fd(fd) };
+ if file.is_null() {
+ // We don't clean up the task work since that might be expensive if the task work queue
+ // is long. Just let it execute and let it clean up for itself.
+ return Err(DeferredFdCloseError::BadFd);
+ }
+
+ // Acquire a second refcount to the file.
+ //
+ // SAFETY: The `file` pointer points at a file with a non-zero refcount.
+ unsafe { bindings::get_file(file) };
+
+ // This method closes the fd, consuming one of our two refcounts. There could be active
+ // light refcounts created from that fd, so we must ensure that the file has a positive
+ // refcount for the duration of those active light refcounts. We do that by holding on to
+ // the second refcount until the current task returns to userspace.
+ //
+ // SAFETY: The `file` pointer is valid. Passing `current->files` as the file table to close
+ // it in is correct, since we just got the `fd` from `file_close_fd` which also uses
+ // `current->files`.
+ //
+ // Note: fl_owner_t is currently a void pointer.
+ unsafe { bindings::filp_close(file, (*current).files as bindings::fl_owner_t) };
+
+ // We update the file pointer that the task work is supposed to fput. This transfers
+ // ownership of our last refcount.
+ //
+ // INVARIANT: This changes the `file` field of a `DeferredFdCloserInner` from null to
+ // non-null. This doesn't break the type invariant for `DeferredFdCloserInner` because we
+ // still own a refcount to the file, so we can pass ownership of that refcount to the
+ // `DeferredFdCloserInner`.
+ //
+ // When `do_close_fd` runs, it must be safe for it to `fput` the refcount. However, this is
+ // the case because all light refcounts that are associated with the fd we closed
+ // previously must be dropped when `do_close_fd`, since light refcounts must be dropped
+ // before returning to userspace.
+ //
+ // SAFETY: Task works are executed on the current thread right before we return to
+ // userspace, so this write is guaranteed to happen before `do_close_fd` is called, which
+ // means that a race is not possible here.
+ unsafe { *file_field = file };
+
+ Ok(())
+ }
+
+ /// # Safety
+ ///
+ /// The provided pointer must point at the `twork` field of a `DeferredFdCloserInner` stored in
+ /// a `Box`, and the caller must pass exclusive ownership of that `Box`. Furthermore, if the
+ /// file pointer is non-null, then it must be okay to release the refcount by calling `fput`.
+ unsafe extern "C" fn do_close_fd(inner: *mut bindings::callback_head) {
+ // SAFETY: The caller just passed us ownership of this box.
+ let inner = unsafe { Box::from_raw(inner.cast::<DeferredFdCloserInner>()) };
+ if !inner.file.is_null() {
+ // SAFETY: By the type invariants, we own a refcount to this file, and the caller
+ // guarantees that dropping the refcount now is okay.
+ unsafe { bindings::fput(inner.file) };
+ }
+ // The allocation is freed when `inner` goes out of scope.
+ }
+}
+
+/// Represents a failure to close an fd in a deferred manner.
+#[derive(Copy, Clone, Debug, Eq, PartialEq)]
+pub enum DeferredFdCloseError {
+ /// Closing the fd failed because we were unable to schedule a task work.
+ TaskWorkUnavailable,
+ /// Closing the fd failed because the fd does not exist.
+ BadFd,
+}
+
+impl From<DeferredFdCloseError> for Error {
+ fn from(err: DeferredFdCloseError) -> Error {
+ match err {
+ DeferredFdCloseError::TaskWorkUnavailable => ESRCH,
+ DeferredFdCloseError::BadFd => EBADF,
+ }
+ }
+}
+
/// Represents the `EBADF` error code.
///
/// Used for methods that can only fail with `EBADF`.
diff --git a/rust/kernel/task.rs b/rust/kernel/task.rs
index f46ea3ba9e8e..6adcd8ffcfde 100644
--- a/rust/kernel/task.rs
+++ b/rust/kernel/task.rs
@@ -145,6 +145,12 @@ fn deref(&self) -> &Self::Target {
}
}

+ /// Returns a raw pointer to the task.
+ #[inline]
+ pub fn as_raw(&self) -> *mut bindings::task_struct {
+ self.0.get()
+ }
+
/// Returns the group leader of the given task.
pub fn group_leader(&self) -> &Task {
// SAFETY: By the type invariant, we know that `self.0` is a valid task. Valid tasks always
@@ -189,6 +195,14 @@ pub fn pid_in_current_ns(&self) -> Pid {
unsafe { bindings::task_tgid_nr_ns(self.0.get(), ptr::null_mut()) }
}

+ /// Returns whether this task corresponds to a kernel thread.
+ pub fn is_kthread(&self) -> bool {
+ // SAFETY: By the type invariant, we know that `self.0.get()` is non-null and valid. There
+ // are no further requirements to read the task's flags.
+ let flags = unsafe { (*self.0.get()).flags };
+ (flags & bindings::PF_KTHREAD) != 0
+ }
+
/// Wakes up the task.
pub fn wake_up(&self) {
// SAFETY: By the type invariant, we know that `self.0.get()` is non-null and valid.

--
2.43.0.687.g38aa6559b0-goog


2024-02-09 11:22:57

by Alice Ryhl

[permalink] [raw]
Subject: [PATCH v5 7/9] rust: file: add `Kuid` wrapper

Adds a wrapper around `kuid_t` called `Kuid`. This allows us to define
various operations on kuids such as equality and current_euid. It also
lets us provide conversions from kuid into userspace values.

Rust Binder needs these operations because it needs to compare kuids for
equality, and it needs to tell userspace about the pid and uid of
incoming transactions.

To read kuids from a `struct task_struct`, you must currently use
various #defines that perform the appropriate field access under an RCU
read lock. Currently, we do not have a Rust wrapper for rcu_read_lock,
which means that for this patch, there are two ways forward:

1. Inline the methods into Rust code, and use __rcu_read_lock directly
rather than the rcu_read_lock wrapper. This gives up lockdep for
these usages of RCU.

2. Wrap the various #defines in helpers and call the helpers from Rust.

This patch uses the second option. One possible disadvantage of the
second option is the possible introduction of speculation gadgets, but
as discussed in [1], the risk appears to be acceptable.

Of course, once a wrapper for rcu_read_lock is available, it is
preferable to use that over either of the two above approaches.

Link: https://lore.kernel.org/all/202312080947.674CD2DC7@keescook/ [1]
Reviewed-by: Benno Lossin <[email protected]>
Reviewed-by: Martin Rodriguez Reboredo <[email protected]>
Signed-off-by: Alice Ryhl <[email protected]>
---
rust/bindings/bindings_helper.h | 1 +
rust/helpers.c | 45 ++++++++++++++++++++++++++++
rust/kernel/cred.rs | 5 ++--
rust/kernel/task.rs | 66 +++++++++++++++++++++++++++++++++++++++++
4 files changed, 115 insertions(+), 2 deletions(-)

diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h
index 5ca497d786f0..4194b057ef6b 100644
--- a/rust/bindings/bindings_helper.h
+++ b/rust/bindings/bindings_helper.h
@@ -15,6 +15,7 @@
#include <linux/jiffies.h>
#include <linux/mdio.h>
#include <linux/phy.h>
+#include <linux/pid_namespace.h>
#include <linux/security.h>
#include <linux/slab.h>
#include <linux/refcount.h>
diff --git a/rust/helpers.c b/rust/helpers.c
index fd633d9db79a..58e3a9dff349 100644
--- a/rust/helpers.c
+++ b/rust/helpers.c
@@ -142,6 +142,51 @@ void rust_helper_put_task_struct(struct task_struct *t)
}
EXPORT_SYMBOL_GPL(rust_helper_put_task_struct);

+kuid_t rust_helper_task_uid(struct task_struct *task)
+{
+ return task_uid(task);
+}
+EXPORT_SYMBOL_GPL(rust_helper_task_uid);
+
+kuid_t rust_helper_task_euid(struct task_struct *task)
+{
+ return task_euid(task);
+}
+EXPORT_SYMBOL_GPL(rust_helper_task_euid);
+
+#ifndef CONFIG_USER_NS
+uid_t rust_helper_from_kuid(struct user_namespace *to, kuid_t uid)
+{
+ return from_kuid(to, uid);
+}
+EXPORT_SYMBOL_GPL(rust_helper_from_kuid);
+#endif /* CONFIG_USER_NS */
+
+bool rust_helper_uid_eq(kuid_t left, kuid_t right)
+{
+ return uid_eq(left, right);
+}
+EXPORT_SYMBOL_GPL(rust_helper_uid_eq);
+
+kuid_t rust_helper_current_euid(void)
+{
+ return current_euid();
+}
+EXPORT_SYMBOL_GPL(rust_helper_current_euid);
+
+struct user_namespace *rust_helper_current_user_ns(void)
+{
+ return current_user_ns();
+}
+EXPORT_SYMBOL_GPL(rust_helper_current_user_ns);
+
+pid_t rust_helper_task_tgid_nr_ns(struct task_struct *tsk,
+ struct pid_namespace *ns)
+{
+ return task_tgid_nr_ns(tsk, ns);
+}
+EXPORT_SYMBOL_GPL(rust_helper_task_tgid_nr_ns);
+
struct kunit *rust_helper_kunit_get_current_test(void)
{
return kunit_get_current_test();
diff --git a/rust/kernel/cred.rs b/rust/kernel/cred.rs
index fdd899040098..961e94b6a657 100644
--- a/rust/kernel/cred.rs
+++ b/rust/kernel/cred.rs
@@ -8,6 +8,7 @@

use crate::{
bindings,
+ task::Kuid,
types::{AlwaysRefCounted, Opaque},
};

@@ -59,11 +60,11 @@ pub fn get_secid(&self) -> u32 {
}

/// Returns the effective UID of the given credential.
- pub fn euid(&self) -> bindings::kuid_t {
+ pub fn euid(&self) -> Kuid {
// SAFETY: By the type invariant, we know that `self.0` is valid. Furthermore, the `euid`
// field of a credential is never changed after initialization, so there is no potential
// for data races.
- unsafe { (*self.0.get()).euid }
+ Kuid::from_raw(unsafe { (*self.0.get()).euid })
}
}

diff --git a/rust/kernel/task.rs b/rust/kernel/task.rs
index b579367fb923..f46ea3ba9e8e 100644
--- a/rust/kernel/task.rs
+++ b/rust/kernel/task.rs
@@ -9,6 +9,7 @@
types::{NotThreadSafe, Opaque},
};
use core::{
+ cmp::{Eq, PartialEq},
ffi::{c_int, c_long, c_uint},
ops::Deref,
ptr,
@@ -96,6 +97,12 @@ unsafe impl Sync for Task {}
/// The type of process identifiers (PIDs).
type Pid = bindings::pid_t;

+/// The type of user identifiers (UIDs).
+#[derive(Copy, Clone)]
+pub struct Kuid {
+ kuid: bindings::kuid_t,
+}
+
impl Task {
/// Returns a raw pointer to the current task.
///
@@ -157,12 +164,31 @@ pub fn pid(&self) -> Pid {
unsafe { *ptr::addr_of!((*self.0.get()).pid) }
}

+ /// Returns the UID of the given task.
+ pub fn uid(&self) -> Kuid {
+ // SAFETY: By the type invariant, we know that `self.0` is valid.
+ Kuid::from_raw(unsafe { bindings::task_uid(self.0.get()) })
+ }
+
+ /// Returns the effective UID of the given task.
+ pub fn euid(&self) -> Kuid {
+ // SAFETY: By the type invariant, we know that `self.0` is valid.
+ Kuid::from_raw(unsafe { bindings::task_euid(self.0.get()) })
+ }
+
/// Determines whether the given task has pending signals.
pub fn signal_pending(&self) -> bool {
// SAFETY: By the type invariant, we know that `self.0` is valid.
unsafe { bindings::signal_pending(self.0.get()) != 0 }
}

+ /// Returns the given task's pid in the current pid namespace.
+ pub fn pid_in_current_ns(&self) -> Pid {
+ // SAFETY: We know that `self.0.get()` is valid by the type invariant, and passing a null
+ // pointer as the namespace is correct for using the current namespace.
+ unsafe { bindings::task_tgid_nr_ns(self.0.get(), ptr::null_mut()) }
+ }
+
/// Wakes up the task.
pub fn wake_up(&self) {
// SAFETY: By the type invariant, we know that `self.0.get()` is non-null and valid.
@@ -172,6 +198,46 @@ pub fn wake_up(&self) {
}
}

+impl Kuid {
+ /// Get the current euid.
+ #[inline]
+ pub fn current_euid() -> Kuid {
+ // SAFETY: Just an FFI call.
+ Self::from_raw(unsafe { bindings::current_euid() })
+ }
+
+ /// Create a `Kuid` given the raw C type.
+ #[inline]
+ pub fn from_raw(kuid: bindings::kuid_t) -> Self {
+ Self { kuid }
+ }
+
+ /// Turn this kuid into the raw C type.
+ #[inline]
+ pub fn into_raw(self) -> bindings::kuid_t {
+ self.kuid
+ }
+
+ /// Converts this kernel UID into a userspace UID.
+ ///
+ /// Uses the namespace of the current task.
+ #[inline]
+ pub fn into_uid_in_current_ns(self) -> bindings::uid_t {
+ // SAFETY: Just an FFI call.
+ unsafe { bindings::from_kuid(bindings::current_user_ns(), self.kuid) }
+ }
+}
+
+impl PartialEq for Kuid {
+ #[inline]
+ fn eq(&self, other: &Kuid) -> bool {
+ // SAFETY: Just an FFI call.
+ unsafe { bindings::uid_eq(self.kuid, other.kuid) }
+ }
+}
+
+impl Eq for Kuid {}
+
// SAFETY: The type invariants guarantee that `Task` is always ref-counted.
unsafe impl crate::types::AlwaysRefCounted for Task {
fn inc_ref(&self) {

--
2.43.0.687.g38aa6559b0-goog


2024-02-09 11:41:52

by Alice Ryhl

[permalink] [raw]
Subject: [PATCH v5 3/9] rust: file: add Rust abstraction for `struct file`

From: Wedson Almeida Filho <[email protected]>

This abstraction makes it possible to manipulate the open files for a
process. The new `File` struct wraps the C `struct file`. When accessing
it using the smart pointer `ARef<File>`, the pointer will own a
reference count to the file. When accessing it as `&File`, then the
reference does not own a refcount, but the borrow checker will ensure
that the reference count does not hit zero while the `&File` is live.

Since this is intended to manipulate the open files of a process, we
introduce an `fget` constructor that corresponds to the C `fget`
method. In future patches, it will become possible to create a new fd in
a process and bind it to a `File`. Rust Binder will use these to send
fds from one process to another.

We also provide a method for accessing the file's flags. Rust Binder
will use this to access the flags of the Binder fd to check whether the
non-blocking flag is set, which affects what the Binder ioctl does.

This introduces a struct for the EBADF error type, rather than just
using the Error type directly. This has two advantages:
* `File::from_fd` returns a `Result<ARef<File>, BadFdError>`, which the
compiler will represent as a single pointer, with null being an error.
This is possible because the compiler understands that `BadFdError`
has only one possible value, and it also understands that the
`ARef<File>` smart pointer is guaranteed non-null.
* Additionally, we promise to users of the method that the method can
only fail with EBADF, which means that they can rely on this promise
without having to inspect its implementation.
That said, there are also two disadvantages:
* Defining additional error types involves boilerplate.
* The question mark operator will only utilize the `From` trait once,
which prevents you from using the question mark operator on
`BadFdError` in methods that return some third error type that the
kernel `Error` is convertible into. (However, it works fine in methods
that return `Error`.)

Signed-off-by: Wedson Almeida Filho <[email protected]>
Co-developed-by: Daniel Xu <[email protected]>
Signed-off-by: Daniel Xu <[email protected]>
Co-developed-by: Alice Ryhl <[email protected]>
Reviewed-by: Martin Rodriguez Reboredo <[email protected]>
Reviewed-by: Trevor Gross <[email protected]>
Reviewed-by: Benno Lossin <[email protected]>
Signed-off-by: Alice Ryhl <[email protected]>
---
fs/file.c | 7 ++
rust/bindings/bindings_helper.h | 2 +
rust/helpers.c | 7 ++
rust/kernel/file.rs | 254 ++++++++++++++++++++++++++++++++++++++++
rust/kernel/lib.rs | 1 +
5 files changed, 271 insertions(+)

diff --git a/fs/file.c b/fs/file.c
index 3b683b9101d8..f2eab5fcb87f 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -1127,6 +1127,13 @@ EXPORT_SYMBOL(task_lookup_next_fdget_rcu);
*
* The fput_needed flag returned by fget_light should be passed to the
* corresponding fput_light.
+ *
+ * (As an exception to rule 2, you can call filp_close between fget_light and
+ * fput_light provided that you capture a real refcount with get_file before
+ * the call to filp_close, and ensure that this real refcount is fput *after*
+ * the fput_light call.)
+ *
+ * See also the documentation in rust/kernel/file.rs.
*/
static unsigned long __fget_light(unsigned int fd, fmode_t mask)
{
diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h
index 936651110c39..41fcd2905ed4 100644
--- a/rust/bindings/bindings_helper.h
+++ b/rust/bindings/bindings_helper.h
@@ -9,6 +9,8 @@
#include <kunit/test.h>
#include <linux/errname.h>
#include <linux/ethtool.h>
+#include <linux/file.h>
+#include <linux/fs.h>
#include <linux/jiffies.h>
#include <linux/mdio.h>
#include <linux/phy.h>
diff --git a/rust/helpers.c b/rust/helpers.c
index 70e59efd92bc..03141a3608a4 100644
--- a/rust/helpers.c
+++ b/rust/helpers.c
@@ -25,6 +25,7 @@
#include <linux/build_bug.h>
#include <linux/err.h>
#include <linux/errname.h>
+#include <linux/fs.h>
#include <linux/mutex.h>
#include <linux/refcount.h>
#include <linux/sched/signal.h>
@@ -157,6 +158,12 @@ void rust_helper_init_work_with_key(struct work_struct *work, work_func_t func,
}
EXPORT_SYMBOL_GPL(rust_helper_init_work_with_key);

+struct file *rust_helper_get_file(struct file *f)
+{
+ return get_file(f);
+}
+EXPORT_SYMBOL_GPL(rust_helper_get_file);
+
/*
* `bindgen` binds the C `size_t` type as the Rust `usize` type, so we can
* use it in contexts where Rust expects a `usize` like slice (array) indices.
diff --git a/rust/kernel/file.rs b/rust/kernel/file.rs
new file mode 100644
index 000000000000..cf8ebf619379
--- /dev/null
+++ b/rust/kernel/file.rs
@@ -0,0 +1,254 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! Files and file descriptors.
+//!
+//! C headers: [`include/linux/fs.h`](srctree/include/linux/fs.h) and
+//! [`include/linux/file.h`](srctree/include/linux/file.h)
+
+use crate::{
+ bindings,
+ error::{code::*, Error, Result},
+ types::{ARef, AlwaysRefCounted, Opaque},
+};
+use core::ptr;
+
+/// Flags associated with a [`File`].
+pub mod flags {
+ /// File is opened in append mode.
+ pub const O_APPEND: u32 = bindings::O_APPEND;
+
+ /// Signal-driven I/O is enabled.
+ pub const O_ASYNC: u32 = bindings::FASYNC;
+
+ /// Close-on-exec flag is set.
+ pub const O_CLOEXEC: u32 = bindings::O_CLOEXEC;
+
+ /// File was created if it didn't already exist.
+ pub const O_CREAT: u32 = bindings::O_CREAT;
+
+ /// Direct I/O is enabled for this file.
+ pub const O_DIRECT: u32 = bindings::O_DIRECT;
+
+ /// File must be a directory.
+ pub const O_DIRECTORY: u32 = bindings::O_DIRECTORY;
+
+ /// Like [`O_SYNC`] except metadata is not synced.
+ pub const O_DSYNC: u32 = bindings::O_DSYNC;
+
+ /// Ensure that this file is created with the `open(2)` call.
+ pub const O_EXCL: u32 = bindings::O_EXCL;
+
+ /// Large file size enabled (`off64_t` over `off_t`).
+ pub const O_LARGEFILE: u32 = bindings::O_LARGEFILE;
+
+ /// Do not update the file last access time.
+ pub const O_NOATIME: u32 = bindings::O_NOATIME;
+
+ /// File should not be used as process's controlling terminal.
+ pub const O_NOCTTY: u32 = bindings::O_NOCTTY;
+
+ /// If basename of path is a symbolic link, fail open.
+ pub const O_NOFOLLOW: u32 = bindings::O_NOFOLLOW;
+
+ /// File is using nonblocking I/O.
+ pub const O_NONBLOCK: u32 = bindings::O_NONBLOCK;
+
+ /// File is using nonblocking I/O.
+ ///
+ /// This is effectively the same flag as [`O_NONBLOCK`] on all architectures
+ /// except SPARC64.
+ pub const O_NDELAY: u32 = bindings::O_NDELAY;
+
+ /// Used to obtain a path file descriptor.
+ pub const O_PATH: u32 = bindings::O_PATH;
+
+ /// Write operations on this file will flush data and metadata.
+ pub const O_SYNC: u32 = bindings::O_SYNC;
+
+ /// This file is an unnamed temporary regular file.
+ pub const O_TMPFILE: u32 = bindings::O_TMPFILE;
+
+ /// File should be truncated to length 0.
+ pub const O_TRUNC: u32 = bindings::O_TRUNC;
+
+ /// Bitmask for access mode flags.
+ ///
+ /// # Examples
+ ///
+ /// ```
+ /// use kernel::file;
+ /// # fn do_something() {}
+ /// # let flags = 0;
+ /// if (flags & file::flags::O_ACCMODE) == file::flags::O_RDONLY {
+ /// do_something();
+ /// }
+ /// ```
+ pub const O_ACCMODE: u32 = bindings::O_ACCMODE;
+
+ /// File is read only.
+ pub const O_RDONLY: u32 = bindings::O_RDONLY;
+
+ /// File is write only.
+ pub const O_WRONLY: u32 = bindings::O_WRONLY;
+
+ /// File can be both read and written.
+ pub const O_RDWR: u32 = bindings::O_RDWR;
+}
+
+/// Wraps the kernel's `struct file`.
+///
+/// This represents an open file rather than a file on a filesystem. Processes generally reference
+/// open files using file descriptors. However, file descriptors are not the same as files. A file
+/// descriptor is just an integer that corresponds to a file, and a single file may be referenced
+/// by multiple file descriptors.
+///
+/// # Refcounting
+///
+/// Instances of this type are reference-counted. The reference count is incremented by the
+/// `fget`/`get_file` functions and decremented by `fput`. The Rust type `ARef<File>` represents a
+/// pointer that owns a reference count on the file.
+///
+/// Whenever a process opens a file descriptor (fd), it stores a pointer to the file in its `struct
+/// files_struct`. This pointer owns a reference count to the file, ensuring the file isn't
+/// prematurely deleted while the file descriptor is open. In Rust terminology, the pointers in
+/// `struct files_struct` are `ARef<File>` pointers.
+///
+/// ## Light refcounts
+///
+/// Whenever a process has an fd to a file, it may use something called a "light refcount" as a
+/// performance optimization. Light refcounts are acquired by calling `fdget` and released with
+/// `fdput`. The idea behind light refcounts is that if the fd is not closed between the calls to
+/// `fdget` and `fdput`, then the refcount cannot hit zero during that time, as the `struct
+/// files_struct` holds a reference until the fd is closed. This means that it's safe to access the
+/// file even if `fdget` does not increment the refcount.
+///
+/// The requirement that the fd is not closed during a light refcount applies globally across all
+/// threads - not just on the thread using the light refcount. For this reason, light refcounts are
+/// only used when the `struct files_struct` is not shared with other threads, since this ensures
+/// that other unrelated threads cannot suddenly start using the fd and close it. Therefore,
+/// calling `fdget` on a shared `struct files_struct` creates a normal refcount instead of a light
+/// refcount.
+///
+/// Light reference counts must be released with `fdput` before the system call returns to
+/// userspace. This means that if you wait until the current system call returns to userspace, then
+/// all light refcounts that existed at the time have gone away.
+///
+/// ## Rust references
+///
+/// The reference type `&File` is similar to light refcounts:
+///
+/// * `&File` references don't own a reference count. They can only exist as long as the reference
+/// count stays positive, and can only be created when there is some mechanism in place to ensure
+/// this.
+///
+/// * The Rust borrow-checker normally ensures this by enforcing that the `ARef<File>` from which
+/// a `&File` is created outlives the `&File`.
+///
+/// * Using the unsafe [`File::from_ptr`] means that it is up to the caller to ensure that the
+/// `&File` only exists while the reference count is positive.
+///
+/// * You can think of `fdget` as using an fd to look up an `ARef<File>` in the `struct
+/// files_struct` and create an `&File` from it. The "fd cannot be closed" rule is like the Rust
+/// rule "the `ARef<File>` must outlive the `&File`".
+///
+/// # Invariants
+///
+/// * Instances of this type are refcounted using the `f_count` field.
+/// * If an fd with active light refcounts is closed, then it must be the case that the file
+/// refcount is positive until all light refcounts of the fd have been dropped.
+/// * A light refcount must be dropped before returning to userspace.
+#[repr(transparent)]
+pub struct File(Opaque<bindings::file>);
+
+// SAFETY:
+// - `File::dec_ref` can be called from any thread.
+// - It is okay to send ownership of `struct file` across thread boundaries.
+unsafe impl Send for File {}
+
+// SAFETY: All methods defined on `File` that take `&self` are safe to call even if other threads
+// are concurrently accessing the same `struct file`, because those methods either access immutable
+// properties or have proper synchronization to ensure that such accesses are safe.
+unsafe impl Sync for File {}
+
+impl File {
+ /// Constructs a new `struct file` wrapper from a file descriptor.
+ ///
+ /// The file descriptor belongs to the current process.
+ pub fn fget(fd: u32) -> Result<ARef<Self>, BadFdError> {
+ // SAFETY: FFI call, there are no requirements on `fd`.
+ let ptr = ptr::NonNull::new(unsafe { bindings::fget(fd) }).ok_or(BadFdError)?;
+
+ // SAFETY: `bindings::fget` either returns null or a valid pointer to a file, and we
+ // checked for null above.
+ //
+ // INVARIANT: `bindings::fget` creates a refcount, and we pass ownership of the refcount to
+ // the new `ARef<File>`.
+ Ok(unsafe { ARef::from_raw(ptr.cast()) })
+ }
+
+ /// Creates a reference to a [`File`] from a valid pointer.
+ ///
+ /// # Safety
+ ///
+ /// The caller must ensure that `ptr` points at a valid file and that the file's refcount is
+ /// positive for the duration of 'a.
+ pub unsafe fn from_ptr<'a>(ptr: *const bindings::file) -> &'a File {
+ // SAFETY: The caller guarantees that the pointer is not dangling and stays valid for the
+ // duration of 'a. The cast is okay because `File` is `repr(transparent)`.
+ //
+ // INVARIANT: The safety requirements guarantee that the refcount does not hit zero during
+ // 'a.
+ unsafe { &*ptr.cast() }
+ }
+
+ /// Returns a raw pointer to the inner C struct.
+ #[inline]
+ pub fn as_ptr(&self) -> *mut bindings::file {
+ self.0.get()
+ }
+
+ /// Returns the flags associated with the file.
+ ///
+ /// The flags are a combination of the constants in [`flags`].
+ pub fn flags(&self) -> u32 {
+ // This `read_volatile` is intended to correspond to a READ_ONCE call.
+ //
+ // SAFETY: The file is valid because the shared reference guarantees a nonzero refcount.
+ //
+ // FIXME(read_once): Replace with `read_once` when available on the Rust side.
+ unsafe { core::ptr::addr_of!((*self.as_ptr()).f_flags).read_volatile() }
+ }
+}
+
+// SAFETY: The type invariants guarantee that `File` is always ref-counted. This implementation
+// makes `ARef<File>` own a normal refcount.
+unsafe impl AlwaysRefCounted for File {
+ fn inc_ref(&self) {
+ // SAFETY: The existence of a shared reference means that the refcount is nonzero.
+ unsafe { bindings::get_file(self.as_ptr()) };
+ }
+
+ unsafe fn dec_ref(obj: ptr::NonNull<File>) {
+ // SAFETY: To call this method, the caller passes us ownership of a normal refcount, so we
+ // may drop it. The cast is okay since `File` has the same representation as `struct file`.
+ unsafe { bindings::fput(obj.cast().as_ptr()) }
+ }
+}
+
+/// Represents the `EBADF` error code.
+///
+/// Used for methods that can only fail with `EBADF`.
+#[derive(Copy, Clone, Eq, PartialEq)]
+pub struct BadFdError;
+
+impl From<BadFdError> for Error {
+ fn from(_: BadFdError) -> Error {
+ EBADF
+ }
+}
+
+impl core::fmt::Debug for BadFdError {
+ fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result {
+ f.pad("EBADF")
+ }
+}
diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs
index b89ecf4e97a0..9353dd713a20 100644
--- a/rust/kernel/lib.rs
+++ b/rust/kernel/lib.rs
@@ -34,6 +34,7 @@
mod allocator;
mod build_assert;
pub mod error;
+pub mod file;
pub mod init;
pub mod ioctl;
#[cfg(CONFIG_KUNIT)]

--
2.43.0.687.g38aa6559b0-goog


2024-02-09 11:44:56

by Alice Ryhl

[permalink] [raw]
Subject: [PATCH v5 9/9] rust: file: add abstraction for `poll_table`

The existing `CondVar` abstraction is a wrapper around
`wait_queue_head`, but it does not support all use-cases of the C
`wait_queue_head` type. To be specific, a `CondVar` cannot be registered
with a `struct poll_table`. This limitation has the advantage that you
do not need to call `synchronize_rcu` when destroying a `CondVar`.

However, we need the ability to register a `poll_table` with a
`wait_queue_head` in Rust Binder. To enable this, introduce a type
called `PollCondVar`, which is like `CondVar` except that you can
register a `poll_table`. We also introduce `PollTable`, which is a safe
wrapper around `poll_table` that is intended to be used with
`PollCondVar`.

The destructor of `PollCondVar` unconditionally calls `synchronize_rcu`
to ensure that the removal of epoll waiters has fully completed before
the `wait_queue_head` is destroyed.

That said, `synchronize_rcu` is rather expensive and is not needed in
all cases: If we have never registered a `poll_table` with the
`wait_queue_head`, then we don't need to call `synchronize_rcu`. (And
this is a common case in Binder - not all processes use Binder with
epoll.) The current implementation does not account for this, but if we
find that it is necessary to improve this, a future patch could store a
boolean next to the `wait_queue_head` to keep track of whether a
`poll_table` has ever been registered.

Reviewed-by: Benno Lossin <[email protected]>
Reviewed-by: Martin Rodriguez Reboredo <[email protected]>
Signed-off-by: Alice Ryhl <[email protected]>
---
rust/bindings/bindings_helper.h | 1 +
rust/kernel/sync.rs | 1 +
rust/kernel/sync/poll.rs | 117 ++++++++++++++++++++++++++++++++++++++++
3 files changed, 119 insertions(+)

diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h
index f4d9d04333c0..c651d38e5dd6 100644
--- a/rust/bindings/bindings_helper.h
+++ b/rust/bindings/bindings_helper.h
@@ -17,6 +17,7 @@
#include <linux/mdio.h>
#include <linux/phy.h>
#include <linux/pid_namespace.h>
+#include <linux/poll.h>
#include <linux/security.h>
#include <linux/slab.h>
#include <linux/refcount.h>
diff --git a/rust/kernel/sync.rs b/rust/kernel/sync.rs
index c1fb10fc64f4..84b69e337a55 100644
--- a/rust/kernel/sync.rs
+++ b/rust/kernel/sync.rs
@@ -11,6 +11,7 @@
mod condvar;
pub mod lock;
mod locked_by;
+pub mod poll;

pub use arc::{Arc, ArcBorrow, UniqueArc};
pub use condvar::{CondVar, CondVarTimeoutResult};
diff --git a/rust/kernel/sync/poll.rs b/rust/kernel/sync/poll.rs
new file mode 100644
index 000000000000..a0e4f3de109a
--- /dev/null
+++ b/rust/kernel/sync/poll.rs
@@ -0,0 +1,117 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! Utilities for working with `struct poll_table`.
+
+use crate::{
+ bindings,
+ file::File,
+ prelude::*,
+ sync::{CondVar, LockClassKey},
+ types::Opaque,
+};
+use core::ops::Deref;
+
+/// Creates a [`PollCondVar`] initialiser with the given name and a newly-created lock class.
+#[macro_export]
+macro_rules! new_poll_condvar {
+ ($($name:literal)?) => {
+ $crate::sync::poll::PollCondVar::new($crate::optional_name!($($name)?), $crate::static_lock_class!())
+ };
+}
+
+/// Wraps the kernel's `struct poll_table`.
+///
+/// # Invariants
+///
+/// This struct contains a valid `struct poll_table`.
+///
+/// For a `struct poll_table` to be valid, its `_qproc` function must follow the safety
+/// requirements of `_qproc` functions:
+///
+/// * The `_qproc` function is given permission to enqueue a waiter to the provided `poll_table`
+/// during the call. Once the waiter is removed and an rcu grace period has passed, it must no
+/// longer access the `wait_queue_head`.
+#[repr(transparent)]
+pub struct PollTable(Opaque<bindings::poll_table>);
+
+impl PollTable {
+ /// Creates a reference to a [`PollTable`] from a valid pointer.
+ ///
+ /// # Safety
+ ///
+ /// The caller must ensure that for the duration of 'a, the pointer will point at a valid poll
+ /// table (as defined in the type invariants).
+ ///
+ /// The caller must also ensure that the `poll_table` is only accessed via the returned
+ /// reference for the duration of 'a.
+ pub unsafe fn from_ptr<'a>(ptr: *mut bindings::poll_table) -> &'a mut PollTable {
+ // SAFETY: The safety requirements guarantee the validity of the dereference, while the
+ // `PollTable` type being transparent makes the cast ok.
+ unsafe { &mut *ptr.cast() }
+ }
+
+ fn get_qproc(&self) -> bindings::poll_queue_proc {
+ let ptr = self.0.get();
+ // SAFETY: The `ptr` is valid because it originates from a reference, and the `_qproc`
+ // field is not modified concurrently with this call since we have an immutable reference.
+ unsafe { (*ptr)._qproc }
+ }
+
+ /// Register this [`PollTable`] with the provided [`PollCondVar`], so that it can be notified
+ /// using the condition variable.
+ pub fn register_wait(&mut self, file: &File, cv: &PollCondVar) {
+ if let Some(qproc) = self.get_qproc() {
+ // SAFETY: The pointers to `file` and `self` need to be valid for the duration of this
+ // call to `qproc`, which they are because they are references.
+ //
+ // The `cv.wait_queue_head` pointer must be valid until an rcu grace period after the
+ // waiter is removed. The `PollCondVar` is pinned, so before `cv.wait_queue_head` can
+ // be destroyed, the destructor must run. That destructor first removes all waiters,
+ // and then waits for an rcu grace period. Therefore, `cv.wait_queue_head` is valid for
+ // long enough.
+ unsafe { qproc(file.as_ptr() as _, cv.wait_queue_head.get(), self.0.get()) };
+ }
+ }
+}
+
+/// A wrapper around [`CondVar`] that makes it usable with [`PollTable`].
+///
+/// [`CondVar`]: crate::sync::CondVar
+#[pin_data(PinnedDrop)]
+pub struct PollCondVar {
+ #[pin]
+ inner: CondVar,
+}
+
+impl PollCondVar {
+ /// Constructs a new condvar initialiser.
+ pub fn new(name: &'static CStr, key: &'static LockClassKey) -> impl PinInit<Self> {
+ pin_init!(Self {
+ inner <- CondVar::new(name, key),
+ })
+ }
+}
+
+// Make the `CondVar` methods callable on `PollCondVar`.
+impl Deref for PollCondVar {
+ type Target = CondVar;
+
+ fn deref(&self) -> &CondVar {
+ &self.inner
+ }
+}
+
+#[pinned_drop]
+impl PinnedDrop for PollCondVar {
+ fn drop(self: Pin<&mut Self>) {
+ // Clear anything registered using `register_wait`.
+ //
+ // SAFETY: The pointer points at a valid `wait_queue_head`.
+ unsafe { bindings::__wake_up_pollfree(self.inner.wait_queue_head.get()) };
+
+ // Wait for epoll items to be properly removed.
+ //
+ // SAFETY: Just an FFI call.
+ unsafe { bindings::synchronize_rcu() };
+ }
+}

--
2.43.0.687.g38aa6559b0-goog


2024-02-09 12:11:28

by Alice Ryhl

[permalink] [raw]
Subject: Re: [PATCH v5 1/9] rust: types: add `NotThreadSafe`

On Fri, Feb 9, 2024 at 12:18 PM Alice Ryhl <[email protected]> wrote:
> + types::{NotThreadSafe, ScopeGuard, Opaque},

Oops. This one doesn't pass rustfmt. Embarrassing.

I was hoping that this would be the last version. Maybe Miguel can fix
the ordering here when he takes it? (Assuming I don't need to send a
v6.)

It's supposed to be:
types::{NotThreadSafe, Opaque, ScopeGuard},

There shouldn't be any other issues in this patchset.

Alice

2024-02-10 07:41:51

by Trevor Gross

[permalink] [raw]
Subject: Re: [PATCH v5 6/9] rust: file: add `FileDescriptorReservation`

On Fri, Feb 9, 2024 at 5:21 AM Alice Ryhl <[email protected]> wrote:
>
> From: Wedson Almeida Filho <[email protected]>
>
> Allow for the creation of a file descriptor in two steps: first, we
> reserve a slot for it, then we commit or drop the reservation. The first
> step may fail (e.g., the current process ran out of available slots),
> but commit and drop never fail (and are mutually exclusive).
>
> This is needed by Rust Binder when fds are sent from one process to
> another. It has to be a two-step process to properly handle the case
> where multiple fds are sent: The operation must fail or succeed
> atomically, which we achieve by first reserving the fds we need, and
> only installing the files once we have reserved enough fds to send the
> files.
>
> Fd reservations assume that the value of `current` does not change
> between the call to get_unused_fd_flags and the call to fd_install (or
> put_unused_fd). By not implementing the Send trait, this abstraction
> ensures that the `FileDescriptorReservation` cannot be moved into a
> different process.
>
> Signed-off-by: Wedson Almeida Filho <[email protected]>
> Co-developed-by: Alice Ryhl <[email protected]>
> Reviewed-by: Benno Lossin <[email protected]>
> Reviewed-by: Martin Rodriguez Reboredo <[email protected]>
> Signed-off-by: Alice Ryhl <[email protected]>

Reviewed-by: Trevor Gross <[email protected]>

2024-02-10 07:43:41

by Trevor Gross

[permalink] [raw]
Subject: Re: [PATCH v5 7/9] rust: file: add `Kuid` wrapper

On Fri, Feb 9, 2024 at 5:22 AM Alice Ryhl <[email protected]> wrote:
>
> Adds a wrapper around `kuid_t` called `Kuid`. This allows us to define
> various operations on kuids such as equality and current_euid. It also
> lets us provide conversions from kuid into userspace values.
>
> Rust Binder needs these operations because it needs to compare kuids for
> equality, and it needs to tell userspace about the pid and uid of
> incoming transactions.
>
> To read kuids from a `struct task_struct`, you must currently use
> various #defines that perform the appropriate field access under an RCU
> read lock. Currently, we do not have a Rust wrapper for rcu_read_lock,
> which means that for this patch, there are two ways forward:
>
> 1. Inline the methods into Rust code, and use __rcu_read_lock directly
> rather than the rcu_read_lock wrapper. This gives up lockdep for
> these usages of RCU.
>
> 2. Wrap the various #defines in helpers and call the helpers from Rust.
>
> This patch uses the second option. One possible disadvantage of the
> second option is the possible introduction of speculation gadgets, but
> as discussed in [1], the risk appears to be acceptable.
>
> Of course, once a wrapper for rcu_read_lock is available, it is
> preferable to use that over either of the two above approaches.

Is this worth a FIXME?

> Link: https://lore.kernel.org/all/202312080947.674CD2DC7@keescook/ [1]
> Reviewed-by: Benno Lossin <[email protected]>
> Reviewed-by: Martin Rodriguez Reboredo <[email protected]>
> Signed-off-by: Alice Ryhl <[email protected]>

Reviewed-by: Trevor Gross <[email protected]>

2024-02-10 07:48:08

by Trevor Gross

[permalink] [raw]
Subject: Re: [PATCH v5 8/9] rust: file: add `DeferredFdCloser`

On Fri, Feb 9, 2024 at 5:22 AM Alice Ryhl <[email protected]> wrote:
>
> To close an fd from kernel space, we could call `ksys_close`. However,
> if we do this to an fd that is held using `fdget`, then we may trigger a
> use-after-free. Introduce a helper that can be used to close an fd even
> if the fd is currently held with `fdget`. This is done by grabbing an
> extra refcount to the file and dropping it in a task work once we return
> to userspace.
>
> This is necessary for Rust Binder because otherwise the user might try
> to have Binder close its fd for /dev/binder, which would cause problems
> as this happens inside an ioctl on /dev/binder, and ioctls hold the fd
> using `fdget`.
>
> Additional motivation can be found in commit 80cd795630d6 ("binder: fix
> use-after-free due to ksys_close() during fdget()") and in the comments
> on `binder_do_fd_close`.
>
> If there is some way to detect whether an fd is currently held with
> `fdget`, then this could be optimized to skip the allocation and task
> work when this is not the case. Another possible optimization would be
> to combine several fds into a single task work, since this is used with
> fd arrays that might hold several fds.
>
> That said, it might not be necessary to optimize it, because Rust Binder
> has two ways to send fds: BINDER_TYPE_FD and BINDER_TYPE_FDA. With
> BINDER_TYPE_FD, it is userspace's responsibility to close the fd, so
> this mechanism is used only by BINDER_TYPE_FDA, but fd arrays are used
> rarely these days.
>
> Reviewed-by: Benno Lossin <[email protected]>
> Reviewed-by: Martin Rodriguez Reboredo <[email protected]>
> Signed-off-by: Alice Ryhl <[email protected]>
> ---

Reviewed-by: Trevor Gross <[email protected]>

> + /// Schedule a task work that closes the file descriptor when this task returns to userspace.
> + ///
> + /// Fails if this is called from a context where we cannot run work when returning to
> + /// userspace. (E.g., from a kthread.)
> + pub fn close_fd(self, fd: u32) -> Result<(), DeferredFdCloseError> {
> + use bindings::task_work_notify_mode_TWA_RESUME as TWA_RESUME;
> +
> + // In this method, we schedule the task work before closing the file. This is because
> + // scheduling a task work is fallible, and we need to know whether it will fail before we
> + // attempt to close the file.
> +
> + // Task works are not available on kthreads.
> + let current = crate::current!();
> + if current.is_kthread() {
> + return Err(DeferredFdCloseError::TaskWorkUnavailable);
> + }
> +
> + // Transfer ownership of the box's allocation to a raw pointer. This disables the
> + // destructor, so we must manually convert it back to a Box to drop it.
> + //
> + // Until we convert it back to a `Box`, there are no aliasing requirements on this
> + // pointer.
> + let inner = Box::into_raw(self.inner);
> +
> + // The `callback_head` field is first in the struct, so this cast correctly gives us a
> + // pointer to the field.
> + let callback_head = inner.cast::<bindings::callback_head>();
> + // SAFETY: This pointer offset operation does not go out-of-bounds.
> + let file_field = unsafe { core::ptr::addr_of_mut!((*inner).file) };
> +
> + let current = current.as_raw();
> +
> + // SAFETY: This function currently has exclusive access to the `DeferredFdCloserInner`, so
> + // it is okay for us to perform unsynchronized writes to its `callback_head` field.
> + unsafe { bindings::init_task_work(callback_head, Some(Self::do_close_fd)) };
> +
> + // SAFETY: This inserts the `DeferredFdCloserInner` into the task workqueue for the current
> + // task. If this operation is successful, then this transfers exclusive ownership of the
> + // `callback_head` field to the C side until it calls `do_close_fd`, and we don't touch or
> + // invalidate the field during that time.
> + //
> + // When the C side calls `do_close_fd`, the safety requirements of that method are
> + // satisfied because when a task work is executed, the callback is given ownership of the
> + // pointer.
> + //
> + // The file pointer is currently null. If it is changed to be non-null before `do_close_fd`
> + // is called, then that change happens due to the write at the end of this function, and
> + // that write has a safety comment that explains why the refcount can be dropped when
> + // `do_close_fd` runs.
> + let res = unsafe { bindings::task_work_add(current, callback_head, TWA_RESUME) };
> +
> + if res != 0 {
> + // SAFETY: Scheduling the task work failed, so we still have ownership of the box, so
> + // we may destroy it.
> + unsafe { drop(Box::from_raw(inner)) };
> +
> + return Err(DeferredFdCloseError::TaskWorkUnavailable);
> + }
> +
> + // This removes the fd from the fd table in `current`. The file is not fully closed until
> + // `filp_close` is called. We are given ownership of one refcount to the file.
> + //
> + // SAFETY: This is safe no matter what `fd` is. If the `fd` is valid (that is, if the
> + // pointer is non-null), then we call `filp_close` on the returned pointer as required by
> + // `file_close_fd`.
> + let file = unsafe { bindings::file_close_fd(fd) };
> + if file.is_null() {
> + // We don't clean up the task work since that might be expensive if the task work queue
> + // is long. Just let it execute and let it clean up for itself.
> + return Err(DeferredFdCloseError::BadFd);
> + }
> +
> + // Acquire a second refcount to the file.
> + //
> + // SAFETY: The `file` pointer points at a file with a non-zero refcount.
> + unsafe { bindings::get_file(file) };
> +
> + // This method closes the fd, consuming one of our two refcounts There could be active
> + // light refcounts created from that fd, so we must ensure that the file has a positive
> + // refcount for the duration of those active light refcounts. We do that by holding on to
> + // the second refcount until the current task returns to userspace.
> + //
> + // SAFETY: The `file` pointer is valid. Passing `current->files` as the file table to close
> + // it in is correct, since we just got the `fd` from `file_close_fd` which also uses
> + // `current->files`.
> + //
> + // Note: fl_owner_t is currently a void pointer.
> + unsafe { bindings::filp_close(file, (*current).files as bindings::fl_owner_t) };
> +
> + // We update the file pointer that the task work is supposed to fput. This transfers
> + // ownership of our last refcount.
> + //
> + // INVARIANT: This changes the `file` field of a `DeferredFdCloserInner` from null to
> + // non-null. This doesn't break the type invariant for `DeferredFdCloserInner` because we
> + // still own a refcount to the file, so we can pass ownership of that refcount to the
> + // `DeferredFdCloserInner`.
> + //
> + // When `do_close_fd` runs, it must be safe for it to `fput` the refcount. However, this is
> + // the case because all light refcounts that are associated with the fd we closed
> + // previously must be dropped when `do_close_fd`, since light refcounts must be dropped
> + // before returning to userspace.
> + //
> + // SAFETY: Task works are executed on the current thread right before we return to
> + // userspace, so this write is guaranteed to happen before `do_close_fd` is called, which
> + // means that a race is not possible here.
> + unsafe { *file_field = file };
> +
> + Ok(())
> + }

This documentation is quite a work of art :)

2024-02-10 07:52:41

by Trevor Gross

[permalink] [raw]
Subject: Re: [PATCH v5 9/9] rust: file: add abstraction for `poll_table`

On Fri, Feb 9, 2024 at 5:22 AM Alice Ryhl <[email protected]> wrote:
>
> The existing `CondVar` abstraction is a wrapper around
> `wait_queue_head`, but it does not support all use-cases of the C
> `wait_queue_head` type. To be specific, a `CondVar` cannot be registered
> with a `struct poll_table`. This limitation has the advantage that you
> do not need to call `synchronize_rcu` when destroying a `CondVar`.
>
> However, we need the ability to register a `poll_table` with a
> `wait_queue_head` in Rust Binder. To enable this, introduce a type
> called `PollCondVar`, which is like `CondVar` except that you can
> register a `poll_table`. We also introduce `PollTable`, which is a safe
> wrapper around `poll_table` that is intended to be used with
> `PollCondVar`.
>
> The destructor of `PollCondVar` unconditionally calls `synchronize_rcu`
> to ensure that the removal of epoll waiters has fully completed before
> the `wait_queue_head` is destroyed.
>
> That said, `synchronize_rcu` is rather expensive and is not needed in
> all cases: If we have never registered a `poll_table` with the
> `wait_queue_head`, then we don't need to call `synchronize_rcu`. (And
> this is a common case in Binder - not all processes use Binder with
> epoll.) The current implementation does not account for this, but if we
> find that it is necessary to improve this, a future patch could store a
> boolean next to the `wait_queue_head` to keep track of whether a
> `poll_table` has ever been registered.
>
> Reviewed-by: Benno Lossin <[email protected]>
> Reviewed-by: Martin Rodriguez Reboredo <[email protected]>
> Signed-off-by: Alice Ryhl <[email protected]>

One nit below

Reviewed-by: Trevor Gross <[email protected]>

> +/// Creates a [`PollCondVar`] initialiser with the given name and a newly-created lock class.
> +#[macro_export]
> +macro_rules! new_poll_condvar {
> + ($($name:literal)?) => {
> + $crate::sync::poll::PollCondVar::new($crate::optional_name!($($name)?), $crate::static_lock_class!())
> + };
> +}

Length > 100, this could wrap:

macro_rules! new_poll_condvar {
($($name:literal)?) => {
$crate::sync::poll::PollCondVar::new(
$crate::optional_name!($($name)?), $crate::static_lock_class!()
)
};
}

2024-02-12 10:12:29

by Alice Ryhl

[permalink] [raw]
Subject: Re: [PATCH v5 7/9] rust: file: add `Kuid` wrapper

On Sat, Feb 10, 2024 at 8:43 AM Trevor Gross <[email protected]> wrote:
>
> On Fri, Feb 9, 2024 at 5:22 AM Alice Ryhl <[email protected]> wrote:
> >
> > Of course, once a wrapper for rcu_read_lock is available, it is
> > preferable to use that over either of the two above approaches.
>
> Is this worth a FIXME?

Shrug. I think a patch to introduce rcu_read_lock would go through the
helpers as a matter of course either way. But it also doesn't hurt.

Alice

2024-02-12 16:01:45

by Boqun Feng

[permalink] [raw]
Subject: Re: [PATCH v5 7/9] rust: file: add `Kuid` wrapper

On Mon, Feb 12, 2024 at 11:04:47AM +0100, Alice Ryhl wrote:
> On Sat, Feb 10, 2024 at 8:43 AM Trevor Gross <[email protected]> wrote:
> >
> > On Fri, Feb 9, 2024 at 5:22 AM Alice Ryhl <[email protected]> wrote:
> > >
> > > Of course, once a wrapper for rcu_read_lock is available, it is
> > > preferable to use that over either of the two above approaches.
> >
> > Is this worth a FIXME?
>
> Shrug. I think a patch to introduce rcu_read_lock would go through the
> helpers as a matter of course either way. But it also doesn't hurt.
>

Right. And if I understand correctly, we actually need more than RCU
wrappers to "improve" the cases here: we also need the RCU interface to
be inline, plus the extra maintainship of Rust version of kuids getters.
These are all outside the scope of this patchset, and we may need to
revisit later.

The commit log here basically says: what's done in the patch is OK and
probably the best way to proceed. I think it's fine.

Regards,
Boqun

> Alice

2024-03-20 14:25:20

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v5 8/9] rust: file: add `DeferredFdCloser`

On Fri, Feb 09, 2024 at 11:18:21AM +0000, Alice Ryhl wrote:
> To close an fd from kernel space, we could call `ksys_close`. However,
> if we do this to an fd that is held using `fdget`, then we may trigger a
> use-after-free. Introduce a helper that can be used to close an fd even
> if the fd is currently held with `fdget`. This is done by grabbing an
> extra refcount to the file and dropping it in a task work once we return
> to userspace.
>
> This is necessary for Rust Binder because otherwise the user might try
> to have Binder close its fd for /dev/binder, which would cause problems
> as this happens inside an ioctl on /dev/binder, and ioctls hold the fd
> using `fdget`.
>
> Additional motivation can be found in commit 80cd795630d6 ("binder: fix
> use-after-free due to ksys_close() during fdget()") and in the comments
> on `binder_do_fd_close`.
>
> If there is some way to detect whether an fd is currently held with
> `fdget`, then this could be optimized to skip the allocation and task
> work when this is not the case. Another possible optimization would be
> to combine several fds into a single task work, since this is used with
> fd arrays that might hold several fds.
>
> That said, it might not be necessary to optimize it, because Rust Binder
> has two ways to send fds: BINDER_TYPE_FD and BINDER_TYPE_FDA. With
> BINDER_TYPE_FD, it is userspace's responsibility to close the fd, so
> this mechanism is used only by BINDER_TYPE_FDA, but fd arrays are used
> rarely these days.
>
> Reviewed-by: Benno Lossin <[email protected]>
> Reviewed-by: Martin Rodriguez Reboredo <[email protected]>
> Signed-off-by: Alice Ryhl <[email protected]>
> ---
> rust/bindings/bindings_helper.h | 2 +
> rust/helpers.c | 8 ++
> rust/kernel/file.rs | 184 +++++++++++++++++++++++++++++++++++++++-
> rust/kernel/task.rs | 14 +++
> 4 files changed, 207 insertions(+), 1 deletion(-)
>
> diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h
> index 4194b057ef6b..f4d9d04333c0 100644
> --- a/rust/bindings/bindings_helper.h
> +++ b/rust/bindings/bindings_helper.h
> @@ -10,6 +10,7 @@
> #include <linux/cred.h>
> #include <linux/errname.h>
> #include <linux/ethtool.h>
> +#include <linux/fdtable.h>
> #include <linux/file.h>
> #include <linux/fs.h>
> #include <linux/jiffies.h>
> @@ -21,6 +22,7 @@
> #include <linux/refcount.h>
> #include <linux/wait.h>
> #include <linux/sched.h>
> +#include <linux/task_work.h>
> #include <linux/workqueue.h>
>
> /* `bindgen` gets confused at certain things. */
> diff --git a/rust/helpers.c b/rust/helpers.c
> index 58e3a9dff349..d146bbf25aec 100644
> --- a/rust/helpers.c
> +++ b/rust/helpers.c
> @@ -32,6 +32,7 @@
> #include <linux/sched/signal.h>
> #include <linux/security.h>
> #include <linux/spinlock.h>
> +#include <linux/task_work.h>
> #include <linux/wait.h>
> #include <linux/workqueue.h>
>
> @@ -243,6 +244,13 @@ void rust_helper_security_release_secctx(char *secdata, u32 seclen)
> EXPORT_SYMBOL_GPL(rust_helper_security_release_secctx);
> #endif
>
> +void rust_helper_init_task_work(struct callback_head *twork,
> + task_work_func_t func)
> +{
> + init_task_work(twork, func);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_init_task_work);
> +
> /*
> * `bindgen` binds the C `size_t` type as the Rust `usize` type, so we can
> * use it in contexts where Rust expects a `usize` like slice (array) indices.
> diff --git a/rust/kernel/file.rs b/rust/kernel/file.rs
> index fb903b7f23fe..8902f490ccc8 100644
> --- a/rust/kernel/file.rs
> +++ b/rust/kernel/file.rs
> @@ -11,7 +11,8 @@
> error::{code::*, Error, Result},
> types::{ARef, AlwaysRefCounted, NotThreadSafe, Opaque},
> };
> -use core::ptr;
> +use alloc::boxed::Box;
> +use core::{alloc::AllocError, mem, ptr};
>
> /// Flags associated with a [`File`].
> pub mod flags {
> @@ -318,6 +319,187 @@ fn drop(&mut self) {
> }
> }
>
> +/// Helper used for closing file descriptors in a way that is safe even if the file is currently
> +/// held using `fdget`.
> +///
> +/// Additional motivation can be found in commit 80cd795630d6 ("binder: fix use-after-free due to
> +/// ksys_close() during fdget()") and in the comments on `binder_do_fd_close`.
> +pub struct DeferredFdCloser {
> + inner: Box<DeferredFdCloserInner>,
> +}
> +
> +/// SAFETY: This just holds an allocation with no real content, so there's no safety issue with
> +/// moving it across threads.
> +unsafe impl Send for DeferredFdCloser {}
> +unsafe impl Sync for DeferredFdCloser {}
> +
> +/// # Invariants
> +///
> +/// If the `file` pointer is non-null, then it points at a `struct file` and owns a refcount to
> +/// that file.
> +#[repr(C)]
> +struct DeferredFdCloserInner {
> + twork: mem::MaybeUninit<bindings::callback_head>,
> + file: *mut bindings::file,
> +}
> +
> +impl DeferredFdCloser {

So the explicitly deferred close is due to how binder works so it's not
much of a general purpose interface as I don't recall having other
codepaths with similar problems. So this should live in the binder
specific rust code imo.

> + /// Create a new [`DeferredFdCloser`].
> + pub fn new() -> Result<Self, AllocError> {
> + Ok(Self {
> + // INVARIANT: The `file` pointer is null, so the type invariant does not apply.
> + inner: Box::try_new(DeferredFdCloserInner {
> + twork: mem::MaybeUninit::uninit(),
> + file: core::ptr::null_mut(),
> + })?,
> + })
> + }
> +
> + /// Schedule a task work that closes the file descriptor when this task returns to userspace.
> + ///
> + /// Fails if this is called from a context where we cannot run work when returning to
> + /// userspace. (E.g., from a kthread.)
> + pub fn close_fd(self, fd: u32) -> Result<(), DeferredFdCloseError> {
> + use bindings::task_work_notify_mode_TWA_RESUME as TWA_RESUME;
> +
> + // In this method, we schedule the task work before closing the file. This is because
> + // scheduling a task work is fallible, and we need to know whether it will fail before we
> + // attempt to close the file.
> +
> + // Task works are not available on kthreads.
> + let current = crate::current!();
> + if current.is_kthread() {
> + return Err(DeferredFdCloseError::TaskWorkUnavailable);
> + }
> +
> + // Transfer ownership of the box's allocation to a raw pointer. This disables the
> + // destructor, so we must manually convert it back to a Box to drop it.
> + //
> + // Until we convert it back to a `Box`, there are no aliasing requirements on this
> + // pointer.
> + let inner = Box::into_raw(self.inner);
> +
> + // The `callback_head` field is first in the struct, so this cast correctly gives us a
> + // pointer to the field.
> + let callback_head = inner.cast::<bindings::callback_head>();
> + // SAFETY: This pointer offset operation does not go out-of-bounds.
> + let file_field = unsafe { core::ptr::addr_of_mut!((*inner).file) };
> +
> + let current = current.as_raw();
> +
> + // SAFETY: This function currently has exclusive access to the `DeferredFdCloserInner`, so
> + // it is okay for us to perform unsynchronized writes to its `callback_head` field.
> + unsafe { bindings::init_task_work(callback_head, Some(Self::do_close_fd)) };
> +
> + // SAFETY: This inserts the `DeferredFdCloserInner` into the task workqueue for the current
> + // task. If this operation is successful, then this transfers exclusive ownership of the
> + // `callback_head` field to the C side until it calls `do_close_fd`, and we don't touch or
> + // invalidate the field during that time.
> + //
> + // When the C side calls `do_close_fd`, the safety requirements of that method are
> + // satisfied because when a task work is executed, the callback is given ownership of the
> + // pointer.
> + //
> + // The file pointer is currently null. If it is changed to be non-null before `do_close_fd`
> + // is called, then that change happens due to the write at the end of this function, and
> + // that write has a safety comment that explains why the refcount can be dropped when
> + // `do_close_fd` runs.
> + let res = unsafe { bindings::task_work_add(current, callback_head, TWA_RESUME) };
> +
> + if res != 0 {
> + // SAFETY: Scheduling the task work failed, so we still have ownership of the box, so
> + // we may destroy it.
> + unsafe { drop(Box::from_raw(inner)) };
> +
> + return Err(DeferredFdCloseError::TaskWorkUnavailable);
> + }
> +
> + // This removes the fd from the fd table in `current`. The file is not fully closed until
> + // `filp_close` is called. We are given ownership of one refcount to the file.
> + //
> + // SAFETY: This is safe no matter what `fd` is. If the `fd` is valid (that is, if the
> + // pointer is non-null), then we call `filp_close` on the returned pointer as required by
> + // `file_close_fd`.
> + let file = unsafe { bindings::file_close_fd(fd) };
> + if file.is_null() {
> + // We don't clean up the task work since that might be expensive if the task work queue
> + // is long. Just let it execute and let it clean up for itself.
> + return Err(DeferredFdCloseError::BadFd);
> + }
> +
> + // Acquire a second refcount to the file.
> + //
> + // SAFETY: The `file` pointer points at a file with a non-zero refcount.
> + unsafe { bindings::get_file(file) };
> +
> + // This method closes the fd, consuming one of our two refcounts. There could be active
> + // light refcounts created from that fd, so we must ensure that the file has a positive
> + // refcount for the duration of those active light refcounts. We do that by holding on to
> + // the second refcount until the current task returns to userspace.
> + //
> + // SAFETY: The `file` pointer is valid. Passing `current->files` as the file table to close
> + // it in is correct, since we just got the `fd` from `file_close_fd` which also uses
> + // `current->files`.
> + //
> + // Note: fl_owner_t is currently a void pointer.
> + unsafe { bindings::filp_close(file, (*current).files as bindings::fl_owner_t) };
> +
> + // We update the file pointer that the task work is supposed to fput. This transfers
> + // ownership of our last refcount.
> + //
> + // INVARIANT: This changes the `file` field of a `DeferredFdCloserInner` from null to
> + // non-null. This doesn't break the type invariant for `DeferredFdCloserInner` because we
> + // still own a refcount to the file, so we can pass ownership of that refcount to the
> + // `DeferredFdCloserInner`.
> + //
> + // When `do_close_fd` runs, it must be safe for it to `fput` the refcount. However, this is
> + // the case because all light refcounts that are associated with the fd we closed
> + // previously must be dropped when `do_close_fd`, since light refcounts must be dropped
> + // before returning to userspace.
> + //
> + // SAFETY: Task works are executed on the current thread right before we return to
> + // userspace, so this write is guaranteed to happen before `do_close_fd` is called, which
> + // means that a race is not possible here.
> + unsafe { *file_field = file };
> +
> + Ok(())
> + }
> +
> + /// # Safety
> + ///
> + /// The provided pointer must point at the `twork` field of a `DeferredFdCloserInner` stored in
> + /// a `Box`, and the caller must pass exclusive ownership of that `Box`. Furthermore, if the
> + /// file pointer is non-null, then it must be okay to release the refcount by calling `fput`.
> + unsafe extern "C" fn do_close_fd(inner: *mut bindings::callback_head) {
> + // SAFETY: The caller just passed us ownership of this box.
> + let inner = unsafe { Box::from_raw(inner.cast::<DeferredFdCloserInner>()) };
> + if !inner.file.is_null() {
> + // SAFETY: By the type invariants, we own a refcount to this file, and the caller
> + // guarantees that dropping the refcount now is okay.
> + unsafe { bindings::fput(inner.file) };
> + }
> + // The allocation is freed when `inner` goes out of scope.
> + }
> +}
> +
> +/// Represents a failure to close an fd in a deferred manner.
> +#[derive(Copy, Clone, Debug, Eq, PartialEq)]
> +pub enum DeferredFdCloseError {
> + /// Closing the fd failed because we were unable to schedule a task work.
> + TaskWorkUnavailable,
> + /// Closing the fd failed because the fd does not exist.
> + BadFd,
> +}
> +
> +impl From<DeferredFdCloseError> for Error {
> + fn from(err: DeferredFdCloseError) -> Error {
> + match err {
> + DeferredFdCloseError::TaskWorkUnavailable => ESRCH,
> + DeferredFdCloseError::BadFd => EBADF,
> + }
> + }
> +}
> +
> /// Represents the `EBADF` error code.
> ///
> /// Used for methods that can only fail with `EBADF`.
> diff --git a/rust/kernel/task.rs b/rust/kernel/task.rs
> index f46ea3ba9e8e..6adcd8ffcfde 100644
> --- a/rust/kernel/task.rs
> +++ b/rust/kernel/task.rs
> @@ -145,6 +145,12 @@ fn deref(&self) -> &Self::Target {
> }
> }
>
> + /// Returns a raw pointer to the task.
> + #[inline]
> + pub fn as_raw(&self) -> *mut bindings::task_struct {
> + self.0.get()
> + }
> +
> /// Returns the group leader of the given task.
> pub fn group_leader(&self) -> &Task {
> // SAFETY: By the type invariant, we know that `self.0` is a valid task. Valid tasks always
> @@ -189,6 +195,14 @@ pub fn pid_in_current_ns(&self) -> Pid {
> unsafe { bindings::task_tgid_nr_ns(self.0.get(), ptr::null_mut()) }
> }
>
> + /// Returns whether this task corresponds to a kernel thread.
> + pub fn is_kthread(&self) -> bool {
> + // SAFETY: By the type invariant, we know that `self.0.get()` is non-null and valid. There
> + // are no further requirements to read the task's flags.
> + let flags = unsafe { (*self.0.get()).flags };
> + (flags & bindings::PF_KTHREAD) != 0
> + }
> +
> /// Wakes up the task.
> pub fn wake_up(&self) {
> // SAFETY: By the type invariant, we know that `self.0.get()` is non-null and valid.
>
> --
> 2.43.0.687.g38aa6559b0-goog
>

2024-03-20 15:21:00

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v5 3/9] rust: file: add Rust abstraction for `struct file`

On Fri, Feb 09, 2024 at 11:18:16AM +0000, Alice Ryhl wrote:
> From: Wedson Almeida Filho <[email protected]>
>
> This abstraction makes it possible to manipulate the open files for a
> process. The new `File` struct wraps the C `struct file`. When accessing
> it using the smart pointer `ARef<File>`, the pointer will own a
> reference count to the file. When accessing it as `&File`, then the
> reference does not own a refcount, but the borrow checker will ensure
> that the reference count does not hit zero while the `&File` is live.
>
> Since this is intended to manipulate the open files of a process, we
> introduce an `fget` constructor that corresponds to the C `fget`
> method. In future patches, it will become possible to create a new fd in
> a process and bind it to a `File`. Rust Binder will use these to send
> fds from one process to another.
>
> We also provide a method for accessing the file's flags. Rust Binder
> will use this to access the flags of the Binder fd to check whether the
> non-blocking flag is set, which affects what the Binder ioctl does.
>
> This introduces a struct for the EBADF error type, rather than just
> using the Error type directly. This has two advantages:
> * `File::from_fd` returns a `Result<ARef<File>, BadFdError>`, which the

Sorry, where's that method?

> compiler will represent as a single pointer, with null being an error.
> This is possible because the compiler understands that `BadFdError`
> has only one possible value, and it also understands that the
> `ARef<File>` smart pointer is guaranteed non-null.
> * Additionally, we promise to users of the method that the method can
> only fail with EBADF, which means that they can rely on this promise
> without having to inspect its implementation.
> That said, there are also two disadvantages:
> * Defining additional error types involves boilerplate.
> * The question mark operator will only utilize the `From` trait once,
> which prevents you from using the question mark operator on
> `BadFdError` in methods that return some third error type that the
> kernel `Error` is convertible into. (However, it works fine in methods
> that return `Error`.)
>
> Signed-off-by: Wedson Almeida Filho <[email protected]>
> Co-developed-by: Daniel Xu <[email protected]>
> Signed-off-by: Daniel Xu <[email protected]>
> Co-developed-by: Alice Ryhl <[email protected]>
> Reviewed-by: Martin Rodriguez Reboredo <[email protected]>
> Reviewed-by: Trevor Gross <[email protected]>
> Reviewed-by: Benno Lossin <[email protected]>
> Signed-off-by: Alice Ryhl <[email protected]>
> ---
> fs/file.c | 7 ++
> rust/bindings/bindings_helper.h | 2 +
> rust/helpers.c | 7 ++
> rust/kernel/file.rs | 254 ++++++++++++++++++++++++++++++++++++++++
> rust/kernel/lib.rs | 1 +
> 5 files changed, 271 insertions(+)
>
> diff --git a/fs/file.c b/fs/file.c
> index 3b683b9101d8..f2eab5fcb87f 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -1127,6 +1127,13 @@ EXPORT_SYMBOL(task_lookup_next_fdget_rcu);
> *
> * The fput_needed flag returned by fget_light should be passed to the
> * corresponding fput_light.
> + *
> + * (As an exception to rule 2, you can call filp_close between fget_light and
> + * fput_light provided that you capture a real refcount with get_file before
> + * the call to filp_close, and ensure that this real refcount is fput *after*
> + * the fput_light call.)
> + *
> + * See also the documentation in rust/kernel/file.rs.
> */
> static unsigned long __fget_light(unsigned int fd, fmode_t mask)
> {
> diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h
> index 936651110c39..41fcd2905ed4 100644
> --- a/rust/bindings/bindings_helper.h
> +++ b/rust/bindings/bindings_helper.h
> @@ -9,6 +9,8 @@
> #include <kunit/test.h>
> #include <linux/errname.h>
> #include <linux/ethtool.h>
> +#include <linux/file.h>
> +#include <linux/fs.h>
> #include <linux/jiffies.h>
> #include <linux/mdio.h>
> #include <linux/phy.h>
> diff --git a/rust/helpers.c b/rust/helpers.c
> index 70e59efd92bc..03141a3608a4 100644
> --- a/rust/helpers.c
> +++ b/rust/helpers.c
> @@ -25,6 +25,7 @@
> #include <linux/build_bug.h>
> #include <linux/err.h>
> #include <linux/errname.h>
> +#include <linux/fs.h>
> #include <linux/mutex.h>
> #include <linux/refcount.h>
> #include <linux/sched/signal.h>
> @@ -157,6 +158,12 @@ void rust_helper_init_work_with_key(struct work_struct *work, work_func_t func,
> }
> EXPORT_SYMBOL_GPL(rust_helper_init_work_with_key);
>
> +struct file *rust_helper_get_file(struct file *f)
> +{
> + return get_file(f);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_get_file);
> +
> /*
> * `bindgen` binds the C `size_t` type as the Rust `usize` type, so we can
> * use it in contexts where Rust expects a `usize` like slice (array) indices.
> diff --git a/rust/kernel/file.rs b/rust/kernel/file.rs
> new file mode 100644
> index 000000000000..cf8ebf619379
> --- /dev/null
> +++ b/rust/kernel/file.rs
> @@ -0,0 +1,254 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +//! Files and file descriptors.
> +//!
> +//! C headers: [`include/linux/fs.h`](srctree/include/linux/fs.h) and
> +//! [`include/linux/file.h`](srctree/include/linux/file.h)
> +
> +use crate::{
> + bindings,
> + error::{code::*, Error, Result},
> + types::{ARef, AlwaysRefCounted, Opaque},
> +};
> +use core::ptr;
> +
> +/// Flags associated with a [`File`].
> +pub mod flags {
> + /// File is opened in append mode.
> + pub const O_APPEND: u32 = bindings::O_APPEND;
> +
> + /// Signal-driven I/O is enabled.
> + pub const O_ASYNC: u32 = bindings::FASYNC;
> +
> + /// Close-on-exec flag is set.
> + pub const O_CLOEXEC: u32 = bindings::O_CLOEXEC;
> +
> + /// File was created if it didn't already exist.
> + pub const O_CREAT: u32 = bindings::O_CREAT;
> +
> + /// Direct I/O is enabled for this file.
> + pub const O_DIRECT: u32 = bindings::O_DIRECT;
> +
> + /// File must be a directory.
> + pub const O_DIRECTORY: u32 = bindings::O_DIRECTORY;
> +
> + /// Like [`O_SYNC`] except metadata is not synced.
> + pub const O_DSYNC: u32 = bindings::O_DSYNC;
> +
> + /// Ensure that this file is created with the `open(2)` call.
> + pub const O_EXCL: u32 = bindings::O_EXCL;
> +
> + /// Large file size enabled (`off64_t` over `off_t`).
> + pub const O_LARGEFILE: u32 = bindings::O_LARGEFILE;
> +
> + /// Do not update the file last access time.
> + pub const O_NOATIME: u32 = bindings::O_NOATIME;
> +
> + /// File should not be used as process's controlling terminal.
> + pub const O_NOCTTY: u32 = bindings::O_NOCTTY;
> +
> + /// If basename of path is a symbolic link, fail open.
> + pub const O_NOFOLLOW: u32 = bindings::O_NOFOLLOW;
> +
> + /// File is using nonblocking I/O.
> + pub const O_NONBLOCK: u32 = bindings::O_NONBLOCK;
> +
> + /// File is using nonblocking I/O.
> + ///
> + /// This is effectively the same flag as [`O_NONBLOCK`] on all architectures
> + /// except SPARC64.
> + pub const O_NDELAY: u32 = bindings::O_NDELAY;
> +
> + /// Used to obtain a path file descriptor.
> + pub const O_PATH: u32 = bindings::O_PATH;
> +
> + /// Write operations on this file will flush data and metadata.
> + pub const O_SYNC: u32 = bindings::O_SYNC;
> +
> + /// This file is an unnamed temporary regular file.
> + pub const O_TMPFILE: u32 = bindings::O_TMPFILE;
> +
> + /// File should be truncated to length 0.
> + pub const O_TRUNC: u32 = bindings::O_TRUNC;
> +
> + /// Bitmask for access mode flags.
> + ///
> + /// # Examples
> + ///
> + /// ```
> + /// use kernel::file;
> + /// # fn do_something() {}
> + /// # let flags = 0;
> + /// if (flags & file::flags::O_ACCMODE) == file::flags::O_RDONLY {
> + /// do_something();
> + /// }
> + /// ```
> + pub const O_ACCMODE: u32 = bindings::O_ACCMODE;
> +
> + /// File is read only.
> + pub const O_RDONLY: u32 = bindings::O_RDONLY;
> +
> + /// File is write only.
> + pub const O_WRONLY: u32 = bindings::O_WRONLY;
> +
> + /// File can be both read and written.
> + pub const O_RDWR: u32 = bindings::O_RDWR;
> +}
> +
> +/// Wraps the kernel's `struct file`.
> +///
> +/// This represents an open file rather than a file on a filesystem. Processes generally reference
> +/// open files using file descriptors. However, file descriptors are not the same as files. A file
> +/// descriptor is just an integer that corresponds to a file, and a single file may be referenced
> +/// by multiple file descriptors.
> +///
> +/// # Refcounting
> +///
> +/// Instances of this type are reference-counted. The reference count is incremented by the
> +/// `fget`/`get_file` functions and decremented by `fput`. The Rust type `ARef<File>` represents a
> +/// pointer that owns a reference count on the file.
> +///
> +/// Whenever a process opens a file descriptor (fd), it stores a pointer to the file in its `struct
> +/// files_struct`. This pointer owns a reference count to the file, ensuring the file isn't
> +/// prematurely deleted while the file descriptor is open. In Rust terminology, the pointers in
> +/// `struct files_struct` are `ARef<File>` pointers.
> +///
> +/// ## Light refcounts
> +///
> +/// Whenever a process has an fd to a file, it may use something called a "light refcount" as a
> +/// performance optimization. Light refcounts are acquired by calling `fdget` and released with
> +/// `fdput`. The idea behind light refcounts is that if the fd is not closed between the calls to
> +/// `fdget` and `fdput`, then the refcount cannot hit zero during that time, as the `struct
> +/// files_struct` holds a reference until the fd is closed. This means that it's safe to access the
> +/// file even if `fdget` does not increment the refcount.
> +///
> +/// The requirement that the fd is not closed during a light refcount applies globally across all
> +/// threads - not just on the thread using the light refcount. For this reason, light refcounts are
> +/// only used when the `struct files_struct` is not shared with other threads, since this ensures
> +/// that other unrelated threads cannot suddenly start using the fd and close it. Therefore,
> +/// calling `fdget` on a shared `struct files_struct` creates a normal refcount instead of a light
> +/// refcount.

When the fdget() calling task doesn't have a shared file descriptor
table fdget() will not increment the reference count, yes. This
also implies that you cannot have task A use fdget() and then pass
f.file to task B that holds on to it while A returns to userspace. It's
irrelevant that task A won't drop the reference count or that task B
won't drop the reference count. Because task A could return back to
userspace and immediately close the fd via a regular close() system call
at which point task B has a UAF. In other words a file that has been
gotten via fdget() can't be Send to another task without the Send
implying taking a reference to it.

> +///
> +/// Light reference counts must be released with `fdput` before the system call returns to
> +/// userspace. This means that if you wait until the current system call returns to userspace, then
> +/// all light refcounts that existed at the time have gone away.
> +///
> +/// ## Rust references
> +///
> +/// The reference type `&File` is similar to light refcounts:
> +///
> +/// * `&File` references don't own a reference count. They can only exist as long as the reference
> +/// count stays positive, and can only be created when there is some mechanism in place to ensure
> +/// this.
> +///
> +/// * The Rust borrow-checker normally ensures this by enforcing that the `ARef<File>` from which
> +/// a `&File` is created outlives the `&File`.

The section confuses me a little: Does the borrow-checker always ensure
that a &File stays valid or are there circumstances where it doesn't or
are you saying it doesn't enforce it?

> +///
> +/// * Using the unsafe [`File::from_ptr`] means that it is up to the caller to ensure that the
> +/// `&File` only exists while the reference count is positive.

What is this used for in binder? If it's not used don't add it.

> +///
> +/// * You can think of `fdget` as using an fd to look up an `ARef<File>` in the `struct

Could you explain why there isn't an explicit fdget() then and you have
that from_ptr() method?

> +/// files_struct` and create an `&File` from it. The "fd cannot be closed" rule is like the Rust
> +/// rule "the `ARef<File>` must outlive the `&File`".
> +///
> +/// # Invariants
> +///
> +/// * Instances of this type are refcounted using the `f_count` field.
> +/// * If an fd with active light refcounts is closed, then it must be the case that the file
> +/// refcount is positive until all light refcounts of the fd have been dropped.
> +/// * A light refcount must be dropped before returning to userspace.
> +#[repr(transparent)]
> +pub struct File(Opaque<bindings::file>);
> +
> +// SAFETY:
> +// - `File::dec_ref` can be called from any thread.
> +// - It is okay to send ownership of `struct file` across thread boundaries.
> +unsafe impl Send for File {}
> +
> +// SAFETY: All methods defined on `File` that take `&self` are safe to call even if other threads
> +// are concurrently accessing the same `struct file`, because those methods either access immutable
> +// properties or have proper synchronization to ensure that such accesses are safe.
> +unsafe impl Sync for File {}
> +
> +impl File {
> + /// Constructs a new `struct file` wrapper from a file descriptor.
> + ///
> + /// The file descriptor belongs to the current process.
> + pub fn fget(fd: u32) -> Result<ARef<Self>, BadFdError> {
> + // SAFETY: FFI call, there are no requirements on `fd`.
> + let ptr = ptr::NonNull::new(unsafe { bindings::fget(fd) }).ok_or(BadFdError)?;
> +
> + // SAFETY: `bindings::fget` either returns null or a valid pointer to a file, and we
> + // checked for null above.
> + //
> + // INVARIANT: `bindings::fget` creates a refcount, and we pass ownership of the refcount to
> + // the new `ARef<File>`.
> + Ok(unsafe { ARef::from_raw(ptr.cast()) })
> + }
> +
> + /// Creates a reference to a [`File`] from a valid pointer.
> + ///
> + /// # Safety
> + ///
> + /// The caller must ensure that `ptr` points at a valid file and that the file's refcount is
> + /// positive for the duration of 'a.
> + pub unsafe fn from_ptr<'a>(ptr: *const bindings::file) -> &'a File {
> + // SAFETY: The caller guarantees that the pointer is not dangling and stays valid for the
> + // duration of 'a. The cast is okay because `File` is `repr(transparent)`.
> + //
> + // INVARIANT: The safety requirements guarantee that the refcount does not hit zero during
> + // 'a.
> + unsafe { &*ptr.cast() }
> + }
> +
> + /// Returns a raw pointer to the inner C struct.
> + #[inline]
> + pub fn as_ptr(&self) -> *mut bindings::file {
> + self.0.get()
> + }
> +
> + /// Returns the flags associated with the file.
> + ///
> + /// The flags are a combination of the constants in [`flags`].
> + pub fn flags(&self) -> u32 {
> + // This `read_volatile` is intended to correspond to a READ_ONCE call.
> + //
> + // SAFETY: The file is valid because the shared reference guarantees a nonzero refcount.
> + //
> + // FIXME(read_once): Replace with `read_once` when available on the Rust side.
> + unsafe { core::ptr::addr_of!((*self.as_ptr()).f_flags).read_volatile() }
> + }
> +}
> +
> +// SAFETY: The type invariants guarantee that `File` is always ref-counted. This implementation
> +// makes `ARef<File>` own a normal refcount.
> +unsafe impl AlwaysRefCounted for File {
> + fn inc_ref(&self) {
> + // SAFETY: The existence of a shared reference means that the refcount is nonzero.
> + unsafe { bindings::get_file(self.as_ptr()) };
> + }
> +
> + unsafe fn dec_ref(obj: ptr::NonNull<File>) {
> + // SAFETY: To call this method, the caller passes us ownership of a normal refcount, so we
> + // may drop it. The cast is okay since `File` has the same representation as `struct file`.
> + unsafe { bindings::fput(obj.cast().as_ptr()) }
> + }
> +}
> +
> +/// Represents the `EBADF` error code.
> +///
> +/// Used for methods that can only fail with `EBADF`.
> +#[derive(Copy, Clone, Eq, PartialEq)]
> +pub struct BadFdError;
> +
> +impl From<BadFdError> for Error {
> + fn from(_: BadFdError) -> Error {
> + EBADF
> + }
> +}
> +
> +impl core::fmt::Debug for BadFdError {
> + fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result {
> + f.pad("EBADF")
> + }
> +}
> diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs
> index b89ecf4e97a0..9353dd713a20 100644
> --- a/rust/kernel/lib.rs
> +++ b/rust/kernel/lib.rs
> @@ -34,6 +34,7 @@
> mod allocator;
> mod build_assert;
> pub mod error;
> +pub mod file;
> pub mod init;
> pub mod ioctl;
> #[cfg(CONFIG_KUNIT)]
>
> --
> 2.43.0.687.g38aa6559b0-goog
>

2024-03-20 15:33:34

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v5 6/9] rust: file: add `FileDescriptorReservation`

On Fri, Feb 09, 2024 at 11:18:19AM +0000, Alice Ryhl wrote:
> From: Wedson Almeida Filho <[email protected]>
>
> Allow for the creation of a file descriptor in two steps: first, we
> reserve a slot for it, then we commit or drop the reservation. The first
> step may fail (e.g., the current process ran out of available slots),
> but commit and drop never fail (and are mutually exclusive).
>
> This is needed by Rust Binder when fds are sent from one process to
> another. It has to be a two-step process to properly handle the case
> where multiple fds are sent: The operation must fail or succeed
> atomically, which we achieve by first reserving the fds we need, and
> only installing the files once we have reserved enough fds to send the
> files.
>
> Fd reservations assume that the value of `current` does not change
> between the call to get_unused_fd_flags and the call to fd_install (or
> put_unused_fd). By not implementing the Send trait, this abstraction
> ensures that the `FileDescriptorReservation` cannot be moved into a
> different process.
>
> Signed-off-by: Wedson Almeida Filho <[email protected]>
> Co-developed-by: Alice Ryhl <[email protected]>
> Reviewed-by: Benno Lossin <[email protected]>
> Reviewed-by: Martin Rodriguez Reboredo <[email protected]>
> Signed-off-by: Alice Ryhl <[email protected]>
> ---

This looks ok to me now.

2024-03-20 18:17:09

by Alice Ryhl

[permalink] [raw]
Subject: Re: [PATCH v5 3/9] rust: file: add Rust abstraction for `struct file`

Christian Brauner <[email protected]> wrote:
> On Fri, Feb 09, 2024 at 11:18:16AM +0000, Alice Ryhl wrote:
>> From: Wedson Almeida Filho <[email protected]>
>>
>> This abstraction makes it possible to manipulate the open files for a
>> process. The new `File` struct wraps the C `struct file`. When accessing
>> it using the smart pointer `ARef<File>`, the pointer will own a
>> reference count to the file. When accessing it as `&File`, then the
>> reference does not own a refcount, but the borrow checker will ensure
>> that the reference count does not hit zero while the `&File` is live.
>>
>> Since this is intended to manipulate the open files of a process, we
>> introduce an `fget` constructor that corresponds to the C `fget`
>> method. In future patches, it will become possible to create a new fd in
>> a process and bind it to a `File`. Rust Binder will use these to send
>> fds from one process to another.
>>
>> We also provide a method for accessing the file's flags. Rust Binder
>> will use this to access the flags of the Binder fd to check whether the
>> non-blocking flag is set, which affects what the Binder ioctl does.
>>
>> This introduces a struct for the EBADF error type, rather than just
>> using the Error type directly. This has two advantages:
>> * `File::from_fd` returns a `Result<ARef<File>, BadFdError>`, which the
>
> Sorry, where's that method?

Sorry, this is supposed to say `File::fget`.

>> +/// Wraps the kernel's `struct file`.
>> +///
>> +/// This represents an open file rather than a file on a filesystem. Processes generally reference
>> +/// open files using file descriptors. However, file descriptors are not the same as files. A file
>> +/// descriptor is just an integer that corresponds to a file, and a single file may be referenced
>> +/// by multiple file descriptors.
>> +///
>> +/// # Refcounting
>> +///
>> +/// Instances of this type are reference-counted. The reference count is incremented by the
>> +/// `fget`/`get_file` functions and decremented by `fput`. The Rust type `ARef<File>` represents a
>> +/// pointer that owns a reference count on the file.
>> +///
>> +/// Whenever a process opens a file descriptor (fd), it stores a pointer to the file in its `struct
>> +/// files_struct`. This pointer owns a reference count to the file, ensuring the file isn't
>> +/// prematurely deleted while the file descriptor is open. In Rust terminology, the pointers in
>> +/// `struct files_struct` are `ARef<File>` pointers.
>> +///
>> +/// ## Light refcounts
>> +///
>> +/// Whenever a process has an fd to a file, it may use something called a "light refcount" as a
>> +/// performance optimization. Light refcounts are acquired by calling `fdget` and released with
>> +/// `fdput`. The idea behind light refcounts is that if the fd is not closed between the calls to
>> +/// `fdget` and `fdput`, then the refcount cannot hit zero during that time, as the `struct
>> +/// files_struct` holds a reference until the fd is closed. This means that it's safe to access the
>> +/// file even if `fdget` does not increment the refcount.
>> +///
>> +/// The requirement that the fd is not closed during a light refcount applies globally across all
>> +/// threads - not just on the thread using the light refcount. For this reason, light refcounts are
>> +/// only used when the `struct files_struct` is not shared with other threads, since this ensures
>> +/// that other unrelated threads cannot suddenly start using the fd and close it. Therefore,
>> +/// calling `fdget` on a shared `struct files_struct` creates a normal refcount instead of a light
>> +/// refcount.
>
> When the fdget() calling task doesn't have a shared file descriptor
> table fdget() will not increment the reference count, yes. This
> also implies that you cannot have task A use fdget() and then pass
> f.file to task B that holds on to it while A returns to userspace. It's
> irrelevant that task A won't drop the reference count or that task B
> won't drop the reference count. Because task A could return back to
> userspace and immediately close the fd via a regular close() system call
> at which point task B has a UAF. In other words a file that has been
> gotten via fdget() can't be Send to another task without the Send
> implying taking a reference to it.

That matches my understanding.

I suppose that technically you can still send it to another thread *if* you
ensure that the current thread waits until that other thread stops using the
file before returning to userspace.

>> +///
>> +/// Light reference counts must be released with `fdput` before the system call returns to
>> +/// userspace. This means that if you wait until the current system call returns to userspace, then
>> +/// all light refcounts that existed at the time have gone away.
>> +///
>> +/// ## Rust references
>> +///
>> +/// The reference type `&File` is similar to light refcounts:
>> +///
>> +/// * `&File` references don't own a reference count. They can only exist as long as the reference
>> +/// count stays positive, and can only be created when there is some mechanism in place to ensure
>> +/// this.
>> +///
>> +/// * The Rust borrow-checker normally ensures this by enforcing that the `ARef<File>` from which
>> +/// a `&File` is created outlives the `&File`.
>
> The section confuses me a little: Does the borrow-checker always ensure
> that a &File stays valid or are there circumstances where it doesn't or
> are you saying it doesn't enforce it?

The borrow-checker always ensures it.

A &File is actually short-hand for &'a File, where 'a is some
unspecified lifetime. We say that &'a File is annotated with 'a. The
borrow-checker rejects any code that tries to use a reference after the
end of the lifetime annotated on it.

So as long as you annotate the reference with a sufficiently short
lifetime, the borrow checker will prevent UAF. And outside of cases like
from_ptr, the borrow-checker also takes care of ensuring that the
lifetimes are sufficiently short.

(Technically &'a File and &'b File are two completely different types,
so &File is technically a class of types and not a single type. Rust
will automatically convert &'long File to &'short File.)

>> +///
>> +/// * Using the unsafe [`File::from_ptr`] means that it is up to the caller to ensure that the
>> +/// `&File` only exists while the reference count is positive.
>
> What is this used for in binder? If it's not used don't add it.

This is used on the boundary between the Rust part of Binder and the
binderfs component that is implemented in C. For example:

unsafe extern "C" fn rust_binder_open(
inode: *mut bindings::inode,
file_ptr: *mut bindings::file,
) -> core::ffi::c_int {
// SAFETY: The `rust_binderfs.c` file ensures that `i_private` is set to the return value of a
// successful call to `rust_binder_new_device`.
let ctx = unsafe { Arc::<Context>::borrow((*inode).i_private) };

// SAFETY: The caller provides a valid file pointer to a new `struct file`.
let file = unsafe { File::from_ptr(file_ptr) };
let process = match Process::open(ctx, file) {
Ok(process) => process,
Err(err) => return err.to_errno(),
};
// SAFETY: This file is associated with Rust binder, so we own the `private_data` field.
unsafe {
(*file_ptr).private_data = process.into_foreign().cast_mut();
}
0
}

Here, rust_binder_open is the open function in a struct file_operations
vtable. In this case, file_ptr is guaranteed by the caller to be valid
for the duration of the call to rust_binder_open. Binder uses from_ptr
to get a &File from the raw pointer.

As far as I understand, the caller of rust_binder_open uses fdget to
ensure that file_ptr is valid for the duration of the call, but the Rust
code doesn't assume that it does this with fdget. As long as file_ptr is
valid for the duration of the rust_binder_open call, this use of
from_ptr is okay. It will continue to work even if the caller is changed
to use fget.

As for how this code ensures that `file` ends up annotated with a
sufficiently short lifetime, well, that has to do with the signature of
Process::open. Here it is:

impl Process {
pub(crate) fn open(ctx: ArcBorrow<'_, Context>, file: &File) -> Result<Arc<Process>> {
Self::new(ctx.into(), ARef::from(file.cred()))
}
}

In this case, &File is used without specifying a lifetime. It's a
function argument, so this means that the lifetime annotated on the
`file` argument will be exactly the duration in which Process::open is
called. So any attempt to use `file` after the end of the call to
Process::open will be rejected by the borrow-checker. (E.g., if
Process::open tried to schedule something on the workqueue using `file`,
then that would not compile. Storing it in a global variable would not
compile either.)

This means that the borrow-checker will not catch mistakes in
rust_binder_open, but it *will* catch mistakes in Process::open, and
anything called by Process::open.

These examples come from patch 2 of the Binder RFC:
https://lore.kernel.org/rust-for-linux/[email protected]/

>> +///
>> +/// * You can think of `fdget` as using an fd to look up an `ARef<File>` in the `struct
>
> Could you explain why there isn't an explicit fdget() then and you have
> that from_ptr() method?

I don't provide an fdget implementation because Binder never calls it.
However, if you would like, I would be happy to add one to the patchset.

As for from_ptr, see above.

Alice

2024-03-21 13:28:40

by Alice Ryhl

[permalink] [raw]
Subject: Re: [PATCH v5 8/9] rust: file: add `DeferredFdCloser`

On Wed, Mar 20, 2024 at 3:22 PM Christian Brauner <[email protected]> wrote:
>
> On Fri, Feb 09, 2024 at 11:18:21AM +0000, Alice Ryhl wrote:
> >
> > +/// Helper used for closing file descriptors in a way that is safe even if the file is currently
> > +/// held using `fdget`.
> > +///
> > +/// Additional motivation can be found in commit 80cd795630d6 ("binder: fix use-after-free due to
> > +/// ksys_close() during fdget()") and in the comments on `binder_do_fd_close`.
> > +pub struct DeferredFdCloser {
> > + inner: Box<DeferredFdCloserInner>,
> > +}
> > +
> > +/// SAFETY: This just holds an allocation with no real content, so there's no safety issue with
> > +/// moving it across threads.
> > +unsafe impl Send for DeferredFdCloser {}
> > +unsafe impl Sync for DeferredFdCloser {}
> > +
> > +/// # Invariants
> > +///
> > +/// If the `file` pointer is non-null, then it points at a `struct file` and owns a refcount to
> > +/// that file.
> > +#[repr(C)]
> > +struct DeferredFdCloserInner {
> > + twork: mem::MaybeUninit<bindings::callback_head>,
> > + file: *mut bindings::file,
> > +}
> > +
> > +impl DeferredFdCloser {
>
> So the explicitly deferred close is due to how binder works so it's not
> much of a general purpose interface as I don't recall having other
> codepaths with similar problems. So this should live in the binder
> specific rust code imo.

Hmm. Are there really no other ioctls that call ksys_close on a
user-provided fd?

As far as I can tell, this kind of deferred API is the only way for us
to provide a fully safe Rust api for closing an fd. Directly calling
ksys_close must unsafely assert that the fd does not have an active
fdget call. So it makes sense to me as an API that others might want
to use.

Still I can move it elsewhere if necessary.

Alice

2024-03-31 10:27:31

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v5 8/9] rust: file: add `DeferredFdCloser`

On Thu, Mar 21, 2024 at 02:28:15PM +0100, Alice Ryhl wrote:
> On Wed, Mar 20, 2024 at 3:22 PM Christian Brauner <[email protected]> wrote:
> >
> > On Fri, Feb 09, 2024 at 11:18:21AM +0000, Alice Ryhl wrote:
> > >
> > > +/// Helper used for closing file descriptors in a way that is safe even if the file is currently
> > > +/// held using `fdget`.
> > > +///
> > > +/// Additional motivation can be found in commit 80cd795630d6 ("binder: fix use-after-free due to
> > > +/// ksys_close() during fdget()") and in the comments on `binder_do_fd_close`.
> > > +pub struct DeferredFdCloser {
> > > + inner: Box<DeferredFdCloserInner>,
> > > +}
> > > +
> > > +/// SAFETY: This just holds an allocation with no real content, so there's no safety issue with
> > > +/// moving it across threads.
> > > +unsafe impl Send for DeferredFdCloser {}
> > > +unsafe impl Sync for DeferredFdCloser {}
> > > +
> > > +/// # Invariants
> > > +///
> > > +/// If the `file` pointer is non-null, then it points at a `struct file` and owns a refcount to
> > > +/// that file.
> > > +#[repr(C)]
> > > +struct DeferredFdCloserInner {
> > > + twork: mem::MaybeUninit<bindings::callback_head>,
> > > + file: *mut bindings::file,
> > > +}
> > > +
> > > +impl DeferredFdCloser {
> >
> > So the explicitly deferred close is due to how binder works so it's not
> > much of a general purpose interface as I don't recall having other
> > codepaths with similar problems. So this should live in the binder
> > specific rust code imo.
>
> Hmm. Are there really no other ioctls that call ksys_close on a
> user-provided fd?

No, I don't think there are otherwise they would have the same bug that
binder had. io_uring comes closest but they have their own task work
and deferred close implementation.

>
> As far as I can tell, this kind of deferred API is the only way for us
> to provide a fully safe Rust api for closing an fd. Directly calling
> ksys_close must unsafely assert that the fd does not have an active
> fdget call. So it makes sense to me as an API that others might want
> to use.

I'm sorry, I don't quite understand you here.

What binder is doing iirc is that it's performing an ioctl() using a fd
to /dev/binderfs/$device-node or similar. The fatal flaw is that binder
now allows that fd to be closed during that ioctl - and by accident at
that. It's effectively closing a file it's relying on to not go away.
That's the original nonsense/creativity that necessitates this whole
manual task work shenanigans binder is doing. Unless I misremember.

But that's really frowned upon generally and absolutely not encouraged
by providing a generic interface for this stuff.

Sure, we have some users in the kernel that do stuff like call
close_fd() on a file descriptor they just installed into their file
descriptor table. That's usually bad design with maybe a few valid
exceptions. One example where that's still done and should be fixed is
e.g., cachefiles in fs/cachefiles/ondemand.c. The point is that they at
least close a file descriptor that they just installed and that they
don't rely on being valid.

So really, I need more details why without this interface it would
prevent Rust from implementing safe file descriptor closing because
right now all I see is a design flaw in binder being promoted to a
generic interface. And unless there's detailed reasons for it's
existence we're not going to do it.

2024-03-31 12:56:00

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v5 3/9] rust: file: add Rust abstraction for `struct file`

On Wed, Mar 20, 2024 at 06:09:05PM +0000, Alice Ryhl wrote:
> Christian Brauner <[email protected]> wrote:
> > On Fri, Feb 09, 2024 at 11:18:16AM +0000, Alice Ryhl wrote:
> >> From: Wedson Almeida Filho <[email protected]>
> >>
> >> This abstraction makes it possible to manipulate the open files for a
> >> process. The new `File` struct wraps the C `struct file`. When accessing
> >> it using the smart pointer `ARef<File>`, the pointer will own a
> >> reference count to the file. When accessing it as `&File`, then the
> >> reference does not own a refcount, but the borrow checker will ensure
> >> that the reference count does not hit zero while the `&File` is live.
> >>
> >> Since this is intended to manipulate the open files of a process, we
> >> introduce an `fget` constructor that corresponds to the C `fget`
> >> method. In future patches, it will become possible to create a new fd in
> >> a process and bind it to a `File`. Rust Binder will use these to send
> >> fds from one process to another.
> >>
> >> We also provide a method for accessing the file's flags. Rust Binder
> >> will use this to access the flags of the Binder fd to check whether the
> >> non-blocking flag is set, which affects what the Binder ioctl does.
> >>
> >> This introduces a struct for the EBADF error type, rather than just
> >> using the Error type directly. This has two advantages:
> >> * `File::from_fd` returns a `Result<ARef<File>, BadFdError>`, which the
> >
> > Sorry, where's that method?
>
> Sorry, this is supposed to say `File::fget`.
>
> >> +/// Wraps the kernel's `struct file`.
> >> +///
> >> +/// This represents an open file rather than a file on a filesystem. Processes generally reference
> >> +/// open files using file descriptors. However, file descriptors are not the same as files. A file
> >> +/// descriptor is just an integer that corresponds to a file, and a single file may be referenced
> >> +/// by multiple file descriptors.
> >> +///
> >> +/// # Refcounting
> >> +///
> >> +/// Instances of this type are reference-counted. The reference count is incremented by the
> >> +/// `fget`/`get_file` functions and decremented by `fput`. The Rust type `ARef<File>` represents a
> >> +/// pointer that owns a reference count on the file.
> >> +///
> >> +/// Whenever a process opens a file descriptor (fd), it stores a pointer to the file in its `struct
> >> +/// files_struct`. This pointer owns a reference count to the file, ensuring the file isn't
> >> +/// prematurely deleted while the file descriptor is open. In Rust terminology, the pointers in
> >> +/// `struct files_struct` are `ARef<File>` pointers.
> >> +///
> >> +/// ## Light refcounts
> >> +///
> >> +/// Whenever a process has an fd to a file, it may use something called a "light refcount" as a
> >> +/// performance optimization. Light refcounts are acquired by calling `fdget` and released with
> >> +/// `fdput`. The idea behind light refcounts is that if the fd is not closed between the calls to
> >> +/// `fdget` and `fdput`, then the refcount cannot hit zero during that time, as the `struct
> >> +/// files_struct` holds a reference until the fd is closed. This means that it's safe to access the
> >> +/// file even if `fdget` does not increment the refcount.
> >> +///
> >> +/// The requirement that the fd is not closed during a light refcount applies globally across all
> >> +/// threads - not just on the thread using the light refcount. For this reason, light refcounts are
> >> +/// only used when the `struct files_struct` is not shared with other threads, since this ensures
> >> +/// that other unrelated threads cannot suddenly start using the fd and close it. Therefore,
> >> +/// calling `fdget` on a shared `struct files_struct` creates a normal refcount instead of a light
> >> +/// refcount.
> >
> > When the fdget() calling task doesn't have a shared file descriptor
> > table fdget() will not increment the reference count, yes. This
> > also implies that you cannot have task A use fdget() and then pass
> > f.file to task B that holds on to it while A returns to userspace. It's
> > irrelevant that task A won't drop the reference count or that task B
> > won't drop the reference count. Because task A could return back to
> > userspace and immediately close the fd via a regular close() system call
> > at which point task B has a UAF. In other words a file that has been
> > gotten via fdget() can't be Send to another task without the Send
> > implying taking a reference to it.
>
> That matches my understanding.
>
> I suppose that technically you can still send it to another thread *if* you
> ensure that the current thread waits until that other thread stops using the
> file before returning to userspace.

_Technically_ yes, but it would be brittle as hell. The problem is that
fdget() _relies_ on being single-threaded for the time that fd is used
until fdput(). There's locking assumptions that build on that e.g., for
concurrent read/write. So no, that shouldn't be allowed.

Look at how this broke our back when we introduced pidfd_getfd() where
we steal an fd from another task. I have a lengthy explanation how that
can be used to violate our elided-locking which is based on assuming
that we're always single-threaded and the file can't be suddenly shared
with another task. So maybe doable but it would make the semantics even
more intricate.

> >> +///
> >> +/// Light reference counts must be released with `fdput` before the system call returns to
> >> +/// userspace. This means that if you wait until the current system call returns to userspace, then
> >> +/// all light refcounts that existed at the time have gone away.
> >> +///
> >> +/// ## Rust references
> >> +///
> >> +/// The reference type `&File` is similar to light refcounts:
> >> +///
> >> +/// * `&File` references don't own a reference count. They can only exist as long as the reference
> >> +/// count stays positive, and can only be created when there is some mechanism in place to ensure
> >> +/// this.
> >> +///
> >> +/// * The Rust borrow-checker normally ensures this by enforcing that the `ARef<File>` from which
> >> +/// a `&File` is created outlives the `&File`.
> >
> > The section confuses me a little: Does the borrow-checker always ensure
> > that a &File stays valid or are there circumstances where it doesn't or
> > are you saying it doesn't enforce it?
>
> The borrow-checker always ensures it.

Ok, thanks.

>
> A &File is actually short-hand for &'a File, where 'a is some
> unspecified lifetime. We say that &'a File is annotated with 'a. The
> borrow-checker rejects any code that tries to use a reference after the
> end of the lifetime annotated on it.

Thanks for the explanation.

>
> So as long as you annotate the reference with a sufficiently short
> lifetime, the borrow checker will prevent UAF. And outside of cases like

Sorry, but can you explain "sufficiently short lifetime"?

> from_ptr, the borrow-checker also takes care of ensuring that the
> lifetimes are sufficiently short.
>
> (Technically &'a File and &'b File are two completely different types,
> so &File is technically a class of types and not a single type. Rust
> will automatically convert &'long File to &'short File.)
>
> >> +///
> >> +/// * Using the unsafe [`File::from_ptr`] means that it is up to the caller to ensure that the
> >> +/// `&File` only exists while the reference count is positive.
> >
> > What is this used for in binder? If it's not used don't add it.
>
> This is used on the boundary between the Rust part of Binder and the
> binderfs component that is implemented in C. For example:

I see, I'm being foiled by my own code...

>
> unsafe extern "C" fn rust_binder_open(
> inode: *mut bindings::inode,
> file_ptr: *mut bindings::file,
> ) -> core::ffi::c_int {
> // SAFETY: The `rust_binderfs.c` file ensures that `i_private` is set to the return value of a
> // successful call to `rust_binder_new_device`.
> let ctx = unsafe { Arc::<Context>::borrow((*inode).i_private) };
>
> // SAFETY: The caller provides a valid file pointer to a new `struct file`.
> let file = unsafe { File::from_ptr(file_ptr) };

We need a better name for this helper than from_ptr() imho. I think
from_ptr() and as_ptr() is odd for C. How weird would it be to call
that from_raw_file() and as_raw_file()?

But bigger picture I somewhat struggle with the semantics of this
because this is not an interface that we have in C and this is really
about a low-level contract between C and Rust. Specifically this states
that this pointer is _somehow_ guaranteed valid. And really, this seems
a bit of a hack.

Naively, I think this should probably not be necessary if
file_operations are properly wrapped. Or it should at least be demotable
to a purely internal method that can't be called directly or something.

So what I mean is. fdget() may or may not take a reference. The C
interface internally knows whether a reference is owned or not by
abusing the lower two bits in a pointer to keep track of that. Naively,
I would expect the same information to be available to rust so that it's
clear to Rust wheter it's dealing with an explicitly referenced file or
an elided-reference file. Maybe that's not possible and I'm not
well-versed enough to see that yet.

> let process = match Process::open(ctx, file) {
> Ok(process) => process,
> Err(err) => return err.to_errno(),
> };
> // SAFETY: This file is associated with Rust binder, so we own the `private_data` field.
> unsafe {
> (*file_ptr).private_data = process.into_foreign().cast_mut();
> }
> 0
> }
>
> Here, rust_binder_open is the open function in a struct file_operations
> vtable. In this case, file_ptr is guaranteed by the caller to be valid

Where's the code that wraps struct file_operations?

> for the duration of the call to rust_binder_open. Binder uses from_ptr
> to get a &File from the raw pointer.
>
> As far as I understand, the caller of rust_binder_open uses fdget to
> ensure that file_ptr is valid for the duration of the call, but the Rust
> code doesn't assume that it does this with fdget. As long as file_ptr is
> valid for the duration of the rust_binder_open call, this use of
> from_ptr is okay. It will continue to work even if the caller is changed
> to use fget.

Ok.

>
> As for how this code ensures that `file` ends up annotated with a
> sufficiently short lifetime, well, that has to do with the signature of
> Process::open. Here it is:
>
> impl Process {
> pub(crate) fn open(ctx: ArcBorrow<'_, Context>, file: &File) -> Result<Arc<Process>> {
> Self::new(ctx.into(), ARef::from(file.cred()))
> }
> }
>
> In this case, &File is used without specifying a lifetime. It's a
> function argument, so this means that the lifetime annotated on the
> `file` argument will be exactly the duration in which Process::open is
> called. So any attempt to use `file` after the end of the call to
> Process::open will be rejected by the borrow-checker. (E.g., if
> Process::open tried to schedule something on the workqueue using `file`,
> then that would not compile. Storing it in a global variable would not
> compile either.)
>
> This means that the borrow-checker will not catch mistakes in
> rust_binder_open, but it *will* catch mistakes in Process::open, and
> anything called by Process::open.
>
> These examples come from patch 2 of the Binder RFC:
> https://lore.kernel.org/rust-for-linux/[email protected]/
>
> >> +///
> >> +/// * You can think of `fdget` as using an fd to look up an `ARef<File>` in the `struct
> >
> > Could you explain why there isn't an explicit fdget() then and you have
> > that from_ptr() method?
>
> I don't provide an fdget implementation because Binder never calls it.
> However, if you would like, I would be happy to add one to the patchset.
>
> As for from_ptr, see above.
>
> Alice

2024-04-01 15:27:02

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v5 3/9] rust: file: add Rust abstraction for `struct file`

On Mon, Apr 01, 2024 at 12:09:08PM +0000, Alice Ryhl wrote:
> Christian Brauner <[email protected]> wrote:
> > On Wed, Mar 20, 2024 at 06:09:05PM +0000, Alice Ryhl wrote:
> >> Christian Brauner <[email protected]> wrote:
> >>> On Fri, Feb 09, 2024 at 11:18:16AM +0000, Alice Ryhl wrote:
> >>>> +/// Wraps the kernel's `struct file`.
> >>>> +///
> >>>> +/// This represents an open file rather than a file on a filesystem. Processes generally reference
> >>>> +/// open files using file descriptors. However, file descriptors are not the same as files. A file
> >>>> +/// descriptor is just an integer that corresponds to a file, and a single file may be referenced
> >>>> +/// by multiple file descriptors.
> >>>> +///
> >>>> +/// # Refcounting
> >>>> +///
> >>>> +/// Instances of this type are reference-counted. The reference count is incremented by the
> >>>> +/// `fget`/`get_file` functions and decremented by `fput`. The Rust type `ARef<File>` represents a
> >>>> +/// pointer that owns a reference count on the file.
> >>>> +///
> >>>> +/// Whenever a process opens a file descriptor (fd), it stores a pointer to the file in its `struct
> >>>> +/// files_struct`. This pointer owns a reference count to the file, ensuring the file isn't
> >>>> +/// prematurely deleted while the file descriptor is open. In Rust terminology, the pointers in
> >>>> +/// `struct files_struct` are `ARef<File>` pointers.
> >>>> +///
> >>>> +/// ## Light refcounts
> >>>> +///
> >>>> +/// Whenever a process has an fd to a file, it may use something called a "light refcount" as a
> >>>> +/// performance optimization. Light refcounts are acquired by calling `fdget` and released with
> >>>> +/// `fdput`. The idea behind light refcounts is that if the fd is not closed between the calls to
> >>>> +/// `fdget` and `fdput`, then the refcount cannot hit zero during that time, as the `struct
> >>>> +/// files_struct` holds a reference until the fd is closed. This means that it's safe to access the
> >>>> +/// file even if `fdget` does not increment the refcount.
> >>>> +///
> >>>> +/// The requirement that the fd is not closed during a light refcount applies globally across all
> >>>> +/// threads - not just on the thread using the light refcount. For this reason, light refcounts are
> >>>> +/// only used when the `struct files_struct` is not shared with other threads, since this ensures
> >>>> +/// that other unrelated threads cannot suddenly start using the fd and close it. Therefore,
> >>>> +/// calling `fdget` on a shared `struct files_struct` creates a normal refcount instead of a light
> >>>> +/// refcount.
> >>>
> >>> When the fdget() calling task doesn't have a shared file descriptor
> >>> table fdget() will not increment the reference count, yes. This
> >>> also implies that you cannot have task A use fdget() and then pass
> >>> f.file to task B that holds on to it while A returns to userspace. It's
> >>> irrelevant that task A won't drop the reference count or that task B
> >>> won't drop the reference count. Because task A could return back to
> >>> userspace and immediately close the fd via a regular close() system call
> >>> at which point task B has a UAF. In other words a file that has been
> >>> gotten via fdget() can't be Send to another task without the Send
> >>> implying taking a reference to it.
> >>
> >> That matches my understanding.
> >>
> >> I suppose that technically you can still send it to another thread *if* you
> >> ensure that the current thread waits until that other thread stops using the
> >> file before returning to userspace.
> >
> > _Technically_ yes, but it would be brittle as hell. The problem is that
> > fdget() _relies_ on being single-threaded for the time that fd is used
> > until fdput(). There's locking assumptions that build on that e.g., for
> > concurrent read/write. So no, that shouldn't be allowed.
> >
> > Look at how this broke our back when we introduced pidfd_getfd() where
> > we steal an fd from another task. I have a lengthy explanation how that
> > can be used to violate our elided-locking which is based on assuming
> > that we're always single-threaded and the file can't be suddenly shared
> > with another task. So maybe doable but it would make the semantics even
> > more intricate.
>
> Hmm, the part about elided locking is surprising to me, and may be an
> issue. Can you give more details on that? Because the current

So what I referred to was that we do have fdget_pos(). Roughly, if
there's more than one reference on the file then we need to acquire a
mutex but if it's only a single reference then we can avoid taking the
mutex because we know that we're the only one that has a reference to
that file and no one else can acquire one. Whether or not that mutex was
taken is taken track of in struct fd.

So you can't share a file after fdget_pos() has been called on it and
you haven't taken the position mutex. So let's say you had:

* Tread A that calls fdget_pos() on file1 and the reference count is
one. So Thread A doesn't acquire the file position mutex for file1.
* Now somehow that file1 becomes shared, e.g., Thread B calls fget() on
it and now Thread B does some operation that requires the file
position mutex.
=> Thread A and Thread B race on the file position.

So just because you have a reference to a file from somewhere it doesn't
mean you can just share it with another thread.

So if yo have an arbitrary reference to a file in Rust and that somehow
can be shared with another thread you risk races here.

> abstractions here *do* actually allow what I described, since we
> implement Sync for File.
>
> I'm not familiar with the pidfd_getfd discussion you are referring to.
> Do you have a link?

https://lore.kernel.org/linux-fsdevel/[email protected]

pidfd_getfd() can be used to steal a file descriptor from another task.
It's like a non-cooperative SCM_RIGHTS. That means you can have exactly
the scenario described above where a file assumed to be non-shared is
suddenly shared and you have racing reads/writes.

For readdir we nowadays always take the file position mutex because of
the pidfd_getfd() business because that might corrupt internal state.

>
> I'm thinking that we may have to provide two different `struct file`
> wrappers to accurately model this API in Rust. Perhaps they could be
> called File and LocalFile, where one is marked as thread safe and the
> other isn't. I can make all LocalFile methods available on File to avoid
> having to duplicate methods that are available on both.

But isn't that just struct file and struct fd? Ideally we'd stay close
to something like this.

>
> But it's not clear to me that this is even enough. Even if we give you a
> &LocalFile to prevent you from moving it across threads, you can just
> call File::fget to get an ARef<File> to the same file and then move
> *that* across threads.

Yes, absolutely.

>
> This kind of global requirement is not so easy to model. Maybe klint [1]
> could do it ... atomic context violations are a similar kind of global
> check. But having klint do it would be far out.
>
> Or maybe File::fget should also return a LocalFile?
>
> But this raises a different question to me. Let's say process A uses
> Binder to send its own fd to process B, and the following things happen:
>
> 1. Process A enters the ioctl and takes fdget on the fd.
> 2. Process A calls fget on the same fd to send it to another process.
> 3. Process A goes to sleep, waiting for process B to respond.
> 4. Process B receives the message, installs the fd, and returns to userspace.
> 5. Process B responds to the transaction, but does not close the fd.

The fd just installed in 4. and the fd you're referring to in 5. are
identical, right? IOW, we're not talking about two different fd (dup)
for the same file, right?

> 6a. Process A finishes sleeping, and returns to userspace from the ioctl.
> 6b. Process B tries to do an operation (e.g. read) on the fd.
>
> Let's say that 6a and 6b run in parallel.
>
> Could this potentially result in a data race between step 6a and 6b? I'm
> guessing that step 6a probably doesn't touch any of the code that has
> elided locking assumptions, so in practice I guess there's not a problem
> ... but if you make any sort of elided locking assumption as you exit
> from the ioctl (before reaching the fdput), then it seems to me that you
> have a problem.

Yes, 6a doesn't touch any code that has elided locking assumptions.

1'. Process A enters the ioctl and takes fdget() on the fd. Process A
holds the only reference to that file and the file descriptor table
isn't shared. Therefore, f_count is left untouched and remains at 1.
2'. Process A calls fget() which unconditionally bumps f_count bringing
it to 2 and sending it another process (Presumably you intend to
imply that this reference is now owned by the second process.).
3'. [as 3.]
4'. Process B installs the file into it's file descriptor table
consuming that reference from 2'. The f_count remains at 2 with the
reference from 2' now being owned by Process B.
5'. Since Process B isn't closing the fd and has just called
fd_install() it returns to userspace with f_count untouched and
still at 2.
6'a. Process A finishes sleeping and returns to userspace calling
fdput(). Since the original fdget() was done without bumping the
reference count the fdput() of Process A will not decrement the
reference count. So f_count remains at 2.
6'b. Process B performs a read/write syscall and calls fdget_pos().
fdget_pos() sees that this file has f_count > 1 and takes the
file position mutex.

So this isn't a problem. The problem is when a file becomes shared
implicitly without the original owner of the file knowing.

>
> >>>> +///
> >>>> +/// Light reference counts must be released with `fdput` before the system call returns to
> >>>> +/// userspace. This means that if you wait until the current system call returns to userspace, then
> >>>> +/// all light refcounts that existed at the time have gone away.
> >>>> +///
> >>>> +/// ## Rust references
> >>>> +///
> >>>> +/// The reference type `&File` is similar to light refcounts:
> >>>> +///
> >>>> +/// * `&File` references don't own a reference count. They can only exist as long as the reference
> >>>> +/// count stays positive, and can only be created when there is some mechanism in place to ensure
> >>>> +/// this.
> >>>> +///
> >>>> +/// * The Rust borrow-checker normally ensures this by enforcing that the `ARef<File>` from which
> >>>> +/// a `&File` is created outlives the `&File`.
> >>>
> >>> The section confuses me a little: Does the borrow-checker always ensure
> >>> that a &File stays valid or are there circumstances where it doesn't or
> >>> are you saying it doesn't enforce it?
> >>
> >> The borrow-checker always ensures it.
> >
> > Ok, thanks.
> >
> >>
> >> A &File is actually short-hand for &'a File, where 'a is some
> >> unspecified lifetime. We say that &'a File is annotated with 'a. The
> >> borrow-checker rejects any code that tries to use a reference after the
> >> end of the lifetime annotated on it.
> >
> > Thanks for the explanation.
> >
> >>
> >> So as long as you annotate the reference with a sufficiently short
> >> lifetime, the borrow checker will prevent UAF. And outside of cases like
> >
> > Sorry, but can you explain "sufficiently short lifetime"?
>
> By "sufficiently short lifetime" I mean "lifetime that ends before the
> object is destroyed".

Ah, ok. It sounded like it was a specific concept that Rust is
implementing in contrast to long-term lifetime or sm. Thanks!

>
> Idea being that if the lifetime ends before the object is freed, and the
> borrow-checker rejects attempts to use it after the lifetime ends, then
> it follows that the borrow-checker prevents use-after-frees.
>
> >> from_ptr, the borrow-checker also takes care of ensuring that the
> >> lifetimes are sufficiently short.
> >>
> >> (Technically &'a File and &'b File are two completely different types,
> >> so &File is technically a class of types and not a single type. Rust
> >> will automatically convert &'long File to &'short File.)
> >>
> >>>> +///
> >>>> +/// * Using the unsafe [`File::from_ptr`] means that it is up to the caller to ensure that the
> >>>> +/// `&File` only exists while the reference count is positive.
> >>>
> >>> What is this used for in binder? If it's not used don't add it.
> >>
> >> This is used on the boundary between the Rust part of Binder and the
> >> binderfs component that is implemented in C. For example:
> >
> > I see, I'm being foiled by my own code...
> >
> >>
> >> unsafe extern "C" fn rust_binder_open(
> >> inode: *mut bindings::inode,
> >> file_ptr: *mut bindings::file,
> >> ) -> core::ffi::c_int {
> >> // SAFETY: The `rust_binderfs.c` file ensures that `i_private` is set to the return value of a
> >> // successful call to `rust_binder_new_device`.
> >> let ctx = unsafe { Arc::<Context>::borrow((*inode).i_private) };
> >>
> >> // SAFETY: The caller provides a valid file pointer to a new `struct file`.
> >> let file = unsafe { File::from_ptr(file_ptr) };
> >
> > We need a better name for this helper than from_ptr() imho. I think
> > from_ptr() and as_ptr() is odd for C. How weird would it be to call
> > that from_raw_file() and as_raw_file()?
>
> That's a reasonable name. I would be happy to rename to that, and I
> don't think it is unidiomatic.

Thanks!

>
> > But bigger picture I somewhat struggle with the semantics of this
> > because this is not an interface that we have in C and this is really
> > about a low-level contract between C and Rust. Specifically this states
> > that this pointer is _somehow_ guaranteed valid. And really, this seems
> > a bit of a hack.
>
> Indeed ... but I think it's a quite common hack. After all, any time you
> dereference a raw pointer in Rust, you are making the same assumption.
>
> > Naively, I think this should probably not be necessary if
> > file_operations are properly wrapped. Or it should at least be demotable
> > to a purely internal method that can't be called directly or something.
>
> Yes, the usage here of File::from_ptr could probably be hidden inside a
> suitably designed file_operations wrapper. The thing is, Rust Binder
> doesn't currently use such a wrapper at all. It just exports a global of
> type file_operations and the C code in binderfs then references that
> global.

Yeah.

>
> Rust Binder used to use such an abstraction, but I ripped it out before
> sending the Rust Binder RFC because it didn't actually help. It was
> designed for cases where the file system is also implemented in Rust, so
> to get it to expose a file_operations global to the C code in binderfs,
> I had to reach inside its internal implementation. It did not save me
> from doing stuff such as using File::from_ptr from Binder.
>
> Now, you could most likely modify those file_operations abstractions to
> support my use-case better. But calling into C is already unsafe, so
> unless we get multiple drivers that have a similar C/Rust split, it's
> not clear that it's useful to extract the logic from Binder. I would
> prefer to wait for the file_operations abstraction to get upstreamed by
> the people working on VFS bindings, and then we can decide whether we
> should rewrite binderfs into Rust and get rid of the logic, or whether
> it's worth to expand the file_operations abstraction to support split
> C/Rust drivers like the current binderfs.
>
> > So what I mean is. fdget() may or may not take a reference. The C
> > interface internally knows whether a reference is owned or not by
> > abusing the lower two bits in a pointer to keep track of that. Naively,
> > I would expect the same information to be available to rust so that it's
> > clear to Rust wheter it's dealing with an explicitly referenced file or
> > an elided-reference file. Maybe that's not possible and I'm not
> > well-versed enough to see that yet.
>
> I'm sure Rust can access the same information, but I don't think I'm
> currently doing anything that cares about the distinction?

Ok. My main goal is that we end up with an almost 1:1 correspondence
between the Rust and C interface so it's easy for current maintainers
and developers that don't want to care about Rust to continue to do so
and also just somewhat verify that changes they do are sane.

>
> >> let process = match Process::open(ctx, file) {
> >> Ok(process) => process,
> >> Err(err) => return err.to_errno(),
> >> };
> >> // SAFETY: This file is associated with Rust binder, so we own the `private_data` field.
> >> unsafe {
> >> (*file_ptr).private_data = process.into_foreign().cast_mut();
> >> }
> >> 0
> >> }
> >>
> >> Here, rust_binder_open is the open function in a struct file_operations
> >> vtable. In this case, file_ptr is guaranteed by the caller to be valid
> >
> > Where's the code that wraps struct file_operations?
>
> Please see drivers/android/rust_binder.rs in the binderfs patch [2] in
> the Rust Binder RFC.

Ok, I need to just find time to do that...:)

>
> >> for the duration of the call to rust_binder_open. Binder uses from_ptr
> >> to get a &File from the raw pointer.
> >>
> >> As far as I understand, the caller of rust_binder_open uses fdget to
> >> ensure that file_ptr is valid for the duration of the call, but the Rust
> >> code doesn't assume that it does this with fdget. As long as file_ptr is
> >> valid for the duration of the rust_binder_open call, this use of
> >> from_ptr is okay. It will continue to work even if the caller is changed
> >> to use fget.
> >
> > Ok.
> >
>
> [1]: https://rust-for-linux.com/klint
> [2]: https://lore.kernel.org/rust-for-linux/[email protected]/
>
> Alice

2024-04-01 12:09:40

by Alice Ryhl

[permalink] [raw]
Subject: Re: [PATCH v5 3/9] rust: file: add Rust abstraction for `struct file`

Christian Brauner <[email protected]> wrote:
> On Wed, Mar 20, 2024 at 06:09:05PM +0000, Alice Ryhl wrote:
>> Christian Brauner <[email protected]> wrote:
>>> On Fri, Feb 09, 2024 at 11:18:16AM +0000, Alice Ryhl wrote:
>>>> +/// Wraps the kernel's `struct file`.
>>>> +///
>>>> +/// This represents an open file rather than a file on a filesystem. Processes generally reference
>>>> +/// open files using file descriptors. However, file descriptors are not the same as files. A file
>>>> +/// descriptor is just an integer that corresponds to a file, and a single file may be referenced
>>>> +/// by multiple file descriptors.
>>>> +///
>>>> +/// # Refcounting
>>>> +///
>>>> +/// Instances of this type are reference-counted. The reference count is incremented by the
>>>> +/// `fget`/`get_file` functions and decremented by `fput`. The Rust type `ARef<File>` represents a
>>>> +/// pointer that owns a reference count on the file.
>>>> +///
>>>> +/// Whenever a process opens a file descriptor (fd), it stores a pointer to the file in its `struct
>>>> +/// files_struct`. This pointer owns a reference count to the file, ensuring the file isn't
>>>> +/// prematurely deleted while the file descriptor is open. In Rust terminology, the pointers in
>>>> +/// `struct files_struct` are `ARef<File>` pointers.
>>>> +///
>>>> +/// ## Light refcounts
>>>> +///
>>>> +/// Whenever a process has an fd to a file, it may use something called a "light refcount" as a
>>>> +/// performance optimization. Light refcounts are acquired by calling `fdget` and released with
>>>> +/// `fdput`. The idea behind light refcounts is that if the fd is not closed between the calls to
>>>> +/// `fdget` and `fdput`, then the refcount cannot hit zero during that time, as the `struct
>>>> +/// files_struct` holds a reference until the fd is closed. This means that it's safe to access the
>>>> +/// file even if `fdget` does not increment the refcount.
>>>> +///
>>>> +/// The requirement that the fd is not closed during a light refcount applies globally across all
>>>> +/// threads - not just on the thread using the light refcount. For this reason, light refcounts are
>>>> +/// only used when the `struct files_struct` is not shared with other threads, since this ensures
>>>> +/// that other unrelated threads cannot suddenly start using the fd and close it. Therefore,
>>>> +/// calling `fdget` on a shared `struct files_struct` creates a normal refcount instead of a light
>>>> +/// refcount.
>>>
>>> When the fdget() calling task doesn't have a shared file descriptor
>>> table fdget() will not increment the reference count, yes. This
>>> also implies that you cannot have task A use fdget() and then pass
>>> f.file to task B that holds on to it while A returns to userspace. It's
>>> irrelevant that task A won't drop the reference count or that task B
>>> won't drop the reference count. Because task A could return back to
>>> userspace and immediately close the fd via a regular close() system call
>>> at which point task B has a UAF. In other words a file that has been
>>> gotten via fdget() can't be Send to another task without the Send
>>> implying taking a reference to it.
>>
>> That matches my understanding.
>>
>> I suppose that technically you can still send it to another thread *if* you
>> ensure that the current thread waits until that other thread stops using the
>> file before returning to userspace.
>
> _Technically_ yes, but it would be brittle as hell. The problem is that
> fdget() _relies_ on being single-threaded for the time that fd is used
> until fdput(). There's locking assumptions that build on that e.g., for
> concurrent read/write. So no, that shouldn't be allowed.
>
> Look at how this broke our back when we introduced pidfd_getfd() where
> we steal an fd from another task. I have a lengthy explanation how that
> can be used to violate our elided-locking which is based on assuming
> that we're always single-threaded and the file can't be suddenly shared
> with another task. So maybe doable but it would make the semantics even
> more intricate.

Hmm, the part about elided locking is surprising to me, and may be an
issue. Can you give more details on that? Because the current
abstractions here *do* actually allow what I described, since we
implement Sync for File.

I'm not familiar with the pidfd_getfd discussion you are referring to.
Do you have a link?

I'm thinking that we may have to provide two different `struct file`
wrappers to accurately model this API in Rust. Perhaps they could be
called File and LocalFile, where one is marked as thread safe and the
other isn't. I can make all LocalFile methods available on File to avoid
having to duplicate methods that are available on both.

But it's not clear to me that this is even enough. Even if we give you a
&LocalFile to prevent you from moving it across threads, you can just
call File::fget to get an ARef<File> to the same file and then move
*that* across threads.

This kind of global requirement is not so easy to model. Maybe klint [1]
could do it ... atomic context violations are a similar kind of global
check. But having klint do it would be far out.

Or maybe File::fget should also return a LocalFile?

But this raises a different question to me. Let's say process A uses
Binder to send its own fd to process B, and the following things happen:

1. Process A enters the ioctl and takes fdget on the fd.
2. Process A calls fget on the same fd to send it to another process.
3. Process A goes to sleep, waiting for process B to respond.
4. Process B receives the message, installs the fd, and returns to userspace.
5. Process B responds to the transaction, but does not close the fd.
6a. Process A finishes sleeping, and returns to userspace from the ioctl.
6b. Process B tries to do an operation (e.g. read) on the fd.

Let's say that 6a and 6b run in parallel.

Could this potentially result in a data race between step 6a and 6b? I'm
guessing that step 6a probably doesn't touch any of the code that has
elided locking assumptions, so in practice I guess there's not a problem
.. but if you make any sort of elided locking assumption as you exit
from the ioctl (before reaching the fdput), then it seems to me that you
have a problem.

>>>> +///
>>>> +/// Light reference counts must be released with `fdput` before the system call returns to
>>>> +/// userspace. This means that if you wait until the current system call returns to userspace, then
>>>> +/// all light refcounts that existed at the time have gone away.
>>>> +///
>>>> +/// ## Rust references
>>>> +///
>>>> +/// The reference type `&File` is similar to light refcounts:
>>>> +///
>>>> +/// * `&File` references don't own a reference count. They can only exist as long as the reference
>>>> +/// count stays positive, and can only be created when there is some mechanism in place to ensure
>>>> +/// this.
>>>> +///
>>>> +/// * The Rust borrow-checker normally ensures this by enforcing that the `ARef<File>` from which
>>>> +/// a `&File` is created outlives the `&File`.
>>>
>>> The section confuses me a little: Does the borrow-checker always ensure
>>> that a &File stays valid or are there circumstances where it doesn't or
>>> are you saying it doesn't enforce it?
>>
>> The borrow-checker always ensures it.
>
> Ok, thanks.
>
>>
>> A &File is actually short-hand for &'a File, where 'a is some
>> unspecified lifetime. We say that &'a File is annotated with 'a. The
>> borrow-checker rejects any code that tries to use a reference after the
>> end of the lifetime annotated on it.
>
> Thanks for the explanation.
>
>>
>> So as long as you annotate the reference with a sufficiently short
>> lifetime, the borrow checker will prevent UAF. And outside of cases like
>
> Sorry, but can you explain "sufficiently short lifetime"?

By "sufficiently short lifetime" I mean "lifetime that ends before the
object is destroyed".

Idea being that if the lifetime ends before the object is freed, and the
borrow-checker rejects attempts to use it after the lifetime ends, then
it follows that the borrow-checker prevents use-after-frees.

>> from_ptr, the borrow-checker also takes care of ensuring that the
>> lifetimes are sufficiently short.
>>
>> (Technically &'a File and &'b File are two completely different types,
>> so &File is technically a class of types and not a single type. Rust
>> will automatically convert &'long File to &'short File.)
>>
>>>> +///
>>>> +/// * Using the unsafe [`File::from_ptr`] means that it is up to the caller to ensure that the
>>>> +/// `&File` only exists while the reference count is positive.
>>>
>>> What is this used for in binder? If it's not used don't add it.
>>
>> This is used on the boundary between the Rust part of Binder and the
>> binderfs component that is implemented in C. For example:
>
> I see, I'm being foiled by my own code...
>
>>
>> unsafe extern "C" fn rust_binder_open(
>> inode: *mut bindings::inode,
>> file_ptr: *mut bindings::file,
>> ) -> core::ffi::c_int {
>> // SAFETY: The `rust_binderfs.c` file ensures that `i_private` is set to the return value of a
>> // successful call to `rust_binder_new_device`.
>> let ctx = unsafe { Arc::<Context>::borrow((*inode).i_private) };
>>
>> // SAFETY: The caller provides a valid file pointer to a new `struct file`.
>> let file = unsafe { File::from_ptr(file_ptr) };
>
> We need a better name for this helper than from_ptr() imho. I think
> from_ptr() and as_ptr() is odd for C. How weird would it be to call
> that from_raw_file() and as_raw_file()?

That's a reasonable name. I would be happy to rename to that, and I
don't think it is unidiomatic.

> But bigger picture I somewhat struggle with the semantics of this
> because this is not an interface that we have in C and this is really
> about a low-level contract between C and Rust. Specifically this states
> that this pointer is _somehow_ guaranteed valid. And really, this seems
> a bit of a hack.

Indeed ... but I think it's a quite common hack. After all, any time you
dereference a raw pointer in Rust, you are making the same assumption.

> Naively, I think this should probably not be necessary if
> file_operations are properly wrapped. Or it should at least be demotable
> to a purely internal method that can't be called directly or something.

Yes, the usage here of File::from_ptr could probably be hidden inside a
suitably designed file_operations wrapper. The thing is, Rust Binder
doesn't currently use such a wrapper at all. It just exports a global of
type file_operations and the C code in binderfs then references that
global.

Rust Binder used to use such an abstraction, but I ripped it out before
sending the Rust Binder RFC because it didn't actually help. It was
designed for cases where the file system is also implemented in Rust, so
to get it to expose a file_operations global to the C code in binderfs,
I had to reach inside its internal implementation. It did not save me
from doing stuff such as using File::from_ptr from Binder.

Now, you could most likely modify those file_operations abstractions to
support my use-case better. But calling into C is already unsafe, so
unless we get multiple drivers that have a similar C/Rust split, it's
not clear that it's useful to extract the logic from Binder. I would
prefer to wait for the file_operations abstraction to get upstreamed by
the people working on VFS bindings, and then we can decide whether we
should rewrite binderfs into Rust and get rid of the logic, or whether
it's worth to expand the file_operations abstraction to support split
C/Rust drivers like the current binderfs.

> So what I mean is. fdget() may or may not take a reference. The C
> interface internally knows whether a reference is owned or not by
> abusing the lower two bits in a pointer to keep track of that. Naively,
> I would expect the same information to be available to rust so that it's
> clear to Rust wheter it's dealing with an explicitly referenced file or
> an elided-reference file. Maybe that's not possible and I'm not
> well-versed enough to see that yet.

I'm sure Rust can access the same information, but I don't think I'm
currently doing anything that cares about the distinction?

>> let process = match Process::open(ctx, file) {
>> Ok(process) => process,
>> Err(err) => return err.to_errno(),
>> };
>> // SAFETY: This file is associated with Rust binder, so we own the `private_data` field.
>> unsafe {
>> (*file_ptr).private_data = process.into_foreign().cast_mut();
>> }
>> 0
>> }
>>
>> Here, rust_binder_open is the open function in a struct file_operations
>> vtable. In this case, file_ptr is guaranteed by the caller to be valid
>
> Where's the code that wraps struct file_operations?

Please see drivers/android/rust_binder.rs in the binderfs patch [2] in
the Rust Binder RFC.

>> for the duration of the call to rust_binder_open. Binder uses from_ptr
>> to get a &File from the raw pointer.
>>
>> As far as I understand, the caller of rust_binder_open uses fdget to
>> ensure that file_ptr is valid for the duration of the call, but the Rust
>> code doesn't assume that it does this with fdget. As long as file_ptr is
>> valid for the duration of the rust_binder_open call, this use of
>> from_ptr is okay. It will continue to work even if the caller is changed
>> to use fget.
>
> Ok.
>

[1]: https://rust-for-linux.com/klint
[2]: https://lore.kernel.org/rust-for-linux/[email protected]/

Alice

2024-04-02 09:40:19

by Alice Ryhl

[permalink] [raw]
Subject: Re: [PATCH v5 3/9] rust: file: add Rust abstraction for `struct file`

Christian Brauner <[email protected]> wrote:
> On Mon, Apr 01, 2024 at 12:09:08PM +0000, Alice Ryhl wrote:
>> Christian Brauner <[email protected]> wrote:
>>> On Wed, Mar 20, 2024 at 06:09:05PM +0000, Alice Ryhl wrote:
>>>> Christian Brauner <[email protected]> wrote:
>>>>> On Fri, Feb 09, 2024 at 11:18:16AM +0000, Alice Ryhl wrote:
>>>>>> +/// Wraps the kernel's `struct file`.
>>>>>> +///
>>>>>> +/// This represents an open file rather than a file on a filesystem. Processes generally reference
>>>>>> +/// open files using file descriptors. However, file descriptors are not the same as files. A file
>>>>>> +/// descriptor is just an integer that corresponds to a file, and a single file may be referenced
>>>>>> +/// by multiple file descriptors.
>>>>>> +///
>>>>>> +/// # Refcounting
>>>>>> +///
>>>>>> +/// Instances of this type are reference-counted. The reference count is incremented by the
>>>>>> +/// `fget`/`get_file` functions and decremented by `fput`. The Rust type `ARef<File>` represents a
>>>>>> +/// pointer that owns a reference count on the file.
>>>>>> +///
>>>>>> +/// Whenever a process opens a file descriptor (fd), it stores a pointer to the file in its `struct
>>>>>> +/// files_struct`. This pointer owns a reference count to the file, ensuring the file isn't
>>>>>> +/// prematurely deleted while the file descriptor is open. In Rust terminology, the pointers in
>>>>>> +/// `struct files_struct` are `ARef<File>` pointers.
>>>>>> +///
>>>>>> +/// ## Light refcounts
>>>>>> +///
>>>>>> +/// Whenever a process has an fd to a file, it may use something called a "light refcount" as a
>>>>>> +/// performance optimization. Light refcounts are acquired by calling `fdget` and released with
>>>>>> +/// `fdput`. The idea behind light refcounts is that if the fd is not closed between the calls to
>>>>>> +/// `fdget` and `fdput`, then the refcount cannot hit zero during that time, as the `struct
>>>>>> +/// files_struct` holds a reference until the fd is closed. This means that it's safe to access the
>>>>>> +/// file even if `fdget` does not increment the refcount.
>>>>>> +///
>>>>>> +/// The requirement that the fd is not closed during a light refcount applies globally across all
>>>>>> +/// threads - not just on the thread using the light refcount. For this reason, light refcounts are
>>>>>> +/// only used when the `struct files_struct` is not shared with other threads, since this ensures
>>>>>> +/// that other unrelated threads cannot suddenly start using the fd and close it. Therefore,
>>>>>> +/// calling `fdget` on a shared `struct files_struct` creates a normal refcount instead of a light
>>>>>> +/// refcount.
>>>>>
>>>>> When the fdget() calling task doesn't have a shared file descriptor
>>>>> table fdget() will not increment the reference count, yes. This
>>>>> also implies that you cannot have task A use fdget() and then pass
>>>>> f.file to task B that holds on to it while A returns to userspace. It's
>>>>> irrelevant that task A won't drop the reference count or that task B
>>>>> won't drop the reference count. Because task A could return back to
>>>>> userspace and immediately close the fd via a regular close() system call
>>>>> at which point task B has a UAF. In other words a file that has been
>>>>> gotten via fdget() can't be Send to another task without the Send
>>>>> implying taking a reference to it.
>>>>
>>>> That matches my understanding.
>>>>
>>>> I suppose that technically you can still send it to another thread *if* you
>>>> ensure that the current thread waits until that other thread stops using the
>>>> file before returning to userspace.
>>>
>>> _Technically_ yes, but it would be brittle as hell. The problem is that
>>> fdget() _relies_ on being single-threaded for the time that fd is used
>>> until fdput(). There's locking assumptions that build on that e.g., for
>>> concurrent read/write. So no, that shouldn't be allowed.
>>>
>>> Look at how this broke our back when we introduced pidfd_getfd() where
>>> we steal an fd from another task. I have a lengthy explanation how that
>>> can be used to violate our elided-locking which is based on assuming
>>> that we're always single-threaded and the file can't be suddenly shared
>>> with another task. So maybe doable but it would make the semantics even
>>> more intricate.
>>
>> Hmm, the part about elided locking is surprising to me, and may be an
>> issue. Can you give more details on that? Because the current
>
> So what I referred to was that we do have fdget_pos(). Roughly, if
> there's more than one reference on the file then we need to acquire a
> mutex but if it's only a single reference then we can avoid taking the
> mutex because we know that we're the only one that has a reference to
> that file and no one else can acquire one. Whether or not that mutex was
> taken is taken track of in struct fd.
>
> So you can't share a file after fdget_pos() has been called on it and
> you haven't taken the position mutex. So let's say you had:
>
> * Tread A that calls fdget_pos() on file1 and the reference count is
> one. So Thread A doesn't acquire the file position mutex for file1.
> * Now somehow that file1 becomes shared, e.g., Thread B calls fget() on
> it and now Thread B does some operation that requires the file
> position mutex.
> => Thread A and Thread B race on the file position.
>
> So just because you have a reference to a file from somewhere it doesn't
> mean you can just share it with another thread.
>
> So if yo have an arbitrary reference to a file in Rust and that somehow
> can be shared with another thread you risk races here.
>
>> abstractions here *do* actually allow what I described, since we
>> implement Sync for File.
>>
>> I'm not familiar with the pidfd_getfd discussion you are referring to.
>> Do you have a link?
>
> https://lore.kernel.org/linux-fsdevel/[email protected]
>
> pidfd_getfd() can be used to steal a file descriptor from another task.
> It's like a non-cooperative SCM_RIGHTS. That means you can have exactly
> the scenario described above where a file assumed to be non-shared is
> suddenly shared and you have racing reads/writes.
>
> For readdir we nowadays always take the file position mutex because of
> the pidfd_getfd() business because that might corrupt internal state.
>
>>
>> I'm thinking that we may have to provide two different `struct file`
>> wrappers to accurately model this API in Rust. Perhaps they could be
>> called File and LocalFile, where one is marked as thread safe and the
>> other isn't. I can make all LocalFile methods available on File to avoid
>> having to duplicate methods that are available on both.
>
> But isn't that just struct file and struct fd? Ideally we'd stay close
> to something like this.

Right, that kind of naming seems sensible. But I still need to
understand the details a bit better. See below on fdget_pos.

>> But it's not clear to me that this is even enough. Even if we give you a
>> &LocalFile to prevent you from moving it across threads, you can just
>> call File::fget to get an ARef<File> to the same file and then move
>> *that* across threads.
>
> Yes, absolutely.

One of my challenges is that Binder wants to call File::fget,
immediately move it to another thread, and then call fd_install. And
it would be pretty unfortunate if that requires unsafe. But like I argue
below, it seems hard to design a safe API for this in the face of
fdget_pos.

>> This kind of global requirement is not so easy to model. Maybe klint [1]
>> could do it ... atomic context violations are a similar kind of global
>> check. But having klint do it would be far out.
>>
>> Or maybe File::fget should also return a LocalFile?
>>
>> But this raises a different question to me. Let's say process A uses
>> Binder to send its own fd to process B, and the following things happen:
>>
>> 1. Process A enters the ioctl and takes fdget on the fd.
>> 2. Process A calls fget on the same fd to send it to another process.
>> 3. Process A goes to sleep, waiting for process B to respond.
>> 4. Process B receives the message, installs the fd, and returns to userspace.
>> 5. Process B responds to the transaction, but does not close the fd.
>
> The fd just installed in 4. and the fd you're referring to in 5. are
> identical, right? IOW, we're not talking about two different fd (dup)
> for the same file, right?

I'm referring to whatever fd_install does given the `struct file` I got
from fget in step 2.

>> 6a. Process A finishes sleeping, and returns to userspace from the ioctl.
>> 6b. Process B tries to do an operation (e.g. read) on the fd.
>>
>> Let's say that 6a and 6b run in parallel.
>>
>> Could this potentially result in a data race between step 6a and 6b? I'm
>> guessing that step 6a probably doesn't touch any of the code that has
>> elided locking assumptions, so in practice I guess there's not a problem
>> ... but if you make any sort of elided locking assumption as you exit
>> from the ioctl (before reaching the fdput), then it seems to me that you
>> have a problem.
>
> Yes, 6a doesn't touch any code that has elided locking assumptions.
>
> 1'. Process A enters the ioctl and takes fdget() on the fd. Process A
> holds the only reference to that file and the file descriptor table
> isn't shared. Therefore, f_count is left untouched and remains at 1.
> 2'. Process A calls fget() which unconditionally bumps f_count bringing
> it to 2 and sending it another process (Presumably you intend to
> imply that this reference is now owned by the second process.).
> 3'. [as 3.]
> 4'. Process B installs the file into it's file descriptor table
> consuming that reference from 2'. The f_count remains at 2 with the
> reference from 2' now being owned by Process B.
> 5'. Since Process B isn't closing the fd and has just called
> fd_install() it returns to userspace with f_count untouched and
> still at 2.
> 6'a. Process A finishes sleeping and returns to userspace calling
> fdput(). Since the original fdget() was done without bumping the
> reference count the fdput() of Process A will not decrement the
> reference count. So f_count remains at 2.
> 6'b. Process B performs a read/write syscall and calls fdget_pos().
> fdget_pos() sees that this file has f_count > 1 and takes the
> file position mutex.
>
> So this isn't a problem. The problem is when a file becomes shared
> implicitly without the original owner of the file knowing.

Hmm. Yes, but the ioctl code that called fdget doesn't really know that
the ioctl shared the file? So why is it okay?

It really seems like there are two different things going on here. When
it comes to fdget, we only really care about operations that could
remove it from the local file descriptor table, since fdget relies on
the refcount in that table remaining valid until fdput.

On the other hand, for fdget_pos it also matters whether it gets
installed in other file descriptor tables. Threads that reference it
through a different fd table will still access the same position.

And so this means that between fdget/fdput, there's never any problem
with installing the `struct file` into another file descriptor table.
Nothing you can do from that other fd table could cause the local fd
table to drop its refcount on the file. Whereas such an install can be
a problem between fdget_pos/fdput_pos, since that could introduce a race
on the position.

Is this correct?


I was thinking that if we have some sort of File/LocalFile distinction
(or File/Fd), then we may be able to get it to work by limiting what a
File can do. For example, let's say that the only thing you can do with
a File is install it into fd tables, then by the previous logic, there's
no problem with it being safe to move across threads even if there's an
active fdget.

But the fdget_pos kind of throws a wrench into that, because now I can
no longer say "it's always safe to do File::fget, move it to another
thread, and install it into the remote fd table", since that could cause
races on the position if there's an active fdget_pos when we call
File::fget.

>>>>>> +///
>>>>>> +/// Light reference counts must be released with `fdput` before the system call returns to
>>>>>> +/// userspace. This means that if you wait until the current system call returns to userspace, then
>>>>>> +/// all light refcounts that existed at the time have gone away.
>>>>>> +///
>>>>>> +/// ## Rust references
>>>>>> +///
>>>>>> +/// The reference type `&File` is similar to light refcounts:
>>>>>> +///
>>>>>> +/// * `&File` references don't own a reference count. They can only exist as long as the reference
>>>>>> +/// count stays positive, and can only be created when there is some mechanism in place to ensure
>>>>>> +/// this.
>>>>>> +///
>>>>>> +/// * The Rust borrow-checker normally ensures this by enforcing that the `ARef<File>` from which
>>>>>> +/// a `&File` is created outlives the `&File`.
>>>>>
>>>>> The section confuses me a little: Does the borrow-checker always ensure
>>>>> that a &File stays valid or are there circumstances where it doesn't or
>>>>> are you saying it doesn't enforce it?
>>>>
>>>> The borrow-checker always ensures it.
>>>
>>> Ok, thanks.
>>>
>>>>
>>>> A &File is actually short-hand for &'a File, where 'a is some
>>>> unspecified lifetime. We say that &'a File is annotated with 'a. The
>>>> borrow-checker rejects any code that tries to use a reference after the
>>>> end of the lifetime annotated on it.
>>>
>>> Thanks for the explanation.
>>>
>>>>
>>>> So as long as you annotate the reference with a sufficiently short
>>>> lifetime, the borrow checker will prevent UAF. And outside of cases like
>>>
>>> Sorry, but can you explain "sufficiently short lifetime"?
>>
>> By "sufficiently short lifetime" I mean "lifetime that ends before the
>> object is destroyed".
>
> Ah, ok. It sounded like it was a specific concept that Rust is
> implementing in contrast to long-term lifetime or sm. Thanks!
>
>>
>> Idea being that if the lifetime ends before the object is freed, and the
>> borrow-checker rejects attempts to use it after the lifetime ends, then
>> it follows that the borrow-checker prevents use-after-frees.
>>
>>>> from_ptr, the borrow-checker also takes care of ensuring that the
>>>> lifetimes are sufficiently short.
>>>>
>>>> (Technically &'a File and &'b File are two completely different types,
>>>> so &File is technically a class of types and not a single type. Rust
>>>> will automatically convert &'long File to &'short File.)
>>>>
>>>>>> +///
>>>>>> +/// * Using the unsafe [`File::from_ptr`] means that it is up to the caller to ensure that the
>>>>>> +/// `&File` only exists while the reference count is positive.
>>>>>
>>>>> What is this used for in binder? If it's not used don't add it.
>>>>
>>>> This is used on the boundary between the Rust part of Binder and the
>>>> binderfs component that is implemented in C. For example:
>>>
>>> I see, I'm being foiled by my own code...
>>>
>>>>
>>>> unsafe extern "C" fn rust_binder_open(
>>>> inode: *mut bindings::inode,
>>>> file_ptr: *mut bindings::file,
>>>> ) -> core::ffi::c_int {
>>>> // SAFETY: The `rust_binderfs.c` file ensures that `i_private` is set to the return value of a
>>>> // successful call to `rust_binder_new_device`.
>>>> let ctx = unsafe { Arc::<Context>::borrow((*inode).i_private) };
>>>>
>>>> // SAFETY: The caller provides a valid file pointer to a new `struct file`.
>>>> let file = unsafe { File::from_ptr(file_ptr) };
>>>
>>> We need a better name for this helper than from_ptr() imho. I think
>>> from_ptr() and as_ptr() is odd for C. How weird would it be to call
>>> that from_raw_file() and as_raw_file()?
>>
>> That's a reasonable name. I would be happy to rename to that, and I
>> don't think it is unidiomatic.
>
> Thanks!
>
>>
>>> But bigger picture I somewhat struggle with the semantics of this
>>> because this is not an interface that we have in C and this is really
>>> about a low-level contract between C and Rust. Specifically this states
>>> that this pointer is _somehow_ guaranteed valid. And really, this seems
>>> a bit of a hack.
>>
>> Indeed ... but I think it's a quite common hack. After all, any time you
>> dereference a raw pointer in Rust, you are making the same assumption.
>>
>>> Naively, I think this should probably not be necessary if
>>> file_operations are properly wrapped. Or it should at least be demotable
>>> to a purely internal method that can't be called directly or something.
>>
>> Yes, the usage here of File::from_ptr could probably be hidden inside a
>> suitably designed file_operations wrapper. The thing is, Rust Binder
>> doesn't currently use such a wrapper at all. It just exports a global of
>> type file_operations and the C code in binderfs then references that
>> global.
>
> Yeah.
>
>>
>> Rust Binder used to use such an abstraction, but I ripped it out before
>> sending the Rust Binder RFC because it didn't actually help. It was
>> designed for cases where the file system is also implemented in Rust, so
>> to get it to expose a file_operations global to the C code in binderfs,
>> I had to reach inside its internal implementation. It did not save me
>> from doing stuff such as using File::from_ptr from Binder.
>>
>> Now, you could most likely modify those file_operations abstractions to
>> support my use-case better. But calling into C is already unsafe, so
>> unless we get multiple drivers that have a similar C/Rust split, it's
>> not clear that it's useful to extract the logic from Binder. I would
>> prefer to wait for the file_operations abstraction to get upstreamed by
>> the people working on VFS bindings, and then we can decide whether we
>> should rewrite binderfs into Rust and get rid of the logic, or whether
>> it's worth to expand the file_operations abstraction to support split
>> C/Rust drivers like the current binderfs.
>>
>>> So what I mean is. fdget() may or may not take a reference. The C
>>> interface internally knows whether a reference is owned or not by
>>> abusing the lower two bits in a pointer to keep track of that. Naively,
>>> I would expect the same information to be available to rust so that it's
>>> clear to Rust wheter it's dealing with an explicitly referenced file or
>>> an elided-reference file. Maybe that's not possible and I'm not
>>> well-versed enough to see that yet.
>>
>> I'm sure Rust can access the same information, but I don't think I'm
>> currently doing anything that cares about the distinction?
>
> Ok. My main goal is that we end up with an almost 1:1 correspondence
> between the Rust and C interface so it's easy for current maintainers
> and developers that don't want to care about Rust to continue to do so
> and also just somewhat verify that changes they do are sane.

Sure, that goal makes total sense to me.

Alice

2024-04-03 06:02:50

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v5 3/9] rust: file: add Rust abstraction for `struct file`

On Tue, Apr 02, 2024 at 09:39:57AM +0000, Alice Ryhl wrote:
> Christian Brauner <[email protected]> wrote:
> > On Mon, Apr 01, 2024 at 12:09:08PM +0000, Alice Ryhl wrote:
> >> Christian Brauner <[email protected]> wrote:
> >>> On Wed, Mar 20, 2024 at 06:09:05PM +0000, Alice Ryhl wrote:
> >>>> Christian Brauner <[email protected]> wrote:
> >>>>> On Fri, Feb 09, 2024 at 11:18:16AM +0000, Alice Ryhl wrote:
> >>>>>> +/// Wraps the kernel's `struct file`.
> >>>>>> +///
> >>>>>> +/// This represents an open file rather than a file on a filesystem. Processes generally reference
> >>>>>> +/// open files using file descriptors. However, file descriptors are not the same as files. A file
> >>>>>> +/// descriptor is just an integer that corresponds to a file, and a single file may be referenced
> >>>>>> +/// by multiple file descriptors.
> >>>>>> +///
> >>>>>> +/// # Refcounting
> >>>>>> +///
> >>>>>> +/// Instances of this type are reference-counted. The reference count is incremented by the
> >>>>>> +/// `fget`/`get_file` functions and decremented by `fput`. The Rust type `ARef<File>` represents a
> >>>>>> +/// pointer that owns a reference count on the file.
> >>>>>> +///
> >>>>>> +/// Whenever a process opens a file descriptor (fd), it stores a pointer to the file in its `struct
> >>>>>> +/// files_struct`. This pointer owns a reference count to the file, ensuring the file isn't
> >>>>>> +/// prematurely deleted while the file descriptor is open. In Rust terminology, the pointers in
> >>>>>> +/// `struct files_struct` are `ARef<File>` pointers.
> >>>>>> +///
> >>>>>> +/// ## Light refcounts
> >>>>>> +///
> >>>>>> +/// Whenever a process has an fd to a file, it may use something called a "light refcount" as a
> >>>>>> +/// performance optimization. Light refcounts are acquired by calling `fdget` and released with
> >>>>>> +/// `fdput`. The idea behind light refcounts is that if the fd is not closed between the calls to
> >>>>>> +/// `fdget` and `fdput`, then the refcount cannot hit zero during that time, as the `struct
> >>>>>> +/// files_struct` holds a reference until the fd is closed. This means that it's safe to access the
> >>>>>> +/// file even if `fdget` does not increment the refcount.
> >>>>>> +///
> >>>>>> +/// The requirement that the fd is not closed during a light refcount applies globally across all
> >>>>>> +/// threads - not just on the thread using the light refcount. For this reason, light refcounts are
> >>>>>> +/// only used when the `struct files_struct` is not shared with other threads, since this ensures
> >>>>>> +/// that other unrelated threads cannot suddenly start using the fd and close it. Therefore,
> >>>>>> +/// calling `fdget` on a shared `struct files_struct` creates a normal refcount instead of a light
> >>>>>> +/// refcount.
> >>>>>
> >>>>> When the fdget() calling task doesn't have a shared file descriptor
> >>>>> table fdget() will not increment the reference count, yes. This
> >>>>> also implies that you cannot have task A use fdget() and then pass
> >>>>> f.file to task B that holds on to it while A returns to userspace. It's
> >>>>> irrelevant that task A won't drop the reference count or that task B
> >>>>> won't drop the reference count. Because task A could return back to
> >>>>> userspace and immediately close the fd via a regular close() system call
> >>>>> at which point task B has a UAF. In other words a file that has been
> >>>>> gotten via fdget() can't be Send to another task without the Send
> >>>>> implying taking a reference to it.
> >>>>
> >>>> That matches my understanding.
> >>>>
> >>>> I suppose that technically you can still send it to another thread *if* you
> >>>> ensure that the current thread waits until that other thread stops using the
> >>>> file before returning to userspace.
> >>>
> >>> _Technically_ yes, but it would be brittle as hell. The problem is that
> >>> fdget() _relies_ on being single-threaded for the time that fd is used
> >>> until fdput(). There's locking assumptions that build on that e.g., for
> >>> concurrent read/write. So no, that shouldn't be allowed.
> >>>
> >>> Look at how this broke our back when we introduced pidfd_getfd() where
> >>> we steal an fd from another task. I have a lengthy explanation how that
> >>> can be used to violate our elided-locking which is based on assuming
> >>> that we're always single-threaded and the file can't be suddenly shared
> >>> with another task. So maybe doable but it would make the semantics even
> >>> more intricate.
> >>
> >> Hmm, the part about elided locking is surprising to me, and may be an
> >> issue. Can you give more details on that? Because the current
> >
> > So what I referred to was that we do have fdget_pos(). Roughly, if
> > there's more than one reference on the file then we need to acquire a
> > mutex but if it's only a single reference then we can avoid taking the
> > mutex because we know that we're the only one that has a reference to
> > that file and no one else can acquire one. Whether or not that mutex was
> > taken is taken track of in struct fd.
> >
> > So you can't share a file after fdget_pos() has been called on it and
> > you haven't taken the position mutex. So let's say you had:
> >
> > * Tread A that calls fdget_pos() on file1 and the reference count is
> > one. So Thread A doesn't acquire the file position mutex for file1.
> > * Now somehow that file1 becomes shared, e.g., Thread B calls fget() on
> > it and now Thread B does some operation that requires the file
> > position mutex.
> > => Thread A and Thread B race on the file position.
> >
> > So just because you have a reference to a file from somewhere it doesn't
> > mean you can just share it with another thread.
> >
> > So if yo have an arbitrary reference to a file in Rust and that somehow
> > can be shared with another thread you risk races here.
> >
> >> abstractions here *do* actually allow what I described, since we
> >> implement Sync for File.
> >>
> >> I'm not familiar with the pidfd_getfd discussion you are referring to.
> >> Do you have a link?
> >
> > https://lore.kernel.org/linux-fsdevel/[email protected]
> >
> > pidfd_getfd() can be used to steal a file descriptor from another task.
> > It's like a non-cooperative SCM_RIGHTS. That means you can have exactly
> > the scenario described above where a file assumed to be non-shared is
> > suddenly shared and you have racing reads/writes.
> >
> > For readdir we nowadays always take the file position mutex because of
> > the pidfd_getfd() business because that might corrupt internal state.
> >
> >>
> >> I'm thinking that we may have to provide two different `struct file`
> >> wrappers to accurately model this API in Rust. Perhaps they could be
> >> called File and LocalFile, where one is marked as thread safe and the
> >> other isn't. I can make all LocalFile methods available on File to avoid
> >> having to duplicate methods that are available on both.
> >
> > But isn't that just struct file and struct fd? Ideally we'd stay close
> > to something like this.
>
> Right, that kind of naming seems sensible. But I still need to
> understand the details a bit better. See below on fdget_pos.
>
> >> But it's not clear to me that this is even enough. Even if we give you a
> >> &LocalFile to prevent you from moving it across threads, you can just
> >> call File::fget to get an ARef<File> to the same file and then move
> >> *that* across threads.
> >
> > Yes, absolutely.
>
> One of my challenges is that Binder wants to call File::fget,
> immediately move it to another thread, and then call fd_install. And
> it would be pretty unfortunate if that requires unsafe. But like I argue
> below, it seems hard to design a safe API for this in the face of
> fdget_pos.
>
> >> This kind of global requirement is not so easy to model. Maybe klint [1]
> >> could do it ... atomic context violations are a similar kind of global
> >> check. But having klint do it would be far out.
> >>
> >> Or maybe File::fget should also return a LocalFile?
> >>
> >> But this raises a different question to me. Let's say process A uses
> >> Binder to send its own fd to process B, and the following things happen:
> >>
> >> 1. Process A enters the ioctl and takes fdget on the fd.
> >> 2. Process A calls fget on the same fd to send it to another process.
> >> 3. Process A goes to sleep, waiting for process B to respond.
> >> 4. Process B receives the message, installs the fd, and returns to userspace.
> >> 5. Process B responds to the transaction, but does not close the fd.
> >
> > The fd just installed in 4. and the fd you're referring to in 5. are
> > identical, right? IOW, we're not talking about two different fd (dup)
> > for the same file, right?
>
> I'm referring to whatever fd_install does given the `struct file` I got
> from fget in step 2.
>
> >> 6a. Process A finishes sleeping, and returns to userspace from the ioctl.
> >> 6b. Process B tries to do an operation (e.g. read) on the fd.
> >>
> >> Let's say that 6a and 6b run in parallel.
> >>
> >> Could this potentially result in a data race between step 6a and 6b? I'm
> >> guessing that step 6a probably doesn't touch any of the code that has
> >> elided locking assumptions, so in practice I guess there's not a problem
> >> ... but if you make any sort of elided locking assumption as you exit
> >> from the ioctl (before reaching the fdput), then it seems to me that you
> >> have a problem.
> >
> > Yes, 6a doesn't touch any code that has elided locking assumptions.
> >
> > 1'. Process A enters the ioctl and takes fdget() on the fd. Process A
> > holds the only reference to that file and the file descriptor table
> > isn't shared. Therefore, f_count is left untouched and remains at 1.
> > 2'. Process A calls fget() which unconditionally bumps f_count bringing
> > it to 2 and sending it another process (Presumably you intend to
> > imply that this reference is now owned by the second process.).
> > 3'. [as 3.]
> > 4'. Process B installs the file into it's file descriptor table
> > consuming that reference from 2'. The f_count remains at 2 with the
> > reference from 2' now being owned by Process B.
> > 5'. Since Process B isn't closing the fd and has just called
> > fd_install() it returns to userspace with f_count untouched and
> > still at 2.
> > 6'a. Process A finishes sleeping and returns to userspace calling
> > fdput(). Since the original fdget() was done without bumping the
> > reference count the fdput() of Process A will not decrement the
> > reference count. So f_count remains at 2.
> > 6'b. Process B performs a read/write syscall and calls fdget_pos().
> > fdget_pos() sees that this file has f_count > 1 and takes the
> > file position mutex.
> >
> > So this isn't a problem. The problem is when a file becomes shared
> > implicitly without the original owner of the file knowing.
>
> Hmm. Yes, but the ioctl code that called fdget doesn't really know that
> the ioctl shared the file? So why is it okay?

Why does it matter to the ioctl() code itself? The ioctl() code never
calls fdget_pos().

>
> It really seems like there are two different things going on here. When
> it comes to fdget, we only really care about operations that could
> remove it from the local file descriptor table, since fdget relies on
> the refcount in that table remaining valid until fdput.

Yes.

>
> On the other hand, for fdget_pos it also matters whether it gets
> installed in other file descriptor tables. Threads that reference it
> through a different fd table will still access the same position.

Yes, they operate on the same f_pos.

>
> And so this means that between fdget/fdput, there's never any problem
> with installing the `struct file` into another file descriptor table.
> Nothing you can do from that other fd table could cause the local fd
> table to drop its refcount on the file. Whereas such an install can be
> a problem between fdget_pos/fdput_pos, since that could introduce a race
> on the position.
>
> Is this correct?

Yes, but that would imply you're sharing and installing a file into a
file descriptor table from a read/write/seek codepath. I don't see how
this can happen without something like e.g., pidfd_getfd(). And the
fd_install()ing task would then have to go back to userspace and issue a
concurrent read/write/seek system call while the other thread is still
reading/writing.

Overall, we really only care about f_pos consistency because posix
requires atomicity between reads/writes/seeks. For pidfd_getfd() where
such sharing can happen non-cooperatively we just don't care as we've
just declared this to be an instance where we're outside of posix
guarantees. And for readdir() we unconditionally acquire the mutex.

I think io_uring is racing on f_pos as well under certain circumstances
(REQ_F_CUR_POS?) as they don't use fdget_pos() at all. But iirc Jens
dislikes that they ever allowed that.

>
> I was thinking that if we have some sort of File/LocalFile distinction
> (or File/Fd), then we may be able to get it to work by limiting what a
> File can do. For example, let's say that the only thing you can do with
> a File is install it into fd tables, then by the previous logic, there's
> no problem with it being safe to move across threads even if there's an
> active fdget.
>
> But the fdget_pos kind of throws a wrench into that, because now I can
> no longer say "it's always safe to do File::fget, move it to another
> thread, and install it into the remote fd table", since that could cause
> races on the position if there's an active fdget_pos when we call
> File::fget.

I think I understand why that's a problem for you but let me try and
spell it out so I can understand where you're coming from:

You want the Rust compiler to reject a file becoming shared implicitly
from any codepath that is beneath fdget_pos() even though no such code
yet exists (ignoring pidfd_getfd()). So it's a correctness issue to you.

>
> >>>>>> +///
> >>>>>> +/// Light reference counts must be released with `fdput` before the system call returns to
> >>>>>> +/// userspace. This means that if you wait until the current system call returns to userspace, then
> >>>>>> +/// all light refcounts that existed at the time have gone away.
> >>>>>> +///
> >>>>>> +/// ## Rust references
> >>>>>> +///
> >>>>>> +/// The reference type `&File` is similar to light refcounts:
> >>>>>> +///
> >>>>>> +/// * `&File` references don't own a reference count. They can only exist as long as the reference
> >>>>>> +/// count stays positive, and can only be created when there is some mechanism in place to ensure
> >>>>>> +/// this.
> >>>>>> +///
> >>>>>> +/// * The Rust borrow-checker normally ensures this by enforcing that the `ARef<File>` from which
> >>>>>> +/// a `&File` is created outlives the `&File`.
> >>>>>
> >>>>> The section confuses me a little: Does the borrow-checker always ensure
> >>>>> that a &File stays valid or are there circumstances where it doesn't or
> >>>>> are you saying it doesn't enforce it?
> >>>>
> >>>> The borrow-checker always ensures it.
> >>>
> >>> Ok, thanks.
> >>>
> >>>>
> >>>> A &File is actually short-hand for &'a File, where 'a is some
> >>>> unspecified lifetime. We say that &'a File is annotated with 'a. The
> >>>> borrow-checker rejects any code that tries to use a reference after the
> >>>> end of the lifetime annotated on it.
> >>>
> >>> Thanks for the explanation.
> >>>
> >>>>
> >>>> So as long as you annotate the reference with a sufficiently short
> >>>> lifetime, the borrow checker will prevent UAF. And outside of cases like
> >>>
> >>> Sorry, but can you explain "sufficiently short lifetime"?
> >>
> >> By "sufficiently short lifetime" I mean "lifetime that ends before the
> >> object is destroyed".
> >
> > Ah, ok. It sounded like it was a specific concept that Rust is
> > implementing in contrast to long-term lifetime or sm. Thanks!
> >
> >>
> >> Idea being that if the lifetime ends before the object is freed, and the
> >> borrow-checker rejects attempts to use it after the lifetime ends, then
> >> it follows that the borrow-checker prevents use-after-frees.
> >>
> >>>> from_ptr, the borrow-checker also takes care of ensuring that the
> >>>> lifetimes are sufficiently short.
> >>>>
> >>>> (Technically &'a File and &'b File are two completely different types,
> >>>> so &File is technically a class of types and not a single type. Rust
> >>>> will automatically convert &'long File to &'short File.)
> >>>>
> >>>>>> +///
> >>>>>> +/// * Using the unsafe [`File::from_ptr`] means that it is up to the caller to ensure that the
> >>>>>> +/// `&File` only exists while the reference count is positive.
> >>>>>
> >>>>> What is this used for in binder? If it's not used don't add it.
> >>>>
> >>>> This is used on the boundary between the Rust part of Binder and the
> >>>> binderfs component that is implemented in C. For example:
> >>>
> >>> I see, I'm being foiled by my own code...
> >>>
> >>>>
> >>>> unsafe extern "C" fn rust_binder_open(
> >>>> inode: *mut bindings::inode,
> >>>> file_ptr: *mut bindings::file,
> >>>> ) -> core::ffi::c_int {
> >>>> // SAFETY: The `rust_binderfs.c` file ensures that `i_private` is set to the return value of a
> >>>> // successful call to `rust_binder_new_device`.
> >>>> let ctx = unsafe { Arc::<Context>::borrow((*inode).i_private) };
> >>>>
> >>>> // SAFETY: The caller provides a valid file pointer to a new `struct file`.
> >>>> let file = unsafe { File::from_ptr(file_ptr) };
> >>>
> >>> We need a better name for this helper than from_ptr() imho. I think
> >>> from_ptr() and as_ptr() is odd for C. How weird would it be to call
> >>> that from_raw_file() and as_raw_file()?
> >>
> >> That's a reasonable name. I would be happy to rename to that, and I
> >> don't think it is unidiomatic.
> >
> > Thanks!
> >
> >>
> >>> But bigger picture I somewhat struggle with the semantics of this
> >>> because this is not an interface that we have in C and this is really
> >>> about a low-level contract between C and Rust. Specifically this states
> >>> that this pointer is _somehow_ guaranteed valid. And really, this seems
> >>> a bit of a hack.
> >>
> >> Indeed ... but I think it's a quite common hack. After all, any time you
> >> dereference a raw pointer in Rust, you are making the same assumption.
> >>
> >>> Naively, I think this should probably not be necessary if
> >>> file_operations are properly wrapped. Or it should at least be demotable
> >>> to a purely internal method that can't be called directly or something.
> >>
> >> Yes, the usage here of File::from_ptr could probably be hidden inside a
> >> suitably designed file_operations wrapper. The thing is, Rust Binder
> >> doesn't currently use such a wrapper at all. It just exports a global of
> >> type file_operations and the C code in binderfs then references that
> >> global.
> >
> > Yeah.
> >
> >>
> >> Rust Binder used to use such an abstraction, but I ripped it out before
> >> sending the Rust Binder RFC because it didn't actually help. It was
> >> designed for cases where the file system is also implemented in Rust, so
> >> to get it to expose a file_operations global to the C code in binderfs,
> >> I had to reach inside its internal implementation. It did not save me
> >> from doing stuff such as using File::from_ptr from Binder.
> >>
> >> Now, you could most likely modify those file_operations abstractions to
> >> support my use-case better. But calling into C is already unsafe, so
> >> unless we get multiple drivers that have a similar C/Rust split, it's
> >> not clear that it's useful to extract the logic from Binder. I would
> >> prefer to wait for the file_operations abstraction to get upstreamed by
> >> the people working on VFS bindings, and then we can decide whether we
> >> should rewrite binderfs into Rust and get rid of the logic, or whether
> >> it's worth to expand the file_operations abstraction to support split
> >> C/Rust drivers like the current binderfs.
> >>
> >>> So what I mean is. fdget() may or may not take a reference. The C
> >>> interface internally knows whether a reference is owned or not by
> >>> abusing the lower two bits in a pointer to keep track of that. Naively,
> >>> I would expect the same information to be available to rust so that it's
> >>> clear to Rust wheter it's dealing with an explicitly referenced file or
> >>> an elided-reference file. Maybe that's not possible and I'm not
> >>> well-versed enough to see that yet.
> >>
> >> I'm sure Rust can access the same information, but I don't think I'm
> >> currently doing anything that cares about the distinction?
> >
> > Ok. My main goal is that we end up with an almost 1:1 correspondence
> > between the Rust and C interface so it's easy for current maintainers
> > and developers that don't want to care about Rust to continue to do so
> > and also just somewhat verify that changes they do are sane.
>
> Sure, that goal makes total sense to me.
>
> Alice

2024-04-03 11:34:03

by Alice Ryhl

[permalink] [raw]
Subject: Re: [PATCH v5 3/9] rust: file: add Rust abstraction for `struct file`

On Wed, Apr 3, 2024 at 8:02 AM Christian Brauner <[email protected]> wrote:
>
> On Tue, Apr 02, 2024 at 09:39:57AM +0000, Alice Ryhl wrote:
> > Christian Brauner <[email protected]> wrote:
> > > On Mon, Apr 01, 2024 at 12:09:08PM +0000, Alice Ryhl wrote:
> > >> Christian Brauner <[email protected]> wrote:
> > >>> On Wed, Mar 20, 2024 at 06:09:05PM +0000, Alice Ryhl wrote:
> > >>>> Christian Brauner <[email protected]> wrote:
> > >>>>> On Fri, Feb 09, 2024 at 11:18:16AM +0000, Alice Ryhl wrote:
> > >>>>>> +/// Wraps the kernel's `struct file`.
> > >>>>>> +///
> > >>>>>> +/// This represents an open file rather than a file on a filesystem. Processes generally reference
> > >>>>>> +/// open files using file descriptors. However, file descriptors are not the same as files. A file
> > >>>>>> +/// descriptor is just an integer that corresponds to a file, and a single file may be referenced
> > >>>>>> +/// by multiple file descriptors.
> > >>>>>> +///
> > >>>>>> +/// # Refcounting
> > >>>>>> +///
> > >>>>>> +/// Instances of this type are reference-counted. The reference count is incremented by the
> > >>>>>> +/// `fget`/`get_file` functions and decremented by `fput`. The Rust type `ARef<File>` represents a
> > >>>>>> +/// pointer that owns a reference count on the file.
> > >>>>>> +///
> > >>>>>> +/// Whenever a process opens a file descriptor (fd), it stores a pointer to the file in its `struct
> > >>>>>> +/// files_struct`. This pointer owns a reference count to the file, ensuring the file isn't
> > >>>>>> +/// prematurely deleted while the file descriptor is open. In Rust terminology, the pointers in
> > >>>>>> +/// `struct files_struct` are `ARef<File>` pointers.
> > >>>>>> +///
> > >>>>>> +/// ## Light refcounts
> > >>>>>> +///
> > >>>>>> +/// Whenever a process has an fd to a file, it may use something called a "light refcount" as a
> > >>>>>> +/// performance optimization. Light refcounts are acquired by calling `fdget` and released with
> > >>>>>> +/// `fdput`. The idea behind light refcounts is that if the fd is not closed between the calls to
> > >>>>>> +/// `fdget` and `fdput`, then the refcount cannot hit zero during that time, as the `struct
> > >>>>>> +/// files_struct` holds a reference until the fd is closed. This means that it's safe to access the
> > >>>>>> +/// file even if `fdget` does not increment the refcount.
> > >>>>>> +///
> > >>>>>> +/// The requirement that the fd is not closed during a light refcount applies globally across all
> > >>>>>> +/// threads - not just on the thread using the light refcount. For this reason, light refcounts are
> > >>>>>> +/// only used when the `struct files_struct` is not shared with other threads, since this ensures
> > >>>>>> +/// that other unrelated threads cannot suddenly start using the fd and close it. Therefore,
> > >>>>>> +/// calling `fdget` on a shared `struct files_struct` creates a normal refcount instead of a light
> > >>>>>> +/// refcount.
> > >>>>>
> > >>>>> When the fdget() calling task doesn't have a shared file descriptor
> > >>>>> table fdget() will not increment the reference count, yes. This
> > >>>>> also implies that you cannot have task A use fdget() and then pass
> > >>>>> f.file to task B that holds on to it while A returns to userspace It's
> > >>>>> irrelevant that task A won't drop the reference count or that task B
> > >>>>> won't drop the reference count. Because task A could return back to
> > >>>>> userspace and immediately close the fd via a regular close() system call
> > >>>>> at which point task B has a UAF. In other words a file that has been
> > >>>>> gotten via fdget() can't be Send to another task without the Send
> > >>>>> implying taking a reference to it.
> > >>>>
> > >>>> That matches my understanding.
> > >>>>
> > >>>> I suppose that technically you can still send it to another thread *if* you
> > >>>> ensure that the current thread waits until that other thread stops using the
> > >>>> file before returning to userspace.
> > >>>
> > >>> _Technically_ yes, but it would be brittle as hell. The problem is that
> > >>> fdget() _relies_ on being single-threaded for the time that fd is used
> > >>> until fdput(). There's locking assumptions that build on that e.g., for
> > >>> concurrent read/write. So no, that shouldn't be allowed.
> > >>>
> > >>> Look at how this broke our back when we introduced pidfd_getfd() where
> > >>> we steal an fd from another task. I have a lengthy explanation how that
> > >>> can be used to violate our elided-locking which is based on assuming
> > >>> that we're always single-threaded and the file can't be suddenly shared
> > >>> with another task. So maybe doable but it would make the semantics even
> > >>> more intricate.
> > >>
> > >> Hmm, the part about elided locking is surprising to me, and may be an
> > >> issue. Can you give more details on that? Because the current
> > >
> > > So what I referred to was that we do have fdget_pos(). Roughly, if
> > > there's more than one reference on the file then we need to acquire a
> > > mutex but if it's only a single reference then we can avoid taking the
> > > mutex because we know that we're the only one that has a reference to
> > > that file and no one else can acquire one. Whether or not that mutex was
> > > taken is taken track of in struct fd.
> > >
> > > So you can't share a file after fdget_pos() has been called on it and
> > > you haven't taken the position mutex. So let's say you had:
> > >
> > > * Tread A that calls fdget_pos() on file1 and the reference count is
> > > one. So Thread A doesn't acquire the file position mutex for file1.
> > > * Now somehow that file1 becomes shared, e.g., Thread B calls fget() on
> > > it and now Thread B does some operation that requires the file
> > > position mutex.
> > > => Thread A and Thread B race on the file position.
> > >
> > > So just because you have a reference to a file from somewhere it doesn't
> > > mean you can just share it with another thread.
> > >
> > > So if yo have an arbitrary reference to a file in Rust and that somehow
> > > can be shared with another thread you risk races here.
> > >
> > >> abstractions here *do* actually allow what I described, since we
> > >> implement Sync for File.
> > >>
> > >> I'm not familiar with the pidfd_getfd discussion you are referring to.
> > >> Do you have a link?
> > >
> > > https://lore.kernel.org/linux-fsdevel/[email protected]
> > >
> > > pidfd_getfd() can be used to steal a file descriptor from another task.
> > > It's like a non-cooperative SCM_RIGHTS. That means you can have exactly
> > > the scenario described above where a file assumed to be non-shared is
> > > suddenly shared and you have racing reads/writes.
> > >
> > > For readdir we nowadays always take the file position mutex because of
> > > the pidfd_getfd() business because that might corrupt internal state.
> > >
> > >>
> > >> I'm thinking that we may have to provide two different `struct file`
> > >> wrappers to accurately model this API in Rust. Perhaps they could be
> > >> called File and LocalFile, where one is marked as thread safe and the
> > >> other isn't. I can make all LocalFile methods available on File to avoid
> > >> having to duplicate methods that are available on both.
> > >
> > > But isn't that just struct file and struct fd? Ideally we'd stay close
> > > to something like this.
> >
> > Right, that kind of naming seems sensible. But I still need to
> > understand the details a bit better. See below on fdget_pos.
> >
> > >> But it's not clear to me that this is even enough. Even if we give you a
> > >> &LocalFile to prevent you from moving it across threads, you can just
> > >> call File::fget to get an ARef<File> to the same file and then move
> > >> *that* across threads.
> > >
> > > Yes, absolutely.
> >
> > One of my challenges is that Binder wants to call File::fget,
> > immediately move it to another thread, and then call fd_install. And
> > it would be pretty unfortunate if that requires unsafe. But like I argue
> > below, it seems hard to design a safe API for this in the face of
> > fdget_pos.
> >
> > >> This kind of global requirement is not so easy to model. Maybe klint [1]
> > >> could do it ... atomic context violations are a similar kind of global
> > >> check. But having klint do it would be far out.
> > >>
> > >> Or maybe File::fget should also return a LocalFile?
> > >>
> > >> But this raises a different question to me. Let's say process A uses
> > >> Binder to send its own fd to process B, and the following things happen:
> > >>
> > >> 1. Process A enters the ioctl and takes fdget on the fd.
> > >> 2. Process A calls fget on the same fd to send it to another process.
> > >> 3. Process A goes to sleep, waiting for process B to respond.
> > >> 4. Process B receives the message, installs the fd, and returns to userspace.
> > >> 5. Process B responds to the transaction, but does not close the fd.
> > >
> > > The fd just installed in 4. and the fd you're referring to in 5. are
> > > identical, right? IOW, we're not talking about two different fd (dup)
> > > for the same file, right?
> >
> > I'm referring to whatever fd_install does given the `struct file` I got
> > from fget in step 2.
> >
> > >> 6a. Process A finishes sleeping, and returns to userspace from the ioctl.
> > >> 6b. Process B tries to do an operation (e.g. read) on the fd.
> > >>
> > >> Let's say that 6a and 6b run in parallel.
> > >>
> > >> Could this potentially result in a data race between step 6a and 6b? I'm
> > >> guessing that step 6a probably doesn't touch any of the code that has
> > >> elided locking assumptions, so in practice I guess there's not a problem
> > >> ... but if you make any sort of elided locking assumption as you exit
> > >> from the ioctl (before reaching the fdput), then it seems to me that you
> > >> have a problem.
> > >
> > > Yes, 6a doesn't touch any code that has elided locking assumptions.
> > >
> > > 1'. Process A enters the ioctl and takes fdget() on the fd. Process A
> > > holds the only reference to that file and the file descriptor table
> > > isn't shared. Therefore, f_count is left untouched and remains at 1.
> > > 2'. Process A calls fget() which unconditionally bumps f_count bringing
> > > it to 2 and sending it another process (Presumably you intend to
> > > imply that this reference is now owned by the second process.).
> > > 3'. [as 3.]
> > > 4'. Process B installs the file into it's file descriptor table
> > > consuming that reference from 2'. The f_count remains at 2 with the
> > > reference from 2' now being owned by Process B.
> > > 5'. Since Process B isn't closing the fd and has just called
> > > fd_install() it returns to userspace with f_count untouched and
> > > still at 2.
> > > 6'a. Process A finishes sleeping and returns to userspace calling
> > > fdput(). Since the original fdget() was done without bumping the
> > > reference count the fdput() of Process A will not decrement the
> > > reference count. So f_count remains at 2.
> > > 6'b. Process B performs a read/write syscall and calls fdget_pos().
> > > fdget_pos() sees that this file has f_count > 1 and takes the
> > > file position mutex.
> > >
> > > So this isn't a problem. The problem is when a file becomes shared
> > > implicitly without the original owner of the file knowing.
> >
> > Hmm. Yes, but the ioctl code that called fdget doesn't really know that
> > the ioctl shared the file? So why is it okay?
>
> Why does it matter to the ioctl() code itself? The ioctl() code never
> calls fdget_pos().
>
> >
> > It really seems like there are two different things going on here. When
> > it comes to fdget, we only really care about operations that could
> > remove it from the local file descriptor table, since fdget relies on
> > the refcount in that table remaining valid until fdput.
>
> Yes.
>
> >
> > On the other hand, for fdget_pos it also matters whether it gets
> > installed in other file descriptor tables. Threads that reference it
> > through a different fd table will still access the same position.
>
> Yes, they operate on the same f_pos.
>
> >
> > And so this means that between fdget/fdput, there's never any problem
> > with installing the `struct file` into another file descriptor table.
> > Nothing you can do from that other fd table could cause the local fd
> > table to drop its refcount on the file. Whereas such an install can be
> > a problem between fdget_pos/fdput_pos, since that could introduce a race
> > on the position.
> >
> > Is this correct?
>
> Yes, but that would imply you're sharing and installing a file into a
> file descriptor table from a read/write/seek codepath. I don't see how
> this can happen without something like e.g., pidfd_getfd(). And the
> fd_install()ing task would then have to go back to userspace and issue a
> concurrent read/write/seek system call while the other thread is still
> reading/writing.
>
> Overall, we really only care about f_pos consistency because posix
> requires atomicity between reads/writes/seeks. For pidfd_getfd() where
> such sharing can happen non-cooperatively we just don't care as we've
> just declared this to be an instance where we're outside of posix
> guarantees. And for readdir() we unconditionally acquire the mutex.
>
> I think io_uring is racing on f_pos as well under certain circumstances
> (REQ_F_CUR_POS?) as they don't use fdget_pos() at all. But iirc Jens
> dislikes that they ever allowed that.
>
> >
> > I was thinking that if we have some sort of File/LocalFile distinction
> > (or File/Fd), then we may be able to get it to work by limiting what a
> > File can do. For example, let's say that the only thing you can do with
> > a File is install it into fd tables, then by the previous logic, there's
> > no problem with it being safe to move across threads even if there's an
> > active fdget.
> >
> > But the fdget_pos kind of throws a wrench into that, because now I can
> > no longer say "it's always safe to do File::fget, move it to another
> > thread, and install it into the remote fd table", since that could cause
> > races on the position if there's an active fdget_pos when we call
> > File::fget.
>
> I think I understand why that's a problem for you but let me try and
> spell it out so I can understand where you're coming from:
>
> You want the Rust compiler to reject a file becoming shared implicitly
> from any codepath that is beneath fdget_pos() even though no such code
> yet exists (ignoring pidfd_getfd()). So it's a correctness issue to you.

Yes, exactly. One of the design principles behind Rust is that if an
API allows you to trigger memory unsafety, then it *must* be the case
that you called an unsafe function somewhere, and that this call
violated the documented safety requirements for that unsafe function.
So if the File API provides a safe interface that you can use to
trigger memory unsafety, then that's a correctness issue for me.

Thanks for asking this. I should have clarified this previously.

> > >>>>>> +///
> > >>>>>> +/// Light reference counts must be released with `fdput` before the system call returns to
> > >>>>>> +/// userspace. This means that if you wait until the current system call returns to userspace, then
> > >>>>>> +/// all light refcounts that existed at the time have gone away.
> > >>>>>> +///
> > >>>>>> +/// ## Rust references
> > >>>>>> +///
> > >>>>>> +/// The reference type `&File` is similar to light refcounts:
> > >>>>>> +///
> > >>>>>> +/// * `&File` references don't own a reference count. They can only exist as long as the reference
> > >>>>>> +/// count stays positive, and can only be created when there is some mechanism in place to ensure
> > >>>>>> +/// this.
> > >>>>>> +///
> > >>>>>> +/// * The Rust borrow-checker normally ensures this by enforcing that the `ARef<File>` from which
> > >>>>>> +/// a `&File` is created outlives the `&File`.
> > >>>>>
> > >>>>> The section confuses me a little: Does the borrow-checker always ensure
> > >>>>> that a &File stays valid or are there circumstances where it doesn't or
> > >>>>> are you saying it doesn't enforce it?
> > >>>>
> > >>>> The borrow-checker always ensures it.
> > >>>
> > >>> Ok, thanks.
> > >>>
> > >>>>
> > >>>> A &File is actually short-hand for &'a File, where 'a is some
> > >>>> unspecified lifetime. We say that &'a File is annotated with 'a. The
> > >>>> borrow-checker rejects any code that tries to use a reference after the
> > >>>> end of the lifetime annotated on it.
> > >>>
> > >>> Thanks for the explanation.
> > >>>
> > >>>>
> > >>>> So as long as you annotate the reference with a sufficiently short
> > >>>> lifetime, the borrow checker will prevent UAF. And outside of cases like
> > >>>
> > >>> Sorry, but can you explain "sufficiently short lifetime"?
> > >>
> > >> By "sufficiently short lifetime" I mean "lifetime that ends before the
> > >> object is destroyed".
> > >
> > > Ah, ok. It sounded like it was a specific concept that Rust is
> > > implementing in contrast to long-term lifetime or sm. Thanks!
> > >
> > >>
> > >> Idea being that if the lifetime ends before the object is freed, and the
> > >> borrow-checker rejects attempts to use it after the lifetime ends, then
> > >> it follows that the borrow-checker prevents use-after-frees.
> > >>
> > >>>> from_ptr, the borrow-checker also takes care of ensuring that the
> > >>>> lifetimes are sufficiently short.
> > >>>>
> > >>>> (Technically &'a File and &'b File are two completely different types,
> > >>>> so &File is technically a class of types and not a single type. Rust
> > >>>> will automatically convert &'long File to &'short File.)
> > >>>>
> > >>>>>> +///
> > >>>>>> +/// * Using the unsafe [`File::from_ptr`] means that it is up to the caller to ensure that the
> > >>>>>> +/// `&File` only exists while the reference count is positive.
> > >>>>>
> > >>>>> What is this used for in binder? If it's not used don't add it.
> > >>>>
> > >>>> This is used on the boundary between the Rust part of Binder and the
> > >>>> binderfs component that is implemented in C. For example:
> > >>>
> > >>> I see, I'm being foiled by my own code...
> > >>>
> > >>>>
> > >>>> unsafe extern "C" fn rust_binder_open(
> > >>>> inode: *mut bindings::inode,
> > >>>> file_ptr: *mut bindings::file,
> > >>>> ) -> core::ffi::c_int {
> > >>>> // SAFETY: The `rust_binderfs.c` file ensures that `i_private` is set to the return value of a
> > >>>> // successful call to `rust_binder_new_device`.
> > >>>> let ctx = unsafe { Arc::<Context>::borrow((*inode).i_private) };
> > >>>>
> > >>>> // SAFETY: The caller provides a valid file pointer to a new `struct file`.
> > >>>> let file = unsafe { File::from_ptr(file_ptr) };
> > >>>
> > >>> We need a better name for this helper than from_ptr() imho. I think
> > >>> from_ptr() and as_ptr() is odd for C. How weird would it be to call
> > >>> that from_raw_file() and as_raw_file()?
> > >>
> > >> That's a reasonable name. I would be happy to rename to that, and I
> > >> don't think it is unidiomatic.
> > >
> > > Thanks!
> > >
> > >>
> > >>> But bigger picture I somewhat struggle with the semantics of this
> > >>> because this is not an interface that we have in C and this is really
> > >>> about a low-level contract between C and Rust. Specifically this states
> > >>> that this pointer is _somehow_ guaranteed valid. And really, this seems
> > >>> a bit of a hack.
> > >>
> > >> Indeed ... but I think it's a quite common hack. After all, any time you
> > >> dereference a raw pointer in Rust, you are making the same assumption.
> > >>
> > >>> Naively, I think this should probably not be necessary if
> > >>> file_operations are properly wrapped. Or it should at least be demotable
> > >>> to a purely internal method that can't be called directly or something.
> > >>
> > >> Yes, the usage here of File::from_ptr could probably be hidden inside a
> > >> suitably designed file_operations wrapper. The thing is, Rust Binder
> > >> doesn't currently use such a wrapper at all. It just exports a global of
> > >> type file_operations and the C code in binderfs then references that
> > >> global.
> > >
> > > Yeah.
> > >
> > >>
> > >> Rust Binder used to use such an abstraction, but I ripped it out before
> > >> sending the Rust Binder RFC because it didn't actually help. It was
> > >> designed for cases where the file system is also implemented in Rust, so
> > >> to get it to expose a file_operations global to the C code in binderfs,
> > >> I had to reach inside its internal implementation. It did not save me
> > >> from doing stuff such as using File::from_ptr from Binder.
> > >>
> > >> Now, you could most likely modify those file_operations abstractions to
> > >> support my use-case better. But calling into C is already unsafe, so
> > >> unless we get multiple drivers that have a similar C/Rust split, it's
> > >> not clear that it's useful to extract the logic from Binder. I would
> > >> prefer to wait for the file_operations abstraction to get upstreamed by
> > >> the people working on VFS bindings, and then we can decide whether we
> > >> should rewrite binderfs into Rust and get rid of the logic, or whether
> > >> it's worth to expand the file_operations abstraction to support split
> > >> C/Rust drivers like the current binderfs.
> > >>
> > >>> So what I mean is. fdget() may or may not take a reference. The C
> > >>> interface internally knows whether a reference is owned or not by
> > >>> abusing the lower two bits in a pointer to keep track of that. Naively,
> > >>> I would expect the same information to be available to rust so that it's
> > >>> clear to Rust wheter it's dealing with an explicitly referenced file or
> > >>> an elided-reference file. Maybe that's not possible and I'm not
> > >>> well-versed enough to see that yet.
> > >>
> > >> I'm sure Rust can access the same information, but I don't think I'm
> > >> currently doing anything that cares about the distinction?
> > >
> > > Ok. My main goal is that we end up with an almost 1:1 correspondence
> > > between the Rust and C interface so it's easy for current maintainers
> > > and developers that don't want to care about Rust to continue to do so
> > > and also just somewhat verify that changes they do are sane.
> >
> > Sure, that goal makes total sense to me.
> >
> > Alice

2024-04-08 07:48:24

by Alice Ryhl

[permalink] [raw]
Subject: Re: [PATCH v5 3/9] rust: file: add Rust abstraction for `struct file`

Christian Brauner <[email protected]> wrote:
> On Tue, Apr 02, 2024 at 09:39:57AM +0000, Alice Ryhl wrote:
>> Christian Brauner <[email protected]> wrote:
>>> On Mon, Apr 01, 2024 at 12:09:08PM +0000, Alice Ryhl wrote:
>>>> Christian Brauner <[email protected]> wrote:
>>>>> On Wed, Mar 20, 2024 at 06:09:05PM +0000, Alice Ryhl wrote:
>>>>>> Christian Brauner <[email protected]> wrote:
>>>>>>> On Fri, Feb 09, 2024 at 11:18:16AM +0000, Alice Ryhl wrote:
>>>>>>>> +/// Wraps the kernel's `struct file`.
>>>>>>>> +///
>>>>>>>> +/// This represents an open file rather than a file on a filesystem. Processes generally reference
>>>>>>>> +/// open files using file descriptors. However, file descriptors are not the same as files. A file
>>>>>>>> +/// descriptor is just an integer that corresponds to a file, and a single file may be referenced
>>>>>>>> +/// by multiple file descriptors.
>>>>>>>> +///
>>>>>>>> +/// # Refcounting
>>>>>>>> +///
>>>>>>>> +/// Instances of this type are reference-counted. The reference count is incremented by the
>>>>>>>> +/// `fget`/`get_file` functions and decremented by `fput`. The Rust type `ARef<File>` represents a
>>>>>>>> +/// pointer that owns a reference count on the file.
>>>>>>>> +///
>>>>>>>> +/// Whenever a process opens a file descriptor (fd), it stores a pointer to the file in its `struct
>>>>>>>> +/// files_struct`. This pointer owns a reference count to the file, ensuring the file isn't
>>>>>>>> +/// prematurely deleted while the file descriptor is open. In Rust terminology, the pointers in
>>>>>>>> +/// `struct files_struct` are `ARef<File>` pointers.
>>>>>>>> +///
>>>>>>>> +/// ## Light refcounts
>>>>>>>> +///
>>>>>>>> +/// Whenever a process has an fd to a file, it may use something called a "light refcount" as a
>>>>>>>> +/// performance optimization. Light refcounts are acquired by calling `fdget` and released with
>>>>>>>> +/// `fdput`. The idea behind light refcounts is that if the fd is not closed between the calls to
>>>>>>>> +/// `fdget` and `fdput`, then the refcount cannot hit zero during that time, as the `struct
>>>>>>>> +/// files_struct` holds a reference until the fd is closed. This means that it's safe to access the
>>>>>>>> +/// file even if `fdget` does not increment the refcount.
>>>>>>>> +///
>>>>>>>> +/// The requirement that the fd is not closed during a light refcount applies globally across all
>>>>>>>> +/// threads - not just on the thread using the light refcount. For this reason, light refcounts are
>>>>>>>> +/// only used when the `struct files_struct` is not shared with other threads, since this ensures
>>>>>>>> +/// that other unrelated threads cannot suddenly start using the fd and close it. Therefore,
>>>>>>>> +/// calling `fdget` on a shared `struct files_struct` creates a normal refcount instead of a light
>>>>>>>> +/// refcount.
>>>>>>>
>>>>>>> When the fdget() calling task doesn't have a shared file descriptor
>>>>>>> table fdget() will not increment the reference count, yes. This
>>>>>>> also implies that you cannot have task A use fdget() and then pass
>>>>>>> f.file to task B that holds on to it while A returns to userspace. It's
>>>>>>> irrelevant that task A won't drop the reference count or that task B
>>>>>>> won't drop the reference count. Because task A could return back to
>>>>>>> userspace and immediately close the fd via a regular close() system call
>>>>>>> at which point task B has a UAF. In other words a file that has been
>>>>>>> gotten via fdget() can't be Send to another task without the Send
>>>>>>> implying taking a reference to it.
>>>>>>
>>>>>> That matches my understanding.
>>>>>>
>>>>>> I suppose that technically you can still send it to another thread *if* you
>>>>>> ensure that the current thread waits until that other thread stops using the
>>>>>> file before returning to userspace.
>>>>>
>>>>> _Technically_ yes, but it would be brittle as hell. The problem is that
>>>>> fdget() _relies_ on being single-threaded for the time that fd is used
>>>>> until fdput(). There's locking assumptions that build on that e.g., for
>>>>> concurrent read/write. So no, that shouldn't be allowed.
>>>>>
>>>>> Look at how this broke our back when we introduced pidfd_getfd() where
>>>>> we steal an fd from another task. I have a lengthy explanation how that
>>>>> can be used to violate our elided-locking which is based on assuming
>>>>> that we're always single-threaded and the file can't be suddenly shared
>>>>> with another task. So maybe doable but it would make the semantics even
>>>>> more intricate.
>>>>
>>>> Hmm, the part about elided locking is surprising to me, and may be an
>>>> issue. Can you give more details on that? Because the current
>>>
>>> So what I referred to was that we do have fdget_pos(). Roughly, if
>>> there's more than one reference on the file then we need to acquire a
>>> mutex but if it's only a single reference then we can avoid taking the
>>> mutex because we know that we're the only one that has a reference to
>>> that file and no one else can acquire one. Whether or not that mutex was
>>> taken is taken track of in struct fd.
>>>
>>> So you can't share a file after fdget_pos() has been called on it and
>>> you haven't taken the position mutex. So let's say you had:
>>>
>>> * Tread A that calls fdget_pos() on file1 and the reference count is
>>> one. So Thread A doesn't acquire the file position mutex for file1.
>>> * Now somehow that file1 becomes shared, e.g., Thread B calls fget() on
>>> it and now Thread B does some operation that requires the file
>>> position mutex.
>>> => Thread A and Thread B race on the file position.
>>>
>>> So just because you have a reference to a file from somewhere it doesn't
>>> mean you can just share it with another thread.
>>>
>>> So if yo have an arbitrary reference to a file in Rust and that somehow
>>> can be shared with another thread you risk races here.
>>>
>>>> abstractions here *do* actually allow what I described, since we
>>>> implement Sync for File.
>>>>
>>>> I'm not familiar with the pidfd_getfd discussion you are referring to.
>>>> Do you have a link?
>>>
>>> https://lore.kernel.org/linux-fsdevel/[email protected]
>>>
>>> pidfd_getfd() can be used to steal a file descriptor from another task.
>>> It's like a non-cooperative SCM_RIGHTS. That means you can have exactly
>>> the scenario described above where a file assumed to be non-shared is
>>> suddenly shared and you have racing reads/writes.
>>>
>>> For readdir we nowadays always take the file position mutex because of
>>> the pidfd_getfd() business because that might corrupt internal state.
>>>
>>>>
>>>> I'm thinking that we may have to provide two different `struct file`
>>>> wrappers to accurately model this API in Rust. Perhaps they could be
>>>> called File and LocalFile, where one is marked as thread safe and the
>>>> other isn't. I can make all LocalFile methods available on File to avoid
>>>> having to duplicate methods that are available on both.
>>>
>>> But isn't that just struct file and struct fd? Ideally we'd stay close
>>> to something like this.
>>
>> Right, that kind of naming seems sensible. But I still need to
>> understand the details a bit better. See below on fdget_pos.
>>
>>>> But it's not clear to me that this is even enough. Even if we give you a
>>>> &LocalFile to prevent you from moving it across threads, you can just
>>>> call File::fget to get an ARef<File> to the same file and then move
>>>> *that* across threads.
>>>
>>> Yes, absolutely.
>>
>> One of my challenges is that Binder wants to call File::fget,
>> immediately move it to another thread, and then call fd_install. And
>> it would be pretty unfortunate if that requires unsafe. But like I argue
>> below, it seems hard to design a safe API for this in the face of
>> fdget_pos.
>>
>>>> This kind of global requirement is not so easy to model. Maybe klint [1]
>>>> could do it ... atomic context violations are a similar kind of global
>>>> check. But having klint do it would be far out.
>>>>
>>>> Or maybe File::fget should also return a LocalFile?
>>>>
>>>> But this raises a different question to me. Let's say process A uses
>>>> Binder to send its own fd to process B, and the following things happen:
>>>>
>>>> 1. Process A enters the ioctl and takes fdget on the fd.
>>>> 2. Process A calls fget on the same fd to send it to another process.
>>>> 3. Process A goes to sleep, waiting for process B to respond.
>>>> 4. Process B receives the message, installs the fd, and returns to userspace.
>>>> 5. Process B responds to the transaction, but does not close the fd.
>>>
>>> The fd just installed in 4. and the fd you're referring to in 5. are
>>> identical, right? IOW, we're not talking about two different fd (dup)
>>> for the same file, right?
>>
>> I'm referring to whatever fd_install does given the `struct file` I got
>> from fget in step 2.
>>
>>>> 6a. Process A finishes sleeping, and returns to userspace from the ioctl.
>>>> 6b. Process B tries to do an operation (e.g. read) on the fd.
>>>>
>>>> Let's say that 6a and 6b run in parallel.
>>>>
>>>> Could this potentially result in a data race between step 6a and 6b? I'm
>>>> guessing that step 6a probably doesn't touch any of the code that has
>>>> elided locking assumptions, so in practice I guess there's not a problem
>>>> ... but if you make any sort of elided locking assumption as you exit
>>>> from the ioctl (before reaching the fdput), then it seems to me that you
>>>> have a problem.
>>>
>>> Yes, 6a doesn't touch any code that has elided locking assumptions.
>>>
>>> 1'. Process A enters the ioctl and takes fdget() on the fd. Process A
>>> holds the only reference to that file and the file descriptor table
>>> isn't shared. Therefore, f_count is left untouched and remains at 1.
>>> 2'. Process A calls fget() which unconditionally bumps f_count bringing
>>> it to 2 and sending it another process (Presumably you intend to
>>> imply that this reference is now owned by the second process.).
>>> 3'. [as 3.]
>>> 4'. Process B installs the file into it's file descriptor table
>>> consuming that reference from 2'. The f_count remains at 2 with the
>>> reference from 2' now being owned by Process B.
>>> 5'. Since Process B isn't closing the fd and has just called
>>> fd_install() it returns to userspace with f_count untouched and
>>> still at 2.
>>> 6'a. Process A finishes sleeping and returns to userspace calling
>>> fdput(). Since the original fdget() was done without bumping the
>>> reference count the fdput() of Process A will not decrement the
>>> reference count. So f_count remains at 2.
>>> 6'b. Process B performs a read/write syscall and calls fdget_pos().
>>> fdget_pos() sees that this file has f_count > 1 and takes the
>>> file position mutex.
>>>
>>> So this isn't a problem. The problem is when a file becomes shared
>>> implicitly without the original owner of the file knowing.
>>
>> Hmm. Yes, but the ioctl code that called fdget doesn't really know that
>> the ioctl shared the file? So why is it okay?
>
> Why does it matter to the ioctl() code itself? The ioctl() code never
> calls fdget_pos().
>
>>
>> It really seems like there are two different things going on here. When
>> it comes to fdget, we only really care about operations that could
>> remove it from the local file descriptor table, since fdget relies on
>> the refcount in that table remaining valid until fdput.
>
> Yes.
>
>>
>> On the other hand, for fdget_pos it also matters whether it gets
>> installed in other file descriptor tables. Threads that reference it
>> through a different fd table will still access the same position.
>
> Yes, they operate on the same f_pos.
>
>>
>> And so this means that between fdget/fdput, there's never any problem
>> with installing the `struct file` into another file descriptor table.
>> Nothing you can do from that other fd table could cause the local fd
>> table to drop its refcount on the file. Whereas such an install can be
>> a problem between fdget_pos/fdput_pos, since that could introduce a race
>> on the position.
>>
>> Is this correct?
>
> Yes, but that would imply you're sharing and installing a file into a
> file descriptor table from a read/write/seek codepath. I don't see how
> this can happen without something like e.g., pidfd_getfd(). And the
> fd_install()ing task would then have to go back to userspace and issue a
> concurrent read/write/seek system call while the other thread is still
> reading/writing.
>
> Overall, we really only care about f_pos consistency because posix
> requires atomicity between reads/writes/seeks. For pidfd_getfd() where
> such sharing can happen non-cooperatively we just don't care as we've
> just declared this to be an instance where we're outside of posix
> guarantees. And for readdir() we unconditionally acquire the mutex.
>
> I think io_uring is racing on f_pos as well under certain circumstances
> (REQ_F_CUR_POS?) as they don't use fdget_pos() at all. But iirc Jens
> dislikes that they ever allowed that.
>
>>
>> I was thinking that if we have some sort of File/LocalFile distinction
>> (or File/Fd), then we may be able to get it to work by limiting what a
>> File can do. For example, let's say that the only thing you can do with
>> a File is install it into fd tables, then by the previous logic, there's
>> no problem with it being safe to move across threads even if there's an
>> active fdget.
>>
>> But the fdget_pos kind of throws a wrench into that, because now I can
>> no longer say "it's always safe to do File::fget, move it to another
>> thread, and install it into the remote fd table", since that could cause
>> races on the position if there's an active fdget_pos when we call
>> File::fget.
>
> I think I understand why that's a problem for you but let me try and
> spell it out so I can understand where you're coming from:
>
> You want the Rust compiler to reject a file becoming shared implicitly
> from any codepath that is beneath fdget_pos() even though no such code
> yet exists (ignoring pidfd_getfd()). So it's a correctness issue to you.

Hi Christian,

I thought some more about this. Here's an idea of how we can encode this
knowledge in the API. It seems like there are a few different things we
might know about a pointer:

1. The pointer is fully shared, that is, all active fdget / fdget_pos
references (if any) actually incremented the refcount / took the
lock.

2. The pointer may have active fdget references that didn't increment
the refcount, but all fdget_pos references (if any) incremented the
refcount and took the lock.

3. The pointer may have active fdget or fdget_pos references that didn't
increment the refcount / didn't take the lock.

And orthogonal from that, all of the fdget / fdget_pos references are on
a single thread, so we either know that we are on the same thread as
them, or we don't know that. (Though we don't care about this
distinction in the fully shared case.)

I might encode this information with five different "states":

* FullyShared - fully shared
* MaybeFdget - might have fdget, not necessarily on this thread
* MaybeLocalFdget - might have fdget, but if so, it's on this thread
* MaybeFdgetPos - might have fdget_pos, not necessarily on this thread
* MaybeLocalFdgetPos - might have fdget_pos, but if so, it's on this thread

And you can make this a parameter on File, so you get pointer types like
these:

* ARef<File<MaybeFdget>> - A pointer to a file that owns a refcount and
might have an active fdget, possibly on a different thread.

* &File<FullyShared> - A reference to a file that doesn't own a
refcount, and which is fully shared.

* ARef<File<MaybeLocalFdgetPos>> - A pointer to a file that owns a
refcount, and may have both active fdget and fdget_pos calls. If it
does, they're on the same thread as this reference. This is the return
type of fget.

And if we introduce an Fdget smart pointer too, then it would probably
be typed Fdget<MaybeLocalFdget> or Fdget<MaybeLocalFdgetPos>.

You could do different things with these. For example, a MaybeFdget
reference can be used with fd_install, but a MaybeFdgetPos reference
can't. Similarly, you can close an fd using a FullyShared reference, but
not using any of the weaker kinds.

As for how you get each kind, well, the fget method would return an
MaybeLocalFdgetPos reference, since we know that we're on the right
thread if any fdget/fdget_pos references exist, but we don't otherwise
know anything.

And you could use something similar to the DeferredFdCloser to take a
MaybeLocalFdgetPos or MaybeLocalFdget file reference any upgrade it to a
FullyShared one by waiting until we return to userspace. (But you can't
do this with the non-local variants.)

When we write a Rust wrapper around file_operations that internally uses
File::from_raw_file, then we can have it pass a &File<MaybeLocalFdget>
to the ioctl handler, and a &File<MaybeLocalFdgetPos> to the
read/write/seek handlers. That way, the ioctl handler can do things with
the provided file that don't work when there's an fdget_pos, but the
read/write/seek handlers can't.

Or we can have unsafe methods that do upgrades. For example, the
codepath in Binder that sends an fd from one process to another will
most likely just call fget and then use unsafe to say "I know there are
no active fdget_pos references since I'm in an ioctl" to upgrade the
MaybeLocalFdgetPos it got from fget into an MaybeFdget (not FullyShared
since the ioctl uses fdget), that it sends to the other process and then
fd_installs.

Does this sound sane to you?

Alice