by Alice Ryhl

[permalink] [raw]

Subject: Re: [PATCH v5 3/9] rust: file: add Rust abstraction for `struct file`

On Wed, Apr 3, 2024 at 8:02 AM Christian Brauner <[email protected]> wrote:
>
> On Tue, Apr 02, 2024 at 09:39:57AM +0000, Alice Ryhl wrote:
> > Christian Brauner <[email protected]> wrote:
> > > On Mon, Apr 01, 2024 at 12:09:08PM +0000, Alice Ryhl wrote:
> > >> Christian Brauner <[email protected]> wrote:
> > >>> On Wed, Mar 20, 2024 at 06:09:05PM +0000, Alice Ryhl wrote:
> > >>>> Christian Brauner <[email protected]> wrote:
> > >>>>> On Fri, Feb 09, 2024 at 11:18:16AM +0000, Alice Ryhl wrote:
> > >>>>>> +/// Wraps the kernel's `struct file`.
> > >>>>>> +///
> > >>>>>> +/// This represents an open file rather than a file on a filesystem. Processes generally reference
> > >>>>>> +/// open files using file descriptors. However, file descriptors are not the same as files. A file
> > >>>>>> +/// descriptor is just an integer that corresponds to a file, and a single file may be referenced
> > >>>>>> +/// by multiple file descriptors.
> > >>>>>> +///
> > >>>>>> +/// # Refcounting
> > >>>>>> +///
> > >>>>>> +/// Instances of this type are reference-counted. The reference count is incremented by the
> > >>>>>> +/// `fget`/`get_file` functions and decremented by `fput`. The Rust type `ARef<File>` represents a
> > >>>>>> +/// pointer that owns a reference count on the file.
> > >>>>>> +///
> > >>>>>> +/// Whenever a process opens a file descriptor (fd), it stores a pointer to the file in its `struct
> > >>>>>> +/// files_struct`. This pointer owns a reference count to the file, ensuring the file isn't
> > >>>>>> +/// prematurely deleted while the file descriptor is open. In Rust terminology, the pointers in
> > >>>>>> +/// `struct files_struct` are `ARef<File>` pointers.
> > >>>>>> +///
> > >>>>>> +/// ## Light refcounts
> > >>>>>> +///
> > >>>>>> +/// Whenever a process has an fd to a file, it may use something called a "light refcount" as a
> > >>>>>> +/// performance optimization. Light refcounts are acquired by calling `fdget` and released with
> > >>>>>> +/// `fdput`. The idea behind light refcounts is that if the fd is not closed between the calls to
> > >>>>>> +/// `fdget` and `fdput`, then the refcount cannot hit zero during that time, as the `struct
> > >>>>>> +/// files_struct` holds a reference until the fd is closed. This means that it's safe to access the
> > >>>>>> +/// file even if `fdget` does not increment the refcount.
> > >>>>>> +///
> > >>>>>> +/// The requirement that the fd is not closed during a light refcount applies globally across all
> > >>>>>> +/// threads - not just on the thread using the light refcount. For this reason, light refcounts are
> > >>>>>> +/// only used when the `struct files_struct` is not shared with other threads, since this ensures
> > >>>>>> +/// that other unrelated threads cannot suddenly start using the fd and close it. Therefore,
> > >>>>>> +/// calling `fdget` on a shared `struct files_struct` creates a normal refcount instead of a light
> > >>>>>> +/// refcount.
> > >>>>>
> > >>>>> When the fdget() calling task doesn't have a shared file descriptor
> > >>>>> table fdget() will not increment the reference count, yes. This
> > >>>>> also implies that you cannot have task A use fdget() and then pass
> > >>>>> f.file to task B that holds on to it while A returns to userspace It's
> > >>>>> irrelevant that task A won't drop the reference count or that task B
> > >>>>> won't drop the reference count. Because task A could return back to
> > >>>>> userspace and immediately close the fd via a regular close() system call
> > >>>>> at which point task B has a UAF. In other words a file that has been
> > >>>>> gotten via fdget() can't be Send to another task without the Send
> > >>>>> implying taking a reference to it.
> > >>>>
> > >>>> That matches my understanding.
> > >>>>
> > >>>> I suppose that technically you can still send it to another thread *if* you
> > >>>> ensure that the current thread waits until that other thread stops using the
> > >>>> file before returning to userspace.
> > >>>
> > >>> _Technically_ yes, but it would be brittle as hell. The problem is that
> > >>> fdget() _relies_ on being single-threaded for the time that fd is used
> > >>> until fdput(). There's locking assumptions that build on that e.g., for
> > >>> concurrent read/write. So no, that shouldn't be allowed.
> > >>>
> > >>> Look at how this broke our back when we introduced pidfd_getfd() where
> > >>> we steal an fd from another task. I have a lengthy explanation how that
> > >>> can be used to violate our elided-locking which is based on assuming
> > >>> that we're always single-threaded and the file can't be suddenly shared
> > >>> with another task. So maybe doable but it would make the semantics even
> > >>> more intricate.
> > >>
> > >> Hmm, the part about elided locking is surprising to me, and may be an
> > >> issue. Can you give more details on that? Because the current
> > >
> > > So what I referred to was that we do have fdget_pos(). Roughly, if
> > > there's more than one reference on the file then we need to acquire a
> > > mutex but if it's only a single reference then we can avoid taking the
> > > mutex because we know that we're the only one that has a reference to
> > > that file and no one else can acquire one. Whether or not that mutex was
> > > taken is taken track of in struct fd.
> > >
> > > So you can't share a file after fdget_pos() has been called on it and
> > > you haven't taken the position mutex. So let's say you had:
> > >
> > > * Tread A that calls fdget_pos() on file1 and the reference count is
> > > one. So Thread A doesn't acquire the file position mutex for file1.
> > > * Now somehow that file1 becomes shared, e.g., Thread B calls fget() on
> > > it and now Thread B does some operation that requires the file
> > > position mutex.
> > > => Thread A and Thread B race on the file position.
> > >
> > > So just because you have a reference to a file from somewhere it doesn't
> > > mean you can just share it with another thread.
> > >
> > > So if yo have an arbitrary reference to a file in Rust and that somehow
> > > can be shared with another thread you risk races here.
> > >
> > >> abstractions here *do* actually allow what I described, since we
> > >> implement Sync for File.
> > >>
> > >> I'm not familiar with the pidfd_getfd discussion you are referring to.
> > >> Do you have a link?
> > >
> > > https://lore.kernel.org/linux-fsdevel/[email protected]
> > >
> > > pidfd_getfd() can be used to steal a file descriptor from another task.
> > > It's like a non-cooperative SCM_RIGHTS. That means you can have exactly
> > > the scenario described above where a file assumed to be non-shared is
> > > suddenly shared and you have racing reads/writes.
> > >
> > > For readdir we nowadays always take the file position mutex because of
> > > the pidfd_getfd() business because that might corrupt internal state.
> > >
> > >>
> > >> I'm thinking that we may have to provide two different `struct file`
> > >> wrappers to accurately model this API in Rust. Perhaps they could be
> > >> called File and LocalFile, where one is marked as thread safe and the
> > >> other isn't. I can make all LocalFile methods available on File to avoid
> > >> having to duplicate methods that are available on both.
> > >
> > > But isn't that just struct file and struct fd? Ideally we'd stay close
> > > to something like this.
> >
> > Right, that kind of naming seems sensible. But I still need to
> > understand the details a bit better. See below on fdget_pos.
> >
> > >> But it's not clear to me that this is even enough. Even if we give you a
> > >> &LocalFile to prevent you from moving it across threads, you can just
> > >> call File::fget to get an ARef<File> to the same file and then move
> > >> *that* across threads.
> > >
> > > Yes, absolutely.
> >
> > One of my challenges is that Binder wants to call File::fget,
> > immediately move it to another thread, and then call fd_install. And
> > it would be pretty unfortunate if that requires unsafe. But like I argue
> > below, it seems hard to design a safe API for this in the face of
> > fdget_pos.
> >
> > >> This kind of global requirement is not so easy to model. Maybe klint [1]
> > >> could do it ... atomic context violations are a similar kind of global
> > >> check. But having klint do it would be far out.
> > >>
> > >> Or maybe File::fget should also return a LocalFile?
> > >>
> > >> But this raises a different question to me. Let's say process A uses
> > >> Binder to send its own fd to process B, and the following things happen:
> > >>
> > >> 1. Process A enters the ioctl and takes fdget on the fd.
> > >> 2. Process A calls fget on the same fd to send it to another process.
> > >> 3. Process A goes to sleep, waiting for process B to respond.
> > >> 4. Process B receives the message, installs the fd, and returns to userspace.
> > >> 5. Process B responds to the transaction, but does not close the fd.
> > >
> > > The fd just installed in 4. and the fd you're referring to in 5. are
> > > identical, right? IOW, we're not talking about two different fd (dup)
> > > for the same file, right?
> >
> > I'm referring to whatever fd_install does given the `struct file` I got
> > from fget in step 2.
> >
> > >> 6a. Process A finishes sleeping, and returns to userspace from the ioctl.
> > >> 6b. Process B tries to do an operation (e.g. read) on the fd.
> > >>
> > >> Let's say that 6a and 6b run in parallel.
> > >>
> > >> Could this potentially result in a data race between step 6a and 6b? I'm
> > >> guessing that step 6a probably doesn't touch any of the code that has
> > >> elided locking assumptions, so in practice I guess there's not a problem
> > >> ... but if you make any sort of elided locking assumption as you exit
> > >> from the ioctl (before reaching the fdput), then it seems to me that you
> > >> have a problem.
> > >
> > > Yes, 6a doesn't touch any code that has elided locking assumptions.
> > >
> > > 1'. Process A enters the ioctl and takes fdget() on the fd. Process A
> > > holds the only reference to that file and the file descriptor table
> > > isn't shared. Therefore, f_count is left untouched and remains at 1.
> > > 2'. Process A calls fget() which unconditionally bumps f_count bringing
> > > it to 2 and sending it another process (Presumably you intend to
> > > imply that this reference is now owned by the second process.).
> > > 3'. [as 3.]
> > > 4'. Process B installs the file into it's file descriptor table
> > > consuming that reference from 2'. The f_count remains at 2 with the
> > > reference from 2' now being owned by Process B.
> > > 5'. Since Process B isn't closing the fd and has just called
> > > fd_install() it returns to userspace with f_count untouched and
> > > still at 2.
> > > 6'a. Process A finishes sleeping and returns to userspace calling
> > > fdput(). Since the original fdget() was done without bumping the
> > > reference count the fdput() of Process A will not decrement the
> > > reference count. So f_count remains at 2.
> > > 6'b. Process B performs a read/write syscall and calls fdget_pos().
> > > fdget_pos() sees that this file has f_count > 1 and takes the
> > > file position mutex.
> > >
> > > So this isn't a problem. The problem is when a file becomes shared
> > > implicitly without the original owner of the file knowing.
> >
> > Hmm. Yes, but the ioctl code that called fdget doesn't really know that
> > the ioctl shared the file? So why is it okay?
>
> Why does it matter to the ioctl() code itself? The ioctl() code never
> calls fdget_pos().
>
> >
> > It really seems like there are two different things going on here. When
> > it comes to fdget, we only really care about operations that could
> > remove it from the local file descriptor table, since fdget relies on
> > the refcount in that table remaining valid until fdput.
>
> Yes.
>
> >
> > On the other hand, for fdget_pos it also matters whether it gets
> > installed in other file descriptor tables. Threads that reference it
> > through a different fd table will still access the same position.
>
> Yes, they operate on the same f_pos.
>
> >
> > And so this means that between fdget/fdput, there's never any problem
> > with installing the `struct file` into another file descriptor table.
> > Nothing you can do from that other fd table could cause the local fd
> > table to drop its refcount on the file. Whereas such an install can be
> > a problem between fdget_pos/fdput_pos, since that could introduce a race
> > on the position.
> >
> > Is this correct?
>
> Yes, but that would imply you're sharing and installing a file into a
> file descriptor table from a read/write/seek codepath. I don't see how
> this can happen without something like e.g., pidfd_getfd(). And the
> fd_install()ing task would then have to go back to userspace and issue a
> concurrent read/write/seek system call while the other thread is still
> reading/writing.
>
> Overall, we really only care about f_pos consistency because posix
> requires atomicity between reads/writes/seeks. For pidfd_getfd() where
> such sharing can happen non-cooperatively we just don't care as we've
> just declared this to be an instance where we're outside of posix
> guarantees. And for readdir() we unconditionally acquire the mutex.
>
> I think io_uring is racing on f_pos as well under certain circumstances
> (REQ_F_CUR_POS?) as they don't use fdget_pos() at all. But iirc Jens
> dislikes that they ever allowed that.
>
> >
> > I was thinking that if we have some sort of File/LocalFile distinction
> > (or File/Fd), then we may be able to get it to work by limiting what a
> > File can do. For example, let's say that the only thing you can do with
> > a File is install it into fd tables, then by the previous logic, there's
> > no problem with it being safe to move across threads even if there's an
> > active fdget.
> >
> > But the fdget_pos kind of throws a wrench into that, because now I can
> > no longer say "it's always safe to do File::fget, move it to another
> > thread, and install it into the remote fd table", since that could cause
> > races on the position if there's an active fdget_pos when we call
> > File::fget.
>
> I think I understand why that's a problem for you but let me try and
> spell it out so I can understand where you're coming from:
>
> You want the Rust compiler to reject a file becoming shared implicitly
> from any codepath that is beneath fdget_pos() even though no such code
> yet exists (ignoring pidfd_getfd()). So it's a correctness issue to you.

Yes, exactly. One of the design principles behind Rust is that if an
API allows you to trigger memory unsafety, then it *must* be the case
that you called an unsafe function somewhere, and that this call
violated the documented safety requirements for that unsafe function.
So if the File API provides a safe interface that you can use to
trigger memory unsafety, then that's a correctness issue for me.

Thanks for asking this. I should have clarified this previously.

> > >>>>>> +///
> > >>>>>> +/// Light reference counts must be released with `fdput` before the system call returns to
> > >>>>>> +/// userspace. This means that if you wait until the current system call returns to userspace, then
> > >>>>>> +/// all light refcounts that existed at the time have gone away.
> > >>>>>> +///
> > >>>>>> +/// ## Rust references
> > >>>>>> +///
> > >>>>>> +/// The reference type `&File` is similar to light refcounts:
> > >>>>>> +///
> > >>>>>> +/// * `&File` references don't own a reference count. They can only exist as long as the reference
> > >>>>>> +/// count stays positive, and can only be created when there is some mechanism in place to ensure
> > >>>>>> +/// this.
> > >>>>>> +///
> > >>>>>> +/// * The Rust borrow-checker normally ensures this by enforcing that the `ARef<File>` from which
> > >>>>>> +/// a `&File` is created outlives the `&File`.
> > >>>>>
> > >>>>> The section confuses me a little: Does the borrow-checker always ensure
> > >>>>> that a &File stays valid or are there circumstances where it doesn't or
> > >>>>> are you saying it doesn't enforce it?
> > >>>>
> > >>>> The borrow-checker always ensures it.
> > >>>
> > >>> Ok, thanks.
> > >>>
> > >>>>
> > >>>> A &File is actually short-hand for &'a File, where 'a is some
> > >>>> unspecified lifetime. We say that &'a File is annotated with 'a. The
> > >>>> borrow-checker rejects any code that tries to use a reference after the
> > >>>> end of the lifetime annotated on it.
> > >>>
> > >>> Thanks for the explanation.
> > >>>
> > >>>>
> > >>>> So as long as you annotate the reference with a sufficiently short
> > >>>> lifetime, the borrow checker will prevent UAF. And outside of cases like
> > >>>
> > >>> Sorry, but can you explain "sufficiently short lifetime"?
> > >>
> > >> By "sufficiently short lifetime" I mean "lifetime that ends before the
> > >> object is destroyed".
> > >
> > > Ah, ok. It sounded like it was a specific concept that Rust is
> > > implementing in contrast to long-term lifetime or sm. Thanks!
> > >
> > >>
> > >> Idea being that if the lifetime ends before the object is freed, and the
> > >> borrow-checker rejects attempts to use it after the lifetime ends, then
> > >> it follows that the borrow-checker prevents use-after-frees.
> > >>
> > >>>> from_ptr, the borrow-checker also takes care of ensuring that the
> > >>>> lifetimes are sufficiently short.
> > >>>>
> > >>>> (Technically &'a File and &'b File are two completely different types,
> > >>>> so &File is technically a class of types and not a single type. Rust
> > >>>> will automatically convert &'long File to &'short File.)
> > >>>>
> > >>>>>> +///
> > >>>>>> +/// * Using the unsafe [`File::from_ptr`] means that it is up to the caller to ensure that the
> > >>>>>> +/// `&File` only exists while the reference count is positive.
> > >>>>>
> > >>>>> What is this used for in binder? If it's not used don't add it.
> > >>>>
> > >>>> This is used on the boundary between the Rust part of Binder and the
> > >>>> binderfs component that is implemented in C. For example:
> > >>>
> > >>> I see, I'm being foiled by my own code...
> > >>>
> > >>>>
> > >>>> unsafe extern "C" fn rust_binder_open(
> > >>>> inode: *mut bindings::inode,
> > >>>> file_ptr: *mut bindings::file,
> > >>>> ) -> core::ffi::c_int {
> > >>>> // SAFETY: The `rust_binderfs.c` file ensures that `i_private` is set to the return value of a
> > >>>> // successful call to `rust_binder_new_device`.
> > >>>> let ctx = unsafe { Arc::<Context>::borrow((*inode).i_private) };
> > >>>>
> > >>>> // SAFETY: The caller provides a valid file pointer to a new `struct file`.
> > >>>> let file = unsafe { File::from_ptr(file_ptr) };
> > >>>
> > >>> We need a better name for this helper than from_ptr() imho. I think
> > >>> from_ptr() and as_ptr() is odd for C. How weird would it be to call
> > >>> that from_raw_file() and as_raw_file()?
> > >>
> > >> That's a reasonable name. I would be happy to rename to that, and I
> > >> don't think it is unidiomatic.
> > >
> > > Thanks!
> > >
> > >>
> > >>> But bigger picture I somewhat struggle with the semantics of this
> > >>> because this is not an interface that we have in C and this is really
> > >>> about a low-level contract between C and Rust. Specifically this states
> > >>> that this pointer is _somehow_ guaranteed valid. And really, this seems
> > >>> a bit of a hack.
> > >>
> > >> Indeed ... but I think it's a quite common hack. After all, any time you
> > >> dereference a raw pointer in Rust, you are making the same assumption.
> > >>
> > >>> Naively, I think this should probably not be necessary if
> > >>> file_operations are properly wrapped. Or it should at least be demotable
> > >>> to a purely internal method that can't be called directly or something.
> > >>
> > >> Yes, the usage here of File::from_ptr could probably be hidden inside a
> > >> suitably designed file_operations wrapper. The thing is, Rust Binder
> > >> doesn't currently use such a wrapper at all. It just exports a global of
> > >> type file_operations and the C code in binderfs then references that
> > >> global.
> > >
> > > Yeah.
> > >
> > >>
> > >> Rust Binder used to use such an abstraction, but I ripped it out before
> > >> sending the Rust Binder RFC because it didn't actually help. It was
> > >> designed for cases where the file system is also implemented in Rust, so
> > >> to get it to expose a file_operations global to the C code in binderfs,
> > >> I had to reach inside its internal implementation. It did not save me
> > >> from doing stuff such as using File::from_ptr from Binder.
> > >>
> > >> Now, you could most likely modify those file_operations abstractions to
> > >> support my use-case better. But calling into C is already unsafe, so
> > >> unless we get multiple drivers that have a similar C/Rust split, it's
> > >> not clear that it's useful to extract the logic from Binder. I would
> > >> prefer to wait for the file_operations abstraction to get upstreamed by
> > >> the people working on VFS bindings, and then we can decide whether we
> > >> should rewrite binderfs into Rust and get rid of the logic, or whether
> > >> it's worth to expand the file_operations abstraction to support split
> > >> C/Rust drivers like the current binderfs.
> > >>
> > >>> So what I mean is. fdget() may or may not take a reference. The C
> > >>> interface internally knows whether a reference is owned or not by
> > >>> abusing the lower two bits in a pointer to keep track of that. Naively,
> > >>> I would expect the same information to be available to rust so that it's
> > >>> clear to Rust wheter it's dealing with an explicitly referenced file or
> > >>> an elided-reference file. Maybe that's not possible and I'm not
> > >>> well-versed enough to see that yet.
> > >>
> > >> I'm sure Rust can access the same information, but I don't think I'm
> > >> currently doing anything that cares about the distinction?
> > >
> > > Ok. My main goal is that we end up with an almost 1:1 correspondence
> > > between the Rust and C interface so it's easy for current maintainers
> > > and developers that don't want to care about Rust to continue to do so
> > > and also just somewhat verify that changes they do are sane.
> >
> > Sure, that goal makes total sense to me.
> >
> > Alice

2024-04-08 07:48:24

by Alice Ryhl

[permalink] [raw]

Subject: Re: [PATCH v5 3/9] rust: file: add Rust abstraction for `struct file`

Christian Brauner <[email protected]> wrote:
> On Tue, Apr 02, 2024 at 09:39:57AM +0000, Alice Ryhl wrote:
>> Christian Brauner <[email protected]> wrote:
>>> On Mon, Apr 01, 2024 at 12:09:08PM +0000, Alice Ryhl wrote:
>>>> Christian Brauner <[email protected]> wrote:
>>>>> On Wed, Mar 20, 2024 at 06:09:05PM +0000, Alice Ryhl wrote:
>>>>>> Christian Brauner <[email protected]> wrote:
>>>>>>> On Fri, Feb 09, 2024 at 11:18:16AM +0000, Alice Ryhl wrote:
>>>>>>>> +/// Wraps the kernel's `struct file`.
>>>>>>>> +///
>>>>>>>> +/// This represents an open file rather than a file on a filesystem. Processes generally reference
>>>>>>>> +/// open files using file descriptors. However, file descriptors are not the same as files. A file
>>>>>>>> +/// descriptor is just an integer that corresponds to a file, and a single file may be referenced
>>>>>>>> +/// by multiple file descriptors.
>>>>>>>> +///
>>>>>>>> +/// # Refcounting
>>>>>>>> +///
>>>>>>>> +/// Instances of this type are reference-counted. The reference count is incremented by the
>>>>>>>> +/// `fget`/`get_file` functions and decremented by `fput`. The Rust type `ARef<File>` represents a
>>>>>>>> +/// pointer that owns a reference count on the file.
>>>>>>>> +///
>>>>>>>> +/// Whenever a process opens a file descriptor (fd), it stores a pointer to the file in its `struct
>>>>>>>> +/// files_struct`. This pointer owns a reference count to the file, ensuring the file isn't
>>>>>>>> +/// prematurely deleted while the file descriptor is open. In Rust terminology, the pointers in
>>>>>>>> +/// `struct files_struct` are `ARef<File>` pointers.
>>>>>>>> +///
>>>>>>>> +/// ## Light refcounts
>>>>>>>> +///
>>>>>>>> +/// Whenever a process has an fd to a file, it may use something called a "light refcount" as a
>>>>>>>> +/// performance optimization. Light refcounts are acquired by calling `fdget` and released with
>>>>>>>> +/// `fdput`. The idea behind light refcounts is that if the fd is not closed between the calls to
>>>>>>>> +/// `fdget` and `fdput`, then the refcount cannot hit zero during that time, as the `struct
>>>>>>>> +/// files_struct` holds a reference until the fd is closed. This means that it's safe to access the
>>>>>>>> +/// file even if `fdget` does not increment the refcount.
>>>>>>>> +///
>>>>>>>> +/// The requirement that the fd is not closed during a light refcount applies globally across all
>>>>>>>> +/// threads - not just on the thread using the light refcount. For this reason, light refcounts are
>>>>>>>> +/// only used when the `struct files_struct` is not shared with other threads, since this ensures
>>>>>>>> +/// that other unrelated threads cannot suddenly start using the fd and close it. Therefore,
>>>>>>>> +/// calling `fdget` on a shared `struct files_struct` creates a normal refcount instead of a light
>>>>>>>> +/// refcount.
>>>>>>>
>>>>>>> When the fdget() calling task doesn't have a shared file descriptor
>>>>>>> table fdget() will not increment the reference count, yes. This
>>>>>>> also implies that you cannot have task A use fdget() and then pass
>>>>>>> f.file to task B that holds on to it while A returns to userspace. It's
>>>>>>> irrelevant that task A won't drop the reference count or that task B
>>>>>>> won't drop the reference count. Because task A could return back to
>>>>>>> userspace and immediately close the fd via a regular close() system call
>>>>>>> at which point task B has a UAF. In other words a file that has been
>>>>>>> gotten via fdget() can't be Send to another task without the Send
>>>>>>> implying taking a reference to it.
>>>>>>
>>>>>> That matches my understanding.
>>>>>>
>>>>>> I suppose that technically you can still send it to another thread *if* you
>>>>>> ensure that the current thread waits until that other thread stops using the
>>>>>> file before returning to userspace.
>>>>>
>>>>> _Technically_ yes, but it would be brittle as hell. The problem is that
>>>>> fdget() _relies_ on being single-threaded for the time that fd is used
>>>>> until fdput(). There's locking assumptions that build on that e.g., for
>>>>> concurrent read/write. So no, that shouldn't be allowed.
>>>>>
>>>>> Look at how this broke our back when we introduced pidfd_getfd() where
>>>>> we steal an fd from another task. I have a lengthy explanation how that
>>>>> can be used to violate our elided-locking which is based on assuming
>>>>> that we're always single-threaded and the file can't be suddenly shared
>>>>> with another task. So maybe doable but it would make the semantics even
>>>>> more intricate.
>>>>
>>>> Hmm, the part about elided locking is surprising to me, and may be an
>>>> issue. Can you give more details on that? Because the current
>>>
>>> So what I referred to was that we do have fdget_pos(). Roughly, if
>>> there's more than one reference on the file then we need to acquire a
>>> mutex but if it's only a single reference then we can avoid taking the
>>> mutex because we know that we're the only one that has a reference to
>>> that file and no one else can acquire one. Whether or not that mutex was
>>> taken is taken track of in struct fd.
>>>
>>> So you can't share a file after fdget_pos() has been called on it and
>>> you haven't taken the position mutex. So let's say you had:
>>>
>>> * Tread A that calls fdget_pos() on file1 and the reference count is
>>> one. So Thread A doesn't acquire the file position mutex for file1.
>>> * Now somehow that file1 becomes shared, e.g., Thread B calls fget() on
>>> it and now Thread B does some operation that requires the file
>>> position mutex.
>>> => Thread A and Thread B race on the file position.
>>>
>>> So just because you have a reference to a file from somewhere it doesn't
>>> mean you can just share it with another thread.
>>>
>>> So if yo have an arbitrary reference to a file in Rust and that somehow
>>> can be shared with another thread you risk races here.
>>>
>>>> abstractions here *do* actually allow what I described, since we
>>>> implement Sync for File.
>>>>
>>>> I'm not familiar with the pidfd_getfd discussion you are referring to.
>>>> Do you have a link?
>>>
>>> https://lore.kernel.org/linux-fsdevel/[email protected]
>>>
>>> pidfd_getfd() can be used to steal a file descriptor from another task.
>>> It's like a non-cooperative SCM_RIGHTS. That means you can have exactly
>>> the scenario described above where a file assumed to be non-shared is
>>> suddenly shared and you have racing reads/writes.
>>>
>>> For readdir we nowadays always take the file position mutex because of
>>> the pidfd_getfd() business because that might corrupt internal state.
>>>
>>>>
>>>> I'm thinking that we may have to provide two different `struct file`
>>>> wrappers to accurately model this API in Rust. Perhaps they could be
>>>> called File and LocalFile, where one is marked as thread safe and the
>>>> other isn't. I can make all LocalFile methods available on File to avoid
>>>> having to duplicate methods that are available on both.
>>>
>>> But isn't that just struct file and struct fd? Ideally we'd stay close
>>> to something like this.
>>
>> Right, that kind of naming seems sensible. But I still need to
>> understand the details a bit better. See below on fdget_pos.
>>
>>>> But it's not clear to me that this is even enough. Even if we give you a
>>>> &LocalFile to prevent you from moving it across threads, you can just
>>>> call File::fget to get an ARef<File> to the same file and then move
>>>> *that* across threads.
>>>
>>> Yes, absolutely.
>>
>> One of my challenges is that Binder wants to call File::fget,
>> immediately move it to another thread, and then call fd_install. And
>> it would be pretty unfortunate if that requires unsafe. But like I argue
>> below, it seems hard to design a safe API for this in the face of
>> fdget_pos.
>>
>>>> This kind of global requirement is not so easy to model. Maybe klint [1]
>>>> could do it ... atomic context violations are a similar kind of global
>>>> check. But having klint do it would be far out.
>>>>
>>>> Or maybe File::fget should also return a LocalFile?
>>>>
>>>> But this raises a different question to me. Let's say process A uses
>>>> Binder to send its own fd to process B, and the following things happen:
>>>>
>>>> 1. Process A enters the ioctl and takes fdget on the fd.
>>>> 2. Process A calls fget on the same fd to send it to another process.
>>>> 3. Process A goes to sleep, waiting for process B to respond.
>>>> 4. Process B receives the message, installs the fd, and returns to userspace.
>>>> 5. Process B responds to the transaction, but does not close the fd.
>>>
>>> The fd just installed in 4. and the fd you're referring to in 5. are
>>> identical, right? IOW, we're not talking about two different fd (dup)
>>> for the same file, right?
>>
>> I'm referring to whatever fd_install does given the `struct file` I got
>> from fget in step 2.
>>
>>>> 6a. Process A finishes sleeping, and returns to userspace from the ioctl.
>>>> 6b. Process B tries to do an operation (e.g. read) on the fd.
>>>>
>>>> Let's say that 6a and 6b run in parallel.
>>>>
>>>> Could this potentially result in a data race between step 6a and 6b? I'm
>>>> guessing that step 6a probably doesn't touch any of the code that has
>>>> elided locking assumptions, so in practice I guess there's not a problem
>>>> ... but if you make any sort of elided locking assumption as you exit
>>>> from the ioctl (before reaching the fdput), then it seems to me that you
>>>> have a problem.
>>>
>>> Yes, 6a doesn't touch any code that has elided locking assumptions.
>>>
>>> 1'. Process A enters the ioctl and takes fdget() on the fd. Process A
>>> holds the only reference to that file and the file descriptor table
>>> isn't shared. Therefore, f_count is left untouched and remains at 1.
>>> 2'. Process A calls fget() which unconditionally bumps f_count bringing
>>> it to 2 and sending it another process (Presumably you intend to
>>> imply that this reference is now owned by the second process.).
>>> 3'. [as 3.]
>>> 4'. Process B installs the file into it's file descriptor table
>>> consuming that reference from 2'. The f_count remains at 2 with the
>>> reference from 2' now being owned by Process B.
>>> 5'. Since Process B isn't closing the fd and has just called
>>> fd_install() it returns to userspace with f_count untouched and
>>> still at 2.
>>> 6'a. Process A finishes sleeping and returns to userspace calling
>>> fdput(). Since the original fdget() was done without bumping the
>>> reference count the fdput() of Process A will not decrement the
>>> reference count. So f_count remains at 2.
>>> 6'b. Process B performs a read/write syscall and calls fdget_pos().
>>> fdget_pos() sees that this file has f_count > 1 and takes the
>>> file position mutex.
>>>
>>> So this isn't a problem. The problem is when a file becomes shared
>>> implicitly without the original owner of the file knowing.
>>
>> Hmm. Yes, but the ioctl code that called fdget doesn't really know that
>> the ioctl shared the file? So why is it okay?
>
> Why does it matter to the ioctl() code itself? The ioctl() code never
> calls fdget_pos().
>
>>
>> It really seems like there are two different things going on here. When
>> it comes to fdget, we only really care about operations that could
>> remove it from the local file descriptor table, since fdget relies on
>> the refcount in that table remaining valid until fdput.
>
> Yes.
>
>>
>> On the other hand, for fdget_pos it also matters whether it gets
>> installed in other file descriptor tables. Threads that reference it
>> through a different fd table will still access the same position.
>
> Yes, they operate on the same f_pos.
>
>>
>> And so this means that between fdget/fdput, there's never any problem
>> with installing the `struct file` into another file descriptor table.
>> Nothing you can do from that other fd table could cause the local fd
>> table to drop its refcount on the file. Whereas such an install can be
>> a problem between fdget_pos/fdput_pos, since that could introduce a race
>> on the position.
>>
>> Is this correct?
>
> Yes, but that would imply you're sharing and installing a file into a
> file descriptor table from a read/write/seek codepath. I don't see how
> this can happen without something like e.g., pidfd_getfd(). And the
> fd_install()ing task would then have to go back to userspace and issue a
> concurrent read/write/seek system call while the other thread is still
> reading/writing.
>
> Overall, we really only care about f_pos consistency because posix
> requires atomicity between reads/writes/seeks. For pidfd_getfd() where
> such sharing can happen non-cooperatively we just don't care as we've
> just declared this to be an instance where we're outside of posix
> guarantees. And for readdir() we unconditionally acquire the mutex.
>
> I think io_uring is racing on f_pos as well under certain circumstances
> (REQ_F_CUR_POS?) as they don't use fdget_pos() at all. But iirc Jens
> dislikes that they ever allowed that.
>
>>
>> I was thinking that if we have some sort of File/LocalFile distinction
>> (or File/Fd), then we may be able to get it to work by limiting what a
>> File can do. For example, let's say that the only thing you can do with
>> a File is install it into fd tables, then by the previous logic, there's
>> no problem with it being safe to move across threads even if there's an
>> active fdget.
>>
>> But the fdget_pos kind of throws a wrench into that, because now I can
>> no longer say "it's always safe to do File::fget, move it to another
>> thread, and install it into the remote fd table", since that could cause
>> races on the position if there's an active fdget_pos when we call
>> File::fget.
>
> I think I understand why that's a problem for you but let me try and
> spell it out so I can understand where you're coming from:
>
> You want the Rust compiler to reject a file becoming shared implicitly
> from any codepath that is beneath fdget_pos() even though no such code
> yet exists (ignoring pidfd_getfd()). So it's a correctness issue to you.

Hi Christian,

I thought some more about this. Here's an idea of how we can encode this
knowledge in the API. It seems like there are a few different things we
might know about a pointer:

1. The pointer is fully shared, that is, all active fdget / fdget_pos
references (if any) actually incremented the refcount / took the
lock.

2. The pointer may have active fdget references that didn't increment
the refcount, but all fdget_pos references (if any) incremented the
refcount and took the lock.

3. The pointer may have active fdget or fdget_pos references that didn't
increment the refcount / didn't take the lock.

And orthogonal from that, all of the fdget / fdget_pos references are on
a single thread, so we either know that we are on the same thread as
them, or we don't know that. (Though we don't care about this
distinction in the fully shared case.)

I might encode this information with five different "states":

* FullyShared - fully shared
* MaybeFdget - might have fdget, not necessarily on this thread
* MaybeLocalFdget - might have fdget, but if so, it's on this thread
* MaybeFdgetPos - might have fdget_pos, not necessarily on this thread
* MaybeLocalFdgetPos - might have fdget_pos, but if so, it's on this thread

And you can make this a parameter on File, so you get pointer types like
these:

* ARef<File<MaybeFdget>> - A pointer to a file that owns a refcount and
might have an active fdget, possibly on a different thread.

* &File<FullyShared> - A reference to a file that doesn't own a
refcount, and which is fully shared.

* ARef<File<MaybeLocalFdgetPos>> - A pointer to a file that owns a
refcount, and may have both active fdget and fdget_pos calls. If it
does, they're on the same thread as this reference. This is the return
type of fget.

And if we introduce an Fdget smart pointer too, then it would probably
be typed Fdget<MaybeLocalFdget> or Fdget<MaybeLocalFdgetPos>.

You could do different things with these. For example, a MaybeFdget
reference can be used with fd_install, but a MaybeFdgetPos reference
can't. Similarly, you can close an fd using a FullyShared reference, but
not using any of the weaker kinds.

As for how you get each kind, well, the fget method would return an
MaybeLocalFdgetPos reference, since we know that we're on the right
thread if any fdget/fdget_pos references exist, but we don't otherwise
know anything.

And you could use something similar to the DeferredFdCloser to take a
MaybeLocalFdgetPos or MaybeLocalFdget file reference any upgrade it to a
FullyShared one by waiting until we return to userspace. (But you can't
do this with the non-local variants.)

When we write a Rust wrapper around file_operations that internally uses
File::from_raw_file, then we can have it pass a &File<MaybeLocalFdget>
to the ioctl handler, and a &File<MaybeLocalFdgetPos> to the
read/write/seek handlers. That way, the ioctl handler can do things with
the provided file that don't work when there's an fdget_pos, but the
read/write/seek handlers can't.

Or we can have unsafe methods that do upgrades. For example, the
codepath in Binder that sends an fd from one process to another will
most likely just call fget and then use unsafe to say "I know there are
no active fdget_pos references since I'm in an ioctl" to upgrade the
MaybeLocalFdgetPos it got from fget into an MaybeFdget (not FullyShared
since the ioctl uses fdget), that it sends to the other process and then
fd_installs.

Does this sound sane to you?

Alice