2019-08-24 00:13:07

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote:
> On Fri, Aug 23, 2019 at 01:23:45PM +1000, Dave Chinner wrote:
>
> > > But the fact that RDMA, and potentially others, can "pass the
> > > pins" to other processes is something I spent a lot of time trying to work out.
> >
> > There's nothing in file layout lease architecture that says you
> > can't "pass the pins" to another process. All the file layout lease
> > requirements say is that if you are going to pass a resource for
> > which the layout lease guarantees access for to another process,
> > then the destination process already have a valid, active layout
> > lease that covers the range of the pins being passed to it via the
> > RDMA handle.
>
> How would the kernel detect and enforce this? There are many ways to
> pass a FD.

AFAIC, that's not really a kernel problem. It's more of an
application design constraint than anything else. i.e. if the app
passes the IB context to another process without a lease, then the
original process is still responsible for recalling the lease and
has to tell that other process to release the IB handle and it's
resources.

> IMHO it is wrong to try and create a model where the file lease exists
> independently from the kernel object relying on it. In other words the
> IB MR object itself should hold a reference to the lease it relies
> upon to function properly.

That still doesn't work. Leases are not individually trackable or
reference counted objects objects - they are attached to a struct
file bUt, in reality, they are far more restricted than a struct
file.

That is, a lease specifically tracks the pid and the _open fd_ it
was obtained for, so it is essentially owned by a specific process
context. Hence a lease is not able to be passed to a separate
process context and have it still work correctly for lease break
notifications. i.e. the layout break signal gets delivered to
original process that created the struct file, if it still exists
and has the original fd still open. It does not get sent to the
process that currently holds a reference to the IB context.

So while a struct file passed to another process might still have
an active lease, and you can change the owner of the struct file
via fcntl(F_SETOWN), you can't associate the existing lease with a
the new fd in the new process and so layout break signals can't be
directed at the lease fd....

This really means that a lease can only be owned by a single process
context - it can't be shared across multiple processes (so I was
wrong about dup/pass as being a possible way of passing them)
because there's only one process that can "own" a struct file, and
that where signals are sent when the lease needs to be broken.

So, fundamentally, if you want to pass a resource that pins a file
layout between processes, both processes need to hold a layout lease
on that file range. And that means exclusive leases and passing
layouts between processes are fundamentally incompatible because you
can't hold two exclusive leases on the same file range....

Cheers,

Dave.
--
Dave Chinner
[email protected]


2019-08-24 05:08:57

by Ira Weiny

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

On Sat, Aug 24, 2019 at 10:11:24AM +1000, Dave Chinner wrote:
> On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote:
> > On Fri, Aug 23, 2019 at 01:23:45PM +1000, Dave Chinner wrote:
> >
> > > > But the fact that RDMA, and potentially others, can "pass the
> > > > pins" to other processes is something I spent a lot of time trying to work out.
> > >
> > > There's nothing in file layout lease architecture that says you
> > > can't "pass the pins" to another process. All the file layout lease
> > > requirements say is that if you are going to pass a resource for
> > > which the layout lease guarantees access for to another process,
> > > then the destination process already have a valid, active layout
> > > lease that covers the range of the pins being passed to it via the
> > > RDMA handle.
> >
> > How would the kernel detect and enforce this? There are many ways to
> > pass a FD.
>
> AFAIC, that's not really a kernel problem. It's more of an
> application design constraint than anything else. i.e. if the app
> passes the IB context to another process without a lease, then the
> original process is still responsible for recalling the lease and
> has to tell that other process to release the IB handle and it's
> resources.
>
> > IMHO it is wrong to try and create a model where the file lease exists
> > independently from the kernel object relying on it. In other words the
> > IB MR object itself should hold a reference to the lease it relies
> > upon to function properly.
>
> That still doesn't work. Leases are not individually trackable or
> reference counted objects objects - they are attached to a struct
> file bUt, in reality, they are far more restricted than a struct
> file.
>
> That is, a lease specifically tracks the pid and the _open fd_ it
> was obtained for, so it is essentially owned by a specific process
> context. Hence a lease is not able to be passed to a separate
> process context and have it still work correctly for lease break
> notifications. i.e. the layout break signal gets delivered to
> original process that created the struct file, if it still exists
> and has the original fd still open. It does not get sent to the
> process that currently holds a reference to the IB context.
>

The fcntl man page says:

"Leases are associated with an open file description (see open(2)). This means
that duplicate file descriptors (created by, for example, fork(2) or dup(2))
refer to the same lease, and this lease may be modified or released using any
of these descriptors. Furthermore, the lease is released by either an
explicit F_UNLCK operation on any of these duplicate file descriptors, or when
all such file descriptors have been closed."

From this I took it that the child process FD would have the lease as well
_and_ could release it. I _assumed_ that applied to SCM_RIGHTS but it does not
seem to work the same way as dup() so I'm not so sure.

Ira

>
> So while a struct file passed to another process might still have
> an active lease, and you can change the owner of the struct file
> via fcntl(F_SETOWN), you can't associate the existing lease with a
> the new fd in the new process and so layout break signals can't be
> directed at the lease fd....
>
> This really means that a lease can only be owned by a single process
> context - it can't be shared across multiple processes (so I was
> wrong about dup/pass as being a possible way of passing them)
> because there's only one process that can "own" a struct file, and
> that where signals are sent when the lease needs to be broken.
>
> So, fundamentally, if you want to pass a resource that pins a file
> layout between processes, both processes need to hold a layout lease
> on that file range. And that means exclusive leases and passing
> layouts between processes are fundamentally incompatible because you
> can't hold two exclusive leases on the same file range....
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]

2019-08-25 19:43:40

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

On Sat, Aug 24, 2019 at 10:11:24AM +1000, Dave Chinner wrote:
> On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote:
> > On Fri, Aug 23, 2019 at 01:23:45PM +1000, Dave Chinner wrote:
> >
> > > > But the fact that RDMA, and potentially others, can "pass the
> > > > pins" to other processes is something I spent a lot of time trying to work out.
> > >
> > > There's nothing in file layout lease architecture that says you
> > > can't "pass the pins" to another process. All the file layout lease
> > > requirements say is that if you are going to pass a resource for
> > > which the layout lease guarantees access for to another process,
> > > then the destination process already have a valid, active layout
> > > lease that covers the range of the pins being passed to it via the
> > > RDMA handle.
> >
> > How would the kernel detect and enforce this? There are many ways to
> > pass a FD.
>
> AFAIC, that's not really a kernel problem. It's more of an
> application design constraint than anything else. i.e. if the app
> passes the IB context to another process without a lease, then the
> original process is still responsible for recalling the lease and
> has to tell that other process to release the IB handle and it's
> resources.

It is a kernel problem, the MR exists and is doing DMA. That relies on
the lease to prevent data corruption.

The sanest outcome I could suggest is that when the kernel detects the
MR has outlived the lease it needs then we forcibly abort the entire
RDMA state. Ie the application has malfunctioned and gets wacked with
a very big hammer.

> That still doesn't work. Leases are not individually trackable or
> reference counted objects objects - they are attached to a struct
> file bUt, in reality, they are far more restricted than a struct
> file.

This is the problem. How to link something that is not refcounted to
the refcounted world of file descriptors does not seem very obvious.

There are too many places where struct file relies on its refcounting
to try to and plug them.

Jason

2019-08-26 05:55:44

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

On Fri, Aug 23, 2019 at 10:08:36PM -0700, Ira Weiny wrote:
> On Sat, Aug 24, 2019 at 10:11:24AM +1000, Dave Chinner wrote:
> > On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote:
> > > On Fri, Aug 23, 2019 at 01:23:45PM +1000, Dave Chinner wrote:
> > >
> > > > > But the fact that RDMA, and potentially others, can "pass the
> > > > > pins" to other processes is something I spent a lot of time trying to work out.
> > > >
> > > > There's nothing in file layout lease architecture that says you
> > > > can't "pass the pins" to another process. All the file layout lease
> > > > requirements say is that if you are going to pass a resource for
> > > > which the layout lease guarantees access for to another process,
> > > > then the destination process already have a valid, active layout
> > > > lease that covers the range of the pins being passed to it via the
> > > > RDMA handle.
> > >
> > > How would the kernel detect and enforce this? There are many ways to
> > > pass a FD.
> >
> > AFAIC, that's not really a kernel problem. It's more of an
> > application design constraint than anything else. i.e. if the app
> > passes the IB context to another process without a lease, then the
> > original process is still responsible for recalling the lease and
> > has to tell that other process to release the IB handle and it's
> > resources.
> >
> > > IMHO it is wrong to try and create a model where the file lease exists
> > > independently from the kernel object relying on it. In other words the
> > > IB MR object itself should hold a reference to the lease it relies
> > > upon to function properly.
> >
> > That still doesn't work. Leases are not individually trackable or
> > reference counted objects objects - they are attached to a struct
> > file bUt, in reality, they are far more restricted than a struct
> > file.
> >
> > That is, a lease specifically tracks the pid and the _open fd_ it
> > was obtained for, so it is essentially owned by a specific process
> > context. Hence a lease is not able to be passed to a separate
> > process context and have it still work correctly for lease break
> > notifications. i.e. the layout break signal gets delivered to
> > original process that created the struct file, if it still exists
> > and has the original fd still open. It does not get sent to the
> > process that currently holds a reference to the IB context.
> >
>
> The fcntl man page says:
>
> "Leases are associated with an open file description (see open(2)). This means
> that duplicate file descriptors (created by, for example, fork(2) or dup(2))
> refer to the same lease, and this lease may be modified or released using any
> of these descriptors. Furthermore, the lease is released by either an
> explicit F_UNLCK operation on any of these duplicate file descriptors, or when
> all such file descriptors have been closed."

Right, the lease is attached to the struct file, so it follows
where-ever the struct file goes. That doesn't mean it's actually
useful when the struct file is duplicated and/or passed to another
process. :/

AFAICT, the problem is that when we take another reference to the
struct file, or when the struct file is passed to a different
process, nothing updates the lease or lease state attached to that
struct file.

> From this I took it that the child process FD would have the lease as well
> _and_ could release it. I _assumed_ that applied to SCM_RIGHTS but it does not
> seem to work the same way as dup() so I'm not so sure.

Sure, that part works because the struct file is passed. It doesn't
end up with the same fd number in the other process, though.

The issue is that layout leases need to notify userspace when they
are broken by the kernel, so a lease stores the owner pid/tid in the
file->f_owner field via __f_setown(). It also keeps a struct fasync
attached to the file_lock that records the fd that the lease was
created on. When a signal needs to be sent to userspace for that
lease, we call kill_fasync() and that walks the list of fasync
structures on the lease and calls:

send_sigio(fown, fa->fa_fd, band);

And it does for every fasync struct attached to a lease. Yes, a
lease can track multiple fds, but it can only track them in a single
process context. The moment the struct file is shared with another
process, the lease is no longer capable of sending notifications to
all the lease holders.

Yes, you can change the owning process via F_SETOWNER, but that's
still only a single process context, and you can't change the fd in
the fasync list. You can add new fd to an existing lease by calling
F_SETLEASE on the new fd, but you still only have a single process
owner context for signal delivery.

As such, leases that require callbacks to userspace are currently
only valid within the process context the lease was taken in.
Indeed, even closing the fd the lease was taken on without
F_UNLCKing it first doesn't mean the lease has been torn down if
there is some other reference to the struct file. That means the
original lease owner will still get SIGIO delivered to that fd on a
lease break regardless of whether it is open or not. ANd if we
implement "layout lease not released within SIGIO response timeout"
then that process will get killed, despite the fact it may not even
have a reference to that file anymore.

So, AFAICT, leases that require userspace callbacks only work within
their original process context while they original fd is still open.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-08-29 02:03:02

by Ira Weiny

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

On Mon, Aug 26, 2019 at 03:55:10PM +1000, Dave Chinner wrote:
> On Fri, Aug 23, 2019 at 10:08:36PM -0700, Ira Weiny wrote:
> > On Sat, Aug 24, 2019 at 10:11:24AM +1000, Dave Chinner wrote:
> > > On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote:
> > >
> > > > IMHO it is wrong to try and create a model where the file lease exists
> > > > independently from the kernel object relying on it. In other words the
> > > > IB MR object itself should hold a reference to the lease it relies
> > > > upon to function properly.
> > >
> > > That still doesn't work. Leases are not individually trackable or
> > > reference counted objects objects - they are attached to a struct
> > > file bUt, in reality, they are far more restricted than a struct
> > > file.
> > >
> > > That is, a lease specifically tracks the pid and the _open fd_ it
> > > was obtained for, so it is essentially owned by a specific process
> > > context. Hence a lease is not able to be passed to a separate
> > > process context and have it still work correctly for lease break
> > > notifications. i.e. the layout break signal gets delivered to
> > > original process that created the struct file, if it still exists
> > > and has the original fd still open. It does not get sent to the
> > > process that currently holds a reference to the IB context.

But this is an exclusive layout lease which does not send a signal. There is
no way to break it.

> > >
> >
> > The fcntl man page says:
> >
> > "Leases are associated with an open file description (see open(2)). This means
> > that duplicate file descriptors (created by, for example, fork(2) or dup(2))
> > refer to the same lease, and this lease may be modified or released using any
> > of these descriptors. Furthermore, the lease is released by either an
> > explicit F_UNLCK operation on any of these duplicate file descriptors, or when
> > all such file descriptors have been closed."
>
> Right, the lease is attached to the struct file, so it follows
> where-ever the struct file goes. That doesn't mean it's actually
> useful when the struct file is duplicated and/or passed to another
> process. :/
>
> AFAICT, the problem is that when we take another reference to the
> struct file, or when the struct file is passed to a different
> process, nothing updates the lease or lease state attached to that
> struct file.

Ok, I probably should have made this more clear in the cover letter but _only_
the process which took the lease can actually pin memory.

That pinned memory _can_ be passed to another process but those sub-process' can
_not_ use the original lease to pin _more_ of the file. They would need to
take their own lease to do that.

Sorry for not being clear on that.

>
> > From this I took it that the child process FD would have the lease as well
> > _and_ could release it. I _assumed_ that applied to SCM_RIGHTS but it does not
> > seem to work the same way as dup() so I'm not so sure.
>
> Sure, that part works because the struct file is passed. It doesn't
> end up with the same fd number in the other process, though.
>
> The issue is that layout leases need to notify userspace when they
> are broken by the kernel, so a lease stores the owner pid/tid in the
> file->f_owner field via __f_setown(). It also keeps a struct fasync
> attached to the file_lock that records the fd that the lease was
> created on. When a signal needs to be sent to userspace for that
> lease, we call kill_fasync() and that walks the list of fasync
> structures on the lease and calls:
>
> send_sigio(fown, fa->fa_fd, band);
>
> And it does for every fasync struct attached to a lease. Yes, a
> lease can track multiple fds, but it can only track them in a single
> process context. The moment the struct file is shared with another
> process, the lease is no longer capable of sending notifications to
> all the lease holders.
>
> Yes, you can change the owning process via F_SETOWNER, but that's
> still only a single process context, and you can't change the fd in
> the fasync list. You can add new fd to an existing lease by calling
> F_SETLEASE on the new fd, but you still only have a single process
> owner context for signal delivery.
>
> As such, leases that require callbacks to userspace are currently
> only valid within the process context the lease was taken in.

But for long term pins we are not requiring callbacks.

> Indeed, even closing the fd the lease was taken on without
> F_UNLCKing it first doesn't mean the lease has been torn down if
> there is some other reference to the struct file. That means the
> original lease owner will still get SIGIO delivered to that fd on a
> lease break regardless of whether it is open or not. ANd if we
> implement "layout lease not released within SIGIO response timeout"
> then that process will get killed, despite the fact it may not even
> have a reference to that file anymore.

I'm not seeing that as a problem. This is all a result of the application
failing to do the right thing. The code here is simply keeping the kernel
consistent and safe so that an admin or the user themselves can unwind the
badness without damage to the file system.

>
> So, AFAICT, leases that require userspace callbacks only work within
> their original process context while they original fd is still open.

But they _work_ IFF the application actually expects to do something with the
SIGIO. The application could just as well chose to ignore the SIGIO without
closing the FD which would do the same thing.

If the application expected to do something with the SIGIO but closed the FD
then it's really just the applications fault.

So after thinking on this for a day I don't think we have a serious issue.

Even the "zombie" lease is just an application error and it is already possible
to get something like this. If the application passes the FD to another
process and closes their FD then SIGIO's don't get delivered but there is a
lease hanging off the struct file until it is destroyed. No harm, no foul.

In the case of close it is _not_ true that users don't have a way to release
the lease. It is just that they can't call F_UNLCK to do so. Once they have
"zombie'ed" the lease (again an application error) the only recourse is to
unpin the file through the subsystem which pinned the page. Probably through
killing the process.

Ira

2019-08-29 03:29:58

by John Hubbard

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

On 8/28/19 7:02 PM, Ira Weiny wrote:
> On Mon, Aug 26, 2019 at 03:55:10PM +1000, Dave Chinner wrote:
>> On Fri, Aug 23, 2019 at 10:08:36PM -0700, Ira Weiny wrote:
>>> On Sat, Aug 24, 2019 at 10:11:24AM +1000, Dave Chinner wrote:
>>>> On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote:
...
>>
>> Sure, that part works because the struct file is passed. It doesn't
>> end up with the same fd number in the other process, though.
>>
>> The issue is that layout leases need to notify userspace when they
>> are broken by the kernel, so a lease stores the owner pid/tid in the
>> file->f_owner field via __f_setown(). It also keeps a struct fasync
>> attached to the file_lock that records the fd that the lease was
>> created on. When a signal needs to be sent to userspace for that
>> lease, we call kill_fasync() and that walks the list of fasync
>> structures on the lease and calls:
>>
>> send_sigio(fown, fa->fa_fd, band);
>>
>> And it does for every fasync struct attached to a lease. Yes, a
>> lease can track multiple fds, but it can only track them in a single
>> process context. The moment the struct file is shared with another
>> process, the lease is no longer capable of sending notifications to
>> all the lease holders.
>>
>> Yes, you can change the owning process via F_SETOWNER, but that's
>> still only a single process context, and you can't change the fd in
>> the fasync list. You can add new fd to an existing lease by calling
>> F_SETLEASE on the new fd, but you still only have a single process
>> owner context for signal delivery.
>>
>> As such, leases that require callbacks to userspace are currently
>> only valid within the process context the lease was taken in.
>
> But for long term pins we are not requiring callbacks.
>

Hi Ira,

If "require callbacks to userspace" means sending SIGIO, then actually
FOLL_LONGTERM *does* require those callbacks. Because we've been, so
far, equating FOLL_LONGTERM with the vaddr_pin struct and with a lease.

What am I missing here?

thanks,
--
John Hubbard
NVIDIA