2019-08-20 11:55:50

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote:
> On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
> > On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
> >
> > > So that leaves just the normal close() syscall exit case, where the
> > > application has full control of the order in which resources are
> > > released. We've already established that we can block in this
> > > context. Blocking in an interruptible state will allow fatal signal
> > > delivery to wake us, and then we fall into the
> > > fatal_signal_pending() case if we get a SIGKILL while blocking.
> >
> > The major problem with RDMA is that it doesn't always wait on close() for the
> > MR holding the page pins to be destoyed. This is done to avoid a
> > deadlock of the form:
> >
> > uverbs_destroy_ufile_hw()
> > mutex_lock()
> > [..]
> > mmput()
> > exit_mmap()
> > remove_vma()
> > fput();
> > file_operations->release()
>
> I think this is wrong, and I'm pretty sure it's an example of why
> the final __fput() call is moved out of line.

Yes, I think so too, all I can say is this *used* to happen, as we
have special code avoiding it, which is the code that is messing up
Ira's lifetime model.

Ira, you could try unraveling the special locking, that solves your
lifetime issues?

Jason


2019-08-21 18:02:28

by Ira Weiny

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

On Tue, Aug 20, 2019 at 08:55:15AM -0300, Jason Gunthorpe wrote:
> On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote:
> > On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
> > > On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
> > >
> > > > So that leaves just the normal close() syscall exit case, where the
> > > > application has full control of the order in which resources are
> > > > released. We've already established that we can block in this
> > > > context. Blocking in an interruptible state will allow fatal signal
> > > > delivery to wake us, and then we fall into the
> > > > fatal_signal_pending() case if we get a SIGKILL while blocking.
> > >
> > > The major problem with RDMA is that it doesn't always wait on close() for the
> > > MR holding the page pins to be destoyed. This is done to avoid a
> > > deadlock of the form:
> > >
> > > uverbs_destroy_ufile_hw()
> > > mutex_lock()
> > > [..]
> > > mmput()
> > > exit_mmap()
> > > remove_vma()
> > > fput();
> > > file_operations->release()
> >
> > I think this is wrong, and I'm pretty sure it's an example of why
> > the final __fput() call is moved out of line.
>
> Yes, I think so too, all I can say is this *used* to happen, as we
> have special code avoiding it, which is the code that is messing up
> Ira's lifetime model.
>
> Ira, you could try unraveling the special locking, that solves your
> lifetime issues?

Yes I will try to prove this out... But I'm still not sure this fully solves
the problem.

This only ensures that the process which has the RDMA context (RDMA FD) is safe
with regard to hanging the close for the "data file FD" (the file which has
pinned pages) in that _same_ process. But what about the scenario.

Process A has the RDMA context FD and data file FD (with lease) open.

Process A uses SCM_RIGHTS to pass the RDMA context FD to Process B.

Process A attempts to exit (hangs because data file FD is pinned).

Admin kills process A. kill works because we have allowed for it...

Process B _still_ has the RDMA context FD open _and_ therefore still holds the
file pins.

Truncation still fails.

Admin does not know which process is holding the pin.

What am I missing?

Ira

2019-08-21 18:35:16

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

On Wed, Aug 21, 2019 at 11:02:00AM -0700, Ira Weiny wrote:
> On Tue, Aug 20, 2019 at 08:55:15AM -0300, Jason Gunthorpe wrote:
> > On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote:
> > > On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
> > > > On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
> > > >
> > > > > So that leaves just the normal close() syscall exit case, where the
> > > > > application has full control of the order in which resources are
> > > > > released. We've already established that we can block in this
> > > > > context. Blocking in an interruptible state will allow fatal signal
> > > > > delivery to wake us, and then we fall into the
> > > > > fatal_signal_pending() case if we get a SIGKILL while blocking.
> > > >
> > > > The major problem with RDMA is that it doesn't always wait on close() for the
> > > > MR holding the page pins to be destoyed. This is done to avoid a
> > > > deadlock of the form:
> > > >
> > > > uverbs_destroy_ufile_hw()
> > > > mutex_lock()
> > > > [..]
> > > > mmput()
> > > > exit_mmap()
> > > > remove_vma()
> > > > fput();
> > > > file_operations->release()
> > >
> > > I think this is wrong, and I'm pretty sure it's an example of why
> > > the final __fput() call is moved out of line.
> >
> > Yes, I think so too, all I can say is this *used* to happen, as we
> > have special code avoiding it, which is the code that is messing up
> > Ira's lifetime model.
> >
> > Ira, you could try unraveling the special locking, that solves your
> > lifetime issues?
>
> Yes I will try to prove this out... But I'm still not sure this fully solves
> the problem.
>
> This only ensures that the process which has the RDMA context (RDMA FD) is safe
> with regard to hanging the close for the "data file FD" (the file which has
> pinned pages) in that _same_ process. But what about the scenario.

Oh, I didn't think we were talking about that. Hanging the close of
the datafile fd contingent on some other FD's closure is a recipe for
deadlock..

IMHO the pin refcnt is held by the driver char dev FD, that is the
object you need to make it visible against.

Why not just have a single table someplace of all the layout leases
with the file they are held on and the FD/socket/etc that is holding
the pin? Make it independent of processes and FDs?

Jason

2019-08-21 18:36:12

by John Hubbard

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

On 8/21/19 11:13 AM, Jason Gunthorpe wrote:
> On Wed, Aug 21, 2019 at 11:02:00AM -0700, Ira Weiny wrote:
>> On Tue, Aug 20, 2019 at 08:55:15AM -0300, Jason Gunthorpe wrote:
>>> On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote:
>>>> On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
>>>>> On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
>>>>>
>>>>>> So that leaves just the normal close() syscall exit case, where the
>>>>>> application has full control of the order in which resources are
>>>>>> released. We've already established that we can block in this
>>>>>> context. Blocking in an interruptible state will allow fatal signal
>>>>>> delivery to wake us, and then we fall into the
>>>>>> fatal_signal_pending() case if we get a SIGKILL while blocking.
>>>>>
>>>>> The major problem with RDMA is that it doesn't always wait on close() for the
>>>>> MR holding the page pins to be destoyed. This is done to avoid a
>>>>> deadlock of the form:
>>>>>
>>>>> uverbs_destroy_ufile_hw()
>>>>> mutex_lock()
>>>>> [..]
>>>>> mmput()
>>>>> exit_mmap()
>>>>> remove_vma()
>>>>> fput();
>>>>> file_operations->release()
>>>>
>>>> I think this is wrong, and I'm pretty sure it's an example of why
>>>> the final __fput() call is moved out of line.
>>>
>>> Yes, I think so too, all I can say is this *used* to happen, as we
>>> have special code avoiding it, which is the code that is messing up
>>> Ira's lifetime model.
>>>
>>> Ira, you could try unraveling the special locking, that solves your
>>> lifetime issues?
>>
>> Yes I will try to prove this out... But I'm still not sure this fully solves
>> the problem.
>>
>> This only ensures that the process which has the RDMA context (RDMA FD) is safe
>> with regard to hanging the close for the "data file FD" (the file which has
>> pinned pages) in that _same_ process. But what about the scenario.
>
> Oh, I didn't think we were talking about that. Hanging the close of
> the datafile fd contingent on some other FD's closure is a recipe for
> deadlock..
>
> IMHO the pin refcnt is held by the driver char dev FD, that is the
> object you need to make it visible against.


If you do that, it might make it a lot simpler to add lease support
to drivers like XDP, which is otherwise looking pretty difficult to
set up with an fd. (It's socket-based, and not immediately clear where
to connect up the fd.)


thanks,
--
John Hubbard
NVIDIA

>
> Why not just have a single table someplace of all the layout leases
> with the file they are held on and the FD/socket/etc that is holding
> the pin? Make it independent of processes and FDs?
>
> Jason
>

2019-08-21 18:58:09

by Ira Weiny

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

On Wed, Aug 21, 2019 at 03:13:43PM -0300, Jason Gunthorpe wrote:
> On Wed, Aug 21, 2019 at 11:02:00AM -0700, Ira Weiny wrote:
> > On Tue, Aug 20, 2019 at 08:55:15AM -0300, Jason Gunthorpe wrote:
> > > On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote:
> > > > On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
> > > > > On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
> > > > >
> > > > > > So that leaves just the normal close() syscall exit case, where the
> > > > > > application has full control of the order in which resources are
> > > > > > released. We've already established that we can block in this
> > > > > > context. Blocking in an interruptible state will allow fatal signal
> > > > > > delivery to wake us, and then we fall into the
> > > > > > fatal_signal_pending() case if we get a SIGKILL while blocking.
> > > > >
> > > > > The major problem with RDMA is that it doesn't always wait on close() for the
> > > > > MR holding the page pins to be destoyed. This is done to avoid a
> > > > > deadlock of the form:
> > > > >
> > > > > uverbs_destroy_ufile_hw()
> > > > > mutex_lock()
> > > > > [..]
> > > > > mmput()
> > > > > exit_mmap()
> > > > > remove_vma()
> > > > > fput();
> > > > > file_operations->release()
> > > >
> > > > I think this is wrong, and I'm pretty sure it's an example of why
> > > > the final __fput() call is moved out of line.
> > >
> > > Yes, I think so too, all I can say is this *used* to happen, as we
> > > have special code avoiding it, which is the code that is messing up
> > > Ira's lifetime model.
> > >
> > > Ira, you could try unraveling the special locking, that solves your
> > > lifetime issues?
> >
> > Yes I will try to prove this out... But I'm still not sure this fully solves
> > the problem.
> >
> > This only ensures that the process which has the RDMA context (RDMA FD) is safe
> > with regard to hanging the close for the "data file FD" (the file which has
> > pinned pages) in that _same_ process. But what about the scenario.
>
> Oh, I didn't think we were talking about that. Hanging the close of
> the datafile fd contingent on some other FD's closure is a recipe for
> deadlock..

The discussion between Jan and Dave was concerning what happens when a user
calls

fd = open()
fnctl(...getlease...)
addr = mmap(fd...)
ib_reg_mr() <pin>
munmap(addr...)
close(fd)

Dave suggested:

"I'm of a mind to make the last close() on a file block if there's an
active layout lease to prevent processes from zombie-ing layout
leases like this. i.e. you can't close the fd until resources that
pin the lease have been released."

-- Dave https://lkml.org/lkml/2019/8/16/994

>
> IMHO the pin refcnt is held by the driver char dev FD, that is the
> object you need to make it visible against.

I'm sorry but what do you mean by "driver char dev FD"?

>
> Why not just have a single table someplace of all the layout leases
> with the file they are held on and the FD/socket/etc that is holding
> the pin? Make it independent of processes and FDs?

If it is independent of processes how will we know which process is blocking
the truncate? Using a global table is an interesting idea but I still believe
the users are going to want to track this to specific processes. It's not
clear to me how that would be done with a global table.

I agree the XDP/socket case is bothersome... I was thinking that somewhere the
fd of the socket could be hooked up in this case. But taking a look at it
reveals that is not going to be easy. And I assume XDP has the same issue WRT
SCM_RIGHTS and the ability to share the xdp context?

Ira

>
> Jason

2019-08-21 19:08:00

by Ira Weiny

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

On Wed, Aug 21, 2019 at 11:57:03AM -0700, 'Ira Weiny' wrote:
> On Wed, Aug 21, 2019 at 03:13:43PM -0300, Jason Gunthorpe wrote:
> > On Wed, Aug 21, 2019 at 11:02:00AM -0700, Ira Weiny wrote:
> > > On Tue, Aug 20, 2019 at 08:55:15AM -0300, Jason Gunthorpe wrote:
> > > > On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote:
> > > > > On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
> > > > > > On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
> > > > > >
> > > > > > > So that leaves just the normal close() syscall exit case, where the
> > > > > > > application has full control of the order in which resources are
> > > > > > > released. We've already established that we can block in this
> > > > > > > context. Blocking in an interruptible state will allow fatal signal
> > > > > > > delivery to wake us, and then we fall into the
> > > > > > > fatal_signal_pending() case if we get a SIGKILL while blocking.
> > > > > >
> > > > > > The major problem with RDMA is that it doesn't always wait on close() for the
> > > > > > MR holding the page pins to be destoyed. This is done to avoid a
> > > > > > deadlock of the form:
> > > > > >
> > > > > > uverbs_destroy_ufile_hw()
> > > > > > mutex_lock()
> > > > > > [..]
> > > > > > mmput()
> > > > > > exit_mmap()
> > > > > > remove_vma()
> > > > > > fput();
> > > > > > file_operations->release()
> > > > >
> > > > > I think this is wrong, and I'm pretty sure it's an example of why
> > > > > the final __fput() call is moved out of line.
> > > >
> > > > Yes, I think so too, all I can say is this *used* to happen, as we
> > > > have special code avoiding it, which is the code that is messing up
> > > > Ira's lifetime model.
> > > >
> > > > Ira, you could try unraveling the special locking, that solves your
> > > > lifetime issues?
> > >
> > > Yes I will try to prove this out... But I'm still not sure this fully solves
> > > the problem.
> > >
> > > This only ensures that the process which has the RDMA context (RDMA FD) is safe
> > > with regard to hanging the close for the "data file FD" (the file which has
> > > pinned pages) in that _same_ process. But what about the scenario.
> >
> > Oh, I didn't think we were talking about that. Hanging the close of
> > the datafile fd contingent on some other FD's closure is a recipe for
> > deadlock..
>
> The discussion between Jan and Dave was concerning what happens when a user
> calls
>
> fd = open()
> fnctl(...getlease...)
> addr = mmap(fd...)
> ib_reg_mr() <pin>
> munmap(addr...)
> close(fd)
>
> Dave suggested:
>
> "I'm of a mind to make the last close() on a file block if there's an
> active layout lease to prevent processes from zombie-ing layout
> leases like this. i.e. you can't close the fd until resources that
> pin the lease have been released."
>
> -- Dave https://lkml.org/lkml/2019/8/16/994

I think this may all be easier if there was a way to block a dup() if it comes
from an SCM_RIGHTS. Does anyone know if that is easy? I assume it would just
mean passing some flag through the dup() call chain.

Jason, if we did that would it break RDMA use cases? I personally don't know
of any. We could pass data back from vaddr_pin indicating that a file pin has
been taken and predicate the blocking of SCM_RIGHTS on that?

Of course if the user called:

fd = open()
fnctl(...getlease...)
addr = mmap(fd...)
ib_reg_mr() <pin>
munmap(addr...)
close(fd)
fork() <=== in another thread because close is hanging

Would that dup() "fd" above into the child? Or maybe that would be part of the
work to make close() hang? Ensure the fd/file is still in the FD table so it
gets dupped???

Ira


>
> >
> > IMHO the pin refcnt is held by the driver char dev FD, that is the
> > object you need to make it visible against.
>
> I'm sorry but what do you mean by "driver char dev FD"?
>
> >
> > Why not just have a single table someplace of all the layout leases
> > with the file they are held on and the FD/socket/etc that is holding
> > the pin? Make it independent of processes and FDs?
>
> If it is independent of processes how will we know which process is blocking
> the truncate? Using a global table is an interesting idea but I still believe
> the users are going to want to track this to specific processes. It's not
> clear to me how that would be done with a global table.
>
> I agree the XDP/socket case is bothersome... I was thinking that somewhere the
> fd of the socket could be hooked up in this case. But taking a look at it
> reveals that is not going to be easy. And I assume XDP has the same issue WRT
> SCM_RIGHTS and the ability to share the xdp context?
>
> Ira
>
> >
> > Jason

2019-08-21 19:49:27

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

On Wed, Aug 21, 2019 at 11:57:03AM -0700, Ira Weiny wrote:

> > Oh, I didn't think we were talking about that. Hanging the close of
> > the datafile fd contingent on some other FD's closure is a recipe for
> > deadlock..
>
> The discussion between Jan and Dave was concerning what happens when a user
> calls
>
> fd = open()
> fnctl(...getlease...)
> addr = mmap(fd...)
> ib_reg_mr() <pin>
> munmap(addr...)
> close(fd)

I don't see how blocking close(fd) could work. Write it like this:

fd = open()
uverbs = open(/dev/uverbs)
fnctl(...getlease...)
addr = mmap(fd...)
ib_reg_mr() <pin>
munmap(addr...)
<sigkill>

The order FD's are closed during sigkill is not deterministic, so when
all the fputs happen during a kill'd exit we could end up blocking in
close(fd) as close(uverbs) will come after in the close
list. close(uverbs) is the thing that does the dereg_mr and releases
the pin.

We don't need complexity with dup to create problems.

Jason

2019-08-21 20:45:31

by Ira Weiny

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

On Wed, Aug 21, 2019 at 04:48:10PM -0300, Jason Gunthorpe wrote:
> On Wed, Aug 21, 2019 at 11:57:03AM -0700, Ira Weiny wrote:
>
> > > Oh, I didn't think we were talking about that. Hanging the close of
> > > the datafile fd contingent on some other FD's closure is a recipe for
> > > deadlock..
> >
> > The discussion between Jan and Dave was concerning what happens when a user
> > calls
> >
> > fd = open()
> > fnctl(...getlease...)
> > addr = mmap(fd...)
> > ib_reg_mr() <pin>
> > munmap(addr...)
> > close(fd)
>
> I don't see how blocking close(fd) could work.

Well Dave was saying this _could_ work. FWIW I'm not 100% sure it will but I
can't prove it won't.. Maybe we are all just touching a different part of this
elephant[1] but the above scenario or one without munmap is very reasonably
something a user would do. So we can either allow the close to complete (my
current patches) or try to make it block like Dave is suggesting.

I don't disagree with Dave with the semantics being nice and clean for the
filesystem. But the fact that RDMA, and potentially others, can "pass the
pins" to other processes is something I spent a lot of time trying to work out.

>
> Write it like this:
>
> fd = open()
> uverbs = open(/dev/uverbs)
> fnctl(...getlease...)
> addr = mmap(fd...)
> ib_reg_mr() <pin>
> munmap(addr...)
> <sigkill>
>
> The order FD's are closed during sigkill is not deterministic, so when
> all the fputs happen during a kill'd exit we could end up blocking in
> close(fd) as close(uverbs) will come after in the close
> list. close(uverbs) is the thing that does the dereg_mr and releases
> the pin.

Of course, that is a different scenario which needs to be fixed in my patch
set. Now that my servers are back up I can hopefully make progress. (Power
was down for them yesterday).

>
> We don't need complexity with dup to create problems.

No but that complexity _will_ come unless we "zombie" layout leases.

Ira

[1] https://en.wikipedia.org/wiki/Blind_men_and_an_elephant

>
> Jason
>

2019-08-22 00:09:45

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

On Wed, Aug 21, 2019 at 01:44:21PM -0700, Ira Weiny wrote:

> > The order FD's are closed during sigkill is not deterministic, so when
> > all the fputs happen during a kill'd exit we could end up blocking in
> > close(fd) as close(uverbs) will come after in the close
> > list. close(uverbs) is the thing that does the dereg_mr and releases
> > the pin.
>
> Of course, that is a different scenario which needs to be fixed in my patch
> set. Now that my servers are back up I can hopefully make progress. (Power
> was down for them yesterday).

It isn't really a different scenario, the problem is that the
filesystem fd must be closable independenly of fencing the MR to avoid
deadlock cycles. Once you resolve that the issue of the uverbs FD out
living it won't matter one bit if it is in the same process or
another.

Jason

2019-08-23 09:14:39

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

On Wed, Aug 21, 2019 at 11:02:00AM -0700, Ira Weiny wrote:
> On Tue, Aug 20, 2019 at 08:55:15AM -0300, Jason Gunthorpe wrote:
> > On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote:
> > > On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
> > > > On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
> > > >
> > > > > So that leaves just the normal close() syscall exit case, where the
> > > > > application has full control of the order in which resources are
> > > > > released. We've already established that we can block in this
> > > > > context. Blocking in an interruptible state will allow fatal signal
> > > > > delivery to wake us, and then we fall into the
> > > > > fatal_signal_pending() case if we get a SIGKILL while blocking.
> > > >
> > > > The major problem with RDMA is that it doesn't always wait on close() for the
> > > > MR holding the page pins to be destoyed. This is done to avoid a
> > > > deadlock of the form:
> > > >
> > > > uverbs_destroy_ufile_hw()
> > > > mutex_lock()
> > > > [..]
> > > > mmput()
> > > > exit_mmap()
> > > > remove_vma()
> > > > fput();
> > > > file_operations->release()
> > >
> > > I think this is wrong, and I'm pretty sure it's an example of why
> > > the final __fput() call is moved out of line.
> >
> > Yes, I think so too, all I can say is this *used* to happen, as we
> > have special code avoiding it, which is the code that is messing up
> > Ira's lifetime model.
> >
> > Ira, you could try unraveling the special locking, that solves your
> > lifetime issues?
>
> Yes I will try to prove this out... But I'm still not sure this fully solves
> the problem.
>
> This only ensures that the process which has the RDMA context (RDMA FD) is safe
> with regard to hanging the close for the "data file FD" (the file which has
> pinned pages) in that _same_ process. But what about the scenario.
>
> Process A has the RDMA context FD and data file FD (with lease) open.
>
> Process A uses SCM_RIGHTS to pass the RDMA context FD to Process B.

Passing the RDMA context dependent on a file layout lease to another
process that doesn't have a file layout lease or a reference to the
original lease should be considered a violation of the layout lease.
Process B does not have an active layout lease, and so by the rules
of layout leases, it is not allowed to pin the layout of the file.

> Process A attempts to exit (hangs because data file FD is pinned).
>
> Admin kills process A. kill works because we have allowed for it...
>
> Process B _still_ has the RDMA context FD open _and_ therefore still holds the
> file pins.
>
> Truncation still fails.
>
> Admin does not know which process is holding the pin.
>
> What am I missing?

Application does not hold the correct file layout lease references.
Passing the fd via SCM_RIGHTS to a process without a layout lease
is equivalent to not using layout leases in the first place.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-08-23 09:54:28

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

On Wed, Aug 21, 2019 at 01:44:21PM -0700, Ira Weiny wrote:
> On Wed, Aug 21, 2019 at 04:48:10PM -0300, Jason Gunthorpe wrote:
> > On Wed, Aug 21, 2019 at 11:57:03AM -0700, Ira Weiny wrote:
> >
> > > > Oh, I didn't think we were talking about that. Hanging the close of
> > > > the datafile fd contingent on some other FD's closure is a recipe for
> > > > deadlock..
> > >
> > > The discussion between Jan and Dave was concerning what happens when a user
> > > calls
> > >
> > > fd = open()
> > > fnctl(...getlease...)
> > > addr = mmap(fd...)
> > > ib_reg_mr() <pin>
> > > munmap(addr...)
> > > close(fd)
> >
> > I don't see how blocking close(fd) could work.
>
> Well Dave was saying this _could_ work. FWIW I'm not 100% sure it will but I
> can't prove it won't..

Right, I proposed it as a possible way of making sure application
developers don't do this. It _could_ be made to work (e.g. recording
longterm page pins on the vma->file), but this is tangential to
the discussion of requiring active references to all resources
covered by the layout lease.

I think allowing applications to behave like the above is simply
poor system level design, regardless of the interaction with
filesystems and layout leases.

> Maybe we are all just touching a different part of this
> elephant[1] but the above scenario or one without munmap is very reasonably
> something a user would do. So we can either allow the close to complete (my
> current patches) or try to make it block like Dave is suggesting.
>
> I don't disagree with Dave with the semantics being nice and clean for the
> filesystem.

I'm not trying to make it "nice and clean for the filesystem".

The problem is not just RDMA/DAX - anything that is directly
accessing the block device under the filesystem has the same set of
issues. That is, the filesystem controls the life cycle of the
blocks in the block device, so direct access to the blocks by any
means needs to be co-ordinated with the filesystem. Pinning direct
access to a file via page pins attached to a hardware context that
the filesystem knows nothing about is not an access model that the
filesystems can support.

IOWs, anyone looking at this problem just from the RDMA POV of page
pins is not seeing all the other direct storage access mechainsms
that we need to support in the filesystems. RDMA on DAX is just one
of them. pNFS is another. Remote acces via NVMeOF is another. XDP
-> DAX (direct file data placement from the network hardware) is
another. There are /lots/ of different direct storage access
mechanisms that filesystems need to support and we sure as hell do
not want to have to support special case semantics for every single
one of them.

Hence if we don't start with a sane model for arbitrating direct
access to the storage at the filesystem level we'll never get this
stuff to work reliably, let alone work together coherently. An
application that wants a direct data path to storage should have a
single API that enables then to safely access the storage,
regardless of how they are accessing the storage.

From that perspective, what we are talking about here with RDMA
doing "mmap, page pin, unmap, close" and "pass page pins via
SCM_RIGHTS" are fundamentally unworkable from the filesystem
perspective. They are use-after-free situations from the filesystem
perspective - they do not hold direct references to anything in the
filesystem, and so the filesytem is completely unaware of them.

The filesystem needs to be aware of /all users/ of it's resources if
it's going to manage them sanely. It needs to be able to corectly
coordinate modifications to ownership of the underlying storage with
all the users directly accessing that physical storage regardless of
the mechanism being used to access the storage. IOWs, access
control must be independent of the mechanism used to gain access to
the storage hardware.

That's what file layout leases are defining - the filesystem support
model for allowing direct storage access from userspace. It's not
defining "RDMA/FSDAX" access rules, it's defining a generic direct
access model. And one of the rules in this model is "if you don't
have an active reference to the file layout, you are not allowed to
directly access the layout.".

Anything else is unsupportable from the filesystem perspective -
designing an access mechanism that allows userspace to retain access
indefinitely by relying on hidden details of kernel subsystem
implementations is a terrible architecture. Not only does it bleed
kernel implementation into the API and the behavioural model, it
means we can't ever change that internal kernel behaviour because
userspace may be dependent on it. I shouldn't be having to point out
how bad this is from a system design perspective.

That's where the "nice and clean" semantics come from - starting
from "what can we actually support?", "what exactly do all the
different direct access mechanisms actually require?", "does it work
for future technologies not yet on our radar?" and working from
there. So I'm not just looking at what we are doing right now, I'm
looking at 15 years down the track when we still have to support
layout leases and we've got hardware we haven't dreamed of yet. If
the model is clean, simple, robust, implementation independent and
has well defined semantics, then it should stand the test of time.
i.e. the "nice and clean" semantics have nothign to do with the
filesystem per se, but everything to do with ensuring the mechanism
is generic and workable for direct storage access for a long time
into the future.

We can't force people to use layout leases - at all, let alone
correctly - but if you want filesystems and enterprise distros to
support direct access to filesystem controlled storage, then direct
access applications need to follow a sane set of rules that are
supportable at the filesystem level.

> But the fact that RDMA, and potentially others, can "pass the
> pins" to other processes is something I spent a lot of time trying to work out.

There's nothing in file layout lease architecture that says you
can't "pass the pins" to another process. All the file layout lease
requirements say is that if you are going to pass a resource for
which the layout lease guarantees access for to another process,
then the destination process already have a valid, active layout
lease that covers the range of the pins being passed to it via the
RDMA handle.

i.e. as the pins pass from one process to another, they pass from
the protection of the lease process A holds to the protection that
the lease process B holds. This can probably even be done by
duplicating the lease fd and passing it by SCM_RIGHTS first.....

Cheers,

Dave.

--
Dave Chinner
[email protected]

2019-08-23 23:33:18

by Ira Weiny

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

On Fri, Aug 23, 2019 at 10:59:14AM +1000, Dave Chinner wrote:
> On Wed, Aug 21, 2019 at 11:02:00AM -0700, Ira Weiny wrote:
> > On Tue, Aug 20, 2019 at 08:55:15AM -0300, Jason Gunthorpe wrote:
> > > On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote:
> > > > On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
> > > > > On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
> > > > >
> > > > > > So that leaves just the normal close() syscall exit case, where the
> > > > > > application has full control of the order in which resources are
> > > > > > released. We've already established that we can block in this
> > > > > > context. Blocking in an interruptible state will allow fatal signal
> > > > > > delivery to wake us, and then we fall into the
> > > > > > fatal_signal_pending() case if we get a SIGKILL while blocking.
> > > > >
> > > > > The major problem with RDMA is that it doesn't always wait on close() for the
> > > > > MR holding the page pins to be destoyed. This is done to avoid a
> > > > > deadlock of the form:
> > > > >
> > > > > uverbs_destroy_ufile_hw()
> > > > > mutex_lock()
> > > > > [..]
> > > > > mmput()
> > > > > exit_mmap()
> > > > > remove_vma()
> > > > > fput();
> > > > > file_operations->release()
> > > >
> > > > I think this is wrong, and I'm pretty sure it's an example of why
> > > > the final __fput() call is moved out of line.
> > >
> > > Yes, I think so too, all I can say is this *used* to happen, as we
> > > have special code avoiding it, which is the code that is messing up
> > > Ira's lifetime model.
> > >
> > > Ira, you could try unraveling the special locking, that solves your
> > > lifetime issues?
> >
> > Yes I will try to prove this out... But I'm still not sure this fully solves
> > the problem.
> >
> > This only ensures that the process which has the RDMA context (RDMA FD) is safe
> > with regard to hanging the close for the "data file FD" (the file which has
> > pinned pages) in that _same_ process. But what about the scenario.
> >
> > Process A has the RDMA context FD and data file FD (with lease) open.
> >
> > Process A uses SCM_RIGHTS to pass the RDMA context FD to Process B.
>
> Passing the RDMA context dependent on a file layout lease to another
> process that doesn't have a file layout lease or a reference to the
> original lease should be considered a violation of the layout lease.
> Process B does not have an active layout lease, and so by the rules
> of layout leases, it is not allowed to pin the layout of the file.
>

I don't disagree with the semantics of this. I just don't know how to enforce
it.

> > Process A attempts to exit (hangs because data file FD is pinned).
> >
> > Admin kills process A. kill works because we have allowed for it...
> >
> > Process B _still_ has the RDMA context FD open _and_ therefore still holds the
> > file pins.
> >
> > Truncation still fails.
> >
> > Admin does not know which process is holding the pin.
> >
> > What am I missing?
>
> Application does not hold the correct file layout lease references.
> Passing the fd via SCM_RIGHTS to a process without a layout lease
> is equivalent to not using layout leases in the first place.

Ok, So If I understand you correctly you would support a failure of SCM_RIGHTS
in this case? I'm ok with that but not sure how to implement it right now.

To that end, I would like to simplify this slightly because I'm not convinced
that SCM_RIGHTS is a problem we need to solve right now. ie I don't know of a
user who wants to do this.

Right now duplication via SCM_RIGHTS could fail if _any_ file pins (and by
definition leases) exist underneath the "RDMA FD" (or other direct access FD,
like XDP etc) being duplicated. Later, if this becomes a use case we will need
to code up the proper checks, potentially within each of the subsystems. This
is because, with RDMA at least, there are potentially large numbers of MR's and
file leases which may have to be checked.

Ira

2019-08-24 04:51:59

by Ira Weiny

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

On Fri, Aug 23, 2019 at 01:23:45PM +1000, Dave Chinner wrote:
> On Wed, Aug 21, 2019 at 01:44:21PM -0700, Ira Weiny wrote:
> > On Wed, Aug 21, 2019 at 04:48:10PM -0300, Jason Gunthorpe wrote:
> > > On Wed, Aug 21, 2019 at 11:57:03AM -0700, Ira Weiny wrote:
> > >
> > > > > Oh, I didn't think we were talking about that. Hanging the close of
> > > > > the datafile fd contingent on some other FD's closure is a recipe for
> > > > > deadlock..
> > > >
> > > > The discussion between Jan and Dave was concerning what happens when a user
> > > > calls
> > > >
> > > > fd = open()
> > > > fnctl(...getlease...)
> > > > addr = mmap(fd...)
> > > > ib_reg_mr() <pin>
> > > > munmap(addr...)
> > > > close(fd)
> > >
> > > I don't see how blocking close(fd) could work.
> >
> > Well Dave was saying this _could_ work. FWIW I'm not 100% sure it will but I
> > can't prove it won't..
>
> Right, I proposed it as a possible way of making sure application
> developers don't do this. It _could_ be made to work (e.g. recording
> longterm page pins on the vma->file), but this is tangential to
> the discussion of requiring active references to all resources
> covered by the layout lease.
>
> I think allowing applications to behave like the above is simply
> poor system level design, regardless of the interaction with
> filesystems and layout leases.
>
> > Maybe we are all just touching a different part of this
> > elephant[1] but the above scenario or one without munmap is very reasonably
> > something a user would do. So we can either allow the close to complete (my
> > current patches) or try to make it block like Dave is suggesting.

My belief when writing the current series was that hanging the close would
cause deadlock. But it seems I was wrong because of the delayed __fput().

So far, I have not been able to get RDMA to have an issue like Jason suggested
would happen (or used to happen). So from that perspective it may be ok to
hang the close.

> >
> > I don't disagree with Dave with the semantics being nice and clean for the
> > filesystem.
>
> I'm not trying to make it "nice and clean for the filesystem".
>
> The problem is not just RDMA/DAX - anything that is directly
> accessing the block device under the filesystem has the same set of
> issues. That is, the filesystem controls the life cycle of the
> blocks in the block device, so direct access to the blocks by any
> means needs to be co-ordinated with the filesystem. Pinning direct
> access to a file via page pins attached to a hardware context that
> the filesystem knows nothing about is not an access model that the
> filesystems can support.
>
> IOWs, anyone looking at this problem just from the RDMA POV of page
> pins is not seeing all the other direct storage access mechainsms
> that we need to support in the filesystems. RDMA on DAX is just one
> of them. pNFS is another. Remote acces via NVMeOF is another. XDP
> -> DAX (direct file data placement from the network hardware) is
> another. There are /lots/ of different direct storage access
> mechanisms that filesystems need to support and we sure as hell do
> not want to have to support special case semantics for every single
> one of them.

My use of struct file was based on the fact that FDs are a primary interface
for linux and my thought was that they would be more universal than having file
pin information stored in an RDMA specific structure.

XDP is not as direct; it uses sockets. But sockets also have a struct file
which I believe could be used in a similar manner. I'm not 100% sure of the
xdp_umem lifetime yet but it seems that my choice of using struct file was a
good one in this respect.

>
> Hence if we don't start with a sane model for arbitrating direct
> access to the storage at the filesystem level we'll never get this
> stuff to work reliably, let alone work together coherently. An
> application that wants a direct data path to storage should have a
> single API that enables then to safely access the storage,
> regardless of how they are accessing the storage.
>
> From that perspective, what we are talking about here with RDMA
> doing "mmap, page pin, unmap, close" and "pass page pins via
> SCM_RIGHTS" are fundamentally unworkable from the filesystem
> perspective. They are use-after-free situations from the filesystem
> perspective - they do not hold direct references to anything in the
> filesystem, and so the filesytem is completely unaware of them.

I see your point of view but looking at it from a different point of view I
don't see this as a "use after free".

The user has explicitly registered this memory (and layout) with another direct
access subsystem (RDMA for example) so why do they need to keep the FD around?

>
> The filesystem needs to be aware of /all users/ of it's resources if
> it's going to manage them sanely.

From the way I look at it the underlying filesystem _is_ aware of the leases
with my patch set. And so to is the user. It is just not through the original
"data file fd".

And the owner of the lease becomes the subsystem object ("RDMA FD" in this
case) which is holding the pins. Furthermore, the lease is maintained and
transferred automatically through the normal FD processing.

(Furthermore, tracking of these pins is available for whatever subsystem by
tracking them with struct file; _not_ just RDMA). When those subsystem objects
are released the "data file lease" will be released as well. That was the
design.

>
> > But the fact that RDMA, and potentially others, can "pass the
> > pins" to other processes is something I spent a lot of time trying to work out.
>
> There's nothing in file layout lease architecture that says you
> can't "pass the pins" to another process. All the file layout lease
> requirements say is that if you are going to pass a resource for
> which the layout lease guarantees access for to another process,
> then the destination process already have a valid, active layout
> lease that covers the range of the pins being passed to it via the
> RDMA handle.
>
> i.e. as the pins pass from one process to another, they pass from
> the protection of the lease process A holds to the protection that
> the lease process B holds. This can probably even be done by
> duplicating the lease fd and passing it by SCM_RIGHTS first.....

My worry with this is how to enforce it. As I said in the other thread I think
we could potentially block SCM_RIGHTS use in the short term. But I'm not sure
about blocking every call which may "dup()" an FD to random processes.

Ira

2019-08-25 19:44:56

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

On Fri, Aug 23, 2019 at 09:49:12PM -0700, Ira Weiny wrote:

> So far, I have not been able to get RDMA to have an issue like Jason suggested
> would happen (or used to happen). So from that perspective it may be ok to
> hang the close.

No, it is not OK to hang the close. You will deadlock on process
destruction when the 'lease fd' hangs waiting for the 'uverbs fd'
which is later in the single threaded destruction sequence.

This is different from the uverbs deadlock I outlined

Jason