2022-06-23 19:12:03

by Tycho Andersen

[permalink] [raw]
Subject: strange interaction between fuse + pidns

Hi all,

I'm seeing some weird interactions with fuse and the pid namespace. I have a
small reproducer here: https://github.com/tych0/kernel-utils/tree/master/fuse2

fuse has the concept of "forcing" a request, which means (among other
things) that it does an unkillable wait in request_wait_answer(). fuse
flushes files when they are closed with this unkillable wait:

$ sudo cat /proc/1544574/stack
[<0>] request_wait_answer+0x12f/0x210
[<0>] fuse_simple_request+0x109/0x2c0
[<0>] fuse_flush+0x16f/0x1b0
[<0>] filp_close+0x27/0x70
[<0>] put_files_struct+0x6b/0xc0
[<0>] do_exit+0x360/0xb80
[<0>] do_group_exit+0x3a/0xa0
[<0>] get_signal+0x140/0x870
[<0>] arch_do_signal_or_restart+0xae/0x7c0
[<0>] exit_to_user_mode_prepare+0x10f/0x1c0
[<0>] syscall_exit_to_user_mode+0x26/0x40
[<0>] do_syscall_64+0x46/0xb0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae

Generally, this is OK, since the fuse_dev_release() -> fuse_abort_conn()
wakes up this code when a fuse dev goes away (i.e. a fuse daemon is killed
or unmounted or whatever). However, there's a problem when the fuse daemon
itself spawns a thread that does a flush: since the thread has a copy of
the fd table with an fd pointing to the same fuse device, the reference
count isn't decremented to zero in fuse_dev_release(), and the task hangs
forever.

Tasks can be aborted via fusectl's abort file, so all is not lost. However,
this does wreak havoc in containers which mounted a fuse filesystem with
this state. If the init pid exits (or crashes), the kernel tries to clean
up the pidns:

$ sudo cat /proc/1528591/stack
[<0>] do_wait+0x156/0x2f0
[<0>] kernel_wait4+0x8d/0x140
[<0>] zap_pid_ns_processes+0x104/0x180
[<0>] do_exit+0xa41/0xb80
[<0>] do_group_exit+0x3a/0xa0
[<0>] __x64_sys_exit_group+0x14/0x20
[<0>] do_syscall_64+0x37/0xb0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae

but hangs forever. This unkillable wait seems unfortunate, so I tried the
obvious thing of changing it to a killable wait:

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 0e537e580dc1..c604dfcaec26 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -297,7 +297,6 @@ void fuse_request_end(struct fuse_req *req)
spin_unlock(&fiq->lock);
}
WARN_ON(test_bit(FR_PENDING, &req->flags));
- WARN_ON(test_bit(FR_SENT, &req->flags));
if (test_bit(FR_BACKGROUND, &req->flags)) {
spin_lock(&fc->bg_lock);
clear_bit(FR_BACKGROUND, &req->flags);
@@ -381,30 +380,33 @@ static void request_wait_answer(struct fuse_req *req)
queue_interrupt(req);
}

- if (!test_bit(FR_FORCE, &req->flags)) {
- /* Only fatal signals may interrupt this */
- err = wait_event_killable(req->waitq,
- test_bit(FR_FINISHED, &req->flags));
- if (!err)
- return;
+ /* Only fatal signals may interrupt this */
+ err = wait_event_killable(req->waitq,
+ test_bit(FR_FINISHED, &req->flags));
+ if (!err)
+ return;

- spin_lock(&fiq->lock);
- /* Request is not yet in userspace, bail out */
- if (test_bit(FR_PENDING, &req->flags)) {
- list_del(&req->list);
- spin_unlock(&fiq->lock);
- __fuse_put_request(req);
- req->out.h.error = -EINTR;
- return;
- }
+ spin_lock(&fiq->lock);
+ /* Request is not yet in userspace, bail out */
+ if (test_bit(FR_PENDING, &req->flags)) {
+ list_del(&req->list);
spin_unlock(&fiq->lock);
+ __fuse_put_request(req);
+ req->out.h.error = -EINTR;
+ return;
}
+ spin_unlock(&fiq->lock);

/*
- * Either request is already in userspace, or it was forced.
- * Wait it out.
+ * Womp womp. We sent a request to userspace and now we're getting
+ * killed.
*/
- wait_event(req->waitq, test_bit(FR_FINISHED, &req->flags));
+ set_bit(FR_INTERRUPTED, &req->flags);
+ /* matches barrier in fuse_dev_do_read() */
+ smp_mb__after_atomic();
+ /* request *must* be FR_SENT here, because we ignored FR_PENDING before */
+ WARN_ON(!test_bit(FR_SENT, &req->flags));
+ queue_interrupt(req);
}

static void __fuse_request_send(struct fuse_req *req)

avaialble as a full patch here:
https://github.com/tych0/linux/commit/81b9ff4c8c1af24f6544945da808dbf69a1293f7

but now things are even weirder. Tasks are stuck at the killable wait, but with
a SIGKILL pending for the thread group.

root@(none):/# cat /proc/187/stack
[<0>] fuse_simple_request+0x8d9/0x10f0 [fuse]
[<0>] fuse_flush+0x42f/0x630 [fuse]
[<0>] filp_close+0x96/0x120
[<0>] put_files_struct+0x15c/0x2c0
[<0>] do_exit+0xa00/0x2450
[<0>] do_group_exit+0xb2/0x2a0
[<0>] __x64_sys_exit_group+0x35/0x40
[<0>] do_syscall_64+0x40/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
root@(none):/# cat /proc/187/status
Name: main
Umask: 0022
State: S (sleeping)
Tgid: 187
Ngid: 0
Pid: 187
PPid: 185
TracerPid: 0
Uid: 0 0 0 0
Gid: 0 0 0 0
FDSize: 0
Groups:
NStgid: 187 3
NSpid: 187 3
NSpgid: 171 0
NSsid: 160 0
Threads: 1
SigQ: 0/6706
SigPnd: 0000000000000000
ShdPnd: 0000000000000100
SigBlk: 0000000000000000
SigIgn: 0000000180004002
SigCgt: 0000000000000000
CapInh: 0000000000000000
CapPrm: 000001ffffffffff
CapEff: 000001ffffffffff
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000
NoNewPrivs: 0
Seccomp: 0
Seccomp_filters: 0
Speculation_Store_Bypass: thread vulnerable
SpeculationIndirectBranch: conditional enabled
Cpus_allowed: f
Cpus_allowed_list: 0-3
Mems_allowed: 00000000,00000001
Mems_allowed_list: 0
voluntary_ctxt_switches: 6
nonvoluntary_ctxt_switches: 1

Any ideas what's going on here? It also seems I'm not the first person to
wonder about this:
https://sourceforge.net/p/fuse/mailman/fuse-devel/thread/CAMp4zn9zTA_A2GJiYo5AD9V5HpeXbzzMP%3DnF0WtwbxRbV3koNA%40mail.gmail.com/#msg36598753

Thanks,

Tycho


2022-06-23 22:38:33

by Vivek Goyal

[permalink] [raw]
Subject: Re: strange interaction between fuse + pidns

On Thu, Jun 23, 2022 at 11:21:25AM -0600, Tycho Andersen wrote:
> Hi all,
>
> I'm seeing some weird interactions with fuse and the pid namespace. I have a
> small reproducer here: https://github.com/tych0/kernel-utils/tree/master/fuse2
>
> fuse has the concept of "forcing" a request, which means (among other
> things) that it does an unkillable wait in request_wait_answer(). fuse
> flushes files when they are closed with this unkillable wait:
>
> $ sudo cat /proc/1544574/stack
> [<0>] request_wait_answer+0x12f/0x210
> [<0>] fuse_simple_request+0x109/0x2c0
> [<0>] fuse_flush+0x16f/0x1b0
> [<0>] filp_close+0x27/0x70
> [<0>] put_files_struct+0x6b/0xc0
> [<0>] do_exit+0x360/0xb80
> [<0>] do_group_exit+0x3a/0xa0
> [<0>] get_signal+0x140/0x870
> [<0>] arch_do_signal_or_restart+0xae/0x7c0
> [<0>] exit_to_user_mode_prepare+0x10f/0x1c0
> [<0>] syscall_exit_to_user_mode+0x26/0x40
> [<0>] do_syscall_64+0x46/0xb0
> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
>
> Generally, this is OK, since the fuse_dev_release() -> fuse_abort_conn()
> wakes up this code when a fuse dev goes away (i.e. a fuse daemon is killed
> or unmounted or whatever). However, there's a problem when the fuse daemon
> itself spawns a thread that does a flush:

So in this case single process is client as well as server. IOW, one
thread is fuse server servicing fuse requests and other thread is fuse
client accessing fuse filesystem?

> since the thread has a copy of
> the fd table with an fd pointing to the same fuse device, the reference
> count isn't decremented to zero in fuse_dev_release(), and the task hangs
> forever.

So why did fuse server thread stop responding to fuse messages. Why
did it not complete flush.

Is it something to do with this init process dying in pid namespace
and it killed that fuse server thread. But it could not kill another
thread because it is in unkillable wait.

>
> Tasks can be aborted via fusectl's abort file, so all is not lost. However,
> this does wreak havoc in containers which mounted a fuse filesystem with
> this state. If the init pid exits (or crashes), the kernel tries to clean
> up the pidns:
>
> $ sudo cat /proc/1528591/stack
> [<0>] do_wait+0x156/0x2f0
> [<0>] kernel_wait4+0x8d/0x140
> [<0>] zap_pid_ns_processes+0x104/0x180
> [<0>] do_exit+0xa41/0xb80
> [<0>] do_group_exit+0x3a/0xa0
> [<0>] __x64_sys_exit_group+0x14/0x20
> [<0>] do_syscall_64+0x37/0xb0
> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
>
> but hangs forever. This unkillable wait seems unfortunate, so I tried the
> obvious thing of changing it to a killable wait:

BTW, unkillable wait happens on ly fc->no_interrupt = 1. And this seems
to be set only if server probably some previous interrupt request
returned -ENOSYS.

fuse_dev_do_write() {
else if (oh.error == -ENOSYS)
fc->no_interrupt = 1;
}

So a simple workaround might be for server to implement support for
interrupting requests.

Having said that, this does sounds like a problem and probably should
be fixed at kernel level.

>
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 0e537e580dc1..c604dfcaec26 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -297,7 +297,6 @@ void fuse_request_end(struct fuse_req *req)
> spin_unlock(&fiq->lock);
> }
> WARN_ON(test_bit(FR_PENDING, &req->flags));
> - WARN_ON(test_bit(FR_SENT, &req->flags));
> if (test_bit(FR_BACKGROUND, &req->flags)) {
> spin_lock(&fc->bg_lock);
> clear_bit(FR_BACKGROUND, &req->flags);
> @@ -381,30 +380,33 @@ static void request_wait_answer(struct fuse_req *req)
> queue_interrupt(req);
> }
>
> - if (!test_bit(FR_FORCE, &req->flags)) {
> - /* Only fatal signals may interrupt this */
> - err = wait_event_killable(req->waitq,
> - test_bit(FR_FINISHED, &req->flags));
> - if (!err)
> - return;
> + /* Only fatal signals may interrupt this */
> + err = wait_event_killable(req->waitq,
> + test_bit(FR_FINISHED, &req->flags));

Trying to do a fatal signal killable wait sounds reasonable. But I am
not sure about the history.

- Why FORCE requests can't do killable wait.
- Why flush needs to have FORCE flag set.

> + if (!err)
> + return;
>
> - spin_lock(&fiq->lock);
> - /* Request is not yet in userspace, bail out */
> - if (test_bit(FR_PENDING, &req->flags)) {
> - list_del(&req->list);
> - spin_unlock(&fiq->lock);
> - __fuse_put_request(req);
> - req->out.h.error = -EINTR;
> - return;
> - }
> + spin_lock(&fiq->lock);
> + /* Request is not yet in userspace, bail out */
> + if (test_bit(FR_PENDING, &req->flags)) {
> + list_del(&req->list);
> spin_unlock(&fiq->lock);
> + __fuse_put_request(req);
> + req->out.h.error = -EINTR;
> + return;
> }
> + spin_unlock(&fiq->lock);
>
> /*
> - * Either request is already in userspace, or it was forced.
> - * Wait it out.
> + * Womp womp. We sent a request to userspace and now we're getting
> + * killed.
> */
> - wait_event(req->waitq, test_bit(FR_FINISHED, &req->flags));
> + set_bit(FR_INTERRUPTED, &req->flags);
> + /* matches barrier in fuse_dev_do_read() */
> + smp_mb__after_atomic();
> + /* request *must* be FR_SENT here, because we ignored FR_PENDING before */
> + WARN_ON(!test_bit(FR_SENT, &req->flags));
> + queue_interrupt(req);
> }
>
> static void __fuse_request_send(struct fuse_req *req)
>
> avaialble as a full patch here:
> https://github.com/tych0/linux/commit/81b9ff4c8c1af24f6544945da808dbf69a1293f7
>
> but now things are even weirder. Tasks are stuck at the killable wait, but with
> a SIGKILL pending for the thread group.

That's strange. No idea what's going on.

Thanks
Vivek
>
> root@(none):/# cat /proc/187/stack
> [<0>] fuse_simple_request+0x8d9/0x10f0 [fuse]
> [<0>] fuse_flush+0x42f/0x630 [fuse]
> [<0>] filp_close+0x96/0x120
> [<0>] put_files_struct+0x15c/0x2c0
> [<0>] do_exit+0xa00/0x2450
> [<0>] do_group_exit+0xb2/0x2a0
> [<0>] __x64_sys_exit_group+0x35/0x40
> [<0>] do_syscall_64+0x40/0x90
> [<0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
> root@(none):/# cat /proc/187/status
> Name: main
> Umask: 0022
> State: S (sleeping)
> Tgid: 187
> Ngid: 0
> Pid: 187
> PPid: 185
> TracerPid: 0
> Uid: 0 0 0 0
> Gid: 0 0 0 0
> FDSize: 0
> Groups:
> NStgid: 187 3
> NSpid: 187 3
> NSpgid: 171 0
> NSsid: 160 0
> Threads: 1
> SigQ: 0/6706
> SigPnd: 0000000000000000
> ShdPnd: 0000000000000100
> SigBlk: 0000000000000000
> SigIgn: 0000000180004002
> SigCgt: 0000000000000000
> CapInh: 0000000000000000
> CapPrm: 000001ffffffffff
> CapEff: 000001ffffffffff
> CapBnd: 000001ffffffffff
> CapAmb: 0000000000000000
> NoNewPrivs: 0
> Seccomp: 0
> Seccomp_filters: 0
> Speculation_Store_Bypass: thread vulnerable
> SpeculationIndirectBranch: conditional enabled
> Cpus_allowed: f
> Cpus_allowed_list: 0-3
> Mems_allowed: 00000000,00000001
> Mems_allowed_list: 0
> voluntary_ctxt_switches: 6
> nonvoluntary_ctxt_switches: 1
>
> Any ideas what's going on here? It also seems I'm not the first person to
> wonder about this:
> https://sourceforge.net/p/fuse/mailman/fuse-devel/thread/CAMp4zn9zTA_A2GJiYo5AD9V5HpeXbzzMP%3DnF0WtwbxRbV3koNA%40mail.gmail.com/#msg36598753
>
> Thanks,
>
> Tycho
>

2022-06-23 23:46:59

by Tycho Andersen

[permalink] [raw]
Subject: Re: strange interaction between fuse + pidns

On Thu, Jun 23, 2022 at 05:55:20PM -0400, Vivek Goyal wrote:
> So in this case single process is client as well as server. IOW, one
> thread is fuse server servicing fuse requests and other thread is fuse
> client accessing fuse filesystem?

Yes. Probably an abuse of the API and something people Should Not Do,
but as you say the kernel still shouldn't lock up like this.

> > since the thread has a copy of
> > the fd table with an fd pointing to the same fuse device, the reference
> > count isn't decremented to zero in fuse_dev_release(), and the task hangs
> > forever.
>
> So why did fuse server thread stop responding to fuse messages. Why
> did it not complete flush.

In this particular case I think it's because the application crashed
for unrelated reasons and tried to exit the pidns, hitting this
problem.

> BTW, unkillable wait happens on ly fc->no_interrupt = 1. And this seems
> to be set only if server probably some previous interrupt request
> returned -ENOSYS.
>
> fuse_dev_do_write() {
> else if (oh.error == -ENOSYS)
> fc->no_interrupt = 1;
> }
>
> So a simple workaround might be for server to implement support for
> interrupting requests.

Yes, but that is the libfuse default IIUC.

> Having said that, this does sounds like a problem and probably should
> be fixed at kernel level.
>
> >
> > diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> > index 0e537e580dc1..c604dfcaec26 100644
> > --- a/fs/fuse/dev.c
> > +++ b/fs/fuse/dev.c
> > @@ -297,7 +297,6 @@ void fuse_request_end(struct fuse_req *req)
> > spin_unlock(&fiq->lock);
> > }
> > WARN_ON(test_bit(FR_PENDING, &req->flags));
> > - WARN_ON(test_bit(FR_SENT, &req->flags));
> > if (test_bit(FR_BACKGROUND, &req->flags)) {
> > spin_lock(&fc->bg_lock);
> > clear_bit(FR_BACKGROUND, &req->flags);
> > @@ -381,30 +380,33 @@ static void request_wait_answer(struct fuse_req *req)
> > queue_interrupt(req);
> > }
> >
> > - if (!test_bit(FR_FORCE, &req->flags)) {
> > - /* Only fatal signals may interrupt this */
> > - err = wait_event_killable(req->waitq,
> > - test_bit(FR_FINISHED, &req->flags));
> > - if (!err)
> > - return;
> > + /* Only fatal signals may interrupt this */
> > + err = wait_event_killable(req->waitq,
> > + test_bit(FR_FINISHED, &req->flags));
>
> Trying to do a fatal signal killable wait sounds reasonable. But I am
> not sure about the history.
>
> - Why FORCE requests can't do killable wait.
> - Why flush needs to have FORCE flag set.

args->force implies a few other things besides this killable wait in
fuse_simple_request(), most notably:

req = fuse_request_alloc(fm, GFP_KERNEL | __GFP_NOFAIL);

and

__set_bit(FR_WAITING, &req->flags);

seems like it probably can be invoked from some non-user/atomic
context somehow?

> > + if (!err)
> > + return;
> >
> > - spin_lock(&fiq->lock);
> > - /* Request is not yet in userspace, bail out */
> > - if (test_bit(FR_PENDING, &req->flags)) {
> > - list_del(&req->list);
> > - spin_unlock(&fiq->lock);
> > - __fuse_put_request(req);
> > - req->out.h.error = -EINTR;
> > - return;
> > - }
> > + spin_lock(&fiq->lock);
> > + /* Request is not yet in userspace, bail out */
> > + if (test_bit(FR_PENDING, &req->flags)) {
> > + list_del(&req->list);
> > spin_unlock(&fiq->lock);
> > + __fuse_put_request(req);
> > + req->out.h.error = -EINTR;
> > + return;
> > }
> > + spin_unlock(&fiq->lock);
> >
> > /*
> > - * Either request is already in userspace, or it was forced.
> > - * Wait it out.
> > + * Womp womp. We sent a request to userspace and now we're getting
> > + * killed.
> > */
> > - wait_event(req->waitq, test_bit(FR_FINISHED, &req->flags));
> > + set_bit(FR_INTERRUPTED, &req->flags);
> > + /* matches barrier in fuse_dev_do_read() */
> > + smp_mb__after_atomic();
> > + /* request *must* be FR_SENT here, because we ignored FR_PENDING before */
> > + WARN_ON(!test_bit(FR_SENT, &req->flags));
> > + queue_interrupt(req);
> > }
> >
> > static void __fuse_request_send(struct fuse_req *req)
> >
> > avaialble as a full patch here:
> > https://github.com/tych0/linux/commit/81b9ff4c8c1af24f6544945da808dbf69a1293f7
> >
> > but now things are even weirder. Tasks are stuck at the killable wait, but with
> > a SIGKILL pending for the thread group.
>
> That's strange. No idea what's going on.

Thanks for taking a look. This is where it falls apart for me. In
principle the patch seems simple, but this sleeping behavior is beyond
my understanding.

Tycho

2022-06-24 17:55:17

by Vivek Goyal

[permalink] [raw]
Subject: Re: strange interaction between fuse + pidns

On Thu, Jun 23, 2022 at 05:41:17PM -0600, Tycho Andersen wrote:
> On Thu, Jun 23, 2022 at 05:55:20PM -0400, Vivek Goyal wrote:
> > So in this case single process is client as well as server. IOW, one
> > thread is fuse server servicing fuse requests and other thread is fuse
> > client accessing fuse filesystem?
>
> Yes. Probably an abuse of the API and something people Should Not Do,
> but as you say the kernel still shouldn't lock up like this.
>
> > > since the thread has a copy of
> > > the fd table with an fd pointing to the same fuse device, the reference
> > > count isn't decremented to zero in fuse_dev_release(), and the task hangs
> > > forever.
> >
> > So why did fuse server thread stop responding to fuse messages. Why
> > did it not complete flush.
>
> In this particular case I think it's because the application crashed
> for unrelated reasons and tried to exit the pidns, hitting this
> problem.
>
> > BTW, unkillable wait happens on ly fc->no_interrupt = 1. And this seems
> > to be set only if server probably some previous interrupt request
> > returned -ENOSYS.
> >
> > fuse_dev_do_write() {
> > else if (oh.error == -ENOSYS)
> > fc->no_interrupt = 1;
> > }
> >
> > So a simple workaround might be for server to implement support for
> > interrupting requests.
>
> Yes, but that is the libfuse default IIUC.

Looking at libfuse code. I understand low level API interface and for
that looks like generic code itself will take care of this (without
needing support from filesystem).

libfuse/lib/fuse_lowlevel.c

do_interrupt().

>
> > Having said that, this does sounds like a problem and probably should
> > be fixed at kernel level.
> >
> > >
> > > diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> > > index 0e537e580dc1..c604dfcaec26 100644
> > > --- a/fs/fuse/dev.c
> > > +++ b/fs/fuse/dev.c
> > > @@ -297,7 +297,6 @@ void fuse_request_end(struct fuse_req *req)
> > > spin_unlock(&fiq->lock);
> > > }
> > > WARN_ON(test_bit(FR_PENDING, &req->flags));
> > > - WARN_ON(test_bit(FR_SENT, &req->flags));
> > > if (test_bit(FR_BACKGROUND, &req->flags)) {
> > > spin_lock(&fc->bg_lock);
> > > clear_bit(FR_BACKGROUND, &req->flags);
> > > @@ -381,30 +380,33 @@ static void request_wait_answer(struct fuse_req *req)
> > > queue_interrupt(req);
> > > }
> > >
> > > - if (!test_bit(FR_FORCE, &req->flags)) {
> > > - /* Only fatal signals may interrupt this */
> > > - err = wait_event_killable(req->waitq,
> > > - test_bit(FR_FINISHED, &req->flags));
> > > - if (!err)
> > > - return;
> > > + /* Only fatal signals may interrupt this */
> > > + err = wait_event_killable(req->waitq,
> > > + test_bit(FR_FINISHED, &req->flags));
> >
> > Trying to do a fatal signal killable wait sounds reasonable. But I am
> > not sure about the history.
> >
> > - Why FORCE requests can't do killable wait.
> > - Why flush needs to have FORCE flag set.
>
> args->force implies a few other things besides this killable wait in
> fuse_simple_request(), most notably:
>
> req = fuse_request_alloc(fm, GFP_KERNEL | __GFP_NOFAIL);
>
> and
>
> __set_bit(FR_WAITING, &req->flags);

FR_WAITING stuff is common between both type of requests. We set it
in fuse_get_req() as well which is called for non-force requests.

So there seem to be only two key difference.

- We allocate request with flag __GFP_NOFAIL for force. So don't
want memory allocation to fail.

- And this special casing of non-killable wait.

Miklos probably will have more thoughts on this.

Thanks
Vivek

>
> seems like it probably can be invoked from some non-user/atomic
> context somehow?
>
> > > + if (!err)
> > > + return;
> > >
> > > - spin_lock(&fiq->lock);
> > > - /* Request is not yet in userspace, bail out */
> > > - if (test_bit(FR_PENDING, &req->flags)) {
> > > - list_del(&req->list);
> > > - spin_unlock(&fiq->lock);
> > > - __fuse_put_request(req);
> > > - req->out.h.error = -EINTR;
> > > - return;
> > > - }
> > > + spin_lock(&fiq->lock);
> > > + /* Request is not yet in userspace, bail out */
> > > + if (test_bit(FR_PENDING, &req->flags)) {
> > > + list_del(&req->list);
> > > spin_unlock(&fiq->lock);
> > > + __fuse_put_request(req);
> > > + req->out.h.error = -EINTR;
> > > + return;
> > > }
> > > + spin_unlock(&fiq->lock);
> > >
> > > /*
> > > - * Either request is already in userspace, or it was forced.
> > > - * Wait it out.
> > > + * Womp womp. We sent a request to userspace and now we're getting
> > > + * killed.
> > > */
> > > - wait_event(req->waitq, test_bit(FR_FINISHED, &req->flags));
> > > + set_bit(FR_INTERRUPTED, &req->flags);
> > > + /* matches barrier in fuse_dev_do_read() */
> > > + smp_mb__after_atomic();
> > > + /* request *must* be FR_SENT here, because we ignored FR_PENDING before */
> > > + WARN_ON(!test_bit(FR_SENT, &req->flags));
> > > + queue_interrupt(req);
> > > }
> > >
> > > static void __fuse_request_send(struct fuse_req *req)
> > >
> > > avaialble as a full patch here:
> > > https://github.com/tych0/linux/commit/81b9ff4c8c1af24f6544945da808dbf69a1293f7
> > >
> > > but now things are even weirder. Tasks are stuck at the killable wait, but with
> > > a SIGKILL pending for the thread group.
> >
> > That's strange. No idea what's going on.
>
> Thanks for taking a look. This is where it falls apart for me. In
> principle the patch seems simple, but this sleeping behavior is beyond
> my understanding.
>
> Tycho
>

2022-07-11 11:41:41

by Miklos Szeredi

[permalink] [raw]
Subject: Re: strange interaction between fuse + pidns

On Thu, 23 Jun 2022 at 19:21, Tycho Andersen <[email protected]> wrote:

> /*
> - * Either request is already in userspace, or it was forced.
> - * Wait it out.
> + * Womp womp. We sent a request to userspace and now we're getting
> + * killed.
> */
> - wait_event(req->waitq, test_bit(FR_FINISHED, &req->flags));

You can't remove this, it's a crucial part of fuse request handling.
Yes, it causes pain, but making *sent* requests killable is a lot more
work.

For one: need to duplicate caller's locking state (i_rwsem, ...) and
move the request into a backround queue instead of just finishing it
off immediately so that the shadow locking can be torn down when the
reply actually arrives. This affects a lot of requests.

Or we could special case FUSE_FLUSH, which doesn't have any locking.

The reason force=true is needed for FUSE_FLUSH is because it affects
posix lock state. Not waiting for the reply if the task is killed
could have observable consequences, but my guess is that it's an
uninteresting corner case and would not cause regressions in real
life.

Can you try the attached untested patch?

Thanks,
Miklos


Attachments:
fuse-allow-flush-to-be-killed.patch (1.35 kB)

2022-07-11 14:10:21

by Miklos Szeredi

[permalink] [raw]
Subject: Re: strange interaction between fuse + pidns

On Mon, 11 Jul 2022 at 12:35, Miklos Szeredi <[email protected]> wrote:
>
> Can you try the attached untested patch?

Updated patch to avoid use after free on req->args.

Still mostly untested.

Thanks,
Miklos


Attachments:
fuse-allow-flush-to-be-killed-v2.patch (2.72 kB)

2022-07-11 21:08:11

by Tycho Andersen

[permalink] [raw]
Subject: Re: strange interaction between fuse + pidns

Hi all,

On Mon, Jul 11, 2022 at 03:59:15PM +0200, Miklos Szeredi wrote:
> On Mon, 11 Jul 2022 at 12:35, Miklos Szeredi <[email protected]> wrote:
> >
> > Can you try the attached untested patch?
>
> Updated patch to avoid use after free on req->args.
>
> Still mostly untested.

Thanks, when I applied your patch, I still ended up with tasks stuck
waiting with a SIGKILL pending. So I looked into that and came up with
the patch below. With both your patch and mine, my testcase exits
cleanly.

Eric (or Christian, or anyone), can you comment on the patch below? I
have no idea what this will break. Maybe instead a better approach is
some additional special case in __send_signal_locked()?

Tycho

From b7ea26adcf3546be5745063cc86658acb5ed37e9 Mon Sep 17 00:00:00 2001
From: Tycho Andersen <[email protected]>
Date: Mon, 11 Jul 2022 11:26:58 -0600
Subject: [PATCH] sched: __fatal_signal_pending() should also check shared
signals

The wait_* code uses signal_pending_state() to test whether a thread has
been interrupted, which ultimately uses __fatal_signal_pending() to detect
if there is a fatal signal.

When a pid ns dies, in zap_pid_ns_processes() it does:

group_send_sig_info(SIGKILL, SEND_SIG_PRIV, task, PIDTYPE_MAX);

for all the tasks in the pid ns. That calls through:

group_send_sig_info() ->
do_send_sig_info() ->
send_signal_locked() ->
__send_signal_locked()

which does:

pending = (type != PIDTYPE_PID) ? &t->signal->shared_pending : &t->pending;

which puts sigkill in the set of shared signals, but not the individual
pending ones. If tasks are stuck in a killable wait (e.g. a fuse flush
operation), they won't see this shared signal, and will hang forever, since
TIF_SIGPENDING is set, but the fatal signal can't be detected.

Signed-off-by: Tycho Andersen <[email protected]>
---
include/linux/sched/signal.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index cafbe03eed01..a033ccb0a729 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -402,7 +402,8 @@ static inline int signal_pending(struct task_struct *p)

static inline int __fatal_signal_pending(struct task_struct *p)
{
- return unlikely(sigismember(&p->pending.signal, SIGKILL));
+ return unlikely(sigismember(&p->pending.signal, SIGKILL) ||
+ sigismember(&p->signal->shared_pending.signal, SIGKILL));
}

static inline int fatal_signal_pending(struct task_struct *p)

base-commit: 32346491ddf24599decca06190ebca03ff9de7f8
--
2.34.1

2022-07-11 22:08:30

by Eric W. Biederman

[permalink] [raw]
Subject: Re: strange interaction between fuse + pidns

Tycho Andersen <[email protected]> writes:

> Hi all,
>
> On Mon, Jul 11, 2022 at 03:59:15PM +0200, Miklos Szeredi wrote:
>> On Mon, 11 Jul 2022 at 12:35, Miklos Szeredi <[email protected]> wrote:
>> >
>> > Can you try the attached untested patch?
>>
>> Updated patch to avoid use after free on req->args.
>>
>> Still mostly untested.
>
> Thanks, when I applied your patch, I still ended up with tasks stuck
> waiting with a SIGKILL pending. So I looked into that and came up with
> the patch below. With both your patch and mine, my testcase exits
> cleanly.
>
> Eric (or Christian, or anyone), can you comment on the patch below? I
> have no idea what this will break. Maybe instead a better approach is
> some additional special case in __send_signal_locked()?
>
> Tycho
>
> From b7ea26adcf3546be5745063cc86658acb5ed37e9 Mon Sep 17 00:00:00 2001
> From: Tycho Andersen <[email protected]>
> Date: Mon, 11 Jul 2022 11:26:58 -0600
> Subject: [PATCH] sched: __fatal_signal_pending() should also check shared
> signals
>
> The wait_* code uses signal_pending_state() to test whether a thread has
> been interrupted, which ultimately uses __fatal_signal_pending() to detect
> if there is a fatal signal.
>
> When a pid ns dies, in zap_pid_ns_processes() it does:
>
> group_send_sig_info(SIGKILL, SEND_SIG_PRIV, task, PIDTYPE_MAX);
>
> for all the tasks in the pid ns. That calls through:
>
> group_send_sig_info() ->
> do_send_sig_info() ->
> send_signal_locked() ->
> __send_signal_locked()
>
> which does:
>
> pending = (type != PIDTYPE_PID) ? &t->signal->shared_pending : &t->pending;
>
> which puts sigkill in the set of shared signals, but not the individual
> pending ones. If tasks are stuck in a killable wait (e.g. a fuse flush
> operation), they won't see this shared signal, and will hang forever, since
> TIF_SIGPENDING is set, but the fatal signal can't be detected.

Hmm.

That is perplexing.

__send_signal_locked calls complete_signal. Then if any of the tasks of
the process can receive the signal, complete_signal will loop through
all of the tasks of the process and set the per thread SIGKILL. Pretty
much by definition tasks can always receive SIGKILL.

Is complete_signal not being able to do that?

The patch below really should not be necessary, and I have pending work
that if I can push over the finish line won't even make sense.

As it is currently an abuse to use the per thread SIGKILL to indicate
that a fatal signal has been short circuit delivered. That abuse as
well as being unclean tends to confuse people reading the code.

Eric

> Signed-off-by: Tycho Andersen <[email protected]>
> ---
> include/linux/sched/signal.h | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> index cafbe03eed01..a033ccb0a729 100644
> --- a/include/linux/sched/signal.h
> +++ b/include/linux/sched/signal.h
> @@ -402,7 +402,8 @@ static inline int signal_pending(struct task_struct *p)
>
> static inline int __fatal_signal_pending(struct task_struct *p)
> {
> - return unlikely(sigismember(&p->pending.signal, SIGKILL));
> + return unlikely(sigismember(&p->pending.signal, SIGKILL) ||
> + sigismember(&p->signal->shared_pending.signal, SIGKILL));
> }
>
> static inline int fatal_signal_pending(struct task_struct *p)
>
> base-commit: 32346491ddf24599decca06190ebca03ff9de7f8

2022-07-11 23:36:11

by Tycho Andersen

[permalink] [raw]
Subject: Re: strange interaction between fuse + pidns

On Mon, Jul 11, 2022 at 04:37:12PM -0500, Eric W. Biederman wrote:
> Tycho Andersen <[email protected]> writes:
>
> > Hi all,
> >
> > On Mon, Jul 11, 2022 at 03:59:15PM +0200, Miklos Szeredi wrote:
> >> On Mon, 11 Jul 2022 at 12:35, Miklos Szeredi <[email protected]> wrote:
> >> >
> >> > Can you try the attached untested patch?
> >>
> >> Updated patch to avoid use after free on req->args.
> >>
> >> Still mostly untested.
> >
> > Thanks, when I applied your patch, I still ended up with tasks stuck
> > waiting with a SIGKILL pending. So I looked into that and came up with
> > the patch below. With both your patch and mine, my testcase exits
> > cleanly.
> >
> > Eric (or Christian, or anyone), can you comment on the patch below? I
> > have no idea what this will break. Maybe instead a better approach is
> > some additional special case in __send_signal_locked()?
> >
> > Tycho
> >
> > From b7ea26adcf3546be5745063cc86658acb5ed37e9 Mon Sep 17 00:00:00 2001
> > From: Tycho Andersen <[email protected]>
> > Date: Mon, 11 Jul 2022 11:26:58 -0600
> > Subject: [PATCH] sched: __fatal_signal_pending() should also check shared
> > signals
> >
> > The wait_* code uses signal_pending_state() to test whether a thread has
> > been interrupted, which ultimately uses __fatal_signal_pending() to detect
> > if there is a fatal signal.
> >
> > When a pid ns dies, in zap_pid_ns_processes() it does:
> >
> > group_send_sig_info(SIGKILL, SEND_SIG_PRIV, task, PIDTYPE_MAX);
> >
> > for all the tasks in the pid ns. That calls through:
> >
> > group_send_sig_info() ->
> > do_send_sig_info() ->
> > send_signal_locked() ->
> > __send_signal_locked()
> >
> > which does:
> >
> > pending = (type != PIDTYPE_PID) ? &t->signal->shared_pending : &t->pending;
> >
> > which puts sigkill in the set of shared signals, but not the individual
> > pending ones. If tasks are stuck in a killable wait (e.g. a fuse flush
> > operation), they won't see this shared signal, and will hang forever, since
> > TIF_SIGPENDING is set, but the fatal signal can't be detected.
>
> Hmm.
>
> That is perplexing.

Thanks for taking a look.

> __send_signal_locked calls complete_signal. Then if any of the tasks of
> the process can receive the signal, complete_signal will loop through
> all of the tasks of the process and set the per thread SIGKILL. Pretty
> much by definition tasks can always receive SIGKILL.
>
> Is complete_signal not being able to do that?

In my specific case it was because my testcase was already trying to
exit and had set PF_EXITING when the signal is delivered, so
complete_signal() was indeed not able to do that since PF_EXITING is
checked before SIGKILL in wants_signal().

But I changed my testacase to sleep instead of exit, and I get the
same hang behavior, even though complete_signal() does add SIGKILL to
the set. So there's something else going on there...

> The patch below really should not be necessary, and I have pending work
> that if I can push over the finish line won't even make sense.
>
> As it is currently an abuse to use the per thread SIGKILL to indicate
> that a fatal signal has been short circuit delivered. That abuse as
> well as being unclean tends to confuse people reading the code.

How close is your work? I'm wondering if it's worth investigating the
non-PF_EXITING case further, or if we should just land this since it
fixes the PF_EXITING case as well. Or maybe just do something like
this in addition:

diff --git a/kernel/signal.c b/kernel/signal.c
index 6f86fda5e432..0f71dfb1c3d2 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -982,12 +982,12 @@ static inline bool wants_signal(int sig, struct task_struct *p)
if (sigismember(&p->blocked, sig))
return false;

- if (p->flags & PF_EXITING)
- return false;
-
if (sig == SIGKILL)
return true;

+ if (p->flags & PF_EXITING)
+ return false;
+
if (task_is_stopped_or_traced(p))
return false;

?

Tycho

2022-07-11 23:39:50

by Eric W. Biederman

[permalink] [raw]
Subject: Re: strange interaction between fuse + pidns

Tycho Andersen <[email protected]> writes:

> On Mon, Jul 11, 2022 at 04:37:12PM -0500, Eric W. Biederman wrote:
>> Tycho Andersen <[email protected]> writes:
>>
>> > Hi all,
>> >
>> > On Mon, Jul 11, 2022 at 03:59:15PM +0200, Miklos Szeredi wrote:
>> >> On Mon, 11 Jul 2022 at 12:35, Miklos Szeredi <[email protected]> wrote:
>> >> >
>> >> > Can you try the attached untested patch?
>> >>
>> >> Updated patch to avoid use after free on req->args.
>> >>
>> >> Still mostly untested.
>> >
>> > Thanks, when I applied your patch, I still ended up with tasks stuck
>> > waiting with a SIGKILL pending. So I looked into that and came up with
>> > the patch below. With both your patch and mine, my testcase exits
>> > cleanly.
>> >
>> > Eric (or Christian, or anyone), can you comment on the patch below? I
>> > have no idea what this will break. Maybe instead a better approach is
>> > some additional special case in __send_signal_locked()?
>> >
>> > Tycho
>> >
>> > From b7ea26adcf3546be5745063cc86658acb5ed37e9 Mon Sep 17 00:00:00 2001
>> > From: Tycho Andersen <[email protected]>
>> > Date: Mon, 11 Jul 2022 11:26:58 -0600
>> > Subject: [PATCH] sched: __fatal_signal_pending() should also check shared
>> > signals
>> >
>> > The wait_* code uses signal_pending_state() to test whether a thread has
>> > been interrupted, which ultimately uses __fatal_signal_pending() to detect
>> > if there is a fatal signal.
>> >
>> > When a pid ns dies, in zap_pid_ns_processes() it does:
>> >
>> > group_send_sig_info(SIGKILL, SEND_SIG_PRIV, task, PIDTYPE_MAX);
>> >
>> > for all the tasks in the pid ns. That calls through:
>> >
>> > group_send_sig_info() ->
>> > do_send_sig_info() ->
>> > send_signal_locked() ->
>> > __send_signal_locked()
>> >
>> > which does:
>> >
>> > pending = (type != PIDTYPE_PID) ? &t->signal->shared_pending : &t->pending;
>> >
>> > which puts sigkill in the set of shared signals, but not the individual
>> > pending ones. If tasks are stuck in a killable wait (e.g. a fuse flush
>> > operation), they won't see this shared signal, and will hang forever, since
>> > TIF_SIGPENDING is set, but the fatal signal can't be detected.
>>
>> Hmm.
>>
>> That is perplexing.
>
> Thanks for taking a look.
>
>> __send_signal_locked calls complete_signal. Then if any of the tasks of
>> the process can receive the signal, complete_signal will loop through
>> all of the tasks of the process and set the per thread SIGKILL. Pretty
>> much by definition tasks can always receive SIGKILL.
>>
>> Is complete_signal not being able to do that?
>
> In my specific case it was because my testcase was already trying to
> exit and had set PF_EXITING when the signal is delivered, so
> complete_signal() was indeed not able to do that since PF_EXITING is
> checked before SIGKILL in wants_signal().
>
> But I changed my testacase to sleep instead of exit, and I get the
> same hang behavior, even though complete_signal() does add SIGKILL to
> the set. So there's something else going on there...
>
>> The patch below really should not be necessary, and I have pending work
>> that if I can push over the finish line won't even make sense.
>>
>> As it is currently an abuse to use the per thread SIGKILL to indicate
>> that a fatal signal has been short circuit delivered. That abuse as
>> well as being unclean tends to confuse people reading the code.
>
> How close is your work? I'm wondering if it's worth investigating the
> non-PF_EXITING case further, or if we should just land this since it
> fixes the PF_EXITING case as well. Or maybe just do something like
> this in addition:

It is not different enough to change the semantics. What I am aiming
for is having a dedicated flag indicating a task will exit, that
fatal_signal_pending can check. And I intend to make that flag one way
so that once it is set it will never be cleared.

Which sort of argues against your patch below.

It most definitely does not sort out the sleep case.


The other thing I have played with that might be relevant was removing
the explicit wait in zap_pid_ns_processes and simply not allowing wait
to reap the pid namespace init until all it's children had been reaped.
Essentially how we deal with the thread group leader for ordinary
processes. Does that sound like it might help in the fuse case?

I am not certain how far such a change would be (it has been a couple
years since I played with implementing it) but it can be made much
sooner if it demonstratively breaks some dead-locks, and generally makes
the kernel easier to maintain.

Eric

2022-07-12 14:00:53

by Tycho Andersen

[permalink] [raw]
Subject: Re: strange interaction between fuse + pidns

On Mon, Jul 11, 2022 at 06:06:21PM -0500, Eric W. Biederman wrote:
> Tycho Andersen <[email protected]> writes:
> It is not different enough to change the semantics. What I am aiming
> for is having a dedicated flag indicating a task will exit, that
> fatal_signal_pending can check. And I intend to make that flag one way
> so that once it is set it will never be cleared.

Ok - how far out is that? I'd like to try to convince Miklos to land
the fuse part of this fix now, but without the "look at shared signals
too" patch, that fix is useless. I'm not married to my patch, but I
would like to get this fixed somehow soon.

> The other thing I have played with that might be relevant was removing
> the explicit wait in zap_pid_ns_processes and simply not allowing wait
> to reap the pid namespace init until all it's children had been reaped.
> Essentially how we deal with the thread group leader for ordinary
> processes. Does that sound like it might help in the fuse case?

No, the problem is that the wait code doesn't know to look in the
right place, so waiting later still won't help.

Tycho

2022-07-12 15:12:29

by Eric W. Biederman

[permalink] [raw]
Subject: Re: strange interaction between fuse + pidns

Tycho Andersen <[email protected]> writes:

> On Mon, Jul 11, 2022 at 06:06:21PM -0500, Eric W. Biederman wrote:
>> Tycho Andersen <[email protected]> writes:
>> It is not different enough to change the semantics. What I am aiming
>> for is having a dedicated flag indicating a task will exit, that
>> fatal_signal_pending can check. And I intend to make that flag one way
>> so that once it is set it will never be cleared.
>
> Ok - how far out is that? I'd like to try to convince Miklos to land
> the fuse part of this fix now, but without the "look at shared signals
> too" patch, that fix is useless. I'm not married to my patch, but I
> would like to get this fixed somehow soon.

My point is that we need to figure out why you need the look at shared
signals.

If I can get everything reviewed my changes will be in the next merge
window (it unfortunately always takes longer to get the code reviewed
than I would like).

However when my changes land does not matter. What you are trying to
solve is orthogonal of my on-going work.

The problem is that looking at shared signals is fundamentally broken.
A case in point is that kernel threads can have a pending SIGKILL that
is not a fatal signal. As kernel threads are allowed to ignore or even
handle SIGKILL.

If you want to change fatal_signal_pending to include PF_EXITING I would
need to double check the implications but I think that would work, and
would not have the problems including the shared pending state of
SIGKILL.

>> The other thing I have played with that might be relevant was removing
>> the explicit wait in zap_pid_ns_processes and simply not allowing wait
>> to reap the pid namespace init until all it's children had been reaped.
>> Essentially how we deal with the thread group leader for ordinary
>> processes. Does that sound like it might help in the fuse case?
>
> No, the problem is that the wait code doesn't know to look in the
> right place, so waiting later still won't help.

I was suggesting to modify the kernel so that zap_pid_ns_processes would
not wait for the zapped processes. Instead I was proposing that
delay_group_leader called from wait_consider_task would simply refuse to
allow the init process of a pid namespace to be reaped until every other
process of that pid namespace had exited.

You can prototype how that would affect the deadlock by simply removing
the waiting from zap_pid_ns_processes.

I suggest that simply because that has the potential to remove some of
the strange pid namespace cases.

I don't understand the problematic interaction between pid namespace
shutdown and the fuse daemon, so I am merely suggesting a possibility
that I know can simplify pid namespace shutdown.

Something like:

diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index f4f8cb0435b4..d22a30b0b0cf 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -207,47 +207,6 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
read_unlock(&tasklist_lock);
rcu_read_unlock();

- /*
- * Reap the EXIT_ZOMBIE children we had before we ignored SIGCHLD.
- * kernel_wait4() will also block until our children traced from the
- * parent namespace are detached and become EXIT_DEAD.
- */
- do {
- clear_thread_flag(TIF_SIGPENDING);
- rc = kernel_wait4(-1, NULL, __WALL, NULL);
- } while (rc != -ECHILD);
-
- /*
- * kernel_wait4() misses EXIT_DEAD children, and EXIT_ZOMBIE
- * process whose parents processes are outside of the pid
- * namespace. Such processes are created with setns()+fork().
- *
- * If those EXIT_ZOMBIE processes are not reaped by their
- * parents before their parents exit, they will be reparented
- * to pid_ns->child_reaper. Thus pidns->child_reaper needs to
- * stay valid until they all go away.
- *
- * The code relies on the pid_ns->child_reaper ignoring
- * SIGCHILD to cause those EXIT_ZOMBIE processes to be
- * autoreaped if reparented.
- *
- * Semantically it is also desirable to wait for EXIT_ZOMBIE
- * processes before allowing the child_reaper to be reaped, as
- * that gives the invariant that when the init process of a
- * pid namespace is reaped all of the processes in the pid
- * namespace are gone.
- *
- * Once all of the other tasks are gone from the pid_namespace
- * free_pid() will awaken this task.
- */
- for (;;) {
- set_current_state(TASK_INTERRUPTIBLE);
- if (pid_ns->pid_allocated == init_pids)
- break;
- schedule();
- }
- __set_current_state(TASK_RUNNING);
-
if (pid_ns->reboot)
current->signal->group_exit_code = pid_ns->reboot;


Eric

2022-07-12 15:28:33

by Tycho Andersen

[permalink] [raw]
Subject: Re: strange interaction between fuse + pidns

On Tue, Jul 12, 2022 at 09:34:50AM -0500, Eric W. Biederman wrote:
> Tycho Andersen <[email protected]> writes:
>
> > On Mon, Jul 11, 2022 at 06:06:21PM -0500, Eric W. Biederman wrote:
> >> Tycho Andersen <[email protected]> writes:
> >> It is not different enough to change the semantics. What I am aiming
> >> for is having a dedicated flag indicating a task will exit, that
> >> fatal_signal_pending can check. And I intend to make that flag one way
> >> so that once it is set it will never be cleared.
> >
> > Ok - how far out is that? I'd like to try to convince Miklos to land
> > the fuse part of this fix now, but without the "look at shared signals
> > too" patch, that fix is useless. I'm not married to my patch, but I
> > would like to get this fixed somehow soon.
>
> My point is that we need to figure out why you need the look at shared
> signals.

At least in the case where the task was already exiting, it's because
complete_signal() never wakes them up.

> If I can get everything reviewed my changes will be in the next merge
> window (it unfortunately always takes longer to get the code reviewed
> than I would like).
>
> However when my changes land does not matter. What you are trying to
> solve is orthogonal of my on-going work.
>
> The problem is that looking at shared signals is fundamentally broken.
> A case in point is that kernel threads can have a pending SIGKILL that
> is not a fatal signal. As kernel threads are allowed to ignore or even
> handle SIGKILL.
>
> If you want to change fatal_signal_pending to include PF_EXITING I would
> need to double check the implications but I think that would work, and
> would not have the problems including the shared pending state of
> SIGKILL.

I think that would work. I'll test it out, thanks.

> >> The other thing I have played with that might be relevant was removing
> >> the explicit wait in zap_pid_ns_processes and simply not allowing wait
> >> to reap the pid namespace init until all it's children had been reaped.
> >> Essentially how we deal with the thread group leader for ordinary
> >> processes. Does that sound like it might help in the fuse case?
> >
> > No, the problem is that the wait code doesn't know to look in the
> > right place, so waiting later still won't help.
>
> I was suggesting to modify the kernel so that zap_pid_ns_processes would
> not wait for the zapped processes. Instead I was proposing that
> delay_group_leader called from wait_consider_task would simply refuse to
> allow the init process of a pid namespace to be reaped until every other
> process of that pid namespace had exited.
>
> You can prototype how that would affect the deadlock by simply removing
> the waiting from zap_pid_ns_processes.
>
> I suggest that simply because that has the potential to remove some of
> the strange pid namespace cases.
>
> I don't understand the problematic interaction between pid namespace
> shutdown and the fuse daemon, so I am merely suggesting a possibility
> that I know can simplify pid namespace shutdown.
>
> Something like:
>
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index f4f8cb0435b4..d22a30b0b0cf 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -207,47 +207,6 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
> read_unlock(&tasklist_lock);
> rcu_read_unlock();
>
> - /*
> - * Reap the EXIT_ZOMBIE children we had before we ignored SIGCHLD.
> - * kernel_wait4() will also block until our children traced from the
> - * parent namespace are detached and become EXIT_DEAD.
> - */
> - do {
> - clear_thread_flag(TIF_SIGPENDING);
> - rc = kernel_wait4(-1, NULL, __WALL, NULL);
> - } while (rc != -ECHILD);
> -
> - /*
> - * kernel_wait4() misses EXIT_DEAD children, and EXIT_ZOMBIE
> - * process whose parents processes are outside of the pid
> - * namespace. Such processes are created with setns()+fork().
> - *
> - * If those EXIT_ZOMBIE processes are not reaped by their
> - * parents before their parents exit, they will be reparented
> - * to pid_ns->child_reaper. Thus pidns->child_reaper needs to
> - * stay valid until they all go away.
> - *
> - * The code relies on the pid_ns->child_reaper ignoring
> - * SIGCHILD to cause those EXIT_ZOMBIE processes to be
> - * autoreaped if reparented.
> - *
> - * Semantically it is also desirable to wait for EXIT_ZOMBIE
> - * processes before allowing the child_reaper to be reaped, as
> - * that gives the invariant that when the init process of a
> - * pid namespace is reaped all of the processes in the pid
> - * namespace are gone.
> - *
> - * Once all of the other tasks are gone from the pid_namespace
> - * free_pid() will awaken this task.
> - */
> - for (;;) {
> - set_current_state(TASK_INTERRUPTIBLE);
> - if (pid_ns->pid_allocated == init_pids)
> - break;
> - schedule();
> - }
> - __set_current_state(TASK_RUNNING);
> -
> if (pid_ns->reboot)
> current->signal->group_exit_code = pid_ns->reboot;

Yes, but we need to add the wait to delay_group_leader(), and if the
tasks are stuck indefinitely looking at the wrong condition, I don't
see how moving it will help resolve things.

Thanks,

Tycho

2022-07-13 18:28:05

by Tycho Andersen

[permalink] [raw]
Subject: [PATCH] sched: __fatal_signal_pending() should also check PF_EXITING

The wait_* code uses signal_pending_state() to test whether a thread has
been interrupted, which ultimately uses __fatal_signal_pending() to detect
if there is a fatal signal.

When a pid ns dies, it does:

group_send_sig_info(SIGKILL, SEND_SIG_PRIV, task, PIDTYPE_MAX);

for all the tasks in the pid ns. That calls through:

group_send_sig_info() ->
do_send_sig_info() ->
send_signal_locked() ->
__send_signal_locked()

which does:

pending = (type != PIDTYPE_PID) ? &t->signal->shared_pending : &t->pending;

which puts sigkill in the set of shared signals, but not the individual
pending ones. When complete_signal() is called at the end of
__send_signal_locked(), if the task already had PF_EXITING (i.e. was
already waiting on something in its fd closing path like a fuse flush),
complete_signal() will not wake up the thread, since wants_signal() checks
PF_EXITING before testing for SIGKILL.

If tasks are stuck in a killable wait (e.g. a fuse flush operation), they
won't see this shared signal, and will hang forever, since TIF_SIGPENDING
is set, but the fatal signal can't be detected. So, let's also look for
PF_EXITING in __fatal_signal_pending().

Signed-off-by: Tycho Andersen <[email protected]>
---
include/linux/sched/signal.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index cafbe03eed01..c20b7e1d89ef 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -402,7 +402,8 @@ static inline int signal_pending(struct task_struct *p)

static inline int __fatal_signal_pending(struct task_struct *p)
{
- return unlikely(sigismember(&p->pending.signal, SIGKILL));
+ return unlikely(sigismember(&p->pending.signal, SIGKILL) ||
+ p->flags & PF_EXITING);
}

static inline int fatal_signal_pending(struct task_struct *p)

base-commit: 32346491ddf24599decca06190ebca03ff9de7f8
--
2.34.1

2022-07-20 15:39:01

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH] sched: __fatal_signal_pending() should also check PF_EXITING

On Wed, Jul 13, 2022 at 11:53:05AM -0600, Tycho Andersen wrote:
> The wait_* code uses signal_pending_state() to test whether a thread has
> been interrupted, which ultimately uses __fatal_signal_pending() to detect
> if there is a fatal signal.
>
> When a pid ns dies, it does:
>
> group_send_sig_info(SIGKILL, SEND_SIG_PRIV, task, PIDTYPE_MAX);
>
> for all the tasks in the pid ns. That calls through:
>
> group_send_sig_info() ->
> do_send_sig_info() ->
> send_signal_locked() ->
> __send_signal_locked()
>
> which does:
>
> pending = (type != PIDTYPE_PID) ? &t->signal->shared_pending : &t->pending;
>
> which puts sigkill in the set of shared signals, but not the individual
> pending ones. When complete_signal() is called at the end of
> __send_signal_locked(), if the task already had PF_EXITING (i.e. was
> already waiting on something in its fd closing path like a fuse flush),
> complete_signal() will not wake up the thread, since wants_signal() checks
> PF_EXITING before testing for SIGKILL.
>
> If tasks are stuck in a killable wait (e.g. a fuse flush operation), they
> won't see this shared signal, and will hang forever, since TIF_SIGPENDING
> is set, but the fatal signal can't be detected. So, let's also look for
> PF_EXITING in __fatal_signal_pending().
>
> Signed-off-by: Tycho Andersen <[email protected]>

Cool, thanks for nailing this down!

I assume you've been running this on some boxes with no weird effects?

> ---
> include/linux/sched/signal.h | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> index cafbe03eed01..c20b7e1d89ef 100644
> --- a/include/linux/sched/signal.h
> +++ b/include/linux/sched/signal.h
> @@ -402,7 +402,8 @@ static inline int signal_pending(struct task_struct *p)
>
> static inline int __fatal_signal_pending(struct task_struct *p)
> {
> - return unlikely(sigismember(&p->pending.signal, SIGKILL));
> + return unlikely(sigismember(&p->pending.signal, SIGKILL) ||
> + p->flags & PF_EXITING);

Looking around at the callers this does seem safe, but the name does
now seem misleading. Should this be renamed to something like
exiting_or_fatal_signal_pending()?

> }
>
> static inline int fatal_signal_pending(struct task_struct *p)
>
> base-commit: 32346491ddf24599decca06190ebca03ff9de7f8
> --
> 2.34.1
>

2022-07-20 21:44:50

by Tycho Andersen

[permalink] [raw]
Subject: Re: [PATCH] sched: __fatal_signal_pending() should also check PF_EXITING

On Wed, Jul 20, 2022 at 10:03:28AM -0500, Serge E. Hallyn wrote:
> On Wed, Jul 13, 2022 at 11:53:05AM -0600, Tycho Andersen wrote:
> > The wait_* code uses signal_pending_state() to test whether a thread has
> > been interrupted, which ultimately uses __fatal_signal_pending() to detect
> > if there is a fatal signal.
> >
> > When a pid ns dies, it does:
> >
> > group_send_sig_info(SIGKILL, SEND_SIG_PRIV, task, PIDTYPE_MAX);
> >
> > for all the tasks in the pid ns. That calls through:
> >
> > group_send_sig_info() ->
> > do_send_sig_info() ->
> > send_signal_locked() ->
> > __send_signal_locked()
> >
> > which does:
> >
> > pending = (type != PIDTYPE_PID) ? &t->signal->shared_pending : &t->pending;
> >
> > which puts sigkill in the set of shared signals, but not the individual
> > pending ones. When complete_signal() is called at the end of
> > __send_signal_locked(), if the task already had PF_EXITING (i.e. was
> > already waiting on something in its fd closing path like a fuse flush),
> > complete_signal() will not wake up the thread, since wants_signal() checks
> > PF_EXITING before testing for SIGKILL.
> >
> > If tasks are stuck in a killable wait (e.g. a fuse flush operation), they
> > won't see this shared signal, and will hang forever, since TIF_SIGPENDING
> > is set, but the fatal signal can't be detected. So, let's also look for
> > PF_EXITING in __fatal_signal_pending().
> >
> > Signed-off-by: Tycho Andersen <[email protected]>
>
> Cool, thanks for nailing this down!
>
> I assume you've been running this on some boxes with no weird effects?

Yes, but I haven't tested all the paths.

> > ---
> > include/linux/sched/signal.h | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> > index cafbe03eed01..c20b7e1d89ef 100644
> > --- a/include/linux/sched/signal.h
> > +++ b/include/linux/sched/signal.h
> > @@ -402,7 +402,8 @@ static inline int signal_pending(struct task_struct *p)
> >
> > static inline int __fatal_signal_pending(struct task_struct *p)
> > {
> > - return unlikely(sigismember(&p->pending.signal, SIGKILL));
> > + return unlikely(sigismember(&p->pending.signal, SIGKILL) ||
> > + p->flags & PF_EXITING);
>
> Looking around at the callers this does seem safe, but the name does
> now seem misleading. Should this be renamed to something like
> exiting_or_fatal_signal_pending()?

This is why I like my original patch better: it is just expanding the
set of signals to include the shared signals, which are indeed still
fatal pending signals for the task. I don't really understand Eric's
argument about kernel threads ignoring SIGKILL, since kernel threads
can still ignore SIGKILL just fine after this patch.

But yes, assuming Eric is ok with this venison. I can send a v2 with
the name change as you suggest.

Thanks for looking.

Tycho

2022-07-21 02:08:58

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH] sched: __fatal_signal_pending() should also check PF_EXITING

On Wed, Jul 20, 2022 at 02:58:42PM -0600, Tycho Andersen wrote:
> On Wed, Jul 20, 2022 at 10:03:28AM -0500, Serge E. Hallyn wrote:
> > On Wed, Jul 13, 2022 at 11:53:05AM -0600, Tycho Andersen wrote:
> > > The wait_* code uses signal_pending_state() to test whether a thread has
> > > been interrupted, which ultimately uses __fatal_signal_pending() to detect
> > > if there is a fatal signal.
> > >
> > > When a pid ns dies, it does:
> > >
> > > group_send_sig_info(SIGKILL, SEND_SIG_PRIV, task, PIDTYPE_MAX);
> > >
> > > for all the tasks in the pid ns. That calls through:
> > >
> > > group_send_sig_info() ->
> > > do_send_sig_info() ->
> > > send_signal_locked() ->
> > > __send_signal_locked()
> > >
> > > which does:
> > >
> > > pending = (type != PIDTYPE_PID) ? &t->signal->shared_pending : &t->pending;
> > >
> > > which puts sigkill in the set of shared signals, but not the individual
> > > pending ones. When complete_signal() is called at the end of
> > > __send_signal_locked(), if the task already had PF_EXITING (i.e. was
> > > already waiting on something in its fd closing path like a fuse flush),
> > > complete_signal() will not wake up the thread, since wants_signal() checks
> > > PF_EXITING before testing for SIGKILL.
> > >
> > > If tasks are stuck in a killable wait (e.g. a fuse flush operation), they
> > > won't see this shared signal, and will hang forever, since TIF_SIGPENDING
> > > is set, but the fatal signal can't be detected. So, let's also look for
> > > PF_EXITING in __fatal_signal_pending().
> > >
> > > Signed-off-by: Tycho Andersen <[email protected]>
> >
> > Cool, thanks for nailing this down!
> >
> > I assume you've been running this on some boxes with no weird effects?
>
> Yes, but I haven't tested all the paths.
>
> > > ---
> > > include/linux/sched/signal.h | 3 ++-
> > > 1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> > > index cafbe03eed01..c20b7e1d89ef 100644
> > > --- a/include/linux/sched/signal.h
> > > +++ b/include/linux/sched/signal.h
> > > @@ -402,7 +402,8 @@ static inline int signal_pending(struct task_struct *p)
> > >
> > > static inline int __fatal_signal_pending(struct task_struct *p)
> > > {
> > > - return unlikely(sigismember(&p->pending.signal, SIGKILL));
> > > + return unlikely(sigismember(&p->pending.signal, SIGKILL) ||
> > > + p->flags & PF_EXITING);
> >
> > Looking around at the callers this does seem safe, but the name does
> > now seem misleading. Should this be renamed to something like
> > exiting_or_fatal_signal_pending()?
>
> This is why I like my original patch better: it is just expanding the
> set of signals to include the shared signals, which are indeed still
> fatal pending signals for the task. I don't really understand Eric's
> argument about kernel threads ignoring SIGKILL, since kernel threads

Oh - I didn't either - checking the sigkill in shared signals *seems*
legit if they can be put there - but since you posted the new patch I
assumed his reasoning was clear to you. I know Eric's busy, cc:ing Oleg
for his interpretation too.

> can still ignore SIGKILL just fine after this patch.
>
> But yes, assuming Eric is ok with this venison. I can send a v2 with
> the name change as you suggest.
>
> Thanks for looking.
>
> Tycho

2022-07-27 16:20:11

by Tycho Andersen

[permalink] [raw]
Subject: Re: [PATCH] sched: __fatal_signal_pending() should also check PF_EXITING

Hi all,

On Wed, Jul 20, 2022 at 08:54:59PM -0500, Serge E. Hallyn wrote:
> Oh - I didn't either - checking the sigkill in shared signals *seems*
> legit if they can be put there - but since you posted the new patch I
> assumed his reasoning was clear to you. I know Eric's busy, cc:ing Oleg
> for his interpretation too.

Any thoughts on this?

Thanks,

Tycho

2022-07-27 18:20:24

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] sched: __fatal_signal_pending() should also check PF_EXITING

Tycho Andersen <[email protected]> writes:

> Hi all,
>
> On Wed, Jul 20, 2022 at 08:54:59PM -0500, Serge E. Hallyn wrote:
>> Oh - I didn't either - checking the sigkill in shared signals *seems*
>> legit if they can be put there - but since you posted the new patch I
>> assumed his reasoning was clear to you. I know Eric's busy, cc:ing Oleg
>> for his interpretation too.
>
> Any thoughts on this?

Having __fatal_signal_pending check SIGKILL in shared signals is
completely and utterly wrong.

What __fatal_signal_pending reports is if a signal has gone through
short cirucuit delivery after determining that the delivery of the
signal will terminate the process.

Using "sigismember(&tsk->pending.signal, SIGKILL)" to report that a
fatal signal has experienced short circuit delivery is a bit of an
abuse, but essentially harmless as tkill of SIGKILL to a thread will
result in every thread in the process experiencing short circuit
delivery of the fatal SIGKILL. So a pending SIGKILL can't really mean
anything else.



After having looked at the code a little more I can unfortunately also
say that testing PF_EXITING in __fatal_signal_pending will cause
kernel_wait4 in zap_pid_ns_processes to not sleep, and instead to return
0. Which will cause zap_pid_ns_processes to busy wait. That seems very
unfortunate.

I hadn't realized it at the time I wrote zap_pid_ns_processes but I
think anything called from do_exit that cares about signal pending state
is pretty much broken and needs to be fixed.



So the question is how do we fix the problem in fuse that shows up
during a pid namespace exit without having interruptible sleeps we need
to wake up?

What are the code paths that experience the problem?

Will refactoring zap_pid_ns_processes as I have proposed so that it does
not use kernel_wait4 help sort this out? AKA make it work something
like thread group leader of a process and not allow wait to reap the
init process of a pid namespace until all of the processes in a pid
namespaces have been gone. Not that I see the problem in using
kernel_wait4 it looks like zap_pid_ns_processes needs to stop calling
kernel_wait4 regardless of the fuse problem.

Eric




2022-07-27 19:05:00

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] sched: __fatal_signal_pending() should also check PF_EXITING

On 07/27, Tycho Andersen wrote:
>
> Hi all,
>
> On Wed, Jul 20, 2022 at 08:54:59PM -0500, Serge E. Hallyn wrote:
> > Oh - I didn't either - checking the sigkill in shared signals *seems*
> > legit if they can be put there - but since you posted the new patch I
> > assumed his reasoning was clear to you. I know Eric's busy, cc:ing Oleg
> > for his interpretation too.
>
> Any thoughts on this?

Cough... I don't know what can I say except I personally dislike this
patch no matter what ;)

And I do not understand how can this patch help. OK, a single-threaded
PF_EXITING task sleeps in TASK_KILLABLE. send_signal_locked() won't
wake it up anyway?

I must have missed something.

Oleg.

2022-07-27 19:27:17

by Tycho Andersen

[permalink] [raw]
Subject: Re: [PATCH] sched: __fatal_signal_pending() should also check PF_EXITING

On Wed, Jul 27, 2022 at 07:55:39PM +0200, Oleg Nesterov wrote:
> On 07/27, Tycho Andersen wrote:
> >
> > Hi all,
> >
> > On Wed, Jul 20, 2022 at 08:54:59PM -0500, Serge E. Hallyn wrote:
> > > Oh - I didn't either - checking the sigkill in shared signals *seems*
> > > legit if they can be put there - but since you posted the new patch I
> > > assumed his reasoning was clear to you. I know Eric's busy, cc:ing Oleg
> > > for his interpretation too.
> >
> > Any thoughts on this?
>
> Cough... I don't know what can I say except I personally dislike this
> patch no matter what ;)
>
> And I do not understand how can this patch help. OK, a single-threaded
> PF_EXITING task sleeps in TASK_KILLABLE. send_signal_locked() won't
> wake it up anyway?
>
> I must have missed something.

What do you think of the patch in
https://lore.kernel.org/all/YsyHMVLuT5U6mm+I@netflix/ ? Hopefully that
has an explanation that makes more sense.

Tycho

2022-07-27 19:27:32

by Tycho Andersen

[permalink] [raw]
Subject: Re: [PATCH] sched: __fatal_signal_pending() should also check PF_EXITING

On Wed, Jul 27, 2022 at 11:32:08AM -0500, Eric W. Biederman wrote:
> Tycho Andersen <[email protected]> writes:
>
> > Hi all,
> >
> > On Wed, Jul 20, 2022 at 08:54:59PM -0500, Serge E. Hallyn wrote:
> >> Oh - I didn't either - checking the sigkill in shared signals *seems*
> >> legit if they can be put there - but since you posted the new patch I
> >> assumed his reasoning was clear to you. I know Eric's busy, cc:ing Oleg
> >> for his interpretation too.
> >
> > Any thoughts on this?
>
> Having __fatal_signal_pending check SIGKILL in shared signals is
> completely and utterly wrong.
>
> What __fatal_signal_pending reports is if a signal has gone through
> short cirucuit delivery after determining that the delivery of the
> signal will terminate the process.

This short-circuiting you're talking about happens in __send_signal()?
The problem here is that __send_signal() will add things to the shared
queue:

pending = (type != PIDTYPE_PID) ? &t->signal->shared_pending : &t->pending;

and indeed we add it to the shared set because of the way
zap_pid_ns_processes() calls it:

roup_send_sig_info(SIGKILL, SEND_SIG_PRIV, task, PIDTYPE_MAX);

> Using "sigismember(&tsk->pending.signal, SIGKILL)" to report that a
> fatal signal has experienced short circuit delivery is a bit of an
> abuse, but essentially harmless as tkill of SIGKILL to a thread will
> result in every thread in the process experiencing short circuit
> delivery of the fatal SIGKILL. So a pending SIGKILL can't really mean
> anything else.

This is the part I don't follow. If it's ok to send a signal to this
set, why is it not ok to also look there (other than that it was a
slight hack in the first place)? Maybe it will short circuit
more threads, but that seems ok.

> After having looked at the code a little more I can unfortunately also
> say that testing PF_EXITING in __fatal_signal_pending will cause
> kernel_wait4 in zap_pid_ns_processes to not sleep, and instead to return
> 0. Which will cause zap_pid_ns_processes to busy wait. That seems very
> unfortunate.
>
> I hadn't realized it at the time I wrote zap_pid_ns_processes but I
> think anything called from do_exit that cares about signal pending state
> is pretty much broken and needs to be fixed.

> So the question is how do we fix the problem in fuse that shows up
> during a pid namespace exit without having interruptible sleeps we need
> to wake up?
>
> What are the code paths that experience the problem?

[<0>] request_wait_answer+0x282/0x710 [fuse]
[<0>] fuse_simple_request+0x502/0xc10 [fuse]
[<0>] fuse_flush+0x431/0x630 [fuse]
[<0>] filp_close+0x96/0x120
[<0>] put_files_struct+0x15c/0x2c0
[<0>] do_exit+0xa00/0x2450
[<0>] do_group_exit+0xb2/0x2a0
[<0>] get_signal+0x1eed/0x2090
[<0>] arch_do_signal_or_restart+0x89/0x1bc0
[<0>] exit_to_user_mode_prepare+0x11d/0x1b0
[<0>] syscall_exit_to_user_mode+0x19/0x50
[<0>] do_syscall_64+0x50/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0

is the full call stack, I have a reproducer here (make check will run
it): https://github.com/tych0/kernel-utils/tree/master/fuse2

In addition to fuse, it looks like nfs_file_flush() eventually ends up
in __fatal_signal_pending(), and probably a few others that want to
synchronize with something outside the local kernel.

> Will refactoring zap_pid_ns_processes as I have proposed so that it does
> not use kernel_wait4 help sort this out? AKA make it work something
> like thread group leader of a process and not allow wait to reap the
> init process of a pid namespace until all of the processes in a pid
> namespaces have been gone. Not that I see the problem in using
> kernel_wait4 it looks like zap_pid_ns_processes needs to stop calling
> kernel_wait4 regardless of the fuse problem.

I can look at this, but I really don't think it will help: in this
brave new world, what wakes up tasks stuck like the above? They're
still looking at the wrong signal set.

Tycho

2022-07-27 19:32:15

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] sched: __fatal_signal_pending() should also check PF_EXITING

On 07/27, Tycho Andersen wrote:
>
> On Wed, Jul 27, 2022 at 07:55:39PM +0200, Oleg Nesterov wrote:
> > On 07/27, Tycho Andersen wrote:
> > >
> > > Hi all,
> > >
> > > On Wed, Jul 20, 2022 at 08:54:59PM -0500, Serge E. Hallyn wrote:
> > > > Oh - I didn't either - checking the sigkill in shared signals *seems*
> > > > legit if they can be put there - but since you posted the new patch I
> > > > assumed his reasoning was clear to you. I know Eric's busy, cc:ing Oleg
> > > > for his interpretation too.
> > >
> > > Any thoughts on this?
> >
> > Cough... I don't know what can I say except I personally dislike this
> > patch no matter what ;)
> >
> > And I do not understand how can this patch help. OK, a single-threaded
> > PF_EXITING task sleeps in TASK_KILLABLE. send_signal_locked() won't
> > wake it up anyway?
> >
> > I must have missed something.
>
> What do you think of the patch in
> https://lore.kernel.org/all/YsyHMVLuT5U6mm+I@netflix/ ? Hopefully that
> has an explanation that makes more sense.

Sorry, I still do not follow. Again, I can easily miss something. But how
can ANY change in __fatal_signal_pending() ensure that SIGKILL will wakeup
a PF_EXITING task which already sleeps in TASK_KILLABLE state? or even set
TIF_SIGPENDING as the changelog states?

Oleg.

2022-07-27 20:31:42

by Tycho Andersen

[permalink] [raw]
Subject: Re: [PATCH] sched: __fatal_signal_pending() should also check PF_EXITING

On Wed, Jul 27, 2022 at 09:19:50PM +0200, Oleg Nesterov wrote:
> On 07/27, Tycho Andersen wrote:
> >
> > On Wed, Jul 27, 2022 at 07:55:39PM +0200, Oleg Nesterov wrote:
> > > On 07/27, Tycho Andersen wrote:
> > > >
> > > > Hi all,
> > > >
> > > > On Wed, Jul 20, 2022 at 08:54:59PM -0500, Serge E. Hallyn wrote:
> > > > > Oh - I didn't either - checking the sigkill in shared signals *seems*
> > > > > legit if they can be put there - but since you posted the new patch I
> > > > > assumed his reasoning was clear to you. I know Eric's busy, cc:ing Oleg
> > > > > for his interpretation too.
> > > >
> > > > Any thoughts on this?
> > >
> > > Cough... I don't know what can I say except I personally dislike this
> > > patch no matter what ;)
> > >
> > > And I do not understand how can this patch help. OK, a single-threaded
> > > PF_EXITING task sleeps in TASK_KILLABLE. send_signal_locked() won't
> > > wake it up anyway?
> > >
> > > I must have missed something.
> >
> > What do you think of the patch in
> > https://lore.kernel.org/all/YsyHMVLuT5U6mm+I@netflix/ ? Hopefully that
> > has an explanation that makes more sense.
>
> Sorry, I still do not follow. Again, I can easily miss something. But how
> can ANY change in __fatal_signal_pending() ensure that SIGKILL will wakeup
> a PF_EXITING task which already sleeps in TASK_KILLABLE state? or even set
> TIF_SIGPENDING as the changelog states?

__fatal_signal_pending() just checks the non-shared set:

sigismember(&p->pending.signal, SIGKILL)

When init in a pid namespace dies, it calls zap_pid_ns_processes(),
which does:

group_send_sig_info(SIGKILL, SEND_SIG_PRIV, task, PIDTYPE_MAX);

that eventually gets to __send_signal_locked() which does:

pending = (type != PIDTYPE_PID) ? &t->signal->shared_pending : &t->pending;

i.e. it decides to put the signal in the shared set, instead of the individual
set. If we change __fatal_signal_pending() to look in the shared set too, it
will exit all the wait code in this case.

Maybe it should be fixed somehow by complete_signal(), but that doesn't work if
the thread is already PF_EXITING, because wants_signal() will cause it to
ignore the task, so it remains stuck forever.

Does that make sense? Maybe it's me who is missing something. I have a
reproducer here:
https://github.com/tych0/kernel-utils/tree/master/fuse2

Tycho

2022-07-28 09:42:27

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] sched: __fatal_signal_pending() should also check PF_EXITING

On 07/27, Tycho Andersen wrote:
>
> On Wed, Jul 27, 2022 at 09:19:50PM +0200, Oleg Nesterov wrote:
> >
> > Sorry, I still do not follow. Again, I can easily miss something. But how
> > can ANY change in __fatal_signal_pending() ensure that SIGKILL will wakeup
> > a PF_EXITING task which already sleeps in TASK_KILLABLE state? or even set
> > TIF_SIGPENDING as the changelog states?
>
> __fatal_signal_pending() just checks the non-shared set:
>
> sigismember(&p->pending.signal, SIGKILL)
>
> When init in a pid namespace dies, it calls zap_pid_ns_processes(),
> which does:
>
> group_send_sig_info(SIGKILL, SEND_SIG_PRIV, task, PIDTYPE_MAX);
>
> that eventually gets to __send_signal_locked() which does:
>
> pending = (type != PIDTYPE_PID) ? &t->signal->shared_pending : &t->pending;
>
> i.e. it decides to put the signal in the shared set, instead of the individual
> set. If we change __fatal_signal_pending() to look in the shared set too, it
> will exit all the wait code in this case.

This is clear, but it seems you do not understand me. Let me try again
to explain and please correct me if I am wrong.

To simplify, lets suppose we have a single-thread task T which simply
does
__set_current_state(TASK_KILLABLE);
schedule();

in the do_exit() paths after exit_signals() which sets PF_EXITING. Btw,
note that it even documents that this thread is not "visible" for the
group-wide signals, see below.

Now, suppose that this task is running and you send SIGKILL. T will
dequeue SIGKILL from T->penging and call do_exit(). However, it won't
remove SIGKILL from T->signal.shared_pending(), and this means that
signal_pending(T) is still true.

Now. If we add a PF_EXITING or sigismember(shared_pending, SIGKILL) check
into __fatal_signal_pending(), then yes, T won't block in schedule(),
schedule()->signal_pending_state() will return true.

But what if T exits on its own? It will block in schedule() forever.
schedule()->signal_pending_state() will not even check __fatal_signal_pending(),
signal_pending() == F.

Now if you send SIGKILL to this task, SIGKILL won't wake it up or even
set TIF_SIGPENDING, complete_signal() will do nothing.

See?

I agree, we should probably cleanup this logic and define how exactly
the exiting task should react to signals (not only fatal signals). But
your patch certainly doesn't look good to me and it is not enough.
May be we can change get_signal() to not remove SIGKILL from t->pending
for the start... not sure, this needs another discussion.


Finally. if fuse_flush() wants __fatal_signal_pending() == T when the
caller exits, perhaps it can do it itself? Something like

if (current->flags & PF_EXITING) {
spin_lock_irq(siglock);
set_thread_flag(TIF_SIGPENDING);
sigaddset(&current->pending.signal, SIGKILL);
spin_unlock_irq(siglock);
}

Sure, this is ugly as hell. But perhaps this can serve as a workaround?

Oleg.

2022-07-28 19:33:13

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] sched: __fatal_signal_pending() should also check PF_EXITING

Tycho Andersen <[email protected]> writes:

> On Wed, Jul 27, 2022 at 11:32:08AM -0500, Eric W. Biederman wrote:
>> Tycho Andersen <[email protected]> writes:
>>
>> > Hi all,
>> >
>> > On Wed, Jul 20, 2022 at 08:54:59PM -0500, Serge E. Hallyn wrote:
>> >> Oh - I didn't either - checking the sigkill in shared signals *seems*
>> >> legit if they can be put there - but since you posted the new patch I
>> >> assumed his reasoning was clear to you. I know Eric's busy, cc:ing Oleg
>> >> for his interpretation too.
>> >
>> > Any thoughts on this?
>>
>> Having __fatal_signal_pending check SIGKILL in shared signals is
>> completely and utterly wrong.
>>
>> What __fatal_signal_pending reports is if a signal has gone through
>> short cirucuit delivery after determining that the delivery of the
>> signal will terminate the process.
>
> This short-circuiting you're talking about happens in __send_signal()?
> The problem here is that __send_signal() will add things to the shared
> queue:
>
> pending = (type != PIDTYPE_PID) ? &t->signal->shared_pending : &t->pending;
>
> and indeed we add it to the shared set because of the way
> zap_pid_ns_processes() calls it:
>
> roup_send_sig_info(SIGKILL, SEND_SIG_PRIV, task, PIDTYPE_MAX);
>
>> Using "sigismember(&tsk->pending.signal, SIGKILL)" to report that a
>> fatal signal has experienced short circuit delivery is a bit of an
>> abuse, but essentially harmless as tkill of SIGKILL to a thread will
>> result in every thread in the process experiencing short circuit
>> delivery of the fatal SIGKILL. So a pending SIGKILL can't really mean
>> anything else.
>
> This is the part I don't follow. If it's ok to send a signal to this
> set, why is it not ok to also look there (other than that it was a
> slight hack in the first place)? Maybe it will short circuit
> more threads, but that seems ok.

Let me see if I can help, now that I have the code and the backtrace
the details are becoming clearer.

In zap_pid_ns_processes group_send_signal sets some signal bits.
Then the processes notices those bits and deals with them in get_signal.
Then get_signal calls do_exit.
Then do_exit flushes the file descriptors.


So the reason __fatal_signal_pending fails is that the signal has been
dealt with by calling do_exit and it is no longer pending.

Frankly that there are some left over SIGKILL bits in the pending mask
is a misfeature, and it is definitely not something you should count on.


>> After having looked at the code a little more I can unfortunately also
>> say that testing PF_EXITING in __fatal_signal_pending will cause
>> kernel_wait4 in zap_pid_ns_processes to not sleep, and instead to return
>> 0. Which will cause zap_pid_ns_processes to busy wait. That seems very
>> unfortunate.
>>
>> I hadn't realized it at the time I wrote zap_pid_ns_processes but I
>> think anything called from do_exit that cares about signal pending state
>> is pretty much broken and needs to be fixed.
>
>> So the question is how do we fix the problem in fuse that shows up
>> during a pid namespace exit without having interruptible sleeps we need
>> to wake up?
>>
>> What are the code paths that experience the problem?
>
> [<0>] request_wait_answer+0x282/0x710 [fuse]
> [<0>] fuse_simple_request+0x502/0xc10 [fuse]
> [<0>] fuse_flush+0x431/0x630 [fuse]
> [<0>] filp_close+0x96/0x120
> [<0>] put_files_struct+0x15c/0x2c0
> [<0>] do_exit+0xa00/0x2450
> [<0>] do_group_exit+0xb2/0x2a0
> [<0>] get_signal+0x1eed/0x2090
> [<0>] arch_do_signal_or_restart+0x89/0x1bc0
> [<0>] exit_to_user_mode_prepare+0x11d/0x1b0
> [<0>] syscall_exit_to_user_mode+0x19/0x50
> [<0>] do_syscall_64+0x50/0x90
> [<0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0

This is half of the hang thank you.

> is the full call stack, I have a reproducer here (make check will run
> it): https://github.com/tych0/kernel-utils/tree/master/fuse2


Thank you for this I can almost trace what is going on.

There is a server and it is accessing the filesystem it servers.
Plus the process staying alive keeps the filesystem mounted and
active because it is acessing the filesystem it serves.


The filesystem access hanging appears to be because fc->connected != 0.

If fuse_abort_conn could successfully run that should successfully
shut everything down.

The server when it is killed should call fuse_abort_conn from
fuse_dev_release, but the code appears to never get there because it is
blocked in fuse_flush from closing the file descriptor open to the fuse
file system.

So this looks like one process hanging and waiting on itself.


I suspect the solution is do use something like task_wake in the
fuse_flush calls to allow the server file descriptor to be close
resulting in fuse_dev_release calling fuse_abort_conn.


>
> In addition to fuse, it looks like nfs_file_flush() eventually ends up
> in __fatal_signal_pending(), and probably a few others that want to
> synchronize with something outside the local kernel.

Yes, my thinking there about not calling __fatal_signal_pending during
do_exit seems to have been confused. Rather what can't be done is
depending upon __fatal_signal_pending to ever returning true.

What needs to happen is to make certain we don't have the loops where
one part of the shutdown depends upon another. I don't know if there
is a way to shutdown the nfs connection why a client is waiting or not.

>> Will refactoring zap_pid_ns_processes as I have proposed so that it does
>> not use kernel_wait4 help sort this out? AKA make it work something
>> like thread group leader of a process and not allow wait to reap the
>> init process of a pid namespace until all of the processes in a pid
>> namespaces have been gone. Not that I see the problem in using
>> kernel_wait4 it looks like zap_pid_ns_processes needs to stop calling
>> kernel_wait4 regardless of the fuse problem.
>
> I can look at this, but I really don't think it will help: in this
> brave new world, what wakes up tasks stuck like the above? They're
> still looking at the wrong signal set.

This is probably a wild goose chase. There are some funny namespace
bits like the pid namespace closing that we need to rule out of
the picture, but overall I don't think the pid namespace has
anything to do with this.

I was thinking removing the kernel_wait would simplify things by
removing one place where a process waits on another process during
shutdown. With the right problem it probably will help. In this case
since everything is in a single process I don't see that happening.

I spent a little time and tried to get this to reproduce without the
pid namespace and it initially it looked like it was working but
after a bunch of time those attempts exited on their own. So I haven't
mastered all of what is going on in your fuse test case yet.

Eric

2022-07-28 21:25:26

by Tycho Andersen

[permalink] [raw]
Subject: Re: [PATCH] sched: __fatal_signal_pending() should also check PF_EXITING

On Thu, Jul 28, 2022 at 11:12:20AM +0200, Oleg Nesterov wrote:
> This is clear, but it seems you do not understand me. Let me try again
> to explain and please correct me if I am wrong.
>
> To simplify, lets suppose we have a single-thread task T which simply
> does
> __set_current_state(TASK_KILLABLE);
> schedule();
>
> in the do_exit() paths after exit_signals() which sets PF_EXITING. Btw,
> note that it even documents that this thread is not "visible" for the
> group-wide signals, see below.
>
> Now, suppose that this task is running and you send SIGKILL. T will
> dequeue SIGKILL from T->penging and call do_exit(). However, it won't
> remove SIGKILL from T->signal.shared_pending(), and this means that
> signal_pending(T) is still true.
>
> Now. If we add a PF_EXITING or sigismember(shared_pending, SIGKILL) check
> into __fatal_signal_pending(), then yes, T won't block in schedule(),
> schedule()->signal_pending_state() will return true.
>
> But what if T exits on its own? It will block in schedule() forever.
> schedule()->signal_pending_state() will not even check __fatal_signal_pending(),
> signal_pending() == F.
>
> Now if you send SIGKILL to this task, SIGKILL won't wake it up or even
> set TIF_SIGPENDING, complete_signal() will do nothing.
>
> See?
>
> I agree, we should probably cleanup this logic and define how exactly
> the exiting task should react to signals (not only fatal signals). But
> your patch certainly doesn't look good to me and it is not enough.
> May be we can change get_signal() to not remove SIGKILL from t->pending
> for the start... not sure, this needs another discussion.

Thank you for this! Between that and Eric's line about:

> Frankly that there are some left over SIGKILL bits in the pending mask
> is a misfeature, and it is definitely not something you should count on.

I think I finally maybe understand the objections.

Is it fair to say that a task with PF_EXITING should never wait? I'm
wondering if a solution would be to patch the wait code to look for
PF_EXITING, in addition to checking the signal state.

> Finally. if fuse_flush() wants __fatal_signal_pending() == T when the
> caller exits, perhaps it can do it itself? Something like
>
> if (current->flags & PF_EXITING) {
> spin_lock_irq(siglock);
> set_thread_flag(TIF_SIGPENDING);
> sigaddset(&current->pending.signal, SIGKILL);
> spin_unlock_irq(siglock);
> }
>
> Sure, this is ugly as hell. But perhaps this can serve as a workaround?

or even just

if (current->flags & PF_EXITING)
return;

since we don't have anyone to send the result of the flush to anyway.
If we don't end up converging on a fix here, I'll just send that
patch. Thanks for the suggestion.

Tycho

2022-07-29 05:21:41

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] sched: __fatal_signal_pending() should also check PF_EXITING

Tycho Andersen <[email protected]> writes:

> On Thu, Jul 28, 2022 at 11:12:20AM +0200, Oleg Nesterov wrote:
>> This is clear, but it seems you do not understand me. Let me try again
>> to explain and please correct me if I am wrong.
>>
>> To simplify, lets suppose we have a single-thread task T which simply
>> does
>> __set_current_state(TASK_KILLABLE);
>> schedule();
>>
>> in the do_exit() paths after exit_signals() which sets PF_EXITING. Btw,
>> note that it even documents that this thread is not "visible" for the
>> group-wide signals, see below.
>>
>> Now, suppose that this task is running and you send SIGKILL. T will
>> dequeue SIGKILL from T->penging and call do_exit(). However, it won't
>> remove SIGKILL from T->signal.shared_pending(), and this means that
>> signal_pending(T) is still true.
>>
>> Now. If we add a PF_EXITING or sigismember(shared_pending, SIGKILL) check
>> into __fatal_signal_pending(), then yes, T won't block in schedule(),
>> schedule()->signal_pending_state() will return true.
>>
>> But what if T exits on its own? It will block in schedule() forever.
>> schedule()->signal_pending_state() will not even check __fatal_signal_pending(),
>> signal_pending() == F.
>>
>> Now if you send SIGKILL to this task, SIGKILL won't wake it up or even
>> set TIF_SIGPENDING, complete_signal() will do nothing.
>>
>> See?
>>
>> I agree, we should probably cleanup this logic and define how exactly
>> the exiting task should react to signals (not only fatal signals). But
>> your patch certainly doesn't look good to me and it is not enough.
>> May be we can change get_signal() to not remove SIGKILL from t->pending
>> for the start... not sure, this needs another discussion.
>
> Thank you for this! Between that and Eric's line about:
>
>> Frankly that there are some left over SIGKILL bits in the pending mask
>> is a misfeature, and it is definitely not something you should count on.
>
> I think I finally maybe understand the objections.
>
> Is it fair to say that a task with PF_EXITING should never wait? I'm
> wondering if a solution would be to patch the wait code to look for
> PF_EXITING, in addition to checking the signal state.

That will at a minimum change zap_pid_ns_processes to busy wait
instead of sleeping while it waits for children to die.

So we would need to survey the waits that can happen when closing file
descriptors and any other place on the exit path to see how much impact
a such a change would do.


It might be possible to allow an extra SIGKILL to terminate such waits.
We do something like that for coredumps. But that is incredibly subtle
and a pain to maintain so I want to avoid that if we can.


>> Finally. if fuse_flush() wants __fatal_signal_pending() == T when the
>> caller exits, perhaps it can do it itself? Something like
>>
>> if (current->flags & PF_EXITING) {
>> spin_lock_irq(siglock);
>> set_thread_flag(TIF_SIGPENDING);
>> sigaddset(&current->pending.signal, SIGKILL);
>> spin_unlock_irq(siglock);
>> }
>>
>> Sure, this is ugly as hell. But perhaps this can serve as a workaround?
>
> or even just
>
> if (current->flags & PF_EXITING)
> return;
>
> since we don't have anyone to send the result of the flush to anyway.
> If we don't end up converging on a fix here, I'll just send that
> patch. Thanks for the suggestion.

If that was limited to the case you care about that would be reasonable.

That will have an effect on any time a process that opens files on a
fuse filesystem exits and depends upon the exit path to close it's file
descriptors to the fuse filesystem.


I do see a plausible solution along those lines.

In fuse_flush instead of using fuse_simple_request call an equivalent
function that when PF_EXITING is true skips calling request_wait_answer.
Or perhaps when PF_EXITING is set uses schedule_work to call the
request_wait_answer.

That will allow everything to work as it does today. It will optimize
the fuse when file descriptors are called on the exit path. It will
avoid the hang by removing an indefinite wait on userspace.

This should even generalize into the vfs. I looked and nfs also looks
like it has the potential to optimize out the wait for the result of the
flush. A correctly implemented flush method looks to flush any
write-back data when the file is closed and to return any errors from
that flush to the caller of close. For .flush called from the exit path
aka exit_files aka close_files there is no way to place to return an
error status to, so there is no need to wait for the flush to complete.

That said solve I think it makes sense to solve the problem for fuse
first, and the we can figure out support for other filesystems.

Eric

2022-07-29 14:40:48

by Tycho Andersen

[permalink] [raw]
Subject: Re: [PATCH] sched: __fatal_signal_pending() should also check PF_EXITING

On Fri, Jul 29, 2022 at 12:04:17AM -0500, Eric W. Biederman wrote:
> Tycho Andersen <[email protected]> writes:
>
> > On Thu, Jul 28, 2022 at 11:12:20AM +0200, Oleg Nesterov wrote:
> >> This is clear, but it seems you do not understand me. Let me try again
> >> to explain and please correct me if I am wrong.
> >>
> >> To simplify, lets suppose we have a single-thread task T which simply
> >> does
> >> __set_current_state(TASK_KILLABLE);
> >> schedule();
> >>
> >> in the do_exit() paths after exit_signals() which sets PF_EXITING. Btw,
> >> note that it even documents that this thread is not "visible" for the
> >> group-wide signals, see below.
> >>
> >> Now, suppose that this task is running and you send SIGKILL. T will
> >> dequeue SIGKILL from T->penging and call do_exit(). However, it won't
> >> remove SIGKILL from T->signal.shared_pending(), and this means that
> >> signal_pending(T) is still true.
> >>
> >> Now. If we add a PF_EXITING or sigismember(shared_pending, SIGKILL) check
> >> into __fatal_signal_pending(), then yes, T won't block in schedule(),
> >> schedule()->signal_pending_state() will return true.
> >>
> >> But what if T exits on its own? It will block in schedule() forever.
> >> schedule()->signal_pending_state() will not even check __fatal_signal_pending(),
> >> signal_pending() == F.
> >>
> >> Now if you send SIGKILL to this task, SIGKILL won't wake it up or even
> >> set TIF_SIGPENDING, complete_signal() will do nothing.
> >>
> >> See?
> >>
> >> I agree, we should probably cleanup this logic and define how exactly
> >> the exiting task should react to signals (not only fatal signals). But
> >> your patch certainly doesn't look good to me and it is not enough.
> >> May be we can change get_signal() to not remove SIGKILL from t->pending
> >> for the start... not sure, this needs another discussion.
> >
> > Thank you for this! Between that and Eric's line about:
> >
> >> Frankly that there are some left over SIGKILL bits in the pending mask
> >> is a misfeature, and it is definitely not something you should count on.
> >
> > I think I finally maybe understand the objections.
> >
> > Is it fair to say that a task with PF_EXITING should never wait? I'm
> > wondering if a solution would be to patch the wait code to look for
> > PF_EXITING, in addition to checking the signal state.
>
> That will at a minimum change zap_pid_ns_processes to busy wait
> instead of sleeping while it waits for children to die.
>
> So we would need to survey the waits that can happen when closing file
> descriptors and any other place on the exit path to see how much impact
> a such a change would do.

Oh, yes, of course.

> It might be possible to allow an extra SIGKILL to terminate such waits.
> We do something like that for coredumps. But that is incredibly subtle
> and a pain to maintain so I want to avoid that if we can.

Yeah, it feels better to clean up these waits. If we thought we got
them all we could maybe even stick a WARN() in the wait code.

> >> Finally. if fuse_flush() wants __fatal_signal_pending() == T when the
> >> caller exits, perhaps it can do it itself? Something like
> >>
> >> if (current->flags & PF_EXITING) {
> >> spin_lock_irq(siglock);
> >> set_thread_flag(TIF_SIGPENDING);
> >> sigaddset(&current->pending.signal, SIGKILL);
> >> spin_unlock_irq(siglock);
> >> }
> >>
> >> Sure, this is ugly as hell. But perhaps this can serve as a workaround?
> >
> > or even just
> >
> > if (current->flags & PF_EXITING)
> > return;
> >
> > since we don't have anyone to send the result of the flush to anyway.
> > If we don't end up converging on a fix here, I'll just send that
> > patch. Thanks for the suggestion.
>
> If that was limited to the case you care about that would be reasonable.
>
> That will have an effect on any time a process that opens files on a
> fuse filesystem exits and depends upon the exit path to close it's file
> descriptors to the fuse filesystem.
>
>
> I do see a plausible solution along those lines.
>
> In fuse_flush instead of using fuse_simple_request call an equivalent
> function that when PF_EXITING is true skips calling request_wait_answer.
> Or perhaps when PF_EXITING is set uses schedule_work to call the
> request_wait_answer.

I don't see why this is any different than what I proposed. It changes
the semantics to flush happening out-of-order with task exit, instead
of strictly before, which you point out might be a problem. What am I
missing?

Tycho

2022-07-29 16:20:53

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] sched: __fatal_signal_pending() should also check PF_EXITING

Tycho Andersen <[email protected]> writes:

> On Fri, Jul 29, 2022 at 12:04:17AM -0500, Eric W. Biederman wrote:
>> Tycho Andersen <[email protected]> writes:
>>
>> > On Thu, Jul 28, 2022 at 11:12:20AM +0200, Oleg Nesterov wrote:
>
>> >> Finally. if fuse_flush() wants __fatal_signal_pending() == T when the
>> >> caller exits, perhaps it can do it itself? Something like
>> >>
>> >> if (current->flags & PF_EXITING) {
>> >> spin_lock_irq(siglock);
>> >> set_thread_flag(TIF_SIGPENDING);
>> >> sigaddset(&current->pending.signal, SIGKILL);
>> >> spin_unlock_irq(siglock);
>> >> }
>> >>
>> >> Sure, this is ugly as hell. But perhaps this can serve as a workaround?
>> >
>> > or even just
>> >
>> > if (current->flags & PF_EXITING)
>> > return;
>> >
>> > since we don't have anyone to send the result of the flush to anyway.
>> > If we don't end up converging on a fix here, I'll just send that
>> > patch. Thanks for the suggestion.
>>
>> If that was limited to the case you care about that would be reasonable.
>>
>> That will have an effect on any time a process that opens files on a
>> fuse filesystem exits and depends upon the exit path to close it's file
>> descriptors to the fuse filesystem.
>>
>>
>> I do see a plausible solution along those lines.
>>
>> In fuse_flush instead of using fuse_simple_request call an equivalent
>> function that when PF_EXITING is true skips calling request_wait_answer.
>> Or perhaps when PF_EXITING is set uses schedule_work to call the
>> request_wait_answer.
>
> I don't see why this is any different than what I proposed. It changes
> the semantics to flush happening out-of-order with task exit, instead
> of strictly before, which you point out might be a problem. What am I
> missing?

What you proposed skips the flush operation entirely. Which means
that a fuse server that tracks opens and closes of a file descriptor
will see more opens than closes and will have a reference counting
problem (probably resulting in things not being freed).

Simply skipping the wait for the result from the fuse server means
the fuse server sees what it has always seen. The kernel simply won't
block until that result has been returned. Which means the other file
descriptors can be closed.


For the specific case you are looking at with the server being killed
and server's file descriptors not yet being closed, the difference
does not matter. In the ordinary case of a process exit closing file
descriptors to a fuse filesystem where the server continues to live
and function not waiting for the response from the server simply
winds up being an optimization, in exit. The key part is the fuse
server continues to see the same traffic. In particular the open
requests and the flush requests continue to balance, so reference
counting in the fuse server is not broken.

Eric

2022-07-29 16:57:53

by Tycho Andersen

[permalink] [raw]
Subject: Re: [PATCH] sched: __fatal_signal_pending() should also check PF_EXITING

On Fri, Jul 29, 2022 at 11:15:28AM -0500, Eric W. Biederman wrote:
> Tycho Andersen <[email protected]> writes:
>
> > On Fri, Jul 29, 2022 at 12:04:17AM -0500, Eric W. Biederman wrote:
> >> Tycho Andersen <[email protected]> writes:
> >>
> >> > On Thu, Jul 28, 2022 at 11:12:20AM +0200, Oleg Nesterov wrote:
> >
> >> >> Finally. if fuse_flush() wants __fatal_signal_pending() == T when the
> >> >> caller exits, perhaps it can do it itself? Something like
> >> >>
> >> >> if (current->flags & PF_EXITING) {
> >> >> spin_lock_irq(siglock);
> >> >> set_thread_flag(TIF_SIGPENDING);
> >> >> sigaddset(&current->pending.signal, SIGKILL);
> >> >> spin_unlock_irq(siglock);
> >> >> }
> >> >>
> >> >> Sure, this is ugly as hell. But perhaps this can serve as a workaround?
> >> >
> >> > or even just
> >> >
> >> > if (current->flags & PF_EXITING)
> >> > return;
> >> >
> >> > since we don't have anyone to send the result of the flush to anyway.
> >> > If we don't end up converging on a fix here, I'll just send that
> >> > patch. Thanks for the suggestion.
> >>
> >> If that was limited to the case you care about that would be reasonable.
> >>
> >> That will have an effect on any time a process that opens files on a
> >> fuse filesystem exits and depends upon the exit path to close it's file
> >> descriptors to the fuse filesystem.
> >>
> >>
> >> I do see a plausible solution along those lines.
> >>
> >> In fuse_flush instead of using fuse_simple_request call an equivalent
> >> function that when PF_EXITING is true skips calling request_wait_answer.
> >> Or perhaps when PF_EXITING is set uses schedule_work to call the
> >> request_wait_answer.
> >
> > I don't see why this is any different than what I proposed. It changes
> > the semantics to flush happening out-of-order with task exit, instead
> > of strictly before, which you point out might be a problem. What am I
> > missing?
>
> What you proposed skips the flush operation entirely.

Sorry, I wasn't clear. I was thinking of roughly similar to what you
were, returning from request_wait_answer() early if we have
PF_EXITING. Sounds like we agree that it shouldn't be an issue. I'll
give it a test and send out a patch Monday.

Tycho

2022-07-29 18:14:43

by Eric W. Biederman

[permalink] [raw]
Subject: [RFC][PATCH] fuse: In fuse_flush only wait if someone wants the return code


In my very light testing this resolves a hang where a thread of the fuse
server was accessing the fuse filesystem (the fuse server is serving
up), when the fuse server is killed.

The practical problem is that the fuse server file descriptor was being
closed after the file descriptor into the fuse filesystem so that the
fuse filesystem operations were being blocked for instead of being
aborted. Simply skipping the unnecessary wait resolves this issue.

This is just a proof of concept and someone should look to see if the
fuse max_background limit could cause a problem with this approach.

Additionally testing PF_EXITING is a very crude way to tell if someone
wants the return code from the vfs flush operation. As such in the long
run it probably makes sense to get some direct vfs support for knowing
if flush needs to block until all of the flushing is complete and a
status/return code can be returned.

Unless I have missed something this is a generic optimization that can
apply to many network filesystems.

Al, vfs folks?

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/fuse/file.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 59 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 05caa2b9272e..a4fccd859495 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -464,6 +464,62 @@ static void fuse_sync_writes(struct inode *inode)
fuse_release_nowrite(inode);
}

+struct fuse_flush_args {
+ struct fuse_args args;
+ struct fuse_flush_in inarg;
+ struct inode *inode;
+};
+
+static void fuse_flush_end(struct fuse_mount *fm, struct fuse_args *args, int err)
+{
+ struct fuse_flush_args *fa = container_of(args, typeof(*fa), args);
+
+ if (err == -ENOSYS) {
+ fm->fc->no_flush = 1;
+ err = 0;
+ }
+
+ /*
+ * In memory i_blocks is not maintained by fuse, if writeback cache is
+ * enabled, i_blocks from cached attr may not be accurate.
+ */
+ if (!err && fm->fc->writeback_cache)
+ fuse_invalidate_attr_mask(fa->inode, STATX_BLOCKS);
+
+ kfree(fa);
+}
+
+static int fuse_flush_async(struct file *file, fl_owner_t id)
+{
+ struct inode *inode = file_inode(file);
+ struct fuse_mount *fm = get_fuse_mount(inode);
+ struct fuse_file *ff = file->private_data;
+ struct fuse_flush_args *fa;
+ int err;
+
+ fa = kzalloc(sizeof(*fa), GFP_KERNEL);
+ if (!fa)
+ return -ENOMEM;
+
+ fa->inarg.fh = ff->fh;
+ fa->inarg.lock_owner = fuse_lock_owner_id(fm->fc, id);
+ fa->args.opcode = FUSE_FLUSH;
+ fa->args.nodeid = get_node_id(inode);
+ fa->args.in_numargs = 1;
+ fa->args.in_args[0].size = sizeof(fa->inarg);
+ fa->args.in_args[0].value = &fa->inarg;
+ fa->args.force = true;
+ fa->args.end = fuse_flush_end;
+ fa->inode = inode;
+ __iget(inode);
+
+ err = fuse_simple_background(fm, &fa->args, GFP_KERNEL);
+ if (err)
+ fuse_flush_end(fm, &fa->args, err);
+
+ return err;
+}
+
static int fuse_flush(struct file *file, fl_owner_t id)
{
struct inode *inode = file_inode(file);
@@ -495,6 +551,9 @@ static int fuse_flush(struct file *file, fl_owner_t id)
if (fm->fc->no_flush)
goto inval_attr_out;

+ if (current->flags & PF_EXITING)
+ return fuse_flush_async(file, id);
+
memset(&inarg, 0, sizeof(inarg));
inarg.fh = ff->fh;
inarg.lock_owner = fuse_lock_owner_id(fm->fc, id);
--
2.35.3

2022-07-29 20:51:20

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC][PATCH] fuse: In fuse_flush only wait if someone wants the return code

On 07/29, Eric W. Biederman wrote:
>
> +static int fuse_flush_async(struct file *file, fl_owner_t id)
> +{
> + struct inode *inode = file_inode(file);
> + struct fuse_mount *fm = get_fuse_mount(inode);
> + struct fuse_file *ff = file->private_data;
> + struct fuse_flush_args *fa;
> + int err;
> +
> + fa = kzalloc(sizeof(*fa), GFP_KERNEL);
> + if (!fa)
> + return -ENOMEM;
> +
> + fa->inarg.fh = ff->fh;
> + fa->inarg.lock_owner = fuse_lock_owner_id(fm->fc, id);
> + fa->args.opcode = FUSE_FLUSH;
> + fa->args.nodeid = get_node_id(inode);
> + fa->args.in_numargs = 1;
> + fa->args.in_args[0].size = sizeof(fa->inarg);
> + fa->args.in_args[0].value = &fa->inarg;
> + fa->args.force = true;
> + fa->args.end = fuse_flush_end;
> + fa->inode = inode;
> + __iget(inode);

Hmm... who does iput() ?

Oleg.

2022-07-30 00:22:07

by Al Viro

[permalink] [raw]
Subject: Re: [RFC][PATCH] fuse: In fuse_flush only wait if someone wants the return code

On Fri, Jul 29, 2022 at 10:47:32PM +0200, Oleg Nesterov wrote:
> On 07/29, Eric W. Biederman wrote:
> >
> > +static int fuse_flush_async(struct file *file, fl_owner_t id)
> > +{
> > + struct inode *inode = file_inode(file);
> > + struct fuse_mount *fm = get_fuse_mount(inode);
> > + struct fuse_file *ff = file->private_data;
> > + struct fuse_flush_args *fa;
> > + int err;
> > +
> > + fa = kzalloc(sizeof(*fa), GFP_KERNEL);
> > + if (!fa)
> > + return -ENOMEM;
> > +
> > + fa->inarg.fh = ff->fh;
> > + fa->inarg.lock_owner = fuse_lock_owner_id(fm->fc, id);
> > + fa->args.opcode = FUSE_FLUSH;
> > + fa->args.nodeid = get_node_id(inode);
> > + fa->args.in_numargs = 1;
> > + fa->args.in_args[0].size = sizeof(fa->inarg);
> > + fa->args.in_args[0].value = &fa->inarg;
> > + fa->args.force = true;
> > + fa->args.end = fuse_flush_end;
> > + fa->inode = inode;
> > + __iget(inode);
>
> Hmm... who does iput() ?

... or holds ->i_lock as expected by __iget()...

2022-07-30 05:50:46

by Eric W. Biederman

[permalink] [raw]
Subject: [RFC][PATCH v2] fuse: In fuse_flush only wait if someone wants the return code


In my very light testing this resolves a hang where a thread of the
fuse server was accessing the fuse filesystem (the fuse server is
serving up), when the fuse server is killed.

The practical problem is that the fuse server file descriptor was
being closed after the file descriptor into the fuse filesystem so
that the fuse filesystem operations were being blocked for instead of
being aborted. Simply skipping the unnecessary wait resolves this
issue.

This is just a proof of concept and someone should look to see if the
fuse max_background limit could cause a problem with this approach.

Additionally testing PF_EXITING is a very crude way to tell if someone
wants the return code from the vfs flush operation. As such in the
long run it probably makes sense to get some direct vfs support for
knowing if flush needs to block until all of the flushing is complete
and a status/return code can be returned.

Unless I have missed something this is a generic optimization that can
apply to many network filesystems.

Al, vfs folks? (igrab/iput sorted so as not to be distractions).

Perhaps a .flush_async method without a return code and a
filp_close_async function without a return code to take advantage of
this in the general sense.

Waiting potentially indefinitely for user space in do_exit seems like a
bad idea. Especially when all that the wait is for is to get a return
code that will never be examined.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/fuse/file.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 59 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 05caa2b9272e..2bd94acd761f 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -464,6 +464,62 @@ static void fuse_sync_writes(struct inode *inode)
fuse_release_nowrite(inode);
}

+struct fuse_flush_args {
+ struct fuse_args args;
+ struct fuse_flush_in inarg;
+ struct inode *inode;
+};
+
+static void fuse_flush_end(struct fuse_mount *fm, struct fuse_args *args, int err)
+{
+ struct fuse_flush_args *fa = container_of(args, typeof(*fa), args);
+
+ if (err == -ENOSYS) {
+ fm->fc->no_flush = 1;
+ err = 0;
+ }
+
+ /*
+ * In memory i_blocks is not maintained by fuse, if writeback cache is
+ * enabled, i_blocks from cached attr may not be accurate.
+ */
+ if (!err && fm->fc->writeback_cache)
+ fuse_invalidate_attr_mask(fa->inode, STATX_BLOCKS);
+
+ iput(fa->inode);
+ kfree(fa);
+}
+
+static int fuse_flush_async(struct file *file, fl_owner_t id)
+{
+ struct inode *inode = file_inode(file);
+ struct fuse_mount *fm = get_fuse_mount(inode);
+ struct fuse_file *ff = file->private_data;
+ struct fuse_flush_args *fa;
+ int err;
+
+ fa = kzalloc(sizeof(*fa), GFP_KERNEL);
+ if (!fa)
+ return -ENOMEM;
+
+ fa->inarg.fh = ff->fh;
+ fa->inarg.lock_owner = fuse_lock_owner_id(fm->fc, id);
+ fa->args.opcode = FUSE_FLUSH;
+ fa->args.nodeid = get_node_id(inode);
+ fa->args.in_numargs = 1;
+ fa->args.in_args[0].size = sizeof(fa->inarg);
+ fa->args.in_args[0].value = &fa->inarg;
+ fa->args.force = true;
+ fa->args.end = fuse_flush_end;
+ fa->inode = igrab(inode);
+
+ err = fuse_simple_background(fm, &fa->args, GFP_KERNEL);
+ if (err)
+ fuse_flush_end(fm, &fa->args, err);
+
+ return err;
+}
+
static int fuse_flush(struct file *file, fl_owner_t id)
{
struct inode *inode = file_inode(file);
@@ -495,6 +551,9 @@ static int fuse_flush(struct file *file, fl_owner_t id)
if (fm->fc->no_flush)
goto inval_attr_out;

+ if (current->flags & PF_EXITING)
+ return fuse_flush_async(file, id);
+
memset(&inarg, 0, sizeof(inarg));
inarg.fh = ff->fh;
inarg.lock_owner = fuse_lock_owner_id(fm->fc, id);
--
2.35.3

2022-08-01 15:21:44

by Tycho Andersen

[permalink] [raw]
Subject: Re: [RFC][PATCH v2] fuse: In fuse_flush only wait if someone wants the return code

On Sat, Jul 30, 2022 at 12:10:33AM -0500, Eric W. Biederman wrote:
> +static int fuse_flush_async(struct file *file, fl_owner_t id)
> +{
> + struct inode *inode = file_inode(file);
> + struct fuse_mount *fm = get_fuse_mount(inode);
> + struct fuse_file *ff = file->private_data;
> + struct fuse_flush_args *fa;
> + int err;
> +
> + fa = kzalloc(sizeof(*fa), GFP_KERNEL);
> + if (!fa)
> + return -ENOMEM;
> +
> + fa->inarg.fh = ff->fh;
> + fa->inarg.lock_owner = fuse_lock_owner_id(fm->fc, id);
> + fa->args.opcode = FUSE_FLUSH;
> + fa->args.nodeid = get_node_id(inode);
> + fa->args.in_numargs = 1;
> + fa->args.in_args[0].size = sizeof(fa->inarg);
> + fa->args.in_args[0].value = &fa->inarg;
> + fa->args.force = true;

Seems like you need a

fa->args.nocreds = true;

here or you'll hit the WARN() in fuse_simple_background().

Tycho

2022-08-02 12:58:44

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC][PATCH v2] fuse: In fuse_flush only wait if someone wants the return code

On Sat, 30 Jul 2022 at 07:11, Eric W. Biederman <[email protected]> wrote:
>
>
> In my very light testing this resolves a hang where a thread of the
> fuse server was accessing the fuse filesystem (the fuse server is
> serving up), when the fuse server is killed.
>
> The practical problem is that the fuse server file descriptor was
> being closed after the file descriptor into the fuse filesystem so
> that the fuse filesystem operations were being blocked for instead of
> being aborted. Simply skipping the unnecessary wait resolves this
> issue.
>
> This is just a proof of concept and someone should look to see if the
> fuse max_background limit could cause a problem with this approach.

max_background just throttles the number of background requests that
the userspace filesystem can *unqueue*. It doesn't affect queuing in
any way.

> Additionally testing PF_EXITING is a very crude way to tell if someone
> wants the return code from the vfs flush operation. As such in the
> long run it probably makes sense to get some direct vfs support for
> knowing if flush needs to block until all of the flushing is complete
> and a status/return code can be returned.
>
> Unless I have missed something this is a generic optimization that can
> apply to many network filesystems.
>
> Al, vfs folks? (igrab/iput sorted so as not to be distractions).
>
> Perhaps a .flush_async method without a return code and a
> filp_close_async function without a return code to take advantage of
> this in the general sense.
>
> Waiting potentially indefinitely for user space in do_exit seems like a
> bad idea. Especially when all that the wait is for is to get a return
> code that will never be examined.

The wait is for posix locks to get unlocked. But "remote" posix locks
are almost never used due to problems like this, so I think it's safe
to do this.

>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
> fs/fuse/file.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 59 insertions(+)
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 05caa2b9272e..2bd94acd761f 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -464,6 +464,62 @@ static void fuse_sync_writes(struct inode *inode)
> fuse_release_nowrite(inode);
> }
>
> +struct fuse_flush_args {
> + struct fuse_args args;
> + struct fuse_flush_in inarg;
> + struct inode *inode;
> +};
> +
> +static void fuse_flush_end(struct fuse_mount *fm, struct fuse_args *args, int err)
> +{
> + struct fuse_flush_args *fa = container_of(args, typeof(*fa), args);
> +
> + if (err == -ENOSYS) {
> + fm->fc->no_flush = 1;
> + err = 0;
> + }
> +
> + /*
> + * In memory i_blocks is not maintained by fuse, if writeback cache is
> + * enabled, i_blocks from cached attr may not be accurate.
> + */
> + if (!err && fm->fc->writeback_cache)
> + fuse_invalidate_attr_mask(fa->inode, STATX_BLOCKS);
> +
> + iput(fa->inode);

Filesystems might expect not just he inode to not be destroyed but
also the file, so do what other file operations do, keep a ref on ff:

fuse_file_put(fa->ff, false, false);

> + kfree(fa);
> +}
> +
> +static int fuse_flush_async(struct file *file, fl_owner_t id)
> +{
> + struct inode *inode = file_inode(file);
> + struct fuse_mount *fm = get_fuse_mount(inode);
> + struct fuse_file *ff = file->private_data;
> + struct fuse_flush_args *fa;
> + int err;
> +
> + fa = kzalloc(sizeof(*fa), GFP_KERNEL);
> + if (!fa)
> + return -ENOMEM;
> +
> + fa->inarg.fh = ff->fh;
> + fa->inarg.lock_owner = fuse_lock_owner_id(fm->fc, id);
> + fa->args.opcode = FUSE_FLUSH;
> + fa->args.nodeid = get_node_id(inode);
> + fa->args.in_numargs = 1;
> + fa->args.in_args[0].size = sizeof(fa->inarg);
> + fa->args.in_args[0].value = &fa->inarg;
> + fa->args.force = true;
> + fa->args.end = fuse_flush_end;
> + fa->inode = igrab(inode);

fa->ff = fuse_file_get(ff);

> +
> + err = fuse_simple_background(fm, &fa->args, GFP_KERNEL);
> + if (err)
> + fuse_flush_end(fm, &fa->args, err);
> +
> + return err;
> +}
> +
> static int fuse_flush(struct file *file, fl_owner_t id)
> {
> struct inode *inode = file_inode(file);
> @@ -495,6 +551,9 @@ static int fuse_flush(struct file *file, fl_owner_t id)
> if (fm->fc->no_flush)
> goto inval_attr_out;
>
> + if (current->flags & PF_EXITING)
> + return fuse_flush_async(file, id);
> +
> memset(&inarg, 0, sizeof(inarg));
> inarg.fh = ff->fh;
> inarg.lock_owner = fuse_lock_owner_id(fm->fc, id);
> --
> 2.35.3
>

2022-08-15 14:06:10

by Tycho Andersen

[permalink] [raw]
Subject: Re: [RFC][PATCH v2] fuse: In fuse_flush only wait if someone wants the return code

Hi,

On Sat, Jul 30, 2022 at 12:10:33AM -0500, Eric W. Biederman wrote:
> Al, vfs folks? (igrab/iput sorted so as not to be distractions).

Any movement on this? Can you resend (or I can) the patch with the
fixes for fuse at the very least?

Thanks,

2022-08-15 18:39:54

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC][PATCH v2] fuse: In fuse_flush only wait if someone wants the return code

On Mon, Aug 15, 2022 at 07:59:08AM -0600, Tycho Andersen wrote:
> Hi,
>
> On Sat, Jul 30, 2022 at 12:10:33AM -0500, Eric W. Biederman wrote:
> > Al, vfs folks? (igrab/iput sorted so as not to be distractions).
>
> Any movement on this? Can you resend (or I can) the patch with the
> fixes for fuse at the very least?
>
> Thanks,

If you resend with the fixes, I'd like to do a bit of testing with it.

2022-09-01 15:08:58

by Tycho Andersen

[permalink] [raw]
Subject: [PATCH] fuse: In fuse_flush only wait if someone wants the return code

From: "Eric W. Biederman" <[email protected]>

In my very light testing this resolves a hang where a thread of the
fuse server was accessing the fuse filesystem (the fuse server is
serving up), when the fuse server is killed.

The practical problem is that the fuse server file descriptor was
being closed after the file descriptor into the fuse filesystem so
that the fuse filesystem operations were being blocked for instead of
being aborted. Simply skipping the unnecessary wait resolves this
issue.

This is just a proof of concept and someone should look to see if the
fuse max_background limit could cause a problem with this approach.

Additionally testing PF_EXITING is a very crude way to tell if someone
wants the return code from the vfs flush operation. As such in the
long run it probably makes sense to get some direct vfs support for
knowing if flush needs to block until all of the flushing is complete
and a status/return code can be returned.

Unless I have missed something this is a generic optimization that can
apply to many network filesystems.

Al, vfs folks? (igrab/iput sorted so as not to be distractions).

Perhaps a .flush_async method without a return code and a
filp_close_async function without a return code to take advantage of
this in the general sense.

Waiting potentially indefinitely for user space in do_exit seems like a
bad idea. Especially when all that the wait is for is to get a return
code that will never be examined.

Signed-off-by: "Eric W. Biederman" <[email protected]>
[tycho: small fixups for releasing fuse file + nocred flag]
Signed-off-by: Tycho Andersen <[email protected]>
Reported-by: Tycho Andersen <[email protected]>
Tested-by: "Serge E. Hallyn" <[email protected]>
---
fs/fuse/file.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 64 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 05caa2b9272e..da45fb2dd740 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -464,6 +464,67 @@ static void fuse_sync_writes(struct inode *inode)
fuse_release_nowrite(inode);
}

+struct fuse_flush_args {
+ struct fuse_args args;
+ struct fuse_flush_in inarg;
+ struct inode *inode;
+ struct fuse_file *ff;
+};
+
+static void fuse_flush_end(struct fuse_mount *fm, struct fuse_args *args, int err)
+{
+ struct fuse_flush_args *fa = container_of(args, typeof(*fa), args);
+
+ if (err == -ENOSYS) {
+ fm->fc->no_flush = 1;
+ err = 0;
+ }
+
+ /*
+ * In memory i_blocks is not maintained by fuse, if writeback cache is
+ * enabled, i_blocks from cached attr may not be accurate.
+ */
+ if (!err && fm->fc->writeback_cache)
+ fuse_invalidate_attr_mask(fa->inode, STATX_BLOCKS);
+
+
+ iput(fa->inode);
+ fuse_file_put(fa->ff, false, false);
+ kfree(fa);
+}
+
+static int fuse_flush_async(struct file *file, fl_owner_t id)
+{
+ struct inode *inode = file_inode(file);
+ struct fuse_mount *fm = get_fuse_mount(inode);
+ struct fuse_file *ff = file->private_data;
+ struct fuse_flush_args *fa;
+ int err;
+
+ fa = kzalloc(sizeof(*fa), GFP_KERNEL);
+ if (!fa)
+ return -ENOMEM;
+
+ fa->inarg.fh = ff->fh;
+ fa->inarg.lock_owner = fuse_lock_owner_id(fm->fc, id);
+ fa->args.opcode = FUSE_FLUSH;
+ fa->args.nodeid = get_node_id(inode);
+ fa->args.in_numargs = 1;
+ fa->args.in_args[0].size = sizeof(fa->inarg);
+ fa->args.in_args[0].value = &fa->inarg;
+ fa->args.force = true;
+ fa->args.nocreds = true;
+ fa->args.end = fuse_flush_end;
+ fa->inode = igrab(inode);
+ fa->ff = fuse_file_get(ff);
+
+ err = fuse_simple_background(fm, &fa->args, GFP_KERNEL);
+ if (err)
+ fuse_flush_end(fm, &fa->args, err);
+
+ return err;
+}
+
static int fuse_flush(struct file *file, fl_owner_t id)
{
struct inode *inode = file_inode(file);
@@ -495,6 +556,9 @@ static int fuse_flush(struct file *file, fl_owner_t id)
if (fm->fc->no_flush)
goto inval_attr_out;

+ if (current->flags & PF_EXITING)
+ return fuse_flush_async(file, id);
+
memset(&inarg, 0, sizeof(inarg));
inarg.fh = ff->fh;
inarg.lock_owner = fuse_lock_owner_id(fm->fc, id);

base-commit: 3d7cb6b04c3f3115719235cc6866b10326de34cd
--
2.34.1

2022-09-19 15:06:28

by Tycho Andersen

[permalink] [raw]
Subject: Re: [PATCH] fuse: In fuse_flush only wait if someone wants the return code

Hi Miklos,

On Thu, Sep 01, 2022 at 08:06:47AM -0600, Tycho Andersen wrote:
> From: "Eric W. Biederman" <[email protected]>
>
> In my very light testing this resolves a hang where a thread of the
> fuse server was accessing the fuse filesystem (the fuse server is
> serving up), when the fuse server is killed.
>
> The practical problem is that the fuse server file descriptor was
> being closed after the file descriptor into the fuse filesystem so
> that the fuse filesystem operations were being blocked for instead of
> being aborted. Simply skipping the unnecessary wait resolves this
> issue.
>
> This is just a proof of concept and someone should look to see if the
> fuse max_background limit could cause a problem with this approach.
>
> Additionally testing PF_EXITING is a very crude way to tell if someone
> wants the return code from the vfs flush operation. As such in the
> long run it probably makes sense to get some direct vfs support for
> knowing if flush needs to block until all of the flushing is complete
> and a status/return code can be returned.
>
> Unless I have missed something this is a generic optimization that can
> apply to many network filesystems.
>
> Al, vfs folks? (igrab/iput sorted so as not to be distractions).
>
> Perhaps a .flush_async method without a return code and a
> filp_close_async function without a return code to take advantage of
> this in the general sense.
>
> Waiting potentially indefinitely for user space in do_exit seems like a
> bad idea. Especially when all that the wait is for is to get a return
> code that will never be examined.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> [tycho: small fixups for releasing fuse file + nocred flag]
> Signed-off-by: Tycho Andersen <[email protected]>
> Reported-by: Tycho Andersen <[email protected]>
> Tested-by: "Serge E. Hallyn" <[email protected]>

Any chance you're willing to take this patch? We're still seeing this
a lot and it would be great to get it fixed.

Thanks.

Tycho

2022-09-19 16:02:32

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH v2] fuse: In fuse_flush only wait if someone wants the return code

Tycho Andersen <[email protected]> writes:

> Hi,
>
> On Sat, Jul 30, 2022 at 12:10:33AM -0500, Eric W. Biederman wrote:
>> Al, vfs folks? (igrab/iput sorted so as not to be distractions).
>
> Any movement on this? Can you resend (or I can) the patch with the
> fixes for fuse at the very least?

Sorry for not replying earlier. Thank you for taking this.

I had really meant to suggest something like that. At the moment I have
a bit too much on my plate, so I am glad to see this moving forward.

I am a bit sad that I didn't succeed in starting a general vfs
discussion about this. Oh well. As long as we get weird bugs
like this fixed.

Eric

2022-09-20 18:14:55

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH] fuse: In fuse_flush only wait if someone wants the return code

On Mon, Sep 19, 2022 at 09:03:41AM -0600, Tycho Andersen wrote:
> Hi Miklos,
>
> On Thu, Sep 01, 2022 at 08:06:47AM -0600, Tycho Andersen wrote:
> > From: "Eric W. Biederman" <[email protected]>
> >
> > In my very light testing this resolves a hang where a thread of the
> > fuse server was accessing the fuse filesystem (the fuse server is
> > serving up), when the fuse server is killed.
> >
> > The practical problem is that the fuse server file descriptor was
> > being closed after the file descriptor into the fuse filesystem so
> > that the fuse filesystem operations were being blocked for instead of
> > being aborted. Simply skipping the unnecessary wait resolves this
> > issue.
> >
> > This is just a proof of concept and someone should look to see if the
> > fuse max_background limit could cause a problem with this approach.

I tried to track this down last week, but it looks to me like since
the max_background is per-connection, this should work as expected
and not affect any other connections.

> > Additionally testing PF_EXITING is a very crude way to tell if someone
> > wants the return code from the vfs flush operation. As such in the
> > long run it probably makes sense to get some direct vfs support for
> > knowing if flush needs to block until all of the flushing is complete
> > and a status/return code can be returned.
> >
> > Unless I have missed something this is a generic optimization that can
> > apply to many network filesystems.
> >
> > Al, vfs folks? (igrab/iput sorted so as not to be distractions).
> >
> > Perhaps a .flush_async method without a return code and a
> > filp_close_async function without a return code to take advantage of
> > this in the general sense.
> >
> > Waiting potentially indefinitely for user space in do_exit seems like a
> > bad idea. Especially when all that the wait is for is to get a return
> > code that will never be examined.
> >
> > Signed-off-by: "Eric W. Biederman" <[email protected]>
> > [tycho: small fixups for releasing fuse file + nocred flag]
> > Signed-off-by: Tycho Andersen <[email protected]>
> > Reported-by: Tycho Andersen <[email protected]>
> > Tested-by: "Serge E. Hallyn" <[email protected]>
>
> Any chance you're willing to take this patch? We're still seeing this
> a lot and it would be great to get it fixed.
>
> Thanks.
>
> Tycho

2022-09-26 15:59:09

by Tycho Andersen

[permalink] [raw]
Subject: Re: [PATCH] fuse: In fuse_flush only wait if someone wants the return code

Hi,

On Mon, Sep 19, 2022 at 09:03:47AM -0600, Tycho Andersen wrote:
> Hi Miklos,
>
> On Thu, Sep 01, 2022 at 08:06:47AM -0600, Tycho Andersen wrote:
> > From: "Eric W. Biederman" <[email protected]>
> >
> > In my very light testing this resolves a hang where a thread of the
> > fuse server was accessing the fuse filesystem (the fuse server is
> > serving up), when the fuse server is killed.
> >
> > The practical problem is that the fuse server file descriptor was
> > being closed after the file descriptor into the fuse filesystem so
> > that the fuse filesystem operations were being blocked for instead of
> > being aborted. Simply skipping the unnecessary wait resolves this
> > issue.
> >
> > This is just a proof of concept and someone should look to see if the
> > fuse max_background limit could cause a problem with this approach.
> >
> > Additionally testing PF_EXITING is a very crude way to tell if someone
> > wants the return code from the vfs flush operation. As such in the
> > long run it probably makes sense to get some direct vfs support for
> > knowing if flush needs to block until all of the flushing is complete
> > and a status/return code can be returned.
> >
> > Unless I have missed something this is a generic optimization that can
> > apply to many network filesystems.
> >
> > Al, vfs folks? (igrab/iput sorted so as not to be distractions).
> >
> > Perhaps a .flush_async method without a return code and a
> > filp_close_async function without a return code to take advantage of
> > this in the general sense.
> >
> > Waiting potentially indefinitely for user space in do_exit seems like a
> > bad idea. Especially when all that the wait is for is to get a return
> > code that will never be examined.
> >
> > Signed-off-by: "Eric W. Biederman" <[email protected]>
> > [tycho: small fixups for releasing fuse file + nocred flag]
> > Signed-off-by: Tycho Andersen <[email protected]>
> > Reported-by: Tycho Andersen <[email protected]>
> > Tested-by: "Serge E. Hallyn" <[email protected]>
>
> Any chance you're willing to take this patch? We're still seeing this
> a lot and it would be great to get it fixed.

Another ping here, can someone take this? Miklos?

Thanks,

Tycho

2022-09-27 10:04:09

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] fuse: In fuse_flush only wait if someone wants the return code

On Thu, 1 Sept 2022 at 16:07, Tycho Andersen <[email protected]> wrote:
>
> From: "Eric W. Biederman" <[email protected]>
>
> In my very light testing this resolves a hang where a thread of the
> fuse server was accessing the fuse filesystem (the fuse server is
> serving up), when the fuse server is killed.
>
> The practical problem is that the fuse server file descriptor was
> being closed after the file descriptor into the fuse filesystem so
> that the fuse filesystem operations were being blocked for instead of
> being aborted. Simply skipping the unnecessary wait resolves this
> issue.
>
> This is just a proof of concept and someone should look to see if the
> fuse max_background limit could cause a problem with this approach.

Maybe you missed my comments here:

https://lore.kernel.org/all/CAJfpegsTmiO-sKaBLgoVT4WxDXBkRES=HF1YmQN1ES7gfJEJ+w@mail.gmail.com/

I'm generally okay with this, but please write a proper changelog for
the patch, also mentioning the issues related to posix locks.

> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -464,6 +464,67 @@ static void fuse_sync_writes(struct inode *inode)
> fuse_release_nowrite(inode);
> }
>
> +struct fuse_flush_args {
> + struct fuse_args args;
> + struct fuse_flush_in inarg;
> + struct inode *inode;
> + struct fuse_file *ff;
> +};
> +
> +static void fuse_flush_end(struct fuse_mount *fm, struct fuse_args *args, int err)
> +{
> + struct fuse_flush_args *fa = container_of(args, typeof(*fa), args);
> +
> + if (err == -ENOSYS) {
> + fm->fc->no_flush = 1;
> + err = 0;
> + }
> +
> + /*
> + * In memory i_blocks is not maintained by fuse, if writeback cache is
> + * enabled, i_blocks from cached attr may not be accurate.
> + */
> + if (!err && fm->fc->writeback_cache)
> + fuse_invalidate_attr_mask(fa->inode, STATX_BLOCKS);
> +
> +
> + iput(fa->inode);
> + fuse_file_put(fa->ff, false, false);
> + kfree(fa);
> +}
> +
> +static int fuse_flush_async(struct file *file, fl_owner_t id)
> +{
> + struct inode *inode = file_inode(file);
> + struct fuse_mount *fm = get_fuse_mount(inode);
> + struct fuse_file *ff = file->private_data;
> + struct fuse_flush_args *fa;
> + int err;
> +
> + fa = kzalloc(sizeof(*fa), GFP_KERNEL);
> + if (!fa)
> + return -ENOMEM;
> +
> + fa->inarg.fh = ff->fh;
> + fa->inarg.lock_owner = fuse_lock_owner_id(fm->fc, id);
> + fa->args.opcode = FUSE_FLUSH;
> + fa->args.nodeid = get_node_id(inode);
> + fa->args.in_numargs = 1;
> + fa->args.in_args[0].size = sizeof(fa->inarg);
> + fa->args.in_args[0].value = &fa->inarg;
> + fa->args.force = true;
> + fa->args.nocreds = true;
> + fa->args.end = fuse_flush_end;
> + fa->inode = igrab(inode);

Grabbing the inode should already taken care of by fuse_file_release().

Also please try to reduce duplication in both the above functions.

Thanks,
Miklos

2022-09-29 15:05:35

by Stef Bon

[permalink] [raw]
Subject: Re: [fuse-devel] [PATCH] fuse: In fuse_flush only wait if someone wants the return code

Hi,

I've some idea;s about the cause of the error.
In the first message about this:

"However, there's a problem when the fuse daemon
itself spawns a thread that does a flush: since the thread has a copy of
the fd table with an fd pointing to the same fuse device, the reference
count isn't decremented to zero in fuse_dev_release(), and the task hangs
forever."

If the kernel starts to abort the filesystem (since the daemon in
userspace is terminated), and cannot do that since a file handle is
still open due to a flush, resulting in a hang, maybe the reason to
stop/abort the filesystem is wrong. The kernel should look at the fuse
device fd (which is duplicated after spawning), find there is still
one fd open, and should not go into aborting the fs.

I hope this helps,

Stef Bon
the Netherlands

Op di 27 sep. 2022 om 11:48 schreef Miklos Szeredi via fuse-devel
<[email protected]>:


>
> On Thu, 1 Sept 2022 at 16:07, Tycho Andersen <[email protected]> wrote:
> >
> > From: "Eric W. Biederman" <[email protected]>
> >
> > In my very light testing this resolves a hang where a thread of the
> > fuse server was accessing the fuse filesystem (the fuse server is
> > serving up), when the fuse server is killed.
> >
> > The practical problem is that the fuse server file descriptor was
> > being closed after the file descriptor into the fuse filesystem so
> > that the fuse filesystem operations were being blocked for instead of
> > being aborted. Simply skipping the unnecessary wait resolves this
> > issue.
> >
> > This is just a proof of concept and someone should look to see if the
> > fuse max_background limit could cause a problem with this approach.
>
> Maybe you missed my comments here:
>
> https://lore.kernel.org/all/CAJfpegsTmiO-sKaBLgoVT4WxDXBkRES=HF1YmQN1ES7gfJEJ+w@mail.gmail.com/
>
> I'm generally okay with this, but please write a proper changelog for
> the patch, also mentioning the issues related to posix locks.
>
> > --- a/fs/fuse/file.c
> > +++ b/fs/fuse/file.c
> > @@ -464,6 +464,67 @@ static void fuse_sync_writes(struct inode *inode)
> > fuse_release_nowrite(inode);
> > }
> >
> > +struct fuse_flush_args {
> > + struct fuse_args args;
> > + struct fuse_flush_in inarg;
> > + struct inode *inode;
> > + struct fuse_file *ff;
> > +};
> > +
> > +static void fuse_flush_end(struct fuse_mount *fm, struct fuse_args *args, int err)
> > +{
> > + struct fuse_flush_args *fa = container_of(args, typeof(*fa), args);
> > +
> > + if (err == -ENOSYS) {
> > + fm->fc->no_flush = 1;
> > + err = 0;
> > + }
> > +
> > + /*
> > + * In memory i_blocks is not maintained by fuse, if writeback cache is
> > + * enabled, i_blocks from cached attr may not be accurate.
> > + */
> > + if (!err && fm->fc->writeback_cache)
> > + fuse_invalidate_attr_mask(fa->inode, STATX_BLOCKS);
> > +
> > +
> > + iput(fa->inode);
> > + fuse_file_put(fa->ff, false, false);
> > + kfree(fa);
> > +}
> > +
> > +static int fuse_flush_async(struct file *file, fl_owner_t id)
> > +{
> > + struct inode *inode = file_inode(file);
> > + struct fuse_mount *fm = get_fuse_mount(inode);
> > + struct fuse_file *ff = file->private_data;
> > + struct fuse_flush_args *fa;
> > + int err;
> > +
> > + fa = kzalloc(sizeof(*fa), GFP_KERNEL);
> > + if (!fa)
> > + return -ENOMEM;
> > +
> > + fa->inarg.fh = ff->fh;
> > + fa->inarg.lock_owner = fuse_lock_owner_id(fm->fc, id);
> > + fa->args.opcode = FUSE_FLUSH;
> > + fa->args.nodeid = get_node_id(inode);
> > + fa->args.in_numargs = 1;
> > + fa->args.in_args[0].size = sizeof(fa->inarg);
> > + fa->args.in_args[0].value = &fa->inarg;
> > + fa->args.force = true;
> > + fa->args.nocreds = true;
> > + fa->args.end = fuse_flush_end;
> > + fa->inode = igrab(inode);
>
> Grabbing the inode should already taken care of by fuse_file_release().
>
> Also please try to reduce duplication in both the above functions.
>
> Thanks,
> Miklos
>
>
> --
> fuse-devel mailing list
> To unsubscribe or subscribe, visit https://lists.sourceforge.net/lists/listinfo/fuse-devel

2022-09-29 16:47:09

by Tycho Andersen

[permalink] [raw]
Subject: [PATCH v2] fuse: In fuse_flush only wait if someone wants the return code

If a fuse filesystem is mounted inside a container, there is a problem
during pid namespace destruction. The scenario is:

1. task (a thread in the fuse server, with a fuse file open) starts
exiting, does exit_signals(), goes into fuse_flush() -> wait
2. fuse daemon gets killed, tries to wake everyone up
3. task from 1 is stuck because complete_signal() doesn't wake it up, since
it has PF_EXITING.

The result is that the thread will never be woken up, and pid namespace
destruction will block indefinitely.

To add insult to injury, nobody is waiting for these return codes, since
the pid namespace is being destroyed.

To fix this, let's not block on flush operations when the current task has
PF_EXITING.

This does change the semantics slightly: the wait here is for posix locks
to be unlocked, so the task will exit before things are unlocked. To quote
Miklos: https://lore.kernel.org/all/CAJfpegsTmiO-sKaBLgoVT4WxDXBkRES=HF1YmQN1ES7gfJEJ+w@mail.gmail.com/

> "remote" posix locks are almost never used due to problems like this,
> so I think it's safe to do this.

Signed-off-by: "Eric W. Biederman" <[email protected]>
Signed-off-by: Tycho Andersen <[email protected]>
Link: https://lore.kernel.org/all/YrShFXRLtRt6T%2Fj+@risky/
---
v2: drop the fuse_flush_async() function and just re-use the already
prepared args; add a description of the problem+note about posix locks
---
fs/fuse/file.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 50 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 05caa2b9272e..20bbe3e1afc7 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -464,6 +464,34 @@ static void fuse_sync_writes(struct inode *inode)
fuse_release_nowrite(inode);
}

+struct fuse_flush_args {
+ struct fuse_args args;
+ struct fuse_flush_in inarg;
+ struct inode *inode;
+ struct fuse_file *ff;
+};
+
+static void fuse_flush_end(struct fuse_mount *fm, struct fuse_args *args, int err)
+{
+ struct fuse_flush_args *fa = container_of(args, typeof(*fa), args);
+
+ if (err == -ENOSYS) {
+ fm->fc->no_flush = 1;
+ err = 0;
+ }
+
+ /*
+ * In memory i_blocks is not maintained by fuse, if writeback cache is
+ * enabled, i_blocks from cached attr may not be accurate.
+ */
+ if (!err && fm->fc->writeback_cache)
+ fuse_invalidate_attr_mask(fa->inode, STATX_BLOCKS);
+
+
+ fuse_file_put(fa->ff, false, false);
+ kfree(fa);
+}
+
static int fuse_flush(struct file *file, fl_owner_t id)
{
struct inode *inode = file_inode(file);
@@ -505,6 +533,28 @@ static int fuse_flush(struct file *file, fl_owner_t id)
args.in_args[0].value = &inarg;
args.force = true;

+ if (current->flags & PF_EXITING) {
+ struct fuse_flush_args *fa;
+
+ err = -ENOMEM;
+ fa = kzalloc(sizeof(*fa), GFP_KERNEL);
+ if (!fa)
+ goto inval_attr_out;
+
+ memcpy(&fa->args, &args, sizeof(args));
+ memcpy(&fa->inarg, &inarg, sizeof(inarg));
+ fa->args.nocreds = true;
+ fa->args.end = fuse_flush_end;
+ fa->ff = fuse_file_get(ff);
+ fa->inode = inode;
+
+ err = fuse_simple_background(fm, &fa->args, GFP_KERNEL);
+ if (err)
+ fuse_flush_end(fm, &fa->args, err);
+
+ return err;
+ }
+
err = fuse_simple_request(fm, &args);
if (err == -ENOSYS) {
fm->fc->no_flush = 1;

base-commit: 3d7cb6b04c3f3115719235cc6866b10326de34cd
--
2.34.1

2022-09-30 13:59:13

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH v2] fuse: In fuse_flush only wait if someone wants the return code

On Thu, 29 Sept 2022 at 18:40, Tycho Andersen <[email protected]> wrote:
>
> If a fuse filesystem is mounted inside a container, there is a problem
> during pid namespace destruction. The scenario is:
>
> 1. task (a thread in the fuse server, with a fuse file open) starts
> exiting, does exit_signals(), goes into fuse_flush() -> wait

Can't the same happen through

fuse_flush -> fuse_sync_writes -> fuse_set_nowrite -> wait

?


> 2. fuse daemon gets killed, tries to wake everyone up
> 3. task from 1 is stuck because complete_signal() doesn't wake it up, since
> it has PF_EXITING.
>
> The result is that the thread will never be woken up, and pid namespace
> destruction will block indefinitely.
>
> To add insult to injury, nobody is waiting for these return codes, since
> the pid namespace is being destroyed.
>
> To fix this, let's not block on flush operations when the current task has
> PF_EXITING.
>
> This does change the semantics slightly: the wait here is for posix locks
> to be unlocked, so the task will exit before things are unlocked. To quote
> Miklos: https://lore.kernel.org/all/CAJfpegsTmiO-sKaBLgoVT4WxDXBkRES=HF1YmQN1ES7gfJEJ+w@mail.gmail.com/
>
> > "remote" posix locks are almost never used due to problems like this,
> > so I think it's safe to do this.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> Signed-off-by: Tycho Andersen <[email protected]>
> Link: https://lore.kernel.org/all/YrShFXRLtRt6T%2Fj+@risky/
> ---
> v2: drop the fuse_flush_async() function and just re-use the already
> prepared args; add a description of the problem+note about posix locks
> ---
> fs/fuse/file.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 50 insertions(+)
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 05caa2b9272e..20bbe3e1afc7 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -464,6 +464,34 @@ static void fuse_sync_writes(struct inode *inode)
> fuse_release_nowrite(inode);
> }
>
> +struct fuse_flush_args {
> + struct fuse_args args;
> + struct fuse_flush_in inarg;
> + struct inode *inode;
> + struct fuse_file *ff;
> +};
> +
> +static void fuse_flush_end(struct fuse_mount *fm, struct fuse_args *args, int err)
> +{
> + struct fuse_flush_args *fa = container_of(args, typeof(*fa), args);
> +
> + if (err == -ENOSYS) {
> + fm->fc->no_flush = 1;
> + err = 0;
> + }
> +
> + /*
> + * In memory i_blocks is not maintained by fuse, if writeback cache is
> + * enabled, i_blocks from cached attr may not be accurate.
> + */
> + if (!err && fm->fc->writeback_cache)
> + fuse_invalidate_attr_mask(fa->inode, STATX_BLOCKS);

This is still duplicating code, can you please create a helper?

Thanks,
Miklos

2022-09-30 14:36:06

by Tycho Andersen

[permalink] [raw]
Subject: Re: [PATCH v2] fuse: In fuse_flush only wait if someone wants the return code

On Fri, Sep 30, 2022 at 03:35:16PM +0200, Miklos Szeredi wrote:
> On Thu, 29 Sept 2022 at 18:40, Tycho Andersen <[email protected]> wrote:
> >
> > If a fuse filesystem is mounted inside a container, there is a problem
> > during pid namespace destruction. The scenario is:
> >
> > 1. task (a thread in the fuse server, with a fuse file open) starts
> > exiting, does exit_signals(), goes into fuse_flush() -> wait
>
> Can't the same happen through
>
> fuse_flush -> fuse_sync_writes -> fuse_set_nowrite -> wait
>
> ?

Looks like yes, though I haven't seen this in the wild, I guess
because there aren't multiple writers most of the time the user code
that causes this.

I'm not exactly sure how to fix this. Reading through 3be5a52b30aa
("fuse: support writable mmap"), we don't want to allow multiple
writes since that may do allocations, which could cause deadlocks. But
in this case we have no reliable way to wait (besides a busy loop, I
suppose).

Maybe just a check for PF_EXITING and a pr_warn() with "echo 1 >
/sys/fs/fuse/connections/$N/abort" or something?

> > + /*
> > + * In memory i_blocks is not maintained by fuse, if writeback cache is
> > + * enabled, i_blocks from cached attr may not be accurate.
> > + */
> > + if (!err && fm->fc->writeback_cache)
> > + fuse_invalidate_attr_mask(fa->inode, STATX_BLOCKS);
>
> This is still duplicating code, can you please create a helper?

Yep, will do, pending the outcome of the above discussion.

Tycho

2022-09-30 14:45:36

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH v2] fuse: In fuse_flush only wait if someone wants the return code

On Fri, 30 Sept 2022 at 16:01, Tycho Andersen <[email protected]> wrote:
>
> On Fri, Sep 30, 2022 at 03:35:16PM +0200, Miklos Szeredi wrote:
> > On Thu, 29 Sept 2022 at 18:40, Tycho Andersen <[email protected]> wrote:
> > >
> > > If a fuse filesystem is mounted inside a container, there is a problem
> > > during pid namespace destruction. The scenario is:
> > >
> > > 1. task (a thread in the fuse server, with a fuse file open) starts
> > > exiting, does exit_signals(), goes into fuse_flush() -> wait
> >
> > Can't the same happen through
> >
> > fuse_flush -> fuse_sync_writes -> fuse_set_nowrite -> wait
> >
> > ?
>
> Looks like yes, though I haven't seen this in the wild, I guess
> because there aren't multiple writers most of the time the user code
> that causes this.
>
> I'm not exactly sure how to fix this. Reading through 3be5a52b30aa
> ("fuse: support writable mmap"), we don't want to allow multiple
> writes since that may do allocations, which could cause deadlocks. But
> in this case we have no reliable way to wait (besides a busy loop, I
> suppose).
>
> Maybe just a check for PF_EXITING and a pr_warn() with "echo 1 >
> /sys/fs/fuse/connections/$N/abort" or something?

AFAICS it should be perfectly normal (and trivial to trigger) for an
exiting process to have its dirty pages flushed through fuse_flush().

We could do that asynchronously as well, generally there are no
promises about dirty pages being synced as part of the process exiting
. But ordering between dirty page flushing and sending the FUSE_FLUSH
request should be kept. Which needs more complexity, unfortunately.

Thanks,
Miklos

2022-09-30 16:35:46

by Tycho Andersen

[permalink] [raw]
Subject: Re: [PATCH v2] fuse: In fuse_flush only wait if someone wants the return code

On Fri, Sep 30, 2022 at 04:41:37PM +0200, Miklos Szeredi wrote:
> On Fri, 30 Sept 2022 at 16:01, Tycho Andersen <[email protected]> wrote:
> >
> > On Fri, Sep 30, 2022 at 03:35:16PM +0200, Miklos Szeredi wrote:
> > > On Thu, 29 Sept 2022 at 18:40, Tycho Andersen <[email protected]> wrote:
> > > >
> > > > If a fuse filesystem is mounted inside a container, there is a problem
> > > > during pid namespace destruction. The scenario is:
> > > >
> > > > 1. task (a thread in the fuse server, with a fuse file open) starts
> > > > exiting, does exit_signals(), goes into fuse_flush() -> wait
> > >
> > > Can't the same happen through
> > >
> > > fuse_flush -> fuse_sync_writes -> fuse_set_nowrite -> wait
> > >
> > > ?
> >
> > Looks like yes, though I haven't seen this in the wild, I guess
> > because there aren't multiple writers most of the time the user code
> > that causes this.
> >
> > I'm not exactly sure how to fix this. Reading through 3be5a52b30aa
> > ("fuse: support writable mmap"), we don't want to allow multiple
> > writes since that may do allocations, which could cause deadlocks. But
> > in this case we have no reliable way to wait (besides a busy loop, I
> > suppose).
> >
> > Maybe just a check for PF_EXITING and a pr_warn() with "echo 1 >
> > /sys/fs/fuse/connections/$N/abort" or something?
>
> AFAICS it should be perfectly normal (and trivial to trigger) for an
> exiting process to have its dirty pages flushed through fuse_flush().

Agreed.

> We could do that asynchronously as well, generally there are no
> promises about dirty pages being synced as part of the process exiting
> . But ordering between dirty page flushing and sending the FUSE_FLUSH
> request should be kept. Which needs more complexity, unfortunately.

How can we wait in fuse_set_nowrite()? Or are you suggesting we just
do a fuse_flush_writepages() in the async part and hope for the best?

Thanks,

Tycho

2022-09-30 20:26:33

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH] fuse: In fuse_flush only wait if someone wants the return code

On Tue, Sep 27, 2022 at 11:46:44AM +0200, Miklos Szeredi wrote:
> On Thu, 1 Sept 2022 at 16:07, Tycho Andersen <[email protected]> wrote:
> >
> > From: "Eric W. Biederman" <[email protected]>
> >
> > In my very light testing this resolves a hang where a thread of the
> > fuse server was accessing the fuse filesystem (the fuse server is
> > serving up), when the fuse server is killed.
> >
> > The practical problem is that the fuse server file descriptor was
> > being closed after the file descriptor into the fuse filesystem so
> > that the fuse filesystem operations were being blocked for instead of
> > being aborted. Simply skipping the unnecessary wait resolves this
> > issue.
> >
> > This is just a proof of concept and someone should look to see if the
> > fuse max_background limit could cause a problem with this approach.
>
> Maybe you missed my comments here:
>
> https://lore.kernel.org/all/CAJfpegsTmiO-sKaBLgoVT4WxDXBkRES=HF1YmQN1ES7gfJEJ+w@mail.gmail.com/

That's odd - fwiw I too had completely missed that reply, sorry.

> I'm generally okay with this, but please write a proper changelog for
> the patch, also mentioning the issues related to posix locks.
>
> > --- a/fs/fuse/file.c
> > +++ b/fs/fuse/file.c
> > @@ -464,6 +464,67 @@ static void fuse_sync_writes(struct inode *inode)
> > fuse_release_nowrite(inode);
> > }
> >
> > +struct fuse_flush_args {
> > + struct fuse_args args;
> > + struct fuse_flush_in inarg;
> > + struct inode *inode;
> > + struct fuse_file *ff;
> > +};
> > +
> > +static void fuse_flush_end(struct fuse_mount *fm, struct fuse_args *args, int err)
> > +{
> > + struct fuse_flush_args *fa = container_of(args, typeof(*fa), args);
> > +
> > + if (err == -ENOSYS) {
> > + fm->fc->no_flush = 1;
> > + err = 0;
> > + }
> > +
> > + /*
> > + * In memory i_blocks is not maintained by fuse, if writeback cache is
> > + * enabled, i_blocks from cached attr may not be accurate.
> > + */
> > + if (!err && fm->fc->writeback_cache)
> > + fuse_invalidate_attr_mask(fa->inode, STATX_BLOCKS);
> > +
> > +
> > + iput(fa->inode);
> > + fuse_file_put(fa->ff, false, false);
> > + kfree(fa);
> > +}
> > +
> > +static int fuse_flush_async(struct file *file, fl_owner_t id)
> > +{
> > + struct inode *inode = file_inode(file);
> > + struct fuse_mount *fm = get_fuse_mount(inode);
> > + struct fuse_file *ff = file->private_data;
> > + struct fuse_flush_args *fa;
> > + int err;
> > +
> > + fa = kzalloc(sizeof(*fa), GFP_KERNEL);
> > + if (!fa)
> > + return -ENOMEM;
> > +
> > + fa->inarg.fh = ff->fh;
> > + fa->inarg.lock_owner = fuse_lock_owner_id(fm->fc, id);
> > + fa->args.opcode = FUSE_FLUSH;
> > + fa->args.nodeid = get_node_id(inode);
> > + fa->args.in_numargs = 1;
> > + fa->args.in_args[0].size = sizeof(fa->inarg);
> > + fa->args.in_args[0].value = &fa->inarg;
> > + fa->args.force = true;
> > + fa->args.nocreds = true;
> > + fa->args.end = fuse_flush_end;
> > + fa->inode = igrab(inode);
>
> Grabbing the inode should already taken care of by fuse_file_release().
>
> Also please try to reduce duplication in both the above functions.
>
> Thanks,
> Miklos

2022-10-26 09:15:22

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH v2] fuse: In fuse_flush only wait if someone wants the return code

On Fri, 30 Sept 2022 at 18:10, Tycho Andersen <[email protected]> wrote:
>
> On Fri, Sep 30, 2022 at 04:41:37PM +0200, Miklos Szeredi wrote:
> > On Fri, 30 Sept 2022 at 16:01, Tycho Andersen <[email protected]> wrote:
> > >
> > > On Fri, Sep 30, 2022 at 03:35:16PM +0200, Miklos Szeredi wrote:
> > > > On Thu, 29 Sept 2022 at 18:40, Tycho Andersen <[email protected]> wrote:
> > > > >
> > > > > If a fuse filesystem is mounted inside a container, there is a problem
> > > > > during pid namespace destruction. The scenario is:
> > > > >
> > > > > 1. task (a thread in the fuse server, with a fuse file open) starts
> > > > > exiting, does exit_signals(), goes into fuse_flush() -> wait
> > > >
> > > > Can't the same happen through
> > > >
> > > > fuse_flush -> fuse_sync_writes -> fuse_set_nowrite -> wait
> > > >
> > > > ?
> > >
> > > Looks like yes, though I haven't seen this in the wild, I guess
> > > because there aren't multiple writers most of the time the user code
> > > that causes this.
> > >
> > > I'm not exactly sure how to fix this. Reading through 3be5a52b30aa
> > > ("fuse: support writable mmap"), we don't want to allow multiple
> > > writes since that may do allocations, which could cause deadlocks. But
> > > in this case we have no reliable way to wait (besides a busy loop, I
> > > suppose).
> > >
> > > Maybe just a check for PF_EXITING and a pr_warn() with "echo 1 >
> > > /sys/fs/fuse/connections/$N/abort" or something?
> >
> > AFAICS it should be perfectly normal (and trivial to trigger) for an
> > exiting process to have its dirty pages flushed through fuse_flush().
>
> Agreed.
>
> > We could do that asynchronously as well, generally there are no
> > promises about dirty pages being synced as part of the process exiting
> > . But ordering between dirty page flushing and sending the FUSE_FLUSH
> > request should be kept. Which needs more complexity, unfortunately.
>
> How can we wait in fuse_set_nowrite()? Or are you suggesting we just
> do a fuse_flush_writepages() in the async part and hope for the best?

I was thinking along the lines of calling schedule_work() in the
exiting case to do the flush.

Thanks,
Miklos

2022-11-14 16:20:56

by Tycho Andersen

[permalink] [raw]
Subject: [PATCH v3] fuse: In fuse_flush only wait if someone wants the return code

If a fuse filesystem is mounted inside a container, there is a problem
during pid namespace destruction. The scenario is:

1. task (a thread in the fuse server, with a fuse file open) starts
exiting, does exit_signals(), goes into fuse_flush() -> wait
2. fuse daemon gets killed, tries to wake everyone up
3. task from 1 is stuck because complete_signal() doesn't wake it up, since
it has PF_EXITING.

The result is that the thread will never be woken up, and pid namespace
destruction will block indefinitely.

To add insult to injury, nobody is waiting for these return codes, since
the pid namespace is being destroyed.

To fix this, let's not block on flush operations when the current task has
PF_EXITING.

This does change the semantics slightly: the wait here is for posix locks
to be unlocked, so the task will exit before things are unlocked. To quote
Miklos: https://lore.kernel.org/all/CAJfpegsTmiO-sKaBLgoVT4WxDXBkRES=HF1YmQN1ES7gfJEJ+w@mail.gmail.com/

> "remote" posix locks are almost never used due to problems like this,
> so I think it's safe to do this.

Signed-off-by: "Eric W. Biederman" <[email protected]>
Signed-off-by: Tycho Andersen <[email protected]>
Link: https://lore.kernel.org/all/YrShFXRLtRt6T%2Fj+@risky/
---
v2: drop the fuse_flush_async() function and just re-use the already
prepared args; add a description of the problem+note about posix locks
v3: use schedule_work() to avoid other sleeps in inode_write_now() and
fuse_sync_writes(). Fix a UAF of the stack-based inarg.
---
fs/fuse/file.c | 106 +++++++++++++++++++++++++++++++++++++++----------
1 file changed, 84 insertions(+), 22 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 71bfb663aac5..10173b0e74b7 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -18,6 +18,7 @@
#include <linux/falloc.h>
#include <linux/uio.h>
#include <linux/fs.h>
+#include <linux/file.h>

static int fuse_send_open(struct fuse_mount *fm, u64 nodeid,
unsigned int open_flags, int opcode,
@@ -477,20 +478,20 @@ static void fuse_sync_writes(struct inode *inode)
fuse_release_nowrite(inode);
}

-static int fuse_flush(struct file *file, fl_owner_t id)
+static void fuse_invalidate_attrs(struct fuse_mount *fm, int err, struct inode *inode)
{
- struct inode *inode = file_inode(file);
- struct fuse_mount *fm = get_fuse_mount(inode);
- struct fuse_file *ff = file->private_data;
- struct fuse_flush_in inarg;
- FUSE_ARGS(args);
- int err;
-
- if (fuse_is_bad(inode))
- return -EIO;
+ /*
+ * In memory i_blocks is not maintained by fuse, if writeback cache is
+ * enabled, i_blocks from cached attr may not be accurate.
+ */
+ if (!err && fm->fc->writeback_cache)
+ fuse_invalidate_attr_mask(inode, STATX_BLOCKS);
+}

- if (ff->open_flags & FOPEN_NOFLUSH && !fm->fc->writeback_cache)
- return 0;
+static int do_fuse_flush(struct fuse_mount *fm, struct inode *inode,
+ struct file *file, struct fuse_args *args)
+{
+ int err;

err = write_inode_now(inode, 1);
if (err)
@@ -504,6 +505,53 @@ static int fuse_flush(struct file *file, fl_owner_t id)
if (err)
return err;

+ err = fuse_simple_request(fm, args);
+ if (err == -ENOSYS) {
+ fm->fc->no_flush = 1;
+ err = 0;
+ }
+
+ return err;
+}
+
+struct fuse_flush_args {
+ struct fuse_args args;
+ struct fuse_flush_in inarg;
+ struct inode *inode;
+ struct fuse_file *ff;
+ struct work_struct work;
+ struct file *file;
+};
+
+static void fuse_flush_async(struct work_struct *work)
+{
+ struct fuse_flush_args *fa = container_of(work, typeof(*fa), work);
+ struct fuse_mount *fm = get_fuse_mount(fa->inode);
+ int err;
+
+ err = do_fuse_flush(fm, fa->inode, fa->file, &fa->args);
+ if (err < 0)
+ fuse_invalidate_attrs(fm, err, fa->inode);
+ fuse_file_put(fa->ff, false, false);
+ fput(fa->file);
+ kfree(fa);
+}
+
+static int fuse_flush(struct file *file, fl_owner_t id)
+{
+ struct inode *inode = file_inode(file);
+ struct fuse_mount *fm = get_fuse_mount(inode);
+ struct fuse_file *ff = file->private_data;
+ struct fuse_flush_in inarg;
+ FUSE_ARGS(args);
+ int err;
+
+ if (fuse_is_bad(inode))
+ return -EIO;
+
+ if (ff->open_flags & FOPEN_NOFLUSH && !fm->fc->writeback_cache)
+ return 0;
+
err = 0;
if (fm->fc->no_flush)
goto inval_attr_out;
@@ -518,19 +566,33 @@ static int fuse_flush(struct file *file, fl_owner_t id)
args.in_args[0].value = &inarg;
args.force = true;

- err = fuse_simple_request(fm, &args);
- if (err == -ENOSYS) {
- fm->fc->no_flush = 1;
- err = 0;
+ if (current->flags & PF_EXITING) {
+ struct fuse_flush_args *fa;
+
+ err = -ENOMEM;
+ fa = kzalloc(sizeof(*fa), GFP_KERNEL);
+ if (!fa)
+ goto inval_attr_out;
+
+ memcpy(&fa->args, &args, sizeof(args));
+ memcpy(&fa->inarg, &inarg, sizeof(inarg));
+ fa->args.in_args[0].value = &fa->inarg;
+ fa->args.nocreds = true;
+ fa->ff = fuse_file_get(ff);
+ fa->inode = inode;
+ fa->file = get_file(file);
+
+ INIT_WORK(&fa->work, fuse_flush_async);
+ schedule_work(&fa->work);
+ return 0;
}

+ err = do_fuse_flush(fm, inode, file, &args);
+ if (!err)
+ return 0;
+
inval_attr_out:
- /*
- * In memory i_blocks is not maintained by fuse, if writeback cache is
- * enabled, i_blocks from cached attr may not be accurate.
- */
- if (!err && fm->fc->writeback_cache)
- fuse_invalidate_attr_mask(inode, STATX_BLOCKS);
+ fuse_invalidate_attrs(fm, err, inode);
return err;
}


base-commit: f0c4d9fc9cc9462659728d168387191387e903cc
--
2.34.1


2022-11-28 15:33:05

by Tycho Andersen

[permalink] [raw]
Subject: Re: [PATCH v3] fuse: In fuse_flush only wait if someone wants the return code

Hi Milkos,

On Mon, Nov 14, 2022 at 09:02:09AM -0700, Tycho Andersen wrote:
> v3: use schedule_work() to avoid other sleeps in inode_write_now() and
> fuse_sync_writes(). Fix a UAF of the stack-based inarg.

Thoughts on this version?

Thanks,

Tycho

2022-12-08 14:42:56

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH v3] fuse: In fuse_flush only wait if someone wants the return code

On Mon, 28 Nov 2022 at 16:01, Tycho Andersen <[email protected]> wrote:
>
> Hi Milkos,
>
> On Mon, Nov 14, 2022 at 09:02:09AM -0700, Tycho Andersen wrote:
> > v3: use schedule_work() to avoid other sleeps in inode_write_now() and
> > fuse_sync_writes(). Fix a UAF of the stack-based inarg.
>
> Thoughts on this version?

Skipping attr invalidation on success is wrong. And there's still too
much duplication, IMO.

How about the attached (untested) patch?

Thanks,
Miklos


Attachments:
fuse-flush-async-if-exiting.patch (3.29 kB)

2022-12-08 18:06:45

by Tycho Andersen

[permalink] [raw]
Subject: Re: [PATCH v3] fuse: In fuse_flush only wait if someone wants the return code

On Thu, Dec 08, 2022 at 03:26:19PM +0100, Miklos Szeredi wrote:
> On Mon, 28 Nov 2022 at 16:01, Tycho Andersen <[email protected]> wrote:
> >
> > Hi Milkos,
> >
> > On Mon, Nov 14, 2022 at 09:02:09AM -0700, Tycho Andersen wrote:
> > > v3: use schedule_work() to avoid other sleeps in inode_write_now() and
> > > fuse_sync_writes(). Fix a UAF of the stack-based inarg.
> >
> > Thoughts on this version?
>
> Skipping attr invalidation on success is wrong.

Agreed, that looks like my mistake.

> How about the attached (untested) patch?

It passes my reproducer with no warnings or anything. Feel free to
add:

Tested-by: Tycho Andersen <[email protected]>

if you want to commit it.

Tycho

2022-12-19 19:38:12

by Tycho Andersen

[permalink] [raw]
Subject: Re: [PATCH v3] fuse: In fuse_flush only wait if someone wants the return code

On Thu, Dec 08, 2022 at 10:49:30AM -0700, Tycho Andersen wrote:
> On Thu, Dec 08, 2022 at 03:26:19PM +0100, Miklos Szeredi wrote:
> > On Mon, 28 Nov 2022 at 16:01, Tycho Andersen <[email protected]> wrote:
> > >
> > > Hi Milkos,
> > >
> > > On Mon, Nov 14, 2022 at 09:02:09AM -0700, Tycho Andersen wrote:
> > > > v3: use schedule_work() to avoid other sleeps in inode_write_now() and
> > > > fuse_sync_writes(). Fix a UAF of the stack-based inarg.
> > >
> > > Thoughts on this version?
> >
> > Skipping attr invalidation on success is wrong.
>
> Agreed, that looks like my mistake.
>
> > How about the attached (untested) patch?
>
> It passes my reproducer with no warnings or anything. Feel free to
> add:
>
> Tested-by: Tycho Andersen <[email protected]>
>
> if you want to commit it.

Ping, thoughts on landing this?

Thanks,

Tycho

2023-01-03 14:56:47

by Tycho Andersen

[permalink] [raw]
Subject: Re: [PATCH v3] fuse: In fuse_flush only wait if someone wants the return code

On Mon, Dec 19, 2022 at 12:16:50PM -0700, Tycho Andersen wrote:
> On Thu, Dec 08, 2022 at 10:49:30AM -0700, Tycho Andersen wrote:
> > On Thu, Dec 08, 2022 at 03:26:19PM +0100, Miklos Szeredi wrote:
> > > On Mon, 28 Nov 2022 at 16:01, Tycho Andersen <[email protected]> wrote:
> > > >
> > > > Hi Milkos,
> > > >
> > > > On Mon, Nov 14, 2022 at 09:02:09AM -0700, Tycho Andersen wrote:
> > > > > v3: use schedule_work() to avoid other sleeps in inode_write_now() and
> > > > > fuse_sync_writes(). Fix a UAF of the stack-based inarg.
> > > >
> > > > Thoughts on this version?
> > >
> > > Skipping attr invalidation on success is wrong.
> >
> > Agreed, that looks like my mistake.
> >
> > > How about the attached (untested) patch?
> >
> > It passes my reproducer with no warnings or anything. Feel free to
> > add:
> >
> > Tested-by: Tycho Andersen <[email protected]>
> >
> > if you want to commit it.
>
> Ping, thoughts on landing this?

Happy new year all. Any update here?

Thanks,

Tycho

2023-01-05 16:30:40

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH v3] fuse: In fuse_flush only wait if someone wants the return code

On Tue, Jan 03, 2023 at 07:51:22AM -0700, Tycho Andersen wrote:
> On Mon, Dec 19, 2022 at 12:16:50PM -0700, Tycho Andersen wrote:
> > On Thu, Dec 08, 2022 at 10:49:30AM -0700, Tycho Andersen wrote:
> > > On Thu, Dec 08, 2022 at 03:26:19PM +0100, Miklos Szeredi wrote:
> > > > On Mon, 28 Nov 2022 at 16:01, Tycho Andersen <[email protected]> wrote:
> > > > >
> > > > > Hi Milkos,
> > > > >
> > > > > On Mon, Nov 14, 2022 at 09:02:09AM -0700, Tycho Andersen wrote:
> > > > > > v3: use schedule_work() to avoid other sleeps in inode_write_now() and
> > > > > > fuse_sync_writes(). Fix a UAF of the stack-based inarg.
> > > > >
> > > > > Thoughts on this version?
> > > >
> > > > Skipping attr invalidation on success is wrong.
> > >
> > > Agreed, that looks like my mistake.
> > >
> > > > How about the attached (untested) patch?
> > >
> > > It passes my reproducer with no warnings or anything. Feel free to
> > > add:
> > >
> > > Tested-by: Tycho Andersen <[email protected]>
> > >
> > > if you want to commit it.
> >
> > Ping, thoughts on landing this?
>
> Happy new year all. Any update here?
>
> Thanks,
>
> Tycho

Thanks for pushing on this, Tycho. I'd suggest sending a clean new version
incorporating Miklos' fix.

-serge

2023-01-26 14:13:03

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH v3] fuse: In fuse_flush only wait if someone wants the return code

On Tue, 3 Jan 2023 at 15:51, Tycho Andersen <[email protected]> wrote:

> Happy new year all. Any update here?

Applied, thanks.

Miklos