Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
From:   "Eric W. Biederman" <ebiederm@xmission.com>
To:     Tycho Andersen <tycho@tycho.pizza>
Cc:     Oleg Nesterov <oleg@redhat.com>,
        "Serge E. Hallyn" <serge@hallyn.com>,
        Miklos Szeredi <miklos@szeredi.hu>,
        linux-kernel@vger.kernel.org
References: <20220713175305.1327649-1-tycho@tycho.pizza>
        <20220720150328.GA30749@mail.hallyn.com>
        <YthsgqAZYnwHZLn+@tycho.pizza> <20220721015459.GA4297@mail.hallyn.com>
        <YuFdUj5X4qckC/6g@tycho.pizza> <20220727175538.GC18822@redhat.com>
        <YuGBXnqb5rPwAlYk@tycho.pizza> <20220727191949.GD18822@redhat.com>
        <YuGUyayVWDB7R89i@tycho.pizza> <20220728091220.GA11207@redhat.com>
        <YuL9uc8WfiYlb2Hw@tycho.pizza>
Date:   Fri, 29 Jul 2022 00:04:17 -0500
In-Reply-To: <YuL9uc8WfiYlb2Hw@tycho.pizza> (Tycho Andersen's message of "Thu,
        28 Jul 2022 15:20:57 -0600")
Message-ID: <87pmhofr1q.fsf@email.froward.int.ebiederm.org>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
Subject: Re: [PATCH] sched: __fatal_signal_pending() should also check
 PF_EXITING
Precedence: bulk

Tycho Andersen <tycho@tycho.pizza> writes:

> On Thu, Jul 28, 2022 at 11:12:20AM +0200, Oleg Nesterov wrote:
>> This is clear, but it seems you do not understand me. Let me try again
>> to explain and please correct me if I am wrong.
>> 
>> To simplify, lets suppose we have a single-thread task T which simply
>> does
>> 	__set_current_state(TASK_KILLABLE);
>> 	schedule();
>> 
>> in the do_exit() paths after exit_signals() which sets PF_EXITING. Btw,
>> note that it even documents that this thread is not "visible" for the
>> group-wide signals, see below.
>> 
>> Now, suppose that this task is running and you send SIGKILL. T will
>> dequeue SIGKILL from T->penging and call do_exit(). However, it won't
>> remove SIGKILL from T->signal.shared_pending(), and this means that
>> signal_pending(T) is still true.
>> 
>> Now. If we add a PF_EXITING or sigismember(shared_pending, SIGKILL) check
>> into __fatal_signal_pending(), then yes, T won't block in schedule(),
>> schedule()->signal_pending_state() will return true.
>> 
>> But what if T exits on its own? It will block in schedule() forever.
>> schedule()->signal_pending_state() will not even check __fatal_signal_pending(),
>> signal_pending() == F.
>> 
>> Now if you send SIGKILL to this task, SIGKILL won't wake it up or even
>> set TIF_SIGPENDING, complete_signal() will do nothing.
>> 
>> See?
>> 
>> I agree, we should probably cleanup this logic and define how exactly
>> the exiting task should react to signals (not only fatal signals). But
>> your patch certainly doesn't look good to me and it is not enough.
>> May be we can change get_signal() to not remove SIGKILL from t->pending
>> for the start... not sure, this needs another discussion.
>
> Thank you for this! Between that and Eric's line about:
>
>> Frankly that there are some left over SIGKILL bits in the pending mask
>> is a misfeature, and it is definitely not something you should count on.
>
> I think I finally maybe understand the objections.
>
> Is it fair to say that a task with PF_EXITING should never wait? I'm
> wondering if a solution would be to patch the wait code to look for
> PF_EXITING, in addition to checking the signal state.

That will at a minimum change zap_pid_ns_processes to busy wait
instead of sleeping while it waits for children to die.

So we would need to survey the waits that can happen when closing file
descriptors and any other place on the exit path to see how much impact
a such a change would do.


It might be possible to allow an extra SIGKILL to terminate such waits.
We do something like that for coredumps.  But that is incredibly subtle
and a pain to maintain so I want to avoid that if we can.


>> Finally. if fuse_flush() wants __fatal_signal_pending() == T when the
>> caller exits, perhaps it can do it itself? Something like
>> 
>> 	if (current->flags & PF_EXITING) {
>> 		spin_lock_irq(siglock);
>> 		set_thread_flag(TIF_SIGPENDING);
>> 		sigaddset(&current->pending.signal, SIGKILL);
>> 		spin_unlock_irq(siglock);
>> 	}
>> 
>> Sure, this is ugly as hell. But perhaps this can serve as a workaround?
>
> or even just
>
>     if (current->flags & PF_EXITING)
>         return;
>
> since we don't have anyone to send the result of the flush to anyway.
> If we don't end up converging on a fix here, I'll just send that
> patch. Thanks for the suggestion.

If that was limited to the case you care about that would be reasonable.

That will have an effect on any time a process that opens files on a
fuse filesystem exits and depends upon the exit path to close it's file
descriptors to the fuse filesystem.


I do see a plausible solution along those lines.

In fuse_flush instead of using fuse_simple_request call an equivalent
function that when PF_EXITING is true skips calling request_wait_answer.
Or perhaps when PF_EXITING is set uses schedule_work to call the
request_wait_answer.

That will allow everything to work as it does today.  It will optimize
the fuse when file descriptors are called on the exit path.  It will
avoid the hang by removing an indefinite wait on userspace.

This should even generalize into the vfs.  I looked and nfs also looks
like it has the potential to optimize out the wait for the result of the
flush.  A correctly implemented flush method looks to flush any
write-back data when the file is closed and to return any errors from
that flush to the caller of close.  For .flush called from the exit path
aka exit_files aka close_files there is no way to place to return an
error status to, so there is no need to wait for the flush to complete.

That said solve I think it makes sense to solve the problem for fuse
first, and the we can figure out support for other filesystems.

Eric