Hi Linus,
The following changes since commit b787f68c36d49bb1d9236f403813641efa74a031:
Linux 4.1-rc1 (2015-04-26 17:59:10 -0700)
are available in the git repository at:
git://github.com/awilliam/linux-vfio.git tags/vfio-v4.1-rc2
for you to fetch changes up to 82a0eaab980a3af92d46e93eaf299d6c1428c52b:
vfio-pci: Log device requests more verbosely (2015-04-28 10:23:30 -0600)
----------------------------------------------------------------
Fix some undesirable behavior with the vfio device request interface:
- Flush signals on interrupted wait to retain polling interval (Alex Williamson)
- Increase verbosity of device request channel (Alex Williamson)
----------------------------------------------------------------
Alex Williamson (2):
vfio: Flush signals on device request interruption
vfio-pci: Log device requests more verbosely
drivers/vfio/pci/vfio_pci.c | 8 +++++++-
drivers/vfio/vfio.c | 13 ++++++++++---
2 files changed, 17 insertions(+), 4 deletions(-)
On Fri, May 1, 2015 at 10:40 AM, Alex Williamson
<[email protected]> wrote:
>
> - Flush signals on interrupted wait to retain polling interval (Alex Williamson)
This cannot *possibly* be right. If I read this patch right, you're
randomly just getting rid of signals. No way in hell is that correct.
"flush_signals()" is only for kernel threads, where it's a hacky
alternative to actually handling them (since kernel threads never
rreturn to user space and cannot really "handle" a signal). But you're
doing it in the ->remove handler for the device, which can be called
by arbitrary system processes. This is not a kernel thread thing, as
far as I can see.
If you cannot handle signals, you damn well shouldn't be using
"wait_event_interruptible_timeout()" to begin with. Get rid of the
"interruptible", since it apparently *isn't* interruptible.
So I'm not pulling this.
Now I'm worried that other drivers do insane things like this. I
wonder if we should add some sanity test to flush_signals() to make
sure that it can only ever get called from a kernel thread.
Oleg?
Linus
On Fri, 2015-05-01 at 11:37 -0700, Linus Torvalds wrote:
> On Fri, May 1, 2015 at 10:40 AM, Alex Williamson
> <[email protected]> wrote:
> >
> > - Flush signals on interrupted wait to retain polling interval (Alex Williamson)
>
> This cannot *possibly* be right. If I read this patch right, you're
> randomly just getting rid of signals. No way in hell is that correct.
>
> "flush_signals()" is only for kernel threads, where it's a hacky
> alternative to actually handling them (since kernel threads never
> rreturn to user space and cannot really "handle" a signal). But you're
> doing it in the ->remove handler for the device, which can be called
> by arbitrary system processes. This is not a kernel thread thing, as
> far as I can see.
>
> If you cannot handle signals, you damn well shouldn't be using
> "wait_event_interruptible_timeout()" to begin with. Get rid of the
> "interruptible", since it apparently *isn't* interruptible.
>
> So I'm not pulling this.
Ok. It seemed like useful behavior to be able to provide some response
to the user in the event that a ->remove handler is blocked by a device
in-use and the user attempts to abort the action. Thanks for reviewing,
Alex
* Linus Torvalds <[email protected]> wrote:
> On Fri, May 1, 2015 at 10:40 AM, Alex Williamson
> <[email protected]> wrote:
> >
> > - Flush signals on interrupted wait to retain polling interval (Alex Williamson)
>
> This cannot *possibly* be right. If I read this patch right, you're
> randomly just getting rid of signals. No way in hell is that correct.
>
> "flush_signals()" is only for kernel threads, where it's a hacky
> alternative to actually handling them (since kernel threads never
> rreturn to user space and cannot really "handle" a signal). But you're
> doing it in the ->remove handler for the device, which can be called
> by arbitrary system processes. This is not a kernel thread thing, as
> far as I can see.
>
> If you cannot handle signals, you damn well shouldn't be using
> "wait_event_interruptible_timeout()" to begin with. Get rid of the
> "interruptible", since it apparently *isn't* interruptible.
>
> So I'm not pulling this.
>
> Now I'm worried that other drivers do insane things like this. I
> wonder if we should add some sanity test to flush_signals() to make
> sure that it can only ever get called from a kernel thread.
>
> Oleg?
So there are these uses:
triton:~/tip> git grep -lw flush_signals
arch/arm/common/bL_switcher.c
Looks safe: used within the bL_switcher_thread() kthread.
drivers/block/drbd/drbd_main.c
drivers/block/drbd/drbd_nl.c
drivers/block/drbd/drbd_receiver.c
drivers/block/drbd/drbd_worker.c
Couldn't convince myself it's safe, but it appears to be. (Call chains
are obfuscated in various ways that makes it hard to tell where a
given function execute.)
drivers/md/md.c
drivers/md/raid1.c
drivers/md/raid5.c
Hm, so I'm not super sure about the flush_signals() in
raid1.c:make_request() AFAICS we can do direct RAID1 writes in
raid1_unplug(). That looks unsafe ... I've Cc:-ed Neil.
raid5.c seems safe: raid5_unplug() doesn't create requests directly,
leaves it all for the mddev kthread.
drivers/scsi/bnx2fc/bnx2fc_fcoe.c
drivers/scsi/bnx2fc/bnx2fc_tgt.c
drivers/scsi/bnx2i/bnx2i_iscsi.c
drivers/scsi/libiscsi.c
drivers/target/iscsi/iscsi_target_login.c
drivers/target/iscsi/iscsi_target_nego.c
Couldn't fully check it due to excessive complexity, but seemed safe.
drivers/staging/rtl8188eu/core/rtw_cmd.c
drivers/staging/rtl8712/osdep_service.h
Looks safe: done in RTW_CMD_THREAD and 'padapter' kthreads.
drivers/w1/w1_family.c
drivers/w1/w1_int.c
Looks unsafe: called from various module exit handlers in:
drivers/w1/slaves/w1_bq27000.c
drivers/w1/slaves/w1_ds2406.c
drivers/w1/slaves/w1_ds2408.c
drivers/w1/slaves/w1_ds2413.c
drivers/w1/slaves/w1_ds2423.c
drivers/w1/slaves/w1_ds2431.c
drivers/w1/slaves/w1_ds2433.c
drivers/w1/slaves/w1_ds2760.c
drivers/w1/slaves/w1_ds2780.c
drivers/w1/slaves/w1_ds2781.c
drivers/w1/slaves/w1_ds28e04.c
drivers/w1/slaves/w1_smem.c
drivers/w1/slaves/w1_therm.c
which would be executed in rmmod context, losing signals.
Cc:-ed Evgeniy.
fs/lockd/svc.c
fs/nfs/callback.c
fs/nfsd/nfssvc.c
Looks safe: lockd, nfsd plus nfsv4.%u-svc kthreads.
kernel/locking/rtmutex-tester.c
Looks safe: used within a kthread.
include/linux/sched.h
kernel/signal.c
Both safe ;-)
I also found a __flush_signals() use in:
security/selinux/hooks.c
Now that's selinux_bprm_committed_creds(), apparently executed on
exec(). Also does stuff like:
memset(&itimer, 0, sizeof itimer);
for (i = 0; i < 3; i++)
do_setitimer(i, &itimer, NULL);
and unblocks signals as well:
sigemptyset(¤t->blocked);
but this appears to be kind of legit: the task failed to get the
required permissions, and guns go off.
In any case, it seems to me that the patch below would be justified?
Totally untested and so. __flush_signals() not affected.
Thanks,
Ingo
---
kernel/signal.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/kernel/signal.c b/kernel/signal.c
index d51c5ddd855c..100e30afe5d2 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -427,6 +427,10 @@ void flush_signals(struct task_struct *t)
{
unsigned long flags;
+ /* Only kthreads are allowed to destroy signals: */
+ if (WARN_ON_ONCE(!(current->flags & PF_KTHREAD)))
+ return;
+
spin_lock_irqsave(&t->sighand->siglock, flags);
__flush_signals(t);
spin_unlock_irqrestore(&t->sighand->siglock, flags);
On Fri, May 1, 2015 at 8:37 PM, Linus Torvalds
<[email protected]> wrote:
> "flush_signals()" is only for kernel threads, where it's a hacky
> alternative to actually handling them (since kernel threads never
> rreturn to user space and cannot really "handle" a signal). But you're
> doing it in the ->remove handler for the device, which can be called
> by arbitrary system processes. This is not a kernel thread thing, as
> far as I can see.
>
> If you cannot handle signals, you damn well shouldn't be using
> "wait_event_interruptible_timeout()" to begin with. Get rid of the
> "interruptible", since it apparently *isn't* interruptible.
>
> So I'm not pulling this.
>
> Now I'm worried that other drivers do insane things like this. I
> wonder if we should add some sanity test to flush_signals() to make
> sure that it can only ever get called from a kernel thread.
Hmm, a quick grep exposes some questionable users.
At least w1 looks fishy.
drivers/w1/w1_family.c:w1_unregister_family
drivers/w1/w1_int.c:__w1_remove_master_device
What do you think about a WARN_ON like:
diff --git a/kernel/signal.c b/kernel/signal.c
index d51c5dd..b4079c3 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -427,6 +427,8 @@ void flush_signals(struct task_struct *t)
{
unsigned long flags;
+ WARN_ON((current->flags & PF_KTHREAD) == 0);
+
spin_lock_irqsave(&t->sighand->siglock, flags);
__flush_signals(t);
spin_unlock_irqrestore(&t->sighand->siglock, flags);
--
Thanks,
//richard
On Fri, May 1, 2015 at 11:48 AM, Alex Williamson
<[email protected]> wrote:
>
> Ok. It seemed like useful behavior to be able to provide some response
> to the user in the event that a ->remove handler is blocked by a device
> in-use and the user attempts to abort the action.
Well, that kind of notification *might* be useful, but at the cost of
saying "somebody tried to send you a signal, so I am now telling you
about it, and then deleting that signal, and you'll never know what it
actually was"?
That's not useful, that's just wrong.
Now, what might in theory be useful - but I haven't actually seen
anybody do anything like that - is to start out with an interruptible
sleep, warn if you get interrupted, and then continue with an
un-interruptible sleep (leaving the signal active).
But even that sounds like a very special case, and I don't think
anything has ever done that.
In general, our signal handling falls into three distinct categories:
(a) interruptible (and you can cancel the operation and return "try again")
(b) killable (you can cancel the operation, knowing that the
requester will be killed and won't try again)
(c) uninterruptible
where that (b) tends to be a special case of an operation that
technically isn't really interruptible (because the ABI doesn't allow
for retrying or error returns), but knowing that the caller will never
see the error case because it's killed means that you can do it. The
classic example of that is an NFS mount that is mounted "nointr" - you
can't return EINTR for a read or a write (because that invalidates
POSIX) but you want to let SIGKILL just kill the process in the middle
when the network is hung.
Linus
...of course I meant t-> and not current->
On 5/1/15, Richard Weinberger <[email protected]> wrote:
> On Fri, May 1, 2015 at 8:37 PM, Linus Torvalds
> <[email protected]> wrote:
>> "flush_signals()" is only for kernel threads, where it's a hacky
>> alternative to actually handling them (since kernel threads never
>> rreturn to user space and cannot really "handle" a signal). But you're
>> doing it in the ->remove handler for the device, which can be called
>> by arbitrary system processes. This is not a kernel thread thing, as
>> far as I can see.
>>
>> If you cannot handle signals, you damn well shouldn't be using
>> "wait_event_interruptible_timeout()" to begin with. Get rid of the
>> "interruptible", since it apparently *isn't* interruptible.
>>
>> So I'm not pulling this.
>>
>> Now I'm worried that other drivers do insane things like this. I
>> wonder if we should add some sanity test to flush_signals() to make
>> sure that it can only ever get called from a kernel thread.
>
> Hmm, a quick grep exposes some questionable users.
> At least w1 looks fishy.
> drivers/w1/w1_family.c:w1_unregister_family
> drivers/w1/w1_int.c:__w1_remove_master_device
>
> What do you think about a WARN_ON like:
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index d51c5dd..b4079c3 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -427,6 +427,8 @@ void flush_signals(struct task_struct *t)
> {
> unsigned long flags;
>
> + WARN_ON((current->flags & PF_KTHREAD) == 0);
> +
> spin_lock_irqsave(&t->sighand->siglock, flags);
> __flush_signals(t);
> spin_unlock_irqrestore(&t->sighand->siglock, flags);
>
> --
> Thanks,
> //richard
>
--
Thanks,
//richard
On Fri, 2015-05-01 at 13:23 -0700, Linus Torvalds wrote:
> On Fri, May 1, 2015 at 11:48 AM, Alex Williamson
> <[email protected]> wrote:
> >
> > Ok. It seemed like useful behavior to be able to provide some response
> > to the user in the event that a ->remove handler is blocked by a device
> > in-use and the user attempts to abort the action.
>
> Well, that kind of notification *might* be useful, but at the cost of
> saying "somebody tried to send you a signal, so I am now telling you
> about it, and then deleting that signal, and you'll never know what it
> actually was"?
>
> That's not useful, that's just wrong.
Yep, it was a bad idea.
> Now, what might in theory be useful - but I haven't actually seen
> anybody do anything like that - is to start out with an interruptible
> sleep, warn if you get interrupted, and then continue with an
> un-interruptible sleep (leaving the signal active).
I was considering doing exactly this.
> But even that sounds like a very special case, and I don't think
> anything has ever done that.
>
> In general, our signal handling falls into three distinct categories:
>
> (a) interruptible (and you can cancel the operation and return "try again")
>
> (b) killable (you can cancel the operation, knowing that the
> requester will be killed and won't try again)
>
> (c) uninterruptible
>
> where that (b) tends to be a special case of an operation that
> technically isn't really interruptible (because the ABI doesn't allow
> for retrying or error returns), but knowing that the caller will never
> see the error case because it's killed means that you can do it. The
> classic example of that is an NFS mount that is mounted "nointr" - you
> can't return EINTR for a read or a write (because that invalidates
> POSIX) but you want to let SIGKILL just kill the process in the middle
> when the network is hung.
I think we're in that (c) case unless we want to change our driver API
to allow driver remove callbacks to return error. Killing the caller
doesn't really help the situation without being able to back out of the
remove path. Killing the task with the device open would help, but
seems rather harsh. I expect we eventually want to be able to escalate
to revoking the device from the user, but currently we only have a
notifier to request the device from cooperative users. In the event of
an uncooperative user, we block, which can be difficult to figure out,
especially when we're dealing with SR-IOV devices and a PF unbind
implicitly induces a VF unbind. The interruptible component here is
simply a logging mechanism which should have turned into an
"interruptible_once" rather than a signal flush.
I try to avoid vfio being a special case, but maybe in this instance
it's worthwhile. If you have other suggestions, please let me know.
Thanks,
Alex
On Fri, 1 May 2015 21:38:13 +0200 Ingo Molnar <[email protected]> wrote:
> drivers/md/md.c
> drivers/md/raid1.c
> drivers/md/raid5.c
>
> Hm, so I'm not super sure about the flush_signals() in
> raid1.c:make_request() AFAICS we can do direct RAID1 writes in
> raid1_unplug(). That looks unsafe ... I've Cc:-ed Neil.
>
> raid5.c seems safe: raid5_unplug() doesn't create requests directly,
> leaves it all for the mddev kthread.
Both raid1.c and raid5.c call flush_signals() in the make_request function
(in unusual circumstances).
I wanted a uninterruptible wait which didn't add to load-average. That
approach works in kernel threads...
All the calls in md.c are in a kernel thread so safe, but I'd rather have an
explicit "uninterruptible, but no load-average" wait....
I should probably change the make_request code to queue the request
somewhere rather than wait for it to be serviceable.
I'll look into that...
> In any case, it seems to me that the patch below would be justified?
> Totally untested and so. __flush_signals() not affected.
Fine by me - does seem justified.
Thanks,
NeilBrown
>
> Thanks,
>
> Ingo
>
> ---
> kernel/signal.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index d51c5ddd855c..100e30afe5d2 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -427,6 +427,10 @@ void flush_signals(struct task_struct *t)
> {
> unsigned long flags;
>
> + /* Only kthreads are allowed to destroy signals: */
> + if (WARN_ON_ONCE(!(current->flags & PF_KTHREAD)))
> + return;
> +
> spin_lock_irqsave(&t->sighand->siglock, flags);
> __flush_signals(t);
> spin_unlock_irqrestore(&t->sighand->siglock, flags);
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
Hi Ingo
01.05.2015, 22:38, "Ingo Molnar" <[email protected]>:
> * Linus Torvalds <[email protected]> wrote:
>> ?On Fri, May 1, 2015 at 10:40 AM, Alex Williamson
>> ?<[email protected]> wrote:
>>> ?- Flush signals on interrupted wait to retain polling interval (Alex Williamson)
>> ?This cannot *possibly* be right. If I read this patch right, you're
>> ?randomly just getting rid of signals. No way in hell is that correct.
>>
>> ?"flush_signals()" is only for kernel threads, where it's a hacky
>> ?alternative to actually handling them (since kernel threads never
>> ?rreturn to user space and cannot really "handle" a signal). But you're
>> ?doing it in the ->remove handler for the device, which can be called
>> ?by arbitrary system processes. This is not a kernel thread thing, as
>> ?far as I can see.
>>
>> ?If you cannot handle signals, you damn well shouldn't be using
>> ?"wait_event_interruptible_timeout()" to begin with. Get rid of the
>> ?"interruptible", since it apparently *isn't* interruptible.
>>
>> ?So I'm not pulling this.
>>
>> ?Now I'm worried that other drivers do insane things like this. I
>> ?wonder if we should add some sanity test to flush_signals() to make
>> ?sure that it can only ever get called from a kernel thread.
> Looks unsafe: called from various module exit handlers in:
>
> ??drivers/w1/slaves/w1_bq27000.c
> ??drivers/w1/slaves/w1_ds2406.c
> ??drivers/w1/slaves/w1_ds2408.c
> ??drivers/w1/slaves/w1_ds2413.c
> ??drivers/w1/slaves/w1_ds2423.c
> ??drivers/w1/slaves/w1_ds2431.c
> ??drivers/w1/slaves/w1_ds2433.c
> ??drivers/w1/slaves/w1_ds2760.c
> ??drivers/w1/slaves/w1_ds2780.c
> ??drivers/w1/slaves/w1_ds2781.c
> ??drivers/w1/slaves/w1_ds28e04.c
> ??drivers/w1/slaves/w1_smem.c
> ??drivers/w1/slaves/w1_therm.c
>
> which would be executed in rmmod context, losing signals.
> Cc:-ed Evgeniy.
w1 uses a little bit strange refcnt logic, basically every w1 module waits
for all its users to release their w1 resources and doesn't exit until its safe.
To wait for resources to be freed at module exit time it checks its refcnt to drop to zero periodically
and sleeps between the checks for a second. w1 flushes signals between these
checks for no particular reason, it can be safely removed from w1_unregister_family()
and interruptible sleep replaced with the normal one.
On Sat, May 2, 2015 at 1:30 AM, NeilBrown <[email protected]> wrote:
>
> All the calls in md.c are in a kernel thread so safe, but I'd rather have an
> explicit "uninterruptible, but no load-average" wait....
Hmm. Our task state is a bitmask anyway, so we could probably just add a
#define __TASK_NOLOAD 16
(and move the EXIT_xyz defines *away* from the list that is actually
the task state), and teach our load average thing to not count those
kinds of waits. Then you could just use
TASK_UNINTERRUPTIBLE | __TASK_NOLOAD
to make processes not count towards the load.
Or - probably preferably - we could really clean things up, and make
things much more like the bitmask it *should* be, and have explicit
bits for
- SLEEPING/STOPPED/EXITING ("why not running?")
- LOADAVG (accounted towards load)
- WAKESIG (ie "interruptible")
- WAKEKILL (this we already have)
and just make the rule be that we use "__TASK_xyz" for the actual
individual bits, and "TASK_xyz" for the helper combinations. So then
we'd have
#define TASK_UNINTERRUPTIBLE (__TASK_SLEEPING | __TASK_LOADAVG)
#define TASK_INTERRUPTIBLE (__TASK_SLEEPING | __TASK_WAKESIG)
#define TASK_KILLABLE (__TASK_SLEEPING | __TASK_WAKEKILL)
#define TASK_NOLOADAVG (__TASK_SLEEPING)
which is almost certainly how this *should* have been done, but isn't,
because of historical use.
Cleaning up like that *should* be fairly simple, but I'd be a bit
nervous about getting all the state comparisons right (we have an
unholy mix of "check this bit" and "check this whole state", and we'd
need to make sure we get those cases all right).
Ingo, what do you think? This is mostly a scheduler interface issue..
Linus
On Fri, May 1, 2015 at 9:38 PM, Ingo Molnar <[email protected]> wrote:
> diff --git a/kernel/signal.c b/kernel/signal.c
> index d51c5ddd855c..100e30afe5d2 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -427,6 +427,10 @@ void flush_signals(struct task_struct *t)
> {
> unsigned long flags;
>
> + /* Only kthreads are allowed to destroy signals: */
> + if (WARN_ON_ONCE(!(current->flags & PF_KTHREAD)))
Shouldn't this be t->flags?
--
Thanks,
//richard
On 05/01, Ingo Molnar wrote:
>
> * Linus Torvalds <[email protected]> wrote:
>
> > On Fri, May 1, 2015 at 10:40 AM, Alex Williamson
> > <[email protected]> wrote:
> > >
> > > - Flush signals on interrupted wait to retain polling interval (Alex Williamson)
> >
> > This cannot *possibly* be right. If I read this patch right, you're
> > randomly just getting rid of signals. No way in hell is that correct.
> >
> > "flush_signals()" is only for kernel threads, where it's a hacky
> > alternative to actually handling them (since kernel threads never
> > rreturn to user space and cannot really "handle" a signal). But you're
> > doing it in the ->remove handler for the device, which can be called
> > by arbitrary system processes. This is not a kernel thread thing, as
> > far as I can see.
> >
> > If you cannot handle signals, you damn well shouldn't be using
> > "wait_event_interruptible_timeout()" to begin with. Get rid of the
> > "interruptible", since it apparently *isn't* interruptible.
> >
> > So I'm not pulling this.
> >
> > Now I'm worried that other drivers do insane things like this. I
> > wonder if we should add some sanity test to flush_signals() to make
> > sure that it can only ever get called from a kernel thread.
> >
> > Oleg?
>
> So there are these uses:
>
> triton:~/tip> git grep -lw flush_signals
> arch/arm/common/bL_switcher.c
>
>
> Looks safe: used within the bL_switcher_thread() kthread.
safe but wrong at first glance... I mean, it is pointless. Looks like,
bL_switcher_thread() wrongly thinks that wait_event_interruptible() can
lead to signal_pending().
> drivers/block/drbd/drbd_main.c
> drivers/block/drbd/drbd_nl.c
> drivers/block/drbd/drbd_receiver.c
> drivers/block/drbd/drbd_worker.c
Oh, I didn't know this helper is abused so much. I'll try to recheck
tomorrow, but it seems that it should die...
> drivers/md/md.c
I can't understand this code... The comment says:
/* We need to wait INTERRUPTIBLE so that
* we don't add to the load-average.
* That means we need to be sure no signals are
* pending
*/
if (signal_pending(current))
flush_signals(current);
and this is wrong. However, signal_pending() can be true because of
allow_signal(SIGKILL) above. But why it does allow_signal() ?
> fs/lockd/svc.c
> fs/nfs/callback.c
> fs/nfsd/nfssvc.c
>
> Looks safe: lockd, nfsd plus nfsv4.%u-svc kthreads.
Yes, this case looks fine. But perhaps it makes sense to add another
helper... Or not, I'll try to think more.
> I also found a __flush_signals() use in:
>
> security/selinux/hooks.c
>
> Now that's selinux_bprm_committed_creds(), apparently executed on
> exec(). Also does stuff like:
>
> memset(&itimer, 0, sizeof itimer);
> for (i = 0; i < 3; i++)
> do_setitimer(i, &itimer, NULL);
>
> and unblocks signals as well:
>
> sigemptyset(¤t->blocked);
>
> but this appears to be kind of legit: the task failed to get the
> required permissions, and guns go off.
and I simply can't understand this code... but I feel that it can't
be correct ;) Will try to re-read tomorrow.
> In any case, it seems to me that the patch below would be justified?
> Totally untested and so. __flush_signals() not affected.
Yes, agreed, it can't be right if the caller is not kthread.
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -427,6 +427,10 @@ void flush_signals(struct task_struct *t)
> {
> unsigned long flags;
>
> + /* Only kthreads are allowed to destroy signals: */
> + if (WARN_ON_ONCE(!(current->flags & PF_KTHREAD)))
> + return;
> +
But I am not sure this can't make some buggy driver even more buggy.
Just suppose it does something
do {
if (signal_pending())
flush_signals();
} while (wait_event_interruptible(...));
and this change will turn this into busy-wait loop.
So perhaps another change which just adds WARN_ON_RATELIMIT() without
"return" will be safer?
Oleg.
On 05/03, Oleg Nesterov wrote:
>
> On 05/01, Ingo Molnar wrote:
> >
>
> > I also found a __flush_signals() use in:
> >
> > security/selinux/hooks.c
> >
> and I simply can't understand this code... but I feel that it can't
> be correct ;) Will try to re-read tomorrow.
Yes, this doesn't look right. Lets kill __flush_signals() first, there
are no other users.
And I am not sure it is fine to flush SIGSTOP... do we really want this?
Oleg.
selinux_bprm_committed_creds()->__flush_signals() is not right, we
shouldn't clear TIF_SIGPENDING unconditionally. There can be other
reasons for signal_pending(): freezing(), JOBCTL_PENDING_MASK, and
potentially more.
Also change this code to check fatal_signal_pending() rather than
SIGNAL_GROUP_EXIT, it looks a bit better.
Now we can kill __flush_signals() before it finds another buggy user.
Note: this code looks racy, we can flush a signal which was sent after
the task SID has been updated.
Signed-off-by: Oleg Nesterov <[email protected]>
---
include/linux/sched.h | 1 -
kernel/signal.c | 13 ++++---------
security/selinux/hooks.c | 6 ++++--
3 files changed, 8 insertions(+), 12 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8db31ef..eb1ac84 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2345,7 +2345,6 @@ extern void sched_dead(struct task_struct *p);
extern void proc_caches_init(void);
extern void flush_signals(struct task_struct *);
-extern void __flush_signals(struct task_struct *);
extern void ignore_signals(struct task_struct *);
extern void flush_signal_handlers(struct task_struct *, int force_default);
extern int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info);
diff --git a/kernel/signal.c b/kernel/signal.c
index 16a3052..837ca7d 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -414,21 +414,16 @@ void flush_sigqueue(struct sigpending *queue)
}
/*
- * Flush all pending signals for a task.
+ * Flush all pending signals for this kthread.
*/
-void __flush_signals(struct task_struct *t)
-{
- clear_tsk_thread_flag(t, TIF_SIGPENDING);
- flush_sigqueue(&t->pending);
- flush_sigqueue(&t->signal->shared_pending);
-}
-
void flush_signals(struct task_struct *t)
{
unsigned long flags;
spin_lock_irqsave(&t->sighand->siglock, flags);
- __flush_signals(t);
+ clear_tsk_thread_flag(t, TIF_SIGPENDING);
+ flush_sigqueue(&t->pending);
+ flush_sigqueue(&t->signal->shared_pending);
spin_unlock_irqrestore(&t->sighand->siglock, flags);
}
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 6da7532..6907d11 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2397,10 +2397,12 @@ static void selinux_bprm_committed_creds(struct linux_binprm *bprm)
for (i = 0; i < 3; i++)
do_setitimer(i, &itimer, NULL);
spin_lock_irq(¤t->sighand->siglock);
- if (!(current->signal->flags & SIGNAL_GROUP_EXIT)) {
- __flush_signals(current);
+ if (!fatal_signal_pending(current)) {
+ flush_sigqueue(¤t->pending);
+ flush_sigqueue(¤t->signal->shared_pending);
flush_signal_handlers(current, 1);
sigemptyset(¤t->blocked);
+ recalc_sigpending();
}
spin_unlock_irq(¤t->sighand->siglock);
}
--
1.5.5.1
On 05/02, NeilBrown wrote:
>
> All the calls in md.c are in a kernel thread so safe, but I'd rather have an
> explicit "uninterruptible, but no load-average" wait....
Could you please explain why md_thread() does allow_signal(SIGKILL) ?
I am just curious. It looks as if we want to allow user-space to "call"
thread->run(), and this looks strange.
Oleg.
On Monday, May 04, 2015 06:45:58 PM Oleg Nesterov wrote:
> selinux_bprm_committed_creds()->__flush_signals() is not right, we
> shouldn't clear TIF_SIGPENDING unconditionally. There can be other
> reasons for signal_pending(): freezing(), JOBCTL_PENDING_MASK, and
> potentially more.
>
> Also change this code to check fatal_signal_pending() rather than
> SIGNAL_GROUP_EXIT, it looks a bit better.
>
> Now we can kill __flush_signals() before it finds another buggy user.
[NOTE: Added the SELinux list to the CC line]
This looks reasonable to me, I'm going to apply it to selinux#next today.
> Note: this code looks racy, we can flush a signal which was sent after
> the task SID has been updated.
The whole signal flush thread has started some discussions about how we are
currently handling this, and if it still makes sense. Like many things, it
seemed like a good idea at the time, but after several years we're debating if
that is still the case. I expect we'll be changing this code soon.
> Signed-off-by: Oleg Nesterov <[email protected]>
> ---
> include/linux/sched.h | 1 -
> kernel/signal.c | 13 ++++---------
> security/selinux/hooks.c | 6 ++++--
> 3 files changed, 8 insertions(+), 12 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 8db31ef..eb1ac84 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2345,7 +2345,6 @@ extern void sched_dead(struct task_struct *p);
>
> extern void proc_caches_init(void);
> extern void flush_signals(struct task_struct *);
> -extern void __flush_signals(struct task_struct *);
> extern void ignore_signals(struct task_struct *);
> extern void flush_signal_handlers(struct task_struct *, int force_default);
> extern int dequeue_signal(struct task_struct *tsk, sigset_t *mask,
> siginfo_t *info); diff --git a/kernel/signal.c b/kernel/signal.c
> index 16a3052..837ca7d 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -414,21 +414,16 @@ void flush_sigqueue(struct sigpending *queue)
> }
>
> /*
> - * Flush all pending signals for a task.
> + * Flush all pending signals for this kthread.
> */
> -void __flush_signals(struct task_struct *t)
> -{
> - clear_tsk_thread_flag(t, TIF_SIGPENDING);
> - flush_sigqueue(&t->pending);
> - flush_sigqueue(&t->signal->shared_pending);
> -}
> -
> void flush_signals(struct task_struct *t)
> {
> unsigned long flags;
>
> spin_lock_irqsave(&t->sighand->siglock, flags);
> - __flush_signals(t);
> + clear_tsk_thread_flag(t, TIF_SIGPENDING);
> + flush_sigqueue(&t->pending);
> + flush_sigqueue(&t->signal->shared_pending);
> spin_unlock_irqrestore(&t->sighand->siglock, flags);
> }
>
> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> index 6da7532..6907d11 100644
> --- a/security/selinux/hooks.c
> +++ b/security/selinux/hooks.c
> @@ -2397,10 +2397,12 @@ static void selinux_bprm_committed_creds(struct
> linux_binprm *bprm) for (i = 0; i < 3; i++)
> do_setitimer(i, &itimer, NULL);
> spin_lock_irq(¤t->sighand->siglock);
> - if (!(current->signal->flags & SIGNAL_GROUP_EXIT)) {
> - __flush_signals(current);
> + if (!fatal_signal_pending(current)) {
> + flush_sigqueue(¤t->pending);
> + flush_sigqueue(¤t->signal->shared_pending);
> flush_signal_handlers(current, 1);
> sigemptyset(¤t->blocked);
> + recalc_sigpending();
> }
> spin_unlock_irq(¤t->sighand->siglock);
> }
--
paul moore
http://www.paul-moore.com
* Oleg Nesterov <[email protected]> wrote:
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> > @@ -427,6 +427,10 @@ void flush_signals(struct task_struct *t)
> > {
> > unsigned long flags;
> >
> > + /* Only kthreads are allowed to destroy signals: */
> > + if (WARN_ON_ONCE(!(current->flags & PF_KTHREAD)))
> > + return;
> > +
>
> But I am not sure this can't make some buggy driver even more buggy.
> Just suppose it does something
>
> do {
> if (signal_pending())
> flush_signals();
> } while (wait_event_interruptible(...));
>
> and this change will turn this into busy-wait loop.
>
> So perhaps another change which just adds WARN_ON_RATELIMIT()
> without "return" will be safer?
Yeah, absolutely.
Thanks,
Ingo
On Sat, May 02, 2015 at 09:27:49AM -0700, Linus Torvalds wrote:
> On Sat, May 2, 2015 at 1:30 AM, NeilBrown <[email protected]> wrote:
> >
> > All the calls in md.c are in a kernel thread so safe, but I'd rather have an
> > explicit "uninterruptible, but no load-average" wait....
>
> Hmm. Our task state is a bitmask anyway, so we could probably just add a
>
> #define __TASK_NOLOAD 16
>
> (and move the EXIT_xyz defines *away* from the list that is actually
> the task state), and teach our load average thing to not count those
> kinds of waits. Then you could just use
>
> TASK_UNINTERRUPTIBLE | __TASK_NOLOAD
>
> to make processes not count towards the load.
>
> Or - probably preferably - we could really clean things up, and make
> things much more like the bitmask it *should* be, and have explicit
> bits for
>
> - SLEEPING/STOPPED/EXITING ("why not running?")
> - LOADAVG (accounted towards load)
> - WAKESIG (ie "interruptible")
> - WAKEKILL (this we already have)
>
> and just make the rule be that we use "__TASK_xyz" for the actual
> individual bits, and "TASK_xyz" for the helper combinations. So then
> we'd have
>
> #define TASK_UNINTERRUPTIBLE (__TASK_SLEEPING | __TASK_LOADAVG)
> #define TASK_INTERRUPTIBLE (__TASK_SLEEPING | __TASK_WAKESIG)
> #define TASK_KILLABLE (__TASK_SLEEPING | __TASK_WAKEKILL)
> #define TASK_NOLOADAVG (__TASK_SLEEPING)
>
> which is almost certainly how this *should* have been done, but isn't,
> because of historical use.
>
> Cleaning up like that *should* be fairly simple, but I'd be a bit
> nervous about getting all the state comparisons right (we have an
> unholy mix of "check this bit" and "check this whole state", and we'd
> need to make sure we get those cases all right).
>
> Ingo, what do you think? This is mostly a scheduler interface issue..
Hehe, a little something like this:
https://lkml.org/lkml/2013/11/12/710
Lemme go clean that up and finish it :-)
On Mon, 4 May 2015, Oleg Nesterov wrote:
> > All the calls in md.c are in a kernel thread so safe, but I'd rather have an
> > explicit "uninterruptible, but no load-average" wait....
>
> Could you please explain why md_thread() does allow_signal(SIGKILL) ?
>
> I am just curious. It looks as if we want to allow user-space to "call"
> thread->run(), and this looks strange.
One would think that this is because md wants to be notified when system
is going to be halted/rebooted, and userspace init (whatever that is)
decides to do 'kill -9 -1' to perform the final shutdown of md (the
question is why it really should be needed, becasue all filesystems should
be R/O by that time anyway).
--
Jiri Kosina
SUSE Labs
On Thu, 7 May 2015 15:33:53 +0200 (CEST) Jiri Kosina <[email protected]> wrote:
> On Mon, 4 May 2015, Oleg Nesterov wrote:
>
> > > All the calls in md.c are in a kernel thread so safe, but I'd rather have an
> > > explicit "uninterruptible, but no load-average" wait....
> >
> > Could you please explain why md_thread() does allow_signal(SIGKILL) ?
> >
> > I am just curious. It looks as if we want to allow user-space to "call"
> > thread->run(), and this looks strange.
>
> One would think that this is because md wants to be notified when system
> is going to be halted/rebooted, and userspace init (whatever that is)
> decides to do 'kill -9 -1' to perform the final shutdown of md (the
> question is why it really should be needed, becasue all filesystems should
> be R/O by that time anyway).
>
Something like that. It is historical strangeness that might have seemed
like a good idea at the time, but is hard to justify.
There is an alt-sysrq that will remount filesystems read-only, but no
alt-sysrq to switch RAID arrays to "idle/clean". So I leverages alt-sysrq-K,
which does "kill -9 -1" (I think).
Since md gained the ability to switch to idle/clean after a short timeout,
and also the ability to use a write-intent-bitmap so only little bits of the
array are ever non idle/clean, this all became much less interesting.
NeilBrown