2017-06-07 23:05:38

by Mikulas Patocka

[permalink] [raw]
Subject: [PATCH] md: don't use flush_signals in userspace processes

The function flush_signals clears all pending signals for the process. It
may be used by kernel threads when we need to prepare a kernel thread for
responding to signals. However using this function for an userspaces
processes is incorrect - clearing signals without the program expecting it
can cause misbehavior.

The raid1 and raid5 code uses flush_signals in its request routine because
it wants to prepare for an interruptible wait. This patch drops
flush_signals and uses sigprocmask instead to block all signals (including
SIGKILL) around the schedule() call. The signals are not lost, but the
schedule() call won't respond to them.

Signed-off-by: Mikulas Patocka <[email protected]>
Cc: [email protected]

---
drivers/md/raid1.c | 5 ++++-
drivers/md/raid5.c | 5 ++++-
2 files changed, 8 insertions(+), 2 deletions(-)

Index: linux-4.12-rc4/drivers/md/raid1.c
===================================================================
--- linux-4.12-rc4.orig/drivers/md/raid1.c
+++ linux-4.12-rc4/drivers/md/raid1.c
@@ -1335,7 +1335,7 @@ static void raid1_write_request(struct m
*/
DEFINE_WAIT(w);
for (;;) {
- flush_signals(current);
+ sigset_t full, old;
prepare_to_wait(&conf->wait_barrier,
&w, TASK_INTERRUPTIBLE);
if (bio_end_sector(bio) <= mddev->suspend_lo ||
@@ -1345,7 +1345,10 @@ static void raid1_write_request(struct m
bio->bi_iter.bi_sector,
bio_end_sector(bio))))
break;
+ sigfillset(&full);
+ sigprocmask(SIG_BLOCK, &full, &old);
schedule();
+ sigprocmask(SIG_SETMASK, &old, NULL);
}
finish_wait(&conf->wait_barrier, &w);
}
Index: linux-4.12-rc4/drivers/md/raid5.c
===================================================================
--- linux-4.12-rc4.orig/drivers/md/raid5.c
+++ linux-4.12-rc4/drivers/md/raid5.c
@@ -5693,12 +5693,15 @@ static void raid5_make_request(struct md
* userspace, we want an interruptible
* wait.
*/
- flush_signals(current);
prepare_to_wait(&conf->wait_for_overlap,
&w, TASK_INTERRUPTIBLE);
if (logical_sector >= mddev->suspend_lo &&
logical_sector < mddev->suspend_hi) {
+ sigset_t full, old;
+ sigfillset(&full);
+ sigprocmask(SIG_BLOCK, &full, &old);
schedule();
+ sigprocmask(SIG_SETMASK, &old, NULL);
do_prepare = true;
}
goto retry;


2017-06-08 06:59:15

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH] md: don't use flush_signals in userspace processes

On Wed, Jun 07 2017, Mikulas Patocka wrote:

> The function flush_signals clears all pending signals for the process. It
> may be used by kernel threads when we need to prepare a kernel thread for
> responding to signals. However using this function for an userspaces
> processes is incorrect - clearing signals without the program expecting it
> can cause misbehavior.
>
> The raid1 and raid5 code uses flush_signals in its request routine because
> it wants to prepare for an interruptible wait. This patch drops
> flush_signals and uses sigprocmask instead to block all signals (including
> SIGKILL) around the schedule() call. The signals are not lost, but the
> schedule() call won't respond to them.
>
> Signed-off-by: Mikulas Patocka <[email protected]>
> Cc: [email protected]

Thanks for catching that!

Acked-by: NeilBrown <[email protected]>

NeilBrown


>
> ---
> drivers/md/raid1.c | 5 ++++-
> drivers/md/raid5.c | 5 ++++-
> 2 files changed, 8 insertions(+), 2 deletions(-)
>
> Index: linux-4.12-rc4/drivers/md/raid1.c
> ===================================================================
> --- linux-4.12-rc4.orig/drivers/md/raid1.c
> +++ linux-4.12-rc4/drivers/md/raid1.c
> @@ -1335,7 +1335,7 @@ static void raid1_write_request(struct m
> */
> DEFINE_WAIT(w);
> for (;;) {
> - flush_signals(current);
> + sigset_t full, old;
> prepare_to_wait(&conf->wait_barrier,
> &w, TASK_INTERRUPTIBLE);
> if (bio_end_sector(bio) <= mddev->suspend_lo ||
> @@ -1345,7 +1345,10 @@ static void raid1_write_request(struct m
> bio->bi_iter.bi_sector,
> bio_end_sector(bio))))
> break;
> + sigfillset(&full);
> + sigprocmask(SIG_BLOCK, &full, &old);
> schedule();
> + sigprocmask(SIG_SETMASK, &old, NULL);
> }
> finish_wait(&conf->wait_barrier, &w);
> }
> Index: linux-4.12-rc4/drivers/md/raid5.c
> ===================================================================
> --- linux-4.12-rc4.orig/drivers/md/raid5.c
> +++ linux-4.12-rc4/drivers/md/raid5.c
> @@ -5693,12 +5693,15 @@ static void raid5_make_request(struct md
> * userspace, we want an interruptible
> * wait.
> */
> - flush_signals(current);
> prepare_to_wait(&conf->wait_for_overlap,
> &w, TASK_INTERRUPTIBLE);
> if (logical_sector >= mddev->suspend_lo &&
> logical_sector < mddev->suspend_hi) {
> + sigset_t full, old;
> + sigfillset(&full);
> + sigprocmask(SIG_BLOCK, &full, &old);
> schedule();
> + sigprocmask(SIG_SETMASK, &old, NULL);
> do_prepare = true;
> }
> goto retry;


Attachments:
signature.asc (832.00 B)

2017-06-08 20:51:27

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [PATCH] md: don't use flush_signals in userspace processes



On Thu, 8 Jun 2017, Shaohua Li wrote:

> On Thu, Jun 08, 2017 at 04:59:03PM +1000, Neil Brown wrote:
> > On Wed, Jun 07 2017, Mikulas Patocka wrote:
> >
> > > The function flush_signals clears all pending signals for the process. It
> > > may be used by kernel threads when we need to prepare a kernel thread for
> > > responding to signals. However using this function for an userspaces
> > > processes is incorrect - clearing signals without the program expecting it
> > > can cause misbehavior.
> > >
> > > The raid1 and raid5 code uses flush_signals in its request routine because
> > > it wants to prepare for an interruptible wait. This patch drops
> > > flush_signals and uses sigprocmask instead to block all signals (including
> > > SIGKILL) around the schedule() call. The signals are not lost, but the
> > > schedule() call won't respond to them.
> > >
> > > Signed-off-by: Mikulas Patocka <[email protected]>
> > > Cc: [email protected]
> >
> > Thanks for catching that!
> >
> > Acked-by: NeilBrown <[email protected]>
>
> Applied, thanks!
>
> Neil,
> Not about the patch itself. I had question about that part of code. Dropped
> others since this is raid related. I didn't get the point why it's a
> TASK_INTERRUPTIBLE sleep. It seems suggesting the thread will bail out if a
> signal is sent. But I didn't see we check the signal and exit the loop. What's
> the correct behavior here? Since the suspend range is controlled by userspace,

As I understand the code - the purpose is to have an UNINTERRUPTIBLE sleep
that isn't accounted in load average and that doesn't trigger the hung
task warning.

There should really be something like TASK_UNINTERRUPTIBLE_LONG for this
purpose.

> I think the correct behavior is if user kills the thread, we exit the loop. So
> it seems like to be we check if there is fatal signal pending, exit the loop,
> and return IO error. Not sure if we should return IO error though.

No, this is not correct - if we report an I/O error for the affected bio,
it could corrupt filesystem or confuse other device mapper targets that
could be on the top of MD. It is not right to corrupt filesystem if the
user kills a process.

> Thanks,
> Shaohua

Mikulas

2017-06-08 21:24:42

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH] md: don't use flush_signals in userspace processes

On Thu, Jun 08 2017, Mikulas Patocka wrote:

> On Thu, 8 Jun 2017, Shaohua Li wrote:
>
>> On Thu, Jun 08, 2017 at 04:59:03PM +1000, Neil Brown wrote:
>> > On Wed, Jun 07 2017, Mikulas Patocka wrote:
>> >
>> > > The function flush_signals clears all pending signals for the process. It
>> > > may be used by kernel threads when we need to prepare a kernel thread for
>> > > responding to signals. However using this function for an userspaces
>> > > processes is incorrect - clearing signals without the program expecting it
>> > > can cause misbehavior.
>> > >
>> > > The raid1 and raid5 code uses flush_signals in its request routine because
>> > > it wants to prepare for an interruptible wait. This patch drops
>> > > flush_signals and uses sigprocmask instead to block all signals (including
>> > > SIGKILL) around the schedule() call. The signals are not lost, but the
>> > > schedule() call won't respond to them.
>> > >
>> > > Signed-off-by: Mikulas Patocka <[email protected]>
>> > > Cc: [email protected]
>> >
>> > Thanks for catching that!
>> >
>> > Acked-by: NeilBrown <[email protected]>
>>
>> Applied, thanks!
>>
>> Neil,
>> Not about the patch itself. I had question about that part of code. Dropped
>> others since this is raid related. I didn't get the point why it's a
>> TASK_INTERRUPTIBLE sleep. It seems suggesting the thread will bail out if a
>> signal is sent. But I didn't see we check the signal and exit the loop. What's
>> the correct behavior here? Since the suspend range is controlled by userspace,
>
> As I understand the code - the purpose is to have an UNINTERRUPTIBLE sleep
> that isn't accounted in load average and that doesn't trigger the hung
> task warning.

Exactly my reason - yes.

>
> There should really be something like TASK_UNINTERRUPTIBLE_LONG for this
> purpose.

That would be nice.

>
>> I think the correct behavior is if user kills the thread, we exit the loop. So
>> it seems like to be we check if there is fatal signal pending, exit the loop,
>> and return IO error. Not sure if we should return IO error though.
>
> No, this is not correct - if we report an I/O error for the affected bio,
> it could corrupt filesystem or confuse other device mapper targets that
> could be on the top of MD. It is not right to corrupt filesystem if the
> user kills a process.

Yes, we are too deep to even return something like ERESTARTSYS.
Blocking is the only option.

Thanks,
NeilBrown


>
>> Thanks,
>> Shaohua
>
> Mikulas


Attachments:
signature.asc (832.00 B)

2017-06-08 22:52:58

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [PATCH] md: don't use flush_signals in userspace processes



On Thu, 8 Jun 2017, NeilBrown wrote:

> On Wed, Jun 07 2017, Mikulas Patocka wrote:
>
> > The function flush_signals clears all pending signals for the process. It
> > may be used by kernel threads when we need to prepare a kernel thread for
> > responding to signals. However using this function for an userspaces
> > processes is incorrect - clearing signals without the program expecting it
> > can cause misbehavior.
> >
> > The raid1 and raid5 code uses flush_signals in its request routine because
> > it wants to prepare for an interruptible wait. This patch drops
> > flush_signals and uses sigprocmask instead to block all signals (including
> > SIGKILL) around the schedule() call. The signals are not lost, but the
> > schedule() call won't respond to them.
> >
> > Signed-off-by: Mikulas Patocka <[email protected]>
> > Cc: [email protected]
>
> Thanks for catching that!
>
> Acked-by: NeilBrown <[email protected]>
>
> NeilBrown

BTW. why does md_thread do "allow_signal(SIGKILL)" and then
"if (signal_pending(current)) flush_signals(current)"?

Does userspace really send SIGKILL to MD kernel threads? The SIGKILL will
be lost when flush_signals is called, so it looks quite dubious.

Mikulas

2017-06-08 22:59:03

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH] md: don't use flush_signals in userspace processes

On Fri, Jun 09, 2017 at 07:24:29AM +1000, Neil Brown wrote:
> On Thu, Jun 08 2017, Mikulas Patocka wrote:
>
> > On Thu, 8 Jun 2017, Shaohua Li wrote:
> >
> >> On Thu, Jun 08, 2017 at 04:59:03PM +1000, Neil Brown wrote:
> >> > On Wed, Jun 07 2017, Mikulas Patocka wrote:
> >> >
> >> > > The function flush_signals clears all pending signals for the process. It
> >> > > may be used by kernel threads when we need to prepare a kernel thread for
> >> > > responding to signals. However using this function for an userspaces
> >> > > processes is incorrect - clearing signals without the program expecting it
> >> > > can cause misbehavior.
> >> > >
> >> > > The raid1 and raid5 code uses flush_signals in its request routine because
> >> > > it wants to prepare for an interruptible wait. This patch drops
> >> > > flush_signals and uses sigprocmask instead to block all signals (including
> >> > > SIGKILL) around the schedule() call. The signals are not lost, but the
> >> > > schedule() call won't respond to them.
> >> > >
> >> > > Signed-off-by: Mikulas Patocka <[email protected]>
> >> > > Cc: [email protected]
> >> >
> >> > Thanks for catching that!
> >> >
> >> > Acked-by: NeilBrown <[email protected]>
> >>
> >> Applied, thanks!
> >>
> >> Neil,
> >> Not about the patch itself. I had question about that part of code. Dropped
> >> others since this is raid related. I didn't get the point why it's a
> >> TASK_INTERRUPTIBLE sleep. It seems suggesting the thread will bail out if a
> >> signal is sent. But I didn't see we check the signal and exit the loop. What's
> >> the correct behavior here? Since the suspend range is controlled by userspace,
> >
> > As I understand the code - the purpose is to have an UNINTERRUPTIBLE sleep
> > that isn't accounted in load average and that doesn't trigger the hung
> > task warning.
>
> Exactly my reason - yes.
>
> >
> > There should really be something like TASK_UNINTERRUPTIBLE_LONG for this
> > purpose.
>
> That would be nice.
>
> >
> >> I think the correct behavior is if user kills the thread, we exit the loop. So
> >> it seems like to be we check if there is fatal signal pending, exit the loop,
> >> and return IO error. Not sure if we should return IO error though.
> >
> > No, this is not correct - if we report an I/O error for the affected bio,
> > it could corrupt filesystem or confuse other device mapper targets that
> > could be on the top of MD. It is not right to corrupt filesystem if the
> > user kills a process.
>
> Yes, we are too deep to even return something like ERESTARTSYS.
> Blocking is the only option.

My concern is if the app controlling the suspend range dies, other threads
will block in the kernel side forever. We can't even force kill them. This
is an unfortunate behavior. Would adding a timeout here make sense? The app
controlling the suspend range looks part of the disk firmware now. If the
firmware doesn't respond, returning IO timeout is normal.

Thanks,
Shaohua

2017-06-09 01:50:02

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH] md: don't use flush_signals in userspace processes

On Thu, Jun 08 2017, Shaohua Li wrote:

> On Fri, Jun 09, 2017 at 07:24:29AM +1000, Neil Brown wrote:
>> On Thu, Jun 08 2017, Mikulas Patocka wrote:
>>
>> > On Thu, 8 Jun 2017, Shaohua Li wrote:
>> >
>> >> On Thu, Jun 08, 2017 at 04:59:03PM +1000, Neil Brown wrote:
>> >> > On Wed, Jun 07 2017, Mikulas Patocka wrote:
>> >> >
>> >> > > The function flush_signals clears all pending signals for the process. It
>> >> > > may be used by kernel threads when we need to prepare a kernel thread for
>> >> > > responding to signals. However using this function for an userspaces
>> >> > > processes is incorrect - clearing signals without the program expecting it
>> >> > > can cause misbehavior.
>> >> > >
>> >> > > The raid1 and raid5 code uses flush_signals in its request routine because
>> >> > > it wants to prepare for an interruptible wait. This patch drops
>> >> > > flush_signals and uses sigprocmask instead to block all signals (including
>> >> > > SIGKILL) around the schedule() call. The signals are not lost, but the
>> >> > > schedule() call won't respond to them.
>> >> > >
>> >> > > Signed-off-by: Mikulas Patocka <[email protected]>
>> >> > > Cc: [email protected]
>> >> >
>> >> > Thanks for catching that!
>> >> >
>> >> > Acked-by: NeilBrown <[email protected]>
>> >>
>> >> Applied, thanks!
>> >>
>> >> Neil,
>> >> Not about the patch itself. I had question about that part of code. Dropped
>> >> others since this is raid related. I didn't get the point why it's a
>> >> TASK_INTERRUPTIBLE sleep. It seems suggesting the thread will bail out if a
>> >> signal is sent. But I didn't see we check the signal and exit the loop. What's
>> >> the correct behavior here? Since the suspend range is controlled by userspace,
>> >
>> > As I understand the code - the purpose is to have an UNINTERRUPTIBLE sleep
>> > that isn't accounted in load average and that doesn't trigger the hung
>> > task warning.
>>
>> Exactly my reason - yes.
>>
>> >
>> > There should really be something like TASK_UNINTERRUPTIBLE_LONG for this
>> > purpose.
>>
>> That would be nice.
>>
>> >
>> >> I think the correct behavior is if user kills the thread, we exit the loop. So
>> >> it seems like to be we check if there is fatal signal pending, exit the loop,
>> >> and return IO error. Not sure if we should return IO error though.
>> >
>> > No, this is not correct - if we report an I/O error for the affected bio,
>> > it could corrupt filesystem or confuse other device mapper targets that
>> > could be on the top of MD. It is not right to corrupt filesystem if the
>> > user kills a process.
>>
>> Yes, we are too deep to even return something like ERESTARTSYS.
>> Blocking is the only option.
>
> My concern is if the app controlling the suspend range dies, other threads
> will block in the kernel side forever. We can't even force kill them. This
> is an unfortunate behavior. Would adding a timeout here make sense? The app
> controlling the suspend range looks part of the disk firmware now. If the
> firmware doesn't respond, returning IO timeout is normal.

Yes, this happens.
You can write to /sys/block/mdXX/md/suspend_lo to unblock it, but most
people don't know this.

A timeout might be appropriate, but I would want it to be quite a long
one. Several minutes at least...
Though if I found I wanted to do some careful raid6repair surgery on an
array, and so suspending the problematic region and started work, I
might get annoyed if I/O started getting errors, instead of just waiting
like I asked.... but maybe that is a rather far-fetched scenario.

Thanks,
NeilBrown


>
> Thanks,
> Shaohua


Attachments:
signature.asc (832.00 B)

2017-06-09 01:55:14

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH] md: don't use flush_signals in userspace processes

On Thu, Jun 08 2017, Mikulas Patocka wrote:

> On Thu, 8 Jun 2017, NeilBrown wrote:
>
>> On Wed, Jun 07 2017, Mikulas Patocka wrote:
>>
>> > The function flush_signals clears all pending signals for the process. It
>> > may be used by kernel threads when we need to prepare a kernel thread for
>> > responding to signals. However using this function for an userspaces
>> > processes is incorrect - clearing signals without the program expecting it
>> > can cause misbehavior.
>> >
>> > The raid1 and raid5 code uses flush_signals in its request routine because
>> > it wants to prepare for an interruptible wait. This patch drops
>> > flush_signals and uses sigprocmask instead to block all signals (including
>> > SIGKILL) around the schedule() call. The signals are not lost, but the
>> > schedule() call won't respond to them.
>> >
>> > Signed-off-by: Mikulas Patocka <[email protected]>
>> > Cc: [email protected]
>>
>> Thanks for catching that!
>>
>> Acked-by: NeilBrown <[email protected]>
>>
>> NeilBrown
>
> BTW. why does md_thread do "allow_signal(SIGKILL)" and then
> "if (signal_pending(current)) flush_signals(current)"?
>
> Does userspace really send SIGKILL to MD kernel threads? The SIGKILL will
> be lost when flush_signals is called, so it looks quite dubious.
>

This is for md_check_recovery() which does do something on a signal.
Chances are good that it will get to handle the signal before
md_thread() flushed them, but not guaranteed. I could be improved I
guess.

Or maybe it could be discarded - the md_check_recovery() thing.
The idea was that if you alt-sysrq-K to kill all processes, md arrays
would go into immediate-safe-mode where the metadata is marked clean
immediately after writes finish, rather than waiting a few seconds. The
chance of having a clean array after shutdown is hopefully improved.

I've never actually used this though, and I doubt many people know about
it. And bitmaps make it fairly pointless.
So I wouldn't object much if
allow_signal(SIGKILL);
and
if (signal_pending(current)) {
if (mddev->pers->sync_request && !mddev->external) {
pr_debug("md: %s in immediate safe mode\n",
mdname(mddev));
mddev->safemode = 2;
}
flush_signals(current);
}

were removed.

Thanks,
NeilBrown


Attachments:
signature.asc (832.00 B)
Subject: Re: [PATCH] md: don't use flush_signals in userspace processes

On Fri, 09 Jun 2017, NeilBrown wrote:
> Or maybe it could be discarded - the md_check_recovery() thing.
> The idea was that if you alt-sysrq-K to kill all processes, md arrays
> would go into immediate-safe-mode where the metadata is marked clean
> immediately after writes finish, rather than waiting a few seconds. The
> chance of having a clean array after shutdown is hopefully improved.
>
> I've never actually used this though, and I doubt many people know about
> it. And bitmaps make it fairly pointless.

Hmm, I have, although I had no idea this was why my arrays were getting
far less frazzled than expected... It is really useful behavior, now
that I know it can do that.

If you can teach SysRq+S, and especially SysRq+U, to force all arrays
into safe-mode *after* they carried their current meanings
(sync/umount), that would be more useful though, and it would help a lot
of people to avoid dirty arrays without them even knowing why...

--
Henrique Holschuh