2024-04-17 17:49:36

by Mikulas Patocka

[permalink] [raw]
Subject: [PATCH 1/2] completion: move blk_wait_io to kernel/sched/completion.c

The block layer has a function blk_wait_io - it works like
wait_for_completion_io, except that it doesn't warn if the wait takes too
long. This commit renames the function to wait_for_completion_long_io and
moves it to kernel/sched/completion.c so that other kernel subsystems can
use it. It will be needed by the dm-io subsystem.

Signed-off-by: Mikulas Patocka <[email protected]>

---
block/bio.c | 2 +-
block/blk-mq.c | 2 +-
block/blk.h | 12 ------------
include/linux/completion.h | 1 +
kernel/sched/completion.c | 20 ++++++++++++++++++++
5 files changed, 23 insertions(+), 14 deletions(-)

Index: linux-2.6/block/blk.h
===================================================================
--- linux-2.6.orig/block/blk.h 2024-04-17 19:41:14.000000000 +0200
+++ linux-2.6/block/blk.h 2024-04-17 19:41:14.000000000 +0200
@@ -72,18 +72,6 @@ static inline int bio_queue_enter(struct
return __bio_queue_enter(q, bio);
}

-static inline void blk_wait_io(struct completion *done)
-{
- /* Prevent hang_check timer from firing at us during very long I/O */
- unsigned long timeout = sysctl_hung_task_timeout_secs * HZ / 2;
-
- if (timeout)
- while (!wait_for_completion_io_timeout(done, timeout))
- ;
- else
- wait_for_completion_io(done);
-}
-
#define BIO_INLINE_VECS 4
struct bio_vec *bvec_alloc(mempool_t *pool, unsigned short *nr_vecs,
gfp_t gfp_mask);
Index: linux-2.6/include/linux/completion.h
===================================================================
--- linux-2.6.orig/include/linux/completion.h 2024-04-17 19:41:14.000000000 +0200
+++ linux-2.6/include/linux/completion.h 2024-04-17 19:41:14.000000000 +0200
@@ -112,6 +112,7 @@ extern long wait_for_completion_interrup
struct completion *x, unsigned long timeout);
extern long wait_for_completion_killable_timeout(
struct completion *x, unsigned long timeout);
+extern void wait_for_completion_long_io(struct completion *x);
extern bool try_wait_for_completion(struct completion *x);
extern bool completion_done(struct completion *x);

Index: linux-2.6/block/bio.c
===================================================================
--- linux-2.6.orig/block/bio.c 2024-04-17 19:41:14.000000000 +0200
+++ linux-2.6/block/bio.c 2024-04-17 19:41:14.000000000 +0200
@@ -1378,7 +1378,7 @@ int submit_bio_wait(struct bio *bio)
bio->bi_end_io = submit_bio_wait_endio;
bio->bi_opf |= REQ_SYNC;
submit_bio(bio);
- blk_wait_io(&done);
+ wait_for_completion_long_io(&done);

return blk_status_to_errno(bio->bi_status);
}
Index: linux-2.6/block/blk-mq.c
===================================================================
--- linux-2.6.orig/block/blk-mq.c 2024-04-17 19:41:14.000000000 +0200
+++ linux-2.6/block/blk-mq.c 2024-04-17 19:41:14.000000000 +0200
@@ -1407,7 +1407,7 @@ blk_status_t blk_execute_rq(struct reque
if (blk_rq_is_poll(rq))
blk_rq_poll_completion(rq, &wait.done);
else
- blk_wait_io(&wait.done);
+ wait_for_completion_long_io(&wait.done);

return wait.ret;
}
Index: linux-2.6/kernel/sched/completion.c
===================================================================
--- linux-2.6.orig/kernel/sched/completion.c 2024-04-17 19:41:14.000000000 +0200
+++ linux-2.6/kernel/sched/completion.c 2024-04-17 19:41:14.000000000 +0200
@@ -290,6 +290,26 @@ wait_for_completion_killable_timeout(str
EXPORT_SYMBOL(wait_for_completion_killable_timeout);

/**
+ * wait_for_completion_long_io - waits for completion of a task
+ * @x: holds the state of this particular completion
+ *
+ * This is like wait_for_completion_io, but it doesn't warn if the wait takes
+ * too long.
+ */
+void wait_for_completion_long_io(struct completion *x)
+{
+ /* Prevent hang_check timer from firing at us during very long I/O */
+ unsigned long timeout = sysctl_hung_task_timeout_secs * HZ / 2;
+
+ if (timeout)
+ while (!wait_for_completion_io_timeout(x, timeout))
+ ;
+ else
+ wait_for_completion_io(x);
+}
+EXPORT_SYMBOL(wait_for_completion_long_io);
+
+/**
* try_wait_for_completion - try to decrement a completion without blocking
* @x: completion structure
*



2024-04-17 18:12:01

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [PATCH 1/2] completion: move blk_wait_io to kernel/sched/completion.c



On Wed, 17 Apr 2024, Peter Zijlstra wrote:

> On Wed, Apr 17, 2024 at 07:49:17PM +0200, Mikulas Patocka wrote:
> > Index: linux-2.6/kernel/sched/completion.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/sched/completion.c 2024-04-17 19:41:14.000000000 +0200
> > +++ linux-2.6/kernel/sched/completion.c 2024-04-17 19:41:14.000000000 +0200
> > @@ -290,6 +290,26 @@ wait_for_completion_killable_timeout(str
> > EXPORT_SYMBOL(wait_for_completion_killable_timeout);
> >
> > /**
> > + * wait_for_completion_long_io - waits for completion of a task
> > + * @x: holds the state of this particular completion
> > + *
> > + * This is like wait_for_completion_io, but it doesn't warn if the wait takes
> > + * too long.
> > + */
> > +void wait_for_completion_long_io(struct completion *x)
> > +{
> > + /* Prevent hang_check timer from firing at us during very long I/O */
> > + unsigned long timeout = sysctl_hung_task_timeout_secs * HZ / 2;
> > +
> > + if (timeout)
> > + while (!wait_for_completion_io_timeout(x, timeout))
> > + ;
> > + else
> > + wait_for_completion_io(x);
> > +}
> > +EXPORT_SYMBOL(wait_for_completion_long_io);
>
> Urgh, why is it a sane thing to circumvent the hang check timer?

The block layer already does it - the bios can have arbitrary size, so
waiting for them takes arbitrary time.

Mikulas


2024-04-17 22:42:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/2] completion: move blk_wait_io to kernel/sched/completion.c

On Wed, Apr 17, 2024 at 07:49:17PM +0200, Mikulas Patocka wrote:
> Index: linux-2.6/kernel/sched/completion.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched/completion.c 2024-04-17 19:41:14.000000000 +0200
> +++ linux-2.6/kernel/sched/completion.c 2024-04-17 19:41:14.000000000 +0200
> @@ -290,6 +290,26 @@ wait_for_completion_killable_timeout(str
> EXPORT_SYMBOL(wait_for_completion_killable_timeout);
>
> /**
> + * wait_for_completion_long_io - waits for completion of a task
> + * @x: holds the state of this particular completion
> + *
> + * This is like wait_for_completion_io, but it doesn't warn if the wait takes
> + * too long.
> + */
> +void wait_for_completion_long_io(struct completion *x)
> +{
> + /* Prevent hang_check timer from firing at us during very long I/O */
> + unsigned long timeout = sysctl_hung_task_timeout_secs * HZ / 2;
> +
> + if (timeout)
> + while (!wait_for_completion_io_timeout(x, timeout))
> + ;
> + else
> + wait_for_completion_io(x);
> +}
> +EXPORT_SYMBOL(wait_for_completion_long_io);

Urgh, why is it a sane thing to circumvent the hang check timer?

2024-04-18 04:57:18

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 1/2] completion: move blk_wait_io to kernel/sched/completion.c

On Wed, Apr 17, 2024 at 08:00:22PM +0200, Mikulas Patocka wrote:
> > > +EXPORT_SYMBOL(wait_for_completion_long_io);
> >
> > Urgh, why is it a sane thing to circumvent the hang check timer?
>
> The block layer already does it - the bios can have arbitrary size, so
> waiting for them takes arbitrary time.

And as mentioned the last few times around, I think we want a task
state to say that task can sleep long or even forever and not propagate
this hack even further.


2024-04-18 14:41:53

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 1/2] completion: move blk_wait_io to kernel/sched/completion.c

On 4/17/24 10:57 PM, Christoph Hellwig wrote:
> On Wed, Apr 17, 2024 at 08:00:22PM +0200, Mikulas Patocka wrote:
>>>> +EXPORT_SYMBOL(wait_for_completion_long_io);
>>>
>>> Urgh, why is it a sane thing to circumvent the hang check timer?
>>
>> The block layer already does it - the bios can have arbitrary size, so
>> waiting for them takes arbitrary time.
>
> And as mentioned the last few times around, I think we want a task
> state to say that task can sleep long or even forever and not propagate
> this hack even further.

It certainly is a hack/work-around, but unless there are a lot more that
should be using something like this, I don't think adding extra core
complexity in terms of a special task state (or per-task flag, at least
that would be easier) is really warranted.

--
Jens Axboe


2024-04-18 14:46:38

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 1/2] completion: move blk_wait_io to kernel/sched/completion.c

On Thu, Apr 18, 2024 at 08:30:14AM -0600, Jens Axboe wrote:
> It certainly is a hack/work-around, but unless there are a lot more that
> should be using something like this, I don't think adding extra core
> complexity in terms of a special task state (or per-task flag, at least
> that would be easier) is really warranted.

Basically any kernel thread doing on-demand work has the same problem.
It just has an easier workaround hack, as the kernel threads can simply
claim to do an interruptible sleep to not trigger the softlockup
warnings.


2024-04-18 15:09:22

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 1/2] completion: move blk_wait_io to kernel/sched/completion.c

On 4/18/24 8:46 AM, Christoph Hellwig wrote:
> On Thu, Apr 18, 2024 at 08:30:14AM -0600, Jens Axboe wrote:
>> It certainly is a hack/work-around, but unless there are a lot more that
>> should be using something like this, I don't think adding extra core
>> complexity in terms of a special task state (or per-task flag, at least
>> that would be easier) is really warranted.
>
> Basically any kernel thread doing on-demand work has the same problem.
> It just has an easier workaround hack, as the kernel threads can simply
> claim to do an interruptible sleep to not trigger the softlockup
> warnings.

A kernel thread can just use TASK_INTERRUPTIBLE, as it doesn't take
signals anyway. But yeah, I guess you could view that as a work-around
as well.

Outside of that, mostly only a block problem, where our sleep is always
uninterruptible. Unless there are similar hacks elsewhere in the kernel
that I'm not aware of?

--
Jens Axboe


2024-04-22 11:00:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/2] completion: move blk_wait_io to kernel/sched/completion.c

On Wed, Apr 17, 2024 at 09:57:04PM -0700, Christoph Hellwig wrote:
> On Wed, Apr 17, 2024 at 08:00:22PM +0200, Mikulas Patocka wrote:
> > > > +EXPORT_SYMBOL(wait_for_completion_long_io);
> > >
> > > Urgh, why is it a sane thing to circumvent the hang check timer?
> >
> > The block layer already does it - the bios can have arbitrary size, so
> > waiting for them takes arbitrary time.
>
> And as mentioned the last few times around, I think we want a task
> state to say that task can sleep long or even forever and not propagate
> this hack even further.

A bit like TASK_NOLOAD (which is used to make TASK_IDLE work), but
different I suppose.

TASK_NOHUNG would be trivial to add ofc. But is it worth it?

Anyway, as per the other email, anything like this needs to come with a
big fat warning. You get to keep the pieces etc..

---
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3c2abbc587b4..83b25327c233 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -112,7 +112,8 @@ struct user_event_mm;
#define TASK_FREEZABLE 0x00002000
#define __TASK_FREEZABLE_UNSAFE (0x00004000 * IS_ENABLED(CONFIG_LOCKDEP))
#define TASK_FROZEN 0x00008000
-#define TASK_STATE_MAX 0x00010000
+#define TASK_NOHUNG 0x00010000
+#define TASK_STATE_MAX 0x00020000

#define TASK_ANY (TASK_STATE_MAX-1)

diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index b2fc2727d654..126fac835e5e 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -210,7 +210,8 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
state = READ_ONCE(t->__state);
if ((state & TASK_UNINTERRUPTIBLE) &&
!(state & TASK_WAKEKILL) &&
- !(state & TASK_NOLOAD))
+ !(state & TASK_NOLOAD) &&
+ !(state & TASK_NOHUNG))
check_hung_task(t, timeout);
}
unlock:

2024-04-22 13:05:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/2] completion: move blk_wait_io to kernel/sched/completion.c

On Wed, Apr 17, 2024 at 08:00:22PM +0200, Mikulas Patocka wrote:
>
>
> On Wed, 17 Apr 2024, Peter Zijlstra wrote:
>
> > On Wed, Apr 17, 2024 at 07:49:17PM +0200, Mikulas Patocka wrote:
> > > Index: linux-2.6/kernel/sched/completion.c
> > > ===================================================================
> > > --- linux-2.6.orig/kernel/sched/completion.c 2024-04-17 19:41:14.000000000 +0200
> > > +++ linux-2.6/kernel/sched/completion.c 2024-04-17 19:41:14.000000000 +0200
> > > @@ -290,6 +290,26 @@ wait_for_completion_killable_timeout(str
> > > EXPORT_SYMBOL(wait_for_completion_killable_timeout);
> > >
> > > /**
> > > + * wait_for_completion_long_io - waits for completion of a task
> > > + * @x: holds the state of this particular completion
> > > + *
> > > + * This is like wait_for_completion_io, but it doesn't warn if the wait takes
> > > + * too long.
> > > + */
> > > +void wait_for_completion_long_io(struct completion *x)
> > > +{
> > > + /* Prevent hang_check timer from firing at us during very long I/O */
> > > + unsigned long timeout = sysctl_hung_task_timeout_secs * HZ / 2;
> > > +
> > > + if (timeout)
> > > + while (!wait_for_completion_io_timeout(x, timeout))
> > > + ;
> > > + else
> > > + wait_for_completion_io(x);
> > > +}
> > > +EXPORT_SYMBOL(wait_for_completion_long_io);
> >
> > Urgh, why is it a sane thing to circumvent the hang check timer?
>
> The block layer already does it - the bios can have arbitrary size, so
> waiting for them takes arbitrary time.

Yeah, but now you make it generic and your comment doesn't warn people
away, it makes them think this is a sane thing to do.

2024-04-23 12:51:09

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [PATCH 1/2] completion: move blk_wait_io to kernel/sched/completion.c



On Mon, 22 Apr 2024, Peter Zijlstra wrote:

> On Wed, Apr 17, 2024 at 09:57:04PM -0700, Christoph Hellwig wrote:
> > On Wed, Apr 17, 2024 at 08:00:22PM +0200, Mikulas Patocka wrote:
> > > > > +EXPORT_SYMBOL(wait_for_completion_long_io);
> > > >
> > > > Urgh, why is it a sane thing to circumvent the hang check timer?
> > >
> > > The block layer already does it - the bios can have arbitrary size, so
> > > waiting for them takes arbitrary time.
> >
> > And as mentioned the last few times around, I think we want a task
> > state to say that task can sleep long or even forever and not propagate
> > this hack even further.
>
> A bit like TASK_NOLOAD (which is used to make TASK_IDLE work), but
> different I suppose.
>
> TASK_NOHUNG would be trivial to add ofc. But is it worth it?
>
> Anyway, as per the other email, anything like this needs to come with a
> big fat warning. You get to keep the pieces etc..

This seems better than the blk_wait_io hack.

Reviewed-by: Mikulas Patocka <[email protected]>

> ---
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 3c2abbc587b4..83b25327c233 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -112,7 +112,8 @@ struct user_event_mm;
> #define TASK_FREEZABLE 0x00002000
> #define __TASK_FREEZABLE_UNSAFE (0x00004000 * IS_ENABLED(CONFIG_LOCKDEP))
> #define TASK_FROZEN 0x00008000
> -#define TASK_STATE_MAX 0x00010000
> +#define TASK_NOHUNG 0x00010000
> +#define TASK_STATE_MAX 0x00020000
>
> #define TASK_ANY (TASK_STATE_MAX-1)
>
> diff --git a/kernel/hung_task.c b/kernel/hung_task.c
> index b2fc2727d654..126fac835e5e 100644
> --- a/kernel/hung_task.c
> +++ b/kernel/hung_task.c
> @@ -210,7 +210,8 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
> state = READ_ONCE(t->__state);
> if ((state & TASK_UNINTERRUPTIBLE) &&
> !(state & TASK_WAKEKILL) &&
> - !(state & TASK_NOLOAD))
> + !(state & TASK_NOLOAD) &&
> + !(state & TASK_NOHUNG))
> check_hung_task(t, timeout);
> }
> unlock:
>


2024-04-26 07:22:35

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 1/2] completion: move blk_wait_io to kernel/sched/completion.c

On Mon, Apr 22, 2024 at 12:59:56PM +0200, Peter Zijlstra wrote:
> A bit like TASK_NOLOAD (which is used to make TASK_IDLE work), but
> different I suppose.
>
> TASK_NOHUNG would be trivial to add ofc. But is it worth it?

Yes. And it would allow us to kill the horrible existing block hack.