2022-09-18 21:27:54

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH next] sbitmap: fix lockup while swapping

Commit 4acb83417cad ("sbitmap: fix batched wait_cnt accounting")
is a big improvement: without it, I had to revert to before commit
040b83fcecfb ("sbitmap: fix possible io hung due to lost wakeup")
to avoid the high system time and freezes which that had introduced.

Now okay on the NVME laptop, but 4acb83417cad is a disaster for heavy
swapping (kernel builds in low memory) on another: soon locking up in
sbitmap_queue_wake_up() (into which __sbq_wake_up() is inlined), cycling
around with waitqueue_active() but wait_cnt 0 . Here is a backtrace,
showing the common pattern of outer sbitmap_queue_wake_up() interrupted
before setting wait_cnt 0 back to wake_batch (in some cases other CPUs
are idle, in other cases they're spinning for a lock in dd_bio_merge()):

sbitmap_queue_wake_up < sbitmap_queue_clear < blk_mq_put_tag <
__blk_mq_free_request < blk_mq_free_request < __blk_mq_end_request <
scsi_end_request < scsi_io_completion < scsi_finish_command <
scsi_complete < blk_complete_reqs < blk_done_softirq < __do_softirq <
__irq_exit_rcu < irq_exit_rcu < common_interrupt < asm_common_interrupt <
_raw_spin_unlock_irqrestore < __wake_up_common_lock < __wake_up <
sbitmap_queue_wake_up < sbitmap_queue_clear < blk_mq_put_tag <
__blk_mq_free_request < blk_mq_free_request < dd_bio_merge <
blk_mq_sched_bio_merge < blk_mq_attempt_bio_merge < blk_mq_submit_bio <
__submit_bio < submit_bio_noacct_nocheck < submit_bio_noacct <
submit_bio < __swap_writepage < swap_writepage < pageout <
shrink_folio_list < evict_folios < lru_gen_shrink_lruvec <
shrink_lruvec < shrink_node < do_try_to_free_pages < try_to_free_pages <
__alloc_pages_slowpath < __alloc_pages < folio_alloc < vma_alloc_folio <
do_anonymous_page < __handle_mm_fault < handle_mm_fault <
do_user_addr_fault < exc_page_fault < asm_exc_page_fault

I have almost no grasp of all the possible sbitmap races, and their
consequences: but using the same !waitqueue_active() check as used
elsewhere, fixes the lockup and shows no adverse consequence for me.

Fixes: 4acb83417cad ("sbitmap: fix batched wait_cnt accounting")
Signed-off-by: Hugh Dickins <[email protected]>
---

lib/sbitmap.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -620,7 +620,7 @@ static bool __sbq_wake_up(struct sbitmap
* function again to wakeup a new batch on a different 'ws'.
*/
if (cur == 0)
- return true;
+ return !waitqueue_active(&ws->wait);
sub = min(*nr, cur);
wait_cnt = cur - sub;
} while (!atomic_try_cmpxchg(&ws->wait_cnt, &cur, wait_cnt));


2022-09-19 21:56:29

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH next] sbitmap: fix lockup while swapping

On Sun, Sep 18, 2022 at 02:10:51PM -0700, Hugh Dickins wrote:
> I have almost no grasp of all the possible sbitmap races, and their
> consequences: but using the same !waitqueue_active() check as used
> elsewhere, fixes the lockup and shows no adverse consequence for me.


> Fixes: 4acb83417cad ("sbitmap: fix batched wait_cnt accounting")
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
>
> lib/sbitmap.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> --- a/lib/sbitmap.c
> +++ b/lib/sbitmap.c
> @@ -620,7 +620,7 @@ static bool __sbq_wake_up(struct sbitmap
> * function again to wakeup a new batch on a different 'ws'.
> */
> if (cur == 0)
> - return true;
> + return !waitqueue_active(&ws->wait);

If it's 0, that is supposed to mean another thread is about to make it not zero
as well as increment the wakestate index. That should be happening after patch
48c033314f37 was included, at least.

Prior to 4acb83417cad, the code would also return 'true' if the count was
already zero, and this batched code wasn't supposed to behave any different in
that condition.

Anyway, I don't think the waitqueue_active criteria of the current waitstate is
correct either. The next waitstate may have waiters too, so we still may need
to account for this batch's count in order to wake them.

2022-09-19 23:20:38

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH next] sbitmap: fix lockup while swapping

On Mon, 19 Sep 2022, Keith Busch wrote:
> On Sun, Sep 18, 2022 at 02:10:51PM -0700, Hugh Dickins wrote:
> > I have almost no grasp of all the possible sbitmap races, and their
> > consequences: but using the same !waitqueue_active() check as used
> > elsewhere, fixes the lockup and shows no adverse consequence for me.
>
>
> > Fixes: 4acb83417cad ("sbitmap: fix batched wait_cnt accounting")
> > Signed-off-by: Hugh Dickins <[email protected]>
> > ---
> >
> > lib/sbitmap.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > --- a/lib/sbitmap.c
> > +++ b/lib/sbitmap.c
> > @@ -620,7 +620,7 @@ static bool __sbq_wake_up(struct sbitmap
> > * function again to wakeup a new batch on a different 'ws'.
> > */
> > if (cur == 0)
> > - return true;
> > + return !waitqueue_active(&ws->wait);
>
> If it's 0, that is supposed to mean another thread is about to make it not zero
> as well as increment the wakestate index. That should be happening after patch
> 48c033314f37 was included, at least.

I believe that the thread about to make wait_cnt not zero (and increment the
wakestate index) is precisely this interrupted thread: the backtrace shows
that it had just done its wakeups, so has not yet reached making wait_cnt
not zero; and I suppose that either its wakeups did not empty the waitqueue
completely, or another waiter got added as soon as it dropped the spinlock.

>
> Prior to 4acb83417cad, the code would also return 'true' if the count was
> already zero, and this batched code wasn't supposed to behave any different in
> that condition.

In principle yes, but in practice no. Prior to 4acb83417cad, the swapping
load would run okayish for a few minutes, before freezing up mysteriously
(presumably due to missed wakeups). The "ish" in "okayish" because the
system time was abnormally high, and occasionally there was an odd message
from systemd about killing its journal or something - 4acb83417cad saved
me from having to investigate that further.

Prior to 4acb83417cad, it never locked up looping on wait_cnt < 0;
after 4acb83417cad, it would lock up on wait_cnt 0 in a few seconds.

But in writing that, and remembering the abnormal systime, I begin to
suspect that it might have often been in a tight loop on wait_cnt < 0,
but the batch accounting sufficiently wrong that it always got rescued
by an unrelated wakeup (shifting wakestate index), before any lockup
ever got observed and reported. Or something like that.

(And I'm trying to avoid making a fool of myself with the arithmetic:
how quickly would wait_cnt 0 have got decremented to positive before?)

I won't mind Jens deleting that "Fixes: 4acb83417cad" if it's unfair.

>
> Anyway, I don't think the waitqueue_active criteria of the current waitstate is
> correct either. The next waitstate may have waiters too, so we still may need
> to account for this batch's count in order to wake them.

I cannot usefully comment on that, it's all rather too subtle for me.

But I did wonder if each of those !waitqueue_active()s would be better
replaced just by "false"s: we only get that far into __sbq_wake_up() if
waitqueue_active(), so the !waitqueue_active() case reflects a race:
a possible race, yes, but a race that wants precise accounting at a
few imprecise locations?

Hugh

2022-09-21 16:51:15

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH next] sbitmap: fix lockup while swapping

On Mon 19-09-22 16:01:39, Hugh Dickins wrote:
> On Mon, 19 Sep 2022, Keith Busch wrote:
> > On Sun, Sep 18, 2022 at 02:10:51PM -0700, Hugh Dickins wrote:
> > > I have almost no grasp of all the possible sbitmap races, and their
> > > consequences: but using the same !waitqueue_active() check as used
> > > elsewhere, fixes the lockup and shows no adverse consequence for me.
> >
> >
> > > Fixes: 4acb83417cad ("sbitmap: fix batched wait_cnt accounting")
> > > Signed-off-by: Hugh Dickins <[email protected]>
> > > ---
> > >
> > > lib/sbitmap.c | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > --- a/lib/sbitmap.c
> > > +++ b/lib/sbitmap.c
> > > @@ -620,7 +620,7 @@ static bool __sbq_wake_up(struct sbitmap
> > > * function again to wakeup a new batch on a different 'ws'.
> > > */
> > > if (cur == 0)
> > > - return true;
> > > + return !waitqueue_active(&ws->wait);
> >
> > If it's 0, that is supposed to mean another thread is about to make it not zero
> > as well as increment the wakestate index. That should be happening after patch
> > 48c033314f37 was included, at least.
>
> I believe that the thread about to make wait_cnt not zero (and increment the
> wakestate index) is precisely this interrupted thread: the backtrace shows
> that it had just done its wakeups, so has not yet reached making wait_cnt
> not zero; and I suppose that either its wakeups did not empty the waitqueue
> completely, or another waiter got added as soon as it dropped the spinlock.
>
> >
> > Prior to 4acb83417cad, the code would also return 'true' if the count was
> > already zero, and this batched code wasn't supposed to behave any different in
> > that condition.
>
> In principle yes, but in practice no. Prior to 4acb83417cad, the swapping
> load would run okayish for a few minutes, before freezing up mysteriously
> (presumably due to missed wakeups). The "ish" in "okayish" because the
> system time was abnormally high, and occasionally there was an odd message
> from systemd about killing its journal or something - 4acb83417cad saved
> me from having to investigate that further.
>
> Prior to 4acb83417cad, it never locked up looping on wait_cnt < 0;
> after 4acb83417cad, it would lock up on wait_cnt 0 in a few seconds.
>
> But in writing that, and remembering the abnormal systime, I begin to
> suspect that it might have often been in a tight loop on wait_cnt < 0,
> but the batch accounting sufficiently wrong that it always got rescued
> by an unrelated wakeup (shifting wakestate index), before any lockup
> ever got observed and reported. Or something like that.
>
> (And I'm trying to avoid making a fool of myself with the arithmetic:
> how quickly would wait_cnt 0 have got decremented to positive before?)
>
> I won't mind Jens deleting that "Fixes: 4acb83417cad" if it's unfair.
>
> >
> > Anyway, I don't think the waitqueue_active criteria of the current waitstate is
> > correct either. The next waitstate may have waiters too, so we still may need
> > to account for this batch's count in order to wake them.
>
> I cannot usefully comment on that, it's all rather too subtle for me.
>
> But I did wonder if each of those !waitqueue_active()s would be better
> replaced just by "false"s: we only get that far into __sbq_wake_up() if
> waitqueue_active(), so the !waitqueue_active() case reflects a race:
> a possible race, yes, but a race that wants precise accounting at a
> few imprecise locations?

I think the code is actually too subtle and complex for anybody to sensibly
reason about it (as the continuous stream of bugs demostrates ;)). That
being said I agree with Keith that your "fix" looks suspicious and it is
not how things are expected to work and we can possibly loose wakeups with
your change. So we need to understand better how the code can be looping on
your system. I'll think more about it tomorrow...

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-09-23 14:54:31

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH next] sbitmap: fix lockup while swapping

On Wed 21-09-22 18:40:12, Jan Kara wrote:
> On Mon 19-09-22 16:01:39, Hugh Dickins wrote:
> > On Mon, 19 Sep 2022, Keith Busch wrote:
> > > On Sun, Sep 18, 2022 at 02:10:51PM -0700, Hugh Dickins wrote:
> > > > I have almost no grasp of all the possible sbitmap races, and their
> > > > consequences: but using the same !waitqueue_active() check as used
> > > > elsewhere, fixes the lockup and shows no adverse consequence for me.
> > >
> > >
> > > > Fixes: 4acb83417cad ("sbitmap: fix batched wait_cnt accounting")
> > > > Signed-off-by: Hugh Dickins <[email protected]>
> > > > ---
> > > >
> > > > lib/sbitmap.c | 2 +-
> > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > --- a/lib/sbitmap.c
> > > > +++ b/lib/sbitmap.c
> > > > @@ -620,7 +620,7 @@ static bool __sbq_wake_up(struct sbitmap
> > > > * function again to wakeup a new batch on a different 'ws'.
> > > > */
> > > > if (cur == 0)
> > > > - return true;
> > > > + return !waitqueue_active(&ws->wait);
> > >
> > > If it's 0, that is supposed to mean another thread is about to make it not zero
> > > as well as increment the wakestate index. That should be happening after patch
> > > 48c033314f37 was included, at least.
> >
> > I believe that the thread about to make wait_cnt not zero (and increment the
> > wakestate index) is precisely this interrupted thread: the backtrace shows
> > that it had just done its wakeups, so has not yet reached making wait_cnt
> > not zero; and I suppose that either its wakeups did not empty the waitqueue
> > completely, or another waiter got added as soon as it dropped the spinlock.

I was trying to wrap my head around this but I am failing to see how we
could have wait_cnt == 0 for long enough to cause any kind of stall let
alone a lockup in sbitmap_queue_wake_up() as you describe. I can understand
we have:

CPU1 CPU2
sbitmap_queue_wake_up()
ws = sbq_wake_ptr(sbq);
cur = atomic_read(&ws->wait_cnt);
do {
...
wait_cnt = cur - sub; /* this will be 0 */
} while (!atomic_try_cmpxchg(&ws->wait_cnt, &cur, wait_cnt));
...
/* Gets the same waitqueue */
ws = sbq_wake_ptr(sbq);
cur = atomic_read(&ws->wait_cnt);
do {
if (cur == 0)
return true; /* loop */
wake_up_nr(&ws->wait, wake_batch);
smp_mb__before_atomic();
sbq_index_atomic_inc(&sbq->wake_index);
atomic_set(&ws->wait_cnt, wake_batch); /* This stops looping on CPU2 */

So until CPU1 reaches the atomic_set(), CPU2 can be looping. But how come
this takes so long that is causes a hang as you describe? Hum... So either
CPU1 takes really long to get to atomic_set():
- can CPU1 get preempted? Likely not at least in the context you show in
your message
- can CPU1 spend so long in wake_up_nr()? Maybe the waitqueue lock is
contended but still...

or CPU2 somehow sees cur==0 for longer than it should. The whole sequence
executed in a loop on CPU2 does not contain anything that would force CPU2
to refresh its cache and get new ws->wait_cnt value so we are at the mercy
of CPU cache coherency mechanisms to stage the write on CPU1 and propagate
it to other CPUs. But still I would not expect that to take significantly
long. Any other ideas?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-09-23 15:24:37

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH next] sbitmap: fix lockup while swapping

On Fri, Sep 23, 2022 at 04:43:03PM +0200, Jan Kara wrote:
> On Wed 21-09-22 18:40:12, Jan Kara wrote:
> > On Mon 19-09-22 16:01:39, Hugh Dickins wrote:
> > > On Mon, 19 Sep 2022, Keith Busch wrote:
> > > > On Sun, Sep 18, 2022 at 02:10:51PM -0700, Hugh Dickins wrote:
> > > > > I have almost no grasp of all the possible sbitmap races, and their
> > > > > consequences: but using the same !waitqueue_active() check as used
> > > > > elsewhere, fixes the lockup and shows no adverse consequence for me.
> > > >
> > > >
> > > > > Fixes: 4acb83417cad ("sbitmap: fix batched wait_cnt accounting")
> > > > > Signed-off-by: Hugh Dickins <[email protected]>
> > > > > ---
> > > > >
> > > > > lib/sbitmap.c | 2 +-
> > > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > > >
> > > > > --- a/lib/sbitmap.c
> > > > > +++ b/lib/sbitmap.c
> > > > > @@ -620,7 +620,7 @@ static bool __sbq_wake_up(struct sbitmap
> > > > > * function again to wakeup a new batch on a different 'ws'.
> > > > > */
> > > > > if (cur == 0)
> > > > > - return true;
> > > > > + return !waitqueue_active(&ws->wait);
> > > >
> > > > If it's 0, that is supposed to mean another thread is about to make it not zero
> > > > as well as increment the wakestate index. That should be happening after patch
> > > > 48c033314f37 was included, at least.
> > >
> > > I believe that the thread about to make wait_cnt not zero (and increment the
> > > wakestate index) is precisely this interrupted thread: the backtrace shows
> > > that it had just done its wakeups, so has not yet reached making wait_cnt
> > > not zero; and I suppose that either its wakeups did not empty the waitqueue
> > > completely, or another waiter got added as soon as it dropped the spinlock.
>
> I was trying to wrap my head around this but I am failing to see how we
> could have wait_cnt == 0 for long enough to cause any kind of stall let
> alone a lockup in sbitmap_queue_wake_up() as you describe. I can understand
> we have:
>
> CPU1 CPU2
> sbitmap_queue_wake_up()
> ws = sbq_wake_ptr(sbq);
> cur = atomic_read(&ws->wait_cnt);
> do {
> ...
> wait_cnt = cur - sub; /* this will be 0 */
> } while (!atomic_try_cmpxchg(&ws->wait_cnt, &cur, wait_cnt));
> ...
> /* Gets the same waitqueue */
> ws = sbq_wake_ptr(sbq);
> cur = atomic_read(&ws->wait_cnt);
> do {
> if (cur == 0)
> return true; /* loop */
> wake_up_nr(&ws->wait, wake_batch);
> smp_mb__before_atomic();
> sbq_index_atomic_inc(&sbq->wake_index);
> atomic_set(&ws->wait_cnt, wake_batch); /* This stops looping on CPU2 */
>
> So until CPU1 reaches the atomic_set(), CPU2 can be looping. But how come
> this takes so long that is causes a hang as you describe? Hum... So either
> CPU1 takes really long to get to atomic_set():
> - can CPU1 get preempted? Likely not at least in the context you show in
> your message
> - can CPU1 spend so long in wake_up_nr()? Maybe the waitqueue lock is
> contended but still...
>
> or CPU2 somehow sees cur==0 for longer than it should. The whole sequence
> executed in a loop on CPU2 does not contain anything that would force CPU2
> to refresh its cache and get new ws->wait_cnt value so we are at the mercy
> of CPU cache coherency mechanisms to stage the write on CPU1 and propagate
> it to other CPUs. But still I would not expect that to take significantly
> long. Any other ideas?

Thank you for the analysis. I arrived at the same conclusions.

If this is a preempt enabled context, and there's just one CPU, I suppose the
2nd task could spin in the while(), blocking the 1st task from resetting the
wait_cnt. I doubt that's happening though, at least for nvme where we call this
function in irq context.

2022-09-23 16:21:51

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH next] sbitmap: fix lockup while swapping

On Fri, 23 Sep 2022, Jan Kara wrote:
> On Wed 21-09-22 18:40:12, Jan Kara wrote:
> > On Mon 19-09-22 16:01:39, Hugh Dickins wrote:
> > > On Mon, 19 Sep 2022, Keith Busch wrote:
> > > > On Sun, Sep 18, 2022 at 02:10:51PM -0700, Hugh Dickins wrote:
> > > > > I have almost no grasp of all the possible sbitmap races, and their
> > > > > consequences: but using the same !waitqueue_active() check as used
> > > > > elsewhere, fixes the lockup and shows no adverse consequence for me.
> > > >
> > > >
> > > > > Fixes: 4acb83417cad ("sbitmap: fix batched wait_cnt accounting")
> > > > > Signed-off-by: Hugh Dickins <[email protected]>
> > > > > ---
> > > > >
> > > > > lib/sbitmap.c | 2 +-
> > > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > > >
> > > > > --- a/lib/sbitmap.c
> > > > > +++ b/lib/sbitmap.c
> > > > > @@ -620,7 +620,7 @@ static bool __sbq_wake_up(struct sbitmap
> > > > > * function again to wakeup a new batch on a different 'ws'.
> > > > > */
> > > > > if (cur == 0)
> > > > > - return true;
> > > > > + return !waitqueue_active(&ws->wait);
> > > >
> > > > If it's 0, that is supposed to mean another thread is about to make it not zero
> > > > as well as increment the wakestate index. That should be happening after patch
> > > > 48c033314f37 was included, at least.
> > >
> > > I believe that the thread about to make wait_cnt not zero (and increment the
> > > wakestate index) is precisely this interrupted thread: the backtrace shows
> > > that it had just done its wakeups, so has not yet reached making wait_cnt
> > > not zero; and I suppose that either its wakeups did not empty the waitqueue
> > > completely, or another waiter got added as soon as it dropped the spinlock.
>
> I was trying to wrap my head around this but I am failing to see how we
> could have wait_cnt == 0 for long enough to cause any kind of stall let
> alone a lockup in sbitmap_queue_wake_up() as you describe. I can understand
> we have:
>
> CPU1 CPU2
> sbitmap_queue_wake_up()
> ws = sbq_wake_ptr(sbq);
> cur = atomic_read(&ws->wait_cnt);
> do {
> ...
> wait_cnt = cur - sub; /* this will be 0 */
> } while (!atomic_try_cmpxchg(&ws->wait_cnt, &cur, wait_cnt));
> ...
> /* Gets the same waitqueue */
> ws = sbq_wake_ptr(sbq);
> cur = atomic_read(&ws->wait_cnt);
> do {
> if (cur == 0)
> return true; /* loop */
> wake_up_nr(&ws->wait, wake_batch);
> smp_mb__before_atomic();
> sbq_index_atomic_inc(&sbq->wake_index);
> atomic_set(&ws->wait_cnt, wake_batch); /* This stops looping on CPU2 */
>
> So until CPU1 reaches the atomic_set(), CPU2 can be looping. But how come
> this takes so long that is causes a hang as you describe? Hum... So either
> CPU1 takes really long to get to atomic_set():
> - can CPU1 get preempted? Likely not at least in the context you show in
> your message
> - can CPU1 spend so long in wake_up_nr()? Maybe the waitqueue lock is
> contended but still...
>
> or CPU2 somehow sees cur==0 for longer than it should. The whole sequence
> executed in a loop on CPU2 does not contain anything that would force CPU2
> to refresh its cache and get new ws->wait_cnt value so we are at the mercy
> of CPU cache coherency mechanisms to stage the write on CPU1 and propagate
> it to other CPUs. But still I would not expect that to take significantly
> long. Any other ideas?

Rushed reply (hoping) to help your thinking,
I'll read you more closely two hours later.

You're writing of CPU1 and CPU2, but in my case it was just the one CPU
interrupted - see again sbitmap_queue_wake_up twice in the backtrace:

sbitmap_queue_wake_up < sbitmap_queue_clear < blk_mq_put_tag <
__blk_mq_free_request < blk_mq_free_request < __blk_mq_end_request <
scsi_end_request < scsi_io_completion < scsi_finish_command <
scsi_complete < blk_complete_reqs < blk_done_softirq < __do_softirq <
__irq_exit_rcu < irq_exit_rcu < common_interrupt < asm_common_interrupt <
_raw_spin_unlock_irqrestore < __wake_up_common_lock < __wake_up <
sbitmap_queue_wake_up < sbitmap_queue_clear < blk_mq_put_tag <
__blk_mq_free_request < blk_mq_free_request < dd_bio_merge <
blk_mq_sched_bio_merge < blk_mq_attempt_bio_merge < blk_mq_submit_bio <
__submit_bio < submit_bio_noacct_nocheck < submit_bio_noacct <
submit_bio < __swap_writepage < swap_writepage < pageout <
shrink_folio_list < evict_folios < lru_gen_shrink_lruvec <
shrink_lruvec < shrink_node < do_try_to_free_pages < try_to_free_pages <
__alloc_pages_slowpath < __alloc_pages < folio_alloc < vma_alloc_folio <
do_anonymous_page < __handle_mm_fault < handle_mm_fault <
do_user_addr_fault < exc_page_fault < asm_exc_page_fault

And in one case I did study stack contents carefully, confirming
the same sbq pointer used in upper and lower sbitmap_queue_wake_up.

And on Keith's points: yes, I do have preemption enabled; but no, this
machine does not have nvme, and I never hit this on the nvme laptop.

Thanks,
Hugh

2022-09-23 19:48:28

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH next] sbitmap: fix lockup while swapping

Does the following fix the observation? Rational being that there's no reason
to spin on the current wait state that is already under handling; let
subsequent clearings proceed to the next inevitable wait state immediately.

---
diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index 624fa7f118d1..47bf7882210b 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -634,6 +634,13 @@ static bool __sbq_wake_up(struct sbitmap_queue *sbq, int *nr)

*nr -= sub;

+ /*
+ * Increase wake_index before updating wait_cnt, otherwise concurrent
+ * callers can see valid wait_cnt in old waitqueue, which can cause
+ * invalid wakeup on the old waitqueue.
+ */
+ sbq_index_atomic_inc(&sbq->wake_index);
+
/*
* When wait_cnt == 0, we have to be particularly careful as we are
* responsible to reset wait_cnt regardless whether we've actually
@@ -660,13 +667,6 @@ static bool __sbq_wake_up(struct sbitmap_queue *sbq, int *nr)
* of atomic_set().
*/
smp_mb__before_atomic();
-
- /*
- * Increase wake_index before updating wait_cnt, otherwise concurrent
- * callers can see valid wait_cnt in old waitqueue, which can cause
- * invalid wakeup on the old waitqueue.
- */
- sbq_index_atomic_inc(&sbq->wake_index);
atomic_set(&ws->wait_cnt, wake_batch);

return ret || *nr;
--

2022-09-23 22:16:41

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH next] sbitmap: fix lockup while swapping

On Fri, 23 Sep 2022, Keith Busch wrote:

> Does the following fix the observation? Rational being that there's no reason
> to spin on the current wait state that is already under handling; let
> subsequent clearings proceed to the next inevitable wait state immediately.

It's running fine without lockup so far; but doesn't this change merely
narrow the window? If this is interrupted in between atomic_try_cmpxchg()
setting wait_cnt to 0 and sbq_index_atomic_inc() advancing wake_index,
don't we run the same risk as before, of sbitmap_queue_wake_up() from
the interrupt handler getting stuck on that wait_cnt 0?

>
> ---
> diff --git a/lib/sbitmap.c b/lib/sbitmap.c
> index 624fa7f118d1..47bf7882210b 100644
> --- a/lib/sbitmap.c
> +++ b/lib/sbitmap.c
> @@ -634,6 +634,13 @@ static bool __sbq_wake_up(struct sbitmap_queue *sbq, int *nr)
>
> *nr -= sub;
>
> + /*
> + * Increase wake_index before updating wait_cnt, otherwise concurrent
> + * callers can see valid wait_cnt in old waitqueue, which can cause
> + * invalid wakeup on the old waitqueue.
> + */
> + sbq_index_atomic_inc(&sbq->wake_index);
> +
> /*
> * When wait_cnt == 0, we have to be particularly careful as we are
> * responsible to reset wait_cnt regardless whether we've actually
> @@ -660,13 +667,6 @@ static bool __sbq_wake_up(struct sbitmap_queue *sbq, int *nr)
> * of atomic_set().
> */
> smp_mb__before_atomic();
> -
> - /*
> - * Increase wake_index before updating wait_cnt, otherwise concurrent
> - * callers can see valid wait_cnt in old waitqueue, which can cause
> - * invalid wakeup on the old waitqueue.
> - */
> - sbq_index_atomic_inc(&sbq->wake_index);
> atomic_set(&ws->wait_cnt, wake_batch);
>
> return ret || *nr;
> --

2022-09-24 00:06:32

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH next] sbitmap: fix lockup while swapping

On Fri, 23 Sep 2022, Hugh Dickins wrote:
> On Fri, 23 Sep 2022, Keith Busch wrote:
>
> > Does the following fix the observation? Rational being that there's no reason
> > to spin on the current wait state that is already under handling; let
> > subsequent clearings proceed to the next inevitable wait state immediately.
>
> It's running fine without lockup so far; but doesn't this change merely
> narrow the window? If this is interrupted in between atomic_try_cmpxchg()
> setting wait_cnt to 0 and sbq_index_atomic_inc() advancing wake_index,
> don't we run the same risk as before, of sbitmap_queue_wake_up() from
> the interrupt handler getting stuck on that wait_cnt 0?

Yes, it ran successfully for 50 minutes, then an interrupt came in
immediately after the cmpxchg, and it locked up just as before.

Easily dealt with by disabling interrupts, no doubt, but I assume it's a
badge of honour not to disable interrupts here (except perhaps in waking).

Some clever way to make the wait_cnt and wake_index adjustments atomic?

Or is this sbitmap_queue_wake_up() interrupting sbitmap_queue_wake_up()
just supposed never to happen, the counts preventing it: but some
misaccounting letting it happen by mistake?

>
> >
> > ---
> > diff --git a/lib/sbitmap.c b/lib/sbitmap.c
> > index 624fa7f118d1..47bf7882210b 100644
> > --- a/lib/sbitmap.c
> > +++ b/lib/sbitmap.c
> > @@ -634,6 +634,13 @@ static bool __sbq_wake_up(struct sbitmap_queue *sbq, int *nr)
> >
> > *nr -= sub;
> >
> > + /*
> > + * Increase wake_index before updating wait_cnt, otherwise concurrent
> > + * callers can see valid wait_cnt in old waitqueue, which can cause
> > + * invalid wakeup on the old waitqueue.
> > + */
> > + sbq_index_atomic_inc(&sbq->wake_index);
> > +
> > /*
> > * When wait_cnt == 0, we have to be particularly careful as we are
> > * responsible to reset wait_cnt regardless whether we've actually
> > @@ -660,13 +667,6 @@ static bool __sbq_wake_up(struct sbitmap_queue *sbq, int *nr)
> > * of atomic_set().
> > */
> > smp_mb__before_atomic();
> > -
> > - /*
> > - * Increase wake_index before updating wait_cnt, otherwise concurrent
> > - * callers can see valid wait_cnt in old waitqueue, which can cause
> > - * invalid wakeup on the old waitqueue.
> > - */
> > - sbq_index_atomic_inc(&sbq->wake_index);
> > atomic_set(&ws->wait_cnt, wake_batch);
> >
> > return ret || *nr;
> > --
>

2022-09-26 14:10:06

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH next] sbitmap: fix lockup while swapping

On Fri 23-09-22 16:15:29, Hugh Dickins wrote:
> On Fri, 23 Sep 2022, Hugh Dickins wrote:
> > On Fri, 23 Sep 2022, Keith Busch wrote:
> >
> > > Does the following fix the observation? Rational being that there's no reason
> > > to spin on the current wait state that is already under handling; let
> > > subsequent clearings proceed to the next inevitable wait state immediately.
> >
> > It's running fine without lockup so far; but doesn't this change merely
> > narrow the window? If this is interrupted in between atomic_try_cmpxchg()
> > setting wait_cnt to 0 and sbq_index_atomic_inc() advancing wake_index,
> > don't we run the same risk as before, of sbitmap_queue_wake_up() from
> > the interrupt handler getting stuck on that wait_cnt 0?
>
> Yes, it ran successfully for 50 minutes, then an interrupt came in
> immediately after the cmpxchg, and it locked up just as before.
>
> Easily dealt with by disabling interrupts, no doubt, but I assume it's a
> badge of honour not to disable interrupts here (except perhaps in waking).

I don't think any magic with sbq_index_atomic_inc() is going to reliably
fix this. After all the current waitqueue may be the only one that has active
waiters so sbq_wake_ptr() will always end up returning this waitqueue
regardless of the current value of sbq->wake_index.

Honestly, this whole code needs a serious redesign. I have some
simplifications in mind but it will take some thinking and benchmarking so
we need some fix for the interim. I was pondering for quite some time about
some band aid to the problem you've found but didn't find anything
satisfactory.

In the end I see two options:

1) Take your patch (as wrong as it is ;). Yes, it can lead to lost wakeups
but we were living with those for a relatively long time so probably we can
live with them for some longer.

2) Revert Yu Kuai's original fix 040b83fcecfb8 ("sbitmap: fix possible io
hung due to lost wakeup") and my fixup 48c033314f37 ("sbitmap: Avoid leaving
waitqueue in invalid state in __sbq_wake_up()"). But then Keith would have
to redo his batched accounting patches on top.

> Some clever way to make the wait_cnt and wake_index adjustments atomic?
>
> Or is this sbitmap_queue_wake_up() interrupting sbitmap_queue_wake_up()
> just supposed never to happen, the counts preventing it: but some
> misaccounting letting it happen by mistake?

No, I think that is in principle a situation that we have to accommodate.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-09-26 16:33:40

by Yu Kuai

[permalink] [raw]
Subject: Re: [PATCH next] sbitmap: fix lockup while swapping

Hi,

在 2022/09/26 19:44, Jan Kara 写道:
> On Fri 23-09-22 16:15:29, Hugh Dickins wrote:
>> On Fri, 23 Sep 2022, Hugh Dickins wrote:
>>> On Fri, 23 Sep 2022, Keith Busch wrote:
>>>
>>>> Does the following fix the observation? Rational being that there's no reason
>>>> to spin on the current wait state that is already under handling; let
>>>> subsequent clearings proceed to the next inevitable wait state immediately.
>>>
>>> It's running fine without lockup so far; but doesn't this change merely
>>> narrow the window? If this is interrupted in between atomic_try_cmpxchg()
>>> setting wait_cnt to 0 and sbq_index_atomic_inc() advancing wake_index,
>>> don't we run the same risk as before, of sbitmap_queue_wake_up() from
>>> the interrupt handler getting stuck on that wait_cnt 0?
>>
>> Yes, it ran successfully for 50 minutes, then an interrupt came in
>> immediately after the cmpxchg, and it locked up just as before.
>>
>> Easily dealt with by disabling interrupts, no doubt, but I assume it's a
>> badge of honour not to disable interrupts here (except perhaps in waking).
>
> I don't think any magic with sbq_index_atomic_inc() is going to reliably
> fix this. After all the current waitqueue may be the only one that has active
> waiters so sbq_wake_ptr() will always end up returning this waitqueue
> regardless of the current value of sbq->wake_index.
>
> Honestly, this whole code needs a serious redesign. I have some
> simplifications in mind but it will take some thinking and benchmarking so
> we need some fix for the interim. I was pondering for quite some time about
> some band aid to the problem you've found but didn't find anything
> satisfactory.
>
> In the end I see two options:
>
> 1) Take your patch (as wrong as it is ;). Yes, it can lead to lost wakeups
> but we were living with those for a relatively long time so probably we can
> live with them for some longer.
>
> 2) Revert Yu Kuai's original fix 040b83fcecfb8 ("sbitmap: fix possible io
> hung due to lost wakeup") and my fixup 48c033314f37 ("sbitmap: Avoid leaving
> waitqueue in invalid state in __sbq_wake_up()"). But then Keith would have
> to redo his batched accounting patches on top.
>
>> Some clever way to make the wait_cnt and wake_index adjustments atomic?

I'm thinking about a hacky way to make the update of wake_cnt and
wake_index atomic, however, redesign of sbitmap_queue is probably
better. ????

There are only 8 wait queues and wake_batch is 8 at most, thus
only need 3 * 9 = 27 bit, and a single atomic value is enough:

- 0-2 represents ws[0].wake_cnt
- 3-5 represents ws[1].wake_cnt
- ...
- 21-24 represents ws[7].wake_cnt
- 25-27 represents sbq->wake_index

for example, assume the atomic value is:

0B 111 111 111 111 111 111 111 111 111 000,

which means wake_index is 7 and ws[0].wake_cnt is 0,
if we try to inc wake_index and reset wake_cnt together:

atomic_add(..., 0B 001 000 000 000 000 000 000 000 000 111)

Thanks,
Kuai

>>
>> Or is this sbitmap_queue_wake_up() interrupting sbitmap_queue_wake_up()
>> just supposed never to happen, the counts preventing it: but some
>> misaccounting letting it happen by mistake?
>
> No, I think that is in principle a situation that we have to accommodate.
>
> Honza
>

2022-09-27 04:27:05

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH next] sbitmap: fix lockup while swapping

On Sat, 24 Sep 2022, Hillf Danton wrote:
>
> I think the lockup can be avoided by
> a) either advancing wake_index as early as I can [1],
> b) or doing wakeup in case of zero wait_cnt to kill all cases of waitqueue_active().
>
> Only for thoughts now.

Thanks Hillf: I gave your __sbq_wake_up() patch below several tries,
and as far as I could tell, it works just as well as my one-liner.

But I don't think it's what we would want to do: doesn't it increment
wake_index on every call to __sbq_wake_up()? whereas I thought it was
intended to be incremented only after wake_batch calls (thinking in
terms of nr 1).

I'll not be surprised if your advance-wake_index-earlier idea ends
up as a part of the solution: but mainly I agree with Jan that the
whole code needs a serious redesign (or perhaps the whole design
needs a serious recode). So I didn't give your version more thought.

Hugh

>
> Hillf
>
> [1] https://lore.kernel.org/lkml/[email protected]/
>
> +++ b/lib/sbitmap.c
> @@ -613,6 +613,16 @@ static bool __sbq_wake_up(struct sbitmap
> if (!ws)
> return false;
>
> + do {
> + /* open code sbq_index_atomic_inc(&sbq->wake_index) to avoid race */
> + int old = atomic_read(&sbq->wake_index);
> + int new = sbq_index_inc(old);
> +
> + /* try another ws if someone else takes care of this one */
> + if (old != atomic_cmpxchg(&sbq->wake_index, old, new))
> + return true;
> + } while (0);
> +
> cur = atomic_read(&ws->wait_cnt);
> do {
> /*
> @@ -620,7 +630,7 @@ static bool __sbq_wake_up(struct sbitmap
> * function again to wakeup a new batch on a different 'ws'.
> */
> if (cur == 0)
> - return true;
> + goto out;
> sub = min(*nr, cur);
> wait_cnt = cur - sub;
> } while (!atomic_try_cmpxchg(&ws->wait_cnt, &cur, wait_cnt));
> @@ -634,6 +644,7 @@ static bool __sbq_wake_up(struct sbitmap
>
> *nr -= sub;
>
> +out:
> /*
> * When wait_cnt == 0, we have to be particularly careful as we are
> * responsible to reset wait_cnt regardless whether we've actually
> @@ -661,12 +672,6 @@ static bool __sbq_wake_up(struct sbitmap
> */
> smp_mb__before_atomic();
>
> - /*
> - * Increase wake_index before updating wait_cnt, otherwise concurrent
> - * callers can see valid wait_cnt in old waitqueue, which can cause
> - * invalid wakeup on the old waitqueue.
> - */
> - sbq_index_atomic_inc(&sbq->wake_index);
> atomic_set(&ws->wait_cnt, wake_batch);
>
> return ret || *nr;

2022-09-27 04:30:10

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH next] sbitmap: fix lockup while swapping

On Mon, 26 Sep 2022, Jan Kara wrote:
> On Fri 23-09-22 16:15:29, Hugh Dickins wrote:
>
> I don't think any magic with sbq_index_atomic_inc() is going to reliably
> fix this. After all the current waitqueue may be the only one that has active
> waiters so sbq_wake_ptr() will always end up returning this waitqueue
> regardless of the current value of sbq->wake_index.
>
> Honestly, this whole code needs a serious redesign.

I was pleased to see you say so, Jan: I do agree.

> I have some
> simplifications in mind but it will take some thinking and benchmarking

I'm definitely not the right person to take it on, and glad if you can.
But I did have some thoughts and experiments over the weekend, and would
like to throw a couple of suggestions into the pot.

One, not a big issue, but I think sbq_index_atomic_inc() is misconceived.
It's unhelpful for multiple racers to be adjusting sbq->wake_index, and
wake_index = ws - sbq->ws;
atomic_cmpxchg(&sbq->wake_index, wake_index, sbq_index_inc(wake_index));
seems to me a better way for __sbq_wake_up() to increment it.

Two, and here the depths of my naivete and incomprehension may be on
display, but: I get the impression that __sbq_wake_up() is intended
to accumulate wake_batch-1 wakeups while doing nothing, then on the
wake_batch'th it hopes to do all those wake_batch wakeups. I assume
someone in the past has established that that's a safe way to procede
here, though it's not obviously safe to me.

Now, those !waitqueue_active(&ws->wait) checks are good for catching
when the hoped-for procedure has gone so "wrong" that there's actually
nothing to be woken on this ws (so proceed to the next); but they give
no clue as to when there are some but not enough wakeups done.

It is very easy to add a wake_up_nr_return() to kernel/sched/wait.c,
which returns the nr_exclusive still not woken (__wake_up_common_lock()
merely has to return nr_exclusive itself); and then __sbq_wake_up() can
be recalled until wake_batch have been woken (or all queues empty).

I do have an experiment running that way: but my testing is much too
limited to draw serious conclusions from, and I've already admitted
that I may just be misunderstanding the whole thing. But, just maybe,
a wake_up_nr_return() might be useful. End of those suggestions.

> so
> we need some fix for the interim. I was pondering for quite some time about
> some band aid to the problem you've found but didn't find anything
> satisfactory.
>
> In the end I see two options:
>
> 1) Take your patch (as wrong as it is ;). Yes, it can lead to lost wakeups
> but we were living with those for a relatively long time so probably we can
> live with them for some longer.

In getting that experiment above going, I did have to make this change
below: and it looks to me now as better than my original patch - since
this one does try all SBQ_WAIT_QUEUES before giving up, whereas my first
patch immediately gave up on the waitqueue_active !wait_cnt case.

--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -587,7 +587,7 @@ static struct sbq_wait_state *sbq_wake_p
for (i = 0; i < SBQ_WAIT_QUEUES; i++) {
struct sbq_wait_state *ws = &sbq->ws[wake_index];

- if (waitqueue_active(&ws->wait)) {
+ if (waitqueue_active(&ws->wait) && atomic_read(&ws->wait_cnt)) {
if (wake_index != atomic_read(&sbq->wake_index))
atomic_set(&sbq->wake_index, wake_index);
return ws;

TBH I have not tested this one outside of that experiment: would you
prefer this patch to my first one, I test and sign this off and send?

>
> 2) Revert Yu Kuai's original fix 040b83fcecfb8 ("sbitmap: fix possible io
> hung due to lost wakeup") and my fixup 48c033314f37 ("sbitmap: Avoid leaving
> waitqueue in invalid state in __sbq_wake_up()"). But then Keith would have
> to redo his batched accounting patches on top.

I know much too little to help make that choice.

Hugh

2022-09-27 10:46:38

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH next] sbitmap: fix lockup while swapping

On Mon 26-09-22 20:39:03, Hugh Dickins wrote:
> On Mon, 26 Sep 2022, Jan Kara wrote:
> > On Fri 23-09-22 16:15:29, Hugh Dickins wrote:
> >
> > I don't think any magic with sbq_index_atomic_inc() is going to reliably
> > fix this. After all the current waitqueue may be the only one that has active
> > waiters so sbq_wake_ptr() will always end up returning this waitqueue
> > regardless of the current value of sbq->wake_index.
> >
> > Honestly, this whole code needs a serious redesign.
>
> I was pleased to see you say so, Jan: I do agree.
>
> > I have some
> > simplifications in mind but it will take some thinking and benchmarking
>
> I'm definitely not the right person to take it on, and glad if you can.
> But I did have some thoughts and experiments over the weekend, and would
> like to throw a couple of suggestions into the pot.
>
> One, not a big issue, but I think sbq_index_atomic_inc() is misconceived.
> It's unhelpful for multiple racers to be adjusting sbq->wake_index, and
> wake_index = ws - sbq->ws;
> atomic_cmpxchg(&sbq->wake_index, wake_index, sbq_index_inc(wake_index));
> seems to me a better way for __sbq_wake_up() to increment it.

So my thinking was that instead of having multiple counters, we'd have just
two - one counting completions and the other one counting wakeups and if
completions - wakeups > batch, we search for waiters in the wait queues,
wake them up so that 'wakeups' counter catches up. That also kind of
alleviates the 'wake_index' issue because racing updates to it will lead to
reordering of wakeups but not to lost wakeups, retries, or anything.

I also agree with your wake_up_nr_return() idea below, that is part of this
solution (reliably waking given number of waiters) and in fact I have
already coded that yesterday while thinking about the problem ;)

> Two, and here the depths of my naivete and incomprehension may be on
> display, but: I get the impression that __sbq_wake_up() is intended
> to accumulate wake_batch-1 wakeups while doing nothing, then on the
> wake_batch'th it hopes to do all those wake_batch wakeups.

Correct. That is the idea of this code as far as I understand it as well
(but bear in mind that I'm digging into this code for only about 50 days
longer than you ;).

> I assume someone in the past has established that that's a safe way to
> procede here, though it's not obviously safe to me.

Yes, it is not obvious and even not universally safe. wake_batch has to be
suitably tuned based on the number of available bits in the bitmap to avoid
deadlocks (freeing of a bit is what generates a wakeup). For numbers of
bits smaller than the number of waitqueues we have, we are using wake_batch
== 1, which is obviously safe. As the number of bits grows larger we can
set wake batch as wake_batch =~ bits/waitqueues. That makes sure all the
waitqueues will be woken up and because new waiters are added only when all
bits are used, this even makes sure all waiters will eventually wake up as
the bits get used / freed. I won't comment on fairness, that has apparently
not been a design consideration.

> Now, those !waitqueue_active(&ws->wait) checks are good for catching
> when the hoped-for procedure has gone so "wrong" that there's actually
> nothing to be woken on this ws (so proceed to the next); but they give
> no clue as to when there are some but not enough wakeups done.
>
> It is very easy to add a wake_up_nr_return() to kernel/sched/wait.c,
> which returns the nr_exclusive still not woken (__wake_up_common_lock()
> merely has to return nr_exclusive itself); and then __sbq_wake_up() can
> be recalled until wake_batch have been woken (or all queues empty).

Fully agreed about this. I'd just note that the waitqueue_active() checks
are enough for the above reasoning to guarantee waking all the wait queues
in principle. They just break down if multiple processes want to wakeup on
the same waitqueue because of TTCTTU races. And that was the last straw
that made me go with wake_up_nr_return() as you describe it.

> I do have an experiment running that way: but my testing is much too
> limited to draw serious conclusions from, and I've already admitted
> that I may just be misunderstanding the whole thing. But, just maybe,
> a wake_up_nr_return() might be useful. End of those suggestions.
>
> > so
> > we need some fix for the interim. I was pondering for quite some time about
> > some band aid to the problem you've found but didn't find anything
> > satisfactory.
> >
> > In the end I see two options:
> >
> > 1) Take your patch (as wrong as it is ;). Yes, it can lead to lost wakeups
> > but we were living with those for a relatively long time so probably we can
> > live with them for some longer.
>
> In getting that experiment above going, I did have to make this change
> below: and it looks to me now as better than my original patch - since
> this one does try all SBQ_WAIT_QUEUES before giving up, whereas my first
> patch immediately gave up on the waitqueue_active !wait_cnt case.
>
> --- a/lib/sbitmap.c
> +++ b/lib/sbitmap.c
> @@ -587,7 +587,7 @@ static struct sbq_wait_state *sbq_wake_p
> for (i = 0; i < SBQ_WAIT_QUEUES; i++) {
> struct sbq_wait_state *ws = &sbq->ws[wake_index];
>
> - if (waitqueue_active(&ws->wait)) {
> + if (waitqueue_active(&ws->wait) && atomic_read(&ws->wait_cnt)) {
> if (wake_index != atomic_read(&sbq->wake_index))
> atomic_set(&sbq->wake_index, wake_index);
> return ws;
>
> TBH I have not tested this one outside of that experiment: would you
> prefer this patch to my first one, I test and sign this off and send?

Yes, actually this is an elegant solution. It has the same inherent
raciness as your waitqueue_active() patch so wakeups could be lost even
though some waiters need them but that seems pretty unlikely. So yes, if
you can submit this, I guess this is a good band aid for the coming merge
window.

> > 2) Revert Yu Kuai's original fix 040b83fcecfb8 ("sbitmap: fix possible io
> > hung due to lost wakeup") and my fixup 48c033314f37 ("sbitmap: Avoid leaving
> > waitqueue in invalid state in __sbq_wake_up()"). But then Keith would have
> > to redo his batched accounting patches on top.
>
> I know much too little to help make that choice.

Yeah, I guess it is Jens' call in the end. I'm fine with both options.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-09-28 04:02:10

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH next] sbitmap: fix lockup while swapping

On Tue, 27 Sep 2022, Jan Kara wrote:
> On Mon 26-09-22 20:39:03, Hugh Dickins wrote:
>
> So my thinking was that instead of having multiple counters, we'd have just
> two - one counting completions and the other one counting wakeups and if
> completions - wakeups > batch, we search for waiters in the wait queues,
> wake them up so that 'wakeups' counter catches up. That also kind of
> alleviates the 'wake_index' issue because racing updates to it will lead to
> reordering of wakeups but not to lost wakeups, retries, or anything.
>
> I also agree with your wake_up_nr_return() idea below, that is part of this
> solution (reliably waking given number of waiters) and in fact I have
> already coded that yesterday while thinking about the problem ;)

Great - I'm pleasantly surprised to have been not so far off,
and we seem to be much in accord.

(What I called wake_up_nr_return() can perfectly well be wake_up_nr()
itself: I had just been temporarily avoiding a void to int change in
a header file, recompiling the world.)

Many thanks for your detailed elucidation of the batch safety,
in particular: I won't pretend to have absorbed it completely yet,
but it's there in your mail for me and all of us to refer back to.

> > TBH I have not tested this one outside of that experiment: would you
> > prefer this patch to my first one, I test and sign this off and send?
>
> Yes, actually this is an elegant solution. It has the same inherent
> raciness as your waitqueue_active() patch so wakeups could be lost even
> though some waiters need them but that seems pretty unlikely. So yes, if
> you can submit this, I guess this is a good band aid for the coming merge
> window.

No problem in the testing, the v2 patch follows now.

>
> > > 2) Revert Yu Kuai's original fix 040b83fcecfb8 ("sbitmap: fix possible io
> > > hung due to lost wakeup") and my fixup 48c033314f37 ("sbitmap: Avoid leaving
> > > waitqueue in invalid state in __sbq_wake_up()"). But then Keith would have
> > > to redo his batched accounting patches on top.
> >
> > I know much too little to help make that choice.
>
> Yeah, I guess it is Jens' call in the end. I'm fine with both options.
>
> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR

2022-09-28 04:25:02

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH next v2] sbitmap: fix lockup while swapping

Commit 4acb83417cad ("sbitmap: fix batched wait_cnt accounting")
is a big improvement: without it, I had to revert to before commit
040b83fcecfb ("sbitmap: fix possible io hung due to lost wakeup")
to avoid the high system time and freezes which that had introduced.

Now okay on the NVME laptop, but 4acb83417cad is a disaster for heavy
swapping (kernel builds in low memory) on another: soon locking up in
sbitmap_queue_wake_up() (into which __sbq_wake_up() is inlined), cycling
around with waitqueue_active() but wait_cnt 0 . Here is a backtrace,
showing the common pattern of outer sbitmap_queue_wake_up() interrupted
before setting wait_cnt 0 back to wake_batch (in some cases other CPUs
are idle, in other cases they're spinning for a lock in dd_bio_merge()):

sbitmap_queue_wake_up < sbitmap_queue_clear < blk_mq_put_tag <
__blk_mq_free_request < blk_mq_free_request < __blk_mq_end_request <
scsi_end_request < scsi_io_completion < scsi_finish_command <
scsi_complete < blk_complete_reqs < blk_done_softirq < __do_softirq <
__irq_exit_rcu < irq_exit_rcu < common_interrupt < asm_common_interrupt <
_raw_spin_unlock_irqrestore < __wake_up_common_lock < __wake_up <
sbitmap_queue_wake_up < sbitmap_queue_clear < blk_mq_put_tag <
__blk_mq_free_request < blk_mq_free_request < dd_bio_merge <
blk_mq_sched_bio_merge < blk_mq_attempt_bio_merge < blk_mq_submit_bio <
__submit_bio < submit_bio_noacct_nocheck < submit_bio_noacct <
submit_bio < __swap_writepage < swap_writepage < pageout <
shrink_folio_list < evict_folios < lru_gen_shrink_lruvec <
shrink_lruvec < shrink_node < do_try_to_free_pages < try_to_free_pages <
__alloc_pages_slowpath < __alloc_pages < folio_alloc < vma_alloc_folio <
do_anonymous_page < __handle_mm_fault < handle_mm_fault <
do_user_addr_fault < exc_page_fault < asm_exc_page_fault

See how the process-context sbitmap_queue_wake_up() has been interrupted,
after bringing wait_cnt down to 0 (and in this example, after doing its
wakeups), before advancing wake_index and refilling wake_cnt: an
interrupt-context sbitmap_queue_wake_up() of the same sbq gets stuck.

I have almost no grasp of all the possible sbitmap races, and their
consequences: but __sbq_wake_up() can do nothing useful while wait_cnt 0,
so it is better if sbq_wake_ptr() skips on to the next ws in that case:
which fixes the lockup and shows no adverse consequence for me.

Signed-off-by: Hugh Dickins <[email protected]>
---
v2: - v1 to __sbq_wake_up() broke out when this happens, but
v2 to sbq_wake_ptr() does better by skipping on to the next.
- added more comment and deleted dubious Fixes attribution.

lib/sbitmap.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -587,7 +587,7 @@ static struct sbq_wait_state *sbq_wake_p
for (i = 0; i < SBQ_WAIT_QUEUES; i++) {
struct sbq_wait_state *ws = &sbq->ws[wake_index];

- if (waitqueue_active(&ws->wait)) {
+ if (waitqueue_active(&ws->wait) && atomic_read(&ws->wait_cnt)) {
if (wake_index != atomic_read(&sbq->wake_index))
atomic_set(&sbq->wake_index, wake_index);
return ws;

2022-09-28 04:26:39

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH next v2] sbitmap: fix lockup while swapping

Commit 4acb83417cad ("sbitmap: fix batched wait_cnt accounting")
is a big improvement: without it, I had to revert to before commit
040b83fcecfb ("sbitmap: fix possible io hung due to lost wakeup")
to avoid the high system time and freezes which that had introduced.

Now okay on the NVME laptop, but 4acb83417cad is a disaster for heavy
swapping (kernel builds in low memory) on another: soon locking up in
sbitmap_queue_wake_up() (into which __sbq_wake_up() is inlined), cycling
around with waitqueue_active() but wait_cnt 0 . Here is a backtrace,
showing the common pattern of outer sbitmap_queue_wake_up() interrupted
before setting wait_cnt 0 back to wake_batch (in some cases other CPUs
are idle, in other cases they're spinning for a lock in dd_bio_merge()):

sbitmap_queue_wake_up < sbitmap_queue_clear < blk_mq_put_tag <
__blk_mq_free_request < blk_mq_free_request < __blk_mq_end_request <
scsi_end_request < scsi_io_completion < scsi_finish_command <
scsi_complete < blk_complete_reqs < blk_done_softirq < __do_softirq <
__irq_exit_rcu < irq_exit_rcu < common_interrupt < asm_common_interrupt <
_raw_spin_unlock_irqrestore < __wake_up_common_lock < __wake_up <
sbitmap_queue_wake_up < sbitmap_queue_clear < blk_mq_put_tag <
__blk_mq_free_request < blk_mq_free_request < dd_bio_merge <
blk_mq_sched_bio_merge < blk_mq_attempt_bio_merge < blk_mq_submit_bio <
__submit_bio < submit_bio_noacct_nocheck < submit_bio_noacct <
submit_bio < __swap_writepage < swap_writepage < pageout <
shrink_folio_list < evict_folios < lru_gen_shrink_lruvec <
shrink_lruvec < shrink_node < do_try_to_free_pages < try_to_free_pages <
__alloc_pages_slowpath < __alloc_pages < folio_alloc < vma_alloc_folio <
do_anonymous_page < __handle_mm_fault < handle_mm_fault <
do_user_addr_fault < exc_page_fault < asm_exc_page_fault

See how the process-context sbitmap_queue_wake_up() has been interrupted,
after bringing wait_cnt down to 0 (and in this example, after doing its
wakeups), before advancing wake_index and refilling wake_cnt: an
interrupt-context sbitmap_queue_wake_up() of the same sbq gets stuck.

I have almost no grasp of all the possible sbitmap races, and their
consequences: but __sbq_wake_up() can do nothing useful while wait_cnt 0,
so it is better if sbq_wake_ptr() skips on to the next ws in that case:
which fixes the lockup and shows no adverse consequence for me.

Signed-off-by: Hugh Dickins <[email protected]>
---
v2: - v1 to __sbq_wake_up() broke out when this happens, but
v2 to sbq_wake_ptr() does better by skipping on to the next.
- added more comment and deleted dubious Fixes attribution.
- and apologies to Mr Axboe and all for my axbod typo

lib/sbitmap.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -587,7 +587,7 @@ static struct sbq_wait_state *sbq_wake_p
for (i = 0; i < SBQ_WAIT_QUEUES; i++) {
struct sbq_wait_state *ws = &sbq->ws[wake_index];

- if (waitqueue_active(&ws->wait)) {
+ if (waitqueue_active(&ws->wait) && atomic_read(&ws->wait_cnt)) {
if (wake_index != atomic_read(&sbq->wake_index))
atomic_set(&sbq->wake_index, wake_index);
return ws;

2022-09-29 09:00:30

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH next v2] sbitmap: fix lockup while swapping

On Tue 27-09-22 21:07:46, Hugh Dickins wrote:
> Commit 4acb83417cad ("sbitmap: fix batched wait_cnt accounting")
> is a big improvement: without it, I had to revert to before commit
> 040b83fcecfb ("sbitmap: fix possible io hung due to lost wakeup")
> to avoid the high system time and freezes which that had introduced.
>
> Now okay on the NVME laptop, but 4acb83417cad is a disaster for heavy
> swapping (kernel builds in low memory) on another: soon locking up in
> sbitmap_queue_wake_up() (into which __sbq_wake_up() is inlined), cycling
> around with waitqueue_active() but wait_cnt 0 . Here is a backtrace,
> showing the common pattern of outer sbitmap_queue_wake_up() interrupted
> before setting wait_cnt 0 back to wake_batch (in some cases other CPUs
> are idle, in other cases they're spinning for a lock in dd_bio_merge()):
>
> sbitmap_queue_wake_up < sbitmap_queue_clear < blk_mq_put_tag <
> __blk_mq_free_request < blk_mq_free_request < __blk_mq_end_request <
> scsi_end_request < scsi_io_completion < scsi_finish_command <
> scsi_complete < blk_complete_reqs < blk_done_softirq < __do_softirq <
> __irq_exit_rcu < irq_exit_rcu < common_interrupt < asm_common_interrupt <
> _raw_spin_unlock_irqrestore < __wake_up_common_lock < __wake_up <
> sbitmap_queue_wake_up < sbitmap_queue_clear < blk_mq_put_tag <
> __blk_mq_free_request < blk_mq_free_request < dd_bio_merge <
> blk_mq_sched_bio_merge < blk_mq_attempt_bio_merge < blk_mq_submit_bio <
> __submit_bio < submit_bio_noacct_nocheck < submit_bio_noacct <
> submit_bio < __swap_writepage < swap_writepage < pageout <
> shrink_folio_list < evict_folios < lru_gen_shrink_lruvec <
> shrink_lruvec < shrink_node < do_try_to_free_pages < try_to_free_pages <
> __alloc_pages_slowpath < __alloc_pages < folio_alloc < vma_alloc_folio <
> do_anonymous_page < __handle_mm_fault < handle_mm_fault <
> do_user_addr_fault < exc_page_fault < asm_exc_page_fault
>
> See how the process-context sbitmap_queue_wake_up() has been interrupted,
> after bringing wait_cnt down to 0 (and in this example, after doing its
> wakeups), before advancing wake_index and refilling wake_cnt: an
> interrupt-context sbitmap_queue_wake_up() of the same sbq gets stuck.
>
> I have almost no grasp of all the possible sbitmap races, and their
> consequences: but __sbq_wake_up() can do nothing useful while wait_cnt 0,
> so it is better if sbq_wake_ptr() skips on to the next ws in that case:
> which fixes the lockup and shows no adverse consequence for me.
>
> Signed-off-by: Hugh Dickins <[email protected]>

Perhaps we could add a note here like: "The check for wait_cnt being 0 is
obviously racy and ultimately can lead to lost wakeups for example when
there is only single waitqueue with waiters. However in these cases lost
wakeups are unlikely to matter and proper fix requires redesign (and
benchmarking) of batched wakeup code so let's plug the hole with this band
aid for now."

Otherwise feel free to add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> v2: - v1 to __sbq_wake_up() broke out when this happens, but
> v2 to sbq_wake_ptr() does better by skipping on to the next.
> - added more comment and deleted dubious Fixes attribution.
> - and apologies to Mr Axboe and all for my axbod typo
>
> lib/sbitmap.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> --- a/lib/sbitmap.c
> +++ b/lib/sbitmap.c
> @@ -587,7 +587,7 @@ static struct sbq_wait_state *sbq_wake_p
> for (i = 0; i < SBQ_WAIT_QUEUES; i++) {
> struct sbq_wait_state *ws = &sbq->ws[wake_index];
>
> - if (waitqueue_active(&ws->wait)) {
> + if (waitqueue_active(&ws->wait) && atomic_read(&ws->wait_cnt)) {
> if (wake_index != atomic_read(&sbq->wake_index))
> atomic_set(&sbq->wake_index, wake_index);
> return ws;
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-09-29 20:18:52

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH next v3] sbitmap: fix lockup while swapping

Commit 4acb83417cad ("sbitmap: fix batched wait_cnt accounting")
is a big improvement: without it, I had to revert to before commit
040b83fcecfb ("sbitmap: fix possible io hung due to lost wakeup")
to avoid the high system time and freezes which that had introduced.

Now okay on the NVME laptop, but 4acb83417cad is a disaster for heavy
swapping (kernel builds in low memory) on another: soon locking up in
sbitmap_queue_wake_up() (into which __sbq_wake_up() is inlined), cycling
around with waitqueue_active() but wait_cnt 0 . Here is a backtrace,
showing the common pattern of outer sbitmap_queue_wake_up() interrupted
before setting wait_cnt 0 back to wake_batch (in some cases other CPUs
are idle, in other cases they're spinning for a lock in dd_bio_merge()):

sbitmap_queue_wake_up < sbitmap_queue_clear < blk_mq_put_tag <
__blk_mq_free_request < blk_mq_free_request < __blk_mq_end_request <
scsi_end_request < scsi_io_completion < scsi_finish_command <
scsi_complete < blk_complete_reqs < blk_done_softirq < __do_softirq <
__irq_exit_rcu < irq_exit_rcu < common_interrupt < asm_common_interrupt <
_raw_spin_unlock_irqrestore < __wake_up_common_lock < __wake_up <
sbitmap_queue_wake_up < sbitmap_queue_clear < blk_mq_put_tag <
__blk_mq_free_request < blk_mq_free_request < dd_bio_merge <
blk_mq_sched_bio_merge < blk_mq_attempt_bio_merge < blk_mq_submit_bio <
__submit_bio < submit_bio_noacct_nocheck < submit_bio_noacct <
submit_bio < __swap_writepage < swap_writepage < pageout <
shrink_folio_list < evict_folios < lru_gen_shrink_lruvec <
shrink_lruvec < shrink_node < do_try_to_free_pages < try_to_free_pages <
__alloc_pages_slowpath < __alloc_pages < folio_alloc < vma_alloc_folio <
do_anonymous_page < __handle_mm_fault < handle_mm_fault <
do_user_addr_fault < exc_page_fault < asm_exc_page_fault

See how the process-context sbitmap_queue_wake_up() has been interrupted,
after bringing wait_cnt down to 0 (and in this example, after doing its
wakeups), before advancing wake_index and refilling wake_cnt: an
interrupt-context sbitmap_queue_wake_up() of the same sbq gets stuck.

I have almost no grasp of all the possible sbitmap races, and their
consequences: but __sbq_wake_up() can do nothing useful while wait_cnt 0,
so it is better if sbq_wake_ptr() skips on to the next ws in that case:
which fixes the lockup and shows no adverse consequence for me.

The check for wait_cnt being 0 is obviously racy, and ultimately can lead
to lost wakeups: for example, when there is only a single waitqueue with
waiters. However, lost wakeups are unlikely to matter in these cases,
and a proper fix requires redesign (and benchmarking) of the batched
wakeup code: so let's plug the hole with this bandaid for now.

Signed-off-by: Hugh Dickins <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
v2: - v1 to __sbq_wake_up() broke out when this happens, but
v2 to sbq_wake_ptr() does better by skipping on to the next.
- added more comment and deleted dubious Fixes attribution.
- and apologies to Mr Axboe and all for my axbod typo
v3: add paragraph of comment and Reviewed-by Jan Kara: thanks.

lib/sbitmap.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -587,7 +587,7 @@ static struct sbq_wait_state *sbq_wake_p
for (i = 0; i < SBQ_WAIT_QUEUES; i++) {
struct sbq_wait_state *ws = &sbq->ws[wake_index];

- if (waitqueue_active(&ws->wait)) {
+ if (waitqueue_active(&ws->wait) && atomic_read(&ws->wait_cnt)) {
if (wake_index != atomic_read(&sbq->wake_index))
atomic_set(&sbq->wake_index, wake_index);
return ws;

2022-09-30 00:16:52

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH next v3] sbitmap: fix lockup while swapping

On Thu, 29 Sep 2022 12:50:12 -0700 (PDT), Hugh Dickins wrote:
> Commit 4acb83417cad ("sbitmap: fix batched wait_cnt accounting")
> is a big improvement: without it, I had to revert to before commit
> 040b83fcecfb ("sbitmap: fix possible io hung due to lost wakeup")
> to avoid the high system time and freezes which that had introduced.
>
> Now okay on the NVME laptop, but 4acb83417cad is a disaster for heavy
> swapping (kernel builds in low memory) on another: soon locking up in
> sbitmap_queue_wake_up() (into which __sbq_wake_up() is inlined), cycling
> around with waitqueue_active() but wait_cnt 0 . Here is a backtrace,
> showing the common pattern of outer sbitmap_queue_wake_up() interrupted
> before setting wait_cnt 0 back to wake_batch (in some cases other CPUs
> are idle, in other cases they're spinning for a lock in dd_bio_merge()):
>
> [...]

Applied, thanks!

[1/1] sbitmap: fix lockup while swapping
commit: 30514bd2dd4e86a3ecfd6a93a3eadf7b9ea164a0

Best regards,
--
Jens Axboe