2022-02-22 05:50:31

by CGEL

[permalink] [raw]
Subject: [PATCH linux-next] mm: swap: get rid of deadloop in swapin readahead

From: Guo Ziliang <[email protected]>

In our testing, a deadloop task was found. Through sysrq printing, same
stack was found every time, as follows:
__swap_duplicate+0x58/0x1a0
swapcache_prepare+0x24/0x30
__read_swap_cache_async+0xac/0x220
read_swap_cache_async+0x58/0xa0
swapin_readahead+0x24c/0x628
do_swap_page+0x374/0x8a0
__handle_mm_fault+0x598/0xd60
handle_mm_fault+0x114/0x200
do_page_fault+0x148/0x4d0
do_translation_fault+0xb0/0xd4
do_mem_abort+0x50/0xb0

The reason for the deadloop is that swapcache_prepare() always returns
EEXIST, indicating that SWAP_HAS_CACHE has not been cleared, so that
it cannot jump out of the loop. We suspect that the task that clears
the SWAP_HAS_CACHE flag never gets a chance to run. We try to lower
the priority of the task stuck in a deadloop so that the task that
clears the SWAP_HAS_CACHE flag will run. The results show that the
system returns to normal after the priority is lowered.

In our testing, multiple real-time tasks are bound to the same core,
and the task in the deadloop is the highest priority task of the
core, so the deadloop task cannot be preempted.

Although cond_resched() is used by __read_swap_cache_async, it is an
empty function in the preemptive system and cannot achieve the purpose
of releasing the CPU. A high-priority task cannot release the CPU
unless preempted by a higher-priority task. But when this task
is already the highest priority task on this core, other tasks
will not be able to be scheduled. So we think we should replace
cond_resched() with schedule_timeout_uninterruptible(1),
schedule_timeout_interruptible will call set_current_state
first to set the task state, so the task will be removed
from the running queue, so as to achieve the purpose of
giving up the CPU and prevent it from running in kernel
mode for too long.

Reported-by: Zeal Robot <[email protected]>
Reviewed-by: Ran Xiaokai <[email protected]>
Reviewed-by: Jiang Xuexin <[email protected]>
Reviewed-by: Yang Yang <[email protected]>
Signed-off-by: Guo Ziliang <[email protected]>
---
mm/swap_state.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 8d4104242100..ee67164531c0 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -478,7 +478,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
* __read_swap_cache_async(), which has set SWAP_HAS_CACHE
* in swap_map, but not yet added its page to swap cache.
*/
- cond_resched();
+ schedule_timeout_uninterruptible(1);
}

/*
--
2.15.2



2022-02-26 02:44:14

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH linux-next] mm: swap: get rid of deadloop in swapin readahead

On Mon, 21 Feb 2022 11:17:49 +0000 [email protected] wrote:

> From: Guo Ziliang <[email protected]>
>
> In our testing, a deadloop task was found. Through sysrq printing, same
> stack was found every time, as follows:
> __swap_duplicate+0x58/0x1a0
> swapcache_prepare+0x24/0x30
> __read_swap_cache_async+0xac/0x220
> read_swap_cache_async+0x58/0xa0
> swapin_readahead+0x24c/0x628
> do_swap_page+0x374/0x8a0
> __handle_mm_fault+0x598/0xd60
> handle_mm_fault+0x114/0x200
> do_page_fault+0x148/0x4d0
> do_translation_fault+0xb0/0xd4
> do_mem_abort+0x50/0xb0
>
> The reason for the deadloop is that swapcache_prepare() always returns
> EEXIST, indicating that SWAP_HAS_CACHE has not been cleared, so that
> it cannot jump out of the loop. We suspect that the task that clears
> the SWAP_HAS_CACHE flag never gets a chance to run. We try to lower
> the priority of the task stuck in a deadloop so that the task that
> clears the SWAP_HAS_CACHE flag will run. The results show that the
> system returns to normal after the priority is lowered.
>
> In our testing, multiple real-time tasks are bound to the same core,
> and the task in the deadloop is the highest priority task of the
> core, so the deadloop task cannot be preempted.
>
> Although cond_resched() is used by __read_swap_cache_async, it is an
> empty function in the preemptive system and cannot achieve the purpose
> of releasing the CPU. A high-priority task cannot release the CPU
> unless preempted by a higher-priority task. But when this task
> is already the highest priority task on this core, other tasks
> will not be able to be scheduled. So we think we should replace
> cond_resched() with schedule_timeout_uninterruptible(1),
> schedule_timeout_interruptible will call set_current_state
> first to set the task state, so the task will be removed
> from the running queue, so as to achieve the purpose of
> giving up the CPU and prevent it from running in kernel
> mode for too long.
>
> ...
>
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -478,7 +478,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> * __read_swap_cache_async(), which has set SWAP_HAS_CACHE
> * in swap_map, but not yet added its page to swap cache.
> */
> - cond_resched();
> + schedule_timeout_uninterruptible(1);
> }
>
> /*

Sigh. I guess yes, we should do this, at least in a short-term,
backportable-to-stable way.

But busy-waiting while hoping that someone else will save us isn't an
attractive design. Hugh, have you ever thought about something more
deterministic in there?

Thanks.

2022-02-28 12:08:34

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH linux-next] mm: swap: get rid of deadloop in swapin readahead

On Mon 21-02-22 11:17:49, [email protected] wrote:
> From: Guo Ziliang <[email protected]>
>
> In our testing, a deadloop task was found. Through sysrq printing, same
> stack was found every time, as follows:
> __swap_duplicate+0x58/0x1a0
> swapcache_prepare+0x24/0x30
> __read_swap_cache_async+0xac/0x220
> read_swap_cache_async+0x58/0xa0
> swapin_readahead+0x24c/0x628
> do_swap_page+0x374/0x8a0
> __handle_mm_fault+0x598/0xd60
> handle_mm_fault+0x114/0x200
> do_page_fault+0x148/0x4d0
> do_translation_fault+0xb0/0xd4
> do_mem_abort+0x50/0xb0
>
> The reason for the deadloop is that swapcache_prepare() always returns
> EEXIST, indicating that SWAP_HAS_CACHE has not been cleared, so that
> it cannot jump out of the loop. We suspect that the task that clears
> the SWAP_HAS_CACHE flag never gets a chance to run. We try to lower
> the priority of the task stuck in a deadloop so that the task that
> clears the SWAP_HAS_CACHE flag will run. The results show that the
> system returns to normal after the priority is lowered.
>
> In our testing, multiple real-time tasks are bound to the same core,
> and the task in the deadloop is the highest priority task of the
> core, so the deadloop task cannot be preempted.
>
> Although cond_resched() is used by __read_swap_cache_async, it is an
> empty function in the preemptive system and cannot achieve the purpose
> of releasing the CPU. A high-priority task cannot release the CPU
> unless preempted by a higher-priority task. But when this task
> is already the highest priority task on this core, other tasks
> will not be able to be scheduled. So we think we should replace
> cond_resched() with schedule_timeout_uninterruptible(1),
> schedule_timeout_interruptible will call set_current_state
> first to set the task state, so the task will be removed
> from the running queue, so as to achieve the purpose of
> giving up the CPU and prevent it from running in kernel
> mode for too long.

I am sorry but I really do not see how this case is any different from
any other kernel code path being hogged by a RT task. We surely
shouldn't put sleeps into all random paths which are doing cond_resched
at the moment.

> Reported-by: Zeal Robot <[email protected]>
> Reviewed-by: Ran Xiaokai <[email protected]>
> Reviewed-by: Jiang Xuexin <[email protected]>
> Reviewed-by: Yang Yang <[email protected]>
> Signed-off-by: Guo Ziliang <[email protected]>
> ---
> mm/swap_state.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 8d4104242100..ee67164531c0 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -478,7 +478,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> * __read_swap_cache_async(), which has set SWAP_HAS_CACHE
> * in swap_map, but not yet added its page to swap cache.
> */
> - cond_resched();
> + schedule_timeout_uninterruptible(1);
> }
>
> /*
> --
> 2.15.2
>

--
Michal Hocko
SUSE Labs

2022-02-28 16:24:34

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH linux-next] mm: swap: get rid of deadloop in swapin readahead

On Mon, 28 Feb 2022 08:57:49 +0100 Michal Hocko <[email protected]> wrote:

> On Mon 21-02-22 11:17:49, [email protected] wrote:
> > From: Guo Ziliang <[email protected]>
> >
> > In our testing, a deadloop task was found. Through sysrq printing, same
> > stack was found every time, as follows:
> > __swap_duplicate+0x58/0x1a0
> > swapcache_prepare+0x24/0x30
> > __read_swap_cache_async+0xac/0x220
> > read_swap_cache_async+0x58/0xa0
> > swapin_readahead+0x24c/0x628
> > do_swap_page+0x374/0x8a0
> > __handle_mm_fault+0x598/0xd60
> > handle_mm_fault+0x114/0x200
> > do_page_fault+0x148/0x4d0
> > do_translation_fault+0xb0/0xd4
> > do_mem_abort+0x50/0xb0
> >
> > The reason for the deadloop is that swapcache_prepare() always returns
> > EEXIST, indicating that SWAP_HAS_CACHE has not been cleared, so that
> > it cannot jump out of the loop. We suspect that the task that clears
> > the SWAP_HAS_CACHE flag never gets a chance to run. We try to lower
> > the priority of the task stuck in a deadloop so that the task that
> > clears the SWAP_HAS_CACHE flag will run. The results show that the
> > system returns to normal after the priority is lowered.
> >
> > In our testing, multiple real-time tasks are bound to the same core,
> > and the task in the deadloop is the highest priority task of the
> > core, so the deadloop task cannot be preempted.
> >
> > Although cond_resched() is used by __read_swap_cache_async, it is an
> > empty function in the preemptive system and cannot achieve the purpose
> > of releasing the CPU. A high-priority task cannot release the CPU
> > unless preempted by a higher-priority task. But when this task
> > is already the highest priority task on this core, other tasks
> > will not be able to be scheduled. So we think we should replace
> > cond_resched() with schedule_timeout_uninterruptible(1),
> > schedule_timeout_interruptible will call set_current_state
> > first to set the task state, so the task will be removed
> > from the running queue, so as to achieve the purpose of
> > giving up the CPU and prevent it from running in kernel
> > mode for too long.
>
> I am sorry but I really do not see how this case is any different from
> any other kernel code path being hogged by a RT task. We surely
> shouldn't put sleeps into all random paths which are doing cond_resched
> at the moment.

But this cond_resched() is different from most. This one is attempting
to yield the CPU so this task can make progress. And cond_resched()
simply isn't an appropriate way of doing this because under this fairly
common situation, it's a no-op.

2022-03-01 05:44:43

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH linux-next] mm: swap: get rid of deadloop in swapin readahead

On Fri, 25 Feb 2022, Andrew Morton wrote:
> On Mon, 21 Feb 2022 11:17:49 +0000 [email protected] wrote:
> > From: Guo Ziliang <[email protected]>
> >
> > In our testing, a deadloop task was found. Through sysrq printing, same
> > stack was found every time, as follows:
> > __swap_duplicate+0x58/0x1a0
> > swapcache_prepare+0x24/0x30
> > __read_swap_cache_async+0xac/0x220
> > read_swap_cache_async+0x58/0xa0
> > swapin_readahead+0x24c/0x628
> > do_swap_page+0x374/0x8a0
> > __handle_mm_fault+0x598/0xd60
> > handle_mm_fault+0x114/0x200
> > do_page_fault+0x148/0x4d0
> > do_translation_fault+0xb0/0xd4
> > do_mem_abort+0x50/0xb0
> >
> > The reason for the deadloop is that swapcache_prepare() always returns
> > EEXIST, indicating that SWAP_HAS_CACHE has not been cleared, so that
> > it cannot jump out of the loop. We suspect that the task that clears
> > the SWAP_HAS_CACHE flag never gets a chance to run. We try to lower
> > the priority of the task stuck in a deadloop so that the task that
> > clears the SWAP_HAS_CACHE flag will run. The results show that the
> > system returns to normal after the priority is lowered.
> >
> > In our testing, multiple real-time tasks are bound to the same core,
> > and the task in the deadloop is the highest priority task of the
> > core, so the deadloop task cannot be preempted.
> >
> > Although cond_resched() is used by __read_swap_cache_async, it is an
> > empty function in the preemptive system and cannot achieve the purpose
> > of releasing the CPU. A high-priority task cannot release the CPU
> > unless preempted by a higher-priority task. But when this task
> > is already the highest priority task on this core, other tasks
> > will not be able to be scheduled. So we think we should replace
> > cond_resched() with schedule_timeout_uninterruptible(1),
> > schedule_timeout_interruptible will call set_current_state
> > first to set the task state, so the task will be removed
> > from the running queue, so as to achieve the purpose of
> > giving up the CPU and prevent it from running in kernel
> > mode for too long.
> >
> > ...
> >
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -478,7 +478,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> > * __read_swap_cache_async(), which has set SWAP_HAS_CACHE
> > * in swap_map, but not yet added its page to swap cache.
> > */
> > - cond_resched();
> > + schedule_timeout_uninterruptible(1);
> > }
> >
> > /*
>
> Sigh. I guess yes, we should do this, at least in a short-term,
> backportable-to-stable way.
>
> But busy-waiting while hoping that someone else will save us isn't an
> attractive design. Hugh, have you ever thought about something more
> deterministic in there?

Not something more deterministic, no: I think that would entail
heavier locking, perhaps slowing down hotter paths, just to avoid
this swap oddity.

This loop was written long before there was a preemptive kernel:
it was appropriate then, and almost never needed more than one retry
to complete; but preemption changed the story without us realizing.

Sigh here too. I commend the thread on it from July 2018:
https://lore.kernel.org/linux-mm/[email protected]/

There the 4.9-stable user proposed preempt_disable(), I agreed but
found the patch provided insufficient, and offered another 4.9 patch
further down the thread. Your preference at the time was msleep(1).

I was working on a similar patch for 4.18, but have not completed it
yet ;) and don't remember how satisfied or not I was with that one;
and wonder if I'm any more likely to get it finished by 2026. It's
clear that I put much more thought into it back then than just now.

Maybe someone else would have a go: my 4.9 patch in that thread
shows most of it, but might need a lot of work to update to 5.17.

And it also gathered some temporary debug stats on how often this
happens: I'm not conscious of using RT at all, but was disturbed to see
how long an ordinary preemptive kernel was sometimes spinning there.
So I think I agree with you more than Michal on that: RT just makes
the bad behaviour more obvious.

Hugh

2022-03-02 08:52:10

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH linux-next] mm: swap: get rid of deadloop in swapin readahead

On Mon, 28 Feb 2022 20:07:33 -0800 (PST) Hugh Dickins <[email protected]> wrote:

> > > --- a/mm/swap_state.c
> > > +++ b/mm/swap_state.c
> > > @@ -478,7 +478,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> > > * __read_swap_cache_async(), which has set SWAP_HAS_CACHE
> > > * in swap_map, but not yet added its page to swap cache.
> > > */
> > > - cond_resched();
> > > + schedule_timeout_uninterruptible(1);
> > > }
> > >
> > > /*
> >
> > Sigh. I guess yes, we should do this, at least in a short-term,
> > backportable-to-stable way.
> >
> > But busy-waiting while hoping that someone else will save us isn't an
> > attractive design. Hugh, have you ever thought about something more
> > deterministic in there?
>
> Not something more deterministic, no: I think that would entail
> heavier locking, perhaps slowing down hotter paths, just to avoid
> this swap oddity.
>
> This loop was written long before there was a preemptive kernel:
> it was appropriate then, and almost never needed more than one retry
> to complete; but preemption changed the story without us realizing.
>
> Sigh here too. I commend the thread on it from July 2018:
> https://lore.kernel.org/linux-mm/[email protected]/
>
> There the 4.9-stable user proposed preempt_disable(), I agreed but
> found the patch provided insufficient, and offered another 4.9 patch
> further down the thread. Your preference at the time was msleep(1).
>
> I was working on a similar patch for 4.18, but have not completed it
> yet ;) and don't remember how satisfied or not I was with that one;
> and wonder if I'm any more likely to get it finished by 2026. It's
> clear that I put much more thought into it back then than just now.
>
> Maybe someone else would have a go: my 4.9 patch in that thread
> shows most of it, but might need a lot of work to update to 5.17.
>
> And it also gathered some temporary debug stats on how often this
> happens: I'm not conscious of using RT at all, but was disturbed to see
> how long an ordinary preemptive kernel was sometimes spinning there.
> So I think I agree with you more than Michal on that: RT just makes
> the bad behaviour more obvious.

Thanks as always.

Using msleep() seems pretty pointless so I plan to go ahead with patch
as-is, with a cc:stable. None of it is pretty, but it's better than
what we have now, yes?

2022-03-02 11:15:44

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH linux-next] mm: swap: get rid of deadloop in swapin readahead

On Mon 28-02-22 07:33:15, Andrew Morton wrote:
> On Mon, 28 Feb 2022 08:57:49 +0100 Michal Hocko <[email protected]> wrote:
>
> > On Mon 21-02-22 11:17:49, [email protected] wrote:
> > > From: Guo Ziliang <[email protected]>
> > >
> > > In our testing, a deadloop task was found. Through sysrq printing, same
> > > stack was found every time, as follows:
> > > __swap_duplicate+0x58/0x1a0
> > > swapcache_prepare+0x24/0x30
> > > __read_swap_cache_async+0xac/0x220
> > > read_swap_cache_async+0x58/0xa0
> > > swapin_readahead+0x24c/0x628
> > > do_swap_page+0x374/0x8a0
> > > __handle_mm_fault+0x598/0xd60
> > > handle_mm_fault+0x114/0x200
> > > do_page_fault+0x148/0x4d0
> > > do_translation_fault+0xb0/0xd4
> > > do_mem_abort+0x50/0xb0
> > >
> > > The reason for the deadloop is that swapcache_prepare() always returns
> > > EEXIST, indicating that SWAP_HAS_CACHE has not been cleared, so that
> > > it cannot jump out of the loop. We suspect that the task that clears
> > > the SWAP_HAS_CACHE flag never gets a chance to run. We try to lower
> > > the priority of the task stuck in a deadloop so that the task that
> > > clears the SWAP_HAS_CACHE flag will run. The results show that the
> > > system returns to normal after the priority is lowered.
> > >
> > > In our testing, multiple real-time tasks are bound to the same core,
> > > and the task in the deadloop is the highest priority task of the
> > > core, so the deadloop task cannot be preempted.
> > >
> > > Although cond_resched() is used by __read_swap_cache_async, it is an
> > > empty function in the preemptive system and cannot achieve the purpose
> > > of releasing the CPU. A high-priority task cannot release the CPU
> > > unless preempted by a higher-priority task. But when this task
> > > is already the highest priority task on this core, other tasks
> > > will not be able to be scheduled. So we think we should replace
> > > cond_resched() with schedule_timeout_uninterruptible(1),
> > > schedule_timeout_interruptible will call set_current_state
> > > first to set the task state, so the task will be removed
> > > from the running queue, so as to achieve the purpose of
> > > giving up the CPU and prevent it from running in kernel
> > > mode for too long.
> >
> > I am sorry but I really do not see how this case is any different from
> > any other kernel code path being hogged by a RT task. We surely
> > shouldn't put sleeps into all random paths which are doing cond_resched
> > at the moment.
>
> But this cond_resched() is different from most. This one is attempting
> to yield the CPU so this task can make progress. And cond_resched()
> simply isn't an appropriate way of doing this because under this fairly
> common situation, it's a no-op.

I might be really missing something but I really do not see how is this
any different from the page allocator path which only does cond_resched
as well (well, except for throttling but that might just not trigger).
Or other paths which just do cond_resched while waiting for a progress
somewhere else.

Not that I like this situation but !PREEMPT kernel with RT priority
tasks is rather limited and full of potential priblems IMHO.
--
Michal Hocko
SUSE Labs

2022-03-02 20:32:32

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH linux-next] mm: swap: get rid of deadloop in swapin readahead

On Tue, 1 Mar 2022, Andrew Morton wrote:
>
> Using msleep() seems pretty pointless so I plan to go ahead with patch
> as-is, with a cc:stable. None of it is pretty, but it's better than
> what we have now, yes?

Yes, I've no objection to that.

Hugh

2022-03-03 00:26:33

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH linux-next] mm: swap: get rid of deadloop in swapin readahead

On Wed, 2 Mar 2022, Michal Hocko wrote:
>
> I might be really missing something but I really do not see how is this
> any different from the page allocator path which only does cond_resched
> as well (well, except for throttling but that might just not trigger).
> Or other paths which just do cond_resched while waiting for a progress
> somewhere else.
>
> Not that I like this situation but !PREEMPT kernel with RT priority
> tasks is rather limited and full of potential priblems IMHO.

As I said in previous mail, I have really not given this as much
thought this time as I did in the 2018 mail thread linked there;
but have seen that it behaves more badly than I had imagined, in
any preemptive kernel - no need for RT. We just don't have the
stats to show when this code here spins waiting on code elsewhere
that is sleeping. I think the difference from most cond_resched()
places is that swapin is trying to collect together several factors
with minimal locking, and we should have added preempt_disable()s
when preemption was invented. But it's only swap so we didn't notice.

Hugh