Recently kswapd has been modified to give up after MAX_RECLAIM_RETRIES
number of unsucessful iterations. Before going to sleep, kswapd thread
will unconditionally wakeup all threads sleeping on pfmemalloc_wait.
However the awoken threads will recheck the watermarks and wake the
kswapd thread and sleep again on pfmemalloc_wait. There is a chance
of continuous back and forth between kswapd and direct reclaiming
threads if the kswapd keep failing and thus defeat the purpose of
adding backoff mechanism to kswapd. So, add kswapd_failures check
on the throttle_direct_reclaim condition.
Signed-off-by: Shakeel Butt <[email protected]>
---
mm/vmscan.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index bae698484e8e..b2d24cc7a161 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2819,6 +2819,12 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
return wmark_ok;
}
+static bool should_throttle_direct_reclaim(pg_data_t *pgdat)
+{
+ return (pgdat->kswapd_failures < MAX_RECLAIM_RETRIES &&
+ !pfmemalloc_watermark_ok(pgdat));
+}
+
/*
* Throttle direct reclaimers if backing storage is backed by the network
* and the PFMEMALLOC reserve for the preferred node is getting dangerously
@@ -2873,7 +2879,7 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
/* Throttle based on the first usable node */
pgdat = zone->zone_pgdat;
- if (pfmemalloc_watermark_ok(pgdat))
+ if (!should_throttle_direct_reclaim(pgdat))
goto out;
break;
}
@@ -2895,14 +2901,14 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
*/
if (!(gfp_mask & __GFP_FS)) {
wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
- pfmemalloc_watermark_ok(pgdat), HZ);
+ !should_throttle_direct_reclaim(pgdat), HZ);
goto check_pending;
}
/* Throttle until kswapd wakes the process */
wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
- pfmemalloc_watermark_ok(pgdat));
+ !should_throttle_direct_reclaim(pgdat));
check_pending:
if (fatal_signal_pending(current))
--
2.12.0.246.ga2ecc84866-goog
On Fri 10-03-17 11:46:20, Shakeel Butt wrote:
> Recently kswapd has been modified to give up after MAX_RECLAIM_RETRIES
> number of unsucessful iterations. Before going to sleep, kswapd thread
> will unconditionally wakeup all threads sleeping on pfmemalloc_wait.
> However the awoken threads will recheck the watermarks and wake the
> kswapd thread and sleep again on pfmemalloc_wait. There is a chance
> of continuous back and forth between kswapd and direct reclaiming
> threads if the kswapd keep failing and thus defeat the purpose of
> adding backoff mechanism to kswapd. So, add kswapd_failures check
> on the throttle_direct_reclaim condition.
I have to say I really do not like this. kswapd_failures shouldn't
really be checked outside of the kswapd context. The
pfmemalloc_watermark_ok/throttle_direct_reclaim is quite complex even
without putting another variable into it. I wish we rather replace this
throttling by something else. Johannes had an idea to throttle by the
number of reclaimers.
Anyway, I am wondering whether we can hit this issue in
practice? Have you seen it happening or is this a result of the code
review? I would assume that that !zone_reclaimable_pages check in
pfmemalloc_watermark_ok should help to some degree.
> Signed-off-by: Shakeel Butt <[email protected]>
> ---
> mm/vmscan.c | 12 +++++++++---
> 1 file changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index bae698484e8e..b2d24cc7a161 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2819,6 +2819,12 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
> return wmark_ok;
> }
>
> +static bool should_throttle_direct_reclaim(pg_data_t *pgdat)
> +{
> + return (pgdat->kswapd_failures < MAX_RECLAIM_RETRIES &&
> + !pfmemalloc_watermark_ok(pgdat));
> +}
> +
> /*
> * Throttle direct reclaimers if backing storage is backed by the network
> * and the PFMEMALLOC reserve for the preferred node is getting dangerously
> @@ -2873,7 +2879,7 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
>
> /* Throttle based on the first usable node */
> pgdat = zone->zone_pgdat;
> - if (pfmemalloc_watermark_ok(pgdat))
> + if (!should_throttle_direct_reclaim(pgdat))
> goto out;
> break;
> }
> @@ -2895,14 +2901,14 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
> */
> if (!(gfp_mask & __GFP_FS)) {
> wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
> - pfmemalloc_watermark_ok(pgdat), HZ);
> + !should_throttle_direct_reclaim(pgdat), HZ);
>
> goto check_pending;
> }
>
> /* Throttle until kswapd wakes the process */
> wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
> - pfmemalloc_watermark_ok(pgdat));
> + !should_throttle_direct_reclaim(pgdat));
>
> check_pending:
> if (fatal_signal_pending(current))
> --
> 2.12.0.246.ga2ecc84866-goog
>
--
Michal Hocko
SUSE Labs
On Mon, Mar 13, 2017 at 2:02 AM, Michal Hocko <[email protected]> wrote:
> On Fri 10-03-17 11:46:20, Shakeel Butt wrote:
>> Recently kswapd has been modified to give up after MAX_RECLAIM_RETRIES
>> number of unsucessful iterations. Before going to sleep, kswapd thread
>> will unconditionally wakeup all threads sleeping on pfmemalloc_wait.
>> However the awoken threads will recheck the watermarks and wake the
>> kswapd thread and sleep again on pfmemalloc_wait. There is a chance
>> of continuous back and forth between kswapd and direct reclaiming
>> threads if the kswapd keep failing and thus defeat the purpose of
>> adding backoff mechanism to kswapd. So, add kswapd_failures check
>> on the throttle_direct_reclaim condition.
>
> I have to say I really do not like this. kswapd_failures shouldn't
> really be checked outside of the kswapd context. The
> pfmemalloc_watermark_ok/throttle_direct_reclaim is quite complex even
> without putting another variable into it. I wish we rather replace this
> throttling by something else. Johannes had an idea to throttle by the
> number of reclaimers.
>
Do you suspect race in accessing kswapd_failures in non-kswapd
context? Please do let me know more about replacing this throttling.
> Anyway, I am wondering whether we can hit this issue in
> practice? Have you seen it happening or is this a result of the code
> review? I would assume that that !zone_reclaimable_pages check in
> pfmemalloc_watermark_ok should help to some degree.
>
Yes, I have seen this issue going on for more than one hour on my
test. It was a simple test where the number of processes, in the
presence of swap, try to allocate memory more than RAM. The number of
processes are equal to the number of cores and are pinned to each
individual core. I am suspecting that !zone_reclaimable_pages() check
did not help.
>> Signed-off-by: Shakeel Butt <[email protected]>
>> ---
>> mm/vmscan.c | 12 +++++++++---
>> 1 file changed, 9 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index bae698484e8e..b2d24cc7a161 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -2819,6 +2819,12 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
>> return wmark_ok;
>> }
>>
>> +static bool should_throttle_direct_reclaim(pg_data_t *pgdat)
>> +{
>> + return (pgdat->kswapd_failures < MAX_RECLAIM_RETRIES &&
>> + !pfmemalloc_watermark_ok(pgdat));
>> +}
>> +
>> /*
>> * Throttle direct reclaimers if backing storage is backed by the network
>> * and the PFMEMALLOC reserve for the preferred node is getting dangerously
>> @@ -2873,7 +2879,7 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
>>
>> /* Throttle based on the first usable node */
>> pgdat = zone->zone_pgdat;
>> - if (pfmemalloc_watermark_ok(pgdat))
>> + if (!should_throttle_direct_reclaim(pgdat))
>> goto out;
>> break;
>> }
>> @@ -2895,14 +2901,14 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
>> */
>> if (!(gfp_mask & __GFP_FS)) {
>> wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
>> - pfmemalloc_watermark_ok(pgdat), HZ);
>> + !should_throttle_direct_reclaim(pgdat), HZ);
>>
>> goto check_pending;
>> }
>>
>> /* Throttle until kswapd wakes the process */
>> wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
>> - pfmemalloc_watermark_ok(pgdat));
>> + !should_throttle_direct_reclaim(pgdat));
>>
>> check_pending:
>> if (fatal_signal_pending(current))
>> --
>> 2.12.0.246.ga2ecc84866-goog
>>
>
> --
> Michal Hocko
> SUSE Labs
On Mon 13-03-17 08:07:15, Shakeel Butt wrote:
> On Mon, Mar 13, 2017 at 2:02 AM, Michal Hocko <[email protected]> wrote:
> > On Fri 10-03-17 11:46:20, Shakeel Butt wrote:
> >> Recently kswapd has been modified to give up after MAX_RECLAIM_RETRIES
> >> number of unsucessful iterations. Before going to sleep, kswapd thread
> >> will unconditionally wakeup all threads sleeping on pfmemalloc_wait.
> >> However the awoken threads will recheck the watermarks and wake the
> >> kswapd thread and sleep again on pfmemalloc_wait. There is a chance
> >> of continuous back and forth between kswapd and direct reclaiming
> >> threads if the kswapd keep failing and thus defeat the purpose of
> >> adding backoff mechanism to kswapd. So, add kswapd_failures check
> >> on the throttle_direct_reclaim condition.
> >
> > I have to say I really do not like this. kswapd_failures shouldn't
> > really be checked outside of the kswapd context. The
> > pfmemalloc_watermark_ok/throttle_direct_reclaim is quite complex even
> > without putting another variable into it. I wish we rather replace this
> > throttling by something else. Johannes had an idea to throttle by the
> > number of reclaimers.
> >
>
> Do you suspect race in accessing kswapd_failures in non-kswapd
> context?
No, this is not about race conditions. It is more about the logic of the
code. kswapd_failures is the private thing to the kswapd daemon. Direct
reclaimers shouldn't have any business in it - well except resetting it.
> Please do let me know more about replacing this throttling.
The idea behind a different throttling would be to not allow too many
direct reclaimers on the same set of nodes/zones. Johannes would tell
you more.
> > Anyway, I am wondering whether we can hit this issue in
> > practice? Have you seen it happening or is this a result of the code
> > review? I would assume that that !zone_reclaimable_pages check in
> > pfmemalloc_watermark_ok should help to some degree.
> >
> Yes, I have seen this issue going on for more than one hour on my
> test. It was a simple test where the number of processes, in the
> presence of swap, try to allocate memory more than RAM.
this is an anonymous memory, right?
> The number of
> processes are equal to the number of cores and are pinned to each
> individual core. I am suspecting that !zone_reclaimable_pages() check
> did not help.
Hmm, interesting! I would expect the OOM killer triggering but I guess
I see what is going on. kswapd couldn't reclaim a single page and ran
out of its kswapd_failures attempts while no direct reclaimers could
reclaim a single page either until we reached the throttling point when
we are basically livelocked because neither kswapd nor _all_ direct
reclaimers can make a forward progress. Although this sounds quite
unlikely I think it is quite possible to happen. So we cannot really
throttle _all_ direct reclaimers when the kswapd is out of game which I
haven't fully realized when reviewing "mm: fix 100% CPU kswapd busyloop
on unreclaimable nodes".
The simplest thing to do would be something like you have proposed and
do not throttle if kswapd is out of game.
diff --git a/mm/vmscan.c b/mm/vmscan.c
index bae698484e8e..d34b1afc781a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2791,6 +2791,9 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
int i;
bool wmark_ok;
+ if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
+ return true;
+
for (i = 0; i <= ZONE_NORMAL; i++) {
zone = &pgdat->node_zones[i];
if (!managed_zone(zone))
I do not like this as I've already said but it would allow to merge
"mm: fix 100% CPU kswapd busyloop on unreclaimable nodes" without too
many additional changes.
Another option would be to cap the waiting time same as we do for
GFP_NOFS. Not ideal either because I suspect we would just get herds
of direct reclaimers that way.
The best option would be to rethink the throttling and move it out of
the direct reclaim path somehow.
Thanks and sorry for not spotting the potential lockup previously.
--
Michal Hocko
SUSE Labs
On Mon, Mar 13, 2017 at 8:46 AM, Michal Hocko <[email protected]> wrote:
> On Mon 13-03-17 08:07:15, Shakeel Butt wrote:
>> On Mon, Mar 13, 2017 at 2:02 AM, Michal Hocko <[email protected]> wrote:
>> > On Fri 10-03-17 11:46:20, Shakeel Butt wrote:
>> >> Recently kswapd has been modified to give up after MAX_RECLAIM_RETRIES
>> >> number of unsucessful iterations. Before going to sleep, kswapd thread
>> >> will unconditionally wakeup all threads sleeping on pfmemalloc_wait.
>> >> However the awoken threads will recheck the watermarks and wake the
>> >> kswapd thread and sleep again on pfmemalloc_wait. There is a chance
>> >> of continuous back and forth between kswapd and direct reclaiming
>> >> threads if the kswapd keep failing and thus defeat the purpose of
>> >> adding backoff mechanism to kswapd. So, add kswapd_failures check
>> >> on the throttle_direct_reclaim condition.
>> >
>> > I have to say I really do not like this. kswapd_failures shouldn't
>> > really be checked outside of the kswapd context. The
>> > pfmemalloc_watermark_ok/throttle_direct_reclaim is quite complex even
>> > without putting another variable into it. I wish we rather replace this
>> > throttling by something else. Johannes had an idea to throttle by the
>> > number of reclaimers.
>> >
>>
>> Do you suspect race in accessing kswapd_failures in non-kswapd
>> context?
>
> No, this is not about race conditions. It is more about the logic of the
> code. kswapd_failures is the private thing to the kswapd daemon. Direct
> reclaimers shouldn't have any business in it - well except resetting it.
>
>> Please do let me know more about replacing this throttling.
>
> The idea behind a different throttling would be to not allow too many
> direct reclaimers on the same set of nodes/zones. Johannes would tell
> you more.
>
>> > Anyway, I am wondering whether we can hit this issue in
>> > practice? Have you seen it happening or is this a result of the code
>> > review? I would assume that that !zone_reclaimable_pages check in
>> > pfmemalloc_watermark_ok should help to some degree.
>> >
>> Yes, I have seen this issue going on for more than one hour on my
>> test. It was a simple test where the number of processes, in the
>> presence of swap, try to allocate memory more than RAM.
>
> this is an anonymous memory, right?
>
Yes.
>> The number of
>> processes are equal to the number of cores and are pinned to each
>> individual core. I am suspecting that !zone_reclaimable_pages() check
>> did not help.
>
> Hmm, interesting! I would expect the OOM killer triggering but I guess
> I see what is going on. kswapd couldn't reclaim a single page and ran
> out of its kswapd_failures attempts while no direct reclaimers could
> reclaim a single page either until we reached the throttling point when
> we are basically livelocked because neither kswapd nor _all_ direct
> reclaimers can make a forward progress. Although this sounds quite
> unlikely I think it is quite possible to happen. So we cannot really
> throttle _all_ direct reclaimers when the kswapd is out of game which I
> haven't fully realized when reviewing "mm: fix 100% CPU kswapd busyloop
> on unreclaimable nodes".
>
> The simplest thing to do would be something like you have proposed and
> do not throttle if kswapd is out of game.
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index bae698484e8e..d34b1afc781a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2791,6 +2791,9 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
> int i;
> bool wmark_ok;
>
> + if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
> + return true;
> +
> for (i = 0; i <= ZONE_NORMAL; i++) {
> zone = &pgdat->node_zones[i];
> if (!managed_zone(zone))
>
> I do not like this as I've already said but it would allow to merge
> "mm: fix 100% CPU kswapd busyloop on unreclaimable nodes" without too
> many additional changes.
>
> Another option would be to cap the waiting time same as we do for
> GFP_NOFS. Not ideal either because I suspect we would just get herds
> of direct reclaimers that way.
>
> The best option would be to rethink the throttling and move it out of
> the direct reclaim path somehow.
>
Agreed.
> Thanks and sorry for not spotting the potential lockup previously.
> --
> Michal Hocko
> SUSE Labs
Hi Shakeel,
On Fri, Mar 10, 2017 at 11:46:20AM -0800, Shakeel Butt wrote:
> Recently kswapd has been modified to give up after MAX_RECLAIM_RETRIES
> number of unsucessful iterations. Before going to sleep, kswapd thread
> will unconditionally wakeup all threads sleeping on pfmemalloc_wait.
> However the awoken threads will recheck the watermarks and wake the
> kswapd thread and sleep again on pfmemalloc_wait. There is a chance
> of continuous back and forth between kswapd and direct reclaiming
> threads if the kswapd keep failing and thus defeat the purpose of
> adding backoff mechanism to kswapd. So, add kswapd_failures check
> on the throttle_direct_reclaim condition.
>
> Signed-off-by: Shakeel Butt <[email protected]>
You're right, the way it works right now is kind of lame. Did you
observe continued kswapd spinning because of the wakeup ping-pong?
> +static bool should_throttle_direct_reclaim(pg_data_t *pgdat)
> +{
> + return (pgdat->kswapd_failures < MAX_RECLAIM_RETRIES &&
> + !pfmemalloc_watermark_ok(pgdat));
> +}
> +
> /*
> * Throttle direct reclaimers if backing storage is backed by the network
> * and the PFMEMALLOC reserve for the preferred node is getting dangerously
> @@ -2873,7 +2879,7 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
>
> /* Throttle based on the first usable node */
> pgdat = zone->zone_pgdat;
> - if (pfmemalloc_watermark_ok(pgdat))
> + if (!should_throttle_direct_reclaim(pgdat))
> goto out;
Instead of a second helper function, could you rename
pfmemalloc_watermark_ok() and add the kswapd_failure check at the very
beginning of that function?
Because that check fits nicely with the comment about kswapd having to
be awake, too. We need kswapd operational when throttling reclaimers.
Thanks
On Mon, Mar 13, 2017 at 12:58 PM, Johannes Weiner <[email protected]> wrote:
> Hi Shakeel,
>
> On Fri, Mar 10, 2017 at 11:46:20AM -0800, Shakeel Butt wrote:
>> Recently kswapd has been modified to give up after MAX_RECLAIM_RETRIES
>> number of unsucessful iterations. Before going to sleep, kswapd thread
>> will unconditionally wakeup all threads sleeping on pfmemalloc_wait.
>> However the awoken threads will recheck the watermarks and wake the
>> kswapd thread and sleep again on pfmemalloc_wait. There is a chance
>> of continuous back and forth between kswapd and direct reclaiming
>> threads if the kswapd keep failing and thus defeat the purpose of
>> adding backoff mechanism to kswapd. So, add kswapd_failures check
>> on the throttle_direct_reclaim condition.
>>
>> Signed-off-by: Shakeel Butt <[email protected]>
>
> You're right, the way it works right now is kind of lame. Did you
> observe continued kswapd spinning because of the wakeup ping-pong?
>
Yes, I did observe kswapd spinning for more than an hour.
>> +static bool should_throttle_direct_reclaim(pg_data_t *pgdat)
>> +{
>> + return (pgdat->kswapd_failures < MAX_RECLAIM_RETRIES &&
>> + !pfmemalloc_watermark_ok(pgdat));
>> +}
>> +
>> /*
>> * Throttle direct reclaimers if backing storage is backed by the network
>> * and the PFMEMALLOC reserve for the preferred node is getting dangerously
>> @@ -2873,7 +2879,7 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
>>
>> /* Throttle based on the first usable node */
>> pgdat = zone->zone_pgdat;
>> - if (pfmemalloc_watermark_ok(pgdat))
>> + if (!should_throttle_direct_reclaim(pgdat))
>> goto out;
>
> Instead of a second helper function, could you rename
> pfmemalloc_watermark_ok() and add the kswapd_failure check at the very
> beginning of that function?
>
Sure, Michal also suggested the same.
> Because that check fits nicely with the comment about kswapd having to
> be awake, too. We need kswapd operational when throttling reclaimers.
>
> Thanks