2015-04-03 17:44:06

by Nishanth Aravamudan

[permalink] [raw]
Subject: Re: [PATCH v2] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable pages

On 31.03.2015 [11:48:29 +0200], Michal Hocko wrote:
> On Fri 27-03-15 15:23:50, Nishanth Aravamudan wrote:
> > On 27.03.2015 [13:17:59 -0700], Dave Hansen wrote:
> > > On 03/27/2015 12:28 PM, Nishanth Aravamudan wrote:
> > > > @@ -2585,7 +2585,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
> > > >
> > > > for (i = 0; i <= ZONE_NORMAL; i++) {
> > > > zone = &pgdat->node_zones[i];
> > > > - if (!populated_zone(zone))
> > > > + if (!populated_zone(zone) || !zone_reclaimable(zone))
> > > > continue;
> > > >
> > > > pfmemalloc_reserve += min_wmark_pages(zone);
> > >
> > > Do you really want zone_reclaimable()? Or do you want something more
> > > direct like "zone_reclaimable_pages(zone) == 0"?
> >
> > Yeah, I guess in my testing this worked out to be the same, since
> > zone_reclaimable_pages(zone) is 0 and so zone_reclaimable(zone) will
> > always be false. Thanks!
> >
> > Based upon 675becce15 ("mm: vmscan: do not throttle based on pfmemalloc
> > reserves if node has no ZONE_NORMAL") from Mel.
> >
> > We have a system with the following topology:
> >
> > # numactl -H
> > available: 3 nodes (0,2-3)
> > node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
> > 23 24 25 26 27 28 29 30 31
> > node 0 size: 28273 MB
> > node 0 free: 27323 MB
> > node 2 cpus:
> > node 2 size: 16384 MB
> > node 2 free: 0 MB
> > node 3 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
> > node 3 size: 30533 MB
> > node 3 free: 13273 MB
> > node distances:
> > node 0 2 3
> > 0: 10 20 20
> > 2: 20 10 20
> > 3: 20 20 10
> >
> > Node 2 has no free memory, because:
> > # cat /sys/devices/system/node/node2/hugepages/hugepages-16777216kB/nr_hugepages
> > 1
> >
> > This leads to the following zoneinfo:
> >
> > Node 2, zone DMA
> > pages free 0
> > min 1840
> > low 2300
> > high 2760
> > scanned 0
> > spanned 262144
> > present 262144
> > managed 262144
> > ...
> > all_unreclaimable: 1
>
> Blee, this is a weird configuration.

Yep, super gross. It's relatively rare in the field, thankfully. But 16G
pages definitely make it pretty likely to hit (as in, I've seen it once
before :)

> > If one then attempts to allocate some normal 16M hugepages via
> >
> > echo 37 > /proc/sys/vm/nr_hugepages
> >
> > The echo never returns and kswapd2 consumes CPU cycles.
> >
> > This is because throttle_direct_reclaim ends up calling
> > wait_event(pfmemalloc_wait, pfmemalloc_watermark_ok...).
> > pfmemalloc_watermark_ok() in turn checks all zones on the node if there
> > are any reserves, and if so, then indicates the watermarks are ok, by
> > seeing if there are sufficient free pages.
> >
> > 675becce15 added a condition already for memoryless nodes. In this case,
> > though, the node has memory, it is just all consumed (and not
> > reclaimable). Effectively, though, the result is the same on this call
> > to pfmemalloc_watermark_ok() and thus seems like a reasonable additional
> > condition.
> >
> > With this change, the afore-mentioned 16M hugepage allocation attempt
> > succeeds and correctly round-robins between Nodes 1 and 3.
>
> I am just wondering whether this is the right/complete fix. Don't we
> need a similar treatment at more places?

Almost certainly needs an audit. Exhausted nodes are tough to reproduce
easily (fully exhausted, that is), for me.

> I would expect kswapd would be looping endlessly because the zone
> wouldn't be balanced obviously. But I would be wrong... because
> pgdat_balanced is doing this:
> /*
> * A special case here:
> *
> * balance_pgdat() skips over all_unreclaimable after
> * DEF_PRIORITY. Effectively, it considers them balanced so
> * they must be considered balanced here as well!
> */
> if (!zone_reclaimable(zone)) {
> balanced_pages += zone->managed_pages;
> continue;
> }
>
> and zone_reclaimable is false for you as you didn't have any
> zone_reclaimable_pages(). But wakeup_kswapd doesn't do this check so it
> would see !zone_balanced() AFAICS (build_zonelists doesn't ignore those
> zones right?) and so the kswapd would be woken up easily. So it looks
> like a mess.

My understanding, and I could easily be wrong, is that kswapd2 (node 2
is the exhausted one) spins endlessly, because the reclaim logic sees
that we are reclaiming from somewhere but the allocation request for
node 2 (which is __GFP_THISNODE for hugepages, not GFP_THISNODE) will
never complete, so we just continue to reclaim.

> There are possibly other places which rely on populated_zone or
> for_each_populated_zone without checking reclaimability. Are those
> working as expected?

Not yet verified, admittedly.

> That being said. I am not objecting to this patch. I am just trying to
> wrap my head around possible issues from such a weird configuration and
> all the consequences.

Yeah, there are almost certainly more. Luckily, our test organization is
hammering this configuration, so hopefully I'll get reports about
further issues soon, if there are any, with the patch applied.

Thanks,
Nish


2015-04-03 18:24:50

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable pages

On Fri 03-04-15 10:43:57, Nishanth Aravamudan wrote:
> On 31.03.2015 [11:48:29 +0200], Michal Hocko wrote:
[...]
> > I would expect kswapd would be looping endlessly because the zone
> > wouldn't be balanced obviously. But I would be wrong... because
> > pgdat_balanced is doing this:
> > /*
> > * A special case here:
> > *
> > * balance_pgdat() skips over all_unreclaimable after
> > * DEF_PRIORITY. Effectively, it considers them balanced so
> > * they must be considered balanced here as well!
> > */
> > if (!zone_reclaimable(zone)) {
> > balanced_pages += zone->managed_pages;
> > continue;
> > }
> >
> > and zone_reclaimable is false for you as you didn't have any
> > zone_reclaimable_pages(). But wakeup_kswapd doesn't do this check so it
> > would see !zone_balanced() AFAICS (build_zonelists doesn't ignore those
> > zones right?) and so the kswapd would be woken up easily. So it looks
> > like a mess.
>
> My understanding, and I could easily be wrong, is that kswapd2 (node 2
> is the exhausted one) spins endlessly, because the reclaim logic sees
> that we are reclaiming from somewhere but the allocation request for
> node 2 (which is __GFP_THISNODE for hugepages, not GFP_THISNODE) will
> never complete, so we just continue to reclaim.

__GFP_THISNODE would be waking up kswapd2 again and again, that is true.
I am just wondering whether we will have any __GFP_THISNODE allocations
for a node without CPUs (numa_node_id() shouldn't return such a node
AFAICS). Maybe if somebody is bound to Node2 explicitly but I would
consider this as a misconfiguration.

--
Michal Hocko
SUSE Labs

2015-04-03 18:50:50

by Nishanth Aravamudan

[permalink] [raw]
Subject: Re: [PATCH v2] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable pages

On 03.04.2015 [20:24:45 +0200], Michal Hocko wrote:
> On Fri 03-04-15 10:43:57, Nishanth Aravamudan wrote:
> > On 31.03.2015 [11:48:29 +0200], Michal Hocko wrote:
> [...]
> > > I would expect kswapd would be looping endlessly because the zone
> > > wouldn't be balanced obviously. But I would be wrong... because
> > > pgdat_balanced is doing this:
> > > /*
> > > * A special case here:
> > > *
> > > * balance_pgdat() skips over all_unreclaimable after
> > > * DEF_PRIORITY. Effectively, it considers them balanced so
> > > * they must be considered balanced here as well!
> > > */
> > > if (!zone_reclaimable(zone)) {
> > > balanced_pages += zone->managed_pages;
> > > continue;
> > > }
> > >
> > > and zone_reclaimable is false for you as you didn't have any
> > > zone_reclaimable_pages(). But wakeup_kswapd doesn't do this check so it
> > > would see !zone_balanced() AFAICS (build_zonelists doesn't ignore those
> > > zones right?) and so the kswapd would be woken up easily. So it looks
> > > like a mess.
> >
> > My understanding, and I could easily be wrong, is that kswapd2 (node 2
> > is the exhausted one) spins endlessly, because the reclaim logic sees
> > that we are reclaiming from somewhere but the allocation request for
> > node 2 (which is __GFP_THISNODE for hugepages, not GFP_THISNODE) will
> > never complete, so we just continue to reclaim.
>
> __GFP_THISNODE would be waking up kswapd2 again and again, that is true.

Right, one idea I had for this was ensuring that we perform reclaim with
somehow some knowledge of __GFP_THISNODE -- that is it needs to be
somewhat targetted in order to actually help satisfy the current
allocation. But it got pretty hairy fast and I didn't want to break the
world :)

> I am just wondering whether we will have any __GFP_THISNODE allocations
> for a node without CPUs (numa_node_id() shouldn't return such a node
> AFAICS). Maybe if somebody is bound to Node2 explicitly but I would
> consider this as a misconfiguration.

Right, I'd need to check what happens if in our setup you taskset to
node2 and tried to force memory to be local -- I think you'd either be
killed immediately, or the kernel will just disagree with your binding
since it's invalid (e.g., that will happen if you try to bind to a
memoryless node, I think).

Keep in mind that although in my config node2 had no CPUs, that's not a
hard & fast requirement. I do believe in a previous iteration of this
bug, the exhausted node had no free memory but did have cpus assigned to
it.

-Nish