2013-07-19 20:55:44

by Johannes Weiner

[permalink] [raw]
Subject: [patch 0/3] mm: improve page aging fairness between zones/nodes

The way the page allocator interacts with kswapd creates aging
imbalances, where the amount of time a userspace page gets in memory
under reclaim pressure is dependent on which zone, which node the
allocator took the page frame from.

#1 fixes missed kswapd wakeups on NUMA systems, which lead to some
nodes falling behind for a full reclaim cycle relative to the other
nodes in the system

#3 fixes an interaction where kswapd and a continuous stream of page
allocations keep the preferred zone of a task between the high and
low watermark (allocations succeed + kswapd does not go to sleep)
indefinitely, completely underutilizing the lower zones and
thrashing on the preferred zone

These patches are the aging fairness part of the thrash-detection
based file LRU balancing. Andrea recommended to submit them
separately as they are bugfixes in their own right.

The following test ran a foreground workload (memcachetest) with
background IO of various sizes on a 4 node 8G system (similar results
were observed with single-node 4G systems):

parallelio
BAS FAIRALLO
BASE FAIRALLOC
Ops memcachetest-0M 5170.00 ( 0.00%) 5283.00 ( 2.19%)
Ops memcachetest-791M 4740.00 ( 0.00%) 5293.00 ( 11.67%)
Ops memcachetest-2639M 2551.00 ( 0.00%) 4950.00 ( 94.04%)
Ops memcachetest-4487M 2606.00 ( 0.00%) 3922.00 ( 50.50%)
Ops io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops io-duration-791M 55.00 ( 0.00%) 18.00 ( 67.27%)
Ops io-duration-2639M 235.00 ( 0.00%) 103.00 ( 56.17%)
Ops io-duration-4487M 278.00 ( 0.00%) 173.00 ( 37.77%)
Ops swaptotal-0M 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swaptotal-791M 245184.00 ( 0.00%) 0.00 ( 0.00%)
Ops swaptotal-2639M 468069.00 ( 0.00%) 108778.00 ( 76.76%)
Ops swaptotal-4487M 452529.00 ( 0.00%) 76623.00 ( 83.07%)
Ops swapin-0M 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swapin-791M 108297.00 ( 0.00%) 0.00 ( 0.00%)
Ops swapin-2639M 169537.00 ( 0.00%) 50031.00 ( 70.49%)
Ops swapin-4487M 167435.00 ( 0.00%) 34178.00 ( 79.59%)
Ops minorfaults-0M 1518666.00 ( 0.00%) 1503993.00 ( 0.97%)
Ops minorfaults-791M 1676963.00 ( 0.00%) 1520115.00 ( 9.35%)
Ops minorfaults-2639M 1606035.00 ( 0.00%) 1799717.00 (-12.06%)
Ops minorfaults-4487M 1612118.00 ( 0.00%) 1583825.00 ( 1.76%)
Ops majorfaults-0M 6.00 ( 0.00%) 0.00 ( 0.00%)
Ops majorfaults-791M 13836.00 ( 0.00%) 10.00 ( 99.93%)
Ops majorfaults-2639M 22307.00 ( 0.00%) 6490.00 ( 70.91%)
Ops majorfaults-4487M 21631.00 ( 0.00%) 4380.00 ( 79.75%)

BAS FAIRALLO
BASE FAIRALLOC
User 287.78 460.97
System 2151.67 3142.51
Elapsed 9737.00 8879.34

BAS FAIRALLO
BASE FAIRALLOC
Minor Faults 53721925 57188551
Major Faults 392195 15157
Swap Ins 2994854 112770
Swap Outs 4907092 134982
Direct pages scanned 0 41824
Kswapd pages scanned 32975063 8128269
Kswapd pages reclaimed 6323069 7093495
Direct pages reclaimed 0 41824
Kswapd efficiency 19% 87%
Kswapd velocity 3386.573 915.414
Direct efficiency 100% 100%
Direct velocity 0.000 4.710
Percentage direct scans 0% 0%
Zone normal velocity 2011.338 550.661
Zone dma32 velocity 1365.623 369.221
Zone dma velocity 9.612 0.242
Page writes by reclaim 18732404.000 614807.000
Page writes file 13825312 479825
Page writes anon 4907092 134982
Page reclaim immediate 85490 5647
Sector Reads 12080532 483244
Sector Writes 88740508 65438876
Page rescued immediate 0 0
Slabs scanned 82560 12160
Direct inode steals 0 0
Kswapd inode steals 24401 40013
Kswapd skipped wait 0 0
THP fault alloc 6 8
THP collapse alloc 5481 5812
THP splits 75 22
THP fault fallback 0 0
THP collapse fail 0 0
Compaction stalls 0 54
Compaction success 0 45
Compaction failures 0 9
Page migrate success 881492 82278
Page migrate failure 0 0
Compaction pages isolated 0 60334
Compaction migrate scanned 0 53505
Compaction free scanned 0 1537605
Compaction cost 914 86
NUMA PTE updates 46738231 41988419
NUMA hint faults 31175564 24213387
NUMA hint local faults 10427393 6411593
NUMA pages migrated 881492 55344
AutoNUMA cost 156221 121361

The overall runtime was reduced, throughput for both the foreground
workload as well as the background IO improved, major faults, swapping
and reclaim activity shrunk significantly, reclaim efficiency more
than quadrupled.

include/linux/mmzone.h | 1 +
mm/page_alloc.c | 55 +++++++++++++++++++++++++++++++++++++------------------
mm/vmscan.c | 2 +-
3 files changed, 39 insertions(+), 19 deletions(-)


2013-07-19 20:55:47

by Johannes Weiner

[permalink] [raw]
Subject: [patch 1/3] mm: vmscan: fix numa reclaim balance problem in kswapd

When the page allocator fails to get a page from all zones in its
given zonelist, it wakes up the per-node kswapds for all zones that
are at their low watermark.

However, with a system under load and the free page counters being
per-cpu approximations, the observed counter value in a zone can
fluctuate enough that the allocation fails but the kswapd wakeup is
also skipped while the zone is still really close to the low
watermark.

When one node misses a wakeup like this, it won't be aged before all
the other node's zones are down to their low watermarks again. And
skipping a full aging cycle is an obvious fairness problem.

Kswapd runs until the high watermarks are restored, so it should also
be woken when the high watermarks are not met. This ages nodes more
equally and creates a safety margin for the page counter fluctuation.

By using zone_balanced(), it will now check, in addition to the
watermark, if compaction requires more order-0 pages to create a
higher order page.

Signed-off-by: Johannes Weiner <[email protected]>
---
mm/vmscan.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e364542..bccc6d3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3277,7 +3277,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
}
if (!waitqueue_active(&pgdat->kswapd_wait))
return;
- if (zone_watermark_ok_safe(zone, order, low_wmark_pages(zone), 0, 0))
+ if (zone_balanced(zone, order, 0, 0))
return;

trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
--
1.8.3.2

2013-07-19 20:55:51

by Johannes Weiner

[permalink] [raw]
Subject: [patch 3/3] mm: page_alloc: fair zone allocator policy

Each zone that holds userspace pages of one workload must be aged at a
speed proportional to the zone size. Otherwise, the time an
individual page gets to stay in memory depends on the zone it happened
to be allocated in. Asymmetry in the zone aging creates rather
unpredictable aging behavior and results in the wrong pages being
reclaimed, activated etc.

But exactly this happens right now because of the way the page
allocator and kswapd interact. The page allocator uses per-node lists
of all zones in the system, ordered by preference, when allocating a
new page. When the first iteration does not yield any results, kswapd
is woken up and the allocator retries. Due to the way kswapd reclaims
zones below the high watermark while a zone can be allocated from when
it is above the low watermark, the allocator may keep kswapd running
while kswapd reclaim ensures that the page allocator can keep
allocating from the first zone in the zonelist for extended periods of
time. Meanwhile the other zones rarely see new allocations and thus
get aged much slower in comparison.

The result is that the occasional page placed in lower zones gets
relatively more time in memory, even get promoted to the active list
after its peers have long been evicted. Meanwhile, the bulk of the
working set may be thrashing on the preferred zone even though there
may be significant amounts of memory available in the lower zones.

Even the most basic test -- repeatedly reading a file slightly bigger
than memory -- shows how broken the zone aging is. In this scenario,
no single page should be able stay in memory long enough to get
referenced twice and activated, but activation happens in spades:

$ grep active_file /proc/zoneinfo
nr_inactive_file 0
nr_active_file 0
nr_inactive_file 0
nr_active_file 8
nr_inactive_file 1582
nr_active_file 11994
$ cat data data data data >/dev/null
$ grep active_file /proc/zoneinfo
nr_inactive_file 0
nr_active_file 70
nr_inactive_file 258753
nr_active_file 443214
nr_inactive_file 149793
nr_active_file 12021

Fix this with a very simple round robin allocator. Each zone is
allowed a batch of allocations that is proportional to the zone's
size, after which it is treated as full. The batch counters are reset
when all zones have been tried and the allocator enters the slowpath
and kicks off kswapd reclaim:

$ grep active_file /proc/zoneinfo
nr_inactive_file 0
nr_active_file 0
nr_inactive_file 174
nr_active_file 4865
nr_inactive_file 53
nr_active_file 860
$ cat data data data data >/dev/null
$ grep active_file /proc/zoneinfo
nr_inactive_file 0
nr_active_file 0
nr_inactive_file 666622
nr_active_file 4988
nr_inactive_file 190969
nr_active_file 937

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/mmzone.h | 1 +
mm/page_alloc.c | 39 +++++++++++++++++++++++++++++----------
2 files changed, 30 insertions(+), 10 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index af4a3b7..0c41d59 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -352,6 +352,7 @@ struct zone {
* free areas of different sizes
*/
spinlock_t lock;
+ atomic_t alloc_batch;
int all_unreclaimable; /* All pages pinned */
#if defined CONFIG_COMPACTION || defined CONFIG_CMA
/* Set to true when the PG_migrate_skip bits should be cleared */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index af1d956b..d938b67 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1879,6 +1879,14 @@ zonelist_scan:
if (alloc_flags & ALLOC_NO_WATERMARKS)
goto try_this_zone;
/*
+ * Distribute pages in proportion to the individual
+ * zone size to ensure fair page aging. The zone a
+ * page was allocated in should have no effect on the
+ * time the page has in memory before being reclaimed.
+ */
+ if (atomic_read(&zone->alloc_batch) <= 0)
+ continue;
+ /*
* When allocating a page cache page for writing, we
* want to get it from a zone that is within its dirty
* limit, such that no single zone holds more than its
@@ -1984,7 +1992,8 @@ this_zone_full:
goto zonelist_scan;
}

- if (page)
+ if (page) {
+ atomic_sub(1U << order, &zone->alloc_batch);
/*
* page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
* necessary to allocate the page. The expectation is
@@ -1993,6 +2002,7 @@ this_zone_full:
* for !PFMEMALLOC purposes.
*/
page->pfmemalloc = !!(alloc_flags & ALLOC_NO_WATERMARKS);
+ }

return page;
}
@@ -2342,16 +2352,20 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
return page;
}

-static inline
-void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
- enum zone_type high_zoneidx,
- enum zone_type classzone_idx)
+static void prepare_slowpath(gfp_t gfp_mask, unsigned int order,
+ struct zonelist *zonelist,
+ enum zone_type high_zoneidx,
+ enum zone_type classzone_idx)
{
struct zoneref *z;
struct zone *zone;

- for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
- wakeup_kswapd(zone, order, classzone_idx);
+ for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+ atomic_set(&zone->alloc_batch,
+ high_wmark_pages(zone) - low_wmark_pages(zone));
+ if (!(gfp_mask & __GFP_NO_KSWAPD))
+ wakeup_kswapd(zone, order, classzone_idx);
+ }
}

static inline int
@@ -2447,9 +2461,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
goto nopage;

restart:
- if (!(gfp_mask & __GFP_NO_KSWAPD))
- wake_all_kswapd(order, zonelist, high_zoneidx,
- zone_idx(preferred_zone));
+ prepare_slowpath(gfp_mask, order, zonelist,
+ high_zoneidx, zone_idx(preferred_zone));

/*
* OK, we're below the kswapd watermark and have kicked background
@@ -4758,6 +4771,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
zone_seqlock_init(zone);
zone->zone_pgdat = pgdat;

+ /* For bootup, initialized properly in watermark setup */
+ atomic_set(&zone->alloc_batch, zone->managed_pages);
+
zone_pcp_init(zone);
lruvec_init(&zone->lruvec);
if (!size)
@@ -5533,6 +5549,9 @@ static void __setup_per_zone_wmarks(void)
zone->watermark[WMARK_LOW] = min_wmark_pages(zone) + (tmp >> 2);
zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) + (tmp >> 1);

+ atomic_set(&zone->alloc_batch,
+ high_wmark_pages(zone) - low_wmark_pages(zone));
+
setup_zone_migrate_reserve(zone);
spin_unlock_irqrestore(&zone->lock, flags);
}
--
1.8.3.2

2013-07-19 20:56:35

by Johannes Weiner

[permalink] [raw]
Subject: [patch 2/3] mm: page_alloc: rearrange watermark checking in get_page_from_freelist

Allocations that do not have to respect the watermarks are rare
high-priority events. Reorder the code such that per-zone dirty
limits and future checks important only to regular page allocations
are ignored in these extraordinary situations.

Signed-off-by: Johannes Weiner <[email protected]>
---
mm/page_alloc.c | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d9df57d..af1d956b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1867,12 +1867,17 @@ zonelist_scan:
*/
for_each_zone_zonelist_nodemask(zone, z, zonelist,
high_zoneidx, nodemask) {
+ unsigned long mark;
+
if (IS_ENABLED(CONFIG_NUMA) && zlc_active &&
!zlc_zone_worth_trying(zonelist, z, allowednodes))
continue;
if ((alloc_flags & ALLOC_CPUSET) &&
!cpuset_zone_allowed_softwall(zone, gfp_mask))
continue;
+ BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
+ if (alloc_flags & ALLOC_NO_WATERMARKS)
+ goto try_this_zone;
/*
* When allocating a page cache page for writing, we
* want to get it from a zone that is within its dirty
@@ -1903,16 +1908,11 @@ zonelist_scan:
(gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone))
goto this_zone_full;

- BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
- if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
- unsigned long mark;
+ mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
+ if (!zone_watermark_ok(zone, order, mark,
+ classzone_idx, alloc_flags)) {
int ret;

- mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
- if (zone_watermark_ok(zone, order, mark,
- classzone_idx, alloc_flags))
- goto try_this_zone;
-
if (IS_ENABLED(CONFIG_NUMA) &&
!did_zlc_setup && nr_online_nodes > 1) {
/*
--
1.8.3.2

2013-07-22 17:01:25

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 0/3] mm: improve page aging fairness between zones/nodes

Hi Zlatko,

On Mon, Jul 22, 2013 at 06:48:52PM +0200, Zlatko Calusic wrote:
> On 19.07.2013 22:55, Johannes Weiner wrote:
> >The way the page allocator interacts with kswapd creates aging
> >imbalances, where the amount of time a userspace page gets in memory
> >under reclaim pressure is dependent on which zone, which node the
> >allocator took the page frame from.
> >
> >#1 fixes missed kswapd wakeups on NUMA systems, which lead to some
> > nodes falling behind for a full reclaim cycle relative to the other
> > nodes in the system
> >
> >#3 fixes an interaction where kswapd and a continuous stream of page
> > allocations keep the preferred zone of a task between the high and
> > low watermark (allocations succeed + kswapd does not go to sleep)
> > indefinitely, completely underutilizing the lower zones and
> > thrashing on the preferred zone
> >
> >These patches are the aging fairness part of the thrash-detection
> >based file LRU balancing. Andrea recommended to submit them
> >separately as they are bugfixes in their own right.
> >
>
> I have the patch applied and under testing. So far, so good. It
> looks like it could finally fix the bug that I was chasing few
> months ago (nicely described in your bullet #3). But, few more days
> of testing will be needed before I can reach a quality verdict.

I should have remembered that you talked about this problem... Thanks
a lot for testing!

May I ask for the zone layout of your test machine(s)? I.e. how many
nodes if NUMA, how big Normal and DMA32 (on Node 0) are.

2013-07-22 17:02:03

by Zlatko Calusic

[permalink] [raw]
Subject: Re: [patch 0/3] mm: improve page aging fairness between zones/nodes

On 19.07.2013 22:55, Johannes Weiner wrote:
> The way the page allocator interacts with kswapd creates aging
> imbalances, where the amount of time a userspace page gets in memory
> under reclaim pressure is dependent on which zone, which node the
> allocator took the page frame from.
>
> #1 fixes missed kswapd wakeups on NUMA systems, which lead to some
> nodes falling behind for a full reclaim cycle relative to the other
> nodes in the system
>
> #3 fixes an interaction where kswapd and a continuous stream of page
> allocations keep the preferred zone of a task between the high and
> low watermark (allocations succeed + kswapd does not go to sleep)
> indefinitely, completely underutilizing the lower zones and
> thrashing on the preferred zone
>
> These patches are the aging fairness part of the thrash-detection
> based file LRU balancing. Andrea recommended to submit them
> separately as they are bugfixes in their own right.
>

I have the patch applied and under testing. So far, so good. It looks
like it could finally fix the bug that I was chasing few months ago
(nicely described in your bullet #3). But, few more days of testing will
be needed before I can reach a quality verdict.

Good job!
--
Zlatko

2013-07-22 17:14:26

by Zlatko Calusic

[permalink] [raw]
Subject: Re: [patch 0/3] mm: improve page aging fairness between zones/nodes

On 22.07.2013 19:01, Johannes Weiner wrote:
> Hi Zlatko,
>
> On Mon, Jul 22, 2013 at 06:48:52PM +0200, Zlatko Calusic wrote:
>> On 19.07.2013 22:55, Johannes Weiner wrote:
>>> The way the page allocator interacts with kswapd creates aging
>>> imbalances, where the amount of time a userspace page gets in memory
>>> under reclaim pressure is dependent on which zone, which node the
>>> allocator took the page frame from.
>>>
>>> #1 fixes missed kswapd wakeups on NUMA systems, which lead to some
>>> nodes falling behind for a full reclaim cycle relative to the other
>>> nodes in the system
>>>
>>> #3 fixes an interaction where kswapd and a continuous stream of page
>>> allocations keep the preferred zone of a task between the high and
>>> low watermark (allocations succeed + kswapd does not go to sleep)
>>> indefinitely, completely underutilizing the lower zones and
>>> thrashing on the preferred zone
>>>
>>> These patches are the aging fairness part of the thrash-detection
>>> based file LRU balancing. Andrea recommended to submit them
>>> separately as they are bugfixes in their own right.
>>>
>>
>> I have the patch applied and under testing. So far, so good. It
>> looks like it could finally fix the bug that I was chasing few
>> months ago (nicely described in your bullet #3). But, few more days
>> of testing will be needed before I can reach a quality verdict.
>
> I should have remembered that you talked about this problem... Thanks
> a lot for testing!
>
> May I ask for the zone layout of your test machine(s)? I.e. how many
> nodes if NUMA, how big Normal and DMA32 (on Node 0) are.
>

I have been reading about NUMA hw for at least a decade, but I guess
another one will pass before I actually see one. ;) Find /proc/zoneinfo
attached.

If your patchset fails my case, then nr_{in,}active_file in Normal zone
will drop close to zero in a matter of days. If it fixes this particular
imbalance, and I have faith it will, then those two counters will stay
in relative balance with nr_{in,}active_anon in the same zone. I also
applied Konstantin's excellent lru-milestones-timestamps-and-ages, and
graphing of interesting numbers on top of that, which is why I already
have faith in your patchset. I can see much better balance between zones
already. But, let's give it some more time...

--
Zlatko


Attachments:
zoneinfo (3.88 kB)

2013-07-22 19:47:14

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch 1/3] mm: vmscan: fix numa reclaim balance problem in kswapd

On 07/19/2013 04:55 PM, Johannes Weiner wrote:
> When the page allocator fails to get a page from all zones in its
> given zonelist, it wakes up the per-node kswapds for all zones that
> are at their low watermark.
>
> However, with a system under load and the free page counters being
> per-cpu approximations, the observed counter value in a zone can
> fluctuate enough that the allocation fails but the kswapd wakeup is
> also skipped while the zone is still really close to the low
> watermark.
>
> When one node misses a wakeup like this, it won't be aged before all
> the other node's zones are down to their low watermarks again. And
> skipping a full aging cycle is an obvious fairness problem.
>
> Kswapd runs until the high watermarks are restored, so it should also
> be woken when the high watermarks are not met. This ages nodes more
> equally and creates a safety margin for the page counter fluctuation.
>
> By using zone_balanced(), it will now check, in addition to the
> watermark, if compaction requires more order-0 pages to create a
> higher order page.
>
> Signed-off-by: Johannes Weiner <[email protected]>

This patch alone looks like it could have the effect of increasing
the pressure on the first zone in the zonelist, keeping its free
memory above the low watermark essentially forever, without having
the allocator fall back to other zones.

However, your third patch fixes that problem, and missed wakeups
would still hurt, so...

Reviewed-by: Rik van Riel <[email protected]>

--
All rights reversed

2013-07-22 19:51:36

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch 2/3] mm: page_alloc: rearrange watermark checking in get_page_from_freelist

On 07/19/2013 04:55 PM, Johannes Weiner wrote:
> Allocations that do not have to respect the watermarks are rare
> high-priority events. Reorder the code such that per-zone dirty
> limits and future checks important only to regular page allocations
> are ignored in these extraordinary situations.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Reviewed-by: Rik van Riel <[email protected]>


--
All rights reversed

2013-07-22 20:14:52

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 1/3] mm: vmscan: fix numa reclaim balance problem in kswapd

On Mon, Jul 22, 2013 at 03:47:03PM -0400, Rik van Riel wrote:
> On 07/19/2013 04:55 PM, Johannes Weiner wrote:
> >When the page allocator fails to get a page from all zones in its
> >given zonelist, it wakes up the per-node kswapds for all zones that
> >are at their low watermark.
> >
> >However, with a system under load and the free page counters being
> >per-cpu approximations, the observed counter value in a zone can
> >fluctuate enough that the allocation fails but the kswapd wakeup is
> >also skipped while the zone is still really close to the low
> >watermark.
> >
> >When one node misses a wakeup like this, it won't be aged before all
> >the other node's zones are down to their low watermarks again. And
> >skipping a full aging cycle is an obvious fairness problem.
> >
> >Kswapd runs until the high watermarks are restored, so it should also
> >be woken when the high watermarks are not met. This ages nodes more
> >equally and creates a safety margin for the page counter fluctuation.
> >
> >By using zone_balanced(), it will now check, in addition to the
> >watermark, if compaction requires more order-0 pages to create a
> >higher order page.
> >
> >Signed-off-by: Johannes Weiner <[email protected]>
>
> This patch alone looks like it could have the effect of increasing
> the pressure on the first zone in the zonelist, keeping its free
> memory above the low watermark essentially forever, without having
> the allocator fall back to other zones.
>
> However, your third patch fixes that problem, and missed wakeups
> would still hurt, so...

The kswapd wakeups happen in the slowpath, after the fastpath tried
all zones in the zonelist, not just the first one.

With the problem fixed in #3, the slowpath is rarely entered (even
when kswapds should be woken). From that point of view, the effects
of #1 are further improved by #3, but #1 on its own does not worsen
the situation.

> Reviewed-by: Rik van Riel <[email protected]>

Thanks!

2013-07-22 20:21:16

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch 3/3] mm: page_alloc: fair zone allocator policy

On 07/19/2013 04:55 PM, Johannes Weiner wrote:

> @@ -1984,7 +1992,8 @@ this_zone_full:
> goto zonelist_scan;
> }
>
> - if (page)
> + if (page) {
> + atomic_sub(1U << order, &zone->alloc_batch);
> /*
> * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
> * necessary to allocate the page. The expectation is

Could this be moved into the slow path in buffered_rmqueue and
rmqueue_bulk, or would the effect of ignoring the pcp buffers be
too detrimental to keeping the balance between zones?

It would be kind of nice to not have this atomic operation on every
page allocation...

As a side benefit, higher-order buffered_rmqueue and rmqueue_bulk
both happen under the zone->lock, so moving this accounting down
to that layer might allow you to get rid of the atomics alltogether.

I like the overall approach though. This is something Linux has needed
for a long time, and could be extremely useful to automatic NUMA
balancing as well...

--
All rights reversed

2013-07-22 21:04:34

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 3/3] mm: page_alloc: fair zone allocator policy

On Mon, Jul 22, 2013 at 04:21:07PM -0400, Rik van Riel wrote:
> On 07/19/2013 04:55 PM, Johannes Weiner wrote:
>
> >@@ -1984,7 +1992,8 @@ this_zone_full:
> > goto zonelist_scan;
> > }
> >
> >- if (page)
> >+ if (page) {
> >+ atomic_sub(1U << order, &zone->alloc_batch);
> > /*
> > * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
> > * necessary to allocate the page. The expectation is
>
> Could this be moved into the slow path in buffered_rmqueue and
> rmqueue_bulk, or would the effect of ignoring the pcp buffers be
> too detrimental to keeping the balance between zones?

What I'm worried about is not the inaccury that comes from the buffer
size but the fact that there are no guaranteed buffer empty+refill
cycles. The reclaimer could end up feeding the pcp list that the
allocator is using indefinitely, which brings us back to the original
problem. If you have >= NR_CPU jobs running, the kswapds are bound to
share CPUs with the allocating tasks, so the scenario is not unlikely.

> It would be kind of nice to not have this atomic operation on every
> page allocation...

No argument there ;)

> I like the overall approach though. This is something Linux has needed
> for a long time, and could be extremely useful to automatic NUMA
> balancing as well...
>
> --
> All rights reversed

2013-07-22 22:49:10

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch 3/3] mm: page_alloc: fair zone allocator policy

On 07/22/2013 05:04 PM, Johannes Weiner wrote:
> On Mon, Jul 22, 2013 at 04:21:07PM -0400, Rik van Riel wrote:
>> On 07/19/2013 04:55 PM, Johannes Weiner wrote:
>>
>>> @@ -1984,7 +1992,8 @@ this_zone_full:
>>> goto zonelist_scan;
>>> }
>>>
>>> - if (page)
>>> + if (page) {
>>> + atomic_sub(1U << order, &zone->alloc_batch);
>>> /*
>>> * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
>>> * necessary to allocate the page. The expectation is
>>
>> Could this be moved into the slow path in buffered_rmqueue and
>> rmqueue_bulk, or would the effect of ignoring the pcp buffers be
>> too detrimental to keeping the balance between zones?
>
> What I'm worried about is not the inaccury that comes from the buffer
> size but the fact that there are no guaranteed buffer empty+refill
> cycles. The reclaimer could end up feeding the pcp list that the
> allocator is using indefinitely, which brings us back to the original
> problem. If you have >= NR_CPU jobs running, the kswapds are bound to
> share CPUs with the allocating tasks, so the scenario is not unlikely.

You are absolutely right. Thinking about it some more,
I cannot think of a better way to do this than your patch.

Reviewed-by: Rik van Riel <[email protected]>

--
All rights reversed

2013-07-24 11:18:40

by Zlatko Calusic

[permalink] [raw]
Subject: Re: [patch 0/3] mm: improve page aging fairness between zones/nodes

On 22.07.2013 18:48, Zlatko Calusic wrote:
> On 19.07.2013 22:55, Johannes Weiner wrote:
>> The way the page allocator interacts with kswapd creates aging
>> imbalances, where the amount of time a userspace page gets in memory
>> under reclaim pressure is dependent on which zone, which node the
>> allocator took the page frame from.
>>
>> #1 fixes missed kswapd wakeups on NUMA systems, which lead to some
>> nodes falling behind for a full reclaim cycle relative to the other
>> nodes in the system
>>
>> #3 fixes an interaction where kswapd and a continuous stream of page
>> allocations keep the preferred zone of a task between the high and
>> low watermark (allocations succeed + kswapd does not go to sleep)
>> indefinitely, completely underutilizing the lower zones and
>> thrashing on the preferred zone
>>
>> These patches are the aging fairness part of the thrash-detection
>> based file LRU balancing. Andrea recommended to submit them
>> separately as they are bugfixes in their own right.
>>
>
> I have the patch applied and under testing. So far, so good. It looks
> like it could finally fix the bug that I was chasing few months ago
> (nicely described in your bullet #3). But, few more days of testing will
> be needed before I can reach a quality verdict.
>

Well, only 2 days later it's already obvious that the patch is perfect! :)

In the attached image, in the left column are the graphs covering last
day and a half. It can be observed that zones are really balanced, and
that aging is practically perfect. Graphs on the right column cover last
10 day period, and the left side of the upper graph shows how it would
look with the stock kernel after about 20 day uptime (although only a
few days is enough to reach such imbalance). File pages in the Normal
zone are extinct species (red) and the zone is choke full of anon pages
(blue). Having seen a lot of this graphs, I'm certain that it won't
happen anymore with your patch applied. The balance is restored! Thank
you for your work. Feel free to add:

Tested-by: Zlatko Calusic <[email protected]>

Regards,
--
Zlatko


Attachments:
screenshot1.png (77.71 kB)

2013-07-24 12:47:10

by Hush Bensen

[permalink] [raw]
Subject: Re: [patch 0/3] mm: improve page aging fairness between zones/nodes

于 2013/7/24 19:18, Zlatko Calusic 写道:
> On 22.07.2013 18:48, Zlatko Calusic wrote:
>> On 19.07.2013 22:55, Johannes Weiner wrote:
>>> The way the page allocator interacts with kswapd creates aging
>>> imbalances, where the amount of time a userspace page gets in memory
>>> under reclaim pressure is dependent on which zone, which node the
>>> allocator took the page frame from.
>>>
>>> #1 fixes missed kswapd wakeups on NUMA systems, which lead to some
>>> nodes falling behind for a full reclaim cycle relative to the other
>>> nodes in the system
>>>
>>> #3 fixes an interaction where kswapd and a continuous stream of page
>>> allocations keep the preferred zone of a task between the high and
>>> low watermark (allocations succeed + kswapd does not go to sleep)
>>> indefinitely, completely underutilizing the lower zones and
>>> thrashing on the preferred zone
>>>
>>> These patches are the aging fairness part of the thrash-detection
>>> based file LRU balancing. Andrea recommended to submit them
>>> separately as they are bugfixes in their own right.
>>>
>>
>> I have the patch applied and under testing. So far, so good. It looks
>> like it could finally fix the bug that I was chasing few months ago
>> (nicely described in your bullet #3). But, few more days of testing will
>> be needed before I can reach a quality verdict.
>>
>
> Well, only 2 days later it's already obvious that the patch is
> perfect! :)
>
> In the attached image, in the left column are the graphs covering last
> day and a half. It can be observed that zones are really balanced, and
> that aging is practically perfect. Graphs on the right column cover
> last 10 day period, and the left side of the upper graph shows how it
> would look with the stock kernel after about 20 day uptime (although
> only a few days is enough to reach such imbalance). File pages in the
> Normal zone are extinct species (red) and the zone is choke full of
> anon pages (blue). Having seen a lot of this graphs, I'm certain that
> it won't happen anymore with your patch applied. The balance is
> restored! Thank you for your work. Feel free to add:
>
> Tested-by: Zlatko Calusic <[email protected]>

Thanks for your testing Zlatko, could you tell me which benchmark or
workload you are using? Btw, which tool is used to draw these nice
pictures? ;-)

>
> Regards,

2013-07-24 14:41:25

by Zlatko Calusic

[permalink] [raw]
Subject: Re: [patch 0/3] mm: improve page aging fairness between zones/nodes

On 24.07.2013 14:46, Hush Bensen wrote:
> 于 2013/7/24 19:18, Zlatko Calusic 写道:
>> On 22.07.2013 18:48, Zlatko Calusic wrote:
>>> On 19.07.2013 22:55, Johannes Weiner wrote:
>>>> The way the page allocator interacts with kswapd creates aging
>>>> imbalances, where the amount of time a userspace page gets in memory
>>>> under reclaim pressure is dependent on which zone, which node the
>>>> allocator took the page frame from.
>>>>
>>>> #1 fixes missed kswapd wakeups on NUMA systems, which lead to some
>>>> nodes falling behind for a full reclaim cycle relative to the other
>>>> nodes in the system
>>>>
>>>> #3 fixes an interaction where kswapd and a continuous stream of page
>>>> allocations keep the preferred zone of a task between the high and
>>>> low watermark (allocations succeed + kswapd does not go to sleep)
>>>> indefinitely, completely underutilizing the lower zones and
>>>> thrashing on the preferred zone
>>>>
>>>> These patches are the aging fairness part of the thrash-detection
>>>> based file LRU balancing. Andrea recommended to submit them
>>>> separately as they are bugfixes in their own right.
>>>>
>>>
>>> I have the patch applied and under testing. So far, so good. It looks
>>> like it could finally fix the bug that I was chasing few months ago
>>> (nicely described in your bullet #3). But, few more days of testing will
>>> be needed before I can reach a quality verdict.
>>>
>>
>> Well, only 2 days later it's already obvious that the patch is
>> perfect! :)
>>
>> In the attached image, in the left column are the graphs covering last
>> day and a half. It can be observed that zones are really balanced, and
>> that aging is practically perfect. Graphs on the right column cover
>> last 10 day period, and the left side of the upper graph shows how it
>> would look with the stock kernel after about 20 day uptime (although
>> only a few days is enough to reach such imbalance). File pages in the
>> Normal zone are extinct species (red) and the zone is choke full of
>> anon pages (blue). Having seen a lot of this graphs, I'm certain that
>> it won't happen anymore with your patch applied. The balance is
>> restored! Thank you for your work. Feel free to add:
>>
>> Tested-by: Zlatko Calusic <[email protected]>
>
> Thanks for your testing Zlatko, could you tell me which benchmark or
> workload you are using? Btw, which tool is used to draw these nice
> pictures? ;-)
>

Workload is mixed (various services, light load). What makes the biggest
I/O load is backup procedure that goes every evening. The graphs are
home-made, a little bit of rrd, a little bit of perl, nothing too
complex. I'm actually slowly getting rid of these extra graphs, because
I used them only for debugging this specific problem, which is now fixed
thanks to Johannes.

--
Zlatko

2013-07-25 06:51:18

by Paul Bolle

[permalink] [raw]
Subject: Re: [patch 3/3] mm: page_alloc: fair zone allocator policy

On 07/23/2013 04:21 AM, Rik van Riel wrote:
> On 07/19/2013 04:55 PM, Johannes Weiner wrote:
>
>> @@ -1984,7 +1992,8 @@ this_zone_full:
>> goto zonelist_scan;
>> }
>>
>> - if (page)
>> + if (page) {
>> + atomic_sub(1U << order, &zone->alloc_batch);
>> /*
>> * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
>> * necessary to allocate the page. The expectation is
>
> Could this be moved into the slow path in buffered_rmqueue and
> rmqueue_bulk, or would the effect of ignoring the pcp buffers be
> too detrimental to keeping the balance between zones?
>
> It would be kind of nice to not have this atomic operation on every
> page allocation...

atomic operation will lock cache line or memory bus? And cmpxchg will
lock cache line or memory bus? ;-)

>
> As a side benefit, higher-order buffered_rmqueue and rmqueue_bulk
> both happen under the zone->lock, so moving this accounting down
> to that layer might allow you to get rid of the atomics alltogether.
>
> I like the overall approach though. This is something Linux has needed
> for a long time, and could be extremely useful to automatic NUMA
> balancing as well...
>

2013-07-25 15:10:59

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 3/3] mm: page_alloc: fair zone allocator policy

Hi Paul Bolle^W^W Sam Ben^W^W Hush Bensen^W^W Mtrr Patt^W^W Ric
Mason^W^W Will Huck^W^W Simon Jeons^W^W Jaeguk Hanse^W^W Ni zhan
Chen^W^W^W Wanpeng Li

[ I Cc'd Paul Bolle at [email protected] as well, his English was
better from there ]

On Thu, Jul 25, 2013 at 02:50:54PM +0800, Paul Bolle wrote:
> On 07/23/2013 04:21 AM, Rik van Riel wrote:
> >On 07/19/2013 04:55 PM, Johannes Weiner wrote:
> >
> >>@@ -1984,7 +1992,8 @@ this_zone_full:
> >> goto zonelist_scan;
> >> }
> >>
> >>- if (page)
> >>+ if (page) {
> >>+ atomic_sub(1U << order, &zone->alloc_batch);
> >> /*
> >> * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
> >> * necessary to allocate the page. The expectation is
> >
> >Could this be moved into the slow path in buffered_rmqueue and
> >rmqueue_bulk, or would the effect of ignoring the pcp buffers be
> >too detrimental to keeping the balance between zones?
> >
> >It would be kind of nice to not have this atomic operation on every
> >page allocation...
>
> atomic operation will lock cache line or memory bus? And cmpxchg
> will lock cache line or memory bus? ;-)

Sure, why not ;-)

2013-07-25 15:20:56

by Paul Bolle

[permalink] [raw]
Subject: Re: [patch 3/3] mm: page_alloc: fair zone allocator policy

On Thu, 2013-07-25 at 11:10 -0400, Johannes Weiner wrote:
> [ I Cc'd Paul Bolle at [email protected] as well, his English was
> better from there ]

Namespace collission! That's not me!

Could my double perhaps use another name, like say, Paul Bollee? Or
something even less confusing?

Thanks!


Paul Bolle

2013-07-26 22:45:39

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 0/3] mm: improve page aging fairness between zones/nodes

On Fri, 19 Jul 2013 16:55:22 -0400 Johannes Weiner <[email protected]> wrote:

> The way the page allocator interacts with kswapd creates aging
> imbalances, where the amount of time a userspace page gets in memory
> under reclaim pressure is dependent on which zone, which node the
> allocator took the page frame from.
>
> #1 fixes missed kswapd wakeups on NUMA systems, which lead to some
> nodes falling behind for a full reclaim cycle relative to the other
> nodes in the system
>
> #3 fixes an interaction where kswapd and a continuous stream of page
> allocations keep the preferred zone of a task between the high and
> low watermark (allocations succeed + kswapd does not go to sleep)
> indefinitely, completely underutilizing the lower zones and
> thrashing on the preferred zone
>
> These patches are the aging fairness part of the thrash-detection
> based file LRU balancing. Andrea recommended to submit them
> separately as they are bugfixes in their own right.
>
> The following test ran a foreground workload (memcachetest) with
> background IO of various sizes on a 4 node 8G system (similar results
> were observed with single-node 4G systems):
>
> parallelio
> BAS FAIRALLO
> BASE FAIRALLOC
> Ops memcachetest-0M 5170.00 ( 0.00%) 5283.00 ( 2.19%)
> Ops memcachetest-791M 4740.00 ( 0.00%) 5293.00 ( 11.67%)
> Ops memcachetest-2639M 2551.00 ( 0.00%) 4950.00 ( 94.04%)
> Ops memcachetest-4487M 2606.00 ( 0.00%) 3922.00 ( 50.50%)
> Ops io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%)
> Ops io-duration-791M 55.00 ( 0.00%) 18.00 ( 67.27%)
> Ops io-duration-2639M 235.00 ( 0.00%) 103.00 ( 56.17%)
> Ops io-duration-4487M 278.00 ( 0.00%) 173.00 ( 37.77%)
> Ops swaptotal-0M 0.00 ( 0.00%) 0.00 ( 0.00%)
> Ops swaptotal-791M 245184.00 ( 0.00%) 0.00 ( 0.00%)
> Ops swaptotal-2639M 468069.00 ( 0.00%) 108778.00 ( 76.76%)
> Ops swaptotal-4487M 452529.00 ( 0.00%) 76623.00 ( 83.07%)
> Ops swapin-0M 0.00 ( 0.00%) 0.00 ( 0.00%)
> Ops swapin-791M 108297.00 ( 0.00%) 0.00 ( 0.00%)
> Ops swapin-2639M 169537.00 ( 0.00%) 50031.00 ( 70.49%)
> Ops swapin-4487M 167435.00 ( 0.00%) 34178.00 ( 79.59%)
> Ops minorfaults-0M 1518666.00 ( 0.00%) 1503993.00 ( 0.97%)
> Ops minorfaults-791M 1676963.00 ( 0.00%) 1520115.00 ( 9.35%)
> Ops minorfaults-2639M 1606035.00 ( 0.00%) 1799717.00 (-12.06%)
> Ops minorfaults-4487M 1612118.00 ( 0.00%) 1583825.00 ( 1.76%)
> Ops majorfaults-0M 6.00 ( 0.00%) 0.00 ( 0.00%)
> Ops majorfaults-791M 13836.00 ( 0.00%) 10.00 ( 99.93%)
> Ops majorfaults-2639M 22307.00 ( 0.00%) 6490.00 ( 70.91%)
> Ops majorfaults-4487M 21631.00 ( 0.00%) 4380.00 ( 79.75%)

A reminder whether positive numbers are good or bad would be useful ;)

> BAS FAIRALLO
> BASE FAIRALLOC
> User 287.78 460.97
> System 2151.67 3142.51
> Elapsed 9737.00 8879.34

Confused. Why would the amount of user time increase so much?

And that's a tremendous increase in system time. Am I interpreting
this correctly?

> BAS FAIRALLO
> BASE FAIRALLOC
> Minor Faults 53721925 57188551
> Major Faults 392195 15157
> Swap Ins 2994854 112770
> Swap Outs 4907092 134982
> Direct pages scanned 0 41824
> Kswapd pages scanned 32975063 8128269
> Kswapd pages reclaimed 6323069 7093495
> Direct pages reclaimed 0 41824
> Kswapd efficiency 19% 87%
> Kswapd velocity 3386.573 915.414
> Direct efficiency 100% 100%
> Direct velocity 0.000 4.710
> Percentage direct scans 0% 0%
> Zone normal velocity 2011.338 550.661
> Zone dma32 velocity 1365.623 369.221
> Zone dma velocity 9.612 0.242
> Page writes by reclaim 18732404.000 614807.000
> Page writes file 13825312 479825
> Page writes anon 4907092 134982
> Page reclaim immediate 85490 5647
> Sector Reads 12080532 483244
> Sector Writes 88740508 65438876
> Page rescued immediate 0 0
> Slabs scanned 82560 12160
> Direct inode steals 0 0
> Kswapd inode steals 24401 40013
> Kswapd skipped wait 0 0
> THP fault alloc 6 8
> THP collapse alloc 5481 5812
> THP splits 75 22
> THP fault fallback 0 0
> THP collapse fail 0 0
> Compaction stalls 0 54
> Compaction success 0 45
> Compaction failures 0 9
> Page migrate success 881492 82278
> Page migrate failure 0 0
> Compaction pages isolated 0 60334
> Compaction migrate scanned 0 53505
> Compaction free scanned 0 1537605
> Compaction cost 914 86
> NUMA PTE updates 46738231 41988419
> NUMA hint faults 31175564 24213387
> NUMA hint local faults 10427393 6411593
> NUMA pages migrated 881492 55344
> AutoNUMA cost 156221 121361

Some nice numbers there.

> The overall runtime was reduced, throughput for both the foreground
> workload as well as the background IO improved, major faults, swapping
> and reclaim activity shrunk significantly, reclaim efficiency more
> than quadrupled.

2013-07-26 22:53:21

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 1/3] mm: vmscan: fix numa reclaim balance problem in kswapd

On Fri, 19 Jul 2013 16:55:23 -0400 Johannes Weiner <[email protected]> wrote:

> When the page allocator fails to get a page from all zones in its
> given zonelist, it wakes up the per-node kswapds for all zones that
> are at their low watermark.
>
> However, with a system under load and the free page counters being
> per-cpu approximations, the observed counter value in a zone can
> fluctuate enough that the allocation fails but the kswapd wakeup is
> also skipped while the zone is still really close to the low
> watermark.
>
> When one node misses a wakeup like this, it won't be aged before all
> the other node's zones are down to their low watermarks again. And
> skipping a full aging cycle is an obvious fairness problem.
>
> Kswapd runs until the high watermarks are restored, so it should also
> be woken when the high watermarks are not met. This ages nodes more
> equally and creates a safety margin for the page counter fluctuation.

Well yes, but what guarantee is there that the per-cpu counter error
problem is reliably fixed? AFAICT this patch "fixes" it because the
gap between the low and high watermarks happens to be larger than the
per-cpu counter fluctuation, yes? If so, there are surely all sorts of
situations where it will break again.

To fix this reliably, we should be looking at constraining counter
batch sizes or performing a counter summation to get the more accurate
estimate?

2013-07-26 23:14:56

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 0/3] mm: improve page aging fairness between zones/nodes

On Fri, Jul 26, 2013 at 03:45:33PM -0700, Andrew Morton wrote:
> On Fri, 19 Jul 2013 16:55:22 -0400 Johannes Weiner <[email protected]> wrote:
>
> > The way the page allocator interacts with kswapd creates aging
> > imbalances, where the amount of time a userspace page gets in memory
> > under reclaim pressure is dependent on which zone, which node the
> > allocator took the page frame from.
> >
> > #1 fixes missed kswapd wakeups on NUMA systems, which lead to some
> > nodes falling behind for a full reclaim cycle relative to the other
> > nodes in the system
> >
> > #3 fixes an interaction where kswapd and a continuous stream of page
> > allocations keep the preferred zone of a task between the high and
> > low watermark (allocations succeed + kswapd does not go to sleep)
> > indefinitely, completely underutilizing the lower zones and
> > thrashing on the preferred zone
> >
> > These patches are the aging fairness part of the thrash-detection
> > based file LRU balancing. Andrea recommended to submit them
> > separately as they are bugfixes in their own right.
> >
> > The following test ran a foreground workload (memcachetest) with
> > background IO of various sizes on a 4 node 8G system (similar results
> > were observed with single-node 4G systems):
> >
> > parallelio
> > BAS FAIRALLO
> > BASE FAIRALLOC
> > Ops memcachetest-0M 5170.00 ( 0.00%) 5283.00 ( 2.19%)
> > Ops memcachetest-791M 4740.00 ( 0.00%) 5293.00 ( 11.67%)
> > Ops memcachetest-2639M 2551.00 ( 0.00%) 4950.00 ( 94.04%)
> > Ops memcachetest-4487M 2606.00 ( 0.00%) 3922.00 ( 50.50%)
> > Ops io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%)
> > Ops io-duration-791M 55.00 ( 0.00%) 18.00 ( 67.27%)
> > Ops io-duration-2639M 235.00 ( 0.00%) 103.00 ( 56.17%)
> > Ops io-duration-4487M 278.00 ( 0.00%) 173.00 ( 37.77%)
> > Ops swaptotal-0M 0.00 ( 0.00%) 0.00 ( 0.00%)
> > Ops swaptotal-791M 245184.00 ( 0.00%) 0.00 ( 0.00%)
> > Ops swaptotal-2639M 468069.00 ( 0.00%) 108778.00 ( 76.76%)
> > Ops swaptotal-4487M 452529.00 ( 0.00%) 76623.00 ( 83.07%)
> > Ops swapin-0M 0.00 ( 0.00%) 0.00 ( 0.00%)
> > Ops swapin-791M 108297.00 ( 0.00%) 0.00 ( 0.00%)
> > Ops swapin-2639M 169537.00 ( 0.00%) 50031.00 ( 70.49%)
> > Ops swapin-4487M 167435.00 ( 0.00%) 34178.00 ( 79.59%)
> > Ops minorfaults-0M 1518666.00 ( 0.00%) 1503993.00 ( 0.97%)
> > Ops minorfaults-791M 1676963.00 ( 0.00%) 1520115.00 ( 9.35%)
> > Ops minorfaults-2639M 1606035.00 ( 0.00%) 1799717.00 (-12.06%)
> > Ops minorfaults-4487M 1612118.00 ( 0.00%) 1583825.00 ( 1.76%)
> > Ops majorfaults-0M 6.00 ( 0.00%) 0.00 ( 0.00%)
> > Ops majorfaults-791M 13836.00 ( 0.00%) 10.00 ( 99.93%)
> > Ops majorfaults-2639M 22307.00 ( 0.00%) 6490.00 ( 70.91%)
> > Ops majorfaults-4487M 21631.00 ( 0.00%) 4380.00 ( 79.75%)
>
> A reminder whether positive numbers are good or bad would be useful ;)

It depends on the datapoint, but a positive percentage number is an
improvement, a negative one a regression.

> > BAS FAIRALLO
> > BASE FAIRALLOC
> > User 287.78 460.97
> > System 2151.67 3142.51
> > Elapsed 9737.00 8879.34
>
> Confused. Why would the amount of user time increase so much?
>
> And that's a tremendous increase in system time. Am I interpreting
> this correctly?

It is because each memcachetest is running for a fixed duration (only
the background IO is fixed in size). The time memcachetest previously
spent waiting on major faults is now spent doing actual work (more
user time, more syscalls). The number of operations memcachetest
could actually perform increased.

2013-07-29 17:48:30

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [patch 3/3] mm: page_alloc: fair zone allocator policy

Hi Johannes,

On Fri, Jul 19, 2013 at 04:55:25PM -0400, Johannes Weiner wrote:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index af1d956b..d938b67 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1879,6 +1879,14 @@ zonelist_scan:
> if (alloc_flags & ALLOC_NO_WATERMARKS)
> goto try_this_zone;
> /*
> + * Distribute pages in proportion to the individual
> + * zone size to ensure fair page aging. The zone a
> + * page was allocated in should have no effect on the
> + * time the page has in memory before being reclaimed.
> + */
> + if (atomic_read(&zone->alloc_batch) <= 0)
> + continue;
> + /*
> * When allocating a page cache page for writing, we
> * want to get it from a zone that is within its dirty
> * limit, such that no single zone holds more than its

I rebased the zone_reclaim_mode and compaction fixes on top of the
zone fair allocator (it applied without rejects, lucky) but the above
breaks zone_reclaim_mode (it regress for pagecache too, which
currently works), so then in turn my THP/compaction tests break too.

zone_reclaim_mode isn't LRU-fair, and cannot be... (even migrating
cache around nodes to try to keep LRU fariness would not be worth it,
especially with ssds). But we can still increase the fairness within
the zones of the current node (for those nodes that have more than 1
zone).

I think to fix it we need an additional first pass of the fast path,
and if alloc_batch is <= 0 for any zone in the current node, we then
forbid allocating from the zones not in the current node (even if
alloc_batch would allow it) during the first pass, only if
zone_reclaim_mode is enabled. If first pass fails, we need to reset
alloc_batch for all zones in the current node (and only in the current
zone), goto zonelist_scan and continue as we do now.

2013-07-29 22:24:49

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 3/3] mm: page_alloc: fair zone allocator policy

On Mon, Jul 29, 2013 at 07:48:21PM +0200, Andrea Arcangeli wrote:
> Hi Johannes,
>
> On Fri, Jul 19, 2013 at 04:55:25PM -0400, Johannes Weiner wrote:
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index af1d956b..d938b67 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1879,6 +1879,14 @@ zonelist_scan:
> > if (alloc_flags & ALLOC_NO_WATERMARKS)
> > goto try_this_zone;
> > /*
> > + * Distribute pages in proportion to the individual
> > + * zone size to ensure fair page aging. The zone a
> > + * page was allocated in should have no effect on the
> > + * time the page has in memory before being reclaimed.
> > + */
> > + if (atomic_read(&zone->alloc_batch) <= 0)
> > + continue;
> > + /*
> > * When allocating a page cache page for writing, we
> > * want to get it from a zone that is within its dirty
> > * limit, such that no single zone holds more than its
>
> I rebased the zone_reclaim_mode and compaction fixes on top of the
> zone fair allocator (it applied without rejects, lucky) but the above
> breaks zone_reclaim_mode (it regress for pagecache too, which
> currently works), so then in turn my THP/compaction tests break too.

Ah yeah, we spill too eagerly to other nodes when we should try to
reclaim the local one first.

> zone_reclaim_mode isn't LRU-fair, and cannot be... (even migrating
> cache around nodes to try to keep LRU fariness would not be worth it,
> especially with ssds). But we can still increase the fairness within
> the zones of the current node (for those nodes that have more than 1
> zone).
>
> I think to fix it we need an additional first pass of the fast path,
> and if alloc_batch is <= 0 for any zone in the current node, we then
> forbid allocating from the zones not in the current node (even if
> alloc_batch would allow it) during the first pass, only if
> zone_reclaim_mode is enabled. If first pass fails, we need to reset
> alloc_batch for all zones in the current node (and only in the current
> zone), goto zonelist_scan and continue as we do now.

How sensible are the various settings of zone_reclaim_mode and the way
zone reclaim is invoked right now?

zone_reclaim_mode == 1 tries to reclaim clean page cache in the
preferred zone before falling back to other zones. Great, kswapd also
tries clean cache first and avoids writeout and swapping as long as
possible. And if zone_reclaim() fails, kswapd has to be woken up
anyway because filling remote zones without reclaiming them is hardly
sustainable. Using kswapd would have the advantage that it reclaims
the whole local node and not just the first zone in it, which would
make much more sense in the first place.

Same for zone_reclaim_mode at higher settings.

So could we cut all this short and just restrict any allocation with
zone_reclaim_mode != 0 to zones in reclaim distance, and if that
fails, wake kswapd and enter the slowpath? (ALLOC_WMARK_LOW marks the
fast path)

It would be even cooler to remove the zone_reclaim() call further
down, but that might be too much reliance on kswapd...

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f03d2f2..8ddf9ac 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1875,6 +1875,10 @@ zonelist_scan:
BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
if (alloc_flags & ALLOC_NO_WATERMARKS)
goto try_this_zone;
+ if ((alloc_flags & ALLOC_WMARK_LOW) &&
+ zone_reclaim_mode &&
+ !zone_allows_reclaim(preferred_zone, zone))
+ continue;
/*
* Distribute pages in proportion to the individual
* zone size to ensure fair page aging. The zone a

2013-07-30 17:45:49

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 1/3] mm: vmscan: fix numa reclaim balance problem in kswapd

On Fri, Jul 26, 2013 at 03:53:19PM -0700, Andrew Morton wrote:
> On Fri, 19 Jul 2013 16:55:23 -0400 Johannes Weiner <[email protected]> wrote:
>
> > When the page allocator fails to get a page from all zones in its
> > given zonelist, it wakes up the per-node kswapds for all zones that
> > are at their low watermark.
> >
> > However, with a system under load and the free page counters being
> > per-cpu approximations, the observed counter value in a zone can
> > fluctuate enough that the allocation fails but the kswapd wakeup is
> > also skipped while the zone is still really close to the low
> > watermark.
> >
> > When one node misses a wakeup like this, it won't be aged before all
> > the other node's zones are down to their low watermarks again. And
> > skipping a full aging cycle is an obvious fairness problem.
> >
> > Kswapd runs until the high watermarks are restored, so it should also
> > be woken when the high watermarks are not met. This ages nodes more
> > equally and creates a safety margin for the page counter fluctuation.
>
> Well yes, but what guarantee is there that the per-cpu counter error
> problem is reliably fixed? AFAICT this patch "fixes" it because the
> gap between the low and high watermarks happens to be larger than the
> per-cpu counter fluctuation, yes? If so, there are surely all sorts of
> situations where it will break again.
>
> To fix this reliably, we should be looking at constraining counter
> batch sizes or performing a counter summation to get the more accurate
> estimate?

Thinking about this some more, the per-cpu fluctuation appears to be a
red herring in this case. Kswapd wakeup uses the safe version of the
watermark checking code. A percpu inaccuracy would result in the
allocator waking kswapd a little too early or too late, but once it
decides to wake it, we are getting an accurate picture on all zones.
The safe watermark checks are designed to catch percpu inaccuracies
between the low and the min watermark, the distance between low and
high is twice as big, so we should be good.

The fluctuation that makes individual zones miss full aging cycles
comes from true free page variation under load when the NOT OK
threshold for the allocator is the same value as the OK threshold for
skipping kswapd wakeups.

2013-07-31 09:33:44

by Zlatko Calusic

[permalink] [raw]
Subject: Re: [patch 0/3] mm: improve page aging fairness between zones/nodes

On 24.07.2013 13:18, Zlatko Calusic wrote:
> On 22.07.2013 18:48, Zlatko Calusic wrote:
>> On 19.07.2013 22:55, Johannes Weiner wrote:
>>> The way the page allocator interacts with kswapd creates aging
>>> imbalances, where the amount of time a userspace page gets in memory
>>> under reclaim pressure is dependent on which zone, which node the
>>> allocator took the page frame from.
>>>
>>> #1 fixes missed kswapd wakeups on NUMA systems, which lead to some
>>> nodes falling behind for a full reclaim cycle relative to the other
>>> nodes in the system
>>>
>>> #3 fixes an interaction where kswapd and a continuous stream of page
>>> allocations keep the preferred zone of a task between the high and
>>> low watermark (allocations succeed + kswapd does not go to sleep)
>>> indefinitely, completely underutilizing the lower zones and
>>> thrashing on the preferred zone
>>>
>>> These patches are the aging fairness part of the thrash-detection
>>> based file LRU balancing. Andrea recommended to submit them
>>> separately as they are bugfixes in their own right.
>>>
>>
>> I have the patch applied and under testing. So far, so good. It looks
>> like it could finally fix the bug that I was chasing few months ago
>> (nicely described in your bullet #3). But, few more days of testing will
>> be needed before I can reach a quality verdict.
>>
>
> Well, only 2 days later it's already obvious that the patch is perfect! :)
>

Additionaly, on the patched kernel, kswapd burns 30% less CPU cycles.
Nice to see that restored balance also eases kswapd's job, but that was
to be expected. Measured on the real workload, twice, to be sure.

Regards,
--
Zlatko

2013-07-31 12:44:00

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 1/3] mm: vmscan: fix numa reclaim balance problem in kswapd

On Tue, Jul 30, 2013 at 01:45:39PM -0400, Johannes Weiner wrote:
> On Fri, Jul 26, 2013 at 03:53:19PM -0700, Andrew Morton wrote:
> > On Fri, 19 Jul 2013 16:55:23 -0400 Johannes Weiner <[email protected]> wrote:
> >
> > > When the page allocator fails to get a page from all zones in its
> > > given zonelist, it wakes up the per-node kswapds for all zones that
> > > are at their low watermark.
> > >
> > > However, with a system under load and the free page counters being
> > > per-cpu approximations, the observed counter value in a zone can
> > > fluctuate enough that the allocation fails but the kswapd wakeup is
> > > also skipped while the zone is still really close to the low
> > > watermark.
> > >
> > > When one node misses a wakeup like this, it won't be aged before all
> > > the other node's zones are down to their low watermarks again. And
> > > skipping a full aging cycle is an obvious fairness problem.
> > >
> > > Kswapd runs until the high watermarks are restored, so it should also
> > > be woken when the high watermarks are not met. This ages nodes more
> > > equally and creates a safety margin for the page counter fluctuation.
> >
> > Well yes, but what guarantee is there that the per-cpu counter error
> > problem is reliably fixed? AFAICT this patch "fixes" it because the
> > gap between the low and high watermarks happens to be larger than the
> > per-cpu counter fluctuation, yes? If so, there are surely all sorts of
> > situations where it will break again.
> >
> > To fix this reliably, we should be looking at constraining counter
> > batch sizes or performing a counter summation to get the more accurate
> > estimate?
>
> Thinking about this some more, the per-cpu fluctuation appears to be a
> red herring in this case. Kswapd wakeup uses the safe version of the
> watermark checking code. A percpu inaccuracy would result in the
> allocator waking kswapd a little too early or too late, but once it
> decides to wake it, we are getting an accurate picture on all zones.
> The safe watermark checks are designed to catch percpu inaccuracies
> between the low and the min watermark, the distance between low and
> high is twice as big, so we should be good.
>
> The fluctuation that makes individual zones miss full aging cycles
> comes from true free page variation under load when the NOT OK
> threshold for the allocator is the same value as the OK threshold for
> skipping kswapd wakeups.

Andrew, could you please update the patch to this new changelog with
the percpu counter drivel removed? Thanks!

---
When the page allocator fails to get a page from all zones in its
given zonelist, it wakes up the per-node kswapds for all zones that
are at their low watermark.

However, with a system under load the free pages in a zone can
fluctuate enough that the allocation fails but the kswapd wakeup is
also skipped while the zone is still really close to the low
watermark.

When one node misses a wakeup like this, it won't be aged before all
the other node's zones are down to their low watermarks again. And
skipping a full aging cycle is an obvious fairness problem.

Kswapd runs until the high watermarks are restored, so it should also
be woken when the high watermarks are not met. This ages nodes more
equally and creates a safety margin for the page counter fluctuation.

By using zone_balanced(), it will now check, in addition to the
watermark, if compaction requires more order-0 pages to create a
higher order page.

Signed-off-by: Johannes Weiner <[email protected]>
---

2013-08-01 02:56:15

by Minchan Kim

[permalink] [raw]
Subject: Re: [patch 3/3] mm: page_alloc: fair zone allocator policy

Hi Hannes,

On Fri, Jul 19, 2013 at 04:55:25PM -0400, Johannes Weiner wrote:
> Each zone that holds userspace pages of one workload must be aged at a
> speed proportional to the zone size. Otherwise, the time an
> individual page gets to stay in memory depends on the zone it happened
> to be allocated in. Asymmetry in the zone aging creates rather
> unpredictable aging behavior and results in the wrong pages being
> reclaimed, activated etc.
>
> But exactly this happens right now because of the way the page
> allocator and kswapd interact. The page allocator uses per-node lists
> of all zones in the system, ordered by preference, when allocating a
> new page. When the first iteration does not yield any results, kswapd
> is woken up and the allocator retries. Due to the way kswapd reclaims
> zones below the high watermark while a zone can be allocated from when
> it is above the low watermark, the allocator may keep kswapd running
> while kswapd reclaim ensures that the page allocator can keep
> allocating from the first zone in the zonelist for extended periods of
> time. Meanwhile the other zones rarely see new allocations and thus
> get aged much slower in comparison.
>
> The result is that the occasional page placed in lower zones gets
> relatively more time in memory, even get promoted to the active list
> after its peers have long been evicted. Meanwhile, the bulk of the
> working set may be thrashing on the preferred zone even though there
> may be significant amounts of memory available in the lower zones.
>
> Even the most basic test -- repeatedly reading a file slightly bigger
> than memory -- shows how broken the zone aging is. In this scenario,
> no single page should be able stay in memory long enough to get
> referenced twice and activated, but activation happens in spades:
>
> $ grep active_file /proc/zoneinfo
> nr_inactive_file 0
> nr_active_file 0
> nr_inactive_file 0
> nr_active_file 8
> nr_inactive_file 1582
> nr_active_file 11994
> $ cat data data data data >/dev/null
> $ grep active_file /proc/zoneinfo
> nr_inactive_file 0
> nr_active_file 70
> nr_inactive_file 258753
> nr_active_file 443214
> nr_inactive_file 149793
> nr_active_file 12021
>
> Fix this with a very simple round robin allocator. Each zone is
> allowed a batch of allocations that is proportional to the zone's
> size, after which it is treated as full. The batch counters are reset
> when all zones have been tried and the allocator enters the slowpath
> and kicks off kswapd reclaim:
>
> $ grep active_file /proc/zoneinfo
> nr_inactive_file 0
> nr_active_file 0
> nr_inactive_file 174
> nr_active_file 4865
> nr_inactive_file 53
> nr_active_file 860
> $ cat data data data data >/dev/null
> $ grep active_file /proc/zoneinfo
> nr_inactive_file 0
> nr_active_file 0
> nr_inactive_file 666622
> nr_active_file 4988
> nr_inactive_file 190969
> nr_active_file 937

First of all, I should appreciate your great work!
It's amazing and I saw Zlatko proved it enhances real works.
Thanks Zlatko, too!

So, I don't want to prevent merging but I think at least, we should
discuss some issues.

The concern I have is that it could accelerate low memory pinning
problems like mlock. Actually, I don't have such workload that makes
pin lots of pages but that's why we introduced lowmem_reserve_ratio,
as you know well so we should cover this issue, at least.

Other thing of my concerns is to add overhead in fast path.
Sometime, we are really reluctant to add simple even "if" condition
in fastpath but you are adding atomic op whenever page is allocated and
enter slowpath whenever all of given zones's batchcount is zero.
Yes, it's not really slow path because it could return to normal status
without calling significant slow functions by reset batchcount of
prepare_slowpath.

I think it's tradeoff and I am biased your approach although we would
lose a little performance because fair aging would recover the loss by
fastpath's overhead. But who knows? Someone has a concern.

So we should mention about such problems.

>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> include/linux/mmzone.h | 1 +
> mm/page_alloc.c | 39 +++++++++++++++++++++++++++++----------
> 2 files changed, 30 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index af4a3b7..0c41d59 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -352,6 +352,7 @@ struct zone {
> * free areas of different sizes
> */
> spinlock_t lock;
> + atomic_t alloc_batch;
> int all_unreclaimable; /* All pages pinned */
> #if defined CONFIG_COMPACTION || defined CONFIG_CMA
> /* Set to true when the PG_migrate_skip bits should be cleared */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index af1d956b..d938b67 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1879,6 +1879,14 @@ zonelist_scan:
> if (alloc_flags & ALLOC_NO_WATERMARKS)
> goto try_this_zone;
> /*
> + * Distribute pages in proportion to the individual
> + * zone size to ensure fair page aging. The zone a
> + * page was allocated in should have no effect on the
> + * time the page has in memory before being reclaimed.
> + */
> + if (atomic_read(&zone->alloc_batch) <= 0)
> + continue;
> + /*
> * When allocating a page cache page for writing, we
> * want to get it from a zone that is within its dirty
> * limit, such that no single zone holds more than its
> @@ -1984,7 +1992,8 @@ this_zone_full:
> goto zonelist_scan;
> }
>
> - if (page)
> + if (page) {
> + atomic_sub(1U << order, &zone->alloc_batch);
> /*
> * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
> * necessary to allocate the page. The expectation is
> @@ -1993,6 +2002,7 @@ this_zone_full:
> * for !PFMEMALLOC purposes.
> */
> page->pfmemalloc = !!(alloc_flags & ALLOC_NO_WATERMARKS);
> + }
>
> return page;
> }
> @@ -2342,16 +2352,20 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
> return page;
> }
>
> -static inline
> -void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
> - enum zone_type high_zoneidx,
> - enum zone_type classzone_idx)
> +static void prepare_slowpath(gfp_t gfp_mask, unsigned int order,
> + struct zonelist *zonelist,
> + enum zone_type high_zoneidx,
> + enum zone_type classzone_idx)
> {
> struct zoneref *z;
> struct zone *zone;
>
> - for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> - wakeup_kswapd(zone, order, classzone_idx);
> + for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
> + atomic_set(&zone->alloc_batch,
> + high_wmark_pages(zone) - low_wmark_pages(zone));
> + if (!(gfp_mask & __GFP_NO_KSWAPD))
> + wakeup_kswapd(zone, order, classzone_idx);
> + }
> }
>
> static inline int
> @@ -2447,9 +2461,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> goto nopage;
>
> restart:
> - if (!(gfp_mask & __GFP_NO_KSWAPD))
> - wake_all_kswapd(order, zonelist, high_zoneidx,
> - zone_idx(preferred_zone));
> + prepare_slowpath(gfp_mask, order, zonelist,
> + high_zoneidx, zone_idx(preferred_zone));
>
> /*
> * OK, we're below the kswapd watermark and have kicked background
> @@ -4758,6 +4771,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
> zone_seqlock_init(zone);
> zone->zone_pgdat = pgdat;
>
> + /* For bootup, initialized properly in watermark setup */
> + atomic_set(&zone->alloc_batch, zone->managed_pages);
> +
> zone_pcp_init(zone);
> lruvec_init(&zone->lruvec);
> if (!size)
> @@ -5533,6 +5549,9 @@ static void __setup_per_zone_wmarks(void)
> zone->watermark[WMARK_LOW] = min_wmark_pages(zone) + (tmp >> 2);
> zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) + (tmp >> 1);
>
> + atomic_set(&zone->alloc_batch,
> + high_wmark_pages(zone) - low_wmark_pages(zone));
> +
> setup_zone_migrate_reserve(zone);
> spin_unlock_irqrestore(&zone->lock, flags);
> }
> --
> 1.8.3.2
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2013-08-01 04:31:56

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch 3/3] mm: page_alloc: fair zone allocator policy

On 07/31/2013 10:56 PM, Minchan Kim wrote:

> Yes, it's not really slow path because it could return to normal status
> without calling significant slow functions by reset batchcount of
> prepare_slowpath.
>
> I think it's tradeoff and I am biased your approach although we would
> lose a little performance because fair aging would recover the loss by
> fastpath's overhead. But who knows? Someone has a concern.
>
> So we should mention about such problems.

If the atomic operation in the fast path turns out to be a problem,
I suspect we may be able to fix it by using per-cpu counters, and
consolidating those every once in a while.

However, it may be good to see whether there is a problem in the
first place, before adding complexity.

--
All rights reversed

2013-08-01 15:51:27

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [patch 3/3] mm: page_alloc: fair zone allocator policy

On Thu, Aug 01, 2013 at 12:31:34AM -0400, Rik van Riel wrote:
> On 07/31/2013 10:56 PM, Minchan Kim wrote:
>
> > Yes, it's not really slow path because it could return to normal status
> > without calling significant slow functions by reset batchcount of
> > prepare_slowpath.
> >
> > I think it's tradeoff and I am biased your approach although we would
> > lose a little performance because fair aging would recover the loss by
> > fastpath's overhead. But who knows? Someone has a concern.
> >
> > So we should mention about such problems.
>
> If the atomic operation in the fast path turns out to be a problem,
> I suspect we may be able to fix it by using per-cpu counters, and
> consolidating those every once in a while.
>
> However, it may be good to see whether there is a problem in the
> first place, before adding complexity.

prepare_slowpath is racy anyway, so I don't see the point in wanting
to do atomic ops in the first place.

Much better to do the increment in assembly in a single insn to reduce
the window, but there's no point in using the lock prefix when
atomic_set will screw it over while somebody does atomic_dec under it
anyway. atomic_set cannot be atomic, so it'll screw over any
perfectly accurate accounted atomic_dec anyway.

I think it should be done in C without atomic_dec or by introducing an
__atomic_dec that isn't using the lock prefix and just tries to do the
dec in a single (or as few as possible) asm insn.

And because the whole thing is racy, after prepare_slowpath (which I
think is a too generic name and should be renamed
reset_zonelist_alloc_batch), this last attempt before invoking
reclaim:

/* This is the last chance, in general, before the goto nopage. */
page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
preferred_zone, migratetype);

which is trying again with the MIN wmark, should be also altered so
that we do one pass totally disregarding alloc_batch, just before
declaring MIN-wmark failure in the allocation, and invoking direct
reclaim.

Either we add a (alloc_flags | ALLOC_NO_ALLOC_BATCH) to the above
alloc_flags in the above call, or we run a second
get_page_from_freelist if the above one fails with
ALLOC_NO_ALLOC_BATCH set.

Otherwise because of the inherent racy behavior of the alloc_batch
logic, what will happen is that we'll eventually have enough hugepage
allocations in enough CPUs to hit the alloc_batch limit again in
between the prepare_slowpath and the above get_page_from_freelist,
that will lead us to call spurious direct_reclaim invocations, when it
was just a false negative of alloc_batch.

2013-08-01 19:58:37

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 3/3] mm: page_alloc: fair zone allocator policy

On Thu, Aug 01, 2013 at 05:51:11PM +0200, Andrea Arcangeli wrote:
> On Thu, Aug 01, 2013 at 12:31:34AM -0400, Rik van Riel wrote:
> > On 07/31/2013 10:56 PM, Minchan Kim wrote:
> >
> > > Yes, it's not really slow path because it could return to normal status
> > > without calling significant slow functions by reset batchcount of
> > > prepare_slowpath.
> > >
> > > I think it's tradeoff and I am biased your approach although we would
> > > lose a little performance because fair aging would recover the loss by
> > > fastpath's overhead. But who knows? Someone has a concern.
> > >
> > > So we should mention about such problems.
> >
> > If the atomic operation in the fast path turns out to be a problem,
> > I suspect we may be able to fix it by using per-cpu counters, and
> > consolidating those every once in a while.
> >
> > However, it may be good to see whether there is a problem in the
> > first place, before adding complexity.
>
> prepare_slowpath is racy anyway, so I don't see the point in wanting
> to do atomic ops in the first place.
>
> Much better to do the increment in assembly in a single insn to reduce
> the window, but there's no point in using the lock prefix when
> atomic_set will screw it over while somebody does atomic_dec under it
> anyway. atomic_set cannot be atomic, so it'll screw over any
> perfectly accurate accounted atomic_dec anyway.
>
> I think it should be done in C without atomic_dec or by introducing an
> __atomic_dec that isn't using the lock prefix and just tries to do the
> dec in a single (or as few as possible) asm insn.

Well the reset is not synchronized to the atomic_dec(), but we are
guaranteed that after n page allocations we wake up kswapd, while
without atomic ops we might miss a couple allocations when allocators
race during the RMW.

But we might be able to get away with a small error.

> And because the whole thing is racy, after prepare_slowpath (which I
> think is a too generic name and should be renamed
> reset_zonelist_alloc_batch), this last attempt before invoking
> reclaim:
>
> /* This is the last chance, in general, before the goto nopage. */
> page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
> high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
> preferred_zone, migratetype);
>
> which is trying again with the MIN wmark, should be also altered so
> that we do one pass totally disregarding alloc_batch, just before
> declaring MIN-wmark failure in the allocation, and invoking direct
> reclaim.
>
> Either we add a (alloc_flags | ALLOC_NO_ALLOC_BATCH) to the above
> alloc_flags in the above call, or we run a second
> get_page_from_freelist if the above one fails with
> ALLOC_NO_ALLOC_BATCH set.
>
> Otherwise because of the inherent racy behavior of the alloc_batch
> logic, what will happen is that we'll eventually have enough hugepage
> allocations in enough CPUs to hit the alloc_batch limit again in
> between the prepare_slowpath and the above get_page_from_freelist,
> that will lead us to call spurious direct_reclaim invocations, when it
> was just a false negative of alloc_batch.

So the discussion diverged between on-list and off-list. I suggested
ignoring the alloc_batch in the slowpath completely (since allocations
are very unlikely to fail due to the batches immediately after they
were reset, and the occasional glitch does not matter). This should
solve the problem of spurious direct reclaim invocations. The
zone_reclaim_mode allocation sequence would basically be:

1. Try zones in the local node with low watermark while maintaining
fairness with the allocation batch and invoking zone reclaim if
necessary

2. If that fails, wake up kswapd and reset the alloc batches on the
local node

3. Try zones in order of preference with the min watermark, invoking
zone reclaim if necessary, ignoring allocation batches.

So in the fastpath we try to be local and fair, in the slow path we
are likely to be local and fair, unless we have to spill into remote
nodes. But then there is nothing we can do anyway.

2013-08-01 22:16:41

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [patch 3/3] mm: page_alloc: fair zone allocator policy

On Thu, Aug 01, 2013 at 03:58:23PM -0400, Johannes Weiner wrote:
> But we might be able to get away with a small error.

The idea is that there's a small error anyway, because multiple CPUs
can reset it at the same time, while another CPU is decreasing it, so
the decrease sometime may get lost regardless if atomic or not. How
worse the error will become I don't know for sure but I would expect
to still be in the noise level.

Removing the locked op won't help much when multiple processes
allocates from the same node, but it'll speed it up when just one CPU
runs.

> So the discussion diverged between on-list and off-list. I suggested
> ignoring the alloc_batch in the slowpath completely (since allocations
> are very unlikely to fail due to the batches immediately after they
> were reset, and the occasional glitch does not matter). This should
> solve the problem of spurious direct reclaim invocations. The

I'm fine with this solution of ignoring alloc_batch in the slowpath,
it definitely solves the risk of suprious direct reclaim and it
contributes to solving zone_reclaim_mode too.

> zone_reclaim_mode allocation sequence would basically be:
>
> 1. Try zones in the local node with low watermark while maintaining
> fairness with the allocation batch and invoking zone reclaim if
> necessary
>
> 2. If that fails, wake up kswapd and reset the alloc batches on the
> local node
>
> 3. Try zones in order of preference with the min watermark, invoking
> zone reclaim if necessary, ignoring allocation batches.

One difference is that kswapd wasn't involved before. But maybe it's
beneficial.

> So in the fastpath we try to be local and fair, in the slow path we
> are likely to be local and fair, unless we have to spill into remote
> nodes. But then there is nothing we can do anyway.

I think it should work well.

And with these changes, alloc_batch will help tremendously
zone_reclaim_mode too (not just prevent its breakage): the allocations
will be spread evenly on multiple zones of the same nodes, instead of
insisting calling zone_reclaim on the highest zone of the node until
zone_reclaim finally fails on it (with the current behavior if
zone_reclaim_mode is enabled, the LRU is rotated only the high zone
until full). And it's common to have a least one node with two zones
(pci32 zone).

By skipping any non-local zone in the fast path if zone_reclaim_mode
is enabled, will prevent multiple CPUs in different nodes to step in
each other toes, and alloc_batch will never be resetted by remote CPUs
with the zone_local check introduced in prepare_slowpath.

Changing the prepare_slowpath to reset only the local node will also
decrease the very distant interconnect traffic for very big NUMA where
zone_reclaim_mode is enabled by default, so it'll speedup
prepare_slowpath too.

Can you submit the fixes to Andrew? I'm very happy with your patch, so
then I can submit the compaction zone_reclaim_mode fixes on top of
it.

Changing topic on my zone_reclaim compaction fixes, hope to get more
review on those, especially the high order watermark algorithm
improvements. I'll try to get a specsfs with jumbo frames run on it
too to see if there's any positive measurable effect out of the
watermark changes and the substantial improvement in compaction
reliability. (the increase of the low watermark level for high order
pages is needed to get more accuracy out of zone_reclaim_mode hugepage
placement, and the decrease of the min watermark level will avoid us
to waste precious CPU to generate a zillon of high order pages that
cannot be possibly be useful to PF_MEMALLOC, we've currently a ton 8k
pages, that any GFP_ATOMIC or GFP_KERNEL allocation are totally
forbidden to use, but we still generate those for nothing, and over
time we split them right away at the first 4k allocation so they don't
even stick)

Thanks!
Andrea

2013-08-02 06:22:24

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 3/3] mm: page_alloc: fair zone allocator policy

On Thu, Aug 01, 2013 at 11:56:36AM +0900, Minchan Kim wrote:
> Hi Hannes,
>
> On Fri, Jul 19, 2013 at 04:55:25PM -0400, Johannes Weiner wrote:
> > Each zone that holds userspace pages of one workload must be aged at a
> > speed proportional to the zone size. Otherwise, the time an
> > individual page gets to stay in memory depends on the zone it happened
> > to be allocated in. Asymmetry in the zone aging creates rather
> > unpredictable aging behavior and results in the wrong pages being
> > reclaimed, activated etc.
> >
> > But exactly this happens right now because of the way the page
> > allocator and kswapd interact. The page allocator uses per-node lists
> > of all zones in the system, ordered by preference, when allocating a
> > new page. When the first iteration does not yield any results, kswapd
> > is woken up and the allocator retries. Due to the way kswapd reclaims
> > zones below the high watermark while a zone can be allocated from when
> > it is above the low watermark, the allocator may keep kswapd running
> > while kswapd reclaim ensures that the page allocator can keep
> > allocating from the first zone in the zonelist for extended periods of
> > time. Meanwhile the other zones rarely see new allocations and thus
> > get aged much slower in comparison.
> >
> > The result is that the occasional page placed in lower zones gets
> > relatively more time in memory, even get promoted to the active list
> > after its peers have long been evicted. Meanwhile, the bulk of the
> > working set may be thrashing on the preferred zone even though there
> > may be significant amounts of memory available in the lower zones.
> >
> > Even the most basic test -- repeatedly reading a file slightly bigger
> > than memory -- shows how broken the zone aging is. In this scenario,
> > no single page should be able stay in memory long enough to get
> > referenced twice and activated, but activation happens in spades:
> >
> > $ grep active_file /proc/zoneinfo
> > nr_inactive_file 0
> > nr_active_file 0
> > nr_inactive_file 0
> > nr_active_file 8
> > nr_inactive_file 1582
> > nr_active_file 11994
> > $ cat data data data data >/dev/null
> > $ grep active_file /proc/zoneinfo
> > nr_inactive_file 0
> > nr_active_file 70
> > nr_inactive_file 258753
> > nr_active_file 443214
> > nr_inactive_file 149793
> > nr_active_file 12021
> >
> > Fix this with a very simple round robin allocator. Each zone is
> > allowed a batch of allocations that is proportional to the zone's
> > size, after which it is treated as full. The batch counters are reset
> > when all zones have been tried and the allocator enters the slowpath
> > and kicks off kswapd reclaim:
> >
> > $ grep active_file /proc/zoneinfo
> > nr_inactive_file 0
> > nr_active_file 0
> > nr_inactive_file 174
> > nr_active_file 4865
> > nr_inactive_file 53
> > nr_active_file 860
> > $ cat data data data data >/dev/null
> > $ grep active_file /proc/zoneinfo
> > nr_inactive_file 0
> > nr_active_file 0
> > nr_inactive_file 666622
> > nr_active_file 4988
> > nr_inactive_file 190969
> > nr_active_file 937
>
> First of all, I should appreciate your great work!
> It's amazing and I saw Zlatko proved it enhances real works.
> Thanks Zlatko, too!
>
> So, I don't want to prevent merging but I think at least, we should
> discuss some issues.
>
> The concern I have is that it could accelerate low memory pinning
> problems like mlock. Actually, I don't have such workload that makes
> pin lots of pages but that's why we introduced lowmem_reserve_ratio,
> as you know well so we should cover this issue, at least.

We are not actually using the lower zones more in terms of number of
pages at any given time, we are just using them more in terms of
allocation and reclaim rate.

Consider how the page allocator works: it will first fill up the
preferred zone, then the next best zone, etc. and then when all of
them are full, it will wake up kswapd, which will only reclaim them
back to the high watermark + balance gap.

After my change, all zones will be at the high watermarks + balance
gap during idle and between the low and high marks under load. The
change is marginal.

The lowmem reserve makes sure, before and after, that a sizable
portion of low memory is reserved for non-user allocations.

> Other thing of my concerns is to add overhead in fast path.
> Sometime, we are really reluctant to add simple even "if" condition
> in fastpath but you are adding atomic op whenever page is allocated and
> enter slowpath whenever all of given zones's batchcount is zero.
> Yes, it's not really slow path because it could return to normal status
> without calling significant slow functions by reset batchcount of
> prepare_slowpath.

Yes, no direct reclaim should occur. I know that it comes of a cost,
but the alternative is broken page reclaim, page cache thrashing etc.
I don't see a way around it...

Thanks for your input, Minchan!

2013-08-02 07:31:40

by Minchan Kim

[permalink] [raw]
Subject: Re: [patch 3/3] mm: page_alloc: fair zone allocator policy

On Fri, Aug 02, 2013 at 02:22:08AM -0400, Johannes Weiner wrote:
> On Thu, Aug 01, 2013 at 11:56:36AM +0900, Minchan Kim wrote:
> > Hi Hannes,
> >
> > On Fri, Jul 19, 2013 at 04:55:25PM -0400, Johannes Weiner wrote:
> > > Each zone that holds userspace pages of one workload must be aged at a
> > > speed proportional to the zone size. Otherwise, the time an
> > > individual page gets to stay in memory depends on the zone it happened
> > > to be allocated in. Asymmetry in the zone aging creates rather
> > > unpredictable aging behavior and results in the wrong pages being
> > > reclaimed, activated etc.
> > >
> > > But exactly this happens right now because of the way the page
> > > allocator and kswapd interact. The page allocator uses per-node lists
> > > of all zones in the system, ordered by preference, when allocating a
> > > new page. When the first iteration does not yield any results, kswapd
> > > is woken up and the allocator retries. Due to the way kswapd reclaims
> > > zones below the high watermark while a zone can be allocated from when
> > > it is above the low watermark, the allocator may keep kswapd running
> > > while kswapd reclaim ensures that the page allocator can keep
> > > allocating from the first zone in the zonelist for extended periods of
> > > time. Meanwhile the other zones rarely see new allocations and thus
> > > get aged much slower in comparison.
> > >
> > > The result is that the occasional page placed in lower zones gets
> > > relatively more time in memory, even get promoted to the active list
> > > after its peers have long been evicted. Meanwhile, the bulk of the
> > > working set may be thrashing on the preferred zone even though there
> > > may be significant amounts of memory available in the lower zones.
> > >
> > > Even the most basic test -- repeatedly reading a file slightly bigger
> > > than memory -- shows how broken the zone aging is. In this scenario,
> > > no single page should be able stay in memory long enough to get
> > > referenced twice and activated, but activation happens in spades:
> > >
> > > $ grep active_file /proc/zoneinfo
> > > nr_inactive_file 0
> > > nr_active_file 0
> > > nr_inactive_file 0
> > > nr_active_file 8
> > > nr_inactive_file 1582
> > > nr_active_file 11994
> > > $ cat data data data data >/dev/null
> > > $ grep active_file /proc/zoneinfo
> > > nr_inactive_file 0
> > > nr_active_file 70
> > > nr_inactive_file 258753
> > > nr_active_file 443214
> > > nr_inactive_file 149793
> > > nr_active_file 12021
> > >
> > > Fix this with a very simple round robin allocator. Each zone is
> > > allowed a batch of allocations that is proportional to the zone's
> > > size, after which it is treated as full. The batch counters are reset
> > > when all zones have been tried and the allocator enters the slowpath
> > > and kicks off kswapd reclaim:
> > >
> > > $ grep active_file /proc/zoneinfo
> > > nr_inactive_file 0
> > > nr_active_file 0
> > > nr_inactive_file 174
> > > nr_active_file 4865
> > > nr_inactive_file 53
> > > nr_active_file 860
> > > $ cat data data data data >/dev/null
> > > $ grep active_file /proc/zoneinfo
> > > nr_inactive_file 0
> > > nr_active_file 0
> > > nr_inactive_file 666622
> > > nr_active_file 4988
> > > nr_inactive_file 190969
> > > nr_active_file 937
> >
> > First of all, I should appreciate your great work!
> > It's amazing and I saw Zlatko proved it enhances real works.
> > Thanks Zlatko, too!
> >
> > So, I don't want to prevent merging but I think at least, we should
> > discuss some issues.
> >
> > The concern I have is that it could accelerate low memory pinning
> > problems like mlock. Actually, I don't have such workload that makes
> > pin lots of pages but that's why we introduced lowmem_reserve_ratio,
> > as you know well so we should cover this issue, at least.
>
> We are not actually using the lower zones more in terms of number of
> pages at any given time, we are just using them more in terms of
> allocation and reclaim rate.
>
> Consider how the page allocator works: it will first fill up the
> preferred zone, then the next best zone, etc. and then when all of
> them are full, it will wake up kswapd, which will only reclaim them
> back to the high watermark + balance gap.
>
> After my change, all zones will be at the high watermarks + balance
> gap during idle and between the low and high marks under load. The
> change is marginal.

Sorry but I couldn't follow.
Old logic is to use lower zones after higher zones is full so it is
unlikely to have mlocked pages in lower zones.
With your change, it is using lower zones although higher zone isn't full
so the chance that mlocked pages in lower zones would be more than old.

Yes, if we think in the long term where all zones are full and reclaimer
continue to reclaim all of zones reach high watermark + balance gap,
the change would be marginal as you said. But that's not to say we can
ignore former case, I think.

>
> The lowmem reserve makes sure, before and after, that a sizable
> portion of low memory is reserved for non-user allocations.
>
> > Other thing of my concerns is to add overhead in fast path.
> > Sometime, we are really reluctant to add simple even "if" condition
> > in fastpath but you are adding atomic op whenever page is allocated and
> > enter slowpath whenever all of given zones's batchcount is zero.
> > Yes, it's not really slow path because it could return to normal status
> > without calling significant slow functions by reset batchcount of
> > prepare_slowpath.
>
> Yes, no direct reclaim should occur. I know that it comes of a cost,
> but the alternative is broken page reclaim, page cache thrashing etc.
> I don't see a way around it...

That's why I support your solution. :)

--
Kind regards,
Minchan Kim