Subject: [RFC][PATCH] make global bitlock waitqueues per-node
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, Dave Hansen <dave.hansen@linux.intel.com>,
        agruenba@redhat.com, rpeterso@redhat.com, mgorman@techsingularity.net,
        peterz@infradead.org, luto@kernel.org, swhiteho@redhat.com,
        torvalds@linux-foundation.org
From: Dave Hansen <dave.hansen@linux.intel.com>
Date: Mon, 19 Dec 2016 14:58:26 -0800
Message-Id: <20161219225826.F8CB356F@viggo.jf.intel.com>
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 7621
Lines: 204


I saw a 4.8->4.9 regression (details below) that I attributed to:

	9dcb8b685f mm: remove per-zone hashtable of bitlock waitqueues

That commit took the bitlock waitqueues from being dynamically-allocated
per-zone to being statically allocated and global.  As suggested by
Linus, this makes them per-node, but keeps them statically-allocated.

It leaves us with more waitqueues than the global approach, inherently
scales it up as we gain nodes, and avoids generating code for
page_zone() which was evidently quite ugly.  The patch is pretty darn
tiny too.

This turns what was a ~40% 4.8->4.9 regression into a 17% gain over
what on 4.8 did.  That gain is a _bit_ surprising, but not entirely
unexpected since we now get much simpler code from no page_zone() and a
fixed-size array for which we don't have to follow a pointer (and get to
do power-of-2 math).

This boots in a small VM and on a multi-node NUMA system, but has not
been tested widely.  It definitely needs to be validated to make sure
it properly initialzes the new structure in the various node hotplug
cases before anybody goes applying it.

Original report below (sent privately by accident).

 include/linux/mmzone.h |    5 +++++
 kernel/sched/core.c    |   16 ----------------
 mm/filemap.c           |   22 +++++++++++++++++++++-
 mm/page_alloc.c        |    5 +++++
 4 files changed, 31 insertions(+), 17 deletions(-)

---

I'm seeing a 4.8->4.9 regression:

		https://www.sr71.net/~dave/intel/2016-12-19-page_fault_3.processes.png

This is on a 160-thread 8-socket system.  The workloads is a bunch of
processes faulting in pages to separate MAP_SHARED mappings:

	https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault3.c

The smoking gun in the profiles is that __wake_up_bit() (called via
unlock_page()) goes from ~1% to 4% in the profiles.

The workload is pretty maniacally touching random parts of the
waitqueue, and missing the cache heavily.  Now that it is shared, I
suspect we're transferring cachelines across node boundaries in a way
that we were not with the per-zone waitqueues.

This is *only* showing up with MAP_SHARED pages, not with anonymous
pages.  I think we do a lock_page()/unlock_page() pair in
do_shared_fault(), which we avoid in the anonymous case.  Reverting:

	9dcb8b685f mm: remove per-zone hashtable of bitlock waitqueues

restores things to 4.8 behavior.  The fact that automated testing didn't
catch this probably means that it's pretty rare to find hardware that
actually shows the problem, so I don't think it's worth reverting
anything in mainline.

ut, the commit says:

>     As part of that earlier discussion, we had a much better solution for
>     the NUMA scalability issue - by just making the page lock have a
>     separate contention bit, the waitqueue doesn't even have to be looked at
>     for the normal case.

So, maybe we should do that moving forward since we at least found one
case that's pretty aversely affected.

Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>

---

 b/include/linux/mmzone.h |    5 +++++
 b/kernel/sched/core.c    |   16 ----------------
 b/mm/filemap.c           |   22 +++++++++++++++++++++-
 b/mm/page_alloc.c        |    5 +++++
 4 files changed, 31 insertions(+), 17 deletions(-)

diff -puN include/linux/mmzone.h~static-per-zone-waitqueue include/linux/mmzone.h
--- a/include/linux/mmzone.h~static-per-zone-waitqueue	2016-12-19 11:35:12.210823059 -0800
+++ b/include/linux/mmzone.h	2016-12-19 13:11:53.335170271 -0800
@@ -27,6 +27,9 @@
 #endif
 #define MAX_ORDER_NR_PAGES (1 << (MAX_ORDER - 1))
 
+#define WAIT_TABLE_BITS 8
+#define WAIT_TABLE_SIZE (1 << WAIT_TABLE_BITS)
+
 /*
  * PAGE_ALLOC_COSTLY_ORDER is the order at which allocations are deemed
  * costly to service.  That is between allocation orders which should
@@ -662,6 +665,8 @@ typedef struct pglist_data {
 	unsigned long		min_slab_pages;
 #endif /* CONFIG_NUMA */
 
+	wait_queue_head_t	wait_table[WAIT_TABLE_SIZE];
+
 	/* Write-intensive fields used by page reclaim */
 	ZONE_PADDING(_pad1_)
 	spinlock_t		lru_lock;
diff -puN kernel/sched/core.c~static-per-zone-waitqueue kernel/sched/core.c
--- a/kernel/sched/core.c~static-per-zone-waitqueue	2016-12-19 11:35:12.212823149 -0800
+++ b/kernel/sched/core.c	2016-12-19 11:35:12.225823738 -0800
@@ -7509,27 +7509,11 @@ static struct kmem_cache *task_group_cac
 DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
 DECLARE_PER_CPU(cpumask_var_t, select_idle_mask);
 
-#define WAIT_TABLE_BITS 8
-#define WAIT_TABLE_SIZE (1 << WAIT_TABLE_BITS)
-static wait_queue_head_t bit_wait_table[WAIT_TABLE_SIZE] __cacheline_aligned;
-
-wait_queue_head_t *bit_waitqueue(void *word, int bit)
-{
-	const int shift = BITS_PER_LONG == 32 ? 5 : 6;
-	unsigned long val = (unsigned long)word << shift | bit;
-
-	return bit_wait_table + hash_long(val, WAIT_TABLE_BITS);
-}
-EXPORT_SYMBOL(bit_waitqueue);
-
 void __init sched_init(void)
 {
 	int i, j;
 	unsigned long alloc_size = 0, ptr;
 
-	for (i = 0; i < WAIT_TABLE_SIZE; i++)
-		init_waitqueue_head(bit_wait_table + i);
-
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	alloc_size += 2 * nr_cpu_ids * sizeof(void **);
 #endif
diff -puN mm/filemap.c~static-per-zone-waitqueue mm/filemap.c
--- a/mm/filemap.c~static-per-zone-waitqueue	2016-12-19 11:35:12.215823285 -0800
+++ b/mm/filemap.c	2016-12-19 14:51:23.881814379 -0800
@@ -779,6 +779,26 @@ EXPORT_SYMBOL(__page_cache_alloc);
 #endif
 
 /*
+ * We need 'nid' because page_waitqueue() needs to get the waitqueue
+ * for memory where virt_to_page() does not work, like highmem.
+ */
+static wait_queue_head_t *__bit_waitqueue(void *word, int bit, int nid)
+{
+	const int shift = BITS_PER_LONG == 32 ? 5 : 6;
+	unsigned long val = (unsigned long)word << shift | bit;
+
+	return &NODE_DATA(nid)->wait_table[hash_long(val, WAIT_TABLE_BITS)];
+}
+
+wait_queue_head_t *bit_waitqueue(void *word, int bit)
+{
+	const int __maybe_unused nid = page_to_nid(virt_to_page(word));
+
+	return __bit_waitqueue(word, bit, nid);
+}
+EXPORT_SYMBOL(bit_waitqueue);
+
+/*
  * In order to wait for pages to become available there must be
  * waitqueues associated with pages. By using a hash table of
  * waitqueues where the bucket discipline is to maintain all
@@ -790,7 +810,7 @@ EXPORT_SYMBOL(__page_cache_alloc);
  */
 wait_queue_head_t *page_waitqueue(struct page *page)
 {
-	return bit_waitqueue(page, 0);
+	return __bit_waitqueue(page, 0, page_to_nid(page));
 }
 EXPORT_SYMBOL(page_waitqueue);
 
diff -puN mm/page_alloc.c~static-per-zone-waitqueue mm/page_alloc.c
--- a/mm/page_alloc.c~static-per-zone-waitqueue	2016-12-19 11:35:12.219823466 -0800
+++ b/mm/page_alloc.c	2016-12-19 13:10:56.587613213 -0800
@@ -5872,6 +5872,7 @@ void __paginginit free_area_init_node(in
 	pg_data_t *pgdat = NODE_DATA(nid);
 	unsigned long start_pfn = 0;
 	unsigned long end_pfn = 0;
+	int i;
 
 	/* pg_data_t should be reset to zero when it's allocated */
 	WARN_ON(pgdat->nr_zones || pgdat->kswapd_classzone_idx);
@@ -5892,6 +5893,10 @@ void __paginginit free_area_init_node(in
 				  zones_size, zholes_size);
 
 	alloc_node_mem_map(pgdat);
+
+	/* per-node page waitqueue initialization: */
+	for (i = 0; i < WAIT_TABLE_SIZE; i++)
+		init_waitqueue_head(&pgdat->wait_table[i]);
 #ifdef CONFIG_FLAT_NODE_MEM_MAP
 	printk(KERN_DEBUG "free_area_init_node: node %d, pgdat %08lx, node_mem_map %08lx\n",
 		nid, (unsigned long)pgdat,
_