2024-01-02 18:46:48

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: [PATCH v3 00/11] Mitigate a vmap lock contention v3

This is v3. It is based on the 6.7.0-rc8.

1. Motivation

- Offload global vmap locks making it scaled to number of CPUS;
- If possible and there is an agreement, we can remove the "Per cpu kva allocator"
to make the vmap code to be more simple;
- There were complains from XFS folk that a vmalloc might be contented
on the their workloads.

2. Design(high level overview)

We introduce an effective vmap node logic. A node behaves as independent
entity to serve an allocation request directly(if possible) from its pool.
That way it bypasses a global vmap space that is protected by its own lock.

An access to pools are serialized by CPUs. Number of nodes are equal to
number of CPUs in a system. Please note the high threshold is bound to
128 nodes.

Pools are size segregated and populated based on system demand. The maximum
alloc request that can be stored into a segregated storage is 256 pages. The
lazily drain path decays a pool by 25% as a first step and as second populates
it by fresh freed VAs for reuse instead of returning them into a global space.

When a VA is obtained(alloc path), it is stored in separate nodes. A va->va_start
address is converted into a correct node where it should be placed and resided.
Doing so we balance VAs across the nodes as a result an access becomes scalable.
The addr_to_node() function does a proper address conversion to a correct node.

A vmap space is divided on segments with fixed size, it is 16 pages. That way
any address can be associated with a segment number. Number of segments are
equal to num_possible_cpus() but not grater then 128. The numeration starts
from 0. See below how it is converted:

static inline unsigned int
addr_to_node_id(unsigned long addr)
{
return (addr / zone_size) % nr_nodes;
}

On a free path, a VA can be easily found by converting its "va_start" address
to a certain node it resides. It is moved from "busy" data to "lazy" data structure.
Later on, as noted earlier, the lazy kworker decays each node pool and populates it
by fresh incoming VAs. Please note, a VA is returned to a node that did an alloc
request.

3. Test on AMD Ryzen Threadripper 3970X 32-Core Processor

sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64

<default perf>
94.41% 0.89% [kernel] [k] _raw_spin_lock
93.35% 93.07% [kernel] [k] native_queued_spin_lock_slowpath
76.13% 0.28% [kernel] [k] __vmalloc_node_range
72.96% 0.81% [kernel] [k] alloc_vmap_area
56.94% 0.00% [kernel] [k] __get_vm_area_node
41.95% 0.00% [kernel] [k] vmalloc
37.15% 0.01% [test_vmalloc] [k] full_fit_alloc_test
35.17% 0.00% [kernel] [k] ret_from_fork_asm
35.17% 0.00% [kernel] [k] ret_from_fork
35.17% 0.00% [kernel] [k] kthread
35.08% 0.00% [test_vmalloc] [k] test_func
34.45% 0.00% [test_vmalloc] [k] fix_size_alloc_test
28.09% 0.01% [test_vmalloc] [k] long_busy_list_alloc_test
23.53% 0.25% [kernel] [k] vfree.part.0
21.72% 0.00% [kernel] [k] remove_vm_area
20.08% 0.21% [kernel] [k] find_unlink_vmap_area
2.34% 0.61% [kernel] [k] free_vmap_area_noflush
<default perf>
vs
<patch-series perf>
82.32% 0.22% [test_vmalloc] [k] long_busy_list_alloc_test
63.36% 0.02% [kernel] [k] vmalloc
63.34% 2.64% [kernel] [k] __vmalloc_node_range
30.42% 4.46% [kernel] [k] vfree.part.0
28.98% 2.51% [kernel] [k] __alloc_pages_bulk
27.28% 0.19% [kernel] [k] __get_vm_area_node
26.13% 1.50% [kernel] [k] alloc_vmap_area
21.72% 21.67% [kernel] [k] clear_page_rep
19.51% 2.43% [kernel] [k] _raw_spin_lock
16.61% 16.51% [kernel] [k] native_queued_spin_lock_slowpath
13.40% 2.07% [kernel] [k] free_unref_page
10.62% 0.01% [kernel] [k] remove_vm_area
9.02% 8.73% [kernel] [k] insert_vmap_area
8.94% 0.00% [kernel] [k] ret_from_fork_asm
8.94% 0.00% [kernel] [k] ret_from_fork
8.94% 0.00% [kernel] [k] kthread
8.29% 0.00% [test_vmalloc] [k] test_func
7.81% 0.05% [test_vmalloc] [k] full_fit_alloc_test
5.30% 4.73% [kernel] [k] purge_vmap_node
4.47% 2.65% [kernel] [k] free_vmap_area_noflush
<patch-series perf>

confirms that a native_queued_spin_lock_slowpath goes down to
16.51% percent from 93.07%.

The throughput is ~12x higher:

urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
Run the test with following parameters: run_test_mask=7 nr_threads=64
Done.
Check the kernel ring buffer to see the summary.

real 10m51.271s
user 0m0.013s
sys 0m0.187s
urezki@pc638:~$

urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
Run the test with following parameters: run_test_mask=7 nr_threads=64
Done.
Check the kernel ring buffer to see the summary.

real 0m51.301s
user 0m0.015s
sys 0m0.040s
urezki@pc638:~$

4. Changelog

v1: https://lore.kernel.org/linux-mm/ZIAqojPKjChJTssg@pc636/T/
v2: https://lore.kernel.org/lkml/[email protected]/

Delta v2 -> v3:
- fix comments from v2 feedback;
- switch from pre-fetch chunk logic to a less complex size based pools.

Baoquan He (1):
mm/vmalloc: remove vmap_area_list

Uladzislau Rezki (Sony) (10):
mm: vmalloc: Add va_alloc() helper
mm: vmalloc: Rename adjust_va_to_fit_type() function
mm: vmalloc: Move vmap_init_free_space() down in vmalloc.c
mm: vmalloc: Remove global vmap_area_root rb-tree
mm: vmalloc: Remove global purge_vmap_area_root rb-tree
mm: vmalloc: Offload free_vmap_area_lock lock
mm: vmalloc: Support multiple nodes in vread_iter
mm: vmalloc: Support multiple nodes in vmallocinfo
mm: vmalloc: Set nr_nodes based on CPUs in a system
mm: vmalloc: Add a shrinker to drain vmap pools

.../admin-guide/kdump/vmcoreinfo.rst | 8 +-
arch/arm64/kernel/crash_core.c | 1 -
arch/riscv/kernel/crash_core.c | 1 -
include/linux/vmalloc.h | 1 -
kernel/crash_core.c | 4 +-
kernel/kallsyms_selftest.c | 1 -
mm/nommu.c | 2 -
mm/vmalloc.c | 1049 ++++++++++++-----
8 files changed, 786 insertions(+), 281 deletions(-)

--
2.39.2



2024-01-02 18:46:55

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: [PATCH v3 01/11] mm: vmalloc: Add va_alloc() helper

Currently __alloc_vmap_area() function contains an open codded
logic that finds and adjusts a VA based on allocation request.

Introduce a va_alloc() helper that adjusts found VA only. There
is no a functional change as a result of this patch.

Reviewed-by: Baoquan He <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
---
mm/vmalloc.c | 41 ++++++++++++++++++++++++++++-------------
1 file changed, 28 insertions(+), 13 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d12a17fc0c17..739401a9eafc 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1481,6 +1481,32 @@ adjust_va_to_fit_type(struct rb_root *root, struct list_head *head,
return 0;
}

+static unsigned long
+va_alloc(struct vmap_area *va,
+ struct rb_root *root, struct list_head *head,
+ unsigned long size, unsigned long align,
+ unsigned long vstart, unsigned long vend)
+{
+ unsigned long nva_start_addr;
+ int ret;
+
+ if (va->va_start > vstart)
+ nva_start_addr = ALIGN(va->va_start, align);
+ else
+ nva_start_addr = ALIGN(vstart, align);
+
+ /* Check the "vend" restriction. */
+ if (nva_start_addr + size > vend)
+ return vend;
+
+ /* Update the free vmap_area. */
+ ret = adjust_va_to_fit_type(root, head, va, nva_start_addr, size);
+ if (WARN_ON_ONCE(ret))
+ return vend;
+
+ return nva_start_addr;
+}
+
/*
* Returns a start address of the newly allocated area, if success.
* Otherwise a vend is returned that indicates failure.
@@ -1493,7 +1519,6 @@ __alloc_vmap_area(struct rb_root *root, struct list_head *head,
bool adjust_search_size = true;
unsigned long nva_start_addr;
struct vmap_area *va;
- int ret;

/*
* Do not adjust when:
@@ -1511,18 +1536,8 @@ __alloc_vmap_area(struct rb_root *root, struct list_head *head,
if (unlikely(!va))
return vend;

- if (va->va_start > vstart)
- nva_start_addr = ALIGN(va->va_start, align);
- else
- nva_start_addr = ALIGN(vstart, align);
-
- /* Check the "vend" restriction. */
- if (nva_start_addr + size > vend)
- return vend;
-
- /* Update the free vmap_area. */
- ret = adjust_va_to_fit_type(root, head, va, nva_start_addr, size);
- if (WARN_ON_ONCE(ret))
+ nva_start_addr = va_alloc(va, root, head, size, align, vstart, vend);
+ if (nva_start_addr == vend)
return vend;

#if DEBUG_AUGMENT_LOWEST_MATCH_CHECK
--
2.39.2


2024-01-02 18:47:10

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: [PATCH v3 02/11] mm: vmalloc: Rename adjust_va_to_fit_type() function

This patch renames the adjust_va_to_fit_type() function
to va_clip() which is shorter and more expressive.

There is no a functional change as a result of this patch.

Reviewed-by: Baoquan He <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
---
mm/vmalloc.c | 13 ++++++-------
1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 739401a9eafc..10f289e86512 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1382,9 +1382,9 @@ classify_va_fit_type(struct vmap_area *va,
}

static __always_inline int
-adjust_va_to_fit_type(struct rb_root *root, struct list_head *head,
- struct vmap_area *va, unsigned long nva_start_addr,
- unsigned long size)
+va_clip(struct rb_root *root, struct list_head *head,
+ struct vmap_area *va, unsigned long nva_start_addr,
+ unsigned long size)
{
struct vmap_area *lva = NULL;
enum fit_type type = classify_va_fit_type(va, nva_start_addr, size);
@@ -1500,7 +1500,7 @@ va_alloc(struct vmap_area *va,
return vend;

/* Update the free vmap_area. */
- ret = adjust_va_to_fit_type(root, head, va, nva_start_addr, size);
+ ret = va_clip(root, head, va, nva_start_addr, size);
if (WARN_ON_ONCE(ret))
return vend;

@@ -4155,9 +4155,8 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
/* It is a BUG(), but trigger recovery instead. */
goto recovery;

- ret = adjust_va_to_fit_type(&free_vmap_area_root,
- &free_vmap_area_list,
- va, start, size);
+ ret = va_clip(&free_vmap_area_root,
+ &free_vmap_area_list, va, start, size);
if (WARN_ON_ONCE(unlikely(ret)))
/* It is a BUG(), but trigger recovery instead. */
goto recovery;
--
2.39.2


2024-01-02 18:47:20

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: [PATCH v3 03/11] mm: vmalloc: Move vmap_init_free_space() down in vmalloc.c

A vmap_init_free_space() is a function that setups a vmap space
and is considered as part of initialization phase. Since a main
entry which is vmalloc_init(), has been moved down in vmalloc.c
it makes sense to follow the pattern.

There is no a functional change as a result of this patch.

Reviewed-by: Baoquan He <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
---
mm/vmalloc.c | 82 ++++++++++++++++++++++++++--------------------------
1 file changed, 41 insertions(+), 41 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 10f289e86512..06bd843d18ae 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2512,47 +2512,6 @@ void __init vm_area_register_early(struct vm_struct *vm, size_t align)
kasan_populate_early_vm_area_shadow(vm->addr, vm->size);
}

-static void vmap_init_free_space(void)
-{
- unsigned long vmap_start = 1;
- const unsigned long vmap_end = ULONG_MAX;
- struct vmap_area *busy, *free;
-
- /*
- * B F B B B F
- * -|-----|.....|-----|-----|-----|.....|-
- * | The KVA space |
- * |<--------------------------------->|
- */
- list_for_each_entry(busy, &vmap_area_list, list) {
- if (busy->va_start - vmap_start > 0) {
- free = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
- if (!WARN_ON_ONCE(!free)) {
- free->va_start = vmap_start;
- free->va_end = busy->va_start;
-
- insert_vmap_area_augment(free, NULL,
- &free_vmap_area_root,
- &free_vmap_area_list);
- }
- }
-
- vmap_start = busy->va_end;
- }
-
- if (vmap_end - vmap_start > 0) {
- free = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
- if (!WARN_ON_ONCE(!free)) {
- free->va_start = vmap_start;
- free->va_end = vmap_end;
-
- insert_vmap_area_augment(free, NULL,
- &free_vmap_area_root,
- &free_vmap_area_list);
- }
- }
-}
-
static inline void setup_vmalloc_vm_locked(struct vm_struct *vm,
struct vmap_area *va, unsigned long flags, const void *caller)
{
@@ -4465,6 +4424,47 @@ module_init(proc_vmalloc_init);

#endif

+static void vmap_init_free_space(void)
+{
+ unsigned long vmap_start = 1;
+ const unsigned long vmap_end = ULONG_MAX;
+ struct vmap_area *busy, *free;
+
+ /*
+ * B F B B B F
+ * -|-----|.....|-----|-----|-----|.....|-
+ * | The KVA space |
+ * |<--------------------------------->|
+ */
+ list_for_each_entry(busy, &vmap_area_list, list) {
+ if (busy->va_start - vmap_start > 0) {
+ free = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
+ if (!WARN_ON_ONCE(!free)) {
+ free->va_start = vmap_start;
+ free->va_end = busy->va_start;
+
+ insert_vmap_area_augment(free, NULL,
+ &free_vmap_area_root,
+ &free_vmap_area_list);
+ }
+ }
+
+ vmap_start = busy->va_end;
+ }
+
+ if (vmap_end - vmap_start > 0) {
+ free = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
+ if (!WARN_ON_ONCE(!free)) {
+ free->va_start = vmap_start;
+ free->va_end = vmap_end;
+
+ insert_vmap_area_augment(free, NULL,
+ &free_vmap_area_root,
+ &free_vmap_area_list);
+ }
+ }
+}
+
void __init vmalloc_init(void)
{
struct vmap_area *va;
--
2.39.2


2024-01-02 18:47:39

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: [PATCH v3 04/11] mm: vmalloc: Remove global vmap_area_root rb-tree

Store allocated objects in a separate nodes. A va->va_start
address is converted into a correct node where it should
be placed and resided. An addr_to_node() function is used
to do a proper address conversion to determine a node that
contains a VA.

Such approach balances VAs across nodes as a result an access
becomes scalable. Number of nodes in a system depends on number
of CPUs.

Please note:

1. As of now allocated VAs are bound to a node-0. It means the
patch does not give any difference comparing with a current
behavior;

2. The global vmap_area_lock, vmap_area_root are removed as there
is no need in it anymore. The vmap_area_list is still kept and
is _empty_. It is exported for a kexec only;

3. The vmallocinfo and vread() have to be reworked to be able to
handle multiple nodes.

Reviewed-by: Baoquan He <[email protected]>
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
---
mm/vmalloc.c | 240 +++++++++++++++++++++++++++++++++++++--------------
1 file changed, 173 insertions(+), 67 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 06bd843d18ae..786ecb18ae22 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -728,11 +728,9 @@ EXPORT_SYMBOL(vmalloc_to_pfn);
#define DEBUG_AUGMENT_LOWEST_MATCH_CHECK 0


-static DEFINE_SPINLOCK(vmap_area_lock);
static DEFINE_SPINLOCK(free_vmap_area_lock);
/* Export for kexec only */
LIST_HEAD(vmap_area_list);
-static struct rb_root vmap_area_root = RB_ROOT;
static bool vmap_initialized __read_mostly;

static struct rb_root purge_vmap_area_root = RB_ROOT;
@@ -772,6 +770,38 @@ static struct rb_root free_vmap_area_root = RB_ROOT;
*/
static DEFINE_PER_CPU(struct vmap_area *, ne_fit_preload_node);

+/*
+ * An effective vmap-node logic. Users make use of nodes instead
+ * of a global heap. It allows to balance an access and mitigate
+ * contention.
+ */
+struct rb_list {
+ struct rb_root root;
+ struct list_head head;
+ spinlock_t lock;
+};
+
+static struct vmap_node {
+ /* Bookkeeping data of this node. */
+ struct rb_list busy;
+} single;
+
+static struct vmap_node *vmap_nodes = &single;
+static __read_mostly unsigned int nr_vmap_nodes = 1;
+static __read_mostly unsigned int vmap_zone_size = 1;
+
+static inline unsigned int
+addr_to_node_id(unsigned long addr)
+{
+ return (addr / vmap_zone_size) % nr_vmap_nodes;
+}
+
+static inline struct vmap_node *
+addr_to_node(unsigned long addr)
+{
+ return &vmap_nodes[addr_to_node_id(addr)];
+}
+
static __always_inline unsigned long
va_size(struct vmap_area *va)
{
@@ -803,10 +833,11 @@ unsigned long vmalloc_nr_pages(void)
}

/* Look up the first VA which satisfies addr < va_end, NULL if none. */
-static struct vmap_area *find_vmap_area_exceed_addr(unsigned long addr)
+static struct vmap_area *
+find_vmap_area_exceed_addr(unsigned long addr, struct rb_root *root)
{
struct vmap_area *va = NULL;
- struct rb_node *n = vmap_area_root.rb_node;
+ struct rb_node *n = root->rb_node;

addr = (unsigned long)kasan_reset_tag((void *)addr);

@@ -1552,12 +1583,14 @@ __alloc_vmap_area(struct rb_root *root, struct list_head *head,
*/
static void free_vmap_area(struct vmap_area *va)
{
+ struct vmap_node *vn = addr_to_node(va->va_start);
+
/*
* Remove from the busy tree/list.
*/
- spin_lock(&vmap_area_lock);
- unlink_va(va, &vmap_area_root);
- spin_unlock(&vmap_area_lock);
+ spin_lock(&vn->busy.lock);
+ unlink_va(va, &vn->busy.root);
+ spin_unlock(&vn->busy.lock);

/*
* Insert/Merge it back to the free tree/list.
@@ -1600,6 +1633,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
int node, gfp_t gfp_mask,
unsigned long va_flags)
{
+ struct vmap_node *vn;
struct vmap_area *va;
unsigned long freed;
unsigned long addr;
@@ -1645,9 +1679,11 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
va->vm = NULL;
va->flags = va_flags;

- spin_lock(&vmap_area_lock);
- insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
- spin_unlock(&vmap_area_lock);
+ vn = addr_to_node(va->va_start);
+
+ spin_lock(&vn->busy.lock);
+ insert_vmap_area(va, &vn->busy.root, &vn->busy.head);
+ spin_unlock(&vn->busy.lock);

BUG_ON(!IS_ALIGNED(va->va_start, align));
BUG_ON(va->va_start < vstart);
@@ -1871,26 +1907,61 @@ static void free_unmap_vmap_area(struct vmap_area *va)

struct vmap_area *find_vmap_area(unsigned long addr)
{
+ struct vmap_node *vn;
struct vmap_area *va;
+ int i, j;

- spin_lock(&vmap_area_lock);
- va = __find_vmap_area(addr, &vmap_area_root);
- spin_unlock(&vmap_area_lock);
+ /*
+ * An addr_to_node_id(addr) converts an address to a node index
+ * where a VA is located. If VA spans several zones and passed
+ * addr is not the same as va->va_start, what is not common, we
+ * may need to scan an extra nodes. See an example:
+ *
+ * <--va-->
+ * -|-----|-----|-----|-----|-
+ * 1 2 0 1
+ *
+ * VA resides in node 1 whereas it spans 1 and 2. If passed
+ * addr is within a second node we should do extra work. We
+ * should mention that it is rare and is a corner case from
+ * the other hand it has to be covered.
+ */
+ i = j = addr_to_node_id(addr);
+ do {
+ vn = &vmap_nodes[i];

- return va;
+ spin_lock(&vn->busy.lock);
+ va = __find_vmap_area(addr, &vn->busy.root);
+ spin_unlock(&vn->busy.lock);
+
+ if (va)
+ return va;
+ } while ((i = (i + 1) % nr_vmap_nodes) != j);
+
+ return NULL;
}

static struct vmap_area *find_unlink_vmap_area(unsigned long addr)
{
+ struct vmap_node *vn;
struct vmap_area *va;
+ int i, j;

- spin_lock(&vmap_area_lock);
- va = __find_vmap_area(addr, &vmap_area_root);
- if (va)
- unlink_va(va, &vmap_area_root);
- spin_unlock(&vmap_area_lock);
+ i = j = addr_to_node_id(addr);
+ do {
+ vn = &vmap_nodes[i];

- return va;
+ spin_lock(&vn->busy.lock);
+ va = __find_vmap_area(addr, &vn->busy.root);
+ if (va)
+ unlink_va(va, &vn->busy.root);
+ spin_unlock(&vn->busy.lock);
+
+ if (va)
+ return va;
+ } while ((i = (i + 1) % nr_vmap_nodes) != j);
+
+ return NULL;
}

/*** Per cpu kva allocator ***/
@@ -2092,6 +2163,7 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)

static void free_vmap_block(struct vmap_block *vb)
{
+ struct vmap_node *vn;
struct vmap_block *tmp;
struct xarray *xa;

@@ -2099,9 +2171,10 @@ static void free_vmap_block(struct vmap_block *vb)
tmp = xa_erase(xa, addr_to_vb_idx(vb->va->va_start));
BUG_ON(tmp != vb);

- spin_lock(&vmap_area_lock);
- unlink_va(vb->va, &vmap_area_root);
- spin_unlock(&vmap_area_lock);
+ vn = addr_to_node(vb->va->va_start);
+ spin_lock(&vn->busy.lock);
+ unlink_va(vb->va, &vn->busy.root);
+ spin_unlock(&vn->busy.lock);

free_vmap_area_noflush(vb->va);
kfree_rcu(vb, rcu_head);
@@ -2525,9 +2598,11 @@ static inline void setup_vmalloc_vm_locked(struct vm_struct *vm,
static void setup_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va,
unsigned long flags, const void *caller)
{
- spin_lock(&vmap_area_lock);
+ struct vmap_node *vn = addr_to_node(va->va_start);
+
+ spin_lock(&vn->busy.lock);
setup_vmalloc_vm_locked(vm, va, flags, caller);
- spin_unlock(&vmap_area_lock);
+ spin_unlock(&vn->busy.lock);
}

static void clear_vm_uninitialized_flag(struct vm_struct *vm)
@@ -3715,6 +3790,7 @@ static size_t vmap_ram_vread_iter(struct iov_iter *iter, const char *addr,
*/
long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
{
+ struct vmap_node *vn;
struct vmap_area *va;
struct vm_struct *vm;
char *vaddr;
@@ -3728,8 +3804,11 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)

remains = count;

- spin_lock(&vmap_area_lock);
- va = find_vmap_area_exceed_addr((unsigned long)addr);
+ /* Hooked to node_0 so far. */
+ vn = addr_to_node(0);
+ spin_lock(&vn->busy.lock);
+
+ va = find_vmap_area_exceed_addr((unsigned long)addr, &vn->busy.root);
if (!va)
goto finished_zero;

@@ -3737,7 +3816,7 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
if ((unsigned long)addr + remains <= va->va_start)
goto finished_zero;

- list_for_each_entry_from(va, &vmap_area_list, list) {
+ list_for_each_entry_from(va, &vn->busy.head, list) {
size_t copied;

if (remains == 0)
@@ -3796,12 +3875,12 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
}

finished_zero:
- spin_unlock(&vmap_area_lock);
+ spin_unlock(&vn->busy.lock);
/* zero-fill memory holes */
return count - remains + zero_iter(iter, remains);
finished:
/* Nothing remains, or We couldn't copy/zero everything. */
- spin_unlock(&vmap_area_lock);
+ spin_unlock(&vn->busy.lock);

return count - remains;
}
@@ -4135,14 +4214,15 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
}

/* insert all vm's */
- spin_lock(&vmap_area_lock);
for (area = 0; area < nr_vms; area++) {
- insert_vmap_area(vas[area], &vmap_area_root, &vmap_area_list);
+ struct vmap_node *vn = addr_to_node(vas[area]->va_start);

+ spin_lock(&vn->busy.lock);
+ insert_vmap_area(vas[area], &vn->busy.root, &vn->busy.head);
setup_vmalloc_vm_locked(vms[area], vas[area], VM_ALLOC,
pcpu_get_vm_areas);
+ spin_unlock(&vn->busy.lock);
}
- spin_unlock(&vmap_area_lock);

/*
* Mark allocated areas as accessible. Do it now as a best-effort
@@ -4253,55 +4333,57 @@ bool vmalloc_dump_obj(void *object)
{
void *objp = (void *)PAGE_ALIGN((unsigned long)object);
const void *caller;
- struct vm_struct *vm;
struct vmap_area *va;
+ struct vmap_node *vn;
unsigned long addr;
unsigned int nr_pages;
+ bool success = false;

- if (!spin_trylock(&vmap_area_lock))
- return false;
- va = __find_vmap_area((unsigned long)objp, &vmap_area_root);
- if (!va) {
- spin_unlock(&vmap_area_lock);
- return false;
- }
+ vn = addr_to_node((unsigned long)objp);

- vm = va->vm;
- if (!vm) {
- spin_unlock(&vmap_area_lock);
- return false;
+ if (spin_trylock(&vn->busy.lock)) {
+ va = __find_vmap_area(addr, &vn->busy.root);
+
+ if (va && va->vm) {
+ addr = (unsigned long)va->vm->addr;
+ caller = va->vm->caller;
+ nr_pages = va->vm->nr_pages;
+ success = true;
+ }
+
+ spin_unlock(&vn->busy.lock);
}
- addr = (unsigned long)vm->addr;
- caller = vm->caller;
- nr_pages = vm->nr_pages;
- spin_unlock(&vmap_area_lock);
- pr_cont(" %u-page vmalloc region starting at %#lx allocated at %pS\n",
- nr_pages, addr, caller);
- return true;
+
+ if (success)
+ pr_cont(" %u-page vmalloc region starting at %#lx allocated at %pS\n",
+ nr_pages, addr, caller);
+
+ return success;
}
#endif

#ifdef CONFIG_PROC_FS
static void *s_start(struct seq_file *m, loff_t *pos)
- __acquires(&vmap_purge_lock)
- __acquires(&vmap_area_lock)
{
+ struct vmap_node *vn = addr_to_node(0);
+
mutex_lock(&vmap_purge_lock);
- spin_lock(&vmap_area_lock);
+ spin_lock(&vn->busy.lock);

- return seq_list_start(&vmap_area_list, *pos);
+ return seq_list_start(&vn->busy.head, *pos);
}

static void *s_next(struct seq_file *m, void *p, loff_t *pos)
{
- return seq_list_next(p, &vmap_area_list, pos);
+ struct vmap_node *vn = addr_to_node(0);
+ return seq_list_next(p, &vn->busy.head, pos);
}

static void s_stop(struct seq_file *m, void *p)
- __releases(&vmap_area_lock)
- __releases(&vmap_purge_lock)
{
- spin_unlock(&vmap_area_lock);
+ struct vmap_node *vn = addr_to_node(0);
+
+ spin_unlock(&vn->busy.lock);
mutex_unlock(&vmap_purge_lock);
}

@@ -4344,9 +4426,11 @@ static void show_purge_info(struct seq_file *m)

static int s_show(struct seq_file *m, void *p)
{
+ struct vmap_node *vn;
struct vmap_area *va;
struct vm_struct *v;

+ vn = addr_to_node(0);
va = list_entry(p, struct vmap_area, list);

if (!va->vm) {
@@ -4397,7 +4481,7 @@ static int s_show(struct seq_file *m, void *p)
* As a final step, dump "unpurged" areas.
*/
final:
- if (list_is_last(&va->list, &vmap_area_list))
+ if (list_is_last(&va->list, &vn->busy.head))
show_purge_info(m);

return 0;
@@ -4428,7 +4512,8 @@ static void vmap_init_free_space(void)
{
unsigned long vmap_start = 1;
const unsigned long vmap_end = ULONG_MAX;
- struct vmap_area *busy, *free;
+ struct vmap_area *free;
+ struct vm_struct *busy;

/*
* B F B B B F
@@ -4436,12 +4521,12 @@ static void vmap_init_free_space(void)
* | The KVA space |
* |<--------------------------------->|
*/
- list_for_each_entry(busy, &vmap_area_list, list) {
- if (busy->va_start - vmap_start > 0) {
+ for (busy = vmlist; busy; busy = busy->next) {
+ if ((unsigned long) busy->addr - vmap_start > 0) {
free = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
if (!WARN_ON_ONCE(!free)) {
free->va_start = vmap_start;
- free->va_end = busy->va_start;
+ free->va_end = (unsigned long) busy->addr;

insert_vmap_area_augment(free, NULL,
&free_vmap_area_root,
@@ -4449,7 +4534,7 @@ static void vmap_init_free_space(void)
}
}

- vmap_start = busy->va_end;
+ vmap_start = (unsigned long) busy->addr + busy->size;
}

if (vmap_end - vmap_start > 0) {
@@ -4465,9 +4550,23 @@ static void vmap_init_free_space(void)
}
}

+static void vmap_init_nodes(void)
+{
+ struct vmap_node *vn;
+ int i;
+
+ for (i = 0; i < nr_vmap_nodes; i++) {
+ vn = &vmap_nodes[i];
+ vn->busy.root = RB_ROOT;
+ INIT_LIST_HEAD(&vn->busy.head);
+ spin_lock_init(&vn->busy.lock);
+ }
+}
+
void __init vmalloc_init(void)
{
struct vmap_area *va;
+ struct vmap_node *vn;
struct vm_struct *tmp;
int i;

@@ -4489,6 +4588,11 @@ void __init vmalloc_init(void)
xa_init(&vbq->vmap_blocks);
}

+ /*
+ * Setup nodes before importing vmlist.
+ */
+ vmap_init_nodes();
+
/* Import existing vmlist entries. */
for (tmp = vmlist; tmp; tmp = tmp->next) {
va = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
@@ -4498,7 +4602,9 @@ void __init vmalloc_init(void)
va->va_start = (unsigned long)tmp->addr;
va->va_end = va->va_start + tmp->size;
va->vm = tmp;
- insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
+
+ vn = addr_to_node(va->va_start);
+ insert_vmap_area(va, &vn->busy.root, &vn->busy.head);
}

/*
--
2.39.2


2024-01-02 18:47:59

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: [PATCH v3 06/11] mm: vmalloc: Remove global purge_vmap_area_root rb-tree

Similar to busy VA, lazily-freed area is stored to a node
it belongs to. Such approach does not require any global
locking primitive, instead an access becomes scalable what
mitigates a contention.

This patch removes a global purge-lock, global purge-tree
and global purge list.

Reviewed-by: Baoquan He <[email protected]>
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
---
mm/vmalloc.c | 135 +++++++++++++++++++++++++++++++--------------------
1 file changed, 82 insertions(+), 53 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 8c01f2225ef7..9b2f1b0cac9d 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -731,10 +731,6 @@ EXPORT_SYMBOL(vmalloc_to_pfn);
static DEFINE_SPINLOCK(free_vmap_area_lock);
static bool vmap_initialized __read_mostly;

-static struct rb_root purge_vmap_area_root = RB_ROOT;
-static LIST_HEAD(purge_vmap_area_list);
-static DEFINE_SPINLOCK(purge_vmap_area_lock);
-
/*
* This kmem_cache is used for vmap_area objects. Instead of
* allocating from slab we reuse an object from this cache to
@@ -782,6 +778,12 @@ struct rb_list {
static struct vmap_node {
/* Bookkeeping data of this node. */
struct rb_list busy;
+ struct rb_list lazy;
+
+ /*
+ * Ready-to-free areas.
+ */
+ struct list_head purge_list;
} single;

static struct vmap_node *vmap_nodes = &single;
@@ -1766,40 +1768,22 @@ static DEFINE_MUTEX(vmap_purge_lock);

/* for per-CPU blocks */
static void purge_fragmented_blocks_allcpus(void);
+static cpumask_t purge_nodes;

/*
* Purges all lazily-freed vmap areas.
*/
-static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
+static unsigned long
+purge_vmap_node(struct vmap_node *vn)
{
- unsigned long resched_threshold;
- unsigned int num_purged_areas = 0;
- struct list_head local_purge_list;
+ unsigned long num_purged_areas = 0;
struct vmap_area *va, *n_va;

- lockdep_assert_held(&vmap_purge_lock);
-
- spin_lock(&purge_vmap_area_lock);
- purge_vmap_area_root = RB_ROOT;
- list_replace_init(&purge_vmap_area_list, &local_purge_list);
- spin_unlock(&purge_vmap_area_lock);
-
- if (unlikely(list_empty(&local_purge_list)))
- goto out;
-
- start = min(start,
- list_first_entry(&local_purge_list,
- struct vmap_area, list)->va_start);
-
- end = max(end,
- list_last_entry(&local_purge_list,
- struct vmap_area, list)->va_end);
-
- flush_tlb_kernel_range(start, end);
- resched_threshold = lazy_max_pages() << 1;
+ if (list_empty(&vn->purge_list))
+ return 0;

spin_lock(&free_vmap_area_lock);
- list_for_each_entry_safe(va, n_va, &local_purge_list, list) {
+ list_for_each_entry_safe(va, n_va, &vn->purge_list, list) {
unsigned long nr = (va->va_end - va->va_start) >> PAGE_SHIFT;
unsigned long orig_start = va->va_start;
unsigned long orig_end = va->va_end;
@@ -1821,13 +1805,55 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)

atomic_long_sub(nr, &vmap_lazy_nr);
num_purged_areas++;
-
- if (atomic_long_read(&vmap_lazy_nr) < resched_threshold)
- cond_resched_lock(&free_vmap_area_lock);
}
spin_unlock(&free_vmap_area_lock);

-out:
+ return num_purged_areas;
+}
+
+/*
+ * Purges all lazily-freed vmap areas.
+ */
+static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
+{
+ unsigned long num_purged_areas = 0;
+ struct vmap_node *vn;
+ int i;
+
+ lockdep_assert_held(&vmap_purge_lock);
+ purge_nodes = CPU_MASK_NONE;
+
+ for (i = 0; i < nr_vmap_nodes; i++) {
+ vn = &vmap_nodes[i];
+
+ INIT_LIST_HEAD(&vn->purge_list);
+
+ if (RB_EMPTY_ROOT(&vn->lazy.root))
+ continue;
+
+ spin_lock(&vn->lazy.lock);
+ WRITE_ONCE(vn->lazy.root.rb_node, NULL);
+ list_replace_init(&vn->lazy.head, &vn->purge_list);
+ spin_unlock(&vn->lazy.lock);
+
+ start = min(start, list_first_entry(&vn->purge_list,
+ struct vmap_area, list)->va_start);
+
+ end = max(end, list_last_entry(&vn->purge_list,
+ struct vmap_area, list)->va_end);
+
+ cpumask_set_cpu(i, &purge_nodes);
+ }
+
+ if (cpumask_weight(&purge_nodes) > 0) {
+ flush_tlb_kernel_range(start, end);
+
+ for_each_cpu(i, &purge_nodes) {
+ vn = &nodes[i];
+ num_purged_areas += purge_vmap_node(vn);
+ }
+ }
+
trace_purge_vmap_area_lazy(start, end, num_purged_areas);
return num_purged_areas > 0;
}
@@ -1846,16 +1872,9 @@ static void reclaim_and_purge_vmap_areas(void)

static void drain_vmap_area_work(struct work_struct *work)
{
- unsigned long nr_lazy;
-
- do {
- mutex_lock(&vmap_purge_lock);
- __purge_vmap_area_lazy(ULONG_MAX, 0);
- mutex_unlock(&vmap_purge_lock);
-
- /* Recheck if further work is required. */
- nr_lazy = atomic_long_read(&vmap_lazy_nr);
- } while (nr_lazy > lazy_max_pages());
+ mutex_lock(&vmap_purge_lock);
+ __purge_vmap_area_lazy(ULONG_MAX, 0);
+ mutex_unlock(&vmap_purge_lock);
}

/*
@@ -1865,6 +1884,7 @@ static void drain_vmap_area_work(struct work_struct *work)
*/
static void free_vmap_area_noflush(struct vmap_area *va)
{
+ struct vmap_node *vn = addr_to_node(va->va_start);
unsigned long nr_lazy_max = lazy_max_pages();
unsigned long va_start = va->va_start;
unsigned long nr_lazy;
@@ -1878,10 +1898,9 @@ static void free_vmap_area_noflush(struct vmap_area *va)
/*
* Merge or place it to the purge tree/list.
*/
- spin_lock(&purge_vmap_area_lock);
- merge_or_add_vmap_area(va,
- &purge_vmap_area_root, &purge_vmap_area_list);
- spin_unlock(&purge_vmap_area_lock);
+ spin_lock(&vn->lazy.lock);
+ merge_or_add_vmap_area(va, &vn->lazy.root, &vn->lazy.head);
+ spin_unlock(&vn->lazy.lock);

trace_free_vmap_area_noflush(va_start, nr_lazy, nr_lazy_max);

@@ -4411,15 +4430,21 @@ static void show_numa_info(struct seq_file *m, struct vm_struct *v)

static void show_purge_info(struct seq_file *m)
{
+ struct vmap_node *vn;
struct vmap_area *va;
+ int i;

- spin_lock(&purge_vmap_area_lock);
- list_for_each_entry(va, &purge_vmap_area_list, list) {
- seq_printf(m, "0x%pK-0x%pK %7ld unpurged vm_area\n",
- (void *)va->va_start, (void *)va->va_end,
- va->va_end - va->va_start);
+ for (i = 0; i < nr_vmap_nodes; i++) {
+ vn = &vmap_nodes[i];
+
+ spin_lock(&vn->lazy.lock);
+ list_for_each_entry(va, &vn->lazy.head, list) {
+ seq_printf(m, "0x%pK-0x%pK %7ld unpurged vm_area\n",
+ (void *)va->va_start, (void *)va->va_end,
+ va->va_end - va->va_start);
+ }
+ spin_unlock(&vn->lazy.lock);
}
- spin_unlock(&purge_vmap_area_lock);
}

static int s_show(struct seq_file *m, void *p)
@@ -4558,6 +4583,10 @@ static void vmap_init_nodes(void)
vn->busy.root = RB_ROOT;
INIT_LIST_HEAD(&vn->busy.head);
spin_lock_init(&vn->busy.lock);
+
+ vn->lazy.root = RB_ROOT;
+ INIT_LIST_HEAD(&vn->lazy.head);
+ spin_lock_init(&vn->lazy.lock);
}
}

--
2.39.2


2024-01-02 18:48:03

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: [PATCH v3 05/11] mm/vmalloc: remove vmap_area_list

From: Baoquan He <[email protected]>

Earlier, vmap_area_list is exported to vmcoreinfo so that makedumpfile
get the base address of vmalloc area. Now, vmap_area_list is empty, so
export VMALLOC_START to vmcoreinfo instead, and remove vmap_area_list.

Signed-off-by: Baoquan He <[email protected]>
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
---
Documentation/admin-guide/kdump/vmcoreinfo.rst | 8 ++++----
arch/arm64/kernel/crash_core.c | 1 -
arch/riscv/kernel/crash_core.c | 1 -
include/linux/vmalloc.h | 1 -
kernel/crash_core.c | 4 +---
kernel/kallsyms_selftest.c | 1 -
mm/nommu.c | 2 --
mm/vmalloc.c | 2 --
8 files changed, 5 insertions(+), 15 deletions(-)

diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst b/Documentation/admin-guide/kdump/vmcoreinfo.rst
index 78e4d2e7ba14..df54fbeaaa16 100644
--- a/Documentation/admin-guide/kdump/vmcoreinfo.rst
+++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst
@@ -65,11 +65,11 @@ Defines the beginning of the text section. In general, _stext indicates
the kernel start address. Used to convert a virtual address from the
direct kernel map to a physical address.

-vmap_area_list
---------------
+VMALLOC_START
+-------------

-Stores the virtual area list. makedumpfile gets the vmalloc start value
-from this variable and its value is necessary for vmalloc translation.
+Stores the base address of vmalloc area. makedumpfile gets this value
+since is necessary for vmalloc translation.

mem_map
-------
diff --git a/arch/arm64/kernel/crash_core.c b/arch/arm64/kernel/crash_core.c
index 66cde752cd74..2a24199a9b81 100644
--- a/arch/arm64/kernel/crash_core.c
+++ b/arch/arm64/kernel/crash_core.c
@@ -23,7 +23,6 @@ void arch_crash_save_vmcoreinfo(void)
/* Please note VMCOREINFO_NUMBER() uses "%d", not "%x" */
vmcoreinfo_append_str("NUMBER(MODULES_VADDR)=0x%lx\n", MODULES_VADDR);
vmcoreinfo_append_str("NUMBER(MODULES_END)=0x%lx\n", MODULES_END);
- vmcoreinfo_append_str("NUMBER(VMALLOC_START)=0x%lx\n", VMALLOC_START);
vmcoreinfo_append_str("NUMBER(VMALLOC_END)=0x%lx\n", VMALLOC_END);
vmcoreinfo_append_str("NUMBER(VMEMMAP_START)=0x%lx\n", VMEMMAP_START);
vmcoreinfo_append_str("NUMBER(VMEMMAP_END)=0x%lx\n", VMEMMAP_END);
diff --git a/arch/riscv/kernel/crash_core.c b/arch/riscv/kernel/crash_core.c
index 8706736fd4e2..d18d529fd9b9 100644
--- a/arch/riscv/kernel/crash_core.c
+++ b/arch/riscv/kernel/crash_core.c
@@ -8,7 +8,6 @@ void arch_crash_save_vmcoreinfo(void)
VMCOREINFO_NUMBER(phys_ram_base);

vmcoreinfo_append_str("NUMBER(PAGE_OFFSET)=0x%lx\n", PAGE_OFFSET);
- vmcoreinfo_append_str("NUMBER(VMALLOC_START)=0x%lx\n", VMALLOC_START);
vmcoreinfo_append_str("NUMBER(VMALLOC_END)=0x%lx\n", VMALLOC_END);
#ifdef CONFIG_MMU
VMCOREINFO_NUMBER(VA_BITS);
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index c720be70c8dd..91810b4e9510 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -253,7 +253,6 @@ extern long vread_iter(struct iov_iter *iter, const char *addr, size_t count);
/*
* Internals. Don't use..
*/
-extern struct list_head vmap_area_list;
extern __init void vm_area_add_early(struct vm_struct *vm);
extern __init void vm_area_register_early(struct vm_struct *vm, size_t align);

diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index d4313b53837e..b427f4a3b156 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -759,7 +759,7 @@ static int __init crash_save_vmcoreinfo_init(void)
VMCOREINFO_SYMBOL_ARRAY(swapper_pg_dir);
#endif
VMCOREINFO_SYMBOL(_stext);
- VMCOREINFO_SYMBOL(vmap_area_list);
+ vmcoreinfo_append_str("NUMBER(VMALLOC_START)=0x%lx\n", VMALLOC_START);

#ifndef CONFIG_NUMA
VMCOREINFO_SYMBOL(mem_map);
@@ -800,8 +800,6 @@ static int __init crash_save_vmcoreinfo_init(void)
VMCOREINFO_OFFSET(free_area, free_list);
VMCOREINFO_OFFSET(list_head, next);
VMCOREINFO_OFFSET(list_head, prev);
- VMCOREINFO_OFFSET(vmap_area, va_start);
- VMCOREINFO_OFFSET(vmap_area, list);
VMCOREINFO_LENGTH(zone.free_area, MAX_ORDER + 1);
log_buf_vmcoreinfo_setup();
VMCOREINFO_LENGTH(free_area.free_list, MIGRATE_TYPES);
diff --git a/kernel/kallsyms_selftest.c b/kernel/kallsyms_selftest.c
index b4cac76ea5e9..8a689b4ff4f9 100644
--- a/kernel/kallsyms_selftest.c
+++ b/kernel/kallsyms_selftest.c
@@ -89,7 +89,6 @@ static struct test_item test_items[] = {
ITEM_DATA(kallsyms_test_var_data_static),
ITEM_DATA(kallsyms_test_var_bss),
ITEM_DATA(kallsyms_test_var_data),
- ITEM_DATA(vmap_area_list),
#endif
};

diff --git a/mm/nommu.c b/mm/nommu.c
index b6dc558d3144..5ec8f44e7ce9 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -131,8 +131,6 @@ int follow_pfn(struct vm_area_struct *vma, unsigned long address,
}
EXPORT_SYMBOL(follow_pfn);

-LIST_HEAD(vmap_area_list);
-
void vfree(const void *addr)
{
kfree(addr);
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 786ecb18ae22..8c01f2225ef7 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -729,8 +729,6 @@ EXPORT_SYMBOL(vmalloc_to_pfn);


static DEFINE_SPINLOCK(free_vmap_area_lock);
-/* Export for kexec only */
-LIST_HEAD(vmap_area_list);
static bool vmap_initialized __read_mostly;

static struct rb_root purge_vmap_area_root = RB_ROOT;
--
2.39.2


2024-01-02 18:48:35

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: [PATCH v3 07/11] mm: vmalloc: Offload free_vmap_area_lock lock

Concurrent access to a global vmap space is a bottle-neck.
We can simulate a high contention by running a vmalloc test
suite.

To address it, introduce an effective vmap node logic. Each
node behaves as independent entity. When a node is accessed
it serves a request directly(if possible) from its pool.

This model has a size based pool for requests, i.e. pools are
serialized and populated based on object size and real demand.
A maximum object size that pool can handle is set to 256 pages.

This technique reduces a pressure on the global vmap lock.

Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
---
mm/vmalloc.c | 387 +++++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 342 insertions(+), 45 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 9b2f1b0cac9d..fa4ab2bbbc5b 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -775,7 +775,22 @@ struct rb_list {
spinlock_t lock;
};

+struct vmap_pool {
+ struct list_head head;
+ unsigned long len;
+};
+
+/*
+ * A fast size storage contains VAs up to 1M size.
+ */
+#define MAX_VA_SIZE_PAGES 256
+
static struct vmap_node {
+ /* Simple size segregated storage. */
+ struct vmap_pool pool[MAX_VA_SIZE_PAGES];
+ spinlock_t pool_lock;
+ bool skip_populate;
+
/* Bookkeeping data of this node. */
struct rb_list busy;
struct rb_list lazy;
@@ -784,6 +799,8 @@ static struct vmap_node {
* Ready-to-free areas.
*/
struct list_head purge_list;
+ struct work_struct purge_work;
+ unsigned long nr_purged;
} single;

static struct vmap_node *vmap_nodes = &single;
@@ -802,6 +819,61 @@ addr_to_node(unsigned long addr)
return &vmap_nodes[addr_to_node_id(addr)];
}

+static inline struct vmap_node *
+id_to_node(unsigned int id)
+{
+ return &vmap_nodes[id % nr_vmap_nodes];
+}
+
+/*
+ * We use the value 0 to represent "no node", that is why
+ * an encoded value will be the node-id incremented by 1.
+ * It is always greater then 0. A valid node_id which can
+ * be encoded is [0:nr_vmap_nodes - 1]. If a passed node_id
+ * is not valid 0 is returned.
+ */
+static unsigned int
+encode_vn_id(unsigned int node_id)
+{
+ /* Can store U8_MAX [0:254] nodes. */
+ if (node_id < nr_vmap_nodes)
+ return (node_id + 1) << BITS_PER_BYTE;
+
+ /* Warn and no node encoded. */
+ WARN_ONCE(1, "Encode wrong node id (%u)\n", node_id);
+ return 0;
+}
+
+/*
+ * Returns an encoded node-id, the valid range is within
+ * [0:nr_vmap_nodes-1] values. Otherwise nr_vmap_nodes is
+ * returned if extracted data is wrong.
+ */
+static unsigned int
+decode_vn_id(unsigned int val)
+{
+ unsigned int node_id = (val >> BITS_PER_BYTE) - 1;
+
+ /* Can store U8_MAX [0:254] nodes. */
+ if (node_id < nr_vmap_nodes)
+ return node_id;
+
+ /* If it was _not_ zero, warn. */
+ WARN_ONCE(node_id != UINT_MAX,
+ "Decode wrong node id (%d)\n", node_id);
+
+ return nr_vmap_nodes;
+}
+
+static bool
+is_vn_id_valid(unsigned int node_id)
+{
+ if (node_id < nr_vmap_nodes)
+ return true;
+
+ return false;
+}
+
static __always_inline unsigned long
va_size(struct vmap_area *va)
{
@@ -1623,6 +1695,104 @@ preload_this_cpu_lock(spinlock_t *lock, gfp_t gfp_mask, int node)
kmem_cache_free(vmap_area_cachep, va);
}

+static struct vmap_pool *
+size_to_va_pool(struct vmap_node *vn, unsigned long size)
+{
+ unsigned int idx = (size - 1) / PAGE_SIZE;
+
+ if (idx < MAX_VA_SIZE_PAGES)
+ return &vn->pool[idx];
+
+ return NULL;
+}
+
+static bool
+node_pool_add_va(struct vmap_node *n, struct vmap_area *va)
+{
+ struct vmap_pool *vp;
+
+ vp = size_to_va_pool(n, va_size(va));
+ if (!vp)
+ return false;
+
+ spin_lock(&n->pool_lock);
+ list_add(&va->list, &vp->head);
+ WRITE_ONCE(vp->len, vp->len + 1);
+ spin_unlock(&n->pool_lock);
+
+ return true;
+}
+
+static struct vmap_area *
+node_pool_del_va(struct vmap_node *vn, unsigned long size,
+ unsigned long align, unsigned long vstart,
+ unsigned long vend)
+{
+ struct vmap_area *va = NULL;
+ struct vmap_pool *vp;
+ int err = 0;
+
+ vp = size_to_va_pool(vn, size);
+ if (!vp || list_empty(&vp->head))
+ return NULL;
+
+ spin_lock(&vn->pool_lock);
+ if (!list_empty(&vp->head)) {
+ va = list_first_entry(&vp->head, struct vmap_area, list);
+
+ if (IS_ALIGNED(va->va_start, align)) {
+ /*
+ * Do some sanity check and emit a warning
+ * if one of below checks detects an error.
+ */
+ err |= (va_size(va) != size);
+ err |= (va->va_start < vstart);
+ err |= (va->va_end > vend);
+
+ if (!WARN_ON_ONCE(err)) {
+ list_del_init(&va->list);
+ WRITE_ONCE(vp->len, vp->len - 1);
+ } else {
+ va = NULL;
+ }
+ } else {
+ list_move_tail(&va->list, &vp->head);
+ va = NULL;
+ }
+ }
+ spin_unlock(&vn->pool_lock);
+
+ return va;
+}
+
+static struct vmap_area *
+node_alloc(unsigned long size, unsigned long align,
+ unsigned long vstart, unsigned long vend,
+ unsigned long *addr, unsigned int *vn_id)
+{
+ struct vmap_area *va;
+
+ *vn_id = 0;
+ *addr = vend;
+
+ /*
+ * Fallback to a global heap if not vmalloc or there
+ * is only one node.
+ */
+ if (vstart != VMALLOC_START || vend != VMALLOC_END ||
+ nr_vmap_nodes == 1)
+ return NULL;
+
+ *vn_id = raw_smp_processor_id() % nr_vmap_nodes;
+ va = node_pool_del_va(id_to_node(*vn_id), size, align, vstart, vend);
+ *vn_id = encode_vn_id(*vn_id);
+
+ if (va)
+ *addr = va->va_start;
+
+ return va;
+}
+
/*
* Allocate a region of KVA of the specified size and alignment, within the
* vstart and vend.
@@ -1637,6 +1807,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
struct vmap_area *va;
unsigned long freed;
unsigned long addr;
+ unsigned int vn_id;
int purged = 0;
int ret;

@@ -1647,11 +1818,23 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
return ERR_PTR(-EBUSY);

might_sleep();
- gfp_mask = gfp_mask & GFP_RECLAIM_MASK;

- va = kmem_cache_alloc_node(vmap_area_cachep, gfp_mask, node);
- if (unlikely(!va))
- return ERR_PTR(-ENOMEM);
+ /*
+ * If a VA is obtained from a global heap(if it fails here)
+ * it is anyway marked with this "vn_id" so it is returned
+ * to this pool's node later. Such way gives a possibility
+ * to populate pools based on users demand.
+ *
+ * On success a ready to go VA is returned.
+ */
+ va = node_alloc(size, align, vstart, vend, &addr, &vn_id);
+ if (!va) {
+ gfp_mask = gfp_mask & GFP_RECLAIM_MASK;
+
+ va = kmem_cache_alloc_node(vmap_area_cachep, gfp_mask, node);
+ if (unlikely(!va))
+ return ERR_PTR(-ENOMEM);
+ }

/*
* Only scan the relevant parts containing pointers to other objects
@@ -1660,10 +1843,12 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
kmemleak_scan_area(&va->rb_node, SIZE_MAX, gfp_mask);

retry:
- preload_this_cpu_lock(&free_vmap_area_lock, gfp_mask, node);
- addr = __alloc_vmap_area(&free_vmap_area_root, &free_vmap_area_list,
- size, align, vstart, vend);
- spin_unlock(&free_vmap_area_lock);
+ if (addr == vend) {
+ preload_this_cpu_lock(&free_vmap_area_lock, gfp_mask, node);
+ addr = __alloc_vmap_area(&free_vmap_area_root, &free_vmap_area_list,
+ size, align, vstart, vend);
+ spin_unlock(&free_vmap_area_lock);
+ }

trace_alloc_vmap_area(addr, size, align, vstart, vend, addr == vend);

@@ -1677,7 +1862,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
va->va_start = addr;
va->va_end = addr + size;
va->vm = NULL;
- va->flags = va_flags;
+ va->flags = (va_flags | vn_id);

vn = addr_to_node(va->va_start);

@@ -1770,63 +1955,135 @@ static DEFINE_MUTEX(vmap_purge_lock);
static void purge_fragmented_blocks_allcpus(void);
static cpumask_t purge_nodes;

-/*
- * Purges all lazily-freed vmap areas.
- */
-static unsigned long
-purge_vmap_node(struct vmap_node *vn)
+static void
+reclaim_list_global(struct list_head *head)
{
- unsigned long num_purged_areas = 0;
- struct vmap_area *va, *n_va;
+ struct vmap_area *va, *n;

- if (list_empty(&vn->purge_list))
- return 0;
+ if (list_empty(head))
+ return;

spin_lock(&free_vmap_area_lock);
+ list_for_each_entry_safe(va, n, head, list)
+ merge_or_add_vmap_area_augment(va,
+ &free_vmap_area_root, &free_vmap_area_list);
+ spin_unlock(&free_vmap_area_lock);
+}
+
+static void
+decay_va_pool_node(struct vmap_node *vn, bool full_decay)
+{
+ struct vmap_area *va, *nva;
+ struct list_head decay_list;
+ struct rb_root decay_root;
+ unsigned long n_decay;
+ int i;
+
+ decay_root = RB_ROOT;
+ INIT_LIST_HEAD(&decay_list);
+
+ for (i = 0; i < MAX_VA_SIZE_PAGES; i++) {
+ struct list_head tmp_list;
+
+ if (list_empty(&vn->pool[i].head))
+ continue;
+
+ INIT_LIST_HEAD(&tmp_list);
+
+ /* Detach the pool, so no-one can access it. */
+ spin_lock(&vn->pool_lock);
+ list_replace_init(&vn->pool[i].head, &tmp_list);
+ spin_unlock(&vn->pool_lock);
+
+ if (full_decay)
+ WRITE_ONCE(vn->pool[i].len, 0);
+
+ /* Decay a pool by ~25% out of left objects. */
+ n_decay = vn->pool[i].len >> 2;
+
+ list_for_each_entry_safe(va, nva, &tmp_list, list) {
+ list_del_init(&va->list);
+ merge_or_add_vmap_area(va, &decay_root, &decay_list);
+
+ if (!full_decay) {
+ WRITE_ONCE(vn->pool[i].len, vn->pool[i].len - 1);
+
+ if (!--n_decay)
+ break;
+ }
+ }
+
+ /* Attach the pool back if it has been partly decayed. */
+ if (!full_decay && !list_empty(&tmp_list)) {
+ spin_lock(&vn->pool_lock);
+ list_replace_init(&tmp_list, &vn->pool[i].head);
+ spin_unlock(&vn->pool_lock);
+ }
+ }
+
+ reclaim_list_global(&decay_list);
+}
+
+static void purge_vmap_node(struct work_struct *work)
+{
+ struct vmap_node *vn = container_of(work,
+ struct vmap_node, purge_work);
+ struct vmap_area *va, *n_va;
+ LIST_HEAD(local_list);
+
+ vn->nr_purged = 0;
+
list_for_each_entry_safe(va, n_va, &vn->purge_list, list) {
unsigned long nr = (va->va_end - va->va_start) >> PAGE_SHIFT;
unsigned long orig_start = va->va_start;
unsigned long orig_end = va->va_end;
+ unsigned int vn_id = decode_vn_id(va->flags);

- /*
- * Finally insert or merge lazily-freed area. It is
- * detached and there is no need to "unlink" it from
- * anything.
- */
- va = merge_or_add_vmap_area_augment(va, &free_vmap_area_root,
- &free_vmap_area_list);
-
- if (!va)
- continue;
+ list_del_init(&va->list);

if (is_vmalloc_or_module_addr((void *)orig_start))
kasan_release_vmalloc(orig_start, orig_end,
va->va_start, va->va_end);

atomic_long_sub(nr, &vmap_lazy_nr);
- num_purged_areas++;
+ vn->nr_purged++;
+
+ if (is_vn_id_valid(vn_id) && !vn->skip_populate)
+ if (node_pool_add_va(vn, va))
+ continue;
+
+ /* Go back to global. */
+ list_add(&va->list, &local_list);
}
- spin_unlock(&free_vmap_area_lock);

- return num_purged_areas;
+ reclaim_list_global(&local_list);
}

/*
* Purges all lazily-freed vmap areas.
*/
-static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
+static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end,
+ bool full_pool_decay)
{
- unsigned long num_purged_areas = 0;
+ unsigned long nr_purged_areas = 0;
+ unsigned int nr_purge_helpers;
+ unsigned int nr_purge_nodes;
struct vmap_node *vn;
int i;

lockdep_assert_held(&vmap_purge_lock);
+
+ /*
+ * Use cpumask to mark which node has to be processed.
+ */
purge_nodes = CPU_MASK_NONE;

for (i = 0; i < nr_vmap_nodes; i++) {
vn = &vmap_nodes[i];

INIT_LIST_HEAD(&vn->purge_list);
+ vn->skip_populate = full_pool_decay;
+ decay_va_pool_node(vn, full_pool_decay);

if (RB_EMPTY_ROOT(&vn->lazy.root))
continue;
@@ -1845,17 +2102,45 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
cpumask_set_cpu(i, &purge_nodes);
}

- if (cpumask_weight(&purge_nodes) > 0) {
+ nr_purge_nodes = cpumask_weight(&purge_nodes);
+ if (nr_purge_nodes > 0) {
flush_tlb_kernel_range(start, end);

+ /* One extra worker is per a lazy_max_pages() full set minus one. */
+ nr_purge_helpers = atomic_long_read(&vmap_lazy_nr) / lazy_max_pages();
+ nr_purge_helpers = clamp(nr_purge_helpers, 1U, nr_purge_nodes) - 1;
+
for_each_cpu(i, &purge_nodes) {
- vn = &nodes[i];
- num_purged_areas += purge_vmap_node(vn);
+ vn = &vmap_nodes[i];
+
+ if (nr_purge_helpers > 0) {
+ INIT_WORK(&vn->purge_work, purge_vmap_node);
+
+ if (cpumask_test_cpu(i, cpu_online_mask))
+ schedule_work_on(i, &vn->purge_work);
+ else
+ schedule_work(&vn->purge_work);
+
+ nr_purge_helpers--;
+ } else {
+ vn->purge_work.func = NULL;
+ purge_vmap_node(&vn->purge_work);
+ nr_purged_areas += vn->nr_purged;
+ }
+ }
+
+ for_each_cpu(i, &purge_nodes) {
+ vn = &vmap_nodes[i];
+
+ if (vn->purge_work.func) {
+ flush_work(&vn->purge_work);
+ nr_purged_areas += vn->nr_purged;
+ }
}
}

- trace_purge_vmap_area_lazy(start, end, num_purged_areas);
- return num_purged_areas > 0;
+ trace_purge_vmap_area_lazy(start, end, nr_purged_areas);
+ return nr_purged_areas > 0;
}

/*
@@ -1866,14 +2151,14 @@ static void reclaim_and_purge_vmap_areas(void)
{
mutex_lock(&vmap_purge_lock);
purge_fragmented_blocks_allcpus();
- __purge_vmap_area_lazy(ULONG_MAX, 0);
+ __purge_vmap_area_lazy(ULONG_MAX, 0, true);
mutex_unlock(&vmap_purge_lock);
}

static void drain_vmap_area_work(struct work_struct *work)
{
mutex_lock(&vmap_purge_lock);
- __purge_vmap_area_lazy(ULONG_MAX, 0);
+ __purge_vmap_area_lazy(ULONG_MAX, 0, false);
mutex_unlock(&vmap_purge_lock);
}

@@ -1884,9 +2169,10 @@ static void drain_vmap_area_work(struct work_struct *work)
*/
static void free_vmap_area_noflush(struct vmap_area *va)
{
- struct vmap_node *vn = addr_to_node(va->va_start);
unsigned long nr_lazy_max = lazy_max_pages();
unsigned long va_start = va->va_start;
+ unsigned int vn_id = decode_vn_id(va->flags);
+ struct vmap_node *vn;
unsigned long nr_lazy;

if (WARN_ON_ONCE(!list_empty(&va->list)))
@@ -1896,10 +2182,14 @@ static void free_vmap_area_noflush(struct vmap_area *va)
PAGE_SHIFT, &vmap_lazy_nr);

/*
- * Merge or place it to the purge tree/list.
+ * If it was request by a certain node we would like to
+ * return it to that node, i.e. its pool for later reuse.
*/
+ vn = is_vn_id_valid(vn_id) ?
+ id_to_node(vn_id):addr_to_node(va->va_start);
+
spin_lock(&vn->lazy.lock);
- merge_or_add_vmap_area(va, &vn->lazy.root, &vn->lazy.head);
+ insert_vmap_area(va, &vn->lazy.root, &vn->lazy.head);
spin_unlock(&vn->lazy.lock);

trace_free_vmap_area_noflush(va_start, nr_lazy, nr_lazy_max);
@@ -2408,7 +2698,7 @@ static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush)
}
free_purged_blocks(&purge_list);

- if (!__purge_vmap_area_lazy(start, end) && flush)
+ if (!__purge_vmap_area_lazy(start, end, false) && flush)
flush_tlb_kernel_range(start, end);
mutex_unlock(&vmap_purge_lock);
}
@@ -4576,7 +4866,7 @@ static void vmap_init_free_space(void)
static void vmap_init_nodes(void)
{
struct vmap_node *vn;
- int i;
+ int i, j;

for (i = 0; i < nr_vmap_nodes; i++) {
vn = &vmap_nodes[i];
@@ -4587,6 +4877,13 @@ static void vmap_init_nodes(void)
vn->lazy.root = RB_ROOT;
INIT_LIST_HEAD(&vn->lazy.head);
spin_lock_init(&vn->lazy.lock);
+
+ for (j = 0; j < MAX_VA_SIZE_PAGES; j++) {
+ INIT_LIST_HEAD(&vn->pool[j].head);
+ WRITE_ONCE(vn->pool[j].len, 0);
+ }
+
+ spin_lock_init(&vn->pool_lock);
}
}

--
2.39.2


2024-01-02 18:48:37

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: [PATCH v3 08/11] mm: vmalloc: Support multiple nodes in vread_iter

Extend the vread_iter() to be able to perform a sequential
reading of VAs which are spread among multiple nodes. So a
data read over the /dev/kmem correctly reflects a vmalloc
memory layout.

Reviewed-by: Baoquan He <[email protected]>
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
---
mm/vmalloc.c | 67 +++++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 53 insertions(+), 14 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index fa4ab2bbbc5b..594ed003d44d 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -906,7 +906,7 @@ unsigned long vmalloc_nr_pages(void)

/* Look up the first VA which satisfies addr < va_end, NULL if none. */
static struct vmap_area *
-find_vmap_area_exceed_addr(unsigned long addr, struct rb_root *root)
+__find_vmap_area_exceed_addr(unsigned long addr, struct rb_root *root)
{
struct vmap_area *va = NULL;
struct rb_node *n = root->rb_node;
@@ -930,6 +930,41 @@ find_vmap_area_exceed_addr(unsigned long addr, struct rb_root *root)
return va;
}

+/*
+ * Returns a node where a first VA, that satisfies addr < va_end, resides.
+ * If success, a node is locked. A user is responsible to unlock it when a
+ * VA is no longer needed to be accessed.
+ *
+ * Returns NULL if nothing found.
+ */
+static struct vmap_node *
+find_vmap_area_exceed_addr_lock(unsigned long addr, struct vmap_area **va)
+{
+ struct vmap_node *vn, *va_node = NULL;
+ struct vmap_area *va_lowest;
+ int i;
+
+ for (i = 0; i < nr_vmap_nodes; i++) {
+ vn = &vmap_nodes[i];
+
+ spin_lock(&vn->busy.lock);
+ va_lowest = __find_vmap_area_exceed_addr(addr, &vn->busy.root);
+ if (va_lowest) {
+ if (!va_node || va_lowest->va_start < (*va)->va_start) {
+ if (va_node)
+ spin_unlock(&va_node->busy.lock);
+
+ *va = va_lowest;
+ va_node = vn;
+ continue;
+ }
+ }
+ spin_unlock(&vn->busy.lock);
+ }
+
+ return va_node;
+}
+
static struct vmap_area *__find_vmap_area(unsigned long addr, struct rb_root *root)
{
struct rb_node *n = root->rb_node;
@@ -4102,6 +4137,7 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
struct vm_struct *vm;
char *vaddr;
size_t n, size, flags, remains;
+ unsigned long next;

addr = kasan_reset_tag(addr);

@@ -4111,19 +4147,15 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)

remains = count;

- /* Hooked to node_0 so far. */
- vn = addr_to_node(0);
- spin_lock(&vn->busy.lock);
-
- va = find_vmap_area_exceed_addr((unsigned long)addr, &vn->busy.root);
- if (!va)
+ vn = find_vmap_area_exceed_addr_lock((unsigned long) addr, &va);
+ if (!vn)
goto finished_zero;

/* no intersects with alive vmap_area */
if ((unsigned long)addr + remains <= va->va_start)
goto finished_zero;

- list_for_each_entry_from(va, &vn->busy.head, list) {
+ do {
size_t copied;

if (remains == 0)
@@ -4138,10 +4170,10 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
WARN_ON(flags == VMAP_BLOCK);

if (!vm && !flags)
- continue;
+ goto next_va;

if (vm && (vm->flags & VM_UNINITIALIZED))
- continue;
+ goto next_va;

/* Pair with smp_wmb() in clear_vm_uninitialized_flag() */
smp_rmb();
@@ -4150,7 +4182,7 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
size = vm ? get_vm_area_size(vm) : va_size(va);

if (addr >= vaddr + size)
- continue;
+ goto next_va;

if (addr < vaddr) {
size_t to_zero = min_t(size_t, vaddr - addr, remains);
@@ -4179,15 +4211,22 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)

if (copied != n)
goto finished;
- }
+
+ next_va:
+ next = va->va_end;
+ spin_unlock(&vn->busy.lock);
+ } while ((vn = find_vmap_area_exceed_addr_lock(next, &va)));

finished_zero:
- spin_unlock(&vn->busy.lock);
+ if (vn)
+ spin_unlock(&vn->busy.lock);
+
/* zero-fill memory holes */
return count - remains + zero_iter(iter, remains);
finished:
/* Nothing remains, or We couldn't copy/zero everything. */
- spin_unlock(&vn->busy.lock);
+ if (vn)
+ spin_unlock(&vn->busy.lock);

return count - remains;
}
--
2.39.2


2024-01-02 18:48:52

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: [PATCH v3 09/11] mm: vmalloc: Support multiple nodes in vmallocinfo

Allocated areas are spread among nodes, it implies that
the scanning has to be performed individually of each node
in order to dump all existing VAs.

Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
---
mm/vmalloc.c | 120 ++++++++++++++++++++-------------------------------
1 file changed, 47 insertions(+), 73 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 594ed003d44d..0c671cb96151 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -4709,30 +4709,6 @@ bool vmalloc_dump_obj(void *object)
#endif

#ifdef CONFIG_PROC_FS
-static void *s_start(struct seq_file *m, loff_t *pos)
-{
- struct vmap_node *vn = addr_to_node(0);
-
- mutex_lock(&vmap_purge_lock);
- spin_lock(&vn->busy.lock);
-
- return seq_list_start(&vn->busy.head, *pos);
-}
-
-static void *s_next(struct seq_file *m, void *p, loff_t *pos)
-{
- struct vmap_node *vn = addr_to_node(0);
- return seq_list_next(p, &vn->busy.head, pos);
-}
-
-static void s_stop(struct seq_file *m, void *p)
-{
- struct vmap_node *vn = addr_to_node(0);
-
- spin_unlock(&vn->busy.lock);
- mutex_unlock(&vmap_purge_lock);
-}
-
static void show_numa_info(struct seq_file *m, struct vm_struct *v)
{
if (IS_ENABLED(CONFIG_NUMA)) {
@@ -4776,84 +4752,82 @@ static void show_purge_info(struct seq_file *m)
}
}

-static int s_show(struct seq_file *m, void *p)
+static int vmalloc_info_show(struct seq_file *m, void *p)
{
struct vmap_node *vn;
struct vmap_area *va;
struct vm_struct *v;
+ int i;

- vn = addr_to_node(0);
- va = list_entry(p, struct vmap_area, list);
+ for (i = 0; i < nr_vmap_nodes; i++) {
+ vn = &vmap_nodes[i];

- if (!va->vm) {
- if (va->flags & VMAP_RAM)
- seq_printf(m, "0x%pK-0x%pK %7ld vm_map_ram\n",
- (void *)va->va_start, (void *)va->va_end,
- va->va_end - va->va_start);
+ spin_lock(&vn->busy.lock);
+ list_for_each_entry(va, &vn->busy.head, list) {
+ if (!va->vm) {
+ if (va->flags & VMAP_RAM)
+ seq_printf(m, "0x%pK-0x%pK %7ld vm_map_ram\n",
+ (void *)va->va_start, (void *)va->va_end,
+ va->va_end - va->va_start);

- goto final;
- }
+ continue;
+ }

- v = va->vm;
+ v = va->vm;

- seq_printf(m, "0x%pK-0x%pK %7ld",
- v->addr, v->addr + v->size, v->size);
+ seq_printf(m, "0x%pK-0x%pK %7ld",
+ v->addr, v->addr + v->size, v->size);

- if (v->caller)
- seq_printf(m, " %pS", v->caller);
+ if (v->caller)
+ seq_printf(m, " %pS", v->caller);

- if (v->nr_pages)
- seq_printf(m, " pages=%d", v->nr_pages);
+ if (v->nr_pages)
+ seq_printf(m, " pages=%d", v->nr_pages);

- if (v->phys_addr)
- seq_printf(m, " phys=%pa", &v->phys_addr);
+ if (v->phys_addr)
+ seq_printf(m, " phys=%pa", &v->phys_addr);

- if (v->flags & VM_IOREMAP)
- seq_puts(m, " ioremap");
+ if (v->flags & VM_IOREMAP)
+ seq_puts(m, " ioremap");

- if (v->flags & VM_ALLOC)
- seq_puts(m, " vmalloc");
+ if (v->flags & VM_ALLOC)
+ seq_puts(m, " vmalloc");

- if (v->flags & VM_MAP)
- seq_puts(m, " vmap");
+ if (v->flags & VM_MAP)
+ seq_puts(m, " vmap");

- if (v->flags & VM_USERMAP)
- seq_puts(m, " user");
+ if (v->flags & VM_USERMAP)
+ seq_puts(m, " user");

- if (v->flags & VM_DMA_COHERENT)
- seq_puts(m, " dma-coherent");
+ if (v->flags & VM_DMA_COHERENT)
+ seq_puts(m, " dma-coherent");

- if (is_vmalloc_addr(v->pages))
- seq_puts(m, " vpages");
+ if (is_vmalloc_addr(v->pages))
+ seq_puts(m, " vpages");

- show_numa_info(m, v);
- seq_putc(m, '\n');
+ show_numa_info(m, v);
+ seq_putc(m, '\n');
+ }
+ spin_unlock(&vn->busy.lock);
+ }

/*
* As a final step, dump "unpurged" areas.
*/
-final:
- if (list_is_last(&va->list, &vn->busy.head))
- show_purge_info(m);
-
+ show_purge_info(m);
return 0;
}

-static const struct seq_operations vmalloc_op = {
- .start = s_start,
- .next = s_next,
- .stop = s_stop,
- .show = s_show,
-};
-
static int __init proc_vmalloc_init(void)
{
+ void *priv_data = NULL;
+
if (IS_ENABLED(CONFIG_NUMA))
- proc_create_seq_private("vmallocinfo", 0400, NULL,
- &vmalloc_op,
- nr_node_ids * sizeof(unsigned int), NULL);
- else
- proc_create_seq("vmallocinfo", 0400, NULL, &vmalloc_op);
+ priv_data = kmalloc(nr_node_ids * sizeof(unsigned int), GFP_KERNEL);
+
+ proc_create_single_data("vmallocinfo",
+ 0400, NULL, vmalloc_info_show, priv_data);
+
return 0;
}
module_init(proc_vmalloc_init);
--
2.39.2


2024-01-02 18:49:17

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: [PATCH v3 10/11] mm: vmalloc: Set nr_nodes based on CPUs in a system

A number of nodes which are used in the alloc/free paths is
set based on num_possible_cpus() in a system. Please note a
high limit threshold though is fixed and corresponds to 128
nodes.

For 32-bit or single core systems an access to a global vmap
heap is not balanced. Such small systems do not suffer from
lock contentions due to low number of CPUs. In such case the
nr_nodes is equal to 1.

Test on AMD Ryzen Threadripper 3970X 32-Core Processor:
sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64

<default perf>
94.41% 0.89% [kernel] [k] _raw_spin_lock
93.35% 93.07% [kernel] [k] native_queued_spin_lock_slowpath
76.13% 0.28% [kernel] [k] __vmalloc_node_range
72.96% 0.81% [kernel] [k] alloc_vmap_area
56.94% 0.00% [kernel] [k] __get_vm_area_node
41.95% 0.00% [kernel] [k] vmalloc
37.15% 0.01% [test_vmalloc] [k] full_fit_alloc_test
35.17% 0.00% [kernel] [k] ret_from_fork_asm
35.17% 0.00% [kernel] [k] ret_from_fork
35.17% 0.00% [kernel] [k] kthread
35.08% 0.00% [test_vmalloc] [k] test_func
34.45% 0.00% [test_vmalloc] [k] fix_size_alloc_test
28.09% 0.01% [test_vmalloc] [k] long_busy_list_alloc_test
23.53% 0.25% [kernel] [k] vfree.part.0
21.72% 0.00% [kernel] [k] remove_vm_area
20.08% 0.21% [kernel] [k] find_unlink_vmap_area
2.34% 0.61% [kernel] [k] free_vmap_area_noflush
<default perf>
vs
<patch-series perf>
82.32% 0.22% [test_vmalloc] [k] long_busy_list_alloc_test
63.36% 0.02% [kernel] [k] vmalloc
63.34% 2.64% [kernel] [k] __vmalloc_node_range
30.42% 4.46% [kernel] [k] vfree.part.0
28.98% 2.51% [kernel] [k] __alloc_pages_bulk
27.28% 0.19% [kernel] [k] __get_vm_area_node
26.13% 1.50% [kernel] [k] alloc_vmap_area
21.72% 21.67% [kernel] [k] clear_page_rep
19.51% 2.43% [kernel] [k] _raw_spin_lock
16.61% 16.51% [kernel] [k] native_queued_spin_lock_slowpath
13.40% 2.07% [kernel] [k] free_unref_page
10.62% 0.01% [kernel] [k] remove_vm_area
9.02% 8.73% [kernel] [k] insert_vmap_area
8.94% 0.00% [kernel] [k] ret_from_fork_asm
8.94% 0.00% [kernel] [k] ret_from_fork
8.94% 0.00% [kernel] [k] kthread
8.29% 0.00% [test_vmalloc] [k] test_func
7.81% 0.05% [test_vmalloc] [k] full_fit_alloc_test
5.30% 4.73% [kernel] [k] purge_vmap_node
4.47% 2.65% [kernel] [k] free_vmap_area_noflush
<patch-series perf>

confirms that a native_queued_spin_lock_slowpath goes down to
16.51% percent from 93.07%.

The throughput is ~12x higher:

urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
Run the test with following parameters: run_test_mask=7 nr_threads=64
Done.
Check the kernel ring buffer to see the summary.

real 10m51.271s
user 0m0.013s
sys 0m0.187s
urezki@pc638:~$

urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
Run the test with following parameters: run_test_mask=7 nr_threads=64
Done.
Check the kernel ring buffer to see the summary.

real 0m51.301s
user 0m0.015s
sys 0m0.040s
urezki@pc638:~$

Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
---
mm/vmalloc.c | 29 +++++++++++++++++++++++------
1 file changed, 23 insertions(+), 6 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 0c671cb96151..ef534c76daef 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -4879,10 +4879,27 @@ static void vmap_init_free_space(void)
static void vmap_init_nodes(void)
{
struct vmap_node *vn;
- int i, j;
+ int i, n;
+
+#if BITS_PER_LONG == 64
+ /* A high threshold of max nodes is fixed and bound to 128. */
+ n = clamp_t(unsigned int, num_possible_cpus(), 1, 128);
+
+ if (n > 1) {
+ vn = kmalloc_array(n, sizeof(*vn), GFP_NOWAIT | __GFP_NOWARN);
+ if (vn) {
+ /* Node partition is 16 pages. */
+ vmap_zone_size = (1 << 4) * PAGE_SIZE;
+ nr_vmap_nodes = n;
+ vmap_nodes = vn;
+ } else {
+ pr_err("Failed to allocate an array. Disable a node layer\n");
+ }
+ }
+#endif

- for (i = 0; i < nr_vmap_nodes; i++) {
- vn = &vmap_nodes[i];
+ for (n = 0; n < nr_vmap_nodes; n++) {
+ vn = &vmap_nodes[n];
vn->busy.root = RB_ROOT;
INIT_LIST_HEAD(&vn->busy.head);
spin_lock_init(&vn->busy.lock);
@@ -4891,9 +4908,9 @@ static void vmap_init_nodes(void)
INIT_LIST_HEAD(&vn->lazy.head);
spin_lock_init(&vn->lazy.lock);

- for (j = 0; j < MAX_VA_SIZE_PAGES; j++) {
- INIT_LIST_HEAD(&vn->pool[j].head);
- WRITE_ONCE(vn->pool[j].len, 0);
+ for (i = 0; i < MAX_VA_SIZE_PAGES; i++) {
+ INIT_LIST_HEAD(&vn->pool[i].head);
+ WRITE_ONCE(vn->pool[i].len, 0);
}

spin_lock_init(&vn->pool_lock);
--
2.39.2


2024-01-02 18:49:26

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: [PATCH v3 11/11] mm: vmalloc: Add a shrinker to drain vmap pools

The added shrinker is used to return back current cached
VAs into a global vmap space, when a system enters into a
low memory mode.

Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
---
mm/vmalloc.c | 39 +++++++++++++++++++++++++++++++++++++++
1 file changed, 39 insertions(+)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index ef534c76daef..e30dabf68263 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -4917,8 +4917,37 @@ static void vmap_init_nodes(void)
}
}

+static unsigned long
+vmap_node_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
+{
+ unsigned long count;
+ struct vmap_node *vn;
+ int i, j;
+
+ for (count = 0, i = 0; i < nr_vmap_nodes; i++) {
+ vn = &vmap_nodes[i];
+
+ for (j = 0; j < MAX_VA_SIZE_PAGES; j++)
+ count += READ_ONCE(vn->pool[j].len);
+ }
+
+ return count ? count : SHRINK_EMPTY;
+}
+
+static unsigned long
+vmap_node_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
+{
+ int i;
+
+ for (i = 0; i < nr_vmap_nodes; i++)
+ decay_va_pool_node(&vmap_nodes[i], true);
+
+ return SHRINK_STOP;
+}
+
void __init vmalloc_init(void)
{
+ struct shrinker *vmap_node_shrinker;
struct vmap_area *va;
struct vmap_node *vn;
struct vm_struct *tmp;
@@ -4966,4 +4995,14 @@ void __init vmalloc_init(void)
*/
vmap_init_free_space();
vmap_initialized = true;
+
+ vmap_node_shrinker = shrinker_alloc(0, "vmap-node");
+ if (!vmap_node_shrinker) {
+ pr_err("Failed to allocate vmap-node shrinker!\n");
+ return;
+ }
+
+ vmap_node_shrinker->count_objects = vmap_node_shrink_count;
+ vmap_node_shrinker->scan_objects = vmap_node_shrink_scan;
+ shrinker_register(vmap_node_shrinker);
}
--
2.39.2


2024-01-03 11:09:05

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCH v3 07/11] mm: vmalloc: Offload free_vmap_area_lock lock

On Tue, 2 Jan 2024 19:46:29 +0100 Uladzislau Rezki <[email protected]>
> +static void
> +decay_va_pool_node(struct vmap_node *vn, bool full_decay)
> +{
> + struct vmap_area *va, *nva;
> + struct list_head decay_list;
> + struct rb_root decay_root;
> + unsigned long n_decay;
> + int i;
> +
> + decay_root = RB_ROOT;
> + INIT_LIST_HEAD(&decay_list);
> +
> + for (i = 0; i < MAX_VA_SIZE_PAGES; i++) {
> + struct list_head tmp_list;
> +
> + if (list_empty(&vn->pool[i].head))
> + continue;
> +
> + INIT_LIST_HEAD(&tmp_list);
> +
> + /* Detach the pool, so no-one can access it. */
> + spin_lock(&vn->pool_lock);
> + list_replace_init(&vn->pool[i].head, &tmp_list);
> + spin_unlock(&vn->pool_lock);
> +
> + if (full_decay)
> + WRITE_ONCE(vn->pool[i].len, 0);
> +
> + /* Decay a pool by ~25% out of left objects. */
> + n_decay = vn->pool[i].len >> 2;
> +
> + list_for_each_entry_safe(va, nva, &tmp_list, list) {
> + list_del_init(&va->list);
> + merge_or_add_vmap_area(va, &decay_root, &decay_list);
> +
> + if (!full_decay) {
> + WRITE_ONCE(vn->pool[i].len, vn->pool[i].len - 1);
> +
> + if (!--n_decay)
> + break;
> + }
> + }
> +
> + /* Attach the pool back if it has been partly decayed. */
> + if (!full_decay && !list_empty(&tmp_list)) {
> + spin_lock(&vn->pool_lock);
> + list_replace_init(&tmp_list, &vn->pool[i].head);
> + spin_unlock(&vn->pool_lock);
> + }

Failure of working out why list_splice() was not used here in case of
non-empty vn->pool[i].head, after staring ten minutes.
> + }
> +
> + reclaim_list_global(&decay_list);
> +}

2024-01-03 15:48:09

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: Re: [PATCH v3 07/11] mm: vmalloc: Offload free_vmap_area_lock lock

On Wed, Jan 03, 2024 at 07:08:32PM +0800, Hillf Danton wrote:
> On Tue, 2 Jan 2024 19:46:29 +0100 Uladzislau Rezki <[email protected]>
> > +static void
> > +decay_va_pool_node(struct vmap_node *vn, bool full_decay)
> > +{
> > + struct vmap_area *va, *nva;
> > + struct list_head decay_list;
> > + struct rb_root decay_root;
> > + unsigned long n_decay;
> > + int i;
> > +
> > + decay_root = RB_ROOT;
> > + INIT_LIST_HEAD(&decay_list);
> > +
> > + for (i = 0; i < MAX_VA_SIZE_PAGES; i++) {
> > + struct list_head tmp_list;
> > +
> > + if (list_empty(&vn->pool[i].head))
> > + continue;
> > +
> > + INIT_LIST_HEAD(&tmp_list);
> > +
> > + /* Detach the pool, so no-one can access it. */
> > + spin_lock(&vn->pool_lock);
> > + list_replace_init(&vn->pool[i].head, &tmp_list);
> > + spin_unlock(&vn->pool_lock);
> > +
> > + if (full_decay)
> > + WRITE_ONCE(vn->pool[i].len, 0);
> > +
> > + /* Decay a pool by ~25% out of left objects. */
> > + n_decay = vn->pool[i].len >> 2;
> > +
> > + list_for_each_entry_safe(va, nva, &tmp_list, list) {
> > + list_del_init(&va->list);
> > + merge_or_add_vmap_area(va, &decay_root, &decay_list);
> > +
> > + if (!full_decay) {
> > + WRITE_ONCE(vn->pool[i].len, vn->pool[i].len - 1);
> > +
> > + if (!--n_decay)
> > + break;
> > + }
> > + }
> > +
> > + /* Attach the pool back if it has been partly decayed. */
> > + if (!full_decay && !list_empty(&tmp_list)) {
> > + spin_lock(&vn->pool_lock);
> > + list_replace_init(&tmp_list, &vn->pool[i].head);
> > + spin_unlock(&vn->pool_lock);
> > + }
>
> Failure of working out why list_splice() was not used here in case of
> non-empty vn->pool[i].head, after staring ten minutes.
>
The vn->pool[i].head is always empty here because we have detached it above
and initialized. Concurrent decay and populate also is not possible because
both is done by only one context.

--
Uladzislau Rezki

2024-01-05 08:11:14

by Wen Gu

[permalink] [raw]
Subject: Re: [PATCH v3 04/11] mm: vmalloc: Remove global vmap_area_root rb-tree


On 2024/01/03 02:46, Uladzislau Rezki wrote:

> Store allocated objects in a separate nodes. A va->va_start
> address is converted into a correct node where it should
> be placed and resided. An addr_to_node() function is used
> to do a proper address conversion to determine a node that
> contains a VA.
>
> Such approach balances VAs across nodes as a result an access
> becomes scalable. Number of nodes in a system depends on number
> of CPUs.
>
> Please note:
>
> 1. As of now allocated VAs are bound to a node-0. It means the
> patch does not give any difference comparing with a current
> behavior;
>
> 2. The global vmap_area_lock, vmap_area_root are removed as there
> is no need in it anymore. The vmap_area_list is still kept and
> is _empty_. It is exported for a kexec only;
>
> 3. The vmallocinfo and vread() have to be reworked to be able to
> handle multiple nodes.
>
> Reviewed-by: Baoquan He <[email protected]>
> Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
> ---

<...>

> struct vmap_area *find_vmap_area(unsigned long addr)
> {
> + struct vmap_node *vn;
> struct vmap_area *va;
> + int i, j;
>
> - spin_lock(&vmap_area_lock);
> - va = __find_vmap_area(addr, &vmap_area_root);
> - spin_unlock(&vmap_area_lock);
> + /*
> + * An addr_to_node_id(addr) converts an address to a node index
> + * where a VA is located. If VA spans several zones and passed
> + * addr is not the same as va->va_start, what is not common, we
> + * may need to scan an extra nodes. See an example:
> + *
> + * <--va-->
> + * -|-----|-----|-----|-----|-
> + * 1 2 0 1
> + *
> + * VA resides in node 1 whereas it spans 1 and 2. If passed
> + * addr is within a second node we should do extra work. We
> + * should mention that it is rare and is a corner case from
> + * the other hand it has to be covered.
> + */
> + i = j = addr_to_node_id(addr);
> + do {
> + vn = &vmap_nodes[i];
>
> - return va;
> + spin_lock(&vn->busy.lock);
> + va = __find_vmap_area(addr, &vn->busy.root);
> + spin_unlock(&vn->busy.lock);
> +
> + if (va)
> + return va;
> + } while ((i = (i + 1) % nr_vmap_nodes) != j);
> +
> + return NULL;
> }
>

Hi Uladzislau Rezki,

I really like your work, it is great and helpful!

Currently, I am working on using shared memory communication (SMC [1])
to transparently accelerate TCP communication between two peers within
the same OS instance[2].

In this scenario, a vzalloced kernel buffer acts as a shared memory and
will be simultaneous read or written by two SMC sockets, thus forming an
SMC connection.


socket1 socket2
| ^
| | userspace
---- write -------------------- read ------
| +-----------------+ | kernel
+--->| shared memory |---+
| (vzalloced now) |
+-----------------+

Then I encountered the performance regression caused by lock contention
in find_vmap_area() when multiple threads transfer data through multiple
SMC connections on machines with many CPUs[3].

According to perf, the performance bottleneck is caused by the global
vmap_area_lock contention[4]:

- writer:

smc_tx_sendmsg
-> memcpy_from_msg
-> copy_from_iter
-> check_copy_size
-> check_object_size
-> if (CONFIG_HARDENED_USERCOPY is set) check_heap_object
-> if(vm) find_vmap_area
-> try to hold vmap_area_lock lock
- reader:

smc_rx_recvmsg
-> memcpy_to_msg
-> copy_to_iter
-> check_copy_size
-> check_object_size
-> if (CONFIG_HARDENED_USERCOPY is set) check_heap_object
-> if(vm) find_vmap_area
-> try to hold vmap_area_lock lock

Fortunately, thank you for this patch set, the global vmap_area_lock was
removed and per node lock vn->busy.lock is introduced. it is really helpful:

In 48 CPUs qemu environment, the Requests/s increased by 5 times:
- nginx
- wrk -c 1000 -t 96 -d 30 http://127.0.0.1:80

vzalloced shmem vzalloced shmem(with this patch set)
Requests/sec 113536.56 583729.93


But it also has some overhead, compared to using kzalloced shared memory
or unsetting CONFIG_HARDENED_USERCOPY, which won't involve finding vmap area:

kzalloced shmem vzalloced shmem(unset CONFIG_HARDENED_USERCOPY)
Requests/sec 831950.39 805164.78


So, as a newbie in Linux-mm, I would like to ask for some suggestions:

Is it possible to further eliminate the overhead caused by lock contention
in find_vmap_area() in this scenario (maybe this is asking too much), or the
only way out is not setting CONFIG_HARDENED_USERCOPY or not using vzalloced
buffer in the situation where cocurrent kernel-userspace-copy happens?

Any feedback will be appreciated. Thanks again for your time.


[1] Shared Memory Communications (SMC) enables two SMC capable peers to
communicate by using memory buffers that each peer allocates for the
partner's use. It improves throughput, lowers latency and cost, and
maintains existing functions. See details in https://www.ibm.com/support/pages/node/7009315

[2] https://lore.kernel.org/netdev/[email protected]/

[3] issues: https://lore.kernel.org/all/[email protected]/
analysis: https://lore.kernel.org/all/[email protected]/

[4] Some flamegraphs are attached,
- SMC using vzalloced buffer: vzalloc_t96.svg
- SMC using vzalloced buffer and with this patchset: vzalloc_t96_improve.svg
- SMC using vzalloced buffer and unset CONFIG_HARDENED_USERCOPY: vzalloc_t96_nocheck.svg
- SMC using kzalloced buffer: kzalloc_t96.svg


Best regards,
Wen Gu


Attachments:
vzalloc_t96.svg (197.44 kB)
vzalloc_t96_improve.svg (269.85 kB)
vzalloc_t96_nocheck.svg (277.13 kB)
kzalloc_t96.svg (289.89 kB)
Download all attachments

2024-01-05 10:50:22

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: Re: [PATCH v3 04/11] mm: vmalloc: Remove global vmap_area_root rb-tree

Hello, Wen Gu.

>
> Hi Uladzislau Rezki,
>
> I really like your work, it is great and helpful!
>
> Currently, I am working on using shared memory communication (SMC [1])
> to transparently accelerate TCP communication between two peers within
> the same OS instance[2].
>
> In this scenario, a vzalloced kernel buffer acts as a shared memory and
> will be simultaneous read or written by two SMC sockets, thus forming an
> SMC connection.
>
>
> socket1 socket2
> | ^
> | | userspace
> ---- write -------------------- read ------
> | +-----------------+ | kernel
> +--->| shared memory |---+
> | (vzalloced now) |
> +-----------------+
>
> Then I encountered the performance regression caused by lock contention
> in find_vmap_area() when multiple threads transfer data through multiple
> SMC connections on machines with many CPUs[3].
>
> According to perf, the performance bottleneck is caused by the global
> vmap_area_lock contention[4]:
>
> - writer:
>
> smc_tx_sendmsg
> -> memcpy_from_msg
> -> copy_from_iter
> -> check_copy_size
> -> check_object_size
> -> if (CONFIG_HARDENED_USERCOPY is set) check_heap_object
> -> if(vm) find_vmap_area
> -> try to hold vmap_area_lock lock
> - reader:
>
> smc_rx_recvmsg
> -> memcpy_to_msg
> -> copy_to_iter
> -> check_copy_size
> -> check_object_size
> -> if (CONFIG_HARDENED_USERCOPY is set) check_heap_object
> -> if(vm) find_vmap_area
> -> try to hold vmap_area_lock lock
>
> Fortunately, thank you for this patch set, the global vmap_area_lock was
> removed and per node lock vn->busy.lock is introduced. it is really helpful:
>
> In 48 CPUs qemu environment, the Requests/s increased by 5 times:
> - nginx
> - wrk -c 1000 -t 96 -d 30 http://127.0.0.1:80
>
> vzalloced shmem vzalloced shmem(with this patch set)
> Requests/sec 113536.56 583729.93
>
>
Thank you for the confirmation that your workload is improved. The "nginx"
is 5 times better!

> But it also has some overhead, compared to using kzalloced shared memory
> or unsetting CONFIG_HARDENED_USERCOPY, which won't involve finding vmap area:
>
> kzalloced shmem vzalloced shmem(unset CONFIG_HARDENED_USERCOPY)
> Requests/sec 831950.39 805164.78
>
>
The CONFIG_HARDENED_USERCOPY prevents coping "wrong" memory regions. That is
why if it is a vmalloced memory it wants to make sure it is really true,
if not user-copy is aborted.

So there is an extra work that involves finding a VA associated with an address.

> So, as a newbie in Linux-mm, I would like to ask for some suggestions:
>
> Is it possible to further eliminate the overhead caused by lock contention
> in find_vmap_area() in this scenario (maybe this is asking too much), or the
> only way out is not setting CONFIG_HARDENED_USERCOPY or not using vzalloced
> buffer in the situation where cocurrent kernel-userspace-copy happens?
>
Could you please try below patch, if it improves this series further?
Just in case:

<snip>
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index e30dabf68263..40acf53cadfb 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -772,7 +772,7 @@ static DEFINE_PER_CPU(struct vmap_area *, ne_fit_preload_node);
struct rb_list {
struct rb_root root;
struct list_head head;
- spinlock_t lock;
+ rwlock_t lock;
};

struct vmap_pool {
@@ -947,19 +947,19 @@ find_vmap_area_exceed_addr_lock(unsigned long addr, struct vmap_area **va)
for (i = 0; i < nr_vmap_nodes; i++) {
vn = &vmap_nodes[i];

- spin_lock(&vn->busy.lock);
+ read_lock(&vn->busy.lock);
va_lowest = __find_vmap_area_exceed_addr(addr, &vn->busy.root);
if (va_lowest) {
if (!va_node || va_lowest->va_start < (*va)->va_start) {
if (va_node)
- spin_unlock(&va_node->busy.lock);
+ read_unlock(&va_node->busy.lock);

*va = va_lowest;
va_node = vn;
continue;
}
}
- spin_unlock(&vn->busy.lock);
+ read_unlock(&vn->busy.lock);
}

return va_node;
@@ -1695,9 +1695,9 @@ static void free_vmap_area(struct vmap_area *va)
/*
* Remove from the busy tree/list.
*/
- spin_lock(&vn->busy.lock);
+ write_lock(&vn->busy.lock);
unlink_va(va, &vn->busy.root);
- spin_unlock(&vn->busy.lock);
+ write_unlock(&vn->busy.lock);

/*
* Insert/Merge it back to the free tree/list.
@@ -1901,9 +1901,9 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,

vn = addr_to_node(va->va_start);

- spin_lock(&vn->busy.lock);
+ write_lock(&vn->busy.lock);
insert_vmap_area(va, &vn->busy.root, &vn->busy.head);
- spin_unlock(&vn->busy.lock);
+ write_unlock(&vn->busy.lock);

BUG_ON(!IS_ALIGNED(va->va_start, align));
BUG_ON(va->va_start < vstart);
@@ -2123,10 +2123,10 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end,
if (RB_EMPTY_ROOT(&vn->lazy.root))
continue;

- spin_lock(&vn->lazy.lock);
+ write_lock(&vn->lazy.lock);
WRITE_ONCE(vn->lazy.root.rb_node, NULL);
list_replace_init(&vn->lazy.head, &vn->purge_list);
- spin_unlock(&vn->lazy.lock);
+ write_unlock(&vn->lazy.lock);

start = min(start, list_first_entry(&vn->purge_list,
struct vmap_area, list)->va_start);
@@ -2223,9 +2223,9 @@ static void free_vmap_area_noflush(struct vmap_area *va)
vn = is_vn_id_valid(vn_id) ?
id_to_node(vn_id):addr_to_node(va->va_start);

- spin_lock(&vn->lazy.lock);
+ write_lock(&vn->lazy.lock);
insert_vmap_area(va, &vn->lazy.root, &vn->lazy.head);
- spin_unlock(&vn->lazy.lock);
+ write_unlock(&vn->lazy.lock);

trace_free_vmap_area_noflush(va_start, nr_lazy, nr_lazy_max);

@@ -2272,9 +2272,9 @@ struct vmap_area *find_vmap_area(unsigned long addr)
do {
vn = &vmap_nodes[i];

- spin_lock(&vn->busy.lock);
+ read_lock(&vn->busy.lock);
va = __find_vmap_area(addr, &vn->busy.root);
- spin_unlock(&vn->busy.lock);
+ read_unlock(&vn->busy.lock);

if (va)
return va;
@@ -2293,11 +2293,11 @@ static struct vmap_area *find_unlink_vmap_area(unsigned long addr)
do {
vn = &vmap_nodes[i];

- spin_lock(&vn->busy.lock);
+ write_lock(&vn->busy.lock);
va = __find_vmap_area(addr, &vn->busy.root);
if (va)
unlink_va(va, &vn->busy.root);
- spin_unlock(&vn->busy.lock);
+ write_unlock(&vn->busy.lock);

if (va)
return va;
@@ -2514,9 +2514,9 @@ static void free_vmap_block(struct vmap_block *vb)
BUG_ON(tmp != vb);

vn = addr_to_node(vb->va->va_start);
- spin_lock(&vn->busy.lock);
+ write_lock(&vn->busy.lock);
unlink_va(vb->va, &vn->busy.root);
- spin_unlock(&vn->busy.lock);
+ write_unlock(&vn->busy.lock);

free_vmap_area_noflush(vb->va);
kfree_rcu(vb, rcu_head);
@@ -2942,9 +2942,9 @@ static void setup_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va,
{
struct vmap_node *vn = addr_to_node(va->va_start);

- spin_lock(&vn->busy.lock);
+ read_lock(&vn->busy.lock);
setup_vmalloc_vm_locked(vm, va, flags, caller);
- spin_unlock(&vn->busy.lock);
+ read_unlock(&vn->busy.lock);
}

static void clear_vm_uninitialized_flag(struct vm_struct *vm)
@@ -4214,19 +4214,19 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)

next_va:
next = va->va_end;
- spin_unlock(&vn->busy.lock);
+ read_unlock(&vn->busy.lock);
} while ((vn = find_vmap_area_exceed_addr_lock(next, &va)));

finished_zero:
if (vn)
- spin_unlock(&vn->busy.lock);
+ read_unlock(&vn->busy.lock);

/* zero-fill memory holes */
return count - remains + zero_iter(iter, remains);
finished:
/* Nothing remains, or We couldn't copy/zero everything. */
if (vn)
- spin_unlock(&vn->busy.lock);
+ read_unlock(&vn->busy.lock);

return count - remains;
}
@@ -4563,11 +4563,11 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
for (area = 0; area < nr_vms; area++) {
struct vmap_node *vn = addr_to_node(vas[area]->va_start);

- spin_lock(&vn->busy.lock);
+ write_lock(&vn->busy.lock);
insert_vmap_area(vas[area], &vn->busy.root, &vn->busy.head);
setup_vmalloc_vm_locked(vms[area], vas[area], VM_ALLOC,
pcpu_get_vm_areas);
- spin_unlock(&vn->busy.lock);
+ write_unlock(&vn->busy.lock);
}

/*
@@ -4687,7 +4687,7 @@ bool vmalloc_dump_obj(void *object)

vn = addr_to_node((unsigned long)objp);

- if (spin_trylock(&vn->busy.lock)) {
+ if (read_trylock(&vn->busy.lock)) {
va = __find_vmap_area(addr, &vn->busy.root);

if (va && va->vm) {
@@ -4697,7 +4697,7 @@ bool vmalloc_dump_obj(void *object)
success = true;
}

- spin_unlock(&vn->busy.lock);
+ read_unlock(&vn->busy.lock);
}

if (success)
@@ -4742,13 +4742,13 @@ static void show_purge_info(struct seq_file *m)
for (i = 0; i < nr_vmap_nodes; i++) {
vn = &vmap_nodes[i];

- spin_lock(&vn->lazy.lock);
+ read_lock(&vn->lazy.lock);
list_for_each_entry(va, &vn->lazy.head, list) {
seq_printf(m, "0x%pK-0x%pK %7ld unpurged vm_area\n",
(void *)va->va_start, (void *)va->va_end,
va->va_end - va->va_start);
}
- spin_unlock(&vn->lazy.lock);
+ read_unlock(&vn->lazy.lock);
}
}

@@ -4762,7 +4762,7 @@ static int vmalloc_info_show(struct seq_file *m, void *p)
for (i = 0; i < nr_vmap_nodes; i++) {
vn = &vmap_nodes[i];

- spin_lock(&vn->busy.lock);
+ read_lock(&vn->busy.lock);
list_for_each_entry(va, &vn->busy.head, list) {
if (!va->vm) {
if (va->flags & VMAP_RAM)
@@ -4808,7 +4808,7 @@ static int vmalloc_info_show(struct seq_file *m, void *p)
show_numa_info(m, v);
seq_putc(m, '\n');
}
- spin_unlock(&vn->busy.lock);
+ read_unlock(&vn->busy.lock);
}

/*
@@ -4902,11 +4902,11 @@ static void vmap_init_nodes(void)
vn = &vmap_nodes[n];
vn->busy.root = RB_ROOT;
INIT_LIST_HEAD(&vn->busy.head);
- spin_lock_init(&vn->busy.lock);
+ rwlock_init(&vn->busy.lock);

vn->lazy.root = RB_ROOT;
INIT_LIST_HEAD(&vn->lazy.head);
- spin_lock_init(&vn->lazy.lock);
+ rwlock_init(&vn->lazy.lock);

for (i = 0; i < MAX_VA_SIZE_PAGES; i++) {
INIT_LIST_HEAD(&vn->pool[i].head);
<snip>

Thank you!

--
Uladzislau Rezki

2024-01-06 09:18:13

by Wen Gu

[permalink] [raw]
Subject: Re: [PATCH v3 04/11] mm: vmalloc: Remove global vmap_area_root rb-tree


On 2024/1/5 18:50, Uladzislau Rezki wrote:

> Hello, Wen Gu.
>
>>
>> Hi Uladzislau Rezki,
>>

<...>

>> Fortunately, thank you for this patch set, the global vmap_area_lock was
>> removed and per node lock vn->busy.lock is introduced. it is really helpful:
>>
>> In 48 CPUs qemu environment, the Requests/s increased by 5 times:
>> - nginx
>> - wrk -c 1000 -t 96 -d 30 http://127.0.0.1:80
>>
>> vzalloced shmem vzalloced shmem(with this patch set)
>> Requests/sec 113536.56 583729.93
>>
>>
> Thank you for the confirmation that your workload is improved. The "nginx"
> is 5 times better!
>

Yes, thank you very much for the improvement!

>> But it also has some overhead, compared to using kzalloced shared memory
>> or unsetting CONFIG_HARDENED_USERCOPY, which won't involve finding vmap area:
>>
>> kzalloced shmem vzalloced shmem(unset CONFIG_HARDENED_USERCOPY)
>> Requests/sec 831950.39 805164.78
>>
>>
> The CONFIG_HARDENED_USERCOPY prevents coping "wrong" memory regions. That is
> why if it is a vmalloced memory it wants to make sure it is really true,
> if not user-copy is aborted.
>
> So there is an extra work that involves finding a VA associated with an address.
>

Yes, and lock contention in finding VA is likely to be a performance bottleneck,
which is mitigated a lot by your work.

>> So, as a newbie in Linux-mm, I would like to ask for some suggestions:
>>
>> Is it possible to further eliminate the overhead caused by lock contention
>> in find_vmap_area() in this scenario (maybe this is asking too much), or the
>> only way out is not setting CONFIG_HARDENED_USERCOPY or not using vzalloced
>> buffer in the situation where cocurrent kernel-userspace-copy happens?
>>
> Could you please try below patch, if it improves this series further?
> Just in case:
>

Thank you! I tried the patch, and it seems that the wait for rwlock_t
also exists, as much as using spinlock_t. (The flamegraph is attached.
Not sure why the read_lock waits so long, given that there is no frequent
write_lock competition)

vzalloced shmem(spinlock_t) vzalloced shmem(rwlock_t)
Requests/sec 583729.93 460007.44

So I guess the overhead in finding vmap area is inevitable here and the
original spin_lock is fine in this series.

Thanks again for your help!

Best regards,
Wen Gu

> <snip>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index e30dabf68263..40acf53cadfb 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -772,7 +772,7 @@ static DEFINE_PER_CPU(struct vmap_area *, ne_fit_preload_node);
> struct rb_list {
> struct rb_root root;
> struct list_head head;
> - spinlock_t lock;
> + rwlock_t lock;
> };
>
> struct vmap_pool {
> @@ -947,19 +947,19 @@ find_vmap_area_exceed_addr_lock(unsigned long addr, struct vmap_area **va)
> for (i = 0; i < nr_vmap_nodes; i++) {
> vn = &vmap_nodes[i];
>
> - spin_lock(&vn->busy.lock);
> + read_lock(&vn->busy.lock);
> va_lowest = __find_vmap_area_exceed_addr(addr, &vn->busy.root);
> if (va_lowest) {
> if (!va_node || va_lowest->va_start < (*va)->va_start) {
> if (va_node)
> - spin_unlock(&va_node->busy.lock);
> + read_unlock(&va_node->busy.lock);
>
> *va = va_lowest;
> va_node = vn;
> continue;
> }
> }
> - spin_unlock(&vn->busy.lock);
> + read_unlock(&vn->busy.lock);
> }
>
> return va_node;
> @@ -1695,9 +1695,9 @@ static void free_vmap_area(struct vmap_area *va)
> /*
> * Remove from the busy tree/list.
> */
> - spin_lock(&vn->busy.lock);
> + write_lock(&vn->busy.lock);
> unlink_va(va, &vn->busy.root);
> - spin_unlock(&vn->busy.lock);
> + write_unlock(&vn->busy.lock);
>
> /*
> * Insert/Merge it back to the free tree/list.
> @@ -1901,9 +1901,9 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
>
> vn = addr_to_node(va->va_start);
>
> - spin_lock(&vn->busy.lock);
> + write_lock(&vn->busy.lock);
> insert_vmap_area(va, &vn->busy.root, &vn->busy.head);
> - spin_unlock(&vn->busy.lock);
> + write_unlock(&vn->busy.lock);
>
> BUG_ON(!IS_ALIGNED(va->va_start, align));
> BUG_ON(va->va_start < vstart);
> @@ -2123,10 +2123,10 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end,
> if (RB_EMPTY_ROOT(&vn->lazy.root))
> continue;
>
> - spin_lock(&vn->lazy.lock);
> + write_lock(&vn->lazy.lock);
> WRITE_ONCE(vn->lazy.root.rb_node, NULL);
> list_replace_init(&vn->lazy.head, &vn->purge_list);
> - spin_unlock(&vn->lazy.lock);
> + write_unlock(&vn->lazy.lock);
>
> start = min(start, list_first_entry(&vn->purge_list,
> struct vmap_area, list)->va_start);
> @@ -2223,9 +2223,9 @@ static void free_vmap_area_noflush(struct vmap_area *va)
> vn = is_vn_id_valid(vn_id) ?
> id_to_node(vn_id):addr_to_node(va->va_start);
>
> - spin_lock(&vn->lazy.lock);
> + write_lock(&vn->lazy.lock);
> insert_vmap_area(va, &vn->lazy.root, &vn->lazy.head);
> - spin_unlock(&vn->lazy.lock);
> + write_unlock(&vn->lazy.lock);
>
> trace_free_vmap_area_noflush(va_start, nr_lazy, nr_lazy_max);
>
> @@ -2272,9 +2272,9 @@ struct vmap_area *find_vmap_area(unsigned long addr)
> do {
> vn = &vmap_nodes[i];
>
> - spin_lock(&vn->busy.lock);
> + read_lock(&vn->busy.lock);
> va = __find_vmap_area(addr, &vn->busy.root);
> - spin_unlock(&vn->busy.lock);
> + read_unlock(&vn->busy.lock);
>
> if (va)
> return va;
> @@ -2293,11 +2293,11 @@ static struct vmap_area *find_unlink_vmap_area(unsigned long addr)
> do {
> vn = &vmap_nodes[i];
>
> - spin_lock(&vn->busy.lock);
> + write_lock(&vn->busy.lock);
> va = __find_vmap_area(addr, &vn->busy.root);
> if (va)
> unlink_va(va, &vn->busy.root);
> - spin_unlock(&vn->busy.lock);
> + write_unlock(&vn->busy.lock);
>
> if (va)
> return va;
> @@ -2514,9 +2514,9 @@ static void free_vmap_block(struct vmap_block *vb)
> BUG_ON(tmp != vb);
>
> vn = addr_to_node(vb->va->va_start);
> - spin_lock(&vn->busy.lock);
> + write_lock(&vn->busy.lock);
> unlink_va(vb->va, &vn->busy.root);
> - spin_unlock(&vn->busy.lock);
> + write_unlock(&vn->busy.lock);
>
> free_vmap_area_noflush(vb->va);
> kfree_rcu(vb, rcu_head);
> @@ -2942,9 +2942,9 @@ static void setup_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va,
> {
> struct vmap_node *vn = addr_to_node(va->va_start);
>
> - spin_lock(&vn->busy.lock);
> + read_lock(&vn->busy.lock);
> setup_vmalloc_vm_locked(vm, va, flags, caller);
> - spin_unlock(&vn->busy.lock);
> + read_unlock(&vn->busy.lock);
> }
>
> static void clear_vm_uninitialized_flag(struct vm_struct *vm)
> @@ -4214,19 +4214,19 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
>
> next_va:
> next = va->va_end;
> - spin_unlock(&vn->busy.lock);
> + read_unlock(&vn->busy.lock);
> } while ((vn = find_vmap_area_exceed_addr_lock(next, &va)));
>
> finished_zero:
> if (vn)
> - spin_unlock(&vn->busy.lock);
> + read_unlock(&vn->busy.lock);
>
> /* zero-fill memory holes */
> return count - remains + zero_iter(iter, remains);
> finished:
> /* Nothing remains, or We couldn't copy/zero everything. */
> if (vn)
> - spin_unlock(&vn->busy.lock);
> + read_unlock(&vn->busy.lock);
>
> return count - remains;
> }
> @@ -4563,11 +4563,11 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
> for (area = 0; area < nr_vms; area++) {
> struct vmap_node *vn = addr_to_node(vas[area]->va_start);
>
> - spin_lock(&vn->busy.lock);
> + write_lock(&vn->busy.lock);
> insert_vmap_area(vas[area], &vn->busy.root, &vn->busy.head);
> setup_vmalloc_vm_locked(vms[area], vas[area], VM_ALLOC,
> pcpu_get_vm_areas);
> - spin_unlock(&vn->busy.lock);
> + write_unlock(&vn->busy.lock);
> }
>
> /*
> @@ -4687,7 +4687,7 @@ bool vmalloc_dump_obj(void *object)
>
> vn = addr_to_node((unsigned long)objp);
>
> - if (spin_trylock(&vn->busy.lock)) {
> + if (read_trylock(&vn->busy.lock)) {
> va = __find_vmap_area(addr, &vn->busy.root);
>
> if (va && va->vm) {
> @@ -4697,7 +4697,7 @@ bool vmalloc_dump_obj(void *object)
> success = true;
> }
>
> - spin_unlock(&vn->busy.lock);
> + read_unlock(&vn->busy.lock);
> }
>
> if (success)
> @@ -4742,13 +4742,13 @@ static void show_purge_info(struct seq_file *m)
> for (i = 0; i < nr_vmap_nodes; i++) {
> vn = &vmap_nodes[i];
>
> - spin_lock(&vn->lazy.lock);
> + read_lock(&vn->lazy.lock);
> list_for_each_entry(va, &vn->lazy.head, list) {
> seq_printf(m, "0x%pK-0x%pK %7ld unpurged vm_area\n",
> (void *)va->va_start, (void *)va->va_end,
> va->va_end - va->va_start);
> }
> - spin_unlock(&vn->lazy.lock);
> + read_unlock(&vn->lazy.lock);
> }
> }
>
> @@ -4762,7 +4762,7 @@ static int vmalloc_info_show(struct seq_file *m, void *p)
> for (i = 0; i < nr_vmap_nodes; i++) {
> vn = &vmap_nodes[i];
>
> - spin_lock(&vn->busy.lock);
> + read_lock(&vn->busy.lock);
> list_for_each_entry(va, &vn->busy.head, list) {
> if (!va->vm) {
> if (va->flags & VMAP_RAM)
> @@ -4808,7 +4808,7 @@ static int vmalloc_info_show(struct seq_file *m, void *p)
> show_numa_info(m, v);
> seq_putc(m, '\n');
> }
> - spin_unlock(&vn->busy.lock);
> + read_unlock(&vn->busy.lock);
> }
>
> /*
> @@ -4902,11 +4902,11 @@ static void vmap_init_nodes(void)
> vn = &vmap_nodes[n];
> vn->busy.root = RB_ROOT;
> INIT_LIST_HEAD(&vn->busy.head);
> - spin_lock_init(&vn->busy.lock);
> + rwlock_init(&vn->busy.lock);
>
> vn->lazy.root = RB_ROOT;
> INIT_LIST_HEAD(&vn->lazy.head);
> - spin_lock_init(&vn->lazy.lock);
> + rwlock_init(&vn->lazy.lock);
>
> for (i = 0; i < MAX_VA_SIZE_PAGES; i++) {
> INIT_LIST_HEAD(&vn->pool[i].head);
> <snip>
>
> Thank you!
>
> --
> Uladzislau Rezki


Attachments:
vzalloc_t96_improve_rwlock.svg (246.88 kB)

2024-01-06 16:36:43

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: Re: [PATCH v3 04/11] mm: vmalloc: Remove global vmap_area_root rb-tree

>
> On 2024/1/5 18:50, Uladzislau Rezki wrote:
>
> > Hello, Wen Gu.
> >
> > >
> > > Hi Uladzislau Rezki,
> > >
>
> <...>
>
> > > Fortunately, thank you for this patch set, the global vmap_area_lock was
> > > removed and per node lock vn->busy.lock is introduced. it is really helpful:
> > >
> > > In 48 CPUs qemu environment, the Requests/s increased by 5 times:
> > > - nginx
> > > - wrk -c 1000 -t 96 -d 30 http://127.0.0.1:80
> > >
> > > vzalloced shmem vzalloced shmem(with this patch set)
> > > Requests/sec 113536.56 583729.93
> > >
> > >
> > Thank you for the confirmation that your workload is improved. The "nginx"
> > is 5 times better!
> >
>
> Yes, thank you very much for the improvement!
>
> > > But it also has some overhead, compared to using kzalloced shared memory
> > > or unsetting CONFIG_HARDENED_USERCOPY, which won't involve finding vmap area:
> > >
> > > kzalloced shmem vzalloced shmem(unset CONFIG_HARDENED_USERCOPY)
> > > Requests/sec 831950.39 805164.78
> > >
> > >
> > The CONFIG_HARDENED_USERCOPY prevents coping "wrong" memory regions. That is
> > why if it is a vmalloced memory it wants to make sure it is really true,
> > if not user-copy is aborted.
> >
> > So there is an extra work that involves finding a VA associated with an address.
> >
>
> Yes, and lock contention in finding VA is likely to be a performance bottleneck,
> which is mitigated a lot by your work.
>
> > > So, as a newbie in Linux-mm, I would like to ask for some suggestions:
> > >
> > > Is it possible to further eliminate the overhead caused by lock contention
> > > in find_vmap_area() in this scenario (maybe this is asking too much), or the
> > > only way out is not setting CONFIG_HARDENED_USERCOPY or not using vzalloced
> > > buffer in the situation where cocurrent kernel-userspace-copy happens?
> > >
> > Could you please try below patch, if it improves this series further?
> > Just in case:
> >
>
> Thank you! I tried the patch, and it seems that the wait for rwlock_t
> also exists, as much as using spinlock_t. (The flamegraph is attached.
> Not sure why the read_lock waits so long, given that there is no frequent
> write_lock competition)
>
> vzalloced shmem(spinlock_t) vzalloced shmem(rwlock_t)
> Requests/sec 583729.93 460007.44
>
> So I guess the overhead in finding vmap area is inevitable here and the
> original spin_lock is fine in this series.
>
I have also noticed a erformance difference between rwlock and spinlock.
So, yes. This is what we need to do extra if CONFIG_HARDENED_USERCOPY is
set, i.e. find a VA.

--
Uladzislau Rezki

2024-01-07 07:00:17

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCH v3 04/11] mm: vmalloc: Remove global vmap_area_root rb-tree

On Sat, 6 Jan 2024 17:36:23 +0100 Uladzislau Rezki <[email protected]>
> >
> > Thank you! I tried the patch, and it seems that the wait for rwlock_t
> > also exists, as much as using spinlock_t. (The flamegraph is attached.
> > Not sure why the read_lock waits so long, given that there is no frequent
> > write_lock competition)
> >
> > vzalloced shmem(spinlock_t) vzalloced shmem(rwlock_t)
> > Requests/sec 583729.93 460007.44
> >
> > So I guess the overhead in finding vmap area is inevitable here and the
> > original spin_lock is fine in this series.
> >
> I have also noticed a erformance difference between rwlock and spinlock.
> So, yes. This is what we need to do extra if CONFIG_HARDENED_USERCOPY is
> set, i.e. find a VA.

See if read bias helps to understand the gap between spinlock and rwlock.

--- x/kernel/locking/qrwlock.c
+++ y/kernel/locking/qrwlock.c
@@ -23,7 +23,7 @@ void __lockfunc queued_read_lock_slowpat
/*
* Readers come here when they cannot get the lock without waiting
*/
- if (unlikely(in_interrupt())) {
+ if (1) {
/*
* Readers in interrupt context will get the lock immediately
* if the writer is just waiting (not holding the lock yet),

2024-01-08 07:45:45

by Wen Gu

[permalink] [raw]
Subject: Re: [PATCH v3 04/11] mm: vmalloc: Remove global vmap_area_root rb-tree



On 2024/1/7 14:59, Hillf Danton wrote:
> On Sat, 6 Jan 2024 17:36:23 +0100 Uladzislau Rezki <[email protected]>
>>>
>>> Thank you! I tried the patch, and it seems that the wait for rwlock_t
>>> also exists, as much as using spinlock_t. (The flamegraph is attached.
>>> Not sure why the read_lock waits so long, given that there is no frequent
>>> write_lock competition)
>>>
>>> vzalloced shmem(spinlock_t) vzalloced shmem(rwlock_t)
>>> Requests/sec 583729.93 460007.44
>>>
>>> So I guess the overhead in finding vmap area is inevitable here and the
>>> original spin_lock is fine in this series.
>>>
>> I have also noticed a erformance difference between rwlock and spinlock.
>> So, yes. This is what we need to do extra if CONFIG_HARDENED_USERCOPY is
>> set, i.e. find a VA.
>
> See if read bias helps to understand the gap between spinlock and rwlock.
>
> --- x/kernel/locking/qrwlock.c
> +++ y/kernel/locking/qrwlock.c
> @@ -23,7 +23,7 @@ void __lockfunc queued_read_lock_slowpat
> /*
> * Readers come here when they cannot get the lock without waiting
> */
> - if (unlikely(in_interrupt())) {
> + if (1) {
> /*
> * Readers in interrupt context will get the lock immediately
> * if the writer is just waiting (not holding the lock yet),

Thank you for the idea! Hillf.

IIUC, the change makes read operations more likely to acquire lock and
modified fairness to favor reading.

The test in my scenario shows:

vzalloced shmem with spinlock_t rwlock_t rwlock_t(with above change)
Requests/sec 564961.29 442534.33 439733.31

In addition to read bias, there seems to be other factors that cause the
gap, but I haven't figured it out yet..

Thanks,
Wen Gu

2024-01-08 18:38:53

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: Re: [PATCH v3 04/11] mm: vmalloc: Remove global vmap_area_root rb-tree

On Mon, Jan 08, 2024 at 03:45:18PM +0800, Wen Gu wrote:
>
>
> On 2024/1/7 14:59, Hillf Danton wrote:
> > On Sat, 6 Jan 2024 17:36:23 +0100 Uladzislau Rezki <[email protected]>
> > > >
> > > > Thank you! I tried the patch, and it seems that the wait for rwlock_t
> > > > also exists, as much as using spinlock_t. (The flamegraph is attached.
> > > > Not sure why the read_lock waits so long, given that there is no frequent
> > > > write_lock competition)
> > > >
> > > > vzalloced shmem(spinlock_t) vzalloced shmem(rwlock_t)
> > > > Requests/sec 583729.93 460007.44
> > > >
> > > > So I guess the overhead in finding vmap area is inevitable here and the
> > > > original spin_lock is fine in this series.
> > > >
> > > I have also noticed a erformance difference between rwlock and spinlock.
> > > So, yes. This is what we need to do extra if CONFIG_HARDENED_USERCOPY is
> > > set, i.e. find a VA.
> >
> > See if read bias helps to understand the gap between spinlock and rwlock.
> >
> > --- x/kernel/locking/qrwlock.c
> > +++ y/kernel/locking/qrwlock.c
> > @@ -23,7 +23,7 @@ void __lockfunc queued_read_lock_slowpat
> > /*
> > * Readers come here when they cannot get the lock without waiting
> > */
> > - if (unlikely(in_interrupt())) {
> > + if (1) {
> > /*
> > * Readers in interrupt context will get the lock immediately
> > * if the writer is just waiting (not holding the lock yet),
>
> Thank you for the idea! Hillf.
>
> IIUC, the change makes read operations more likely to acquire lock and
> modified fairness to favor reading.
>
> The test in my scenario shows:
>
> vzalloced shmem with spinlock_t rwlock_t rwlock_t(with above change)
> Requests/sec 564961.29 442534.33 439733.31
>
> In addition to read bias, there seems to be other factors that cause the
> gap, but I haven't figured it out yet..
>
<snip>
urezki@pc638:~$ cat locktorture.sh
#!/bin/sh

# available lock types: spin_lock, rw_lock
torture_type=$1
test_time=$2

echo "Start..."

modprobe locktorture $torture_type nreaders_stress=0 > /dev/null 2>&1
sleep $test_time
rmmod locktorture > /dev/null 2>&1

echo "Done."
urezki@pc638:~$
<snip>

Out:

# sudo ./locktorture.sh rw_lock 30
[12107.327566] Writes: Total: 53304415 Max/Min: 1620715/3225 ??? Fail: 0
[12107.327898] spin_lock-torture: lock_torture_stats is stopping
[12107.328192] Writes: Total: 53304415 Max/Min: 1620715/3225 ??? Fail: 0
[12107.328368] spin_lock-torture:--- End of test: SUCCESS: acq_writer_lim=0 bind_readers=0-63 bind_writers=0-63 call_rcu_chains=0 long_hold=100 nested_locks=0 nreaders_stress=0 nwriters_stress=128 onoff_holdoff=0 onoff_interval=0 rt_boost=2 rt_boost_factor=50 shuffle_interval=3 shutdown_secs=0 stat_interval=60 stutter=5 verbose=1 writer_fifo=0

# sudo ./locktorture.sh spin_lock 30
[12051.968992] Writes: Total: 47843400 Max/Min: 1335320/5942 ??? Fail: 0
[12051.969236] spin_lock-torture: lock_torture_stats is stopping
[12051.969507] Writes: Total: 47843400 Max/Min: 1335320/5942 ??? Fail: 0
[12051.969813] spin_lock-torture:--- End of test: SUCCESS: acq_writer_lim=0 bind_readers=0-63 bind_writers=0-63 call_rcu_chains=0 long_hold=100 nested_locks=0 nreaders_stress=0 nwriters_stress=128 onoff_holdoff=0 onoff_interval=0 rt_boost=2 rt_boost_factor=50 shuffle_interval=3 shutdown_secs=0 stat_interval=60 stutter=5 verbose=1 writer_fifo=0

I do not see a big difference between spin_lock and rw_lock. In fact
the locktorture.ko test shows that a spin_lock is slightly worse but
it might be something that i could miss or is not accurate enough in
this test.

When it comes to vmap test-suite and the difference between rw_lock
and spin_lock. I have not spend much time figuring out the difference.
From the first glance it can be that a cache-miss rate is higher when
switch to rw_lock or rw-lock requires more atomics.

--
Uladzislau Rezki

2024-01-11 15:55:59

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: Re: [PATCH v3 07/11] mm: vmalloc: Offload free_vmap_area_lock lock

On Thu, Jan 11, 2024 at 08:02:16PM +1100, Dave Chinner wrote:
> On Tue, Jan 02, 2024 at 07:46:29PM +0100, Uladzislau Rezki (Sony) wrote:
> > Concurrent access to a global vmap space is a bottle-neck.
> > We can simulate a high contention by running a vmalloc test
> > suite.
> >
> > To address it, introduce an effective vmap node logic. Each
> > node behaves as independent entity. When a node is accessed
> > it serves a request directly(if possible) from its pool.
> >
> > This model has a size based pool for requests, i.e. pools are
> > serialized and populated based on object size and real demand.
> > A maximum object size that pool can handle is set to 256 pages.
> >
> > This technique reduces a pressure on the global vmap lock.
> >
> > Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
>
> Why not use a llist for this? That gets rid of the need for a
> new pool_lock altogether...
>
Initially i used the llist. I have changed it because i keep track
of objects per a pool to decay it later. I do not find these locks
as contented one therefore i did not think much.

Anyway, i will have a look at this to see if llist is easy to go with
or not. If so i will send out a separate patch.

Thanks!

--
Uladzislau Rezki

2024-01-12 12:18:42

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: Re: [PATCH v3 07/11] mm: vmalloc: Offload free_vmap_area_lock lock

On Fri, Jan 12, 2024 at 07:37:36AM +1100, Dave Chinner wrote:
> On Thu, Jan 11, 2024 at 04:54:48PM +0100, Uladzislau Rezki wrote:
> > On Thu, Jan 11, 2024 at 08:02:16PM +1100, Dave Chinner wrote:
> > > On Tue, Jan 02, 2024 at 07:46:29PM +0100, Uladzislau Rezki (Sony) wrote:
> > > > Concurrent access to a global vmap space is a bottle-neck.
> > > > We can simulate a high contention by running a vmalloc test
> > > > suite.
> > > >
> > > > To address it, introduce an effective vmap node logic. Each
> > > > node behaves as independent entity. When a node is accessed
> > > > it serves a request directly(if possible) from its pool.
> > > >
> > > > This model has a size based pool for requests, i.e. pools are
> > > > serialized and populated based on object size and real demand.
> > > > A maximum object size that pool can handle is set to 256 pages.
> > > >
> > > > This technique reduces a pressure on the global vmap lock.
> > > >
> > > > Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
> > >
> > > Why not use a llist for this? That gets rid of the need for a
> > > new pool_lock altogether...
> > >
> > Initially i used the llist. I have changed it because i keep track
> > of objects per a pool to decay it later. I do not find these locks
> > as contented one therefore i did not think much.
>
> Ok. I've used llist and an atomic counter to track the list length
> in the past.
>
> But is the list length even necessary? It seems to me that it is
> only used by the shrinker to determine how many objects are on the
> lists for scanning, and I'm not sure that's entirely necessary given
> the way the current global shrinker works (i.e. completely unfair to
> low numbered nodes due to scan loop start bias).
>
I use the length to decay pools by certain percentage, currently it is
25%, so i need to know number of objects. It is done in the purge path.
As for shrinker, once it hits us we drain pools entirely.

> > Anyway, i will have a look at this to see if llist is easy to go with
> > or not. If so i will send out a separate patch.
>
> Sounds good, it was just something that crossed my mind given the
> pattern of "producer adds single items, consumer detaches entire
> list, processes it and reattaches remainder" is a perfect match for
> the llist structure.
>
The llist_del_first() has to be serialized. For this purpose a per-cpu
pool would work or kind of "in_use" atomic that protects concurrent
removing.

If we detach entire llist, then we need to keep track of last node
to add it later as a "batch" to already existing/populated list.

Thanks four looking!

--
Uladzislau Rezki

2024-01-15 19:10:28

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: Re: [PATCH v3 10/11] mm: vmalloc: Set nr_nodes based on CPUs in a system

> On Tue, Jan 02, 2024 at 07:46:32PM +0100, Uladzislau Rezki (Sony) wrote:
> > A number of nodes which are used in the alloc/free paths is
> > set based on num_possible_cpus() in a system. Please note a
> > high limit threshold though is fixed and corresponds to 128
> > nodes.
>
> Large CPU count machines are NUMA machines. ALl of the allocation
> and reclaim is NUMA node based i.e. a pgdat per NUMA node.
>
> Shrinkers are also able to be run in a NUMA aware mode so that
> per-node structures can be reclaimed similar to how per-node LRU
> lists are scanned for reclaim.
>
> Hence I'm left to wonder if it would be better to have a vmalloc
> area per pgdat (or sub-node cluster) rather than just base the
> number on CPU count and then have an arbitrary maximum number when
> we get to 128 CPU cores. We can have 128 CPU cores in a
> single socket these days, so not being able to scale the vmalloc
> areas beyond a single socket seems like a bit of a limitation.
>
>
> Hence I'm left to wonder if it would be better to have a vmalloc
> area per pgdat (or sub-node cluster) rather than just base the
>
> Scaling out the vmalloc areas in a NUMA aware fashion allows the
> shrinker to be run in numa aware mode, which gets rid of the need
> for the global shrinker to loop over every single vmap area in every
> shrinker invocation. Only the vm areas on the node that has a memory
> shortage need to be scanned and reclaimed, it doesn't need reclaim
> everything globally when a single node runs out of memory.
>
> Yes, this may not give quite as good microbenchmark scalability
> results, but being able to locate each vm area in node local memory
> and have operation on them largely isolated to node-local tasks and
> vmalloc area reclaim will work much better on large multi-socket
> NUMA machines.
>
Currently i fix the max nodes number to 128. This is because i do not
have an access to such big NUMA systems whereas i do have an access to
around ~128 ones. That is why i have decided to stop on that number as
of now.

We can easily set nr_nodes to num_possible_cpus() and let it scale for
anyone. But before doing this, i would like to give it a try as a first
step because i have not tested it well on really big NUMA systems.

Thanks for you NUMA-aware input.

--
Uladzislau Rezki

2024-01-16 23:28:11

by Lorenzo Stoakes

[permalink] [raw]
Subject: Re: [PATCH v3 04/11] mm: vmalloc: Remove global vmap_area_root rb-tree

On Tue, Jan 02, 2024 at 07:46:26PM +0100, Uladzislau Rezki (Sony) wrote:
> Store allocated objects in a separate nodes. A va->va_start
> address is converted into a correct node where it should
> be placed and resided. An addr_to_node() function is used
> to do a proper address conversion to determine a node that
> contains a VA.
>
> Such approach balances VAs across nodes as a result an access
> becomes scalable. Number of nodes in a system depends on number
> of CPUs.
>
> Please note:
>
> 1. As of now allocated VAs are bound to a node-0. It means the
> patch does not give any difference comparing with a current
> behavior;
>
> 2. The global vmap_area_lock, vmap_area_root are removed as there
> is no need in it anymore. The vmap_area_list is still kept and
> is _empty_. It is exported for a kexec only;
>
> 3. The vmallocinfo and vread() have to be reworked to be able to
> handle multiple nodes.
>
> Reviewed-by: Baoquan He <[email protected]>
> Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
> ---
> mm/vmalloc.c | 240 +++++++++++++++++++++++++++++++++++++--------------
> 1 file changed, 173 insertions(+), 67 deletions(-)
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 06bd843d18ae..786ecb18ae22 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -728,11 +728,9 @@ EXPORT_SYMBOL(vmalloc_to_pfn);
> #define DEBUG_AUGMENT_LOWEST_MATCH_CHECK 0
>
>
> -static DEFINE_SPINLOCK(vmap_area_lock);
> static DEFINE_SPINLOCK(free_vmap_area_lock);
> /* Export for kexec only */
> LIST_HEAD(vmap_area_list);
> -static struct rb_root vmap_area_root = RB_ROOT;
> static bool vmap_initialized __read_mostly;
>
> static struct rb_root purge_vmap_area_root = RB_ROOT;
> @@ -772,6 +770,38 @@ static struct rb_root free_vmap_area_root = RB_ROOT;
> */
> static DEFINE_PER_CPU(struct vmap_area *, ne_fit_preload_node);
>
> +/*
> + * An effective vmap-node logic. Users make use of nodes instead
> + * of a global heap. It allows to balance an access and mitigate
> + * contention.
> + */
> +struct rb_list {

I'm not sure this name is very instructive - this contains a red/black tree
root node, a list head and a lock, but the meaning of it is that it stores
a red/black tree and a list of vmap_area objects and has a lock to protect
access.

A raw 'list_head' without a comment is always hard to parse, maybe some
comments/embed in vmap_node?

At the very least if you wanted to keep the name generic it should be
something like ordered_rb_tree or something like this.

> + struct rb_root root;
> + struct list_head head;
> + spinlock_t lock;
> +};
> +
> +static struct vmap_node {
> + /* Bookkeeping data of this node. */
> + struct rb_list busy;
> +} single;

This may be a thing about encapsulation/naming or similar, but I'm a little
confused as to why the rb_list type is maintained as a field rather than
its fields embedded?

> +
> +static struct vmap_node *vmap_nodes = &single;
> +static __read_mostly unsigned int nr_vmap_nodes = 1;
> +static __read_mostly unsigned int vmap_zone_size = 1;

It might be worth adding a comment here explaining that we're binding to a
single node for now to maintain existing behaviour (and a brief description
of what these values mean - for instance what unit vmap_zone_size is
expressed in?)

> +
> +static inline unsigned int
> +addr_to_node_id(unsigned long addr)
> +{
> + return (addr / vmap_zone_size) % nr_vmap_nodes;
> +}
> +
> +static inline struct vmap_node *
> +addr_to_node(unsigned long addr)
> +{
> + return &vmap_nodes[addr_to_node_id(addr)];
> +}
> +
> static __always_inline unsigned long
> va_size(struct vmap_area *va)
> {
> @@ -803,10 +833,11 @@ unsigned long vmalloc_nr_pages(void)
> }
>
> /* Look up the first VA which satisfies addr < va_end, NULL if none. */
> -static struct vmap_area *find_vmap_area_exceed_addr(unsigned long addr)
> +static struct vmap_area *
> +find_vmap_area_exceed_addr(unsigned long addr, struct rb_root *root)
> {
> struct vmap_area *va = NULL;
> - struct rb_node *n = vmap_area_root.rb_node;
> + struct rb_node *n = root->rb_node;
>
> addr = (unsigned long)kasan_reset_tag((void *)addr);
>
> @@ -1552,12 +1583,14 @@ __alloc_vmap_area(struct rb_root *root, struct list_head *head,
> */
> static void free_vmap_area(struct vmap_area *va)
> {
> + struct vmap_node *vn = addr_to_node(va->va_start);
> +

I'm being nitty here, and while I know it's a vmalloc convention to use
'va' and 'vm', perhaps we can break away from the super short variable name
convention and use 'vnode' or something for these values?

I feel people might get confused between 'vm' and 'vn' for instance.

> /*
> * Remove from the busy tree/list.
> */
> - spin_lock(&vmap_area_lock);
> - unlink_va(va, &vmap_area_root);
> - spin_unlock(&vmap_area_lock);
> + spin_lock(&vn->busy.lock);
> + unlink_va(va, &vn->busy.root);
> + spin_unlock(&vn->busy.lock);
>
> /*
> * Insert/Merge it back to the free tree/list.
> @@ -1600,6 +1633,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
> int node, gfp_t gfp_mask,
> unsigned long va_flags)
> {
> + struct vmap_node *vn;
> struct vmap_area *va;
> unsigned long freed;
> unsigned long addr;
> @@ -1645,9 +1679,11 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
> va->vm = NULL;
> va->flags = va_flags;
>
> - spin_lock(&vmap_area_lock);
> - insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
> - spin_unlock(&vmap_area_lock);
> + vn = addr_to_node(va->va_start);
> +
> + spin_lock(&vn->busy.lock);
> + insert_vmap_area(va, &vn->busy.root, &vn->busy.head);
> + spin_unlock(&vn->busy.lock);
>
> BUG_ON(!IS_ALIGNED(va->va_start, align));
> BUG_ON(va->va_start < vstart);
> @@ -1871,26 +1907,61 @@ static void free_unmap_vmap_area(struct vmap_area *va)
>
> struct vmap_area *find_vmap_area(unsigned long addr)
> {
> + struct vmap_node *vn;
> struct vmap_area *va;
> + int i, j;
>
> - spin_lock(&vmap_area_lock);
> - va = __find_vmap_area(addr, &vmap_area_root);
> - spin_unlock(&vmap_area_lock);
> + /*
> + * An addr_to_node_id(addr) converts an address to a node index
> + * where a VA is located. If VA spans several zones and passed
> + * addr is not the same as va->va_start, what is not common, we
> + * may need to scan an extra nodes. See an example:

For my understading when you say 'scan an extra nodes' do you mean scan
just 1 extra node, or multiple? If the former I'd replace this with 'may
need to scan an extra node' if the latter then 'may ened to scan extra
nodes'.

It's a nitty language thing, but also potentially changes the meaning of
this!

> + *
> + * <--va-->
> + * -|-----|-----|-----|-----|-
> + * 1 2 0 1
> + *
> + * VA resides in node 1 whereas it spans 1 and 2. If passed
> + * addr is within a second node we should do extra work. We
> + * should mention that it is rare and is a corner case from
> + * the other hand it has to be covered.

A very minor language style nit, but you've already said this is not
common, I don't think you need this 'We should mention...' bit. It's not a
big deal however!

> + */
> + i = j = addr_to_node_id(addr);
> + do {
> + vn = &vmap_nodes[i];
>
> - return va;
> + spin_lock(&vn->busy.lock);
> + va = __find_vmap_area(addr, &vn->busy.root);
> + spin_unlock(&vn->busy.lock);
> +
> + if (va)
> + return va;
> + } while ((i = (i + 1) % nr_vmap_nodes) != j);

If you comment above suggests that only 1 extra node might need to be
scanned, should we stop after one iteration?

> +
> + return NULL;
> }
>
> static struct vmap_area *find_unlink_vmap_area(unsigned long addr)
> {
> + struct vmap_node *vn;
> struct vmap_area *va;
> + int i, j;
>
> - spin_lock(&vmap_area_lock);
> - va = __find_vmap_area(addr, &vmap_area_root);
> - if (va)
> - unlink_va(va, &vmap_area_root);
> - spin_unlock(&vmap_area_lock);
> + i = j = addr_to_node_id(addr);
> + do {
> + vn = &vmap_nodes[i];
>
> - return va;
> + spin_lock(&vn->busy.lock);
> + va = __find_vmap_area(addr, &vn->busy.root);
> + if (va)
> + unlink_va(va, &vn->busy.root);
> + spin_unlock(&vn->busy.lock);
> +
> + if (va)
> + return va;
> + } while ((i = (i + 1) % nr_vmap_nodes) != j);

Maybe worth adding a comment saying to refer to the comment in
find_vmap_area() to see why this loop is necessary.

> +
> + return NULL;
> }
>
> /*** Per cpu kva allocator ***/
> @@ -2092,6 +2163,7 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
>
> static void free_vmap_block(struct vmap_block *vb)
> {
> + struct vmap_node *vn;
> struct vmap_block *tmp;
> struct xarray *xa;
>
> @@ -2099,9 +2171,10 @@ static void free_vmap_block(struct vmap_block *vb)
> tmp = xa_erase(xa, addr_to_vb_idx(vb->va->va_start));
> BUG_ON(tmp != vb);
>
> - spin_lock(&vmap_area_lock);
> - unlink_va(vb->va, &vmap_area_root);
> - spin_unlock(&vmap_area_lock);
> + vn = addr_to_node(vb->va->va_start);
> + spin_lock(&vn->busy.lock);
> + unlink_va(vb->va, &vn->busy.root);
> + spin_unlock(&vn->busy.lock);
>
> free_vmap_area_noflush(vb->va);
> kfree_rcu(vb, rcu_head);
> @@ -2525,9 +2598,11 @@ static inline void setup_vmalloc_vm_locked(struct vm_struct *vm,
> static void setup_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va,
> unsigned long flags, const void *caller)
> {
> - spin_lock(&vmap_area_lock);
> + struct vmap_node *vn = addr_to_node(va->va_start);
> +
> + spin_lock(&vn->busy.lock);
> setup_vmalloc_vm_locked(vm, va, flags, caller);
> - spin_unlock(&vmap_area_lock);
> + spin_unlock(&vn->busy.lock);
> }
>
> static void clear_vm_uninitialized_flag(struct vm_struct *vm)
> @@ -3715,6 +3790,7 @@ static size_t vmap_ram_vread_iter(struct iov_iter *iter, const char *addr,
> */
> long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
> {
> + struct vmap_node *vn;
> struct vmap_area *va;
> struct vm_struct *vm;
> char *vaddr;
> @@ -3728,8 +3804,11 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)

Unrelated to your change but makes me feel a little unwell to see 'const
char *addr'! Can we change this at some point? Or maybe I can :)

>
> remains = count;
>
> - spin_lock(&vmap_area_lock);
> - va = find_vmap_area_exceed_addr((unsigned long)addr);
> + /* Hooked to node_0 so far. */
> + vn = addr_to_node(0);

Why can't we use addr for this call? We already enforce the node-0 only
thing by setting nr_vmap_nodes to 1 right? And won't this be potentially
subtly wrong when we later increase this?

> + spin_lock(&vn->busy.lock);
> +
> + va = find_vmap_area_exceed_addr((unsigned long)addr, &vn->busy.root);
> if (!va)
> goto finished_zero;
>
> @@ -3737,7 +3816,7 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
> if ((unsigned long)addr + remains <= va->va_start)
> goto finished_zero;
>
> - list_for_each_entry_from(va, &vmap_area_list, list) {
> + list_for_each_entry_from(va, &vn->busy.head, list) {
> size_t copied;
>
> if (remains == 0)
> @@ -3796,12 +3875,12 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
> }
>
> finished_zero:
> - spin_unlock(&vmap_area_lock);
> + spin_unlock(&vn->busy.lock);
> /* zero-fill memory holes */
> return count - remains + zero_iter(iter, remains);
> finished:
> /* Nothing remains, or We couldn't copy/zero everything. */
> - spin_unlock(&vmap_area_lock);
> + spin_unlock(&vn->busy.lock);
>
> return count - remains;
> }
> @@ -4135,14 +4214,15 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
> }
>
> /* insert all vm's */
> - spin_lock(&vmap_area_lock);
> for (area = 0; area < nr_vms; area++) {
> - insert_vmap_area(vas[area], &vmap_area_root, &vmap_area_list);
> + struct vmap_node *vn = addr_to_node(vas[area]->va_start);
>
> + spin_lock(&vn->busy.lock);
> + insert_vmap_area(vas[area], &vn->busy.root, &vn->busy.head);
> setup_vmalloc_vm_locked(vms[area], vas[area], VM_ALLOC,
> pcpu_get_vm_areas);
> + spin_unlock(&vn->busy.lock);

Hmm, before we were locking/unlocking once before the loop, now we're
locking on each iteration, this seems inefficient.

Seems like we need logic like:

/* ... something to check nr_vms > 0 ... */
struct vmap_node *last_node = NULL;

for (...) {
struct vmap_node *vnode = addr_to_node(vas[area]->va_start);

if (vnode != last_node) {
spin_unlock(last_node->busy.lock);
spin_lock(vnode->busy.lock);
last_node = vnode;
}

...
}

if (last_node)
spin_unlock(last_node->busy.lock);

To minimise the lock twiddling. What do you think?

> }
> - spin_unlock(&vmap_area_lock);
>
> /*
> * Mark allocated areas as accessible. Do it now as a best-effort
> @@ -4253,55 +4333,57 @@ bool vmalloc_dump_obj(void *object)
> {
> void *objp = (void *)PAGE_ALIGN((unsigned long)object);
> const void *caller;
> - struct vm_struct *vm;
> struct vmap_area *va;
> + struct vmap_node *vn;
> unsigned long addr;
> unsigned int nr_pages;
> + bool success = false;
>
> - if (!spin_trylock(&vmap_area_lock))
> - return false;

Nitpick on style for this, I really don't know why you are removing this
early exit? It's far neater to have a guard clause than to nest a whole
bunch of code below.

> - va = __find_vmap_area((unsigned long)objp, &vmap_area_root);
> - if (!va) {
> - spin_unlock(&vmap_area_lock);
> - return false;
> - }
> + vn = addr_to_node((unsigned long)objp);

Later in the patch where you fix build bot issues with the below
__find_vmap_area() invocation, you move from addr to (unsigned long)objp.

However since you're already referring to that here, why not change what
addr refers to and use that in both instances, e.g.

unsigned long addr = (unsigned long)objp;

Then update things that refer to the objp value as necessary.

>
> - vm = va->vm;
> - if (!vm) {
> - spin_unlock(&vmap_area_lock);
> - return false;
> + if (spin_trylock(&vn->busy.lock)) {
> + va = __find_vmap_area(addr, &vn->busy.root);
> +
> + if (va && va->vm) {
> + addr = (unsigned long)va->vm->addr;
> + caller = va->vm->caller;
> + nr_pages = va->vm->nr_pages;

Again it feels like you're undoing some good here, now you're referencing
va->vm over and over when you could simply assign vm = va->vm as the
original code did.

Also again it'd be nicer to use an early exit/guard clause approach.

> + success = true;
> + }
> +
> + spin_unlock(&vn->busy.lock);
> }
> - addr = (unsigned long)vm->addr;
> - caller = vm->caller;
> - nr_pages = vm->nr_pages;
> - spin_unlock(&vmap_area_lock);
> - pr_cont(" %u-page vmalloc region starting at %#lx allocated at %pS\n",
> - nr_pages, addr, caller);
> - return true;
> +
> + if (success)
> + pr_cont(" %u-page vmalloc region starting at %#lx allocated at %pS\n",
> + nr_pages, addr, caller);

With the redefinition of addr, you could then simply put va->vm->addr here.

> +
> + return success;
> }
> #endif

These are all essentially style nits (the actual bug was fixed by your
follow up patch for the build bots) :)

>
> #ifdef CONFIG_PROC_FS
> static void *s_start(struct seq_file *m, loff_t *pos)
> - __acquires(&vmap_purge_lock)
> - __acquires(&vmap_area_lock)

Do we want to replace these __acquires() directives? I suppose we simply
cannot now we need to look up the node.

> {
> + struct vmap_node *vn = addr_to_node(0);

Hm does the procfs operation span only one node? I guess we can start from
the initial node for an iteration, but I wonder if '&vmap_nodes[0]' here is
a more 'honest' thing to do rather than to assume that address 0 gets
translated to node zero here?

I think a comment like:

/* We start from node 0 */

Would be useful here at any rate.

> +
> mutex_lock(&vmap_purge_lock);
> - spin_lock(&vmap_area_lock);
> + spin_lock(&vn->busy.lock);
>
> - return seq_list_start(&vmap_area_list, *pos);
> + return seq_list_start(&vn->busy.head, *pos);
> }
>
> static void *s_next(struct seq_file *m, void *p, loff_t *pos)
> {
> - return seq_list_next(p, &vmap_area_list, pos);
> + struct vmap_node *vn = addr_to_node(0);

This one I'm a little more uncertain of, obviously comments above still
apply but actually shouldn't we add a check to see if we're at the end of
the list and should look at the next node?

Even if we only have one for now, I don't like the idea of leaving in
hardcoded things that might get missed when we move to nr_vmap_nodes > 1.

For instance right now if you increased this above 1 it'd break things
right?

I'd say better to write logic assuming nr_vmap_nodes _could_ be > 1 even
if, to start, it won't be.

> + return seq_list_next(p, &vn->busy.head, pos);
> }
>
> static void s_stop(struct seq_file *m, void *p)
> - __releases(&vmap_area_lock)
> - __releases(&vmap_purge_lock)
> {
> - spin_unlock(&vmap_area_lock);
> + struct vmap_node *vn = addr_to_node(0);

See comments above about use of addr_to_node(0).

> +
> + spin_unlock(&vn->busy.lock);
> mutex_unlock(&vmap_purge_lock);
> }
>
> @@ -4344,9 +4426,11 @@ static void show_purge_info(struct seq_file *m)
>
> static int s_show(struct seq_file *m, void *p)
> {
> + struct vmap_node *vn;
> struct vmap_area *va;
> struct vm_struct *v;
>
> + vn = addr_to_node(0);

This one is really quite icky, should we make it easy for a vmap_area to
know its vmap_node? How is this going to work once nr_vmap_nodes > 1?

> va = list_entry(p, struct vmap_area, list);
>
> if (!va->vm) {
> @@ -4397,7 +4481,7 @@ static int s_show(struct seq_file *m, void *p)
> * As a final step, dump "unpurged" areas.
> */
> final:
> - if (list_is_last(&va->list, &vmap_area_list))
> + if (list_is_last(&va->list, &vn->busy.head))
> show_purge_info(m);
>
> return 0;
> @@ -4428,7 +4512,8 @@ static void vmap_init_free_space(void)
> {
> unsigned long vmap_start = 1;
> const unsigned long vmap_end = ULONG_MAX;
> - struct vmap_area *busy, *free;
> + struct vmap_area *free;
> + struct vm_struct *busy;
>
> /*
> * B F B B B F
> @@ -4436,12 +4521,12 @@ static void vmap_init_free_space(void)
> * | The KVA space |
> * |<--------------------------------->|
> */
> - list_for_each_entry(busy, &vmap_area_list, list) {
> - if (busy->va_start - vmap_start > 0) {
> + for (busy = vmlist; busy; busy = busy->next) {
> + if ((unsigned long) busy->addr - vmap_start > 0) {
> free = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
> if (!WARN_ON_ONCE(!free)) {
> free->va_start = vmap_start;
> - free->va_end = busy->va_start;
> + free->va_end = (unsigned long) busy->addr;
>
> insert_vmap_area_augment(free, NULL,
> &free_vmap_area_root,
> @@ -4449,7 +4534,7 @@ static void vmap_init_free_space(void)
> }
> }
>
> - vmap_start = busy->va_end;
> + vmap_start = (unsigned long) busy->addr + busy->size;
> }
>
> if (vmap_end - vmap_start > 0) {
> @@ -4465,9 +4550,23 @@ static void vmap_init_free_space(void)
> }
> }
>
> +static void vmap_init_nodes(void)
> +{
> + struct vmap_node *vn;
> + int i;
> +
> + for (i = 0; i < nr_vmap_nodes; i++) {
> + vn = &vmap_nodes[i];
> + vn->busy.root = RB_ROOT;
> + INIT_LIST_HEAD(&vn->busy.head);
> + spin_lock_init(&vn->busy.lock);
> + }
> +}
> +
> void __init vmalloc_init(void)
> {
> struct vmap_area *va;
> + struct vmap_node *vn;
> struct vm_struct *tmp;
> int i;
>
> @@ -4489,6 +4588,11 @@ void __init vmalloc_init(void)
> xa_init(&vbq->vmap_blocks);
> }
>
> + /*
> + * Setup nodes before importing vmlist.
> + */
> + vmap_init_nodes();
> +
> /* Import existing vmlist entries. */
> for (tmp = vmlist; tmp; tmp = tmp->next) {
> va = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
> @@ -4498,7 +4602,9 @@ void __init vmalloc_init(void)
> va->va_start = (unsigned long)tmp->addr;
> va->va_end = va->va_start + tmp->size;
> va->vm = tmp;
> - insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
> +
> + vn = addr_to_node(va->va_start);
> + insert_vmap_area(va, &vn->busy.root, &vn->busy.head);
> }
>
> /*
> --
> 2.39.2
>

2024-01-16 23:39:06

by Lorenzo Stoakes

[permalink] [raw]
Subject: Re: [PATCH v3 05/11] mm/vmalloc: remove vmap_area_list

On Tue, Jan 02, 2024 at 07:46:27PM +0100, Uladzislau Rezki (Sony) wrote:
> From: Baoquan He <[email protected]>
>
> Earlier, vmap_area_list is exported to vmcoreinfo so that makedumpfile
> get the base address of vmalloc area. Now, vmap_area_list is empty, so
> export VMALLOC_START to vmcoreinfo instead, and remove vmap_area_list.
>
> Signed-off-by: Baoquan He <[email protected]>
> Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
> ---
> Documentation/admin-guide/kdump/vmcoreinfo.rst | 8 ++++----
> arch/arm64/kernel/crash_core.c | 1 -
> arch/riscv/kernel/crash_core.c | 1 -
> include/linux/vmalloc.h | 1 -
> kernel/crash_core.c | 4 +---
> kernel/kallsyms_selftest.c | 1 -
> mm/nommu.c | 2 --
> mm/vmalloc.c | 2 --
> 8 files changed, 5 insertions(+), 15 deletions(-)
>
> diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst b/Documentation/admin-guide/kdump/vmcoreinfo.rst
> index 78e4d2e7ba14..df54fbeaaa16 100644
> --- a/Documentation/admin-guide/kdump/vmcoreinfo.rst
> +++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst
> @@ -65,11 +65,11 @@ Defines the beginning of the text section. In general, _stext indicates
> the kernel start address. Used to convert a virtual address from the
> direct kernel map to a physical address.
>
> -vmap_area_list
> ---------------
> +VMALLOC_START
> +-------------
>
> -Stores the virtual area list. makedumpfile gets the vmalloc start value
> -from this variable and its value is necessary for vmalloc translation.
> +Stores the base address of vmalloc area. makedumpfile gets this value
> +since is necessary for vmalloc translation.
>
> mem_map
> -------
> diff --git a/arch/arm64/kernel/crash_core.c b/arch/arm64/kernel/crash_core.c
> index 66cde752cd74..2a24199a9b81 100644
> --- a/arch/arm64/kernel/crash_core.c
> +++ b/arch/arm64/kernel/crash_core.c
> @@ -23,7 +23,6 @@ void arch_crash_save_vmcoreinfo(void)
> /* Please note VMCOREINFO_NUMBER() uses "%d", not "%x" */
> vmcoreinfo_append_str("NUMBER(MODULES_VADDR)=0x%lx\n", MODULES_VADDR);
> vmcoreinfo_append_str("NUMBER(MODULES_END)=0x%lx\n", MODULES_END);
> - vmcoreinfo_append_str("NUMBER(VMALLOC_START)=0x%lx\n", VMALLOC_START);
> vmcoreinfo_append_str("NUMBER(VMALLOC_END)=0x%lx\n", VMALLOC_END);
> vmcoreinfo_append_str("NUMBER(VMEMMAP_START)=0x%lx\n", VMEMMAP_START);
> vmcoreinfo_append_str("NUMBER(VMEMMAP_END)=0x%lx\n", VMEMMAP_END);
> diff --git a/arch/riscv/kernel/crash_core.c b/arch/riscv/kernel/crash_core.c
> index 8706736fd4e2..d18d529fd9b9 100644
> --- a/arch/riscv/kernel/crash_core.c
> +++ b/arch/riscv/kernel/crash_core.c
> @@ -8,7 +8,6 @@ void arch_crash_save_vmcoreinfo(void)
> VMCOREINFO_NUMBER(phys_ram_base);
>
> vmcoreinfo_append_str("NUMBER(PAGE_OFFSET)=0x%lx\n", PAGE_OFFSET);
> - vmcoreinfo_append_str("NUMBER(VMALLOC_START)=0x%lx\n", VMALLOC_START);
> vmcoreinfo_append_str("NUMBER(VMALLOC_END)=0x%lx\n", VMALLOC_END);
> #ifdef CONFIG_MMU
> VMCOREINFO_NUMBER(VA_BITS);
> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> index c720be70c8dd..91810b4e9510 100644
> --- a/include/linux/vmalloc.h
> +++ b/include/linux/vmalloc.h
> @@ -253,7 +253,6 @@ extern long vread_iter(struct iov_iter *iter, const char *addr, size_t count);
> /*
> * Internals. Don't use..
> */
> -extern struct list_head vmap_area_list;
> extern __init void vm_area_add_early(struct vm_struct *vm);
> extern __init void vm_area_register_early(struct vm_struct *vm, size_t align);
>
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index d4313b53837e..b427f4a3b156 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -759,7 +759,7 @@ static int __init crash_save_vmcoreinfo_init(void)
> VMCOREINFO_SYMBOL_ARRAY(swapper_pg_dir);
> #endif
> VMCOREINFO_SYMBOL(_stext);
> - VMCOREINFO_SYMBOL(vmap_area_list);
> + vmcoreinfo_append_str("NUMBER(VMALLOC_START)=0x%lx\n", VMALLOC_START);
>
> #ifndef CONFIG_NUMA
> VMCOREINFO_SYMBOL(mem_map);
> @@ -800,8 +800,6 @@ static int __init crash_save_vmcoreinfo_init(void)
> VMCOREINFO_OFFSET(free_area, free_list);
> VMCOREINFO_OFFSET(list_head, next);
> VMCOREINFO_OFFSET(list_head, prev);
> - VMCOREINFO_OFFSET(vmap_area, va_start);
> - VMCOREINFO_OFFSET(vmap_area, list);
> VMCOREINFO_LENGTH(zone.free_area, MAX_ORDER + 1);
> log_buf_vmcoreinfo_setup();
> VMCOREINFO_LENGTH(free_area.free_list, MIGRATE_TYPES);
> diff --git a/kernel/kallsyms_selftest.c b/kernel/kallsyms_selftest.c
> index b4cac76ea5e9..8a689b4ff4f9 100644
> --- a/kernel/kallsyms_selftest.c
> +++ b/kernel/kallsyms_selftest.c
> @@ -89,7 +89,6 @@ static struct test_item test_items[] = {
> ITEM_DATA(kallsyms_test_var_data_static),
> ITEM_DATA(kallsyms_test_var_bss),
> ITEM_DATA(kallsyms_test_var_data),
> - ITEM_DATA(vmap_area_list),
> #endif
> };
>
> diff --git a/mm/nommu.c b/mm/nommu.c
> index b6dc558d3144..5ec8f44e7ce9 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -131,8 +131,6 @@ int follow_pfn(struct vm_area_struct *vma, unsigned long address,
> }
> EXPORT_SYMBOL(follow_pfn);
>
> -LIST_HEAD(vmap_area_list);
> -
> void vfree(const void *addr)
> {
> kfree(addr);
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 786ecb18ae22..8c01f2225ef7 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -729,8 +729,6 @@ EXPORT_SYMBOL(vmalloc_to_pfn);
>
>
> static DEFINE_SPINLOCK(free_vmap_area_lock);
> -/* Export for kexec only */
> -LIST_HEAD(vmap_area_list);
> static bool vmap_initialized __read_mostly;
>
> static struct rb_root purge_vmap_area_root = RB_ROOT;
> --
> 2.39.2
>

Looks good to me, I'm not _hugely_ familiar with this crash core stuff so:

Acked-by: Lorenzo Stoakes <[email protected]>

2024-01-18 13:17:49

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: Re: [PATCH v3 04/11] mm: vmalloc: Remove global vmap_area_root rb-tree

On Tue, Jan 16, 2024 at 11:25:43PM +0000, Lorenzo Stoakes wrote:
> On Tue, Jan 02, 2024 at 07:46:26PM +0100, Uladzislau Rezki (Sony) wrote:
> > Store allocated objects in a separate nodes. A va->va_start
> > address is converted into a correct node where it should
> > be placed and resided. An addr_to_node() function is used
> > to do a proper address conversion to determine a node that
> > contains a VA.
> >
> > Such approach balances VAs across nodes as a result an access
> > becomes scalable. Number of nodes in a system depends on number
> > of CPUs.
> >
> > Please note:
> >
> > 1. As of now allocated VAs are bound to a node-0. It means the
> > patch does not give any difference comparing with a current
> > behavior;
> >
> > 2. The global vmap_area_lock, vmap_area_root are removed as there
> > is no need in it anymore. The vmap_area_list is still kept and
> > is _empty_. It is exported for a kexec only;
> >
> > 3. The vmallocinfo and vread() have to be reworked to be able to
> > handle multiple nodes.
> >
> > Reviewed-by: Baoquan He <[email protected]>
> > Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
> > ---
> > mm/vmalloc.c | 240 +++++++++++++++++++++++++++++++++++++--------------
> > 1 file changed, 173 insertions(+), 67 deletions(-)
> >
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 06bd843d18ae..786ecb18ae22 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -728,11 +728,9 @@ EXPORT_SYMBOL(vmalloc_to_pfn);
> > #define DEBUG_AUGMENT_LOWEST_MATCH_CHECK 0
> >
> >
> > -static DEFINE_SPINLOCK(vmap_area_lock);
> > static DEFINE_SPINLOCK(free_vmap_area_lock);
> > /* Export for kexec only */
> > LIST_HEAD(vmap_area_list);
> > -static struct rb_root vmap_area_root = RB_ROOT;
> > static bool vmap_initialized __read_mostly;
> >
> > static struct rb_root purge_vmap_area_root = RB_ROOT;
> > @@ -772,6 +770,38 @@ static struct rb_root free_vmap_area_root = RB_ROOT;
> > */
> > static DEFINE_PER_CPU(struct vmap_area *, ne_fit_preload_node);
> >
> > +/*
> > + * An effective vmap-node logic. Users make use of nodes instead
> > + * of a global heap. It allows to balance an access and mitigate
> > + * contention.
> > + */
> > +struct rb_list {
>
> I'm not sure this name is very instructive - this contains a red/black tree
> root node, a list head and a lock, but the meaning of it is that it stores
> a red/black tree and a list of vmap_area objects and has a lock to protect
> access.
>
> A raw 'list_head' without a comment is always hard to parse, maybe some
> comments/embed in vmap_node?
>
> At the very least if you wanted to keep the name generic it should be
> something like ordered_rb_tree or something like this.
>
I can add a comment in the vmap_node. rb_list describes a single, solid
data structure where a list and rb-tree are part of one entity protected
by lock. Similar to the B+tree where you have leaf nodes linked between
each other in order to do a fast sequential traversal.

> > + struct rb_root root;
> > + struct list_head head;
> > + spinlock_t lock;
> > +};
> > +
> > +static struct vmap_node {
> > + /* Bookkeeping data of this node. */
> > + struct rb_list busy;
> > +} single;
>
> This may be a thing about encapsulation/naming or similar, but I'm a little
> confused as to why the rb_list type is maintained as a field rather than
> its fields embedded?
>
The "struct vmap_node" will be extended by the following patches in the
series.

> > +
> > +static struct vmap_node *vmap_nodes = &single;
> > +static __read_mostly unsigned int nr_vmap_nodes = 1;
> > +static __read_mostly unsigned int vmap_zone_size = 1;
>
> It might be worth adding a comment here explaining that we're binding to a
> single node for now to maintain existing behaviour (and a brief description
> of what these values mean - for instance what unit vmap_zone_size is
> expressed in?)
>
Right. Agree on it :)

> > +
> > +static inline unsigned int
> > +addr_to_node_id(unsigned long addr)
> > +{
> > + return (addr / vmap_zone_size) % nr_vmap_nodes;
> > +}
> > +
> > +static inline struct vmap_node *
> > +addr_to_node(unsigned long addr)
> > +{
> > + return &vmap_nodes[addr_to_node_id(addr)];
> > +}
> > +
> > static __always_inline unsigned long
> > va_size(struct vmap_area *va)
> > {
> > @@ -803,10 +833,11 @@ unsigned long vmalloc_nr_pages(void)
> > }
> >
> > /* Look up the first VA which satisfies addr < va_end, NULL if none. */
> > -static struct vmap_area *find_vmap_area_exceed_addr(unsigned long addr)
> > +static struct vmap_area *
> > +find_vmap_area_exceed_addr(unsigned long addr, struct rb_root *root)
> > {
> > struct vmap_area *va = NULL;
> > - struct rb_node *n = vmap_area_root.rb_node;
> > + struct rb_node *n = root->rb_node;
> >
> > addr = (unsigned long)kasan_reset_tag((void *)addr);
> >
> > @@ -1552,12 +1583,14 @@ __alloc_vmap_area(struct rb_root *root, struct list_head *head,
> > */
> > static void free_vmap_area(struct vmap_area *va)
> > {
> > + struct vmap_node *vn = addr_to_node(va->va_start);
> > +
>
> I'm being nitty here, and while I know it's a vmalloc convention to use
> 'va' and 'vm', perhaps we can break away from the super short variable name
> convention and use 'vnode' or something for these values?
>
> I feel people might get confused between 'vm' and 'vn' for instance.
>
vnode, varea?

> > /*
> > * Remove from the busy tree/list.
> > */
> > - spin_lock(&vmap_area_lock);
> > - unlink_va(va, &vmap_area_root);
> > - spin_unlock(&vmap_area_lock);
> > + spin_lock(&vn->busy.lock);
> > + unlink_va(va, &vn->busy.root);
> > + spin_unlock(&vn->busy.lock);
> >
> > /*
> > * Insert/Merge it back to the free tree/list.
> > @@ -1600,6 +1633,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
> > int node, gfp_t gfp_mask,
> > unsigned long va_flags)
> > {
> > + struct vmap_node *vn;
> > struct vmap_area *va;
> > unsigned long freed;
> > unsigned long addr;
> > @@ -1645,9 +1679,11 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
> > va->vm = NULL;
> > va->flags = va_flags;
> >
> > - spin_lock(&vmap_area_lock);
> > - insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
> > - spin_unlock(&vmap_area_lock);
> > + vn = addr_to_node(va->va_start);
> > +
> > + spin_lock(&vn->busy.lock);
> > + insert_vmap_area(va, &vn->busy.root, &vn->busy.head);
> > + spin_unlock(&vn->busy.lock);
> >
> > BUG_ON(!IS_ALIGNED(va->va_start, align));
> > BUG_ON(va->va_start < vstart);
> > @@ -1871,26 +1907,61 @@ static void free_unmap_vmap_area(struct vmap_area *va)
> >
> > struct vmap_area *find_vmap_area(unsigned long addr)
> > {
> > + struct vmap_node *vn;
> > struct vmap_area *va;
> > + int i, j;
> >
> > - spin_lock(&vmap_area_lock);
> > - va = __find_vmap_area(addr, &vmap_area_root);
> > - spin_unlock(&vmap_area_lock);
> > + /*
> > + * An addr_to_node_id(addr) converts an address to a node index
> > + * where a VA is located. If VA spans several zones and passed
> > + * addr is not the same as va->va_start, what is not common, we
> > + * may need to scan an extra nodes. See an example:
>
> For my understading when you say 'scan an extra nodes' do you mean scan
> just 1 extra node, or multiple? If the former I'd replace this with 'may
> need to scan an extra node' if the latter then 'may ened to scan extra
> nodes'.
>
> It's a nitty language thing, but also potentially changes the meaning of
> this!
>
Typo, i should replace it to: scan extra nodes.

> > + *
> > + * <--va-->
> > + * -|-----|-----|-----|-----|-
> > + * 1 2 0 1
> > + *
> > + * VA resides in node 1 whereas it spans 1 and 2. If passed
> > + * addr is within a second node we should do extra work. We
> > + * should mention that it is rare and is a corner case from
> > + * the other hand it has to be covered.
>
> A very minor language style nit, but you've already said this is not
> common, I don't think you need this 'We should mention...' bit. It's not a
> big deal however!
>
No problem. We can remove it!

> > + */
> > + i = j = addr_to_node_id(addr);
> > + do {
> > + vn = &vmap_nodes[i];
> >
> > - return va;
> > + spin_lock(&vn->busy.lock);
> > + va = __find_vmap_area(addr, &vn->busy.root);
> > + spin_unlock(&vn->busy.lock);
> > +
> > + if (va)
> > + return va;
> > + } while ((i = (i + 1) % nr_vmap_nodes) != j);
>
> If you comment above suggests that only 1 extra node might need to be
> scanned, should we stop after one iteration?
>
Not really. Though we can improve it further to scan backward.

> > +
> > + return NULL;
> > }
> >
> > static struct vmap_area *find_unlink_vmap_area(unsigned long addr)
> > {
> > + struct vmap_node *vn;
> > struct vmap_area *va;
> > + int i, j;
> >
> > - spin_lock(&vmap_area_lock);
> > - va = __find_vmap_area(addr, &vmap_area_root);
> > - if (va)
> > - unlink_va(va, &vmap_area_root);
> > - spin_unlock(&vmap_area_lock);
> > + i = j = addr_to_node_id(addr);
> > + do {
> > + vn = &vmap_nodes[i];
> >
> > - return va;
> > + spin_lock(&vn->busy.lock);
> > + va = __find_vmap_area(addr, &vn->busy.root);
> > + if (va)
> > + unlink_va(va, &vn->busy.root);
> > + spin_unlock(&vn->busy.lock);
> > +
> > + if (va)
> > + return va;
> > + } while ((i = (i + 1) % nr_vmap_nodes) != j);
>
> Maybe worth adding a comment saying to refer to the comment in
> find_vmap_area() to see why this loop is necessary.
>
OK. We can do it to make it better for reading.

> > +
> > + return NULL;
> > }
> >
> > /*** Per cpu kva allocator ***/
> > @@ -2092,6 +2163,7 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
> >
> > static void free_vmap_block(struct vmap_block *vb)
> > {
> > + struct vmap_node *vn;
> > struct vmap_block *tmp;
> > struct xarray *xa;
> >
> > @@ -2099,9 +2171,10 @@ static void free_vmap_block(struct vmap_block *vb)
> > tmp = xa_erase(xa, addr_to_vb_idx(vb->va->va_start));
> > BUG_ON(tmp != vb);
> >
> > - spin_lock(&vmap_area_lock);
> > - unlink_va(vb->va, &vmap_area_root);
> > - spin_unlock(&vmap_area_lock);
> > + vn = addr_to_node(vb->va->va_start);
> > + spin_lock(&vn->busy.lock);
> > + unlink_va(vb->va, &vn->busy.root);
> > + spin_unlock(&vn->busy.lock);
> >
> > free_vmap_area_noflush(vb->va);
> > kfree_rcu(vb, rcu_head);
> > @@ -2525,9 +2598,11 @@ static inline void setup_vmalloc_vm_locked(struct vm_struct *vm,
> > static void setup_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va,
> > unsigned long flags, const void *caller)
> > {
> > - spin_lock(&vmap_area_lock);
> > + struct vmap_node *vn = addr_to_node(va->va_start);
> > +
> > + spin_lock(&vn->busy.lock);
> > setup_vmalloc_vm_locked(vm, va, flags, caller);
> > - spin_unlock(&vmap_area_lock);
> > + spin_unlock(&vn->busy.lock);
> > }
> >
> > static void clear_vm_uninitialized_flag(struct vm_struct *vm)
> > @@ -3715,6 +3790,7 @@ static size_t vmap_ram_vread_iter(struct iov_iter *iter, const char *addr,
> > */
> > long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
> > {
> > + struct vmap_node *vn;
> > struct vmap_area *va;
> > struct vm_struct *vm;
> > char *vaddr;
> > @@ -3728,8 +3804,11 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
>
> Unrelated to your change but makes me feel a little unwell to see 'const
> char *addr'! Can we change this at some point? Or maybe I can :)
>
You are welcome :)

> >
> > remains = count;
> >
> > - spin_lock(&vmap_area_lock);
> > - va = find_vmap_area_exceed_addr((unsigned long)addr);
> > + /* Hooked to node_0 so far. */
> > + vn = addr_to_node(0);
>
> Why can't we use addr for this call? We already enforce the node-0 only
> thing by setting nr_vmap_nodes to 1 right? And won't this be potentially
> subtly wrong when we later increase this?
>
I used to have 0 here. But please note, it is changed by the next patch in
this series.

> > + spin_lock(&vn->busy.lock);
> > +
> > + va = find_vmap_area_exceed_addr((unsigned long)addr, &vn->busy.root);
> > if (!va)
> > goto finished_zero;
> >
> > @@ -3737,7 +3816,7 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
> > if ((unsigned long)addr + remains <= va->va_start)
> > goto finished_zero;
> >
> > - list_for_each_entry_from(va, &vmap_area_list, list) {
> > + list_for_each_entry_from(va, &vn->busy.head, list) {
> > size_t copied;
> >
> > if (remains == 0)
> > @@ -3796,12 +3875,12 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
> > }
> >
> > finished_zero:
> > - spin_unlock(&vmap_area_lock);
> > + spin_unlock(&vn->busy.lock);
> > /* zero-fill memory holes */
> > return count - remains + zero_iter(iter, remains);
> > finished:
> > /* Nothing remains, or We couldn't copy/zero everything. */
> > - spin_unlock(&vmap_area_lock);
> > + spin_unlock(&vn->busy.lock);
> >
> > return count - remains;
> > }
> > @@ -4135,14 +4214,15 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
> > }
> >
> > /* insert all vm's */
> > - spin_lock(&vmap_area_lock);
> > for (area = 0; area < nr_vms; area++) {
> > - insert_vmap_area(vas[area], &vmap_area_root, &vmap_area_list);
> > + struct vmap_node *vn = addr_to_node(vas[area]->va_start);
> >
> > + spin_lock(&vn->busy.lock);
> > + insert_vmap_area(vas[area], &vn->busy.root, &vn->busy.head);
> > setup_vmalloc_vm_locked(vms[area], vas[area], VM_ALLOC,
> > pcpu_get_vm_areas);
> > + spin_unlock(&vn->busy.lock);
>
> Hmm, before we were locking/unlocking once before the loop, now we're
> locking on each iteration, this seems inefficient.
>
> Seems like we need logic like:
>
> /* ... something to check nr_vms > 0 ... */
> struct vmap_node *last_node = NULL;
>
> for (...) {
> struct vmap_node *vnode = addr_to_node(vas[area]->va_start);
>
> if (vnode != last_node) {
> spin_unlock(last_node->busy.lock);
> spin_lock(vnode->busy.lock);
> last_node = vnode;
> }
>
> ...
> }
>
> if (last_node)
> spin_unlock(last_node->busy.lock);
>
> To minimise the lock twiddling. What do you think?
>
This per-cpu-allocator prefetches several VA units per-cpu. I do not
find it as critical because it is not a hot path for the per-cpu allocator.
When its buffers are exhausted it does an extra prefetch. So it is not
frequent.

>
> > }
> > - spin_unlock(&vmap_area_lock);
> >
> > /*
> > * Mark allocated areas as accessible. Do it now as a best-effort
> > @@ -4253,55 +4333,57 @@ bool vmalloc_dump_obj(void *object)
> > {
> > void *objp = (void *)PAGE_ALIGN((unsigned long)object);
> > const void *caller;
> > - struct vm_struct *vm;
> > struct vmap_area *va;
> > + struct vmap_node *vn;
> > unsigned long addr;
> > unsigned int nr_pages;
> > + bool success = false;
> >
> > - if (!spin_trylock(&vmap_area_lock))
> > - return false;
>
> Nitpick on style for this, I really don't know why you are removing this
> early exit? It's far neater to have a guard clause than to nest a whole
> bunch of code below.
>
Hm... I can return back as it used to be. I do not have a strong opinion here.

> > - va = __find_vmap_area((unsigned long)objp, &vmap_area_root);
> > - if (!va) {
> > - spin_unlock(&vmap_area_lock);
> > - return false;
> > - }
> > + vn = addr_to_node((unsigned long)objp);
>
> Later in the patch where you fix build bot issues with the below
> __find_vmap_area() invocation, you move from addr to (unsigned long)objp.
>
> However since you're already referring to that here, why not change what
> addr refers to and use that in both instances, e.g.
>
> unsigned long addr = (unsigned long)objp;
>
> Then update things that refer to the objp value as necessary.
>
That is what i was thinking of. We can send a separate patch for this.

> >
> > - vm = va->vm;
> > - if (!vm) {
> > - spin_unlock(&vmap_area_lock);
> > - return false;
> > + if (spin_trylock(&vn->busy.lock)) {
> > + va = __find_vmap_area(addr, &vn->busy.root);
> > +
> > + if (va && va->vm) {
> > + addr = (unsigned long)va->vm->addr;
> > + caller = va->vm->caller;
> > + nr_pages = va->vm->nr_pages;
>
> Again it feels like you're undoing some good here, now you're referencing
> va->vm over and over when you could simply assign vm = va->vm as the
> original code did.
>
> Also again it'd be nicer to use an early exit/guard clause approach.
>
We can change it in separate patch.

> > + success = true;
> > + }
> > +
> > + spin_unlock(&vn->busy.lock);
> > }
> > - addr = (unsigned long)vm->addr;
> > - caller = vm->caller;
> > - nr_pages = vm->nr_pages;
> > - spin_unlock(&vmap_area_lock);
> > - pr_cont(" %u-page vmalloc region starting at %#lx allocated at %pS\n",
> > - nr_pages, addr, caller);
> > - return true;
> > +
> > + if (success)
> > + pr_cont(" %u-page vmalloc region starting at %#lx allocated at %pS\n",
> > + nr_pages, addr, caller);
>
> With the redefinition of addr, you could then simply put va->vm->addr here.
>
> > +
> > + return success;
> > }
> > #endif
>
> These are all essentially style nits (the actual bug was fixed by your
> follow up patch for the build bots) :)
>
Right :)

> >
> > #ifdef CONFIG_PROC_FS
> > static void *s_start(struct seq_file *m, loff_t *pos)
> > - __acquires(&vmap_purge_lock)
> > - __acquires(&vmap_area_lock)
>
> Do we want to replace these __acquires() directives? I suppose we simply
> cannot now we need to look up the node.
>
Yep. It is reworked anyway in another patch.

> > {
> > + struct vmap_node *vn = addr_to_node(0);
>
> Hm does the procfs operation span only one node? I guess we can start from
> the initial node for an iteration, but I wonder if '&vmap_nodes[0]' here is
> a more 'honest' thing to do rather than to assume that address 0 gets
> translated to node zero here?
>
> I think a comment like:
>
> /* We start from node 0 */
>
> Would be useful here at any rate.
>
It works since nr_nodes is set to 1. By later patches in this series
it is fulfilled and completed.

> > +
> > mutex_lock(&vmap_purge_lock);
> > - spin_lock(&vmap_area_lock);
> > + spin_lock(&vn->busy.lock);
> >
> > - return seq_list_start(&vmap_area_list, *pos);
> > + return seq_list_start(&vn->busy.head, *pos);
> > }
> >
> > static void *s_next(struct seq_file *m, void *p, loff_t *pos)
> > {
> > - return seq_list_next(p, &vmap_area_list, pos);
> > + struct vmap_node *vn = addr_to_node(0);
>
> This one I'm a little more uncertain of, obviously comments above still
> apply but actually shouldn't we add a check to see if we're at the end of
> the list and should look at the next node?
>
> Even if we only have one for now, I don't like the idea of leaving in
> hardcoded things that might get missed when we move to nr_vmap_nodes > 1.
>
> For instance right now if you increased this above 1 it'd break things
> right?
>
> I'd say better to write logic assuming nr_vmap_nodes _could_ be > 1 even
> if, to start, it won't be.
>
Same as above. It is incomplete and stick to a single node. Further
patches solve this fully.

> > + return seq_list_next(p, &vn->busy.head, pos);
> > }
> >
> > static void s_stop(struct seq_file *m, void *p)
> > - __releases(&vmap_area_lock)
> > - __releases(&vmap_purge_lock)
> > {
> > - spin_unlock(&vmap_area_lock);
> > + struct vmap_node *vn = addr_to_node(0);
>
> See comments above about use of addr_to_node(0).
>
All this s_start/s_next/etc are removed and reworked by following
patches.

> > +
> > + spin_unlock(&vn->busy.lock);
> > mutex_unlock(&vmap_purge_lock);
> > }
> >
> > @@ -4344,9 +4426,11 @@ static void show_purge_info(struct seq_file *m)
> >
> > static int s_show(struct seq_file *m, void *p)
> > {
> > + struct vmap_node *vn;
> > struct vmap_area *va;
> > struct vm_struct *v;
> >
> > + vn = addr_to_node(0);
>
> This one is really quite icky, should we make it easy for a vmap_area to
> know its vmap_node? How is this going to work once nr_vmap_nodes > 1?
>
Same as above.

Thank you for the review! I can fix the comments as separate patches if
no objections.

--
Uladzislau Rezki

2024-01-18 18:15:42

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: Re: [PATCH v3 07/11] mm: vmalloc: Offload free_vmap_area_lock lock

On Wed, Jan 17, 2024 at 09:12:26AM +1100, Dave Chinner wrote:
> On Fri, Jan 12, 2024 at 01:18:27PM +0100, Uladzislau Rezki wrote:
> > On Fri, Jan 12, 2024 at 07:37:36AM +1100, Dave Chinner wrote:
> > > On Thu, Jan 11, 2024 at 04:54:48PM +0100, Uladzislau Rezki wrote:
> > > > On Thu, Jan 11, 2024 at 08:02:16PM +1100, Dave Chinner wrote:
> > > > > On Tue, Jan 02, 2024 at 07:46:29PM +0100, Uladzislau Rezki (Sony) wrote:
> > > > > > Concurrent access to a global vmap space is a bottle-neck.
> > > > > > We can simulate a high contention by running a vmalloc test
> > > > > > suite.
> > > > > >
> > > > > > To address it, introduce an effective vmap node logic. Each
> > > > > > node behaves as independent entity. When a node is accessed
> > > > > > it serves a request directly(if possible) from its pool.
> > > > > >
> > > > > > This model has a size based pool for requests, i.e. pools are
> > > > > > serialized and populated based on object size and real demand.
> > > > > > A maximum object size that pool can handle is set to 256 pages.
> > > > > >
> > > > > > This technique reduces a pressure on the global vmap lock.
> > > > > >
> > > > > > Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
> > > > >
> > > > > Why not use a llist for this? That gets rid of the need for a
> > > > > new pool_lock altogether...
> > > > >
> > > > Initially i used the llist. I have changed it because i keep track
> > > > of objects per a pool to decay it later. I do not find these locks
> > > > as contented one therefore i did not think much.
> > >
> > > Ok. I've used llist and an atomic counter to track the list length
> > > in the past.
> > >
> > > But is the list length even necessary? It seems to me that it is
> > > only used by the shrinker to determine how many objects are on the
> > > lists for scanning, and I'm not sure that's entirely necessary given
> > > the way the current global shrinker works (i.e. completely unfair to
> > > low numbered nodes due to scan loop start bias).
> > >
> > I use the length to decay pools by certain percentage, currently it is
> > 25%, so i need to know number of objects. It is done in the purge path.
> > As for shrinker, once it hits us we drain pools entirely.
>
> Why does purge need to be different to shrinking?
>
> But, regardless, you can still use llist with an atomic counter to
> do this - there is no need for a spin lock at all.
>
As i pointed earlier, i will have a look at it.

> > > > Anyway, i will have a look at this to see if llist is easy to go with
> > > > or not. If so i will send out a separate patch.
> > >
> > > Sounds good, it was just something that crossed my mind given the
> > > pattern of "producer adds single items, consumer detaches entire
> > > list, processes it and reattaches remainder" is a perfect match for
> > > the llist structure.
> > >
> > The llist_del_first() has to be serialized. For this purpose a per-cpu
> > pool would work or kind of "in_use" atomic that protects concurrent
> > removing.
>
> So don't use llist_del_first().
>
> > If we detach entire llist, then we need to keep track of last node
> > to add it later as a "batch" to already existing/populated list.
>
> Why? I haven't see any need for ordering these lists which would
> requiring strict tail-add ordered semantics.
>
I mean the following:

1. first = llist_del_all(&example);
2. last = llist_reverse_order(first);

4. va = __llist_del_first(first);

/*
* "example" might not be empty, use the batch. Otherwise
* we loose the entries "example" pointed to.
*/
3. llist_add_batch(first, last, &example);

--
Uladzislau Rezki

2024-01-18 18:24:05

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: Re: [PATCH v3 10/11] mm: vmalloc: Set nr_nodes based on CPUs in a system

On Wed, Jan 17, 2024 at 09:06:02AM +1100, Dave Chinner wrote:
> On Mon, Jan 15, 2024 at 08:09:29PM +0100, Uladzislau Rezki wrote:
> > > On Tue, Jan 02, 2024 at 07:46:32PM +0100, Uladzislau Rezki (Sony) wrote:
> > > > A number of nodes which are used in the alloc/free paths is
> > > > set based on num_possible_cpus() in a system. Please note a
> > > > high limit threshold though is fixed and corresponds to 128
> > > > nodes.
> > >
> > > Large CPU count machines are NUMA machines. ALl of the allocation
> > > and reclaim is NUMA node based i.e. a pgdat per NUMA node.
> > >
> > > Shrinkers are also able to be run in a NUMA aware mode so that
> > > per-node structures can be reclaimed similar to how per-node LRU
> > > lists are scanned for reclaim.
> > >
> > > Hence I'm left to wonder if it would be better to have a vmalloc
> > > area per pgdat (or sub-node cluster) rather than just base the
> > > number on CPU count and then have an arbitrary maximum number when
> > > we get to 128 CPU cores. We can have 128 CPU cores in a
> > > single socket these days, so not being able to scale the vmalloc
> > > areas beyond a single socket seems like a bit of a limitation.
> > >
> > >
> > > Hence I'm left to wonder if it would be better to have a vmalloc
> > > area per pgdat (or sub-node cluster) rather than just base the
> > >
> > > Scaling out the vmalloc areas in a NUMA aware fashion allows the
> > > shrinker to be run in numa aware mode, which gets rid of the need
> > > for the global shrinker to loop over every single vmap area in every
> > > shrinker invocation. Only the vm areas on the node that has a memory
> > > shortage need to be scanned and reclaimed, it doesn't need reclaim
> > > everything globally when a single node runs out of memory.
> > >
> > > Yes, this may not give quite as good microbenchmark scalability
> > > results, but being able to locate each vm area in node local memory
> > > and have operation on them largely isolated to node-local tasks and
> > > vmalloc area reclaim will work much better on large multi-socket
> > > NUMA machines.
> > >
> > Currently i fix the max nodes number to 128. This is because i do not
> > have an access to such big NUMA systems whereas i do have an access to
> > around ~128 ones. That is why i have decided to stop on that number as
> > of now.
>
> I suspect you are confusing number of CPUs with number of NUMA nodes.
>
I do not think so :)

>
> A NUMA system with 128 nodes is a large NUMA system that will have
> thousands of CPU cores, whilst above you talk about basing the
> count on CPU cores and that a single socket can have 128 cores?
>
> > We can easily set nr_nodes to num_possible_cpus() and let it scale for
> > anyone. But before doing this, i would like to give it a try as a first
> > step because i have not tested it well on really big NUMA systems.
>
> I don't think you need to have large NUMA systems to test it. We
> have the "fakenuma" feature for a reason. Essentially, once you
> have enough CPU cores that catastrophic lock contention can be
> generated in a fast path (can take as few as 4-5 CPU cores), then
> you can effectively test NUMA scalability with fakenuma by creating
> nodes with >=8 CPUs each.
>
> This is how I've done testing of numa aware algorithms (like
> shrinkers!) for the past decade - I haven't had direct access to a
> big NUMA machine since 2008, yet it's relatively trivial to test
> NUMA based scalability algorithms without them these days.
>
I see your point. NUMA-aware scalability require reworking adding extra
layer that allows such scaling.

If the socket has 256 CPUs, how do scale VAs inside that node among
those CPUs?

--
Uladzislau Rezki

2024-01-19 10:32:15

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: Re: [PATCH v3 10/11] mm: vmalloc: Set nr_nodes based on CPUs in a system

On Fri, Jan 19, 2024 at 08:28:05AM +1100, Dave Chinner wrote:
> On Thu, Jan 18, 2024 at 07:23:47PM +0100, Uladzislau Rezki wrote:
> > On Wed, Jan 17, 2024 at 09:06:02AM +1100, Dave Chinner wrote:
> > > On Mon, Jan 15, 2024 at 08:09:29PM +0100, Uladzislau Rezki wrote:
> > > > We can easily set nr_nodes to num_possible_cpus() and let it scale for
> > > > anyone. But before doing this, i would like to give it a try as a first
> > > > step because i have not tested it well on really big NUMA systems.
> > >
> > > I don't think you need to have large NUMA systems to test it. We
> > > have the "fakenuma" feature for a reason. Essentially, once you
> > > have enough CPU cores that catastrophic lock contention can be
> > > generated in a fast path (can take as few as 4-5 CPU cores), then
> > > you can effectively test NUMA scalability with fakenuma by creating
> > > nodes with >=8 CPUs each.
> > >
> > > This is how I've done testing of numa aware algorithms (like
> > > shrinkers!) for the past decade - I haven't had direct access to a
> > > big NUMA machine since 2008, yet it's relatively trivial to test
> > > NUMA based scalability algorithms without them these days.
> > >
> > I see your point. NUMA-aware scalability require reworking adding extra
> > layer that allows such scaling.
> >
> > If the socket has 256 CPUs, how do scale VAs inside that node among
> > those CPUs?
>
> It's called "sub-numa clustering" and is a bios option that presents
> large core count CPU packages as multiple NUMA nodes. See:
>
> https://www.intel.com/content/www/us/en/developer/articles/technical/fourth-generation-xeon-scalable-family-overview.html
>
> Essentially, large core count CPUs are a cluster of smaller core
> groups with their own resources and memory controllers. This is how
> they are laid out either on a single die (intel) or as a collection
> of smaller dies (AMD compute complexes) that are tied together by
> the interconnect between the LLCs and memory controllers. They only
> appear as a "unified" CPU because they are configured that way by
> the bios, but can also be configured to actually expose their inner
> non-uniform memory access topology for operating systems and
> application stacks that are NUMA aware (like Linux).
>
> This means a "256 core" CPU would probably present as 16 smaller 16
> core CPUs each with their own L1/2/3 caches and memory controllers.
> IOWs, a single socket appears to the kernel as a 16 node NUMA system
> with 16 cores per node. Most NUMA aware scalability algorithms will
> work just fine with this sort setup - it's just another set of
> numbers in the NUMA distance table...
>
Thank you for your input. I will go through it to see what we can
do in terms of NUMA-aware with thousands of CPUs in total.

Thanks!

--
Uladzislau Rezki


2024-01-20 12:57:37

by Lorenzo Stoakes

[permalink] [raw]
Subject: Re: [PATCH v3 04/11] mm: vmalloc: Remove global vmap_area_root rb-tree

On Thu, Jan 18, 2024 at 02:15:31PM +0100, Uladzislau Rezki wrote:

[snip]

>
> > > + struct rb_root root;
> > > + struct list_head head;
> > > + spinlock_t lock;
> > > +};
> > > +
> > > +static struct vmap_node {
> > > + /* Bookkeeping data of this node. */
> > > + struct rb_list busy;
> > > +} single;
> >
> > This may be a thing about encapsulation/naming or similar, but I'm a little
> > confused as to why the rb_list type is maintained as a field rather than
> > its fields embedded?
> >
> The "struct vmap_node" will be extended by the following patches in the
> series.
>

Yeah sorry I missed this, only realising after I sent...!

> > > +
> > > +static struct vmap_node *vmap_nodes = &single;
> > > +static __read_mostly unsigned int nr_vmap_nodes = 1;
> > > +static __read_mostly unsigned int vmap_zone_size = 1;
> >
> > It might be worth adding a comment here explaining that we're binding to a
> > single node for now to maintain existing behaviour (and a brief description
> > of what these values mean - for instance what unit vmap_zone_size is
> > expressed in?)
> >
> Right. Agree on it :)
>

Indeed :)

[snip]

> > > /* Look up the first VA which satisfies addr < va_end, NULL if none. */
> > > -static struct vmap_area *find_vmap_area_exceed_addr(unsigned long addr)
> > > +static struct vmap_area *
> > > +find_vmap_area_exceed_addr(unsigned long addr, struct rb_root *root)
> > > {
> > > struct vmap_area *va = NULL;
> > > - struct rb_node *n = vmap_area_root.rb_node;
> > > + struct rb_node *n = root->rb_node;
> > >
> > > addr = (unsigned long)kasan_reset_tag((void *)addr);
> > >
> > > @@ -1552,12 +1583,14 @@ __alloc_vmap_area(struct rb_root *root, struct list_head *head,
> > > */
> > > static void free_vmap_area(struct vmap_area *va)
> > > {
> > > + struct vmap_node *vn = addr_to_node(va->va_start);
> > > +
> >
> > I'm being nitty here, and while I know it's a vmalloc convention to use
> > 'va' and 'vm', perhaps we can break away from the super short variable name
> > convention and use 'vnode' or something for these values?
> >
> > I feel people might get confused between 'vm' and 'vn' for instance.
> >
> vnode, varea?

I think 'vm' and 'va' are fine, just scanning through easy to mistake 'vn'
and 'vm'. Obviously a litle nitpicky! You could replace all but a bit
churny, so I think vn -> vnode works best imo.

[snip]

> > > struct vmap_area *find_vmap_area(unsigned long addr)
> > > {
> > > + struct vmap_node *vn;
> > > struct vmap_area *va;
> > > + int i, j;
> > >
> > > - spin_lock(&vmap_area_lock);
> > > - va = __find_vmap_area(addr, &vmap_area_root);
> > > - spin_unlock(&vmap_area_lock);
> > > + /*
> > > + * An addr_to_node_id(addr) converts an address to a node index
> > > + * where a VA is located. If VA spans several zones and passed
> > > + * addr is not the same as va->va_start, what is not common, we
> > > + * may need to scan an extra nodes. See an example:
> >
> > For my understading when you say 'scan an extra nodes' do you mean scan
> > just 1 extra node, or multiple? If the former I'd replace this with 'may
> > need to scan an extra node' if the latter then 'may ened to scan extra
> > nodes'.
> >
> > It's a nitty language thing, but also potentially changes the meaning of
> > this!
> >
> Typo, i should replace it to: scan extra nodes.

Thanks.

>
> > > + *
> > > + * <--va-->
> > > + * -|-----|-----|-----|-----|-
> > > + * 1 2 0 1
> > > + *
> > > + * VA resides in node 1 whereas it spans 1 and 2. If passed
> > > + * addr is within a second node we should do extra work. We
> > > + * should mention that it is rare and is a corner case from
> > > + * the other hand it has to be covered.
> >
> > A very minor language style nit, but you've already said this is not
> > common, I don't think you need this 'We should mention...' bit. It's not a
> > big deal however!
> >
> No problem. We can remove it!

Thanks.

>
> > > + */
> > > + i = j = addr_to_node_id(addr);
> > > + do {
> > > + vn = &vmap_nodes[i];
> > >
> > > - return va;
> > > + spin_lock(&vn->busy.lock);
> > > + va = __find_vmap_area(addr, &vn->busy.root);
> > > + spin_unlock(&vn->busy.lock);
> > > +
> > > + if (va)
> > > + return va;
> > > + } while ((i = (i + 1) % nr_vmap_nodes) != j);
> >
> > If you comment above suggests that only 1 extra node might need to be
> > scanned, should we stop after one iteration?
> >
> Not really. Though we can improve it further to scan backward.

I think it'd be good to clarify in the comment above that the VA could span
more than 1 node then, as the diagram seems to imply only 1 (I think just
simply because of the example you were showing).

[snip]

> > > static struct vmap_area *find_unlink_vmap_area(unsigned long addr)
> > > {
> > > + struct vmap_node *vn;
> > > struct vmap_area *va;
> > > + int i, j;
> > >
> > > - spin_lock(&vmap_area_lock);
> > > - va = __find_vmap_area(addr, &vmap_area_root);
> > > - if (va)
> > > - unlink_va(va, &vmap_area_root);
> > > - spin_unlock(&vmap_area_lock);
> > > + i = j = addr_to_node_id(addr);
> > > + do {
> > > + vn = &vmap_nodes[i];
> > >
> > > - return va;
> > > + spin_lock(&vn->busy.lock);
> > > + va = __find_vmap_area(addr, &vn->busy.root);
> > > + if (va)
> > > + unlink_va(va, &vn->busy.root);
> > > + spin_unlock(&vn->busy.lock);
> > > +
> > > + if (va)
> > > + return va;
> > > + } while ((i = (i + 1) % nr_vmap_nodes) != j);
> >
> > Maybe worth adding a comment saying to refer to the comment in
> > find_vmap_area() to see why this loop is necessary.
> >
> OK. We can do it to make it better for reading.

Thanks!

[snip]

> > > @@ -3728,8 +3804,11 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
> >
> > Unrelated to your change but makes me feel a little unwell to see 'const
> > char *addr'! Can we change this at some point? Or maybe I can :)
> >
> You are welcome :)

Haha ;) yes I think I might tbh, I have noted it down.

>
> > >
> > > remains = count;
> > >
> > > - spin_lock(&vmap_area_lock);
> > > - va = find_vmap_area_exceed_addr((unsigned long)addr);
> > > + /* Hooked to node_0 so far. */
> > > + vn = addr_to_node(0);
> >
> > Why can't we use addr for this call? We already enforce the node-0 only
> > thing by setting nr_vmap_nodes to 1 right? And won't this be potentially
> > subtly wrong when we later increase this?
> >
> I used to have 0 here. But please note, it is changed by the next patch in
> this series.

Yeah sorry, again hadn't noticed this.

[snip]

> > > + spin_lock(&vn->busy.lock);
> > > + insert_vmap_area(vas[area], &vn->busy.root, &vn->busy.head);
> > > setup_vmalloc_vm_locked(vms[area], vas[area], VM_ALLOC,
> > > pcpu_get_vm_areas);
> > > + spin_unlock(&vn->busy.lock);
> >
> > Hmm, before we were locking/unlocking once before the loop, now we're
> > locking on each iteration, this seems inefficient.
> >
> > Seems like we need logic like:
> >
> > /* ... something to check nr_vms > 0 ... */
> > struct vmap_node *last_node = NULL;
> >
> > for (...) {
> > struct vmap_node *vnode = addr_to_node(vas[area]->va_start);
> >
> > if (vnode != last_node) {
> > spin_unlock(last_node->busy.lock);
> > spin_lock(vnode->busy.lock);
> > last_node = vnode;
> > }
> >
> > ...
> > }
> >
> > if (last_node)
> > spin_unlock(last_node->busy.lock);
> >
> > To minimise the lock twiddling. What do you think?
> >
> This per-cpu-allocator prefetches several VA units per-cpu. I do not
> find it as critical because it is not a hot path for the per-cpu allocator.
> When its buffers are exhausted it does an extra prefetch. So it is not
> frequent.

OK, sure I mean this is simpler and more readable so if not a huge perf
concern then not a big deal.

>
> >
> > > }
> > > - spin_unlock(&vmap_area_lock);
> > >
> > > /*
> > > * Mark allocated areas as accessible. Do it now as a best-effort
> > > @@ -4253,55 +4333,57 @@ bool vmalloc_dump_obj(void *object)
> > > {
> > > void *objp = (void *)PAGE_ALIGN((unsigned long)object);
> > > const void *caller;
> > > - struct vm_struct *vm;
> > > struct vmap_area *va;
> > > + struct vmap_node *vn;
> > > unsigned long addr;
> > > unsigned int nr_pages;
> > > + bool success = false;
> > >
> > > - if (!spin_trylock(&vmap_area_lock))
> > > - return false;
> >
> > Nitpick on style for this, I really don't know why you are removing this
> > early exit? It's far neater to have a guard clause than to nest a whole
> > bunch of code below.
> >
> Hm... I can return back as it used to be. I do not have a strong opinion here.

Yeah that'd be ideal just for readability.

[snip the rest as broadly fairly trivial comment stuff on which we agree]

>
> Thank you for the review! I can fix the comments as separate patches if
> no objections.

Yes, overall it's style/comment improvement stuff nothing major, feel free
to send as follow-up patches.

I don't want to hold anything up here so for the rest, feel free to add:

Reviewed-by: Lorenzo Stoakes <[email protected]>

>
> --
> Uladzislau Rezki

2024-01-22 18:26:18

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: Re: [PATCH v3 04/11] mm: vmalloc: Remove global vmap_area_root rb-tree

On Sat, Jan 20, 2024 at 12:55:10PM +0000, Lorenzo Stoakes wrote:
> On Thu, Jan 18, 2024 at 02:15:31PM +0100, Uladzislau Rezki wrote:
>
> [snip]
>
> >
> > > > + struct rb_root root;
> > > > + struct list_head head;
> > > > + spinlock_t lock;
> > > > +};
> > > > +
> > > > +static struct vmap_node {
> > > > + /* Bookkeeping data of this node. */
> > > > + struct rb_list busy;
> > > > +} single;
> > >
> > > This may be a thing about encapsulation/naming or similar, but I'm a little
> > > confused as to why the rb_list type is maintained as a field rather than
> > > its fields embedded?
> > >
> > The "struct vmap_node" will be extended by the following patches in the
> > series.
> >
>
> Yeah sorry I missed this, only realising after I sent...!
>
> > > > +
> > > > +static struct vmap_node *vmap_nodes = &single;
> > > > +static __read_mostly unsigned int nr_vmap_nodes = 1;
> > > > +static __read_mostly unsigned int vmap_zone_size = 1;
> > >
> > > It might be worth adding a comment here explaining that we're binding to a
> > > single node for now to maintain existing behaviour (and a brief description
> > > of what these values mean - for instance what unit vmap_zone_size is
> > > expressed in?)
> > >
> > Right. Agree on it :)
> >
>
> Indeed :)
>
> [snip]
>
> > > > /* Look up the first VA which satisfies addr < va_end, NULL if none. */
> > > > -static struct vmap_area *find_vmap_area_exceed_addr(unsigned long addr)
> > > > +static struct vmap_area *
> > > > +find_vmap_area_exceed_addr(unsigned long addr, struct rb_root *root)
> > > > {
> > > > struct vmap_area *va = NULL;
> > > > - struct rb_node *n = vmap_area_root.rb_node;
> > > > + struct rb_node *n = root->rb_node;
> > > >
> > > > addr = (unsigned long)kasan_reset_tag((void *)addr);
> > > >
> > > > @@ -1552,12 +1583,14 @@ __alloc_vmap_area(struct rb_root *root, struct list_head *head,
> > > > */
> > > > static void free_vmap_area(struct vmap_area *va)
> > > > {
> > > > + struct vmap_node *vn = addr_to_node(va->va_start);
> > > > +
> > >
> > > I'm being nitty here, and while I know it's a vmalloc convention to use
> > > 'va' and 'vm', perhaps we can break away from the super short variable name
> > > convention and use 'vnode' or something for these values?
> > >
> > > I feel people might get confused between 'vm' and 'vn' for instance.
> > >
> > vnode, varea?
>
> I think 'vm' and 'va' are fine, just scanning through easy to mistake 'vn'
> and 'vm'. Obviously a litle nitpicky! You could replace all but a bit
> churny, so I think vn -> vnode works best imo.
>
> [snip]
>
> > > > struct vmap_area *find_vmap_area(unsigned long addr)
> > > > {
> > > > + struct vmap_node *vn;
> > > > struct vmap_area *va;
> > > > + int i, j;
> > > >
> > > > - spin_lock(&vmap_area_lock);
> > > > - va = __find_vmap_area(addr, &vmap_area_root);
> > > > - spin_unlock(&vmap_area_lock);
> > > > + /*
> > > > + * An addr_to_node_id(addr) converts an address to a node index
> > > > + * where a VA is located. If VA spans several zones and passed
> > > > + * addr is not the same as va->va_start, what is not common, we
> > > > + * may need to scan an extra nodes. See an example:
> > >
> > > For my understading when you say 'scan an extra nodes' do you mean scan
> > > just 1 extra node, or multiple? If the former I'd replace this with 'may
> > > need to scan an extra node' if the latter then 'may ened to scan extra
> > > nodes'.
> > >
> > > It's a nitty language thing, but also potentially changes the meaning of
> > > this!
> > >
> > Typo, i should replace it to: scan extra nodes.
>
> Thanks.
>
> >
> > > > + *
> > > > + * <--va-->
> > > > + * -|-----|-----|-----|-----|-
> > > > + * 1 2 0 1
> > > > + *
> > > > + * VA resides in node 1 whereas it spans 1 and 2. If passed
> > > > + * addr is within a second node we should do extra work. We
> > > > + * should mention that it is rare and is a corner case from
> > > > + * the other hand it has to be covered.
> > >
> > > A very minor language style nit, but you've already said this is not
> > > common, I don't think you need this 'We should mention...' bit. It's not a
> > > big deal however!
> > >
> > No problem. We can remove it!
>
> Thanks.
>
> >
> > > > + */
> > > > + i = j = addr_to_node_id(addr);
> > > > + do {
> > > > + vn = &vmap_nodes[i];
> > > >
> > > > - return va;
> > > > + spin_lock(&vn->busy.lock);
> > > > + va = __find_vmap_area(addr, &vn->busy.root);
> > > > + spin_unlock(&vn->busy.lock);
> > > > +
> > > > + if (va)
> > > > + return va;
> > > > + } while ((i = (i + 1) % nr_vmap_nodes) != j);
> > >
> > > If you comment above suggests that only 1 extra node might need to be
> > > scanned, should we stop after one iteration?
> > >
> > Not really. Though we can improve it further to scan backward.
>
> I think it'd be good to clarify in the comment above that the VA could span
> more than 1 node then, as the diagram seems to imply only 1 (I think just
> simply because of the example you were showing).
>
> [snip]
>
> > > > static struct vmap_area *find_unlink_vmap_area(unsigned long addr)
> > > > {
> > > > + struct vmap_node *vn;
> > > > struct vmap_area *va;
> > > > + int i, j;
> > > >
> > > > - spin_lock(&vmap_area_lock);
> > > > - va = __find_vmap_area(addr, &vmap_area_root);
> > > > - if (va)
> > > > - unlink_va(va, &vmap_area_root);
> > > > - spin_unlock(&vmap_area_lock);
> > > > + i = j = addr_to_node_id(addr);
> > > > + do {
> > > > + vn = &vmap_nodes[i];
> > > >
> > > > - return va;
> > > > + spin_lock(&vn->busy.lock);
> > > > + va = __find_vmap_area(addr, &vn->busy.root);
> > > > + if (va)
> > > > + unlink_va(va, &vn->busy.root);
> > > > + spin_unlock(&vn->busy.lock);
> > > > +
> > > > + if (va)
> > > > + return va;
> > > > + } while ((i = (i + 1) % nr_vmap_nodes) != j);
> > >
> > > Maybe worth adding a comment saying to refer to the comment in
> > > find_vmap_area() to see why this loop is necessary.
> > >
> > OK. We can do it to make it better for reading.
>
> Thanks!
>
> [snip]
>
> > > > @@ -3728,8 +3804,11 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
> > >
> > > Unrelated to your change but makes me feel a little unwell to see 'const
> > > char *addr'! Can we change this at some point? Or maybe I can :)
> > >
> > You are welcome :)
>
> Haha ;) yes I think I might tbh, I have noted it down.
>
> >
> > > >
> > > > remains = count;
> > > >
> > > > - spin_lock(&vmap_area_lock);
> > > > - va = find_vmap_area_exceed_addr((unsigned long)addr);
> > > > + /* Hooked to node_0 so far. */
> > > > + vn = addr_to_node(0);
> > >
> > > Why can't we use addr for this call? We already enforce the node-0 only
> > > thing by setting nr_vmap_nodes to 1 right? And won't this be potentially
> > > subtly wrong when we later increase this?
> > >
> > I used to have 0 here. But please note, it is changed by the next patch in
> > this series.
>
> Yeah sorry, again hadn't noticed this.
>
> [snip]
>
> > > > + spin_lock(&vn->busy.lock);
> > > > + insert_vmap_area(vas[area], &vn->busy.root, &vn->busy.head);
> > > > setup_vmalloc_vm_locked(vms[area], vas[area], VM_ALLOC,
> > > > pcpu_get_vm_areas);
> > > > + spin_unlock(&vn->busy.lock);
> > >
> > > Hmm, before we were locking/unlocking once before the loop, now we're
> > > locking on each iteration, this seems inefficient.
> > >
> > > Seems like we need logic like:
> > >
> > > /* ... something to check nr_vms > 0 ... */
> > > struct vmap_node *last_node = NULL;
> > >
> > > for (...) {
> > > struct vmap_node *vnode = addr_to_node(vas[area]->va_start);
> > >
> > > if (vnode != last_node) {
> > > spin_unlock(last_node->busy.lock);
> > > spin_lock(vnode->busy.lock);
> > > last_node = vnode;
> > > }
> > >
> > > ...
> > > }
> > >
> > > if (last_node)
> > > spin_unlock(last_node->busy.lock);
> > >
> > > To minimise the lock twiddling. What do you think?
> > >
> > This per-cpu-allocator prefetches several VA units per-cpu. I do not
> > find it as critical because it is not a hot path for the per-cpu allocator.
> > When its buffers are exhausted it does an extra prefetch. So it is not
> > frequent.
>
> OK, sure I mean this is simpler and more readable so if not a huge perf
> concern then not a big deal.
>
> >
> > >
> > > > }
> > > > - spin_unlock(&vmap_area_lock);
> > > >
> > > > /*
> > > > * Mark allocated areas as accessible. Do it now as a best-effort
> > > > @@ -4253,55 +4333,57 @@ bool vmalloc_dump_obj(void *object)
> > > > {
> > > > void *objp = (void *)PAGE_ALIGN((unsigned long)object);
> > > > const void *caller;
> > > > - struct vm_struct *vm;
> > > > struct vmap_area *va;
> > > > + struct vmap_node *vn;
> > > > unsigned long addr;
> > > > unsigned int nr_pages;
> > > > + bool success = false;
> > > >
> > > > - if (!spin_trylock(&vmap_area_lock))
> > > > - return false;
> > >
> > > Nitpick on style for this, I really don't know why you are removing this
> > > early exit? It's far neater to have a guard clause than to nest a whole
> > > bunch of code below.
> > >
> > Hm... I can return back as it used to be. I do not have a strong opinion here.
>
> Yeah that'd be ideal just for readability.
>
> [snip the rest as broadly fairly trivial comment stuff on which we agree]
>
> >
> > Thank you for the review! I can fix the comments as separate patches if
> > no objections.
>
> Yes, overall it's style/comment improvement stuff nothing major, feel free
> to send as follow-up patches.
>
> I don't want to hold anything up here so for the rest, feel free to add:
>
> Reviewed-by: Lorenzo Stoakes <[email protected]>
>
Appreciate! I will go through again and send out the patch that adds
more detailed explanation as requested in this review.

Again, thank you!

--
Uladzislau Rezki

2024-02-08 13:58:23

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: Re: [PATCH v3 07/11] mm: vmalloc: Offload free_vmap_area_lock lock

On Thu, Feb 08, 2024 at 08:25:23AM +0800, Baoquan He wrote:
> On 01/02/24 at 07:46pm, Uladzislau Rezki (Sony) wrote:
> ......
> > +static struct vmap_area *
> > +node_alloc(unsigned long size, unsigned long align,
> > + unsigned long vstart, unsigned long vend,
> > + unsigned long *addr, unsigned int *vn_id)
> > +{
> > + struct vmap_area *va;
> > +
> > + *vn_id = 0;
> > + *addr = vend;
> > +
> > + /*
> > + * Fallback to a global heap if not vmalloc or there
> > + * is only one node.
> > + */
> > + if (vstart != VMALLOC_START || vend != VMALLOC_END ||
> > + nr_vmap_nodes == 1)
> > + return NULL;
> > +
> > + *vn_id = raw_smp_processor_id() % nr_vmap_nodes;
> > + va = node_pool_del_va(id_to_node(*vn_id), size, align, vstart, vend);
> > + *vn_id = encode_vn_id(*vn_id);
> > +
> > + if (va)
> > + *addr = va->va_start;
> > +
> > + return va;
> > +}
> > +
> > /*
> > * Allocate a region of KVA of the specified size and alignment, within the
> > * vstart and vend.
> > @@ -1637,6 +1807,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
> > struct vmap_area *va;
> > unsigned long freed;
> > unsigned long addr;
> > + unsigned int vn_id;
> > int purged = 0;
> > int ret;
> >
> > @@ -1647,11 +1818,23 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
> > return ERR_PTR(-EBUSY);
> >
> > might_sleep();
> > - gfp_mask = gfp_mask & GFP_RECLAIM_MASK;
> >
> > - va = kmem_cache_alloc_node(vmap_area_cachep, gfp_mask, node);
> > - if (unlikely(!va))
> > - return ERR_PTR(-ENOMEM);
> > + /*
> > + * If a VA is obtained from a global heap(if it fails here)
> > + * it is anyway marked with this "vn_id" so it is returned
> > + * to this pool's node later. Such way gives a possibility
> > + * to populate pools based on users demand.
> > + *
> > + * On success a ready to go VA is returned.
> > + */
> > + va = node_alloc(size, align, vstart, vend, &addr, &vn_id);
>
> Sorry for late checking.
>
No problem :)

> Here, if no available va got, e.g a empty vp, still we will get an
> effective vn_id with the current cpu_id for VMALLOC region allocation
> request.
>
> > + if (!va) {
> > + gfp_mask = gfp_mask & GFP_RECLAIM_MASK;
> > +
> > + va = kmem_cache_alloc_node(vmap_area_cachep, gfp_mask, node);
> > + if (unlikely(!va))
> > + return ERR_PTR(-ENOMEM);
> > + }
> >
> > /*
> > * Only scan the relevant parts containing pointers to other objects
> > @@ -1660,10 +1843,12 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
> > kmemleak_scan_area(&va->rb_node, SIZE_MAX, gfp_mask);
> >
> > retry:
> > - preload_this_cpu_lock(&free_vmap_area_lock, gfp_mask, node);
> > - addr = __alloc_vmap_area(&free_vmap_area_root, &free_vmap_area_list,
> > - size, align, vstart, vend);
> > - spin_unlock(&free_vmap_area_lock);
> > + if (addr == vend) {
> > + preload_this_cpu_lock(&free_vmap_area_lock, gfp_mask, node);
> > + addr = __alloc_vmap_area(&free_vmap_area_root, &free_vmap_area_list,
> > + size, align, vstart, vend);
>
> Then, here, we will get an available va from random location, but its
> vn_id is from the current cpu.
>
> Then in purge_vmap_node(), we will decode the vn_id stored in va->flags,
> and add the relevant va into vn->pool[] according to the vn_id. The
> worst case could be most of va in vn->pool[] are not corresponding to
> the vmap_nodes they belongs to. It doesn't matter?
>
We do not do any "in-front" population, instead it behaves as a cache
miss when you need to access a main memmory to do a load and then keep
the data in a cache.

Same here. As a first step, for a CPU it always a miss, thus a VA is
obtained from the global heap and is marked with a current CPU that
makes an attempt to alloc. Later on that CPU/node is populated by that
marked VA. So second alloc on same CPU goes via fast path.

VAs are populated based on demand and those nodes which do allocations.

> Should we adjust the code of vn_id assigning in node_alloc(), or I missed anything?
Now it is open-coded. Some further refactoring should be done. Agree.

--
Uladzislau Rezki

2024-02-22 08:35:55

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: Re: [PATCH v3 00/11] Mitigate a vmap lock contention v3

Hello, Folk!

> This is v3. It is based on the 6.7.0-rc8.
>
> 1. Motivation
>
> - Offload global vmap locks making it scaled to number of CPUS;
> - If possible and there is an agreement, we can remove the "Per cpu kva allocator"
> to make the vmap code to be more simple;
> - There were complains from XFS folk that a vmalloc might be contented
> on the their workloads.
>
> 2. Design(high level overview)
>
> We introduce an effective vmap node logic. A node behaves as independent
> entity to serve an allocation request directly(if possible) from its pool.
> That way it bypasses a global vmap space that is protected by its own lock.
>
> An access to pools are serialized by CPUs. Number of nodes are equal to
> number of CPUs in a system. Please note the high threshold is bound to
> 128 nodes.
>
> Pools are size segregated and populated based on system demand. The maximum
> alloc request that can be stored into a segregated storage is 256 pages. The
> lazily drain path decays a pool by 25% as a first step and as second populates
> it by fresh freed VAs for reuse instead of returning them into a global space.
>
> When a VA is obtained(alloc path), it is stored in separate nodes. A va->va_start
> address is converted into a correct node where it should be placed and resided.
> Doing so we balance VAs across the nodes as a result an access becomes scalable.
> The addr_to_node() function does a proper address conversion to a correct node.
>
> A vmap space is divided on segments with fixed size, it is 16 pages. That way
> any address can be associated with a segment number. Number of segments are
> equal to num_possible_cpus() but not grater then 128. The numeration starts
> from 0. See below how it is converted:
>
> static inline unsigned int
> addr_to_node_id(unsigned long addr)
> {
> return (addr / zone_size) % nr_nodes;
> }
>
> On a free path, a VA can be easily found by converting its "va_start" address
> to a certain node it resides. It is moved from "busy" data to "lazy" data structure.
> Later on, as noted earlier, the lazy kworker decays each node pool and populates it
> by fresh incoming VAs. Please note, a VA is returned to a node that did an alloc
> request.
>
> 3. Test on AMD Ryzen Threadripper 3970X 32-Core Processor
>
> sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
>
> <default perf>
> 94.41% 0.89% [kernel] [k] _raw_spin_lock
> 93.35% 93.07% [kernel] [k] native_queued_spin_lock_slowpath
> 76.13% 0.28% [kernel] [k] __vmalloc_node_range
> 72.96% 0.81% [kernel] [k] alloc_vmap_area
> 56.94% 0.00% [kernel] [k] __get_vm_area_node
> 41.95% 0.00% [kernel] [k] vmalloc
> 37.15% 0.01% [test_vmalloc] [k] full_fit_alloc_test
> 35.17% 0.00% [kernel] [k] ret_from_fork_asm
> 35.17% 0.00% [kernel] [k] ret_from_fork
> 35.17% 0.00% [kernel] [k] kthread
> 35.08% 0.00% [test_vmalloc] [k] test_func
> 34.45% 0.00% [test_vmalloc] [k] fix_size_alloc_test
> 28.09% 0.01% [test_vmalloc] [k] long_busy_list_alloc_test
> 23.53% 0.25% [kernel] [k] vfree.part.0
> 21.72% 0.00% [kernel] [k] remove_vm_area
> 20.08% 0.21% [kernel] [k] find_unlink_vmap_area
> 2.34% 0.61% [kernel] [k] free_vmap_area_noflush
> <default perf>
> vs
> <patch-series perf>
> 82.32% 0.22% [test_vmalloc] [k] long_busy_list_alloc_test
> 63.36% 0.02% [kernel] [k] vmalloc
> 63.34% 2.64% [kernel] [k] __vmalloc_node_range
> 30.42% 4.46% [kernel] [k] vfree.part.0
> 28.98% 2.51% [kernel] [k] __alloc_pages_bulk
> 27.28% 0.19% [kernel] [k] __get_vm_area_node
> 26.13% 1.50% [kernel] [k] alloc_vmap_area
> 21.72% 21.67% [kernel] [k] clear_page_rep
> 19.51% 2.43% [kernel] [k] _raw_spin_lock
> 16.61% 16.51% [kernel] [k] native_queued_spin_lock_slowpath
> 13.40% 2.07% [kernel] [k] free_unref_page
> 10.62% 0.01% [kernel] [k] remove_vm_area
> 9.02% 8.73% [kernel] [k] insert_vmap_area
> 8.94% 0.00% [kernel] [k] ret_from_fork_asm
> 8.94% 0.00% [kernel] [k] ret_from_fork
> 8.94% 0.00% [kernel] [k] kthread
> 8.29% 0.00% [test_vmalloc] [k] test_func
> 7.81% 0.05% [test_vmalloc] [k] full_fit_alloc_test
> 5.30% 4.73% [kernel] [k] purge_vmap_node
> 4.47% 2.65% [kernel] [k] free_vmap_area_noflush
> <patch-series perf>
>
> confirms that a native_queued_spin_lock_slowpath goes down to
> 16.51% percent from 93.07%.
>
> The throughput is ~12x higher:
>
> urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
> Run the test with following parameters: run_test_mask=7 nr_threads=64
> Done.
> Check the kernel ring buffer to see the summary.
>
> real 10m51.271s
> user 0m0.013s
> sys 0m0.187s
> urezki@pc638:~$
>
> urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
> Run the test with following parameters: run_test_mask=7 nr_threads=64
> Done.
> Check the kernel ring buffer to see the summary.
>
> real 0m51.301s
> user 0m0.015s
> sys 0m0.040s
> urezki@pc638:~$
>
> 4. Changelog
>
> v1: https://lore.kernel.org/linux-mm/ZIAqojPKjChJTssg@pc636/T/
> v2: https://lore.kernel.org/lkml/[email protected]/
>
> Delta v2 -> v3:
> - fix comments from v2 feedback;
> - switch from pre-fetch chunk logic to a less complex size based pools.
>
> Baoquan He (1):
> mm/vmalloc: remove vmap_area_list
>
> Uladzislau Rezki (Sony) (10):
> mm: vmalloc: Add va_alloc() helper
> mm: vmalloc: Rename adjust_va_to_fit_type() function
> mm: vmalloc: Move vmap_init_free_space() down in vmalloc.c
> mm: vmalloc: Remove global vmap_area_root rb-tree
> mm: vmalloc: Remove global purge_vmap_area_root rb-tree
> mm: vmalloc: Offload free_vmap_area_lock lock
> mm: vmalloc: Support multiple nodes in vread_iter
> mm: vmalloc: Support multiple nodes in vmallocinfo
> mm: vmalloc: Set nr_nodes based on CPUs in a system
> mm: vmalloc: Add a shrinker to drain vmap pools
>
> .../admin-guide/kdump/vmcoreinfo.rst | 8 +-
> arch/arm64/kernel/crash_core.c | 1 -
> arch/riscv/kernel/crash_core.c | 1 -
> include/linux/vmalloc.h | 1 -
> kernel/crash_core.c | 4 +-
> kernel/kallsyms_selftest.c | 1 -
> mm/nommu.c | 2 -
> mm/vmalloc.c | 1049 ++++++++++++-----
> 8 files changed, 786 insertions(+), 281 deletions(-)
>
> --
> 2.39.2
>
There is one thing that i have to clarify and which is open for me yet.

Test machine:
quemu x86_64 system
64 CPUs
64G of memory

test suite:
test_vmalloc.sh

environment:
mm-unstable, branch: next-20240220 where this series
is located. On top of it i added locally Suren's Baghdasaryan
Memory allocation profiling v3 for better understanding of memory
usage.

Before running test, the condition is as below:

urezki@pc638:~$ sort -h /proc/allocinfo
27.2MiB 6970 mm/memory.c:1122 module:memory func:folio_prealloc
79.1MiB 20245 mm/readahead.c:247 module:readahead func:page_cache_ra_unbounded
112MiB 8689 mm/slub.c:2202 module:slub func:alloc_slab_page
122MiB 31168 mm/page_ext.c:270 module:page_ext func:alloc_page_ext
urezki@pc638:~$ free -m
total used free shared buff/cache available
Mem: 64172 936 63618 0 134 63236
Swap: 0 0 0
urezki@pc638:~$

The test-suite stresses vmap/vmalloc layer by creating workers which in
a tight loop do alloc/free, i.e. it is considered as extreme. Below three
identical tests were done with only one difference, which is 64, 128 and 256 kworkers:

1) sudo tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=64

urezki@pc638:~$ sort -h /proc/allocinfo
80.1MiB 20518 mm/readahead.c:247 module:readahead func:page_cache_ra_unbounded
122MiB 31168 mm/page_ext.c:270 module:page_ext func:alloc_page_ext
153MiB 39048 mm/filemap.c:1919 module:filemap func:__filemap_get_folio
178MiB 13259 mm/slub.c:2202 module:slub func:alloc_slab_page
350MiB 89656 include/linux/mm.h:2848 module:memory func:pagetable_alloc
urezki@pc638:~$ free -m
total used free shared buff/cache available
Mem: 64172 1417 63054 0 298 62755
Swap: 0 0 0
urezki@pc638:~$

2) sudo tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=128

urezki@pc638:~$ sort -h /proc/allocinfo
122MiB 31168 mm/page_ext.c:270 module:page_ext func:alloc_page_ext
154MiB 39440 mm/filemap.c:1919 module:filemap func:__filemap_get_folio
196MiB 14038 mm/slub.c:2202 module:slub func:alloc_slab_page
1.20GiB 315655 include/linux/mm.h:2848 module:memory func:pagetable_alloc
urezki@pc638:~$ free -m
total used free shared buff/cache available
Mem: 64172 2556 61914 0 302 61616
Swap: 0 0 0
urezki@pc638:~$

3) sudo tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=256

urezki@pc638:~$ sort -h /proc/allocinfo
127MiB 32565 mm/readahead.c:247 module:readahead func:page_cache_ra_unbounded
197MiB 50506 mm/filemap.c:1919 module:filemap func:__filemap_get_folio
278MiB 18519 mm/slub.c:2202 module:slub func:alloc_slab_page
5.36GiB 1405072 include/linux/mm.h:2848 module:memory func:pagetable_alloc
urezki@pc638:~$ free -m
total used free shared buff/cache available
Mem: 64172 6741 57652 0 394 57431
Swap: 0 0 0
urezki@pc638:~$

pagetable_alloc - gets increased as soon as a higher pressure is applied by
increasing number of workers. Running same number of jobs on a next run
does not increase it and stays on same level as on previous.

/**
* pagetable_alloc - Allocate pagetables
* @gfp: GFP flags
* @order: desired pagetable order
*
* pagetable_alloc allocates memory for page tables as well as a page table
* descriptor to describe that memory.
*
* Return: The ptdesc describing the allocated page tables.
*/
static inline struct ptdesc *pagetable_alloc(gfp_t gfp, unsigned int order)
{
struct page *page = alloc_pages(gfp | __GFP_COMP, order);

return page_ptdesc(page);
}

Could you please comment on it? Or do you have any thought? Is it expected?
Is a page-table ever shrink?

/proc/slabinfo does not show any high "active" or "number" of objects to
be used by any cache.

/proc/meminfo - "VmallocUsed" stays low after those 3 tests.

I have checked it with KASAN, KMEMLEAK and i do not see any issues.

Thank you for the help!

--
Uladzislau Rezki

2024-02-22 23:16:19

by Pedro Falcato

[permalink] [raw]
Subject: Re: [PATCH v3 00/11] Mitigate a vmap lock contention v3

Hi,

On Thu, Feb 22, 2024 at 8:35 AM Uladzislau Rezki <[email protected]> wrote:
>
> Hello, Folk!
>
>[...]
> pagetable_alloc - gets increased as soon as a higher pressure is applied by
> increasing number of workers. Running same number of jobs on a next run
> does not increase it and stays on same level as on previous.
>
> /**
> * pagetable_alloc - Allocate pagetables
> * @gfp: GFP flags
> * @order: desired pagetable order
> *
> * pagetable_alloc allocates memory for page tables as well as a page table
> * descriptor to describe that memory.
> *
> * Return: The ptdesc describing the allocated page tables.
> */
> static inline struct ptdesc *pagetable_alloc(gfp_t gfp, unsigned int order)
> {
> struct page *page = alloc_pages(gfp | __GFP_COMP, order);
>
> return page_ptdesc(page);
> }
>
> Could you please comment on it? Or do you have any thought? Is it expected?
> Is a page-table ever shrink?

It's my understanding that the vunmap_range helpers don't actively
free page tables, they just clear PTEs. munmap does free them in
mmap.c:free_pgtables, maybe something could be worked up for vmalloc
too.
I would not be surprised if the memory increase you're seeing is more
or less correlated to the maximum vmalloc footprint throughout the
whole test.

--
Pedro

2024-02-23 09:38:01

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: Re: [PATCH v3 00/11] Mitigate a vmap lock contention v3

On Thu, Feb 22, 2024 at 11:15:59PM +0000, Pedro Falcato wrote:
> Hi,
>
> On Thu, Feb 22, 2024 at 8:35 AM Uladzislau Rezki <[email protected]> wrote:
> >
> > Hello, Folk!
> >
> >[...]
> > pagetable_alloc - gets increased as soon as a higher pressure is applied by
> > increasing number of workers. Running same number of jobs on a next run
> > does not increase it and stays on same level as on previous.
> >
> > /**
> > * pagetable_alloc - Allocate pagetables
> > * @gfp: GFP flags
> > * @order: desired pagetable order
> > *
> > * pagetable_alloc allocates memory for page tables as well as a page table
> > * descriptor to describe that memory.
> > *
> > * Return: The ptdesc describing the allocated page tables.
> > */
> > static inline struct ptdesc *pagetable_alloc(gfp_t gfp, unsigned int order)
> > {
> > struct page *page = alloc_pages(gfp | __GFP_COMP, order);
> >
> > return page_ptdesc(page);
> > }
> >
> > Could you please comment on it? Or do you have any thought? Is it expected?
> > Is a page-table ever shrink?
>
> It's my understanding that the vunmap_range helpers don't actively
> free page tables, they just clear PTEs. munmap does free them in
> mmap.c:free_pgtables, maybe something could be worked up for vmalloc
> too.
>
Right. I see that for a user space, pgtables are removed. There was a
work on it.

>
> I would not be surprised if the memory increase you're seeing is more
> or less correlated to the maximum vmalloc footprint throughout the
> whole test.
>
Yes, the vmalloc footprint follows the memory usage. Some uses cases
map lot of memory.

Thanks for the input!

--
Uladzislau Rezki

2024-02-23 10:27:06

by Baoquan He

[permalink] [raw]
Subject: Re: [PATCH v3 00/11] Mitigate a vmap lock contention v3

On 02/23/24 at 10:34am, Uladzislau Rezki wrote:
> On Thu, Feb 22, 2024 at 11:15:59PM +0000, Pedro Falcato wrote:
> > Hi,
> >
> > On Thu, Feb 22, 2024 at 8:35 AM Uladzislau Rezki <[email protected]> wrote:
> > >
> > > Hello, Folk!
> > >
> > >[...]
> > > pagetable_alloc - gets increased as soon as a higher pressure is applied by
> > > increasing number of workers. Running same number of jobs on a next run
> > > does not increase it and stays on same level as on previous.
> > >
> > > /**
> > > * pagetable_alloc - Allocate pagetables
> > > * @gfp: GFP flags
> > > * @order: desired pagetable order
> > > *
> > > * pagetable_alloc allocates memory for page tables as well as a page table
> > > * descriptor to describe that memory.
> > > *
> > > * Return: The ptdesc describing the allocated page tables.
> > > */
> > > static inline struct ptdesc *pagetable_alloc(gfp_t gfp, unsigned int order)
> > > {
> > > struct page *page = alloc_pages(gfp | __GFP_COMP, order);
> > >
> > > return page_ptdesc(page);
> > > }
> > >
> > > Could you please comment on it? Or do you have any thought? Is it expected?
> > > Is a page-table ever shrink?
> >
> > It's my understanding that the vunmap_range helpers don't actively
> > free page tables, they just clear PTEs. munmap does free them in
> > mmap.c:free_pgtables, maybe something could be worked up for vmalloc
> > too.
> >
> Right. I see that for a user space, pgtables are removed. There was a
> work on it.
>
> >
> > I would not be surprised if the memory increase you're seeing is more
> > or less correlated to the maximum vmalloc footprint throughout the
> > whole test.
> >
> Yes, the vmalloc footprint follows the memory usage. Some uses cases
> map lot of memory.

The 'nr_threads=256' testing may be too radical. I took the test on
a bare metal machine as below, it's still running and hang there after
30 minutes. I did this after system boot. I am looking for other
machines with more processors.

[root@dell-r640-068 ~]# nproc
64
[root@dell-r640-068 ~]# free -h
total used free shared buff/cache available
Mem: 187Gi 18Gi 169Gi 12Mi 262Mi 168Gi
Swap: 4.0Gi 0B 4.0Gi
[root@dell-r640-068 ~]#

[root@dell-r640-068 linux]# tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=256
Run the test with following parameters: run_test_mask=127 nr_threads=256


2024-02-23 11:35:51

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: Re: [PATCH v3 00/11] Mitigate a vmap lock contention v3

> On 02/23/24 at 10:34am, Uladzislau Rezki wrote:
> > On Thu, Feb 22, 2024 at 11:15:59PM +0000, Pedro Falcato wrote:
> > > Hi,
> > >
> > > On Thu, Feb 22, 2024 at 8:35 AM Uladzislau Rezki <[email protected]> wrote:
> > > >
> > > > Hello, Folk!
> > > >
> > > >[...]
> > > > pagetable_alloc - gets increased as soon as a higher pressure is applied by
> > > > increasing number of workers. Running same number of jobs on a next run
> > > > does not increase it and stays on same level as on previous.
> > > >
> > > > /**
> > > > * pagetable_alloc - Allocate pagetables
> > > > * @gfp: GFP flags
> > > > * @order: desired pagetable order
> > > > *
> > > > * pagetable_alloc allocates memory for page tables as well as a page table
> > > > * descriptor to describe that memory.
> > > > *
> > > > * Return: The ptdesc describing the allocated page tables.
> > > > */
> > > > static inline struct ptdesc *pagetable_alloc(gfp_t gfp, unsigned int order)
> > > > {
> > > > struct page *page = alloc_pages(gfp | __GFP_COMP, order);
> > > >
> > > > return page_ptdesc(page);
> > > > }
> > > >
> > > > Could you please comment on it? Or do you have any thought? Is it expected?
> > > > Is a page-table ever shrink?
> > >
> > > It's my understanding that the vunmap_range helpers don't actively
> > > free page tables, they just clear PTEs. munmap does free them in
> > > mmap.c:free_pgtables, maybe something could be worked up for vmalloc
> > > too.
> > >
> > Right. I see that for a user space, pgtables are removed. There was a
> > work on it.
> >
> > >
> > > I would not be surprised if the memory increase you're seeing is more
> > > or less correlated to the maximum vmalloc footprint throughout the
> > > whole test.
> > >
> > Yes, the vmalloc footprint follows the memory usage. Some uses cases
> > map lot of memory.
>
> The 'nr_threads=256' testing may be too radical. I took the test on
> a bare metal machine as below, it's still running and hang there after
> 30 minutes. I did this after system boot. I am looking for other
> machines with more processors.
>
> [root@dell-r640-068 ~]# nproc
> 64
> [root@dell-r640-068 ~]# free -h
> total used free shared buff/cache available
> Mem: 187Gi 18Gi 169Gi 12Mi 262Mi 168Gi
> Swap: 4.0Gi 0B 4.0Gi
> [root@dell-r640-068 ~]#
>
> [root@dell-r640-068 linux]# tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=256
> Run the test with following parameters: run_test_mask=127 nr_threads=256
>
Agree, nr_threads=256 is a way radical :) Mine took 50 minutes to
complete. So wait more :)


--
Uladzislau Rezki

2024-02-23 15:57:47

by Baoquan He

[permalink] [raw]
Subject: Re: [PATCH v3 00/11] Mitigate a vmap lock contention v3

On 02/23/24 at 12:06pm, Uladzislau Rezki wrote:
> > On 02/23/24 at 10:34am, Uladzislau Rezki wrote:
> > > On Thu, Feb 22, 2024 at 11:15:59PM +0000, Pedro Falcato wrote:
> > > > Hi,
> > > >
> > > > On Thu, Feb 22, 2024 at 8:35 AM Uladzislau Rezki <[email protected]> wrote:
> > > > >
> > > > > Hello, Folk!
> > > > >
> > > > >[...]
> > > > > pagetable_alloc - gets increased as soon as a higher pressure is applied by
> > > > > increasing number of workers. Running same number of jobs on a next run
> > > > > does not increase it and stays on same level as on previous.
> > > > >
> > > > > /**
> > > > > * pagetable_alloc - Allocate pagetables
> > > > > * @gfp: GFP flags
> > > > > * @order: desired pagetable order
> > > > > *
> > > > > * pagetable_alloc allocates memory for page tables as well as a page table
> > > > > * descriptor to describe that memory.
> > > > > *
> > > > > * Return: The ptdesc describing the allocated page tables.
> > > > > */
> > > > > static inline struct ptdesc *pagetable_alloc(gfp_t gfp, unsigned int order)
> > > > > {
> > > > > struct page *page = alloc_pages(gfp | __GFP_COMP, order);
> > > > >
> > > > > return page_ptdesc(page);
> > > > > }
> > > > >
> > > > > Could you please comment on it? Or do you have any thought? Is it expected?
> > > > > Is a page-table ever shrink?
> > > >
> > > > It's my understanding that the vunmap_range helpers don't actively
> > > > free page tables, they just clear PTEs. munmap does free them in
> > > > mmap.c:free_pgtables, maybe something could be worked up for vmalloc
> > > > too.
> > > >
> > > Right. I see that for a user space, pgtables are removed. There was a
> > > work on it.
> > >
> > > >
> > > > I would not be surprised if the memory increase you're seeing is more
> > > > or less correlated to the maximum vmalloc footprint throughout the
> > > > whole test.
> > > >
> > > Yes, the vmalloc footprint follows the memory usage. Some uses cases
> > > map lot of memory.
> >
> > The 'nr_threads=256' testing may be too radical. I took the test on
> > a bare metal machine as below, it's still running and hang there after
> > 30 minutes. I did this after system boot. I am looking for other
> > machines with more processors.
> >
> > [root@dell-r640-068 ~]# nproc
> > 64
> > [root@dell-r640-068 ~]# free -h
> > total used free shared buff/cache available
> > Mem: 187Gi 18Gi 169Gi 12Mi 262Mi 168Gi
> > Swap: 4.0Gi 0B 4.0Gi
> > [root@dell-r640-068 ~]#
> >
> > [root@dell-r640-068 linux]# tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=256
> > Run the test with following parameters: run_test_mask=127 nr_threads=256
> >
> Agree, nr_threads=256 is a way radical :) Mine took 50 minutes to
> complete. So wait more :)

Right, mine could take the similar time to finish that. I got a machine
with 288 cpus, see if I can get some clues. When I go through the code
flow, suddenly realized it could be drain_vmap_area_work which is the
bottle neck and cause the tremendous page table pages costing.

On your system, there's 64 cpus. then

nr_lazy_max = lazy_max_pages() = 7*32M = 224M;

So with nr_threads=128 or 256, it's so easily getting to the nr_lazy_max
and triggering drain_vmap_work(). When cpu resouce is very limited, the
lazy vmap purging will be very slow. While the alloc/free in lib/tet_vmalloc.c
are going far faster and more easily then vmap reclaiming. If old va is not
reused, new va is allocated and keep extending, the new page table surely
need be created to cover them.

I will take testing on the system with 288 cpus, will update if testing
is done.


2024-02-23 18:55:30

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: Re: [PATCH v3 00/11] Mitigate a vmap lock contention v3

On Fri, Feb 23, 2024 at 11:57:25PM +0800, Baoquan He wrote:
> On 02/23/24 at 12:06pm, Uladzislau Rezki wrote:
> > > On 02/23/24 at 10:34am, Uladzislau Rezki wrote:
> > > > On Thu, Feb 22, 2024 at 11:15:59PM +0000, Pedro Falcato wrote:
> > > > > Hi,
> > > > >
> > > > > On Thu, Feb 22, 2024 at 8:35 AM Uladzislau Rezki <[email protected]> wrote:
> > > > > >
> > > > > > Hello, Folk!
> > > > > >
> > > > > >[...]
> > > > > > pagetable_alloc - gets increased as soon as a higher pressure is applied by
> > > > > > increasing number of workers. Running same number of jobs on a next run
> > > > > > does not increase it and stays on same level as on previous.
> > > > > >
> > > > > > /**
> > > > > > * pagetable_alloc - Allocate pagetables
> > > > > > * @gfp: GFP flags
> > > > > > * @order: desired pagetable order
> > > > > > *
> > > > > > * pagetable_alloc allocates memory for page tables as well as a page table
> > > > > > * descriptor to describe that memory.
> > > > > > *
> > > > > > * Return: The ptdesc describing the allocated page tables.
> > > > > > */
> > > > > > static inline struct ptdesc *pagetable_alloc(gfp_t gfp, unsigned int order)
> > > > > > {
> > > > > > struct page *page = alloc_pages(gfp | __GFP_COMP, order);
> > > > > >
> > > > > > return page_ptdesc(page);
> > > > > > }
> > > > > >
> > > > > > Could you please comment on it? Or do you have any thought? Is it expected?
> > > > > > Is a page-table ever shrink?
> > > > >
> > > > > It's my understanding that the vunmap_range helpers don't actively
> > > > > free page tables, they just clear PTEs. munmap does free them in
> > > > > mmap.c:free_pgtables, maybe something could be worked up for vmalloc
> > > > > too.
> > > > >
> > > > Right. I see that for a user space, pgtables are removed. There was a
> > > > work on it.
> > > >
> > > > >
> > > > > I would not be surprised if the memory increase you're seeing is more
> > > > > or less correlated to the maximum vmalloc footprint throughout the
> > > > > whole test.
> > > > >
> > > > Yes, the vmalloc footprint follows the memory usage. Some uses cases
> > > > map lot of memory.
> > >
> > > The 'nr_threads=256' testing may be too radical. I took the test on
> > > a bare metal machine as below, it's still running and hang there after
> > > 30 minutes. I did this after system boot. I am looking for other
> > > machines with more processors.
> > >
> > > [root@dell-r640-068 ~]# nproc
> > > 64
> > > [root@dell-r640-068 ~]# free -h
> > > total used free shared buff/cache available
> > > Mem: 187Gi 18Gi 169Gi 12Mi 262Mi 168Gi
> > > Swap: 4.0Gi 0B 4.0Gi
> > > [root@dell-r640-068 ~]#
> > >
> > > [root@dell-r640-068 linux]# tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=256
> > > Run the test with following parameters: run_test_mask=127 nr_threads=256
> > >
> > Agree, nr_threads=256 is a way radical :) Mine took 50 minutes to
> > complete. So wait more :)
>
> Right, mine could take the similar time to finish that. I got a machine
> with 288 cpus, see if I can get some clues. When I go through the code
> flow, suddenly realized it could be drain_vmap_area_work which is the
> bottle neck and cause the tremendous page table pages costing.
>
> On your system, there's 64 cpus. then
>
> nr_lazy_max = lazy_max_pages() = 7*32M = 224M;
>
> So with nr_threads=128 or 256, it's so easily getting to the nr_lazy_max
> and triggering drain_vmap_work(). When cpu resouce is very limited, the
> lazy vmap purging will be very slow. While the alloc/free in lib/tet_vmalloc.c
> are going far faster and more easily then vmap reclaiming. If old va is not
> reused, new va is allocated and keep extending, the new page table surely
> need be created to cover them.
>
> I will take testing on the system with 288 cpus, will update if testing
> is done.
>
<snip>
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 12caa794abd4..a90c5393d85f 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1754,6 +1754,8 @@ size_to_va_pool(struct vmap_node *vn, unsigned long size)
return NULL;
}

+static unsigned long lazy_max_pages(void);
+
static bool
node_pool_add_va(struct vmap_node *n, struct vmap_area *va)
{
@@ -1763,6 +1765,9 @@ node_pool_add_va(struct vmap_node *n, struct vmap_area *va)
if (!vp)
return false;

+ if (READ_ONCE(vp->len) > lazy_max_pages())
+ return false;
+
spin_lock(&n->pool_lock);
list_add(&va->list, &vp->head);
WRITE_ONCE(vp->len, vp->len + 1);
@@ -2170,9 +2175,9 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end,
INIT_WORK(&vn->purge_work, purge_vmap_node);

if (cpumask_test_cpu(i, cpu_online_mask))
- schedule_work_on(i, &vn->purge_work);
+ queue_work_on(i, system_highpri_wq, &vn->purge_work);
else
- schedule_work(&vn->purge_work);
+ queue_work(system_highpri_wq, &vn->purge_work);

nr_purge_helpers--;
} else {
<snip>

We need this. This settles it back to a normal PTE-usage. Tomorrow i
will check if cache-len should be limited. I tested on my 64 CPUs
system with radical 256 kworkers. It looks good.

--
Uladzislau Rezki

2024-02-28 09:27:28

by Baoquan He

[permalink] [raw]
Subject: Re: [PATCH v3 00/11] Mitigate a vmap lock contention v3

On 02/23/24 at 07:55pm, Uladzislau Rezki wrote:
> On Fri, Feb 23, 2024 at 11:57:25PM +0800, Baoquan He wrote:
> > On 02/23/24 at 12:06pm, Uladzislau Rezki wrote:
> > > > On 02/23/24 at 10:34am, Uladzislau Rezki wrote:
> > > > > On Thu, Feb 22, 2024 at 11:15:59PM +0000, Pedro Falcato wrote:
> > > > > > Hi,
> > > > > >
> > > > > > On Thu, Feb 22, 2024 at 8:35 AM Uladzislau Rezki <[email protected]> wrote:
> > > > > > >
> > > > > > > Hello, Folk!
> > > > > > >
> > > > > > >[...]
> > > > > > > pagetable_alloc - gets increased as soon as a higher pressure is applied by
> > > > > > > increasing number of workers. Running same number of jobs on a next run
> > > > > > > does not increase it and stays on same level as on previous.
> > > > > > >
> > > > > > > /**
> > > > > > > * pagetable_alloc - Allocate pagetables
> > > > > > > * @gfp: GFP flags
> > > > > > > * @order: desired pagetable order
> > > > > > > *
> > > > > > > * pagetable_alloc allocates memory for page tables as well as a page table
> > > > > > > * descriptor to describe that memory.
> > > > > > > *
> > > > > > > * Return: The ptdesc describing the allocated page tables.
> > > > > > > */
> > > > > > > static inline struct ptdesc *pagetable_alloc(gfp_t gfp, unsigned int order)
> > > > > > > {
> > > > > > > struct page *page = alloc_pages(gfp | __GFP_COMP, order);
> > > > > > >
> > > > > > > return page_ptdesc(page);
> > > > > > > }
> > > > > > >
> > > > > > > Could you please comment on it? Or do you have any thought? Is it expected?
> > > > > > > Is a page-table ever shrink?
> > > > > >
> > > > > > It's my understanding that the vunmap_range helpers don't actively
> > > > > > free page tables, they just clear PTEs. munmap does free them in
> > > > > > mmap.c:free_pgtables, maybe something could be worked up for vmalloc
> > > > > > too.
> > > > > >
> > > > > Right. I see that for a user space, pgtables are removed. There was a
> > > > > work on it.
> > > > >
> > > > > >
> > > > > > I would not be surprised if the memory increase you're seeing is more
> > > > > > or less correlated to the maximum vmalloc footprint throughout the
> > > > > > whole test.
> > > > > >
> > > > > Yes, the vmalloc footprint follows the memory usage. Some uses cases
> > > > > map lot of memory.
> > > >
> > > > The 'nr_threads=256' testing may be too radical. I took the test on
> > > > a bare metal machine as below, it's still running and hang there after
> > > > 30 minutes. I did this after system boot. I am looking for other
> > > > machines with more processors.
> > > >
> > > > [root@dell-r640-068 ~]# nproc
> > > > 64
> > > > [root@dell-r640-068 ~]# free -h
> > > > total used free shared buff/cache available
> > > > Mem: 187Gi 18Gi 169Gi 12Mi 262Mi 168Gi
> > > > Swap: 4.0Gi 0B 4.0Gi
> > > > [root@dell-r640-068 ~]#
> > > >
> > > > [root@dell-r640-068 linux]# tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=256
> > > > Run the test with following parameters: run_test_mask=127 nr_threads=256
> > > >
> > > Agree, nr_threads=256 is a way radical :) Mine took 50 minutes to
> > > complete. So wait more :)
> >
> > Right, mine could take the similar time to finish that. I got a machine
> > with 288 cpus, see if I can get some clues. When I go through the code
> > flow, suddenly realized it could be drain_vmap_area_work which is the
> > bottle neck and cause the tremendous page table pages costing.
> >
> > On your system, there's 64 cpus. then
> >
> > nr_lazy_max = lazy_max_pages() = 7*32M = 224M;
> >
> > So with nr_threads=128 or 256, it's so easily getting to the nr_lazy_max
> > and triggering drain_vmap_work(). When cpu resouce is very limited, the
> > lazy vmap purging will be very slow. While the alloc/free in lib/tet_vmalloc.c
> > are going far faster and more easily then vmap reclaiming. If old va is not
> > reused, new va is allocated and keep extending, the new page table surely
> > need be created to cover them.
> >
> > I will take testing on the system with 288 cpus, will update if testing
> > is done.
> >
> <snip>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 12caa794abd4..a90c5393d85f 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1754,6 +1754,8 @@ size_to_va_pool(struct vmap_node *vn, unsigned long size)
> return NULL;
> }
>
> +static unsigned long lazy_max_pages(void);
> +
> static bool
> node_pool_add_va(struct vmap_node *n, struct vmap_area *va)
> {
> @@ -1763,6 +1765,9 @@ node_pool_add_va(struct vmap_node *n, struct vmap_area *va)
> if (!vp)
> return false;
>
> + if (READ_ONCE(vp->len) > lazy_max_pages())
> + return false;
> +
> spin_lock(&n->pool_lock);
> list_add(&va->list, &vp->head);
> WRITE_ONCE(vp->len, vp->len + 1);
> @@ -2170,9 +2175,9 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end,
> INIT_WORK(&vn->purge_work, purge_vmap_node);
>
> if (cpumask_test_cpu(i, cpu_online_mask))
> - schedule_work_on(i, &vn->purge_work);
> + queue_work_on(i, system_highpri_wq, &vn->purge_work);
> else
> - schedule_work(&vn->purge_work);
> + queue_work(system_highpri_wq, &vn->purge_work);
>
> nr_purge_helpers--;
> } else {
> <snip>
>
> We need this. This settles it back to a normal PTE-usage. Tomorrow i
> will check if cache-len should be limited. I tested on my 64 CPUs
> system with radical 256 kworkers. It looks good.

I finally finished the testing w/o and with your above improvement
patch. Testing is done on a system with 128 cpus. The system with 288
cpus is not available because of some console connection. Attach the log
here. In some testing after rebooting, I found it could take more than 30
minutes, I am not sure if it's caused by my messy code change. I finally
cleaned up all of them and take a clean linux-next to test, then apply
your above draft code.


Attachments:
(No filename) (5.97 kB)
vmalloc_node.log (7.70 kB)
Download all attachments

2024-02-28 10:39:46

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: Re: [PATCH v3 07/11] mm: vmalloc: Offload free_vmap_area_lock lock

On Wed, Feb 28, 2024 at 05:48:53PM +0800, Baoquan He wrote:
> On 01/02/24 at 07:46pm, Uladzislau Rezki (Sony) wrote:
> .....snip...
> > +static void
> > +decay_va_pool_node(struct vmap_node *vn, bool full_decay)
> > +{
> > + struct vmap_area *va, *nva;
> > + struct list_head decay_list;
> > + struct rb_root decay_root;
> > + unsigned long n_decay;
> > + int i;
> > +
> > + decay_root = RB_ROOT;
> > + INIT_LIST_HEAD(&decay_list);
> > +
> > + for (i = 0; i < MAX_VA_SIZE_PAGES; i++) {
> > + struct list_head tmp_list;
> > +
> > + if (list_empty(&vn->pool[i].head))
> > + continue;
> > +
> > + INIT_LIST_HEAD(&tmp_list);
> > +
> > + /* Detach the pool, so no-one can access it. */
> > + spin_lock(&vn->pool_lock);
> > + list_replace_init(&vn->pool[i].head, &tmp_list);
> > + spin_unlock(&vn->pool_lock);
> > +
> > + if (full_decay)
> > + WRITE_ONCE(vn->pool[i].len, 0);
> > +
> > + /* Decay a pool by ~25% out of left objects. */
>
> This isn't true if the pool has less than 4 objects. If there are 3
> objects, n_decay = 0.
>
This is expectable.

> > + n_decay = vn->pool[i].len >> 2;
> > +
> > + list_for_each_entry_safe(va, nva, &tmp_list, list) {
> > + list_del_init(&va->list);
> > + merge_or_add_vmap_area(va, &decay_root, &decay_list);
> > +
> > + if (!full_decay) {
> > + WRITE_ONCE(vn->pool[i].len, vn->pool[i].len - 1);
> > +
> > + if (!--n_decay)
> > + break;
>
> Here, --n_decay will make n_decay 0xffffffffffffffff,
> then all left objects are reclaimed.
Right. Last three objects do not play a big game.

--
Uladzislau Rezki

2024-02-28 12:38:31

by Baoquan He

[permalink] [raw]
Subject: Re: [PATCH v3 07/11] mm: vmalloc: Offload free_vmap_area_lock lock

On 02/28/24 at 11:39am, Uladzislau Rezki wrote:
> On Wed, Feb 28, 2024 at 05:48:53PM +0800, Baoquan He wrote:
> > On 01/02/24 at 07:46pm, Uladzislau Rezki (Sony) wrote:
> > .....snip...
> > > +static void
> > > +decay_va_pool_node(struct vmap_node *vn, bool full_decay)
> > > +{
> > > + struct vmap_area *va, *nva;
> > > + struct list_head decay_list;
> > > + struct rb_root decay_root;
> > > + unsigned long n_decay;
> > > + int i;
> > > +
> > > + decay_root = RB_ROOT;
> > > + INIT_LIST_HEAD(&decay_list);
> > > +
> > > + for (i = 0; i < MAX_VA_SIZE_PAGES; i++) {
> > > + struct list_head tmp_list;
> > > +
> > > + if (list_empty(&vn->pool[i].head))
> > > + continue;
> > > +
> > > + INIT_LIST_HEAD(&tmp_list);
> > > +
> > > + /* Detach the pool, so no-one can access it. */
> > > + spin_lock(&vn->pool_lock);
> > > + list_replace_init(&vn->pool[i].head, &tmp_list);
> > > + spin_unlock(&vn->pool_lock);
> > > +
> > > + if (full_decay)
> > > + WRITE_ONCE(vn->pool[i].len, 0);
> > > +
> > > + /* Decay a pool by ~25% out of left objects. */
> >
> > This isn't true if the pool has less than 4 objects. If there are 3
> > objects, n_decay = 0.
> >
> This is expectable.


>
> > > + n_decay = vn->pool[i].len >> 2;
> > > +
> > > + list_for_each_entry_safe(va, nva, &tmp_list, list) {
> > > + list_del_init(&va->list);
> > > + merge_or_add_vmap_area(va, &decay_root, &decay_list);
> > > +
> > > + if (!full_decay) {
> > > + WRITE_ONCE(vn->pool[i].len, vn->pool[i].len - 1);
> > > +
> > > + if (!--n_decay)
> > > + break;
> >
> > Here, --n_decay will make n_decay 0xffffffffffffffff,
> > then all left objects are reclaimed.
> Right. Last three objects do not play a big game.

See it now, thanks.


2024-02-29 10:38:24

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: Re: [PATCH v3 00/11] Mitigate a vmap lock contention v3

>
> I finally finished the testing w/o and with your above improvement
> patch. Testing is done on a system with 128 cpus. The system with 288
> cpus is not available because of some console connection. Attach the log
> here. In some testing after rebooting, I found it could take more than 30
> minutes, I am not sure if it's caused by my messy code change. I finally
> cleaned up all of them and take a clean linux-next to test, then apply
> your above draft code.

> [root@dell-per6515-03 linux]# nproc
> 128
> [root@dell-per6515-03 linux]# free -h
> total used free shared buff/cache available
> Mem: 124Gi 2.6Gi 122Gi 21Mi 402Mi 122Gi
> Swap: 4.0Gi 0B 4.0Gi
>
> 1)linux-next kernel w/o improving code from Uladzislau
> -------------------------------------------------------
> [root@dell-per6515-03 linux]# time tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=64
> Run the test with following parameters: run_test_mask=127 nr_threads=64
> Done.
> Check the kernel ring buffer to see the summary.
>
> real 4m28.018s
> user 0m0.015s
> sys 0m4.712s
> [root@dell-per6515-03 ~]# sort -h /proc/allocinfo | tail -10
> 21405696 5226 mm/memory.c:1122 func:folio_prealloc
> 26199936 7980 kernel/fork.c:309 func:alloc_thread_stack_node
> 29822976 7281 mm/readahead.c:247 func:page_cache_ra_unbounded
> 99090432 96768 drivers/iommu/iova.c:604 func:iova_magazine_alloc
> 107638784 6320 mm/readahead.c:468 func:ra_alloc_folio
> 120560528 29439 mm/mm_init.c:2521 func:alloc_large_system_hash
> 134742016 32896 mm/percpu-vm.c:95 func:pcpu_alloc_pages
> 263192576 64256 mm/page_ext.c:270 func:alloc_page_ext
> 266797056 65136 include/linux/mm.h:2848 func:pagetable_alloc
> 507617280 32796 mm/slub.c:2305 func:alloc_slab_page
> [root@dell-per6515-03 ~]#
> [root@dell-per6515-03 ~]#
> [root@dell-per6515-03 linux]# time tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=128
> Run the test with following parameters: run_test_mask=127 nr_threads=128
> Done.
> Check the kernel ring buffer to see the summary.
>
> real 6m19.328s
> user 0m0.005s
> sys 0m9.476s
> [root@dell-per6515-03 ~]# sort -h /proc/allocinfo | tail -10
> 21405696 5226 mm/memory.c:1122 func:folio_prealloc
> 26889408 8190 kernel/fork.c:309 func:alloc_thread_stack_node
> 29822976 7281 mm/readahead.c:247 func:page_cache_ra_unbounded
> 99090432 96768 drivers/iommu/iova.c:604 func:iova_magazine_alloc
> 107638784 6320 mm/readahead.c:468 func:ra_alloc_folio
> 120560528 29439 mm/mm_init.c:2521 func:alloc_large_system_hash
> 134742016 32896 mm/percpu-vm.c:95 func:pcpu_alloc_pages
> 263192576 64256 mm/page_ext.c:270 func:alloc_page_ext
> 550068224 34086 mm/slub.c:2305 func:alloc_slab_page
> 664535040 162240 include/linux/mm.h:2848 func:pagetable_alloc
> [root@dell-per6515-03 ~]#
> [root@dell-per6515-03 ~]#
> [root@dell-per6515-03 linux]# time tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=256
> Run the test with following parameters: run_test_mask=127 nr_threads=256
> Done.
> Check the kernel ring buffer to see the summary.
>
> real 19m10.657s
> user 0m0.015s
> sys 0m20.959s
> [root@dell-per6515-03 ~]# sort -h /proc/allocinfo | tail -10
> 22441984 5479 mm/shmem.c:1634 func:shmem_alloc_folio
> 26758080 8150 kernel/fork.c:309 func:alloc_thread_stack_node
> 35880960 8760 mm/readahead.c:247 func:page_cache_ra_unbounded
> 99090432 96768 drivers/iommu/iova.c:604 func:iova_magazine_alloc
> 120560528 29439 mm/mm_init.c:2521 func:alloc_large_system_hash
> 122355712 7852 mm/readahead.c:468 func:ra_alloc_folio
> 134742016 32896 mm/percpu-vm.c:95 func:pcpu_alloc_pages
> 263192576 64256 mm/page_ext.c:270 func:alloc_page_ext
> 708231168 50309 mm/slub.c:2305 func:alloc_slab_page
> 1107296256 270336 include/linux/mm.h:2848 func:pagetable_alloc
> [root@dell-per6515-03 ~]#
>
> 2)linux-next kernel with improving code from Uladzislau
> -----------------------------------------------------
> [root@dell-per6515-03 linux]# time tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=64
> Run the test with following parameters: run_test_mask=127 nr_threads=64
> Done.
> Check the kernel ring buffer to see the summary.
>
> real 4m27.226s
> user 0m0.006s
> sys 0m4.709s
> [root@dell-per6515-03 linux]# sort -h /proc/allocinfo | tail -10
> 38023168 9283 mm/readahead.c:247 func:page_cache_ra_unbounded
> 72228864 17634 fs/xfs/xfs_buf.c:390 [xfs] func:xfs_buf_alloc_pages
> 99090432 96768 drivers/iommu/iova.c:604 func:iova_magazine_alloc
> 99863552 97523 fs/xfs/xfs_icache.c:81 [xfs] func:xfs_inode_alloc
> 120560528 29439 mm/mm_init.c:2521 func:alloc_large_system_hash
> 136314880 33280 mm/percpu-vm.c:95 func:pcpu_alloc_pages
> 184176640 10684 mm/readahead.c:468 func:ra_alloc_folio
> 263192576 64256 mm/page_ext.c:270 func:alloc_page_ext
> 284700672 69507 include/linux/mm.h:2848 func:pagetable_alloc
> 601427968 36377 mm/slub.c:2305 func:alloc_slab_page
> [root@dell-per6515-03 linux]# time tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=128
> Run the test with following parameters: run_test_mask=127 nr_threads=128
> Done.
> Check the kernel ring buffer to see the summary.
>
> real 6m16.960s
> user 0m0.007s
> sys 0m9.465s
> [root@dell-per6515-03 linux]# sort -h /proc/allocinfo | tail -10
> 38158336 9316 mm/readahead.c:247 func:page_cache_ra_unbounded
> 72220672 17632 fs/xfs/xfs_buf.c:390 [xfs] func:xfs_buf_alloc_pages
> 99090432 96768 drivers/iommu/iova.c:604 func:iova_magazine_alloc
> 99863552 97523 fs/xfs/xfs_icache.c:81 [xfs] func:xfs_inode_alloc
> 120560528 29439 mm/mm_init.c:2521 func:alloc_large_system_hash
> 136314880 33280 mm/percpu-vm.c:95 func:pcpu_alloc_pages
> 184504320 10710 mm/readahead.c:468 func:ra_alloc_folio
> 263192576 64256 mm/page_ext.c:270 func:alloc_page_ext
> 427884544 104464 include/linux/mm.h:2848 func:pagetable_alloc
> 697311232 45159 mm/slub.c:2305 func:alloc_slab_page
> [root@dell-per6515-03 linux]# time tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=256
> Run the test with following parameters: run_test_mask=127 nr_threads=256
> Done.
> Check the kernel ring buffer to see the summary.
>
> real 21m15.673s
> user 0m0.008s
> sys 0m20.259s
> [root@dell-per6515-03 linux]# sort -h /proc/allocinfo | tail -10
> 38158336 9316 mm/readahead.c:247 func:page_cache_ra_unbounded
> 72224768 17633 fs/xfs/xfs_buf.c:390 [xfs] func:xfs_buf_alloc_pages
> 99090432 96768 drivers/iommu/iova.c:604 func:iova_magazine_alloc
> 99863552 97523 fs/xfs/xfs_icache.c:81 [xfs] func:xfs_inode_alloc
> 120560528 29439 mm/mm_init.c:2521 func:alloc_large_system_hash
> 136314880 33280 mm/percpu-vm.c:95 func:pcpu_alloc_pages
> 184504320 10710 mm/readahead.c:468 func:ra_alloc_folio
> 263192576 64256 mm/page_ext.c:270 func:alloc_page_ext
> 506974208 123773 include/linux/mm.h:2848 func:pagetable_alloc
> 809504768 53621 mm/slub.c:2305 func:alloc_slab_page
> [root@dell-per6515-03 linux]# time tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=256
> Run the test with following parameters: run_test_mask=127 nr_threads=256
> Done.
> Check the kernel ring buffer to see the summary.
>
> real 21m36.580s
> user 0m0.012s
> sys 0m19.912s
> [root@dell-per6515-03 linux]# sort -h /proc/allocinfo | tail -10
> 38977536 9516 mm/readahead.c:247 func:page_cache_ra_unbounded
> 72273920 17645 fs/xfs/xfs_buf.c:390 [xfs] func:xfs_buf_alloc_pages
> 99090432 96768 drivers/iommu/iova.c:604 func:iova_magazine_alloc
> 99895296 97554 fs/xfs/xfs_icache.c:81 [xfs] func:xfs_inode_alloc
> 120560528 29439 mm/mm_init.c:2521 func:alloc_large_system_hash
> 141033472 34432 mm/percpu-vm.c:95 func:pcpu_alloc_pages
> 186064896 10841 mm/readahead.c:468 func:ra_alloc_folio
> 263192576 64256 mm/page_ext.c:270 func:alloc_page_ext
> 541237248 132138 include/linux/mm.h:2848 func:pagetable_alloc
> 694718464 41216 mm/slub.c:2305 func:alloc_slab_page
>
>
Thank you for testing this. So ~132mb with a patch. I think it looks
good but i might change the draft version and send out a new version.

Thank you again!

--
Uladzislau Rezki

2024-02-28 09:49:14

by Baoquan He

[permalink] [raw]
Subject: Re: [PATCH v3 07/11] mm: vmalloc: Offload free_vmap_area_lock lock

On 01/02/24 at 07:46pm, Uladzislau Rezki (Sony) wrote:
....snip...
> +static void
> +decay_va_pool_node(struct vmap_node *vn, bool full_decay)
> +{
> + struct vmap_area *va, *nva;
> + struct list_head decay_list;
> + struct rb_root decay_root;
> + unsigned long n_decay;
> + int i;
> +
> + decay_root = RB_ROOT;
> + INIT_LIST_HEAD(&decay_list);
> +
> + for (i = 0; i < MAX_VA_SIZE_PAGES; i++) {
> + struct list_head tmp_list;
> +
> + if (list_empty(&vn->pool[i].head))
> + continue;
> +
> + INIT_LIST_HEAD(&tmp_list);
> +
> + /* Detach the pool, so no-one can access it. */
> + spin_lock(&vn->pool_lock);
> + list_replace_init(&vn->pool[i].head, &tmp_list);
> + spin_unlock(&vn->pool_lock);
> +
> + if (full_decay)
> + WRITE_ONCE(vn->pool[i].len, 0);
> +
> + /* Decay a pool by ~25% out of left objects. */

This isn't true if the pool has less than 4 objects. If there are 3
objects, n_decay = 0.

> + n_decay = vn->pool[i].len >> 2;
> +
> + list_for_each_entry_safe(va, nva, &tmp_list, list) {
> + list_del_init(&va->list);
> + merge_or_add_vmap_area(va, &decay_root, &decay_list);
> +
> + if (!full_decay) {
> + WRITE_ONCE(vn->pool[i].len, vn->pool[i].len - 1);
> +
> + if (!--n_decay)
> + break;

Here, --n_decay will make n_decay 0xffffffffffffffff,
then all left objects are reclaimed.
> + }
> + }
> +
> + /* Attach the pool back if it has been partly decayed. */
> + if (!full_decay && !list_empty(&tmp_list)) {
> + spin_lock(&vn->pool_lock);
> + list_replace_init(&tmp_list, &vn->pool[i].head);
> + spin_unlock(&vn->pool_lock);
> + }
> + }
> +
> + reclaim_list_global(&decay_list);
> +}
.....snip


2024-03-22 18:21:16

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH v3 07/11] mm: vmalloc: Offload free_vmap_area_lock lock

Hi,

On Tue, Jan 02, 2024 at 07:46:29PM +0100, Uladzislau Rezki (Sony) wrote:
> Concurrent access to a global vmap space is a bottle-neck.
> We can simulate a high contention by running a vmalloc test
> suite.
>
> To address it, introduce an effective vmap node logic. Each
> node behaves as independent entity. When a node is accessed
> it serves a request directly(if possible) from its pool.
>
> This model has a size based pool for requests, i.e. pools are
> serialized and populated based on object size and real demand.
> A maximum object size that pool can handle is set to 256 pages.
>
> This technique reduces a pressure on the global vmap lock.
>
> Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>

This patch results in a persistent "spinlock bad magic" message
when booting s390 images with spinlock debugging enabled.

[ 0.465445] BUG: spinlock bad magic on CPU#0, swapper/0
[ 0.465490] lock: single+0x1860/0x1958, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
[ 0.466067] CPU: 0 PID: 0 Comm: swapper Not tainted 6.8.0-12955-g8e938e398669 #1
[ 0.466188] Hardware name: QEMU 8561 QEMU (KVM/Linux)
[ 0.466270] Call Trace:
[ 0.466470] [<00000000011f26c8>] dump_stack_lvl+0x98/0xd8
[ 0.466516] [<00000000001dcc6a>] do_raw_spin_lock+0x8a/0x108
[ 0.466545] [<000000000042146c>] find_vmap_area+0x6c/0x108
[ 0.466572] [<000000000042175a>] find_vm_area+0x22/0x40
[ 0.466597] [<000000000012f152>] __set_memory+0x132/0x150
[ 0.466624] [<0000000001cc0398>] vmem_map_init+0x40/0x118
[ 0.466651] [<0000000001cc0092>] paging_init+0x22/0x68
[ 0.466677] [<0000000001cbbed2>] setup_arch+0x52a/0x708
[ 0.466702] [<0000000001cb6140>] start_kernel+0x80/0x5c8
[ 0.466727] [<0000000000100036>] startup_continue+0x36/0x40

Bisect results and decoded stacktrace below.

The uninitialized spinlock is &vn->busy.lock.
Debugging shows that this lock is actually never initialized.

[ 0.464684] ####### locking 0000000002280fb8
[ 0.464862] BUG: spinlock bad magic on CPU#0, swapper/0
..
[ 0.464684] ####### locking 0000000002280fb8
[ 0.477479] ####### locking 0000000002280fb8
[ 0.478166] ####### locking 0000000002280fb8
[ 0.478218] ####### locking 0000000002280fb8
..
[ 0.718250] #### busy lock init 0000000002871860
[ 0.718328] #### busy lock init 00000000028731b8

Only the initialized locks are used after the call to vmap_init_nodes().

Guenter

---
# bad: [8e938e39866920ddc266898e6ae1fffc5c8f51aa] Merge tag '6.9-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6
# good: [e8f897f4afef0031fe618a8e94127a0934896aba] Linux 6.8
git bisect start 'HEAD' 'v6.8'
# good: [e56bc745fa1de77abc2ad8debc4b1b83e0426c49] smb311: additional compression flag defined in updated protocol spec
git bisect good e56bc745fa1de77abc2ad8debc4b1b83e0426c49
# bad: [902861e34c401696ed9ad17a54c8790e7e8e3069] Merge tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
git bisect bad 902861e34c401696ed9ad17a54c8790e7e8e3069
# good: [480e035fc4c714fb5536e64ab9db04fedc89e910] Merge tag 'drm-next-2024-03-13' of https://gitlab.freedesktop.org/drm/kernel
git bisect good 480e035fc4c714fb5536e64ab9db04fedc89e910
# good: [fe46a7dd189e25604716c03576d05ac8a5209743] Merge tag 'sound-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
git bisect good fe46a7dd189e25604716c03576d05ac8a5209743
# bad: [435a75548109f19e5b5b14ae35b9acb063c084e9] mm: use folio more widely in __split_huge_page
git bisect bad 435a75548109f19e5b5b14ae35b9acb063c084e9
# good: [4d5bf0b6183f79ea361dd506365d2a471270735c] mm/mmu_gather: add tlb_remove_tlb_entries()
git bisect good 4d5bf0b6183f79ea361dd506365d2a471270735c
# bad: [4daacfe8f99f4b4cef562649d56c48642981f46e] mm/damon/sysfs-schemes: support PSI-based quota auto-tune
git bisect bad 4daacfe8f99f4b4cef562649d56c48642981f46e
# good: [217b2119b9e260609958db413876f211038f00ee] mm,page_owner: implement the tracking of the stacks count
git bisect good 217b2119b9e260609958db413876f211038f00ee
# bad: [40254101d87870b2e5ac3ddc28af40aa04c48486] arm64, crash: wrap crash dumping code into crash related ifdefs
git bisect bad 40254101d87870b2e5ac3ddc28af40aa04c48486
# bad: [53becf32aec1c8049b854f0c31a11df5ed75df6f] mm: vmalloc: support multiple nodes in vread_iter
git bisect bad 53becf32aec1c8049b854f0c31a11df5ed75df6f
# good: [7fa8cee003166ef6db0bba70d610dbf173543811] mm: vmalloc: move vmap_init_free_space() down in vmalloc.c
git bisect good 7fa8cee003166ef6db0bba70d610dbf173543811
# good: [282631cb2447318e2a55b41a665dbe8571c46d70] mm: vmalloc: remove global purge_vmap_area_root rb-tree
git bisect good 282631cb2447318e2a55b41a665dbe8571c46d70
# bad: [96aa8437d169b8e030a98e2b74fd9a8ee9d3be7e] mm: vmalloc: add a scan area of VA only once
git bisect bad 96aa8437d169b8e030a98e2b74fd9a8ee9d3be7e
# bad: [72210662c5a2b6005f6daea7fe293a0dc573e1a5] mm: vmalloc: offload free_vmap_area_lock lock
git bisect bad 72210662c5a2b6005f6daea7fe293a0dc573e1a5
# first bad commit: [72210662c5a2b6005f6daea7fe293a0dc573e1a5] mm: vmalloc: offload free_vmap_area_lock lock

---
[ 0.465490] lock: single+0x1860/0x1958, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
[ 0.466067] CPU: 0 PID: 0 Comm: swapper Not tainted 6.8.0-12955-g8e938e398669 #1
[ 0.466188] Hardware name: QEMU 8561 QEMU (KVM/Linux)
[ 0.466270] Call Trace:
[ 0.466470] dump_stack_lvl (lib/dump_stack.c:117)
[ 0.466516] do_raw_spin_lock (kernel/locking/spinlock_debug.c:87 kernel/locking/spinlock_debug.c:115)
[ 0.466545] find_vmap_area (mm/vmalloc.c:1059 mm/vmalloc.c:2364)
[ 0.466572] find_vm_area (mm/vmalloc.c:3150)
[ 0.466597] __set_memory (arch/s390/mm/pageattr.c:360 arch/s390/mm/pageattr.c:393)
[ 0.466624] vmem_map_init (./arch/s390/include/asm/set_memory.h:55 arch/s390/mm/vmem.c:660)
[ 0.466651] paging_init (arch/s390/mm/init.c:97)
[ 0.466677] setup_arch (arch/s390/kernel/setup.c:972)
[ 0.466702] start_kernel (init/main.c:899)
[ 0.466727] startup_continue (arch/s390/kernel/head64.S:35)
[ 0.466811] INFO: lockdep is turned off.


2024-03-22 19:03:22

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: Re: [PATCH v3 07/11] mm: vmalloc: Offload free_vmap_area_lock lock

On Fri, Mar 22, 2024 at 11:21:02AM -0700, Guenter Roeck wrote:
> Hi,
>
> On Tue, Jan 02, 2024 at 07:46:29PM +0100, Uladzislau Rezki (Sony) wrote:
> > Concurrent access to a global vmap space is a bottle-neck.
> > We can simulate a high contention by running a vmalloc test
> > suite.
> >
> > To address it, introduce an effective vmap node logic. Each
> > node behaves as independent entity. When a node is accessed
> > it serves a request directly(if possible) from its pool.
> >
> > This model has a size based pool for requests, i.e. pools are
> > serialized and populated based on object size and real demand.
> > A maximum object size that pool can handle is set to 256 pages.
> >
> > This technique reduces a pressure on the global vmap lock.
> >
> > Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
>
> This patch results in a persistent "spinlock bad magic" message
> when booting s390 images with spinlock debugging enabled.
>
> [ 0.465445] BUG: spinlock bad magic on CPU#0, swapper/0
> [ 0.465490] lock: single+0x1860/0x1958, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
> [ 0.466067] CPU: 0 PID: 0 Comm: swapper Not tainted 6.8.0-12955-g8e938e398669 #1
> [ 0.466188] Hardware name: QEMU 8561 QEMU (KVM/Linux)
> [ 0.466270] Call Trace:
> [ 0.466470] [<00000000011f26c8>] dump_stack_lvl+0x98/0xd8
> [ 0.466516] [<00000000001dcc6a>] do_raw_spin_lock+0x8a/0x108
> [ 0.466545] [<000000000042146c>] find_vmap_area+0x6c/0x108
> [ 0.466572] [<000000000042175a>] find_vm_area+0x22/0x40
> [ 0.466597] [<000000000012f152>] __set_memory+0x132/0x150
> [ 0.466624] [<0000000001cc0398>] vmem_map_init+0x40/0x118
> [ 0.466651] [<0000000001cc0092>] paging_init+0x22/0x68
> [ 0.466677] [<0000000001cbbed2>] setup_arch+0x52a/0x708
> [ 0.466702] [<0000000001cb6140>] start_kernel+0x80/0x5c8
> [ 0.466727] [<0000000000100036>] startup_continue+0x36/0x40
>
> Bisect results and decoded stacktrace below.
>
> The uninitialized spinlock is &vn->busy.lock.
> Debugging shows that this lock is actually never initialized.
>
It is. Once the vmalloc_init() "main entry" function is called from the:

<snip>
start_kernel()
mm_core_init()
vmalloc_init()
<snip>

> [ 0.464684] ####### locking 0000000002280fb8
> [ 0.464862] BUG: spinlock bad magic on CPU#0, swapper/0
> ...
> [ 0.464684] ####### locking 0000000002280fb8
> [ 0.477479] ####### locking 0000000002280fb8
> [ 0.478166] ####### locking 0000000002280fb8
> [ 0.478218] ####### locking 0000000002280fb8
> ...
> [ 0.718250] #### busy lock init 0000000002871860
> [ 0.718328] #### busy lock init 00000000028731b8
>
> Only the initialized locks are used after the call to vmap_init_nodes().
>
Right, when the vmap space and vmalloc is initialized.

> Guenter
>
> ---
> # bad: [8e938e39866920ddc266898e6ae1fffc5c8f51aa] Merge tag '6.9-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6
> # good: [e8f897f4afef0031fe618a8e94127a0934896aba] Linux 6.8
> git bisect start 'HEAD' 'v6.8'
> # good: [e56bc745fa1de77abc2ad8debc4b1b83e0426c49] smb311: additional compression flag defined in updated protocol spec
> git bisect good e56bc745fa1de77abc2ad8debc4b1b83e0426c49
> # bad: [902861e34c401696ed9ad17a54c8790e7e8e3069] Merge tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> git bisect bad 902861e34c401696ed9ad17a54c8790e7e8e3069
> # good: [480e035fc4c714fb5536e64ab9db04fedc89e910] Merge tag 'drm-next-2024-03-13' of https://gitlab.freedesktop.org/drm/kernel
> git bisect good 480e035fc4c714fb5536e64ab9db04fedc89e910
> # good: [fe46a7dd189e25604716c03576d05ac8a5209743] Merge tag 'sound-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
> git bisect good fe46a7dd189e25604716c03576d05ac8a5209743
> # bad: [435a75548109f19e5b5b14ae35b9acb063c084e9] mm: use folio more widely in __split_huge_page
> git bisect bad 435a75548109f19e5b5b14ae35b9acb063c084e9
> # good: [4d5bf0b6183f79ea361dd506365d2a471270735c] mm/mmu_gather: add tlb_remove_tlb_entries()
> git bisect good 4d5bf0b6183f79ea361dd506365d2a471270735c
> # bad: [4daacfe8f99f4b4cef562649d56c48642981f46e] mm/damon/sysfs-schemes: support PSI-based quota auto-tune
> git bisect bad 4daacfe8f99f4b4cef562649d56c48642981f46e
> # good: [217b2119b9e260609958db413876f211038f00ee] mm,page_owner: implement the tracking of the stacks count
> git bisect good 217b2119b9e260609958db413876f211038f00ee
> # bad: [40254101d87870b2e5ac3ddc28af40aa04c48486] arm64, crash: wrap crash dumping code into crash related ifdefs
> git bisect bad 40254101d87870b2e5ac3ddc28af40aa04c48486
> # bad: [53becf32aec1c8049b854f0c31a11df5ed75df6f] mm: vmalloc: support multiple nodes in vread_iter
> git bisect bad 53becf32aec1c8049b854f0c31a11df5ed75df6f
> # good: [7fa8cee003166ef6db0bba70d610dbf173543811] mm: vmalloc: move vmap_init_free_space() down in vmalloc.c
> git bisect good 7fa8cee003166ef6db0bba70d610dbf173543811
> # good: [282631cb2447318e2a55b41a665dbe8571c46d70] mm: vmalloc: remove global purge_vmap_area_root rb-tree
> git bisect good 282631cb2447318e2a55b41a665dbe8571c46d70
> # bad: [96aa8437d169b8e030a98e2b74fd9a8ee9d3be7e] mm: vmalloc: add a scan area of VA only once
> git bisect bad 96aa8437d169b8e030a98e2b74fd9a8ee9d3be7e
> # bad: [72210662c5a2b6005f6daea7fe293a0dc573e1a5] mm: vmalloc: offload free_vmap_area_lock lock
> git bisect bad 72210662c5a2b6005f6daea7fe293a0dc573e1a5
> # first bad commit: [72210662c5a2b6005f6daea7fe293a0dc573e1a5] mm: vmalloc: offload free_vmap_area_lock lock
>
> ---
> [ 0.465490] lock: single+0x1860/0x1958, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
> [ 0.466067] CPU: 0 PID: 0 Comm: swapper Not tainted 6.8.0-12955-g8e938e398669 #1
> [ 0.466188] Hardware name: QEMU 8561 QEMU (KVM/Linux)
> [ 0.466270] Call Trace:
> [ 0.466470] dump_stack_lvl (lib/dump_stack.c:117)
> [ 0.466516] do_raw_spin_lock (kernel/locking/spinlock_debug.c:87 kernel/locking/spinlock_debug.c:115)
> [ 0.466545] find_vmap_area (mm/vmalloc.c:1059 mm/vmalloc.c:2364)
> [ 0.466572] find_vm_area (mm/vmalloc.c:3150)
> [ 0.466597] __set_memory (arch/s390/mm/pageattr.c:360 arch/s390/mm/pageattr.c:393)
> [ 0.466624] vmem_map_init (./arch/s390/include/asm/set_memory.h:55 arch/s390/mm/vmem.c:660)
> [ 0.466651] paging_init (arch/s390/mm/init.c:97)
> [ 0.466677] setup_arch (arch/s390/kernel/setup.c:972)
> [ 0.466702] start_kernel (init/main.c:899)
> [ 0.466727] startup_continue (arch/s390/kernel/head64.S:35)
> [ 0.466811] INFO: lockdep is turned off.
>
<snip>
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 22aa63f4ef63..0d77d171b5d9 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2343,6 +2343,9 @@ struct vmap_area *find_vmap_area(unsigned long addr)
struct vmap_area *va;
int i, j;

+ if (unlikely(!vmap_initialized))
+ return NULL;
+
/*
* An addr_to_node_id(addr) converts an address to a node index
* where a VA is located. If VA spans several zones and passed
<snip>

Could you please test it?

--
Uladzislau Rezki

2024-03-22 20:53:24

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH v3 07/11] mm: vmalloc: Offload free_vmap_area_lock lock

On 3/22/24 12:03, Uladzislau Rezki wrote:
[ ... ]

> <snip>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 22aa63f4ef63..0d77d171b5d9 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2343,6 +2343,9 @@ struct vmap_area *find_vmap_area(unsigned long addr)
> struct vmap_area *va;
> int i, j;
>
> + if (unlikely(!vmap_initialized))
> + return NULL;
> +
> /*
> * An addr_to_node_id(addr) converts an address to a node index
> * where a VA is located. If VA spans several zones and passed
> <snip>
>
> Could you please test it?
>

That fixes the problem.

Thanks,
Guenter