2023-05-22 11:16:06

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: [PATCH 0/9] Mitigate a vmap lock contention

Hello, folk.

1. This is a followup of the vmap topic that was highlighted at the LSFMMBPF-2023
conference. This small serial attempts to mitigate the contention across the
vmap/vmalloc code. The problem is described here:

wget ftp://vps418301.ovh.net/incoming/Fix_a_vmalloc_lock_contention_in_SMP_env_v2.pdf

The material is tagged as a v2 version. It contains extra slides about testing
the throughput, steps and comparison with a current approach.

2. Motivation.

- The vmap code is not scalled to number of CPUs and this should be fixed;
- XFS folk has complained several times that vmalloc might be contented on
their workloads:

<snip>
commit 8dc9384b7d75012856b02ff44c37566a55fc2abf
Author: Dave Chinner <[email protected]>
Date: Tue Jan 4 17:22:18 2022 -0800

xfs: reduce kvmalloc overhead for CIL shadow buffers

Oh, let me count the ways that the kvmalloc API sucks dog eggs.

The problem is when we are logging lots of large objects, we hit
kvmalloc really damn hard with costly order allocations, and
behaviour utterly sucks:
...
<snip>

- If we align with per-cpu allocator in terms of performance we can
remove it to simplify the vmap code. Currently we have 3 allocators
See:

<snip>
/*** Per cpu kva allocator ***/

/*
* vmap space is limited especially on 32 bit architectures. Ensure there is
* room for at least 16 percpu vmap blocks per CPU.
*/
/*
* If we had a constant VMALLOC_START and VMALLOC_END, we'd like to be able
* to #define VMALLOC_SPACE (VMALLOC_END-VMALLOC_START). Guess
* instead (we just need a rough idea)
*/
<snip>

3. Test

On my: AMD Ryzen Threadripper 3970X 32-Core Processor, i have below figures:

1-page 1-page-this-patch
1 0.576131 vs 0.555889
2 2.68376 vs 1.07895
3 4.26502 vs 1.01739
4 6.04306 vs 1.28924
5 8.04786 vs 1.57616
6 9.38844 vs 1.78142
7 9.53481 vs 2.00172
8 10.4609 vs 2.15964
9 10.6089 vs 2.484
10 11.7372 vs 2.40443
11 11.5969 vs 2.71635
12 13.053 vs 2.6162
13 12.2973 vs 2.843
14 13.1429 vs 2.85714
15 13.7348 vs 2.90691
16 14.3687 vs 3.0285
17 14.8823 vs 3.05717
18 14.9571 vs 2.98018
19 14.9127 vs 3.0951
20 16.0786 vs 3.19521
21 15.8055 vs 3.24915
22 16.8087 vs 3.2521
23 16.7298 vs 3.25698
24 17.244 vs 3.36586
25 17.8416 vs 3.39384
26 18.8008 vs 3.40907
27 18.5591 vs 3.5254
28 19.761 vs 3.55468
29 20.06 vs 3.59869
30 20.4353 vs 3.6991
31 20.9082 vs 3.73028
32 21.0865 vs 3.82904

1..32 - is a number of jobs. The results are in usec and is a vmallco()/vfree()
pair throughput.

The series is based on the v6.3 tag and considered as beta version. Please note
it does not support vread() functionality yet. So it means that it is not fully
complete.

Any input/thoughts are welcome.

Uladzislau Rezki (Sony) (9):
mm: vmalloc: Add va_alloc() helper
mm: vmalloc: Rename adjust_va_to_fit_type() function
mm: vmalloc: Move vmap_init_free_space() down in vmalloc.c
mm: vmalloc: Add a per-CPU-zone infrastructure
mm: vmalloc: Insert busy-VA per-cpu zone
mm: vmalloc: Support multiple zones in vmallocinfo
mm: vmalloc: Insert lazy-VA per-cpu zone
mm: vmalloc: Offload free_vmap_area_lock global lock
mm: vmalloc: Scale and activate cvz_size

mm/vmalloc.c | 641 +++++++++++++++++++++++++++++++++++----------------
1 file changed, 448 insertions(+), 193 deletions(-)

--
2.30.2



2023-05-22 11:16:14

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: [PATCH 4/9] mm: vmalloc: Add a per-CPU-zone infrastructure

Introduce a basic skeleton that defines a cache entity for
a CPU. It consist of bookkeeping data which belong to a CPU
zone. A CPU-zone defines a virtual address range with its
fixed size.

This approach gives a possibility to map any address to a
particular CPU-zone it belongs to. Below is a high-level
description how it is used in alloc and free paths:

alloc-path:
in-front-pre-fetch occurs if a cache is empty. A chunk is
placed to a free-tree of this CPU. Finally this CPU cache
is clipped in order to accomplish a request. A fresh VA
resides in a busy-tree of specific CPU-zone it is bound
to.

free-path:
va->va_start is converted to a zone, so a lookup occurs in
a bookkeeping zone it belongs to. Once it gets removed a VA
is moved to a lazy-tree. It is freed after a TLB flushing.

A "cvz_size", that reflects a zone size, is set to ULONG_MAX
by this patch. The reason is we would like to follow an old
behavior until all dependent changes are in place.

There is no a functional change as a result of this patch.

Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
---
mm/vmalloc.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 55 insertions(+)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 19dcffb0d563..f6da2590b4de 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -773,6 +773,61 @@ static struct rb_root free_vmap_area_root = RB_ROOT;
*/
static DEFINE_PER_CPU(struct vmap_area *, ne_fit_preload_node);

+/*
+ * A per-cpu effective vmap cache logic. It allows to pre-fetch
+ * a memory vmap chunk with a specified size. A bookkeeping data
+ * is stored and accessed per-CPU.
+ */
+enum { FREE = 0, BUSY, LAZY, NFBL };
+
+#define fbl(z, i, m) z->fbl[i].m
+#define fbl_root(z, i) fbl(z, i, root)
+#define fbl_head(z, i) fbl(z, i, head)
+
+#define fbl_lock(z, i) spin_lock(&fbl(z, i, lock))
+#define fbl_unlock(z, i) spin_unlock(&fbl(z, i, lock))
+
+struct cpu_vmap_zone {
+ /*
+ * FREE, BUSY, LAZY bookkeeping data of this CPU zone.
+ */
+ struct {
+ struct rb_root root;
+ struct list_head head;
+ spinlock_t lock;
+ } fbl[NFBL];
+
+ /*
+ * A list of outstanding lazily-freed areas, ready to
+ * be drained, which belong(address space) to this CPU
+ * zone.
+ */
+ struct list_head purge_list;
+
+ /*
+ * It controls that only one CPU can pre-fetch this
+ * CPU zone.
+ */
+ atomic_t fill_in_progress;
+};
+
+static DEFINE_PER_CPU(struct cpu_vmap_zone, cpu_vmap_zone);
+
+/* Disable a per-cpu caching. */
+static __read_mostly unsigned long cvz_size = ULONG_MAX;
+
+static inline unsigned int
+addr_to_cpu(unsigned long addr)
+{
+ return (addr / cvz_size) % num_possible_cpus();
+}
+
+static inline struct cpu_vmap_zone *
+__maybe_unused addr_to_cvz(unsigned long addr)
+{
+ return &per_cpu(cpu_vmap_zone, addr_to_cpu(addr));
+}
+
static __always_inline unsigned long
va_size(struct vmap_area *va)
{
--
2.30.2


2023-05-22 11:17:31

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: [PATCH 9/9] mm: vmalloc: Scale and activate cvz_size

Scale a zone size to number of CPUs. Each power of two
value adds extra 100 * PAGE_SIZE bytes to a zone size.

A maximum zone size limit is set to 4M, it is reached
if number of CPUs=1024 and PAGE_SIZE=4096 bytes.

For 32-bit or single core systems cvz_size is set to
ULONG_MAX. It means that a pre-loading technique is
deactivated.

Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
---
mm/vmalloc.c | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 8054b8bf6c18..a5e6956a8da4 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -806,8 +806,6 @@ struct cpu_vmap_zone {
};

static DEFINE_PER_CPU(struct cpu_vmap_zone, cpu_vmap_zone);
-
-/* Disable a per-cpu caching. */
static __read_mostly unsigned long cvz_size = ULONG_MAX;

static inline unsigned int
@@ -4490,6 +4488,19 @@ static void vmap_init_pcpu_zones(void)
spin_lock_init(&z->fbl[j].lock);
}
}
+
+ /*
+ * Scale a zone size to number of CPUs. Each power of two
+ * value adds extra 100 pages to a zone size. A high-threshold
+ * limit is set to 4M. It can be reached if number of CPUs=1024
+ * whereas a PAGE_SIZE is 4096 bytes.
+ */
+#if BITS_PER_LONG == 64
+ if (i > 1) {
+ cvz_size = fls(i) * (100 * PAGE_SIZE);
+ cvz_size = min(cvz_size, (unsigned long) SZ_4M);
+ }
+#endif
}

void __init vmalloc_init(void)
--
2.30.2


2023-05-22 11:22:55

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: [PATCH 2/9] mm: vmalloc: Rename adjust_va_to_fit_type() function

This patch renames the adjust_va_to_fit_type() function
to a shorter variant and more expressive. A new name is
va_clip().

There is no a functional change as a result of this patch.

Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
---
mm/vmalloc.c | 13 ++++++-------
1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 409285b68a67..5f900efec6a9 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1383,9 +1383,9 @@ classify_va_fit_type(struct vmap_area *va,
}

static __always_inline int
-adjust_va_to_fit_type(struct rb_root *root, struct list_head *head,
- struct vmap_area *va, unsigned long nva_start_addr,
- unsigned long size)
+va_clip(struct rb_root *root, struct list_head *head,
+ struct vmap_area *va, unsigned long nva_start_addr,
+ unsigned long size)
{
struct vmap_area *lva = NULL;
enum fit_type type = classify_va_fit_type(va, nva_start_addr, size);
@@ -1501,7 +1501,7 @@ va_alloc(struct vmap_area *va,
return vend;

/* Update the free vmap_area. */
- ret = adjust_va_to_fit_type(root, head, va, nva_start_addr, size);
+ ret = va_clip(root, head, va, nva_start_addr, size);
if (WARN_ON_ONCE(ret))
return vend;

@@ -3979,9 +3979,8 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
/* It is a BUG(), but trigger recovery instead. */
goto recovery;

- ret = adjust_va_to_fit_type(&free_vmap_area_root,
- &free_vmap_area_list,
- va, start, size);
+ ret = va_clip(&free_vmap_area_root,
+ &free_vmap_area_list, va, start, size);
if (WARN_ON_ONCE(unlikely(ret)))
/* It is a BUG(), but trigger recovery instead. */
goto recovery;
--
2.30.2


2023-05-22 11:28:26

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: [PATCH 5/9] mm: vmalloc: Insert busy-VA per-cpu zone

Store busy-VA objects per a CPU zone. A va->va_start address is
converted into a correct zone where it is placed and resides. An
addr_to_cvz() function is used to do a proper address conversion.

Such approach balances VAs across CPUs. That is why an access
becomes scalable to number of CPUs in a system.

Please note:

Since a zone size is set to ULONG_MAX, i.e. everything is bound
thus accessed to the CPU_0 so far, this patch does not give any
difference comparing with a current behavior.

The global vmap_area_lock, vmap_area_root are removed as there
is no need in it anymore. The vmap_area_list is still kept and
is _empty_. It is exported for a kexec only.

The vmallocinfo and vread() have to be reworked to be able to
handle multiple zones. As a result of this patch it can handle
only one zone, i.e. when cache is disabled.

Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
---
mm/vmalloc.c | 127 ++++++++++++++++++++++++++++++++++-----------------
1 file changed, 84 insertions(+), 43 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index f6da2590b4de..a9170fe19909 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -729,11 +729,9 @@ EXPORT_SYMBOL(vmalloc_to_pfn);
#define DEBUG_AUGMENT_LOWEST_MATCH_CHECK 0


-static DEFINE_SPINLOCK(vmap_area_lock);
static DEFINE_SPINLOCK(free_vmap_area_lock);
/* Export for kexec only */
LIST_HEAD(vmap_area_list);
-static struct rb_root vmap_area_root = RB_ROOT;
static bool vmap_initialized __read_mostly;

static struct rb_root purge_vmap_area_root = RB_ROOT;
@@ -823,7 +821,7 @@ addr_to_cpu(unsigned long addr)
}

static inline struct cpu_vmap_zone *
-__maybe_unused addr_to_cvz(unsigned long addr)
+addr_to_cvz(unsigned long addr)
{
return &per_cpu(cpu_vmap_zone, addr_to_cpu(addr));
}
@@ -859,10 +857,10 @@ unsigned long vmalloc_nr_pages(void)
}

/* Look up the first VA which satisfies addr < va_end, NULL if none. */
-static struct vmap_area *find_vmap_area_exceed_addr(unsigned long addr)
+static struct vmap_area *find_vmap_area_exceed_addr(unsigned long addr, struct rb_root *root)
{
struct vmap_area *va = NULL;
- struct rb_node *n = vmap_area_root.rb_node;
+ struct rb_node *n = root->rb_node;

addr = (unsigned long)kasan_reset_tag((void *)addr);

@@ -1608,12 +1606,14 @@ __alloc_vmap_area(struct rb_root *root, struct list_head *head,
*/
static void free_vmap_area(struct vmap_area *va)
{
+ struct cpu_vmap_zone *z = addr_to_cvz(va->va_start);
+
/*
* Remove from the busy tree/list.
*/
- spin_lock(&vmap_area_lock);
- unlink_va(va, &vmap_area_root);
- spin_unlock(&vmap_area_lock);
+ fbl_lock(z, BUSY);
+ unlink_va(va, &fbl_root(z, BUSY));
+ fbl_unlock(z, BUSY);

/*
* Insert/Merge it back to the free tree/list.
@@ -1656,6 +1656,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
int node, gfp_t gfp_mask,
unsigned long va_flags)
{
+ struct cpu_vmap_zone *z;
struct vmap_area *va;
unsigned long freed;
unsigned long addr;
@@ -1701,9 +1702,11 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
va->vm = NULL;
va->flags = va_flags;

- spin_lock(&vmap_area_lock);
- insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
- spin_unlock(&vmap_area_lock);
+ z = addr_to_cvz(va->va_start);
+
+ fbl_lock(z, BUSY);
+ insert_vmap_area(va, &fbl_root(z, BUSY), &fbl_head(z, BUSY));
+ fbl_unlock(z, BUSY);

BUG_ON(!IS_ALIGNED(va->va_start, align));
BUG_ON(va->va_start < vstart);
@@ -1926,24 +1929,26 @@ static void free_unmap_vmap_area(struct vmap_area *va)

struct vmap_area *find_vmap_area(unsigned long addr)
{
+ struct cpu_vmap_zone *z = addr_to_cvz(addr);
struct vmap_area *va;

- spin_lock(&vmap_area_lock);
- va = __find_vmap_area(addr, &vmap_area_root);
- spin_unlock(&vmap_area_lock);
+ fbl_lock(z, BUSY);
+ va = __find_vmap_area(addr, &fbl_root(z, BUSY));
+ fbl_unlock(z, BUSY);

return va;
}

static struct vmap_area *find_unlink_vmap_area(unsigned long addr)
{
+ struct cpu_vmap_zone *z = addr_to_cvz(addr);
struct vmap_area *va;

- spin_lock(&vmap_area_lock);
- va = __find_vmap_area(addr, &vmap_area_root);
+ fbl_lock(z, BUSY);
+ va = __find_vmap_area(addr, &fbl_root(z, BUSY));
if (va)
- unlink_va(va, &vmap_area_root);
- spin_unlock(&vmap_area_lock);
+ unlink_va(va, &fbl_root(z, BUSY));
+ fbl_unlock(z, BUSY);

return va;
}
@@ -2095,14 +2100,17 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)

static void free_vmap_block(struct vmap_block *vb)
{
+ struct cpu_vmap_zone *z;
struct vmap_block *tmp;

tmp = xa_erase(&vmap_blocks, addr_to_vb_idx(vb->va->va_start));
BUG_ON(tmp != vb);

- spin_lock(&vmap_area_lock);
- unlink_va(vb->va, &vmap_area_root);
- spin_unlock(&vmap_area_lock);
+ z = addr_to_cvz(vb->va->va_start);
+
+ fbl_lock(z, BUSY);
+ unlink_va(vb->va, &fbl_root(z, BUSY));
+ fbl_unlock(z, BUSY);

free_vmap_area_noflush(vb->va);
kfree_rcu(vb, rcu_head);
@@ -2484,9 +2492,11 @@ static inline void setup_vmalloc_vm_locked(struct vm_struct *vm,
static void setup_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va,
unsigned long flags, const void *caller)
{
- spin_lock(&vmap_area_lock);
+ struct cpu_vmap_zone *z = addr_to_cvz(va->va_start);
+
+ fbl_lock(z, BUSY);
setup_vmalloc_vm_locked(vm, va, flags, caller);
- spin_unlock(&vmap_area_lock);
+ fbl_unlock(z, BUSY);
}

static void clear_vm_uninitialized_flag(struct vm_struct *vm)
@@ -3605,6 +3615,7 @@ static void vmap_ram_vread(char *buf, char *addr, int count, unsigned long flags
*/
long vread(char *buf, char *addr, unsigned long count)
{
+ struct cpu_vmap_zone *z;
struct vmap_area *va;
struct vm_struct *vm;
char *vaddr, *buf_start = buf;
@@ -3617,8 +3628,11 @@ long vread(char *buf, char *addr, unsigned long count)
if ((unsigned long) addr + count < count)
count = -(unsigned long) addr;

- spin_lock(&vmap_area_lock);
- va = find_vmap_area_exceed_addr((unsigned long)addr);
+ /* Hooked to CPU0 because a cache is not activated. */
+ z = &per_cpu(cpu_vmap_zone, 0);
+ fbl_lock(z, BUSY);
+
+ va = find_vmap_area_exceed_addr((unsigned long)addr, &fbl_root(z, BUSY));
if (!va)
goto finished;

@@ -3626,7 +3640,7 @@ long vread(char *buf, char *addr, unsigned long count)
if ((unsigned long)addr + count <= va->va_start)
goto finished;

- list_for_each_entry_from(va, &vmap_area_list, list) {
+ list_for_each_entry_from(va, &fbl_head(z, BUSY), list) {
if (!count)
break;

@@ -3674,7 +3688,7 @@ long vread(char *buf, char *addr, unsigned long count)
count -= n;
}
finished:
- spin_unlock(&vmap_area_lock);
+ fbl_unlock(z, BUSY);

if (buf == buf_start)
return 0;
@@ -4014,14 +4028,15 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
}

/* insert all vm's */
- spin_lock(&vmap_area_lock);
for (area = 0; area < nr_vms; area++) {
- insert_vmap_area(vas[area], &vmap_area_root, &vmap_area_list);
+ struct cpu_vmap_zone *z = addr_to_cvz(vas[area]->va_start);

+ fbl_lock(z, BUSY);
+ insert_vmap_area(vas[area], &fbl_root(z, BUSY), &fbl_head(z, BUSY));
setup_vmalloc_vm_locked(vms[area], vas[area], VM_ALLOC,
pcpu_get_vm_areas);
+ fbl_unlock(z, BUSY);
}
- spin_unlock(&vmap_area_lock);

/*
* Mark allocated areas as accessible. Do it now as a best-effort
@@ -4145,24 +4160,24 @@ bool vmalloc_dump_obj(void *object)
#ifdef CONFIG_PROC_FS
static void *s_start(struct seq_file *m, loff_t *pos)
__acquires(&vmap_purge_lock)
- __acquires(&vmap_area_lock)
+ __acquires(&fbl(&per_cpu(cpu_vmap_zone, 0), BUSY, lock))
{
mutex_lock(&vmap_purge_lock);
- spin_lock(&vmap_area_lock);
+ fbl_lock((&per_cpu(cpu_vmap_zone, 0)), BUSY);

- return seq_list_start(&vmap_area_list, *pos);
+ return seq_list_start(&fbl_head((&per_cpu(cpu_vmap_zone, 0)), BUSY), *pos);
}

static void *s_next(struct seq_file *m, void *p, loff_t *pos)
{
- return seq_list_next(p, &vmap_area_list, pos);
+ return seq_list_next(p, &fbl_head((&per_cpu(cpu_vmap_zone, 0)), BUSY), pos);
}

static void s_stop(struct seq_file *m, void *p)
- __releases(&vmap_area_lock)
+ __releases(&fbl(&per_cpu(cpu_vmap_zone, 0), BUSY, lock))
__releases(&vmap_purge_lock)
{
- spin_unlock(&vmap_area_lock);
+ fbl_unlock((&per_cpu(cpu_vmap_zone, 0)), BUSY);
mutex_unlock(&vmap_purge_lock);
}

@@ -4258,7 +4273,7 @@ static int s_show(struct seq_file *m, void *p)
* As a final step, dump "unpurged" areas.
*/
final:
- if (list_is_last(&va->list, &vmap_area_list))
+ if (list_is_last(&va->list, &fbl_head((&per_cpu(cpu_vmap_zone, 0)), BUSY)))
show_purge_info(m);

return 0;
@@ -4289,7 +4304,8 @@ static void vmap_init_free_space(void)
{
unsigned long vmap_start = 1;
const unsigned long vmap_end = ULONG_MAX;
- struct vmap_area *busy, *free;
+ struct vmap_area *free;
+ struct vm_struct *busy;

/*
* B F B B B F
@@ -4297,12 +4313,12 @@ static void vmap_init_free_space(void)
* | The KVA space |
* |<--------------------------------->|
*/
- list_for_each_entry(busy, &vmap_area_list, list) {
- if (busy->va_start - vmap_start > 0) {
+ for (busy = vmlist; busy; busy = busy->next) {
+ if (busy->addr - vmap_start > 0) {
free = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
if (!WARN_ON_ONCE(!free)) {
free->va_start = vmap_start;
- free->va_end = busy->va_start;
+ free->va_end = (unsigned long) busy->addr;

insert_vmap_area_augment(free, NULL,
&free_vmap_area_root,
@@ -4310,7 +4326,7 @@ static void vmap_init_free_space(void)
}
}

- vmap_start = busy->va_end;
+ vmap_start = (unsigned long) busy->addr + busy->size;
}

if (vmap_end - vmap_start > 0) {
@@ -4326,6 +4342,22 @@ static void vmap_init_free_space(void)
}
}

+static void vmap_init_pcpu_zones(void)
+{
+ struct cpu_vmap_zone *z;
+ int i, j;
+
+ for_each_possible_cpu(i) {
+ z = per_cpu_ptr(&cpu_vmap_zone, i);
+
+ for (j = 0; j < ARRAY_SIZE(z->fbl); j++) {
+ INIT_LIST_HEAD(&z->fbl[j].head);
+ z->fbl[j].root = RB_ROOT;
+ spin_lock_init(&z->fbl[j].lock);
+ }
+ }
+}
+
void __init vmalloc_init(void)
{
struct vmap_area *va;
@@ -4349,8 +4381,15 @@ void __init vmalloc_init(void)
INIT_WORK(&p->wq, delayed_vfree_work);
}

+ /*
+ * Setup per-cpu data before importing vmlist.
+ */
+ vmap_init_pcpu_zones();
+
/* Import existing vmlist entries. */
for (tmp = vmlist; tmp; tmp = tmp->next) {
+ struct cpu_vmap_zone *z;
+
va = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
if (WARN_ON_ONCE(!va))
continue;
@@ -4358,7 +4397,9 @@ void __init vmalloc_init(void)
va->va_start = (unsigned long)tmp->addr;
va->va_end = va->va_start + tmp->size;
va->vm = tmp;
- insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
+
+ z = addr_to_cvz(va->va_start);
+ insert_vmap_area(va, &fbl_root(z, BUSY), &fbl_head(z, BUSY));
}

/*
--
2.30.2


2023-05-22 11:29:21

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: [PATCH 7/9] mm: vmalloc: Insert lazy-VA per-cpu zone

Similar to busy VAs, lazy ones are stored per a CPU zone
also. Freed address is converted into a correct zone it
belongs to and resides there for further handling.

Such approach does not require to have any global locking
primitive, instead an access becomes scalable to number of
CPUs.

This patch removes a global purge-lock, global purge-tree
and list.

Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
---
mm/vmalloc.c | 127 ++++++++++++++++++++++++++++-----------------------
1 file changed, 71 insertions(+), 56 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index dd83deb5ef4f..fe993c0561dd 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -734,10 +734,6 @@ static DEFINE_SPINLOCK(free_vmap_area_lock);
LIST_HEAD(vmap_area_list);
static bool vmap_initialized __read_mostly;

-static struct rb_root purge_vmap_area_root = RB_ROOT;
-static LIST_HEAD(purge_vmap_area_list);
-static DEFINE_SPINLOCK(purge_vmap_area_lock);
-
/*
* This kmem_cache is used for vmap_area objects. Instead of
* allocating from slab we reuse an object from this cache to
@@ -1792,39 +1788,17 @@ static DEFINE_MUTEX(vmap_purge_lock);
/* for per-CPU blocks */
static void purge_fragmented_blocks_allcpus(void);

-/*
- * Purges all lazily-freed vmap areas.
- */
-static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
+static unsigned long
+purge_cpu_vmap_zone(struct cpu_vmap_zone *z)
{
- unsigned long resched_threshold;
- unsigned int num_purged_areas = 0;
- struct list_head local_purge_list;
+ unsigned long num_purged_areas = 0;
struct vmap_area *va, *n_va;

- lockdep_assert_held(&vmap_purge_lock);
-
- spin_lock(&purge_vmap_area_lock);
- purge_vmap_area_root = RB_ROOT;
- list_replace_init(&purge_vmap_area_list, &local_purge_list);
- spin_unlock(&purge_vmap_area_lock);
-
- if (unlikely(list_empty(&local_purge_list)))
+ if (list_empty(&z->purge_list))
goto out;

- start = min(start,
- list_first_entry(&local_purge_list,
- struct vmap_area, list)->va_start);
-
- end = max(end,
- list_last_entry(&local_purge_list,
- struct vmap_area, list)->va_end);
-
- flush_tlb_kernel_range(start, end);
- resched_threshold = lazy_max_pages() << 1;
-
spin_lock(&free_vmap_area_lock);
- list_for_each_entry_safe(va, n_va, &local_purge_list, list) {
+ list_for_each_entry_safe(va, n_va, &z->purge_list, list) {
unsigned long nr = (va->va_end - va->va_start) >> PAGE_SHIFT;
unsigned long orig_start = va->va_start;
unsigned long orig_end = va->va_end;
@@ -1846,13 +1820,57 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)

atomic_long_sub(nr, &vmap_lazy_nr);
num_purged_areas++;
-
- if (atomic_long_read(&vmap_lazy_nr) < resched_threshold)
- cond_resched_lock(&free_vmap_area_lock);
}
spin_unlock(&free_vmap_area_lock);

out:
+ return num_purged_areas;
+}
+
+/*
+ * Purges all lazily-freed vmap areas.
+ */
+static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
+{
+ unsigned long num_purged_areas = 0;
+ struct cpu_vmap_zone *z;
+ int need_purge = 0;
+ int i;
+
+ lockdep_assert_held(&vmap_purge_lock);
+
+ for_each_possible_cpu(i) {
+ z = per_cpu_ptr(&cpu_vmap_zone, i);
+ INIT_LIST_HEAD(&z->purge_list);
+
+ if (RB_EMPTY_ROOT(&fbl_root(z, LAZY)))
+ continue;
+
+ fbl_lock(z, LAZY);
+ WRITE_ONCE(fbl(z, LAZY, root.rb_node), NULL);
+ list_replace_init(&fbl_head(z, LAZY), &z->purge_list);
+ fbl_unlock(z, LAZY);
+
+ start = min(start,
+ list_first_entry(&z->purge_list,
+ struct vmap_area, list)->va_start);
+
+ end = max(end,
+ list_last_entry(&z->purge_list,
+ struct vmap_area, list)->va_end);
+
+ need_purge++;
+ }
+
+ if (need_purge) {
+ flush_tlb_kernel_range(start, end);
+
+ for_each_possible_cpu(i) {
+ z = per_cpu_ptr(&cpu_vmap_zone, i);
+ num_purged_areas += purge_cpu_vmap_zone(z);
+ }
+ }
+
trace_purge_vmap_area_lazy(start, end, num_purged_areas);
return num_purged_areas > 0;
}
@@ -1870,16 +1888,9 @@ static void purge_vmap_area_lazy(void)

static void drain_vmap_area_work(struct work_struct *work)
{
- unsigned long nr_lazy;
-
- do {
- mutex_lock(&vmap_purge_lock);
- __purge_vmap_area_lazy(ULONG_MAX, 0);
- mutex_unlock(&vmap_purge_lock);
-
- /* Recheck if further work is required. */
- nr_lazy = atomic_long_read(&vmap_lazy_nr);
- } while (nr_lazy > lazy_max_pages());
+ mutex_lock(&vmap_purge_lock);
+ __purge_vmap_area_lazy(ULONG_MAX, 0);
+ mutex_unlock(&vmap_purge_lock);
}

/*
@@ -1889,6 +1900,7 @@ static void drain_vmap_area_work(struct work_struct *work)
*/
static void free_vmap_area_noflush(struct vmap_area *va)
{
+ struct cpu_vmap_zone *z = addr_to_cvz(va->va_start);
unsigned long nr_lazy_max = lazy_max_pages();
unsigned long va_start = va->va_start;
unsigned long nr_lazy;
@@ -1902,10 +1914,9 @@ static void free_vmap_area_noflush(struct vmap_area *va)
/*
* Merge or place it to the purge tree/list.
*/
- spin_lock(&purge_vmap_area_lock);
- merge_or_add_vmap_area(va,
- &purge_vmap_area_root, &purge_vmap_area_list);
- spin_unlock(&purge_vmap_area_lock);
+ fbl_lock(z, LAZY);
+ merge_or_add_vmap_area(va, &fbl_root(z, LAZY), &fbl_head(z, LAZY));
+ fbl_unlock(z, LAZY);

trace_free_vmap_area_noflush(va_start, nr_lazy, nr_lazy_max);

@@ -4199,17 +4210,21 @@ static void show_numa_info(struct seq_file *m, struct vm_struct *v)

static void show_purge_info(struct seq_file *m)
{
+ struct cpu_vmap_zone *z;
struct vmap_area *va;
+ int i;

- mutex_lock(&vmap_purge_lock);
- spin_lock(&purge_vmap_area_lock);
- list_for_each_entry(va, &purge_vmap_area_list, list) {
- seq_printf(m, "0x%pK-0x%pK %7ld unpurged vm_area\n",
- (void *)va->va_start, (void *)va->va_end,
- va->va_end - va->va_start);
+ for_each_possible_cpu(i) {
+ z = per_cpu_ptr(&cpu_vmap_zone, i);
+
+ fbl_lock(z, LAZY);
+ list_for_each_entry(va, &fbl_head(z, LAZY), list) {
+ seq_printf(m, "0x%pK-0x%pK %7ld unpurged vm_area\n",
+ (void *)va->va_start, (void *)va->va_end,
+ va->va_end - va->va_start);
+ }
+ fbl_unlock(z, LAZY);
}
- spin_unlock(&purge_vmap_area_lock);
- mutex_unlock(&vmap_purge_lock);
}

static int s_show(struct seq_file *m, void *p)
--
2.30.2


2023-05-22 11:29:38

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: [PATCH 3/9] mm: vmalloc: Move vmap_init_free_space() down in vmalloc.c

A vmap_init_free_space() is a function that setups a vmap space
and is considered as part of initialization phase. Since a main
entry which is vmalloc_init(), has been moved down in vmalloc.c
it makes sense to follow the pattern.

There is no a functional change as a result of this patch.

Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
---
mm/vmalloc.c | 82 ++++++++++++++++++++++++++--------------------------
1 file changed, 41 insertions(+), 41 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 5f900efec6a9..19dcffb0d563 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2416,47 +2416,6 @@ void __init vm_area_register_early(struct vm_struct *vm, size_t align)
kasan_populate_early_vm_area_shadow(vm->addr, vm->size);
}

-static void vmap_init_free_space(void)
-{
- unsigned long vmap_start = 1;
- const unsigned long vmap_end = ULONG_MAX;
- struct vmap_area *busy, *free;
-
- /*
- * B F B B B F
- * -|-----|.....|-----|-----|-----|.....|-
- * | The KVA space |
- * |<--------------------------------->|
- */
- list_for_each_entry(busy, &vmap_area_list, list) {
- if (busy->va_start - vmap_start > 0) {
- free = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
- if (!WARN_ON_ONCE(!free)) {
- free->va_start = vmap_start;
- free->va_end = busy->va_start;
-
- insert_vmap_area_augment(free, NULL,
- &free_vmap_area_root,
- &free_vmap_area_list);
- }
- }
-
- vmap_start = busy->va_end;
- }
-
- if (vmap_end - vmap_start > 0) {
- free = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
- if (!WARN_ON_ONCE(!free)) {
- free->va_start = vmap_start;
- free->va_end = vmap_end;
-
- insert_vmap_area_augment(free, NULL,
- &free_vmap_area_root,
- &free_vmap_area_list);
- }
- }
-}
-
static inline void setup_vmalloc_vm_locked(struct vm_struct *vm,
struct vmap_area *va, unsigned long flags, const void *caller)
{
@@ -4271,6 +4230,47 @@ module_init(proc_vmalloc_init);

#endif

+static void vmap_init_free_space(void)
+{
+ unsigned long vmap_start = 1;
+ const unsigned long vmap_end = ULONG_MAX;
+ struct vmap_area *busy, *free;
+
+ /*
+ * B F B B B F
+ * -|-----|.....|-----|-----|-----|.....|-
+ * | The KVA space |
+ * |<--------------------------------->|
+ */
+ list_for_each_entry(busy, &vmap_area_list, list) {
+ if (busy->va_start - vmap_start > 0) {
+ free = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
+ if (!WARN_ON_ONCE(!free)) {
+ free->va_start = vmap_start;
+ free->va_end = busy->va_start;
+
+ insert_vmap_area_augment(free, NULL,
+ &free_vmap_area_root,
+ &free_vmap_area_list);
+ }
+ }
+
+ vmap_start = busy->va_end;
+ }
+
+ if (vmap_end - vmap_start > 0) {
+ free = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
+ if (!WARN_ON_ONCE(!free)) {
+ free->va_start = vmap_start;
+ free->va_end = vmap_end;
+
+ insert_vmap_area_augment(free, NULL,
+ &free_vmap_area_root,
+ &free_vmap_area_list);
+ }
+ }
+}
+
void __init vmalloc_init(void)
{
struct vmap_area *va;
--
2.30.2


2023-05-22 11:30:27

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: [PATCH 6/9] mm: vmalloc: Support multiple zones in vmallocinfo

A global vmap area busy tree has been removed and replaced by
multiple per-cpu trees/lists, therefore we need to traversal
all per-cpu busy-lists in order to dump all allocated objects.

Please note, after this patch, dumped addresses of allocated
areas are not sequential. See an example below:

0 1 2 0 1 2
|---|---|---|---|---|---|-> vmap space

There 3 CPUs dumping is done per-CPU zone, as you can see
address of zone_0 can be ahead of addresses residing in the
zone_1 or zone_2.

Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
---
mm/vmalloc.c | 90 ++++++++++++++++++++++++++--------------------------
1 file changed, 45 insertions(+), 45 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index a9170fe19909..dd83deb5ef4f 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -4159,26 +4159,18 @@ bool vmalloc_dump_obj(void *object)

#ifdef CONFIG_PROC_FS
static void *s_start(struct seq_file *m, loff_t *pos)
- __acquires(&vmap_purge_lock)
- __acquires(&fbl(&per_cpu(cpu_vmap_zone, 0), BUSY, lock))
{
- mutex_lock(&vmap_purge_lock);
- fbl_lock((&per_cpu(cpu_vmap_zone, 0)), BUSY);
-
- return seq_list_start(&fbl_head((&per_cpu(cpu_vmap_zone, 0)), BUSY), *pos);
+ return *pos < 1 ? (void *)1 : NULL;
}

static void *s_next(struct seq_file *m, void *p, loff_t *pos)
{
- return seq_list_next(p, &fbl_head((&per_cpu(cpu_vmap_zone, 0)), BUSY), pos);
+ ++*pos;
+ return NULL;
}

static void s_stop(struct seq_file *m, void *p)
- __releases(&fbl(&per_cpu(cpu_vmap_zone, 0), BUSY, lock))
- __releases(&vmap_purge_lock)
{
- fbl_unlock((&per_cpu(cpu_vmap_zone, 0)), BUSY);
- mutex_unlock(&vmap_purge_lock);
}

static void show_numa_info(struct seq_file *m, struct vm_struct *v)
@@ -4209,6 +4201,7 @@ static void show_purge_info(struct seq_file *m)
{
struct vmap_area *va;

+ mutex_lock(&vmap_purge_lock);
spin_lock(&purge_vmap_area_lock);
list_for_each_entry(va, &purge_vmap_area_list, list) {
seq_printf(m, "0x%pK-0x%pK %7ld unpurged vm_area\n",
@@ -4216,65 +4209,72 @@ static void show_purge_info(struct seq_file *m)
va->va_end - va->va_start);
}
spin_unlock(&purge_vmap_area_lock);
+ mutex_unlock(&vmap_purge_lock);
}

static int s_show(struct seq_file *m, void *p)
{
+ struct cpu_vmap_zone *z;
struct vmap_area *va;
struct vm_struct *v;
+ int i;

- va = list_entry(p, struct vmap_area, list);
+ for_each_possible_cpu(i) {
+ z = per_cpu_ptr(&cpu_vmap_zone, i);

- if (!va->vm) {
- if (va->flags & VMAP_RAM)
- seq_printf(m, "0x%pK-0x%pK %7ld vm_map_ram\n",
- (void *)va->va_start, (void *)va->va_end,
- va->va_end - va->va_start);
+ fbl_lock(z, BUSY);
+ list_for_each_entry(va, &fbl_head(z, BUSY), list) {
+ if (!va->vm) {
+ if (va->flags & VMAP_RAM)
+ seq_printf(m, "0x%pK-0x%pK %7ld vm_map_ram\n",
+ (void *)va->va_start, (void *)va->va_end,
+ va->va_end - va->va_start);

- goto final;
- }
+ continue;
+ }

- v = va->vm;
+ v = va->vm;

- seq_printf(m, "0x%pK-0x%pK %7ld",
- v->addr, v->addr + v->size, v->size);
+ seq_printf(m, "0x%pK-0x%pK %7ld",
+ v->addr, v->addr + v->size, v->size);

- if (v->caller)
- seq_printf(m, " %pS", v->caller);
+ if (v->caller)
+ seq_printf(m, " %pS", v->caller);

- if (v->nr_pages)
- seq_printf(m, " pages=%d", v->nr_pages);
+ if (v->nr_pages)
+ seq_printf(m, " pages=%d", v->nr_pages);

- if (v->phys_addr)
- seq_printf(m, " phys=%pa", &v->phys_addr);
+ if (v->phys_addr)
+ seq_printf(m, " phys=%pa", &v->phys_addr);

- if (v->flags & VM_IOREMAP)
- seq_puts(m, " ioremap");
+ if (v->flags & VM_IOREMAP)
+ seq_puts(m, " ioremap");

- if (v->flags & VM_ALLOC)
- seq_puts(m, " vmalloc");
+ if (v->flags & VM_ALLOC)
+ seq_puts(m, " vmalloc");

- if (v->flags & VM_MAP)
- seq_puts(m, " vmap");
+ if (v->flags & VM_MAP)
+ seq_puts(m, " vmap");

- if (v->flags & VM_USERMAP)
- seq_puts(m, " user");
+ if (v->flags & VM_USERMAP)
+ seq_puts(m, " user");

- if (v->flags & VM_DMA_COHERENT)
- seq_puts(m, " dma-coherent");
+ if (v->flags & VM_DMA_COHERENT)
+ seq_puts(m, " dma-coherent");

- if (is_vmalloc_addr(v->pages))
- seq_puts(m, " vpages");
+ if (is_vmalloc_addr(v->pages))
+ seq_puts(m, " vpages");

- show_numa_info(m, v);
- seq_putc(m, '\n');
+ show_numa_info(m, v);
+ seq_putc(m, '\n');
+ }
+ fbl_unlock(z, BUSY);
+ }

/*
* As a final step, dump "unpurged" areas.
*/
-final:
- if (list_is_last(&va->list, &fbl_head((&per_cpu(cpu_vmap_zone, 0)), BUSY)))
- show_purge_info(m);
+ show_purge_info(m);

return 0;
}
--
2.30.2


2023-05-22 11:31:41

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: [PATCH 8/9] mm: vmalloc: Offload free_vmap_area_lock global lock

Introduce a fast path of allocation sequence, that consists
of per-cpu path and fallback mechanism which is used when a
request can not be accomplished by fast track.

A fast track pre-loads a chunk from a global vmap heap directly
into its per-cpu zone, following by clipping the chunk based on
allocation request.

This technique allows to offload a global free_vmap_area_lock
making an allocation path to be serialized to number of CPUs
in a system.

Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
---
mm/vmalloc.c | 127 +++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 123 insertions(+), 4 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index fe993c0561dd..8054b8bf6c18 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1642,6 +1642,93 @@ preload_this_cpu_lock(spinlock_t *lock, gfp_t gfp_mask, int node)
kmem_cache_free(vmap_area_cachep, va);
}

+static unsigned long
+this_cpu_zone_alloc_fill(struct cpu_vmap_zone *z,
+ unsigned long size, unsigned long align,
+ gfp_t gfp_mask, int node)
+{
+ unsigned long addr = VMALLOC_END;
+ struct vmap_area *va;
+
+ /*
+ * It still can race. One task sets a progress to
+ * 1 a second one gets preempted on entry, the first
+ * zeroed the progress flag and second proceed with
+ * an extra prefetch.
+ */
+ if (atomic_xchg(&z->fill_in_progress, 1))
+ return addr;
+
+ va = kmem_cache_alloc_node(vmap_area_cachep, gfp_mask, node);
+ if (unlikely(!va))
+ goto out;
+
+ spin_lock(&free_vmap_area_lock);
+ addr = __alloc_vmap_area(&free_vmap_area_root, &free_vmap_area_list,
+ cvz_size, 1, VMALLOC_START, VMALLOC_END);
+ spin_unlock(&free_vmap_area_lock);
+
+ if (addr == VMALLOC_END) {
+ kmem_cache_free(vmap_area_cachep, va);
+ goto out;
+ }
+
+ va->va_start = addr;
+ va->va_end = addr + cvz_size;
+
+ fbl_lock(z, FREE);
+ va = merge_or_add_vmap_area_augment(va,
+ &fbl_root(z, FREE), &fbl_head(z, FREE));
+ addr = va_alloc(va, &fbl_root(z, FREE), &fbl_head(z, FREE),
+ size, align, VMALLOC_START, VMALLOC_END);
+ fbl_unlock(z, FREE);
+
+out:
+ atomic_set(&z->fill_in_progress, 0);
+ return addr;
+}
+
+static unsigned long
+this_cpu_zone_alloc(unsigned long size, unsigned long align, gfp_t gfp_mask, int node)
+{
+ struct cpu_vmap_zone *z = raw_cpu_ptr(&cpu_vmap_zone);
+ unsigned long extra = align > PAGE_SIZE ? align : 0;
+ unsigned long addr = VMALLOC_END, left = 0;
+
+ /*
+ * It is disabled, fallback to a global heap.
+ */
+ if (cvz_size == ULONG_MAX)
+ return addr;
+
+ /*
+ * Any allocation bigger/equal than one half of
+ * a zone-size will fallback to a global heap.
+ */
+ if (cvz_size / (size + extra) < 3)
+ return addr;
+
+ if (RB_EMPTY_ROOT(&fbl_root(z, FREE)))
+ goto fill;
+
+ fbl_lock(z, FREE);
+ addr = __alloc_vmap_area(&fbl_root(z, FREE), &fbl_head(z, FREE),
+ size, align, VMALLOC_START, VMALLOC_END);
+
+ if (addr == VMALLOC_END)
+ left = get_subtree_max_size(fbl_root(z, FREE).rb_node);
+ fbl_unlock(z, FREE);
+
+fill:
+ /*
+ * A low watermark is 3 pages.
+ */
+ if (addr == VMALLOC_END && left < 4 * PAGE_SIZE)
+ addr = this_cpu_zone_alloc_fill(z, size, align, gfp_mask, node);
+
+ return addr;
+}
+
/*
* Allocate a region of KVA of the specified size and alignment, within the
* vstart and vend.
@@ -1678,11 +1765,21 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
*/
kmemleak_scan_area(&va->rb_node, SIZE_MAX, gfp_mask);

+ /*
+ * Fast path allocation, start with it.
+ */
+ if (vstart == VMALLOC_START && vend == VMALLOC_END)
+ addr = this_cpu_zone_alloc(size, align, gfp_mask, node);
+ else
+ addr = vend;
+
retry:
- preload_this_cpu_lock(&free_vmap_area_lock, gfp_mask, node);
- addr = __alloc_vmap_area(&free_vmap_area_root, &free_vmap_area_list,
- size, align, vstart, vend);
- spin_unlock(&free_vmap_area_lock);
+ if (addr == vend) {
+ preload_this_cpu_lock(&free_vmap_area_lock, gfp_mask, node);
+ addr = __alloc_vmap_area(&free_vmap_area_root, &free_vmap_area_list,
+ size, align, vstart, vend);
+ spin_unlock(&free_vmap_area_lock);
+ }

trace_alloc_vmap_area(addr, size, align, vstart, vend, addr == vend);

@@ -1827,6 +1924,27 @@ purge_cpu_vmap_zone(struct cpu_vmap_zone *z)
return num_purged_areas;
}

+static void
+drop_cpu_vmap_cache(struct cpu_vmap_zone *z)
+{
+ struct vmap_area *va, *n_va;
+ LIST_HEAD(free_head);
+
+ if (RB_EMPTY_ROOT(&fbl_root(z, FREE)))
+ return;
+
+ fbl_lock(z, FREE);
+ WRITE_ONCE(fbl(z, FREE, root.rb_node), NULL);
+ list_replace_init(&fbl_head(z, FREE), &free_head);
+ fbl_unlock(z, FREE);
+
+ spin_lock(&free_vmap_area_lock);
+ list_for_each_entry_safe(va, n_va, &free_head, list)
+ merge_or_add_vmap_area_augment(va,
+ &free_vmap_area_root, &free_vmap_area_list);
+ spin_unlock(&free_vmap_area_lock);
+}
+
/*
* Purges all lazily-freed vmap areas.
*/
@@ -1868,6 +1986,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
for_each_possible_cpu(i) {
z = per_cpu_ptr(&cpu_vmap_zone, i);
num_purged_areas += purge_cpu_vmap_zone(z);
+ drop_cpu_vmap_cache(z);
}
}

--
2.30.2


2023-05-23 06:14:19

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 2/9] mm: vmalloc: Rename adjust_va_to_fit_type() function

On Mon, May 22, 2023 at 01:08:42PM +0200, Uladzislau Rezki (Sony) wrote:
> static __always_inline int
> -adjust_va_to_fit_type(struct rb_root *root, struct list_head *head,
> - struct vmap_area *va, unsigned long nva_start_addr,
> - unsigned long size)
> +va_clip(struct rb_root *root, struct list_head *head,
> + struct vmap_area *va, unsigned long nva_start_addr,
> + unsigned long size)

Same comment as before, not going to repeat it for futher reviews.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <[email protected]>

2023-05-23 06:16:52

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 5/9] mm: vmalloc: Insert busy-VA per-cpu zone

> /* Look up the first VA which satisfies addr < va_end, NULL if none. */
> -static struct vmap_area *find_vmap_area_exceed_addr(unsigned long addr)
> +static struct vmap_area *find_vmap_area_exceed_addr(unsigned long addr, struct rb_root *root)

Please avoid the overly long line.

> + struct cpu_vmap_zone *z = addr_to_cvz(va->va_start);
> +
> /*
> * Remove from the busy tree/list.
> */
> - spin_lock(&vmap_area_lock);
> - unlink_va(va, &vmap_area_root);
> - spin_unlock(&vmap_area_lock);
> + fbl_lock(z, BUSY);
> + unlink_va(va, &fbl_root(z, BUSY));
> + fbl_unlock(z, BUSY);

I find the BUSY magic here very confusing, and would prefer to
just spell the actual lock reference out.

2023-05-23 06:17:33

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 4/9] mm: vmalloc: Add a per-CPU-zone infrastructure

On Mon, May 22, 2023 at 01:08:44PM +0200, Uladzislau Rezki (Sony) wrote:
> +#define fbl(z, i, m) z->fbl[i].m
> +#define fbl_root(z, i) fbl(z, i, root)
> +#define fbl_head(z, i) fbl(z, i, head)
> +
> +#define fbl_lock(z, i) spin_lock(&fbl(z, i, lock))
> +#define fbl_unlock(z, i) spin_unlock(&fbl(z, i, lock))

Even if it is just temporary, I don't think adding these wrappers
make much sense.

> +struct cpu_vmap_zone {
> + /*
> + * FREE, BUSY, LAZY bookkeeping data of this CPU zone.
> + */
> + struct {
> + struct rb_root root;
> + struct list_head head;
> + spinlock_t lock;
> + } fbl[NFBL];

Maybe replace NFBL with something longer and more descriptive?

But also in general it feels like this should be folded into a patch
doing real work. As-is it doesn't look very useful.

2023-05-23 10:35:39

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: Re: [PATCH 2/9] mm: vmalloc: Rename adjust_va_to_fit_type() function

On Mon, May 22, 2023 at 11:06:31PM -0700, Christoph Hellwig wrote:
> On Mon, May 22, 2023 at 01:08:42PM +0200, Uladzislau Rezki (Sony) wrote:
> > static __always_inline int
> > -adjust_va_to_fit_type(struct rb_root *root, struct list_head *head,
> > - struct vmap_area *va, unsigned long nva_start_addr,
> > - unsigned long size)
> > +va_clip(struct rb_root *root, struct list_head *head,
> > + struct vmap_area *va, unsigned long nva_start_addr,
> > + unsigned long size)
>
> Same comment as before, not going to repeat it for futher reviews.
>
Fixed!


> Otherwise looks good:
>
> Reviewed-by: Christoph Hellwig <[email protected]>
>
Thanks!

--
Uladzislau Rezki

2023-05-23 12:19:19

by Hyeonggon Yoo

[permalink] [raw]
Subject: Re: [PATCH 0/9] Mitigate a vmap lock contention

On Mon, May 22, 2023 at 01:08:40PM +0200, Uladzislau Rezki (Sony) wrote:
> Hello, folk.
>
> 1. This is a followup of the vmap topic that was highlighted at the LSFMMBPF-2023
> conference. This small serial attempts to mitigate the contention across the
> vmap/vmalloc code. The problem is described here:
>

Hello Uladzislau, thank you for the work!

> wget ftp://vps418301.ovh.net/incoming/Fix_a_vmalloc_lock_contention_in_SMP_env_v2.pdf

I ran the exactly same command but couldn't download the file, did I
miss something?

$ wget ftp://vps418301.ovh.net/incoming/Fix_a_vmalloc_lock_contention_in_SMP_env_v2.pdf
[...]
==> PASV ... done. ==> RETR Fix_a_vmalloc_lock_contention_in_SMP_env_v2.pdf ...
No such file `Fix_a_vmalloc_lock_contention_in_SMP_env_v2.pdf'.

> The material is tagged as a v2 version. It contains extra slides about testing
> the throughput, steps and comparison with a current approach.
>
> 2. Motivation.
>
> - The vmap code is not scalled to number of CPUs and this should be fixed;
> - XFS folk has complained several times that vmalloc might be contented on
> their workloads:
>
> <snip>
> commit 8dc9384b7d75012856b02ff44c37566a55fc2abf
> Author: Dave Chinner <[email protected]>
> Date: Tue Jan 4 17:22:18 2022 -0800
>
> xfs: reduce kvmalloc overhead for CIL shadow buffers
>
> Oh, let me count the ways that the kvmalloc API sucks dog eggs.
>
> The problem is when we are logging lots of large objects, we hit
> kvmalloc really damn hard with costly order allocations, and
> behaviour utterly sucks:

based on the commit I guess xfs should use vmalloc/kvmalloc is because
it allocates large buffers, how large could it be?

> 3. Test
>
> On my: AMD Ryzen Threadripper 3970X 32-Core Processor, i have below figures:
>
> 1-page 1-page-this-patch
> 1 0.576131 vs 0.555889
> 2 2.68376 vs 1.07895
> 3 4.26502 vs 1.01739
> 4 6.04306 vs 1.28924
> 5 8.04786 vs 1.57616
> 6 9.38844 vs 1.78142

<snip>

> 29 20.06 vs 3.59869
> 30 20.4353 vs 3.6991
> 31 20.9082 vs 3.73028
> 32 21.0865 vs 3.82904
>
> 1..32 - is a number of jobs. The results are in usec and is a vmallco()/vfree()
> pair throughput.

I would be more interested in real numbers than synthetic benchmarks,
Maybe XFS folks could help performing profiling similar to commit 8dc9384b7d750
with and without this patchset?

By the way looking at the commit, teaching __p?d_alloc() about gfp
context (that I'm _slowly_ working on...) could be nice for allowing
non-GFP_KERNEL kvmalloc allocations, as Matthew mentioned. [1]

Thanks!

[1] https://lore.kernel.org/linux-mm/Y%[email protected]

--
Hyeonggon Yoo

Doing kernel stuff as a hobby
Undergraduate | Chungnam National University
Dept. Computer Science & Engineering

2023-05-23 15:13:14

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: Re: [PATCH 5/9] mm: vmalloc: Insert busy-VA per-cpu zone

On Mon, May 22, 2023 at 11:12:08PM -0700, Christoph Hellwig wrote:
> > /* Look up the first VA which satisfies addr < va_end, NULL if none. */
> > -static struct vmap_area *find_vmap_area_exceed_addr(unsigned long addr)
> > +static struct vmap_area *find_vmap_area_exceed_addr(unsigned long addr, struct rb_root *root)
>
> Please avoid the overly long line.
>
Will fix it.

> > + struct cpu_vmap_zone *z = addr_to_cvz(va->va_start);
> > +
> > /*
> > * Remove from the busy tree/list.
> > */
> > - spin_lock(&vmap_area_lock);
> > - unlink_va(va, &vmap_area_root);
> > - spin_unlock(&vmap_area_lock);
> > + fbl_lock(z, BUSY);
> > + unlink_va(va, &fbl_root(z, BUSY));
> > + fbl_unlock(z, BUSY);
>
> I find the BUSY magic here very confusing, and would prefer to
> just spell the actual lock reference out.
>
No problem. I can make it open-coded. What is about an access to tree/list?

>> unlink_va(va, &fbl_root(z, BUSY));
I mean this one.

--
Uladzislau Rezki

2023-05-23 15:13:48

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: Re: [PATCH 4/9] mm: vmalloc: Add a per-CPU-zone infrastructure

On Mon, May 22, 2023 at 11:08:38PM -0700, Christoph Hellwig wrote:
> On Mon, May 22, 2023 at 01:08:44PM +0200, Uladzislau Rezki (Sony) wrote:
> > +#define fbl(z, i, m) z->fbl[i].m
> > +#define fbl_root(z, i) fbl(z, i, root)
> > +#define fbl_head(z, i) fbl(z, i, head)
> > +
> > +#define fbl_lock(z, i) spin_lock(&fbl(z, i, lock))
> > +#define fbl_unlock(z, i) spin_unlock(&fbl(z, i, lock))
>
> Even if it is just temporary, I don't think adding these wrappers
> make much sense.
>
If open-coded, it looks like:

spin_lock(&z->fbl[BUSY].lock);
fbl_lock(z, BUSY);

the reason of adding such helpers is to make the name shorter.

> > +struct cpu_vmap_zone {
> > + /*
> > + * FREE, BUSY, LAZY bookkeeping data of this CPU zone.
> > + */
> > + struct {
> > + struct rb_root root;
> > + struct list_head head;
> > + spinlock_t lock;
> > + } fbl[NFBL];
>
> Maybe replace NFBL with something longer and more descriptive?
>
> But also in general it feels like this should be folded into a patch
> doing real work. As-is it doesn't look very useful.
>
I split it for better understanding for review. But i can fold it.

--
Uladzislau Rezki

2023-05-23 15:17:06

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 4/9] mm: vmalloc: Add a per-CPU-zone infrastructure

On Tue, May 23, 2023 at 04:53:25PM +0200, Uladzislau Rezki wrote:
> > > +#define fbl_lock(z, i) spin_lock(&fbl(z, i, lock))
> > > +#define fbl_unlock(z, i) spin_unlock(&fbl(z, i, lock))
> >
> > Even if it is just temporary, I don't think adding these wrappers
> > make much sense.
> >
> If open-coded, it looks like:
>
> spin_lock(&z->fbl[BUSY].lock);

Give the fbl structure a name and you can have a local variable for it,
which will make all this a lot more readable. And then unless there is
a really good reason to iterate over this as an array just have three
of these structs embedded named free, busy and lazy.

2023-05-23 15:22:08

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: Re: [PATCH 0/9] Mitigate a vmap lock contention

On Tue, May 23, 2023 at 08:59:05PM +0900, Hyeonggon Yoo wrote:
> On Mon, May 22, 2023 at 01:08:40PM +0200, Uladzislau Rezki (Sony) wrote:
> > Hello, folk.
> >
> > 1. This is a followup of the vmap topic that was highlighted at the LSFMMBPF-2023
> > conference. This small serial attempts to mitigate the contention across the
> > vmap/vmalloc code. The problem is described here:
> >
>
> Hello Uladzislau, thank you for the work!
>
> > wget ftp://vps418301.ovh.net/incoming/Fix_a_vmalloc_lock_contention_in_SMP_env_v2.pdf
>
> I ran the exactly same command but couldn't download the file, did I
> miss something?
>
wget ftp://vps418301.ovh.net/incoming/Mitigate_a_vmalloc_lock_contention_in_SMP_env_v2.pdf

Sorry, i renamed the file name :)

> $ wget ftp://vps418301.ovh.net/incoming/Fix_a_vmalloc_lock_contention_in_SMP_env_v2.pdf
> [...]
> ==> PASV ... done. ==> RETR Fix_a_vmalloc_lock_contention_in_SMP_env_v2.pdf ...
> No such file `Fix_a_vmalloc_lock_contention_in_SMP_env_v2.pdf'.
>
> > The material is tagged as a v2 version. It contains extra slides about testing
> > the throughput, steps and comparison with a current approach.
> >
> > 2. Motivation.
> >
> > - The vmap code is not scalled to number of CPUs and this should be fixed;
> > - XFS folk has complained several times that vmalloc might be contented on
> > their workloads:
> >
> > <snip>
> > commit 8dc9384b7d75012856b02ff44c37566a55fc2abf
> > Author: Dave Chinner <[email protected]>
> > Date: Tue Jan 4 17:22:18 2022 -0800
> >
> > xfs: reduce kvmalloc overhead for CIL shadow buffers
> >
> > Oh, let me count the ways that the kvmalloc API sucks dog eggs.
> >
> > The problem is when we are logging lots of large objects, we hit
> > kvmalloc really damn hard with costly order allocations, and
> > behaviour utterly sucks:
>
> based on the commit I guess xfs should use vmalloc/kvmalloc is because
> it allocates large buffers, how large could it be?
>
They use kvmalloc(). When the page allocator is not able to serve a
request they fallback to vmalloc. At least what i see, the sizes are:

from 73728 up to 1048576, i.e. 18 pages up to 256 pages.

> > 3. Test
> >
> > On my: AMD Ryzen Threadripper 3970X 32-Core Processor, i have below figures:
> >
> > 1-page 1-page-this-patch
> > 1 0.576131 vs 0.555889
> > 2 2.68376 vs 1.07895
> > 3 4.26502 vs 1.01739
> > 4 6.04306 vs 1.28924
> > 5 8.04786 vs 1.57616
> > 6 9.38844 vs 1.78142
>
> <snip>
>
> > 29 20.06 vs 3.59869
> > 30 20.4353 vs 3.6991
> > 31 20.9082 vs 3.73028
> > 32 21.0865 vs 3.82904
> >
> > 1..32 - is a number of jobs. The results are in usec and is a vmallco()/vfree()
> > pair throughput.
>
> I would be more interested in real numbers than synthetic benchmarks,
> Maybe XFS folks could help performing profiling similar to commit 8dc9384b7d750
> with and without this patchset?
>
I added Dave Chinner <[email protected]> to this thread. But. The
contention exists. Apart of that per-cpu-KVA allocator can go away
if we make it generic instead.

> By the way looking at the commit, teaching __p?d_alloc() about gfp
> context (that I'm _slowly_ working on...) could be nice for allowing
> non-GFP_KERNEL kvmalloc allocations, as Matthew mentioned. [1]
>
> Thanks!
>
> [1] https://lore.kernel.org/linux-mm/Y%[email protected]
>

Thanks!

--
Uladzisdlau Rezki

2023-05-23 15:51:42

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: Re: [PATCH 4/9] mm: vmalloc: Add a per-CPU-zone infrastructure

On Tue, May 23, 2023 at 08:13:47AM -0700, Christoph Hellwig wrote:
> On Tue, May 23, 2023 at 04:53:25PM +0200, Uladzislau Rezki wrote:
> > > > +#define fbl_lock(z, i) spin_lock(&fbl(z, i, lock))
> > > > +#define fbl_unlock(z, i) spin_unlock(&fbl(z, i, lock))
> > >
> > > Even if it is just temporary, I don't think adding these wrappers
> > > make much sense.
> > >
> > If open-coded, it looks like:
> >
> > spin_lock(&z->fbl[BUSY].lock);
>
> Give the fbl structure a name and you can have a local variable for it,
> which will make all this a lot more readable. And then unless there is
> a really good reason to iterate over this as an array just have three
> of these structs embedded named free, busy and lazy.
>
OK. I can go that way.

--
Uladzislau Rezki

2023-05-23 17:30:46

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH 2/9] mm: vmalloc: Rename adjust_va_to_fit_type() function

* Uladzislau Rezki (Sony) <[email protected]> [230522 07:09]:
> This patch renames the adjust_va_to_fit_type() function
> to a shorter variant and more expressive. A new name is
> va_clip().
>
> There is no a functional change as a result of this patch.
>
> Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
> ---
> mm/vmalloc.c | 13 ++++++-------
> 1 file changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 409285b68a67..5f900efec6a9 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1383,9 +1383,9 @@ classify_va_fit_type(struct vmap_area *va,
> }
>
> static __always_inline int
> -adjust_va_to_fit_type(struct rb_root *root, struct list_head *head,
> - struct vmap_area *va, unsigned long nva_start_addr,
> - unsigned long size)
> +va_clip(struct rb_root *root, struct list_head *head,
> + struct vmap_area *va, unsigned long nva_start_addr,
> + unsigned long size)
> {
> struct vmap_area *lva = NULL;
> enum fit_type type = classify_va_fit_type(va, nva_start_addr, size);
> @@ -1501,7 +1501,7 @@ va_alloc(struct vmap_area *va,
> return vend;
>
> /* Update the free vmap_area. */
> - ret = adjust_va_to_fit_type(root, head, va, nva_start_addr, size);
> + ret = va_clip(root, head, va, nva_start_addr, size);
> if (WARN_ON_ONCE(ret))

ret is only used in this WARN_ON_ONCE() check (from patch 1), since you
have shortened adjust_va_to_fit_type(), you can probably drop the ret
variable all together here?

> return vend;
>
> @@ -3979,9 +3979,8 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
> /* It is a BUG(), but trigger recovery instead. */
> goto recovery;
>
> - ret = adjust_va_to_fit_type(&free_vmap_area_root,
> - &free_vmap_area_list,
> - va, start, size);
> + ret = va_clip(&free_vmap_area_root,
> + &free_vmap_area_list, va, start, size);
> if (WARN_ON_ONCE(unlikely(ret)))
> /* It is a BUG(), but trigger recovery instead. */
> goto recovery;
> --
> 2.30.2
>

2023-05-23 18:25:11

by Hyeonggon Yoo

[permalink] [raw]
Subject: Re: [PATCH 0/9] Mitigate a vmap lock contention

On Tue, May 23, 2023 at 05:12:30PM +0200, Uladzislau Rezki wrote:
> > > 2. Motivation.
> > >
> > > - The vmap code is not scalled to number of CPUs and this should be fixed;
> > > - XFS folk has complained several times that vmalloc might be contented on
> > > their workloads:
> > >
> > > <snip>
> > > commit 8dc9384b7d75012856b02ff44c37566a55fc2abf
> > > Author: Dave Chinner <[email protected]>
> > > Date: Tue Jan 4 17:22:18 2022 -0800
> > >
> > > xfs: reduce kvmalloc overhead for CIL shadow buffers
> > >
> > > Oh, let me count the ways that the kvmalloc API sucks dog eggs.
> > >
> > > The problem is when we are logging lots of large objects, we hit
> > > kvmalloc really damn hard with costly order allocations, and
> > > behaviour utterly sucks:
> >
> > based on the commit I guess xfs should use vmalloc/kvmalloc is because
> > it allocates large buffers, how large could it be?
> >
> They use kvmalloc(). When the page allocator is not able to serve a
> request they fallback to vmalloc. At least what i see, the sizes are:
>
> from 73728 up to 1048576, i.e. 18 pages up to 256 pages.
>
> > > 3. Test
> > >
> > > On my: AMD Ryzen Threadripper 3970X 32-Core Processor, i have below figures:
> > >
> > > 1-page 1-page-this-patch
> > > 1 0.576131 vs 0.555889
> > > 2 2.68376 vs 1.07895
> > > 3 4.26502 vs 1.01739
> > > 4 6.04306 vs 1.28924
> > > 5 8.04786 vs 1.57616
> > > 6 9.38844 vs 1.78142
> >
> > <snip>
> >
> > > 29 20.06 vs 3.59869
> > > 30 20.4353 vs 3.6991
> > > 31 20.9082 vs 3.73028
> > > 32 21.0865 vs 3.82904
> > >
> > > 1..32 - is a number of jobs. The results are in usec and is a vmallco()/vfree()
> > > pair throughput.
> >
> > I would be more interested in real numbers than synthetic benchmarks,
> > Maybe XFS folks could help performing profiling similar to commit 8dc9384b7d750
> > with and without this patchset?
> >
> I added Dave Chinner <[email protected]> to this thread.

Oh, I missed that, and it would be better to [+Cc linux-xfs]

> But. The contention exists.

I think "theoretically can be contended" doesn't necessarily mean it's actually
contended in the real world.

Also I find it difficult to imagine vmalloc being highly contended because it was
historically considered slow and thus discouraged when performance is important.

IOW vmalloc would not be contended when allocation size is small because we have
kmalloc/buddy API, and therefore I wonder which workloads are allocating very large
buffers and at the same time allocating very frequently, thus performance-sensitive.

I am not against this series, but wondering which workloads would benefit ;)

> Apart of that per-cpu-KVA allocator can go away if we make it generic instead.

Not sure I understand your point, can you elaborate please?

And I would like to ask some side questions:

1. Is vm_[un]map_ram() API still worth with this patchset?

2. How does this patchset deals with 32-bit machines where
vmalloc address space is limited?

Thanks!

> > By the way looking at the commit, teaching __p?d_alloc() about gfp
> > context (that I'm _slowly_ working on...) could be nice for allowing
> > non-GFP_KERNEL kvmalloc allocations, as Matthew mentioned. [1]
> >
> > Thanks!
> >
> > [1] https://lore.kernel.org/linux-mm/Y%[email protected]
> >

--
Hyeonggon Yoo

Doing kernel stuff as a hobby
Undergraduate | Chungnam National University
Dept. Computer Science & Engineering

2023-05-24 01:45:34

by Hyeonggon Yoo

[permalink] [raw]
Subject: Re: [PATCH 0/9] Mitigate a vmap lock contention

On Wed, May 24, 2023 at 07:43:38AM +1000, Dave Chinner wrote:
> On Wed, May 24, 2023 at 03:04:28AM +0900, Hyeonggon Yoo wrote:
> > On Tue, May 23, 2023 at 05:12:30PM +0200, Uladzislau Rezki wrote:
> > > > > 2. Motivation.
> > > > >
> > > > > - The vmap code is not scalled to number of CPUs and this should be fixed;
> > > > > - XFS folk has complained several times that vmalloc might be contented on
> > > > > their workloads:
> > > > >
> > > > > <snip>
> > > > > commit 8dc9384b7d75012856b02ff44c37566a55fc2abf
> > > > > Author: Dave Chinner <[email protected]>
> > > > > Date: Tue Jan 4 17:22:18 2022 -0800
> > > > >
> > > > > xfs: reduce kvmalloc overhead for CIL shadow buffers
> > > > >
> > > > > Oh, let me count the ways that the kvmalloc API sucks dog eggs.
> > > > >
> > > > > The problem is when we are logging lots of large objects, we hit
> > > > > kvmalloc really damn hard with costly order allocations, and
> > > > > behaviour utterly sucks:
> > > >
> > > > based on the commit I guess xfs should use vmalloc/kvmalloc is because
> > > > it allocates large buffers, how large could it be?
> > > >
> > > They use kvmalloc(). When the page allocator is not able to serve a
> > > request they fallback to vmalloc. At least what i see, the sizes are:
> > >
> > > from 73728 up to 1048576, i.e. 18 pages up to 256 pages.
> > >
> > > > > 3. Test
> > > > >
> > > > > On my: AMD Ryzen Threadripper 3970X 32-Core Processor, i have below figures:
> > > > >
> > > > > 1-page 1-page-this-patch
> > > > > 1 0.576131 vs 0.555889
> > > > > 2 2.68376 vs 1.07895
> > > > > 3 4.26502 vs 1.01739
> > > > > 4 6.04306 vs 1.28924
> > > > > 5 8.04786 vs 1.57616
> > > > > 6 9.38844 vs 1.78142
> > > >
> > > > <snip>
> > > >
> > > > > 29 20.06 vs 3.59869
> > > > > 30 20.4353 vs 3.6991
> > > > > 31 20.9082 vs 3.73028
> > > > > 32 21.0865 vs 3.82904
> > > > >
> > > > > 1..32 - is a number of jobs. The results are in usec and is a vmallco()/vfree()
> > > > > pair throughput.
> > > >
> > > > I would be more interested in real numbers than synthetic benchmarks,
> > > > Maybe XFS folks could help performing profiling similar to commit 8dc9384b7d750
> > > > with and without this patchset?
> > > >
> > > I added Dave Chinner <[email protected]> to this thread.
> >
> > Oh, I missed that, and it would be better to [+Cc linux-xfs]
> >
> > > But. The contention exists.
> >
> > I think "theoretically can be contended" doesn't necessarily mean it's actually
> > contended in the real world.
>
> Did you not read the commit message for the XFS commit documented
> above? vmalloc lock contention most c0ertainly does exist in the
> real world and the profiles in commit 8dc9384b7d75 ("xfs: reduce
> kvmalloc overhead for CIL shadow buffers") document it clearly.
>
> > Also I find it difficult to imagine vmalloc being highly contended because it was
> > historically considered slow and thus discouraged when performance is important.
>
> Read the above XFS commit.
>
> We use vmalloc in critical high performance fast paths that cannot
> tolerate high order memory allocation failure. XFS runs this
> fast path millions of times a second, and will call into
> vmalloc() several hundred thousands times a second with machine wide
> concurrency under certain types of workloads.
>
> > IOW vmalloc would not be contended when allocation size is small because we have
> > kmalloc/buddy API, and therefore I wonder which workloads are allocating very large
> > buffers and at the same time allocating very frequently, thus performance-sensitive.
> >
> > I am not against this series, but wondering which workloads would benefit ;)
>
> Yup, you need to read the XFS commit message. If you understand what
> is in that commit message, then you wouldn't be doubting that
> vmalloc contention is real and that it is used in high performance
> fast paths that are traversed millions of times a second....

Oh, I read the commit but seems slipped my mind while reading it - sorry for such a dumb
question, now I get it, and thank you so much. In any case didn't mean to offend,
I should've read and thought more before asking.

>
> > > Apart of that per-cpu-KVA allocator can go away if we make it generic instead.
> >
> > Not sure I understand your point, can you elaborate please?
> >
> > And I would like to ask some side questions:
> >
> > 1. Is vm_[un]map_ram() API still worth with this patchset?
>
> XFS also uses this interface for mapping multi-page buffers in the
> XFS buffer cache. These are the items that also require the high
> order costly kvmalloc allocations in the transaction commit path
> when they are modified.
>
> So, yes, we need these mapping interfaces to scale just as well as
> vmalloc itself....

I mean, even before this series, vm_[un]map_ram() caches vmap_blocks
per CPU but it has limitation on size that can be cached per cpu.

But now that vmap() itself becomes scalable after this series, I wonder
they are still worth, why not replace it with v[un]map()?
>
> > 2. How does this patchset deals with 32-bit machines where
> > vmalloc address space is limited?
>
> From the XFS side, we just don't care about 32 bit machines at all.
> XFS is aimed at server and HPC environments which have been entirely
> 64 bit for a long, long time now...

But Linux still supports 32 bit machines and is not going to drop
support for them anytime soon so I think there should be at least a way to
disable this feature.

Thanks!

--
Hyeonggon Yoo

Doing kernel stuff as a hobby
Undergraduate | Chungnam National University
Dept. Computer Science & Engineering

2023-05-24 10:04:02

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: Re: [PATCH 0/9] Mitigate a vmap lock contention

On Wed, May 24, 2023 at 03:04:28AM +0900, Hyeonggon Yoo wrote:
> On Tue, May 23, 2023 at 05:12:30PM +0200, Uladzislau Rezki wrote:
> > > > 2. Motivation.
> > > >
> > > > - The vmap code is not scalled to number of CPUs and this should be fixed;
> > > > - XFS folk has complained several times that vmalloc might be contented on
> > > > their workloads:
> > > >
> > > > <snip>
> > > > commit 8dc9384b7d75012856b02ff44c37566a55fc2abf
> > > > Author: Dave Chinner <[email protected]>
> > > > Date: Tue Jan 4 17:22:18 2022 -0800
> > > >
> > > > xfs: reduce kvmalloc overhead for CIL shadow buffers
> > > >
> > > > Oh, let me count the ways that the kvmalloc API sucks dog eggs.
> > > >
> > > > The problem is when we are logging lots of large objects, we hit
> > > > kvmalloc really damn hard with costly order allocations, and
> > > > behaviour utterly sucks:
> > >
> > > based on the commit I guess xfs should use vmalloc/kvmalloc is because
> > > it allocates large buffers, how large could it be?
> > >
> > They use kvmalloc(). When the page allocator is not able to serve a
> > request they fallback to vmalloc. At least what i see, the sizes are:
> >
> > from 73728 up to 1048576, i.e. 18 pages up to 256 pages.
> >
> > > > 3. Test
> > > >
> > > > On my: AMD Ryzen Threadripper 3970X 32-Core Processor, i have below figures:
> > > >
> > > > 1-page 1-page-this-patch
> > > > 1 0.576131 vs 0.555889
> > > > 2 2.68376 vs 1.07895
> > > > 3 4.26502 vs 1.01739
> > > > 4 6.04306 vs 1.28924
> > > > 5 8.04786 vs 1.57616
> > > > 6 9.38844 vs 1.78142
> > >
> > > <snip>
> > >
> > > > 29 20.06 vs 3.59869
> > > > 30 20.4353 vs 3.6991
> > > > 31 20.9082 vs 3.73028
> > > > 32 21.0865 vs 3.82904
> > > >
> > > > 1..32 - is a number of jobs. The results are in usec and is a vmallco()/vfree()
> > > > pair throughput.
> > >
> > > I would be more interested in real numbers than synthetic benchmarks,
> > > Maybe XFS folks could help performing profiling similar to commit 8dc9384b7d750
> > > with and without this patchset?
> > >
> > I added Dave Chinner <[email protected]> to this thread.
>
> Oh, I missed that, and it would be better to [+Cc linux-xfs]
>
> > But. The contention exists.
>
> I think "theoretically can be contended" doesn't necessarily mean it's actually
> contended in the real world.
>
> Also I find it difficult to imagine vmalloc being highly contended because it was
> historically considered slow and thus discouraged when performance is important.
>
> IOW vmalloc would not be contended when allocation size is small because we have
> kmalloc/buddy API, and therefore I wonder which workloads are allocating very large
> buffers and at the same time allocating very frequently, thus performance-sensitive.
>
> I am not against this series, but wondering which workloads would benefit ;)
>
> > Apart of that per-cpu-KVA allocator can go away if we make it generic instead.
>
> Not sure I understand your point, can you elaborate please?
>
> And I would like to ask some side questions:
>
> 1. Is vm_[un]map_ram() API still worth with this patchset?
>
It is up to community to decide. As i see XFS needs it also. Maybe in
the future it can be removed(who knows). If the vmalloc code itself can
deliver such performance as vm_map* APIs.

>
> 2. How does this patchset deals with 32-bit machines where
> vmalloc address space is limited?
>
It can deal without any problems. Though i am not sure it is needed for
32-bit systems. The reason is, the vmalloc code was a bit slow when it
comes to lookup time, it used to be O(n). After that it was improved to
O(logn).

vm_map_ram() and friends interface was added because of vmalloc drawbacks.
I am not sure that there are 32-bit systems with 10/20/30... CPUs on board.
In that case it is worth care about contention.

--
Uladzislau Rezki

2023-05-24 12:00:59

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: Re: [PATCH 2/9] mm: vmalloc: Rename adjust_va_to_fit_type() function

On Tue, May 23, 2023 at 01:24:09PM -0400, Liam R. Howlett wrote:
> * Uladzislau Rezki (Sony) <[email protected]> [230522 07:09]:
> > This patch renames the adjust_va_to_fit_type() function
> > to a shorter variant and more expressive. A new name is
> > va_clip().
> >
> > There is no a functional change as a result of this patch.
> >
> > Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
> > ---
> > mm/vmalloc.c | 13 ++++++-------
> > 1 file changed, 6 insertions(+), 7 deletions(-)
> >
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 409285b68a67..5f900efec6a9 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -1383,9 +1383,9 @@ classify_va_fit_type(struct vmap_area *va,
> > }
> >
> > static __always_inline int
> > -adjust_va_to_fit_type(struct rb_root *root, struct list_head *head,
> > - struct vmap_area *va, unsigned long nva_start_addr,
> > - unsigned long size)
> > +va_clip(struct rb_root *root, struct list_head *head,
> > + struct vmap_area *va, unsigned long nva_start_addr,
> > + unsigned long size)
> > {
> > struct vmap_area *lva = NULL;
> > enum fit_type type = classify_va_fit_type(va, nva_start_addr, size);
> > @@ -1501,7 +1501,7 @@ va_alloc(struct vmap_area *va,
> > return vend;
> >
> > /* Update the free vmap_area. */
> > - ret = adjust_va_to_fit_type(root, head, va, nva_start_addr, size);
> > + ret = va_clip(root, head, va, nva_start_addr, size);
> > if (WARN_ON_ONCE(ret))
>
> ret is only used in this WARN_ON_ONCE() check (from patch 1), since you
> have shortened adjust_va_to_fit_type(), you can probably drop the ret
> variable all together here?
>
I can move the WARN_ON_ONCE() check into the va_clip(). So it makes sense.

As for a return value, we should check it, at least here:

ret = va_clip(&free_vmap_area_root,
&free_vmap_area_list, va, start, size);
if (WARN_ON_ONCE(unlikely(ret)))
/* It is a BUG(), but trigger recovery instead. */
goto recovery;



i t can fail, though it rather goes toward almost zero%.

Thank you for looking at it, Liam!

--
Uladzuslau Rezki

2023-05-25 08:20:29

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 0/9] Mitigate a vmap lock contention

On Thu, May 25, 2023 at 07:56:56AM +1000, Dave Chinner wrote:
> > It is up to community to decide. As i see XFS needs it also. Maybe in
> > the future it can be removed(who knows). If the vmalloc code itself can
> > deliver such performance as vm_map* APIs.
>
> vm_map* APIs cannot be replaced with vmalloc, they cover a very
> different use case. i.e. vmalloc allocates mapped memory,
> vm_map_ram() maps allocated memory....
>
> > vm_map_ram() and friends interface was added because of vmalloc drawbacks.
>
> No. vm_map*() were scalability improvements added in 2009 to replace
> on vmap() and vunmap() to avoid global lock contention in the vmap
> allocator that XFS had been working around for years with it's own
> internal vmap cache....

All of that is true. At the same time XFS could very much switch to
vmalloc for !XBF_UNMAPPED && size > PAGE_SIZE buffers IFF that provided
an advantage. The need for vmap and then vm_map_* initially came from
the fact that we were using the page cache to back the xfs_buf (or
page_buf back then). With your work on getting rid of the pagecache
usage we could just use vmalloc now if we wanted to and it improves
performance. Or at some point we could even look into just using large
folios for that with the infrastructure for that improving a lot (but
I suspect we're not quite there yet).

But ther are other uses of vm_map_* that can't do that, and they will
benefit from the additional scalability as well even IFF just using
vmalloc was fine for XFS now (I don't know, I haven't actually looked
into it).

2023-05-25 10:30:51

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: Re: [PATCH 0/9] Mitigate a vmap lock contention

On Thu, May 25, 2023 at 07:56:56AM +1000, Dave Chinner wrote:
> On Wed, May 24, 2023 at 11:50:12AM +0200, Uladzislau Rezki wrote:
> > On Wed, May 24, 2023 at 03:04:28AM +0900, Hyeonggon Yoo wrote:
> > > On Tue, May 23, 2023 at 05:12:30PM +0200, Uladzislau Rezki wrote:
> > > And I would like to ask some side questions:
> > >
> > > 1. Is vm_[un]map_ram() API still worth with this patchset?
> > >
> > It is up to community to decide. As i see XFS needs it also. Maybe in
> > the future it can be removed(who knows). If the vmalloc code itself can
> > deliver such performance as vm_map* APIs.
>
> vm_map* APIs cannot be replaced with vmalloc, they cover a very
> different use case. i.e. vmalloc allocates mapped memory,
> vm_map_ram() maps allocated memory....
>
> > vm_map_ram() and friends interface was added because of vmalloc drawbacks.
>
> No. vm_map*() were scalability improvements added in 2009 to replace
> on vmap() and vunmap() to avoid global lock contention in the vmap
> allocator that XFS had been working around for years with it's own
> internal vmap cache....
>
> commit 95f8e302c04c0b0c6de35ab399a5551605eeb006
> Author: Nicholas Piggin <[email protected]>
> Date: Tue Jan 6 14:43:09 2009 +1100
>
> [XFS] use scalable vmap API
>
> Implement XFS's large buffer support with the new vmap APIs. See the vmap
> rewrite (db64fe02) for some numbers. The biggest improvement that comes from
> using the new APIs is avoiding the global KVA allocation lock on every call.
>
> Signed-off-by: Nick Piggin <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Signed-off-by: Lachlan McIlroy <[email protected]>
>
> vmap/vunmap() themselves were introduce in 2.5.32 (2002) and before
> that XFS was using remap_page_array() and vfree() in exactly the
> same way it uses vm_map_ram() and vm_unmap_ram() today....
>
> XFS has a long, long history of causing virtual memory allocator
> scalability and contention problems. As you can see, this isn't our
> first rodeo...
>
Let me be more specific, sorry it looks like there is misunderstanding.
I am talking about removing of vb_alloc()/vb_free() per-cpu stuff. If
alloc_vmap_area() gives same performance:

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d50c551592fc..a1687bbdad30 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2503,12 +2503,6 @@ void vm_unmap_ram(const void *mem, unsigned int count)

kasan_poison_vmalloc(mem, size);

- if (likely(count <= VMAP_MAX_ALLOC)) {
- debug_check_no_locks_freed(mem, size);
- vb_free(addr, size);
- return;
- }
-
va = find_unlink_vmap_area(addr);
if (WARN_ON_ONCE(!va))
return;
@@ -2539,12 +2533,6 @@ void *vm_map_ram(struct page **pages, unsigned int count, int node)
unsigned long addr;
void *mem;

- if (likely(count <= VMAP_MAX_ALLOC)) {
- mem = vb_alloc(size, GFP_KERNEL);
- if (IS_ERR(mem))
- return NULL;
- addr = (unsigned long)mem;
- } else {
struct vmap_area *va;
va = alloc_vmap_area(size, PAGE_SIZE,
VMALLOC_START, VMALLOC_END,
@@ -2554,7 +2542,6 @@ void *vm_map_ram(struct page **pages, unsigned int count, int node)

addr = va->va_start;
mem = (void *)addr;
- }

if (vmap_pages_range(addr, addr + size, PAGE_KERNEL,
pages, PAGE_SHIFT) < 0) {


+ other related parts.

--
Uladzislau Rezki

2023-05-27 22:01:51

by Lorenzo Stoakes

[permalink] [raw]
Subject: Re: [PATCH 2/9] mm: vmalloc: Rename adjust_va_to_fit_type() function

On Mon, May 22, 2023 at 01:08:42PM +0200, Uladzislau Rezki (Sony) wrote:
> This patch renames the adjust_va_to_fit_type() function
> to a shorter variant and more expressive. A new name is
> va_clip().
>
> There is no a functional change as a result of this patch.

Small nit - I think:-

'This patch renames the adjust_va_to_fit_type() function to va_clip() which
is shorter and more expressive.'

Reads better here.

>
> Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
> ---
> mm/vmalloc.c | 13 ++++++-------
> 1 file changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 409285b68a67..5f900efec6a9 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1383,9 +1383,9 @@ classify_va_fit_type(struct vmap_area *va,
> }
>
> static __always_inline int
> -adjust_va_to_fit_type(struct rb_root *root, struct list_head *head,
> - struct vmap_area *va, unsigned long nva_start_addr,
> - unsigned long size)
> +va_clip(struct rb_root *root, struct list_head *head,
> + struct vmap_area *va, unsigned long nva_start_addr,
> + unsigned long size)
> {
> struct vmap_area *lva = NULL;
> enum fit_type type = classify_va_fit_type(va, nva_start_addr, size);
> @@ -1501,7 +1501,7 @@ va_alloc(struct vmap_area *va,
> return vend;
>
> /* Update the free vmap_area. */
> - ret = adjust_va_to_fit_type(root, head, va, nva_start_addr, size);
> + ret = va_clip(root, head, va, nva_start_addr, size);
> if (WARN_ON_ONCE(ret))
> return vend;
>
> @@ -3979,9 +3979,8 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
> /* It is a BUG(), but trigger recovery instead. */
> goto recovery;
>
> - ret = adjust_va_to_fit_type(&free_vmap_area_root,
> - &free_vmap_area_list,
> - va, start, size);
> + ret = va_clip(&free_vmap_area_root,
> + &free_vmap_area_list, va, start, size);
> if (WARN_ON_ONCE(unlikely(ret)))
> /* It is a BUG(), but trigger recovery instead. */
> goto recovery;
> --
> 2.30.2
>

Otherwise,

Reviewed-by: Lorenzo Stoakes <[email protected]>

2023-05-27 22:36:15

by Lorenzo Stoakes

[permalink] [raw]
Subject: Re: [PATCH 3/9] mm: vmalloc: Move vmap_init_free_space() down in vmalloc.c

On Mon, May 22, 2023 at 01:08:43PM +0200, Uladzislau Rezki (Sony) wrote:
> A vmap_init_free_space() is a function that setups a vmap space
> and is considered as part of initialization phase. Since a main
> entry which is vmalloc_init(), has been moved down in vmalloc.c
> it makes sense to follow the pattern.
>
> There is no a functional change as a result of this patch.
>
> Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
> ---
> mm/vmalloc.c | 82 ++++++++++++++++++++++++++--------------------------
> 1 file changed, 41 insertions(+), 41 deletions(-)
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 5f900efec6a9..19dcffb0d563 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2416,47 +2416,6 @@ void __init vm_area_register_early(struct vm_struct *vm, size_t align)
> kasan_populate_early_vm_area_shadow(vm->addr, vm->size);
> }
>
> -static void vmap_init_free_space(void)
> -{
> - unsigned long vmap_start = 1;
> - const unsigned long vmap_end = ULONG_MAX;
> - struct vmap_area *busy, *free;
> -
> - /*
> - * B F B B B F
> - * -|-----|.....|-----|-----|-----|.....|-
> - * | The KVA space |
> - * |<--------------------------------->|
> - */
> - list_for_each_entry(busy, &vmap_area_list, list) {
> - if (busy->va_start - vmap_start > 0) {
> - free = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
> - if (!WARN_ON_ONCE(!free)) {
> - free->va_start = vmap_start;
> - free->va_end = busy->va_start;
> -
> - insert_vmap_area_augment(free, NULL,
> - &free_vmap_area_root,
> - &free_vmap_area_list);
> - }
> - }
> -
> - vmap_start = busy->va_end;
> - }
> -
> - if (vmap_end - vmap_start > 0) {
> - free = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
> - if (!WARN_ON_ONCE(!free)) {
> - free->va_start = vmap_start;
> - free->va_end = vmap_end;
> -
> - insert_vmap_area_augment(free, NULL,
> - &free_vmap_area_root,
> - &free_vmap_area_list);
> - }
> - }
> -}
> -
> static inline void setup_vmalloc_vm_locked(struct vm_struct *vm,
> struct vmap_area *va, unsigned long flags, const void *caller)
> {
> @@ -4271,6 +4230,47 @@ module_init(proc_vmalloc_init);
>
> #endif
>
> +static void vmap_init_free_space(void)
> +{
> + unsigned long vmap_start = 1;
> + const unsigned long vmap_end = ULONG_MAX;
> + struct vmap_area *busy, *free;
> +
> + /*
> + * B F B B B F
> + * -|-----|.....|-----|-----|-----|.....|-
> + * | The KVA space |
> + * |<--------------------------------->|
> + */
> + list_for_each_entry(busy, &vmap_area_list, list) {
> + if (busy->va_start - vmap_start > 0) {
> + free = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
> + if (!WARN_ON_ONCE(!free)) {
> + free->va_start = vmap_start;
> + free->va_end = busy->va_start;
> +
> + insert_vmap_area_augment(free, NULL,
> + &free_vmap_area_root,
> + &free_vmap_area_list);
> + }
> + }
> +
> + vmap_start = busy->va_end;
> + }
> +
> + if (vmap_end - vmap_start > 0) {
> + free = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
> + if (!WARN_ON_ONCE(!free)) {
> + free->va_start = vmap_start;
> + free->va_end = vmap_end;
> +
> + insert_vmap_area_augment(free, NULL,
> + &free_vmap_area_root,
> + &free_vmap_area_list);
> + }
> + }
> +}
> +
> void __init vmalloc_init(void)
> {
> struct vmap_area *va;
> --
> 2.30.2
>

LGTM,

Reviewed-by: Lorenzo Stoakes <[email protected]>

2023-05-29 20:45:18

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: Re: [PATCH 2/9] mm: vmalloc: Rename adjust_va_to_fit_type() function

On Sat, May 27, 2023 at 10:50:22PM +0100, Lorenzo Stoakes wrote:
> On Mon, May 22, 2023 at 01:08:42PM +0200, Uladzislau Rezki (Sony) wrote:
> > This patch renames the adjust_va_to_fit_type() function
> > to a shorter variant and more expressive. A new name is
> > va_clip().
> >
> > There is no a functional change as a result of this patch.
>
> Small nit - I think:-
>
> 'This patch renames the adjust_va_to_fit_type() function to va_clip() which
> is shorter and more expressive.'
>
> Reads better here.
>
I will update.

Thanks!

--
Uladzislau Rezki

2023-06-05 01:18:39

by Baoquan He

[permalink] [raw]
Subject: Re: [PATCH 8/9] mm: vmalloc: Offload free_vmap_area_lock global lock

On 05/22/23 at 01:08pm, Uladzislau Rezki (Sony) wrote:
......
> +static unsigned long
> +this_cpu_zone_alloc_fill(struct cpu_vmap_zone *z,
> + unsigned long size, unsigned long align,
> + gfp_t gfp_mask, int node)
> +{
> + unsigned long addr = VMALLOC_END;
> + struct vmap_area *va;
> +
> + /*
> + * It still can race. One task sets a progress to
> + * 1 a second one gets preempted on entry, the first
> + * zeroed the progress flag and second proceed with
> + * an extra prefetch.
> + */
> + if (atomic_xchg(&z->fill_in_progress, 1))
> + return addr;
> +
> + va = kmem_cache_alloc_node(vmap_area_cachep, gfp_mask, node);
> + if (unlikely(!va))
> + goto out;
> +
> + spin_lock(&free_vmap_area_lock);
> + addr = __alloc_vmap_area(&free_vmap_area_root, &free_vmap_area_list,
> + cvz_size, 1, VMALLOC_START, VMALLOC_END);
> + spin_unlock(&free_vmap_area_lock);

The 'z' is passed in from this_cpu_zone_alloc(), and it's got with
raw_cpu_ptr(&cpu_vmap_zone). Here when we try to get chunk of cvz_size
from free_vmap_area_root/free_vmap_area_list, how can we guarantee it
must belong to the 'z' zone? With my understanding, __alloc_vmap_area()
will get efficient address range sequentially bottom up from
free_vmap_area_root. Please correct me if I am wrong.

static unsigned long
this_cpu_zone_alloc(unsigned long size, unsigned long align, gfp_t gfp_mask, int node)
{
struct cpu_vmap_zone *z = raw_cpu_ptr(&cpu_vmap_zone);
......
if (addr == VMALLOC_END && left < 4 * PAGE_SIZE)
addr = this_cpu_zone_alloc_fill(z, size, align, gfp_mask, node);
}

> +
> + if (addr == VMALLOC_END) {
> + kmem_cache_free(vmap_area_cachep, va);
> + goto out;
> + }
> +
> + va->va_start = addr;
> + va->va_end = addr + cvz_size;
> +
> + fbl_lock(z, FREE);
> + va = merge_or_add_vmap_area_augment(va,
> + &fbl_root(z, FREE), &fbl_head(z, FREE));
> + addr = va_alloc(va, &fbl_root(z, FREE), &fbl_head(z, FREE),
> + size, align, VMALLOC_START, VMALLOC_END);
> + fbl_unlock(z, FREE);
> +
> +out:
> + atomic_set(&z->fill_in_progress, 0);
> + return addr;
> +}
> +
> +static unsigned long
> +this_cpu_zone_alloc(unsigned long size, unsigned long align, gfp_t gfp_mask, int node)
> +{
> + struct cpu_vmap_zone *z = raw_cpu_ptr(&cpu_vmap_zone);
> + unsigned long extra = align > PAGE_SIZE ? align : 0;
> + unsigned long addr = VMALLOC_END, left = 0;
> +
> + /*
> + * It is disabled, fallback to a global heap.
> + */
> + if (cvz_size == ULONG_MAX)
> + return addr;
> +
> + /*
> + * Any allocation bigger/equal than one half of
~~~~~~typo~~~~~~ bigger than/equal to
> + * a zone-size will fallback to a global heap.
> + */
> + if (cvz_size / (size + extra) < 3)
> + return addr;
> +
> + if (RB_EMPTY_ROOT(&fbl_root(z, FREE)))
> + goto fill;
> +
> + fbl_lock(z, FREE);
> + addr = __alloc_vmap_area(&fbl_root(z, FREE), &fbl_head(z, FREE),
> + size, align, VMALLOC_START, VMALLOC_END);
> +
> + if (addr == VMALLOC_END)
> + left = get_subtree_max_size(fbl_root(z, FREE).rb_node);
> + fbl_unlock(z, FREE);
> +
> +fill:
> + /*
> + * A low watermark is 3 pages.
> + */
> + if (addr == VMALLOC_END && left < 4 * PAGE_SIZE)
> + addr = this_cpu_zone_alloc_fill(z, size, align, gfp_mask, node);
> +
> + return addr;
> +}
> +
> /*
> * Allocate a region of KVA of the specified size and alignment, within the
> * vstart and vend.
> @@ -1678,11 +1765,21 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
> */
> kmemleak_scan_area(&va->rb_node, SIZE_MAX, gfp_mask);
>
> + /*
> + * Fast path allocation, start with it.
> + */
> + if (vstart == VMALLOC_START && vend == VMALLOC_END)
> + addr = this_cpu_zone_alloc(size, align, gfp_mask, node);
> + else
> + addr = vend;
> +
> retry:
> - preload_this_cpu_lock(&free_vmap_area_lock, gfp_mask, node);
> - addr = __alloc_vmap_area(&free_vmap_area_root, &free_vmap_area_list,
> - size, align, vstart, vend);
> - spin_unlock(&free_vmap_area_lock);
> + if (addr == vend) {
> + preload_this_cpu_lock(&free_vmap_area_lock, gfp_mask, node);
> + addr = __alloc_vmap_area(&free_vmap_area_root, &free_vmap_area_list,
> + size, align, vstart, vend);
> + spin_unlock(&free_vmap_area_lock);
> + }
>
> trace_alloc_vmap_area(addr, size, align, vstart, vend, addr == vend);
>
> @@ -1827,6 +1924,27 @@ purge_cpu_vmap_zone(struct cpu_vmap_zone *z)
> return num_purged_areas;
> }
>
> +static void
> +drop_cpu_vmap_cache(struct cpu_vmap_zone *z)
> +{
> + struct vmap_area *va, *n_va;
> + LIST_HEAD(free_head);
> +
> + if (RB_EMPTY_ROOT(&fbl_root(z, FREE)))
> + return;
> +
> + fbl_lock(z, FREE);
> + WRITE_ONCE(fbl(z, FREE, root.rb_node), NULL);
> + list_replace_init(&fbl_head(z, FREE), &free_head);
> + fbl_unlock(z, FREE);
> +
> + spin_lock(&free_vmap_area_lock);
> + list_for_each_entry_safe(va, n_va, &free_head, list)
> + merge_or_add_vmap_area_augment(va,
> + &free_vmap_area_root, &free_vmap_area_list);
> + spin_unlock(&free_vmap_area_lock);
> +}
> +
> /*
> * Purges all lazily-freed vmap areas.
> */
> @@ -1868,6 +1986,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
> for_each_possible_cpu(i) {
> z = per_cpu_ptr(&cpu_vmap_zone, i);
> num_purged_areas += purge_cpu_vmap_zone(z);
> + drop_cpu_vmap_cache(z);
> }
> }
>
> --
> 2.30.2
>


2023-06-06 09:24:58

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: Re: [PATCH 8/9] mm: vmalloc: Offload free_vmap_area_lock global lock

On Mon, Jun 05, 2023 at 08:43:39AM +0800, Baoquan He wrote:
> On 05/22/23 at 01:08pm, Uladzislau Rezki (Sony) wrote:
> ......
> > +static unsigned long
> > +this_cpu_zone_alloc_fill(struct cpu_vmap_zone *z,
> > + unsigned long size, unsigned long align,
> > + gfp_t gfp_mask, int node)
> > +{
> > + unsigned long addr = VMALLOC_END;
> > + struct vmap_area *va;
> > +
> > + /*
> > + * It still can race. One task sets a progress to
> > + * 1 a second one gets preempted on entry, the first
> > + * zeroed the progress flag and second proceed with
> > + * an extra prefetch.
> > + */
> > + if (atomic_xchg(&z->fill_in_progress, 1))
> > + return addr;
> > +
> > + va = kmem_cache_alloc_node(vmap_area_cachep, gfp_mask, node);
> > + if (unlikely(!va))
> > + goto out;
> > +
> > + spin_lock(&free_vmap_area_lock);
> > + addr = __alloc_vmap_area(&free_vmap_area_root, &free_vmap_area_list,
> > + cvz_size, 1, VMALLOC_START, VMALLOC_END);
> > + spin_unlock(&free_vmap_area_lock);
>
> The 'z' is passed in from this_cpu_zone_alloc(), and it's got with
> raw_cpu_ptr(&cpu_vmap_zone). Here when we try to get chunk of cvz_size
> from free_vmap_area_root/free_vmap_area_list, how can we guarantee it
> must belong to the 'z' zone? With my understanding, __alloc_vmap_area()
> will get efficient address range sequentially bottom up from
> free_vmap_area_root. Please correct me if I am wrong.
>
We do not guarantee that and it does not worth it. The most important is:

If we search a zone that exactly match a CPU-id the usage of a global
vmap space becomes more wider, i.e. toward a high address space. This is
not good because we can affect other users which allocate within a specific
range. On a big system it might be a problem. Therefore a pre-fetch is done
sequentially on demand.

Secondly, i do not see much difference in performance if we follow
exactly CPU-zone-id.

> static unsigned long
> this_cpu_zone_alloc(unsigned long size, unsigned long align, gfp_t gfp_mask, int node)
> {
> struct cpu_vmap_zone *z = raw_cpu_ptr(&cpu_vmap_zone);
> ......
> if (addr == VMALLOC_END && left < 4 * PAGE_SIZE)
> addr = this_cpu_zone_alloc_fill(z, size, align, gfp_mask, node);
> }
>
> > +
> > + if (addr == VMALLOC_END) {
> > + kmem_cache_free(vmap_area_cachep, va);
> > + goto out;
> > + }
> > +
> > + va->va_start = addr;
> > + va->va_end = addr + cvz_size;
> > +
> > + fbl_lock(z, FREE);
> > + va = merge_or_add_vmap_area_augment(va,
> > + &fbl_root(z, FREE), &fbl_head(z, FREE));
> > + addr = va_alloc(va, &fbl_root(z, FREE), &fbl_head(z, FREE),
> > + size, align, VMALLOC_START, VMALLOC_END);
> > + fbl_unlock(z, FREE);
> > +
> > +out:
> > + atomic_set(&z->fill_in_progress, 0);
> > + return addr;
> > +}
> > +
> > +static unsigned long
> > +this_cpu_zone_alloc(unsigned long size, unsigned long align, gfp_t gfp_mask, int node)
> > +{
> > + struct cpu_vmap_zone *z = raw_cpu_ptr(&cpu_vmap_zone);
> > + unsigned long extra = align > PAGE_SIZE ? align : 0;
> > + unsigned long addr = VMALLOC_END, left = 0;
> > +
> > + /*
> > + * It is disabled, fallback to a global heap.
> > + */
> > + if (cvz_size == ULONG_MAX)
> > + return addr;
> > +
> > + /*
> > + * Any allocation bigger/equal than one half of
> ~~~~~~typo~~~~~~ bigger than/equal to
I will rework it!

--
Uladzislau Rezki

2023-06-06 12:28:11

by Baoquan He

[permalink] [raw]
Subject: Re: [PATCH 8/9] mm: vmalloc: Offload free_vmap_area_lock global lock

On 06/06/23 at 11:01am, Uladzislau Rezki wrote:
> On Mon, Jun 05, 2023 at 08:43:39AM +0800, Baoquan He wrote:
> > On 05/22/23 at 01:08pm, Uladzislau Rezki (Sony) wrote:
> > ......
> > > +static unsigned long
> > > +this_cpu_zone_alloc_fill(struct cpu_vmap_zone *z,
> > > + unsigned long size, unsigned long align,
> > > + gfp_t gfp_mask, int node)
> > > +{
> > > + unsigned long addr = VMALLOC_END;
> > > + struct vmap_area *va;
> > > +
> > > + /*
> > > + * It still can race. One task sets a progress to
> > > + * 1 a second one gets preempted on entry, the first
> > > + * zeroed the progress flag and second proceed with
> > > + * an extra prefetch.
> > > + */
> > > + if (atomic_xchg(&z->fill_in_progress, 1))
> > > + return addr;
> > > +
> > > + va = kmem_cache_alloc_node(vmap_area_cachep, gfp_mask, node);
> > > + if (unlikely(!va))
> > > + goto out;
> > > +
> > > + spin_lock(&free_vmap_area_lock);
> > > + addr = __alloc_vmap_area(&free_vmap_area_root, &free_vmap_area_list,
> > > + cvz_size, 1, VMALLOC_START, VMALLOC_END);
> > > + spin_unlock(&free_vmap_area_lock);
> >
> > The 'z' is passed in from this_cpu_zone_alloc(), and it's got with
> > raw_cpu_ptr(&cpu_vmap_zone). Here when we try to get chunk of cvz_size
> > from free_vmap_area_root/free_vmap_area_list, how can we guarantee it
> > must belong to the 'z' zone? With my understanding, __alloc_vmap_area()
> > will get efficient address range sequentially bottom up from
> > free_vmap_area_root. Please correct me if I am wrong.
> >
> We do not guarantee that and it does not worth it. The most important is:
>
> If we search a zone that exactly match a CPU-id the usage of a global
> vmap space becomes more wider, i.e. toward a high address space. This is
> not good because we can affect other users which allocate within a specific
> range. On a big system it might be a problem. Therefore a pre-fetch is done
> sequentially on demand.
>
> Secondly, i do not see much difference in performance if we follow
> exactly CPU-zone-id.

Ah, I see, the allocated range will be put into appropriate zone's
busy tree by calculating its zone via addr_to_cvz(va->va_start). The
cvz->free tree is only a percpu pre-fetch cache. This is smart, thanks a
lot for explanation.

>
> > static unsigned long
> > this_cpu_zone_alloc(unsigned long size, unsigned long align, gfp_t gfp_mask, int node)
> > {
> > struct cpu_vmap_zone *z = raw_cpu_ptr(&cpu_vmap_zone);
> > ......
> > if (addr == VMALLOC_END && left < 4 * PAGE_SIZE)
> > addr = this_cpu_zone_alloc_fill(z, size, align, gfp_mask, node);
> > }
> >
> > > +
> > > + if (addr == VMALLOC_END) {
> > > + kmem_cache_free(vmap_area_cachep, va);
> > > + goto out;
> > > + }
> > > +
> > > + va->va_start = addr;
> > > + va->va_end = addr + cvz_size;
> > > +
> > > + fbl_lock(z, FREE);
> > > + va = merge_or_add_vmap_area_augment(va,
> > > + &fbl_root(z, FREE), &fbl_head(z, FREE));
> > > + addr = va_alloc(va, &fbl_root(z, FREE), &fbl_head(z, FREE),
> > > + size, align, VMALLOC_START, VMALLOC_END);
> > > + fbl_unlock(z, FREE);
> > > +
> > > +out:
> > > + atomic_set(&z->fill_in_progress, 0);
> > > + return addr;
> > > +}
> > > +
> > > +static unsigned long
> > > +this_cpu_zone_alloc(unsigned long size, unsigned long align, gfp_t gfp_mask, int node)
> > > +{
> > > + struct cpu_vmap_zone *z = raw_cpu_ptr(&cpu_vmap_zone);
> > > + unsigned long extra = align > PAGE_SIZE ? align : 0;
> > > + unsigned long addr = VMALLOC_END, left = 0;
> > > +
> > > + /*
> > > + * It is disabled, fallback to a global heap.
> > > + */
> > > + if (cvz_size == ULONG_MAX)
> > > + return addr;
> > > +
> > > + /*
> > > + * Any allocation bigger/equal than one half of
> > ~~~~~~typo~~~~~~ bigger than/equal to
> I will rework it!
>
> --
> Uladzislau Rezki
>


2023-06-07 07:24:17

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: Re: [PATCH 8/9] mm: vmalloc: Offload free_vmap_area_lock global lock

On Tue, Jun 06, 2023 at 08:11:04PM +0800, Baoquan He wrote:
> On 06/06/23 at 11:01am, Uladzislau Rezki wrote:
> > On Mon, Jun 05, 2023 at 08:43:39AM +0800, Baoquan He wrote:
> > > On 05/22/23 at 01:08pm, Uladzislau Rezki (Sony) wrote:
> > > ......
> > > > +static unsigned long
> > > > +this_cpu_zone_alloc_fill(struct cpu_vmap_zone *z,
> > > > + unsigned long size, unsigned long align,
> > > > + gfp_t gfp_mask, int node)
> > > > +{
> > > > + unsigned long addr = VMALLOC_END;
> > > > + struct vmap_area *va;
> > > > +
> > > > + /*
> > > > + * It still can race. One task sets a progress to
> > > > + * 1 a second one gets preempted on entry, the first
> > > > + * zeroed the progress flag and second proceed with
> > > > + * an extra prefetch.
> > > > + */
> > > > + if (atomic_xchg(&z->fill_in_progress, 1))
> > > > + return addr;
> > > > +
> > > > + va = kmem_cache_alloc_node(vmap_area_cachep, gfp_mask, node);
> > > > + if (unlikely(!va))
> > > > + goto out;
> > > > +
> > > > + spin_lock(&free_vmap_area_lock);
> > > > + addr = __alloc_vmap_area(&free_vmap_area_root, &free_vmap_area_list,
> > > > + cvz_size, 1, VMALLOC_START, VMALLOC_END);
> > > > + spin_unlock(&free_vmap_area_lock);
> > >
> > > The 'z' is passed in from this_cpu_zone_alloc(), and it's got with
> > > raw_cpu_ptr(&cpu_vmap_zone). Here when we try to get chunk of cvz_size
> > > from free_vmap_area_root/free_vmap_area_list, how can we guarantee it
> > > must belong to the 'z' zone? With my understanding, __alloc_vmap_area()
> > > will get efficient address range sequentially bottom up from
> > > free_vmap_area_root. Please correct me if I am wrong.
> > >
> > We do not guarantee that and it does not worth it. The most important is:
> >
> > If we search a zone that exactly match a CPU-id the usage of a global
> > vmap space becomes more wider, i.e. toward a high address space. This is
> > not good because we can affect other users which allocate within a specific
> > range. On a big system it might be a problem. Therefore a pre-fetch is done
> > sequentially on demand.
> >
> > Secondly, i do not see much difference in performance if we follow
> > exactly CPU-zone-id.
>
> Ah, I see, the allocated range will be put into appropriate zone's
> busy tree by calculating its zone via addr_to_cvz(va->va_start). The
> cvz->free tree is only a percpu pre-fetch cache. This is smart, thanks a
> lot for explanation.
>
Yes. The busy/lazy are placed per-cpu zone(using addr_to_cvz(addr)) whereas
the allocated chunk on a current CPU.

Thanks!

--
Uladzislau Rezki