While reviewing Baoquan's recent changes to permit vread() access to
vm_map_ram regions of vmalloc allocations, Willy pointed out [1] that it
would be nice to refactor vread() as a whole, since its only user is
read_kcore() and the existing form of vread() necessitates the use of a
bounce buffer.
This patch series does exactly that, as well as adjusting how we read the
kernel text section to avoid the use of a bounce buffer in this case as
well.
This patch series necessarily changes the locking used in vmalloc, however
tests indicate that this has very little impact on allocation performance
(test results are shown in the relevant patch).
This has been tested against the test case which motivated Baoquan's
changes in the first place [2] which continues to function correctly, as
do the vmalloc self tests.
[1] https://lore.kernel.org/all/Y8WfDSRkc%[email protected]/
[2] https://lore.kernel.org/all/[email protected]/T/#u
v2:
- Fix ordering of vread_iter() parameters
- Fix nommu vread() -> vread_iter()
v1:
https://lore.kernel.org/all/[email protected]/
Lorenzo Stoakes (4):
fs/proc/kcore: Avoid bounce buffer for ktext data
mm: vmalloc: use rwsem, mutex for vmap_area_lock and vmap_block->lock
fs/proc/kcore: convert read_kcore() to read_kcore_iter()
mm: vmalloc: convert vread() to vread_iter()
fs/proc/kcore.c | 84 +++++++------------
include/linux/vmalloc.h | 3 +-
mm/nommu.c | 10 +--
mm/vmalloc.c | 178 +++++++++++++++++++++-------------------
4 files changed, 130 insertions(+), 145 deletions(-)
base-commit: 4018ab1f7cec061b8425737328edefebdc0ab832
--
2.39.2
Commit df04abfd181a ("fs/proc/kcore.c: Add bounce buffer for ktext data")
introduced the use of a bounce buffer to retrieve kernel text data for
/proc/kcore in order to avoid failures arising from hardened user copies
enabled by CONFIG_HARDENED_USERCOPY in check_kernel_text_object().
We can avoid doing this if instead of copy_to_user() we use _copy_to_user()
which bypasses the hardening check. This is more efficient than using a
bounce buffer and simplifies the code.
We do so as part an overall effort to eliminate bounce buffer usage in the
function with an eye to converting it an iterator read.
Signed-off-by: Lorenzo Stoakes <[email protected]>
---
fs/proc/kcore.c | 17 +++++------------
1 file changed, 5 insertions(+), 12 deletions(-)
diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index 71157ee35c1a..556f310d6aa4 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -541,19 +541,12 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
case KCORE_VMEMMAP:
case KCORE_TEXT:
/*
- * Using bounce buffer to bypass the
- * hardened user copy kernel text checks.
+ * We use _copy_to_user() to bypass usermode hardening
+ * which would otherwise prevent this operation.
*/
- if (copy_from_kernel_nofault(buf, (void *)start, tsz)) {
- if (clear_user(buffer, tsz)) {
- ret = -EFAULT;
- goto out;
- }
- } else {
- if (copy_to_user(buffer, buf, tsz)) {
- ret = -EFAULT;
- goto out;
- }
+ if (_copy_to_user(buffer, (char *)start, tsz)) {
+ ret = -EFAULT;
+ goto out;
}
break;
default:
--
2.39.2
vmalloc() is, by design, not permitted to be used in atomic context and
already contains components which may sleep, so avoiding spin locks is not
a problem from the perspective of atomic context.
The global vmap_area_lock is held when the red/black tree rooted in
vmap_are_root is accessed and thus is rather long-held and under
potentially high contention. It is likely to be under contention for reads
rather than write, so replace it with a rwsem.
Each individual vmap_block->lock is likely to be held for less time but
under low contention, so a mutex is not an outrageous choice here.
A subset of test_vmalloc.sh performance results:-
fix_size_alloc_test 0.40%
full_fit_alloc_test 2.08%
long_busy_list_alloc_test 0.34%
random_size_alloc_test -0.25%
random_size_align_alloc_test 0.06%
...
all tests cycles 0.2%
This represents a tiny reduction in performance that sits barely above
noise.
The reason for making this change is to build a basis for vread() to be
usable asynchronously, this eliminating the need for a bounce buffer when
copying data to userland in read_kcore() and allowing that to be converted
to an iterator form.
Signed-off-by: Lorenzo Stoakes <[email protected]>
---
mm/vmalloc.c | 77 +++++++++++++++++++++++++++-------------------------
1 file changed, 40 insertions(+), 37 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 978194dc2bb8..c24b27664a97 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -40,6 +40,7 @@
#include <linux/uaccess.h>
#include <linux/hugetlb.h>
#include <linux/sched/mm.h>
+#include <linux/rwsem.h>
#include <asm/tlbflush.h>
#include <asm/shmparam.h>
@@ -725,7 +726,7 @@ EXPORT_SYMBOL(vmalloc_to_pfn);
#define DEBUG_AUGMENT_LOWEST_MATCH_CHECK 0
-static DEFINE_SPINLOCK(vmap_area_lock);
+static DECLARE_RWSEM(vmap_area_lock);
static DEFINE_SPINLOCK(free_vmap_area_lock);
/* Export for kexec only */
LIST_HEAD(vmap_area_list);
@@ -1537,9 +1538,9 @@ static void free_vmap_area(struct vmap_area *va)
/*
* Remove from the busy tree/list.
*/
- spin_lock(&vmap_area_lock);
+ down_write(&vmap_area_lock);
unlink_va(va, &vmap_area_root);
- spin_unlock(&vmap_area_lock);
+ up_write(&vmap_area_lock);
/*
* Insert/Merge it back to the free tree/list.
@@ -1627,9 +1628,9 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
va->vm = NULL;
va->flags = va_flags;
- spin_lock(&vmap_area_lock);
+ down_write(&vmap_area_lock);
insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
- spin_unlock(&vmap_area_lock);
+ up_write(&vmap_area_lock);
BUG_ON(!IS_ALIGNED(va->va_start, align));
BUG_ON(va->va_start < vstart);
@@ -1854,9 +1855,9 @@ struct vmap_area *find_vmap_area(unsigned long addr)
{
struct vmap_area *va;
- spin_lock(&vmap_area_lock);
+ down_read(&vmap_area_lock);
va = __find_vmap_area(addr, &vmap_area_root);
- spin_unlock(&vmap_area_lock);
+ up_read(&vmap_area_lock);
return va;
}
@@ -1865,11 +1866,11 @@ static struct vmap_area *find_unlink_vmap_area(unsigned long addr)
{
struct vmap_area *va;
- spin_lock(&vmap_area_lock);
+ down_write(&vmap_area_lock);
va = __find_vmap_area(addr, &vmap_area_root);
if (va)
unlink_va(va, &vmap_area_root);
- spin_unlock(&vmap_area_lock);
+ up_write(&vmap_area_lock);
return va;
}
@@ -1914,7 +1915,7 @@ struct vmap_block_queue {
};
struct vmap_block {
- spinlock_t lock;
+ struct mutex lock;
struct vmap_area *va;
unsigned long free, dirty;
DECLARE_BITMAP(used_map, VMAP_BBMAP_BITS);
@@ -1991,7 +1992,7 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
}
vaddr = vmap_block_vaddr(va->va_start, 0);
- spin_lock_init(&vb->lock);
+ mutex_init(&vb->lock);
vb->va = va;
/* At least something should be left free */
BUG_ON(VMAP_BBMAP_BITS <= (1UL << order));
@@ -2026,9 +2027,9 @@ static void free_vmap_block(struct vmap_block *vb)
tmp = xa_erase(&vmap_blocks, addr_to_vb_idx(vb->va->va_start));
BUG_ON(tmp != vb);
- spin_lock(&vmap_area_lock);
+ down_write(&vmap_area_lock);
unlink_va(vb->va, &vmap_area_root);
- spin_unlock(&vmap_area_lock);
+ up_write(&vmap_area_lock);
free_vmap_area_noflush(vb->va);
kfree_rcu(vb, rcu_head);
@@ -2047,7 +2048,7 @@ static void purge_fragmented_blocks(int cpu)
if (!(vb->free + vb->dirty == VMAP_BBMAP_BITS && vb->dirty != VMAP_BBMAP_BITS))
continue;
- spin_lock(&vb->lock);
+ mutex_lock(&vb->lock);
if (vb->free + vb->dirty == VMAP_BBMAP_BITS && vb->dirty != VMAP_BBMAP_BITS) {
vb->free = 0; /* prevent further allocs after releasing lock */
vb->dirty = VMAP_BBMAP_BITS; /* prevent purging it again */
@@ -2056,10 +2057,10 @@ static void purge_fragmented_blocks(int cpu)
spin_lock(&vbq->lock);
list_del_rcu(&vb->free_list);
spin_unlock(&vbq->lock);
- spin_unlock(&vb->lock);
+ mutex_unlock(&vb->lock);
list_add_tail(&vb->purge, &purge);
} else
- spin_unlock(&vb->lock);
+ mutex_unlock(&vb->lock);
}
rcu_read_unlock();
@@ -2101,9 +2102,9 @@ static void *vb_alloc(unsigned long size, gfp_t gfp_mask)
list_for_each_entry_rcu(vb, &vbq->free, free_list) {
unsigned long pages_off;
- spin_lock(&vb->lock);
+ mutex_lock(&vb->lock);
if (vb->free < (1UL << order)) {
- spin_unlock(&vb->lock);
+ mutex_unlock(&vb->lock);
continue;
}
@@ -2117,7 +2118,7 @@ static void *vb_alloc(unsigned long size, gfp_t gfp_mask)
spin_unlock(&vbq->lock);
}
- spin_unlock(&vb->lock);
+ mutex_unlock(&vb->lock);
break;
}
@@ -2144,16 +2145,16 @@ static void vb_free(unsigned long addr, unsigned long size)
order = get_order(size);
offset = (addr & (VMAP_BLOCK_SIZE - 1)) >> PAGE_SHIFT;
vb = xa_load(&vmap_blocks, addr_to_vb_idx(addr));
- spin_lock(&vb->lock);
+ mutex_lock(&vb->lock);
bitmap_clear(vb->used_map, offset, (1UL << order));
- spin_unlock(&vb->lock);
+ mutex_unlock(&vb->lock);
vunmap_range_noflush(addr, addr + size);
if (debug_pagealloc_enabled_static())
flush_tlb_kernel_range(addr, addr + size);
- spin_lock(&vb->lock);
+ mutex_lock(&vb->lock);
/* Expand dirty range */
vb->dirty_min = min(vb->dirty_min, offset);
@@ -2162,10 +2163,10 @@ static void vb_free(unsigned long addr, unsigned long size)
vb->dirty += 1UL << order;
if (vb->dirty == VMAP_BBMAP_BITS) {
BUG_ON(vb->free);
- spin_unlock(&vb->lock);
+ mutex_unlock(&vb->lock);
free_vmap_block(vb);
} else
- spin_unlock(&vb->lock);
+ mutex_unlock(&vb->lock);
}
static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush)
@@ -2183,7 +2184,7 @@ static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush)
rcu_read_lock();
list_for_each_entry_rcu(vb, &vbq->free, free_list) {
- spin_lock(&vb->lock);
+ mutex_lock(&vb->lock);
if (vb->dirty && vb->dirty != VMAP_BBMAP_BITS) {
unsigned long va_start = vb->va->va_start;
unsigned long s, e;
@@ -2196,7 +2197,7 @@ static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush)
flush = 1;
}
- spin_unlock(&vb->lock);
+ mutex_unlock(&vb->lock);
}
rcu_read_unlock();
}
@@ -2451,9 +2452,9 @@ static inline void setup_vmalloc_vm_locked(struct vm_struct *vm,
static void setup_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va,
unsigned long flags, const void *caller)
{
- spin_lock(&vmap_area_lock);
+ down_write(&vmap_area_lock);
setup_vmalloc_vm_locked(vm, va, flags, caller);
- spin_unlock(&vmap_area_lock);
+ up_write(&vmap_area_lock);
}
static void clear_vm_uninitialized_flag(struct vm_struct *vm)
@@ -3507,9 +3508,9 @@ static void vmap_ram_vread(char *buf, char *addr, int count, unsigned long flags
if (!vb)
goto finished;
- spin_lock(&vb->lock);
+ mutex_lock(&vb->lock);
if (bitmap_empty(vb->used_map, VMAP_BBMAP_BITS)) {
- spin_unlock(&vb->lock);
+ mutex_unlock(&vb->lock);
goto finished;
}
for_each_set_bitrange(rs, re, vb->used_map, VMAP_BBMAP_BITS) {
@@ -3536,7 +3537,7 @@ static void vmap_ram_vread(char *buf, char *addr, int count, unsigned long flags
count -= n;
}
unlock:
- spin_unlock(&vb->lock);
+ mutex_unlock(&vb->lock);
finished:
/* zero-fill the left dirty or free regions */
@@ -3576,13 +3577,15 @@ long vread(char *buf, char *addr, unsigned long count)
unsigned long buflen = count;
unsigned long n, size, flags;
+ might_sleep();
+
addr = kasan_reset_tag(addr);
/* Don't allow overflow */
if ((unsigned long) addr + count < count)
count = -(unsigned long) addr;
- spin_lock(&vmap_area_lock);
+ down_read(&vmap_area_lock);
va = find_vmap_area_exceed_addr((unsigned long)addr);
if (!va)
goto finished;
@@ -3639,7 +3642,7 @@ long vread(char *buf, char *addr, unsigned long count)
count -= n;
}
finished:
- spin_unlock(&vmap_area_lock);
+ up_read(&vmap_area_lock);
if (buf == buf_start)
return 0;
@@ -3980,14 +3983,14 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
}
/* insert all vm's */
- spin_lock(&vmap_area_lock);
+ down_write(&vmap_area_lock);
for (area = 0; area < nr_vms; area++) {
insert_vmap_area(vas[area], &vmap_area_root, &vmap_area_list);
setup_vmalloc_vm_locked(vms[area], vas[area], VM_ALLOC,
pcpu_get_vm_areas);
}
- spin_unlock(&vmap_area_lock);
+ up_write(&vmap_area_lock);
/*
* Mark allocated areas as accessible. Do it now as a best-effort
@@ -4114,7 +4117,7 @@ static void *s_start(struct seq_file *m, loff_t *pos)
__acquires(&vmap_area_lock)
{
mutex_lock(&vmap_purge_lock);
- spin_lock(&vmap_area_lock);
+ down_read(&vmap_area_lock);
return seq_list_start(&vmap_area_list, *pos);
}
@@ -4128,7 +4131,7 @@ static void s_stop(struct seq_file *m, void *p)
__releases(&vmap_area_lock)
__releases(&vmap_purge_lock)
{
- spin_unlock(&vmap_area_lock);
+ up_read(&vmap_area_lock);
mutex_unlock(&vmap_purge_lock);
}
--
2.39.2
Now we have eliminated spinlocks from the vread() case, convert
read_kcore() to read_kcore_iter().
For the time being we still use a bounce buffer for vread(), however in the
next patch we will convert this to interact directly with the iterator and
eliminate the bounce buffer altogether.
Signed-off-by: Lorenzo Stoakes <[email protected]>
---
fs/proc/kcore.c | 58 ++++++++++++++++++++++++-------------------------
1 file changed, 29 insertions(+), 29 deletions(-)
diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index 556f310d6aa4..25e0eeb8d498 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -24,7 +24,7 @@
#include <linux/memblock.h>
#include <linux/init.h>
#include <linux/slab.h>
-#include <linux/uaccess.h>
+#include <linux/uio.h>
#include <asm/io.h>
#include <linux/list.h>
#include <linux/ioport.h>
@@ -308,9 +308,12 @@ static void append_kcore_note(char *notes, size_t *i, const char *name,
}
static ssize_t
-read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
+read_kcore_iter(struct kiocb *iocb, struct iov_iter *iter)
{
+ struct file *file = iocb->ki_filp;
char *buf = file->private_data;
+ loff_t *ppos = &iocb->ki_pos;
+
size_t phdrs_offset, notes_offset, data_offset;
size_t page_offline_frozen = 1;
size_t phdrs_len, notes_len;
@@ -318,6 +321,7 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
size_t tsz;
int nphdr;
unsigned long start;
+ size_t buflen = iov_iter_count(iter);
size_t orig_buflen = buflen;
int ret = 0;
@@ -333,7 +337,7 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
notes_offset = phdrs_offset + phdrs_len;
/* ELF file header. */
- if (buflen && *fpos < sizeof(struct elfhdr)) {
+ if (buflen && *ppos < sizeof(struct elfhdr)) {
struct elfhdr ehdr = {
.e_ident = {
[EI_MAG0] = ELFMAG0,
@@ -355,19 +359,18 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
.e_phnum = nphdr,
};
- tsz = min_t(size_t, buflen, sizeof(struct elfhdr) - *fpos);
- if (copy_to_user(buffer, (char *)&ehdr + *fpos, tsz)) {
+ tsz = min_t(size_t, buflen, sizeof(struct elfhdr) - *ppos);
+ if (copy_to_iter((char *)&ehdr + *ppos, tsz, iter) != tsz) {
ret = -EFAULT;
goto out;
}
- buffer += tsz;
buflen -= tsz;
- *fpos += tsz;
+ *ppos += tsz;
}
/* ELF program headers. */
- if (buflen && *fpos < phdrs_offset + phdrs_len) {
+ if (buflen && *ppos < phdrs_offset + phdrs_len) {
struct elf_phdr *phdrs, *phdr;
phdrs = kzalloc(phdrs_len, GFP_KERNEL);
@@ -397,22 +400,21 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
phdr++;
}
- tsz = min_t(size_t, buflen, phdrs_offset + phdrs_len - *fpos);
- if (copy_to_user(buffer, (char *)phdrs + *fpos - phdrs_offset,
- tsz)) {
+ tsz = min_t(size_t, buflen, phdrs_offset + phdrs_len - *ppos);
+ if (copy_to_iter((char *)phdrs + *ppos - phdrs_offset, tsz,
+ iter) != tsz) {
kfree(phdrs);
ret = -EFAULT;
goto out;
}
kfree(phdrs);
- buffer += tsz;
buflen -= tsz;
- *fpos += tsz;
+ *ppos += tsz;
}
/* ELF note segment. */
- if (buflen && *fpos < notes_offset + notes_len) {
+ if (buflen && *ppos < notes_offset + notes_len) {
struct elf_prstatus prstatus = {};
struct elf_prpsinfo prpsinfo = {
.pr_sname = 'R',
@@ -447,24 +449,23 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
vmcoreinfo_data,
min(vmcoreinfo_size, notes_len - i));
- tsz = min_t(size_t, buflen, notes_offset + notes_len - *fpos);
- if (copy_to_user(buffer, notes + *fpos - notes_offset, tsz)) {
+ tsz = min_t(size_t, buflen, notes_offset + notes_len - *ppos);
+ if (copy_to_iter(notes + *ppos - notes_offset, tsz, iter) != tsz) {
kfree(notes);
ret = -EFAULT;
goto out;
}
kfree(notes);
- buffer += tsz;
buflen -= tsz;
- *fpos += tsz;
+ *ppos += tsz;
}
/*
* Check to see if our file offset matches with any of
* the addresses in the elf_phdr on our list.
*/
- start = kc_offset_to_vaddr(*fpos - data_offset);
+ start = kc_offset_to_vaddr(*ppos - data_offset);
if ((tsz = (PAGE_SIZE - (start & ~PAGE_MASK))) > buflen)
tsz = buflen;
@@ -497,7 +498,7 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
}
if (!m) {
- if (clear_user(buffer, tsz)) {
+ if (iov_iter_zero(tsz, iter) != tsz) {
ret = -EFAULT;
goto out;
}
@@ -508,14 +509,14 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
case KCORE_VMALLOC:
vread(buf, (char *)start, tsz);
/* we have to zero-fill user buffer even if no read */
- if (copy_to_user(buffer, buf, tsz)) {
+ if (copy_to_iter(buf, tsz, iter) != tsz) {
ret = -EFAULT;
goto out;
}
break;
case KCORE_USER:
/* User page is handled prior to normal kernel page: */
- if (copy_to_user(buffer, (char *)start, tsz)) {
+ if (copy_to_iter((char *)start, tsz, iter) != tsz) {
ret = -EFAULT;
goto out;
}
@@ -531,7 +532,7 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
*/
if (!page || PageOffline(page) ||
is_page_hwpoison(page) || !pfn_is_ram(pfn)) {
- if (clear_user(buffer, tsz)) {
+ if (iov_iter_zero(tsz, iter) != tsz) {
ret = -EFAULT;
goto out;
}
@@ -541,25 +542,24 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
case KCORE_VMEMMAP:
case KCORE_TEXT:
/*
- * We use _copy_to_user() to bypass usermode hardening
+ * We use _copy_to_iter() to bypass usermode hardening
* which would otherwise prevent this operation.
*/
- if (_copy_to_user(buffer, (char *)start, tsz)) {
+ if (_copy_to_iter((char *)start, tsz, iter) != tsz) {
ret = -EFAULT;
goto out;
}
break;
default:
pr_warn_once("Unhandled KCORE type: %d\n", m->type);
- if (clear_user(buffer, tsz)) {
+ if (iov_iter_zero(tsz, iter) != tsz) {
ret = -EFAULT;
goto out;
}
}
skip:
buflen -= tsz;
- *fpos += tsz;
- buffer += tsz;
+ *ppos += tsz;
start += tsz;
tsz = (buflen > PAGE_SIZE ? PAGE_SIZE : buflen);
}
@@ -603,7 +603,7 @@ static int release_kcore(struct inode *inode, struct file *file)
}
static const struct proc_ops kcore_proc_ops = {
- .proc_read = read_kcore,
+ .proc_read_iter = read_kcore_iter,
.proc_open = open_kcore,
.proc_release = release_kcore,
.proc_lseek = default_llseek,
--
2.39.2
Having previously laid the foundation for converting vread() to an iterator
function, pull the trigger and do so.
This patch attempts to provide minimal refactoring and to reflect the
existing logic as best we can, with the exception of aligned_vread_iter()
which drops the use of the deprecated kmap_atomic() in favour of
kmap_local_page().
All existing logic to zero portions of memory not read remain and there
should be no functional difference other than a performance improvement in
/proc/kcore access to vmalloc regions.
Now we have discarded with the need for a bounce buffer at all in
read_kcore_iter(), we dispense with the one allocated there altogether.
Signed-off-by: Lorenzo Stoakes <[email protected]>
---
fs/proc/kcore.c | 21 +--------
include/linux/vmalloc.h | 3 +-
mm/nommu.c | 10 ++--
mm/vmalloc.c | 101 +++++++++++++++++++++-------------------
4 files changed, 62 insertions(+), 73 deletions(-)
diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index 25e0eeb8d498..a0ed3ca35cce 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -307,13 +307,9 @@ static void append_kcore_note(char *notes, size_t *i, const char *name,
*i = ALIGN(*i + descsz, 4);
}
-static ssize_t
-read_kcore_iter(struct kiocb *iocb, struct iov_iter *iter)
+static ssize_t read_kcore_iter(struct kiocb *iocb, struct iov_iter *iter)
{
- struct file *file = iocb->ki_filp;
- char *buf = file->private_data;
loff_t *ppos = &iocb->ki_pos;
-
size_t phdrs_offset, notes_offset, data_offset;
size_t page_offline_frozen = 1;
size_t phdrs_len, notes_len;
@@ -507,9 +503,7 @@ read_kcore_iter(struct kiocb *iocb, struct iov_iter *iter)
switch (m->type) {
case KCORE_VMALLOC:
- vread(buf, (char *)start, tsz);
- /* we have to zero-fill user buffer even if no read */
- if (copy_to_iter(buf, tsz, iter) != tsz) {
+ if (vread_iter(iter, (char *)start, tsz) != tsz) {
ret = -EFAULT;
goto out;
}
@@ -582,10 +576,6 @@ static int open_kcore(struct inode *inode, struct file *filp)
if (ret)
return ret;
- filp->private_data = kmalloc(PAGE_SIZE, GFP_KERNEL);
- if (!filp->private_data)
- return -ENOMEM;
-
if (kcore_need_update)
kcore_update_ram();
if (i_size_read(inode) != proc_root_kcore->size) {
@@ -596,16 +586,9 @@ static int open_kcore(struct inode *inode, struct file *filp)
return 0;
}
-static int release_kcore(struct inode *inode, struct file *file)
-{
- kfree(file->private_data);
- return 0;
-}
-
static const struct proc_ops kcore_proc_ops = {
.proc_read_iter = read_kcore_iter,
.proc_open = open_kcore,
- .proc_release = release_kcore,
.proc_lseek = default_llseek,
};
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 69250efa03d1..6beb2ace6a7a 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -9,6 +9,7 @@
#include <asm/page.h> /* pgprot_t */
#include <linux/rbtree.h>
#include <linux/overflow.h>
+#include <linux/uio.h>
#include <asm/vmalloc.h>
@@ -251,7 +252,7 @@ static inline void set_vm_flush_reset_perms(void *addr)
#endif
/* for /proc/kcore */
-extern long vread(char *buf, char *addr, unsigned long count);
+extern long vread_iter(struct iov_iter *iter, char *addr, size_t count);
/*
* Internals. Don't use..
diff --git a/mm/nommu.c b/mm/nommu.c
index 57ba243c6a37..e0fcd948096e 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -36,6 +36,7 @@
#include <linux/printk.h>
#include <linux/uaccess.h>
+#include <linux/uio.h>
#include <asm/tlb.h>
#include <asm/tlbflush.h>
#include <asm/mmu_context.h>
@@ -198,14 +199,13 @@ unsigned long vmalloc_to_pfn(const void *addr)
}
EXPORT_SYMBOL(vmalloc_to_pfn);
-long vread(char *buf, char *addr, unsigned long count)
+long vread_iter(struct iov_iter *iter, char *addr, size_t count)
{
/* Don't allow overflow */
- if ((unsigned long) buf + count < count)
- count = -(unsigned long) buf;
+ if ((unsigned long) addr + count < count)
+ count = -(unsigned long) addr;
- memcpy(buf, addr, count);
- return count;
+ return copy_to_iter(addr, count, iter);
}
/*
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index c24b27664a97..f19509a6eef4 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -37,7 +37,6 @@
#include <linux/rbtree_augmented.h>
#include <linux/overflow.h>
#include <linux/pgtable.h>
-#include <linux/uaccess.h>
#include <linux/hugetlb.h>
#include <linux/sched/mm.h>
#include <linux/rwsem.h>
@@ -3446,20 +3445,20 @@ EXPORT_SYMBOL(vmalloc_32_user);
* small helper routine , copy contents to buf from addr.
* If the page is not present, fill zero.
*/
-
-static int aligned_vread(char *buf, char *addr, unsigned long count)
+static void aligned_vread_iter(struct iov_iter *iter,
+ char *addr, size_t count)
{
- struct page *p;
- int copied = 0;
+ struct page *page;
- while (count) {
+ while (count > 0) {
unsigned long offset, length;
+ size_t copied = 0;
offset = offset_in_page(addr);
length = PAGE_SIZE - offset;
if (length > count)
length = count;
- p = vmalloc_to_page(addr);
+ page = vmalloc_to_page(addr);
/*
* To do safe access to this _mapped_ area, we need
* lock. But adding lock here means that we need to add
@@ -3467,23 +3466,24 @@ static int aligned_vread(char *buf, char *addr, unsigned long count)
* interface, rarely used. Instead of that, we'll use
* kmap() and get small overhead in this access function.
*/
- if (p) {
+ if (page) {
/* We can expect USER0 is not used -- see vread() */
- void *map = kmap_atomic(p);
- memcpy(buf, map + offset, length);
- kunmap_atomic(map);
- } else
- memset(buf, 0, length);
+ void *map = kmap_local_page(page);
+
+ copied = copy_to_iter(map + offset, length, iter);
+ kunmap_local(map);
+ }
+
+ if (copied < length)
+ iov_iter_zero(length - copied, iter);
addr += length;
- buf += length;
- copied += length;
count -= length;
}
- return copied;
}
-static void vmap_ram_vread(char *buf, char *addr, int count, unsigned long flags)
+static void vmap_ram_vread_iter(struct iov_iter *iter, char *addr, int count,
+ unsigned long flags)
{
char *start;
struct vmap_block *vb;
@@ -3496,7 +3496,7 @@ static void vmap_ram_vread(char *buf, char *addr, int count, unsigned long flags
* handle it here.
*/
if (!(flags & VMAP_BLOCK)) {
- aligned_vread(buf, addr, count);
+ aligned_vread_iter(iter, addr, count);
return;
}
@@ -3517,22 +3517,24 @@ static void vmap_ram_vread(char *buf, char *addr, int count, unsigned long flags
if (!count)
break;
start = vmap_block_vaddr(vb->va->va_start, rs);
- while (addr < start) {
+
+ if (addr < start) {
+ size_t to_zero = min_t(size_t, start - addr, count);
+
+ iov_iter_zero(to_zero, iter);
+ addr += to_zero;
+ count -= (int)to_zero;
if (count == 0)
goto unlock;
- *buf = '\0';
- buf++;
- addr++;
- count--;
}
+
/*it could start reading from the middle of used region*/
offset = offset_in_page(addr);
n = ((re - rs + 1) << PAGE_SHIFT) - offset;
if (n > count)
n = count;
- aligned_vread(buf, start+offset, n);
+ aligned_vread_iter(iter, start + offset, n);
- buf += n;
addr += n;
count -= n;
}
@@ -3541,15 +3543,15 @@ static void vmap_ram_vread(char *buf, char *addr, int count, unsigned long flags
finished:
/* zero-fill the left dirty or free regions */
- if (count)
- memset(buf, 0, count);
+ if (count > 0)
+ iov_iter_zero(count, iter);
}
/**
- * vread() - read vmalloc area in a safe way.
- * @buf: buffer for reading data
- * @addr: vm address.
- * @count: number of bytes to be read.
+ * vread_iter() - read vmalloc area in a safe way to an iterator.
+ * @iter: the iterator to which data should be written.
+ * @addr: vm address.
+ * @count: number of bytes to be read.
*
* This function checks that addr is a valid vmalloc'ed area, and
* copy data from that area to a given buffer. If the given memory range
@@ -3569,13 +3571,13 @@ static void vmap_ram_vread(char *buf, char *addr, int count, unsigned long flags
* (same number as @count) or %0 if [addr...addr+count) doesn't
* include any intersection with valid vmalloc area
*/
-long vread(char *buf, char *addr, unsigned long count)
+long vread_iter(struct iov_iter *iter, char *addr, size_t count)
{
struct vmap_area *va;
struct vm_struct *vm;
- char *vaddr, *buf_start = buf;
- unsigned long buflen = count;
- unsigned long n, size, flags;
+ char *vaddr;
+ size_t buflen = count;
+ size_t n, size, flags;
might_sleep();
@@ -3595,7 +3597,7 @@ long vread(char *buf, char *addr, unsigned long count)
goto finished;
list_for_each_entry_from(va, &vmap_area_list, list) {
- if (!count)
+ if (count == 0)
break;
vm = va->vm;
@@ -3619,36 +3621,39 @@ long vread(char *buf, char *addr, unsigned long count)
if (addr >= vaddr + size)
continue;
- while (addr < vaddr) {
+
+ if (addr < vaddr) {
+ size_t to_zero = min_t(size_t, vaddr - addr, count);
+
+ iov_iter_zero(to_zero, iter);
+ addr += to_zero;
+ count -= to_zero;
if (count == 0)
goto finished;
- *buf = '\0';
- buf++;
- addr++;
- count--;
}
+
n = vaddr + size - addr;
if (n > count)
n = count;
if (flags & VMAP_RAM)
- vmap_ram_vread(buf, addr, n, flags);
+ vmap_ram_vread_iter(iter, addr, n, flags);
else if (!(vm->flags & VM_IOREMAP))
- aligned_vread(buf, addr, n);
+ aligned_vread_iter(iter, addr, n);
else /* IOREMAP area is treated as memory hole */
- memset(buf, 0, n);
- buf += n;
+ iov_iter_zero(n, iter);
+
addr += n;
count -= n;
}
finished:
up_read(&vmap_area_lock);
- if (buf == buf_start)
+ if (count == buflen)
return 0;
/* zero-fill memory holes */
- if (buf != buf_start + buflen)
- memset(buf, 0, buflen - (buf - buf_start));
+ if (count > 0)
+ iov_iter_zero(count, iter);
return buflen;
}
--
2.39.2
On Sun, 19 Mar 2023 07:09:31 +0000 Lorenzo Stoakes <[email protected]> wrote:
> vmalloc() is, by design, not permitted to be used in atomic context and
> already contains components which may sleep, so avoiding spin locks is not
> a problem from the perspective of atomic context.
>
> The global vmap_area_lock is held when the red/black tree rooted in
> vmap_are_root is accessed and thus is rather long-held and under
> potentially high contention. It is likely to be under contention for reads
> rather than write, so replace it with a rwsem.
>
> Each individual vmap_block->lock is likely to be held for less time but
> under low contention, so a mutex is not an outrageous choice here.
>
> A subset of test_vmalloc.sh performance results:-
>
> fix_size_alloc_test 0.40%
> full_fit_alloc_test 2.08%
> long_busy_list_alloc_test 0.34%
> random_size_alloc_test -0.25%
> random_size_align_alloc_test 0.06%
> ...
> all tests cycles 0.2%
>
> This represents a tiny reduction in performance that sits barely above
> noise.
>
> The reason for making this change is to build a basis for vread() to be
> usable asynchronously, this eliminating the need for a bounce buffer when
> copying data to userland in read_kcore() and allowing that to be converted
> to an iterator form.
>
I'm not understanding the final paragraph. How and where is vread()
used "asynchronously"?
On Sun, Mar 19, 2023 at 01:10:47PM -0700, Andrew Morton wrote:
> On Sun, 19 Mar 2023 07:09:31 +0000 Lorenzo Stoakes <[email protected]> wrote:
>
> > vmalloc() is, by design, not permitted to be used in atomic context and
> > already contains components which may sleep, so avoiding spin locks is not
> > a problem from the perspective of atomic context.
> >
> > The global vmap_area_lock is held when the red/black tree rooted in
> > vmap_are_root is accessed and thus is rather long-held and under
> > potentially high contention. It is likely to be under contention for reads
> > rather than write, so replace it with a rwsem.
> >
> > Each individual vmap_block->lock is likely to be held for less time but
> > under low contention, so a mutex is not an outrageous choice here.
> >
> > A subset of test_vmalloc.sh performance results:-
> >
> > fix_size_alloc_test 0.40%
> > full_fit_alloc_test 2.08%
> > long_busy_list_alloc_test 0.34%
> > random_size_alloc_test -0.25%
> > random_size_align_alloc_test 0.06%
> > ...
> > all tests cycles 0.2%
> >
> > This represents a tiny reduction in performance that sits barely above
> > noise.
> >
> > The reason for making this change is to build a basis for vread() to be
> > usable asynchronously, this eliminating the need for a bounce buffer when
> > copying data to userland in read_kcore() and allowing that to be converted
> > to an iterator form.
> >
>
> I'm not understanding the final paragraph. How and where is vread()
> used "asynchronously"?
The basis for saying asynchronous was based on Documentation/filesystems/vfs.rst
describing read_iter() as 'possibly asynchronous read with iov_iter as
destination', and read_iter() is what is (now) invoked when accessing
/proc/kcore.
However I agree this is vague and it is clearer to refer to the fact that we are
now directly writing to user memory and thus wish to avoid spinlocks as we may
need to fault in user memory in doing so.
Would it be ok for you to go ahead and replace that final paragraph with the
below?:-
The reason for making this change is to build a basis for vread() to write
to user memory directly via an iterator; as a result we may cause page
faults during which we must not hold a spinlock. Doing this eliminates the
need for a bounce buffer in read_kcore() and thus permits that to be
converted to also use an iterator, as a read_iter() handler.
On Sun, Mar 19, 2023 at 08:29:16PM +0000, Lorenzo Stoakes wrote:
> The basis for saying asynchronous was based on Documentation/filesystems/vfs.rst
> describing read_iter() as 'possibly asynchronous read with iov_iter as
> destination', and read_iter() is what is (now) invoked when accessing
> /proc/kcore.
>
> However I agree this is vague and it is clearer to refer to the fact that we are
> now directly writing to user memory and thus wish to avoid spinlocks as we may
> need to fault in user memory in doing so.
>
> Would it be ok for you to go ahead and replace that final paragraph with the
> below?:-
>
> The reason for making this change is to build a basis for vread() to write
> to user memory directly via an iterator; as a result we may cause page
> faults during which we must not hold a spinlock. Doing this eliminates the
> need for a bounce buffer in read_kcore() and thus permits that to be
> converted to also use an iterator, as a read_iter() handler.
I'd say the purpose of the iterator is to abstract whether we're
accessing user memory, kernel memory or a pipe, so I'd suggest:
The reason for making this change is to build a basis for vread() to
write to memory via an iterator; as a result we may cause page faults
during which we must not hold a spinlock. Doing this eliminates the
need for a bounce buffer in read_kcore() and thus permits that to be
converted to also use an iterator, as a read_iter() handler.
I'm still undecided whether this change is really a good thing. I
think we have line-of-sight to making vmalloc (and thus kvmalloc)
usable from interrupt context, and this destroys that possibility.
I wonder if we can't do something like prefaulting the page before
taking the spinlock, then use copy_page_to_iter_atomic()
On Sun, Mar 19, 2023 at 08:47:14PM +0000, Matthew Wilcox wrote:
> On Sun, Mar 19, 2023 at 08:29:16PM +0000, Lorenzo Stoakes wrote:
> > The basis for saying asynchronous was based on Documentation/filesystems/vfs.rst
> > describing read_iter() as 'possibly asynchronous read with iov_iter as
> > destination', and read_iter() is what is (now) invoked when accessing
> > /proc/kcore.
> >
> > However I agree this is vague and it is clearer to refer to the fact that we are
> > now directly writing to user memory and thus wish to avoid spinlocks as we may
> > need to fault in user memory in doing so.
> >
> > Would it be ok for you to go ahead and replace that final paragraph with the
> > below?:-
> >
> > The reason for making this change is to build a basis for vread() to write
> > to user memory directly via an iterator; as a result we may cause page
> > faults during which we must not hold a spinlock. Doing this eliminates the
> > need for a bounce buffer in read_kcore() and thus permits that to be
> > converted to also use an iterator, as a read_iter() handler.
>
> I'd say the purpose of the iterator is to abstract whether we're
> accessing user memory, kernel memory or a pipe, so I'd suggest:
>
> The reason for making this change is to build a basis for vread() to
> write to memory via an iterator; as a result we may cause page faults
> during which we must not hold a spinlock. Doing this eliminates the
> need for a bounce buffer in read_kcore() and thus permits that to be
> converted to also use an iterator, as a read_iter() handler.
>
Thanks, sorry I missed the detail about iterators abstacting the three
different targets there, that is definitely better!
> I'm still undecided whether this change is really a good thing. I
> think we have line-of-sight to making vmalloc (and thus kvmalloc)
> usable from interrupt context, and this destroys that possibility.
>
> I wonder if we can't do something like prefaulting the page before
> taking the spinlock, then use copy_page_to_iter_atomic()
There are a number of aspects of vmalloc that are not atomic-safe,
e.g. alloc_vmap_area() and vmap_range_noflush() are designated
might_sleep(), equally vfree().
So I feel that making it safe for atomic context requires a bit more of a
general rework. Given we would be able to revisit lock types at the point
we do that (something that would fit very solidly into the context of any
such change), and given that this patch series establishes that we use an
iterator, I think it is useful to keep this as-is as defer that change
until later.
> vmalloc() is, by design, not permitted to be used in atomic context and
> already contains components which may sleep, so avoiding spin locks is not
> a problem from the perspective of atomic context.
>
> The global vmap_area_lock is held when the red/black tree rooted in
> vmap_are_root is accessed and thus is rather long-held and under
> potentially high contention. It is likely to be under contention for reads
> rather than write, so replace it with a rwsem.
>
> Each individual vmap_block->lock is likely to be held for less time but
> under low contention, so a mutex is not an outrageous choice here.
>
> A subset of test_vmalloc.sh performance results:-
>
> fix_size_alloc_test 0.40%
> full_fit_alloc_test 2.08%
> long_busy_list_alloc_test 0.34%
> random_size_alloc_test -0.25%
> random_size_align_alloc_test 0.06%
> ...
> all tests cycles 0.2%
>
> This represents a tiny reduction in performance that sits barely above
> noise.
>
How important to have many simultaneous users of vread()? I do not see a
big reason to switch into mutexes due to performance impact and making it
less atomic.
So, how important for you to have this change?
--
Uladzislau Rezki
On Mon, Mar 20, 2023 at 08:54:33AM +0100, Uladzislau Rezki wrote:
> > vmalloc() is, by design, not permitted to be used in atomic context and
> > already contains components which may sleep, so avoiding spin locks is not
> > a problem from the perspective of atomic context.
> >
> > The global vmap_area_lock is held when the red/black tree rooted in
> > vmap_are_root is accessed and thus is rather long-held and under
> > potentially high contention. It is likely to be under contention for reads
> > rather than write, so replace it with a rwsem.
> >
> > Each individual vmap_block->lock is likely to be held for less time but
> > under low contention, so a mutex is not an outrageous choice here.
> >
> > A subset of test_vmalloc.sh performance results:-
> >
> > fix_size_alloc_test 0.40%
> > full_fit_alloc_test 2.08%
> > long_busy_list_alloc_test 0.34%
> > random_size_alloc_test -0.25%
> > random_size_align_alloc_test 0.06%
> > ...
> > all tests cycles 0.2%
> >
> > This represents a tiny reduction in performance that sits barely above
> > noise.
> >
> How important to have many simultaneous users of vread()? I do not see a
> big reason to switch into mutexes due to performance impact and making it
> less atomic.
It's less about simultaneous users of vread() and more about being able to write
direct to user memory rather than via a bounce buffer and not hold a spinlock
over possible page faults.
The performance impact is barely above noise (I got fairly widely varying
results), so I don't think it's really much of a cost at all. I can't imagine
there are many users critically dependent on a sub-single digit % reduction in
speed in vmalloc() allocation.
As I was saying to Willy, the code is already not atomic, or rather needs rework
to become atomic-safe (there are a smattering of might_sleep()'s throughout)
However, given your objection alongside Willy's, let me examine Willy's
suggestion that we instead of doing this, prefault the user memory in advance of
the vread call.
>
> So, how important for you to have this change?
>
Personally, always very important :)
> --
> Uladzislau Rezki
On Mon, Mar 20, 2023 at 08:25:32AM +0000, Lorenzo Stoakes wrote:
> On Mon, Mar 20, 2023 at 08:54:33AM +0100, Uladzislau Rezki wrote:
> > > vmalloc() is, by design, not permitted to be used in atomic context and
> > > already contains components which may sleep, so avoiding spin locks is not
> > > a problem from the perspective of atomic context.
> > >
> > > The global vmap_area_lock is held when the red/black tree rooted in
> > > vmap_are_root is accessed and thus is rather long-held and under
> > > potentially high contention. It is likely to be under contention for reads
> > > rather than write, so replace it with a rwsem.
> > >
> > > Each individual vmap_block->lock is likely to be held for less time but
> > > under low contention, so a mutex is not an outrageous choice here.
> > >
> > > A subset of test_vmalloc.sh performance results:-
> > >
> > > fix_size_alloc_test 0.40%
> > > full_fit_alloc_test 2.08%
> > > long_busy_list_alloc_test 0.34%
> > > random_size_alloc_test -0.25%
> > > random_size_align_alloc_test 0.06%
> > > ...
> > > all tests cycles 0.2%
> > >
> > > This represents a tiny reduction in performance that sits barely above
> > > noise.
> > >
> > How important to have many simultaneous users of vread()? I do not see a
> > big reason to switch into mutexes due to performance impact and making it
> > less atomic.
>
> It's less about simultaneous users of vread() and more about being able to write
> direct to user memory rather than via a bounce buffer and not hold a spinlock
> over possible page faults.
>
> The performance impact is barely above noise (I got fairly widely varying
> results), so I don't think it's really much of a cost at all. I can't imagine
> there are many users critically dependent on a sub-single digit % reduction in
> speed in vmalloc() allocation.
>
> As I was saying to Willy, the code is already not atomic, or rather needs rework
> to become atomic-safe (there are a smattering of might_sleep()'s throughout)
>
> However, given your objection alongside Willy's, let me examine Willy's
> suggestion that we instead of doing this, prefault the user memory in advance of
> the vread call.
>
Just a quick perf tests shows regression around 6%. 10 workers test_mask is 31:
# default
[ 140.349731] All test took worker0=485061693537 cycles
[ 140.386065] All test took worker1=486504572954 cycles
[ 140.418452] All test took worker2=467204082542 cycles
[ 140.435895] All test took worker3=512591010219 cycles
[ 140.458316] All test took worker4=448583324125 cycles
[ 140.494244] All test took worker5=501018129647 cycles
[ 140.518144] All test took worker6=516224787767 cycles
[ 140.535472] All test took worker7=442025617137 cycles
[ 140.558249] All test took worker8=503337286539 cycles
[ 140.590571] All test took worker9=494369561574 cycles
# patch
[ 144.464916] All test took worker0=530373399067 cycles
[ 144.492904] All test took worker1=522641540924 cycles
[ 144.528999] All test took worker2=529711158267 cycles
[ 144.552963] All test took worker3=527389011775 cycles
[ 144.592951] All test took worker4=529583252449 cycles
[ 144.610286] All test took worker5=523605706016 cycles
[ 144.627690] All test took worker6=531494777011 cycles
[ 144.653046] All test took worker7=527150114726 cycles
[ 144.669818] All test took worker8=526599712235 cycles
[ 144.693428] All test took worker9=526057490851 cycles
> >
> > So, how important for you to have this change?
> >
>
> Personally, always very important :)
>
This is good. Personal opinion always wins :)
--
Uladzislau Rezki
On Mon, Mar 20, 2023 at 09:32:06AM +0100, Uladzislau Rezki wrote:
> On Mon, Mar 20, 2023 at 08:25:32AM +0000, Lorenzo Stoakes wrote:
> > On Mon, Mar 20, 2023 at 08:54:33AM +0100, Uladzislau Rezki wrote:
> > > > vmalloc() is, by design, not permitted to be used in atomic context and
> > > > already contains components which may sleep, so avoiding spin locks is not
> > > > a problem from the perspective of atomic context.
> > > >
> > > > The global vmap_area_lock is held when the red/black tree rooted in
> > > > vmap_are_root is accessed and thus is rather long-held and under
> > > > potentially high contention. It is likely to be under contention for reads
> > > > rather than write, so replace it with a rwsem.
> > > >
> > > > Each individual vmap_block->lock is likely to be held for less time but
> > > > under low contention, so a mutex is not an outrageous choice here.
> > > >
> > > > A subset of test_vmalloc.sh performance results:-
> > > >
> > > > fix_size_alloc_test 0.40%
> > > > full_fit_alloc_test 2.08%
> > > > long_busy_list_alloc_test 0.34%
> > > > random_size_alloc_test -0.25%
> > > > random_size_align_alloc_test 0.06%
> > > > ...
> > > > all tests cycles 0.2%
> > > >
> > > > This represents a tiny reduction in performance that sits barely above
> > > > noise.
> > > >
> > > How important to have many simultaneous users of vread()? I do not see a
> > > big reason to switch into mutexes due to performance impact and making it
> > > less atomic.
> >
> > It's less about simultaneous users of vread() and more about being able to write
> > direct to user memory rather than via a bounce buffer and not hold a spinlock
> > over possible page faults.
> >
> > The performance impact is barely above noise (I got fairly widely varying
> > results), so I don't think it's really much of a cost at all. I can't imagine
> > there are many users critically dependent on a sub-single digit % reduction in
> > speed in vmalloc() allocation.
> >
> > As I was saying to Willy, the code is already not atomic, or rather needs rework
> > to become atomic-safe (there are a smattering of might_sleep()'s throughout)
> >
> > However, given your objection alongside Willy's, let me examine Willy's
> > suggestion that we instead of doing this, prefault the user memory in advance of
> > the vread call.
> >
> Just a quick perf tests shows regression around 6%. 10 workers test_mask is 31:
>
> # default
> [ 140.349731] All test took worker0=485061693537 cycles
> [ 140.386065] All test took worker1=486504572954 cycles
> [ 140.418452] All test took worker2=467204082542 cycles
> [ 140.435895] All test took worker3=512591010219 cycles
> [ 140.458316] All test took worker4=448583324125 cycles
> [ 140.494244] All test took worker5=501018129647 cycles
> [ 140.518144] All test took worker6=516224787767 cycles
> [ 140.535472] All test took worker7=442025617137 cycles
> [ 140.558249] All test took worker8=503337286539 cycles
> [ 140.590571] All test took worker9=494369561574 cycles
>
> # patch
> [ 144.464916] All test took worker0=530373399067 cycles
> [ 144.492904] All test took worker1=522641540924 cycles
> [ 144.528999] All test took worker2=529711158267 cycles
> [ 144.552963] All test took worker3=527389011775 cycles
> [ 144.592951] All test took worker4=529583252449 cycles
> [ 144.610286] All test took worker5=523605706016 cycles
> [ 144.627690] All test took worker6=531494777011 cycles
> [ 144.653046] All test took worker7=527150114726 cycles
> [ 144.669818] All test took worker8=526599712235 cycles
> [ 144.693428] All test took worker9=526057490851 cycles
>
OK ouch, that's worse than I observed! Let me try this prefault approach and
then we can revert back to spinlocks.
> > >
> > > So, how important for you to have this change?
> > >
> >
> > Personally, always very important :)
> >
> This is good. Personal opinion always wins :)
>
The heart always wins ;) well, an adaption here can make everybody's hearts
happy I think.
> --
> Uladzislau Rezki
On Sun, Mar 19, 2023 at 09:16:18PM +0000, Lorenzo Stoakes wrote:
> On Sun, Mar 19, 2023 at 08:47:14PM +0000, Matthew Wilcox wrote:
> > I wonder if we can't do something like prefaulting the page before
> > taking the spinlock, then use copy_page_to_iter_atomic()
>
<snip>
On second thoughts, and discussing with Uladzislau, this seems like a more
sensible approach.
I'm going to respin with prefaulting and revert to the previous locking.
On 19.03.23 08:09, Lorenzo Stoakes wrote:
> Commit df04abfd181a ("fs/proc/kcore.c: Add bounce buffer for ktext data")
> introduced the use of a bounce buffer to retrieve kernel text data for
> /proc/kcore in order to avoid failures arising from hardened user copies
> enabled by CONFIG_HARDENED_USERCOPY in check_kernel_text_object().
>
> We can avoid doing this if instead of copy_to_user() we use _copy_to_user()
> which bypasses the hardening check. This is more efficient than using a
> bounce buffer and simplifies the code.
>
> We do so as part an overall effort to eliminate bounce buffer usage in the
> function with an eye to converting it an iterator read.
>
> Signed-off-by: Lorenzo Stoakes <[email protected]>
> ---
> fs/proc/kcore.c | 17 +++++------------
> 1 file changed, 5 insertions(+), 12 deletions(-)
>
> diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
> index 71157ee35c1a..556f310d6aa4 100644
> --- a/fs/proc/kcore.c
> +++ b/fs/proc/kcore.c
> @@ -541,19 +541,12 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
> case KCORE_VMEMMAP:
> case KCORE_TEXT:
> /*
> - * Using bounce buffer to bypass the
> - * hardened user copy kernel text checks.
> + * We use _copy_to_user() to bypass usermode hardening
> + * which would otherwise prevent this operation.
> */
> - if (copy_from_kernel_nofault(buf, (void *)start, tsz)) {
> - if (clear_user(buffer, tsz)) {
> - ret = -EFAULT;
> - goto out;
> - }
> - } else {
> - if (copy_to_user(buffer, buf, tsz)) {
> - ret = -EFAULT;
> - goto out;
> - }
> + if (_copy_to_user(buffer, (char *)start, tsz)) {
> + ret = -EFAULT;
> + goto out;
> }
> break;
> default:
Looks correct to me. The only difference between copy_to_user() and
_copy_to_user() is the check_copy_size() check, that ends up calling
check_kernel_text_object().
Reviewed-by: David Hildenbrand <[email protected]>
--
Thanks,
David / dhildenb
On Mon, Mar 20, 2023 at 08:35:11AM +0000, Lorenzo Stoakes wrote:
> On Mon, Mar 20, 2023 at 09:32:06AM +0100, Uladzislau Rezki wrote:
> > On Mon, Mar 20, 2023 at 08:25:32AM +0000, Lorenzo Stoakes wrote:
> > > On Mon, Mar 20, 2023 at 08:54:33AM +0100, Uladzislau Rezki wrote:
> > > > > vmalloc() is, by design, not permitted to be used in atomic context and
> > > > > already contains components which may sleep, so avoiding spin locks is not
> > > > > a problem from the perspective of atomic context.
> > > > >
> > > > > The global vmap_area_lock is held when the red/black tree rooted in
> > > > > vmap_are_root is accessed and thus is rather long-held and under
> > > > > potentially high contention. It is likely to be under contention for reads
> > > > > rather than write, so replace it with a rwsem.
> > > > >
> > > > > Each individual vmap_block->lock is likely to be held for less time but
> > > > > under low contention, so a mutex is not an outrageous choice here.
> > > > >
> > > > > A subset of test_vmalloc.sh performance results:-
> > > > >
> > > > > fix_size_alloc_test 0.40%
> > > > > full_fit_alloc_test 2.08%
> > > > > long_busy_list_alloc_test 0.34%
> > > > > random_size_alloc_test -0.25%
> > > > > random_size_align_alloc_test 0.06%
> > > > > ...
> > > > > all tests cycles 0.2%
> > > > >
> > > > > This represents a tiny reduction in performance that sits barely above
> > > > > noise.
> > > > >
> > > > How important to have many simultaneous users of vread()? I do not see a
> > > > big reason to switch into mutexes due to performance impact and making it
> > > > less atomic.
> > >
> > > It's less about simultaneous users of vread() and more about being able to write
> > > direct to user memory rather than via a bounce buffer and not hold a spinlock
> > > over possible page faults.
> > >
> > > The performance impact is barely above noise (I got fairly widely varying
> > > results), so I don't think it's really much of a cost at all. I can't imagine
> > > there are many users critically dependent on a sub-single digit % reduction in
> > > speed in vmalloc() allocation.
> > >
> > > As I was saying to Willy, the code is already not atomic, or rather needs rework
> > > to become atomic-safe (there are a smattering of might_sleep()'s throughout)
> > >
> > > However, given your objection alongside Willy's, let me examine Willy's
> > > suggestion that we instead of doing this, prefault the user memory in advance of
> > > the vread call.
> > >
> > Just a quick perf tests shows regression around 6%. 10 workers test_mask is 31:
> >
> > # default
> > [ 140.349731] All test took worker0=485061693537 cycles
> > [ 140.386065] All test took worker1=486504572954 cycles
> > [ 140.418452] All test took worker2=467204082542 cycles
> > [ 140.435895] All test took worker3=512591010219 cycles
> > [ 140.458316] All test took worker4=448583324125 cycles
> > [ 140.494244] All test took worker5=501018129647 cycles
> > [ 140.518144] All test took worker6=516224787767 cycles
> > [ 140.535472] All test took worker7=442025617137 cycles
> > [ 140.558249] All test took worker8=503337286539 cycles
> > [ 140.590571] All test took worker9=494369561574 cycles
> >
> > # patch
> > [ 144.464916] All test took worker0=530373399067 cycles
> > [ 144.492904] All test took worker1=522641540924 cycles
> > [ 144.528999] All test took worker2=529711158267 cycles
> > [ 144.552963] All test took worker3=527389011775 cycles
> > [ 144.592951] All test took worker4=529583252449 cycles
> > [ 144.610286] All test took worker5=523605706016 cycles
> > [ 144.627690] All test took worker6=531494777011 cycles
> > [ 144.653046] All test took worker7=527150114726 cycles
> > [ 144.669818] All test took worker8=526599712235 cycles
> > [ 144.693428] All test took worker9=526057490851 cycles
> >
>
> OK ouch, that's worse than I observed! Let me try this prefault approach and
> then we can revert back to spinlocks.
>
> > > >
> > > > So, how important for you to have this change?
> > > >
> > >
> > > Personally, always very important :)
> > >
> > This is good. Personal opinion always wins :)
> >
>
> The heart always wins ;) well, an adaption here can make everybody's hearts
> happy I think.
>
Totally agree :)
--
Uladzislau Rezki
On Sun, Mar 19, 2023 at 07:09:31AM +0000, Lorenzo Stoakes wrote:
> vmalloc() is, by design, not permitted to be used in atomic context and
> already contains components which may sleep, so avoiding spin locks is not
> a problem from the perspective of atomic context.
>
> The global vmap_area_lock is held when the red/black tree rooted in
> vmap_are_root is accessed and thus is rather long-held and under
> potentially high contention. It is likely to be under contention for reads
> rather than write, so replace it with a rwsem.
>
> Each individual vmap_block->lock is likely to be held for less time but
> under low contention, so a mutex is not an outrageous choice here.
>
> A subset of test_vmalloc.sh performance results:-
>
> fix_size_alloc_test 0.40%
> full_fit_alloc_test 2.08%
> long_busy_list_alloc_test 0.34%
> random_size_alloc_test -0.25%
> random_size_align_alloc_test 0.06%
> ...
> all tests cycles 0.2%
>
> This represents a tiny reduction in performance that sits barely above
> noise.
I'm travelling right now, but give me a few days and I'll test this
against the XFS workloads that hammer the global vmalloc spin lock
really, really badly. XFS can use vm_map_ram and vmalloc really
heavily for metadata buffers and hit the global spin lock from every
CPU in the system at the same time (i.e. highly concurrent
workloads). vmalloc is also heavily used in the hottest path
throught the journal where we process and calculate delta changes to
several million items every second, again spread across every CPU in
the system at the same time.
We really need the global spinlock to go away completely, but in the
mean time a shared read lock should help a little bit....
-Dave.
--
Dave Chinner
[email protected]
On Tue, Mar 21, 2023 at 12:09:12PM +1100, Dave Chinner wrote:
> On Sun, Mar 19, 2023 at 07:09:31AM +0000, Lorenzo Stoakes wrote:
> > vmalloc() is, by design, not permitted to be used in atomic context and
> > already contains components which may sleep, so avoiding spin locks is not
> > a problem from the perspective of atomic context.
> >
> > The global vmap_area_lock is held when the red/black tree rooted in
> > vmap_are_root is accessed and thus is rather long-held and under
> > potentially high contention. It is likely to be under contention for reads
> > rather than write, so replace it with a rwsem.
> >
> > Each individual vmap_block->lock is likely to be held for less time but
> > under low contention, so a mutex is not an outrageous choice here.
> >
> > A subset of test_vmalloc.sh performance results:-
> >
> > fix_size_alloc_test 0.40%
> > full_fit_alloc_test 2.08%
> > long_busy_list_alloc_test 0.34%
> > random_size_alloc_test -0.25%
> > random_size_align_alloc_test 0.06%
> > ...
> > all tests cycles 0.2%
> >
> > This represents a tiny reduction in performance that sits barely above
> > noise.
>
> I'm travelling right now, but give me a few days and I'll test this
> against the XFS workloads that hammer the global vmalloc spin lock
> really, really badly. XFS can use vm_map_ram and vmalloc really
> heavily for metadata buffers and hit the global spin lock from every
> CPU in the system at the same time (i.e. highly concurrent
> workloads). vmalloc is also heavily used in the hottest path
> throught the journal where we process and calculate delta changes to
> several million items every second, again spread across every CPU in
> the system at the same time.
>
> We really need the global spinlock to go away completely, but in the
> mean time a shared read lock should help a little bit....
>
I am working on it. I submitted a proposal how to eliminate it:
<snip>
Hello, LSF.
Title: Introduce a per-cpu-vmap-cache to eliminate a vmap lock contention
Description:
Currently the vmap code is not scaled to number of CPU cores in a system
because a global vmap space is protected by a single spinlock. Such approach
has a clear bottleneck if many CPUs simultaneously access to one resource.
In this talk i would like to describe a drawback, show some data related
to contentions and places where those occur in a code. Apart of that i
would like to share ideas how to eliminate it providing a few approaches
and compare them.
Requirements:
* It should be a per-cpu approach;
* Search of freed ptrs should not interfere with other freeing(as much as we can);
* - offload allocated areas(buzy ones) per-cpu;
* Cache ready sized objects or merge them into one big per-cpu-space(split on demand);
* Lazily-freed areas either drained per-cpu individually or by one CPU for all;
* Prefetch a fixed size in front and allocate per-cpu
Goals:
* Implement a per-cpu way of allocation to eliminate a contention.
Thanks!
<snip>
--
Uladzislau Rezki
On Tue, Mar 21, 2023 at 06:23:39AM +0100, Uladzislau Rezki wrote:
> On Tue, Mar 21, 2023 at 12:09:12PM +1100, Dave Chinner wrote:
> > On Sun, Mar 19, 2023 at 07:09:31AM +0000, Lorenzo Stoakes wrote:
> > > vmalloc() is, by design, not permitted to be used in atomic context and
> > > already contains components which may sleep, so avoiding spin locks is not
> > > a problem from the perspective of atomic context.
> > >
> > > The global vmap_area_lock is held when the red/black tree rooted in
> > > vmap_are_root is accessed and thus is rather long-held and under
> > > potentially high contention. It is likely to be under contention for reads
> > > rather than write, so replace it with a rwsem.
> > >
> > > Each individual vmap_block->lock is likely to be held for less time but
> > > under low contention, so a mutex is not an outrageous choice here.
> > >
> > > A subset of test_vmalloc.sh performance results:-
> > >
> > > fix_size_alloc_test 0.40%
> > > full_fit_alloc_test 2.08%
> > > long_busy_list_alloc_test 0.34%
> > > random_size_alloc_test -0.25%
> > > random_size_align_alloc_test 0.06%
> > > ...
> > > all tests cycles 0.2%
> > >
> > > This represents a tiny reduction in performance that sits barely above
> > > noise.
> >
> > I'm travelling right now, but give me a few days and I'll test this
> > against the XFS workloads that hammer the global vmalloc spin lock
> > really, really badly. XFS can use vm_map_ram and vmalloc really
> > heavily for metadata buffers and hit the global spin lock from every
> > CPU in the system at the same time (i.e. highly concurrent
> > workloads). vmalloc is also heavily used in the hottest path
> > throught the journal where we process and calculate delta changes to
> > several million items every second, again spread across every CPU in
> > the system at the same time.
> >
> > We really need the global spinlock to go away completely, but in the
> > mean time a shared read lock should help a little bit....
> >
Hugely appreciated Dave, however I must disappoint on the rwsem as I have now
reworked my patch set to use the original locks in order to satisfy Willy's
desire to make vmalloc atomic in future, and Uladzislau's desire to not have a
~6% performance hit -
https://lore.kernel.org/all/[email protected]/
> I am working on it. I submitted a proposal how to eliminate it:
>
>
> <snip>
> Hello, LSF.
>
> Title: Introduce a per-cpu-vmap-cache to eliminate a vmap lock contention
>
> Description:
> Currently the vmap code is not scaled to number of CPU cores in a system
> because a global vmap space is protected by a single spinlock. Such approach
> has a clear bottleneck if many CPUs simultaneously access to one resource.
>
> In this talk i would like to describe a drawback, show some data related
> to contentions and places where those occur in a code. Apart of that i
> would like to share ideas how to eliminate it providing a few approaches
> and compare them.
>
> Requirements:
> * It should be a per-cpu approach;
> * Search of freed ptrs should not interfere with other freeing(as much as we can);
> * - offload allocated areas(buzy ones) per-cpu;
> * Cache ready sized objects or merge them into one big per-cpu-space(split on demand);
> * Lazily-freed areas either drained per-cpu individually or by one CPU for all;
> * Prefetch a fixed size in front and allocate per-cpu
>
> Goals:
> * Implement a per-cpu way of allocation to eliminate a contention.
>
> Thanks!
> <snip>
>
> --
> Uladzislau Rezki
>
That's really awesome! I will come to that talk at LSF/MM :) being able to
sustain the lock in atomic context seems to be an aspect that is important going
forward also.
On Tue, Mar 21, 2023 at 07:45:56AM +0000, Lorenzo Stoakes wrote:
> On Tue, Mar 21, 2023 at 06:23:39AM +0100, Uladzislau Rezki wrote:
> > On Tue, Mar 21, 2023 at 12:09:12PM +1100, Dave Chinner wrote:
> > > On Sun, Mar 19, 2023 at 07:09:31AM +0000, Lorenzo Stoakes wrote:
> > > > vmalloc() is, by design, not permitted to be used in atomic context and
> > > > already contains components which may sleep, so avoiding spin locks is not
> > > > a problem from the perspective of atomic context.
> > > >
> > > > The global vmap_area_lock is held when the red/black tree rooted in
> > > > vmap_are_root is accessed and thus is rather long-held and under
> > > > potentially high contention. It is likely to be under contention for reads
> > > > rather than write, so replace it with a rwsem.
> > > >
> > > > Each individual vmap_block->lock is likely to be held for less time but
> > > > under low contention, so a mutex is not an outrageous choice here.
> > > >
> > > > A subset of test_vmalloc.sh performance results:-
> > > >
> > > > fix_size_alloc_test 0.40%
> > > > full_fit_alloc_test 2.08%
> > > > long_busy_list_alloc_test 0.34%
> > > > random_size_alloc_test -0.25%
> > > > random_size_align_alloc_test 0.06%
> > > > ...
> > > > all tests cycles 0.2%
> > > >
> > > > This represents a tiny reduction in performance that sits barely above
> > > > noise.
> > >
> > > I'm travelling right now, but give me a few days and I'll test this
> > > against the XFS workloads that hammer the global vmalloc spin lock
> > > really, really badly. XFS can use vm_map_ram and vmalloc really
> > > heavily for metadata buffers and hit the global spin lock from every
> > > CPU in the system at the same time (i.e. highly concurrent
> > > workloads). vmalloc is also heavily used in the hottest path
> > > throught the journal where we process and calculate delta changes to
> > > several million items every second, again spread across every CPU in
> > > the system at the same time.
> > >
> > > We really need the global spinlock to go away completely, but in the
> > > mean time a shared read lock should help a little bit....
> > >
>
> Hugely appreciated Dave, however I must disappoint on the rwsem as I have now
> reworked my patch set to use the original locks in order to satisfy Willy's
> desire to make vmalloc atomic in future, and Uladzislau's desire to not have a
> ~6% performance hit -
> https://lore.kernel.org/all/[email protected]/
>
> > I am working on it. I submitted a proposal how to eliminate it:
> >
> >
> > <snip>
> > Hello, LSF.
> >
> > Title: Introduce a per-cpu-vmap-cache to eliminate a vmap lock contention
> >
> > Description:
> > Currently the vmap code is not scaled to number of CPU cores in a system
> > because a global vmap space is protected by a single spinlock. Such approach
> > has a clear bottleneck if many CPUs simultaneously access to one resource.
> >
> > In this talk i would like to describe a drawback, show some data related
> > to contentions and places where those occur in a code. Apart of that i
> > would like to share ideas how to eliminate it providing a few approaches
> > and compare them.
> >
> > Requirements:
> > * It should be a per-cpu approach;
> > * Search of freed ptrs should not interfere with other freeing(as much as we can);
> > * - offload allocated areas(buzy ones) per-cpu;
> > * Cache ready sized objects or merge them into one big per-cpu-space(split on demand);
> > * Lazily-freed areas either drained per-cpu individually or by one CPU for all;
> > * Prefetch a fixed size in front and allocate per-cpu
> >
> > Goals:
> > * Implement a per-cpu way of allocation to eliminate a contention.
> >
> > Thanks!
> > <snip>
> >
> > --
> > Uladzislau Rezki
> >
>
> That's really awesome! I will come to that talk at LSF/MM :) being able to
> sustain the lock in atomic context seems to be an aspect that is important going
> forward also.
>
Uhh... So i need to prepare then :)))
--
Uladzislau Rezki
On Tue, Mar 21, 2023 at 07:45:56AM +0000, Lorenzo Stoakes wrote:
> On Tue, Mar 21, 2023 at 06:23:39AM +0100, Uladzislau Rezki wrote:
> > On Tue, Mar 21, 2023 at 12:09:12PM +1100, Dave Chinner wrote:
> > > On Sun, Mar 19, 2023 at 07:09:31AM +0000, Lorenzo Stoakes wrote:
> > > > vmalloc() is, by design, not permitted to be used in atomic context and
> > > > already contains components which may sleep, so avoiding spin locks is not
> > > > a problem from the perspective of atomic context.
> > > >
> > > > The global vmap_area_lock is held when the red/black tree rooted in
> > > > vmap_are_root is accessed and thus is rather long-held and under
> > > > potentially high contention. It is likely to be under contention for reads
> > > > rather than write, so replace it with a rwsem.
> > > >
> > > > Each individual vmap_block->lock is likely to be held for less time but
> > > > under low contention, so a mutex is not an outrageous choice here.
> > > >
> > > > A subset of test_vmalloc.sh performance results:-
> > > >
> > > > fix_size_alloc_test 0.40%
> > > > full_fit_alloc_test 2.08%
> > > > long_busy_list_alloc_test 0.34%
> > > > random_size_alloc_test -0.25%
> > > > random_size_align_alloc_test 0.06%
> > > > ...
> > > > all tests cycles 0.2%
> > > >
> > > > This represents a tiny reduction in performance that sits barely above
> > > > noise.
> > >
> > > I'm travelling right now, but give me a few days and I'll test this
> > > against the XFS workloads that hammer the global vmalloc spin lock
> > > really, really badly. XFS can use vm_map_ram and vmalloc really
> > > heavily for metadata buffers and hit the global spin lock from every
> > > CPU in the system at the same time (i.e. highly concurrent
> > > workloads). vmalloc is also heavily used in the hottest path
> > > throught the journal where we process and calculate delta changes to
> > > several million items every second, again spread across every CPU in
> > > the system at the same time.
> > >
> > > We really need the global spinlock to go away completely, but in the
> > > mean time a shared read lock should help a little bit....
> > >
>
> Hugely appreciated Dave, however I must disappoint on the rwsem as I have now
> reworked my patch set to use the original locks in order to satisfy Willy's
> desire to make vmalloc atomic in future, and Uladzislau's desire to not have a
> ~6% performance hit -
> https://lore.kernel.org/all/[email protected]/
Yeah, I'd already read that.
What I want to do, though, is to determine whether the problem
shared access contention or exclusive access contention. If it's
exclusive access contention, then an rwsem will do nothing to
alleviate the problem, and that's kinda critical to know before any
fix for the contention problems are worked out...
> > I am working on it. I submitted a proposal how to eliminate it:
> >
> >
> > <snip>
> > Hello, LSF.
> >
> > Title: Introduce a per-cpu-vmap-cache to eliminate a vmap lock contention
> >
> > Description:
> > Currently the vmap code is not scaled to number of CPU cores in a system
> > because a global vmap space is protected by a single spinlock. Such approach
> > has a clear bottleneck if many CPUs simultaneously access to one resource.
> >
> > In this talk i would like to describe a drawback, show some data related
> > to contentions and places where those occur in a code. Apart of that i
> > would like to share ideas how to eliminate it providing a few approaches
> > and compare them.
If you want data about contention problems with vmalloc
> > Requirements:
> > * It should be a per-cpu approach;
Hmmmm. My 2c worth on this: That is not a requirement.
That's a -solution-.
The requirement is that independent concurrent vmalloc/vfree
operations do not severely contend with each other.
Yes, the solution will probably involve sharding the resource space
across mulitple independent structures (as we do in filesystems with
block groups, allocations groups, etc) but that does not necessarily
need the structures to be per-cpu.
e.g per-node vmalloc arenas might be sufficient and allow more
expensive but more efficient indexing structures to be used because
we don't have to care about the explosion of memory that
fine-grained per-cpu indexing generally entails. This may also fit
in to the existing per-node structure of the memory reclaim
infrastructure to manage things like compaction, balancing, etc of
vmalloc space assigned to the given node.
Hence I think saying "per-cpu is a requirement" kinda prevents
exploration of other novel solutions that may have advantages other
than "just solves the concurrency problem"...
> > * Search of freed ptrs should not interfere with other freeing(as much as we can);
> > * - offload allocated areas(buzy ones) per-cpu;
> > * Cache ready sized objects or merge them into one big per-cpu-space(split on demand);
> > * Lazily-freed areas either drained per-cpu individually or by one CPU for all;
> > * Prefetch a fixed size in front and allocate per-cpu
I'd call these desired traits and/or potential optimisations, not
hard requirements.
> > Goals:
> > * Implement a per-cpu way of allocation to eliminate a contention.
The goal should be to "allow contention-free vmalloc operations", not
that we implement a specific solution.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Tue, Mar 21, 2023 at 09:05:26PM +1100, Dave Chinner wrote:
> On Tue, Mar 21, 2023 at 07:45:56AM +0000, Lorenzo Stoakes wrote:
> > On Tue, Mar 21, 2023 at 06:23:39AM +0100, Uladzislau Rezki wrote:
> > > On Tue, Mar 21, 2023 at 12:09:12PM +1100, Dave Chinner wrote:
> > > > On Sun, Mar 19, 2023 at 07:09:31AM +0000, Lorenzo Stoakes wrote:
> > > > > vmalloc() is, by design, not permitted to be used in atomic context and
> > > > > already contains components which may sleep, so avoiding spin locks is not
> > > > > a problem from the perspective of atomic context.
> > > > >
> > > > > The global vmap_area_lock is held when the red/black tree rooted in
> > > > > vmap_are_root is accessed and thus is rather long-held and under
> > > > > potentially high contention. It is likely to be under contention for reads
> > > > > rather than write, so replace it with a rwsem.
> > > > >
> > > > > Each individual vmap_block->lock is likely to be held for less time but
> > > > > under low contention, so a mutex is not an outrageous choice here.
> > > > >
> > > > > A subset of test_vmalloc.sh performance results:-
> > > > >
> > > > > fix_size_alloc_test 0.40%
> > > > > full_fit_alloc_test 2.08%
> > > > > long_busy_list_alloc_test 0.34%
> > > > > random_size_alloc_test -0.25%
> > > > > random_size_align_alloc_test 0.06%
> > > > > ...
> > > > > all tests cycles 0.2%
> > > > >
> > > > > This represents a tiny reduction in performance that sits barely above
> > > > > noise.
> > > >
> > > > I'm travelling right now, but give me a few days and I'll test this
> > > > against the XFS workloads that hammer the global vmalloc spin lock
> > > > really, really badly. XFS can use vm_map_ram and vmalloc really
> > > > heavily for metadata buffers and hit the global spin lock from every
> > > > CPU in the system at the same time (i.e. highly concurrent
> > > > workloads). vmalloc is also heavily used in the hottest path
> > > > throught the journal where we process and calculate delta changes to
> > > > several million items every second, again spread across every CPU in
> > > > the system at the same time.
> > > >
> > > > We really need the global spinlock to go away completely, but in the
> > > > mean time a shared read lock should help a little bit....
> > > >
> >
> > Hugely appreciated Dave, however I must disappoint on the rwsem as I have now
> > reworked my patch set to use the original locks in order to satisfy Willy's
> > desire to make vmalloc atomic in future, and Uladzislau's desire to not have a
> > ~6% performance hit -
> > https://lore.kernel.org/all/[email protected]/
>
> Yeah, I'd already read that.
>
> What I want to do, though, is to determine whether the problem
> shared access contention or exclusive access contention. If it's
> exclusive access contention, then an rwsem will do nothing to
> alleviate the problem, and that's kinda critical to know before any
> fix for the contention problems are worked out...
>
> > > I am working on it. I submitted a proposal how to eliminate it:
> > >
> > >
> > > <snip>
> > > Hello, LSF.
> > >
> > > Title: Introduce a per-cpu-vmap-cache to eliminate a vmap lock contention
> > >
> > > Description:
> > > Currently the vmap code is not scaled to number of CPU cores in a system
> > > because a global vmap space is protected by a single spinlock. Such approach
> > > has a clear bottleneck if many CPUs simultaneously access to one resource.
> > >
> > > In this talk i would like to describe a drawback, show some data related
> > > to contentions and places where those occur in a code. Apart of that i
> > > would like to share ideas how to eliminate it providing a few approaches
> > > and compare them.
>
> If you want data about contention problems with vmalloc
>
> > > Requirements:
> > > * It should be a per-cpu approach;
>
> Hmmmm. My 2c worth on this: That is not a requirement.
>
> That's a -solution-.
>
> The requirement is that independent concurrent vmalloc/vfree
> operations do not severely contend with each other.
>
> Yes, the solution will probably involve sharding the resource space
> across mulitple independent structures (as we do in filesystems with
> block groups, allocations groups, etc) but that does not necessarily
> need the structures to be per-cpu.
>
> e.g per-node vmalloc arenas might be sufficient and allow more
> expensive but more efficient indexing structures to be used because
> we don't have to care about the explosion of memory that
> fine-grained per-cpu indexing generally entails. This may also fit
> in to the existing per-node structure of the memory reclaim
> infrastructure to manage things like compaction, balancing, etc of
> vmalloc space assigned to the given node.
>
> Hence I think saying "per-cpu is a requirement" kinda prevents
> exploration of other novel solutions that may have advantages other
> than "just solves the concurrency problem"...
>
> > > * Search of freed ptrs should not interfere with other freeing(as much as we can);
> > > * - offload allocated areas(buzy ones) per-cpu;
> > > * Cache ready sized objects or merge them into one big per-cpu-space(split on demand);
> > > * Lazily-freed areas either drained per-cpu individually or by one CPU for all;
> > > * Prefetch a fixed size in front and allocate per-cpu
>
> I'd call these desired traits and/or potential optimisations, not
> hard requirements.
>
> > > Goals:
> > > * Implement a per-cpu way of allocation to eliminate a contention.
>
> The goal should be to "allow contention-free vmalloc operations", not
> that we implement a specific solution.
>
I think we are on the same page. I do not see that we go apart in anything.
Probably i was a bit more specific in requirements but this is how i see
personally on it based on different kind of experiments with it.
Thank you for your 2c!
--
Uladzislau Rezki
Hello, Dave.
>
> I'm travelling right now, but give me a few days and I'll test this
> against the XFS workloads that hammer the global vmalloc spin lock
> really, really badly. XFS can use vm_map_ram and vmalloc really
> heavily for metadata buffers and hit the global spin lock from every
> CPU in the system at the same time (i.e. highly concurrent
> workloads). vmalloc is also heavily used in the hottest path
> throught the journal where we process and calculate delta changes to
> several million items every second, again spread across every CPU in
> the system at the same time.
>
> We really need the global spinlock to go away completely, but in the
> mean time a shared read lock should help a little bit....
>
Could you please share some steps how to run your workloads in order to
touch vmalloc() code. I would like to have a look at it in more detail
just for understanding the workloads.
Meanwhile my grep agains xfs shows:
<snip>
urezki@pc638:~/data/raid0/coding/linux-rcu.git/fs/xfs$ grep -rn vmalloc ./
./xfs_log_priv.h:675: * Log vector and shadow buffers can be large, so we need to use kvmalloc() here
./xfs_log_priv.h:676: * to ensure success. Unfortunately, kvmalloc() only allows GFP_KERNEL contexts
./xfs_log_priv.h:677: * to fall back to vmalloc, so we can't actually do anything useful with gfp
./xfs_log_priv.h:678: * flags to control the kmalloc() behaviour within kvmalloc(). Hence kmalloc()
./xfs_log_priv.h:681: * vmalloc if it can't get somethign straight away from the free lists or
./xfs_log_priv.h:682: * buddy allocator. Hence we have to open code kvmalloc outselves here.
./xfs_log_priv.h:686: * allocations. This is actually the only way to make vmalloc() do GFP_NOFS
./xfs_log_priv.h:691:xlog_kvmalloc(
./xfs_log_priv.h:702: p = vmalloc(buf_size);
./xfs_bio_io.c:21: unsigned int is_vmalloc = is_vmalloc_addr(data);
./xfs_bio_io.c:26: if (is_vmalloc && op == REQ_OP_WRITE)
./xfs_bio_io.c:56: if (is_vmalloc && op == REQ_OP_READ)
./xfs_log.c:1976: if (is_vmalloc_addr(iclog->ic_data))
./xfs_log_cil.c:338: lv = xlog_kvmalloc(buf_size);
./libxfs/xfs_attr_leaf.c:522: args->value = kvmalloc(valuelen, GFP_KERNEL | __GFP_NOLOCKDEP);
./kmem.h:12:#include <linux/vmalloc.h>
./kmem.h:78: if (is_vmalloc_addr(addr))
./kmem.h:79: return vmalloc_to_page(addr);
./xfs_attr_item.c:84: * This could be over 64kB in length, so we have to use kvmalloc() for
./xfs_attr_item.c:85: * this. But kvmalloc() utterly sucks, so we use our own version.
./xfs_attr_item.c:87: nv = xlog_kvmalloc(sizeof(struct xfs_attri_log_nameval) +
./scrub/attr.c:60: ab = kvmalloc(sizeof(*ab) + sz, flags);
urezki@pc638:~/data/raid0/coding/linux-rcu.git/fs/xfs$
<snip>
Thanks!
--
Uladzislau Rezki
On Wed, Mar 22, 2023 at 02:18:19PM +0100, Uladzislau Rezki wrote:
> Hello, Dave.
>
> >
> > I'm travelling right now, but give me a few days and I'll test this
> > against the XFS workloads that hammer the global vmalloc spin lock
> > really, really badly. XFS can use vm_map_ram and vmalloc really
> > heavily for metadata buffers and hit the global spin lock from every
> > CPU in the system at the same time (i.e. highly concurrent
> > workloads). vmalloc is also heavily used in the hottest path
> > throught the journal where we process and calculate delta changes to
> > several million items every second, again spread across every CPU in
> > the system at the same time.
> >
> > We really need the global spinlock to go away completely, but in the
> > mean time a shared read lock should help a little bit....
> >
> Could you please share some steps how to run your workloads in order to
> touch vmalloc() code. I would like to have a look at it in more detail
> just for understanding the workloads.
>
> Meanwhile my grep agains xfs shows:
>
> <snip>
> urezki@pc638:~/data/raid0/coding/linux-rcu.git/fs/xfs$ grep -rn vmalloc ./
You're missing:
fs/xfs/xfs_buf.c: bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
which i suspect is the majority of Dave's workload. That will almost
certainly take the vb_alloc() path.
On Wed, Mar 22, 2023 at 05:47:28PM +0000, Matthew Wilcox wrote:
> On Wed, Mar 22, 2023 at 02:18:19PM +0100, Uladzislau Rezki wrote:
> > Hello, Dave.
> >
> > >
> > > I'm travelling right now, but give me a few days and I'll test this
> > > against the XFS workloads that hammer the global vmalloc spin lock
> > > really, really badly. XFS can use vm_map_ram and vmalloc really
> > > heavily for metadata buffers and hit the global spin lock from every
> > > CPU in the system at the same time (i.e. highly concurrent
> > > workloads). vmalloc is also heavily used in the hottest path
> > > throught the journal where we process and calculate delta changes to
> > > several million items every second, again spread across every CPU in
> > > the system at the same time.
> > >
> > > We really need the global spinlock to go away completely, but in the
> > > mean time a shared read lock should help a little bit....
> > >
> > Could you please share some steps how to run your workloads in order to
> > touch vmalloc() code. I would like to have a look at it in more detail
> > just for understanding the workloads.
> >
> > Meanwhile my grep agains xfs shows:
> >
> > <snip>
> > urezki@pc638:~/data/raid0/coding/linux-rcu.git/fs/xfs$ grep -rn vmalloc ./
>
> You're missing:
>
> fs/xfs/xfs_buf.c: bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
>
> which i suspect is the majority of Dave's workload. That will almost
> certainly take the vb_alloc() path.
>
Then it has nothing to do with vmalloc contention(i mean global KVA allocator), IMHO.
Unless:
<snip>
void *vm_map_ram(struct page **pages, unsigned int count, int node)
{
unsigned long size = (unsigned long)count << PAGE_SHIFT;
unsigned long addr;
void *mem;
if (likely(count <= VMAP_MAX_ALLOC)) {
mem = vb_alloc(size, GFP_KERNEL);
if (IS_ERR(mem))
return NULL;
addr = (unsigned long)mem;
} else {
struct vmap_area *va;
va = alloc_vmap_area(size, PAGE_SIZE,
VMALLOC_START, VMALLOC_END, node, GFP_KERNEL);
if (IS_ERR(va))
return NULL;
<snip>
number of pages > VMAP_MAX_ALLOC.
That is why i have asked about workloads because i would like to understand
where a "problem" is. A vm_map_ram() access the global vmap space but it happens
when a new vmap block is required and i also think it is not a problem.
But who knows, therefore it makes sense to have a lock at workload.
--
Uladzislau Rezki
On Wed, Mar 22, 2023 at 07:01:59PM +0100, Uladzislau Rezki wrote:
> On Wed, Mar 22, 2023 at 05:47:28PM +0000, Matthew Wilcox wrote:
> > On Wed, Mar 22, 2023 at 02:18:19PM +0100, Uladzislau Rezki wrote:
> > > Hello, Dave.
> > >
> > > >
> > > > I'm travelling right now, but give me a few days and I'll test this
> > > > against the XFS workloads that hammer the global vmalloc spin lock
> > > > really, really badly. XFS can use vm_map_ram and vmalloc really
> > > > heavily for metadata buffers and hit the global spin lock from every
> > > > CPU in the system at the same time (i.e. highly concurrent
> > > > workloads). vmalloc is also heavily used in the hottest path
> > > > throught the journal where we process and calculate delta changes to
> > > > several million items every second, again spread across every CPU in
> > > > the system at the same time.
> > > >
> > > > We really need the global spinlock to go away completely, but in the
> > > > mean time a shared read lock should help a little bit....
> > > >
> > > Could you please share some steps how to run your workloads in order to
> > > touch vmalloc() code. I would like to have a look at it in more detail
> > > just for understanding the workloads.
> > >
> > > Meanwhile my grep agains xfs shows:
> > >
> > > <snip>
> > > urezki@pc638:~/data/raid0/coding/linux-rcu.git/fs/xfs$ grep -rn vmalloc ./
> >
> > You're missing:
> >
> > fs/xfs/xfs_buf.c: bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
> >
> > which i suspect is the majority of Dave's workload. That will almost
> > certainly take the vb_alloc() path.
> >
> Then it has nothing to do with vmalloc contention(i mean global KVA allocator), IMHO.
> Unless:
>
> <snip>
> void *vm_map_ram(struct page **pages, unsigned int count, int node)
> {
> unsigned long size = (unsigned long)count << PAGE_SHIFT;
> unsigned long addr;
> void *mem;
>
> if (likely(count <= VMAP_MAX_ALLOC)) {
> mem = vb_alloc(size, GFP_KERNEL);
> if (IS_ERR(mem))
> return NULL;
> addr = (unsigned long)mem;
> } else {
> struct vmap_area *va;
> va = alloc_vmap_area(size, PAGE_SIZE,
> VMALLOC_START, VMALLOC_END, node, GFP_KERNEL);
> if (IS_ERR(va))
> return NULL;
> <snip>
>
> number of pages > VMAP_MAX_ALLOC.
>
> That is why i have asked about workloads because i would like to understand
> where a "problem" is. A vm_map_ram() access the global vmap space but it happens
> when a new vmap block is required and i also think it is not a problem.
>
> But who knows, therefore it makes sense to have a lock at workload.
>
There is a lock-stat statistics for vm_map_ram()/vm_unmap_ram() test.
I did it on 64 CPUs system with running 64 threads doing mapping/unmapping
of 1 page. Each thread does 10 000 000 mapping + unmapping in a loop:
<snip>
root@pc638:/home/urezki# cat /proc/lock_stat
lock_stat version 0.4
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name con-bounces contentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
vmap_area_lock: 2554079 2554276 0.06 213.61 11719647.67 4.59 2986513 3005712 0.05 67.02 3573323.37 1.19
--------------
vmap_area_lock 1297948 [<00000000dd41cbaa>] alloc_vmap_area+0x1c7/0x910
vmap_area_lock 1256330 [<000000009d927bf3>] free_vmap_block+0x4a/0xe0
vmap_area_lock 1 [<00000000c95c05a7>] find_vm_area+0x16/0x70
--------------
vmap_area_lock 1738590 [<00000000dd41cbaa>] alloc_vmap_area+0x1c7/0x910
vmap_area_lock 815688 [<000000009d927bf3>] free_vmap_block+0x4a/0xe0
vmap_area_lock 1 [<00000000c1d619d7>] __get_vm_area_node+0xd2/0x170
.....................................................................................................................................................................................................
vmap_blocks.xa_lock: 862689 862698 0.05 77.74 849325.39 0.98 3005156 3005709 0.12 31.11 1920242.82 0.64
-------------------
vmap_blocks.xa_lock 378418 [<00000000625a5626>] vm_map_ram+0x359/0x4a0
vmap_blocks.xa_lock 484280 [<00000000caa2ef03>] xa_erase+0xe/0x30
-------------------
vmap_blocks.xa_lock 576226 [<00000000caa2ef03>] xa_erase+0xe/0x30
vmap_blocks.xa_lock 286472 [<00000000625a5626>] vm_map_ram+0x359/0x4a0
....................................................................................................................................................................................................
free_vmap_area_lock: 394960 394961 0.05 124.78 448241.23 1.13 1514508 1515077 0.12 30.48 1179167.01 0.78
-------------------
free_vmap_area_lock 385970 [<00000000955bd641>] alloc_vmap_area+0xe5/0x910
free_vmap_area_lock 4692 [<00000000230abf7e>] __purge_vmap_area_lazy+0x10a/0x7d0
free_vmap_area_lock 4299 [<00000000eed9ff9e>] alloc_vmap_area+0x497/0x910
-------------------
free_vmap_area_lock 371734 [<00000000955bd641>] alloc_vmap_area+0xe5/0x910
free_vmap_area_lock 17007 [<00000000230abf7e>] __purge_vmap_area_lazy+0x10a/0x7d0
free_vmap_area_lock 6220 [<00000000eed9ff9e>] alloc_vmap_area+0x497/0x910
.....................................................................................................................................................................................................
purge_vmap_area_lock: 169307 169312 0.05 31.08 81655.21 0.48 1514794 1515078 0.05 67.73 912391.12 0.60
--------------------
purge_vmap_area_lock 166409 [<0000000050938075>] free_vmap_area_noflush+0x65/0x370
purge_vmap_area_lock 2903 [<00000000fb8b57f7>] __purge_vmap_area_lazy+0x47/0x7d0
--------------------
purge_vmap_area_lock 167511 [<0000000050938075>] free_vmap_area_noflush+0x65/0x370
purge_vmap_area_lock 1801 [<00000000fb8b57f7>] __purge_vmap_area_lazy+0x47/0x7d0
<snip>
alloc_vmap_area is a top and second one is xa_lock. But the test i have
done is pretty high concurrent scenario.
--
Uladzislau Rezki
On Wed, Mar 22, 2023 at 08:15:09PM +0100, Uladzislau Rezki wrote:
> On Wed, Mar 22, 2023 at 07:01:59PM +0100, Uladzislau Rezki wrote:
> > On Wed, Mar 22, 2023 at 05:47:28PM +0000, Matthew Wilcox wrote:
> > > On Wed, Mar 22, 2023 at 02:18:19PM +0100, Uladzislau Rezki wrote:
> > > > Hello, Dave.
> > > >
> > > > >
> > > > > I'm travelling right now, but give me a few days and I'll test this
> > > > > against the XFS workloads that hammer the global vmalloc spin lock
> > > > > really, really badly. XFS can use vm_map_ram and vmalloc really
> > > > > heavily for metadata buffers and hit the global spin lock from every
> > > > > CPU in the system at the same time (i.e. highly concurrent
> > > > > workloads). vmalloc is also heavily used in the hottest path
> > > > > throught the journal where we process and calculate delta changes to
> > > > > several million items every second, again spread across every CPU in
> > > > > the system at the same time.
> > > > >
> > > > > We really need the global spinlock to go away completely, but in the
> > > > > mean time a shared read lock should help a little bit....
> > > > >
> > > > Could you please share some steps how to run your workloads in order to
> > > > touch vmalloc() code. I would like to have a look at it in more detail
> > > > just for understanding the workloads.
> > > >
> > > > Meanwhile my grep agains xfs shows:
> > > >
> > > > <snip>
> > > > urezki@pc638:~/data/raid0/coding/linux-rcu.git/fs/xfs$ grep -rn vmalloc ./
> > >
> > > You're missing:
> > >
> > > fs/xfs/xfs_buf.c: bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
> > >
> > > which i suspect is the majority of Dave's workload. That will almost
> > > certainly take the vb_alloc() path.
> > >
> > Then it has nothing to do with vmalloc contention(i mean global KVA allocator), IMHO.
> > Unless:
> >
> > <snip>
> > void *vm_map_ram(struct page **pages, unsigned int count, int node)
> > {
> > unsigned long size = (unsigned long)count << PAGE_SHIFT;
> > unsigned long addr;
> > void *mem;
> >
> > if (likely(count <= VMAP_MAX_ALLOC)) {
> > mem = vb_alloc(size, GFP_KERNEL);
> > if (IS_ERR(mem))
> > return NULL;
> > addr = (unsigned long)mem;
> > } else {
> > struct vmap_area *va;
> > va = alloc_vmap_area(size, PAGE_SIZE,
> > VMALLOC_START, VMALLOC_END, node, GFP_KERNEL);
> > if (IS_ERR(va))
> > return NULL;
> > <snip>
> >
> > number of pages > VMAP_MAX_ALLOC.
> >
> > That is why i have asked about workloads because i would like to understand
> > where a "problem" is. A vm_map_ram() access the global vmap space but it happens
> > when a new vmap block is required and i also think it is not a problem.
> >
> > But who knows, therefore it makes sense to have a lock at workload.
> >
> There is a lock-stat statistics for vm_map_ram()/vm_unmap_ram() test.
> I did it on 64 CPUs system with running 64 threads doing mapping/unmapping
> of 1 page. Each thread does 10 000 000 mapping + unmapping in a loop:
>
> <snip>
> root@pc638:/home/urezki# cat /proc/lock_stat
> lock_stat version 0.4
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> class name con-bounces contentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> vmap_area_lock: 2554079 2554276 0.06 213.61 11719647.67 4.59 2986513 3005712 0.05 67.02 3573323.37 1.19
> --------------
> vmap_area_lock 1297948 [<00000000dd41cbaa>] alloc_vmap_area+0x1c7/0x910
> vmap_area_lock 1256330 [<000000009d927bf3>] free_vmap_block+0x4a/0xe0
> vmap_area_lock 1 [<00000000c95c05a7>] find_vm_area+0x16/0x70
> --------------
> vmap_area_lock 1738590 [<00000000dd41cbaa>] alloc_vmap_area+0x1c7/0x910
> vmap_area_lock 815688 [<000000009d927bf3>] free_vmap_block+0x4a/0xe0
> vmap_area_lock 1 [<00000000c1d619d7>] __get_vm_area_node+0xd2/0x170
>
> .....................................................................................................................................................................................................
>
> vmap_blocks.xa_lock: 862689 862698 0.05 77.74 849325.39 0.98 3005156 3005709 0.12 31.11 1920242.82 0.64
> -------------------
> vmap_blocks.xa_lock 378418 [<00000000625a5626>] vm_map_ram+0x359/0x4a0
> vmap_blocks.xa_lock 484280 [<00000000caa2ef03>] xa_erase+0xe/0x30
> -------------------
> vmap_blocks.xa_lock 576226 [<00000000caa2ef03>] xa_erase+0xe/0x30
> vmap_blocks.xa_lock 286472 [<00000000625a5626>] vm_map_ram+0x359/0x4a0
>
> ....................................................................................................................................................................................................
>
> free_vmap_area_lock: 394960 394961 0.05 124.78 448241.23 1.13 1514508 1515077 0.12 30.48 1179167.01 0.78
> -------------------
> free_vmap_area_lock 385970 [<00000000955bd641>] alloc_vmap_area+0xe5/0x910
> free_vmap_area_lock 4692 [<00000000230abf7e>] __purge_vmap_area_lazy+0x10a/0x7d0
> free_vmap_area_lock 4299 [<00000000eed9ff9e>] alloc_vmap_area+0x497/0x910
> -------------------
> free_vmap_area_lock 371734 [<00000000955bd641>] alloc_vmap_area+0xe5/0x910
> free_vmap_area_lock 17007 [<00000000230abf7e>] __purge_vmap_area_lazy+0x10a/0x7d0
> free_vmap_area_lock 6220 [<00000000eed9ff9e>] alloc_vmap_area+0x497/0x910
>
> .....................................................................................................................................................................................................
>
> purge_vmap_area_lock: 169307 169312 0.05 31.08 81655.21 0.48 1514794 1515078 0.05 67.73 912391.12 0.60
> --------------------
> purge_vmap_area_lock 166409 [<0000000050938075>] free_vmap_area_noflush+0x65/0x370
> purge_vmap_area_lock 2903 [<00000000fb8b57f7>] __purge_vmap_area_lazy+0x47/0x7d0
> --------------------
> purge_vmap_area_lock 167511 [<0000000050938075>] free_vmap_area_noflush+0x65/0x370
> purge_vmap_area_lock 1801 [<00000000fb8b57f7>] __purge_vmap_area_lazy+0x47/0x7d0
> <snip>
>
> alloc_vmap_area is a top and second one is xa_lock. But the test i have
> done is pretty high concurrent scenario.
>
<snip>
From 32c38d239c6de3f1d3accf97d9d4944ecaa4bccd Mon Sep 17 00:00:00 2001
From: "Uladzislau Rezki (Sony)" <[email protected]>
Date: Thu, 23 Mar 2023 13:07:27 +0100
Subject: [PATCH] mm: vmalloc: Remove global vmap_blocks xarray
A global vmap_blocks-xarray array can be contented under
heavy usage of the vm_map_ram()/vm_unmap_ram() APIs. Under
stress test the lock-stat shows that a "vmap_blocks.xa_lock"
lock is a second in a list when it comes to contentions:
<snip>
----------------------------------------
class name con-bounces contentions ...
----------------------------------------
...
vmap_blocks.xa_lock: 862689 862698 ...
-------------------
vmap_blocks.xa_lock 378418 [<00000000625a5626>] vm_map_ram+0x359/0x4a0
vmap_blocks.xa_lock 484280 [<00000000caa2ef03>] xa_erase+0xe/0x30
-------------------
vmap_blocks.xa_lock 576226 [<00000000caa2ef03>] xa_erase+0xe/0x30
vmap_blocks.xa_lock 286472 [<00000000625a5626>] vm_map_ram+0x359/0x4a0
...
<snip>
that is a result of running vm_map_ram()/vm_unmap_ram() in
a loop. The test creates 64(on 64 CPUs system) threads and
each one maps/unmaps 1 page.
After this change the xa_lock is considered as noise in the
same test condition:
<snip>
...
&xa->xa_lock#1: 10333 10394 ...
--------------
&xa->xa_lock#1 5349 [<00000000bbbc9751>] xa_erase+0xe/0x30
&xa->xa_lock#1 5045 [<0000000018def45d>] vm_map_ram+0x3a4/0x4f0
--------------
&xa->xa_lock#1 7326 [<0000000018def45d>] vm_map_ram+0x3a4/0x4f0
&xa->xa_lock#1 3068 [<00000000bbbc9751>] xa_erase+0xe/0x30
...
<snip>
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
---
mm/vmalloc.c | 54 +++++++++++++++++++++++++++++-----------------------
1 file changed, 30 insertions(+), 24 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 978194dc2bb8..b1e549d152b2 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1911,6 +1911,7 @@ static struct vmap_area *find_unlink_vmap_area(unsigned long addr)
struct vmap_block_queue {
spinlock_t lock;
struct list_head free;
+ struct xarray vmap_blocks;
};
struct vmap_block {
@@ -1927,25 +1928,22 @@ struct vmap_block {
/* Queue of free and dirty vmap blocks, for allocation and flushing purposes */
static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
-/*
- * XArray of vmap blocks, indexed by address, to quickly find a vmap block
- * in the free path. Could get rid of this if we change the API to return a
- * "cookie" from alloc, to be passed to free. But no big deal yet.
- */
-static DEFINE_XARRAY(vmap_blocks);
-
-/*
- * We should probably have a fallback mechanism to allocate virtual memory
- * out of partially filled vmap blocks. However vmap block sizing should be
- * fairly reasonable according to the vmalloc size, so it shouldn't be a
- * big problem.
- */
+static struct vmap_block_queue *
+addr_to_vbq(unsigned long addr)
+{
+ int cpu = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
+ return &per_cpu(vmap_block_queue, cpu);
+}
-static unsigned long addr_to_vb_idx(unsigned long addr)
+static unsigned long
+addr_to_vb_va_start(unsigned long addr)
{
- addr -= VMALLOC_START & ~(VMAP_BLOCK_SIZE-1);
- addr /= VMAP_BLOCK_SIZE;
- return addr;
+ /* Check if aligned. */
+ if (IS_ALIGNED(addr, VMAP_BLOCK_SIZE))
+ return addr;
+
+ /* A start address of block an address belongs to. */
+ return rounddown(addr, VMAP_BLOCK_SIZE);
}
static void *vmap_block_vaddr(unsigned long va_start, unsigned long pages_off)
@@ -1953,7 +1951,7 @@ static void *vmap_block_vaddr(unsigned long va_start, unsigned long pages_off)
unsigned long addr;
addr = va_start + (pages_off << PAGE_SHIFT);
- BUG_ON(addr_to_vb_idx(addr) != addr_to_vb_idx(va_start));
+ BUG_ON(addr_to_vb_va_start(addr) != addr_to_vb_va_start(va_start));
return (void *)addr;
}
@@ -1970,7 +1968,6 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
struct vmap_block_queue *vbq;
struct vmap_block *vb;
struct vmap_area *va;
- unsigned long vb_idx;
int node, err;
void *vaddr;
@@ -2003,8 +2000,8 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
bitmap_set(vb->used_map, 0, (1UL << order));
INIT_LIST_HEAD(&vb->free_list);
- vb_idx = addr_to_vb_idx(va->va_start);
- err = xa_insert(&vmap_blocks, vb_idx, vb, gfp_mask);
+ vbq = addr_to_vbq(va->va_start);
+ err = xa_insert(&vbq->vmap_blocks, va->va_start, vb, gfp_mask);
if (err) {
kfree(vb);
free_vmap_area(va);
@@ -2021,9 +2018,11 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
static void free_vmap_block(struct vmap_block *vb)
{
+ struct vmap_block_queue *vbq;
struct vmap_block *tmp;
- tmp = xa_erase(&vmap_blocks, addr_to_vb_idx(vb->va->va_start));
+ vbq = addr_to_vbq(vb->va->va_start);
+ tmp = xa_erase(&vbq->vmap_blocks, vb->va->va_start);
BUG_ON(tmp != vb);
spin_lock(&vmap_area_lock);
@@ -2135,6 +2134,7 @@ static void vb_free(unsigned long addr, unsigned long size)
unsigned long offset;
unsigned int order;
struct vmap_block *vb;
+ struct vmap_block_queue *vbq;
BUG_ON(offset_in_page(size));
BUG_ON(size > PAGE_SIZE*VMAP_MAX_ALLOC);
@@ -2143,7 +2143,10 @@ static void vb_free(unsigned long addr, unsigned long size)
order = get_order(size);
offset = (addr & (VMAP_BLOCK_SIZE - 1)) >> PAGE_SHIFT;
- vb = xa_load(&vmap_blocks, addr_to_vb_idx(addr));
+
+ vbq = addr_to_vbq(addr);
+ vb = xa_load(&vbq->vmap_blocks, addr_to_vb_va_start(addr));
+
spin_lock(&vb->lock);
bitmap_clear(vb->used_map, offset, (1UL << order));
spin_unlock(&vb->lock);
@@ -3486,6 +3489,7 @@ static void vmap_ram_vread(char *buf, char *addr, int count, unsigned long flags
{
char *start;
struct vmap_block *vb;
+ struct vmap_block_queue *vbq;
unsigned long offset;
unsigned int rs, re, n;
@@ -3503,7 +3507,8 @@ static void vmap_ram_vread(char *buf, char *addr, int count, unsigned long flags
* Area is split into regions and tracked with vmap_block, read out
* each region and zero fill the hole between regions.
*/
- vb = xa_load(&vmap_blocks, addr_to_vb_idx((unsigned long)addr));
+ vbq = addr_to_vbq((unsigned long) addr);
+ vb = xa_load(&vbq->vmap_blocks, addr_to_vb_va_start((unsigned long) addr));
if (!vb)
goto finished;
@@ -4272,6 +4277,7 @@ void __init vmalloc_init(void)
p = &per_cpu(vfree_deferred, i);
init_llist_head(&p->list);
INIT_WORK(&p->wq, delayed_vfree_work);
+ xa_init(&vbq->vmap_blocks);
}
/* Import existing vmlist entries. */
--
2.30.2
<snip>
Any thoughts patch?
I do not consider it as a big improvement in performance. But, it tends
to remove completely a contention on the "xa_lock" + it refactor slightly
the per-cpu allocator. XFS workloads can be improved, though.
--
Uladzislau Rezki
On Fri, Mar 24, 2023 at 04:25:39PM +1100, Dave Chinner wrote:
> Did you read the comment above this function? I mean, it's all about
> how poorly kvmalloc() works for the highly concurrent, fail-fast
> context that occurs in the journal commit fast path, and how we open
> code it with kmalloc and vmalloc to work "ok" in this path.
>
> Then if you go look at the commits related to it, you might find
> that XFS developers tend to write properly useful changelogs to
> document things like "it's better, but vmalloc will soon have lock
> contention problems if we hit it any harder"....
The problem with writing whinges like this is that mm developers don't
read XFS changelogs. I certainly had no idea this was a problem, and
I doubt anybody else who could make a start at fixing this problem had
any idea either. Why go to all this effort instead of sending an email
to linux-mm?
> So, this patch open codes the kvmalloc() in the commit path to have
> the above described behaviour. The result is we more than halve the
> CPU time spend doing kvmalloc() in this path and transaction commits
> with 64kB objects in them more than doubles. i.e. we get ~5x
> reduction in CPU usage per costly-sized kvmalloc() invocation and
> the profile looks like this:
>
> - 37.60% xlog_cil_commit
> 16.01% memcpy_erms
> - 8.45% __kmalloc
> - 8.04% kmalloc_order_trace
> - 8.03% kmalloc_order
> - 7.93% alloc_pages
> - 7.90% __alloc_pages
> - 4.05% __alloc_pages_slowpath.constprop.0
> - 2.18% get_page_from_freelist
> - 1.77% wake_all_kswapds
> ....
> - __wake_up_common_lock
> - 0.94% _raw_spin_lock_irqsave
> - 3.72% get_page_from_freelist
> - 2.43% _raw_spin_lock_irqsave
> - 5.72% vmalloc
> - 5.72% __vmalloc_node_range
> - 4.81% __get_vm_area_node.constprop.0
> - 3.26% alloc_vmap_area
> - 2.52% _raw_spin_lock
> - 1.46% _raw_spin_lock
> 0.56% __alloc_pages_bulk
> - 4.66% kvfree
> - 3.25% vfree
OK, i see. I tried to use the fs_mark in different configurations. For
example:
<snip>
time fs_mark -D 10000 -S0 -n 100000 -s 0 -L 32 -d ./scratch/0 -d ./scratch/1 -d ./scratch/2 \
-d ./scratch/3 -d ./scratch/4 -d ./scratch/5 -d ./scratch/6 -d ./scratch/7 -d ./scratch/8 \
-d ./scratch/9 -d ./scratch/10 -d ./scratch/11 -d ./scratch/12 -d ./scratch/13 \
-d ./scratch/14 -d ./scratch/15 -t 64 -F
<snip>
But i did not manage to trigger xlog_cil_commit() to fallback to vmalloc
code. I think i should reduce an amount of memory on my kvm-pc and
repeat the tests!
--
Uladzislau Rezki
On Tue, Mar 28, 2023 at 01:53:27PM +1100, Dave Chinner wrote:
> On Mon, Mar 27, 2023 at 07:22:44PM +0200, Uladzislau Rezki wrote:
> > > So, this patch open codes the kvmalloc() in the commit path to have
> > > the above described behaviour. The result is we more than halve the
> > > CPU time spend doing kvmalloc() in this path and transaction commits
> > > with 64kB objects in them more than doubles. i.e. we get ~5x
> > > reduction in CPU usage per costly-sized kvmalloc() invocation and
> > > the profile looks like this:
> > >
> > > - 37.60% xlog_cil_commit
> > > 16.01% memcpy_erms
> > > - 8.45% __kmalloc
> > > - 8.04% kmalloc_order_trace
> > > - 8.03% kmalloc_order
> > > - 7.93% alloc_pages
> > > - 7.90% __alloc_pages
> > > - 4.05% __alloc_pages_slowpath.constprop.0
> > > - 2.18% get_page_from_freelist
> > > - 1.77% wake_all_kswapds
> > > ....
> > > - __wake_up_common_lock
> > > - 0.94% _raw_spin_lock_irqsave
> > > - 3.72% get_page_from_freelist
> > > - 2.43% _raw_spin_lock_irqsave
> > > - 5.72% vmalloc
> > > - 5.72% __vmalloc_node_range
> > > - 4.81% __get_vm_area_node.constprop.0
> > > - 3.26% alloc_vmap_area
> > > - 2.52% _raw_spin_lock
> > > - 1.46% _raw_spin_lock
> > > 0.56% __alloc_pages_bulk
> > > - 4.66% kvfree
> > > - 3.25% vfree
> > OK, i see. I tried to use the fs_mark in different configurations. For
> > example:
> >
> > <snip>
> > time fs_mark -D 10000 -S0 -n 100000 -s 0 -L 32 -d ./scratch/0 -d ./scratch/1 -d ./scratch/2 \
> > -d ./scratch/3 -d ./scratch/4 -d ./scratch/5 -d ./scratch/6 -d ./scratch/7 -d ./scratch/8 \
> > -d ./scratch/9 -d ./scratch/10 -d ./scratch/11 -d ./scratch/12 -d ./scratch/13 \
> > -d ./scratch/14 -d ./scratch/15 -t 64 -F
> > <snip>
> >
> > But i did not manage to trigger xlog_cil_commit() to fallback to vmalloc
> > code. I think i should reduce an amount of memory on my kvm-pc and
> > repeat the tests!
>
> Simple way of doing is to use directory blocks that are larger than
> page size:
>
> mkfs.xfs -n size=64k ....
>
> We can hit that path in other ways - large attributes will hit it in
> the attr buffer allocation path, enabling the new attribute
> intent-based logging mode will hit it in the xlog_cil_commit path as
> well. IIRC, the above profile comes from the latter case, creating
> lots of zero length files with 64kB xattrs attached via fsmark.
>
Good. Thank you that is useful.
--
Uladzislau Rezki