Problem:
***
Stephen reported vread() will skip vm_map_ram areas when reading out
/proc/kcore with drgn utility. Please see below link to get more
details.
/proc/kcore reads 0's for vmap_block
https://lore.kernel.org/all/[email protected]/T/#u
Root cause:
***
The normal vmalloc API uses struct vmap_area to manage the virtual
kernel area allocated, and associate a vm_struct to store more
information and pass out. However, area reserved through vm_map_ram()
interface doesn't allocate vm_struct to associate with. So the current
code in vread() will skip the vm_map_ram area through 'if (!va->vm)'
conditional checking.
Solution:
***
To mark the area reserved through vm_map_ram() interface, add field 'flags'
into struct vmap_area. Bit 0 indicates this is vm_map_ram area created
through vm_map_ram() interface, bit 1 marks out the type of vm_map_ram area
which makes use of vmap_block to manage split regions via vb_alloc/free().
And also add bitmap field 'used_map' into struct vmap_block to mark those
further subdivided regions being used to differentiate with dirty and free
regions in vmap_block.
With the help of above vmap_area->flags and vmap_block->used_map, we can
recognize and handle vm_map_ram areas successfully. All these are done
in patch 1~3.
Meanwhile, do some improvement on areas related to vm_map_ram areas in
patch 4, 5. And also change area flag from VM_ALLOC to VM_IOREMAP in
patch 6, 7 because this will show them as 'ioremap' in /proc/vmallocinfo,
and exclude them from /proc/kcore.
Testing
***
- I did basic testing on kvm guest, and run below commands to
access kcore file to do statistics:
makedumpfile --mem-usage /proc/kcore
- Stephen helped to test v4 and confirm the problem he spotted and
reported has been resolved with the v4 patchset.
Changelog
***
v4->v5:
- Add a 'unlock' label and goto jumping in while loop of
vmap_ram_vread() function, this improve the codes execution. This is
done in patch 3.
- Add Stephen's Reported-by and Tested-by tags in patch 3.
v3->v4:
- Fix typo in pach 2 catched by Lorenzo.
- Add WARN_ON(flags == VMAP_BLOCK) in vread() to address Dan's concern
that VMAP_BLOCK could be set alone in vmap->flags.
- Add checking on the returned value from xa_load() in vmap_ram_vread(),
Uladzislau commented on the risk of this place.
- Fix a bug in s_show() which is changed in patch 5. The old change will
cause 'va->vm is NULL but the VMAP_RAM flag is not set' case wrongly
handled. Please see below link for details.
- https://lore.kernel.org/all/Y8aAmuUY9OxrYlLm@kili/T/#u
- Add Uladzislau and Lorenzo's Reviewed-by.
v2->v3:
- Benefited from find_unlink_vmap_area() introduced by Uladzislau, do
not need to worry about the va->vm and va->flags reset during freeing.
- Change to identify vm_map_area with VMAP_RAM in vmap->flags in
vread().
- Rename the old vb_vread() to vmap_ram_vread().
- Handle two kinds of vm_map_area area reading in vmap_ram_vread()
respectively.
- Fix bug of the while loop code block in vmap_block reading, found by
Lorenzo.
- Improve conditional check related to vm_map_ram area, suggested by
Lorenzo.
v1->v2:
- Change alloc_vmap_area() to pass in va_flags so that we can pass and
set vmap_area->flags for vm_map_ram area. With this, no extra action
need be added to acquire vmap_area_lock when doing the vmap_area->flags
setting. Uladzislau reviewed v1 and pointed out the issue.
- Improve vb_vread() to cover the case where reading is started from a
dirty or free region.
RFC->v1:
- Add a new field flags in vmap_area to mark vm_map_ram area. It cold be
risky reusing the vm union in vmap_area in RFC. I will consider
reusing the union in vmap_area to save memory later. Now just take the
simpler way to let's focus on resolving the main problem.
- Add patch 4~7 for optimization.
Baoquan He (7):
mm/vmalloc.c: add used_map into vmap_block to track space of
vmap_block
mm/vmalloc.c: add flags to mark vm_map_ram area
mm/vmalloc.c: allow vread() to read out vm_map_ram areas
mm/vmalloc: explicitly identify vm_map_ram area when shown in
/proc/vmcoreinfo
mm/vmalloc: skip the uninitilized vmalloc areas
powerpc: mm: add VM_IOREMAP flag to the vmalloc area
sh: mm: set VM_IOREMAP flag to the vmalloc area
arch/powerpc/kernel/pci_64.c | 2 +-
arch/sh/kernel/cpu/sh4/sq.c | 2 +-
include/linux/vmalloc.h | 1 +
mm/vmalloc.c | 127 ++++++++++++++++++++++++++++++-----
4 files changed, 112 insertions(+), 20 deletions(-)
--
2.34.1
Through vmalloc API, a virtual kernel area is reserved for physical
address mapping. And vmap_area is used to track them, while vm_struct
is allocated to associate with the vmap_area to store more information
and passed out.
However, area reserved via vm_map_ram() is an exception. It doesn't have
vm_struct to associate with vmap_area. And we can't recognize the
vmap_area with '->vm == NULL' as a vm_map_ram() area because the normal
freeing path will set va->vm = NULL before unmapping, please see
function remove_vm_area().
Meanwhile, there are two kinds of handling for vm_map_ram area. One is
the whole vmap_area being reserved and mapped at one time through
vm_map_area() interface; the other is the whole vmap_area with
VMAP_BLOCK_SIZE size being reserved, while mapped into split regions
with smaller size via vb_alloc().
To mark the area reserved through vm_map_ram(), add flags field into
struct vmap_area. Bit 0 indicates this is vm_map_ram area created
through vm_map_ram() interface, while bit 1 marks out the type of
vm_map_ram area which makes use of vmap_block to manage split regions
via vb_alloc/free().
This is a preparation for later use.
Signed-off-by: Baoquan He <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
---
include/linux/vmalloc.h | 1 +
mm/vmalloc.c | 16 ++++++++++++----
2 files changed, 13 insertions(+), 4 deletions(-)
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 096d48aa3437..69250efa03d1 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -76,6 +76,7 @@ struct vmap_area {
unsigned long subtree_max_size; /* in "free" tree */
struct vm_struct *vm; /* in "busy" tree */
};
+ unsigned long flags; /* mark type of vm_map_ram area */
};
/* archs that select HAVE_ARCH_HUGE_VMAP should override one or more of these */
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d6ff058ef4d0..ab4825050b5c 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1589,7 +1589,8 @@ preload_this_cpu_lock(spinlock_t *lock, gfp_t gfp_mask, int node)
static struct vmap_area *alloc_vmap_area(unsigned long size,
unsigned long align,
unsigned long vstart, unsigned long vend,
- int node, gfp_t gfp_mask)
+ int node, gfp_t gfp_mask,
+ unsigned long va_flags)
{
struct vmap_area *va;
unsigned long freed;
@@ -1635,6 +1636,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
va->va_start = addr;
va->va_end = addr + size;
va->vm = NULL;
+ va->flags = va_flags;
spin_lock(&vmap_area_lock);
insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
@@ -1913,6 +1915,10 @@ static struct vmap_area *find_unlink_vmap_area(unsigned long addr)
#define VMAP_BLOCK_SIZE (VMAP_BBMAP_BITS * PAGE_SIZE)
+#define VMAP_RAM 0x1 /* indicates vm_map_ram area*/
+#define VMAP_BLOCK 0x2 /* mark out the vmap_block sub-type*/
+#define VMAP_FLAGS_MASK 0x3
+
struct vmap_block_queue {
spinlock_t lock;
struct list_head free;
@@ -1988,7 +1994,8 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
va = alloc_vmap_area(VMAP_BLOCK_SIZE, VMAP_BLOCK_SIZE,
VMALLOC_START, VMALLOC_END,
- node, gfp_mask);
+ node, gfp_mask,
+ VMAP_RAM|VMAP_BLOCK);
if (IS_ERR(va)) {
kfree(vb);
return ERR_CAST(va);
@@ -2297,7 +2304,8 @@ void *vm_map_ram(struct page **pages, unsigned int count, int node)
} else {
struct vmap_area *va;
va = alloc_vmap_area(size, PAGE_SIZE,
- VMALLOC_START, VMALLOC_END, node, GFP_KERNEL);
+ VMALLOC_START, VMALLOC_END,
+ node, GFP_KERNEL, VMAP_RAM);
if (IS_ERR(va))
return NULL;
@@ -2537,7 +2545,7 @@ static struct vm_struct *__get_vm_area_node(unsigned long size,
if (!(flags & VM_NO_GUARD))
size += PAGE_SIZE;
- va = alloc_vmap_area(size, align, start, end, node, gfp_mask);
+ va = alloc_vmap_area(size, align, start, end, node, gfp_mask, 0);
if (IS_ERR(va)) {
kfree(area);
return NULL;
--
2.34.1
For areas allocated via vmalloc_xxx() APIs, it searches for unmapped area
to reserve and allocates new pages to map into, please see function
__vmalloc_node_range(). During the process, flag VM_UNINITIALIZED is set
in vm->flags to indicate that the pages allocation and mapping haven't
been done, until clear_vm_uninitialized_flag() is called to clear
VM_UNINITIALIZED.
For this kind of area, if VM_UNINITIALIZED is still set, let's ignore
it in vread() because pages newly allocated and being mapped in that
area only contains zero data. reading them out by aligned_vread() is
wasting time.
Signed-off-by: Baoquan He <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
---
mm/vmalloc.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index dea76e73e57c..8037527774db 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3668,6 +3668,11 @@ long vread(char *buf, char *addr, unsigned long count)
if (!vm && !flags)
continue;
+ if (vm && (vm->flags & VM_UNINITIALIZED))
+ continue;
+ /* Pair with smp_wmb() in clear_vm_uninitialized_flag() */
+ smp_rmb();
+
vaddr = (char *) va->va_start;
size = vm ? get_vm_area_size(vm) : va_size(va);
--
2.34.1
Now, by marking VMAP_RAM in vmap_area->flags for vm_map_ram area, we can
clearly differentiate it with other vmalloc areas. So identify
vm_map_area area by checking VMAP_RAM of vmap_area->flags when shown
in /proc/vmcoreinfo.
Meanwhile, the code comment above vm_map_ram area checking in s_show()
is not needed any more, remove it here.
Signed-off-by: Baoquan He <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
---
mm/vmalloc.c | 11 ++++-------
1 file changed, 4 insertions(+), 7 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 4bb78ebd70f6..dea76e73e57c 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -4233,14 +4233,11 @@ static int s_show(struct seq_file *m, void *p)
va = list_entry(p, struct vmap_area, list);
- /*
- * s_show can encounter race with remove_vm_area, !vm on behalf
- * of vmap area is being tear down or vm_map_ram allocation.
- */
if (!va->vm) {
- seq_printf(m, "0x%pK-0x%pK %7ld vm_map_ram\n",
- (void *)va->va_start, (void *)va->va_end,
- va->va_end - va->va_start);
+ if (va->flags & VMAP_RAM)
+ seq_printf(m, "0x%pK-0x%pK %7ld vm_map_ram\n",
+ (void *)va->va_start, (void *)va->va_end,
+ va->va_end - va->va_start);
goto final;
}
--
2.34.1
Currently, for vmalloc areas with flag VM_IOREMAP set, except of the
specific alignment clamping in __get_vm_area_node(), they will be
1) Shown as ioremap in /proc/vmallocinfo;
2) Ignored by /proc/kcore reading via vread()
So for the io mapping in ioremap_phb() of ppc, we should set VM_IOREMAP
in flag to make it handled correctly as above.
Signed-off-by: Baoquan He <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
---
arch/powerpc/kernel/pci_64.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/powerpc/kernel/pci_64.c b/arch/powerpc/kernel/pci_64.c
index 0c7cfb9fab04..fd42059ae2a5 100644
--- a/arch/powerpc/kernel/pci_64.c
+++ b/arch/powerpc/kernel/pci_64.c
@@ -132,7 +132,7 @@ void __iomem *ioremap_phb(phys_addr_t paddr, unsigned long size)
* address decoding but I'd rather not deal with those outside of the
* reserved 64K legacy region.
*/
- area = __get_vm_area_caller(size, 0, PHB_IO_BASE, PHB_IO_END,
+ area = __get_vm_area_caller(size, VM_IOREMAP, PHB_IO_BASE, PHB_IO_END,
__builtin_return_address(0));
if (!area)
return NULL;
--
2.34.1
Currently, vread can read out vmalloc areas which is associated with
a vm_struct. While this doesn't work for areas created by vm_map_ram()
interface because it doesn't have an associated vm_struct. Then in vread(),
these areas are all skipped.
Here, add a new function vmap_ram_vread() to read out vm_map_ram areas.
The area created with vmap_ram_vread() interface directly can be handled
like the other normal vmap areas with aligned_vread(). While areas
which will be further subdivided and managed with vmap_block need
carefully read out page-aligned small regions and zero fill holes.
Reported-by: Stephen Brennan <[email protected]>
Signed-off-by: Baoquan He <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Tested-by: Stephen Brennan <[email protected]>
---
mm/vmalloc.c | 88 +++++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 81 insertions(+), 7 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index ab4825050b5c..4bb78ebd70f6 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3544,6 +3544,68 @@ static int aligned_vread(char *buf, char *addr, unsigned long count)
return copied;
}
+static void vmap_ram_vread(char *buf, char *addr, int count, unsigned long flags)
+{
+ char *start;
+ struct vmap_block *vb;
+ unsigned long offset;
+ unsigned int rs, re, n;
+
+ /*
+ * If it's area created by vm_map_ram() interface directly, but
+ * not further subdividing and delegating management to vmap_block,
+ * handle it here.
+ */
+ if (!(flags & VMAP_BLOCK)) {
+ aligned_vread(buf, addr, count);
+ return;
+ }
+
+ /*
+ * Area is split into regions and tracked with vmap_block, read out
+ * each region and zero fill the hole between regions.
+ */
+ vb = xa_load(&vmap_blocks, addr_to_vb_idx((unsigned long)addr));
+ if (!vb)
+ goto finished;
+
+ spin_lock(&vb->lock);
+ if (bitmap_empty(vb->used_map, VMAP_BBMAP_BITS)) {
+ spin_unlock(&vb->lock);
+ goto finished;
+ }
+ for_each_set_bitrange(rs, re, vb->used_map, VMAP_BBMAP_BITS) {
+ if (!count)
+ break;
+ start = vmap_block_vaddr(vb->va->va_start, rs);
+ while (addr < start) {
+ if (count == 0)
+ goto unlock;
+ *buf = '\0';
+ buf++;
+ addr++;
+ count--;
+ }
+ /*it could start reading from the middle of used region*/
+ offset = offset_in_page(addr);
+ n = ((re - rs + 1) << PAGE_SHIFT) - offset;
+ if (n > count)
+ n = count;
+ aligned_vread(buf, start+offset, n);
+
+ buf += n;
+ addr += n;
+ count -= n;
+ }
+unlock:
+ spin_unlock(&vb->lock);
+
+finished:
+ /* zero-fill the left dirty or free regions */
+ if (count)
+ memset(buf, 0, count);
+}
+
/**
* vread() - read vmalloc area in a safe way.
* @buf: buffer for reading data
@@ -3574,7 +3636,7 @@ long vread(char *buf, char *addr, unsigned long count)
struct vm_struct *vm;
char *vaddr, *buf_start = buf;
unsigned long buflen = count;
- unsigned long n;
+ unsigned long n, size, flags;
addr = kasan_reset_tag(addr);
@@ -3595,12 +3657,21 @@ long vread(char *buf, char *addr, unsigned long count)
if (!count)
break;
- if (!va->vm)
+ vm = va->vm;
+ flags = va->flags & VMAP_FLAGS_MASK;
+ /*
+ * VMAP_BLOCK indicates a sub-type of vm_map_ram area, need
+ * be set together with VMAP_RAM.
+ */
+ WARN_ON(flags == VMAP_BLOCK);
+
+ if (!vm && !flags)
continue;
- vm = va->vm;
- vaddr = (char *) vm->addr;
- if (addr >= vaddr + get_vm_area_size(vm))
+ vaddr = (char *) va->va_start;
+ size = vm ? get_vm_area_size(vm) : va_size(va);
+
+ if (addr >= vaddr + size)
continue;
while (addr < vaddr) {
if (count == 0)
@@ -3610,10 +3681,13 @@ long vread(char *buf, char *addr, unsigned long count)
addr++;
count--;
}
- n = vaddr + get_vm_area_size(vm) - addr;
+ n = vaddr + size - addr;
if (n > count)
n = count;
- if (!(vm->flags & VM_IOREMAP))
+
+ if (flags & VMAP_RAM)
+ vmap_ram_vread(buf, addr, n, flags);
+ else if (!(vm->flags & VM_IOREMAP))
aligned_vread(buf, addr, n);
else /* IOREMAP area is treated as memory hole */
memset(buf, 0, n);
--
2.34.1
Currently, for vmalloc areas with flag VM_IOREMAP set, except of the
specific alignment clamping in __get_vm_area_node(), they will be
1) Shown as ioremap in /proc/vmallocinfo;
2) Ignored by /proc/kcore reading via vread()
So for the ioremap in __sq_remap() of sh, we should set VM_IOREMAP
in flag to make it handled correctly as above.
Signed-off-by: Baoquan He <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
---
arch/sh/kernel/cpu/sh4/sq.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/sh/kernel/cpu/sh4/sq.c b/arch/sh/kernel/cpu/sh4/sq.c
index a76b94e41e91..27f2e3da5aa2 100644
--- a/arch/sh/kernel/cpu/sh4/sq.c
+++ b/arch/sh/kernel/cpu/sh4/sq.c
@@ -103,7 +103,7 @@ static int __sq_remap(struct sq_mapping *map, pgprot_t prot)
#if defined(CONFIG_MMU)
struct vm_struct *vma;
- vma = __get_vm_area_caller(map->size, VM_ALLOC, map->sq_addr,
+ vma = __get_vm_area_caller(map->size, VM_IOREMAP, map->sq_addr,
SQ_ADDRMAX, __builtin_return_address(0));
if (!vma)
return -ENOMEM;
--
2.34.1