Problem:
***
Stephen reported vread() will skip vm_map_ram areas when reading out
/proc/kcore with drgn utility. Please see below link to get more
details.
/proc/kcore reads 0's for vmap_block
https://lore.kernel.org/all/[email protected]/T/#u
Root cause:
***
The normal vmalloc API uses struct vmap_area to manage the virtual
kernel area allocated, and associate a vm_struct to store more
information and pass out. However, area reserved through vm_map_ram()
interface doesn't allocate vm_struct to associate with. So the current
code in vread() will skip the vm_map_ram area through 'if (!va->vm)'
conditional checking.
Solution:
***
To mark the area reserved through vm_map_ram() interface, add field 'flags'
into struct vmap_area. Bit 0 indicates this is vm_map_ram area created
through vm_map_ram() interface, bit 1 marks out the type of vm_map_ram area
which makes use of vmap_block to manage split regions via vb_alloc/free().
And also add bitmap field 'used_map' into struct vmap_block to mark those
further subdivided regions being used to differentiate with dirty and free
regions in vmap_block.
With the help of above vmap_area->flags and vmap_block->used_map, we can
recognize and handle vm_map_ram areas successfully. All these are done
in patch 1~3.
Meanwhile, do some improvement on areas related to vm_map_ram areas in
patch 4, 5. And also change area flag from VM_ALLOC to VM_IOREMAP in
patch 6, 7 because this will show them as 'ioremap' in /proc/vmallocinfo,
and exclude them from /proc/kcore.
Testing
***
Only did the basic testing on kvm guest, and run below commands to
access kcore file to do statistics:
makedumpfile --mem-usage /proc/kcore
Changelog
***
v3->v4:
- Fix typo in pach 2 catched by Lorenzo.
- Add WARN_ON(flags == VMAP_BLOCK) in vread() to address Dan's concern
that VMAP_BLOCK could be set alone in vmap->flags.
- Add checking on the returned value from xa_load() in vmap_ram_vread(),
Uladzislau commented on the risk of this place.
- Fix a bug in s_show() which is changed in patch 5. The old change will
cause 'va->vm is NULL but the VMAP_RAM flag is not set' case wrongly
handled. Please see below link for details.
- https://lore.kernel.org/all/Y8aAmuUY9OxrYlLm@kili/T/#u
- Add Uladzislau and Lorenzo's Reviewed-by.
v2->v3:
- Benefited from find_unlink_vmap_area() introduced by Uladzislau, do
not need to worry about the va->vm and va->flags reset during freeing.
- Change to identify vm_map_area with VMAP_RAM in vmap->flags in
vread().
- Rename the old vb_vread() to vmap_ram_vread().
- Handle two kinds of vm_map_area area reading in vmap_ram_vread()
respectively.
- Fix bug of the while loop code block in vmap_block reading, found by
Lorenzo.
- Improve conditional check related to vm_map_ram area, suggested by
Lorenzo.
v1->v2:
- Change alloc_vmap_area() to pass in va_flags so that we can pass and
set vmap_area->flags for vm_map_ram area. With this, no extra action
need be added to acquire vmap_area_lock when doing the vmap_area->flags
setting. Uladzislau reviewed v1 and pointed out the issue.
- Improve vb_vread() to cover the case where reading is started from a
dirty or free region.
RFC->v1:
- Add a new field flags in vmap_area to mark vm_map_ram area. It cold be
risky reusing the vm union in vmap_area in RFC. I will consider
reusing the union in vmap_area to save memory later. Now just take the
simpler way to let's focus on resolving the main problem.
- Add patch 4~7 for optimization.
Baoquan He (7):
mm/vmalloc.c: add used_map into vmap_block to track space of
vmap_block
mm/vmalloc.c: add flags to mark vm_map_ram area
mm/vmalloc.c: allow vread() to read out vm_map_ram areas
mm/vmalloc: explicitly identify vm_map_ram area when shown in
/proc/vmcoreinfo
mm/vmalloc: skip the uninitilized vmalloc areas
powerpc: mm: add VM_IOREMAP flag to the vmalloc area
sh: mm: set VM_IOREMAP flag to the vmalloc area
arch/powerpc/kernel/pci_64.c | 2 +-
arch/sh/kernel/cpu/sh4/sq.c | 2 +-
include/linux/vmalloc.h | 1 +
mm/vmalloc.c | 126 ++++++++++++++++++++++++++++++-----
4 files changed, 111 insertions(+), 20 deletions(-)
--
2.34.1
In one vmap_block area, there could be three types of regions: region
being used which is allocated through vb_alloc(), dirty region which
is freed via vb_free() and free region. Among them, only used region
has available data. While there's no way to track those used regions
currently.
Here, add bitmap field used_map into vmap_block, and set/clear it during
allocation or freeing regions of vmap_block area.
This is a preparatoin for later use.
Signed-off-by: Baoquan He <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
---
mm/vmalloc.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 428e0bee5c9c..d6ff058ef4d0 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1922,6 +1922,7 @@ struct vmap_block {
spinlock_t lock;
struct vmap_area *va;
unsigned long free, dirty;
+ DECLARE_BITMAP(used_map, VMAP_BBMAP_BITS);
unsigned long dirty_min, dirty_max; /*< dirty range */
struct list_head free_list;
struct rcu_head rcu_head;
@@ -1998,10 +1999,12 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
vb->va = va;
/* At least something should be left free */
BUG_ON(VMAP_BBMAP_BITS <= (1UL << order));
+ bitmap_zero(vb->used_map, VMAP_BBMAP_BITS);
vb->free = VMAP_BBMAP_BITS - (1UL << order);
vb->dirty = 0;
vb->dirty_min = VMAP_BBMAP_BITS;
vb->dirty_max = 0;
+ bitmap_set(vb->used_map, 0, (1UL << order));
INIT_LIST_HEAD(&vb->free_list);
vb_idx = addr_to_vb_idx(va->va_start);
@@ -2111,6 +2114,7 @@ static void *vb_alloc(unsigned long size, gfp_t gfp_mask)
pages_off = VMAP_BBMAP_BITS - vb->free;
vaddr = vmap_block_vaddr(vb->va->va_start, pages_off);
vb->free -= 1UL << order;
+ bitmap_set(vb->used_map, pages_off, (1UL << order));
if (vb->free == 0) {
spin_lock(&vbq->lock);
list_del_rcu(&vb->free_list);
@@ -2144,6 +2148,9 @@ static void vb_free(unsigned long addr, unsigned long size)
order = get_order(size);
offset = (addr & (VMAP_BLOCK_SIZE - 1)) >> PAGE_SHIFT;
vb = xa_load(&vmap_blocks, addr_to_vb_idx(addr));
+ spin_lock(&vb->lock);
+ bitmap_clear(vb->used_map, offset, (1UL << order));
+ spin_unlock(&vb->lock);
vunmap_range_noflush(addr, addr + size);
--
2.34.1
Through vmalloc API, a virtual kernel area is reserved for physical
address mapping. And vmap_area is used to track them, while vm_struct
is allocated to associate with the vmap_area to store more information
and passed out.
However, area reserved via vm_map_ram() is an exception. It doesn't have
vm_struct to associate with vmap_area. And we can't recognize the
vmap_area with '->vm == NULL' as a vm_map_ram() area because the normal
freeing path will set va->vm = NULL before unmapping, please see
function remove_vm_area().
Meanwhile, there are two kinds of handling for vm_map_ram area. One is
the whole vmap_area being reserved and mapped at one time through
vm_map_area() interface; the other is the whole vmap_area with
VMAP_BLOCK_SIZE size being reserved, while mapped into split regions
with smaller size via vb_alloc().
To mark the area reserved through vm_map_ram(), add flags field into
struct vmap_area. Bit 0 indicates this is vm_map_ram area created
through vm_map_ram() interface, while bit 1 marks out the type of
vm_map_ram area which makes use of vmap_block to manage split regions
via vb_alloc/free().
This is a preparation for later use.
Signed-off-by: Baoquan He <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
---
include/linux/vmalloc.h | 1 +
mm/vmalloc.c | 16 ++++++++++++----
2 files changed, 13 insertions(+), 4 deletions(-)
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 096d48aa3437..69250efa03d1 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -76,6 +76,7 @@ struct vmap_area {
unsigned long subtree_max_size; /* in "free" tree */
struct vm_struct *vm; /* in "busy" tree */
};
+ unsigned long flags; /* mark type of vm_map_ram area */
};
/* archs that select HAVE_ARCH_HUGE_VMAP should override one or more of these */
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d6ff058ef4d0..ab4825050b5c 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1589,7 +1589,8 @@ preload_this_cpu_lock(spinlock_t *lock, gfp_t gfp_mask, int node)
static struct vmap_area *alloc_vmap_area(unsigned long size,
unsigned long align,
unsigned long vstart, unsigned long vend,
- int node, gfp_t gfp_mask)
+ int node, gfp_t gfp_mask,
+ unsigned long va_flags)
{
struct vmap_area *va;
unsigned long freed;
@@ -1635,6 +1636,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
va->va_start = addr;
va->va_end = addr + size;
va->vm = NULL;
+ va->flags = va_flags;
spin_lock(&vmap_area_lock);
insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
@@ -1913,6 +1915,10 @@ static struct vmap_area *find_unlink_vmap_area(unsigned long addr)
#define VMAP_BLOCK_SIZE (VMAP_BBMAP_BITS * PAGE_SIZE)
+#define VMAP_RAM 0x1 /* indicates vm_map_ram area*/
+#define VMAP_BLOCK 0x2 /* mark out the vmap_block sub-type*/
+#define VMAP_FLAGS_MASK 0x3
+
struct vmap_block_queue {
spinlock_t lock;
struct list_head free;
@@ -1988,7 +1994,8 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
va = alloc_vmap_area(VMAP_BLOCK_SIZE, VMAP_BLOCK_SIZE,
VMALLOC_START, VMALLOC_END,
- node, gfp_mask);
+ node, gfp_mask,
+ VMAP_RAM|VMAP_BLOCK);
if (IS_ERR(va)) {
kfree(vb);
return ERR_CAST(va);
@@ -2297,7 +2304,8 @@ void *vm_map_ram(struct page **pages, unsigned int count, int node)
} else {
struct vmap_area *va;
va = alloc_vmap_area(size, PAGE_SIZE,
- VMALLOC_START, VMALLOC_END, node, GFP_KERNEL);
+ VMALLOC_START, VMALLOC_END,
+ node, GFP_KERNEL, VMAP_RAM);
if (IS_ERR(va))
return NULL;
@@ -2537,7 +2545,7 @@ static struct vm_struct *__get_vm_area_node(unsigned long size,
if (!(flags & VM_NO_GUARD))
size += PAGE_SIZE;
- va = alloc_vmap_area(size, align, start, end, node, gfp_mask);
+ va = alloc_vmap_area(size, align, start, end, node, gfp_mask, 0);
if (IS_ERR(va)) {
kfree(area);
return NULL;
--
2.34.1
Now, by marking VMAP_RAM in vmap_area->flags for vm_map_ram area, we can
clearly differentiate it with other vmalloc areas. So identify
vm_map_area area by checking VMAP_RAM of vmap_area->flags when shown
in /proc/vmcoreinfo.
Meanwhile, the code comment above vm_map_ram area checking in s_show()
is not needed any more, remove it here.
Signed-off-by: Baoquan He <[email protected]>
---
mm/vmalloc.c | 11 ++++-------
1 file changed, 4 insertions(+), 7 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 5a3ea6cb7ec2..e515dbacb0cb 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -4232,14 +4232,11 @@ static int s_show(struct seq_file *m, void *p)
va = list_entry(p, struct vmap_area, list);
- /*
- * s_show can encounter race with remove_vm_area, !vm on behalf
- * of vmap area is being tear down or vm_map_ram allocation.
- */
if (!va->vm) {
- seq_printf(m, "0x%pK-0x%pK %7ld vm_map_ram\n",
- (void *)va->va_start, (void *)va->va_end,
- va->va_end - va->va_start);
+ if (va->flags & VMAP_RAM)
+ seq_printf(m, "0x%pK-0x%pK %7ld vm_map_ram\n",
+ (void *)va->va_start, (void *)va->va_end,
+ va->va_end - va->va_start);
goto final;
}
--
2.34.1
Currently, vread can read out vmalloc areas which is associated with
a vm_struct. While this doesn't work for areas created by vm_map_ram()
interface because it doesn't have an associated vm_struct. Then in vread(),
these areas are all skipped.
Here, add a new function vmap_ram_vread() to read out vm_map_ram areas.
The area created with vmap_ram_vread() interface directly can be handled
like the other normal vmap areas with aligned_vread(). While areas
which will be further subdivided and managed with vmap_block need
carefully read out page-aligned small regions and zero fill holes.
Signed-off-by: Baoquan He <[email protected]>
---
mm/vmalloc.c | 87 +++++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 80 insertions(+), 7 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index ab4825050b5c..5a3ea6cb7ec2 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3544,6 +3544,67 @@ static int aligned_vread(char *buf, char *addr, unsigned long count)
return copied;
}
+static void vmap_ram_vread(char *buf, char *addr, int count, unsigned long flags)
+{
+ char *start;
+ struct vmap_block *vb;
+ unsigned long offset;
+ unsigned int rs, re, n;
+
+ /*
+ * If it's area created by vm_map_ram() interface directly, but
+ * not further subdividing and delegating management to vmap_block,
+ * handle it here.
+ */
+ if (!(flags & VMAP_BLOCK)) {
+ aligned_vread(buf, addr, count);
+ return;
+ }
+
+ /*
+ * Area is split into regions and tracked with vmap_block, read out
+ * each region and zero fill the hole between regions.
+ */
+ vb = xa_load(&vmap_blocks, addr_to_vb_idx((unsigned long)addr));
+ if (!vb)
+ goto finished;
+
+ spin_lock(&vb->lock);
+ if (bitmap_empty(vb->used_map, VMAP_BBMAP_BITS)) {
+ spin_unlock(&vb->lock);
+ goto finished;
+ }
+ for_each_set_bitrange(rs, re, vb->used_map, VMAP_BBMAP_BITS) {
+ if (!count)
+ break;
+ start = vmap_block_vaddr(vb->va->va_start, rs);
+ while (addr < start) {
+ if (count == 0)
+ break;
+ *buf = '\0';
+ buf++;
+ addr++;
+ count--;
+ }
+ /*it could start reading from the middle of used region*/
+ offset = offset_in_page(addr);
+ n = ((re - rs + 1) << PAGE_SHIFT) - offset;
+ if (n > count)
+ n = count;
+ aligned_vread(buf, start+offset, n);
+
+ buf += n;
+ addr += n;
+ count -= n;
+ }
+ spin_unlock(&vb->lock);
+
+finished:
+ /* zero-fill the left dirty or free regions */
+ if (count)
+ memset(buf, 0, count);
+}
+
/**
* vread() - read vmalloc area in a safe way.
* @buf: buffer for reading data
@@ -3574,7 +3635,7 @@ long vread(char *buf, char *addr, unsigned long count)
struct vm_struct *vm;
char *vaddr, *buf_start = buf;
unsigned long buflen = count;
- unsigned long n;
+ unsigned long n, size, flags;
addr = kasan_reset_tag(addr);
@@ -3595,12 +3656,21 @@ long vread(char *buf, char *addr, unsigned long count)
if (!count)
break;
- if (!va->vm)
+ vm = va->vm;
+ flags = va->flags & VMAP_FLAGS_MASK;
+ /*
+ * VMAP_BLOCK indicates a sub-type of vm_map_ram area, need
+ * be set together with VMAP_RAM.
+ */
+ WARN_ON(flags == VMAP_BLOCK);
+
+ if (!vm && !flags)
continue;
- vm = va->vm;
- vaddr = (char *) vm->addr;
- if (addr >= vaddr + get_vm_area_size(vm))
+ vaddr = (char *) va->va_start;
+ size = vm ? get_vm_area_size(vm) : va_size(va);
+
+ if (addr >= vaddr + size)
continue;
while (addr < vaddr) {
if (count == 0)
@@ -3610,10 +3680,13 @@ long vread(char *buf, char *addr, unsigned long count)
addr++;
count--;
}
- n = vaddr + get_vm_area_size(vm) - addr;
+ n = vaddr + size - addr;
if (n > count)
n = count;
- if (!(vm->flags & VM_IOREMAP))
+
+ if (flags & VMAP_RAM)
+ vmap_ram_vread(buf, addr, n, flags);
+ else if (!(vm->flags & VM_IOREMAP))
aligned_vread(buf, addr, n);
else /* IOREMAP area is treated as memory hole */
memset(buf, 0, n);
--
2.34.1
For areas allocated via vmalloc_xxx() APIs, it searches for unmapped area
to reserve and allocates new pages to map into, please see function
__vmalloc_node_range(). During the process, flag VM_UNINITIALIZED is set
in vm->flags to indicate that the pages allocation and mapping haven't
been done, until clear_vm_uninitialized_flag() is called to clear
VM_UNINITIALIZED.
For this kind of area, if VM_UNINITIALIZED is still set, let's ignore
it in vread() because pages newly allocated and being mapped in that
area only contains zero data. reading them out by aligned_vread() is
wasting time.
Signed-off-by: Baoquan He <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
---
mm/vmalloc.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index e515dbacb0cb..504b63606613 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3667,6 +3667,11 @@ long vread(char *buf, char *addr, unsigned long count)
if (!vm && !flags)
continue;
+ if (vm && (vm->flags & VM_UNINITIALIZED))
+ continue;
+ /* Pair with smp_wmb() in clear_vm_uninitialized_flag() */
+ smp_rmb();
+
vaddr = (char *) va->va_start;
size = vm ? get_vm_area_size(vm) : va_size(va);
--
2.34.1
Currently, for vmalloc areas with flag VM_IOREMAP set, except of the
specific alignment clamping in __get_vm_area_node(), they will be
1) Shown as ioremap in /proc/vmallocinfo;
2) Ignored by /proc/kcore reading via vread()
So for the ioremap in __sq_remap() of sh, we should set VM_IOREMAP
in flag to make it handled correctly as above.
Signed-off-by: Baoquan He <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
---
arch/sh/kernel/cpu/sh4/sq.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/sh/kernel/cpu/sh4/sq.c b/arch/sh/kernel/cpu/sh4/sq.c
index a76b94e41e91..27f2e3da5aa2 100644
--- a/arch/sh/kernel/cpu/sh4/sq.c
+++ b/arch/sh/kernel/cpu/sh4/sq.c
@@ -103,7 +103,7 @@ static int __sq_remap(struct sq_mapping *map, pgprot_t prot)
#if defined(CONFIG_MMU)
struct vm_struct *vma;
- vma = __get_vm_area_caller(map->size, VM_ALLOC, map->sq_addr,
+ vma = __get_vm_area_caller(map->size, VM_IOREMAP, map->sq_addr,
SQ_ADDRMAX, __builtin_return_address(0));
if (!vma)
return -ENOMEM;
--
2.34.1
Currently, for vmalloc areas with flag VM_IOREMAP set, except of the
specific alignment clamping in __get_vm_area_node(), they will be
1) Shown as ioremap in /proc/vmallocinfo;
2) Ignored by /proc/kcore reading via vread()
So for the io mapping in ioremap_phb() of ppc, we should set VM_IOREMAP
in flag to make it handled correctly as above.
Signed-off-by: Baoquan He <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
---
arch/powerpc/kernel/pci_64.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/powerpc/kernel/pci_64.c b/arch/powerpc/kernel/pci_64.c
index 0c7cfb9fab04..fd42059ae2a5 100644
--- a/arch/powerpc/kernel/pci_64.c
+++ b/arch/powerpc/kernel/pci_64.c
@@ -132,7 +132,7 @@ void __iomem *ioremap_phb(phys_addr_t paddr, unsigned long size)
* address decoding but I'd rather not deal with those outside of the
* reserved 64K legacy region.
*/
- area = __get_vm_area_caller(size, 0, PHB_IO_BASE, PHB_IO_END,
+ area = __get_vm_area_caller(size, VM_IOREMAP, PHB_IO_BASE, PHB_IO_END,
__builtin_return_address(0));
if (!area)
return NULL;
--
2.34.1
On Wed, Feb 01, 2023 at 05:13:35PM +0800, Baoquan He wrote:
> Currently, vread can read out vmalloc areas which is associated with
> a vm_struct. While this doesn't work for areas created by vm_map_ram()
> interface because it doesn't have an associated vm_struct. Then in vread(),
> these areas are all skipped.
>
> Here, add a new function vmap_ram_vread() to read out vm_map_ram areas.
> The area created with vmap_ram_vread() interface directly can be handled
> like the other normal vmap areas with aligned_vread(). While areas
> which will be further subdivided and managed with vmap_block need
> carefully read out page-aligned small regions and zero fill holes.
>
> Signed-off-by: Baoquan He <[email protected]>
> ---
> mm/vmalloc.c | 87 +++++++++++++++++++++++++++++++++++++++++++++++-----
> 1 file changed, 80 insertions(+), 7 deletions(-)
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index ab4825050b5c..5a3ea6cb7ec2 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3544,6 +3544,67 @@ static int aligned_vread(char *buf, char *addr, unsigned long count)
> return copied;
> }
>
> +static void vmap_ram_vread(char *buf, char *addr, int count, unsigned long flags)
> +{
> + char *start;
> + struct vmap_block *vb;
> + unsigned long offset;
> + unsigned int rs, re, n;
> +
> + /*
> + * If it's area created by vm_map_ram() interface directly, but
> + * not further subdividing and delegating management to vmap_block,
> + * handle it here.
> + */
> + if (!(flags & VMAP_BLOCK)) {
> + aligned_vread(buf, addr, count);
> + return;
> + }
> +
> + /*
> + * Area is split into regions and tracked with vmap_block, read out
> + * each region and zero fill the hole between regions.
> + */
> + vb = xa_load(&vmap_blocks, addr_to_vb_idx((unsigned long)addr));
> + if (!vb)
> + goto finished;
> +
> + spin_lock(&vb->lock);
> + if (bitmap_empty(vb->used_map, VMAP_BBMAP_BITS)) {
> + spin_unlock(&vb->lock);
> + goto finished;
> + }
> + for_each_set_bitrange(rs, re, vb->used_map, VMAP_BBMAP_BITS) {
> + if (!count)
> + break;
> + start = vmap_block_vaddr(vb->va->va_start, rs);
> + while (addr < start) {
> + if (count == 0)
> + break;
Bit pedantic, but you're using the `if (!count)` form of checking whether it's
zero above, but here you explicitly check it, would be good to keep both consistent.
Given you're checking here, perhaps you could simply drop the previous check?
> + *buf = '\0';
> + buf++;
> + addr++;
> + count--;
> + }
> + /*it could start reading from the middle of used region*/
> + offset = offset_in_page(addr);
> + n = ((re - rs + 1) << PAGE_SHIFT) - offset;
> + if (n > count)
> + n = count;
> + aligned_vread(buf, start+offset, n);
> +
> + buf += n;
> + addr += n;
> + count -= n;
> + }
> + spin_unlock(&vb->lock);
> +
> +finished:
> + /* zero-fill the left dirty or free regions */
> + if (count)
> + memset(buf, 0, count);
> +}
> +
> /**
> * vread() - read vmalloc area in a safe way.
> * @buf: buffer for reading data
> @@ -3574,7 +3635,7 @@ long vread(char *buf, char *addr, unsigned long count)
> struct vm_struct *vm;
> char *vaddr, *buf_start = buf;
> unsigned long buflen = count;
> - unsigned long n;
> + unsigned long n, size, flags;
>
> addr = kasan_reset_tag(addr);
>
> @@ -3595,12 +3656,21 @@ long vread(char *buf, char *addr, unsigned long count)
> if (!count)
> break;
>
> - if (!va->vm)
> + vm = va->vm;
> + flags = va->flags & VMAP_FLAGS_MASK;
> + /*
> + * VMAP_BLOCK indicates a sub-type of vm_map_ram area, need
> + * be set together with VMAP_RAM.
> + */
> + WARN_ON(flags == VMAP_BLOCK);
> +
> + if (!vm && !flags)
> continue;
>
> - vm = va->vm;
> - vaddr = (char *) vm->addr;
> - if (addr >= vaddr + get_vm_area_size(vm))
> + vaddr = (char *) va->va_start;
> + size = vm ? get_vm_area_size(vm) : va_size(va);
> +
> + if (addr >= vaddr + size)
> continue;
> while (addr < vaddr) {
> if (count == 0)
> @@ -3610,10 +3680,13 @@ long vread(char *buf, char *addr, unsigned long count)
> addr++;
> count--;
> }
> - n = vaddr + get_vm_area_size(vm) - addr;
> + n = vaddr + size - addr;
> if (n > count)
> n = count;
> - if (!(vm->flags & VM_IOREMAP))
> +
> + if (flags & VMAP_RAM)
> + vmap_ram_vread(buf, addr, n, flags);
> + else if (!(vm->flags & VM_IOREMAP))
> aligned_vread(buf, addr, n);
> else /* IOREMAP area is treated as memory hole */
> memset(buf, 0, n);
> --
> 2.34.1
>
Other than the nit, feel free to add:-
Reviewed-by: Lorenzo Stoakes <[email protected]>
On Wed, Feb 01, 2023 at 05:13:36PM +0800, Baoquan He wrote:
> Now, by marking VMAP_RAM in vmap_area->flags for vm_map_ram area, we can
> clearly differentiate it with other vmalloc areas. So identify
> vm_map_area area by checking VMAP_RAM of vmap_area->flags when shown
> in /proc/vmcoreinfo.
>
> Meanwhile, the code comment above vm_map_ram area checking in s_show()
> is not needed any more, remove it here.
>
> Signed-off-by: Baoquan He <[email protected]>
> ---
> mm/vmalloc.c | 11 ++++-------
> 1 file changed, 4 insertions(+), 7 deletions(-)
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 5a3ea6cb7ec2..e515dbacb0cb 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -4232,14 +4232,11 @@ static int s_show(struct seq_file *m, void *p)
>
> va = list_entry(p, struct vmap_area, list);
>
> - /*
> - * s_show can encounter race with remove_vm_area, !vm on behalf
> - * of vmap area is being tear down or vm_map_ram allocation.
> - */
> if (!va->vm) {
> - seq_printf(m, "0x%pK-0x%pK %7ld vm_map_ram\n",
> - (void *)va->va_start, (void *)va->va_end,
> - va->va_end - va->va_start);
> + if (va->flags & VMAP_RAM)
> + seq_printf(m, "0x%pK-0x%pK %7ld vm_map_ram\n",
> + (void *)va->va_start, (void *)va->va_end,
> + va->va_end - va->va_start);
>
> goto final;
> }
> --
> 2.34.1
>
Reviewed-by: Lorenzo Stoakes <[email protected]>
On 02/01/23 at 08:16pm, Lorenzo Stoakes wrote:
> On Wed, Feb 01, 2023 at 05:13:35PM +0800, Baoquan He wrote:
> > Currently, vread can read out vmalloc areas which is associated with
> > a vm_struct. While this doesn't work for areas created by vm_map_ram()
> > interface because it doesn't have an associated vm_struct. Then in vread(),
> > these areas are all skipped.
> >
> > Here, add a new function vmap_ram_vread() to read out vm_map_ram areas.
> > The area created with vmap_ram_vread() interface directly can be handled
> > like the other normal vmap areas with aligned_vread(). While areas
> > which will be further subdivided and managed with vmap_block need
> > carefully read out page-aligned small regions and zero fill holes.
> >
> > Signed-off-by: Baoquan He <[email protected]>
> > ---
> > mm/vmalloc.c | 87 +++++++++++++++++++++++++++++++++++++++++++++++-----
> > 1 file changed, 80 insertions(+), 7 deletions(-)
> >
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index ab4825050b5c..5a3ea6cb7ec2 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -3544,6 +3544,67 @@ static int aligned_vread(char *buf, char *addr, unsigned long count)
> > return copied;
> > }
> >
> > +static void vmap_ram_vread(char *buf, char *addr, int count, unsigned long flags)
> > +{
> > + char *start;
> > + struct vmap_block *vb;
> > + unsigned long offset;
> > + unsigned int rs, re, n;
> > +
> > + /*
> > + * If it's area created by vm_map_ram() interface directly, but
> > + * not further subdividing and delegating management to vmap_block,
> > + * handle it here.
> > + */
> > + if (!(flags & VMAP_BLOCK)) {
> > + aligned_vread(buf, addr, count);
> > + return;
> > + }
> > +
> > + /*
> > + * Area is split into regions and tracked with vmap_block, read out
> > + * each region and zero fill the hole between regions.
> > + */
> > + vb = xa_load(&vmap_blocks, addr_to_vb_idx((unsigned long)addr));
> > + if (!vb)
> > + goto finished;
> > +
> > + spin_lock(&vb->lock);
> > + if (bitmap_empty(vb->used_map, VMAP_BBMAP_BITS)) {
> > + spin_unlock(&vb->lock);
> > + goto finished;
> > + }
> > + for_each_set_bitrange(rs, re, vb->used_map, VMAP_BBMAP_BITS) {
> > + if (!count)
> > + break;
> > + start = vmap_block_vaddr(vb->va->va_start, rs);
> > + while (addr < start) {
> > + if (count == 0)
> > + break;
>
> Bit pedantic, but you're using the `if (!count)` form of checking whether it's
> zero above, but here you explicitly check it, would be good to keep both consistent.
Yeah, sounds good. Will change.
>
> Given you're checking here, perhaps you could simply drop the previous check?
Well, maybe no. The previous "if (!count)" is checking if count is 0
after the 'count -=n;' line at the end of the for_each loop. While this
"if (count == 0)" is checking if count is 0 after 'count--;' at the end
of while loop. Not sure if I got your point.
>
> > + *buf = '\0';
> > + buf++;
> > + addr++;
> > + count--;
> > + }
> > + /*it could start reading from the middle of used region*/
> > + offset = offset_in_page(addr);
> > + n = ((re - rs + 1) << PAGE_SHIFT) - offset;
> > + if (n > count)
> > + n = count;
> > + aligned_vread(buf, start+offset, n);
> > +
> > + buf += n;
> > + addr += n;
> > + count -= n;
> > + }
> > + spin_unlock(&vb->lock);
> > +
> > +finished:
> > + /* zero-fill the left dirty or free regions */
> > + if (count)
> > + memset(buf, 0, count);
> > +}
> > +
> > /**
> > * vread() - read vmalloc area in a safe way.
> > * @buf: buffer for reading data
> > @@ -3574,7 +3635,7 @@ long vread(char *buf, char *addr, unsigned long count)
> > struct vm_struct *vm;
> > char *vaddr, *buf_start = buf;
> > unsigned long buflen = count;
> > - unsigned long n;
> > + unsigned long n, size, flags;
> >
> > addr = kasan_reset_tag(addr);
> >
> > @@ -3595,12 +3656,21 @@ long vread(char *buf, char *addr, unsigned long count)
> > if (!count)
> > break;
> >
> > - if (!va->vm)
> > + vm = va->vm;
> > + flags = va->flags & VMAP_FLAGS_MASK;
> > + /*
> > + * VMAP_BLOCK indicates a sub-type of vm_map_ram area, need
> > + * be set together with VMAP_RAM.
> > + */
> > + WARN_ON(flags == VMAP_BLOCK);
> > +
> > + if (!vm && !flags)
> > continue;
> >
> > - vm = va->vm;
> > - vaddr = (char *) vm->addr;
> > - if (addr >= vaddr + get_vm_area_size(vm))
> > + vaddr = (char *) va->va_start;
> > + size = vm ? get_vm_area_size(vm) : va_size(va);
> > +
> > + if (addr >= vaddr + size)
> > continue;
> > while (addr < vaddr) {
> > if (count == 0)
> > @@ -3610,10 +3680,13 @@ long vread(char *buf, char *addr, unsigned long count)
> > addr++;
> > count--;
> > }
> > - n = vaddr + get_vm_area_size(vm) - addr;
> > + n = vaddr + size - addr;
> > if (n > count)
> > n = count;
> > - if (!(vm->flags & VM_IOREMAP))
> > +
> > + if (flags & VMAP_RAM)
> > + vmap_ram_vread(buf, addr, n, flags);
> > + else if (!(vm->flags & VM_IOREMAP))
> > aligned_vread(buf, addr, n);
> > else /* IOREMAP area is treated as memory hole */
> > memset(buf, 0, n);
> > --
> > 2.34.1
> >
>
> Other than the nit, feel free to add:-
>
> Reviewed-by: Lorenzo Stoakes <[email protected]>
>
On Thu, Feb 02, 2023 at 11:20:07AM +0800, Baoquan He wrote:
[snip]
> > > + for_each_set_bitrange(rs, re, vb->used_map, VMAP_BBMAP_BITS) {
> > > + if (!count)
> > > + break;
> > > + start = vmap_block_vaddr(vb->va->va_start, rs);
> > > + while (addr < start) {
> > > + if (count == 0)
> > > + break;
> >
> > Bit pedantic, but you're using the `if (!count)` form of checking whether it's
> > zero above, but here you explicitly check it, would be good to keep both consistent.
>
> Yeah, sounds good. Will change.
>
> >
> > Given you're checking here, perhaps you could simply drop the previous check?
>
> Well, maybe no. The previous "if (!count)" is checking if count is 0
> after the 'count -=n;' line at the end of the for_each loop. While this
> "if (count == 0)" is checking if count is 0 after 'count--;' at the end
> of while loop. Not sure if I got your point.
You're right, sorry each break is for a different loop :) and I guess the inner
check is feeding the outer one so we're all good.
On 02/02/23 at 07:17am, Lorenzo Stoakes wrote:
> On Thu, Feb 02, 2023 at 11:20:07AM +0800, Baoquan He wrote:
>
> [snip]
>
> > > > + for_each_set_bitrange(rs, re, vb->used_map, VMAP_BBMAP_BITS) {
> > > > + if (!count)
> > > > + break;
> > > > + start = vmap_block_vaddr(vb->va->va_start, rs);
> > > > + while (addr < start) {
> > > > + if (count == 0)
> > > > + break;
> > >
> > > Bit pedantic, but you're using the `if (!count)` form of checking whether it's
> > > zero above, but here you explicitly check it, would be good to keep both consistent.
> >
> > Yeah, sounds good. Will change.
> >
> > >
> > > Given you're checking here, perhaps you could simply drop the previous check?
> >
> > Well, maybe no. The previous "if (!count)" is checking if count is 0
> > after the 'count -=n;' line at the end of the for_each loop. While this
> > "if (count == 0)" is checking if count is 0 after 'count--;' at the end
> > of while loop. Not sure if I got your point.
>
> You're right, sorry each break is for a different loop :) and I guess the inner
> check is feeding the outer one so we're all good.
Oh, the inner check and break only terminates the while loop, but it
should jump to the 'spin_unlock(&vb->lock);' line too as the outer
break does. I will fix this.
Baoquan He <[email protected]> writes:
> Problem:
> ***
> Stephen reported vread() will skip vm_map_ram areas when reading out
> /proc/kcore with drgn utility. Please see below link to get more
> details.
>
> /proc/kcore reads 0's for vmap_block
> https://lore.kernel.org/all/[email protected]/T/#u
>
> Root cause:
> ***
> The normal vmalloc API uses struct vmap_area to manage the virtual
> kernel area allocated, and associate a vm_struct to store more
> information and pass out. However, area reserved through vm_map_ram()
> interface doesn't allocate vm_struct to associate with. So the current
> code in vread() will skip the vm_map_ram area through 'if (!va->vm)'
> conditional checking.
>
> Solution:
> ***
> To mark the area reserved through vm_map_ram() interface, add field 'flags'
> into struct vmap_area. Bit 0 indicates this is vm_map_ram area created
> through vm_map_ram() interface, bit 1 marks out the type of vm_map_ram area
> which makes use of vmap_block to manage split regions via vb_alloc/free().
>
> And also add bitmap field 'used_map' into struct vmap_block to mark those
> further subdivided regions being used to differentiate with dirty and free
> regions in vmap_block.
>
> With the help of above vmap_area->flags and vmap_block->used_map, we can
> recognize and handle vm_map_ram areas successfully. All these are done
> in patch 1~3.
>
> Meanwhile, do some improvement on areas related to vm_map_ram areas in
> patch 4, 5. And also change area flag from VM_ALLOC to VM_IOREMAP in
> patch 6, 7 because this will show them as 'ioremap' in /proc/vmallocinfo,
> and exclude them from /proc/kcore.
>
> Testing
> ***
> Only did the basic testing on kvm guest, and run below commands to
> access kcore file to do statistics:
>
> makedumpfile --mem-usage /proc/kcore
Hi Baoquan,
Sorry I haven't commented with testing info or review on each revision:
I'm not really familiar with the details necessary for review. However,
it looks like this is getting close to ready, so I did another test:
[opc@stepbren-ol8-1 drgn_vmalloc_test]$ sudo insmod drgn_vmalloc_test.ko
[opc@stepbren-ol8-1 drgn_vmalloc_test]$ sudo dmesg | tail -n 5
[ 20.763310] missing module BTF, cannot register kfuncs
[ 20.840200] missing module BTF, cannot register kfuncs
[ 91.475814] drgn_vmalloc_test: loading out-of-tree module taints kernel.
[ 91.479913] drgn_vmalloc_test: module verification failed: signature and/or required key missing - tainting kernel
[ 91.484926] drgn_vmalloc_test: 0xffffa51ac2d00000
[opc@stepbren-ol8-1 drgn_vmalloc_test]$ sudo drgn
drgn 0.0.22 (using Python 3.6.8, elfutils 0.186, with libkdumpfile)
For help, type help(drgn).
>>> import drgn
>>> from drgn import NULL, Object, cast, container_of, execscript, offsetof, reinterpret, sizeof
>>> from drgn.helpers.common import *
>>> from drgn.helpers.linux import *
warning: could not get debugging information for:
drgn_vmalloc_test (could not find module in depmod)
>>> prog.read(0xffffa51ac2d00000, 64)
b'\x00\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00\x05\x00\x00\x00\x06\x00\x00\x00\x07\x00\x00\x00\x08\x00\x00\x00\t\x00\x00\x00\n\x00\x00\x00\x0b\x00\x00\x00\x0c\x00\x00\x00\r\x00\x00\x00\x0e\x00\x00\x00\x0f\x00\x00\x00'
>>>
So this definitely still resolves the originally reported issue. Feel
free to add, if you want:
Tested-by: Stephen Brennan <[email protected]>
Thanks for all the work here,
Stephen
>
> Changelog
> ***
> v3->v4:
> - Fix typo in pach 2 catched by Lorenzo.
> - Add WARN_ON(flags == VMAP_BLOCK) in vread() to address Dan's concern
> that VMAP_BLOCK could be set alone in vmap->flags.
> - Add checking on the returned value from xa_load() in vmap_ram_vread(),
> Uladzislau commented on the risk of this place.
> - Fix a bug in s_show() which is changed in patch 5. The old change will
> cause 'va->vm is NULL but the VMAP_RAM flag is not set' case wrongly
> handled. Please see below link for details.
> - https://lore.kernel.org/all/Y8aAmuUY9OxrYlLm@kili/T/#u
> - Add Uladzislau and Lorenzo's Reviewed-by.
>
> v2->v3:
> - Benefited from find_unlink_vmap_area() introduced by Uladzislau, do
> not need to worry about the va->vm and va->flags reset during freeing.
> - Change to identify vm_map_area with VMAP_RAM in vmap->flags in
> vread().
> - Rename the old vb_vread() to vmap_ram_vread().
> - Handle two kinds of vm_map_area area reading in vmap_ram_vread()
> respectively.
> - Fix bug of the while loop code block in vmap_block reading, found by
> Lorenzo.
> - Improve conditional check related to vm_map_ram area, suggested by
> Lorenzo.
>
> v1->v2:
> - Change alloc_vmap_area() to pass in va_flags so that we can pass and
> set vmap_area->flags for vm_map_ram area. With this, no extra action
> need be added to acquire vmap_area_lock when doing the vmap_area->flags
> setting. Uladzislau reviewed v1 and pointed out the issue.
> - Improve vb_vread() to cover the case where reading is started from a
> dirty or free region.
>
> RFC->v1:
> - Add a new field flags in vmap_area to mark vm_map_ram area. It cold be
> risky reusing the vm union in vmap_area in RFC. I will consider
> reusing the union in vmap_area to save memory later. Now just take the
> simpler way to let's focus on resolving the main problem.
> - Add patch 4~7 for optimization.
>
> Baoquan He (7):
> mm/vmalloc.c: add used_map into vmap_block to track space of
> vmap_block
> mm/vmalloc.c: add flags to mark vm_map_ram area
> mm/vmalloc.c: allow vread() to read out vm_map_ram areas
> mm/vmalloc: explicitly identify vm_map_ram area when shown in
> /proc/vmcoreinfo
> mm/vmalloc: skip the uninitilized vmalloc areas
> powerpc: mm: add VM_IOREMAP flag to the vmalloc area
> sh: mm: set VM_IOREMAP flag to the vmalloc area
>
> arch/powerpc/kernel/pci_64.c | 2 +-
> arch/sh/kernel/cpu/sh4/sq.c | 2 +-
> include/linux/vmalloc.h | 1 +
> mm/vmalloc.c | 126 ++++++++++++++++++++++++++++++-----
> 4 files changed, 111 insertions(+), 20 deletions(-)
>
> --
> 2.34.1
On 02/02/23 at 09:47am, Stephen Brennan wrote:
......snip...
> > Testing
> > ***
> > Only did the basic testing on kvm guest, and run below commands to
> > access kcore file to do statistics:
> >
> > makedumpfile --mem-usage /proc/kcore
>
> Hi Baoquan,
>
> Sorry I haven't commented with testing info or review on each revision:
> I'm not really familiar with the details necessary for review. However,
> it looks like this is getting close to ready, so I did another test:
That's OK, and your testing is very helpful because I don't know how to
create vm_map_ram() area to test the patches, just did basic testing.
>
> [opc@stepbren-ol8-1 drgn_vmalloc_test]$ sudo insmod drgn_vmalloc_test.ko
> [opc@stepbren-ol8-1 drgn_vmalloc_test]$ sudo dmesg | tail -n 5
> [ 20.763310] missing module BTF, cannot register kfuncs
> [ 20.840200] missing module BTF, cannot register kfuncs
> [ 91.475814] drgn_vmalloc_test: loading out-of-tree module taints kernel.
> [ 91.479913] drgn_vmalloc_test: module verification failed: signature and/or required key missing - tainting kernel
> [ 91.484926] drgn_vmalloc_test: 0xffffa51ac2d00000
> [opc@stepbren-ol8-1 drgn_vmalloc_test]$ sudo drgn
> drgn 0.0.22 (using Python 3.6.8, elfutils 0.186, with libkdumpfile)
> For help, type help(drgn).
> >>> import drgn
> >>> from drgn import NULL, Object, cast, container_of, execscript, offsetof, reinterpret, sizeof
> >>> from drgn.helpers.common import *
> >>> from drgn.helpers.linux import *
> warning: could not get debugging information for:
> drgn_vmalloc_test (could not find module in depmod)
> >>> prog.read(0xffffa51ac2d00000, 64)
> b'\x00\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00\x05\x00\x00\x00\x06\x00\x00\x00\x07\x00\x00\x00\x08\x00\x00\x00\t\x00\x00\x00\n\x00\x00\x00\x0b\x00\x00\x00\x0c\x00\x00\x00\r\x00\x00\x00\x0e\x00\x00\x00\x0f\x00\x00\x00'
> >>>
>
> So this definitely still resolves the originally reported issue. Feel
> free to add, if you want:
>
> Tested-by: Stephen Brennan <[email protected]>
I noticed Andrew had picked this v4 into his mm tree, maybe Andrew can
help add this Tested-by tag.
On Sat, 4 Feb 2023 12:12:08 +0800 Baoquan He <[email protected]> wrote:
> > So this definitely still resolves the originally reported issue. Feel
> > free to add, if you want:
> >
> > Tested-by: Stephen Brennan <[email protected]>
>
> I noticed Andrew had picked this v4 into his mm tree, maybe Andrew can
> help add this Tested-by tag.
I dropped this series and am now unable to locate a fix which addressed
the issue which Stephen hit.
Please send a v5?
On 02/04/23 at 03:13pm, Andrew Morton wrote:
> On Sat, 4 Feb 2023 12:12:08 +0800 Baoquan He <[email protected]> wrote:
>
> > > So this definitely still resolves the originally reported issue. Feel
> > > free to add, if you want:
> > >
> > > Tested-by: Stephen Brennan <[email protected]>
> >
> > I noticed Andrew had picked this v4 into his mm tree, maybe Andrew can
> > help add this Tested-by tag.
>
> I dropped this series and am now unable to locate a fix which addressed
> the issue which Stephen hit.
Stephen wanted to read out vm_map_ram areas, that can't be done w/o this
patchset. The patch 3 solves his problem.
>
> Please send a v5?
I add Stephen's Reported-by and Tested-by in patch 3 and send v5.