For analysis purpose it is useful to have numa node information
corresponding mapped virtual address ranges of a process. Currently,
the file /proc/<pid>/numa_maps provides list of numa nodes from where pages
are allocated per VMA of a process. This is not useful if an user needs to
determine which numa node the mapped pages are allocated from for a
particular address range. It would have helped if the numa node information
presented in /proc/<pid>/numa_maps was broken down by VA ranges showing the
exact numa node from where the pages have been allocated.
The format of /proc/<pid>/numa_maps file content is dependent on
/proc/<pid>/maps file content as mentioned in the manpage. i.e one line
entry for every VMA corresponding to entries in /proc/<pids>/maps file.
Therefore changing the output of /proc/<pid>/numa_maps may not be possible.
This patch set introduces the file /proc/<pid>/numa_vamaps which
will provide proper break down of VA ranges by numa node id from where the
mapped pages are allocated. For Address ranges not having any pages mapped,
a '-' is printed instead of numa node id.
Includes support to lseek, allowing seeking to a specific process Virtual
address(VA) starting from where the address range to numa node information
can to be read from this file.
The new file /proc/<pid>/numa_vamaps will be governed by ptrace access
mode PTRACE_MODE_READ_REALCREDS.
See following for previous discussion about this proposal
https://marc.info/?t=152524073400001&r=1&w=2
Prakash Sangappa (6):
Add check to match numa node id when gathering pte stats
Add /proc/<pid>/numa_vamaps file for numa node information
Provide process address range to numa node id mapping
Add support to lseek /proc/<pid>/numa_vamaps file
File /proc/<pid>/numa_vamaps access needs PTRACE_MODE_READ_REALCREDS
check
/proc/pid/numa_vamaps: document in Documentation/filesystems/proc.txt
Documentation/filesystems/proc.txt | 21 +++
fs/proc/base.c | 6 +-
fs/proc/internal.h | 1 +
fs/proc/task_mmu.c | 265 ++++++++++++++++++++++++++++++++++++-
4 files changed, 285 insertions(+), 8 deletions(-)
--
2.7.4
Allow lseeking to a process virtual address(VA), starting from where
the address range to numa node information can be read. The lseek offset
will be the process virtual address.
Signed-off-by: Prakash Sangappa <[email protected]>
Reviewed-by: Steve Sistare <[email protected]>
---
fs/proc/task_mmu.c | 23 ++++++++++++++++++++++-
1 file changed, 22 insertions(+), 1 deletion(-)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 1371e379..93dce46 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1866,6 +1866,27 @@ static int gather_hole_info_vamap(unsigned long start, unsigned long end,
return 0;
}
+static loff_t numa_vamaps_llseek(struct file *file, loff_t offset, int orig)
+{
+ struct numa_vamaps_private *nvm = file->private_data;
+
+ if (orig == SEEK_CUR && offset < 0 && nvm->vm_start < -offset)
+ return -EINVAL;
+
+ switch (orig) {
+ case SEEK_SET:
+ nvm->vm_start = offset & PAGE_MASK;
+ break;
+ case SEEK_CUR:
+ nvm->vm_start += offset;
+ nvm->vm_start = nvm->vm_start & PAGE_MASK;
+ break;
+ default:
+ return -EINVAL;
+ }
+ return nvm->vm_start;
+}
+
static int vamap_vprintf(struct numa_vamaps_private *nvm, const char *f, ...)
{
va_list args;
@@ -2052,7 +2073,7 @@ const struct file_operations proc_pid_numa_maps_operations = {
const struct file_operations proc_numa_vamaps_operations = {
.open = numa_vamaps_open,
.read = numa_vamaps_read,
- .llseek = noop_llseek,
+ .llseek = numa_vamaps_llseek,
.release = numa_vamaps_release,
};
#endif /* CONFIG_NUMA */
--
2.7.4
Permission to access /proc/<pid>/numa_vamaps file should be governed by
PTRACE_READ_REALCREADS check to restrict getting specific VA range to numa
node mapping information.
Signed-off-by: Prakash Sangappa <[email protected]>
Reviewed-by: Steve Sistare <[email protected]>
---
fs/proc/base.c | 4 +++-
fs/proc/task_mmu.c | 2 +-
2 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 1af99ae..3c19a55 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -745,7 +745,9 @@ struct mm_struct *proc_mem_open(struct inode *inode, unsigned int mode)
struct mm_struct *mm = ERR_PTR(-ESRCH);
if (task) {
- mm = mm_access(task, mode | PTRACE_MODE_FSCREDS);
+ if (!(mode & PTRACE_MODE_REALCREDS))
+ mode |= PTRACE_MODE_FSCREDS;
+ mm = mm_access(task, mode);
put_task_struct(task);
if (!IS_ERR_OR_NULL(mm)) {
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 93dce46..30b29d2 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -2043,7 +2043,7 @@ static int numa_vamaps_open(struct inode *inode, struct file *file)
if (!nvm)
return -ENOMEM;
- mm = proc_mem_open(inode, PTRACE_MODE_READ);
+ mm = proc_mem_open(inode, PTRACE_MODE_READ | PTRACE_MODE_REALCREDS);
if (IS_ERR(mm)) {
kfree(nvm);
return PTR_ERR(mm);
--
2.7.4
Add support to check if numa node id matches when gathering pte stats,
to be used by later patches.
Signed-off-by: Prakash Sangappa <[email protected]>
Reviewed-by: Steve Sistare <[email protected]>
---
fs/proc/task_mmu.c | 44 +++++++++++++++++++++++++++++++++++++-------
1 file changed, 37 insertions(+), 7 deletions(-)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 5ea1d64..0e2095c 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1569,9 +1569,15 @@ struct numa_maps {
unsigned long mapcount_max;
unsigned long dirty;
unsigned long swapcache;
+ unsigned long nextaddr;
+ long nid;
+ long isvamaps;
unsigned long node[MAX_NUMNODES];
};
+#define NUMA_VAMAPS_NID_NOPAGES (-1)
+#define NUMA_VAMAPS_NID_NONE (-2)
+
struct numa_maps_private {
struct proc_maps_private proc_maps;
struct numa_maps md;
@@ -1653,6 +1659,20 @@ static struct page *can_gather_numa_stats_pmd(pmd_t pmd,
}
#endif
+static bool
+vamap_match_nid(struct numa_maps *md, unsigned long addr, struct page *page)
+{
+ long target = (page ? page_to_nid(page) : NUMA_VAMAPS_NID_NOPAGES);
+
+ if (md->nid == NUMA_VAMAPS_NID_NONE)
+ md->nid = target;
+ if (md->nid == target)
+ return 0;
+ /* did not match */
+ md->nextaddr = addr;
+ return 1;
+}
+
static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
unsigned long end, struct mm_walk *walk)
{
@@ -1661,6 +1681,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
spinlock_t *ptl;
pte_t *orig_pte;
pte_t *pte;
+ int ret = 0;
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
ptl = pmd_trans_huge_lock(pmd, vma);
@@ -1668,11 +1689,13 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
struct page *page;
page = can_gather_numa_stats_pmd(*pmd, vma, addr);
- if (page)
+ if (md->isvamaps)
+ ret = vamap_match_nid(md, addr, page);
+ if (page && !ret)
gather_stats(page, md, pmd_dirty(*pmd),
HPAGE_PMD_SIZE/PAGE_SIZE);
spin_unlock(ptl);
- return 0;
+ return ret;
}
if (pmd_trans_unstable(pmd))
@@ -1681,6 +1704,10 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
do {
struct page *page = can_gather_numa_stats(*pte, vma, addr);
+ if (md->isvamaps && vamap_match_nid(md, addr, page)) {
+ ret = 1;
+ break;
+ }
if (!page)
continue;
gather_stats(page, md, pte_dirty(*pte), 1);
@@ -1688,7 +1715,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
} while (pte++, addr += PAGE_SIZE, addr != end);
pte_unmap_unlock(orig_pte, ptl);
cond_resched();
- return 0;
+ return ret;
}
#ifdef CONFIG_HUGETLB_PAGE
static int gather_hugetlb_stats(pte_t *pte, unsigned long hmask,
@@ -1697,15 +1724,18 @@ static int gather_hugetlb_stats(pte_t *pte, unsigned long hmask,
pte_t huge_pte = huge_ptep_get(pte);
struct numa_maps *md;
struct page *page;
+ int ret = 0;
+ md = walk->private;
if (!pte_present(huge_pte))
- return 0;
+ return (md->isvamaps ? vamap_match_nid(md, addr, NULL) : 0);
page = pte_page(huge_pte);
- if (!page)
- return 0;
+ if (md->isvamaps)
+ ret = vamap_match_nid(md, addr, page);
+ if (!page || ret)
+ return ret;
- md = walk->private;
gather_stats(page, md, pte_dirty(huge_pte), 1);
return 0;
}
--
2.7.4
Add documentation for /proc/<pid>/numa_vamaps in
Documentation/filesystems/proc.txt
Signed-off-by: Prakash Sangappa <[email protected]>
Reviewed-by: Steve Sistare <[email protected]>
---
Documentation/filesystems/proc.txt | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 22b4b00..7095216 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -150,6 +150,9 @@ Table 1-1: Process specific entries in /proc
each mapping and flags associated with it
numa_maps an extension based on maps, showing the memory locality and
binding policy as well as mem usage (in pages) of each mapping.
+ numa_vamaps Presents information about mapped address ranges to numa node
+ from where the physical memory is allocated.
+
..............................................................................
For example, to get the status information of a process, all you have to do is
@@ -571,6 +574,24 @@ Where:
node locality page counters (N0 == node0, N1 == node1, ...) and the kernel page
size, in KB, that is backing the mapping up.
+The /proc/pid/numa_vamaps shows mapped address ranges to numa node id from
+where the physical pages are allocated. For mapped address ranges not having
+any pages mapped a '-' is shown instead of the node id. Each line in the file
+will show address range to one numa node.
+
+address-range numa-node-id
+
+00400000-00410000 N1
+00410000-0047f000 N0
+0047f000-00480000 N2
+00480000-00481000 -
+00481000-004a0000 N0
+004a0000-004a2000 -
+004a2000-004aa000 N2
+004aa000-004ad000 N0
+004ad000-004ae000 -
+..
+
1.2 Kernel data
---------------
--
2.7.4
This patch provides process address range to numa node information
thru /proc/<pid>/numa_vamaps file. For address ranges not having
any pages mapped, a '-' is printed instead of the numa node id.
Following is the sample of the file format
00400000-00410000 N1
00410000-0047f000 N0
0047f000-00480000 N2
00480000-00481000 -
00481000-004a0000 N0
004a0000-004a2000 -
004a2000-004aa000 N2
004aa000-004ad000 N0
004ad000-004ae000 -
..
Signed-off-by: Prakash Sangappa <[email protected]>
Reviewed-by: Steve Sistare <[email protected]>
---
fs/proc/task_mmu.c | 158 +++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 158 insertions(+)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 02b553c..1371e379 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1845,6 +1845,162 @@ static int show_numa_map(struct seq_file *m, void *v)
return 0;
}
+static int gather_hole_info_vamap(unsigned long start, unsigned long end,
+ struct mm_walk *walk)
+{
+ struct numa_maps *md = walk->private;
+ struct vm_area_struct *vma = walk->vma;
+
+ /*
+ * If in a nid, end walk at hole start.
+ * If no nid and vma changes, end walk at next vma start.
+ */
+ if (md->nid >= 0 || vma != find_vma(walk->mm, start)) {
+ md->nextaddr = start;
+ return 1;
+ }
+
+ if (md->nid == NUMA_VAMAPS_NID_NONE)
+ md->nid = NUMA_VAMAPS_NID_NOPAGES;
+
+ return 0;
+}
+
+static int vamap_vprintf(struct numa_vamaps_private *nvm, const char *f, ...)
+{
+ va_list args;
+ int len, space;
+
+ space = NUMA_VAMAPS_BUFSZ - nvm->count;
+ va_start(args, f);
+ len = vsnprintf(nvm->buf + nvm->count, space, f, args);
+ va_end(args);
+ if (len < space) {
+ nvm->count += len;
+ return 0;
+ }
+ return 1;
+}
+
+/*
+ * Display va-range to numa node info via /proc
+ */
+static ssize_t numa_vamaps_read(struct file *file, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct numa_vamaps_private *nvm = file->private_data;
+ struct vm_area_struct *vma, *tailvma;
+ struct numa_maps *md = &nvm->md;
+ struct mm_struct *mm = nvm->mm;
+ u64 vm_start = nvm->vm_start;
+ size_t ucount;
+ struct mm_walk walk = {
+ .hugetlb_entry = gather_hugetlb_stats,
+ .pmd_entry = gather_pte_stats,
+ .pte_hole = gather_hole_info_vamap,
+ .private = md,
+ .mm = mm,
+ };
+ int ret = 0, copied = 0, done = 0;
+
+ if (!mm || !mmget_not_zero(mm))
+ return 0;
+
+ if (count <= 0)
+ goto out_mm;
+
+ /* First copy leftover contents in buffer */
+ if (nvm->from)
+ goto docopy;
+
+repeat:
+ down_read(&mm->mmap_sem);
+ vma = find_vma(mm, vm_start);
+ if (!vma) {
+ done = 1;
+ goto out;
+ }
+
+ if (vma->vm_start > vm_start)
+ vm_start = vma->vm_start;
+
+ while (nvm->count < count) {
+ u64 vm_end;
+
+ /* Ensure we start with an empty numa_maps statistics */
+ memset(md, 0, sizeof(*md));
+ md->nid = NUMA_VAMAPS_NID_NONE; /* invalid nodeid at start */
+ md->nextaddr = 0;
+ md->isvamaps = 1;
+
+ if (walk_page_range(vm_start, vma->vm_end, &walk) < 0)
+ break;
+
+ /* nextaddr ends the range. if 0, reached the vma end */
+ vm_end = (md->nextaddr ? md->nextaddr : vma->vm_end);
+
+ /* break if buffer full */
+ if (md->nid >= 0 && md->node[md->nid]) {
+ if (vamap_vprintf(nvm, "%08lx-%08lx N%ld\n", vm_start,
+ vm_end, md->nid))
+ break;
+ } else if (vamap_vprintf(nvm, "%08lx-%08lx - \n", vm_start,
+ vm_end)) {
+ break;
+ }
+
+ /* advance to next VA */
+ vm_start = vm_end;
+ if (vm_end == vma->vm_end) {
+ vma = vma->vm_next;
+ if (!vma) {
+ done = 1;
+ break;
+ }
+ vm_start = vma->vm_start;
+ }
+ }
+out:
+ /* last, add gate vma details */
+ if (!vma && (tailvma = get_gate_vma(mm)) != NULL &&
+ vm_start < tailvma->vm_end) {
+ done = 0;
+ if (!vamap_vprintf(nvm, "%08lx-%08lx - \n",
+ tailvma->vm_start, tailvma->vm_end)) {
+ done = 1;
+ vm_start = tailvma->vm_end;
+ }
+ }
+
+ up_read(&mm->mmap_sem);
+docopy:
+ ucount = min(count, nvm->count);
+ if (ucount && copy_to_user(buf, nvm->buf + nvm->from, ucount)) {
+ ret = -EFAULT;
+ goto out_mm;;
+ }
+ copied += ucount;
+ count -= ucount;
+ nvm->count -= ucount;
+ buf += ucount;
+ if (!done && count) {
+ nvm->from = 0;
+ goto repeat;
+ }
+ /* somthing left in the buffer */
+ if (nvm->count)
+ nvm->from += ucount;
+ else
+ nvm->from = 0;
+
+ nvm->vm_start = vm_start;
+ ret = copied;
+ *ppos += copied;
+out_mm:
+ mmput(mm);
+ return ret;
+}
+
static const struct seq_operations proc_pid_numa_maps_op = {
.start = m_start,
.next = m_next,
@@ -1895,6 +2051,8 @@ const struct file_operations proc_pid_numa_maps_operations = {
const struct file_operations proc_numa_vamaps_operations = {
.open = numa_vamaps_open,
+ .read = numa_vamaps_read,
+ .llseek = noop_llseek,
.release = numa_vamaps_release,
};
#endif /* CONFIG_NUMA */
--
2.7.4
Introduce supporting data structures and file operations. Later
patch will provide changes for generating file content.
Signed-off-by: Prakash Sangappa <[email protected]>
Reviewed-by: Steve Sistare <[email protected]>
---
fs/proc/base.c | 2 ++
fs/proc/internal.h | 1 +
fs/proc/task_mmu.c | 42 ++++++++++++++++++++++++++++++++++++++++++
3 files changed, 45 insertions(+)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index ccf86f1..1af99ae 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2927,6 +2927,7 @@ static const struct pid_entry tgid_base_stuff[] = {
REG("maps", S_IRUGO, proc_pid_maps_operations),
#ifdef CONFIG_NUMA
REG("numa_maps", S_IRUGO, proc_pid_numa_maps_operations),
+ REG("numa_vamaps", S_IRUGO, proc_numa_vamaps_operations),
#endif
REG("mem", S_IRUSR|S_IWUSR, proc_mem_operations),
LNK("cwd", proc_cwd_link),
@@ -3313,6 +3314,7 @@ static const struct pid_entry tid_base_stuff[] = {
#endif
#ifdef CONFIG_NUMA
REG("numa_maps", S_IRUGO, proc_pid_numa_maps_operations),
+ REG("numa_vamaps", S_IRUGO, proc_numa_vamaps_operations),
#endif
REG("mem", S_IRUSR|S_IWUSR, proc_mem_operations),
LNK("cwd", proc_cwd_link),
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 5185d7f..994c7fd 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -298,6 +298,7 @@ extern const struct file_operations proc_pid_smaps_operations;
extern const struct file_operations proc_pid_smaps_rollup_operations;
extern const struct file_operations proc_clear_refs_operations;
extern const struct file_operations proc_pagemap_operations;
+extern const struct file_operations proc_numa_vamaps_operations;
extern unsigned long task_vsize(struct mm_struct *);
extern unsigned long task_statm(struct mm_struct *,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 0e2095c..02b553c 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1583,6 +1583,16 @@ struct numa_maps_private {
struct numa_maps md;
};
+#define NUMA_VAMAPS_BUFSZ 1024
+struct numa_vamaps_private {
+ struct mm_struct *mm;
+ struct numa_maps md;
+ u64 vm_start;
+ size_t from;
+ size_t count; /* residual bytes in buf at offset 'from' */
+ char buf[NUMA_VAMAPS_BUFSZ]; /* buffer */
+};
+
static void gather_stats(struct page *page, struct numa_maps *md, int pte_dirty,
unsigned long nr_pages)
{
@@ -1848,6 +1858,34 @@ static int pid_numa_maps_open(struct inode *inode, struct file *file)
sizeof(struct numa_maps_private));
}
+static int numa_vamaps_open(struct inode *inode, struct file *file)
+{
+ struct mm_struct *mm;
+ struct numa_vamaps_private *nvm;
+ nvm = kzalloc(sizeof(struct numa_vamaps_private), GFP_KERNEL);
+ if (!nvm)
+ return -ENOMEM;
+
+ mm = proc_mem_open(inode, PTRACE_MODE_READ);
+ if (IS_ERR(mm)) {
+ kfree(nvm);
+ return PTR_ERR(mm);
+ }
+ nvm->mm = mm;
+ file->private_data = nvm;
+ return 0;
+}
+
+static int numa_vamaps_release(struct inode *inode, struct file *file)
+{
+ struct numa_vamaps_private *nvm = file->private_data;
+
+ if (nvm->mm)
+ mmdrop(nvm->mm);
+ kfree(nvm);
+ return 0;
+}
+
const struct file_operations proc_pid_numa_maps_operations = {
.open = pid_numa_maps_open,
.read = seq_read,
@@ -1855,4 +1893,8 @@ const struct file_operations proc_pid_numa_maps_operations = {
.release = proc_map_release,
};
+const struct file_operations proc_numa_vamaps_operations = {
+ .open = numa_vamaps_open,
+ .release = numa_vamaps_release,
+};
#endif /* CONFIG_NUMA */
--
2.7.4
On Wed 12-09-18 13:23:58, Prakash Sangappa wrote:
> For analysis purpose it is useful to have numa node information
> corresponding mapped virtual address ranges of a process. Currently,
> the file /proc/<pid>/numa_maps provides list of numa nodes from where pages
> are allocated per VMA of a process. This is not useful if an user needs to
> determine which numa node the mapped pages are allocated from for a
> particular address range. It would have helped if the numa node information
> presented in /proc/<pid>/numa_maps was broken down by VA ranges showing the
> exact numa node from where the pages have been allocated.
>
> The format of /proc/<pid>/numa_maps file content is dependent on
> /proc/<pid>/maps file content as mentioned in the manpage. i.e one line
> entry for every VMA corresponding to entries in /proc/<pids>/maps file.
> Therefore changing the output of /proc/<pid>/numa_maps may not be possible.
>
> This patch set introduces the file /proc/<pid>/numa_vamaps which
> will provide proper break down of VA ranges by numa node id from where the
> mapped pages are allocated. For Address ranges not having any pages mapped,
> a '-' is printed instead of numa node id.
>
> Includes support to lseek, allowing seeking to a specific process Virtual
> address(VA) starting from where the address range to numa node information
> can to be read from this file.
>
> The new file /proc/<pid>/numa_vamaps will be governed by ptrace access
> mode PTRACE_MODE_READ_REALCREDS.
>
> See following for previous discussion about this proposal
>
> https://marc.info/?t=152524073400001&r=1&w=2
It would be really great to give a short summary of the previous
discussion. E.g. why do we need a proc interface in the first place when
we already have an API to query for the information you are proposing to
export [1]
[1] http://lkml.kernel.org/r/[email protected]
--
Michal Hocko
SUSE Labs
On 09/13/2018 01:40 AM, Michal Hocko wrote:
> On Wed 12-09-18 13:23:58, Prakash Sangappa wrote:
>> For analysis purpose it is useful to have numa node information
>> corresponding mapped virtual address ranges of a process. Currently,
>> the file /proc/<pid>/numa_maps provides list of numa nodes from where pages
>> are allocated per VMA of a process. This is not useful if an user needs to
>> determine which numa node the mapped pages are allocated from for a
>> particular address range. It would have helped if the numa node information
>> presented in /proc/<pid>/numa_maps was broken down by VA ranges showing the
>> exact numa node from where the pages have been allocated.
>>
>> The format of /proc/<pid>/numa_maps file content is dependent on
>> /proc/<pid>/maps file content as mentioned in the manpage. i.e one line
>> entry for every VMA corresponding to entries in /proc/<pids>/maps file.
>> Therefore changing the output of /proc/<pid>/numa_maps may not be possible.
>>
>> This patch set introduces the file /proc/<pid>/numa_vamaps which
>> will provide proper break down of VA ranges by numa node id from where the
>> mapped pages are allocated. For Address ranges not having any pages mapped,
>> a '-' is printed instead of numa node id.
>>
>> Includes support to lseek, allowing seeking to a specific process Virtual
>> address(VA) starting from where the address range to numa node information
>> can to be read from this file.
>>
>> The new file /proc/<pid>/numa_vamaps will be governed by ptrace access
>> mode PTRACE_MODE_READ_REALCREDS.
>>
>> See following for previous discussion about this proposal
>>
>> https://marc.info/?t=152524073400001&r=1&w=2
> It would be really great to give a short summary of the previous
> discussion. E.g. why do we need a proc interface in the first place when
> we already have an API to query for the information you are proposing to
> export [1]
>
> [1] http://lkml.kernel.org/r/[email protected]
The proc interface provides an efficient way to export address range
to numa node id mapping information compared to using the API.
For example, for sparsely populated mappings, if a VMA has large portions
not have any physical pages mapped, the page walk done thru the /proc file
interface can skip over non existent PMDs / ptes. Whereas using the
API the application would have to scan the entire VMA in page size units.
Also, VMAs having THP pages can have a mix of 4k pages and hugepages.
The page walks would be efficient in scanning and determining if it is
a THP huge page and step over it. Whereas using the API, the application
would not know what page size mapping is used for a given VA and so would
have to again scan the VMA in units of 4k page size.
If this sounds reasonable, I can add it to the commit / patch description.
-Prakash.
On Thu, 13 Sep 2018 15:32:25 -0700 "prakash.sangappa" <[email protected]> wrote:
> >> https://marc.info/?t=152524073400001&r=1&w=2
> > It would be really great to give a short summary of the previous
> > discussion. E.g. why do we need a proc interface in the first place when
> > we already have an API to query for the information you are proposing to
> > export [1]
> >
> > [1] http://lkml.kernel.org/r/[email protected]
>
> The proc interface provides an efficient way to export address range
> to numa node id mapping information compared to using the API.
> For example, for sparsely populated mappings, if a VMA has large portions
> not have any physical pages mapped, the page walk done thru the /proc file
> interface can skip over non existent PMDs / ptes. Whereas using the
> API the application would have to scan the entire VMA in page size units.
>
> Also, VMAs having THP pages can have a mix of 4k pages and hugepages.
> The page walks would be efficient in scanning and determining if it is
> a THP huge page and step over it. Whereas using the API, the application
> would not know what page size mapping is used for a given VA and so would
> have to again scan the VMA in units of 4k page size.
>
> If this sounds reasonable, I can add it to the commit / patch description.
Preferably with some runtime measurements, please. How much faster is
this interface in real-world situations? And why does that performance
matter?
It would also be useful to see more details on how this info helps
operators understand/tune/etc their applications and workloads. In
other words, I'm trying to get an understanding of how useful this code
might be to our users in general.
On 09/13/2018 05:10 PM, Andrew Morton wrote:
>> Also, VMAs having THP pages can have a mix of 4k pages and hugepages.
>> The page walks would be efficient in scanning and determining if it is
>> a THP huge page and step over it. Whereas using the API, the application
>> would not know what page size mapping is used for a given VA and so would
>> have to again scan the VMA in units of 4k page size.
>>
>> If this sounds reasonable, I can add it to the commit / patch description.
As we are judging whether this is a "good" interface, can you tell us a
bit about its scalability? For instance, let's say someone has a 1TB
VMA that's populated with interleaved 4k pages. How much data comes
out? How long does it take to parse? Will we effectively deadlock the
system if someone accidentally cat's the wrong /proc file?
/proc seems like a really simple way to implement this, but it seems a
*really* odd choice for something that needs to collect a large amount
of data. The lseek() stuff is a nice addition, but I wonder if it's
unwieldy to use in practice. For instance, if you want to read data for
the VMA at 0x1000000 you lseek(fd, 0x1000000, SEEK_SET, right? You read
~20 bytes of data and then the fd is at 0x1000020. But, you're getting
data out at the next read() for (at least) the next page, which is also
available at 0x1001000. Seems funky. Do other /proc files behave this way?
On Thu 13-09-18 15:32:25, prakash.sangappa wrote:
>
>
> On 09/13/2018 01:40 AM, Michal Hocko wrote:
> > On Wed 12-09-18 13:23:58, Prakash Sangappa wrote:
> > > For analysis purpose it is useful to have numa node information
> > > corresponding mapped virtual address ranges of a process. Currently,
> > > the file /proc/<pid>/numa_maps provides list of numa nodes from where pages
> > > are allocated per VMA of a process. This is not useful if an user needs to
> > > determine which numa node the mapped pages are allocated from for a
> > > particular address range. It would have helped if the numa node information
> > > presented in /proc/<pid>/numa_maps was broken down by VA ranges showing the
> > > exact numa node from where the pages have been allocated.
> > >
> > > The format of /proc/<pid>/numa_maps file content is dependent on
> > > /proc/<pid>/maps file content as mentioned in the manpage. i.e one line
> > > entry for every VMA corresponding to entries in /proc/<pids>/maps file.
> > > Therefore changing the output of /proc/<pid>/numa_maps may not be possible.
> > >
> > > This patch set introduces the file /proc/<pid>/numa_vamaps which
> > > will provide proper break down of VA ranges by numa node id from where the
> > > mapped pages are allocated. For Address ranges not having any pages mapped,
> > > a '-' is printed instead of numa node id.
> > >
> > > Includes support to lseek, allowing seeking to a specific process Virtual
> > > address(VA) starting from where the address range to numa node information
> > > can to be read from this file.
> > >
> > > The new file /proc/<pid>/numa_vamaps will be governed by ptrace access
> > > mode PTRACE_MODE_READ_REALCREDS.
> > >
> > > See following for previous discussion about this proposal
> > >
> > > https://marc.info/?t=152524073400001&r=1&w=2
> > It would be really great to give a short summary of the previous
> > discussion. E.g. why do we need a proc interface in the first place when
> > we already have an API to query for the information you are proposing to
> > export [1]
> >
> > [1] http://lkml.kernel.org/r/[email protected]
>
> The proc interface provides an efficient way to export address range
> to numa node id mapping information compared to using the API.
Do you have any numbers?
> For example, for sparsely populated mappings, if a VMA has large portions
> not have any physical pages mapped, the page walk done thru the /proc file
> interface can skip over non existent PMDs / ptes. Whereas using the
> API the application would have to scan the entire VMA in page size units.
What prevents you from pre-filtering by reading /proc/$pid/maps to get
ranges of interest?
> Also, VMAs having THP pages can have a mix of 4k pages and hugepages.
> The page walks would be efficient in scanning and determining if it is
> a THP huge page and step over it. Whereas using the API, the application
> would not know what page size mapping is used for a given VA and so would
> have to again scan the VMA in units of 4k page size.
Why does this matter for something that is for analysis purposes.
Reading the file for the whole address space is far from a free
operation. Is the page walk optimization really essential for usability?
Moreover what prevents move_pages implementation to be clever for the
page walk itself? In other words why would we want to add a new API
rather than make the existing one faster for everybody.
> If this sounds reasonable, I can add it to the commit / patch description.
This all is absolutely _essential_ for any new API proposed. Remember that
once we add a new user interface, we have to maintain it for ever. We
used to be too relaxed when adding new proc files in the past and it
backfired many times already.
--
Michal Hocko
SUSE Labs
On 9/14/2018 1:56 AM, Michal Hocko wrote:
> On Thu 13-09-18 15:32:25, prakash.sangappa wrote:
>> On 09/13/2018 01:40 AM, Michal Hocko wrote:
>>> On Wed 12-09-18 13:23:58, Prakash Sangappa wrote:
>>>> For analysis purpose it is useful to have numa node information
>>>> corresponding mapped virtual address ranges of a process. Currently,
>>>> the file /proc/<pid>/numa_maps provides list of numa nodes from where pages
>>>> are allocated per VMA of a process. This is not useful if an user needs to
>>>> determine which numa node the mapped pages are allocated from for a
>>>> particular address range. It would have helped if the numa node information
>>>> presented in /proc/<pid>/numa_maps was broken down by VA ranges showing the
>>>> exact numa node from where the pages have been allocated.
>>>>
>>>> The format of /proc/<pid>/numa_maps file content is dependent on
>>>> /proc/<pid>/maps file content as mentioned in the manpage. i.e one line
>>>> entry for every VMA corresponding to entries in /proc/<pids>/maps file.
>>>> Therefore changing the output of /proc/<pid>/numa_maps may not be possible.
>>>>
>>>> This patch set introduces the file /proc/<pid>/numa_vamaps which
>>>> will provide proper break down of VA ranges by numa node id from where the
>>>> mapped pages are allocated. For Address ranges not having any pages mapped,
>>>> a '-' is printed instead of numa node id.
>>>>
>>>> Includes support to lseek, allowing seeking to a specific process Virtual
>>>> address(VA) starting from where the address range to numa node information
>>>> can to be read from this file.
>>>>
>>>> The new file /proc/<pid>/numa_vamaps will be governed by ptrace access
>>>> mode PTRACE_MODE_READ_REALCREDS.
>>>>
>>>> See following for previous discussion about this proposal
>>>>
>>>> https://marc.info/?t=152524073400001&r=1&w=2
>>> It would be really great to give a short summary of the previous
>>> discussion. E.g. why do we need a proc interface in the first place when
>>> we already have an API to query for the information you are proposing to
>>> export [1]
>>>
>>> [1] http://lkml.kernel.org/r/[email protected]
>>
>> The proc interface provides an efficient way to export address range
>> to numa node id mapping information compared to using the API.
>
> Do you have any numbers?
>
>> For example, for sparsely populated mappings, if a VMA has large portions
>> not have any physical pages mapped, the page walk done thru the /proc file
>> interface can skip over non existent PMDs / ptes. Whereas using the
>> API the application would have to scan the entire VMA in page size units.
>
> What prevents you from pre-filtering by reading /proc/$pid/maps to get
> ranges of interest?
That works for skipping holes, but not for skipping huge pages. I did a
quick experiment to time move_pages on a 3 GHz Xeon and a 4.18 kernel.
Allocate 128 GB and touch every small page. Call move_pages with nodes=NULL
to get the node id for all pages, passing 512 consecutive small pages per
call to move_nodes. The total move_nodes time is 1.85 secs, and 55 nsec
per page. Extrapolating to a 1 TB range, it would take 15 sec to retrieve
the numa node for every small page in the range. That is not terrible, but
it is not interactive, and it becomes terrible for multiple TB.
>> Also, VMAs having THP pages can have a mix of 4k pages and hugepages.
>> The page walks would be efficient in scanning and determining if it is
>> a THP huge page and step over it. Whereas using the API, the application
>> would not know what page size mapping is used for a given VA and so would
>> have to again scan the VMA in units of 4k page size.
>
> Why does this matter for something that is for analysis purposes.
> Reading the file for the whole address space is far from a free
> operation. Is the page walk optimization really essential for usability?
> Moreover what prevents move_pages implementation to be clever for the
> page walk itself? In other words why would we want to add a new API
> rather than make the existing one faster for everybody.
One could optimize move pages. If the caller passes a consecutive range
of small pages, and the page walk sees that a VA is mapped by a huge page,
then it can return the same numa node for each of the following VA's that fall
into the huge page range. It would be faster than 55 nsec per small page, but
hard to say how much faster, and the cost is still driven by the number of
small pages.
>> If this sounds reasonable, I can add it to the commit / patch description.
>
> This all is absolutely _essential_ for any new API proposed. Remember that
> once we add a new user interface, we have to maintain it for ever. We
> used to be too relaxed when adding new proc files in the past and it
> backfired many times already.
An offhand idea -- we could extend /proc/pid/numa_maps in a backward compatible
way by providing a control interface that is poked via write() or ioctl().
Provide one control "do-not-combine". If do-not-combine has been set, then
the read() function returns a separate line for each range of memory mapped
on the same numa node, in the existing format.
- Steve
On 9/14/18 9:01 AM, Steven Sistare wrote:
> On 9/14/2018 1:56 AM, Michal Hocko wrote:
>> On Thu 13-09-18 15:32:25, prakash.sangappa wrote:
>>>
>>> The proc interface provides an efficient way to export address range
>>> to numa node id mapping information compared to using the API.
>> Do you have any numbers?
>>
>>> For example, for sparsely populated mappings, if a VMA has large portions
>>> not have any physical pages mapped, the page walk done thru the /proc file
>>> interface can skip over non existent PMDs / ptes. Whereas using the
>>> API the application would have to scan the entire VMA in page size units.
>> What prevents you from pre-filtering by reading /proc/$pid/maps to get
>> ranges of interest?
> That works for skipping holes, but not for skipping huge pages. I did a
> quick experiment to time move_pages on a 3 GHz Xeon and a 4.18 kernel.
> Allocate 128 GB and touch every small page. Call move_pages with nodes=NULL
> to get the node id for all pages, passing 512 consecutive small pages per
> call to move_nodes. The total move_nodes time is 1.85 secs, and 55 nsec
> per page. Extrapolating to a 1 TB range, it would take 15 sec to retrieve
> the numa node for every small page in the range. That is not terrible, but
> it is not interactive, and it becomes terrible for multiple TB.
>
Also, for valid VMAs in 'maps' file, if the VMA is sparsely populated
with physical pages,
the page walk can skip over non existing page table entires (PMDs) and
so can be faster.
For example reading va range of a 400GB VMA which has few pages mapped
in beginning and few pages at the end and the rest of VMA does not have
any pages, it
takes 0.001s using the /proc interface. Whereas with move_page() api
passing 1024
consecutive small pages address, it takes about 2.4secs. This is on a
similar system
running 4.19 kernel.
On 09/14/2018 11:04 AM, Prakash Sangappa wrote:
> Also, for valid VMAs in 'maps' file, if the VMA is sparsely
> populated with physical pages, the page walk can skip over non
> existing page table entires (PMDs) and so can be faster.
Note that this only works for things that were _never_ populated. They
might be sparse after once being populated and then being reclaimed or
discarded. Those will still have all the page tables allocated.
On 9/13/2018 5:25 PM, Dave Hansen wrote:
> On 09/13/2018 05:10 PM, Andrew Morton wrote:
>>> Also, VMAs having THP pages can have a mix of 4k pages and hugepages.
>>> The page walks would be efficient in scanning and determining if it is
>>> a THP huge page and step over it. Whereas using the API, the application
>>> would not know what page size mapping is used for a given VA and so would
>>> have to again scan the VMA in units of 4k page size.
>>>
>>> If this sounds reasonable, I can add it to the commit / patch description.
> As we are judging whether this is a "good" interface, can you tell us a
> bit about its scalability? For instance, let's say someone has a 1TB
> VMA that's populated with interleaved 4k pages. How much data comes
> out? How long does it take to parse? Will we effectively deadlock the
> system if someone accidentally cat's the wrong /proc file?
For the worst case scenario you describe, it would be one line(range)
for each 4k. Which would
be similar to what you get with '/proc/*/pagemap'. The amount of data
copied out at a
time is based on the buffer size used in the kernel. Which is 1024. That
is if one line(one range)
printed is about 40 bytes(char), that means about 25 lines per copy
out. Main concern would
be holding 'mmap_sem' lock, which can cause hangs. When the 1024 buffer
gets filled the
mmap_sem is dropped and the buffer content is copied out to the user
buffer. Then the
mmap_sem lock is reacquired and the page walk continues as needed until
the specified user
buffer size is filed or till end of process address space is reached.
One potential issue could be that there is a large VA range with all
pages populated from
one numa node, then the page walk could take longer while holding
mmap_sem lock. This
can be addressed by dropping and re-acquiring the mmap_sem lock after
certain number of
pages have been walked(Say 512 - which is what happens in
'/proc/*/pagemap' case).
>
> /proc seems like a really simple way to implement this, but it seems a
> *really* odd choice for something that needs to collect a large amount
> of data. The lseek() stuff is a nice addition, but I wonder if it's
> unwieldy to use in practice. For instance, if you want to read data for
> the VMA at 0x1000000 you lseek(fd, 0x1000000, SEEK_SET, right? You read
> ~20 bytes of data and then the fd is at 0x1000020. But, you're getting
> data out at the next read() for (at least) the next page, which is also
> available at 0x1001000. Seems funky. Do other /proc files behave this way?
>
Yes, SEEK_SET to the VA. The lseek offset is the process VA. So it is
not going to be
different from reading a normal text file. Expect that /proc files are
special. Ex In
/proc/*/pagemap' file case, read enforces that seek/file offset and the
user buffer size
passed in to be a multiple of the pagemap_entry_t size or else the
read would fail.
The usage for numa_vamaps file will be to SEEK_SET to the VA from where
VA range
to numa node information needs to be read.
The 'fd' offset is not taken into consideration here, just the VA. Say
each va range to numa
node id printed is about 40 bytes(chars). Now if the read only read 20
bytes, it would have read
part of the line. Subsequent read would read the remaining bytes of the
line, which will
be stored in the kernel buffer.
On Fri 14-09-18 12:01:18, Steven Sistare wrote:
> On 9/14/2018 1:56 AM, Michal Hocko wrote:
[...]
> > Why does this matter for something that is for analysis purposes.
> > Reading the file for the whole address space is far from a free
> > operation. Is the page walk optimization really essential for usability?
> > Moreover what prevents move_pages implementation to be clever for the
> > page walk itself? In other words why would we want to add a new API
> > rather than make the existing one faster for everybody.
>
> One could optimize move pages. If the caller passes a consecutive range
> of small pages, and the page walk sees that a VA is mapped by a huge page,
> then it can return the same numa node for each of the following VA's that fall
> into the huge page range. It would be faster than 55 nsec per small page, but
> hard to say how much faster, and the cost is still driven by the number of
> small pages.
This is exactly what I was arguing for. There is some room for
improvements for the existing interface. I yet have to hear the explicit
usecase which would required even better performance that cannot be
achieved by the existing API.
--
Michal Hocko
SUSE Labs
On 9/24/18 10:14 AM, Michal Hocko wrote:
> On Fri 14-09-18 12:01:18, Steven Sistare wrote:
>> On 9/14/2018 1:56 AM, Michal Hocko wrote:
> [...]
>>> Why does this matter for something that is for analysis purposes.
>>> Reading the file for the whole address space is far from a free
>>> operation. Is the page walk optimization really essential for usability?
>>> Moreover what prevents move_pages implementation to be clever for the
>>> page walk itself? In other words why would we want to add a new API
>>> rather than make the existing one faster for everybody.
>> One could optimize move pages. If the caller passes a consecutive range
>> of small pages, and the page walk sees that a VA is mapped by a huge page,
>> then it can return the same numa node for each of the following VA's that fall
>> into the huge page range. It would be faster than 55 nsec per small page, but
>> hard to say how much faster, and the cost is still driven by the number of
>> small pages.
> This is exactly what I was arguing for. There is some room for
> improvements for the existing interface. I yet have to hear the explicit
> usecase which would required even better performance that cannot be
> achieved by the existing API.
>
Above mentioned optimization to move_pages() API helps when scanning
mapped huge pages, but does not help if there are large sparse mappings
with few pages mapped. Otherwise, consider adding page walk support in
the move_pages() implementation, enhance the API(new flag?) to return
address range to numa node information. The page walk optimization
would certainly make a difference for usability.
We can have applications(Like Oracle DB) having processes with large sparse
mappings(in TBs) with only some areas of these mapped address range
being accessed, basically large portions not having page tables backing
it.
This can become more prevalent on newer systems with multiple TBs of
memory.
Here is some data from pmap using move_pages() API with optimization.
Following table compares time pmap takes to print address mapping of a
large process, with numa node information using move_pages() api vs pmap
using /proc numa_vamaps file.
Running pmap command on a process with 1.3 TB of address space, with
sparse mappings.
~1.3 TB sparse 250G dense segment with hugepages.
move_pages 8.33s 3.14
optimized move_pages 6.29s 0.92
/proc numa_vamaps 0.08s 0.04
Second column is pmap time on a 250G address range of this process, which maps
hugepages(THP & hugetlb).
On 11/9/2018 11:48 PM, Prakash Sangappa wrote:
> On 9/24/18 10:14 AM, Michal Hocko wrote:
>> On Fri 14-09-18 12:01:18, Steven Sistare wrote:
>>> On 9/14/2018 1:56 AM, Michal Hocko wrote:
>> [...]
>>>> Why does this matter for something that is for analysis purposes.
>>>> Reading the file for the whole address space is far from a free
>>>> operation. Is the page walk optimization really essential for usability?
>>>> Moreover what prevents move_pages implementation to be clever for the
>>>> page walk itself? In other words why would we want to add a new API
>>>> rather than make the existing one faster for everybody.
>>> One could optimize move pages. If the caller passes a consecutive range
>>> of small pages, and the page walk sees that a VA is mapped by a huge page,
>>> then it can return the same numa node for each of the following VA's that fall
>>> into the huge page range. It would be faster than 55 nsec per small page, but
>>> hard to say how much faster, and the cost is still driven by the number of
>>> small pages.
>> This is exactly what I was arguing for. There is some room for
>> improvements for the existing interface. I yet have to hear the explicit
>> usecase which would required even better performance that cannot be
>> achieved by the existing API.
>>
>
> Above mentioned optimization to move_pages() API helps when scanning
> mapped huge pages, but does not help if there are large sparse mappings
> with few pages mapped. Otherwise, consider adding page walk support in
> the move_pages() implementation, enhance the API(new flag?) to return
> address range to numa node information. The page walk optimization
> would certainly make a difference for usability.
>
> We can have applications(Like Oracle DB) having processes with large sparse
> mappings(in TBs) with only some areas of these mapped address range
> being accessed, basically large portions not having page tables backing it.
> This can become more prevalent on newer systems with multiple TBs of
> memory.
>
> Here is some data from pmap using move_pages() API with optimization.
> Following table compares time pmap takes to print address mapping of a
> large process, with numa node information using move_pages() api vs pmap
> using /proc numa_vamaps file.
>
> Running pmap command on a process with 1.3 TB of address space, with
> sparse mappings.
>
> ~1.3 TB sparse 250G dense segment with hugepages.
> move_pages 8.33s 3.14
> optimized move_pages 6.29s 0.92
> /proc numa_vamaps 0.08s 0.04
>
>
> Second column is pmap time on a 250G address range of this process, which maps
> hugepages(THP & hugetlb).
The data look compelling to me. numa_vmap provides a much smoother user experience
for the analyst who is casting a wide net looking for the root of a performance issue.
Almost no waiting to see the data.
- Steve
On Tue 18-12-18 15:46:45, prakash.sangappa wrote:
[...]
> Dave Hansen asked how would it scale, with respect reading this file from
> a large process. Answer is, the file contents are generated using page
> table walk, and copied to user buffer. The mmap_sem lock is drop and
> re-acquired in the process of walking the page table and copying file
> content. The kernel buffer size used determines how long the lock is held.
> Which can be further improved to drop the lock and re-acquire after a
> fixed number(512) of pages are walked.
I guess you are still missing the point here. Have you tried a larger
mapping with interleaved memory policy? I would bet my hat that you are
going to spend a large part of the time just pushing the output to the
userspace... Not to mention the parsing on the consumer side.
Also you keep failing (IMO) explaining _who_ is going to be the consumer
of the file. What kind of analysis will need such an optimized data
collection and what can you do about that?
This is really _essential_ when adding a new interface to provide a data
that is already available by other means. In other words tell us your
specific usecase that is hitting a bottleneck that cannot be handled by
the existing API and we can start considering a new one.
--
Michal Hocko
SUSE Labs