For analysis purpose it is useful to have numa node information
corresponding mapped address ranges of the process. Currently
/proc/<pid>/numa_maps provides list of numa nodes from where pages are
allocated per VMA of the process. This is not useful if an user needs to
determine which numa node the mapped pages are allocated from for a
particular address range. It would have helped if the numa node information
presented in /proc/<pid>/numa_maps was broken down by VA ranges showing the
exact numa node from where the pages have been allocated.
The format of /proc/<pid>/numa_maps file content is dependent on
/proc/<pid>/maps file content as mentioned in the manpage. i.e one line
entry for every VMA corresponding to entries in /proc/<pids>/maps file.
Therefore changing the output of /proc/<pid>/numa_maps may not be possible.
Hence, this patch proposes adding file /proc/<pid>/numa_vamaps which will
provide proper break down of VA ranges by numa node id from where the mapped
pages are allocated. For Address ranges not having any pages mapped, a '-'
is printed instead of numa node id. In addition, this file will include most
of the other information currently presented in /proc/<pid>/numa_maps. The
additional information included is for convenience. If this is not
preferred, the patch could be modified to just provide VA range to numa node
information as the rest of the information is already available thru
/proc/<pid>/numa_maps file.
Since the VA range to numa node information does not include page's PFN,
reading this file will not be restricted(i.e requiring CAP_SYS_ADMIN).
Here is the snippet from the new file content showing the format.
00400000-00401000 N0=1 kernelpagesize_kB=4 mapped=1 file=/tmp/hmap2
00600000-00601000 N0=1 kernelpagesize_kB=4 anon=1 dirty=1 file=/tmp/hmap2
00601000-00602000 N0=1 kernelpagesize_kB=4 anon=1 dirty=1 file=/tmp/hmap2
7f0215600000-7f0215800000 N0=1 kernelpagesize_kB=2048 dirty=1 file=/mnt/f1
7f0215800000-7f0215c00000 - file=/mnt/f1
7f0215c00000-7f0215e00000 N0=1 kernelpagesize_kB=2048 dirty=1 file=/mnt/f1
7f0215e00000-7f0216200000 - file=/mnt/f1
..
7f0217ecb000-7f0217f20000 N0=85 kernelpagesize_kB=4 mapped=85 mapmax=51
file=/usr/lib64/libc-2.17.so
7f0217f20000-7f0217f30000 - file=/usr/lib64/libc-2.17.so
7f0217f30000-7f0217f90000 N0=96 kernelpagesize_kB=4 mapped=96 mapmax=51
file=/usr/lib64/libc-2.17.so
7f0217f90000-7f0217fb0000 - file=/usr/lib64/libc-2.17.so
..
The 'pmap' command can be enhanced to include an option to show numa node
information which it can read from this new proc file. This will be a
follow on proposal.
There have been couple of previous patch proposals to provide numa node
information based on pfn or physical address. They seem to have not made
progress. Also it would appear reading numa node information based on PFN
or physical address will require privileges(CAP_SYS_ADMIN) similar to
reading PFN info from /proc/<pid>/pagemap.
See
https://marc.info/?t=139630938200001&r=1&w=2
https://marc.info/?t=139718724400001&r=1&w=2
Signed-off-by: Prakash Sangappa <[email protected]>
---
fs/proc/base.c | 2 +
fs/proc/internal.h | 3 +
fs/proc/task_mmu.c | 299 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
3 files changed, 303 insertions(+), 1 deletion(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 1b2ede6..8fd7cc5 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2960,6 +2960,7 @@ static const struct pid_entry tgid_base_stuff[] = {
REG("maps", S_IRUGO, proc_pid_maps_operations),
#ifdef CONFIG_NUMA
REG("numa_maps", S_IRUGO, proc_pid_numa_maps_operations),
+ REG("numa_vamaps", S_IRUGO, proc_pid_numa_vamaps_operations),
#endif
REG("mem", S_IRUSR|S_IWUSR, proc_mem_operations),
LNK("cwd", proc_cwd_link),
@@ -3352,6 +3353,7 @@ static const struct pid_entry tid_base_stuff[] = {
#endif
#ifdef CONFIG_NUMA
REG("numa_maps", S_IRUGO, proc_tid_numa_maps_operations),
+ REG("numa_vamaps", S_IRUGO, proc_tid_numa_vamaps_operations),
#endif
REG("mem", S_IRUSR|S_IWUSR, proc_mem_operations),
LNK("cwd", proc_cwd_link),
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 0f1692e..9a3ff80 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -273,6 +273,7 @@ struct proc_maps_private {
#ifdef CONFIG_NUMA
struct mempolicy *task_mempolicy;
#endif
+ u64 vma_off;
} __randomize_layout;
struct mm_struct *proc_mem_open(struct inode *inode, unsigned int mode);
@@ -280,7 +281,9 @@ struct mm_struct *proc_mem_open(struct inode *inode, unsigned int mode);
extern const struct file_operations proc_pid_maps_operations;
extern const struct file_operations proc_tid_maps_operations;
extern const struct file_operations proc_pid_numa_maps_operations;
+extern const struct file_operations proc_pid_numa_vamaps_operations;
extern const struct file_operations proc_tid_numa_maps_operations;
+extern const struct file_operations proc_tid_numa_vamaps_operations;
extern const struct file_operations proc_pid_smaps_operations;
extern const struct file_operations proc_pid_smaps_rollup_operations;
extern const struct file_operations proc_tid_smaps_operations;
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index c486ad4..e3c7d65 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -169,6 +169,13 @@ static void *m_start(struct seq_file *m, loff_t *ppos)
hold_task_mempolicy(priv);
priv->tail_vma = get_gate_vma(mm);
+ if (priv->vma_off) {
+ vma = find_vma(mm, priv->vma_off);
+ if (vma)
+ return vma;
+ }
+
+
if (last_addr) {
vma = find_vma(mm, last_addr - 1);
if (vma && vma->vm_start <= last_addr)
@@ -197,7 +204,18 @@ static void *m_start(struct seq_file *m, loff_t *ppos)
static void *m_next(struct seq_file *m, void *v, loff_t *pos)
{
struct proc_maps_private *priv = m->private;
- struct vm_area_struct *next;
+ struct vm_area_struct *next, *vma = v;
+
+ if (priv->vma_off) {
+ if (vma && vma->vm_start <= priv->vma_off &&
+ priv->vma_off < vma->vm_end)
+ next = vma;
+ else
+ next = find_vma(priv->mm, priv->vma_off);
+
+ if (next)
+ return next;
+ }
(*pos)++;
next = m_next_vma(priv, v);
@@ -1568,9 +1586,14 @@ struct numa_maps {
unsigned long mapcount_max;
unsigned long dirty;
unsigned long swapcache;
+ unsigned long nextaddr;
+ long nid;
unsigned long node[MAX_NUMNODES];
};
+#define NUMA_VAMAPS_NID_NOPAGES (-1)
+#define NUMA_VAMAPS_NID_NONE (-2)
+
struct numa_maps_private {
struct proc_maps_private proc_maps;
struct numa_maps md;
@@ -1804,6 +1827,232 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid)
return 0;
}
+/* match node id of page to previous node id, return 0 on match */
+static int vamap_match_nid(struct numa_maps *md, unsigned long addr,
+ struct page *page)
+{
+ if (page) {
+ if (md->nid == NUMA_VAMAPS_NID_NONE ||
+ page_to_nid(page) == md->nid) {
+ if (md->nid == NUMA_VAMAPS_NID_NONE)
+ md->nid = page_to_nid(page);
+ return 0;
+ }
+ } else {
+ if (md->nid == NUMA_VAMAPS_NID_NONE ||
+ md->nid == NUMA_VAMAPS_NID_NOPAGES ) {
+ if (md->nid == NUMA_VAMAPS_NID_NONE)
+ md->nid = NUMA_VAMAPS_NID_NOPAGES;
+ return 0;
+ }
+ }
+ /* Did not match */
+ md->nextaddr = addr;
+ return 1;
+}
+
+static int gather_pte_stats_vamap(pmd_t *pmd, unsigned long addr,
+ unsigned long end, struct mm_walk *walk)
+{
+ struct numa_maps *md = walk->private;
+ struct vm_area_struct *vma = walk->vma;
+ spinlock_t *ptl;
+ pte_t *orig_pte;
+ pte_t *pte;
+ int ret = 0;
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ ptl = pmd_trans_huge_lock(pmd, vma);
+ if (ptl) {
+ struct page *page;
+
+ page = can_gather_numa_stats_pmd(*pmd, vma, addr);
+ ret = vamap_match_nid(md, addr, page);
+ if (page && !ret)
+ gather_stats(page, md, pmd_dirty(*pmd),
+ HPAGE_PMD_SIZE/PAGE_SIZE);
+ spin_unlock(ptl);
+ return ret;
+ }
+
+ if (pmd_trans_unstable(pmd))
+ return 0;
+#endif
+ orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+ do {
+ struct page *page = can_gather_numa_stats(*pte, vma, addr);
+ ret = vamap_match_nid(md, addr, page);
+ if (ret)
+ break;
+ if (page)
+ gather_stats(page, md, pte_dirty(*pte), 1);
+
+ } while (pte++, addr += PAGE_SIZE, addr != end);
+ pte_unmap_unlock(orig_pte, ptl);
+ cond_resched();
+ return ret;
+}
+#ifdef CONFIG_HUGETLB_PAGE
+static int gather_hugetlb_stats_vamap(pte_t *pte, unsigned long hmask,
+ unsigned long addr, unsigned long end, struct mm_walk *walk)
+{
+ pte_t huge_pte = huge_ptep_get(pte);
+ struct numa_maps *md;
+ struct page *page;
+
+ md = walk->private;
+ if (!pte_present(huge_pte))
+ return (vamap_match_nid(md, addr, NULL));
+
+ page = pte_page(huge_pte);
+ if (!page)
+ return (vamap_match_nid(md, addr, page));
+
+ if (vamap_match_nid(md, addr, page))
+ return 1;
+ gather_stats(page, md, pte_dirty(huge_pte), 1);
+ return 0;
+}
+
+#else
+static int gather_hugetlb_stats_vamap(pte_t *pte, unsigned long hmask,
+ unsigned long addr, unsigned long end, struct mm_walk *walk)
+{
+ return 0;
+}
+#endif
+
+
+static int gather_hole_info_vamap(unsigned long start, unsigned long end,
+ struct mm_walk *walk)
+{
+ struct numa_maps *md = walk->private;
+ struct vm_area_struct *vma = walk->vma;
+
+ /*
+ * check if we are still tracking a hole or end the walk.
+ */
+ if ((md->nid != NUMA_VAMAPS_NID_NOPAGES &&
+ md->nid != NUMA_VAMAPS_NID_NONE) ||
+ vma != find_vma(walk->mm, start)) {
+ md->nextaddr = start;
+ return 1;
+ }
+
+ if (md->nid == NUMA_VAMAPS_NID_NONE)
+ md->nid = NUMA_VAMAPS_NID_NOPAGES;
+
+ return 0;
+}
+
+/*
+ * Display pages allocated per node via /proc.
+ */
+static int show_numa_vamap(struct seq_file *m, void *v, int is_pid)
+{
+ struct numa_maps_private *numa_priv = m->private;
+ struct proc_maps_private *proc_priv = &numa_priv->proc_maps;
+ struct vm_area_struct *vma = v;
+ struct numa_maps *md = &numa_priv->md;
+ struct file *file = vma->vm_file;
+ struct mm_struct *mm = vma->vm_mm;
+ struct mm_walk walk = {
+ .hugetlb_entry = gather_hugetlb_stats_vamap,
+ .pmd_entry = gather_pte_stats_vamap,
+ .pte_hole = gather_hole_info_vamap,
+ .private = md,
+ .mm = mm,
+ };
+ unsigned long start_va, next_va;
+
+ if (!mm)
+ return 0;
+
+ start_va = proc_priv->vma_off;
+ if (!start_va)
+ start_va = vma->vm_start;
+
+ if (start_va < vma->vm_end) {
+
+ /* Ensure we start with an empty numa_maps statistics */
+ memset(md, 0, sizeof(*md));
+ md->nid = NUMA_VAMAPS_NID_NONE; /* invalid nodeid at start */
+ md->nextaddr = 0;
+
+ /* mmap_sem is held by m_start() */
+ if (walk_page_range(start_va, vma->vm_end, &walk) < 0)
+ goto out;
+
+ /*
+ * If we reached the end of this vma.
+ */
+ if (md->nextaddr == 0)
+ md->nextaddr = vma->vm_end;
+
+ next_va = md->nextaddr;
+ seq_printf(m, "%08lx-%08lx", start_va, next_va);
+ start_va = next_va;
+
+ if (md->nid != NUMA_VAMAPS_NID_NONE &&
+ md->nid != NUMA_VAMAPS_NID_NOPAGES && md->node[md->nid]) {
+ seq_printf(m, " N%ld=%lu", md->nid, md->node[md->nid]);
+
+ seq_printf(m, " kernelpagesize_kB=%lu",
+ vma_kernel_pagesize(vma) >> 10);
+ } else {
+ seq_printf(m, " - ");
+ }
+
+ if (md->anon)
+ seq_printf(m, " anon=%lu", md->anon);
+
+ if (md->dirty)
+ seq_printf(m, " dirty=%lu", md->dirty);
+
+ if (md->pages != md->anon && md->pages != md->dirty)
+ seq_printf(m, " mapped=%lu", md->pages);
+
+ if (md->mapcount_max > 1)
+ seq_printf(m, " mapmax=%lu", md->mapcount_max);
+
+ if (md->swapcache)
+ seq_printf(m, " swapcache=%lu", md->swapcache);
+
+ if (md->active < md->pages && !is_vm_hugetlb_page(vma))
+ seq_printf(m, " active=%lu", md->active);
+
+ if (md->writeback)
+ seq_printf(m, " writeback=%lu", md->writeback);
+
+ if (file) {
+ seq_puts(m, " file=");
+ seq_file_path(m, file, "\n\t= ");
+ } else if (vma->vm_start <= mm->brk &&
+ vma->vm_end >= mm->start_brk) {
+ seq_puts(m, " heap");
+ } else if (is_stack(vma)) {
+ seq_puts(m, " stack");
+
+ }
+
+ seq_putc(m, '\n');
+ }
+
+ /*
+ * If buffer has not overflowed update vma_off, otherwise preserve
+ * previous offest as it will be retried.
+ */
+ if (!seq_has_overflowed(m)) {
+ if (md->nextaddr < vma->vm_end)
+ proc_priv->vma_off = md->nextaddr;
+ else
+ proc_priv->vma_off = 0;
+ }
+out:
+ m_cache_vma(m, vma);
+ return 0;
+}
+
static int show_pid_numa_map(struct seq_file *m, void *v)
{
return show_numa_map(m, v, 1);
@@ -1814,6 +2063,16 @@ static int show_tid_numa_map(struct seq_file *m, void *v)
return show_numa_map(m, v, 0);
}
+static int show_pid_numa_vamap(struct seq_file *m, void *v)
+{
+ return show_numa_vamap(m, v, 1);
+}
+
+static int show_tid_numa_vamap(struct seq_file *m, void *v)
+{
+ return show_numa_vamap(m, v, 0);
+}
+
static const struct seq_operations proc_pid_numa_maps_op = {
.start = m_start,
.next = m_next,
@@ -1828,6 +2087,20 @@ static const struct seq_operations proc_tid_numa_maps_op = {
.show = show_tid_numa_map,
};
+static const struct seq_operations proc_pid_numa_vamaps_op = {
+ .start = m_start,
+ .next = m_next,
+ .stop = m_stop,
+ .show = show_pid_numa_vamap,
+};
+
+static const struct seq_operations proc_tid_numa_vamaps_op = {
+ .start = m_start,
+ .next = m_next,
+ .stop = m_stop,
+ .show = show_tid_numa_vamap,
+};
+
static int numa_maps_open(struct inode *inode, struct file *file,
const struct seq_operations *ops)
{
@@ -1845,6 +2118,16 @@ static int tid_numa_maps_open(struct inode *inode, struct file *file)
return numa_maps_open(inode, file, &proc_tid_numa_maps_op);
}
+static int pid_numa_vamaps_open(struct inode *inode, struct file *file)
+{
+ return numa_maps_open(inode, file, &proc_pid_numa_vamaps_op);
+}
+
+static int tid_numa_vamaps_open(struct inode *inode, struct file *file)
+{
+ return numa_maps_open(inode, file, &proc_tid_numa_vamaps_op);
+}
+
const struct file_operations proc_pid_numa_maps_operations = {
.open = pid_numa_maps_open,
.read = seq_read,
@@ -1858,4 +2141,18 @@ const struct file_operations proc_tid_numa_maps_operations = {
.llseek = seq_lseek,
.release = proc_map_release,
};
+
+const struct file_operations proc_pid_numa_vamaps_operations = {
+ .open = pid_numa_vamaps_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = proc_map_release,
+};
+
+const struct file_operations proc_tid_numa_vamaps_operations = {
+ .open = tid_numa_vamaps_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = proc_map_release,
+};
#endif /* CONFIG_NUMA */
--
2.7.4
On Tue, 1 May 2018 22:58:06 -0700 Prakash Sangappa <[email protected]> wrote:
> For analysis purpose it is useful to have numa node information
> corresponding mapped address ranges of the process. Currently
> /proc/<pid>/numa_maps provides list of numa nodes from where pages are
> allocated per VMA of the process. This is not useful if an user needs to
> determine which numa node the mapped pages are allocated from for a
> particular address range. It would have helped if the numa node information
> presented in /proc/<pid>/numa_maps was broken down by VA ranges showing the
> exact numa node from where the pages have been allocated.
>
> The format of /proc/<pid>/numa_maps file content is dependent on
> /proc/<pid>/maps file content as mentioned in the manpage. i.e one line
> entry for every VMA corresponding to entries in /proc/<pids>/maps file.
> Therefore changing the output of /proc/<pid>/numa_maps may not be possible.
>
> Hence, this patch proposes adding file /proc/<pid>/numa_vamaps which will
> provide proper break down of VA ranges by numa node id from where the mapped
> pages are allocated. For Address ranges not having any pages mapped, a '-'
> is printed instead of numa node id. In addition, this file will include most
> of the other information currently presented in /proc/<pid>/numa_maps. The
> additional information included is for convenience. If this is not
> preferred, the patch could be modified to just provide VA range to numa node
> information as the rest of the information is already available thru
> /proc/<pid>/numa_maps file.
>
> Since the VA range to numa node information does not include page's PFN,
> reading this file will not be restricted(i.e requiring CAP_SYS_ADMIN).
>
> Here is the snippet from the new file content showing the format.
>
> 00400000-00401000 N0=1 kernelpagesize_kB=4 mapped=1 file=/tmp/hmap2
> 00600000-00601000 N0=1 kernelpagesize_kB=4 anon=1 dirty=1 file=/tmp/hmap2
> 00601000-00602000 N0=1 kernelpagesize_kB=4 anon=1 dirty=1 file=/tmp/hmap2
> 7f0215600000-7f0215800000 N0=1 kernelpagesize_kB=2048 dirty=1 file=/mnt/f1
> 7f0215800000-7f0215c00000 - file=/mnt/f1
> 7f0215c00000-7f0215e00000 N0=1 kernelpagesize_kB=2048 dirty=1 file=/mnt/f1
> 7f0215e00000-7f0216200000 - file=/mnt/f1
> ..
> 7f0217ecb000-7f0217f20000 N0=85 kernelpagesize_kB=4 mapped=85 mapmax=51
> file=/usr/lib64/libc-2.17.so
> 7f0217f20000-7f0217f30000 - file=/usr/lib64/libc-2.17.so
> 7f0217f30000-7f0217f90000 N0=96 kernelpagesize_kB=4 mapped=96 mapmax=51
> file=/usr/lib64/libc-2.17.so
> 7f0217f90000-7f0217fb0000 - file=/usr/lib64/libc-2.17.so
> ..
>
> The 'pmap' command can be enhanced to include an option to show numa node
> information which it can read from this new proc file. This will be a
> follow on proposal.
I'd like to hear rather more about the use-cases for this new
interface. Why do people need it, what is the end-user benefit, etc?
> There have been couple of previous patch proposals to provide numa node
> information based on pfn or physical address. They seem to have not made
> progress. Also it would appear reading numa node information based on PFN
> or physical address will require privileges(CAP_SYS_ADMIN) similar to
> reading PFN info from /proc/<pid>/pagemap.
>
> See
> https://marc.info/?t=139630938200001&r=1&w=2
>
> https://marc.info/?t=139718724400001&r=1&w=2
OK, let's hope that these people will be able to provide their review,
feedback, testing, etc. You missed a couple (Dave, Naoya).
> fs/proc/base.c | 2 +
> fs/proc/internal.h | 3 +
> fs/proc/task_mmu.c | 299 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
Some Documentation/ updates seem appropriate. I suggest you grep the
directory for "numa_maps" to find suitable locations.
And a quick build check shows that `size fs/proc/task_mmu.o' gets quite
a bit larger when CONFIG_SMP=n and CONFIG_NUMA=n. That seems wrong -
please see if you can eliminate the bloat from systems which don't need
this feature.
On 05/02/2018 02:33 PM, Andrew Morton wrote:
> On Tue, 1 May 2018 22:58:06 -0700 Prakash Sangappa <[email protected]> wrote:
>> For analysis purpose it is useful to have numa node information
>> corresponding mapped address ranges of the process. Currently
>> /proc/<pid>/numa_maps provides list of numa nodes from where pages are
>> allocated per VMA of the process. This is not useful if an user needs to
>> determine which numa node the mapped pages are allocated from for a
>> particular address range. It would have helped if the numa node information
>> presented in /proc/<pid>/numa_maps was broken down by VA ranges showing the
>> exact numa node from where the pages have been allocated.
I'm finding myself a little lost in figuring out what this does. Today,
numa_maps might us that a 3-page VMA has 1 page from Node 0 and 2 pages
from Node 1. We group *entirely* by VMA:
1000-4000 N0=1 N1=2
We don't want that. We want to tell exactly where each node's memory is
despite if they are in the same VMA, like this:
1000-2000 N1=1
2000-3000 N0=1
3000-4000 N1=1
So that no line of output ever has more than one node's memory. It
*appears* in this new file as if each contiguous range of memory from a
given node has its own VMA. Right?
This sounds interesting, but I've never found myself wanting this
information a single time that I can recall. I'd love to hear more.
Is this for debugging? Are apps actually going to *parse* this file?
How hard did you try to share code with numa_maps? Are you sure we
can't just replace numa_maps? VMAs are a kernel-internal thing and we
never promised to represent them 1:1 in our ABI.
Are we going to continue creating new files in /proc every time a tiny
new niche pops up? :)
On 05/02/2018 03:28 PM, Dave Hansen wrote:
> On 05/02/2018 02:33 PM, Andrew Morton wrote:
>> On Tue, 1 May 2018 22:58:06 -0700 Prakash Sangappa <[email protected]> wrote:
>>> For analysis purpose it is useful to have numa node information
>>> corresponding mapped address ranges of the process. Currently
>>> /proc/<pid>/numa_maps provides list of numa nodes from where pages are
>>> allocated per VMA of the process. This is not useful if an user needs to
>>> determine which numa node the mapped pages are allocated from for a
>>> particular address range. It would have helped if the numa node information
>>> presented in /proc/<pid>/numa_maps was broken down by VA ranges showing the
>>> exact numa node from where the pages have been allocated.
> I'm finding myself a little lost in figuring out what this does. Today,
> numa_maps might us that a 3-page VMA has 1 page from Node 0 and 2 pages
> from Node 1. We group *entirely* by VMA:
>
> 1000-4000 N0=1 N1=2
Yes
>
> We don't want that. We want to tell exactly where each node's memory is
> despite if they are in the same VMA, like this:
>
> 1000-2000 N1=1
> 2000-3000 N0=1
> 3000-4000 N1=1
>
> So that no line of output ever has more than one node's memory. It
Yes, that is exactly what this patch will provide. It may not have
been clear from the sample output I had included.
Here is another snippet from a process.
..
006dc000-006dd000 N1=1 kernelpagesize_kB=4 anon=1 dirty=1 file=/usr/bin/bash
006dd000-006de000 N0=1 kernelpagesize_kB=4 anon=1 dirty=1 file=/usr/bin/bash
006de000-006e0000 N1=2 kernelpagesize_kB=4 anon=2 dirty=2 file=/usr/bin/bash
006e0000-006e6000 N0=6 kernelpagesize_kB=4 anon=6 dirty=6 file=/usr/bin/bash
006e6000-006eb000 N0=5 kernelpagesize_kB=4 anon=5 dirty=5
006eb000-006ec000 N1=1 kernelpagesize_kB=4 anon=1 dirty=1
007f9000-007fa000 N1=1 kernelpagesize_kB=4 anon=1 dirty=1 heap
007fa000-00965000 N0=363 kernelpagesize_kB=4 anon=363 dirty=363 heap
00965000-0096c000 - heap
0096c000-0096d000 N0=1 kernelpagesize_kB=4 anon=1 dirty=1 heap
0096d000-00984000 - heap
..
> *appears* in this new file as if each contiguous range of memory from a
> given node has its own VMA. Right?
No. It just breaks down each VMA of the process into address ranges
which have pages on a numa node on each line. i.e Each line will
indicate memory from one numa node only.
>
> This sounds interesting, but I've never found myself wanting this
> information a single time that I can recall. I'd love to hear more.
>
> Is this for debugging? Are apps actually going to *parse* this file?
Yes, mainly for debugging/performance analysis . User analyzing can look
at this file. Oracle Database team will be using this information.
>
> How hard did you try to share code with numa_maps? Are you sure we
> can't just replace numa_maps? VMAs are a kernel-internal thing and we
> never promised to represent them 1:1 in our ABI.
I was inclined to just modify numa_maps. However the man page
documents numa_maps format to correlate with 'maps' file.
Wondering if apps/scripts will break if we change the output
of 'numa_maps'. So decided to add a new file instead.
I could try to share the code with numa_maps.
>
> Are we going to continue creating new files in /proc every time a tiny
> new niche pops up? :)
Wish we could just enhance the existing files.
On 05/02/2018 02:33 PM, Andrew Morton wrote:
> On Tue, 1 May 2018 22:58:06 -0700 Prakash Sangappa <[email protected]> wrote:
>
>> For analysis purpose it is useful to have numa node information
>> corresponding mapped address ranges of the process. Currently
>> /proc/<pid>/numa_maps provides list of numa nodes from where pages are
>> allocated per VMA of the process. This is not useful if an user needs to
>> determine which numa node the mapped pages are allocated from for a
>> particular address range. It would have helped if the numa node information
>> presented in /proc/<pid>/numa_maps was broken down by VA ranges showing the
>> exact numa node from where the pages have been allocated.
>>
>> The format of /proc/<pid>/numa_maps file content is dependent on
>> /proc/<pid>/maps file content as mentioned in the manpage. i.e one line
>> entry for every VMA corresponding to entries in /proc/<pids>/maps file.
>> Therefore changing the output of /proc/<pid>/numa_maps may not be possible.
>>
>> Hence, this patch proposes adding file /proc/<pid>/numa_vamaps which will
>> provide proper break down of VA ranges by numa node id from where the mapped
>> pages are allocated. For Address ranges not having any pages mapped, a '-'
>> is printed instead of numa node id. In addition, this file will include most
>> of the other information currently presented in /proc/<pid>/numa_maps. The
>> additional information included is for convenience. If this is not
>> preferred, the patch could be modified to just provide VA range to numa node
>> information as the rest of the information is already available thru
>> /proc/<pid>/numa_maps file.
>>
>> Since the VA range to numa node information does not include page's PFN,
>> reading this file will not be restricted(i.e requiring CAP_SYS_ADMIN).
>>
>> Here is the snippet from the new file content showing the format.
>>
>> 00400000-00401000 N0=1 kernelpagesize_kB=4 mapped=1 file=/tmp/hmap2
>> 00600000-00601000 N0=1 kernelpagesize_kB=4 anon=1 dirty=1 file=/tmp/hmap2
>> 00601000-00602000 N0=1 kernelpagesize_kB=4 anon=1 dirty=1 file=/tmp/hmap2
>> 7f0215600000-7f0215800000 N0=1 kernelpagesize_kB=2048 dirty=1 file=/mnt/f1
>> 7f0215800000-7f0215c00000 - file=/mnt/f1
>> 7f0215c00000-7f0215e00000 N0=1 kernelpagesize_kB=2048 dirty=1 file=/mnt/f1
>> 7f0215e00000-7f0216200000 - file=/mnt/f1
>> ..
>> 7f0217ecb000-7f0217f20000 N0=85 kernelpagesize_kB=4 mapped=85 mapmax=51
>> file=/usr/lib64/libc-2.17.so
>> 7f0217f20000-7f0217f30000 - file=/usr/lib64/libc-2.17.so
>> 7f0217f30000-7f0217f90000 N0=96 kernelpagesize_kB=4 mapped=96 mapmax=51
>> file=/usr/lib64/libc-2.17.so
>> 7f0217f90000-7f0217fb0000 - file=/usr/lib64/libc-2.17.so
>> ..
>>
>> The 'pmap' command can be enhanced to include an option to show numa node
>> information which it can read from this new proc file. This will be a
>> follow on proposal.
> I'd like to hear rather more about the use-cases for this new
> interface. Why do people need it, what is the end-user benefit, etc?
This is mainly for debugging / performance analysis. Oracle Database
team is looking to use this information.
>> There have been couple of previous patch proposals to provide numa node
>> information based on pfn or physical address. They seem to have not made
>> progress. Also it would appear reading numa node information based on PFN
>> or physical address will require privileges(CAP_SYS_ADMIN) similar to
>> reading PFN info from /proc/<pid>/pagemap.
>>
>> See
>> https://marc.info/?t=139630938200001&r=1&w=2
>>
>> https://marc.info/?t=139718724400001&r=1&w=2
> OK, let's hope that these people will be able to provide their review,
> feedback, testing, etc. You missed a couple (Dave, Naoya).
>
>> fs/proc/base.c | 2 +
>> fs/proc/internal.h | 3 +
>> fs/proc/task_mmu.c | 299 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
> Some Documentation/ updates seem appropriate. I suggest you grep the
> directory for "numa_maps" to find suitable locations.
Sure, I can update Documentation/filesystems/proc.txt file which is
where 'numa_maps' is documented.
>
> And a quick build check shows that `size fs/proc/task_mmu.o' gets quite
> a bit larger when CONFIG_SMP=n and CONFIG_NUMA=n. That seems wrong -
> please see if you can eliminate the bloat from systems which don't need
> this feature.
>
>
Ok will take a look.
On 05/03/2018 03:58 AM, Dave Hansen wrote:
> On 05/02/2018 02:33 PM, Andrew Morton wrote:
>> On Tue, 1 May 2018 22:58:06 -0700 Prakash Sangappa <[email protected]> wrote:
>>> For analysis purpose it is useful to have numa node information
>>> corresponding mapped address ranges of the process. Currently
>>> /proc/<pid>/numa_maps provides list of numa nodes from where pages are
>>> allocated per VMA of the process. This is not useful if an user needs to
>>> determine which numa node the mapped pages are allocated from for a
>>> particular address range. It would have helped if the numa node information
>>> presented in /proc/<pid>/numa_maps was broken down by VA ranges showing the
>>> exact numa node from where the pages have been allocated.
>
> I'm finding myself a little lost in figuring out what this does. Today,
> numa_maps might us that a 3-page VMA has 1 page from Node 0 and 2 pages
> from Node 1. We group *entirely* by VMA:
>
> 1000-4000 N0=1 N1=2
>
> We don't want that. We want to tell exactly where each node's memory is
> despite if they are in the same VMA, like this:
>
> 1000-2000 N1=1
> 2000-3000 N0=1
> 3000-4000 N1=1
I am kind of wondering on a big memory system how many lines of output
we might have for a large (consuming lets say 80 % of system RAM) VMA
in interleave policy. Is not that a problem ?
On Tue, 1 May 2018, Prakash Sangappa wrote:
> For analysis purpose it is useful to have numa node information
> corresponding mapped address ranges of the process. Currently
> /proc/<pid>/numa_maps provides list of numa nodes from where pages are
> allocated per VMA of the process. This is not useful if an user needs to
> determine which numa node the mapped pages are allocated from for a
> particular address range. It would have helped if the numa node information
> presented in /proc/<pid>/numa_maps was broken down by VA ranges showing the
> exact numa node from where the pages have been allocated.
Cant you write a small script that scans the information in numa_maps and
then displays the total pages per NUMA node and then a list of which
ranges have how many pages on a particular node?
> reading this file will not be restricted(i.e requiring CAP_SYS_ADMIN).
So a prime motivator here is security restricted access to numa_maps?
On 05/03/2018 03:27 PM, prakash.sangappa wrote:
>>
> If each consecutive page comes from different node, yes in
> the extreme case is this file will have a lot of lines. All the lines
> are generated at the time file is read. The amount of data read will be
> limited to the user read buffer size used in the read.
>
> /proc/<pid>/pagemap also has kind of similar issue. There is 1 64
> bit value for each user page.
But nobody reads it sequentially. Everybody lseek()s because it has a
fixed block size. You can't do that in text.
On 05/03/2018 01:46 AM, Anshuman Khandual wrote:
> On 05/03/2018 03:58 AM, Dave Hansen wrote:
>> On 05/02/2018 02:33 PM, Andrew Morton wrote:
>>> On Tue, 1 May 2018 22:58:06 -0700 Prakash Sangappa <[email protected]> wrote:
>>>> For analysis purpose it is useful to have numa node information
>>>> corresponding mapped address ranges of the process. Currently
>>>> /proc/<pid>/numa_maps provides list of numa nodes from where pages are
>>>> allocated per VMA of the process. This is not useful if an user needs to
>>>> determine which numa node the mapped pages are allocated from for a
>>>> particular address range. It would have helped if the numa node information
>>>> presented in /proc/<pid>/numa_maps was broken down by VA ranges showing the
>>>> exact numa node from where the pages have been allocated.
>> I'm finding myself a little lost in figuring out what this does. Today,
>> numa_maps might us that a 3-page VMA has 1 page from Node 0 and 2 pages
>> from Node 1. We group *entirely* by VMA:
>>
>> 1000-4000 N0=1 N1=2
>>
>> We don't want that. We want to tell exactly where each node's memory is
>> despite if they are in the same VMA, like this:
>>
>> 1000-2000 N1=1
>> 2000-3000 N0=1
>> 3000-4000 N1=1
> I am kind of wondering on a big memory system how many lines of output
> we might have for a large (consuming lets say 80 % of system RAM) VMA
> in interleave policy. Is not that a problem ?
>
If each consecutive page comes from different node, yes in
the extreme case is this file will have a lot of lines. All the lines
are generated at the time file is read. The amount of data read will be
limited to the user read buffer size used in the read.
/proc/<pid>/pagemap also has kind of similar issue. There is 1 64 bit
value
for each user page.
On 05/03/2018 01:57 AM, Michal Hocko wrote:
> On Wed 02-05-18 16:43:58, prakash.sangappa wrote:
>>
>> On 05/02/2018 02:33 PM, Andrew Morton wrote:
>>> On Tue, 1 May 2018 22:58:06 -0700 Prakash Sangappa <[email protected]> wrote:
>>>
>>>> For analysis purpose it is useful to have numa node information
>>>> corresponding mapped address ranges of the process. Currently
>>>> /proc/<pid>/numa_maps provides list of numa nodes from where pages are
>>>> allocated per VMA of the process. This is not useful if an user needs to
>>>> determine which numa node the mapped pages are allocated from for a
>>>> particular address range. It would have helped if the numa node information
>>>> presented in /proc/<pid>/numa_maps was broken down by VA ranges showing the
>>>> exact numa node from where the pages have been allocated.
>>>>
>>>> The format of /proc/<pid>/numa_maps file content is dependent on
>>>> /proc/<pid>/maps file content as mentioned in the manpage. i.e one line
>>>> entry for every VMA corresponding to entries in /proc/<pids>/maps file.
>>>> Therefore changing the output of /proc/<pid>/numa_maps may not be possible.
>>>>
>>>> Hence, this patch proposes adding file /proc/<pid>/numa_vamaps which will
>>>> provide proper break down of VA ranges by numa node id from where the mapped
>>>> pages are allocated. For Address ranges not having any pages mapped, a '-'
>>>> is printed instead of numa node id. In addition, this file will include most
>>>> of the other information currently presented in /proc/<pid>/numa_maps. The
>>>> additional information included is for convenience. If this is not
>>>> preferred, the patch could be modified to just provide VA range to numa node
>>>> information as the rest of the information is already available thru
>>>> /proc/<pid>/numa_maps file.
>>>>
>>>> Since the VA range to numa node information does not include page's PFN,
>>>> reading this file will not be restricted(i.e requiring CAP_SYS_ADMIN).
>>>>
>>>> Here is the snippet from the new file content showing the format.
>>>>
>>>> 00400000-00401000 N0=1 kernelpagesize_kB=4 mapped=1 file=/tmp/hmap2
>>>> 00600000-00601000 N0=1 kernelpagesize_kB=4 anon=1 dirty=1 file=/tmp/hmap2
>>>> 00601000-00602000 N0=1 kernelpagesize_kB=4 anon=1 dirty=1 file=/tmp/hmap2
>>>> 7f0215600000-7f0215800000 N0=1 kernelpagesize_kB=2048 dirty=1 file=/mnt/f1
>>>> 7f0215800000-7f0215c00000 - file=/mnt/f1
>>>> 7f0215c00000-7f0215e00000 N0=1 kernelpagesize_kB=2048 dirty=1 file=/mnt/f1
>>>> 7f0215e00000-7f0216200000 - file=/mnt/f1
>>>> ..
>>>> 7f0217ecb000-7f0217f20000 N0=85 kernelpagesize_kB=4 mapped=85 mapmax=51
>>>> file=/usr/lib64/libc-2.17.so
>>>> 7f0217f20000-7f0217f30000 - file=/usr/lib64/libc-2.17.so
>>>> 7f0217f30000-7f0217f90000 N0=96 kernelpagesize_kB=4 mapped=96 mapmax=51
>>>> file=/usr/lib64/libc-2.17.so
>>>> 7f0217f90000-7f0217fb0000 - file=/usr/lib64/libc-2.17.so
>>>> ..
>>>>
>>>> The 'pmap' command can be enhanced to include an option to show numa node
>>>> information which it can read from this new proc file. This will be a
>>>> follow on proposal.
>>> I'd like to hear rather more about the use-cases for this new
>>> interface. Why do people need it, what is the end-user benefit, etc?
>> This is mainly for debugging / performance analysis. Oracle Database
>> team is looking to use this information.
> But we do have an interface to query (e.g. move_pages) that your
> application can use. I am really worried that the broken out per node
> data can be really large (just take a large vma with interleaved policy
> as an example). So is this really worth adding as a general purpose proc
> interface?
I guess move_pages could be useful. There needs to be a tool or
command which can read the numa node information using move_pages
to be used to observe another process.
From an observability point of view, one of the use of the proposed
new file 'numa_vamaps' was to modify 'pmap' command to display numa
node information broken down by address ranges. Would having pmap
show numa node information be useful?
On 05/03/2018 11:03 AM, Christopher Lameter wrote:
> On Tue, 1 May 2018, Prakash Sangappa wrote:
>
>> For analysis purpose it is useful to have numa node information
>> corresponding mapped address ranges of the process. Currently
>> /proc/<pid>/numa_maps provides list of numa nodes from where pages are
>> allocated per VMA of the process. This is not useful if an user needs to
>> determine which numa node the mapped pages are allocated from for a
>> particular address range. It would have helped if the numa node information
>> presented in /proc/<pid>/numa_maps was broken down by VA ranges showing the
>> exact numa node from where the pages have been allocated.
> Cant you write a small script that scans the information in numa_maps and
> then displays the total pages per NUMA node and then a list of which
> ranges have how many pages on a particular node?
Don't think we can determine which numa node a given user process
address range has pages from, based on the existing 'numa_maps' file.
>> reading this file will not be restricted(i.e requiring CAP_SYS_ADMIN).
> So a prime motivator here is security restricted access to numa_maps?
No it is the opposite. A regular user should be able to determine
numa node information.
On Thu 03-05-18 15:37:39, prakash.sangappa wrote:
>
>
> On 05/03/2018 01:57 AM, Michal Hocko wrote:
> > On Wed 02-05-18 16:43:58, prakash.sangappa wrote:
> > >
> > > On 05/02/2018 02:33 PM, Andrew Morton wrote:
> > > > On Tue, 1 May 2018 22:58:06 -0700 Prakash Sangappa <[email protected]> wrote:
> > > >
> > > > > For analysis purpose it is useful to have numa node information
> > > > > corresponding mapped address ranges of the process. Currently
> > > > > /proc/<pid>/numa_maps provides list of numa nodes from where pages are
> > > > > allocated per VMA of the process. This is not useful if an user needs to
> > > > > determine which numa node the mapped pages are allocated from for a
> > > > > particular address range. It would have helped if the numa node information
> > > > > presented in /proc/<pid>/numa_maps was broken down by VA ranges showing the
> > > > > exact numa node from where the pages have been allocated.
> > > > >
> > > > > The format of /proc/<pid>/numa_maps file content is dependent on
> > > > > /proc/<pid>/maps file content as mentioned in the manpage. i.e one line
> > > > > entry for every VMA corresponding to entries in /proc/<pids>/maps file.
> > > > > Therefore changing the output of /proc/<pid>/numa_maps may not be possible.
> > > > >
> > > > > Hence, this patch proposes adding file /proc/<pid>/numa_vamaps which will
> > > > > provide proper break down of VA ranges by numa node id from where the mapped
> > > > > pages are allocated. For Address ranges not having any pages mapped, a '-'
> > > > > is printed instead of numa node id. In addition, this file will include most
> > > > > of the other information currently presented in /proc/<pid>/numa_maps. The
> > > > > additional information included is for convenience. If this is not
> > > > > preferred, the patch could be modified to just provide VA range to numa node
> > > > > information as the rest of the information is already available thru
> > > > > /proc/<pid>/numa_maps file.
> > > > >
> > > > > Since the VA range to numa node information does not include page's PFN,
> > > > > reading this file will not be restricted(i.e requiring CAP_SYS_ADMIN).
> > > > >
> > > > > Here is the snippet from the new file content showing the format.
> > > > >
> > > > > 00400000-00401000 N0=1 kernelpagesize_kB=4 mapped=1 file=/tmp/hmap2
> > > > > 00600000-00601000 N0=1 kernelpagesize_kB=4 anon=1 dirty=1 file=/tmp/hmap2
> > > > > 00601000-00602000 N0=1 kernelpagesize_kB=4 anon=1 dirty=1 file=/tmp/hmap2
> > > > > 7f0215600000-7f0215800000 N0=1 kernelpagesize_kB=2048 dirty=1 file=/mnt/f1
> > > > > 7f0215800000-7f0215c00000 - file=/mnt/f1
> > > > > 7f0215c00000-7f0215e00000 N0=1 kernelpagesize_kB=2048 dirty=1 file=/mnt/f1
> > > > > 7f0215e00000-7f0216200000 - file=/mnt/f1
> > > > > ..
> > > > > 7f0217ecb000-7f0217f20000 N0=85 kernelpagesize_kB=4 mapped=85 mapmax=51
> > > > > file=/usr/lib64/libc-2.17.so
> > > > > 7f0217f20000-7f0217f30000 - file=/usr/lib64/libc-2.17.so
> > > > > 7f0217f30000-7f0217f90000 N0=96 kernelpagesize_kB=4 mapped=96 mapmax=51
> > > > > file=/usr/lib64/libc-2.17.so
> > > > > 7f0217f90000-7f0217fb0000 - file=/usr/lib64/libc-2.17.so
> > > > > ..
> > > > >
> > > > > The 'pmap' command can be enhanced to include an option to show numa node
> > > > > information which it can read from this new proc file. This will be a
> > > > > follow on proposal.
> > > > I'd like to hear rather more about the use-cases for this new
> > > > interface. Why do people need it, what is the end-user benefit, etc?
> > > This is mainly for debugging / performance analysis. Oracle Database
> > > team is looking to use this information.
> > But we do have an interface to query (e.g. move_pages) that your
> > application can use. I am really worried that the broken out per node
> > data can be really large (just take a large vma with interleaved policy
> > as an example). So is this really worth adding as a general purpose proc
> > interface?
>
> I guess move_pages could be useful. There needs to be a tool or
> command which can read the numa node information using move_pages
> to be used to observe another process.
That should be trivial. You can get vma ranges of interest from /proc/maps
and then use move_pages to get a more detailed information.
> From an observability point of view, one of the use of the proposed
> new file 'numa_vamaps' was to modify 'pmap' command to display numa
> node information broken down by address ranges. Would having pmap
> show numa node information be useful?
I do not have a usecase for that.
--
Michal Hocko
SUSE Labs
On Thu 03-05-18 15:39:49, prakash.sangappa wrote:
>
>
> On 05/03/2018 11:03 AM, Christopher Lameter wrote:
> > On Tue, 1 May 2018, Prakash Sangappa wrote:
> >
> > > For analysis purpose it is useful to have numa node information
> > > corresponding mapped address ranges of the process. Currently
> > > /proc/<pid>/numa_maps provides list of numa nodes from where pages are
> > > allocated per VMA of the process. This is not useful if an user needs to
> > > determine which numa node the mapped pages are allocated from for a
> > > particular address range. It would have helped if the numa node information
> > > presented in /proc/<pid>/numa_maps was broken down by VA ranges showing the
> > > exact numa node from where the pages have been allocated.
> > Cant you write a small script that scans the information in numa_maps and
> > then displays the total pages per NUMA node and then a list of which
> > ranges have how many pages on a particular node?
>
> Don't think we can determine which numa node a given user process
> address range has pages from, based on the existing 'numa_maps' file.
yes we have. See move_pages...
> > > reading this file will not be restricted(i.e requiring CAP_SYS_ADMIN).
> > So a prime motivator here is security restricted access to numa_maps?
> No it is the opposite. A regular user should be able to determine
> numa node information.
Well, that breaks the layout randomization, doesn't it?
--
Michal Hocko
SUSE Labs
On Thu, 3 May 2018, prakash.sangappa wrote:
> > > exact numa node from where the pages have been allocated.
> > Cant you write a small script that scans the information in numa_maps and
> > then displays the total pages per NUMA node and then a list of which
> > ranges have how many pages on a particular node?
>
> Don't think we can determine which numa node a given user process
> address range has pages from, based on the existing 'numa_maps' file.
Well the information is contained in numa_maps I thought. What is missing?
> > > reading this file will not be restricted(i.e requiring CAP_SYS_ADMIN).
> > So a prime motivator here is security restricted access to numa_maps?
> No it is the opposite. A regular user should be able to determine
> numa node information.
That used to be the case until changes were made to the permissions for
reading numa_maps.
On 5/4/18 4:12 AM, Michal Hocko wrote:
> On Thu 03-05-18 15:39:49, prakash.sangappa wrote:
>>
>> On 05/03/2018 11:03 AM, Christopher Lameter wrote:
>>> On Tue, 1 May 2018, Prakash Sangappa wrote:
>>>
>>>> For analysis purpose it is useful to have numa node information
>>>> corresponding mapped address ranges of the process. Currently
>>>> /proc/<pid>/numa_maps provides list of numa nodes from where pages are
>>>> allocated per VMA of the process. This is not useful if an user needs to
>>>> determine which numa node the mapped pages are allocated from for a
>>>> particular address range. It would have helped if the numa node information
>>>> presented in /proc/<pid>/numa_maps was broken down by VA ranges showing the
>>>> exact numa node from where the pages have been allocated.
>>> Cant you write a small script that scans the information in numa_maps and
>>> then displays the total pages per NUMA node and then a list of which
>>> ranges have how many pages on a particular node?
>> Don't think we can determine which numa node a given user process
>> address range has pages from, based on the existing 'numa_maps' file.
> yes we have. See move_pages...
Sure using move_pages, not based on just 'numa_maps'.
>
>>>> reading this file will not be restricted(i.e requiring CAP_SYS_ADMIN).
>>> So a prime motivator here is security restricted access to numa_maps?
>> No it is the opposite. A regular user should be able to determine
>> numa node information.
> Well, that breaks the layout randomization, doesn't it?
Exposing numa node information itself should not break randomization right?
It would be upto the application. In case of randomization, the application
could generate address range traces of interest for debugging and then
using numa node information one could determine where the memory is laid
out for analysis.
On 5/4/18 7:57 AM, Christopher Lameter wrote:
> On Thu, 3 May 2018, prakash.sangappa wrote:
>
>>>> exact numa node from where the pages have been allocated.
>>> Cant you write a small script that scans the information in numa_maps and
>>> then displays the total pages per NUMA node and then a list of which
>>> ranges have how many pages on a particular node?
>> Don't think we can determine which numa node a given user process
>> address range has pages from, based on the existing 'numa_maps' file.
> Well the information is contained in numa_maps I thought. What is missing?
Currently 'numa_maps' gives a list of numa nodes, memory is allocated per
VMA.
Ex. we get something like from numa_maps.
04000 N0=1,N2=2 kernelpagesize_KB=4
First is the start address of a VMA. This VMA could be much larger then
3 4k pages.
It does not say which address in the VMA has the pages mapped.
>
>>>> reading this file will not be restricted(i.e requiring CAP_SYS_ADMIN).
>>> So a prime motivator here is security restricted access to numa_maps?
>> No it is the opposite. A regular user should be able to determine
>> numa node information.
> That used to be the case until changes were made to the permissions for
> reading numa_maps.
>
On Fri, 4 May 2018, Prakash Sangappa wrote:
> Currently 'numa_maps' gives a list of numa nodes, memory is allocated per
> VMA.
> Ex. we get something like from numa_maps.
>
> 04000 N0=1,N2=2 kernelpagesize_KB=4
>
> First is the start address of a VMA. This VMA could be much larger then 3 4k
> pages.
> It does not say which address in the VMA has the pages mapped.
Not precise. First the address is there as you already said. That is the
virtual address of the beginning of the VMA. What is missing? You need
each address for each page? Length of the VMA segment?
Physical address?
On 05/07/2018 07:47 AM, Christopher Lameter wrote:
> On Fri, 4 May 2018, Prakash Sangappa wrote:
>> Currently 'numa_maps' gives a list of numa nodes, memory is allocated per
>> VMA.
>> Ex. we get something like from numa_maps.
>>
>> 04000 N0=1,N2=2 kernelpagesize_KB=4
>>
>> First is the start address of a VMA. This VMA could be much larger then 3 4k
>> pages.
>> It does not say which address in the VMA has the pages mapped.
> Not precise. First the address is there as you already said. That is the
> virtual address of the beginning of the VMA. What is missing? You need
Yes,
> each address for each page? Length of the VMA segment?
> Physical address?
Need numa node information for each virtual address with pages mapped.
No need of physical address.
On 05/03/2018 03:26 PM, Dave Hansen wrote:
> On 05/03/2018 03:27 PM, prakash.sangappa wrote:
>> If each consecutive page comes from different node, yes in
>> the extreme case is this file will have a lot of lines. All the lines
>> are generated at the time file is read. The amount of data read will be
>> limited to the user read buffer size used in the read.
>>
>> /proc/<pid>/pagemap also has kind of similar issue. There is 1 64
>> bit value for each user page.
> But nobody reads it sequentially. Everybody lseek()s because it has a
> fixed block size. You can't do that in text.
The current text based files on /proc does allow seeking, but it will not
help to seek to a specific VA(vma) to start from, as the seek offset
will be the
offset in the text. This is the case with using 'seq_file' interface in the
kernel to generate the /proc file content.
However, with the proposed new file, we could allow seeking to specified
virtual address. The lseek offset in this case would represent the
virtual address
of the process. Subsequent read from the file would provide VA range to
numa node
information starting from that VA. In case the VA seek'ed to is invalid,
it will start
from the next valid mapped VA of the process. The implementation would
not be based on seq_file.
For example.
Getting numa node information for a process having the following VMAs
mapped,
starting from '006dc000'
00400000-004dd000
006dc000-006dd000
006dd000-006e6000
Can seek to VA 006dc000 and start reading, it would get following
006dc000-006dd000 N1=1 kernelpagesize_kB=4 anon=1 dirty=1
file=/usr/bin/bash
006dd000-006de000 N0=1 kernelpagesize_kB=4 anon=1 dirty=1
file=/usr/bin/bash
006de000-006e0000 N1=2 kernelpagesize_kB=4 anon=2 dirty=2
file=/usr/bin/bash
006e0000-006e6000 N0=6 kernelpagesize_kB=4 anon=6 dirty=6
file=/usr/bin/bash
..
One advantage with getting numa node information from this /proc file vs
say
using 'move_pages()' API, will be that the /proc file will be able to
provide address
range to numa node information, not one page at a time.
On 05/07/2018 04:22 PM, prakash.sangappa wrote:
> However, with the proposed new file, we could allow seeking to
> specified virtual address. The lseek offset in this case would
> represent the virtual address of the process. Subsequent read from
> the file would provide VA range to numa node information starting
> from that VA. In case the VA seek'ed to is invalid, it will start
> from the next valid mapped VA of the process. The implementation
> would not be based on seq_file.
So you're proposing a new /proc/<pid> file that appears next to and is
named very similarly to the exiting /proc/<pid>, but which has entirely
different behavior?
On 05/07/2018 05:05 PM, Dave Hansen wrote:
> On 05/07/2018 04:22 PM, prakash.sangappa wrote:
>> However, with the proposed new file, we could allow seeking to
>> specified virtual address. The lseek offset in this case would
>> represent the virtual address of the process. Subsequent read from
>> the file would provide VA range to numa node information starting
>> from that VA. In case the VA seek'ed to is invalid, it will start
>> from the next valid mapped VA of the process. The implementation
>> would not be based on seq_file.
> So you're proposing a new /proc/<pid> file that appears next to and is
> named very similarly to the exiting /proc/<pid>, but which has entirely
> different behavior?
It will be /proc/<pid>/numa_vamaps. Yes, the behavior will be
different with respect to seeking. Output will still be text and
the format will be same.
I want to get feedback on this approach.
On Mon, 7 May 2018, prakash.sangappa wrote:
> > each address for each page? Length of the VMA segment?
> > Physical address?
>
> Need numa node information for each virtual address with pages mapped.
> No need of physical address.
You need per page information? Note that there can only be one page
per virtual address. Or are we talking about address ranges?
https://www.kernel.org/doc/Documentation/vm/pagemap.txt ?
Also the move_pages syscall has the ability to determine the location of
individual pages.
On 05/07/2018 06:16 PM, prakash.sangappa wrote:
> It will be /proc/<pid>/numa_vamaps. Yes, the behavior will be
> different with respect to seeking. Output will still be text and
> the format will be same.
>
> I want to get feedback on this approach.
I think it would be really great if you can write down a list of the
things you actually want to accomplish. Dare I say: you need a
requirements list.
The numa_vamaps approach continues down the path of an ever-growing list
of highly-specialized /proc/<pid> files. I don't think that is
sustainable, even if it has been our trajectory for many years.
Pagemap wasn't exactly a shining example of us getting new ABIs right,
but it sounds like something along those is what we need.
On Fri 04-05-18 09:18:11, Prakash Sangappa wrote:
>
>
> On 5/4/18 4:12 AM, Michal Hocko wrote:
> > On Thu 03-05-18 15:39:49, prakash.sangappa wrote:
> > >
> > > On 05/03/2018 11:03 AM, Christopher Lameter wrote:
> > > > On Tue, 1 May 2018, Prakash Sangappa wrote:
> > > >
> > > > > For analysis purpose it is useful to have numa node information
> > > > > corresponding mapped address ranges of the process. Currently
> > > > > /proc/<pid>/numa_maps provides list of numa nodes from where pages are
> > > > > allocated per VMA of the process. This is not useful if an user needs to
> > > > > determine which numa node the mapped pages are allocated from for a
> > > > > particular address range. It would have helped if the numa node information
> > > > > presented in /proc/<pid>/numa_maps was broken down by VA ranges showing the
> > > > > exact numa node from where the pages have been allocated.
> > > > Cant you write a small script that scans the information in numa_maps and
> > > > then displays the total pages per NUMA node and then a list of which
> > > > ranges have how many pages on a particular node?
> > > Don't think we can determine which numa node a given user process
> > > address range has pages from, based on the existing 'numa_maps' file.
> > yes we have. See move_pages...
>
> Sure using move_pages, not based on just 'numa_maps'.
>
> > > > > reading this file will not be restricted(i.e requiring CAP_SYS_ADMIN).
> > > > So a prime motivator here is security restricted access to numa_maps?
> > > No it is the opposite. A regular user should be able to determine
> > > numa node information.
> > Well, that breaks the layout randomization, doesn't it?
>
> Exposing numa node information itself should not break randomization right?
I thought you planned to expose address ranges for each numa node as
well. /me confused.
> It would be upto the application. In case of randomization, the application
> could generate? address range traces of interest for debugging and then
> using numa node information one could determine where the memory is laid
> out for analysis.
... even more confused
--
Michal Hocko
SUSE Labs
On 5/10/18 12:42 AM, Michal Hocko wrote:
> On Fri 04-05-18 09:18:11, Prakash Sangappa wrote:
>>
>> On 5/4/18 4:12 AM, Michal Hocko wrote:
>>> On Thu 03-05-18 15:39:49, prakash.sangappa wrote:
>>>> On 05/03/2018 11:03 AM, Christopher Lameter wrote:
>>>>> On Tue, 1 May 2018, Prakash Sangappa wrote:
>>>>>
>>>>>> For analysis purpose it is useful to have numa node information
>>>>>> corresponding mapped address ranges of the process. Currently
>>>>>> /proc/<pid>/numa_maps provides list of numa nodes from where pages are
>>>>>> allocated per VMA of the process. This is not useful if an user needs to
>>>>>> determine which numa node the mapped pages are allocated from for a
>>>>>> particular address range. It would have helped if the numa node information
>>>>>> presented in /proc/<pid>/numa_maps was broken down by VA ranges showing the
>>>>>> exact numa node from where the pages have been allocated.
>>>>> Cant you write a small script that scans the information in numa_maps and
>>>>> then displays the total pages per NUMA node and then a list of which
>>>>> ranges have how many pages on a particular node?
>>>> Don't think we can determine which numa node a given user process
>>>> address range has pages from, based on the existing 'numa_maps' file.
>>> yes we have. See move_pages...
>> Sure using move_pages, not based on just 'numa_maps'.
>>
>>>>>> reading this file will not be restricted(i.e requiring CAP_SYS_ADMIN).
>>>>> So a prime motivator here is security restricted access to numa_maps?
>>>> No it is the opposite. A regular user should be able to determine
>>>> numa node information.
>>> Well, that breaks the layout randomization, doesn't it?
>> Exposing numa node information itself should not break randomization right?
> I thought you planned to expose address ranges for each numa node as
> well. /me confused.
Yes, are you suggesting this information should not be available to a
regular
user?
Is it not possible to get that same information using the move_pages()
api as a regular
user, although one / set of pages at a time?
>> It would be upto the application. In case of randomization, the application
>> could generate address range traces of interest for debugging and then
>> using numa node information one could determine where the memory is laid
>> out for analysis.
> ... even more confused
>
On Thu 10-05-18 09:00:24, Prakash Sangappa wrote:
>
>
> On 5/10/18 12:42 AM, Michal Hocko wrote:
> > On Fri 04-05-18 09:18:11, Prakash Sangappa wrote:
> > >
> > > On 5/4/18 4:12 AM, Michal Hocko wrote:
> > > > On Thu 03-05-18 15:39:49, prakash.sangappa wrote:
> > > > > On 05/03/2018 11:03 AM, Christopher Lameter wrote:
> > > > > > On Tue, 1 May 2018, Prakash Sangappa wrote:
> > > > > >
> > > > > > > For analysis purpose it is useful to have numa node information
> > > > > > > corresponding mapped address ranges of the process. Currently
> > > > > > > /proc/<pid>/numa_maps provides list of numa nodes from where pages are
> > > > > > > allocated per VMA of the process. This is not useful if an user needs to
> > > > > > > determine which numa node the mapped pages are allocated from for a
> > > > > > > particular address range. It would have helped if the numa node information
> > > > > > > presented in /proc/<pid>/numa_maps was broken down by VA ranges showing the
> > > > > > > exact numa node from where the pages have been allocated.
> > > > > > Cant you write a small script that scans the information in numa_maps and
> > > > > > then displays the total pages per NUMA node and then a list of which
> > > > > > ranges have how many pages on a particular node?
> > > > > Don't think we can determine which numa node a given user process
> > > > > address range has pages from, based on the existing 'numa_maps' file.
> > > > yes we have. See move_pages...
> > > Sure using move_pages, not based on just 'numa_maps'.
> > >
> > > > > > > reading this file will not be restricted(i.e requiring CAP_SYS_ADMIN).
> > > > > > So a prime motivator here is security restricted access to numa_maps?
> > > > > No it is the opposite. A regular user should be able to determine
> > > > > numa node information.
> > > > Well, that breaks the layout randomization, doesn't it?
> > > Exposing numa node information itself should not break randomization right?
> > I thought you planned to expose address ranges for each numa node as
> > well. /me confused.
>
> Yes, are you suggesting this information should not be available to a
> regular
> user?
absolutely. We do check ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)
to make sure the access is authorised.
>
> Is it not possible to get that same information using the move_pages() api
> as a regular
> user, although one / set of pages at a time?
No, see PTRACE_MODE_READ_REALCREDS check in move_pages.
--
Michal Hocko
SUSE Labs
On 05/09/2018 04:31 PM, Dave Hansen wrote:
> On 05/07/2018 06:16 PM, prakash.sangappa wrote:
>> It will be /proc/<pid>/numa_vamaps. Yes, the behavior will be
>> different with respect to seeking. Output will still be text and
>> the format will be same.
>>
>> I want to get feedback on this approach.
> I think it would be really great if you can write down a list of the
> things you actually want to accomplish. Dare I say: you need a
> requirements list.
>
> The numa_vamaps approach continues down the path of an ever-growing list
> of highly-specialized /proc/<pid> files. I don't think that is
> sustainable, even if it has been our trajectory for many years.
>
> Pagemap wasn't exactly a shining example of us getting new ABIs right,
> but it sounds like something along those is what we need.
Just sent out a V2 patch. This patch simplifies the file content. It
only provides VA range to numa node id information.
The requirement is basically observability for performance analysis.
- Need to be able to determine VA range to numa node id information.
Which also gives an idea of which range has memory allocated.
- The proc file /proc/<pid>/numa_vamaps is in text so it is easy to
directly view.
The V2 patch supports seeking to a particular process VA from where
the application could read the VA to numa node id information.
Also added the 'PTRACE_MODE_READ_REALCREDS' check when opening the
file /proc file as was indicated by Michal Hacko
The VA range to numa node information from this file can be used by pmap.
Here is a sample from a prototype change to pmap(procps) showing
numa node information, gathered from the new 'numa_vamaps' file.
$ ./rpmap -L -A 00000000006f8000,00007f5f730fe000 31423|more
31423: bash
00000000006f8000 16K N1 rw--- bash
00000000006fc000 4K N0 rw--- bash
00000000006fd000 4K N0 rw--- [ anon ]
00000000006fe000 8K N1 rw--- [ anon ]
0000000000700000 4K N0 rw--- [ anon ]
0000000000701000 4K N1 rw--- [ anon ]
0000000000702000 4K N0 rw--- [ anon ]
0000000000ce8000 52K N0 rw--- [ anon ]
0000000000cf5000 4K N1 rw--- [ anon ]
0000000000cf6000 28K N0 rw--- [ anon ]
0000000000cfd000 4K N1 rw--- [ anon ]
0000000000cfe000 28K N0 rw--- [ anon ]
0000000000d05000 504K N1 rw--- [ anon ]
0000000000d83000 8K N0 rw--- [ anon ]
0000000000d85000 932K N1 rw--- [ anon ]
0000000000e6e000 4K - rw--- [ anon ]
0000000000e6f000 168K N1 rw--- [ anon ]
00007f5f72ef4000 4K N2 r-x-- libnss_files-2.23.so
00007f5f72ef5000 40K N0 r-x-- libnss_files-2.23.so
00007f5f72eff000 2044K - ----- libnss_files-2.23.so
00007f5f730fe000 4K N0 r---- libnss_files-2.23.so
total 3868K
-Prakash.
On 09/12/2018 01:42 PM, prakash.sangappa wrote:
>
>
> On 05/09/2018 04:31 PM, Dave Hansen wrote:
>> On 05/07/2018 06:16 PM, prakash.sangappa wrote:
>>> It will be /proc/<pid>/numa_vamaps. Yes, the behavior will be
>>> different with respect to seeking. Output will still be text and
>>> the format will be same.
>>>
>>> I want to get feedback on this approach.
>> I think it would be really great if you can write down a list of the
>> things you actually want to accomplish. Dare I say: you need a
>> requirements list.
>>
>> The numa_vamaps approach continues down the path of an ever-growing list
>> of highly-specialized /proc/<pid> files. I don't think that is
>> sustainable, even if it has been our trajectory for many years.
>>
>> Pagemap wasn't exactly a shining example of us getting new ABIs right,
>> but it sounds like something along those is what we need.
>
> Just sent out a V2 patch. This patch simplifies the file content. It
> only provides VA range to numa node id information.
>
> The requirement is basically observability for performance analysis.
>
> - Need to be able to determine VA range to numa node id information.
> Which also gives an idea of which range has memory allocated.
>
> - The proc file /proc/<pid>/numa_vamaps is in text so it is easy to
> directly view.
>
> The V2 patch supports seeking to a particular process VA from where
> the application could read the VA to numa node id information.
>
> Also added the 'PTRACE_MODE_READ_REALCREDS' check when opening the
> file /proc file as was indicated by Michal Hacko
>
I meant Michal Hocko
Sorry misspelled the name.
> The VA range to numa node information from this file can be used by pmap.
>
> Here is a sample from a prototype change to pmap(procps) showing
> numa node information, gathered from the new 'numa_vamaps' file.
>
> $ ./rpmap -L -A 00000000006f8000,00007f5f730fe000 31423|more
> 31423: bash
> 00000000006f8000 16K N1 rw--- bash
> 00000000006fc000 4K N0 rw--- bash
> 00000000006fd000 4K N0 rw--- [ anon ]
> 00000000006fe000 8K N1 rw--- [ anon ]
> 0000000000700000 4K N0 rw--- [ anon ]
> 0000000000701000 4K N1 rw--- [ anon ]
> 0000000000702000 4K N0 rw--- [ anon ]
> 0000000000ce8000 52K N0 rw--- [ anon ]
> 0000000000cf5000 4K N1 rw--- [ anon ]
> 0000000000cf6000 28K N0 rw--- [ anon ]
> 0000000000cfd000 4K N1 rw--- [ anon ]
> 0000000000cfe000 28K N0 rw--- [ anon ]
> 0000000000d05000 504K N1 rw--- [ anon ]
> 0000000000d83000 8K N0 rw--- [ anon ]
> 0000000000d85000 932K N1 rw--- [ anon ]
> 0000000000e6e000 4K - rw--- [ anon ]
> 0000000000e6f000 168K N1 rw--- [ anon ]
> 00007f5f72ef4000 4K N2 r-x-- libnss_files-2.23.so
> 00007f5f72ef5000 40K N0 r-x-- libnss_files-2.23.so
> 00007f5f72eff000 2044K - ----- libnss_files-2.23.so
> 00007f5f730fe000 4K N0 r---- libnss_files-2.23.so
> total 3868K
>
> -Prakash.
> The /proc/pid/numa_vamaps shows mapped address ranges to numa node id
> from where the physical pages are allocated.
All these files make the problem with useless dentry and /proc/*/* inode
instantiations worse (unlike top level /proc/* files which are
tolerable).
> +address-range numa-node-id
> +
> +00400000-00410000 N1
> +00410000-0047f000 N0
> +0047f000-00480000 N2
> +00480000-00481000 -
> +00481000-004a0000 N0
> +004a0000-004a2000 -
> +004a2000-004aa000 N2
> +004aa000-004ad000 N0
> +004ad000-004ae000 -
'N' is useless data.
Parsing with awk won't work because field #3 is separated with space
but field #2 with '-'.
%08lx-%08lx kind of sucks: 32-bit get aligned data so parsing can be
faster by pointing to &p[8+1] but not on 64-bit.
If scanf("%lx-%lx") is used then leading zeroes are useless.
Text is harder than it looks.
Please in the name of everything holy add new honest system call.
On 09/12/2018 04:02 PM, Alexey Dobriyan wrote:
>> The /proc/pid/numa_vamaps shows mapped address ranges to numa node id
>> from where the physical pages are allocated.
> All these files make the problem with useless dentry and /proc/*/* inode
> instantiations worse (unlike top level /proc/* files which are
> tolerable).
>
>> +address-range numa-node-id
>> +
>> +00400000-00410000 N1
>> +00410000-0047f000 N0
>> +0047f000-00480000 N2
>> +00480000-00481000 -
>> +00481000-004a0000 N0
>> +004a0000-004a2000 -
>> +004a2000-004aa000 N2
>> +004aa000-004ad000 N0
>> +004ad000-004ae000 -
> 'N' is useless data.
'N' could be dropped.
>
> Parsing with awk won't work because field #3 is separated with space
> but field #2 with '-'.
>
> %08lx-%08lx kind of sucks: 32-bit get aligned data so parsing can be
> faster by pointing to &p[8+1] but not on 64-bit.
> If scanf("%lx-%lx") is used then leading zeroes are useless.
This was similar to how '/proc/*/maps' file presents the address range.
However, we could separate start and end address of the range with a
space, if that would be preferred.
> Text is harder than it looks.
We could make each line fixed length. It could be "%016lx %016lx %04d"
i.e 4 digit for node number?
> Please in the name of everything holy add new honest system call.
On Wed, Sep 12, 2018 at 10:43 PM prakash.sangappa
<[email protected]> wrote:
> On 05/09/2018 04:31 PM, Dave Hansen wrote:
> > On 05/07/2018 06:16 PM, prakash.sangappa wrote:
> >> It will be /proc/<pid>/numa_vamaps. Yes, the behavior will be
> >> different with respect to seeking. Output will still be text and
> >> the format will be same.
> >>
> >> I want to get feedback on this approach.
> > I think it would be really great if you can write down a list of the
> > things you actually want to accomplish. Dare I say: you need a
> > requirements list.
> >
> > The numa_vamaps approach continues down the path of an ever-growing list
> > of highly-specialized /proc/<pid> files. I don't think that is
> > sustainable, even if it has been our trajectory for many years.
> >
> > Pagemap wasn't exactly a shining example of us getting new ABIs right,
> > but it sounds like something along those is what we need.
>
> Just sent out a V2 patch. This patch simplifies the file content. It
> only provides VA range to numa node id information.
>
> The requirement is basically observability for performance analysis.
>
> - Need to be able to determine VA range to numa node id information.
> Which also gives an idea of which range has memory allocated.
>
> - The proc file /proc/<pid>/numa_vamaps is in text so it is easy to
> directly view.
>
> The V2 patch supports seeking to a particular process VA from where
> the application could read the VA to numa node id information.
>
> Also added the 'PTRACE_MODE_READ_REALCREDS' check when opening the
> file /proc file as was indicated by Michal Hacko
procfs files should use PTRACE_MODE_*_FSCREDS, not PTRACE_MODE_*_REALCREDS.
On Fri 14-09-18 03:33:28, Jann Horn wrote:
> On Wed, Sep 12, 2018 at 10:43 PM prakash.sangappa
> <[email protected]> wrote:
> > On 05/09/2018 04:31 PM, Dave Hansen wrote:
> > > On 05/07/2018 06:16 PM, prakash.sangappa wrote:
> > >> It will be /proc/<pid>/numa_vamaps. Yes, the behavior will be
> > >> different with respect to seeking. Output will still be text and
> > >> the format will be same.
> > >>
> > >> I want to get feedback on this approach.
> > > I think it would be really great if you can write down a list of the
> > > things you actually want to accomplish. Dare I say: you need a
> > > requirements list.
> > >
> > > The numa_vamaps approach continues down the path of an ever-growing list
> > > of highly-specialized /proc/<pid> files. I don't think that is
> > > sustainable, even if it has been our trajectory for many years.
> > >
> > > Pagemap wasn't exactly a shining example of us getting new ABIs right,
> > > but it sounds like something along those is what we need.
> >
> > Just sent out a V2 patch. This patch simplifies the file content. It
> > only provides VA range to numa node id information.
> >
> > The requirement is basically observability for performance analysis.
> >
> > - Need to be able to determine VA range to numa node id information.
> > Which also gives an idea of which range has memory allocated.
> >
> > - The proc file /proc/<pid>/numa_vamaps is in text so it is easy to
> > directly view.
> >
> > The V2 patch supports seeking to a particular process VA from where
> > the application could read the VA to numa node id information.
> >
> > Also added the 'PTRACE_MODE_READ_REALCREDS' check when opening the
> > file /proc file as was indicated by Michal Hacko
>
> procfs files should use PTRACE_MODE_*_FSCREDS, not PTRACE_MODE_*_REALCREDS.
Out of my curiosity, what is the semantic difference? At least
kernel_move_pages uses PTRACE_MODE_READ_REALCREDS. Is this a bug?
--
Michal Hocko
SUSE Labs
On Fri, Sep 14, 2018 at 8:21 AM Michal Hocko <[email protected]> wrote:
> On Fri 14-09-18 03:33:28, Jann Horn wrote:
> > On Wed, Sep 12, 2018 at 10:43 PM prakash.sangappa
> > <[email protected]> wrote:
> > > On 05/09/2018 04:31 PM, Dave Hansen wrote:
> > > > On 05/07/2018 06:16 PM, prakash.sangappa wrote:
> > > >> It will be /proc/<pid>/numa_vamaps. Yes, the behavior will be
> > > >> different with respect to seeking. Output will still be text and
> > > >> the format will be same.
> > > >>
> > > >> I want to get feedback on this approach.
> > > > I think it would be really great if you can write down a list of the
> > > > things you actually want to accomplish. Dare I say: you need a
> > > > requirements list.
> > > >
> > > > The numa_vamaps approach continues down the path of an ever-growing list
> > > > of highly-specialized /proc/<pid> files. I don't think that is
> > > > sustainable, even if it has been our trajectory for many years.
> > > >
> > > > Pagemap wasn't exactly a shining example of us getting new ABIs right,
> > > > but it sounds like something along those is what we need.
> > >
> > > Just sent out a V2 patch. This patch simplifies the file content. It
> > > only provides VA range to numa node id information.
> > >
> > > The requirement is basically observability for performance analysis.
> > >
> > > - Need to be able to determine VA range to numa node id information.
> > > Which also gives an idea of which range has memory allocated.
> > >
> > > - The proc file /proc/<pid>/numa_vamaps is in text so it is easy to
> > > directly view.
> > >
> > > The V2 patch supports seeking to a particular process VA from where
> > > the application could read the VA to numa node id information.
> > >
> > > Also added the 'PTRACE_MODE_READ_REALCREDS' check when opening the
> > > file /proc file as was indicated by Michal Hacko
> >
> > procfs files should use PTRACE_MODE_*_FSCREDS, not PTRACE_MODE_*_REALCREDS.
>
> Out of my curiosity, what is the semantic difference? At least
> kernel_move_pages uses PTRACE_MODE_READ_REALCREDS. Is this a bug?
No, that's fine. REALCREDS basically means "look at the caller's real
UID for the access check", while FSCREDS means "look at the caller's
filesystem UID". The ptrace access check has historically been using
the real UID, which is sorta weird, but normally works fine. Given
that this is documented, I didn't see any reason to change it for most
things that do ptrace access checks, even if the EUID would IMO be
more appropriate. But things that capture caller credentials at points
like open() really shouldn't look at the real UID; instead, they
should use the filesystem UID (which in practice is basically the same
as the EUID).
So in short, it depends on the interface you're coming through: Direct
syscalls use REALCREDS, things that go through the VFS layer use
FSCREDS.
On Fri 14-09-18 14:49:10, Jann Horn wrote:
> On Fri, Sep 14, 2018 at 8:21 AM Michal Hocko <[email protected]> wrote:
> > On Fri 14-09-18 03:33:28, Jann Horn wrote:
> > > On Wed, Sep 12, 2018 at 10:43 PM prakash.sangappa
> > > <[email protected]> wrote:
> > > > On 05/09/2018 04:31 PM, Dave Hansen wrote:
> > > > > On 05/07/2018 06:16 PM, prakash.sangappa wrote:
> > > > >> It will be /proc/<pid>/numa_vamaps. Yes, the behavior will be
> > > > >> different with respect to seeking. Output will still be text and
> > > > >> the format will be same.
> > > > >>
> > > > >> I want to get feedback on this approach.
> > > > > I think it would be really great if you can write down a list of the
> > > > > things you actually want to accomplish. Dare I say: you need a
> > > > > requirements list.
> > > > >
> > > > > The numa_vamaps approach continues down the path of an ever-growing list
> > > > > of highly-specialized /proc/<pid> files. I don't think that is
> > > > > sustainable, even if it has been our trajectory for many years.
> > > > >
> > > > > Pagemap wasn't exactly a shining example of us getting new ABIs right,
> > > > > but it sounds like something along those is what we need.
> > > >
> > > > Just sent out a V2 patch. This patch simplifies the file content. It
> > > > only provides VA range to numa node id information.
> > > >
> > > > The requirement is basically observability for performance analysis.
> > > >
> > > > - Need to be able to determine VA range to numa node id information.
> > > > Which also gives an idea of which range has memory allocated.
> > > >
> > > > - The proc file /proc/<pid>/numa_vamaps is in text so it is easy to
> > > > directly view.
> > > >
> > > > The V2 patch supports seeking to a particular process VA from where
> > > > the application could read the VA to numa node id information.
> > > >
> > > > Also added the 'PTRACE_MODE_READ_REALCREDS' check when opening the
> > > > file /proc file as was indicated by Michal Hacko
> > >
> > > procfs files should use PTRACE_MODE_*_FSCREDS, not PTRACE_MODE_*_REALCREDS.
> >
> > Out of my curiosity, what is the semantic difference? At least
> > kernel_move_pages uses PTRACE_MODE_READ_REALCREDS. Is this a bug?
>
> No, that's fine. REALCREDS basically means "look at the caller's real
> UID for the access check", while FSCREDS means "look at the caller's
> filesystem UID". The ptrace access check has historically been using
> the real UID, which is sorta weird, but normally works fine. Given
> that this is documented, I didn't see any reason to change it for most
> things that do ptrace access checks, even if the EUID would IMO be
> more appropriate. But things that capture caller credentials at points
> like open() really shouldn't look at the real UID; instead, they
> should use the filesystem UID (which in practice is basically the same
> as the EUID).
>
> So in short, it depends on the interface you're coming through: Direct
> syscalls use REALCREDS, things that go through the VFS layer use
> FSCREDS.
Ahh, OK, I see. Thanks for the clarification! Then I agree that the proc
interface should use FSCREDS.
--
Michal Hocko
SUSE Labs
On 9/14/18 5:49 AM, Jann Horn wrote:
> On Fri, Sep 14, 2018 at 8:21 AM Michal Hocko <[email protected]> wrote:
>> On Fri 14-09-18 03:33:28, Jann Horn wrote:
>>> On Wed, Sep 12, 2018 at 10:43 PM prakash.sangappa
>>> <[email protected]> wrote:
>>>> On 05/09/2018 04:31 PM, Dave Hansen wrote:
>>>>> On 05/07/2018 06:16 PM, prakash.sangappa wrote:
>>>>>> It will be /proc/<pid>/numa_vamaps. Yes, the behavior will be
>>>>>> different with respect to seeking. Output will still be text and
>>>>>> the format will be same.
>>>>>>
>>>>>> I want to get feedback on this approach.
>>>>> I think it would be really great if you can write down a list of the
>>>>> things you actually want to accomplish. Dare I say: you need a
>>>>> requirements list.
>>>>>
>>>>> The numa_vamaps approach continues down the path of an ever-growing list
>>>>> of highly-specialized /proc/<pid> files. I don't think that is
>>>>> sustainable, even if it has been our trajectory for many years.
>>>>>
>>>>> Pagemap wasn't exactly a shining example of us getting new ABIs right,
>>>>> but it sounds like something along those is what we need.
>>>> Just sent out a V2 patch. This patch simplifies the file content. It
>>>> only provides VA range to numa node id information.
>>>>
>>>> The requirement is basically observability for performance analysis.
>>>>
>>>> - Need to be able to determine VA range to numa node id information.
>>>> Which also gives an idea of which range has memory allocated.
>>>>
>>>> - The proc file /proc/<pid>/numa_vamaps is in text so it is easy to
>>>> directly view.
>>>>
>>>> The V2 patch supports seeking to a particular process VA from where
>>>> the application could read the VA to numa node id information.
>>>>
>>>> Also added the 'PTRACE_MODE_READ_REALCREDS' check when opening the
>>>> file /proc file as was indicated by Michal Hacko
>>> procfs files should use PTRACE_MODE_*_FSCREDS, not PTRACE_MODE_*_REALCREDS.
>> Out of my curiosity, what is the semantic difference? At least
>> kernel_move_pages uses PTRACE_MODE_READ_REALCREDS. Is this a bug?
> No, that's fine. REALCREDS basically means "look at the caller's real
> UID for the access check", while FSCREDS means "look at the caller's
> filesystem UID". The ptrace access check has historically been using
> the real UID, which is sorta weird, but normally works fine. Given
> that this is documented, I didn't see any reason to change it for most
> things that do ptrace access checks, even if the EUID would IMO be
> more appropriate. But things that capture caller credentials at points
> like open() really shouldn't look at the real UID; instead, they
> should use the filesystem UID (which in practice is basically the same
> as the EUID).
>
> So in short, it depends on the interface you're coming through: Direct
> syscalls use REALCREDS, things that go through the VFS layer use
> FSCREDS.
So in this case can the REALCREDS check be done in the read() system call
when reading the /proc file instead of the open call?
On Fri, Sep 14, 2018 at 8:08 PM Prakash Sangappa
<[email protected]> wrote:
> On 9/14/18 5:49 AM, Jann Horn wrote:
> > On Fri, Sep 14, 2018 at 8:21 AM Michal Hocko <[email protected]> wrote:
> >> On Fri 14-09-18 03:33:28, Jann Horn wrote:
> >>> On Wed, Sep 12, 2018 at 10:43 PM prakash.sangappa
> >>> <[email protected]> wrote:
> >>>> On 05/09/2018 04:31 PM, Dave Hansen wrote:
> >>>>> On 05/07/2018 06:16 PM, prakash.sangappa wrote:
> >>>>>> It will be /proc/<pid>/numa_vamaps. Yes, the behavior will be
> >>>>>> different with respect to seeking. Output will still be text and
> >>>>>> the format will be same.
> >>>>>>
> >>>>>> I want to get feedback on this approach.
> >>>>> I think it would be really great if you can write down a list of the
> >>>>> things you actually want to accomplish. Dare I say: you need a
> >>>>> requirements list.
> >>>>>
> >>>>> The numa_vamaps approach continues down the path of an ever-growing list
> >>>>> of highly-specialized /proc/<pid> files. I don't think that is
> >>>>> sustainable, even if it has been our trajectory for many years.
> >>>>>
> >>>>> Pagemap wasn't exactly a shining example of us getting new ABIs right,
> >>>>> but it sounds like something along those is what we need.
> >>>> Just sent out a V2 patch. This patch simplifies the file content. It
> >>>> only provides VA range to numa node id information.
> >>>>
> >>>> The requirement is basically observability for performance analysis.
> >>>>
> >>>> - Need to be able to determine VA range to numa node id information.
> >>>> Which also gives an idea of which range has memory allocated.
> >>>>
> >>>> - The proc file /proc/<pid>/numa_vamaps is in text so it is easy to
> >>>> directly view.
> >>>>
> >>>> The V2 patch supports seeking to a particular process VA from where
> >>>> the application could read the VA to numa node id information.
> >>>>
> >>>> Also added the 'PTRACE_MODE_READ_REALCREDS' check when opening the
> >>>> file /proc file as was indicated by Michal Hacko
> >>> procfs files should use PTRACE_MODE_*_FSCREDS, not PTRACE_MODE_*_REALCREDS.
> >> Out of my curiosity, what is the semantic difference? At least
> >> kernel_move_pages uses PTRACE_MODE_READ_REALCREDS. Is this a bug?
> > No, that's fine. REALCREDS basically means "look at the caller's real
> > UID for the access check", while FSCREDS means "look at the caller's
> > filesystem UID". The ptrace access check has historically been using
> > the real UID, which is sorta weird, but normally works fine. Given
> > that this is documented, I didn't see any reason to change it for most
> > things that do ptrace access checks, even if the EUID would IMO be
> > more appropriate. But things that capture caller credentials at points
> > like open() really shouldn't look at the real UID; instead, they
> > should use the filesystem UID (which in practice is basically the same
> > as the EUID).
> >
> > So in short, it depends on the interface you're coming through: Direct
> > syscalls use REALCREDS, things that go through the VFS layer use
> > FSCREDS.
>
> So in this case can the REALCREDS check be done in the read() system call
> when reading the /proc file instead of the open call?
No, REALCREDS shouldn't be used in open() and shouldn't be used in read().
FSCREDS can be used in open(); in theory, using ptrace_may_access() in
any way in read() is currently unsafe, but in practice, it's used that
way anyway. I have plans to clean that up eventually...