2007-08-16 22:11:00

by Wu Fengguang

[permalink] [raw]
Subject: [PATCH 4/4] maps: /proc/<pid>/pmaps interface - memory maps in granularity of pages

Show a process's page-by-page address space infomation in /proc/<pid>/pmaps.
It helps to analyze applications' memory footprints in a comprehensive way.

Pages share the same states are grouped into a page range.
For each page range, the following fields are exported:
- first page index
- number of pages in the range
- well known page/pte flags
- number of mmap users

Only page flags not expected to disappear in the near future are exported:

Y:young R:referenced A:active U:uptodate P:ptedirty D:dirty W:writeback

Here is a sample output:

# cat /proc/$$/pmaps
08048000-080c9000 r-xp 08048000 00:00 0
8048 81 Y_A_P__ 1
080c9000-080f8000 rwxp 080c9000 00:00 0 [heap]
80c9 2f Y_A_P__ 1
f7e1c000-f7e25000 r-xp 00000000 03:00 176633 /lib/libnss_files-2.3.6.so
0 1 Y_AU___ 1
1 1 YR_U___ 1
5 1 YR_U___ 1
8 1 YR_U___ 1
f7e25000-f7e27000 rwxp 00008000 03:00 176633 /lib/libnss_files-2.3.6.so
8 2 Y_A_P__ 1
f7e27000-f7e2f000 r-xp 00000000 03:00 176635 /lib/libnss_nis-2.3.6.so
0 1 Y_AU___ 1
1 1 YR_U___ 1
4 1 YR_U___ 1
7 1 Y_AU___ 1
f7e2f000-f7e31000 rwxp 00007000 03:00 176635 /lib/libnss_nis-2.3.6.so
7 2 Y_A_P__ 1
f7e31000-f7e43000 r-xp 00000000 03:00 176630 /lib/libnsl-2.3.6.so
0 1 Y_AU___ 1
1 3 YR_U___ 1
10 1 YR_U___ 1
f7e43000-f7e45000 rwxp 00011000 03:00 176630 /lib/libnsl-2.3.6.so
11 2 Y_A_P__ 1
f7e45000-f7e47000 rwxp f7e45000 00:00 0
f7e47000-f7e4f000 r-xp 00000000 03:00 176631 /lib/libnss_compat-2.3.6.so
0 1 Y_AU___ 1
1 3 YR_U___ 1
7 1 Y_AU___ 1
f7e4f000-f7e51000 rwxp 00007000 03:00 176631 /lib/libnss_compat-2.3.6.so
7 2 Y_A_P__ 1
f7e51000-f7f79000 r-xp 00000000 03:00 176359 /lib/libc-2.3.6.so
0 16 YRAU___ 2
19 1 YR_U___ 1
1f 1 YRAU___ 2
21 1 YRAU___ 1
22 2 YRAU___ 2
24 1 YRAU___ 1
26 1 YRAU___ 2
[...]


Matt Mackall's pagemap/kpagemap and John Berthels's exmap can achieve the same goals,
and probably more. But this text based pmaps interface should be more easy to use.

The concern of dataset size is taken care of by working in a sparse way:

1) It will only generate output for resident pages, that normally is
much smaller than the mapped size. Take my shell for example, the
(size:rss) ratio is (7:1)!

wfg ~% cat /proc/$$/smaps |grep Size|sum
sum 50552.000
avg 777.723

wfg ~% cat /proc/$$/smaps |grep Rss|sum
sum 7604.000
avg 116.985

2) The page range trick suppresses more output.

It's interesting to see that the seq_file interface demands some
more programming efforts, and provides such flexibility as well.

Cc: Jeremy Fitzhardinge <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Matt Mackall <[email protected]>
Cc: John Berthels <[email protected]>
Cc: Nick Piggin <[email protected]>
Signed-off-by: Fengguang Wu <[email protected]>
---
fs/proc/base.c | 7 +
fs/proc/internal.h | 1
fs/proc/task_mmu.c | 171 +++++++++++++++++++++++++++++++++++++++++++
3 files changed, 179 insertions(+)

--- linux-2.6.23-rc2-mm2.orig/fs/proc/task_mmu.c
+++ linux-2.6.23-rc2-mm2/fs/proc/task_mmu.c
@@ -754,3 +754,174 @@ const struct file_operations proc_numa_m
.release = seq_release_private,
};
#endif /* CONFIG_NUMA */
+
+struct pmaps_private {
+ struct proc_maps_private pmp;
+ struct vm_area_struct *vma;
+ struct seq_file *m;
+ /* page range attrs */
+ unsigned long offset;
+ unsigned long len;
+ unsigned long flags;
+ int mapcount;
+};
+
+#define PMAPS_BUF_SIZE (64<<10) /* 64K */
+#define PMAPS_BATCH_SIZE (16<<20) /* 16M */
+
+#define PG_YOUNG PG_readahead /* reuse any non-relevant flag */
+#define PG_DIRTY PG_lru /* ditto */
+
+static unsigned long page_mask;
+
+static struct {
+ unsigned long mask;
+ const char *name;
+ bool faked;
+} page_flags [] = {
+ {1 << PG_YOUNG, "Y:pteyoung", 1},
+ {1 << PG_referenced, "R:referenced", 0},
+ {1 << PG_active, "A:active", 0},
+
+ {1 << PG_uptodate, "U:uptodate", 0},
+ {1 << PG_DIRTY, "P:ptedirty", 1},
+ {1 << PG_dirty, "D:dirty", 0},
+ {1 << PG_writeback, "W:writeback", 0},
+};
+
+static unsigned long pte_page_flags(pte_t ptent, struct page* page)
+{
+ unsigned long flags;
+
+ flags = page->flags & page_mask;
+
+ if (pte_young(ptent))
+ flags |= (1 << PG_YOUNG);
+
+ if (pte_dirty(ptent))
+ flags |= (1 << PG_DIRTY);
+
+ return flags;
+}
+
+static int pmaps_show_range(struct pmaps_private *pp)
+{
+ int i;
+
+ if (!pp->len)
+ return 0;
+
+ seq_printf(pp->m, "%lx\t%lx\t", pp->offset, pp->len);
+
+ for (i = 0; i < ARRAY_SIZE(page_flags); i++)
+ seq_putc(pp->m, (pp->flags & page_flags[i].mask) ?
+ page_flags[i].name[0] : '_');
+
+ return seq_printf(pp->m, "\t%x\n", pp->mapcount);
+}
+
+static int pmaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
+ void *private)
+{
+ struct pmaps_private *pp = private;
+ struct vm_area_struct *vma = pp->vma;
+ pte_t *pte, *apte, ptent;
+ spinlock_t *ptl;
+ struct page *page;
+ unsigned long offset;
+ unsigned long flags;
+ int mapcount;
+ int ret = 0;
+
+ apte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+ for (; addr != end; pte++, addr += PAGE_SIZE) {
+ ptent = *pte;
+ if (!pte_present(ptent))
+ continue;
+
+ page = vm_normal_page(vma, addr, ptent);
+ if (!page)
+ continue;
+
+ /* test page similarity, then grow the range or show it */
+ offset = page_index(page);
+ mapcount = page_mapcount(page);
+ flags = pte_page_flags(ptent, page);
+ if (offset == pp->offset + pp->len &&
+ mapcount == pp->mapcount &&
+ flags == pp->flags) {
+ pp->len++;
+ } else {
+ ret = pmaps_show_range(pp);
+ if (ret)
+ break;
+ pp->offset = offset;
+ pp->len = 1;
+ pp->mapcount = mapcount;
+ pp->flags = flags;
+ }
+ }
+ pte_unmap_unlock(apte, ptl);
+ cond_resched();
+ return ret;
+}
+
+static struct mm_walk pmaps_walk = { .pmd_entry = pmaps_pte_range };
+static int show_pmaps(struct seq_file *m, void *v)
+{
+ struct vm_area_struct *vma = v;
+ struct pmaps_private *pp = m->private;
+ unsigned long addr = m->version;
+ unsigned long end;
+ int ret;
+
+ if (addr == vma->vm_start) {
+ ret = show_map(m, vma);
+ if (ret)
+ return ret;
+ }
+
+ end = vma->vm_end;
+ if (end - addr > PMAPS_BATCH_SIZE)
+ end = addr + PMAPS_BATCH_SIZE;
+
+ pp->m = m;
+ pp->vma = vma;
+ pp->len = 0;
+ walk_page_range(vma->vm_mm, addr, end, &pmaps_walk, pp);
+ pmaps_show_range(pp);
+
+ return 0;
+}
+
+static struct seq_operations proc_pid_pmaps_op = {
+ .start = m_start,
+ .next = m_next,
+ .stop = m_stop,
+ .show = show_pmaps
+};
+
+static int pmaps_open(struct inode *inode, struct file *file)
+{
+ return generic_maps_open(inode, file, &proc_pid_pmaps_op,
+ PMAPS_BATCH_SIZE, PMAPS_BUF_SIZE,
+ sizeof(struct pmaps_private));
+}
+
+const struct file_operations proc_pmaps_operations = {
+ .open = pmaps_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release_private,
+};
+
+static __init int task_mmu_init(void)
+{
+ int i;
+ for (page_mask = 0, i = 0; i < ARRAY_SIZE(page_flags); i++)
+ if (!page_flags[i].faked)
+ page_mask |= page_flags[i].mask;
+ return 0;
+}
+
+pure_initcall(task_mmu_init);
--- linux-2.6.23-rc2-mm2.orig/fs/proc/base.c
+++ linux-2.6.23-rc2-mm2/fs/proc/base.c
@@ -45,6 +45,11 @@
*
* Paul Mundt <[email protected]>:
* Overall revision about smaps.
+ *
+ * ChangeLog:
+ * 15-Aug-2007
+ * Fengguang Wu <[email protected]>:
+ * Page granularity mapping info in pmaps.
*/

#include <asm/uaccess.h>
@@ -2044,6 +2049,7 @@ static const struct pid_entry tgid_base_
#ifdef CONFIG_PROC_PAGE_MONITOR
REG("clear_refs", S_IWUSR, clear_refs),
REG("smaps", S_IRUGO, smaps),
+ REG("pmaps", S_IRUSR, pmaps),
REG("pagemap", S_IRUSR, pagemap),
#endif
#endif
@@ -2336,6 +2342,7 @@ static const struct pid_entry tid_base_s
#ifdef CONFIG_PROC_PAGE_MONITOR
REG("clear_refs", S_IWUSR, clear_refs),
REG("smaps", S_IRUGO, smaps),
+ REG("pmaps", S_IRUSR, pmaps),
REG("pagemap", S_IRUSR, pagemap),
#endif
#endif
--- linux-2.6.23-rc2-mm2.orig/fs/proc/internal.h
+++ linux-2.6.23-rc2-mm2/fs/proc/internal.h
@@ -50,6 +50,7 @@ extern loff_t mem_lseek(struct file * fi
extern const struct file_operations proc_maps_operations;
extern const struct file_operations proc_numa_maps_operations;
extern const struct file_operations proc_smaps_operations;
+extern const struct file_operations proc_pmaps_operations;
extern const struct file_operations proc_clear_refs_operations;
extern const struct file_operations proc_pagemap_operations;


--


2007-08-17 02:37:54

by Matt Mackall

[permalink] [raw]
Subject: Re: [PATCH 4/4] maps: /proc/<pid>/pmaps interface - memory maps in granularity of pages

On Fri, Aug 17, 2007 at 06:05:20AM +0800, Fengguang Wu wrote:
> Show a process's page-by-page address space infomation in /proc/<pid>/pmaps.
> It helps to analyze applications' memory footprints in a comprehensive way.
>
> Pages share the same states are grouped into a page range.
> For each page range, the following fields are exported:
> - first page index
> - number of pages in the range
> - well known page/pte flags
> - number of mmap users
>
> Only page flags not expected to disappear in the near future are exported:
>
> Y:young R:referenced A:active U:uptodate P:ptedirty D:dirty W:writeback
...

> The concern of dataset size is taken care of by working in a sparse way:
>
> 1) It will only generate output for resident pages, that normally is
> much smaller than the mapped size. Take my shell for example, the
> (size:rss) ratio is (7:1)!
>
> wfg ~% cat /proc/$$/smaps |grep Size|sum
> sum 50552.000
> avg 777.723
>
> wfg ~% cat /proc/$$/smaps |grep Rss|sum
> sum 7604.000
> avg 116.985
>
> 2) The page range trick suppresses more output.
>
> It's interesting to see that the seq_file interface demands some
> more programming efforts, and provides such flexibility as well.

I'm so-so on this.

On the downside:

- requires lots of parsing
- isn't random-access
- probably significantly slower than pagemap

--
Mathematics is the supreme nostalgia of our time.

2007-08-17 03:45:00

by Wu Fengguang

[permalink] [raw]
Subject: Re: [PATCH 4/4] maps: /proc/<pid>/pmaps interface - memory maps in granularity of pages

On Thu, Aug 16, 2007 at 09:38:46PM -0500, Matt Mackall wrote:
> On Fri, Aug 17, 2007 at 06:05:20AM +0800, Fengguang Wu wrote:
> > Show a process's page-by-page address space infomation in /proc/<pid>/pmaps.
> > It helps to analyze applications' memory footprints in a comprehensive way.
> >
> > Pages share the same states are grouped into a page range.
> > For each page range, the following fields are exported:
> > - first page index
> > - number of pages in the range
> > - well known page/pte flags
> > - number of mmap users
> >
> > Only page flags not expected to disappear in the near future are exported:
> >
> > Y:young R:referenced A:active U:uptodate P:ptedirty D:dirty W:writeback
> ...
>
> > The concern of dataset size is taken care of by working in a sparse way:
> >
> > 1) It will only generate output for resident pages, that normally is
> > much smaller than the mapped size. Take my shell for example, the
> > (size:rss) ratio is (7:1)!
> >
> > wfg ~% cat /proc/$$/smaps |grep Size|sum
> > sum 50552.000
> > avg 777.723
> >
> > wfg ~% cat /proc/$$/smaps |grep Rss|sum
> > sum 7604.000
> > avg 116.985
> >
> > 2) The page range trick suppresses more output.
> >
> > It's interesting to see that the seq_file interface demands some
> > more programming efforts, and provides such flexibility as well.
>
> I'm so-so on this.

Not that way! It's a good thing that people have different experiences
and hence viewpoints. Maybe the concept of PFN sharing are
straightforward to you, while I have been playing with seq_file a lot.

> On the downside:
>
> - requires lots of parsing
> - isn't random-access
> - probably significantly slower than pagemap

That could be true. Maybe some user with huge datasets will give us
some idea about the performance. I don't know, maybe it's application
dependent.

Anyway I don't think it's fair to merge a binary interface without the
challenge from a textual one ;)

Thank you,
Fengguang

2007-08-17 03:55:56

by Matt Mackall

[permalink] [raw]
Subject: Re: [PATCH 4/4] maps: /proc/<pid>/pmaps interface - memory maps in granularity of pages

On Fri, Aug 17, 2007 at 11:44:37AM +0800, Fengguang Wu wrote:
> > I'm so-so on this.
>
> Not that way! It's a good thing that people have different experiences
> and hence viewpoints. Maybe the concept of PFN sharing are
> straightforward to you, while I have been playing with seq_file a lot.
>
> > On the downside:
> >
> > - requires lots of parsing
> > - isn't random-access
> > - probably significantly slower than pagemap
>
> That could be true. Maybe some user with huge datasets will give us
> some idea about the performance. I don't know, maybe it's application
> dependent.
>
> Anyway I don't think it's fair to merge a binary interface without the
> challenge from a textual one ;)

Yes, that's why I didn't say I hated it.

--
Mathematics is the supreme nostalgia of our time.

2007-08-17 06:47:40

by Wu Fengguang

[permalink] [raw]
Subject: Re: [PATCH 4/4] maps: /proc/<pid>/pmaps interface - memory maps in granularity of pages

Matt,

It's not easy to do direct performance comparisons between pmaps and
pagemap/kpagemap. However some close analyzes are still possible :)

1) code size
pmaps ~200 LOC
pagemap/kpagemap ~300 LOC

2) dataset size
take for example my running firefox on Intel Core 2:
VSZ 400 MB
RSS 64 MB, or 16k pages
pmaps 64 KB, wc shows 2k lines, or so much page ranges
pagemap 800 KB, could be heavily optimized by returning partial data
kpagemap 256 KB

3) runtime overheads
pmaps 2k lines of string processing(encode/decode)
kpagemap 16k seek()/read()s, and context switches (could be
optimized somehow by doing a PFN sort first, but
that's also non-trivial overheads)

So pmaps seems to be a clear winner :)

Thank you,
Fengguang

2007-08-17 16:57:46

by Matt Mackall

[permalink] [raw]
Subject: Re: [PATCH 4/4] maps: /proc/<pid>/pmaps interface - memory maps in granularity of pages

On Fri, Aug 17, 2007 at 02:47:27PM +0800, Fengguang Wu wrote:
> Matt,
>
> It's not easy to do direct performance comparisons between pmaps and
> pagemap/kpagemap. However some close analyzes are still possible :)
>
> 1) code size
> pmaps ~200 LOC
> pagemap/kpagemap ~300 LOC
>
> 2) dataset size
> take for example my running firefox on Intel Core 2:
> VSZ 400 MB
> RSS 64 MB, or 16k pages
> pmaps 64 KB, wc shows 2k lines, or so much page ranges
> pagemap 800 KB, could be heavily optimized by returning partial data

I take it you're in 64-bit mode?

You're right, this data compresses well in many circumstances. I
suspect it will suffer under memory pressure though. That will
fragment the ranges in-memory and also fragment the active bits. The
worst case here is huge, of course, but realistically I'd expect
something like 2x-4x.

But there are still the downsides I have mentioned:

- you don't get page frame numbers
- you can't do random access

And how long does it take to pull the data out? My benchmarks show
greater than 50MB/s (and that's with the version in -mm that's doing
double buffering), so that 800K would take < .016s.

> kpagemap 256 KB
>
> 3) runtime overheads
> pmaps 2k lines of string processing(encode/decode)
> kpagemap 16k seek()/read()s, and context switches (could be
> optimized somehow by doing a PFN sort first, but
> that's also non-trivial overheads)

You can do anywhere between 16k small reads or 1 large read. Depends
what data you're trying to get. Right now, kpagemap is fast enough
that I can do realtime displays of the whole of memory in my desktop
in a GUI written in Python. And Python is fairly horrible for drawing
bitmaps and such.

http://www.selenic.com/Screenshot-kpagemap.png

> So pmaps seems to be a clear winner :)

Except that it's only providing a subset of the data.

--
Mathematics is the supreme nostalgia of our time.

2007-08-18 02:48:46

by Wu Fengguang

[permalink] [raw]
Subject: Re: [PATCH 4/4] maps: /proc/<pid>/pmaps interface - memory maps in granularity of pages

Matt,

On Fri, Aug 17, 2007 at 11:58:08AM -0500, Matt Mackall wrote:
> On Fri, Aug 17, 2007 at 02:47:27PM +0800, Fengguang Wu wrote:
> > It's not easy to do direct performance comparisons between pmaps and
> > pagemap/kpagemap. However some close analyzes are still possible :)
> >
> > 1) code size
> > pmaps ~200 LOC
> > pagemap/kpagemap ~300 LOC
> >
> > 2) dataset size
> > take for example my running firefox on Intel Core 2:
> > VSZ 400 MB
> > RSS 64 MB, or 16k pages
> > pmaps 64 KB, wc shows 2k lines, or so much page ranges
> > pagemap 800 KB, could be heavily optimized by returning partial data
>
> I take it you're in 64-bit mode?

Yes. That will be the common case.

> You're right, this data compresses well in many circumstances. I
> suspect it will suffer under memory pressure though. That will
> fragment the ranges in-memory and also fragment the active bits. The
> worst case here is huge, of course, but realistically I'd expect
> something like 2x-4x.

Not likely to degrade even under memory pressure ;)

The compress_ratio = (VSZ:RSS) * (RSS:page_ranges).
- On fresh startup and no memory pressure,
- the VSZ:RSS ratio of ALL processes are 4516796KB:457048KB ~= 10:1.
- the firefox case shows a (RSS:page_ranges) of 16k:2k ~= 8:1.
- On memory pressure,
- as VSZ goes up, RSS will be bounded by physical memory.
So VSZ:RSS ratio actually goes up with memory pressure.
- page range is a good unit of locality. They are more likely to be
reclaimed as a whole. So (RSS:page_ranges) wouldn't degrade as much.

> But there are still the downsides I have mentioned:
>
> - you don't get page frame numbers

True. I guess PFNs are meaningless to a normal user?

> - you can't do random access

Not for now.

It would be trivial to support seek-by-address semantic: the seqfile
operations already iterate by addresses. Only that we cannot do it via
the regular read/pread/seek interfaces. They have different semantic
on fpos. However, tricks like ioctl(begin_addr, end_addr) can be
employed if necessary.

> And how long does it take to pull the data out? My benchmarks show
> greater than 50MB/s (and that's with the version in -mm that's doing
> double buffering), so that 800K would take < .016s.

You are right :)

> > kpagemap 256 KB
> >
> > 3) runtime overheads
> > pmaps 2k lines of string processing(encode/decode)
> > kpagemap 16k seek()/read()s, and context switches (could be
> > optimized somehow by doing a PFN sort first, but
> > that's also non-trivial overheads)
>
> You can do anywhere between 16k small reads or 1 large read. Depends

No way to avoid the seeks if PFNs are discontinuous. Too bad the
memory get fragmented with uptime, at least for the current kernel.

But sure, sequential reads are viable when doing whole system memory
analysis, or for memory hog processes.

> what data you're trying to get. Right now, kpagemap is fast enough
> that I can do realtime displays of the whole of memory in my desktop
> in a GUI written in Python. And Python is fairly horrible for drawing
> bitmaps and such.
>
> http://www.selenic.com/Screenshot-kpagemap.png
>
> > So pmaps seems to be a clear winner :)
>
> Except that it's only providing a subset of the data.

Yes, and it's a nice graph :)

2007-08-18 06:39:50

by Matt Mackall

[permalink] [raw]
Subject: Re: [PATCH 4/4] maps: /proc/<pid>/pmaps interface - memory maps in granularity of pages

On Sat, Aug 18, 2007 at 10:48:31AM +0800, Fengguang Wu wrote:
> Matt,
>
> On Fri, Aug 17, 2007 at 11:58:08AM -0500, Matt Mackall wrote:
> > On Fri, Aug 17, 2007 at 02:47:27PM +0800, Fengguang Wu wrote:
> > > It's not easy to do direct performance comparisons between pmaps and
> > > pagemap/kpagemap. However some close analyzes are still possible :)
> > >
> > > 1) code size
> > > pmaps ~200 LOC
> > > pagemap/kpagemap ~300 LOC
> > >
> > > 2) dataset size
> > > take for example my running firefox on Intel Core 2:
> > > VSZ 400 MB
> > > RSS 64 MB, or 16k pages
> > > pmaps 64 KB, wc shows 2k lines, or so much page ranges
> > > pagemap 800 KB, could be heavily optimized by returning partial data
> >
> > I take it you're in 64-bit mode?
>
> Yes. That will be the common case.
>
> > You're right, this data compresses well in many circumstances. I
> > suspect it will suffer under memory pressure though. That will
> > fragment the ranges in-memory and also fragment the active bits. The
> > worst case here is huge, of course, but realistically I'd expect
> > something like 2x-4x.
>
> Not likely to degrade even under memory pressure ;)
>
> The compress_ratio = (VSZ:RSS) * (RSS:page_ranges).
> - On fresh startup and no memory pressure,
> - the VSZ:RSS ratio of ALL processes are 4516796KB:457048KB ~= 10:1.
> - the firefox case shows a (RSS:page_ranges) of 16k:2k ~= 8:1.

Yes.

> - On memory pressure,
> - as VSZ goes up, RSS will be bounded by physical memory.
> So VSZ:RSS ratio actually goes up with memory pressure.

And yes.

But that's not what I'm talking about. You're likely to have more
holes in your ranges with memory pressure as things that aren't active
get paged or swapped out and back in. And because we're walking the
LRU more rapidly, we'll flip over a lot of the active bits more often
which will mean more output.

> - page range is a good unit of locality. They are more likely to be
> reclaimed as a whole. So (RSS:page_ranges) wouldn't degrade as much.

There is that. The relative magnitude of the different effects is
unclear. But it is clear that the worst case for pmap is much worse
than pagemap (two lines per page of RSS?).

> > But there are still the downsides I have mentioned:
> >
> > - you don't get page frame numbers
>
> True. I guess PFNs are meaningless to a normal user?

They're useful for anyone who's trying to look at the system as a
whole.

> > - you can't do random access
>
> Not for now.
>
> It would be trivial to support seek-by-address semantic: the seqfile
> operations already iterate by addresses. Only that we cannot do it via
> the regular read/pread/seek interfaces. They have different semantic
> on fpos. However, tricks like ioctl(begin_addr, end_addr) can be
> employed if necessary.

I suppose. But if you're willing to stomach that sort of thing, you
might as well use a simple binary interface.

--
Mathematics is the supreme nostalgia of our time.

2007-08-18 08:45:46

by Wu Fengguang

[permalink] [raw]
Subject: Re: [PATCH 4/4] maps: /proc/<pid>/pmaps interface - memory maps in granularity of pages

Matt,

On Sat, Aug 18, 2007 at 01:40:42AM -0500, Matt Mackall wrote:
> > - On memory pressure,
> > - as VSZ goes up, RSS will be bounded by physical memory.
> > So VSZ:RSS ratio actually goes up with memory pressure.
>
> And yes.
>
> But that's not what I'm talking about. You're likely to have more
> holes in your ranges with memory pressure as things that aren't active
> get paged or swapped out and back in. And because we're walking the
> LRU more rapidly, we'll flip over a lot of the active bits more often
> which will mean more output.
>
> > - page range is a good unit of locality. They are more likely to be
> > reclaimed as a whole. So (RSS:page_ranges) wouldn't degrade as much.
>
> There is that. The relative magnitude of the different effects is
> unclear. But it is clear that the worst case for pmap is much worse

> than pagemap (two lines per page of RSS?).
It's one line per page. No sane app will make vmas proliferate.

So let's talk about the worst case.

pagemap's data set size is determined by VSZ.
4GB VSZ means 1M PFNs, hence 8MB pagemap data.

pmaps's data set size is bounded by RSS hence physical memory.
4GB RSS means up to 1M page ranges, hence ~20M pmaps data.
Not too bad :)

> > > But there are still the downsides I have mentioned:
> > >
> > > - you don't get page frame numbers
> >
> > True. I guess PFNs are meaningless to a normal user?
>
> They're useful for anyone who's trying to look at the system as a
> whole.
>
> > > - you can't do random access
> >
> > Not for now.
> >
> > It would be trivial to support seek-by-address semantic: the seqfile
> > operations already iterate by addresses. Only that we cannot do it via
> > the regular read/pread/seek interfaces. They have different semantic
> > on fpos. However, tricks like ioctl(begin_addr, end_addr) can be
> > employed if necessary.
>
> I suppose. But if you're willing to stomach that sort of thing, you
> might as well use a simple binary interface.

Python can do ioctl() :)

Anyway it's already a special interface.

2007-08-18 10:31:59

by Wu Fengguang

[permalink] [raw]
Subject: Re: [PATCH 4/4] maps: /proc/<pid>/pmaps interface - memory maps in granularity of pages

On Sat, Aug 18, 2007 at 01:40:42AM -0500, Matt Mackall wrote:
> > > - you don't get page frame numbers
> >
> > True. I guess PFNs are meaningless to a normal user?
>
> They're useful for anyone who's trying to look at the system as a
> whole.

To answer the question: "who are sharing this page with me"?
PFNs are not the only option. The tuple dev/ino/offset can also
uniquely identify the shared page :)

2007-08-18 17:21:35

by Matt Mackall

[permalink] [raw]
Subject: Re: [PATCH 4/4] maps: /proc/<pid>/pmaps interface - memory maps in granularity of pages

On Sat, Aug 18, 2007 at 04:45:31PM +0800, Fengguang Wu wrote:
> Matt,
>
> On Sat, Aug 18, 2007 at 01:40:42AM -0500, Matt Mackall wrote:
> > > - On memory pressure,
> > > - as VSZ goes up, RSS will be bounded by physical memory.
> > > So VSZ:RSS ratio actually goes up with memory pressure.
> >
> > And yes.
> >
> > But that's not what I'm talking about. You're likely to have more
> > holes in your ranges with memory pressure as things that aren't active
> > get paged or swapped out and back in. And because we're walking the
> > LRU more rapidly, we'll flip over a lot of the active bits more often
> > which will mean more output.
> >
> > > - page range is a good unit of locality. They are more likely to be
> > > reclaimed as a whole. So (RSS:page_ranges) wouldn't degrade as much.
> >
> > There is that. The relative magnitude of the different effects is
> > unclear. But it is clear that the worst case for pmap is much worse
>
> > than pagemap (two lines per page of RSS?).
> It's one line per page. No sane app will make vmas proliferate.

Sane apps are few and far between.

> So let's talk about the worst case.
>
> pagemap's data set size is determined by VSZ.
> 4GB VSZ means 1M PFNs, hence 8MB pagemap data.
>
> pmaps's data set size is bounded by RSS hence physical memory.
> 4GB RSS means up to 1M page ranges, hence ~20M pmaps data.
> Not too bad :)

Hmmm, I've been misreading the output.

What does it do with nonlinear VMAs?

--
Mathematics is the supreme nostalgia of our time.

2007-08-19 00:40:29

by Wu Fengguang

[permalink] [raw]
Subject: Re: [PATCH 4/4] maps: /proc/<pid>/pmaps interface - memory maps in granularity of pages

On Sat, Aug 18, 2007 at 12:22:26PM -0500, Matt Mackall wrote:
> > > > So VSZ:RSS ratio actually goes up with memory pressure.
> > >
> > > And yes.
> > >
> > > But that's not what I'm talking about. You're likely to have more
> > > holes in your ranges with memory pressure as things that aren't active
> > > get paged or swapped out and back in. And because we're walking the
> > > LRU more rapidly, we'll flip over a lot of the active bits more often
> > > which will mean more output.
> > >
> > > > - page range is a good unit of locality. They are more likely to be
> > > > reclaimed as a whole. So (RSS:page_ranges) wouldn't degrade as much.
> > >
> > > There is that. The relative magnitude of the different effects is
> > > unclear. But it is clear that the worst case for pmap is much worse
> >
> > > than pagemap (two lines per page of RSS?).
> > It's one line per page. No sane app will make vmas proliferate.
>
> Sane apps are few and far between.

Very likely, and they will bloat maps/smaps/pmaps alike :(

> > So let's talk about the worst case.
> >
> > pagemap's data set size is determined by VSZ.
> > 4GB VSZ means 1M PFNs, hence 8MB pagemap data.
> >
> > pmaps's data set size is bounded by RSS hence physical memory.
> > 4GB RSS means up to 1M page ranges, hence ~20M pmaps data.
> > Not too bad :)
>
> Hmmm, I've been misreading the output.
>
> What does it do with nonlinear VMAs?

The implementation gets offset from page_index(page), so will work
the same way in linear/nonlinear VMAs. Depending how one does the
remap_file_ranges() calls, the output lines may be not strictly
ordered by offset, or overlap, or have small page ranges.