2005-11-30 18:57:03

by Badari Pulavarty

[permalink] [raw]
Subject: Better pagecache statistics ?

Hi,

Is there a effort/patches underway to provide better pagecache
statistics ?

Basically, I am interested in finding detailed break out of
cached pages. ("Cached" in /proc/meminfo)

Out of this "cached pages"

- How much is just file system cache (regular file data) ?
- How much is shared memory pages ?
- How much is mmaped() stuff ?
- How much is for text, data, bss, heap, malloc ?

What is the right way of getting this kind of data ?
I was trying to add tags when we do add_to_page_cache()
and quickly got ugly :(

Thanks,
Badari


2005-12-01 02:35:54

by Hareesh Nagarajan

[permalink] [raw]
Subject: Re: Better pagecache statistics ?

Hi,

Badari Pulavarty wrote:
> - How much is just file system cache (regular file data) ?

This is just a thought of mine:
/proc/slabinfo?

> - How much is shared memory pages ?
> - How much is mmaped() stuff ?

cat /proc/vmstat | grep nr_mapped
nr_mapped 77105

But yes, this doesn't give you a detailed account.

> - How much is for text, data, bss, heap, malloc ?

Again, this is just a thought of mine: Couldn't you get this information
from /proc/<pid>/maps or from the nicer and easier to parse procps
application: pmap <pid>?

Thanks,

Hareesh

2005-12-01 15:20:37

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: Better pagecache statistics ?


Hi Badari,

On Wed, Nov 30, 2005 at 10:57:09AM -0800, Badari Pulavarty wrote:
> Hi,
>
> Is there a effort/patches underway to provide better pagecache
> statistics ?
>
> Basically, I am interested in finding detailed break out of
> cached pages. ("Cached" in /proc/meminfo)
>
> Out of this "cached pages"
>
> - How much is just file system cache (regular file data) ?
> - How much is shared memory pages ?

You could do that from userspace probably, by doing some math
on all processes statistics versus global stats, but does not
seem very practical.

> - How much is mmaped() stuff ?

That would be "nr_mapped".

> - How much is for text, data, bss, heap, malloc ?

Hum, the core pagecache code does not deal with such details,
so adding (and maintaining) accounting there does not seem very
practical either.

You could walk /proc/<pid>/{maps,smaps} and account for different
types of pages.

$ cat /proc/self/smaps

bf8df000-bf8f4000 rw-p bf8df000 00:00 0 [stack]
Size: 84 kB
Rss: 8 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 8 kB

0975b000-0977c000 rw-p 0975b000 00:00 0 [heap]
Size: 132 kB
Rss: 4 kB
Shared_Clean: 0 kB
Shared_Dirty: 4 kB
Private_Clean: 0 kB
Private_Dirty: 0 kB

But doing it from userspace does not guarantee much precision
since the state can change while walking the proc stats.

> What is the right way of getting this kind of data ?
> I was trying to add tags when we do add_to_page_cache()
> and quickly got ugly :(

Problem is that any kind of information maybe be valuable,
depending on what you're trying to do.

For example, one might want to break statistics in /proc/vmstat
and /proc/meminfo on a per-zone basis (for instance there is no
per-zone "locked" accounting at the moment), per-uid basis,
per-process basis, or whatever.

Other than the pagecache stats you mention, there is a
general lack of numbers in the MM code.

I think that SystemTap suits the requirement for creation
of detailed MM statistics, allowing creation of hooks outside the
kernel in an easy manner. Hooks can be inserted on demand.

I just started playing with SystemTap yesterday. First
thing I want to record is "what is the latency of
direct reclaim".

2005-12-01 15:59:42

by Badari Pulavarty

[permalink] [raw]
Subject: Re: Better pagecache statistics ?

On Thu, 2005-12-01 at 13:20 -0200, Marcelo Tosatti wrote:
> Hi Badari,
>
> On Wed, Nov 30, 2005 at 10:57:09AM -0800, Badari Pulavarty wrote:
> > Hi,
> >
> > Is there a effort/patches underway to provide better pagecache
> > statistics ?
> >
> > Basically, I am interested in finding detailed break out of
> > cached pages. ("Cached" in /proc/meminfo)
> >
> > Out of this "cached pages"
> >
> > - How much is just file system cache (regular file data) ?
> > - How much is shared memory pages ?
>
> You could do that from userspace probably, by doing some math
> on all processes statistics versus global stats, but does not
> seem very practical.
>
> > - How much is mmaped() stuff ?
>
> That would be "nr_mapped".
>
> > - How much is for text, data, bss, heap, malloc ?
>
> Hum, the core pagecache code does not deal with such details,
> so adding (and maintaining) accounting there does not seem very
> practical either.
>
> You could walk /proc/<pid>/{maps,smaps} and account for different
> types of pages.
>
> $ cat /proc/self/smaps
>
> bf8df000-bf8f4000 rw-p bf8df000 00:00 0 [stack]
> Size: 84 kB
> Rss: 8 kB
> Shared_Clean: 0 kB
> Shared_Dirty: 0 kB
> Private_Clean: 0 kB
> Private_Dirty: 8 kB
>
> 0975b000-0977c000 rw-p 0975b000 00:00 0 [heap]
> Size: 132 kB
> Rss: 4 kB
> Shared_Clean: 0 kB
> Shared_Dirty: 4 kB
> Private_Clean: 0 kB
> Private_Dirty: 0 kB
>
> But doing it from userspace does not guarantee much precision
> since the state can change while walking the proc stats.
>
> > What is the right way of getting this kind of data ?
> > I was trying to add tags when we do add_to_page_cache()
> > and quickly got ugly :(
>
> Problem is that any kind of information maybe be valuable,
> depending on what you're trying to do.
>
> For example, one might want to break statistics in /proc/vmstat
> and /proc/meminfo on a per-zone basis (for instance there is no
> per-zone "locked" accounting at the moment), per-uid basis,
> per-process basis, or whatever.
>
> Other than the pagecache stats you mention, there is a
> general lack of numbers in the MM code.
>
> I think that SystemTap suits the requirement for creation
> of detailed MM statistics, allowing creation of hooks outside the
> kernel in an easy manner. Hooks can be inserted on demand.
>
> I just started playing with SystemTap yesterday. First
> thing I want to record is "what is the latency of
> direct reclaim".
>
>

Hi Marcelo,

Let me give you background on why I am looking at this.

I have been involved in various database customer situations.
Most times, machine is either extreemly sluggish or dying.
Only hints we get from /proc/meminfo, /proc/slabinfo, vmstat
etc is - lots of stuff in "Cache" and system is heavily swapping.
I want to find out whats getting swapped out and whats eating up
all the pagecache., whats getting into cache, whats getting out
of cache etc.. I find no easy way to get this kind of information.

Database folks complain that filecache causes them most trouble.
Even when they use DIO on their tables & stuff, random apps (ftp,
scp, tar etc..) bloats the pagecache and kicks out database
pools, shared mem, malloc etc - causing lots of trouble for them.

I want to understand more before I try to fix it. First step would
be to get better stats from pagecache and evaluate whats happening
to get a better handle on the problem.

BTW, I am very well familiar with kprobes/jprobes & systemtap.
I have been playing with them for at least 8 months :) There is
no easy way to do this, unless stats are already in the kernel.

My final goal is to get stats like ..

Out of "Cached" value - to get details like

<mmap> - xxx KB
<shared mem> - xxx KB
<text, data, bss, malloc, heap, stacks> - xxx KB
<filecache pages total> -- xxx KB
(filename1 or <dev>, <ino>) -- #of pages
(filename2 or <dev>, <ino>) -- #of pages

This would be really powerful on understanding system better.

Don't you think ?


Thanks,
Badari

2005-12-01 16:00:54

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: Better pagecache statistics ?

On Thu, Dec 01, 2005 at 01:20:29PM -0200, Marcelo Tosatti wrote:
>
> Hi Badari,
>
> On Wed, Nov 30, 2005 at 10:57:09AM -0800, Badari Pulavarty wrote:
> > Hi,
> >
> > Is there a effort/patches underway to provide better pagecache
> > statistics ?
> >
> > Basically, I am interested in finding detailed break out of
> > cached pages. ("Cached" in /proc/meminfo)
> >
> > Out of this "cached pages"
> >
> > - How much is just file system cache (regular file data) ?
> > - How much is shared memory pages ?
>
> You could do that from userspace probably, by doing some math
> on all processes statistics versus global stats, but does not
> seem very practical.

Actually, SysRQ-M reports "N pages shared".

2005-12-01 16:10:26

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Better pagecache statistics ?


> Out of "Cached" value - to get details like
>
> <mmap> - xxx KB
> <shared mem> - xxx KB
> <text, data, bss, malloc, heap, stacks> - xxx KB
> <filecache pages total> -- xxx KB
> (filename1 or <dev>, <ino>) -- #of pages
> (filename2 or <dev>, <ino>) -- #of pages
>
> This would be really powerful on understanding system better.

to some extend it might be useful.
I have a few concerns though
1) If we make these stats into an ABI then it becomes harder to change
the architecture of the VM radically since such concepts may not even
exist in the new architecture. As long as this is some sort of advisory,
humans-only file I think this isn't too much of a big deal though.

2) not all the concepts you mention really exist as far as the kernel is
concerned. I mean.. a mmap file is file cache is .. etc.
malloc/heap/stacks are also not differentiated too much and are mostly
userspace policy (especially thread stacks).

A split in
* non-file backed
- mapped once
- mapped more than once
* file backed
- mapped at least once
- not mapped
I can see as being meaningful. Assigning meaning to it beyond this is
dangerous; that is more an interpretation of the policy userspace
happens to use for things and I think coding that into the kernel is a
mistake.

Knowing which files are in memory how much is, as debug feature,
potentially quite useful for VM hackers to see how well the various VM
algorithms work. I'm concerned about the performance impact (eg you can
do it only once a day or so, not every 10 seconds) and about how to get
this data out in a consistent way (after all, spewing this amount of
debug info will in itself impact the vm balances)

Greetings,
Arjan van de Ven

2005-12-01 16:23:40

by Badari Pulavarty

[permalink] [raw]
Subject: Re: Better pagecache statistics ?

On Thu, 2005-12-01 at 17:10 +0100, Arjan van de Ven wrote:
> > Out of "Cached" value - to get details like
> >
> > <mmap> - xxx KB
> > <shared mem> - xxx KB
> > <text, data, bss, malloc, heap, stacks> - xxx KB
> > <filecache pages total> -- xxx KB
> > (filename1 or <dev>, <ino>) -- #of pages
> > (filename2 or <dev>, <ino>) -- #of pages
> >
> > This would be really powerful on understanding system better.
>
> to some extend it might be useful.
> I have a few concerns though
> 1) If we make these stats into an ABI then it becomes harder to change
> the architecture of the VM radically since such concepts may not even
> exist in the new architecture. As long as this is some sort of advisory,
> humans-only file I think this isn't too much of a big deal though.

ABI or API ? yuck. I am thinking more like a /proc/cachedetail like
thing (mostly for debug).

>
> 2) not all the concepts you mention really exist as far as the kernel is
> concerned. I mean.. a mmap file is file cache is .. etc.
> malloc/heap/stacks are also not differentiated too much and are mostly
> userspace policy (especially thread stacks).
>
> A split in
> * non-file backed
> - mapped once
> - mapped more than once
> * file backed
> - mapped at least once
> - not mapped
> I can see as being meaningful. Assigning meaning to it beyond this is
> dangerous; that is more an interpretation of the policy userspace
> happens to use for things and I think coding that into the kernel is a
> mistake.

You are right. I think, kernel should NOT try to distinguish more than
that. This would at least separate out most of the stuff anyway (with
few execeptions).

>
> Knowing which files are in memory how much is, as debug feature,
> potentially quite useful for VM hackers to see how well the various VM
> algorithms work. I'm concerned about the performance impact (eg you can
> do it only once a day or so, not every 10 seconds) and about how to get
> this data out in a consistent way (after all, spewing this amount of
> debug info will in itself impact the vm balances)

Yes. I am worried about that too. We do have "mapping->nrpages" to
represent how many of them are in cache. But getting to all mappings
(especially the ones we can't get to from any process - closed files)
and printing them is quite expensive and need to hold locks :(

Thanks,
Badari

2005-12-01 17:09:17

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: Better pagecache statistics ?

On Thu, Dec 01, 2005 at 05:10:11PM +0100, Arjan van de Ven wrote:
> > Out of "Cached" value - to get details like
> >
> > <mmap> - xxx KB
> > <shared mem> - xxx KB
> > <text, data, bss, malloc, heap, stacks> - xxx KB
> > <filecache pages total> -- xxx KB
> > (filename1 or <dev>, <ino>) -- #of pages
> > (filename2 or <dev>, <ino>) -- #of pages
> >
> > This would be really powerful on understanding system better.
>
> to some extend it might be useful.
> I have a few concerns though
> 1) If we make these stats into an ABI then it becomes harder to change
> the architecture of the VM radically since such concepts may not even
> exist in the new architecture. As long as this is some sort of advisory,
> humans-only file I think this isn't too much of a big deal though.
>
> 2) not all the concepts you mention really exist as far as the kernel is
> concerned. I mean.. a mmap file is file cache is .. etc.
> malloc/heap/stacks are also not differentiated too much and are mostly
> userspace policy (especially thread stacks).
>
> A split in
> * non-file backed
> - mapped once
> - mapped more than once
> * file backed
> - mapped at least once
> - not mapped
> I can see as being meaningful. Assigning meaning to it beyond this is
> dangerous; that is more an interpretation of the policy userspace
> happens to use for things and I think coding that into the kernel is a
> mistake.
>
> Knowing which files are in memory how much is, as debug feature,
> potentially quite useful for VM hackers to see how well the various VM
> algorithms work. I'm concerned about the performance impact (eg you can
> do it only once a day or so, not every 10 seconds) and about how to get
> this data out in a consistent way (after all, spewing this amount of
> debug info will in itself impact the vm balances)

Most of the issues you mention are null if you move the stats
maintenance burden to userspace.

The performance impact is also minimized since the hooks
(read: overhead) can be loaded on-demand as needed.

2005-12-01 17:15:08

by Badari Pulavarty

[permalink] [raw]
Subject: Re: Better pagecache statistics ?

On Thu, 2005-12-01 at 15:08 -0200, Marcelo Tosatti wrote:
> On Thu, Dec 01, 2005 at 05:10:11PM +0100, Arjan van de Ven wrote:
> > > Out of "Cached" value - to get details like
> > >
> > > <mmap> - xxx KB
> > > <shared mem> - xxx KB
> > > <text, data, bss, malloc, heap, stacks> - xxx KB
> > > <filecache pages total> -- xxx KB
> > > (filename1 or <dev>, <ino>) -- #of pages
> > > (filename2 or <dev>, <ino>) -- #of pages
> > >
> > > This would be really powerful on understanding system better.
> >
> > to some extend it might be useful.
> > I have a few concerns though
> > 1) If we make these stats into an ABI then it becomes harder to change
> > the architecture of the VM radically since such concepts may not even
> > exist in the new architecture. As long as this is some sort of advisory,
> > humans-only file I think this isn't too much of a big deal though.
> >
> > 2) not all the concepts you mention really exist as far as the kernel is
> > concerned. I mean.. a mmap file is file cache is .. etc.
> > malloc/heap/stacks are also not differentiated too much and are mostly
> > userspace policy (especially thread stacks).
> >
> > A split in
> > * non-file backed
> > - mapped once
> > - mapped more than once
> > * file backed
> > - mapped at least once
> > - not mapped
> > I can see as being meaningful. Assigning meaning to it beyond this is
> > dangerous; that is more an interpretation of the policy userspace
> > happens to use for things and I think coding that into the kernel is a
> > mistake.
> >
> > Knowing which files are in memory how much is, as debug feature,
> > potentially quite useful for VM hackers to see how well the various VM
> > algorithms work. I'm concerned about the performance impact (eg you can
> > do it only once a day or so, not every 10 seconds) and about how to get
> > this data out in a consistent way (after all, spewing this amount of
> > debug info will in itself impact the vm balances)
>
> Most of the issues you mention are null if you move the stats
> maintenance burden to userspace.
>
> The performance impact is also minimized since the hooks
> (read: overhead) can be loaded on-demand as needed.
>

The overhead is - going through each mapping/inode in the system
and dumping out "nrpages" - to get per-file statistics. This is
going to be expensive, need locking and there is no single list
we can traverse to get it. I am not sure how to do this.

Thanks,
Badari

2005-12-01 17:19:56

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: Better pagecache statistics ?


> Hi Marcelo,
>
> Let me give you background on why I am looking at this.
>
> I have been involved in various database customer situations.
> Most times, machine is either extreemly sluggish or dying.
> Only hints we get from /proc/meminfo, /proc/slabinfo, vmstat
> etc is - lots of stuff in "Cache" and system is heavily swapping.
> I want to find out whats getting swapped out and whats eating up
> all the pagecache., whats getting into cache, whats getting out
> of cache etc.. I find no easy way to get this kind of information.

Someone recently wrote a patch to record such information (pagecache
insertion/eviction, etc), don't remember who did though. Rik?

> Database folks complain that filecache causes them most trouble.
> Even when they use DIO on their tables & stuff, random apps (ftp,
> scp, tar etc..) bloats the pagecache and kicks out database
> pools, shared mem, malloc etc - causing lots of trouble for them.

LRU lacks frequency information, which is crucial for avoiding
such kind of problems.

http://www.linux-mm.org/AdvancedPageReplacement

Peter Zijlstra is working on implementing CLOCK-Pro, which uses
inter reference distance between accesses to a page instead of "least
recently used" metric for page replacement decision. He just published
results of "mdb" (mini-db) benchmark at http://www.linux-mm.org/PeterZClockPro2.

Read more about the "mdb" benchmark at
http://www.linux-mm.org/PageReplacementTesting.

But thats offtopic :)

> I want to understand more before I try to fix it. First step would
> be to get better stats from pagecache and evaluate whats happening
> to get a better handle on the problem.
>
> BTW, I am very well familiar with kprobes/jprobes & systemtap.
> I have been playing with them for at least 8 months :) There is
> no easy way to do this, unless stats are already in the kernel.

I thought that it would be easy to use SystemTap for a such
a purpose?

The sys_read/sys_write example at
http://www.redhat.com/magazine/011sep05/features/systemtap/ sounds
interesting.

What I'm I missing?

> My final goal is to get stats like ..
>
> Out of "Cached" value - to get details like
>
> <mmap> - xxx KB
> <shared mem> - xxx KB
> <text, data, bss, malloc, heap, stacks> - xxx KB
> <filecache pages total> -- xxx KB
> (filename1 or <dev>, <ino>) -- #of pages
> (filename2 or <dev>, <ino>) -- #of pages
>
> This would be really powerful on understanding system better.
>
> Don't you think ?

Yep... /proc/<pid>/smaps provides that information on a per-process
basis already.


2005-12-01 17:21:49

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Better pagecache statistics ?

On Thu, 2005-12-01 at 09:15 -0800, Badari Pulavarty wrote:
> > Most of the issues you mention are null if you move the stats
> > maintenance burden to userspace.
> >
> > The performance impact is also minimized since the hooks
> > (read: overhead) can be loaded on-demand as needed.
> >
>
> The overhead is - going through each mapping/inode in the system
> and dumping out "nrpages" - to get per-file statistics. This is
> going to be expensive, need locking and there is no single list
> we can traverse to get it. I am not sure how to do this.

and worse... you're going to need memory to store the results, either in
kernel or in userspace, and you don't know how much until you're done.
That memory is going to need to be allocated, which in turn changes the
vm state..


2005-12-01 17:31:44

by Badari Pulavarty

[permalink] [raw]
Subject: Re: Better pagecache statistics ?

On Thu, 2005-12-01 at 15:19 -0200, Marcelo Tosatti wrote:
> > Hi Marcelo,
> >
> > Let me give you background on why I am looking at this.
> >
> > I have been involved in various database customer situations.
> > Most times, machine is either extreemly sluggish or dying.
> > Only hints we get from /proc/meminfo, /proc/slabinfo, vmstat
> > etc is - lots of stuff in "Cache" and system is heavily swapping.
> > I want to find out whats getting swapped out and whats eating up
> > all the pagecache., whats getting into cache, whats getting out
> > of cache etc.. I find no easy way to get this kind of information.
>
> Someone recently wrote a patch to record such information (pagecache
> insertion/eviction, etc), don't remember who did though. Rik?
>
> > Database folks complain that filecache causes them most trouble.
> > Even when they use DIO on their tables & stuff, random apps (ftp,
> > scp, tar etc..) bloats the pagecache and kicks out database
> > pools, shared mem, malloc etc - causing lots of trouble for them.
>
> LRU lacks frequency information, which is crucial for avoiding
> such kind of problems.
>
> http://www.linux-mm.org/AdvancedPageReplacement
>
> Peter Zijlstra is working on implementing CLOCK-Pro, which uses
> inter reference distance between accesses to a page instead of "least
> recently used" metric for page replacement decision. He just published
> results of "mdb" (mini-db) benchmark at http://www.linux-mm.org/PeterZClockPro2.
>
> Read more about the "mdb" benchmark at
> http://www.linux-mm.org/PageReplacementTesting.
>
> But thats offtopic :)
>
> > I want to understand more before I try to fix it. First step would
> > be to get better stats from pagecache and evaluate whats happening
> > to get a better handle on the problem.
> >
> > BTW, I am very well familiar with kprobes/jprobes & systemtap.
> > I have been playing with them for at least 8 months :) There is
> > no easy way to do this, unless stats are already in the kernel.
>
> I thought that it would be easy to use SystemTap for a such
> a purpose?
>
> The sys_read/sys_write example at
> http://www.redhat.com/magazine/011sep05/features/systemtap/ sounds
> interesting.
>
> What I'm I missing?

Well, Few things:

1) We have to have those probes present in the system all the time
collecting the information when read/write happens, maintaining it
and spitting it out. Since its kernel probe, all this data will be
in the kernel.

2) If we want to do this accounting (and you don't have those probes
installed already) - we can't capture what happened earlier.

3) probing sys_read/sys_write() are going to tell you how much
a data a process did read or wrote - but its not going to tell you
how much is in the cache (now or 10 minutes later).

>
> > My final goal is to get stats like ..
> >
> > Out of "Cached" value - to get details like
> >
> > <mmap> - xxx KB
> > <shared mem> - xxx KB
> > <text, data, bss, malloc, heap, stacks> - xxx KB
> > <filecache pages total> -- xxx KB
> > (filename1 or <dev>, <ino>) -- #of pages
> > (filename2 or <dev>, <ino>) -- #of pages
> >
> > This would be really powerful on understanding system better.
> >
> > Don't you think ?
>
> Yep... /proc/<pid>/smaps provides that information on a per-process
> basis already.

/proc/pid/smaps will give me information about text,data,shared libs,
malloc etc. Not the filecache information about files process opened,
pages read/wrote currently in the pagecache. Isn't it ?

Thanks,
Badari

2005-12-01 17:57:33

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: Better pagecache statistics ?

On Thu, Dec 01, 2005 at 06:21:39PM +0100, Arjan van de Ven wrote:
> On Thu, 2005-12-01 at 09:15 -0800, Badari Pulavarty wrote:
> > > Most of the issues you mention are null if you move the stats
> > > maintenance burden to userspace.
> > >
> > > The performance impact is also minimized since the hooks
> > > (read: overhead) can be loaded on-demand as needed.
> > >
> >
> > The overhead is - going through each mapping/inode in the system
> > and dumping out "nrpages" - to get per-file statistics. This is
> > going to be expensive, need locking and there is no single list
> > we can traverse to get it. I am not sure how to do this.

Can't you add hooks to add_to_page_cache/remove_from_page_cache
to record pagecache activity ?

> and worse... you're going to need memory to store the results, either in
> kernel or in userspace, and you don't know how much until you're done.
> That memory is going to need to be allocated, which in turn changes the
> vm state..

Indeed - need to pre-allocate some sensible amount of memory to store the
results (relayfs does it for your).

2005-12-01 18:15:52

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: Better pagecache statistics ?


> > I thought that it would be easy to use SystemTap for a such
> > a purpose?
> >
> > The sys_read/sys_write example at
> > http://www.redhat.com/magazine/011sep05/features/systemtap/ sounds
> > interesting.
> >
> > What I'm I missing?
>
> Well, Few things:
>
> 1) We have to have those probes present in the system all the time
> collecting the information when read/write happens, maintaining it
> and spitting it out. Since its kernel probe, all this data will be
> in the kernel.

Yeah, there is some overhead.

> 2) If we want to do this accounting (and you don't have those probes
> installed already) - we can't capture what happened earlier.

I suppose that the vast majority of situations where such information is
needed are special anyway?

Why do you need it around all the time?

> 3) probing sys_read/sys_write() are going to tell you how much
> a data a process did read or wrote - but its not going to tell you
> how much is in the cache (now or 10 minutes later).

Sure, that was just an example - need to insert probes
on the correct places.

> > > My final goal is to get stats like ..
> > >
> > > Out of "Cached" value - to get details like
> > >
> > > <mmap> - xxx KB
> > > <shared mem> - xxx KB
> > > <text, data, bss, malloc, heap, stacks> - xxx KB
> > > <filecache pages total> -- xxx KB
> > > (filename1 or <dev>, <ino>) -- #of pages
> > > (filename2 or <dev>, <ino>) -- #of pages
> > >
> > > This would be really powerful on understanding system better.
> > >
> > > Don't you think ?
> >
> > Yep... /proc/<pid>/smaps provides that information on a per-process
> > basis already.
>
> /proc/pid/smaps will give me information about text,data,shared libs,
> malloc etc. Not the filecache information about files process opened,
> pages read/wrote currently in the pagecache. Isn't it ?

Right.

2005-12-01 18:20:09

by Badari Pulavarty

[permalink] [raw]
Subject: Re: Better pagecache statistics ?

On Thu, 2005-12-01 at 15:57 -0200, Marcelo Tosatti wrote:
> On Thu, Dec 01, 2005 at 06:21:39PM +0100, Arjan van de Ven wrote:
> > On Thu, 2005-12-01 at 09:15 -0800, Badari Pulavarty wrote:
> > > > Most of the issues you mention are null if you move the stats
> > > > maintenance burden to userspace.
> > > >
> > > > The performance impact is also minimized since the hooks
> > > > (read: overhead) can be loaded on-demand as needed.
> > > >
> > >
> > > The overhead is - going through each mapping/inode in the system
> > > and dumping out "nrpages" - to get per-file statistics. This is
> > > going to be expensive, need locking and there is no single list
> > > we can traverse to get it. I am not sure how to do this.
>
> Can't you add hooks to add_to_page_cache/remove_from_page_cache
> to record pagecache activity ?

In theory, yes. We already maintain info in "mapping->nrpages".
Trick would be to collect all of them, send them to user space.

Thanks,
Badari

2005-12-01 18:24:09

by Badari Pulavarty

[permalink] [raw]
Subject: Re: Better pagecache statistics ?

On Thu, 2005-12-01 at 15:57 -0200, Marcelo Tosatti wrote:
> On Thu, Dec 01, 2005 at 06:21:39PM +0100, Arjan van de Ven wrote:
> > On Thu, 2005-12-01 at 09:15 -0800, Badari Pulavarty wrote:
> > > > Most of the issues you mention are null if you move the stats
> > > > maintenance burden to userspace.
> > > >
> > > > The performance impact is also minimized since the hooks
> > > > (read: overhead) can be loaded on-demand as needed.
> > > >
> > >
> > > The overhead is - going through each mapping/inode in the system
> > > and dumping out "nrpages" - to get per-file statistics. This is
> > > going to be expensive, need locking and there is no single list
> > > we can traverse to get it. I am not sure how to do this.
>
> Can't you add hooks to add_to_page_cache/remove_from_page_cache
> to record pagecache activity ?

BTW, the hook can't be dynamically loaded (using kprobes) - since
we miss what happend till then.

Thanks,
Badari

2005-12-01 18:25:50

by Badari Pulavarty

[permalink] [raw]
Subject: Re: Better pagecache statistics ?

On Thu, 2005-12-01 at 16:15 -0200, Marcelo Tosatti wrote:
> > > I thought that it would be easy to use SystemTap for a such
> > > a purpose?
> > >
> > > The sys_read/sys_write example at
> > > http://www.redhat.com/magazine/011sep05/features/systemtap/ sounds
> > > interesting.
> > >
> > > What I'm I missing?
> >
> > Well, Few things:
> >
> > 1) We have to have those probes present in the system all the time
> > collecting the information when read/write happens, maintaining it
> > and spitting it out. Since its kernel probe, all this data will be
> > in the kernel.
>
> Yeah, there is some overhead.
>
> > 2) If we want to do this accounting (and you don't have those probes
> > installed already) - we can't capture what happened earlier.
>
> I suppose that the vast majority of situations where such information is
> needed are special anyway?
>
> Why do you need it around all the time?

Otherwise, we need to insert hooks and ask the customer to reproduce
the problem :(

> > 3) probing sys_read/sys_write() are going to tell you how much
> > a data a process did read or wrote - but its not going to tell you
> > how much is in the cache (now or 10 minutes later).
>
> Sure, that was just an example - need to insert probes
> on the correct places.


Okay, I miss understood.


Thanks,
Badari

2005-12-01 21:17:42

by Christoph Lameter

[permalink] [raw]
Subject: Re: Better pagecache statistics ?

We are actually looking at have better pagecache statistics and I have
been trying out a series of approaches. The direct need right now is to
have some statistics on the size of the pagecache and the number of
unmapped file backed pages per node.

With those numbers one would be able to do a local page eviction if memory
on a node runs low. Per cpu counters exist for some of these but these are
only meaningful if they are summed up (despite drivers/base/node.c
seemingly allowing access to per node information).

One pathological case that we frequently encounter is that an application
does lots of file I/O that saturates a node and then terminates. At that
point a high number of unmapped pagecache pages exist that are not
reclaimed because other nodes still have enough free memory. If a counter
would be available per node then we could check if the numer of unmapped
pagecache pages is high and if that is the case run kswapd on one specific
node.

In my various attempts to get some form of statistics for that purpose I
encountered the problem that I need to modify critical code paths in the
VM.

One solution would be to add an atomic counter to the zone for the
number of mapped and the number pagecache pages. However, this would mean
that these counters have to be incremented and decremented for everypage
removed and added to the pagecache.

Another one would be to have a node based per cpu array that is summed up
in regular intervals to create true per node statistics. However, numbers
are then not current and its not feasable to add them up for every check.

2005-12-02 00:13:18

by Badari Pulavarty

[permalink] [raw]
Subject: Re: Better pagecache statistics ?

On Thu, 2005-12-01 at 13:16 -0800, Christoph Lameter wrote:
> We are actually looking at have better pagecache statistics and I have
> been trying out a series of approaches. The direct need right now is to
> have some statistics on the size of the pagecache and the number of
> unmapped file backed pages per node.
>

Cool. I would be interested in it. Where are you collecting the
statistics ?

Thanks,
Badari

2005-12-02 22:16:15

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: Better pagecache statistics ?


Badari Pulavarty <[email protected]> writes:

> > Can't you add hooks to add_to_page_cache/remove_from_page_cache
> > to record pagecache activity ?
>
> In theory, yes. We already maintain info in "mapping->nrpages".
> Trick would be to collect all of them, send them to user space.

If you happened to have a copy of systemtap built, you might run this
script instead of inserting static hooks into your kernel. (The tool
has come some way since the OLS '2005 demo.)

#! stap
probe kernel.function("add_to_page_cache") {
printf("pid %d added pages (%d)\n", pid(), $mapping->nrpages)
}
probe kernel.function("__remove_from_page_cache") {
printf("pid %d removed pages (%d)\n", pid(), $page->mapping->nrpages)
}

- FChE

2005-12-02 22:31:47

by Badari Pulavarty

[permalink] [raw]
Subject: Re: Better pagecache statistics ?

On Fri, 2005-12-02 at 17:15 -0500, Frank Ch. Eigler wrote:
> Badari Pulavarty <[email protected]> writes:
>
> > > Can't you add hooks to add_to_page_cache/remove_from_page_cache
> > > to record pagecache activity ?
> >
> > In theory, yes. We already maintain info in "mapping->nrpages".
> > Trick would be to collect all of them, send them to user space.
>
> If you happened to have a copy of systemtap built, you might run this
> script instead of inserting static hooks into your kernel. (The tool
> has come some way since the OLS '2005 demo.)
>
> #! stap
> probe kernel.function("add_to_page_cache") {
> printf("pid %d added pages (%d)\n", pid(), $mapping->nrpages)
> }
> probe kernel.function("__remove_from_page_cache") {
> printf("pid %d removed pages (%d)\n", pid(), $page->mapping->nrpages)
> }

Yes. This is what I also did earlier to test. But unfortunately,
we need more than this.

Having by "pid" basis is not good enough. I need per file/mapping
basis collected and sent to user-space on-demand. Is systemtap
hooked to relayfs to send data across to user-land ? printf() is
not an option. And also, I need to have this probe, installed
from the boot time and collecting all the information - so I can
access it when I need it - which means this bloats kernel memory.
Isn't it ?


Thanks,
Badari

2005-12-02 22:47:27

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: Better pagecache statistics ?

Hi -

On Fri, Dec 02, 2005 at 02:31:56PM -0800, Badari Pulavarty wrote:
> On Fri, 2005-12-02 at 17:15 -0500, Frank Ch. Eigler wrote:
> [...]
> > #! stap
> > probe kernel.function("add_to_page_cache") {
> > printf("pid %d added pages (%d)\n", pid(), $mapping->nrpages)
> > }
> > probe kernel.function("__remove_from_page_cache") {
> > printf("pid %d removed pages (%d)\n", pid(), $page->mapping->nrpages)
> > }
>
> [...] Having by "pid" basis is not good enough. I need per
> file/mapping basis collected and sent to user-space on-demand.

If you can characterize all your data needs in terms of points to
insert hooks (breakpoint addresses) and expressions to sample there,
systemtap scripts can probably track the relationships. (We have
associative arrays, looping, etc.)

> Is systemtap hooked to relayfs to send data across to user-land ?
> printf() is not an option.

systemtap can optionally use relayfs. The printf you see here does
not relate to/invoke the kernel printk, if that's what you're worried
about.

> And also, I need to have this probe, installed from the boot time
> and collecting all the information - so I can access it when I need
> it

We haven't done much work yet to address on-demand kind of interaction
with a systemtap probe session. However, one could fake it by
associating data-printing operations with events that are triggered
purposely from userspace, like running a particular system call from a
particularly named process.

> which means this bloats kernel memory. [...]

The degree of bloat is under the operator's control: systemtap only
uses initialization-time memory allocation, so its arrays can fill up.


- FChE

2005-12-02 23:46:38

by Badari Pulavarty

[permalink] [raw]
Subject: Re: Better pagecache statistics ?

On Fri, 2005-12-02 at 17:46 -0500, Frank Ch. Eigler wrote:
> Hi -
>
> On Fri, Dec 02, 2005 at 02:31:56PM -0800, Badari Pulavarty wrote:
> > On Fri, 2005-12-02 at 17:15 -0500, Frank Ch. Eigler wrote:
> > [...]
> > > #! stap
> > > probe kernel.function("add_to_page_cache") {
> > > printf("pid %d added pages (%d)\n", pid(), $mapping->nrpages)
> > > }
> > > probe kernel.function("__remove_from_page_cache") {
> > > printf("pid %d removed pages (%d)\n", pid(), $page->mapping->nrpages)
> > > }
> >
> > [...] Having by "pid" basis is not good enough. I need per
> > file/mapping basis collected and sent to user-space on-demand.
>
> If you can characterize all your data needs in terms of points to
> insert hooks (breakpoint addresses) and expressions to sample there,
> systemtap scripts can probably track the relationships. (We have
> associative arrays, looping, etc.)
>
> > Is systemtap hooked to relayfs to send data across to user-land ?
> > printf() is not an option.
>
> systemtap can optionally use relayfs. The printf you see here does
> not relate to/invoke the kernel printk, if that's what you're worried
> about.

Hmm. You are right.

Is there a way another user-level program/utility access some of the
data maintained in those arrays ?

>
> > And also, I need to have this probe, installed from the boot time
> > and collecting all the information - so I can access it when I need
> > it
>
> We haven't done much work yet to address on-demand kind of interaction
> with a systemtap probe session. However, one could fake it by
> associating data-printing operations with events that are triggered
> purposely from userspace, like running a particular system call from a
> particularly named process.
>
> > which means this bloats kernel memory. [...]
>
> The degree of bloat is under the operator's control: systemtap only
> uses initialization-time memory allocation, so its arrays can fill up.

Does this mean that I can do something like

page_cache[0xffff8100c4c6b298] = $mapping->nrpages ?

And this won't generate bloated arrays ?

Here is what I wrote earlier to capture some of the pagecache data.
Unfortunately, I can't capture whatever happend before inserting the
problem. So it won't give me information about all whats there in the
pagecache.

BTW, if you prefer - we can move the discussion to systemtap.
(I have few questions/issues on ret probes & accessability of
arguments - since I want to do this on return).

Thanks,
Badari





Attachments:
pagecache.stp (502.00 B)

2005-12-04 18:48:46

by Martin Bligh

[permalink] [raw]
Subject: Re: Better pagecache statistics ?

>> > > Out of "Cached" value - to get details like
>> > >
>> > > <mmap> - xxx KB
>> > > <shared mem> - xxx KB
>> > > <text, data, bss, malloc, heap, stacks> - xxx KB
>> > > <filecache pages total> -- xxx KB
>> > > (filename1 or <dev>, <ino>) -- #of pages
>> > > (filename2 or <dev>, <ino>) -- #of pages
>> > >
>> > > This would be really powerful on understanding system better.
>> >
>> > to some extend it might be useful.
>> > I have a few concerns though
>> > 1) If we make these stats into an ABI then it becomes harder to change
>> > the architecture of the VM radically since such concepts may not even
>> > exist in the new architecture. As long as this is some sort of advisory,
>> > humans-only file I think this isn't too much of a big deal though.
>> >
>> > 2) not all the concepts you mention really exist as far as the kernel is
>> > concerned. I mean.. a mmap file is file cache is .. etc.
>> > malloc/heap/stacks are also not differentiated too much and are mostly
>> > userspace policy (especially thread stacks).
>> >
>> > A split in
>> > * non-file backed
>> > - mapped once
>> > - mapped more than once
>> > * file backed
>> > - mapped at least once
>> > - not mapped
>> > I can see as being meaningful. Assigning meaning to it beyond this is
>> > dangerous; that is more an interpretation of the policy userspace
>> > happens to use for things and I think coding that into the kernel is a
>> > mistake.
>> >
>> > Knowing which files are in memory how much is, as debug feature,
>> > potentially quite useful for VM hackers to see how well the various VM
>> > algorithms work. I'm concerned about the performance impact (eg you can
>> > do it only once a day or so, not every 10 seconds) and about how to get
>> > this data out in a consistent way (after all, spewing this amount of
>> > debug info will in itself impact the vm balances)
>>
>> Most of the issues you mention are null if you move the stats
>> maintenance burden to userspace.
>>
>> The performance impact is also minimized since the hooks
>> (read: overhead) can be loaded on-demand as needed.
>>
>
> The overhead is - going through each mapping/inode in the system
> and dumping out "nrpages" - to get per-file statistics. This is
> going to be expensive, need locking and there is no single list
> we can traverse to get it. I am not sure how to do this.

I made something idiotic to just walk the mem_map array and gather
stats on every page in the system. Not exactly pretty ... but useful.
Can't lay my hands on it at the moment, but Badari can ask Janet
for it, I think ;-)

M.

2005-12-28 06:13:34

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: Better pagecache statistics ?


Badari, any improvements on the {add_to,remove_from}_page_cache hooks?

> I just started playing with SystemTap yesterday. First
> thing I want to record is "what is the latency of
> direct reclaim".

I've come up with something which works, though pretty dumb and
inefficient.

I'm facing three problems, maybe someone has a clue on how to improve
the situation.

a) nanosecond timekeeping

Since the systemtap language does not support "struct" abstraction, but
simply "long/string/array" types, there is no way to easily return more
than one value from a function. Is it possible to pass references down
to functions so as to return more than one value?

I failed to find any way to do that.

For nanosecond timekeeping one needs second/nanosecond tuple (struct
timespec).

b) ERROR: MAXACTION exceeded near identifier 'log' at ttfp_delay.stp:49:3

The array size is capped to a maximum. Is there any way to configure
SystemTap to periodically dump-and-zero the arrays? This makes lots of
sense to any statistical gathering code.

c) Hash tables

It would be better to store the log entries in a hash table, the present
script uses the "current" pointer as a key into a pair of arrays,
incrementing the key until a free one is found (which can be very
inefficient).

A hash table would be much more efficient, but allocating memory inside
the scripts is tricky. A pre-allocated, pre-sized pool of memory could
work well for this purpose. The "dump-array-entries-to-userspace" action
could be used to free them.

So both b) and c) could be fixed with the same logic:

- dump entries to userspace if memory pool is getting short
on free entries.
- periodically dump entries to userspace (akin to "bdflush").

And finally, there seems to be a bug which results in _very_ large
(several seconds) delays - that seems unlikely to really happening.

Thoughts?

/*
* ttfp_delay - measure direct reclaim latency
*/

global count_try_to_free_pages
global count_exit_try_to_free_pages

global entry_array_us
global exit_array_us

global entry_array_ms
global exit_array_ms

function get_currentpointer:long () %{
THIS->__retvalue = (int) current;
%}

probe kernel.function("try_to_free_pages")
{
current_p = get_currentpointer();
++count_try_to_free_pages;
while (entry_array_us[current_p])
++current_p;

entry_array_us[current_p] = gettimeofday_us();
entry_array_ms[current_p] = gettimeofday_ms();
}

probe kernel.function("try_to_free_pages").return
{
current_p = get_currentpointer();
++count_exit_try_to_free_pages;
while (exit_array_us[current_p])
++current_p;

exit_array_us[current_p] = gettimeofday_us();
exit_array_ms[current_p] = gettimeofday_ms();
}

probe begin { log("starting probe") }

probe end
{
log("ending probe")
log ("calls to try_to_free_pages: " . string(count_try_to_free_pages));
log ("returns from try_to_free_pages: " . string(count_exit_try_to_free_pages));
foreach(var in entry_array_us) {
pos++;
log ("try_to_free_pages (" . string(pos) . ") delta: " . string(exit_array_us[var] - entry_array_us[var]) . "us " .string (exit_array_ms[var] - entry_array_ms[var]) . "ms ");
}

}


example output, running a 800MB "dd" copy on the background.

[root@dmt examples]# stap -g ttfp_delay.stp
starting probe
ending probe
calls to try_to_free_pages: 387
returns from try_to_free_pages: 373
try_to_free_pages (1) delta: 15028us 15ms
try_to_free_pages (2) delta: 47677211us 47677ms
try_to_free_pages (3) delta: 39us 0ms
try_to_free_pages (4) delta: 35us 0ms
try_to_free_pages (5) delta: 152us 0ms
try_to_free_pages (6) delta: 104us 0ms
try_to_free_pages (7) delta: 353us 0ms
try_to_free_pages (8) delta: 61us 0ms
try_to_free_pages (9) delta: 187us 0ms
try_to_free_pages (10) delta: 55us 0ms
try_to_free_pages (11) delta: 50us 0ms
try_to_free_pages (12) delta: 30us 0ms
try_to_free_pages (13) delta: 31us 0ms
try_to_free_pages (14) delta: 42us 0ms
try_to_free_pages (15) delta: 37us 0ms
try_to_free_pages (16) delta: 178us 0ms
try_to_free_pages (17) delta: 34us 0ms
try_to_free_pages (18) delta: 37us 0ms
try_to_free_pages (19) delta: 35us 0ms
try_to_free_pages (20) delta: 34us 0ms
try_to_free_pages (21) delta: 65us 0ms
...

2005-12-28 19:06:57

by Tom Zanussi

[permalink] [raw]
Subject: Re: Better pagecache statistics ?

Marcelo Tosatti writes:

[...]

>
> b) ERROR: MAXACTION exceeded near identifier 'log' at ttfp_delay.stp:49:3
>
> The array size is capped to a maximum. Is there any way to configure
> SystemTap to periodically dump-and-zero the arrays? This makes lots of
> sense to any statistical gathering code.
>
> c) Hash tables
>
> It would be better to store the log entries in a hash table, the present
> script uses the "current" pointer as a key into a pair of arrays,
> incrementing the key until a free one is found (which can be very
> inefficient).
>
> A hash table would be much more efficient, but allocating memory inside
> the scripts is tricky. A pre-allocated, pre-sized pool of memory could
> work well for this purpose. The "dump-array-entries-to-userspace" action
> could be used to free them.
>
> So both b) and c) could be fixed with the same logic:
>
> - dump entries to userspace if memory pool is getting short
> on free entries.
> - periodically dump entries to userspace (akin to "bdflush").

Hi,

There's a sytemtap example that does something similar to what you're
describing - see the kmalloc-stacks/kmalloc-top examples in the
testsuite:

systemtap/tests/systemtap.samples/kmalloc-stacks.stp
systemtap/tests/systemtap.samples/kmalloc-top

Basically, the kmalloc-stacks.stp script hashes data in a systemtap
hash and periodically formats the current contents of the hash table
into a convenient form and writes it to userspace, then clears the
hash for the next go-round. kmalloc-top is a companion Perl script
'daemon' that sits around in userspace waiting for new batches of hash
data, which it then adds to a continuously accumulating Perl hash in
the user-side script. There's a bit more detail about the script(s)
here:

http://sourceware.org/ml/systemtap/2005-q3/msg00550.html

HTH,

Tom