2009-01-26 15:50:15

by Chris Friesen

[permalink] [raw]
Subject: marching through all physical memory in software

Someone is asking me about the feasability of "scrubbing" system memory
by accessing each page and handling the ECC faults.

The range of PAGE_OFFSET to "high_memory" should get me all of the
kernel memory area, but what about all the memory set aside for
userspace (which may not be contiguous)? Is there any straightforward
way to march through this memory?

I suppose I'm looking for something like walk_page_range(), but for
physical memory rather than virtual.

Thanks,

Chris


2009-01-26 16:00:12

by Arjan van de Ven

[permalink] [raw]
Subject: Re: marching through all physical memory in software

On Mon, 26 Jan 2009 09:38:13 -0600
"Chris Friesen" <[email protected]> wrote:

> Someone is asking me about the feasability of "scrubbing" system
> memory by accessing each page and handling the ECC faults.
>

Hi,

I would suggest that you look at the "edac" subsystem, which tries to
do exactly this....



--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2009-01-27 19:00:40

by Chris Friesen

[permalink] [raw]
Subject: Re: marching through all physical memory in software

Arjan van de Ven wrote:
> On Mon, 26 Jan 2009 09:38:13 -0600
> "Chris Friesen" <[email protected]> wrote:
>
>> Someone is asking me about the feasability of "scrubbing" system
>> memory by accessing each page and handling the ECC faults.
>>
>
> Hi,
>
> I would suggest that you look at the "edac" subsystem, which tries to
> do exactly this....

Looking at the current -git code, there appears to be an option for
memory controllers to do this (the set_sdram_scrub_rate() routine), but
there don't appear to be any controllers that can actually do it.

edac appears to currently be able to scrub the specific page where the
fault occurred. This is a useful building block, but doesn't provide
the ability to march through all of physical memory.

Chris

2009-01-27 20:16:57

by Eric W. Biederman

[permalink] [raw]
Subject: Re: marching through all physical memory in software

"Chris Friesen" <[email protected]> writes:

> Arjan van de Ven wrote:
>> On Mon, 26 Jan 2009 09:38:13 -0600
>> "Chris Friesen" <[email protected]> wrote:
>>
>>> Someone is asking me about the feasability of "scrubbing" system
>>> memory by accessing each page and handling the ECC faults.
>>>
>>
>> Hi,
>>
>> I would suggest that you look at the "edac" subsystem, which tries to
>> do exactly this....


> edac appears to currently be able to scrub the specific page where the fault
> occurred. This is a useful building block, but doesn't provide the ability to
> march through all of physical memory.

Well that is the tricky part. The rest is simply finding which physical
addresses are valid. Either by querying the memory controller or looking
at the range the BIOS gave us.

That part should not be too hard. I think it simply has not been implemented
yet as most ECC chipsets implement this in hardware today.

Eric

2009-01-30 08:57:38

by Pavel Machek

[permalink] [raw]
Subject: Re: marching through all physical memory in software

On Tue 2009-01-27 12:16:52, Eric W. Biederman wrote:
> "Chris Friesen" <[email protected]> writes:
>
> > Arjan van de Ven wrote:
> >> On Mon, 26 Jan 2009 09:38:13 -0600
> >> "Chris Friesen" <[email protected]> wrote:
> >>
> >>> Someone is asking me about the feasability of "scrubbing" system
> >>> memory by accessing each page and handling the ECC faults.
> >>>
> >>
> >> Hi,
> >>
> >> I would suggest that you look at the "edac" subsystem, which tries to
> >> do exactly this....
>
>
> > edac appears to currently be able to scrub the specific page where the fault
> > occurred. This is a useful building block, but doesn't provide the ability to
> > march through all of physical memory.
>
> Well that is the tricky part. The rest is simply finding which physical
> addresses are valid. Either by querying the memory controller or looking
> at the range the BIOS gave us.
>
> That part should not be too hard. I think it simply has not been implemented
> yet as most ECC chipsets implement this in hardware today.

You can do the scrubbing today by echo reboot > /sys/power/disk; echo
disk > /sys/power/state :-)... or using uswsusp APIs.

Take a look at hibernation code for 'walk all memory' examples...

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-01-30 09:05:29

by Nigel Cunningham

[permalink] [raw]
Subject: Re: marching through all physical memory in software

Hi.

On Wed, 2009-01-28 at 20:38 +0100, Pavel Machek wrote:
> You can do the scrubbing today by echo reboot > /sys/power/disk; echo
> disk > /sys/power/state :-)... or using uswsusp APIs.

That won't work. The RAM retains it's contents across a reboot, and even
for a little while after powering off.

Regards,

Nigel

2009-01-30 09:13:26

by Pavel Machek

[permalink] [raw]
Subject: Re: marching through all physical memory in software

> Hi.
>
> On Wed, 2009-01-28 at 20:38 +0100, Pavel Machek wrote:
> > You can do the scrubbing today by echo reboot > /sys/power/disk; echo
> > disk > /sys/power/state :-)... or using uswsusp APIs.
>
> That won't work. The RAM retains it's contents across a reboot, and even
> for a little while after powering off.

Yes, and the original goal was to rewrite all the memory with same
contents so that parity errors don't accumulate. SO scrubbing here !=
trying to clear it.


--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-01-30 13:00:40

by Nigel Cunningham

[permalink] [raw]
Subject: Re: marching through all physical memory in software

Hi again.

On Fri, 2009-01-30 at 10:13 +0100, Pavel Machek wrote:
> > Hi.
> >
> > On Wed, 2009-01-28 at 20:38 +0100, Pavel Machek wrote:
> > > You can do the scrubbing today by echo reboot > /sys/power/disk; echo
> > > disk > /sys/power/state :-)... or using uswsusp APIs.
> >
> > That won't work. The RAM retains its contents across a reboot, and even
> > for a little while after powering off.
>
> Yes, and the original goal was to rewrite all the memory with same
> contents so that parity errors don't accumulate. SO scrubbing here !=
> trying to clear it.

Sorry - I think I missed something.

AFAICS, hibernating is going to be a noop as far as doing anything to
memory that's not touched by the process of hibernating goes. It won't
clear it or scrub it or anything else.

Regards,

Nigel

2009-01-30 19:32:41

by Eric W. Biederman

[permalink] [raw]
Subject: Re: marching through all physical memory in software

Doug Thompson <[email protected]> writes:

> Nigel Cunningham <[email protected]> wrote:
>
> Hi again.
>
> On Fri, 2009-01-30 at 10:13 +0100, Pavel Machek wrote:
> > > Hi.
> > >
> > > On Wed, 2009-01-28 at 20:38 +0100, Pavel Machek wrote:
> > > > You can do the scrubbing today by echo reboot > /sys/power/disk; echo
> > > > disk > /sys/power/state :-)... or using uswsusp APIs.
> > >
> > > That won't work. The RAM retains its contents across a reboot, and even
> > > for a little while after powering off.
> >
> > Yes, and the original goal was to rewrite all the memory with same
> > contents so that parity errors don't accumulate. SO scrubbing here !=
> > trying to clear it.
>
> Sorry - I think I missed something.
>
> AFAICS, hibernating is going to be a noop as far as doing anything to
> memory that's not touched by the process of hibernating goes. It won't
> clear it or scrub it or anything else.

A background software scrubber simply has the job of rewritting memory
to it's current content so that the data and the ecc check bits are
guaranteed to be in sync keeping correctable ecc errors caused by
environmental factors from accumulating.

Pavel's original comment was that the hibernation code has to walk all
of memory to save it to disk so it would be a good place to look to
figure out how to walk all of memory. And incidentally hibernation
would serve as a crud way of rewritting all of memory.


> Even if hibernating worked, it does not touch the issue of scrubbing memory
> that doesn't have hardware support AND the requirement of thousands of nodes in
> cluster who MUST remain operational for days on end.

But it may still serve as an example of how to walk through all of memory.

> Sicortex's MIPS based system fits that exactly. When I did their EDAC driver
> they wanted to have a software scrubber at a NICE run level to scan memory and
> do this operation without shutting down the system.
>
> We never got to it, but it would be a nice for some to have a background
> software scrubber. But I would need help from the memory guys on a proper
> interface.
>
> The goal would be have a "loose" target of attempting to all most memory if not
> all. Sometime of iteration over the memory set.

Thinking about it. We only care about memory the kernel is using so the memory
maps the BIOS supplies the kernel should be sufficient. We have weird corner
cases like ACPI but not handling those in the first pass and getting
something working should be fine.

There are other silly things such as wanting to only scrub memory on it's native
NUMA node (if possible) for both performance and scalability.

Eric

2009-01-30 21:03:14

by Tim Small

[permalink] [raw]
Subject: Re: marching through all physical memory in software

Eric W. Biederman wrote:
> A background software scrubber simply has the job of rewritting memory
> to it's current content so that the data and the ecc check bits are
> guaranteed to be in sync

Don't you just need to READ memory? The memory controller hardware
takes care of the rest in the vast majority of cases.

You only need to rewrite RAM if a correctable error occurs, and the
chipset doesn't support automatic write-back of the corrected value (a
different problem altogether...). The actual memory bits themselves are
refreshed by the hardware quite frequently (max of every 64ms for DDR2,
I believe)...

Cheers,

Tim.

2009-01-30 21:10:46

by Nigel Cunningham

[permalink] [raw]
Subject: Re: marching through all physical memory in software

Hi.

On Fri, 2009-01-30 at 11:32 -0800, Eric W. Biederman wrote:
> Doug Thompson <[email protected]> writes:
>
> > Nigel Cunningham <[email protected]> wrote:
> >
> > Hi again.
> >
> > On Fri, 2009-01-30 at 10:13 +0100, Pavel Machek wrote:
> > > > Hi.
> > > >
> > > > On Wed, 2009-01-28 at 20:38 +0100, Pavel Machek wrote:
> > > > > You can do the scrubbing today by echo reboot > /sys/power/disk; echo
> > > > > disk > /sys/power/state :-)... or using uswsusp APIs.
> > > >
> > > > That won't work. The RAM retains its contents across a reboot, and even
> > > > for a little while after powering off.
> > >
> > > Yes, and the original goal was to rewrite all the memory with same
> > > contents so that parity errors don't accumulate. SO scrubbing here !=
> > > trying to clear it.
> >
> > Sorry - I think I missed something.
> >
> > AFAICS, hibernating is going to be a noop as far as doing anything to
> > memory that's not touched by the process of hibernating goes. It won't
> > clear it or scrub it or anything else.
>
> A background software scrubber simply has the job of rewritting memory
> to it's current content so that the data and the ecc check bits are
> guaranteed to be in sync keeping correctable ecc errors caused by
> environmental factors from accumulating.
>
> Pavel's original comment was that the hibernation code has to walk all
> of memory to save it to disk so it would be a good place to look to
> figure out how to walk all of memory. And incidentally hibernation
> would serve as a crud way of rewritting all of memory.

Thanks. Now I get it :)

Nigel

2009-01-31 03:54:18

by Eric W. Biederman

[permalink] [raw]
Subject: Re: marching through all physical memory in software

Tim Small <[email protected]> writes:

> Eric W. Biederman wrote:
>> A background software scrubber simply has the job of rewritting memory
>> to it's current content so that the data and the ecc check bits are
>> guaranteed to be in sync
>
> Don't you just need to READ memory? The memory controller hardware takes care
> of the rest in the vast majority of cases.
>
> You only need to rewrite RAM if a correctable error occurs, and the chipset
> doesn't support automatic write-back of the corrected value (a different problem
> altogether...). The actual memory bits themselves are refreshed by the hardware
> quite frequently (max of every 64ms for DDR2, I believe)...

At the point we are talking about software scrubbing it makes sense to assume
a least common denominator memory controller, one that does not do automatic
write-back of the corrected value, as all of the recent memory controllers
do scrubbing in hardware.

Once you handle the stupidest hardware all other cases are just software optimizations
on that, and we already have the tricky code that does a read-modify-write without
changing the contents of memory, so guarantees everything it touches will be written
back.

Eric

2009-01-31 12:48:52

by Tim Small

[permalink] [raw]
Subject: Re: marching through all physical memory in software

Eric W. Biederman wrote:
> At the point we are talking about software scrubbing it makes sense to assume
> a least common denominator memory controller, one that does not do automatic
> write-back of the corrected value, as all of the recent memory controllers
> do scrubbing in hardware.
>

I was just trying to clarify the distinction between the two processes
which have similar names, but aren't (IMO) actually that similar:

"Software Scrubbing"

Triggering a read, and subsequent rewrite of a particular RAM location
which has suffered a correctable ECC error(s) i.e. hardware detects an
error, then the OS takes care of the rewrite to "scrub" the error in the
case that the hardware doesn't handle this automatically.

This should be a very-occasional error-path process, and performance is
probably not critical..


"Background Scrubbing"

. This is a poor name, IMO (scrub infers some kind of write to me),
which applies to a process whereby you ensure that the ECC check-bits
are verified periodically for the whole of physical RAM, so that single
bit errors in a given ECC block don't accumulate and turn into
uncorrectable errors. It may also lead to improved data collection for
some failure modes. Again, many memory controllers implement this
feature in hardware, so we shouldn't do it twice where this is supported.

There is (AFAIK) no need to do any writes here, and in fact doing so is
only likely to hurt performance, I think.... The design which springs
to mind is of a background thread which (possibly at idle priority)
reads RAM at a user-configurable rate (e.g. consume a max of n% of
memory bandwidth, or read all of RAM at least once every x minutes).
Possible design issues:

. There will be some trade off between reducing impact on the system as
a whole, and making firm guarantees about how often memory is checked.
Difficult to know what the default would be, but probably
no-firm-guarantee of minimum time (idle processing only) is likely to
cause least problems for most users.
. An eye will need to be kept on the impact that this reading has on the
performance of the rest of the system (e.g. cache pollution, and NUMA,
as you previously mentioned), but my gut feeling is that for the
majority of systems it shouldn't be significant. If practical
mechanisms are available on some CPUs to read RAM without populating the
CPU cache, we should use those (but I've no idea if they exist or not).

Perhaps a good default would be to benchmark memory read bandwidth when
the feature is turned on, and then operate at (e.g.) 0.5% of that bandwidth.


Cheers,

Tim.

Subject: Re: marching through all physical memory in software

On Sat, 31 Jan 2009, Tim Small wrote:
> Eric W. Biederman wrote:
> > At the point we are talking about software scrubbing it makes sense to assume
> > a least common denominator memory controller, one that does not do automatic
> > write-back of the corrected value, as all of the recent memory controllers
> > do scrubbing in hardware.
> >
>
> I was just trying to clarify the distinction between the two processes
> which have similar names, but aren't (IMO) actually that similar:
>
> "Software Scrubbing"
>
> Triggering a read, and subsequent rewrite of a particular RAM location
> which has suffered a correctable ECC error(s) i.e. hardware detects an
> error, then the OS takes care of the rewrite to "scrub" the error in the
> case that the hardware doesn't handle this automatically.
>
> This should be a very-occasional error-path process, and performance is
> probably not critical..
>
>
> "Background Scrubbing"
>
> . This is a poor name, IMO (scrub infers some kind of write to me),
> which applies to a process whereby you ensure that the ECC check-bits
> are verified periodically for the whole of physical RAM, so that single
> bit errors in a given ECC block don't accumulate and turn into
> uncorrectable errors. It may also lead to improved data collection for
> some failure modes. Again, many memory controllers implement this
> feature in hardware, so we shouldn't do it twice where this is supported.

It is implined in the background scrubbing, that if a background scrub
page read causes an ECC correctable error to be flagged, the normal
"fix through scrub" behaviour of the memory controller will be
triggered (possibly, the software scrubbing described above).

And if an uncorretable error is detected during the scrub, we have to
do something about it as well. And that won't be that easy: locate
whatever process is using that page, and so something smart to it...
or do some emergency evasive actions if it is one of the kernel's data
scructures, etc.

So, as you said, "background scrubbing" and "software scrubbing" really are
very different things, and one has to expect that background scrubbing will
eventually trigger software scrubbing, major system emergency handling
(uncorrectable errors in kernel memory) or minor system emergency
handling (uncorrectable errors in process memory).

> There is (AFAIK) no need to do any writes here, and in fact doing so is

One might want the possibility of doing inconditional writes, because
it helps with memory bitrot on crappy hardware where the refresh
cycles aren't enough to avoid bitrot. But you definately won't want
it most of the time.

You can also implement software-based ECC using a background scrubber
and setting aside pages to store the ECC information. Now, THAT is
probably not worth bothering with due to the performance impact, but
who knows...

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh

2009-01-31 21:28:21

by Pavel Machek

[permalink] [raw]
Subject: Re: marching through all physical memory in software

Hi!

> And if an uncorretable error is detected during the scrub, we have to
> do something about it as well. And that won't be that easy: locate
> whatever process is using that page, and so something smart to it...
> or do some emergency evasive actions if it is one of the kernel's data
> scructures, etc.
>
> So, as you said, "background scrubbing" and "software scrubbing" really are
> very different things, and one has to expect that background scrubbing will
> eventually trigger software scrubbing, major system emergency handling
> (uncorrectable errors in kernel memory) or minor system emergency
> handling (uncorrectable errors in process memory).
>
> > There is (AFAIK) no need to do any writes here, and in fact doing so is
>
> One might want the possibility of doing inconditional writes, because
> it helps with memory bitrot on crappy hardware where the refresh
> cycles aren't enough to avoid bitrot. But you definately won't want
> it most of the time.
>
> You can also implement software-based ECC using a background scrubber
> and setting aside pages to store the ECC information. Now, THAT is
> probably not worth bothering with due to the performance impact, but
> who knows...

Actually, that would be quite cool. a) I suspect memory in my zaurus
bitrots and b) bitroting memory over s2ram is apprently quite common.

Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

Subject: Re: marching through all physical memory in software

On Sat, 31 Jan 2009, Pavel Machek wrote:
> > You can also implement software-based ECC using a background scrubber
> > and setting aside pages to store the ECC information. Now, THAT is
> > probably not worth bothering with due to the performance impact, but
> > who knows...
>
> Actually, that would be quite cool. a) I suspect memory in my zaurus
> bitrots and b) bitroting memory over s2ram is apprently quite common.

Well, software-based ECC for s2ram (calculate right before s2ram,
check-and-fix right after waking up) is certainly doable and a LOT
easier than my crazy idea of sofware-based generic ECC (which requires
some sort of trick to detect pages that were written to, so that you
can update their ECC information)...

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh

2009-03-05 22:13:29

by Larry Woodman

[permalink] [raw]
Subject: [Patch] mm tracepoints

I've implemented several mm tracepoints to track page allocation and
freeing, various types of pagefaults and unmaps, and critical page
reclamation routines. This is useful for debugging memory allocation
issues and system performance problems under heavy memory loads:

# tracer: mm
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
pdflush-624 [004] 184.293169: wb_kupdate:
(mm_pdflush_kupdate) count=3e48
pdflush-624 [004] 184.293439: get_page_from_freelist:
(mm_page_allocation) pfn=447c27 zone_free=1940910
events/6-33 [006] 184.962879: free_hot_cold_page:
(mm_page_free) pfn=44bba9
irqbalance-8313 [001] 188.042951: unmap_vmas:
(mm_anon_userfree) mm=ffff88044a7300c0 address=7f9a2eb70000 pfn=24c29a
cat-9122 [005] 191.141173: filemap_fault:
(mm_filemap_fault) primary fault: mm=ffff88024c9d8f40 address=3cea2dd000
pfn=44d68e
cat-9122 [001] 191.143036: handle_mm_fault:
(mm_anon_fault) mm=ffff88024c8beb40 address=7fffbde99f94 pfn=24ce22
...



Signed-off-by: Larry Woodman <[email protected]>


Attachments:
mm_tracepoints.diff (21.66 kB)

2009-03-06 02:11:55

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [Patch] mm tracepoints

> I've implemented several mm tracepoints to track page allocation and
> freeing, various types of pagefaults and unmaps, and critical page
> reclamation routines. This is useful for debugging memory allocation
> issues and system performance problems under heavy memory loads:
>
> # tracer: mm
> #
> # TASK-PID CPU# TIMESTAMP FUNCTION
> # | | | | |
> pdflush-624 [004] 184.293169: wb_kupdate:
> (mm_pdflush_kupdate) count=3e48
> pdflush-624 [004] 184.293439: get_page_from_freelist:
> (mm_page_allocation) pfn=447c27 zone_free=1940910
> events/6-33 [006] 184.962879: free_hot_cold_page:
> (mm_page_free) pfn=44bba9
> irqbalance-8313 [001] 188.042951: unmap_vmas:
> (mm_anon_userfree) mm=ffff88044a7300c0 address=7f9a2eb70000 pfn=24c29a
> cat-9122 [005] 191.141173: filemap_fault:
> (mm_filemap_fault) primary fault: mm=ffff88024c9d8f40 address=3cea2dd000
> pfn=44d68e
> cat-9122 [001] 191.143036: handle_mm_fault:
> (mm_anon_fault) mm=ffff88024c8beb40 address=7fffbde99f94 pfn=24ce22
> ...

Hi Larry,

I've started to evaluate your patch.

firstly, this patch can't apply tip/master.
secondly, I don't think the address of mm_struct and pfn help to analysis.
administrator don't know the page is which file's cache.


2009-03-06 02:26:25

by Steven Rostedt

[permalink] [raw]
Subject: Re: [Patch] mm tracepoints


On Fri, 6 Mar 2009, KOSAKI Motohiro wrote:

> > I've implemented several mm tracepoints to track page allocation and
> > freeing, various types of pagefaults and unmaps, and critical page
> > reclamation routines. This is useful for debugging memory allocation
> > issues and system performance problems under heavy memory loads:
> >
> > # tracer: mm
> > #
> > # TASK-PID CPU# TIMESTAMP FUNCTION
> > # | | | | |
> > pdflush-624 [004] 184.293169: wb_kupdate:
> > (mm_pdflush_kupdate) count=3e48
> > pdflush-624 [004] 184.293439: get_page_from_freelist:
> > (mm_page_allocation) pfn=447c27 zone_free=1940910
> > events/6-33 [006] 184.962879: free_hot_cold_page:
> > (mm_page_free) pfn=44bba9
> > irqbalance-8313 [001] 188.042951: unmap_vmas:
> > (mm_anon_userfree) mm=ffff88044a7300c0 address=7f9a2eb70000 pfn=24c29a
> > cat-9122 [005] 191.141173: filemap_fault:
> > (mm_filemap_fault) primary fault: mm=ffff88024c9d8f40 address=3cea2dd000
> > pfn=44d68e
> > cat-9122 [001] 191.143036: handle_mm_fault:
> > (mm_anon_fault) mm=ffff88024c8beb40 address=7fffbde99f94 pfn=24ce22
> > ...
>
> Hi Larry,
>
> I've started to evaluate your patch.
>
> firstly, this patch can't apply tip/master.
> secondly, I don't think the address of mm_struct and pfn help to analysis.
> administrator don't know the page is which file's cache.

The mm_struct may not be helpful since there should be a 1 to 1 mapping
between user tasks and the mm struct. Hmm, maybe not, due to threads?

But the pfn is helpful since it is a unique identifier for what physical
page was mapped.

-- Steve

2009-03-06 11:04:46

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Patch] mm tracepoints


* Steven Rostedt <[email protected]> wrote:

>
> On Fri, 6 Mar 2009, KOSAKI Motohiro wrote:
>
> > > I've implemented several mm tracepoints to track page allocation and
> > > freeing, various types of pagefaults and unmaps, and critical page
> > > reclamation routines. This is useful for debugging memory allocation
> > > issues and system performance problems under heavy memory loads:
> > >
> > > # tracer: mm
> > > #
> > > # TASK-PID CPU# TIMESTAMP FUNCTION
> > > # | | | | |
> > > pdflush-624 [004] 184.293169: wb_kupdate:
> > > (mm_pdflush_kupdate) count=3e48
> > > pdflush-624 [004] 184.293439: get_page_from_freelist:
> > > (mm_page_allocation) pfn=447c27 zone_free=1940910
> > > events/6-33 [006] 184.962879: free_hot_cold_page:
> > > (mm_page_free) pfn=44bba9
> > > irqbalance-8313 [001] 188.042951: unmap_vmas:
> > > (mm_anon_userfree) mm=ffff88044a7300c0 address=7f9a2eb70000 pfn=24c29a
> > > cat-9122 [005] 191.141173: filemap_fault:
> > > (mm_filemap_fault) primary fault: mm=ffff88024c9d8f40 address=3cea2dd000
> > > pfn=44d68e
> > > cat-9122 [001] 191.143036: handle_mm_fault:
> > > (mm_anon_fault) mm=ffff88024c8beb40 address=7fffbde99f94 pfn=24ce22
> > > ...
> >
> > Hi Larry,
> >
> > I've started to evaluate your patch.
> >
> > firstly, this patch can't apply tip/master.

yeah, would be nice to have a patch against:

http://people.redhat.com/mingo/tip.git/README

> > secondly, I don't think the address of mm_struct and pfn
> > help to analysis. administrator don't know the page is which
> > file's cache.
>
> The mm_struct may not be helpful since there should be a 1 to
> 1 mapping between user tasks and the mm struct. Hmm, maybe
> not, due to threads?

Correct - so the mm ID looks useful.

> But the pfn is helpful since it is a unique identifier for
> what physical page was mapped.

Yeah. Nevertheless some sort of filename:offset indicator would
be nice too. (as an add-on)

Ingo

2009-03-06 12:35:18

by Larry Woodman

[permalink] [raw]
Subject: Re: [Patch] mm tracepoints

On Fri, 2009-03-06 at 12:04 +0100, Ingo Molnar wrote:
> * Steven Rostedt <[email protected]> wrote:
>
> >
> > On Fri, 6 Mar 2009, KOSAKI Motohiro wrote:
> >
> > > > I've implemented several mm tracepoints to track page allocation and
> > > > freeing, various types of pagefaults and unmaps, and critical page
> > > > reclamation routines. This is useful for debugging memory allocation
> > > > issues and system performance problems under heavy memory loads:
> > > >
> > > > # tracer: mm
> > > > #
> > > > # TASK-PID CPU# TIMESTAMP FUNCTION
> > > > # | | | | |
> > > > pdflush-624 [004] 184.293169: wb_kupdate:
> > > > (mm_pdflush_kupdate) count=3e48
> > > > pdflush-624 [004] 184.293439: get_page_from_freelist:
> > > > (mm_page_allocation) pfn=447c27 zone_free=1940910
> > > > events/6-33 [006] 184.962879: free_hot_cold_page:
> > > > (mm_page_free) pfn=44bba9
> > > > irqbalance-8313 [001] 188.042951: unmap_vmas:
> > > > (mm_anon_userfree) mm=ffff88044a7300c0 address=7f9a2eb70000 pfn=24c29a
> > > > cat-9122 [005] 191.141173: filemap_fault:
> > > > (mm_filemap_fault) primary fault: mm=ffff88024c9d8f40 address=3cea2dd000
> > > > pfn=44d68e
> > > > cat-9122 [001] 191.143036: handle_mm_fault:
> > > > (mm_anon_fault) mm=ffff88024c8beb40 address=7fffbde99f94 pfn=24ce22
> > > > ...
> > >
> > > Hi Larry,
> > >
> > > I've started to evaluate your patch.
> > >
> > > firstly, this patch can't apply tip/master.
>
> yeah, would be nice to have a patch against:
>
> http://people.redhat.com/mingo/tip.git/README

Yeah I'll fix that, it is a moving target.
>
> > > secondly, I don't think the address of mm_struct and pfn
> > > help to analysis. administrator don't know the page is which
> > > file's cache.
> >
> > The mm_struct may not be helpful since there should be a 1 to
> > 1 mapping between user tasks and the mm struct. Hmm, maybe
> > not, due to threads?
>
> Correct - so the mm ID looks useful.
>
> > But the pfn is helpful since it is a unique identifier for
> > what physical page was mapped.
>
> Yeah. Nevertheless some sort of filename:offset indicator would
> be nice too. (as an add-on)

You mean in the filemap pagefault case???

Thanks, Larry

>
> Ingo

2009-03-06 13:56:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Patch] mm tracepoints


* Larry Woodman <[email protected]> wrote:

> On Fri, 2009-03-06 at 12:04 +0100, Ingo Molnar wrote:
> > * Steven Rostedt <[email protected]> wrote:
> >
> > >
> > > On Fri, 6 Mar 2009, KOSAKI Motohiro wrote:
> > >
> > > > > I've implemented several mm tracepoints to track page allocation and
> > > > > freeing, various types of pagefaults and unmaps, and critical page
> > > > > reclamation routines. This is useful for debugging memory allocation
> > > > > issues and system performance problems under heavy memory loads:
> > > > >
> > > > > # tracer: mm
> > > > > #
> > > > > # TASK-PID CPU# TIMESTAMP FUNCTION
> > > > > # | | | | |
> > > > > pdflush-624 [004] 184.293169: wb_kupdate:
> > > > > (mm_pdflush_kupdate) count=3e48
> > > > > pdflush-624 [004] 184.293439: get_page_from_freelist:
> > > > > (mm_page_allocation) pfn=447c27 zone_free=1940910
> > > > > events/6-33 [006] 184.962879: free_hot_cold_page:
> > > > > (mm_page_free) pfn=44bba9
> > > > > irqbalance-8313 [001] 188.042951: unmap_vmas:
> > > > > (mm_anon_userfree) mm=ffff88044a7300c0 address=7f9a2eb70000 pfn=24c29a
> > > > > cat-9122 [005] 191.141173: filemap_fault:
> > > > > (mm_filemap_fault) primary fault: mm=ffff88024c9d8f40 address=3cea2dd000
> > > > > pfn=44d68e
> > > > > cat-9122 [001] 191.143036: handle_mm_fault:
> > > > > (mm_anon_fault) mm=ffff88024c8beb40 address=7fffbde99f94 pfn=24ce22
> > > > > ...
> > > >
> > > > Hi Larry,
> > > >
> > > > I've started to evaluate your patch.
> > > >
> > > > firstly, this patch can't apply tip/master.
> >
> > yeah, would be nice to have a patch against:
> >
> > http://people.redhat.com/mingo/tip.git/README
>
> Yeah I'll fix that, it is a moving target.
> >
> > > > secondly, I don't think the address of mm_struct and pfn
> > > > help to analysis. administrator don't know the page is which
> > > > file's cache.
> > >
> > > The mm_struct may not be helpful since there should be a 1 to
> > > 1 mapping between user tasks and the mm struct. Hmm, maybe
> > > not, due to threads?
> >
> > Correct - so the mm ID looks useful.
> >
> > > But the pfn is helpful since it is a unique identifier for
> > > what physical page was mapped.
> >
> > Yeah. Nevertheless some sort of filename:offset indicator
> > would be nice too. (as an add-on)
>
> You mean in the filemap pagefault case???

Would that be useless or controversial? We know from
vma->mapping which inode it maps to. Knowing which file is
faulting in can be useful - especially when addresses are a
moving target such as under PIE or with dlopen(), etc.

Ingo

2009-03-06 16:54:24

by Larry Woodman

[permalink] [raw]
Subject: Re: [Patch] mm tracepoints

On Fri, 2009-03-06 at 14:55 +0100, Ingo Molnar wrote:
> * Larry Woodman <[email protected]> wrote:
>
> > On Fri, 2009-03-06 at 12:04 +0100, Ingo Molnar wrote:
> > > * Steven Rostedt <[email protected]> wrote:
> > >
> > > >
> > > > On Fri, 6 Mar 2009, KOSAKI Motohiro wrote:
> > > >
> > > > > > I've implemented several mm tracepoints to track page allocation and
> > > > > > freeing, various types of pagefaults and unmaps, and critical page
> > > > > > reclamation routines. This is useful for debugging memory allocation
> > > > > > issues and system performance problems under heavy memory loads:
> > > > > >
> > > > > > # tracer: mm
> > > > > > #
> > > > > > # TASK-PID CPU# TIMESTAMP FUNCTION
> > > > > > # | | | | |
> > > > > > pdflush-624 [004] 184.293169: wb_kupdate:
> > > > > > (mm_pdflush_kupdate) count=3e48
> > > > > > pdflush-624 [004] 184.293439: get_page_from_freelist:
> > > > > > (mm_page_allocation) pfn=447c27 zone_free=1940910
> > > > > > events/6-33 [006] 184.962879: free_hot_cold_page:
> > > > > > (mm_page_free) pfn=44bba9
> > > > > > irqbalance-8313 [001] 188.042951: unmap_vmas:
> > > > > > (mm_anon_userfree) mm=ffff88044a7300c0 address=7f9a2eb70000 pfn=24c29a
> > > > > > cat-9122 [005] 191.141173: filemap_fault:
> > > > > > (mm_filemap_fault) primary fault: mm=ffff88024c9d8f40 address=3cea2dd000
> > > > > > pfn=44d68e
> > > > > > cat-9122 [001] 191.143036: handle_mm_fault:
> > > > > > (mm_anon_fault) mm=ffff88024c8beb40 address=7fffbde99f94 pfn=24ce22
> > > > > > ...
> > > > >
> > > > > Hi Larry,
> > > > >
> > > > > I've started to evaluate your patch.
> > > > >
> > > > > firstly, this patch can't apply tip/master.
> > >
> > > yeah, would be nice to have a patch against:
> > >
> > > http://people.redhat.com/mingo/tip.git/README
> >
> > Yeah I'll fix that, it is a moving target.
> > >
> > > > > secondly, I don't think the address of mm_struct and pfn
> > > > > help to analysis. administrator don't know the page is which
> > > > > file's cache.
> > > >
> > > > The mm_struct may not be helpful since there should be a 1 to
> > > > 1 mapping between user tasks and the mm struct. Hmm, maybe
> > > > not, due to threads?
> > >
> > > Correct - so the mm ID looks useful.
> > >
> > > > But the pfn is helpful since it is a unique identifier for
> > > > what physical page was mapped.
> > >
> > > Yeah. Nevertheless some sort of filename:offset indicator
> > > would be nice too. (as an add-on)
> >
> > You mean in the filemap pagefault case???
>
> Would that be useless or controversial? We know from
> vma->mapping which inode it maps to. Knowing which file is
> faulting in can be useful - especially when addresses are a
> moving target such as under PIE or with dlopen(), etc.
>
> Ingo

Attached is the updated patch that applies and builds correctly(sorry I
missed the lockdep tracepoints that were added at the last minute). As
far as the filename:offset is concerned I am working on that. Its not
as simple as it looks because we have to follow a variable list of
structs that can be null terminated several places along the way.

Larry



Attachments:
mm_tracepoints.diff (21.72 kB)

2009-03-06 17:10:47

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Patch] mm tracepoints


* Larry Woodman <[email protected]> wrote:

> > Would that be useless or controversial? We know from
> > vma->mapping which inode it maps to. Knowing which file is
> > faulting in can be useful - especially when addresses are a
> > moving target such as under PIE or with dlopen(), etc.
> >
> > Ingo
>
> Attached is the updated patch that applies and builds
> correctly (sorry I missed the lockdep tracepoints that were
> added at the last minute). [...]

Looks pretty good and useful to me. I've Cc:-ed more mm folks,
it would be nice to hear their opinion about these tracepoints.

Andrew, Nick, Peter, what do you think?

About the motivation of these tracepoints: i suspect these
tracepoints reflect your years-long experience in dealing with
various MM regressions in the enterprise space and these
tracepoints would help understand such regressions
faster/easier?

> [...] As far as the filename:offset is concerned I am working
> on that. Its not as simple as it looks because we have to
> follow a variable list of structs that can be null terminated
> several places along the way.

It's definitely not simple! I dont think it should be in this
base patch at all - it should be an add-on.

Ingo

2009-03-06 17:38:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [Patch] mm tracepoints

On Fri, 2009-03-06 at 18:10 +0100, Ingo Molnar wrote:
> Looks pretty good and useful to me. I've Cc:-ed more mm folks,
> it would be nice to hear their opinion about these tracepoints.
>
> Andrew, Nick, Peter, what do you think?

Bit sad we use the struct mm_struct * as mm identifier (little %lx vs %p
confusion there too), but I suppose there simply isn't anything better.

Exposing kernel pointers like that might upset some of the security
folks, not sure if I care though.

I'm missing the fault_filemap_read counterpart of fault_anon_pgin.

Once you have anon/filemap symmetric, you might consider folding these
and doing the anon argument thing you do elsewhere.

Initially I was thinking we lacked the kswapd vs direct reclaim
information on the pgout data, but since we log the pid:comm for each
event...

Which brings us to mm_pdflush_*, we can already see its pdflush from
pid:comm, then again, it fits the naming style. Same for
mm_directreclaim*() - we already know its direct, since its not kswapd
doing it.

Finally, we have page_free, but not page_alloc? Oh, it is there, just
not in the obvious place.


Things missing, we trace unmap, but not mmap, mprotect, mlock?

pagelock perhaps?


2009-03-06 17:47:19

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Patch] mm tracepoints


* Peter Zijlstra <[email protected]> wrote:

> On Fri, 2009-03-06 at 18:10 +0100, Ingo Molnar wrote:
> > Looks pretty good and useful to me. I've Cc:-ed more mm folks,
> > it would be nice to hear their opinion about these tracepoints.
> >
> > Andrew, Nick, Peter, what do you think?
>
> Bit sad we use the struct mm_struct * as mm identifier (little
> %lx vs %p confusion there too), but I suppose there simply
> isn't anything better.

the other option would be to trace the pgd physical pfn value.
The physical address of the pagetable is a pretty fundamental
thing so that abstraction is unlikely to change.

> Exposing kernel pointers like that might upset some of the
> security folks, not sure if I care though.

it's admin-only.

> I'm missing the fault_filemap_read counterpart of
> fault_anon_pgin.
>
> Once you have anon/filemap symmetric, you might consider
> folding these and doing the anon argument thing you do
> elsewhere.
>
> Initially I was thinking we lacked the kswapd vs direct
> reclaim information on the pgout data, but since we log the
> pid:comm for each event...
>
> Which brings us to mm_pdflush_*, we can already see its
> pdflush from pid:comm, then again, it fits the naming style.
> Same for mm_directreclaim*() - we already know its direct,
> since its not kswapd doing it.
>
> Finally, we have page_free, but not page_alloc? Oh, it is
> there, just not in the obvious place.
>
> Things missing, we trace unmap, but not mmap, mprotect, mlock?
>
> pagelock perhaps?

yeah, pagelock would be nice. In a similar way to lockdep
tracing. Maybe it should be part of lock tracing?

Ingo

2009-03-06 17:57:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [Patch] mm tracepoints

On Fri, 2009-03-06 at 18:38 +0100, Peter Zijlstra wrote:
> On Fri, 2009-03-06 at 18:10 +0100, Ingo Molnar wrote:
> > Looks pretty good and useful to me. I've Cc:-ed more mm folks,
> > it would be nice to hear their opinion about these tracepoints.
> >
> > Andrew, Nick, Peter, what do you think?
>
> Bit sad we use the struct mm_struct * as mm identifier (little %lx vs %p
> confusion there too), but I suppose there simply isn't anything better.

> Things missing,

Why only anon and filemap, that misses out on all the funky driver
->fault() handlers.

2009-03-06 18:02:25

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Patch] mm tracepoints


* Peter Zijlstra <[email protected]> wrote:

> On Fri, 2009-03-06 at 18:38 +0100, Peter Zijlstra wrote:
> > On Fri, 2009-03-06 at 18:10 +0100, Ingo Molnar wrote:
> > > Looks pretty good and useful to me. I've Cc:-ed more mm folks,
> > > it would be nice to hear their opinion about these tracepoints.
> > >
> > > Andrew, Nick, Peter, what do you think?
> >
> > Bit sad we use the struct mm_struct * as mm identifier (little %lx vs %p
> > confusion there too), but I suppose there simply isn't anything better.
>
> > Things missing,
>
> Why only anon and filemap, that misses out on all the funky
> driver ->fault() handlers.

btw., does it include shm faults? I think all of this would be
handled if the tracepoint was at handle_mm_fault(), right?

Ingo

2009-03-06 18:21:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [Patch] mm tracepoints

On Fri, 2009-03-06 at 19:01 +0100, Ingo Molnar wrote:
> * Peter Zijlstra <[email protected]> wrote:
>
> > On Fri, 2009-03-06 at 18:38 +0100, Peter Zijlstra wrote:
> > > On Fri, 2009-03-06 at 18:10 +0100, Ingo Molnar wrote:
> > > > Looks pretty good and useful to me. I've Cc:-ed more mm folks,
> > > > it would be nice to hear their opinion about these tracepoints.
> > > >
> > > > Andrew, Nick, Peter, what do you think?
> > >
> > > Bit sad we use the struct mm_struct * as mm identifier (little %lx vs %p
> > > confusion there too), but I suppose there simply isn't anything better.
> >
> > > Things missing,
> >
> > Why only anon and filemap, that misses out on all the funky
> > driver ->fault() handlers.
>
> btw., does it include shm faults? I think all of this would be
> handled if the tracepoint was at handle_mm_fault(), right?

Partially, you wouldn't be able to do the file:offset thing you asked
for.

But yeah, also hugetlb seems to be missing.

2009-03-06 18:25:42

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Patch] mm tracepoints


* Peter Zijlstra <[email protected]> wrote:

> On Fri, 2009-03-06 at 19:01 +0100, Ingo Molnar wrote:
> > * Peter Zijlstra <[email protected]> wrote:
> >
> > > On Fri, 2009-03-06 at 18:38 +0100, Peter Zijlstra wrote:
> > > > On Fri, 2009-03-06 at 18:10 +0100, Ingo Molnar wrote:
> > > > > Looks pretty good and useful to me. I've Cc:-ed more mm folks,
> > > > > it would be nice to hear their opinion about these tracepoints.
> > > > >
> > > > > Andrew, Nick, Peter, what do you think?
> > > >
> > > > Bit sad we use the struct mm_struct * as mm identifier (little %lx vs %p
> > > > confusion there too), but I suppose there simply isn't anything better.
> > >
> > > > Things missing,
> > >
> > > Why only anon and filemap, that misses out on all the funky
> > > driver ->fault() handlers.
> >
> > btw., does it include shm faults? I think all of this would
> > be handled if the tracepoint was at handle_mm_fault(),
> > right?
>
> Partially, you wouldn't be able to do the file:offset thing
> you asked for.

That could be done further down in filemap_fault(). I.e. have an
all-encompassing tracepoint for all things [user-] page faults,
and a few opt-in places for more interesting specific fault
types.

> But yeah, also hugetlb seems to be missing.

Probably not that huge of an issue, given how rare those faults
are ;-)

Ingo

2009-03-06 19:03:21

by Larry Woodman

[permalink] [raw]
Subject: Re: [Patch] mm tracepoints

On Fri, 2009-03-06 at 18:38 +0100, Peter Zijlstra wrote:
> On Fri, 2009-03-06 at 18:10 +0100, Ingo Molnar wrote:
> > Looks pretty good and useful to me. I've Cc:-ed more mm folks,
> > it would be nice to hear their opinion about these tracepoints.
> >
> > Andrew, Nick, Peter, what do you think?
>
> Bit sad we use the struct mm_struct * as mm identifier (little %lx vs %p
> confusion there too), but I suppose there simply isn't anything better.
>
> Exposing kernel pointers like that might upset some of the security
> folks, not sure if I care though.
>
> I'm missing the fault_filemap_read counterpart of fault_anon_pgin.

filemap_fault handles both the initial fault when the pte is zero and
pagein when the page has been reclaimed. It was impossible to implement
them as separate handlers in __do_fault() without changing the
underlying MM code.

>
> Once you have anon/filemap symmetric, you might consider folding these
> and doing the anon argument thing you do elsewhere.
>
> Initially I was thinking we lacked the kswapd vs direct reclaim
> information on the pgout data, but since we log the pid:comm for each
> event...

They are separate, trace_mm_kswapd_runs() and
trace_mm_directreclaim_reclaimall().
trace_mm_directreclaim_reclaimzone() is for the zone_reclaim path where
we do local zone reclamation rather than falling off to the next zone in
the zone list.

>
> Which brings us to mm_pdflush_*, we can already see its pdflush from
> pid:comm, then again, it fits the naming style. Same for
> mm_directreclaim*() - we already know its direct, since its not kswapd
> doing it.
>

Like I said above there are 2 direct reclaim paths: one is teh call to
try_to_free_pages() out of __alloc_pages_internal() and the other is the
call to shrink_zone() out of __zone_reclaim(). I made a distinction
between these because the first calls shrink_zone for each zone in the
zone list when memory is really low(below min) where the second calls
shrink_zone for the local zone to prevent memory allocation from a
remote node.

> Finally, we have page_free, but not page_alloc? Oh, it is there, just
> not in the obvious place.

In order to get the zone free information it has to be in down in
get_page_from_freelist.

>
>
> Things missing, we trace unmap, but not mmap, mprotect, mlock?
>
I was concentrating more on the operations that traced a page moving
throughout the system. mmap and mprotect operate on the virtual address
space instead of the pages mapped in that address space.


Larry

2009-03-06 19:19:23

by Larry Woodman

[permalink] [raw]
Subject: Re: [Patch] mm tracepoints

On Fri, 2009-03-06 at 18:10 +0100, Ingo Molnar wrote:
> * Larry Woodman <[email protected]> wrote:
>
> > > Would that be useless or controversial? We know from
> > > vma->mapping which inode it maps to. Knowing which file is
> > > faulting in can be useful - especially when addresses are a
> > > moving target such as under PIE or with dlopen(), etc.
> > >
> > > Ingo
> >
> > Attached is the updated patch that applies and builds
> > correctly (sorry I missed the lockdep tracepoints that were
> > added at the last minute). [...]
>
> Looks pretty good and useful to me. I've Cc:-ed more mm folks,
> it would be nice to hear their opinion about these tracepoints.
>
> Andrew, Nick, Peter, what do you think?
>
> About the motivation of these tracepoints: i suspect these
> tracepoints reflect your years-long experience in dealing with
> various MM regressions in the enterprise space and these
> tracepoints would help understand such regressions
> faster/easier?

Exactly, and without running some "debug enhanced kernel".

>
> > [...] As far as the filename:offset is concerned I am working
> > on that. Its not as simple as it looks because we have to
> > follow a variable list of structs that can be null terminated
> > several places along the way.
>
> It's definitely not simple! I dont think it should be in this
> base patch at all - it should be an add-on.
>
> Ingo

2009-03-06 19:58:28

by Larry Woodman

[permalink] [raw]
Subject: Re: [Patch] mm tracepoints

On Fri, 2009-03-06 at 19:01 +0100, Ingo Molnar wrote:
> * Peter Zijlstra <[email protected]> wrote:
>
> > On Fri, 2009-03-06 at 18:38 +0100, Peter Zijlstra wrote:
> > > On Fri, 2009-03-06 at 18:10 +0100, Ingo Molnar wrote:
> > > > Looks pretty good and useful to me. I've Cc:-ed more mm folks,
> > > > it would be nice to hear their opinion about these tracepoints.
> > > >
> > > > Andrew, Nick, Peter, what do you think?
> > >
> > > Bit sad we use the struct mm_struct * as mm identifier (little %lx vs %p
> > > confusion there too), but I suppose there simply isn't anything better.
> >
> > > Things missing,
> >
> > Why only anon and filemap, that misses out on all the funky
> > driver ->fault() handlers.
>
> btw., does it include shm faults? I think all of this would be
> handled if the tracepoint was at handle_mm_fault(), right?

The problem with this approach is you cant tell what kind of fault is
being encountered and how it will be handled until you are way down in
the functions that I added the tracepoints in...

The value of these tracepoint is the data you get from they are
currently located.

Larry

>
> Ingo

2009-03-06 21:17:37

by Andrew Morton

[permalink] [raw]
Subject: Re: [Patch] mm tracepoints

On Thu, 05 Mar 2009 17:16:40 -0500
Larry Woodman <[email protected]> wrote:

> I've implemented several mm tracepoints to track page allocation and
> freeing, various types of pagefaults and unmaps, and critical page
> reclamation routines. This is useful for debugging memory allocation
> issues and system performance problems under heavy memory loads:
>
> # tracer: mm
> #
> # TASK-PID CPU# TIMESTAMP FUNCTION
> # | | | | |
> pdflush-624 [004] 184.293169: wb_kupdate:
> (mm_pdflush_kupdate) count=3e48
> pdflush-624 [004] 184.293439: get_page_from_freelist:
> (mm_page_allocation) pfn=447c27 zone_free=1940910
> events/6-33 [006] 184.962879: free_hot_cold_page:
> (mm_page_free) pfn=44bba9
> irqbalance-8313 [001] 188.042951: unmap_vmas:
> (mm_anon_userfree) mm=ffff88044a7300c0 address=7f9a2eb70000 pfn=24c29a
> cat-9122 [005] 191.141173: filemap_fault:
> (mm_filemap_fault) primary fault: mm=ffff88024c9d8f40 address=3cea2dd000
> pfn=44d68e
> cat-9122 [001] 191.143036: handle_mm_fault:
> (mm_anon_fault) mm=ffff88024c8beb40 address=7fffbde99f94 pfn=24ce22
> ...

I'm struggling to think of any memory management problems which this
facility would have helped us solve. Single-page tracing like this
isn't very interesting or useful.

What we generally are looking for when resolving MM
performance/correctness problems is a representation/visualisation of
aggregated results over a period of time. That means synchronous or
downstream processing of large amounts of bulk data.

Now, possibly the above information could be used to generate the
needed information. But the above rather random-looking and chaotic
data output would make it very hard to develop the needed
aggregation/representation tools.

And unless someone actually develops those tools (which is a lot of
work), there isn't much point in adding the kernel infrastructure to
generate the data for the non-existing tool.

I haven't looked at LTT in a while. What sort of information does it
extract from the MM system? Is it useful to MM developers? If so, can
this newly-proposed facility do the same thing?


How about a test case - how could this patch help us (and our testers)
make some progress with the infamous
http://bugzilla.kernel.org/show_bug.cgi?id=12309 ?


Then again, maybe I'm wrong! Maybe MM developers _do_ believe that
this tool would assist them in their work. Given that MM develoeprs
are the target market for this feature, it would be sensible to cc the
linux-mm list, methinks?

2009-03-06 21:54:20

by Chris Friesen

[permalink] [raw]
Subject: Re: [Patch] mm tracepoints

Peter Zijlstra wrote:
> On Fri, 2009-03-06 at 18:10 +0100, Ingo Molnar wrote:
>> Looks pretty good and useful to me. I've Cc:-ed more mm folks,
>> it would be nice to hear their opinion about these tracepoints.
>>
>> Andrew, Nick, Peter, what do you think?
>
> Bit sad we use the struct mm_struct * as mm identifier (little %lx vs %p
> confusion there too), but I suppose there simply isn't anything better.

Could we use the tgid as an mm identifier? Or does the possibility of
CLONE_VM & !CLONE_THREAD preclude this?

Chris

2009-03-25 18:05:55

by Larry Woodman

[permalink] [raw]
Subject: Latest mm tracepoints patch merged to your tip tree

Ingo, attached is the latest mm tracepoints patch I sent to lkml
yesterday merged up to your latest tip tree in
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git

--------------------------------------------------------------------

I've implemented several mm tracepoints to track page allocation and
freeing, various types of pagefaults and unmaps, and critical page
reclamation routines. This is useful for debugging memory allocation
issues and system performance problems under heavy memory loads.

I have also addressed Rik van Riel's comments:

>It looks mostly good.
>
>I believe that the vmscan.c tracepoints could be a little
>more verbose though, it would be useful to know whether we
>are scanning anon or file pages and whether or not we're
>doing lumpy reclaim. Possibly the priority level, too.

----------------------------------------------------------------------


# tracer: mm
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
pdflush-624 [004] 184.293169: wb_kupdate:
(mm_pdflush_kupdate) count=3e48
pdflush-624 [004] 184.293439: get_page_from_freelist:
(mm_page_allocation) pfn=447c27 zone_free=1940910
events/6-33 [006] 184.962879: free_hot_cold_page:
(mm_page_free) pfn=44bba9
irqbalance-8313 [001] 188.042951: unmap_vmas:
(mm_anon_userfree) mm=ffff88044a7300c0 address=7f9a2eb70000 pfn=24c29a
cat-9122 [005] 191.141173: filemap_fault:
(mm_filemap_fault) primary fault: mm=ffff88024c9d8f40 address=3cea2dd000
pfn=44d68e
cat-9122 [001] 191.143036: handle_mm_fault:
(mm_anon_fault) mm=ffff88024c8beb40 address=7fffbde99f94 pfn=24ce22
...



Signed-off-by: Larry Woodman <[email protected]>


Attachments:
mm_tracepoints.patch (23.66 kB)