2007-11-13 19:41:35

by Mathieu Desnoyers

[permalink] [raw]
Subject: [RFC 5/7] LTTng instrumentation mm

Memory management core events.

Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: [email protected]
---
mm/filemap.c | 4 ++++
mm/memory.c | 34 +++++++++++++++++++++++++---------
mm/page_alloc.c | 5 +++++
mm/page_io.c | 1 +
4 files changed, 35 insertions(+), 9 deletions(-)

Index: linux-2.6-lttng/mm/filemap.c
===================================================================
--- linux-2.6-lttng.orig/mm/filemap.c 2007-11-13 09:25:26.000000000 -0500
+++ linux-2.6-lttng/mm/filemap.c 2007-11-13 09:49:35.000000000 -0500
@@ -514,9 +514,13 @@ void fastcall wait_on_page_bit(struct pa
{
DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);

+ trace_mark(mm_filemap_wait_start, "address %p", page_address(page));
+
if (test_bit(bit_nr, &page->flags))
__wait_on_bit(page_waitqueue(page), &wait, sync_page,
TASK_UNINTERRUPTIBLE);
+
+ trace_mark(mm_filemap_wait_end, "address %p", page_address(page));
}
EXPORT_SYMBOL(wait_on_page_bit);

Index: linux-2.6-lttng/mm/memory.c
===================================================================
--- linux-2.6-lttng.orig/mm/memory.c 2007-11-13 09:45:41.000000000 -0500
+++ linux-2.6-lttng/mm/memory.c 2007-11-13 09:49:35.000000000 -0500
@@ -2072,6 +2072,7 @@ static int do_swap_page(struct mm_struct
delayacct_set_flag(DELAYACCT_PF_SWAPIN);
page = lookup_swap_cache(entry);
if (!page) {
+ trace_mark(mm_swap_in, "address #p%lu", address);
grab_swap_token(); /* Contend for token _before_ read-in */
swapin_readahead(entry, address, vma);
page = read_swap_cache_async(entry, vma, address);
@@ -2526,30 +2527,45 @@ unlock:
int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, int write_access)
{
+ int res;
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;

+ trace_mark(mm_handle_fault_entry, "address %lu ip #p%ld",
+ address, KSTK_EIP(current));
+
__set_current_state(TASK_RUNNING);

count_vm_event(PGFAULT);

- if (unlikely(is_vm_hugetlb_page(vma)))
- return hugetlb_fault(mm, vma, address, write_access);
+ if (unlikely(is_vm_hugetlb_page(vma))) {
+ res = hugetlb_fault(mm, vma, address, write_access);
+ goto end;
+ }

pgd = pgd_offset(mm, address);
pud = pud_alloc(mm, pgd, address);
- if (!pud)
- return VM_FAULT_OOM;
+ if (!pud) {
+ res = VM_FAULT_OOM;
+ goto end;
+ }
pmd = pmd_alloc(mm, pud, address);
- if (!pmd)
- return VM_FAULT_OOM;
+ if (!pmd) {
+ res = VM_FAULT_OOM;
+ goto end;
+ }
pte = pte_alloc_map(mm, pmd, address);
- if (!pte)
- return VM_FAULT_OOM;
+ if (!pte) {
+ res = VM_FAULT_OOM;
+ goto end;
+ }

- return handle_pte_fault(mm, vma, address, pte, pmd, write_access);
+ res = handle_pte_fault(mm, vma, address, pte, pmd, write_access);
+end:
+ trace_mark(mm_handle_fault_exit, MARK_NOARGS);
+ return res;
}

#ifndef __PAGETABLE_PUD_FOLDED
Index: linux-2.6-lttng/mm/page_alloc.c
===================================================================
--- linux-2.6-lttng.orig/mm/page_alloc.c 2007-11-13 09:25:26.000000000 -0500
+++ linux-2.6-lttng/mm/page_alloc.c 2007-11-13 09:49:35.000000000 -0500
@@ -519,6 +519,9 @@ static void __free_pages_ok(struct page
int i;
int reserved = 0;

+ trace_mark(mm_page_free, "order %u address %p",
+ order, page_address(page));
+
for (i = 0 ; i < (1 << order) ; ++i)
reserved += free_pages_check(page + i);
if (reserved)
@@ -1639,6 +1642,8 @@ fastcall unsigned long __get_free_pages(
page = alloc_pages(gfp_mask, order);
if (!page)
return 0;
+ trace_mark(mm_page_alloc, "order %u address %p",
+ order, page_address(page));
return (unsigned long) page_address(page);
}

Index: linux-2.6-lttng/mm/page_io.c
===================================================================
--- linux-2.6-lttng.orig/mm/page_io.c 2007-11-13 09:25:26.000000000 -0500
+++ linux-2.6-lttng/mm/page_io.c 2007-11-13 09:49:35.000000000 -0500
@@ -114,6 +114,7 @@ int swap_writepage(struct page *page, st
rw |= (1 << BIO_RW_SYNC);
count_vm_event(PSWPOUT);
set_page_writeback(page);
+ trace_mark(mm_swap_out, "address %p", page_address(page));
unlock_page(page);
submit_bio(rw, bio);
out:

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68


2007-11-15 21:07:08

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC 5/7] LTTng instrumentation mm

> On Tue, 2007-11-13 at 14:33 -0500, Mathieu Desnoyers wrote:
> linux-2.6-lttng/mm/page_io.c 2007-11-13 09:49:35.000000000 -0500
> @@ -114,6 +114,7 @@ int swap_writepage(struct page *page, st
> rw |= (1 << BIO_RW_SYNC);
> count_vm_event(PSWPOUT);
> set_page_writeback(page);
> + trace_mark(mm_swap_out, "address %p", page_address(page));
> unlock_page(page);
> submit_bio(rw, bio);
> out:

I'm not sure all this page_address() stuff makes any sense on highmem
systems. How about page_to_pfn()?

I also have to wonder if you should be hooking into count_vm_event() and
using those. Could you give a high-level overview of exactly why you
need these hooks, and perhaps what you expect from future people adding
things to the VM?

-- Dave

2007-11-15 21:52:03

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC 5/7] LTTng instrumentation mm

* Dave Hansen ([email protected]) wrote:
> > On Tue, 2007-11-13 at 14:33 -0500, Mathieu Desnoyers wrote:
> > linux-2.6-lttng/mm/page_io.c 2007-11-13 09:49:35.000000000 -0500
> > @@ -114,6 +114,7 @@ int swap_writepage(struct page *page, st
> > rw |= (1 << BIO_RW_SYNC);
> > count_vm_event(PSWPOUT);
> > set_page_writeback(page);
> > + trace_mark(mm_swap_out, "address %p", page_address(page));
> > unlock_page(page);
> > submit_bio(rw, bio);
> > out:
>
> I'm not sure all this page_address() stuff makes any sense on highmem
> systems. How about page_to_pfn()?
>

Hrm, maybe both ?

Knowing which page frame number has been swapped out is not always as
relevant as knowing the page's virtual address (when it has one). Saving
both the PFN and the page's virtual address could give us useful
information when the page is not mapped.

We face two possible approaches : either we save both the address and
the pfn at each event and later have the information at once in the
trace, or we instrument the kernel virtual addresses map/unmap
operations and let the trace analyzer figure out the mappings.

It is sometimes a big benefit traffic-wise to let the userspace tool do
recreate the kernel structures from the traced information, but it
involved specialized treatment in the userspace tools. If we chose this
solution, we could simply save the PFN in the event, as you propose.


> I also have to wonder if you should be hooking into count_vm_event() and
> using those. Could you give a high-level overview of exactly why you
> need these hooks, and perhaps what you expect from future people adding
> things to the VM?
>

Yep, I guess we could put useful markers beside the count_vm_events
inline function calls.

High level overview :

We currently have a "LTTng statedump", which iterates on the mappings of
all tasks at trace start time to dump them in the trace. We also
instrument memory allocation/free. We therefore have much of the
information needed to recreate the memory mappings in the kernel at any
point during the trace by "replaying" the trace.

Having the events that helps us to recreate it
- precisely
- efficiently
- with a level of generality that should not break "too much" between
kernel versions

would be useful to us.

Then we could start creating plugins in our userspace trace analysis
tool to analyze fun stuff such as sources of memory fragmentation.

Then coupling that with, eventually, performance counter, we could start
doing really fun things with cache misses...

It can also be useful to you guys to find our problems by adding ad-hoc
instrumentation to the VM code when pinpointing the cause of a problem.
Martin Bligh made interesting things applying a tracer to the vm,
described in "Linux Kernel Debugging on Google-sized clusters" in
OLS2007 proceedings.

(https://ols2006.108.redhat.com/2007/Reprints/OLS2007-Proceedings-V1.pdf)

Mathieu

> -- Dave
>

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-11-15 22:16:38

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC 5/7] LTTng instrumentation mm

On Thu, 2007-11-15 at 16:51 -0500, Mathieu Desnoyers wrote:
> * Dave Hansen ([email protected]) wrote:
> > > On Tue, 2007-11-13 at 14:33 -0500, Mathieu Desnoyers wrote:
> > > linux-2.6-lttng/mm/page_io.c 2007-11-13 09:49:35.000000000 -0500
> > > @@ -114,6 +114,7 @@ int swap_writepage(struct page *page, st
> > > rw |= (1 << BIO_RW_SYNC);
> > > count_vm_event(PSWPOUT);
> > > set_page_writeback(page);
> > > + trace_mark(mm_swap_out, "address %p", page_address(page));
> > > unlock_page(page);
> > > submit_bio(rw, bio);
> > > out:
> >
> > I'm not sure all this page_address() stuff makes any sense on highmem
> > systems. How about page_to_pfn()?
>
> Knowing which page frame number has been swapped out is not always as
> relevant as knowing the page's virtual address (when it has one). Saving
> both the PFN and the page's virtual address could give us useful
> information when the page is not mapped.

For most (all?) architectures, the PFN and the virtual address in the
kernel's linear are interchangeable with pretty trivial arithmetic. All
pages have a pfn, but not all have a virtual address. Thus, I suggested
using the pfn. What kind of virtual addresses are you talking about?

-- Dave

2007-11-16 14:31:23

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC 5/7] LTTng instrumentation mm

* Dave Hansen ([email protected]) wrote:
> On Thu, 2007-11-15 at 16:51 -0500, Mathieu Desnoyers wrote:
> > * Dave Hansen ([email protected]) wrote:
> > > > On Tue, 2007-11-13 at 14:33 -0500, Mathieu Desnoyers wrote:
> > > > linux-2.6-lttng/mm/page_io.c 2007-11-13 09:49:35.000000000 -0500
> > > > @@ -114,6 +114,7 @@ int swap_writepage(struct page *page, st
> > > > rw |= (1 << BIO_RW_SYNC);
> > > > count_vm_event(PSWPOUT);
> > > > set_page_writeback(page);
> > > > + trace_mark(mm_swap_out, "address %p", page_address(page));
> > > > unlock_page(page);
> > > > submit_bio(rw, bio);
> > > > out:
> > >
> > > I'm not sure all this page_address() stuff makes any sense on highmem
> > > systems. How about page_to_pfn()?
> >
> > Knowing which page frame number has been swapped out is not always as
> > relevant as knowing the page's virtual address (when it has one). Saving
> > both the PFN and the page's virtual address could give us useful
> > information when the page is not mapped.
>
> For most (all?) architectures, the PFN and the virtual address in the
> kernel's linear are interchangeable with pretty trivial arithmetic. All
> pages have a pfn, but not all have a virtual address. Thus, I suggested
> using the pfn. What kind of virtual addresses are you talking about?
>

Hum, the mappings I was referring to are the virual memory mappings of
all processes, which is not at all what interests us here.

Let's use the PFN then.

I see that the standard macro to get the kernel address from a pfn is :

asm-x86/page_32.h:#define pfn_to_kaddr(pfn) __va((pfn) << PAGE_SHIFT)

The question might seem trivial, but I wonder how this deals with large
pages ?

Mathieu


> -- Dave
>

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-11-16 14:47:55

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC 5/7] LTTng instrumentation mm

* Dave Hansen ([email protected]) wrote:
> On Thu, 2007-11-15 at 16:51 -0500, Mathieu Desnoyers wrote:
> > * Dave Hansen ([email protected]) wrote:
> > > > On Tue, 2007-11-13 at 14:33 -0500, Mathieu Desnoyers wrote:
> > > > linux-2.6-lttng/mm/page_io.c 2007-11-13 09:49:35.000000000 -0500
> > > > @@ -114,6 +114,7 @@ int swap_writepage(struct page *page, st
> > > > rw |= (1 << BIO_RW_SYNC);
> > > > count_vm_event(PSWPOUT);
> > > > set_page_writeback(page);
> > > > + trace_mark(mm_swap_out, "address %p", page_address(page));
> > > > unlock_page(page);
> > > > submit_bio(rw, bio);
> > > > out:
> > >
> > > I'm not sure all this page_address() stuff makes any sense on highmem
> > > systems. How about page_to_pfn()?
> >
> > Knowing which page frame number has been swapped out is not always as
> > relevant as knowing the page's virtual address (when it has one). Saving
> > both the PFN and the page's virtual address could give us useful
> > information when the page is not mapped.
>
> For most (all?) architectures, the PFN and the virtual address in the
> kernel's linear are interchangeable with pretty trivial arithmetic. All
> pages have a pfn, but not all have a virtual address. Thus, I suggested
> using the pfn. What kind of virtual addresses are you talking about?
>

Hrm, in asm-generic/memory_model.h, we have various versions of
__page_to_pfn. Normally they all cast the result to (unsigned long),
except for :


#elif defined(CONFIG_SPARSEMEM_VMEMMAP)

/* memmap is virtually contigious. */
#define __pfn_to_page(pfn) (vmemmap + (pfn))
#define __page_to_pfn(page) ((page) - vmemmap)

So I guess the result is a pointer ? Should this be expected ?

Mathieu

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-11-19 18:05:12

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC 5/7] LTTng instrumentation mm

On Fri, 2007-11-16 at 09:30 -0500, Mathieu Desnoyers wrote:
> I see that the standard macro to get the kernel address from a pfn is :
>
> asm-x86/page_32.h:#define pfn_to_kaddr(pfn) __va((pfn) << PAGE_SHIFT)
>
> The question might seem trivial, but I wonder how this deals with large
> pages ?

Well, first of all, large pages are a virtual addressing concept. We're
only talking about physical addresses here. You still address the
memory the same way no matter if it is composed of large or small pages.
The physical address (and pfn) never change no matter what we do with
the page or how we allocate ir.

-- Dave

2007-11-19 18:07:30

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC 5/7] LTTng instrumentation mm

On Fri, 2007-11-16 at 09:47 -0500, Mathieu Desnoyers wrote:
> * Dave Hansen ([email protected]) wrote:
> > For most (all?) architectures, the PFN and the virtual address in the
> > kernel's linear are interchangeable with pretty trivial arithmetic. All
> > pages have a pfn, but not all have a virtual address. Thus, I suggested
> > using the pfn. What kind of virtual addresses are you talking about?
> >
>
> Hrm, in asm-generic/memory_model.h, we have various versions of
> __page_to_pfn. Normally they all cast the result to (unsigned long),
> except for :
>
>
> #elif defined(CONFIG_SPARSEMEM_VMEMMAP)
>
> /* memmap is virtually contigious. */
> #define __pfn_to_page(pfn) (vmemmap + (pfn))
> #define __page_to_pfn(page) ((page) - vmemmap)
>
> So I guess the result is a pointer ? Should this be expected ?

Nope. 'pointer - pointer' is an integer. Just solve this equation for
integer:

'pointer + integer = pointer'

-- Dave

2007-11-19 18:58:17

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC 5/7] LTTng instrumentation mm

* Dave Hansen ([email protected]) wrote:
> On Fri, 2007-11-16 at 09:47 -0500, Mathieu Desnoyers wrote:
> > * Dave Hansen ([email protected]) wrote:
> > > For most (all?) architectures, the PFN and the virtual address in the
> > > kernel's linear are interchangeable with pretty trivial arithmetic. All
> > > pages have a pfn, but not all have a virtual address. Thus, I suggested
> > > using the pfn. What kind of virtual addresses are you talking about?
> > >
> >
> > Hrm, in asm-generic/memory_model.h, we have various versions of
> > __page_to_pfn. Normally they all cast the result to (unsigned long),
> > except for :
> >
> >
> > #elif defined(CONFIG_SPARSEMEM_VMEMMAP)
> >
> > /* memmap is virtually contigious. */
> > #define __pfn_to_page(pfn) (vmemmap + (pfn))
> > #define __page_to_pfn(page) ((page) - vmemmap)
> >
> > So I guess the result is a pointer ? Should this be expected ?
>
> Nope. 'pointer - pointer' is an integer. Just solve this equation for
> integer:
>
> 'pointer + integer = pointer'
>

Well, using page_to_pfn turns out to be ugly in markers (and in
printks) then. Depending on the architecture, it will result in either
an unsigned long (x86_64) or an unsigned int (i386), which corresponds
to %lu or %u and will print a warning if we don't cast it explicitly.

Mathieu


> -- Dave
>

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-11-19 19:00:58

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC 5/7] LTTng instrumentation mm

* Mathieu Desnoyers ([email protected]) wrote:
> * Dave Hansen ([email protected]) wrote:
> > On Fri, 2007-11-16 at 09:47 -0500, Mathieu Desnoyers wrote:
> > > * Dave Hansen ([email protected]) wrote:
> > > > For most (all?) architectures, the PFN and the virtual address in the
> > > > kernel's linear are interchangeable with pretty trivial arithmetic. All
> > > > pages have a pfn, but not all have a virtual address. Thus, I suggested
> > > > using the pfn. What kind of virtual addresses are you talking about?
> > > >
> > >
> > > Hrm, in asm-generic/memory_model.h, we have various versions of
> > > __page_to_pfn. Normally they all cast the result to (unsigned long),
> > > except for :
> > >
> > >
> > > #elif defined(CONFIG_SPARSEMEM_VMEMMAP)
> > >
> > > /* memmap is virtually contigious. */
> > > #define __pfn_to_page(pfn) (vmemmap + (pfn))
> > > #define __page_to_pfn(page) ((page) - vmemmap)
> > >
> > > So I guess the result is a pointer ? Should this be expected ?
> >
> > Nope. 'pointer - pointer' is an integer. Just solve this equation for
> > integer:
> >
> > 'pointer + integer = pointer'
> >
>
> Well, using page_to_pfn turns out to be ugly in markers (and in
> printks) then. Depending on the architecture, it will result in either
> an unsigned long (x86_64) or an unsigned int (i386), which corresponds

Well, it's signed long and signed int, but the point is still valid.

> to %lu or %u and will print a warning if we don't cast it explicitly.
>
> Mathieu
>
>
> > -- Dave
> >
>
> --
> Mathieu Desnoyers
> Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-11-19 19:43:18

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC 5/7] LTTng instrumentation mm

On Mon, 2007-11-19 at 13:52 -0500, Mathieu Desnoyers wrote:
> > > So I guess the result is a pointer ? Should this be expected ?
> >
> > Nope. 'pointer - pointer' is an integer. Just solve this equation for
> > integer:
> >
> > 'pointer + integer = pointer'
> >
>
> Well, using page_to_pfn turns out to be ugly in markers (and in
> printks) then. Depending on the architecture, it will result in either
> an unsigned long (x86_64) or an unsigned int (i386), which corresponds
> to %lu or %u and will print a warning if we don't cast it explicitly.

Casting the i386 one to be an unconditional 'unsigned long' shouldn't be
an issue. We don't generally expect pfns to fit into ints anyway.

-- Dave

2007-11-19 19:44:29

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC 5/7] LTTng instrumentation mm

On Mon, 2007-11-19 at 14:00 -0500, Mathieu Desnoyers wrote:
> > Well, using page_to_pfn turns out to be ugly in markers (and in
> > printks) then. Depending on the architecture, it will result in either
> > an unsigned long (x86_64) or an unsigned int (i386), which corresponds
>
> Well, it's signed long and signed int, but the point is still valid.

the result of page_to_pfn() may end up being signed in practice, but it
never needs to be. Just cast it to an unsigned long and make it
consistent everywhere.

-- Dave

2007-11-19 19:58:19

by Mathieu Desnoyers

[permalink] [raw]
Subject: [PATCH] Cast __page_to_pfn to unsigned long in CONFIG_SPARSEMEM

* Dave Hansen ([email protected]) wrote:
> On Mon, 2007-11-19 at 13:52 -0500, Mathieu Desnoyers wrote:
> > > > So I guess the result is a pointer ? Should this be expected ?
> > >
> > > Nope. 'pointer - pointer' is an integer. Just solve this equation for
> > > integer:
> > >
> > > 'pointer + integer = pointer'
> > >
> >
> > Well, using page_to_pfn turns out to be ugly in markers (and in
> > printks) then. Depending on the architecture, it will result in either
> > an unsigned long (x86_64) or an unsigned int (i386), which corresponds
> > to %lu or %u and will print a warning if we don't cast it explicitly.
>
> Casting the i386 one to be an unconditional 'unsigned long' shouldn't be
> an issue. We don't generally expect pfns to fit into ints anyway.

So would this make sense ?

Cast __page_to_pfn to unsigned long in CONFIG_SPARSEMEM

Make sure the type returned by __page_to_pfn is always unsigned long. If we
don't cast it explicitly, it can be int on i386, but long on x86_64. This is
especially inelegant for printks.

Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: Dave Hansen <[email protected]>
CC: [email protected]
CC: [email protected]
---
include/asm-generic/memory_model.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6-lttng/include/asm-generic/memory_model.h
===================================================================
--- linux-2.6-lttng.orig/include/asm-generic/memory_model.h 2007-11-19 14:47:30.000000000 -0500
+++ linux-2.6-lttng/include/asm-generic/memory_model.h 2007-11-19 14:48:30.000000000 -0500
@@ -50,7 +50,7 @@

/* memmap is virtually contigious. */
#define __pfn_to_page(pfn) (vmemmap + (pfn))
-#define __page_to_pfn(page) ((page) - vmemmap)
+#define __page_to_pfn(page) ((unsigned long)((page) - vmemmap))

#elif defined(CONFIG_SPARSEMEM)
/*

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-11-19 20:09:57

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] Cast __page_to_pfn to unsigned long in CONFIG_SPARSEMEM

On Mon, 2007-11-19 at 14:52 -0500, Mathieu Desnoyers wrote:
> * Dave Hansen ([email protected]) wrote:
> > On Mon, 2007-11-19 at 13:52 -0500, Mathieu Desnoyers wrote:
> > > > > So I guess the result is a pointer ? Should this be expected ?
> > > >
> > > > Nope. 'pointer - pointer' is an integer. Just solve this equation for
> > > > integer:
> > > >
> > > > 'pointer + integer = pointer'
> > > >
> > >
> > > Well, using page_to_pfn turns out to be ugly in markers (and in
> > > printks) then. Depending on the architecture, it will result in either
> > > an unsigned long (x86_64) or an unsigned int (i386), which corresponds
> > > to %lu or %u and will print a warning if we don't cast it explicitly.
> >
> > Casting the i386 one to be an unconditional 'unsigned long' shouldn't be
> > an issue. We don't generally expect pfns to fit into ints anyway.
>
> So would this make sense ?
>
> Cast __page_to_pfn to unsigned long in CONFIG_SPARSEMEM
>
> Make sure the type returned by __page_to_pfn is always unsigned long. If we
> don't cast it explicitly, it can be int on i386, but long on x86_64. This is
> especially inelegant for printks.

The only thing I might suggest doing differently is actually using the
page_to_pfn() definition itself:

memory_model.h:#define page_to_pfn __page_to_pfn

The full inline function version should do this already, and we
shouldn't have any real direct __page_to_pfn() users anyway.

-- Dave

2007-11-19 20:20:35

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH] Cast page_to_pfn to unsigned long in CONFIG_SPARSEMEM

* Dave Hansen ([email protected]) wrote:
> The only thing I might suggest doing differently is actually using the
> page_to_pfn() definition itself:
>
> memory_model.h:#define page_to_pfn __page_to_pfn
>
> The full inline function version should do this already, and we
> shouldn't have any real direct __page_to_pfn() users anyway.
>

Like this then..

Cast page_to_pfn to unsigned long in CONFIG_SPARSEMEM

Make sure the type returned by page_to_pfn is always unsigned long. If we
don't cast it explicitly, it can be int on i386, but long on x86_64. This is
especially inelegant for printks.

Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: Dave Hansen <[email protected]>
CC: [email protected]
CC: [email protected]
---
include/asm-generic/memory_model.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6-lttng/include/asm-generic/memory_model.h
===================================================================
--- linux-2.6-lttng.orig/include/asm-generic/memory_model.h 2007-11-19 15:06:40.000000000 -0500
+++ linux-2.6-lttng/include/asm-generic/memory_model.h 2007-11-19 15:18:57.000000000 -0500
@@ -76,7 +76,7 @@ struct page;
extern struct page *pfn_to_page(unsigned long pfn);
extern unsigned long page_to_pfn(struct page *page);
#else
-#define page_to_pfn __page_to_pfn
+#define page_to_pfn ((unsigned long)__page_to_pfn)
#define pfn_to_page __pfn_to_page
#endif /* CONFIG_OUT_OF_LINE_PFN_TO_PAGE */




--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-11-19 21:14:45

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Cast page_to_pfn to unsigned long in CONFIG_SPARSEMEM

On Mon, 19 Nov 2007 15:20:23 -0500
Mathieu Desnoyers <[email protected]> wrote:

> * Dave Hansen ([email protected]) wrote:
> > The only thing I might suggest doing differently is actually using the
> > page_to_pfn() definition itself:
> >
> > memory_model.h:#define page_to_pfn __page_to_pfn
> >
> > The full inline function version should do this already, and we
> > shouldn't have any real direct __page_to_pfn() users anyway.
> >
>
> Like this then..
>
> Cast page_to_pfn to unsigned long in CONFIG_SPARSEMEM
>
> Make sure the type returned by page_to_pfn is always unsigned long. If we
> don't cast it explicitly, it can be int on i386, but long on x86_64.

formally ptrdiff_t, I believe.

> This is
> especially inelegant for printks.
>
> Signed-off-by: Mathieu Desnoyers <[email protected]>
> CC: Dave Hansen <[email protected]>
> CC: [email protected]
> CC: [email protected]
> ---
> include/asm-generic/memory_model.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> Index: linux-2.6-lttng/include/asm-generic/memory_model.h
> ===================================================================
> --- linux-2.6-lttng.orig/include/asm-generic/memory_model.h 2007-11-19 15:06:40.000000000 -0500
> +++ linux-2.6-lttng/include/asm-generic/memory_model.h 2007-11-19 15:18:57.000000000 -0500
> @@ -76,7 +76,7 @@ struct page;
> extern struct page *pfn_to_page(unsigned long pfn);
> extern unsigned long page_to_pfn(struct page *page);
> #else
> -#define page_to_pfn __page_to_pfn
> +#define page_to_pfn ((unsigned long)__page_to_pfn)
> #define pfn_to_page __pfn_to_page
> #endif /* CONFIG_OUT_OF_LINE_PFN_TO_PAGE */

I'd have thought that __pfn_to_page() was the place to fix this: the
lower-level point. Because someone might later start using __pfn_to_page()
for something.

Heaven knows why though - why does __pfn_to_page() even exist?

2007-11-19 21:19:57

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] Cast page_to_pfn to unsigned long in CONFIG_SPARSEMEM

On Mon, 2007-11-19 at 13:08 -0800, Andrew Morton wrote:
>
> > #else
> > -#define page_to_pfn __page_to_pfn
> > +#define page_to_pfn ((unsigned long)__page_to_pfn)
> > #define pfn_to_page __pfn_to_page
> > #endif /* CONFIG_OUT_OF_LINE_PFN_TO_PAGE */
>
> I'd have thought that __pfn_to_page() was the place to fix this: the
> lower-level point. Because someone might later start using
> __pfn_to_page()
> for something.
>
> Heaven knows why though - why does __pfn_to_page() even exist?

I think it's this stuff:

#ifdef CONFIG_OUT_OF_LINE_PFN_TO_PAGE
struct page *pfn_to_page(unsigned long pfn)
{
return __pfn_to_page(pfn);
}
unsigned long page_to_pfn(struct page *page)
{
return __page_to_pfn(page);
}
EXPORT_SYMBOL(pfn_to_page);
EXPORT_SYMBOL(page_to_pfn);
#endif /* CONFIG_OUT_OF_LINE_PFN_TO_PAGE */

Which comes from:

config OUT_OF_LINE_PFN_TO_PAGE
def_bool X86_64
depends on DISCONTIGMEM

and only on x86_64. Perhaps it can go away with the
discontig->sparsemem-vmemmap conversion.

-- Dave


2007-11-19 21:26:17

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] Cast page_to_pfn to unsigned long in CONFIG_SPARSEMEM

On Mon, 2007-11-19 at 13:19 -0800, Dave Hansen wrote:
> On Mon, 2007-11-19 at 13:08 -0800, Andrew Morton wrote:
> > Heaven knows why though - why does __pfn_to_page() even exist?
> Perhaps it can go away with the
> discontig->sparsemem-vmemmap conversion.

In fact, Christoph Lameter's

Subject:
x86_64: Make sparsemem/vmemmap the
default memory model V2
Date:
Thu, 15 Nov 2007 19:55:11
-0800 (PST)

does remove it.

-- Dave

2007-11-20 17:39:33

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH] Cast page_to_pfn to unsigned long in CONFIG_SPARSEMEM

* Andrew Morton ([email protected]) wrote:
> On Mon, 19 Nov 2007 15:20:23 -0500
> Mathieu Desnoyers <[email protected]> wrote:
>
> > * Dave Hansen ([email protected]) wrote:
> > > The only thing I might suggest doing differently is actually using the
> > > page_to_pfn() definition itself:
> > >
> > > memory_model.h:#define page_to_pfn __page_to_pfn
> > >
> > > The full inline function version should do this already, and we
> > > shouldn't have any real direct __page_to_pfn() users anyway.
> > >
> >
> > Like this then..
> >
> > Cast page_to_pfn to unsigned long in CONFIG_SPARSEMEM
> >
> > Make sure the type returned by page_to_pfn is always unsigned long. If we
> > don't cast it explicitly, it can be int on i386, but long on x86_64.
>
> formally ptrdiff_t, I believe.
>
> > This is
> > especially inelegant for printks.
> >
> > Signed-off-by: Mathieu Desnoyers <[email protected]>
> > CC: Dave Hansen <[email protected]>
> > CC: [email protected]
> > CC: [email protected]
> > ---
> > include/asm-generic/memory_model.h | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > Index: linux-2.6-lttng/include/asm-generic/memory_model.h
> > ===================================================================
> > --- linux-2.6-lttng.orig/include/asm-generic/memory_model.h 2007-11-19 15:06:40.000000000 -0500
> > +++ linux-2.6-lttng/include/asm-generic/memory_model.h 2007-11-19 15:18:57.000000000 -0500
> > @@ -76,7 +76,7 @@ struct page;
> > extern struct page *pfn_to_page(unsigned long pfn);
> > extern unsigned long page_to_pfn(struct page *page);
> > #else
> > -#define page_to_pfn __page_to_pfn
> > +#define page_to_pfn ((unsigned long)__page_to_pfn)
> > #define pfn_to_page __pfn_to_page
> > #endif /* CONFIG_OUT_OF_LINE_PFN_TO_PAGE */
>
> I'd have thought that __pfn_to_page() was the place to fix this: the
> lower-level point. Because someone might later start using __pfn_to_page()
> for something.
>
> Heaven knows why though - why does __pfn_to_page() even exist?

Since it all does away with Christoph's patchset anyway, please drop
this patch. (I think there is also an issue with this patch version,
which is that the define should take the arguments...).

Mathieu

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-11-21 20:13:27

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] Cast page_to_pfn to unsigned long in CONFIG_SPARSEMEM

On Mon, 19 Nov 2007, Dave Hansen wrote:

> Which comes from:
>
> config OUT_OF_LINE_PFN_TO_PAGE
> def_bool X86_64
> depends on DISCONTIGMEM
>
> and only on x86_64. Perhaps it can go away with the
> discontig->sparsemem-vmemmap conversion.

The discontig/flatmem removal patch for x86_64 in mm already removes this.

2007-11-28 14:10:19

by Mathieu Desnoyers

[permalink] [raw]
Subject: [RFC PATCH] LTTng instrumentation mm (using page_to_pfn)

LTTng instrumentation mm

Memory management core events.

Changelog:
- Use page_to_pfn for swap out instrumentation, wait_on_page_bit, do_swap_page,
page alloc/free.

Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: [email protected]
CC: Dave Hansen <[email protected]>
---
mm/filemap.c | 4 ++++
mm/memory.c | 34 +++++++++++++++++++++++++---------
mm/page_alloc.c | 5 +++++
mm/page_io.c | 1 +
4 files changed, 35 insertions(+), 9 deletions(-)

Index: linux-2.6-lttng/mm/filemap.c
===================================================================
--- linux-2.6-lttng.orig/mm/filemap.c 2007-11-28 08:38:46.000000000 -0500
+++ linux-2.6-lttng/mm/filemap.c 2007-11-28 08:59:05.000000000 -0500
@@ -514,9 +514,13 @@ void fastcall wait_on_page_bit(struct pa
{
DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);

+ trace_mark(mm_filemap_wait_start, "pfn %lu", page_to_pfn(page));
+
if (test_bit(bit_nr, &page->flags))
__wait_on_bit(page_waitqueue(page), &wait, sync_page,
TASK_UNINTERRUPTIBLE);
+
+ trace_mark(mm_filemap_wait_end, "pfn %lu", page_to_pfn(page));
}
EXPORT_SYMBOL(wait_on_page_bit);

Index: linux-2.6-lttng/mm/memory.c
===================================================================
--- linux-2.6-lttng.orig/mm/memory.c 2007-11-28 08:42:09.000000000 -0500
+++ linux-2.6-lttng/mm/memory.c 2007-11-28 09:02:57.000000000 -0500
@@ -2072,6 +2072,7 @@ static int do_swap_page(struct mm_struct
delayacct_set_flag(DELAYACCT_PF_SWAPIN);
page = lookup_swap_cache(entry);
if (!page) {
+ trace_mark(mm_swap_in, "pfn %lu", page_to_pfn(page));
grab_swap_token(); /* Contend for token _before_ read-in */
swapin_readahead(entry, address, vma);
page = read_swap_cache_async(entry, vma, address);
@@ -2526,30 +2527,45 @@ unlock:
int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, int write_access)
{
+ int res;
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;

+ trace_mark(mm_handle_fault_entry, "address %lu ip #p%ld",
+ address, KSTK_EIP(current));
+
__set_current_state(TASK_RUNNING);

count_vm_event(PGFAULT);

- if (unlikely(is_vm_hugetlb_page(vma)))
- return hugetlb_fault(mm, vma, address, write_access);
+ if (unlikely(is_vm_hugetlb_page(vma))) {
+ res = hugetlb_fault(mm, vma, address, write_access);
+ goto end;
+ }

pgd = pgd_offset(mm, address);
pud = pud_alloc(mm, pgd, address);
- if (!pud)
- return VM_FAULT_OOM;
+ if (!pud) {
+ res = VM_FAULT_OOM;
+ goto end;
+ }
pmd = pmd_alloc(mm, pud, address);
- if (!pmd)
- return VM_FAULT_OOM;
+ if (!pmd) {
+ res = VM_FAULT_OOM;
+ goto end;
+ }
pte = pte_alloc_map(mm, pmd, address);
- if (!pte)
- return VM_FAULT_OOM;
+ if (!pte) {
+ res = VM_FAULT_OOM;
+ goto end;
+ }

- return handle_pte_fault(mm, vma, address, pte, pmd, write_access);
+ res = handle_pte_fault(mm, vma, address, pte, pmd, write_access);
+end:
+ trace_mark(mm_handle_fault_exit, MARK_NOARGS);
+ return res;
}

#ifndef __PAGETABLE_PUD_FOLDED
Index: linux-2.6-lttng/mm/page_alloc.c
===================================================================
--- linux-2.6-lttng.orig/mm/page_alloc.c 2007-11-28 08:38:46.000000000 -0500
+++ linux-2.6-lttng/mm/page_alloc.c 2007-11-28 09:05:36.000000000 -0500
@@ -519,6 +519,9 @@ static void __free_pages_ok(struct page
int i;
int reserved = 0;

+ trace_mark(mm_page_free, "order %u pfn %lu",
+ order, page_to_pfn(page));
+
for (i = 0 ; i < (1 << order) ; ++i)
reserved += free_pages_check(page + i);
if (reserved)
@@ -1639,6 +1642,8 @@ fastcall unsigned long __get_free_pages(
page = alloc_pages(gfp_mask, order);
if (!page)
return 0;
+ trace_mark(mm_page_alloc, "order %u pfn %lu",
+ order, page_to_pfn(page));
return (unsigned long) page_address(page);
}

Index: linux-2.6-lttng/mm/page_io.c
===================================================================
--- linux-2.6-lttng.orig/mm/page_io.c 2007-11-28 08:38:47.000000000 -0500
+++ linux-2.6-lttng/mm/page_io.c 2007-11-28 08:52:14.000000000 -0500
@@ -114,6 +114,7 @@ int swap_writepage(struct page *page, st
rw |= (1 << BIO_RW_SYNC);
count_vm_event(PSWPOUT);
set_page_writeback(page);
+ trace_mark(mm_swap_out, "pfn %lu", page_to_pfn(page));
unlock_page(page);
submit_bio(rw, bio);
out:
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-11-28 16:54:30

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH] LTTng instrumentation mm (using page_to_pfn)

On Wed, 2007-11-28 at 09:09 -0500, Mathieu Desnoyers wrote:
> ===================================================================
> --- linux-2.6-lttng.orig/mm/filemap.c 2007-11-28 08:38:46.000000000 -0500
> +++ linux-2.6-lttng/mm/filemap.c 2007-11-28 08:59:05.000000000 -0500
> @@ -514,9 +514,13 @@ void fastcall wait_on_page_bit(struct pa
> {
> DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
>
> + trace_mark(mm_filemap_wait_start, "pfn %lu", page_to_pfn(page));
> +
> if (test_bit(bit_nr, &page->flags))
> __wait_on_bit(page_waitqueue(page), &wait, sync_page,
> TASK_UNINTERRUPTIBLE);
> +
> + trace_mark(mm_filemap_wait_end, "pfn %lu", page_to_pfn(page));
> }
> EXPORT_SYMBOL(wait_on_page_bit);

I've got some small nits with this. I guess I just wish that if we're
going to sprinkle hooks all over that we'd have those hooks be as useful
as possible for people who have to look at them on a daily basis.

Do you also want to put in the page bit which is being waited on?

>
> Index: linux-2.6-lttng/mm/memory.c
> ===================================================================
> --- linux-2.6-lttng.orig/mm/memory.c 2007-11-28 08:42:09.000000000 -0500
> +++ linux-2.6-lttng/mm/memory.c 2007-11-28 09:02:57.000000000 -0500
> @@ -2072,6 +2072,7 @@ static int do_swap_page(struct mm_struct
> delayacct_set_flag(DELAYACCT_PF_SWAPIN);
> page = lookup_swap_cache(entry);
> if (!page) {
> + trace_mark(mm_swap_in, "pfn %lu", page_to_pfn(page));
> grab_swap_token(); /* Contend for token _before_ read-in */
> swapin_readahead(entry, address, vma);
> page = read_swap_cache_async(entry, vma, address);

How about putting the swap file number and the offset as well?

> @@ -2526,30 +2527,45 @@ unlock:
> int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> unsigned long address, int write_access)
> {
> + int res;
> pgd_t *pgd;
> pud_t *pud;
> pmd_t *pmd;
> pte_t *pte;
>
> + trace_mark(mm_handle_fault_entry, "address %lu ip #p%ld",
> + address, KSTK_EIP(current));

For knowing this one, the write access can be pretty important. It can
help you find copy-on-write situations as well as some common bugs that
creep in here.

> __set_current_state(TASK_RUNNING);
>
> count_vm_event(PGFAULT);
>
> - if (unlikely(is_vm_hugetlb_page(vma)))
> - return hugetlb_fault(mm, vma, address, write_access);
> + if (unlikely(is_vm_hugetlb_page(vma))) {
> + res = hugetlb_fault(mm, vma, address, write_access);
> + goto end;
> + }

I think you should also add tracing to the hugetlb code while you're at
it. Those poor fellows seem to be always getting left out these
days. :)

> pgd = pgd_offset(mm, address);
> pud = pud_alloc(mm, pgd, address);
> - if (!pud)
> - return VM_FAULT_OOM;
> + if (!pud) {
> + res = VM_FAULT_OOM;
> + goto end;
> + }
> pmd = pmd_alloc(mm, pud, address);
> - if (!pmd)
> - return VM_FAULT_OOM;
> + if (!pmd) {
> + res = VM_FAULT_OOM;
> + goto end;
> + }
> pte = pte_alloc_map(mm, pmd, address);
> - if (!pte)
> - return VM_FAULT_OOM;
> + if (!pte) {
> + res = VM_FAULT_OOM;
> + goto end;
> + }
>
> - return handle_pte_fault(mm, vma, address, pte, pmd, write_access);
> + res = handle_pte_fault(mm, vma, address, pte, pmd, write_access);
> +end:
> + trace_mark(mm_handle_fault_exit, MARK_NOARGS);
> + return res;
> }
>
> #ifndef __PAGETABLE_PUD_FOLDED
> Index: linux-2.6-lttng/mm/page_alloc.c
> ===================================================================
> --- linux-2.6-lttng.orig/mm/page_alloc.c 2007-11-28 08:38:46.000000000 -0500
> +++ linux-2.6-lttng/mm/page_alloc.c 2007-11-28 09:05:36.000000000 -0500
> @@ -519,6 +519,9 @@ static void __free_pages_ok(struct page
> int i;
> int reserved = 0;
>
> + trace_mark(mm_page_free, "order %u pfn %lu",
> + order, page_to_pfn(page));
> +
> for (i = 0 ; i < (1 << order) ; ++i)
> reserved += free_pages_check(page + i);
> if (reserved)
> @@ -1639,6 +1642,8 @@ fastcall unsigned long __get_free_pages(
> page = alloc_pages(gfp_mask, order);
> if (!page)
> return 0;
> + trace_mark(mm_page_alloc, "order %u pfn %lu",
> + order, page_to_pfn(page));
> return (unsigned long) page_address(page);
> }
>
> Index: linux-2.6-lttng/mm/page_io.c
> ===================================================================
> --- linux-2.6-lttng.orig/mm/page_io.c 2007-11-28 08:38:47.000000000 -0500
> +++ linux-2.6-lttng/mm/page_io.c 2007-11-28 08:52:14.000000000 -0500
> @@ -114,6 +114,7 @@ int swap_writepage(struct page *page, st
> rw |= (1 << BIO_RW_SYNC);
> count_vm_event(PSWPOUT);
> set_page_writeback(page);
> + trace_mark(mm_swap_out, "pfn %lu", page_to_pfn(page));
> unlock_page(page);
> submit_bio(rw, bio);

I'd also like to see the swap file number and the location in swap for
this one.

-- Dave

2007-11-29 02:34:35

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH] LTTng instrumentation mm (using page_to_pfn)

I am adding the rest.. two questions left :

* Dave Hansen ([email protected]) wrote:

> >
> > Index: linux-2.6-lttng/mm/memory.c
> > ===================================================================
> > --- linux-2.6-lttng.orig/mm/memory.c 2007-11-28 08:42:09.000000000 -0500
> > +++ linux-2.6-lttng/mm/memory.c 2007-11-28 09:02:57.000000000 -0500
> > @@ -2072,6 +2072,7 @@ static int do_swap_page(struct mm_struct
> > delayacct_set_flag(DELAYACCT_PF_SWAPIN);
> > page = lookup_swap_cache(entry);
> > if (!page) {
> > + trace_mark(mm_swap_in, "pfn %lu", page_to_pfn(page));
> > grab_swap_token(); /* Contend for token _before_ read-in */
> > swapin_readahead(entry, address, vma);
> > page = read_swap_cache_async(entry, vma, address);
>
> How about putting the swap file number and the offset as well?
>
[...]
> > Index: linux-2.6-lttng/mm/page_io.c
> > ===================================================================
> > --- linux-2.6-lttng.orig/mm/page_io.c 2007-11-28 08:38:47.000000000 -0500
> > +++ linux-2.6-lttng/mm/page_io.c 2007-11-28 08:52:14.000000000 -0500
> > @@ -114,6 +114,7 @@ int swap_writepage(struct page *page, st
> > rw |= (1 << BIO_RW_SYNC);
> > count_vm_event(PSWPOUT);
> > set_page_writeback(page);
> > + trace_mark(mm_swap_out, "pfn %lu", page_to_pfn(page));
> > unlock_page(page);
> > submit_bio(rw, bio);
>
> I'd also like to see the swap file number and the location in swap for
> this one.
>

Before I start digging deeper in checking whether it is already
instrumented by the fs instrumentation (and would therefore be
redundant), is there a particular data structure from mm/ that you
suggest taking the swap file number and location in swap from ?

Mathieu

> -- Dave
>

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-11-29 06:26:17

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH] LTTng instrumentation mm (using page_to_pfn)

On Wed, 2007-11-28 at 21:34 -0500, Mathieu Desnoyers wrote:
> Before I start digging deeper in checking whether it is already
> instrumented by the fs instrumentation (and would therefore be
> redundant), is there a particular data structure from mm/ that you
> suggest taking the swap file number and location in swap from ?

page_private() at this point stores a swp_entry_t. There are swp_type()
and swp_offset() helpers to decode the two bits you need after you've
turned page_private() into a swp_entry_t. See how get_swap_bio()
creates a temporary swp_entry_t from the page_private() passed into it,
then uses swp_type/offset() on it?

I don't know if there is some history behind it, but it doesn't make a
whole ton of sense to me to be passing page_private(page) into
get_swap_bio() (which happens from its only two call sites). It just
kinda obfuscates where 'index' came from.

It think we probably could just be doing

swp_entry_t entry = { .val = page_private(page), };

in get_swap_bio() and not passing page_private(). We have the page in
there already, so we don't need to pass a derived value like
page_private(). At the least, it'll save some clutter in the function
declaration.

Or, make a helper:

static swp_entry_t page_swp_entry(struct page *page)
{
swp_entry_t entry;
VM_BUG_ON(!PageSwapCache(page));
entry.val = page_private(page);
return entry;
}

I see at least 4 call sites that could use this. The try_to_unmap_one()
caller would trip over the debug check, so you'd have to move the call
inside of the if(PageSwapCache(page)) statement.

-- Dave

2007-11-30 16:12:16

by Mathieu Desnoyers

[permalink] [raw]
Subject: [RFC PATCH] LTTng instrumentation mm (updated)

LTTng instrumentation mm

Memory management core events.

Changelog:
- Use page_to_pfn for swap out instrumentation, wait_on_page_bit, do_swap_page,
page alloc/free.
- add missing free_hot_cold_page instrumentation.
- add hugetlb page_alloc page_free instrumentation.
- Add write_access to mm fault.
- Add page bit_nr waited for by wait_on_page_bit.
- Move page alloc instrumentation to __aloc_pages so we cover the alloc zeroed
page path.
- Add swap file used for swap in and swap out events.
- Dump the swap files, instrument swapon and swapoff.

(note : I did not change the other sites where page_swp_entry could be
used)
(note 2 : my FS instrumentation does not dump the kernel vfs mounts,
which would be useful to interpret the "dump swap files"
instrumentation. I should add this eventually.)

Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: [email protected]
CC: Dave Hansen <[email protected]>
---
include/linux/swapops.h | 8 ++++++++
mm/filemap.c | 6 ++++++
mm/hugetlb.c | 2 ++
mm/memory.c | 38 +++++++++++++++++++++++++++++---------
mm/page_alloc.c | 6 ++++++
mm/page_io.c | 5 +++++
mm/swapfile.c | 22 ++++++++++++++++++++++
7 files changed, 78 insertions(+), 9 deletions(-)

Index: linux-2.6-lttng/mm/filemap.c
===================================================================
--- linux-2.6-lttng.orig/mm/filemap.c 2007-11-29 20:22:52.000000000 -0500
+++ linux-2.6-lttng/mm/filemap.c 2007-11-29 20:23:01.000000000 -0500
@@ -514,9 +514,15 @@ void fastcall wait_on_page_bit(struct pa
{
DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);

+ trace_mark(mm_filemap_wait_start, "pfn %lu bit_nr %d",
+ page_to_pfn(page), bit_nr);
+
if (test_bit(bit_nr, &page->flags))
__wait_on_bit(page_waitqueue(page), &wait, sync_page,
TASK_UNINTERRUPTIBLE);
+
+ trace_mark(mm_filemap_wait_end, "pfn %lu bit_nr %d",
+ page_to_pfn(page), bit_nr);
}
EXPORT_SYMBOL(wait_on_page_bit);

Index: linux-2.6-lttng/mm/memory.c
===================================================================
--- linux-2.6-lttng.orig/mm/memory.c 2007-11-29 20:22:52.000000000 -0500
+++ linux-2.6-lttng/mm/memory.c 2007-11-29 20:42:36.000000000 -0500
@@ -2090,6 +2090,10 @@ static int do_swap_page(struct mm_struct
/* Had to read the page from swap area: Major fault */
ret = VM_FAULT_MAJOR;
count_vm_event(PGMAJFAULT);
+ trace_mark(mm_swap_in, "pfn %lu filp %p offset %lu",
+ page_to_pfn(page),
+ get_swap_info_struct(swp_type(entry))->swap_file,
+ swp_offset(entry));
}

mark_page_accessed(page);
@@ -2526,30 +2530,46 @@ unlock:
int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, int write_access)
{
+ int res;
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;

+ trace_mark(mm_handle_fault_entry,
+ "address %lu ip #p%ld write_access %d",
+ address, KSTK_EIP(current), write_access);
+
__set_current_state(TASK_RUNNING);

count_vm_event(PGFAULT);

- if (unlikely(is_vm_hugetlb_page(vma)))
- return hugetlb_fault(mm, vma, address, write_access);
+ if (unlikely(is_vm_hugetlb_page(vma))) {
+ res = hugetlb_fault(mm, vma, address, write_access);
+ goto end;
+ }

pgd = pgd_offset(mm, address);
pud = pud_alloc(mm, pgd, address);
- if (!pud)
- return VM_FAULT_OOM;
+ if (!pud) {
+ res = VM_FAULT_OOM;
+ goto end;
+ }
pmd = pmd_alloc(mm, pud, address);
- if (!pmd)
- return VM_FAULT_OOM;
+ if (!pmd) {
+ res = VM_FAULT_OOM;
+ goto end;
+ }
pte = pte_alloc_map(mm, pmd, address);
- if (!pte)
- return VM_FAULT_OOM;
+ if (!pte) {
+ res = VM_FAULT_OOM;
+ goto end;
+ }

- return handle_pte_fault(mm, vma, address, pte, pmd, write_access);
+ res = handle_pte_fault(mm, vma, address, pte, pmd, write_access);
+end:
+ trace_mark(mm_handle_fault_exit, MARK_NOARGS);
+ return res;
}

#ifndef __PAGETABLE_PUD_FOLDED
Index: linux-2.6-lttng/mm/page_alloc.c
===================================================================
--- linux-2.6-lttng.orig/mm/page_alloc.c 2007-11-29 20:22:52.000000000 -0500
+++ linux-2.6-lttng/mm/page_alloc.c 2007-11-29 20:23:01.000000000 -0500
@@ -519,6 +519,9 @@ static void __free_pages_ok(struct page
int i;
int reserved = 0;

+ trace_mark(mm_page_free, "order %u pfn %lu",
+ order, page_to_pfn(page));
+
for (i = 0 ; i < (1 << order) ; ++i)
reserved += free_pages_check(page + i);
if (reserved)
@@ -981,6 +984,8 @@ static void fastcall free_hot_cold_page(
struct per_cpu_pages *pcp;
unsigned long flags;

+ trace_mark(mm_page_free, "order %u pfn %lu", 0, page_to_pfn(page));
+
if (PageAnon(page))
page->mapping = NULL;
if (free_pages_check(page))
@@ -1625,6 +1630,7 @@ nopage:
show_mem();
}
got_pg:
+ trace_mark(mm_page_alloc, "order %u pfn %lu", order, page_to_pfn(page));
return page;
}

Index: linux-2.6-lttng/mm/page_io.c
===================================================================
--- linux-2.6-lttng.orig/mm/page_io.c 2007-11-29 20:22:52.000000000 -0500
+++ linux-2.6-lttng/mm/page_io.c 2007-11-29 20:43:02.000000000 -0500
@@ -114,6 +114,11 @@ int swap_writepage(struct page *page, st
rw |= (1 << BIO_RW_SYNC);
count_vm_event(PSWPOUT);
set_page_writeback(page);
+ trace_mark(mm_swap_out, "pfn %lu filp %p offset %lu",
+ page_to_pfn(page),
+ get_swap_info_struct(swp_type(
+ page_swp_entry(page)))->swap_file,
+ swp_offset(page_swp_entry(page)));
unlock_page(page);
submit_bio(rw, bio);
out:
Index: linux-2.6-lttng/mm/hugetlb.c
===================================================================
--- linux-2.6-lttng.orig/mm/hugetlb.c 2007-11-29 20:22:52.000000000 -0500
+++ linux-2.6-lttng/mm/hugetlb.c 2007-11-29 20:23:01.000000000 -0500
@@ -118,6 +118,7 @@ static void free_huge_page(struct page *
int nid = page_to_nid(page);
struct address_space *mapping;

+ trace_mark(mm_huge_page_free, "pfn %lu", page_to_pfn(page));
mapping = (struct address_space *) page_private(page);
BUG_ON(page_count(page));
INIT_LIST_HEAD(&page->lru);
@@ -401,6 +402,7 @@ static struct page *alloc_huge_page(stru
if (!IS_ERR(page)) {
set_page_refcounted(page);
set_page_private(page, (unsigned long) mapping);
+ trace_mark(mm_huge_page_alloc, "pfn %lu", page_to_pfn(page));
}
return page;
}
Index: linux-2.6-lttng/include/linux/swapops.h
===================================================================
--- linux-2.6-lttng.orig/include/linux/swapops.h 2007-11-29 20:22:52.000000000 -0500
+++ linux-2.6-lttng/include/linux/swapops.h 2007-11-29 20:23:01.000000000 -0500
@@ -68,6 +68,14 @@ static inline pte_t swp_entry_to_pte(swp
return __swp_entry_to_pte(arch_entry);
}

+static inline swp_entry_t page_swp_entry(struct page *page)
+{
+ swp_entry_t entry;
+ VM_BUG_ON(!PageSwapCache(page));
+ entry.val = page_private(page);
+ return entry;
+}
+
#ifdef CONFIG_MIGRATION
static inline swp_entry_t make_migration_entry(struct page *page, int write)
{
Index: linux-2.6-lttng/mm/swapfile.c
===================================================================
--- linux-2.6-lttng.orig/mm/swapfile.c 2007-11-30 09:18:38.000000000 -0500
+++ linux-2.6-lttng/mm/swapfile.c 2007-11-30 10:21:50.000000000 -0500
@@ -1279,6 +1279,7 @@ asmlinkage long sys_swapoff(const char _
swap_map = p->swap_map;
p->swap_map = NULL;
p->flags = 0;
+ trace_mark(mm_swap_file_close, "filp %p", swap_file);
spin_unlock(&swap_lock);
mutex_unlock(&swapon_mutex);
vfree(swap_map);
@@ -1660,6 +1661,8 @@ asmlinkage long sys_swapon(const char __
} else {
swap_info[prev].next = p - swap_info;
}
+ trace_mark(mm_swap_file_open, "filp %p filename %s",
+ swap_file, name);
spin_unlock(&swap_lock);
mutex_unlock(&swapon_mutex);
error = 0;
@@ -1796,3 +1799,22 @@ int valid_swaphandles(swp_entry_t entry,
spin_unlock(&swap_lock);
return ret;
}
+
+void ltt_dump_swap_files(void *call_data)
+{
+ int type;
+ struct swap_info_struct * p = NULL;
+
+ mutex_lock(&swapon_mutex);
+ for (type = swap_list.head; type >= 0; type = swap_info[type].next) {
+ p = swap_info + type;
+ if ((p->flags & SWP_ACTIVE) != SWP_ACTIVE)
+ continue;
+ __trace_mark(0, statedump_swap_files, call_data,
+ "filp %p vfsmount %p dname %s",
+ p->swap_file, p->swap_file->f_vfsmnt,
+ p->swap_file->f_dentry->d_name.name);
+ }
+ mutex_unlock(&swapon_mutex);
+}
+EXPORT_SYMBOL_GPL(ltt_dump_swap_files);
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-11-30 16:47:19

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH] LTTng instrumentation mm (updated)

On Fri, 2007-11-30 at 11:11 -0500, Mathieu Desnoyers wrote:
> +static inline swp_entry_t page_swp_entry(struct page *page)
> +{
> + swp_entry_t entry;
> + VM_BUG_ON(!PageSwapCache(page));
> + entry.val = page_private(page);
> + return entry;
> +}

This probably needs to be introduced (and used) in a separate patch.
Please fix up those other places in the code that can take advantage of
it.

> #ifdef CONFIG_MIGRATION
> static inline swp_entry_t make_migration_entry(struct page *page, int
> write)
> {
> Index: linux-2.6-lttng/mm/swapfile.c
> ===================================================================
> --- linux-2.6-lttng.orig/mm/swapfile.c 2007-11-30 09:18:38.000000000
> -0500
> +++ linux-2.6-lttng/mm/swapfile.c 2007-11-30 10:21:50.000000000
> -0500
> @@ -1279,6 +1279,7 @@ asmlinkage long sys_swapoff(const char _
> swap_map = p->swap_map;
> p->swap_map = NULL;
> p->flags = 0;
> + trace_mark(mm_swap_file_close, "filp %p", swap_file);
> spin_unlock(&swap_lock);
> mutex_unlock(&swapon_mutex);
> vfree(swap_map);
> @@ -1660,6 +1661,8 @@ asmlinkage long sys_swapon(const char __
> } else {
> swap_info[prev].next = p - swap_info;
> }
> + trace_mark(mm_swap_file_open, "filp %p filename %s",
> + swap_file, name);

You print out the filp a number of times here, but how does that help in
a trace? If I was trying to figure out which swapfile, I'd probably
just want to know the swp_entry_t->type, then I could look at this:

dave@foo:~/garbage$ cat /proc/swaps
Filename Type Size Used Priority
/dev/sda2 partition 1992052 649336 -1

to see the ordering.

-- Dave

2007-11-30 17:05:34

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH] LTTng instrumentation mm (updated)

* Dave Hansen ([email protected]) wrote:
> On Fri, 2007-11-30 at 11:11 -0500, Mathieu Desnoyers wrote:
> > +static inline swp_entry_t page_swp_entry(struct page *page)
> > +{
> > + swp_entry_t entry;
> > + VM_BUG_ON(!PageSwapCache(page));
> > + entry.val = page_private(page);
> > + return entry;
> > +}
>
> This probably needs to be introduced (and used) in a separate patch.
> Please fix up those other places in the code that can take advantage of
> it.
>
Sure,

> > #ifdef CONFIG_MIGRATION
> > static inline swp_entry_t make_migration_entry(struct page *page, int
> > write)
> > {
> > Index: linux-2.6-lttng/mm/swapfile.c
> > ===================================================================
> > --- linux-2.6-lttng.orig/mm/swapfile.c 2007-11-30 09:18:38.000000000
> > -0500
> > +++ linux-2.6-lttng/mm/swapfile.c 2007-11-30 10:21:50.000000000
> > -0500
> > @@ -1279,6 +1279,7 @@ asmlinkage long sys_swapoff(const char _
> > swap_map = p->swap_map;
> > p->swap_map = NULL;
> > p->flags = 0;
> > + trace_mark(mm_swap_file_close, "filp %p", swap_file);
> > spin_unlock(&swap_lock);
> > mutex_unlock(&swapon_mutex);
> > vfree(swap_map);
> > @@ -1660,6 +1661,8 @@ asmlinkage long sys_swapon(const char __
> > } else {
> > swap_info[prev].next = p - swap_info;
> > }
> > + trace_mark(mm_swap_file_open, "filp %p filename %s",
> > + swap_file, name);
>
> You print out the filp a number of times here, but how does that help in
> a trace? If I was trying to figure out which swapfile, I'd probably
> just want to know the swp_entry_t->type, then I could look at this:
>
> dave@foo:~/garbage$ cat /proc/swaps
> Filename Type Size Used Priority
> /dev/sda2 partition 1992052 649336 -1
>
> to see the ordering.
>

Given a trace including :
- Swapfiles initially used
- multiple swapon/swapoff
- swap in/out events

We would like to be able to tell which swap file the information has
been written to/read from at any given time during the trace.

Therefore, I dump the swap file information at the beginning of the
trace (see the ltt_dump_swap_files function) and also follow each
swapon/swapoff.

The minimal information that has to be saved at each swap read/write
seems to be the struct file * that is used by the operation. We can then
map back to the file used by knowing the mapping between struct file *
and associated file names (dump/swapon/swapoff instrumentation).

The swp_entry_t->type does not seem to map to any specific information
in /proc/swaps ? (or I may have missed a detail) Even if it does, it is
limited to a specific point in time and does not follow swapon/swapoff
events.

You are talking about ordering in /proc/swaps : I wonder what happens if
we add/remove swap files from the array : I guess the swp_entry_t
ordering may become mixed up with the order of the /proc/swaps output,
since it is based on the swap_info array which will fill empty spots
upon swapon (again, unless I missed a clever detail).

Mathieu

> -- Dave
>

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-11-30 18:42:39

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH] LTTng instrumentation mm (updated)

On Fri, 2007-11-30 at 12:05 -0500, Mathieu Desnoyers wrote:
>
>
> Given a trace including :
> - Swapfiles initially used
> - multiple swapon/swapoff
> - swap in/out events
>
> We would like to be able to tell which swap file the information has
> been written to/read from at any given time during the trace.

Oh, tracing is expected to be on at all times? I figured someone would
encounter a problem, then turn it on to dig down a little deeper, then
turn it off.

As for why I care what is in /proc/swaps. Take a look at this:

struct swap_info_struct *
get_swap_info_struct(unsigned type)
{
return &swap_info[type];
}

Then, look at the proc functions:

static void *swap_next(struct seq_file *swap, void *v, loff_t *pos)
{
struct swap_info_struct *ptr;
struct swap_info_struct *endptr = swap_info + nr_swapfiles;

if (v == SEQ_START_TOKEN)
ptr = swap_info;
...

I guess if that swap_info[] has any holes, we can't relate indexes in
there right back to /proc/swaps, but maybe we should add some
information so that we _can_.

-- Dave

2007-11-30 19:10:26

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH] LTTng instrumentation mm (updated)

* Dave Hansen ([email protected]) wrote:
> On Fri, 2007-11-30 at 12:05 -0500, Mathieu Desnoyers wrote:
> >
> >
> > Given a trace including :
> > - Swapfiles initially used
> > - multiple swapon/swapoff
> > - swap in/out events
> >
> > We would like to be able to tell which swap file the information has
> > been written to/read from at any given time during the trace.
>
> Oh, tracing is expected to be on at all times? I figured someone would
> encounter a problem, then turn it on to dig down a little deeper, then
> turn it off.
>

Yep, it can be expected to be on at all times, especially on production
systems using "flight recorder" tracing to record information in a
circular buffer, then dumping the buffers when some triggers (error
conditions) happens.

> As for why I care what is in /proc/swaps. Take a look at this:
>
> struct swap_info_struct *
> get_swap_info_struct(unsigned type)
> {
> return &swap_info[type];
> }
>
> Then, look at the proc functions:
>
> static void *swap_next(struct seq_file *swap, void *v, loff_t *pos)
> {
> struct swap_info_struct *ptr;
> struct swap_info_struct *endptr = swap_info + nr_swapfiles;
>
> if (v == SEQ_START_TOKEN)
> ptr = swap_info;
> ...
>
> I guess if that swap_info[] has any holes, we can't relate indexes in
> there right back to /proc/swaps, but maybe we should add some
> information so that we _can_.
>

The if (!(ptr->flags & SWP_USED) test in swap_next seems to skip the
unused swap_info entries.

Why should we care about get_swap_info_struct always returning a "used"
swap info struct ?

> -- Dave
>

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-12-04 19:18:18

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: [RFC PATCH] LTTng instrumentation mm (updated)

Mathieu Desnoyers <[email protected]> writes:

> [...]
>> > We would like to be able to tell which swap file the information has
>> > been written to/read from at any given time during the trace.
>>
>> Oh, tracing is expected to be on at all times? I figured someone would
>> encounter a problem, then turn it on to dig down a little deeper, then
>> turn it off.
>
> Yep, it can be expected to be on at all times, especially on production
> systems using "flight recorder" tracing to record information in a
> circular buffer [...]

Considering how early in the boot sequence swap partitions are
activated, it seems optimistic to assume that the monitoring equipment
will always start up in time to catch the initial swapons. It would
be more useful if a marker parameter was included in the swap events
to let a tool/user map to /proc/swaps or a file name.

- FChE

2007-12-04 19:30:46

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH] LTTng instrumentation mm (updated)

* Frank Ch. Eigler ([email protected]) wrote:
> Mathieu Desnoyers <[email protected]> writes:
>
> > [...]
> >> > We would like to be able to tell which swap file the information has
> >> > been written to/read from at any given time during the trace.
> >>
> >> Oh, tracing is expected to be on at all times? I figured someone would
> >> encounter a problem, then turn it on to dig down a little deeper, then
> >> turn it off.
> >
> > Yep, it can be expected to be on at all times, especially on production
> > systems using "flight recorder" tracing to record information in a
> > circular buffer [...]
>
> Considering how early in the boot sequence swap partitions are
> activated, it seems optimistic to assume that the monitoring equipment
> will always start up in time to catch the initial swapons. It would
> be more useful if a marker parameter was included in the swap events
> to let a tool/user map to /proc/swaps or a file name.
>
> - FChE

Not early at all ? We have userspace processes running.. this is _late_
in the boot sequence! ;)

Anyhow, that I have now is a combination including your proposal :

- I dump the swapon/swapoff events.
- I also dump the equivalent of /proc/swaps (with kernel internal
information) at trace start to know what swap files are currently
used.

Does it sound fair ?

Mathieu

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-12-04 19:41:52

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH] LTTng instrumentation mm (updated)

On Tue, 2007-12-04 at 14:25 -0500, Mathieu Desnoyers wrote:
>
> - I also dump the equivalent of /proc/swaps (with kernel internal
> information) at trace start to know what swap files are currently
> used.

What about just enhancing /proc/swaps so that this information can be
useful to people other than those doing traces?

Now that we have /proc/$pid/pagemap, we expose some of the same
information about which userspace virtual addresses are stored where and
in which swapfile.

-- Dave

2007-12-04 20:11:21

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH] LTTng instrumentation mm (updated)

* Dave Hansen ([email protected]) wrote:
> On Tue, 2007-12-04 at 14:25 -0500, Mathieu Desnoyers wrote:
> >
> > - I also dump the equivalent of /proc/swaps (with kernel internal
> > information) at trace start to know what swap files are currently
> > used.
>
> What about just enhancing /proc/swaps so that this information can be
> useful to people other than those doing traces?
>

It includes an in-kernel struct file pointer, exporting it to userspace
would be somewhat ugly.

> Now that we have /proc/$pid/pagemap, we expose some of the same
> information about which userspace virtual addresses are stored where and
> in which swapfile.
>

The problems with /proc :

- It exports all the data in formatted text. What I need for my traces
is pure binary, compact representation.
- It's not very neat to export in-kernel pointer information like a
kernel tracer would need.
- The locking is very often wrong. I started correcting /proc/modules a
while ago, but I fear there are quite a few cases where a procfile
reader could release the locks between two consecutive reads of the
same list and therefore cause missing information or corruption. While
being manageable for a proc text file, this is _highly_ unwanted in a
trace. See my previous "seq file sorted" and "module.c sort module
list" patches about this. My tracer deals with addition/removal of
elements to a list between dumps done by "chunks" by tracing the
modifications done to the list at the same time. However, /proc seq
files will just get corrupted or forget about an element not touched
by the modification, which my tracer cannot cope with.

Mathieu


> -- Dave
>

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-12-04 20:24:59

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH] LTTng instrumentation mm (updated)

On Tue, 2007-12-04 at 15:05 -0500, Mathieu Desnoyers wrote:
> * Dave Hansen ([email protected]) wrote:
> > On Tue, 2007-12-04 at 14:25 -0500, Mathieu Desnoyers wrote:
> > >
> > > - I also dump the equivalent of /proc/swaps (with kernel internal
> > > information) at trace start to know what swap files are currently
> > > used.
> >
> > What about just enhancing /proc/swaps so that this information can be
> > useful to people other than those doing traces?
>
> It includes an in-kernel struct file pointer, exporting it to userspace
> would be somewhat ugly.

What about just exporting the 'type' field that we use to index into
swap_info[]?

As far as /proc goes, it may not be _ideal_ for your traces, but it sure
beats not getting the information out at all. ;) I guess I'm just not
that familiar with the tracing requirements and I can't really assess
whether what you're asking for is reasonable, or horrible
over-engineering. Dunno.

-- Dave

2007-12-04 20:28:28

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH] LTTng instrumentation mm (updated)

Or, think out of the box...

Maybe you can introduce some interfaces that expose information both in
sysfs (in normal human-readable formats) and in a way that lets you get
the same data out in some binary format.

Seems to me you'll have a lot easier time justifying all of these lines
of code spread all over the kernel if there are a few more users off the
bat.

-- Dave