LinuxLists.cc - [RFC] Track mlock()ed pages

2007-01-26 05:43:36

Subject: [RFC] Track mlock()ed pages

Add NR_MLOCK

Track mlocked pages via a ZVC

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.20-rc6/include/linux/mmzone.h
===================================================================
--- linux-2.6.20-rc6.orig/include/linux/mmzone.h 2007-01-25 20:29:58.000000000 -0800
+++ linux-2.6.20-rc6/include/linux/mmzone.h 2007-01-25 20:31:23.000000000 -0800
@@ -58,6 +58,7 @@ enum zone_stat_item {
NR_FILE_DIRTY,
NR_WRITEBACK,
/* Second 128 byte cacheline */
+ NR_MLOCK, /* Mlocked pages */
NR_SLAB_RECLAIMABLE,
NR_SLAB_UNRECLAIMABLE,
NR_PAGETABLE, /* used for pagetables */
Index: linux-2.6.20-rc6/mm/rmap.c
===================================================================
--- linux-2.6.20-rc6.orig/mm/rmap.c 2007-01-25 20:18:38.000000000 -0800
+++ linux-2.6.20-rc6/mm/rmap.c 2007-01-25 20:31:23.000000000 -0800
@@ -551,6 +551,8 @@ void page_add_new_anon_rmap(struct page
{
atomic_set(&page->_mapcount, 0); /* elevate count by 1 (starts at -1) */
__page_set_anon_rmap(page, vma, address);
+ if (vma->vm_flags & VM_LOCKED)
+ __inc_zone_page_state(page, NR_MLOCK);
}

/**
@@ -565,6 +567,16 @@ void page_add_file_rmap(struct page *pag
__inc_zone_page_state(page, NR_FILE_MAPPED);
}

+/*
+ * Add an rmap in a known vma. This allows us to update the mlock counter.
+ */
+void page_add_file_rmap_vma(struct page *page, struct vm_area_struct *vma)
+{
+ page_add_file_rmap(page);
+ if (vma->vm_flags & VM_LOCKED)
+ __inc_zone_page_state(page, NR_MLOCK);
+}
+
/**
* page_remove_rmap - take down pte mapping from a page
* @page: page to remove mapping from
@@ -602,6 +614,8 @@ void page_remove_rmap(struct page *page,
__dec_zone_page_state(page,
PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED);
}
+ if (vma->vm_flags & VM_LOCKED)
+ __dec_zone_page_state(page, NR_MLOCK);
}

/*
Index: linux-2.6.20-rc6/include/linux/rmap.h
===================================================================
--- linux-2.6.20-rc6.orig/include/linux/rmap.h 2007-01-25 20:18:38.000000000 -0800
+++ linux-2.6.20-rc6/include/linux/rmap.h 2007-01-25 20:31:23.000000000 -0800
@@ -72,6 +72,7 @@ void __anon_vma_link(struct vm_area_stru
void page_add_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
void page_add_new_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
void page_add_file_rmap(struct page *);
+void page_add_file_rmap_vma(struct page *, struct vm_area_struct *);
void page_remove_rmap(struct page *, struct vm_area_struct *);

/**
Index: linux-2.6.20-rc6/mm/fremap.c
===================================================================
--- linux-2.6.20-rc6.orig/mm/fremap.c 2007-01-25 20:18:38.000000000 -0800
+++ linux-2.6.20-rc6/mm/fremap.c 2007-01-25 20:31:23.000000000 -0800
@@ -81,7 +81,7 @@ int install_page(struct mm_struct *mm, s
flush_icache_page(vma, page);
pte_val = mk_pte(page, prot);
set_pte_at(mm, addr, pte, pte_val);
- page_add_file_rmap(page);
+ page_add_file_rmap_vma(page, vma);
update_mmu_cache(vma, addr, pte_val);
lazy_mmu_prot_update(pte_val);
err = 0;
Index: linux-2.6.20-rc6/mm/memory.c
===================================================================
--- linux-2.6.20-rc6.orig/mm/memory.c 2007-01-25 20:18:38.000000000 -0800
+++ linux-2.6.20-rc6/mm/memory.c 2007-01-25 20:31:23.000000000 -0800
@@ -2256,7 +2256,7 @@ retry:
page_add_new_anon_rmap(new_page, vma, address);
} else {
inc_mm_counter(mm, file_rss);
- page_add_file_rmap(new_page);
+ page_add_file_rmap_vma(new_page, vma);
if (write_access) {
dirty_page = new_page;
get_page(dirty_page);
Index: linux-2.6.20-rc6/drivers/base/node.c
===================================================================
--- linux-2.6.20-rc6.orig/drivers/base/node.c 2007-01-25 20:30:17.000000000 -0800
+++ linux-2.6.20-rc6/drivers/base/node.c 2007-01-25 20:31:23.000000000 -0800
@@ -60,6 +60,7 @@ static ssize_t node_read_meminfo(struct
"Node %d FilePages: %8lu kB\n"
"Node %d Mapped: %8lu kB\n"
"Node %d AnonPages: %8lu kB\n"
+ "Node %d Mlock: %8lu KB\n"
"Node %d PageTables: %8lu kB\n"
"Node %d NFS_Unstable: %8lu kB\n"
"Node %d Bounce: %8lu kB\n"
@@ -82,6 +83,7 @@ static ssize_t node_read_meminfo(struct
nid, K(node_page_state(nid, NR_FILE_PAGES)),
nid, K(node_page_state(nid, NR_FILE_MAPPED)),
nid, K(node_page_state(nid, NR_ANON_PAGES)),
+ nid, K(node_page_state(nid, NR_MLOCK)),
nid, K(node_page_state(nid, NR_PAGETABLE)),
nid, K(node_page_state(nid, NR_UNSTABLE_NFS)),
nid, K(node_page_state(nid, NR_BOUNCE)),
Index: linux-2.6.20-rc6/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.20-rc6.orig/fs/proc/proc_misc.c 2007-01-25 20:30:21.000000000 -0800
+++ linux-2.6.20-rc6/fs/proc/proc_misc.c 2007-01-25 20:31:23.000000000 -0800
@@ -166,6 +166,7 @@ static int meminfo_read_proc(char *page,
"Writeback: %8lu kB\n"
"AnonPages: %8lu kB\n"
"Mapped: %8lu kB\n"
+ "Mlock: %8lu KB\n"
"Slab: %8lu kB\n"
"SReclaimable: %8lu kB\n"
"SUnreclaim: %8lu kB\n"
@@ -196,6 +197,7 @@ static int meminfo_read_proc(char *page,
K(global_page_state(NR_WRITEBACK)),
K(global_page_state(NR_ANON_PAGES)),
K(global_page_state(NR_FILE_MAPPED)),
+ K(global_page_state(NR_MLOCK)),
K(global_page_state(NR_SLAB_RECLAIMABLE) +
global_page_state(NR_SLAB_UNRECLAIMABLE)),
K(global_page_state(NR_SLAB_RECLAIMABLE)),
Index: linux-2.6.20-rc6/mm/vmstat.c
===================================================================
--- linux-2.6.20-rc6.orig/mm/vmstat.c 2007-01-25 20:30:21.000000000 -0800
+++ linux-2.6.20-rc6/mm/vmstat.c 2007-01-25 20:31:23.000000000 -0800
@@ -433,6 +433,7 @@ static const char * const vmstat_text[]
"nr_file_pages",
"nr_dirty",
"nr_writeback",
+ "nr_mlock",
"nr_slab_reclaimable",
"nr_slab_unreclaimable",
"nr_page_table_pages",

2007-01-26 06:30:55

by Nick Piggin

[permalink] [raw]

Subject: Re: [RFC] Track mlock()ed pages

Christoph Lameter wrote:
> Add NR_MLOCK
>
> Track mlocked pages via a ZVC

I think it is not quite right. You are tracking the number of ptes
that point to mlocked pages, which can be >= the actual number of pages.

Also, page_add_anon_rmap still needs to be balanced with page_remove_rmap.

I can't think of an easy way to do this without per-page state. ie.
another page flag.

>
> Signed-off-by: Christoph Lameter <[email protected]>
>

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2007-01-26 06:36:21

by Christoph Lameter

[permalink] [raw]

Subject: Re: [RFC] Track mlock()ed pages

On Fri, 26 Jan 2007, Nick Piggin wrote:

> Christoph Lameter wrote:
> > Add NR_MLOCK
> >
> > Track mlocked pages via a ZVC
>
> I think it is not quite right. You are tracking the number of ptes
> that point to mlocked pages, which can be >= the actual number of pages.

Mlocked pages are not inherited. I would expect sharing to be very rare.

> Also, page_add_anon_rmap still needs to be balanced with page_remove_rmap.

Hmmm....

> I can't think of an easy way to do this without per-page state. ie.
> another page flag.

Thats what I am trying to avoid.

2007-01-26 11:13:22

by Andrew Morton

[permalink] [raw]

Subject: Re: [RFC] Track mlock()ed pages

On Thu, 25 Jan 2007 22:36:17 -0800 (PST)
Christoph Lameter <[email protected]> wrote:

> On Fri, 26 Jan 2007, Nick Piggin wrote:
>
> > Christoph Lameter wrote:
> > > Add NR_MLOCK
> > >
> > > Track mlocked pages via a ZVC

Why?

> > I think it is not quite right. You are tracking the number of ptes
> > that point to mlocked pages, which can be >= the actual number of pages.
>
> Mlocked pages are not inherited. I would expect sharing to be very rare.
>
> > Also, page_add_anon_rmap still needs to be balanced with page_remove_rmap.
>
> Hmmm....
>
> > I can't think of an easy way to do this without per-page state. ie.
> > another page flag.
>
> Thats what I am trying to avoid.

You could perhaps go for a walk across all the other vmas which presently
map this page. If any of them have VM_LOCKED, don't increment the counter.
Similar on removal: only decrement the counter when the final mlocked VMA
is dropping the pte.

2007-01-26 11:46:25

by Nick Piggin

[permalink] [raw]

Subject: Re: [RFC] Track mlock()ed pages

Christoph Lameter wrote:
> On Fri, 26 Jan 2007, Nick Piggin wrote:
>
>
>>Christoph Lameter wrote:
>>
>>>Add NR_MLOCK
>>>
>>>Track mlocked pages via a ZVC
>>
>>I think it is not quite right. You are tracking the number of ptes
>>that point to mlocked pages, which can be >= the actual number of pages.
>
>
> Mlocked pages are not inherited. I would expect sharing to be very rare.

Things like library and application text could easily have a lot of
sharing.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2007-01-26 12:00:22

by Nick Piggin

[permalink] [raw]

Subject: Re: [RFC] Track mlock()ed pages

Andrew Morton wrote:
> On Thu, 25 Jan 2007 22:36:17 -0800 (PST)
> Christoph Lameter <[email protected]> wrote:
>

>>>I can't think of an easy way to do this without per-page state. ie.
>>>another page flag.
>>
>>Thats what I am trying to avoid.
>
>
> You could perhaps go for a walk across all the other vmas which presently
> map this page. If any of them have VM_LOCKED, don't increment the counter.
> Similar on removal: only decrement the counter when the final mlocked VMA
> is dropping the pte.

Can't do with un-racily because you can't get that information
atomically, AFAIKS. When / if we ever lock the page in fault handler,
this could become easier... but that seems nasty to do in fault path,
even if only for VM_LOCKED vmas.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2007-01-26 15:44:49

by Christoph Lameter

[permalink] [raw]

Subject: Re: [RFC] Track mlock()ed pages

On Fri, 26 Jan 2007, Andrew Morton wrote:

> > > > Track mlocked pages via a ZVC
>
> Why?

Large amounts of mlocked pages may be a problem for

1. Reclaim behavior.

2. Defragmentation

> You could perhaps go for a walk across all the other vmas which presently
> map this page. If any of them have VM_LOCKED, don't increment the counter.
> Similar on removal: only decrement the counter when the final mlocked VMA
> is dropping the pte.

For that we would need an additional refcount for vmlocked maps in the
page struct. Looks too expensive.

2007-01-26 18:10:36

by Andrew Morton

[permalink] [raw]

Subject: Re: [RFC] Track mlock()ed pages

On Fri, 26 Jan 2007 07:44:42 -0800 (PST)
Christoph Lameter <[email protected]> wrote:

> On Fri, 26 Jan 2007, Andrew Morton wrote:
>
> > > > > Track mlocked pages via a ZVC
> >
> > Why?
>
> Large amounts of mlocked pages may be a problem for
>
> 1. Reclaim behavior.
>
> 2. Defragmentation
>

We know that. What has that to do with this patch?

>
> > You could perhaps go for a walk across all the other vmas which presently
> > map this page. If any of them have VM_LOCKED, don't increment the counter.
> > Similar on removal: only decrement the counter when the final mlocked VMA
> > is dropping the pte.
>
> For that we would need an additional refcount for vmlocked maps in the
> page struct.

No you don't. The refcount is already there. It is "the sum of the VM_LOCKED
VMAs which map this page".

It might be impractical or expensive to calculate it, but it's there.

2007-01-26 18:22:21

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [RFC] Track mlock()ed pages

On Fri, 26 Jan 2007 10:10:27 -0800
Andrew Morton <[email protected]> wrote:

> On Fri, 26 Jan 2007 07:44:42 -0800 (PST)
> Christoph Lameter <[email protected]> wrote:
>
> > On Fri, 26 Jan 2007, Andrew Morton wrote:
> >
> > > > > > Track mlocked pages via a ZVC
> > >
> > > Why?
> >
> > Large amounts of mlocked pages may be a problem for
> >
> > 1. Reclaim behavior.
> >
> > 2. Defragmentation
> >
>
> We know that. What has that to do with this patch?
>
3. just counting mlocked pages....

I have an experience that I was asked by the user to calculate "free" pages
on the system where several big 'mlockall' process runs, which shared amounts of
pages...when I answered the user cannot trust the result of "/bin/free"
if you use mlock processes.

It was very fun :P

-Kame

2007-01-26 18:23:51

by Christoph Lameter

[permalink] [raw]

Subject: Re: [RFC] Track mlock()ed pages

On Fri, 26 Jan 2007, Andrew Morton wrote:

> > Large amounts of mlocked pages may be a problem for
> >
> > 1. Reclaim behavior.
> >
> > 2. Defragmentation
> >
>
> We know that. What has that to do with this patch?

Knowing how much mlocked pages are where is necessary to solve these
issues.

> > > You could perhaps go for a walk across all the other vmas which presently
> > > map this page. If any of them have VM_LOCKED, don't increment the counter.
> > > Similar on removal: only decrement the counter when the final mlocked VMA
> > > is dropping the pte.
> >
> > For that we would need an additional refcount for vmlocked maps in the
> > page struct.
>
> No you don't. The refcount is already there. It is "the sum of the VM_LOCKED
> VMAs which map this page".
>
> It might be impractical or expensive to calculate it, but it's there.

Correct. Its so expensive that it cannot be used to build vm stats for
mlocked pages. F.e. Determination of the final mlocked VMA dropping the
page would require a scan over all vmas mapping the page.

2007-01-26 18:42:16

by Andrew Morton

[permalink] [raw]

Subject: Re: [RFC] Track mlock()ed pages

On Fri, 26 Jan 2007 10:23:44 -0800 (PST)
Christoph Lameter <[email protected]> wrote:

> On Fri, 26 Jan 2007, Andrew Morton wrote:
>
> > > Large amounts of mlocked pages may be a problem for
> > >
> > > 1. Reclaim behavior.
> > >
> > > 2. Defragmentation
> > >
> >
> > We know that. What has that to do with this patch?
>
> Knowing how much mlocked pages are where is necessary to solve these
> issues.

If we continue this dialogue for long enough, we'll actually have a changlog.

> > > > You could perhaps go for a walk across all the other vmas which presently
> > > > map this page. If any of them have VM_LOCKED, don't increment the counter.
> > > > Similar on removal: only decrement the counter when the final mlocked VMA
> > > > is dropping the pte.
> > >
> > > For that we would need an additional refcount for vmlocked maps in the
> > > page struct.
> >
> > No you don't. The refcount is already there. It is "the sum of the VM_LOCKED
> > VMAs which map this page".
> >
> > It might be impractical or expensive to calculate it, but it's there.
>
> Correct. Its so expensive that it cannot be used to build vm stats for
> mlocked pages. F.e. Determination of the final mlocked VMA dropping the
> page would require a scan over all vmas mapping the page.

Of course it would. But how do you know it is "too expensive"? We "scan
all the vmas mapping a page" as a matter of course in the page scanner -
millions of times a minute. If that's "too expensive" then ouch.

That, plus if we have so many vmas mapping a page for this effect to
matter, then your change as proposed will be so inaccurate as to be
useless, no?

2007-01-27 22:19:42

by Rik van Riel

[permalink] [raw]

Subject: Re: [RFC] Track mlock()ed pages

Andrew Morton wrote:

> Of course it would. But how do you know it is "too expensive"? We "scan
> all the vmas mapping a page" as a matter of course in the page scanner -
> millions of times a minute. If that's "too expensive" then ouch.

We can do it lazily.

At mlock time, move pages onto the mlocked list, unless they
are there already.

On munlock, move pages to the active list. For mlock-only
memory (shared memory segments?) we could add a simple check
to see if the next process on the list has the page mlocked,
checking only that one.

While scanning the active list, move mlocked pages that are
found back onto the mlocked list.

This lazy movement of pages will impact shared libraries,
but probably not shared memory segments.

Does this sound workable?

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is. Each group
calls the other unpatriotic.

2007-01-27 22:30:31

by Andrew Morton

[permalink] [raw]

Subject: Re: [RFC] Track mlock()ed pages

On Sat, 27 Jan 2007 17:19:21 -0500
Rik van Riel <[email protected]> wrote:

> Andrew Morton wrote:
>
> > Of course it would. But how do you know it is "too expensive"? We "scan
> > all the vmas mapping a page" as a matter of course in the page scanner -
> > millions of times a minute. If that's "too expensive" then ouch.
>
> We can do it lazily.
>
> At mlock time, move pages onto the mlocked list, unless they
> are there already.

Needs another page flag to determine what list the page is on (eek).

> On munlock, move pages to the active list.

We'd need to determine whether some other vma has mlocked the page too.
That's either the page_struct refcount or the vma walk. The latter is
equivalent to what I'm suggesting.

> For mlock-only
> memory (shared memory segments?) we could add a simple check
> to see if the next process on the list has the page mlocked,
> checking only that one.
>
> While scanning the active list, move mlocked pages that are
> found back onto the mlocked list.
>
> This lazy movement of pages will impact shared libraries,
> but probably not shared memory segments.
>
> Does this sound workable?

I'm still not sure what problem we're trying to solve here.

Knowing how many mlocked pages there are in a zone doesn't sound terribly
interesting and I don't recall ever wanting to know that.

Being able to keep mlocked pages off the LRU altogether sounds more useful.

It's all rather a tight corner case - people don't use mlock much.

2007-01-27 22:36:10

by Rik van Riel

[permalink] [raw]

Subject: Re: [RFC] Track mlock()ed pages

Andrew Morton wrote:
> On Sat, 27 Jan 2007 17:19:21 -0500
> Rik van Riel <[email protected]> wrote:
>
>> Andrew Morton wrote:
>>
>>> Of course it would. But how do you know it is "too expensive"? We "scan
>>> all the vmas mapping a page" as a matter of course in the page scanner -
>>> millions of times a minute. If that's "too expensive" then ouch.
>> We can do it lazily.
>>
>> At mlock time, move pages onto the mlocked list, unless they
>> are there already.
>
> Needs another page flag to determine what list the page is on (eek).
>
>> On munlock, move pages to the active list.
>
> We'd need to determine whether some other vma has mlocked the page too.
> That's either the page_struct refcount or the vma walk. The latter is
> equivalent to what I'm suggesting.

It doesn't have to be 100% accurate. The pages that are mapped
both mlocked and non-mlocked will probably be only shared
libraries.

>> For mlock-only
>> memory (shared memory segments?) we could add a simple check
>> to see if the next process on the list has the page mlocked,
>> checking only that one.

As long as we get it right for the large objects, it should be
fine.

> I'm still not sure what problem we're trying to solve here.
>
> Knowing how many mlocked pages there are in a zone doesn't sound terribly
> interesting and I don't recall ever wanting to know that.
>
> Being able to keep mlocked pages off the LRU altogether sounds more useful.

> It's all rather a tight corner case - people don't use mlock much.

Just because it's not common does not mean you can just ignore
it and hope Linux doesn't unexpectedly fall over.

One thing that knowing the amount of mlocked data in each zone
(and node) would allow us to do is spread the mlocked memory
around a little better at allocation time. This could prevent
some of the "I started a 6GB Oracle on my 8GB system and it
immediately fell over" bugs.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is. Each group
calls the other unpatriotic.