2006-03-20 13:36:03

by Stone Wang

[permalink] [raw]
Subject: [PATCH][0/8] (Targeting 2.6.17) Posix memory locking and balanced mlock-LRU semantic

Both one of my friends(who is working on a DBMS oriented from
PostgreSQL) and i had encountered unexpected OOMs with mlock/mlockall.

After careful code-reading and tests,i found out that the reason of the
OOM is that VM's LRU algorithm treating mlocked pages as Active/Inactive,
regardless of that the mlocked pages could not be reclaimed.

Mlocking many pages will easily cause unbalance between LRU and slab:
VM tend to reclaim from Active/Inactive list,most of which are mlocked,
thus OOM may be triggered. While in fact,there are enough pages to be
reclaimed in slab.
( Setting a large "vfs_cache_pressure" may help to avoid the OOM
under this situation, but i think it's better "do things right" than
depending on the "vfs_cache_pressure" tunable)

We think that it's wrong semantic treating mlocked as Active/Inactive.
Mlocked pages should not be counted in page-reclaiming algorithm,
for in fact they will never be affected by page reclaims.

Following patch patch try to fix this, with some additions.

The patch brings Linux with:
1. Posix mlock/munlock/mlockall/munlockall.
Get mlock/munlock/mlockall/munlockall to Posix definiton: transaction-like,
just as described in the manpage(2) of mlock/munlock/mlockall/munlockall.
Thus users of mlock system call series will always have an clear map of
mlocked areas.
2. More consistent LRU semantics in Memory Management.
Mlocked pages is placed on a separate LRU list: Wired List.
The pages dont take part in LRU algorithms,for they could never be swapped,
until munlocked.
3. Output the Wired(mlocked) pages count through /proc/meminfo.
One line is added to /proc/meminfo: "Wired: N kB",thus Linux system
administrators/programmers can have a clearer map of physical memory usage.


Test of the patch:

Test envioronment:
RHEL4.
Totoal physical memory size: 256MB,no swap.
One ext3 directory("/mnt/test") with about 256 thousand small
files (each size: 2kB).

Step 1. run a task mlocking 220 MB
Step 2. run: "find /mnt/test -size 100"


Case A. Standard kernel.org kernel 2.6.15

Linux soon run OOM, OOM-time memory info:

[root@Linux ~]# cat /proc/meminfo
MemTotal: 254248 kB
MemFree: 3144 kB
Buffers: 124 kB
Cached: 1584 kB
SwapCached: 0 kB
Active: 229308 kB
Inactive: 596 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 254248 kB
LowFree: 3144 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 0 kB
Writeback: 0 kB
Mapped: 228556 kB
Slab: 20076 kB
CommitLimit: 127124 kB
Committed_AS: 238424 kB
PageTables: 584 kB
VmallocTotal: 770040 kB
VmallocUsed: 180 kB
VmallocChunk: 769844 kB
HugePages_Total: 0
HugePages_Free: 0
Hugepagesize: 4096 kB


Case B. Patched 2.6.15

No OOM happened.

[root@Linux ~]# cat /proc/meminfo
MemTotal: 254344 kB
MemFree: 3508 kB
Buffers: 6352 kB
Cached: 2684 kB
SwapCached: 0 kB
Active: 7140 kB
Inactive: 4732 kB
Wired: 225284 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 254344 kB
LowFree: 3508 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 72 kB
Writeback: 0 kB
Mapped: 229208 kB
Slab: 12552 kB
CommitLimit: 127172 kB
Committed_AS: 238168 kB
PageTables: 572 kB
VmallocTotal: 770040 kB
VmallocUsed: 180 kB
VmallocChunk: 769844 kB
HugePages_Total: 0
HugePages_Free: 0
Hugepagesize: 4096 kB


A lot thanks to Mel Gorman for your book: <Understanding the Linux Virtual
Memory Manager>. Also, thanks to other 2 great Linux kernel books: ULK3 and
LDD3.

FreeBSD's VM implementation enlightened me,thanks to FreeBSD guys.

Attachment is the full patch,following mails are what it splits up,.

Shaoping Wang


Attachments:
(No filename) (3.83 kB)
patch-2.6.15-memlock (48.29 kB)
Download all attachments

2006-03-20 13:41:23

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH][0/8] (Targeting 2.6.17) Posix memory locking and balanced mlock-LRU semantic

> 1. Posix mlock/munlock/mlockall/munlockall.
> Get mlock/munlock/mlockall/munlockall to Posix definiton: transaction-like,
> just as described in the manpage(2) of mlock/munlock/mlockall/munlockall.
> Thus users of mlock system call series will always have an clear map of
> mlocked areas.
> 2. More consistent LRU semantics in Memory Management.
> Mlocked pages is placed on a separate LRU list: Wired List.

please give this a more logical name, such as mlocked list or pinned
list


2006-03-20 17:27:36

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH][0/8] (Targeting 2.6.17) Posix memory locking and balanced mlock-LRU semantic

On Mon, 20 Mar 2006, Stone Wang wrote:

> 2. More consistent LRU semantics in Memory Management.
> Mlocked pages is placed on a separate LRU list: Wired List.
> The pages dont take part in LRU algorithms,for they could never be swapped,
> until munlocked.

This also implies that dirty bits of the pte for mlocked pages are never
checked.

Currently light swapping (which is very common) will scan over all pages
and move the dirty bits from the pte into struct page. This may take
awhile but at least at some point we will write out dirtied pages.

The result of not scanning mlocked pages will be that mmapped files will
not be updated unless either the process terminates or msync() is called.

2006-03-20 23:52:59

by Nate Diller

[permalink] [raw]
Subject: Re: [PATCH][0/8] (Targeting 2.6.17) Posix memory locking and balanced mlock-LRU semantic

On 3/20/06, Arjan van de Ven <[email protected]> wrote:
> > 1. Posix mlock/munlock/mlockall/munlockall.
> > Get mlock/munlock/mlockall/munlockall to Posix definiton: transaction-like,
> > just as described in the manpage(2) of mlock/munlock/mlockall/munlockall.
> > Thus users of mlock system call series will always have an clear map of
> > mlocked areas.
> > 2. More consistent LRU semantics in Memory Management.
> > Mlocked pages is placed on a separate LRU list: Wired List.
>
> please give this a more logical name, such as mlocked list or pinned
> list

Shaoping, thanks for doing this work, it is something I have been
thinking about for the past few weeks. It's especially nice to be
able to see how many pages are pinned in this manner.

Might I suggest calling it the long_term_pinned list? It also might
be worth putting ramdisk pages on this list, since they cannot be
written out in response to memory pressure. This would eliminate the
need for AOP_WRITEPAGE_ACTIVATE.

NATE

2006-03-21 05:23:10

by Stone Wang

[permalink] [raw]
Subject: Re: [PATCH][0/8] (Targeting 2.6.17) Posix memory locking and balanced mlock-LRU semantic

I will check and fix it.

2006/3/20, Christoph Lameter <[email protected]>:
> On Mon, 20 Mar 2006, Stone Wang wrote:
>
> > 2. More consistent LRU semantics in Memory Management.
> > Mlocked pages is placed on a separate LRU list: Wired List.
> > The pages dont take part in LRU algorithms,for they could never be swapped,
> > until munlocked.
>
> This also implies that dirty bits of the pte for mlocked pages are never
> checked.
>
> Currently light swapping (which is very common) will scan over all pages
> and move the dirty bits from the pte into struct page. This may take
> awhile but at least at some point we will write out dirtied pages.
>
> The result of not scanning mlocked pages will be that mmapped files will
> not be updated unless either the process terminates or msync() is called.
>
>

2006-03-21 07:10:57

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH][0/8] (Targeting 2.6.17) Posix memory locking and balanced mlock-LRU semantic

On Mon, 2006-03-20 at 15:52 -0800, Nate Diller wrote:
> On 3/20/06, Arjan van de Ven <[email protected]> wrote:
> > > 1. Posix mlock/munlock/mlockall/munlockall.
> > > Get mlock/munlock/mlockall/munlockall to Posix definiton: transaction-like,
> > > just as described in the manpage(2) of mlock/munlock/mlockall/munlockall.
> > > Thus users of mlock system call series will always have an clear map of
> > > mlocked areas.
> > > 2. More consistent LRU semantics in Memory Management.
> > > Mlocked pages is placed on a separate LRU list: Wired List.
> >
> > please give this a more logical name, such as mlocked list or pinned
> > list
>
> Shaoping, thanks for doing this work, it is something I have been
> thinking about for the past few weeks. It's especially nice to be
> able to see how many pages are pinned in this manner.
>
> Might I suggest calling it the long_term_pinned list? It also might
> be worth putting ramdisk pages on this list, since they cannot be
> written out in response to memory pressure. This would eliminate the
> need for AOP_WRITEPAGE_ACTIVATE.

I like that idea



2006-03-21 12:55:21

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH][0/8] (Targeting 2.6.17) Posix memory locking and balanced mlock-LRU semantic

Stone Wang wrote:
> Both one of my friends(who is working on a DBMS oriented from
> PostgreSQL) and i had encountered unexpected OOMs with mlock/mlockall.
>

I'm not sure this is a great idea. There are more conditions than just
mlock that prevent pages being reclaimed. Running out of swap, for
example, no swap, page temporarily pinned (in other words -- any duration
from fleeting to permanent). I think something _much_ simpler could be
done for a more general approach just to teach the VM to tolerate these
pages a bit better.

Also, supposing we do want this, I think there is a fairly significant
queue of mm stuff you need to line up behind... it is probably asking
too much to target 2.6.17 for such a significant change in any case.

But despite all that I looked though and have a few comments ;)
Kudos for jumping in and getting your hands dirty! It can be tricky code.

> The patch brings Linux with:
> 1. Posix mlock/munlock/mlockall/munlockall.
> Get mlock/munlock/mlockall/munlockall to Posix definiton: transaction-like,
> just as described in the manpage(2) of mlock/munlock/mlockall/munlockall.
> Thus users of mlock system call series will always have an clear map of
> mlocked areas.

In what way are we not now posix compliant now?

--
SUSE Labs, Novell Inc.


Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-21 12:57:12

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH][0/8] (Targeting 2.6.17) Posix memory locking and balanced mlock-LRU semantic

Nate Diller wrote:

> Might I suggest calling it the long_term_pinned list? It also might
> be worth putting ramdisk pages on this list, since they cannot be
> written out in response to memory pressure. This would eliminate the
> need for AOP_WRITEPAGE_ACTIVATE.
>

They are for the ram filesystem, btw. and I don't think you can eliminate
AOP_WRITEPAGE_ACTIVATE, because it is needed for a number of reasons (out
of swap space being one).

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-21 15:20:17

by Stone Wang

[permalink] [raw]
Subject: Re: [PATCH][0/8] (Targeting 2.6.17) Posix memory locking and balanced mlock-LRU semantic

Checked, mlocked pages dont take part in swapping-writeback,
unlike normal mmaped pages :

linux-2.6.16/mm/rmap.c

try_to_unmap_one()

603 if ((vma->vm_flags & VM_LOCKED) ||
604 (ptep_clear_flush_young(vma, address, pte)
605 && !ignore_refs)) {
606 ret = SWAP_FAIL;
607 goto out_unmap;
608 }
609
610 /* Nuke the page table entry. */
611 flush_cache_page(vma, address, page_to_pfn(page));
612 pteval = ptep_clear_flush(vma, address, pte);
613
614 /* Move the dirty bit to the physical page now the pte is gone. */
615 if (pte_dirty(pteval))
616 set_page_dirty(page);

For VM_LOCKED page, it goes back(line 607) without set_page_dirty(line 616).



2006/3/20, Christoph Lameter <[email protected]>:
> On Mon, 20 Mar 2006, Stone Wang wrote:
>
> > 2. More consistent LRU semantics in Memory Management.
> > Mlocked pages is placed on a separate LRU list: Wired List.
> > The pages dont take part in LRU algorithms,for they could never be swapped,
> > until munlocked.
>
> This also implies that dirty bits of the pte for mlocked pages are never
> checked.
>
> Currently light swapping (which is very common) will scan over all pages
> and move the dirty bits from the pte into struct page. This may take
> awhile but at least at some point we will write out dirtied pages.
>
> The result of not scanning mlocked pages will be that mmapped files will
> not be updated unless either the process terminates or msync() is called.
>
>

2006-03-24 04:46:16

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH][0/8] (Targeting 2.6.17) Posix memory locking and balanced mlock-LRU semantic

On Mon, 20 Mar 2006, Christoph Lameter wrote:

> The result of not scanning mlocked pages will be that mmapped files will
> not be updated unless either the process terminates or msync() is called.

That's ok. Light swapping on a system with non-mlocked
mmapped pages has the same result, since we won't scan
mapped pages most of the time...

--
All Rights Reversed

2006-03-24 14:36:55

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH][0/8] (Targeting 2.6.17) Posix memory locking and balanced mlock-LRU semantic

"Stone Wang" <[email protected]> writes:
> mlocked areas.
> 2. More consistent LRU semantics in Memory Management.
> Mlocked pages is placed on a separate LRU list: Wired List.

If it's mlocked why don't you just called it Mlocked list?
Strange jargon makes the patch cooler? Also in meminfo

-Andi

2006-03-24 14:54:19

by Stone Wang

[permalink] [raw]
Subject: Re: [PATCH][0/8] (Targeting 2.6.17) Posix memory locking and balanced mlock-LRU semantic

I am preparing patch for 2.6.16, replace the name "wired" with "pinned".

Potentially, the list could be used for more purposes, than just mlocked pages.

Shaoping Wang

24 Mar 2006 15:36:46 +0100, Andi Kleen <[email protected]>:
> "Stone Wang" <[email protected]> writes:
> > mlocked areas.
> > 2. More consistent LRU semantics in Memory Management.
> > Mlocked pages is placed on a separate LRU list: Wired List.
>
> If it's mlocked why don't you just called it Mlocked list?
> Strange jargon makes the patch cooler? Also in meminfo
>
> -Andi
>

2006-03-24 15:05:13

by Stone Wang

[permalink] [raw]
Subject: Re: [PATCH][0/8] (Targeting 2.6.17) Posix memory locking and balanced mlock-LRU semantic

2006/3/21, Nick Piggin <[email protected]>:
> Stone Wang wrote:
> > Both one of my friends(who is working on a DBMS oriented from
> > PostgreSQL) and i had encountered unexpected OOMs with mlock/mlockall.
> >
>
> I'm not sure this is a great idea. There are more conditions than just
> mlock that prevent pages being reclaimed. Running out of swap, for
> example, no swap, page temporarily pinned (in other words -- any duration
> from fleeting to permanent). I think something _much_ simpler could be
> done for a more general approach just to teach the VM to tolerate these
> pages a bit better.
>
> Also, supposing we do want this, I think there is a fairly significant
> queue of mm stuff you need to line up behind... it is probably asking
> too much to target 2.6.17 for such a significant change in any case.
>
> But despite all that I looked though and have a few comments ;)
> Kudos for jumping in and getting your hands dirty! It can be tricky code.
>
> > The patch brings Linux with:
> > 1. Posix mlock/munlock/mlockall/munlockall.
> > Get mlock/munlock/mlockall/munlockall to Posix definiton: transaction-like,
> > just as described in the manpage(2) of mlock/munlock/mlockall/munlockall.
> > Thus users of mlock system call series will always have an clear map of
> > mlocked areas.
>
> In what way are we not now posix compliant now?

Currently, Linux's mlock for example, may fail with only part of its
task finished.

While accroding to POSIX definition:

man mlock(2)

"
RETURN VALUE
On success, mlock returns zero. On error, -1 is returned, errno is set
appropriately, and no changes are made to any locks in the address
space of the process.
"

Shaoping Wang

>
> --
> SUSE Labs, Novell Inc.
>
>
> Send instant messages to your online friends http://au.messenger.yahoo.com
>
>

2006-03-24 18:27:17

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH][0/8] (Targeting 2.6.17) Posix memory locking and balanced mlock-LRU semantic

Stone Wang wrote:
> 2006/3/21, Nick Piggin <[email protected]>:

>>In what way are we not now posix compliant now?
>
>
> Currently, Linux's mlock for example, may fail with only part of its
> task finished.
>
> While accroding to POSIX definition:
>
> man mlock(2)
>
> "
> RETURN VALUE
> On success, mlock returns zero. On error, -1 is returned, errno is set
> appropriately, and no changes are made to any locks in the address
> space of the process.
> "
>

Looks like you're right, so good catch. You should probably try to submit your
posix mlock patch by itself then. Make sure you look at the coding standards
though, and try to _really_ follow coding conventions of the file you're
modifying.

You also should make sure the patch works standalone (ie. not just as part of
a set). Oh, and introducing a new field in vma for a flag is probably not the
best option if you still have room in the vm_flags field.

And the patch changelog should contain the actual problem, and quote the
relevant part of the POSIX definition, if applicable.

Thanks,
Nick

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com