LinuxLists.cc - 2.6.26-rc5-mm3

[permalink] [raw]

Subject: Re: 2.6.26-rc5-mm3: kernel BUG at mm/vmscan.c:510

On Thu, 12 Jun 2008 11:58:58 +0400 Alexey Dobriyan <[email protected]> wrote:

> [ 254.217776] ------------[ cut here ]------------
> [ 254.217776] kernel BUG at mm/vmscan.c:510!
> [ 254.217776] invalid opcode: 0000 [1] PREEMPT SMP DEBUG_PAGEALLOC
> [ 254.217776] last sysfs file: /sys/kernel/uevent_seqnum
> [ 254.217776] CPU 1
> [ 254.217776] Modules linked in: ext2 nf_conntrack_irc xt_state iptable_filter ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack ip_tables x_tables usblp ehci_hcd uhci_hcd usbcore sr_mod cdrom
> [ 254.217776] Pid: 12044, comm: madvise02 Not tainted 2.6.26-rc5-mm3 #4
> [ 254.217776] RIP: 0010:[<ffffffff802729b2>] [<ffffffff802729b2>] putback_lru_page+0x152/0x160
> [ 254.217776] RSP: 0018:ffff81012edd1cd8 EFLAGS: 00010202
> [ 254.217776] RAX: ffffe20003f344b8 RBX: 0000000000000000 RCX: 0000000000000001
> [ 254.217776] RDX: 0000000000005d5c RSI: 0000000000000000 RDI: ffffe20003f344b8
> [ 254.217776] RBP: ffff81012edd1cf8 R08: 0000000000000000 R09: 0000000000000000
> [ 254.217776] R10: ffffffff80275152 R11: 0000000000000001 R12: ffffe20003f344b8
> [ 254.217776] R13: 00000000ffffffff R14: ffff810124801080 R15: ffffffffffffffff
> [ 254.217776] FS: 00007fb3ad83c6f0(0000) GS:ffff81017f845320(0000) knlGS:0000000000000000
> [ 254.217776] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 254.217776] CR2: 00007fffb5846d38 CR3: 0000000117de9000 CR4: 00000000000006e0
> [ 254.217776] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 254.217776] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 254.217776] Process madvise02 (pid: 12044, threadinfo ffff81012edd0000, task ffff81017db6b3c0)
> [ 254.217776] Stack: ffffe20003f344b8 ffffe20003f344b8 ffffffff80629300 0000000000000001
> [ 254.217776] ffff81012edd1d18 ffffffff8027d268 ffffe20003f344b8 0000000000000000
> [ 254.217776] ffff81012edd1d38 ffffffff80271783 0000000000000246 ffffe20003f344b8
> [ 254.217776] Call Trace:
> [ 254.217776] [<ffffffff8027d268>] __clear_page_mlock+0xe8/0x100
> [ 254.217776] [<ffffffff80271783>] truncate_complete_page+0x73/0x80
> [ 254.217776] [<ffffffff80271871>] truncate_inode_pages_range+0xe1/0x3c0
> [ 254.217776] [<ffffffff80271b60>] truncate_inode_pages+0x10/0x20
> [ 254.217776] [<ffffffff802e9738>] ext3_delete_inode+0x18/0xf0
> [ 254.217776] [<ffffffff802e9720>] ? ext3_delete_inode+0x0/0xf0
> [ 254.217776] [<ffffffff802aa27b>] generic_delete_inode+0x7b/0x100
> [ 254.217776] [<ffffffff802aa43c>] generic_drop_inode+0x13c/0x180
> [ 254.217776] [<ffffffff802a960d>] iput+0x5d/0x70
> [ 254.217776] [<ffffffff8029f43e>] do_unlinkat+0x13e/0x1e0
> [ 254.217776] [<ffffffff8046de77>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [ 254.217776] [<ffffffff80255c69>] ? trace_hardirqs_on_caller+0xc9/0x150
> [ 254.217776] [<ffffffff8046de77>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [ 254.217776] [<ffffffff8029f4f1>] sys_unlink+0x11/0x20
> [ 254.217776] [<ffffffff8020b6bb>] system_call_after_swapgs+0x7b/0x80
> [ 254.217776]
> [ 254.217776]
> [ 254.217776] Code: 0f 0b eb fe 0f 1f 44 00 00 f6 47 01 40 48 89 f8 75 1d 83 78 08 01 75 13 4c 89 e7 31 db e8 97 44 ff ff e9 2b ff ff ff 0f 0b eb fe <0f> 0b eb fe 48 8b 47 10 eb dd 0f 1f 40 00 55 48 89 e5 41 57 45
> [ 254.217776] RIP [<ffffffff802729b2>] putback_lru_page+0x152/0x160
> [ 254.217776] RSP <ffff81012edd1cd8>
> [ 254.234540] ---[ end trace a1dd07b571590cc8 ]---

int putback_lru_page(struct page *page)
{
int lru;
int ret = 1;
int was_unevictable;

VM_BUG_ON(!PageLocked(page));
VM_BUG_ON(PageLRU(page));

lru = !!TestClearPageActive(page);
was_unevictable = TestClearPageUnevictable(page); /* for page_evictable() */

if (unlikely(!page->mapping)) {
/*
* page truncated. drop lock as put_page() will
* free the page.
*/
VM_BUG_ON(page_count(page) != 1);

added by unevictable-lru-infrastructure.patch.

How does one reproduce this? Looks like LTP madvise2.

2008-06-12 08:27:53

by Alexey Dobriyan

[permalink] [raw]

Subject: Re: 2.6.26-rc5-mm3: kernel BUG at mm/vmscan.c:510

On Thu, Jun 12, 2008 at 01:22:05AM -0700, Andrew Morton wrote:
> On Thu, 12 Jun 2008 11:58:58 +0400 Alexey Dobriyan <[email protected]> wrote:
>
> > [ 254.217776] ------------[ cut here ]------------
> > [ 254.217776] kernel BUG at mm/vmscan.c:510!
> > [ 254.217776] invalid opcode: 0000 [1] PREEMPT SMP DEBUG_PAGEALLOC
> > [ 254.217776] last sysfs file: /sys/kernel/uevent_seqnum
> > [ 254.217776] CPU 1
> > [ 254.217776] Modules linked in: ext2 nf_conntrack_irc xt_state iptable_filter ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack ip_tables x_tables usblp ehci_hcd uhci_hcd usbcore sr_mod cdrom
> > [ 254.217776] Pid: 12044, comm: madvise02 Not tainted 2.6.26-rc5-mm3 #4
> > [ 254.217776] RIP: 0010:[<ffffffff802729b2>] [<ffffffff802729b2>] putback_lru_page+0x152/0x160
> > [ 254.217776] RSP: 0018:ffff81012edd1cd8 EFLAGS: 00010202
> > [ 254.217776] RAX: ffffe20003f344b8 RBX: 0000000000000000 RCX: 0000000000000001
> > [ 254.217776] RDX: 0000000000005d5c RSI: 0000000000000000 RDI: ffffe20003f344b8
> > [ 254.217776] RBP: ffff81012edd1cf8 R08: 0000000000000000 R09: 0000000000000000
> > [ 254.217776] R10: ffffffff80275152 R11: 0000000000000001 R12: ffffe20003f344b8
> > [ 254.217776] R13: 00000000ffffffff R14: ffff810124801080 R15: ffffffffffffffff
> > [ 254.217776] FS: 00007fb3ad83c6f0(0000) GS:ffff81017f845320(0000) knlGS:0000000000000000
> > [ 254.217776] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 254.217776] CR2: 00007fffb5846d38 CR3: 0000000117de9000 CR4: 00000000000006e0
> > [ 254.217776] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [ 254.217776] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > [ 254.217776] Process madvise02 (pid: 12044, threadinfo ffff81012edd0000, task ffff81017db6b3c0)
> > [ 254.217776] Stack: ffffe20003f344b8 ffffe20003f344b8 ffffffff80629300 0000000000000001
> > [ 254.217776] ffff81012edd1d18 ffffffff8027d268 ffffe20003f344b8 0000000000000000
> > [ 254.217776] ffff81012edd1d38 ffffffff80271783 0000000000000246 ffffe20003f344b8
> > [ 254.217776] Call Trace:
> > [ 254.217776] [<ffffffff8027d268>] __clear_page_mlock+0xe8/0x100
> > [ 254.217776] [<ffffffff80271783>] truncate_complete_page+0x73/0x80
> > [ 254.217776] [<ffffffff80271871>] truncate_inode_pages_range+0xe1/0x3c0
> > [ 254.217776] [<ffffffff80271b60>] truncate_inode_pages+0x10/0x20
> > [ 254.217776] [<ffffffff802e9738>] ext3_delete_inode+0x18/0xf0
> > [ 254.217776] [<ffffffff802e9720>] ? ext3_delete_inode+0x0/0xf0
> > [ 254.217776] [<ffffffff802aa27b>] generic_delete_inode+0x7b/0x100
> > [ 254.217776] [<ffffffff802aa43c>] generic_drop_inode+0x13c/0x180
> > [ 254.217776] [<ffffffff802a960d>] iput+0x5d/0x70
> > [ 254.217776] [<ffffffff8029f43e>] do_unlinkat+0x13e/0x1e0
> > [ 254.217776] [<ffffffff8046de77>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> > [ 254.217776] [<ffffffff80255c69>] ? trace_hardirqs_on_caller+0xc9/0x150
> > [ 254.217776] [<ffffffff8046de77>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> > [ 254.217776] [<ffffffff8029f4f1>] sys_unlink+0x11/0x20
> > [ 254.217776] [<ffffffff8020b6bb>] system_call_after_swapgs+0x7b/0x80
> > [ 254.217776]
> > [ 254.217776]
> > [ 254.217776] Code: 0f 0b eb fe 0f 1f 44 00 00 f6 47 01 40 48 89 f8 75 1d 83 78 08 01 75 13 4c 89 e7 31 db e8 97 44 ff ff e9 2b ff ff ff 0f 0b eb fe <0f> 0b eb fe 48 8b 47 10 eb dd 0f 1f 40 00 55 48 89 e5 41 57 45
> > [ 254.217776] RIP [<ffffffff802729b2>] putback_lru_page+0x152/0x160
> > [ 254.217776] RSP <ffff81012edd1cd8>
> > [ 254.234540] ---[ end trace a1dd07b571590cc8 ]---
>
> int putback_lru_page(struct page *page)
> {
> int lru;
> int ret = 1;
> int was_unevictable;
>
> VM_BUG_ON(!PageLocked(page));
> VM_BUG_ON(PageLRU(page));
>
> lru = !!TestClearPageActive(page);
> was_unevictable = TestClearPageUnevictable(page); /* for page_evictable() */
>
> if (unlikely(!page->mapping)) {
> /*
> * page truncated. drop lock as put_page() will
> * free the page.
> */
> VM_BUG_ON(page_count(page) != 1);
>
>
> added by unevictable-lru-infrastructure.patch.
>
> How does one reproduce this? Looks like LTP madvise2.

Yep, totally reproducible here:

sudo ./testcases/bin/madvise02

2008-06-12 08:45:39

by Kamalesh Babulal

[permalink] [raw]

Subject: [BUG] 2.6.26-rc5-mm3 kernel BUG at mm/filemap.c:575!

Hi Andrew,

2.6.26-rc5-mm3 kernel panics while booting up on the x86_64
machine. Sorry the console is bit overwritten for the first few lines.

------------[ cut here ]------------
ot fs
no fstab.kernel BUG at mm/filemap.c:575!
sys, mounting ininvalid opcode: 0000 [1] ternal defaultsSMP
Switching to ne
w root and runnilast sysfs file: /sys/block/dm-3/removable
ng init.
unmounCPU 3 ting old /dev
u
nmounting old /pModules linked in:roc
unmounting
old /sys
Pid: 1, comm: init Not tainted 2.6.26-rc5-mm3-autotest #1
RIP: 0010:[<ffffffff80268155>] [<ffffffff80268155>] unlock_page+0xf/0x26
RSP: 0018:ffff81003f9e1dc8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffffe20000f63080 RCX: 0000000000000036
RDX: 0000000000000000 RSI: ffffe20000f63080 RDI: ffffe20000f63080
RBP: 0000000000000000 R08: ffff81003f9a5727 R09: ffffc10000200200
R10: ffffc10000100100 R11: 000000000000000e R12: 0000000000000000
R13: 0000000000000000 R14: ffff81003f47aed8 R15: 0000000000000000
FS: 000000000066d870(0063) GS:ffff81003f99fa80(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000000065afa0 CR3: 000000003d580000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process init (pid: 1, threadinfo ffff81003f9e0000, task ffff81003f9d8000)
Stack: ffffe20000f63080 ffffffff80270d9c 0000000000000000 ffffffffffffffff
000000000000000e 0000000000000000 ffffe20000f63080 ffffe20000f630c0
ffffe20000f63100 ffffe20000f63140 ffffe20000f63180 ffffe20000f631c0
Call Trace:
[<ffffffff80270d9c>] truncate_inode_pages_range+0xc5/0x305
[<ffffffff802a7177>] generic_delete_inode+0xc9/0x133
[<ffffffff8029e3cd>] do_unlinkat+0xf0/0x160
[<ffffffff8020bd0b>] system_call_after_swapgs+0x7b/0x80

Code: 00 00 48 85 c0 74 0b 48 8b 40 10 48 85 c0 74 02 ff d0 e8 75 ec 32 00 41 5b 31 c0 c3 53 48 89 fb f0 0f ba 37 00 19 c0 85 c0 75 04 <0f> 0b eb fe e8 56 f5 ff ff 48 89 de 48 89 c7 31 d2 5b e9 47 be
RIP [<ffffffff80268155>] unlock_page+0xf/0x26
RSP <ffff81003f9e1dc8>
---[ end trace 27b1d01b03af7c12 ]---
Kernel panic - not syncing: Attempted to kill init!
Pid: 1, comm: init Tainted: G D 2.6.26-rc5-mm3-autotest #1

Call Trace:
[<ffffffff80232d87>] panic+0x86/0x144
[<ffffffff80233a09>] printk+0x4e/0x56
[<ffffffff80235740>] do_exit+0x71/0x67c
[<ffffffff80598691>] oops_begin+0x0/0x8c
[<ffffffff8020dbc0>] do_invalid_op+0x87/0x91
[<ffffffff80268155>] unlock_page+0xf/0x26
[<ffffffff805982d9>] error_exit+0x0/0x51
[<ffffffff80268155>] unlock_page+0xf/0x26
[<ffffffff80270d9c>] truncate_inode_pages_range+0xc5/0x305
[<ffffffff802a7177>] generic_delete_inode+0xc9/0x133
[<ffffffff8029e3cd>] do_unlinkat+0xf0/0x160
[<ffffffff8020bd0b>] system_call_after_swapgs+0x7b/0x80

--
Thanks & Regards,
Kamalesh Babulal,
Linux Technology Center,
IBM, ISTL.

2008-06-12 09:08:53

[permalink] [raw]

Subject: Re: [BUG] 2.6.26-rc5-mm3 kernel BUG at mm/filemap.c:575!

On Thu, 12 Jun 2008 14:14:21 +0530 Kamalesh Babulal <[email protected]> wrote:

> Hi Andrew,
>
> 2.6.26-rc5-mm3 kernel panics while booting up on the x86_64
> machine. Sorry the console is bit overwritten for the first few lines.
>
> ------------[ cut here ]------------
> ot fs
> no fstab.kernel BUG at mm/filemap.c:575!
> sys, mounting ininvalid opcode: 0000 [1] ternal defaultsSMP
> Switching to ne
> w root and runnilast sysfs file: /sys/block/dm-3/removable
> ng init.
> unmounCPU 3 ting old /dev
> u
> nmounting old /pModules linked in:roc
> unmounting
> old /sys
> Pid: 1, comm: init Not tainted 2.6.26-rc5-mm3-autotest #1
> RIP: 0010:[<ffffffff80268155>] [<ffffffff80268155>] unlock_page+0xf/0x26
> RSP: 0018:ffff81003f9e1dc8 EFLAGS: 00010246
> RAX: 0000000000000000 RBX: ffffe20000f63080 RCX: 0000000000000036
> RDX: 0000000000000000 RSI: ffffe20000f63080 RDI: ffffe20000f63080
> RBP: 0000000000000000 R08: ffff81003f9a5727 R09: ffffc10000200200
> R10: ffffc10000100100 R11: 000000000000000e R12: 0000000000000000
> R13: 0000000000000000 R14: ffff81003f47aed8 R15: 0000000000000000
> FS: 000000000066d870(0063) GS:ffff81003f99fa80(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 000000000065afa0 CR3: 000000003d580000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process init (pid: 1, threadinfo ffff81003f9e0000, task ffff81003f9d8000)
> Stack: ffffe20000f63080 ffffffff80270d9c 0000000000000000 ffffffffffffffff
> 000000000000000e 0000000000000000 ffffe20000f63080 ffffe20000f630c0
> ffffe20000f63100 ffffe20000f63140 ffffe20000f63180 ffffe20000f631c0
> Call Trace:
> [<ffffffff80270d9c>] truncate_inode_pages_range+0xc5/0x305
> [<ffffffff802a7177>] generic_delete_inode+0xc9/0x133
> [<ffffffff8029e3cd>] do_unlinkat+0xf0/0x160
> [<ffffffff8020bd0b>] system_call_after_swapgs+0x7b/0x80
>
>
> Code: 00 00 48 85 c0 74 0b 48 8b 40 10 48 85 c0 74 02 ff d0 e8 75 ec 32 00 41 5b 31 c0 c3 53 48 89 fb f0 0f ba 37 00 19 c0 85 c0 75 04 <0f> 0b eb fe e8 56 f5 ff ff 48 89 de 48 89 c7 31 d2 5b e9 47 be
> RIP [<ffffffff80268155>] unlock_page+0xf/0x26
> RSP <ffff81003f9e1dc8>
> ---[ end trace 27b1d01b03af7c12 ]---

Another unlock of an unlocked page. Presumably when reclaim hadn't
done anything yet.

Don't know, sorry. Strange.

2008-06-12 11:24:42

[permalink] [raw]

Subject: Re: [BUG] 2.6.26-rc5-mm3 kernel BUG at mm/filemap.c:575!

On Thu, 12 Jun 2008 01:57:46 -0700
Andrew Morton <[email protected]> wrote:

> > Call Trace:
> > [<ffffffff80270d9c>] truncate_inode_pages_range+0xc5/0x305
> > [<ffffffff802a7177>] generic_delete_inode+0xc9/0x133
> > [<ffffffff8029e3cd>] do_unlinkat+0xf0/0x160
> > [<ffffffff8020bd0b>] system_call_after_swapgs+0x7b/0x80
> >
> >
> > Code: 00 00 48 85 c0 74 0b 48 8b 40 10 48 85 c0 74 02 ff d0 e8 75 ec 32 00 41 5b 31 c0 c3 53 48 89 fb f0 0f ba 37 00 19 c0 85 c0 75 04 <0f> 0b eb fe e8 56 f5 ff ff 48 89 de 48 89 c7 31 d2 5b e9 47 be
> > RIP [<ffffffff80268155>] unlock_page+0xf/0x26
> > RSP <ffff81003f9e1dc8>
> > ---[ end trace 27b1d01b03af7c12 ]---
>
> Another unlock of an unlocked page. Presumably when reclaim hadn't
> done anything yet.
>
> Don't know, sorry. Strange.
>
at first look,

==
truncate_inode_pages_range()
-> TestSetPageLocked() //
-> truncate_complete_page()
-> remove_from_page_cache() // makes page->mapping to be NULL.
-> clear_page_mlock()
-> __clear_page_mlock()
-> putback_lru_page()
-> unlock_page() // page->mapping is NULL
-> unlock_page() //BUG
==

It seems truncate_complete_page() is bad.
==
static void
truncate_complete_page(struct address_space *mapping, struct page *page)
{
if (page->mapping != mapping)
return;

if (PagePrivate(page))
do_invalidatepage(page, 0);

cancel_dirty_page(page, PAGE_CACHE_SIZE);

remove_from_page_cache(page); -----------------(A)
clear_page_mlock(page); -----------------(B)
ClearPageUptodate(page);
ClearPageMappedToDisk(page);
page_cache_release(page); /* pagecache ref */
}
==

(B) should be called before (A) as invalidate_complete_page() does.

Thanks,
-Kame

2008-06-12 11:39:22

[permalink] [raw]

Subject: Re: [BUG] 2.6.26-rc5-mm3 kernel BUG at mm/filemap.c:575!

On Thursday 12 June 2008 18:57, Andrew Morton wrote:
> On Thu, 12 Jun 2008 14:14:21 +0530 Kamalesh Babulal
<[email protected]> wrote:
> > Hi Andrew,
> >
> > 2.6.26-rc5-mm3 kernel panics while booting up on the x86_64
> > machine. Sorry the console is bit overwritten for the first few lines.
> >
> > ------------[ cut here ]------------
> > ot fs
> > no fstab.kernel BUG at mm/filemap.c:575!
> > sys, mounting ininvalid opcode: 0000 [1] ternal defaultsSMP
> > Switching to ne
> > w root and runnilast sysfs file: /sys/block/dm-3/removable
> > ng init.
> > unmounCPU 3 ting old /dev
> > u
> > nmounting old /pModules linked in:roc
> > unmounting
> > old /sys
> > Pid: 1, comm: init Not tainted 2.6.26-rc5-mm3-autotest #1
> > RIP: 0010:[<ffffffff80268155>] [<ffffffff80268155>] unlock_page+0xf/0x26
> > RSP: 0018:ffff81003f9e1dc8 EFLAGS: 00010246
> > RAX: 0000000000000000 RBX: ffffe20000f63080 RCX: 0000000000000036
> > RDX: 0000000000000000 RSI: ffffe20000f63080 RDI: ffffe20000f63080
> > RBP: 0000000000000000 R08: ffff81003f9a5727 R09: ffffc10000200200
> > R10: ffffc10000100100 R11: 000000000000000e R12: 0000000000000000
> > R13: 0000000000000000 R14: ffff81003f47aed8 R15: 0000000000000000
> > FS: 000000000066d870(0063) GS:ffff81003f99fa80(0000)
> > knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > CR2: 000000000065afa0 CR3: 000000003d580000 CR4: 00000000000006e0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > Process init (pid: 1, threadinfo ffff81003f9e0000, task ffff81003f9d8000)
> > Stack: ffffe20000f63080 ffffffff80270d9c 0000000000000000
> > ffffffffffffffff 000000000000000e 0000000000000000 ffffe20000f63080
> > ffffe20000f630c0 ffffe20000f63100 ffffe20000f63140 ffffe20000f63180
> > ffffe20000f631c0 Call Trace:
> > [<ffffffff80270d9c>] truncate_inode_pages_range+0xc5/0x305
> > [<ffffffff802a7177>] generic_delete_inode+0xc9/0x133
> > [<ffffffff8029e3cd>] do_unlinkat+0xf0/0x160
> > [<ffffffff8020bd0b>] system_call_after_swapgs+0x7b/0x80
> >
> >
> > Code: 00 00 48 85 c0 74 0b 48 8b 40 10 48 85 c0 74 02 ff d0 e8 75 ec 32
> > 00 41 5b 31 c0 c3 53 48 89 fb f0 0f ba 37 00 19 c0 85 c0 75 04 <0f> 0b eb
> > fe e8 56 f5 ff ff 48 89 de 48 89 c7 31 d2 5b e9 47 be RIP
> > [<ffffffff80268155>] unlock_page+0xf/0x26
> > RSP <ffff81003f9e1dc8>
> > ---[ end trace 27b1d01b03af7c12 ]---
>
> Another unlock of an unlocked page. Presumably when reclaim hadn't
> done anything yet.
>
> Don't know, sorry. Strange.

Looks like something lockless pagecache *could* be connected with, but
I have never seen such a bug.

Hmm...

@@ -104,6 +105,7 @@ truncate_complete_page(struct address_sp
cancel_dirty_page(page, PAGE_CACHE_SIZE);

remove_from_page_cache(page);
+ clear_page_mlock(page);
ClearPageUptodate(page);
ClearPageMappedToDisk(page);
page_cache_release(page); /* pagecache ref */

...

+static inline void clear_page_mlock(struct page *page)
+{
+ if (unlikely(TestClearPageMlocked(page)))
+ __clear_page_mlock(page);
+}

...

+void __clear_page_mlock(struct page *page)
+{
+ VM_BUG_ON(!PageLocked(page)); /* for LRU isolate/putback */
+
+ dec_zone_page_state(page, NR_MLOCK);
+ count_vm_event(NORECL_PGCLEARED);
+ if (!isolate_lru_page(page)) {
+ putback_lru_page(page);
+ } else {
+ /*
+ * Page not on the LRU yet. Flush all pagevecs and retry.
+ */
+ lru_add_drain_all();
+ if (!isolate_lru_page(page))
+ putback_lru_page(page);
+ else if (PageUnevictable(page))
+ count_vm_event(NORECL_PGSTRANDED);
+ }
+}

...

+int putback_lru_page(struct page *page)
+{
+ int lru;
+ int ret = 1;
+ int was_unevictable;
+
+ VM_BUG_ON(!PageLocked(page));
+ VM_BUG_ON(PageLRU(page));
+
+ lru = !!TestClearPageActive(page);
+ was_unevictable = TestClearPageUnevictable(page); /* for
page_evictable() */
+
+ if (unlikely(!page->mapping)) {
+ /*
+ * page truncated. drop lock as put_page() will
+ * free the page.
+ */
+ VM_BUG_ON(page_count(page) != 1);
+ unlock_page(page);
^^^^^^^^^^^^^^^^^^

This is a rather wild thing to be doing. It's a really bad idea
to drop a lock that's taken several function calls distant and
across different files...

This is most likely where the locking is getting screwed up, but
even if it was cobbled together to work, it just makes the
locking scheme very hard to follow and verify.

I don't have any suggestions yet, as I still haven't been able
to review the patchset properly (and probably won't for the next
week or so). But please rethink the locking.

Thanks,
Nick

2008-06-12 23:32:56

by Byron Bradley

[permalink] [raw]

Subject: Re: 2.6.26-rc5-mm3

Looks like x86 and ARM both fail to boot if PROFILE_LIKELY, FTRACE and
DYNAMIC_FTRACE are selected. If any one of those three are disabled it
boots (or fails in some other way which I'm looking at now). The serial
console output from both machines when they fail to boot is below, let me
know if there is any other information I can provide.

ARM (Marvell Orion 5x):
<5>Linux version 2.6.26-rc5-mm3-dirty (bb3081@gamma) (gcc version 4.2.0 20070413 (prerelease) (CodeSourcery Sourcery G++ Lite 2007q1-21)) #24 PREEMPT Thu Jun 12 23:39:12 BST 2008
CPU: Feroceon [41069260] revision 0 (ARMv5TEJ), cr=a0053177
Machine: QNAP TS-109/TS-209
<4>Clearing invalid memory bank 0KB@0xffffffff
<4>Clearing invalid memory bank 0KB@0xffffffff
<4>Clearing invalid memory bank 0KB@0xffffffff
<4>Ignoring unrecognised tag 0x00000000
<4>Ignoring unrecognised tag 0x00000000
<4>Ignoring unrecognised tag 0x00000000
<4>Ignoring unrecognised tag 0x41000403
Memory policy: ECC disabled, Data cache writeback
<7>On node 0 totalpages: 32768
<7>Node 0 memmap at 0xc05df000 size 1048576 first pfn 0xc05df000
<7>free_area_init_node: node 0, pgdat c0529680, node_mem_map c05df000
<7> DMA zone: 32512 pages, LIFO batch:7
CPU0: D VIVT write-back cache
CPU0: I cache: 32768 bytes, associativity 1, 32 byte lines, 1024 sets
CPU0: D cache: 32768 bytes, associativity 1, 32 byte lines, 1024 sets
Built 1 zonelists in Zone order, mobility grouping on. Total pages: 32512
<5>Kernel command line: console=ttyS0,115200n8 root=/dev/nfs nfsroot=192.168.2.53:/stuff/debian ip=dhcp
PID hash table entries: 512 (order: 9, 2048 bytes)
Console: colour dummy device 80x30
<6>Dentry cache hash table entries: 16384 (order: 4, 65536 bytes)
<6>Inode-cache hash table entries: 8192 (order: 3, 32768 bytes)
<6>Memory: 128MB = 128MB total
<5>Memory: 123776KB available (5016K code, 799K data, 160K init)
<6>SLUB: Genslabs=12, HWalign=32, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
<7>Calibrating delay loop... 331.77 BogoMIPS (lpj=1658880)
Mount-cache hash table entries: 512
<6>CPU: Testing write buffer coherency: ok

x86 (AMD Athlon):
Linux version 2.6.26-rc5-mm3-dirty (bb3081@gamma) (gcc version 4.2.3 (Ubuntu 4.2.3-2ubuntu7)) #6 Thu Jun 12 23:53:18 BST 2008
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 000000001fff0000 (usable)
BIOS-e820: 000000001fff0000 - 000000001fff3000 (ACPI NVS)
BIOS-e820: 000000001fff3000 - 0000000020000000 (ACPI data)
BIOS-e820: 00000000fec00000 - 00000000fec01000 (reserved)
BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
BIOS-e820: 00000000ffff0000 - 0000000100000000 (reserved)
last_pfn = 131056 max_arch_pfn = 1048576
0MB HIGHMEM available.
511MB LOWMEM available.
mapped low ram: 0 - 01400000
low ram: 00f7a000 - 1fff0000
bootmap 00f7a000 - 00f7e000
early res: 0 [0-fff] BIOS data page
early res: 1 [100000-f74657] TEXT DATA BSS
early res: 2 [f75000-f79fff] INIT_PG_TABLE
early res: 3 [9f800-fffff] BIOS reserved
early res: 4 [f7a000-f7dfff] BOOTMAP
Zone PFN ranges:
DMA 0 -> 4096
Normal 4096 -> 131056
HighMem 131056 -> 131056
Movable zone start PFN for each node
early_node_map[2] active PFN ranges
0: 0 -> 159
0: 256 -> 131056
DMI 2.2 present.
ACPI: RSDP 000F7950, 0014 (r0 Nvidia)
ACPI: RSDT 1FFF3000, 002C (r1 Nvidia AWRDACPI 42302E31 AWRD 0)
ACPI: FACP 1FFF3040, 0074 (r1 Nvidia AWRDACPI 42302E31 AWRD 0)
ACPI: DSDT 1FFF30C0, 4C22 (r1 NVIDIA AWRDACPI 1000 MSFT 100000E)
ACPI: FACS 1FFF0000, 0040
ACPI: APIC 1FFF7D00, 006E (r1 Nvidia AWRDACPI 42302E31 AWRD 0)
ACPI: PM-Timer IO Port: 0x4008
Allocating PCI resources starting at 30000000 (gap: 20000000:dec00000)
Built 1 zonelists in Zone order, mobility grouping on. Total pages: 129935
Kernel command line: console=ttyS0,115200 root=/dev/nfs nfsroot=192.168.2.53:/stuff/debian-amd ip=dhcp BOOT_IMAGE=linux.amd
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Initializing CPU#0
PID hash table entries: 2048 (order: 11, 8192 bytes)
Detected 1102.525 MHz processor.
Console: colour VGA+ 80x25
console [ttyS0] enabled
Dentry cache hash table entries: 65536 (order: 6, 262144 bytes)
Inode-cache hash table entries: 32768 (order: 5, 131072 bytes)
Memory: 504184k/524224k available (8084k kernel code, 19476k reserved, 2784k data, 436k init, 0k highmem)
virtual kernel memory layout:
fixmap : 0xfffed000 - 0xfffff000 ( 72 kB)
pkmap : 0xff800000 - 0xffc00000 (4096 kB)
vmalloc : 0xe0800000 - 0xff7fe000 ( 495 MB)
lowmem : 0xc0000000 - 0xdfff0000 ( 511 MB)
.init : 0xc0ba0000 - 0xc0c0d000 ( 436 kB)
.data : 0xc08e53f1 - 0xc0b9d418 (2784 kB)
.text : 0xc0100000 - 0xc08e53f1 (8084 kB)
Checking if this processor honours the WP bit even in supervisor mode...Ok.
SLUB: Genslabs=12, HWalign=32, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
Calibrating delay using timer specific routine.. 2207.88 BogoMIPS (lpj=11039440)
Mount-cache hash table entries: 512
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 512K (64 bytes/line)
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
CPU: AMD Athlon(tm) stepping 00
Checking 'hlt' instruction... OK.
Freeing SMP alternatives: 0k freed
ACPI: Core revision 20080321
Parsing all Control Methods:
Table [DSDT](id 0001) - 804 Objects with 77 Devices 276 Methods 35 Regions
tbxface-0598 [00] tb_load_namespace : ACPI Tables successfully acquired
ACPI: setting ELCR to 0200 (from 1c28)
evxfevnt-0091 [00] enable : Transition to ACPI mode successful
gcov: version magic: 0x3430322a
net_namespace: 324 bytes

Cheers,

--
Byron Bradley

2008-06-12 23:55:23

by Daniel Walker

[permalink] [raw]

Subject: Re: 2.6.26-rc5-mm3

2008-06-13 00:04:39

by Byron Bradley

[permalink] [raw]

Subject: Re: 2.6.26-rc5-mm3

On Thu, 12 Jun 2008, Daniel Walker wrote:

>
> On Fri, 2008-06-13 at 00:32 +0100, Byron Bradley wrote:
> > Looks like x86 and ARM both fail to boot if PROFILE_LIKELY, FTRACE and
> > DYNAMIC_FTRACE are selected. If any one of those three are disabled it
> > boots (or fails in some other way which I'm looking at now). The serial
> > console output from both machines when they fail to boot is below, let me
> > know if there is any other information I can provide.
>
> Did you happen to check PROFILE_LIKELY and FTRACE alone?

Yes, without DYNAMIC_FTRACE the arm box gets all the way to userspace and
the x86 box panics while registering a driver so most likely unrelated to
this problem.

--
Byron Bradley

2008-06-13 00:23:25

[permalink] [raw]

Subject: Re: [BUG] 2.6.26-rc5-mm3 kernel BUG at mm/filemap.c:575!

On Thu, 12 Jun 2008 21:38:59 +1000
Nick Piggin <[email protected]> wrote:

> +int putback_lru_page(struct page *page)
> +{
> + int lru;
> + int ret = 1;
> + int was_unevictable;
> +
> + VM_BUG_ON(!PageLocked(page));
> + VM_BUG_ON(PageLRU(page));
> +
> + lru = !!TestClearPageActive(page);
> + was_unevictable = TestClearPageUnevictable(page); /* for
> page_evictable() */
> +
> + if (unlikely(!page->mapping)) {
> + /*
> + * page truncated. drop lock as put_page() will
> + * free the page.
> + */
> + VM_BUG_ON(page_count(page) != 1);
> + unlock_page(page);
> ^^^^^^^^^^^^^^^^^^
>
>
> This is a rather wild thing to be doing. It's a really bad idea
> to drop a lock that's taken several function calls distant and
> across different files...
>
I agree and strongly hope this unlock should be removed.
The caller can do unlock by itself, I think.

Thanks,
-Kame

2008-06-13 01:41:27

[permalink] [raw]

Subject: [PATCH] fix double unlock_page() in 2.6.26-rc5-mm3 kernel BUG at mm/filemap.c:575!

This is reproducer of panic. "quick fix" is attached.
But I think putback_lru_page() should be re-designed.

==
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <unistd.h>
#include <errno.h>

int main(int argc, char *argv[])
{
int fd;
char *filename = argv[1];
char buffer[4096];
char *addr;
int len;

fd = open(filename, O_CREAT | O_EXCL | O_RDWR, S_IRWXU);

if (fd < 0) {
perror("open");
exit(1);
}
len = write(fd, buffer, sizeof(buffer));

if (len < 0) {
perror("write");
exit(1);
}

addr = mmap(NULL, 4096, PROT_WRITE, MAP_SHARED|MAP_LOCKED, fd, 0);
if (addr == MAP_FAILED) {
perror("mmap");
exit(1);
}
munmap(addr, 4096);
close(fd);

unlink(filename);
}
==
you'll see panic.

Fix is here
==

quick fix for double unlock_page();

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Index: linux-2.6.26-rc5-mm3/mm/truncate.c
===================================================================
--- linux-2.6.26-rc5-mm3.orig/mm/truncate.c
+++ linux-2.6.26-rc5-mm3/mm/truncate.c
@@ -104,8 +104,8 @@ truncate_complete_page(struct address_sp

cancel_dirty_page(page, PAGE_CACHE_SIZE);

- remove_from_page_cache(page);
clear_page_mlock(page);
+ remove_from_page_cache(page);
ClearPageUptodate(page);
ClearPageMappedToDisk(page);
page_cache_release(page); /* pagecache ref */

2008-06-13 02:15:04

[permalink] [raw]

Subject: Re: [PATCH] fix double unlock_page() in 2.6.26-rc5-mm3 kernel BUG at mm/filemap.c:575!

On Fri, 13 Jun 2008 10:44:44 +0900 KAMEZAWA Hiroyuki <[email protected]> wrote:

> This is reproducer of panic. "quick fix" is attached.

Thanks - I put that in
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.26-rc5/2.6.26-rc5-mm3/hot-fixes/

> But I think putback_lru_page() should be re-designed.

Yes, it sounds that way.

2008-06-13 04:19:50

by Valdis Klētnieks

[permalink] [raw]

Subject: Re: [BUG] 2.6.26-rc5-mm3 kernel BUG at mm/filemap.c:575!

On Thu, 12 Jun 2008 14:14:21 +0530, Kamalesh Babulal said:
> Hi Andrew,
>
> 2.6.26-rc5-mm3 kernel panics while booting up on the x86_64
> machine. Sorry the console is bit overwritten for the first few lines.

> no fstab.kernel BUG at mm/filemap.c:575!

For whatever it's worth, I'm seeing the same thing on my x86_64 laptop.
-rc5-mm2 works OK, I'm going to try to bisect it tonight.

% diff -u /usr/src/linux-2.6.26-rc5-mm[23]/.config
--- /usr/src/linux-2.6.26-rc5-mm2/.config 2008-06-10 22:21:13.000000000 -0400
+++ /usr/src/linux-2.6.26-rc5-mm3/.config 2008-06-12 22:20:25.000000000 -0400
@@ -1,7 +1,7 @@
#
# Automatically generated make config: don't edit
-# Linux kernel version: 2.6.26-rc5-mm2
-# Tue Jun 10 22:21:13 2008
+# Linux kernel version: 2.6.26-rc5-mm3
+# Thu Jun 12 22:20:25 2008
#
CONFIG_64BIT=y
# CONFIG_X86_32 is not set
@@ -275,7 +275,7 @@
CONFIG_ZONE_DMA_FLAG=1
CONFIG_BOUNCE=y
CONFIG_VIRT_TO_BUS=y
-# CONFIG_NORECLAIM_LRU is not set
+CONFIG_UNEVICTABLE_LRU=y
CONFIG_MTRR=y
CONFIG_MTRR_SANITIZER=y
CONFIG_MTRR_SANITIZER_ENABLE_DEFAULT=0

Not much changed there...

Attachments:

(No filename) (226.00 B)

2008-06-13 04:42:36

by Valdis Klētnieks

[permalink] [raw]

Subject: Re: [PATCH] fix double unlock_page() in 2.6.26-rc5-mm3 kernel BUG at mm/filemap.c:575!

On Fri, 13 Jun 2008 10:44:44 +0900, KAMEZAWA Hiroyuki said:

> quick fix for double unlock_page();
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> Index: linux-2.6.26-rc5-mm3/mm/truncate.c
> ===================================================================
> --- linux-2.6.26-rc5-mm3.orig/mm/truncate.c
> +++ linux-2.6.26-rc5-mm3/mm/truncate.c
> @@ -104,8 +104,8 @@ truncate_complete_page(struct address_sp
>
> cancel_dirty_page(page, PAGE_CACHE_SIZE);
>
> - remove_from_page_cache(page);
> clear_page_mlock(page);
> + remove_from_page_cache(page);
> ClearPageUptodate(page);
> ClearPageMappedToDisk(page);
> page_cache_release(page); /* pagecache ref */

Confirming this quick fix works on my laptop that was hitting this crash -
am now up and running on -rc5-mm3.

Attachments:

(No filename) (226.00 B)

2008-06-13 07:17:21

[permalink] [raw]

Subject: Re: [BUG] 2.6.26-rc5-mm3 kernel BUG at mm/filemap.c:575!

On Fri, 13 Jun 2008 00:18:43 -0400 [email protected] wrote:

> On Thu, 12 Jun 2008 14:14:21 +0530, Kamalesh Babulal said:
> > Hi Andrew,
> >
> > 2.6.26-rc5-mm3 kernel panics while booting up on the x86_64
> > machine. Sorry the console is bit overwritten for the first few lines.
>
> > no fstab.kernel BUG at mm/filemap.c:575!
>
> For whatever it's worth, I'm seeing the same thing on my x86_64 laptop.
> -rc5-mm2 works OK, I'm going to try to bisect it tonight.

ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.26-rc5/2.6.26-rc5-mm3/hot-fixes/fix-double-unlock_page-in-2626-rc5-mm3-kernel-bug-at-mm-filemapc-575.patch is said to "fix" it.

2008-06-13 15:30:32

[permalink] [raw]

Subject: Re: [PATCH] fix double unlock_page() in 2.6.26-rc5-mm3 kernel BUG at mm/filemap.c:575!

On Thu, 2008-06-12 at 19:13 -0700, Andrew Morton wrote:
> On Fri, 13 Jun 2008 10:44:44 +0900 KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> > This is reproducer of panic. "quick fix" is attached.
>
> Thanks - I put that in
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.26-rc5/2.6.26-rc5-mm3/hot-fixes/
>
> > But I think putback_lru_page() should be re-designed.
>
> Yes, it sounds that way.

Here's a proposed replacement patch that reworks putback_lru_page()
slightly and cleans up the call sites. I still want to balance the
get_page() in isolate_lru_page() with a put_page() in putback_lru_page()
for the primary users--vmscan and page migration. So, I need to drop
the lock before the put_page() when handed a page with null mapping and
a single reference count as the page will be freed on put_page() and a
locked page would bug out in free_pages_check()/bad_page().

Lee

PATCH fix page unlocking protocol for putback_lru_page()

Against: 2.6.26-rc5-mm3

Replaces Kame-san's hotfix:
fix-double-unlock_page-in-2626-rc5-mm3-kernel-bug-at-mm-filemapc-575.patch

Applies at end of vmscan/unevictable/mlock series to avoid patch conflicts.

1) modified putback_lru_page() to drop page lock only if both page_mapping()
NULL and page_count() == 1 [rather than VM_BUG_ON(page_count(page) != 1].
I want to balance the put_page() from isolate_lru_page() here for vmscan
and, e.g., page migration rather than requiring explicit checks of the
page_mapping() and explicit put_page() in these areas. However, the page
could be truncated while one of these subsystems holds it isolated from
the LRU. So, need to handle this case. Callers of putback_lru_page()
need to be aware of this and only call it with a page with NULL
page_mapping() when they will no longer reference the page afterwards.
This is the case for vmscan and page migration.

2) m[un]lock_vma_page() already will not be called for page with NULL
mapping. Added VM_BUG_ON() to assert this.

3) modified clear_page_lock() to skip the isolate/putback shuffle for
pages with NULL mapping, as they are being truncated/freed. Thus,
any future callers of clear_page_lock() need not be concerned about
the putback_lru_page() semantics for truncated pages.

Signed-off-by: Lee Schermerhorn <[email protected]>

mm/mlock.c | 29 +++++++++++++++++++----------
mm/vmscan.c | 12 +++++++-----
2 files changed, 26 insertions(+), 15 deletions(-)

Index: linux-2.6.26-rc5-mm3/mm/mlock.c
===================================================================
--- linux-2.6.26-rc5-mm3.orig/mm/mlock.c 2008-06-12 11:42:59.000000000 -0400
+++ linux-2.6.26-rc5-mm3/mm/mlock.c 2008-06-13 09:47:14.000000000 -0400
@@ -59,27 +59,33 @@ void __clear_page_mlock(struct page *pag

dec_zone_page_state(page, NR_MLOCK);
count_vm_event(NORECL_PGCLEARED);
- if (!isolate_lru_page(page)) {
- putback_lru_page(page);
- } else {
- /*
- * Page not on the LRU yet. Flush all pagevecs and retry.
- */
- lru_add_drain_all();
- if (!isolate_lru_page(page))
+ if (page->mapping) { /* truncated ? */
+ if (!isolate_lru_page(page)) {
putback_lru_page(page);
- else if (PageUnevictable(page))
- count_vm_event(NORECL_PGSTRANDED);
+ } else {
+ /*
+ * Page not on the LRU yet.
+ * Flush all pagevecs and retry.
+ */
+ lru_add_drain_all();
+ if (!isolate_lru_page(page))
+ putback_lru_page(page);
+ else if (PageUnevictable(page))
+ count_vm_event(NORECL_PGSTRANDED);
+ }
}
}

/*
* Mark page as mlocked if not already.
* If page on LRU, isolate and putback to move to unevictable list.
+ *
+ * Called with page locked and page_mapping() != NULL.
*/
void mlock_vma_page(struct page *page)
{
BUG_ON(!PageLocked(page));
+ VM_BUG_ON(!page_mapping(page));

if (!TestSetPageMlocked(page)) {
inc_zone_page_state(page, NR_MLOCK);
@@ -92,6 +98,8 @@ void mlock_vma_page(struct page *page)
/*
* called from munlock()/munmap() path with page supposedly on the LRU.
*
+ * Called with page locked and page_mapping() != NULL.
+ *
* Note: unlike mlock_vma_page(), we can't just clear the PageMlocked
* [in try_to_munlock()] and then attempt to isolate the page. We must
* isolate the page to keep others from messing with its unevictable
@@ -110,6 +118,7 @@ void mlock_vma_page(struct page *page)
static void munlock_vma_page(struct page *page)
{
BUG_ON(!PageLocked(page));
+ VM_BUG_ON(!page_mapping(page));

if (TestClearPageMlocked(page)) {
dec_zone_page_state(page, NR_MLOCK);
Index: linux-2.6.26-rc5-mm3/mm/vmscan.c
===================================================================
--- linux-2.6.26-rc5-mm3.orig/mm/vmscan.c 2008-06-12 11:39:09.000000000 -0400
+++ linux-2.6.26-rc5-mm3/mm/vmscan.c 2008-06-13 09:44:44.000000000 -0400
@@ -1,4 +1,4 @@
-/*
+ /*
* linux/mm/vmscan.c
*
* Copyright (C) 1991, 1992, 1993, 1994 Linus Torvalds
@@ -488,6 +488,9 @@ int remove_mapping(struct address_space
* lru_lock must not be held, interrupts must be enabled.
* Must be called with page locked.
*
+ * If page truncated [page_mapping() == NULL] and we hold the last reference,
+ * the page will be freed here. For vmscan and page migration.
+ *
* return 1 if page still locked [not truncated], else 0
*/
int putback_lru_page(struct page *page)
@@ -502,12 +505,11 @@ int putback_lru_page(struct page *page)
lru = !!TestClearPageActive(page);
was_unevictable = TestClearPageUnevictable(page); /* for page_evictable() */

- if (unlikely(!page->mapping)) {
+ if (unlikely(!page->mapping && page_count(page) == 1)) {
/*
- * page truncated. drop lock as put_page() will
- * free the page.
+ * page truncated and we hold last reference.
+ * drop lock as put_page() will free the page.
*/
- VM_BUG_ON(page_count(page) != 1);
unlock_page(page);
ret = 0;
} else if (page_evictable(page, NULL)) {

2008-06-14 13:33:03

by Kamalesh Babulal

[permalink] [raw]

Subject: Re: [PATCH] fix double unlock_page() in 2.6.26-rc5-mm3 kernel BUG at mm/filemap.c:575!

KAMEZAWA Hiroyuki wrote:
> This is reproducer of panic. "quick fix" is attached.
> But I think putback_lru_page() should be re-designed.
>
> ==
> #include <stdio.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <sys/mman.h>
> #include <unistd.h>
> #include <errno.h>
>
> int main(int argc, char *argv[])
> {
> int fd;
> char *filename = argv[1];
> char buffer[4096];
> char *addr;
> int len;
>
> fd = open(filename, O_CREAT | O_EXCL | O_RDWR, S_IRWXU);
>
> if (fd < 0) {
> perror("open");
> exit(1);
> }
> len = write(fd, buffer, sizeof(buffer));
>
> if (len < 0) {
> perror("write");
> exit(1);
> }
>
> addr = mmap(NULL, 4096, PROT_WRITE, MAP_SHARED|MAP_LOCKED, fd, 0);
> if (addr == MAP_FAILED) {
> perror("mmap");
> exit(1);
> }
> munmap(addr, 4096);
> close(fd);
>
> unlink(filename);
> }
> ==
> you'll see panic.
>
> Fix is here
> ==
Hi Kame,

Thanks, The patch fixes the kernel panic.

Tested-by: Kamalesh Babulal <[email protected]>
>
> quick fix for double unlock_page();
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> Index: linux-2.6.26-rc5-mm3/mm/truncate.c
> ===================================================================
> --- linux-2.6.26-rc5-mm3.orig/mm/truncate.c
> +++ linux-2.6.26-rc5-mm3/mm/truncate.c
> @@ -104,8 +104,8 @@ truncate_complete_page(struct address_sp
>
> cancel_dirty_page(page, PAGE_CACHE_SIZE);
>
> - remove_from_page_cache(page);
> clear_page_mlock(page);
> + remove_from_page_cache(page);
> ClearPageUptodate(page);
> ClearPageMappedToDisk(page);
> page_cache_release(page); /* pagecache ref */
>

--
Thanks & Regards,
Kamalesh Babulal,
Linux Technology Center,
IBM, ISTL.

2008-06-15 04:00:12

by Kamalesh Babulal

[permalink] [raw]

Subject: Re: [PATCH] fix double unlock_page() in 2.6.26-rc5-mm3 kernel BUG at mm/filemap.c:575!

Lee Schermerhorn wrote:
> On Thu, 2008-06-12 at 19:13 -0700, Andrew Morton wrote:
>> On Fri, 13 Jun 2008 10:44:44 +0900 KAMEZAWA Hiroyuki <[email protected]> wrote:
>>
>>> This is reproducer of panic. "quick fix" is attached.
>> Thanks - I put that in
>> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.26-rc5/2.6.26-rc5-mm3/hot-fixes/
>>
>>> But I think putback_lru_page() should be re-designed.
>> Yes, it sounds that way.
>
> Here's a proposed replacement patch that reworks putback_lru_page()
> slightly and cleans up the call sites. I still want to balance the
> get_page() in isolate_lru_page() with a put_page() in putback_lru_page()
> for the primary users--vmscan and page migration. So, I need to drop
> the lock before the put_page() when handed a page with null mapping and
> a single reference count as the page will be freed on put_page() and a
> locked page would bug out in free_pages_check()/bad_page().
>
> Lee
>
> PATCH fix page unlocking protocol for putback_lru_page()
>
> Against: 2.6.26-rc5-mm3
>
> Replaces Kame-san's hotfix:
> fix-double-unlock_page-in-2626-rc5-mm3-kernel-bug-at-mm-filemapc-575.patch
>
> Applies at end of vmscan/unevictable/mlock series to avoid patch conflicts.
>
> 1) modified putback_lru_page() to drop page lock only if both page_mapping()
> NULL and page_count() == 1 [rather than VM_BUG_ON(page_count(page) != 1].
> I want to balance the put_page() from isolate_lru_page() here for vmscan
> and, e.g., page migration rather than requiring explicit checks of the
> page_mapping() and explicit put_page() in these areas. However, the page
> could be truncated while one of these subsystems holds it isolated from
> the LRU. So, need to handle this case. Callers of putback_lru_page()
> need to be aware of this and only call it with a page with NULL
> page_mapping() when they will no longer reference the page afterwards.
> This is the case for vmscan and page migration.
>
> 2) m[un]lock_vma_page() already will not be called for page with NULL
> mapping. Added VM_BUG_ON() to assert this.
>
> 3) modified clear_page_lock() to skip the isolate/putback shuffle for
> pages with NULL mapping, as they are being truncated/freed. Thus,
> any future callers of clear_page_lock() need not be concerned about
> the putback_lru_page() semantics for truncated pages.
>
Hi Lee,

Thanks, After applying the patch, the kernel does not panic's while
bootup.

Tested-by: Kamalesh Babulal <[email protected]>

> Signed-off-by: Lee Schermerhorn <[email protected]>
>
> mm/mlock.c | 29 +++++++++++++++++++----------
> mm/vmscan.c | 12 +++++++-----
> 2 files changed, 26 insertions(+), 15 deletions(-)
>
> Index: linux-2.6.26-rc5-mm3/mm/mlock.c
> ===================================================================
> --- linux-2.6.26-rc5-mm3.orig/mm/mlock.c 2008-06-12 11:42:59.000000000 -0400
> +++ linux-2.6.26-rc5-mm3/mm/mlock.c 2008-06-13 09:47:14.000000000 -0400
> @@ -59,27 +59,33 @@ void __clear_page_mlock(struct page *pag
>
> dec_zone_page_state(page, NR_MLOCK);
> count_vm_event(NORECL_PGCLEARED);
> - if (!isolate_lru_page(page)) {
> - putback_lru_page(page);
> - } else {
> - /*
> - * Page not on the LRU yet. Flush all pagevecs and retry.
> - */
> - lru_add_drain_all();
> - if (!isolate_lru_page(page))
> + if (page->mapping) { /* truncated ? */
> + if (!isolate_lru_page(page)) {
> putback_lru_page(page);
> - else if (PageUnevictable(page))
> - count_vm_event(NORECL_PGSTRANDED);
> + } else {
> + /*
> + * Page not on the LRU yet.
> + * Flush all pagevecs and retry.
> + */
> + lru_add_drain_all();
> + if (!isolate_lru_page(page))
> + putback_lru_page(page);
> + else if (PageUnevictable(page))
> + count_vm_event(NORECL_PGSTRANDED);
> + }
> }
> }
>
> /*
> * Mark page as mlocked if not already.
> * If page on LRU, isolate and putback to move to unevictable list.
> + *
> + * Called with page locked and page_mapping() != NULL.
> */
> void mlock_vma_page(struct page *page)
> {
> BUG_ON(!PageLocked(page));
> + VM_BUG_ON(!page_mapping(page));
>
> if (!TestSetPageMlocked(page)) {
> inc_zone_page_state(page, NR_MLOCK);
> @@ -92,6 +98,8 @@ void mlock_vma_page(struct page *page)
> /*
> * called from munlock()/munmap() path with page supposedly on the LRU.
> *
> + * Called with page locked and page_mapping() != NULL.
> + *
> * Note: unlike mlock_vma_page(), we can't just clear the PageMlocked
> * [in try_to_munlock()] and then attempt to isolate the page. We must
> * isolate the page to keep others from messing with its unevictable
> @@ -110,6 +118,7 @@ void mlock_vma_page(struct page *page)
> static void munlock_vma_page(struct page *page)
> {
> BUG_ON(!PageLocked(page));
> + VM_BUG_ON(!page_mapping(page));
>
> if (TestClearPageMlocked(page)) {
> dec_zone_page_state(page, NR_MLOCK);
> Index: linux-2.6.26-rc5-mm3/mm/vmscan.c
> ===================================================================
> --- linux-2.6.26-rc5-mm3.orig/mm/vmscan.c 2008-06-12 11:39:09.000000000 -0400
> +++ linux-2.6.26-rc5-mm3/mm/vmscan.c 2008-06-13 09:44:44.000000000 -0400
> @@ -1,4 +1,4 @@
> -/*
> + /*
> * linux/mm/vmscan.c
> *
> * Copyright (C) 1991, 1992, 1993, 1994 Linus Torvalds
> @@ -488,6 +488,9 @@ int remove_mapping(struct address_space
> * lru_lock must not be held, interrupts must be enabled.
> * Must be called with page locked.
> *
> + * If page truncated [page_mapping() == NULL] and we hold the last reference,
> + * the page will be freed here. For vmscan and page migration.
> + *
> * return 1 if page still locked [not truncated], else 0
> */
> int putback_lru_page(struct page *page)
> @@ -502,12 +505,11 @@ int putback_lru_page(struct page *page)
> lru = !!TestClearPageActive(page);
> was_unevictable = TestClearPageUnevictable(page); /* for page_evictable() */
>
> - if (unlikely(!page->mapping)) {
> + if (unlikely(!page->mapping && page_count(page) == 1)) {
> /*
> - * page truncated. drop lock as put_page() will
> - * free the page.
> + * page truncated and we hold last reference.
> + * drop lock as put_page() will free the page.
> */
> - VM_BUG_ON(page_count(page) != 1);
> unlock_page(page);
> ret = 0;
> } else if (page_evictable(page, NULL)) {
>
>

--
Thanks & Regards,
Kamalesh Babulal,
Linux Technology Center,
IBM, ISTL.

2008-06-16 14:48:46

[permalink] [raw]

Subject: Re: [PATCH] fix double unlock_page() in 2.6.26-rc5-mm3 kernel BUG at mm/filemap.c:575!

On Fri, 2008-06-13 at 11:30 -0400, Lee Schermerhorn wrote:
> On Thu, 2008-06-12 at 19:13 -0700, Andrew Morton wrote:
> > On Fri, 13 Jun 2008 10:44:44 +0900 KAMEZAWA Hiroyuki <[email protected]> wrote:
> >
> > > This is reproducer of panic. "quick fix" is attached.
> >
> > Thanks - I put that in
> > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.26-rc5/2.6.26-rc5-mm3/hot-fixes/
> >
> > > But I think putback_lru_page() should be re-designed.
> >
> > Yes, it sounds that way.
>
> Here's a proposed replacement patch that reworks putback_lru_page()
> slightly and cleans up the call sites. I still want to balance the
> get_page() in isolate_lru_page() with a put_page() in putback_lru_page()
> for the primary users--vmscan and page migration. So, I need to drop
> the lock before the put_page() when handed a page with null mapping and
> a single reference count as the page will be freed on put_page() and a
> locked page would bug out in free_pages_check()/bad_page().
>

Below is a fix to the "proposed replacement patch" posted on Friday.
Incorrect test for page->mapping().

Lee

Against: 2.6.26-rc5-mm3

Incremental fix to my proposed patch to "fix double unlock_page() in
2.6.26-rc5-mm3 kernel BUG at mm/filemap.c:575".

"page_mapping(page)" should be "page->mapping" in VM_BUG_ON()s
introduced to m[un]lock_vma_page().

Signed-off-by: Lee Schermerhorn <[email protected]>

mm/mlock.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

Index: linux-2.6.26-rc5-mm3/mm/mlock.c
===================================================================
--- linux-2.6.26-rc5-mm3.orig/mm/mlock.c 2008-06-16 09:47:28.000000000 -0400
+++ linux-2.6.26-rc5-mm3/mm/mlock.c 2008-06-16 09:48:27.000000000 -0400
@@ -80,12 +80,12 @@ void __clear_page_mlock(struct page *pag
* Mark page as mlocked if not already.
* If page on LRU, isolate and putback to move to unevictable list.
*
- * Called with page locked and page_mapping() != NULL.
+ * Called with page locked and page->mapping != NULL.
*/
void mlock_vma_page(struct page *page)
{
BUG_ON(!PageLocked(page));
- VM_BUG_ON(!page_mapping(page));
+ VM_BUG_ON(!page->mapping);

if (!TestSetPageMlocked(page)) {
inc_zone_page_state(page, NR_MLOCK);
@@ -98,7 +98,7 @@ void mlock_vma_page(struct page *page)
/*
* called from munlock()/munmap() path with page supposedly on the LRU.
*
- * Called with page locked and page_mapping() != NULL.
+ * Called with page locked and page->mapping != NULL.
*
* Note: unlike mlock_vma_page(), we can't just clear the PageMlocked
* [in try_to_munlock()] and then attempt to isolate the page. We must
@@ -118,7 +118,7 @@ void mlock_vma_page(struct page *page)
static void munlock_vma_page(struct page *page)
{
BUG_ON(!PageLocked(page));
- VM_BUG_ON(!page_mapping(page));
+ VM_BUG_ON(!page->mapping);

if (TestClearPageMlocked(page)) {
dec_zone_page_state(page, NR_MLOCK);

2008-06-17 02:38:20

[permalink] [raw]

Subject: Re: [PATCH] fix double unlock_page() in 2.6.26-rc5-mm3 kernel BUG at mm/filemap.c:575!

On Fri, 13 Jun 2008 11:30:46 -0400
Lee Schermerhorn <[email protected]> wrote:

> 1) modified putback_lru_page() to drop page lock only if both page_mapping()
> NULL and page_count() == 1 [rather than VM_BUG_ON(page_count(page) != 1].

I'm sorry that I cannot catch the whole changes..

I cannot convice that this implicit behavior won't cause lock-up in future, again.
Even if there are enough comments...

Why the page should be locked when it is put back to LRU ?
I think this restriction is added by RvR patch set, right ?
I'm sorry that I cannot catch the whole changes..

Anyway, IMHO, lock <-> unlock should be visible as a pair as much as possible.

Thanks,
-Kame

> I want to balance the put_page() from isolate_lru_page() here for vmscan
> and, e.g., page migration rather than requiring explicit checks of the
> page_mapping() and explicit put_page() in these areas. However, the page
> could be truncated while one of these subsystems holds it isolated from
> the LRU. So, need to handle this case. Callers of putback_lru_page()
> need to be aware of this and only call it with a page with NULL
> page_mapping() when they will no longer reference the page afterwards.
> This is the case for vmscan and page migration.
>
> 2) m[un]lock_vma_page() already will not be called for page with NULL
> mapping. Added VM_BUG_ON() to assert this.
>
> 3) modified clear_page_lock() to skip the isolate/putback shuffle for
> pages with NULL mapping, as they are being truncated/freed. Thus,
> any future callers of clear_page_lock() need not be concerned about
> the putback_lru_page() semantics for truncated pages.
>
> Signed-off-by: Lee Schermerhorn <[email protected]>
>
> mm/mlock.c | 29 +++++++++++++++++++----------
> mm/vmscan.c | 12 +++++++-----
> 2 files changed, 26 insertions(+), 15 deletions(-)
>
> Index: linux-2.6.26-rc5-mm3/mm/mlock.c
> ===================================================================
> --- linux-2.6.26-rc5-mm3.orig/mm/mlock.c 2008-06-12 11:42:59.000000000 -0400
> +++ linux-2.6.26-rc5-mm3/mm/mlock.c 2008-06-13 09:47:14.000000000 -0400
> @@ -59,27 +59,33 @@ void __clear_page_mlock(struct page *pag
>
> dec_zone_page_state(page, NR_MLOCK);
> count_vm_event(NORECL_PGCLEARED);
> - if (!isolate_lru_page(page)) {
> - putback_lru_page(page);
> - } else {
> - /*
> - * Page not on the LRU yet. Flush all pagevecs and retry.
> - */
> - lru_add_drain_all();
> - if (!isolate_lru_page(page))
> + if (page->mapping) { /* truncated ? */
> + if (!isolate_lru_page(page)) {
> putback_lru_page(page);
> - else if (PageUnevictable(page))
> - count_vm_event(NORECL_PGSTRANDED);
> + } else {
> + /*
> + * Page not on the LRU yet.
> + * Flush all pagevecs and retry.
> + */
> + lru_add_drain_all();
> + if (!isolate_lru_page(page))
> + putback_lru_page(page);
> + else if (PageUnevictable(page))
> + count_vm_event(NORECL_PGSTRANDED);
> + }
> }
> }
>
> /*
> * Mark page as mlocked if not already.
> * If page on LRU, isolate and putback to move to unevictable list.
> + *
> + * Called with page locked and page_mapping() != NULL.
> */
> void mlock_vma_page(struct page *page)
> {
> BUG_ON(!PageLocked(page));
> + VM_BUG_ON(!page_mapping(page));
>
> if (!TestSetPageMlocked(page)) {
> inc_zone_page_state(page, NR_MLOCK);
> @@ -92,6 +98,8 @@ void mlock_vma_page(struct page *page)
> /*
> * called from munlock()/munmap() path with page supposedly on the LRU.
> *
> + * Called with page locked and page_mapping() != NULL.
> + *
> * Note: unlike mlock_vma_page(), we can't just clear the PageMlocked
> * [in try_to_munlock()] and then attempt to isolate the page. We must
> * isolate the page to keep others from messing with its unevictable
> @@ -110,6 +118,7 @@ void mlock_vma_page(struct page *page)
> static void munlock_vma_page(struct page *page)
> {
> BUG_ON(!PageLocked(page));
> + VM_BUG_ON(!page_mapping(page));
>
> if (TestClearPageMlocked(page)) {
> dec_zone_page_state(page, NR_MLOCK);
> Index: linux-2.6.26-rc5-mm3/mm/vmscan.c
> ===================================================================
> --- linux-2.6.26-rc5-mm3.orig/mm/vmscan.c 2008-06-12 11:39:09.000000000 -0400
> +++ linux-2.6.26-rc5-mm3/mm/vmscan.c 2008-06-13 09:44:44.000000000 -0400
> @@ -1,4 +1,4 @@
> -/*
> + /*
> * linux/mm/vmscan.c
> *
> * Copyright (C) 1991, 1992, 1993, 1994 Linus Torvalds
> @@ -488,6 +488,9 @@ int remove_mapping(struct address_space
> * lru_lock must not be held, interrupts must be enabled.
> * Must be called with page locked.
> *
> + * If page truncated [page_mapping() == NULL] and we hold the last reference,
> + * the page will be freed here. For vmscan and page migration.
> + *
> * return 1 if page still locked [not truncated], else 0
> */
> int putback_lru_page(struct page *page)
> @@ -502,12 +505,11 @@ int putback_lru_page(struct page *page)
> lru = !!TestClearPageActive(page);
> was_unevictable = TestClearPageUnevictable(page); /* for page_evictable() */
>
> - if (unlikely(!page->mapping)) {
> + if (unlikely(!page->mapping && page_count(page) == 1)) {
> /*
> - * page truncated. drop lock as put_page() will
> - * free the page.
> + * page truncated and we hold last reference.
> + * drop lock as put_page() will free the page.
> */
> - VM_BUG_ON(page_count(page) != 1);
> unlock_page(page);
> ret = 0;
> } else if (page_evictable(page, NULL)) {
>
>
>

2008-06-17 07:36:20

[permalink] [raw]

Subject: [PATCH][RFC] fix kernel BUG at mm/migrate.c:719! in 2.6.26-rc5-mm3

Hi.

I got this bug while migrating pages only a few times
via memory_migrate of cpuset.

Unfortunately, even if this patch is applied,
I got bad_page problem after hundreds times of page migration
(I'll report it in another mail).
But I believe something like this patch is needed anyway.

------------[ cut here ]------------
kernel BUG at mm/migrate.c:719!
invalid opcode: 0000 [1] SMP
last sysfs file: /sys/devices/system/cpu/cpu3/cache/index1/shared_cpu_map
CPU 0
Modules linked in: ipv6 autofs4 hidp rfcomm l2cap bluetooth sunrpc dm_mirror dm_log dm_multipath dm_mod sbs sbshc button battery acpi_memhotplug ac parport_pc lp parport floppy serio_raw rtc_cmos rtc_core rtc_lib 8139too pcspkr 8139cp mii ata_piix libata sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd [last unloaded: microcode]
Pid: 3096, comm: switch.sh Not tainted 2.6.26-rc5-mm3 #1
RIP: 0010:[<ffffffff8029bb85>] [<ffffffff8029bb85>] migrate_pages+0x33e/0x49f
RSP: 0018:ffff81002f463bb8 EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffffe20000c17500 RCX: 0000000000000034
RDX: ffffe20000c17500 RSI: ffffe200010003c0 RDI: ffffe20000c17528
RBP: ffffe200010003c0 R08: 8000000000000000 R09: 304605894800282f
R10: 282f87058b480028 R11: 0028304005894800 R12: ffff81003f90a5d8
R13: 0000000000000000 R14: ffffe20000bf4cc0 R15: ffff81002f463c88
FS: 00007ff9386576f0(0000) GS:ffffffff8061d800(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007ff938669000 CR3: 000000002f458000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process switch.sh (pid: 3096, threadinfo ffff81002f462000, task ffff81003e99cf10)
Stack: 0000000000000001 ffffffff80290777 0000000000000000 0000000000000000
ffff81002f463c88 ffff81000000ea18 ffff81002f463c88 000000000000000c
ffff81002f463ca8 00007ffffffff000 00007fff649f6000 0000000000000004
Call Trace:
[<ffffffff80290777>] ? new_node_page+0x0/0x2f
[<ffffffff80291611>] ? do_migrate_pages+0x19b/0x1e7
[<ffffffff802315c7>] ? set_cpus_allowed_ptr+0xe6/0xf3
[<ffffffff8025c827>] ? cpuset_migrate_mm+0x58/0x8f
[<ffffffff8025d0fd>] ? cpuset_attach+0x8b/0x9e
[<ffffffff8025a3e1>] ? cgroup_attach_task+0x3a3/0x3f5
[<ffffffff80276cb5>] ? __alloc_pages_internal+0xe2/0x3d1
[<ffffffff8025af06>] ? cgroup_common_file_write+0x150/0x1dd
[<ffffffff8025aaf4>] ? cgroup_file_write+0x54/0x150
[<ffffffff8029f839>] ? vfs_write+0xad/0x136
[<ffffffff8029fd76>] ? sys_write+0x45/0x6e
[<ffffffff8020bef2>] ? tracesys+0xd5/0xda

Code: 4c 48 8d 7b 28 e8 cc 87 09 00 48 83 7b 18 00 75 30 48 8b 03 48 89 da 25 00 40 00 00 48 85 c0 74 04 48 8b 53 10 83 7a 08 01 74 04 <0f> 0b eb fe 48 89 df e8 5e 50 fd ff 48 89 df e8 7d d6 fd ff eb
RIP [<ffffffff8029bb85>] migrate_pages+0x33e/0x49f
RSP <ffff81002f463bb8>
Clocksource tsc unstable (delta = 438246251 ns)
---[ end trace ce4e6053f7b9bba1 ]---

This bug is caused by VM_BUG_ON() in unmap_and_move().

unmap_and_move()
710 if (rc != -EAGAIN) {
711 /*
712 * A page that has been migrated has all references
713 * removed and will be freed. A page that has not been
714 * migrated will have kepts its references and be
715 * restored.
716 */
717 list_del(&page->lru);
718 if (!page->mapping) {
719 VM_BUG_ON(page_count(page) != 1);
720 unlock_page(page);
721 put_page(page); /* just free the old page */
722 goto end_migration;
723 } else
724 unlock = putback_lru_page(page);
725 }

I think the page count is not necessarily 1 here, because
migration_entry_wait increases page count and waits for the
page to be unlocked.
So, if the old page is accessed between migrate_page_move_mapping,
which checks the page count, and remove_migration_ptes, page count
would not be 1 here.

Actually, just commenting out get/put_page from migration_entry_wait
works well in my environment(succeeded in hundreds times of page migration),
but modifying migration_entry_wait this way is not good, I think.

This patch depends on Lee Schermerhorn's fix for double unlock_page.

This patch also fixes a race between migrate_entry_wait and
page_freeze_refs in migrate_page_move_mapping.

Signed-off-by: Daisuke Nishimura <[email protected]>

---
diff -uprN linux-2.6.26-rc5-mm3/mm/migrate.c linux-2.6.26-rc5-mm3-test/mm/migrate.c
--- linux-2.6.26-rc5-mm3/mm/migrate.c 2008-06-17 15:31:23.000000000 +0900
+++ linux-2.6.26-rc5-mm3-test/mm/migrate.c 2008-06-17 13:59:15.000000000 +0900
@@ -232,6 +232,7 @@ void migration_entry_wait(struct mm_stru
swp_entry_t entry;
struct page *page;

+retry:
ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
pte = *ptep;
if (!is_swap_pte(pte))
@@ -243,11 +244,20 @@ void migration_entry_wait(struct mm_stru

page = migration_entry_to_page(entry);

- get_page(page);
- pte_unmap_unlock(ptep, ptl);
- wait_on_page_locked(page);
- put_page(page);
- return;
+ /*
+ * page count might be set to zero by page_freeze_refs()
+ * in migrate_page_move_mapping().
+ */
+ if (get_page_unless_zero(page)) {
+ pte_unmap_unlock(ptep, ptl);
+ wait_on_page_locked(page);
+ put_page(page);
+ return;
+ } else {
+ pte_unmap_unlock(ptep, ptl);
+ goto retry;
+ }
+
out:
pte_unmap_unlock(ptep, ptl);
}
@@ -715,13 +725,7 @@ unlock:
* restored.
*/
list_del(&page->lru);
- if (!page->mapping) {
- VM_BUG_ON(page_count(page) != 1);
- unlock_page(page);
- put_page(page); /* just free the old page */
- goto end_migration;
- } else
- unlock = putback_lru_page(page);
+ unlock = putback_lru_page(page);
}

if (unlock)

2008-06-17 07:48:32

[permalink] [raw]

Subject: [Bad page] trying to free locked page? (Re: [PATCH][RFC] fix kernel BUG at mm/migrate.c:719! in 2.6.26-rc5-mm3)

On Tue, 17 Jun 2008 16:35:01 +0900, Daisuke Nishimura <[email protected]> wrote:
> Hi.
>
> I got this bug while migrating pages only a few times
> via memory_migrate of cpuset.
>
> Unfortunately, even if this patch is applied,
> I got bad_page problem after hundreds times of page migration
> (I'll report it in another mail).
> But I believe something like this patch is needed anyway.
>

I got bad_page after hundreds times of page migration.
It seems that a locked page is being freed.

Bad page state in process 'switch.sh'
page:ffffe20001ee8f40 flags:0x0500000000080019 mapping:0000000000000000 mapcount:0 count:0
Trying to fix it up, but a reboot is needed
Backtrace:
Pid: 23283, comm: switch.sh Not tainted 2.6.26-rc5-mm3-test6-lee #1

Call Trace:
[<ffffffff802747b0>] bad_page+0x97/0x131
[<ffffffff80275ae6>] free_hot_cold_page+0xd4/0x19c
[<ffffffff8027a5c3>] putback_lru_page+0xf4/0xfb
[<ffffffff8029b210>] putback_lru_pages+0x46/0x74
[<ffffffff8029bc5b>] migrate_pages+0x3f4/0x468
[<ffffffff80290797>] new_node_page+0x0/0x2f
[<ffffffff80291631>] do_migrate_pages+0x19b/0x1e7
[<ffffffff8025c827>] cpuset_migrate_mm+0x58/0x8f
[<ffffffff8025d0fd>] cpuset_attach+0x8b/0x9e
[<ffffffff8032ffdc>] sscanf+0x49/0x51
[<ffffffff8025a3e1>] cgroup_attach_task+0x3a3/0x3f5
[<ffffffff80489a90>] __mutex_lock_slowpath+0x64/0x93
[<ffffffff8025af06>] cgroup_common_file_write+0x150/0x1dd
[<ffffffff8025aaf4>] cgroup_file_write+0x54/0x150
[<ffffffff8029f855>] vfs_write+0xad/0x136
[<ffffffff8029fd92>] sys_write+0x45/0x6e
[<ffffffff8020bef2>] tracesys+0xd5/0xda

Hexdump:
000: 28 00 08 00 00 00 00 05 02 00 00 00 01 00 00 00
010: 00 00 00 00 00 00 00 00 41 3b 41 2f 00 81 ff ff
020: 46 01 00 00 00 00 00 00 e8 17 e6 01 00 e2 ff ff
030: e8 4b e6 01 00 e2 ff ff 00 00 00 00 00 00 00 00
040: 19 00 08 00 00 00 00 05 00 00 00 00 ff ff ff ff
050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
060: ba 06 00 00 00 00 00 00 00 01 10 00 00 c1 ff ff
070: 00 02 20 00 00 c1 ff ff 00 00 00 00 00 00 00 00
080: 28 00 08 00 00 00 00 05 01 00 00 00 00 00 00 00
090: 00 00 00 00 00 00 00 00 01 3d 41 2f 00 81 ff ff
0a0: bb c3 55 f7 07 00 00 00 68 c4 f0 01 00 e2 ff ff
0b0: e8 8f ee 01 00 e2 ff ff 00 00 00 00 00 00 00 00
------------[ cut here ]------------
kernel BUG at mm/filemap.c:575!
invalid opcode: 0000 [1] SMP
last sysfs file: /sys/devices/system/cpu/cpu3/cache/index1/shared_cpu_map
CPU 1
Modules linked in: nfs lockd nfs_acl ipv6 autofs4 hidp rfcomm l2cap bluetooth sunrpc dm_mirror dm_log dm_multipath dm_mod sbs sbshc button battery acpi_memhotplug ac parport_pc lp parport floppy serio_raw rtc_cmos 8139too rtc_core rtc_lib 8139cp mii pcspkr ata_piix libata sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd [last unloaded: microcode]
Pid: 23283, comm: switch.sh Tainted: G B 2.6.26-rc5-mm3-test6-lee #1
RIP: 0010:[<ffffffff80270bfe>] [<ffffffff80270bfe>] unlock_page+0xf/0x26
RSP: 0018:ffff8100396e7b78 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffffe20001ee8f40 RCX: 000000000000005a
RDX: 0000000000000006 RSI: 0000000000000003 RDI: ffffe20001ee8f40
RBP: ffffe20001f3e9c0 R08: 0000000000000008 R09: ffff810001101780
R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000004
R13: ffff8100396e7c88 R14: ffffe20001e8d080 R15: ffff8100396e7c88
FS: 00007fd4597fb6f0(0000) GS:ffff81007f98d280(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000418498 CR3: 000000003e9ac000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process switch.sh (pid: 23283, threadinfo ffff8100396e6000, task ffff8100318a64a0)
Stack: ffffe20001ee8f40 ffffffff8029b21c ffffe20001e98e40 ffff8100396e7c60
ffffe20000665140 ffff8100314fd581 0000000000000000 ffffffff8029bc5b
0000000000000000 ffffffff80290797 0000000000000000 0000000000000001
Call Trace:
[<ffffffff8029b21c>] ? putback_lru_pages+0x52/0x74
[<ffffffff8029bc5b>] ? migrate_pages+0x3f4/0x468
[<ffffffff80290797>] ? new_node_page+0x0/0x2f
[<ffffffff80291631>] ? do_migrate_pages+0x19b/0x1e7
[<ffffffff8025c827>] ? cpuset_migrate_mm+0x58/0x8f
[<ffffffff8025d0fd>] ? cpuset_attach+0x8b/0x9e
[<ffffffff8032ffdc>] ? sscanf+0x49/0x51
[<ffffffff8025a3e1>] ? cgroup_attach_task+0x3a3/0x3f5
[<ffffffff80489a90>] ? __mutex_lock_slowpath+0x64/0x93
[<ffffffff8025af06>] ? cgroup_common_file_write+0x150/0x1dd
[<ffffffff8025aaf4>] ? cgroup_file_write+0x54/0x150
[<ffffffff8029f855>] ? vfs_write+0xad/0x136
[<ffffffff8029fd92>] ? sys_write+0x45/0x6e
[<ffffffff8020bef2>] ? tracesys+0xd5/0xda

Code: 40 58 48 85 c0 74 0b 48 8b 40 10 48 85 c0 74 02 ff d0 e8 7b 89 21 00 41 5b 31 c0 c3 53 48 89 fb f0 0f ba 37 00 19 c0 85 c0 75 04 <0f> 0b eb fe e8 01 f5 ff ff 48 89 de 48 89 c7 31 d2 5b e9 ea 5e
RIP [<ffffffff80270bfe>] unlock_page+0xf/0x26
RSP <ffff8100396e7b78>
---[ end trace 4ab171fcf075cf2e ]---

2008-06-17 08:58:35

[permalink] [raw]

Subject: Re: [Bad page] trying to free locked page? (Re: [PATCH][RFC] fix kernel BUG at mm/migrate.c:719! in 2.6.26-rc5-mm3)

On Tue, 17 Jun 2008 16:47:09 +0900
Daisuke Nishimura <[email protected]> wrote:

> On Tue, 17 Jun 2008 16:35:01 +0900, Daisuke Nishimura <[email protected]> wrote:
> > Hi.
> >
> > I got this bug while migrating pages only a few times
> > via memory_migrate of cpuset.
> >
> > Unfortunately, even if this patch is applied,
> > I got bad_page problem after hundreds times of page migration
> > (I'll report it in another mail).
> > But I believe something like this patch is needed anyway.
> >
>
> I got bad_page after hundreds times of page migration.
> It seems that a locked page is being freed.
>
Good catch, and I think your investigation in the last e-mail was correct.
I'd like to dig this...but it seems some kind of big fix is necessary.
Did this happen under page-migraion by cpuset-task-move test ?

Thanks,
-Kame

>
> Bad page state in process 'switch.sh'
> page:ffffe20001ee8f40 flags:0x0500000000080019 mapping:0000000000000000 mapcount:0 count:0
> Trying to fix it up, but a reboot is needed
> Backtrace:
> Pid: 23283, comm: switch.sh Not tainted 2.6.26-rc5-mm3-test6-lee #1
>
> Call Trace:
> [<ffffffff802747b0>] bad_page+0x97/0x131
> [<ffffffff80275ae6>] free_hot_cold_page+0xd4/0x19c
> [<ffffffff8027a5c3>] putback_lru_page+0xf4/0xfb
> [<ffffffff8029b210>] putback_lru_pages+0x46/0x74
> [<ffffffff8029bc5b>] migrate_pages+0x3f4/0x468
> [<ffffffff80290797>] new_node_page+0x0/0x2f
> [<ffffffff80291631>] do_migrate_pages+0x19b/0x1e7
> [<ffffffff8025c827>] cpuset_migrate_mm+0x58/0x8f
> [<ffffffff8025d0fd>] cpuset_attach+0x8b/0x9e
> [<ffffffff8032ffdc>] sscanf+0x49/0x51
> [<ffffffff8025a3e1>] cgroup_attach_task+0x3a3/0x3f5
> [<ffffffff80489a90>] __mutex_lock_slowpath+0x64/0x93
> [<ffffffff8025af06>] cgroup_common_file_write+0x150/0x1dd
> [<ffffffff8025aaf4>] cgroup_file_write+0x54/0x150
> [<ffffffff8029f855>] vfs_write+0xad/0x136
> [<ffffffff8029fd92>] sys_write+0x45/0x6e
> [<ffffffff8020bef2>] tracesys+0xd5/0xda
>
> Hexdump:
> 000: 28 00 08 00 00 00 00 05 02 00 00 00 01 00 00 00
> 010: 00 00 00 00 00 00 00 00 41 3b 41 2f 00 81 ff ff
> 020: 46 01 00 00 00 00 00 00 e8 17 e6 01 00 e2 ff ff
> 030: e8 4b e6 01 00 e2 ff ff 00 00 00 00 00 00 00 00
> 040: 19 00 08 00 00 00 00 05 00 00 00 00 ff ff ff ff
> 050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 060: ba 06 00 00 00 00 00 00 00 01 10 00 00 c1 ff ff
> 070: 00 02 20 00 00 c1 ff ff 00 00 00 00 00 00 00 00
> 080: 28 00 08 00 00 00 00 05 01 00 00 00 00 00 00 00
> 090: 00 00 00 00 00 00 00 00 01 3d 41 2f 00 81 ff ff
> 0a0: bb c3 55 f7 07 00 00 00 68 c4 f0 01 00 e2 ff ff
> 0b0: e8 8f ee 01 00 e2 ff ff 00 00 00 00 00 00 00 00
> ------------[ cut here ]------------
> kernel BUG at mm/filemap.c:575!
> invalid opcode: 0000 [1] SMP
> last sysfs file: /sys/devices/system/cpu/cpu3/cache/index1/shared_cpu_map
> CPU 1
> Modules linked in: nfs lockd nfs_acl ipv6 autofs4 hidp rfcomm l2cap bluetooth sunrpc dm_mirror dm_log dm_multipath dm_mod sbs sbshc button battery acpi_memhotplug ac parport_pc lp parport floppy serio_raw rtc_cmos 8139too rtc_core rtc_lib 8139cp mii pcspkr ata_piix libata sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd [last unloaded: microcode]
> Pid: 23283, comm: switch.sh Tainted: G B 2.6.26-rc5-mm3-test6-lee #1
> RIP: 0010:[<ffffffff80270bfe>] [<ffffffff80270bfe>] unlock_page+0xf/0x26
> RSP: 0018:ffff8100396e7b78 EFLAGS: 00010246
> RAX: 0000000000000000 RBX: ffffe20001ee8f40 RCX: 000000000000005a
> RDX: 0000000000000006 RSI: 0000000000000003 RDI: ffffe20001ee8f40
> RBP: ffffe20001f3e9c0 R08: 0000000000000008 R09: ffff810001101780
> R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000004
> R13: ffff8100396e7c88 R14: ffffe20001e8d080 R15: ffff8100396e7c88
> FS: 00007fd4597fb6f0(0000) GS:ffff81007f98d280(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 0000000000418498 CR3: 000000003e9ac000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process switch.sh (pid: 23283, threadinfo ffff8100396e6000, task ffff8100318a64a0)
> Stack: ffffe20001ee8f40 ffffffff8029b21c ffffe20001e98e40 ffff8100396e7c60
> ffffe20000665140 ffff8100314fd581 0000000000000000 ffffffff8029bc5b
> 0000000000000000 ffffffff80290797 0000000000000000 0000000000000001
> Call Trace:
> [<ffffffff8029b21c>] ? putback_lru_pages+0x52/0x74
> [<ffffffff8029bc5b>] ? migrate_pages+0x3f4/0x468
> [<ffffffff80290797>] ? new_node_page+0x0/0x2f
> [<ffffffff80291631>] ? do_migrate_pages+0x19b/0x1e7
> [<ffffffff8025c827>] ? cpuset_migrate_mm+0x58/0x8f
> [<ffffffff8025d0fd>] ? cpuset_attach+0x8b/0x9e
> [<ffffffff8032ffdc>] ? sscanf+0x49/0x51
> [<ffffffff8025a3e1>] ? cgroup_attach_task+0x3a3/0x3f5
> [<ffffffff80489a90>] ? __mutex_lock_slowpath+0x64/0x93
> [<ffffffff8025af06>] ? cgroup_common_file_write+0x150/0x1dd
> [<ffffffff8025aaf4>] ? cgroup_file_write+0x54/0x150
> [<ffffffff8029f855>] ? vfs_write+0xad/0x136
> [<ffffffff8029fd92>] ? sys_write+0x45/0x6e
> [<ffffffff8020bef2>] ? tracesys+0xd5/0xda
>
>
> Code: 40 58 48 85 c0 74 0b 48 8b 40 10 48 85 c0 74 02 ff d0 e8 7b 89 21 00 41 5b 31 c0 c3 53 48 89 fb f0 0f ba 37 00 19 c0 85 c0 75 04 <0f> 0b eb fe e8 01 f5 ff ff 48 89 de 48 89 c7 31 d2 5b e9 ea 5e
> RIP [<ffffffff80270bfe>] unlock_page+0xf/0x26
> RSP <ffff8100396e7b78>
> ---[ end trace 4ab171fcf075cf2e ]---
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2008-06-17 09:15:16

[permalink] [raw]

Subject: Re: [Bad page] trying to free locked page? (Re: [PATCH][RFC] fix kernel BUG at mm/migrate.c:719! in 2.6.26-rc5-mm3)

> > > I got this bug while migrating pages only a few times
> > > via memory_migrate of cpuset.
> > >
> > > Unfortunately, even if this patch is applied,
> > > I got bad_page problem after hundreds times of page migration
> > > (I'll report it in another mail).
> > > But I believe something like this patch is needed anyway.
> > >
> >
> > I got bad_page after hundreds times of page migration.
> > It seems that a locked page is being freed.
> >
> Good catch, and I think your investigation in the last e-mail was correct.
> I'd like to dig this...but it seems some kind of big fix is necessary.
> Did this happen under page-migraion by cpuset-task-move test ?

Indeed!

I guess lee's unevictable infrastructure and nick's specurative pagecache
is conflicted.
I'm investigating deeply now.

2008-06-17 09:16:41

[permalink] [raw]

Subject: Re: [Bad page] trying to free locked page? (Re: [PATCH][RFC] fix kernel BUG at mm/migrate.c:719! in 2.6.26-rc5-mm3)

On Tue, 17 Jun 2008 18:03:14 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> On Tue, 17 Jun 2008 16:47:09 +0900
> Daisuke Nishimura <[email protected]> wrote:
>
> > On Tue, 17 Jun 2008 16:35:01 +0900, Daisuke Nishimura <[email protected]> wrote:
> > > Hi.
> > >
> > > I got this bug while migrating pages only a few times
> > > via memory_migrate of cpuset.
> > >
> > > Unfortunately, even if this patch is applied,
> > > I got bad_page problem after hundreds times of page migration
> > > (I'll report it in another mail).
> > > But I believe something like this patch is needed anyway.
> > >
> >
> > I got bad_page after hundreds times of page migration.
> > It seems that a locked page is being freed.
> >
> Good catch, and I think your investigation in the last e-mail was correct.
> I'd like to dig this...but it seems some kind of big fix is necessary.
> Did this happen under page-migraion by cpuset-task-move test ?
>
Yes.

I made 2 cpuset directories, run some processes in each cpusets,
and run a script like below infinitely to move tasks and migrate pages.

---
#!/bin/bash

G1=$1
G2=$2

move_task()
{
for pid in $1
do
echo $pid >$2/tasks 2>/dev/null
done
}

G1_TASK=`cat ${G1}/tasks`
G2_TASK=`cat ${G2}/tasks`

move_task "${G1_TASK}" ${G2} &
move_task "${G2_TASK}" ${G1} &

wait
---

I got this bad_page after running this script for about 600 times.

Thanks,
Daisuke Nishimura.

2008-06-17 15:26:23

[permalink] [raw]

Subject: Re: [PATCH] fix double unlock_page() in 2.6.26-rc5-mm3 kernel BUG at mm/filemap.c:575!

On Tue, 2008-06-17 at 11:32 +0900, KAMEZAWA Hiroyuki wrote:
> On Fri, 13 Jun 2008 11:30:46 -0400
> Lee Schermerhorn <[email protected]> wrote:
>
> > 1) modified putback_lru_page() to drop page lock only if both page_mapping()
> > NULL and page_count() == 1 [rather than VM_BUG_ON(page_count(page) != 1].
>
> I'm sorry that I cannot catch the whole changes..
>
> I cannot convice that this implicit behavior won't cause lock-up in future, again.
> Even if there are enough comments...
>
> Why the page should be locked when it is put back to LRU ?
> I think this restriction is added by RvR patch set, right ?
> I'm sorry that I cannot catch the whole changes..

Kame-san: The restriction to put the page back to the LRU via
putback_lru_page() with the page locked does come from the unevictable
page infrastructure. Both page migration and vmscan can hold the page
isolated from the LRU, but unlocked, for quite some time. During this
time, a page can become nonreclaimable [or unevictable] or a
nonreclaimable page can become reclaimable. It's OK if an unevictable
pages gets on on the regular LRU lists, because we'll detect it and
"cull" it if/when vmscan attempts to reclaim it. However, if a
reclaimable page gets onto the unevictable LRU list, we may never get it
off, except via manual scan. Rik doesn't think we need the manual scan,
so we've been very careful to avoid conditions where we could "leak" a
reclaimable page permantently onto the unevictable list. Kosaki-san
found several scenarios where this could happen unless we check, under
page lock, the unevictable conditions when putting these pages back on
the LRU.

>
> Anyway, IMHO, lock <-> unlock should be visible as a pair as much as possible.

I've considered modifying putback_lru_page() not to unlock/put the page
when mapping == NULL and count == 1. Then all of the callers would have
to remember this state, drop the lock and call put page themselves. I
think this would duplicate code and look ugly, but if we need to do
that, I guess we'll do it.

Regards,
Lee
>
> Thanks,
> -Kame
>
> > I want to balance the put_page() from isolate_lru_page() here for vmscan
> > and, e.g., page migration rather than requiring explicit checks of the
> > page_mapping() and explicit put_page() in these areas. However, the page
> > could be truncated while one of these subsystems holds it isolated from
> > the LRU. So, need to handle this case. Callers of putback_lru_page()
> > need to be aware of this and only call it with a page with NULL
> > page_mapping() when they will no longer reference the page afterwards.
> > This is the case for vmscan and page migration.
> >
> > 2) m[un]lock_vma_page() already will not be called for page with NULL
> > mapping. Added VM_BUG_ON() to assert this.
> >
> > 3) modified clear_page_lock() to skip the isolate/putback shuffle for
> > pages with NULL mapping, as they are being truncated/freed. Thus,
> > any future callers of clear_page_lock() need not be concerned about
> > the putback_lru_page() semantics for truncated pages.
> >
> > Signed-off-by: Lee Schermerhorn <[email protected]>
> >
> > mm/mlock.c | 29 +++++++++++++++++++----------
> > mm/vmscan.c | 12 +++++++-----
> > 2 files changed, 26 insertions(+), 15 deletions(-)
> >
> > Index: linux-2.6.26-rc5-mm3/mm/mlock.c
> > ===================================================================
> > --- linux-2.6.26-rc5-mm3.orig/mm/mlock.c 2008-06-12 11:42:59.000000000 -0400
> > +++ linux-2.6.26-rc5-mm3/mm/mlock.c 2008-06-13 09:47:14.000000000 -0400
> > @@ -59,27 +59,33 @@ void __clear_page_mlock(struct page *pag
> >
> > dec_zone_page_state(page, NR_MLOCK);
> > count_vm_event(NORECL_PGCLEARED);
> > - if (!isolate_lru_page(page)) {
> > - putback_lru_page(page);
> > - } else {
> > - /*
> > - * Page not on the LRU yet. Flush all pagevecs and retry.
> > - */
> > - lru_add_drain_all();
> > - if (!isolate_lru_page(page))
> > + if (page->mapping) { /* truncated ? */
> > + if (!isolate_lru_page(page)) {
> > putback_lru_page(page);
> > - else if (PageUnevictable(page))
> > - count_vm_event(NORECL_PGSTRANDED);
> > + } else {
> > + /*
> > + * Page not on the LRU yet.
> > + * Flush all pagevecs and retry.
> > + */
> > + lru_add_drain_all();
> > + if (!isolate_lru_page(page))
> > + putback_lru_page(page);
> > + else if (PageUnevictable(page))
> > + count_vm_event(NORECL_PGSTRANDED);
> > + }
> > }
> > }
> >
> > /*
> > * Mark page as mlocked if not already.
> > * If page on LRU, isolate and putback to move to unevictable list.
> > + *
> > + * Called with page locked and page_mapping() != NULL.
> > */
> > void mlock_vma_page(struct page *page)
> > {
> > BUG_ON(!PageLocked(page));
> > + VM_BUG_ON(!page_mapping(page));
> >
> > if (!TestSetPageMlocked(page)) {
> > inc_zone_page_state(page, NR_MLOCK);
> > @@ -92,6 +98,8 @@ void mlock_vma_page(struct page *page)
> > /*
> > * called from munlock()/munmap() path with page supposedly on the LRU.
> > *
> > + * Called with page locked and page_mapping() != NULL.
> > + *
> > * Note: unlike mlock_vma_page(), we can't just clear the PageMlocked
> > * [in try_to_munlock()] and then attempt to isolate the page. We must
> > * isolate the page to keep others from messing with its unevictable
> > @@ -110,6 +118,7 @@ void mlock_vma_page(struct page *page)
> > static void munlock_vma_page(struct page *page)
> > {
> > BUG_ON(!PageLocked(page));
> > + VM_BUG_ON(!page_mapping(page));
> >
> > if (TestClearPageMlocked(page)) {
> > dec_zone_page_state(page, NR_MLOCK);
> > Index: linux-2.6.26-rc5-mm3/mm/vmscan.c
> > ===================================================================
> > --- linux-2.6.26-rc5-mm3.orig/mm/vmscan.c 2008-06-12 11:39:09.000000000 -0400
> > +++ linux-2.6.26-rc5-mm3/mm/vmscan.c 2008-06-13 09:44:44.000000000 -0400
> > @@ -1,4 +1,4 @@
> > -/*
> > + /*
> > * linux/mm/vmscan.c
> > *
> > * Copyright (C) 1991, 1992, 1993, 1994 Linus Torvalds
> > @@ -488,6 +488,9 @@ int remove_mapping(struct address_space
> > * lru_lock must not be held, interrupts must be enabled.
> > * Must be called with page locked.
> > *
> > + * If page truncated [page_mapping() == NULL] and we hold the last reference,
> > + * the page will be freed here. For vmscan and page migration.
> > + *
> > * return 1 if page still locked [not truncated], else 0
> > */
> > int putback_lru_page(struct page *page)
> > @@ -502,12 +505,11 @@ int putback_lru_page(struct page *page)
> > lru = !!TestClearPageActive(page);
> > was_unevictable = TestClearPageUnevictable(page); /* for page_evictable() */
> >
> > - if (unlikely(!page->mapping)) {
> > + if (unlikely(!page->mapping && page_count(page) == 1)) {
> > /*
> > - * page truncated. drop lock as put_page() will
> > - * free the page.
> > + * page truncated and we hold last reference.
> > + * drop lock as put_page() will free the page.
> > */
> > - VM_BUG_ON(page_count(page) != 1);
> > unlock_page(page);
> > ret = 0;
> > } else if (page_evictable(page, NULL)) {
> >
> >
> >
>

2008-06-17 15:33:44

[permalink] [raw]

Subject: Re: [PATCH][RFC] fix kernel BUG at mm/migrate.c:719! in 2.6.26-rc5-mm3

> @@ -715,13 +725,7 @@ unlock:
> * restored.
> */
> list_del(&page->lru);
> - if (!page->mapping) {
> - VM_BUG_ON(page_count(page) != 1);
> - unlock_page(page);
> - put_page(page); /* just free the old page */
> - goto end_migration;
> - } else
> - unlock = putback_lru_page(page);
> + unlock = putback_lru_page(page);
> }
>
> if (unlock)

this part is really necessary?
I tryed to remove it, but any problem doesn't happend.

Of cource, another part is definitly necessary for specurative pagecache :)

2008-06-17 15:34:38

[permalink] [raw]

Subject: Re: [Bad page] trying to free locked page? (Re: [PATCH][RFC] fix kernel BUG at mm/migrate.c:719! in 2.6.26-rc5-mm3)

> > I got this bug while migrating pages only a few times
> > via memory_migrate of cpuset.
> >
> > Unfortunately, even if this patch is applied,
> > I got bad_page problem after hundreds times of page migration
> > (I'll report it in another mail).
> > But I believe something like this patch is needed anyway.
> >
>
> I got bad_page after hundreds times of page migration.
> It seems that a locked page is being freed.

I can't reproduce this bad page.
I'll try again tomorrow ;)

2008-06-17 17:46:21

[permalink] [raw]

Subject: Re: [PATCH][RFC] fix kernel BUG at mm/migrate.c:719! in 2.6.26-rc5-mm3

On Tue, 2008-06-17 at 16:35 +0900, Daisuke Nishimura wrote:
> Hi.
>
> I got this bug while migrating pages only a few times
> via memory_migrate of cpuset.

Ah, I did test migration fairly heavily, but not by moving cpusets.

>
> Unfortunately, even if this patch is applied,
> I got bad_page problem after hundreds times of page migration
> (I'll report it in another mail).
> But I believe something like this patch is needed anyway.

Agreed. See comments below.
>
> ------------[ cut here ]------------
> kernel BUG at mm/migrate.c:719!
> invalid opcode: 0000 [1] SMP
> last sysfs file: /sys/devices/system/cpu/cpu3/cache/index1/shared_cpu_map
> CPU 0
> Modules linked in: ipv6 autofs4 hidp rfcomm l2cap bluetooth sunrpc dm_mirror dm_log dm_multipath dm_mod sbs sbshc button battery acpi_memhotplug ac parport_pc lp parport floppy serio_raw rtc_cmos rtc_core rtc_lib 8139too pcspkr 8139cp mii ata_piix libata sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd [last unloaded: microcode]
> Pid: 3096, comm: switch.sh Not tainted 2.6.26-rc5-mm3 #1
> RIP: 0010:[<ffffffff8029bb85>] [<ffffffff8029bb85>] migrate_pages+0x33e/0x49f
> RSP: 0018:ffff81002f463bb8 EFLAGS: 00010202
> RAX: 0000000000000000 RBX: ffffe20000c17500 RCX: 0000000000000034
> RDX: ffffe20000c17500 RSI: ffffe200010003c0 RDI: ffffe20000c17528
> RBP: ffffe200010003c0 R08: 8000000000000000 R09: 304605894800282f
> R10: 282f87058b480028 R11: 0028304005894800 R12: ffff81003f90a5d8
> R13: 0000000000000000 R14: ffffe20000bf4cc0 R15: ffff81002f463c88
> FS: 00007ff9386576f0(0000) GS:ffffffff8061d800(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00007ff938669000 CR3: 000000002f458000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process switch.sh (pid: 3096, threadinfo ffff81002f462000, task ffff81003e99cf10)
> Stack: 0000000000000001 ffffffff80290777 0000000000000000 0000000000000000
> ffff81002f463c88 ffff81000000ea18 ffff81002f463c88 000000000000000c
> ffff81002f463ca8 00007ffffffff000 00007fff649f6000 0000000000000004
> Call Trace:
> [<ffffffff80290777>] ? new_node_page+0x0/0x2f
> [<ffffffff80291611>] ? do_migrate_pages+0x19b/0x1e7
> [<ffffffff802315c7>] ? set_cpus_allowed_ptr+0xe6/0xf3
> [<ffffffff8025c827>] ? cpuset_migrate_mm+0x58/0x8f
> [<ffffffff8025d0fd>] ? cpuset_attach+0x8b/0x9e
> [<ffffffff8025a3e1>] ? cgroup_attach_task+0x3a3/0x3f5
> [<ffffffff80276cb5>] ? __alloc_pages_internal+0xe2/0x3d1
> [<ffffffff8025af06>] ? cgroup_common_file_write+0x150/0x1dd
> [<ffffffff8025aaf4>] ? cgroup_file_write+0x54/0x150
> [<ffffffff8029f839>] ? vfs_write+0xad/0x136
> [<ffffffff8029fd76>] ? sys_write+0x45/0x6e
> [<ffffffff8020bef2>] ? tracesys+0xd5/0xda
>
>
> Code: 4c 48 8d 7b 28 e8 cc 87 09 00 48 83 7b 18 00 75 30 48 8b 03 48 89 da 25 00 40 00 00 48 85 c0 74 04 48 8b 53 10 83 7a 08 01 74 04 <0f> 0b eb fe 48 89 df e8 5e 50 fd ff 48 89 df e8 7d d6 fd ff eb
> RIP [<ffffffff8029bb85>] migrate_pages+0x33e/0x49f
> RSP <ffff81002f463bb8>
> Clocksource tsc unstable (delta = 438246251 ns)
> ---[ end trace ce4e6053f7b9bba1 ]---
>
>
> This bug is caused by VM_BUG_ON() in unmap_and_move().
>
> unmap_and_move()
> 710 if (rc != -EAGAIN) {
> 711 /*
> 712 * A page that has been migrated has all references
> 713 * removed and will be freed. A page that has not been
> 714 * migrated will have kepts its references and be
> 715 * restored.
> 716 */
> 717 list_del(&page->lru);
> 718 if (!page->mapping) {
> 719 VM_BUG_ON(page_count(page) != 1);
> 720 unlock_page(page);
> 721 put_page(page); /* just free the old page */
> 722 goto end_migration;
> 723 } else
> 724 unlock = putback_lru_page(page);
> 725 }

I think that at least part of your patch, below, should fix this
problem. See comments there.

Now I wonder if the assertion that newpage count == 1 could be violated?
I don't see how. We've just allocated and filled it and haven't
unlocked it yet, so we should hold the only reference. Do you agree?
>
> I think the page count is not necessarily 1 here, because
> migration_entry_wait increases page count and waits for the
> page to be unlocked.
> So, if the old page is accessed between migrate_page_move_mapping,
> which checks the page count, and remove_migration_ptes, page count
> would not be 1 here.
>
> Actually, just commenting out get/put_page from migration_entry_wait
> works well in my environment(succeeded in hundreds times of page migration),
> but modifying migration_entry_wait this way is not good, I think.
>
>
> This patch depends on Lee Schermerhorn's fix for double unlock_page.
>
> This patch also fixes a race between migrate_entry_wait and
> page_freeze_refs in migrate_page_move_mapping.
>
>
> Signed-off-by: Daisuke Nishimura <[email protected]>
>
> ---
> diff -uprN linux-2.6.26-rc5-mm3/mm/migrate.c linux-2.6.26-rc5-mm3-test/mm/migrate.c
> --- linux-2.6.26-rc5-mm3/mm/migrate.c 2008-06-17 15:31:23.000000000 +0900
> +++ linux-2.6.26-rc5-mm3-test/mm/migrate.c 2008-06-17 13:59:15.000000000 +0900
> @@ -232,6 +232,7 @@ void migration_entry_wait(struct mm_stru
> swp_entry_t entry;
> struct page *page;
>
> +retry:
> ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
> pte = *ptep;
> if (!is_swap_pte(pte))
> @@ -243,11 +244,20 @@ void migration_entry_wait(struct mm_stru
>
> page = migration_entry_to_page(entry);
>
> - get_page(page);
> - pte_unmap_unlock(ptep, ptl);
> - wait_on_page_locked(page);
> - put_page(page);
> - return;
> + /*
> + * page count might be set to zero by page_freeze_refs()
> + * in migrate_page_move_mapping().
> + */
> + if (get_page_unless_zero(page)) {
> + pte_unmap_unlock(ptep, ptl);
> + wait_on_page_locked(page);
> + put_page(page);
> + return;
> + } else {
> + pte_unmap_unlock(ptep, ptl);
> + goto retry;
> + }
> +

I'm not sure about this part. If it IS needed, I think it would be
needed independently of the unevictable/putback_lru_page() changes, as
this race must have already existed.

However, unmap_and_move() replaced the migration entries with bona fide
pte's referencing the new page before freeing the old page, so I think
we're OK without this change.

> out:
> pte_unmap_unlock(ptep, ptl);
> }
> @@ -715,13 +725,7 @@ unlock:
> * restored.
> */
> list_del(&page->lru);
> - if (!page->mapping) {
> - VM_BUG_ON(page_count(page) != 1);
> - unlock_page(page);
> - put_page(page); /* just free the old page */
> - goto end_migration;
> - } else
> - unlock = putback_lru_page(page);
> + unlock = putback_lru_page(page);
> }
>
> if (unlock)

I agree with this part. I came to the same conclusion looking at the
code. If we just changed the if() and VM_BUG_ON() to:

if (!page->mapping && page_count(page) == 1) { ...

we'd be doing exactly what putback_lru_page() is doing. So, this code
as always unnecessary, duplicate code [that I was trying to avoid :(].
So, just let putback_lru_page() handle this condition and conditionally
unlock_page().

I'm testing with my stress load with the 2nd part of the patch above and
it's holding up OK. Of course, I didn't hit the problem before. I'll
try your duplicator script and see what happens.

Regards,
Lee

2008-06-17 18:29:21

[permalink] [raw]

Subject: Re: [Bad page] trying to free locked page? (Re: [PATCH][RFC] fix kernel BUG at mm/migrate.c:719! in 2.6.26-rc5-mm3)

On Tue, 2008-06-17 at 18:15 +0900, Daisuke Nishimura wrote:
> On Tue, 17 Jun 2008 18:03:14 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> > On Tue, 17 Jun 2008 16:47:09 +0900
> > Daisuke Nishimura <[email protected]> wrote:
> >
> > > On Tue, 17 Jun 2008 16:35:01 +0900, Daisuke Nishimura <[email protected]> wrote:
> > > > Hi.
> > > >
> > > > I got this bug while migrating pages only a few times
> > > > via memory_migrate of cpuset.
> > > >
> > > > Unfortunately, even if this patch is applied,
> > > > I got bad_page problem after hundreds times of page migration
> > > > (I'll report it in another mail).
> > > > But I believe something like this patch is needed anyway.
> > > >
> > >
> > > I got bad_page after hundreds times of page migration.
> > > It seems that a locked page is being freed.

I'm seeing *mlocked* pages [PG_mlocked] being freed now with my stress
load, with just the "if(!page->mapping) { } clause removed, as proposed
in your rfc patch in previous mail. Need to investigate this...

I'm not seeing *locked* pages [PG_lock], tho'. From your stack trace,
it appears that migrate_page() left locked pages on the list of pages to
be putback. The pages get locked and unlocked in unmap_and_move(). I
haven't found a path [yet] where the page can be returned still locked.
I think I need to duplicate the problem.

> > >
> > Good catch, and I think your investigation in the last e-mail was correct.
> > I'd like to dig this...but it seems some kind of big fix is necessary.
> > Did this happen under page-migraion by cpuset-task-move test ?
> >
> Yes.
>
> I made 2 cpuset directories, run some processes in each cpusets,
> and run a script like below infinitely to move tasks and migrate pages.

What processes/tests do you run in each cpuset?

>
> ---
> #!/bin/bash
>
> G1=$1
> G2=$2
>
> move_task()
> {
> for pid in $1
> do
> echo $pid >$2/tasks 2>/dev/null
> done
> }
>
> G1_TASK=`cat ${G1}/tasks`
> G2_TASK=`cat ${G2}/tasks`
>
> move_task "${G1_TASK}" ${G2} &
> move_task "${G2_TASK}" ${G1} &
>
> wait
> ---
>
> I got this bad_page after running this script for about 600 times.
>

2008-06-17 18:34:51

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH][RFC] fix kernel BUG at mm/migrate.c:719! in 2.6.26-rc5-mm3

On Tue, 17 Jun 2008, Lee Schermerhorn wrote:
>
> Now I wonder if the assertion that newpage count == 1 could be violated?
> I don't see how. We've just allocated and filled it and haven't
> unlocked it yet, so we should hold the only reference. Do you agree?

Disagree: IIRC, excellent example of the kind of assumption
that becomes invalid with Nick's speculative page references.

Someone interested in the previous use of the page may have
incremented the refcount, and in due course will find that
it's got reused for something else, and will then back off.

Hugh

2008-06-17 19:29:28

[permalink] [raw]

Subject: Re: [PATCH][RFC] fix kernel BUG at mm/migrate.c:719! in 2.6.26-rc5-mm3

On Tue, 2008-06-17 at 19:33 +0100, Hugh Dickins wrote:
> On Tue, 17 Jun 2008, Lee Schermerhorn wrote:
> >
> > Now I wonder if the assertion that newpage count == 1 could be violated?
> > I don't see how. We've just allocated and filled it and haven't
> > unlocked it yet, so we should hold the only reference. Do you agree?
>
> Disagree: IIRC, excellent example of the kind of assumption
> that becomes invalid with Nick's speculative page references.
>
> Someone interested in the previous use of the page may have
> incremented the refcount, and in due course will find that
> it's got reused for something else, and will then back off.
>

Yeah. Kosaki-san mentioned that we'd need some rework for the
speculative page cache work. Looks like we'll need to drop the
VM_BUG_ON().

I need to go read up on the new invariants we can trust with the
speculative page cache.

Thanks,
Lee

2008-06-17 20:00:24

[permalink] [raw]

Subject: [PATCH] unevictable mlocked pages: initialize mm member of munlock mm_walk structure

was: Re: [Bad page] trying to free locked page? (Re: [PATCH][RFC] fix
kernel BUG at mm/migrate.c:719! in 2.6.26-rc5-mm3)

On Tue, 2008-06-17 at 14:29 -0400, Lee Schermerhorn wrote:
> On Tue, 2008-06-17 at 18:15 +0900, Daisuke Nishimura wrote:
> > On Tue, 17 Jun 2008 18:03:14 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> > > On Tue, 17 Jun 2008 16:47:09 +0900
> > > Daisuke Nishimura <[email protected]> wrote:
> > >
> > > > On Tue, 17 Jun 2008 16:35:01 +0900, Daisuke Nishimura <[email protected]> wrote:
> > > > > Hi.
> > > > >
> > > > > I got this bug while migrating pages only a few times
> > > > > via memory_migrate of cpuset.
> > > > >
> > > > > Unfortunately, even if this patch is applied,
> > > > > I got bad_page problem after hundreds times of page migration
> > > > > (I'll report it in another mail).
> > > > > But I believe something like this patch is needed anyway.
> > > > >
> > > >
> > > > I got bad_page after hundreds times of page migration.
> > > > It seems that a locked page is being freed.
>
> I'm seeing *mlocked* pages [PG_mlocked] being freed now with my stress
> load, with just the "if(!page->mapping) { } clause removed, as proposed
> in your rfc patch in previous mail. Need to investigate this...
>
<snip>

This [freeing of mlocked pages] also occurs in unpatched 26-rc5-mm3.

Fixed by the following:

PATCH: fix munlock page table walk - now requires 'mm'

Against 2.6.26-rc5-mm3.

Incremental fix for: mlock-mlocked-pages-are-unevictable-fix.patch

Initialize the 'mm' member of the mm_walk structure, else the
page table walk doesn't occur, and mlocked pages will not be
munlocked. This is visible in the vmstats:

noreclaim_pgs_munlocked - should equal noreclaim_pgs_mlocked
less (nr_mlock + noreclaim_pgs_cleared), but is always zero
[munlock_vma_page() never called]

noreclaim_pgs_mlockfreed - should be zero [for debug only],
but == noreclaim_pgs_mlocked - (nr_mlock + noreclaim_pgs_cleared)

Signed-off-by: Lee Schermerhorn <[email protected]>

mm/mlock.c | 2 ++
1 file changed, 2 insertions(+)

Index: linux-2.6.26-rc5-mm3/mm/mlock.c
===================================================================
--- linux-2.6.26-rc5-mm3.orig/mm/mlock.c 2008-06-17 15:20:57.000000000 -0400
+++ linux-2.6.26-rc5-mm3/mm/mlock.c 2008-06-17 15:23:17.000000000 -0400
@@ -318,6 +318,8 @@ static void __munlock_vma_pages_range(st
VM_BUG_ON(start < vma->vm_start);
VM_BUG_ON(end > vma->vm_end);

+ munlock_page_walk.mm = mm;
+
lru_add_drain_all(); /* push cached pages to LRU */
walk_page_range(start, end, &munlock_page_walk);
lru_add_drain_all(); /* to update stats */

2008-06-18 01:08:53

[permalink] [raw]

Subject: Re: [PATCH][RFC] fix kernel BUG at mm/migrate.c:719! in 2.6.26-rc5-mm3

On Tue, 17 Jun 2008 16:35:01 +0900
Daisuke Nishimura <[email protected]> wrote:

> This patch also fixes a race between migrate_entry_wait and
> page_freeze_refs in migrate_page_move_mapping.
>
Ok, let's fix one by one. please add your Signed-off-by if ok.

This is a fix for page migration under speculative page lookup protocol.
-Kame
==
In speculative page cache lookup protocol, page_count(page) is set to 0
while radix-tree midification is going on, truncation, migration, etc...

While page migration, a page fault to page under migration should wait
unlock_page() and migration_entry_wait() waits for the page from its
pte entry. It does get_page() -> wait_on_page_locked() -> put_page() now.

In page migration, page_freeze_refs() -> page_unfreeze_refs() is called.

Here, page_unfreeze_refs() expects page_count(page) == 0 and panics
if page_count(page) != 0. To avoid this, we shouldn't touch page_count()
if it is zero. This patch uses page_cache_get_speculative() to avoid
the panic.

From: Daisuke Nishimura <[email protected]>
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/migrate.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

Index: test-2.6.26-rc5-mm3/mm/migrate.c
===================================================================
--- test-2.6.26-rc5-mm3.orig/mm/migrate.c
+++ test-2.6.26-rc5-mm3/mm/migrate.c
@@ -243,7 +243,8 @@ void migration_entry_wait(struct mm_stru

page = migration_entry_to_page(entry);

- get_page(page);
+ if (!page_cache_get_speculative())
+ goto out;
pte_unmap_unlock(ptep, ptl);
wait_on_page_locked(page);
put_page(page);

2008-06-18 01:27:20

[permalink] [raw]

Subject: Re: [PATCH][RFC] fix kernel BUG at mm/migrate.c:719! in 2.6.26-rc5-mm3

On Wed, 18 Jun 2008 10:13:49 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> On Tue, 17 Jun 2008 16:35:01 +0900
> Daisuke Nishimura <[email protected]> wrote:
>
> > This patch also fixes a race between migrate_entry_wait and
> > page_freeze_refs in migrate_page_move_mapping.
> >
> Ok, let's fix one by one. please add your Signed-off-by if ok.
>
Agree. It should be fixed independently.

Signed-off-by: Daisuke Nishimura <[email protected]>

> This is a fix for page migration under speculative page lookup protocol.
> -Kame
> ==
> In speculative page cache lookup protocol, page_count(page) is set to 0
> while radix-tree midification is going on, truncation, migration, etc...
>
> While page migration, a page fault to page under migration should wait
> unlock_page() and migration_entry_wait() waits for the page from its
> pte entry. It does get_page() -> wait_on_page_locked() -> put_page() now.
>
> In page migration, page_freeze_refs() -> page_unfreeze_refs() is called.
>
> Here, page_unfreeze_refs() expects page_count(page) == 0 and panics
> if page_count(page) != 0. To avoid this, we shouldn't touch page_count()
> if it is zero. This patch uses page_cache_get_speculative() to avoid
> the panic.
>
> From: Daisuke Nishimura <[email protected]>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> mm/migrate.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> Index: test-2.6.26-rc5-mm3/mm/migrate.c
> ===================================================================
> --- test-2.6.26-rc5-mm3.orig/mm/migrate.c
> +++ test-2.6.26-rc5-mm3/mm/migrate.c
> @@ -243,7 +243,8 @@ void migration_entry_wait(struct mm_stru
>
> page = migration_entry_to_page(entry);
>
> - get_page(page);
> + if (!page_cache_get_speculative())
> + goto out;
> pte_unmap_unlock(ptep, ptl);
> wait_on_page_locked(page);
> put_page(page);
>

2008-06-18 01:49:26

[permalink] [raw]

Subject: [PATCH] migration_entry_wait fix.

On Wed, 18 Jun 2008 10:13:49 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> + if (!page_cache_get_speculative())
> + goto out;
This is obviously buggy....sorry..quilt refresh miss..

==
In speculative page cache lookup protocol, page_count(page) is set to 0
while radix-tree modification is going on, truncation, migration, etc...

While page migration, a page fault to page under migration should wait
unlock_page() and migration_entry_wait() waits for the page from its
pte entry. It does get_page() -> wait_on_page_locked() -> put_page() now.

In page migration, page_freeze_refs() -> page_unfreeze_refs() is called.

Here, page_unfreeze_refs() expects page_count(page) == 0 and panics
if page_count(page) != 0. To avoid this, we shouldn't touch page_count()
if it is zero. This patch uses page_cache_get_speculative() to avoid
the panic.

From: Daisuke Nishimura <[email protected]>
Signed-off-by: Daisuke Nishimura <[email protected]>
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/migrate.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

Index: test-2.6.26-rc5-mm3/mm/migrate.c
===================================================================
--- test-2.6.26-rc5-mm3.orig/mm/migrate.c
+++ test-2.6.26-rc5-mm3/mm/migrate.c
@@ -243,7 +243,8 @@ void migration_entry_wait(struct mm_stru

page = migration_entry_to_page(entry);

- get_page(page);
+ if (!page_cache_get_speculative(page))
+ goto out;
pte_unmap_unlock(ptep, ptl);
wait_on_page_locked(page);
put_page(page);

2008-06-18 01:55:32

[permalink] [raw]

Subject: Re: [PATCH][RFC] fix kernel BUG at mm/migrate.c:719! in 2.6.26-rc5-mm3

On Wed, 18 Jun 2008 00:33:18 +0900, KOSAKI Motohiro <[email protected]> wrote:
> > @@ -715,13 +725,7 @@ unlock:
> > * restored.
> > */
> > list_del(&page->lru);
> > - if (!page->mapping) {
> > - VM_BUG_ON(page_count(page) != 1);
> > - unlock_page(page);
> > - put_page(page); /* just free the old page */
> > - goto end_migration;
> > - } else
> > - unlock = putback_lru_page(page);
> > + unlock = putback_lru_page(page);
> > }
> >
> > if (unlock)
>
> this part is really necessary?
> I tryed to remove it, but any problem doesn't happend.
>
I made this part first, and added a fix for migration_entry_wait later.

So, I haven't test without this part, and I think it will cause
VM_BUG_ON() here without this part.

Anyway, I will test it.

> Of cource, another part is definitly necessary for specurative pagecache :)
>

2008-06-18 02:34:06

[permalink] [raw]

Subject: Re: [Bad page] trying to free locked page? (Re: [PATCH][RFC] fix kernel BUG at mm/migrate.c:719! in 2.6.26-rc5-mm3)

On Wed, 18 Jun 2008 00:34:16 +0900, KOSAKI Motohiro <[email protected]> wrote:
> > > I got this bug while migrating pages only a few times
> > > via memory_migrate of cpuset.
> > >
> > > Unfortunately, even if this patch is applied,
> > > I got bad_page problem after hundreds times of page migration
> > > (I'll report it in another mail).
> > > But I believe something like this patch is needed anyway.
> > >
> >
> > I got bad_page after hundreds times of page migration.
> > It seems that a locked page is being freed.
>
> I can't reproduce this bad page.
> I'll try again tomorrow ;)
>

OK. I'll report on my test more precisely.

- Environment
HW: 4CPU(x86_64), 2node NUMA
kernel: 2.6.26-rc5-mm3 + Lee's two fixes about double unlock_page
+ my patch. config is attached.

- mount cpuset and make settings
# mount -t cgroup -o cpuset cpuset /cgroup/cpuset

# mkdir /cgroup/cpuset/01
# echo 0-1 >/cgroup/cpuset/01/cpuset.cpus
# echo 0 >/cgroup/cpuset/01/cpuset.mems
# echo 1 >/cgroup/cpuset/01/cpuset.memory_migrate

# mkdir /cgroup/cpuset/02
# echo 2-3 >/cgroup/cpuset/02/cpuset.cpus
# echo 1 >/cgroup/cpuset/02/cpuset.mems
# echo 1 >/cgroup/cpuset/02/cpuset.memory_migrate

- register processes in cpusets
# echo $$ >/cgroup/cpuset/01/tasks

I'm using LTP's page01 test, and running two instances infinitely.
# while true; do (somewhere)/page01 4194304 1; done &
# while true; do (somewhere)/page01 4194304 1; done &

The same thing should be done about 02 directory.

- echo pids to another directory
Run simple script like below.

---
#!/bin/bash

G1=$1
G2=$2

move_task()
{
for pid in $1
do
echo $pid >$2/tasks 2>/dev/null
done
}

G1_TASK=`cat ${G1}/tasks`
G2_TASK=`cat ${G2}/tasks`

move_task "${G1_TASK}" ${G2} &
move_task "${G2_TASK}" ${G1} &

wait
---

Please let me know if you need other information.
I'm also digging this problem.

Thanks,
Daisuke Nishimura.

Attachments:

config-2.6.26-rc5-mm3 (74.48 kB)

2008-06-18 02:42:31

[permalink] [raw]

Subject: Re: [Bad page] trying to free locked page? (Re: [PATCH][RFC] fix kernel BUG at mm/migrate.c:719! in 2.6.26-rc5-mm3)

> > > >
> > > Good catch, and I think your investigation in the last e-mail was correct.
> > > I'd like to dig this...but it seems some kind of big fix is necessary.
> > > Did this happen under page-migraion by cpuset-task-move test ?
> > >
> > Yes.
> >
> > I made 2 cpuset directories, run some processes in each cpusets,
> > and run a script like below infinitely to move tasks and migrate pages.
>
> What processes/tests do you run in each cpuset?
>

Please see the mail I've just sended to Kosaki-san :)

Thanks,
Daisuke Nishimura.

2008-06-18 03:00:40

[permalink] [raw]

Subject: Re: [PATCH][RFC] fix kernel BUG at mm/migrate.c:719! in 2.6.26-rc5-mm3

> > @@ -232,6 +232,7 @@ void migration_entry_wait(struct mm_stru
> > swp_entry_t entry;
> > struct page *page;
> >
> > +retry:
> > ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
> > pte = *ptep;
> > if (!is_swap_pte(pte))
> > @@ -243,11 +244,20 @@ void migration_entry_wait(struct mm_stru
> >
> > page = migration_entry_to_page(entry);
> >
> > - get_page(page);
> > - pte_unmap_unlock(ptep, ptl);
> > - wait_on_page_locked(page);
> > - put_page(page);
> > - return;
> > + /*
> > + * page count might be set to zero by page_freeze_refs()
> > + * in migrate_page_move_mapping().
> > + */
> > + if (get_page_unless_zero(page)) {
> > + pte_unmap_unlock(ptep, ptl);
> > + wait_on_page_locked(page);
> > + put_page(page);
> > + return;
> > + } else {
> > + pte_unmap_unlock(ptep, ptl);
> > + goto retry;
> > + }
> > +
>
> I'm not sure about this part. If it IS needed, I think it would be
> needed independently of the unevictable/putback_lru_page() changes, as
> this race must have already existed.
>
> However, unmap_and_move() replaced the migration entries with bona fide
> pte's referencing the new page before freeing the old page, so I think
> we're OK without this change.
>

Without this part, I can easily get VM_BUG_ON in get_page,
even when processes in cpusets are only bash.

---
kernel BUG at include/linux/mm.h:297!
:
Call Trace:
[<ffffffff80280d82>] ? handle_mm_fault+0x3e5/0x782
[<ffffffff8048c8bf>] ? do_page_fault+0x3d0/0x7a7
[<ffffffff80263ed0>] ? audit_syscall_exit+0x2e4/0x303
[<ffffffff8048a989>] ? error_exit+0x0/0x51
Code: b8 00 00 00 00 00 e2 ff ff 48 8d 1c 02 48 8b 13 f6 c
2 01 75 04 0f 0b eb fe 80 e6 40 48 89 d8 74 04 48 8b 43 10 83 78 08 00 75 04 <0f> 0b eb fe
f0 ff 40 08 fe 45 00 f6 03 01 74 0a 31 f6 48 89 df
RIP [<ffffffff8029c309>] migration_entry_wait+0xcb/0xfa
RSP <ffff81062cc6fe58>
---

I agree that this part should be fixed independently, and
Kamezawa-san has already posted a patch for this.

Thanks,
Daisuke Nishimura.

2008-06-18 03:33:47

[permalink] [raw]

Subject: Re: [PATCH] unevictable mlocked pages: initialize mm member of munlock mm_walk structure

> PATCH: fix munlock page table walk - now requires 'mm'
>
> Against 2.6.26-rc5-mm3.
>
> Incremental fix for: mlock-mlocked-pages-are-unevictable-fix.patch
>
> Initialize the 'mm' member of the mm_walk structure, else the
> page table walk doesn't occur, and mlocked pages will not be
> munlocked. This is visible in the vmstats:

Yup, Dave Hansen changed page_walk interface recently.
thus, his and ours patch is conflicted ;)

below patch is just nit cleanups.

===========================================
From: Lee Schermerhorn <[email protected]>

This [freeing of mlocked pages] also occurs in unpatched 26-rc5-mm3.

Fixed by the following:

PATCH: fix munlock page table walk - now requires 'mm'

Against 2.6.26-rc5-mm3.

Incremental fix for: mlock-mlocked-pages-are-unevictable-fix.patch

Initialize the 'mm' member of the mm_walk structure, else the
page table walk doesn't occur, and mlocked pages will not be
munlocked. This is visible in the vmstats:

noreclaim_pgs_munlocked - should equal noreclaim_pgs_mlocked
less (nr_mlock + noreclaim_pgs_cleared), but is always zero
[munlock_vma_page() never called]

noreclaim_pgs_mlockfreed - should be zero [for debug only],
but == noreclaim_pgs_mlocked - (nr_mlock + noreclaim_pgs_cleared)

Signed-off-by: Lee Schermerhorn <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>

mm/mlock.c | 1 +
1 file changed, 1 insertion(+)

Index: b/mm/mlock.c
===================================================================
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -310,6 +310,7 @@ static void __munlock_vma_pages_range(st
.pmd_entry = __munlock_pmd_handler,
.pte_entry = __munlock_pte_handler,
.private = &mpw,
+ .mm = mm,
};

VM_BUG_ON(start & ~PAGE_MASK || end & ~PAGE_MASK);

2008-06-18 04:42:53

[permalink] [raw]

Subject: Re: [PATCH][RFC] fix kernel BUG at mm/migrate.c:719! in 2.6.26-rc5-mm3

On Wed, 18 Jun 2008 10:54:00 +0900, Daisuke Nishimura <[email protected]> wrote:
> On Wed, 18 Jun 2008 00:33:18 +0900, KOSAKI Motohiro <[email protected]> wrote:
> > > @@ -715,13 +725,7 @@ unlock:
> > > * restored.
> > > */
> > > list_del(&page->lru);
> > > - if (!page->mapping) {
> > > - VM_BUG_ON(page_count(page) != 1);
> > > - unlock_page(page);
> > > - put_page(page); /* just free the old page */
> > > - goto end_migration;
> > > - } else
> > > - unlock = putback_lru_page(page);
> > > + unlock = putback_lru_page(page);
> > > }
> > >
> > > if (unlock)
> >
> > this part is really necessary?
> > I tryed to remove it, but any problem doesn't happend.
> >
> I made this part first, and added a fix for migration_entry_wait later.
>
> So, I haven't test without this part, and I think it will cause
> VM_BUG_ON() here without this part.
>
> Anyway, I will test it.
>
I got this VM_BUG_ON() as expected only by doing:

# echo $$ >/cgroup/cpuset/02/tasks

So, I beleive that both fixes for migration_entry_wait and
unmap_and_move (and, of course, removal VM_BUG_ON from
putback_lru_page) are needed.

Thanks,
Daisuke Nishimura.

2008-06-18 04:54:45

[permalink] [raw]

Subject: Re: [PATCH][RFC] fix kernel BUG at mm/migrate.c:719! in 2.6.26-rc5-mm3

On Wed, 18 Jun 2008 13:41:28 +0900
Daisuke Nishimura <[email protected]> wrote:

> On Wed, 18 Jun 2008 10:54:00 +0900, Daisuke Nishimura <[email protected]> wrote:
> > On Wed, 18 Jun 2008 00:33:18 +0900, KOSAKI Motohiro <[email protected]> wrote:
> > > > @@ -715,13 +725,7 @@ unlock:
> > > > * restored.
> > > > */
> > > > list_del(&page->lru);
> > > > - if (!page->mapping) {
> > > > - VM_BUG_ON(page_count(page) != 1);
> > > > - unlock_page(page);
> > > > - put_page(page); /* just free the old page */
> > > > - goto end_migration;
> > > > - } else
> > > > - unlock = putback_lru_page(page);
> > > > + unlock = putback_lru_page(page);
> > > > }
> > > >
> > > > if (unlock)
> > >
> > > this part is really necessary?
> > > I tryed to remove it, but any problem doesn't happend.
> > >
> > I made this part first, and added a fix for migration_entry_wait later.
> >
> > So, I haven't test without this part, and I think it will cause
> > VM_BUG_ON() here without this part.
> >
> > Anyway, I will test it.
> >
> I got this VM_BUG_ON() as expected only by doing:
>
> # echo $$ >/cgroup/cpuset/02/tasks
>
> So, I beleive that both fixes for migration_entry_wait and
> unmap_and_move (and, of course, removal VM_BUG_ON from
> putback_lru_page) are needed.
>
>
yes, but I'm now trying to rewrite putback_lru_page(). For avoid more complication.

Thanks,
-Kame

2008-06-18 05:20:12

[permalink] [raw]

Subject: Re: [PATCH][RFC] fix kernel BUG at mm/migrate.c:719! in 2.6.26-rc5-mm3

On Wednesday 18 June 2008 05:28, Lee Schermerhorn wrote:
> On Tue, 2008-06-17 at 19:33 +0100, Hugh Dickins wrote:
> > On Tue, 17 Jun 2008, Lee Schermerhorn wrote:
> > > Now I wonder if the assertion that newpage count == 1 could be
> > > violated? I don't see how. We've just allocated and filled it and
> > > haven't unlocked it yet, so we should hold the only reference. Do you
> > > agree?
> >
> > Disagree: IIRC, excellent example of the kind of assumption
> > that becomes invalid with Nick's speculative page references.
> >
> > Someone interested in the previous use of the page may have
> > incremented the refcount, and in due course will find that
> > it's got reused for something else, and will then back off.
>
> Yeah. Kosaki-san mentioned that we'd need some rework for the
> speculative page cache work. Looks like we'll need to drop the
> VM_BUG_ON().
>
> I need to go read up on the new invariants we can trust with the
> speculative page cache.

I don't know if I've added a summary, which is something I should
do.

The best thing to do is never use page_count, but just use get
and put to refcount it. If you really must use it:

- If there are X references to a page, page_count will return >= X.
- If page_count returns Y, there are no more than Y references to the page.

2008-06-18 05:27:44

[permalink] [raw]

Subject: Re: [PATCH] migration_entry_wait fix.

> From: Daisuke Nishimura <[email protected]>
> Signed-off-by: Daisuke Nishimura <[email protected]>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> mm/migrate.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> Index: test-2.6.26-rc5-mm3/mm/migrate.c
> ===================================================================
> --- test-2.6.26-rc5-mm3.orig/mm/migrate.c
> +++ test-2.6.26-rc5-mm3/mm/migrate.c
> @@ -243,7 +243,8 @@ void migration_entry_wait(struct mm_stru
>
> page = migration_entry_to_page(entry);
>
> - get_page(page);
> + if (!page_cache_get_speculative(page))
> + goto out;
> pte_unmap_unlock(ptep, ptl);
> wait_on_page_locked(page);
> put_page(page);

sorry, so late responce.

Acked-by: KOSAKI Motohiro <[email protected]>

2008-06-18 05:36:21

[permalink] [raw]

Subject: Re: [PATCH] migration_entry_wait fix.

On Wednesday 18 June 2008 11:54, KAMEZAWA Hiroyuki wrote:
> On Wed, 18 Jun 2008 10:13:49 +0900
>
> KAMEZAWA Hiroyuki <[email protected]> wrote:
> > + if (!page_cache_get_speculative())
> > + goto out;
>
> This is obviously buggy....sorry..quilt refresh miss..
>
> ==
> In speculative page cache lookup protocol, page_count(page) is set to 0
> while radix-tree modification is going on, truncation, migration, etc...

These tend to all happen while the page is locked, and in particular
while the page does not have any references other than the current
code path and the pagecache. So no page tables should point to it.

So migration_entry_wait should not find pages with a refcount of zero.

> While page migration, a page fault to page under migration should wait
> unlock_page() and migration_entry_wait() waits for the page from its
> pte entry. It does get_page() -> wait_on_page_locked() -> put_page() now.
>
> In page migration, page_freeze_refs() -> page_unfreeze_refs() is called.
>
> Here, page_unfreeze_refs() expects page_count(page) == 0 and panics
> if page_count(page) != 0. To avoid this, we shouldn't touch page_count()
> if it is zero. This patch uses page_cache_get_speculative() to avoid
> the panic.

At any rate, page_cache_get_speculative() should not be used for this
purpose, but for when we _really_ don't have any references to a page.

2008-06-18 05:59:39

[permalink] [raw]

Subject: Re: [PATCH] migration_entry_wait fix.

On Wed, 18 Jun 2008 15:35:57 +1000
Nick Piggin <[email protected]> wrote:

> On Wednesday 18 June 2008 11:54, KAMEZAWA Hiroyuki wrote:
> > On Wed, 18 Jun 2008 10:13:49 +0900
> >
> > KAMEZAWA Hiroyuki <[email protected]> wrote:
> > > + if (!page_cache_get_speculative())
> > > + goto out;
> >
> > This is obviously buggy....sorry..quilt refresh miss..
> >
> > ==
> > In speculative page cache lookup protocol, page_count(page) is set to 0
> > while radix-tree modification is going on, truncation, migration, etc...
>
> These tend to all happen while the page is locked, and in particular
> while the page does not have any references other than the current
> code path and the pagecache. So no page tables should point to it.
>
> So migration_entry_wait should not find pages with a refcount of zero.
>
>
> > While page migration, a page fault to page under migration should wait
> > unlock_page() and migration_entry_wait() waits for the page from its
> > pte entry. It does get_page() -> wait_on_page_locked() -> put_page() now.
> >
> > In page migration, page_freeze_refs() -> page_unfreeze_refs() is called.
> >
> > Here, page_unfreeze_refs() expects page_count(page) == 0 and panics
> > if page_count(page) != 0. To avoid this, we shouldn't touch page_count()
> > if it is zero. This patch uses page_cache_get_speculative() to avoid
> > the panic.
>
> At any rate, page_cache_get_speculative() should not be used for this
> purpose, but for when we _really_ don't have any references to a page.
>
Then, I got NAK. what should I do ?
(This fix is not related to lock_page() problem.)

If I read your advice correctly, we shouldn't use lock_page() here.

Before speculative page cache, page_table_entry of a page under migration
has a pte entry which encodes pfn as special pte entry. and wait for the
end of page migration by lock_page().

Maybe we just go back to user-land and makes it to do page-fault again
is better ?

Thanks,
-Kame

2008-06-18 06:43:05

[permalink] [raw]

Subject: Re: [PATCH] migration_entry_wait fix.

On Wednesday 18 June 2008 16:04, KAMEZAWA Hiroyuki wrote:
> On Wed, 18 Jun 2008 15:35:57 +1000
>
> Nick Piggin <[email protected]> wrote:
> > On Wednesday 18 June 2008 11:54, KAMEZAWA Hiroyuki wrote:
> > > On Wed, 18 Jun 2008 10:13:49 +0900
> > >
> > > KAMEZAWA Hiroyuki <[email protected]> wrote:
> > > > + if (!page_cache_get_speculative())
> > > > + goto out;
> > >
> > > This is obviously buggy....sorry..quilt refresh miss..
> > >
> > > ==
> > > In speculative page cache lookup protocol, page_count(page) is set to 0
> > > while radix-tree modification is going on, truncation, migration,
> > > etc...
> >
> > These tend to all happen while the page is locked, and in particular
> > while the page does not have any references other than the current
> > code path and the pagecache. So no page tables should point to it.
> >
> > So migration_entry_wait should not find pages with a refcount of zero.
> >
> > > While page migration, a page fault to page under migration should wait
> > > unlock_page() and migration_entry_wait() waits for the page from its
> > > pte entry. It does get_page() -> wait_on_page_locked() -> put_page()
> > > now.
> > >
> > > In page migration, page_freeze_refs() -> page_unfreeze_refs() is
> > > called.
> > >
> > > Here, page_unfreeze_refs() expects page_count(page) == 0 and panics
> > > if page_count(page) != 0. To avoid this, we shouldn't touch
> > > page_count() if it is zero. This patch uses
> > > page_cache_get_speculative() to avoid the panic.
> >
> > At any rate, page_cache_get_speculative() should not be used for this
> > purpose, but for when we _really_ don't have any references to a page.
>
> Then, I got NAK. what should I do ?

Well, not nack as such as just wanting to find out a bit more about
how this happens (I'm a little bit slow...)

> (This fix is not related to lock_page() problem.)
>
> If I read your advice correctly, we shouldn't use lock_page() here.
>
> Before speculative page cache, page_table_entry of a page under migration
> has a pte entry which encodes pfn as special pte entry. and wait for the
> end of page migration by lock_page().

What I don't think I understand, is how we can have a page in the
page tables (and with the ptl held) but with a zero refcount... Oh,
it's not actually a page but a migration entry! I'm not quite so
familiar with that code.

Hmm, so we might possibly see a page there that has a zero refcount
due to page_freeze_refs? In which case, I think the direction of you
fix is good. Sorry for my misunderstanding the problem, and thank
you for fixing up my code!

I would ask you to use get_page_unless_zero rather than
page_cache_get_speculative(), because it's not exactly a speculative
reference -- a speculative reference is one where we elevate _count
and then must recheck that the page we have is correct.

Also, please add a comment. It would really be nicer to hide this
transiently-frozen state away from migration_entry_wait, but I can't
see any lock that would easily solve it.

Thanks,
Nick

2008-06-18 06:47:43

[permalink] [raw]

Subject: Re: [PATCH] migration_entry_wait fix.

On Wed, 18 Jun 2008 16:42:37 +1000
Nick Piggin <[email protected]> wrote:

> > (This fix is not related to lock_page() problem.)
> >
> > If I read your advice correctly, we shouldn't use lock_page() here.
> >
> > Before speculative page cache, page_table_entry of a page under migration
> > has a pte entry which encodes pfn as special pte entry. and wait for the
> > end of page migration by lock_page().
>
> What I don't think I understand, is how we can have a page in the
> page tables (and with the ptl held) but with a zero refcount... Oh,
> it's not actually a page but a migration entry! I'm not quite so
> familiar with that code.
>
> Hmm, so we might possibly see a page there that has a zero refcount
> due to page_freeze_refs? In which case, I think the direction of you
> fix is good. Sorry for my misunderstanding the problem, and thank
> you for fixing up my code!
>
> I would ask you to use get_page_unless_zero rather than
> page_cache_get_speculative(), because it's not exactly a speculative
> reference -- a speculative reference is one where we elevate _count
> and then must recheck that the page we have is correct.
>
ok.

> Also, please add a comment. It would really be nicer to hide this
> transiently-frozen state away from migration_entry_wait, but I can't
> see any lock that would easily solve it.
>
ok, will adds comments.

Thanks,
-Kame

2008-06-18 07:25:14

[permalink] [raw]

Subject: [PATCH -mm][BUGFIX] migration_entry_wait fix. v2

In speculative page cache look up protocol, page_count(page) is set to 0
while radix-tree modification is going on, truncation, migration, etc...

While page migration, a page fault to page under migration does
- look up page table
- find it is migration_entry_pte
- decode pfn from migration_entry_pte and get page of pfn_page(pfn)
- wait until page is unlocked

It does get_page() -> wait_on_page_locked() -> put_page() now.

In page migration's radix-tree replacement, page_freeze_refs() ->
page_unfreeze_refs() is called. And page_count(page) turns to be zero
and must be kept to be zero while radix-tree replacement.

If get_page() is called against a page under radix-tree replacement,
the kernel panics(). To avoid this, we shouldn't increment page_count()
if it is zero. This patch uses get_page_unless_zero().

Even if get_page_unless_zero() fails, the caller just retries.
But will be a bit busier.

Change log v1->v2:
- rewrote the patch description and added comments.

From: Daisuke Nishimura <[email protected]>
Signed-off-by: Daisuke Nishimura <[email protected]>
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/migrate.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)

Index: test-2.6.26-rc5-mm3/mm/migrate.c
===================================================================
--- test-2.6.26-rc5-mm3.orig/mm/migrate.c
+++ test-2.6.26-rc5-mm3/mm/migrate.c
@@ -242,8 +242,15 @@ void migration_entry_wait(struct mm_stru
goto out;

page = migration_entry_to_page(entry);
-
- get_page(page);
+ /*
+ * Once radix-tree replacement of page migration started, page_count
+ * *must* be zero. And, we don't want to call wait_on_page_locked()
+ * against a page without get_page().
+ * So, we use get_page_unless_zero(), here. Even failed, page fault
+ * will occur again.
+ */
+ if (!get_page_unless_zero(page))
+ goto out;
pte_unmap_unlock(ptep, ptl);
wait_on_page_locked(page);
put_page(page);

2008-06-18 07:26:42

[permalink] [raw]

Subject: Re: [PATCH -mm][BUGFIX] migration_entry_wait fix. v2

> In speculative page cache look up protocol, page_count(page) is set to 0
> while radix-tree modification is going on, truncation, migration, etc...
>
> While page migration, a page fault to page under migration does
> - look up page table
> - find it is migration_entry_pte
> - decode pfn from migration_entry_pte and get page of pfn_page(pfn)
> - wait until page is unlocked
>
> It does get_page() -> wait_on_page_locked() -> put_page() now.
>
> In page migration's radix-tree replacement, page_freeze_refs() ->
> page_unfreeze_refs() is called. And page_count(page) turns to be zero
> and must be kept to be zero while radix-tree replacement.
>
> If get_page() is called against a page under radix-tree replacement,
> the kernel panics(). To avoid this, we shouldn't increment page_count()
> if it is zero. This patch uses get_page_unless_zero().
>
> Even if get_page_unless_zero() fails, the caller just retries.
> But will be a bit busier.

Great!
Acked-by: KOSAKI Motohiro <[email protected]>

2008-06-18 07:40:29

[permalink] [raw]

Subject: Re: [PATCH -mm][BUGFIX] migration_entry_wait fix. v2

On Wednesday 18 June 2008 17:29, KAMEZAWA Hiroyuki wrote:
> In speculative page cache look up protocol, page_count(page) is set to 0
> while radix-tree modification is going on, truncation, migration, etc...
>
> While page migration, a page fault to page under migration does
> - look up page table
> - find it is migration_entry_pte
> - decode pfn from migration_entry_pte and get page of pfn_page(pfn)
> - wait until page is unlocked
>
> It does get_page() -> wait_on_page_locked() -> put_page() now.
>
> In page migration's radix-tree replacement, page_freeze_refs() ->
> page_unfreeze_refs() is called. And page_count(page) turns to be zero
> and must be kept to be zero while radix-tree replacement.
>
> If get_page() is called against a page under radix-tree replacement,
> the kernel panics(). To avoid this, we shouldn't increment page_count()
> if it is zero. This patch uses get_page_unless_zero().
>
> Even if get_page_unless_zero() fails, the caller just retries.
> But will be a bit busier.
>
> Change log v1->v2:
> - rewrote the patch description and added comments.
>

Thanks

Acked-by: Nick Piggin <[email protected]>

Andrew, this is a bugfix to mm-speculative-page-references.patch

> From: Daisuke Nishimura <[email protected]>
> Signed-off-by: Daisuke Nishimura <[email protected]>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> mm/migrate.c | 11 +++++++++--
> 1 file changed, 9 insertions(+), 2 deletions(-)
>
> Index: test-2.6.26-rc5-mm3/mm/migrate.c
> ===================================================================
> --- test-2.6.26-rc5-mm3.orig/mm/migrate.c
> +++ test-2.6.26-rc5-mm3/mm/migrate.c
> @@ -242,8 +242,15 @@ void migration_entry_wait(struct mm_stru
> goto out;
>
> page = migration_entry_to_page(entry);
> -
> - get_page(page);
> + /*
> + * Once radix-tree replacement of page migration started, page_count
> + * *must* be zero. And, we don't want to call wait_on_page_locked()
> + * against a page without get_page().
> + * So, we use get_page_unless_zero(), here. Even failed, page fault
> + * will occur again.
> + */
> + if (!get_page_unless_zero(page))
> + goto out;
> pte_unmap_unlock(ptep, ptl);
> wait_on_page_locked(page);
> put_page(page);

2008-06-18 07:54:50

[permalink] [raw]

Subject: [PATCH][-mm] remove redundant page->mapping check

> > > this part is really necessary?
> > > I tryed to remove it, but any problem doesn't happend.
> > >
> > I made this part first, and added a fix for migration_entry_wait later.
> >
> > So, I haven't test without this part, and I think it will cause
> > VM_BUG_ON() here without this part.
> >
> > Anyway, I will test it.
> >
> I got this VM_BUG_ON() as expected only by doing:
>
> # echo $$ >/cgroup/cpuset/02/tasks
>
> So, I beleive that both fixes for migration_entry_wait and
> unmap_and_move (and, of course, removal VM_BUG_ON from
> putback_lru_page) are needed.

OK, I confirmed this part.

Andrew, please pick.

==================================================

Against: 2.6.26-rc5-mm3

remove redundant mapping check.

we'd be doing exactly what putback_lru_page() is doing. So, this code
as always unnecessary, duplicate code.
So, just let putback_lru_page() handle this condition and conditionally
unlock_page().

Signed-off-by: Daisuke Nishimura <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
Acked-by: Lee Schermerhorn <[email protected]>

---
mm/migrate.c | 8 +-------
1 file changed, 1 insertion(+), 7 deletions(-)

Index: b/mm/migrate.c
===================================================================
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -716,13 +716,7 @@ unlock:
* restored.
*/
list_del(&page->lru);
- if (!page->mapping) {
- VM_BUG_ON(page_count(page) != 1);
- unlock_page(page);
- put_page(page); /* just free the old page */
- goto end_migration;
- } else
- unlock = putback_lru_page(page);
+ unlock = putback_lru_page(page);
}

if (unlock)

2008-06-18 09:34:49

[permalink] [raw]

Subject: [Experimental][PATCH] putback_lru_page rework

Lee-san, how about this ?
Tested on x86-64 and tried Nisimura-san's test at el. works good now.
-Kame
==
putback_lru_page()/unevictable page handling rework.

Now, putback_lru_page() requires that the page is locked.
And in some special case, implicitly unlock it.

This patch tries to make putback_lru_pages() to be lock_page() free.
(Of course, some callers must take the lock.)

The main reason that putback_lru_page() assumes that page is locked
is to avoid the change in page's status among Mlocked/Not-Mlocked.

Once it is added to unevictable list, the page is removed from
unevictable list only when page is munlocked. (there are other special
case. but we ignore the special case.)
So, status change during putback_lru_page() is fatal and page should
be locked.

putback_lru_page() in this patch has a new concepts.
When it adds page to unevictable list, it checks the status is
changed or not again. if changed, retry to putback.

This patche changes also caller side and cleaning up lock/unlock_page().

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

---
mm/internal.h | 2 -
mm/migrate.c | 23 +++----------
mm/mlock.c | 24 +++++++-------
mm/vmscan.c | 96 +++++++++++++++++++++++++---------------------------------
4 files changed, 61 insertions(+), 84 deletions(-)

Index: test-2.6.26-rc5-mm3/mm/vmscan.c
===================================================================
--- test-2.6.26-rc5-mm3.orig/mm/vmscan.c
+++ test-2.6.26-rc5-mm3/mm/vmscan.c
@@ -486,73 +486,63 @@ int remove_mapping(struct address_space
* Page may still be unevictable for other reasons.
*
* lru_lock must not be held, interrupts must be enabled.
- * Must be called with page locked.
- *
- * return 1 if page still locked [not truncated], else 0
*/
-int putback_lru_page(struct page *page)
+#ifdef CONFIG_UNEVICTABLE_LRU
+void putback_lru_page(struct page *page)
{
int lru;
- int ret = 1;
int was_unevictable;

- VM_BUG_ON(!PageLocked(page));
VM_BUG_ON(PageLRU(page));

- lru = !!TestClearPageActive(page);
was_unevictable = TestClearPageUnevictable(page); /* for page_evictable() */

- if (unlikely(!page->mapping)) {
- /*
- * page truncated. drop lock as put_page() will
- * free the page.
- */
- VM_BUG_ON(page_count(page) != 1);
- unlock_page(page);
- ret = 0;
- } else if (page_evictable(page, NULL)) {
- /*
- * For evictable pages, we can use the cache.
- * In event of a race, worst case is we end up with an
- * unevictable page on [in]active list.
- * We know how to handle that.
- */
+redo:
+ lru = !!TestClearPageActive(page);
+ if (page_evictable(page, NULL)) {
lru += page_is_file_cache(page);
lru_cache_add_lru(page, lru);
- mem_cgroup_move_lists(page, lru);
-#ifdef CONFIG_UNEVICTABLE_LRU
- if (was_unevictable)
- count_vm_event(NORECL_PGRESCUED);
-#endif
} else {
- /*
- * Put unevictable pages directly on zone's unevictable
- * list.
- */
+ lru = LRU_UNEVICTABLE;
add_page_to_unevictable_list(page);
- mem_cgroup_move_lists(page, LRU_UNEVICTABLE);
-#ifdef CONFIG_UNEVICTABLE_LRU
- if (!was_unevictable)
- count_vm_event(NORECL_PGCULLED);
-#endif
}
+ mem_cgroup_move_lists(page, lru);
+
+ /*
+ * page's status can change while we move it among lru. If an evictable
+ * page is on unevictable list, it never be freed. To avoid that,
+ * check after we added it to the list, again.
+ */
+ if (lru == LRU_UNEVICTABLE && page_evictable(page, NULL)) {
+ if (!isolate_lru_page(page)) {
+ put_page(page);
+ goto redo;
+ }
+ /* This means someone else dropped this page from LRU
+ * So, it will be freed or putback to LRU again. There is
+ * nothing to do here.
+ */
+ }
+
+ if (was_unevictable && lru != LRU_UNEVICTABLE)
+ count_vm_event(NORECL_PGRESCUED);
+ else if (!was_unevictable && lru == LRU_UNEVICTABLE)
+ count_vm_event(NORECL_PGCULLED);

put_page(page); /* drop ref from isolate */
- return ret; /* ret => "page still locked" */
}
-
-/*
- * Cull page that shrink_*_list() has detected to be unevictable
- * under page lock to close races with other tasks that might be making
- * the page evictable. Avoid stranding an evictable page on the
- * unevictable list.
- */
-static void cull_unevictable_page(struct page *page)
+#else
+void putback_lru_page(struct page *page)
{
- lock_page(page);
- if (putback_lru_page(page))
- unlock_page(page);
+ int lru;
+ VM_BUG_ON(PageLRU(page));
+
+ lru = !!TestClearPageActive(page) + page_is_file_cache(page);
+ lru_cache_add_lru(page, lru);
+ mem_cgroup_move_lists(page, lru);
+ put_page(page);
}
+#endif

/*
* shrink_page_list() returns the number of reclaimed pages
@@ -746,8 +736,8 @@ free_it:
continue;

cull_mlocked:
- if (putback_lru_page(page))
- unlock_page(page);
+ unlock_page(page);
+ putback_lru_page(page);
continue;

activate_locked:
@@ -1127,7 +1117,7 @@ static unsigned long shrink_inactive_lis
list_del(&page->lru);
if (unlikely(!page_evictable(page, NULL))) {
spin_unlock_irq(&zone->lru_lock);
- cull_unevictable_page(page);
+ putback_lru_page(page);
spin_lock_irq(&zone->lru_lock);
continue;
}
@@ -1231,7 +1221,7 @@ static void shrink_active_list(unsigned
list_del(&page->lru);

if (unlikely(!page_evictable(page, NULL))) {
- cull_unevictable_page(page);
+ putback_lru_page(page);
continue;
}

@@ -2393,8 +2383,6 @@ int zone_reclaim(struct zone *zone, gfp_
int page_evictable(struct page *page, struct vm_area_struct *vma)
{

- VM_BUG_ON(PageUnevictable(page));
-
if (mapping_unevictable(page_mapping(page)))
return 0;

Index: test-2.6.26-rc5-mm3/mm/mlock.c
===================================================================
--- test-2.6.26-rc5-mm3.orig/mm/mlock.c
+++ test-2.6.26-rc5-mm3/mm/mlock.c
@@ -55,7 +55,6 @@ EXPORT_SYMBOL(can_do_mlock);
*/
void __clear_page_mlock(struct page *page)
{
- VM_BUG_ON(!PageLocked(page)); /* for LRU isolate/putback */

dec_zone_page_state(page, NR_MLOCK);
count_vm_event(NORECL_PGCLEARED);
@@ -79,7 +78,6 @@ void __clear_page_mlock(struct page *pag
*/
void mlock_vma_page(struct page *page)
{
- BUG_ON(!PageLocked(page));

if (!TestSetPageMlocked(page)) {
inc_zone_page_state(page, NR_MLOCK);
@@ -109,7 +107,6 @@ void mlock_vma_page(struct page *page)
*/
static void munlock_vma_page(struct page *page)
{
- BUG_ON(!PageLocked(page));

if (TestClearPageMlocked(page)) {
dec_zone_page_state(page, NR_MLOCK);
@@ -169,7 +166,8 @@ static int __mlock_vma_pages_range(struc

/*
* get_user_pages makes pages present if we are
- * setting mlock.
+ * setting mlock. and this extra reference count will
+ * disable migration of this page.
*/
ret = get_user_pages(current, mm, addr,
min_t(int, nr_pages, ARRAY_SIZE(pages)),
@@ -197,14 +195,8 @@ static int __mlock_vma_pages_range(struc
for (i = 0; i < ret; i++) {
struct page *page = pages[i];

- /*
- * page might be truncated or migrated out from under
- * us. Check after acquiring page lock.
- */
- lock_page(page);
- if (page->mapping)
+ if (page_mapcount(page))
mlock_vma_page(page);
- unlock_page(page);
put_page(page); /* ref from get_user_pages() */

/*
@@ -240,6 +232,9 @@ static int __munlock_pte_handler(pte_t *
struct page *page;
pte_t pte;

+ /*
+ * page is never be unmapped by page-reclaim. we lock this page now.
+ */
retry:
pte = *ptep;
/*
@@ -261,7 +256,15 @@ retry:
goto out;

lock_page(page);
- if (!page->mapping) {
+ /*
+ * Because we lock page here, we have to check 2 cases.
+ * - the page is migrated.
+ * - the page is truncated (file-cache only)
+ * Note: Anonymous page doesn't clear page->mapping even if it
+ * is removed from rmap.
+ */
+ if (!page->mapping ||
+ (PageAnon(page) && !page_mapcount(page))) {
unlock_page(page);
goto retry;
}
Index: test-2.6.26-rc5-mm3/mm/migrate.c
===================================================================
--- test-2.6.26-rc5-mm3.orig/mm/migrate.c
+++ test-2.6.26-rc5-mm3/mm/migrate.c
@@ -67,9 +67,7 @@ int putback_lru_pages(struct list_head *

list_for_each_entry_safe(page, page2, l, lru) {
list_del(&page->lru);
- lock_page(page);
- if (putback_lru_page(page))
- unlock_page(page);
+ putback_lru_page(page);
count++;
}
return count;
@@ -571,7 +569,6 @@ static int fallback_migrate_page(struct
static int move_to_new_page(struct page *newpage, struct page *page)
{
struct address_space *mapping;
- int unlock = 1;
int rc;

/*
@@ -610,12 +607,11 @@ static int move_to_new_page(struct page
* Put back on LRU while holding page locked to
* handle potential race with, e.g., munlock()
*/
- unlock = putback_lru_page(newpage);
+ putback_lru_page(newpage);
} else
newpage->mapping = NULL;

- if (unlock)
- unlock_page(newpage);
+ unlock_page(newpage);

return rc;
}
@@ -632,7 +628,6 @@ static int unmap_and_move(new_page_t get
struct page *newpage = get_new_page(page, private, &result);
int rcu_locked = 0;
int charge = 0;
- int unlock = 1;

if (!newpage)
return -ENOMEM;
@@ -713,6 +708,7 @@ rcu_unlock:
rcu_read_unlock();

unlock:
+ unlock_page(page);

if (rc != -EAGAIN) {
/*
@@ -722,18 +718,9 @@ unlock:
* restored.
*/
list_del(&page->lru);
- if (!page->mapping) {
- VM_BUG_ON(page_count(page) != 1);
- unlock_page(page);
- put_page(page); /* just free the old page */
- goto end_migration;
- } else
- unlock = putback_lru_page(page);
+ putback_lru_page(page);
}

- if (unlock)
- unlock_page(page);
-
end_migration:
if (!charge)
mem_cgroup_end_migration(newpage);
Index: test-2.6.26-rc5-mm3/mm/internal.h
===================================================================
--- test-2.6.26-rc5-mm3.orig/mm/internal.h
+++ test-2.6.26-rc5-mm3/mm/internal.h
@@ -43,7 +43,7 @@ static inline void __put_page(struct pag
* in mm/vmscan.c:
*/
extern int isolate_lru_page(struct page *page);
-extern int putback_lru_page(struct page *page);
+extern void putback_lru_page(struct page *page);

/*
* in mm/page_alloc.c

2008-06-18 10:21:07

[permalink] [raw]

Subject: Re: [Bad page] trying to free locked page? (Re: [PATCH][RFC] fix kernel BUG at mm/migrate.c:719! in 2.6.26-rc5-mm3)

> > > I got bad_page after hundreds times of page migration.
> > > It seems that a locked page is being freed.
> >
> > I can't reproduce this bad page.
> > I'll try again tomorrow ;)
>
> OK. I'll report on my test more precisely.

Thank you verbose explain.
I ran its testcase >3H today.
but unfortunately, I couldn't reproduce it.

Hmm...

2008-06-18 11:37:33

[permalink] [raw]

Subject: Re: [Experimental][PATCH] putback_lru_page rework

Hi kame-san,

> putback_lru_page() in this patch has a new concepts.
> When it adds page to unevictable list, it checks the status is
> changed or not again. if changed, retry to putback.

it seems good idea :)
this patch can reduce lock_page() call.

> - } else if (page_evictable(page, NULL)) {
> - /*
> - * For evictable pages, we can use the cache.
> - * In event of a race, worst case is we end up with an
> - * unevictable page on [in]active list.
> - * We know how to handle that.
> - */

I think this comment is useful.
Why do you want kill it?

> +redo:
> + lru = !!TestClearPageActive(page);
> + if (page_evictable(page, NULL)) {
> lru += page_is_file_cache(page);
> lru_cache_add_lru(page, lru);
> - mem_cgroup_move_lists(page, lru);
> -#ifdef CONFIG_UNEVICTABLE_LRU
> - if (was_unevictable)
> - count_vm_event(NORECL_PGRESCUED);
> -#endif
> } else {
> - /*
> - * Put unevictable pages directly on zone's unevictable
> - * list.
> - */

ditto.

> + lru = LRU_UNEVICTABLE;
> add_page_to_unevictable_list(page);
> - mem_cgroup_move_lists(page, LRU_UNEVICTABLE);
> -#ifdef CONFIG_UNEVICTABLE_LRU
> - if (!was_unevictable)
> - count_vm_event(NORECL_PGCULLED);
> -#endif
> }
> + mem_cgroup_move_lists(page, lru);
> +
> + /*
> + * page's status can change while we move it among lru. If an evictable
> + * page is on unevictable list, it never be freed. To avoid that,
> + * check after we added it to the list, again.
> + */
> + if (lru == LRU_UNEVICTABLE && page_evictable(page, NULL)) {
> + if (!isolate_lru_page(page)) {
> + put_page(page);
> + goto redo;

No.
We should treat carefully unevictable -> unevictable moving too.

> + }
> + /* This means someone else dropped this page from LRU
> + * So, it will be freed or putback to LRU again. There is
> + * nothing to do here.
> + */
> + }
> +
> + if (was_unevictable && lru != LRU_UNEVICTABLE)
> + count_vm_event(NORECL_PGRESCUED);
> + else if (!was_unevictable && lru == LRU_UNEVICTABLE)
> + count_vm_event(NORECL_PGCULLED);
>
> put_page(page); /* drop ref from isolate */
> - return ret; /* ret => "page still locked" */
> }
> -
> -/*
> - * Cull page that shrink_*_list() has detected to be unevictable
> - * under page lock to close races with other tasks that might be making
> - * the page evictable. Avoid stranding an evictable page on the
> - * unevictable list.
> - */
> -static void cull_unevictable_page(struct page *page)
> +#else
> +void putback_lru_page(struct page *page)
> {
> - lock_page(page);
> - if (putback_lru_page(page))
> - unlock_page(page);
> + int lru;
> + VM_BUG_ON(PageLRU(page));
> +
> + lru = !!TestClearPageActive(page) + page_is_file_cache(page);
> + lru_cache_add_lru(page, lru);
> + mem_cgroup_move_lists(page, lru);
> + put_page(page);
> }
> +#endif
>
> /*
> * shrink_page_list() returns the number of reclaimed pages
> @@ -746,8 +736,8 @@ free_it:
> continue;
>
> cull_mlocked:
> - if (putback_lru_page(page))
> - unlock_page(page);
> + unlock_page(page);
> + putback_lru_page(page);
> continue;
>
> activate_locked:
> @@ -1127,7 +1117,7 @@ static unsigned long shrink_inactive_lis
> list_del(&page->lru);
> if (unlikely(!page_evictable(page, NULL))) {
> spin_unlock_irq(&zone->lru_lock);
> - cull_unevictable_page(page);
> + putback_lru_page(page);
> spin_lock_irq(&zone->lru_lock);
> continue;
> }
> @@ -1231,7 +1221,7 @@ static void shrink_active_list(unsigned
> list_del(&page->lru);
>
> if (unlikely(!page_evictable(page, NULL))) {
> - cull_unevictable_page(page);
> + putback_lru_page(page);
> continue;
> }
>
> @@ -2393,8 +2383,6 @@ int zone_reclaim(struct zone *zone, gfp_
> int page_evictable(struct page *page, struct vm_area_struct *vma)
> {
>
> - VM_BUG_ON(PageUnevictable(page));
> -
> if (mapping_unevictable(page_mapping(page)))
> return 0;

Why do you remove this?

> @@ -169,7 +166,8 @@ static int __mlock_vma_pages_range(struc
>
> /*
> * get_user_pages makes pages present if we are
> - * setting mlock.
> + * setting mlock. and this extra reference count will
> + * disable migration of this page.
> */
> ret = get_user_pages(current, mm, addr,
> min_t(int, nr_pages, ARRAY_SIZE(pages)),
> @@ -197,14 +195,8 @@ static int __mlock_vma_pages_range(struc
> for (i = 0; i < ret; i++) {
> struct page *page = pages[i];
>
> - /*
> - * page might be truncated or migrated out from under
> - * us. Check after acquiring page lock.
> - */
> - lock_page(page);
> - if (page->mapping)
> + if (page_mapcount(page))
> mlock_vma_page(page);
> - unlock_page(page);
> put_page(page); /* ref from get_user_pages() */
>
> /*
> @@ -240,6 +232,9 @@ static int __munlock_pte_handler(pte_t *
> struct page *page;
> pte_t pte;
>
> + /*
> + * page is never be unmapped by page-reclaim. we lock this page now.
> + */
> retry:
> pte = *ptep;
> /*
> @@ -261,7 +256,15 @@ retry:
> goto out;
>
> lock_page(page);
> - if (!page->mapping) {
> + /*
> + * Because we lock page here, we have to check 2 cases.
> + * - the page is migrated.
> + * - the page is truncated (file-cache only)
> + * Note: Anonymous page doesn't clear page->mapping even if it
> + * is removed from rmap.
> + */
> + if (!page->mapping ||
> + (PageAnon(page) && !page_mapcount(page))) {
> unlock_page(page);
> goto retry;
> }
> Index: test-2.6.26-rc5-mm3/mm/migrate.c
> ===================================================================
> --- test-2.6.26-rc5-mm3.orig/mm/migrate.c
> +++ test-2.6.26-rc5-mm3/mm/migrate.c
> @@ -67,9 +67,7 @@ int putback_lru_pages(struct list_head *
>
> list_for_each_entry_safe(page, page2, l, lru) {
> list_del(&page->lru);
> - lock_page(page);
> - if (putback_lru_page(page))
> - unlock_page(page);
> + putback_lru_page(page);
> count++;
> }
> return count;
> @@ -571,7 +569,6 @@ static int fallback_migrate_page(struct
> static int move_to_new_page(struct page *newpage, struct page *page)
> {
> struct address_space *mapping;
> - int unlock = 1;
> int rc;
>
> /*
> @@ -610,12 +607,11 @@ static int move_to_new_page(struct page
> * Put back on LRU while holding page locked to
> * handle potential race with, e.g., munlock()
> */

this comment isn't true.

> - unlock = putback_lru_page(newpage);
> + putback_lru_page(newpage);
> } else
> newpage->mapping = NULL;

originally move_to_lru() called in unmap_and_move().
unevictable infrastructure patch move to this point for
calling putback_lru_page() under page locked.

So, your patch remove page locked dependency.
move to unmap_and_move() again is better.

it become page lock holding time reducing.

>
> - if (unlock)
> - unlock_page(newpage);
> + unlock_page(newpage);
>
> return rc;
> }
> @@ -632,7 +628,6 @@ static int unmap_and_move(new_page_t get
> struct page *newpage = get_new_page(page, private, &result);
> int rcu_locked = 0;
> int charge = 0;
> - int unlock = 1;
>
> if (!newpage)
> return -ENOMEM;
> @@ -713,6 +708,7 @@ rcu_unlock:
> rcu_read_unlock();
>
> unlock:
> + unlock_page(page);
>
> if (rc != -EAGAIN) {
> /*
> @@ -722,18 +718,9 @@ unlock:
> * restored.
> */
> list_del(&page->lru);
> - if (!page->mapping) {
> - VM_BUG_ON(page_count(page) != 1);
> - unlock_page(page);
> - put_page(page); /* just free the old page */
> - goto end_migration;
> - } else
> - unlock = putback_lru_page(page);
> + putback_lru_page(page);
> }
>
> - if (unlock)
> - unlock_page(page);
> -
> end_migration:
> if (!charge)
> mem_cgroup_end_migration(newpage);
> Index: test-2.6.26-rc5-mm3/mm/internal.h
> ===================================================================
> --- test-2.6.26-rc5-mm3.orig/mm/internal.h
> +++ test-2.6.26-rc5-mm3/mm/internal.h
> @@ -43,7 +43,7 @@ static inline void __put_page(struct pag
> * in mm/vmscan.c:
> */
> extern int isolate_lru_page(struct page *page);
> -extern int putback_lru_page(struct page *page);
> +extern void putback_lru_page(struct page *page);
>
> /*
> * in mm/page_alloc.c
>

2008-06-18 11:51:19

[permalink] [raw]

Subject: Re: [Experimental][PATCH] putback_lru_page rework

On Wed, 18 Jun 2008 20:36:52 +0900
KOSAKI Motohiro <[email protected]> wrote:

> Hi kame-san,
>
> > putback_lru_page() in this patch has a new concepts.
> > When it adds page to unevictable list, it checks the status is
> > changed or not again. if changed, retry to putback.
>
> it seems good idea :)
> this patch can reduce lock_page() call.
>
yes.

>
> > - } else if (page_evictable(page, NULL)) {
> > - /*
> > - * For evictable pages, we can use the cache.
> > - * In event of a race, worst case is we end up with an
> > - * unevictable page on [in]active list.
> > - * We know how to handle that.
> > - */
>
> I think this comment is useful.
> Why do you want kill it?
>
Oh, my mistake.

> > + mem_cgroup_move_lists(page, lru);
> > +
> > + /*
> > + * page's status can change while we move it among lru. If an evictable
> > + * page is on unevictable list, it never be freed. To avoid that,
> > + * check after we added it to the list, again.
> > + */
> > + if (lru == LRU_UNEVICTABLE && page_evictable(page, NULL)) {
> > + if (!isolate_lru_page(page)) {
> > + put_page(page);
> > + goto redo;
>
> No.
> We should treat carefully unevictable -> unevictable moving too.
>
This lru is the destination ;)

>
> > + }
> > + /* This means someone else dropped this page from LRU
> > + * So, it will be freed or putback to LRU again. There is
> > + * nothing to do here.
> > + */
> > + }
> > +
> > + if (was_unevictable && lru != LRU_UNEVICTABLE)
> > + count_vm_event(NORECL_PGRESCUED);
> > + else if (!was_unevictable && lru == LRU_UNEVICTABLE)
> > + count_vm_event(NORECL_PGCULLED);
> >
> > put_page(page); /* drop ref from isolate */
> > - return ret; /* ret => "page still locked" */
> > }
> > -
> > -/*
> > - * Cull page that shrink_*_list() has detected to be unevictable
> > - * under page lock to close races with other tasks that might be making
> > - * the page evictable. Avoid stranding an evictable page on the
> > - * unevictable list.
> > - */
> > -static void cull_unevictable_page(struct page *page)
> > +#else
> > +void putback_lru_page(struct page *page)
> > {
> > - lock_page(page);
> > - if (putback_lru_page(page))
> > - unlock_page(page);
> > + int lru;
> > + VM_BUG_ON(PageLRU(page));
> > +
> > + lru = !!TestClearPageActive(page) + page_is_file_cache(page);
> > + lru_cache_add_lru(page, lru);
> > + mem_cgroup_move_lists(page, lru);
> > + put_page(page);
> > }
> > +#endif
> >
> > /*
> > * shrink_page_list() returns the number of reclaimed pages
> > @@ -746,8 +736,8 @@ free_it:
> > continue;
> >
> > cull_mlocked:
> > - if (putback_lru_page(page))
> > - unlock_page(page);
> > + unlock_page(page);
> > + putback_lru_page(page);
> > continue;
> >
> > activate_locked:
> > @@ -1127,7 +1117,7 @@ static unsigned long shrink_inactive_lis
> > list_del(&page->lru);
> > if (unlikely(!page_evictable(page, NULL))) {
> > spin_unlock_irq(&zone->lru_lock);
> > - cull_unevictable_page(page);
> > + putback_lru_page(page);
> > spin_lock_irq(&zone->lru_lock);
> > continue;
> > }
> > @@ -1231,7 +1221,7 @@ static void shrink_active_list(unsigned
> > list_del(&page->lru);
> >
> > if (unlikely(!page_evictable(page, NULL))) {
> > - cull_unevictable_page(page);
> > + putback_lru_page(page);
> > continue;
> > }
> >
> > @@ -2393,8 +2383,6 @@ int zone_reclaim(struct zone *zone, gfp_
> > int page_evictable(struct page *page, struct vm_area_struct *vma)
> > {
> >
> > - VM_BUG_ON(PageUnevictable(page));
> > -
> > if (mapping_unevictable(page_mapping(page)))
> > return 0;
>
> Why do you remove this?
>
I caught panci here ;)
maybe
==
if (lru == LRU_UNEVICTABLE && page_evictable(page, NULL))
==
check is.

>
>
>
> > @@ -169,7 +166,8 @@ static int __mlock_vma_pages_range(struc
> >
> > /*
> > * get_user_pages makes pages present if we are
> > - * setting mlock.
> > + * setting mlock. and this extra reference count will
> > + * disable migration of this page.
> > */
> > ret = get_user_pages(current, mm, addr,
> > min_t(int, nr_pages, ARRAY_SIZE(pages)),
> > @@ -197,14 +195,8 @@ static int __mlock_vma_pages_range(struc
> > for (i = 0; i < ret; i++) {
> > struct page *page = pages[i];
> >
> > - /*
> > - * page might be truncated or migrated out from under
> > - * us. Check after acquiring page lock.
> > - */
> > - lock_page(page);
> > - if (page->mapping)
> > + if (page_mapcount(page))
> > mlock_vma_page(page);
> > - unlock_page(page);
> > put_page(page); /* ref from get_user_pages() */
> >
> > /*
> > @@ -240,6 +232,9 @@ static int __munlock_pte_handler(pte_t *
> > struct page *page;
> > pte_t pte;
> >
> > + /*
> > + * page is never be unmapped by page-reclaim. we lock this page now.
> > + */
> > retry:
> > pte = *ptep;
> > /*
> > @@ -261,7 +256,15 @@ retry:
> > goto out;
> >
> > lock_page(page);
> > - if (!page->mapping) {
> > + /*
> > + * Because we lock page here, we have to check 2 cases.
> > + * - the page is migrated.
> > + * - the page is truncated (file-cache only)
> > + * Note: Anonymous page doesn't clear page->mapping even if it
> > + * is removed from rmap.
> > + */
> > + if (!page->mapping ||
> > + (PageAnon(page) && !page_mapcount(page))) {
> > unlock_page(page);
> > goto retry;
> > }
> > Index: test-2.6.26-rc5-mm3/mm/migrate.c
> > ===================================================================
> > --- test-2.6.26-rc5-mm3.orig/mm/migrate.c
> > +++ test-2.6.26-rc5-mm3/mm/migrate.c
> > @@ -67,9 +67,7 @@ int putback_lru_pages(struct list_head *
> >
> > list_for_each_entry_safe(page, page2, l, lru) {
> > list_del(&page->lru);
> > - lock_page(page);
> > - if (putback_lru_page(page))
> > - unlock_page(page);
> > + putback_lru_page(page);
> > count++;
> > }
> > return count;
> > @@ -571,7 +569,6 @@ static int fallback_migrate_page(struct
> > static int move_to_new_page(struct page *newpage, struct page *page)
> > {
> > struct address_space *mapping;
> > - int unlock = 1;
> > int rc;
> >
> > /*
> > @@ -610,12 +607,11 @@ static int move_to_new_page(struct page
> > * Put back on LRU while holding page locked to
> > * handle potential race with, e.g., munlock()
> > */
>
> this comment isn't true.
>
yes.

> > - unlock = putback_lru_page(newpage);
> > + putback_lru_page(newpage);
> > } else
> > newpage->mapping = NULL;
>
> originally move_to_lru() called in unmap_and_move().
> unevictable infrastructure patch move to this point for
> calling putback_lru_page() under page locked.
>
> So, your patch remove page locked dependency.
> move to unmap_and_move() again is better.
>
> it become page lock holding time reducing.
>
ok, will look into again.

Thanks,
-Kame

> >
> > - if (unlock)
> > - unlock_page(newpage);
> > + unlock_page(newpage);
> >
> > return rc;
> > }
> > @@ -632,7 +628,6 @@ static int unmap_and_move(new_page_t get
> > struct page *newpage = get_new_page(page, private, &result);
> > int rcu_locked = 0;
> > int charge = 0;
> > - int unlock = 1;
> >
> > if (!newpage)
> > return -ENOMEM;
> > @@ -713,6 +708,7 @@ rcu_unlock:
> > rcu_read_unlock();
> >
> > unlock:
> > + unlock_page(page);
> >
> > if (rc != -EAGAIN) {
> > /*
> > @@ -722,18 +718,9 @@ unlock:
> > * restored.
> > */
> > list_del(&page->lru);
> > - if (!page->mapping) {
> > - VM_BUG_ON(page_count(page) != 1);
> > - unlock_page(page);
> > - put_page(page); /* just free the old page */
> > - goto end_migration;
> > - } else
> > - unlock = putback_lru_page(page);
> > + putback_lru_page(page);
> > }
> >
> > - if (unlock)
> > - unlock_page(page);
> > -
> > end_migration:
> > if (!charge)
> > mem_cgroup_end_migration(newpage);
> > Index: test-2.6.26-rc5-mm3/mm/internal.h
> > ===================================================================
> > --- test-2.6.26-rc5-mm3.orig/mm/internal.h
> > +++ test-2.6.26-rc5-mm3/mm/internal.h
> > @@ -43,7 +43,7 @@ static inline void __put_page(struct pag
> > * in mm/vmscan.c:
> > */
> > extern int isolate_lru_page(struct page *page);
> > -extern int putback_lru_page(struct page *page);
> > +extern void putback_lru_page(struct page *page);
> >
> > /*
> > * in mm/page_alloc.c
> >
>
>
>
>

2008-06-18 14:51:18

[permalink] [raw]

Subject: Re: [Experimental][PATCH] putback_lru_page rework

Hi, Kamezawa-san.

Sorry for my late reply, and thank you for your patch.

> This patch tries to make putback_lru_pages() to be lock_page() free.
> (Of course, some callers must take the lock.)
>
I like this idea.

I'll test it tomorrow.

Thanks,
Daisuke Nishimura.

2008-06-18 17:56:26

by Daniel Walker

[permalink] [raw]

Subject: Re: 2.6.26-rc5-mm3

On Fri, 2008-06-13 at 00:32 +0100, Byron Bradley wrote:
> Looks like x86 and ARM both fail to boot if PROFILE_LIKELY, FTRACE and
> DYNAMIC_FTRACE are selected. If any one of those three are disabled it
> boots (or fails in some other way which I'm looking at now). The serial
> console output from both machines when they fail to boot is below, let me
> know if there is any other information I can provide.

I was able to reproduce a hang on x86 with those options. The patch
below is a potential fix. I think we don't want to trace
do_check_likely(), since the ftrace internals might use likely/unlikely
macro's which will just cause recursion back to do_check_likely()..

Signed-off-by: Daniel Walker <[email protected]>

---
lib/likely_prof.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.25/lib/likely_prof.c
===================================================================
--- linux-2.6.25.orig/lib/likely_prof.c
+++ linux-2.6.25/lib/likely_prof.c
@@ -22,7 +22,7 @@

static struct likeliness *likeliness_head;

-int do_check_likely(struct likeliness *likeliness, unsigned int ret)
+int notrace do_check_likely(struct likeliness *likeliness, unsigned int ret)
{
static unsigned long likely_lock;

2008-06-18 18:23:26

[permalink] [raw]

Subject: Re: [Experimental][PATCH] putback_lru_page rework

On Wed, 2008-06-18 at 18:40 +0900, KAMEZAWA Hiroyuki wrote:
> Lee-san, how about this ?
> Tested on x86-64 and tried Nisimura-san's test at el. works good now.

I have been testing with my work load on both ia64 and x86_64 and it
seems to be working well. I'll let them run for a day or so.

> -Kame
> ==
> putback_lru_page()/unevictable page handling rework.
>
> Now, putback_lru_page() requires that the page is locked.
> And in some special case, implicitly unlock it.
>
> This patch tries to make putback_lru_pages() to be lock_page() free.
> (Of course, some callers must take the lock.)
>
> The main reason that putback_lru_page() assumes that page is locked
> is to avoid the change in page's status among Mlocked/Not-Mlocked.
>
> Once it is added to unevictable list, the page is removed from
> unevictable list only when page is munlocked. (there are other special
> case. but we ignore the special case.)
> So, status change during putback_lru_page() is fatal and page should
> be locked.
>
> putback_lru_page() in this patch has a new concepts.
> When it adds page to unevictable list, it checks the status is
> changed or not again. if changed, retry to putback.

Given that the race that would activate this retry is likely quite rare,
this approach makes sense.

>
> This patche changes also caller side and cleaning up lock/unlock_page().
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

A couple of minor comments below, but:

Acked-by: Lee Schermerhorn <[email protected]>

>
> ---
> mm/internal.h | 2 -
> mm/migrate.c | 23 +++----------
> mm/mlock.c | 24 +++++++-------
> mm/vmscan.c | 96 +++++++++++++++++++++++++---------------------------------
> 4 files changed, 61 insertions(+), 84 deletions(-)
>
> Index: test-2.6.26-rc5-mm3/mm/vmscan.c
> ===================================================================
> --- test-2.6.26-rc5-mm3.orig/mm/vmscan.c
> +++ test-2.6.26-rc5-mm3/mm/vmscan.c
> @@ -486,73 +486,63 @@ int remove_mapping(struct address_space
> * Page may still be unevictable for other reasons.
> *
> * lru_lock must not be held, interrupts must be enabled.
> - * Must be called with page locked.
> - *
> - * return 1 if page still locked [not truncated], else 0
> */
> -int putback_lru_page(struct page *page)
> +#ifdef CONFIG_UNEVICTABLE_LRU
> +void putback_lru_page(struct page *page)
> {
> int lru;
> - int ret = 1;
> int was_unevictable;
>
> - VM_BUG_ON(!PageLocked(page));
> VM_BUG_ON(PageLRU(page));
>
> - lru = !!TestClearPageActive(page);
> was_unevictable = TestClearPageUnevictable(page); /* for page_evictable() */
>
> - if (unlikely(!page->mapping)) {
> - /*
> - * page truncated. drop lock as put_page() will
> - * free the page.
> - */
> - VM_BUG_ON(page_count(page) != 1);
> - unlock_page(page);
> - ret = 0;
> - } else if (page_evictable(page, NULL)) {
> - /*
> - * For evictable pages, we can use the cache.
> - * In event of a race, worst case is we end up with an
> - * unevictable page on [in]active list.
> - * We know how to handle that.
> - */
> +redo:
> + lru = !!TestClearPageActive(page);
> + if (page_evictable(page, NULL)) {
> lru += page_is_file_cache(page);
> lru_cache_add_lru(page, lru);
> - mem_cgroup_move_lists(page, lru);
> -#ifdef CONFIG_UNEVICTABLE_LRU
> - if (was_unevictable)
> - count_vm_event(NORECL_PGRESCUED);
> -#endif
> } else {
> - /*
> - * Put unevictable pages directly on zone's unevictable
> - * list.
> - */
> + lru = LRU_UNEVICTABLE;
> add_page_to_unevictable_list(page);
> - mem_cgroup_move_lists(page, LRU_UNEVICTABLE);
> -#ifdef CONFIG_UNEVICTABLE_LRU
> - if (!was_unevictable)
> - count_vm_event(NORECL_PGCULLED);
> -#endif
> }
> + mem_cgroup_move_lists(page, lru);
> +
> + /*
> + * page's status can change while we move it among lru. If an evictable
> + * page is on unevictable list, it never be freed. To avoid that,
> + * check after we added it to the list, again.
> + */
> + if (lru == LRU_UNEVICTABLE && page_evictable(page, NULL)) {
> + if (!isolate_lru_page(page)) {
> + put_page(page);
> + goto redo;
> + }
> + /* This means someone else dropped this page from LRU
> + * So, it will be freed or putback to LRU again. There is
> + * nothing to do here.
> + */
> + }
> +
> + if (was_unevictable && lru != LRU_UNEVICTABLE)
> + count_vm_event(NORECL_PGRESCUED);
> + else if (!was_unevictable && lru == LRU_UNEVICTABLE)
> + count_vm_event(NORECL_PGCULLED);
>
> put_page(page); /* drop ref from isolate */
> - return ret; /* ret => "page still locked" */
> }
> -
> -/*
> - * Cull page that shrink_*_list() has detected to be unevictable
> - * under page lock to close races with other tasks that might be making
> - * the page evictable. Avoid stranding an evictable page on the
> - * unevictable list.
> - */
> -static void cull_unevictable_page(struct page *page)
> +#else
> +void putback_lru_page(struct page *page)
> {
> - lock_page(page);
> - if (putback_lru_page(page))
> - unlock_page(page);
> + int lru;
> + VM_BUG_ON(PageLRU(page));
> +
> + lru = !!TestClearPageActive(page) + page_is_file_cache(page);
> + lru_cache_add_lru(page, lru);
> + mem_cgroup_move_lists(page, lru);
> + put_page(page);
> }
> +#endif
>
> /*
> * shrink_page_list() returns the number of reclaimed pages
> @@ -746,8 +736,8 @@ free_it:
> continue;
>
> cull_mlocked:
> - if (putback_lru_page(page))
> - unlock_page(page);
> + unlock_page(page);
> + putback_lru_page(page);
> continue;
>
> activate_locked:
> @@ -1127,7 +1117,7 @@ static unsigned long shrink_inactive_lis
> list_del(&page->lru);
> if (unlikely(!page_evictable(page, NULL))) {
> spin_unlock_irq(&zone->lru_lock);
> - cull_unevictable_page(page);
> + putback_lru_page(page);
> spin_lock_irq(&zone->lru_lock);
> continue;
> }
> @@ -1231,7 +1221,7 @@ static void shrink_active_list(unsigned
> list_del(&page->lru);
>
> if (unlikely(!page_evictable(page, NULL))) {
> - cull_unevictable_page(page);
> + putback_lru_page(page);
> continue;
> }
>
> @@ -2393,8 +2383,6 @@ int zone_reclaim(struct zone *zone, gfp_
> int page_evictable(struct page *page, struct vm_area_struct *vma)
> {
>
> - VM_BUG_ON(PageUnevictable(page));
> -
> if (mapping_unevictable(page_mapping(page)))
> return 0;
>
> Index: test-2.6.26-rc5-mm3/mm/mlock.c
> ===================================================================
> --- test-2.6.26-rc5-mm3.orig/mm/mlock.c
> +++ test-2.6.26-rc5-mm3/mm/mlock.c
> @@ -55,7 +55,6 @@ EXPORT_SYMBOL(can_do_mlock);
> */
> void __clear_page_mlock(struct page *page)
> {
> - VM_BUG_ON(!PageLocked(page)); /* for LRU isolate/putback */
>
> dec_zone_page_state(page, NR_MLOCK);
> count_vm_event(NORECL_PGCLEARED);
> @@ -79,7 +78,6 @@ void __clear_page_mlock(struct page *pag
> */
> void mlock_vma_page(struct page *page)
> {
> - BUG_ON(!PageLocked(page));
>
> if (!TestSetPageMlocked(page)) {
> inc_zone_page_state(page, NR_MLOCK);
> @@ -109,7 +107,6 @@ void mlock_vma_page(struct page *page)
> */
> static void munlock_vma_page(struct page *page)
> {
> - BUG_ON(!PageLocked(page));
>
> if (TestClearPageMlocked(page)) {
> dec_zone_page_state(page, NR_MLOCK);
> @@ -169,7 +166,8 @@ static int __mlock_vma_pages_range(struc
>
> /*
> * get_user_pages makes pages present if we are
> - * setting mlock.
> + * setting mlock. and this extra reference count will
> + * disable migration of this page.
> */
> ret = get_user_pages(current, mm, addr,
> min_t(int, nr_pages, ARRAY_SIZE(pages)),
> @@ -197,14 +195,8 @@ static int __mlock_vma_pages_range(struc
> for (i = 0; i < ret; i++) {
> struct page *page = pages[i];
>
> - /*
> - * page might be truncated or migrated out from under
> - * us. Check after acquiring page lock.
> - */
> - lock_page(page);

Hmmm. Still thinking about this. No need to protect against in flight
truncation or migration?

> - if (page->mapping)
> + if (page_mapcount(page))
> mlock_vma_page(page);
> - unlock_page(page);
> put_page(page); /* ref from get_user_pages() */
>
> /*
> @@ -240,6 +232,9 @@ static int __munlock_pte_handler(pte_t *
> struct page *page;
> pte_t pte;
>
> + /*
> + * page is never be unmapped by page-reclaim. we lock this page now.
> + */

I don't understand what you're trying to say here. That is, what the
point of this comment is...

> retry:
> pte = *ptep;
> /*
> @@ -261,7 +256,15 @@ retry:
> goto out;
>
> lock_page(page);
> - if (!page->mapping) {
> + /*
> + * Because we lock page here, we have to check 2 cases.
> + * - the page is migrated.
> + * - the page is truncated (file-cache only)
> + * Note: Anonymous page doesn't clear page->mapping even if it
> + * is removed from rmap.
> + */
> + if (!page->mapping ||
> + (PageAnon(page) && !page_mapcount(page))) {
> unlock_page(page);
> goto retry;
> }
> Index: test-2.6.26-rc5-mm3/mm/migrate.c
> ===================================================================
> --- test-2.6.26-rc5-mm3.orig/mm/migrate.c
> +++ test-2.6.26-rc5-mm3/mm/migrate.c
> @@ -67,9 +67,7 @@ int putback_lru_pages(struct list_head *
>
> list_for_each_entry_safe(page, page2, l, lru) {
> list_del(&page->lru);
> - lock_page(page);
> - if (putback_lru_page(page))
> - unlock_page(page);
> + putback_lru_page(page);
> count++;
> }
> return count;
> @@ -571,7 +569,6 @@ static int fallback_migrate_page(struct
> static int move_to_new_page(struct page *newpage, struct page *page)
> {
> struct address_space *mapping;
> - int unlock = 1;
> int rc;
>
> /*
> @@ -610,12 +607,11 @@ static int move_to_new_page(struct page
> * Put back on LRU while holding page locked to
> * handle potential race with, e.g., munlock()
> */
> - unlock = putback_lru_page(newpage);
> + putback_lru_page(newpage);
> } else
> newpage->mapping = NULL;
>
> - if (unlock)
> - unlock_page(newpage);
> + unlock_page(newpage);
>
> return rc;
> }
> @@ -632,7 +628,6 @@ static int unmap_and_move(new_page_t get
> struct page *newpage = get_new_page(page, private, &result);
> int rcu_locked = 0;
> int charge = 0;
> - int unlock = 1;
>
> if (!newpage)
> return -ENOMEM;
> @@ -713,6 +708,7 @@ rcu_unlock:
> rcu_read_unlock();
>
> unlock:
> + unlock_page(page);
>
> if (rc != -EAGAIN) {
> /*
> @@ -722,18 +718,9 @@ unlock:
> * restored.
> */
> list_del(&page->lru);
> - if (!page->mapping) {
> - VM_BUG_ON(page_count(page) != 1);
> - unlock_page(page);
> - put_page(page); /* just free the old page */
> - goto end_migration;
> - } else
> - unlock = putback_lru_page(page);
> + putback_lru_page(page);
> }
>
> - if (unlock)
> - unlock_page(page);
> -
> end_migration:
> if (!charge)
> mem_cgroup_end_migration(newpage);
> Index: test-2.6.26-rc5-mm3/mm/internal.h
> ===================================================================
> --- test-2.6.26-rc5-mm3.orig/mm/internal.h
> +++ test-2.6.26-rc5-mm3/mm/internal.h
> @@ -43,7 +43,7 @@ static inline void __put_page(struct pag
> * in mm/vmscan.c:
> */
> extern int isolate_lru_page(struct page *page);
> -extern int putback_lru_page(struct page *page);
> +extern void putback_lru_page(struct page *page);
>
> /*
> * in mm/page_alloc.c
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2008-06-19 00:17:43

[permalink] [raw]

Subject: Re: [Experimental][PATCH] putback_lru_page rework

On Wed, 18 Jun 2008 14:21:06 -0400
Lee Schermerhorn <[email protected]> wrote:

> On Wed, 2008-06-18 at 18:40 +0900, KAMEZAWA Hiroyuki wrote:
> > Lee-san, how about this ?
> > Tested on x86-64 and tried Nisimura-san's test at el. works good now.
>
> I have been testing with my work load on both ia64 and x86_64 and it
> seems to be working well. I'll let them run for a day or so.
>
thank you.
<snip>

> > @@ -240,6 +232,9 @@ static int __munlock_pte_handler(pte_t *
> > struct page *page;
> > pte_t pte;
> >
> > + /*
> > + * page is never be unmapped by page-reclaim. we lock this page now.
> > + */
>
> I don't understand what you're trying to say here. That is, what the
> point of this comment is...
>
We access the page-table without taking pte_lock. But this vm is MLOCKED
and migration-race is handled. So we don't need to be too nervous to access
the pte. I'll consider more meaningful words.

Thanks,
-Kame

2008-06-19 07:00:25

by Hidehiro Kawai

[permalink] [raw]

Subject: [BUG][PATCH -mm] avoid BUG() in __stop_machine_run()

When a process loads a kernel module, __stop_machine_run() is called, and
it calls sched_setscheduler() to give newly created kernel threads highest
priority. However, the process can have no CAP_SYS_NICE which required
for sched_setscheduler() to increase the priority. For example, SystemTap
loads its module with only CAP_SYS_MODULE. In this case,
sched_setscheduler() returns -EPERM, then BUG() is called.

Failure of sched_setscheduler() wouldn't be a real problem, so this
patch just ignores it.
Or, should we give the CAP_SYS_NICE capability temporarily?

Signed-off-by: Hidehiro Kawai <[email protected]>
---
kernel/stop_machine.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

Index: linux-2.6.26-rc5-mm3/kernel/stop_machine.c
===================================================================
--- linux-2.6.26-rc5-mm3.orig/kernel/stop_machine.c
+++ linux-2.6.26-rc5-mm3/kernel/stop_machine.c
@@ -143,8 +143,7 @@ int __stop_machine_run(int (*fn)(void *)
kthread_bind(threads[i], i);

/* Make it highest prio. */
- if (sched_setscheduler(threads[i], SCHED_FIFO, &param) != 0)
- BUG();
+ sched_setscheduler(threads[i], SCHED_FIFO, &param);
}

/* We've created all the threads. Wake them all: hold this CPU so one

2008-06-19 08:02:18

[permalink] [raw]

Subject: Re: [Experimental][PATCH] putback_lru_page rework

> > > - unlock = putback_lru_page(newpage);
> > > + putback_lru_page(newpage);
> > > } else
> > > newpage->mapping = NULL;
> >
> > originally move_to_lru() called in unmap_and_move().
> > unevictable infrastructure patch move to this point for
> > calling putback_lru_page() under page locked.
> >
> > So, your patch remove page locked dependency.
> > move to unmap_and_move() again is better.
> >
> > it become page lock holding time reducing.
> >
> ok, will look into again.
>

I agree with Kosaki-san.

And VM_BUG_ON(page_count(newpage) != 1) in unmap_and_move()
is not correct again, IMHO.
I got this BUG actually when testing this patch(with
migratin_entry_wait fix).

unmap_and_move()
move_to_new_page()
migrate_page()
remove_migration_ptes()
putback_lru_page() (*1)
:
if (!newpage->mapping) (*2)
VM_BUG_ON(page_count(newpage) != 1)

If a anonymous page(without mapping) is migrated successfully,
this page is moved back to lru by putback_lru_page()(*1),
and the page count becomes 1(pte only).

At the same time(between *1 and *2), if the process
that owns this page are freeing this page, the page count
becomes 0 and ->mapping becomes NULL by free_hot_cold_page(),
so this BUG is caused.

I've not seen this BUG on real HW yet(seen twice on fake-numa
hvm guest of Xen), but I think it can happen theoretically.

Thanks,
Daisuke Nishimura.

2008-06-19 08:19:52

[permalink] [raw]

Subject: Re: [Experimental][PATCH] putback_lru_page rework

On Thu, 19 Jun 2008 17:00:59 +0900
Daisuke Nishimura <[email protected]> wrote:

> > > > - unlock = putback_lru_page(newpage);
> > > > + putback_lru_page(newpage);
> > > > } else
> > > > newpage->mapping = NULL;
> > >
> > > originally move_to_lru() called in unmap_and_move().
> > > unevictable infrastructure patch move to this point for
> > > calling putback_lru_page() under page locked.
> > >
> > > So, your patch remove page locked dependency.
> > > move to unmap_and_move() again is better.
> > >
> > > it become page lock holding time reducing.
> > >
> > ok, will look into again.
> >
>
> I agree with Kosaki-san.
>
> And VM_BUG_ON(page_count(newpage) != 1) in unmap_and_move()
> is not correct again, IMHO.
> I got this BUG actually when testing this patch(with
> migratin_entry_wait fix).
>
> unmap_and_move()
> move_to_new_page()
> migrate_page()
> remove_migration_ptes()
> putback_lru_page() (*1)
> :
> if (!newpage->mapping) (*2)
> VM_BUG_ON(page_count(newpage) != 1)
>
> If a anonymous page(without mapping) is migrated successfully,
> this page is moved back to lru by putback_lru_page()(*1),
> and the page count becomes 1(pte only).
>
yes.

> At the same time(between *1 and *2), if the process
> that owns this page are freeing this page, the page count
> becomes 0 and ->mapping becomes NULL by free_hot_cold_page(),
> so this BUG is caused.
>
Agree, I see.

> I've not seen this BUG on real HW yet(seen twice on fake-numa
> hvm guest of Xen), but I think it can happen theoretically.
>
That's (maybe) because page->mapping is not cleared when it's removed
from rmap. (and there is pagevec to dealy freeing....)

But ok, I see your point. KOSAKI-san is now writing patch set to
fix the whole. please see it.

Thanks,
-Kame

2008-06-19 09:14:19

by Ingo Molnar

[permalink] [raw]

Subject: Re: 2.6.26-rc5-mm3

* Daniel Walker <[email protected]> wrote:

>
> On Fri, 2008-06-13 at 00:32 +0100, Byron Bradley wrote:
> > Looks like x86 and ARM both fail to boot if PROFILE_LIKELY, FTRACE and
> > DYNAMIC_FTRACE are selected. If any one of those three are disabled it
> > boots (or fails in some other way which I'm looking at now). The serial
> > console output from both machines when they fail to boot is below, let me
> > know if there is any other information I can provide.
>
> I was able to reproduce a hang on x86 with those options. The patch
> below is a potential fix. I think we don't want to trace
> do_check_likely(), since the ftrace internals might use likely/unlikely
> macro's which will just cause recursion back to do_check_likely()..
>
> Signed-off-by: Daniel Walker <[email protected]>
>
> ---
> lib/likely_prof.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> Index: linux-2.6.25/lib/likely_prof.c
> ===================================================================
> --- linux-2.6.25.orig/lib/likely_prof.c
> +++ linux-2.6.25/lib/likely_prof.c
> @@ -22,7 +22,7 @@
>
> static struct likeliness *likeliness_head;
>
> -int do_check_likely(struct likeliness *likeliness, unsigned int ret)
> +int notrace do_check_likely(struct likeliness *likeliness, unsigned int ret)

the better fix would be to add likely_prof.o to this list of exceptions
in lib/Makefile:

ifdef CONFIG_FTRACE
# Do not profile string.o, since it may be used in early boot or vdso
CFLAGS_REMOVE_string.o = -pg
# Also do not profile any debug utilities
CFLAGS_REMOVE_spinlock_debug.o = -pg
CFLAGS_REMOVE_list_debug.o = -pg
CFLAGS_REMOVE_debugobjects.o = -pg
endif

instead of adding notrace to the source.

Ingo

2008-06-19 10:13:22

by Rusty Russell

[permalink] [raw]

Subject: Re: [BUG][PATCH -mm] avoid BUG() in __stop_machine_run()

On Thursday 19 June 2008 16:59:50 Hidehiro Kawai wrote:
> When a process loads a kernel module, __stop_machine_run() is called, and
> it calls sched_setscheduler() to give newly created kernel threads highest
> priority. However, the process can have no CAP_SYS_NICE which required
> for sched_setscheduler() to increase the priority. For example, SystemTap
> loads its module with only CAP_SYS_MODULE. In this case,
> sched_setscheduler() returns -EPERM, then BUG() is called.

Hi Hidehiro,

Nice catch. This can happen in the current code, it just doesn't
BUG().

> Failure of sched_setscheduler() wouldn't be a real problem, so this
> patch just ignores it.

Well, it can mean that the stop_machine blocks indefinitely. Better
than a BUG(), but we should aim higher.

> Or, should we give the CAP_SYS_NICE capability temporarily?

I don't think so. It can be seen from another thread, and in theory
that should not see something random. Worse, they can change it from
another thread.

How's this?

sched_setscheduler: add a flag to control access checks

Hidehiro Kawai noticed that sched_setscheduler() can fail in
stop_machine: it calls sched_setscheduler() from insmod, which can
have CAP_SYS_MODULE without CAP_SYS_NICE.

This simply introduces a flag to allow us to disable the capability
checks for internal callers (this is simpler than splitting the
sched_setscheduler() function, since it loops checking permissions).

The flag is only "false" (ie. no check) for the following cases, where
it shouldn't matter:
drivers/input/touchscreen/ucb1400_ts.c:ucb1400_ts_thread()
- it's a kthread
drivers/mmc/core/sdio_irq.c:sdio_irq_thread()
- also a kthread
kernel/kthread.c:create_kthread()
- making a kthread (from kthreadd)
kernel/softlockup.c:watchdog()
- also a kthread

And these cases could have failed before:
kernel/softirq.c:cpu_callback()
- CPU hotplug callback
kernel/stop_machine.c:__stop_machine_run()
- Called from various places, including modprobe()

Signed-off-by: Rusty Russell <[email protected]>

diff -r 509f0724da6b drivers/input/touchscreen/ucb1400_ts.c
--- a/drivers/input/touchscreen/ucb1400_ts.c Thu Jun 19 17:06:30 2008 +1000
+++ b/drivers/input/touchscreen/ucb1400_ts.c Thu Jun 19 19:36:40 2008 +1000
@@ -287,7 +287,7 @@ static int ucb1400_ts_thread(void *_ucb)
int valid = 0;
struct sched_param param = { .sched_priority = 1 };

- sched_setscheduler(tsk, SCHED_FIFO, &param);
+ sched_setscheduler(tsk, SCHED_FIFO, &param, false);

set_freezable();
while (!kthread_should_stop()) {
diff -r 509f0724da6b drivers/mmc/core/sdio_irq.c
--- a/drivers/mmc/core/sdio_irq.c Thu Jun 19 17:06:30 2008 +1000
+++ b/drivers/mmc/core/sdio_irq.c Thu Jun 19 19:36:40 2008 +1000
@@ -70,7 +70,7 @@ static int sdio_irq_thread(void *_host)
unsigned long period, idle_period;
int ret;

- sched_setscheduler(current, SCHED_FIFO, &param);
+ sched_setscheduler(current, SCHED_FIFO, &param, false);

/*
* We want to allow for SDIO cards to work even on non SDIO
diff -r 509f0724da6b include/linux/sched.h
--- a/include/linux/sched.h Thu Jun 19 17:06:30 2008 +1000
+++ b/include/linux/sched.h Thu Jun 19 19:36:40 2008 +1000
@@ -1654,7 +1654,8 @@ extern int can_nice(const struct task_st
extern int can_nice(const struct task_struct *p, const int nice);
extern int task_curr(const struct task_struct *p);
extern int idle_cpu(int cpu);
-extern int sched_setscheduler(struct task_struct *, int, struct sched_param *);
+extern int sched_setscheduler(struct task_struct *, int, struct sched_param *,
+ bool);
extern struct task_struct *idle_task(int cpu);
extern struct task_struct *curr_task(int cpu);
extern void set_curr_task(int cpu, struct task_struct *p);
diff -r 509f0724da6b kernel/kthread.c
--- a/kernel/kthread.c Thu Jun 19 17:06:30 2008 +1000
+++ b/kernel/kthread.c Thu Jun 19 19:36:40 2008 +1000
@@ -104,7 +104,7 @@ static void create_kthread(struct kthrea
* root may have changed our (kthreadd's) priority or CPU mask.
* The kernel thread should not inherit these properties.
*/
- sched_setscheduler(create->result, SCHED_NORMAL, &param);
+ sched_setscheduler(create->result, SCHED_NORMAL, &param, false);
set_user_nice(create->result, KTHREAD_NICE_LEVEL);
set_cpus_allowed(create->result, CPU_MASK_ALL);
}
diff -r 509f0724da6b kernel/rtmutex-tester.c
--- a/kernel/rtmutex-tester.c Thu Jun 19 17:06:30 2008 +1000
+++ b/kernel/rtmutex-tester.c Thu Jun 19 19:36:40 2008 +1000
@@ -327,7 +327,8 @@ static ssize_t sysfs_test_command(struct
switch (op) {
case RTTEST_SCHEDOT:
schedpar.sched_priority = 0;
- ret = sched_setscheduler(threads[tid], SCHED_NORMAL, &schedpar);
+ ret = sched_setscheduler(threads[tid], SCHED_NORMAL, &schedpar,
+ true);
if (ret)
return ret;
set_user_nice(current, 0);
@@ -335,7 +336,8 @@ static ssize_t sysfs_test_command(struct

case RTTEST_SCHEDRT:
schedpar.sched_priority = dat;
- ret = sched_setscheduler(threads[tid], SCHED_FIFO, &schedpar);
+ ret = sched_setscheduler(threads[tid], SCHED_FIFO, &schedpar,
+ true);
if (ret)
return ret;
break;
diff -r 509f0724da6b kernel/sched.c
--- a/kernel/sched.c Thu Jun 19 17:06:30 2008 +1000
+++ b/kernel/sched.c Thu Jun 19 19:36:40 2008 +1000
@@ -4749,11 +4749,12 @@ __setscheduler(struct rq *rq, struct tas
* @p: the task in question.
* @policy: new policy.
* @param: structure containing the new RT priority.
+ * @user: do checks to ensure this thread has permission
*
* NOTE that the task may be already dead.
*/
int sched_setscheduler(struct task_struct *p, int policy,
- struct sched_param *param)
+ struct sched_param *param, bool user)
{
int retval, oldprio, oldpolicy = -1, on_rq, running;
unsigned long flags;
@@ -4785,7 +4786,7 @@ recheck:
/*
* Allow unprivileged RT tasks to decrease priority:
*/
- if (!capable(CAP_SYS_NICE)) {
+ if (user && !capable(CAP_SYS_NICE)) {
if (rt_policy(policy)) {
unsigned long rlim_rtprio;

@@ -4821,7 +4822,8 @@ recheck:
* Do not allow realtime tasks into groups that have no runtime
* assigned.
*/
- if (rt_policy(policy) && task_group(p)->rt_bandwidth.rt_runtime == 0)
+ if (user
+ && rt_policy(policy) && task_group(p)->rt_bandwidth.rt_runtime == 0)
return -EPERM;
#endif

@@ -4888,7 +4890,7 @@ do_sched_setscheduler(pid_t pid, int pol
retval = -ESRCH;
p = find_process_by_pid(pid);
if (p != NULL)
- retval = sched_setscheduler(p, policy, &lparam);
+ retval = sched_setscheduler(p, policy, &lparam, true);
rcu_read_unlock();

return retval;
diff -r 509f0724da6b kernel/softirq.c
--- a/kernel/softirq.c Thu Jun 19 17:06:30 2008 +1000
+++ b/kernel/softirq.c Thu Jun 19 19:36:40 2008 +1000
@@ -645,7 +645,7 @@ static int __cpuinit cpu_callback(struct

p = per_cpu(ksoftirqd, hotcpu);
per_cpu(ksoftirqd, hotcpu) = NULL;
- sched_setscheduler(p, SCHED_FIFO, &param);
+ sched_setscheduler(p, SCHED_FIFO, &param, false);
kthread_stop(p);
takeover_tasklets(hotcpu);
break;
diff -r 509f0724da6b kernel/softlockup.c
--- a/kernel/softlockup.c Thu Jun 19 17:06:30 2008 +1000
+++ b/kernel/softlockup.c Thu Jun 19 19:36:40 2008 +1000
@@ -211,7 +211,7 @@ static int watchdog(void *__bind_cpu)
struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
int this_cpu = (long)__bind_cpu;

- sched_setscheduler(current, SCHED_FIFO, &param);
+ sched_setscheduler(current, SCHED_FIFO, &param, false);

/* initialize timestamp */
touch_softlockup_watchdog();
diff -r 509f0724da6b kernel/stop_machine.c
--- a/kernel/stop_machine.c Thu Jun 19 17:06:30 2008 +1000
+++ b/kernel/stop_machine.c Thu Jun 19 19:36:40 2008 +1000
@@ -187,7 +187,7 @@ struct task_struct *__stop_machine_run(i
struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };

/* One high-prio thread per cpu. We'll do this one. */
- sched_setscheduler(p, SCHED_FIFO, &param);
+ sched_setscheduler(p, SCHED_FIFO, &param, false);
kthread_bind(p, cpu);
wake_up_process(p);
wait_for_completion(&smdata.done);

2008-06-19 14:39:42

by Daniel Walker

[permalink] [raw]

Subject: Re: 2.6.26-rc5-mm3

On Thu, 2008-06-19 at 11:13 +0200, Ingo Molnar wrote:

> the better fix would be to add likely_prof.o to this list of exceptions
> in lib/Makefile:
>
> ifdef CONFIG_FTRACE
> # Do not profile string.o, since it may be used in early boot or vdso
> CFLAGS_REMOVE_string.o = -pg
> # Also do not profile any debug utilities
> CFLAGS_REMOVE_spinlock_debug.o = -pg
> CFLAGS_REMOVE_list_debug.o = -pg
> CFLAGS_REMOVE_debugobjects.o = -pg
> endif
>
> instead of adding notrace to the source.
>
> Ingo

Here's the fix mentioned above.

--

Remove tracing from likely profiling since it could cause recursion if
ftrace uses likely/unlikely macro's internally.

Signed-off-by: Daniel Walker <[email protected]>

---
lib/Makefile | 2 ++
1 file changed, 2 insertions(+)

Index: linux-2.6.25/lib/Makefile
===================================================================
--- linux-2.6.25.orig/lib/Makefile
+++ linux-2.6.25/lib/Makefile
@@ -15,6 +15,8 @@ CFLAGS_REMOVE_string.o = -pg
CFLAGS_REMOVE_spinlock_debug.o = -pg
CFLAGS_REMOVE_list_debug.o = -pg
CFLAGS_REMOVE_debugobjects.o = -pg
+# likely profiling can cause recursion in ftrace, so don't trace it.
+CFLAGS_REMOVE_likely_prof.o = -pg
endif

lib-$(CONFIG_MMU) += ioremap.o

2008-06-19 14:45:39

[permalink] [raw]

Subject: Re: [Experimental][PATCH] putback_lru_page rework

On Thu, 2008-06-19 at 09:22 +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 18 Jun 2008 14:21:06 -0400
> Lee Schermerhorn <[email protected]> wrote:
>
> > On Wed, 2008-06-18 at 18:40 +0900, KAMEZAWA Hiroyuki wrote:
> > > Lee-san, how about this ?
> > > Tested on x86-64 and tried Nisimura-san's test at el. works good now.
> >
> > I have been testing with my work load on both ia64 and x86_64 and it
> > seems to be working well. I'll let them run for a day or so.
> >
> thank you.
> <snip>

Update:

On x86_64 [32GB, 4xdual-core Opteron], my work load has run for ~20:40
hours. Still running.

On ia64 [32G, 16cpu, 4 node], the system started going into softlockup
after ~7 hours. Stack trace [below] indicates zone-lru lock in
__page_cache_release() called from put_page(). Either heavy contention
or failure to unlock. Note that previous run, with patches to
putback_lru_page() and unmap_and_move(), the same load ran for ~18 hours
before I shut it down to try these patches.

I'm going to try again with the collected patches posted by Kosaki-san
[for which, Thanks!]. If it occurs again, I'll deconfig the unevictable
lru feature and see if I can reproduce it there. It may be unrelated to
the unevictable lru patches.

>
> > > @@ -240,6 +232,9 @@ static int __munlock_pte_handler(pte_t *
> > > struct page *page;
> > > pte_t pte;
> > >
> > > + /*
> > > + * page is never be unmapped by page-reclaim. we lock this page now.
> > > + */
> >
> > I don't understand what you're trying to say here. That is, what the
> > point of this comment is...
> >
> We access the page-table without taking pte_lock. But this vm is MLOCKED
> and migration-race is handled. So we don't need to be too nervous to access
> the pte. I'll consider more meaningful words.

OK, so you just want to note that we're accessing the pte w/o locking
and that this is safe because the vma has been VM_LOCKED and all pages
should be mlocked?

I'll note that the vma is NOT VM_LOCKED during the pte walk.
munlock_vma_pages_range() resets it so that try_to_unlock(), called from
munlock_vma_page(), won't try to re-mlock the page. However, we hold
the mmap sem for write, so faults are held off--no need to worry about a
COW fault occurring between when the VM_LOCKED was cleared and before
the page is munlocked. If that could occur, it could open a window
where a non-mlocked page is mapped in this vma, and page reclaim could
potentially unmap the page. Shouldn't be an issue as long as we never
downgrade the semaphore to read during munlock.

Lee

----------
softlockup stack trace for "usex" workload on ia64:

BUG: soft lockup - CPU#13 stuck for 61s! [usex:124359]
Modules linked in: ipv6 sunrpc dm_mirror dm_log dm_multipath scsi_dh dm_mod pci_slot fan dock thermal sg sr_mod processor button container ehci_hcd ohci_hcd uhci_hcd usbcore

Pid: 124359, CPU 13, comm: usex
psr : 00001010085a6010 ifs : 8000000000000000 ip : [<a00000010000a1a0>] Tainted: G D (2.6.26-rc5-mm3-kame-rework+mcl_inherit)
ip is at ia64_spinlock_contention+0x20/0x60
unat: 0000000000000000 pfs : 0000000000000081 rsc : 0000000000000003
rnat: 0000000000000000 bsps: 0000000000000000 pr : a65955959a96e969
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f
csd : 0000000000000000 ssd : 0000000000000000
b0 : a0000001001264a0 b6 : a0000001006f0350 b7 : a00000010000b940
f6 : 0ffff8000000000000000 f7 : 1003ecf3cf3cf3cf3cf3d
f8 : 1003e0000000000000001 f9 : 1003e0000000000000015
f10 : 1003e000003a82aaab1fb f11 : 1003e0000000000000000
r1 : a000000100c03650 r2 : 000000000000038a r3 : 0000000000000001
r8 : 00000010085a6010 r9 : 0000000000080028 r10 : 000000000000000b
r11 : 0000000000000a80 r12 : e0000741aaac7d50 r13 : e0000741aaac0000
r14 : 0000000000000000 r15 : a000400741329148 r16 : e000074000060100
r17 : e000076000078e98 r18 : 0000000000000015 r19 : 0000000000000018
r20 : 0000000000000003 r21 : 0000000000000002 r22 : e000076000078e88
r23 : e000076000078e80 r24 : 0000000000000001 r25 : 0240000000080028
r26 : ffffffffffff04d8 r27 : 00000010085a6010 r28 : 7fe3382473f8b380
r29 : 9c00000000000000 r30 : 0000000000000001 r31 : e000074000061400

Call Trace:
[<a000000100015e00>] show_stack+0x80/0xa0
sp=e0000741aaac79b0 bsp=e0000741aaac1528
[<a000000100016700>] show_regs+0x880/0x8c0
sp=e0000741aaac7b80 bsp=e0000741aaac14d0
[<a0000001000fbbe0>] softlockup_tick+0x2e0/0x340
sp=e0000741aaac7b80 bsp=e0000741aaac1480
[<a0000001000a9400>] run_local_timers+0x40/0x60
sp=e0000741aaac7b80 bsp=e0000741aaac1468
[<a0000001000a9460>] update_process_times+0x40/0xc0
sp=e0000741aaac7b80 bsp=e0000741aaac1438
[<a00000010003ded0>] timer_interrupt+0x1b0/0x4a0
sp=e0000741aaac7b80 bsp=e0000741aaac13d0
[<a0000001000fc480>] handle_IRQ_event+0x80/0x120
sp=e0000741aaac7b80 bsp=e0000741aaac1398
[<a0000001000fc660>] __do_IRQ+0x140/0x440
sp=e0000741aaac7b80 bsp=e0000741aaac1338
[<a0000001000136d0>] ia64_handle_irq+0x3f0/0x420
sp=e0000741aaac7b80 bsp=e0000741aaac12c0
[<a00000010000c120>] ia64_native_leave_kernel+0x0/0x270
sp=e0000741aaac7b80 bsp=e0000741aaac12c0
[<a00000010000a1a0>] ia64_spinlock_contention+0x20/0x60
sp=e0000741aaac7d50 bsp=e0000741aaac12c0
[<a0000001006f0350>] _spin_lock_irqsave+0x50/0x60
sp=e0000741aaac7d50 bsp=e0000741aaac12b8

Probably zone lru_lock in __page_cache_release().

[<a0000001001264a0>] put_page+0x100/0x300
sp=e0000741aaac7d50 bsp=e0000741aaac1280
[<a000000100157170>] free_page_and_swap_cache+0x70/0xe0
sp=e0000741aaac7d50 bsp=e0000741aaac1260
[<a000000100145a10>] exit_mmap+0x3b0/0x580
sp=e0000741aaac7d50 bsp=e0000741aaac1210
[<a00000010008b420>] mmput+0x80/0x1c0
sp=e0000741aaac7e10 bsp=e0000741aaac11d8

NOTE: all cpus show similar stack traces above here. Some, however, get
here from do_exit()/exit_mm(), rather than via execve().

[<a00000010019c2c0>] flush_old_exec+0x5a0/0x1520
sp=e0000741aaac7e10 bsp=e0000741aaac10f0
[<a000000100213080>] load_elf_binary+0x7e0/0x2600
sp=e0000741aaac7e20 bsp=e0000741aaac0fb8
[<a00000010019b7a0>] search_binary_handler+0x1a0/0x520
sp=e0000741aaac7e20 bsp=e0000741aaac0f30
[<a00000010019e4e0>] do_execve+0x320/0x3e0
sp=e0000741aaac7e20 bsp=e0000741aaac0ed0
[<a000000100014d00>] sys_execve+0x60/0xc0
sp=e0000741aaac7e30 bsp=e0000741aaac0e98
[<a00000010000b690>] ia64_execve+0x30/0x140
sp=e0000741aaac7e30 bsp=e0000741aaac0e48
[<a00000010000bfa0>] ia64_ret_from_syscall+0x0/0x20
sp=e0000741aaac7e30 bsp=e0000741aaac0e48
[<a000000000010720>] __start_ivt_text+0xffffffff00010720/0x400
sp=e0000741aaac8000 bsp=e0000741aaac0e48

2008-06-19 15:33:17

[permalink] [raw]

Subject: Re: Re: [Experimental][PATCH] putback_lru_page rework

----- Original Message -----
>Subject: Re: [Experimental][PATCH] putback_lru_page rework
>From: Lee Schermerhorn <[email protected]>

>On Thu, 2008-06-19 at 09:22 +0900, KAMEZAWA Hiroyuki wrote:
>> On Wed, 18 Jun 2008 14:21:06 -0400
>> Lee Schermerhorn <[email protected]> wrote:
>>
>> > On Wed, 2008-06-18 at 18:40 +0900, KAMEZAWA Hiroyuki wrote:
>> > > Lee-san, how about this ?
>> > > Tested on x86-64 and tried Nisimura-san's test at el. works good now.
>> >
>> > I have been testing with my work load on both ia64 and x86_64 and it
>> > seems to be working well. I'll let them run for a day or so.
>> >
>> thank you.
>> <snip>
>
>Update:
>
>On x86_64 [32GB, 4xdual-core Opteron], my work load has run for ~20:40
>hours. Still running.
>
>On ia64 [32G, 16cpu, 4 node], the system started going into softlockup
>after ~7 hours. Stack trace [below] indicates zone-lru lock in
>__page_cache_release() called from put_page(). Either heavy contention
>or failure to unlock. Note that previous run, with patches to
>putback_lru_page() and unmap_and_move(), the same load ran for ~18 hours
>before I shut it down to try these patches.
>
Thanks, then there are more troubles should be shooted down.

>I'm going to try again with the collected patches posted by Kosaki-san
>[for which, Thanks!]. If it occurs again, I'll deconfig the unevictable
>lru feature and see if I can reproduce it there. It may be unrelated to
>the unevictable lru patches.
>
I hope so...Hmm..I'll dig tomorrow.

>>
>> > > @@ -240,6 +232,9 @@ static int __munlock_pte_handler(pte_t *
>> > > struct page *page;
>> > > pte_t pte;
>> > >
>> > > + /*
>> > > + * page is never be unmapped by page-reclaim. we lock this page now.
>> > > + */
>> >
>> > I don't understand what you're trying to say here. That is, what the
>> > point of this comment is...
>> >
>> We access the page-table without taking pte_lock. But this vm is MLOCKED
>> and migration-race is handled. So we don't need to be too nervous to access
>> the pte. I'll consider more meaningful words.
>
>OK, so you just want to note that we're accessing the pte w/o locking
>and that this is safe because the vma has been VM_LOCKED and all pages
>should be mlocked?
>
yes that was my thought.

>I'll note that the vma is NOT VM_LOCKED during the pte walk.
Ouch..
>munlock_vma_pages_range() resets it so that try_to_unlock(), called from
>munlock_vma_page(), won't try to re-mlock the page. However, we hold
>the mmap sem for write, so faults are held off--no need to worry about a
>COW fault occurring between when the VM_LOCKED was cleared and before
>the page is munlocked.
okay.

> If that could occur, it could open a window
>where a non-mlocked page is mapped in this vma, and page reclaim could
>potentially unmap the page. Shouldn't be an issue as long as we never
>downgrade the semaphore to read during munlock.
>

Thank you for clarification. (so..will check Kosaki-san's one's comment later.
)

>
>Probably zone lru_lock in __page_cache_release().
>
> [<a0000001001264a0>] put_page+0x100/0x300
> sp=e0000741aaac7d50 bsp=e0000741aaac1280
> [<a000000100157170>] free_page_and_swap_cache+0x70/0xe0
> sp=e0000741aaac7d50 bsp=e0000741aaac1260
> [<a000000100145a10>] exit_mmap+0x3b0/0x580
> sp=e0000741aaac7d50 bsp=e0000741aaac1210
> [<a00000010008b420>] mmput+0x80/0x1c0
> sp=e0000741aaac7e10 bsp=e0000741aaac11d8
>
I think I have never seen this kind of dead-lock related to zone->lock.
(maybe it's because zone->lock is used in clear way historically)
I'll check around zone->lock. thanks.

Regards,
-Kame

2008-06-19 15:51:58

by Jeremy Fitzhardinge

[permalink] [raw]

Subject: Re: [BUG][PATCH -mm] avoid BUG() in __stop_machine_run()

Rusty Russell wrote:
> On Thursday 19 June 2008 16:59:50 Hidehiro Kawai wrote:
>
>> When a process loads a kernel module, __stop_machine_run() is called, and
>> it calls sched_setscheduler() to give newly created kernel threads highest
>> priority. However, the process can have no CAP_SYS_NICE which required
>> for sched_setscheduler() to increase the priority. For example, SystemTap
>> loads its module with only CAP_SYS_MODULE. In this case,
>> sched_setscheduler() returns -EPERM, then BUG() is called.
>>
>
> Hi Hidehiro,
>
> Nice catch. This can happen in the current code, it just doesn't
> BUG().
>
>
>> Failure of sched_setscheduler() wouldn't be a real problem, so this
>> patch just ignores it.
>>
>
> Well, it can mean that the stop_machine blocks indefinitely. Better
> than a BUG(), but we should aim higher.
>
>
>> Or, should we give the CAP_SYS_NICE capability temporarily?
>>
>
> I don't think so. It can be seen from another thread, and in theory
> that should not see something random. Worse, they can change it from
> another thread.
>
> How's this?
>
> sched_setscheduler: add a flag to control access checks
>
> Hidehiro Kawai noticed that sched_setscheduler() can fail in
> stop_machine: it calls sched_setscheduler() from insmod, which can
> have CAP_SYS_MODULE without CAP_SYS_NICE.
>
> This simply introduces a flag to allow us to disable the capability
> checks for internal callers (this is simpler than splitting the
> sched_setscheduler() function, since it loops checking permissions).
>
What about?

int sched_setscheduler(struct task_struct *p, int policy,
struct sched_param *param)
{
return __sched_setscheduler(p, policy, param, true);
}

int sched_setscheduler_nocheck(struct task_struct *p, int policy,
struct sched_param *param)
{
return __sched_setscheduler(p, policy, param, false);
}

(With the appropriate transformation of sched_setscheduler -> __)

Better than scattering stray true/falses around the code.

J

2008-06-19 16:27:50

[permalink] [raw]

Subject: Re: 2.6.26-rc5-mm3: BUG large value for HugePages_Rsvd

After running some of the libhugetlbfs tests the value for
/proc/meminfo/HugePages_Rsvd becomes really large. It looks like it has
wrapped backwards from zero.
Below is the sequence I used to run one of the tests that causes this;
the tests passes for what it is intended to test but leaves a large
value for reserved pages and that seemed strange to me.
test run on ppc64 with 16M huge pages

cat /proc/meminfo
....
HugePages_Total: 25
HugePages_Free: 25
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 16384 kB

mount -t hugetlbfs hugetlbfs /mnt

tundro4:~/libhugetlbfs-dev-20080516/tests # HUGETLBFS_VERBOSE=99 HUGETLBFS_DEBUG=y PATH="obj64:$PATH" LD_LIBRARY_PATH="$LD_LIBRARY_PATH:../obj64:obj64" truncate_above_4GB
Starting testcase "truncate_above_4GB", pid 3145
Mapping 3 hpages at offset 0x100000000...mapped at 0x3fffd000000
Replacing map at 0x3ffff000000 with map from offset 0x1000000...done
Truncating at 0x100000000...done
PASS

cat /proc/meminfo
....
HugePages_Total: 25
HugePages_Free: 25
HugePages_Rsvd: 18446744073709551614
HugePages_Surp: 0
Hugepagesize: 16384 kB

I put in some printks and see that the rsvd value goes mad in
'return_unused_surplus_pages'.

Debug output:

tundro4 kernel: mm/hugetlb.c:gather_surplus_pages:527; resv_huge_pages=0 delta=3
tundro4 kernel: Call Trace:
tundro4 kernel: [c000000287dff9a0] [c000000000010978] .show_stack+0x7c/0x1c4 (unreliable)
tundro4 kernel: [c000000287dffa50] [c0000000000d7c8c] .hugetlb_acct_memory+0xa4/0x448
tundro4 kernel: [c000000287dffb20] [c0000000000d85ec] .hugetlb_reserve_pages+0xec/0x16c
tundro4 kernel: [c000000287dffbc0] [c0000000001be7fc] .hugetlbfs_file_mmap+0xe0/0x154
tundro4 kernel: [c000000287dffc70] [c0000000000cbc78] .mmap_region+0x280/0x52c
tundro4 kernel: [c000000287dffd80] [c00000000000bfa0] .sys_mmap+0xa8/0x108
tundro4 kernel: [c000000287dffe30] [c0000000000086ac] syscall_exit+0x0/0x40
tundro4 kernel: mm/hugetlb.c:gather_surplus_pages:530; resv_huge_pages=3 delta=3
tundro4 kernel: mm/hugetlb.c:decrement_hugepage_resv_vma:147; resv_huge_pages=3
tundro4 kernel: mm/hugetlb.c:decrement_hugepage_resv_vma:149; resv_huge_pages=2
tundro4 kernel: mm/hugetlb.c:return_unused_surplus_pages:630; resv_huge_pages=2 unused_resv_pages=2
tundro4 kernel: Call Trace:
tundro4 kernel: [c000000287dff900] [c000000000010978] .show_stack+0x7c/0x1c4 (unreliable)
tundro4 kernel: [c000000287dff9b0] [c0000000000d7a10] .return_unused_surplus_pages+0x70/0x248
tundro4 kernel: [c000000287dffa50] [c0000000000d7fb8] .hugetlb_acct_memory+0x3d0/0x448
tundro4 kernel: [c000000287dffb20] [c0000000000c98fc] .remove_vma+0x64/0xe0
tundro4 kernel: [c000000287dffbb0] [c0000000000cb058] .do_munmap+0x30c/0x354
tundro4 kernel: [c000000287dffc70] [c0000000000cbad0] .mmap_region+0xd8/0x52c
tundro4 kernel: [c000000287dffd80] [c00000000000bfa0] .sys_mmap+0xa8/0x108
tundro4 kernel: [c000000287dffe30] [c0000000000086ac] syscall_exit+0x0/0x40
tundro4 kernel: mm/hugetlb.c:return_unused_surplus_pages:633; resv_huge_pages=0 unused_resv_pages=2
tundro4 kernel: mm/hugetlb.c:gather_surplus_pages:527; resv_huge_pages=0 delta=1
tundro4 kernel: Call Trace:
tundro4 kernel: [c000000287dff9a0] [c000000000010978] .show_stack+0x7c/0x1c4 (unreliable)
tundro4 kernel: [c000000287dffa50] [c0000000000d7c8c] .hugetlb_acct_memory+0xa4/0x448
tundro4 kernel: [c000000287dffb20] [c0000000000d85ec] .hugetlb_reserve_pages+0xec/0x16c
tundro4 kernel: [c000000287dffbc0] [c0000000001be7fc] .hugetlbfs_file_mmap+0xe0/0x154
tundro4 kernel: [c000000287dffc70] [c0000000000cbc78] .mmap_region+0x280/0x52c
tundro4 kernel: [c000000287dffd80] [c00000000000bfa0] .sys_mmap+0xa8/0x108
tundro4 kernel: [c000000287dffe30] [c0000000000086ac] syscall_exit+0x0/0x40
tundro4 kernel: mm/hugetlb.c:gather_surplus_pages:530; resv_huge_pages=1 delta=1
tundro4 kernel: mm/hugetlb.c:decrement_hugepage_resv_vma:147; resv_huge_pages=1
tundro4 kernel: mm/hugetlb.c:decrement_hugepage_resv_vma:149; resv_huge_pages=0
tundro4 kernel: mm/hugetlb.c:return_unused_surplus_pages:630; resv_huge_pages=0 unused_resv_pages=2
tundro4 kernel: Call Trace:
tundro4 kernel: [c000000287dff860] [c000000000010978] .show_stack+0x7c/0x1c4 (unreliable)
tundro4 kernel: [c000000287dff910] [c0000000000d7a10] .return_unused_surplus_pages+0x70/0x248
tundro4 kernel: [c000000287dff9b0] [c0000000000d7fb8] .hugetlb_acct_memory+0x3d0/0x448
tundro4 kernel: [c000000287dffa80] [c0000000000c98fc] .remove_vma+0x64/0xe0
tundro4 kernel: [c000000287dffb10] [c0000000000c9af0] .exit_mmap+0x178/0x1b8
tundro4 kernel: [c000000287dffbc0] [c000000000055ef0] .mmput+0x60/0x178
tundro4 kernel: [c000000287dffc50] [c00000000005add8] .exit_mm+0x130/0x154
tundro4 kernel: [c000000287dffce0] [c00000000005d598] .do_exit+0x2bc/0x778
tundro4 kernel: [c000000287dffda0] [c00000000005db38] .sys_exit_group+0x0/0x8
tundro4 kernel: [c000000287dffe30] [c0000000000086ac] syscall_exit+0x0/0x40
tundro4 kernel: mm/hugetlb.c:return_unused_surplus_pages:633; resv_huge_pages=18446744073709551614 unused_resv_pages=2

============the end===============

2008-06-19 17:19:38

[permalink] [raw]

Subject: Re: 2.6.26-rc5-mm3: BUG large value for HugePages_Rsvd

On Thu, Jun 19, 2008 at 11:27:47AM -0500, Jon Tollefson wrote:
> After running some of the libhugetlbfs tests the value for
> /proc/meminfo/HugePages_Rsvd becomes really large. It looks like it has
> wrapped backwards from zero.
> Below is the sequence I used to run one of the tests that causes this;
> the tests passes for what it is intended to test but leaves a large
> value for reserved pages and that seemed strange to me.
> test run on ppc64 with 16M huge pages

Yes Adam reported that here yesterday, he found it in his hugetlfs testing.
I have done some investigation on it and it is being triggered by a bug in
the private reservation tracking patches. It is triggered by the hugetlb
test which causes some complex vma splits to occur on a private mapping.

I believe I have the underlying problem nailed and do have some nearly
complete patches for this and they should be in a postable state by
tommorrow.

-apw

2008-06-20 00:41:44

[permalink] [raw]

Subject: Re: [Experimental][PATCH] putback_lru_page rework

On Thu, 19 Jun 2008 10:45:22 -0400
Lee Schermerhorn <[email protected]> wrote:

> On Thu, 2008-06-19 at 09:22 +0900, KAMEZAWA Hiroyuki wrote:
> > On Wed, 18 Jun 2008 14:21:06 -0400
> > Lee Schermerhorn <[email protected]> wrote:
> >
> > > On Wed, 2008-06-18 at 18:40 +0900, KAMEZAWA Hiroyuki wrote:
> > > > Lee-san, how about this ?
> > > > Tested on x86-64 and tried Nisimura-san's test at el. works good now.
> > >
> > > I have been testing with my work load on both ia64 and x86_64 and it
> > > seems to be working well. I'll let them run for a day or so.
> > >
> > thank you.
> > <snip>
>
> Update:
>
> On x86_64 [32GB, 4xdual-core Opteron], my work load has run for ~20:40
> hours. Still running.
>
> On ia64 [32G, 16cpu, 4 node], the system started going into softlockup
> after ~7 hours. Stack trace [below] indicates zone-lru lock in
> __page_cache_release() called from put_page(). Either heavy contention
> or failure to unlock. Note that previous run, with patches to
> putback_lru_page() and unmap_and_move(), the same load ran for ~18 hours
> before I shut it down to try these patches.
>

On ia64, ia64_spinlock_contention() enables irq and soft-lockup detection
by irq works well. On x86-64, irq is not enabled during spin-wait, and
soft-lockup detection irq cannot be handled until irq is enabled.
Then, it seems there is someone who drops into infinite-loop within
spin_lock_irqsave(&zone->lock, flags)....

Then, "A" cpu doesn't report soft-lockup while others report ?

Hmm..

-Kame

> I'm going to try again with the collected patches posted by Kosaki-san
> [for which, Thanks!]. If it occurs again, I'll deconfig the unevictable
> lru feature and see if I can reproduce it there. It may be unrelated to
> the unevictable lru patches.
>
> >
> > > > @@ -240,6 +232,9 @@ static int __munlock_pte_handler(pte_t *
> > > > struct page *page;
> > > > pte_t pte;
> > > >
> > > > + /*
> > > > + * page is never be unmapped by page-reclaim. we lock this page now.
> > > > + */
> > >
> > > I don't understand what you're trying to say here. That is, what the
> > > point of this comment is...
> > >
> > We access the page-table without taking pte_lock. But this vm is MLOCKED
> > and migration-race is handled. So we don't need to be too nervous to access
> > the pte. I'll consider more meaningful words.
>
> OK, so you just want to note that we're accessing the pte w/o locking
> and that this is safe because the vma has been VM_LOCKED and all pages
> should be mlocked?
>
> I'll note that the vma is NOT VM_LOCKED during the pte walk.
> munlock_vma_pages_range() resets it so that try_to_unlock(), called from
> munlock_vma_page(), won't try to re-mlock the page. However, we hold
> the mmap sem for write, so faults are held off--no need to worry about a
> COW fault occurring between when the VM_LOCKED was cleared and before
> the page is munlocked. If that could occur, it could open a window
> where a non-mlocked page is mapped in this vma, and page reclaim could
> potentially unmap the page. Shouldn't be an issue as long as we never
> downgrade the semaphore to read during munlock.
>
> Lee
>
> ----------
> softlockup stack trace for "usex" workload on ia64:
>
> BUG: soft lockup - CPU#13 stuck for 61s! [usex:124359]
> Modules linked in: ipv6 sunrpc dm_mirror dm_log dm_multipath scsi_dh dm_mod pci_slot fan dock thermal sg sr_mod processor button container ehci_hcd ohci_hcd uhci_hcd usbcore
>
> Pid: 124359, CPU 13, comm: usex
> psr : 00001010085a6010 ifs : 8000000000000000 ip : [<a00000010000a1a0>] Tainted: G D (2.6.26-rc5-mm3-kame-rework+mcl_inherit)
> ip is at ia64_spinlock_contention+0x20/0x60
> unat: 0000000000000000 pfs : 0000000000000081 rsc : 0000000000000003
> rnat: 0000000000000000 bsps: 0000000000000000 pr : a65955959a96e969
> ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f
> csd : 0000000000000000 ssd : 0000000000000000
> b0 : a0000001001264a0 b6 : a0000001006f0350 b7 : a00000010000b940
> f6 : 0ffff8000000000000000 f7 : 1003ecf3cf3cf3cf3cf3d
> f8 : 1003e0000000000000001 f9 : 1003e0000000000000015
> f10 : 1003e000003a82aaab1fb f11 : 1003e0000000000000000
> r1 : a000000100c03650 r2 : 000000000000038a r3 : 0000000000000001
> r8 : 00000010085a6010 r9 : 0000000000080028 r10 : 000000000000000b
> r11 : 0000000000000a80 r12 : e0000741aaac7d50 r13 : e0000741aaac0000
> r14 : 0000000000000000 r15 : a000400741329148 r16 : e000074000060100
> r17 : e000076000078e98 r18 : 0000000000000015 r19 : 0000000000000018
> r20 : 0000000000000003 r21 : 0000000000000002 r22 : e000076000078e88
> r23 : e000076000078e80 r24 : 0000000000000001 r25 : 0240000000080028
> r26 : ffffffffffff04d8 r27 : 00000010085a6010 r28 : 7fe3382473f8b380
> r29 : 9c00000000000000 r30 : 0000000000000001 r31 : e000074000061400
>
> Call Trace:
> [<a000000100015e00>] show_stack+0x80/0xa0
> sp=e0000741aaac79b0 bsp=e0000741aaac1528
> [<a000000100016700>] show_regs+0x880/0x8c0
> sp=e0000741aaac7b80 bsp=e0000741aaac14d0
> [<a0000001000fbbe0>] softlockup_tick+0x2e0/0x340
> sp=e0000741aaac7b80 bsp=e0000741aaac1480
> [<a0000001000a9400>] run_local_timers+0x40/0x60
> sp=e0000741aaac7b80 bsp=e0000741aaac1468
> [<a0000001000a9460>] update_process_times+0x40/0xc0
> sp=e0000741aaac7b80 bsp=e0000741aaac1438
> [<a00000010003ded0>] timer_interrupt+0x1b0/0x4a0
> sp=e0000741aaac7b80 bsp=e0000741aaac13d0
> [<a0000001000fc480>] handle_IRQ_event+0x80/0x120
> sp=e0000741aaac7b80 bsp=e0000741aaac1398
> [<a0000001000fc660>] __do_IRQ+0x140/0x440
> sp=e0000741aaac7b80 bsp=e0000741aaac1338
> [<a0000001000136d0>] ia64_handle_irq+0x3f0/0x420
> sp=e0000741aaac7b80 bsp=e0000741aaac12c0
> [<a00000010000c120>] ia64_native_leave_kernel+0x0/0x270
> sp=e0000741aaac7b80 bsp=e0000741aaac12c0
> [<a00000010000a1a0>] ia64_spinlock_contention+0x20/0x60
> sp=e0000741aaac7d50 bsp=e0000741aaac12c0
> [<a0000001006f0350>] _spin_lock_irqsave+0x50/0x60
> sp=e0000741aaac7d50 bsp=e0000741aaac12b8
>
> Probably zone lru_lock in __page_cache_release().
>
> [<a0000001001264a0>] put_page+0x100/0x300
> sp=e0000741aaac7d50 bsp=e0000741aaac1280
> [<a000000100157170>] free_page_and_swap_cache+0x70/0xe0
> sp=e0000741aaac7d50 bsp=e0000741aaac1260
> [<a000000100145a10>] exit_mmap+0x3b0/0x580
> sp=e0000741aaac7d50 bsp=e0000741aaac1210
> [<a00000010008b420>] mmput+0x80/0x1c0
> sp=e0000741aaac7e10 bsp=e0000741aaac11d8
>
> NOTE: all cpus show similar stack traces above here. Some, however, get
> here from do_exit()/exit_mm(), rather than via execve().
>
> [<a00000010019c2c0>] flush_old_exec+0x5a0/0x1520
> sp=e0000741aaac7e10 bsp=e0000741aaac10f0
> [<a000000100213080>] load_elf_binary+0x7e0/0x2600
> sp=e0000741aaac7e20 bsp=e0000741aaac0fb8
> [<a00000010019b7a0>] search_binary_handler+0x1a0/0x520
> sp=e0000741aaac7e20 bsp=e0000741aaac0f30
> [<a00000010019e4e0>] do_execve+0x320/0x3e0
> sp=e0000741aaac7e20 bsp=e0000741aaac0ed0
> [<a000000100014d00>] sys_execve+0x60/0xc0
> sp=e0000741aaac7e30 bsp=e0000741aaac0e98
> [<a00000010000b690>] ia64_execve+0x30/0x140
> sp=e0000741aaac7e30 bsp=e0000741aaac0e48
> [<a00000010000bfa0>] ia64_ret_from_syscall+0x0/0x20
> sp=e0000741aaac7e30 bsp=e0000741aaac0e48
> [<a000000000010720>] __start_ivt_text+0xffffffff00010720/0x400
> sp=e0000741aaac8000 bsp=e0000741aaac0e48
>
>
>
>

2008-06-20 01:08:35

[permalink] [raw]

Subject: Re: [Experimental][PATCH] putback_lru_page rework

Lee-san, this is an additonal one..
Not-tested-yet, just by review.

Fixing page_lock() <-> zone->lock nesting of bad-behavior.

Before:
lock_page()(TestSetPageLocked())
spin_lock(zone->lock)
unlock_page()
spin_unlock(zone->lock)
After:
spin_lock(zone->lock)
spin_unlock(zone->lock)

Including nit-pick fix. (I'll ask Kosaki-san to merge this to his 5/5)

Hmm...

---
mm/vmscan.c | 25 +++++--------------------
1 file changed, 5 insertions(+), 20 deletions(-)

Index: test-2.6.26-rc5-mm3/mm/vmscan.c
===================================================================
--- test-2.6.26-rc5-mm3.orig/mm/vmscan.c
+++ test-2.6.26-rc5-mm3/mm/vmscan.c
@@ -1106,7 +1106,7 @@ static unsigned long shrink_inactive_lis
if (nr_taken == 0)
goto done;

- spin_lock(&zone->lru_lock);
+ spin_lock_irq(&zone->lru_lock);
/*
* Put back any unfreeable pages.
*/
@@ -1136,9 +1136,8 @@ static unsigned long shrink_inactive_lis
}
}
} while (nr_scanned < max_scan);
- spin_unlock(&zone->lru_lock);
+ spin_unlock_irq(&zone->lru_lock);
done:
- local_irq_enable();
pagevec_release(&pvec);
return nr_reclaimed;
}
@@ -2438,7 +2437,7 @@ static void show_page_path(struct page *
*/
static void check_move_unevictable_page(struct page *page, struct zone *zone)
{
-
+retry:
ClearPageUnevictable(page); /* for page_evictable() */
if (page_evictable(page, NULL)) {
enum lru_list l = LRU_INACTIVE_ANON + page_is_file_cache(page);
@@ -2455,6 +2454,8 @@ static void check_move_unevictable_page(
*/
SetPageUnevictable(page);
list_move(&page->lru, &zone->lru[LRU_UNEVICTABLE].list);
+ if (page_evictable(page, NULL))
+ goto retry;
}
}

@@ -2494,16 +2495,6 @@ void scan_mapping_unevictable_pages(stru
next = page_index;
next++;

- if (TestSetPageLocked(page)) {
- /*
- * OK, let's do it the hard way...
- */
- if (zone)
- spin_unlock_irq(&zone->lru_lock);
- zone = NULL;
- lock_page(page);
- }
-
if (pagezone != zone) {
if (zone)
spin_unlock_irq(&zone->lru_lock);
@@ -2514,8 +2505,6 @@ void scan_mapping_unevictable_pages(stru
if (PageLRU(page) && PageUnevictable(page))
check_move_unevictable_page(page, zone);

- unlock_page(page);
-
}
if (zone)
spin_unlock_irq(&zone->lru_lock);
@@ -2551,15 +2540,11 @@ void scan_zone_unevictable_pages(struct
for (scan = 0; scan < batch_size; scan++) {
struct page *page = lru_to_page(l_unevictable);

- if (TestSetPageLocked(page))
- continue;
-
prefetchw_prev_lru_page(page, l_unevictable, flags);

if (likely(PageLRU(page) && PageUnevictable(page)))
check_move_unevictable_page(page, zone);

- unlock_page(page);
}
spin_unlock_irq(&zone->lru_lock);

2008-06-20 03:18:38

[permalink] [raw]

Subject: Re: 2.6.26-rc5-mm3: BUG large value for HugePages_Rsvd

Andy Whitcroft wrote:
> On Thu, Jun 19, 2008 at 11:27:47AM -0500, Jon Tollefson wrote:
>
>> After running some of the libhugetlbfs tests the value for
>> /proc/meminfo/HugePages_Rsvd becomes really large. It looks like it has
>> wrapped backwards from zero.
>> Below is the sequence I used to run one of the tests that causes this;
>> the tests passes for what it is intended to test but leaves a large
>> value for reserved pages and that seemed strange to me.
>> test run on ppc64 with 16M huge pages
>>
>
> Yes Adam reported that here yesterday, he found it in his hugetlfs testing.
> I have done some investigation on it and it is being triggered by a bug in
> the private reservation tracking patches. It is triggered by the hugetlb
> test which causes some complex vma splits to occur on a private mapping.
>
sorry I missed that

> I believe I have the underlying problem nailed and do have some nearly
> complete patches for this and they should be in a postable state by
> tommorrow.
>
Cool.
> -apw
>
Jon

2008-06-20 13:21:42

by Ingo Molnar

[permalink] [raw]

Subject: Re: [BUG][PATCH -mm] avoid BUG() in __stop_machine_run()

* Jeremy Fitzhardinge <[email protected]> wrote:

>> This simply introduces a flag to allow us to disable the capability
>> checks for internal callers (this is simpler than splitting the
>> sched_setscheduler() function, since it loops checking permissions).
>>
> What about?
>
> int sched_setscheduler(struct task_struct *p, int policy,
> struct sched_param *param)
> {
> return __sched_setscheduler(p, policy, param, true);
> }
>
>
> int sched_setscheduler_nocheck(struct task_struct *p, int policy,
> struct sched_param *param)
> {
> return __sched_setscheduler(p, policy, param, false);
> }
>
>
> (With the appropriate transformation of sched_setscheduler -> __)
>
> Better than scattering stray true/falses around the code.

agreed - it would also be less intrusive on the API change side.

i've created a new tip/sched/new-API-sched_setscheduler topic for this
to track it, but it would be nice to have a v2 of this patch that
introduces the new API the way suggested by Jeremy. (Hence the new topic
is auto-merged into tip/master but not into linux-next yet.) Thanks,

Ingo

2008-06-20 16:24:33

[permalink] [raw]

Subject: Re: Re: [Experimental][PATCH] putback_lru_page rework

On Fri, 2008-06-20 at 00:32 +0900, [email protected] wrote:
> ----- Original Message -----
> >Subject: Re: [Experimental][PATCH] putback_lru_page rework
> >From: Lee Schermerhorn <[email protected]>
>
> >On Thu, 2008-06-19 at 09:22 +0900, KAMEZAWA Hiroyuki wrote:
> >> On Wed, 18 Jun 2008 14:21:06 -0400
> >> Lee Schermerhorn <[email protected]> wrote:
> >>
> >> > On Wed, 2008-06-18 at 18:40 +0900, KAMEZAWA Hiroyuki wrote:
> >> > > Lee-san, how about this ?
> >> > > Tested on x86-64 and tried Nisimura-san's test at el. works good now.
> >> >
> >> > I have been testing with my work load on both ia64 and x86_64 and it
> >> > seems to be working well. I'll let them run for a day or so.
> >> >
> >> thank you.
> >> <snip>
> >
> >Update:
> >
> >On x86_64 [32GB, 4xdual-core Opteron], my work load has run for ~20:40
> >hours. Still running.
> >
> >On ia64 [32G, 16cpu, 4 node], the system started going into softlockup
> >after ~7 hours. Stack trace [below] indicates zone-lru lock in
> >__page_cache_release() called from put_page(). Either heavy contention
> >or failure to unlock. Note that previous run, with patches to
> >putback_lru_page() and unmap_and_move(), the same load ran for ~18 hours
> >before I shut it down to try these patches.
> >
> Thanks, then there are more troubles should be shooted down.
>
>
> >I'm going to try again with the collected patches posted by Kosaki-san
> >[for which, Thanks!]. If it occurs again, I'll deconfig the unevictable
> >lru feature and see if I can reproduce it there. It may be unrelated to
> >the unevictable lru patches.
> >
> I hope so...Hmm..I'll dig tomorrow.

Another update--with the collected patches:

Again, the x86_64 ran for > 22 hours w/o error before I shut it down.

And, again, the ia64 went into soft lockup--same stack traces. This
time after > 17 hours of running. It is possible that a BUG started
this, but it has long scrolled out of my terminal buffer by the time I
see the system.

I'm now trying the ia64 platform with 26-rc5-mm3 + collected patches
with UNEVICTABLE_LRU de-configured. I'll start that up today and let it
run over the weekend [with panic_on_oops set] if it hasn't hit the
problem before I leave.

Regards,
Lee

2008-06-20 17:30:12

[permalink] [raw]

Subject: Re: [Experimental][PATCH] putback_lru_page rework

On Fri, 2008-06-20 at 10:13 +0900, KAMEZAWA Hiroyuki wrote:
> Lee-san, this is an additonal one..
> Not-tested-yet, just by review.

OK, I'll test this on my x86_64 platform, which doesn't seem to hit the
soft lockups.

>
> Fixing page_lock() <-> zone->lock nesting of bad-behavior.
>
> Before:
> lock_page()(TestSetPageLocked())
> spin_lock(zone->lock)
> unlock_page()
> spin_unlock(zone->lock)

Couple of comments:
* I believe that the locks are acquired in the right order--at least as
documented in the comments in mm/rmap.c.
* The unlocking appears out of order because this function attempts to
hold the zone lock across a few pages in the pagevec, but must switch to
a different zone lru lock when it finds a page on a different zone from
the zone whose lock it is holding--like in the pagevec draining
functions, altho' they don't need to lock the page.

> After:
> spin_lock(zone->lock)
> spin_unlock(zone->lock)

Right. With your reworked check_move_unevictable_page() [with retry],
we don't need to lock the page here, any more. That means we can revert
all of the changes to pass the mapping back to sys_shmctl() and move the
call to scan_mapping_unevictable_pages() back to shmem_lock() after
clearing the address_space's unevictable flag. We only did that to
avoid sleeping while holding the shmem_inode_info lock and the
shmid_kernel's ipc_perm spinlock.

Shall I handle that, after we've tested this patch?

>
> Including nit-pick fix. (I'll ask Kosaki-san to merge this to his 5/5)
>
> Hmm...
>
> ---
> mm/vmscan.c | 25 +++++--------------------
> 1 file changed, 5 insertions(+), 20 deletions(-)
>
> Index: test-2.6.26-rc5-mm3/mm/vmscan.c
> ===================================================================
> --- test-2.6.26-rc5-mm3.orig/mm/vmscan.c
> +++ test-2.6.26-rc5-mm3/mm/vmscan.c
> @@ -1106,7 +1106,7 @@ static unsigned long shrink_inactive_lis
> if (nr_taken == 0)
> goto done;
>
> - spin_lock(&zone->lru_lock);
> + spin_lock_irq(&zone->lru_lock);

1) It appears that the spin_lock() [no '_irq'] was there because irqs
are disabled a few lines above so that we could use non-atomic
__count[_zone]_vm_events().
2) I think this predates the split lru or unevictable lru patches, so
these changes are unrelated.
> /*
> * Put back any unfreeable pages.
> */
> @@ -1136,9 +1136,8 @@ static unsigned long shrink_inactive_lis
> }
> }
> } while (nr_scanned < max_scan);
> - spin_unlock(&zone->lru_lock);
> + spin_unlock_irq(&zone->lru_lock);
> done:
> - local_irq_enable();
> pagevec_release(&pvec);
> return nr_reclaimed;
> }
> @@ -2438,7 +2437,7 @@ static void show_page_path(struct page *
> */
> static void check_move_unevictable_page(struct page *page, struct zone *zone)
> {
> -
> +retry:
> ClearPageUnevictable(page); /* for page_evictable() */
We can remove this comment ^^^^^^^^^^^^^^^^^^^^^^^^^^
page_evictable() no longer asserts !PageUnevictable(), right?

> if (page_evictable(page, NULL)) {
> enum lru_list l = LRU_INACTIVE_ANON + page_is_file_cache(page);
> @@ -2455,6 +2454,8 @@ static void check_move_unevictable_page(
> */
> SetPageUnevictable(page);
> list_move(&page->lru, &zone->lru[LRU_UNEVICTABLE].list);
> + if (page_evictable(page, NULL))
> + goto retry;
> }
> }
>
> @@ -2494,16 +2495,6 @@ void scan_mapping_unevictable_pages(stru
> next = page_index;
> next++;
>
> - if (TestSetPageLocked(page)) {
> - /*
> - * OK, let's do it the hard way...
> - */
> - if (zone)
> - spin_unlock_irq(&zone->lru_lock);
> - zone = NULL;
> - lock_page(page);
> - }
> -
> if (pagezone != zone) {
> if (zone)
> spin_unlock_irq(&zone->lru_lock);
> @@ -2514,8 +2505,6 @@ void scan_mapping_unevictable_pages(stru
> if (PageLRU(page) && PageUnevictable(page))
> check_move_unevictable_page(page, zone);
>
> - unlock_page(page);
> -
> }
> if (zone)
> spin_unlock_irq(&zone->lru_lock);
> @@ -2551,15 +2540,11 @@ void scan_zone_unevictable_pages(struct
> for (scan = 0; scan < batch_size; scan++) {
> struct page *page = lru_to_page(l_unevictable);
>
> - if (TestSetPageLocked(page))
> - continue;
> -
> prefetchw_prev_lru_page(page, l_unevictable, flags);
>
> if (likely(PageLRU(page) && PageUnevictable(page)))
> check_move_unevictable_page(page, zone);
>
> - unlock_page(page);
> }
> spin_unlock_irq(&zone->lru_lock);
>
>

I'll let you know how it goes.

Later,
Lee

2008-06-20 19:22:04

[permalink] [raw]

Subject: [RFC] hugetlb reservations -- MAP_PRIVATE fixes for split vmas

As reported by Adam Litke and Jon Tollefson one of the libhugetlbfs
regression tests triggers a negative overall reservation count. When
this occurs where there is no dynamic pool enabled tests will fail.

Following this email are two patches to fix this issue:

hugetlb reservations: move region tracking earlier -- simply moves the
region tracking code earlier so we do not have to supply prototypes, and

hugetlb reservations: fix hugetlb MAP_PRIVATE reservations across vma
splits -- which moves us to tracking the consumed reservation so that
we can correctly calculate the remaining reservations at vma close time.

This stack is against the top of v2.6.25-rc6-mm3, should this solution
prove acceptable it would probabally need porting below Nicks multiple
hugepage size patches and those updated; if so I would be happy to do
that too.

Jon could you have a test on this and see if it works out for you.

-apw

2008-06-20 19:22:22

[permalink] [raw]

Subject: [PATCH 2/2] hugetlb reservations: fix hugetlb MAP_PRIVATE reservations across vma splits

When a hugetlb mapping with a reservation is split, a new VMA is cloned
from the original. This new VMA is a direct copy of the original
including the reservation count. When this pair of VMAs are unmapped
we will incorrect double account the unused reservation and the overall
reservation count will be incorrect, in extreme cases it will wrap.

The problem occurs when we split an existing VMA say to unmap a page in
the middle. split_vma() will create a new VMA copying all fields from
the original. As we are storing our reservation count in vm_private_data
this is also copies, endowing the new VMA with a duplicate of the original
VMA's reservation. Neither of the new VMAs can exhaust these reservations
as they are too small, but when we unmap and close these VMAs we will
incorrect credit the remainder twice and resv_huge_pages will become
out of sync. This can lead to allocation failures on mappings with
reservations and even to resv_huge_pages wrapping which prevents all
subsequent hugepage allocations.

The simple fix would be to correctly apportion the remaining reservation
count when the split is made. However the only hook we have vm_ops->open
only has the new VMA we do not know the identity of the preceeding VMA.
Also even if we did have that VMA to hand we do not know how much of the
reservation was consumed each side of the split.

This patch therefore takes a different tack. We know that the whole of any
private mapping (which has a reservation) has a reservation over its whole
size. Any present pages represent consumed reservation. Therefore if
we track the instantiated pages we can calculate the remaining reservation.

This patch reuses the existing regions code to track the regions for which
we have consumed reservation (ie. the instantiated pages), as each page
is faulted in we record the consumption of reservation for the new page.
When we need to return unused reservations at unmap time we simply count
the consumed reservation region subtracting that from the whole of the map.
During a VMA split the newly opened VMA will point to the same region map,
as this map is offset oriented it remains valid for both of the split VMAs.
This map is referenced counted so that it is removed when all VMAs which
are part of the mmap are gone.

Signed-off-by: Andy Whitcroft <[email protected]>
---
mm/hugetlb.c | 151 ++++++++++++++++++++++++++++++++++++++++++++++++----------
1 files changed, 126 insertions(+), 25 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d701e39..ecff986 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -171,6 +171,30 @@ static long region_truncate(struct list_head *head, long end)
return chg;
}

+static long region_count(struct list_head *head, long f, long t)
+{
+ struct file_region *rg;
+ long chg = 0;
+
+ /* Locate each segment we overlap with, and count that overlap. */
+ list_for_each_entry(rg, head, link) {
+ int seg_from;
+ int seg_to;
+
+ if (rg->to <= f)
+ continue;
+ if (rg->from >= t)
+ break;
+
+ seg_from = max(rg->from, f);
+ seg_to = min(rg->to, t);
+
+ chg += seg_to - seg_from;
+ }
+
+ return chg;
+}
+
/*
* Convert the address within this vma to the page offset within
* the mapping, in base page units.
@@ -193,9 +217,14 @@ static pgoff_t vma_pagecache_offset(struct hstate *h,
(vma->vm_pgoff >> huge_page_order(h));
}

-#define HPAGE_RESV_OWNER (1UL << (BITS_PER_LONG - 1))
-#define HPAGE_RESV_UNMAPPED (1UL << (BITS_PER_LONG - 2))
+/*
+ * Flags for MAP_PRIVATE reservations. These are stored in the bottom
+ * bits of the reservation map pointer.
+ */
+#define HPAGE_RESV_OWNER (1UL << 0)
+#define HPAGE_RESV_UNMAPPED (1UL << 1)
#define HPAGE_RESV_MASK (HPAGE_RESV_OWNER | HPAGE_RESV_UNMAPPED)
+
/*
* These helpers are used to track how many pages are reserved for
* faults in a MAP_PRIVATE mapping. Only the process that called mmap()
@@ -205,6 +234,15 @@ static pgoff_t vma_pagecache_offset(struct hstate *h,
* the reserve counters are updated with the hugetlb_lock held. It is safe
* to reset the VMA at fork() time as it is not in use yet and there is no
* chance of the global counters getting corrupted as a result of the values.
+ *
+ * The private mapping reservation is represented in a subtly different
+ * manner to a shared mapping. A shared mapping has a region map associated
+ * with the underlying file, this region map represents the backing file
+ * pages which have had a reservation taken and this persists even after
+ * the page is instantiated. A private mapping has a region map associated
+ * with the original mmap which is attached to all VMAs which reference it,
+ * this region map represents those offsets which have consumed reservation
+ * ie. where pages have been instantiated.
*/
static unsigned long get_vma_private_data(struct vm_area_struct *vma)
{
@@ -217,22 +255,44 @@ static void set_vma_private_data(struct vm_area_struct *vma,
vma->vm_private_data = (void *)value;
}

-static unsigned long vma_resv_huge_pages(struct vm_area_struct *vma)
+struct resv_map {
+ struct kref refs;
+ struct list_head regions;
+};
+
+struct resv_map *resv_map_alloc(void)
+{
+ struct resv_map *resv_map = kmalloc(sizeof(*resv_map), GFP_KERNEL);
+ if (!resv_map)
+ return NULL;
+
+ kref_init(&resv_map->refs);
+ INIT_LIST_HEAD(&resv_map->regions);
+
+ return resv_map;
+}
+
+void resv_map_release(struct kref *ref)
+{
+ struct resv_map *resv_map = container_of(ref, struct resv_map, refs);
+
+ region_truncate(&resv_map->regions, 0);
+ kfree(resv_map);
+}
+
+static struct resv_map *vma_resv_map(struct vm_area_struct *vma)
{
VM_BUG_ON(!is_vm_hugetlb_page(vma));
if (!(vma->vm_flags & VM_SHARED))
- return get_vma_private_data(vma) & ~HPAGE_RESV_MASK;
+ return (struct resv_map *)(get_vma_private_data(vma) &
+ ~HPAGE_RESV_MASK);
return 0;
}

-static void set_vma_resv_huge_pages(struct vm_area_struct *vma,
- unsigned long reserve)
+static void set_vma_resv_map(struct vm_area_struct *vma, struct resv_map *map)
{
- VM_BUG_ON(!is_vm_hugetlb_page(vma));
- VM_BUG_ON(vma->vm_flags & VM_SHARED);
-
- set_vma_private_data(vma,
- (get_vma_private_data(vma) & HPAGE_RESV_MASK) | reserve);
+ set_vma_private_data(vma, (get_vma_private_data(vma) &
+ HPAGE_RESV_MASK) | (unsigned long)map);
}

static void set_vma_resv_flags(struct vm_area_struct *vma, unsigned long flags)
@@ -251,11 +311,11 @@ static int is_vma_resv_set(struct vm_area_struct *vma, unsigned long flag)
}

/* Decrement the reserved pages in the hugepage pool by one */
-static void decrement_hugepage_resv_vma(struct hstate *h,
- struct vm_area_struct *vma)
+static int decrement_hugepage_resv_vma(struct hstate *h,
+ struct vm_area_struct *vma, unsigned long address)
{
if (vma->vm_flags & VM_NORESERVE)
- return;
+ return 0;

if (vma->vm_flags & VM_SHARED) {
/* Shared mappings always use reserves */
@@ -266,14 +326,19 @@ static void decrement_hugepage_resv_vma(struct hstate *h,
* private mappings.
*/
if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
- unsigned long flags, reserve;
+ unsigned long idx = vma_pagecache_offset(h,
+ vma, address);
+ struct resv_map *reservations = vma_resv_map(vma);
+
h->resv_huge_pages--;
- flags = (unsigned long)vma->vm_private_data &
- HPAGE_RESV_MASK;
- reserve = (unsigned long)vma->vm_private_data - 1;
- vma->vm_private_data = (void *)(reserve | flags);
+
+ /* Mark this page used in the map. */
+ if (region_chg(&reservations->regions, idx, idx + 1) < 0)
+ return -1;
+ region_add(&reservations->regions, idx, idx + 1);
}
}
+ return 0;
}

/* Reset counters to 0 and clear all HPAGE_RESV_* flags */
@@ -289,7 +354,7 @@ static int vma_has_private_reserves(struct vm_area_struct *vma)
{
if (vma->vm_flags & VM_SHARED)
return 0;
- if (!vma_resv_huge_pages(vma))
+ if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER))
return 0;
return 1;
}
@@ -376,15 +441,16 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
nid = zone_to_nid(zone);
if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask) &&
!list_empty(&h->hugepage_freelists[nid])) {
+ if (!avoid_reserve &&
+ decrement_hugepage_resv_vma(h, vma, address) < 0)
+ return NULL;
+
page = list_entry(h->hugepage_freelists[nid].next,
struct page, lru);
list_del(&page->lru);
h->free_huge_pages--;
h->free_huge_pages_node[nid]--;

- if (!avoid_reserve)
- decrement_hugepage_resv_vma(h, vma);
-
break;
}
}
@@ -1456,10 +1522,39 @@ out:
return ret;
}

+static void hugetlb_vm_op_open(struct vm_area_struct *vma)
+{
+ struct resv_map *reservations = vma_resv_map(vma);
+
+ /*
+ * This new VMA will share its siblings reservation map. The open
+ * vm_op is only called for newly created VMAs which have been made
+ * from another, still existing VMA. As that VMA has a reference to
+ * this reservation map the reservation map cannot disappear until
+ * after this open completes. It is therefore safe to take a new
+ * reference here without additional locking.
+ */
+ if (reservations)
+ kref_get(&reservations->refs);
+}
+
static void hugetlb_vm_op_close(struct vm_area_struct *vma)
{
struct hstate *h = hstate_vma(vma);
- unsigned long reserve = vma_resv_huge_pages(vma);
+ struct resv_map *reservations = vma_resv_map(vma);
+ unsigned long reserve = 0;
+ unsigned long start;
+ unsigned long end;
+
+ if (reservations) {
+ start = vma_pagecache_offset(h, vma, vma->vm_start);
+ end = vma_pagecache_offset(h, vma, vma->vm_end);
+
+ reserve = (end - start) -
+ region_count(&reservations->regions, start, end);
+
+ kref_put(&reservations->refs, resv_map_release);
+ }

if (reserve)
hugetlb_acct_memory(h, -reserve);
@@ -1479,6 +1574,7 @@ static int hugetlb_vm_op_fault(struct vm_area_struct *vma, struct vm_fault *vmf)

struct vm_operations_struct hugetlb_vm_ops = {
.fault = hugetlb_vm_op_fault,
+ .open = hugetlb_vm_op_open,
.close = hugetlb_vm_op_close,
};

@@ -2037,8 +2133,13 @@ int hugetlb_reserve_pages(struct inode *inode,
if (!vma || vma->vm_flags & VM_SHARED)
chg = region_chg(&inode->i_mapping->private_list, from, to);
else {
+ struct resv_map *resv_map = resv_map_alloc();
+ if (!resv_map)
+ return -ENOMEM;
+
chg = to - from;
- set_vma_resv_huge_pages(vma, chg);
+
+ set_vma_resv_map(vma, resv_map);
set_vma_resv_flags(vma, HPAGE_RESV_OWNER);
}

--
1.5.6.205.g7ca3a

2008-06-20 19:22:41

[permalink] [raw]

Subject: [PATCH 1/2] hugetlb reservations: move region tracking earlier

Move the region tracking code much earlier so we can use it for page
presence tracking later on. No code is changed, just its location.

Signed-off-by: Andy Whitcroft <[email protected]>
---
mm/hugetlb.c | 246 +++++++++++++++++++++++++++++----------------------------
1 files changed, 125 insertions(+), 121 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0f76ed1..d701e39 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -47,6 +47,131 @@ static unsigned long __initdata default_hstate_size;
static DEFINE_SPINLOCK(hugetlb_lock);

/*
+ * Region tracking -- allows tracking of reservations and instantiated pages
+ * across the pages in a mapping.
+ */
+struct file_region {
+ struct list_head link;
+ long from;
+ long to;
+};
+
+static long region_add(struct list_head *head, long f, long t)
+{
+ struct file_region *rg, *nrg, *trg;
+
+ /* Locate the region we are either in or before. */
+ list_for_each_entry(rg, head, link)
+ if (f <= rg->to)
+ break;
+
+ /* Round our left edge to the current segment if it encloses us. */
+ if (f > rg->from)
+ f = rg->from;
+
+ /* Check for and consume any regions we now overlap with. */
+ nrg = rg;
+ list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
+ if (&rg->link == head)
+ break;
+ if (rg->from > t)
+ break;
+
+ /* If this area reaches higher then extend our area to
+ * include it completely. If this is not the first area
+ * which we intend to reuse, free it. */
+ if (rg->to > t)
+ t = rg->to;
+ if (rg != nrg) {
+ list_del(&rg->link);
+ kfree(rg);
+ }
+ }
+ nrg->from = f;
+ nrg->to = t;
+ return 0;
+}
+
+static long region_chg(struct list_head *head, long f, long t)
+{
+ struct file_region *rg, *nrg;
+ long chg = 0;
+
+ /* Locate the region we are before or in. */
+ list_for_each_entry(rg, head, link)
+ if (f <= rg->to)
+ break;
+
+ /* If we are below the current region then a new region is required.
+ * Subtle, allocate a new region at the position but make it zero
+ * size such that we can guarantee to record the reservation. */
+ if (&rg->link == head || t < rg->from) {
+ nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
+ if (!nrg)
+ return -ENOMEM;
+ nrg->from = f;
+ nrg->to = f;
+ INIT_LIST_HEAD(&nrg->link);
+ list_add(&nrg->link, rg->link.prev);
+
+ return t - f;
+ }
+
+ /* Round our left edge to the current segment if it encloses us. */
+ if (f > rg->from)
+ f = rg->from;
+ chg = t - f;
+
+ /* Check for and consume any regions we now overlap with. */
+ list_for_each_entry(rg, rg->link.prev, link) {
+ if (&rg->link == head)
+ break;
+ if (rg->from > t)
+ return chg;
+
+ /* We overlap with this area, if it extends futher than
+ * us then we must extend ourselves. Account for its
+ * existing reservation. */
+ if (rg->to > t) {
+ chg += rg->to - t;
+ t = rg->to;
+ }
+ chg -= rg->to - rg->from;
+ }
+ return chg;
+}
+
+static long region_truncate(struct list_head *head, long end)
+{
+ struct file_region *rg, *trg;
+ long chg = 0;
+
+ /* Locate the region we are either in or before. */
+ list_for_each_entry(rg, head, link)
+ if (end <= rg->to)
+ break;
+ if (&rg->link == head)
+ return 0;
+
+ /* If we are in the middle of a region then adjust it. */
+ if (end > rg->from) {
+ chg = rg->to - end;
+ rg->to = end;
+ rg = list_entry(rg->link.next, typeof(*rg), link);
+ }
+
+ /* Drop any remaining regions. */
+ list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
+ if (&rg->link == head)
+ break;
+ chg += rg->to - rg->from;
+ list_del(&rg->link);
+ kfree(rg);
+ }
+ return chg;
+}
+
+/*
* Convert the address within this vma to the page offset within
* the mapping, in base page units.
*/
@@ -649,127 +774,6 @@ static void return_unused_surplus_pages(struct hstate *h,
}
}

-struct file_region {
- struct list_head link;
- long from;
- long to;
-};
-
-static long region_add(struct list_head *head, long f, long t)
-{
- struct file_region *rg, *nrg, *trg;
-
- /* Locate the region we are either in or before. */
- list_for_each_entry(rg, head, link)
- if (f <= rg->to)
- break;
-
- /* Round our left edge to the current segment if it encloses us. */
- if (f > rg->from)
- f = rg->from;
-
- /* Check for and consume any regions we now overlap with. */
- nrg = rg;
- list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
- if (&rg->link == head)
- break;
- if (rg->from > t)
- break;
-
- /* If this area reaches higher then extend our area to
- * include it completely. If this is not the first area
- * which we intend to reuse, free it. */
- if (rg->to > t)
- t = rg->to;
- if (rg != nrg) {
- list_del(&rg->link);
- kfree(rg);
- }
- }
- nrg->from = f;
- nrg->to = t;
- return 0;
-}
-
-static long region_chg(struct list_head *head, long f, long t)
-{
- struct file_region *rg, *nrg;
- long chg = 0;
-
- /* Locate the region we are before or in. */
- list_for_each_entry(rg, head, link)
- if (f <= rg->to)
- break;
-
- /* If we are below the current region then a new region is required.
- * Subtle, allocate a new region at the position but make it zero
- * size such that we can guarantee to record the reservation. */
- if (&rg->link == head || t < rg->from) {
- nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
- if (!nrg)
- return -ENOMEM;
- nrg->from = f;
- nrg->to = f;
- INIT_LIST_HEAD(&nrg->link);
- list_add(&nrg->link, rg->link.prev);
-
- return t - f;
- }
-
- /* Round our left edge to the current segment if it encloses us. */
- if (f > rg->from)
- f = rg->from;
- chg = t - f;
-
- /* Check for and consume any regions we now overlap with. */
- list_for_each_entry(rg, rg->link.prev, link) {
- if (&rg->link == head)
- break;
- if (rg->from > t)
- return chg;
-
- /* We overlap with this area, if it extends futher than
- * us then we must extend ourselves. Account for its
- * existing reservation. */
- if (rg->to > t) {
- chg += rg->to - t;
- t = rg->to;
- }
- chg -= rg->to - rg->from;
- }
- return chg;
-}
-
-static long region_truncate(struct list_head *head, long end)
-{
- struct file_region *rg, *trg;
- long chg = 0;
-
- /* Locate the region we are either in or before. */
- list_for_each_entry(rg, head, link)
- if (end <= rg->to)
- break;
- if (&rg->link == head)
- return 0;
-
- /* If we are in the middle of a region then adjust it. */
- if (end > rg->from) {
- chg = rg->to - end;
- rg->to = end;
- rg = list_entry(rg->link.next, typeof(*rg), link);
- }
-
- /* Drop any remaining regions. */
- list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
- if (&rg->link == head)
- break;
- chg += rg->to - rg->from;
- list_del(&rg->link);
- kfree(rg);
- }
- return chg;
-}
-
/*
* Determine if the huge page at addr within the vma has an associated
* reservation. Where it does not we will need to logically increase
--
1.5.6.205.g7ca3a

2008-06-20 20:49:19

[permalink] [raw]

Subject: Re: [Experimental][PATCH] putback_lru_page rework

On Fri, 2008-06-20 at 13:10 -0400, Lee Schermerhorn wrote:
> On Fri, 2008-06-20 at 10:13 +0900, KAMEZAWA Hiroyuki wrote:
> > Lee-san, this is an additonal one..
> > Not-tested-yet, just by review.
>
> OK, I'll test this on my x86_64 platform, which doesn't seem to hit the
> soft lockups.
>

Quick update:

With this patch applied, at ~ 1.5 hours into the test, my system panic'd
[panic_on_oops set] with a BUG in __find_get_block() -- looks like the
BUG_ON() in check_irqs_on() called from bh_lru_install() inlined by
__find_get_block(). Before the panic occurred, I saw warnings from
native_smp_call_function_mask() [arch/x86/kernel/smp.c]--also because
irqs_disabled().

I'll back out the changes [spin_[un]lock() => spin_[un]lock_irq()] to
shrink_inactive_list() and try again. Just a hunch.

Lee

2008-06-21 08:39:16

[permalink] [raw]

Subject: Re: [Experimental][PATCH] putback_lru_page rework

Hi

> Lee-san, this is an additonal one..
> Not-tested-yet, just by review.
>
> Fixing page_lock() <-> zone->lock nesting of bad-behavior.
>
> Before:
> lock_page()(TestSetPageLocked())
> spin_lock(zone->lock)
> unlock_page()
> spin_unlock(zone->lock)
> After:
> spin_lock(zone->lock)
> spin_unlock(zone->lock)

Good catch!

>
> Including nit-pick fix. (I'll ask Kosaki-san to merge this to his 5/5)
>
> Hmm...
>
> ---
> mm/vmscan.c | 25 +++++--------------------
> 1 file changed, 5 insertions(+), 20 deletions(-)
>
> Index: test-2.6.26-rc5-mm3/mm/vmscan.c
> ===================================================================
> --- test-2.6.26-rc5-mm3.orig/mm/vmscan.c
> +++ test-2.6.26-rc5-mm3/mm/vmscan.c
> @@ -1106,7 +1106,7 @@ static unsigned long shrink_inactive_lis
> if (nr_taken == 0)
> goto done;
>
> - spin_lock(&zone->lru_lock);
> + spin_lock_irq(&zone->lru_lock);
> /*
> * Put back any unfreeable pages.
> */
> @@ -1136,9 +1136,8 @@ static unsigned long shrink_inactive_lis
> }
> }
> } while (nr_scanned < max_scan);
> - spin_unlock(&zone->lru_lock);
> + spin_unlock_irq(&zone->lru_lock);
> done:
> - local_irq_enable();
> pagevec_release(&pvec);
> return nr_reclaimed;
> }

No.

shrink_inactive_list lock usage is

local_irq_disable()
spin_lock(&zone->lru_lock);
while(){
if (!pagevec_add(&pvec, page)) {
spin_unlock_irq(&zone->lru_lock);
__pagevec_release(&pvec);
spin_lock_irq(&zone->lru_lock);
}
}
spin_unlock(&zone->lru_lock);
local_irq_enable();

this keep below lock rule.
- if zone->lru_lock is holded, interrupt is always disable disabled.

> @@ -2438,7 +2437,7 @@ static void show_page_path(struct page *
> */
> static void check_move_unevictable_page(struct page *page, struct zone *zone)
> {
> -
> +retry:
> ClearPageUnevictable(page); /* for page_evictable() */
> if (page_evictable(page, NULL)) {
> enum lru_list l = LRU_INACTIVE_ANON + page_is_file_cache(page);
> @@ -2455,6 +2454,8 @@ static void check_move_unevictable_page(
> */
> SetPageUnevictable(page);
> list_move(&page->lru, &zone->lru[LRU_UNEVICTABLE].list);
> + if (page_evictable(page, NULL))
> + goto retry;
> }
> }

Right, Thanks.

>
> @@ -2494,16 +2495,6 @@ void scan_mapping_unevictable_pages(stru
> next = page_index;
> next++;
>
> - if (TestSetPageLocked(page)) {
> - /*
> - * OK, let's do it the hard way...
> - */
> - if (zone)
> - spin_unlock_irq(&zone->lru_lock);
> - zone = NULL;
> - lock_page(page);
> - }
> -
> if (pagezone != zone) {
> if (zone)
> spin_unlock_irq(&zone->lru_lock);
> @@ -2514,8 +2505,6 @@ void scan_mapping_unevictable_pages(stru
> if (PageLRU(page) && PageUnevictable(page))
> check_move_unevictable_page(page, zone);
>
> - unlock_page(page);
> -
> }
> if (zone)
> spin_unlock_irq(&zone->lru_lock);

Right.

> @@ -2551,15 +2540,11 @@ void scan_zone_unevictable_pages(struct
> for (scan = 0; scan < batch_size; scan++) {
> struct page *page = lru_to_page(l_unevictable);
>
> - if (TestSetPageLocked(page))
> - continue;
> -
> prefetchw_prev_lru_page(page, l_unevictable, flags);
>
> if (likely(PageLRU(page) && PageUnevictable(page)))
> check_move_unevictable_page(page, zone);
>
> - unlock_page(page);
> }
> spin_unlock_irq(&zone->lru_lock);

Right.

2008-06-21 08:41:42

[permalink] [raw]

Subject: Re: [Experimental][PATCH] putback_lru_page rework

> > Before:
> > lock_page()(TestSetPageLocked())
> > spin_lock(zone->lock)
> > unlock_page()
> > spin_unlock(zone->lock)
>
> Couple of comments:
> * I believe that the locks are acquired in the right order--at least as
> documented in the comments in mm/rmap.c.
> * The unlocking appears out of order because this function attempts to
> hold the zone lock across a few pages in the pagevec, but must switch to
> a different zone lru lock when it finds a page on a different zone from
> the zone whose lock it is holding--like in the pagevec draining
> functions, altho' they don't need to lock the page.
>
> > After:
> > spin_lock(zone->lock)
> > spin_unlock(zone->lock)
>
> Right. With your reworked check_move_unevictable_page() [with retry],
> we don't need to lock the page here, any more. That means we can revert
> all of the changes to pass the mapping back to sys_shmctl() and move the
> call to scan_mapping_unevictable_pages() back to shmem_lock() after
> clearing the address_space's unevictable flag. We only did that to
> avoid sleeping while holding the shmem_inode_info lock and the
> shmid_kernel's ipc_perm spinlock.
>
> Shall I handle that, after we've tested this patch?

Yeah, I'll do it :)

> > @@ -2438,7 +2437,7 @@ static void show_page_path(struct page *
> > */
> > static void check_move_unevictable_page(struct page *page, struct zone *zone)
> > {
> > -
> > +retry:
> > ClearPageUnevictable(page); /* for page_evictable() */
> We can remove this comment ^^^^^^^^^^^^^^^^^^^^^^^^^^
> page_evictable() no longer asserts !PageUnevictable(), right?

Yes.
I'll remove it.

2008-06-21 08:56:50

[permalink] [raw]

Subject: Re: [Experimental][PATCH] putback_lru_page rework

> Quick update:
>
> With this patch applied, at ~ 1.5 hours into the test, my system panic'd
> [panic_on_oops set] with a BUG in __find_get_block() -- looks like the
> BUG_ON() in check_irqs_on() called from bh_lru_install() inlined by
> __find_get_block(). Before the panic occurred, I saw warnings from
> native_smp_call_function_mask() [arch/x86/kernel/smp.c]--also because
> irqs_disabled().
>
> I'll back out the changes [spin_[un]lock() => spin_[un]lock_irq()] to
> shrink_inactive_list() and try again. Just a hunch.

Yup.
Kamezawa-san's patch remove local_irq_enable(), but don't remove
local_irq_disable().
thus, irq is never enabled.

> - spin_unlock(&zone->lru_lock);
> + spin_unlock_irq(&zone->lru_lock);
> done:
> - local_irq_enable();

2008-06-23 00:25:42

[permalink] [raw]

Subject: Re: [Experimental][PATCH] putback_lru_page rework

On Sat, 21 Jun 2008 17:56:17 +0900
KOSAKI Motohiro <[email protected]> wrote:

> > Quick update:
> >
> > With this patch applied, at ~ 1.5 hours into the test, my system panic'd
> > [panic_on_oops set] with a BUG in __find_get_block() -- looks like the
> > BUG_ON() in check_irqs_on() called from bh_lru_install() inlined by
> > __find_get_block(). Before the panic occurred, I saw warnings from
> > native_smp_call_function_mask() [arch/x86/kernel/smp.c]--also because
> > irqs_disabled().
> >
> > I'll back out the changes [spin_[un]lock() => spin_[un]lock_irq()] to
> > shrink_inactive_list() and try again. Just a hunch.
>
> Yup.
> Kamezawa-san's patch remove local_irq_enable(), but don't remove
> local_irq_disable().
> thus, irq is never enabled.
>

Sorry,
-Kame

> > - spin_unlock(&zone->lru_lock);
> > + spin_unlock_irq(&zone->lru_lock);
> > done:
> > - local_irq_enable();
>
>
>

2008-06-23 03:57:30

by Rusty Russell

[permalink] [raw]

Subject: Re: [BUG][PATCH -mm] avoid BUG() in __stop_machine_run()

On Friday 20 June 2008 23:21:10 Ingo Molnar wrote:
> * Jeremy Fitzhardinge <[email protected]> wrote:
> >> This simply introduces a flag to allow us to disable the capability
> >> checks for internal callers (this is simpler than splitting the
> >> sched_setscheduler() function, since it loops checking permissions).
> >
> > What about?
> >
> > int sched_setscheduler(struct task_struct *p, int policy,
> > struct sched_param *param)
> > {
> > return __sched_setscheduler(p, policy, param, true);
> > }
> >
> >
> > int sched_setscheduler_nocheck(struct task_struct *p, int policy,
> > struct sched_param *param)
> > {
> > return __sched_setscheduler(p, policy, param, false);
> > }
> >
> >
> > (With the appropriate transformation of sched_setscheduler -> __)
> >
> > Better than scattering stray true/falses around the code.
>
> agreed - it would also be less intrusive on the API change side.

Yes, here's the patch. I've put it in my tree for testing, too.

sched_setscheduler_nocheck: add a flag to control access checks

Hidehiro Kawai noticed that sched_setscheduler() can fail in
stop_machine: it calls sched_setscheduler() from insmod, which can
have CAP_SYS_MODULE without CAP_SYS_NICE.

Two cases could have failed, so are changed to sched_setscheduler_nocheck:
kernel/softirq.c:cpu_callback()
- CPU hotplug callback
kernel/stop_machine.c:__stop_machine_run()
- Called from various places, including modprobe()

Signed-off-by: Rusty Russell <[email protected]>

diff -r 91c45b8d7775 include/linux/sched.h
--- a/include/linux/sched.h Mon Jun 23 13:49:26 2008 +1000
+++ b/include/linux/sched.h Mon Jun 23 13:54:55 2008 +1000
@@ -1655,6 +1655,8 @@ extern int task_curr(const struct task_s
extern int task_curr(const struct task_struct *p);
extern int idle_cpu(int cpu);
extern int sched_setscheduler(struct task_struct *, int, struct sched_param *);
+extern int sched_setscheduler_nocheck(struct task_struct *, int,
+ struct sched_param *);
extern struct task_struct *idle_task(int cpu);
extern struct task_struct *curr_task(int cpu);
extern void set_curr_task(int cpu, struct task_struct *p);
diff -r 91c45b8d7775 kernel/sched.c
--- a/kernel/sched.c Mon Jun 23 13:49:26 2008 +1000
+++ b/kernel/sched.c Mon Jun 23 13:54:55 2008 +1000
@@ -4744,16 +4744,8 @@ __setscheduler(struct rq *rq, struct tas
set_load_weight(p);
}

-/**
- * sched_setscheduler - change the scheduling policy and/or RT priority of a thread.
- * @p: the task in question.
- * @policy: new policy.
- * @param: structure containing the new RT priority.
- *
- * NOTE that the task may be already dead.
- */
-int sched_setscheduler(struct task_struct *p, int policy,
- struct sched_param *param)
+static int __sched_setscheduler(struct task_struct *p, int policy,
+ struct sched_param *param, bool user)
{
int retval, oldprio, oldpolicy = -1, on_rq, running;
unsigned long flags;
@@ -4785,7 +4777,7 @@ recheck:
/*
* Allow unprivileged RT tasks to decrease priority:
*/
- if (!capable(CAP_SYS_NICE)) {
+ if (user && !capable(CAP_SYS_NICE)) {
if (rt_policy(policy)) {
unsigned long rlim_rtprio;

@@ -4821,7 +4813,8 @@ recheck:
* Do not allow realtime tasks into groups that have no runtime
* assigned.
*/
- if (rt_policy(policy) && task_group(p)->rt_bandwidth.rt_runtime == 0)
+ if (user
+ && rt_policy(policy) && task_group(p)->rt_bandwidth.rt_runtime == 0)
return -EPERM;
#endif

@@ -4870,7 +4863,38 @@ recheck:

return 0;
}
+
+/**
+ * sched_setscheduler - change the scheduling policy and/or RT priority of a thread.
+ * @p: the task in question.
+ * @policy: new policy.
+ * @param: structure containing the new RT priority.
+ *
+ * NOTE that the task may be already dead.
+ */
+int sched_setscheduler(struct task_struct *p, int policy,
+ struct sched_param *param)
+{
+ return __sched_setscheduler(p, policy, param, true);
+}
EXPORT_SYMBOL_GPL(sched_setscheduler);
+
+/**
+ * sched_setscheduler_nocheck - change the scheduling policy and/or RT priority of a thread
from kernelspace.
+ * @p: the task in question.
+ * @policy: new policy.
+ * @param: structure containing the new RT priority.
+ *
+ * Just like sched_setscheduler, only don't bother checking if the
+ * current context has permission. For example, this is needed in
+ * stop_machine(): we create temporary high priority worker threads,
+ * but our caller might not have that capability.
+ */
+int sched_setscheduler_nocheck(struct task_struct *p, int policy,
+ struct sched_param *param)
+{
+ return __sched_setscheduler(p, policy, param, false);
+}

static int
do_sched_setscheduler(pid_t pid, int policy, struct sched_param __user *param)
diff -r 91c45b8d7775 kernel/softirq.c
--- a/kernel/softirq.c Mon Jun 23 13:49:26 2008 +1000
+++ b/kernel/softirq.c Mon Jun 23 13:54:55 2008 +1000
@@ -645,7 +645,7 @@ static int __cpuinit cpu_callback(struct

p = per_cpu(ksoftirqd, hotcpu);
per_cpu(ksoftirqd, hotcpu) = NULL;
- sched_setscheduler(p, SCHED_FIFO, &param);
+ sched_setscheduler_nocheck(p, SCHED_FIFO, &param);
kthread_stop(p);
takeover_tasklets(hotcpu);
break;
diff -r 91c45b8d7775 kernel/stop_machine.c
--- a/kernel/stop_machine.c Mon Jun 23 13:49:26 2008 +1000
+++ b/kernel/stop_machine.c Mon Jun 23 13:54:55 2008 +1000
@@ -187,7 +187,7 @@ struct task_struct *__stop_machine_run(i
struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };

/* One high-prio thread per cpu. We'll do this one. */
- sched_setscheduler(p, SCHED_FIFO, &param);
+ sched_setscheduler_nocheck(p, SCHED_FIFO, &param);
kthread_bind(p, cpu);
wake_up_process(p);
wait_for_completion(&smdata.done);

2008-06-23 07:33:27

[permalink] [raw]

Subject: Re: [PATCH 2/2] hugetlb reservations: fix hugetlb MAP_PRIVATE reservations across vma splits

On (20/06/08 20:17), Andy Whitcroft didst pronounce:
> When a hugetlb mapping with a reservation is split, a new VMA is cloned
> from the original. This new VMA is a direct copy of the original
> including the reservation count. When this pair of VMAs are unmapped
> we will incorrect double account the unused reservation and the overall
> reservation count will be incorrect, in extreme cases it will wrap.
>

D'oh. It's not even that extreme, it's fairly straight-forward
to trigger as it turns out as this crappy application shows
http://www.csn.ul.ie/~mel/postings/apw-20080622/hugetlbfs-unmap-private-test.c
. This runs on x86 and can wrap the rsvd counters. I believe the other tests
I was running had already used the reserves and missed this test case.

> The problem occurs when we split an existing VMA say to unmap a page in
> the middle. split_vma() will create a new VMA copying all fields from
> the original. As we are storing our reservation count in vm_private_data
> this is also copies, endowing the new VMA with a duplicate of the original
> VMA's reservation. Neither of the new VMAs can exhaust these reservations
> as they are too small, but when we unmap and close these VMAs we will
> incorrect credit the remainder twice and resv_huge_pages will become
> out of sync. This can lead to allocation failures on mappings with
> reservations and even to resv_huge_pages wrapping which prevents all
> subsequent hugepage allocations.
>

Yeah, that does sound as if it would occur all right and running the
test program confirms it.

> The simple fix would be to correctly apportion the remaining reservation
> count when the split is made. However the only hook we have vm_ops->open
> only has the new VMA we do not know the identity of the preceeding VMA.
> Also even if we did have that VMA to hand we do not know how much of the
> reservation was consumed each side of the split.
>
> This patch therefore takes a different tack. We know that the whole of any
> private mapping (which has a reservation) has a reservation over its whole
> size. Any present pages represent consumed reservation. Therefore if
> we track the instantiated pages we can calculate the remaining reservation.
>
> This patch reuses the existing regions code to track the regions for which
> we have consumed reservation (ie. the instantiated pages), as each page
> is faulted in we record the consumption of reservation for the new page.

Clever. The additional nice thing is that it makes private mappings less of
a special case in comparison to shared mappings. My impression right now is
that with the path, shared mappings track reservations based on the underlying
file and the private mappings are generally tracked per-mapping and only share
due to unmap-related-splits or forks(). That seems a bit more consistent.

> When we need to return unused reservations at unmap time we simply count
> the consumed reservation region subtracting that from the whole of the map.
> During a VMA split the newly opened VMA will point to the same region map,
> as this map is offset oriented it remains valid for both of the split VMAs.
> This map is referenced counted so that it is removed when all VMAs which
> are part of the mmap are gone.
>

This looks sensible and applying the patches and running the test program
means that the reserve counter does not wrap when the program exists which
is very nice. I also tested a parent-child scenario where the pool is of
insufficient size and the child gets killed as expected. Thanks a million
for cleaning this up.

Some comments below but they are relatively minor.

> Signed-off-by: Andy Whitcroft <[email protected]>
> ---
> mm/hugetlb.c | 151 ++++++++++++++++++++++++++++++++++++++++++++++++----------
> 1 files changed, 126 insertions(+), 25 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index d701e39..ecff986 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -171,6 +171,30 @@ static long region_truncate(struct list_head *head, long end)
> return chg;
> }
>
> +static long region_count(struct list_head *head, long f, long t)
> +{
> + struct file_region *rg;
> + long chg = 0;
> +
> + /* Locate each segment we overlap with, and count that overlap. */
> + list_for_each_entry(rg, head, link) {
> + int seg_from;
> + int seg_to;
> +
> + if (rg->to <= f)
> + continue;
> + if (rg->from >= t)
> + break;
> +
> + seg_from = max(rg->from, f);
> + seg_to = min(rg->to, t);
> +
> + chg += seg_to - seg_from;
> + }
> +
> + return chg;
> +}

Ok, seems straight forward. The tuples track pages that already exist so
by counting the overlaps in a given range, you know how many hugepages
have been faulted. The size of the VMA minus the overlap is the
required reservation.

> +
> /*
> * Convert the address within this vma to the page offset within
> * the mapping, in base page units.
> @@ -193,9 +217,14 @@ static pgoff_t vma_pagecache_offset(struct hstate *h,
> (vma->vm_pgoff >> huge_page_order(h));
> }
>
> -#define HPAGE_RESV_OWNER (1UL << (BITS_PER_LONG - 1))
> -#define HPAGE_RESV_UNMAPPED (1UL << (BITS_PER_LONG - 2))
> +/*
> + * Flags for MAP_PRIVATE reservations. These are stored in the bottom
> + * bits of the reservation map pointer.
> + */
> +#define HPAGE_RESV_OWNER (1UL << 0)
> +#define HPAGE_RESV_UNMAPPED (1UL << 1)
> #define HPAGE_RESV_MASK (HPAGE_RESV_OWNER | HPAGE_RESV_UNMAPPED)
> +

The bits move here but for good reason. private_data is now a pointer and
we pack flags into bits that are available due to alignment. Right?

> /*
> * These helpers are used to track how many pages are reserved for
> * faults in a MAP_PRIVATE mapping. Only the process that called mmap()
> @@ -205,6 +234,15 @@ static pgoff_t vma_pagecache_offset(struct hstate *h,
> * the reserve counters are updated with the hugetlb_lock held. It is safe
> * to reset the VMA at fork() time as it is not in use yet and there is no
> * chance of the global counters getting corrupted as a result of the values.
> + *
> + * The private mapping reservation is represented in a subtly different
> + * manner to a shared mapping. A shared mapping has a region map associated
> + * with the underlying file, this region map represents the backing file
> + * pages which have had a reservation taken and this persists even after
> + * the page is instantiated. A private mapping has a region map associated
> + * with the original mmap which is attached to all VMAs which reference it,
> + * this region map represents those offsets which have consumed reservation
> + * ie. where pages have been instantiated.
> */
> static unsigned long get_vma_private_data(struct vm_area_struct *vma)
> {
> @@ -217,22 +255,44 @@ static void set_vma_private_data(struct vm_area_struct *vma,
> vma->vm_private_data = (void *)value;
> }
>
> -static unsigned long vma_resv_huge_pages(struct vm_area_struct *vma)
> +struct resv_map {
> + struct kref refs;
> + struct list_head regions;
> +};
> +
> +struct resv_map *resv_map_alloc(void)
> +{
> + struct resv_map *resv_map = kmalloc(sizeof(*resv_map), GFP_KERNEL);
> + if (!resv_map)
> + return NULL;
> +
> + kref_init(&resv_map->refs);
> + INIT_LIST_HEAD(&resv_map->regions);
> +
> + return resv_map;
> +}
> +
> +void resv_map_release(struct kref *ref)
> +{
> + struct resv_map *resv_map = container_of(ref, struct resv_map, refs);
> +

tabs vs space problem here.

> + region_truncate(&resv_map->regions, 0);
> + kfree(resv_map);
> +}

Otherwise, looks right. The region_truncate() looked a bit odd but you
have call it or memory would leak so well thought out there.

> +
> +static struct resv_map *vma_resv_map(struct vm_area_struct *vma)
> {
> VM_BUG_ON(!is_vm_hugetlb_page(vma));
> if (!(vma->vm_flags & VM_SHARED))
> - return get_vma_private_data(vma) & ~HPAGE_RESV_MASK;
> + return (struct resv_map *)(get_vma_private_data(vma) &
> + ~HPAGE_RESV_MASK);
> return 0;
> }
>
> -static void set_vma_resv_huge_pages(struct vm_area_struct *vma,
> - unsigned long reserve)
> +static void set_vma_resv_map(struct vm_area_struct *vma, struct resv_map *map)
> {
> - VM_BUG_ON(!is_vm_hugetlb_page(vma));
> - VM_BUG_ON(vma->vm_flags & VM_SHARED);
> -
> - set_vma_private_data(vma,
> - (get_vma_private_data(vma) & HPAGE_RESV_MASK) | reserve);
> + set_vma_private_data(vma, (get_vma_private_data(vma) &
> + HPAGE_RESV_MASK) | (unsigned long)map);
> }

The VM_BUG_ON checks are removed here. Is that intentional? They still
seem valid but maybe I am missing something.

>
> static void set_vma_resv_flags(struct vm_area_struct *vma, unsigned long flags)
> @@ -251,11 +311,11 @@ static int is_vma_resv_set(struct vm_area_struct *vma, unsigned long flag)
> }
>
> /* Decrement the reserved pages in the hugepage pool by one */
> -static void decrement_hugepage_resv_vma(struct hstate *h,
> - struct vm_area_struct *vma)
> +static int decrement_hugepage_resv_vma(struct hstate *h,
> + struct vm_area_struct *vma, unsigned long address)
> {

The comment needs an update here to explain what the return value means.
I believe the reason is below.

> if (vma->vm_flags & VM_NORESERVE)
> - return;
> + return 0;
>
> if (vma->vm_flags & VM_SHARED) {
> /* Shared mappings always use reserves */
> @@ -266,14 +326,19 @@ static void decrement_hugepage_resv_vma(struct hstate *h,
> * private mappings.
> */
> if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
> - unsigned long flags, reserve;
> + unsigned long idx = vma_pagecache_offset(h,
> + vma, address);
> + struct resv_map *reservations = vma_resv_map(vma);
> +
> h->resv_huge_pages--;
> - flags = (unsigned long)vma->vm_private_data &
> - HPAGE_RESV_MASK;
> - reserve = (unsigned long)vma->vm_private_data - 1;
> - vma->vm_private_data = (void *)(reserve | flags);
> +
> + /* Mark this page used in the map. */
> + if (region_chg(&reservations->regions, idx, idx + 1) < 0)
> + return -1;
> + region_add(&reservations->regions, idx, idx + 1);

There is an incredibly remote possibility that a fault would fail for a
mapping that had reserved huge pages because the kmalloc() in region_chg
failed. The system would have to be in terrible shape though. Should a
KERN_WARNING be printed here if this failure path is entered? Otherwise it
will just mainfest as a SIGKILLd application.

> }
> }
> + return 0;
> }
>
> /* Reset counters to 0 and clear all HPAGE_RESV_* flags */
> @@ -289,7 +354,7 @@ static int vma_has_private_reserves(struct vm_area_struct *vma)
> {
> if (vma->vm_flags & VM_SHARED)
> return 0;
> - if (!vma_resv_huge_pages(vma))
> + if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER))
> return 0;
> return 1;
> }
> @@ -376,15 +441,16 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
> nid = zone_to_nid(zone);
> if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask) &&
> !list_empty(&h->hugepage_freelists[nid])) {
> + if (!avoid_reserve &&
> + decrement_hugepage_resv_vma(h, vma, address) < 0)
> + return NULL;
> +
> page = list_entry(h->hugepage_freelists[nid].next,
> struct page, lru);
> list_del(&page->lru);
> h->free_huge_pages--;
> h->free_huge_pages_node[nid]--;
>
> - if (!avoid_reserve)
> - decrement_hugepage_resv_vma(h, vma);
> -
> break;
> }
> }
> @@ -1456,10 +1522,39 @@ out:
> return ret;
> }
>
> +static void hugetlb_vm_op_open(struct vm_area_struct *vma)
> +{
> + struct resv_map *reservations = vma_resv_map(vma);
> +
> + /*
> + * This new VMA will share its siblings reservation map. The open
> + * vm_op is only called for newly created VMAs which have been made
> + * from another, still existing VMA. As that VMA has a reference to
> + * this reservation map the reservation map cannot disappear until
> + * after this open completes. It is therefore safe to take a new
> + * reference here without additional locking.
> + */
> + if (reservations)
> + kref_get(&reservations->refs);
> +}

This comment is a tad misleading. The open call is also called at fork()
time. However, in the case of fork, the private_data will be cleared.
Maybe something like;

====
The open vm_op is called when new VMAs are created but only VMAs which
have been made from another, still existing VMA will have a
reservation....
====

?

> +
> static void hugetlb_vm_op_close(struct vm_area_struct *vma)
> {
> struct hstate *h = hstate_vma(vma);
> - unsigned long reserve = vma_resv_huge_pages(vma);
> + struct resv_map *reservations = vma_resv_map(vma);
> + unsigned long reserve = 0;
> + unsigned long start;
> + unsigned long end;
> +
> + if (reservations) {
> + start = vma_pagecache_offset(h, vma, vma->vm_start);
> + end = vma_pagecache_offset(h, vma, vma->vm_end);
> +
> + reserve = (end - start) -
> + region_count(&reservations->regions, start, end);
> +
> + kref_put(&reservations->refs, resv_map_release);
> + }
>

Clever. So, a split VMA will have the reference map for portions of the
mapping outside its range but region_count() ensures that we decrement
by the correct amount.

> if (reserve)
> hugetlb_acct_memory(h, -reserve);
> @@ -1479,6 +1574,7 @@ static int hugetlb_vm_op_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>
> struct vm_operations_struct hugetlb_vm_ops = {
> .fault = hugetlb_vm_op_fault,
> + .open = hugetlb_vm_op_open,
> .close = hugetlb_vm_op_close,
> };
>
> @@ -2037,8 +2133,13 @@ int hugetlb_reserve_pages(struct inode *inode,
> if (!vma || vma->vm_flags & VM_SHARED)
> chg = region_chg(&inode->i_mapping->private_list, from, to);
> else {
> + struct resv_map *resv_map = resv_map_alloc();
> + if (!resv_map)
> + return -ENOMEM;
> +
> chg = to - from;
> - set_vma_resv_huge_pages(vma, chg);
> +
> + set_vma_resv_map(vma, resv_map);
> set_vma_resv_flags(vma, HPAGE_RESV_OWNER);
> }

Overall, this is a really clever idea and I like that it brings private
mappings closer to shared mappings in a number of respects.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2008-06-23 08:01:00

[permalink] [raw]

Subject: Re: [PATCH 2/2] hugetlb reservations: fix hugetlb MAP_PRIVATE reservations across vma splits

Typical. I spotted this after I pushed send.....

> <SNIP>

> @@ -266,14 +326,19 @@ static void decrement_hugepage_resv_vma(struct hstate *h,
> * private mappings.
> */
> if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
> - unsigned long flags, reserve;
> + unsigned long idx = vma_pagecache_offset(h,
> + vma, address);
> + struct resv_map *reservations = vma_resv_map(vma);
> +
> h->resv_huge_pages--;
> - flags = (unsigned long)vma->vm_private_data &
> - HPAGE_RESV_MASK;
> - reserve = (unsigned long)vma->vm_private_data - 1;
> - vma->vm_private_data = (void *)(reserve | flags);
> +
> + /* Mark this page used in the map. */
> + if (region_chg(&reservations->regions, idx, idx + 1) < 0)
> + return -1;
> + region_add(&reservations->regions, idx, idx + 1);
> }

decrement_hugepage_resv_vma() is called with hugetlb_lock held and region_chg
calls kmalloc(GFP_KERNEL). Hence it's possible we would sleep with that
spinlock held which is a bit uncool. The allocation needs to happen outside
the lock. Right?

> <SNIP>

--
Mel Gorman

2008-06-23 09:56:41

[permalink] [raw]

Subject: Re: [PATCH 2/2] hugetlb reservations: fix hugetlb MAP_PRIVATE reservations across vma splits

On Mon, Jun 23, 2008 at 09:00:48AM +0100, Mel Gorman wrote:
> Typical. I spotted this after I pushed send.....
>
> > <SNIP>
>
> > @@ -266,14 +326,19 @@ static void decrement_hugepage_resv_vma(struct hstate *h,
> > * private mappings.
> > */
> > if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
> > - unsigned long flags, reserve;
> > + unsigned long idx = vma_pagecache_offset(h,
> > + vma, address);
> > + struct resv_map *reservations = vma_resv_map(vma);
> > +
> > h->resv_huge_pages--;
> > - flags = (unsigned long)vma->vm_private_data &
> > - HPAGE_RESV_MASK;
> > - reserve = (unsigned long)vma->vm_private_data - 1;
> > - vma->vm_private_data = (void *)(reserve | flags);
> > +
> > + /* Mark this page used in the map. */
> > + if (region_chg(&reservations->regions, idx, idx + 1) < 0)
> > + return -1;
> > + region_add(&reservations->regions, idx, idx + 1);
> > }
>
> decrement_hugepage_resv_vma() is called with hugetlb_lock held and region_chg
> calls kmalloc(GFP_KERNEL). Hence it's possible we would sleep with that
> spinlock held which is a bit uncool. The allocation needs to happen outside
> the lock. Right?

Yes, good spot. Luckily this pair of calls can be separated, as the
first is a prepare and the second a commit. So I can trivially pull
the allocation outside the lock.

Had a quick go at this and it looks like I can move both out of the lock
to a much more logical spot and clean the patch up significantly. Will
fold in your other comments and post up a V2 once it has been tested.

Thanks.

-apw

2008-06-23 16:04:21

[permalink] [raw]

Subject: Re: [RFC] hugetlb reservations -- MAP_PRIVATE fixes for split vmas

Andy Whitcroft wrote:
> As reported by Adam Litke and Jon Tollefson one of the libhugetlbfs
> regression tests triggers a negative overall reservation count. When
> this occurs where there is no dynamic pool enabled tests will fail.
>
> Following this email are two patches to fix this issue:
>
> hugetlb reservations: move region tracking earlier -- simply moves the
> region tracking code earlier so we do not have to supply prototypes, and
>
> hugetlb reservations: fix hugetlb MAP_PRIVATE reservations across vma
> splits -- which moves us to tracking the consumed reservation so that
> we can correctly calculate the remaining reservations at vma close time.
>
> This stack is against the top of v2.6.25-rc6-mm3, should this solution
> prove acceptable it would probabally need porting below Nicks multiple
> hugepage size patches and those updated; if so I would be happy to do
> that too.
>
> Jon could you have a test on this and see if it works out for you.
>
> -apw
>
Looking good so far.
I am not seeing any of the tests push the reservation number negative -
with this patch set applied

Jon

2008-06-23 17:37:32

[permalink] [raw]

Subject: [RFC] hugetlb reservations -- MAP_PRIVATE fixes for split vmas V2

As reported by Adam Litke and Jon Tollefson one of the libhugetlbfs
regression tests triggers a negative overall reservation count. When
this occurs where there is no dynamic pool enabled tests will fail.

Following this email are two patches to address this issue:

hugetlb reservations: move region tracking earlier -- simply moves the
region tracking code earlier so we do not have to supply prototypes, and

hugetlb reservations: fix hugetlb MAP_PRIVATE reservations across vma
splits -- which moves us to tracking the consumed reservation so that
we can correctly calculate the remaining reservations at vma close time.

This stack is against the top of v2.6.25-rc6-mm3, should this solution
prove acceptable it would need slipping underneath Nick's multiple hugepage
size patches and those updated. I have a modified stack prepared for that.

This version incorporates Mel's feedback (both cosmetic, and an allocation
under spinlock issue) and has an improved layout.

Changes in V2:
- commentry updates
- pull allocations out from under hugetlb_lock
- refactor to match shared code layout
- reinstate BUG_ON's

Jon could you have a test on this and see if it works out for you.

-apw

2008-06-23 17:38:42

[permalink] [raw]

Subject: [PATCH 1/2] hugetlb reservations: move region tracking earlier

2008-06-23 17:38:53

[permalink] [raw]

Subject: [PATCH 2/2] hugetlb reservations: fix hugetlb MAP_PRIVATE reservations across vma splits V2

When a hugetlb mapping with a reservation is split, a new VMA is cloned
from the original. This new VMA is a direct copy of the original
including the reservation count. When this pair of VMAs are unmapped
we will incorrect double account the unused reservation and the overall
reservation count will be incorrect, in extreme cases it will wrap.

The problem occurs when we split an existing VMA say to unmap a page in
the middle. split_vma() will create a new VMA copying all fields from
the original. As we are storing our reservation count in vm_private_data
this is also copies, endowing the new VMA with a duplicate of the original
VMA's reservation. Neither of the new VMAs can exhaust these reservations
as they are too small, but when we unmap and close these VMAs we will
incorrect credit the remainder twice and resv_huge_pages will become
out of sync. This can lead to allocation failures on mappings with
reservations and even to resv_huge_pages wrapping which prevents all
subsequent hugepage allocations.

The simple fix would be to correctly apportion the remaining reservation
count when the split is made. However the only hook we have vm_ops->open
only has the new VMA we do not know the identity of the preceeding VMA.
Also even if we did have that VMA to hand we do not know how much of the
reservation was consumed each side of the split.

This patch therefore takes a different tack. We know that the whole of any
private mapping (which has a reservation) has a reservation over its whole
size. Any present pages represent consumed reservation. Therefore if
we track the instantiated pages we can calculate the remaining reservation.

This patch reuses the existing regions code to track the regions for which
we have consumed reservation (ie. the instantiated pages), as each page
is faulted in we record the consumption of reservation for the new page.
When we need to return unused reservations at unmap time we simply count
the consumed reservation region subtracting that from the whole of the map.
During a VMA split the newly opened VMA will point to the same region map,
as this map is offset oriented it remains valid for both of the split VMAs.
This map is referenced counted so that it is removed when all VMAs which
are part of the mmap are gone.

Thanks to Adam Litke and Mel Gorman for their review feedback.

Signed-off-by: Andy Whitcroft <[email protected]>
---
mm/hugetlb.c | 171 ++++++++++++++++++++++++++++++++++++++++++++++++---------
1 files changed, 144 insertions(+), 27 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d701e39..7ba6d4d 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -49,6 +49,16 @@ static DEFINE_SPINLOCK(hugetlb_lock);
/*
* Region tracking -- allows tracking of reservations and instantiated pages
* across the pages in a mapping.
+ *
+ * The region data structures are protected by a combination of the mmap_sem
+ * and the hugetlb_instantion_mutex. To access or modify a region the caller
+ * must either hold the mmap_sem for write, or the mmap_sem for read and
+ * the hugetlb_instantiation mutex:
+ *
+ * down_write(&mm->mmap_sem);
+ * or
+ * down_read(&mm->mmap_sem);
+ * mutex_lock(&hugetlb_instantiation_mutex);
*/
struct file_region {
struct list_head link;
@@ -171,6 +181,30 @@ static long region_truncate(struct list_head *head, long end)
return chg;
}

+static long region_count(struct list_head *head, long f, long t)
+{
+ struct file_region *rg;
+ long chg = 0;
+
+ /* Locate each segment we overlap with, and count that overlap. */
+ list_for_each_entry(rg, head, link) {
+ int seg_from;
+ int seg_to;
+
+ if (rg->to <= f)
+ continue;
+ if (rg->from >= t)
+ break;
+
+ seg_from = max(rg->from, f);
+ seg_to = min(rg->to, t);
+
+ chg += seg_to - seg_from;
+ }
+
+ return chg;
+}
+
/*
* Convert the address within this vma to the page offset within
* the mapping, in base page units.
@@ -193,9 +227,15 @@ static pgoff_t vma_pagecache_offset(struct hstate *h,
(vma->vm_pgoff >> huge_page_order(h));
}

-#define HPAGE_RESV_OWNER (1UL << (BITS_PER_LONG - 1))
-#define HPAGE_RESV_UNMAPPED (1UL << (BITS_PER_LONG - 2))
+/*
+ * Flags for MAP_PRIVATE reservations. These are stored in the bottom
+ * bits of the reservation map pointer, which are always clear due to
+ * alignment.
+ */
+#define HPAGE_RESV_OWNER (1UL << 0)
+#define HPAGE_RESV_UNMAPPED (1UL << 1)
#define HPAGE_RESV_MASK (HPAGE_RESV_OWNER | HPAGE_RESV_UNMAPPED)
+
/*
* These helpers are used to track how many pages are reserved for
* faults in a MAP_PRIVATE mapping. Only the process that called mmap()
@@ -205,6 +245,15 @@ static pgoff_t vma_pagecache_offset(struct hstate *h,
* the reserve counters are updated with the hugetlb_lock held. It is safe
* to reset the VMA at fork() time as it is not in use yet and there is no
* chance of the global counters getting corrupted as a result of the values.
+ *
+ * The private mapping reservation is represented in a subtly different
+ * manner to a shared mapping. A shared mapping has a region map associated
+ * with the underlying file, this region map represents the backing file
+ * pages which have ever had a reservation assigned which this persists even
+ * after the page is instantiated. A private mapping has a region map
+ * associated with the original mmap which is attached to all VMAs which
+ * reference it, this region map represents those offsets which have consumed
+ * reservation ie. where pages have been instantiated.
*/
static unsigned long get_vma_private_data(struct vm_area_struct *vma)
{
@@ -217,22 +266,48 @@ static void set_vma_private_data(struct vm_area_struct *vma,
vma->vm_private_data = (void *)value;
}

-static unsigned long vma_resv_huge_pages(struct vm_area_struct *vma)
+struct resv_map {
+ struct kref refs;
+ struct list_head regions;
+};
+
+struct resv_map *resv_map_alloc(void)
+{
+ struct resv_map *resv_map = kmalloc(sizeof(*resv_map), GFP_KERNEL);
+ if (!resv_map)
+ return NULL;
+
+ kref_init(&resv_map->refs);
+ INIT_LIST_HEAD(&resv_map->regions);
+
+ return resv_map;
+}
+
+void resv_map_release(struct kref *ref)
+{
+ struct resv_map *resv_map = container_of(ref, struct resv_map, refs);
+
+ /* Clear out any active regions before we release the map. */
+ region_truncate(&resv_map->regions, 0);
+ kfree(resv_map);
+}
+
+static struct resv_map *vma_resv_map(struct vm_area_struct *vma)
{
VM_BUG_ON(!is_vm_hugetlb_page(vma));
if (!(vma->vm_flags & VM_SHARED))
- return get_vma_private_data(vma) & ~HPAGE_RESV_MASK;
+ return (struct resv_map *)(get_vma_private_data(vma) &
+ ~HPAGE_RESV_MASK);
return 0;
}

-static void set_vma_resv_huge_pages(struct vm_area_struct *vma,
- unsigned long reserve)
+static void set_vma_resv_map(struct vm_area_struct *vma, struct resv_map *map)
{
VM_BUG_ON(!is_vm_hugetlb_page(vma));
VM_BUG_ON(vma->vm_flags & VM_SHARED);

- set_vma_private_data(vma,
- (get_vma_private_data(vma) & HPAGE_RESV_MASK) | reserve);
+ set_vma_private_data(vma, (get_vma_private_data(vma) &
+ HPAGE_RESV_MASK) | (unsigned long)map);
}

static void set_vma_resv_flags(struct vm_area_struct *vma, unsigned long flags)
@@ -260,19 +335,12 @@ static void decrement_hugepage_resv_vma(struct hstate *h,
if (vma->vm_flags & VM_SHARED) {
/* Shared mappings always use reserves */
h->resv_huge_pages--;
- } else {
+ } else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
/*
* Only the process that called mmap() has reserves for
* private mappings.
*/
- if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
- unsigned long flags, reserve;
- h->resv_huge_pages--;
- flags = (unsigned long)vma->vm_private_data &
- HPAGE_RESV_MASK;
- reserve = (unsigned long)vma->vm_private_data - 1;
- vma->vm_private_data = (void *)(reserve | flags);
- }
+ h->resv_huge_pages--;
}
}

@@ -289,7 +357,7 @@ static int vma_has_private_reserves(struct vm_area_struct *vma)
{
if (vma->vm_flags & VM_SHARED)
return 0;
- if (!vma_resv_huge_pages(vma))
+ if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER))
return 0;
return 1;
}
@@ -794,12 +862,19 @@ static int vma_needs_reservation(struct hstate *h,
return region_chg(&inode->i_mapping->private_list,
idx, idx + 1);

- } else {
- if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER))
- return 1;
- }
+ } else if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
+ return 1;

- return 0;
+ } else {
+ int err;
+ pgoff_t idx = vma_pagecache_offset(h, vma, addr);
+ struct resv_map *reservations = vma_resv_map(vma);
+
+ err = region_chg(&reservations->regions, idx, idx + 1);
+ if (err < 0)
+ return err;
+ return 0;
+ }
}
static void vma_commit_reservation(struct hstate *h,
struct vm_area_struct *vma, unsigned long addr)
@@ -810,6 +885,13 @@ static void vma_commit_reservation(struct hstate *h,
if (vma->vm_flags & VM_SHARED) {
pgoff_t idx = vma_pagecache_offset(h, vma, addr);
region_add(&inode->i_mapping->private_list, idx, idx + 1);
+
+ } else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
+ pgoff_t idx = vma_pagecache_offset(h, vma, addr);
+ struct resv_map *reservations = vma_resv_map(vma);
+
+ /* Mark this page used in the map. */
+ region_add(&reservations->regions, idx, idx + 1);
}
}

@@ -1456,13 +1538,42 @@ out:
return ret;
}

+static void hugetlb_vm_op_open(struct vm_area_struct *vma)
+{
+ struct resv_map *reservations = vma_resv_map(vma);
+
+ /*
+ * This new VMA should share its siblings reservation map if present.
+ * The VMA will only ever have a valid reservation map pointer where
+ * it is being copied for another still existing VMA. As that VMA
+ * has a reference to the reservation map it cannot dissappear until
+ * after this open call completes. It is therefore safe to take a
+ * new reference here without additional locking.
+ */
+ if (reservations)
+ kref_get(&reservations->refs);
+}
+
static void hugetlb_vm_op_close(struct vm_area_struct *vma)
{
struct hstate *h = hstate_vma(vma);
- unsigned long reserve = vma_resv_huge_pages(vma);
+ struct resv_map *reservations = vma_resv_map(vma);
+ unsigned long reserve;
+ unsigned long start;
+ unsigned long end;

- if (reserve)
- hugetlb_acct_memory(h, -reserve);
+ if (reservations) {
+ start = vma_pagecache_offset(h, vma, vma->vm_start);
+ end = vma_pagecache_offset(h, vma, vma->vm_end);
+
+ reserve = (end - start) -
+ region_count(&reservations->regions, start, end);
+
+ kref_put(&reservations->refs, resv_map_release);
+
+ if (reserve)
+ hugetlb_acct_memory(h, -reserve);
+ }
}

/*
@@ -1479,6 +1590,7 @@ static int hugetlb_vm_op_fault(struct vm_area_struct *vma, struct vm_fault *vmf)

struct vm_operations_struct hugetlb_vm_ops = {
.fault = hugetlb_vm_op_fault,
+ .open = hugetlb_vm_op_open,
.close = hugetlb_vm_op_close,
};

@@ -2037,8 +2149,13 @@ int hugetlb_reserve_pages(struct inode *inode,
if (!vma || vma->vm_flags & VM_SHARED)
chg = region_chg(&inode->i_mapping->private_list, from, to);
else {
+ struct resv_map *resv_map = resv_map_alloc();
+ if (!resv_map)
+ return -ENOMEM;
+
chg = to - from;
- set_vma_resv_huge_pages(vma, chg);
+
+ set_vma_resv_map(vma, resv_map);
set_vma_resv_flags(vma, HPAGE_RESV_OWNER);
}

--
1.5.6.205.g7ca3a

2008-06-23 21:02:12

by Ingo Molnar

[permalink] [raw]

Subject: Re: [BUG][PATCH -mm] avoid BUG() in __stop_machine_run()

* Rusty Russell <[email protected]> wrote:

> On Friday 20 June 2008 23:21:10 Ingo Molnar wrote:
> > * Jeremy Fitzhardinge <[email protected]> wrote:
[...]
> > > (With the appropriate transformation of sched_setscheduler -> __)
> > >
> > > Better than scattering stray true/falses around the code.
> >
> > agreed - it would also be less intrusive on the API change side.
>
> Yes, here's the patch. I've put it in my tree for testing, too.
>
> sched_setscheduler_nocheck: add a flag to control access checks

applied to tip/sched/new-API-sched_setscheduler, thanks Rusty. Also
added it to auto-sched-next so that it shows up in linux-next.

btw., had to merge this bit manually:

> +/**
> + * sched_setscheduler_nocheck - change the scheduling policy and/or RT priority of a thread
> from kernelspace.
> + * @p: the task in question.

as it suffered from line-warp damage.

Ingo

2008-06-23 23:06:41

[permalink] [raw]

Subject: Re: [PATCH 1/2] hugetlb reservations: move region tracking earlier

On (23/06/08 18:35), Andy Whitcroft didst pronounce:
> Move the region tracking code much earlier so we can use it for page
> presence tracking later on. No code is changed, just its location.
>
> Signed-off-by: Andy Whitcroft <[email protected]>

Straight-forward code-move.

Acked-by: Mel Gorman <[email protected]>

> ---
> mm/hugetlb.c | 246 +++++++++++++++++++++++++++++----------------------------
> 1 files changed, 125 insertions(+), 121 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 0f76ed1..d701e39 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -47,6 +47,131 @@ static unsigned long __initdata default_hstate_size;
> static DEFINE_SPINLOCK(hugetlb_lock);
>
> /*
> + * Region tracking -- allows tracking of reservations and instantiated pages
> + * across the pages in a mapping.
> + */
> +struct file_region {
> + struct list_head link;
> + long from;
> + long to;
> +};
> +
> +static long region_add(struct list_head *head, long f, long t)
> +{
> + struct file_region *rg, *nrg, *trg;
> +
> + /* Locate the region we are either in or before. */
> + list_for_each_entry(rg, head, link)
> + if (f <= rg->to)
> + break;
> +
> + /* Round our left edge to the current segment if it encloses us. */
> + if (f > rg->from)
> + f = rg->from;
> +
> + /* Check for and consume any regions we now overlap with. */
> + nrg = rg;
> + list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
> + if (&rg->link == head)
> + break;
> + if (rg->from > t)
> + break;
> +
> + /* If this area reaches higher then extend our area to
> + * include it completely. If this is not the first area
> + * which we intend to reuse, free it. */
> + if (rg->to > t)
> + t = rg->to;
> + if (rg != nrg) {
> + list_del(&rg->link);
> + kfree(rg);
> + }
> + }
> + nrg->from = f;
> + nrg->to = t;
> + return 0;
> +}
> +
> +static long region_chg(struct list_head *head, long f, long t)
> +{
> + struct file_region *rg, *nrg;
> + long chg = 0;
> +
> + /* Locate the region we are before or in. */
> + list_for_each_entry(rg, head, link)
> + if (f <= rg->to)
> + break;
> +
> + /* If we are below the current region then a new region is required.
> + * Subtle, allocate a new region at the position but make it zero
> + * size such that we can guarantee to record the reservation. */
> + if (&rg->link == head || t < rg->from) {
> + nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
> + if (!nrg)
> + return -ENOMEM;
> + nrg->from = f;
> + nrg->to = f;
> + INIT_LIST_HEAD(&nrg->link);
> + list_add(&nrg->link, rg->link.prev);
> +
> + return t - f;
> + }
> +
> + /* Round our left edge to the current segment if it encloses us. */
> + if (f > rg->from)
> + f = rg->from;
> + chg = t - f;
> +
> + /* Check for and consume any regions we now overlap with. */
> + list_for_each_entry(rg, rg->link.prev, link) {
> + if (&rg->link == head)
> + break;
> + if (rg->from > t)
> + return chg;
> +
> + /* We overlap with this area, if it extends futher than
> + * us then we must extend ourselves. Account for its
> + * existing reservation. */
> + if (rg->to > t) {
> + chg += rg->to - t;
> + t = rg->to;
> + }
> + chg -= rg->to - rg->from;
> + }
> + return chg;
> +}
> +
> +static long region_truncate(struct list_head *head, long end)
> +{
> + struct file_region *rg, *trg;
> + long chg = 0;
> +
> + /* Locate the region we are either in or before. */
> + list_for_each_entry(rg, head, link)
> + if (end <= rg->to)
> + break;
> + if (&rg->link == head)
> + return 0;
> +
> + /* If we are in the middle of a region then adjust it. */
> + if (end > rg->from) {
> + chg = rg->to - end;
> + rg->to = end;
> + rg = list_entry(rg->link.next, typeof(*rg), link);
> + }
> +
> + /* Drop any remaining regions. */
> + list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
> + if (&rg->link == head)
> + break;
> + chg += rg->to - rg->from;
> + list_del(&rg->link);
> + kfree(rg);
> + }
> + return chg;
> +}
> +
> +/*
> * Convert the address within this vma to the page offset within
> * the mapping, in base page units.
> */
> @@ -649,127 +774,6 @@ static void return_unused_surplus_pages(struct hstate *h,
> }
> }
>
> -struct file_region {
> - struct list_head link;
> - long from;
> - long to;
> -};
> -
> -static long region_add(struct list_head *head, long f, long t)
> -{
> - struct file_region *rg, *nrg, *trg;
> -
> - /* Locate the region we are either in or before. */
> - list_for_each_entry(rg, head, link)
> - if (f <= rg->to)
> - break;
> -
> - /* Round our left edge to the current segment if it encloses us. */
> - if (f > rg->from)
> - f = rg->from;
> -
> - /* Check for and consume any regions we now overlap with. */
> - nrg = rg;
> - list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
> - if (&rg->link == head)
> - break;
> - if (rg->from > t)
> - break;
> -
> - /* If this area reaches higher then extend our area to
> - * include it completely. If this is not the first area
> - * which we intend to reuse, free it. */
> - if (rg->to > t)
> - t = rg->to;
> - if (rg != nrg) {
> - list_del(&rg->link);
> - kfree(rg);
> - }
> - }
> - nrg->from = f;
> - nrg->to = t;
> - return 0;
> -}
> -
> -static long region_chg(struct list_head *head, long f, long t)
> -{
> - struct file_region *rg, *nrg;
> - long chg = 0;
> -
> - /* Locate the region we are before or in. */
> - list_for_each_entry(rg, head, link)
> - if (f <= rg->to)
> - break;
> -
> - /* If we are below the current region then a new region is required.
> - * Subtle, allocate a new region at the position but make it zero
> - * size such that we can guarantee to record the reservation. */
> - if (&rg->link == head || t < rg->from) {
> - nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
> - if (!nrg)
> - return -ENOMEM;
> - nrg->from = f;
> - nrg->to = f;
> - INIT_LIST_HEAD(&nrg->link);
> - list_add(&nrg->link, rg->link.prev);
> -
> - return t - f;
> - }
> -
> - /* Round our left edge to the current segment if it encloses us. */
> - if (f > rg->from)
> - f = rg->from;
> - chg = t - f;
> -
> - /* Check for and consume any regions we now overlap with. */
> - list_for_each_entry(rg, rg->link.prev, link) {
> - if (&rg->link == head)
> - break;
> - if (rg->from > t)
> - return chg;
> -
> - /* We overlap with this area, if it extends futher than
> - * us then we must extend ourselves. Account for its
> - * existing reservation. */
> - if (rg->to > t) {
> - chg += rg->to - t;
> - t = rg->to;
> - }
> - chg -= rg->to - rg->from;
> - }
> - return chg;
> -}
> -
> -static long region_truncate(struct list_head *head, long end)
> -{
> - struct file_region *rg, *trg;
> - long chg = 0;
> -
> - /* Locate the region we are either in or before. */
> - list_for_each_entry(rg, head, link)
> - if (end <= rg->to)
> - break;
> - if (&rg->link == head)
> - return 0;
> -
> - /* If we are in the middle of a region then adjust it. */
> - if (end > rg->from) {
> - chg = rg->to - end;
> - rg->to = end;
> - rg = list_entry(rg->link.next, typeof(*rg), link);
> - }
> -
> - /* Drop any remaining regions. */
> - list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
> - if (&rg->link == head)
> - break;
> - chg += rg->to - rg->from;
> - list_del(&rg->link);
> - kfree(rg);
> - }
> - return chg;
> -}
> -
> /*
> * Determine if the huge page at addr within the vma has an associated
> * reservation. Where it does not we will need to logically increase
> --
> 1.5.6.205.g7ca3a
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2008-06-23 23:11:30

[permalink] [raw]

Subject: Re: [PATCH 2/2] hugetlb reservations: fix hugetlb MAP_PRIVATE reservations across vma splits V2

On (23/06/08 18:35), Andy Whitcroft didst pronounce:
> When a hugetlb mapping with a reservation is split, a new VMA is cloned
> from the original. This new VMA is a direct copy of the original
> including the reservation count. When this pair of VMAs are unmapped
> we will incorrect double account the unused reservation and the overall
> reservation count will be incorrect, in extreme cases it will wrap.
>
> The problem occurs when we split an existing VMA say to unmap a page in
> the middle. split_vma() will create a new VMA copying all fields from
> the original. As we are storing our reservation count in vm_private_data
> this is also copies, endowing the new VMA with a duplicate of the original
> VMA's reservation. Neither of the new VMAs can exhaust these reservations
> as they are too small, but when we unmap and close these VMAs we will
> incorrect credit the remainder twice and resv_huge_pages will become
> out of sync. This can lead to allocation failures on mappings with
> reservations and even to resv_huge_pages wrapping which prevents all
> subsequent hugepage allocations.
>
> The simple fix would be to correctly apportion the remaining reservation
> count when the split is made. However the only hook we have vm_ops->open
> only has the new VMA we do not know the identity of the preceeding VMA.
> Also even if we did have that VMA to hand we do not know how much of the
> reservation was consumed each side of the split.
>
> This patch therefore takes a different tack. We know that the whole of any
> private mapping (which has a reservation) has a reservation over its whole
> size. Any present pages represent consumed reservation. Therefore if
> we track the instantiated pages we can calculate the remaining reservation.
>
> This patch reuses the existing regions code to track the regions for which
> we have consumed reservation (ie. the instantiated pages), as each page
> is faulted in we record the consumption of reservation for the new page.
> When we need to return unused reservations at unmap time we simply count
> the consumed reservation region subtracting that from the whole of the map.
> During a VMA split the newly opened VMA will point to the same region map,
> as this map is offset oriented it remains valid for both of the split VMAs.
> This map is referenced counted so that it is removed when all VMAs which
> are part of the mmap are gone.
>
> Thanks to Adam Litke and Mel Gorman for their review feedback.
>
> Signed-off-by: Andy Whitcroft <[email protected]>

Nice explanation. Testing on i386 with qemu, this patch allows some
small tests to pass without corruption of the rsvd counters.
libhugetlbfs tests also passed. I do not see anything new to complain
about in the code. Thanks.

Acked-by: Mel Gorman <[email protected]>

> ---
> mm/hugetlb.c | 171 ++++++++++++++++++++++++++++++++++++++++++++++++---------
> 1 files changed, 144 insertions(+), 27 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index d701e39..7ba6d4d 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -49,6 +49,16 @@ static DEFINE_SPINLOCK(hugetlb_lock);
> /*
> * Region tracking -- allows tracking of reservations and instantiated pages
> * across the pages in a mapping.
> + *
> + * The region data structures are protected by a combination of the mmap_sem
> + * and the hugetlb_instantion_mutex. To access or modify a region the caller
> + * must either hold the mmap_sem for write, or the mmap_sem for read and
> + * the hugetlb_instantiation mutex:
> + *
> + * down_write(&mm->mmap_sem);
> + * or
> + * down_read(&mm->mmap_sem);
> + * mutex_lock(&hugetlb_instantiation_mutex);
> */
> struct file_region {
> struct list_head link;
> @@ -171,6 +181,30 @@ static long region_truncate(struct list_head *head, long end)
> return chg;
> }
>
> +static long region_count(struct list_head *head, long f, long t)
> +{
> + struct file_region *rg;
> + long chg = 0;
> +
> + /* Locate each segment we overlap with, and count that overlap. */
> + list_for_each_entry(rg, head, link) {
> + int seg_from;
> + int seg_to;
> +
> + if (rg->to <= f)
> + continue;
> + if (rg->from >= t)
> + break;
> +
> + seg_from = max(rg->from, f);
> + seg_to = min(rg->to, t);
> +
> + chg += seg_to - seg_from;
> + }
> +
> + return chg;
> +}
> +
> /*
> * Convert the address within this vma to the page offset within
> * the mapping, in base page units.
> @@ -193,9 +227,15 @@ static pgoff_t vma_pagecache_offset(struct hstate *h,
> (vma->vm_pgoff >> huge_page_order(h));
> }
>
> -#define HPAGE_RESV_OWNER (1UL << (BITS_PER_LONG - 1))
> -#define HPAGE_RESV_UNMAPPED (1UL << (BITS_PER_LONG - 2))
> +/*
> + * Flags for MAP_PRIVATE reservations. These are stored in the bottom
> + * bits of the reservation map pointer, which are always clear due to
> + * alignment.
> + */
> +#define HPAGE_RESV_OWNER (1UL << 0)
> +#define HPAGE_RESV_UNMAPPED (1UL << 1)
> #define HPAGE_RESV_MASK (HPAGE_RESV_OWNER | HPAGE_RESV_UNMAPPED)
> +
> /*
> * These helpers are used to track how many pages are reserved for
> * faults in a MAP_PRIVATE mapping. Only the process that called mmap()
> @@ -205,6 +245,15 @@ static pgoff_t vma_pagecache_offset(struct hstate *h,
> * the reserve counters are updated with the hugetlb_lock held. It is safe
> * to reset the VMA at fork() time as it is not in use yet and there is no
> * chance of the global counters getting corrupted as a result of the values.
> + *
> + * The private mapping reservation is represented in a subtly different
> + * manner to a shared mapping. A shared mapping has a region map associated
> + * with the underlying file, this region map represents the backing file
> + * pages which have ever had a reservation assigned which this persists even
> + * after the page is instantiated. A private mapping has a region map
> + * associated with the original mmap which is attached to all VMAs which
> + * reference it, this region map represents those offsets which have consumed
> + * reservation ie. where pages have been instantiated.
> */
> static unsigned long get_vma_private_data(struct vm_area_struct *vma)
> {
> @@ -217,22 +266,48 @@ static void set_vma_private_data(struct vm_area_struct *vma,
> vma->vm_private_data = (void *)value;
> }
>
> -static unsigned long vma_resv_huge_pages(struct vm_area_struct *vma)
> +struct resv_map {
> + struct kref refs;
> + struct list_head regions;
> +};
> +
> +struct resv_map *resv_map_alloc(void)
> +{
> + struct resv_map *resv_map = kmalloc(sizeof(*resv_map), GFP_KERNEL);
> + if (!resv_map)
> + return NULL;
> +
> + kref_init(&resv_map->refs);
> + INIT_LIST_HEAD(&resv_map->regions);
> +
> + return resv_map;
> +}
> +
> +void resv_map_release(struct kref *ref)
> +{
> + struct resv_map *resv_map = container_of(ref, struct resv_map, refs);
> +
> + /* Clear out any active regions before we release the map. */
> + region_truncate(&resv_map->regions, 0);
> + kfree(resv_map);
> +}
> +
> +static struct resv_map *vma_resv_map(struct vm_area_struct *vma)
> {
> VM_BUG_ON(!is_vm_hugetlb_page(vma));
> if (!(vma->vm_flags & VM_SHARED))
> - return get_vma_private_data(vma) & ~HPAGE_RESV_MASK;
> + return (struct resv_map *)(get_vma_private_data(vma) &
> + ~HPAGE_RESV_MASK);
> return 0;
> }
>
> -static void set_vma_resv_huge_pages(struct vm_area_struct *vma,
> - unsigned long reserve)
> +static void set_vma_resv_map(struct vm_area_struct *vma, struct resv_map *map)
> {
> VM_BUG_ON(!is_vm_hugetlb_page(vma));
> VM_BUG_ON(vma->vm_flags & VM_SHARED);
>
> - set_vma_private_data(vma,
> - (get_vma_private_data(vma) & HPAGE_RESV_MASK) | reserve);
> + set_vma_private_data(vma, (get_vma_private_data(vma) &
> + HPAGE_RESV_MASK) | (unsigned long)map);
> }
>
> static void set_vma_resv_flags(struct vm_area_struct *vma, unsigned long flags)
> @@ -260,19 +335,12 @@ static void decrement_hugepage_resv_vma(struct hstate *h,
> if (vma->vm_flags & VM_SHARED) {
> /* Shared mappings always use reserves */
> h->resv_huge_pages--;
> - } else {
> + } else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
> /*
> * Only the process that called mmap() has reserves for
> * private mappings.
> */
> - if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
> - unsigned long flags, reserve;
> - h->resv_huge_pages--;
> - flags = (unsigned long)vma->vm_private_data &
> - HPAGE_RESV_MASK;
> - reserve = (unsigned long)vma->vm_private_data - 1;
> - vma->vm_private_data = (void *)(reserve | flags);
> - }
> + h->resv_huge_pages--;
> }
> }
>
> @@ -289,7 +357,7 @@ static int vma_has_private_reserves(struct vm_area_struct *vma)
> {
> if (vma->vm_flags & VM_SHARED)
> return 0;
> - if (!vma_resv_huge_pages(vma))
> + if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER))
> return 0;
> return 1;
> }
> @@ -794,12 +862,19 @@ static int vma_needs_reservation(struct hstate *h,
> return region_chg(&inode->i_mapping->private_list,
> idx, idx + 1);
>
> - } else {
> - if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER))
> - return 1;
> - }
> + } else if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
> + return 1;
>
> - return 0;
> + } else {
> + int err;
> + pgoff_t idx = vma_pagecache_offset(h, vma, addr);
> + struct resv_map *reservations = vma_resv_map(vma);
> +
> + err = region_chg(&reservations->regions, idx, idx + 1);
> + if (err < 0)
> + return err;
> + return 0;
> + }
> }
> static void vma_commit_reservation(struct hstate *h,
> struct vm_area_struct *vma, unsigned long addr)
> @@ -810,6 +885,13 @@ static void vma_commit_reservation(struct hstate *h,
> if (vma->vm_flags & VM_SHARED) {
> pgoff_t idx = vma_pagecache_offset(h, vma, addr);
> region_add(&inode->i_mapping->private_list, idx, idx + 1);
> +
> + } else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
> + pgoff_t idx = vma_pagecache_offset(h, vma, addr);
> + struct resv_map *reservations = vma_resv_map(vma);
> +
> + /* Mark this page used in the map. */
> + region_add(&reservations->regions, idx, idx + 1);
> }
> }
>
> @@ -1456,13 +1538,42 @@ out:
> return ret;
> }
>
> +static void hugetlb_vm_op_open(struct vm_area_struct *vma)
> +{
> + struct resv_map *reservations = vma_resv_map(vma);
> +
> + /*
> + * This new VMA should share its siblings reservation map if present.
> + * The VMA will only ever have a valid reservation map pointer where
> + * it is being copied for another still existing VMA. As that VMA
> + * has a reference to the reservation map it cannot dissappear until
> + * after this open call completes. It is therefore safe to take a
> + * new reference here without additional locking.
> + */
> + if (reservations)
> + kref_get(&reservations->refs);
> +}
> +
> static void hugetlb_vm_op_close(struct vm_area_struct *vma)
> {
> struct hstate *h = hstate_vma(vma);
> - unsigned long reserve = vma_resv_huge_pages(vma);
> + struct resv_map *reservations = vma_resv_map(vma);
> + unsigned long reserve;
> + unsigned long start;
> + unsigned long end;
>
> - if (reserve)
> - hugetlb_acct_memory(h, -reserve);
> + if (reservations) {
> + start = vma_pagecache_offset(h, vma, vma->vm_start);
> + end = vma_pagecache_offset(h, vma, vma->vm_end);
> +
> + reserve = (end - start) -
> + region_count(&reservations->regions, start, end);
> +
> + kref_put(&reservations->refs, resv_map_release);
> +
> + if (reserve)
> + hugetlb_acct_memory(h, -reserve);
> + }
> }
>
> /*
> @@ -1479,6 +1590,7 @@ static int hugetlb_vm_op_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>
> struct vm_operations_struct hugetlb_vm_ops = {
> .fault = hugetlb_vm_op_fault,
> + .open = hugetlb_vm_op_open,
> .close = hugetlb_vm_op_close,
> };
>
> @@ -2037,8 +2149,13 @@ int hugetlb_reserve_pages(struct inode *inode,
> if (!vma || vma->vm_flags & VM_SHARED)
> chg = region_chg(&inode->i_mapping->private_list, from, to);
> else {
> + struct resv_map *resv_map = resv_map_alloc();
> + if (!resv_map)
> + return -ENOMEM;
> +
> chg = to - from;
> - set_vma_resv_huge_pages(vma, chg);
> +
> + set_vma_resv_map(vma, resv_map);
> set_vma_resv_flags(vma, HPAGE_RESV_OWNER);
> }
>
> --
> 1.5.6.205.g7ca3a
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2008-06-25 21:22:47