Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759227AbYFQRqV (ORCPT ); Tue, 17 Jun 2008 13:46:21 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1755334AbYFQRqK (ORCPT ); Tue, 17 Jun 2008 13:46:10 -0400 Received: from g1t0027.austin.hp.com ([15.216.28.34]:10335 "EHLO g1t0027.austin.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755562AbYFQRqH (ORCPT ); Tue, 17 Jun 2008 13:46:07 -0400 Subject: Re: [PATCH][RFC] fix kernel BUG at mm/migrate.c:719! in 2.6.26-rc5-mm3 From: Lee Schermerhorn To: Daisuke Nishimura Cc: Andrew Morton , Rik van Riel , Kosaki Motohiro , Nick Piggin , linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-testers@vger.kernel.org In-Reply-To: <20080617163501.7cf411ee.nishimura@mxp.nes.nec.co.jp> References: <20080611225945.4da7bb7f.akpm@linux-foundation.org> <20080617163501.7cf411ee.nishimura@mxp.nes.nec.co.jp> Content-Type: text/plain; charset=UTF-8 Organization: HP/OSLO Date: Tue, 17 Jun 2008 13:46:38 -0400 Message-Id: <1213724798.8707.41.camel@lts-notebook> Mime-Version: 1.0 X-Mailer: Evolution 2.22.1 Content-Transfer-Encoding: 8bit X-Brightmail-Tracker: AAAAAQAAAAI= X-Whitelist: TRUE Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7894 Lines: 193 On Tue, 2008-06-17 at 16:35 +0900, Daisuke Nishimura wrote: > Hi. > > I got this bug while migrating pages only a few times > via memory_migrate of cpuset. Ah, I did test migration fairly heavily, but not by moving cpusets. > > Unfortunately, even if this patch is applied, > I got bad_page problem after hundreds times of page migration > (I'll report it in another mail). > But I believe something like this patch is needed anyway. Agreed. See comments below. > > ------------[ cut here ]------------ > kernel BUG at mm/migrate.c:719! > invalid opcode: 0000 [1] SMP > last sysfs file: /sys/devices/system/cpu/cpu3/cache/index1/shared_cpu_map > CPU 0 > Modules linked in: ipv6 autofs4 hidp rfcomm l2cap bluetooth sunrpc dm_mirror dm_log dm_multipath dm_mod sbs sbshc button battery acpi_memhotplug ac parport_pc lp parport floppy serio_raw rtc_cmos rtc_core rtc_lib 8139too pcspkr 8139cp mii ata_piix libata sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd [last unloaded: microcode] > Pid: 3096, comm: switch.sh Not tainted 2.6.26-rc5-mm3 #1 > RIP: 0010:[] [] migrate_pages+0x33e/0x49f > RSP: 0018:ffff81002f463bb8 EFLAGS: 00010202 > RAX: 0000000000000000 RBX: ffffe20000c17500 RCX: 0000000000000034 > RDX: ffffe20000c17500 RSI: ffffe200010003c0 RDI: ffffe20000c17528 > RBP: ffffe200010003c0 R08: 8000000000000000 R09: 304605894800282f > R10: 282f87058b480028 R11: 0028304005894800 R12: ffff81003f90a5d8 > R13: 0000000000000000 R14: ffffe20000bf4cc0 R15: ffff81002f463c88 > FS: 00007ff9386576f0(0000) GS:ffffffff8061d800(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > CR2: 00007ff938669000 CR3: 000000002f458000 CR4: 00000000000006e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Process switch.sh (pid: 3096, threadinfo ffff81002f462000, task ffff81003e99cf10) > Stack: 0000000000000001 ffffffff80290777 0000000000000000 0000000000000000 > ffff81002f463c88 ffff81000000ea18 ffff81002f463c88 000000000000000c > ffff81002f463ca8 00007ffffffff000 00007fff649f6000 0000000000000004 > Call Trace: > [] ? new_node_page+0x0/0x2f > [] ? do_migrate_pages+0x19b/0x1e7 > [] ? set_cpus_allowed_ptr+0xe6/0xf3 > [] ? cpuset_migrate_mm+0x58/0x8f > [] ? cpuset_attach+0x8b/0x9e > [] ? cgroup_attach_task+0x3a3/0x3f5 > [] ? __alloc_pages_internal+0xe2/0x3d1 > [] ? cgroup_common_file_write+0x150/0x1dd > [] ? cgroup_file_write+0x54/0x150 > [] ? vfs_write+0xad/0x136 > [] ? sys_write+0x45/0x6e > [] ? tracesys+0xd5/0xda > > > Code: 4c 48 8d 7b 28 e8 cc 87 09 00 48 83 7b 18 00 75 30 48 8b 03 48 89 da 25 00 40 00 00 48 85 c0 74 04 48 8b 53 10 83 7a 08 01 74 04 <0f> 0b eb fe 48 89 df e8 5e 50 fd ff 48 89 df e8 7d d6 fd ff eb > RIP [] migrate_pages+0x33e/0x49f > RSP > Clocksource tsc unstable (delta = 438246251 ns) > ---[ end trace ce4e6053f7b9bba1 ]--- > > > This bug is caused by VM_BUG_ON() in unmap_and_move(). > > unmap_and_move() > 710 if (rc != -EAGAIN) { > 711 /* > 712 * A page that has been migrated has all references > 713 * removed and will be freed. A page that has not been > 714 * migrated will have kepts its references and be > 715 * restored. > 716 */ > 717 list_del(&page->lru); > 718 if (!page->mapping) { > 719 VM_BUG_ON(page_count(page) != 1); > 720 unlock_page(page); > 721 put_page(page); /* just free the old page */ > 722 goto end_migration; > 723 } else > 724 unlock = putback_lru_page(page); > 725 } I think that at least part of your patch, below, should fix this problem. See comments there. Now I wonder if the assertion that newpage count == 1 could be violated? I don't see how. We've just allocated and filled it and haven't unlocked it yet, so we should hold the only reference. Do you agree? > > I think the page count is not necessarily 1 here, because > migration_entry_wait increases page count and waits for the > page to be unlocked. > So, if the old page is accessed between migrate_page_move_mapping, > which checks the page count, and remove_migration_ptes, page count > would not be 1 here. > > Actually, just commenting out get/put_page from migration_entry_wait > works well in my environment(succeeded in hundreds times of page migration), > but modifying migration_entry_wait this way is not good, I think. > > > This patch depends on Lee Schermerhorn's fix for double unlock_page. > > This patch also fixes a race between migrate_entry_wait and > page_freeze_refs in migrate_page_move_mapping. > > > Signed-off-by: Daisuke Nishimura > > --- > diff -uprN linux-2.6.26-rc5-mm3/mm/migrate.c linux-2.6.26-rc5-mm3-test/mm/migrate.c > --- linux-2.6.26-rc5-mm3/mm/migrate.c 2008-06-17 15:31:23.000000000 +0900 > +++ linux-2.6.26-rc5-mm3-test/mm/migrate.c 2008-06-17 13:59:15.000000000 +0900 > @@ -232,6 +232,7 @@ void migration_entry_wait(struct mm_stru > swp_entry_t entry; > struct page *page; > > +retry: > ptep = pte_offset_map_lock(mm, pmd, address, &ptl); > pte = *ptep; > if (!is_swap_pte(pte)) > @@ -243,11 +244,20 @@ void migration_entry_wait(struct mm_stru > > page = migration_entry_to_page(entry); > > - get_page(page); > - pte_unmap_unlock(ptep, ptl); > - wait_on_page_locked(page); > - put_page(page); > - return; > + /* > + * page count might be set to zero by page_freeze_refs() > + * in migrate_page_move_mapping(). > + */ > + if (get_page_unless_zero(page)) { > + pte_unmap_unlock(ptep, ptl); > + wait_on_page_locked(page); > + put_page(page); > + return; > + } else { > + pte_unmap_unlock(ptep, ptl); > + goto retry; > + } > + I'm not sure about this part. If it IS needed, I think it would be needed independently of the unevictable/putback_lru_page() changes, as this race must have already existed. However, unmap_and_move() replaced the migration entries with bona fide pte's referencing the new page before freeing the old page, so I think we're OK without this change. > out: > pte_unmap_unlock(ptep, ptl); > } > @@ -715,13 +725,7 @@ unlock: > * restored. > */ > list_del(&page->lru); > - if (!page->mapping) { > - VM_BUG_ON(page_count(page) != 1); > - unlock_page(page); > - put_page(page); /* just free the old page */ > - goto end_migration; > - } else > - unlock = putback_lru_page(page); > + unlock = putback_lru_page(page); > } > > if (unlock)  I agree with this part. I came to the same conclusion looking at the code. If we just changed the if() and VM_BUG_ON() to: if (!page->mapping && page_count(page) == 1) { ... we'd be doing exactly what putback_lru_page() is doing. So, this code as always unnecessary, duplicate code [that I was trying to avoid :(]. So, just let putback_lru_page() handle this condition and conditionally unlock_page(). I'm testing with my stress load with the 2nd part of the patch above and it's holding up OK. Of course, I didn't hit the problem before. I'll try your duplicator script and see what happens. Regards, Lee -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/