Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752147AbbHMJTh (ORCPT ); Thu, 13 Aug 2015 05:19:37 -0400 Received: from blu004-omc1s4.hotmail.com ([65.55.116.15]:64987 "EHLO BLU004-OMC1S4.hotmail.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751938AbbHMJTU (ORCPT ); Thu, 13 Aug 2015 05:19:20 -0400 X-TMN: [pEdHRAhZUdwE834RZ40VgwRJXspgt5N9] X-Originating-Email: [wanpeng.li@hotmail.com] Message-ID: Subject: Re: [PATCH] mm/hwpoison: fix race between soft_offline_page and unpoison_memory To: Naoya Horiguchi References: <20150813085332.GA30163@hori1.linux.bs1.fc.nec.co.jp> CC: Andrew Morton , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" From: Wanpeng Li Date: Thu, 13 Aug 2015 17:18:56 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:38.0) Gecko/20100101 Thunderbird/38.0.1 MIME-Version: 1.0 In-Reply-To: <20150813085332.GA30163@hori1.linux.bs1.fc.nec.co.jp> Content-Type: text/plain; charset="iso-2022-jp" Content-Transfer-Encoding: 7bit X-OriginalArrivalTime: 13 Aug 2015 09:19:14.0542 (UTC) FILETIME=[1F5248E0:01D0D5A9] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 9811 Lines: 218 On 8/13/15 4:53 PM, Naoya Horiguchi wrote: > On Thu, Aug 13, 2015 at 03:09:07PM +0800, Wanpeng Li wrote: >> [ 61.572584] BUG: Bad page state in process bash pfn:97000 >> [ 61.578106] page:ffffea00025c0000 count:0 mapcount:1 mapping: (null) index:0x7f4fdbe00 >> [ 61.586803] flags: 0x1fffff80080048(uptodate|active|swapbacked) >> [ 61.592809] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set >> [ 61.599250] bad because of flags: >> [ 61.602567] flags: 0x40(active) >> [ 61.605746] Modules linked in: snd_hda_codec_hdmi i915 rpcsec_gss_krb5 nfsv4 dns_resolver bnep rfcomm nfsd bluetooth auth_rpcgss nfs_acl nfs rfkill lockd grace sunrpc i2c_algo_bit drm_kms_helper snd_hda_codec_realtek snd_hda_codec_generic drm snd_hda_intel fscache snd_hda_codec x86_pkg_temp_thermal coretemp kvm_intel snd_hda_core snd_hwdep kvm snd_pcm snd_seq_dummy snd_seq_oss crct10dif_pclmul snd_seq_midi crc32_pclmul snd_seq_midi_event ghash_clmulni_intel snd_rawmidi aesni_intel lrw gf128mul snd_seq glue_helper ablk_helper snd_seq_device cryptd fuse snd_timer dcdbas serio_raw mei_me parport_pc snd mei ppdev i2c_core video lp soundcore parport lpc_ich shpchp mfd_core ext4 mbcache jbd2 sd_mod e1000e ahci ptp libahci crc32c_intel libata pps_core >> [ 61.605827] CPU: 3 PID: 2211 Comm: bash Not tainted 4.2.0-rc5-mm1+ #45 >> [ 61.605829] Hardware name: Dell Inc. OptiPlex 7020/0F5C5X, BIOS A03 01/08/2015 >> [ 61.605832] ffffffff818b3be8 ffff8800da373ad8 ffffffff8165ceb4 0000000001313ce1 >> [ 61.605837] ffffea00025c0000 ffff8800da373b08 ffffffff8117bdd6 ffff88021edd4b00 >> [ 61.605842] 0000000000000001 001fffff80080048 0000000000000000 ffff8800da373b88 >> [ 61.605847] Call Trace: >> [ 61.605858] [] dump_stack+0x48/0x5c >> [ 61.605865] [] bad_page+0xe6/0x140 >> [ 61.605870] [] free_pages_prepare+0x2f9/0x320 >> [ 61.605876] [] ? uncharge_list+0xdd/0x100 >> [ 61.605882] [] free_hot_cold_page+0x40/0x170 >> [ 61.605888] [] __put_single_page+0x20/0x30 >> [ 61.605892] [] put_page+0x25/0x40 >> [ 61.605897] [] unmap_and_move+0x1a6/0x1f0 >> [ 61.605908] [] migrate_pages+0x100/0x1d0 >> [ 61.605914] [] ? kill_procs+0x100/0x100 >> [ 61.605918] [] ? unlock_page+0x6f/0x90 >> [ 61.605923] [] __soft_offline_page+0x127/0x2a0 >> [ 61.605928] [] soft_offline_page+0xa6/0x200 >> >> There is a race window between soft_offline_page() and unpoison_memory(): >> >> CPU0 CPU1 >> >> soft_offline_page >> __soft_offline_page >> TestSetPageHWPoison >> unpoison_memory >> PageHWPoison check (true) >> TestClearPageHWPoison >> put_page -> release refcount held by get_hwpoison_page in unpoison_memory >> put_page -> release refcount held by isolate_lru_page in __soft_offline_page >> migrate_pages >> >> The second put_page() releases refcount held by isolate_lru_page() which >> will lead to unmap_and_move() releases the last refcount of page and w/ >> mapcount still 1 since try_to_unmap() is not called if there is only >> one user map the page. Anyway, the page refcount and mapcount will >> still mess if the page is mapped by multiple users. Commit (4491f712606: >> mm/memory-failure: set PageHWPoison before migrate_pages()) is introduced >> to avoid to reuse just successful migrated page, however, it also incurs >> this race window. >> >> Fix it by continue to use migratetype to guarantee the source page which >> is successful migration does not reused before PG_hwpoison is set. >> >> Signed-off-by: Wanpeng Li > Thank you for the report, Wanpeng. > > I think that unpoison is used only in testing so this race never affects > our end-users/customers, so going back to this migratetype change stuff > looks unworthy to me. Migratetype stuff is just removed by you two months ago, then this bug occurs recently since the more and more patches which you fix some races. > > If I read correctly, the old migratetype approach has a few problems: > 1) it doesn't fix the problem completely, because > set_migratetype_isolate() can fail to set MIGRATE_ISOLATE to the > target page if the pageblock of the page contains one or more > unmovable pages (i.e. has_unmovable_pages() returns true). > 2) the original code changes migratetype to MIGRATE_ISOLATE forcibly, > and sets it to MIGRATE_MOVABLE forcibly after soft offline, regardless > of the original migratetype state, which could impact other subsystems > like memory hotplug or compaction. Maybe we can add a "FIXME" comment on the Migratetype stuff, since the current linus tree calltrace and it should be fixed immediately, and I don't see obvious bugs appear on migratetype stuffs at least currently, so "FIXME" is enough. :-) > > So in my opinion, the best option is to fix this within unpoison code, > but it might be hard because we don't know whether the target page is Exactly. > under migration. The second option is to add a race check in the if (reason > == MR_MEMORY_FAILURE) branch in unmap_and_move(), although this looks ad-hoc You have already add MR_MEMORY_FAILURE, however, that is not enough to fix unpoison race. > and need testing. The third option is to leave it with some "FIXME" comment. I *prefer* add "FIXME" to migratetype stuffs. Regards, Wanpeng Li > > Thanks, > Naoya Horiguchi > >> --- >> include/linux/page-isolation.h | 5 +++++ >> mm/memory-failure.c | 16 ++++++++++++---- >> mm/migrate.c | 3 +-- >> mm/page_isolation.c | 4 ++-- >> 4 files changed, 20 insertions(+), 8 deletions(-) >> >> diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h >> index 047d647..ff5751e 100644 >> --- a/include/linux/page-isolation.h >> +++ b/include/linux/page-isolation.h >> @@ -65,6 +65,11 @@ undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn, >> int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn, >> bool skip_hwpoisoned_pages); >> >> +/* >> + * Internal functions. Changes pageblock's migrate type. >> + */ >> +int set_migratetype_isolate(struct page *page, bool skip_hwpoisoned_pages); >> +void unset_migratetype_isolate(struct page *page, unsigned migratetype); >> struct page *alloc_migrate_target(struct page *page, unsigned long private, >> int **resultp); >> >> diff --git a/mm/memory-failure.c b/mm/memory-failure.c >> index eca613e..0ed3814 100644 >> --- a/mm/memory-failure.c >> +++ b/mm/memory-failure.c >> @@ -1647,8 +1647,6 @@ static int __soft_offline_page(struct page *page, int flags) >> inc_zone_page_state(page, NR_ISOLATED_ANON + >> page_is_file_cache(page)); >> list_add(&page->lru, &pagelist); >> - if (!TestSetPageHWPoison(page)) >> - atomic_long_inc(&num_poisoned_pages); >> ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL, >> MIGRATE_SYNC, MR_MEMORY_FAILURE); >> if (ret) { >> @@ -1663,8 +1661,9 @@ static int __soft_offline_page(struct page *page, int flags) >> pfn, ret, page->flags); >> if (ret > 0) >> ret = -EIO; >> - if (TestClearPageHWPoison(page)) >> - atomic_long_dec(&num_poisoned_pages); >> + } else { >> + if (!TestSetPageHWPoison(page)) >> + atomic_long_inc(&num_poisoned_pages); >> } >> } else { >> pr_info("soft offline: %#lx: isolation failed: %d, page count %d, type %lx\n", >> @@ -1715,6 +1714,14 @@ int soft_offline_page(struct page *page, int flags) >> >> get_online_mems(); >> >> + /* >> + * Isolate the page, so that it doesn't get reallocated if it >> + * was free. This flag should be kept set until the source page >> + * is freed and PG_hwpoison on it is set. >> + */ >> + if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE) >> + set_migratetype_isolate(page, false); >> + >> ret = get_any_page(page, pfn, flags); >> put_online_mems(); >> if (ret > 0) { /* for in-use pages */ >> @@ -1733,5 +1740,6 @@ int soft_offline_page(struct page *page, int flags) >> atomic_long_inc(&num_poisoned_pages); >> } >> } >> + unset_migratetype_isolate(page, MIGRATE_MOVABLE); >> return ret; >> } >> diff --git a/mm/migrate.c b/mm/migrate.c >> index 1f8369d..472baf5 100644 >> --- a/mm/migrate.c >> +++ b/mm/migrate.c >> @@ -880,8 +880,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage, >> /* Establish migration ptes or remove ptes */ >> if (page_mapped(page)) { >> try_to_unmap(page, >> - TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS| >> - TTU_IGNORE_HWPOISON); >> + TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS); >> page_was_mapped = 1; >> } >> >> diff --git a/mm/page_isolation.c b/mm/page_isolation.c >> index 4568fd5..654662a 100644 >> --- a/mm/page_isolation.c >> +++ b/mm/page_isolation.c >> @@ -9,7 +9,7 @@ >> #include >> #include "internal.h" >> >> -static int set_migratetype_isolate(struct page *page, >> +int set_migratetype_isolate(struct page *page, >> bool skip_hwpoisoned_pages) >> { >> struct zone *zone; >> @@ -73,7 +73,7 @@ out: >> return ret; >> } >> >> -static void unset_migratetype_isolate(struct page *page, unsigned migratetype) >> +void unset_migratetype_isolate(struct page *page, unsigned migratetype) >> { >> struct zone *zone; >> unsigned long flags, nr_pages; >> -- >> 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/