Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754774AbbHNJBo (ORCPT ); Fri, 14 Aug 2015 05:01:44 -0400 Received: from blu004-omc1s33.hotmail.com ([65.55.116.44]:55678 "EHLO BLU004-OMC1S33.hotmail.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753510AbbHNJBk (ORCPT ); Fri, 14 Aug 2015 05:01:40 -0400 X-TMN: [C4JcD21PyGio07Jnjo+9ZWRadNV+r8en] X-Originating-Email: [wanpeng.li@hotmail.com] Message-ID: Subject: Re: [PATCH] mm/hwpoison: fix race between soft_offline_page and unpoison_memory To: Naoya Horiguchi References: <20150813085332.GA30163@hori1.linux.bs1.fc.nec.co.jp> <20150813100407.GA2993@hori1.linux.bs1.fc.nec.co.jp> <20150814041939.GA9951@hori1.linux.bs1.fc.nec.co.jp> <20150814072649.GA31021@hori1.linux.bs1.fc.nec.co.jp> <20150814083818.GB6956@hori1.linux.bs1.fc.nec.co.jp> CC: Andrew Morton , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" From: Wanpeng Li Date: Fri, 14 Aug 2015 17:01:34 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:38.0) Gecko/20100101 Thunderbird/38.0.1 MIME-Version: 1.0 In-Reply-To: <20150814083818.GB6956@hori1.linux.bs1.fc.nec.co.jp> Content-Type: text/plain; charset="iso-2022-jp" Content-Transfer-Encoding: 7bit X-OriginalArrivalTime: 14 Aug 2015 09:01:38.0820 (UTC) FILETIME=[D479A440:01D0D66F] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2596 Lines: 59 On 8/14/15 4:38 PM, Naoya Horiguchi wrote: > On Fri, Aug 14, 2015 at 03:59:21PM +0800, Wanpeng Li wrote: >> On 8/14/15 3:54 PM, Wanpeng Li wrote: >>> [...] >>>> OK, then I rethink of handling the race in unpoison_memory(). >>>> >>>> Currently properly contained/hwpoisoned pages should have page refcount 1 >>>> (when the memory error hits LRU pages or hugetlb pages) or refcount 0 >>>> (when the memory error hits the buddy page.) And current unpoison_memory() >>>> implicitly assumes this because otherwise the unpoisoned page has no place >>>> to go and it's just leaked. >>>> So to avoid the kernel panic, adding prechecks of refcount and mapcount >>>> to limit the page to unpoison for only unpoisonable pages looks OK to me. >>>> The page under soft offlining always has refcount >=2 and/or mapcount > 0, >>>> so such pages should be filtered out. >>>> >>>> Here's a patch. In my testing (run soft offline stress testing then repeat >>>> unpoisoning in background,) the reported (or similar) bug doesn't happen. >>>> Can I have your comments? >>> As page_action() prints out page maybe still referenced by some users, >>> however, PageHWPoison has already set. So you will leak many poison pages. >>> >> Anyway, the bug is still there. >> >> [ 944.387559] BUG: Bad page state in process expr pfn:591e3 >> [ 944.393053] page:ffffea00016478c0 count:-1 mapcount:0 mapping: >> (null) index:0x2 >> [ 944.401147] flags: 0x1fffff80000000() >> [ 944.404819] page dumped because: nonzero _count > Hmm, no luck :( > > To investigate more, I'd like to test the exactly same kernel as yours, so > could you share the kernel info (.config and base kernel and what patches > you applied)? or pushing your tree somewhere like github? > # if you like, sending to me privately is fine. > > I think that I tested v4.2-rc6 + + > "mm/hwpoison: fix race between soft_offline_page and unpoison_memory", > but I experienced some conflict in applying your patches for some reason, > so it might happen that we are testing on different kernels. I don't have special config and tree, the latest mmotm has already merged my recent 8 hwpoison patches, you can test based on it. Regards, Wanpeng Li > > Mine is here: > https://github.com/Naoya-Horiguchi/linux v4.2-rc6/fix_race_soft_offline_unpoison > > Thanks, > Naoya Horiguchi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/