Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754202AbbHNH72 (ORCPT ); Fri, 14 Aug 2015 03:59:28 -0400 Received: from blu004-omc1s2.hotmail.com ([65.55.116.13]:52407 "EHLO BLU004-OMC1S2.hotmail.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753525AbbHNH71 (ORCPT ); Fri, 14 Aug 2015 03:59:27 -0400 X-TMN: [QB5VJ9d9ACiHQnB38ipKTTr7VnXG2tQ5] X-Originating-Email: [wanpeng.li@hotmail.com] Message-ID: Subject: Re: [PATCH] mm/hwpoison: fix race between soft_offline_page and unpoison_memory To: Naoya Horiguchi References: <20150813085332.GA30163@hori1.linux.bs1.fc.nec.co.jp> <20150813100407.GA2993@hori1.linux.bs1.fc.nec.co.jp> <20150814041939.GA9951@hori1.linux.bs1.fc.nec.co.jp> <20150814072649.GA31021@hori1.linux.bs1.fc.nec.co.jp> CC: Andrew Morton , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" From: Wanpeng Li Date: Fri, 14 Aug 2015 15:59:21 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:38.0) Gecko/20100101 Thunderbird/38.0.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: 7bit X-OriginalArrivalTime: 14 Aug 2015 07:59:25.0691 (UTC) FILETIME=[235B5CB0:01D0D667] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 1608 Lines: 36 On 8/14/15 3:54 PM, Wanpeng Li wrote: > [...] >> OK, then I rethink of handling the race in unpoison_memory(). >> >> Currently properly contained/hwpoisoned pages should have page refcount 1 >> (when the memory error hits LRU pages or hugetlb pages) or refcount 0 >> (when the memory error hits the buddy page.) And current unpoison_memory() >> implicitly assumes this because otherwise the unpoisoned page has no place >> to go and it's just leaked. >> So to avoid the kernel panic, adding prechecks of refcount and mapcount >> to limit the page to unpoison for only unpoisonable pages looks OK to me. >> The page under soft offlining always has refcount >=2 and/or mapcount > 0, >> so such pages should be filtered out. >> >> Here's a patch. In my testing (run soft offline stress testing then repeat >> unpoisoning in background,) the reported (or similar) bug doesn't happen. >> Can I have your comments? > As page_action() prints out page maybe still referenced by some users, > however, PageHWPoison has already set. So you will leak many poison pages. > Anyway, the bug is still there. [ 944.387559] BUG: Bad page state in process expr pfn:591e3 [ 944.393053] page:ffffea00016478c0 count:-1 mapcount:0 mapping: (null) index:0x2 [ 944.401147] flags: 0x1fffff80000000() [ 944.404819] page dumped because: nonzero _count Regards, Wanpeng Li -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/