Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp9704844imu; Wed, 5 Dec 2018 08:59:55 -0800 (PST) X-Google-Smtp-Source: AFSGD/X9Cs4+ysBhOXLiIM/U9G6PlXTL3DCnzmJodX5Yw+gkQcgOSqVUeuG8C7Wd0n23+LB8nXR3 X-Received: by 2002:a63:cc4e:: with SMTP id q14mr14781392pgi.291.1544029195595; Wed, 05 Dec 2018 08:59:55 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1544029195; cv=none; d=google.com; s=arc-20160816; b=YIQ2gpyOzjv85EVaRRG50rxkROHHcMcVJx1yk5gI1U0DCV46qcwTPDLwKVFQdWhikc u/dRaNjwfuajiJR9QC8S+0WLExLd272/AO3yv12Gc+uJKAY3z7Mh2MYi+q+xspzpxrUr +CReGDBKdoi2veolo07s89J5cmQSFrglEgdY7a3SVPlnXTxO0xi9R8ZsvSgdDLhPq4Wa tmLBJo9mk3WIMwtV50D4FnOqn2ieE1A/z3ULHKf2ZY1szstaCtbm8KrOJP6fz/xjL0+m Mg8iA4acAXbMzbGP/kN2r4jOfKOPAPSnejDs8v8Yd7hJaimLQs74MtUr7KQQRgrFGtVP Qvmw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=trn+oA+mu/3avK0GB6FYbwkSs1DQ+PBsUed3Jmt9nUg=; b=fgyQgXV2IvFvosK089ZliLAAeTcOf+pQkeI/AptgT6REC0ZKzJ91iWjFuPg9ZAzRGl HIieENWZcRl9g4gKxOfsAqfFt79XVIji+98zOLjvXrDiUpumfCByPEmO4LbefiUeMOIW cptkWtG62kC6zOvN7YaOKoLw9ikcH+Jskh+qeQGPaUf9mODpOGpXuB6uLmKSH1HSg4SX 6CfF4alhmlPZpTcBoRi2N+1jgDxlgREXy9c9FFu8mDcODpytLUhS6vc7D8DLxB84hf5v l0tSi3sjibxM+PA7zFB4Llm9LLzoZvAdrHduPpVdu8b7xS7wlCERiIyLcasUNX5N2E5M jgQA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id y20si20626741plp.415.2018.12.05.08.59.40; Wed, 05 Dec 2018 08:59:55 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728006AbeLEQ5X (ORCPT + 99 others); Wed, 5 Dec 2018 11:57:23 -0500 Received: from mx2.suse.de ([195.135.220.15]:49620 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1727182AbeLEQ5W (ORCPT ); Wed, 5 Dec 2018 11:57:22 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id D81DAB08E; Wed, 5 Dec 2018 16:57:19 +0000 (UTC) Date: Wed, 5 Dec 2018 17:57:16 +0100 From: Michal Hocko To: Naoya Horiguchi , Oscar Salvador Cc: Andrew Morton , Dan Williams , Pavel Tatashin , linux-mm@kvack.org, LKML , Stable tree Subject: Re: [RFC PATCH] hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined Message-ID: <20181205165716.GS1286@dhcp22.suse.cz> References: <20181203100309.14784-1-mhocko@kernel.org> <20181205122918.GL1286@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20181205122918.GL1286@dhcp22.suse.cz> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 05-12-18 13:29:18, Michal Hocko wrote: [...] > After some more thinking I am not really sure the above reasoning is > still true with the current upstream kernel. Maybe I just managed to > confuse myself so please hold off on this patch for now. Testing by > Oscar has shown this patch is helping but the changelog might need to be > updated. OK, so Oscar has nailed it down and it seems that 4.4 kernel we have been debugging on behaves slightly different. The underlying problem is the same though. So I have reworded the changelog and added "just in case" PageLRU handling. Naoya, maybe you have an argument that would make this void for current upstream kernels. I have dropped all the reviewed tags as the patch has changed slightly. Thanks a lot to Oscar for his patience and testing he has devoted to this issue. Btw. the way how we drop all the work on the first page that we cannot isolate is just goofy. Why don't we simply migrate all that we already have on the list and go on? Something for a followup cleanup though. --- From 909521051f41ae46a841b481acaf1ed9c695ae7b Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Mon, 3 Dec 2018 10:27:18 +0100 Subject: [PATCH] hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined We have received a bug report that an injected MCE about faulty memory prevents memory offline to succeed on 4.4 base kernel. The underlying reason was that the HWPoison page has an elevated reference count and the migration keeps failing. There are two problems with that. First of all it is dubious to migrate the poisoned page because we know that accessing that memory is possible to fail. Secondly it doesn't make any sense to migrate a potentially broken content and preserve the memory corruption over to a new location. Oscar has found out that 4.4 and the current upstream kernels behave slightly differently with his simply testcase === int main(void) { int ret; int i; int fd; char *array = malloc(4096); char *array_locked = malloc(4096); fd = open("/tmp/data", O_RDONLY); read(fd, array, 4095); for (i = 0; i < 4096; i++) array_locked[i] = 'd'; ret = mlock((void *)PAGE_ALIGN((unsigned long)array_locked), sizeof(array_locked)); if (ret) perror("mlock"); sleep (20); ret = madvise((void *)PAGE_ALIGN((unsigned long)array_locked), 4096, MADV_HWPOISON); if (ret) perror("madvise"); for (i = 0; i < 4096; i++) array_locked[i] = 'd'; return 0; } === + offline this memory. In 4.4 kernels he saw the hwpoisoned page to be returned back to the LRU list kernel: [] dump_trace+0x59/0x340 kernel: [] show_stack_log_lvl+0xea/0x170 kernel: [] show_stack+0x21/0x40 kernel: [] dump_stack+0x5c/0x7c kernel: [] warn_slowpath_common+0x81/0xb0 kernel: [] __pagevec_lru_add_fn+0x14c/0x160 kernel: [] pagevec_lru_move_fn+0xad/0x100 kernel: [] __lru_cache_add+0x6c/0xb0 kernel: [] add_to_page_cache_lru+0x46/0x70 kernel: [] extent_readpages+0xc3/0x1a0 [btrfs] kernel: [] __do_page_cache_readahead+0x177/0x200 kernel: [] ondemand_readahead+0x168/0x2a0 kernel: [] generic_file_read_iter+0x41f/0x660 kernel: [] __vfs_read+0xcd/0x140 kernel: [] vfs_read+0x7a/0x120 kernel: [] kernel_read+0x3b/0x50 kernel: [] do_execveat_common.isra.29+0x490/0x6f0 kernel: [] do_execve+0x28/0x30 kernel: [] call_usermodehelper_exec_async+0xfb/0x130 kernel: [] ret_from_fork+0x55/0x80 And that later confuses the hotremove path because an LRU page is attempted to be migrated and that fails due to an elevated reference count. It is quite possible that the reuse of the HWPoisoned page is some kind of fixed race condition but I am not really sure about that. With the upstream kernel the failure is slightly different. The page doesn't seem to have LRU bit set but isolate_movable_page simply fails and do_migrate_range simply puts all the isolated pages back to LRU and therefore no progress is made and scan_movable_pages finds same set of pages over and over again. Fix both cases by explicitly checking HWPoisoned pages before we even try to get a reference on the page, try to unmap it if it is still mapped. As explained by Naoya : Hwpoison code never unmapped those for no big reason because : Ksm pages never dominate memory, so we simply didn't have strong : motivation to save the pages. Also put WARN_ON(PageLRU) in case there is a race and we can hit LRU HWPoison pages which shouldn't happen but I couldn't convince myself about that. Debugged-by: Oscar Salvador Cc: stable Signed-off-by: Michal Hocko --- mm/memory_hotplug.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index c6c42a7425e5..cfa1a2736876 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -34,6 +34,7 @@ #include #include #include +#include #include @@ -1366,6 +1367,21 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) pfn = page_to_pfn(compound_head(page)) + hpage_nr_pages(page) - 1; + /* + * HWPoison pages have elevated reference counts so the migration would + * fail on them. It also doesn't make any sense to migrate them in the + * first place. Still try to unmap such a page in case it is still mapped + * (e.g. current hwpoison implementation doesn't unmap KSM pages but keep + * the unmap as the catch all safety net). + */ + if (PageHWPoison(page)) { + if (WARN_ON(PageLRU(page))) + isolate_lru_page(page); + if (page_mapped(page)) + try_to_unmap(page, TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS); + continue; + } + if (!get_page_unless_zero(page)) continue; /* -- 2.19.2 -- Michal Hocko SUSE Labs