Received: by 2002:ac0:a5a7:0:0:0:0:0 with SMTP id m36-v6csp1426900imm; Thu, 19 Jul 2018 01:12:03 -0700 (PDT) X-Google-Smtp-Source: AAOMgpdKZcWUQZp6RJVfDbh3+c4AP2q7ApMwDYnPo6P0HBDcnkArl/h3l1Ne3w7rh2U/JzmfQb4S X-Received: by 2002:a17:902:a702:: with SMTP id w2-v6mr9242057plq.41.1531987923090; Thu, 19 Jul 2018 01:12:03 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1531987923; cv=none; d=google.com; s=arc-20160816; b=wDq4N64kE5YhAf2nemstL9ZpSCa5t1S55raGrzvUCYj5wYrhWnmiYrMNW/hDXA635N SZtWvpfOP/akBe90Tkf0nSUcmkA98AjYN05D02oVoy2a1dYHbKR+M1cYfy7Z45Izy9pl sUJIAxl17IGY8Qwgb7AZpCD40lG51/apzHhSzdeFVISPM5BGLARhNl9QnAwEv/pkYr5J wltgG0Yib+IFDmN38Z/LKx3Tt83m4gTegJDIQOEnb9BKG9xzRSVHAx/SqEhPNZb7Yupw zNefIN2h+OY5JC2RfTwgiRCl+GN7ITuTsOX3wRLxvOyBbcBPBnS0dfbfyLo+yDgO66JD Z3Sw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:content-transfer-encoding :content-id:content-language:accept-language:in-reply-to:references :message-id:date:thread-index:thread-topic:subject:cc:to:from :arc-authentication-results; bh=uhBVVzs728767mKwvnpzzH2+U5qmWyiEGFI2A8FZKsc=; b=Jbm2pSCUhSwharu6YtIkZBPazdjUIk7JQ0sAqnv/KMo5EboWWN5dAJsvop6cmQwPNu NXDUvHqqt/UDbBL8trpPM4gLzF+Sgsari7rykVXWEaR6LQY7iF53OGTTPtoGGAMxiopr UqwknLPhKTE65cSnDkOVBwsr9ob6vAz7dWnj75i/DhGW4pDvPzjrPq7sY7MBLxKmv3Wy DvAUEsfaUSFGM3JdN8Cc05e3WP/YmLd+oaqOJcORn/AfznbNv1cNCnXU600R7KlEJuWn Yh0TWRwyUn5/1oOTjU2znFaIWDMEobTIO46Yu1QaLjUFZzqWgwzJNG9yS2gXAvMcRxqI 9nsQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id h3-v6si5024430pld.114.2018.07.19.01.11.48; Thu, 19 Jul 2018 01:12:03 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731382AbeGSIwL convert rfc822-to-8bit (ORCPT + 99 others); Thu, 19 Jul 2018 04:52:11 -0400 Received: from tyo162.gate.nec.co.jp ([114.179.232.162]:35879 "EHLO tyo162.gate.nec.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727553AbeGSIwK (ORCPT ); Thu, 19 Jul 2018 04:52:10 -0400 Received: from mailgate01.nec.co.jp ([114.179.233.122]) by tyo162.gate.nec.co.jp (8.15.1/8.15.1) with ESMTPS id w6J8A4LZ011101 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Thu, 19 Jul 2018 17:10:04 +0900 Received: from mailsv01.nec.co.jp (mailgate-v.nec.co.jp [10.204.236.94]) by mailgate01.nec.co.jp (8.15.1/8.15.1) with ESMTP id w6J8A3xM020636; Thu, 19 Jul 2018 17:10:04 +0900 Received: from mail03.kamome.nec.co.jp (mail03.kamome.nec.co.jp [10.25.43.7]) by mailsv01.nec.co.jp (8.15.1/8.15.1) with ESMTP id w6J893Ef020552; Thu, 19 Jul 2018 17:10:03 +0900 Received: from bpxc99gp.gisp.nec.co.jp ([10.38.151.147] [10.38.151.147]) by mail03.kamome.nec.co.jp with ESMTP id BT-MMP-2122084; Thu, 19 Jul 2018 17:08:08 +0900 Received: from BPXM23GP.gisp.nec.co.jp ([10.38.151.215]) by BPXC19GP.gisp.nec.co.jp ([10.38.151.147]) with mapi id 14.03.0319.002; Thu, 19 Jul 2018 17:08:06 +0900 From: Naoya Horiguchi To: Michal Hocko CC: "linux-mm@kvack.org" , Andrew Morton , "xishi.qiuxishi@alibaba-inc.com" , "zy.zhengyi@alibaba-inc.com" , "linux-kernel@vger.kernel.org" Subject: Re: [PATCH v2 1/2] mm: fix race on soft-offlining free huge pages Thread-Topic: [PATCH v2 1/2] mm: fix race on soft-offlining free huge pages Thread-Index: AQHUHY+ZfA+YF2+Ff02zwneBkDT4a6SS4qaAgACvZYCAAIS7AIABaDSAgAAPgwCAAA7BgA== Date: Thu, 19 Jul 2018 08:08:05 +0000 Message-ID: <20180719080804.GA32756@hori1.linux.bs1.fc.nec.co.jp> References: <1531805552-19547-1-git-send-email-n-horiguchi@ah.jp.nec.com> <1531805552-19547-2-git-send-email-n-horiguchi@ah.jp.nec.com> <20180717142743.GJ7193@dhcp22.suse.cz> <20180718005528.GA12184@hori1.linux.bs1.fc.nec.co.jp> <20180718085032.GS7193@dhcp22.suse.cz> <20180719061945.GB22154@hori1.linux.bs1.fc.nec.co.jp> <20180719071516.GK7193@dhcp22.suse.cz> In-Reply-To: <20180719071516.GK7193@dhcp22.suse.cz> Accept-Language: en-US, ja-JP Content-Language: ja-JP X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.51.8.81] Content-Type: text/plain; charset="iso-2022-jp" Content-ID: <05B4D8597F8D454BABD4853CF7549D4D@gisp.nec.co.jp> Content-Transfer-Encoding: 8BIT MIME-Version: 1.0 X-TM-AS-MML: disable Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jul 19, 2018 at 09:15:16AM +0200, Michal Hocko wrote: > On Thu 19-07-18 06:19:45, Naoya Horiguchi wrote: > > On Wed, Jul 18, 2018 at 10:50:32AM +0200, Michal Hocko wrote: > > > On Wed 18-07-18 00:55:29, Naoya Horiguchi wrote: > > > > On Tue, Jul 17, 2018 at 04:27:43PM +0200, Michal Hocko wrote: > > > > > On Tue 17-07-18 14:32:31, Naoya Horiguchi wrote: > > > > > > There's a race condition between soft offline and hugetlb_fault which > > > > > > causes unexpected process killing and/or hugetlb allocation failure. > > > > > > > > > > > > The process killing is caused by the following flow: > > > > > > > > > > > > CPU 0 CPU 1 CPU 2 > > > > > > > > > > > > soft offline > > > > > > get_any_page > > > > > > // find the hugetlb is free > > > > > > mmap a hugetlb file > > > > > > page fault > > > > > > ... > > > > > > hugetlb_fault > > > > > > hugetlb_no_page > > > > > > alloc_huge_page > > > > > > // succeed > > > > > > soft_offline_free_page > > > > > > // set hwpoison flag > > > > > > mmap the hugetlb file > > > > > > page fault > > > > > > ... > > > > > > hugetlb_fault > > > > > > hugetlb_no_page > > > > > > find_lock_page > > > > > > return VM_FAULT_HWPOISON > > > > > > mm_fault_error > > > > > > do_sigbus > > > > > > // kill the process > > > > > > > > > > > > > > > > > > The hugetlb allocation failure comes from the following flow: > > > > > > > > > > > > CPU 0 CPU 1 > > > > > > > > > > > > mmap a hugetlb file > > > > > > // reserve all free page but don't fault-in > > > > > > soft offline > > > > > > get_any_page > > > > > > // find the hugetlb is free > > > > > > soft_offline_free_page > > > > > > // set hwpoison flag > > > > > > dissolve_free_huge_page > > > > > > // fail because all free hugepages are reserved > > > > > > page fault > > > > > > ... > > > > > > hugetlb_fault > > > > > > hugetlb_no_page > > > > > > alloc_huge_page > > > > > > ... > > > > > > dequeue_huge_page_node_exact > > > > > > // ignore hwpoisoned hugepage > > > > > > // and finally fail due to no-mem > > > > > > > > > > > > The root cause of this is that current soft-offline code is written > > > > > > based on an assumption that PageHWPoison flag should beset at first to > > > > > > avoid accessing the corrupted data. This makes sense for memory_failure() > > > > > > or hard offline, but does not for soft offline because soft offline is > > > > > > about corrected (not uncorrected) error and is safe from data lost. > > > > > > This patch changes soft offline semantics where it sets PageHWPoison flag > > > > > > only after containment of the error page completes successfully. > > > > > > > > > > Could you please expand on the worklow here please? The code is really > > > > > hard to grasp. I must be missing something because the thing shouldn't > > > > > be really complicated. Either the page is in the free pool and you just > > > > > remove it from the allocator (with hugetlb asking for a new hugeltb page > > > > > to guaratee reserves) or it is used and you just migrate the content to > > > > > a new page (again with the hugetlb reserves consideration). Why should > > > > > PageHWPoison flag ordering make any relevance? > > > > > > > > (Considering soft offlining free hugepage,) > > > > PageHWPoison is set at first before this patch, which is racy with > > > > hugetlb fault code because it's not protected by hugetlb_lock. > > > > > > > > Originally this was written in the similar manner as hard-offline, where > > > > the race is accepted and a PageHWPoison flag is set as soon as possible. > > > > But actually that's found not necessary/correct because soft offline is > > > > supposed to be less aggressive and failure is OK. > > > > > > OK > > > > > > > So this patch is suggesting to make soft-offline less aggressive by > > > > moving SetPageHWPoison into the lock. > > > > > > I guess I still do not understand why we should even care about the > > > ordering of the HWPoison flag setting. Why cannot we simply have the > > > following code flow? Or maybe we are doing that already I just do not > > > follow the code > > > > > > soft_offline > > > check page_count > > > - free - normal page - remove from the allocator > > > - hugetlb - allocate a new hugetlb page && remove from the pool > > > - used - migrate to a new page && never release the old one > > > > > > Why do we even need HWPoison flag here? Everything can be completely > > > transparent to the application. It shouldn't fail from what I > > > understood. > > > > PageHWPoison flag is used to the 'remove from the allocator' part > > which is like below: > > > > static inline > > struct page *rmqueue( > > ... > > do { > > page = NULL; > > if (alloc_flags & ALLOC_HARDER) { > > page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC); > > if (page) > > trace_mm_page_alloc_zone_locked(page, order, migratetype); > > } > > if (!page) > > page = __rmqueue(zone, order, migratetype); > > } while (page && check_new_pages(page, order)); > > > > check_new_pages() returns true if the page taken from free list has > > a hwpoison page so that the allocator iterates another round to get > > another page. > > > > There's no function that can be called from outside allocator to remove > > a page in allocator. So actual page removal is done at allocation time, > > not at error handling time. That's the reason why we need PageHWPoison. > > hwpoison is an internal mm functionality so why cannot we simply add a > function that would do that? That's one possible solution. I know about another downside in current implementation. If a hwpoison page is found during high order page allocation, all 2^order pages (not only hwpoison page) are removed from buddy because of the above quoted code. And these leaked pages are never returned to freelist even with unpoison_memory(). If we have a page removal function which properly splits high order free pages into lower order pages, this problem is avoided. OTOH PageHWPoison still has a role to report error to userspace. Without it unpoison_memory() doesn't work. Thanks, Naoya Horiguchi > I find the PageHWPoison usage here doing > more complications than real good. Or am I missing something?