Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp403488imu; Thu, 8 Nov 2018 22:49:09 -0800 (PST) X-Google-Smtp-Source: AJdET5fjyQdtuygiUB7vcBBDWyNCL5NIYe/oD+NBeGih8ARkJ7QTsLWYypjXNlZ7zT6BzCeJORdC X-Received: by 2002:a17:902:7409:: with SMTP id g9-v6mr7806513pll.341.1541746149529; Thu, 08 Nov 2018 22:49:09 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1541746149; cv=none; d=google.com; s=arc-20160816; b=qscNdds2tzKM3wDnLuCAr/dc8VJ9C6Fmf0JHTiq7fyfkOCMNb6xi4ioSOX/z1Gzx// Xxz/xsngd3q5cePQUBFMQ7Ufq6QMrf0QgJ1hRbmh43QKiJ8hB7vwau89Sl5r/RkrykEe FxA9Q0KDhJW0Fjr6POEHKK/YLkBB0w6QH8ehvsc5X5wppJ9LN6eRZsWoLib3NCXYwQ76 PfPOGnYVo+oFdy9KFe/YJftRQDaeoT/TgABP9ct1sThigIt2T8cTm7FR5w3kFvK2mTTo Mv7cYwtyoNmg7u/lfzeIf0cLgJmyAN9zoRUp7DWsln0FkHy682QcZdnZQOQPrCOOkvy+ jX/w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:references:in-reply-to:message-id:date :subject:cc:to:from:dkim-signature; bh=CqlIoVW1d98lHuODa9U9NYQESN2YZOyjeuQFZV3TRgI=; b=yDnjKNhk+e9CojT4Kened3VGwR5P9TzeZOQq0TbzqBLuJhvWRnDSW7FcSsHUNhBzON FhOQlpuAVBt4WWtVA9RFhD++6fLdJplYda3RWHLHMb2XRDlz6ClN8UfcH8nSYMbzUzpz 5FkMtJzLTsLzDSZT/RZdRcK898DbMhabJ8Oyrngba+kBeyOpI+TiPTcK/HrDf25jmKTe b4Uoea2uyDwiqMOnMupnulRvhgdyyP+X9abdrktoycbrDeww3glxUep2TQRSRw75pL+l 4vIh2/McdWYgTjHchc4eRGJEXgo91c+WH+p28PoOT6eHWSDhIvnZyExpuLSzV7JbxlTw G/vw== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@gmail.com header.s=20161025 header.b=BZXfJP7E; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id u64si6264592pgu.534.2018.11.08.22.48.54; Thu, 08 Nov 2018 22:49:09 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=fail header.i=@gmail.com header.s=20161025 header.b=BZXfJP7E; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728138AbeKIQ04 (ORCPT + 99 others); Fri, 9 Nov 2018 11:26:56 -0500 Received: from mail-pg1-f195.google.com ([209.85.215.195]:38590 "EHLO mail-pg1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727691AbeKIQ0z (ORCPT ); Fri, 9 Nov 2018 11:26:55 -0500 Received: by mail-pg1-f195.google.com with SMTP id f8-v6so449442pgq.5 for ; Thu, 08 Nov 2018 22:47:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:from:to:cc:subject:date:message-id:in-reply-to:references; bh=CqlIoVW1d98lHuODa9U9NYQESN2YZOyjeuQFZV3TRgI=; b=BZXfJP7E9452UWPu00jxx7AAEkL3FE0eiuA79Uq2lq79+nmfyqRUMQwn3p0Z5tdZrP 7fBk2ZeT3y1O/uQoN1kvFM7xM7OQzsATk98Dm87zfeV+soDfPLnzeXEkjVz4B36lMc2W zd2JY/UtzvA8UoaB23w+H9RPe8EysVWiguGFC6AUaKA4TeLg6ny7E/X6vj7ENVdAQZC+ mBQCq2bmn0Xvfo/hweqUk9F0P+erBmTwNtJsI8lyVMT85g4ht4x61mczVvF3aTjT9xIj 114z+3uGUEkR+db9bQ2l2/E/B4KOmvFelbNfdRSMi9QCyi+9UmSiG5SXYBcZCcYoWIes Uihw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:from:to:cc:subject:date:message-id :in-reply-to:references; bh=CqlIoVW1d98lHuODa9U9NYQESN2YZOyjeuQFZV3TRgI=; b=WyZQ8ngTTd4hoRXAq6vXQ4aAsCZFQseThws8Hc3E0XF6s12z9k+LW94fWoMLzHo9SO E6YTJqsp50ZySl0zrwi56pWCjYZLTvU80fG/eWN90EMBmUVxHTLYyf2X5SKUqnuTGY9C gF7WK4zEUkmQOzTAuXSYWqgZKKPMatl3a0t40/z+o9HTIHGeu2VTZ+NU3jh6EQla7Fpl zChXys4xOuP+raV3S1E8c5owmAhrZ0shyIAsifTVGcp36AdpyFX6CxwvlbEqowLfTIIf /Is4Dc29Z/WLknZu0q4TbEXiVUSis/zjSwaD00yEJJ5gx3A/6Pz0zirDJJ5E8UesaguV gLaA== X-Gm-Message-State: AGRZ1gKvfrRCsP8HDwIcIUXPGbaALVhwKntCyjzD0EpMxcSs+/g5oZwV L02VCGI0Si89zCpPEuw2VQ== X-Received: by 2002:a63:88c7:: with SMTP id l190mr6336857pgd.110.1541746066992; Thu, 08 Nov 2018 22:47:46 -0800 (PST) Received: from www9186uo.sakura.ne.jp (www9186uo.sakura.ne.jp. [153.121.56.200]) by smtp.gmail.com with ESMTPSA id c70-v6sm6808355pfg.97.2018.11.08.22.47.44 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 08 Nov 2018 22:47:46 -0800 (PST) From: Naoya Horiguchi To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Michal Hocko , Andrew Morton , Mike Kravetz , xishi.qiuxishi@alibaba-inc.com, Laurent Dufour Subject: [RFC][PATCH v1 08/11] mm: soft-offline: isolate error pages from buddy freelist Date: Fri, 9 Nov 2018 15:47:12 +0900 Message-Id: <1541746035-13408-9-git-send-email-n-horiguchi@ah.jp.nec.com> X-Mailer: git-send-email 2.7.0 In-Reply-To: <1541746035-13408-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1541746035-13408-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Soft-offline shares PG_hwpoison with hard-offline to keep track of memory error, but recently we found that the approach can be undesirable for soft-offline because it never expects to stop applications unlike hard-offline. So this patch suggests that memory error handler (not only sets PG_hwpoison, but) isolates error pages from buddy allocator in its context. In previous works [1], we allow soft-offline handler to set PG_hwpoison only after successful page migration and page freeing. This patch, along with that, makes the isolation always done via set_hwpoison_free_buddy_page() with zone->lock, so the behavior should be less racy and more predictable. Note that only considering for isolation, we don't have to set PG_hwpoison, but my analysis shows that to make memory hotremove properly work, we still need some flag to clearly separate memory error from any other type of pages. So this patch doesn't change this. [1]: commit 6bc9b56433b7 ("mm: fix race on soft-offlining free huge pages") commit d4ae9916ea29 ("mm: soft-offline: close the race against page allocation") Signed-off-by: Naoya Horiguchi --- mm/memory-failure.c | 8 +++--- mm/page_alloc.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++++----- 2 files changed, 70 insertions(+), 9 deletions(-) diff --git v4.19-mmotm-2018-10-30-16-08/mm/memory-failure.c v4.19-mmotm-2018-10-30-16-08_patched/mm/memory-failure.c index 869ff8f..ecafd4a 100644 --- v4.19-mmotm-2018-10-30-16-08/mm/memory-failure.c +++ v4.19-mmotm-2018-10-30-16-08_patched/mm/memory-failure.c @@ -1762,9 +1762,11 @@ static int __soft_offline_page(struct page *page) if (ret == 1) { put_hwpoison_page(page); pr_info("soft_offline: %#lx: invalidated\n", pfn); - SetPageHWPoison(page); - num_poisoned_pages_inc(); - return 0; + if (set_hwpoison_free_buddy_page(page)) { + num_poisoned_pages_inc(); + return 0; + } else + return -EBUSY; } /* diff --git v4.19-mmotm-2018-10-30-16-08/mm/page_alloc.c v4.19-mmotm-2018-10-30-16-08_patched/mm/page_alloc.c index ae31839..970d6ff 100644 --- v4.19-mmotm-2018-10-30-16-08/mm/page_alloc.c +++ v4.19-mmotm-2018-10-30-16-08_patched/mm/page_alloc.c @@ -8183,10 +8183,55 @@ bool is_free_buddy_page(struct page *page) } #ifdef CONFIG_MEMORY_FAILURE + +/* + * Pick out a free page from buddy allocator. Unlike expand(), this + * function can choose the target page by @target which is not limited + * to the first page of some free block. + * + * This function changes zone state, so callers need to hold zone->lock. + */ +static inline void pickout_buddy_page(struct zone *zone, struct page *page, + struct page *target, int torder, int low, int high, + struct free_area *area, int migratetype) +{ + unsigned long size = 1 << high; + struct page *current_buddy, *next_page; + + while (high > low) { + area--; + high--; + size >>= 1; + + if (target >= &page[size]) { /* target is in higher buddy */ + next_page = page + size; + current_buddy = page; + } else { /* target is in lower buddy */ + next_page = page; + current_buddy = page + size; + } + VM_BUG_ON_PAGE(bad_range(zone, current_buddy), current_buddy); + + if (set_page_guard(zone, &page[size], high, migratetype)) + continue; + + list_add(¤t_buddy->lru, &area->free_list[migratetype]); + area->nr_free++; + set_page_order(current_buddy, high); + page = next_page; + } +} + /* - * Set PG_hwpoison flag if a given page is confirmed to be a free page. This - * test is performed under the zone lock to prevent a race against page - * allocation. + * Isolate hwpoisoned free page which actully does the following + * - confirm that a given page is a free page under zone->lock, + * - set PG_hwpoison flag, + * - remove the page from buddy allocator, subdividing buddy page + * of each order. + * + * Just setting PG_hwpoison flag is not safe enough for complete isolation + * because rapidly-changing memory allocator code is always with the + * risk of mishandling the flag and potential race. */ bool set_hwpoison_free_buddy_page(struct page *page) { @@ -8199,10 +8244,24 @@ bool set_hwpoison_free_buddy_page(struct page *page) spin_lock_irqsave(&zone->lock, flags); for (order = 0; order < MAX_ORDER; order++) { struct page *page_head = page - (pfn & ((1 << order) - 1)); + unsigned int forder = page_order(page_head); + struct free_area *area = &(zone->free_area[forder]); - if (PageBuddy(page_head) && page_order(page_head) >= order) { - if (!TestSetPageHWPoison(page)) - hwpoisoned = true; + if (PageBuddy(page_head) && forder >= order) { + int migtype = get_pfnblock_migratetype(page_head, + page_to_pfn(page_head)); + /* + * TestSetPageHWPoison() will be used later when + * reworking hard-offline part is finished. + */ + SetPageHWPoison(page); + + list_del(&page_head->lru); + rmv_page_order(page_head); + area->nr_free--; + pickout_buddy_page(zone, page_head, page, 0, 0, forder, + area, migtype); + hwpoisoned = true; break; } } -- 2.7.0