Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp270117imu; Mon, 19 Nov 2018 22:14:05 -0800 (PST) X-Google-Smtp-Source: AFSGD/Wkjmtf6Vwk5GALg9F1/Fw2i7AFQR0rvPP9Do0GdnxrgwPFTJZF5p8aP7+AqWBk44W9AORY X-Received: by 2002:a17:902:6a8b:: with SMTP id n11-v6mr866538plk.311.1542694445464; Mon, 19 Nov 2018 22:14:05 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1542694445; cv=none; d=google.com; s=arc-20160816; b=NStTuGDs6+LJNTd78XenXQPIYsokgt2ZR0QzTW1+TIZwMWf5+RIYD9Qhlhyi4Ndm4x qa51525oHVYydzg8mucbczSpKklvQbfGnQjsdkzgGqLsQecHOAVRQf0cnbd+RMQmGd3e BhFb9X6U6E8utFDf0dFgSR/V5KscBia1SznSznf2o+I1l/8V5pXJ1/SFZOuDaI3NNC4P gDyN8b5Y73Cb/Bqc5NiPXYMEjJVNarmGDIxPNLnsy+8ITKt0sHIiAs8zOj1illSZMcVL T81rCM4HKTDt7w6haf/Ulz0pU9EQbzV6Wi7E4PCaJdL+f6rpjvKp4gIacCSV37LvpsJm 37aA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:user-agent:references :message-id:in-reply-to:subject:cc:to:from:date:dkim-signature; bh=5su8qWxOqOD+Tjuw3ssqyEPe5GYRzq+MX4KmJD9vE8M=; b=k+hs3p8tKyzJkmKGePzODUKnxF7dYOhbpK/sER+9No3yiD+q4vnGQW0fdFhdwBj+7c 4k44GOpTe4DssgxH0fKIxCq39uCAGPdWQc0h2V55uiTPdXMlsJspCd+hryhfXyFJf5xT RAOlkqIOTrYNunah5eZU7huPz4DjYCKzs88DrWqFF27Lw/2KS+7XPsCu7L+V9GSX2IIB gHVasN7XmTpNCyTDqR/sCEFb1zp8geJH7FFWuuqQjvReYLzJvrc6eKTeIzhDJyXm3868 MpsSCff2AQdX0woPXsL9QTf2RAt+m9gqRhbUBXZ+Ne2/DxamHczku1IrZiukwoxRVLwX SuRQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=SCRTG0sD; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id n11-v6si43619214plg.300.2018.11.19.22.13.50; Mon, 19 Nov 2018 22:14:05 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=SCRTG0sD; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731842AbeKTQMW (ORCPT + 99 others); Tue, 20 Nov 2018 11:12:22 -0500 Received: from mail-pg1-f195.google.com ([209.85.215.195]:33814 "EHLO mail-pg1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726883AbeKTQMV (ORCPT ); Tue, 20 Nov 2018 11:12:21 -0500 Received: by mail-pg1-f195.google.com with SMTP id 17so407075pgg.1 for ; Mon, 19 Nov 2018 21:45:03 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:in-reply-to:message-id:references :user-agent:mime-version; bh=5su8qWxOqOD+Tjuw3ssqyEPe5GYRzq+MX4KmJD9vE8M=; b=SCRTG0sDQJ9XJIHTy2XeM/WZkeocXWSbZT5fqjvlJg8c75C7xJQlvAifP0M0373Hbm UwhZG91TAjow1KLogCFJEwxhXCHDMaXNTpAnN5rhW38okevf0HJV5CEX+FYdIVzqKcJN ent+UUvGNr3c+M4n+2beAude67Vxw/ZPxJA2C1vM108wxd7i2SRjL+BqheteCvl+AXOu f4k9DPMnWVPGT9aaG581Js5k6Xm7FVVqjc03AuH12zpbU/gfjppfV80Zf/evmRMjyf1T kA61BFI/AJGwe3P3eo2kgTevf3OD5WZdxE242xv9dEg9zrdVIjiqugYV6KlZM6bxa4aK J1kQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id :references:user-agent:mime-version; bh=5su8qWxOqOD+Tjuw3ssqyEPe5GYRzq+MX4KmJD9vE8M=; b=GEDGXBrZmt5t6n6z2zGcgNowVWxFfGr6szdcmcYyUtbdnu+h1ai+/oLVaU6jqztq+J eUcyf5E4SPRrA4A40OBpnm6TTEfiCpDGrGSgU/5f4V3SJYji/IujzWVX+jzkJFsKGsea hOsYND1MHWuWoPFMk4LbE5AzjE4Bf9fcCmzCC/6shH9/hhZukzMKOWOIrxQLTJB1nL/g 6j4Bm67iGLRgMxhuyCEH1gaA35clpKOUf4LpKVMN2lE1Y11UaoFM5DVBAbI1dCiLB8Qf uXoaoLvss4n1Znd318cIzIXTkbocs8/9USdmLemhERxnopqF2tx0lJEoOyyq24lOGdTA 8I0A== X-Gm-Message-State: AA+aEWb3Zg28s/vBToReZ4eBqUb4Wk6SYtWXofK0Md9E8QwLhyWR7CjK eCOXhmQDxnd5AZamT6+s5YXiqg== X-Received: by 2002:a63:fa02:: with SMTP id y2mr679843pgh.177.1542692702826; Mon, 19 Nov 2018 21:45:02 -0800 (PST) Received: from [100.112.89.103] ([104.133.8.103]) by smtp.gmail.com with ESMTPSA id s37sm37292983pgm.19.2018.11.19.21.45.01 (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Mon, 19 Nov 2018 21:45:01 -0800 (PST) Date: Mon, 19 Nov 2018 21:44:41 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Baoquan He cc: Hugh Dickins , Michal Hocko , Vlastimil Babka , David Hildenbrand , linux-mm@kvack.org, pifang@redhat.com, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, aarcange@redhat.com, Mel Gorman Subject: Re: Memory hotplug softlock issue In-Reply-To: <20181120015644.GA5727@MiWiFi-R3L-srv> Message-ID: References: <20181115143204.GV23831@dhcp22.suse.cz> <20181116012433.GU2653@MiWiFi-R3L-srv> <20181116091409.GD14706@dhcp22.suse.cz> <20181119105202.GE18471@MiWiFi-R3L-srv> <20181119124033.GJ22247@dhcp22.suse.cz> <20181119125121.GK22247@dhcp22.suse.cz> <20181119141016.GO22247@dhcp22.suse.cz> <20181119173312.GV22247@dhcp22.suse.cz> <20181119205907.GW22247@dhcp22.suse.cz> <20181120015644.GA5727@MiWiFi-R3L-srv> User-Agent: Alpine 2.11 (LSU 23 2013-08-11) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 20 Nov 2018, Baoquan He wrote: > On 11/19/18 at 09:59pm, Michal Hocko wrote: > > On Mon 19-11-18 12:34:09, Hugh Dickins wrote: > > > I'm glad that I delayed, what I had then (migration_waitqueue instead > > > of using page_waitqueue) was not wrong, but what I've been using the > > > last couple of months is rather better (and can be put to use to solve > > > similar problems in collapsing pages on huge tmpfs. but we don't need > > > to get into that at this time): put_and_wait_on_page_locked(). > > > > > > What I have not yet done is verify it on latest kernel, and research > > > the interested Cc list (Linus and Tim Chen come immediately to mind), > > > and write the commit comment. I have some testing to do on the latest > > > kernel today, so I'll throw put_and_wait_on_page_locked() in too, > > > and post tomorrow I hope. > > > > Cool, it seems that Baoquan has a reliable test case to trigger the > > pathological case. > > Yes. I will test Hugh's patch. Thanks: I've completed some of the retesting now, so it would probably help us all better if I post the patch in this thread, even without completing its description and links and Cc list yet - there isn't even a problem description below, I still have to paste that in from the unposted patch that I made six months ago. Here is today's... [PATCH] mm: put_and_wait_on_page_locked() while page is migrated We have all assumed that it is essential to hold a page reference while waiting on a page lock: partly to guarantee that there is still a struct page when MEMORY_HOTREMOVE is configured, but also to protect against reuse of the struct page going to someone who then holds the page locked indefinitely, when the waiter can reasonably expect timely unlocking. But in fact, so long as wait_on_page_bit_common() does the put_page(), and is careful not to rely on struct page contents thereafter, there is no need to hold a reference to the page while waiting on it. That does mean that this case cannot go back through the loop: but that's fine for the page migration case, and even if used more widely, is limited by the "Stop walking if it's locked" optimization in wake_page_function(). Add interface put_and_wait_on_page_locked() to do this, using negative value of the lock arg to wait_on_page_bit_common() to implement it. No interruptible or killable variant needed yet, but they might follow: I have a vague notion that reporting -EINTR should take precedence over return from wait_on_page_bit_common() without knowing the page state, so arrange it accordingly - but that may be nothing but pedantic. shrink_page_list()'s __ClearPageLocked(): that was a surprise! this survived a lot of testing before that showed up. It does raise the question: should is_page_cache_freeable() and __remove_mapping() now treat a PG_waiters page as if an extra reference were held? Perhaps, but I don't think it matters much, since shrink_page_list() already had to win its trylock_page(), so waiters are not very common there: I noticed no difference when trying the bigger change, and it's surely not needed while put_and_wait_on_page_locked() is only for page migration. Signed-off-by: Hugh Dickins --- include/linux/pagemap.h | 2 ++ mm/filemap.c | 53 ++++++++++++++++++++++++++++++++--------- mm/huge_memory.c | 6 ++--- mm/migrate.c | 12 ++++------ mm/vmscan.c | 11 +++++---- 5 files changed, 57 insertions(+), 27 deletions(-) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 226f96f0dee0..e2d7039af6a3 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -537,6 +537,8 @@ static inline int wait_on_page_locked_killable(struct page *page) return wait_on_page_bit_killable(compound_head(page), PG_locked); } +extern void put_and_wait_on_page_locked(struct page *page); + /* * Wait for a page to complete writeback */ diff --git a/mm/filemap.c b/mm/filemap.c index 81adec8ee02c..ef82119032d8 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -981,7 +981,14 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync, if (wait_page->bit_nr != key->bit_nr) return 0; - /* Stop walking if it's locked */ + /* + * Stop walking if it's locked. + * Is this safe if put_and_wait_on_page_locked() is in use? + * Yes: the waker must hold a reference to this page, and if PG_locked + * has now already been set by another task, that task must also hold + * a reference to the *same usage* of this page; so there is no need + * to walk on to wake even the put_and_wait_on_page_locked() callers. + */ if (test_bit(key->bit_nr, &key->page->flags)) return -1; @@ -1050,13 +1057,14 @@ static void wake_up_page(struct page *page, int bit) } static inline int wait_on_page_bit_common(wait_queue_head_t *q, - struct page *page, int bit_nr, int state, bool lock) + struct page *page, int bit_nr, int state, int lock) { struct wait_page_queue wait_page; wait_queue_entry_t *wait = &wait_page.wait; bool thrashing = false; unsigned long pflags; int ret = 0; + bool bit_is_set; if (bit_nr == PG_locked && !PageUptodate(page) && PageWorkingset(page)) { @@ -1067,7 +1075,7 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q, } init_wait(wait); - wait->flags = lock ? WQ_FLAG_EXCLUSIVE : 0; + wait->flags = lock > 0 ? WQ_FLAG_EXCLUSIVE : 0; wait->func = wake_page_function; wait_page.page = page; wait_page.bit_nr = bit_nr; @@ -1084,14 +1092,17 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q, spin_unlock_irq(&q->lock); - if (likely(test_bit(bit_nr, &page->flags))) { + bit_is_set = test_bit(bit_nr, &page->flags); + if (lock < 0) + put_page(page); + + if (likely(bit_is_set)) io_schedule(); - } - if (lock) { + if (lock > 0) { if (!test_and_set_bit_lock(bit_nr, &page->flags)) break; - } else { + } else if (lock == 0) { if (!test_bit(bit_nr, &page->flags)) break; } @@ -1100,6 +1111,17 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q, ret = -EINTR; break; } + + if (lock < 0) { + /* + * We can no longer safely access page->flags: + * even if CONFIG_MEMORY_HOTREMOVE is not enabled, + * there is a risk of waiting forever on a page reused + * for something that keeps it locked indefinitely. + * But best check for -EINTR above before breaking. + */ + break; + } } finish_wait(q, wait); @@ -1124,17 +1146,26 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q, void wait_on_page_bit(struct page *page, int bit_nr) { wait_queue_head_t *q = page_waitqueue(page); - wait_on_page_bit_common(q, page, bit_nr, TASK_UNINTERRUPTIBLE, false); + wait_on_page_bit_common(q, page, bit_nr, TASK_UNINTERRUPTIBLE, 0); } EXPORT_SYMBOL(wait_on_page_bit); int wait_on_page_bit_killable(struct page *page, int bit_nr) { wait_queue_head_t *q = page_waitqueue(page); - return wait_on_page_bit_common(q, page, bit_nr, TASK_KILLABLE, false); + return wait_on_page_bit_common(q, page, bit_nr, TASK_KILLABLE, 0); } EXPORT_SYMBOL(wait_on_page_bit_killable); +void put_and_wait_on_page_locked(struct page *page) +{ + wait_queue_head_t *q; + + page = compound_head(page); + q = page_waitqueue(page); + wait_on_page_bit_common(q, page, PG_locked, TASK_UNINTERRUPTIBLE, -1); +} + /** * add_page_wait_queue - Add an arbitrary waiter to a page's wait queue * @page: Page defining the wait queue of interest @@ -1264,7 +1295,7 @@ void __lock_page(struct page *__page) { struct page *page = compound_head(__page); wait_queue_head_t *q = page_waitqueue(page); - wait_on_page_bit_common(q, page, PG_locked, TASK_UNINTERRUPTIBLE, true); + wait_on_page_bit_common(q, page, PG_locked, TASK_UNINTERRUPTIBLE, 1); } EXPORT_SYMBOL(__lock_page); @@ -1272,7 +1303,7 @@ int __lock_page_killable(struct page *__page) { struct page *page = compound_head(__page); wait_queue_head_t *q = page_waitqueue(page); - return wait_on_page_bit_common(q, page, PG_locked, TASK_KILLABLE, true); + return wait_on_page_bit_common(q, page, PG_locked, TASK_KILLABLE, 1); } EXPORT_SYMBOL_GPL(__lock_page_killable); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 622cced74fd9..832ab11badc2 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1501,8 +1501,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd) if (!get_page_unless_zero(page)) goto out_unlock; spin_unlock(vmf->ptl); - wait_on_page_locked(page); - put_page(page); + put_and_wait_on_page_locked(page); goto out; } @@ -1538,8 +1537,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd) if (!get_page_unless_zero(page)) goto out_unlock; spin_unlock(vmf->ptl); - wait_on_page_locked(page); - put_page(page); + put_and_wait_on_page_locked(page); goto out; } diff --git a/mm/migrate.c b/mm/migrate.c index f7e4bfdc13b7..acda06f99754 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -327,16 +327,13 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep, /* * Once page cache replacement of page migration started, page_count - * *must* be zero. And, we don't want to call wait_on_page_locked() - * against a page without get_page(). - * So, we use get_page_unless_zero(), here. Even failed, page fault - * will occur again. + * is zero; but we must not call put_and_wait_on_page_locked() without + * a ref. Use get_page_unless_zero(), and just fault again if it fails. */ if (!get_page_unless_zero(page)) goto out; pte_unmap_unlock(ptep, ptl); - wait_on_page_locked(page); - put_page(page); + put_and_wait_on_page_locked(page); return; out: pte_unmap_unlock(ptep, ptl); @@ -370,8 +367,7 @@ void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd) if (!get_page_unless_zero(page)) goto unlock; spin_unlock(ptl); - wait_on_page_locked(page); - put_page(page); + put_and_wait_on_page_locked(page); return; unlock: spin_unlock(ptl); diff --git a/mm/vmscan.c b/mm/vmscan.c index 62ac0c488624..85d9dde31153 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1459,11 +1459,14 @@ static unsigned long shrink_page_list(struct list_head *page_list, /* * At this point, we have no other references and there is * no way to pick any more up (removed from LRU, removed - * from pagecache). Can use non-atomic bitops now (and - * we obviously don't have to worry about waking up a process - * waiting on the page lock, because there are no references. + * from pagecache). Usually we can use non-atomic bitops now, + * but beware: earlier calls to put_and_wait_on_page_locked() + * might still be waiting. */ - __ClearPageLocked(page); + if (unlikely(PageWaiters(page))) + unlock_page(page); + else + __ClearPageLocked(page); free_it: nr_reclaimed++; -- 2.19.1.1215.g8438c0b245-goog