Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp646775imu; Tue, 27 Nov 2018 04:20:27 -0800 (PST) X-Google-Smtp-Source: AFSGD/VDiTgNJs8RAqltbk+s/bfkKyYT4z8wjPdH5MQ1eNzZ+Y3O0wD1fhUewNE2toV1JyaaLL8g X-Received: by 2002:a17:902:7443:: with SMTP id e3mr29318560plt.304.1543321227603; Tue, 27 Nov 2018 04:20:27 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1543321227; cv=none; d=google.com; s=arc-20160816; b=XdomTPgK6fFG4Xsmjyj5Q4AdWZtkXg7ArMvG4pxg2csE1Goul/7t0aNyOSzhbVAEa1 J/+BAoug2vmi0kwmwJHJWETprA2UzLml41g1cw5rlS7nF6ZFTjab5vr8NlTZm8uXeY6k CTGG4+eE/vM1U8G4cMipGaBzm47z5XrpUWspMJ+IFRJIrDAa0hg6hIwSMH0aSrh/g4Tb x0Se1Zy6xylcyEvxST+IZGouPhVIgQBdyhtsTPYhbvhAO64QU6W87FCn1sG0z3GH3bRu phTzOSsf2lSoyZoiZtZ0fbi1oAcIiHuFM+rw46XBnuELon9lTSufUO2kn7c6wJQZbmph bEtg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:message-id:user-agent:in-reply-to :content-disposition:mime-version:references:subject:cc:to:from:date; bh=tMq+o42n+svFpDd+sCvLkN9gj0niTSIE6uqlhEcDPWc=; b=cQxyZHppdEpLu26pHk/ovrsHA2tpWP+ngZn7apcj8IKqdl1fsps/iVs5EHqWLPaBJk tNcjeK2Un8flvFEmJAbZoy1AhDGmJNqusXK/ENXvfrCwAKpKlhf+CAoRPf1llXD4YfRp 1JpScAV+RgXnKx7QNk1sDDzLK2FyxxCmMZ3J0jXwzXHvDAWO9b7fAIe5kgT76ChmNAXU FTDaR61XWDASJDB+PcTI56Ynq49aLGbv06QWI6WniAZJBP1AYzF8UMIK2oH5uKjKGHF+ 8GVAKWDHwcPbVS7NqvnjRjeN5wpr6LcQ88WAUu3tLMU0BSTn0Dy567+Ufrice3fKbhIk IVkg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id n28si4047452pfb.88.2018.11.27.04.20.01; Tue, 27 Nov 2018 04:20:27 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728864AbeK0V4d (ORCPT + 99 others); Tue, 27 Nov 2018 16:56:33 -0500 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:36412 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1727329AbeK0V4d (ORCPT ); Tue, 27 Nov 2018 16:56:33 -0500 Received: from pps.filterd (m0098413.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id wARAsFMk048156 for ; Tue, 27 Nov 2018 05:59:01 -0500 Received: from e06smtp04.uk.ibm.com (e06smtp04.uk.ibm.com [195.75.94.100]) by mx0b-001b2d01.pphosted.com with ESMTP id 2p148m96ck-1 (version=TLSv1.2 cipher=AES256-GCM-SHA384 bits=256 verify=NOT) for ; Tue, 27 Nov 2018 05:59:00 -0500 Received: from localhost by e06smtp04.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Tue, 27 Nov 2018 10:58:58 -0000 Received: from b06cxnps4076.portsmouth.uk.ibm.com (9.149.109.198) by e06smtp04.uk.ibm.com (192.168.101.134) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; (version=TLSv1/SSLv3 cipher=AES256-GCM-SHA384 bits=256/256) Tue, 27 Nov 2018 10:58:53 -0000 Received: from d06av21.portsmouth.uk.ibm.com (d06av21.portsmouth.uk.ibm.com [9.149.105.232]) by b06cxnps4076.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id wARAwqXG39911678 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Tue, 27 Nov 2018 10:58:52 GMT Received: from d06av21.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 15CCA52050; Tue, 27 Nov 2018 10:58:52 +0000 (GMT) Received: from rapoport-lnx (unknown [9.148.204.155]) by d06av21.portsmouth.uk.ibm.com (Postfix) with ESMTPS id 4B0E952054; Tue, 27 Nov 2018 10:58:50 +0000 (GMT) Date: Tue, 27 Nov 2018 12:58:48 +0200 From: Mike Rapoport To: Hugh Dickins Cc: Linus Torvalds , Andrew Morton , Baoquan He , Michal Hocko , Vlastimil Babka , Andrea Arcangeli , David Hildenbrand , Mel Gorman , David Herrmann , Tim Chen , Kan Liang , Andi Kleen , Davidlohr Bueso , Peter Zijlstra , Christoph Lameter , Nick Piggin , pifang@redhat.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCHi v2] mm: put_and_wait_on_page_locked() while page is migrated References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.24 (2015-08-30) X-TM-AS-GCONF: 00 x-cbid: 18112710-0016-0000-0000-0000022D5528 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 18112710-0017-0000-0000-00003285AD92 Message-Id: <20181127105848.GD16502@rapoport-lnx> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:,, definitions=2018-11-27_09:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1811270096 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Nov 26, 2018 at 11:27:07AM -0800, Hugh Dickins wrote: > Waiting on a page migration entry has used wait_on_page_locked() all > along since 2006: but you cannot safely wait_on_page_locked() without > holding a reference to the page, and that extra reference is enough to > make migrate_page_move_mapping() fail with -EAGAIN, when a racing task > faults on the entry before migrate_page_move_mapping() gets there. > > And that failure is retried nine times, amplifying the pain when > trying to migrate a popular page. With a single persistent faulter, > migration sometimes succeeds; with two or three concurrent faulters, > success becomes much less likely (and the more the page was mapped, > the worse the overhead of unmapping and remapping it on each try). > > This is especially a problem for memory offlining, where the outer > level retries forever (or until terminated from userspace), because > a heavy refault workload can trigger an endless loop of migration > failures. wait_on_page_locked() is the wrong tool for the job. > > David Herrmann (but was he the first?) noticed this issue in 2014: > https://marc.info/?l=linux-mm&m=140110465608116&w=2 > > Tim Chen started a thread in August 2017 which appears relevant: > https://marc.info/?l=linux-mm&m=150275941014915&w=2 > where Kan Liang went on to implicate __migration_entry_wait(): > https://marc.info/?l=linux-mm&m=150300268411980&w=2 > and the thread ended up with the v4.14 commits: > 2554db916586 ("sched/wait: Break up long wake list walk") > 11a19c7b099f ("sched/wait: Introduce wakeup boomark in wake_up_page_bit") > > Baoquan He reported "Memory hotplug softlock issue" 14 November 2018: > https://marc.info/?l=linux-mm&m=154217936431300&w=2 > > We have all assumed that it is essential to hold a page reference while > waiting on a page lock: partly to guarantee that there is still a struct > page when MEMORY_HOTREMOVE is configured, but also to protect against > reuse of the struct page going to someone who then holds the page locked > indefinitely, when the waiter can reasonably expect timely unlocking. > > But in fact, so long as wait_on_page_bit_common() does the put_page(), > and is careful not to rely on struct page contents thereafter, there is > no need to hold a reference to the page while waiting on it. That does > mean that this case cannot go back through the loop: but that's fine for > the page migration case, and even if used more widely, is limited by the > "Stop walking if it's locked" optimization in wake_page_function(). > > Add interface put_and_wait_on_page_locked() to do this, using "behavior" > enum in place of "lock" arg to wait_on_page_bit_common() to implement it. > No interruptible or killable variant needed yet, but they might follow: > I have a vague notion that reporting -EINTR should take precedence over > return from wait_on_page_bit_common() without knowing the page state, > so arrange it accordingly - but that may be nothing but pedantic. > > __migration_entry_wait() still has to take a brief reference to the > page, prior to calling put_and_wait_on_page_locked(): but now that it > is dropped before waiting, the chance of impeding page migration is > very much reduced. Should we perhaps disable preemption across this? > > shrink_page_list()'s __ClearPageLocked(): that was a surprise! This > survived a lot of testing before that showed up. PageWaiters may have > been set by wait_on_page_bit_common(), and the reference dropped, just > before shrink_page_list() succeeds in freezing its last page reference: > in such a case, unlock_page() must be used. Follow the suggestion from > Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now: > that optimization predates PageWaiters, and won't buy much these days; > but we can reinstate it for the !PageWaiters case if anyone notices. > > It does raise the question: should vmscan.c's is_page_cache_freeable() > and __remove_mapping() now treat a PageWaiters page as if an extra > reference were held? Perhaps, but I don't think it matters much, since > shrink_page_list() already had to win its trylock_page(), so waiters are > not very common there: I noticed no difference when trying the bigger > change, and it's surely not needed while put_and_wait_on_page_locked() > is only used for page migration. > > Reported-and-tested-by: Baoquan He > Signed-off-by: Hugh Dickins > Acked-by: Michal Hocko > Reviewed-by: Andrea Arcangeli > --- > include/linux/pagemap.h | 2 ++ > mm/filemap.c | 77 ++++++++++++++++++++++++++++++++++------- > mm/huge_memory.c | 6 ++-- > mm/migrate.c | 12 +++---- > mm/vmscan.c | 10 ++---- > 5 files changed, 74 insertions(+), 33 deletions(-) > > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h > index 226f96f0dee0..e2d7039af6a3 100644 > --- a/include/linux/pagemap.h > +++ b/include/linux/pagemap.h > @@ -537,6 +537,8 @@ static inline int wait_on_page_locked_killable(struct page *page) > return wait_on_page_bit_killable(compound_head(page), PG_locked); > } > > +extern void put_and_wait_on_page_locked(struct page *page); > + > /* > * Wait for a page to complete writeback > */ > diff --git a/mm/filemap.c b/mm/filemap.c > index 81adec8ee02c..575e16c037ca 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -981,7 +981,14 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync, > if (wait_page->bit_nr != key->bit_nr) > return 0; > > - /* Stop walking if it's locked */ > + /* > + * Stop walking if it's locked. > + * Is this safe if put_and_wait_on_page_locked() is in use? > + * Yes: the waker must hold a reference to this page, and if PG_locked > + * has now already been set by another task, that task must also hold > + * a reference to the *same usage* of this page; so there is no need > + * to walk on to wake even the put_and_wait_on_page_locked() callers. > + */ > if (test_bit(key->bit_nr, &key->page->flags)) > return -1; > > @@ -1049,25 +1056,44 @@ static void wake_up_page(struct page *page, int bit) > wake_up_page_bit(page, bit); > } > > +/* > + * A choice of three behaviors for wait_on_page_bit_common(): > + */ > +enum behavior { > + EXCLUSIVE, /* Hold ref to page and take the bit when woken, like > + * __lock_page() waiting on then setting PG_locked. > + */ > + SHARED, /* Hold ref to page and check the bit when woken, like > + * wait_on_page_writeback() waiting on PG_writeback. > + */ > + DROP, /* Drop ref to page before wait, no check when woken, > + * like put_and_wait_on_page_locked() on PG_locked. > + */ > +}; Can we please make it: /** * enum behavior - a choice of three behaviors for wait_on_page_bit_common() */ enum behavior { /** * @EXCLUSIVE: Hold ref to page and take the bit when woken, * like __lock_page() waiting on then setting %PG_locked. */ EXCLUSIVE, /** * @SHARED: Hold ref to page and check the bit when woken, * like wait_on_page_writeback() waiting on %PG_writeback. */ SHARED, /** * @DROP: Drop ref to page before wait, no check when woken, * like put_and_wait_on_page_locked() on %PG_locked. */ DROP, }; -- Sincerely yours, Mike.