Received: by 2002:a05:6a10:f347:0:0:0:0 with SMTP id d7csp1781936pxu; Tue, 24 Nov 2020 08:48:35 -0800 (PST) X-Google-Smtp-Source: ABdhPJxyEsyUv4Uvn5LoBI1ueeLRz1dpwzsPBmd0s0+7p5dGC1vE+IrUcohaQwmoE6xuUz8YJDtX X-Received: by 2002:a05:6402:491:: with SMTP id k17mr4897571edv.370.1606236515469; Tue, 24 Nov 2020 08:48:35 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1606236515; cv=none; d=google.com; s=arc-20160816; b=Ivf/doqPnqMurj/t3CrloVhmhUmThNaiqjczlkk75dWHq0FWkVgXwk7ZGxeDyCkV1k kpeQQxt8lyaeBFkatanINZrapuvwededBJ+JqkyANL+BAeuzaUa2LWWX5RnupDwhKnnw SDAtDSQl2yjWkqDsJrfnx7iW+XaX9DC2tGbSuQz6UcgVBGrd3qeujYe0xxDZQ+qYjsWk +M4ohVMXeqO/5Ck7GufAgD+UNqMpp+1m//QuYu2mEYA9MDq26/jwHgFyueVrX9Jogged w1rMED6m4m45+ZIM4CrcQMR+SocwLxPkZBi4C90CfgwhoP42thvokiPTqMws5jO+BZQA nm4w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:user-agent:references:message-id :in-reply-to:subject:cc:to:from:date:dkim-signature; bh=r2MNsblWMD0h7dACeUcQYwsvIz1+vfqKTGuqjygmt6A=; b=miSko5mKpBeaR7KDCzWyr6uOuDNjg+m4WBRksTeMZTHSl2oK2SfLjGdA9muFjI7TzM 1B10XCANM86+5/Fpiln5EdV3bqdSIxBXVsRVzyHtSuq012zKbQ5w4PX0VeJA4IfO+0aP 17z7YVCczTCYFw3OcHos5RFR0Cdy536jRptwdSuhfuPUhTWkbPKjNRILIrJVY/2bnsVb bPC0q+iviyb8C49AYTK5omEUZq/7yJa+mizoSdHtrcTDvtiM7v9NkZmK5XjFQ2rE0hZv nb0giIB0JItJi1AFBVb7J5aseSeyLFI7HgjhGLhA44kF8jm0cEh+BMhd0mu0FZxx6jEG Vz8Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=oek5p2sp; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id ce6si8865849ejb.133.2020.11.24.08.48.08; Tue, 24 Nov 2020 08:48:35 -0800 (PST) Received-SPF: pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=oek5p2sp; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2390592AbgKXQrA (ORCPT + 99 others); Tue, 24 Nov 2020 11:47:00 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36360 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2390499AbgKXQq7 (ORCPT ); Tue, 24 Nov 2020 11:46:59 -0500 Received: from mail-oi1-x243.google.com (mail-oi1-x243.google.com [IPv6:2607:f8b0:4864:20::243]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id F2D5CC061A4E for ; Tue, 24 Nov 2020 08:46:57 -0800 (PST) Received: by mail-oi1-x243.google.com with SMTP id k26so24511905oiw.0 for ; Tue, 24 Nov 2020 08:46:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:in-reply-to:message-id:references :user-agent:mime-version; bh=r2MNsblWMD0h7dACeUcQYwsvIz1+vfqKTGuqjygmt6A=; b=oek5p2spg5uGjV0ReaRb//qL3MZLLeV+cLAdm0LSD2luaYf/jDNBuTG0Zr/eipkW9j h49N3swg1t9yAVac0ZxbkFow2WD6DqYjY155804339u9GxeoPByd7eym9Nr6086ihBDT eYf5a0jiyTliGzfdOHSI4ajtt9jnEKy0KZwX3/8wjUnhpdNXJQt8x2xizOQvcCXIoKa1 97qFZ/LtEYvNuXZvQv40oEAoT+/ppecOzxlCHrblncb8D172okfP5Z0DJf5pdH7Rx2ov HfyYP6Txu94ZDlfB44H/EirbhUxHcNNwiAyieHais+Kmop1A9dLE9fUgydyPMXifgKSk hwYQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id :references:user-agent:mime-version; bh=r2MNsblWMD0h7dACeUcQYwsvIz1+vfqKTGuqjygmt6A=; b=ster5XmtGaglUsWnGYcpktU4BISSO/zXZsIcu5gCmDnm4BFePXarHxG76uwMbEsww8 zVHhmrL64QWGr1ZJ396ak76PMnHYdcSoAdTEQ+Dk0JxLeiY6BC/JNzrVqXbGUl7FxRJ8 TUGJP3NQmjC4mJSbGPU83eXDBgTQCy+P5VdYmX2tNUfnAPWwEoG6DdknORlJqdn3+/6/ q4oaL72LraH7K4FcNV/1bGYcDrqJ7VpYLhHtuRJfxltSvdb3v+jE0Rl5BYm5tagXIFnd WyClSYJyd/ltHQWHfYMqHM8C+VyJTOeLBykQmlgr9f0VRkIS/SGIDwFYT6WYAuUkb0n7 +TdA== X-Gm-Message-State: AOAM531s9jHcKXgMaNH5urHnSEsfgWCYVfbFPWoojB0C3Lg3MWkg1g3P vK+lVk6mfOv/Z6JKcQeVEKivQQ== X-Received: by 2002:aca:d755:: with SMTP id o82mr3124373oig.164.1606236416738; Tue, 24 Nov 2020 08:46:56 -0800 (PST) Received: from eggly.attlocal.net (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147]) by smtp.gmail.com with ESMTPSA id i43sm8108918ota.39.2020.11.24.08.46.54 (version=TLS1 cipher=ECDHE-ECDSA-AES128-SHA bits=128/128); Tue, 24 Nov 2020 08:46:55 -0800 (PST) Date: Tue, 24 Nov 2020 08:46:43 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Linus Torvalds cc: Jan Kara , syzbot , Andreas Dilger , Ext4 Developers List , Linux Kernel Mailing List , syzkaller-bugs , Theodore Ts'o , Linux-MM , Oleg Nesterov , Andrew Morton , "Kirill A. Shutemov" , Nicholas Piggin , Alex Shi , Qian Cai , Christoph Hellwig , "Darrick J. Wong" , Matthew Wilcox , William Kucharski , Jens Axboe , linux-fsdevel , linux-xfs , Hugh Dickins Subject: Re: kernel BUG at fs/ext4/inode.c:LINE! In-Reply-To: Message-ID: References: <000000000000d3a33205add2f7b2@google.com> <20200828100755.GG7072@quack2.suse.cz> <20200831100340.GA26519@quack2.suse.cz> User-Agent: Alpine 2.11 (LSU 23 2013-08-11) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org On Mon, 23 Nov 2020, Hugh Dickins wrote: > On Mon, 23 Nov 2020, Linus Torvalds wrote: > > > > IOW, why couldn't we just make the __test_set_page_writeback() > > increment the page count if the writeback flag wasn't already set, and > > then make the end_page_writeback() do a put_page() after it all? > > Right, that should be a lot simpler, and will not require any of the > cleanup (much as I liked that). If you're reasonably confident that > adding the extra get_page+put_page to every writeback (instead of > just to the waited case, which I presume significantly less common) > will get lost in the noise - I was not confident of that, nor > confident of devising realistic tests to decide it. > > What I did look into before sending, was whether in the filesystems > there was a pattern of doing a put_page() after *set_page_writeback(), > when it would just be a matter of deleting that put_page() and doing > it instead at the end of end_page_writeback(). But no: there were a > few cases like that, but in general no such pattern. > > Though, what I think I'll try is not quite what you suggest there, > but instead do both get_page() and put_page() in end_page_writeback(). > The reason being, there are a number of places (in mm at least) where > we judge what to do by the expected refcount: places that know to add > 1 on when PagePrivate is set (for buffers), but do not expect to add > 1 on when PageWriteback is set. Now, all of those places probably > have to have their own wait_on_page_writeback() too, but I'd rather > narrow the window when the refcount is raised, than work through > what if any change would be needed in those places. This ran fine overnight on several machines - just to check I hadn't screwed it up. Vanishingly unlikely to have hit either condition, nor would I have noticed any difference in performance. [PATCH] mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback) Twice now, when exercising ext4 looped on shmem huge pages, I have crashed on the PF_ONLY_HEAD check inside PageWaiters(): ext4_finish_bio() calling end_page_writeback() calling wake_up_page() on tail of a shmem huge page, no longer an ext4 page at all. The problem is that PageWriteback is not accompanied by a page reference (as the NOTE at the end of test_clear_page_writeback() acknowledges): as soon as TestClearPageWriteback has been done, that page could be removed from page cache, freed, and reused for something else by the time that wake_up_page() is reached. https://lore.kernel.org/linux-mm/20200827122019.GC14765@casper.infradead.org/ Matthew Wilcox suggested avoiding or weakening the PageWaiters() tail check; but I'm paranoid about even looking at an unreferenced struct page, lest its memory might itself have already been reused or hotremoved (and wake_up_page_bit() may modify that memory with its ClearPageWaiters()). Then on crashing a second time, realized there's a stronger reason against that approach. If my testing just occasionally crashes on that check, when the page is reused for part of a compound page, wouldn't it be much more common for the page to get reused as an order-0 page before reaching wake_up_page()? And on rare occasions, might that reused page already be marked PageWriteback by its new user, and already be waited upon? What would that look like? It would look like BUG_ON(PageWriteback) after wait_on_page_writeback() in write_cache_pages() (though I have never seen that crash myself). And prior to 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") this would have been much less likely: before that, wake_page_function()'s non-exclusive case would stop walking and not wake if it found Writeback already set again; whereas now the non-exclusive case proceeds to wake. I have not thought of a fix that does not add a little overhead: the simplest fix is for end_page_writeback() to get_page() before calling test_clear_page_writeback(), then put_page() after wake_up_page(). Was there a chance of missed wakeups before, since a page freed before reaching wake_up_page() would have PageWaiters cleared? I think not, because each waiter does hold a reference on the page. This bug comes when the old use of the page, the one we do TestClearPageWriteback on, had *no* waiters, so no additional page reference beyond the page cache (and whoever racily freed it). The reuse of the page has a waiter holding a reference, and its own PageWriteback set; but the belated wake_up_page() has woken the reuse to hit that BUG_ON(PageWriteback). Reported-by: syzbot+3622cea378100f45d59f@syzkaller.appspotmail.com Reported-by: Qian Cai Fixes: 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") Signed-off-by: Hugh Dickins Cc: stable@vger.kernel.org # v5.8+ --- mm/filemap.c | 8 ++++++++ mm/page-writeback.c | 6 ------ 2 files changed, 8 insertions(+), 6 deletions(-) --- 5.10-rc5/mm/filemap.c 2020-11-22 17:43:01.637279974 -0800 +++ linux/mm/filemap.c 2020-11-23 23:08:20.141851113 -0800 @@ -1484,11 +1484,19 @@ void end_page_writeback(struct page *pag rotate_reclaimable_page(page); } + /* + * Writeback does not hold a page reference of its own, relying + * on truncation to wait for the clearing of PG_writeback. + * But here we must make sure that the page is not freed and + * reused before the wake_up_page(). + */ + get_page(page); if (!test_clear_page_writeback(page)) BUG(); smp_mb__after_atomic(); wake_up_page(page, PG_writeback); + put_page(page); } EXPORT_SYMBOL(end_page_writeback); --- 5.10-rc5/mm/page-writeback.c 2020-10-25 16:45:47.977843485 -0700 +++ linux/mm/page-writeback.c 2020-11-23 23:08:20.141851113 -0800 @@ -2754,12 +2754,6 @@ int test_clear_page_writeback(struct pag } else { ret = TestClearPageWriteback(page); } - /* - * NOTE: Page might be free now! Writeback doesn't hold a page - * reference on its own, it relies on truncation to wait for - * the clearing of PG_writeback. The below can only access - * page state that is static across allocation cycles. - */ if (ret) { dec_lruvec_state(lruvec, NR_WRITEBACK); dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);