Received: by 2002:a05:6a10:9afc:0:0:0:0 with SMTP id t28csp721431pxm; Fri, 25 Feb 2022 18:18:01 -0800 (PST) X-Google-Smtp-Source: ABdhPJxqb4UcY2Dc8HM3yf96bJTjY5HoxF9ouWz1cjf9QIfb4wP3vDt3UwOgGkgFTxiyRZOvBtbo X-Received: by 2002:a17:903:2406:b0:14d:6447:990c with SMTP id e6-20020a170903240600b0014d6447990cmr10284746plo.22.1645841881655; Fri, 25 Feb 2022 18:18:01 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1645841881; cv=none; d=google.com; s=arc-20160816; b=omu48Q1SEvt0n/zcjYI1T+y1FLcaLGUxJDHMhdIJGck5tOqYxCcilq+qX5AVCgQHHA jjEHDz0dRBkZj/LElSICTXDmBs8tIp9W7Q4vhautzas1HL1IElh2x7ifl9bcB+ilGAM7 KaZsZRQbhwjLVJ5qazgRa6KcAWTpdTB+FREWtuI2kfVe9iOp2Do7Muh1mmK9xJGICWtI 0vG+i2THj1Ez7ZfOMQI3JR2L7y2ma13uoXgCoBBNLaFUDGdWbnqarefTqysuTOkM2SrG i7zL5C6WqXBmkFn8aCFQFJP++tKJQNyFyV/rohiSlgeEIk4bkVO56n8r5nGzqwsf+f9A Ge5Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date; bh=c8x1yK0eQyspzdtNnE7NRnbEA3DAuWlOOmXZs5XylhY=; b=r8BXX/TGUhDuT24V6HYwXlF/c3qMgV7xnk7ngqVZthD3aP8DQSJ0v9qEhCKEE6fyyr FRbQTwchloeMsupTiFeEthtvUvwLYfVsf7QvoLvIknqP7/U3shpYWRWw5wRNaURPeI/b 1P8sCzUq+EqW1IaZKMTU2VGt3evx+IRCkL4p8H1jOH+6NH/2GLlNBdDmYVrPpfBYxEyy znbxtp1Pdheend279rWbhBYypWyXtLOsywCJI/IqF27dv66YDijm/4islkxV+I+PMd5c CkAXguPm6dzu7r4ft02Ph1qB4ATzaaIXnC3wKdRTDagEmIogjZI7m56xK3+rGmvTIPis 36Rg== ARC-Authentication-Results: i=1; mx.google.com; spf=softfail (google.com: domain of transitioning linux-ext4-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=mit.edu Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [23.128.96.19]) by mx.google.com with ESMTPS id b3-20020a170902d50300b0014d77eb167bsi3766546plg.113.2022.02.25.18.17.58 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 25 Feb 2022 18:18:01 -0800 (PST) Received-SPF: softfail (google.com: domain of transitioning linux-ext4-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) client-ip=23.128.96.19; Authentication-Results: mx.google.com; spf=softfail (google.com: domain of transitioning linux-ext4-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=mit.edu Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 1A1032C3128; Fri, 25 Feb 2022 17:50:46 -0800 (PST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235098AbiBYXYT (ORCPT + 99 others); Fri, 25 Feb 2022 18:24:19 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43296 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230190AbiBYXYT (ORCPT ); Fri, 25 Feb 2022 18:24:19 -0500 Received: from outgoing.mit.edu (outgoing-auth-1.mit.edu [18.9.28.11]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C7F81BF511; Fri, 25 Feb 2022 15:23:45 -0800 (PST) Received: from cwcc.thunk.org (pool-108-7-220-252.bstnma.fios.verizon.net [108.7.220.252]) (authenticated bits=0) (User authenticated as tytso@ATHENA.MIT.EDU) by outgoing.mit.edu (8.14.7/8.12.4) with ESMTP id 21PNLLWc016897 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 25 Feb 2022 18:21:22 -0500 Received: by cwcc.thunk.org (Postfix, from userid 15806) id 6F72115C0038; Fri, 25 Feb 2022 18:21:21 -0500 (EST) Date: Fri, 25 Feb 2022 18:21:21 -0500 From: "Theodore Ts'o" To: John Hubbard Cc: Eric Biggers , Lee Jones , linux-ext4@vger.kernel.org, Christoph Hellwig , Dave Chinner , Goldwyn Rodrigues , "Darrick J . Wong" , Bob Peterson , Damien Le Moal , Andreas Gruenbacher , Ritesh Harjani , Greg Kroah-Hartman , Johannes Thumshirn , linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, cluster-devel@redhat.com, linux-kernel@vger.kernel.org Subject: Re: [PATCH -v3] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first Message-ID: References: <2f9933b3-a574-23e1-e632-72fc29e582cf@nvidia.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <2f9933b3-a574-23e1-e632-72fc29e582cf@nvidia.com> X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RDNS_NONE, SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org On Fri, Feb 25, 2022 at 01:33:33PM -0800, John Hubbard wrote: > On 2/25/22 13:23, Theodore Ts'o wrote: > > [un]pin_user_pages_remote is dirtying pages without properly warning > > the file system in advance. This was noted by Jan Kara in 2018[1] and > > In 2018, [un]pin_user_pages_remote did not exist. And so what Jan reported > was actually that dio_bio_complete() was calling set_page_dirty_lock() > on pages that were not (any longer) set up for that. Fair enough, there are two problems that are getting conflated here, and that's my bad. The problem which Jan pointed out is one where the Direct I/O read path triggered a page fault, so page_mkwrite() was actually called. So in this case, the file system was actually notified, and the page was marked dirty after the file system was notified. But then the DIO read was racing with the page cleaner, which would call writepage(), and then clear the page, and then remove the buffer_heads. Then dio_bio_complete() would call set_page_dirty() a second time, and that's what would trigger the BUG. But in the syzbot reproducer, it's a different problem. In this case, process_vm_writev() calling [un]pin_user_pages_remote(), and page_mkwrite() is never getting called. So there is no need to race with the page cleaner, and so the BUG triggers much more reliably. > > more recently has resulted in bug reports by Syzbot in various Android > > kernels[2]. > > > > This is technically a bug in mm/gup.c, but arguably ext4 is fragile in > > Is it, really? unpin_user_pages_dirty_lock() moved the set_page_dirty_lock() > call into mm/gup.c, but that merely refactored things. The callers are > all over the kernel, and those callers are what need changing in order > to fix this. From my perspective, the bug is calling set_page_dirty() without first calling the file system's page_mkwrite(). This is necessary since the file system needs to allocate file system data blocks in preparation for a future writeback. Now, calling page_mkwrite() by itself is not enough, since the moment you make the page dirty, the page cleaner could go ahead and call writepage() behind your back and clean it. In actual practice, with a Direct I/O read request racing with writeback, this is race was quite hard to hit, because the that would imply that the background writepage() call would have to complete ahead of the synchronous read request, and the block layer generally prioritizes synchronous reads ahead of background write requests. So in practice, this race was ***very*** hard to hit. Jan may have reported it in 2018, but I don't think I've ever seen it happen myself. For process_vm_writev() this is a case where user pages are pinned and then released in short order, so I suspect that race with the page cleaner would also be very hard to hit. But we could completely remove the potential for the race, and also make things kinder for f2fs and btrfs's compressed file write support, by making things work much like the write(2) system call. Imagine if we had a "pin_user_pages_local()" which calls write_begin(), and a "unpin_user_pages_local()" which calls write_end(), and the presumption with the "[un]pin_user_pages_local" API is that you don't hold the pinned pages for very long --- say, not across a system call boundary, and then it would work the same way the write(2) system call works does except that in the case of process_vm_writev(2) the pages are identified by another process's address space where they happen to be mapped. This obviously doesn't work when pinning pages for remote DMA, because in that case the time between pin_user_pages_remote() and unpin_user_pages_remote() could be a long, long time, so that means we can't use using write_begin/write_end; we'd need to call page_mkwrite() when the pages are first pinned and then somehow prevent the page cleaner from touching a dirty page which is pinned for use by the remote DMA. Does that make sense? - Ted