Received: by 2002:a05:6358:16cc:b0:ea:6187:17c9 with SMTP id r12csp6539361rwl; Mon, 9 Jan 2023 09:33:31 -0800 (PST) X-Google-Smtp-Source: AMrXdXu2WlrSszyxS7EH20yz2d3gWPhs/tCrfVifCn140AJPEqTSLXnCwRSIPkkqV1rd0jRd4OGu X-Received: by 2002:a17:906:4e54:b0:812:d53e:1084 with SMTP id g20-20020a1709064e5400b00812d53e1084mr61635436ejw.70.1673285611738; Mon, 09 Jan 2023 09:33:31 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1673285611; cv=none; d=google.com; s=arc-20160816; b=PaofBw6RWXk1yqmI1aMIKIPbEiCc6ElFQ4zm91wql2kzHeunr3awWi0Q8dkUZc48UU 7LvskmmgSZxHzQw0a42MKtQPsAqICqQ0xuRDO/rgmfwtUShi+QH/pyCm7sBtpMWeVA// FnAgQhQIEiFHpR1VK40Dgygn6/QcZ/MLaieQ8tY10CE9M8n6cBmzRbfmnxkqn0EKUuOm qYLd1YFbx9wBUiorEPkzmkoJ8A7CL7qloI9BbUAFBE8yGCc2efPW7fktiPK7Zf93OTdn oMXSFQNqWRB9BoAMVYElfRj9o4QqyFeMPa8qzfTyZN8RksOQVAOFXR/jwWvT4FrECnLU hlBQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature :dkim-signature; bh=C0CXunOTUL8UJMKw/JLqxROwKh3zUZZ2e0csICXto58=; b=gpd7d+dbUXLVmZXliucN2Csy7A7xqDtL/54cz7/VsDOyIDWSsDx4cOwU+niDxvwyta QSN0llb3x3FOrtc35cwD0b8BhquRCapjSVu541kbU4R9EYVZeDdR2QjunjJuGaV4od/M WiCOtJfnMQVxpHY1dGEEfa0livR95rYXxv2NbZmJsseT2GGVtq40TvolT2T9kqnAoVHi BMNjmBgwOvfFOdZMqoJ0CtE2J5YLq+51WoSOfdFwtvgDq/TSns+ZAH67wx86vF3Hr9hq L3i7pWVRFFXessuV8p66cePhVhvDNkSdqTAsUaBBquOlmb/muSQgvXjlcSHQmOUwZkab mFkw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.cz header.s=susede2_rsa header.b="xl+Dc/r0"; dkim=neutral (no key) header.i=@suse.cz header.s=susede2_ed25519 header.b=ugMSHNKO; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id r6-20020a056402034600b0046844a8111dsi8485200edw.533.2023.01.09.09.33.19; Mon, 09 Jan 2023 09:33:31 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.cz header.s=susede2_rsa header.b="xl+Dc/r0"; dkim=neutral (no key) header.i=@suse.cz header.s=susede2_ed25519 header.b=ugMSHNKO; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229472AbjAIRZL (ORCPT + 53 others); Mon, 9 Jan 2023 12:25:11 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48096 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237138AbjAIRZF (ORCPT ); Mon, 9 Jan 2023 12:25:05 -0500 Received: from smtp-out2.suse.de (smtp-out2.suse.de [IPv6:2001:67c:2178:6::1d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3AD2240871; Mon, 9 Jan 2023 09:25:03 -0800 (PST) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 51D6920040; Mon, 9 Jan 2023 17:25:01 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1673285101; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=C0CXunOTUL8UJMKw/JLqxROwKh3zUZZ2e0csICXto58=; b=xl+Dc/r0HXHDoLeEeSJZ+vjjJl1EMHyb6zwmrehhVvD0AwO/+pRikl7f2fCKShunYJelaD GxCcS0M2lHN/okQL57U4gfSMZtv1JRRTR+DHY01e00Nk61nWF2xKA+Hs3hwqL8/764Z6Lj 2xgjuOwv6u+Fihc5zuDv4O/PwyZAp/k= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1673285101; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=C0CXunOTUL8UJMKw/JLqxROwKh3zUZZ2e0csICXto58=; b=ugMSHNKOkTuqkjXCjLGqfgcTtIvpkPJ87OSNpKlvjl3E6LCZjy1Fk4JRVpVuuWmErcUWDf Qu69zShVl0lfhFCA== Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 40784134AD; Mon, 9 Jan 2023 17:25:01 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id eS+4D+1NvGMiFAAAMHmgww (envelope-from ); Mon, 09 Jan 2023 17:25:01 +0000 Received: by quack3.suse.cz (Postfix, from userid 1000) id ADDE1A0749; Mon, 9 Jan 2023 18:25:00 +0100 (CET) Date: Mon, 9 Jan 2023 18:25:00 +0100 From: Jan Kara To: David Howells Cc: Jens Axboe , Al Viro , Christoph Hellwig , Matthew Wilcox , Logan Gunthorpe , Christoph Hellwig , Jeff Layton , linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v4 7/7] iov_iter, block: Make bio structs pin pages rather than ref'ing if appropriate Message-ID: <20230109172500.bd4z2incticapm7x@quack3> References: <167305160937.1521586.133299343565358971.stgit@warthog.procyon.org.uk> <167305166150.1521586.10220949115402059720.stgit@warthog.procyon.org.uk> <1880793.1673257404@warthog.procyon.org.uk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1880793.1673257404@warthog.procyon.org.uk> X-Spam-Status: No, score=-3.7 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_SOFTFAIL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 09-01-23 09:43:24, David Howells wrote: > Jens Axboe wrote: > > > > A field, bi_cleanup_mode, is added to the bio struct that gets set by > > > iov_iter_extract_pages() with FOLL_* flags indicating what cleanup is > > > necessary. FOLL_GET -> put_page(), FOLL_PIN -> unpin_user_page(). Other > > > flags could also be used in future. > > > > > > Newly allocated bio structs have bi_cleanup_mode set to FOLL_GET to > > > indicate that attached pages are ref'd by default. Cloning sets it to 0. > > > __bio_iov_iter_get_pages() overrides it to what iov_iter_extract_pages() > > > indicates. > > > > What's the motivation for this change? > > DIO reads in most filesystems and, I think, the block layer are currently > broken with respect to concurrent fork in the same process because they take > refs (FOLL_GET) on the pages involved which causes the CoW mechanism to > malfunction, leading (I think) the parent process to not see the result of the > DIO. IIRC, the pages undergoing DIO get forcibly copied by fork - and the > copies given to the parent. Instead, DIO reads should be pinning the pages > (FOLL_PIN). Maybe Willy can weigh in on this? > > Further, getting refs on pages in, say, a KVEC iterator is the wrong > thing to do as the kvec may point to things that shouldn't be ref'd > (vmap'd or vmalloc'd regions, for example). Instead, the in-kernel > caller should do what it needs to do to keep hold of the memory and the > DIO should not take a ref at all. Yes, plus there is also a problem if user sets up a DIO read into a buffer backed by memory mapped file, then these mapped pages can be cleaned by writeback while the DIO read is running causing checksum failures or DIF/DIX failures. Also once the writeback is done, the filesystem currently thinks it controls all paths modifying page data and thus can happily go on deduplicating file blocks or do similar stuff although pages are concurrently modified by DIO read possibly causing data corruption. See [1] for more details why filesystems have a problem with this. So filesystems really need DIO reads to use FOLL_PIN instead of FOLL_GET and consequently we need to pass information to bio completion function how page references should be dropped. Honza [1] https://lore.kernel.org/all/20180103100430.GE4911@quack2.suse.cz/ -- Jan Kara SUSE Labs, CR