Received: by 2002:a6b:fb09:0:0:0:0:0 with SMTP id h9csp533091iog; Fri, 17 Jun 2022 08:12:46 -0700 (PDT) X-Google-Smtp-Source: AGRyM1u0AbeU3VvHg/D1aGooUcLdWcX1GEG+VOK9j3OwrBXHPmqa5iu9LnumKV43KsLWGfTCJI25 X-Received: by 2002:a17:906:2756:b0:718:d986:d3e3 with SMTP id a22-20020a170906275600b00718d986d3e3mr9720358ejd.425.1655478766651; Fri, 17 Jun 2022 08:12:46 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1655478766; cv=none; d=google.com; s=arc-20160816; b=OyLMQ+DrUKBcWKT5AmCfptky47CkOUH3mRVcUeA7G3SeJf40JiQcDAasHL9Jhhpshb 0fU+gkrnpXsokwU2JUdpypB5Xr9PhROqEIJSV3n23qevGFM1Tq1cmpQ8Pic2ELhfMt2h MvXdCmINbl3p6xH4djA+WEcq938FuPvUvPe+u0riikTkwOgpqraV9qSzZiEgDP1LSFV+ eDmueua08cOgUMhaBVwnCP3oMso88MC0p3R3SKowTYe70/YVhYRuJ0oZcpCy2pJOS4tC U7SDvG+7Ky/juMOtVmgG2vQooZXSdBD3jNxaRMMUXbDewxnGdYysAYMHC5t92FJYBp8L MmcA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature :dkim-signature; bh=isGHlZjzBCS8A0ziLERiN6uU3hp//39N9INjROOZ4Pk=; b=w5y9RYknyinRlYpaVBwui/U4qKu33uHEMEjx36JH/SOkz5qsoy57UxD8UFIQpAESnR vtdHBnwHr50/T2xbxS1paISfVypHLqXtxN6zElEsz3PG7EijhAVmX9MGMuc3W/keNU5h 2rR+7bYL/TMsrO6ha/iFxOVZ+0qS3Bt1zGEusQefR8HllLAa288ZXQB9TMEa6lK3btG3 RnMB0OrEsC6ypp2zj20gS+sU0btX6ekcrHKn2LKBuaLE1cgInO9/JSv420Te3uZqxOXd v+wiwhYFrpMiaTLiaMa2y9Wwn9Y4c6td4tJAy0Ic3W+CspjwJs2QPVmt5q/Q8ejclTfF uGDQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.cz header.s=susede2_rsa header.b=29BY2Gsq; dkim=neutral (no key) header.i=@suse.cz; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id y5-20020a056402440500b0042b2375f906si6905027eda.327.2022.06.17.08.12.14; Fri, 17 Jun 2022 08:12:46 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.cz header.s=susede2_rsa header.b=29BY2Gsq; dkim=neutral (no key) header.i=@suse.cz; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1383052AbiFQPLq (ORCPT + 99 others); Fri, 17 Jun 2022 11:11:46 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42768 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1383048AbiFQPLp (ORCPT ); Fri, 17 Jun 2022 11:11:45 -0400 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 14E7D40A33; Fri, 17 Jun 2022 08:11:44 -0700 (PDT) Received: from relay2.suse.de (relay2.suse.de [149.44.160.134]) by smtp-out1.suse.de (Postfix) with ESMTP id C79DA219B5; Fri, 17 Jun 2022 15:11:42 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1655478702; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=isGHlZjzBCS8A0ziLERiN6uU3hp//39N9INjROOZ4Pk=; b=29BY2Gsq7RS4PQujaqYg8igo7ncgTQxpgcKj2s8O7+ltIo0STzilq1HanbNbyZjvweqj5/ WVOgYueUnvwU6xVjKHhQiNBSv8sqZQjnsGMJ8vWSjua1gVci6N/RzHHPjarzoRrvh22w2E RrnNnR8+55+SE7GeKML1tLO/J2dRIho= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1655478702; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=isGHlZjzBCS8A0ziLERiN6uU3hp//39N9INjROOZ4Pk=; b=RApSRdJ0+bq3YiqbLhOBUHXLYCTMrx3WltsZSs4Q37iZa0+lbuJEcQjxZMQb/L52xvQry+ /xqh8lNgwvDPmQBw== Received: from quack3.suse.cz (unknown [10.100.224.230]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by relay2.suse.de (Postfix) with ESMTPS id 2098D2C141; Fri, 17 Jun 2022 15:11:41 +0000 (UTC) Received: by quack3.suse.cz (Postfix, from userid 1000) id 428CFA0634; Fri, 17 Jun 2022 17:11:35 +0200 (CEST) Date: Fri, 17 Jun 2022 17:11:35 +0200 From: Jan Kara To: Amir Goldstein Cc: Jan Kara , Dave Chinner , "Darrick J . Wong" , Christoph Hellwig , Matthew Wilcox , linux-xfs , linux-fsdevel , Ext4 , Lukas Czerner , Theodore Tso , Al Viro Subject: Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload Message-ID: <20220617151135.yc6vytge6hjabsuz@quack3> References: <20190404165737.30889-1-amir73il@gmail.com> <20190404211730.GD26298@dastard> <20190407232728.GF26298@dastard> <20190408141114.GC15023@quack2.suse.cz> <20190409082605.GA8107@quack2.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org On Fri 17-06-22 17:48:08, Amir Goldstein wrote: > On Tue, Apr 9, 2019 at 11:26 AM Jan Kara wrote: > > > > On Mon 08-04-19 20:41:09, Amir Goldstein wrote: > > > On Mon, Apr 8, 2019 at 5:11 PM Jan Kara wrote: > > > > > > > > On Mon 08-04-19 12:02:34, Amir Goldstein wrote: > > > > > On Mon, Apr 8, 2019 at 2:27 AM Dave Chinner wrote: > > > > > > > > > > > > On Fri, Apr 05, 2019 at 05:02:33PM +0300, Amir Goldstein wrote: > > > > > > > On Fri, Apr 5, 2019 at 12:17 AM Dave Chinner wrote: > > > > > > > > > > > > > > > > On Thu, Apr 04, 2019 at 07:57:37PM +0300, Amir Goldstein wrote: > > > > > > > > > This patch improves performance of mixed random rw workload > > > > > > > > > on xfs without relaxing the atomic buffered read/write guaranty > > > > > > > > > that xfs has always provided. > > > > > > > > > > > > > > > > > > We achieve that by calling generic_file_read_iter() twice. > > > > > > > > > Once with a discard iterator to warm up page cache before taking > > > > > > > > > the shared ilock and once again under shared ilock. > > > > > > > > > > > > > > > > This will race with thing like truncate, hole punching, etc that > > > > > > > > serialise IO and invalidate the page cache for data integrity > > > > > > > > reasons under the IOLOCK. These rely on there being no IO to the > > > > > > > > inode in progress at all to work correctly, which this patch > > > > > > > > violates. IOWs, while this is fast, it is not safe and so not a > > > > > > > > viable approach to solving the problem. > > > > > > > > > > > > > > > > > > > > > > This statement leaves me wondering, if ext4 does not takes > > > > > > > i_rwsem on generic_file_read_iter(), how does ext4 (or any other > > > > > > > fs for that matter) guaranty buffered read synchronization with > > > > > > > truncate, hole punching etc? > > > > > > > The answer in ext4 case is i_mmap_sem, which is read locked > > > > > > > in the page fault handler. > > > > > > > > > > > > Nope, the i_mmap_sem is for serialisation of /page faults/ against > > > > > > truncate, holepunching, etc. Completely irrelevant to the read() > > > > > > path. > > > > > > > > > > > > > > > > I'm at lost here. Why are page faults completely irrelevant to read() > > > > > path? Aren't full pages supposed to be faulted in on read() after > > > > > truncate_pagecache_range()? > > > > > > > > During read(2), pages are not "faulted in". Just look at > > > > what generic_file_buffered_read() does. It uses completely separate code to > > > > add page to page cache, trigger readahead, and possibly call ->readpage() to > > > > fill the page with data. "fault" path (handled by filemap_fault()) applies > > > > only to accesses from userspace to mmaps. > > > > > > > > > > Oh! thanks for fixing my blind spot. > > > So if you agree with Dave that ext4, and who knows what other fs, > > > are vulnerable to populating page cache with stale "uptodate" data, > > > > Not that many filesystems support punching holes but you're right. > > > > > then it seems to me that also xfs is vulnerable via readahead(2) and > > > posix_fadvise(). > > > > Yes, this is correct AFAICT. > > > > > Mind you, I recently added an fadvise f_op, so it could be used by > > > xfs to synchronize with IOLOCK. > > > > And yes, this should work. > > > > > Perhaps a better solution would be for truncate_pagecache_range() > > > to leave zeroed or Unwritten (i.e. lazy zeroed by read) pages in page > > > cache. When we have shared pages for files, these pages could be > > > deduped. > > > > No, I wouldn't really mess with sharing pages due to this. It would be hard > > to make that scale resonably and would be rather complex. We really need a > > proper and reasonably simple synchronization mechanism between operations > > removing blocks from inode and operations filling in page cache of the > > inode. Page lock was supposed to provide this but doesn't quite work > > because hole punching first remove pagecache pages and then go removing all > > blocks. > > > > So I agree with Dave that going for range lock is really the cleanest way > > forward here without causing big regressions for mixed rw workloads. I'm > > just thinking how to best do that without introducing lot of boilerplate > > code into each filesystem. > > Hi Jan, Dave, > > Trying to circle back to this after 3 years! > Seeing that there is no progress with range locks and > that the mixed rw workloads performance issue still very much exists. > > Is the situation now different than 3 years ago with invalidate_lock? Yes, I've implemented invalidate_lock exactly to fix the issues you've pointed out without regressing the mixed rw workloads (because invalidate_lock is taken in shared mode only for reads and usually not at all for writes). > Would my approach of pre-warm page cache before taking IOLOCK > be safe if page cache is pre-warmed with invalidate_lock held? Why would it be needed? But yes, with invalidate_lock you could presumably make that idea safe... Honza -- Jan Kara SUSE Labs, CR