Received: by 2002:a05:6902:102b:0:0:0:0 with SMTP id x11csp3242209ybt; Mon, 22 Jun 2020 19:38:05 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxjLEYvIKHmppqZ28C3+Z4mxj2rDCfa5JJYXw5SQn2KKq9dY7IcR8Zd/8cIC7kRIc/fuiLc X-Received: by 2002:aa7:ccc2:: with SMTP id y2mr19280786edt.97.1592879885231; Mon, 22 Jun 2020 19:38:05 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1592879885; cv=none; d=google.com; s=arc-20160816; b=pi0E8E335N6OiEB2pmJabBEacmnX1Hi31Sq3V/TRyHFaIJntWIgPA2t6gayELyLeyN vR2fI4Yo/XjJ7BzBxD5svXQIzyPiVL9mE9wFd6PD1TQbTh3U06l3PjKec3BdjP/eGY+k xGaAuiwRsc6AlrW8rKh//KaklFQxGX4lrRU2y69vXOccby2qdC+IQttsmlTVnP0hzP/P eZHV/qFB+flpJQo8aWFjMTKCyvYwDcNmP6i7Dvx6gpP2Zen8nK+oKZjm2l6rLG6NucWk kJeRbp9D3Rqj5xaFLJPLcduOjqGeo68WtxLYsVU1xPsbab2/K1kao/FVBlGqI6dSAMFn vfnw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=gEWJ2NfDPilSa+t94T2dU8F5sXvfDqbriTDMR67inw8=; b=rp1T4JD5JErktmskSv3Mp9fqsP/h40iya+zqRQUwJ5EQoQsc+s88AY5aoudpATL1Fc aAr6o8wRwYTOYoVzcw1s9HTFMRfxuUEZkhE5k6/AwMhBHx1/UFQVc6dbymVqa4h0jIDO vdqiA48UGNrFvmZda9BN60qe+cEw5A9JbcJwkCNcU5WUiYIFs9Or8lkU02zHk+LCuX05 ClXbeIyh5e5T61CH56lFf5AhOHvKkHlskIEiITzNxGMIsZe6MvZnqEm+lMgtwBqgj8Ed ULasmdmVBOsflwjv2TvOOLzPJ+f2QqBTeCbmDLSI/08qyzSFKC2caKTfTx1FLYdnFmBg /eow== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id f6si2160083edq.225.2020.06.22.19.37.42; Mon, 22 Jun 2020 19:38:05 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731617AbgFWCf2 (ORCPT + 99 others); Mon, 22 Jun 2020 22:35:28 -0400 Received: from mail107.syd.optusnet.com.au ([211.29.132.53]:38806 "EHLO mail107.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731312AbgFWCf1 (ORCPT ); Mon, 22 Jun 2020 22:35:27 -0400 Received: from dread.disaster.area (pa49-180-124-177.pa.nsw.optusnet.com.au [49.180.124.177]) by mail107.syd.optusnet.com.au (Postfix) with ESMTPS id 5FCEBD5A410; Tue, 23 Jun 2020 12:35:24 +1000 (AEST) Received: from dave by dread.disaster.area with local (Exim 4.92.3) (envelope-from ) id 1jnYmL-0002E0-W3; Tue, 23 Jun 2020 12:35:18 +1000 Date: Tue, 23 Jun 2020 12:35:17 +1000 From: Dave Chinner To: Matthew Wilcox Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, agruenba@redhat.com, linux-kernel@vger.kernel.org Subject: Re: [RFC] Bypass filesystems for reading cached pages Message-ID: <20200623023517.GG2040@dread.disaster.area> References: <20200619155036.GZ8681@bombadil.infradead.org> <20200622003215.GC2040@dread.disaster.area> <20200622191857.GB21350@casper.infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200622191857.GB21350@casper.infradead.org> User-Agent: Mutt/1.10.1 (2018-07-13) X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=W5xGqiek c=1 sm=1 tr=0 a=k3aV/LVJup6ZGWgigO6cSA==:117 a=k3aV/LVJup6ZGWgigO6cSA==:17 a=kj9zAlcOel0A:10 a=nTHF0DUjJn0A:10 a=7-415B0cAAAA:8 a=I5Ld6uvlkQaMZklyL9UA:9 a=AnJBR7kdrS1bbMZr:21 a=j-efh17Vv1HrvYRV:21 a=CjuIK1q_8ugA:10 a=biEYGPWJfzWAr4FL6Ov7:22 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jun 22, 2020 at 08:18:57PM +0100, Matthew Wilcox wrote: > On Mon, Jun 22, 2020 at 10:32:15AM +1000, Dave Chinner wrote: > > On Fri, Jun 19, 2020 at 08:50:36AM -0700, Matthew Wilcox wrote: > > > > > > This patch lifts the IOCB_CACHED idea expressed by Andreas to the VFS. > > > The advantage of this patch is that we can avoid taking any filesystem > > > lock, as long as the pages being accessed are in the cache (and we don't > > > need to readahead any pages into the cache). We also avoid an indirect > > > function call in these cases. > > > > What does this micro-optimisation actually gain us except for more > > complexity in the IO path? > > > > i.e. if a filesystem lock has such massive overhead that it slows > > down the cached readahead path in production workloads, then that's > > something the filesystem needs to address, not unconditionally > > bypass the filesystem before the IO gets anywhere near it. > > You're been talking about adding a range lock to XFS for a while now. I don't see what that has to do with this patch. > I remain quite sceptical that range locks are a good idea; they have not > worked out well as a replacement for the mmap_sem, although the workload > for the mmap_sem is quite different and they may yet show promise for > the XFS iolock. That was a really poor implementation of a range lock. It had no concurrency to speak of, because the tracking tree required a spinlock to be taken for every lock or unlock the range lock performed. Hence it had an expensive critical section that could not scale past the number of ops a single CPU could perform on that tree. IOWs, it topped out at about 150k lock cycles a second with 2-3 concurrent AIO+DIO threads, and only went slower as the number of concurrent IO submitters went up. So, yeah, if you are going to talk about range locks, you need to forget about the what was tried on the mmap_sem because nobody actually scalability tested the lock implementation by itself and it turned out to be total crap.... > There are production workloads that do not work well on top of a single > file on an XFS filesystem. For example, using an XFS file in a host as > the backing store for a guest block device. People tend to work around > that kind of performance bug rather than report it. *cough* AIO+DIO *cough* You may not like that answer, but anyone who cares about IO performance, especially single file IO performance, is using AIO+DIO. Buffered IO for VM image files in production environments tends to be the exception, not the norm, because caching is done in the guest by the guest page cache. Double caching IO data is generally considered a waste of resources that could otherwise be sold to customers. > Do you agree that the guarantees that XFS currently supplies regarding > locked operation will be maintained if the I/O is contained within a > single page and the mutex is not taken? Not at first glance because block size < file size configurations exist and hence filesystems might be punching out extents from a sub-page range.... > ie add this check to the original > patch: > > if (iocb->ki_pos / PAGE_SIZE != > (iocb->ki_pos + iov_iter_count(iter) - 1) / PAGE_SIZE) > goto uncached; > > I think that gets me almost everything I want. Small I/Os are going to > notice the pain of the mutex more than large I/Os. Exactly what are you trying to optimise, Willy? You haven't explained to anyone what workload needs these micro-optimisations, and without understanding why you want to cut the filesystems out of the readahead path, I can't suggest alternative solutions... Cheers, Dave. -- Dave Chinner david@fromorbit.com