Received: by 2002:a05:6902:102b:0:0:0:0 with SMTP id x11csp1515566ybt; Sat, 20 Jun 2020 12:42:23 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwGyDTVacJx/r8DBdkX05/tzkhLf/1cdecDSr0ahApm4xtKHyrGyvY6hPt4qw4xBTuS/DwA X-Received: by 2002:aa7:c908:: with SMTP id b8mr9532265edt.76.1592682143681; Sat, 20 Jun 2020 12:42:23 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1592682143; cv=none; d=google.com; s=arc-20160816; b=psfP8dH6lfHmKADS2fvtiaUXEKl9p9zaaxi4zi71McqANcLEfXMCUDk58cVqi4/Bw+ i4LMA4UOsCNuRDcShnHbeuySktKhZYgczbGxzaKutvJzPLkW2R7sCVipBD123EFp2z1m 41bVBUEAc0XnJ6F/CtJzZO0ttmqxH9aK8j/BS9md1R9jReFX/pgtpF20NIb+bOYePnh6 AGtpv5wL6Sq3xx25tMg8Oy8GbpMn0TTP/1IzxzC7SiTTW7k0pWdT1iKzLvGLB0FAB5Ed HDmkyyz1KD7adWgANex6cZwEDYdshbGY8y4CFs8LNvMcygwWTbWHZkHReV8qHHqIpj4f 7VhQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :dkim-signature; bh=QHhdZsEwflik6dhlpG9lDRGf2hUbiRTywvrsqPNIe50=; b=Q/cTeq7UurJBb6d3UVnt2OMzd4Oy0E8zD0Dw3gYhjrRZwZ8juHYsuIfUMsCk1IL60K v+JLGkDWKByBlxljRnFXKrU0eyVEbC8LcqD4B+rxZkDVxCqOc43qt33SnDf3qMWaqfXH lLLgjkP3H/l27w0f8hFPuEXGWPwtbe3qVb1k9iS9GmKbnEtekhIYg9E1vj2xfbh3fhLL UtPvOmQbTpRWksCP1AenbZfdQ3Dsg8SBSn3MH6YgYonHFXVlPrT0g/3rI6GNyrkpzbjd 06zBxINTQKK6EBFFMH4GcHM8751AZV9nNMeWjdyjqGotc3glre9L3QD3FrVv/E9G9cZe wCAA== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@infradead.org header.s=bombadil.20170209 header.b=Za6zjjv9; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id y22si6593632ejq.391.2020.06.20.12.42.01; Sat, 20 Jun 2020 12:42:23 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=fail header.i=@infradead.org header.s=bombadil.20170209 header.b=Za6zjjv9; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728706AbgFTTju (ORCPT + 99 others); Sat, 20 Jun 2020 15:39:50 -0400 Received: from bombadil.infradead.org ([198.137.202.133]:41660 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728483AbgFTTju (ORCPT ); Sat, 20 Jun 2020 15:39:50 -0400 X-Greylist: delayed 1464 seconds by postgrey-1.27 at vger.kernel.org; Sat, 20 Jun 2020 15:39:50 EDT DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20170209; h=In-Reply-To:Content-Type:MIME-Version :References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=QHhdZsEwflik6dhlpG9lDRGf2hUbiRTywvrsqPNIe50=; b=Za6zjjv9SW/fcKUx1GQuuZ9KSj 3wWuDUlczJhnxt7HcGZtrKs/RQAJ2AxNVABPlYlAUBMiiPlezmVC/KuFD6oTzv+hc1dMlkYzLmgp1 BPrTo0wrA4qsogg8qYSC3XksacQyySS6ckSNjRripeZ512qO4mTc7KGXKNg8VQ5SVqp77jaNsBQCD +bna6zliSxrpte3Z5JYZigx6V/DCULCJaNKwASL42oxHzh09Glukgz997JBujOxmA6MEj2P223Zis hG0Xdtepr3u9WW7EyIJ91zJac8F2UV9HEp7UHIfZXswsSbuBZIPMuRls7H1qrI9KG2uToOIRL6u0B oFWJCYJg==; Received: from willy by bombadil.infradead.org with local (Exim 4.92.3 #3 (Red Hat Linux)) id 1jmixW-0004zd-13; Sat, 20 Jun 2020 19:15:22 +0000 Date: Sat, 20 Jun 2020 12:15:21 -0700 From: Matthew Wilcox To: Amir Goldstein Cc: linux-fsdevel , Linux MM , Andreas Gruenbacher , linux-kernel Subject: Re: [RFC] Bypass filesystems for reading cached pages Message-ID: <20200620191521.GG8681@bombadil.infradead.org> References: <20200619155036.GZ8681@bombadil.infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, Jun 20, 2020 at 09:19:37AM +0300, Amir Goldstein wrote: > On Fri, Jun 19, 2020 at 6:52 PM Matthew Wilcox wrote: > > This patch lifts the IOCB_CACHED idea expressed by Andreas to the VFS. > > The advantage of this patch is that we can avoid taking any filesystem > > lock, as long as the pages being accessed are in the cache (and we don't > > need to readahead any pages into the cache). We also avoid an indirect > > function call in these cases. > > XFS is taking i_rwsem lock in read_iter() for a surprising reason: > https://lore.kernel.org/linux-xfs/CAOQ4uxjpqDQP2AKA8Hrt4jDC65cTo4QdYDOKFE-C3cLxBBa6pQ@mail.gmail.com/ > In that post I claim that ocfs2 and cifs also do some work in read_iter(). > I didn't go back to check what, but it sounds like cache coherence among > nodes. That's out of date. Here's POSIX-2017: https://pubs.opengroup.org/onlinepubs/9699919799/functions/read.html "I/O is intended to be atomic to ordinary files and pipes and FIFOs. Atomic means that all the bytes from a single operation that started out together end up together, without interleaving from other I/O operations. It is a known attribute of terminals that this is not honored, and terminals are explicitly (and implicitly permanently) excepted, making the behavior unspecified. The behavior for other device types is also left unspecified, but the wording is intended to imply that future standards might choose to specify atomicity (or not)." That _doesn't_ say "a read cannot observe a write in progress". It says "Two writes cannot interleave". Indeed, further down in that section, it says: "Earlier versions of this standard allowed two very different behaviors with regard to the handling of interrupts. In order to minimize the resulting confusion, it was decided that POSIX.1-2017 should support only one of these behaviors. Historical practice on AT&T-derived systems was to have read() and write() return -1 and set errno to [EINTR] when interrupted after some, but not all, of the data requested had been transferred. However, the US Department of Commerce FIPS 151-1 and FIPS 151-2 require the historical BSD behavior, in which read() and write() return the number of bytes actually transferred before the interrupt. If -1 is returned when any data is transferred, it is difficult to recover from the error on a seekable device and impossible on a non-seekable device. Most new implementations support this behavior. The behavior required by POSIX.1-2017 is to return the number of bytes transferred." That explicitly allows for a write to be interrupted by a signal and later resumed, allowing a read to observe a half-complete write. > Because if I am not mistaken, even though this change has a potential > to improve many workloads, it may also degrade some workloads in cases > where case readahead is not properly tuned. Imagine reading a large file > and getting only a few pages worth of data read on every syscall. > Or did I misunderstand your patch's behavior in that case? I think you did. If the IOCB_CACHED read hits a readahead page, it returns early. Then call_read_iter() notices the read is not yet complete, and calls ->read_iter() to finish the read. So it's two calls to generic_file_buffered_read() rather than one, but it's still one syscall.