Received: by 2002:a25:7ec1:0:0:0:0:0 with SMTP id z184csp5727489ybc; Wed, 27 Nov 2019 08:31:29 -0800 (PST) X-Google-Smtp-Source: APXvYqwFDFbjsgFO2LdB/stWUxy0VHjoDnZiAQrqTO8wj6T44NazxKyaFVRjm1D+xY3+HiIQch46 X-Received: by 2002:aa7:d6cf:: with SMTP id x15mr32820972edr.202.1574872288785; Wed, 27 Nov 2019 08:31:28 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1574872288; cv=none; d=google.com; s=arc-20160816; b=1LVNYOhCvT4d/0ciD2G25JMu0NokjH/B27ctbafrqijL0h4cowHkiJv4FORb/3rl7S Ra+5Lmn46ua5zWtbRGdXpeJjVAE6bLQ2r0ne37NJRWh+X68cwWeoK+l8PRJq3A0eqTjd Ep95o3JNmVJYRuKkCcwflXfmNtR4ZPjtfRdOQ9rWhXWi6uybcixWLwOpaaESN5C2uyZo uJ/lSGHecShiUPf1HA+fj8A4qgDouTbwNBc+i5dWsT0RfeTrDEq5OXIE5sWGFejxWwM9 w04mzgXZHy5WHgaYb41wk7IVdNmmGDegVBn5G+NK+TJGni0o90Q1WZhnO/FBGkz+c+XG crtA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=rcyN5dGoMhhF+8wU9SDW3LRuwC7oWgcs+VV9XVwODaY=; b=gvtTxy3dmMG+rnQrJ5dxl8J1zgAYTxb7EC0VwD7Ws2W0a2YMiEGQmd9ph1rjUA4Ngz TanVEa7zgVvdnDRUpXywgilYIP9olF4LiN3ff1mKkjL5ObTacic21wvk/gKEPMSFp26r hajljBPBOuGmHJFxa/Xp+kFwFRl6H3rZdYbNlGjKxZykmSZA9vfqOm3xawMrz7hiYatP W5fRfD+EMscOjKVVLYVtJCpcP2AO9Dn9QWj+Zo4nPJ/ty3Z7GzVjpSAC1A6aT8MEaph6 PyLZWq+TTbdZHw6lP/dCKRa7l/RhLkCL3UYRmtKhOt9n2fYBZ2Xf9LBrR2ZZ5eP4aSsQ EStQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b="Yw5W/UMq"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id a15si10657794edb.293.2019.11.27.08.31.02; Wed, 27 Nov 2019 08:31:28 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b="Yw5W/UMq"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727017AbfK0Q3w (ORCPT + 99 others); Wed, 27 Nov 2019 11:29:52 -0500 Received: from us-smtp-2.mimecast.com ([205.139.110.61]:51940 "EHLO us-smtp-delivery-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726909AbfK0Q3w (ORCPT ); Wed, 27 Nov 2019 11:29:52 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1574872190; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=rcyN5dGoMhhF+8wU9SDW3LRuwC7oWgcs+VV9XVwODaY=; b=Yw5W/UMqNqUmrbAFYputbQRJ7zaP481weijO0JcJ/Vp7E1rJYG9aqHuBFUD8e1KMof2w0b LEuuj1POtDd6ojKawlO9UBRMpFNxMzWZ0o9re2Yw8p9FdZIXMr2s7Tp8+zwgrk00clQJqh v9ufNFvboWhfzf2pbaK7eaUoJkGdwSY= Received: from mail-ot1-f71.google.com (mail-ot1-f71.google.com [209.85.210.71]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-404-mVhujEwSPeG9LurHXTTbiQ-1; Wed, 27 Nov 2019 11:29:48 -0500 Received: by mail-ot1-f71.google.com with SMTP id m7so12122325otr.12 for ; Wed, 27 Nov 2019 08:29:48 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=kIu1VyYGNbaIKqeZUoUtiDvput7821e7oFhQavConxk=; b=Otg/tie63r25BU5ZqGYHt5KqO0+4Z5S21AiBB2B3zKUDY43+OAvRA4E9t/CnWQI+Ee KIgJaxX0c6rHcD994vKw7sfcFlMDj/AwhI4Mrr79fF2JaMwBzGnpablh8el6YPxfoxR3 /5e/nJPnvNx67QbyxVMn9Fr7ktx3HN8J9owTCNLSI0dmhm82JsPtGdgHPkqMxivt6pZr RQ2PjrwWoagp6FaO0pb0eisVUgX/AFok7WOVRvQzJQJRlcmLFgR2YaWru1s+DfUeAkGO ZlzjHJFP1StCi/3xj81cmTwLGBZaYJm/B91gp85nl4NaSbK3Qfnp0mht4BuYscHaHeOq lEkQ== X-Gm-Message-State: APjAAAVjWI2a9LBcmwbm4uTqWzrijsg+/5gvtB3SuAhhurUCBrXj3uQ+ bn5z3xM+WsUJCOF1YU1BW2CXnErxCYxSyf0CsJPW4MQUD3cofE+95Pv24pSA5Q3s0hPXEvbTp5E rwwj1YVLxftA2BuDCXSqKAgGIf4ByLe+syEMVzE9O X-Received: by 2002:aca:d14:: with SMTP id 20mr4877848oin.178.1574872187662; Wed, 27 Nov 2019 08:29:47 -0800 (PST) X-Received: by 2002:aca:d14:: with SMTP id 20mr4877821oin.178.1574872187328; Wed, 27 Nov 2019 08:29:47 -0800 (PST) MIME-Version: 1.0 References: <157225677483.3442.4227193290486305330.stgit@buzz> <20191028124222.ld6u3dhhujfqcn7w@box> <20191028125702.xdfbs7rqhm3wer5t@box> <640bbe51-706b-8d9f-4abc-5f184de6a701@redhat.com> <22f04f02-86e4-b379-81c8-08c002a648f0@redhat.com> In-Reply-To: From: Andreas Gruenbacher Date: Wed, 27 Nov 2019 17:29:36 +0100 Message-ID: Subject: Re: [PATCH] mm/filemap: do not allocate cache pages beyond end of file at read To: Steven Whitehouse Cc: Linus Torvalds , =?UTF-8?Q?Andreas_Gr=C3=BCnbacher?= , Konstantin Khlebnikov , "Kirill A. Shutemov" , Linux-MM , Andrew Morton , Linux Kernel Mailing List , linux-fsdevel , Alexander Viro , Johannes Weiner , "cluster-devel@redhat.com" , Ronnie Sahlberg , Steve French , Bob Peterson X-MC-Unique: mVhujEwSPeG9LurHXTTbiQ-1 X-Mimecast-Spam-Score: 0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Nov 27, 2019 at 4:42 PM Steven Whitehouse wro= te: > Hi, > > On 25/11/2019 17:05, Linus Torvalds wrote: > > On Mon, Nov 25, 2019 at 2:53 AM Steven Whitehouse = wrote: > >> Linus, is that roughly what you were thinking of? > > So the concept looks ok, but I don't really like the new flags as they > > seem to be gfs2-specific rather than generic. > > > > That said, I don't _gate_ them either, since they aren't in any > > critical code sequence, and it's not like they do anything really odd. > > > > I still think the whole gfs2 approach is broken. You're magically ok > > with using stale data from the cache just because it's cached, even if > > another client might have truncated the file or something. > > If another node tries to truncate the file, that will require an > exclusive glock, and in turn that means the all the other nodes will > have to drop their glock(s) shared or exclusive. That process > invalidates the page cache on those nodes, such that any further > requests on those nodes will find the cache empty and have to call into > the filesystem. > > If a page is truncated on another node, then when the local node gives > up its glock, after any copying (e.g. for read) has completed then the > truncate will take place. The local node will then have to reread any > data relating to new pages or return an error in case the next page to > be read has vanished due to the truncate. It is a pretty small window, > and the advantage is that in cases where the page is in cache, we can > directly use the cached page without having to call into the filesystem > at all. So it is page atomic in that sense. > > The overall aim here is to avoid taking (potentially slow) cluster locks > when at all possible, yet at the same time deliver close to local fs > semantics whenever we can. You can think of GFS2's glock concept (at > least as far as the inodes we are discussing here) as providing a > combination of (page) cache control and cluster (dlm) locking. > > > > > So you're ok with saying "the file used to be X bytes in size, so > > we'll just give you this data because we trust that the X is correct". > > > > But you're not ok to say "oh, the file used to be X bytes in size, but > > we don't want to give you a short read because it might not be correct > > any more". > > > > See the disconnect? You trust the size in one situation, but not in ano= ther one. > > Well we are not trusting the size at all... the original algorithm > worked entirely off "is this page in cache and uptodate?" and for > exactly the reason that we know the size in the inode might be out of > date, if we are not currently holding a glock in either shared or > exclusive mode. We also know that if there is a page in cache and > uptodate then we must be holding the glock too. > > > > > > I also don't really see that you *need* the new flag at all. Since > > you're doing to do a speculative read and then a real read anyway, and > > since the only thing that you seem to care about is the file size > > (because the *data* you will trust if it is cached), then why don't > > you just use the *existing* generic read, and *IFF* you get a > > truncated return value, then you go and double-check that the file > > hasn't changed in size? > > > > See what I'm saying? I think gfs2 is being very inconsistent in when > > it trusts the file size, and I don't see that you even need the new > > behavior that patch gives, because you might as well just use the > > existing code (just move the i_size check earlier, and then teach gfs2 > > to double-check the "I didn't get as much as I expected" case). We can identify short reads, but we won't get information about readahead back from generic_file_read_iter or filemap_fault. We could try to work around this with filesystem specific flags for ->readpage and ->readpages, but that would break down with multiple concurrent readers in addition to being a real mess. I'm currently out of better ideas that avoid duplicating the generic code. > > Linus > > I'll leave the finer details to Andreas here, since it is his patch, and > hopefully we can figure out a good path forward. We are perhaps also a > bit reluctant to go off and (nearly) duplicate code that is already in > the core vfs library functions, since that often leads to things getting > out of sync (our implementation of ->writepages is one case where that > happened in the past) and missing important bug fixes/features in some > cases. Hopefully though we can iterate on this a bit and come up with > something which will resolve all the issues, > > Steve. Thanks, Andreas