Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751195AbcKBECa (ORCPT ); Wed, 2 Nov 2016 00:02:30 -0400 Received: from ipmail06.adl2.internode.on.net ([150.101.137.129]:37797 "EHLO ipmail06.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750710AbcKBEC2 (ORCPT ); Wed, 2 Nov 2016 00:02:28 -0400 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: ApUSAHJkGVh5LBirEGdsb2JhbABXBhsBAQEDAQEBCQEBAYMqAQEBAQEfgVSCeYN5nB8BAQEBAQEGgRuMCIpEhhwEAgKCDVQBAgEBAQEBAgYBAQEBAQEBATdFhGEBAQEDATocIwULCAMYCSUPBSUDBxoTiEwHuEwBCyUehVSFIIQzBoVuBZQ/hV6QLYF5iCmFcI0XhASBJgYIgyIfgWcqNIYigjoBAQE Date: Wed, 2 Nov 2016 15:01:26 +1100 From: Dave Chinner To: Hugh Dickins Cc: Jakob Unterwurzacher , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org Subject: Re: tmpfs returns incorrect data on concurrent pread() and truncate() Message-ID: <20161102040126.GG9920@dastard> References: <18e9fa0f-ec31-9107-459c-ae1694503f87@gmail.com> <20161102010147.GC9920@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3279 Lines: 74 On Tue, Nov 01, 2016 at 06:38:26PM -0700, Hugh Dickins wrote: > On Wed, 2 Nov 2016, Dave Chinner wrote: > > On Tue, Nov 01, 2016 at 04:51:30PM -0700, Hugh Dickins wrote: > > > On Wed, 26 Oct 2016, Jakob Unterwurzacher wrote: > > > > > > > tmpfs seems to be incorrectly returning 0-bytes when reading from > > > > a file that is concurrently being truncated. > > > > > > That is an interesting observation, and you got me worried; > > > but in fact, it is not a tmpfs problem: if we call it a > > > problem at all, it's a VFS problem or a userspace problem. > > > > > > You chose a ratio of 3 preads to 1 ftruncate in your program below: > > > let's call that the Unterwurzacher Ratio, 3 for tmpfs; YMMV, but for > > > me 4 worked well to show the same issue on ramfs, and 15 on ext4. > > > > > > The Linux VFS does not serialize reads against writes or truncation > > > very strictly: > > > > Which is a fine, because... > > > > > it's unusual to need that serialization, and most > > > > .... many filesystems need more robust serialisation as hole punching > > (and other fallocate-based extent manipulations) have much stricter > > serialisation requirements than truncate and these .... > > > > > users prefer maximum speed to the additional locking, or intermediate > > > buffering, that would be required to avoid the issue you've seen. > > > > .... require additional locking to be done at the filesystem level > > to avoid race conditions. > > > > Throw in the fact that we already have to do this serialisation in > > the filesystem for direct IO as there are no page locks to serialise > > direct IO against truncate. And we need to lock out page faults > > from refaulting while we are doing things like punching holes (to > > avoid data coherency and corruption bugs), so we need more > > filesystem level locks to serialise mmap against fallocate(). > > > > And DAX has similar issues - there are no struct pages to serialise > > read or mmap access against truncate, so again we need filesystem > > level serialisation for this. > > > > Put simple: page locks are insufficient as a generic mechanism for > > serialising filesystem operations. The locking required for this is > > generally deeply filesystem implementation specific, so it's fine > > that the VFS doesn't attempt to provide anything stricter than it > > currently does.... > > I think you are saying that: xfs already provides the extra locking > that avoids this issue; most other filesystems do not; but more can > be expected to add that extra locking in the coming months? Effectively, yes. ext4 already has the extra mmap lock for DAX, and it's likely to get saner internal IO locking as it is converted to use iomaps (which is also needed for DAX). As I've pointed out in a different thread recently, filesystems now have three IO paths - page cache, direct and DAX - and only the page cache IO uses struct pages. The other IO paths don't, and they still have to be serialised against truncate and other operations. IOWs, if a filesystem wants to support the new incoming storage technologies, it has to do this IO serialisation internally regardless of whatever other optimisations and go-fast paths the page cache has... Cheers, Dave. -- Dave Chinner david@fromorbit.com