Received: by 2002:a05:6a10:8c0a:0:0:0:0 with SMTP id go10csp339256pxb; Wed, 20 Jan 2021 08:12:03 -0800 (PST) X-Google-Smtp-Source: ABdhPJyMUnLvtvz6KOuOzv0NkEEjPLmEjxbvwxOa7p6gHCaifFKgIM3AFNTQ638vrUgYvvLCh5u5 X-Received: by 2002:a17:906:48c:: with SMTP id f12mr6519314eja.431.1611159123279; Wed, 20 Jan 2021 08:12:03 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1611159123; cv=none; d=google.com; s=arc-20160816; b=MnYNJg7AayriCDrY5iPzRtk+mANK7UE7klld9o60E6OPOreNEFKA7RcOfDq/VVlFgI AG24Oni0GAUNn9quniuY97khbHTIdBGmhZQA2onSNu73GKWGQi+U6snNFTQOaqxF5Kj0 FKdzuvk9Hol7PUqnZm6jAgXHTNG1CVlcI2qU+r+jk0yu222kLjEbWUcuFiZ9/64u1j+4 sZ2SWLBSC8BO0dcEsoxg6GHeF+vZXdOot6HXYo7kB02oi3nqbQz0z4lxCmjXqexhm9Jl IFZCBLJPJwrU6QoP29QDZRPYXJty8yBnLR3Jjyaq+n7nzjjNMUe25csrd20RQ/qcTmNy 9T2w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=HqEzaCZ0Azda6sVLksCclwM+oIvFICvmsNvOBT5c+h4=; b=oEFUKp/LGB4ywV1lJpFX8Y9exSbw7m4fyTlw9Kn8aSRNRV58Ei9sYGSykXADlwI1S/ PvPM5J29ryZRJ8/ifrjJLMBzXdIOBiTcPj8XJPvSzDdTIq1up8W1bGYbRpG5mAI4AF2W Uzy/D1t69Sm/AlJHoModMyht7vXvhhc5BwSLi2kaizo/t8y7syqs8xGZDognQvF3sCUG 66fjSDpuSXn91uZHqczhyJUMjAeylDi9Qf3l33cg6+evTvtGGpFadIECenJvH0CbJwUu JamPNunek6T9jltjfOcrOptBMK1qa2EpzHql8wF3v6UtGIrkkB+nDePA+W+rlUu606ol SruA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id g20si1004844edy.399.2021.01.20.08.11.35; Wed, 20 Jan 2021 08:12:03 -0800 (PST) Received-SPF: pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730699AbhATQIA (ORCPT + 99 others); Wed, 20 Jan 2021 11:08:00 -0500 Received: from mx2.suse.de ([195.135.220.15]:42844 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2391397AbhATQHB (ORCPT ); Wed, 20 Jan 2021 11:07:01 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id EB8D9AF31; Wed, 20 Jan 2021 16:06:18 +0000 (UTC) Received: by quack2.suse.cz (Postfix, from userid 1000) id AA0E91E0802; Wed, 20 Jan 2021 17:06:18 +0100 (CET) From: Jan Kara To: Cc: Matthew Wilcox , , Jan Kara Subject: [PATCH 0/3 RFC] fs: Hole punch vs page cache filling races Date: Wed, 20 Jan 2021 17:06:08 +0100 Message-Id: <20210120160611.26853-1-jack@suse.cz> X-Mailer: git-send-email 2.26.2 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org Hello, Amir has reported [1] a that ext4 has a potential issues when reads can race with hole punching possibly exposing stale data from freed blocks or even corrupting filesystem when stale mapping data gets used for writeout. The problem is that during hole punching, new page cache pages can get instantiated in a punched range after truncate_inode_pages() has run but before the filesystem removes blocks from the file. In principle any filesystem implementing hole punching thus needs to implement a mechanism to block instantiating page cache pages during hole punching to avoid this race. This is further complicated by the fact that there are multiple places that can instantiate pages in page cache. We can have regular read(2) or page fault doing this but fadvise(2) or madvise(2) can also result in reading in page cache pages through force_page_cache_readahead(). There are couple of ways how to fix this. First way (currently implemented by XFS) is to protect read(2) and *advise(2) calls with i_rwsem so that they are serialized with hole punching. This is easy to do but as a result all reads would then be serialized with writes and thus mixed read-write workloads suffer heavily on ext4. Thus for ext4 I want to use EXT4_I(inode)->i_mmap_sem for serialization of reads and hole punching. The same serialization that is already currently used in ext4 to close this race for page faults. This is conceptually simple but lock ordering is troublesome - since EXT4_I(inode)->i_mmap_sem is used in page fault path, it ranks below mmap_sem. Thus we cannot simply grab EXT4_I(inode)->i_mmap_sem in ext4_file_read_iter() as generic_file_buffered_read() copies data to userspace which may require grabbing mmap_sem. Also grabbing EXT4_I(inode)->i_mmap_sem in ext4_readpages() / ext4_readpage() is problematic because at that point we already have locked pages instantiated in the page cache. So EXT4_I(inode)->i_mmap_sem would effectively rank below page lock which is too low in the locking hierarchy. So for ext4 (and other filesystems with similar locking constraints - F2FS, GFS2, OCFS2, ...) we'd need another hook in the read path that can wrap around insertion of pages into page cache but does not contain copying of data into userspace. This patch set implements one possibility of such hook - we essentially abstract generic_file_buffered_read_get_pages() into a hook. I'm not completely sold on the naming or the API, or even whether this is the best place for the hook. But I wanted to send something out for further discussion. For example another workable option for ext4 would be to have an aops hook for adding a page into page cache (essentially abstract add_to_page_cache_lru()). There will be slight downside that it would mean per-page acquisition of the lock instead of a per-batch-of-pages, also if we ever transition to range locking the mapping, per-batch locking would be more efficient. What do people think about this? Honza [1] https://lore.kernel.org/linux-fsdevel/CAOQ4uxjQNmxqmtA_VbYW0Su9rKRk2zobJmahcyeaEVOFKVQ5dw@mail.gmail.com/