Received: by 2002:a05:6a10:17d3:0:0:0:0 with SMTP id hz19csp3262598pxb; Wed, 14 Apr 2021 00:50:32 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyuRKpNLn+t1W4T415xlpF6l0JkEW950PaJzq8prixlOpJS0LBMoNvGMRIUVEgWzmJO21DY X-Received: by 2002:a17:90a:8b91:: with SMTP id z17mr2135539pjn.73.1618386632089; Wed, 14 Apr 2021 00:50:32 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1618386632; cv=none; d=google.com; s=arc-20160816; b=hnNRRKK3AHd5X/H7AkreLKP4rytNseP17hFs/5yldt6d/h9v71f0MoQNvW+r3fiyWI JQfjXGygQPtiK4iDMEbTkep/c8/wvefVCad4bVdZdOs6jri04Olh2AgtK7apAGF4xCRX 6yThnD44+pIJ5zXe2aZu14kVxRU8XEYOV+T2+nSULPJOEcHThW48Zcu8TGYTqJuP1/ma VMbq4c9J3TIIrkiDIaEF7xM3WZMPIGCNmrrrk2MSNYGr18OuOqz474WZsPnm1irj0hHI SFiOOxfMwx9QtGbW6i0NAuErddKcVCgA04WGCD6V0LdvkoFXfnZ9MpNT5faiYXxW/4C1 lzHw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date; bh=dfKclppT2JiXl/S9+QioHDaK7gl3Amq4UpruZLTtQFg=; b=pfvWnqFRnSHA7FIzDObKYdRo68Mxcct5zChnkRj6yBxUK+6whaQ777kpxHmUpRVfd1 VyX/8rbIXU4SjrjFR+3oo7+mzIdtXEcEeq5qC1hgaqTQ2TASfHQbhMBitfj99i1Hh1SB DlwKyKNjL55MiHV0BHx1fiAgBjKTXg/Pzy+OVSlfdatt5jiZ1lqNIH+GboEKkxfDuypz jccfjih7oLdeK6VSGCyepQN8TcYhN3FoHCE6pyFdvOmr9IVkjyMLLWv+gEXb0FSez6Iv LxB/BCTdYwG9AyBWmEPVhSZhVU0hAy+n3yDWeiXdS1JQrf8rioRcVwJ9ITZQaxM+MsYc pu2A== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id 1si20090333pgz.322.2021.04.14.00.50.11; Wed, 14 Apr 2021 00:50:32 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232908AbhDNABi (ORCPT + 99 others); Tue, 13 Apr 2021 20:01:38 -0400 Received: from mail105.syd.optusnet.com.au ([211.29.132.249]:42794 "EHLO mail105.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229747AbhDNABh (ORCPT ); Tue, 13 Apr 2021 20:01:37 -0400 Received: from dread.disaster.area (pa49-181-239-12.pa.nsw.optusnet.com.au [49.181.239.12]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id ECEA01042B72; Wed, 14 Apr 2021 10:01:14 +1000 (AEST) Received: from dave by dread.disaster.area with local (Exim 4.92.3) (envelope-from ) id 1lWSy1-006oBD-BU; Wed, 14 Apr 2021 10:01:13 +1000 Date: Wed, 14 Apr 2021 10:01:13 +1000 From: Dave Chinner To: Jan Kara Cc: linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org, linux-xfs@vger.kernel.org, Ted Tso , Christoph Hellwig , Amir Goldstein Subject: Re: [PATCH 2/7] mm: Protect operations adding pages to page cache with i_mapping_lock Message-ID: <20210414000113.GG63242@dread.disaster.area> References: <20210413105205.3093-1-jack@suse.cz> <20210413112859.32249-2-jack@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20210413112859.32249-2-jack@suse.cz> X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=Tu+Yewfh c=1 sm=1 tr=0 cx=a_idp_f a=gO82wUwQTSpaJfP49aMSow==:117 a=gO82wUwQTSpaJfP49aMSow==:17 a=kj9zAlcOel0A:10 a=3YhXtTcJ-WEA:10 a=7-415B0cAAAA:8 a=DX6uKfG-f7m7lB0TV10A:9 a=zNJWZbXuTLNWGvKL:21 a=zPXtcYt7zDS9-pWJ:21 a=CjuIK1q_8ugA:10 a=biEYGPWJfzWAr4FL6Ov7:22 Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org On Tue, Apr 13, 2021 at 01:28:46PM +0200, Jan Kara wrote: > Currently, serializing operations such as page fault, read, or readahead > against hole punching is rather difficult. The basic race scheme is > like: > > fallocate(FALLOC_FL_PUNCH_HOLE) read / fault / .. > truncate_inode_pages_range() > cache here> > > > Now the problem is in this way read / page fault / readahead can > instantiate pages in page cache with potentially stale data (if blocks > get quickly reused). Avoiding this race is not simple - page locks do > not work because we want to make sure there are *no* pages in given > range. inode->i_rwsem does not work because page fault happens under > mmap_sem which ranks below inode->i_rwsem. Also using it for reads makes > the performance for mixed read-write workloads suffer. > > So create a new rw_semaphore in the inode - i_mapping_sem - that > protects adding of pages to page cache for page faults / reads / > readahead. > > Signed-off-by: Jan Kara .... > diff --git a/mm/filemap.c b/mm/filemap.c > index bd7c50e060a9..bc82a7856d3e 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -77,7 +77,8 @@ > * ->i_pages lock > * > * ->i_rwsem > - * ->i_mmap_rwsem (truncate->unmap_mapping_range) > + * ->i_mapping_sem (acquired by fs in truncate path) > + * ->i_mmap_rwsem (truncate->unmap_mapping_range) This is now officially a confusing mess. We have inode->i_mapping that points to an address space, which has an internal i_mmap variable and an i_mmap_rwsem that serialises access to the i_mmap tree. Now we have a inode->i_mapping_sem (which is actually a rwsem, not a sem) that sorta serialises additions to the mapping tree (inode->i_mapping->i_pages) against truncate, but it does not serialise accesses to the rest of the inode->i_mapping structure itself despite the similarlities in naming. Then we have the inode_lock() API and the i_mmap_lock() API that wrap around the i_rwsem and i_mmap_rwsem, but there's no API for inode_mapping_lock()... THen we have the mmap_lock in the page fault path as well, which is also an rwsem despite the name, and that protects something completely different to the i_mmap and the i_mapping. IOWs, we have 4 layers of entwined structures and locking here that pretty much all have the same name but protect different things and not always the obvious thing the name suggests. This makes it really difficult to actually read the code and understand that the correct lock is being used in the correct place... > * > * ->mmap_lock > * ->i_mmap_rwsem > @@ -85,7 +86,8 @@ > * ->i_pages lock (arch-dependent flush_dcache_mmap_lock) > * > * ->mmap_lock > - * ->lock_page (access_process_vm) > + * ->i_mapping_sem (filemap_fault) > + * ->lock_page (filemap_fault, access_process_vm) > * > * ->i_rwsem (generic_perform_write) > * ->mmap_lock (fault_in_pages_readable->do_page_fault) > @@ -2276,16 +2278,28 @@ static int filemap_update_page(struct kiocb *iocb, > { > int error; > > + if (iocb->ki_flags & IOCB_NOWAIT) { > + if (!down_read_trylock(&mapping->host->i_mapping_sem)) > + return -EAGAIN; > + } else { > + down_read(&mapping->host->i_mapping_sem); > + } We really need a lock primitive for this. The number of times this exact lock pattern is being replicated all through the IO path is getting out of hand. static inline bool down_read_try_or_lock(struct rwsem *sem, bool try) { if (try) { if (!down_read_trylock(sem)) return false; } else { down_read(&mapping->host->i_mapping_sem); } return true; } and the callers become: if (!down_read_try_or_lock(sem, (iocb->ki_flags & IOCB_NOWAIT))) return -EAGAIN; We can do the same with mutex_try_or_lock(), down_try_or_lock(), etc and we don't need to rely on cargo cult knowledge to propagate this pattern anymore. Because I'm betting relatively few people actually know why the code is written this way because the only place it is documented is in an XFS commit message.... Doing this is a separate cleanup, though, and not something that needs to be done in this patchset. > index c5b0457415be..ac5bb50b3a4c 100644 > --- a/mm/readahead.c > +++ b/mm/readahead.c > @@ -192,6 +192,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl, > */ > unsigned int nofs = memalloc_nofs_save(); > > + down_read(&mapping->host->i_mapping_sem); > /* > * Preallocate as many pages as we will need. > */ I can't say I'm a great fan of having the mapping reach back up to the host to lock the host. THis seems the wrong way around to me given that most of the locking in the IO path is in "host locks mapping" and "mapping locks internal mapping structures" order... I also come back to the naming confusion here, in that when we look at this in long hand from the inode perspective, this chain actually looks like: lock(inode->i_mapping->inode->i_mapping_sem) i.e. the mapping is reaching back up outside it's scope to lock itself against other inode->i_mapping operations. Smells of layering violations to me. So, next question: should this truncate semanphore actually be part of the address space, not the inode? This patch is actually moving the page fault serialisation from the inode into the address space operations when page faults and page cache operations are done, so maybe the lock should also make that move? That would help clear up the naming problem, because now we can name it based around what it serialises in the address space, not the address space as a whole... Cheers, Dave. -- Dave Chinner david@fromorbit.com