Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18;
Date:   Tue, 1 Dec 2020 13:31:21 -0800 (PST)
From:   Hugh Dickins <hughd@google.com>
To:     Peter Xu <peterx@redhat.com>
cc:     Andrea Arcangeli <aarcange@redhat.com>,
        Matthew Wilcox <willy@infradead.org>,
        linux-kernel@vger.kernel.org, linux-mm@kvack.org,
        Andrew Morton <akpm@linux-foundation.org>,
        Hugh Dickins <hughd@google.com>,
        Mike Rapoport <rppt@linux.vnet.ibm.com>
Subject: Re: [PATCH] mm: Don't fault around userfaultfd-registered regions
 on reads
In-Reply-To: <20201128152903.GK6573@xz-x1>
Message-ID: <alpine.LSU.2.11.2012011250070.1582@eggly.anvils>
References: <20201126222359.8120-1-peterx@redhat.com> <20201127122224.GX4327@casper.infradead.org> <X8Ga44uXHmzn/vB9@redhat.com> <20201128152903.GK6573@xz-x1>
User-Agent: Alpine 2.11 (LSU 23 2013-08-11)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Precedence: bulk

On Sat, 28 Nov 2020, Peter Xu wrote:
> On Fri, Nov 27, 2020 at 07:33:39PM -0500, Andrea Arcangeli wrote:
> 
> > Now it would be ok if filemap_map_pages re-filled the ptes, after they
> > were zapped the first time by fallocate, and then the fallocate would
> > zap them again before truncating the page, but I don't see a second
> > pte zap... there's just the below single call of unmap_mapping_range:
> > 
> > 		if ((u64)unmap_end > (u64)unmap_start)
> > 			unmap_mapping_range(mapping, unmap_start,
> > 					    1 + unmap_end - unmap_start, 0);
> > 		shmem_truncate_range(inode, offset, offset + len - 1);
> 
> IMHO the 2nd pte zap is inside shmem_truncate_range(), where we will call
> truncate_inode_page() then onto truncate_cleanup_page().

Correct.

> 
> Since we're at it, some more context: this is actually where I started to
> notice the issue, that we'll try to pre-unmap the whole region first before
> shmem_truncate_range().  The thing is shmem_truncate_range() will zap the ptes
> too, in an even safer way (with page lock held).  So before I came up with the
> current patch, I also tried below patch, and it also fixes the data corrupt
> issue:
> 
> -----8<-----
> diff --git a/mm/shmem.c b/mm/shmem.c
> index efa19e33e470..b275f401fe1f 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2777,7 +2777,6 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
>         inode_lock(inode);
>  
>         if (mode & FALLOC_FL_PUNCH_HOLE) {
> -               struct address_space *mapping = file->f_mapping;
>                 loff_t unmap_start = round_up(offset, PAGE_SIZE);
>                 loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1;
>                 DECLARE_WAIT_QUEUE_HEAD_ONSTACK(shmem_falloc_waitq);
> @@ -2795,9 +2794,6 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
>                 inode->i_private = &shmem_falloc;
>                 spin_unlock(&inode->i_lock);
>  
> -               if ((u64)unmap_end > (u64)unmap_start)
> -                       unmap_mapping_range(mapping, unmap_start,
> -                                           1 + unmap_end - unmap_start, 0);
>                 shmem_truncate_range(inode, offset, offset + len - 1);
>                 /* No need to unmap again: hole-punching leaves COWed pages */
> -----8<-----
>  
> Above code existed starting from the 1st day of shmem fallocate code, so I was
> thinking why we did that.  One reason could be for performance, since this
> pre-unmap of explicit unmap_mapping_range() will try to walk the page tables
> once rather than walking for every single page, so when the hole is huge it
> could benefit us since truncate_cleanup_page() is able to avoid those per-page
> walks then because page_mapped() could be more likely to be zero:
> 
> 	if (page_mapped(page)) {
> 		pgoff_t nr = PageTransHuge(page) ? HPAGE_PMD_NR : 1;
> 		unmap_mapping_pages(mapping, page->index, nr, false);
> 	}
> 
> It would be good if Hugh can help confirm whether my understanding is correct,
> though..

Correct.  Code duplicated from mm/truncate.c, but with more comments
over there, in truncate_pagecache() and in truncate_pagecache_range().

> 
> As a summary, that's why I didn't try to remove the optimization (which fixes
> the issue too) but instead I tried to dissable fault around instead for uffd,
> which sounds a better thing to do.
> 
> > 
> > So looking at filemap_map_pages in shmem, I'm really wondering if the
> > non-uffd case is correct or not.
> > 
> > Do we end up with ptes pointing to non pagecache, so then the virtual
> > mapping is out of sync also with read/write syscalls that will then
> > allocate another shmem page out of sync of the ptes?

No, as you point out, the second pte zap, under page lock, keeps it safe.

> > 
> > If a real page fault happened instead of filemap_map_pages, the
> > shmem_fault() would block during fallocate punch hole by checking
> > inode->i_private, as the comment says:
> > 
> > 	 * So refrain from
> > 	 * faulting pages into the hole while it's being punched.
> > 
> > It's not immediately clear where filemap_map_pages refrains from
> > faulting pages into the hole while it's being punched... given it's
> > ignoring inode->i_private.

Please don't ever rely on that i_private business for correctness: as
more of the comment around "So refrain from" explains, it was added
to avoid a livelock with the trinity fuzzer, and I'd gladly delete
it all, if I could ever be confident that enough has changed in the
intervening years that it's no longer necessary.

It tends to prevent shmem faults racing hole punches in the same area
(without quite guaranteeing no race at all).  But without the livelock
issue, I'd just have gone on letting them race.  "Punch hole" ==
"Lose data" - though I can imagine that UFFD would like more control
over that.  Since map_pages is driven by faulting, it should already
be throttled too.

But Andrea in next mail goes on to see other issues with UFFD_WP
in relation to shmem swap, so this thread is probably dead now.
If there's a bit to spare for UFFD_WP in the anonymous swap entry,
then that bit should be usable for shmem too: but a shmem page
(and its swap entry) can be shared between many different users,
so I don't know whether that will make sense.

Hugh