Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18;
Date:   Thu, 7 Jan 2021 17:56:49 -0500
From:   Andrea Arcangeli <aarcange@redhat.com>
To:     Linus Torvalds <torvalds@linux-foundation.org>
Cc:     Jason Gunthorpe <jgg@ziepe.ca>, Linux-MM <linux-mm@kvack.org>,
        Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
        Yu Zhao <yuzhao@google.com>, Andy Lutomirski <luto@kernel.org>,
        Peter Xu <peterx@redhat.com>,
        Pavel Emelyanov <xemul@openvz.org>,
        Mike Kravetz <mike.kravetz@oracle.com>,
        Mike Rapoport <rppt@linux.vnet.ibm.com>,
        Minchan Kim <minchan@kernel.org>,
        Will Deacon <will@kernel.org>,
        Peter Zijlstra <peterz@infradead.org>,
        Hugh Dickins <hughd@google.com>,
        "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
        Matthew Wilcox <willy@infradead.org>,
        Oleg Nesterov <oleg@redhat.com>, Jann Horn <jannh@google.com>,
        Kees Cook <keescook@chromium.org>,
        John Hubbard <jhubbard@nvidia.com>,
        Leon Romanovsky <leonro@nvidia.com>, Jan Kara <jack@suse.cz>,
        Kirill Tkhai <ktkhai@virtuozzo.com>
Subject: Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy
Message-ID: <X/eRsbAU7bt9nvFD@redhat.com>
References: <B1B85771-B211-4FCC-AEEF-BDFD37332C25@vmware.com>
 <20210107200402.31095-1-aarcange@redhat.com>
 <20210107202525.GD504133@ziepe.ca>
 <CAHk-=wjTuS9JB=Ms4WAMaOkGuLmvYwaf2W0JhXxNPdcv4NWZUA@mail.gmail.com>
 <CAHk-=wjDkyom4haQu6OU_yykkCFqMi98qO2gUPgZBF-11krRAA@mail.gmail.com>
 <X/eFCt/lxRKgoPXu@redhat.com>
 <CAHk-=wjhpTa616YdTfincMV1LqqRWqokKtd1ARdMOPd1Wra8Pg@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAHk-=wjhpTa616YdTfincMV1LqqRWqokKtd1ARdMOPd1Wra8Pg@mail.gmail.com>
User-Agent: Mutt/2.0.4 (2020-12-30)
Precedence: bulk

On Thu, Jan 07, 2021 at 02:17:50PM -0800, Linus Torvalds wrote:
> So I think we can agree that even that softdirty case we can just
> handle by "don't do that then".

Absolutely. The question is if somebody was happily running clear_refs
with a RDMA attached to the process, by the time they update and
reboot they'll find it the hard way with silent mm corruption
currently.

So I was obliged to report this issue and the fact there was very
strong reason why page_count was not used there and it's even
documented explicitly in the source:

 * [..] however we only use
 * page_trans_huge_mapcount() in the copy-on-write faults where we
 * need full accuracy to avoid breaking page pinning, [..]

I didn't entirely forget the comment when I reiterated it in fact also
in Message-ID: <20200527212005.GC31990@redhat.com> on May 27 2020
since I recalled there was a very explicit reason why we had to use
page_mapcount in do_wp_page and deliver full accuracy.

Now I cannot proof there's any such user that will break, but we'll
find those with a 1 year or more of delay.

Even the tlb flushing deferral that caused clear_refs_write to also
corrupt userland memory and literally lose userland writes even
without any special secondary MMU hardware being attached to the
memory, took 6 months to find.

> if a page is pinned, the dirty bit of it makes no sense, because it
> might be dirtied complately asynchronously by the pinner.
>
> So I think none of the softdirty stuff should touch pinned pages. I
> think it falls solidly under that "don't do it then".
> 
> Just skipping over them in clear_soft_dirty[_pmd]() doesn't look that
> hard, does it?

1) How do you know again if it's not speculative lookup or an O_DIRECT
   that made them look pinned?

2) it's a hugepage 1, a 4k pin will make soft dirty then skip 511
   dirty bits by mistake including those pages that weren't pinned and
   that userland is still writing through the transhuge pmd.

   Until v4.x we had a page pin counter for each subpage so the above
   wouldn't have happened, but not anymore since it was simplified and
   improved but now the page_count is pretty inefficient there, even
   if it'd be possible to make safe.

3) the GUP(write=0) may be just reading from RAM and sending to the
   other end so clear_refs may have currently very much tracked CPU
   writes fine, with no interference whatsoever from the secondary MMU
   working in readonly on the memory.

Thanks,
Andrea