Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67;
Date:   Mon, 11 Feb 2019 14:09:31 -0500
From:   Jerome Glisse <jglisse@redhat.com>
To:     Andrea Arcangeli <aarcange@redhat.com>
Cc:     linux-mm@kvack.org, linux-kernel@vger.kernel.org,
        Peter Xu <peterx@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@redhat.com>,
        Arnaldo Carvalho de Melo <acme@kernel.org>,
        Alexander Shishkin <alexander.shishkin@linux.intel.com>,
        Jiri Olsa <jolsa@redhat.com>,
        Namhyung Kim <namhyung@kernel.org>,
        Andrew Morton <akpm@linux-foundation.org>,
        Matthew Wilcox <mawilcox@microsoft.com>,
        Paolo Bonzini <pbonzini@redhat.com>,
        Radim =?utf-8?B?S3LEjW3DocWZ?= <rkrcmar@redhat.com>,
        Michal Hocko <mhocko@kernel.org>, kvm@vger.kernel.org
Subject: Re: [RFC PATCH 0/4] Restore change_pte optimization to its former
 glory
Message-ID: <20190211190931.GA3908@redhat.com>
References: <20190131183706.20980-1-jglisse@redhat.com>
 <20190201235738.GA12463@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <20190201235738.GA12463@redhat.com>
User-Agent: Mutt/1.10.1 (2018-07-13)
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk

On Fri, Feb 01, 2019 at 06:57:38PM -0500, Andrea Arcangeli wrote:
> Hello everyone,
> 
> On Thu, Jan 31, 2019 at 01:37:02PM -0500, Jerome Glisse wrote:
> > From: J?r?me Glisse <jglisse@redhat.com>
> > 
> > This patchset is on top of my patchset to add context information to
> > mmu notifier [1] you can find a branch with everything [2]. I have not
> > tested it but i wanted to get the discussion started. I believe it is
> > correct but i am not sure what kind of kvm test i can run to exercise
> > this.
> > 
> > The idea is that since kvm will invalidate the secondary MMUs within
> > invalidate_range callback then the change_pte() optimization is lost.
> > With this patchset everytime core mm is using set_pte_at_notify() and
> > thus change_pte() get calls then we can ignore the invalidate_range
> > callback altogether and only rely on change_pte callback.
> > 
> > Note that this is only valid when either going from a read and write
> > pte to a read only pte with same pfn, or from a read only pte to a
> > read and write pte with different pfn. The other side of the story
> > is that the primary mmu pte is clear with ptep_clear_flush_notify
> > before the call to change_pte.
> 
> If it's cleared with ptep_clear_flush_notify, change_pte still won't
> work. The above text needs updating with
> "ptep_clear_flush". set_pte_at_notify is all about having
> ptep_clear_flush only before it or it's the same as having a range
> invalidate preceding it.
> 
> With regard to the code, wp_page_copy() needs
> s/ptep_clear_flush_notify/ptep_clear_flush/ before set_pte_at_notify.
> 
> change_pte relies on the ptep_clear_flush preceding the
> set_pte_at_notify that will make sure if the secondary MMU mapping
> randomly disappears between ptep_clear_flush and set_pte_at_notify,
> gup_fast will wait and block on the PT lock until after
> set_pte_at_notify is completed before trying to re-establish a
> secondary MMU mapping.
> 
> So then we've only to worry about what happens because we left the
> secondary MMU mapping potentially intact despite we flushed the
> primary MMU mapping with ptep_clear_flush (as opposed to
> ptep_clear_flush_notify which would teardown the secondary MMU mapping
> too).

So all the above is moot since as you pointed out in the other email
ptep_clear_flush_notify does not invalidate kvm secondary mmu hence.


> 
> In you wording above at least the "with a different pfn" is
> superflous. I think it's ok if the protection changes from read-only
> to read-write and the pfn remains the same. Like when we takeover a
> page because it's not shared anymore (fork child quit).
> 
> It's also ok to change pfn if the mapping is read-only and remains
> read-only, this is what KSM does in replace_page.

Yes i thought this was obvious i will reword and probably just do a
list of every case that is fine.

> 
> The read-write to read-only case must not change pfn to avoid losing
> coherency from the secondary MMU point of view. This isn't so much
> about change_pte itself, but the fact that the page-copy generally
> happens well before the pte mangling starts. This case never presents
> itself in the code because KSM is first write protecting the page and
> only later merging it, regardless of change_pte or not.
> 
> The important thing is that the secondary MMU must be updated first
> (unlike the invalidates) to be sure the secondary MMU already points
> to the new page when the pfn changes and the protection changes from
> read-only to read-write (COW fault). The primary MMU cannot read/write
> to the page anyway while we update the secondary MMU because we did
> ptep_clear_flush() before calling set_pte_at_notify(). So this
> ordering of "ptep_clear_flush; change_pte; set_pte_at" ensures
> whenever the CPU can access the memory, the access is synchronous
> with the secondary MMUs because they've all been updated already.
> 
> If (in set_pte_at_notify) we were to call change_pte() after
> set_pte_at() what would happen is that the CPU could write to the page
> through a TLB fill without page fault while the secondary MMUs still
> read the old memory in the old readonly page.

Yeah, between do you have any good workload for me to test this ? I
was thinking of running few same VM and having KSM work on them. Is
there some way to trigger KVM to fork ? As the other case is breaking
COW after fork.

Cheers,
J?r?me