Received: by 2002:ac0:946b:0:0:0:0:0 with SMTP id j40csp3036014imj; Mon, 11 Feb 2019 12:41:55 -0800 (PST) X-Google-Smtp-Source: AHgI3IZLfUkFurue+acigWrzkAfQ+sFrmopgHdGdTPNosTAl8fMDWMJpmikjhzfO5OetkLz5QY0w X-Received: by 2002:a62:1bd4:: with SMTP id b203mr133359pfb.144.1549917715160; Mon, 11 Feb 2019 12:41:55 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1549917715; cv=none; d=google.com; s=arc-20160816; b=usJ+ZAE8DvvXTEMWU8xIfTpHV9J803AMfx+y/dlpdEEoJvjxPrPDV4xU+yVcgsOJ8v 212QvDPHTuYpdBNJ0i/s9wIih/mAI5EAOoUhhQY2+7KU8JrWHuvS1iOTVpoUez77fF7R smyWgq/TomTMLH00SIFHpWQ6MGoM7poRUQtU+CNbPKq8eKjS5Waln1dOdjPPIrQktyts sU8qYqdzkjbdp2NvPPUUL9p91s+otuuHS4tomQFKg9PXw9lWTPju6iVcF4F/WGyiSlOi Ig928v3A+ubuoIQDaUi2DAW3ajJZeGsx7h+HtByscdrnVYiYy0huY0IoPGpF1InnNbmk FX7A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-transfer-encoding:content-disposition:mime-version :references:message-id:subject:cc:to:from:date; bh=cRRrHsUNQPE5A4RF2o9y1h962QChq1f0l7oFXRjkcfc=; b=FpMucii/WaVaJkchw3DyMU8XmsQa8IWmfti2otQs0YZV05kksc1BSYCeSEV9sWcozH if+zQgignXkR2zUckVhX5zHsGe2POGcn20x0kNg58sY7zFep5b1x6kCcgY0BCf6IJbe1 pXjGHl7OvEwkesnreSjpW112JY4HgMN4BkSuxw+iZVtQWkhzboOo7QvPe7CrR+NC33ys AKgQNUrfpUia3wA7PQ94qzh16OL+Ncwo5B2V2FG0tTpiSFcJyzT6jymbYzjQuJ1yRr/m faXPny8KueVgu43ChcnGXrMUn3oMjiIUaU5ciPKhG+/j03OibnvijAsnYf9x1/B7E3RK cECQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id p15si10951534plq.24.2019.02.11.12.41.38; Mon, 11 Feb 2019 12:41:55 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387685AbfBKTJi (ORCPT + 99 others); Mon, 11 Feb 2019 14:09:38 -0500 Received: from mx1.redhat.com ([209.132.183.28]:58570 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2387609AbfBKTJi (ORCPT ); Mon, 11 Feb 2019 14:09:38 -0500 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 3E64F8831F; Mon, 11 Feb 2019 19:09:37 +0000 (UTC) Received: from redhat.com (ovpn-123-21.rdu2.redhat.com [10.10.123.21]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 74BD819C7D; Mon, 11 Feb 2019 19:09:33 +0000 (UTC) Date: Mon, 11 Feb 2019 14:09:31 -0500 From: Jerome Glisse To: Andrea Arcangeli Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Peter Xu , Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Andrew Morton , Matthew Wilcox , Paolo Bonzini , Radim =?utf-8?B?S3LEjW3DocWZ?= , Michal Hocko , kvm@vger.kernel.org Subject: Re: [RFC PATCH 0/4] Restore change_pte optimization to its former glory Message-ID: <20190211190931.GA3908@redhat.com> References: <20190131183706.20980-1-jglisse@redhat.com> <20190201235738.GA12463@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20190201235738.GA12463@redhat.com> User-Agent: Mutt/1.10.1 (2018-07-13) X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.28]); Mon, 11 Feb 2019 19:09:37 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Feb 01, 2019 at 06:57:38PM -0500, Andrea Arcangeli wrote: > Hello everyone, > > On Thu, Jan 31, 2019 at 01:37:02PM -0500, Jerome Glisse wrote: > > From: J?r?me Glisse > > > > This patchset is on top of my patchset to add context information to > > mmu notifier [1] you can find a branch with everything [2]. I have not > > tested it but i wanted to get the discussion started. I believe it is > > correct but i am not sure what kind of kvm test i can run to exercise > > this. > > > > The idea is that since kvm will invalidate the secondary MMUs within > > invalidate_range callback then the change_pte() optimization is lost. > > With this patchset everytime core mm is using set_pte_at_notify() and > > thus change_pte() get calls then we can ignore the invalidate_range > > callback altogether and only rely on change_pte callback. > > > > Note that this is only valid when either going from a read and write > > pte to a read only pte with same pfn, or from a read only pte to a > > read and write pte with different pfn. The other side of the story > > is that the primary mmu pte is clear with ptep_clear_flush_notify > > before the call to change_pte. > > If it's cleared with ptep_clear_flush_notify, change_pte still won't > work. The above text needs updating with > "ptep_clear_flush". set_pte_at_notify is all about having > ptep_clear_flush only before it or it's the same as having a range > invalidate preceding it. > > With regard to the code, wp_page_copy() needs > s/ptep_clear_flush_notify/ptep_clear_flush/ before set_pte_at_notify. > > change_pte relies on the ptep_clear_flush preceding the > set_pte_at_notify that will make sure if the secondary MMU mapping > randomly disappears between ptep_clear_flush and set_pte_at_notify, > gup_fast will wait and block on the PT lock until after > set_pte_at_notify is completed before trying to re-establish a > secondary MMU mapping. > > So then we've only to worry about what happens because we left the > secondary MMU mapping potentially intact despite we flushed the > primary MMU mapping with ptep_clear_flush (as opposed to > ptep_clear_flush_notify which would teardown the secondary MMU mapping > too). So all the above is moot since as you pointed out in the other email ptep_clear_flush_notify does not invalidate kvm secondary mmu hence. > > In you wording above at least the "with a different pfn" is > superflous. I think it's ok if the protection changes from read-only > to read-write and the pfn remains the same. Like when we takeover a > page because it's not shared anymore (fork child quit). > > It's also ok to change pfn if the mapping is read-only and remains > read-only, this is what KSM does in replace_page. Yes i thought this was obvious i will reword and probably just do a list of every case that is fine. > > The read-write to read-only case must not change pfn to avoid losing > coherency from the secondary MMU point of view. This isn't so much > about change_pte itself, but the fact that the page-copy generally > happens well before the pte mangling starts. This case never presents > itself in the code because KSM is first write protecting the page and > only later merging it, regardless of change_pte or not. > > The important thing is that the secondary MMU must be updated first > (unlike the invalidates) to be sure the secondary MMU already points > to the new page when the pfn changes and the protection changes from > read-only to read-write (COW fault). The primary MMU cannot read/write > to the page anyway while we update the secondary MMU because we did > ptep_clear_flush() before calling set_pte_at_notify(). So this > ordering of "ptep_clear_flush; change_pte; set_pte_at" ensures > whenever the CPU can access the memory, the access is synchronous > with the secondary MMUs because they've all been updated already. > > If (in set_pte_at_notify) we were to call change_pte() after > set_pte_at() what would happen is that the CPU could write to the page > through a TLB fill without page fault while the secondary MMUs still > read the old memory in the old readonly page. Yeah, between do you have any good workload for me to test this ? I was thinking of running few same VM and having KSM work on them. Is there some way to trigger KVM to fork ? As the other case is breaking COW after fork. Cheers, J?r?me