Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67;
MIME-Version: 1.0
References: <20190905103541.4161-1-thomas_os@shipmail.org> <20190905103541.4161-2-thomas_os@shipmail.org>
 <608bbec6-448e-f9d5-b29a-1984225eb078@intel.com> <b84d1dca-4542-a491-e585-a96c9d178466@shipmail.org>
 <20190905152438.GA18286@infradead.org> <10185AAF-BFB8-4193-A20B-B97794FB7E2F@amacapital.net>
 <92171412-eed7-40e9-2554-adb358e65767@shipmail.org>
In-Reply-To: <92171412-eed7-40e9-2554-adb358e65767@shipmail.org>
From:   Andy Lutomirski <luto@kernel.org>
Date:   Tue, 10 Sep 2019 21:18:22 -0700
Message-ID: <CALCETrWEzctRxiv9AY0hhPGNPhv8k0POCMzMi30Bgh2aPY7R3w@mail.gmail.com>
Subject: Re: [RFC PATCH 1/2] x86: Don't let pgprot_modify() change the page
 encryption bit
To:     =?UTF-8?Q?Thomas_Hellstr=C3=B6m_=28VMware=29?= 
        <thomas_os@shipmail.org>
Cc:     Christoph Hellwig <hch@infradead.org>,
        Dave Hansen <dave.hansen@intel.com>,
        LKML <linux-kernel@vger.kernel.org>, X86 ML <x86@kernel.org>,
        pv-drivers@vmware.com, Thomas Hellstrom <thellstrom@vmware.com>,
        Dave Hansen <dave.hansen@linux.intel.com>,
        Andy Lutomirski <luto@kernel.org>,
        Peter Zijlstra <peterz@infradead.org>,
        Thomas Gleixner <tglx@linutronix.de>,
        Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
        "H. Peter Anvin" <hpa@zytor.com>,
        =?UTF-8?Q?Christian_K=C3=B6nig?= <christian.koenig@amd.com>,
        Marek Szyprowski <m.szyprowski@samsung.com>,
        Tom Lendacky <thomas.lendacky@amd.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk

On Tue, Sep 10, 2019 at 12:26 PM Thomas Hellstr=C3=B6m (VMware)
<thomas_os@shipmail.org> wrote:
>
> On 9/10/19 6:11 PM, Andy Lutomirski wrote:
> >
> >> On Sep 5, 2019, at 8:24 AM, Christoph Hellwig <hch@infradead.org> wrot=
e:
> >>
> >>> On Thu, Sep 05, 2019 at 05:21:24PM +0200, Thomas Hellstr=C3=B6m (VMwa=
re) wrote:
> >>>> On 9/5/19 4:15 PM, Dave Hansen wrote:
> >>>> Hi Thomas,
> >>>>
> >>>> Thanks for the second batch of patches!  These look much improved on=
 all
> >>>> fronts.
> >>> Yes, although the TTM functionality isn't in yet. Hopefully we won't =
have to
> >>> bother you with those though, since this assumes TTM will be using th=
e dma
> >>> API.
> >> Please take a look at dma_mmap_prepare and dma_mmap_fault in this
> >> branch:
> >>
> >>     http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dm=
a-mmap-improvements
> >>
> >> they should allow to fault dma api pages in the page fault handler.  B=
ut
> >> this is totally hot off the press and not actually tested for the last
> >> few patches.  Note that I've also included your two patches from this
> >> series to handle SEV.
> > I read that patch, and it seems like you=E2=80=99ve built in the assump=
tion that all pages in the mapping use identical protection or, if not, tha=
t the same fake vma hack that TTM already has is used to fudge around it.  =
Could it be reworked slightly to avoid this?
> >
> > I wonder if it=E2=80=99s a mistake to put the encryption bits in vm_pag=
e_prot at all.
>
>  From my POW, the encryption bits behave quite similar in behaviour to
> the caching mode bits, and they're also in vm_page_prot. They're the
> reason TTM needs to modify the page protection in the fault handler in
> the first place.
>
> The problem seen in TTM is that we want to be able to change the
> vm_page_prot from the fault handler, but it's problematic since we have
> the mmap_sem typically only in read mode. Hence the fake vma hack. From
> what I can tell it's reasonably well-behaved, since pte_modify() skips
> the bits TTM updates, so mprotect() and mremap() works OK. I think
> split_huge_pmd may run into trouble, but we don't support it (yet) with
> TTM.

One thing I'm confused about: does TTM move individual pages between
main memory and device memory or does it move whole mappings?  If it
moves individual pages, then a single mapping could have PTEs from
dma_alloc_coherent() space and from PCI space.  How can this work with
vm_page_prot?  I guess you might get lucky and have both have the same
protection bits, but that seems like an unfortunate thing to rely on.

As a for-real example, take a look at arch/x86/entry/vdso/vma.c.  The
"vvar" VMA contains multiple pages that are backed by different types
of memory.  There's a page of ordinary kernel memory.  Historically
there was a page of genuine MMIO memory, but that's gone now.  If the
system is a SEV guest with pvclock enabled, then there's a page of
decrypted memory.   They all share a VMA, they're instantiated in
.fault, and there is no hackery involved.  Look at vvar_fault().

So, Christoph, can't you restructure your code a bit to compute the
caching and encryption state per page instead of once in
dma_mmap_prepare() and insert the pages accordingly?  You might need
to revert 6d958546ff611c9ae09b181e628c1c5d5da5ebda depending on
exactly how you structure it.

>
> We could probably get away with a WRITE_ONCE() update of the
> vm_page_prot before taking the page table lock since
>
> a) We're locking out all other writers.
> b) We can't race with another fault to the same vma since we hold an
> address space lock ("buffer object reservation")
> c) When we need to update there are no valid page table entries in the
> vma, since it only happens directly after mmap(), or after an
> unmap_mapping_range() with the same address space lock. When another
> reader (for example split_huge_pmd()) sees a valid page table entry, it
> also sees the new page protection and things are fine.
>
> But that would really be a special case. To solve this properly we'd
> probably need an additional lock to protect the vm_flags and
> vm_page_prot, taken after mmap_sem and i_mmap_lock.
>

This is all horrible IMO.