2018-08-20 21:28:30

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

Hi!

See eXclusive Page Frame Ownership (https://lwn.net/Articles/700606/) which was posted
way back in in 2016..

In the last couple of months there has been a slew of CPU issues that have complicated
a lot of things. The latest - L1TF - is still fresh in folks's mind and it is
especially acute to virtualization workloads.

As such a bunch of various folks from different cloud companies (CCed) are looking
at a way to make Linux kernel be more resistant to hardware having these sort of
bugs.

In particular we are looking at a way to "remove as many mappings from the global
kernel address space as possible. Specifically, while being in the
context of process A, memory of process B should not be visible in the
kernel." (email from Julian Stecklina). That is the high-level view and
how this could get done, well, that is why posting this on
LKML/linux-hardening/kvm-devel/linux-mm to start the discussion.

Usually I would start with a draft of RFC patches so folks can rip it apart, but
thanks to other people (Juerg thank you!) it already exists:

(see https://www.mail-archive.com/[email protected]/msg1222756.html)

The idea would be to extend this to:

1) Only do it for processes that run under CPUS which are in isolcpus list.

2) Expand this to be a per-cpu page tables. That is each CPU has its own unique
set of pagetables - naturally _START_KERNEL -> __end would be mapped but the
rest would not.

Thoughts? Is this possible? Crazy? Better ideas?


2018-08-20 21:50:22

by Linus Torvalds

[permalink] [raw]
Subject: Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

On Mon, Aug 20, 2018 at 2:26 PM Konrad Rzeszutek Wilk
<[email protected]> wrote:
>
> See eXclusive Page Frame Ownership (https://lwn.net/Articles/700606/) which was posted
> way back in in 2016..

Ok, so my gut feel is that the above was reasonable within the context
of 2016, but that the XPFO model is completely pointless and wrong in
the post-Meltdown world we now live in.

Why?

Because with the Meltdown patches, we ALREADY HAVE the isolated page
tables that XPFO tries to do.

They are just the normal user page tables.

So don't go around doing other crazy things.

All you need to do is to literally:

- before you enter VMX mode, switch to the user page tables

- when you exit, switch back to the kernel page tables

don't do anything else. You're done.

Now, this is complicated a bit by the fact that in order to enter VMX
mode with the user page tables, you do need to add the VMX state
itself to those user page tables (and add the actual trampoline code
to the vmenter too).

So it does imply we need to slightly extend the user mapping with a
few new patches, but that doesn't sound bad.

In fact, it sounds absolutely trivial to me.

The other thing you want to do is is the trivial optimization of "hey.
we exited VMX mode due to a host interrupt", which would look like
this:

* switch to user page tables in order to do vmenter
* vmenter
* host interrupt happens
- switch to kernel page tables to handle irq
- do_IRQ etc
- switch back to user page tables
- iret
* switch to kernel page tables because the vmenter returned

so you want to have some trivial short-circuiting of that last "switch
to user page tables and back" dance. It may actually be that we don't
even need it, because the irq code may just be looking at what *mode*
we were in, not what page tables we were in. I looked at that code
back in the meltdown days, but that's already so last-year now that we
have all these _other_ CPU bugs we handled.

But other than small details like that, doesn't this "use our Meltdown
user page table" sound like the right thing to do?

And note: no new VM code or complexity. None. We already have the
"isolated KVM context with only pages for the KVM process" case
handled.

Of course, after the long (and entirely unrelated) discussion about
the TLB flushing bug we had, I'm starting to worry about my own
competence, and maybe I'm missing something really fundamental, and
the XPFO patches do something else than what I think they do, or my
"hey, let's use our Meltdown code" idea has some fundamental weakness
that I'm missing.

Linus

2018-08-20 22:20:15

by Kees Cook

[permalink] [raw]
Subject: Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

On Mon, Aug 20, 2018 at 2:52 PM, Woodhouse, David <[email protected]> wrote:
> On Mon, 2018-08-20 at 14:48 -0700, Linus Torvalds wrote:
>>
>> Of course, after the long (and entirely unrelated) discussion about
>> the TLB flushing bug we had, I'm starting to worry about my own
>> competence, and maybe I'm missing something really fundamental, and
>> the XPFO patches do something else than what I think they do, or my
>> "hey, let's use our Meltdown code" idea has some fundamental weakness
>> that I'm missing.
>
> The interesting part is taking the user (and other) pages out of the
> kernel's 1:1 physmap.
>
> It's the *kernel* we don't want being able to access those pages,
> because of the multitude of unfixable cache load gadgets.

Right. And even before Meltdown, it was desirable to remove those from
the physmap to avoid SMAP (and in some cases SMEP) bypasses (as
detailed in the mentioned paper:
http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf).

-Kees

--
Kees Cook
Pixel Security

2018-08-20 22:29:23

by Linus Torvalds

[permalink] [raw]
Subject: Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

On Mon, Aug 20, 2018 at 3:02 PM Woodhouse, David <[email protected]> wrote:
>
> It's the *kernel* we don't want being able to access those pages,
> because of the multitude of unfixable cache load gadgets.

Ahh.

I guess the proof is in the pudding. Did somebody try to forward-port
that patch set and see what the performance is like?

It used to be just 500 LOC. Was that because they took horrible
shortcuts? Are the performance numbers for the 32-bit case that
already had the kmap() overhead?

Linus

2018-08-20 22:37:49

by Tycho Andersen

[permalink] [raw]
Subject: Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

On Mon, Aug 20, 2018 at 03:27:52PM -0700, Linus Torvalds wrote:
> On Mon, Aug 20, 2018 at 3:02 PM Woodhouse, David <[email protected]> wrote:
> >
> > It's the *kernel* we don't want being able to access those pages,
> > because of the multitude of unfixable cache load gadgets.
>
> Ahh.
>
> I guess the proof is in the pudding. Did somebody try to forward-port
> that patch set and see what the performance is like?
>
> It used to be just 500 LOC. Was that because they took horrible
> shortcuts? Are the performance numbers for the 32-bit case that
> already had the kmap() overhead?

The last version I worked on was a bit before Meltdown was public:
https://lkml.org/lkml/2017/9/7/445

The overhead was a lot, but Dave Hansen gave some ideas about how to
speed things up in this thread: https://lkml.org/lkml/2017/9/20/828

Since meltdown hit, I haven't worked seriously on understand and
implementing his suggestions, in part because it wasn't clear to me
what pieces of the infrastructure we might be able to re-use. Someone
who knows more about mm/ might be able to suggest an approach, though.

Tycho

2018-08-20 23:01:41

by Dave Hansen

[permalink] [raw]
Subject: Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

On 08/20/2018 03:35 PM, Tycho Andersen wrote:
> Since meltdown hit, I haven't worked seriously on understand and
> implementing his suggestions, in part because it wasn't clear to me
> what pieces of the infrastructure we might be able to re-use. Someone
> who knows more about mm/ might be able to suggest an approach, though

Unfortunately, I'm not sure there's much of KPTI we can reuse. KPTI
still has a very static kernel map (well, two static kernel maps) and
XPFO really needs a much more dynamic map.

We do have a bit of infrastructure now to do TLB flushes near the kernel
exit point, but it's entirely for the user address space, which isn't
affected by XPFO.


2018-08-20 23:16:37

by David Woodhouse

[permalink] [raw]
Subject: Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)



On Mon, 2018-08-20 at 15:59 -0700, Dave Hansen wrote:
> On 08/20/2018 03:35 PM, Tycho Andersen wrote:
> > Since meltdown hit, I haven't worked seriously on understand and
> > implementing his suggestions, in part because it wasn't clear to me
> > what pieces of the infrastructure we might be able to re-use. Someone
> > who knows more about mm/ might be able to suggest an approach, though
>
> Unfortunately, I'm not sure there's much of KPTI we can reuse.  KPTI
> still has a very static kernel map (well, two static kernel maps) and
> XPFO really needs a much more dynamic map.
>
> We do have a bit of infrastructure now to do TLB flushes near the kernel
> exit point, but it's entirely for the user address space, which isn't
> affected by XPFO.

One option is to have separate kernel address spaces, both with and
without the full physmap.

If you need the physmap, then rather than manually mapping with 4KiB
pages, you just switch. Having first ensured that no malicious guest or
userspace is running on a sibling, of course.

I'm not sure it's a win, but it might be worth looking at.


Attachments:
smime.p7s (5.09 kB)

2018-08-20 23:30:52

by Dave Hansen

[permalink] [raw]
Subject: Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

On 08/20/2018 04:14 PM, David Woodhouse wrote:
> If you need the physmap, then rather than manually mapping with 4KiB
> pages, you just switch. Having first ensured that no malicious guest or
> userspace is running on a sibling, of course.

The problem is determining when "you need the physmap". Tycho's
patches, as I remember them basically classify pages between being
"user" pages which are accessed only via kmap() and friends and "kernel"
pages which need to be mapped all the time because they might have a
'task_struct' or a page table or a 'struct file'.

You're right that we could have a full physmap that we switch to for
kmap()-like access to user pages. But, the real problem is
transitioning pages from kernel to user usage since it requires shooting
down the old kernel mappings for those pages in some way.

2018-08-20 23:42:19

by Linus Torvalds

[permalink] [raw]
Subject: Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

On Mon, Aug 20, 2018 at 4:27 PM Dave Hansen <[email protected]> wrote:
>
> You're right that we could have a full physmap that we switch to for
> kmap()-like access to user pages. But, the real problem is
> transitioning pages from kernel to user usage since it requires shooting
> down the old kernel mappings for those pages in some way.

You might decide that you simply don't care enough, and are willing to
leave possible stale TLB entries rather than shoot things down.

Then you'd still possibly see user pages in the kernel map, but only
for a fairly limited time, and only until the TLB entry gets re-used
for other reasons.

Even with kernel page table entries being marked global, their
lifetime in the TLB is likely not very long, and definitely not long
enough for some user that tries to scan for pages.

Linus

2018-08-21 09:59:09

by David Woodhouse

[permalink] [raw]
Subject: Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

On Mon, 2018-08-20 at 15:27 -0700, Linus Torvalds wrote:
> On Mon, Aug 20, 2018 at 3:02 PM Woodhouse, David <[email protected]> wrote:
> >
> > It's the *kernel* we don't want being able to access those pages,
> > because of the multitude of unfixable cache load gadgets.
>
> Ahh.
>
> I guess the proof is in the pudding. Did somebody try to forward-port
> that patch set and see what the performance is like?

I hadn't actually seen the XPFO patch set before; we're going to take a
serious look.

Of course, this is only really something that a select few people (with
quite a lot of machines) would turn on. And they might be willing to
tolerate a significant performance cost if the alternative way to be
safe is to disable hyperthreading entirely — which is Intel's best
recommendation so far, it seems.

Another alternative... I'm told POWER8 does an interesting thing with
hyperthreading and gang scheduling for KVM. The host kernel doesn't
actually *see* the hyperthreads at all, and KVM just launches the full
set of siblings when it enters a guest, and gathers them again when any
of them exits. That's definitely worth investigating as an option for
x86, too.


Attachments:
smime.p7s (5.09 kB)

2018-08-21 14:06:00

by Liran Alon

[permalink] [raw]
Subject: Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)


> On 21 Aug 2018, at 12:57, David Woodhouse <[email protected]> wrote:
>
> Another alternative... I'm told POWER8 does an interesting thing with
> hyperthreading and gang scheduling for KVM. The host kernel doesn't
> actually *see* the hyperthreads at all, and KVM just launches the full
> set of siblings when it enters a guest, and gathers them again when any
> of them exits. That's definitely worth investigating as an option for
> x86, too.

I actually think that such scheduling mechanism which prevents leaking cache entries to sibling hyperthreads should co-exist together with the KVM address space isolation to fully mitigate L1TF and other similar vulnerabilities. The address space isolation should prevent VMExit handlers code gadgets from loading arbitrary host memory to the cache. Once VMExit code path switches to full host address space, then we should also make sure that no other sibling hyprethread is running in the guest.

Focusing on the scheduling mechanism, we must make sure that when a logical processor runs guest code, all siblings logical processors must run code which do not populate L1D cache with information unrelated to this VM. This includes forbidding one logical processor to run guest code while sibling is running a host task such as a NIC interrupt handler.
Thus, when a vCPU thread exits the guest into the host and VMExit handler reaches code flow which could populate L1D cache with this information, we should force an exit from the guest of the siblings logical processors, such that they will be allowed to resume only on a core which we can promise that the L1D cache is free from information unrelated to this VM.

At first, I have created a patch series which attempts to implement such mechanism in KVM. However, it became clear to me that this may need to be implemented in the scheduler itself. This is because:
1. It is difficult to handle all new scheduling contrains only in KVM.
2. This mechanism should be relevant for any Type-2 hypervisor which runs inside Linux besides KVM (Such as VMware Workstation or VirtualBox).
3. This mechanism could also be used to prevent future “core-cache-leaking” vulnerabilities to be exploited between processes of different security domains which run as siblings on the same core.

The main idea is a mechanism which is very similar to Microsoft's "core scheduler" which they implemented to mitigate this vulnerability. The mechanism should work as follows:
1. Each CPU core will now be tagged with a "security domain id".
2. The scheduler will provide a mechanism to tag a task with a security domain id.
3. Tasks will inherit their security domain id from their parent task.
3.1. First task in system will have security domain id of 0. Thus, if nothing special is done, all tasks will be assigned with security domain id of 0.
4. Tasks will be able to allocate a new security domain id from the scheduler and assign it to another task dynamically.
5. Linux scheduler will prevent scheduling tasks on a core with a different security domain id:
5.0. CPU core security domain id will be set to the security domain id of the tasks which currently run on it.
5.1. The scheduler will attempt to first schedule a task on a core with required security domain id if such exists.
5.2. Otherwise, will need to decide if it wishes to kick all tasks running on some core to run the task with a different security domain id on that core.

The above mechanism can be used to mitigate the L1TF HT variant by just assigning vCPU tasks with a security domain id which is unique per VM and also different than the security domain id of the host which is 0.

I would be glad to hear feedback on the above suggestion.
If this should better be discussed on a separate email thread, please say so and I will open a new thread.

Thanks,
-Liran



2018-08-21 16:44:55

by David Woodhouse

[permalink] [raw]
Subject: Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

On Tue, 2018-08-21 at 17:01 +0300, Liran Alon wrote:
>
> > On 21 Aug 2018, at 12:57, David Woodhouse <[email protected]>
> wrote:
> > 
> > Another alternative... I'm told POWER8 does an interesting thing
> with
> > hyperthreading and gang scheduling for KVM. The host kernel doesn't
> > actually *see* the hyperthreads at all, and KVM just launches the
> full
> > set of siblings when it enters a guest, and gathers them again when
> any
> > of them exits. That's definitely worth investigating as an option
> for
> > x86, too.
>
> I actually think that such scheduling mechanism which prevents
> leaking cache entries to sibling hyperthreads should co-exist
> together with the KVM address space isolation to fully mitigate L1TF
> and other similar vulnerabilities. The address space isolation should
> prevent VMExit handlers code gadgets from loading arbitrary host
> memory to the cache. Once VMExit code path switches to full host
> address space, then we should also make sure that no other sibling
> hyprethread is running in the guest.

The KVM POWER8 solution (see arch/powerpc/kvm/book3s_hv.c) does that.
The siblings are *never* running host kernel code; they're all torn
down when any of them exits the guest. And it's always the *same*
guest.

> Focusing on the scheduling mechanism, we must make sure that when a
> logical processor runs guest code, all siblings logical processors
> must run code which do not populate L1D cache with information
> unrelated to this VM. This includes forbidding one logical processor
> to run guest code while sibling is running a host task such as a NIC
> interrupt handler.
> Thus, when a vCPU thread exits the guest into the host and VMExit
> handler reaches code flow which could populate L1D cache with this
> information, we should force an exit from the guest of the siblings
> logical processors, such that they will be allowed to resume only on
> a core which we can promise that the L1D cache is free from
> information unrelated to this VM.
>
> At first, I have created a patch series which attempts to implement
> such mechanism in KVM. However, it became clear to me that this may
> need to be implemented in the scheduler itself. This is because:
> 1. It is difficult to handle all new scheduling contrains only in
> KVM.
> 2. This mechanism should be relevant for any Type-2 hypervisor which
> runs inside Linux besides KVM (Such as VMware Workstation or
> VirtualBox).
> 3. This mechanism could also be used to prevent future “core-cache-
> leaking” vulnerabilities to be exploited between processes of
> different security domains which run as siblings on the same core.

I'm not sure I agree. If KVM is handling "only let siblings run the
*same* guest" and the siblings aren't visible to the host at all,
that's quite simple. Any other hypervisor can also do it.

Now, the down-side of this is that the siblings aren't visible to the
host. They can't be used to run multiple threads of the same userspace
processes; only multiple threads of the same KVM guest. A truly generic
core scheduler would cope with userspace threads too.

BUT I strongly suspect there's a huge correlation between the set of
people who care enough about the KVM/L1TF issue to enable a costly
XFPO-like solution, and the set of people who mostly don't give a shit
about having sibling CPUs available to run the host's userspace anyway.

This is not the "I happen to run a Windows VM on my Linux desktop" use
case...


Attachments:
smime.p7s (5.09 kB)

2018-08-21 23:12:33

by Liran Alon

[permalink] [raw]
Subject: Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)



> On 21 Aug 2018, at 17:22, David Woodhouse <[email protected]> wrote:
>
> On Tue, 2018-08-21 at 17:01 +0300, Liran Alon wrote:
>>
>>> On 21 Aug 2018, at 12:57, David Woodhouse <[email protected]>
>> wrote:
>>>
>>> Another alternative... I'm told POWER8 does an interesting thing
>> with
>>> hyperthreading and gang scheduling for KVM. The host kernel doesn't
>>> actually *see* the hyperthreads at all, and KVM just launches the
>> full
>>> set of siblings when it enters a guest, and gathers them again when
>> any
>>> of them exits. That's definitely worth investigating as an option
>> for
>>> x86, too.
>>
>> I actually think that such scheduling mechanism which prevents
>> leaking cache entries to sibling hyperthreads should co-exist
>> together with the KVM address space isolation to fully mitigate L1TF
>> and other similar vulnerabilities. The address space isolation should
>> prevent VMExit handlers code gadgets from loading arbitrary host
>> memory to the cache. Once VMExit code path switches to full host
>> address space, then we should also make sure that no other sibling
>> hyprethread is running in the guest.
>
> The KVM POWER8 solution (see arch/powerpc/kvm/book3s_hv.c) does that.
> The siblings are *never* running host kernel code; they're all torn
> down when any of them exits the guest. And it's always the *same*
> guest.
>

I wasn’t aware of this KVM Power8 mechanism. Thanks for the pointer.
(371fefd6f2dc ("KVM: PPC: Allow book3s_hv guests to use SMT processor modes”))

Note though that my point regarding the co-existence of the isolated address space together with such scheduling mechanism is still valid.
The scheduling mechanism should not be seen as an alternative to the isolated address space if we wish to reduce the frequency of events
in which we need to kick sibling hyperthreads from guest.

>> Focusing on the scheduling mechanism, we must make sure that when a
>> logical processor runs guest code, all siblings logical processors
>> must run code which do not populate L1D cache with information
>> unrelated to this VM. This includes forbidding one logical processor
>> to run guest code while sibling is running a host task such as a NIC
>> interrupt handler.
>> Thus, when a vCPU thread exits the guest into the host and VMExit
>> handler reaches code flow which could populate L1D cache with this
>> information, we should force an exit from the guest of the siblings
>> logical processors, such that they will be allowed to resume only on
>> a core which we can promise that the L1D cache is free from
>> information unrelated to this VM.
>>
>> At first, I have created a patch series which attempts to implement
>> such mechanism in KVM. However, it became clear to me that this may
>> need to be implemented in the scheduler itself. This is because:
>> 1. It is difficult to handle all new scheduling contrains only in
>> KVM.
>> 2. This mechanism should be relevant for any Type-2 hypervisor which
>> runs inside Linux besides KVM (Such as VMware Workstation or
>> VirtualBox).
>> 3. This mechanism could also be used to prevent future “core-cache-
>> leaking” vulnerabilities to be exploited between processes of
>> different security domains which run as siblings on the same core.
>
> I'm not sure I agree. If KVM is handling "only let siblings run the
> *same* guest" and the siblings aren't visible to the host at all,
> that's quite simple. Any other hypervisor can also do it.
>
> Now, the down-side of this is that the siblings aren't visible to the
> host. They can't be used to run multiple threads of the same userspace
> processes; only multiple threads of the same KVM guest. A truly generic
> core scheduler would cope with userspace threads too.
>
> BUT I strongly suspect there's a huge correlation between the set of
> people who care enough about the KVM/L1TF issue to enable a costly
> XFPO-like solution, and the set of people who mostly don't give a shit
> about having sibling CPUs available to run the host's userspace anyway.
>
> This is not the "I happen to run a Windows VM on my Linux desktop" use
> case...

If I understand your proposal correctly, you suggest to do something similar to the KVM Power8 solution:
1. Disable HyperThreading for use by host user space.
2. Use sibling hyperthreads only in KVM and schedule group of vCPUs that run on a single core as a “gang” to enter and exit guest together.

This solution may work well for KVM-based cloud providers that match the following criteria:
1. All compute instances run with SR-IOV and IOMMU Posted-Interrupts.
2. Configure affinity such that host dedicate distinct set of physical cores per guest. No physical core is able to run vCPUs from multiple guests.

However, this may not necessarily be the case: Some cloud providers have compute instances which all their devices are emulated or ParaVirtualized.
In the proposed scheduling mechanism, all the IOThreads of these guests will not be able to utilize HyperThreading which can be a significant performance hit.
So Oracle Cloud (OCI) are folks who do care enough about the KVM/L1TF issue but gives a shit about having sibling CPUs available to run host userspace. :)
Unless I’m missing something of course...

In addition, desktop users who run VMs today, expect a security boundary to exist between the guest and the host.
Besides the L1TF HyperThreading variant, we were able to preserve such a security boundary.
It seems a bit weird that we will implement a mechanism in x86 KVM that it’s message to users is basically:
“If you want to have a security boundary between a VM and the host, you need to enable this knob which will also cause the rest of your host
to see half the amount of logical processors”.

Furthermore, I think it is important to think about a mechanism which may help us to mitigate future similar “core-cache-leak” vulnerabilities.
As I previously mentioned, the “core scheduler” could help us mitigate these vulnerabilities on OS-level by disallowing userspace tasks of different “security domain”
to run as siblings on the same core.

-Liran

(Cc Paolo who probably have good feedback on the entire email thread as-well)





2018-08-31 07:47:09

by Julian Stecklina

[permalink] [raw]
Subject: Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

Hey everyone,

On Mon, 20 Aug 2018 15:27 Linus Torvalds <[email protected]> wrote:
> On Mon, Aug 20, 2018 at 3:02 PM Woodhouse, David <[email protected]> wrote:
>>
>> It's the *kernel* we don't want being able to access those pages,
>> because of the multitude of unfixable cache load gadgets.
>
> Ahh.
>
> I guess the proof is in the pudding. Did somebody try to forward-port
> that patch set and see what the performance is like?

I've been spending some cycles on the XPFO patch set this week. For the
patch set as it was posted for v4.13, the performance overhead of
compiling a Linux kernel is ~40% on x86_64[1]. The overhead comes almost
completely from TLB flushing. If we can live with stale TLB entries
allowing temporary access (which I think is reasonable), we can remove
all TLB flushing (on x86). This reduces the overhead to 2-3% for
kernel compile.

There were no problems in forward-porting the patch set to master.
You can find the result here, including a patch makes the TLB flushing
configurable:
http://git.infradead.org/users/jsteckli/linux-xpfo.git/shortlog/refs/heads/xpfo-master

It survived some casual stress-ng runs. I can rerun the benchmarks on
this version, but I doubt there is any change.

> It used to be just 500 LOC. Was that because they took horrible
> shortcuts?

The patch is still fairly small. As for the horrible shortcuts, I let
others comment on that.

HTH,
Julian

[1] Measured on my quad-core (8 hyperthreads) Kaby Lake desktop building
Linux 4.18 with the Phoronix Test Suite.

--
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B


2018-08-31 09:52:53

by James Bottomley

[permalink] [raw]
Subject: Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

On Mon, 2018-08-20 at 21:52 +0000, Woodhouse, David wrote:
> On Mon, 2018-08-20 at 14:48 -0700, Linus Torvalds wrote:
> >
> > Of course, after the long (and entirely unrelated) discussion about
> > the TLB flushing bug we had, I'm starting to worry about my own
> > competence, and maybe I'm missing something really fundamental, and
> > the XPFO patches do something else than what I think they do, or my
> > "hey, let's use our Meltdown code" idea has some fundamental
> > weakness
> > that I'm missing.
>
> The interesting part is taking the user (and other) pages out of the
> kernel's 1:1 physmap.
>
> It's the *kernel* we don't want being able to access those pages,
> because of the multitude of unfixable cache load gadgets.

A long time ago, I gave a talk about precisely this at OLS (2005 I
think). On PA-RISC we have a problem with inequivalent aliasing in the
page cache (same physical page with two different virtual addresses
modulo 4MB) which causes a machine check if it occurs.
Architecturally, PA can move into the cache any page for which it has a
mapping and the kernel offset map of every page causes an inequivalency
if the same page is in use in user space. Of course, practically the
caching machinery is too busy moving in and out pages we reference to
have an interest in speculating on other pages it has a mapping for, so
it almost never (the almost being a set of machine checks we see very
occasionally in the latest and most aggressively cached and speculating
CPUs). If this were implemented, we'd be interested in using it.

James


2018-08-31 16:40:55

by Tycho Andersen

[permalink] [raw]
Subject: Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

On Thu, Aug 30, 2018 at 06:00:51PM +0200, Julian Stecklina wrote:
> Hey everyone,
>
> On Mon, 20 Aug 2018 15:27 Linus Torvalds <[email protected]> wrote:
> > On Mon, Aug 20, 2018 at 3:02 PM Woodhouse, David <[email protected]> wrote:
> >>
> >> It's the *kernel* we don't want being able to access those pages,
> >> because of the multitude of unfixable cache load gadgets.
> >
> > Ahh.
> >
> > I guess the proof is in the pudding. Did somebody try to forward-port
> > that patch set and see what the performance is like?
>
> I've been spending some cycles on the XPFO patch set this week. For the
> patch set as it was posted for v4.13, the performance overhead of
> compiling a Linux kernel is ~40% on x86_64[1]. The overhead comes almost
> completely from TLB flushing. If we can live with stale TLB entries
> allowing temporary access (which I think is reasonable), we can remove
> all TLB flushing (on x86). This reduces the overhead to 2-3% for
> kernel compile.

Cool, thanks for doing this! Do you have any thoughts about what the
2-3% is? It seems to me like if we're not doing the TLB flushes, the
rest of this should be *really* cheap, even cheaper than 2-3%. Dave
Hansen had suggested coalescing things on a per mapping basis vs.
doing it per page, which might help?

> > It used to be just 500 LOC. Was that because they took horrible
> > shortcuts?
>
> The patch is still fairly small. As for the horrible shortcuts, I let
> others comment on that.

Heh, things like xpfo_temp_map() aren't awesome, but that can
hopefully be fixed by keeping a little bit of memory around for use
where we are mapping things and can't fail. I remember some discussion
about hopefully not having to sprinkle xpfo mapping calls everywhere
in the kernel, so perhaps we could get rid of it entirely?

Anyway, I'm working on some other stuff for the kernel right now, but
I hope (:D) that it should be close to done, and I'll have more cycles
to work on this soon too.

Tycho

2018-09-01 21:41:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

On Fri, Aug 31, 2018 at 12:45 AM Julian Stecklina <[email protected]> wrote:
>
> I've been spending some cycles on the XPFO patch set this week. For the
> patch set as it was posted for v4.13, the performance overhead of
> compiling a Linux kernel is ~40% on x86_64[1]. The overhead comes almost
> completely from TLB flushing. If we can live with stale TLB entries
> allowing temporary access (which I think is reasonable), we can remove
> all TLB flushing (on x86). This reduces the overhead to 2-3% for
> kernel compile.

I have to say, even 2-3% for a kernel compile sounds absolutely horrendous.

Kernel bullds are 90% user space at least for me, so a 2-3% slowdown
from a kernel is not some small unnoticeable thing.

Linus

2018-09-03 15:11:55

by Julian Stecklina

[permalink] [raw]
Subject: Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

Linus Torvalds <[email protected]> writes:

> On Fri, Aug 31, 2018 at 12:45 AM Julian Stecklina <[email protected]> wrote:
>>
>> I've been spending some cycles on the XPFO patch set this week. For the
>> patch set as it was posted for v4.13, the performance overhead of
>> compiling a Linux kernel is ~40% on x86_64[1]. The overhead comes almost
>> completely from TLB flushing. If we can live with stale TLB entries
>> allowing temporary access (which I think is reasonable), we can remove
>> all TLB flushing (on x86). This reduces the overhead to 2-3% for
>> kernel compile.
>
> I have to say, even 2-3% for a kernel compile sounds absolutely horrendous.

Well, it's at least in a range where it doesn't look hopeless.

> Kernel bullds are 90% user space at least for me, so a 2-3% slowdown
> from a kernel is not some small unnoticeable thing.

The overhead seems to come from the hooks that XPFO adds to
alloc/free_pages. These hooks add a couple of atomic operations per
allocated (4K) page for book keeping. Some of these atomic ops are only
for debugging and could be removed. There is also some opportunity to
streamline the per-page space overhead of XPFO.

I'll do some more in-depth profiling later this week.

Julian
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B


2018-09-03 15:28:50

by Andi Kleen

[permalink] [raw]
Subject: Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

On Sat, Sep 01, 2018 at 02:38:43PM -0700, Linus Torvalds wrote:
> On Fri, Aug 31, 2018 at 12:45 AM Julian Stecklina <[email protected]> wrote:
> >
> > I've been spending some cycles on the XPFO patch set this week. For the
> > patch set as it was posted for v4.13, the performance overhead of
> > compiling a Linux kernel is ~40% on x86_64[1]. The overhead comes almost
> > completely from TLB flushing. If we can live with stale TLB entries
> > allowing temporary access (which I think is reasonable), we can remove
> > all TLB flushing (on x86). This reduces the overhead to 2-3% for
> > kernel compile.
>
> I have to say, even 2-3% for a kernel compile sounds absolutely horrendous.
>
> Kernel bullds are 90% user space at least for me, so a 2-3% slowdown
> from a kernel is not some small unnoticeable thing.

Also the problem is that depending on the workload everything may fit
into the TLBs, so the temporary stale TLB entries may be around
for a long time. Modern CPUs have very large TLBs, and good
LRU policies. For the kernel entries with global bit set and
which are used for something there may be no reason ever to evict.

Julian, I think you would need at least some quantitative perfmon data about
TLB replacement rates in the kernel to show that it's "reasonable"
instead of hand waving.

Most likely I suspect you would need a low frequency regular TLB
flush for the global entries at least, which will increase
the overhead again.

-Andi


2018-09-03 15:38:16

by Andi Kleen

[permalink] [raw]
Subject: Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

On Sat, Sep 01, 2018 at 06:33:22PM -0400, Wes Turner wrote:
> Speaking of pages and slowdowns,
> is there a better place to ask this question:
> From "'Turning Tables' shared page tables vuln":
> """
> 'New "Turning Tables" Technique Bypasses All Windows Kernel Mitigations'
> https://www.bleepingcomputer.com/news/security/new-turning-tables-technique-bypasses-all-windows-kernel-mitigations/
> > Furthermore, since the concept of page tables is also used by Apple and
> the Linux project, macOS and Linux are, in theory, also vulnerable to this
> technique, albeit the researchers have not verified such attacks, as of
> yet.
> Slides:
> https://cdn2.hubspot.net/hubfs/487909/Turning%20(Page)%20Tables_Slides.pdf
> Naturally, I took notice and decided to forward the latest scary headline
> to this list to see if this is already being addressed?

This essentially just says that if you can change page tables you can subvert kernels.
That's always been the case, always will be, I'm sure has been used forever by root kits,
and I don't know why anybody would pass it off as a "new attack".

-Andi

2018-09-04 09:41:02

by Julian Stecklina

[permalink] [raw]
Subject: Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

Andi Kleen <[email protected]> writes:

> On Sat, Sep 01, 2018 at 02:38:43PM -0700, Linus Torvalds wrote:
>> On Fri, Aug 31, 2018 at 12:45 AM Julian Stecklina <[email protected]> wrote:
>> >
>> > I've been spending some cycles on the XPFO patch set this week. For the
>> > patch set as it was posted for v4.13, the performance overhead of
>> > compiling a Linux kernel is ~40% on x86_64[1]. The overhead comes almost
>> > completely from TLB flushing. If we can live with stale TLB entries
>> > allowing temporary access (which I think is reasonable), we can remove
>> > all TLB flushing (on x86). This reduces the overhead to 2-3% for
>> > kernel compile.
>>
>> I have to say, even 2-3% for a kernel compile sounds absolutely horrendous.
>>
>> Kernel bullds are 90% user space at least for me, so a 2-3% slowdown
>> from a kernel is not some small unnoticeable thing.
>
> Also the problem is that depending on the workload everything may fit
> into the TLBs, so the temporary stale TLB entries may be around
> for a long time. Modern CPUs have very large TLBs, and good
> LRU policies. For the kernel entries with global bit set and
> which are used for something there may be no reason ever to evict.
>
> Julian, I think you would need at least some quantitative perfmon data about
> TLB replacement rates in the kernel to show that it's "reasonable"
> instead of hand waving.

That's a fair point. It definitely depends on the workload. My idle
laptop gnome GUI session still causes ~40k dtlb-load-misses per second
per core. My idle server (some shells, IRC client) still has ~8k dTLB
load misses per second per core. Compiling something pushes this to
millions of misses per second.

For comparison according to https://www.7-cpu.com/cpu/Skylake_X.html SKX
can fit 1536 entries into its L2 dTLB.

> Most likely I suspect you would need a low frequency regular TLB
> flush for the global entries at least, which will increase
> the overhead again.

Given the tiny experiment above, I don't think this is necessary except
for highly special usecases. If stale TLB entries are a concern, the
better intermediate step is to do INVLPG on the core that modified the
page table.

And even with these shortcomings, XPFO severely limits the data an
attacker can leak from the kernel.

Julian
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B


2018-09-07 21:33:03

by Khalid Aziz

[permalink] [raw]
Subject: Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

On 08/30/2018 10:00 AM, Julian Stecklina wrote:
> Hey everyone,
>
> On Mon, 20 Aug 2018 15:27 Linus Torvalds <[email protected]> wrote:
>> On Mon, Aug 20, 2018 at 3:02 PM Woodhouse, David <[email protected]> wrote:
>>>
>>> It's the *kernel* we don't want being able to access those pages,
>>> because of the multitude of unfixable cache load gadgets.
>>
>> Ahh.
>>
>> I guess the proof is in the pudding. Did somebody try to forward-port
>> that patch set and see what the performance is like?
>
> I've been spending some cycles on the XPFO patch set this week. For the
> patch set as it was posted for v4.13, the performance overhead of
> compiling a Linux kernel is ~40% on x86_64[1]. The overhead comes almost
> completely from TLB flushing. If we can live with stale TLB entries
> allowing temporary access (which I think is reasonable), we can remove
> all TLB flushing (on x86). This reduces the overhead to 2-3% for
> kernel compile.
>
> There were no problems in forward-porting the patch set to master.
> You can find the result here, including a patch makes the TLB flushing
> configurable:
> http://git.infradead.org/users/jsteckli/linux-xpfo.git/shortlog/refs/heads/xpfo-master
>
> It survived some casual stress-ng runs. I can rerun the benchmarks on
> this version, but I doubt there is any change.
>
>> It used to be just 500 LOC. Was that because they took horrible
>> shortcuts?
>
> The patch is still fairly small. As for the horrible shortcuts, I let
> others comment on that.


Looks like the performance impact can be whole lot worse. On my test
system with 2 Xeon Platinum 8160 (HT enabled) CPUs and 768 GB of memory,
I am seeing very high penalty with XPFO when building 4.18.6 kernel
sources with "make -j60":

No XPFO patch XPFO patch(No TLB flush) XPFO(TLB Flush)
sys time 52m 54.036s 55m 47.897s 434m 8.645s

That is ~8% worse with TLB flush disabled and ~720% worse with TLB flush
enabled. This test was with kernel sources being compiled on an ext4
filesystem. XPFO seems to affect ext2 even more. With ext2 filesystem,
impact was ~18.6% and ~900%.

--
Khalid



2018-09-12 15:40:07

by Julian Stecklina

[permalink] [raw]
Subject: Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

Julian Stecklina <[email protected]> writes:

> Linus Torvalds <[email protected]> writes:
>
>> On Fri, Aug 31, 2018 at 12:45 AM Julian Stecklina <[email protected]> wrote:
>>>
>>> I've been spending some cycles on the XPFO patch set this week. For the
>>> patch set as it was posted for v4.13, the performance overhead of
>>> compiling a Linux kernel is ~40% on x86_64[1]. The overhead comes almost
>>> completely from TLB flushing. If we can live with stale TLB entries
>>> allowing temporary access (which I think is reasonable), we can remove
>>> all TLB flushing (on x86). This reduces the overhead to 2-3% for
>>> kernel compile.
>>
>> I have to say, even 2-3% for a kernel compile sounds absolutely horrendous.
>
> Well, it's at least in a range where it doesn't look hopeless.
>
>> Kernel bullds are 90% user space at least for me, so a 2-3% slowdown
>> from a kernel is not some small unnoticeable thing.
>
> The overhead seems to come from the hooks that XPFO adds to
> alloc/free_pages. These hooks add a couple of atomic operations per
> allocated (4K) page for book keeping. Some of these atomic ops are only
> for debugging and could be removed. There is also some opportunity to
> streamline the per-page space overhead of XPFO.

I've updated my XPFO branch[1] to make some of the debugging optional
and also integrated the XPFO bookkeeping with struct page, instead of
requiring CONFIG_PAGE_EXTENSION, which removes some checks in the hot
path. These changes push the overhead down to somewhere between 1.5 and
2% for my quad core box in kernel compile. This is close to the
measurement noise, so I take suggestions for a better benchmark here.

Of course, if you hit contention on the xpfo spinlock then performance
will suffer. I guess this is what happened on Khalid's large box.

I'll try to remove the spinlocks and add fixup code to the pagefault
handler to see whether this improves the situation on large boxes. This
might turn out to be ugly, though.

Julian

[1] http://git.infradead.org/users/jsteckli/linux-xpfo.git/shortlog/refs/heads/xpfo-master
--
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B


2018-09-13 06:12:25

by Juerg Haefliger

[permalink] [raw]
Subject: Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

On Wed, Sep 12, 2018 at 5:37 PM, Julian Stecklina <[email protected]> wrote:
> Julian Stecklina <[email protected]> writes:
>
>> Linus Torvalds <[email protected]> writes:
>>
>>> On Fri, Aug 31, 2018 at 12:45 AM Julian Stecklina <[email protected]> wrote:
>>>>
>>>> I've been spending some cycles on the XPFO patch set this week. For the
>>>> patch set as it was posted for v4.13, the performance overhead of
>>>> compiling a Linux kernel is ~40% on x86_64[1]. The overhead comes almost
>>>> completely from TLB flushing. If we can live with stale TLB entries
>>>> allowing temporary access (which I think is reasonable), we can remove
>>>> all TLB flushing (on x86). This reduces the overhead to 2-3% for
>>>> kernel compile.
>>>
>>> I have to say, even 2-3% for a kernel compile sounds absolutely horrendous.
>>
>> Well, it's at least in a range where it doesn't look hopeless.
>>
>>> Kernel bullds are 90% user space at least for me, so a 2-3% slowdown
>>> from a kernel is not some small unnoticeable thing.
>>
>> The overhead seems to come from the hooks that XPFO adds to
>> alloc/free_pages. These hooks add a couple of atomic operations per
>> allocated (4K) page for book keeping. Some of these atomic ops are only
>> for debugging and could be removed. There is also some opportunity to
>> streamline the per-page space overhead of XPFO.
>
> I've updated my XPFO branch[1] to make some of the debugging optional
> and also integrated the XPFO bookkeeping with struct page, instead of
> requiring CONFIG_PAGE_EXTENSION, which removes some checks in the hot
> path.

FWIW, that was my original design but there was some resistance to
adding more to the page struct and page extension was suggested
instead.


> These changes push the overhead down to somewhere between 1.5 and
> 2% for my quad core box in kernel compile. This is close to the
> measurement noise, so I take suggestions for a better benchmark here.
>
> Of course, if you hit contention on the xpfo spinlock then performance
> will suffer. I guess this is what happened on Khalid's large box.
>
> I'll try to remove the spinlocks and add fixup code to the pagefault
> handler to see whether this improves the situation on large boxes. This
> might turn out to be ugly, though.

I'm wondering how much performance we're loosing by having to split
hugepages. Any chance this can be quantified somehow? Maybe we can
have a pool of some sorts reserved for userpages and group allocations
so that we can track the XPFO state at the hugepage level instead of
at the 4k level to prevent/reduce page splitting. Not sure if that
causes issues or has any unwanted side effects though...

...Juerg


> Julian
>
> [1] http://git.infradead.org/users/jsteckli/linux-xpfo.git/shortlog/refs/heads/xpfo-master
> --
> Amazon Development Center Germany GmbH
> Berlin - Dresden - Aachen
> main office: Krausenstr. 38, 10117 Berlin
> Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
> Ust-ID: DE289237879
> Eingetragen am Amtsgericht Charlottenburg HRB 149173 B
>

2018-09-14 17:08:22

by Khalid Aziz

[permalink] [raw]
Subject: Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

On 09/12/2018 09:37 AM, Julian Stecklina wrote:
> Julian Stecklina <[email protected]> writes:
>
>> Linus Torvalds <[email protected]> writes:
>>
>>> On Fri, Aug 31, 2018 at 12:45 AM Julian Stecklina <[email protected]> wrote:
>>>>
>>>> I've been spending some cycles on the XPFO patch set this week. For the
>>>> patch set as it was posted for v4.13, the performance overhead of
>>>> compiling a Linux kernel is ~40% on x86_64[1]. The overhead comes almost
>>>> completely from TLB flushing. If we can live with stale TLB entries
>>>> allowing temporary access (which I think is reasonable), we can remove
>>>> all TLB flushing (on x86). This reduces the overhead to 2-3% for
>>>> kernel compile.
>>>
>>> I have to say, even 2-3% for a kernel compile sounds absolutely horrendous.
>>
>> Well, it's at least in a range where it doesn't look hopeless.
>>
>>> Kernel bullds are 90% user space at least for me, so a 2-3% slowdown
>>> from a kernel is not some small unnoticeable thing.
>>
>> The overhead seems to come from the hooks that XPFO adds to
>> alloc/free_pages. These hooks add a couple of atomic operations per
>> allocated (4K) page for book keeping. Some of these atomic ops are only
>> for debugging and could be removed. There is also some opportunity to
>> streamline the per-page space overhead of XPFO.
>
> I've updated my XPFO branch[1] to make some of the debugging optional
> and also integrated the XPFO bookkeeping with struct page, instead of
> requiring CONFIG_PAGE_EXTENSION, which removes some checks in the hot
> path. These changes push the overhead down to somewhere between 1.5 and
> 2% for my quad core box in kernel compile. This is close to the
> measurement noise, so I take suggestions for a better benchmark here.
>
> Of course, if you hit contention on the xpfo spinlock then performance
> will suffer. I guess this is what happened on Khalid's large box.
>
> I'll try to remove the spinlocks and add fixup code to the pagefault
> handler to see whether this improves the situation on large boxes. This
> might turn out to be ugly, though.
>

Hi Julian,

I ran tests with your updated code and gathered lock statistics. Change in system time for "make -j60" was in the noise margin (It actually went up by about 2%). There is some contention on xpfo_lock. Average wait time does not look high compared to other locks. Max hold time looks a little long. From /proc/lock_stat:

&(&page->xpfo_lock)->rlock: 29698 29897 0.06 134.39 15345.58 0.51 422474670 960222532 0.05 30362.05 195807002.62 0.20

Nevertheless even a smaller average wait time can add up.

--
Khalid




2018-09-17 09:53:07

by Julian Stecklina

[permalink] [raw]
Subject: Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

Khalid Aziz <[email protected]> writes:

> I ran tests with your updated code and gathered lock statistics. Change in
> system time for "make -j60" was in the noise margin (It actually went up by
> about 2%). There is some contention on xpfo_lock. Average wait time does not
> look high compared to other locks. Max hold time looks a little long. From
> /proc/lock_stat:
>
> &(&page->xpfo_lock)->rlock: 29698 29897 0.06 134.39 15345.58 0.51 422474670 960222532 0.05 30362.05 195807002.62 0.20
>
> Nevertheless even a smaller average wait time can add up.

Thanks for doing this!

I've spent some time optimizing spinlock usage in the code. See the two
last commits in my xpfo-master branch[1]. The optimization in
xpfo_kunmap is pretty safe. The last commit that optimizes locking in
xpfo_kmap is tricky, though, and I'm not sure this is the right
approach. FWIW, I've modeled this locking strategy in Spin and it
doesn't find any problems with it.

I've tested the result on a box with 72 hardware threads and I didn't
see a meaningful difference in kernel compile performance. It's still
hovering around 2%. So the question is, whether it's actually useful to
do these optimizations.

Khalid, you mentioned 5% overhead. Can you give the new code a spin and
see whether anything changes?

Julian

[1] http://git.infradead.org/users/jsteckli/linux-xpfo.git/shortlog/refs/heads/xpfo-master

--
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B


2018-09-17 10:01:53

by Julian Stecklina

[permalink] [raw]
Subject: Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

Juerg Haefliger <[email protected]> writes:

>> I've updated my XPFO branch[1] to make some of the debugging optional
>> and also integrated the XPFO bookkeeping with struct page, instead of
>> requiring CONFIG_PAGE_EXTENSION, which removes some checks in the hot
>> path.
>
> FWIW, that was my original design but there was some resistance to
> adding more to the page struct and page extension was suggested
> instead.

From looking at both versions, I have to say that having the metadata in
struct page makes the code easier to understand and removes some special
cases and bookkeeping.

> I'm wondering how much performance we're loosing by having to split
> hugepages. Any chance this can be quantified somehow? Maybe we can
> have a pool of some sorts reserved for userpages and group allocations
> so that we can track the XPFO state at the hugepage level instead of
> at the 4k level to prevent/reduce page splitting. Not sure if that
> causes issues or has any unwanted side effects though...

Optimizing the allocation/deallocation path might be worthwhile, because
that's where most of the overhead goes. I haven't looked into how to do
this yet. I'd appreciate if someone has pointers to code that tries to
achieve similar functionality to get me started.

That being said, I'm wondering whether we have unrealistic expectations
about the overhead here and whether it's worth turning this patch into
something far more complicated. Opinions?

Julian
--
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B


2018-09-17 10:20:32

by Tycho Andersen

[permalink] [raw]
Subject: Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

On Mon, Sep 17, 2018 at 12:01:02PM +0200, Julian Stecklina wrote:
> Juerg Haefliger <[email protected]> writes:
>
> >> I've updated my XPFO branch[1] to make some of the debugging optional
> >> and also integrated the XPFO bookkeeping with struct page, instead of
> >> requiring CONFIG_PAGE_EXTENSION, which removes some checks in the hot
> >> path.
> >
> > FWIW, that was my original design but there was some resistance to
> > adding more to the page struct and page extension was suggested
> > instead.
>
> From looking at both versions, I have to say that having the metadata in
> struct page makes the code easier to understand and removes some special
> cases and bookkeeping.
>
> > I'm wondering how much performance we're loosing by having to split
> > hugepages. Any chance this can be quantified somehow? Maybe we can
> > have a pool of some sorts reserved for userpages and group allocations
> > so that we can track the XPFO state at the hugepage level instead of
> > at the 4k level to prevent/reduce page splitting. Not sure if that
> > causes issues or has any unwanted side effects though...
>
> Optimizing the allocation/deallocation path might be worthwhile, because
> that's where most of the overhead goes. I haven't looked into how to do
> this yet. I'd appreciate if someone has pointers to code that tries to
> achieve similar functionality to get me started.
>
> That being said, I'm wondering whether we have unrealistic expectations
> about the overhead here and whether it's worth turning this patch into
> something far more complicated. Opinions?

I think that implementing Dave Hansen's suggestions of not doing
flushes/other work on every map/unmap, but only when pages are added
to the various free lists will probably help out a lot. That's where I
got stuck last time when I was trying to do it, though :)

Cheers,

Tycho

2018-09-17 13:30:37

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

On Mon, Sep 17, 2018 at 12:01:02PM +0200, Julian Stecklina wrote:
> Juerg Haefliger <[email protected]> writes:
>
> >> I've updated my XPFO branch[1] to make some of the debugging optional
> >> and also integrated the XPFO bookkeeping with struct page, instead of
> >> requiring CONFIG_PAGE_EXTENSION, which removes some checks in the hot
> >> path.
> >
> > FWIW, that was my original design but there was some resistance to
> > adding more to the page struct and page extension was suggested
> > instead.
>
> >From looking at both versions, I have to say that having the metadata in
> struct page makes the code easier to understand and removes some special
> cases and bookkeeping.

Btw, can xpfo_lock be replaced with a bit spinlock in the page?
Growing struct page too much might cause performance issues. Then again
going beyong the 64 byte cache line might already cause that, and even
then it propbably is still way better than the page extensions.

OTOH if you keep the spinlock it might be worth to use
atomic_dec_and_lock on the count. Maybe the answer is an hash of
spinlock, as we obviously can't take all that many of them at the same
time anyway.

Also for your trasitions froms zero it might be worth at looking at
atomic_inc_unless_zero.

2018-09-18 23:01:47

by Khalid Aziz

[permalink] [raw]
Subject: Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

On 09/17/2018 03:51 AM, Julian Stecklina wrote:
> Khalid Aziz <[email protected]> writes:
>
>> I ran tests with your updated code and gathered lock statistics. Change in
>> system time for "make -j60" was in the noise margin (It actually went up by
>> about 2%). There is some contention on xpfo_lock. Average wait time does not
>> look high compared to other locks. Max hold time looks a little long. From
>> /proc/lock_stat:
>>
>> &(&page->xpfo_lock)->rlock: 29698 29897 0.06 134.39 15345.58 0.51 422474670 960222532 0.05 30362.05 195807002.62 0.20
>>
>> Nevertheless even a smaller average wait time can add up.
>
> Thanks for doing this!
>
> I've spent some time optimizing spinlock usage in the code. See the two
> last commits in my xpfo-master branch[1]. The optimization in
> xpfo_kunmap is pretty safe. The last commit that optimizes locking in
> xpfo_kmap is tricky, though, and I'm not sure this is the right
> approach. FWIW, I've modeled this locking strategy in Spin and it
> doesn't find any problems with it.
>
> I've tested the result on a box with 72 hardware threads and I didn't
> see a meaningful difference in kernel compile performance. It's still
> hovering around 2%. So the question is, whether it's actually useful to
> do these optimizations.
>
> Khalid, you mentioned 5% overhead. Can you give the new code a spin and
> see whether anything changes?

Hi Julian,

I tested the kernel with this new code. When booted without "xpfotlbflush",
there is no meaningful change in system time with kernel compile. Kernel
locks up during bootup when booted with xpfotlbflush:

[ 52.967060] RIP: 0010:queued_spin_lock_slowpath+0xf6/0x1e0
[ 52.967061] Code: 48 03 34 c5 80 97 12 82 48 89 16 8b 42 08 85 c0 75 09 f3 90 8b 42 08 85 c0 74 f7 48 8b 32 48 85 f6 74 07 0f 0d 0e eb 02 f3 90 <8b> 07 66 85 c0 75 f7 41 89 c0 66 45 31 c0 41 39 c8 0f 84 93 00 00
[ 52.967061] RSP: 0018:ffffc9001cc83a00 EFLAGS: 00000002
[ 52.967062] RAX: 0000000000340101 RBX: ffffea06c16292e8 RCX: 0000000000580000
[ 52.967062] RDX: ffff88603c9e3980 RSI: 0000000000000000 RDI: ffffea06c16292e8
[ 52.967063] RBP: ffffea06c1629300 R08: 0000000000000001 R09: 0000000000000000
[ 52.967063] R10: 0000000000000000 R11: 0000000000000001 R12: ffff88c02765a000
[ 52.967063] R13: 0000000000000000 R14: ffff8860152a0d00 R15: 0000000000000000
[ 52.967064] FS: 00007f41ad1658c0(0000) GS:ffff88603c800000(0000) knlGS:0000000000000000
[ 52.967064] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 52.967064] CR2: ffff88c02765a000 CR3: 00000060252e4003 CR4: 00000000007606e0
[ 52.967065] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 52.967065] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 52.967065] PKRU: 55555554
[ 52.967066] Call Trace:
[ 52.967066] do_raw_spin_lock+0x6d/0xa0
[ 52.967066] _raw_spin_lock+0x53/0x70
[ 52.967067] ? xpfo_do_map+0x1b/0x52
[ 52.967067] xpfo_do_map+0x1b/0x52
[ 52.967067] xpfo_spurious_fault+0xac/0xae
[ 52.967068] __do_page_fault+0x3cc/0x4e0
[ 52.967068] ? __lock_acquire.isra.31+0x165/0x710
[ 52.967068] do_page_fault+0x32/0x180
[ 52.967068] page_fault+0x1e/0x30
[ 52.967069] RIP: 0010:memcpy_erms+0x6/0x10
[ 52.967069] Code: 90 90 90 90 eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 <f3> a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38 fe
[ 52.967070] RSP: 0018:ffffc9001cc83bb8 EFLAGS: 00010246
[ 52.967070] RAX: ffff8860299d0f00 RBX: ffffc9001cc83dc8 RCX: 0000000000000080
[ 52.967071] RDX: 0000000000000080 RSI: ffff88c02765a000 RDI: ffff8860299d0f00
[ 52.967071] RBP: 0000000000000080 R08: ffffc9001cc83d90 R09: 0000000000000001
[ 52.967071] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000080
[ 52.967072] R13: 0000000000000080 R14: 0000000000000000 R15: ffff88c02765a080
[ 52.967072] _copy_to_iter+0x3b6/0x430
[ 52.967072] copy_page_to_iter+0x1cf/0x390
[ 52.967073] ? pagecache_get_page+0x26/0x200
[ 52.967073] generic_file_read_iter+0x620/0xaf0
[ 52.967073] ? avc_has_perm+0x12e/0x200
[ 52.967074] ? avc_has_perm+0x34/0x200
[ 52.967074] ? sched_clock+0x5/0x10
[ 52.967074] __vfs_read+0x112/0x190
[ 52.967074] vfs_read+0x8c/0x140
[ 52.967075] kernel_read+0x2c/0x40
[ 52.967075] prepare_binprm+0x121/0x230
[ 52.967075] __do_execve_file.isra.32+0x56f/0x930
[ 52.967076] ? __do_execve_file.isra.32+0x140/0x930
[ 52.967076] __x64_sys_execve+0x44/0x50
[ 52.967076] do_syscall_64+0x5b/0x190
[ 52.967077] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 52.967077] RIP: 0033:0x7f41abd898c7
[ 52.967078] Code: ff ff 76 df 89 c6 f7 de 64 41 89 32 eb d5 89 c6 f7 de 64 41 89 32 eb db 66 2e 0f 1f 84 00 00 00 00 00 90 b8 3b 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 02 f3 c3 48 8b 15 98 05 30 00 f7 d8 64 89 02
[ 52.967078] RSP: 002b:00007ffc34b18f48 EFLAGS: 00000207 ORIG_RAX: 000000000000003b
[ 52.967078] RAX: ffffffffffffffda RBX: 00007ffc34b190a0 RCX: 00007f41abd898c7
[ 52.967079] RDX: 00005573e1da99d0 RSI: 00007ffc34b190a0 RDI: 00007ffc34b194a0
[ 52.967079] RBP: 00005573e0895140 R08: 0000000000000008 R09: 0000000000000383
[ 52.967080] R10: 0000000000000008 R11: 0000000000000207 R12: 00005573e1da99d0
[ 52.967080] R13: 0000000000000007 R14: 000000000000000c R15: 00007ffc34b1ad10
[ 52.967080] Kernel panic - not syncing: Hard LOCKUP
[ 52.967081] CPU: 21 PID: 1127 Comm: systemd-udevd Not tainted 4.19.0-rc3-xpfo+ #3
[ 52.967081] Hardware name: Oracle Corporation ORACLE SERVER X7-2/ASM, MB, X7-2, BIOS 41017600 10/06/2017
[ 52.967081] Call Trace:
[ 52.967082] <NMI>
[ 52.967082] dump_stack+0x5a/0x73
[ 52.967082] panic+0xe8/0x25c
[ 52.967082] nmi_panic+0x37/0x40
[ 52.967083] watchdog_overflow_callback+0xef/0x110
[ 52.967083] __perf_event_overflow+0x51/0xe0
[ 52.967083] intel_pmu_handle_irq+0x222/0x4c0
[ 52.967084] ? _raw_spin_unlock+0x24/0x30
[ 52.967084] ? ghes_copy_tofrom_phys+0xf2/0x1a0
[ 52.967084] ? ghes_read_estatus+0x91/0x160
[ 52.967085] perf_event_nmi_handler+0x2e/0x50
[ 52.967085] nmi_handle+0x9a/0x180
[ 52.967085] ? nmi_handle+0x5/0x180
[ 52.967086] default_do_nmi+0xca/0x120
[ 52.967086] do_nmi+0x100/0x160
[ 52.967086] end_repeat_nmi+0x16/0x50
[ 52.967086] RIP: 0010:queued_spin_lock_slowpath+0xf6/0x1e0
[ 52.967087] Code: 48 03 34 c5 80 97 12 82 48 89 16 8b 42 08 85 c0 75 09 f3 90 8b 42 08 85 c0 74 f7 48 8b 32 48 85 f6 74 07 0f 0d 0e eb 02 f3 90 <8b> 07 66 85 c0 75 f7 41 89 c0 66 45 31 c0 41 39 c8 0f 84 93 00 00
[ 52.967087] RSP: 0018:ffffc9001cc83a00 EFLAGS: 00000002
[ 52.967088] RAX: 0000000000340101 RBX: ffffea06c16292e8 RCX: 0000000000580000
[ 52.967088] RDX: ffff88603c9e3980 RSI: 0000000000000000 RDI: ffffea06c16292e8
[ 52.967089] RBP: ffffea06c1629300 R08: 0000000000000001 R09: 0000000000000000
[ 52.967089] R10: 0000000000000000 R11: 0000000000000001 R12: ffff88c02765a000
[ 52.967089] R13: 0000000000000000 R14: ffff8860152a0d00 R15: 0000000000000000
[ 52.967090] ? queued_spin_lock_slowpath+0xf6/0x1e0
[ 52.967090] ? queued_spin_lock_slowpath+0xf6/0x1e0
[ 52.967090] </NMI>
[ 52.967091] do_raw_spin_lock+0x6d/0xa0
[ 52.967091] _raw_spin_lock+0x53/0x70
[ 52.967091] ? xpfo_do_map+0x1b/0x52
[ 52.967092] xpfo_do_map+0x1b/0x52
[ 52.967092] xpfo_spurious_fault+0xac/0xae
[ 52.967092] __do_page_fault+0x3cc/0x4e0
[ 52.967092] ? __lock_acquire.isra.31+0x165/0x710
[ 52.967093] do_page_fault+0x32/0x180
[ 52.967093] page_fault+0x1e/0x30
[ 52.967093] RIP: 0010:memcpy_erms+0x6/0x10
[ 52.967094] Code: 90 90 90 90 eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 <f3> a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38 fe
[ 52.967094] RSP: 0018:ffffc9001cc83bb8 EFLAGS: 00010246
[ 52.967095] RAX: ffff8860299d0f00 RBX: ffffc9001cc83dc8 RCX: 0000000000000080
[ 52.967095] RDX: 0000000000000080 RSI: ffff88c02765a000 RDI: ffff8860299d0f00
[ 52.967096] RBP: 0000000000000080 R08: ffffc9001cc83d90 R09: 0000000000000001
[ 52.967096] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000080
[ 52.967096] R13: 0000000000000080 R14: 0000000000000000 R15: ffff88c02765a080
[ 52.967097] _copy_to_iter+0x3b6/0x430
[ 52.967097] copy_page_to_iter+0x1cf/0x390
[ 52.967097] ? pagecache_get_page+0x26/0x200
[ 52.967098] generic_file_read_iter+0x620/0xaf0
[ 52.967098] ? avc_has_perm+0x12e/0x200
[ 52.967098] ? avc_has_perm+0x34/0x200
[ 52.967098] ? sched_clock+0x5/0x10
[ 52.967099] __vfs_read+0x112/0x190
[ 52.967099] vfs_read+0x8c/0x140
[ 52.967099] kernel_read+0x2c/0x40
[ 52.967100] prepare_binprm+0x121/0x230
[ 52.967100] __do_execve_file.isra.32+0x56f/0x930
[ 52.967100] ? __do_execve_file.isra.32+0x140/0x930
[ 52.967101] __x64_sys_execve+0x44/0x50
[ 52.967101] do_syscall_64+0x5b/0x190
[ 52.967101] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 52.967102] RIP: 0033:0x7f41abd898c7
[ 52.967102] Code: ff ff 76 df 89 c6 f7 de 64 41 89 32 eb d5 89 c6 f7 de 64 41 89 32 eb db 66 2e 0f 1f 84 00 00 00 00 00 90 b8 3b 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 02 f3 c3 48 8b 15 98 05 30 00 f7 d8 64 89 02
[ 52.967102] RSP: 002b:00007ffc34b18f48 EFLAGS: 00000207 ORIG_RAX: 000000000000003b
[ 52.967103] RAX: ffffffffffffffda RBX: 00007ffc34b190a0 RCX: 00007f41abd898c7
[ 52.967103] RDX: 00005573e1da99d0 RSI: 00007ffc34b190a0 RDI: 00007ffc34b194a0
[ 52.967104] RBP: 00005573e0895140 R08: 0000000000000008 R09: 0000000000000383
[ 52.967104] R10: 0000000000000008 R11: 0000000000000207 R12: 00005573e1da99d0
[ 52.967104] R13: 0000000000000007 R14: 000000000000000c R15: 00007ffc34b1ad10
[ 54.001888] Shutting down cpus with NMI
[ 54.001889] Kernel Offset: disabled
[ 54.860701] ---[ end Kernel panic - not syncing: Hard LOCKUP ]---
[ 54.867733] ------------[ cut here ]------------
[ 54.867734] unchecked MSR access error: WRMSR to 0x83f (tried to write 0x00000000000000f6) at rIP: 0xffffffff81055864 (native_write_msr+0x4/0x20)
[ 54.867734] Call Trace:
[ 54.867734] <IRQ>
[ 54.867735] native_apic_msr_write+0x2e/0x40
[ 54.867735] arch_irq_work_raise+0x28/0x40
[ 54.867735] irq_work_queue+0x69/0x70
[ 54.867736] printk_safe_log_store+0xd0/0xf0
[ 54.867736] printk+0x58/0x6f
[ 54.867736] __warn_printk+0x46/0x90
[ 54.867737] ? enqueue_task_fair+0x8e/0x760
[ 54.867737] native_smp_send_reschedule+0x39/0x40
[ 54.867737] check_preempt_curr+0x75/0xb0
[ 54.867738] ttwu_do_wakeup+0x19/0x190
[ 54.867738] try_to_wake_up+0x21e/0x4f0
[ 54.867738] __wake_up_common+0x9d/0x190
[ 54.867738] ep_poll_callback+0xd5/0x370
[ 54.867739] ? ep_poll_callback+0x2b5/0x370
[ 54.867739] __wake_up_common+0x9d/0x190
[ 54.867739] __wake_up_common_lock+0x7a/0xc0
[ 54.867740] irq_work_run_list+0x4c/0x70
[ 54.867740] smp_call_function_interrupt+0x59/0x110
[ 54.867740] call_function_interrupt+0xf/0x20
[ 54.867741] </IRQ>
[ 54.867741] <NMI>
[ 54.867741] RIP: 0010:panic+0x206/0x25c
[ 54.867742] Code: 83 3d 11 83 c9 01 00 74 05 e8 d2 87 02 00 48 c7 c6 80 15 d1 82 48 c7 c7 80 fe 06 82 31 c0 e8 71 ed 06 00 fb 66 0f 1f 44 00 00 <45> 31 e4 e8 de 11 0e 00 4d 39 ec 7c 1e 41 83 f6 01 48 8b 05 ce 82
[ 54.867742] RSP: 0018:fffffe00003a4b58 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff03
[ 54.867743] RAX: 0000000000000038 RBX: fffffe00003a4e00 RCX: 0000000000000000
[ 54.867743] RDX: 0000000000000000 RSI: 0000000000000038 RDI: ffff88603c9d5d08
[ 54.867744] RBP: fffffe00003a4bc8 R08: 0000000000000000 R09: ffff88603c9d5d47
[ 54.867744] R10: 000000000000000b R11: 0000000000000000 R12: ffffffff8207b979
[ 54.867744] R13: 0000000000000000 R14: 0000000000000000 R15: ffff88603c80f560
[ 54.867745] ? panic+0x1ff/0x25c
[ 54.867745] nmi_panic+0x37/0x40
[ 54.867745] watchdog_overflow_callback+0xef/0x110
[ 54.867746] __perf_event_overflow+0x51/0xe0
[ 54.867746] intel_pmu_handle_irq+0x222/0x4c0
[ 54.867746] ? _raw_spin_unlock+0x24/0x30
[ 54.867747] ? ghes_copy_tofrom_phys+0xf2/0x1a0
[ 54.867747] ? ghes_read_estatus+0x91/0x160
[ 54.867747] perf_event_nmi_handler+0x2e/0x50
[ 54.867748] nmi_handle+0x9a/0x180
[ 54.867748] ? nmi_handle+0x5/0x180
[ 54.867748] default_do_nmi+0xca/0x120
[ 54.867748] do_nmi+0x100/0x160
[ 54.867749] end_repeat_nmi+0x16/0x50
[ 54.867749] RIP: 0010:queued_spin_lock_slowpath+0xf6/0x1e0
[ 54.867750] Code: 48 03 34 c5 80 97 12 82 48 89 16 8b 42 08 85 c0 75 09 f3 90 8b 42 08 85 c0 74 f7 48 8b 32 48 85 f6 74 07 0f 0d 0e eb 02 f3 90 <8b> 07 66 85 c0 75 f7 41 89 c0 66 45 31 c0 41 39 c8 0f 84 93 00 00
[ 54.867750] RSP: 0018:ffffc9001cc83a00 EFLAGS: 00000002
[ 54.867751] RAX: 0000000000340101 RBX: ffffea06c16292e8 RCX: 0000000000580000
[ 54.867751] RDX: ffff88603c9e3980 RSI: 0000000000000000 RDI: ffffea06c16292e8
[ 54.867751] RBP: ffffea06c1629300 R08: 0000000000000001 R09: 0000000000000000
[ 54.867752] R10: 0000000000000000 R11: 0000000000000001 R12: ffff88c02765a000
[ 54.867752] R13: 0000000000000000 R14: ffff8860152a0d00 R15: 0000000000000000
[ 54.867752] ? queued_spin_lock_slowpath+0xf6/0x1e0
[ 54.867753] ? queued_spin_lock_slowpath+0xf6/0x1e0
[ 54.867753] </NMI>
[ 54.867753] do_raw_spin_lock+0x6d/0xa0
[ 54.867754] _raw_spin_lock+0x53/0x70
[ 54.867754] ? xpfo_do_map+0x1b/0x52
[ 54.867754] xpfo_do_map+0x1b/0x52
[ 54.867754] xpfo_spurious_fault+0xac/0xae
[ 54.867755] __do_page_fault+0x3cc/0x4e0
[ 54.867755] ? __lock_acquire.isra.31+0x165/0x710
[ 54.867755] do_page_fault+0x32/0x180
[ 54.867756] page_fault+0x1e/0x30
[ 54.867756] RIP: 0010:memcpy_erms+0x6/0x10
[ 54.867757] Code: 90 90 90 90 eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 <f3> a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38 fe
[ 54.867757] RSP: 0018:ffffc9001cc83bb8 EFLAGS: 00010246
[ 54.867758] RAX: ffff8860299d0f00 RBX: ffffc9001cc83dc8 RCX: 0000000000000080
[ 54.867758] RDX: 0000000000000080 RSI: ffff88c02765a000 RDI: ffff8860299d0f00
[ 54.867758] RBP: 0000000000000080 R08: ffffc9001cc83d90 R09: 0000000000000001
[ 54.867759] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000080
[ 54.867759] R13: 0000000000000080 R14: 0000000000000000 R15: ffff88c02765a080
[ 54.867759] _copy_to_iter+0x3b6/0x430
[ 54.867760] copy_page_to_iter+0x1cf/0x390
[ 54.867760] ? pagecache_get_page+0x26/0x200
[ 54.867760] generic_file_read_iter+0x620/0xaf0
[ 54.867761] ? avc_has_perm+0x12e/0x200
[ 54.867761] ? avc_has_perm+0x34/0x200
[ 54.867761] ? sched_clock+0x5/0x10
[ 54.867761] __vfs_read+0x112/0x190
[ 54.867762] vfs_read+0x8c/0x140
[ 54.867762] kernel_read+0x2c/0x40
[ 54.867762] prepare_binprm+0x121/0x230
[ 54.867763] __do_execve_file.isra.32+0x56f/0x930
[ 54.867763] ? __do_execve_file.isra.32+0x140/0x930
[ 54.867763] __x64_sys_execve+0x44/0x50
[ 54.867764] do_syscall_64+0x5b/0x190
[ 54.867764] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 54.867764] RIP: 0033:0x7f41abd898c7
[ 54.867765] Code: ff ff 76 df 89 c6 f7 de 64 41 89 32 eb d5 89 c6 f7 de 64 41 89 32 eb db 66 2e 0f 1f 84 00 00 00 00 00 90 b8 3b 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 02 f3 c3 48 8b 15 98 05 30 00 f7 d8 64 89 02
[ 54.867765] RSP: 002b:00007ffc34b18f48 EFLAGS: 00000207 ORIG_RAX: 000000000000003b
[ 54.867766] RAX: ffffffffffffffda RBX: 00007ffc34b190a0 RCX: 00007f41abd898c7
[ 54.867766] RDX: 00005573e1da99d0 RSI: 00007ffc34b190a0 RDI: 00007ffc34b194a0
[ 54.867767] RBP: 00005573e0895140 R08: 0000000000000008 R09: 0000000000000383
[ 54.867767] R10: 0000000000000008 R11: 0000000000000207 R12: 00005573e1da99d0
[ 54.867767] R13: 0000000000000007 R14: 000000000000000c R15: 00007ffc34b1ad10
[ 54.867768] sched: Unexpected reschedule of offline CPU#4!
[ 54.867768] WARNING: CPU: 21 PID: 1127 at arch/x86/kernel/smp.c:128 native_smp_send_reschedule+0x39/0x40
[ 54.867768] Modules linked in: crc32c_intel nvme nvme_core igb megaraid_sas ahci i2c_algo_bit bnxt_en libahci i2c_core libata dca dm_mirror dm_region_hash dm_log dm_mod
[ 54.867773] CPU: 21 PID: 1127 Comm: systemd-udevd Not tainted 4.19.0-rc3-xpfo+ #3
[ 54.867773] Hardware name: Oracle Corporation ORACLE SERVER X7-2/ASM, MB, X7-2, BIOS 41017600 10/06/2017
[ 54.867773] RIP: 0010:native_smp_send_reschedule+0x39/0x40
[ 54.867774] Code: 0f 92 c0 84 c0 74 15 48 8b 05 13 84 0f 01 be fd 00 00 00 48 8b 40 30 e9 e5 16 bc 00 89 fe 48 c7 c7 d8 48 06 82 e8 67 71 03 00 <0f> 0b c3 0f 1f 40 00 0f 1f 44 00 00 53 be 20 00 48 00 48 89 fb 48
[ 54.867774] RSP: 0018:ffff88603c803db8 EFLAGS: 00010086
[ 54.867775] RAX: 0000000000000000 RBX: ffff88603a7e2c80 RCX: 0000000000000000
[ 54.867775] RDX: 0000000000000000 RSI: 0000000000001277 RDI: ffff88603c9d5d08
[ 54.867776] RBP: ffff88603a7e2c80 R08: 0000000000000000 R09: 0000000000000000
[ 54.867776] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8860250a8000
[ 54.867776] R13: ffff88603c803e00 R14: 0000000000000000 R15: ffff88603a7e2c98
[ 54.867777] FS: 00007f41ad1658c0(0000) GS:ffff88603c800000(0000) knlGS:0000000000000000
[ 54.867777] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 54.867778] CR2: ffff88c02765a000 CR3: 00000060252e4003 CR4: 00000000007606e0
[ 54.867778] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 54.867778] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 54.867779] PKRU: 55555554
[ 54.867779] Call Trace:
[ 54.867779] <IRQ>
[ 54.867779] check_preempt_curr+0x75/0xb0
[ 54.867780] ttwu_do_wakeup+0x19/0x190
[ 54.867780] try_to_wake_up+0x21e/0x4f0
[ 54.867780] __wake_up_common+0x9d/0x190
[ 54.867781] ep_poll_callback+0xd5/0x370
[ 54.867781] ? ep_poll_callback+0x2b5/0x370
[ 54.867781] __wake_up_common+0x9d/0x190
[ 54.867782] __wake_up_common_lock+0x7a/0xc0
[ 54.867782] irq_work_run_list+0x4c/0x70
[ 54.867782] smp_call_function_interrupt+0x59/0x110
[ 54.867782] call_function_interrupt+0xf/0x20
[ 54.867783] </IRQ>
[ 54.867783] <NMI>
[ 54.867783] RIP: 0010:panic+0x206/0x25c
[ 54.867784] Code: 83 3d 11 83 c9 01 00 74 05 e8 d2 87 02 00 48 c7 c6 80 15 d1 82 48 c7 c7 80 fe 06 82 31 c0 e8 71 ed 06 00 fb 66 0f 1f 44 00 00 <45> 31 e4 e8 de 11 0e 00 4d 39 ec 7c 1e 41 83 f6 01 48 8b 05 ce 82
[ 54.867784] RSP: 0018:fffffe00003a4b58 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff03
[ 54.867785] RAX: 0000000000000038 RBX: fffffe00003a4e00 RCX: 0000000000000000
[ 54.867785] RDX: 0000000000000000 RSI: 0000000000000038 RDI: ffff88603c9d5d08
[ 54.867786] RBP: fffffe00003a4bc8 R08: 0000000000000000 R09: ffff88603c9d5d47
[ 54.867786] R10: 000000000000000b R11: 0000000000000000 R12: ffffffff8207b979
[ 54.867786] R13: 0000000000000000 R14: 0000000000000000 R15: ffff88603c80f560
[ 54.867787] ? panic+0x1ff/0x25c
[ 54.867787] nmi_panic+0x37/0x40
[ 54.867787] watchdog_overflow_callback+0xef/0x110
[ 54.867787] __perf_event_overflow+0x51/0xe0
[ 54.867788] intel_pmu_handle_irq+0x222/0x4c0
[ 54.867788] ? _raw_spin_unlock+0x24/0x30
[ 54.867788] ? ghes_copy_tofrom_phys+0xf2/0x1a0
[ 54.867789] ? ghes_read_estatus+0x91/0x160
[ 54.867789] perf_event_nmi_handler+0x2e/0x50
[ 54.867789] nmi_handle+0x9a/0x180
[ 54.867790] ? nmi_handle+0x5/0x180
[ 54.867790] default_do_nmi+0xca/0x120
[ 54.867790] do_nmi+0x100/0x160
[ 54.867791] end_repeat_nmi+0x16/0x50
[ 54.867791] RIP: 0010:queued_spin_lock_slowpath+0xf6/0x1e0
[ 54.867791] Code: 48 03 34 c5 80 97 12 82 48 89 16 8b 42 08 85 c0 75 09 f3 90 8b 42 08 85 c0 74 f7 48 8b 32 48 85 f6 74 07 0f 0d 0e eb 02 f3 90 <8b> 07 66 85 c0 75 f7 41 89 c0 66 45 31 c0 41 39 c8 0f 84 93 00 00
[ 54.867792] RSP: 0018:ffffc9001cc83a00 EFLAGS: 00000002
[ 54.867792] RAX: 0000000000340101 RBX: ffffea06c16292e8 RCX: 0000000000580000
[ 54.867793] RDX: ffff88603c9e3980 RSI: 0000000000000000 RDI: ffffea06c16292e8
[ 54.867793] RBP: ffffea06c1629300 R08: 0000000000000001 R09: 0000000000000000
[ 54.867793] R1
[ 54.867794] Lost 48 message(s)!

--
Khalid

2018-09-19 01:04:24

by Balbir Singh

[permalink] [raw]
Subject: Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

On Mon, Aug 20, 2018 at 09:52:19PM +0000, Woodhouse, David wrote:
> On Mon, 2018-08-20 at 14:48 -0700, Linus Torvalds wrote:
> >
> > Of course, after the long (and entirely unrelated) discussion about
> > the TLB flushing bug we had, I'm starting to worry about my own
> > competence, and maybe I'm missing something really fundamental, and
> > the XPFO patches do something else than what I think they do, or my
> > "hey, let's use our Meltdown code" idea has some fundamental weakness
> > that I'm missing.
>
> The interesting part is taking the user (and other) pages out of the
> kernel's 1:1 physmap.
>
> It's the *kernel* we don't want being able to access those pages,
> because of the multitude of unfixable cache load gadgets.

I am missing why we need this since the kernel can't access
(SMAP) unless we go through to the copy/to/from interface
or execute any of the user pages. Is it because of the dependency
on the availability of those features?

Balbir Singh.


2018-09-19 15:45:18

by Jonathan Adams

[permalink] [raw]
Subject: Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

(apologies again; resending due to formatting issues)
On Tue, Sep 18, 2018 at 6:03 PM Balbir Singh <[email protected]> wrote:
>
> On Mon, Aug 20, 2018 at 09:52:19PM +0000, Woodhouse, David wrote:
> > On Mon, 2018-08-20 at 14:48 -0700, Linus Torvalds wrote:
> > >
> > > Of course, after the long (and entirely unrelated) discussion about
> > > the TLB flushing bug we had, I'm starting to worry about my own
> > > competence, and maybe I'm missing something really fundamental, and
> > > the XPFO patches do something else than what I think they do, or my
> > > "hey, let's use our Meltdown code" idea has some fundamental weakness
> > > that I'm missing.
> >
> > The interesting part is taking the user (and other) pages out of the
> > kernel's 1:1 physmap.
> >
> > It's the *kernel* we don't want being able to access those pages,
> > because of the multitude of unfixable cache load gadgets.
>
> I am missing why we need this since the kernel can't access
> (SMAP) unless we go through to the copy/to/from interface
> or execute any of the user pages. Is it because of the dependency
> on the availability of those features?
>
SMAP protects against kernel accesses to non-PRIV (i.e. userspace)
mappings, but that isn't relevant to what's being discussed here.

Davis is talking about the kernel Direct Map, which is a PRIV (i.e.
kernel) mapping of all physical memory on the system, at
VA = (base + PA).
Since this mapping exists for all physical addresses, speculative
load gadgets (and the processor's prefetch mechanism, etc.) can
load arbitrary data even if it is only otherwise mapped into user
space.

XPFO fixes this by unmapping the Direct Map translations when the
page is allocated as a user page. The mapping is only restored:
1. temporarily if the kernel needs direct access to the page
(i.e. to zero it, access it from a device driver, etc),
2. when the page is freed

And in so doing, significantly reduces the amount of non-kernel data
vulnerable to speculative execution attacks against the kernel.
(and reduces what data can be loaded into the L1 data cache while
in kernel mode, to be peeked at by the recent L1 Terminal Fault
vulnerability).

Does that make sense?

Cheers,
- jonathan

2018-09-23 02:34:06

by Balbir Singh

[permalink] [raw]
Subject: Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

On Wed, Sep 19, 2018 at 08:43:07AM -0700, Jonathan Adams wrote:
> (apologies again; resending due to formatting issues)
> On Tue, Sep 18, 2018 at 6:03 PM Balbir Singh <[email protected]> wrote:
> >
> > On Mon, Aug 20, 2018 at 09:52:19PM +0000, Woodhouse, David wrote:
> > > On Mon, 2018-08-20 at 14:48 -0700, Linus Torvalds wrote:
> > > >
> > > > Of course, after the long (and entirely unrelated) discussion about
> > > > the TLB flushing bug we had, I'm starting to worry about my own
> > > > competence, and maybe I'm missing something really fundamental, and
> > > > the XPFO patches do something else than what I think they do, or my
> > > > "hey, let's use our Meltdown code" idea has some fundamental weakness
> > > > that I'm missing.
> > >
> > > The interesting part is taking the user (and other) pages out of the
> > > kernel's 1:1 physmap.
> > >
> > > It's the *kernel* we don't want being able to access those pages,
> > > because of the multitude of unfixable cache load gadgets.
> >
> > I am missing why we need this since the kernel can't access
> > (SMAP) unless we go through to the copy/to/from interface
> > or execute any of the user pages. Is it because of the dependency
> > on the availability of those features?
> >
> SMAP protects against kernel accesses to non-PRIV (i.e. userspace)
> mappings, but that isn't relevant to what's being discussed here.
>
> Davis is talking about the kernel Direct Map, which is a PRIV (i.e.
> kernel) mapping of all physical memory on the system, at
> VA = (base + PA).
> Since this mapping exists for all physical addresses, speculative
> load gadgets (and the processor's prefetch mechanism, etc.) can
> load arbitrary data even if it is only otherwise mapped into user
> space.

Load aribtrary data with no permission checks (strict RWX).

>
> XPFO fixes this by unmapping the Direct Map translations when the
> page is allocated as a user page. The mapping is only restored:
> 1. temporarily if the kernel needs direct access to the page
> (i.e. to zero it, access it from a device driver, etc),
> 2. when the page is freed
>
> And in so doing, significantly reduces the amount of non-kernel data
> vulnerable to speculative execution attacks against the kernel.
> (and reduces what data can be loaded into the L1 data cache while
> in kernel mode, to be peeked at by the recent L1 Terminal Fault
> vulnerability).

I see and there is no way for gadgets to invoke this path from
user space to make their speculation successful? We still have to
flush L1, indepenedent of whether XPFO is enabled or not right?

Balbir Singh.


2018-09-24 15:01:39

by Julian Stecklina

[permalink] [raw]
Subject: Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

On Tue, 2018-09-18 at 17:00 -0600, Khalid Aziz wrote:
> I tested the kernel with this new code. When booted without
> "xpfotlbflush", 
> there is no meaningful change in system time with kernel compile.

That's good news! So the lock optimizations seem to help.

> Kernel 
> locks up during bootup when booted with xpfotlbflush:

I didn't test the version with TLB flushes, because it's clear that the
overhead is so bad that no one wants to use this.

It shouldn't lock up though, so maybe there is still a race condition
somewhere. I'll give this a spin on my end later this week.

Thanks for trying this out!

Julian
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

2018-09-25 14:14:14

by Julian Stecklina

[permalink] [raw]
Subject: Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

On Sun, 2018-09-23 at 12:33 +1000, Balbir Singh wrote:
> > And in so doing, significantly reduces the amount of non-kernel
> data
> > vulnerable to speculative execution attacks against the kernel.
> > (and reduces what data can be loaded into the L1 data cache while
> > in kernel mode, to be peeked at by the recent L1 Terminal Fault
> > vulnerability).
>
> I see and there is no way for gadgets to invoke this path from
> user space to make their speculation successful? We still have to
> flush L1, indepenedent of whether XPFO is enabled or not right?

Yes. And even with XPFO and L1 cache flushing enabled, there are more
steps that need to be taken to reliably guard against information leaks
using speculative execution.

Specifically, I'm looking into making certain allocations in the Linux
kernel process-local to hide even more memory from prefetching.

Another puzzle piece is co-scheduling support that is relevant for
systems with enabled hyperthreading: https://lwn.net/Articles/764461/

Julian
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

2018-10-15 08:10:17

by Khalid Aziz

[permalink] [raw]
Subject: Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

On 09/24/2018 08:45 AM, Stecklina, Julian wrote:
> I didn't test the version with TLB flushes, because it's clear that the
> overhead is so bad that no one wants to use this.

I don't think we can ignore the vulnerability caused by not flushing
stale TLB entries. On a mostly idle system, TLB entries hang around long
enough to make it fairly easy to exploit this. I was able to use the
additional test in lkdtm module added by this patch series to
successfully read pages unmapped from physmap by just waiting for system
to become idle. A rogue program can simply monitor system load and mount
its attack using ret2dir exploit when system is mostly idle. This brings
us back to the prohibitive cost of TLB flushes. If we are unmapping a
page from physmap every time the page is allocated to userspace, we are
forced to incur the cost of TLB flushes in some way. Work Tycho was
doing to implement Dave's suggestion can help here. Once Tycho has
something working, I can measure overhead on my test machine. Tycho, I
can help with your implementation if you need.

--
Khalid

2018-10-24 11:02:30

by Khalid Aziz

[permalink] [raw]
Subject: Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

On 10/15/2018 01:37 PM, Khalid Aziz wrote:
> On 09/24/2018 08:45 AM, Stecklina, Julian wrote:
>> I didn't test the version with TLB flushes, because it's clear that the
>> overhead is so bad that no one wants to use this.
>
> I don't think we can ignore the vulnerability caused by not flushing
> stale TLB entries. On a mostly idle system, TLB entries hang around long
> enough to make it fairly easy to exploit this. I was able to use the
> additional test in lkdtm module added by this patch series to
> successfully read pages unmapped from physmap by just waiting for system
> to become idle. A rogue program can simply monitor system load and mount
> its attack using ret2dir exploit when system is mostly idle. This brings
> us back to the prohibitive cost of TLB flushes. If we are unmapping a
> page from physmap every time the page is allocated to userspace, we are
> forced to incur the cost of TLB flushes in some way. Work Tycho was
> doing to implement Dave's suggestion can help here. Once Tycho has
> something working, I can measure overhead on my test machine. Tycho, I
> can help with your implementation if you need.

I looked at Tycho's last patch with batch update from
<https://lkml.org/lkml/2017/11/9/951>. I ported it on top of Julian's
patches and got it working well enough to gather performance numbers.
Here is what I see for system times on a machine with dual Xeon E5-2630
and 256GB of memory when running "make -j30 all" on 4.18.6 kernel
(percentages are relative to base 4.19-rc8 kernel without xpfo):


Base 4.19-rc8 913.84s
4.19-rc8 + xpfo, no TLB flush 1027.985s (+12.5%)
4.19-rc8 + batch update, no TLB flush 970.39s (+6.2%)
4.19-rc8 + xpfo, TLB flush 8458.449s (+825.6%)
4.19-rc8 + batch update, TLB flush 4665.659s (+410.6%)

Batch update is significant improvement but we are starting so far
behind baseline, it is still a huge slow down.

--
Khalid


2018-10-24 15:01:31

by Tycho Andersen

[permalink] [raw]
Subject: Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

On Wed, Oct 24, 2018 at 04:30:42PM +0530, Khalid Aziz wrote:
> On 10/15/2018 01:37 PM, Khalid Aziz wrote:
> > On 09/24/2018 08:45 AM, Stecklina, Julian wrote:
> > > I didn't test the version with TLB flushes, because it's clear that the
> > > overhead is so bad that no one wants to use this.
> >
> > I don't think we can ignore the vulnerability caused by not flushing
> > stale TLB entries. On a mostly idle system, TLB entries hang around long
> > enough to make it fairly easy to exploit this. I was able to use the
> > additional test in lkdtm module added by this patch series to
> > successfully read pages unmapped from physmap by just waiting for system
> > to become idle. A rogue program can simply monitor system load and mount
> > its attack using ret2dir exploit when system is mostly idle. This brings
> > us back to the prohibitive cost of TLB flushes. If we are unmapping a
> > page from physmap every time the page is allocated to userspace, we are
> > forced to incur the cost of TLB flushes in some way. Work Tycho was
> > doing to implement Dave's suggestion can help here. Once Tycho has
> > something working, I can measure overhead on my test machine. Tycho, I
> > can help with your implementation if you need.
>
> I looked at Tycho's last patch with batch update from
> <https://lkml.org/lkml/2017/11/9/951>. I ported it on top of Julian's
> patches and got it working well enough to gather performance numbers. Here
> is what I see for system times on a machine with dual Xeon E5-2630 and 256GB
> of memory when running "make -j30 all" on 4.18.6 kernel (percentages are
> relative to base 4.19-rc8 kernel without xpfo):
>
>
> Base 4.19-rc8 913.84s
> 4.19-rc8 + xpfo, no TLB flush 1027.985s (+12.5%)
> 4.19-rc8 + batch update, no TLB flush 970.39s (+6.2%)
> 4.19-rc8 + xpfo, TLB flush 8458.449s (+825.6%)
> 4.19-rc8 + batch update, TLB flush 4665.659s (+410.6%)
>
> Batch update is significant improvement but we are starting so far behind
> baseline, it is still a huge slow down.

There's some other stuff that Dave suggested that I didn't do; in
particular coalesce xpfo bits instead of setting things once per page
when mappings are shared, etc.

Perhaps that will help more?

I'm still stuck working on something else for now, but I hope to be
able to participate more on this Soon (TM). Thanks for the testing!

Tycho