LinuxLists.cc - NUMA_BALANCING and Xen PV guest regression in 3.20-rc0

2015-02-19 13:06:56

Subject: NUMA_BALANCING and Xen PV guest regression in 3.20-rc0

Mel,

The NUMA_BALANCING series beginning with 5d833062139d (mm: numa: do not
dereference pmd outside of the lock during NUMA hinting fault) and
specifically 8a0516ed8b90 (mm: convert p[te|md]_numa users to
p[te|md]_protnone_numa) breaks Xen 64-bit PV guests.

Any fault on a present userspace mapping (e.g., a write to a read-only
mapping) is being misinterpreted as a NUMA hinting fault and not handled
correctly. All userspace programs end up continuously faulting.

This is because the hypervisor sets _PAGE_GLOBAL (== _PAGE_PROTNONE) on
all present userspace page table entries.

Note that the comment in asm/pgtable_types.h that says that
_PAGE_BIT_PROTNONE is only valid on non-present entries.

/* If _PAGE_BIT_PRESENT is clear, we use these: */
/* - if the user mapped it with PROT_NONE; pte_present gives true */
#define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL

Adjusting pte_protnone() and pmd_protnone() to check for the absence of
_PAGE_PRESENT allows 64-bit Xen PV guests to work correctly again (see
following patch), but I'm not sure if NUMA_BALANCING would correctly
work with this change.

David

8<---------------------------
x86: pte_protnone() and pmd_protnone() must check entry is
not present

Since _PAGE_PROTNONE aliases _PAGE_GLOBAL it is only valid if
_PAGE_PRESENT is clear. Make pte_protnone() and pmd_protnone() check
for this.

This fixes a 64-bit Xen PV guest regression introduced by
8a0516ed8b90c95ffa1363b420caa37418149f21 (mm: convert p[te|md]_numa
users to p[te|md]_protnone_numa). Any userspace process would
endlessly fault.

In a 64-bit PV guest, userspace page table entries have _PAGE_GLOBAL
set by the hypervisor. This meant that any fault on a present
userspace entry (e.g., a write to a read-only mapping) would be
misinterpreted as a NUMA hinting fault and the fault would not be
correctly handled, resulting in the access endlessly faulting.

Signed-off-by: David Vrabel <[email protected]>
Cc: Mel Gorman <[email protected]>
---
arch/x86/include/asm/pgtable.h | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 67fc3d2..a0c35bf 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -476,12 +476,14 @@ static inline int pmd_present(pmd_t pmd)
*/
static inline int pte_protnone(pte_t pte)
{
- return pte_flags(pte) & _PAGE_PROTNONE;
+ return (pte_flags(pte) & (_PAGE_PROTNONE | _PAGE_PRESENT))
+ == _PAGE_PROTNONE;
}

static inline int pmd_protnone(pmd_t pmd)
{
- return pmd_flags(pmd) & _PAGE_PROTNONE;
+ return (pmd_flags(pmd) & (_PAGE_PROTNONE | _PAGE_PRESENT))
+ == _PAGE_PROTNONE;
}
#endif /* CONFIG_NUMA_BALANCING */

--
1.7.10.4

2015-02-19 17:01:17

by Mel Gorman

[permalink] [raw]

Subject: Re: NUMA_BALANCING and Xen PV guest regression in 3.20-rc0

On Thu, Feb 19, 2015 at 01:06:53PM +0000, David Vrabel wrote:
> Mel,
>
> The NUMA_BALANCING series beginning with 5d833062139d (mm: numa: do not
> dereference pmd outside of the lock during NUMA hinting fault) and
> specifically 8a0516ed8b90 (mm: convert p[te|md]_numa users to
> p[te|md]_protnone_numa) breaks Xen 64-bit PV guests.
>
> Any fault on a present userspace mapping (e.g., a write to a read-only
> mapping) is being misinterpreted as a NUMA hinting fault and not handled
> correctly. All userspace programs end up continuously faulting.
>
> This is because the hypervisor sets _PAGE_GLOBAL (== _PAGE_PROTNONE) on
> all present userspace page table entries.
>

I see, this is a variation of the problem where the NUMA hinted PTE was
treated as special due to the paravirt interfaces not being used.

> Note that the comment in asm/pgtable_types.h that says that
> _PAGE_BIT_PROTNONE is only valid on non-present entries.
>
> /* If _PAGE_BIT_PRESENT is clear, we use these: */
> /* - if the user mapped it with PROT_NONE; pte_present gives true */
> #define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL
>
> Adjusting pte_protnone() and pmd_protnone() to check for the absence of
> _PAGE_PRESENT allows 64-bit Xen PV guests to work correctly again (see
> following patch), but I'm not sure if NUMA_BALANCING would correctly
> work with this change.
>

Thanks for the analysis and the reminder of some of the details from the
previous discussion.

>
> 8<---------------------------
> x86: pte_protnone() and pmd_protnone() must check entry is
> not present
>
> Since _PAGE_PROTNONE aliases _PAGE_GLOBAL it is only valid if
> _PAGE_PRESENT is clear. Make pte_protnone() and pmd_protnone() check
> for this.
>
> This fixes a 64-bit Xen PV guest regression introduced by
> 8a0516ed8b90c95ffa1363b420caa37418149f21 (mm: convert p[te|md]_numa
> users to p[te|md]_protnone_numa). Any userspace process would
> endlessly fault.
>
> In a 64-bit PV guest, userspace page table entries have _PAGE_GLOBAL
> set by the hypervisor. This meant that any fault on a present
> userspace entry (e.g., a write to a read-only mapping) would be
> misinterpreted as a NUMA hinting fault and the fault would not be
> correctly handled, resulting in the access endlessly faulting.
>
> Signed-off-by: David Vrabel <[email protected]>
> Cc: Mel Gorman <[email protected]>

I cannot think of a reason why this would fail for NUMA balancing on bare
metal. The PAGE_NONE protection clears the present bit on p[te|md]_modify
so the expectations are matched before or after the patch is applied. So,
for bare metal at least

Acked-by: Mel Gorman <[email protected]>

I *think* this will work ok with Xen but I cannot 100% convince myself.
I'm adding Wei Liu to the cc who may have a Xen PV setup handy that
supports NUMA and may be able to test the patch to confirm.

--
Mel Gorman
SUSE Labs

2015-02-19 23:09:05

by Linus Torvalds

[permalink] [raw]

Subject: Re: NUMA_BALANCING and Xen PV guest regression in 3.20-rc0

On Thu, Feb 19, 2015 at 5:06 AM, David Vrabel <[email protected]> wrote:
>
> The NUMA_BALANCING series beginning with 5d833062139d (mm: numa: do not
> dereference pmd outside of the lock during NUMA hinting fault) and
> specifically 8a0516ed8b90 (mm: convert p[te|md]_numa users to
> p[te|md]_protnone_numa) breaks Xen 64-bit PV guests.
>
> Any fault on a present userspace mapping (e.g., a write to a read-only
> mapping) is being misinterpreted as a NUMA hinting fault and not handled
> correctly. All userspace programs end up continuously faulting.
>
> This is because the hypervisor sets _PAGE_GLOBAL (== _PAGE_PROTNONE) on
> all present userspace page table entries.

That's some crazy stuff, but whatever. The patch is clearly good. Applied,

Linus

2015-02-20 01:05:23

by Kirill A. Shutemov

[permalink] [raw]

Subject: Re: NUMA_BALANCING and Xen PV guest regression in 3.20-rc0

2015-02-20 01:49:09

by Linus Torvalds

[permalink] [raw]

Subject: Re: NUMA_BALANCING and Xen PV guest regression in 3.20-rc0

On Thu, Feb 19, 2015 at 5:05 PM, Kirill A. Shutemov
<[email protected]> wrote:
>
> I'm feeling I miss very basic background on how Xen works, but why does it
> set _PAGE_GLOBAL on userspace entries? It sounds strange to me.

It is definitely strange. I'm guessing that it's some ancient Xen hack
for the early Intel virtualization that used to have absolutely
horrendous vmenter/exit costs, including very much the TLB overhead. \

These days, Intel has address space identifiers, and doesn't flush the
whole TLB on VM entry/exit, so it's probably pointless to play games
with the global bit.

I get the feeling that a lot of Xen stuff is that kind of "legacy
hacks" that should just be cleaned up, but nobody has the energy or
the interest. There was the whole odd crazy SHARED_KERNEL_PMD hackery
too.

Linus

2015-02-20 10:28:31

by David Vrabel

[permalink] [raw]

Subject: Re: [Xen-devel] NUMA_BALANCING and Xen PV guest regression in 3.20-rc0

On 19/02/15 23:09, Linus Torvalds wrote:
> On Thu, Feb 19, 2015 at 5:06 AM, David Vrabel <[email protected]> wrote:
>>
>> The NUMA_BALANCING series beginning with 5d833062139d (mm: numa: do not
>> dereference pmd outside of the lock during NUMA hinting fault) and
>> specifically 8a0516ed8b90 (mm: convert p[te|md]_numa users to
>> p[te|md]_protnone_numa) breaks Xen 64-bit PV guests.
>>
>> Any fault on a present userspace mapping (e.g., a write to a read-only
>> mapping) is being misinterpreted as a NUMA hinting fault and not handled
>> correctly. All userspace programs end up continuously faulting.
>>
>> This is because the hypervisor sets _PAGE_GLOBAL (== _PAGE_PROTNONE) on
>> all present userspace page table entries.
>
> That's some crazy stuff, but whatever. The patch is clearly good. Applied,

Thanks.

Xen PV guests do not use any hardware virtualization features. In
particular they do not use nested paging.

A 64-bit PV guest runs in user mode for both kernel and userspace. On
kernel to user mode transitions, the hypervisor flips between two sets
of page tables (the user mode tables do not contain any kernel mappings,
but the kernel mode tables contain both). By setting _PAGE_GLOBAL on
the userspace entries, a kernel to user transition can avoid flushing
the userspace mappings from the TLB.

David

2015-02-20 10:47:57

by Andrew Cooper

[permalink] [raw]

Subject: Re: [Xen-devel] NUMA_BALANCING and Xen PV guest regression in 3.20-rc0

On 20/02/15 01:49, Linus Torvalds wrote:
> On Thu, Feb 19, 2015 at 5:05 PM, Kirill A. Shutemov
> <[email protected]> wrote:
>> I'm feeling I miss very basic background on how Xen works, but why does it
>> set _PAGE_GLOBAL on userspace entries? It sounds strange to me.
> It is definitely strange. I'm guessing that it's some ancient Xen hack
> for the early Intel virtualization that used to have absolutely
> horrendous vmenter/exit costs, including very much the TLB overhead. \
>
> These days, Intel has address space identifiers, and doesn't flush the
> whole TLB on VM entry/exit, so it's probably pointless to play games
> with the global bit.

It was introduced in 2006, but has nothing to do with VT-x

http://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=6f562e72cdc4b7e1519e23be75f812aebbf41db3

As long mode drops segment limit checking, the only way to protect a
64bit PV kernel from its userspace (both of which run in ring3 on user
pages) is to maintain two sets of pagetables and switch between them on
guest kernel/user context switches. The user set lack kernel mappings.

I can't comment about the performance impact of the patch (way before my
time), but the justification was to try and reduce the overhead of guest
context switches.

>
> I get the feeling that a lot of Xen stuff is that kind of "legacy
> hacks" that should just be cleaned up, but nobody has the energy or
> the interest.

Time, mainly.

There certainly are areas which should be up for re-evaluation, given 9
years of change in hardware.

> There was the whole odd crazy SHARED_KERNEL_PMD hackery too.

SHARED_KERNEL_PMD is an artefact of Xen living in the same virtual
address space as a PV guest, and needing to maintain linear mappings.

~Andrew

2015-02-20 11:29:55

by Kirill A. Shutemov

[permalink] [raw]

Subject: Re: [Xen-devel] NUMA_BALANCING and Xen PV guest regression in 3.20-rc0

On Fri, Feb 20, 2015 at 10:47:52AM +0000, Andrew Cooper wrote:
> On 20/02/15 01:49, Linus Torvalds wrote:
> > On Thu, Feb 19, 2015 at 5:05 PM, Kirill A. Shutemov
> > <[email protected]> wrote:
> >> I'm feeling I miss very basic background on how Xen works, but why does it
> >> set _PAGE_GLOBAL on userspace entries? It sounds strange to me.
> > It is definitely strange. I'm guessing that it's some ancient Xen hack
> > for the early Intel virtualization that used to have absolutely
> > horrendous vmenter/exit costs, including very much the TLB overhead. \
> >
> > These days, Intel has address space identifiers, and doesn't flush the
> > whole TLB on VM entry/exit, so it's probably pointless to play games
> > with the global bit.
>
> It was introduced in 2006, but has nothing to do with VT-x
>
> http://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=6f562e72cdc4b7e1519e23be75f812aebbf41db3
>
> As long mode drops segment limit checking, the only way to protect a
> 64bit PV kernel from its userspace (both of which run in ring3 on user
> pages) is to maintain two sets of pagetables and switch between them on
> guest kernel/user context switches. The user set lack kernel mappings.
>
> I can't comment about the performance impact of the patch (way before my
> time), but the justification was to try and reduce the overhead of guest
> context switches.

IIUC, it tries to reduce userspace->kernel switch in guest. It's still
hopeless: kernel mappings are always TLB-cold, right?

> > I get the feeling that a lot of Xen stuff is that kind of "legacy
> > hacks" that should just be cleaned up, but nobody has the energy or
> > the interest.
>
> Time, mainly.
>
> There certainly are areas which should be up for re-evaluation, given 9
> years of change in hardware.

Is Xen PV still widely used? I'm surprised that users can tolerate this
kind of overhead.

--
Kirill A. Shutemov

2015-02-20 11:54:56

by Andrew Cooper

[permalink] [raw]

Subject: Re: [Xen-devel] NUMA_BALANCING and Xen PV guest regression in 3.20-rc0

On 20/02/15 11:29, Kirill A. Shutemov wrote:
> On Fri, Feb 20, 2015 at 10:47:52AM +0000, Andrew Cooper wrote:
>> On 20/02/15 01:49, Linus Torvalds wrote:
>>> On Thu, Feb 19, 2015 at 5:05 PM, Kirill A. Shutemov
>>> <[email protected]> wrote:
>>>> I'm feeling I miss very basic background on how Xen works, but why does it
>>>> set _PAGE_GLOBAL on userspace entries? It sounds strange to me.
>>> It is definitely strange. I'm guessing that it's some ancient Xen hack
>>> for the early Intel virtualization that used to have absolutely
>>> horrendous vmenter/exit costs, including very much the TLB overhead. \
>>>
>>> These days, Intel has address space identifiers, and doesn't flush the
>>> whole TLB on VM entry/exit, so it's probably pointless to play games
>>> with the global bit.
>> It was introduced in 2006, but has nothing to do with VT-x
>>
>> http://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=6f562e72cdc4b7e1519e23be75f812aebbf41db3
>>
>> As long mode drops segment limit checking, the only way to protect a
>> 64bit PV kernel from its userspace (both of which run in ring3 on user
>> pages) is to maintain two sets of pagetables and switch between them on
>> guest kernel/user context switches. The user set lack kernel mappings.
>>
>> I can't comment about the performance impact of the patch (way before my
>> time), but the justification was to try and reduce the overhead of guest
>> context switches.
> IIUC, it tries to reduce userspace->kernel switch in guest. It's still
> hopeless: kernel mappings are always TLB-cold, right?

There is no way to avoid the kernel mappings being TLB-cold on a guest
user -> kernel context switch.

A lot of the "legacy 32bit stuff" which was dropped in AMD64 were
exactly the bits Xen was using to efficiently make PV guests safe.

>
>>> I get the feeling that a lot of Xen stuff is that kind of "legacy
>>> hacks" that should just be cleaned up, but nobody has the energy or
>>> the interest.
>> Time, mainly.
>>
>> There certainly are areas which should be up for re-evaluation, given 9
>> years of change in hardware.
> Is Xen PV still widely used?

Dom0 realistically still needs to be PV.

PVH (hardware extensions but no qemu emulating a motherboard) is on the
horizon but still very much experimental with open issues needing to be
solved.

> I'm surprised that users can tolerate this kind of overhead.

For modern hardware, most workloads are now "better" in HVM guests, but
even only 5 years ago, the vmentry/vmexit overhead tended to outweigh
the PV overheads.

On the other hand, unikernel virtual machines can always be more
efficient as PV guests, so PV is not going to die any time soon.

~Andrew

2015-02-23 15:23:42

by Dario Faggioli

[permalink] [raw]

Subject: Re: [Xen-devel] NUMA_BALANCING and Xen PV guest regression in 3.20-rc0

Hi everyone,

On Thu, 2015-02-19 at 17:01 +0000, Mel Gorman wrote:
> On Thu, Feb 19, 2015 at 01:06:53PM +0000, David Vrabel wrote:

> I cannot think of a reason why this would fail for NUMA balancing on bare
> metal. The PAGE_NONE protection clears the present bit on p[te|md]_modify
> so the expectations are matched before or after the patch is applied. So,
> for bare metal at least
>
> Acked-by: Mel Gorman <[email protected]>
>
> I *think* this will work ok with Xen but I cannot 100% convince myself.
> I'm adding Wei Liu to the cc who may have a Xen PV setup handy that
> supports NUMA and may be able to test the patch to confirm.
>
I'm not Wei, but I've been able to test a kernel with David's patch in
the following conditions:

1. as Dom0 kernel, when Xen does not have any virtual NUMA support
2. as DomU PV kernel, when Xen does not have any virtual NUMA support
3. as DomU PV kernel, when Xen _does_ _have_ virtual NUMA support
(i.e., Wei's code)

Cases 1. and 2. have been, I believe, tested by David already, but
anyways... :-)

Case 3. worked well for me, as the following commands show. In fact,
with this in guest config file:

vnuma = [ [ "pnode=0","size=1000","vcpus=0-3","vdistances=10,20" ],
[ "pnode=1","size=1000","vcpus=4-7","vdistances=20,10" ],
]

This is what I get from inside the guest:

root@test-pv:~# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3
node 0 size: 951 MB
node 0 free: 868 MB
node 1 cpus: 4 5 6 7
node 1 size: 968 MB
node 1 free: 924 MB
node distances:
node 0 1
0: 10 20
1: 20 10

And this is it from the host:

root@Zhaman:~# xl debug-keys u ; xl dmesg |tail -12
(XEN) Memory location of each domain:
(XEN) Domain 0 (total: 1047417):
(XEN) Node 0: 1031009
(XEN) Node 1: 16408
(XEN) Domain 1 (total: 512000):
(XEN) Node 0: 256000
(XEN) Node 1: 256000
(XEN) 2 vnodes, 8 vcpus, guest physical layout:
(XEN) 0: pnode 0, vcpus 0-3
(XEN) 0000000000000000 - 000000003e800000
(XEN) 1: pnode 1, vcpus 4-7
(XEN) 000000003e800000 - 000000007d000000

Still inside the guest, I see this:

root@test-pv:~# cat /proc/sys/kernel/numa_balancing
1

And this:

root@test-pv:~# grep numa /proc/vmstat
numa_hit 65987
numa_miss 0
numa_foreign 0
numa_interleave 14473
numa_local 58642
numa_other 7345
numa_pte_updates 596
numa_huge_pte_updates 0
numa_hint_faults 479
numa_hint_faults_local 420
numa_pages_migrated 51

So, yes, I would say this wok with Xen, is that correct, Mel?

I'll give it a try at running more complex stuff like 'perf bench numa'
inside the guest and see what happens...

Regards,
Dario

Attachments:

signature.asc (181.00 B)
This is a digitally signed message part

2015-02-23 15:46:44

by Mel Gorman

[permalink] [raw]

Subject: Re: [Xen-devel] NUMA_BALANCING and Xen PV guest regression in 3.20-rc0

On Mon, Feb 23, 2015 at 03:13:48PM +0000, Dario Faggioli wrote:
> Hi everyone,
>
> On Thu, 2015-02-19 at 17:01 +0000, Mel Gorman wrote:
> > On Thu, Feb 19, 2015 at 01:06:53PM +0000, David Vrabel wrote:
>
> > I cannot think of a reason why this would fail for NUMA balancing on bare
> > metal. The PAGE_NONE protection clears the present bit on p[te|md]_modify
> > so the expectations are matched before or after the patch is applied. So,
> > for bare metal at least
> >
> > Acked-by: Mel Gorman <[email protected]>
> >
> > I *think* this will work ok with Xen but I cannot 100% convince myself.
> > I'm adding Wei Liu to the cc who may have a Xen PV setup handy that
> > supports NUMA and may be able to test the patch to confirm.
> >
> I'm not Wei, but I've been able to test a kernel with David's patch in
> the following conditions:
>

Thanks very much, it's appreciated.

> 1. as Dom0 kernel, when Xen does not have any virtual NUMA support
> 2. as DomU PV kernel, when Xen does not have any virtual NUMA support
> 3. as DomU PV kernel, when Xen _does_ _have_ virtual NUMA support
> (i.e., Wei's code)
>
> Cases 1. and 2. have been, I believe, tested by David already, but
> anyways... :-)
>
> Case 3. worked well for me, as the following commands show. In fact,
> with this in guest config file:
>
> <SNIP>
>
> And this:
>
> root@test-pv:~# grep numa /proc/vmstat
> numa_hit 65987
> numa_miss 0
> numa_foreign 0
> numa_interleave 14473
> numa_local 58642
> numa_other 7345
> numa_pte_updates 596
> numa_huge_pte_updates 0
> numa_hint_faults 479
> numa_hint_faults_local 420
> numa_pages_migrated 51
>
> So, yes, I would say this wok with Xen, is that correct, Mel?
>

Yes, these stats indicate that NUMA balancing is active. There is no
much activity but it's there.

> I'll give it a try at running more complex stuff like 'perf bench numa'
> inside the guest and see what happens...
>

Thanks.

--
Mel Gorman
SUSE Labs