LinuxLists.cc - [RFC PATCH 0/3] Use an alternative to _PAGE_PROTNONE for _PAGE

2014-04-07 15:11:07

Subject: [RFC PATCH 0/3] Use an alternative to _PAGE_PROTNONE for _PAGE_NUMA

Aliasing _PAGE_NUMA and _PAGE_PROTNONE had some convenient properties but
it ultimately gave Xen a headache and pisses almost everybody off that
looks closely at it. Two discussions on "why this makes sense" is one
discussion too many so rather than having a third there is this series.

Conceptually it's simple -- use an unused physical address bit for _PAGE_NUMA
and make it a 64-bit only feature on x86. This had been avoided before
because if the physical address space expands we are back to square one
but lets worry about that when it happens unless the x86 maintainers or
hardware people warn us that we're about to run headlong into a wall.

Testing was minimal -- short lived JVM and autonumabench tests that trigger
the relevant paths for NUMA balancing. Functionally it did not die miserably.
Performance looks as expected with no major changes.

arch/x86/Kconfig | 2 +-
arch/x86/include/asm/pgtable.h | 8 +++----
arch/x86/include/asm/pgtable_types.h | 44 ++++++++++++++++++++----------------
mm/memory.c | 12 ----------
4 files changed, 29 insertions(+), 37 deletions(-)

--
1.8.4.5

2014-04-07 15:11:03

by Mel Gorman

[permalink] [raw]

Subject: [PATCH 1/3] x86: Require x86-64 for automatic NUMA balancing

Automatic NUMA balancing currently depends on reusing the PROT_NONE
bit which has caused problems on Xen. In preparation for using one of
the unused physical address bits this patch requires x86-64 for automatic
NUMA balancing. 32-bit support for NUMA on x86 is no longer interesting
and the loss of automatic NUMA balancing support should be no surprise.

Signed-off-by: Mel Gorman <[email protected]>
---
arch/x86/Kconfig | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0af5250..084b1c1 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -26,7 +26,7 @@ config X86
select ARCH_MIGHT_HAVE_PC_SERIO
select HAVE_AOUT if X86_32
select HAVE_UNSTABLE_SCHED_CLOCK
- select ARCH_SUPPORTS_NUMA_BALANCING
+ select ARCH_SUPPORTS_NUMA_BALANCING if X86_64
select ARCH_SUPPORTS_INT128 if X86_64
select ARCH_WANTS_PROT_NUMA_PROT_NONE
select HAVE_IDE
--
1.8.4.5

2014-04-07 15:10:56

by Mel Gorman

[permalink] [raw]

Subject: [PATCH 2/3] x86: Define _PAGE_NUMA with unused physical address bits PMD and PTE levels

_PAGE_NUMA is currently an alias of _PROT_PROTNONE to trap NUMA hinting
faults. As the bit is shared care is taken that _PAGE_NUMA is only used in
places where _PAGE_PROTNONE could not reach but this still causes problems
on Xen and conceptually difficult.

Fundamentally, we only need the _PAGE_NUMA bit to tell the difference
between an entry that is really unmapped and a page that is protected
for NUMA hinting faults. Due to physical address limitations bits 52:62
are free so we can currently use them. As the present bit is cleared when
making a NUMA PTE, the hinting faults will still be trapped. It means that
32-bit NUMA cannot use automatic NUMA balancing but it is improbable that
anyone cares about that configuration.

In the future there will be a problem when the physical address space
expands because the bits may no longer be free. There is also the risk that
the hardware people are planning to use these bits for some other purpose.
When/if this happens then an option would be to use bit 11 and disable
kmemcheck if automatic NUMA balancing is enabled assuming bit 11 has not
been used for something else in the meantime.

Signed-off-by: Mel Gorman <[email protected]>
---
arch/x86/include/asm/pgtable.h | 8 +++----
arch/x86/include/asm/pgtable_types.h | 44 ++++++++++++++++++++----------------
2 files changed, 28 insertions(+), 24 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index bbc8b12..58fa7d1 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -447,8 +447,8 @@ static inline int pte_same(pte_t a, pte_t b)

static inline int pte_present(pte_t a)
{
- return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE |
- _PAGE_NUMA);
+ return (pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE |
+ _PAGE_NUMA)) != 0;
}

#define pte_accessible pte_accessible
@@ -477,8 +477,8 @@ static inline int pmd_present(pmd_t pmd)
* the _PAGE_PSE flag will remain set at all times while the
* _PAGE_PRESENT bit is clear).
*/
- return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE |
- _PAGE_NUMA);
+ return (pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE |
+ _PAGE_NUMA)) != 0;
}

static inline int pmd_none(pmd_t pmd)
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 1aa9ccd..f3eafd2 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -25,6 +25,15 @@
#define _PAGE_BIT_SPLITTING _PAGE_BIT_UNUSED1 /* only valid on a PSE pmd */
#define _PAGE_BIT_NX 63 /* No execute: only valid after cpuid check */

+/*
+ * Software bits ignored by the page table walker
+ * At the time of writing, different levels have bits that are ignored. Due
+ * to physical address limitations, bits 52:62 should be ignored for the PMD
+ * and PTE levels and are available for use by software. Be aware that this
+ * may change if the physical address space expands.
+ */
+#define _PAGE_BIT_NUMA 62
+
/* If _PAGE_BIT_PRESENT is clear, we use these: */
/* - if the user mapped it with PROT_NONE; pte_present gives true */
#define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL
@@ -56,6 +65,21 @@
#endif

/*
+ * _PAGE_NUMA distinguishes between a numa hinting minor fault and a page
+ * that is not present. The hinting fault gathers numa placement statistics
+ * (see pte_numa()). The bit is always zero when the PTE is not present.
+ *
+ * The bit picked must be always zero when the pmd is present and not
+ * present, so that we don't lose information when we set it while
+ * atomically clearing the present bit.
+ */
+#ifdef CONFIG_NUMA_BALANCING
+#define _PAGE_NUMA (_AT(pteval_t, 1) << _PAGE_BIT_NUMA)
+#else
+#define _PAGE_NUMA (_AT(pteval_t, 0))
+#endif
+
+/*
* The same hidden bit is used by kmemcheck, but since kmemcheck
* works on kernel pages while soft-dirty engine on user space,
* they do not conflict with each other.
@@ -94,26 +118,6 @@
#define _PAGE_FILE (_AT(pteval_t, 1) << _PAGE_BIT_FILE)
#define _PAGE_PROTNONE (_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)

-/*
- * _PAGE_NUMA indicates that this page will trigger a numa hinting
- * minor page fault to gather numa placement statistics (see
- * pte_numa()). The bit picked (8) is within the range between
- * _PAGE_FILE (6) and _PAGE_PROTNONE (8) bits. Therefore, it doesn't
- * require changes to the swp entry format because that bit is always
- * zero when the pte is not present.
- *
- * The bit picked must be always zero when the pmd is present and not
- * present, so that we don't lose information when we set it while
- * atomically clearing the present bit.
- *
- * Because we shared the same bit (8) with _PAGE_PROTNONE this can be
- * interpreted as _PAGE_NUMA only in places that _PAGE_PROTNONE
- * couldn't reach, like handle_mm_fault() (see access_error in
- * arch/x86/mm/fault.c, the vma protection must not be PROT_NONE for
- * handle_mm_fault() to be invoked).
- */
-#define _PAGE_NUMA _PAGE_PROTNONE
-
#define _PAGE_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | \
_PAGE_ACCESSED | _PAGE_DIRTY)
#define _KERNPG_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | \
--
1.8.4.5

2014-04-07 15:12:23

by Mel Gorman

[permalink] [raw]

Subject: [PATCH 3/3] mm: Allow FOLL_NUMA on FOLL_FORCE

As _PAGE_NUMA is no longer aliased to _PAGE_PROTNONE there should be no
confusion between them. It should be possible to kick away the special
casing in __get_user_pages.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/memory.c | 12 ------------
1 file changed, 12 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 22dfa61..b9c35a7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1714,18 +1714,6 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
vm_flags &= (gup_flags & FOLL_FORCE) ?
(VM_MAYREAD | VM_MAYWRITE) : (VM_READ | VM_WRITE);

- /*
- * If FOLL_FORCE and FOLL_NUMA are both set, handle_mm_fault
- * would be called on PROT_NONE ranges. We must never invoke
- * handle_mm_fault on PROT_NONE ranges or the NUMA hinting
- * page faults would unprotect the PROT_NONE ranges if
- * _PAGE_NUMA and _PAGE_PROTNONE are sharing the same pte/pmd
- * bitflag. So to avoid that, don't set FOLL_NUMA if
- * FOLL_FORCE is set.
- */
- if (!(gup_flags & FOLL_FORCE))
- gup_flags |= FOLL_NUMA;
-
i = 0;

do {
--
1.8.4.5

2014-04-07 15:32:44

by David Vrabel

[permalink] [raw]

Subject: Re: [PATCH 2/3] x86: Define _PAGE_NUMA with unused physical address bits PMD and PTE levels

On 07/04/14 16:10, Mel Gorman wrote:
> _PAGE_NUMA is currently an alias of _PROT_PROTNONE to trap NUMA hinting
> faults. As the bit is shared care is taken that _PAGE_NUMA is only used in
> places where _PAGE_PROTNONE could not reach but this still causes problems
> on Xen and conceptually difficult.

The problem with Xen guests occurred because mprotect() /was/ confusing
PROTNONE mappings with _PAGE_NUMA and clearing the non-existant NUMA hints.

David

2014-04-07 15:49:43

by Mel Gorman

[permalink] [raw]

Subject: Re: [PATCH 2/3] x86: Define _PAGE_NUMA with unused physical address bits PMD and PTE levels

On Mon, Apr 07, 2014 at 04:32:39PM +0100, David Vrabel wrote:
> On 07/04/14 16:10, Mel Gorman wrote:
> > _PAGE_NUMA is currently an alias of _PROT_PROTNONE to trap NUMA hinting
> > faults. As the bit is shared care is taken that _PAGE_NUMA is only used in
> > places where _PAGE_PROTNONE could not reach but this still causes problems
> > on Xen and conceptually difficult.
>
> The problem with Xen guests occurred because mprotect() /was/ confusing
> PROTNONE mappings with _PAGE_NUMA and clearing the non-existant NUMA hints.
>

I didn't bother spelling it out in case I gave the impression that I was
blaming Xen for the problem. As the bit is now changes, does it help
the Xen problem or cause another collision of some sort? There is no
guarantee _PAGE_NUMA will remain as bit 62 but at worst it'll use bit 11
and NUMA_BALANCING will depend in !KMEMCHECK.

--
Mel Gorman
SUSE Labs

2014-04-07 16:19:17

by Cyrill Gorcunov

[permalink] [raw]

Subject: Re: [PATCH 2/3] x86: Define _PAGE_NUMA with unused physical address bits PMD and PTE levels

On Mon, Apr 07, 2014 at 04:49:35PM +0100, Mel Gorman wrote:
> On Mon, Apr 07, 2014 at 04:32:39PM +0100, David Vrabel wrote:
> > On 07/04/14 16:10, Mel Gorman wrote:
> > > _PAGE_NUMA is currently an alias of _PROT_PROTNONE to trap NUMA hinting
> > > faults. As the bit is shared care is taken that _PAGE_NUMA is only used in
> > > places where _PAGE_PROTNONE could not reach but this still causes problems
> > > on Xen and conceptually difficult.
> >
> > The problem with Xen guests occurred because mprotect() /was/ confusing
> > PROTNONE mappings with _PAGE_NUMA and clearing the non-existant NUMA hints.
>
> I didn't bother spelling it out in case I gave the impression that I was
> blaming Xen for the problem. As the bit is now changes, does it help
> the Xen problem or cause another collision of some sort? There is no
> guarantee _PAGE_NUMA will remain as bit 62 but at worst it'll use bit 11
> and NUMA_BALANCING will depend in !KMEMCHECK.

Fwiw, we're using bit 11 for soft-dirty tracking, so i really hope worst case
never happen. (At the moment I'm trying to figure out if with this set
it would be possible to clean up ugly macros in pgoff_to_pte for 2 level pages).

2014-04-07 18:29:03

by Mel Gorman

[permalink] [raw]

Subject: Re: [PATCH 2/3] x86: Define _PAGE_NUMA with unused physical address bits PMD and PTE levels

On Mon, Apr 07, 2014 at 08:19:10PM +0400, Cyrill Gorcunov wrote:
> On Mon, Apr 07, 2014 at 04:49:35PM +0100, Mel Gorman wrote:
> > On Mon, Apr 07, 2014 at 04:32:39PM +0100, David Vrabel wrote:
> > > On 07/04/14 16:10, Mel Gorman wrote:
> > > > _PAGE_NUMA is currently an alias of _PROT_PROTNONE to trap NUMA hinting
> > > > faults. As the bit is shared care is taken that _PAGE_NUMA is only used in
> > > > places where _PAGE_PROTNONE could not reach but this still causes problems
> > > > on Xen and conceptually difficult.
> > >
> > > The problem with Xen guests occurred because mprotect() /was/ confusing
> > > PROTNONE mappings with _PAGE_NUMA and clearing the non-existant NUMA hints.
> >
> > I didn't bother spelling it out in case I gave the impression that I was
> > blaming Xen for the problem. As the bit is now changes, does it help
> > the Xen problem or cause another collision of some sort? There is no
> > guarantee _PAGE_NUMA will remain as bit 62 but at worst it'll use bit 11
> > and NUMA_BALANCING will depend in !KMEMCHECK.
>
> Fwiw, we're using bit 11 for soft-dirty tracking, so i really hope worst case
> never happen. (At the moment I'm trying to figure out if with this set
> it would be possible to clean up ugly macros in pgoff_to_pte for 2 level pages).

I had considered the soft-dirty tracking usage of the same bit. I thought I'd
be able to swizzle around it or a further worst case of having soft-dirty and
automatic NUMA balancing mutually exclusive. Unfortunately upon examination
it's not obvious how to have both of them share a bit and I suspect any
attempt to will break CRIU. In my current tree, NUMA_BALANCING cannot be
set if MEM_SOFT_DIRTY which is not particularly satisfactory. Next on the
list is examining if _PAGE_BIT_IOMAP can be used.

--
Mel Gorman
SUSE Labs

2014-04-07 19:16:26

by Cyrill Gorcunov

[permalink] [raw]

Subject: Re: [PATCH 2/3] x86: Define _PAGE_NUMA with unused physical address bits PMD and PTE levels

On Mon, Apr 07, 2014 at 07:28:54PM +0100, Mel Gorman wrote:
> > > I didn't bother spelling it out in case I gave the impression that I was
> > > blaming Xen for the problem. As the bit is now changes, does it help
> > > the Xen problem or cause another collision of some sort? There is no
> > > guarantee _PAGE_NUMA will remain as bit 62 but at worst it'll use bit 11
> > > and NUMA_BALANCING will depend in !KMEMCHECK.
> >
> > Fwiw, we're using bit 11 for soft-dirty tracking, so i really hope worst case
> > never happen. (At the moment I'm trying to figure out if with this set
> > it would be possible to clean up ugly macros in pgoff_to_pte for 2 level pages).
>
> I had considered the soft-dirty tracking usage of the same bit. I thought I'd
> be able to swizzle around it or a further worst case of having soft-dirty and
> automatic NUMA balancing mutually exclusive. Unfortunately upon examination
> it's not obvious how to have both of them share a bit and I suspect any
> attempt to will break CRIU. In my current tree, NUMA_BALANCING cannot be
> set if MEM_SOFT_DIRTY which is not particularly satisfactory. Next on the
> list is examining if _PAGE_BIT_IOMAP can be used.

Thanks for info, Mel! It seems indeed if no more space left on x86-64 (in
the very worst case which I still think won't happen anytime soon) we'll
have to make them mut. exclusive. But for now (with 62 bit used for numa)
they can live together, right?

2014-04-07 19:29:13

by H. Peter Anvin

[permalink] [raw]

Subject: Re: [PATCH 2/3] x86: Define _PAGE_NUMA with unused physical address bits PMD and PTE levels

On 04/07/2014 11:28 AM, Mel Gorman wrote:
>
> I had considered the soft-dirty tracking usage of the same bit. I thought I'd
> be able to swizzle around it or a further worst case of having soft-dirty and
> automatic NUMA balancing mutually exclusive. Unfortunately upon examination
> it's not obvious how to have both of them share a bit and I suspect any
> attempt to will break CRIU. In my current tree, NUMA_BALANCING cannot be
> set if MEM_SOFT_DIRTY which is not particularly satisfactory. Next on the
> list is examining if _PAGE_BIT_IOMAP can be used.
>

Didn't we smoke the last user of _PAGE_BIT_IOMAP?

-hpa

2014-04-07 19:36:50

by Cyrill Gorcunov

[permalink] [raw]

Subject: Re: [PATCH 2/3] x86: Define _PAGE_NUMA with unused physical address bits PMD and PTE levels

2014-04-07 19:43:31

by H. Peter Anvin

[permalink] [raw]

Subject: Re: [PATCH 2/3] x86: Define _PAGE_NUMA with unused physical address bits PMD and PTE levels

On 04/07/2014 12:36 PM, Cyrill Gorcunov wrote:
> On Mon, Apr 07, 2014 at 12:27:10PM -0700, H. Peter Anvin wrote:
>> On 04/07/2014 11:28 AM, Mel Gorman wrote:
>>>
>>> I had considered the soft-dirty tracking usage of the same bit. I thought I'd
>>> be able to swizzle around it or a further worst case of having soft-dirty and
>>> automatic NUMA balancing mutually exclusive. Unfortunately upon examination
>>> it's not obvious how to have both of them share a bit and I suspect any
>>> attempt to will break CRIU. In my current tree, NUMA_BALANCING cannot be
>>> set if MEM_SOFT_DIRTY which is not particularly satisfactory. Next on the
>>> list is examining if _PAGE_BIT_IOMAP can be used.
>>
>> Didn't we smoke the last user of _PAGE_BIT_IOMAP?
>
> Seems so, at least for non-kernel pages (not considering this bit references in
> xen code, which i simply don't know but i guess it's used for kernel pages only).
>

David Vrabel has a patchset which I presumed would be pulled through the
Xen tree this merge window:

[PATCHv5 0/8] x86/xen: fixes for mapping high MMIO regions (and remove
_PAGE_IOMAP)

That frees up this bit.

-hpa

2014-04-07 19:58:40

by Dave Hansen

[permalink] [raw]

Subject: Re: [PATCH 2/3] x86: Define _PAGE_NUMA with unused physical address bits PMD and PTE levels

On 04/07/2014 08:10 AM, Mel Gorman wrote:
> +/*
> + * Software bits ignored by the page table walker
> + * At the time of writing, different levels have bits that are ignored. Due
> + * to physical address limitations, bits 52:62 should be ignored for the PMD
> + * and PTE levels and are available for use by software. Be aware that this
> + * may change if the physical address space expands.
> + */
> +#define _PAGE_BIT_NUMA 62

Doesn't moving it up to the high bits break pte_modify()'s assumptions?
I was thinking of this nugget from change_pte_range():

ptent = ptep_modify_prot_start(mm, addr, pte);
if (pte_numa(ptent))
ptent = pte_mknonnuma(ptent);
ptent = pte_modify(ptent, newprot);

pte_modify() pulls off all the high bits out of 'ptent' and only adds
them back if they're in newprot (which as far as I can tell comes from
the VMA). So I _think_ it'll axe the _PAGE_NUMA out of 'ptent'.

2014-04-07 21:19:53

by Mel Gorman

[permalink] [raw]

Subject: Re: [PATCH 2/3] x86: Define _PAGE_NUMA with unused physical address bits PMD and PTE levels

On Mon, Apr 07, 2014 at 12:27:10PM -0700, H. Peter Anvin wrote:
> On 04/07/2014 11:28 AM, Mel Gorman wrote:
> >
> > I had considered the soft-dirty tracking usage of the same bit. I thought I'd
> > be able to swizzle around it or a further worst case of having soft-dirty and
> > automatic NUMA balancing mutually exclusive. Unfortunately upon examination
> > it's not obvious how to have both of them share a bit and I suspect any
> > attempt to will break CRIU. In my current tree, NUMA_BALANCING cannot be
> > set if MEM_SOFT_DIRTY which is not particularly satisfactory. Next on the
> > list is examining if _PAGE_BIT_IOMAP can be used.
> >
>
> Didn't we smoke the last user of _PAGE_BIT_IOMAP?
>

There are still some users of _PAGE_IOMAP with Xen being the main user.
For x86 on bare metal it looks like userspace should never have a PTE with
_PAGE_IO set so it should be usable as _PAGE_NUMA. Patches that do that
are currently being tested but a side-effect was that I had to disable
support on Xen as Xen appears to use it to distinguish between Xen PTEs
and MFNs. It's unclear what automatic NUMA balancing on Xen even means --
are NUMA nodes always mapped to the physical topology? What is sensible
behaviour if guest and host both run it? etc. If they need it, we can then
examine what the proper way to support _PAGE_NUMA on Xen is.

--
Mel Gorman
SUSE Labs

2014-04-07 21:25:43

by Mel Gorman

[permalink] [raw]

Subject: Re: [PATCH 2/3] x86: Define _PAGE_NUMA with unused physical address bits PMD and PTE levels

On Mon, Apr 07, 2014 at 12:42:40PM -0700, H. Peter Anvin wrote:
> On 04/07/2014 12:36 PM, Cyrill Gorcunov wrote:
> > On Mon, Apr 07, 2014 at 12:27:10PM -0700, H. Peter Anvin wrote:
> >> On 04/07/2014 11:28 AM, Mel Gorman wrote:
> >>>
> >>> I had considered the soft-dirty tracking usage of the same bit. I thought I'd
> >>> be able to swizzle around it or a further worst case of having soft-dirty and
> >>> automatic NUMA balancing mutually exclusive. Unfortunately upon examination
> >>> it's not obvious how to have both of them share a bit and I suspect any
> >>> attempt to will break CRIU. In my current tree, NUMA_BALANCING cannot be
> >>> set if MEM_SOFT_DIRTY which is not particularly satisfactory. Next on the
> >>> list is examining if _PAGE_BIT_IOMAP can be used.
> >>
> >> Didn't we smoke the last user of _PAGE_BIT_IOMAP?
> >
> > Seems so, at least for non-kernel pages (not considering this bit references in
> > xen code, which i simply don't know but i guess it's used for kernel pages only).
> >
>
> David Vrabel has a patchset which I presumed would be pulled through the
> Xen tree this merge window:
>
> [PATCHv5 0/8] x86/xen: fixes for mapping high MMIO regions (and remove
> _PAGE_IOMAP)
>
> That frees up this bit.
>

Thanks, I was not aware of that patch. Based on it, I intend to force
automatic NUMA balancing to depend on !XEN and see what the reaction is. If
support for Xen is really required then it potentially be re-enabled if/when
that series is merged assuming they do not need the bit for something else.

--
Mel Gorman
SUSE Labs

2014-04-08 04:04:56

by Steven Noonan

[permalink] [raw]

Subject: Re: [PATCH 2/3] x86: Define _PAGE_NUMA with unused physical address bits PMD and PTE levels

On Mon, Apr 7, 2014 at 2:25 PM, Mel Gorman <[email protected]> wrote:
> On Mon, Apr 07, 2014 at 12:42:40PM -0700, H. Peter Anvin wrote:
>> On 04/07/2014 12:36 PM, Cyrill Gorcunov wrote:
>> > On Mon, Apr 07, 2014 at 12:27:10PM -0700, H. Peter Anvin wrote:
>> >> On 04/07/2014 11:28 AM, Mel Gorman wrote:
>> >>>
>> >>> I had considered the soft-dirty tracking usage of the same bit. I thought I'd
>> >>> be able to swizzle around it or a further worst case of having soft-dirty and
>> >>> automatic NUMA balancing mutually exclusive. Unfortunately upon examination
>> >>> it's not obvious how to have both of them share a bit and I suspect any
>> >>> attempt to will break CRIU. In my current tree, NUMA_BALANCING cannot be
>> >>> set if MEM_SOFT_DIRTY which is not particularly satisfactory. Next on the
>> >>> list is examining if _PAGE_BIT_IOMAP can be used.
>> >>
>> >> Didn't we smoke the last user of _PAGE_BIT_IOMAP?
>> >
>> > Seems so, at least for non-kernel pages (not considering this bit references in
>> > xen code, which i simply don't know but i guess it's used for kernel pages only).
>> >
>>
>> David Vrabel has a patchset which I presumed would be pulled through the
>> Xen tree this merge window:
>>
>> [PATCHv5 0/8] x86/xen: fixes for mapping high MMIO regions (and remove
>> _PAGE_IOMAP)
>>
>> That frees up this bit.
>>
>
> Thanks, I was not aware of that patch. Based on it, I intend to force
> automatic NUMA balancing to depend on !XEN and see what the reaction is. If
> support for Xen is really required then it potentially be re-enabled if/when
> that series is merged assuming they do not need the bit for something else.
>

Amazon EC2 does have large memory instance types with NUMA exposed to
the guest (e.g. c3.8xlarge, i2.8xlarge, etc), so it'd be preferable
(to me anyway) if we didn't require !XEN.

2014-04-08 09:31:39

by David Vrabel

[permalink] [raw]

Subject: Re: [PATCH 2/3] x86: Define _PAGE_NUMA with unused physical address bits PMD and PTE levels

On 07/04/14 20:36, Cyrill Gorcunov wrote:
> On Mon, Apr 07, 2014 at 12:27:10PM -0700, H. Peter Anvin wrote:
>> On 04/07/2014 11:28 AM, Mel Gorman wrote:
>>>
>>> I had considered the soft-dirty tracking usage of the same bit. I thought I'd
>>> be able to swizzle around it or a further worst case of having soft-dirty and
>>> automatic NUMA balancing mutually exclusive. Unfortunately upon examination
>>> it's not obvious how to have both of them share a bit and I suspect any
>>> attempt to will break CRIU. In my current tree, NUMA_BALANCING cannot be
>>> set if MEM_SOFT_DIRTY which is not particularly satisfactory. Next on the
>>> list is examining if _PAGE_BIT_IOMAP can be used.
>>
>> Didn't we smoke the last user of _PAGE_BIT_IOMAP?

Not yet.

A last minute regression with mapping of I/O regions from userspace was
found so I had to drop the series from 3.15. It should be back for 3.16.

> Seems so, at least for non-kernel pages (not considering this bit references in
> xen code, which i simply don't know but i guess it's used for kernel pages only).

Xen uses it for all I/O mappings, both kernel and for userspace.

David

2014-04-08 15:18:01

by H. Peter Anvin

[permalink] [raw]

Subject: Re: [PATCH 2/3] x86: Define _PAGE_NUMA with unused physical address bits PMD and PTE levels

<snark>

Of course, it would also be preferable if Amazon (or anything else) didn't need Xen PV :(

On April 7, 2014 9:04:53 PM PDT, Steven Noonan <[email protected]> wrote:
>On Mon, Apr 7, 2014 at 2:25 PM, Mel Gorman <[email protected]> wrote:
>> On Mon, Apr 07, 2014 at 12:42:40PM -0700, H. Peter Anvin wrote:
>>> On 04/07/2014 12:36 PM, Cyrill Gorcunov wrote:
>>> > On Mon, Apr 07, 2014 at 12:27:10PM -0700, H. Peter Anvin wrote:
>>> >> On 04/07/2014 11:28 AM, Mel Gorman wrote:
>>> >>>
>>> >>> I had considered the soft-dirty tracking usage of the same bit.
>I thought I'd
>>> >>> be able to swizzle around it or a further worst case of having
>soft-dirty and
>>> >>> automatic NUMA balancing mutually exclusive. Unfortunately upon
>examination
>>> >>> it's not obvious how to have both of them share a bit and I
>suspect any
>>> >>> attempt to will break CRIU. In my current tree, NUMA_BALANCING
>cannot be
>>> >>> set if MEM_SOFT_DIRTY which is not particularly satisfactory.
>Next on the
>>> >>> list is examining if _PAGE_BIT_IOMAP can be used.
>>> >>
>>> >> Didn't we smoke the last user of _PAGE_BIT_IOMAP?
>>> >
>>> > Seems so, at least for non-kernel pages (not considering this bit
>references in
>>> > xen code, which i simply don't know but i guess it's used for
>kernel pages only).
>>> >
>>>
>>> David Vrabel has a patchset which I presumed would be pulled through
>the
>>> Xen tree this merge window:
>>>
>>> [PATCHv5 0/8] x86/xen: fixes for mapping high MMIO regions (and
>remove
>>> _PAGE_IOMAP)
>>>
>>> That frees up this bit.
>>>
>>
>> Thanks, I was not aware of that patch. Based on it, I intend to
>force
>> automatic NUMA balancing to depend on !XEN and see what the reaction
>is. If
>> support for Xen is really required then it potentially be re-enabled
>if/when
>> that series is merged assuming they do not need the bit for something
>else.
>>
>
>Amazon EC2 does have large memory instance types with NUMA exposed to
>the guest (e.g. c3.8xlarge, i2.8xlarge, etc), so it'd be preferable
>(to me anyway) if we didn't require !XEN.

--
Sent from my mobile phone. Please pardon brevity and lack of formatting.

2014-04-08 16:03:40

by Konrad Rzeszutek Wilk

[permalink] [raw]

Subject: Re: [PATCH 2/3] x86: Define _PAGE_NUMA with unused physical address bits PMD and PTE levels

.snip..
> >>> David Vrabel has a patchset which I presumed would be pulled through
> >the
> >>> Xen tree this merge window:
> >>>
> >>> [PATCHv5 0/8] x86/xen: fixes for mapping high MMIO regions (and
> >remove
> >>> _PAGE_IOMAP)
> >>>
> >>> That frees up this bit.
> >>>
> >>
> >> Thanks, I was not aware of that patch. Based on it, I intend to
> >force
> >> automatic NUMA balancing to depend on !XEN and see what the reaction
> >is. If
> >> support for Xen is really required then it potentially be re-enabled
> >if/when
> >> that series is merged assuming they do not need the bit for something
> >else.
> >>
> >
> >Amazon EC2 does have large memory instance types with NUMA exposed to
> >the guest (e.g. c3.8xlarge, i2.8xlarge, etc), so it'd be preferable
> >(to me anyway) if we didn't require !XEN.

What about the patch that David Vrabel posted:

http://osdir.com/ml/general/2014-03/msg41979.html

Has anybody taken it for a spin?

2014-04-08 16:18:47

by H. Peter Anvin

[permalink] [raw]

Subject: Re: [PATCH 2/3] x86: Define _PAGE_NUMA with unused physical address bits PMD and PTE levels

On 04/08/2014 09:02 AM, Konrad Rzeszutek Wilk wrote:
>>>
>>> Amazon EC2 does have large memory instance types with NUMA exposed to
>>> the guest (e.g. c3.8xlarge, i2.8xlarge, etc), so it'd be preferable
>>> (to me anyway) if we didn't require !XEN.
>
> What about the patch that David Vrabel posted:
>
> http://osdir.com/ml/general/2014-03/msg41979.html
>
> Has anybody taken it for a spin?
>

Oh lovely, more pvops in low level paths. I'm so thrilled.

Incidentally, I wasn't even Cc:'d on that patch and was only added to
the thread by Linus, but never saw the early bits of the thread
including the actual patch.

-hpa

2014-04-08 16:47:49

by Mel Gorman

[permalink] [raw]

Subject: Re: [PATCH 2/3] x86: Define _PAGE_NUMA with unused physical address bits PMD and PTE levels

On Tue, Apr 08, 2014 at 09:16:49AM -0700, H. Peter Anvin wrote:
> On 04/08/2014 09:02 AM, Konrad Rzeszutek Wilk wrote:
> >>>
> >>> Amazon EC2 does have large memory instance types with NUMA exposed to
> >>> the guest (e.g. c3.8xlarge, i2.8xlarge, etc), so it'd be preferable
> >>> (to me anyway) if we didn't require !XEN.
> >
> > What about the patch that David Vrabel posted:
> >
> > http://osdir.com/ml/general/2014-03/msg41979.html
> >
> > Has anybody taken it for a spin?
> >
>
> Oh lovely, more pvops in low level paths. I'm so thrilled.
>
> Incidentally, I wasn't even Cc:'d on that patch and was only added to
> the thread by Linus, but never saw the early bits of the thread
> including the actual patch.
>

I posted an alternative to that patch that confines the damage to the
NUMA pte helpers.

--
Mel Gorman
SUSE Labs

2014-04-08 16:51:15

by David Vrabel

[permalink] [raw]

Subject: Re: [PATCH 2/3] x86: Define _PAGE_NUMA with unused physical address bits PMD and PTE levels

On 08/04/14 17:16, H. Peter Anvin wrote:
> On 04/08/2014 09:02 AM, Konrad Rzeszutek Wilk wrote:
>>>>
>>>> Amazon EC2 does have large memory instance types with NUMA exposed to
>>>> the guest (e.g. c3.8xlarge, i2.8xlarge, etc), so it'd be preferable
>>>> (to me anyway) if we didn't require !XEN.
>>
>> What about the patch that David Vrabel posted:
>>
>> http://osdir.com/ml/general/2014-03/msg41979.html
>>
>> Has anybody taken it for a spin?
>>
>
> Oh lovely, more pvops in low level paths. I'm so thrilled.
>
> Incidentally, I wasn't even Cc:'d on that patch and was only added to
> the thread by Linus, but never saw the early bits of the thread
> including the actual patch.

I did resend a version CC'd to all the x86 maintainers and included some
performance figures for native (~1 extra clock cycle).

I've included it again below.

My preference would be take this patch as it fixes it for both NUMA
rebalancing and any future uses that want to set/clear _PAGE_PRESENT.

David

8<--------------
x86: use pv-ops in {pte, pmd}_{set,clear}_flags()

Instead of using native functions to operate on the PTEs in
pte_set_flags(), pte_clear_flags(), pmd_set_flags(), pmd_clear_flags()
use the PV aware ones.

This fixes a regression in Xen PV guests introduced by 1667918b6483
(mm: numa: clear numa hinting information on mprotect).

This has negligible performance impact on native since the pte_val()
and __pte() (etc.) calls are patched at runtime when running on bare
metal. Measurements on a 3 GHz AMD 4284 give approx. 0.3 ns (~1 clock
cycle) of additional time for each function.

Xen PV guest page tables require that their entries use machine
addresses if the preset bit (_PAGE_PRESENT) is set, and (for
successful migration) non-present PTEs must use pseudo-physical
addresses. This is because on migration MFNs only present PTEs are
translated to PFNs (canonicalised) so they may be translated back to
the new MFN in the destination domain (uncanonicalised).

pte_mknonnuma(), pmd_mknonnuma(), pte_mknuma() and pmd_mknuma() set
and clear the _PAGE_PRESENT bit using pte_set_flags(),
pte_clear_flags(), etc.

In a Xen PV guest, these functions must translate MFNs to PFNs when
clearing _PAGE_PRESENT and translate PFNs to MFNs when setting
_PAGE_PRESENT.

Signed-off-by: David Vrabel <[email protected]>
Cc: Steven Noonan <[email protected]>
Cc: Elena Ufimtseva <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: <[email protected]> [3.12+]
---
arch/x86/include/asm/pgtable.h | 12 ++++++------
1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index bbc8b12..323e5e2 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -174,16 +174,16 @@ static inline int has_transparent_hugepage(void)

static inline pte_t pte_set_flags(pte_t pte, pteval_t set)
{
- pteval_t v = native_pte_val(pte);
+ pteval_t v = pte_val(pte);

- return native_make_pte(v | set);
+ return __pte(v | set);
}

static inline pte_t pte_clear_flags(pte_t pte, pteval_t clear)
{
- pteval_t v = native_pte_val(pte);
+ pteval_t v = pte_val(pte);

- return native_make_pte(v & ~clear);
+ return __pte(v & ~clear);
}

static inline pte_t pte_mkclean(pte_t pte)
@@ -248,14 +248,14 @@ static inline pte_t pte_mkspecial(pte_t pte)

static inline pmd_t pmd_set_flags(pmd_t pmd, pmdval_t set)
{
- pmdval_t v = native_pmd_val(pmd);
+ pmdval_t v = pmd_val(pmd);

return __pmd(v | set);
}

static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear)
{
- pmdval_t v = native_pmd_val(pmd);
+ pmdval_t v = pmd_val(pmd);

return __pmd(v & ~clear);
}

2014-04-08 16:51:30

by Mel Gorman

[permalink] [raw]

Subject: Re: [PATCH 2/3] x86: Define _PAGE_NUMA with unused physical address bits PMD and PTE levels

On Tue, Apr 08, 2014 at 12:02:50PM -0400, Konrad Rzeszutek Wilk wrote:
> .snip..
> > >>> David Vrabel has a patchset which I presumed would be pulled through
> > >the
> > >>> Xen tree this merge window:
> > >>>
> > >>> [PATCHv5 0/8] x86/xen: fixes for mapping high MMIO regions (and
> > >remove
> > >>> _PAGE_IOMAP)
> > >>>
> > >>> That frees up this bit.
> > >>>
> > >>
> > >> Thanks, I was not aware of that patch. Based on it, I intend to
> > >force
> > >> automatic NUMA balancing to depend on !XEN and see what the reaction
> > >is. If
> > >> support for Xen is really required then it potentially be re-enabled
> > >if/when
> > >> that series is merged assuming they do not need the bit for something
> > >else.
> > >>
> > >
> > >Amazon EC2 does have large memory instance types with NUMA exposed to
> > >the guest (e.g. c3.8xlarge, i2.8xlarge, etc), so it'd be preferable
> > >(to me anyway) if we didn't require !XEN.
>
> What about the patch that David Vrabel posted:
>
> http://osdir.com/ml/general/2014-03/msg41979.html
>
> Has anybody taken it for a spin?

Alternatively "[PATCH 4/5] mm: use paravirt friendly ops for NUMA
hinting ptes" which modifies the NUMA pte helpers instead of the main
set/clear ones.

--
Mel Gorman
SUSE Labs

2014-04-08 20:51:31

by Steven Noonan

[permalink] [raw]

Subject: Re: [PATCH 2/3] x86: Define _PAGE_NUMA with unused physical address bits PMD and PTE levels

On Tue, Apr 8, 2014 at 8:16 AM, H. Peter Anvin <[email protected]> wrote:
> <snark>
>
> Of course, it would also be preferable if Amazon (or anything else) didn't need Xen PV :(

Well Amazon doesn't expose NUMA on PV, only on HVM guests.

> On April 7, 2014 9:04:53 PM PDT, Steven Noonan <[email protected]> wrote:
>>On Mon, Apr 7, 2014 at 2:25 PM, Mel Gorman <[email protected]> wrote:
>>> On Mon, Apr 07, 2014 at 12:42:40PM -0700, H. Peter Anvin wrote:
>>>> On 04/07/2014 12:36 PM, Cyrill Gorcunov wrote:
>>>> > On Mon, Apr 07, 2014 at 12:27:10PM -0700, H. Peter Anvin wrote:
>>>> >> On 04/07/2014 11:28 AM, Mel Gorman wrote:
>>>> >>>
>>>> >>> I had considered the soft-dirty tracking usage of the same bit.
>>I thought I'd
>>>> >>> be able to swizzle around it or a further worst case of having
>>soft-dirty and
>>>> >>> automatic NUMA balancing mutually exclusive. Unfortunately upon
>>examination
>>>> >>> it's not obvious how to have both of them share a bit and I
>>suspect any
>>>> >>> attempt to will break CRIU. In my current tree, NUMA_BALANCING
>>cannot be
>>>> >>> set if MEM_SOFT_DIRTY which is not particularly satisfactory.
>>Next on the
>>>> >>> list is examining if _PAGE_BIT_IOMAP can be used.
>>>> >>
>>>> >> Didn't we smoke the last user of _PAGE_BIT_IOMAP?
>>>> >
>>>> > Seems so, at least for non-kernel pages (not considering this bit
>>references in
>>>> > xen code, which i simply don't know but i guess it's used for
>>kernel pages only).
>>>> >
>>>>
>>>> David Vrabel has a patchset which I presumed would be pulled through
>>the
>>>> Xen tree this merge window:
>>>>
>>>> [PATCHv5 0/8] x86/xen: fixes for mapping high MMIO regions (and
>>remove
>>>> _PAGE_IOMAP)
>>>>
>>>> That frees up this bit.
>>>>
>>>
>>> Thanks, I was not aware of that patch. Based on it, I intend to
>>force
>>> automatic NUMA balancing to depend on !XEN and see what the reaction
>>is. If
>>> support for Xen is really required then it potentially be re-enabled
>>if/when
>>> that series is merged assuming they do not need the bit for something
>>else.
>>>
>>
>>Amazon EC2 does have large memory instance types with NUMA exposed to
>>the guest (e.g. c3.8xlarge, i2.8xlarge, etc), so it'd be preferable
>>(to me anyway) if we didn't require !XEN.
>
> --
> Sent from my mobile phone. Please pardon brevity and lack of formatting.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2014-04-08 21:00:48

by H. Peter Anvin

[permalink] [raw]

Subject: Re: [PATCH 2/3] x86: Define _PAGE_NUMA with unused physical address bits PMD and PTE levels

On 04/08/2014 01:51 PM, Steven Noonan wrote:
> On Tue, Apr 8, 2014 at 8:16 AM, H. Peter Anvin <[email protected]> wrote:
>> <snark>
>>
>> Of course, it would also be preferable if Amazon (or anything else) didn't need Xen PV :(
>
> Well Amazon doesn't expose NUMA on PV, only on HVM guests.
>

Yes, but Amazon is one of the main things keeping Xen PV alive as far as
I can tell, which means the support gets built in, and so on.

-hpa

2014-04-09 15:05:43

by Konrad Rzeszutek Wilk

[permalink] [raw]

Subject: Re: [PATCH 2/3] x86: Define _PAGE_NUMA with unused physical address bits PMD and PTE levels

On Tue, Apr 08, 2014 at 01:59:09PM -0700, H. Peter Anvin wrote:
> On 04/08/2014 01:51 PM, Steven Noonan wrote:
> > On Tue, Apr 8, 2014 at 8:16 AM, H. Peter Anvin <[email protected]> wrote:
> >> <snark>
> >>
> >> Of course, it would also be preferable if Amazon (or anything else) didn't need Xen PV :(
> >
> > Well Amazon doesn't expose NUMA on PV, only on HVM guests.
> >
>
> Yes, but Amazon is one of the main things keeping Xen PV alive as far as
> I can tell, which means the support gets built in, and so on.

Taking the snarkiness aside, the issue here is that even on guests
without NUMA exposed the problem shows up. That is the 'mknuma' are
still being called even if the guest topology is not NUMA!

Which brings a question - why isn't the mknuma and its friends gatted by
an jump_label machinery or such?

Mel, any particular reasons why it couldn't be done this way?
>
> -hpa
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2014-04-09 15:10:13

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 2/3] x86: Define _PAGE_NUMA with unused physical address bits PMD and PTE levels

On Wed, Apr 09, 2014 at 11:04:48AM -0400, Konrad Rzeszutek Wilk wrote:
> On Tue, Apr 08, 2014 at 01:59:09PM -0700, H. Peter Anvin wrote:
> > On 04/08/2014 01:51 PM, Steven Noonan wrote:
> > > On Tue, Apr 8, 2014 at 8:16 AM, H. Peter Anvin <[email protected]> wrote:
> > >> <snark>
> > >>
> > >> Of course, it would also be preferable if Amazon (or anything else) didn't need Xen PV :(
> > >
> > > Well Amazon doesn't expose NUMA on PV, only on HVM guests.
> > >
> >
> > Yes, but Amazon is one of the main things keeping Xen PV alive as far as
> > I can tell, which means the support gets built in, and so on.
>
> Taking the snarkiness aside, the issue here is that even on guests
> without NUMA exposed the problem shows up. That is the 'mknuma' are
> still being called even if the guest topology is not NUMA!
>
> Which brings a question - why isn't the mknuma and its friends gatted by
> an jump_label machinery or such?
>
> Mel, any particular reasons why it couldn't be done this way?

Hmm,. I thought we disabled all that when there was only the 1 node. All
this should be driven from task_tick_numa() which only gets called when
numabalancing_enabled, and that _should_ be false when nr_nodes == 1.

2014-04-09 15:19:19

by Konrad Rzeszutek Wilk

[permalink] [raw]

Subject: Re: [PATCH 2/3] x86: Define _PAGE_NUMA with unused physical address bits PMD and PTE levels

On Tue, Apr 08, 2014 at 05:51:23PM +0100, Mel Gorman wrote:
> On Tue, Apr 08, 2014 at 12:02:50PM -0400, Konrad Rzeszutek Wilk wrote:
> > .snip..
> > > >>> David Vrabel has a patchset which I presumed would be pulled through
> > > >the
> > > >>> Xen tree this merge window:
> > > >>>
> > > >>> [PATCHv5 0/8] x86/xen: fixes for mapping high MMIO regions (and
> > > >remove
> > > >>> _PAGE_IOMAP)
> > > >>>
> > > >>> That frees up this bit.
> > > >>>
> > > >>
> > > >> Thanks, I was not aware of that patch. Based on it, I intend to
> > > >force
> > > >> automatic NUMA balancing to depend on !XEN and see what the reaction
> > > >is. If
> > > >> support for Xen is really required then it potentially be re-enabled
> > > >if/when
> > > >> that series is merged assuming they do not need the bit for something
> > > >else.
> > > >>
> > > >
> > > >Amazon EC2 does have large memory instance types with NUMA exposed to
> > > >the guest (e.g. c3.8xlarge, i2.8xlarge, etc), so it'd be preferable
> > > >(to me anyway) if we didn't require !XEN.
> >
> > What about the patch that David Vrabel posted:
> >
> > http://osdir.com/ml/general/2014-03/msg41979.html
> >
> > Has anybody taken it for a spin?
>
> Alternatively "[PATCH 4/5] mm: use paravirt friendly ops for NUMA
> hinting ptes" which modifies the NUMA pte helpers instead of the main
> set/clear ones.

Ah nice! Looking forward to it being posted as non-RFC and could you also
please CC '[email protected]' on it?

Thank you!

2014-04-09 15:39:26

by Mel Gorman

[permalink] [raw]

Subject: Re: [PATCH 2/3] x86: Define _PAGE_NUMA with unused physical address bits PMD and PTE levels

On Wed, Apr 09, 2014 at 11:18:27AM -0400, Konrad Rzeszutek Wilk wrote:
> On Tue, Apr 08, 2014 at 05:51:23PM +0100, Mel Gorman wrote:
> > On Tue, Apr 08, 2014 at 12:02:50PM -0400, Konrad Rzeszutek Wilk wrote:
> > > .snip..
> > > > >>> David Vrabel has a patchset which I presumed would be pulled through
> > > > >the
> > > > >>> Xen tree this merge window:
> > > > >>>
> > > > >>> [PATCHv5 0/8] x86/xen: fixes for mapping high MMIO regions (and
> > > > >remove
> > > > >>> _PAGE_IOMAP)
> > > > >>>
> > > > >>> That frees up this bit.
> > > > >>>
> > > > >>
> > > > >> Thanks, I was not aware of that patch. Based on it, I intend to
> > > > >force
> > > > >> automatic NUMA balancing to depend on !XEN and see what the reaction
> > > > >is. If
> > > > >> support for Xen is really required then it potentially be re-enabled
> > > > >if/when
> > > > >> that series is merged assuming they do not need the bit for something
> > > > >else.
> > > > >>
> > > > >
> > > > >Amazon EC2 does have large memory instance types with NUMA exposed to
> > > > >the guest (e.g. c3.8xlarge, i2.8xlarge, etc), so it'd be preferable
> > > > >(to me anyway) if we didn't require !XEN.
> > >
> > > What about the patch that David Vrabel posted:
> > >
> > > http://osdir.com/ml/general/2014-03/msg41979.html
> > >
> > > Has anybody taken it for a spin?
> >
> > Alternatively "[PATCH 4/5] mm: use paravirt friendly ops for NUMA
> > hinting ptes" which modifies the NUMA pte helpers instead of the main
> > set/clear ones.
>
> Ah nice! Looking forward to it being posted as non-RFC and could you also
> please CC '[email protected]' on it?
>

Yes I will. Unless the x86 maintainers push for it on the grounds that
it is a functional fix for xen, I'm going to wait until after the merge
window to resend it. That'd give it some chance of being tested in -next
before hitting mainline.

--
Mel Gorman
SUSE Labs