2013-04-27 00:07:39

by Pierre-Loup A. Griffais

[permalink] [raw]
Subject: IO regression after ab8fabd46f on x86 kernels with high memory

I initially observed this between kernels 3.2 and 3.5: on 3.2, copying a
180M shared object on the same ext4 filesystem takes 0.6s. On 3.5, it
takes between two and three minutes. It looks like a similar throughput
regression happens on any machine running an i386 PAE kernel with high
amounts of memory; the threshold seems to be 16G; passing mem=15G to the
kernel commandline fixes it.

I bisected it to the following change:

commit ab8fabd46f811d5153d8a0cd2fac9a0d41fb593d
Author: Johannes Weiner <[email protected]>
Date: Tue Jan 10 15:07:42 2012 -0800

mm: exclude reserved pages from dirtyable memory

I realize running x86 kernels against high amounts of memory is not
advised for various reasons, but I would assume that such a big
regression in basic functionality to not be part of them. Is that
accurate, or are these configurations expected to become unusable from
3.3 onwards?

Also CCing Sonny since it looks like he tried to fix an overflow issue
related to the same change with commit c8b74c2f66049, but I'm still
experiencing the problem with a kernel built from master.

Thanks,
- Pierre-Loup


2013-04-27 01:54:12

by Rik van Riel

[permalink] [raw]
Subject: Re: IO regression after ab8fabd46f on x86 kernels with high memory

On 04/26/2013 07:44 PM, Pierre-Loup A. Griffais wrote:
> I initially observed this between kernels 3.2 and 3.5: on 3.2, copying a
> 180M shared object on the same ext4 filesystem takes 0.6s. On 3.5, it
> takes between two and three minutes. It looks like a similar throughput
> regression happens on any machine running an i386 PAE kernel with high
> amounts of memory; the threshold seems to be 16G; passing mem=15G to the
> kernel commandline fixes it.

If you have that much memory in the system, you will
want to run a 64 bit kernel to avoid all kinds of
memory management corner cases.

> I bisected it to the following change:
>
> commit ab8fabd46f811d5153d8a0cd2fac9a0d41fb593d
> Author: Johannes Weiner <[email protected]>
> Date: Tue Jan 10 15:07:42 2012 -0800
>
> mm: exclude reserved pages from dirtyable memory
>
> I realize running x86 kernels against high amounts of memory is not
> advised for various reasons, but I would assume that such a big
> regression in basic functionality to not be part of them. Is that
> accurate, or are these configurations expected to become unusable from
> 3.3 onwards?

Reverting that patch would probably break i686 PAE systems with
lots of memory at a different threshold.

With more than 8-12GB of memory, an i686 kernel is between a
rock and a hard place. Whether you move it closer to the rock,
or closer to the hard place, all you do is change the way in
which it breaks.

> Also CCing Sonny since it looks like he tried to fix an overflow issue
> related to the same change with commit c8b74c2f66049, but I'm still
> experiencing the problem with a kernel built from master.
>
> Thanks,
> - Pierre-Loup


--
All rights reversed

2013-04-27 02:42:58

by Johannes Weiner

[permalink] [raw]
Subject: Re: IO regression after ab8fabd46f on x86 kernels with high memory

On Fri, Apr 26, 2013 at 09:53:56PM -0400, Rik van Riel wrote:
> On 04/26/2013 07:44 PM, Pierre-Loup A. Griffais wrote:
> >I initially observed this between kernels 3.2 and 3.5: on 3.2, copying a
> >180M shared object on the same ext4 filesystem takes 0.6s. On 3.5, it
> >takes between two and three minutes. It looks like a similar throughput
> >regression happens on any machine running an i386 PAE kernel with high
> >amounts of memory; the threshold seems to be 16G; passing mem=15G to the
> >kernel commandline fixes it.
>
> If you have that much memory in the system, you will
> want to run a 64 bit kernel to avoid all kinds of
> memory management corner cases.

Agreed. You can even keep your 32 bit userland, just swap the
kernel...

> >I bisected it to the following change:
> >
> >commit ab8fabd46f811d5153d8a0cd2fac9a0d41fb593d
> >Author: Johannes Weiner <[email protected]>
> >Date: Tue Jan 10 15:07:42 2012 -0800
> >
> > mm: exclude reserved pages from dirtyable memory
> >
> >I realize running x86 kernels against high amounts of memory is not
> >advised for various reasons, but I would assume that such a big
> >regression in basic functionality to not be part of them. Is that
> >accurate, or are these configurations expected to become unusable from
> >3.3 onwards?
>
> Reverting that patch would probably break i686 PAE systems with
> lots of memory at a different threshold.

It would also re-introduce the reclaim stalls when zones with very
little page cache due to lowmem reserves end up with a large
percentage of their LRU dirty. And that affects modern machines too,
because of the lowmem reserves in DMA32 due to relatively bigger
Normal zones.

On such large highmem machines, however, the imbalance between highmem
and lowmem is so enormous that the lowmem reserves basically exclude
all of lowmem from page cache usage.

But because dirty highmem creates lowmem pressure, and the amount of
sanely allowable dirty memory is actually a function of lowmem, not
highmem, highmem is not included in the amount of dirtyable memory.

So because your lowmem is not available for page cache and highmem is
not considered dirtyable out of the box, the amount of dirtyable
memory on your machine is 0. You can workaround this by setting
vm.highmem_is_dirtyable=1.

2013-04-29 22:03:24

by Linus Torvalds

[permalink] [raw]
Subject: Re: IO regression after ab8fabd46f on x86 kernels with high memory

On Mon, Apr 29, 2013 at 2:53 PM, Pierre-Loup A. Griffais
<[email protected]> wrote:
>
> Other than this particular concern, what's the high-level take-away? Is PAE
> support in the Linux kernel a false promise than distros should not be
> shipping by default, if at all? Should it be removed from the kernel
> entirely if these configurations are knowingly broken by commits like this?

PAE is "make it barely work". The whole concept is fundamentally
flawed, and anybody who runs a 32-bit kernel with 16GB or RAM doesn't
even understand *how* flawed and stupid that is.

Don't do it. Upgrade to 64-bit, or live with the fact that IO
performance will suck. The fact that it happened to work better under
your particular load with one particular IO size is entirely just
"random noise".

Yeah, the difference between "we can cache it" and "we have to do IO"
is huge. With a 32-bit kernel, we do IO much earlier now, just to
avoid some really nasty situations. That makes you go from the "can
sit in the cache" to the "do lots of IO" situation. Tough.

Seriously, you can compile yourself a 64-bit kernel and continue to
use your 32-bit user-land. And you can complain to whatever distro you
used that it didn't do that in the first place. But we're not going to
bother with trying to tune PAE for some particular load. It's just not
worth it to anybody.

Linus

2013-04-29 22:16:43

by Pierre-Loup A. Griffais

[permalink] [raw]
Subject: Re: IO regression after ab8fabd46f on x86 kernels with high memory

On 04/29/2013 03:03 PM, Linus Torvalds wrote:
> On Mon, Apr 29, 2013 at 2:53 PM, Pierre-Loup A. Griffais
> <[email protected]> wrote:
>>
>> Other than this particular concern, what's the high-level take-away? Is PAE
>> support in the Linux kernel a false promise than distros should not be
>> shipping by default, if at all? Should it be removed from the kernel
>> entirely if these configurations are knowingly broken by commits like this?
>
> PAE is "make it barely work". The whole concept is fundamentally
> flawed, and anybody who runs a 32-bit kernel with 16GB or RAM doesn't
> even understand *how* flawed and stupid that is.
>
> Don't do it. Upgrade to 64-bit, or live with the fact that IO
> performance will suck. The fact that it happened to work better under
> your particular load with one particular IO size is entirely just
> "random noise".
>
> Yeah, the difference between "we can cache it" and "we have to do IO"
> is huge. With a 32-bit kernel, we do IO much earlier now, just to
> avoid some really nasty situations. That makes you go from the "can
> sit in the cache" to the "do lots of IO" situation. Tough.
>
> Seriously, you can compile yourself a 64-bit kernel and continue to
> use your 32-bit user-land. And you can complain to whatever distro you
> used that it didn't do that in the first place. But we're not going to
> bother with trying to tune PAE for some particular load. It's just not
> worth it to anybody.

All of this came from me trying to reproduce slowdowns reported by other
people; I personally run a 64-bit kernel and understand how bad of an
idea it is to attempt to run 32-bit kernels with PAE enabled on modern
machines. However, my goal is to avoid ending up with a variety of
end-users that don't necessarily understand this getting bitten by it
and breaking their systems by upgrading their kernels. I will indeed
bring this up with distributors and point out than shipping PAE kernels
by default is not a good idea given these problems and your stance on
the matter.

Thanks,
- Pierre-Loup

>
> Linus
>

2013-04-29 22:16:45

by Pierre-Loup A. Griffais

[permalink] [raw]
Subject: Re: IO regression after ab8fabd46f on x86 kernels with high memory

On 04/26/2013 07:42 PM, Johannes Weiner wrote:
> On Fri, Apr 26, 2013 at 09:53:56PM -0400, Rik van Riel wrote:
>> On 04/26/2013 07:44 PM, Pierre-Loup A. Griffais wrote:
>>> I initially observed this between kernels 3.2 and 3.5: on 3.2, copying a
>>> 180M shared object on the same ext4 filesystem takes 0.6s. On 3.5, it
>>> takes between two and three minutes. It looks like a similar throughput
>>> regression happens on any machine running an i386 PAE kernel with high
>>> amounts of memory; the threshold seems to be 16G; passing mem=15G to the
>>> kernel commandline fixes it.
>>
>> If you have that much memory in the system, you will
>> want to run a 64 bit kernel to avoid all kinds of
>> memory management corner cases.
>
> Agreed. You can even keep your 32 bit userland, just swap the
> kernel...
>
>>> I bisected it to the following change:
>>>
>>> commit ab8fabd46f811d5153d8a0cd2fac9a0d41fb593d
>>> Author: Johannes Weiner <[email protected]>
>>> Date: Tue Jan 10 15:07:42 2012 -0800
>>>
>>> mm: exclude reserved pages from dirtyable memory
>>>
>>> I realize running x86 kernels against high amounts of memory is not
>>> advised for various reasons, but I would assume that such a big
>>> regression in basic functionality to not be part of them. Is that
>>> accurate, or are these configurations expected to become unusable from
>>> 3.3 onwards?
>>
>> Reverting that patch would probably break i686 PAE systems with
>> lots of memory at a different threshold.
>
> It would also re-introduce the reclaim stalls when zones with very
> little page cache due to lowmem reserves end up with a large
> percentage of their LRU dirty. And that affects modern machines too,
> because of the lowmem reserves in DMA32 due to relatively bigger
> Normal zones.
>
> On such large highmem machines, however, the imbalance between highmem
> and lowmem is so enormous that the lowmem reserves basically exclude
> all of lowmem from page cache usage.
>
> But because dirty highmem creates lowmem pressure, and the amount of
> sanely allowable dirty memory is actually a function of lowmem, not
> highmem, highmem is not included in the amount of dirtyable memory.
>
> So because your lowmem is not available for page cache and highmem is
> not considered dirtyable out of the box, the amount of dirtyable
> memory on your machine is 0. You can workaround this by setting
> vm.highmem_is_dirtyable=1.

I understand the technical concerns; we had some existing issues on 3.2
with 24/32GB machines where the kernel would start erroneously
OOM-killing new processes after a while; booting with mem=16G solved
that. But now this goes a level further, since the machine is unusable
upfront, right at boot, even with mem=16G. As such this is clearly seems
like a regression more than a tradeoff.

We're in a situation where popular distros ship 32-bit as the default
"use this if you're not sure what to get" option, with PAE also enabled
by default. most modern computers shipping with more than 16G of RAM,
especially for gaming. Looking at the Steam HW survey data we have
hundreds of users using this combination; this commit means that
installing package updates that pull in a new kernel will immediately
cause their system to become unusable.

Other than this particular concern, what's the high-level take-away? Is
PAE support in the Linux kernel a false promise than distros should not
be shipping by default, if at all? Should it be removed from the kernel
entirely if these configurations are knowingly broken by commits like this?

Thanks,
- Pierre-Loup

2013-04-30 00:48:41

by Rik van Riel

[permalink] [raw]
Subject: Re: IO regression after ab8fabd46f on x86 kernels with high memory

On 04/29/2013 06:03 PM, Linus Torvalds wrote:

> Seriously, you can compile yourself a 64-bit kernel and continue to
> use your 32-bit user-land. And you can complain to whatever distro you
> used that it didn't do that in the first place. But we're not going to
> bother with trying to tune PAE for some particular load. It's just not
> worth it to anybody.

I can think of one way to "tune PAE" that will help
avoid the breakage, and at the same time draw the
attention of users.

Limit the memory that a 32 bit PAE kernel uses, to
something small enough where the user will not
encounter random breakage. Maybe 8 or 12GB?

It could also print out a friendly message, to
inform the user they should upgrade to a 64 bit
kernel to enjoy the use of all of their memory.

It is a bit of a heavy stick, but I suspect that
it would clue in all of the affected users.

If you have no objection to this, I'll whip up a
patch.

2013-04-30 01:29:40

by Pierre-Loup A. Griffais

[permalink] [raw]
Subject: Re: IO regression after ab8fabd46f on x86 kernels with high memory

On 04/29/2013 05:48 PM, Rik van Riel wrote:
> On 04/29/2013 06:03 PM, Linus Torvalds wrote:
>
>> Seriously, you can compile yourself a 64-bit kernel and continue to
>> use your 32-bit user-land. And you can complain to whatever distro you
>> used that it didn't do that in the first place. But we're not going to
>> bother with trying to tune PAE for some particular load. It's just not
>> worth it to anybody.
>
> I can think of one way to "tune PAE" that will help
> avoid the breakage, and at the same time draw the
> attention of users.
>
> Limit the memory that a 32 bit PAE kernel uses, to
> something small enough where the user will not
> encounter random breakage. Maybe 8 or 12GB?
>
> It could also print out a friendly message, to
> inform the user they should upgrade to a 64 bit
> kernel to enjoy the use of all of their memory.
>
> It is a bit of a heavy stick, but I suspect that
> it would clue in all of the affected users.
>
> If you have no objection to this, I'll whip up a
> patch.
>

That would be pretty useful, especially if I can then convince
distributors to apply it and roll it out ASAP. I haven't personally
observed any problems with mem=15G whereas mem=16G exhibits the IO issue
upfront and more than that exhibits the OOM-killer / low memory
starvation issue that existed before Johannes change.

Thanks,
- Pierre-Loup

2013-05-02 01:34:31

by Steven Rostedt

[permalink] [raw]
Subject: Re: IO regression after ab8fabd46f on x86 kernels with high memory

On Mon, Apr 29, 2013 at 08:48:17PM -0400, Rik van Riel wrote:
>
> It could also print out a friendly message, to
> inform the user they should upgrade to a 64 bit
> kernel to enjoy the use of all of their memory.

Oh, oh, oh!!! Can we use my message:

http://lwn.net/Articles/501769/

OK, maybe it's not so friendly ;-)

-- Steve

2013-05-02 02:46:25

by Rik van Riel

[permalink] [raw]
Subject: [PATCH] mm,x86: limit 32 bit kernel to 12GB memory

On Wed, 1 May 2013 21:34:26 -0400
Steven Rostedt <[email protected]> wrote:
> On Mon, Apr 29, 2013 at 08:48:17PM -0400, Rik van Riel wrote:
> >
> > It could also print out a friendly message, to
> > inform the user they should upgrade to a 64 bit
> > kernel to enjoy the use of all of their memory.
>
> Oh, oh, oh!!! Can we use my message:
>
> http://lwn.net/Articles/501769/
>
> OK, maybe it's not so friendly ;-)

Here's a somewhat friendlier one. Printing out the total amount of
memory in the system may give them some extra motivation to upgrade
to a 64 bit kernel :)

---8<----
Subject: mm,x86: limit 32 bit kernel to 12GB memory

Running 32 bit kernels on very large memory systems is a recipe
for disaster, due to fundamental architectural limits in both
Linux and the hardware. Moreover, all modern hardware with large
memory supports 64 bits.

However, many users continue using 32 bit kernels, and end up
encountering stability and performance problems as a result.

It may be better to save those people the frustration of stability
issues by limiting memory on a 32 bit kernel to 12GB (about the upper
limit that still works right), and printing a friendly reminder that
they really should be using a 64 bit kernel.

Signed-off-by: Rik van Riel <[email protected]>
---
arch/x86/include/asm/setup.h | 1 +
arch/x86/mm/init_32.c | 11 +++++++++++
2 files changed, 12 insertions(+)

diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index b7bf350..79de6bf 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -14,6 +14,7 @@
*/
#define MAXMEM_PFN PFN_DOWN(MAXMEM)
#define MAX_NONPAE_PFN (1 << 20)
+#define MAX_PAE_PFN (3 << 20)

#endif /* __i386__ */

diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 3ac7e31..e35b3f5 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -600,6 +600,12 @@ static void __init lowmem_pfn_init(void)

#define MSG_HIGHMEM_TRIMMED \
"Warning: only 4GB will be used. Use a HIGHMEM64G enabled kernel!\n"
+
+#define MSG_HIGHMEM_INSANE \
+ "Warning: 32 bit kernels on large memory systems have problems.\n" \
+ "Limiting memory to 12GB for system stability.\n" \
+ "Use a 64 bit kernel to access all %lu MB of memory.\n"
+
/*
* We have more RAM than fits into lowmem - we try to put it into
* highmem, also taking the highmem=x boot parameter into account:
@@ -634,6 +640,11 @@ static void __init highmem_pfn_init(void)
max_pfn = MAX_NONPAE_PFN;
printk(KERN_WARNING MSG_HIGHMEM_TRIMMED);
}
+#else /* !CONFIG_HIGHMEM64G */
+ if (max_pfn > MAX_PAE_PFN) {
+ printk(KERN_WARNING MSG_HIGHMEM_INSANE, max_pfn>>8);
+ max_pfn = MAX_PFN;
+ }
#endif /* !CONFIG_HIGHMEM64G */
#endif /* !CONFIG_HIGHMEM */
}

2013-05-02 04:38:05

by Sonny Rao

[permalink] [raw]
Subject: Re: IO regression after ab8fabd46f on x86 kernels with high memory

On Mon, Apr 29, 2013 at 3:08 PM, Pierre-Loup A. Griffais
<[email protected]> wrote:
> On 04/29/2013 03:03 PM, Linus Torvalds wrote:
>>
>> On Mon, Apr 29, 2013 at 2:53 PM, Pierre-Loup A. Griffais
>> <[email protected]> wrote:
>>>
>>>
>>> Other than this particular concern, what's the high-level take-away? Is
>>> PAE
>>> support in the Linux kernel a false promise than distros should not be
>>> shipping by default, if at all? Should it be removed from the kernel
>>> entirely if these configurations are knowingly broken by commits like
>>> this?
>>
>>
>> PAE is "make it barely work". The whole concept is fundamentally
>> flawed, and anybody who runs a 32-bit kernel with 16GB or RAM doesn't
>> even understand *how* flawed and stupid that is.
>>
>> Don't do it. Upgrade to 64-bit, or live with the fact that IO
>> performance will suck. The fact that it happened to work better under
>> your particular load with one particular IO size is entirely just
>> "random noise".
>>
>> Yeah, the difference between "we can cache it" and "we have to do IO"
>> is huge. With a 32-bit kernel, we do IO much earlier now, just to
>> avoid some really nasty situations. That makes you go from the "can
>> sit in the cache" to the "do lots of IO" situation. Tough.
>>
>> Seriously, you can compile yourself a 64-bit kernel and continue to
>> use your 32-bit user-land. And you can complain to whatever distro you
>> used that it didn't do that in the first place. But we're not going to
>> bother with trying to tune PAE for some particular load. It's just not
>> worth it to anybody.
>
>
> All of this came from me trying to reproduce slowdowns reported by other
> people; I personally run a 64-bit kernel and understand how bad of an idea
> it is to attempt to run 32-bit kernels with PAE enabled on modern machines.
> However, my goal is to avoid ending up with a variety of end-users that
> don't necessarily understand this getting bitten by it and breaking their
> systems by upgrading their kernels. I will indeed bring this up with
> distributors and point out than shipping PAE kernels by default is not a
> good idea given these problems and your stance on the matter.
>

Sorry just saw this (my stupid gmail filters for lkml) The slow-down
we ran into wasn't even on PAE -- it was *just* with highmem on a 2GB
system. The non-zero amount (90MB? or so) of highmem was enough to
cause major problems due to that particular underflow.

I would say regardless of how much memory you have, if the system can
use a 64-bit kernel, then it almost certainly should. I've seen some
very minor performance impacts on 64-bit capable Atom systems with
tiny L2 caches, but it's almost in the noise and not worth the pain.

> Thanks,
> - Pierre-Loup
>
>>
>> Linus
>>
>

2013-05-02 08:01:15

by Pierre-Loup A. Griffais

[permalink] [raw]
Subject: Re: [PATCH] mm,x86: limit 32 bit kernel to 12GB memory

Reviewed-by: Pierre-Loup A. Griffais <[email protected]>

On 05/01/2013 07:46 PM, Rik van Riel wrote:
> On Wed, 1 May 2013 21:34:26 -0400
> Steven Rostedt <[email protected]> wrote:
>> On Mon, Apr 29, 2013 at 08:48:17PM -0400, Rik van Riel wrote:
>>>
>>> It could also print out a friendly message, to
>>> inform the user they should upgrade to a 64 bit
>>> kernel to enjoy the use of all of their memory.
>>
>> Oh, oh, oh!!! Can we use my message:
>>
>> http://lwn.net/Articles/501769/
>>
>> OK, maybe it's not so friendly ;-)
>
> Here's a somewhat friendlier one. Printing out the total amount of
> memory in the system may give them some extra motivation to upgrade
> to a 64 bit kernel :)
>
> ---8<----
> Subject: mm,x86: limit 32 bit kernel to 12GB memory
>
> Running 32 bit kernels on very large memory systems is a recipe
> for disaster, due to fundamental architectural limits in both
> Linux and the hardware. Moreover, all modern hardware with large
> memory supports 64 bits.
>
> However, many users continue using 32 bit kernels, and end up
> encountering stability and performance problems as a result.
>
> It may be better to save those people the frustration of stability
> issues by limiting memory on a 32 bit kernel to 12GB (about the upper
> limit that still works right), and printing a friendly reminder that
> they really should be using a 64 bit kernel.
>
> Signed-off-by: Rik van Riel <[email protected]>
> ---
> arch/x86/include/asm/setup.h | 1 +
> arch/x86/mm/init_32.c | 11 +++++++++++
> 2 files changed, 12 insertions(+)
>
> diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
> index b7bf350..79de6bf 100644
> --- a/arch/x86/include/asm/setup.h
> +++ b/arch/x86/include/asm/setup.h
> @@ -14,6 +14,7 @@
> */
> #define MAXMEM_PFN PFN_DOWN(MAXMEM)
> #define MAX_NONPAE_PFN (1 << 20)
> +#define MAX_PAE_PFN (3 << 20)
>
> #endif /* __i386__ */
>
> diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
> index 3ac7e31..e35b3f5 100644
> --- a/arch/x86/mm/init_32.c
> +++ b/arch/x86/mm/init_32.c
> @@ -600,6 +600,12 @@ static void __init lowmem_pfn_init(void)
>
> #define MSG_HIGHMEM_TRIMMED \
> "Warning: only 4GB will be used. Use a HIGHMEM64G enabled kernel!\n"
> +
> +#define MSG_HIGHMEM_INSANE \
> + "Warning: 32 bit kernels on large memory systems have problems.\n" \
> + "Limiting memory to 12GB for system stability.\n" \
> + "Use a 64 bit kernel to access all %lu MB of memory.\n"
> +
> /*
> * We have more RAM than fits into lowmem - we try to put it into
> * highmem, also taking the highmem=x boot parameter into account:
> @@ -634,6 +640,11 @@ static void __init highmem_pfn_init(void)
> max_pfn = MAX_NONPAE_PFN;
> printk(KERN_WARNING MSG_HIGHMEM_TRIMMED);
> }
> +#else /* !CONFIG_HIGHMEM64G */
> + if (max_pfn > MAX_PAE_PFN) {
> + printk(KERN_WARNING MSG_HIGHMEM_INSANE, max_pfn>>8);
> + max_pfn = MAX_PFN;
> + }
> #endif /* !CONFIG_HIGHMEM64G */
> #endif /* !CONFIG_HIGHMEM */
> }
>

2013-05-02 20:03:16

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm,x86: limit 32 bit kernel to 12GB memory

On Wed, May 1, 2013 at 7:46 PM, Rik van Riel <[email protected]> wrote:
>
> Here's a somewhat friendlier one. Printing out the total amount of
> memory in the system may give them some extra motivation to upgrade
> to a 64 bit kernel :)

This needs more work:

- suggesting a 64-bit kernel on a truly 32-bit CPU is insane, so it
had better actually check the CPUID for 64-bit support ("lm" for "long
mode").

- we don't remove features, so there should be a kernel command line
option to say "I'm insane, I know this is going to have problems, I
want you to try to use more memory anyway" and disable the new 12GB
limit

- I don't think it's necessarily "system stability". The problem with
large amounts of highmem ends up being that we end up using up almost
all of the lowmem just to *track* the huge amount of highmem, and then
we have so little lowmem that we suck at performance and have various
random problems. So it's not just "system stability", it's more fluid
than that.

The "it's more fluid than that" is also why I'd want to have a way to
override it. Using up all lowmem to track highmem is actually ok under
some very specific loads. If you have a setup where you have tons of
highmem, but all it is ever used for is anonymous user pages, you
don't need a lot of lowmem. Some of the craziest PAE users were that
class of use, and for all we know there are still crazy people with
real 32-bit CPU's that want to do it.

We don't really want to support it, we don't really care, but I don't
think we want to then say "you cannot do that" either. We want to say
"you're a f*cking crazy moron, and we don't think what you do is a
good idea, but if if you absolutely want to shoot yourself in the
foot, here's how to do it. Don't expect things to work well in
general, but you might have a load where it's acceptable".

Linus

2013-05-08 19:10:50

by H. Peter Anvin

[permalink] [raw]
Subject: Re: IO regression after ab8fabd46f on x86 kernels with high memory

On 04/29/2013 03:03 PM, Linus Torvalds wrote:
> On Mon, Apr 29, 2013 at 2:53 PM, Pierre-Loup A. Griffais
> <[email protected]> wrote:
>>
>> Other than this particular concern, what's the high-level take-away? Is PAE
>> support in the Linux kernel a false promise than distros should not be
>> shipping by default, if at all? Should it be removed from the kernel
>> entirely if these configurations are knowingly broken by commits like this?
>
> PAE is "make it barely work". The whole concept is fundamentally
> flawed, and anybody who runs a 32-bit kernel with 16GB or RAM doesn't
> even understand *how* flawed and stupid that is.
>

Let's be straight... the problem isn't PAE per se, the problem is
*HIGHMEM*. PAE just allows HIGHMEM to stretch further into problematic
territory.

Distros install PAE kernels by default because it is required to support
NX. That is fine.

The problem is that once your memory crosses the HIGHMEM threshold
-- 896 MiB in the normal configuration -- then you are in "this is going
to hurt" territory. I have seen HIGHMEM devastate performance without
even crossing the 4 GiB threshold where PAE is required.

We kernel guys have been asking the distros to ship 64-bit kernels even
in their 32-bit distros for many years, but concerns of compat issues
and the desire to deprecate 32-bit userspace seems to have kept that
from happening.

-hpa

2013-05-11 09:16:54

by Yuhong Bao

[permalink] [raw]
Subject: RE: [PATCH] mm,x86: limit 32 bit kernel to 12GB memory

> - I don't think it's necessarily "system stability". The problem with
> large amounts of highmem ends up being that we end up using up almost
> all of the lowmem just to *track* the huge amount of highmem, and then
> we have so little lowmem that we suck at performance and have various
> random problems. So it's not just "system stability", it's more fluid
> than that.

FYI 32-bit Windows already limits to 16GB when 3G/1G split is used for a similar reason. (They default to 2G/2G split.)

Yuhong Bao -

2013-06-03 01:24:13

by Yuhong Bao

[permalink] [raw]
Subject: RE: IO regression after ab8fabd46f on x86 kernels with high memory

> We kernel guys have been asking the distros to ship 64-bit kernels even
> in their 32-bit distros for many years, but concerns of compat issues
> and the desire to deprecate 32-bit userspace seems to have kept that
> from happening.

And now there is another reason: to call 64-bit EFI runtime services.
In retrospect, I would have stuck with 32-bit EFI with 64-bit kernels calling runtime services in compatibility mode, but of course it is too late for that now.

Yuhong Bao -