Hi,
I had a look at an IBM patch, which is described thus:
- Order 2 allocation relief
Symptom: Under stress and after long uptimes of a 64 bit system
the error message "__alloc_pages: 2-order allocation failed."
appears and either the fork of a new process fails or an
active process dies.
Problem: The order 2 allocation problem is based in the size of the
region and segement tables as defined by the zSeries
architecture. A full region or segment table in 64 bit mode
takes 16 KB of contigous real memory. The page allocation
routines do not guarantee that a higher order allocation
will succeed due to memory fragmentation.
Solution: The order 2 allocation fix is supposed to reduce the number
of order 2 allocations for the region and segment tables to
a minimum. To do so it uses a feature of the architecture
that allows to create incomplete region and segment tables.
In almost all cases a process does not need full region or
segment tables. If a full region or segment table is needed
it is reallocated to the full size.
This patch is very s/390 specific and breaks all other architectures.
<<they meant "zSeries specific", surely --zaitcev>>
It's a stupid question, but: why can we not simply
wait until a desired unfragmented memory area is available,
with a GPF flag? What they describe does not happen in an
interrupt context, so we can sleep.
And another one: why not to increase a kernel-visible or "soft"
page size to 16KB for zSeries? It's a 64 bits platform. There
will be some increase in fragmentation, but nobody measured it.
Perhaps it's not going to be severe. It may even improve paging
efficiency.
-- Pete
P.S. The patch itself is at:
http://www10.software.ibm.com/developerworks/opensource/linux390/alpha_src/linux-2.4.7-order2-3.tar.gz
Pete Zaitcev wrote:
> This patch is very s/390 specific and breaks all other architectures.
> <<they meant "zSeries specific", surely --zaitcev>>
B.t.w. Martin found a way to make the patch less intrusive so
that it won't break other archs any more ...
>It's a stupid question, but: why can we not simply
>wait until a desired unfragmented memory area is available,
>with a GPF flag? What they describe does not happen in an
>interrupt context, so we can sleep.
Because nobody even *tries* to free adjacent pages to build up
a free order-2 area. You could wait really long ...
This looks hard to fix with the current mm layer. Maybe Rik's
rmap method could help here, because with reverse mappings we
can at least try to free adjacent areas (because we then at least
*know* who's using the pages).
>And another one: why not to increase a kernel-visible or "soft"
>page size to 16KB for zSeries? It's a 64 bits platform. There
>will be some increase in fragmentation, but nobody measured it.
>Perhaps it's not going to be severe. It may even improve paging
>efficiency.
Because then we can mmap() to user space only on 16KB boundaries.
This is a problem in particular for the 31-bit emulation layer,
as 31-bit binaries are laid out on 4KB boundaries by the linker,
so you really need to be able to mmap() on 4KB boundaries.
One way to fix this could be to allow user space mappings on a
different granularity than the 'page size' for the allocator.
(Is this what PAGE_SIZE vs. PAGE_CACHE_SIZE had been intended
for, maybe? It doesn't work at the moment in any case.)
Mit freundlichen Gruessen / Best Regards
Ulrich Weigand
--
Dr. Ulrich Weigand
Linux for S/390 Design & Development
IBM Deutschland Entwicklung GmbH, Schoenaicher Str. 220, 71032 Boeblingen
Phone: +49-7031/16-3727 --- Email: [email protected]
> >with a GPF flag? What they describe does not happen in an
> >interrupt context, so we can sleep.
>
> Because nobody even *tries* to free adjacent pages to build up
> a free order-2 area. You could wait really long ...
Without the rmap patch you can't easily do it
> rmap method could help here, because with reverse mappings we
> can at least try to free adjacent areas (because we then at least
> *know* who's using the pages).
rmap definitely makes it a real no brainer to do this at least for small
clusters of pages. Doing large chunks gets progressively harder
From: Alan Cox <[email protected]>
Date: Thu, 7 Feb 2002 00:18:47 +0000 (GMT)
> >with a GPF flag? What they describe does not happen in an
> >interrupt context, so we can sleep.
>
> Because nobody even *tries* to free adjacent pages to build up
> a free order-2 area. You could wait really long ...
Without the rmap patch you can't easily do it
One change from Rik's VM (both of them, the 2.4.9 based AC stuff and
RMAP) and the current stuff in Linus's tree is that order 2 and
smaller are treated all equally.
This got rid of a lot of problems on Sparc64 and with AF_UNIX sockets
for example. Sparc64 has the same issue as the IBM patch is trying
to solve, we need order 1 pages for our page table allocations. And
AF_UNIX was trying to use large linear buffers for better performance
during bulk transfers.
Btw, the AF_UNIX side of this results in all kinds of MYSQL
performance problems, or at least this is how I remember it.
(for more details on this grep for SKB_MAX_ALLOC in current
2.4.x/2.5.x sources, in particular the references in
include/linux/skbuff.h and net/unix/af_unix.c)
There was even a linux-kernel thread about all of this back in
the 2.4.{13,14,15} days, perhaps someone can find it on
marc.theaimsgroup.com
I do not think the Linus VM behavior is unreasonable, which basically
amounts to continually trying to free pages for all order 3 and below
allocations (if you can sleep and you aren't PF_MEMALLOC etc.).
> rmap method could help here, because with reverse mappings we
> can at least try to free adjacent areas (because we then at least
> *know* who's using the pages).
rmap definitely makes it a real no brainer to do this at least for small
clusters of pages. Doing large chunks gets progressively harder
You just have to be careful that you don't let the algorithm
degenerate into a dumb scan, which is the kind of silly stuff
the VM used to do back in the pre-2.2.x days :-)
On Wed, 6 Feb 2002, David S. Miller wrote:
> I do not think the Linus VM behavior is unreasonable, which basically
> amounts to continually trying to free pages for all order 3 and below
> allocations (if you can sleep and you aren't PF_MEMALLOC etc.).
The only problem is that it doesn't. It won't try to free
pages once you have enough free pages, which means you'll
just end up in a livelock.
Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document
http://www.surriel.com/ http://distro.conectiva.com/
From: Rik van Riel <[email protected]>
Date: Thu, 7 Feb 2002 10:16:22 -0200 (BRST)
The only problem is that it doesn't. It won't try to free
pages once you have enough free pages, which means you'll
just end up in a livelock.
It always calls balance_classzone which always calls try_to_free_pages
which always will try to free SWAP_CLUSTER_MAX pages.
Oh, I see, is it that the old and RMAP VM won't do that? :-)
BTW, in checking this out it seems current->allocation_order is only
set and never checked anywhere.
On Thu, 7 Feb 2002, David S. Miller wrote:
> From: Rik van Riel <[email protected]>
> Date: Thu, 7 Feb 2002 10:16:22 -0200 (BRST)
>
> The only problem is that it doesn't. It won't try to free
> pages once you have enough free pages, which means you'll
> just end up in a livelock.
>
> It always calls balance_classzone which always calls try_to_free_pages
> which always will try to free SWAP_CLUSTER_MAX pages.
Duh, indeed. It seems Linus' free_plenty() checks were
removed somewhere along the way.
Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document
http://www.surriel.com/ http://distro.conectiva.com/
On Thu, 7 Feb 2002, David S. Miller wrote:
>
> BTW, in checking this out it seems current->allocation_order is only
> set and never checked anywhere.
Yes, the "local_pages" interaction between __free_pages_ok and
balance_classzone is in a half-baked state in the mainline tree,
I think Linus backed out some of what Andrea intended: the -aa tree
makes more sense there (where "allocation_order" is "local_pages.order").
Hugh
On February 6, 2002 10:50 pm, Ulrich Weigand wrote:
> Pete Zaitcev wrote:
> >It's a stupid question, but: why can we not simply
> >wait until a desired unfragmented memory area is available,
> >with a GPF flag? What they describe does not happen in an
> >interrupt context, so we can sleep.
>
> Because nobody even *tries* to free adjacent pages to build up
> a free order-2 area. You could wait really long ...
>
> This looks hard to fix with the current mm layer. Maybe Rik's
> rmap method could help here, because with reverse mappings we
> can at least try to free adjacent areas (because we then at least
> *know* who's using the pages).
Yes, that's one of leading reasons for wanting rmap. (Number one and two
reasons are: allow forcible unmapping of multiply referenced pages for
swapout; get more reliable hardware ref bit readings.)
Note that even if we can do forcible freeing we still have to deal with the
issue of fragmentation due to pinned pages, e.g., slab cache, admittedly a
rarer problem.
--
Daniel
On Thu, 7 Feb 2002, Daniel Phillips wrote:
> > This looks hard to fix with the current mm layer. Maybe Rik's
> > rmap method could help here, because with reverse mappings we
> > can at least try to free adjacent areas (because we then at least
> > *know* who's using the pages).
>
> Yes, that's one of leading reasons for wanting rmap. (Number one and
> two reasons are: allow forcible unmapping of multiply referenced pages
> for swapout; get more reliable hardware ref bit readings.)
It's still on my TODO list. Patches are very much welcome
though ;)
Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document
http://www.surriel.com/ http://distro.conectiva.com/
On February 7, 2002 03:55 pm, Rik van Riel wrote:
> On Thu, 7 Feb 2002, Daniel Phillips wrote:
>
> > > This looks hard to fix with the current mm layer. Maybe Rik's
> > > rmap method could help here, because with reverse mappings we
> > > can at least try to free adjacent areas (because we then at least
> > > *know* who's using the pages).
> >
> > Yes, that's one of leading reasons for wanting rmap. (Number one and
> > two reasons are: allow forcible unmapping of multiply referenced pages
> > for swapout; get more reliable hardware ref bit readings.)
>
> It's still on my TODO list. Patches are very much welcome
> though ;)
I'd rather see rmap go in in its simplest possible form, outperforming the
current virtual scanning method on basic page replacement performance, rather
that using the other things we know rmap can do as the argument for inclusion.
It's for this reason that I'm concentrating on the fork speedup.
--
Daniel
Rik van Riel wrote:
>On Thu, 7 Feb 2002, Daniel Phillips wrote:
>
>> Yes, that's one of leading reasons for wanting rmap. (Number one and
>> two reasons are: allow forcible unmapping of multiply referenced pages
>> for swapout; get more reliable hardware ref bit readings.)
>
>It's still on my TODO list. Patches are very much welcome
>though ;)
On s390 we have per physical page hardware referenced / changed bits.
In the rmap framework, it should also be possible to make more efficient
use of these ...
Mit freundlichen Gruessen / Best Regards
Ulrich Weigand
--
Dr. Ulrich Weigand
Linux for S/390 Design & Development
IBM Deutschland Entwicklung GmbH, Schoenaicher Str. 220, 71032 Boeblingen
Phone: +49-7031/16-3727 --- Email: [email protected]
From: Daniel Phillips <[email protected]>
Date: Thu, 7 Feb 2002 16:07:39 +0100
I'd rather see rmap go in in its simplest possible form, outperforming the
current virtual scanning method on basic page replacement performance, rather
that using the other things we know rmap can do as the argument for inclusion.
It's for this reason that I'm concentrating on the fork speedup.
Ok, but just keep in mind that failing for < 3 order page allocations
would be a regression from what is in there now.
On Thu, 7 Feb 2002, Ulrich Weigand wrote:
> On s390 we have per physical page hardware referenced / changed bits.
> In the rmap framework, it should also be possible to make more
> efficient use of these ...
Absolutely, on S390 you could basically bypass part
of the -rmap code.
Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document
http://www.surriel.com/ http://distro.conectiva.com/
On February 7, 2002 04:05 pm, Ulrich Weigand wrote:
> Rik van Riel wrote:
>
> >On Thu, 7 Feb 2002, Daniel Phillips wrote:
> >
> >> Yes, that's one of leading reasons for wanting rmap. (Number one and
> >> two reasons are: allow forcible unmapping of multiply referenced pages
> >> for swapout; get more reliable hardware ref bit readings.)
> >
> >It's still on my TODO list. Patches are very much welcome
> >though ;)
>
> On s390 we have per physical page hardware referenced / changed bits.
> In the rmap framework, it should also be possible to make more efficient
> use of these ...
I'm an rmap fan, but this feature in fact negates one of the advantages of
rmap since since virtual scanning doesn't needs to propagate the page ref
bit into the physical page. However, it still has to unmap pages, which
is the biggest disconnect with virtual scanning.
As Rik said, it would make rmap run a little faster, but some organization
is needed to make the ref bit propagation per-arch.
--
Daniel
--On Thursday, 07 February, 2002 3:12 PM +0100 Daniel Phillips
<[email protected]> wrote:
> Maybe Rik's
>> rmap method could help here, because with reverse mappings we
>> can at least try to free adjacent areas (because we then at least
>> *know* who's using the pages).
>
> Yes, that's one of leading reasons for wanting rmap. (Number one and two
> reasons are: allow forcible unmapping of multiply referenced pages for
> swapout; get more reliable hardware ref bit readings.)
>
> Note that even if we can do forcible freeing we still have to deal with
> the issue of fragmentation due to pinned pages, e.g., slab cache,
> admittedly a rarer problem.
Perhaps mitigated if you use the same technology as you are using to do the
freeing, to ensure that pinned pages (slab cache etc.) are preferentially
allocated next to other pinned pages.
--
Alex Bligh