Linus,
As you know, we currently allow 1-order allocations to fail easily.
However, there is one special case of 1-order allocations which cannot
fail: fork.
Here is the tested patch against pre4.
--- linux.orig/mm/page_alloc.c Thu Oct 18 14:26:28 2001
+++ linux/mm/page_alloc.c Thu Oct 18 16:23:15 2001
@@ -393,8 +393,13 @@
}
}
- /* Don't let big-order allocations loop */
- if (order)
+ /* We have one special 1-order alloc user: fork().
+ * It obviously cannot fail easily like other
+ * high order allocations. This could also be fixed
+ * by having a __GFP_LOOP flag to indicate that the
+ * high order allocation is "critical".
+ */
+ if (order > 1)
return NULL;
/* Yield for kswapd, and try again */
From: Marcelo Tosatti <[email protected]>
Date: Thu, 18 Oct 2001 15:04:15 -0200 (BRST)
As you know, we currently allow 1-order allocations to fail easily.
However, there is one special case of 1-order allocations which cannot
fail: fork.
Here is the tested patch against pre4.
There are also some platforms using 1-order allocations
for page tables as well.
But I don't know if I agree with this special casing.
Why not just put something into the GFP flag bits
which distinguishes between high order allocations which
are "critical" and others which are "don't try too hard".
BTW, such a scheme could be useful for page cache pre-fetching.
If you use a high order allocation, it is more likely that all
of the pages in that prefetch will fit into the same kernel TLB
mapping. We could use a GFP_NONCRITICAL for something like this.
Franks a lot,
David S. Miller
[email protected]
On Thu, 18 Oct 2001, David S. Miller wrote:
> From: Marcelo Tosatti <[email protected]>
> Date: Thu, 18 Oct 2001 15:04:15 -0200 (BRST)
>
> As you know, we currently allow 1-order allocations to fail easily.
>
> However, there is one special case of 1-order allocations which cannot
> fail: fork.
>
> Here is the tested patch against pre4.
>
> There are also some platforms using 1-order allocations
> for page tables as well.
>
> But I don't know if I agree with this special casing.
> Why not just put something into the GFP flag bits
> which distinguishes between high order allocations which
> are "critical" and others which are "don't try too hard".
Look at the comment on my patch. I've suggested that :)
I've added a __GFP_FAIL flag back in 2.4-ac something days exactly for
that purpose. I've ported the same code to the XFS tree so they could try
to "lazily" allocate (big) structures to build page clusters.
However, there is one nasty problem with it: How we can define "don't try
too hard" ?
Lets say you want to use the __GFP_FAIL flag when trying to allocate data
to do more readahead. If it fails too easily, we're never going to do
enough readahead.
What I'm trying to say is that we would need levels of "don't try too
hard" to have a nice scheme, and thats not simple.
See my point?
> BTW, such a scheme could be useful for page cache pre-fetching.
It could be used in a _LOT_ of performance critical parts of the kernel,
indeed.
> If you use a high order allocation, it is more likely that all
> of the pages in that prefetch will fit into the same kernel TLB
> mapping. We could use a GFP_NONCRITICAL for something like this.
On Thu, 18 Oct 2001, David S. Miller wrote:
>
> There are also some platforms using 1-order allocations
> for page tables as well.
>
> But I don't know if I agree with this special casing.
Well, it's not really any _new_ special casing - we've always had the
special case for order-0, the patch just expands it to order-1 too.
That said, I think a separate flag saying "don't try too hard", which can
be used for all orders, including 0 and 1, and just says that "ok, we want
you to balance things, but if this allocation fails that's not a big
deal".
So the flag would just always be implicit in allocations of higher orders,
because big orders are basically impossible to guarantee..
Linus
On Thu, 18 Oct 2001, Linus Torvalds wrote:
>
> On Thu, 18 Oct 2001, David S. Miller wrote:
> >
> > There are also some platforms using 1-order allocations
> > for page tables as well.
> >
> > But I don't know if I agree with this special casing.
>
> Well, it's not really any _new_ special casing - we've always had the
> special case for order-0, the patch just expands it to order-1 too.
>
> That said, I think a separate flag saying "don't try too hard", which can
> be used for all orders, including 0 and 1, and just says that "ok, we want
> you to balance things, but if this allocation fails that's not a big
> deal".
>
> So the flag would just always be implicit in allocations of higher orders,
> because big orders are basically impossible to guarantee..
Read my last mail on this thread... A single flag saying "we can fail
easily" does not sound good to me.
Imagine people changing the point where the
if ((gfp_mask & __GFP_FAIL))
return;
check is done (inside the freeing routines).
I would like to have a _defined_ meaning for a "fail easily" allocation,
and a simple unique __GFP_FAIL flag can't give us that IMO.
On Thu, 18 Oct 2001, Marcelo Tosatti wrote:
> Imagine people changing the point where the
>
> if ((gfp_mask & __GFP_FAIL))
> return;
>
> check is done (inside the freeing routines).
>
> I would like to have a _defined_ meaning for a "fail easily" allocation,
> and a simple unique __GFP_FAIL flag can't give us that IMO.
Actually, I guess we could define this to be the same point
where we'd end up freeing memory in order to satisfy our
allocation.
This would result in __GFP_FAIL meaning "give me memory if
it's available, but don't waste time freeing memory if we
don't have enough free memory now".
Space-wise these semantics could change (say, pages_low
vs. pages_min), but they'll stay the same when you look at
"how hard to try" or "how much effort to spend".
regards,
Rik
--
DMCA, SSSCA, W3C? Who cares? http://thefreeworld.net/ (volunteers needed)
http://www.surriel.com/ http://distro.conectiva.com/
On Thu, 18 Oct 2001, Rik van Riel wrote:
> On Thu, 18 Oct 2001, Marcelo Tosatti wrote:
>
> > Imagine people changing the point where the
> >
> > if ((gfp_mask & __GFP_FAIL))
> > return;
> >
> > check is done (inside the freeing routines).
> >
> > I would like to have a _defined_ meaning for a "fail easily" allocation,
> > and a simple unique __GFP_FAIL flag can't give us that IMO.
>
> Actually, I guess we could define this to be the same point
> where we'd end up freeing memory in order to satisfy our
> allocation.
>
> This would result in __GFP_FAIL meaning "give me memory if
> it's available, but don't waste time freeing memory if we
> don't have enough free memory now".
>
> Space-wise these semantics could change (say, pages_low
> vs. pages_min), but they'll stay the same when you look at
> "how hard to try" or "how much effort to spend".
Just remember that if we give __GFP_FAIL a "give me memory if its
available" meaning we simply can't use it for stuff like pagecache
prefetching --- its _too_ fragile.
Thats why I think we need the freeing levels, and thats why I think we
should left all of that for 2.5. :)
On Thu, 18 Oct 2001, Marcelo Tosatti wrote:
> On Thu, 18 Oct 2001, Rik van Riel wrote:
> > Actually, I guess we could define this to be the same point
> > where we'd end up freeing memory in order to satisfy our
> > allocation.
>
> Just remember that if we give __GFP_FAIL a "give me memory if its
> available" meaning we simply can't use it for stuff like pagecache
> prefetching --- its _too_ fragile.
IMHO it makes perfect sense, since at this point, one more
allocation _will_ push us over the limit and let kswapd go
to work to free up more memory.
We just need to make sure that the "wake up kswapd and maybe
help free memory" point is EXACTLY the same as the __GFP_FAIL
failure point.
Unless off course I'm overlooking something ... in that case
I'd appreciate it if you could point it out to me ;)
regards,
Rik
--
DMCA, SSSCA, W3C? Who cares? http://thefreeworld.net/ (volunteers needed)
http://www.surriel.com/ http://distro.conectiva.com/
On Thu, 18 Oct 2001, Rik van Riel wrote:
> On Thu, 18 Oct 2001, Marcelo Tosatti wrote:
> > On Thu, 18 Oct 2001, Rik van Riel wrote:
>
> > > Actually, I guess we could define this to be the same point
> > > where we'd end up freeing memory in order to satisfy our
> > > allocation.
> >
> > Just remember that if we give __GFP_FAIL a "give me memory if its
> > available" meaning we simply can't use it for stuff like pagecache
> > prefetching --- its _too_ fragile.
>
> IMHO it makes perfect sense, since at this point, one more
> allocation _will_ push us over the limit and let kswapd go
> to work to free up more memory.
>
> We just need to make sure that the "wake up kswapd and maybe
> help free memory" point is EXACTLY the same as the __GFP_FAIL
> failure point.
Ok, great, that works fine. We can do that for 2.4, no problem.
> Unless off course I'm overlooking something ... in that case
> I'd appreciate it if you could point it out to me ;)
I would just like to have a _good_ scheme for this kind of "lazy
allocations" for 2.5 which can also be used by the page clustering code.
We really don't want the page clustering code to simply use a
"__GFP_FAIL" which fails so easily because we want performance.
Got my point?
On Thu, 18 Oct 2001, Marcelo Tosatti wrote:
> On Thu, 18 Oct 2001, Rik van Riel wrote:
> > We just need to make sure that the "wake up kswapd and maybe
> > help free memory" point is EXACTLY the same as the __GFP_FAIL
> > failure point.
>
> Ok, great, that works fine. We can do that for 2.4, no problem.
A quick (untested) patch to demonstrate my idea is below
my signature. Note the comments in __alloc_pages() ...
it's important to only fail our __GFP_FAIL allocation
_after_ having woken up kswapd.
regards,
Rik
--
DMCA, SSSCA, W3C? Who cares? http://thefreeworld.net/ (volunteers needed)
http://www.surriel.com/ http://distro.conectiva.com/
--- linux-2.4.12-ac3/mm/page_alloc.c.nofail Thu Oct 18 23:24:50 2001
+++ linux-2.4.12-ac3/mm/page_alloc.c Fri Oct 19 01:34:31 2001
@@ -341,18 +341,25 @@
/*
* OK, none of the zones on our zonelist has lots
- * of pages free.
- *
- * We wake up kswapd, in the hope that kswapd will
- * resolve this situation before memory gets tight.
- *
- * We'll also help a bit trying to free pages, this
- * way statistics will make sure really fast allocators
- * are slowed down more than slow allocators and other
- * programs in the system shouldn't be impacted as much
- * by the hogs.
+ * of pages free. Kswapd has work to do ...
*/
wakeup_kswapd();
+
+ /*
+ * We don't want to do memory balancing work ourself,
+ * instead we fail this allocation and hope that kswapd
+ * will have things in a better shape next time.
+ */
+ if (gfp_mask & __GFP_FAIL)
+ return NULL;
+
+ /*
+ * Free some pages ourselves, rather than eating up the
+ * last few free pages and running the system into the
+ * ground. Since this slows down heavy allocators more
+ * than occasional allocators, it provides some fairness
+ * and smoother behaviour under heavy load.
+ */
if ((gfp_mask & __GFP_WAIT) && !(current->flags & PF_MEMALLOC))
try_to_free_pages(gfp_mask);
--- linux-2.4.12-ac3/include/linux/mm.h.nofail Fri Oct 19 01:29:19 2001
+++ linux-2.4.12-ac3/include/linux/mm.h Fri Oct 19 01:33:30 2001
@@ -567,12 +567,14 @@
#define __GFP_HIGH 0x20 /* Should access emergency pools? */
#define __GFP_IO 0x40 /* Can start physical IO? */
#define __GFP_FS 0x80 /* Can call down to low-level FS? */
+#define __GFP_FAIL 0x100 /* Fail early when low on free pages. */
#define GFP_NOIO (__GFP_HIGH | __GFP_WAIT)
#define GFP_NOFS (__GFP_HIGH | __GFP_WAIT | __GFP_IO)
#define GFP_ATOMIC (__GFP_HIGH)
#define GFP_USER ( __GFP_WAIT | __GFP_IO | __GFP_FS)
#define GFP_HIGHUSER ( __GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM)
+#define GFP_READAHEAD ( __GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM | __GFP_FAIL)
#define GFP_KERNEL (__GFP_HIGH | __GFP_WAIT | __GFP_IO | __GFP_FS)
#define GFP_NFS (__GFP_HIGH | __GFP_WAIT | __GFP_IO | __GFP_FS)
#define GFP_KSWAPD ( __GFP_IO | __GFP_FS)