Change ALLOC_NO_WATERMARK page allocation such that dipping into the reserves
becomes a system wide event.
This has the advantage that logic dealing with reserve pages need not be node
aware (when we're this low on memory speed is usually not an issue).
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Lee Schermerhorn <[email protected]>
Cc: Christoph Lameter <[email protected]>
---
mm/page_alloc.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)
Index: linux-2.6-2/mm/page_alloc.c
===================================================================
--- linux-2.6-2.orig/mm/page_alloc.c
+++ linux-2.6-2/mm/page_alloc.c
@@ -1311,6 +1311,21 @@ restart:
rebalance:
if (alloc_flags & ALLOC_NO_WATERMARKS) {
nofail_alloc:
+ /*
+ * break out of mempolicy boundaries
+ */
+ zonelist = NODE_DATA(numa_node_id())->node_zonelists +
+ gfp_zone(gfp_mask);
+
+ /*
+ * Before going bare metal, try to get a page above the
+ * critical threshold - ignoring CPU sets.
+ */
+ page = get_page_from_freelist(gfp_mask, order, zonelist,
+ ALLOC_WMARK_MIN|ALLOC_HIGH|ALLOC_HARDER);
+ if (page)
+ goto got_pg;
+
/* go through the zonelist yet again, ignoring mins */
page = get_page_from_freelist(gfp_mask, order, zonelist,
ALLOC_NO_WATERMARKS);
--
On Mon, 6 Aug 2007, Peter Zijlstra wrote:
> Change ALLOC_NO_WATERMARK page allocation such that dipping into the reserves
> becomes a system wide event.
Shudder. That can just be a desaster for NUMA. Both performance wise and
logic wise. One cpuset being low on memory should not affect applications
in other cpusets.
On Monday 06 August 2007 11:11, Christoph Lameter wrote:
> On Mon, 6 Aug 2007, Peter Zijlstra wrote:
> > Change ALLOC_NO_WATERMARK page allocation such that dipping into
> > the reserves becomes a system wide event.
>
> Shudder. That can just be a desaster for NUMA. Both performance wise
> and logic wise. One cpuset being low on memory should not affect
> applications in other cpusets.
Currently your system likely would have died here, so ending up with a
reserve page temporarily on the wrong node is already an improvement.
I agree that the reserve pool should be per-node in the end, but I do
not think that serves the interest of simplifying the initial patch
set. How about a numa performance patch that adds onto the end of
Peter's series?
Regards,
Daniel
On Mon, 2007-08-06 at 11:21 -0700, Daniel Phillips wrote:
> On Monday 06 August 2007 11:11, Christoph Lameter wrote:
> > On Mon, 6 Aug 2007, Peter Zijlstra wrote:
> > > Change ALLOC_NO_WATERMARK page allocation such that dipping into
> > > the reserves becomes a system wide event.
> >
> > Shudder. That can just be a desaster for NUMA. Both performance wise
> > and logic wise. One cpuset being low on memory should not affect
> > applications in other cpusets.
Do note that these are only PF_MEMALLOC allocations that will break the
cpuset. And one can argue that these are not application allocation but
system allocations.
> Currently your system likely would have died here, so ending up with a
> reserve page temporarily on the wrong node is already an improvement.
>
> I agree that the reserve pool should be per-node in the end, but I do
> not think that serves the interest of simplifying the initial patch
> set. How about a numa performance patch that adds onto the end of
> Peter's series?
Trouble with keeping this per node is that all the code dealing with the
reserve needs to keep per-cpu state, which given that the system is
really crawling at that moment, seems excessive.
On Mon, 6 Aug 2007, Daniel Phillips wrote:
> Currently your system likely would have died here, so ending up with a
> reserve page temporarily on the wrong node is already an improvement.
The system would have died? Why? The application in the cpuset that
ran out of memory should have died not the system.
On Monday 06 August 2007 11:31, Peter Zijlstra wrote:
> > I agree that the reserve pool should be per-node in the end, but I
> > do not think that serves the interest of simplifying the initial
> > patch set. How about a numa performance patch that adds onto the
> > end of Peter's series?
>
> Trouble with keeping this per node is that all the code dealing with
> the reserve needs to keep per-cpu state, which given that the system
> is really crawling at that moment, seems excessive.
It does. I was suggesting that Christoph think about the NUMA part, our
job just to save the world ;-)
Regards,
Daniel
On Monday 06 August 2007 11:42, Christoph Lameter wrote:
> On Mon, 6 Aug 2007, Daniel Phillips wrote:
> > Currently your system likely would have died here, so ending up
> > with a reserve page temporarily on the wrong node is already an
> > improvement.
>
> The system would have died? Why?
Because a block device may have deadlocked here, leaving the system
unable to clean dirty memory, or unable to load executables over the
network for example.
> The application in the cpuset that ran out of memory should have died
> not the system.
If your "application" is a virtual block device then you can land in
deep doodoo.
Regards,
Daniel
On Mon, 6 Aug 2007, Daniel Phillips wrote:
> On Monday 06 August 2007 11:42, Christoph Lameter wrote:
> > On Mon, 6 Aug 2007, Daniel Phillips wrote:
> > > Currently your system likely would have died here, so ending up
> > > with a reserve page temporarily on the wrong node is already an
> > > improvement.
> >
> > The system would have died? Why?
>
> Because a block device may have deadlocked here, leaving the system
> unable to clean dirty memory, or unable to load executables over the
> network for example.
So this is a locking problem that has not been taken care of?
On Mon, 6 Aug 2007, Peter Zijlstra wrote:
> > > Shudder. That can just be a desaster for NUMA. Both performance wise
> > > and logic wise. One cpuset being low on memory should not affect
> > > applications in other cpusets.
>
> Do note that these are only PF_MEMALLOC allocations that will break the
> cpuset. And one can argue that these are not application allocation but
> system allocations.
This is global, global locking etc etc. On a large NUMA system this will
cause significant delays. One fears that a livelock may result.
On Monday 06 August 2007 11:51, Christoph Lameter wrote:
> On Mon, 6 Aug 2007, Daniel Phillips wrote:
> > On Monday 06 August 2007 11:42, Christoph Lameter wrote:
> > > On Mon, 6 Aug 2007, Daniel Phillips wrote:
> > > > Currently your system likely would have died here, so ending up
> > > > with a reserve page temporarily on the wrong node is already an
> > > > improvement.
> > >
> > > The system would have died? Why?
> >
> > Because a block device may have deadlocked here, leaving the system
> > unable to clean dirty memory, or unable to load executables over
> > the network for example.
>
> So this is a locking problem that has not been taken care of?
A deadlock problem that we used to call memory recursion deadlock, aka:
somebody in the vm writeout path needs to allocate memory, but there is
no memory, so vm writeout stops forever.
Regards,
Daniel
On Mon, 2007-08-06 at 12:11 -0700, Christoph Lameter wrote:
> On Mon, 6 Aug 2007, Peter Zijlstra wrote:
>
> > > > Shudder. That can just be a desaster for NUMA. Both performance wise
> > > > and logic wise. One cpuset being low on memory should not affect
> > > > applications in other cpusets.
> >
> > Do note that these are only PF_MEMALLOC allocations that will break the
> > cpuset. And one can argue that these are not application allocation but
> > system allocations.
>
> This is global, global locking etc etc. On a large NUMA system this will
> cause significant delays. One fears that a livelock may result.
The only new lock is in SLUB, and I'm not aware of any regular
PF_MEMALLOC paths using slab allocations, but I'll instrument the
regular reclaim path to verify this.
The functionality this is aimed at is swap over network, and I doubt
you'll be enabling that on these machines.
On Mon, 6 Aug 2007, Peter Zijlstra wrote:
> The functionality this is aimed at is swap over network, and I doubt
> you'll be enabling that on these machines.
So add #ifdefs around it?
On Mon, Aug 06, 2007 at 11:51:45AM -0700, Christoph Lameter wrote:
> On Mon, 6 Aug 2007, Daniel Phillips wrote:
>
> > On Monday 06 August 2007 11:42, Christoph Lameter wrote:
> > > On Mon, 6 Aug 2007, Daniel Phillips wrote:
> > > > Currently your system likely would have died here, so ending up
> > > > with a reserve page temporarily on the wrong node is already an
> > > > improvement.
> > >
> > > The system would have died? Why?
> >
> > Because a block device may have deadlocked here, leaving the system
> > unable to clean dirty memory, or unable to load executables over the
> > network for example.
>
> So this is a locking problem that has not been taken care of?
No.
It's very simple:
1) memory becomes full
2) we try to free memory by paging or swapping
3) I/O requires a memory allocation which fails because memory is full
4) box dies because it's unable to dig itself out of OOM
Most I/O paths can deal with this by having a mempool for their I/O
needs. For network I/O, this turns out to be prohibitively hard due to
the complexity of the stack.
--
Mathematics is the supreme nostalgia of our time.
On Mon, 6 Aug 2007, Matt Mackall wrote:
> > > Because a block device may have deadlocked here, leaving the system
> > > unable to clean dirty memory, or unable to load executables over the
> > > network for example.
> >
> > So this is a locking problem that has not been taken care of?
>
> No.
>
> It's very simple:
>
> 1) memory becomes full
We do have limits to avoid memory getting too full.
> 2) we try to free memory by paging or swapping
> 3) I/O requires a memory allocation which fails because memory is full
> 4) box dies because it's unable to dig itself out of OOM
>
> Most I/O paths can deal with this by having a mempool for their I/O
> needs. For network I/O, this turns out to be prohibitively hard due to
> the complexity of the stack.
The common solution is to have a reserve (min_free_kbytes). The problem
with the network stack seems to be that the amount of reserve needed
cannot be predicted accurately.
The solution may be as simple as configuring the reserves right and
avoid the unbounded memory allocations. That is possible if one
would make sure that the network layer triggers reclaim once in a
while.
On Mon, 2007-08-06 at 13:19 -0700, Christoph Lameter wrote:
> On Mon, 6 Aug 2007, Matt Mackall wrote:
>
> > > > Because a block device may have deadlocked here, leaving the system
> > > > unable to clean dirty memory, or unable to load executables over the
> > > > network for example.
> > >
> > > So this is a locking problem that has not been taken care of?
> >
> > No.
> >
> > It's very simple:
> >
> > 1) memory becomes full
>
> We do have limits to avoid memory getting too full.
>
> > 2) we try to free memory by paging or swapping
> > 3) I/O requires a memory allocation which fails because memory is full
> > 4) box dies because it's unable to dig itself out of OOM
> >
> > Most I/O paths can deal with this by having a mempool for their I/O
> > needs. For network I/O, this turns out to be prohibitively hard due to
> > the complexity of the stack.
>
> The common solution is to have a reserve (min_free_kbytes).
This patch set builds on that. Trouble with min_free_kbytes is the
relative nature of ALLOC_HIGH and ALLOC_HARDER.
> The problem
> with the network stack seems to be that the amount of reserve needed
> cannot be predicted accurately.
>
> The solution may be as simple as configuring the reserves right and
> avoid the unbounded memory allocations.
Which is what the next series of patches will be doing. Please do look
in detail at these networked swap patches I've been posting for the last
year or so.
> That is possible if one
> would make sure that the network layer triggers reclaim once in a
> while.
This does not make sense, we cannot reclaim from reclaim.
On Mon, 6 Aug 2007 13:19:26 -0700 (PDT) Christoph Lameter <[email protected]> wrote:
> On Mon, 6 Aug 2007, Matt Mackall wrote:
>
> > > > Because a block device may have deadlocked here, leaving the system
> > > > unable to clean dirty memory, or unable to load executables over the
> > > > network for example.
> > >
> > > So this is a locking problem that has not been taken care of?
> >
> > No.
> >
> > It's very simple:
> >
> > 1) memory becomes full
>
> We do have limits to avoid memory getting too full.
>
> > 2) we try to free memory by paging or swapping
> > 3) I/O requires a memory allocation which fails because memory is full
> > 4) box dies because it's unable to dig itself out of OOM
> >
> > Most I/O paths can deal with this by having a mempool for their I/O
> > needs. For network I/O, this turns out to be prohibitively hard due to
> > the complexity of the stack.
>
> The common solution is to have a reserve (min_free_kbytes). The problem
> with the network stack seems to be that the amount of reserve needed
> cannot be predicted accurately.
>
> The solution may be as simple as configuring the reserves right and
> avoid the unbounded memory allocations. That is possible if one
> would make sure that the network layer triggers reclaim once in a
> while.
Such a simple fix would be attractive. Some of the net drivers now have
remarkably large rx and tx queues. One wonders if this is playing a part
in the problem and whether reducing the queue sizes would help.
I guess we'd need to reduce the queue size on all NICs in the machine
though, which might be somewhat of a performance problem.
I don't think we've seen a lot of justification for those large queues.
I'm suspecting it's a few percent in carefully-chosen workloads (of the
microbenchmark kind?)
On Mon, 6 Aug 2007, Peter Zijlstra wrote:
> > The solution may be as simple as configuring the reserves right and
> > avoid the unbounded memory allocations.
>
> Which is what the next series of patches will be doing. Please do look
> in detail at these networked swap patches I've been posting for the last
> year or so.
>
> > That is possible if one
> > would make sure that the network layer triggers reclaim once in a
> > while.
>
> This does not make sense, we cannot reclaim from reclaim.
But we should limit the amounts of allocation we do while performing
reclaim. F.e. refilling memory pools during reclaim should be disabled.
(What Peter already wrote, but in different words)
On Monday 06 August 2007 13:19, Christoph Lameter wrote:
> The solution may be as simple as configuring the reserves right and
> avoid the unbounded memory allocations.
Exactly. That is what this patch set is about. This is the part that
provides some hooks to extend the traditional reserves to be able to
handle some of the more difficult situations.
> That is possible if one
> would make sure that the network layer triggers reclaim once in a
> while.
No. It does no good at all for network to do a bunch of work
reclaiming, then have some other random task (for example, a heavy
writer) swoop in and grab the reclaimed memory before net can use it.
Also, net allocates memory in interrupt context where shrink_caches is
not possible. The correct solution is to _reserve_ the memory net
needs for vm writeout, which is in Peter's next patch set coming down
the pipe.
Regards,
Daniel
On Monday 06 August 2007 14:05, Christoph Lameter wrote:
> > > That is possible if one
> > > would make sure that the network layer triggers reclaim once in a
> > > while.
> >
> > This does not make sense, we cannot reclaim from reclaim.
>
> But we should limit the amounts of allocation we do while performing
> reclaim...
Correct. That is what the throttling part of these patches is about.
In order to fix the vm writeout deadlock problem properly, two things
are necessary:
1) Throttle the vm writeout path to use a bounded amount of memory
2) Provide access to a sufficiently large amount of reserve memory for
each memory user in the vm writeout path
You can understand every detail of this patch set and the following ones
coming from Peter in terms of those two requirements.
> F.e. refilling memory pools during reclaim should be disabled.
Actually, recursing into the vm should be disabled entirely but that is
a rather deeply ingrained part of mm culture we do not propose to
fiddle with just now.
Memory pools are refilled when the pool user frees some memory, not ever
by the mm.
Regards,
Daniel
On Mon, 6 Aug 2007, Daniel Phillips wrote:
> Correct. That is what the throttling part of these patches is about.
Where are those patches?
> In order to fix the vm writeout deadlock problem properly, two things
> are necessary:
>
> 1) Throttle the vm writeout path to use a bounded amount of memory
>
> 2) Provide access to a sufficiently large amount of reserve memory for
> each memory user in the vm writeout path
>
> You can understand every detail of this patch set and the following ones
> coming from Peter in terms of those two requirements.
AFAICT: This patchset is not throttling processes but failing allocations.
The patchset does not reconfigure the memory reserves as expected. Instead
new reserve logic is added. And I suspect that we have the same issues as
in earlier releases with various corner cases not being covered. Code is
added that is supposedly not used. If it ever is on a large config then we
are in very deep trouble by the new code paths themselves that serialize
things in order to give some allocations precendence over the other
allocations that are made to fail ....
On Monday 06 August 2007 13:27, Andrew Morton wrote:
> On Mon, 6 Aug 2007 13:19:26 -0700 (PDT) Christoph Lameter wrote:
> > The solution may be as simple as configuring the reserves right and
> > avoid the unbounded memory allocations. That is possible if one
> > would make sure that the network layer triggers reclaim once in a
> > while.
>
> Such a simple fix would be attractive. Some of the net drivers now
> have remarkably large rx and tx queues. One wonders if this is
> playing a part in the problem and whether reducing the queue sizes
> would help.
There is nothing wrong with having huge rx and tx queues, except when
they lie in the vm writeout path. In that case, we do indeed throttle
them to reasonable numbers.
See:
http://zumastor.googlecode.com/svn/trunk/ddsnap/kernel/dm-ddsnap.c
down(&info->throttle_sem);
The only way we have ever gotten ddsnap to run reliably under heavy load
without deadlocking is with Peter's patch set (a distant descendant of
mine from two years or so ago) and we did put a lot of effort into
trying to make congestion_wait and friends do the job instead.
To be sure, it is a small subset of Peter's full patch set that actually
does the job that congestion_wait cannot be made to do, which is to
guarantee that socket->write() is always able to get enough memory to
send out the vm traffic without recursing. What Peter's patches do is
make it _nice_ to fix these problems.
Regards,
Daniel
On Monday 06 August 2007 16:14, Christoph Lameter wrote:
> On Mon, 6 Aug 2007, Daniel Phillips wrote:
> > Correct. That is what the throttling part of these patches is
> > about.
>
> Where are those patches?
Here is one user:
http://zumastor.googlecode.com/svn/trunk/ddsnap/kernel/dm-ddsnap.c
down(&info->throttle_sem);
Peter has another (swap over net). A third is the network stack itself,
which is in the full patch set that Peter has posted a number of times
in the past but did not appear today because Peter broke the full patch
set up into multiple sets to make it all easier to understand.
> AFAICT: This patchset is not throttling processes but failing
> allocations.
Failing allocations? Where do you see that? As far as I can see,
Peter's patch set allows allocations to fail exactly where the user has
always specified they may fail, and in no new places. If there is a
flaw in that logic, please let us know.
What the current patch set actually does is allow some critical
allocations that would have failed or recursed into deadlock before to
succeed instead, allowing vm writeout to complete successfully. You
may quibble with exactly how he accomplishes that, but preventing these
allocations from failing is not optional.
> The patchset does not reconfigure the memory reserves as
> expected.
What do you mean by that? Expected by who?
> Instead new reserve logic is added.
Peter did not actually have to add a new layer of abstraction to
alloc_pages to impose order on the hardcoded hacks that currently live
in there to decide how far various callers can dig into reserves. It
would probably help people understand this patch set if that part were
taken out for now and replaced with the original seat-of-the-pants two
line hack I had in the original. But I do not see anything wrong with
what Peter has written there, it just takes a little more time to read.
> And I suspect that we
> have the same issues as in earlier releases with various corner cases
> not being covered.
Do you have an example?
> Code is added that is supposedly not used.
What makes you think that?
> If it ever is on a large config then we are in very deep trouble by
> the new code paths themselves that serialize things in order to give
> some allocations precendence over the other allocations that are made
> to fail ....
You mean by allocating the reserve memory on the wrong node in NUMA?
That is on a code path that avoids destroying your machine performance
or killing the machine entirely as with current kernels, for which a
few cachelines pulled to another node is a small price to pay. And you
are free to use your special expertise in NUMA to make those fallback
paths even more efficient, but first you need to understand what they
are doing and why.
At your service for any more questions :-)
Regards,
Daniel
On Mon, 6 Aug 2007, Daniel Phillips wrote:
> > AFAICT: This patchset is not throttling processes but failing
> > allocations.
>
> Failing allocations? Where do you see that? As far as I can see,
> Peter's patch set allows allocations to fail exactly where the user has
> always specified they may fail, and in no new places. If there is a
> flaw in that logic, please let us know.
See the code added to slub: Allocations are satisfied from the reserve
patch or they are failing.
> > The patchset does not reconfigure the memory reserves as
> > expected.
>
> What do you mean by that? Expected by who?
What would be expected it some recalculation of min_freekbytes?
> > And I suspect that we
> > have the same issues as in earlier releases with various corner cases
> > not being covered.
>
> Do you have an example?
Try NUMA constraints and zone limitations.
> > Code is added that is supposedly not used.
>
> What makes you think that?
Because the argument is that performance does not matter since the code
patchs are not used.
> > If it ever is on a large config then we are in very deep trouble by
> > the new code paths themselves that serialize things in order to give
> > some allocations precendence over the other allocations that are made
> > to fail ....
>
> You mean by allocating the reserve memory on the wrong node in NUMA?
No I mean all 1024 processors of our system running into this fail/succeed
thingy that was added.
> That is on a code path that avoids destroying your machine performance
> or killing the machine entirely as with current kernels, for which a
As far as I know from our systems: The current kernels do not kill the
machine if the reserves are configured the right way.
> few cachelines pulled to another node is a small price to pay. And you
> are free to use your special expertise in NUMA to make those fallback
> paths even more efficient, but first you need to understand what they
> are doing and why.
There is your problem. The justification is not clear at all and the
solution likely causes unrelated problems.
On Tue, 2007-08-07 at 15:18 -0700, Christoph Lameter wrote:
> On Mon, 6 Aug 2007, Daniel Phillips wrote:
>
> > > AFAICT: This patchset is not throttling processes but failing
> > > allocations.
> >
> > Failing allocations? Where do you see that? As far as I can see,
> > Peter's patch set allows allocations to fail exactly where the user has
> > always specified they may fail, and in no new places. If there is a
> > flaw in that logic, please let us know.
>
> See the code added to slub: Allocations are satisfied from the reserve
> patch or they are failing.
Allocations are satisfied from the reserve IFF the allocation context is
entitled to the reserve, otherwise it will try to allocate a new slab.
And that is exactly like any other low mem situation, allocations do
fail under pressure, but not more so with this patch.
> > > The patchset does not reconfigure the memory reserves as
> > > expected.
> >
> > What do you mean by that? Expected by who?
>
> What would be expected it some recalculation of min_freekbytes?
And I have been telling you:
if (alloc_flags & ALLOC_HIGH)
min -= min / 2;
if (alloc_flags & ALLOC_HARDER)
min -= min / 4;
so the reserve is 3/8 of min_freekbytes, a fixed limit is required.
> > > Code is added that is supposedly not used.
> >
> > What makes you think that?
>
> Because the argument is that performance does not matter since the code
> patchs are not used.
The argument is that once you hit these code paths, you don't much care
for performance, not the other way around.
> > And I suspect that we
> > have the same issues as in earlier releases with various corner cases
> > > not being covered.
> >
> > Do you have an example?
>
> Try NUMA constraints and zone limitations.
How exactly are these failing, the breaking out of the policy boundaries
in the reserve path is not different from IRQ context allocations not
honouring them.
> > > If it ever is on a large config then we are in very deep trouble by
> > > the new code paths themselves that serialize things in order to give
> > > some allocations precendence over the other allocations that are made
> > > to fail ....
> >
> > You mean by allocating the reserve memory on the wrong node in NUMA?
>
> No I mean all 1024 processors of our system running into this fail/succeed
> thingy that was added.
You mean to say, 1024 cpus running into the kmem_cache wide reserve
slab, yeah if that happens that will hurt.
I just instrumented the kernel to find the allocations done under
PF_MEMALLOC for a heavy (non-swapping) reclaim load:
localhost ~ # cat /proc/ip_trace
79 [<000000006001c140>] do_ubd_request+0x97/0x176
1 [<000000006006dfef>] allocate_slab+0x44/0x9a
165 [<00000000600e75d7>] new_handle+0x1d/0x47
1 [<00000000600ee897>] __jbd_kmalloc+0x18/0x1a
141 [<00000000600eeb2d>] journal_alloc_journal_head+0x15/0x6f
1 [<0000000060136907>] current_io_context+0x38/0x95
1 [<0000000060139634>] alloc_as_io_context+0x18/0x95
this is from about 15 mins of load-5 trashing file backed reclaim.
So sadly I was mistaken, you will run into it, albeit from the numbers
its not very frequent. (and here I though that path was fully covered
with mempools)
> > That is on a code path that avoids destroying your machine performance
> > or killing the machine entirely as with current kernels, for which a
>
> As far as I know from our systems: The current kernels do not kill the
> machine if the reserves are configured the right way.
AFAIK the current kernels do not allow for swap over nfs and a bunch of
other things.
The scenario I'm interested in is (two machines: client [A]/server [B]):
1) networked machine [A] runs swap-over-net (NFS/iSCSI)
2) someone trips over the NFS/iSCSI server's [B] ethernet cable
/ someone does /etc/init.d/nfs stop
3) the machine [A] stops dead in its tracks in the middle of reclaim
4) the machine [A] keeps on receiving unrelated network traffic for
whatever other purpose the machine served
- time passes -
5) cable gets reconnected / nfs server started [B]
6) client [A] succeeds with the reconnect and happily continues its
swapping life.
Between 4-5 it needs to receive and discard an unspecified amount of
network traffic while user-space is basically frozen solid.
FWIW this scenario works here.
> > few cachelines pulled to another node is a small price to pay. And you
> > are free to use your special expertise in NUMA to make those fallback
> > paths even more efficient, but first you need to understand what they
> > are doing and why.
>
> There is your problem. The justification is not clear at all and the
> solution likely causes unrelated problems.
The situation I'm wanting to avoid is during 3), the machine will slowly
but steadily freeze over as userspace allocations start to block on
outstanding swap-IO.
Since I need a fixed reserve to operate 4), I do not want these slowly
freezing processes to 'accidentally' gobble up slabs allocated from the
reserve for more important matters.
So what the modification to the slab allocator does is:
- if there is a reserve slab:
- if the context allows access to the reserves
- serve object
- otherwise try to allocate a new slab
- if this succeeds
- remove the reserve slab since clearly the pressure is gone
- serve from the new slab
- otherwise fail
Now that 'fail'-part scares you, however, note that __GFP_WAIT
allocations will stick in the allocate part for a while, just like
regular low memory situations.
Anyway, I'm in a bit of a mess having proven even file backed reclaim
hits these paths.
1) I'm reluctant to add #ifdef in core allocation paths
2) I'm reluctant to add yet another NR_CPUS array to struct kmem_cache
Christoph, does this all explain the situation?
On 8/7/07, Christoph Lameter <[email protected]> wrote:
> > > AFAICT: This patchset is not throttling processes but failing
> > > allocations.
> >
> > Failing allocations? Where do you see that? As far as I can see,
> > Peter's patch set allows allocations to fail exactly where the user has
> > always specified they may fail, and in no new places. If there is a
> > flaw in that logic, please let us know.
>
> See the code added to slub: Allocations are satisfied from the reserve
> patch or they are failing.
First off, what I know about this patch set is, it works. Without it,
ddsnap deadlocks on network IO, and with it, no deadlock. We are
niggling over how it works, not whether it works, or whether it is
needed.
Next a confession: I have not yet applied this particular patch and
read the resulting file. However, the algorithm it is supposed to
implement with respect to slab is:
1. If the allocation can be satisified in the usual way, do that.
2. Otherwise, if the GFP flags do not include __GFP_MEMALLOC or
PF_MEMALLOC is not set, fail the allocation
3. Otherwise, if the memcache's reserve quota is not reached,
satisfy the request, allocating a new page from the MEMALLOC reserve,
but the memcache's reserve counter and succeed
4. Die. We are the walking dead. Do whatever we feel like, for
example fail allocation. This is a kernel bug, if we are lucky enough
of the kernel will remain running to get some diagnostics.
If it does not implement that algorithm, please shout.
> > > The patchset does not reconfigure the memory reserves as
> > > expected.
> >
> > What do you mean by that? Expected by who?
>
> What would be expected it some recalculation of min_freekbytes?
I still do not know exactly what you are talking about. The patch set
provides a means of adjusting the global memalloc reserve when a
memalloc pool user starts or stops. We could leave that part entirely
out of the patch set and just rely on the reserve being "big enough"
as we have done since the dawn of time. That would leave less to
niggle about and the reserve adjustment mechanism could be submitted
later. Would that be better?
> > > And I suspect that we
> > > have the same issues as in earlier releases with various corner cases
> > > not being covered.
> >
> > Do you have an example?
>
> Try NUMA constraints and zone limitations.
Are you worried about a correctness issue that would prevent the
machine from operating, or are you just worried about allocating
reserve pages to the local node for performance reasons?
> > > Code is added that is supposedly not used.
> >
> > What makes you think that?
>
> Because the argument is that performance does not matter since the code
> patchs are not used.
Used, yes. Maybe heavily, maybe not. The point is, with current
kernels you would never get to these new code paths because your
machine would have slowed to a crawl or deadlocked. So not having the
perfect NUMA implementation of the new memory reserves is perhaps a
new performance issue for NUMA, but it is no show stopper, and
according to design, existing code paths are not degraded by any
measurable extent. Deadlock in current kernels is a showstopper, that
is what this patch set fixes.
Since there is no regression for NUMA here (there is not supposed to
be anyway) I do not think that adding more complexity to the patch set
to optimize NUMA in this corner case that you could never even get to
before is warranted. That is properly a follow-on NUMA-specific
patch, analogous to a per-arch optimization.
> > > If it ever is on a large config then we are in very deep trouble by
> > > the new code paths themselves that serialize things in order to give
> > > some allocations precendence over the other allocations that are made
> > > to fail ....
> >
> > You mean by allocating the reserve memory on the wrong node in NUMA?
>
> No I mean all 1024 processors of our system running into this fail/succeed
> thingy that was added.
If an allocation now fails that would have succeeded in the past, the
patch set is buggy. I can't say for sure one way or another at this
time of night. If you see something, could you please mention a
file/line number?
> > That is on a code path that avoids destroying your machine performance
> > or killing the machine entirely as with current kernels, for which a
>
> As far as I know from our systems: The current kernels do not kill the
> machine if the reserves are configured the right way.
Current kernels deadlock on a regular basis doing fancy block IO. I
am not sure what you mean.
> > few cachelines pulled to another node is a small price to pay. And you
> > are free to use your special expertise in NUMA to make those fallback
> > paths even more efficient, but first you need to understand what they
> > are doing and why.
>
> There is your problem. The justification is not clear at all and the
> solution likely causes unrelated problems.
Well I hope the justification is clear now. Not deadlocking is a very
good thing, and we have a before and after test case. Getting late
here, Peter's email shift starts now ;-)
Regards,
Daniel
On Wed, 8 Aug 2007, Peter Zijlstra wrote:
> Christoph, does this all explain the situation?
Sort of. I am still very sceptical that this will work reliably. I'd
rather look at alternate solution like fixing reclaim. Could you have a
look at Andrew's and my comments on the slub patch?
On Wed, 8 Aug 2007, Daniel Phillips wrote:
> 1. If the allocation can be satisified in the usual way, do that.
> 2. Otherwise, if the GFP flags do not include __GFP_MEMALLOC or
> PF_MEMALLOC is not set, fail the allocation
> 3. Otherwise, if the memcache's reserve quota is not reached,
> satisfy the request, allocating a new page from the MEMALLOC reserve,
> but the memcache's reserve counter and succeed
Maybe we need to kill PF_MEMALLOC....
> > Try NUMA constraints and zone limitations.
>
> Are you worried about a correctness issue that would prevent the
> machine from operating, or are you just worried about allocating
> reserve pages to the local node for performance reasons?
I am worried that allocation constraints will make the approach incorrect.
Because logically you must have distinct pools for each set of allocations
constraints. Otherwise something will drain the precious reserve slab.
> > No I mean all 1024 processors of our system running into this fail/succeed
> > thingy that was added.
>
> If an allocation now fails that would have succeeded in the past, the
> patch set is buggy. I can't say for sure one way or another at this
> time of night. If you see something, could you please mention a
> file/line number?
It seems that allocations fail that the reclaim logic should have taken
care of letting succeed. Not good.
On 8/8/07, Christoph Lameter <[email protected]> wrote:
> On Wed, 8 Aug 2007, Daniel Phillips wrote:
> Maybe we need to kill PF_MEMALLOC....
Shrink_caches needs to be able to recurse into filesystems at least,
and for the duration of the recursion the filesystem must have
privileged access to reserves. Consider the difficulty of handling
that with anything other than a process flag.
> > > Try NUMA constraints and zone limitations.
> >
> > Are you worried about a correctness issue that would prevent the
> > machine from operating, or are you just worried about allocating
> > reserve pages to the local node for performance reasons?
>
> I am worried that allocation constraints will make the approach incorrect.
> Because logically you must have distinct pools for each set of allocations
> constraints. Otherwise something will drain the precious reserve slab.
To be precise, each user of the reserve must have bounded reserve
requirements, and a shared reserve must be at least as large as the
total of the bounds.
Peter's contribution here was to modify the slab slightly so that
multiple slabs can share the global memalloc reserve. Not necessary
pre-allocating pages into the reserve reduces complexity. The total
size of the reserve can also be reduced by this pool sharing strategy
in theory, but the patch set does not attempt this.
Anyway, you are right that if allocation constraints are not
calculated and enforced on the user side then the global reserve may
drain and get the system into a very bad state. That is why each user
on the block writeout path must be audited for reserve requirements
and throttled to a requirement no greater than its share of the global
pool. This patch set does not provide an example of such block IO
throttling, but you can see one in ddsnap at the link I provided
earlier (throttle_sem).
> > > No I mean all 1024 processors of our system running into this fail/succeed
> > > thingy that was added.
> >
> > If an allocation now fails that would have succeeded in the past, the
> > patch set is buggy. I can't say for sure one way or another at this
> > time of night. If you see something, could you please mention a
> > file/line number?
>
> It seems that allocations fail that the reclaim logic should have taken
> care of letting succeed. Not good.
I do not see a case like that, if you can see one then we would like
to know. We make a guarantee to each reserve user that the memory it
has reserved will be available immediately from the global memalloc
pool. Bugs can break this, and we do not automatically monitor each
user just now to detect violations, but both are also the case with
the traditional memalloc paths.
In theory, we could reduce the size of the global memalloc pool by
including "easily freeable" memory in it. This is just an
optimization and does not belong in this patch set, which fixes a
system integrity issue.
Regards,
Daniel
On Thu, 9 Aug 2007, Daniel Phillips wrote:
> On 8/8/07, Christoph Lameter <[email protected]> wrote:
> > On Wed, 8 Aug 2007, Daniel Phillips wrote:
> > Maybe we need to kill PF_MEMALLOC....
>
> Shrink_caches needs to be able to recurse into filesystems at least,
> and for the duration of the recursion the filesystem must have
> privileged access to reserves. Consider the difficulty of handling
> that with anything other than a process flag.
Shrink_caches needs to allocate memory? Hmmm... Maybe we can only limit
the PF_MEMALLOC use.
> In theory, we could reduce the size of the global memalloc pool by
> including "easily freeable" memory in it. This is just an
> optimization and does not belong in this patch set, which fixes a
> system integrity issue.
I think the main thing would be to fix reclaim to not do stupid things
like triggering writeout early in the reclaim pass and to allow reentry
into reclaim. The idea of memory pools always sounded strange to me given
that you have a lot of memory in a zone that is reclaimable as needed.
On 8/9/07, Christoph Lameter <[email protected]> wrote:
> On Thu, 9 Aug 2007, Daniel Phillips wrote:
> > On 8/8/07, Christoph Lameter <[email protected]> wrote:
> > > On Wed, 8 Aug 2007, Daniel Phillips wrote:
> > > Maybe we need to kill PF_MEMALLOC....
> > Shrink_caches needs to be able to recurse into filesystems at least,
> > and for the duration of the recursion the filesystem must have
> > privileged access to reserves. Consider the difficulty of handling
> > that with anything other than a process flag.
>
> Shrink_caches needs to allocate memory? Hmmm... Maybe we can only limit
> the PF_MEMALLOC use.
PF_MEMALLOC is not such a bad thing. It will usually be less code
than mempool for the same use case, besides being able to handle a
wider range of problems. We introduce __GPF_MEMALLOC for situations
where the need for reserve memory is locally known, as in the network
stack, which is similar or identical to the use case for mempool. One
could reasonably ask why we need mempool with a lighter alternative
available. But this is a case of to each their own I think. Either
technique will work for reserve management.
> > In theory, we could reduce the size of the global memalloc pool by
> > including "easily freeable" memory in it. This is just an
> > optimization and does not belong in this patch set, which fixes a
> > system integrity issue.
>
> I think the main thing would be to fix reclaim to not do stupid things
> like triggering writeout early in the reclaim pass and to allow reentry
> into reclaim. The idea of memory pools always sounded strange to me given
> that you have a lot of memory in a zone that is reclaimable as needed.
You can fix reclaim as much as you want and the basic deadlock will
still not go away. When you finally do get to writing something out,
memory consumers in the writeout path are going to cause problems,
which this patch set fixes.
Agreed that the idea of mempool always sounded strange, and we show
how to get rid of them, but that is not the immediate purpose of this
patch set.
Regards,
Daniel
On Thu, 9 Aug 2007, Daniel Phillips wrote:
> You can fix reclaim as much as you want and the basic deadlock will
> still not go away. When you finally do get to writing something out,
> memory consumers in the writeout path are going to cause problems,
> which this patch set fixes.
We currently also do *not* write out immediately. I/O is queued when
submitted so it does *not* reduce memory. It is better to actually delay
writeout until you have thrown out clean pages. At that point the free
memory is at its high point. If memory goes below the high point again by
these writes then we can again reclaim until things are right.
> Agreed that the idea of mempool always sounded strange, and we show
> how to get rid of them, but that is not the immediate purpose of this
> patch set.
Ok mempools are unrelated. The allocations problems that this patch
addresses can be fixed by making reclaim more intelligent. This may likely
make mempools less of an issue in the kernel. If we can reclaim in an
emergency even in ATOMIC contexts then things get much easier.
On 8/9/07, Christoph Lameter <[email protected]> wrote:
> The allocations problems that this patch addresses can be fixed by making reclaim
> more intelligent.
If you believe that the deadlock problems we address here can be
better fixed by making reclaim more intelligent then please post a
patch and we will test it. I am highly skeptical, but the proof is in
the patch.
> If we can reclaim in an emergency even in ATOMIC contexts then things get much
> easier.
It is already easy, and it is already fixed in this patch series.
Sure, we can pare these patches down a little more, but you are going
to have a really hard time coming up with something simpler that
actually works.
Regards,
Daniel
On Thu, 9 Aug 2007, Daniel Phillips wrote:
> If you believe that the deadlock problems we address here can be
> better fixed by making reclaim more intelligent then please post a
> patch and we will test it. I am highly skeptical, but the proof is in
> the patch.
Then please test the patch that I posted here earlier to reclaim even if
PF_MEMALLOC is set. It may require some fixups but it should address your
issues in most vm load situations.
On 8/9/07, Christoph Lameter <[email protected]> wrote:
> > If you believe that the deadlock problems we address here can be
> > better fixed by making reclaim more intelligent then please post a
> > patch and we will test it. I am highly skeptical, but the proof is in
> > the patch.
>
> Then please test the patch that I posted here earlier to reclaim even if
> PF_MEMALLOC is set. It may require some fixups but it should address your
> issues in most vm load situations.
It is quite clear what is in your patch. Instead of just grabbing a
page off the buddy free lists in a critical allocation situation you
go invoke shrink_caches. Why oh why? All the memory needed to get
through these crunches is already sitting right there on the buddy
free lists, ready to be used, why would you go off scanning instead?
And this does not work in atomic contexts at all, that is a whole
thing you would have to develop, and why? You just offered us
functionality that we already have, except your idea has issues.
You do not do anything to prevent mixing of ordinary slab allocations
of unknown duration with critical allocations of controlled duration.
This is _very important_ for sk_alloc. How are you going to take
care of that?
In short, you provide a piece we don't need because we already have it
in a more efficient form, your approach does not work in atomic
context, and you need to solve the slab object problem. You also need
integration with sk_alloc. That is just what I noticed on a
once-over-lightly. Your patch has a _long_ way to go before it is
ready to try.
We have already presented a patch set that is tested and is known to
solve the deadlocks. This patch set has been more than two years in
development. It covers problems you have not even begun to think
about, which we have been aware of for years. Your idea is not
anywhere close to working. Why don't you just work with us instead?
There are certainly improvements that can be made to the posted patch
set. Running off and learning from scratch how to do this is not
really helpful.
Regards,
Daniel
On Fri, 10 Aug 2007, Daniel Phillips wrote:
> It is quite clear what is in your patch. Instead of just grabbing a
> page off the buddy free lists in a critical allocation situation you
> go invoke shrink_caches. Why oh why? All the memory needed to get
Because we get to the code of interest when we have no memory on the
buddy free lists and need to reclaim memory to fill them up again.
> You do not do anything to prevent mixing of ordinary slab allocations
> of unknown duration with critical allocations of controlled duration.
> This is _very important_ for sk_alloc. How are you going to take
> care of that?
It is not necessary because you can reclaim memory as needed.
> There are certainly improvements that can be made to the posted patch
> set. Running off and learning from scratch how to do this is not
> really helpful.
The idea of adding code to deal with "I have no memory" situations
in a kernel that based on have as much memory as possible in use at all
times is plainly the wrong approach. If you need memory then memory needs
to be reclaimed. That is the basic way that things work and following that
through brings about a much less invasive solution without all the issues
that the proposed solution creates.
On 8/10/07, Christoph Lameter <[email protected]> wrote:
> The idea of adding code to deal with "I have no memory" situations
> in a kernel that based on have as much memory as possible in use at all
> times is plainly the wrong approach.
No. It is you who have read the patches wrongly, because what you
imply here is exactly backwards.
> If you need memory then memory needs
> to be reclaimed. That is the basic way that things work
Wrong. A naive reading of your comment would suggest you do not
understand how PF_MEMALLOC works, and that it has worked that way from
day one (well, since long before I arrived) and that we just do more
of the same, except better.
> and following that
> through brings about a much less invasive solution without all the issues
> that the proposed solution creates.
What issues? Test case please, a real one that you have run yourself.
Please, no more theoretical issues that cannot be demonstrated in
practice because they do not exist.
Regards,
Daniel
On Friday 10 August 2007 10:46, Christoph Lameter wrote:
> On Fri, 10 Aug 2007, Daniel Phillips wrote:
> > It is quite clear what is in your patch. Instead of just grabbing
> > a page off the buddy free lists in a critical allocation situation
> > you go invoke shrink_caches. Why oh why? All the memory needed to
> > get
>
> Because we get to the code of interest when we have no memory on the
> buddy free lists...
Ah wait, that statement is incorrect and may well be the crux of your
misunderstanding. Buddy free lists are not exhausted until the entire
memalloc reserve has been depleted, which would indicate a kernel bug
and imminent system death.
> ...and need to reclaim memory to fill them up again.
That we do, but we satisfy the allocations in the vm writeout path
first, without waiting for shrink_caches to do its thing.
Regards,
Daniel
On Sun, 12 Aug 2007, Daniel Phillips wrote:
> > Because we get to the code of interest when we have no memory on the
> > buddy free lists...
>
> Ah wait, that statement is incorrect and may well be the crux of your
> misunderstanding. Buddy free lists are not exhausted until the entire
> memalloc reserve has been depleted, which would indicate a kernel bug
> and imminent system death.
I added the call to reclaim where the memory is out.