LinuxLists.cc - [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF

2007-08-14 15:37:48

Subject: [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set

If we exhaust the reserves in the page allocator when PF_MEMALLOC is set
then no longer give up but call into reclaim with PF_MEMALLOC set.

This is in essence a recursive call back into page reclaim with another
page flag (__GFP_NOMEMALLOC) set. The recursion is bounded since potential
allocations with __PF_NOMEMALLOC set will not enter that branch again.

This means that allocation under PF_MEMALLOC will no longer run out of
memory. Allocations under PF_MEMALLOC will do a limited form of reclaim
instead.

The reclaim is of particular important to stacked filesystems that may
do a lot of allocations in the write path. Reclaim will be working
as long as there are clean file backed pages to reclaim.

Signed-off-by: Christoph Lameter <[email protected]>

---
mm/page_alloc.c | 11 +++++++++++
1 file changed, 11 insertions(+)

Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c 2007-08-13 23:50:01.000000000 -0700
+++ linux-2.6/mm/page_alloc.c 2007-08-13 23:58:43.000000000 -0700
@@ -1306,6 +1306,17 @@ nofail_alloc:
zonelist, ALLOC_NO_WATERMARKS);
if (page)
goto got_pg;
+ /*
+ * If we are already in reclaim then the environment
+ * is already setup. We can simply call
+ * try_to_get_free_pages(). Just make sure that
+ * we do not allocate anything.
+ */
+ if (p->flags & PF_MEMALLOC && wait &&
+ try_to_free_pages(zonelist->zones, order,
+ gfp_mask | __GFP_NOMEMALLOC))
+ goto restart;
+
if (gfp_mask & __GFP_NOFAIL) {
congestion_wait(WRITE, HZ/50);
goto nofail_alloc;

--

2007-08-18 09:31:52

by Pavel Machek

[permalink] [raw]

Subject: Re: [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set

Hi!

> If we exhaust the reserves in the page allocator when PF_MEMALLOC is set
> then no longer give up but call into reclaim with PF_MEMALLOC set.
>
> This is in essence a recursive call back into page reclaim with another
> page flag (__GFP_NOMEMALLOC) set. The recursion is bounded since potential
> allocations with __PF_NOMEMALLOC set will not enter that branch again.
>
> This means that allocation under PF_MEMALLOC will no longer run out of
> memory. Allocations under PF_MEMALLOC will do a limited form of reclaim
> instead.
>
> The reclaim is of particular important to stacked filesystems that may
> do a lot of allocations in the write path. Reclaim will be working
> as long as there are clean file backed pages to reclaim.

I don't get it. Lets say that we have stacked filesystem that needs
it. That filesystem is broken today.

Now you give it second chance by reclaiming clean pages, but there are
no guarantees that we have any.... so that filesystem is still broken
with your patch...?

Should we fix the filesystem instead?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-08-20 19:00:59

by Christoph Lameter

[permalink] [raw]

Subject: Re: [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set

On Sat, 18 Aug 2007, Pavel Machek wrote:

> > The reclaim is of particular important to stacked filesystems that may
> > do a lot of allocations in the write path. Reclaim will be working
> > as long as there are clean file backed pages to reclaim.
>
> I don't get it. Lets say that we have stacked filesystem that needs
> it. That filesystem is broken today.
>
> Now you give it second chance by reclaiming clean pages, but there are
> no guarantees that we have any.... so that filesystem is still broken
> with your patch...?

There is a guarantee that we have some because the user space program is
executing. Meaning the executable pages can be retrieved. The amount
dirty memory in the system is limited by the dirty_ratio. So the VM can
only get into trouble if there is a sufficient amount of anonymous pages
and all executables have been reclaimed. That is pretty rare.

Plus the same issue can happen today. Writes are usually not completed
during reclaim. If the writes are sufficiently deferred then you have the
same issue now.

2007-08-20 20:18:29

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set

On Mon, 2007-08-20 at 12:00 -0700, Christoph Lameter wrote:
> On Sat, 18 Aug 2007, Pavel Machek wrote:
>
> > > The reclaim is of particular important to stacked filesystems that may
> > > do a lot of allocations in the write path. Reclaim will be working
> > > as long as there are clean file backed pages to reclaim.
> >
> > I don't get it. Lets say that we have stacked filesystem that needs
> > it. That filesystem is broken today.
> >
> > Now you give it second chance by reclaiming clean pages, but there are
> > no guarantees that we have any.... so that filesystem is still broken
> > with your patch...?
>
> There is a guarantee that we have some because the user space program is
> executing. Meaning the executable pages can be retrieved. The amount
> dirty memory in the system is limited by the dirty_ratio. So the VM can
> only get into trouble if there is a sufficient amount of anonymous pages
> and all executables have been reclaimed. That is pretty rare.
>
> Plus the same issue can happen today. Writes are usually not completed
> during reclaim. If the writes are sufficiently deferred then you have the
> same issue now.

Once we have initiated (disk) writeout we do not need more memory to
complete it, all we need to do is wait for the completion interrupt.

Networking is different here in that an unbounded amount of net traffic
needs to be processed in order to find the completion event.

2007-08-20 20:32:53

by Christoph Lameter

[permalink] [raw]

Subject: Re: [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set

On Mon, 20 Aug 2007, Peter Zijlstra wrote:

> > Plus the same issue can happen today. Writes are usually not completed
> > during reclaim. If the writes are sufficiently deferred then you have the
> > same issue now.
>
> Once we have initiated (disk) writeout we do not need more memory to
> complete it, all we need to do is wait for the completion interrupt.

We cannot reclaim the page as long as the I/O is not complete. If you
have too many anonymous pages and the rest of memory is dirty then you can
get into OOM scenarios even without this patch.

> Networking is different here in that an unbounded amount of net traffic
> needs to be processed in order to find the completion event.

Its not that different. Pages are pinned during writeout from reclaim and
it is not clear when the write will complete. There are no bounds that I
know in reclaim for the writeback of dirty anonymous pages.

But some throttling function like for dirty pages is likely needed for
network traffic.

2007-08-20 21:14:41

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set

On Mon, 2007-08-20 at 13:27 -0700, Christoph Lameter wrote:
> On Mon, 20 Aug 2007, Peter Zijlstra wrote:
>
> > > Plus the same issue can happen today. Writes are usually not completed
> > > during reclaim. If the writes are sufficiently deferred then you have the
> > > same issue now.
> >
> > Once we have initiated (disk) writeout we do not need more memory to
> > complete it, all we need to do is wait for the completion interrupt.
>
> We cannot reclaim the page as long as the I/O is not complete. If you
> have too many anonymous pages and the rest of memory is dirty then you can
> get into OOM scenarios even without this patch.

As long as the reserve is large enough to completely initialize writeout
of a single page we can make progress. Once writeout is initialized the
completion interrupt is guaranteed to happen (assuming working
hardware).

This makes that I can happily run a 256M anonymous workload on a machine
with only 128M memory.

> > Networking is different here in that an unbounded amount of net traffic
> > needs to be processed in order to find the completion event.
>
> Its not that different.

Yes it is, disk based completion does not require memory, network based
completion requires unbounded memory.

> Pages are pinned during writeout from reclaim and
> it is not clear when the write will complete.

For disk based writeback you do not know when it comes, but you need
only passively wait for it.

For networked writeback you need to receive all packets that happen to
be targeted at your machine and inspect them - and toss some away
because you cannot keep everything, memory is limited.

> There are no bounds that I
> know in reclaim for the writeback of dirty anonymous pages.

throttle_vm_writeout() does sort-of.

> But some throttling function like for dirty pages is likely needed for
> network traffic.

Yes, Daniel is working on writeout throttling.

2007-08-20 21:17:39

by Christoph Lameter

[permalink] [raw]

Subject: Re: [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set

On Mon, 20 Aug 2007, Peter Zijlstra wrote:

> > Its not that different.
>
> Yes it is, disk based completion does not require memory, network based
> completion requires unbounded memory.

Disk based completion only require no memory if its not on a stack of
other devices and if the interrupt handles is appropriately shaped. If
there are multile levels below or there is some sort of complex
completion handling then this also may require memory.

2007-08-21 00:39:33

by Nick Piggin

[permalink] [raw]

Subject: Re: [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set

On Mon, Aug 20, 2007 at 11:14:08PM +0200, Peter Zijlstra wrote:
> On Mon, 2007-08-20 at 13:27 -0700, Christoph Lameter wrote:
> > On Mon, 20 Aug 2007, Peter Zijlstra wrote:
> >
> > > > Plus the same issue can happen today. Writes are usually not completed
> > > > during reclaim. If the writes are sufficiently deferred then you have the
> > > > same issue now.
> > >
> > > Once we have initiated (disk) writeout we do not need more memory to
> > > complete it, all we need to do is wait for the completion interrupt.
> >
> > We cannot reclaim the page as long as the I/O is not complete. If you
> > have too many anonymous pages and the rest of memory is dirty then you can
> > get into OOM scenarios even without this patch.
>
> As long as the reserve is large enough to completely initialize writeout
> of a single page we can make progress. Once writeout is initialized the
> completion interrupt is guaranteed to happen (assuming working
> hardware).

Although interestingly, we are not guaranteed to have enough memory to
completely initialise writeout of a single page.

The buffer layer doesn't require disk blocks to be allocated at page
dirty-time. Allocating disk blocks can require complex filesystem operations
and readin of buffer cache pages. The buffer_head structures themselves may
not even be present and must be allocated :P

In _practice_, this isn't such a problem because we have dirty limits, and
we're almost guaranteed to have some clean pages to be reclaimed. In this
same way, networked filesystems are not a problem in practice. However
network swap, because there is no dirty limits on swap, can actually see
the deadlock problems.

2007-08-21 14:07:27

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set

On Mon, 2007-08-20 at 14:17 -0700, Christoph Lameter wrote:
> On Mon, 20 Aug 2007, Peter Zijlstra wrote:
>
> > > Its not that different.
> >
> > Yes it is, disk based completion does not require memory, network based
> > completion requires unbounded memory.
>
> Disk based completion only require no memory if its not on a stack of
> other devices and if the interrupt handles is appropriately shaped. If
> there are multile levels below or there is some sort of complex
> completion handling then this also may require memory.

I'm not aware of such a scenario - but it could well be. Still if it
would it would take a _bounded_ amount of memory per page.

Network would still differ in that it requires an _unbounded_ amount of
packets to receive and process in order to receive that completion.

Attachments:

signature.asc (189.00 B)
This is a digitally signed message part

2007-08-21 14:09:18

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set

On Tue, 2007-08-21 at 02:39 +0200, Nick Piggin wrote:
> On Mon, Aug 20, 2007 at 11:14:08PM +0200, Peter Zijlstra wrote:
> > On Mon, 2007-08-20 at 13:27 -0700, Christoph Lameter wrote:
> > > On Mon, 20 Aug 2007, Peter Zijlstra wrote:
> > >
> > > > > Plus the same issue can happen today. Writes are usually not completed
> > > > > during reclaim. If the writes are sufficiently deferred then you have the
> > > > > same issue now.
> > > >
> > > > Once we have initiated (disk) writeout we do not need more memory to
> > > > complete it, all we need to do is wait for the completion interrupt.
> > >
> > > We cannot reclaim the page as long as the I/O is not complete. If you
> > > have too many anonymous pages and the rest of memory is dirty then you can
> > > get into OOM scenarios even without this patch.
> >
> > As long as the reserve is large enough to completely initialize writeout
> > of a single page we can make progress. Once writeout is initialized the
> > completion interrupt is guaranteed to happen (assuming working
> > hardware).
>
> Although interestingly, we are not guaranteed to have enough memory to
> completely initialise writeout of a single page.

Yes, that is due to the unbounded nature of direct reclaim, no?

I've been meaning to write some patches to address this problem in a way
that does not introduce the hard wall Linus objects to. If only I had
this extra day in the week :-/

And then there is the deadlock in add_to_swap() that I still have to
look into, I hope it can eventually be solved using reserve based
allocation.

> The buffer layer doesn't require disk blocks to be allocated at page
> dirty-time. Allocating disk blocks can require complex filesystem operations
> and readin of buffer cache pages. The buffer_head structures themselves may
> not even be present and must be allocated :P
>
> In _practice_, this isn't such a problem because we have dirty limits, and
> we're almost guaranteed to have some clean pages to be reclaimed. In this
> same way, networked filesystems are not a problem in practice. However
> network swap, because there is no dirty limits on swap, can actually see
> the deadlock problems.

The main problem with networked swap is not so much sending out the
pages (this has similar problems like the filesystems but is all bounded
in its memory use).

The biggest issue is receiving the completion notification. Network
needs to fall back to a state where it does not blindly consumes memory
or drops _all_ packets. An intermediate state is required, one where we
can receive and inspect incoming packets but commit to very few.

In order to create such a network state and for it to be stable, a
certain amount of memory needs to be available and an external trigger
is needed to enter and leave this state - currently provided by there
being more memory available than needed or not.

Attachments:

signature.asc (189.00 B)
This is a digitally signed message part

2007-08-23 03:38:36

by Nick Piggin

[permalink] [raw]

Subject: Re: [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set

On Tue, Aug 21, 2007 at 04:07:15PM +0200, Peter Zijlstra wrote:
> On Tue, 2007-08-21 at 02:39 +0200, Nick Piggin wrote:
> >
> > Although interestingly, we are not guaranteed to have enough memory to
> > completely initialise writeout of a single page.
>
> Yes, that is due to the unbounded nature of direct reclaim, no?

Even writing out a single page to a plain old block backed filesystem
can take a fair chunk of memory. I'm not really sure how problematic
this is with a "real" filesystem, but even with something pretty simple,
you might have to do block allocation, which itself might have to do
indirect block allocation (which itself can be 3 or 4 levels), all of
which have to actually update block bitmaps (which themselves may be
many pages big). Then you also may have to even just allocate the
buffer_head structure itself. And that's just to write out a single
buffer in the page (on a 64K page system, there might be 64 of these).

Unbounded direct reclaim surely doesn't help either :P

> I've been meaning to write some patches to address this problem in a way
> that does not introduce the hard wall Linus objects to. If only I had
> this extra day in the week :-/

For this problem I think the right way to go is to ensure everything
is allocated to do writeout at page-dirty-time. This is what fsblock
does (or at least _allows_ for: filesystems that do journalling or
delayed allocation etc. themselves will have to ensure they have
sufficient preallocations to do the manipulations they need at writeout
time).

But again, on the pragmatic side, the best behaviour I think is just
to have writeouts not allocate from reserves without first trying to
reclaim some clean memory, and also limit the number of users of the
reserve. We want this anyway, right, because we don't want regular
reclaim to start causing things like atomic allocation failures when
load goes up.

> And then there is the deadlock in add_to_swap() that I still have to
> look into, I hope it can eventually be solved using reserve based
> allocation.

Yes it should have a reserve. It wouldn't be hard, all you need is
enough memory to be able to swap out a single page I would think (ie.
one preload's worth).

> > The buffer layer doesn't require disk blocks to be allocated at page
> > dirty-time. Allocating disk blocks can require complex filesystem operations
> > and readin of buffer cache pages. The buffer_head structures themselves may
> > not even be present and must be allocated :P
> >
> > In _practice_, this isn't such a problem because we have dirty limits, and
> > we're almost guaranteed to have some clean pages to be reclaimed. In this
> > same way, networked filesystems are not a problem in practice. However
> > network swap, because there is no dirty limits on swap, can actually see
> > the deadlock problems.
>
> The main problem with networked swap is not so much sending out the
> pages (this has similar problems like the filesystems but is all bounded
> in its memory use).
>
> The biggest issue is receiving the completion notification. Network
> needs to fall back to a state where it does not blindly consumes memory
> or drops _all_ packets. An intermediate state is required, one where we
> can receive and inspect incoming packets but commit to very few.

Yes, I understand this is the main problem. But it is not _helped_ by
the fact that reclaim reserves include the atomic allocation reserves.
I haven't run this problem for a long time, but I'd venture to guess the
_main_ reason the deadlock is hit is not because of networking allocating
a lot of other irrelevant data, but because of reclaim using up most of
the atomic allocation reserves.

And this observation is not tied to recurisve reclaim: if we somehow had
a reserve for atomic allocations that was aside from the reclaim reserve,
I think such a system would be practically free of deadlock for more
anonymous-intensive workloads too.

> In order to create such a network state and for it to be stable, a
> certain amount of memory needs to be available and an external trigger
> is needed to enter and leave this state - currently provided by there
> being more memory available than needed or not.

I do appreciate the deadlock and solution. I'm puzzled by your last line
though? Currently we do not provide the required reserves in the network
layer, *at all*, right?

2007-08-23 09:27:09

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set

On Thu, 2007-08-23 at 05:38 +0200, Nick Piggin wrote:
> On Tue, Aug 21, 2007 at 04:07:15PM +0200, Peter Zijlstra wrote:
> > On Tue, 2007-08-21 at 02:39 +0200, Nick Piggin wrote:
> > >
> > > Although interestingly, we are not guaranteed to have enough memory to
> > > completely initialise writeout of a single page.
> >
> > Yes, that is due to the unbounded nature of direct reclaim, no?
>
> Even writing out a single page to a plain old block backed filesystem
> can take a fair chunk of memory. I'm not really sure how problematic
> this is with a "real" filesystem, but even with something pretty simple,
> you might have to do block allocation, which itself might have to do
> indirect block allocation (which itself can be 3 or 4 levels), all of
> which have to actually update block bitmaps (which themselves may be
> many pages big). Then you also may have to even just allocate the
> buffer_head structure itself. And that's just to write out a single
> buffer in the page (on a 64K page system, there might be 64 of these).

Right, nikita once talked me though all that when we talked about
clustered writeout.

IIRC filesystems were supposed to keep mempools big enough to do this
for a single writepage at a time. Not sure its actually done though.

One advantage here is that swap writeout is very simple, so for
swap_writepage() the overhead is minimal, and we can free up space to
make progress with the fs writeout. And if there is little anonymous in
the system it must have a lot clean because of the dirty limit.

But yeah, there are some nasty details left here.

> > I've been meaning to write some patches to address this problem in a way
> > that does not introduce the hard wall Linus objects to. If only I had
> > this extra day in the week :-/
>
> For this problem I think the right way to go is to ensure everything
> is allocated to do writeout at page-dirty-time. This is what fsblock
> does (or at least _allows_ for: filesystems that do journalling or
> delayed allocation etc. themselves will have to ensure they have
> sufficient preallocations to do the manipulations they need at writeout
> time).
>
> But again, on the pragmatic side, the best behaviour I think is just
> to have writeouts not allocate from reserves without first trying to
> reclaim some clean memory, and also limit the number of users of the
> reserve. We want this anyway, right, because we don't want regular
> reclaim to start causing things like atomic allocation failures when
> load goes up.

My idea is to extend kswapd, run cpus_per_node instances of kswapd per
node for each of GFP_KERNEL, GFP_NOFS, GFP_NOIO. (basically 3 kswapds
per cpu)

whenever we would hit direct reclaim, add ourselves to a special
waitqueue corresponding to the type of GFP and kick all the
corresponding kswapds.

Now Linus' big objection is that all these processes would hit a wall
and not progress until the watermarks are high again.

Here is were the 'special' part of the waitqueue comes into order.

Instead of freeing pages to the page allocator, these kswapds would hand
out pages to the waiting processes in a round robin fashion. Only if
there are no more waiting processes left, would the page go to the buddy
system.

> > And then there is the deadlock in add_to_swap() that I still have to
> > look into, I hope it can eventually be solved using reserve based
> > allocation.
>
> Yes it should have a reserve. It wouldn't be hard, all you need is
> enough memory to be able to swap out a single page I would think (ie.
> one preload's worth).

Yeah, just need to look at the locking an batching, and ensure it has
enough preload to survive one batch, once all the locks are dropped it
can breathe again :-)

> > > The buffer layer doesn't require disk blocks to be allocated at page
> > > dirty-time. Allocating disk blocks can require complex filesystem operations
> > > and readin of buffer cache pages. The buffer_head structures themselves may
> > > not even be present and must be allocated :P
> > >
> > > In _practice_, this isn't such a problem because we have dirty limits, and
> > > we're almost guaranteed to have some clean pages to be reclaimed. In this
> > > same way, networked filesystems are not a problem in practice. However
> > > network swap, because there is no dirty limits on swap, can actually see
> > > the deadlock problems.
> >
> > The main problem with networked swap is not so much sending out the
> > pages (this has similar problems like the filesystems but is all bounded
> > in its memory use).
> >
> > The biggest issue is receiving the completion notification. Network
> > needs to fall back to a state where it does not blindly consumes memory
> > or drops _all_ packets. An intermediate state is required, one where we
> > can receive and inspect incoming packets but commit to very few.
>
> Yes, I understand this is the main problem. But it is not _helped_ by
> the fact that reclaim reserves include the atomic allocation reserves.
> I haven't run this problem for a long time, but I'd venture to guess the
> _main_ reason the deadlock is hit is not because of networking allocating
> a lot of other irrelevant data, but because of reclaim using up most of
> the atomic allocation reserves.

Ah, interesting notion.

> And this observation is not tied to recurisve reclaim: if we somehow had
> a reserve for atomic allocations that was aside from the reclaim reserve,
> I think such a system would be practically free of deadlock for more
> anonymous-intensive workloads too.

One could get quite far, however the scenario of shutting down the
remote swap server while other network traffic is present will surely
still deadlock.

> > In order to create such a network state and for it to be stable, a
> > certain amount of memory needs to be available and an external trigger
> > is needed to enter and leave this state - currently provided by there
> > being more memory available than needed or not.
>
> I do appreciate the deadlock and solution. I'm puzzled by your last line
> though? Currently we do not provide the required reserves in the network
> layer, *at all*, right?

Right, I was speaking of a kernel with my patches applied. Sorry for the
confusion.

Attachments:

signature.asc (189.00 B)
This is a digitally signed message part

2007-08-23 10:12:43

by Nikita Danilov

[permalink] [raw]

Subject: Re: [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set

Peter Zijlstra writes:

[...]

> My idea is to extend kswapd, run cpus_per_node instances of kswapd per
> node for each of GFP_KERNEL, GFP_NOFS, GFP_NOIO. (basically 3 kswapds
> per cpu)
>
> whenever we would hit direct reclaim, add ourselves to a special
> waitqueue corresponding to the type of GFP and kick all the
> corresponding kswapds.

There are two standard objections to this:

- direct reclaim was introduced to reduce memory allocation latency,
and going to scheduler kills this. But more importantly,

- it might so happen that _all_ per-cpu kswapd instances are
blocked, e.g., waiting for IO on indirect blocks, or queue
congestion. In that case whole system stops waiting for IO to
complete. In the direct reclaim case, other threads can continue
zone scanning.

Nikita.

2007-08-23 13:59:29

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set

On Thu, 2007-08-23 at 14:11 +0400, Nikita Danilov wrote:
> Peter Zijlstra writes:
>
> [...]
>
> > My idea is to extend kswapd, run cpus_per_node instances of kswapd per
> > node for each of GFP_KERNEL, GFP_NOFS, GFP_NOIO. (basically 3 kswapds
> > per cpu)
> >
> > whenever we would hit direct reclaim, add ourselves to a special
> > waitqueue corresponding to the type of GFP and kick all the
> > corresponding kswapds.
>
> There are two standard objections to this:
>
> - direct reclaim was introduced to reduce memory allocation latency,
> and going to scheduler kills this. But more importantly,

The part you snipped:

> > Here is were the 'special' part of the waitqueue comes into order.
> >
> > Instead of freeing pages to the page allocator, these kswapds would hand
> > out pages to the waiting processes in a round robin fashion. Only if
> > there are no more waiting processes left, would the page go to the buddy
> > system.

should deal with that, it allows processes to quickly get some memory.

> - it might so happen that _all_ per-cpu kswapd instances are
> blocked, e.g., waiting for IO on indirect blocks, or queue
> congestion. In that case whole system stops waiting for IO to
> complete. In the direct reclaim case, other threads can continue
> zone scanning.

By running separate GFP_KERNEL, GFP_NOFS and GFP_NOIO kswapds this
should not occur. Much like it now does not occur.

This approach would make it work pretty much like it does now. But
instead of letting each separate context run into reclaim we then have a
fixed set of reclaim contexts which evenly distribute their resulting
free pages.

The possible down sides are:

- more schedule()s, but I don't think these will matter when we're that
deep into reclaim
- less concurrency - but I hope 1 set per cpu is enough, we could up
this if it turns out to really help.

Attachments:

signature.asc (189.00 B)
This is a digitally signed message part

2007-08-24 04:00:21

by Nick Piggin

[permalink] [raw]

Subject: Re: [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set

On Thu, Aug 23, 2007 at 11:26:48AM +0200, Peter Zijlstra wrote:
> On Thu, 2007-08-23 at 05:38 +0200, Nick Piggin wrote:
> > On Tue, Aug 21, 2007 at 04:07:15PM +0200, Peter Zijlstra wrote:
> > > On Tue, 2007-08-21 at 02:39 +0200, Nick Piggin wrote:
> > > >
> > > > Although interestingly, we are not guaranteed to have enough memory to
> > > > completely initialise writeout of a single page.
> > >
> > > Yes, that is due to the unbounded nature of direct reclaim, no?
> >
> > Even writing out a single page to a plain old block backed filesystem
> > can take a fair chunk of memory. I'm not really sure how problematic
> > this is with a "real" filesystem, but even with something pretty simple,
> > you might have to do block allocation, which itself might have to do
> > indirect block allocation (which itself can be 3 or 4 levels), all of
> > which have to actually update block bitmaps (which themselves may be
> > many pages big). Then you also may have to even just allocate the
> > buffer_head structure itself. And that's just to write out a single
> > buffer in the page (on a 64K page system, there might be 64 of these).
>
> Right, nikita once talked me though all that when we talked about
> clustered writeout.
>
> IIRC filesystems were supposed to keep mempools big enough to do this
> for a single writepage at a time. Not sure its actually done though.

It isn't ;) At least I don't think so for the minix-derived ones
I've seen. But no matter, this is going a bit off topic anyway.

> > But again, on the pragmatic side, the best behaviour I think is just
> > to have writeouts not allocate from reserves without first trying to
> > reclaim some clean memory, and also limit the number of users of the
> > reserve. We want this anyway, right, because we don't want regular
> > reclaim to start causing things like atomic allocation failures when
> > load goes up.
>
> My idea is to extend kswapd, run cpus_per_node instances of kswapd per
> node for each of GFP_KERNEL, GFP_NOFS, GFP_NOIO. (basically 3 kswapds
> per cpu)
>
> whenever we would hit direct reclaim, add ourselves to a special
> waitqueue corresponding to the type of GFP and kick all the
> corresponding kswapds.

I don't know what this is solving? You don't need to run all reclaim
from kswapd process in order to limit concurrency. Just explicitly
limit it when a process applies for PF_MEMALLOC reserves. I had a
patch to do this at one point, but it never got much testing -- I
think there were other problems iwth a single process able to do
unbounded writeout and such anyway. But yeah, I don't think getting
rid of direct reclaim will do anything magical.

> Now Linus' big objection is that all these processes would hit a wall
> and not progress until the watermarks are high again.
>
> Here is were the 'special' part of the waitqueue comes into order.
>
> Instead of freeing pages to the page allocator, these kswapds would hand
> out pages to the waiting processes in a round robin fashion. Only if
> there are no more waiting processes left, would the page go to the buddy
> system.

Directly getting back pages (and having more than 1 kswapd per node)
may be things worth exploring at some point. But I don't see how muchi
bearing they have to any deadlock problems.

> > > And then there is the deadlock in add_to_swap() that I still have to
> > > look into, I hope it can eventually be solved using reserve based
> > > allocation.
> >
> > Yes it should have a reserve. It wouldn't be hard, all you need is
> > enough memory to be able to swap out a single page I would think (ie.
> > one preload's worth).
>
> Yeah, just need to look at the locking an batching, and ensure it has
> enough preload to survive one batch, once all the locks are dropped it
> can breathe again :-)

I don't think you'd need to do anything remotely fancy ;) Just so long
as it can allocate a swapcache entry for a single page to write out, that
page will be written and eventually reclaimed, along with its radix tree
nodes.

> > > The biggest issue is receiving the completion notification. Network
> > > needs to fall back to a state where it does not blindly consumes memory
> > > or drops _all_ packets. An intermediate state is required, one where we
> > > can receive and inspect incoming packets but commit to very few.
> >
> > Yes, I understand this is the main problem. But it is not _helped_ by
> > the fact that reclaim reserves include the atomic allocation reserves.
> > I haven't run this problem for a long time, but I'd venture to guess the
> > _main_ reason the deadlock is hit is not because of networking allocating
> > a lot of other irrelevant data, but because of reclaim using up most of
> > the atomic allocation reserves.
>
> Ah, interesting notion.
>
> > And this observation is not tied to recurisve reclaim: if we somehow had
> > a reserve for atomic allocations that was aside from the reclaim reserve,
> > I think such a system would be practically free of deadlock for more
> > anonymous-intensive workloads too.
>
> One could get quite far, however the scenario of shutting down the
> remote swap server while other network traffic is present will surely
> still deadlock.

I guess it would still have all the same theoretical holes, and some
could surely still be tickled, yes ;)