LinuxLists.cc - [RFC][PATCH 0/3] beat kswapd with the proverbial clue-bat

2004-09-05 05:44:36

Subject: [RFC][PATCH 0/3] beat kswapd with the proverbial clue-bat

Kswapd is dumb as bricks when it comes to higher order allocations.
Actually that's not quite fair: it is bad at lots of things... but
higher order allocations are one of its more spectacular failures.

The major problem that I can see is with !wait allocations, where
you aren't allowed to free anything yourself - you're relying on
kswapd (aside from that, it's always nice to avoid synchronous reclaim).

Apparently these (higher-order && !wait) come up mainly in networking
which is the thing I had in mind. *However* as I only have half of a
gigabit network (ie. 1 card), I haven't done any testing where it
really counts. I'm also seeing surprisingly few reports on lkml, so
perhaps it is me that needs the beating?

Anyway, the big failure case is when memory is fragmented to the
point that pages_free > pages_low, but you still have no higher order
pages left. In that case, your !wait allocations can keep calling
wakeup_kswapd but he'll just keep sleeping. min_free_kbytes is not
really a solution because it just raises pages_low. In a nutshell,
that whole area doesn't really have any idea about higher order
allocations.

So my solution? Just teach kswapd and the watermark code about higher
order allocations in a fairly simple way. If pages_low is (say), 1024KB,
we now also require 512KB of order-1 and above pages, 256K of order-2
and up, 128K of order 3, etc. (perhaps we should stop at about order-3?)

*Also*, if we have requested an order 5 allocation, but one isn't
available, we'll get kswapd to try to free at least 1, even if its
order-5 "free-until" watermark is 0KB.

The main cost is keeping track of the number of free pages of each order.
There is also a penalty in the allocator for order > 0 allocations, but
I have tried to do it so lower order allocations need to do less work.

Flames? Comments?

2004-09-05 05:46:02

by Nick Piggin

[permalink] [raw]

Subject: [RFC][PATCH 1/3] account free buddy areas

1/3

Attachments:

vm-free-order-pages.patch (3.32 kB)

2004-09-05 05:47:22

by Nick Piggin

[permalink] [raw]

Subject: [RFC][PATCH 2/3] alloc-order watermarks

2/3

Attachments:

vm-alloc-order-watermarks.patch (3.76 kB)

2004-09-05 05:49:21

by Nick Piggin

[permalink] [raw]

Subject: [RFC][PATCH 3/3] teach kswapd about watermarks

3/3

Attachments:

vm-kswapd-heed-order-watermarks.patch (4.82 kB)

2004-09-05 05:50:56

by Nick Piggin

[permalink] [raw]

Subject: Re: [RFC][PATCH 2/3] alloc-order watermarks

Nick Piggin wrote:
> 2/3
>
>
> ------------------------------------------------------------------------
>
>
>
> Move the watermark checking code into a single function. Extend it to account
> for the order of the allocation and the number of free pages that could satisfy
> such a request.
>
> Signed-off-by: Nick Piggin <[email protected]>
>
>
> ---
>
> linux-2.6-npiggin/include/linux/mmzone.h | 2 +
> linux-2.6-npiggin/mm/page_alloc.c | 57 ++++++++++++++++++++-----------
> 2 files changed, 40 insertions(+), 19 deletions(-)
>
> diff -puN mm/page_alloc.c~vm-alloc-order-watermarks mm/page_alloc.c
> --- linux-2.6/mm/page_alloc.c~vm-alloc-order-watermarks 2004-09-05 14:55:46.000000000 +1000
> +++ linux-2.6-npiggin/mm/page_alloc.c 2004-09-05 15:10:07.000000000 +1000
> @@ -676,6 +676,36 @@ buffered_rmqueue(struct zone *zone, int
> }
>
> /*
> + * Return the number of pages available for order 'order' allocations.
> + */

Sorry, stale comment. It actually returns 1 if free pages are above the
watermark, 0 otherwise.

> +int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
> + int alloc_type, int can_try_harder, int gfp_high)

2004-09-05 06:04:07

by David Miller

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] beat kswapd with the proverbial clue-bat

On Sun, 05 Sep 2004 15:44:18 +1000
Nick Piggin <[email protected]> wrote:

> So my solution? Just teach kswapd and the watermark code about higher
> order allocations in a fairly simple way. If pages_low is (say), 1024KB,
> we now also require 512KB of order-1 and above pages, 256K of order-2
> and up, 128K of order 3, etc. (perhaps we should stop at about order-3?)

Whether to stop at order 3 is indeed an interesting question.

The reality is that the high-order allocations come mostly from folks
using jumbo 9K MTUs on gigabit and faster technologies. On x86, an
order 2 would cover those packet allocations, but on sparc64 for example
order 1 would be enough, whereas on a 2K PAGE_SIZE system order 3 would
be necessary.

People using e1000 cards are hitting this case, and some of the e1000
developers are going to play around with using page array based SKBs
(via the existing SKB page frags mechanism). So instead of allocating
a huge linear chunk for RX packets, they'll allocate a header area of
256 bytes then an array of pages to cover the rest.

Right now, my current suggestion would not be to stop at a certain order.

2004-09-05 06:06:52

by David Miller

[permalink] [raw]

Subject: Re: [RFC][PATCH 3/3] teach kswapd about watermarks

If you're only doing atomic_set() and atomic_read() on kswapd_max_order,
you're not doing anything atomic on the datum so no need to make it
an atomic_t.

2004-09-05 06:11:39

by Andrew Morton

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] beat kswapd with the proverbial clue-bat

Nick Piggin <[email protected]> wrote:
>
> Kswapd is dumb as bricks when it comes to higher order allocations.
> Actually that's not quite fair: it is bad at lots of things... but
> higher order allocations are one of its more spectacular failures.
>
> The major problem that I can see is with !wait allocations, where
> you aren't allowed to free anything yourself - you're relying on
> kswapd (aside from that, it's always nice to avoid synchronous reclaim).
>
> Apparently these (higher-order && !wait) come up mainly in networking
> which is the thing I had in mind. *However* as I only have half of a
> gigabit network (ie. 1 card), I haven't done any testing where it
> really counts. I'm also seeing surprisingly few reports on lkml, so
> perhaps it is me that needs the beating?

There have been few reports, and I believe that networking is getting
changed to reduce the amount of GFP_ATOMIC higher-order allocation
attempts.

There have been multiple instances in the past year or so where we've made
changes in there, the changes were not adequately tested and stuff broke in
subtle ways. We need to raise the bar a bit - clearly demonstrate that we
have a problem, and then demonstrate that the fix fixes it, then worry
about side-effects.

> Anyway, the big failure case is when memory is fragmented to the
> point that pages_free > pages_low, but you still have no higher order
> pages left. In that case, your !wait allocations can keep calling
> wakeup_kswapd but he'll just keep sleeping. min_free_kbytes is not
> really a solution because it just raises pages_low. In a nutshell,
> that whole area doesn't really have any idea about higher order
> allocations.
>
> So my solution? Just teach kswapd and the watermark code about higher
> order allocations in a fairly simple way. If pages_low is (say), 1024KB,
> we now also require 512KB of order-1 and above pages, 256K of order-2
> and up, 128K of order 3, etc. (perhaps we should stop at about order-3?)
>
> *Also*, if we have requested an order 5 allocation, but one isn't
> available, we'll get kswapd to try to free at least 1, even if its
> order-5 "free-until" watermark is 0KB.
>
> The main cost is keeping track of the number of free pages of each order.
> There is also a penalty in the allocator for order > 0 allocations, but
> I have tried to do it so lower order allocations need to do less work.
>

I don't see anything in your code which directly prevents the following
serious scenario:

a) Some random 0-order allocation causes a 4-order page to be split up,
taking the 4-order pool below threshold.

b) kswapd goes berzerk reclaiming 9000 pages to replenish the 4-order
pool even though we don't need it.

You have arith in there which kinda-sorta prevents it, but I don't see any
hard-and-fast protection. Or did I miss it?

2004-09-05 06:13:57

by Nick Piggin

[permalink] [raw]

Subject: Re: [RFC][PATCH 1/3] account free buddy areas

Nick Piggin wrote:
> 1/3
>
>
> ------------------------------------------------------------------------
>
>
>
> Keep track of the number of free pages of each order in the buddy allocator.
>
> Signed-off-by: Nick Piggin <[email protected]>
>
>
> ---
>
> linux-2.6-npiggin/include/linux/mmzone.h | 1 +
> linux-2.6-npiggin/mm/page_alloc.c | 22 ++++++++--------------
> 2 files changed, 9 insertions(+), 14 deletions(-)
>
> diff -puN mm/page_alloc.c~vm-free-order-pages mm/page_alloc.c
> --- linux-2.6/mm/page_alloc.c~vm-free-order-pages 2004-09-05 14:53:53.000000000 +1000
> +++ linux-2.6-npiggin/mm/page_alloc.c 2004-09-05 14:53:53.000000000 +1000
> @@ -216,6 +216,7 @@ static inline void __free_pages_bulk (st
> page_idx &= mask;
> }
> list_add(&(base + page_idx)->lru, &area->free_list);
> + area->nr_free++;
> }
>

Ahh, yes _that_ is why I got an offset in page_alloc.c

Obviously this function needs an area->nr_free-- in the loop somewhere around
list_del(&buddy1->lru);

I have actually tested the complete patchset with this addition, I just forgot
to update the patch.

2004-09-05 06:17:26

by Nick Piggin

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] beat kswapd with the proverbial clue-bat

David S. Miller wrote:
> On Sun, 05 Sep 2004 15:44:18 +1000
> Nick Piggin <[email protected]> wrote:
>
>
>>So my solution? Just teach kswapd and the watermark code about higher
>>order allocations in a fairly simple way. If pages_low is (say), 1024KB,
>>we now also require 512KB of order-1 and above pages, 256K of order-2
>>and up, 128K of order 3, etc. (perhaps we should stop at about order-3?)
>
>
> Whether to stop at order 3 is indeed an interesting question.
>
> The reality is that the high-order allocations come mostly from folks
> using jumbo 9K MTUs on gigabit and faster technologies. On x86, an
> order 2 would cover those packet allocations, but on sparc64 for example
> order 1 would be enough, whereas on a 2K PAGE_SIZE system order 3 would
> be necessary.
>

Yeah I see.

> People using e1000 cards are hitting this case, and some of the e1000
> developers are going to play around with using page array based SKBs
> (via the existing SKB page frags mechanism). So instead of allocating
> a huge linear chunk for RX packets, they'll allocate a header area of
> 256 bytes then an array of pages to cover the rest.
>

Yes, I guess that would be ideal from the memory manager's POV.

> Right now, my current suggestion would not be to stop at a certain order.
>

OK I'll keep it as is and we'll see how that goes. Thanks.

2004-09-05 06:20:37

by Nick Piggin

[permalink] [raw]

Subject: Re: [RFC][PATCH 3/3] teach kswapd about watermarks

David S. Miller wrote:
> If you're only doing atomic_set() and atomic_read() on kswapd_max_order,
> you're not doing anything atomic on the datum so no need to make it
> an atomic_t.
>

OK, sure. And yes, anything other than loads and stores on that
value would never be correct.

There is still the small race of two threads updating the value,
but that doesn't matter (and the atomics don't help there anyway
of course).

2004-09-05 06:26:20

by Nick Piggin

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] beat kswapd with the proverbial clue-bat

Andrew Morton wrote:
> Nick Piggin <[email protected]> wrote:
>

>>Apparently these (higher-order && !wait) come up mainly in networking
>>which is the thing I had in mind. *However* as I only have half of a
>>gigabit network (ie. 1 card), I haven't done any testing where it
>>really counts. I'm also seeing surprisingly few reports on lkml, so
>>perhaps it is me that needs the beating?
>
>
> There have been few reports, and I believe that networking is getting
> changed to reduce the amount of GFP_ATOMIC higher-order allocation
> attempts.
>

That is the ideal goal, I think. But while our allocator offers higher
order allocations, we *should* be a bit smarter about them.

> There have been multiple instances in the past year or so where we've made
> changes in there, the changes were not adequately tested and stuff broke in
> subtle ways. We need to raise the bar a bit - clearly demonstrate that we
> have a problem, and then demonstrate that the fix fixes it, then worry
> about side-effects.
>

Yep. As you see I've already corrected myself a couple of times :\
RFC only at this stage.

>
> I don't see anything in your code which directly prevents the following
> serious scenario:
>
> a) Some random 0-order allocation causes a 4-order page to be split up,
> taking the 4-order pool below threshold.
>
> b) kswapd goes berzerk reclaiming 9000 pages to replenish the 4-order
> pool even though we don't need it.
>
> You have arith in there which kinda-sorta prevents it, but I don't see any
> hard-and-fast protection. Or did I miss it?
>

Yep. Kswapd will not care about 4-order allocations unless someone does
a wake_kswapd(order 4);

We could get into a situation where kswapd free smore than required, but
if you've got someone regularly allocating 4-order pages, it probably
isn't *that* dumb to free one or two more.

If we complete an entire balance_pgdat round without freeing the required
pages, that kswapd_max_order gets reset to zero anyway...

2004-09-05 06:28:38

by Anton Blanchard

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] beat kswapd with the proverbial clue-bat

> There have been few reports, and I believe that networking is getting
> changed to reduce the amount of GFP_ATOMIC higher-order allocation
> attempts.

FYI I seem to remember issues on loopback due to its large MTU. Also the
printk_ratelimit stuff first appeared because the e1000 was spewing so
many higher order page allocation failures on some boxes.

But yes, the e1000 guys were going to look into multiple buffer mode so
they dont need a high order allocation.

Anton

2004-09-05 10:09:52

by Nick Piggin

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] beat kswapd with the proverbial clue-bat

Anton Blanchard wrote:
>>There have been few reports, and I believe that networking is getting
>>changed to reduce the amount of GFP_ATOMIC higher-order allocation
>>attempts.
>
>
> FYI I seem to remember issues on loopback due to its large MTU. Also the

Yeah I had seen a few, surprisingly few though. Sorry I'm a bit clueless
about networking - I suppose there is a good reason for the 16K MTU? My
first thought might be that a 4K one could be better on CPU cache as well
as lighter on the mm. I know the networking guys know what they're doing
though...

> printk_ratelimit stuff first appeared because the e1000 was spewing so
> many higher order page allocation failures on some boxes.
>
> But yes, the e1000 guys were going to look into multiple buffer mode so
> they dont need a high order allocation.
>

Well let me be the first to say I don't want to stop that from happening.

With regard to getting this patchset tested, I might see if I can hunt
down another e1000 and give it a try at the end of the week. If anyone
would like to beat me to it, just let me know and I'll send out a new
set of patches with those couple of required fixes.

2004-09-05 10:15:38

by Nick Piggin

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] beat kswapd with the proverbial clue-bat

Nick Piggin wrote:
> David S. Miller wrote:
>
>> On Sun, 05 Sep 2004 15:44:18 +1000
>> Nick Piggin <[email protected]> wrote:
>>
>>
>>> So my solution? Just teach kswapd and the watermark code about higher
>>> order allocations in a fairly simple way. If pages_low is (say), 1024KB,
>>> we now also require 512KB of order-1 and above pages, 256K of order-2
>>> and up, 128K of order 3, etc. (perhaps we should stop at about order-3?)
>>
>>
>>
>> Whether to stop at order 3 is indeed an interesting question.
>>
>> The reality is that the high-order allocations come mostly from folks
>> using jumbo 9K MTUs on gigabit and faster technologies. On x86, an
>> order 2 would cover those packet allocations, but on sparc64 for example
>> order 1 would be enough, whereas on a 2K PAGE_SIZE system order 3 would
>> be necessary.
>>
>
> Yeah I see.
>

Hmm, and the crowning argument for not stopping at order 3 is that if we
never use higher order allocations, nothing will care about their watermarks
anyway. I think I had myself confused when that question in the first place.

So yeah, stopping at a fixed number isn't required, and as you say it keeps
things general and special cases minimal.

2004-09-05 16:49:45

by Linus Torvalds

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] beat kswapd with the proverbial clue-bat

On Sun, 5 Sep 2004, Nick Piggin wrote:
>
> So my solution? Just teach kswapd and the watermark code about higher
> order allocations in a fairly simple way. If pages_low is (say), 1024KB,
> we now also require 512KB of order-1 and above pages, 256K of order-2
> and up, 128K of order 3, etc. (perhaps we should stop at about order-3?)

I'm pretty sure we did something like this (where we not only required a
certain amount of memory free, but also a certain buddy availability) for
a short while in the 2.3.x timeframe or something like that, and that it
caused problems for some machines that just kept on trying to free memory.

I wrote a program at the time to check why, but I've long since lost it,
so I re-created it just now (probably buggy as hell), and append it here
for you to debug the dang thing.

In it's first rough draft (ie "it might be totally wrong") using this
script:

for i in 4 64 128 256 512 1024 2048 4096; do
echo $i megabytes:
./a.out $(($i*256)) pages
echo
done

I get

4 megabytes:
Got 2**1 buddy in 63 iterations (6%)
Got 2**2 buddy in 199 iterations (19%)
Got 2**3 buddy in 623 iterations (60%)

64 megabytes:
Got 2**1 buddy in 66 iterations (0%)
Got 2**2 buddy in 1541 iterations (9%)
Got 2**3 buddy in 4872 iterations (29%)

128 megabytes:
Got 2**1 buddy in 66 iterations (0%)
Got 2**2 buddy in 2865 iterations (8%)
Got 2**3 buddy in 7247 iterations (22%)

256 megabytes:
Got 2**1 buddy in 66 iterations (0%)
Got 2**2 buddy in 5593 iterations (8%)
Got 2**3 buddy in 13163 iterations (20%)

512 megabytes:
Got 2**1 buddy in 66 iterations (0%)
Got 2**2 buddy in 10897 iterations (8%)
Got 2**3 buddy in 30265 iterations (23%)

1024 megabytes:
Got 2**1 buddy in 1539 iterations (0%)
Got 2**2 buddy in 20664 iterations (7%)
Got 2**3 buddy in 67737 iterations (25%)

2048 megabytes:
Got 2**1 buddy in 1539 iterations (0%)
Got 2**2 buddy in 32541 iterations (6%)
Got 2**3 buddy in 122437 iterations (23%)

4096 megabytes:
Got 2**1 buddy in 1539 iterations (0%)
Got 2**2 buddy in 38852 iterations (3%)
Got 2**3 buddy in 220501 iterations (21%)

ie it shows how it's _really_ hard to get high-order buddies if you don't
have a lot of memory.

Now, this test-result is not "real life", in that the whole point of the
buddy allocator is that it is good at trying to keep higher orders free,
and as such real life should behave much better. But this _is_ pretty
accurate for the case where we totally ran out of memory and are trying
to randomly free one page here and one page there. That's where buddy
doesn't much help us, so this kind of shows the worst case.

(Rule: you should not allow the machine to get to this point, exactly
because at the < ~5% free point, buddy stops being very effective).

Notice how you may need to free 20% of memory to get a 2**3 allocation, if
you have totally depleted your pages. And it's much worse if you have very
little memory to begin with.

Anyway. I haven't debugged this program, so it may have serious bugs, and
be off by an order of magnitude or two. Whatever. If I'm wrong, somebody
can fix the program/script and see what the real numbers are.

Linus

----
#include <stdlib.h>

#define MAX_NR_PAGES (1024*1024)

#define DECLARE_BITMAP(x,size) \
unsigned int x[((size)+31)/32];

DECLARE_BITMAP(first, MAX_NR_PAGES);
DECLARE_BITMAP(second, MAX_NR_PAGES/2);
DECLARE_BITMAP(third, MAX_NR_PAGES/4);
DECLARE_BITMAP(fourth, MAX_NR_PAGES/8);

static int switch_bit(int nr, unsigned int *bitmap)
{
unsigned int mask = 1U << (nr & 31);
unsigned int *map = bitmap + (nr >> 5);
unsigned int old = *map;

*map = old ^ mask;
return (old & mask) != 0;
}

static int set_bit(int nr, unsigned int *bitmap)
{
unsigned int mask = 1U << (nr & 31);
unsigned int *map = bitmap + (nr >> 5);
unsigned int old = *map;

*map = old | mask;
return (old & mask) != 0;
}

static void fill_one(int nr, int max, int random)
{
random >>= 1;
if (switch_bit(random, second)) {
static int hit = 0;

if (!hit) {
hit = 1;
printf("Got 2**1 buddy in %d iterations (%d%%)\n", nr, nr*100/max);
}
random >>= 1;
if (switch_bit(random, third)) {
static int hit = 0;

if (!hit) {
hit = 1;
printf("Got 2**2 buddy in %d iterations (%d%%)\n", nr, nr*100/max);
}
random >>= 1;
if (switch_bit(random, fourth)) {
printf("Got 2**3 buddy in %d iterations (%d%%)\n", nr, nr*100/max);
exit(0);
}
}
}
}

int main(int argc, char **argv)
{
int i;
int nrpages;

if (argc < 2)
return -1;
nrpages = atoi(argv[1]);
if (nrpages < 1 || nrpages > MAX_NR_PAGES)
return -1;
srandom(time(NULL));

i = 0;
do {
unsigned r = random() % nrpages;
if (set_bit(r, first))
continue;
i++;
fill_one(i, nrpages, r);
} while (i < nrpages * 3 / 4);
printf("Filled 75% (%d)\n", i);
return -1;
}

2004-09-05 17:25:09

by Linus Torvalds

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] beat kswapd with the proverbial clue-bat

On Sun, 5 Sep 2004, Nick Piggin wrote:
>
> Hmm, and the crowning argument for not stopping at order 3 is that if we
> never use higher order allocations, nothing will care about their watermarks
> anyway. I think I had myself confused when that question in the first place.
>
> So yeah, stopping at a fixed number isn't required, and as you say it keeps
> things general and special cases minimal.

Hey, please refute my "you need 20% free" to get even to order-3 for most
cases first.

It's probably acceptable to have a _very_ backgrounded job that does
freeing if order-3 isn't available, but it had better be pretty
slow-moving, I suspect. On the order of "It's probably ok to try to aim
for up to 25% free 'overnight' if the machine is idle" but it's almost
certainly not ok to aggressively push things out to that degree..

Linus

2004-09-05 17:36:42

by Martin J. Bligh

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] beat kswapd with the proverbial clue-bat

>> Hmm, and the crowning argument for not stopping at order 3 is that if we
>> never use higher order allocations, nothing will care about their watermarks
>> anyway. I think I had myself confused when that question in the first place.
>>
>> So yeah, stopping at a fixed number isn't required, and as you say it keeps
>> things general and special cases minimal.
>
> Hey, please refute my "you need 20% free" to get even to order-3 for most
> cases first.
>
> It's probably acceptable to have a _very_ backgrounded job that does
> freeing if order-3 isn't available, but it had better be pretty
> slow-moving, I suspect. On the order of "It's probably ok to try to aim
> for up to 25% free 'overnight' if the machine is idle" but it's almost
> certainly not ok to aggressively push things out to that degree..

IIRC, the way we free pages is basically random shootdown. We don't even
attempt to free pages in a contiguous block, so it's unsuprising it doesn't
work. Now we have rmap, we should be able to walk through a block, see if
everything in it looks swappable, and then try to shoot it down. If not,
move onto the next one. I think that'd be a damned sight more likely
to free things up for higher order allocs if they do happen.

M.

2004-09-05 17:37:33

by Arjan van de Ven

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] beat kswapd with the proverbial clue-bat

On Sun, 2004-09-05 at 19:24, Linus Torvalds wrote:
> On Sun, 5 Sep 2004, Nick Piggin wrote:
> >
> > Hmm, and the crowning argument for not stopping at order 3 is that if we
> > never use higher order allocations, nothing will care about their watermarks
> > anyway. I think I had myself confused when that question in the first place.
> >
> > So yeah, stopping at a fixed number isn't required, and as you say it keeps
> > things general and special cases minimal.
>
> Hey, please refute my "you need 20% free" to get even to order-3 for most
> cases first.
>
> It's probably acceptable to have a _very_ backgrounded job that does
> freeing if order-3 isn't available, but it had better be pretty
> slow-moving, I suspect. On the order of "It's probably ok to try to aim
> for up to 25% free 'overnight' if the machine is idle" but it's almost
> certainly not ok to aggressively push things out to that degree..

well... we have a reverse mapping now. What is stopping us from doing
physical defragmentation ?

Attachments:

signature.asc (189.00 B)
This is a digitally signed message part

2004-09-05 17:58:28

by Linus Torvalds

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] beat kswapd with the proverbial clue-bat

On Sun, 5 Sep 2004, Arjan van de Ven wrote:
>
> well... we have a reverse mapping now. What is stopping us from doing
> physical defragmentation ?

Nothing but replacement policy, really, and the fact that not everything
is rmappable.

I think we should _normally_ honor replacement policy, the way we do now.
Only if we are in the situation "we have enough memory, but not enough
high-order-pages" should we go to a separate physical defrag algorithm.

So either kswapd should have a totally different mode, or there should be
a separate "kdefragd". It would potentially also be good if it is user-
triggerable, so that you could, for example, have a heavier defragd run
from the daily "cron" runs - something that doesn't seem to make much
sense from a traditional kswapd standpoint.

In other words, I don't think the physical thing should be triggered at
all by normal memory pressure. A large-order allocation failure would
trigger it "somewhat", and maybe it might run very slowly in the
background (wake up every five minutes or so to see if it is worth doing
anything), and then some user-triggerable way to make it more aggressive.

Does that sound sane to people?

Linus

2004-09-05 18:42:07

by Arjan van de Ven

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] beat kswapd with the proverbial clue-bat

On Sun, Sep 05, 2004 at 10:58:07AM -0700, Linus Torvalds wrote:
>
>
> On Sun, 5 Sep 2004, Arjan van de Ven wrote:
> >
> > well... we have a reverse mapping now. What is stopping us from doing
> > physical defragmentation ?
>
> Nothing but replacement policy, really, and the fact that not everything
> is rmappable.
>
> I think we should _normally_ honor replacement policy, the way we do now.

yes it absolutely is quite a heavy hammer.
However right now the alternative (free a LOT of memory and hope it
collapses into higher order ones) is even heavier, freeing the wrong 8 pages
is less of a disturbance than freeing 8000 of the mostly wrong pages ;)

I absolutely agree this heavy hammer only should trigger if there is a
request for a higher order page that isn't there. Doing it from a special
thread does make sense, makes it relatively easy to keep track of such
wakeups and more importantly rate limit them etc.

Attachments:

(No filename) (932.00 B)
(No filename) (189.00 B)
Download all attachments

2004-09-06 00:55:05

by Nick Piggin

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] beat kswapd with the proverbial clue-bat

Linus Torvalds wrote:

>
>On Sun, 5 Sep 2004, Nick Piggin wrote:
>
>>So my solution? Just teach kswapd and the watermark code about higher
>>order allocations in a fairly simple way. If pages_low is (say), 1024KB,
>>we now also require 512KB of order-1 and above pages, 256K of order-2
>>and up, 128K of order 3, etc. (perhaps we should stop at about order-3?)
>>
>
>I'm pretty sure we did something like this (where we not only required a
>certain amount of memory free, but also a certain buddy availability) for
>a short while in the 2.3.x timeframe or something like that, and that it
>caused problems for some machines that just kept on trying to free memory.
>
>

OK. As I explained in a follow up, my paragraph you quoted was a bit
confused. We don't try to keep any higher order allocations free by
checking the watermarks unless we're trying to make an allocation of
that order in the first place.

So if order 0 allocations are the only thing that ever get done, then
everything proceeds as before. If we do an order 1 allocation, we'll
only kick kswapd if we're low on memory that can satisfy order 1 allocs.

So it is only ever going to free pages when they really are being requested.
Also, if the required memory just can't be freed, kswapd will give up and
forget about it (ie. kswapd_max_order = 0;)

>I wrote a program at the time to check why, but I've long since lost it,
>so I re-created it just now (probably buggy as hell), and append it here
>for you to debug the dang thing.
>
>In it's first rough draft (ie "it might be totally wrong") using this
>script:
>
>

It looks correct to me. It does give a good argument for defragmentation.
But that is a bit orthogonal to what my patches do: they make kswapd work
properly to free higher order memory. The current alternatives are to do
the exact same work (lots of scanning) in allocator context, or fail the
allocation.

If someone wants to make that work, "lots of scanning" into "a bit of
defragmenting", great.

>
>Now, this test-result is not "real life", in that the whole point of the
>buddy allocator is that it is good at trying to keep higher orders free,
>and as such real life should behave much better. But this _is_ pretty
>accurate for the case where we totally ran out of memory and are trying
>to randomly free one page here and one page there. That's where buddy
>doesn't much help us, so this kind of shows the worst case.
>
>(Rule: you should not allow the machine to get to this point, exactly
>because at the < ~5% free point, buddy stops being very effective).
>
>Notice how you may need to free 20% of memory to get a 2**3 allocation, if
>you have totally depleted your pages. And it's much worse if you have very
>little memory to begin with.
>
>Anyway. I haven't debugged this program, so it may have serious bugs, and
>be off by an order of magnitude or two. Whatever. If I'm wrong, somebody
>can fix the program/script and see what the real numbers are.
>
>

No, Andrew just recently reported that order-1 allocations were needing to
free 20MB or more (on systems with not really huge memories IIRC). So I
think your program could be reasonably close to real life.

2004-09-06 01:12:59

by Nick Piggin

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] beat kswapd with the proverbial clue-bat

Linus Torvalds wrote:

>
>On Sun, 5 Sep 2004, Nick Piggin wrote:
>
>>Hmm, and the crowning argument for not stopping at order 3 is that if we
>>never use higher order allocations, nothing will care about their watermarks
>>anyway. I think I had myself confused when that question in the first place.
>>
>>So yeah, stopping at a fixed number isn't required, and as you say it keeps
>>things general and special cases minimal.
>>
>
>Hey, please refute my "you need 20% free" to get even to order-3 for most
>cases first.
>
>

s/need 20% free/need to free 20%/ ?

I won't refute that. It is an unfortunate effect of using dumb scanning to
free higher order memory. Orthogonal to my patch, which just uses the
current
freeing mechanism to ensure kswapd gets off its bottom and does some
work for
higher order allocators.

>It's probably acceptable to have a _very_ backgrounded job that does
>freeing if order-3 isn't available, but it had better be pretty
>slow-moving, I suspect. On the order of "It's probably ok to try to aim
>for up to 25% free 'overnight' if the machine is idle" but it's almost
>certainly not ok to aggressively push things out to that degree..
>
>

I think my mechanism is about as on-demand as you can get. I think we've
typically leaned this way WRT the memory manager as opposed to background
scanning and freeing.... so sure you could introduce a new method for
freeing
higher order memory (ie. defragmenting), but I think you'd still want to use
my kswapd scheme in that case as well.

So instead of always freeing pages, you'd say:
if (!watermark_ok(order 0 pages))
try_to_free_pages
if (!watermark_ok(order-that-i-need))
try_to_defragment_pages

2004-09-06 01:35:25

by Nick Piggin

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] beat kswapd with the proverbial clue-bat

Linus Torvalds wrote:

>
>On Sun, 5 Sep 2004, Arjan van de Ven wrote:
>
>>well... we have a reverse mapping now. What is stopping us from doing
>>physical defragmentation ?
>>
>
>Nothing but replacement policy, really, and the fact that not everything
>is rmappable.
>
>I think we should _normally_ honor replacement policy, the way we do now.
>Only if we are in the situation "we have enough memory, but not enough
>high-order-pages" should we go to a separate physical defrag algorithm.
>
>

Sure.

>So either kswapd should have a totally different mode, or there should be
>a separate "kdefragd". It would potentially also be good if it is user-
>triggerable, so that you could, for example, have a heavier defragd run
>from the daily "cron" runs - something that doesn't seem to make much
>sense from a traditional kswapd standpoint.
>
>In other words, I don't think the physical thing should be triggered at
>all by normal memory pressure. A large-order allocation failure would
>trigger it "somewhat", and maybe it might run very slowly in the
>background (wake up every five minutes or so to see if it is worth doing
>anything), and then some user-triggerable way to make it more aggressive.
>
>Does that sound sane to people?
>
>

Not to me :P

I think doing it just in time with kswapd and watermarks like we
do for order 0 allocations should be fine.

If you think of kswapd as "do the same freeing work the allocator
will otherwise have to" and "provide a context for doing freeing
work if the allocator can't" (in the !wait case)... then I think my
changes are pretty logical.

I think I confused everybody in the first email - we *do not* try
to heed any order-3 and above watermarks if we're only doing order-2
and below allocations... maybe this was the sticking point?

2004-09-06 01:49:32

by Nick Piggin

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] beat kswapd with the proverbial clue-bat

Nick Piggin wrote:

>>
>> Notice how you may need to free 20% of memory to get a 2**3
>> allocation, if you have totally depleted your pages. And it's much
>> worse if you have very little memory to begin with.
>>
>> Anyway. I haven't debugged this program, so it may have serious bugs,
>> and be off by an order of magnitude or two. Whatever. If I'm wrong,
>> somebody can fix the program/script and see what the real numbers are.
>>
>>
>
> No, Andrew just recently reported that order-1 allocations were
> needing to
> free 20MB or more (on systems with not really huge memories IIRC). So I
> think your program could be reasonably close to real life.
>
>

But yeah, that is when your memory is completely depleted. A small
modification to your program to make it just keep scanning until we've
freed a set amount of memory obviously shows that the more you've freed,
the easier it becomes to free higher order areas... In this way, having
kswapd batch up the freeing might possibly make it *more* efficient than
only freeing the single higher order area when we've absolutely run out
of areas (and simply failing !wait allocations altogether).

2004-09-06 03:35:47

by David Miller

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] beat kswapd with the proverbial clue-bat

On Sun, 05 Sep 2004 20:09:30 +1000
Nick Piggin <[email protected]> wrote:

> Yeah I had seen a few, surprisingly few though. Sorry I'm a bit clueless
> about networking - I suppose there is a good reason for the 16K MTU? My
> first thought might be that a 4K one could be better on CPU cache as well
> as lighter on the mm. I know the networking guys know what they're doing
> though...

It's better to get as long a stride as possible for the copy
from userspace, and yes as you get larger you run into cache
issues. 16K turned out the be the break point considering those
two attributes when I did my testing.

Just fool around with ifconfig lo mtu XXX and TCP bandwidth tests.
See what you come up with.

2004-09-06 08:55:55

by Nick Piggin

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] beat kswapd with the proverbial clue-bat

David S. Miller wrote:
> On Sun, 05 Sep 2004 20:09:30 +1000
> Nick Piggin <[email protected]> wrote:
>
>
>>Yeah I had seen a few, surprisingly few though. Sorry I'm a bit clueless
>>about networking - I suppose there is a good reason for the 16K MTU? My
>>first thought might be that a 4K one could be better on CPU cache as well
>>as lighter on the mm. I know the networking guys know what they're doing
>>though...
>
>
> It's better to get as long a stride as possible for the copy
> from userspace, and yes as you get larger you run into cache
> issues. 16K turned out the be the break point considering those
> two attributes when I did my testing.
>

OK. Makes sense.

> Just fool around with ifconfig lo mtu XXX and TCP bandwidth tests.
> See what you come up with.
>

Thanks, I'll give that a try. I don't nearly have access to a
representitive range of architectures, but if I see anything
interesting on what I've got, I'll ping you.

2004-09-15 13:31:48

by Arjan van de Ven

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] beat kswapd with the proverbial clue-bat

On Wed, Sep 15, 2004 at 03:27:12PM +0200, J?rn Engel wrote:
> On Sun, 5 September 2004 10:58:07 -0700, Linus Torvalds wrote:
> > On Sun, 5 Sep 2004, Arjan van de Ven wrote:
> > >
> > > well... we have a reverse mapping now. What is stopping us from doing
> > > physical defragmentation ?
> >
> > Nothing but replacement policy, really, and the fact that not everything
> > is rmappable.
>
> What about pointers? I have an ethernet driver that hands pointers to
> physical memory pages to hardware. Would be fun if someone
> defragmented those pages. ;)

if you haven't pinned those pages then you have lost already.

Attachments:

(No filename) (621.00 B)
(No filename) (189.00 B)
Download all attachments

2004-09-15 13:36:56

by Jörn Engel

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] beat kswapd with the proverbial clue-bat

On Sun, 5 September 2004 10:58:07 -0700, Linus Torvalds wrote:
> On Sun, 5 Sep 2004, Arjan van de Ven wrote:
> >
> > well... we have a reverse mapping now. What is stopping us from doing
> > physical defragmentation ?
>
> Nothing but replacement policy, really, and the fact that not everything
> is rmappable.

What about pointers? I have an ethernet driver that hands pointers to
physical memory pages to hardware. Would be fun if someone
defragmented those pages. ;)

J?rn

--
It's just what we asked for, but not what we want!
-- anonymous

2004-09-15 13:43:36

by Arjan van de Ven

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] beat kswapd with the proverbial clue-bat

On Wed, Sep 15, 2004 at 03:34:08PM +0200, J?rn Engel wrote:
> On Wed, 15 September 2004 15:29:04 +0200, Arjan van de Ven wrote:
> >
> > if you haven't pinned those pages then you have lost already.
>
> Bug reports say otherwise. Could you explain "pinning" to a newbie
> like me?

if your page is made unfreeable for the vm, for example by virtue of not
being on the LRU or having an elevated count or.. or .. then such a page is
pinned.

if your page is freeable byt he VM and your device is dmaing from/to it you
have a really bad bug

Attachments:

(No filename) (540.00 B)
(No filename) (189.00 B)
Download all attachments

2004-09-15 14:11:19

by Jörn Engel

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] beat kswapd with the proverbial clue-bat

On Wed, 15 September 2004 15:29:04 +0200, Arjan van de Ven wrote:
>
> if you haven't pinned those pages then you have lost already.

Bug reports say otherwise. Could you explain "pinning" to a newbie
like me?

J?rn

--
Homo Sapiens is a goal, not a description.
-- unknown

2004-09-15 14:24:05

by Jörn Engel

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] beat kswapd with the proverbial clue-bat

On Wed, 15 September 2004 15:39:39 +0200, Arjan van de Ven wrote:
>
> if your page is made unfreeable for the vm, for example by virtue of not
> being on the LRU or having an elevated count or.. or .. then such a page is
> pinned.
>
> if your page is freeable byt he VM and your device is dmaing from/to it you
> have a really bad bug

Agreed. skb->data should be safe as long as kfree_skb isn't called [1].
Thanks for the education.

[1] Actually, that does cause a really bad bug, but completely
unrelated to memory management. kfree_skb does more than the name
indicates. Patches will follow...

J?rn

--
Everything should be made as simple as possible, but not simpler.
-- Albert Einstein