2005-11-01 20:59:15

by Joel Schopp

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19


>> The patches have gone through a large number of revisions, have been
>> heavily tested and reviewed by a few people. The memory footprint of this
>> approach is smaller than introducing new zones. If the cache footprint,
>> increased branches and instructions were a problem, I would expect
>> them to
>> show up in the aim9 benchmark or the benchmark that ran ghostscript
>> multiple times on a large file.
>>
>
> I appreciate that a lot of work has gone into them. You must appreciate
> that they add a reasonable amount of complexity and a non-zero perormance
> cost to the page allocator.

The patches do ad a reasonable amount of complexity to the page allocator. In
my opinion that is the only downside of these patches, even though it is a big
one. What we need to decide as a community is if there is a less complex way to
do this, and if there isn't a less complex way then is the benefit worth the
increased complexity.

As to the non-zero performance cost, I think hard numbers should carry more
weight than they have been given in this area. Mel has posted hard numbers that
say the patches are a wash with respect to performance. I don't see any
evidence to contradict those results.

>> The will need high order allocations if we want to provide HugeTLB pages
>> to userspace on-demand rather than reserving at boot-time. This is a
>> future problem, but it's one that is not worth tackling until the
>> fragmentation problem is fixed first.
>>
>
> Sure. In what form, we haven't agreed. I vote zones! :)

I'd like to hear more details of how zones would be less complex while still
solving the problem. I just don't get it.


2005-11-02 01:04:42

by Nick Piggin

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Joel Schopp wrote:

> The patches do ad a reasonable amount of complexity to the page
> allocator. In my opinion that is the only downside of these patches,
> even though it is a big one. What we need to decide as a community is
> if there is a less complex way to do this, and if there isn't a less
> complex way then is the benefit worth the increased complexity.
>
> As to the non-zero performance cost, I think hard numbers should carry
> more weight than they have been given in this area. Mel has posted hard
> numbers that say the patches are a wash with respect to performance. I
> don't see any evidence to contradict those results.
>

The numbers I have seen show that performance is decreased. People
like Ken Chen spend months trying to find a 0.05% improvement in
performance. Not long ago I just spent days getting our cached
kbuild performance back to where 2.4 is on my build system.

I can simply see they will cost more icache, more dcache, more branches,
etc. in what is the hottest part of the kernel in some workloads (kernel
compiles, for one).

I'm sorry if I sound like a wet blanket. I just don't look at a patch
and think "wow all those 3 guys with Linux on IBM mainframes and using
lpars are going to be so much happier now, this is something we need".

>>> The will need high order allocations if we want to provide HugeTLB pages
>>> to userspace on-demand rather than reserving at boot-time. This is a
>>> future problem, but it's one that is not worth tackling until the
>>> fragmentation problem is fixed first.
>>>
>>
>> Sure. In what form, we haven't agreed. I vote zones! :)
>
>
> I'd like to hear more details of how zones would be less complex while
> still solving the problem. I just don't get it.
>

You have an extra zone. You size that zone at boot according to the
amount of memory you need to be able to free. Only easy-reclaim stuff
goes in that zone.

It is less complex because zones are a complexity we already have to
live with. 99% of the infrastructure is already there to do this.

If you want to hot unplug memory or guarantee hugepage allocation,
this is the way to do it. Nobody has told me why this *doesn't* work.

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-11-02 01:42:34

by Martin Bligh

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

> The numbers I have seen show that performance is decreased. People
> like Ken Chen spend months trying to find a 0.05% improvement in
> performance. Not long ago I just spent days getting our cached
> kbuild performance back to where 2.4 is on my build system.

Ironically, we're currently trying to chase down a 'database benchmark'
regression that seems to have been cause by the last round of "let's
rewrite the scheduler again" (more details later). Nick, you've added an
awful lot of complexity to some of these code paths yourself ... seems
ironic that you're the one complaining about it ;-)

>>> Sure. In what form, we haven't agreed. I vote zones! :)
>>
>>
>> I'd like to hear more details of how zones would be less complex while
>> still solving the problem. I just don't get it.
>>
>
> You have an extra zone. You size that zone at boot according to the
> amount of memory you need to be able to free. Only easy-reclaim stuff
> goes in that zone.
>
> It is less complex because zones are a complexity we already have to
> live with. 99% of the infrastructure is already there to do this.
>
> If you want to hot unplug memory or guarantee hugepage allocation,
> this is the way to do it. Nobody has told me why this *doesn't* work.

Because the zone is statically sized, and you're back to the same crap
we had with 32bit systems of splitting ZONE_NORMAL and ZONE_HIGHMEM,
effectively. Define how much you need for system ram, and how much
for easily reclaimable memory at boot time. You can't - it doesn't work.

M.

2005-11-02 02:02:16

by Nick Piggin

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Martin J. Bligh wrote:
>>The numbers I have seen show that performance is decreased. People
>>like Ken Chen spend months trying to find a 0.05% improvement in
>>performance. Not long ago I just spent days getting our cached
>>kbuild performance back to where 2.4 is on my build system.
>
>
> Ironically, we're currently trying to chase down a 'database benchmark'
> regression that seems to have been cause by the last round of "let's
> rewrite the scheduler again" (more details later). Nick, you've added an
> awful lot of complexity to some of these code paths yourself ... seems
> ironic that you're the one complaining about it ;-)
>

Yeah that's unfortunate, but I think a large portion of the problem
(if they are anything the same) has been narrowed down to some over
eager wakeup balancing for which there are a number of proposed
patches.

But in this case I was more worried about getting the groundwork done
for handling the multicore multicore systems that everyone will soon
be using rather than several % performance regression on TPC-C (not
to say that I don't care about that at all)... I don't see the irony.

But let's move this to another thread if it is going to continue. I
would be happy to discuss scheduler problems.

>>You have an extra zone. You size that zone at boot according to the
>>amount of memory you need to be able to free. Only easy-reclaim stuff
>>goes in that zone.
>>
>>It is less complex because zones are a complexity we already have to
>>live with. 99% of the infrastructure is already there to do this.
>>
>>If you want to hot unplug memory or guarantee hugepage allocation,
>>this is the way to do it. Nobody has told me why this *doesn't* work.
>
>
> Because the zone is statically sized, and you're back to the same crap
> we had with 32bit systems of splitting ZONE_NORMAL and ZONE_HIGHMEM,
> effectively. Define how much you need for system ram, and how much
> for easily reclaimable memory at boot time. You can't - it doesn't work.
>

You can't what? What doesn't work? If you have no hard limits set,
then the frag patches can't guarantee anything either.

You can't have it both ways. Either you have limits for things or
you don't need any guarantees. Zones handle the former case nicely,
and we currently do the latter case just fine (along with the frag
patches).

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-11-02 02:24:01

by Martin Bligh

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

>>> The numbers I have seen show that performance is decreased. People
>>> like Ken Chen spend months trying to find a 0.05% improvement in
>>> performance. Not long ago I just spent days getting our cached
>>> kbuild performance back to where 2.4 is on my build system.
>>
>> Ironically, we're currently trying to chase down a 'database benchmark'
>> regression that seems to have been cause by the last round of "let's
>> rewrite the scheduler again" (more details later). Nick, you've added an
>> awful lot of complexity to some of these code paths yourself ... seems
>> ironic that you're the one complaining about it ;-)
>
> Yeah that's unfortunate, but I think a large portion of the problem
> (if they are anything the same) has been narrowed down to some over
> eager wakeup balancing for which there are a number of proposed
> patches.
>
> But in this case I was more worried about getting the groundwork done
> for handling the multicore multicore systems that everyone will soon
> be using rather than several % performance regression on TPC-C (not
> to say that I don't care about that at all)... I don't see the irony.
>
> But let's move this to another thread if it is going to continue. I
> would be happy to discuss scheduler problems.

My point was that most things we do add complexity to the codebase,
including the things you do yourself ... I'm not saying the we're worse
off for the changes you've made, by any means - I think they've been
mostly beneficial. I'm just pointing out that we ALL do it, so let us
not be too quick to judge when others propose adding something that does ;-)

>> Because the zone is statically sized, and you're back to the same crap
>> we had with 32bit systems of splitting ZONE_NORMAL and ZONE_HIGHMEM,
>> effectively. Define how much you need for system ram, and how much
>> for easily reclaimable memory at boot time. You can't - it doesn't work.
>
> You can't what? What doesn't work? If you have no hard limits set,
> then the frag patches can't guarantee anything either.
>
> You can't have it both ways. Either you have limits for things or
> you don't need any guarantees. Zones handle the former case nicely,
> and we currently do the latter case just fine (along with the frag
> patches).

I'll go look through Mel's current patchset again. I was under the
impression it didn't suffer from this problem, at least not as much
as zones did.

Nothing is guaranteed. You can shag the whole machine and/or VM in
any number of ways ... if we can significantly improve the probability
of existing higher order allocs working, and new functionality has
an excellent probability of success, that's as good as you're going to
get. Have a free "perfect is the enemy of good" Linus quote, on me ;-)

M.

2005-11-02 02:51:33

by Nick Piggin

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Martin J. Bligh wrote:

>>But let's move this to another thread if it is going to continue. I
>>would be happy to discuss scheduler problems.
>
>
> My point was that most things we do add complexity to the codebase,
> including the things you do yourself ... I'm not saying the we're worse
> off for the changes you've made, by any means - I think they've been
> mostly beneficial.

Heh - I like the "mostly" ;)

> I'm just pointing out that we ALL do it, so let us
> not be too quick to judge when others propose adding something that does ;-)
>

What I'm getting worried about is the marked increase in the
rate of features and complexity going in.

I am almost certainly never going to use memory hotplug or
demand paging of hugepages. I am pretty likely going to have
to wade through this code at some point in the future if it
is merged.

It is also going to slow down my kernel by maybe 1% when
doing kbuilds, but hey let's not worry about that until we've
merged 10 more such slowdowns (ok that wasn't aimed at you or
Mel, but my perception of the status quo).

>
>>You can't what? What doesn't work? If you have no hard limits set,
>>then the frag patches can't guarantee anything either.
>>
>>You can't have it both ways. Either you have limits for things or
>>you don't need any guarantees. Zones handle the former case nicely,
>>and we currently do the latter case just fine (along with the frag
>>patches).
>
>
> I'll go look through Mel's current patchset again. I was under the
> impression it didn't suffer from this problem, at least not as much
> as zones did.
>

Over time, I don't think it can offer any stronger a guarantee
than what we currently have. I'm not even sure that it would be
any better at all for problematic workloads as time -> infinity.

> Nothing is guaranteed. You can shag the whole machine and/or VM in
> any number of ways ... if we can significantly improve the probability
> of existing higher order allocs working, and new functionality has
> an excellent probability of success, that's as good as you're going to
> get. Have a free "perfect is the enemy of good" Linus quote, on me ;-)
>

I think it falls down if these higher order allocations actually
get *used* for anything. You'll simply be going through the process
of replacing your contiguous, easy-to-reclaim memory with pinned
kernel memory.

However, for the purpose of memory hot unplug, a new zone *will*
guarantee memory can be reclaimed and unplugged.

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-11-02 04:39:12

by Martin Bligh

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

>> I'm just pointing out that we ALL do it, so let us
>> not be too quick to judge when others propose adding something that does ;-)
>
> What I'm getting worried about is the marked increase in the
> rate of features and complexity going in.
>
> I am almost certainly never going to use memory hotplug or
> demand paging of hugepages. I am pretty likely going to have
> to wade through this code at some point in the future if it
> is merged.

Mmm. Though whether any one of us will personally use each feature
is perhaps not the most ideal criteria to judge things by ;-)

> It is also going to slow down my kernel by maybe 1% when
> doing kbuilds, but hey let's not worry about that until we've
> merged 10 more such slowdowns (ok that wasn't aimed at you or
> Mel, but my perception of the status quo).

If it's really 1%, yes, that's a huge problem. And yes, I agree with
you that there's a problem with the rate of change. Part of that is
a lack of performance measurement and testing, and the quality sometimes
scares me (though the last month has actually been significantly better,
the tree mostly builds and boots now!). I've tried to do something on
the testing front, but I'm acutely aware it's not sufficient by any means.

>>> You can't what? What doesn't work? If you have no hard limits set,
>>> then the frag patches can't guarantee anything either.
>>>
>>> You can't have it both ways. Either you have limits for things or
>>> you don't need any guarantees. Zones handle the former case nicely,
>>> and we currently do the latter case just fine (along with the frag
>>> patches).
>>
>> I'll go look through Mel's current patchset again. I was under the
>> impression it didn't suffer from this problem, at least not as much
>> as zones did.
>
> Over time, I don't think it can offer any stronger a guarantee
> than what we currently have. I'm not even sure that it would be
> any better at all for problematic workloads as time -> infinity.

Sounds worth discussing. We need *some* way of dealing with fragmentation
issues. To me that means both an avoidance strategy, and an ability
to actively defragment if we need it. Linux is evolved software, it
may not be perfect at first - that's the way we work, and it's served
us well up till now. To me, that's the biggest advantage we have over
the proprietary model.

>> Nothing is guaranteed. You can shag the whole machine and/or VM in
>> any number of ways ... if we can significantly improve the probability
>> of existing higher order allocs working, and new functionality has
>> an excellent probability of success, that's as good as you're going to
>> get. Have a free "perfect is the enemy of good" Linus quote, on me ;-)
>
> I think it falls down if these higher order allocations actually
> get *used* for anything. You'll simply be going through the process
> of replacing your contiguous, easy-to-reclaim memory with pinned
> kernel memory.

It seems inevitable that we need both physically contiguous memory
sections, and virtually contiguous in kernel space (which equates to
the same thing, unless we totally break the 1-1 P-V mapping and
lose the large page mapping for kernel, which I'd hate to do.)

> However, for the purpose of memory hot unplug, a new zone *will*
> guarantee memory can be reclaimed and unplugged.

It's not just about memory hotplug. There are, as we have discussed
already, many usage for physically contiguous (and virtually contiguous)
memory segments. Focusing purely on any one of them will not solve the
issue at hand ...

M.

2005-11-02 05:08:01

by Nick Piggin

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Martin J. Bligh wrote:

>>I am almost certainly never going to use memory hotplug or
>>demand paging of hugepages. I am pretty likely going to have
>>to wade through this code at some point in the future if it
>>is merged.
>
>
> Mmm. Though whether any one of us will personally use each feature
> is perhaps not the most ideal criteria to judge things by ;-)
>

Of course, but I'd say very few people will. Then again maybe
I'm just a luddite who doesn't know what's good for him ;)

>
>>It is also going to slow down my kernel by maybe 1% when
>>doing kbuilds, but hey let's not worry about that until we've
>>merged 10 more such slowdowns (ok that wasn't aimed at you or
>>Mel, but my perception of the status quo).
>
>
> If it's really 1%, yes, that's a huge problem. And yes, I agree with
> you that there's a problem with the rate of change. Part of that is
> a lack of performance measurement and testing, and the quality sometimes
> scares me (though the last month has actually been significantly better,
> the tree mostly builds and boots now!). I've tried to do something on
> the testing front, but I'm acutely aware it's not sufficient by any means.
>

To be honest I haven't tested so this is an unfounded guess. However
it is based on what I have seen of Mel's numbers, and the fact that
the kernel spends nearly 1/3rd of its time in the page allocator when
running a kbuild.

I may get around to getting some real numbers when my current patch
queues shrink.

>>Over time, I don't think it can offer any stronger a guarantee
>>than what we currently have. I'm not even sure that it would be
>>any better at all for problematic workloads as time -> infinity.
>
>
> Sounds worth discussing. We need *some* way of dealing with fragmentation
> issues. To me that means both an avoidance strategy, and an ability
> to actively defragment if we need it. Linux is evolved software, it
> may not be perfect at first - that's the way we work, and it's served
> us well up till now. To me, that's the biggest advantage we have over
> the proprietary model.
>

True and I'm also annoyed that we have these issues at all. I just
don't see that the avoidance strategy helps that much because as I
said, you don't need to keep these lovely contiguous regions just for
show (or other easy-to-reclaim user pages).

The absolute priority is to move away from higher order allocs or
use fallbacks IMO. And that doesn't necessarily mean order 1 or even
2 allocations because we've don't seem to have a problem with those.

Because I want Linux to be as robust as you do.

>>I think it falls down if these higher order allocations actually
>>get *used* for anything. You'll simply be going through the process
>>of replacing your contiguous, easy-to-reclaim memory with pinned
>>kernel memory.
>
>
> It seems inevitable that we need both physically contiguous memory
> sections, and virtually contiguous in kernel space (which equates to
> the same thing, unless we totally break the 1-1 P-V mapping and
> lose the large page mapping for kernel, which I'd hate to do.)
>

I think this isn't as bad an idea as you think. If it means those
guys doing memory hotplug take a few % performance hit and nobody else
has to bear the costs then that sounds great.

>
>>However, for the purpose of memory hot unplug, a new zone *will*
>>guarantee memory can be reclaimed and unplugged.
>
>
> It's not just about memory hotplug. There are, as we have discussed
> already, many usage for physically contiguous (and virtually contiguous)
> memory segments. Focusing purely on any one of them will not solve the
> issue at hand ...
>

True, but we don't seem to have huge problems with other things. The
main ones that have come up on lkml are e1000 which is getting fixed,
and maybe XFS which I think there are also moves to improve.

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-11-02 05:15:12

by Martin Bligh

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

>> It's not just about memory hotplug. There are, as we have discussed
>> already, many usage for physically contiguous (and virtually contiguous)
>> memory segments. Focusing purely on any one of them will not solve the
>> issue at hand ...
>
> True, but we don't seem to have huge problems with other things. The
> main ones that have come up on lkml are e1000 which is getting fixed,
> and maybe XFS which I think there are also moves to improve.

It should be fairly easy to trawl through the list of all allocations
and pull out all the higher order ones from the whole source tree. I
suspect there's a lot ... maybe I'll play with it later on.

M.

2005-11-02 06:24:04

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Martin J. Bligh wrote:
>>True, but we don't seem to have huge problems with other things. The
>>main ones that have come up on lkml are e1000 which is getting fixed,
>>and maybe XFS which I think there are also moves to improve.
>
>
> It should be fairly easy to trawl through the list of all allocations
> and pull out all the higher order ones from the whole source tree. I
> suspect there's a lot ... maybe I'll play with it later on.
>

please check kmalloc(32k,64k)

For example, loopback device's default MTU=16436 means order=3 and
maybe there are other high MTU device.

I suspect skb_makewritable()/skb_copy()/skb_linearize() function can be
sufferd from fragmentation when MTU is big. They allocs large skb by
gathering fragmented skbs.When these skb_* funcs failed, the packet
is silently discarded by netfilter. If fragmentation is heavy, packets
(especialy TCP) uses large MTU never reachs its end, even if loopback.

Honestly, I'm not familiar with network code, could anyone comment this ?

-- Kame


2005-11-02 07:21:03

by Yasunori Goto

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Hello.
Nick-san.

I posted patches to make ZONE_REMOVABLE to LHMS.
I don't say they are better than Mel-san's patch.
I hope this will be base of good discussion.


There were 2 types.
One was just add ZONE_REMOVABLE.
This patch came from early implementation of memory hotplug VA-Linux
team.
http://sourceforge.net/mailarchive/forum.php?thread_id=5969508&forum_id=223

ZONE_HIGHMEM is used for this purpose at early implementation.
We thought ZONE_HIGHMEM is easier removing than other zone.
But some of archtecture don't use it. That is why ZONE_REMOVABLE
was born.
(And I remember that ZONE_DMA32 was defined after this patch.
So, number of zone became 5, and one more bit was necessary in
page->flags. (I don't know recent progress of ZONE_DMA32)).


Another one was a bit similar than Mel-san's one.
One of motivation of this patch was to create orthogonal relationship
between Removable and DMA/Normal/Highmem. I thought it is desirable.
Because, ppc64 can treat that all of memory is same (DMA) zone.
I thought that new zone spoiled its good feature.

http://sourceforge.net/mailarchive/forum.php?thread_id=5345977&forum_id=223
http://sourceforge.net/mailarchive/forum.php?thread_id=5345978&forum_id=223
http://sourceforge.net/mailarchive/forum.php?thread_id=5345979&forum_id=223
http://sourceforge.net/mailarchive/forum.php?thread_id=5345980&forum_id=223


Thanks.

P.S. to Mel-san.
I'm sorry for late writing of this. This threads was mail bomb for me
to read with my poor English skill. :-(


> Martin J. Bligh wrote:
>
> >>But let's move this to another thread if it is going to continue. I
> >>would be happy to discuss scheduler problems.
> >
> >
> > My point was that most things we do add complexity to the codebase,
> > including the things you do yourself ... I'm not saying the we're worse
> > off for the changes you've made, by any means - I think they've been
> > mostly beneficial.
>
> Heh - I like the "mostly" ;)
>
> > I'm just pointing out that we ALL do it, so let us
> > not be too quick to judge when others propose adding something that does ;-)
> >
>
> What I'm getting worried about is the marked increase in the
> rate of features and complexity going in.
>
> I am almost certainly never going to use memory hotplug or
> demand paging of hugepages. I am pretty likely going to have
> to wade through this code at some point in the future if it
> is merged.
>
> It is also going to slow down my kernel by maybe 1% when
> doing kbuilds, but hey let's not worry about that until we've
> merged 10 more such slowdowns (ok that wasn't aimed at you or
> Mel, but my perception of the status quo).
>
> >
> >>You can't what? What doesn't work? If you have no hard limits set,
> >>then the frag patches can't guarantee anything either.
> >>
> >>You can't have it both ways. Either you have limits for things or
> >>you don't need any guarantees. Zones handle the former case nicely,
> >>and we currently do the latter case just fine (along with the frag
> >>patches).
> >
> >
> > I'll go look through Mel's current patchset again. I was under the
> > impression it didn't suffer from this problem, at least not as much
> > as zones did.
> >
>
> Over time, I don't think it can offer any stronger a guarantee
> than what we currently have. I'm not even sure that it would be
> any better at all for problematic workloads as time -> infinity.
>
> > Nothing is guaranteed. You can shag the whole machine and/or VM in
> > any number of ways ... if we can significantly improve the probability
> > of existing higher order allocs working, and new functionality has
> > an excellent probability of success, that's as good as you're going to
> > get. Have a free "perfect is the enemy of good" Linus quote, on me ;-)
> >
>
> I think it falls down if these higher order allocations actually
> get *used* for anything. You'll simply be going through the process
> of replacing your contiguous, easy-to-reclaim memory with pinned
> kernel memory.
>
> However, for the purpose of memory hot unplug, a new zone *will*
> guarantee memory can be reclaimed and unplugged.
>
> --
> SUSE Labs, Novell Inc.
>
> Send instant messages to your online friends http://au.messenger.yahoo.com
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Yasunori Goto

2005-11-02 10:13:31

by Nick Piggin

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

KAMEZAWA Hiroyuki wrote:
> Martin J. Bligh wrote:
>

> please check kmalloc(32k,64k)
>
> For example, loopback device's default MTU=16436 means order=3 and
> maybe there are other high MTU device.
>
> I suspect skb_makewritable()/skb_copy()/skb_linearize() function can be
> sufferd from fragmentation when MTU is big. They allocs large skb by
> gathering fragmented skbs.When these skb_* funcs failed, the packet
> is silently discarded by netfilter. If fragmentation is heavy, packets
> (especialy TCP) uses large MTU never reachs its end, even if loopback.
>
> Honestly, I'm not familiar with network code, could anyone comment this ?
>

I'd be interested to know, actually. I was hoping loopback should always
use order-0 allocations, because the loopback driver is SG, FRAGLIST,
and HIGHDMA capable. However I'm likewise not familiar with network code.

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-11-02 11:37:30

by Mel Gorman

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Wed, 2 Nov 2005, Nick Piggin wrote:

> Joel Schopp wrote:
>
> > The patches do ad a reasonable amount of complexity to the page allocator.
> > In my opinion that is the only downside of these patches, even though it is
> > a big one. What we need to decide as a community is if there is a less
> > complex way to do this, and if there isn't a less complex way then is the
> > benefit worth the increased complexity.
> >
> > As to the non-zero performance cost, I think hard numbers should carry more
> > weight than they have been given in this area. Mel has posted hard numbers
> > that say the patches are a wash with respect to performance. I don't see
> > any evidence to contradict those results.
> >
>
> The numbers I have seen show that performance is decreased. People
> like Ken Chen spend months trying to find a 0.05% improvement in
> performance.

Fine, that is understandable. The AIM9 benchmarks also show performance
improvements in other areas like fork_test. About a 5% difference which is
also important for kernel builds. Wider testing would be needed to see if
the improvements are specific to my tests or not. Every set of patches
have had a performance regression test run with Aim9 so I certainly have
not been ignoring perforkmance.

> Not long ago I just spent days getting our cached
> kbuild performance back to where 2.4 is on my build system.
>

Then it would be interesting to find out how 2.6.14-rc5-mm1 compares
against 2.6.14-rc5-mm1-mbuddy-v19?

> I can simply see they will cost more icache, more dcache, more branches,
> etc. in what is the hottest part of the kernel in some workloads (kernel
> compiles, for one).
>
> I'm sorry if I sound like a wet blanket. I just don't look at a patch
> and think "wow all those 3 guys with Linux on IBM mainframes and using
> lpars are going to be so much happier now, this is something we need".
>

I developed this as the beginning of a long term solution for on-demand
HugeTLB pages as part of a PhD. This could potentially help desktop
workloads in the future. Hotplug machines are a benefit that was picked up
by the work on the way. We can help hotplug to some extent today and
desktop users in the future (and given time, all of the hotplug problems
as well). But if we tell desktop users "Yeah, your applications will run a
bit better with HugeTLB pages as long as you configure the size of the
zone correctly" at any stage, we'll be told where to go.

> > > > The will need high order allocations if we want to provide HugeTLB pages
> > > > to userspace on-demand rather than reserving at boot-time. This is a
> > > > future problem, but it's one that is not worth tackling until the
> > > > fragmentation problem is fixed first.
> > > >
> > >
> > > Sure. In what form, we haven't agreed. I vote zones! :)
> >
> >
> > I'd like to hear more details of how zones would be less complex while still
> > solving the problem. I just don't get it.
> >
>
> You have an extra zone. You size that zone at boot according to the
> amount of memory you need to be able to free. Only easy-reclaim stuff
> goes in that zone.
>

Helps hotplug, no one else. Rules out HugeTLB on demand for userspace
unless we are willing to tell desktop users to configure this tunable.

> It is less complex because zones are a complexity we already have to
> live with. 99% of the infrastructure is already there to do this.
>

The simplicity of zones is still in dispute. I am putting together a mail
of pros, cons, situations and future work for both approaches. I hope to
sent it out fairly soon.

> If you want to hot unplug memory or guarantee hugepage allocation,
> this is the way to do it. Nobody has told me why this *doesn't* work.
>

Hot unplug the configured zone of memory and guarantee hugepage allocation
only for userspace. There is no help for kernel allocations to steal a
huge page under any circumstance. Our approach allows the kernel to get
the large page at the cost of fragmentation degrading slowly over time. To
stop it fragmenting slowly over time, more work is needed.

--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab

2005-11-02 11:41:52

by Mel Gorman

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Wed, 2 Nov 2005, Nick Piggin wrote:

> Martin J. Bligh wrote:
> > > The numbers I have seen show that performance is decreased. People
> > > like Ken Chen spend months trying to find a 0.05% improvement in
> > > performance. Not long ago I just spent days getting our cached
> > > kbuild performance back to where 2.4 is on my build system.
> >
> >
> > Ironically, we're currently trying to chase down a 'database benchmark'
> > regression that seems to have been cause by the last round of "let's
> > rewrite the scheduler again" (more details later). Nick, you've added an
> > awful lot of complexity to some of these code paths yourself ... seems
> > ironic that you're the one complaining about it ;-)
> >
>
> Yeah that's unfortunate, but I think a large portion of the problem
> (if they are anything the same) has been narrowed down to some over
> eager wakeup balancing for which there are a number of proposed
> patches.
>
> But in this case I was more worried about getting the groundwork done
> for handling the multicore multicore systems that everyone will soon
> be using rather than several % performance regression on TPC-C (not
> to say that I don't care about that at all)... I don't see the irony.
>
> But let's move this to another thread if it is going to continue. I
> would be happy to discuss scheduler problems.
>
> > > You have an extra zone. You size that zone at boot according to the
> > > amount of memory you need to be able to free. Only easy-reclaim stuff
> > > goes in that zone.
> > >
> > > It is less complex because zones are a complexity we already have to
> > > live with. 99% of the infrastructure is already there to do this.
> > >
> > > If you want to hot unplug memory or guarantee hugepage allocation,
> > > this is the way to do it. Nobody has told me why this *doesn't* work.
> >
> >
> > Because the zone is statically sized, and you're back to the same crap
> > we had with 32bit systems of splitting ZONE_NORMAL and ZONE_HIGHMEM,
> > effectively. Define how much you need for system ram, and how much
> > for easily reclaimable memory at boot time. You can't - it doesn't work.
> >
>
> You can't what? What doesn't work? If you have no hard limits set,
> then the frag patches can't guarantee anything either.
>

True, but the difference is

Anti-defrag: Best effort at low cost (according to Aim9) without tunable
Zones: Will work, but requires tunable. falls apart if tuned wrong

> You can't have it both ways. Either you have limits for things or
> you don't need any guarantees. Zones handle the former case nicely,
> and we currently do the latter case just fine (along with the frag
> patches).
>

Sure, so you compromise and do best effort for as long as possible.
Always try to keep fragmentation low. If the system is configured to
really need low fragmentation, then after a long period of time, a
page-migration mechanism kicks in to move the kernel pages out of EasyRclm
areas and we continue on.

--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab

2005-11-02 11:48:19

by Mel Gorman

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Wed, 2 Nov 2005, Nick Piggin wrote:

> Martin J. Bligh wrote:
>
> > > But let's move this to another thread if it is going to continue. I
> > > would be happy to discuss scheduler problems.
> >
> >
> > My point was that most things we do add complexity to the codebase,
> > including the things you do yourself ... I'm not saying the we're worse
> > off for the changes you've made, by any means - I think they've been
> > mostly beneficial.
>
> Heh - I like the "mostly" ;)
>
> > I'm just pointing out that we ALL do it, so let us
> > not be too quick to judge when others propose adding something that does ;-)
> >
>
> What I'm getting worried about is the marked increase in the
> rate of features and complexity going in.
>
> I am almost certainly never going to use memory hotplug or
> demand paging of hugepages. I am pretty likely going to have
> to wade through this code at some point in the future if it
> is merged.
>

Plenty of features in the kernel I don't use either :) .

> It is also going to slow down my kernel by maybe 1% when
> doing kbuilds, but hey let's not worry about that until we've
> merged 10 more such slowdowns (ok that wasn't aimed at you or
> Mel, but my perception of the status quo).
>

Ok, my patches show performance gains and losses on different parts of
Aim9. page_test is slightly down but fork_test was considerably up. Both
would have an effect on kbuild so more figures are needed on mode
machines. That will only be found from testing from a variety of machines.

> >
> > > You can't what? What doesn't work? If you have no hard limits set,
> > > then the frag patches can't guarantee anything either.
> > >
> > > You can't have it both ways. Either you have limits for things or
> > > you don't need any guarantees. Zones handle the former case nicely,
> > > and we currently do the latter case just fine (along with the frag
> > > patches).
> >
> >
> > I'll go look through Mel's current patchset again. I was under the
> > impression it didn't suffer from this problem, at least not as much
> > as zones did.
> >
>
> Over time, I don't think it can offer any stronger a guarantee
> than what we currently have. I'm not even sure that it would be
> any better at all for problematic workloads as time -> infinity.
>

Not as they currently stand no. As I've said elsewhere, to really
guarantee things, kswapd would need to know how to clear out UesrRclm
pages from the other reserve types.

> > Nothing is guaranteed. You can shag the whole machine and/or VM in
> > any number of ways ... if we can significantly improve the probability of
> > existing higher order allocs working, and new functionality has
> > an excellent probability of success, that's as good as you're going to get.
> > Have a free "perfect is the enemy of good" Linus quote, on me ;-)
> >
>
> I think it falls down if these higher order allocations actually
> get *used* for anything. You'll simply be going through the process
> of replacing your contiguous, easy-to-reclaim memory with pinned
> kernel memory.
>

And a misconfigured zone-based approach just falls apart. Going to finish
that summary mail to avoid repetition.

> However, for the purpose of memory hot unplug, a new zone *will*
> guarantee memory can be reclaimed and unplugged.
>

>

--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab

2005-11-02 15:11:31

by Mel Gorman

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On (02/11/05 12:06), Nick Piggin didst pronounce:
> Joel Schopp wrote:
>
> >The patches do ad a reasonable amount of complexity to the page
> >allocator. In my opinion that is the only downside of these patches,
> >even though it is a big one. What we need to decide as a community is
> >if there is a less complex way to do this, and if there isn't a less
> >complex way then is the benefit worth the increased complexity.
> >
> >As to the non-zero performance cost, I think hard numbers should carry
> >more weight than they have been given in this area. Mel has posted hard
> >numbers that say the patches are a wash with respect to performance. I
> >don't see any evidence to contradict those results.
> >
>
> The numbers I have seen show that performance is decreased. People
> like Ken Chen spend months trying to find a 0.05% improvement in
> performance. Not long ago I just spent days getting our cached
> kbuild performance back to where 2.4 is on my build system.
>

One contention point is the overhead this introduces. Lets say that
we do discover that kbuild is slower with this patch (still unknown),
then we have to get rid of mbuddy, disable it or replace it with an
as-yet-to-be-written-zone-based-approach.

I wrote a quick patch that disables anti-defrag via a config option and ran
aim9 on the test machine I have been using all along. I deliberatly changed
the minimum amount of anti-defrag as possible but maybe we could make this
patch even smaller or go the other way and conditionally take out as much
anti-defrag as possible.

Here are the Aim9 comparisons between -clean and
-mbuddy-v19-antidefrag-disabled-with-config-option (just the one run)

These are both based on 2.6.14-rc5-mm1

vanilla-mm mbuddy-disabled-via-config
1 creat-clo 16006.00 15844.72 -161.28 -1.01% File Creations and Closes/second
2 page_test 117515.83 119696.77 2180.94 1.86% System Allocations & Pages/second
3 brk_test 440289.81 439870.04 -419.77 -0.10% System Memory Allocations/second
4 jmp_test 4179466.67 4179150.00 -316.67 -0.01% Non-local gotos/second
5 signal_test 80803.20 82055.98 1252.78 1.55% Signal Traps/second
6 exec_test 61.75 61.53 -0.22 -0.36% Program Loads/second
7 fork_test 1327.01 1344.55 17.54 1.32% Task Creations/second
8 link_test 5531.53 5548.33 16.80 0.30% Link/Unlink Pairs/second

On this kernel, I forgot to disable the collection of buddy allocator
statistics. Collection introduces more overhead in both CPU and memory.
Here are the figures when statistic collection is also disabled via the
config option.

vanilla-mm mbuddy-disabled-via-config-nostats
1 creat-clo 16006.00 15906.06 -99.94 -0.62% File Creations and Closes/second
2 page_test 117515.83 120736.54 3220.71 2.74% System Allocations & Pages/second
3 brk_test 440289.81 430311.61 -9978.20 -2.27% System Memory Allocations/second
4 jmp_test 4179466.67 4181683.33 2216.66 0.05% Non-local gotos/second
5 signal_test 80803.20 87387.54 6584.34 8.15% Signal Traps/second
6 exec_test 61.75 62.14 0.39 0.63% Program Loads/second
7 fork_test 1327.01 1345.77 18.76 1.41% Task Creations/second
8 link_test 5531.53 5556.72 25.19 0.46% Link/Unlink Pairs/second

So, now we have performance gains in a number of areas. Nice big jump in
page_test and that fork_test improvement probably won't hurt kbuild either with
exec_test giving a bit of a nudge. signal_test has a big hike for some reason,
not sure who will benefit there, but hey, it can't be bad. I am annoyed with
brk_test especially as it is very similar to page_test in the aim9 source
code but there is no point hiding the result either. These figures does not
tell us how kbuild really performs of course. For that, kbuild needs to be run
on both kernels and compared. This applies to any workload.

This anti-defrag makes the code more complex and harder to read, no
arguement there. However, on at least one test machine, there is a very small
difference when anti-defrag is enabled in comparison to a vanilla kernel.
When the patches applied and the anti-defrag disabled via a kernel option,
we see a number of performance gains, on one machine at least which is a
good thing. Wider testing would show if these good figures are specific to
my testbed or not.

If other testbeds show up nothing bad, anti-defrag with this additional
patch could give us the best of both worlds. If you have a hotplug machine
or you care about high orders, enable this option. Otherwise, choose N and
avoid the anti-defrag overhead.

diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-mbuddy-v19-noconfig/include/linux/gfp.h linux-2.6.14-rc5-mm1-mbuddy-v19-withconfig/include/linux/gfp.h
--- linux-2.6.14-rc5-mm1-mbuddy-v19-noconfig/include/linux/gfp.h 2005-11-02 12:44:06.000000000 +0000
+++ linux-2.6.14-rc5-mm1-mbuddy-v19-withconfig/include/linux/gfp.h 2005-11-02 12:49:24.000000000 +0000
@@ -50,6 +50,7 @@ struct vm_area_struct;
#define __GFP_HARDWALL 0x40000u /* Enforce hardwall cpuset memory allocs */
#define __GFP_VALID 0x80000000u /* valid GFP flags */

+#ifdef CONFIG_PAGEALLOC_ANTIDEFRAG
/*
* Allocation type modifiers, these are required to be adjacent
* __GFP_EASYRCLM: Easily reclaimed pages like userspace or buffer pages
@@ -61,6 +62,11 @@ struct vm_area_struct;
#define __GFP_EASYRCLM 0x80000u /* User and other easily reclaimed pages */
#define __GFP_KERNRCLM 0x100000u /* Kernel page that is reclaimable */
#define __GFP_RCLM_BITS (__GFP_EASYRCLM|__GFP_KERNRCLM)
+#else
+#define __GFP_EASYRCLM 0
+#define __GFP_KERNRCLM 0
+#define __GFP_RCLM_BITS 0
+#endif /* CONFIG_PAGEALLOC_ANTIDEFRAG */

#define __GFP_BITS_SHIFT 21 /* Room for 21 __GFP_FOO bits */
#define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1)
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-mbuddy-v19-noconfig/include/linux/mmzone.h linux-2.6.14-rc5-mm1-mbuddy-v19-withconfig/include/linux/mmzone.h
--- linux-2.6.14-rc5-mm1-mbuddy-v19-noconfig/include/linux/mmzone.h 2005-11-02 12:44:07.000000000 +0000
+++ linux-2.6.14-rc5-mm1-mbuddy-v19-withconfig/include/linux/mmzone.h 2005-11-02 13:00:56.000000000 +0000
@@ -23,6 +23,7 @@
#endif
#define PAGES_PER_MAXORDER (1 << (MAX_ORDER-1))

+#ifdef CONFIG_PAGEALLOC_ANTIDEFRAG
/*
* The two bit field __GFP_RECLAIMBITS enumerates the following types of
* page reclaimability.
@@ -33,6 +34,14 @@
#define RCLM_FALLBACK 3
#define RCLM_TYPES 4
#define BITS_PER_RCLM_TYPE 2
+#else
+#define RCLM_NORCLM 0
+#define RCLM_EASY 0
+#define RCLM_KERN 0
+#define RCLM_FALLBACK 0
+#define RCLM_TYPES 1
+#define BITS_PER_RCLM_TYPE 0
+#endif

#define for_each_rclmtype_order(type, order) \
for (order = 0; order < MAX_ORDER; order++) \
@@ -60,6 +69,7 @@ struct zone_padding {
#define ZONE_PADDING(name)
#endif

+#ifdef CONFIG_PAGEALLOC_ANTIDEFRAG
/*
* Indices into pcpu_list
* PCPU_KERNEL: For RCLM_NORCLM and RCLM_KERN allocations
@@ -68,6 +78,11 @@ struct zone_padding {
#define PCPU_KERNEL 0
#define PCPU_EASY 1
#define PCPU_TYPES 2
+#else
+#define PCPU_KERNEL 0
+#define PCPU_EASY 0
+#define PCPU_TYPES 1
+#endif

struct per_cpu_pages {
int count[PCPU_TYPES]; /* Number of pages on each list */
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-mbuddy-v19-noconfig/init/Kconfig linux-2.6.14-rc5-mm1-mbuddy-v19-withconfig/init/Kconfig
--- linux-2.6.14-rc5-mm1-mbuddy-v19-noconfig/init/Kconfig 2005-11-02 12:42:20.000000000 +0000
+++ linux-2.6.14-rc5-mm1-mbuddy-v19-withconfig/init/Kconfig 2005-11-02 12:59:49.000000000 +0000
@@ -419,6 +419,17 @@ config CC_ALIGN_JUMPS
no dummy operations need be executed.
Zero means use compiler's default.

+config PAGEALLOC_ANTIDEFRAG
+ bool "Try and avoid fragmentation in the page allocator"
+ def_bool y
+ help
+ The standard allocator will fragment memory over time which means that
+ high order allocations will fail even if kswapd is running. If this
+ option is set, the allocator will try and group page types into
+ three groups KernNoRclm, KernRclm and EasyRclm. The gain is a best
+ effort attempt at lowering fragmentation. The loss is more complexity
+
+
endmenu # General setup

config TINY_SHMEM
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-mbuddy-v19-noconfig/mm/page_alloc.c linux-2.6.14-rc5-mm1-mbuddy-v19-withconfig/mm/page_alloc.c
--- linux-2.6.14-rc5-mm1-mbuddy-v19-noconfig/mm/page_alloc.c 2005-11-02 13:05:07.000000000 +0000
+++ linux-2.6.14-rc5-mm1-mbuddy-v19-withconfig/mm/page_alloc.c 2005-11-02 14:09:37.000000000 +0000
@@ -57,11 +57,17 @@ long nr_swap_pages;
* fallback_allocs contains the fallback types for low memory conditions
* where the preferred alloction type if not available.
*/
+#ifdef CONFIG_PAGEALLOC_ANTIDEFRAG
int fallback_allocs[RCLM_TYPES-1][RCLM_TYPES+1] = {
{RCLM_NORCLM, RCLM_FALLBACK, RCLM_KERN, RCLM_EASY, RCLM_TYPES},
{RCLM_EASY, RCLM_FALLBACK, RCLM_NORCLM, RCLM_KERN, RCLM_TYPES},
{RCLM_KERN, RCLM_FALLBACK, RCLM_NORCLM, RCLM_EASY, RCLM_TYPES}
};
+#else
+int fallback_allocs[RCLM_TYPES][RCLM_TYPES+1] = {
+ {RCLM_NORCLM, RCLM_TYPES}
+};
+#endif /* CONFIG_PAGEALLOC_ANTIDEFRAG */

/* Returns 1 if the needed percentage of the zone is reserved for fallbacks */
static inline int min_fallback_reserved(struct zone *zone)
@@ -98,6 +104,7 @@ EXPORT_SYMBOL(totalram_pages);
#error __GFP_KERNRCLM not mapping to RCLM_KERN
#endif

+#ifdef CONFIG_PAGEALLOC_ANTIDEFRAG
/*
* This function maps gfpflags to their RCLM_TYPE. It makes assumptions
* on the location of the GFP flags.
@@ -115,6 +122,12 @@ static inline int gfpflags_to_rclmtype(g

return rclmbits >> RCLM_SHIFT;
}
+#else
+static inline int gfpflags_to_rclmtype(gfp_t gfp_flags)
+{
+ return RCLM_NORCLM;
+}
+#endif /* CONFIG_PAGEALLOC_ANTIDEFRAG */

/*
* copy_bits - Copy bits between bitmaps
@@ -134,6 +147,9 @@ static inline void copy_bits(unsigned lo
int sindex_src,
int nr)
{
+ if (nr == 0)
+ return;
+
/*
* Written like this to take advantage of arch-specific
* set_bit() and clear_bit() functions
@@ -188,8 +204,12 @@ static char *zone_names[MAX_NR_ZONES] =
int min_free_kbytes = 1024;

#ifdef CONFIG_ALLOCSTATS
+#ifdef CONFIG_PAGEALLOC_ANTIDEFRAG
static char *type_names[RCLM_TYPES] = { "KernNoRclm", "EasyRclm",
"KernRclm", "Fallback"};
+#else
+static char *type_names[RCLM_TYPES] = { "KernNoRclm" };
+#endif /* CONFIG_PAGEALLOC_ANTIDEFRAG */
#endif /* CONFIG_ALLOCSTATS */

unsigned long __initdata nr_kernel_pages;
@@ -2228,8 +2248,10 @@ static void __init setup_usemap(struct p
struct zone *zone, unsigned long zonesize)
{
unsigned long usemapsize = usemap_size(zonesize);
- zone->free_area_usemap = alloc_bootmem_node(pgdat, usemapsize);
- memset(zone->free_area_usemap, RCLM_NORCLM, usemapsize);
+ if (usemapsize != 0) {
+ zone->free_area_usemap = alloc_bootmem_node(pgdat, usemapsize);
+ memset(zone->free_area_usemap, RCLM_NORCLM, usemapsize);
+ }
}
#else
static void inline setup_usemap(struct pglist_data *pgdat,