LinuxLists.cc - Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Joel Schopp wrote:

> The patches do ad a reasonable amount of complexity to the page
> allocator. In my opinion that is the only downside of these patches,
> even though it is a big one. What we need to decide as a community is
> if there is a less complex way to do this, and if there isn't a less
> complex way then is the benefit worth the increased complexity.
>
> As to the non-zero performance cost, I think hard numbers should carry
> more weight than they have been given in this area. Mel has posted hard
> numbers that say the patches are a wash with respect to performance. I
> don't see any evidence to contradict those results.
>

The numbers I have seen show that performance is decreased. People
like Ken Chen spend months trying to find a 0.05% improvement in
performance. Not long ago I just spent days getting our cached
kbuild performance back to where 2.4 is on my build system.

I can simply see they will cost more icache, more dcache, more branches,
etc. in what is the hottest part of the kernel in some workloads (kernel
compiles, for one).

I'm sorry if I sound like a wet blanket. I just don't look at a patch
and think "wow all those 3 guys with Linux on IBM mainframes and using
lpars are going to be so much happier now, this is something we need".

>>> The will need high order allocations if we want to provide HugeTLB pages
>>> to userspace on-demand rather than reserving at boot-time. This is a
>>> future problem, but it's one that is not worth tackling until the
>>> fragmentation problem is fixed first.
>>>
>>
>> Sure. In what form, we haven't agreed. I vote zones! :)
>
>
> I'd like to hear more details of how zones would be less complex while
> still solving the problem. I just don't get it.
>

You have an extra zone. You size that zone at boot according to the
amount of memory you need to be able to free. Only easy-reclaim stuff
goes in that zone.

It is less complex because zones are a complexity we already have to
live with. 99% of the infrastructure is already there to do this.

If you want to hot unplug memory or guarantee hugepage allocation,
this is the way to do it. Nobody has told me why this *doesn't* work.

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-11-02 01:42:34

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

> The numbers I have seen show that performance is decreased. People
> like Ken Chen spend months trying to find a 0.05% improvement in
> performance. Not long ago I just spent days getting our cached
> kbuild performance back to where 2.4 is on my build system.

Ironically, we're currently trying to chase down a 'database benchmark'
regression that seems to have been cause by the last round of "let's
rewrite the scheduler again" (more details later). Nick, you've added an
awful lot of complexity to some of these code paths yourself ... seems
ironic that you're the one complaining about it ;-)

>>> Sure. In what form, we haven't agreed. I vote zones! :)
>>
>>
>> I'd like to hear more details of how zones would be less complex while
>> still solving the problem. I just don't get it.
>>
>
> You have an extra zone. You size that zone at boot according to the
> amount of memory you need to be able to free. Only easy-reclaim stuff
> goes in that zone.
>
> It is less complex because zones are a complexity we already have to
> live with. 99% of the infrastructure is already there to do this.
>
> If you want to hot unplug memory or guarantee hugepage allocation,
> this is the way to do it. Nobody has told me why this *doesn't* work.

Because the zone is statically sized, and you're back to the same crap
we had with 32bit systems of splitting ZONE_NORMAL and ZONE_HIGHMEM,
effectively. Define how much you need for system ram, and how much
for easily reclaimable memory at boot time. You can't - it doesn't work.

M.

2005-11-02 02:02:16

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Martin J. Bligh wrote:
>>The numbers I have seen show that performance is decreased. People
>>like Ken Chen spend months trying to find a 0.05% improvement in
>>performance. Not long ago I just spent days getting our cached
>>kbuild performance back to where 2.4 is on my build system.
>
>
> Ironically, we're currently trying to chase down a 'database benchmark'
> regression that seems to have been cause by the last round of "let's
> rewrite the scheduler again" (more details later). Nick, you've added an
> awful lot of complexity to some of these code paths yourself ... seems
> ironic that you're the one complaining about it ;-)
>

Yeah that's unfortunate, but I think a large portion of the problem
(if they are anything the same) has been narrowed down to some over
eager wakeup balancing for which there are a number of proposed
patches.

But in this case I was more worried about getting the groundwork done
for handling the multicore multicore systems that everyone will soon
be using rather than several % performance regression on TPC-C (not
to say that I don't care about that at all)... I don't see the irony.

But let's move this to another thread if it is going to continue. I
would be happy to discuss scheduler problems.

>>You have an extra zone. You size that zone at boot according to the
>>amount of memory you need to be able to free. Only easy-reclaim stuff
>>goes in that zone.
>>
>>It is less complex because zones are a complexity we already have to
>>live with. 99% of the infrastructure is already there to do this.
>>
>>If you want to hot unplug memory or guarantee hugepage allocation,
>>this is the way to do it. Nobody has told me why this *doesn't* work.
>
>
> Because the zone is statically sized, and you're back to the same crap
> we had with 32bit systems of splitting ZONE_NORMAL and ZONE_HIGHMEM,
> effectively. Define how much you need for system ram, and how much
> for easily reclaimable memory at boot time. You can't - it doesn't work.
>

You can't what? What doesn't work? If you have no hard limits set,
then the frag patches can't guarantee anything either.

You can't have it both ways. Either you have limits for things or
you don't need any guarantees. Zones handle the former case nicely,
and we currently do the latter case just fine (along with the frag
patches).

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-11-02 02:24:01

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

>>> The numbers I have seen show that performance is decreased. People
>>> like Ken Chen spend months trying to find a 0.05% improvement in
>>> performance. Not long ago I just spent days getting our cached
>>> kbuild performance back to where 2.4 is on my build system.
>>
>> Ironically, we're currently trying to chase down a 'database benchmark'
>> regression that seems to have been cause by the last round of "let's
>> rewrite the scheduler again" (more details later). Nick, you've added an
>> awful lot of complexity to some of these code paths yourself ... seems
>> ironic that you're the one complaining about it ;-)
>
> Yeah that's unfortunate, but I think a large portion of the problem
> (if they are anything the same) has been narrowed down to some over
> eager wakeup balancing for which there are a number of proposed
> patches.
>
> But in this case I was more worried about getting the groundwork done
> for handling the multicore multicore systems that everyone will soon
> be using rather than several % performance regression on TPC-C (not
> to say that I don't care about that at all)... I don't see the irony.
>
> But let's move this to another thread if it is going to continue. I
> would be happy to discuss scheduler problems.

My point was that most things we do add complexity to the codebase,
including the things you do yourself ... I'm not saying the we're worse
off for the changes you've made, by any means - I think they've been
mostly beneficial. I'm just pointing out that we ALL do it, so let us
not be too quick to judge when others propose adding something that does ;-)

>> Because the zone is statically sized, and you're back to the same crap
>> we had with 32bit systems of splitting ZONE_NORMAL and ZONE_HIGHMEM,
>> effectively. Define how much you need for system ram, and how much
>> for easily reclaimable memory at boot time. You can't - it doesn't work.
>
> You can't what? What doesn't work? If you have no hard limits set,
> then the frag patches can't guarantee anything either.
>
> You can't have it both ways. Either you have limits for things or
> you don't need any guarantees. Zones handle the former case nicely,
> and we currently do the latter case just fine (along with the frag
> patches).

I'll go look through Mel's current patchset again. I was under the
impression it didn't suffer from this problem, at least not as much
as zones did.

Nothing is guaranteed. You can shag the whole machine and/or VM in
any number of ways ... if we can significantly improve the probability
of existing higher order allocs working, and new functionality has
an excellent probability of success, that's as good as you're going to
get. Have a free "perfect is the enemy of good" Linus quote, on me ;-)

M.

2005-11-02 02:51:33

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Martin J. Bligh wrote:

>>But let's move this to another thread if it is going to continue. I
>>would be happy to discuss scheduler problems.
>
>
> My point was that most things we do add complexity to the codebase,
> including the things you do yourself ... I'm not saying the we're worse
> off for the changes you've made, by any means - I think they've been
> mostly beneficial.

Heh - I like the "mostly" ;)

> I'm just pointing out that we ALL do it, so let us
> not be too quick to judge when others propose adding something that does ;-)
>

What I'm getting worried about is the marked increase in the
rate of features and complexity going in.

I am almost certainly never going to use memory hotplug or
demand paging of hugepages. I am pretty likely going to have
to wade through this code at some point in the future if it
is merged.

It is also going to slow down my kernel by maybe 1% when
doing kbuilds, but hey let's not worry about that until we've
merged 10 more such slowdowns (ok that wasn't aimed at you or
Mel, but my perception of the status quo).

>
>>You can't what? What doesn't work? If you have no hard limits set,
>>then the frag patches can't guarantee anything either.
>>
>>You can't have it both ways. Either you have limits for things or
>>you don't need any guarantees. Zones handle the former case nicely,
>>and we currently do the latter case just fine (along with the frag
>>patches).
>
>
> I'll go look through Mel's current patchset again. I was under the
> impression it didn't suffer from this problem, at least not as much
> as zones did.
>

Over time, I don't think it can offer any stronger a guarantee
than what we currently have. I'm not even sure that it would be
any better at all for problematic workloads as time -> infinity.

> Nothing is guaranteed. You can shag the whole machine and/or VM in
> any number of ways ... if we can significantly improve the probability
> of existing higher order allocs working, and new functionality has
> an excellent probability of success, that's as good as you're going to
> get. Have a free "perfect is the enemy of good" Linus quote, on me ;-)
>

I think it falls down if these higher order allocations actually
get *used* for anything. You'll simply be going through the process
of replacing your contiguous, easy-to-reclaim memory with pinned
kernel memory.

However, for the purpose of memory hot unplug, a new zone *will*
guarantee memory can be reclaimed and unplugged.

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-11-02 04:39:12

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

>> I'm just pointing out that we ALL do it, so let us
>> not be too quick to judge when others propose adding something that does ;-)
>
> What I'm getting worried about is the marked increase in the
> rate of features and complexity going in.
>
> I am almost certainly never going to use memory hotplug or
> demand paging of hugepages. I am pretty likely going to have
> to wade through this code at some point in the future if it
> is merged.

Mmm. Though whether any one of us will personally use each feature
is perhaps not the most ideal criteria to judge things by ;-)

> It is also going to slow down my kernel by maybe 1% when
> doing kbuilds, but hey let's not worry about that until we've
> merged 10 more such slowdowns (ok that wasn't aimed at you or
> Mel, but my perception of the status quo).

If it's really 1%, yes, that's a huge problem. And yes, I agree with
you that there's a problem with the rate of change. Part of that is
a lack of performance measurement and testing, and the quality sometimes
scares me (though the last month has actually been significantly better,
the tree mostly builds and boots now!). I've tried to do something on
the testing front, but I'm acutely aware it's not sufficient by any means.

>>> You can't what? What doesn't work? If you have no hard limits set,
>>> then the frag patches can't guarantee anything either.
>>>
>>> You can't have it both ways. Either you have limits for things or
>>> you don't need any guarantees. Zones handle the former case nicely,
>>> and we currently do the latter case just fine (along with the frag
>>> patches).
>>
>> I'll go look through Mel's current patchset again. I was under the
>> impression it didn't suffer from this problem, at least not as much
>> as zones did.
>
> Over time, I don't think it can offer any stronger a guarantee
> than what we currently have. I'm not even sure that it would be
> any better at all for problematic workloads as time -> infinity.

Sounds worth discussing. We need *some* way of dealing with fragmentation
issues. To me that means both an avoidance strategy, and an ability
to actively defragment if we need it. Linux is evolved software, it
may not be perfect at first - that's the way we work, and it's served
us well up till now. To me, that's the biggest advantage we have over
the proprietary model.

>> Nothing is guaranteed. You can shag the whole machine and/or VM in
>> any number of ways ... if we can significantly improve the probability
>> of existing higher order allocs working, and new functionality has
>> an excellent probability of success, that's as good as you're going to
>> get. Have a free "perfect is the enemy of good" Linus quote, on me ;-)
>
> I think it falls down if these higher order allocations actually
> get *used* for anything. You'll simply be going through the process
> of replacing your contiguous, easy-to-reclaim memory with pinned
> kernel memory.

It seems inevitable that we need both physically contiguous memory
sections, and virtually contiguous in kernel space (which equates to
the same thing, unless we totally break the 1-1 P-V mapping and
lose the large page mapping for kernel, which I'd hate to do.)

> However, for the purpose of memory hot unplug, a new zone *will*
> guarantee memory can be reclaimed and unplugged.

It's not just about memory hotplug. There are, as we have discussed
already, many usage for physically contiguous (and virtually contiguous)
memory segments. Focusing purely on any one of them will not solve the
issue at hand ...

M.

2005-11-02 05:08:01

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Martin J. Bligh wrote:

>>I am almost certainly never going to use memory hotplug or
>>demand paging of hugepages. I am pretty likely going to have
>>to wade through this code at some point in the future if it
>>is merged.
>
>
> Mmm. Though whether any one of us will personally use each feature
> is perhaps not the most ideal criteria to judge things by ;-)
>

Of course, but I'd say very few people will. Then again maybe
I'm just a luddite who doesn't know what's good for him ;)

>
>>It is also going to slow down my kernel by maybe 1% when
>>doing kbuilds, but hey let's not worry about that until we've
>>merged 10 more such slowdowns (ok that wasn't aimed at you or
>>Mel, but my perception of the status quo).
>
>
> If it's really 1%, yes, that's a huge problem. And yes, I agree with
> you that there's a problem with the rate of change. Part of that is
> a lack of performance measurement and testing, and the quality sometimes
> scares me (though the last month has actually been significantly better,
> the tree mostly builds and boots now!). I've tried to do something on
> the testing front, but I'm acutely aware it's not sufficient by any means.
>

To be honest I haven't tested so this is an unfounded guess. However
it is based on what I have seen of Mel's numbers, and the fact that
the kernel spends nearly 1/3rd of its time in the page allocator when
running a kbuild.

I may get around to getting some real numbers when my current patch
queues shrink.

>>Over time, I don't think it can offer any stronger a guarantee
>>than what we currently have. I'm not even sure that it would be
>>any better at all for problematic workloads as time -> infinity.
>
>
> Sounds worth discussing. We need *some* way of dealing with fragmentation
> issues. To me that means both an avoidance strategy, and an ability
> to actively defragment if we need it. Linux is evolved software, it
> may not be perfect at first - that's the way we work, and it's served
> us well up till now. To me, that's the biggest advantage we have over
> the proprietary model.
>

True and I'm also annoyed that we have these issues at all. I just
don't see that the avoidance strategy helps that much because as I
said, you don't need to keep these lovely contiguous regions just for
show (or other easy-to-reclaim user pages).

The absolute priority is to move away from higher order allocs or
use fallbacks IMO. And that doesn't necessarily mean order 1 or even
2 allocations because we've don't seem to have a problem with those.

Because I want Linux to be as robust as you do.

>>I think it falls down if these higher order allocations actually
>>get *used* for anything. You'll simply be going through the process
>>of replacing your contiguous, easy-to-reclaim memory with pinned
>>kernel memory.
>
>
> It seems inevitable that we need both physically contiguous memory
> sections, and virtually contiguous in kernel space (which equates to
> the same thing, unless we totally break the 1-1 P-V mapping and
> lose the large page mapping for kernel, which I'd hate to do.)
>

I think this isn't as bad an idea as you think. If it means those
guys doing memory hotplug take a few % performance hit and nobody else
has to bear the costs then that sounds great.

>
>>However, for the purpose of memory hot unplug, a new zone *will*
>>guarantee memory can be reclaimed and unplugged.
>
>
> It's not just about memory hotplug. There are, as we have discussed
> already, many usage for physically contiguous (and virtually contiguous)
> memory segments. Focusing purely on any one of them will not solve the
> issue at hand ...
>

True, but we don't seem to have huge problems with other things. The
main ones that have come up on lkml are e1000 which is getting fixed,
and maybe XFS which I think there are also moves to improve.

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-11-02 05:15:12

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

>> It's not just about memory hotplug. There are, as we have discussed
>> already, many usage for physically contiguous (and virtually contiguous)
>> memory segments. Focusing purely on any one of them will not solve the
>> issue at hand ...
>
> True, but we don't seem to have huge problems with other things. The
> main ones that have come up on lkml are e1000 which is getting fixed,
> and maybe XFS which I think there are also moves to improve.

It should be fairly easy to trawl through the list of all allocations
and pull out all the higher order ones from the whole source tree. I
suspect there's a lot ... maybe I'll play with it later on.

M.

2005-11-02 06:24:04

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Martin J. Bligh wrote:
>>True, but we don't seem to have huge problems with other things. The
>>main ones that have come up on lkml are e1000 which is getting fixed,
>>and maybe XFS which I think there are also moves to improve.
>
>
> It should be fairly easy to trawl through the list of all allocations
> and pull out all the higher order ones from the whole source tree. I
> suspect there's a lot ... maybe I'll play with it later on.
>

please check kmalloc(32k,64k)

For example, loopback device's default MTU=16436 means order=3 and
maybe there are other high MTU device.

I suspect skb_makewritable()/skb_copy()/skb_linearize() function can be
sufferd from fragmentation when MTU is big. They allocs large skb by
gathering fragmented skbs.When these skb_* funcs failed, the packet
is silently discarded by netfilter. If fragmentation is heavy, packets
(especialy TCP) uses large MTU never reachs its end, even if loopback.

Honestly, I'm not familiar with network code, could anyone comment this ?

-- Kame

2005-11-02 07:21:03

by Yasunori Goto

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Hello.
Nick-san.

I posted patches to make ZONE_REMOVABLE to LHMS.
I don't say they are better than Mel-san's patch.
I hope this will be base of good discussion.

There were 2 types.
One was just add ZONE_REMOVABLE.
This patch came from early implementation of memory hotplug VA-Linux
team.
http://sourceforge.net/mailarchive/forum.php?thread_id=5969508&forum_id=223

ZONE_HIGHMEM is used for this purpose at early implementation.
We thought ZONE_HIGHMEM is easier removing than other zone.
But some of archtecture don't use it. That is why ZONE_REMOVABLE
was born.
(And I remember that ZONE_DMA32 was defined after this patch.
So, number of zone became 5, and one more bit was necessary in
page->flags. (I don't know recent progress of ZONE_DMA32)).

Another one was a bit similar than Mel-san's one.
One of motivation of this patch was to create orthogonal relationship
between Removable and DMA/Normal/Highmem. I thought it is desirable.
Because, ppc64 can treat that all of memory is same (DMA) zone.
I thought that new zone spoiled its good feature.

http://sourceforge.net/mailarchive/forum.php?thread_id=5345977&forum_id=223
http://sourceforge.net/mailarchive/forum.php?thread_id=5345978&forum_id=223
http://sourceforge.net/mailarchive/forum.php?thread_id=5345979&forum_id=223
http://sourceforge.net/mailarchive/forum.php?thread_id=5345980&forum_id=223

Thanks.

P.S. to Mel-san.
I'm sorry for late writing of this. This threads was mail bomb for me
to read with my poor English skill. :-(

> Martin J. Bligh wrote:
>
> >>But let's move this to another thread if it is going to continue. I
> >>would be happy to discuss scheduler problems.
> >
> >
> > My point was that most things we do add complexity to the codebase,
> > including the things you do yourself ... I'm not saying the we're worse
> > off for the changes you've made, by any means - I think they've been
> > mostly beneficial.
>
> Heh - I like the "mostly" ;)
>
> > I'm just pointing out that we ALL do it, so let us
> > not be too quick to judge when others propose adding something that does ;-)
> >
>
> What I'm getting worried about is the marked increase in the
> rate of features and complexity going in.
>
> I am almost certainly never going to use memory hotplug or
> demand paging of hugepages. I am pretty likely going to have
> to wade through this code at some point in the future if it
> is merged.
>
> It is also going to slow down my kernel by maybe 1% when
> doing kbuilds, but hey let's not worry about that until we've
> merged 10 more such slowdowns (ok that wasn't aimed at you or
> Mel, but my perception of the status quo).
>
> >
> >>You can't what? What doesn't work? If you have no hard limits set,
> >>then the frag patches can't guarantee anything either.
> >>
> >>You can't have it both ways. Either you have limits for things or
> >>you don't need any guarantees. Zones handle the former case nicely,
> >>and we currently do the latter case just fine (along with the frag
> >>patches).
> >
> >
> > I'll go look through Mel's current patchset again. I was under the
> > impression it didn't suffer from this problem, at least not as much
> > as zones did.
> >
>
> Over time, I don't think it can offer any stronger a guarantee
> than what we currently have. I'm not even sure that it would be
> any better at all for problematic workloads as time -> infinity.
>
> > Nothing is guaranteed. You can shag the whole machine and/or VM in
> > any number of ways ... if we can significantly improve the probability
> > of existing higher order allocs working, and new functionality has
> > an excellent probability of success, that's as good as you're going to
> > get. Have a free "perfect is the enemy of good" Linus quote, on me ;-)
> >
>
> I think it falls down if these higher order allocations actually
> get *used* for anything. You'll simply be going through the process
> of replacing your contiguous, easy-to-reclaim memory with pinned
> kernel memory.
>
> However, for the purpose of memory hot unplug, a new zone *will*
> guarantee memory can be reclaimed and unplugged.
>
> --
> SUSE Labs, Novell Inc.
>
> Send instant messages to your online friends http://au.messenger.yahoo.com
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Yasunori Goto

2005-11-02 10:13:31

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

KAMEZAWA Hiroyuki wrote:
> Martin J. Bligh wrote:
>

> please check kmalloc(32k,64k)
>
> For example, loopback device's default MTU=16436 means order=3 and
> maybe there are other high MTU device.
>
> I suspect skb_makewritable()/skb_copy()/skb_linearize() function can be
> sufferd from fragmentation when MTU is big. They allocs large skb by
> gathering fragmented skbs.When these skb_* funcs failed, the packet
> is silently discarded by netfilter. If fragmentation is heavy, packets
> (especialy TCP) uses large MTU never reachs its end, even if loopback.
>
> Honestly, I'm not familiar with network code, could anyone comment this ?
>

I'd be interested to know, actually. I was hoping loopback should always
use order-0 allocations, because the loopback driver is SG, FRAGLIST,
and HIGHDMA capable. However I'm likewise not familiar with network code.

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-11-02 11:37:30

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Wed, 2 Nov 2005, Nick Piggin wrote:

> Joel Schopp wrote:
>
> > The patches do ad a reasonable amount of complexity to the page allocator.
> > In my opinion that is the only downside of these patches, even though it is
> > a big one. What we need to decide as a community is if there is a less
> > complex way to do this, and if there isn't a less complex way then is the
> > benefit worth the increased complexity.
> >
> > As to the non-zero performance cost, I think hard numbers should carry more
> > weight than they have been given in this area. Mel has posted hard numbers
> > that say the patches are a wash with respect to performance. I don't see
> > any evidence to contradict those results.
> >
>
> The numbers I have seen show that performance is decreased. People
> like Ken Chen spend months trying to find a 0.05% improvement in
> performance.

Fine, that is understandable. The AIM9 benchmarks also show performance
improvements in other areas like fork_test. About a 5% difference which is
also important for kernel builds. Wider testing would be needed to see if
the improvements are specific to my tests or not. Every set of patches
have had a performance regression test run with Aim9 so I certainly have
not been ignoring perforkmance.

> Not long ago I just spent days getting our cached
> kbuild performance back to where 2.4 is on my build system.
>

Then it would be interesting to find out how 2.6.14-rc5-mm1 compares
against 2.6.14-rc5-mm1-mbuddy-v19?

> I can simply see they will cost more icache, more dcache, more branches,
> etc. in what is the hottest part of the kernel in some workloads (kernel
> compiles, for one).
>
> I'm sorry if I sound like a wet blanket. I just don't look at a patch
> and think "wow all those 3 guys with Linux on IBM mainframes and using
> lpars are going to be so much happier now, this is something we need".
>

I developed this as the beginning of a long term solution for on-demand
HugeTLB pages as part of a PhD. This could potentially help desktop
workloads in the future. Hotplug machines are a benefit that was picked up
by the work on the way. We can help hotplug to some extent today and
desktop users in the future (and given time, all of the hotplug problems
as well). But if we tell desktop users "Yeah, your applications will run a
bit better with HugeTLB pages as long as you configure the size of the
zone correctly" at any stage, we'll be told where to go.

> > > > The will need high order allocations if we want to provide HugeTLB pages
> > > > to userspace on-demand rather than reserving at boot-time. This is a
> > > > future problem, but it's one that is not worth tackling until the
> > > > fragmentation problem is fixed first.
> > > >
> > >
> > > Sure. In what form, we haven't agreed. I vote zones! :)
> >
> >
> > I'd like to hear more details of how zones would be less complex while still
> > solving the problem. I just don't get it.
> >
>
> You have an extra zone. You size that zone at boot according to the
> amount of memory you need to be able to free. Only easy-reclaim stuff
> goes in that zone.
>

Helps hotplug, no one else. Rules out HugeTLB on demand for userspace
unless we are willing to tell desktop users to configure this tunable.

> It is less complex because zones are a complexity we already have to
> live with. 99% of the infrastructure is already there to do this.
>

The simplicity of zones is still in dispute. I am putting together a mail
of pros, cons, situations and future work for both approaches. I hope to
sent it out fairly soon.

> If you want to hot unplug memory or guarantee hugepage allocation,
> this is the way to do it. Nobody has told me why this *doesn't* work.
>

Hot unplug the configured zone of memory and guarantee hugepage allocation
only for userspace. There is no help for kernel allocations to steal a
huge page under any circumstance. Our approach allows the kernel to get
the large page at the cost of fragmentation degrading slowly over time. To
stop it fragmenting slowly over time, more work is needed.

--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab

2005-11-02 11:41:52

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Wed, 2 Nov 2005, Nick Piggin wrote:

> Martin J. Bligh wrote:
> > > The numbers I have seen show that performance is decreased. People
> > > like Ken Chen spend months trying to find a 0.05% improvement in
> > > performance. Not long ago I just spent days getting our cached
> > > kbuild performance back to where 2.4 is on my build system.
> >
> >
> > Ironically, we're currently trying to chase down a 'database benchmark'
> > regression that seems to have been cause by the last round of "let's
> > rewrite the scheduler again" (more details later). Nick, you've added an
> > awful lot of complexity to some of these code paths yourself ... seems
> > ironic that you're the one complaining about it ;-)
> >
>
> Yeah that's unfortunate, but I think a large portion of the problem
> (if they are anything the same) has been narrowed down to some over
> eager wakeup balancing for which there are a number of proposed
> patches.
>
> But in this case I was more worried about getting the groundwork done
> for handling the multicore multicore systems that everyone will soon
> be using rather than several % performance regression on TPC-C (not
> to say that I don't care about that at all)... I don't see the irony.
>
> But let's move this to another thread if it is going to continue. I
> would be happy to discuss scheduler problems.
>
> > > You have an extra zone. You size that zone at boot according to the
> > > amount of memory you need to be able to free. Only easy-reclaim stuff
> > > goes in that zone.
> > >
> > > It is less complex because zones are a complexity we already have to
> > > live with. 99% of the infrastructure is already there to do this.
> > >
> > > If you want to hot unplug memory or guarantee hugepage allocation,
> > > this is the way to do it. Nobody has told me why this *doesn't* work.
> >
> >
> > Because the zone is statically sized, and you're back to the same crap
> > we had with 32bit systems of splitting ZONE_NORMAL and ZONE_HIGHMEM,
> > effectively. Define how much you need for system ram, and how much
> > for easily reclaimable memory at boot time. You can't - it doesn't work.
> >
>
> You can't what? What doesn't work? If you have no hard limits set,
> then the frag patches can't guarantee anything either.
>

True, but the difference is

Anti-defrag: Best effort at low cost (according to Aim9) without tunable
Zones: Will work, but requires tunable. falls apart if tuned wrong

> You can't have it both ways. Either you have limits for things or
> you don't need any guarantees. Zones handle the former case nicely,
> and we currently do the latter case just fine (along with the frag
> patches).
>

Sure, so you compromise and do best effort for as long as possible.
Always try to keep fragmentation low. If the system is configured to
really need low fragmentation, then after a long period of time, a
page-migration mechanism kicks in to move the kernel pages out of EasyRclm
areas and we continue on.

--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab

2005-11-02 11:48:19

[permalink] [raw]

Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Wed, 2 Nov 2005, Nick Piggin wrote:

> Martin J. Bligh wrote:
>
> > > But let's move this to another thread if it is going to continue. I
> > > would be happy to discuss scheduler problems.
> >
> >
> > My point was that most things we do add complexity to the codebase,
> > including the things you do yourself ... I'm not saying the we're worse
> > off for the changes you've made, by any means - I think they've been
> > mostly beneficial.
>
> Heh - I like the "mostly" ;)
>
> > I'm just pointing out that we ALL do it, so let us
> > not be too quick to judge when others propose adding something that does ;-)
> >
>
> What I'm getting worried about is the marked increase in the
> rate of features and complexity going in.
>
> I am almost certainly never going to use memory hotplug or
> demand paging of hugepages. I am pretty likely going to have
> to wade through this code at some point in the future if it
> is merged.
>

Plenty of features in the kernel I don't use either :) .

> It is also going to slow down my kernel by maybe 1% when
> doing kbuilds, but hey let's not worry about that until we've
> merged 10 more such slowdowns (ok that wasn't aimed at you or
> Mel, but my perception of the status quo).
>

Ok, my patches show performance gains and losses on different parts of
Aim9. page_test is slightly down but fork_test was considerably up. Both
would have an effect on kbuild so more figures are needed on mode
machines. That will only be found from testing from a variety of machines.

> >
> > > You can't what? What doesn't work? If you have no hard limits set,
> > > then the frag patches can't guarantee anything either.
> > >
> > > You can't have it both ways. Either you have limits for things or
> > > you don't need any guarantees. Zones handle the former case nicely,
> > > and we currently do the latter case just fine (along with the frag
> > > patches).
> >
> >
> > I'll go look through Mel's current patchset again. I was under the
> > impression it didn't suffer from this problem, at least not as much
> > as zones did.
> >
>
> Over time, I don't think it can offer any stronger a guarantee
> than what we currently have. I'm not even sure that it would be
> any better at all for problematic workloads as time -> infinity.
>

Not as they currently stand no. As I've said elsewhere, to really
guarantee things, kswapd would need to know how to clear out UesrRclm
pages from the other reserve types.

> > Nothing is guaranteed. You can shag the whole machine and/or VM in
> > any number of ways ... if we can significantly improve the probability of
> > existing higher order allocs working, and new functionality has
> > an excellent probability of success, that's as good as you're going to get.
> > Have a free "perfect is the enemy of good" Linus quote, on me ;-)
> >
>
> I think it falls down if these higher order allocations actually
> get *used* for anything. You'll simply be going through the process
> of replacing your contiguous, easy-to-reclaim memory with pinned
> kernel memory.
>

And a misconfigured zone-based approach just falls apart. Going to finish
that summary mail to avoid repetition.

> However, for the purpose of memory hot unplug, a new zone *will*
> guarantee memory can be reclaimed and unplugged.
>

>

--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab

2005-11-02 15:11:31