2007-11-04 20:30:51

by Don Porter

[permalink] [raw]
Subject: [RFC/PATCH] Optimize zone allocator synchronization

From: Donald E. Porter <[email protected]>

In the bulk page allocation/free routines in mm/page_alloc.c, the zone
lock is held across all iterations. For certain parallel workloads, I
have found that releasing and reacquiring the lock for each iteration
yields better performance, especially at higher CPU counts. For
instance, kernel compilation is sped up by 5% on an 8 CPU test
machine. In most cases, there is no significant effect on performance
(although the effect tends to be slightly positive). This seems quite
reasonable for the very small scope of the change.

My intuition is that this patch prevents smaller requests from waiting
on larger ones. While grabbing and releasing the lock within the loop
adds a few instructions, it can lower the latency for a particular
thread's allocation which is often on the thread's critical path.
Lowering the average latency for allocation can increase system throughput.

More detailed information, including data from the tests I ran to
validate this change are available at
http://www.cs.utexas.edu/~porterde/kernel-patch.html .

Thanks in advance for your consideration and feedback.

Don

Signed-off-by: Donald E. Porter <[email protected]>

---

diff -uprN linux-2.6.23.1/mm/page_alloc.c linux-2.6.23.1-opt/mm/page_alloc.c
--- linux-2.6.23.1/mm/page_alloc.c 2007-10-12 11:43:44.000000000 -0500
+++ linux-2.6.23.1-opt/mm/page_alloc.c 2007-10-29 18:29:05.000000000 -0500
@@ -477,19 +477,19 @@ static inline int free_pages_check(struc
static void free_pages_bulk(struct zone *zone, int count,
struct list_head *list, int order)
{
- spin_lock(&zone->lock);
zone->all_unreclaimable = 0;
zone->pages_scanned = 0;
while (count--) {
struct page *page;
+ spin_lock(&zone->lock);

VM_BUG_ON(list_empty(list));
page = list_entry(list->prev, struct page, lru);
/* have to delete it as __free_one_page list manipulates */
list_del(&page->lru);
__free_one_page(page, zone, order);
+ spin_unlock(&zone->lock);
}
- spin_unlock(&zone->lock);
}

static void free_one_page(struct zone *zone, struct page *page, int order)
@@ -665,14 +665,17 @@ static int rmqueue_bulk(struct zone *zon
{
int i;

- spin_lock(&zone->lock);
for (i = 0; i < count; ++i) {
- struct page *page = __rmqueue(zone, order);
+ struct page *page;
+ spin_lock(&zone->lock);
+
+ page = __rmqueue(zone, order);
if (unlikely(page == NULL))
break;
list_add_tail(&page->lru, list);
+ spin_unlock(&zone->lock);
}
- spin_unlock(&zone->lock);
+
return i;
}


2007-11-06 10:08:20

by Chris Snook

[permalink] [raw]
Subject: Re: [RFC/PATCH] Optimize zone allocator synchronization

Don Porter wrote:
> From: Donald E. Porter <[email protected]>
>
> In the bulk page allocation/free routines in mm/page_alloc.c, the zone
> lock is held across all iterations. For certain parallel workloads, I
> have found that releasing and reacquiring the lock for each iteration
> yields better performance, especially at higher CPU counts. For
> instance, kernel compilation is sped up by 5% on an 8 CPU test
> machine. In most cases, there is no significant effect on performance
> (although the effect tends to be slightly positive). This seems quite
> reasonable for the very small scope of the change.
>
> My intuition is that this patch prevents smaller requests from waiting
> on larger ones. While grabbing and releasing the lock within the loop
> adds a few instructions, it can lower the latency for a particular
> thread's allocation which is often on the thread's critical path.
> Lowering the average latency for allocation can increase system throughput.
>
> More detailed information, including data from the tests I ran to
> validate this change are available at
> http://www.cs.utexas.edu/~porterde/kernel-patch.html .
>
> Thanks in advance for your consideration and feedback.

That's an interesting insight. My intuition is that Nick Piggin's
recently-posted ticket spinlocks patches[1] will reduce the need for this patch,
though it may be useful to have both. Can you benchmark again with only ticket
spinlocks, and with ticket spinlocks + this patch? You'll probably want to use
2.6.24-rc1 as your baseline, due to the x86 architecture merge.

-- Chris

[1] http://lkml.org/lkml/2007/11/1/123

2007-11-07 06:20:47

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC/PATCH] Optimize zone allocator synchronization

> On Tue, 06 Nov 2007 05:08:07 -0500 Chris Snook <[email protected]> wrote:
> Don Porter wrote:
> > From: Donald E. Porter <[email protected]>
> >
> > In the bulk page allocation/free routines in mm/page_alloc.c, the zone
> > lock is held across all iterations. For certain parallel workloads, I
> > have found that releasing and reacquiring the lock for each iteration
> > yields better performance, especially at higher CPU counts. For
> > instance, kernel compilation is sped up by 5% on an 8 CPU test
> > machine. In most cases, there is no significant effect on performance
> > (although the effect tends to be slightly positive). This seems quite
> > reasonable for the very small scope of the change.
> >
> > My intuition is that this patch prevents smaller requests from waiting
> > on larger ones. While grabbing and releasing the lock within the loop
> > adds a few instructions, it can lower the latency for a particular
> > thread's allocation which is often on the thread's critical path.
> > Lowering the average latency for allocation can increase system throughput.
> >
> > More detailed information, including data from the tests I ran to
> > validate this change are available at
> > http://www.cs.utexas.edu/~porterde/kernel-patch.html .
> >
> > Thanks in advance for your consideration and feedback.
>
> That's an interesting insight. My intuition is that Nick Piggin's
> recently-posted ticket spinlocks patches[1] will reduce the need for this patch,
> though it may be useful to have both. Can you benchmark again with only ticket
> spinlocks, and with ticket spinlocks + this patch? You'll probably want to use
> 2.6.24-rc1 as your baseline, due to the x86 architecture merge.

The patch as-is would hurt low cpu-count workloads, and single-threaded
workloads: it is simply taking that lock a lot more times. This will be
particuarly noticable on things like older P4 machines which have peculiarly
expensive locked operations.

A test to run would be, on ext2:

time (dd if=/dev/zero of=foo bs=16k count=2048 ; rm foo)

(might need to increase /proc/sys/vm/dirty* to avoid any writeback)


I wonder if we can do something like:

if (lock_is_contended(lock)) {
spin_unlock(lock);
spin_lock(lock); /* To the back of the queue */
}

(in conjunction with the ticket locks) so that we only do the expensive
buslocked operation when we actually have a need to do so.

(The above should be wrapped in some new spinlock interface function which
is probably a no-op on architectures which cannot implement it usefully)

2007-11-07 07:38:56

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC/PATCH] Optimize zone allocator synchronization

On Wednesday 07 November 2007 17:19, Andrew Morton wrote:
> > On Tue, 06 Nov 2007 05:08:07 -0500 Chris Snook <[email protected]> wrote:
> >
> > Don Porter wrote:
> > > From: Donald E. Porter <[email protected]>
> > >
> > > In the bulk page allocation/free routines in mm/page_alloc.c, the zone
> > > lock is held across all iterations. For certain parallel workloads, I
> > > have found that releasing and reacquiring the lock for each iteration
> > > yields better performance, especially at higher CPU counts. For
> > > instance, kernel compilation is sped up by 5% on an 8 CPU test
> > > machine. In most cases, there is no significant effect on performance
> > > (although the effect tends to be slightly positive). This seems quite
> > > reasonable for the very small scope of the change.
> > >
> > > My intuition is that this patch prevents smaller requests from waiting
> > > on larger ones. While grabbing and releasing the lock within the loop
> > > adds a few instructions, it can lower the latency for a particular
> > > thread's allocation which is often on the thread's critical path.
> > > Lowering the average latency for allocation can increase system
> > > throughput.
> > >
> > > More detailed information, including data from the tests I ran to
> > > validate this change are available at
> > > http://www.cs.utexas.edu/~porterde/kernel-patch.html .
> > >
> > > Thanks in advance for your consideration and feedback.

I did see this initial post, and didn't quite know what to make of it.
I'll admit it is slightly unexpected :) Always good to research ideas
against common convention, though.

I don't know whether your reasoning is correct though: unless there is
a significant number of higher order allocations (which there should
not be, AFAIKS), all allocators will go through the per CPU lists which
batch the same number of objects on and off, so there is no such thing
as smaller or larger requests.

And there are a number of regressions as well in your tests. It would be
nice to get some more detailed profile numbers (preferably with an
upstream kernel) to try to work out what is going faster.

It's funny, Dave Miller and I were just talking about the possible
reappearance of zone->lock contention with massively multi core and
multi threaded CPUs. I think the right way to fix this in the long run
if it turns into a real problem, is something like having a lock per
MAX_ORDER block, and having CPUs prefer to allocate from different
blocks. Anti-frag makes this pretty interesting to implement, but it
will be possible.


> > That's an interesting insight. My intuition is that Nick Piggin's
> > recently-posted ticket spinlocks patches[1] will reduce the need for this
> > patch, though it may be useful to have both. Can you benchmark again
> > with only ticket spinlocks, and with ticket spinlocks + this patch?
> > You'll probably want to use 2.6.24-rc1 as your baseline, due to the x86
> > architecture merge.
>
> The patch as-is would hurt low cpu-count workloads, and single-threaded
> workloads: it is simply taking that lock a lot more times. This will be
> particuarly noticable on things like older P4 machines which have
> peculiarly expensive locked operations.

It's not even restricted to P4s -- another big cost is going to be the
cacheline pingpong. Actually it might be worth trying another test run
with zone->lock put into its own cacheline (as it stands, when the lock
gets contended, spinners will just sit there pushing useful fields out
of the holder's memory -- ticket locks will do better here, but they
still write to the lock once, then sit there loading it).


> A test to run would be, on ext2:
>
> time (dd if=/dev/zero of=foo bs=16k count=2048 ; rm foo)
>
> (might need to increase /proc/sys/vm/dirty* to avoid any writeback)
>
>
> I wonder if we can do something like:
>
> if (lock_is_contended(lock)) {
> spin_unlock(lock);
> spin_lock(lock); /* To the back of the queue */
> }
>
> (in conjunction with the ticket locks) so that we only do the expensive
> buslocked operation when we actually have a need to do so.
>
> (The above should be wrapped in some new spinlock interface function which
> is probably a no-op on architectures which cannot implement it usefully)

We have the need_lockbreak stuff. Of course, that's often pretty useless
with regular spinlocks (when you consider that my tests show that a single
CPU can be allowed to retake the same lock several million times in a row
despite contention)...

Anyway, yeah we could do that. But I think we do actually want to batch
up allocations on a given CPU in the multithreaded case as well, rather
than interleave them. There are some benefits avoiding cacheline bouncing.

2007-11-18 05:36:47

by Don Porter

[permalink] [raw]
Subject: Re: [RFC/PATCH] Optimize zone allocator synchronization

Thank you all for your consideration and insightful responses to my
posting. I apologize for not responding sooner---I have been under a
deadline.

It seems clear that further investigation will be needed to understand
these performance numbers better.

To summarize, I understand that the following experiments will be helpful:

1) Instrument the allocation code to determine the common size/order
of the allocations for these workloads.

2) Try to integrate these changes with ticket spinlocks

3) Try placing the zone lock in its own cacheline

4) Look for single-threaded regressions (dd benchmark).

I'll do these at my first opportunity, hopefully within the next week.
Please let me know if I misunderstood any of your comments.

My intuition about the cost of ping-ponging the lock's cache line
certainly matched yours, so I was very surprised to see these
performance numbers.

On Wed, Nov 07, 2007 at 04:31:59PM +1100, Nick Piggin wrote:
> It's funny, Dave Miller and I were just talking about the possible
> reappearance of zone->lock contention with massively multi core and
> multi threaded CPUs. I think the right way to fix this in the long run
> if it turns into a real problem, is something like having a lock per
> MAX_ORDER block, and having CPUs prefer to allocate from different
> blocks. Anti-frag makes this pretty interesting to implement, but it
> will be possible.

As a bit of background, the zone lock is indeed one of the more
contended locks in my target workloads so it was no accident that I
was looking for ways to improve its scalability. I am quite
interested in Nick's ideas about how to split up the zone allocator's
synchronization.

Of course, these contention levels may not meet your definition of
"real problem" (~.1% of the execution time).

Best regards,
Don

2008-01-29 17:05:55

by Don Porter

[permalink] [raw]
Subject: Re: [RFC/PATCH] Optimize zone allocator synchronization

I apologize again for the long delay in responding with the requested
additional data needed to understand the performance of this patch.
The complete information is available at:

http://www.cs.utexas.edu/~porterde/kernel-patch.html#subsequent

I instrumented the kernel within simics to determine that the mean
value of the count is 7 and the order is 0 for both free_pages_bulk()
and rmqueue_bulk() for the simulation workloads.

I applied the change suggested by Andrew Morton to the 2.6.24-rc7
kernel patched with Nick Piggin's ticket spinlock patch.

These data indicate that there is a small performance penalty (1-2%)
incurred when adding ticket spinlocks alone, probably because they are
not used enough in the kernel to reap the performance benefits of the
implementation cost. By allowing the lock to be released and
reacquired under contention in just these two places, the 1-2%
overhead of ticket spin performance is reclaimed in most benchmarks,
making overall performance comparable to the baseline kernel.

The only data inconsistent with these results are the kernel
compilation benchmarks. These data indicate that the best performance
is from the baseline kernel. It is not clear to me what property of
kernel compilation causes it to have this performance profile.

Placing the zone spin lock in its own cache line hurts performance.

The dd regression test actually shows a similar trend to the other
benchmarks; ticket spinlocks + my patch perform best.

Thanks for your comments and consideration,

Don Porter

On Sat, Nov 17, 2007 at 11:36:26PM -0600, Don Porter wrote:
> Thank you all for your consideration and insightful responses to my
> posting. I apologize for not responding sooner---I have been under a
> deadline.
>
> It seems clear that further investigation will be needed to understand
> these performance numbers better.
>
> To summarize, I understand that the following experiments will be helpful:
>
> 1) Instrument the allocation code to determine the common size/order
> of the allocations for these workloads.
>
> 2) Try to integrate these changes with ticket spinlocks
>
> 3) Try placing the zone lock in its own cacheline
>
> 4) Look for single-threaded regressions (dd benchmark).
>
> I'll do these at my first opportunity, hopefully within the next week.
> Please let me know if I misunderstood any of your comments.
>
> My intuition about the cost of ping-ponging the lock's cache line
> certainly matched yours, so I was very surprised to see these
> performance numbers.
>
> On Wed, Nov 07, 2007 at 04:31:59PM +1100, Nick Piggin wrote:
> > It's funny, Dave Miller and I were just talking about the possible
> > reappearance of zone->lock contention with massively multi core and
> > multi threaded CPUs. I think the right way to fix this in the long run
> > if it turns into a real problem, is something like having a lock per
> > MAX_ORDER block, and having CPUs prefer to allocate from different
> > blocks. Anti-frag makes this pretty interesting to implement, but it
> > will be possible.
>
> As a bit of background, the zone lock is indeed one of the more
> contended locks in my target workloads so it was no accident that I
> was looking for ways to improve its scalability. I am quite
> interested in Nick's ideas about how to split up the zone allocator's
> synchronization.
>
> Of course, these contention levels may not meet your definition of
> "real problem" (~.1% of the execution time).
>
> Best regards,
> Don