The patch sets do two things.
1. fix bug for 32-bit system. percpu_counter uses s64 counter. Without any
locking reading s64 in 32-bit system isn't safe and can cause bad side effect.
2. improve scalability for __percpu_counter_add. In some cases, _add could
cause heavy lock contention (see patch 4 for detailed infomation and data).
The patches will remove the contention and speed up it a bit. Last post
(http://marc.info/?l=linux-kernel&m=130259547913607&w=2) simpliy uses
atomic64 for percpu_counter, but Tejun pointed out this could cause
deviation in __percpu_counter_sum.
The new implementation uses lglock to protect percpu data. Each cpu has its
private lock while other cpu doesn't take. In this way _add doesn't need take
global lock anymore and remove the deviation. This still gives me about
about 5x ~ 6x faster (not that faster than the original 7x faster, but still
good) with the workload mentioned in patch 4.
patch 1 fix s64 read bug for 32-bit system for UP
patch 2 convert lglock to be used by dynamaically allocated structre. Later
patch will use lglock for percpu_counter
patch 3,4 fix s64 read bug for 32-bit system for MP. And it also improve the
scalability for __percpu_counter_add.
patch 5 is from Christoph Lameter to make __percpu_counter_add fastpath
preemptless. I added it here because I converted percpu_counter to use
lglock. All bugs are from mine.
Comments and suggestions are welcomed!
Thanks,
Shaohua
Hey, Shaohua.
On Wed, May 11, 2011 at 04:10:12PM +0800, Shaohua Li wrote:
> The new implementation uses lglock to protect percpu data. Each cpu has its
> private lock while other cpu doesn't take. In this way _add doesn't need take
> global lock anymore and remove the deviation. This still gives me about
> about 5x ~ 6x faster (not that faster than the original 7x faster, but still
> good) with the workload mentioned in patch 4.
I'm afraid I'm not too thrilled about lglock + atomic64 usage. It is
a very patchy approach which addresses a very specific use case which
might just need a higher @batch. I just can't see enough benefits to
justify the overhead and complexity. :-(
Thanks.
--
tejun
On Wed, 2011-05-11 at 17:28 +0800, Tejun Heo wrote:
> Hey, Shaohua.
>
> On Wed, May 11, 2011 at 04:10:12PM +0800, Shaohua Li wrote:
> > The new implementation uses lglock to protect percpu data. Each cpu has its
> > private lock while other cpu doesn't take. In this way _add doesn't need take
> > global lock anymore and remove the deviation. This still gives me about
> > about 5x ~ 6x faster (not that faster than the original 7x faster, but still
> > good) with the workload mentioned in patch 4.
>
> I'm afraid I'm not too thrilled about lglock + atomic64 usage. It is
> a very patchy approach which addresses a very specific use case which
> might just need a higher @batch.
It's quite hard to get a higher @batch. Please my comments in
http://marc.info/?l=linux-kernel&m=130153302319613&w=2
And the atomic64 approach not just improved the performance (which is
always welcomed), but it also fixes a bug for 32-bit system. The usage
of lglock is actually quite straightforward and is standard usage of
lglock (the comments of lglock.h declare such usage), just lglock
doesn't work for dynamatically allocated structure currently, which
needs a convert.
Thanks,
Shaohua
Hello,
On Thu, May 12, 2011 at 10:48:13AM +0800, Shaohua Li wrote:
> And the atomic64 approach not just improved the performance (which is
> always welcomed), but it also fixes a bug for 32-bit system. The usage
> of lglock is actually quite straightforward and is standard usage of
> lglock (the comments of lglock.h declare such usage), just lglock
> doesn't work for dynamatically allocated structure currently, which
> needs a convert.
lglock doesn't seem like the right solution for the problem at hand.
It's way too heavy handed and used to close a very small race window.
It doesn't seem right. Eric's idea seemed much better to me and I
don't see why that can't be improved and used instead. Would you be
interested in looking that direction?
Thanks.
--
tejun
Hi,
On Thu, 2011-05-12 at 16:21 +0800, Tejun Heo wrote:
> On Thu, May 12, 2011 at 10:48:13AM +0800, Shaohua Li wrote:
> > And the atomic64 approach not just improved the performance (which is
> > always welcomed), but it also fixes a bug for 32-bit system. The usage
> > of lglock is actually quite straightforward and is standard usage of
> > lglock (the comments of lglock.h declare such usage), just lglock
> > doesn't work for dynamatically allocated structure currently, which
> > needs a convert.
>
> lglock doesn't seem like the right solution for the problem at hand.
> It's way too heavy handed and used to close a very small race window.
> It doesn't seem right. Eric's idea seemed much better to me and I
> don't see why that can't be improved and used instead. Would you be
> interested in looking that direction?
sure, but it's quite difficult to determine a @maxfuzzy in his proposal
I thought (and could confuse user), did I miss anything?
Thanks,
Shaohua
Hello,
On Thu, May 12, 2011 at 04:55:20PM +0800, Shaohua Li wrote:
> sure, but it's quite difficult to determine a @maxfuzzy in his proposal
> I thought (and could confuse user), did I miss anything?
I don't think @maxfuzzy is necessary there. I wrote this before but
why can't we track the actual deviation instead of the number of
deviation events?
Thanks.
--
tejun
Le jeudi 12 mai 2011 à 10:59 +0200, Tejun Heo a écrit :
> Hello,
>
> On Thu, May 12, 2011 at 04:55:20PM +0800, Shaohua Li wrote:
> > sure, but it's quite difficult to determine a @maxfuzzy in his proposal
> > I thought (and could confuse user), did I miss anything?
>
> I don't think @maxfuzzy is necessary there. I wrote this before but
> why can't we track the actual deviation instead of the number of
> deviation events?
>
Thats roughly same thing (BATCH multiplicator factor apart)
Most percpu_counter users for a given percpu_counter object use a given
BATCH, dont they ?
Le jeudi 12 mai 2011 à 11:02 +0200, Eric Dumazet a écrit :
> Le jeudi 12 mai 2011 à 10:59 +0200, Tejun Heo a écrit :
> > Hello,
> >
> > On Thu, May 12, 2011 at 04:55:20PM +0800, Shaohua Li wrote:
> > > sure, but it's quite difficult to determine a @maxfuzzy in his proposal
> > > I thought (and could confuse user), did I miss anything?
> >
> > I don't think @maxfuzzy is necessary there. I wrote this before but
> > why can't we track the actual deviation instead of the number of
> > deviation events?
> >
>
> Thats roughly same thing (BATCH multiplicator factor apart)
>
> Most percpu_counter users for a given percpu_counter object use a given
> BATCH, dont they ?
>
>
I guess nr_cpu_ids would be a nice @maxfuzzy default value...
Hello,
On Thu, May 12, 2011 at 11:02:15AM +0200, Eric Dumazet wrote:
> > I don't think @maxfuzzy is necessary there. I wrote this before but
> > why can't we track the actual deviation instead of the number of
> > deviation events?
>
> Thats roughly same thing (BATCH multiplicator factor apart)
>
> Most percpu_counter users for a given percpu_counter object use a given
> BATCH, dont they ?
Well, @maxfuzzy is much harder than @batch. It's way less intuitive.
Although I haven't really thought about it that much, I think it might
be possible to eliminate it. Maybe I'm confused. I'll take another
look later but if someone can think of something, please jump right
in.
Thanks.
--
tejun
On Wed, 11 May 2011, Tejun Heo wrote:
> Hey, Shaohua.
>
> On Wed, May 11, 2011 at 04:10:12PM +0800, Shaohua Li wrote:
> > The new implementation uses lglock to protect percpu data. Each cpu has its
> > private lock while other cpu doesn't take. In this way _add doesn't need take
> > global lock anymore and remove the deviation. This still gives me about
> > about 5x ~ 6x faster (not that faster than the original 7x faster, but still
> > good) with the workload mentioned in patch 4.
>
> I'm afraid I'm not too thrilled about lglock + atomic64 usage. It is
> a very patchy approach which addresses a very specific use case which
> might just need a higher @batch. I just can't see enough benefits to
> justify the overhead and complexity. :-(
Same here.
Hi,
On Thu, May 12, 2011 at 05:05:34PM +0800, Tejun Heo wrote:
> Hello,
>
> On Thu, May 12, 2011 at 11:02:15AM +0200, Eric Dumazet wrote:
> > > I don't think @maxfuzzy is necessary there. I wrote this before but
> > > why can't we track the actual deviation instead of the number of
> > > deviation events?
> >
> > Thats roughly same thing (BATCH multiplicator factor apart)
> >
> > Most percpu_counter users for a given percpu_counter object use a given
> > BATCH, dont they ?
>
> Well, @maxfuzzy is much harder than @batch. It's way less intuitive.
> Although I haven't really thought about it that much, I think it might
> be possible to eliminate it. Maybe I'm confused. I'll take another
> look later but if someone can think of something, please jump right
> in.
there is another problem here, _sum could keep spin if cocurrent updater
is running.
We could slightly change Eric's idea, how about something like this:
s64 __percpu_counter_sum(struct percpu_counter *fbc)
{
retry_times = 0;
retry:
old_seq = fbc->seq;
sum = do_sum()
if (old_seq != fbc->seq && retry_times++ < MAX_RETRY)
goto retry;
return sum;
}
MAX_RETRY could be nr_cpu_ids. The rational here is if cocurrent updater
keeps running, we can't get accurate sum, so just don't try hard.
Thanks,
Shaohua
On Thu, 2011-05-12 at 17:05 +0800, Tejun Heo wrote:
> Hello,
>
> On Thu, May 12, 2011 at 11:02:15AM +0200, Eric Dumazet wrote:
> > > I don't think @maxfuzzy is necessary there. I wrote this before but
> > > why can't we track the actual deviation instead of the number of
> > > deviation events?
> >
> > Thats roughly same thing (BATCH multiplicator factor apart)
> >
> > Most percpu_counter users for a given percpu_counter object use a given
> > BATCH, dont they ?
>
> Well, @maxfuzzy is much harder than @batch. It's way less intuitive.
> Although I haven't really thought about it that much, I think it might
> be possible to eliminate it. Maybe I'm confused. I'll take another
> look later but if someone can think of something, please jump right
> in.
Hmm, looks Eric's approach doesn't work. because we want to remove lock
in _add, checking seq in _sum still races with _add.
can we do something like this:
void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)
{
s64 count;
preempt_disable();
count = __this_cpu_read(*fbc->counters) + amount;
if (count >= batch || count <= -batch) {
while (1) {
atomic_inc(&fbc->add_start);
if (atomic_read(&fbc->sum_start) != 0)
atomic_dec(&fbc->add_start);
else
break;
while (atomic_read(&fbc->sum_start) != 0)
cpu_relax();
}
atomic64_add(count, &fbc->count);
__this_cpu_write(*fbc->counters, 0);
atomic_dec(&fbc->add_start);
} else {
__this_cpu_write(*fbc->counters, count);
}
preempt_enable();
}
s64 __percpu_counter_sum(struct percpu_counter *fbc)
{
s64 ret = 0;
int cpu;
int old_seq;
s64 old_count;
atomic_inc(&fbc->sum_start);
while (atomic_read(&fbc->add_start) != 0)
cpu_relax();
old_count = atomic64_read(&fbc->count);
for_each_online_cpu(cpu) {
s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
ret += *pcount;
}
ret += atomic64_read(&fbc->count);
atomic_dec(&fbc->sum_start);
return ret;
}
if _add finds _sum is in progress, it gives up and and wait _sum. if
_sum finds _add is in progress, it waits _add to give up or end. We let
_add waits _sum here, because _sum is seldom called. If _sum waits _add,
_sum might run a dead loop. Maybe we need a spinlock to protect
concurrent _sum too. Anything wrong here?
Le vendredi 13 mai 2011 à 12:37 +0800, Shaohua Li a écrit :
> On Thu, 2011-05-12 at 17:05 +0800, Tejun Heo wrote:
> > Hello,
> >
> > On Thu, May 12, 2011 at 11:02:15AM +0200, Eric Dumazet wrote:
> > > > I don't think @maxfuzzy is necessary there. I wrote this before but
> > > > why can't we track the actual deviation instead of the number of
> > > > deviation events?
> > >
> > > Thats roughly same thing (BATCH multiplicator factor apart)
> > >
> > > Most percpu_counter users for a given percpu_counter object use a given
> > > BATCH, dont they ?
> >
> > Well, @maxfuzzy is much harder than @batch. It's way less intuitive.
> > Although I haven't really thought about it that much, I think it might
> > be possible to eliminate it. Maybe I'm confused. I'll take another
> > look later but if someone can think of something, please jump right
> > in.
> Hmm, looks Eric's approach doesn't work. because we want to remove lock
> in _add, checking seq in _sum still races with _add.
>
Why ?
I'll code a patch, I believe it should work.
A seqcount is not a 'lock'.
The thing is we want _add to be real fast, so it must not hit a lock set
in _sum()
[Think about a machine with 4096 cpus]
Hi,
On Fri, May 13, 2011 at 01:20:06PM +0800, Eric Dumazet wrote:
> Le vendredi 13 mai 2011 ? 12:37 +0800, Shaohua Li a ?crit :
> > On Thu, 2011-05-12 at 17:05 +0800, Tejun Heo wrote:
> > > Hello,
> > >
> > > On Thu, May 12, 2011 at 11:02:15AM +0200, Eric Dumazet wrote:
> > > > > I don't think @maxfuzzy is necessary there. I wrote this before but
> > > > > why can't we track the actual deviation instead of the number of
> > > > > deviation events?
> > > >
> > > > Thats roughly same thing (BATCH multiplicator factor apart)
> > > >
> > > > Most percpu_counter users for a given percpu_counter object use a given
> > > > BATCH, dont they ?
> > >
> > > Well, @maxfuzzy is much harder than @batch. It's way less intuitive.
> > > Although I haven't really thought about it that much, I think it might
> > > be possible to eliminate it. Maybe I'm confused. I'll take another
> > > look later but if someone can think of something, please jump right
> > > in.
> > Hmm, looks Eric's approach doesn't work. because we want to remove lock
> > in _add, checking seq in _sum still races with _add.
> >
>
> Why ?
>
> I'll code a patch, I believe it should work.
I thought your proposal is:
in _add
{
if (count >= batch || count <= -batch) {
fbc->seq_count++;
atomic64_add(count, &fbc->count);
-------->
__this_cpu_write(*fbc->counters, 0);
}
}
in _sum
{
restart:
oldseq = fbc->seqcount;
smp_rmb();
do_sum();
smp_rmb()
newseq = fbc->seqcount;
if (newseq - oldseq >= maxfuzzy)
goto restart;
return ret;
}
if _sum run between above line marked in _add, then the seqcount check
doesn't work, we still have deviation Tejun pointed out.
Thanks,
Shaohua
Le vendredi 13 mai 2011 à 13:28 +0800, Shaohua Li a écrit :
> Hi,
> On Fri, May 13, 2011 at 01:20:06PM +0800, Eric Dumazet wrote:
> > Le vendredi 13 mai 2011 à 12:37 +0800, Shaohua Li a écrit :
> > > On Thu, 2011-05-12 at 17:05 +0800, Tejun Heo wrote:
> > > > Hello,
> > > >
> > > > On Thu, May 12, 2011 at 11:02:15AM +0200, Eric Dumazet wrote:
> > > > > > I don't think @maxfuzzy is necessary there. I wrote this before but
> > > > > > why can't we track the actual deviation instead of the number of
> > > > > > deviation events?
> > > > >
> > > > > Thats roughly same thing (BATCH multiplicator factor apart)
> > > > >
> > > > > Most percpu_counter users for a given percpu_counter object use a given
> > > > > BATCH, dont they ?
> > > >
> > > > Well, @maxfuzzy is much harder than @batch. It's way less intuitive.
> > > > Although I haven't really thought about it that much, I think it might
> > > > be possible to eliminate it. Maybe I'm confused. I'll take another
> > > > look later but if someone can think of something, please jump right
> > > > in.
> > > Hmm, looks Eric's approach doesn't work. because we want to remove lock
> > > in _add, checking seq in _sum still races with _add.
> > >
> >
> > Why ?
> >
> > I'll code a patch, I believe it should work.
> I thought your proposal is:
> in _add
> {
> if (count >= batch || count <= -batch) {
> fbc->seq_count++;
> atomic64_add(count, &fbc->count);
> -------->
> __this_cpu_write(*fbc->counters, 0);
> }
> }
>
> in _sum
> {
> restart:
> oldseq = fbc->seqcount;
> smp_rmb();
> do_sum();
> smp_rmb()
> newseq = fbc->seqcount;
> if (newseq - oldseq >= maxfuzzy)
> goto restart;
> return ret;
> }
> if _sum run between above line marked in _add, then the seqcount check
> doesn't work, we still have deviation Tejun pointed out.
>
I see the point thanks, I'll think a bit more about it.
We currently serializes both _sum() and _add() with a spinlock.
My idea was OK if we still kept spinlock in _add(), but this obviously
is not the need.
Your goal is letting _add() run without spinlock, but can we agree
_sum() can run with a spinlock() like today [no more than one instance
of _sum() running per percpu_counter] ?
On Fri, 2011-05-13 at 14:34 +0800, Eric Dumazet wrote:
> Le vendredi 13 mai 2011 à 13:28 +0800, Shaohua Li a écrit :
> > Hi,
> > On Fri, May 13, 2011 at 01:20:06PM +0800, Eric Dumazet wrote:
> > > Le vendredi 13 mai 2011 à 12:37 +0800, Shaohua Li a écrit :
> > > > On Thu, 2011-05-12 at 17:05 +0800, Tejun Heo wrote:
> > > > > Hello,
> > > > >
> > > > > On Thu, May 12, 2011 at 11:02:15AM +0200, Eric Dumazet wrote:
> > > > > > > I don't think @maxfuzzy is necessary there. I wrote this before but
> > > > > > > why can't we track the actual deviation instead of the number of
> > > > > > > deviation events?
> > > > > >
> > > > > > Thats roughly same thing (BATCH multiplicator factor apart)
> > > > > >
> > > > > > Most percpu_counter users for a given percpu_counter object use a given
> > > > > > BATCH, dont they ?
> > > > >
> > > > > Well, @maxfuzzy is much harder than @batch. It's way less intuitive.
> > > > > Although I haven't really thought about it that much, I think it might
> > > > > be possible to eliminate it. Maybe I'm confused. I'll take another
> > > > > look later but if someone can think of something, please jump right
> > > > > in.
> > > > Hmm, looks Eric's approach doesn't work. because we want to remove lock
> > > > in _add, checking seq in _sum still races with _add.
> > > >
> > >
> > > Why ?
> > >
> > > I'll code a patch, I believe it should work.
> > I thought your proposal is:
> > in _add
> > {
> > if (count >= batch || count <= -batch) {
> > fbc->seq_count++;
> > atomic64_add(count, &fbc->count);
> > -------->
> > __this_cpu_write(*fbc->counters, 0);
> > }
> > }
> >
> > in _sum
> > {
> > restart:
> > oldseq = fbc->seqcount;
> > smp_rmb();
> > do_sum();
> > smp_rmb()
> > newseq = fbc->seqcount;
> > if (newseq - oldseq >= maxfuzzy)
> > goto restart;
> > return ret;
> > }
> > if _sum run between above line marked in _add, then the seqcount check
> > doesn't work, we still have deviation Tejun pointed out.
> >
>
> I see the point thanks, I'll think a bit more about it.
>
> We currently serializes both _sum() and _add() with a spinlock.
>
> My idea was OK if we still kept spinlock in _add(), but this obviously
> is not the need.
>
> Your goal is letting _add() run without spinlock, but can we agree
> _sum() can run with a spinlock() like today [no more than one instance
> of _sum() running per percpu_counter] ?
locking _sum should be fine
Le vendredi 13 mai 2011 à 08:34 +0200, Eric Dumazet a écrit :
> I see the point thanks, I'll think a bit more about it.
>
> We currently serializes both _sum() and _add() with a spinlock.
>
> My idea was OK if we still kept spinlock in _add(), but this obviously
> is not the need.
>
> Your goal is letting _add() run without spinlock, but can we agree
> _sum() can run with a spinlock() like today [no more than one instance
> of _sum() running per percpu_counter] ?
>
>
Here the patch I cooked (on top of linux-2.6)
This solves the problem quite well for me.
Idea is :
Consider _sum() being slow path. It is still serialized by a spinlock().
Add a fbc->sequence, so that _add() can detect a sum() is in flight, and
directly add to a new atomic64_t field I named "fbc->slowcount" (and not
touch its percpu s32 variable so that _sum() can get accurate
percpu_counter 'Value')
Low order bit of the 'sequence' is used to signal _sum() is in flight,
while _add() threads that overflow their percpu s32 variable do a
sequence += 2, so that _sum() can detect at least one cpu messed the
fbc->count and reset its s32 variable. _sum() can restart its loop, but
since sequence has still low order bit set, we have guarantee that the
_sum() loop wont be restarted ad infinitum.
Notes : I disabled IRQ in _add() to reduce window, making _add() as fast
as possible to avoid _sum() extra loops, but its not strictly necessary,
we can discuss this point, since _sum() is slow path :)
_sum() is accurate and not blocking anymore _add(). It's slowing it a
bit of course since all _add() will touch fbc->slowcount.
_sum() is about same speed than before in my tests.
On my 8 cpu (Intel(R) Xeon(R) CPU E5450 @ 3.00GHz) machine, and 32bit
kernel, the :
loop (10000000 times) {
p = mmap(128M, ANONYMOUS);
munmap(p, 128M);
}
done on 8 cpus bench :
Before patch :
real 3m22.759s
user 0m6.353s
sys 26m28.919s
After patch :
real 0m23.420s
user 0m6.332s
sys 2m44.561s
Quite good results considering atomic64_add() uses two "lock cmpxchg8b"
on x86_32 :
33.03% mmap_test [kernel.kallsyms] [k] unmap_vmas
12.99% mmap_test [kernel.kallsyms] [k] atomic64_add_return_cx8
5.62% mmap_test [kernel.kallsyms] [k] free_pgd_range
3.07% mmap_test [kernel.kallsyms] [k] sysenter_past_esp
2.48% mmap_test [kernel.kallsyms] [k] memcpy
2.24% mmap_test [kernel.kallsyms] [k] perf_event_mmap
2.21% mmap_test [kernel.kallsyms] [k] _raw_spin_lock
2.02% mmap_test [vdso] [.] 0xffffe424
2.01% mmap_test [kernel.kallsyms] [k] perf_event_mmap_output
1.38% mmap_test [kernel.kallsyms] [k] vma_adjust
1.24% mmap_test [kernel.kallsyms] [k] sched_clock_local
1.23% perf [kernel.kallsyms] [k] __copy_from_user_ll_nozero
1.07% mmap_test [kernel.kallsyms] [k] down_write
If only one cpu runs the program :
real 0m16.685s
user 0m0.771s
sys 0m15.815s
Signed-off-by: Eric Dumazet <[email protected]>
---
include/linux/percpu_counter.h | 15 +++++++--
lib/percpu_counter.c | 47 +++++++++++++++++++------------
2 files changed, 40 insertions(+), 22 deletions(-)
diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
index 46f6ba5..288acf4 100644
--- a/include/linux/percpu_counter.h
+++ b/include/linux/percpu_counter.h
@@ -16,14 +16,21 @@
#ifdef CONFIG_SMP
struct percpu_counter {
- spinlock_t lock;
- s64 count;
+ spinlock_t lock;
+ atomic_t sequence; /* low order bit set if one sum() is in flight */
+ atomic64_t count;
+ atomic64_t slowcount;
#ifdef CONFIG_HOTPLUG_CPU
struct list_head list; /* All percpu_counters are on a list */
#endif
s32 __percpu *counters;
};
+static inline bool percpu_counter_active_sum(const struct percpu_counter *fbc)
+{
+ return (atomic_read(&fbc->sequence) & 1) ? true : false;
+}
+
extern int percpu_counter_batch;
int __percpu_counter_init(struct percpu_counter *fbc, s64 amount,
@@ -60,7 +67,7 @@ static inline s64 percpu_counter_sum(struct percpu_counter *fbc)
static inline s64 percpu_counter_read(struct percpu_counter *fbc)
{
- return fbc->count;
+ return atomic64_read(&fbc->count) + atomic64_read(&fbc->slowcount);
}
/*
@@ -70,7 +77,7 @@ static inline s64 percpu_counter_read(struct percpu_counter *fbc)
*/
static inline s64 percpu_counter_read_positive(struct percpu_counter *fbc)
{
- s64 ret = fbc->count;
+ s64 ret = percpu_counter_read(fbc);
barrier(); /* Prevent reloads of fbc->count */
if (ret >= 0)
diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index 28f2c33..aef4bd5 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -59,31 +59,35 @@ void percpu_counter_set(struct percpu_counter *fbc, s64 amount)
{
int cpu;
- spin_lock(&fbc->lock);
for_each_possible_cpu(cpu) {
s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
*pcount = 0;
}
- fbc->count = amount;
- spin_unlock(&fbc->lock);
+ atomic64_set(&fbc->count, amount);
+ atomic64_set(&fbc->slowcount, 0);
}
EXPORT_SYMBOL(percpu_counter_set);
void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)
{
s64 count;
+ unsigned long flags;
- preempt_disable();
+ if (percpu_counter_active_sum(fbc)) {
+ atomic64_add(amount, &fbc->slowcount);
+ return;
+ }
+
+ local_irq_save(flags);
count = __this_cpu_read(*fbc->counters) + amount;
if (count >= batch || count <= -batch) {
- spin_lock(&fbc->lock);
- fbc->count += count;
+ atomic_add(2, &fbc->sequence);
+ atomic64_add(count, &fbc->count);
__this_cpu_write(*fbc->counters, 0);
- spin_unlock(&fbc->lock);
} else {
__this_cpu_write(*fbc->counters, count);
}
- preempt_enable();
+ local_irq_restore(flags);
}
EXPORT_SYMBOL(__percpu_counter_add);
@@ -95,13 +99,22 @@ s64 __percpu_counter_sum(struct percpu_counter *fbc)
{
s64 ret;
int cpu;
+ unsigned int seq;
spin_lock(&fbc->lock);
- ret = fbc->count;
- for_each_online_cpu(cpu) {
- s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
- ret += *pcount;
- }
+ atomic_inc(&fbc->sequence);
+ do {
+ seq = atomic_read(&fbc->sequence);
+
+ ret = atomic64_read(&fbc->count);
+ for_each_online_cpu(cpu) {
+ s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
+ ret += *pcount;
+ }
+ } while (atomic_read(&fbc->sequence) != seq);
+
+ atomic_inc(&fbc->sequence);
+ ret += atomic64_read(&fbc->slowcount);
spin_unlock(&fbc->lock);
return ret;
}
@@ -112,7 +125,8 @@ int __percpu_counter_init(struct percpu_counter *fbc, s64 amount,
{
spin_lock_init(&fbc->lock);
lockdep_set_class(&fbc->lock, key);
- fbc->count = amount;
+ atomic64_set(&fbc->count, amount);
+ atomic64_set(&fbc->slowcount, 0);
fbc->counters = alloc_percpu(s32);
if (!fbc->counters)
return -ENOMEM;
@@ -171,13 +185,10 @@ static int __cpuinit percpu_counter_hotcpu_callback(struct notifier_block *nb,
mutex_lock(&percpu_counters_lock);
list_for_each_entry(fbc, &percpu_counters, list) {
s32 *pcount;
- unsigned long flags;
- spin_lock_irqsave(&fbc->lock, flags);
pcount = per_cpu_ptr(fbc->counters, cpu);
- fbc->count += *pcount;
+ atomic64_add(*pcount, &fbc->count);
*pcount = 0;
- spin_unlock_irqrestore(&fbc->lock, flags);
}
mutex_unlock(&percpu_counters_lock);
#endif
Le vendredi 13 mai 2011 à 16:51 +0200, Eric Dumazet a écrit :
> Here the patch I cooked (on top of linux-2.6)
>
> This solves the problem quite well for me.
>
> Idea is :
>
> Consider _sum() being slow path. It is still serialized by a spinlock().
>
> Add a fbc->sequence, so that _add() can detect a sum() is in flight, and
> directly add to a new atomic64_t field I named "fbc->slowcount" (and not
> touch its percpu s32 variable so that _sum() can get accurate
> percpu_counter 'Value')
>
> Low order bit of the 'sequence' is used to signal _sum() is in flight,
> while _add() threads that overflow their percpu s32 variable do a
> sequence += 2, so that _sum() can detect at least one cpu messed the
> fbc->count and reset its s32 variable. _sum() can restart its loop, but
> since sequence has still low order bit set, we have guarantee that the
> _sum() loop wont be restarted ad infinitum.
>
> Notes : I disabled IRQ in _add() to reduce window, making _add() as fast
> as possible to avoid _sum() extra loops, but its not strictly necessary,
> we can discuss this point, since _sum() is slow path :)
>
> _sum() is accurate and not blocking anymore _add(). It's slowing it a
> bit of course since all _add() will touch fbc->slowcount.
>
> _sum() is about same speed than before in my tests.
>
> On my 8 cpu (Intel(R) Xeon(R) CPU E5450 @ 3.00GHz) machine, and 32bit
> kernel, the :
> loop (10000000 times) {
> p = mmap(128M, ANONYMOUS);
> munmap(p, 128M);
> }
> done on 8 cpus bench :
>
> Before patch :
> real 3m22.759s
> user 0m6.353s
> sys 26m28.919s
>
> After patch :
> real 0m23.420s
> user 0m6.332s
> sys 2m44.561s
>
> Quite good results considering atomic64_add() uses two "lock cmpxchg8b"
> on x86_32 :
>
> 33.03% mmap_test [kernel.kallsyms] [k] unmap_vmas
> 12.99% mmap_test [kernel.kallsyms] [k] atomic64_add_return_cx8
> 5.62% mmap_test [kernel.kallsyms] [k] free_pgd_range
> 3.07% mmap_test [kernel.kallsyms] [k] sysenter_past_esp
> 2.48% mmap_test [kernel.kallsyms] [k] memcpy
> 2.24% mmap_test [kernel.kallsyms] [k] perf_event_mmap
> 2.21% mmap_test [kernel.kallsyms] [k] _raw_spin_lock
> 2.02% mmap_test [vdso] [.] 0xffffe424
> 2.01% mmap_test [kernel.kallsyms] [k] perf_event_mmap_output
> 1.38% mmap_test [kernel.kallsyms] [k] vma_adjust
> 1.24% mmap_test [kernel.kallsyms] [k] sched_clock_local
> 1.23% perf [kernel.kallsyms] [k] __copy_from_user_ll_nozero
> 1.07% mmap_test [kernel.kallsyms] [k] down_write
>
>
> If only one cpu runs the program :
>
> real 0m16.685s
> user 0m0.771s
> sys 0m15.815s
Thinking a bit more, we could allow several _sum() in flight (we would
need an atomic_t counter for counter of _sum(), not a single bit, and
remove the spinlock.
This would allow using a separate integer for the
add_did_change_fbc_count and remove one atomic operation in _add() { the
atomic_add(2, &fbc->sequence); of my previous patch }
Another idea would also put fbc->count / fbc->slowcount out of line,
to keep "struct percpu_counter" read mostly.
I'll send a V2 with this updated schem.
By the way, I ran the bench on a more recent 2x4x2 machine and 64bit
kernel (HP G6 : Intel(R) Xeon(R) CPU E5540 @ 2.53GHz)
1) One process started (no contention) :
Before :
real 0m21.372s
user 0m0.680s
sys 0m20.670s
After V1 patch :
real 0m19.941s
user 0m0.750s
sys 0m19.170s
2) 16 processes started
Before patch:
real 2m14.509s
user 0m13.780s
sys 35m24.170s
After V1 patch :
real 0m48.617s
user 0m16.980s
sys 12m9.400s
Le vendredi 13 mai 2011 à 17:39 +0200, Eric Dumazet a écrit :
> Thinking a bit more, we could allow several _sum() in flight (we would
> need an atomic_t counter for counter of _sum(), not a single bit, and
> remove the spinlock.
>
> This would allow using a separate integer for the
> add_did_change_fbc_count and remove one atomic operation in _add() { the
> atomic_add(2, &fbc->sequence); of my previous patch }
>
>
> Another idea would also put fbc->count / fbc->slowcount out of line,
> to keep "struct percpu_counter" read mostly.
>
> I'll send a V2 with this updated schem.
>
Here is V2 of patch :
Idea is :
We consider _sum() being slow path. We dont try to make it fast [ but
this implementation should be better since I remove the spinlock that
used to serialize _sum() / _add() invocations ]
Add a fbc->sum_cnt, so that _add() can detect a _sum() is in flight, and
directly add to a new atomic64_t field I named "fbc->slowcount" (and not
touch its percpu s32 variable so that _sum() can get accurate
percpu_counter 'Value')
Use an out of line structure to make "struct percpu_count" mostly read
This structure uses its own cache line to reduce false sharing.
Each time one _add() thread overflows its percpu s32 variable, do an
increment of a sequence, so that _sum() can detect at least one cpu
messed the fbc->count and reset its s32 variable.
_sum() can restart its loop, but since sum_cnt is non null, we have
guarantee that the _sum() loop wont be restarted ad infinitum.
In practice, it should be restarted once at most :
I disabled IRQ in _add() to reduce window, making _add() as fast
as possible to avoid _sum() extra loops, but its not strictly necessary,
we can discuss this point, since _sum() is slow path :)
_sum() is accurate and not blocking anymore _add(). It's slowing it a
bit of course since all _add() will touch fbc->slowcount.
On my 2x4x2 cpu (Intel(R) Xeon(R) CPU E5540 @ 2.53GHz) machine, and
64bit
kernel, the :
loop (10000000 times) {
p = mmap(128M, ANONYMOUS);
munmap(p, 128M);
}
1) One process started (no contention) :
Before :
real 0m21.372s
user 0m0.680s
sys 0m20.670s
After V2 patch :
real 0m19.654s
user 0m0.730s
sys 0m18.890s
2) 16 processes started
Before patch:
real 2m14.509s
user 0m13.780s
sys 35m24.170s
After V2 patch:
real 0m35.635s
user 0m17.670s
sys 8m11.020s
Signed-off-by: Eric Dumazet <[email protected]>
---
include/linux/percpu_counter.h | 26 +++++++--
lib/percpu_counter.c | 83 ++++++++++++++++++++-----------
2 files changed, 74 insertions(+), 35 deletions(-)
diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
index 46f6ba5..4aac7f5 100644
--- a/include/linux/percpu_counter.h
+++ b/include/linux/percpu_counter.h
@@ -15,13 +15,25 @@
#ifdef CONFIG_SMP
-struct percpu_counter {
- spinlock_t lock;
- s64 count;
+/*
+ * For performance reasons, we keep this part in a separate cache line
+ */
+struct percpu_counter_rw {
+ atomic64_t count;
+ unsigned int sequence;
+ atomic64_t slowcount;
+
+ /* since we have plenty room, store list here, even if never used */
#ifdef CONFIG_HOTPLUG_CPU
struct list_head list; /* All percpu_counters are on a list */
+ struct percpu_counter *fbc;
#endif
- s32 __percpu *counters;
+} ____cacheline_aligned_in_smp;
+
+struct percpu_counter {
+ atomic_t sum_cnt; /* count of in flight sum() */
+ struct percpu_counter_rw *pcrw;
+ s32 __percpu *counters;
};
extern int percpu_counter_batch;
@@ -60,7 +72,9 @@ static inline s64 percpu_counter_sum(struct percpu_counter *fbc)
static inline s64 percpu_counter_read(struct percpu_counter *fbc)
{
- return fbc->count;
+ struct percpu_counter_rw *pcrw = fbc->pcrw;
+
+ return atomic64_read(&pcrw->count) + atomic64_read(&pcrw->slowcount);
}
/*
@@ -70,7 +84,7 @@ static inline s64 percpu_counter_read(struct percpu_counter *fbc)
*/
static inline s64 percpu_counter_read_positive(struct percpu_counter *fbc)
{
- s64 ret = fbc->count;
+ s64 ret = percpu_counter_read(fbc);
barrier(); /* Prevent reloads of fbc->count */
if (ret >= 0)
diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index 28f2c33..c9c33c1 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -9,6 +9,7 @@
#include <linux/cpu.h>
#include <linux/module.h>
#include <linux/debugobjects.h>
+#include <linux/slab.h>
static LIST_HEAD(percpu_counters);
static DEFINE_MUTEX(percpu_counters_lock);
@@ -58,32 +59,38 @@ static inline void debug_percpu_counter_deactivate(struct percpu_counter *fbc)
void percpu_counter_set(struct percpu_counter *fbc, s64 amount)
{
int cpu;
+ struct percpu_counter_rw *pcrw = fbc->pcrw;
- spin_lock(&fbc->lock);
for_each_possible_cpu(cpu) {
s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
*pcount = 0;
}
- fbc->count = amount;
- spin_unlock(&fbc->lock);
+ atomic64_set(&pcrw->count, amount);
+ atomic64_set(&pcrw->slowcount, 0);
}
EXPORT_SYMBOL(percpu_counter_set);
void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)
{
s64 count;
+ unsigned long flags;
+ struct percpu_counter_rw *pcrw = fbc->pcrw;
- preempt_disable();
+ if (atomic_read(&fbc->sum_cnt)) {
+ atomic64_add(amount, &pcrw->slowcount);
+ return;
+ }
+
+ local_irq_save(flags);
count = __this_cpu_read(*fbc->counters) + amount;
if (count >= batch || count <= -batch) {
- spin_lock(&fbc->lock);
- fbc->count += count;
+ pcrw->sequence++; /* lazy increment (not atomic) */
+ atomic64_add(count, &pcrw->count);
__this_cpu_write(*fbc->counters, 0);
- spin_unlock(&fbc->lock);
} else {
__this_cpu_write(*fbc->counters, count);
}
- preempt_enable();
+ local_irq_restore(flags);
}
EXPORT_SYMBOL(__percpu_counter_add);
@@ -95,14 +102,25 @@ s64 __percpu_counter_sum(struct percpu_counter *fbc)
{
s64 ret;
int cpu;
+ unsigned int seq;
+ struct percpu_counter_rw *pcrw = fbc->pcrw;
- spin_lock(&fbc->lock);
- ret = fbc->count;
- for_each_online_cpu(cpu) {
- s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
- ret += *pcount;
- }
- spin_unlock(&fbc->lock);
+ atomic_inc(&fbc->sum_cnt);
+ do {
+ seq = pcrw->sequence;
+ smp_rmb();
+
+ ret = atomic64_read(&pcrw->count);
+ for_each_online_cpu(cpu) {
+ s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
+ ret += *pcount;
+ }
+
+ smp_rmb();
+ } while (pcrw->sequence != seq);
+
+ atomic_dec(&fbc->sum_cnt);
+ ret += atomic64_read(&pcrw->slowcount);
return ret;
}
EXPORT_SYMBOL(__percpu_counter_sum);
@@ -110,19 +128,27 @@ EXPORT_SYMBOL(__percpu_counter_sum);
int __percpu_counter_init(struct percpu_counter *fbc, s64 amount,
struct lock_class_key *key)
{
- spin_lock_init(&fbc->lock);
- lockdep_set_class(&fbc->lock, key);
- fbc->count = amount;
+ struct percpu_counter_rw *pcrw;
+
+ pcrw = kzalloc(sizeof(*pcrw), GFP_KERNEL);
+ if (!pcrw)
+ return -ENOMEM;
+ atomic64_set(&pcrw->count, amount);
+
fbc->counters = alloc_percpu(s32);
- if (!fbc->counters)
+ if (!fbc->counters) {
+ kfree(pcrw);
return -ENOMEM;
+ }
+ fbc->pcrw = pcrw;
debug_percpu_counter_activate(fbc);
#ifdef CONFIG_HOTPLUG_CPU
- INIT_LIST_HEAD(&fbc->list);
+ INIT_LIST_HEAD(&pcrw->list);
+ pcrw->fbc = fbc;
mutex_lock(&percpu_counters_lock);
- list_add(&fbc->list, &percpu_counters);
+ list_add(&pcrw->list, &percpu_counters);
mutex_unlock(&percpu_counters_lock);
#endif
return 0;
@@ -138,11 +164,13 @@ void percpu_counter_destroy(struct percpu_counter *fbc)
#ifdef CONFIG_HOTPLUG_CPU
mutex_lock(&percpu_counters_lock);
- list_del(&fbc->list);
+ list_del(&fbc->pcrw->list);
mutex_unlock(&percpu_counters_lock);
#endif
free_percpu(fbc->counters);
fbc->counters = NULL;
+ kfree(fbc->pcrw);
+ fbc->pcrw = NULL;
}
EXPORT_SYMBOL(percpu_counter_destroy);
@@ -161,7 +189,7 @@ static int __cpuinit percpu_counter_hotcpu_callback(struct notifier_block *nb,
{
#ifdef CONFIG_HOTPLUG_CPU
unsigned int cpu;
- struct percpu_counter *fbc;
+ struct percpu_counter_rw *pcrw;
compute_batch_value();
if (action != CPU_DEAD)
@@ -169,15 +197,12 @@ static int __cpuinit percpu_counter_hotcpu_callback(struct notifier_block *nb,
cpu = (unsigned long)hcpu;
mutex_lock(&percpu_counters_lock);
- list_for_each_entry(fbc, &percpu_counters, list) {
+ list_for_each_entry(pcrw, &percpu_counters, list) {
s32 *pcount;
- unsigned long flags;
- spin_lock_irqsave(&fbc->lock, flags);
- pcount = per_cpu_ptr(fbc->counters, cpu);
- fbc->count += *pcount;
+ pcount = per_cpu_ptr(pcrw->fbc->counters, cpu);
+ atomic64_add(*pcount, &pcrw->count);
*pcount = 0;
- spin_unlock_irqrestore(&fbc->lock, flags);
}
mutex_unlock(&percpu_counters_lock);
#endif
Le vendredi 13 mai 2011 à 18:35 +0200, Eric Dumazet a écrit :
If you want to try this patch, please note I missed in
__percpu_counter_init() :
fbc->sum_cnt = 0;
> int __percpu_counter_init(struct percpu_counter *fbc, s64 amount,
> struct lock_class_key *key)
> {
> - spin_lock_init(&fbc->lock);
> - lockdep_set_class(&fbc->lock, key);
> - fbc->count = amount;
> + struct percpu_counter_rw *pcrw;
> +
> + pcrw = kzalloc(sizeof(*pcrw), GFP_KERNEL);
> + if (!pcrw)
> + return -ENOMEM;
> + atomic64_set(&pcrw->count, amount);
> +
> fbc->counters = alloc_percpu(s32);
> - if (!fbc->counters)
> + if (!fbc->counters) {
> + kfree(pcrw);
> return -ENOMEM;
> + }
> + fbc->pcrw = pcrw;
fbc->sum_cnt = 0;
>
> debug_percpu_counter_activate(fbc);
>
Shaohua Li reported a scalability problem with many threads doing
mmap()/munmap() calls. vm_committed_as percpu_counter is hitting its
spinlock very hard, if size of mmaped zones are bigger than
percpu_counter batch. We could tune this batch value but better have a
more scalable percpu_counter infrastructure.
Shaohua provided some patches to speedup __percpu_counter_add(), by
removing the need to use a spinlock and use an atomic64_t fbc->count
instead.
Problem of these patches were a possible big deviation seen by
__percpu_counter_sum()
Idea of this patch is to extend Shaohua idea :
We consider _sum() being slow path. We dont try to make it fast [ but
this implementation should be better since we remove the spinlock that
used to serialize _sum() / _add() invocations ]
Add a fbc->sum_cnt, so that _add() can detect a _sum() is in flight, and
directly add to a new atomic64_t field named "fbc->slowcount" (and not
touch its percpu s32 variable so that _sum() can get more accurate
percpu_counter 'Value')
Use an out of line structure to make "struct percpu_count" mostly read
This structure uses its own cache line to reduce false sharing.
Each time one _add() thread overflows its percpu s32 variable, do an
increment of a sequence, so that _sum() can detect at least one cpu
messed the fbc->count and reset its s32 variable.
_sum() can restart its loop, but since sum_cnt is non null, we have
guarantee that the _sum() loop wont be restarted ad infinitum.
_sum() is accurate and not blocking anymore _add() [ It's slowing it a
bit of course since all _add() will touch shared fbc->slowcount ]
On my 2x4x2 cpu (Intel(R) Xeon(R) CPU E5540 @ 2.53GHz) machine, and
64bit kernel, the following bench :
loop (10000000 times) {
p = mmap(128M, ANONYMOUS);
munmap(p, 128M);
}
16 processes started
Before patch:
real 2m14.509s
user 0m13.780s
sys 35m24.170s
After patch:
real 0m34.055s
user 0m17.910s
sys 8m1.680s
Signed-off-by: Eric Dumazet <[email protected]>
CC: Shaohua Li <[email protected]>
CC: Andrew Morton <[email protected]>
CC: Christoph Lameter <[email protected]>
CC: Tejun Heo <[email protected]>
CC: Nick Piggin <[email protected]>
---
V3: remove irq masking in __percpu_counter_add()
initialize fbc->sum_cnt in __percpu_counter_init
include/linux/percpu_counter.h | 26 +++++++---
lib/percpu_counter.c | 79 ++++++++++++++++++++-----------
2 files changed, 72 insertions(+), 33 deletions(-)
diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
index 46f6ba5..4aac7f5 100644
--- a/include/linux/percpu_counter.h
+++ b/include/linux/percpu_counter.h
@@ -15,13 +15,25 @@
#ifdef CONFIG_SMP
-struct percpu_counter {
- spinlock_t lock;
- s64 count;
+/*
+ * For performance reasons, we keep this part in a separate cache line
+ */
+struct percpu_counter_rw {
+ atomic64_t count;
+ unsigned int sequence;
+ atomic64_t slowcount;
+
+ /* since we have plenty room, store list here, even if never used */
#ifdef CONFIG_HOTPLUG_CPU
struct list_head list; /* All percpu_counters are on a list */
+ struct percpu_counter *fbc;
#endif
- s32 __percpu *counters;
+} ____cacheline_aligned_in_smp;
+
+struct percpu_counter {
+ atomic_t sum_cnt; /* count of in flight sum() */
+ struct percpu_counter_rw *pcrw;
+ s32 __percpu *counters;
};
extern int percpu_counter_batch;
@@ -60,7 +72,9 @@ static inline s64 percpu_counter_sum(struct percpu_counter *fbc)
static inline s64 percpu_counter_read(struct percpu_counter *fbc)
{
- return fbc->count;
+ struct percpu_counter_rw *pcrw = fbc->pcrw;
+
+ return atomic64_read(&pcrw->count) + atomic64_read(&pcrw->slowcount);
}
/*
@@ -70,7 +84,7 @@ static inline s64 percpu_counter_read(struct percpu_counter *fbc)
*/
static inline s64 percpu_counter_read_positive(struct percpu_counter *fbc)
{
- s64 ret = fbc->count;
+ s64 ret = percpu_counter_read(fbc);
barrier(); /* Prevent reloads of fbc->count */
if (ret >= 0)
diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index 28f2c33..ff486b2 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -9,6 +9,7 @@
#include <linux/cpu.h>
#include <linux/module.h>
#include <linux/debugobjects.h>
+#include <linux/slab.h>
static LIST_HEAD(percpu_counters);
static DEFINE_MUTEX(percpu_counters_lock);
@@ -58,28 +59,33 @@ static inline void debug_percpu_counter_deactivate(struct percpu_counter *fbc)
void percpu_counter_set(struct percpu_counter *fbc, s64 amount)
{
int cpu;
+ struct percpu_counter_rw *pcrw = fbc->pcrw;
- spin_lock(&fbc->lock);
for_each_possible_cpu(cpu) {
s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
*pcount = 0;
}
- fbc->count = amount;
- spin_unlock(&fbc->lock);
+ atomic64_set(&pcrw->count, amount);
+ atomic64_set(&pcrw->slowcount, 0);
}
EXPORT_SYMBOL(percpu_counter_set);
void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)
{
s64 count;
+ struct percpu_counter_rw *pcrw = fbc->pcrw;
+
+ if (atomic_read(&fbc->sum_cnt)) {
+ atomic64_add(amount, &pcrw->slowcount);
+ return;
+ }
preempt_disable();
count = __this_cpu_read(*fbc->counters) + amount;
if (count >= batch || count <= -batch) {
- spin_lock(&fbc->lock);
- fbc->count += count;
+ atomic64_add(count, &pcrw->count);
+ pcrw->sequence++;
__this_cpu_write(*fbc->counters, 0);
- spin_unlock(&fbc->lock);
} else {
__this_cpu_write(*fbc->counters, count);
}
@@ -95,14 +101,25 @@ s64 __percpu_counter_sum(struct percpu_counter *fbc)
{
s64 ret;
int cpu;
+ unsigned int seq;
+ struct percpu_counter_rw *pcrw = fbc->pcrw;
- spin_lock(&fbc->lock);
- ret = fbc->count;
- for_each_online_cpu(cpu) {
- s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
- ret += *pcount;
- }
- spin_unlock(&fbc->lock);
+ atomic_inc(&fbc->sum_cnt);
+ do {
+ seq = pcrw->sequence;
+ smp_rmb();
+
+ ret = atomic64_read(&pcrw->count);
+ for_each_online_cpu(cpu) {
+ s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
+ ret += *pcount;
+ }
+
+ smp_rmb();
+ } while (pcrw->sequence != seq);
+
+ atomic_dec(&fbc->sum_cnt);
+ ret += atomic64_read(&pcrw->slowcount);
return ret;
}
EXPORT_SYMBOL(__percpu_counter_sum);
@@ -110,19 +127,28 @@ EXPORT_SYMBOL(__percpu_counter_sum);
int __percpu_counter_init(struct percpu_counter *fbc, s64 amount,
struct lock_class_key *key)
{
- spin_lock_init(&fbc->lock);
- lockdep_set_class(&fbc->lock, key);
- fbc->count = amount;
+ struct percpu_counter_rw *pcrw;
+
+ pcrw = kzalloc(sizeof(*pcrw), GFP_KERNEL);
+ if (!pcrw)
+ return -ENOMEM;
+ atomic64_set(&pcrw->count, amount);
+
fbc->counters = alloc_percpu(s32);
- if (!fbc->counters)
+ if (!fbc->counters) {
+ kfree(pcrw);
return -ENOMEM;
+ }
+ fbc->pcrw = pcrw;
+ atomic_set(&fbc->sum_cnt, 0);
debug_percpu_counter_activate(fbc);
#ifdef CONFIG_HOTPLUG_CPU
- INIT_LIST_HEAD(&fbc->list);
+ INIT_LIST_HEAD(&pcrw->list);
+ pcrw->fbc = fbc;
mutex_lock(&percpu_counters_lock);
- list_add(&fbc->list, &percpu_counters);
+ list_add(&pcrw->list, &percpu_counters);
mutex_unlock(&percpu_counters_lock);
#endif
return 0;
@@ -138,11 +164,13 @@ void percpu_counter_destroy(struct percpu_counter *fbc)
#ifdef CONFIG_HOTPLUG_CPU
mutex_lock(&percpu_counters_lock);
- list_del(&fbc->list);
+ list_del(&fbc->pcrw->list);
mutex_unlock(&percpu_counters_lock);
#endif
free_percpu(fbc->counters);
fbc->counters = NULL;
+ kfree(fbc->pcrw);
+ fbc->pcrw = NULL;
}
EXPORT_SYMBOL(percpu_counter_destroy);
@@ -161,7 +189,7 @@ static int __cpuinit percpu_counter_hotcpu_callback(struct notifier_block *nb,
{
#ifdef CONFIG_HOTPLUG_CPU
unsigned int cpu;
- struct percpu_counter *fbc;
+ struct percpu_counter_rw *pcrw;
compute_batch_value();
if (action != CPU_DEAD)
@@ -169,15 +197,12 @@ static int __cpuinit percpu_counter_hotcpu_callback(struct notifier_block *nb,
cpu = (unsigned long)hcpu;
mutex_lock(&percpu_counters_lock);
- list_for_each_entry(fbc, &percpu_counters, list) {
+ list_for_each_entry(pcrw, &percpu_counters, list) {
s32 *pcount;
- unsigned long flags;
- spin_lock_irqsave(&fbc->lock, flags);
- pcount = per_cpu_ptr(fbc->counters, cpu);
- fbc->count += *pcount;
+ pcount = per_cpu_ptr(pcrw->fbc->counters, cpu);
+ atomic64_add(*pcount, &pcrw->count);
*pcount = 0;
- spin_unlock_irqrestore(&fbc->lock, flags);
}
mutex_unlock(&percpu_counters_lock);
#endif
On Sat, 2011-05-14 at 06:03 +0800, Eric Dumazet wrote:
> Shaohua Li reported a scalability problem with many threads doing
> mmap()/munmap() calls. vm_committed_as percpu_counter is hitting its
> spinlock very hard, if size of mmaped zones are bigger than
> percpu_counter batch. We could tune this batch value but better have a
> more scalable percpu_counter infrastructure.
>
> Shaohua provided some patches to speedup __percpu_counter_add(), by
> removing the need to use a spinlock and use an atomic64_t fbc->count
> instead.
>
> Problem of these patches were a possible big deviation seen by
> __percpu_counter_sum()
>
> Idea of this patch is to extend Shaohua idea :
>
> We consider _sum() being slow path. We dont try to make it fast [ but
> this implementation should be better since we remove the spinlock that
> used to serialize _sum() / _add() invocations ]
>
> Add a fbc->sum_cnt, so that _add() can detect a _sum() is in flight, and
> directly add to a new atomic64_t field named "fbc->slowcount" (and not
> touch its percpu s32 variable so that _sum() can get more accurate
> percpu_counter 'Value')
>
> Use an out of line structure to make "struct percpu_count" mostly read
> This structure uses its own cache line to reduce false sharing.
>
> Each time one _add() thread overflows its percpu s32 variable, do an
> increment of a sequence, so that _sum() can detect at least one cpu
> messed the fbc->count and reset its s32 variable.
> _sum() can restart its loop, but since sum_cnt is non null, we have
> guarantee that the _sum() loop wont be restarted ad infinitum.
>
> _sum() is accurate and not blocking anymore _add() [ It's slowing it a
> bit of course since all _add() will touch shared fbc->slowcount ]
>
> On my 2x4x2 cpu (Intel(R) Xeon(R) CPU E5540 @ 2.53GHz) machine, and
> 64bit kernel, the following bench :
>
> loop (10000000 times) {
> p = mmap(128M, ANONYMOUS);
> munmap(p, 128M);
> }
>
> 16 processes started
>
> Before patch:
> real 2m14.509s
> user 0m13.780s
> sys 35m24.170s
>
> After patch:
> real 0m34.055s
> user 0m17.910s
> sys 8m1.680s
>
> Signed-off-by: Eric Dumazet <[email protected]>
> CC: Shaohua Li <[email protected]>
> CC: Andrew Morton <[email protected]>
> CC: Christoph Lameter <[email protected]>
> CC: Tejun Heo <[email protected]>
> CC: Nick Piggin <[email protected]>
> ---
> V3: remove irq masking in __percpu_counter_add()
> initialize fbc->sum_cnt in __percpu_counter_init
>
> include/linux/percpu_counter.h | 26 +++++++---
> lib/percpu_counter.c | 79 ++++++++++++++++++++-----------
> 2 files changed, 72 insertions(+), 33 deletions(-)
>
> diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
> index 46f6ba5..4aac7f5 100644
> --- a/include/linux/percpu_counter.h
> +++ b/include/linux/percpu_counter.h
> @@ -15,13 +15,25 @@
>
> #ifdef CONFIG_SMP
>
> -struct percpu_counter {
> - spinlock_t lock;
> - s64 count;
> +/*
> + * For performance reasons, we keep this part in a separate cache line
> + */
> +struct percpu_counter_rw {
> + atomic64_t count;
> + unsigned int sequence;
> + atomic64_t slowcount;
> +
> + /* since we have plenty room, store list here, even if never used */
> #ifdef CONFIG_HOTPLUG_CPU
> struct list_head list; /* All percpu_counters are on a list */
> + struct percpu_counter *fbc;
> #endif
> - s32 __percpu *counters;
> +} ____cacheline_aligned_in_smp;
> +
> +struct percpu_counter {
> + atomic_t sum_cnt; /* count of in flight sum() */
> + struct percpu_counter_rw *pcrw;
> + s32 __percpu *counters;
> };
>
> extern int percpu_counter_batch;
> @@ -60,7 +72,9 @@ static inline s64 percpu_counter_sum(struct percpu_counter *fbc)
>
> static inline s64 percpu_counter_read(struct percpu_counter *fbc)
> {
> - return fbc->count;
> + struct percpu_counter_rw *pcrw = fbc->pcrw;
> +
> + return atomic64_read(&pcrw->count) + atomic64_read(&pcrw->slowcount);
> }
>
> /*
> @@ -70,7 +84,7 @@ static inline s64 percpu_counter_read(struct percpu_counter *fbc)
> */
> static inline s64 percpu_counter_read_positive(struct percpu_counter *fbc)
> {
> - s64 ret = fbc->count;
> + s64 ret = percpu_counter_read(fbc);
>
> barrier(); /* Prevent reloads of fbc->count */
> if (ret >= 0)
> diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
> index 28f2c33..ff486b2 100644
> --- a/lib/percpu_counter.c
> +++ b/lib/percpu_counter.c
> @@ -9,6 +9,7 @@
> #include <linux/cpu.h>
> #include <linux/module.h>
> #include <linux/debugobjects.h>
> +#include <linux/slab.h>
>
> static LIST_HEAD(percpu_counters);
> static DEFINE_MUTEX(percpu_counters_lock);
> @@ -58,28 +59,33 @@ static inline void debug_percpu_counter_deactivate(struct percpu_counter *fbc)
> void percpu_counter_set(struct percpu_counter *fbc, s64 amount)
> {
> int cpu;
> + struct percpu_counter_rw *pcrw = fbc->pcrw;
>
> - spin_lock(&fbc->lock);
> for_each_possible_cpu(cpu) {
> s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
> *pcount = 0;
> }
> - fbc->count = amount;
> - spin_unlock(&fbc->lock);
> + atomic64_set(&pcrw->count, amount);
> + atomic64_set(&pcrw->slowcount, 0);
> }
> EXPORT_SYMBOL(percpu_counter_set);
>
> void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)
> {
> s64 count;
> + struct percpu_counter_rw *pcrw = fbc->pcrw;
> +
> + if (atomic_read(&fbc->sum_cnt)) {
> + atomic64_add(amount, &pcrw->slowcount);
> + return;
> + }
>
> preempt_disable();
> count = __this_cpu_read(*fbc->counters) + amount;
> if (count >= batch || count <= -batch) {
> - spin_lock(&fbc->lock);
> - fbc->count += count;
> + atomic64_add(count, &pcrw->count);
so if _sum starts and ends here, _sum can still get deviation.
I had a patch which uses the idea which I described in last email and
should remove the deviation, see below. it will delay _add if _sum is
running, which sounds scaring. but since _sum is called quite rare, this
isn't a big problem. we can further convert add_start below to a percpu
count to reduce cache line bounce.
---
include/linux/percpu_counter.h | 18 ++++-------------
lib/percpu_counter.c | 43 +++++++++++++++++++++++++----------------
2 files changed, 32 insertions(+), 29 deletions(-)
Index: linux/include/linux/percpu_counter.h
===================================================================
--- linux.orig/include/linux/percpu_counter.h 2011-05-13 11:13:25.000000000 +0800
+++ linux/include/linux/percpu_counter.h 2011-05-13 16:22:41.000000000 +0800
@@ -16,8 +16,9 @@
#ifdef CONFIG_SMP
struct percpu_counter {
+ atomic_t sum_start, add_start;
+ atomic64_t count;
spinlock_t lock;
- s64 count;
#ifdef CONFIG_HOTPLUG_CPU
struct list_head list; /* All percpu_counters are on a list */
#endif
@@ -26,16 +27,7 @@ struct percpu_counter {
extern int percpu_counter_batch;
-int __percpu_counter_init(struct percpu_counter *fbc, s64 amount,
- struct lock_class_key *key);
-
-#define percpu_counter_init(fbc, value) \
- ({ \
- static struct lock_class_key __key; \
- \
- __percpu_counter_init(fbc, value, &__key); \
- })
-
+int percpu_counter_init(struct percpu_counter *fbc, s64 amount);
void percpu_counter_destroy(struct percpu_counter *fbc);
void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch);
@@ -60,7 +52,7 @@ static inline s64 percpu_counter_sum(str
static inline s64 percpu_counter_read(struct percpu_counter *fbc)
{
- return fbc->count;
+ return atomic64_read(&fbc->count);
}
/*
@@ -70,7 +62,7 @@ static inline s64 percpu_counter_read(st
*/
static inline s64 percpu_counter_read_positive(struct percpu_counter *fbc)
{
- s64 ret = fbc->count;
+ s64 ret = percpu_counter_read(fbc);
barrier(); /* Prevent reloads of fbc->count */
if (ret >= 0)
Index: linux/lib/percpu_counter.c
===================================================================
--- linux.orig/lib/percpu_counter.c 2011-05-13 10:29:04.000000000 +0800
+++ linux/lib/percpu_counter.c 2011-05-13 16:22:03.000000000 +0800
@@ -59,13 +59,11 @@ void percpu_counter_set(struct percpu_co
{
int cpu;
- spin_lock(&fbc->lock);
for_each_possible_cpu(cpu) {
s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
*pcount = 0;
}
- fbc->count = amount;
- spin_unlock(&fbc->lock);
+ atomic64_set(&fbc->count, amount);
}
EXPORT_SYMBOL(percpu_counter_set);
@@ -76,10 +74,20 @@ void __percpu_counter_add(struct percpu_
preempt_disable();
count = __this_cpu_read(*fbc->counters) + amount;
if (count >= batch || count <= -batch) {
- spin_lock(&fbc->lock);
- fbc->count += count;
+ while (1) {
+ atomic_inc_return(&fbc->add_start);
+ if (atomic_read(&fbc->sum_start) != 0)
+ atomic_dec(&fbc->add_start);
+ else
+ break;
+ while (atomic_read(&fbc->sum_start) != 0)
+ cpu_relax();
+ }
+
+ atomic64_add(count, &fbc->count);
__this_cpu_write(*fbc->counters, 0);
- spin_unlock(&fbc->lock);
+
+ atomic_dec(&fbc->add_start);
} else {
__this_cpu_write(*fbc->counters, count);
}
@@ -97,22 +105,28 @@ s64 __percpu_counter_sum(struct percpu_c
int cpu;
spin_lock(&fbc->lock);
- ret = fbc->count;
+ atomic_inc_return(&fbc->sum_start);
+ while (atomic_read(&fbc->add_start) != 0)
+ cpu_relax();
+
+ ret = atomic64_read(&fbc->count);
for_each_online_cpu(cpu) {
s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
ret += *pcount;
}
+
+ atomic_dec(&fbc->sum_start);
spin_unlock(&fbc->lock);
return ret;
}
EXPORT_SYMBOL(__percpu_counter_sum);
-int __percpu_counter_init(struct percpu_counter *fbc, s64 amount,
- struct lock_class_key *key)
+int percpu_counter_init(struct percpu_counter *fbc, s64 amount)
{
spin_lock_init(&fbc->lock);
- lockdep_set_class(&fbc->lock, key);
- fbc->count = amount;
+ atomic64_set(&fbc->count, amount);
+ atomic_set(&fbc->sum_start, 0);
+ atomic_set(&fbc->add_start, 0);
fbc->counters = alloc_percpu(s32);
if (!fbc->counters)
return -ENOMEM;
@@ -127,7 +141,7 @@ int __percpu_counter_init(struct percpu_
#endif
return 0;
}
-EXPORT_SYMBOL(__percpu_counter_init);
+EXPORT_SYMBOL(percpu_counter_init);
void percpu_counter_destroy(struct percpu_counter *fbc)
{
@@ -171,13 +185,10 @@ static int __cpuinit percpu_counter_hotc
mutex_lock(&percpu_counters_lock);
list_for_each_entry(fbc, &percpu_counters, list) {
s32 *pcount;
- unsigned long flags;
- spin_lock_irqsave(&fbc->lock, flags);
pcount = per_cpu_ptr(fbc->counters, cpu);
- fbc->count += *pcount;
+ atomic64_add(*pcount, &fbc->count);
*pcount = 0;
- spin_unlock_irqrestore(&fbc->lock, flags);
}
mutex_unlock(&percpu_counters_lock);
#endif
Le lundi 16 mai 2011 à 08:58 +0800, Shaohua Li a écrit :
> so if _sum starts and ends here, _sum can still get deviation.
This makes no sense at all. If you have so many cpus 'here' right before
you increment fbc->sum_cnt, then no matter how precise and super
cautious you are in your _sum() implementation, as soon as you exit from
sum(), other cpus already changed the percpu counter global value.
> @@ -76,10 +74,20 @@ void __percpu_counter_add(struct percpu_
> preempt_disable();
> count = __this_cpu_read(*fbc->counters) + amount;
> if (count >= batch || count <= -batch) {
> - spin_lock(&fbc->lock);
> - fbc->count += count;
> + while (1) {
> + atomic_inc_return(&fbc->add_start);
> + if (atomic_read(&fbc->sum_start) != 0)
> + atomic_dec(&fbc->add_start);
> + else
> + break;
> + while (atomic_read(&fbc->sum_start) != 0)
> + cpu_relax();
> + }
> +
> + atomic64_add(count, &fbc->count);
> __this_cpu_write(*fbc->counters, 0);
> - spin_unlock(&fbc->lock);
> +
> + atomic_dec(&fbc->add_start);
> } else {
> __this_cpu_write(*fbc->counters, count);
> }
>
This is way too heavy. You have 3 atomic ops here and a very slow
atomic_inc_return() in fast path [ not all machines are x86].
Not all percpu_counters are used in degenerated way. Most of them hit
the global count not very often.
Your version slows down a very common case (one cpu only calling _add()
several times, for example network stack in input path)
fbc->counters being in same cache line than fbc->add_start/sum_start and
all, I bet everything will be very slow during a _sum() on a 4096 cpu
machine, especially if this _sum() is interrupted by some long lasting
interrupt.
I believe the 'deviation' risk is almost null with my patch.
Remember percpu_counter is not an exact counter but a very lazy one.
(Only requirement is to not have drift)
The risk is small especially if we move the :
__this_cpu_write(*fbc->counters, 0);
before the :
atomic64_add(count, &fbc->count);
and then do the sequence increment _after_ this.
Here is my V4 : We dont need the second fbc->slowcount, given sum() get
fbc->count after the folding, not before : If some cpus enter _add()
while _sum() is running they'll seem sum_cnt signal and change
fbc->count immediately.
I also make following sequence in _add() :
__this_cpu_write(*fbc->counters, 0);
atomic64_add(count, &pcrw->count);
pcrw->sequence++;
include/linux/percpu_counter.h | 25 +++++++--
lib/percpu_counter.c | 78 ++++++++++++++++++++-----------
2 files changed, 70 insertions(+), 33 deletions(-)
diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
index 46f6ba5..e3e62b1 100644
--- a/include/linux/percpu_counter.h
+++ b/include/linux/percpu_counter.h
@@ -15,13 +15,24 @@
#ifdef CONFIG_SMP
-struct percpu_counter {
- spinlock_t lock;
- s64 count;
+/*
+ * For performance reasons, we keep this part in a separate cache line
+ */
+struct percpu_counter_rw {
+ atomic64_t count;
+ unsigned int sequence;
+
+ /* since we have plenty room, store list here, even if never used */
#ifdef CONFIG_HOTPLUG_CPU
struct list_head list; /* All percpu_counters are on a list */
+ struct percpu_counter *fbc;
#endif
- s32 __percpu *counters;
+} ____cacheline_aligned_in_smp;
+
+struct percpu_counter {
+ atomic_t sum_cnt; /* count of in flight sum() */
+ struct percpu_counter_rw *pcrw;
+ s32 __percpu *counters;
};
extern int percpu_counter_batch;
@@ -60,7 +71,9 @@ static inline s64 percpu_counter_sum(struct percpu_counter *fbc)
static inline s64 percpu_counter_read(struct percpu_counter *fbc)
{
- return fbc->count;
+ struct percpu_counter_rw *pcrw = fbc->pcrw;
+
+ return atomic64_read(&pcrw->count);
}
/*
@@ -70,7 +83,7 @@ static inline s64 percpu_counter_read(struct percpu_counter *fbc)
*/
static inline s64 percpu_counter_read_positive(struct percpu_counter *fbc)
{
- s64 ret = fbc->count;
+ s64 ret = percpu_counter_read(fbc);
barrier(); /* Prevent reloads of fbc->count */
if (ret >= 0)
diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index 28f2c33..27292ba 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -9,6 +9,7 @@
#include <linux/cpu.h>
#include <linux/module.h>
#include <linux/debugobjects.h>
+#include <linux/slab.h>
static LIST_HEAD(percpu_counters);
static DEFINE_MUTEX(percpu_counters_lock);
@@ -58,28 +59,32 @@ static inline void debug_percpu_counter_deactivate(struct percpu_counter *fbc)
void percpu_counter_set(struct percpu_counter *fbc, s64 amount)
{
int cpu;
+ struct percpu_counter_rw *pcrw = fbc->pcrw;
- spin_lock(&fbc->lock);
for_each_possible_cpu(cpu) {
s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
*pcount = 0;
}
- fbc->count = amount;
- spin_unlock(&fbc->lock);
+ atomic64_set(&pcrw->count, amount);
}
EXPORT_SYMBOL(percpu_counter_set);
void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)
{
s64 count;
+ struct percpu_counter_rw *pcrw = fbc->pcrw;
+
+ if (atomic_read(&fbc->sum_cnt)) {
+ atomic64_add(amount, &pcrw->count);
+ return;
+ }
preempt_disable();
count = __this_cpu_read(*fbc->counters) + amount;
if (count >= batch || count <= -batch) {
- spin_lock(&fbc->lock);
- fbc->count += count;
__this_cpu_write(*fbc->counters, 0);
- spin_unlock(&fbc->lock);
+ atomic64_add(count, &pcrw->count);
+ pcrw->sequence++;
} else {
__this_cpu_write(*fbc->counters, count);
}
@@ -95,14 +100,25 @@ s64 __percpu_counter_sum(struct percpu_counter *fbc)
{
s64 ret;
int cpu;
+ unsigned int seq;
+ struct percpu_counter_rw *pcrw = fbc->pcrw;
- spin_lock(&fbc->lock);
- ret = fbc->count;
- for_each_online_cpu(cpu) {
- s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
- ret += *pcount;
- }
- spin_unlock(&fbc->lock);
+ atomic_inc(&fbc->sum_cnt);
+ do {
+ seq = pcrw->sequence;
+ smp_rmb();
+
+ ret = 0;
+ for_each_online_cpu(cpu) {
+ s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
+ ret += *pcount;
+ }
+ ret += atomic64_read(&pcrw->count);
+
+ smp_rmb();
+ } while (pcrw->sequence != seq);
+
+ atomic_dec(&fbc->sum_cnt);
return ret;
}
EXPORT_SYMBOL(__percpu_counter_sum);
@@ -110,19 +126,28 @@ EXPORT_SYMBOL(__percpu_counter_sum);
int __percpu_counter_init(struct percpu_counter *fbc, s64 amount,
struct lock_class_key *key)
{
- spin_lock_init(&fbc->lock);
- lockdep_set_class(&fbc->lock, key);
- fbc->count = amount;
+ struct percpu_counter_rw *pcrw;
+
+ pcrw = kzalloc(sizeof(*pcrw), GFP_KERNEL);
+ if (!pcrw)
+ return -ENOMEM;
+ atomic64_set(&pcrw->count, amount);
+
fbc->counters = alloc_percpu(s32);
- if (!fbc->counters)
+ if (!fbc->counters) {
+ kfree(pcrw);
return -ENOMEM;
+ }
+ fbc->pcrw = pcrw;
+ atomic_set(&fbc->sum_cnt, 0);
debug_percpu_counter_activate(fbc);
#ifdef CONFIG_HOTPLUG_CPU
- INIT_LIST_HEAD(&fbc->list);
+ INIT_LIST_HEAD(&pcrw->list);
+ pcrw->fbc = fbc;
mutex_lock(&percpu_counters_lock);
- list_add(&fbc->list, &percpu_counters);
+ list_add(&pcrw->list, &percpu_counters);
mutex_unlock(&percpu_counters_lock);
#endif
return 0;
@@ -138,11 +163,13 @@ void percpu_counter_destroy(struct percpu_counter *fbc)
#ifdef CONFIG_HOTPLUG_CPU
mutex_lock(&percpu_counters_lock);
- list_del(&fbc->list);
+ list_del(&fbc->pcrw->list);
mutex_unlock(&percpu_counters_lock);
#endif
free_percpu(fbc->counters);
fbc->counters = NULL;
+ kfree(fbc->pcrw);
+ fbc->pcrw = NULL;
}
EXPORT_SYMBOL(percpu_counter_destroy);
@@ -161,7 +188,7 @@ static int __cpuinit percpu_counter_hotcpu_callback(struct notifier_block *nb,
{
#ifdef CONFIG_HOTPLUG_CPU
unsigned int cpu;
- struct percpu_counter *fbc;
+ struct percpu_counter_rw *pcrw;
compute_batch_value();
if (action != CPU_DEAD)
@@ -169,15 +196,12 @@ static int __cpuinit percpu_counter_hotcpu_callback(struct notifier_block *nb,
cpu = (unsigned long)hcpu;
mutex_lock(&percpu_counters_lock);
- list_for_each_entry(fbc, &percpu_counters, list) {
+ list_for_each_entry(pcrw, &percpu_counters, list) {
s32 *pcount;
- unsigned long flags;
- spin_lock_irqsave(&fbc->lock, flags);
- pcount = per_cpu_ptr(fbc->counters, cpu);
- fbc->count += *pcount;
+ pcount = per_cpu_ptr(pcrw->fbc->counters, cpu);
+ atomic64_add(*pcount, &pcrw->count);
*pcount = 0;
- spin_unlock_irqrestore(&fbc->lock, flags);
}
mutex_unlock(&percpu_counters_lock);
#endif
On Mon, 2011-05-16 at 14:11 +0800, Eric Dumazet wrote:
> Le lundi 16 mai 2011 à 08:58 +0800, Shaohua Li a écrit :
>
> > so if _sum starts and ends here, _sum can still get deviation.
>
> This makes no sense at all. If you have so many cpus 'here' right before
> you increment fbc->sum_cnt, then no matter how precise and super
> cautious you are in your _sum() implementation, as soon as you exit from
> sum(), other cpus already changed the percpu counter global value.
I don't agree here. The original implementation also just has quite
small window we have deviation, the window only exists between the two
lines:
atomic64_add(count, &fbc->count);
__this_cpu_write(*fbc->counters, 0);
if you think we should ignore it, we'd better not use any protection
here.
> > @@ -76,10 +74,20 @@ void __percpu_counter_add(struct percpu_
> > preempt_disable();
> > count = __this_cpu_read(*fbc->counters) + amount;
> > if (count >= batch || count <= -batch) {
> > - spin_lock(&fbc->lock);
> > - fbc->count += count;
> > + while (1) {
> > + atomic_inc_return(&fbc->add_start);
> > + if (atomic_read(&fbc->sum_start) != 0)
> > + atomic_dec(&fbc->add_start);
> > + else
> > + break;
> > + while (atomic_read(&fbc->sum_start) != 0)
> > + cpu_relax();
> > + }
> > +
> > + atomic64_add(count, &fbc->count);
> > __this_cpu_write(*fbc->counters, 0);
> > - spin_unlock(&fbc->lock);
> > +
> > + atomic_dec(&fbc->add_start);
> > } else {
> > __this_cpu_write(*fbc->counters, count);
> > }
> >
>
> This is way too heavy. You have 3 atomic ops here and a very slow
> atomic_inc_return() in fast path [ not all machines are x86].
>
> Not all percpu_counters are used in degenerated way. Most of them hit
> the global count not very often.
>
> Your version slows down a very common case (one cpu only calling _add()
> several times, for example network stack in input path)
>
> fbc->counters being in same cache line than fbc->add_start/sum_start and
> all, I bet everything will be very slow during a _sum() on a 4096 cpu
> machine, especially if this _sum() is interrupted by some long lasting
> interrupt.
as I wrote in the email, the atomic and cacheline issue can be resolved
with a per_cpu data, I just didn't post the patch. I post it this time,
please see below. There is no cache line bounce anymore.
> I believe the 'deviation' risk is almost null with my patch.
> Remember percpu_counter is not an exact counter but a very lazy one.
> (Only requirement is to not have drift)
>
> The risk is small especially if we move the :
> __this_cpu_write(*fbc->counters, 0);
> before the :
> atomic64_add(count, &fbc->count);
>
> and then do the sequence increment _after_ this.
>
>
>
> Here is my V4 : We dont need the second fbc->slowcount, given sum() get
> fbc->count after the folding, not before : If some cpus enter _add()
> while _sum() is running they'll seem sum_cnt signal and change
> fbc->count immediately.
>
> I also make following sequence in _add() :
>
> __this_cpu_write(*fbc->counters, 0);
we still have the deviation issue if _sum starts and ends here. this
doesn't change anything.
> atomic64_add(count, &pcrw->count);
> pcrw->sequence++;
>
>
> include/linux/percpu_counter.h | 25 +++++++--
> lib/percpu_counter.c | 78 ++++++++++++++++++++-----------
> 2 files changed, 70 insertions(+), 33 deletions(-)
>
> diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
> index 46f6ba5..e3e62b1 100644
> --- a/include/linux/percpu_counter.h
> +++ b/include/linux/percpu_counter.h
> @@ -15,13 +15,24 @@
>
> #ifdef CONFIG_SMP
>
> -struct percpu_counter {
> - spinlock_t lock;
> - s64 count;
> +/*
> + * For performance reasons, we keep this part in a separate cache line
> + */
> +struct percpu_counter_rw {
> + atomic64_t count;
> + unsigned int sequence;
> +
> + /* since we have plenty room, store list here, even if never used */
> #ifdef CONFIG_HOTPLUG_CPU
> struct list_head list; /* All percpu_counters are on a list */
> + struct percpu_counter *fbc;
> #endif
> - s32 __percpu *counters;
> +} ____cacheline_aligned_in_smp;
> +
> +struct percpu_counter {
> + atomic_t sum_cnt; /* count of in flight sum() */
> + struct percpu_counter_rw *pcrw;
> + s32 __percpu *counters;
> };
>
> extern int percpu_counter_batch;
> @@ -60,7 +71,9 @@ static inline s64 percpu_counter_sum(struct percpu_counter *fbc)
>
> static inline s64 percpu_counter_read(struct percpu_counter *fbc)
> {
> - return fbc->count;
> + struct percpu_counter_rw *pcrw = fbc->pcrw;
> +
> + return atomic64_read(&pcrw->count);
> }
>
> /*
> @@ -70,7 +83,7 @@ static inline s64 percpu_counter_read(struct percpu_counter *fbc)
> */
> static inline s64 percpu_counter_read_positive(struct percpu_counter *fbc)
> {
> - s64 ret = fbc->count;
> + s64 ret = percpu_counter_read(fbc);
>
> barrier(); /* Prevent reloads of fbc->count */
> if (ret >= 0)
> diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
> index 28f2c33..27292ba 100644
> --- a/lib/percpu_counter.c
> +++ b/lib/percpu_counter.c
> @@ -9,6 +9,7 @@
> #include <linux/cpu.h>
> #include <linux/module.h>
> #include <linux/debugobjects.h>
> +#include <linux/slab.h>
>
> static LIST_HEAD(percpu_counters);
> static DEFINE_MUTEX(percpu_counters_lock);
> @@ -58,28 +59,32 @@ static inline void debug_percpu_counter_deactivate(struct percpu_counter *fbc)
> void percpu_counter_set(struct percpu_counter *fbc, s64 amount)
> {
> int cpu;
> + struct percpu_counter_rw *pcrw = fbc->pcrw;
>
> - spin_lock(&fbc->lock);
> for_each_possible_cpu(cpu) {
> s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
> *pcount = 0;
> }
> - fbc->count = amount;
> - spin_unlock(&fbc->lock);
> + atomic64_set(&pcrw->count, amount);
> }
> EXPORT_SYMBOL(percpu_counter_set);
>
> void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)
> {
> s64 count;
> + struct percpu_counter_rw *pcrw = fbc->pcrw;
> +
> + if (atomic_read(&fbc->sum_cnt)) {
> + atomic64_add(amount, &pcrw->count);
> + return;
> + }
>
> preempt_disable();
> count = __this_cpu_read(*fbc->counters) + amount;
> if (count >= batch || count <= -batch) {
> - spin_lock(&fbc->lock);
> - fbc->count += count;
> __this_cpu_write(*fbc->counters, 0);
> - spin_unlock(&fbc->lock);
> + atomic64_add(count, &pcrw->count);
smp_wmb() or atomic64_add_return() here to guarantee the changes are
seen before sequence++;
> + pcrw->sequence++;
sequence++ can introduce cache line bouncing.
add_start causes a lot of cache bouncing because it's updated by all
cpus. We can actually make it a percpu variable. This will completely
reduce the cache bouncing.
With the patch and last patch, I get about 7x faster running the
workload that last patch described. Only with last patch, the workload
is only about 4x faster.
This doesn't slow down _sum because we removed lock for _sum. I did
a stress test. 23 CPU run _add, one cpu runs _sum. In _add fast path
(don't hold) lock, _sum runs a little slow (about 20% slower). In
_add slow path (hold lock), _sum runs much faster (about 9x faster);
Signed-off-by: Shaohua Li <[email protected]>
---
include/linux/percpu_counter.h | 3 ++-
lib/percpu_counter.c | 22 ++++++++++++++++------
2 files changed, 18 insertions(+), 7 deletions(-)
Index: linux/include/linux/percpu_counter.h
===================================================================
--- linux.orig/include/linux/percpu_counter.h 2011-05-16 10:26:05.000000000 +0800
+++ linux/include/linux/percpu_counter.h 2011-05-16 10:27:48.000000000 +0800
@@ -16,12 +16,13 @@
#ifdef CONFIG_SMP
struct percpu_counter {
- atomic_t sum_start, add_start;
+ atomic_t sum_start;
atomic64_t count;
#ifdef CONFIG_HOTPLUG_CPU
struct list_head list; /* All percpu_counters are on a list */
#endif
s32 __percpu *counters;
+ char __percpu *add_starts;
};
extern int percpu_counter_batch;
Index: linux/lib/percpu_counter.c
===================================================================
--- linux.orig/lib/percpu_counter.c 2011-05-16 10:26:58.000000000 +0800
+++ linux/lib/percpu_counter.c 2011-05-16 10:46:12.000000000 +0800
@@ -75,10 +75,12 @@ void __percpu_counter_add(struct percpu_
count = __this_cpu_read(*fbc->counters) + amount;
if (count >= batch || count <= -batch) {
while (1) {
- atomic_inc_return(&fbc->add_start);
+ __this_cpu_write(*fbc->add_starts, 1);
+ /* Guarantee add_starts is seen by _sum */
+ smp_wmb();
if (atomic_read(&fbc->sum_start) == 0)
break;
- atomic_dec(&fbc->add_start);
+ __this_cpu_write(*fbc->add_starts, 0);
while (atomic_read(&fbc->sum_start) != 0)
cpu_relax();
}
@@ -86,7 +88,7 @@ void __percpu_counter_add(struct percpu_
atomic64_add(count, &fbc->count);
__this_cpu_write(*fbc->counters, 0);
- atomic_dec(&fbc->add_start);
+ __this_cpu_write(*fbc->add_starts, 0);
} else {
__this_cpu_write(*fbc->counters, count);
}
@@ -104,8 +106,10 @@ s64 __percpu_counter_sum(struct percpu_c
int cpu;
atomic_inc_return(&fbc->sum_start);
- while (atomic_read(&fbc->add_start) != 0)
- cpu_relax();
+ for_each_online_cpu(cpu) {
+ while (*per_cpu_ptr(fbc->add_starts, cpu) != 0)
+ cpu_relax();
+ }
ret = atomic64_read(&fbc->count);
for_each_online_cpu(cpu) {
@@ -122,10 +126,15 @@ int percpu_counter_init(struct percpu_co
{
atomic64_set(&fbc->count, amount);
atomic_set(&fbc->sum_start, 0);
- atomic_set(&fbc->add_start, 0);
fbc->counters = alloc_percpu(s32);
if (!fbc->counters)
return -ENOMEM;
+ fbc->add_starts = alloc_percpu(char);
+ if (!fbc->add_starts) {
+ free_percpu(fbc->counters);
+ return -ENOMEM;
+ }
+
debug_percpu_counter_activate(fbc);
@@ -152,6 +161,7 @@ void percpu_counter_destroy(struct percp
mutex_unlock(&percpu_counters_lock);
#endif
free_percpu(fbc->counters);
+ free_percpu(fbc->add_starts);
fbc->counters = NULL;
}
EXPORT_SYMBOL(percpu_counter_destroy);
Le lundi 16 mai 2011 à 14:37 +0800, Shaohua Li a écrit :
> On Mon, 2011-05-16 at 14:11 +0800, Eric Dumazet wrote:
> > Le lundi 16 mai 2011 à 08:58 +0800, Shaohua Li a écrit :
> >
> > > so if _sum starts and ends here, _sum can still get deviation.
> >
> > This makes no sense at all. If you have so many cpus 'here' right before
> > you increment fbc->sum_cnt, then no matter how precise and super
> > cautious you are in your _sum() implementation, as soon as you exit from
> > sum(), other cpus already changed the percpu counter global value.
> I don't agree here. The original implementation also just has quite
> small window we have deviation, the window only exists between the two
> lines:
> atomic64_add(count, &fbc->count);
> __this_cpu_write(*fbc->counters, 0);
> if you think we should ignore it, we'd better not use any protection
> here.
>
Not at all. Your version didnt forbid new cpu to come in _add() and
hitting the deviation problem.
There is a small difference, or else I wouldnt had bother.
> as I wrote in the email, the atomic and cacheline issue can be resolved
> with a per_cpu data, I just didn't post the patch. I post it this time,
> please see below. There is no cache line bounce anymore.
>
I am afraid we make no progress at all here, if you just try to push
your patch and ignore my comments.
percpu_counter is a compromise, dont make it too slow for normal
operations. It works well if most _add() operations only go through
percpu data.
Please just move vm_committed_as to a plain atomic_t, this will solve
your problem.
Thanks
On Mon, 2011-05-16 at 14:55 +0800, Eric Dumazet wrote:
> Le lundi 16 mai 2011 à 14:37 +0800, Shaohua Li a écrit :
> > On Mon, 2011-05-16 at 14:11 +0800, Eric Dumazet wrote:
> > > Le lundi 16 mai 2011 à 08:58 +0800, Shaohua Li a écrit :
> > >
> > > > so if _sum starts and ends here, _sum can still get deviation.
> > >
> > > This makes no sense at all. If you have so many cpus 'here' right before
> > > you increment fbc->sum_cnt, then no matter how precise and super
> > > cautious you are in your _sum() implementation, as soon as you exit from
> > > sum(), other cpus already changed the percpu counter global value.
> > I don't agree here. The original implementation also just has quite
> > small window we have deviation, the window only exists between the two
> > lines:
> > atomic64_add(count, &fbc->count);
> > __this_cpu_write(*fbc->counters, 0);
> > if you think we should ignore it, we'd better not use any protection
> > here.
> >
>
> Not at all. Your version didnt forbid new cpu to come in _add() and
> hitting the deviation problem.
if everybody agrees the deviation isn't a problem, I will not bother to
argue here.
but your patch does have the deviation issue which Tejun dislike.
> There is a small difference, or else I wouldnt had bother.
in _sum, set a bit. in _add, we wait till the bit is unset. This can
easily solve the issue too, and much easier.
> > as I wrote in the email, the atomic and cacheline issue can be resolved
> > with a per_cpu data, I just didn't post the patch. I post it this time,
> > please see below. There is no cache line bounce anymore.
> >
>
> I am afraid we make no progress at all here, if you just try to push
> your patch and ignore my comments.
I did try to push my patch, but I didn't ignore your comments. I pointed
out your patch still has the deviation issue and you didn't think it's
an issue, so you are ignoring my comments actually. On the other hand, I
push my patch because I thought mine hasn't the deviation.
> percpu_counter is a compromise, dont make it too slow for normal
> operations. It works well if most _add() operations only go through
> percpu data.
>
> Please just move vm_committed_as to a plain atomic_t, this will solve
> your problem.
I can, but you can't prevent me to optimize percpu_counter.
Thanks,
Shaohua
Le lundi 16 mai 2011 à 15:15 +0800, Shaohua Li a écrit :
> I can, but you can't prevent me to optimize percpu_counter.
>
Well, I have the right to say you're wrong.
Your last patch is not good, sorry. Please take the time to read it
again and fix obvious problems. And also give us numbers if one process
does the mmap()/munmap() loop, before and after your patch.
A percpu_counter is already a beast as is, you're suggesting to double
its size, for a pathological case.
Its absolutely not clear to me why vm_committed_as is using the default
percpu_counter_batch.
By the way could you make sure percpu_counter_batch has the right value
on your 24 cpus machine ?
Your 128Mbyte mmap threshold sounds like percpu_counter_batch=32 instead
of 48
On Mon, 2011-05-16 at 15:44 +0800, Eric Dumazet wrote:
> Le lundi 16 mai 2011 à 15:15 +0800, Shaohua Li a écrit :
>
> > I can, but you can't prevent me to optimize percpu_counter.
> >
>
> Well, I have the right to say you're wrong.
sure, but please give a reason.
> Your last patch is not good, sorry.
> Please take the time to read it
> again and fix obvious problems.
what kind of obvious problems?
> And also give us numbers if one process
> does the mmap()/munmap() loop, before and after your patch.
I did a stress test with one thread
while {
__percpu_counter_add(+count)
__percpu_counter_add(-count)
}
the loop do 10000000 times.
in _add fast path (no locking hold):
before my patch:
real 0m0.133s
user 0m0.000s
sys 0m0.124s
after:
real 0m0.129s
user 0m0.000s
sys 0m0.120s
the difference is variation.
in _add slow path (locking hold):
before my patch:
real 0m0.374s
user 0m0.000s
sys 0m0.372s
after:
real 0m0.245s
user 0m0.000s
sys 0m0.020s
My patch actually makes _add faster, because we removed spin_lock.
> A percpu_counter is already a beast as is, you're suggesting to double
> its size, for a pathological case.
>
> Its absolutely not clear to me why vm_committed_as is using the default
> percpu_counter_batch.
>
> By the way could you make sure percpu_counter_batch has the right value
> on your 24 cpus machine ?
>
> Your 128Mbyte mmap threshold sounds like percpu_counter_batch=32 instead
> of 48
let's not argue the batch size anymore. If we can make percpu_counter
faster, why we don't (even your patch mentioned this).
Thanks,
Shaohua
Le lundi 16 mai 2011 à 16:34 +0800, Shaohua Li a écrit :
> let's not argue the batch size anymore. If we can make percpu_counter
> faster, why we don't (even your patch mentioned this).
>
I actually changed my mind, after trying to solve the problem and spend
a few hours on it. This is not worth it and counterproductive.
Whole point of percpu_counter is being able to avoid the false sharing
in _most_ cases. I would even make the false sharing case even more
expensive just to pinpoint a bad user, thanks to profiling.
Trying to make it fast in pathological case is throwing a brown paper
bag.
An interesting move would be to make percpu_counter hierarchical,
because we might need it for 4096 cpus machines.
Given that vm_committed has one percent resolution need
(sysctl_overcommit_ratio is expressed with percent resolution), it
should be used with an appropriate batch value, something like :
vm_committed_as_batch = max(percpu_counter_batch,
total_ram_pages/(num_possible_cpus()*100));
Instead of the default percpu_counter_batch, more aimed for
_add(1)/_add(-1) uses.
Note : This wont solve your mmap(128M)/munmap() problem, unless your
machine has a _lot_ of memory. Still, this will be a win on many real
workloads.
Le lundi 16 mai 2011 à 11:35 +0200, Eric Dumazet a écrit :
> Given that vm_committed has one percent resolution need
> (sysctl_overcommit_ratio is expressed with percent resolution), it
> should be used with an appropriate batch value, something like :
>
> vm_committed_as_batch = max(percpu_counter_batch,
> total_ram_pages/(num_possible_cpus()*100));
>
Funny thing with vm_committed_as is we dont even read its value with
default vm configuration
(sysctl_overcommit_memory == OVERCOMMIT_ALWAYS or OVERCOMMIT_GUESS)
[ In this case, we read it only for /proc/meminfo output ]
Ideally, we could dynamically change vm_committed_as_batch when
sysctl_overcommit_memory or other param is changed. This would need a
mechanism to ask all cpus to transfert/clear their local s32 into global
fbc->count [when lowering vm_committed_as_batch]
Another idea would be to use an atomic when manipulating the percpu s32,
so that __percpu_counter_sum() is able to make this operation itself :
At the end of __percpu_counter_sum(), fbc->count would be the final
result, and all s32 would be zero, unless some cpus called _add()
meanwhile.
On Mon, 2011-05-16 at 22:22 +0800, Eric Dumazet wrote:
> Le lundi 16 mai 2011 à 11:35 +0200, Eric Dumazet a écrit :
>
> > Given that vm_committed has one percent resolution need
> > (sysctl_overcommit_ratio is expressed with percent resolution), it
> > should be used with an appropriate batch value, something like :
> >
> > vm_committed_as_batch = max(percpu_counter_batch,
> > total_ram_pages/(num_possible_cpus()*100));
> >
>
>
> Funny thing with vm_committed_as is we dont even read its value with
> default vm configuration
>
> (sysctl_overcommit_memory == OVERCOMMIT_ALWAYS or OVERCOMMIT_GUESS)
>
> [ In this case, we read it only for /proc/meminfo output ]
>
> Ideally, we could dynamically change vm_committed_as_batch when
> sysctl_overcommit_memory or other param is changed. This would need a
> mechanism to ask all cpus to transfert/clear their local s32 into global
> fbc->count [when lowering vm_committed_as_batch]
I actually posted something like this before
http://marc.info/?l=linux-mm&m=130144785326028&w=2
but this could affect /proc/meminfo read.
> Another idea would be to use an atomic when manipulating the percpu s32,
> so that __percpu_counter_sum() is able to make this operation itself :
> At the end of __percpu_counter_sum(), fbc->count would be the final
> result, and all s32 would be zero, unless some cpus called _add()
> meanwhile.
don't understand it. But if concurrent _add can introduce deviation,
this is good.
I'm still interesting in improving percpu_counter itself. If we can
improve it, why we don't? My patch doesn't slow down anything for all
tests. Why didn't you ever look at it?
Thanks,
Shaohua
Le mardi 17 mai 2011 à 08:55 +0800, Shaohua Li a écrit :
> I'm still interesting in improving percpu_counter itself. If we can
> improve it, why we don't? My patch doesn't slow down anything for all
> tests. Why didn't you ever look at it?
>
I did and said there were obvious problems in it.
1) 4096 cpus : size of one percpu_counter is 16Kbytes.
After your last patch -> 20 kbytes for no good reason.
2) Two separate alloc_percpu() -> two separate cache lines instead of
one.
But then, if one alloc_percpu() -> 32 kbytes per object.
3) Focus on percpu_counter() implementation instead of making an
analysis of callers.
I did a lot of rwlocks removal in network stack because they are not the
right synchronization primitive in many cases. I did not optimize
rwlocks. If rwlocks were even slower, I suspect other people would have
help me to convert things faster.
4) There is a possible way to solve your deviation case : add at _add()
beginning a short cut for crazy 'amount' values. Its a bit expensive on
32bit arches, so might be added in a new helper to let _add() be fast
for normal and gentle users.
if (unlikely(amount >= batch || amount <= -batch)) {
atomic64(amount, &fbc->count);
return;
}
Ie dont care to get the s32 value and change it to 0, just leave it
alone.
Thanks
------------------------------------------------------------------
About making percpu s32 'atomic', here is the patch to show the idea:
Each time we call __percpu_counter_sum(), fbc->counters[] values are
zeroed and fbc->count is the "precise" value of the counter. (deviation
comes close to 0)
So if we need to dynamically change one percpu counter batch value, we
can call __percpu_counter_sum() to minimize the 'deviation' close to 0,
no need to send an IPI to all cpus so that they perform the transfert
themselves.
1) vm_committed_as could use a big vm_committed_as_batch when
(sysctl_overcommit_memory == OVERCOMMIT_ALWAYS or OVERCOMMIT_GUESS)
2) We could switch vm_committed_as_batch to an adequate value if/when
sysctl_overcommit_memory is changed to OVERCOMMIT_NEVER (and redo the
batch computation is swap space is changed as well, or hugepages
reservations, or number of online cpus, or ...)
Note: this has a cost, because percpu_counter() fast path would have one
atomic operation instead of 0 in current implementation (cmpxchg() on
the percpu s32). Not sure current cpus would care for percpu data (no
contention)
include/linux/percpu_counter.h | 7 +---
lib/percpu_counter.c | 51 +++++++++++++++----------------
2 files changed, 29 insertions(+), 29 deletions(-)
diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
index 46f6ba5..d6b7831 100644
--- a/include/linux/percpu_counter.h
+++ b/include/linux/percpu_counter.h
@@ -16,8 +16,7 @@
#ifdef CONFIG_SMP
struct percpu_counter {
- spinlock_t lock;
- s64 count;
+ atomic64_t count;
#ifdef CONFIG_HOTPLUG_CPU
struct list_head list; /* All percpu_counters are on a list */
#endif
@@ -60,7 +59,7 @@ static inline s64 percpu_counter_sum(struct percpu_counter *fbc)
static inline s64 percpu_counter_read(struct percpu_counter *fbc)
{
- return fbc->count;
+ return atomic64_read(&fbc->count);
}
/*
@@ -70,7 +69,7 @@ static inline s64 percpu_counter_read(struct percpu_counter *fbc)
*/
static inline s64 percpu_counter_read_positive(struct percpu_counter *fbc)
{
- s64 ret = fbc->count;
+ s64 ret = percpu_counter_read(fbc);
barrier(); /* Prevent reloads of fbc->count */
if (ret >= 0)
diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index 28f2c33..745787e 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -59,29 +59,36 @@ void percpu_counter_set(struct percpu_counter *fbc, s64 amount)
{
int cpu;
- spin_lock(&fbc->lock);
for_each_possible_cpu(cpu) {
s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
*pcount = 0;
}
- fbc->count = amount;
- spin_unlock(&fbc->lock);
+ atomic64_set(&fbc->count, amount);
}
EXPORT_SYMBOL(percpu_counter_set);
void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)
{
s64 count;
+ s32 *ptr, old;
+
+ if (unlikely(amount >= batch || amount <= -batch)) {
+ atomic64_add(amount, &fbc->count);
+ return;
+ }
preempt_disable();
- count = __this_cpu_read(*fbc->counters) + amount;
+ ptr = this_cpu_ptr(fbc->counters);
+retry:
+ old = *ptr;
+ count = old + amount;
if (count >= batch || count <= -batch) {
- spin_lock(&fbc->lock);
- fbc->count += count;
- __this_cpu_write(*fbc->counters, 0);
- spin_unlock(&fbc->lock);
+ if (unlikely(cmpxchg(ptr, old, 0) != old))
+ goto retry;
+ atomic64_add(count, &fbc->count);
} else {
- __this_cpu_write(*fbc->counters, count);
+ if (unlikely(cmpxchg(ptr, old, count) != old))
+ goto retry;
}
preempt_enable();
}
@@ -93,26 +100,23 @@ EXPORT_SYMBOL(__percpu_counter_add);
*/
s64 __percpu_counter_sum(struct percpu_counter *fbc)
{
- s64 ret;
int cpu;
- spin_lock(&fbc->lock);
- ret = fbc->count;
for_each_online_cpu(cpu) {
- s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
- ret += *pcount;
+ s32 count, *pcount = per_cpu_ptr(fbc->counters, cpu);
+
+ count = xchg(pcount, 0);
+ if (count)
+ atomic64_add(count, &fbc->count);
}
- spin_unlock(&fbc->lock);
- return ret;
+ return atomic64_read(&fbc->count);
}
EXPORT_SYMBOL(__percpu_counter_sum);
int __percpu_counter_init(struct percpu_counter *fbc, s64 amount,
struct lock_class_key *key)
{
- spin_lock_init(&fbc->lock);
- lockdep_set_class(&fbc->lock, key);
- fbc->count = amount;
+ atomic64_set(&fbc->count, amount);
fbc->counters = alloc_percpu(s32);
if (!fbc->counters)
return -ENOMEM;
@@ -170,14 +174,11 @@ static int __cpuinit percpu_counter_hotcpu_callback(struct notifier_block *nb,
cpu = (unsigned long)hcpu;
mutex_lock(&percpu_counters_lock);
list_for_each_entry(fbc, &percpu_counters, list) {
- s32 *pcount;
- unsigned long flags;
+ s32 count, *pcount;
- spin_lock_irqsave(&fbc->lock, flags);
pcount = per_cpu_ptr(fbc->counters, cpu);
- fbc->count += *pcount;
- *pcount = 0;
- spin_unlock_irqrestore(&fbc->lock, flags);
+ count = xchg(pcount, 0);
+ atomic64_add(count, &fbc->count);
}
mutex_unlock(&percpu_counters_lock);
#endif
On Tue, 2011-05-17 at 12:56 +0800, Eric Dumazet wrote:
> Le mardi 17 mai 2011 à 08:55 +0800, Shaohua Li a écrit :
>
> > I'm still interesting in improving percpu_counter itself. If we can
> > improve it, why we don't? My patch doesn't slow down anything for all
> > tests. Why didn't you ever look at it?
> >
>
> I did and said there were obvious problems in it.
>
> 1) 4096 cpus : size of one percpu_counter is 16Kbytes.
>
> After your last patch -> 20 kbytes for no good reason.
I don't know why you said there is no good reason. I posted a lot of
data which shows improvement, while you just ignore.
The size issue is completely pointless. If you have 4096 CPUs, how could
you worry about 16k bytes memory. Especially the extra memory makes the
API much faster.
> 2) Two separate alloc_percpu() -> two separate cache lines instead of
> one.
Might be in one cache line actually, but can be easily fixed if not
anyway. On the other hand, even touch two cache lines, it's still faster
than the original spinlock implementation, which I already posted data.
> But then, if one alloc_percpu() -> 32 kbytes per object.
the size issue is completely pointless
> 3) Focus on percpu_counter() implementation instead of making an
> analysis of callers.
>
> I did a lot of rwlocks removal in network stack because they are not the
> right synchronization primitive in many cases. I did not optimize
> rwlocks. If rwlocks were even slower, I suspect other people would have
> help me to convert things faster.
My original issue is mmap, but I already declaimed several times we can
make percpu_counter better, why won't we?
> 4) There is a possible way to solve your deviation case : add at _add()
> beginning a short cut for crazy 'amount' values. Its a bit expensive on
> 32bit arches, so might be added in a new helper to let _add() be fast
> for normal and gentle users.
+ if (unlikely(cmpxchg(ptr, old, 0) != old))
> + goto retry;
this doesn't change anything, you still have the deviation issue here
> + atomic64_add(count, &fbc->count);
> if (unlikely(amount >= batch || amount <= -batch)) {
> atomic64(amount, &fbc->count);
> return;
> }
why we just handle this special case, my patch can make the whole part
faster without deviation
so you didn't point out any obvious problem with my patch actually. This
is good.
Thanks,
Shaohua
Le mardi 17 mai 2011 à 13:22 +0800, Shaohua Li a écrit :
> I don't know why you said there is no good reason. I posted a lot of
> data which shows improvement, while you just ignore.
>
Dear Shaihua, ignoring you would mean I would not even answer, and let
other people do, when they have time (maybe in 2 or 3 months, maybe
never. Just take a look at my previous attempts, two years ago,
atomic64_t didnt exist at that time, obviously)
I hope you can see the value I add to your concerns, making this subject
alive and even coding stuff. We all share ideas, we are not fighting.
> The size issue is completely pointless. If you have 4096 CPUs, how could
> you worry about 16k bytes memory. Especially the extra memory makes the
> API much faster.
>
It is not pointless at all, maybe for Intel guys it is.
I just NACK this idea
> > 2) Two separate alloc_percpu() -> two separate cache lines instead of
> > one.
> Might be in one cache line actually, but can be easily fixed if not
> anyway. On the other hand, even touch two cache lines, it's still faster
> than the original spinlock implementation, which I already posted data.
>
> > But then, if one alloc_percpu() -> 32 kbytes per object.
> the size issue is completely pointless
>
Thats your opinion
> > 3) Focus on percpu_counter() implementation instead of making an
> > analysis of callers.
> >
> > I did a lot of rwlocks removal in network stack because they are not the
> > right synchronization primitive in many cases. I did not optimize
> > rwlocks. If rwlocks were even slower, I suspect other people would have
> > help me to convert things faster.
> My original issue is mmap, but I already declaimed several times we can
> make percpu_counter better, why won't we?
>
Only if it's a good compromise. Your last patches are not yet good
candidates I'm afraid.
> > 4) There is a possible way to solve your deviation case : add at _add()
> > beginning a short cut for crazy 'amount' values. Its a bit expensive on
> > 32bit arches, so might be added in a new helper to let _add() be fast
> > for normal and gentle users.
>
> + if (unlikely(cmpxchg(ptr, old, 0) != old))
> > + goto retry;
> this doesn't change anything, you still have the deviation issue here
>
You do understand 'my last patch' doesnt address the deviation problem
anymore ? Its a completely different matter to address vm_committed_as
problem (and maybe other percpu_counters).
The thing you prefer to not touch so that your 'results' sound better...
If your percpu_counter is hit so hardly that you have many cpus
competing in atomic64(&count, &fbc->count), _sum() result is wrong right
after its return. so _sum() _can_ deviate even if it claims being more
precise.
> > + atomic64_add(count, &fbc->count);
>
> > if (unlikely(amount >= batch || amount <= -batch)) {
> > atomic64(amount, &fbc->count);
> > return;
> > }
> why we just handle this special case, my patch can make the whole part
> faster without deviation
>
This 'special case' is the whole problem others pointed out, and this
makes deviation worst value like before your initial patch.
> so you didn't point out any obvious problem with my patch actually. This
> is good.
>
This brings nothing. Just say NO to people saying its needed.
Its not because Tejun says there is a deviation "problem", you need to
change lglock and bring lglock to percpu_counter, or double
percpu_counter size, or whatever crazy idea.
Just convince him that percpu_counter by itself cannot bring a max
deviation guarantee. No percpu_counter user cares at all. If they do,
then percpu_counter choice for their implementation is probably wrong.
[ We dont provide yet a percpu_counter_add_return() function ]
Hello, Eric, Shaohua.
On Tue, May 17, 2011 at 11:01:01AM +0200, Eric Dumazet wrote:
> Just convince him that percpu_counter by itself cannot bring a max
> deviation guarantee. No percpu_counter user cares at all. If they do,
> then percpu_counter choice for their implementation is probably wrong.
>
> [ We dont provide yet a percpu_counter_add_return() function ]
I haven't gone through this thread yet but will do so later today, but
let me clarify the whole deviation thing.
1. I don't care reasonable (can't think of a better word at the
moment) level of deviation. Under high level of concurrency, the
exact value isn't even well defined - nobody can tell operations
happened in what order anyway.
2. But I _do_ object to _sum() has the possibility of deviating by
multiples of @batch even with very low level of activity.
I'm completely fine with #1. I'm not that crazy but I don't really
want to take #2 - that makes the whole _sum() interface almost
pointless. Also, I don't want to add big honking lglock to just avoid
#2 unless it can be shown that the same effect can't be achieved in
saner manner and I'm highly skeptical that would happen.
Thank you.
--
tejun
Le mardi 17 mai 2011 à 11:11 +0200, Tejun Heo a écrit :
> I'm completely fine with #1. I'm not that crazy but I don't really
> want to take #2 - that makes the whole _sum() interface almost
> pointless.
Hi Tejun
_sum() is a bit more precise than percpu_counter_read(), but to make it
really precise, we means we have to stop concurrent activities, and we
never did in previous/current implementation.
We could add this (as Shaohua and myself tried in various patches)
later, if needed, but nowhere in kernel we currently need that.
Even /proc/meminfo doesnt call _sum(&vm_committed_as) but the lazy
percpu_counter_read_positive() function...
Reammy _sum() gives a good approximation of the counter, more precise
because of the percpu s32 folding, but no guarantee of deviation.
Hello, Eric.
On Tue, May 17, 2011 at 11:45:41AM +0200, Eric Dumazet wrote:
> _sum() is a bit more precise than percpu_counter_read(), but to make it
> really precise, we means we have to stop concurrent activities, and we
> never did in previous/current implementation.
>
> We could add this (as Shaohua and myself tried in various patches)
> later, if needed, but nowhere in kernel we currently need that.
>
> Even /proc/meminfo doesnt call _sum(&vm_committed_as) but the lazy
> percpu_counter_read_positive() function...
>
> Reammy _sum() gives a good approximation of the counter, more precise
> because of the percpu s32 folding, but no guarantee of deviation.
I'm not asking to make it more accurate but the initial patches from
Shaohua made the _sum() result to deviate by @batch even when only one
thread is doing _inc() due to the race window between adding to the
main counter and resetting the local one. All I'm asking is closing
that hole and I'll be completely happy with it. The lglock does that
but it's ummm.... not a very nice way to do it.
Please forget about deviations from concurrent activities. I don't
care and nobody should. All I'm asking is removing that any update
having the possibility of that unnecessary spike and I don't think
that would be too hard.
Thanks.
--
tejun
Le mardi 17 mai 2011 à 11:50 +0200, Tejun Heo a écrit :
> I'm not asking to make it more accurate but the initial patches from
> Shaohua made the _sum() result to deviate by @batch even when only one
> thread is doing _inc() due to the race window between adding to the
> main counter and resetting the local one. All I'm asking is closing
> that hole and I'll be completely happy with it. The lglock does that
> but it's ummm.... not a very nice way to do it.
>
> Please forget about deviations from concurrent activities. I don't
> care and nobody should. All I'm asking is removing that any update
> having the possibility of that unnecessary spike and I don't think
> that would be too hard.
>
Spikes are expected and have no effect by design.
batch value is chosen so that granularity of the percpu_counter
(batch*num_online_cpus()) is the spike factor, and thats pretty
difficult when number of cpus is high.
In Shaohua workload, 'amount' for a 128Mbyte mapping is 32768, while the
batch value is 48. 48*24 = 1152.
So the percpu s32 being in [-47 .. 47] range would not change the
accuracy of the _sum() function [ if it was eventually called, but its
not ]
No drift in the counter is the only thing we care - and _read() being
not too far away from the _sum() value, in particular if the
percpu_counter is used to check a limit that happens to be low (against
granularity of the percpu_counter : batch*num_online_cpus()).
I claim extra care is not needed. This might give the false impression
to reader/user that percpu_counter object can replace a plain
atomic64_t.
For example, I feel vm_committed_as could be a plain atomic_long_t
Hello, Eric.
On Tue, May 17, 2011 at 02:20:07PM +0200, Eric Dumazet wrote:
> Spikes are expected and have no effect by design.
>
> batch value is chosen so that granularity of the percpu_counter
> (batch*num_online_cpus()) is the spike factor, and thats pretty
> difficult when number of cpus is high.
>
> In Shaohua workload, 'amount' for a 128Mbyte mapping is 32768, while the
> batch value is 48. 48*24 = 1152.
> So the percpu s32 being in [-47 .. 47] range would not change the
> accuracy of the _sum() function [ if it was eventually called, but its
> not ]
>
> No drift in the counter is the only thing we care - and _read() being
> not too far away from the _sum() value, in particular if the
> percpu_counter is used to check a limit that happens to be low (against
> granularity of the percpu_counter : batch*num_online_cpus()).
>
> I claim extra care is not needed. This might give the false impression
> to reader/user that percpu_counter object can replace a plain
> atomic64_t.
We already had this discussion. Sure, we can argue about it again all
day but I just don't think it's a necessary compromise and really
makes _sum() quite dubious. It's not about strict correctness, it
can't be, but if I spent the overhead to walk all the different percpu
counters, I'd like to have a rather exact number if there's nothing
much going on (freeblock count, for example). Also, I want to be able
to use large @batch if the situation allows for it without worrying
about _sum() accuracy.
Given that _sum() is super-slow path and we have a lot of latitude
there, this should be possible without resorting to heavy handed
approach like lglock. I was hoping that someone would come up with a
better solution, which didn't seem to have happened. Maybe I was
wrong, I don't know. I'll give it a shot.
But, anyways, here's my position regarding the issue.
* If we're gonna just fix up the slow path, I don't want to make
_sum() less useful by making its accuracy dependent upon @batch.
* If somebody is interested, it would be worthwhile to see whether we
can integrate vmstat and percpu counters so that its deviation is
automatically regulated and we don't have to think about all this
anymore.
I'll see if I can come up with something.
Thank you.
--
tejun
Le mardi 17 mai 2011 à 14:45 +0200, Tejun Heo a écrit :
> Given that _sum() is super-slow path and we have a lot of latitude
> there,
Absolutely not. Its slow path yes, but not super-slow.
Dont change rules now ;)
Take a look at include/net/tcp.h and commit ad1af0fedba14f82b
(tcp: Combat per-cpu skew in orphan tests.)
If you guys make percpu_counter much slower, we'll just remove its use
in network stack, and there will be not a lot of users left.
Hello,
On Tue, May 17, 2011 at 03:00:23PM +0200, Eric Dumazet wrote:
> Absolutely not. Its slow path yes, but not super-slow.
I was speaking in relative terms. We're talking about local only fast
path vs. something which hits every percpu counter likely causing
cache misses in many of them. It's bound to be multiple orders of
magnitude heavier than fast path.
Thanks.
--
tejun
On Tue, 17 May 2011, Tejun Heo wrote:
> Hello,
>
> On Tue, May 17, 2011 at 03:00:23PM +0200, Eric Dumazet wrote:
> > Absolutely not. Its slow path yes, but not super-slow.
>
> I was speaking in relative terms. We're talking about local only fast
> path vs. something which hits every percpu counter likely causing
> cache misses in many of them. It's bound to be multiple orders of
> magnitude heavier than fast path.
Well lets just adopt the system that vm statistics use. Bound the error by
time and batch and allow the user to change the batch if more accuracy is
desired.
The _sum function is optional and should it should be explained that the
result *could* be better but dont count on it.
Hey,
On Tue, May 17, 2011 at 08:55:33AM -0500, Christoph Lameter wrote:
> Well lets just adopt the system that vm statistics use. Bound the error by
> time and batch and allow the user to change the batch if more accuracy is
> desired.
Yeah, that would be the better long term solution.
> The _sum function is optional and should it should be explained that the
> result *could* be better but dont count on it.
We can implement force-flush for dire situations, similar to expedited
RCU flush thing.
Thanks.
--
tejun
On Tue, 17 May 2011, Tejun Heo wrote:
> We can implement force-flush for dire situations, similar to expedited
> RCU flush thing.
I would suggest that if the user wants accurate results then another
external form of synchronization needs to be provided. Maybe the counter
is always incremented when another lock has been taken.
On Tue, 2011-05-17 at 17:50 +0800, Tejun Heo wrote:
> Hello, Eric.
>
> On Tue, May 17, 2011 at 11:45:41AM +0200, Eric Dumazet wrote:
> > _sum() is a bit more precise than percpu_counter_read(), but to make it
> > really precise, we means we have to stop concurrent activities, and we
> > never did in previous/current implementation.
> >
> > We could add this (as Shaohua and myself tried in various patches)
> > later, if needed, but nowhere in kernel we currently need that.
> >
> > Even /proc/meminfo doesnt call _sum(&vm_committed_as) but the lazy
> > percpu_counter_read_positive() function...
> >
> > Reammy _sum() gives a good approximation of the counter, more precise
> > because of the percpu s32 folding, but no guarantee of deviation.
>
> I'm not asking to make it more accurate but the initial patches from
> Shaohua made the _sum() result to deviate by @batch even when only one
> thread is doing _inc() due to the race window between adding to the
> main counter and resetting the local one. All I'm asking is closing
> that hole and I'll be completely happy with it. The lglock does that
> but it's ummm.... not a very nice way to do it.
>
> Please forget about deviations from concurrent activities. I don't
> care and nobody should. All I'm asking is removing that any update
> having the possibility of that unnecessary spike and I don't think
> that would be too hard.
Hmm, we once again to talk about the deviation issue. I thought we
agreed the deviation issue should be resolved in last discussion, but
seems not...
I would suggest you guys seriously look at my v3 patches, which doesn't
use lglock but can solve the deviation issue and has no significant
overhead.
Thanks,
Shaohua