by Andrew Morton

[permalink] [raw]

Subject: Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated

Ravikiran G Thirumalai <[email protected]> wrote:
>
> On Fri, Jan 27, 2006 at 03:08:47PM -0800, Andrew Morton wrote:
> > Andrew Morton <[email protected]> wrote:
> > >
> > > Oh, and because vm_acct_memory() is counting a singleton object, it can use
> > > DEFINE_PER_CPU rather than alloc_percpu(), so it saves on a bit of kmalloc
> > > overhead.
> >
> > Actually, I don't think that's true. we're allocating a sizeof(long) with
> > kmalloc_node() so there shouldn't be memory wastage.
>
> Oh yeah there is. Each dynamic per-cpu object would have been atleast
> (NR_CPUS * sizeof (void *) + num_cpus_possible * cacheline_size ).
> Now kmalloc_node will fall back on size-32 for allocation of long, so
> replace the cacheline_size above with 32 -- which then means dynamic per-cpu
> data are not on a cacheline boundary anymore (most modern cpus have 64byte/128
> byte cache lines) which means per-cpu data could end up false shared....
>

OK. But isn't the core of the problem the fact that __alloc_percpu() is
using kmalloc_node() rather than a (new, as-yet-unimplemented)
kmalloc_cpu()? kmalloc_cpu() wouldn't need the L1 cache alignment.

It might be worth creating just a small number of per-cpu slabs (4-byte,
8-byte). A kmalloc_cpu() would just need a per-cpu array of
kmem_cache_t*'s and it'd internally use kmalloc_node(cpu_to_node), no?

Or we could just give __alloc_percpu() a custom, hand-rolled,
not-cacheline-padded sizeof(long) slab per CPU and use that if (size ==
sizeof(long)). Or something.

2006-01-28 00:28:25

--- a/include/asm-generic/atomic.h 2006-01-28 02:59:49.000000000 +0100
+++ b/include/asm-generic/atomic.h 2006-01-28 02:57:36.000000000 +0100
@@ -66,6 +66,18 @@
atomic64_sub(i, v);
}

+static inline long atomic_long_xchg(atomic_long_t *l, long val)
+{
+ atomic64_t *v = (atomic64_t *)l;
+ return atomic64_xchg(v, val);
+}
+
+static inline long atomic_long_cmpxchg(atomic_long_t *l, long old, long new)
+{
+ atomic64_t *v = (atomic64_t *)l;
+ return atomic64_cmpxchg(v, old, new);
+}
+
#else

typedef atomic_t atomic_long_t;
@@ -113,5 +125,17 @@
atomic_sub(i, v);
}

+static inline long atomic_long_xchg(atomic_long_t *l, long val)
+{
+ atomic_t *v = (atomic_t *)l;
+ return atomic_xchg(v, val);
+}
+
+static inline long atomic_long_cmpxchg(atomic_long_t *l, long old, long new)
+{
+ atomic_t *v = (atomic_t *)l;
+ return atomic_cmpxchg(v, old, new);
+}
+
#endif
#endif

Attachments:

atomic.patch (882.00 B)

2006-01-28 01:16:32

by Andrew Morton

[permalink] [raw]

Subject: Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated

Eric Dumazet <[email protected]> wrote:
>
> [PATCH] Add atomic_long_xchg() and atomic_long_cmpxchg() wrappers
>
> ...
>
> +static inline long atomic_long_xchg(atomic_long_t *l, long val)
> +{
> + atomic64_t *v = (atomic64_t *)l;
> + return atomic64_xchg(v, val);

All we need now is some implementations of atomic64_xchg() ;)

2006-01-28 04:52:48

by Ravikiran G Thirumalai

[permalink] [raw]

Subject: Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated

On Sat, Jan 28, 2006 at 01:35:03AM +0100, Eric Dumazet wrote:
> Eric Dumazet a ?crit :
> >Andrew Morton a ?crit :
> >>Eric Dumazet <[email protected]> wrote:
> >
> >#ifdef CONFIG_SMP
> >void percpu_counter_mod(struct percpu_counter *fbc, long amount)
> >{
> > long old, new;
> > atomic_long_t *pcount;
> >
> > pcount = per_cpu_ptr(fbc->counters, get_cpu());
> >start:
> > old = atomic_long_read(pcount);
> > new = old + amount;
> > if (new >= FBC_BATCH || new <= -FBC_BATCH) {
> > if (unlikely(atomic_long_cmpxchg(pcount, old, 0) != old))
> > goto start;
> > atomic_long_add(new, &fbc->count);
> > } else
> > atomic_long_add(amount, pcount);
> >
> > put_cpu();
> >}
> >EXPORT_SYMBOL(percpu_counter_mod);
> >
> >long percpu_counter_read_accurate(struct percpu_counter *fbc)
> >{
> > long res = 0;
> > int cpu;
> > atomic_long_t *pcount;
> >
> > for_each_cpu(cpu) {
> > pcount = per_cpu_ptr(fbc->counters, cpu);
> > /* dont dirty cache line if not necessary */
> > if (atomic_long_read(pcount))
> > res += atomic_long_xchg(pcount, 0);
---------------------------> (A)
> > }
>

> atomic_long_add(res, &fbc->count);
---------------------------> (B)
> res = atomic_long_read(&fbc->count);
>
> > return res;
> >}

The read is still theoritically FBC_BATCH * NR_CPUS inaccurate no?
What happens when other cpus update their local counters at (A) and (B)?

(I am hoping we don't need percpu_counter_read_accurate anywhere yet and
this is just demo ;). I certainly don't want to go on all cpus for read /
add cmpxchg on the write path for the proto counters that started this
discussion)

Thanks,
Kiran

2006-01-28 07:19:17

Ravikiran G Thirumalai <[email protected]> wrote:
>
> On Thu, Feb 02, 2006 at 07:16:00PM -0800, Andrew Morton wrote:
> > Ravikiran G Thirumalai <[email protected]> wrote:
> > >
> > > On Fri, Jan 27, 2006 at 03:01:06PM -0800, Andrew Morton wrote:
> > > Here's an implementation which delegates tuning of batching to the user. We
> > > don't really need local_t at all as percpu_counter_mod is not safe against
> > > interrupts and softirqs as it is. If we have a counter which could be
> > > modified in process context and irq/bh context, we just have to use a
> > > wrapper like percpu_counter_mod_bh which will just disable and enable bottom
> > > halves. Reads on the counters are safe as they are atomic_reads, and the
> > > cpu local variables are always accessed by that cpu only.
> > >
> > > (PS: the maxerr for ext2/ext3 is just guesstimate)
> >
> > Well that's the problem. We need to choose production-quality values for
> > use in there.
>
> The guesstimate was loosely based on keeping the per-cpu batch at atleast 8
> on reasonably sized systems, while not letting maxerr grow too big. I guess
> machines with cpu counts more than 128 don't use ext3 :). And if they do,
> they can tune the counters with a higher maxerr. I guess it might be a bit
> ugly on the user side with all the if num_possibl_cpus(), but is there a
> better alternative?
>
> (I plan to test the counter values for ext2 and ext3 on a 16 way box, and
> change these if they turn out to be not so good)

OK, thanks. Frankly I think I went overboard on the scalability thing when
adding percpu counters to ext2 and ext3. I suspect they're not providing
significant benefit over per-sb-spinlock and a ulong.

> >
> > > Comments?
> >
> > Using num_possible_cpus() in that header file is just asking for build
> > errors. Probably best to uninline the function rather than adding the
> > needed include of cpumask.h.
>
> Yup,
>
> Here it is.
>
> Change the percpu_counter interface so that user can specify the maximum
> tolerable deviation.

OK, thanks. I need to sit down and a) remember why we're even discussing
this and b) see what we've merged thus far and work out what it all does ;)