Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756130AbYBPKHm (ORCPT ); Sat, 16 Feb 2008 05:07:42 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752857AbYBPKHa (ORCPT ); Sat, 16 Feb 2008 05:07:30 -0500 Received: from neuf-infra-smtp-out-sp604007av.neufgp.fr ([84.96.92.120]:48418 "EHLO neuf-infra-smtp-out-sp604007av.neufgp.fr" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751763AbYBPKH2 (ORCPT ); Sat, 16 Feb 2008 05:07:28 -0500 Message-ID: <47B6B5E1.90705@cosmosbay.com> Date: Sat, 16 Feb 2008 11:07:29 +0100 From: Eric Dumazet User-Agent: Thunderbird 2.0.0.9 (Windows/20071031) MIME-Version: 1.0 To: Andrew Morton CC: Herbert Xu , "David S. Miller" , netdev@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: include/linux/pcounter.h References: <20080204014402.1c55d3fe.akpm@linux-foundation.org> <20080215193711.2e3d41f3.akpm@linux-foundation.org> In-Reply-To: <20080215193711.2e3d41f3.akpm@linux-foundation.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 9564 Lines: 208 Andrew Morton a ?crit : > - First up, why was this added at all? We have percpu_counter.h which > has several years development invested in it. afaict it would suit the > present applications of pcounters. > > If some deficiency in percpu_counters has been identified, is it > possible to correct that deficiency rather than implementing a whole new > counter type? That would be much better. > > - The comments in pcounter.h appear to indicate that there is a > performance advantage (and we infer that the advantage is when the > statically-allocated flavour of pcounters is used). When compared with > percpu_counters the number of data-reference indirections is the same as > with percpu_counters, so no advantage there. > > And, bizarrely, because of a quite inappropriate abstraction toy, both > flavours of pcounters add an indirect function call which I believe is > significantly more expensive than a plain old pointer indirection. > > So it's quite possible that DEFINE_PCOUNTER-style counters consume more > memory and are slower than percpu_counters. They will surely be much > slower on the read side. More below. > > If we really want to put some helper wrappers around > DEFINE_PER_CPU(s32) then I'd suggest that we should do that as a > standalone thing and not attempt to graft the same interface onto two > quite different types of storage (DEFINE_PER_CPU and alloc_per_cpu) > > - The comment "2)" in pcounter.h (which overflows 80 cols and probably > wasn't passed through checkpatch) indicates that some other > implementation (presumably plain old DEFINE_PER_CPU) will use > NR_CPUS*(32+sizeof(void *)) bytes of storage. But DEFINE_PCOUNTER will > use as much memory as DEFINE_PER_CPU(s32) and both pcounter_alloc()-style > pcounters and percpu_counters use > num_possible_cpus()*sizeof(s32)+epsilon. > > - The CONFIG_SMP=n stubs in pcounter.h are cheesy and are vulnerable to > several well-known compilation risks which I always forget. Should be > converted to regular static inlines. > > - the comment in lib/pcounter.c needlessly exceeds 80 cols. > > - pcounter_dyn_add() will spew a > use-of-smp_processor_id()-in-preemptible-code warning if used in places > where one could reasonably use it. The interface could do with a bit of > a rethink. Or at least some justification and documentation. > > - pcounter_getval() will be disastrously inefficient if > num_possible_cpus() is much greater than num_online_cpus(). It should > use for_each_online_cpu() (as does percpu_counter), and implement a CPU > hotplug notifier (as does percpu_counter). > > It will remain grossly inefficient at high CPU counts, unlike > percpu_counters, which solved this problem by doing a batched spill into > a central counter at add/sub time. > > The danger here is that someone will actually use this interface in new > code. Six months later (when it's too late to fix it) the big-NUMA guys > come up and say "whaa, when our user does it disabled interrupts > for N milliseconds". > > - pcounter_getval() can return incorrect negative numbers. This can > cause caller malfunctions in very rare situations because callers just > don't expect the things which they're counting to go negative. > > We experienced this during the evolution of percpu_counter. See > percpu_counter_sum_positive() and friends. > > - pcounter_alloc() should return -ENOMEM on allocation error, not "1". > > - pcounter_free() perhaps shouldn't test for (self->per_cpu_values != > NULL), because callers shouldn't be calling it if pcounter_alloc() failed > (arguable). > > > > afaict the whole implementation can and should be removed and replaced with > percpu_counters. I don't think there's much point in its ability to manage > DEFINE_PER_CPU counters: pcounter_getval() remains grossly inefficient (and > can return negative values) and quite a bit of new code will need to be put > in place to address that. > > But perhaps there are plans to evolve it into something further in the > future, I don't know. But I would suggest that the people who have worked > upon percpu_counters (principally Gautham, Peter Z, clameter and me) be > involved in that work. > Andrew, pcounter is a temporary abstraction. It is temporaty because it will vanish as soon as Christoph Clameter (or somebody else) provides real cheap per cpu counter implementation. At time I introduced it in network tree (locally, not meant to invade kernel land and makes you unhappy :) ), the goals were : Some counters (total sockets count) were a single integer, that were doing ping-pong between cpus (SMP/NUMA). As they are basically lazy values (as we dont really need to read their value), using plain atomic_t was overkill. Using a plain percpu_counters was expensive (NR_CPUS*(32+sizeof(void *)) instead of num_possible_cpus()*4). Using 'online' instead of 'possible' stuff is not really needed for a temporary thing. - We dont care of read sides. We want really fast write side. Real fast. Read side is when you do a "cat /proc/net/sockstat". That is ... once in a while... Now when we allocate a new socket, code to increment the "socket count" is : c03a74a8 : c03a74a8: b8 90 26 5f c0 mov $0xc05f2690,%eax c03a74ad: 64 8b 0d 10 f1 5e c0 mov %fs:0xc05ef110,%ecx c03a74b4: 01 14 01 add %edx,(%ecx,%eax,1) c03a74b7: c3 ret That is 4 instructions. I could be two in the future, thanks to current work on fs/gs based percpu variables. Current percpu_counters implementation is more expensive : c021467b <__percpu_counter_add>: c021467b: 55 push %ebp c021467c: 57 push %edi c021467d: 89 c7 mov %eax,%edi c021467f: 56 push %esi c0214680: 53 push %ebx c0214681: 83 ec 04 sub $0x4,%esp c0214684: 8b 40 14 mov 0x14(%eax),%eax c0214687: 64 8b 1d 08 f0 5e c0 mov %fs:0xc05ef008,%ebx c021468e: 8b 6c 24 18 mov 0x18(%esp),%ebp c0214692: f7 d0 not %eax c0214694: 8b 1c 98 mov (%eax,%ebx,4),%ebx c0214697: 89 1c 24 mov %ebx,(%esp) c021469a: 8b 03 mov (%ebx),%eax c021469c: 89 c3 mov %eax,%ebx c021469e: 89 c6 mov %eax,%esi c02146a0: c1 fe 1f sar $0x1f,%esi c02146a3: 89 e8 mov %ebp,%eax c02146a5: 01 d3 add %edx,%ebx c02146a7: 11 ce adc %ecx,%esi c02146a9: 99 cltd c02146aa: 39 d6 cmp %edx,%esi c02146ac: 7f 15 jg c02146c3 <__percpu_counter_add+0x48> c02146ae: 7c 04 jl c02146b4 <__percpu_counter_add+0x39> c02146b0: 39 eb cmp %ebp,%ebx c02146b2: 73 0f jae c02146c3 <__percpu_counter_add+0x48> c02146b4: f7 dd neg %ebp c02146b6: 89 e8 mov %ebp,%eax c02146b8: 99 cltd c02146b9: 39 d6 cmp %edx,%esi c02146bb: 7f 20 jg c02146dd <__percpu_counter_add+0x62> c02146bd: 7c 04 jl c02146c3 <__percpu_counter_add+0x48> c02146bf: 39 eb cmp %ebp,%ebx c02146c1: 77 1a ja c02146dd <__percpu_counter_add+0x62> c02146c3: 89 f8 mov %edi,%eax c02146c5: e8 04 cc 1f 00 call c04112ce <_spin_lock> c02146ca: 01 5f 04 add %ebx,0x4(%edi) c02146cd: 11 77 08 adc %esi,0x8(%edi) c02146d0: 8b 04 24 mov (%esp),%eax c02146d3: c7 00 00 00 00 00 movl $0x0,(%eax) c02146d9: fe 07 incb (%edi) c02146db: eb 05 jmp c02146e2 <__percpu_counter_add+0x67> c02146dd: 8b 04 24 mov (%esp),%eax c02146e0: 89 18 mov %ebx,(%eax) c02146e2: 58 pop %eax c02146e3: 5b pop %ebx c02146e4: 5e pop %esi c02146e5: 5f pop %edi c02146e6: 5d pop %ebp c02146e7: c3 ret Once it is better, just make pcounter vanish. It is even clearly stated at the top of include/linux/pcounter.h /* * Using a dynamic percpu 'int' variable has a cost : * 1) Extra dereference * Current per_cpu_ptr() implementation uses an array per 'percpu variable'. * 2) memory cost of NR_CPUS*(32+sizeof(void *)) instead of num_possible_cpus()*4 * * This pcounter implementation is an abstraction to be able to use * either a static or a dynamic per cpu variable. * One dynamic per cpu variable gets a fast & cheap implementation, we can * change pcounter implementation too. */ We all agree. Thank you -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/