Cleanup percpu_counter code and fix some bugs. The main purpose is to convert
percpu_counter to use atomic64, which is useful for workloads which cause
percpu_counter->lock contented. In a workload I tested, the atomic method is
7x faster (please see patch 3 for detail).
Note, patch 3 is against Christoph's 'percpu: preemptless __per_cpu_counter_add'