V3->V4:
- Fix various macro definitions.
- Provider experimental percpu based fastpath that does not disable
interrupts for SLUB.
V2->V3:
- Available via git tree against latest upstream from
git://git.kernel.org/pub/scm/linux/kernel/git/christoph/percpu.git linus
- Rework SLUB per cpu operations. Get rid of dynamic DMA slab creation
for CONFIG_ZONE_DMA
- Create fallback framework so that 64 bit ops on 32 bit platforms
can fallback to the use of preempt or interrupt disable. 64 bit
platforms can use 64 bit atomic per cpu ops.
V1->V2:
- Various minor fixes
- Add SLUB conversion
- Add Page allocator conversion
- Patch against the git tree of today
The patchset introduces various operations to allow efficient access
to per cpu variables for the current processor. Currently there is
no way in the core to calculate the address of the instance
of a per cpu variable without a table lookup. So we see a lot of
per_cpu_ptr(x, smp_processor_id())
The patchset introduces a way to calculate the address using the offset
that is available in arch specific ways (register or special memory
locations) using
this_cpu_ptr(x)
In addition macros are provided that can operate on per cpu
variables in a per cpu atomic way. With that scalars in structures
allocated with the new percpu allocator can be modified without disabling
preempt or interrupts. This works by generating a single instruction that
does both the relocation of the address to the proper percpu area and
the RMW action.
F.e.
this_cpu_add(x->var, 20)
can be used to generate an instruction that uses a segment register for the
relocation of the per cpu address into the per cpu area of the current processor
and then increments the variable by 20. The instruction cannot be interrupted
and therefore the modification is atomic vs the cpu (it either happens or not).
Rescheduling or interrupt can only happen before or after the instruction.
Per cpu atomicness does not provide protection from concurrent modifications from
other processors. In general per cpu data is modified only from the processor
that the per cpu area is associated with. So per cpu atomicness provides a fast
and effective means of dealing with concurrency. It may allow development of
better fastpaths for allocators and other important subsystems.
The per cpu atomic RMW operations can be used to avoid having to dimension pointer
arrays in the allocators (patches for page allocator and slub are provided) and
avoid pointer lookups in the hot paths of the allocators thereby decreasing
latency of critical OS paths. The macros could be used to revise the critical
paths in the allocators to no longer need to disable interrupts (not included).
Per cpu atomic RMW operations are useful to decrease the overhead of counter
maintenance in the kernel. A this_cpu_inc() f.e. can generate a single
instruction that has no needs for registers on x86. preempt on / off can
be avoided in many places.
Patchset will reduce the code size and increase speed of operations for
dynamically allocated per cpu based statistics. A set of patches modifies
the fastpaths of the SLUB allocator reducing code size and cache footprint
through the per cpu atomic operations.
This patch depends on all arches supporting the new per cpu allocator.
IA64 still uses the old percpu allocator. Tejon has patches to fixup IA64
and the patch was approved by Tony Luck but the IA64 patches have not been
merged yet.
Hello,
[email protected] wrote:
> V3->V4:
> - Fix various macro definitions.
> - Provider experimental percpu based fastpath that does not disable
> interrupts for SLUB.
The series looks very good to me. percpu#for-next now has ia64 bits
included and the legacy allocator is gone there so it can carry this
series. Sans the last one, they seem they can be stable and
incremental from now on, right? Shall I include this series into the
percpu tree?
Thanks.
--
tejun
* Tejun Heo <[email protected]> wrote:
> Hello,
>
> [email protected] wrote:
> > V3->V4:
> > - Fix various macro definitions.
> > - Provider experimental percpu based fastpath that does not disable
> > interrupts for SLUB.
>
> The series looks very good to me. [...]
Seconded, very nice series!
One final step/cleanup seems to be missing from it: it should replace
current uses of percpu_op() [percpu_read(), etc.] in the x86 tree and
elsewhere with the new this_cpu_*() primitives. this_cpu_*() is using
per_cpu_from_op/per_cpu_to_op directly, we dont need those percpu_op()
variants anymore.
There should also be a kernel image size comparison done for that step,
to make sure all the new primitives are optimized to the max on the
instruction level.
> [...] percpu#for-next now has ia64 bits included and the legacy
> allocator is gone there so it can carry this series. Sans the last
> one, they seem they can be stable and incremental from now on, right?
> Shall I include this series into the percpu tree?
I'd definitely recommend doing that - it should be tested early and wide
for v2.6.33, and together with other percpu bits.
Ingo
On Fri, 2 Oct 2009, Tejun Heo wrote:
> [email protected] wrote:
> > V3->V4:
> > - Fix various macro definitions.
> > - Provider experimental percpu based fastpath that does not disable
> > interrupts for SLUB.
>
> The series looks very good to me. percpu#for-next now has ia64 bits
> included and the legacy allocator is gone there so it can carry this
> series. Sans the last one, they seem they can be stable and
> incremental from now on, right? Shall I include this series into the
> percpu tree?
You can include all but the last patch that is experimental.
On Fri, 2 Oct 2009, Ingo Molnar wrote:
> One final step/cleanup seems to be missing from it: it should replace
> current uses of percpu_op() [percpu_read(), etc.] in the x86 tree and
> elsewhere with the new this_cpu_*() primitives. this_cpu_*() is using
> per_cpu_from_op/per_cpu_to_op directly, we dont need those percpu_op()
> variants anymore.
Well after things settle with this_cpu_xx we can drop those.
> There should also be a kernel image size comparison done for that step,
> to make sure all the new primitives are optimized to the max on the
> instruction level.
Right. There will be a time period in which other arches will need to add
support for this_cpu_xx first.
* Christoph Lameter <[email protected]> wrote:
> On Fri, 2 Oct 2009, Ingo Molnar wrote:
>
> > One final step/cleanup seems to be missing from it: it should
> > replace current uses of percpu_op() [percpu_read(), etc.] in the x86
> > tree and elsewhere with the new this_cpu_*() primitives.
> > this_cpu_*() is using per_cpu_from_op/per_cpu_to_op directly, we
> > dont need those percpu_op() variants anymore.
>
> Well after things settle with this_cpu_xx we can drop those.
>
> > There should also be a kernel image size comparison done for that
> > step, to make sure all the new primitives are optimized to the max
> > on the instruction level.
>
> Right. There will be a time period in which other arches will need to
> add support for this_cpu_xx first.
Size comparison should be only on architectures that support it (i.e.
x86 right now). The generic fallbacks might be bloaty, no argument about
that. ( => the more reason for any architecture to add optimizations for
this_cpu_*() APIs. )
Ingo
On Fri, 2 Oct 2009, Ingo Molnar wrote:
> > Right. There will be a time period in which other arches will need to
> > add support for this_cpu_xx first.
>
> Size comparison should be only on architectures that support it (i.e.
> x86 right now). The generic fallbacks might be bloaty, no argument about
> that. ( => the more reason for any architecture to add optimizations for
> this_cpu_*() APIs. )
The fallbacks basically generate the same code (at least for the core
code) that was there before. F.e.
Before:
#define SNMP_INC_STATS(mib, field) \
do { \
per_cpu_ptr(mib[!in_softirq()], get_cpu())->mibs[field]++; \
put_cpu(); \
} while (0)
After
#define SNMP_INC_STATS_USER(mib, field) \
this_cpu_inc(mib[1]->mibs[field])
For the x86 case this means that we can use a simple atomic increment
with a segment prefix to do all the work.
The fallback case for arches not providing per cpu atomics is:
preempt_disable();
*__this_cpu_ptr(&mib[1]->mibs[field]) += 1;
preempt_enable();
If the arch can optimize __this_cpu_ptr (and provides __my_cpu_offset)
because it has the per cpu offset of the local cpu in some priviledged
location then this is still going to be a win since we avoid
smp_processor_id() entirely and we also avoid the array lookup.
If the arch has no such mechanism then we fall back for this_cpu_ptr too:
#ifndef __my_cpu_offset
#define __my_cpu_offset per_cpu_offset(raw_smp_processor_id())
#endif
And then the result in terms of overhead is the same as before the
per_cpu_xx patches since get_cpu() does both a preempt_disable as well as
a smp_processor_id() call.