2006-03-21 18:28:10

by Andreas Mohr

[permalink] [raw]
Subject: ring buffer indices: way too much modulo (division!) fiddling

Hello all,

"Just One" shocking example (2.6.16 arch/i386/kernel/apm.c):

static apm_event_t get_queued_event(struct apm_user *as)
{
as->event_tail = (as->event_tail + 1) % APM_MAX_EVENTS;
return as->events[as->event_tail];
}

objdump x86 output:

000007cf <get_queued_event>:
7cf: 83 ec 08 sub $0x8,%esp
7d2: ba 67 66 66 66 mov $0x66666667,%edx
7d7: 89 74 24 04 mov %esi,0x4(%esp)
7db: 89 1c 24 mov %ebx,(%esp)
7de: 89 c6 mov %eax,%esi
7e0: 8b 58 24 mov 0x24(%eax),%ebx
7e3: 83 c3 01 add $0x1,%ebx
7e6: 89 d8 mov %ebx,%eax
7e8: 89 d9 mov %ebx,%ecx
7ea: f7 ea imul %edx
7ec: c1 f9 1f sar $0x1f,%ecx
7ef: c1 fa 03 sar $0x3,%edx
7f2: 29 ca sub %ecx,%edx
7f4: 6b d2 14 imul $0x14,%edx,%edx
7f7: 29 d3 sub %edx,%ebx
7f9: 0f b7 44 5e 28 movzwl 0x28(%esi,%ebx,2),%eax
7fe: 89 5e 24 mov %ebx,0x24(%esi)
801: 8b 1c 24 mov (%esp),%ebx
804: 8b 74 24 04 mov 0x4(%esp),%esi
808: 83 c4 08 add $0x8,%esp
80b: c3 ret


Doing it The Probably Proper Way (tm):

static apm_event_t get_queued_event(struct apm_user *as)
{
if (++as->event_tail >= APM_MAX_EVENTS)
as->event_tail = 0;
return as->events[as->event_tail];
}

objdump x86 output:

000007cf <get_queued_event>:
7cf: 8b 48 24 mov 0x24(%eax),%ecx
7d2: 31 d2 xor %edx,%edx
7d4: 83 c1 01 add $0x1,%ecx
7d7: 83 f9 14 cmp $0x14,%ecx
7da: 0f 4c d1 cmovl %ecx,%edx
7dd: 89 50 24 mov %edx,0x24(%eax)
7e0: 0f b7 44 50 28 movzwl 0x28(%eax,%edx,2),%eax
7e5: c3 ret


Result: no painful division avoidance (0x66666667 factors, imul), *vastly*
shorter function (22 vs. 60 bytes), most likely **WAY** faster
(no multiplication cycle burning, better pipelining, less opcodes).

Any problems with such a change that I'm missing here?

If such a change holds water, then I'd strongly suggest creating a new
(arch-specific?) macro named something like

ringbuf_advance_idx(my_idx_var, MY_BUF_SIZE)


(any more generic name? It's not always about ring buffers...)

to be used all over the place instead of custom-writing stuff (and getting it
imperfect/wrong) in many cases.
This would be a perfect job for kernel-janitors, if you ask me.

Or does there happen to be a macro already for such a rather common operation?


This kind of unnecessary modulo operation happens in lots of places (mostly
network drivers).
Probably the most prominent (read: painful) place is
./mm/hugetlb.c/alloc_fresh_huge_page()
(if I judge that place right; I cannot easily verify it since I don't have and
don't know how to configure a NUMA machine setup/image)
So, additional question:
can anyone confirm that alloc_fresh_huge_page() binary code uses an expensive
division or multiplication on a NUMA machine?

Comments?

Andreas Mohr


2006-03-21 19:51:17

by Ray Lee

[permalink] [raw]
Subject: Re: ring buffer indices: way too much modulo (division!) fiddling

On 3/21/06, Andreas Mohr <[email protected]> wrote:
> "Just One" shocking example (2.6.16 arch/i386/kernel/apm.c):
>
> static apm_event_t get_queued_event(struct apm_user *as)
> {
> as->event_tail = (as->event_tail + 1) % APM_MAX_EVENTS;
> return as->events[as->event_tail];
> }
[...]
> 7d7: 83 f9 14 cmp $0x14,%ecx

APM_MAX_EVENTS appears to be 20, not a power of 2. GCC does the
obvious transform of modulo to an ANDL when the modulus is a sane
value. Rewrite the above as a four line function and run it through
gcc -S to see to difference bewteen APM_MAX_EVENTS=20 and =32.

> Any problems with such a change that I'm missing here?

<shrug> It's probably not a wide-spread problem. How many places in
the kernel use ringbuffers that aren't sized to a power of 2?

> ringbuf_advance_idx(my_idx_var, MY_BUF_SIZE)
> (any more generic name? It's not always about ring buffers...)

modulo_add? But it's probably unneccessary, unless this really is
widespread. Even then, just correcting those already in place would be
better than adding yet another abstraction of a simple operation to
the kernel.

> This kind of unnecessary modulo operation happens in lots of places (mostly
> network drivers).

Check to see if those are power of 2 sized buffers. gcc really will do
those correctly.

Ray