LinuxLists.cc - Re:[PATCH 1/1] indirect function calls elimination in IO scheduler

2005-10-17 17:01:59

Subject: Re:[PATCH 1/1] indirect function calls elimination in IO scheduler

Jens Axboe writes

> I don't really see the patch doing what you describe - the indirect
> function calls are the same.

For example on Pentium4 in the function elv_next_request() the line
struct request *rq =
q->elevator->ops->elevator_next_req_fn(q);
before patch had required 11% of function running time as oprofile
reports
%
26 0.0457 :c0270ecb: mov 0xc(%edi),%eax
3455 6.0670 :c0270ece: mov (%eax),%eax
2848 5.0011 :c0270ed0: mov %edi,(%esp)
1538 2.7008 :c0270ed3: call *0xc(%eax)

A patch which would delete all indirect calls was tryed
struct request *rq = q->elevator_cp.ops.elevator_next_req_fn(q);
9 0.0224 :c0270eea: mov %edi,(%esp)
3814 9.4793 :c0270eed: call *0x18(%edi)

But additional memory would be needed for 'ops' in each queue. The
intermediate (proposed) patch has the same timing effect but saves some
memory:
struct request *rq =
q->elevator_cp.ops->elevator_next_req_fn(q);
drivers/block/elevator.c:351
ffffffff802a8b97: 49 8b 44 24 18 mov 0x18(%r12),%rax
ffffffff802a8b9c: 4c 89 e7 mov %r12,%rdi
ffffffff802a8b9f: ff 50 18 callq *0x18(%rax)

For Itanium the difference is huge:
Before patch:
drivers/block/elevator.c:351
a0000001002cbb60: 0d f0 00 4c 18 10 [MFI] ld8 r30=[r38]
a0000001002cbb66: 00 00 00 02 00 c0 nop.f 0x0
a0000001002cbb6c: 05 00 01 84 mov r46=r32;;
a0000001002cbb70: 0b e8 00 3c 18 10 [MMI] ld8 r29=[r30];;
a0000001002cbb76: c0 c1 74 00 42 00 adds r28=24,r29
a0000001002cbb7c: 00 00 04 00 nop.i 0x0;;
a0000001002cbb80: 0b d0 00 38 18 10 [MMI] ld8 r26=[r28];;
a0000001002cbb86: b0 41 68 30 28 00 ld8 r27=[r26],8
a0000001002cbb8c: 00 00 04 00 nop.i 0x0;;
a0000001002cbb90: 11 08 00 34 18 10 [MIB] ld8 r1=[r26]
a0000001002cbb96: 70 d8 04 80 03 00 mov b7=r27
a0000001002cbb9c: 78 00 80 10
br.call.sptk.many

After patching there is no object code for considered line. It
is scattered mixed with other source code lines.

Leonid

2005-10-17 17:58:14

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH 1/1] indirect function calls elimination in IO scheduler

On Mon, Oct 17 2005, Ananiev, Leonid I wrote:
> Jens Axboe writes
>
> > I don't really see the patch doing what you describe - the indirect
> > function calls are the same.
>
> For example on Pentium4 in the function elv_next_request() the line
> struct request *rq =
> q->elevator->ops->elevator_next_req_fn(q);
> before patch had required 11% of function running time as oprofile
> reports
> %
> 26 0.0457 :c0270ecb: mov 0xc(%edi),%eax
> 3455 6.0670 :c0270ece: mov (%eax),%eax
> 2848 5.0011 :c0270ed0: mov %edi,(%esp)
> 1538 2.7008 :c0270ed3: call *0xc(%eax)
>
> A patch which would delete all indirect calls was tryed
> struct request *rq = q->elevator_cp.ops.elevator_next_req_fn(q);
> 9 0.0224 :c0270eea: mov %edi,(%esp)
> 3814 9.4793 :c0270eed: call *0x18(%edi)
>
> But additional memory would be needed for 'ops' in each queue. The
> intermediate (proposed) patch has the same timing effect but saves some
> memory:
> struct request *rq =
> q->elevator_cp.ops->elevator_next_req_fn(q);
> drivers/block/elevator.c:351
> ffffffff802a8b97: 49 8b 44 24 18 mov 0x18(%r12),%rax
> ffffffff802a8b9c: 4c 89 e7 mov %r12,%rdi
> ffffffff802a8b9f: ff 50 18 callq *0x18(%rax)

But with the patch proposed, the function call is still indirect. You
are only moving eliminating a dereference of elevator-> since that is
now inlined in the queue. That matches your asm, you eliminate one mov
there.

I'm guessing you are testing this with your NULL driver, which is why
the difference is so 'huge' in profiling. And you are probably using
noop, correct? I don't see a lot of real world relevance to this
testing to be honest, the io path isn't completely lean with the regular
io schedulers either and I bet this would be noise on real world
testing. Micro benchmarks are all fine, but they only say so much. And
as I originally stated, this patch is a no-go from the beginning since
you cannot ref count a statically embedded structure. It has to be
dynamically allocated.

So if you are really interested in this and have a valid reason to
pursue it, please think more about other ways to solve this.

--
Jens Axboe

2005-10-17 19:25:26

by Chen, Kenneth W

[permalink] [raw]

Subject: RE: [PATCH 1/1] indirect function calls elimination in IO scheduler

Jens Axboe wrote on Monday, October 17, 2005 10:59 AM
> you cannot ref count a statically embedded structure. It has to be
> dynamically allocated.

I'm confused. For every block device queue, there is one unique
elevator_t structure allocated via kmalloc. And vice versa, one
elevator_t has only one request_queue points to it. This elevator_t
structure is per q since it has pointer to per-queue elevator
private data.

Since it is always one to one relationship, ref count is predictable
and static. I see there are ref count on q->elevator, But it is
always 2: one from object instantiation and one from adding an sysfs
hierarchy directory. In this case, I don't see the difference.
Am I missing something?

- Ken

2005-10-17 19:39:31

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH 1/1] indirect function calls elimination in IO scheduler

On Mon, Oct 17 2005, Chen, Kenneth W wrote:
> Jens Axboe wrote on Monday, October 17, 2005 10:59 AM
> > you cannot ref count a statically embedded structure. It has to be
> > dynamically allocated.
>
> I'm confused. For every block device queue, there is one unique
> elevator_t structure allocated via kmalloc. And vice versa, one
> elevator_t has only one request_queue points to it. This elevator_t
> structure is per q since it has pointer to per-queue elevator
> private data.

For every _non_ stacked queue there is an elevator_t structure attached.
That's a seperate point, but it means that embedding the elevator inside
the queue wastes memory (104 bytes per queue on this box I'm typing on)
for dm/md devices.

> Since it is always one to one relationship, ref count is predictable
> and static. I see there are ref count on q->elevator, But it is
> always 2: one from object instantiation and one from adding an sysfs
> hierarchy directory. In this case, I don't see the difference.
> Am I missing something?

The reference count does exist outside of the queue getting gotten/put,
the switching being one of them. Tejun has patches for improving the
switching, so it would be possible to keep two schedulers alive for the
queue for the duration of the switch.

--
Jens Axboe