2005-03-09 01:44:14

by Chen, Kenneth W

[permalink] [raw]
Subject: Direct io on block device has performance regression on 2.6.x kernel

I don't know where to start, but let me start with the bombshell:

Direct I/O on block device running 2.6.X kernel is a lot SLOWER
than running on a 2.4 Kernel!


Processing a direct I/O request on 2.6 is taking a lot more cpu
time compare to the same I/O request running on a 2.4 kernel.

The proof: easy. I started off by having a pseudo disk, a software
disk that has zero access latency. By hooking this pseudo disk into
the block layer API, I can effectively stress the entire I/O stack
above the block level. Combined with user level test programs that
simply submit direct I/O in a simple while loop, I can measure how
fast kernel can process these I/O requests. The performance metric
can be either throughput (# of I/O per second) or per unit of work
(processing time per I/O). For the data presented below, I'm using
throughput metric (meaning larger number is better performance).
Pay attention to relative percentage as absolute number depends on
platform/CPU that test suite runs on.


synchronous I/O AIO
(pread/pwrite/read/write) io_submit
2.4.21 based
(RHEL3) 265,122 229,810

2.6.9 218,565 206,917
2.6.10 213,041 205,891
2.6.11 212,284 201,124

>From the above chart, you can see that 2.6 kernel is at least 18%
slower in processing direct I/O (on block device) in the synchronous
path and 10% slower in the AIO path compare to a distributor's 2.4
kernel. What's worse, with each advance of kernel version, the I/O
path is becoming slower and slower.

Most of the performance regression for 2.6.9 came from dio layer (I
still have to find where the regression came from with 2.6.10 and 2.6.11).
DIO is just overloaded with too many areas to cover. I think it's better
to break things up a little bit.

For example, by having a set of dedicated functions that do direct I/O
on block device improves the performance dramatically:

synchronous I/O AIO
(pread/pwrite/read/write) io_submit
2.4.21 based
(RHEL3) 265,122 229,810
2.6.9 218,565 206,917
2.6.9+patches 323,016 268,484

See, we can be actually 22% faster in synchronous path and 17% faster
In the AIO path, if we do it right!

Kernel patch and test suite to follow in the next couple postings.

- Ken





2005-03-09 01:59:47

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: Direct io on block device has performance regression on 2.6.x kernel

OK, last one in the series: user level test programs that stress
the kernel I/O stack. Pretty dull stuff.

- Ken



diff -Nur zero/aio_null.c blknull_test/aio_null.c
--- zero/aio_null.c 1969-12-31 16:00:00.000000000 -0800
+++ blknull_test/aio_null.c 2005-03-08 00:46:17.000000000 -0800
@@ -0,0 +1,76 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <sched.h>
+#include <signal.h>
+#include <sys/types.h>
+#include <linux/ioctl.h>
+#include <libaio.h>
+
+#define MAXAIO 1024
+
+char buf[4096] __attribute__((aligned(4096)));
+
+io_context_t io_ctx;
+struct iocb iocbpool[MAXAIO];
+struct io_event ioevent[MAXAIO];
+
+void aio_setup(int n)
+{
+ int res = io_queue_init(n, &io_ctx);
+ if (res != 0) {
+ printf("io_queue_setup(%d) returned %d (%s)\n",
+ n, res, strerror(-res));
+ exit(0);
+ }
+}
+
+main(int argc, char* argv[])
+{
+ int fd, i, status, batch;
+ struct iocb* iocbbatch[MAXAIO];
+ int devidx;
+ off_t offset;
+ unsigned long start, end;
+
+ batch = argc < 2 ? 100: atoi(argv[1]);
+ if (batch >= MAXAIO)
+ batch = MAXAIO;
+
+ aio_setup(MAXAIO);
+ fd = open("/dev/raw/raw1", O_RDONLY);
+
+ if (fd == -1) {
+ perror("error opening\n");
+ exit (0);
+ }
+ for (i=0; i<batch; i++) {
+ iocbbatch[i] = iocbpool+i;
+ io_prep_pread(iocbbatch[i], fd, buf, 4096, 0);
+ }
+
+ while (1) {
+ struct timespec ts={30,0};
+ int bufidx;
+ int reap;
+
+ status = io_submit(io_ctx, i, iocbbatch);
+ if (status != i) {
+ printf("bad io_submit: %d [%s]\n", status,
+ strerror(-status));
+ }
+
+ // reap at least batch count back
+ reap = io_getevents(io_ctx, batch, MAXAIO, ioevent, &ts);
+ if (reap < batch) {
+ printf("io_getevents returned=%d [%s]\n", reap,
+ strerror(-reap));
+ }
+
+ // check the return result of each event
+ for (i=0; i<reap; i++)
+ if (ioevent[i].res != 4096)
+ printf("error in read: %lx\n", ioevent[i].res);
+ } /* while (1) */
+}
diff -Nur zero/pread_null.c blknull_test/pread_null.c
--- zero/pread_null.c 1969-12-31 16:00:00.000000000 -0800
+++ blknull_test/pread_null.c 2005-03-08 00:44:20.000000000 -0800
@@ -0,0 +1,27 @@
+#include <stdio.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <malloc.h>
+
+main(int argc, char* argv[])
+{
+ int fd;
+ char *addr;
+
+ fd = open("/dev/raw/raw1", O_RDONLY);
+ if (fd == -1) {
+ perror("error opening\n");
+ exit(0);
+ }
+
+ addr = memalign(4096, 4096);
+ if (addr == 0) {
+ printf("no memory\n");
+ exit(0);
+ }
+
+ while (1) {
+ pread(fd, addr, 4096, 0);
+ }
+
+}
diff -Nur zero/makefile blknull_test/makefile
--- zero/makefile 1969-12-31 16:00:00.000000000 -0800
+++ blknull_test/makefile 2005-03-08 17:10:39.000000000 -0800
@@ -0,0 +1,10 @@
+all: pread_null aio_null
+
+pread_null: pread_null.c
+ gcc -O3 -o $@ pread_null.c
+
+aio_null: aio_null.c
+ gcc -O3 -o $@ aio_null.c -laio
+
+clean:
+ rm -f pread_null aio_null



2005-03-09 06:30:35

by Andrew Morton

[permalink] [raw]
Subject: Re: Direct io on block device has performance regression on 2.6.x kernel

"Chen, Kenneth W" <[email protected]> wrote:
>
> Direct I/O on block device running 2.6.X kernel is a lot SLOWER
> than running on a 2.4 Kernel!
>

A little bit slower, it appears. It used to be faster.

> ...
>
> synchronous I/O AIO
> (pread/pwrite/read/write) io_submit
> 2.4.21 based
> (RHEL3) 265,122 229,810
>
> 2.6.9 218,565 206,917
> 2.6.10 213,041 205,891
> 2.6.11 212,284 201,124

What sort of CPU?

What speed CPU?

What size requests?

Reads or writes?

At 5 usecs per request I figure that's 3% CPU utilisation for 16k requests
at 100 MB/sec.

Once you bolt this onto a real device driver the proportional difference
will fall, due to addition of the constant factor.

Once you bolt all this onto a real disk controller all the numbers will get
worse (but in a strictly proportional manner) due to the disk transfers
depriving the CPU of memory bandwidth.

The raw driver is deprecated and we'd like to remove it. The preferred way
of doing direct-IO against a blockdev is by opening it with O_DIRECT.

Your patches don't address blockdevs opened with O_DIRECT. What you should
do is to make def_blk_aops.direct_IO point at a new function. That will
then work correctly with both raw and with open(/dev/hdX, O_DIRECT).


But before doing anything else, please bench this on real hardware, see if
it is worth pursuing. And gather an oprofile trace of the existing code.


2005-03-09 17:21:57

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: Direct io on block device has performance regression on 2.6.x kernel

Andrew Morton wrote on Tuesday, March 08, 2005 10:28 PM
> But before doing anything else, please bench this on real hardware,
> see if it is worth pursuing.

Let me answer the questions in reverse order. We started with running
industry standard transaction processing database benchmark on 2.6 kernel,
on real hardware (4P smp, 64 GB memory, 450 disks) running industry
standard db application. What we measured is that with best tuning done
to the system, 2.6 kernel has a huge performance regression relative to
its predecessor 2.4 kernel (a kernel from RHEL3, 2.4.21 based).

Ever since we had that measurement, people kick my butt everyday and
asking "after you telling us how great 2.6 kernel is, why is my workload
running significantly slower on this shinny 2.6 kernel?". It hurts,
It hurts like a sledge hammer nailed right in the middle of my head.

This is all real: real benchmark running on real hardware, with real
result showing large performance regression. Nothing synthetic here.

And yes, it is all worth pursuing, the two patches on raw device recuperate
1/3 of the total benchmark performance regression.

The reason I posted the pseudo disk driver is for people to see the effect
easier without shelling out a couple of million dollar to buy all that
equipment.


> Once you bolt this onto a real device driver the proportional difference
> will fall, due to addition of the constant factor.
>
> Once you bolt all this onto a real disk controller all the numbers will get
> worse (but in a strictly proportional manner) due to the disk transfers
> depriving the CPU of memory bandwidth.
>

That's not how I would interpret the number. Kernel utilization went up for
2.6 kernel running the same db workload. One reason is I/O stack is taxing a
little bit on each I/O call (or I should say less efficient), even with minuscule
amount, given the shear amount of I/O rate, it will be amplified very quickly.
One cpu cycle spend in the kernel means one less cpu cycle for the application.
My mean point is with less efficient I/O stack, kernel is actually taking away
valuable compute resources from application to crunch SQL transaction. And that
leads to lower performance.

One can extrapolate it the other way: make kernel more efficient in processing
these I/O requests, kernel utilization goes down, cycle saved will transfer to
application to crunch more SQL transaction, and performance goes up. I hope
everyone is following me here.


> At 5 usecs per request I figure that's 3% CPU utilisation for 16k requests
> at 100 MB/sec.

Our smallest setup has 450 disks, and the workload will generate about 50,000
I/O per second. Larger setup will have more I/O rate.


> What sort of CPU?
>
> What speed CPU?
>
> What size requests?
>
> Reads or writes?
>

1.6 GHz Itanium2, 9M L3
I/O requests are mixture of 2KB and 16KB, occasionally some large size in burst.
Both read/write, about 50/50 split on rw.

- Ken


2005-03-09 20:10:09

by Andrew Morton

[permalink] [raw]
Subject: Re: Direct io on block device has performance regression on 2.6.x kernel

"Chen, Kenneth W" <[email protected]> wrote:
>
> Andrew Morton wrote on Tuesday, March 08, 2005 10:28 PM
> > But before doing anything else, please bench this on real hardware,
> > see if it is worth pursuing.
>
> Let me answer the questions in reverse order. We started with running
> industry standard transaction processing database benchmark on 2.6 kernel,
> on real hardware (4P smp, 64 GB memory, 450 disks) running industry
> standard db application. What we measured is that with best tuning done
> to the system, 2.6 kernel has a huge performance regression relative to
> its predecessor 2.4 kernel (a kernel from RHEL3, 2.4.21 based).

That's news to me. I thought we were doing OK with big database stuff.
Surely lots of people have been testing such things.

> And yes, it is all worth pursuing, the two patches on raw device recuperate
> 1/3 of the total benchmark performance regression.

On a real disk driver? hm, I'm wrong then.

Did you generate a kernel profile?

2005-03-09 22:05:07

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: Direct io on block device has performance regression on 2.6.x kernel

Andrew Morton wrote on Wednesday, March 09, 2005 12:05 PM
> "Chen, Kenneth W" <[email protected]> wrote:
> > Let me answer the questions in reverse order. We started with running
> > industry standard transaction processing database benchmark on 2.6 kernel,
> > on real hardware (4P smp, 64 GB memory, 450 disks) running industry
> > standard db application. What we measured is that with best tuning done
> > to the system, 2.6 kernel has a huge performance regression relative to
> > its predecessor 2.4 kernel (a kernel from RHEL3, 2.4.21 based).
>
> That's news to me. I thought we were doing OK with big database stuff.
> Surely lots of people have been testing such things.

There are different level of "big" stuff. We used to work on 32-way numa
box, but other show stopper issues popping up before we get to the I/O stack.
The good thing came out of that work is the removal of global unplug lock.


> > And yes, it is all worth pursuing, the two patches on raw device recuperate
> > 1/3 of the total benchmark performance regression.
>
> On a real disk driver? hm, I'm wrong then.
>

Yes, on real disk driver (qlogic fiber channel) and with real 15K rpm disks.


> Did you generate a kernel profile?

Top 40 kernel hot functions, percentage is normalized to kernel utilization.

_spin_unlock_irqrestore 23.54%
_spin_unlock_irq 19.27%
__blockdev_direct_IO 3.57%
follow_hugetlb_page 1.84%
e1000_clean 1.38%
kmem_cache_alloc 1.31%
put_page 1.29%
__generic_file_aio_read 1.18%
e1000_intr 1.07%
schedule 1.01%
dio_bio_complete 0.97%
mempool_alloc 0.96%
kmem_cache_free 0.90%
__end_that_request_first 0.88%
__copy_user 0.82%
kfree 0.77%
generic_make_request 0.73%
_spin_lock 0.73%
kref_put 0.73%
vfs_read 0.68%
update_atime 0.68%
scsi_dispatch_cmd 0.67%
fget_light 0.66%
put_io_context 0.60%
_spin_lock_irqsave 0.58%
scsi_finish_command 0.58%
generic_file_aio_write_nolock 0.57%
inode_times_differ 0.55%
break_fault 0.53%
__do_softirq 0.48%
aio_read_evt 0.48%
try_atomic_semop 0.44%
sys_pread64 0.43%
__bio_add_page 0.43%
__mod_timer 0.42%
bio_alloc 0.41%
scsi_decide_disposition 0.40%
e1000_clean_rx_irq 0.39%
find_vma 0.38%
dnotify_parent 0.38%


Profile with spin lock inlined, so that it is easier to see functions
that has the lock contention, again top 40 hot functions:

scsi_request_fn 7.54%
finish_task_switch 6.25%
__blockdev_direct_IO 4.97%
__make_request 3.87%
scsi_end_request 3.54%
dio_bio_end_io 2.70%
follow_hugetlb_page 2.39%
__wake_up 2.37%
aio_complete 1.82%
kmem_cache_alloc 1.68%
__mod_timer 1.63%
e1000_clean 1.57%
__generic_file_aio_read 1.42%
mempool_alloc 1.37%
put_page 1.35%
e1000_intr 1.31%
schedule 1.25%
dio_bio_complete 1.20%
scsi_device_unbusy 1.07%
kmem_cache_free 1.06%
__copy_user 1.04%
scsi_dispatch_cmd 1.04%
__end_that_request_first1.04%
generic_make_request 1.02%
kfree 0.94%
__aio_get_req 0.93%
sys_pread64 0.83%
get_request 0.79%
put_io_context 0.76%
dnotify_parent 0.73%
vfs_read 0.73%
update_atime 0.73%
finished_one_bio 0.63%
generic_file_aio_write_nolock 0.63%
scsi_put_command 0.62%
break_fault 0.62%
e1000_xmit_frame 0.62%
aio_read_evt 0.59%
scsi_io_completion 0.59%
inode_times_differ 0.58%



2005-03-09 22:51:55

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: Direct io on block device has performance regression on 2.6.x kernel

Chen, Kenneth W wrote on Wednesday, March 09, 2005 1:59 PM
> > Did you generate a kernel profile?
>
> Top 40 kernel hot functions, percentage is normalized to kernel utilization.
>
> _spin_unlock_irqrestore 23.54%
> _spin_unlock_irq 19.27%
> ....
>
> Profile with spin lock inlined, so that it is easier to see functions
> that has the lock contention, again top 40 hot functions:

Just to clarify here, these data need to be taken at grain of salt. A
high count in _spin_unlock_* functions do not automatically points to
lock contention. It's one of the blind spot syndrome with timer based
profile on ia64. There are some lock contentions in 2.6 kernel that
we are staring at. Please do not misinterpret the number here.

- Ken


2005-03-09 23:26:29

by Andi Kleen

[permalink] [raw]
Subject: Re: Direct io on block device has performance regression on 2.6.x kernel

"Chen, Kenneth W" <[email protected]> writes:
>
> Just to clarify here, these data need to be taken at grain of salt. A
> high count in _spin_unlock_* functions do not automatically points to
> lock contention. It's one of the blind spot syndrome with timer based
> profile on ia64. There are some lock contentions in 2.6 kernel that
> we are staring at. Please do not misinterpret the number here.

Why don't you use oprofile?>? It uses NMIs and can profile "inside"
interrupt disabled sections.

-Andi

2005-03-10 01:22:30

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: Direct io on block device has performance regression on 2.6.x kernel

Andi Kleen wrote on Wednesday, March 09, 2005 3:23 PM
> > Just to clarify here, these data need to be taken at grain of salt. A
> > high count in _spin_unlock_* functions do not automatically points to
> > lock contention. It's one of the blind spot syndrome with timer based
> > profile on ia64. There are some lock contentions in 2.6 kernel that
> > we are staring at. Please do not misinterpret the number here.
>
> Why don't you use oprofile?>? It uses NMIs and can profile "inside"
> interrupt disabled sections.

The profile is taken on ia64. we don't have nmi. Oprofile will produce
the same result.

- Ken


2005-03-10 01:25:23

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: Direct io on block device has performance regression on 2.6.x kernel

Jesse Barnes wrote on Wednesday, March 09, 2005 3:53 PM
> > "Chen, Kenneth W" <[email protected]> writes:
> > > Just to clarify here, these data need to be taken at grain of salt. A
> > > high count in _spin_unlock_* functions do not automatically points to
> > > lock contention. It's one of the blind spot syndrome with timer based
> > > profile on ia64. There are some lock contentions in 2.6 kernel that
> > > we are staring at. Please do not misinterpret the number here.
> >
> > Why don't you use oprofile?>? It uses NMIs and can profile "inside"
> > interrupt disabled sections.
>
> Oh, and there are other ways of doing interrupt off profiling by using the
> PMU. q-tools can do this I think.

Thank you all for the suggestions. I'm well aware of q-tools and been using
it on and off. It's just that I don't have any data collected with q-tool
for that particular hardware/software benchmark configuration. I posted
with whatever data I have.

- Ken


2005-03-10 01:40:37

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: Direct io on block device has performance regression on 2.6.x kernel

For people who is dying to see some q-tool profile, here is one.
It's not a vanilla 2.6.9 kernel, but with patches in raw device
to get around the DIO performance problem.

- Ken


Flat profile of CPU_CYCLES in hist#0:
Each histogram sample counts as 255.337u seconds
% time self cumul calls self/call tot/call name
5.08 1.92 1.92 - - - schedule
4.64 0.62 2.54 - - - __ia64_readw_relaxed
4.03 0.54 3.08 - - - _stext
3.03 0.41 3.49 - - - qla2x00_queuecommand
2.73 0.37 3.86 - - - qla2x00_start_scsi
1.92 0.26 4.12 - - - __mod_timer
1.78 0.24 4.36 - - - scsi_request_fn
1.68 0.23 4.58 - - - __copy_user
1.45 0.20 4.78 - - - raw_file_rw
1.30 0.17 4.95 - - - kmem_cache_alloc
1.29 0.17 5.12 - - - mempool_alloc
1.29 0.17 5.30 - - - follow_hugetlb_page
1.19 0.16 5.46 - - - generic_make_request
1.14 0.15 5.61 - - - qla2x00_next
1.06 0.14 5.75 - - - memset
1.03 0.14 5.89 - - - raw_file_aio_rw
1.01 0.14 6.03 - - - e1000_clean
0.93 0.13 6.15 - - - scsi_get_command
0.93 0.12 6.28 - - - sd_init_command
0.87 0.12 6.39 - - - __make_request
0.87 0.12 6.51 - - - __aio_get_req
0.81 0.11 6.62 - - - qla2300_intr_handler
0.77 0.10 6.72 - - - put_io_context
0.75 0.10 6.82 - - - qla2x00_process_completed_request
0.74 0.10 6.92 - - - e1000_intr
0.73 0.10 7.02 - - - get_request
0.72 0.10 7.12 - - - rse_clear_invalid
0.70 0.09 7.21 - - - aio_read_evt
0.70 0.09 7.31 - - - e1000_xmit_frame
0.70 0.09 7.40 - - - __bio_add_page
0.69 0.09 7.49 - - - qla2x00_process_response_queue
0.69 0.09 7.58 - - - vfs_read
0.69 0.09 7.68 - - - break_fault
0.67 0.09 7.77 - - - scsi_dispatch_cmd
0.66 0.09 7.86 - - - try_to_wake_up
0.64 0.09 7.94 - - - blk_queue_start_tag
0.63 0.08 8.03 - - - sys_pread64
0.62 0.08 8.11 - - - alt_dtlb_miss
0.60 0.08 8.19 - - - ia64_spinlock_contention
0.57 0.08 8.27 - - - skb_release_data
0.55 0.07 8.34 - - - scsi_prep_fn
0.53 0.07 8.41 - - - tcp_sendmsg
0.52 0.07 8.48 - - - internal_add_timer
0.51 0.07 8.55 - - - drive_stat_acct
0.51 0.07 8.62 - - - tcp_transmit_skb
0.50 0.07 8.69 - - - task_rq_lock
0.49 0.07 8.75 - - - get_user_pages
0.48 0.06 8.82 - - - tcp_rcv_established
0.47 0.06 8.88 - - - kmem_cache_free
0.47 0.06 8.94 - - - save_switch_stack
0.46 0.06 9.00 - - - del_timer
0.46 0.06 9.07 - - - aio_pread
0.45 0.06 9.13 - - - bio_alloc
0.44 0.06 9.19 - - - finish_task_switch
0.44 0.06 9.25 - - - ip_queue_xmit
0.43 0.06 9.30 - - - move_tasks
0.42 0.06 9.36 - - - lock_sock
0.40 0.05 9.41 - - - elv_queue_empty
0.40 0.05 9.47 - - - bio_add_page
0.39 0.05 9.52 - - - try_atomic_semop
0.38 0.05 9.57 - - - qla2x00_done
0.38 0.05 9.62 - - - tcp_recvmsg
0.37 0.05 9.67 - - - put_page
0.37 0.05 9.72 - - - elv_next_request
0.36 0.05 9.77 - - - mod_timer
0.36 0.05 9.82 - - - load_switch_stack
0.36 0.05 9.86 - - - interrupt
0.36 0.05 9.91 - - - kref_get
0.36 0.05 9.96 - - - fget
0.35 0.05 10.01 - - - sk_wait_data
0.35 0.05 10.05 - - - do_IRQ
0.34 0.05 10.10 - - - e1000_clean_rx_irq
0.34 0.05 10.15 - - - find_vma
0.34 0.05 10.19 - - - ia64_leave_syscall
0.33 0.04 10.24 - - - tcp_ack
0.33 0.04 10.28 - - - fget_light
0.33 0.04 10.32 - - - swiotlb_map_sg
0.33 0.04 10.37 - - - blk_plug_device
0.32 0.04 10.41 - - - tcp_current_mss
0.31 0.04 10.45 - - - dev_queue_xmit
0.31 0.04 10.50 - - - __aio_run_iocbs
0.31 0.04 10.54 - - - ia64_handle_irq
0.31 0.04 10.58 - - - tcp_prequeue_process
0.30 0.04 10.62 - - - kfree
0.29 0.04 10.66 - - - submit_bio
0.29 0.04 10.70 - - - kref_put
0.29 0.04 10.74 - - - blk_queue_end_tag
0.29 0.04 10.77 - - - dequeue_task
0.28 0.04 10.81 - - - sys_semtimedop
0.28 0.04 10.85 - - - mempool_free
0.28 0.04 10.89 - - - __end_that_request_first
0.28 0.04 10.93 - - - qdisc_restart
0.28 0.04 10.96 - - - __aio_put_req
0.27 0.04 11.00 - - - sys_write
0.27 0.04 11.03 - - - schedule_timeout
0.26 0.04 11.07 - - - __umoddi3
0.25 0.03 11.10 - - - sched_clock
0.25 0.03 11.14 - - - add_wait_queue_exclusive
0.25 0.03 11.17 - - - blk_recount_segments
0.25 0.03 11.20 - - - aio_complete
0.24 0.03 11.24 - - - vfs_write
0.24 0.03 11.27 - - - remove_wait_queue
0.24 0.03 11.30 - - - __kfree_skb
0.24 0.03 11.33 - - - __elv_add_request
0.23 0.03 11.36 - - - enqueue_task
0.23 0.03 11.39 - - - ip_finish_output2
0.22 0.03 11.42 - - - io_submit_one
0.22 0.03 11.45 - - - tcp_v4_rcv
0.22 0.03 11.48 - - - bio_put
0.21 0.03 11.51 - - - elv_set_request
0.21 0.03 11.54 - - - inet_sendmsg
0.20 0.03 11.57 - - - rebalance_tick
0.20 0.03 11.59 - - - scsi_device_unbusy
0.20 0.03 11.62 - - - sys_io_getevents
0.20 0.03 11.65 - - - local_bh_enable
0.19 0.03 11.67 - - - __do_softirq
0.19 0.03 11.70 - - - ia64_switch_to
0.19 0.03 11.72 - - - update_process_times
0.19 0.03 11.75 - - - end_that_request_last
0.19 0.02 11.77 - - - memmove
0.19 0.02 11.80 - - - eth_type_trans
0.18 0.02 11.82 - - - generic_unplug_device
0.18 0.02 11.85 - - - huge_pte_offset
0.18 0.02 11.87 - - - scheduler_tick
0.17 0.02 11.90 - - - blk_queue_bounce
0.17 0.02 11.92 - - - mempool_alloc_slab
0.17 0.02 11.94 - - - scsi_finish_command
0.17 0.02 11.96 - - - blk_backing_dev_unplug
0.17 0.02 11.99 - - - qla2x00_build_scsi_iocbs_64
0.17 0.02 12.01 - - - blk_rq_map_sg
0.16 0.02 12.03 - - - swiotlb_unmap_sg
0.16 0.02 12.05 - - - sock_aio_write
0.16 0.02 12.07 - - - aio_setup_iocb
0.16 0.02 12.10 - - - blk_remove_plug
0.16 0.02 12.12 - - - ipc_lock
0.16 0.02 12.14 - - - __wake_up
0.15 0.02 12.16 - - - scsi_put_command
0.15 0.02 12.18 - - - scsi_end_request
0.15 0.02 12.20 - - - alloc_skb
0.15 0.02 12.22 - - - raw_end_aio
0.14 0.02 12.24 - - - cache_alloc_refill
0.14 0.02 12.26 - - - clear_queue_congested
0.14 0.02 12.27 - - - ia64_save_extra
0.14 0.02 12.29 - - - tcp_rtt_estimator
0.13 0.02 12.31 - - - ia64_leave_kernel
0.13 0.02 12.33 - - - ia64_syscall_setup
0.13 0.02 12.35 - - - tcp_write_xmit
0.13 0.02 12.36 - - - recalc_task_prio
0.13 0.02 12.38 - - - __tcp_v4_lookup_established
0.13 0.02 12.40 - - - scsi_io_completion
0.13 0.02 12.42 - - - sys_io_submit
0.12 0.02 12.43 - - - elv_merge
0.12 0.02 12.45 - - - skb_clone
0.12 0.02 12.47 - - - __kmalloc
0.12 0.02 12.48 - - - swiotlb_map_single
0.12 0.02 12.50 - - - __udivdi3
0.12 0.02 12.51 - - - get_io_context
0.12 0.02 12.53 - - - elevator_noop_merge
0.12 0.02 12.54 - - - __wake_up_common
0.12 0.02 12.56 - - - lookup_ioctx
0.11 0.02 12.58 - - - io_schedule
0.11 0.02 12.59 - - - net_rx_action
0.11 0.01 12.61 - - - disk_round_stats
0.11 0.01 12.62 - - - __generic_unplug_device
0.11 0.01 12.63 - - - tcp_event_data_recv
0.11 0.01 12.65 - - - __scsi_done
0.10 0.01 12.66 - - - dnotify_parent
0.10 0.01 12.68 - - - timer_interrupt
0.10 0.01 12.69 - - - ip_route_input
0.10 0.01 12.70 - - - sock_wfree
0.10 0.01 12.72 - - - add_timer_randomness
0.09 0.01 12.73 - - - run_timer_softirq
0.09 0.01 12.74 - - - sock_aio_read
0.09 0.01 12.75 - - - e1000_alloc_rx_buffers
0.09 0.01 12.77 - - - release_sock
0.09 0.01 12.78 - - - handle_IRQ_event
0.09 0.01 12.79 - - - pfifo_fast_enqueue
0.09 0.01 12.80 - - - ia64_load_extra
0.09 0.01 12.82 - - - finish_wait
0.08 0.01 12.83 - - - qla2x00_calc_iocbs_64
0.08 0.01 12.84 - - - find_busiest_group
0.08 0.01 12.85 - - - elevator_noop_next_request
0.08 0.01 12.86 - - - netif_receive_skb
0.08 0.01 12.87 - - - bio_destructor
0.08 0.01 12.88 - - - scsi_run_queue
0.08 0.01 12.89 - - - ip_output
0.07 0.01 12.90 - - - aio_pwrite
0.07 0.01 12.91 - - - ia64_ret_from_syscall
0.07 0.01 12.92 - - - add_disk_randomness
0.07 0.01 12.93 - - - scsi_decide_disposition
0.07 0.01 12.94 - - - do_sync_read
0.07 0.01 12.95 - - - bio_check_pages_dirty
0.07 0.01 12.96 - - - add_interrupt_randomness
0.06 0.01 12.97 - - - kobject_get
0.06 0.01 12.97 - - - sock_common_recvmsg
0.06 0.01 12.98 - - - get_device
0.06 0.01 12.99 - - - scsi_softirq
0.06 0.01 13.00 - - - aio_put_req
0.06 0.01 13.01 - - - freed_request
0.06 0.01 13.01 - - - scsi_add_timer
0.06 0.01 13.02 - - - elv_completed_request
0.06 0.01 13.03 - - - ip_rcv
0.06 0.01 13.04 - - - skb_copy_datagram_iovec
0.06 0.01 13.05 - - - iosapic_end_level_irq
0.06 0.01 13.05 - - - __tcp_select_window
0.06 0.01 13.06 - - - kobject_put
0.06 0.01 13.07 - - - activate_task
0.05 0.01 13.08 - - - sys_read
0.05 0.01 13.08 - - - elevator_noop_add_request
0.05 0.01 13.09 - - - elv_try_last_merge
0.05 0.01 13.10 - - - pfm_syst_wide_update_task
0.05 0.01 13.10 - - - elv_remove_request
0.05 0.01 13.11 - - - do_sync_write
0.05 0.01 13.12 - - - raw_file_aio_write
0.05 0.01 13.12 - - - tcp_v4_do_rcv
0.05 0.01 13.13 - - - __ia64_irq_desc
0.05 0.01 13.13 - - - prepare_to_wait
0.04 0.01 13.14 - - - ip_local_deliver_finish
0.04 0.01 13.15 - - - resched_task
0.04 0.01 13.15 - - - qla2x00_timer
0.04 0.01 13.16 - - - cleanup_rbuf
0.04 0.01 13.16 - - - memcpy_toiovec
0.04 0.01 13.17 - - - tcp_rcv_space_adjust
0.04 0.01 13.17 - - - cache_flusharray
0.04 0.01 13.18 - - - bio_phys_segments
0.04 0.01 13.18 - - - rbs_switch
0.04 0.01 13.19 - - - mempool_free_slab
0.04 0.01 13.20 - - - sys_semctl
0.04 0.01 13.20 - - - pfifo_fast_dequeue
0.04 0.01 13.21 - - - bio_endio
0.04 0.01 13.21 - - - ipcperms
0.04 0.01 13.22 - - - memcpy
0.04 0.01 13.22 - - - deactivate_task
0.04 0.00 13.23 - - - blk_run_queue
0.04 0.00 13.23 - - - find_extend_vma
0.04 0.00 13.24 - - - bio_set_pages_dirty
0.04 0.00 13.24 - - - raw_file_read
0.04 0.00 13.25 - - - tcp_send_delayed_ack
0.04 0.00 13.25 - - - swiotlb_unmap_single
0.03 0.00 13.26 - - - sk_reset_timer
0.03 0.00 13.26 - - - update_queue
0.03 0.00 13.26 - - - do_softirq
0.03 0.00 13.27 - - - __blk_put_request
0.03 0.00 13.27 - - - qla2x00_async_event
0.03 0.00 13.28 - - - __ia64_readl
0.03 0.00 13.28 - - - qla2x00_get_new_sp
0.03 0.00 13.29 - - - tcp_v4_send_check
0.03 0.00 13.29 - - - sd_rw_intr
0.03 0.00 13.29 - - - move_irq
0.03 0.00 13.30 - - - put_device
0.03 0.00 13.30 - - - raw_file_aio_read
0.03 0.00 13.30 - - - mark_clean
0.03 0.00 13.31 - - - time_interpolator_update
0.02 0.00 13.31 - - - ip_rcv_finish
0.02 0.00 13.31 - - - bio_hw_segments
0.02 0.00 13.32 - - - e1000_irq_enable
0.02 0.00 13.32 - - - fw2300ipx_version
0.02 0.00 13.32 - - - end_that_request_chunk
0.02 0.00 13.33 - - - find_vma_prev
0.02 0.00 13.33 - - - scsi_free_sgtable
0.02 0.00 13.33 - - - tcp_delack_timer
0.02 0.00 13.34 - - - ip_fast_csum
0.02 0.00 13.34 - - - __find_next_bit
0.02 0.00 13.34 - - - profile_tick
0.02 0.00 13.34 - - - ip_finish_output
0.02 0.00 13.35 - - - del_singleshot_timer_sync
0.02 0.00 13.35 - - - do_timer
0.02 0.00 13.35 - - - csum_tcpudp_magic
0.02 0.00 13.35 - - - autoremove_wake_function
0.02 0.00 13.36 - - - default_wake_function
0.02 0.00 13.36 - - - __sk_stream_mem_reclaim
0.02 0.00 13.36 - - - tcp_ack_saw_tstamp
0.02 0.00 13.36 - - - page_waitqueue
0.02 0.00 13.37 - - - ia64_pfn_valid
0.02 0.00 13.37 - - - elv_put_request
0.02 0.00 13.37 - - - tcp_set_skb_tso_segs
0.02 0.00 13.37 - - - mark_page_accessed
0.02 0.00 13.37 - - - __ia64_local_vector_to_irq
0.02 0.00 13.38 - - - wake_up_process
0.01 0.00 13.38 - - - kfree_skbmem
0.01 0.00 13.38 - - - scsi_next_command
0.01 0.00 13.38 - - - memcmp
0.01 0.00 13.39 - - - drain_array_locked
0.01 0.00 13.39 - - - lsapic_noop
0.01 0.00 13.39 - - - do_csum
0.01 0.00 13.39 - - - wake_up_page
0.01 0.00 13.39 - - - scsi_delete_timer
0.01 0.00 13.39 - - - cache_reap
0.01 0.00 13.40 - - - follow_page
0.01 0.00 13.40 - - - ip_local_deliver
0.01 0.00 13.40 - - - effective_prio
0.01 0.00 13.40 - - - skip_rbs_switch
0.01 0.00 13.40 - - - rt_hash_code
0.01 0.00 13.40 - - - nop
0.01 0.00 13.40 - - - scsi_done
0.01 0.00 13.41 - - - raw_end_io
0.01 0.00 13.41 - - - dont_preserve_current_frame
0.01 0.00 13.41 - - - raise_softirq_irqoff
0.01 0.00 13.41 - - - smp_send_reschedule
0.01 0.00 13.41 - - - raise_softirq
0.01 0.00 13.41 - - - unlock_page
0.01 0.00 13.41 - - - machvec_timer_interrupt
0.01 0.00 13.41 - - - tcp_cwnd_application_limited
0.01 0.00 13.41 - - - check_pgt_cache
0.01 0.00 13.42 - - - set_page_dirty
0.01 0.00 13.42 - - - e1000_read_phy_reg
0.01 0.00 13.42 - - - e1000_update_stats
0.01 0.00 13.42 - - - ia64_do_page_fault
0.01 0.00 13.42 - - - follow_huge_addr
0.01 0.00 13.42 - - - I_BDEV
0.01 0.00 13.42 - - - sk_stream_mem_schedule
0.00 0.00 13.42 - - - page_fault
0.00 0.00 13.42 - - - net_tx_action
0.00 0.00 13.42 - - - sys_gettimeofday
0.00 0.00 13.42 - - - __rcu_process_callbacks
0.00 0.00 13.42 - - - ipc_checkid
0.00 0.00 13.42 - - - __ia64_save_fpu
0.00 0.00 13.42 - - - find_busiest_queue
0.00 0.00 13.43 - - - pfm_stop
0.00 0.00 13.43 - - - sys_semop
0.00 0.00 13.43 - - - tcp_send_ack
0.00 0.00 13.43 - - - sys_poll
0.00 0.00 13.43 - - - free_block
0.00 0.00 13.43 - - - cascade
0.00 0.00 13.43 - - - do_select
0.00 0.00 13.43 - - - double_lock_balance
0.00 0.00 13.43 - - - i8042_interrupt
0.00 0.00 13.43 - - - __alloc_pages
0.00 0.00 13.43 - - - __ia64_inw
0.00 0.00 13.43 - - - handle_mm_fault
0.00 0.00 13.43 - - - del_timer_sync
0.00 0.00 13.43 - - - ia64_send_ipi
0.00 0.00 13.43 - - - ipc_unlock
0.00 0.00 13.43 - - - vsnprintf
0.00 0.00 13.43 - - - nr_blockdev_pages
0.00 0.00 13.43 - - - rcu_process_callbacks
0.00 0.00 13.43 - - - __free_pages
0.00 0.00 13.43 - - - ia64_fault
0.00 0.00 13.43 - - - ext2_permission
0.00 0.00 13.43 - - - profile_hit
0.00 0.00 13.43 - - - sys_lseek
0.00 0.00 13.43 - - - tasklet_action
0.00 0.00 13.43 - - - wb_kupdate
0.00 0.00 13.43 - - - alloc_pages_current
0.00 0.00 13.43 - - - buffered_rmqueue
0.00 0.00 13.43 - - - __get_free_pages
0.00 0.00 13.43 - - - show_stat
0.00 0.00 13.43 - - - batch_entropy_store
0.00 0.00 13.43 - - - __pollwait
0.00 0.00 13.43 - - - e1000_watchdog
0.00 0.00 13.43 - - - __tasklet_schedule
0.00 0.00 13.43 - - - run_local_timers
0.00 0.00 13.43 - - - task_timeslice
0.00 0.00 13.43 - - - __ia64_irq_to_vector
0.00 0.00 13.43 - - - do_generic_mapping_read
0.00 0.00 13.43 - - - bad_range
0.00 0.00 13.43 - - - time_interpolator_get_offset
0.00 0.00 13.43 - - - idle_cpu


2005-03-10 01:49:32

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: Direct io on block device has performance regression on 2.6.x kernel

Andrew Morton wrote on Wednesday, March 09, 2005 5:34 PM
> What are these percentages? Total CPU time? The direct-io stuff doesn't
> look too bad. It's surprising that tweaking the direct-io submission code
> makes much difference.

Percentage is relative to total kernel time. There are three DIO functions
showed up in the profile:

__blockdev_direct_IO 4.97%
dio_bio_end_io 2.70%
dio_bio_complete 1.20%

> hm. __blockdev_direct_IO() doesn't actually do much. I assume your damn
> compiler went and inlined direct_io_worker() on us.

We are using gcc version 3.4.3. I suppose I can finger point gcc ? :-P

- Ken


2005-03-10 01:34:20

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: Direct io on block device has performance regression on 2.6.x kernel

Andrew Morton wrote on Wednesday, March 09, 2005 2:45 PM
> >
> > > Did you generate a kernel profile?
> >
> > Top 40 kernel hot functions, percentage is normalized to kernel utilization.
> >
> > _spin_unlock_irqrestore 23.54%
> > _spin_unlock_irq 19.27%
>
> Cripes.
>
> Is that with CONFIG_PREEMPT? If so, and if you disable CONFIG_PREEMPT,
> this cost should be accounting the the spin_unlock() caller and we can see
> who the culprit is. Perhaps dio->bio_lock.

CONFIG_PREEMPT is off.

Sorry for all the confusion, I probably shouldn't post the first profile
to confuse people. See 2nd profile that I posted earlier (copied here again).

scsi_request_fn 7.54%
finish_task_switch 6.25%
__blockdev_direct_IO 4.97%
__make_request 3.87%
scsi_end_request 3.54%
dio_bio_end_io 2.70%
follow_hugetlb_page 2.39%
__wake_up 2.37%
aio_complete 1.82%
kmem_cache_alloc 1.68%
__mod_timer 1.63%
e1000_clean 1.57%
__generic_file_aio_read 1.42%
mempool_alloc 1.37%
put_page 1.35%
e1000_intr 1.31%
schedule 1.25%
dio_bio_complete 1.20%
scsi_device_unbusy 1.07%
kmem_cache_free 1.06%
__copy_user 1.04%
scsi_dispatch_cmd 1.04%
__end_that_request_first1.04%
generic_make_request 1.02%
kfree 0.94%
__aio_get_req 0.93%
sys_pread64 0.83%
get_request 0.79%
put_io_context 0.76%
dnotify_parent 0.73%
vfs_read 0.73%
update_atime 0.73%
finished_one_bio 0.63%
generic_file_aio_write_nolock 0.63%
scsi_put_command 0.62%
break_fault 0.62%
e1000_xmit_frame 0.62%
aio_read_evt 0.59%
scsi_io_completion 0.59%
inode_times_differ 0.58%


2005-03-10 02:08:57

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: Direct io on block device has performance regression on 2.6.x kernel

Chen, Kenneth W wrote on Wednesday, March 09, 2005 5:45 PM
> Andrew Morton wrote on Wednesday, March 09, 2005 5:34 PM
> > What are these percentages? Total CPU time? The direct-io stuff doesn't
> > look too bad. It's surprising that tweaking the direct-io submission code
> > makes much difference.
>
> Percentage is relative to total kernel time. There are three DIO functions
> showed up in the profile:
>
> __blockdev_direct_IO 4.97%
> dio_bio_end_io 2.70%
> dio_bio_complete 1.20%

For the sake of comparison, let's look at the effect of performance patch on
raw device, in place of the above three functions, we now have two:

raw_file_rw 1.59%
raw_file_aio_rw 1.19%

A total saving of 6.09% (4.97+2.70+1.20 -1.59-1.19). That's only counting
the cpu cycles. We have tons of other data showing significant kernel path
length reduction with the performance patch. Cache misses reduced across
the entire 3 level cache hierarchy, that's a secondary effect can not be
ignored since kernel is also competing cache resource with application.

- Ken


2005-03-10 01:54:59

by Andrew Morton

[permalink] [raw]
Subject: Re: Direct io on block device has performance regression on 2.6.x kernel

"Chen, Kenneth W" <[email protected]> wrote:
>
> Andrew Morton wrote on Wednesday, March 09, 2005 2:45 PM
> > >
> > > > Did you generate a kernel profile?
> > >
> > > Top 40 kernel hot functions, percentage is normalized to kernel utilization.
> > >
> > > _spin_unlock_irqrestore 23.54%
> > > _spin_unlock_irq 19.27%
> >
> > Cripes.
> >
> > Is that with CONFIG_PREEMPT? If so, and if you disable CONFIG_PREEMPT,
> > this cost should be accounting the the spin_unlock() caller and we can see
> > who the culprit is. Perhaps dio->bio_lock.
>
> CONFIG_PREEMPT is off.
>
> Sorry for all the confusion, I probably shouldn't post the first profile
> to confuse people. See 2nd profile that I posted earlier (copied here again).
>
> scsi_request_fn 7.54%
> finish_task_switch 6.25%
> __blockdev_direct_IO 4.97%
> __make_request 3.87%
> scsi_end_request 3.54%
> dio_bio_end_io 2.70%
> follow_hugetlb_page 2.39%
> __wake_up 2.37%
> aio_complete 1.82%

What are these percentages? Total CPU time? The direct-io stuff doesn't
look too bad. It's surprising that tweaking the direct-io submission code
makes much difference.

hm. __blockdev_direct_IO() doesn't actually do much. I assume your damn
compiler went and inlined direct_io_worker() on us.

2005-03-10 02:30:01

by Andrew Morton

[permalink] [raw]
Subject: Re: Direct io on block device has performance regression on 2.6.x kernel

"Chen, Kenneth W" <[email protected]> wrote:
>
> This is all real: real benchmark running on real hardware, with real
> result showing large performance regression. Nothing synthetic here.
>

Ken, could you *please* be more complete, more organized and more specific?

What does "1/3 of the total benchmark performance regression" mean? One
third of 0.1% isn't very impressive. You haven't told us anything at all
about the magnitude of this regression.

Where does the rest of the regression come from?

How much system time? User time? All that stuff.

> And yes, it is all worth pursuing, the two patches on raw device recuperate
> 1/3 of the total benchmark performance regression.

The patch needs a fair bit of work, and if it still provides useful gains
when it's complete I guess could make sense as some database special-case.

But the first thing to do is to work out where the cycles are going to.


Also, I'm rather peeved that we're hearing about this regression now rather
than two years ago. And mystified as to why yours is the only group which
has reported it.

2005-03-10 03:51:14

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: Direct io on block device has performance regression on 2.6.x kernel

Andrew Morton wrote Wednesday, March 09, 2005 6:26 PM
> What does "1/3 of the total benchmark performance regression" mean? One
> third of 0.1% isn't very impressive. You haven't told us anything at all
> about the magnitude of this regression.

2.6.9 kernel is 6% slower compare to distributor's 2.4 kernel (RHEL3). Roughly
2% came from storage driver (I'm not allowed to say anything beyond that, there
is a fix though).

2% came from DIO.

The rest of 2% is still unaccounted for. We don't know where.

> How much system time? User time? All that stuff.
20.5% in the kernel, 79.5% in user space.


> But the first thing to do is to work out where the cycles are going to.
You've seen the profile. That's where all the cycle went.


> Also, I'm rather peeved that we're hearing about this regression now rather
> than two years ago. And mystified as to why yours is the only group which
> has reported it.

2.6.X kernel has never been faster than the 2.4 kernel (RHEL3). At one point
of time, around 2.6.2, the gap is pretty close, at around 1%, but still slower.
Around 2.6.5, we found global plug list is causing huge lock contention on
32-way numa box. That got fixed in 2.6.7. Then comes 2.6.8 which took a big
dip at close to 20% regression. Then we fixed 17% regression in the scheduler
(fixed with cache_decay_tick). 2.6.9 is the last one we measured and it is 6%
slower. It's a constant moving target, a wild goose to chase.

I don't know why other people have not reported the problem, perhaps they
haven't got a chance to run transaction processing db workload on 2.6 kernel.
Perhaps they have not compared, perhaps they are working on the same problem.
I just don't know.

- Ken


2005-03-10 04:12:20

by Andrew Morton

[permalink] [raw]
Subject: Re: Direct io on block device has performance regression on 2.6.x kernel

David Lang <[email protected]> wrote:
>
> (I've seen a 50%
> performance hit on 2.4 with just a thousand or two threads compared to
> 2.6)

Was that 2.4 kernel a vendor kernel with the O(1) scheduler?

2005-03-10 04:14:49

by Andrew Morton

[permalink] [raw]
Subject: Re: Direct io on block device has performance regression on 2.6.x kernel

"Chen, Kenneth W" <[email protected]> wrote:
>
> Andrew Morton wrote Wednesday, March 09, 2005 6:26 PM
> > What does "1/3 of the total benchmark performance regression" mean? One
> > third of 0.1% isn't very impressive. You haven't told us anything at all
> > about the magnitude of this regression.
>
> 2.6.9 kernel is 6% slower compare to distributor's 2.4 kernel (RHEL3). Roughly
> 2% came from storage driver (I'm not allowed to say anything beyond that, there
> is a fix though).

The codepaths are indeed longer in 2.6.

> 2% came from DIO.

hm, that's not a lot.

Once you redo that patch to use aops and to work with O_DIRECT, the paths
will get a little deeper, but not much. We really should do this so that
O_DIRECT works, and in case someone has gone and mmapped the blockdev.

Fine-grained alignment is probably too hard, and it should fall back to
__blockdev_direct_IO().

Does it do the right thing with a request which is non-page-aligned, but
512-byte aligned?

readv and writev?

2% is pretty thin :(

> The rest of 2% is still unaccounted for. We don't know where.

General cache replacement, perhaps. 9MB is a big cache though.

> ...
> Around 2.6.5, we found global plug list is causing huge lock contention on
> 32-way numa box. That got fixed in 2.6.7. Then comes 2.6.8 which took a big
> dip at close to 20% regression. Then we fixed 17% regression in the scheduler
> (fixed with cache_decay_tick). 2.6.9 is the last one we measured and it is 6%
> slower. It's a constant moving target, a wild goose to chase.
>

OK. Seems that the 2.4 O(1) scheduler got it right for that machine.

> haven't got a chance to run transaction processing db workload on 2.6 kernel.
> Perhaps they have not compared, perhaps they are working on the same problem.
> I just don't know.

Maybe there are other factors which drown these little things out:
architecture improvements, choice of architecture, driver changes, etc.

2005-03-10 04:08:38

by David Lang

[permalink] [raw]
Subject: RE: Direct io on block device has performance regression on 2.6.x kernel

On Wed, 9 Mar 2005, Chen, Kenneth W wrote:

>> Also, I'm rather peeved that we're hearing about this regression now rather
>> than two years ago. And mystified as to why yours is the only group which
>> has reported it.
>
> 2.6.X kernel has never been faster than the 2.4 kernel (RHEL3). At one point
> of time, around 2.6.2, the gap is pretty close, at around 1%, but still slower.
> Around 2.6.5, we found global plug list is causing huge lock contention on
> 32-way numa box. That got fixed in 2.6.7. Then comes 2.6.8 which took a big
> dip at close to 20% regression. Then we fixed 17% regression in the scheduler
> (fixed with cache_decay_tick). 2.6.9 is the last one we measured and it is 6%
> slower. It's a constant moving target, a wild goose to chase.
>
> I don't know why other people have not reported the problem, perhaps they
> haven't got a chance to run transaction processing db workload on 2.6 kernel.
> Perhaps they have not compared, perhaps they are working on the same problem.
> I just don't know.

Also the 2.6 kernel is Soo much better in the case where you have many
threads (even if they are all completely idle) that that improvement may
be masking the regression that Ken is reporting (I've seen a 50%
performance hit on 2.4 with just a thousand or two threads compared to
2.6). let's face it, a typical linux box today starts up a LOT of stuff
that will never get used, but will count as an idle thread.

David Lang

--
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
-- C.A.R. Hoare

2005-03-10 04:20:43

by David Lang

[permalink] [raw]
Subject: Re: Direct io on block device has performance regression on 2.6.xkernel

On Wed, 9 Mar 2005, Andrew Morton wrote:

> David Lang <[email protected]> wrote:
>>
>> (I've seen a 50%
>> performance hit on 2.4 with just a thousand or two threads compared to
>> 2.6)
>
> Was that 2.4 kernel a vendor kernel with the O(1) scheduler?

no, a kernel.org kernel. the 2.6 kernel is so much faster for this
workload that I switched for this app and never looked back. I found that
if I had 5000 or so idle tasks 2.4 performcane would drop to about a
quarter of 2.6 (with the CPU system time being the limiting factor)

David Lang

--
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
-- C.A.R. Hoare

2005-03-10 04:33:02

by Andrew Vasquez

[permalink] [raw]
Subject: Re: Direct io on block device has performance regression on 2.6.x kernel

On Wed, 09 Mar 2005, Chen, Kenneth W wrote:

> Andrew Morton wrote Wednesday, March 09, 2005 6:26 PM
> > What does "1/3 of the total benchmark performance regression" mean? One
> > third of 0.1% isn't very impressive. You haven't told us anything at all
> > about the magnitude of this regression.
>
> 2.6.9 kernel is 6% slower compare to distributor's 2.4 kernel (RHEL3). Roughly
> 2% came from storage driver (I'm not allowed to say anything beyond that, there
> is a fix though).
>

Ok now, that statement piqued my interest -- since looking through a
previous email it seems you are using the qla2xxx driver. Care to
elaborate?

Regards,
Andrew Vasquez

2005-03-10 04:34:05

by Jesse Barnes

[permalink] [raw]
Subject: Re: Direct io on block device has performance regression on 2.6.x kernel

On Wednesday, March 9, 2005 3:23 pm, Andi Kleen wrote:
> "Chen, Kenneth W" <[email protected]> writes:
> > Just to clarify here, these data need to be taken at grain of salt. A
> > high count in _spin_unlock_* functions do not automatically points to
> > lock contention. It's one of the blind spot syndrome with timer based
> > profile on ia64. There are some lock contentions in 2.6 kernel that
> > we are staring at. Please do not misinterpret the number here.
>
> Why don't you use oprofile?>? It uses NMIs and can profile "inside"
> interrupt disabled sections.

Oh, and there are other ways of doing interrupt off profiling by using the
PMU. q-tools can do this I think.

Jesse

2005-03-10 04:39:49

by Jesse Barnes

[permalink] [raw]
Subject: Re: Direct io on block device has performance regression on 2.6.x kernel

On Wednesday, March 9, 2005 3:23 pm, Andi Kleen wrote:
> "Chen, Kenneth W" <[email protected]> writes:
> > Just to clarify here, these data need to be taken at grain of salt. A
> > high count in _spin_unlock_* functions do not automatically points to
> > lock contention. It's one of the blind spot syndrome with timer based
> > profile on ia64. There are some lock contentions in 2.6 kernel that
> > we are staring at. Please do not misinterpret the number here.
>
> Why don't you use oprofile?>? It uses NMIs and can profile "inside"
> interrupt disabled sections.

That was oprofile output, but on ia64, 'NMI's are maskable due to the way irq
disabling works. Here's a very hackish patch that changes the kernel to use
cr.tpr instead of psr.i for interrupt control. Making oprofile use real ia64
NMIs is left as an exercise for the reader :)

Jesse


Attachments:
(No filename) (867.00 B)
nmi-enable-5.patch (5.67 kB)
Download all attachments

2005-03-10 04:29:53

by Andrew Morton

[permalink] [raw]
Subject: Re: Direct io on block device has performance regression on 2.6.x kernel

"Chen, Kenneth W" <[email protected]> wrote:
>
> > Did you generate a kernel profile?
>
> Top 40 kernel hot functions, percentage is normalized to kernel utilization.
>
> _spin_unlock_irqrestore 23.54%
> _spin_unlock_irq 19.27%

Cripes.

Is that with CONFIG_PREEMPT? If so, and if you disable CONFIG_PREEMPT,
this cost should be accounting the the spin_unlock() caller and we can see
who the culprit is. Perhaps dio->bio_lock.

2005-03-10 18:41:24

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: Direct io on block device has performance regression on 2.6.x kernel

Andrew Morton wrote on Wednesday, March 09, 2005 8:10 PM
> > 2.6.9 kernel is 6% slower compare to distributor's 2.4 kernel (RHEL3). Roughly
> > 2% came from storage driver (I'm not allowed to say anything beyond that, there
> > is a fix though).
>
> The codepaths are indeed longer in 2.6.

Thank you for acknowledging this.


> > 2% came from DIO.
>
> hm, that's not a lot.
> ....
> 2% is pretty thin :(

This is the exact reason that I did not want to put these numbers out
in the first place. Because most people usually underestimate the
magnitude of these percentage point.

Now I have to give a speech on "performance optimization 101". Take a
look at this page: http://www.suse.de/~aj/SPEC/CINT/d-permanent/index.html
This page tracks the development of gcc and measures the performance of
gcc with SPECint2000. Study the last chart, take out your calculator
and calculate how much performance gain gcc made over the course of 3.5
years of development. Also please factor in the kind of man power that
went into the compiler development.

Until people understand the kind of scale to expect when evaluating a
complex piece of software, then we can talk about database transaction
processing benchmark. This benchmark goes one step further. It bench
the entire software stack (kernel/library/application/compiler), it bench
the entire hardware platform (cpu/memory/IO/chipset) and on the grand
scale, it bench system integration: storage, network, interconnect, mid-
tier app server, front end clients, etc etc. Any specific function/
component only represent a small portion of the entire system, essential
but small. For example, the hottest function in the kernel is 7.5%, out
of 20% kernel time. If we throw away that function entirely, there will
be only 1.5% direct impact on total cpu cycles.

So what's the point? The point is when judging a number whether it is
thin or thick, it has to be judged against the complexity of SUT. It
has to be judged against a relevant scale for that particular workload.
And the scale has to be laid out correctly that represents the weight
of each component.

Losing 6% just from Linux kernel is a huge deal for this type of benchmark.
People work for days to implement features which might give sub percentage
gain. Making Software run faster is not easy, but making software run slower
apparently is a fairly easy task.



> Fine-grained alignment is probably too hard, and it should fall back to
> __blockdev_direct_IO().
>
> Does it do the right thing with a request which is non-page-aligned, but
> 512-byte aligned?
>
> readv and writev?
>

That's why direct_io_worker() is slower. It does everything and handles
every possible usage scenarios out there. I hope making the function fatter
is not in the plan.

- Ken


2005-03-10 20:36:31

by Andrew Morton

[permalink] [raw]
Subject: Re: Direct io on block device has performance regression on 2.6.x kernel

"Chen, Kenneth W" <[email protected]> wrote:
>
> Losing 6% just from Linux kernel is a huge deal for this type of benchmark.
> People work for days to implement features which might give sub percentage
> gain. Making Software run faster is not easy, but making software run slower
> apparently is a fairly easy task.
>
>

heh

>
> > Fine-grained alignment is probably too hard, and it should fall back to
> > __blockdev_direct_IO().
> >
> > Does it do the right thing with a request which is non-page-aligned, but
> > 512-byte aligned?
> >
> > readv and writev?
> >
>
> That's why direct_io_worker() is slower. It does everything and handles
> every possible usage scenarios out there. I hope making the function fatter
> is not in the plan.

We just cannot make a change like this if it does not support readv and
writev well, and if it does not support down-to-512-byte size and
alignment. It will break applications.

2005-03-10 22:02:05

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: Direct io on block device has performance regression on 2.6.x kernel

Andrew Morton wrote on Thursday, March 10, 2005 12:31 PM
> > > Fine-grained alignment is probably too hard, and it should fall back to
> > > __blockdev_direct_IO().
> > >
> > > Does it do the right thing with a request which is non-page-aligned, but
> > > 512-byte aligned?
> > >
> > > readv and writev?
> > >
> >
> > That's why direct_io_worker() is slower. It does everything and handles
> > every possible usage scenarios out there. I hope making the function fatter
> > is not in the plan.
>
> We just cannot make a change like this if it does not support readv and
> writev well, and if it does not support down-to-512-byte size and
> alignment. It will break applications.

I must misread your mail. Yes it does support 512-byte size and alignment.
Let me work on the readv/writev support (unless someone beat me to it).

- Ken


2005-03-10 22:17:13

by Andrew Morton

[permalink] [raw]
Subject: Re: Direct io on block device has performance regression on 2.6.x kernel

"Chen, Kenneth W" <[email protected]> wrote:
>
> Let me work on the readv/writev support (unless someone beat me to it).

Please also move it to the address_space_operations level. Yes, there are
performance benefits from simply omitting the LFS checks, the mmap
consistency fixes, etc. But they're there for a reason.