2008-02-11 20:57:18

by Alan D. Brunelle

[permalink] [raw]
Subject: IO queueing and complete affinity w/ threads: Some results

The test case chosen may not be a very good start, but anyways, here are some initial test results with the "nasty arch bits". This was performed on a 32-way ia64 box with 1 terrabyte of RAM, and 144 FC disks (contained in 24 HP MSA1000 RAID controlers attached to 12 dual-port adapters). Each test case was run for 3 minutes. I had one application per device performing a large amount of direct/asynchronous large reads. Here's the table of results, with explanation below (results are for all 144 devices either accumulated (MBPS) or averaged (other columns)):

A Q C | MBPS Avg Lat StdDev | Q-local Q-remote | C-local C-remote
----- | ------ -------- ------ | -------- -------- | ------- --------
X X X | 3859.9 1.190067 0.0502 | 0.0 19484.7 | 0.0 9758.8
X X A | 3856.3 1.191220 0.0490 | 0.0 19467.2 | 0.0 9750.1
X X I | 3850.3 1.192992 0.0508 | 0.0 19437.3 | 9735.1 0.0
----- | ------ -------- ------ | -------- -------- | ------- --------
X A X | 3853.9 1.191891 0.0503 | 19455.4 0.0 | 0.0 9744.2
X A A | 3853.5 1.191935 0.0507 | 19453.2 0.0 | 0.0 9743.1
X A I | 3856.6 1.191043 0.0512 | 19468.7 0.0 | 9750.8 0.0
----- | ------ -------- ------ | -------- -------- | ------- --------
X I X | 3854.7 1.191674 0.0491 | 0.0 19459.8 | 0.0 9746.4
X I A | 3855.3 1.191434 0.0501 | 0.0 19461.9 | 0.0 9747.4
X I I | 3856.2 1.191128 0.0506 | 0.0 19466.6 | 9749.8 0.0
===== | ====== ======== ====== | ======== ======== | ======= ========
I X X | 3857.0 1.190987 0.0500 | 0.0 19471.9 | 0.0 9752.5
I X A | 3856.5 1.191082 0.0496 | 0.0 19469.4 | 9751.2 0.0
I X I | 3853.7 1.191938 0.0500 | 0.0 19456.2 | 9744.6 0.0
----- | ------ -------- ------ | -------- -------- | ------- --------
I A X | 3854.8 1.191675 0.0502 | 19461.5 0.0 | 0.0 9747.2
I A A | 3855.1 1.191464 0.0503 | 19464.0 0.0 | 9748.5 0.0
I A I | 3854.9 1.191627 0.0483 | 19461.7 0.0 | 9747.4 0.0
----- | ------ -------- ------ | -------- -------- | ------- --------
I I X | 3853.4 1.192070 0.0484 | 19454.8 0.0 | 0.0 9743.9
I I A | 3852.2 1.192403 0.0502 | 19448.5 0.0 | 9740.8 0.0
I I I | 3854.0 1.191822 0.0499 | 19457.9 0.0 | 9745.5 0.0
===== | ====== ======== ====== | ======== ======== | ======= ========
rq=0 | 3854.8 1.191680 0.0480 | 19459.7 0.0 | 202.9 9543.5
rq=1 | 3854.0 1.191965 0.0483 | 19457.0 0.0 | 403.1 9341.9
----- | ------ -------- ------ | -------- -------- | ------- --------

The variables being played with:

'A' - When set to X the application was placed on a CPU other than the one handling IRQs for the device (in another cell)

'Q' - When set to X, queue affinity was placed in another cell from the application OR completion OR IRQ, when set to 'A' it was pegged onto the same CPU as the application, when set to 'I' it was set to the CPU that was managing the IRQ for its device.

'C' - Likewise for the completion affinity: 'X' means on another cell besides the one containing the application or the queueing or the IRQ handling CPU, A means put on the same CPU as the application, and I means put on the same CPU as the IRQ handler.

o For the last two rows, we set Q == C == -1, and let the application go to any CPU (as dictated by the scheduler). Then we had 'rq_affinity' set to 0 or 1.

The resulting columns include:

MBPS - Total megabytes per second (so we're seeing about 3.8 gigabytes per second for the system)
Avg lat - Average per IO measured latency in seconds (note: I had upwards of 128 X 256K IOs going on per device across the system)
StdDev - Average standard deviation across the devices

Q-local & Q-remote refer to the average number of queue operations handled locally and remotely, respectively. (Average per device)
C-local & C-remote refer to the average number of completion operations handled locally and remotely, respectively. (Average per device)

As noted above, I'm not so sure this is the best test case - it's rather artificial, I was hoping to see some differences based upon affinitization, but whilst there appears to be some trends, the results are so close (0.2% difference from best to worst case MBPS, and the standard deviation on the latencies are +/- within the groups), I doubt there is anything definitive. Unfortunately, most of the disks are all being used for real data right now, so I can't perform significant write tests (with file systems in place, say) which would be more real-worldly. I do have access to about 24 of the disks, so I will try to place file system on those and do some tests. [I won't be able to use XFS without going through some hoops - its a Red Hat installation right now, and they don't support XFS out of the box...]

BTW: The Q/C local/remote columns were put in place to make sure that I had things set up right, and for the first 18 cases I think they look right. For the RQ cases at the end, I /think/ what is happening is that on occasion we end up with the application on the CPU that had the IRQ handler, and that would cause us to some times be local - but most of the time (due to the pseudo-random nature of the initial process placement) we'd end up elsewhere from the IRQ handling CPU, and thus end up with remoting the queue/complete handling... The disparity between the Q & C results are due to merging - we issue (and hence complete) less IOs than are submitted to the block IO layer (here it looks to be about 2-to-1).

Alan


2008-02-12 20:56:22

by Alan D. Brunelle

[permalink] [raw]
Subject: Re: IO queueing and complete affinity w/ threads: Some results

Whilst running a series of file system related loads on our 32-way*, I dropped down to a 16-way w/ only 24 disks, and ran two kernels: the original set of Jens' patches and then his subsequent kthreads-based set. Here are the results:

Original:
A Q C | MBPS Avg Lat StdDev | Q-local Q-remote | C-local C-remote
----- | ------ -------- ------ | -------- -------- | ------- --------
X X X | 1850.4 0.413880 0.0109 | 0.0 55860.8 | 0.0 27946.9
X X A | 1850.6 0.413848 0.0106 | 0.0 55859.2 | 0.0 27946.1
X X I | 1850.6 0.413830 0.0107 | 0.0 55858.5 | 27945.8 0.0
----- | ------ -------- ------ | -------- -------- | ------- --------
X A X | 1850.0 0.413949 0.0106 | 55843.7 0.0 | 0.0 27938.3
X A A | 1850.2 0.413931 0.0107 | 55844.2 0.0 | 0.0 27938.6
X A I | 1850.4 0.413862 0.0107 | 55854.3 0.0 | 27943.7 0.0
----- | ------ -------- ------ | -------- -------- | ------- --------
X I X | 1850.9 0.413764 0.0107 | 0.0 55866.2 | 0.0 27949.6
X I A | 1850.5 0.413854 0.0108 | 0.0 55855.0 | 0.0 27944.0
X I I | 1850.4 0.413848 0.0105 | 0.0 55854.6 | 27943.8 0.0
===== | ====== ======== ====== | ======== ======== | ======= ========
I X X | 1570.7 0.487686 0.0142 | 0.0 47406.1 | 0.0 23719.5
I X A | 1570.8 0.487666 0.0143 | 0.0 47409.3 | 23721.2 0.0
I X I | 1570.8 0.487664 0.0142 | 0.0 47410.7 | 23721.8 0.0
----- | ------ -------- ------ | -------- -------- | ------- --------
I A X | 1570.9 0.487642 0.0144 | 47412.2 0.0 | 0.0 23722.6
I A A | 1570.8 0.487647 0.0141 | 47411.2 0.0 | 23722.1 0.0
I A I | 1570.8 0.487651 0.0143 | 47410.8 0.0 | 23721.9 0.0
----- | ------ -------- ------ | -------- -------- | ------- --------
I I X | 1570.8 0.487683 0.0142 | 47410.2 0.0 | 0.0 23721.6
I I A | 1571.1 0.487591 0.0146 | 47415.0 0.0 | 23724.0 0.0
I I I | 1571.0 0.487623 0.0143 | 47412.5 0.0 | 23722.8 0.0
===== | ====== ======== ====== | ======== ======== | ======= ========
rq=0 | 1726.7 0.443562 0.0120 | 52118.6 0.0 | 2138.6 23937.2
rq=1 | 1820.5 0.420729 0.0110 | 54938.2 0.0 | 0.0 27485.6
----- | ------ -------- ------ | -------- -------- | ------- --------


kthreads-based:
A Q C | MBPS Avg Lat StdDev | Q-local Q-remote | C-local C-remote
----- | ------ -------- ------ | -------- -------- | ------- --------
X X X | 1850.5 0.413867 0.0107 | 0.0 55854.7 | 0.0 27943.8
X X A | 1850.9 0.413763 0.0107 | 0.0 55867.0 | 0.0 27950.0
X X I | 1850.3 0.413911 0.0109 | 0.0 55849.0 | 27941.0 0.0
----- | ------ -------- ------ | -------- -------- | ------- --------
X A X | 1851.0 0.413730 0.0107 | 55871.4 0.0 | 0.0 27952.2
X A A | 1850.1 0.413919 0.0107 | 55845.5 0.0 | 0.0 27939.2
X A I | 1850.8 0.413789 0.0108 | 55864.8 0.0 | 27948.9 0.0
----- | ------ -------- ------ | -------- -------- | ------- --------
X I X | 1850.5 0.413849 0.0107 | 0.0 55856.5 | 0.0 27944.8
X I A | 1850.6 0.413818 0.0108 | 0.0 55860.2 | 0.0 27946.6
X I I | 1850.8 0.413764 0.0108 | 0.0 55866.7 | 27949.8 0.0
===== | ====== ======== ====== | ======== ======== | ======= ========
I X X | 1570.9 0.487662 0.0145 | 0.0 47410.1 | 0.0 23721.6
I X A | 1570.7 0.487691 0.0142 | 0.0 47406.9 | 23720.0 0.0
I X I | 1570.7 0.487688 0.0141 | 0.0 47406.5 | 23719.8 0.0
----- | ------ -------- ------ | -------- -------- | ------- --------
I A X | 1570.9 0.487661 0.0144 | 47415.4 0.0 | 0.0 23724.2
I A A | 1570.8 0.487648 0.0141 | 47409.1 0.0 | 23721.0 0.0
I A I | 1570.7 0.487667 0.0141 | 47406.1 0.0 | 23719.5 0.0
----- | ------ -------- ------ | -------- -------- | ------- --------
I I X | 1570.8 0.487691 0.0142 | 47409.3 0.0 | 0.0 23721.2
I I A | 1570.9 0.487644 0.0142 | 47408.8 0.0 | 23720.9 0.0
I I I | 1570.6 0.487671 0.0141 | 47412.5 0.0 | 23722.8 0.0
===== | ====== ======== ====== | ======== ======== | ======= ========
rq=0 | 1742.1 0.439676 0.0118 | 52578.1 0.0 | 3602.6 22703.0
rq=1 | 1745.0 0.438918 0.0115 | 52666.3 0.0 | 3473.0 22876.6
----- | ------ -------- ------ | -------- -------- | ------- --------

For the first 18 sets on both kernels the results are very similar, the last two rq=0/1 sets are perturbed too much by application placement (I would guess). Have to think about that some more.

Alan
* What I'm doing on the 32-way is to compare and contrast mkfs, untar, kernel make & kernel clean times with different combinations of Q, C and RQ. [[This is currently with the "Jens original" patch, if things go well, I can do an overnight run with the kthreads-based patch.]]

2008-02-12 22:08:53

by Alan D. Brunelle

[permalink] [raw]
Subject: Re: IO queueing and complete affinity w/ threads: Some results

Back on the 32-way, in this set of tests we're running 12 disks spread out through the 8 cells of the 32-way. Each disk will have an Ext2 FS placed on it, a clean Linux kernel source untar()ed onto it, then a full make (-j4) and then a make clean performed. The 12 series are done in parallel - so each disk will have:

mkfs
tar x
make
make clean

performed. This was performed ten times, and the overall averages are presented below - note this is Jens' original patch sequence NOT the kthread one (those results available tomorrow, hopefully).

mkfs Min Avg Max Std Dev
--------- ------- ------- ------- -------
q0.c0.rq0 17.814 30.322 33.263 4.551
q0.c0.rq1 17.540 30.058 32.885 4.321
q0.c1.rq0 17.770 31.328 32.958 3.121
q1.c0.rq0 17.907 31.032 32.767 3.515
q1.c1.rq0 16.891 30.319 33.097 4.624

untar Min Avg Max Std Dev
--------- ------- ------- ------- -------
q0.c0.rq0 19.747 21.971 26.292 1.215
q0.c0.rq1 19.680 22.365 36.395 2.010
q0.c1.rq0 18.823 21.390 24.455 0.976
q1.c0.rq0 18.433 21.500 23.371 1.009
q1.c1.rq0 19.414 21.761 34.115 1.378

make Min Avg Max Std Dev
--------- ------- ------- ------- -------
q0.c0.rq0 527.418 543.296 552.030 5.384
q0.c0.rq1 526.265 542.312 549.477 5.467
q0.c1.rq0 528.935 544.940 553.823 4.746
q1.c0.rq0 529.432 544.399 553.212 5.166
q1.c1.rq0 527.638 543.577 551.323 5.478

clean Min Avg Max Std Dev
--------- ------- ------- ------- -------
q0.c0.rq0 16.962 20.308 33.775 3.179
q0.c0.rq1 17.436 20.156 29.370 3.097
q0.c1.rq0 17.061 20.111 31.504 2.791
q1.c0.rq0 16.745 20.247 29.327 2.953
q1.c1.rq0 17.346 20.316 31.178 3.283

Hopefully, the first column is self-explanatory - these are the settings applied to the queue_affinity, completion_affinity and rq_affinity tunables. Due to the fact that the standard deviations are so large coupled with the very close average results, I'm not seeing anything in this set of tests to favor any of the combinations...

As noted, I will be having the machine run the kthreads-variant of the patch stream tonight, and then I have to go back and run a non-patched kernel to see if there are any /regressions/.

Alan

2008-02-12 22:26:40

by Alan D. Brunelle

[permalink] [raw]
Subject: Re: IO queueing and complete affinity w/ threads: Some results

Alan D. Brunelle wrote:

>
> Hopefully, the first column is self-explanatory - these are the settings applied to the queue_affinity, completion_affinity and rq_affinity tunables. Due to the fact that the standard deviations are so large coupled with the very close average results, I'm not seeing anything in this set of tests to favor any of the combinations...
>

Note quite:

Q or C = 0 really means Q or C set to -1 (default), Q or C = 1 means placing that thread on the CPU managing the IRQ. Sorry...

<sigh>
Alan

2008-02-13 15:36:23

by Alan D. Brunelle

[permalink] [raw]
Subject: Re: IO queueing and complete affinity w/ threads: Some results

Comparative results between the original affinity patch and the kthreads-based patch on the 32-way running the kernel make sequence.

It may be easier to compare/contrast with the graphs provided at http://free.linux.hp.com/~adb/jens/kernmk.png (kernmk.agr also provided, if you want to run xmgrace by hand).

Tests are:

1. Make Ext2 FS on each of 12 64GB devices in parallel, times include: mkfs, mount & unmount
2. Untar a full Linux source code tree onto the devices in parallel, times include: mount, untar, unmount
3. Make (-j4) of the full source code tree, times include: mount, make -j4, unmount
4. Clean full source code tree, times include: mount, make clean, unmount

The results are so close amongst all the runs (given the large-ish standard deviations), that we probably can't deduce much from this. A bit of a concern on the top two graphs - mkfs & untar - it certainly appears that the kthreads version is a little slower (about 2.9% difference across the values for the mkfs runs, and 3.5% for the untar operations). On the make runs, however, we didn't see hardly any difference between the runs at all...

We are trying to setup to do some AIM7 tests on a different system over the weekend (15 February - 18 February 2008), I'll post those results on the 18th or 19th if we can pull it off. [I'll also try to steal time on the 32-way to run a straight 2.6.24 kernel, do these runs again, and post those results.]

For the tables below:

q0 == queue_affinity set to -1
q1 == queue_affinity set to the CPU managing the IRQ for each device
c0 == completion_affinity set to -1
c1 == completion_affinity set to CPU managing the IRQ for each device
rq0 == rq_affinity set to 0
rq1 == rq_affinity set to 1

This 4-test sequence was run 10 times (for each kernel), and results averaged. As posted yesterday, here's the original patch sequence results:

mkfs Min Avg Max Std Dev
--------- ------- ------- ------- -------
q0.c0.rq0 17.814 30.322 33.263 4.551
q0.c0.rq1 17.540 30.058 32.885 4.321
q0.c1.rq0 17.770 31.328 32.958 3.121
q1.c0.rq0 17.907 31.032 32.767 3.515
q1.c1.rq0 16.891 30.319 33.097 4.624

untar Min Avg Max Std Dev
--------- ------- ------- ------- -------
q0.c0.rq0 19.747 21.971 26.292 1.215
q0.c0.rq1 19.680 22.365 36.395 2.010
q0.c1.rq0 18.823 21.390 24.455 0.976
q1.c0.rq0 18.433 21.500 23.371 1.009
q1.c1.rq0 19.414 21.761 34.115 1.378

make Min Avg Max Std Dev
--------- ------- ------- ------- -------
q0.c0.rq0 527.418 543.296 552.030 5.384
q0.c0.rq1 526.265 542.312 549.477 5.467
q0.c1.rq0 528.935 544.940 553.823 4.746
q1.c0.rq0 529.432 544.399 553.212 5.166
q1.c1.rq0 527.638 543.577 551.323 5.478

clean Min Avg Max Std Dev
--------- ------- ------- ------- -------
q0.c0.rq0 16.962 20.308 33.775 3.179
q0.c0.rq1 17.436 20.156 29.370 3.097
q0.c1.rq0 17.061 20.111 31.504 2.791
q1.c0.rq0 16.745 20.247 29.327 2.953
q1.c1.rq0 17.346 20.316 31.178 3.283

And for the kthreads-based kernel:

mkfs Min Avg Max Std Dev
--------- ------- ------- ------- -------
q0.c0.rq0 16.686 31.069 33.361 3.452
q0.c0.rq1 16.976 31.719 32.869 2.395
q0.c1.rq0 16.857 31.345 33.410 3.209
q1.c0.rq0 17.317 31.997 34.444 3.099
q1.c1.rq0 16.791 32.266 33.378 2.035

untar Min Avg Max Std Dev
--------- ------- ------- ------- -------
q0.c0.rq0 19.769 22.398 25.196 1.076
q0.c0.rq1 19.742 22.517 38.498 1.733
q0.c1.rq0 20.071 22.698 36.160 2.259
q1.c0.rq0 19.910 22.377 35.640 1.528
q1.c1.rq0 19.448 22.339 24.887 0.926

make Min Avg Max Std Dev
--------- ------- ------- ------- -------
q0.c0.rq0 526.971 542.820 550.591 4.607
q0.c0.rq1 527.320 544.422 550.504 3.798
q0.c1.rq0 527.367 543.856 550.331 4.152
q1.c0.rq0 527.406 543.636 552.947 4.315
q1.c1.rq0 528.921 544.594 550.832 3.786

clean Min Avg Max Std Dev
--------- ------- ------- ------- -------
q0.c0.rq0 16.644 20.242 29.524 2.991
q0.c0.rq1 16.942 20.008 29.729 2.845
q0.c1.rq0 17.205 20.117 29.851 2.661
q1.c0.rq0 17.400 20.147 32.581 2.862
q1.c1.rq0 16.799 20.072 31.883 2.872

2008-02-14 15:36:57

by Alan D. Brunelle

[permalink] [raw]
Subject: Re: IO queueing and complete affinity w/ threads: Some results

Taking a step back, I went to a very simple test environment:

o 4-way IA64
o 2 disks (on separate RAID controller, handled by separate ports on the same FC HBA - generates different IRQs).
o Using write-cached tests - keep all IOs inside of the RAID controller's cache, so no perturbations due to platter accesses)

Basically:

o CPU 0 handled IRQs for /dev/sds
o CPU 2 handled IRQs for /dev/sdaa

We placed an IO generator on CPU1 (for /dev/sds) and CPU3 (for /dev/sdaa). The IO generator performed 4KiB sequential direct AIOs in a very small range (2MB - well within the controller cache on the external storage device). We have found that this is a simple way to maximize throughput, and thus be able to watch the system for effects without worrying about odd seek & other platter-induced issues. Each test took about 6 minutes to run (ran a specific amount of IO, so we could compare & contrast system measurements).

First: overall performance

2.6.24 (no patches) : 106.90 MB/sec

2.6.24 + original patches + rq=0 : 103.09 MB/sec
rq=1 : 98.81 MB/sec

2.6.24 + kthreads patches + rq=0 : 106.85 MB/sec
rq=1 : 107.16 MB/sec

So, the kthreads patches works much better here - and on-par or better than straight 2.6.24. I also ran Caliper (akin to Oprofile, proprietary and ia64-specific, sorry), and looked at the cycles used. On an ia64 back-end-bubbles are deadly, and can be caused by cache misses &c. Looking at the gross data:

Kernel CPU_CYCLES BACK END BUBBLES 100.0 * (BEB/CC)
-------------------------------- ----------------- ----------------- ----------------
2.6.24 (no patches) : 2,357,215,454,852 231,547,237,267 9.8%

2.6.24 + original patches + rq=0 : 2,444,895,579,790 242,719,920,828 9.9%
rq=1 : 2,551,175,203,455 148,586,145,513 5.8%

2.6.24 + kthreads patches + rq=0 : 2,359,376,156,043 255,563,975,526 10.8%
rq=1 : 2,350,539,631,362 208,888,961,094 8.9%

For both the original & kthreads patches we see a /significant/ drop in bubbles when setting rq=1 over rq=0. This shows up in extra CPU cycles available (not spent in %system) - a graph is provided up on http://free.linux.hp.com/~adb/jens/cached_mps.png - it shows the results from stats extracted from running mpstat in conjunction with the IO runs.

Combining %sys & %soft IRQ, we see:

Kernel % user % sys % iowait % idle
-------------------------------- -------- -------- -------- --------
2.6.24 (no patches) : 0.141% 10.088% 43.949% 45.819%

2.6.24 + original patches + rq=0 : 0.123% 11.361% 43.507% 45.008%
rq=1 : 0.156% 6.030% 44.021% 49.794%

2.6.24 + kthreads patches + rq=0 : 0.163% 10.402% 43.744% 45.686%
rq=1 : 0.156% 8.160% 41.880% 49.804%

The good news (I think) is that even with rq=0 with the kthreads patches we're getting on-par performance w/ 2.6.24, so the default case should be ok...

I've only done a few runs by hand with this - these results are from one representative run out of the bunch - but at least this (I believe) shows what this patch stream is intending to do: optimize placement of IO completion handling to minimize cache & TLB disruptions. Freeing up cycles in the kernel is always helpful! :-)

I'm going to try similar runs on an AMD64 w/ Oprofile and see what results I get there... (BTW: I'll be dropping testing of the original patch sequence, the kthreads patches look better in general (both in terms of code & results, coincidence?).

Alan

2008-02-18 12:37:32

by Jens Axboe

[permalink] [raw]
Subject: Re: IO queueing and complete affinity w/ threads: Some results

On Thu, Feb 14 2008, Alan D. Brunelle wrote:
> Taking a step back, I went to a very simple test environment:
>
> o 4-way IA64
> o 2 disks (on separate RAID controller, handled by separate ports on the same FC HBA - generates different IRQs).
> o Using write-cached tests - keep all IOs inside of the RAID controller's cache, so no perturbations due to platter accesses)
>
> Basically:
>
> o CPU 0 handled IRQs for /dev/sds
> o CPU 2 handled IRQs for /dev/sdaa
>
> We placed an IO generator on CPU1 (for /dev/sds) and CPU3 (for /dev/sdaa). The IO generator performed 4KiB sequential direct AIOs in a very small range (2MB - well within the controller cache on the external storage device). We have found that this is a simple way to maximize throughput, and thus be able to watch the system for effects without worrying about odd seek & other platter-induced issues. Each test took about 6 minutes to run (ran a specific amount of IO, so we could compare & contrast system measurements).
>
> First: overall performance
>
> 2.6.24 (no patches) : 106.90 MB/sec
>
> 2.6.24 + original patches + rq=0 : 103.09 MB/sec
> rq=1 : 98.81 MB/sec
>
> 2.6.24 + kthreads patches + rq=0 : 106.85 MB/sec
> rq=1 : 107.16 MB/sec
>
> So, the kthreads patches works much better here - and on-par or better than straight 2.6.24. I also ran Caliper (akin to Oprofile, proprietary and ia64-specific, sorry), and looked at the cycles used. On an ia64 back-end-bubbles are deadly, and can be caused by cache misses &c. Looking at the gross data:
>
> Kernel CPU_CYCLES BACK END BUBBLES 100.0 * (BEB/CC)
> -------------------------------- ----------------- ----------------- ----------------
> 2.6.24 (no patches) : 2,357,215,454,852 231,547,237,267 9.8%
>
> 2.6.24 + original patches + rq=0 : 2,444,895,579,790 242,719,920,828 9.9%
> rq=1 : 2,551,175,203,455 148,586,145,513 5.8%
>
> 2.6.24 + kthreads patches + rq=0 : 2,359,376,156,043 255,563,975,526 10.8%
> rq=1 : 2,350,539,631,362 208,888,961,094 8.9%
>
> For both the original & kthreads patches we see a /significant/ drop in bubbles when setting rq=1 over rq=0. This shows up in extra CPU cycles available (not spent in %system) - a graph is provided up on http://free.linux.hp.com/~adb/jens/cached_mps.png - it shows the results from stats extracted from running mpstat in conjunction with the IO runs.
>
> Combining %sys & %soft IRQ, we see:
>
> Kernel % user % sys % iowait % idle
> -------------------------------- -------- -------- -------- --------
> 2.6.24 (no patches) : 0.141% 10.088% 43.949% 45.819%
>
> 2.6.24 + original patches + rq=0 : 0.123% 11.361% 43.507% 45.008%
> rq=1 : 0.156% 6.030% 44.021% 49.794%
>
> 2.6.24 + kthreads patches + rq=0 : 0.163% 10.402% 43.744% 45.686%
> rq=1 : 0.156% 8.160% 41.880% 49.804%
>
> The good news (I think) is that even with rq=0 with the kthreads patches we're getting on-par performance w/ 2.6.24, so the default case should be ok...
>
> I've only done a few runs by hand with this - these results are from one representative run out of the bunch - but at least this (I believe) shows what this patch stream is intending to do: optimize placement of IO completion handling to minimize cache & TLB disruptions. Freeing up cycles in the kernel is always helpful! :-)
>
> I'm going to try similar runs on an AMD64 w/ Oprofile and see what results I get there... (BTW: I'll be dropping testing of the original patch sequence, the kthreads patches look better in general (both in terms of code & results, coincidence?).

Alan, thanks for your very nice testing efforts on this! It's very
encouraging to see that the kthread based approach is even faster than
the softirq one, since the code is indeed much simpler and doesn't
require any arch modifications. So I'd agree that just testing the
kthread approach is the best way forward, and that scrapping the remote
softirq trigger stuff is sanest.

My main worry with the current code is the ->lock in the per-cpu
completion structure. If we do a lot of migrations to other CPUs, then
that cacheline will be bounced around. But we'll be dirtying the list of
that CPU structure anyway, so playing games to make that part lockless
is probably pretty pointless. So if you get around to testing on bigger
SMP boxes, it'd be interesting to look for. So far it looks like it's a
net win with more idle time, the benefit of keeping the rq completion
queue local must be out weighing the cost of diddling with the per-cpu
data.

--
Jens Axboe

2008-02-18 13:33:30

by Andi Kleen

[permalink] [raw]
Subject: Re: IO queueing and complete affinity w/ threads: Some results

Jens Axboe <[email protected]> writes:

> and that scrapping the remote
> softirq trigger stuff is sanest.

I actually liked Nick's queued smp_function_call_single() patch. So even
if it was not used for block I would still like to see it being merged
in some form to speed up all the other IPI users.

-Andi

2008-02-18 14:16:26

by Jens Axboe

[permalink] [raw]
Subject: Re: IO queueing and complete affinity w/ threads: Some results

On Mon, Feb 18 2008, Andi Kleen wrote:
> Jens Axboe <[email protected]> writes:
>
> > and that scrapping the remote
> > softirq trigger stuff is sanest.
>
> I actually liked Nick's queued smp_function_call_single() patch. So even
> if it was not used for block I would still like to see it being merged
> in some form to speed up all the other IPI users.

Sure, Nicks patch was generically usable, my IPI stuff was just a hack
made to go as fast as possible for a single use. The current
call-on-other cpu path is not exactly scalable...

--
Jens Axboe

2008-02-19 01:50:07

by Nick Piggin

[permalink] [raw]
Subject: Re: IO queueing and complete affinity w/ threads: Some results

On Mon, Feb 18, 2008 at 02:33:17PM +0100, Andi Kleen wrote:
> Jens Axboe <[email protected]> writes:
>
> > and that scrapping the remote
> > softirq trigger stuff is sanest.
>
> I actually liked Nick's queued smp_function_call_single() patch. So even
> if it was not used for block I would still like to see it being merged
> in some form to speed up all the other IPI users.

Yeah, that hasn't been forgotten (nor have your comments about folding
my special function into smp_call_function_single).

The call function path is terribly unscalable at the moment on a lot
of architectures, and also it isn't allowed to be used with interrupts
off due to deadlock (which the queued version can allow, provided
that wait=0).

I will get around to sending that upstream soon.

2008-02-19 21:14:29

by Paul Jackson

[permalink] [raw]
Subject: Re: IO queueing and complete affinity w/ threads: Some results

Jens wrote:
> My main worry with the current code is the ->lock in the per-cpu
> completion structure.

Drive-by-comment here: Does the patch posted later this same day by Mike Travis:

[PATCH 0/2] percpu: Optimize percpu accesses v3

help with this lock issue any? (I have no real clue here -- just connecting
up the pretty colored dots ;).

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214

2008-02-19 21:31:48

by Mike Travis

[permalink] [raw]
Subject: Re: IO queueing and complete affinity w/ threads: Some results

Paul Jackson wrote:
> Jens wrote:
>> My main worry with the current code is the ->lock in the per-cpu
>> completion structure.
>
> Drive-by-comment here: Does the patch posted later this same day by Mike Travis:
>
> [PATCH 0/2] percpu: Optimize percpu accesses v3
>
> help with this lock issue any? (I have no real clue here -- just connecting
> up the pretty colored dots ;).
>

I'm not sure of the context here but a big motivation for doing the
zero-based per_cpu variables was to optimize access to the local
per cpu variables to one instruction, reducing the need for locks.

-Mike

2008-02-20 08:08:36

by Jens Axboe

[permalink] [raw]
Subject: Re: IO queueing and complete affinity w/ threads: Some results

On Tue, Feb 19 2008, Mike Travis wrote:
> Paul Jackson wrote:
> > Jens wrote:
> >> My main worry with the current code is the ->lock in the per-cpu
> >> completion structure.
> >
> > Drive-by-comment here: Does the patch posted later this same day by Mike Travis:
> >
> > [PATCH 0/2] percpu: Optimize percpu accesses v3
> >
> > help with this lock issue any? (I have no real clue here -- just connecting
> > up the pretty colored dots ;).
> >
>
> I'm not sure of the context here but a big motivation for doing the
> zero-based per_cpu variables was to optimize access to the local
> per cpu variables to one instruction, reducing the need for locks.

I'm afraid the two things aren't related, although faster access to
per-cpu is of course a benefit for this as well. My expressed concern
was the:

spin_lock(&bc->lock);
was_empty = list_empty(&bc->list);
list_add_tail(&req->donelist, &bc->list);
spin_unlock(&bc->lock);

where 'bc' may be per-cpu data of another CPU

--
Jens Axboe