SUMMARY:
Basically, I'm trying to do async io through a SCHED_FIFO thread with
high priority reading the disk, and the other less prioritized thread
doing "real" work. But I can't get _nearly_ enough out of the CPU while
reading the disk with the other thread. It is just intolerably
inefficient and I _hope_ that I am making a mistake.
Any ideas on how this should work are appreciated.
MORE INFO (only if you must have it):
I read much of the async io / kio discussion on the LK mailing list.
Finally
Linus concluded that threading _is_ the way to go for now(2001 I
believe).
First, I have kernel 2.2.16 (RedHat 6.2). If this has been corrected
in the 2.4, then please let me know, but I think not.
On my system, "raw" read()ing a large chunk of the /dev/hda5 partition
shows that reading a page (4k) takes about 230000 clock "ticks" which is
the cpu effort required for 23 context switches. So I figure if the
disk generates the "io available" interrupt once every 4k chunk (this
might be the bad assumption), then linux has plenty time to do
several switches between the interrupt handler, and the high priority
SCHED_FIFO process, and the low priority SCHED_FIFO process, and still
have time for plenty useful work at the user level, and time to get back
to handle the io request. During this read(), I should be able to use
at _least_ 50% of my CPU. But I get much less than 10 percent!! Why??
If there is anything that should be done to the kernel, please let me
know as I'd certainly be very willing to help. How exactly _does_
this scheduling and io thing work? Is there some "jiffy" that _must_
expire before Linux switches and lets my other thread do useful work?
If so, then how do you shorten it? Or is it that my IDE disk is very
lousy? Then what are the parameters I should consider in an IDE disk
and how do I tell what I have?? Or is this simply a bad and pending
Linux bug?
(hard to believe)
Or maybe my test code is faulty (unlikely also.)
(Test code at http://www.linisoft.com/test/async.c .)
Please reply to me directly.
Thanks in advance for any insight.
--
Reza
Hello,
I tried your program on my system (P3 800MHz/256Meg ram, IDE harddrive
with UDMA enabled, 2.4.17-rmap12f) with minor changes: I used a file
instead of
a raw device. after creating file (64Mega bytes) and flushing read cache
(writing another huge file with DD on same filesystem), this is what
happened:
A normal read test (for speed measurements).
[root@masouds1 bsd]# time cat mytest > /dev/null
real 0m1.771s
user 0m0.020s
sys 0m0.280s
---
So, 1.7 sec. total time to read data from file.
Now, I flushed cache again and ran your test program:
[root@masouds1 bsd]# ./async
useful CPU work 1 at time(secs, micro-secs) 1014831058 173783
useful CPU work 80848 at time(secs, micro-secs) 1014831059 776664
useful CPU work 1216070 at time(secs, micro-secs) 1014831069 786353
Between number 2 and 3, your program sleeps 10 seconds. That would be
121607 counters each second. Now, when reader-thread and worker-thread
are both running, you get 80000 counts for 1.6 seconds where you should
get 1.6 * 121607 = 194570. That is a 33% of CPU power.
and remember that lots of time is consumed during copying 64 megabytes
of data to user buffer (let alone kernel moving it around and context
switches).
So I believe there isn't a bug in recent version of Linux kernel. Unless
I'm way off track!
Can you run same test I did and report results here?
Masoud
PS: make sure you are not running your IDE drive in PIO mode.
Reza Roboubi wrote:
>SUMMARY:
>
>Basically, I'm trying to do async io through a SCHED_FIFO thread with
>high priority reading the disk, and the other less prioritized thread
>doing "real" work. But I can't get _nearly_ enough out of the CPU while
>reading the disk with the other thread. It is just intolerably
>inefficient and I _hope_ that I am making a mistake.
>Any ideas on how this should work are appreciated.
>
>MORE INFO (only if you must have it):
>
>I read much of the async io / kio discussion on the LK mailing list.
>Finally
>Linus concluded that threading _is_ the way to go for now(2001 I
>believe).
>
>First, I have kernel 2.2.16 (RedHat 6.2). If this has been corrected
>in the 2.4, then please let me know, but I think not.
>
>On my system, "raw" read()ing a large chunk of the /dev/hda5 partition
>shows that reading a page (4k) takes about 230000 clock "ticks" which is
>the cpu effort required for 23 context switches. So I figure if the
>disk generates the "io available" interrupt once every 4k chunk (this
>might be the bad assumption), then linux has plenty time to do
>several switches between the interrupt handler, and the high priority
>SCHED_FIFO process, and the low priority SCHED_FIFO process, and still
>have time for plenty useful work at the user level, and time to get back
>to handle the io request. During this read(), I should be able to use
>at _least_ 50% of my CPU. But I get much less than 10 percent!! Why??
>
>If there is anything that should be done to the kernel, please let me
>know as I'd certainly be very willing to help. How exactly _does_
>this scheduling and io thing work? Is there some "jiffy" that _must_
>expire before Linux switches and lets my other thread do useful work?
>If so, then how do you shorten it? Or is it that my IDE disk is very
>lousy? Then what are the parameters I should consider in an IDE disk
>and how do I tell what I have?? Or is this simply a bad and pending
>Linux bug?
>(hard to believe)
>
>Or maybe my test code is faulty (unlikely also.)
>
>(Test code at http://www.linisoft.com/test/async.c .)
>
>Please reply to me directly.
>
>Thanks in advance for any insight.
>
Masoud,
First let me thank you for reading and running my test code on your
system. I greatly appreciate that. It's so good to see a man with hard
numbers as opposed to just "speaches." Your response has been extremely
helpful.
> PS: make sure you are not running your IDE drive in PIO mode.
This one line tip of yours was probably more helpful to me than many
hours of heart-bleeding M$ support can be to some people.
You reminded me that long ago, due to system instability, I had turned
down some of my BIOS features. They could have caused the kernel to
set my hda settings conservatively. Turns out, that not only dma was
off, but also "multiple read" was set to one. But the dma did the
major change:
> a raw device. after creating file (64Mega bytes) and flushing read cache
> (writing another huge file with DD on same filesystem), this is what
> happened:
> A normal read test (for speed measurements).
> [root@masouds1 bsd]# time cat mytest > /dev/null
>
> real 0m1.771s
> user 0m0.020s
> sys 0m0.280s
> ---
I got:
[root in ~]$ time cat /scratch0/big > /dev/null
0.53user 3.12system 0:09.35elapsed 39%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (25214major+14minor)pagefaults 0swaps
This is probably much better than I had before (big = 102 MB).
> So, 1.7 sec. total time to read data from file.
> Now, I flushed cache again and ran your test program:
> [root@masouds1 bsd]# ./async
> useful CPU work 1 at time(secs, micro-secs) 1014831058 173783
> useful CPU work 80848 at time(secs, micro-secs) 1014831059 776664
> useful CPU work 1216070 at time(secs, micro-secs) 1014831069 786353
>
I get:
[root in /home/reza/backup/tmpwork/tests/linux_timings]$ ./async.out
useful CPU work 1 at time(secs, micro-secs) 1014905754 12224
useful CPU work 240204 at time(secs, micro-secs) 1014905758 8111
useful CPU work 1082083 at time(secs, micro-secs) 1014905768 15236
(using raw, NOT cache)
This is 0.63% efficiency. It is beautiful. Note that this answers my
basic question, that I had known all along anyways: Making my code
complex to take advantage of multi-threading is most certainly worth it.
Now, I did the test again, this time using fifos for doing the "real
work", this is less efficient, and gives about 0.45% of the CPU back
during another thread's read(2).
Intuition suggests that this can still be better, because I also did
tests for memcpy and thread context switching under Linux, and Linux is
very efficient in these areas (my machine can do roughly 400k context
switches in the 4 seconds it took to read that ~50MB chunk (see test
above)) This appeasr to be excellent performance (on the micro second
scale anyways). And one might figure that the CPU does not need 55% of
it's power sustaining a few inter-thread context switches and copies
during read(large_chunk). But my tests are small chunks of code. when
things get large, as they are in the kernel, I can see constant factors
like TLB updates and such adding up.
I can see how valuable it would be to put aside some time and study the
ide driver source, and the kernel in general. At least when one wants
something specific, like the Google servers, one probably can find ways
to tailor the kernel and get more out of it. Maybe much more for
something real specific. No WONDER Google would choose Linux. It is
impossible to customize any closed source os that way.
In any case, for any more questions in this regard, the source code and
the LK archives should be my reference. But your great help got me
beautifully into the order of magnitude I wanted.
Thank you so much, again.
--
Reza
Btw, I mentioned that I rewrote the test to do the "useful work" using
fifos, and that gave 0.45% of the CPU back during the read() operation.
Just in case anyone wants that test, it is on the web site with the
other test:
http://www.linisoft.com/test/asyncf.c //async using fifo
http://www.linisoft.com/test/async.c //async using __asm__(lock)
--
Reza