Here are a set of contest benchmarks for 2.4.20-ck3
comparing it to vanilla 2.4.20
ck3a is with the default aa vm
ck3r is with the rmap vm option
no_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.4.20 1 78 93.6 0.0 0.0 1.00
2420ck3a 1 78 94.9 0.0 0.0 1.00
2420ck3r 1 80 93.8 0.0 0.0 1.00
cacherun:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.4.20 1 74 98.6 0.0 0.0 0.95
2420ck3a 1 75 98.7 0.0 0.0 0.96
2420ck3r 1 76 97.4 0.0 0.0 0.95
process_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.4.20 1 123 56.1 105.0 41.5 1.58
2420ck3a 1 115 63.5 88.0 35.7 1.47
2420ck3r 1 120 61.7 94.0 35.8 1.50
ctar_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.4.20 1 98 79.6 2.0 5.1 1.26
2420ck3a 1 92 82.6 1.0 2.2 1.18
2420ck3r 1 97 80.4 1.0 3.1 1.21
xtar_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.4.20 1 130 59.2 2.0 4.6 1.67
2420ck3a 1 116 65.5 2.0 5.2 1.49
2420ck3r 1 127 60.6 2.0 3.9 1.59
io_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.4.20 1 297 25.3 95.5 15.5 3.81
2420ck3a 1 124 61.3 32.9 12.9 1.59
2420ck3r 1 152 51.3 35.7 12.5 1.90
io_other:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.4.20 1 100 74.0 32.8 15.8 1.28
2420ck3a 1 93 81.7 26.6 14.0 1.19
2420ck3r 1 110 70.0 29.7 13.6 1.38
read_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.4.20 1 200 39.5 16.8 5.5 2.56
2420ck3a 1 93 82.8 6.1 5.4 1.19
2420ck3r 1 121 65.3 7.2 5.0 1.51
list_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.4.20 1 95 78.9 0.0 5.3 1.22
2420ck3a 1 82 91.5 0.0 3.7 1.05
2420ck3r 1 94 80.9 0.0 3.2 1.18
mem_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.4.20 1 91 84.6 50.0 1.1 1.17
2420ck3a 1 104 73.1 51.0 1.9 1.33
2420ck3r 1 105 76.2 54.0 3.8 1.31
dbench_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.4.20 1 354 21.2 4.0 38.4 4.54
2420ck3a 1 135 61.5 0.0 0.0 1.73
2420ck3r 1 180 45.0 0.0 0.0 2.25
Con
http://greetings.yahoo.com.au - Yahoo! Greetings
- Send your seasons greetings online this year!
The only way to get the minimal possible latency and maximal fariness is
my new stochastic fair queueing idea.
The troughput/latency curve has only a few points where the throughput
incrase while latency decrease. The miniumum latency doesn't necessairly
mean minimal throughput as maxmium latency doesn't necessairly imply
maximal throughput.
But clearly maximum throughput likely implys not minium latency and the
other way around.
I described my idea to Jens a few days ago and he's working on it right
now AFIK, incidentally he had an equivalent idea in the past and he had
some old code that he's starting with. For the details on how it works
you can read the stochastic fair queueing in the network packet
scheduler (net/sched/sch_sfq.c).
The design I proposed is to have multiple I/O queues, where to apply the
elevator, and to choose the queue in function of the task->pid that is
sumbitting the bh/bio. You'll have to apply an hash to the pid and you
probably want a perturbation timer that will change the hash function
every 30 sec or so. Plus I want a special queue for everything
asynchronoys. So that the asynchronoys queue will be elevated and
reordered like today since it's not latency critical. I suggeted Jens to
differentiate it by checking current->mm, all the async stuff is
normally submitted by the kernel daemons. A more finegrined way is to
add a bitflag that you have to set between the bh lock and the submit_bh
and that will be cleared by unlock_buffer (or equivalent one for the bio
in 2.5). But the current->mm will take care of async-io with keventd too
so it should be just fine (modulo the first aio submit but it's a minor
issue).
Then the lowlevel drivers will pick requests from these queues in round
robin (other variations may be possible too, like two requests per queue
or stuff like that, but 1 request per queue should give an unbelieveable
responsiveness to the system, something that we never experienced before).
This stochastic fair queueing I/O scheduling algorithm will be the best
for desktop/multimedia, i.e. when your priority is that xmms or mplayer
never skips no matter how many cp /dev/zero . you're running in
background. Or also for a multiuser system. There is no other possible
definitive fix IMHO. The only other possible fix would be to reduce the
I/O queue to say 512kbytes and to take advantage of the FIFO behaviour
of the queue wakeups, I tried that, it works so well, you can trivially
test it with my elevator-lowlatency by just changing a line, but the
problem is 512k is a too small I/O pipeline, i.e. it is not enough to
guarantee the maximal throughput during contigous I/O. 2M of queue is
the miniumum for using at best 100Mbyte arrays like my raid box.
Also the elevator-lowlatency patch right now has an unplug race that can
hang the box under some workload, it's not a fatal bug (i.e. no risk of
corruption and hard to trigger) and I need to fix it, like I need to
limit the number of requests too or the seeking workloads get huge
latencies with no throughput advantage (very well shown by tiobench in
the great bigbox.html effort from Randy).
The stochastic fair queueing will allow us to keep a rasonable sized
queue to avoid hurting workloads like bonnie, while still providing
the maximum possible fariness across different tasks.
However it has to be optional, I suggested to use the b_max ioctl
(currently unused) to select the scheduler mode, because dbench and
other workloads really want the max reodering task-against-task to get
the maximal throughput, not matter how long a certain dbench task is
delayed. It is really up to the user to choose the I/O scheduling
algorithm, just like it happens today in the network stack. Desktop
users will want the stochastic fair queuing, people running workloads
ala dbench must avoid it.
the stochastic fair queueing will also make the anticipatory scheduling
a very low priority to have. Stochasting fair queueing will be an order
of magnitude more important than anticipatory scheduling IMHO.
Andrea
Andrea Arcangeli wrote:
>The only way to get the minimal possible latency and maximal fariness is
>my new stochastic fair queueing idea.
>
Sounds nice. I would like to see per process disk distribution.
>
>
>The troughput/latency curve has only a few points where the throughput
>incrase while latency decrease. The miniumum latency doesn't necessairly
>mean minimal throughput as maxmium latency doesn't necessairly imply
>maximal throughput.
>
>But clearly maximum throughput likely implys not minium latency and the
>other way around.
>
>I described my idea to Jens a few days ago and he's working on it right
>now AFIK, incidentally he had an equivalent idea in the past and he had
>some old code that he's starting with. For the details on how it works
>you can read the stochastic fair queueing in the network packet
>scheduler (net/sched/sch_sfq.c).
>
>The design I proposed is to have multiple I/O queues, where to apply the
>elevator, and to choose the queue in function of the task->pid that is
>sumbitting the bh/bio. You'll have to apply an hash to the pid and you
>probably want a perturbation timer that will change the hash function
>every 30 sec or so. Plus I want a special queue for everything
>asynchronoys. So that the asynchronoys queue will be elevated and
>reordered like today since it's not latency critical. I suggeted Jens to
>differentiate it by checking current->mm, all the async stuff is
>normally submitted by the kernel daemons. A more finegrined way is to
>add a bitflag that you have to set between the bh lock and the submit_bh
>and that will be cleared by unlock_buffer (or equivalent one for the bio
>in 2.5). But the current->mm will take care of async-io with keventd too
>so it should be just fine (modulo the first aio submit but it's a minor
>issue).
>
>Then the lowlevel drivers will pick requests from these queues in round
>robin (other variations may be possible too, like two requests per queue
>or stuff like that, but 1 request per queue should give an unbelieveable
>responsiveness to the system, something that we never experienced before).
>
However dependant reads can not merge with each other obviously so
you could end up say submitting 4K reads per process.
>
>
>This stochastic fair queueing I/O scheduling algorithm will be the best
>for desktop/multimedia, i.e. when your priority is that xmms or mplayer
>never skips no matter how many cp /dev/zero . you're running in
>background. Or also for a multiuser system. There is no other possible
>definitive fix IMHO. The only other possible fix would be to reduce the
>I/O queue to say 512kbytes and to take advantage of the FIFO behaviour
>of the queue wakeups, I tried that, it works so well, you can trivially
>test it with my elevator-lowlatency by just changing a line, but the
>problem is 512k is a too small I/O pipeline, i.e. it is not enough to
>guarantee the maximal throughput during contigous I/O. 2M of queue is
>the miniumum for using at best 100Mbyte arrays like my raid box.
>
But your solution also does nothing for sequential IO throughput in
the presence of more than one submitter. Two sequential readers on
different parts of the disk for example. Submit 128K request (4ms),
seek (5ms), submit 128K request (4ms), so your 30MB/s disk gets
less than 15MB/s here. Same with writes.
I think you should be giving each process a timeslice, that way you
don't need to try to account for request size or seek vs IO time and
is easily tunable. This should solve the above problem with say 50ms
timeslices.
It also does little to help dependant reads vs writes or streaming
reads.
For reads, anticipatory scheduling can be very helpful in theory
however it remains to be seen if I can make it work without adding
too much complexity. Your fair queueing should go nicely on top
if I can make it work. I do like your idea though!
Nick
On Sun, 9 Feb 2003, Andrea Arcangeli wrote:
> The only way to get the minimal possible latency and maximal fariness is
> my new stochastic fair queueing idea.
"The only way" ? That sounds like a lack of fantasy ;))
Having said that, I like the idea of using SFQ for fairness,
since it seems to work really well for networking...
I'll definately try such a patch.
> The only other possible fix would be to reduce the I/O queue to say
> 512kbytes and to take advantage of the FIFO behaviour of the queue
> wakeups, I tried that, it works so well, you can trivially test it with
> my elevator-lowlatency by just changing a line, but the problem is 512k
> is a too small I/O pipeline, i.e. it is not enough to guarantee the
> maximal throughput during contigous I/O.
Maybe you want to count the I/O pipeline size in disk seeks
and not in disk blocks ?
In the time of one disk seek plus half rotational latency
(12 ms) you can do a pretty large amount of reading (>400kB).
This means that for near and medium disk seeks you don't care
all that much about how large the submitted IO is. Track buffers
further reduce this importance.
OTOH, if you're seeking to the track next-door, or have mixed
read and write operations on the same track, the seek time
goes to near-zero and only the rotational latency counts. In
that case the size of the IO does have an influence on the
speed the request can be handled.
> the stochastic fair queueing will also make the anticipatory scheduling
> a very low priority to have. Stochasting fair queueing will be an order
> of magnitude more important than anticipatory scheduling IMHO.
On the contrary, once we have SFQ to fix the biggest elevator
problems the difference made by the anticipatory scheduler should
be much more visible.
Think of a disk with 6 track buffers for reading and a system with
10 active reader processes. Without the anticipatory scheduler you'd
need to go to the platter for almost every OS read (because each
process flushes out the track buffer for the others), while with the
anticipatory scheduler you've got a bigger chance of having the data
you want in one of the drive's track buffers, meaning that you don't
need to go to the platter but can just do a silicon to silicon copy.
If you look at the academic papers of an anticipatory scheduler, you'll
find that it gives as much as a 73% increase in throughput.
On real-world tasks, not even on specially contrived benchmarks.
The only aspect of the anticipatory scheduler that is no longer needed
with your SFQ idea is the distinction between reads and writes, since
your idea already makes the (better, I guess) distinction between
synchronous and asynchronous requests.
regards,
Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://guru.conectiva.com/
Current spamtrap: <a href=mailto:"[email protected]">[email protected]</a>
On Mon, 10 Feb 2003, Nick Piggin wrote:
> Andrea Arcangeli wrote:
>
> >The only way to get the minimal possible latency and maximal fariness is
> >my new stochastic fair queueing idea.
>
> Sounds nice. I would like to see per process disk distribution.
Sounds like the easiest way to get that fair, indeed. Manage
every disk as a separately scheduled resource...
> However dependant reads can not merge with each other obviously so
> you could end up say submitting 4K reads per process.
Considering that one medium/far disk seek counts for about 400 kB
of data read/write, I suspect we'll just want to merge requests or
put adjacant requests next to each other into the elevator up to
a fairly large size. Probably about 1 MB for a hard disk or a cdrom,
but much less for floppies, opticals, etc...
> But your solution also does nothing for sequential IO throughput in
> the presence of more than one submitter.
> I think you should be giving each process a timeslice,
That is the anticipatory scheduler. A good complement to the SFQ
part of the IO scheduler. I'd really like to see both ideas together
in one scheduler, they sound like a winning pair.
> For reads, anticipatory scheduling can be very helpful in theory
> however it remains to be seen if I can make it work without adding
> too much complexity. Your fair queueing should go nicely on top
> if I can make it work. I do like your idea though!
The anticipatory scheduler can just remain "on" while the priority
difference of the requests of the current process aren't too big
when compared to the priority of the other requests. Once the SFQ
part of the scheduler sends down requests with a really big priority
difference the anticipatory scheduler can "skip a beat" and switch
to another process.
Note that Andrea's distinction between synchronous and asynchronous
requests may be more appropriate than the distinction between read
and write requests ...
The only remaining question is how do we know which requests are
synchronous and which are asynchronous ? Ie. which requests have
a process waiting on their completion and which requests don't have
a process waiting on their completion ?
regards,
Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://guru.conectiva.com/
Current spamtrap: <a href=mailto:"[email protected]">[email protected]</a>
On Mon, 10 Feb 2003, Rik van Riel wrote:
> The only aspect of the anticipatory scheduler that is no longer needed
> with your SFQ idea is the distinction between reads and writes, since
> your idea already makes the (better, I guess) distinction between
> synchronous and asynchronous requests.
Forget that I said that, we don't have the infrastructure to
get this right. The definition of "synchronous" is "some
process is waiting on this request to complete", but processes
wait on other objects instead.
A normal sequence (probably the most common) is:
1) submit request
(request is now asynchronous)
2) wait_on_page
(request should now magically become synchronous)
The infrastructure to get this working is probably too big a
change for 2.5/2.6, at this point, so chances are that we're
better off using the (90% accurate?) distinction between reads
and writes.
regards,
Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://guru.conectiva.com/
Current spamtrap: <a href=mailto:"[email protected]">[email protected]</a>
note that issuing a fsync should change all pending writes to 'syncronous'
as should writes to any partition mounted with the sync option, or writes
to a directory with the S flag set.
David Lang
On Mon, 10 Feb 2003, Rik van Riel
wrote:
> Date: Mon, 10 Feb 2003 02:15:10 -0200 (BRST)
> From: Rik van Riel <[email protected]>
> To: Andrea Arcangeli <[email protected]>
> Cc: Con Kolivas <[email protected]>, lkml <[email protected]>,
> Jens Axboe <[email protected]>
> Subject: Re: stochastic fair queueing in the elevator [Re: [BENCHMARK]
> 2.4.20-ck3 / aa / rmap with contest]
>
> On Mon, 10 Feb 2003, Rik van Riel wrote:
>
> > The only aspect of the anticipatory scheduler that is no longer needed
> > with your SFQ idea is the distinction between reads and writes, since
> > your idea already makes the (better, I guess) distinction between
> > synchronous and asynchronous requests.
>
> Forget that I said that, we don't have the infrastructure to
> get this right. The definition of "synchronous" is "some
> process is waiting on this request to complete", but processes
> wait on other objects instead.
>
> A normal sequence (probably the most common) is:
>
> 1) submit request
> (request is now asynchronous)
> 2) wait_on_page
> (request should now magically become synchronous)
>
> The infrastructure to get this working is probably too big a
> change for 2.5/2.6, at this point, so chances are that we're
> better off using the (90% accurate?) distinction between reads
> and writes.
>
> regards,
>
> Rik
> --
> Bravely reimplemented by the knights who say "NIH".
> http://www.surriel.com/ http://guru.conectiva.com/
> Current spamtrap: <a href=mailto:"[email protected]">[email protected]</a>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On Sun, 9 Feb 2003, David Lang wrote:
> note that issuing a fsync should change all pending writes to 'syncronous'
> as should writes to any partition mounted with the sync option, or writes
> to a directory with the S flag set.
Exactly. This is nasty with our current data structures;
probably not something to do during the current code slush.
Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://guru.conectiva.com/
Current spamtrap: <a href=mailto:"[email protected]">[email protected]</a>
David Lang <[email protected]> wrote:
>
> note that issuing a fsync should change all pending writes to 'syncronous'
> as should writes to any partition mounted with the sync option, or writes
> to a directory with the S flag set.
We know, at I/O submission time, whether a write is to be waited upon.
That's in writeback_control.sync_mode.
That, combined with an assumption that "all reads are synchronous" would
allow the outgoing BIOs to be appropriately tagged.
It's still approximate. An exact solution would involve only marking I/O as
synchronous when some process actually waits on its completion. I do not
believe that all the extra lookup and locking infrastructure and storage
which this would require is justified. Certainly not in a first iteration.
The Rice Uni researchers did implement controls which attempted to learn IO
submission patterns on a per-process basis, and I believe these were also
used to avoid undesirable starvations. We have only briefly played with
process-aware logic.
The first thing to do is to get the anticipatory scheduler working properly.
Nick has been tied down for a week chasing generic bugs in the request layer.
He seems to have nailed the important ones now.
On Mon, 10 Feb 2003, Rik van Riel wrote:
> On Sun, 9 Feb 2003, Andrea Arcangeli wrote:
>
> > The only way to get the minimal possible latency and maximal fariness is
> > my new stochastic fair queueing idea.
>
> "The only way" ? That sounds like a lack of fantasy ;))
Took about 30 minutes, but here is an alternative algorithm.
One that doesn't suffer from SFQ's "temporary unfairness due
to unlucky hashing" problem, which can be made worse with SQF's
"rehashing once every N seconds" policy, where N is too big
for the user, say 30 seconds.
It requires 2 extra fields in the struct files_struct:
->last_request and
->request_priority
where ->last_request is the time (jiffies value) when a
process associated with this files_struct last submitted
an IO request and ->request_priority is a floating average
of the time between IO requests, which can be directly used
to determine the request priority.
The floating priority is kept over (1 << RQ_PRIO_SHIFT) requests,
maybe 32 would be a good value to start ? Maybe 128 ?
The request priority of the files_struct used by the current
process can be calculated in a simple O(1) way every time a
request is submitted:
{
struct files_struct *files = current->files;
unsigned long interval = jiffies - files->last_request;
files->request_priority -= (files->request_priority >> RQ_SHIFT_PRIO);
files->request_priority += interval;
}
The request_priority gets assigned as the priority of the
currently submitted request (or the request gets submitted
in the queue belonging to this priority range). We don't
change the priority of already submitted requests, except
maybe when merging requests.
This idea should adapt faster to changing IO usage of tasks
and is O(1) scalable. Dunno if it'll be better than SFQ in
practice too, but it just shows that SFQ isn't the only way. ;)
have fun,
Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://guru.conectiva.com/
Current spamtrap: <a href=mailto:"[email protected]">[email protected]</a>
Rik van Riel wrote:
>On Mon, 10 Feb 2003, Nick Piggin wrote:
>
>>Andrea Arcangeli wrote:
>>
>>
>>>The only way to get the minimal possible latency and maximal fariness is
>>>my new stochastic fair queueing idea.
>>>
>>Sounds nice. I would like to see per process disk distribution.
>>
>
>Sounds like the easiest way to get that fair, indeed. Manage
>every disk as a separately scheduled resource...
>
I hope this option becomes available one day.
>
>
>>However dependant reads can not merge with each other obviously so
>>you could end up say submitting 4K reads per process.
>>
>
>Considering that one medium/far disk seek counts for about 400 kB
>of data read/write, I suspect we'll just want to merge requests or
>put adjacant requests next to each other into the elevator up to
>a fairly large size. Probably about 1 MB for a hard disk or a cdrom,
>but much less for floppies, opticals, etc...
>
Yes, but the point is _dependant reads_. This is why Andrea's approach
alone can't solve dependant read vs write or nondependant read - while
maintaining a good throughput.
>
>
>>But your solution also does nothing for sequential IO throughput in
>>the presence of more than one submitter.
>>
>
>>I think you should be giving each process a timeslice,
>>
>
>That is the anticipatory scheduler. A good complement to the SFQ
>part of the IO scheduler. I'd really like to see both ideas together
>in one scheduler, they sound like a winning pair.
>
No, anticipatory scheduling just "pauses for a bit after some reads"
I am talking about all this nonsense about trying to measure seek vs
throughput vs rotational latency, etc. We don't process schedule by
how many memory accesses a process makes * processor speed / memory
speed + instruction jumps.....
If you want a fair disk scheduler, just give each process a disk
timeslice, and all your seek, stream, settle, track buffer, etc
accounting is magically done for you... For any device, and is
100% tunable. The point is, account by _time_.
Nick
On Sun, 9 Feb 2003, Andrew Morton wrote:
> David Lang <[email protected]> wrote:
> >
> > note that issuing a fsync should change all pending writes to 'syncronous'
> > as should writes to any partition mounted with the sync option, or writes
> > to a directory with the S flag set.
>
> We know, at I/O submission time, whether a write is to be waited upon.
> That's in writeback_control.sync_mode.
An fsync might change already submitted asynchronous writes
into synchronous writes, but this is not something I'm going
to lose sleep over. ;)
> That, combined with an assumption that "all reads are synchronous" would
> allow the outgoing BIOs to be appropriately tagged.
>
> It's still approximate.
Sounds like a good enough approximation to me.
Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://guru.conectiva.com/
Current spamtrap: <a href=mailto:"[email protected]">[email protected]</a>
On Sun, Feb 09, 2003 at 08:33:43PM -0800, Andrew Morton wrote:
> David Lang <[email protected]> wrote:
> >
> > note that issuing a fsync should change all pending writes to 'syncronous'
> > as should writes to any partition mounted with the sync option, or writes
> > to a directory with the S flag set.
>
> We know, at I/O submission time, whether a write is to be waited upon.
> That's in writeback_control.sync_mode.
>
> That, combined with an assumption that "all reads are synchronous" would
> allow the outgoing BIOs to be appropriately tagged.
This may be a terribly stupid question, if so pls. just tell me :)
I assume read-ahead requests go elsewhere? Or do we assume that someone
is waiting for them as well?
If we assume they are synchronous, that would be rather unfair
especially on multi-user systems - and the 90% accuracy that Rik
suggested would seem exaggerated to say the least (accuracy would be
more like 10% on a good day).
> It's still approximate. An exact solution would involve only marking I/O as
> synchronous when some process actually waits on its completion. I do not
> believe that all the extra lookup and locking infrastructure and storage
> which this would require is justified. Certainly not in a first iteration.
Just a quirk; NFS file servers.
The client does read-ahead - the nfsd on the server-side will wait for
the read request, thus the read is synchronous... But it's not.
--
................................................................
: [email protected] : And I see the elder races, :
:.........................: putrid forms of man :
: Jakob ?stergaard : See him rise and claim the earth, :
: OZ9ABN : his downfall is at hand. :
:.........................:............{Konkhra}...............:
Jakob Oestergaard wrote:
>On Sun, Feb 09, 2003 at 08:33:43PM -0800, Andrew Morton wrote:
>
>>David Lang <[email protected]> wrote:
>>
>>>note that issuing a fsync should change all pending writes to 'syncronous'
>>>as should writes to any partition mounted with the sync option, or writes
>>>to a directory with the S flag set.
>>>
>>We know, at I/O submission time, whether a write is to be waited upon.
>>That's in writeback_control.sync_mode.
>>
>>That, combined with an assumption that "all reads are synchronous" would
>>allow the outgoing BIOs to be appropriately tagged.
>>
>
>This may be a terribly stupid question, if so pls. just tell me :)
>
>I assume read-ahead requests go elsewhere? Or do we assume that someone
>is waiting for them as well?
>
>If we assume they are synchronous, that would be rather unfair
>especially on multi-user systems - and the 90% accuracy that Rik
>suggested would seem exaggerated to say the least (accuracy would be
>more like 10% on a good day).
>
Remember that readahead gets scaled down quickly if it isn't
getting hits. It is also likely to be sequential and in the
track buffer, so it is a small cost.
Huge readahead is a problem however anticipatory scheduling
will hopefully allow good throughput for multiple read streams
without requiring much readahead.
Nick
Jakob Oestergaard <[email protected]> wrote:
>
> On Sun, Feb 09, 2003 at 08:33:43PM -0800, Andrew Morton wrote:
> > David Lang <[email protected]> wrote:
> > >
> > > note that issuing a fsync should change all pending writes to 'syncronous'
> > > as should writes to any partition mounted with the sync option, or writes
> > > to a directory with the S flag set.
> >
> > We know, at I/O submission time, whether a write is to be waited upon.
> > That's in writeback_control.sync_mode.
> >
> > That, combined with an assumption that "all reads are synchronous" would
> > allow the outgoing BIOs to be appropriately tagged.
>
> This may be a terribly stupid question, if so pls. just tell me :)
>
> I assume read-ahead requests go elsewhere? Or do we assume that someone
> is waiting for them as well?
Yes, they are treated the same. If any part of the request is to be waited
upon then the whole request should be treated as synchronous.
And if no part of a readahead request is waited upon, we should never have
submited it anyway.
> If we assume they are synchronous, that would be rather unfair
> especially on multi-user systems - and the 90% accuracy that Rik
> suggested would seem exaggerated to say the least (accuracy would be
> more like 10% on a good day).
The simplified assumption that "reads are waited upon and writes are not" is
highly accurate.
I want to see the Linux I/O scheduler work as well as possible with workloads
which make that assumption 100% true. Once we have done that, then we can
start worrying about corner cases like AIO, O_SYNC, truncate, etc.
On Mon, Feb 10, 2003 at 03:58:26PM +1100, Nick Piggin wrote:
...
> >If we assume they are synchronous, that would be rather unfair
> >especially on multi-user systems - and the 90% accuracy that Rik
> >suggested would seem exaggerated to say the least (accuracy would be
> >more like 10% on a good day).
> >
> Remember that readahead gets scaled down quickly if it isn't
> getting hits. It is also likely to be sequential and in the
> track buffer, so it is a small cost.
I buy the sequential-argument - very good point.
Thanks,
> Huge readahead is a problem however anticipatory scheduling
> will hopefully allow good throughput for multiple read streams
> without requiring much readahead.
I'm curious as to how these things will interact in the NFS
server<->client situation :)
Time will show, I guess.
Random data-point (not meant as a rant - I'm happy for all I got ;)
In stock 2.4.20 the interaction is horrible - whatever was done there is
not optimal. A 'tar xf' on the client will neither load the network
nor the server - it seems to be network latency bound (readahead not
doing it's job - changing min-readahead and max-readahead on the client
doesn't seem to make a difference). However, my desktop (running on the
client) can hang for 10 seconds straight every half hour or so, when the
nightly backup runs on the server (local disk reads streaming at around
2 MB/sec from an array capable of at least 40 MB/sec).
--
................................................................
: [email protected] : And I see the elder races, :
:.........................: putrid forms of man :
: Jakob ?stergaard : See him rise and claim the earth, :
: OZ9ABN : his downfall is at hand. :
:.........................:............{Konkhra}...............:
I am compiling 2.5.59 with the latest BK patch applied, and get the following
compile error when compiling sound/usb/usbaudio.c:
sound/usb/usbaudio, in function 'create composite quirk'
warning: initialization from incompatiable pointer type, line 2219
structure has no member named 'interface' line 2226
This error did not occur in early 2.5.5X versions, but has in 2.5.59, 2.5.58
and at least a few before that.
P.S. I am testing a machine with a CMI8330 sound card and HiFi-Link usb audio
device.
.config follows:
#
# Automatically generated make config: don't edit
#
CONFIG_X86=y
CONFIG_MMU=y
CONFIG_SWAP=y
CONFIG_UID16=y
CONFIG_GENERIC_ISA_DMA=y
#
# Code maturity level options
#
CONFIG_EXPERIMENTAL=y
#
# General setup
#
CONFIG_SYSVIPC=y
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_SYSCTL=y
CONFIG_LOG_BUF_SHIFT=14
#
# Loadable module support
#
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
# CONFIG_MODULE_FORCE_UNLOAD is not set
CONFIG_OBSOLETE_MODPARM=y
CONFIG_MODVERSIONS=y
CONFIG_KMOD=y
#
# Processor type and features
#
CONFIG_X86_PC=y
# CONFIG_X86_VOYAGER is not set
# CONFIG_X86_NUMAQ is not set
# CONFIG_X86_SUMMIT is not set
# CONFIG_X86_BIGSMP is not set
# CONFIG_M386 is not set
# CONFIG_M486 is not set
# CONFIG_M586 is not set
# CONFIG_M586TSC is not set
# CONFIG_M586MMX is not set
# CONFIG_M686 is not set
CONFIG_MPENTIUMII=y
# CONFIG_MPENTIUMIII is not set
# CONFIG_MPENTIUM4 is not set
# CONFIG_MK6 is not set
# CONFIG_MK7 is not set
# CONFIG_MK8 is not set
# CONFIG_MELAN is not set
# CONFIG_MCRUSOE is not set
# CONFIG_MWINCHIPC6 is not set
# CONFIG_MWINCHIP2 is not set
# CONFIG_MWINCHIP3D is not set
# CONFIG_MCYRIXIII is not set
CONFIG_X86_CMPXCHG=y
CONFIG_X86_XADD=y
CONFIG_X86_L1_CACHE_SHIFT=5
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INVLPG=y
CONFIG_X86_BSWAP=y
CONFIG_X86_POPAD_OK=y
CONFIG_X86_TSC=y
CONFIG_X86_GOOD_APIC=y
CONFIG_X86_INTEL_USERCOPY=y
CONFIG_X86_USE_PPRO_CHECKSUM=y
# CONFIG_HUGETLB_PAGE is not set
# CONFIG_SMP is not set
CONFIG_PREEMPT=y
# CONFIG_X86_UP_APIC is not set
CONFIG_X86_MCE=y
# CONFIG_X86_MCE_NONFATAL is not set
# CONFIG_TOSHIBA is not set
# CONFIG_I8K is not set
CONFIG_MICROCODE=m
CONFIG_X86_MSR=m
CONFIG_X86_CPUID=m
# CONFIG_EDD is not set
# CONFIG_NOHIGHMEM is not set
CONFIG_HIGHMEM4G=y
# CONFIG_HIGHMEM64G is not set
CONFIG_HIGHMEM=y
# CONFIG_HIGHPTE is not set
# CONFIG_MATH_EMULATION is not set
CONFIG_MTRR=y
CONFIG_HAVE_DEC_LOCK=y
#
# Power management options (ACPI, APM)
#
CONFIG_PM=y
# CONFIG_SOFTWARE_SUSPEND is not set
#
# ACPI Support
#
# CONFIG_ACPI is not set
# CONFIG_APM is not set
# CONFIG_CPU_FREQ is not set
#
# Bus options (PCI, PCMCIA, EISA, MCA, ISA)
#
CONFIG_PCI=y
# CONFIG_PCI_GOBIOS is not set
# CONFIG_PCI_GODIRECT is not set
CONFIG_PCI_GOANY=y
CONFIG_PCI_BIOS=y
CONFIG_PCI_DIRECT=y
# CONFIG_SCx200 is not set
# CONFIG_PCI_LEGACY_PROC is not set
CONFIG_PCI_NAMES=y
CONFIG_ISA=y
# CONFIG_EISA is not set
# CONFIG_MCA is not set
CONFIG_HOTPLUG=y
#
# PCMCIA/CardBus support
#
CONFIG_PCMCIA=m
CONFIG_CARDBUS=y
CONFIG_I82092=m
CONFIG_I82365=m
# CONFIG_TCIC is not set
#
# PCI Hotplug Support
#
# CONFIG_HOTPLUG_PCI is not set
#
# Executable file formats
#
CONFIG_KCORE_ELF=y
# CONFIG_KCORE_AOUT is not set
CONFIG_BINFMT_AOUT=m
CONFIG_BINFMT_ELF=y
CONFIG_BINFMT_MISC=m
#
# Memory Technology Devices (MTD)
#
# CONFIG_MTD is not set
#
# Parallel port support
#
CONFIG_PARPORT=m
CONFIG_PARPORT_PC=m
CONFIG_PARPORT_PC_CML1=m
# CONFIG_PARPORT_SERIAL is not set
# CONFIG_PARPORT_PC_FIFO is not set
# CONFIG_PARPORT_PC_SUPERIO is not set
CONFIG_PARPORT_PC_PCMCIA=m
# CONFIG_PARPORT_OTHER is not set
CONFIG_PARPORT_1284=y
#
# Plug and Play support
#
CONFIG_PNP=y
CONFIG_PNP_NAMES=y
CONFIG_PNP_CARD=y
CONFIG_PNP_DEBUG=y
#
# Protocols
#
CONFIG_ISAPNP=y
CONFIG_PNPBIOS=y
#
# Block devices
#
CONFIG_BLK_DEV_FD=y
# CONFIG_BLK_DEV_XD is not set
# CONFIG_PARIDE is not set
# CONFIG_BLK_CPQ_DA is not set
# CONFIG_BLK_CPQ_CISS_DA is not set
# CONFIG_BLK_DEV_DAC960 is not set
CONFIG_BLK_DEV_UMEM=m
CONFIG_BLK_DEV_LOOP=y
CONFIG_BLK_DEV_NBD=m
CONFIG_BLK_DEV_RAM=y
CONFIG_BLK_DEV_RAM_SIZE=4096
CONFIG_BLK_DEV_INITRD=y
# CONFIG_LBD is not set
#
# ATA/ATAPI/MFM/RLL device support
#
CONFIG_IDE=y
#
# IDE, ATA and ATAPI Block devices
#
CONFIG_BLK_DEV_IDE=y
#
# Please see Documentation/ide.txt for help/info on IDE drives
#
# CONFIG_BLK_DEV_HD_IDE is not set
# CONFIG_BLK_DEV_HD is not set
CONFIG_BLK_DEV_IDEDISK=y
CONFIG_IDEDISK_MULTI_MODE=y
# CONFIG_IDEDISK_STROKE is not set
CONFIG_BLK_DEV_IDECS=m
CONFIG_BLK_DEV_IDECD=y
C
On Mon, 10 Feb 2003 06:10:08 +0100, Jakob Oestergaard said:
> In stock 2.4.20 the interaction is horrible - whatever was done there is
> not optimal. A 'tar xf' on the client will neither load the network
> nor the server - it seems to be network latency bound (readahead not
> doing it's job - changing min-readahead and max-readahead on the client
> doesn't seem to make a difference). However, my desktop (running on the
This sounds like the traditional NFS suckage that has been there for decades.
The problem is that 'tar xf' ends up doing a *LOT* of NFS calls - a huge
stream of stat()/open()/chmod()/utime() calls. On a local disk, most of
this gets accelerated by the in-core inode cache, but on an NFS mount, you're
looking at lots and lots of synchronous calls.
In 'man 5 exports':
async This option allows the NFS server to violate the NFS protocol
and reply to requests before any changes made by that request
have been committed to stable storage (e.g. disc drive).
Using this option usually improves performance, but at the cost
that an unclean server restart (i.e. a crash) can cause data to
be lost or corrupted.
In releases of nfs-utils upto and including 1.0.0, this option
was the default. In this and future releases, sync is the
default, and async must be explicit requested if needed. To
help make system adminstrators aware of this change, 'exportfs'
will issue a warning if neither sync nor async is specified.
Does this address your NFS issue?
--
Valdis Kletnieks
Computer Systems Senior Engineer
Virginia Tech
On Mon, Feb 10, 2003 at 01:06:27AM -0500, [email protected] wrote:
> On Mon, 10 Feb 2003 06:10:08 +0100, Jakob Oestergaard said:
>
> > In stock 2.4.20 the interaction is horrible - whatever was done there is
> > not optimal. A 'tar xf' on the client will neither load the network
> > nor the server - it seems to be network latency bound (readahead not
> > doing it's job - changing min-readahead and max-readahead on the client
> > doesn't seem to make a difference). However, my desktop (running on the
>
> This sounds like the traditional NFS suckage that has been there for decades.
> The problem is that 'tar xf' ends up doing a *LOT* of NFS calls - a huge
> stream of stat()/open()/chmod()/utime() calls. On a local disk, most of
> this gets accelerated by the in-core inode cache, but on an NFS mount, you're
> looking at lots and lots of synchronous calls.
>
> In 'man 5 exports':
>
> async This option allows the NFS server to violate the NFS protocol
> and reply to requests before any changes made by that request
> have been committed to stable storage (e.g. disc drive).
>
> Using this option usually improves performance, but at the cost
> that an unclean server restart (i.e. a crash) can cause data to
> be lost or corrupted.
>
> In releases of nfs-utils upto and including 1.0.0, this option
> was the default. In this and future releases, sync is the
> default, and async must be explicit requested if needed. To
> help make system adminstrators aware of this change, 'exportfs'
> will issue a warning if neither sync nor async is specified.
>
> Does this address your NFS issue?
No. Tried it in the past, just tried it again to make sure.
The caching should be done client-side, I guess, to be effective. Even
if the server acknowledges operations before they hit the disks, the
client is still bound by the RPC call latency.
I'm sure there are problems preventing such caching... Hmm... Seems
like the only way out is a multi-threaded tar ;) Or tar using AIO.
Btw. nocto on the client doesn't really seem to help matters much
either.
Thanks for the suggestion,
--
................................................................
: [email protected] : And I see the elder races, :
:.........................: putrid forms of man :
: Jakob ?stergaard : See him rise and claim the earth, :
: OZ9ABN : his downfall is at hand. :
:.........................:............{Konkhra}...............:
On Mon, Feb 10, 2003 at 01:42:28AM -0200, Rik van Riel wrote:
> On Sun, 9 Feb 2003, Andrea Arcangeli wrote:
>
> > The only way to get the minimal possible latency and maximal fariness is
> > my new stochastic fair queueing idea.
>
> "The only way" ? That sounds like a lack of fantasy ;))
you can do more but you'd need to build additional APIs, to allow the
highlevel (possibly userspace too) to give hints to the lowlevel.
This only requires the pid and checking current->mm which is trivial. So
without adding a complex API I think this is the best/only thing you can
do to get close to the minimal possible I/O latency from a process point
of view.
> On the contrary, once we have SFQ to fix the biggest elevator
> problems the difference made by the anticipatory scheduler should
> be much more visible.
>
> Think of a disk with 6 track buffers for reading and a system with
> 10 active reader processes. Without the anticipatory scheduler you'd
> need to go to the platter for almost every OS read (because each
> process flushes out the track buffer for the others), while with the
> anticipatory scheduler you've got a bigger chance of having the data
> you want in one of the drive's track buffers, meaning that you don't
> need to go to the platter but can just do a silicon to silicon copy.
>
> If you look at the academic papers of an anticipatory scheduler, you'll
> find that it gives as much as a 73% increase in throughput.
> On real-world tasks, not even on specially contrived benchmarks.
>
> The only aspect of the anticipatory scheduler that is no longer needed
> with your SFQ idea is the distinction between reads and writes, since
> your idea already makes the (better, I guess) distinction between
> synchronous and asynchronous requests.
I'm not saying anticipatory scheduling is going to be obsoleted by SFQ,
especially because SFQ has to be an option to use only when absolutely
your only care to get the lowest possible I/O latency from a per-process
point of view (like while playing an mpeg or mp3).
But I still definitely think that if you run an anticipatory scheduling
benchmark w/ and w/o SFQ, the effect w/o SFQ (i.e. right now) is going
to be much more visible than w/ SFQ enabled. The reason is the size of
the queue that w/o SFQ can be as large as several seconds in time and
several dozen mbytes in bytes.
Andrea
On Mon, Feb 10, 2003 at 02:29:20AM -0200, Rik van Riel wrote:
> On Sun, 9 Feb 2003, David Lang wrote:
>
> > note that issuing a fsync should change all pending writes to 'syncronous'
> > as should writes to any partition mounted with the sync option, or writes
> > to a directory with the S flag set.
>
> Exactly. This is nasty with our current data structures;
> probably not something to do during the current code slush.
as I said, you can consider asychronous all requests submitted with
current->mm, this will be 90% accurate. This whole thing is an
heuristic, if the heuristic fails the behaviour become like w/o SFQ so
no regression. Also the latency of fsync is less critical than the
latency of a read() syscall, again, this is statistically, sometime it
can be the other way around.
Andrea
On Mon, Feb 10, 2003 at 02:47:19AM -0200, Rik van Riel wrote:
> On Sun, 9 Feb 2003, Andrew Morton wrote:
> > David Lang <[email protected]> wrote:
> > >
> > > note that issuing a fsync should change all pending writes to 'syncronous'
> > > as should writes to any partition mounted with the sync option, or writes
> > > to a directory with the S flag set.
> >
> > We know, at I/O submission time, whether a write is to be waited upon.
> > That's in writeback_control.sync_mode.
>
> An fsync might change already submitted asynchronous writes
> into synchronous writes, but this is not something I'm going
> to lose sleep over. ;)
correct.
>
> > That, combined with an assumption that "all reads are synchronous" would
> > allow the outgoing BIOs to be appropriately tagged.
> >
> > It's still approximate.
>
> Sounds like a good enough approximation to me.
yes, the approximation should be more than enough IMHO
Andrea
On Mon, Feb 10, 2003 at 02:44:59AM -0200, Rik van Riel wrote:
> On Mon, 10 Feb 2003, Rik van Riel wrote:
> > On Sun, 9 Feb 2003, Andrea Arcangeli wrote:
> >
> > > The only way to get the minimal possible latency and maximal fariness is
> > > my new stochastic fair queueing idea.
> >
> > "The only way" ? That sounds like a lack of fantasy ;))
>
> Took about 30 minutes, but here is an alternative algorithm.
> One that doesn't suffer from SFQ's "temporary unfairness due
> to unlucky hashing" problem, which can be made worse with SQF's
> "rehashing once every N seconds" policy, where N is too big
> for the user, say 30 seconds.
>
> It requires 2 extra fields in the struct files_struct:
> ->last_request and
> ->request_priority
>
> where ->last_request is the time (jiffies value) when a
> process associated with this files_struct last submitted
> an IO request and ->request_priority is a floating average
> of the time between IO requests, which can be directly used
> to determine the request priority.
>
> The floating priority is kept over (1 << RQ_PRIO_SHIFT) requests,
> maybe 32 would be a good value to start ? Maybe 128 ?
>
> The request priority of the files_struct used by the current
> process can be calculated in a simple O(1) way every time a
> request is submitted:
>
> {
> struct files_struct *files = current->files;
> unsigned long interval = jiffies - files->last_request;
>
> files->request_priority -= (files->request_priority >> RQ_SHIFT_PRIO);
> files->request_priority += interval;
> }
>
> The request_priority gets assigned as the priority of the
> currently submitted request (or the request gets submitted
> in the queue belonging to this priority range). We don't
> change the priority of already submitted requests, except
> maybe when merging requests.
>
> This idea should adapt faster to changing IO usage of tasks
> and is O(1) scalable. Dunno if it'll be better than SFQ in
> practice too, but it just shows that SFQ isn't the only way. ;)
The file level is worthless IMHO, you have no idea of what is going on
in the I/O queue from there, the queue can be filled with 64M in some
msec. If you really thing the above can be nearly as good at SFQ at the
queue level, implement your alternate solution, benchmark it against SFQ
with a seeking tiotest 1 with a cp /dev/zero . in background, if it
works better than SFQ we'll use it.
Andrea
On Mon, Feb 10, 2003 at 03:44:35PM +1100, Nick Piggin wrote:
> Rik van Riel wrote:
>
> >On Mon, 10 Feb 2003, Nick Piggin wrote:
> >
> >>Andrea Arcangeli wrote:
> >>
> >>
> >>>The only way to get the minimal possible latency and maximal fariness is
> >>>my new stochastic fair queueing idea.
> >>>
> >>Sounds nice. I would like to see per process disk distribution.
> >>
> >
> >Sounds like the easiest way to get that fair, indeed. Manage
> >every disk as a separately scheduled resource...
> >
> I hope this option becomes available one day.
>
> >
> >
> >>However dependant reads can not merge with each other obviously so
> >>you could end up say submitting 4K reads per process.
> >>
> >
> >Considering that one medium/far disk seek counts for about 400 kB
> >of data read/write, I suspect we'll just want to merge requests or
> >put adjacant requests next to each other into the elevator up to
> >a fairly large size. Probably about 1 MB for a hard disk or a cdrom,
> >but much less for floppies, opticals, etc...
> >
> Yes, but the point is _dependant reads_. This is why Andrea's approach
> alone can't solve dependant read vs write or nondependant read - while
> maintaining a good throughput.
note that for soem workloads the _dependant reads_ can have _dependant
writes_ in between. I know it's not the most common case though, but I
don't love too much to make reads that much special. But again, it makes
perfect sense to have it as a feature, but I wouldn't like it to be only
option, I mean there must always be a way to disable anticipatory
scheduling like there must be a way to disable SFQ (infact for SFQ it
makes sense to leave it disabled by default, it's only a few people that
definitely only cares to have the lowest possible latency for mplayer or
xmms, or for multiuser system doing lots of I/O, but those guys can't
live without it)
Andrea
On Mon, Feb 10, 2003 at 05:51:07AM +0100, Jakob Oestergaard wrote:
> On Sun, Feb 09, 2003 at 08:33:43PM -0800, Andrew Morton wrote:
> > David Lang <[email protected]> wrote:
> > >
> > > note that issuing a fsync should change all pending writes to 'syncronous'
> > > as should writes to any partition mounted with the sync option, or writes
> > > to a directory with the S flag set.
> >
> > We know, at I/O submission time, whether a write is to be waited upon.
> > That's in writeback_control.sync_mode.
> >
> > That, combined with an assumption that "all reads are synchronous" would
> > allow the outgoing BIOs to be appropriately tagged.
>
> This may be a terribly stupid question, if so pls. just tell me :)
>
> I assume read-ahead requests go elsewhere? Or do we assume that someone
> is waiting for them as well?
readahead is meant for merging. So in short normally there is no
additional request generated by readhaead. Inter-process merging must
not be forbidden by SFQ.
We also have to choose if to forbid or not outer-process merging. In
theory SFQ would tell us to avoid it, and to only merge in the context
of the same pid, in the common case it should just take care of most
merging, things like dbench aren't going to run well with SFQ no matter
if we do global merging or only per-process merging. Optimizing for
throughput is not the object here.
Andrea
Andrea Arcangeli wrote:
>On Mon, Feb 10, 2003 at 01:42:28AM -0200, Rik van Riel wrote:
>
>>On Sun, 9 Feb 2003, Andrea Arcangeli wrote:
>>
>>
>>>The only way to get the minimal possible latency and maximal fariness is
>>>my new stochastic fair queueing idea.
>>>
>>"The only way" ? That sounds like a lack of fantasy ;))
>>
>
>you can do more but you'd need to build additional APIs, to allow the
>highlevel (possibly userspace too) to give hints to the lowlevel.
>
>This only requires the pid and checking current->mm which is trivial. So
>without adding a complex API I think this is the best/only thing you can
>do to get close to the minimal possible I/O latency from a process point
>of view.
>
>
>>On the contrary, once we have SFQ to fix the biggest elevator
>>problems the difference made by the anticipatory scheduler should
>>be much more visible.
>>
>>Think of a disk with 6 track buffers for reading and a system with
>>10 active reader processes. Without the anticipatory scheduler you'd
>>need to go to the platter for almost every OS read (because each
>>process flushes out the track buffer for the others), while with the
>>anticipatory scheduler you've got a bigger chance of having the data
>>you want in one of the drive's track buffers, meaning that you don't
>>need to go to the platter but can just do a silicon to silicon copy.
>>
>>If you look at the academic papers of an anticipatory scheduler, you'll
>>find that it gives as much as a 73% increase in throughput.
>>On real-world tasks, not even on specially contrived benchmarks.
>>
>>The only aspect of the anticipatory scheduler that is no longer needed
>>with your SFQ idea is the distinction between reads and writes, since
>>your idea already makes the (better, I guess) distinction between
>>synchronous and asynchronous requests.
>>
>
>I'm not saying anticipatory scheduling is going to be obsoleted by SFQ,
>especially because SFQ has to be an option to use only when absolutely
>your only care to get the lowest possible I/O latency from a per-process
>point of view (like while playing an mpeg or mp3).
>
>But I still definitely think that if you run an anticipatory scheduling
>benchmark w/ and w/o SFQ, the effect w/o SFQ (i.e. right now) is going
>to be much more visible than w/ SFQ enabled. The reason is the size of
>the queue that w/o SFQ can be as large as several seconds in time and
>several dozen mbytes in bytes.
>
You are addressing a fundamentally different problem. In the 2.5
elevator we can have whatever queue size we like. I get great
contest results with a 8192 request queue size. We track each
request submission time so we can impose a soft upper limit on
service times which provides fine latency, and the sync nature
of reads ensures pretty good fairness.
What your scheduler is good for obviously is providing a much
stronger per process fairness solution which is obviously very
useful for some tasks.
The problems each are trying to solve are different. I don't
think your idea solves anything that anticipatory scheduling
solves (tries to).
Nick
Andrea Arcangeli wrote:
>On Mon, Feb 10, 2003 at 03:44:35PM +1100, Nick Piggin wrote:
>
>>Rik van Riel wrote:
>>
>>
>>>On Mon, 10 Feb 2003, Nick Piggin wrote:
>>>
>>>
>>>>Andrea Arcangeli wrote:
>>>>
>>>>
>>>>
>>>>>The only way to get the minimal possible latency and maximal fariness is
>>>>>my new stochastic fair queueing idea.
>>>>>
>>>>>
>>>>Sounds nice. I would like to see per process disk distribution.
>>>>
>>>>
>>>Sounds like the easiest way to get that fair, indeed. Manage
>>>every disk as a separately scheduled resource...
>>>
>>>
>>I hope this option becomes available one day.
>>
>>
>>>
>>>>However dependant reads can not merge with each other obviously so
>>>>you could end up say submitting 4K reads per process.
>>>>
>>>>
>>>Considering that one medium/far disk seek counts for about 400 kB
>>>of data read/write, I suspect we'll just want to merge requests or
>>>put adjacant requests next to each other into the elevator up to
>>>a fairly large size. Probably about 1 MB for a hard disk or a cdrom,
>>>but much less for floppies, opticals, etc...
>>>
>>>
>>Yes, but the point is _dependant reads_. This is why Andrea's approach
>>alone can't solve dependant read vs write or nondependant read - while
>>maintaining a good throughput.
>>
>
>note that for soem workloads the _dependant reads_ can have _dependant
>writes_ in between. I know it's not the most common case though, but I
>don't love too much to make reads that much special. But again, it makes
>perfect sense to have it as a feature, but I wouldn't like it to be only
>option, I mean there must always be a way to disable anticipatory
>scheduling like there must be a way to disable SFQ (infact for SFQ it
>makes sense to leave it disabled by default, it's only a few people that
>definitely only cares to have the lowest possible latency for mplayer or
>xmms, or for multiuser system doing lots of I/O, but those guys can't
>live without it)
>
>Andrea
>
If you pass the information down, the IO scheduler can easily put a
lower expiry time on sync writes for example, or a higher one on
async reads. This isn't what the anticipatory scheduler patch is
about.
The point is that it is very easy to get a huge backlog of writes
for the purpose of disk head schedule optimising, but this is
difficult with reads, anticipatory scheduling helps address this.
Thats all really.
Andrea Arcangeli wrote:
>On Mon, Feb 10, 2003 at 03:58:26PM +1100, Nick Piggin wrote:
>
>>Jakob Oestergaard wrote:
>>
>>
>>>On Sun, Feb 09, 2003 at 08:33:43PM -0800, Andrew Morton wrote:
>>>
>>>
>>>>David Lang <[email protected]> wrote:
>>>>
>>>>
>>>>>note that issuing a fsync should change all pending writes to
>>>>>'syncronous'
>>>>>as should writes to any partition mounted with the sync option, or writes
>>>>>to a directory with the S flag set.
>>>>>
>>>>>
>>>>We know, at I/O submission time, whether a write is to be waited upon.
>>>>That's in writeback_control.sync_mode.
>>>>
>>>>That, combined with an assumption that "all reads are synchronous" would
>>>>allow the outgoing BIOs to be appropriately tagged.
>>>>
>>>>
>>>This may be a terribly stupid question, if so pls. just tell me :)
>>>
>>>I assume read-ahead requests go elsewhere? Or do we assume that someone
>>>is waiting for them as well?
>>>
>>>If we assume they are synchronous, that would be rather unfair
>>>especially on multi-user systems - and the 90% accuracy that Rik
>>>suggested would seem exaggerated to say the least (accuracy would be
>>>more like 10% on a good day).
>>>
>>>
>>Remember that readahead gets scaled down quickly if it isn't
>>getting hits. It is also likely to be sequential and in the
>>track buffer, so it is a small cost.
>>
>>Huge readahead is a problem however anticipatory scheduling
>>will hopefully allow good throughput for multiple read streams
>>without requiring much readahead.
>>
>
>the main purpose of readahead is to generate 512k scsi commands when you
>read a file with a 4k user buffer, anticipatory scheduling isn't very
>related to readahead.
>
You seem to be forgetting things like seek time.
On Mon, Feb 10, 2003 at 03:58:26PM +1100, Nick Piggin wrote:
> Jakob Oestergaard wrote:
>
> >On Sun, Feb 09, 2003 at 08:33:43PM -0800, Andrew Morton wrote:
> >
> >>David Lang <[email protected]> wrote:
> >>
> >>>note that issuing a fsync should change all pending writes to
> >>>'syncronous'
> >>>as should writes to any partition mounted with the sync option, or writes
> >>>to a directory with the S flag set.
> >>>
> >>We know, at I/O submission time, whether a write is to be waited upon.
> >>That's in writeback_control.sync_mode.
> >>
> >>That, combined with an assumption that "all reads are synchronous" would
> >>allow the outgoing BIOs to be appropriately tagged.
> >>
> >
> >This may be a terribly stupid question, if so pls. just tell me :)
> >
> >I assume read-ahead requests go elsewhere? Or do we assume that someone
> >is waiting for them as well?
> >
> >If we assume they are synchronous, that would be rather unfair
> >especially on multi-user systems - and the 90% accuracy that Rik
> >suggested would seem exaggerated to say the least (accuracy would be
> >more like 10% on a good day).
> >
> Remember that readahead gets scaled down quickly if it isn't
> getting hits. It is also likely to be sequential and in the
> track buffer, so it is a small cost.
>
> Huge readahead is a problem however anticipatory scheduling
> will hopefully allow good throughput for multiple read streams
> without requiring much readahead.
the main purpose of readahead is to generate 512k scsi commands when you
read a file with a 4k user buffer, anticipatory scheduling isn't very
related to readahead.
Andrea
On Mon, Feb 10, 2003 at 06:41:14PM +1100, Nick Piggin wrote:
> Andrea Arcangeli wrote:
>
> >On Mon, Feb 10, 2003 at 03:58:26PM +1100, Nick Piggin wrote:
> >
> >>Jakob Oestergaard wrote:
> >>
> >>
> >>>On Sun, Feb 09, 2003 at 08:33:43PM -0800, Andrew Morton wrote:
> >>>
> >>>
> >>>>David Lang <[email protected]> wrote:
> >>>>
> >>>>
> >>>>>note that issuing a fsync should change all pending writes to
> >>>>>'syncronous'
> >>>>>as should writes to any partition mounted with the sync option, or
> >>>>>writes
> >>>>>to a directory with the S flag set.
> >>>>>
> >>>>>
> >>>>We know, at I/O submission time, whether a write is to be waited upon.
> >>>>That's in writeback_control.sync_mode.
> >>>>
> >>>>That, combined with an assumption that "all reads are synchronous" would
> >>>>allow the outgoing BIOs to be appropriately tagged.
> >>>>
> >>>>
> >>>This may be a terribly stupid question, if so pls. just tell me :)
> >>>
> >>>I assume read-ahead requests go elsewhere? Or do we assume that someone
> >>>is waiting for them as well?
> >>>
> >>>If we assume they are synchronous, that would be rather unfair
> >>>especially on multi-user systems - and the 90% accuracy that Rik
> >>>suggested would seem exaggerated to say the least (accuracy would be
> >>>more like 10% on a good day).
> >>>
> >>>
> >>Remember that readahead gets scaled down quickly if it isn't
> >>getting hits. It is also likely to be sequential and in the
> >>track buffer, so it is a small cost.
> >>
> >>Huge readahead is a problem however anticipatory scheduling
> >>will hopefully allow good throughput for multiple read streams
> >>without requiring much readahead.
> >>
> >
> >the main purpose of readahead is to generate 512k scsi commands when you
> >read a file with a 4k user buffer, anticipatory scheduling isn't very
> >related to readahead.
> >
> You seem to be forgetting things like seek time.
I didn't say it's the only purpose. Of course there's no hope for
merging in the metadata dependent reads of the fs where anticipatory
scheduling does its best, and infact they don't even attempt to do any
readhaead. BTW, one thing that should definitely do readhaead and it's
not doing that (at least in 2.4) is the readdir path, again to generate
big commands, no matter the seeks. It was lost with the directory in
pagecache.
Andrea
Andrea Arcangeli <[email protected]> wrote:
>
> BTW, one thing that should definitely do readhaead and it's
> not doing that (at least in 2.4) is the readdir path, again to generate
> big commands, no matter the seeks. It was lost with the directory in
> pagecache.
Yes. But ext3 is still doing directory readahead, and I have never
noticed it gaining any particular benefit over ext2 from it.
And neither fs performs inode table readahead.
Andrea Arcangeli wrote:
>On Mon, Feb 10, 2003 at 06:41:14PM +1100, Nick Piggin wrote:
>
>>Andrea Arcangeli wrote:
>>
>>
>>>On Mon, Feb 10, 2003 at 03:58:26PM +1100, Nick Piggin wrote:
>>>
>>>>Remember that readahead gets scaled down quickly if it isn't
>>>>getting hits. It is also likely to be sequential and in the
>>>>track buffer, so it is a small cost.
>>>>
>>>>Huge readahead is a problem however anticipatory scheduling
>>>>will hopefully allow good throughput for multiple read streams
>>>>without requiring much readahead.
>>>>
What I mean by this is: if we have >1 sequential readers (eg. ftp
server), lets say 30MB/s disk, 4ms avg seek+settle+blah time,
submitting reads in say 128KB chunks alternating between streams
will cut throughput in half... At 1MB readahead we're at 89%
throughput. At 2MB, 94%
With anticipatory scheduling, we can give each stream say 100ms
so thats 96% with, say... 8K readahead if you like. (Yes, I am
aware that CPU/PCI/IDE efficiency also mandates a larger request
size).
Anyway that is the theory. It remains to be seen if we can make
it work.
>>>>
>>>>
>>>the main purpose of readahead is to generate 512k scsi commands when you
>>>read a file with a 4k user buffer, anticipatory scheduling isn't very
>>>related to readahead.
>>>
>>>
>>You seem to be forgetting things like seek time.
>>
>
>I didn't say it's the only purpose. Of course there's no hope for
>merging in the metadata dependent reads of the fs where anticipatory
>scheduling does its best, and infact they don't even attempt to do any
>readhaead. BTW, one thing that should definitely do readhaead and it's
>not doing that (at least in 2.4) is the readdir path, again to generate
>big commands, no matter the seeks. It was lost with the directory in
>pagecache.
>
>Andrea
>
>
>
>
On Mon, Feb 10, 2003 at 12:19:21AM -0800, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > BTW, one thing that should definitely do readhaead and it's
> > not doing that (at least in 2.4) is the readdir path, again to generate
> > big commands, no matter the seeks. It was lost with the directory in
> > pagecache.
>
> Yes. But ext3 is still doing directory readahead, and I have never
> noticed it gaining any particular benefit over ext2 from it.
At least for big directories it should be at least a quite obvious
microoptimization, if you do it at the logical level. Maybe it wasn't
worthwhile in the past (like in ext3) because it was done at the block
layer with a dumb reada rather than with proper read of the next logical
block before wait_on_buffer. Also keep in mind the max readahead
generated by the block layer is nothing compared to the true readahead
that the logical layer is able to generate.
Andrea
On Mon, Feb 10, 2003 at 07:27:51PM +1100, Nick Piggin wrote:
>
>
> Andrea Arcangeli wrote:
>
> >On Mon, Feb 10, 2003 at 06:41:14PM +1100, Nick Piggin wrote:
> >
> >>Andrea Arcangeli wrote:
> >>
> >>
> >>>On Mon, Feb 10, 2003 at 03:58:26PM +1100, Nick Piggin wrote:
> >>>
> >>>>Remember that readahead gets scaled down quickly if it isn't
> >>>>getting hits. It is also likely to be sequential and in the
> >>>>track buffer, so it is a small cost.
> >>>>
> >>>>Huge readahead is a problem however anticipatory scheduling
> >>>>will hopefully allow good throughput for multiple read streams
> >>>>without requiring much readahead.
> >>>>
> What I mean by this is: if we have >1 sequential readers (eg. ftp
> server), lets say 30MB/s disk, 4ms avg seek+settle+blah time,
> submitting reads in say 128KB chunks alternating between streams
> will cut throughput in half... At 1MB readahead we're at 89%
> throughput. At 2MB, 94%
>
> With anticipatory scheduling, we can give each stream say 100ms
> so thats 96% with, say... 8K readahead if you like. (Yes, I am
> aware that CPU/PCI/IDE efficiency also mandates a larger request
> size).
the way things works, if you give 8k readahead, you'll end submitting 8k
requests no matter how long you wait and it will kill throughput to 10%
of what was possible to achieve, very especially with scsi, max
coalescing of ide is 64k and btw that is its main weakness IMHO.
It doesn't make any sense to me your claim that you can decrease the
readahead by adding anticipatory scheduling, if you do you'll run
so slow at 8k per request in all common workloads.
>
> Anyway that is the theory. It remains to be seen if we can make
> it work.
>
> >>>>
> >>>>
> >>>the main purpose of readahead is to generate 512k scsi commands when you
> >>>read a file with a 4k user buffer, anticipatory scheduling isn't very
> >>>related to readahead.
> >>>
> >>>
> >>You seem to be forgetting things like seek time.
> >>
> >
> >I didn't say it's the only purpose. Of course there's no hope for
> >merging in the metadata dependent reads of the fs where anticipatory
> >scheduling does its best, and infact they don't even attempt to do any
> >readhaead. BTW, one thing that should definitely do readhaead and it's
> >not doing that (at least in 2.4) is the readdir path, again to generate
> >big commands, no matter the seeks. It was lost with the directory in
> >pagecache.
> >
> >Andrea
> >
> >
> >
> >
>
Andrea
Andrea Arcangeli <[email protected]> wrote:
>
> On Mon, Feb 10, 2003 at 12:19:21AM -0800, Andrew Morton wrote:
> > Andrea Arcangeli <[email protected]> wrote:
> > >
> > > BTW, one thing that should definitely do readhaead and it's
> > > not doing that (at least in 2.4) is the readdir path, again to generate
> > > big commands, no matter the seeks. It was lost with the directory in
> > > pagecache.
> >
> > Yes. But ext3 is still doing directory readahead, and I have never
> > noticed it gaining any particular benefit over ext2 from it.
>
> At least for big directories it should be at least a quite obvious
> microoptimization, if you do it at the logical level. Maybe it wasn't
> worthwhile in the past (like in ext3) because it was done at the block
> layer with a dumb reada rather than with proper read of the next logical
> block before wait_on_buffer. Also keep in mind the max readahead
> generated by the block layer is nothing compared to the true readahead
> that the logical layer is able to generate.
>
No, it's doing the directory readahead against the correct blocks (it calls
ext3_getblk() to find them).
Large directories tend to be spread all around the disk anyway. And I've
never explicitly tested for any problems which the loss of readahead might
have caused ext2. Nor have I tested inode table readahead. Guess I should.
On Mon, Feb 10, 2003 at 01:09:37AM -0800, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > On Mon, Feb 10, 2003 at 12:19:21AM -0800, Andrew Morton wrote:
> > > Andrea Arcangeli <[email protected]> wrote:
> > > >
> > > > BTW, one thing that should definitely do readhaead and it's
> > > > not doing that (at least in 2.4) is the readdir path, again to generate
> > > > big commands, no matter the seeks. It was lost with the directory in
> > > > pagecache.
> > >
> > > Yes. But ext3 is still doing directory readahead, and I have never
> > > noticed it gaining any particular benefit over ext2 from it.
> >
> > At least for big directories it should be at least a quite obvious
> > microoptimization, if you do it at the logical level. Maybe it wasn't
> > worthwhile in the past (like in ext3) because it was done at the block
> > layer with a dumb reada rather than with proper read of the next logical
> > block before wait_on_buffer. Also keep in mind the max readahead
> > generated by the block layer is nothing compared to the true readahead
> > that the logical layer is able to generate.
> >
>
> No, it's doing the directory readahead against the correct blocks (it calls
> ext3_getblk() to find them).
yes, sorry.
Andrea
Andrea Arcangeli wrote:
>On Mon, Feb 10, 2003 at 07:27:51PM +1100, Nick Piggin wrote:
>
>>
>>Andrea Arcangeli wrote:
>>
>>
>>>On Mon, Feb 10, 2003 at 06:41:14PM +1100, Nick Piggin wrote:
>>>
>>>
>>>>Andrea Arcangeli wrote:
>>>>
>>>>
>>>>
>>>>>On Mon, Feb 10, 2003 at 03:58:26PM +1100, Nick Piggin wrote:
>>>>>
>>>>>
>>>>>>Remember that readahead gets scaled down quickly if it isn't
>>>>>>getting hits. It is also likely to be sequential and in the
>>>>>>track buffer, so it is a small cost.
>>>>>>
>>>>>>Huge readahead is a problem however anticipatory scheduling
>>>>>>will hopefully allow good throughput for multiple read streams
>>>>>>without requiring much readahead.
>>>>>>
>>>>>>
>>What I mean by this is: if we have >1 sequential readers (eg. ftp
>>server), lets say 30MB/s disk, 4ms avg seek+settle+blah time,
>>submitting reads in say 128KB chunks alternating between streams
>>will cut throughput in half... At 1MB readahead we're at 89%
>>throughput. At 2MB, 94%
>>
>>With anticipatory scheduling, we can give each stream say 100ms
>>so thats 96% with, say... 8K readahead if you like. (Yes, I am
>>aware that CPU/PCI/IDE efficiency also mandates a larger request
>>size).
>>
>
>the way things works, if you give 8k readahead, you'll end submitting 8k
>requests no matter how long you wait and it will kill throughput to 10%
>of what was possible to achieve, very especially with scsi, max
>coalescing of ide is 64k and btw that is its main weakness IMHO.
>
>It doesn't make any sense to me your claim that you can decrease the
>readahead by adding anticipatory scheduling, if you do you'll run
>so slow at 8k per request in all common workloads.
>
I mean you kill throughput by submitting non linear requests. OK
with such small requests you probably would lose some throughtput
due to much greater ratio of overheads to work done, but in general
we are talking about disk head positioning. 10 1MB requests should
be much quicker than 1000 10K requests, for disk performance.
Andrew Morton wrote:
>Andrea Arcangeli <[email protected]> wrote:
>
>
>>BTW, one thing that should definitely do readhaead and it's
>>not doing that (at least in 2.4) is the readdir path, again to generate
>>big commands, no matter the seeks. It was lost with the directory in
>>pagecache.
>>
>>
>
>Yes. But ext3 is still doing directory readahead, and I have never
>noticed it gaining any particular benefit over ext2 from it.
>
>And neither fs performs inode table readahead.
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>
>
>
reiser4 does directory readahead. It gets a lot of gain from it.
--
Hans
> In the time of one disk seek plus half rotational latency
> (12 ms) you can do a pretty large amount of reading (>400kB).
> This means that for near and medium disk seeks you don't care
> all that much about how large the submitted IO is. Track buffers
> further reduce this importance.
This isn't always true. Removable devices usually have a quite
low seek time compared to their raw transfer rate.
Bye.
Andrew Morton wrote:
>
>Large directories tend to be spread all around the disk anyway. And I've
>never explicitly tested for any problems which the loss of readahead might
>have caused ext2. Nor have I tested inode table readahead. Guess I should.
>
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>
>
>
readahead seems to be less effective for non-sequential objects. Or at
least, you don't get the order of magnitude benefit from doing only one
seek, you only get the better elevator scheduling from having more
things in the elevator, which isn't such a big gain.
For the spectators of the thread, one of the things most of us learn
from experience about readahead is that there has to be something else
trying to interject seeks into your workload for readahead to make a big
difference, otherwise a modern disk drive (meaning, not 30 or so years
old) is going to optimize it for you anyway using its track cache.
--
Hans
Hans Reiser <[email protected]> wrote:
>
> reiser4 does directory readahead. It gets a lot of gain from it.
What is "a lot"?
Andrew Morton wrote:
>Hans Reiser <[email protected]> wrote:
>
>
>>reiser4 does directory readahead. It gets a lot of gain from it.
>>
>>
>
>What is "a lot"?
>
>
>
>
I'll need a bit of time to answer your question. I must confess that I
am speaking based on vague memories.
--
Hans
On Mon, Feb 10, 2003 at 01:07:25PM +0300, Hans Reiser wrote:
> Andrew Morton wrote:
>
> >
> >Large directories tend to be spread all around the disk anyway. And I've
> >never explicitly tested for any problems which the loss of readahead might
> >have caused ext2. Nor have I tested inode table readahead. Guess I
> >should.
> >
> >
> >-
> >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >the body of a message to [email protected]
> >More majordomo info at http://vger.kernel.org/majordomo-info.html
> >Please read the FAQ at http://www.tux.org/lkml/
> >
> >
> >
> >
> readahead seems to be less effective for non-sequential objects. Or at
yes, this is why I said readahead matters mostly to generate the big dma
commands, so if the object is sequential it will be served by the
lowlevel with a single dma using SG. this is also why when I moved the
high dma limit of scsi to 512k (from 128k IIRC) I got such a relevant
throughput improvement. Also watch the read speed in my tree compared to
2.4 and 2.5 in bigbox.html from Randy (bonnie shows it well).
> least, you don't get the order of magnitude benefit from doing only one
> seek, you only get the better elevator scheduling from having more
> things in the elevator, which isn't such a big gain.
Agreed.
Andrea
Andrew Morton wrote:
>Hans Reiser <[email protected]> wrote:
>
>
>>reiser4 does directory readahead. It gets a lot of gain from it.
>>
>>
>
>What is "a lot"?
>
>
>
>
Vladimir Saveliev had the numbers ready at hand after all, and said:
The most noticeable gain is found in the folowing test (not sure that it
can be considered as effective readahead though): ls -l over directory
containing 30000 files (named numerically) created in random order :
4.87 with no readahead
2.99 with readahead (which is limited by 25% of ram or by directory and
all its stat data)
--
Hans
Andrea Arcangeli wrote:
>On Mon, Feb 10, 2003 at 01:07:25PM +0300, Hans Reiser wrote:
>
>>Andrew Morton wrote:
>>
>>
>>>Large directories tend to be spread all around the disk anyway. And I've
>>>never explicitly tested for any problems which the loss of readahead might
>>>have caused ext2. Nor have I tested inode table readahead. Guess I
>>>should.
>>>
>>>
>>>-
>>>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>>the body of a message to [email protected]
>>>More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>Please read the FAQ at http://www.tux.org/lkml/
>>>
>>>
>>>
>>>
>>>
>>readahead seems to be less effective for non-sequential objects. Or at
>>
>
>yes, this is why I said readahead matters mostly to generate the bigdma
>commands, so if the object is sequential it will be served by the
>lowlevel with a single dma using SG. this is also why when I moved the
>high dma limit of scsi to 512k (from 128k IIRC) I got such a relevant
>throughput improvement. Also watch the read speed in my tree compared to
>2.4 and 2.5 in bigbox.html from Randy (bonnie shows it well).
>
...AND readahead matters mostly to disk head scheduling when there
is other IO competing with the streaming read...
A big throughput improvement would have seen would be all to do with
better disk head scheduling due to the larger request sizes, yeah?
And this would probably be to do with the elevator accounting being
based on requests and/or seeks. You shouldn't notice much difference
with the elevator stuff in Andrew's mm patches.
I don't know too much about SCSI stuff, but if driver / wire / device
overheads were that much higher at 128K compared to 512K I would
think something is broken or maybe optimised badly.
Nick
Hans Reiser <[email protected]> wrote:
>
> readahead seems to be less effective for non-sequential objects. Or at
> least, you don't get the order of magnitude benefit from doing only one
> seek, you only get the better elevator scheduling from having more
> things in the elevator, which isn't such a big gain.
>
> For the spectators of the thread, one of the things most of us learn
> from experience about readahead is that there has to be something else
> trying to interject seeks into your workload for readahead to make a big
> difference, otherwise a modern disk drive (meaning, not 30 or so years
> old) is going to optimize it for you anyway using its track cache.
>
The biggest place where readahead helps is when there are several files being
read at the same time. Without readahead the disk will seek from one file
to the next after having performed a 4k I/O. With readahead, after performing
a (say) 128k I/O. It really can make a 32x difference.
Readahead is brute force. The anticipatory scheduler solves this in a
smarter way.
Yes, device readahead helps. But I believe the caches are only 4-way
segmented or thereabouts, so it is easy to thrash them. Let's test:
vmm:/mnt/hda6> dd if=/dev/zero of=file1 bs=1M count=100
100+0 records in
100+0 records out
vmm:/mnt/hda6> dd if=/dev/zero of=file2 bs=1M count=100
100+0 records in
100+0 records out
vmm:/mnt/hda6> fadvise file1 0 9999999999 dontneed
vmm:/mnt/hda6> fadvise file2 0 9999999999 dontneed
vmm:/mnt/hda6> time cat file1 > /dev/null & time cat file2 > /dev/null
cat 2 > /dev/null 0.00s user 0.11s system 0% cpu 21.011 total
cat 1 > /dev/null 0.00s user 0.14s system 0% cpu 23.011 total
vmm:/mnt/hda6> 0 blockdev --setra 4 /dev/hda
vmm:/mnt/hda6> fadvise file1 0 9999999999 dontneed
vmm:/mnt/hda6> fadvise file2 0 9999999999 dontneed
vmm:/mnt/hda6> time cat file1 > /dev/null & time cat file2 > /dev/null
cat 2 > /dev/null 0.07s user 0.43s system 1% cpu 31.431 total
cat 1 > /dev/null 0.01s user 0.27s system 0% cpu 31.430 total
OK, so there's a 50% slowdown on a modern IDE disk from not using readahead.
vmm:/mnt/hda6> for i in $(seq 1 8)
for> do
for> dd if=/dev/zero of=file$i bs=1M count=10
for> sync
for> fadvise file$i 0 999999999 dontneed
for> done
vmm:/home/akpm> 0 blockdev --setra 240 /dev/hda1
for i in $(seq 1 8)
do
time cat file$i >/dev/null&
done
cat file$i > /dev/null 0.00s user 0.01s system 0% cpu 4.422 total
cat file$i > /dev/null 0.00s user 0.01s system 0% cpu 5.801 total
cat file$i > /dev/null 0.00s user 0.01s system 0% cpu 6.012 total
cat file$i > /dev/null 0.00s user 0.01s system 0% cpu 6.261 total
cat file$i > /dev/null 0.00s user 0.02s system 0% cpu 6.401 total
cat file$i > /dev/null 0.00s user 0.01s system 0% cpu 6.494 total
cat file$i > /dev/null 0.00s user 0.01s system 0% cpu 6.754 total
cat file$i > /dev/null 0.00s user 0.01s system 0% cpu 6.757 total
vmm:/mnt/hda6> for i in $(seq 1 8)
do
fadvise file$i 0 999999999 dontneed
done
vmm:/home/akpm> 0 blockdev --setra 4 /dev/hda1
vmm:/mnt/hda6> for i in $(seq 1 8)
do
time cat file$i >/dev/null&
done
cat file$i > /dev/null 0.00s user 0.02s system 0% cpu 13.373 total
cat file$i > /dev/null 0.00s user 0.04s system 0% cpu 14.849 total
cat file$i > /dev/null 0.00s user 0.02s system 0% cpu 15.816 total
cat file$i > /dev/null 0.00s user 0.04s system 0% cpu 16.808 total
cat file$i > /dev/null 0.00s user 0.05s system 0% cpu 17.824 total
cat file$i > /dev/null 0.00s user 0.03s system 0% cpu 19.004 total
cat file$i > /dev/null 0.00s user 0.03s system 0% cpu 19.274 total
cat file$i > /dev/null 0.00s user 0.03s system 0% cpu 19.301 total
So with eight files, it's a 300% drop from not using readahead.
With 16 files, readahead enabled:
cat file$i > /dev/null 0.00s user 0.01s system 0% cpu 9.887 total
cat file$i > /dev/null 0.00s user 0.01s system 0% cpu 10.925 total
cat file$i > /dev/null 0.00s user 0.01s system 0% cpu 11.033 total
cat file$i > /dev/null 0.00s user 0.01s system 0% cpu 11.146 total
cat file$i > /dev/null 0.00s user 0.01s system 0% cpu 11.381 total
cat file$i > /dev/null 0.00s user 0.02s system 0% cpu 11.640 total
cat file$i > /dev/null 0.00s user 0.01s system 0% cpu 11.907 total
cat file$i > /dev/null 0.00s user 0.01s system 0% cpu 12.262 total
....
Readahead disabled:
cat file$i > /dev/null 0.00s user 0.03s system 0% cpu 39.307 total
cat file$i > /dev/null 0.00s user 0.03s system 0% cpu 39.378 total
cat file$i > /dev/null 0.00s user 0.04s system 0% cpu 39.679 total
cat file$i > /dev/null 0.00s user 0.03s system 0% cpu 39.888 total
cat file$i > /dev/null 0.00s user 0.02s system 0% cpu 40.261 total
Still only 300%. I'd expect to see much larger differences on some older
SCSI disks I have here.
Andrew Morton wrote:
>Hans Reiser <[email protected]> wrote:
>
>>readahead seems to be less effective for non-sequential objects. Or at
>>least, you don't get the order of magnitude benefit from doing only one
>>seek, you only get the better elevator scheduling from having more
>>things in the elevator, which isn't such a big gain.
>>
>>For the spectators of the thread, one of the things most of us learn
>>from experience about readahead is that there has to be something else
>>trying to interject seeks into your workload for readahead to make a big
>>difference, otherwise a modern disk drive (meaning, not 30 or so years
>>old) is going to optimize it for you anyway using its track cache.
>>
>>
>
>The biggest place where readahead helps is when there are several files being
>read at the same time. Without readahead the disk will seek from one file
>to the next after having performed a 4k I/O. With readahead, after performing
>a (say) 128k I/O. It really can make a 32x difference.
>
>Readahead is brute force. The anticipatory scheduler solves this in a
>smarter way.
>
>Yes, device readahead helps. But I believe the caches are only 4-way
>segmented or thereabouts, so it is easy to thrash them. Let's test:
>
[snip results]
>
>
>Still only 300%. I'd expect to see much larger differences on some older
>SCSI disks I have here.
>
It would be interesting to see numbers with and without anticipatory
scheduling as well. I have tried two streaming readers.
Speed with default readahead 13MB/s
Speed with no readahead, no anticipation 7MB/s
Speed with no readahead, anticipation 11.8MB/s
I suspect more readers will simply favour the anticipatory scheduler
more.
Nick
On Mon, Feb 10, 2003 at 09:40:34PM +1100, Nick Piggin wrote:
> I don't know too much about SCSI stuff, but if driver / wire / device
> overheads were that much higher at 128K compared to 512K I would
> think something is broken or maybe optimised badly.
I guess it's also a matter of the way the harddisk can serve the I/O if
it sees it all at the same time, not only the cpu/bus protocol after all
minor overhead. Most certainly it's not a software mistake in linux
that the big commands runs that much faster. Again go check the numbers
in bigbox.html between my tree, 2.4 and 2.5 in bonnie read sequential,
to see the difference between 128k commands and 512k commands with
reads, these are facts. (and no writes and no seeks here)
Andrea
On Mon, Feb 10, 2003 at 02:48:10AM -0800, Andrew Morton wrote:
> Hans Reiser <[email protected]> wrote:
> >
> > readahead seems to be less effective for non-sequential objects. Or at
> > least, you don't get the order of magnitude benefit from doing only one
> > seek, you only get the better elevator scheduling from having more
> > things in the elevator, which isn't such a big gain.
> >
> > For the spectators of the thread, one of the things most of us learn
> > from experience about readahead is that there has to be something else
> > trying to interject seeks into your workload for readahead to make a big
> > difference, otherwise a modern disk drive (meaning, not 30 or so years
> > old) is going to optimize it for you anyway using its track cache.
> >
>
> The biggest place where readahead helps is when there are several files being
> read at the same time. Without readahead the disk will seek from one file
> to the next after having performed a 4k I/O. With readahead, after performing
> a (say) 128k I/O. It really can make a 32x difference.
definitely, but again the most important by far is to submit big
commands, if you get the 512k commands interleaved by different task
it isn't a panic situation.
> Readahead is brute force. The anticipatory scheduler solves this in a
> smarter way.
If you decrease readahead with anticipatory scheduler you will submit
the small commands, if you decrease the readahead you can stall 10
minutes any write in the queue, and still you won't grow the command in
the head of the queue.
> vmm:/mnt/hda6> for i in $(seq 1 8)
> for> do
> for> dd if=/dev/zero of=file$i bs=1M count=10
> for> sync
> for> fadvise file$i 0 999999999 dontneed
what is this fadvise doing?
> for> done
>
> vmm:/home/akpm> 0 blockdev --setra 240 /dev/hda1
> for i in $(seq 1 8)
> do
> time cat file$i >/dev/null&
> done
> cat file$i > /dev/null 0.00s user 0.01s system 0% cpu 4.422 total
> cat file$i > /dev/null 0.00s user 0.01s system 0% cpu 5.801 total
> cat file$i > /dev/null 0.00s user 0.01s system 0% cpu 6.012 total
> cat file$i > /dev/null 0.00s user 0.01s system 0% cpu 6.261 total
> cat file$i > /dev/null 0.00s user 0.02s system 0% cpu 6.401 total
> cat file$i > /dev/null 0.00s user 0.01s system 0% cpu 6.494 total
> cat file$i > /dev/null 0.00s user 0.01s system 0% cpu 6.754 total
> cat file$i > /dev/null 0.00s user 0.01s system 0% cpu 6.757 total
>
> vmm:/mnt/hda6> for i in $(seq 1 8)
> do
> fadvise file$i 0 999999999 dontneed
> done
> vmm:/home/akpm> 0 blockdev --setra 4 /dev/hda1
> vmm:/mnt/hda6> for i in $(seq 1 8)
> do
> time cat file$i >/dev/null&
> done
> cat file$i > /dev/null 0.00s user 0.02s system 0% cpu 13.373 total
> cat file$i > /dev/null 0.00s user 0.04s system 0% cpu 14.849 total
> cat file$i > /dev/null 0.00s user 0.02s system 0% cpu 15.816 total
> cat file$i > /dev/null 0.00s user 0.04s system 0% cpu 16.808 total
> cat file$i > /dev/null 0.00s user 0.05s system 0% cpu 17.824 total
> cat file$i > /dev/null 0.00s user 0.03s system 0% cpu 19.004 total
> cat file$i > /dev/null 0.00s user 0.03s system 0% cpu 19.274 total
> cat file$i > /dev/null 0.00s user 0.03s system 0% cpu 19.301 total
>
> So with eight files, it's a 300% drop from not using readahead.
yes, and anticipatory scheduler isn't going to fix it. furthmore with
IDE the max merging is 64k so you can see little of the whole picture.
Andrea
Andrea Arcangeli <[email protected]> wrote:
>
> On Mon, Feb 10, 2003 at 09:40:34PM +1100, Nick Piggin wrote:
> > I don't know too much about SCSI stuff, but if driver / wire / device
> > overheads were that much higher at 128K compared to 512K I would
> > think something is broken or maybe optimised badly.
>
> I guess it's also a matter of the way the harddisk can serve the I/O if
> it sees it all at the same time, not only the cpu/bus protocol after all
> minor overhead. Most certainly it's not a software mistake in linux
> that the big commands runs that much faster. Again go check the numbers
> in bigbox.html between my tree, 2.4 and 2.5 in bonnie read sequential,
> to see the difference between 128k commands and 512k commands with
> reads, these are facts. (and no writes and no seeks here)
>
I thought scsi in 2.5 was doing 512k I/O's at present???
Doesn't Randy attribute the differences there to an updated
qlogic driver? (Or was the update to allow 512k I/O's? ;))
Andrea Arcangeli wrote:
>On Mon, Feb 10, 2003 at 09:40:34PM +1100, Nick Piggin wrote:
>
>>I don't know too much about SCSI stuff, but if driver / wire / device
>>overheads were that much higher at 128K compared to 512K I would
>>think something is broken or maybe optimised badly.
>>
>
>I guess it's also a matter of the way the harddisk can serve the I/O if
>it sees it all at the same time, not only the cpu/bus protocol after all
>minor overhead. Most certainly it's not a software mistake in linux
>that the big commands runs that much faster. Again go check the numbers
>in bigbox.html between my tree, 2.4 and 2.5 in bonnie read sequential,
>to see the difference between 128k commands and 512k commands with
>reads, these are facts. (and no writes and no seeks here)
>
Yes it is very clear from the numbers that your tree is more than
150% the speed for reads. As I said I don't know too much about
SCSI, but it is very interesting that writes don't get a noticable
improvement although they would be using the bigger request sizes
too, right? Something is causing this but the cpu, bus, wire
overhead of using small requests does not seem to be it.
Nick
On Mon, Feb 10, 2003 at 03:21:01AM -0800, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > On Mon, Feb 10, 2003 at 09:40:34PM +1100, Nick Piggin wrote:
> > > I don't know too much about SCSI stuff, but if driver / wire / device
> > > overheads were that much higher at 128K compared to 512K I would
> > > think something is broken or maybe optimised badly.
> >
> > I guess it's also a matter of the way the harddisk can serve the I/O if
> > it sees it all at the same time, not only the cpu/bus protocol after all
> > minor overhead. Most certainly it's not a software mistake in linux
> > that the big commands runs that much faster. Again go check the numbers
> > in bigbox.html between my tree, 2.4 and 2.5 in bonnie read sequential,
> > to see the difference between 128k commands and 512k commands with
> > reads, these are facts. (and no writes and no seeks here)
> >
>
> I thought scsi in 2.5 was doing 512k I/O's at present???
>
> Doesn't Randy attribute the differences there to an updated
> qlogic driver? (Or was the update to allow 512k I/O's? ;))
The special case of the qlogic that Randy is using does 256k commands
with the new driver.
Andrea
Is it true that there is no manpage available anywhere for fadvise?
For the benefit of others reading this thread, anticipatory scheduling
is described at:
http://lwn.net/Articles/21274/
--
Hans
Andrea Arcangeli <[email protected]> wrote:
>
> what is this fadvise doing?
Invoking fadvise64(). Dropping the pagecache.
> yes, and anticipatory scheduler isn't going to fix it. furthmore with
> IDE the max merging is 64k so you can see little of the whole picture.
124k. (Or is it 128k now?)
On Mon, Feb 10, 2003 at 10:24:57PM +1100, Nick Piggin wrote:
> Andrea Arcangeli wrote:
>
> >On Mon, Feb 10, 2003 at 09:40:34PM +1100, Nick Piggin wrote:
> >
> >>I don't know too much about SCSI stuff, but if driver / wire / device
> >>overheads were that much higher at 128K compared to 512K I would
> >>think something is broken or maybe optimised badly.
> >>
> >
> >I guess it's also a matter of the way the harddisk can serve the I/O if
> >it sees it all at the same time, not only the cpu/bus protocol after all
> >minor overhead. Most certainly it's not a software mistake in linux
> >that the big commands runs that much faster. Again go check the numbers
> >in bigbox.html between my tree, 2.4 and 2.5 in bonnie read sequential,
> >to see the difference between 128k commands and 512k commands with
> >reads, these are facts. (and no writes and no seeks here)
> >
> Yes it is very clear from the numbers that your tree is more than
> 150% the speed for reads. As I said I don't know too much about
correct, that's the huge improvement I get in the read sequential case
(i.e. bonnie), which is a crucial common workload.
> SCSI, but it is very interesting that writes don't get a noticable
> improvement although they would be using the bigger request sizes
> too, right? Something is causing this but the cpu, bus, wire
It's the readahead in my tree that allows the reads to use the max scsi
command size. It has nothing to do with the max scsi command size
itself.
writes don't need readahead to use the max command size, they always
used it since the first place, so they can't go even faster, they never
had a problem.
It's by fixing readahead that reads gets boosted. this has nothing to do
with writes or the max_sectors itself.
You can wait 10 minutes and still such command can't grow. This is why
claiming anticipatory scheduling can decrease the need for readahead
doesn't make much sense to me, there are important things you just can't
achieve by only waiting.
> overhead of using small requests does not seem to be it.
>
> Nick
>
Andrea
Andrea Arcangeli wrote:
>On Mon, Feb 10, 2003 at 02:48:10AM -0800, Andrew Morton wrote:
>
>>Hans Reiser <[email protected]> wrote:
>>
>>>readahead seems to be less effective for non-sequential objects. Or at
>>>least, you don't get the order of magnitude benefit from doing only one
>>>seek, you only get the better elevator scheduling from having more
>>>things in the elevator, which isn't such a big gain.
>>>
>>>For the spectators of the thread, one of the things most of us learn
>>>from experience about readahead is that there has to be something else
>>>trying to interject seeks into your workload for readahead to make a big
>>>difference, otherwise a modern disk drive (meaning, not 30 or so years
>>>old) is going to optimize it for you anyway using its track cache.
>>>
>>>
>>The biggest place where readahead helps is when there are several files being
>>read at the same time. Without readahead the disk will seek from one file
>>to the next after having performed a 4k I/O. With readahead, after performing
>>a (say) 128k I/O. It really can make a 32x difference.
>>
>
>definitely, but again the most important by far is to submit big
>commands, if you get the 512k commands interleaved by different task
>it isn't a panic situation.
>
Well the most important by far is to minimise seeks - you may
be able to achieve that by submitting big commands, but you can
also do it by accounting by time (not number of requests) and
anticipating reads. I'm not saying your way doesn't work.
>
>
>>Readahead is brute force. The anticipatory scheduler solves this in a
>>smarter way.
>>
>
>If you decrease readahead with anticipatory scheduler you will submit
>the small commands, if you decrease the readahead you can stall 10
>minutes any write in the queue, and still you won't grow the command in
>the head of the queue.
>
Not sure exactly what you mean here, however no big starvation
occurs with the scheduler you will find in 2.5.59-mm10 +
experimental.
A streaming reader and streaming writer total 14MB/s in agregate
throughput, with 2.23 bytes read per write with default settings.
This goes to 11.4MB/s and a ratio of 1.6 with 0 readahead.
When I turn off anticipatory scheduling however, I get a good
17MB/s throughput with a ratio of 0.00 reads per write (actually
some small positive epsilon less than one over one hundred).
[snipped]
>>
>>So with eight files, it's a 300% drop from not using readahead.
>>
>
>yes, and anticipatory scheduler isn't going to fix it. furthmore with
>
I am guessing it will do a pretty good job.
>
>IDE the max merging is 64k so you can see little of the whole picture.
>
The disk can't but the io scheduler obviously can. We can give
the disk a lot of help.
>
>
>Andrea
>
>
>
>
On Mon, Feb 10, 2003 at 03:33:27AM -0800, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > what is this fadvise doing?
>
> Invoking fadvise64(). Dropping the pagecache.
>
> > yes, and anticipatory scheduler isn't going to fix it. furthmore with
> > IDE the max merging is 64k so you can see little of the whole picture.
>
> 124k. (Or is it 128k now?)
sure, sorry, the ide limit is 128k
Andrea
Hans Reiser <[email protected]> wrote:
>
> Is it true that there is no manpage available anywhere for fadvise?
>
It's pretty simple.
http://www.opengroup.org/onlinepubs/007904975/functions/posix_fadvise.html
It's also basically unimplementable without the radix tree, so I don't how
other systems can be doing it. Maybe they just lock up for a day when
someone does fadvise64(fd, 0, -1, FADV_DONTNEED) ;)
Andrea Arcangeli wrote:
>On Mon, Feb 10, 2003 at 10:24:57PM +1100, Nick Piggin wrote:
>
>>Andrea Arcangeli wrote:
>>
>>
>>>On Mon, Feb 10, 2003 at 09:40:34PM +1100, Nick Piggin wrote:
>>>
>>>
>>>>I don't know too much about SCSI stuff, but if driver / wire / device
>>>>overheads were that much higher at 128K compared to 512K I would
>>>>think something is broken or maybe optimised badly.
>>>>
>>>>
>>>I guess it's also a matter of the way the harddisk can serve the I/O if
>>>it sees it all at the same time, not only the cpu/bus protocol after all
>>>minor overhead. Most certainly it's not a software mistake in linux
>>>that the big commands runs that much faster. Again go check the numbers
>>>in bigbox.html between my tree, 2.4 and 2.5 in bonnie read sequential,
>>>to see the difference between 128k commands and 512k commands with
>>>reads, these are facts. (and no writes and no seeks here)
>>>
>>>
>>Yes it is very clear from the numbers that your tree is more than
>>150% the speed for reads. As I said I don't know too much about
>>
>
>correct, that's the huge improvement I get in the read sequential case
>(i.e. bonnie), which is a crucial common workload.
>
Yep
>
>
>>SCSI, but it is very interesting that writes don't get a noticable
>>improvement although they would be using the bigger request sizes
>>too, right? Something is causing this but the cpu, bus, wire
>>
>
>It's the readahead in my tree that allows the reads to use the max scsi
>command size. It has nothing to do with the max scsi command size
>itself.
>
>writes don't need readahead to use the max command size, they always
>used it since the first place, so they can't go even faster, they never
>had a problem.
>
Yes I am an idiot, I don't know why I said that :P
>
>
>It's by fixing readahead that reads gets boosted. this has nothing to do
>with writes or the max_sectors itself.
>
>You can wait 10 minutes and still such command can't grow. This is why
>claiming anticipatory scheduling can decrease the need for readahead
>doesn't make much sense to me, there are important things you just can't
>achieve by only waiting.
>
Look at the few simple tests I have been posting - it clearly
indicates the need is decreased. I did say however that it would
be subject to CPU/bus/device overheads. I did not realiase
SCSI setups would behave like this. From a purely disk head
perspective it does nullify the need for readahead (though
it is obivously still needed for other reasons).
Andrea Arcangeli <[email protected]> wrote:
>
> It's the readahead in my tree that allows the reads to use the max scsi
> command size. It has nothing to do with the max scsi command size
> itself.
Oh bah.
- *max_ra++ = vm_max_readahead;
+ *max_ra = ((128*4) >> (PAGE_SHIFT - 10)) - 1;
Well of course that will get bigger bonnie numbers, for exactly the reasons
I've explained. It will seek between files after every 512k rather than
after every 128k.
> You can wait 10 minutes and still such command can't grow. This is why
> claiming anticipatory scheduling can decrease the need for readahead
> doesn't make much sense to me, there are important things you just can't
> achieve by only waiting.
>
The anticipatory scheduler can easily permit 512k of reading before seeking
away to another file. In fact it can allow much more, without requiring that
readhead be cranked up.
Andrew Morton wrote:
>Andrea Arcangeli <[email protected]> wrote:
>
>>It's the readahead in my tree that allows the reads to use the max scsi
>>command size. It has nothing to do with the max scsi command size
>>itself.
>>
>
>Oh bah.
>
>- *max_ra++ = vm_max_readahead;
>+ *max_ra = ((128*4) >> (PAGE_SHIFT - 10)) - 1;
>
>
>Well of course that will get bigger bonnie numbers, for exactly the reasons
>I've explained. It will seek between files after every 512k rather than
>after every 128k.
>
Though Andrea did say it is a "single threaded" streaming read.
That is what I can't understand. Movement of the disk head should
be exactly the same in either situation and 128K is not exactly
a pitiful request size - so it suggests a quirk somewhere. It
is not as if the disk has to be particularly smart or know a
lot about the data in order to optimise the head movement for
a load like this.
On Mon, Feb 10, 2003 at 10:45:59PM +1100, Nick Piggin wrote:
> perspective it does nullify the need for readahead (though
> it is obivously still needed for other reasons).
I'm guessing that physically it may be needed from a head prospective
too, I doubt it only has to do with the in-core overhead. Seeing it all
before reaching the seek point might allow the disk to do smarter things
and to keep the head at the right place for longer, dunno. Anyways,
whatever is the reason it doesn't make much difference from our point of
view ;), but I don't expect this hardware behaviour changing in future
high end storage.
NOTE: just to be sure, I'm not at all against anticpiatory scheduling,
it's clearly a very good feature to have (still I would like an option
to disable it especially in heavy async environments like databases,
where lots of writes are sync too) but it should be probably be enabled
by default, especially for the metadata reads that have to be
synchronous by design.
Infact I wonder that it may be interesting to also make it optionally
controlled from a fs hint (of course we don't pretend all fs to provide
the hint), so that you stall I/O writes only when you know for sure
you're going to submit another read in a few usec, just the time to
parse the last metadata you read. Still a timeout would be needed for
scheduling against RT etc..., but it could be a much more relaxed
timeout with this option enabled, so it would need less accurate
timings, and it would be less dependent on hardware, and it would
be less prone to generate false positive stalls. The downside is having
to add the hints.
Andrea
On Mon, Feb 10, 2003 at 10:53:27PM +1100, Nick Piggin wrote:
> Andrew Morton wrote:
>
> >Andrea Arcangeli <[email protected]> wrote:
> >
> >>It's the readahead in my tree that allows the reads to use the max scsi
> >>command size. It has nothing to do with the max scsi command size
> >>itself.
> >>
> >
> >Oh bah.
> >
> >- *max_ra++ = vm_max_readahead;
> >+ *max_ra = ((128*4) >> (PAGE_SHIFT - 10)) - 1;
> >
> >
> >Well of course that will get bigger bonnie numbers, for exactly the reasons
> >I've explained. It will seek between files after every 512k rather than
> >after every 128k.
> >
> Though Andrea did say it is a "single threaded" streaming read.
yes, I pointed you to bonnie read sequential in bigbox.html, not
tiobench.
Andrea
Andrea Arcangeli wrote:
>On Mon, Feb 10, 2003 at 10:45:59PM +1100, Nick Piggin wrote:
>
>>perspective it does nullify the need for readahead (though
>>it is obivously still needed for other reasons).
>>
>
>I'm guessing that physically it may be needed from a head prospective
>too, I doubt it only has to do with the in-core overhead. Seeing it all
>before reaching the seek point might allow the disk to do smarter things
>and to keep the head at the right place for longer, dunno. Anyways,
>whatever is the reason it doesn't make much difference from our point of
>view ;), but I don't expect this hardware behaviour changing in future
>high end storage.
>
I don't understand it at all. I mean there is no other IO going
on so there would be no reason for the disk head to move anyway.
Rotational latency should be basically non-existant due to track
buffers, and being FC RAID hardware you wouldn't expect them to
skimp on anything.
>
>
>NOTE: just to be sure, I'm not at all against anticpiatory scheduling,
>it's clearly a very good feature to have (still I would like an option
>to disable it especially in heavy async environments like databases,
>where lots of writes are sync too) but it should be probably be enabled
>by default, especially for the metadata reads that have to be
>synchronous by design.
>
Yes it definately has to be selectable (in fact, in my current
version, setting antic_expire = 0 disables it), and Andrew has
been working on tuning the non anticipatory version into shape.
>
>
>Infact I wonder that it may be interesting to also make it optionally
>controlled from a fs hint (of course we don't pretend all fs to provide
>the hint), so that you stall I/O writes only when you know for sure
>you're going to submit another read in a few usec, just the time to
>parse the last metadata you read. Still a timeout would be needed for
>scheduling against RT etc..., but it could be a much more relaxed
>timeout with this option enabled, so it would need less accurate
>timings, and it would be less dependent on hardware, and it would
>be less prone to generate false positive stalls. The downside is having
>to add the hints.
>
It would be easy to anticipate or not based on hints. We could
anticipate sync writes if we wanted, lower expire time for sync
writes, increase it for async reads. It is really not very
complex (although the code needs tidying up).
On Mon, Feb 10, 2003 at 03:48:08AM -0800, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > It's the readahead in my tree that allows the reads to use the max scsi
> > command size. It has nothing to do with the max scsi command size
> > itself.
>
> Oh bah.
>
> - *max_ra++ = vm_max_readahead;
> + *max_ra = ((128*4) >> (PAGE_SHIFT - 10)) - 1;
>
>
> Well of course that will get bigger bonnie numbers, for exactly the reasons
> I've explained. It will seek between files after every 512k rather than
> after every 128k.
NOTE: first there is no seek at all in the benchmark we're talking
about, no idea why you think there are seeks. This is not tiobench, this
is bonnie sequential read.
Second I increased it to 4 times 128k very recently, it was used to be
exactly 128k and 512k for scsi. Infact the difference happens in SCSI
not IDE, where going past 128k is helpful, on ide as said the limit of
the command is limited by the hardware anyways.
the only thing you can argue is that the qla is capable of max 256k per
command, and I'm doing readahead of 512k at once, but I don't think it
can matter, like it doesn't matter now that I increased it even more,
all it matters is not to stop at 128k, and to submit the 256k commands.
>
> > You can wait 10 minutes and still such command can't grow. This is why
> > claiming anticipatory scheduling can decrease the need for readahead
> > doesn't make much sense to me, there are important things you just can't
> > achieve by only waiting.
> >
>
> The anticipatory scheduler can easily permit 512k of reading before seeking
> away to another file. In fact it can allow much more, without requiring that
> readhead be cranked up.
This has nothing to do with "the other file". There is a single task in
the system, it is in replacement of /sbin/init, it never forks and it
never opens more than 1 single file called /myfile. Nothing else nothing
more.
anticipatory scheduling can't help in any way that benchmark, period. If
you destroy readahead you won't be able to build the scsi command of the
optimal size and this will hurt. Waiting can't help this.
Andrea
Nick Piggin <[email protected]> wrote:
>
> Andrew Morton wrote:
>
> >Andrea Arcangeli <[email protected]> wrote:
> >
> >>It's the readahead in my tree that allows the reads to use the max scsi
> >>command size. It has nothing to do with the max scsi command size
> >>itself.
> >>
> >
> >Oh bah.
> >
> >- *max_ra++ = vm_max_readahead;
> >+ *max_ra = ((128*4) >> (PAGE_SHIFT - 10)) - 1;
> >
> >
> >Well of course that will get bigger bonnie numbers, for exactly the reasons
> >I've explained. It will seek between files after every 512k rather than
> >after every 128k.
> >
> Though Andrea did say it is a "single threaded" streaming read.
Oh sorry, I missed that.
> That is what I can't understand. Movement of the disk head should
> be exactly the same in either situation and 128K is not exactly
> a pitiful request size - so it suggests a quirk somewhere. It
> is not as if the disk has to be particularly smart or know a
> lot about the data in order to optimise the head movement for
> a load like this.
Yes, that's a bit odd. Some reduction in CPU cost and bus
traffic, etc would be expected. Could be that sending out a
request which is larger than a track is saving a rev of the disk
for some reason.
Andrea Arcangeli wrote:
>On Mon, Feb 10, 2003 at 03:48:08AM -0800, Andrew Morton wrote:
>
>>Andrea Arcangeli <[email protected]> wrote:
>>
>>>It's the readahead in my tree that allows the reads to use the max scsi
>>>command size. It has nothing to do with the max scsi command size
>>>itself.
>>>
>>Oh bah.
>>
>>- *max_ra++ = vm_max_readahead;
>>+ *max_ra = ((128*4) >> (PAGE_SHIFT - 10)) - 1;
>>
>>
>>Well of course that will get bigger bonnie numbers, for exactly the reasons
>>I've explained. It will seek between files after every 512k rather than
>>after every 128k.
>>
>
>NOTE: first there is no seek at all in the benchmark we're talking
>about, no idea why you think there are seeks. This is not tiobench, this
>is bonnie sequential read.
>
Yes, Andrew obviously missed this... Anyway, could it be due to
a big stripe size and hitting more disks in the RAID? How does
a single SCSI disk perform here, Andrea?
Andrea Arcangeli wrote:
>On Mon, Feb 10, 2003 at 10:53:27PM +1100, Nick Piggin wrote:
>
>>Andrew Morton wrote:
>>
>>
>>>Andrea Arcangeli <[email protected]> wrote:
>>>
>>>
>>>>It's the readahead in my tree that allows the reads to use the max scsi
>>>>command size. It has nothing to do with the max scsi command size
>>>>itself.
>>>>
>>>>
>>>Oh bah.
>>>
>>>- *max_ra++ = vm_max_readahead;
>>>+ *max_ra = ((128*4) >> (PAGE_SHIFT - 10)) - 1;
>>>
>>>
>>>Well of course that will get bigger bonnie numbers, for exactly the reasons
>>>I've explained. It will seek between files after every 512k rather than
>>>after every 128k.
>>>
>>>
>>Though Andrea did say it is a "single threaded" streaming read.
>>
>
>yes, I pointed you to bonnie read sequential in bigbox.html, not
>tiobench.
>
The same pattern is also observed with single threaded tiobench
anyway.
On Mon, Feb 10, 2003 at 11:11:01PM +1100, Nick Piggin wrote:
> Andrea Arcangeli wrote:
>
> >On Mon, Feb 10, 2003 at 10:45:59PM +1100, Nick Piggin wrote:
> >
> >>perspective it does nullify the need for readahead (though
> >>it is obivously still needed for other reasons).
> >>
> >
> >I'm guessing that physically it may be needed from a head prospective
> >too, I doubt it only has to do with the in-core overhead. Seeing it all
> >before reaching the seek point might allow the disk to do smarter things
> >and to keep the head at the right place for longer, dunno. Anyways,
> >whatever is the reason it doesn't make much difference from our point of
> >view ;), but I don't expect this hardware behaviour changing in future
> >high end storage.
> >
> I don't understand it at all. I mean there is no other IO going
Unfortunately I can't help you understand it, but this is what I found
with my pratical experience, I found it the first time in my alpha years
ago when I increased the sym to 512k in early 2.4 then since it could
break stuff we added the max_sectors again in 2.4. But of course if you
don't fix readahead there's no way reads can take advantage of these
lowlevel fixes. I thought I fixed readahead too but I felt it got backed
out and when I noticed I resurrected it in my tree (see the name of the
patch ;)
> >NOTE: just to be sure, I'm not at all against anticpiatory scheduling,
> >it's clearly a very good feature to have (still I would like an option
> >to disable it especially in heavy async environments like databases,
> >where lots of writes are sync too) but it should be probably be enabled
> >by default, especially for the metadata reads that have to be
> >synchronous by design.
> >
> Yes it definately has to be selectable (in fact, in my current
> version, setting antic_expire = 0 disables it), and Andrew has
> been working on tuning the non anticipatory version into shape.
Great.
> >Infact I wonder that it may be interesting to also make it optionally
> >controlled from a fs hint (of course we don't pretend all fs to provide
> >the hint), so that you stall I/O writes only when you know for sure
> >you're going to submit another read in a few usec, just the time to
> >parse the last metadata you read. Still a timeout would be needed for
> >scheduling against RT etc..., but it could be a much more relaxed
> >timeout with this option enabled, so it would need less accurate
> >timings, and it would be less dependent on hardware, and it would
> >be less prone to generate false positive stalls. The downside is having
> >to add the hints.
> >
> It would be easy to anticipate or not based on hints. We could
yep.
> anticipate sync writes if we wanted, lower expire time for sync
> writes, increase it for async reads. It is really not very
> complex (although the code needs tidying up).
this is not the way I thought at it. I'm interested to give an hint
only to know for sure which are the intermediate sync dependent reads
(the obvious example is when doing the get_block and walking the
3 level of inode indirect metadata blocks with big files, or while
walking the balanced tree in reiserfs), and I'm not interested at all
about writes. And I would just set an higher timeout when a read that I
know for sure (thanks to the hint) is "intermdiate" is completed. We can
use high timeouts there because we know they won't trigger 90% of the
time, a new dependent read will be always submitted first.
Andrea
On Mon, Feb 10, 2003 at 11:27:13PM +1100, Nick Piggin wrote:
> Is there a magic number above which you see the improvement,
> Andrea? Or does it steadily climb?
I recall more than 512k wasn't worthwhile anymore
Andrea
On Mon, Feb 10, 2003 at 11:14:45PM +1100, Nick Piggin wrote:
> Andrea Arcangeli wrote:
>
> >On Mon, Feb 10, 2003 at 10:53:27PM +1100, Nick Piggin wrote:
> >
> >>Andrew Morton wrote:
> >>
> >>
> >>>Andrea Arcangeli <[email protected]> wrote:
> >>>
> >>>
> >>>>It's the readahead in my tree that allows the reads to use the max scsi
> >>>>command size. It has nothing to do with the max scsi command size
> >>>>itself.
> >>>>
> >>>>
> >>>Oh bah.
> >>>
> >>>- *max_ra++ = vm_max_readahead;
> >>>+ *max_ra = ((128*4) >> (PAGE_SHIFT - 10)) - 1;
> >>>
> >>>
> >>>Well of course that will get bigger bonnie numbers, for exactly the
> >>>reasons
> >>>I've explained. It will seek between files after every 512k rather than
> >>>after every 128k.
> >>>
> >>>
> >>Though Andrea did say it is a "single threaded" streaming read.
> >>
> >
> >yes, I pointed you to bonnie read sequential in bigbox.html, not
> >tiobench.
> >
> The same pattern is also observed with single threaded tiobench
> anyway.
yes, of course.
Andrea
On Mon, Feb 10, 2003 at 04:12:45AM -0800, Andrew Morton wrote:
> traffic, etc would be expected. Could be that sending out a
> request which is larger than a track is saving a rev of the disk
> for some reason.
I'm guessing something on those lines yes, I doubt it's purerely in core
overhead that makes that much difference. Just for completeness, I never
read this in literature or data sheets, this is all out of pratical
experience, so we can't exclude something odd in the scsi layer, but I
very much doubt, the only thing that might explain it is to waste cpu
and we know it's not wasting it.
Andrea
On Mon, Feb 10, 2003 at 11:17:30PM +1100, Nick Piggin wrote:
> Andrea Arcangeli wrote:
>
> >On Mon, Feb 10, 2003 at 03:48:08AM -0800, Andrew Morton wrote:
> >
> >>Andrea Arcangeli <[email protected]> wrote:
> >>
> >>>It's the readahead in my tree that allows the reads to use the max scsi
> >>>command size. It has nothing to do with the max scsi command size
> >>>itself.
> >>>
> >>Oh bah.
> >>
> >>- *max_ra++ = vm_max_readahead;
> >>+ *max_ra = ((128*4) >> (PAGE_SHIFT - 10)) - 1;
> >>
> >>
> >>Well of course that will get bigger bonnie numbers, for exactly the
> >>reasons
> >>I've explained. It will seek between files after every 512k rather than
> >>after every 128k.
> >>
> >
> >NOTE: first there is no seek at all in the benchmark we're talking
> >about, no idea why you think there are seeks. This is not tiobench, this
> >is bonnie sequential read.
> >
> Yes, Andrew obviously missed this... Anyway, could it be due to
> a big stripe size and hitting more disks in the RAID? How does
> a single SCSI disk perform here, Andrea?
Actually I increased readahead to more than just 512k in my last tree,
especially to take care of RAID :) so that both lowlevel will get the
512k, instead of being limited to 256k command each 8) This might apply
to hardware raid too.
Andrea
Andrew Morton wrote:
>Nick Piggin <[email protected]> wrote:
>
>>That is what I can't understand. Movement of the disk head should
>>be exactly the same in either situation and 128K is not exactly
>>a pitiful request size - so it suggests a quirk somewhere. It
>>is not as if the disk has to be particularly smart or know a
>>lot about the data in order to optimise the head movement for
>>a load like this.
>>
>
>Yes, that's a bit odd. Some reduction in CPU cost and bus
>traffic, etc would be expected. Could be that sending out a
>request which is larger than a track is saving a rev of the disk
>for some reason.
>
Shouldn't be. Even at 128KB readahead we should always have
outstanding requests against the disk in a streaming read
scenario, right? Maybe if the track buffers are bigger than
128K?
Is there a magic number above which you see the improvement,
Andrea? Or does it steadily climb?
Andrea Arcangeli wrote:
>On Mon, Feb 10, 2003 at 11:27:13PM +1100, Nick Piggin wrote:
>
>>Is there a magic number above which you see the improvement,
>>Andrea? Or does it steadily climb?
>>
>
>I recall more than 512k wasn't worthwhile anymore
>
Behaviour between 128 and 512 would be interesting then
Andrea Arcangeli wrote:
>On Mon, Feb 10, 2003 at 11:11:01PM +1100, Nick Piggin wrote:
>
>>I don't understand it at all. I mean there is no other IO going
>>
>
>Unfortunately I can't help you understand it, but this is what I found
>with my pratical experience, I found it the first time in my alpha years
>ago when I increased the sym to 512k in early 2.4 then since it could
>break stuff we added the max_sectors again in 2.4. But of course if you
>don't fix readahead there's no way reads can take advantage of these
>lowlevel fixes. I thought I fixed readahead too but I felt it got backed
>out and when I noticed I resurrected it in my tree (see the name of the
>patch ;)
>
Fair enough. I accept it is important. Still think its odd ;)
[snip]
>>It would be easy to anticipate or not based on hints. We could
>>
>
>yep.
>
>
>>anticipate sync writes if we wanted, lower expire time for sync
>>writes, increase it for async reads. It is really not very
>>complex (although the code needs tidying up).
>>
>
>this is not the way I thought at it. I'm interested to give an hint
>only to know for sure which are the intermediate sync dependent reads
>(the obvious example is when doing the get_block and walking the
>3 level of inode indirect metadata blocks with big files, or while
>walking the balanced tree in reiserfs), and I'm not interested at all
>about writes. And I would just set an higher timeout when a read that I
>know for sure (thanks to the hint) is "intermdiate" is completed. We can
>use high timeouts there because we know they won't trigger 90% of the
>time, a new dependent read will be always submitted first.
>
This is a lot of nitty gritty stuff. It will all help, especially
in corner cases. Luckily it seems you don't need such
infrastructure to demonstrate most anticipatory scheduler gains.
Andrea Arcangeli wrote:
>On Mon, Feb 10, 2003 at 11:34:11PM +1100, Nick Piggin wrote:
>
>
>>Andrea Arcangeli wrote:
>>
>>
>>
>>>On Mon, Feb 10, 2003 at 11:27:13PM +1100, Nick Piggin wrote:
>>>
>>>
>>>
>>>>Is there a magic number above which you see the improvement,
>>>>Andrea? Or does it steadily climb?
>>>>
>>>>
>>>>
>>>I recall more than 512k wasn't worthwhile anymore
>>>
>>>
>>>
>>Behaviour between 128 and 512 would be interesting then
>>
>>
>
>this is exactly what is shown by Randy's bonnie ;)
>
I don't see it... I mean eg 192, 256, 320...
Is the following a fair summary?
There is a certain minimum size required for the IOs to be cost
effective. This can be determined by single reader benchmarking. Only
readahead and not anticipatory scheduling addresses that.
Anticipatory scheduling does not address the application that spends one
minute processing every read that it makes. Readahead does.
Anticipatory scheduling does address the application that reads multiple
files that are near each other (because they are in the same directory),
and current readahead implementations (excepting reiser4 in progress
vaporware) do not.
Anticipatory scheduling can do a better job of avoiding unnecessary
reads for workloads with small time gaps between reads than readahead
(it is potentially more accurate for some workloads).
Is this a fair summary?
--
Hans
Andrew Morton wrote:
>Hans Reiser <[email protected]> wrote:
>
>
>>Is it true that there is no manpage available anywhere for fadvise?
>>
>>
>>
>
>It's pretty simple.
>
>http://www.opengroup.org/onlinepubs/007904975/functions/posix_fadvise.html
>
>It's also basically unimplementable without the radix tree, so I don't how
>other systems can be doing it. Maybe they just lock up for a day when
>someone does fadvise64(fd, 0, -1, FADV_DONTNEED) ;)
>
>
>
>
>
Andrew, don't you think you should write a linux manpage and/or other
documentation when you add system calls like fadvise to linux? Also,
Andrew, your measurements would be a lot more understandable if you did
not use programs written by you and not present in any linux distro (do
I understand that correctly?) without defining what they do. Yes? ;-)
--
Hans
On Mon, Feb 10, 2003 at 11:34:11PM +1100, Nick Piggin wrote:
> Andrea Arcangeli wrote:
>
> >On Mon, Feb 10, 2003 at 11:27:13PM +1100, Nick Piggin wrote:
> >
> >>Is there a magic number above which you see the improvement,
> >>Andrea? Or does it steadily climb?
> >>
> >
> >I recall more than 512k wasn't worthwhile anymore
> >
> Behaviour between 128 and 512 would be interesting then
this is exactly what is shown by Randy's bonnie ;)
Andrea
On Mon, Feb 10, 2003 at 11:36:50PM +1100, Nick Piggin wrote:
> in corner cases. Luckily it seems you don't need such
> infrastructure to demonstrate most anticipatory scheduler gains.
sure. I guess it would be most important to be allowed to leave
anticipatory scheduling enabled *always* even with very async-io
intensive workloads, so when you read a file with normal fileutiles you
still can take advantage of it for the most important places. Ideally in
this case where w/o the hint we might have to disable anticipatory
scheduling completely to get peak async-io performance, the non-hint
timeout could be zero (i.e. no stall at all) and the kernel would
understand and do the right thing
Andrea
On Mon, Feb 10, 2003 at 03:58:13PM +0300, Hans Reiser wrote:
> Is the following a fair summary?
>
> There is a certain minimum size required for the IOs to be cost
> effective. This can be determined by single reader benchmarking. Only
> readahead and not anticipatory scheduling addresses that.
>
> Anticipatory scheduling does not address the application that spends one
> minute processing every read that it makes. Readahead does.
>
> Anticipatory scheduling does address the application that reads multiple
> files that are near each other (because they are in the same directory),
> and current readahead implementations (excepting reiser4 in progress
> vaporware) do not.
>
> Anticipatory scheduling can do a better job of avoiding unnecessary
> reads for workloads with small time gaps between reads than readahead
> (it is potentially more accurate for some workloads).
>
> Is this a fair summary?
I would also add what I feel the most important thing, that is
anticipatory scheduling can help decreasing a lot the latencies of
filesystem reads in presence of lots of misc I/O, by knowing which are
the read commands that are intermediate-dependent-sync, that means
knowing a new dependent read will be submitted very quickly as soon as
the current read-I/O is completed. In such a case it makes lots sense to
wait for the next read to be submitted instead of start processing
immediatly the rest of the I/O queue. This way when you read a file and
you've to walk the indirect blocks before being able to read the data,
you won't be greatly peanalized against the writes or reads that won't
need to pass through metadata I/O before being served.
This doesn't obviate the need of SFQ for the patological multimedia
cases where I/O latency is the first prio, or for workloads where
sync-write latency is the first prio.
BTW, I'm also thinking that the SFQ could be selected not system wide,
but per-process basis, with a prctl or something, so you could have all
the I/O going into the single default async-io queue, except for mplayer
that will use the SFQ queues. This may not even need to be privilegied
since SFQ is fair and if an user can write a lot it can just hurt
latency, with SFQ could hurt more but still not deadlock. This SFQ prctl
for the I/O scheduler, would be very similar to the RT policy for the
main process scheduler. Infact it maybe the best to just have SFQ always
enabled, and selectable only via the SFQ prctl, and to enable the
functionaltiy only per-process basis rather than global. We can still
add a sysctl to enable it globally despite nobody set the per-process
flag.
Andrea
Hans Reiser wrote:
> Is the following a fair summary?
>
> There is a certain minimum size required for the IOs to be cost
> effective. This can be determined by single reader benchmarking.
> Only readahead and not anticipatory scheduling addresses that.
Well as a rule you would gain efficiency the larger your request size gets.
There will be a point where you make the trade off. And no, anticipatory
scheduling will not make read request sizes bigger.
>
>
> Anticipatory scheduling does not address the application that spends
> one minute processing every read that it makes. Readahead does.
In the presence of other IO, anticipatory scheduling would help here. In the
absense of other IO, readahead would not help (past the efficiency problem
above). However anticipatory scheduling _would_ mean you don't need as much
RAM tied up doing nothing (or being discarded) for one minute before it
is needed.
>
>
> Anticipatory scheduling does address the application that reads
> multiple files that are near each other (because they are in the same
> directory), and current readahead implementations (excepting reiser4
> in progress vaporware) do not.
File readahead would not help. Anticipatory scheduling can.
>
>
> Anticipatory scheduling can do a better job of avoiding unnecessary
> reads for workloads with small time gaps between reads than readahead
> (it is potentially more accurate for some workloads).
It avoids seeks mainly, but the lesser need for readahead should mean
the readahead algorithm doesn't need to be very smart.
>
>
> Is this a fair summary?
Well I don't see it so much as readahead vs anticipatory scheduling.
I know readahead is important.
On Mon, 10 Feb 2003, Andrea Arcangeli wrote:
> On Mon, Feb 10, 2003 at 10:45:59PM +1100, Nick Piggin wrote:
> > perspective it does nullify the need for readahead (though
> > it is obivously still needed for other reasons).
>
> I'm guessing that physically it may be needed from a head prospective
> too, I doubt it only has to do with the in-core overhead. Seeing it all
> before reaching the seek point might allow the disk to do smarter things
> NOTE: just to be sure, I'm not at all against anticpiatory scheduling,
Most disks seem to have a large cache, but with the cache unit
for most of the cache being one _track_ at a time.
This has the effect of the disk reading one track in at a time,
but only being able to cache a few of these tracks in its cache.
Anticipatory scheduling should reduce any thrashing of this disk
cache.
regards,
Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://guru.conectiva.com/
Current spamtrap: <a href=mailto:"[email protected]">[email protected]</a>
On Mon, 10 Feb 2003, Andrew Morton wrote:
> Could be that sending out a request which is larger than a track is
> saving a rev of the disk for some reason.
I guess disks are optimised for the benchmarks that are
run by popular PC magazines ...
After all, those benchmarks and the sales price are the
main factors determining sales ;)
Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://guru.conectiva.com/
Current spamtrap: <a href=mailto:"[email protected]">[email protected]</a>
>> You can wait 10 minutes and still such command can't grow. This is why
>> claiming anticipatory scheduling can decrease the need for readahead
>> doesn't make much sense to me, there are important things you just can't
>> achieve by only waiting.
>>
>
> The anticipatory scheduler can easily permit 512k of reading before seeking
> away to another file. In fact it can allow much more, without requiring that
> readhead be cranked up.
IMHO anticipatory scheduling and readahead address different problems. RA is
simpler and cheaper. Reading a few more KB comes almost for free and that
helps a lot sequential reads. AS is useful for random i/o (fs metadata,
executables, ...), but it wastes time if the timer expires, or if the new
request wants data which is far away the previous one. AS is useful for
sequential reads too, but only if the application reads large chunks of
data, otherwise RA is better. I we need both RA and AS to address the
largest variety of workloads.
Bye.
On Mon, Feb 10, 2003 at 03:49:59PM +0100, Giuliano Pochini wrote:
>
> >> You can wait 10 minutes and still such command can't grow. This is why
> >> claiming anticipatory scheduling can decrease the need for readahead
> >> doesn't make much sense to me, there are important things you just can't
> >> achieve by only waiting.
> >>
> >
> > The anticipatory scheduler can easily permit 512k of reading before seeking
> > away to another file. In fact it can allow much more, without requiring that
> > readhead be cranked up.
>
> IMHO anticipatory scheduling and readahead address different problems. RA is
> simpler and cheaper. Reading a few more KB comes almost for free and that
> helps a lot sequential reads. AS is useful for random i/o (fs metadata,
> executables, ...), but it wastes time if the timer expires, or if the new
> request wants data which is far away the previous one. AS is useful for
Yep.
> sequential reads too, but only if the application reads large chunks of
> data, otherwise RA is better. I we need both RA and AS to address the
actually even if the app reads large chunks of data RA is needed, the
size of the read/io_submit syscalls won't reach the
readpage/wait_on_page at the pagecache layer, w/o readahead we would
read PAGE_SIZE per dma and not more (ignoring PAGE_CACHE_SIZE of course,
which isn't going to be enough anyways due mem fragmentation issues)
> largest variety of workloads.
>
> Bye.
Andrea
Andrea Arcangeli wrote:
>On Mon, Feb 10, 2003 at 03:58:13PM +0300, Hans Reiser wrote:
>
>
>>Is the following a fair summary?
>>
>>There is a certain minimum size required for the IOs to be cost
>>effective. This can be determined by single reader benchmarking. Only
>>readahead and not anticipatory scheduling addresses that.
>>
>>Anticipatory scheduling does not address the application that spends one
>>minute processing every read that it makes. Readahead does.
>>
>>Anticipatory scheduling does address the application that reads multiple
>>files that are near each other (because they are in the same directory),
>>and current readahead implementations (excepting reiser4 in progress
>>vaporware) do not.
>>
>>Anticipatory scheduling can do a better job of avoiding unnecessary
>>reads for workloads with small time gaps between reads than readahead
>>(it is potentially more accurate for some workloads).
>>
>>Is this a fair summary?
>>
>>
>
>I would also add what I feel the most important thing, that is
>anticipatory scheduling can help decreasing a lot the latencies of
>filesystem reads in presence of lots of misc I/O, by knowing which are
>the read commands that are intermediate-dependent-sync, that means
>knowing a new dependent read will be submitted very quickly as soon as
>the current read-I/O is completed.
>
Ah.... yes... I have seen that be a problem, and reiserfs suffers from
it more than ext2....
maybe we should simply have an explicit protocol for that... I would be
quite willing to have reiserfs use some flag that says
WILL_READ_NEARBY_NEXT whenever it reads an internal node to get the
location of its children.... that might be a lot cleaner than trying to
do it all in your layer... what do you guys think...
Perhaps it will be both less code, and a better result in that it avoids
some unnecessary lingering....
> In such a case it makes lots sense to
>wait for the next read to be submitted instead of start processing
>immediatly the rest of the I/O queue. This way when you read a file and
>you've to walk the indirect blocks before being able to read the data,
>you won't be greatly peanalized against the writes or reads that won't
>need to pass through metadata I/O before being served.
>
>This doesn't obviate the need of SFQ for the patological multimedia
>cases where I/O latency is the first prio, or for workloads where
>sync-write latency is the first prio.
>
>BTW, I'm also thinking that the SFQ could be selected not system wide,
>but per-process basis, with a prctl or something, so you could have all
>the I/O going into the single default async-io queue, except for mplayer
>that will use the SFQ queues. This may not even need to be privilegied
>since SFQ is fair and if an user can write a lot it can just hurt
>latency, with SFQ could hurt more but still not deadlock. This SFQ prctl
>for the I/O scheduler, would be very similar to the RT policy for the
>main process scheduler. Infact it maybe the best to just have SFQ always
>enabled, and selectable only via the SFQ prctl, and to enable the
>functionaltiy only per-process basis rather than global. We can still
>add a sysctl to enable it globally despite nobody set the per-process
>flag.
>
>Andrea
>
>
>
>
I still don't really understand your SFQ design, probably because I
haven't studied recent networking algorithms that your description
assumed I understand.
--
Hans
On Mon, Feb 10, 2003 at 10:02:48AM +0100, Andrea Arcangeli wrote:
> On Mon, Feb 10, 2003 at 07:27:51PM +1100, Nick Piggin wrote:
> It doesn't make any sense to me your claim that you can decrease the
> readahead by adding anticipatory scheduling, if you do you'll run
> so slow at 8k per request in all common workloads.
Readahead kills seeks and command overhead at the expense of maybe
transfering data needlessly over the bus and consuming RAM.
AS kills seeks. (At the expense of delaying some IO a tiny bit.)
If unecessary seeks are the main problem, with AS smaller READA is
possible. If command overhead is a problem, READA needs to be large.
Regards,
--
Kurt Garloff <[email protected]> Eindhoven, NL
GPG key: See mail header, key servers SuSE Labs (Head)
SuSE Linux AG, Nuernberg, DE SCSI, Security
On Mon, 10 Feb 2003, Kurt Garloff wrote:
> Readahead kills seeks and command overhead at the expense of maybe
> transfering data needlessly over the bus and consuming RAM.
>
> AS kills seeks. (At the expense of delaying some IO a tiny bit.)
>
> If unecessary seeks are the main problem, with AS smaller READA is
> possible. If command overhead is a problem, READA needs to be large.
Think about reading inode blocks or a bunch of related .h
files. We can't nicely do readahead in either of these
scenarios, but AS should take care of them nicely...
Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://guru.conectiva.com/
Current spamtrap: <a href=mailto:"[email protected]">[email protected]</a>
Hi!
> The design I proposed is to have multiple I/O queues, where to apply the
> elevator, and to choose the queue in function of the task->pid that is
> sumbitting the bh/bio. You'll have to apply an hash to the pid and you
> probably want a perturbation timer that will change the hash function
> every 30 sec or so. Plus I want a special queue for everything
> asynchronoys. So that the asynchronoys queue will be elevated and
What about adding speicial queue (queues?) for niced things? That way
nice -20 job would not interfere with normal processes, not even for
I/O.
Pavel
--
Worst form of spam? Adding advertisment signatures ala sourceforge.net.
What goes next? Inserting advertisment *into* email?
On Mon, Feb 10 2003, Pavel Machek wrote:
> Hi!
>
> > The design I proposed is to have multiple I/O queues, where to apply the
> > elevator, and to choose the queue in function of the task->pid that is
> > sumbitting the bh/bio. You'll have to apply an hash to the pid and you
> > probably want a perturbation timer that will change the hash function
> > every 30 sec or so. Plus I want a special queue for everything
> > asynchronoys. So that the asynchronoys queue will be elevated and
>
> What about adding speicial queue (queues?) for niced things? That way
> nice -20 job would not interfere with normal processes, not even for
> I/O.
Once you have per-process queues, it's fairly trivial to add stuff like
that.
--
Jens Axboe
Hi!
> The design I proposed is to have multiple I/O queues, where to apply the
> elevator, and to choose the queue in function of the task->pid that is
> sumbitting the bh/bio. You'll have to apply an hash to the pid and
> you
Well, if you want *fair* scheduler, as in "fair between users", I
guess you should base it on task->uid.
That should solve your dbench problems, too, as you are very unlikely
to see two tasks working over same data with different uids.
Pavel
--
Worst form of spam? Adding advertisment signatures ala sourceforge.net.
What goes next? Inserting advertisment *into* email?
On Mon, Feb 10, 2003 at 05:23:02PM +0100, Pavel Machek wrote:
> Hi!
>
> > The design I proposed is to have multiple I/O queues, where to apply the
> > elevator, and to choose the queue in function of the task->pid that is
> > sumbitting the bh/bio. You'll have to apply an hash to the pid and
> > you
>
> Well, if you want *fair* scheduler, as in "fair between users", I
> guess you should base it on task->uid.
Good idea. All these cases should be optional, and they make plenty of
sense to me.
> That should solve your dbench problems, too, as you are very unlikely
> to see two tasks working over same data with different uids.
yes, but the main reason we do this isn't for multiuser systems. For
multiuser systems it's true it would solve dbench, but there must be an
option to go with task->pid and being able to specify the latency
criticak tasks from userspace would be a good idea too.
Andrea
On Tue, Feb 11 2003, Andrea Arcangeli wrote:
> On Mon, Feb 10, 2003 at 05:23:02PM +0100, Pavel Machek wrote:
> > Hi!
> >
> > > The design I proposed is to have multiple I/O queues, where to apply the
> > > elevator, and to choose the queue in function of the task->pid that is
> > > sumbitting the bh/bio. You'll have to apply an hash to the pid and
> > > you
> >
> > Well, if you want *fair* scheduler, as in "fair between users", I
> > guess you should base it on task->uid.
>
> Good idea. All these cases should be optional, and they make plenty of
> sense to me.
Coolest would to simply stack these schedulers any way you want. Sneak
the uid based fairness scheduler in front of the pid based one, and you
have per-user with per-process fairness.
Lets lego.
--
Jens Axboe
[email protected] said:
> Coolest would to simply stack these schedulers any way you want. Sneak
> the uid based fairness scheduler in front of the pid based one, and
> you have per-user with per-process fairness.
Which again reminds us of the network queueing. You all seem to be
reinventing alexey's wheel here. The above reminds me of HTB with SFQ
leaf nodes.
By all means, do the same thing with disk i/o. It's been a smashing
success with packet queueing.
Jason
On Tue, Feb 11 2003, Jason Lunz wrote:
> [email protected] said:
> > Coolest would to simply stack these schedulers any way you want. Sneak
> > the uid based fairness scheduler in front of the pid based one, and
> > you have per-user with per-process fairness.
>
> Which again reminds us of the network queueing. You all seem to be
> reinventing alexey's wheel here. The above reminds me of HTB with SFQ
> leaf nodes.
There's no wheel reinventing here, just applying the goodies from
network scheduling to disk scheduling.
> By all means, do the same thing with disk i/o. It's been a smashing
> success with packet queueing.
Well, that's the point.
--
Jens Axboe
[email protected] said:
>> By all means, do the same thing with disk i/o. It's been a smashing
>> success with packet queueing.
>
> Well, that's the point.
Yes, what you've done with cbq is great. What I was referring to,
though, is the user configurability of network frame queueing. It's
possible to do really complex things for very specialized needs, yet
also easy to put in a simple tweak if there's just one type of traffic
you need to prioritize. It'd be nice to have that kind of
configurability for unusual i/o loads, and the arbitrary queue stacking
is a whole different beast than having a couple of tunables to tweak.
Jason
On Tue, Feb 11 2003, Jason Lunz wrote:
> [email protected] said:
> >> By all means, do the same thing with disk i/o. It's been a smashing
> >> success with packet queueing.
> >
> > Well, that's the point.
>
> Yes, what you've done with cbq is great. What I was referring to,
> though, is the user configurability of network frame queueing. It's
> possible to do really complex things for very specialized needs, yet
> also easy to put in a simple tweak if there's just one type of traffic
> you need to prioritize. It'd be nice to have that kind of
> configurability for unusual i/o loads, and the arbitrary queue stacking
> is a whole different beast than having a couple of tunables to tweak.
That is indeed the goal. We'll see how much is doable within the 2.6
time frame, though.
--
Jens Axboe
On Mon, 2003-02-10 at 05:30, ext Rik van Riel wrote:
> On Mon, 10 Feb 2003, Andrew Morton wrote:
>
> > Could be that sending out a request which is larger than a track is
> > saving a rev of the disk for some reason.
>
> I guess disks are optimised for the benchmarks that are
> run by popular PC magazines ...
Yes, absolutely they are (I used to work for one such company that no
longer exists). However, there's a catch -- the disk drive companies
are optimizing for the test as currently constituted, but the test keeps
changing. So they're optimizing for version N-1 while the magazine is
publishing N. Can't help it. Ever heard of the Nyquist frequency?
>
> After all, those benchmarks and the sales price are the
> main factors determining sales ;)
>
Yup. In general, it's capacity first, then reliability, then
performance. No, wait, not quite; the FIRST criterion is the ability to
ship when you say you will. If you make Dell late on shipping a bunch
of machines, you WILL feel the pain.
--Rod