LinuxLists.cc - [ak@suse.de: Re: iosched: impact of streaming read on read-many-files]

2003-02-22 05:33:03

Subject: [[email protected]: Re: iosched: impact of streaming read on read-many-files]

Sorry typoed the linux kernel address on the first try.

----- Forwarded message from Andi Kleen <[email protected]> -----

To: Andrew Morton <[email protected]>
Cc: [email protected], [email protected]
Subject: Re: iosched: impact of streaming read on read-many-files
From: Andi Kleen <[email protected]>
Date: 22 Feb 2003 06:40:41 +0100
In-Reply-To: Andrew Morton's message of "21 Feb 2003 22:21:19 +0100"

Andrew Morton <[email protected]> writes:
>
> The correct way to design such an application is to use an RT thread to
> perform the display/audio device I/O and a non-RT thread to perform the disk
> I/O. The disk IO thread keeps the shared 8 megabyte buffer full. The RT
> thread mlocks that buffer.

This requires making xmms suid root. Do you really want to do that?

Also who takes care about all the security holes that would cause?

If you require applications doing such things to work around VM/scheduler
breakage you would first need to make RT and mlock available in a controlled
way to non-root applications (and no capabilities are not enough -
linux capabilities are designed in a way that when you have one
capability you can usually get all soon).

I thought a long time about making some limited mlock available to
user processes, e.g. subject to ulimit. bash seems to already have
an ulimit setting for mlock memory, but the kernel doesn't support it.
Of course it would need to be per user id, not per process, but the
kernel already keeps a per user id structure, so that won't be a big
issue. This would be useful for cryptographic applications like gpg
who don't want their secret key data to hit swap space.

Still mlock'ing 8MB in a controlled way would be still too nasty.

For RT it is more difficult. One way I can think of to make
it available controlled is to define subgroups of threads or processes
(perhaps per user id again) and have RT properties inside these groups.
Another possible way would be to define RT with minimum CPU time,
so e.g. that users' A processes with local rt priority could take upto
5% of the cpu time of the box or similar. Still mating that with latency
guarantees would be rather hard (such fair schedulers tend to be good
at evening out CPU time shares over a longer time, but not good at guaranteeing
short latencies after a given event). And xmms just cares about
latencies, not not long term CPU share.

But both is quite a bit of work, especially the later and may impact
other loads. Fixing the IO scheduler for them is probably easier.

-Andi

----- End forwarded message -----

2003-02-22 06:57:05

by Andrew Morton

[permalink] [raw]

Subject: Re: [[email protected]: Re: iosched: impact of streaming read on read-many-files]

Andi Kleen <[email protected]> wrote:
>
> ..
> >
> > The correct way to design such an application is to use an RT thread to
> > perform the display/audio device I/O and a non-RT thread to perform the disk
> > I/O. The disk IO thread keeps the shared 8 megabyte buffer full. The RT
> > thread mlocks that buffer.
>
> This requires making xmms suid root. Do you really want to do that?
>
> Also who takes care about all the security holes that would cause?
>
> If you require applications doing such things to work around VM/scheduler
> breakage

It is utterly unreasonable to characterise this as "breakage".

No, we do not really need to implement RLIM_MEMLOCK for such applications.
They can leave their memory unlocked for any reasonable loads.

Yes, we _do_ need to give these applications at least elevated scheduling
priority, if not policy, so they get the CPU in a reasonable period of
time.

> But both is quite a bit of work, especially the later and may impact
> other loads. Fixing the IO scheduler for them is probably easier.
>

You have not defined "fix". An IO scheduler which attempts to serve every
request within ten milliseconds is an impossibility. Attempting to
achieve it will result in something which seeks all over the place.

The best solution is to implement five or ten seconds worth of buffering
in the application and for the kernel to implement a high throughput general
purpose I/O scheduler which does not suffer from starvation.

2003-02-22 17:00:27

by Alan

[permalink] [raw]

Subject: Re: [[email protected]: Re: iosched: impact of streaming read on read-many-files]

On Sat, 2003-02-22 at 07:07, Andrew Morton wrote:
> No, we do not really need to implement RLIM_MEMLOCK for such applications.
> They can leave their memory unlocked for any reasonable loads.
>
> Yes, we _do_ need to give these applications at least elevated scheduling
> priority, if not policy, so they get the CPU in a reasonable period of
> time.

It isnt about CPU, its about disk. If the app gets a code page swapped
out then unless we have disk as well as cpu priority, or we do memlock
stuff you are screwed. I guess the obvious answer with 2.5 is to
simply abuse the futex bugs and lock down futex pages with code in ;)

2003-02-22 18:12:53

by Ingo Oeser

[permalink] [raw]

Subject: Re: [[email protected]: Re: iosched: impact of streaming read on read-many-files]

On Fri, Feb 21, 2003 at 11:07:16PM -0800, Andrew Morton wrote:
> You have not defined "fix". An IO scheduler which attempts to serve every
> request within ten milliseconds is an impossibility. Attempting to
> achieve it will result in something which seeks all over the place.
>
> The best solution is to implement five or ten seconds worth of buffering
> in the application and for the kernel to implement a high throughput general
> purpose I/O scheduler which does not suffer from starvation.

What about implementing io-requests, which can time out? So if it will
not be serviced in time or we know, that it will not be serviced
in time, we can skip that.

This can easily be stuffed into the aio-api by cancelling
requests, which are older than a specified time. Just attach a
jiffie to each request and make a new syscall like io_cancel but
with a starting time attached. Or even make it a property of the
aio list we are currently handling and use a kernel timer.

That way we could help streaming applications and the kernel
itself (by reducing its io-requests) at the same time.

Combined with you buffering suggestion, this will help cases,
where the system is under high load and cannot satisfy these
applications anyway.

What do you think?

Regards

Ingo Oeser
--
Science is what we can tell a computer. Art is everything else. --- D.E.Knuth

2003-02-23 15:23:17

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: [[email protected]: Re: iosched: impact of streaming read on read-many-files]

On Fri, Feb 21, 2003 at 11:07:16PM -0800, Andrew Morton wrote:
> request within ten milliseconds is an impossibility. Attempting to
> achieve it will result in something which seeks all over the place.

This is called SFQ, or CFQ with 0 dispatch queue level and it works
fine (given a fixed amount of tasks doing I/O). You still don't
understand you don't care about throughput and seeks if you only need to
read 1 block with the max fairness and you don't mind to read another
block within 1 second. seeking is the last problem here, waiting more
than 1 second is the only problem here.

Andrea

2003-02-23 15:29:07

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: [[email protected]: Re: iosched: impact of streaming read on read-many-files]

On Sat, Feb 22, 2003 at 02:57:28PM +0100, Ingo Oeser wrote:
> On Fri, Feb 21, 2003 at 11:07:16PM -0800, Andrew Morton wrote:
> > You have not defined "fix". An IO scheduler which attempts to serve every
> > request within ten milliseconds is an impossibility. Attempting to
> > achieve it will result in something which seeks all over the place.
> >
> > The best solution is to implement five or ten seconds worth of buffering
> > in the application and for the kernel to implement a high throughput general
> > purpose I/O scheduler which does not suffer from starvation.
>
> What about implementing io-requests, which can time out? So if it will
> not be serviced in time or we know, that it will not be serviced
> in time, we can skip that.
>
> This can easily be stuffed into the aio-api by cancelling
> requests, which are older than a specified time. Just attach a
> jiffie to each request and make a new syscall like io_cancel but
> with a starting time attached. Or even make it a property of the
> aio list we are currently handling and use a kernel timer.
>
> That way we could help streaming applications and the kernel
> itself (by reducing its io-requests) at the same time.
>
> Combined with you buffering suggestion, this will help cases,
> where the system is under high load and cannot satisfy these
> applications anyway.
>
> What do you think?

that works only if the congestion cames from multimedia apps that are
willing to cancel the timed out (now worthless) I/O, that is never the
case normally due the low I/O load they generate (usually it's apps not
going to cancel the I/O that congest the blkdev layer).

still, it's a good idea, you're basically asking to implement the cancel
aio api and I doubt anybody could disagree with that ;).

Andrea

2003-02-23 19:22:09

by Ingo Oeser

[permalink] [raw]

Subject: Re: [[email protected]: Re: iosched: impact of streaming read on read-many-files]

On Sun, Feb 23, 2003 at 04:40:02PM +0100, Andrea Arcangeli wrote:
> On Sat, Feb 22, 2003 at 02:57:28PM +0100, Ingo Oeser wrote:
> > What about implementing io-requests, which can time out? So if it will
> > not be serviced in time or we know, that it will not be serviced
> > in time, we can skip that.
>
> that works only if the congestion cames from multimedia apps that are
> willing to cancel the timed out (now worthless) I/O, that is never the
> case normally due the low I/O load they generate (usually it's apps not
> going to cancel the I/O that congest the blkdev layer).

I would put it to loads, where it doesn't matter that all streams
will be serviced all the time and where it does matter more, that
we stay responsive and show the latest frames available.

> still, it's a good idea, you're basically asking to implement the cancel
> aio api and I doubt anybody could disagree with that ;).

I'm aware of these, but if we are so heavily busy, that the
applications looses IO frames already, then calling into an
application (which might be swapped out) to traverse its
AIO overhead the cancel makes no sense any more.

I want a deadline attached to them and have them automatically
cancelled after this deadline. (that's why I quoted the relevant
part of my e-mail again).

I can see many uses besides multi media applications. Also http
or ftp server could do this, if they are busier as expected or if
a connection dropped.

Regards

Ingo Oeser
--
Science is what we can tell a computer. Art is everything else. --- D.E.Knuth

2003-02-23 23:21:33

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: [[email protected]: Re: iosched: impact of streaming read on read-many-files]

On Sun, Feb 23, 2003 at 08:05:08PM +0100, Ingo Oeser wrote:
> On Sun, Feb 23, 2003 at 04:40:02PM +0100, Andrea Arcangeli wrote:
> > On Sat, Feb 22, 2003 at 02:57:28PM +0100, Ingo Oeser wrote:
> > > What about implementing io-requests, which can time out? So if it will
> > > not be serviced in time or we know, that it will not be serviced
> > > in time, we can skip that.
> >
> > that works only if the congestion cames from multimedia apps that are
> > willing to cancel the timed out (now worthless) I/O, that is never the
> > case normally due the low I/O load they generate (usually it's apps not
> > going to cancel the I/O that congest the blkdev layer).
>
> I would put it to loads, where it doesn't matter that all streams
> will be serviced all the time and where it does matter more, that
> we stay responsive and show the latest frames available.
>
> > still, it's a good idea, you're basically asking to implement the cancel
> > aio api and I doubt anybody could disagree with that ;).
>
> I'm aware of these, but if we are so heavily busy, that the
> applications looses IO frames already, then calling into an

Note that we're dealing with a disk heavily busy since they're timing
out, but we're not necessairly cpu-busy at all. The cancellation is a
completely disk-less operation, it runs at cpu core speed. A simple
getevents with timeout + io_cancel will do the trick just fine. Sure,
this way the cancellation will have to run a bit of userspace code too,
but it shouldn't make that much difference in practice (at least unless
there's an heavy cpu or swap overload too that introduce an huge time
gap between the getevents timeout and the io_cancel to enter kernel).
Your API makes sense too, but it's just a different API and I'm not sure
if it worth to implement it at the light of the above considerations,
given we just have the io_cancel syscall. Also note that the
cancellation API makes sense even if we don't work in the disk-queue
layer, infact unless you've a 4G worthless I/O queue like in Andrew's
example, probably it won't make an huge difference to cancel the I/O in
the queue too, even if of course it would make perfect sense to do that
too. it is especially difficult to cancel the I/O in the disk queue at
the moment because we can't make sure what is our request and what is
somebody else request. Same goes for your cancellation API, it would be
difficult to cancel it. Everything in the queue is asynchronous and not
everybody will wait on its stuff immediatly.

> application (which might be swapped out) to traverse its
> AIO overhead the cancel makes no sense any more.

again, we're not necessairly swapped out, the aio api should take care
of the I/O congestions, not necessairly of the VM and cpu congestions.
And for the I/O congestions it looks sane to me even if it will have to
pass through userspace, but passing through userspace may be an
advantage some time (so you can choose how much to cancel or whatever).

> I want a deadline attached to them and have them automatically
> cancelled after this deadline. (that's why I quoted the relevant
> part of my e-mail again).

Sure I understood your point.

> I can see many uses besides multi media applications. Also http
> or ftp server could do this, if they are busier as expected or if
> a connection dropped.

Not sure how much you can save, disconnects shouldn't be so frequent and
filling the cache anyways may make sense specially for http, and the
timeouts are quite long, by the time the network timeout trigger likely
the I/O just completed since a long time, and it wouldn't be easy to
synchronize the network timeout with the request submit-time-timeout. I
believe cancellation matters for interactive applications, and http and
ftp aren't interactive, I mean, there's not a throughput requirement for
http/ftp, nor a frame rate to keep up with, if flash or java takes
longer to load because your network is slower, you'll just wait longer etc..

Andrea