2003-01-24 03:41:25

by Andrew Morton

[permalink] [raw]
Subject: 2.5.59-mm5


http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.59/2.5.59-mm5/

. -mm3 and -mm4 were not announced - they were sync-up patches as we
worked on the I/O scheduler.

. -mm5 has the first cut of Nick Piggin's anticipatory I/O scheduler.
Here's the scoop:

The problem being addressed here is (mainly) kernel behaviour when there
is a stream of writeout happening, and someone submits a read.

In 2.4.x, the disk queues contain up to 30 megabytes of writes (say, one
seconds's worth). When a read is submitted the 2.4 I/O scheduler will try
to insert that at the right place between the writes. Usually, there is no
right place and the read is appended to the queue. That is: it will be
serviced in one second.

But the problem with reads is that they are dependent - neither the
application nor the kernel can submit read #N until read #N-1 has
completed. So something as simple as

cat /usr/src/linux/kernel/*.c > /dev/null

requires several hundred dependent reads. And in the presence of a
streaming write, each and every one of those reads gets stuck at the end of
the queue, and takes a second to propagate to the head. The `cat' takes
hundreds of seconds.

The celebrated read-latency2 patch recognises the fact that appending a
read to a tail of writes is dumb, and puts the read near the head of the
queue of writes. It provides an improvement of up to 30x. The deadline
I/O scheduler in 2.5 does the same thing: if reads are queued up, promote
them past writes, even if those writes have been waiting longer.


So far so good, but these fixes are still dumb. Because we're solving
the dependent read problem by creating a seek storm. Every time someone
submits a read, we stop writing, seek over and service the read, and then
*immediately* seek back and start servicing writes again.

But in the common case, the application which submitted a read is about
to go and submit another one, closeby on-disk to the first. So whoops, we
have to seek back to service that one as well.


So what anticipatory scheduling does is very simple: if an application
has performed a read, do *nothing at all* for a few milliseconds. Just
return to userspace (or to the filesystem) in the expectation that the
application or filesystem will quickly submit another read which is
closeby.

If the application _does_ submit the read then fine - we service that
quickly. If it does not submit a read then we lose. Time out and go back
to doing writes.

The end result is a large reduction in seeking - decreased read latency,
increased read bandwidth and increased write bandwidth.


The code as-is has rough spots and still needs quite some work. But it
appears to be stable. The test which I have concentrated on is "how long
does my laptop take to compile util-linux when there is a continuous write
happening". On ext2, mounted noatime:

2.4.20: 538 seconds
2.5.59: 400 seconds
2.5.59-mm5: 70 seconds
No streaming write: 48 seconds

A couple of VFS changes were needed as well.

More details on anticipatory scheduling may be found at

http://www.cs.rice.edu/~ssiyer/r/antsched/




Changes since 2.5.59-mm2:


+preempt-locking.patch

Speed up the smp preempt locking.

+ext2-allocation-failure-fix.patch

ext2 ENOSPC crash fix

+ext2_new_block-fixes.patch

ext2 cleanups

+hangcheck-timer.patch

A form of software watchdog

+slab-irq-fix.patch

Fix a BUG() in slab when memory exhaustion happens at a bad time.

+sendfile-security-hooks.patch

Reinstate lost security hooks around sendfile()

+buffer-io-accounting.patch

Fix IO-wait acounting

+aic79xx-linux-2.5.59-20030122.patch

aic7xxx driver update

+topology-remove-underbars.patch

cleanup

+mandlock-oops-fix.patch

file locking fix

+reiserfs_file_write.patch

reworked reiserfs write code.

-exit_mmap-fix2.patch

Dropped

+generic_file_readonly_mmap-fix.patch

Fix MAP_PRIVATE mmaps for filesystems which don't support ->writepage()

+seq_file-page-defn.patch

Compile fix

+exit_mmap-fix-ppc64.patch
+exit_mmap-ia64-fix.patch

Fix the exit_mmap() problem in arch code.

+show_task-fix.patch

Fix oops in show_task()

+scsi-iothread.patch

software suspend fix

+numaq-ioapic-fix2.patch

NUMAQ stuff

+misc.patch

Random fixes

+writeback-sync-cleanup.patch

remove some junk from fs-writeback.c

+dont-wait-on-inode.patch

Fix large delays in the writeback path

+unlink-latency-fix.patch

Fix large delays in unlink()

+anticipatory_io_scheduling-2_5_59-mm3.patch

Anticipatory scheduling implementation



All 65 patches:


kgdb.patch

devfs-fix.patch

deadline-np-42.patch
(undescribed patch)

deadline-np-43.patch
(undescribed patch)

setuid-exec-no-lock_kernel.patch
remove lock_kernel() from exec of setuid apps

buffer-debug.patch
buffer.c debugging

warn-null-wakeup.patch

reiserfs-readpages.patch
reiserfs v3 readpages support

fadvise.patch
implement posix_fadvise64()

ext3-scheduling-storm.patch
ext3: fix scheduling storm and lockups

auto-unplug.patch
self-unplugging request queues

less-unplugging.patch
Remove most of the blk_run_queues() calls

lockless-current_kernel_time.patch
Lockless current_kernel_timer()

scheduler-tunables.patch
scheduler tunables

htlb-2.patch
hugetlb: fix MAP_FIXED handling

kirq.patch

kirq-up-fix.patch
Subject: Re: 2.5.59-mm1

ext3-truncate-ordered-pages.patch
ext3: explicitly free truncated pages

prune-icache-stats.patch
add stats for page reclaim via inode freeing

vma-file-merge.patch

mmap-whitespace.patch

read_cache_pages-cleanup.patch
cleanup in read_cache_pages()

remove-GFP_HIGHIO.patch
remove __GFP_HIGHIO

quota-lockfix.patch
quota locking fix

quota-offsem.patch
quota semaphore fix

oprofile-p4.patch

oprofile_cpu-as-string.patch
oprofile cpu-as-string

preempt-locking.patch
Subject: spinlock efficiency problem [was 2.5.57 IO slowdown with CONFIG_PREEMPT enabled)

wli-11_pgd_ctor.patch
(undescribed patch)

wli-11_pgd_ctor-update.patch
pgd_ctor update

stack-overflow-fix.patch
stack overflow checking fix

ext2-allocation-failure-fix.patch
Subject: [PATCH] ext2 allocation failures

ext2_new_block-fixes.patch
ext2_new_block cleanups and fixes

hangcheck-timer.patch
hangcheck-timer

slab-irq-fix.patch
slab IRQ fix

Richard_Henderson_for_President.patch
Subject: [PATCH] Richard Henderson for President!

parenthesise-pgd_index.patch
Subject: i386 pgd_index() doesn't parenthesize its arg

sendfile-security-hooks.patch
Subject: [RFC][PATCH] Restore LSM hook calls to sendfile

macro-double-eval-fix.patch
Subject: Re: i386 pgd_index() doesn't parenthesize its arg

mmzone-parens.patch
asm-i386/mmzone.h macro paren/eval fixes

blkdev-fixes.patch
blkdev.h fixes

remove-will_become_orphaned_pgrp.patch
remove will_become_orphaned_pgrp()

buffer-io-accounting.patch
correct wait accounting in wait_on_buffer()

aic79xx-linux-2.5.59-20030122.patch
aic7xxx update

MAX_IO_APICS-ifdef.patch
MAX_IO_APICS #ifdef'd wrongly

dac960-error-retry.patch
Subject: [PATCH] linux2.5.56 patch to DAC960 driver for error retry

topology-remove-underbars.patch
Remove __ from topology macros

mandlock-oops-fix.patch
ftruncate/truncate oopses with mandatory locking

put_user-warning-fix.patch
Subject: Re: Linux 2.5.59

reiserfs_file_write.patch
Subject: reiserfs file_write patch

vmlinux-fix.patch
vmlinux fix

smalldevfs.patch
smalldevfs

sound-firmware-load-fix.patch
soundcore.c referenced non-existent errno variable

generic_file_readonly_mmap-fix.patch
Fix generic_file_readonly_mmap()

seq_file-page-defn.patch
Include <asm/page.h> in fs/seq_file.c, as it uses PAGE_SIZE

exit_mmap-fix-ppc64.patch

exit_mmap-ia64-fix.patch
Fix ia64's 64bit->32bit app switching

show_task-fix.patch
Subject: [PATCH] 2.5.59: show_task() oops

scsi-iothread.patch
scsi_eh_* needs to run even during suspend

numaq-ioapic-fix2.patch
NUMAQ io_apic programming fix

misc.patch
misc fixes

writeback-sync-cleanup.patch

dont-wait-on-inode.patch

unlink-latency-fix.patch

anticipatory_io_scheduling-2_5_59-mm3.patch
Subject: [PATCH] 2.5.59-mm3 antic io sched



Subject: Re: 2.5.59-mm5



--On 23 January 2003 19:50 -0800 Andrew Morton <[email protected]> wrote:

> So what anticipatory scheduling does is very simple: if an application
> has performed a read, do *nothing at all* for a few milliseconds. Just
> return to userspace (or to the filesystem) in the expectation that the
> application or filesystem will quickly submit another read which is
> closeby.

I'm sure this is a really dumb question, as I've never played
with this subsystem, in which case I apologize in advance.

Why not follow (by default) the old system where you put the reads
effectively at the back of the queue. Then rather than doing nothing
for a few milliseconds, you carry on with doing the writes. However,
promote the reads to the front of the queue when you have a "good
lump" of them. If you get further reads while you are processing
a lump of them, put them behind the lump. Switch back to the putting
reads at the end when we have done "a few lumps worth" of
reads, or exhausted the reads at the start of the queue (or
perhaps are short of memory).

IE (with a "lump" = 20) and "a few" = 3.

W0 W1 W2 ... W50 W51

[Read arrives, we process some writes]

W5 ... W50 W51 R0

[More reads arrive, more writes processed]

W10 ... W50 W51 R0 R1 R2 .. R7

[Haven't got a big enough lump, but a
write arrives]

W12 W13... W50 W51 W52 R0 R1 R2 .. R7

[More reads arrive, more writes processed]

W14 W15 ... W50 W51 W52 R0 R1 R2 .. R7 R8 R9.. R19

[Another read arrives, after 4 more writes have
been processed, and we move the lump to the
front]

R0 R1 R2 .. R7 R8 R9.. R19 R20 W18 W19 ... W50 W51 W52

[Some reads are processed, and some more arrive, which
we insert into our lump at the front]

R0 R1 R2 .. R7 R8 R9.. R19 R20 R21 R22 W18 W19 ... W50 W51 W52

Then either if the reads are processed at the front of
the queue faster than they arrive, and the "lump" disappears,
or if we've processed 3 x 20 = 60 reads, we revert to
sticking reads back at the end.

All this does is lump between 20 and 60 reads together.

The advantage being that you don't "do nothing" for a few
milliseconds, and can attract larger lumps, than by
waiting without incurring additional latency.

Now of course you have the ordering problem (in that I've
assumed you can insert things into the queue at will),
but you have that anyway.

--
Alex Bligh

2003-01-24 11:07:04

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.5.59-mm5

Alex Bligh - linux-kernel <[email protected]> wrote:
>
>
>
> --On 23 January 2003 19:50 -0800 Andrew Morton <[email protected]> wrote:
>
> > So what anticipatory scheduling does is very simple: if an application
> > has performed a read, do *nothing at all* for a few milliseconds. Just
> > return to userspace (or to the filesystem) in the expectation that the
> > application or filesystem will quickly submit another read which is
> > closeby.
>
> I'm sure this is a really dumb question, as I've never played
> with this subsystem, in which case I apologize in advance.
>
> Why not follow (by default) the old system where you put the reads
> effectively at the back of the queue. Then rather than doing nothing
> for a few milliseconds, you carry on with doing the writes. However,
> promote the reads to the front of the queue when you have a "good
> lump" of them.

That is the problem. Reads do not come in "lumps". They are dependent.
Consider the case of reading a file:

1: Read the directory.

This is a single read, and we cannot do anything until it has
completed.

2: The directory told us where the inode is. Go read the inode.

This is a single read, and we cannot do anything until it has
completed.

3: Go read the first 12 blocks of the file and the first indirect.


This is a single read, and we cannot do anything until it has
completed.

The above process can take up to three trips through the request queue.


In this very common scenario, the only way we'll ever get "lumps" of reads is
if some other processes come in and happen to want to read nearby sectors.
In the best case, the size of the lump is proportional to the number of
processes which are concurrently trying to read something. This just doesn't
happen enough to be significant or interesting.

But writes are completely different. There is no dependency between them and
at any point in time we know where on-disk a lot of writes will be placed.
We don't know that for reads, which is why we need to twiddle thumbs until the
application or filesystem makes up its mind.


2003-01-24 11:14:47

by Jens Axboe

[permalink] [raw]
Subject: Re: 2.5.59-mm5

On Fri, Jan 24 2003, Alex Bligh - linux-kernel wrote:
>
> --On 23 January 2003 19:50 -0800 Andrew Morton <[email protected]> wrote:
>
> > So what anticipatory scheduling does is very simple: if an application
> > has performed a read, do *nothing at all* for a few milliseconds. Just
> > return to userspace (or to the filesystem) in the expectation that the
> > application or filesystem will quickly submit another read which is
> > closeby.
>
> I'm sure this is a really dumb question, as I've never played
> with this subsystem, in which case I apologize in advance.
>
> Why not follow (by default) the old system where you put the reads
> effectively at the back of the queue. Then rather than doing nothing
> for a few milliseconds, you carry on with doing the writes. However,
> promote the reads to the front of the queue when you have a "good
> lump" of them. If you get further reads while you are processing
> a lump of them, put them behind the lump. Switch back to the putting
> reads at the end when we have done "a few lumps worth" of
> reads, or exhausted the reads at the start of the queue (or
> perhaps are short of memory).

The whole point of anticipatory disk scheduling is that the one process
that submits a read is not going to do anything before that reads
completes. However, maybe it will issue a _new_ read right after the
first one completes. The anticipation being that the same process will
submit a close read immediately.

--
Jens Axboe

2003-01-24 11:24:48

by Alex Tomas

[permalink] [raw]
Subject: Re: 2.5.59-mm5

>>>>> Andrew Morton (AM) writes:

AM> But writes are completely different. There is no dependency
AM> between them and at any point in time we know where on-disk a lot
AM> of writes will be placed. We don't know that for reads, which is
AM> why we need to twiddle thumbs until the application or filesystem
AM> makes up its mind.


it's significant that application doesn't want to wait read completion
long and doesn't wait for write completion in most cases.



2003-01-24 11:40:50

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.5.59-mm5

Alex Tomas <[email protected]> wrote:
>
> >>>>> Andrew Morton (AM) writes:
>
> AM> But writes are completely different. There is no dependency
> AM> between them and at any point in time we know where on-disk a lot
> AM> of writes will be placed. We don't know that for reads, which is
> AM> why we need to twiddle thumbs until the application or filesystem
> AM> makes up its mind.
>
>
> it's significant that application doesn't want to wait read completion
> long and doesn't wait for write completion in most cases.

That's correct. Reads are usually synchronous and writes are rarely
synchronous.

The most common place where the kernel forces a user process to wait on
completion of a write is actually in unlink (truncate, really). Because
truncate must wait for in-progress I/O to complete before allowing the
filesystem to free (and potentially reuse) the affected blocks.

If there's a lot of writeout happening then truncate can take _ages_. Hence
this patch:





Truncates can take a very long time. Especially if there is a lot of
writeout happening, because truncate must wait on in-progress I/O.

And sys_unlink() is performing that truncate while holding the parent
directory's i_sem. This basically shuts down new accesses to the entire
directory until the synchronous I/O completes.

In the testing I've been doing, that directory is /tmp, and this hurts.

So change sys_unlink() to perform the actual truncate outside i_sem.

When there is a continuous streaming write to the same disk, this patch
reduces the time for `make -j4 bzImage' from 370 seconds to 220.



namei.c | 12 ++++++++++++
1 files changed, 12 insertions(+)

diff -puN fs/namei.c~unlink-latency-fix fs/namei.c
--- 25/fs/namei.c~unlink-latency-fix 2003-01-24 02:41:04.000000000 -0800
+++ 25-akpm/fs/namei.c 2003-01-24 02:47:36.000000000 -0800
@@ -1659,12 +1659,19 @@ int vfs_unlink(struct inode *dir, struct
return error;
}

+/*
+ * Make sure that the actual truncation of the file will occur outside its
+ * diretory's i_sem. truncate can take a long time if there is a lot of
+ * writeout happening, and we don't want to prevent access to the directory
+ * while waiting on the I/O.
+ */
asmlinkage long sys_unlink(const char * pathname)
{
int error = 0;
char * name;
struct dentry *dentry;
struct nameidata nd;
+ struct inode *inode = NULL;

name = getname(pathname);
if(IS_ERR(name))
@@ -1683,6 +1690,9 @@ asmlinkage long sys_unlink(const char *
/* Why not before? Because we want correct error value */
if (nd.last.name[nd.last.len])
goto slashes;
+ inode = dentry->d_inode;
+ if (inode)
+ inode = igrab(inode);
error = vfs_unlink(nd.dentry->d_inode, dentry);
exit2:
dput(dentry);
@@ -1693,6 +1703,8 @@ exit1:
exit:
putname(name);

+ if (inode)
+ iput(inode); /* truncate the inode here */
return error;

slashes:

_

2003-01-24 12:05:35

by Nikita Danilov

[permalink] [raw]
Subject: Re: 2.5.59-mm5

Andrew Morton writes:

[...]

>
> In this very common scenario, the only way we'll ever get "lumps" of reads is
> if some other processes come in and happen to want to read nearby sectors.

Or if you have read-ahead for meta-data, which is quite useful. Isn't
read ahead targeting the same problem as this anticipatory scheduling?

> In the best case, the size of the lump is proportional to the number of
> processes which are concurrently trying to read something. This just doesn't
> happen enough to be significant or interesting.
>
> But writes are completely different. There is no dependency between them and
> at any point in time we know where on-disk a lot of writes will be placed.
> We don't know that for reads, which is why we need to twiddle thumbs until the
> application or filesystem makes up its mind.
>

Nikita.

2003-01-24 12:05:55

by Alex Tomas

[permalink] [raw]
Subject: Re: 2.5.59-mm5

>>>>> Andrew Morton (AM) writes:

AM> That's correct. Reads are usually synchronous and writes are
AM> rarely synchronous.

AM> The most common place where the kernel forces a user process to
AM> wait on completion of a write is actually in unlink (truncate,
AM> really). Because truncate must wait for in-progress I/O to
AM> complete before allowing the filesystem to free (and potentially
AM> reuse) the affected blocks.

looks like I miss something here.

why do wait for write completion in truncate?

getblk (blockmap);
getblk (bitmap);
set 0 in blockmap->b_data[N];
mark_buffer_dirty (blockmap);
clear_bit (N, &bitmap);
mark_buffer_dirty (bitmap);

isn't that enough?

2003-01-24 15:47:30

by Oliver Xymoron

[permalink] [raw]
Subject: Re: 2.5.59-mm5

On Fri, Jan 24, 2003 at 03:50:17AM -0800, Andrew Morton wrote:
> Alex Tomas <[email protected]> wrote:
> >
> > >>>>> Andrew Morton (AM) writes:
> >
> > AM> But writes are completely different. There is no dependency
> > AM> between them and at any point in time we know where on-disk a lot
> > AM> of writes will be placed. We don't know that for reads, which is
> > AM> why we need to twiddle thumbs until the application or filesystem
> > AM> makes up its mind.
> >
> >
> > it's significant that application doesn't want to wait read completion
> > long and doesn't wait for write completion in most cases.
>
> That's correct. Reads are usually synchronous and writes are rarely
> synchronous.
>
> The most common place where the kernel forces a user process to wait on
> completion of a write is actually in unlink (truncate, really). Because
> truncate must wait for in-progress I/O to complete before allowing the
> filesystem to free (and potentially reuse) the affected blocks.
>
> If there's a lot of writeout happening then truncate can take _ages_. Hence
> this patch:

An alternate approach might be to change the way the scheduler splits
things. That is, rather than marking I/O read vs write and scheduling
based on that, add a flag bit to mark them all sync vs async since
that's the distinction we actually care about. The normal paths can
all do read+sync and write+async, but you can now do things like
marking your truncate writes sync and readahead async.

And dependent/nondependent or stalling/nonstalling might be a clearer
terminology.

--
"Love the dolphins," she advised him. "Write by W.A.S.T.E.."

2003-01-24 15:50:59

by Nick Piggin

[permalink] [raw]
Subject: Re: 2.5.59-mm5

Nikita Danilov wrote:

>Andrew Morton writes:
>
>[...]
>
> >
> > In this very common scenario, the only way we'll ever get "lumps" of reads is
> > if some other processes come in and happen to want to read nearby sectors.
>
>Or if you have read-ahead for meta-data, which is quite useful. Isn't
>read ahead targeting the same problem as this anticipatory scheduling?
>
Finesse vs brute force. A bit of readahead is good.

2003-01-24 15:55:40

by Nick Piggin

[permalink] [raw]
Subject: Re: 2.5.59-mm5

Oliver Xymoron wrote:

>On Fri, Jan 24, 2003 at 03:50:17AM -0800, Andrew Morton wrote:
>
>>Alex Tomas <[email protected]> wrote:
>>
>>>>>>>>Andrew Morton (AM) writes:
>>>>>>>>
>>> AM> But writes are completely different. There is no dependency
>>> AM> between them and at any point in time we know where on-disk a lot
>>> AM> of writes will be placed. We don't know that for reads, which is
>>> AM> why we need to twiddle thumbs until the application or filesystem
>>> AM> makes up its mind.
>>>
>>>
>>>it's significant that application doesn't want to wait read completion
>>>long and doesn't wait for write completion in most cases.
>>>
>>That's correct. Reads are usually synchronous and writes are rarely
>>synchronous.
>>
>>The most common place where the kernel forces a user process to wait on
>>completion of a write is actually in unlink (truncate, really). Because
>>truncate must wait for in-progress I/O to complete before allowing the
>>filesystem to free (and potentially reuse) the affected blocks.
>>
>>If there's a lot of writeout happening then truncate can take _ages_. Hence
>>this patch:
>>
>
>An alternate approach might be to change the way the scheduler splits
>things. That is, rather than marking I/O read vs write and scheduling
>based on that, add a flag bit to mark them all sync vs async since
>that's the distinction we actually care about. The normal paths can
>all do read+sync and write+async, but you can now do things like
>marking your truncate writes sync and readahead async.
>
>And dependent/nondependent or stalling/nonstalling might be a clearer
>terminology.
>
That will be worth investigating to see if the complexity is worth it.
I think from a disk point of view, we still want to split batches between
reads and writes. Could be wrong.

Nick

2003-01-24 16:08:19

by jlnance

[permalink] [raw]
Subject: Re: 2.5.59-mm5

On Thu, Jan 23, 2003 at 07:50:44PM -0800, Andrew Morton wrote:
> So what anticipatory scheduling does is very simple: if an application
> has performed a read, do *nothing at all* for a few milliseconds. Just
> return to userspace (or to the filesystem) in the expectation that the
> application or filesystem will quickly submit another read which is
> closeby.

Does this affect the faulting in of executable pages as well?

Thanks,

Jim

2003-01-24 16:50:49

by Luck, Tony

[permalink] [raw]
Subject: Re: 2.5.59-mm5

Andrew Morton wrote:

So what anticipatory scheduling does is very simple: if an application
has performed a read, do *nothing at all* for a few milliseconds. Just
return to userspace (or to the filesystem) in the expectation that the
application or filesystem will quickly submit another read which is
closeby.

Do you need to give a process being woken from the read an extra priority
boost to make sure that it actually gets run in your "few milliseconds"
window. It would be a shame to leave the disk idle for the interval, and
then discover that the process scheduler had been running other stuff, so
that the reader didn't get a chance to issue the next read.

-Tony Luck

2003-01-24 17:00:53

by Giuliano Pochini

[permalink] [raw]
Subject: Re: 2.5.59-mm5


>>An alternate approach might be to change the way the scheduler splits
>>things. That is, rather than marking I/O read vs write and scheduling
>>based on that, add a flag bit to mark them all sync vs async since
>>that's the distinction we actually care about. The normal paths can
>>all do read+sync and write+async, but you can now do things like
>>marking your truncate writes sync and readahead async.

> That will be worth investigating to see if the complexity is worth it.
> I think from a disk point of view, we still want to split batches between
> reads and writes. Could be wrong.

Yes, sync vs async is a better way to classify io requests than
read vs write and it's more correct from OS point of view. IMHO
it's not more complex then now. Just replace r/w with sy/as and
it will work.


Bye.

2003-01-24 17:13:24

by Nick Piggin

[permalink] [raw]
Subject: Re: 2.5.59-mm5

Giuliano Pochini wrote:

>>>An alternate approach might be to change the way the scheduler splits
>>>things. That is, rather than marking I/O read vs write and scheduling
>>>based on that, add a flag bit to mark them all sync vs async since
>>>that's the distinction we actually care about. The normal paths can
>>>all do read+sync and write+async, but you can now do things like
>>>marking your truncate writes sync and readahead async.
>>>
>
>>That will be worth investigating to see if the complexity is worth it.
>>I think from a disk point of view, we still want to split batches between
>>reads and writes. Could be wrong.
>>
>
>Yes, sync vs async is a better way to classify io requests than
>read vs write and it's more correct from OS point of view. IMHO
>it's not more complex then now. Just replace r/w with sy/as and
>it will work.
>
We probably wouldn't want to go that far as you obviously can
only merge reads with reads and writes with writes, a flag would
be fine. We have to get the basics working first though ;)

2003-01-24 18:55:50

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.5.59-mm5

[email protected] wrote:
>
> On Thu, Jan 23, 2003 at 07:50:44PM -0800, Andrew Morton wrote:
> > So what anticipatory scheduling does is very simple: if an application
> > has performed a read, do *nothing at all* for a few milliseconds. Just
> > return to userspace (or to the filesystem) in the expectation that the
> > application or filesystem will quickly submit another read which is
> > closeby.
>
> Does this affect the faulting in of executable pages as well?
>

Yes. It should in theory decrease average major fault latency when there's
I/O load. That is unproven.


2003-01-24 19:03:21

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.5.59-mm5

Alex Tomas <[email protected]> wrote:
>
> >>>>> Andrew Morton (AM) writes:
>
> AM> That's correct. Reads are usually synchronous and writes are
> AM> rarely synchronous.
>
> AM> The most common place where the kernel forces a user process to
> AM> wait on completion of a write is actually in unlink (truncate,
> AM> really). Because truncate must wait for in-progress I/O to
> AM> complete before allowing the filesystem to free (and potentially
> AM> reuse) the affected blocks.
>
> looks like I miss something here.
>
> why do wait for write completion in truncate?

We cannot free disk blocks until I/O against them has completed. Otherwise
the block could be reused for something else, then the old IO will scribble
on the new data.

What we _can_ do is to defer the waiting - only wait on the I/O when someone
reuses the disk blocks. So there are actually unused blocks with I/O in
flight against them.

We do that for metadata (the wait happens in unmap_underlying_metadata()) but
for file data blocks there is no mechanism in place to look them up.

2003-01-24 19:27:03

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: 2.5.59-mm5

On Sat, 25 Jan 2003 04:22:39 +1100, Nick Piggin said:
> We probably wouldn't want to go that far as you obviously can
> only merge reads with reads and writes with writes, a flag would
> be fine. We have to get the basics working first though ;)

"obviously can only"? Admittedly, merging reads and writes is a lot
trickier, and probably "too hairy to bother", but I'm not aware of a
fundamental "cant" that applies across IDE/SCSI/USB/1394/fiberchannel/etc.
--
Valdis Kletnieks
Computer Systems Senior Engineer
Virginia Tech


Attachments:
(No filename) (226.00 B)

2003-01-24 19:55:56

by Jens Axboe

[permalink] [raw]
Subject: Re: 2.5.59-mm5

On Fri, Jan 24 2003, [email protected] wrote:
> On Sat, 25 Jan 2003 04:22:39 +1100, Nick Piggin said:
> > We probably wouldn't want to go that far as you obviously can
> > only merge reads with reads and writes with writes, a flag would
> > be fine. We have to get the basics working first though ;)
>
> "obviously can only"? Admittedly, merging reads and writes is a lot
> trickier, and probably "too hairy to bother", but I'm not aware of a
> fundamental "cant" that applies across IDE/SCSI/USB/1394/fiberchannel/etc.

Nicks comment refers to the block layer situation, we obviously cannot
merge reads and writes there. You would basically have to rewrite the
entire request submission structure and break all drivers. And for zero
benefit. Face it, it would be stupid to even attempt such a manuever.

Since you bring it up, you must know if a device which can take a single
command that says "read blocks a to b, and write blocks x to z"? Even if
such a thing existed, it would be much better implemented by the driver
as pulling more requests of the queue and constructing these weirdo
commands itself. Something as ugly as that would never invade the Linux
block layer, at least not as long as I have any input on the design of
it.

So I quite agree with the "obviously".

--
Jens Axboe

2003-01-24 19:59:28

by Alex Tomas

[permalink] [raw]
Subject: Re: 2.5.59-mm5

>>>>> Andrew Morton (AM) writes:

AM> We cannot free disk blocks until I/O against them has completed.
AM> Otherwise the block could be reused for something else, then the
AM> old IO will scribble on the new data.

AM> What we _can_ do is to defer the waiting - only wait on the I/O
AM> when someone reuses the disk blocks. So there are actually
AM> unused blocks with I/O in flight against them.

AM> We do that for metadata (the wait happens in
AM> unmap_underlying_metadata()) but for file data blocks there is no
AM> mechanism in place to look them up

yeah! indeed. my stupid mistake ...




2003-01-24 21:03:18

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.5.59-mm5

"Luck, Tony" <[email protected]> wrote:
>
> Andrew Morton wrote:
>
> So what anticipatory scheduling does is very simple: if an application
> has performed a read, do *nothing at all* for a few milliseconds. Just
> return to userspace (or to the filesystem) in the expectation that the
> application or filesystem will quickly submit another read which is
> closeby.
>
> Do you need to give a process being woken from the read an extra priority
> boost to make sure that it actually gets run in your "few milliseconds"
> window. It would be a shame to leave the disk idle for the interval, and
> then discover that the process scheduler had been running other stuff, so
> that the reader didn't get a chance to issue the next read.
>

Indeed. I have experimented with giving the to-be-woken task a boost in the
CPU scheduler, and was not able to observe much difference. At the very
least, one would expect to be able to decrease the anticipation timeout with
that in place.

Maybe it just needs more testing to find the usage patterns which need this
change.

It could be that the woken task is getting sufficient boost from the
effective_prio() logic that no change will be needed. I don't know yet.

2003-01-24 21:54:33

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: 2.5.59-mm5

On Fri, 24 Jan 2003 21:04:34 +0100, Jens Axboe said:

> Nicks comment refers to the block layer situation, we obviously cannot
> merge reads and writes there. You would basically have to rewrite the
> entire request submission structure and break all drivers. And for zero
> benefit. Face it, it would be stupid to even attempt such a manuever.

As I *said* - "hairy beyond benefit", not "cant".

> Since you bring it up, you must know if a device which can take a single
> command that says "read blocks a to b, and write blocks x to z"? Even
> such thing existed,

They do exist.

IBM mainframe disks (the 3330/50/80 series) are able to do much more than that
in one CCW chain So it was *quite* possible to even express things like "Go to
this cylinder/track, search for each record that has value XYZ in the 'key'
field, and if found, write value ABC in the data field". (In fact, the DASD I/O
opcodes for CCW chains are Turing-complete).

> it would be much better implemented by the driver
> as pulling more requests of the queue and constructing these weirdo

The only operating system I'm aware of that actually uses that stuff is MVS.

> So I quite agree with the "obviously".

My complaint was the confusion of "obviously cant" with "we have decided we
don't want to".

/Valdis


Attachments:
(No filename) (226.00 B)

2003-01-25 08:24:14

by Andres Salomon

[permalink] [raw]
Subject: Re: 2.5.59-mm5

My atyfb_base.c compile fix (from 2.5.54) still hasn't found its way into
any of the main kernel trees. The original patch generates a reject
against 2.5.59-mm5, so here's an updated patch.


On Thu, 23 Jan 2003 19:50:44 -0800, Andrew Morton wrote:

> http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.59/2.5.59-mm5/
>
> . -mm3 and -mm4 were not announced - they were sync-up patches as we
> worked on the I/O scheduler.
>
> . -mm5 has the first cut of Nick Piggin's anticipatory I/O scheduler.
> Here's the scoop:
>
[...]
>
> anticipatory_io_scheduling-2_5_59-mm3.patch
> Subject: [PATCH] 2.5.59-mm3 antic io sched
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/


--- a/drivers/video/aty/atyfb_base.c 2003-01-25 03:02:35.000000000 -0500
+++ b/drivers/video/aty/atyfb_base.c 2003-01-25 03:21:48.000000000 -0500
@@ -2587,12 +2587,12 @@
if (info->screen_base)
iounmap((void *) info->screen_base);
#ifdef __BIG_ENDIAN
- if (info->cursor && par->cursor->ram)
+ if (par->cursor && par->cursor->ram)
iounmap(par->cursor->ram);
#endif
#endif
- if (info->cursor)
- kfree(info->cursor);
+ if (par->cursor)
+ kfree(par->cursor);
#ifdef __sparc__
if (par->mmap_map)
kfree(par->mmap_map);

2003-01-25 12:19:54

by Jens Axboe

[permalink] [raw]
Subject: Re: 2.5.59-mm5

On Fri, Jan 24 2003, [email protected] wrote:
> On Fri, 24 Jan 2003 21:04:34 +0100, Jens Axboe said:
>
> > Nicks comment refers to the block layer situation, we obviously cannot
> > merge reads and writes there. You would basically have to rewrite the
> > entire request submission structure and break all drivers. And for zero
> > benefit. Face it, it would be stupid to even attempt such a manuever.
>
> As I *said* - "hairy beyond benefit", not "cant".

Hairy is ok as long as it provides substantial benefit in some way, and
this does definitely not qualify.

> > Since you bring it up, you must know if a device which can take a single
> > command that says "read blocks a to b, and write blocks x to z"? Even
> > such thing existed,
>
> They do exist.
>
> IBM mainframe disks (the 3330/50/80 series) are able to do much more
> than that in one CCW chain So it was *quite* possible to even express
> things like "Go to this cylinder/track, search for each record that
> has value XYZ in the 'key' field, and if found, write value ABC in the
> data field". (In fact, the DASD I/O
> opcodes for CCW chains are Turing-complete).

Well as interesting as that is, it is still an obscurity that will not
be generally supported. As I said, if you wanted to do such a thing you
can do it in the driver. Complicating the block layer in this way is
totally unacceptable, and is just bound to be an endless source of data
corrupting driver bugs.

> > So I quite agree with the "obviously".
>
> My complaint was the confusion of "obviously cant" with "we have decided we
> don't want to".

Ok fair enough, make that a strong "obviously wont" instead then.

--
Jens Axboe