2009-04-03 07:01:11

by Theodore Ts'o

[permalink] [raw]
Subject: [GIT PULL] Ext3 latency fixes

Hi Linus,

Please pull from:

git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git ext3-latency-fixes

I posted these patches a while back, and with your "overwrite.c" test
case, I decided to see how they did. The results were spectacular
enough (see below) that I've decided to request that they be included
in 2.6.30. I've posted the critical patches below for review before,
and Jan Kara has acked them, and there have been no complaints about
them.

I've also added two patches which add replace-via-truncate and
replace-via-rename workarounds to ext3's data=writeback mode. They
only change the behavior in the (currently non-default) data=writeback
mode.

The benchmark which I used was Linus's overwrite.c as the background
workload, and my fsync-tester as the foreground tester. The
fsync-tester writes a megabyte to a file and then times how long it
takes to fsync that file, and then sleeps a second before repeating.

Using an unpatched 2.6.29, fsync-tester shows the following times:

fsync time: 3.4732
fsync time: 2.4338
fsync time: 5.9496
fsync time: 6.2402
fsync time: 4.3375
fsync time: 6.3283
fsync time: 3.6930
fsync time: 3.1848
fsync time: 3.3231

The final report of overwrite.c is:

1.984 GB written in 82.75 (24 MB/s)

With these patches applied, the fsync-tester times are:

fsync time: 1.4538
fsync time: 1.6328
fsync time: 1.4632
fsync time: 1.4550
fsync time: 0.2932
fsync time: 1.6986
fsync time: 0.3787
fsync time: 1.3380
fsync time: 1.8145
fsync time: 0.4050
fsync time: 1.3880

... and the final report of overwrite.c is:

1.984 GB written in 93.77 (21 MB/s)

By having the fsync-related I/O fixed to be posted using WRITE_SYNC,
instead of WRITE, it prioritizes the fsync-related I/O so that it gets
done ahead of the streaming write. This does slow down the background
write process, but it speeds up the worst-case fsync() latency from
6.2 seconds to 1.8 seconds. (Measurements done on a 5400 rpm laptop
drive.)

All aside from the benchmark improvements, if the writes are coming
from fsync(), they really are synchronous operations, and they should
be marked that way just from a correctness point of view.

- Ted

Theodore Ts'o (4):
block_write_full_page: Use synchronous writes for WBC_SYNC_ALL writebacks
ext3: Use WRITE_SYNC for commits which are caused by fsync()
ext3: Add replace-on-truncate hueristics for data=writeback mode
ext3: Add replace-on-rename hueristics for data=writeback mode

fs/buffer.c | 5 +++--
fs/ext3/file.c | 4 ++++
fs/ext3/inode.c | 3 +++
fs/ext3/namei.c | 6 +++++-
fs/jbd/commit.c | 23 +++++++++++++++--------
fs/jbd/transaction.c | 2 ++
include/linux/ext3_fs.h | 1 +
include/linux/jbd.h | 5 +++++
8 files changed, 38 insertions(+), 11 deletions(-)






2009-04-03 07:01:11

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 1/4] block_write_full_page: Use synchronous writes for WBC_SYNC_ALL writebacks

When doing synchronous writes because wbc->sync_mode is set to
WBC_SYNC_ALL, send the write request using WRITE_SYNC, so that we
don't unduly block system calls such as fsync().

Signed-off-by: "Theodore Ts'o" <[email protected]>
Acked-by: Jan Kara <[email protected]>
---
fs/buffer.c | 5 +++--
1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 891e1c7..e7ebd95 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1714,6 +1714,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
struct buffer_head *bh, *head;
const unsigned blocksize = 1 << inode->i_blkbits;
int nr_underway = 0;
+ int write_op = (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE);

BUG_ON(!PageLocked(page));

@@ -1805,7 +1806,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
do {
struct buffer_head *next = bh->b_this_page;
if (buffer_async_write(bh)) {
- submit_bh(WRITE, bh);
+ submit_bh(write_op, bh);
nr_underway++;
}
bh = next;
@@ -1859,7 +1860,7 @@ recover:
struct buffer_head *next = bh->b_this_page;
if (buffer_async_write(bh)) {
clear_buffer_dirty(bh);
- submit_bh(WRITE, bh);
+ submit_bh(write_op, bh);
nr_underway++;
}
bh = next;
--
1.5.6.3


2009-04-03 07:01:11

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 3/4] ext3: Add replace-on-truncate hueristics for data=writeback mode

In data=writeback mode, start an asynchronous flush when closing a
file which had been previously truncated down to zero. This lowers
the probability of data loss in the case of applications that attempt
to replace a file using truncate.

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext3/file.c | 4 ++++
fs/ext3/inode.c | 3 +++
include/linux/ext3_fs.h | 1 +
3 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/fs/ext3/file.c b/fs/ext3/file.c
index 3be1e06..4a04cbb 100644
--- a/fs/ext3/file.c
+++ b/fs/ext3/file.c
@@ -33,6 +33,10 @@
*/
static int ext3_release_file (struct inode * inode, struct file * filp)
{
+ if (EXT3_I(inode)->i_state & EXT3_STATE_FLUSH_ON_CLOSE) {
+ filemap_flush(inode->i_mapping);
+ EXT3_I(inode)->i_state &= ~EXT3_STATE_FLUSH_ON_CLOSE;
+ }
/* if we are the last writer on the inode, drop the block reservation */
if ((filp->f_mode & FMODE_WRITE) &&
(atomic_read(&inode->i_writecount) == 1))
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 5fa453b..0f5bca0 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -2346,6 +2346,9 @@ void ext3_truncate(struct inode *inode)
if (!ext3_can_truncate(inode))
return;

+ if (inode->i_size == 0 && ext3_should_writeback_data(inode))
+ ei->i_state |= EXT3_STATE_FLUSH_ON_CLOSE;
+
/*
* We have to lock the EOF page here, because lock_page() nests
* outside journal_start().
diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
index dd495b8..d2630c5 100644
--- a/include/linux/ext3_fs.h
+++ b/include/linux/ext3_fs.h
@@ -208,6 +208,7 @@ static inline __u32 ext3_mask_flags(umode_t mode, __u32 flags)
#define EXT3_STATE_JDATA 0x00000001 /* journaled data exists */
#define EXT3_STATE_NEW 0x00000002 /* inode is newly created */
#define EXT3_STATE_XATTR 0x00000004 /* has in-inode xattrs */
+#define EXT3_STATE_FLUSH_ON_CLOSE 0x00000008

/* Used to pass group descriptor data when online resize is done */
struct ext3_new_group_input {
--
1.5.6.3


2009-04-03 07:01:11

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 4/4] ext3: Add replace-on-rename hueristics for data=writeback mode

In data=writeback mode, start an asynchronous flush when renaming a
file on top of an already-existing file. This lowers the probability
of data loss in the case of applications that attempt to replace a
file via using rename().

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext3/namei.c | 6 +++++-
1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/fs/ext3/namei.c b/fs/ext3/namei.c
index 4db4ffa..ab98a66 100644
--- a/fs/ext3/namei.c
+++ b/fs/ext3/namei.c
@@ -2265,7 +2265,7 @@ static int ext3_rename (struct inode * old_dir, struct dentry *old_dentry,
struct inode * old_inode, * new_inode;
struct buffer_head * old_bh, * new_bh, * dir_bh;
struct ext3_dir_entry_2 * old_de, * new_de;
- int retval;
+ int retval, flush_file = 0;

old_bh = new_bh = dir_bh = NULL;

@@ -2401,6 +2401,8 @@ static int ext3_rename (struct inode * old_dir, struct dentry *old_dentry,
ext3_mark_inode_dirty(handle, new_inode);
if (!new_inode->i_nlink)
ext3_orphan_add(handle, new_inode);
+ if (ext3_should_writeback_data(new_inode))
+ flush_file = 1;
}
retval = 0;

@@ -2409,6 +2411,8 @@ end_rename:
brelse (old_bh);
brelse (new_bh);
ext3_journal_stop(handle);
+ if (retval == 0 && flush_file)
+ filemap_flush(old_inode->i_mapping);
return retval;
}

--
1.5.6.3


2009-04-03 07:01:11

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 2/4] ext3: Use WRITE_SYNC for commits which are caused by fsync()

If a commit is triggered by fsync(), set a flag indicating the journal
blocks associated with the transaction should be flushed out using
WRITE_SYNC.

Signed-off-by: "Theodore Ts'o" <[email protected]>
Acked-by: Jan Kara <[email protected]>
---
fs/jbd/commit.c | 23 +++++++++++++++--------
fs/jbd/transaction.c | 2 ++
include/linux/jbd.h | 5 +++++
3 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index 3fbffb1..f8077b9 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -20,6 +20,7 @@
#include <linux/slab.h>
#include <linux/mm.h>
#include <linux/pagemap.h>
+#include <linux/bio.h>

/*
* Default IO end handler for temporary BJ_IO buffer_heads.
@@ -171,14 +172,15 @@ static int journal_write_commit_record(journal_t *journal,
return (ret == -EIO);
}

-static void journal_do_submit_data(struct buffer_head **wbuf, int bufs)
+static void journal_do_submit_data(struct buffer_head **wbuf, int bufs,
+ int write_op)
{
int i;

for (i = 0; i < bufs; i++) {
wbuf[i]->b_end_io = end_buffer_write_sync;
/* We use-up our safety reference in submit_bh() */
- submit_bh(WRITE, wbuf[i]);
+ submit_bh(write_op, wbuf[i]);
}
}

@@ -186,7 +188,8 @@ static void journal_do_submit_data(struct buffer_head **wbuf, int bufs)
* Submit all the data buffers to disk
*/
static int journal_submit_data_buffers(journal_t *journal,
- transaction_t *commit_transaction)
+ transaction_t *commit_transaction,
+ int write_op)
{
struct journal_head *jh;
struct buffer_head *bh;
@@ -225,7 +228,7 @@ write_out_data:
BUFFER_TRACE(bh, "needs blocking lock");
spin_unlock(&journal->j_list_lock);
/* Write out all data to prevent deadlocks */
- journal_do_submit_data(wbuf, bufs);
+ journal_do_submit_data(wbuf, bufs, write_op);
bufs = 0;
lock_buffer(bh);
spin_lock(&journal->j_list_lock);
@@ -256,7 +259,7 @@ write_out_data:
jbd_unlock_bh_state(bh);
if (bufs == journal->j_wbufsize) {
spin_unlock(&journal->j_list_lock);
- journal_do_submit_data(wbuf, bufs);
+ journal_do_submit_data(wbuf, bufs, write_op);
bufs = 0;
goto write_out_data;
}
@@ -286,7 +289,7 @@ write_out_data:
}
}
spin_unlock(&journal->j_list_lock);
- journal_do_submit_data(wbuf, bufs);
+ journal_do_submit_data(wbuf, bufs, write_op);

return err;
}
@@ -315,6 +318,7 @@ void journal_commit_transaction(journal_t *journal)
int first_tag = 0;
int tag_flag;
int i;
+ int write_op = WRITE;

/*
* First job: lock down the current transaction and wait for
@@ -347,6 +351,8 @@ void journal_commit_transaction(journal_t *journal)
spin_lock(&journal->j_state_lock);
commit_transaction->t_state = T_LOCKED;

+ if (commit_transaction->t_synchronous_commit)
+ write_op = WRITE_SYNC;
spin_lock(&commit_transaction->t_handle_lock);
while (commit_transaction->t_updates) {
DEFINE_WAIT(wait);
@@ -431,7 +437,8 @@ void journal_commit_transaction(journal_t *journal)
* Now start flushing things to disk, in the order they appear
* on the transaction lists. Data blocks go first.
*/
- err = journal_submit_data_buffers(journal, commit_transaction);
+ err = journal_submit_data_buffers(journal, commit_transaction,
+ write_op);

/*
* Wait for all previously submitted IO to complete.
@@ -660,7 +667,7 @@ start_journal_io:
clear_buffer_dirty(bh);
set_buffer_uptodate(bh);
bh->b_end_io = journal_end_buffer_io_sync;
- submit_bh(WRITE, bh);
+ submit_bh(write_op, bh);
}
cond_resched();

diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
index e6a1174..ed886e6 100644
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -1440,6 +1440,8 @@ int journal_stop(handle_t *handle)
}
}

+ if (handle->h_sync)
+ transaction->t_synchronous_commit = 1;
current->journal_info = NULL;
spin_lock(&journal->j_state_lock);
spin_lock(&transaction->t_handle_lock);
diff --git a/include/linux/jbd.h b/include/linux/jbd.h
index 64246dc..2c69431 100644
--- a/include/linux/jbd.h
+++ b/include/linux/jbd.h
@@ -552,6 +552,11 @@ struct transaction_s
*/
int t_handle_count;

+ /*
+ * This transaction is being forced and some process is
+ * waiting for it to finish.
+ */
+ int t_synchronous_commit:1;
};

/**
--
1.5.6.3


Subject: EXT4 in embedded systems


Hi all,

I am currently running kernel 2.6.15 on an embedded mips system.

I really would like to be able to use the ext4 file system.

I can easily upgrade to kernel 2.6.22 but not any further.

I would like to get the latest and greatest stable ext4 code onto this
box.

Is it possible to back-port the latest ext4 code to the 2.6.22 kernel?

Or, is there a better way to go about it?

Thanks,
Nick Hennenfent
Cisco IPTV


-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Theodore Ts'o
Sent: Friday, April 03, 2009 3:01 AM
To: [email protected]
Cc: Linux Kernel Developers List; Ext4 Developers List
Subject: [GIT PULL] Ext3 latency fixes

Hi Linus,

Please pull from:

git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git
ext3-latency-fixes

I posted these patches a while back, and with your "overwrite.c" test
case, I decided to see how they did. The results were spectacular
enough (see below) that I've decided to request that they be included
in 2.6.30. I've posted the critical patches below for review before,
and Jan Kara has acked them, and there have been no complaints about
them.

I've also added two patches which add replace-via-truncate and
replace-via-rename workarounds to ext3's data=writeback mode. They
only change the behavior in the (currently non-default) data=writeback
mode.

The benchmark which I used was Linus's overwrite.c as the background
workload, and my fsync-tester as the foreground tester. The
fsync-tester writes a megabyte to a file and then times how long it
takes to fsync that file, and then sleeps a second before repeating.

Using an unpatched 2.6.29, fsync-tester shows the following times:

fsync time: 3.4732
fsync time: 2.4338
fsync time: 5.9496
fsync time: 6.2402
fsync time: 4.3375
fsync time: 6.3283
fsync time: 3.6930
fsync time: 3.1848
fsync time: 3.3231

The final report of overwrite.c is:

1.984 GB written in 82.75 (24 MB/s)

With these patches applied, the fsync-tester times are:

fsync time: 1.4538
fsync time: 1.6328
fsync time: 1.4632
fsync time: 1.4550
fsync time: 0.2932
fsync time: 1.6986
fsync time: 0.3787
fsync time: 1.3380
fsync time: 1.8145
fsync time: 0.4050
fsync time: 1.3880

... and the final report of overwrite.c is:

1.984 GB written in 93.77 (21 MB/s)

By having the fsync-related I/O fixed to be posted using WRITE_SYNC,
instead of WRITE, it prioritizes the fsync-related I/O so that it gets
done ahead of the streaming write. This does slow down the background
write process, but it speeds up the worst-case fsync() latency from
6.2 seconds to 1.8 seconds. (Measurements done on a 5400 rpm laptop
drive.)

All aside from the benchmark improvements, if the writes are coming
from fsync(), they really are synchronous operations, and they should
be marked that way just from a correctness point of view.

- Ted

Theodore Ts'o (4):
block_write_full_page: Use synchronous writes for WBC_SYNC_ALL
writebacks
ext3: Use WRITE_SYNC for commits which are caused by fsync()
ext3: Add replace-on-truncate hueristics for data=writeback mode
ext3: Add replace-on-rename hueristics for data=writeback mode

fs/buffer.c | 5 +++--
fs/ext3/file.c | 4 ++++
fs/ext3/inode.c | 3 +++
fs/ext3/namei.c | 6 +++++-
fs/jbd/commit.c | 23 +++++++++++++++--------
fs/jbd/transaction.c | 2 ++
include/linux/ext3_fs.h | 1 +
include/linux/jbd.h | 5 +++++
8 files changed, 38 insertions(+), 11 deletions(-)



2009-04-03 16:07:10

by Eric Sandeen

[permalink] [raw]
Subject: Re: EXT4 in embedded systems

Nick Hennenfent (nhennefe) wrote:
> Hi all,
>
> I am currently running kernel 2.6.15 on an embedded mips system.
>
> I really would like to be able to use the ext4 file system.
>
> I can easily upgrade to kernel 2.6.22 but not any further.
>
> I would like to get the latest and greatest stable ext4 code onto this
> box.
>
> Is it possible to back-port the latest ext4 code to the 2.6.22 kernel?
>
> Or, is there a better way to go about it?

Hi Nick -

Well, "it's just code" :) I'm sure it's possible to backport it, though
it'd be nontrivial, and unlikely to be something that will happen from
the upstream developers; backporting that far can often be significant
work, as the kernel APIs are constantly changing. In my experience,
this requires more than just massaging fs/ext4 and fs/jbd2 into shape,
but requires careful core kernel changes at times as well. It might be
the sort of thing you could find a contract for, I imagine there are
people who would be willing to do this.

Aside from all that though, I wonder if you can share the use case with
us? It's always interesting to see how people choose to use various
filesystems for specific tasks, and this sounds like an interesting one.

Thanks,
-Eric

> Thanks,
> Nick Hennenfent
> Cisco IPTV



Subject: RE: EXT4 in embedded systems


We build IPTV settop boxes that are sold to telecom companies,
so our video comes into the box via a multicast IP stream.

Adding PVR/DVR/Tivo type functionality is becoming very popular
with end users.

As usual, in embedded systems, we have limited memory and cpu power.
We are currently using XFS for recording video to the hard disk, since
it uses less cpu than EXT3 did.

When a consumer is watching a High Def channel and recording a High Def
channel, we need all the extra performance we can get. We are hoping
that
the EXT4 file system will give us a performance boost.

People really hate it when they push a button on their remote and it
takes
a while for something to happen :)

Nick Hennenfent
Cisco IPTV


-----Original Message-----
From: Eric Sandeen [mailto:[email protected]]
Sent: Friday, April 03, 2009 12:07 PM
To: Nick Hennenfent (nhennefe)
Cc: Ext4 Developers List
Subject: Re: EXT4 in embedded systems

Nick Hennenfent (nhennefe) wrote:
> Hi all,
>
> I am currently running kernel 2.6.15 on an embedded mips system.
>
> I really would like to be able to use the ext4 file system.
>
> I can easily upgrade to kernel 2.6.22 but not any further.
>
> I would like to get the latest and greatest stable ext4 code onto this
> box.
>
> Is it possible to back-port the latest ext4 code to the 2.6.22 kernel?
>
> Or, is there a better way to go about it?

Hi Nick -

Well, "it's just code" :) I'm sure it's possible to backport it, though
it'd be nontrivial, and unlikely to be something that will happen from
the upstream developers; backporting that far can often be significant
work, as the kernel APIs are constantly changing. In my experience,
this requires more than just massaging fs/ext4 and fs/jbd2 into shape,
but requires careful core kernel changes at times as well. It might be
the sort of thing you could find a contract for, I imagine there are
people who would be willing to do this.

Aside from all that though, I wonder if you can share the use case with
us? It's always interesting to see how people choose to use various
filesystems for specific tasks, and this sounds like an interesting one.

Thanks,
-Eric

> Thanks,
> Nick Hennenfent
> Cisco IPTV



2009-04-03 18:27:23

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes



On Fri, 3 Apr 2009, Theodore Ts'o wrote:
>
> Please pull from:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git ext3-latency-fixes

Thanks, pulled. I'll be interested to see how it feels. Will report back
after I've rebuild and gone through a few more emails.

One thing I started wondering about in your changes to start using
WRITE_SYNC is that I'm getting closer to thinking that we did the whole
WRITE-vs-WRITE_SYNC thing the wrong way around.

Now, it's clearly true that non-synchronous writes are hopefully always
the common case, so in that sense it makes sense to think of "WRITE" as
the default non-critical case, and then make the (fewer) WRITE_SYNC cases
be the special case.

But at the same time, I now suspect that we could actually have solved
this problem more easily by just doing things the other way around: make
the default "WRITE" be the high-priority one (to match "READ"), and then
just explicitly marking the data writes with "WRITE_ASYNC".

Why? Because I think that with all the writes sprinkled around in random
places, it's probably _easier_ to find the bulk writes that cause the
biggest issues, and just fix _those_ to be WRITE_ASYNC. They may be bulk,
they may be the common case, but they also tend to be the case where we
write with generic routines (eg the whole "do_writepages()" thing).

So the VFS layer tends to already do much of the bulk writeout, and maybe
we would have been better off just changing those to ASYNC and leaving any
more specialized cases as the SYNC case? That would have avoided a lot of
this effort at the filesystem level. We'd just assume that the default
filesystem-specific writes tend to all be SYNC.

Comments?

Linus

2009-04-03 18:47:32

by Jens Axboe

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes

On Fri, Apr 03 2009, Linus Torvalds wrote:
>
>
> On Fri, 3 Apr 2009, Theodore Ts'o wrote:
> >
> > Please pull from:
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git ext3-latency-fixes
>
> Thanks, pulled. I'll be interested to see how it feels. Will report back
> after I've rebuild and gone through a few more emails.

I have one question, didn't see this series before... Ted, what kind of
tests did you run with this and on what? Currently one has to be careful
with WRITE_SYNC, as it also implies an immediate unplug of the device.
So not only does it flag the priority as sync, it'll also kick things
off immediately. We had a nasty regression in performance in a few
revisions of the kernel due to this, sqlite performance was basically 4
times as bad as before we did WRITE_SYNC in sync_dirty_buffer(). So I'd
be curious what kind of testing was done with the patch series before
submitting it.

We should probably just dump the unplug bit from WRITE_SYNC and make
sure we do those explicitly after submission instead.

> One thing I started wondering about in your changes to start using
> WRITE_SYNC is that I'm getting closer to thinking that we did the whole
> WRITE-vs-WRITE_SYNC thing the wrong way around.
>
> Now, it's clearly true that non-synchronous writes are hopefully always
> the common case, so in that sense it makes sense to think of "WRITE" as
> the default non-critical case, and then make the (fewer) WRITE_SYNC cases
> be the special case.
>
> But at the same time, I now suspect that we could actually have solved
> this problem more easily by just doing things the other way around: make
> the default "WRITE" be the high-priority one (to match "READ"), and then
> just explicitly marking the data writes with "WRITE_ASYNC".
>
> Why? Because I think that with all the writes sprinkled around in random
> places, it's probably _easier_ to find the bulk writes that cause the
> biggest issues, and just fix _those_ to be WRITE_ASYNC. They may be bulk,
> they may be the common case, but they also tend to be the case where we
> write with generic routines (eg the whole "do_writepages()" thing).
>
> So the VFS layer tends to already do much of the bulk writeout, and maybe
> we would have been better off just changing those to ASYNC and leaving any
> more specialized cases as the SYNC case? That would have avoided a lot of
> this effort at the filesystem level. We'd just assume that the default
> filesystem-specific writes tend to all be SYNC.

Makes some sense, but we have to be really careful with SYNC writes.
It's important that it really be things that are immediately waited upon
and not "important" writes, since otherwise it'll wreak havoc with the
responsiveness of our reads.

--
Jens Axboe


2009-04-03 19:05:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes



On Fri, 3 Apr 2009, Linus Torvalds wrote:

>
>
> On Fri, 3 Apr 2009, Theodore Ts'o wrote:
> >
> > Please pull from:
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git ext3-latency-fixes
>
> Thanks, pulled. I'll be interested to see how it feels. Will report back
> after I've rebuild and gone through a few more emails.

Hmm.

The "overwrite" behavior may well be better, but it was smooth enough
beforehand too (never having more than ~8MB dirty). The "create big file
and sync" workload causes huge fsync pauses, though. IOW, try with

while :
do
time sh -c "dd if=/dev/zero of=bigfile bs=8M count=256 ; sync"
done

and even really small fsync's end up being at the end of all that
unrelated activity, and you see things like

fsync(7) = 0 <32.756308>

(that was my "switch email folders with update" test case, the full trace
for that file descriptor is

open("/home/torvalds/mail/git-list", O_RDWR) = 7 <0.000010>
fstatfs(7, {f_type="EXT2_SUPER_MAGIC", f_bsize=4096, f_blocks=19230104, f_bfree=13853292, f_bavail=12876440, f_files=4890624
flock(7, LOCK_EX) = 0 <0.000009>
fstat(7, {st_mode=S_IFREG|0600, st_size=54231534, ...}) = 0 <0.000005>
lseek(7, 0, SEEK_SET) = 0 <0.000006>
write(7, "From MAILER-DAEMON Fri Apr 3 11:"..., 554) = 554 <0.000012>
lseek(7, 54202529, SEEK_SET) = 54202529 <0.000007>
read(7, "From [email protected]"..., 66) = 66 <0.000008>
lseek(7, 54202595, SEEK_SET) = 54202595 <0.000006>
read(7, "Return-Path: <[email protected]"..., 2915) = 2915 <0.000007>
lseek(7, 54202529, SEEK_SET) = 54202529 <0.000005>
write(7, "From [email protected]"..., 2981) = 2981 <0.000009>
ftruncate(7, 54231534) = 0 <0.000008>
fsync(7) = 0 <32.756308>
close(7) = 0 <0.000006>

so it had done just a few kB of writes, but because it ended up behind
the humongous backlog of 'bigfile' it didn't much help.

Also, it's maybe worth noting that you don't actually need a 2GB file to
trigger this behavior. Change that "count=256" into a "count=16", and you
now have a simulation of just writing 128MB at a time, with a "sync" in
between to make sure it hits the disk. It makes the pauses smaller, but
they are still several seconds.

(That, btw, is probably more the kind of thing I see when doign a "yum
update". I assume a package manager would do exactly that kind of "unpack
files and sync" in a loop).

Btw, I assume this same thing holds true for ext4 too? Because it shows
how two different "sync" operations interact, and one kills the
performance of the other one. So as long as there is a _single_ fsync()
user, you're fine. It's when you get more than one...

Again, I have that Intel SSD that should do pretty reliably 40+MB/s even
with really nasty write patterns, so I do need several hundred megs to
really see painful pauses. On a slower disk you'd need much less).

Linus

2009-04-03 19:13:26

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes

On Fri, Apr 03, 2009 at 08:47:29PM +0200, Jens Axboe wrote:
>
> I have one question, didn't see this series before... Ted, what kind of
> tests did you run with this and on what?

Well, I've been using it in my default kernel for about a month, and
I've noticed any massive performance issues. (4GB X61s, with an SSD
as my primary drive and a 5400 rpm disk as my secondary drive). I
also did testing on my crash and burn machine N270 ATOM netbook, 512
MB memory, 5400rpm drive.

Most of my testing has been on the fsync latency issue plus my
standard filesystem regression for stability (dbench, fsx, etc.) I
have not done an exhaustive series of performance tests, so you bring
up a good point.

> Currently one has to be careful with WRITE_SYNC, as it also implies
> an immediate unplug of the device.

Yeah, I can definitely see how this could be problematic. What I
really wanted was to separate out the priority of "things which cause
the user to wait and fume" from "things that are just being dumped out
to disk in the background". Unplugging the device is definitely *not*
a good thing here, and you're right, it would be better if we fixed
the relatively few places in the code which use WRITE_SYNC to unplug
the queue after all of the writes are submitted.

- Ted

2009-04-03 19:55:02

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes

On Fri, Apr 03, 2009 at 11:24:50AM -0700, Linus Torvalds wrote:
> But at the same time, I now suspect that we could actually have solved
> this problem more easily by just doing things the other way around: make
> the default "WRITE" be the high-priority one (to match "READ"), and then
> just explicitly marking the data writes with "WRITE_ASYNC".
>
> Why? Because I think that with all the writes sprinkled around in random
> places, it's probably _easier_ to find the bulk writes that cause the
> biggest issues, and just fix _those_ to be WRITE_ASYNC. They may be bulk,
> they may be the common case, but they also tend to be the case where we
> write with generic routines (eg the whole "do_writepages()" thing).
>
> So the VFS layer tends to already do much of the bulk writeout, and maybe
> we would have been better off just changing those to ASYNC and leaving any
> more specialized cases as the SYNC case? That would have avoided a lot of
> this effort at the filesystem level. We'd just assume that the default
> filesystem-specific writes tend to all be SYNC.

Well, most filesystem-specific writes tend all to be ASYNC; it's only
those related to commits and fsync() which are SYNC. Ext3 is unusual
in that data=ordered and the physical-block journalling design of the
jbd layer means that we actually have a much larger number of blocks
that need to be written out synchronously than most other filesystems.
But even so, the number of callsites that I needed to change weren't
that large; in fact, over half of them weren't in the filesystem at
all, but in the page writeback code, since fsync() and data=ordered
both need to wait for the inodes's pages to be flushed out to disk,
and that's all done in common code. The other 40% was in the jbd's
commit code, while we are writing out the journal buffers.

I suspect the more important thing to address is the fact that
WRITE_SYNC unplugs the block device queue, and we would be better off
separating marking a particular I/O as "a user is waiting for this"
from "unplug the device queue now". That will hopefully improve
things even more.

- Ted

2009-04-03 20:44:24

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes



On Fri, 3 Apr 2009, Linus Torvalds wrote:
>
> The "overwrite" behavior may well be better, but it was smooth enough
> beforehand too (never having more than ~8MB dirty). The "create big file
> and sync" workload causes huge fsync pauses, though. IOW, try with
>
> while :
> do
> time sh -c "dd if=/dev/zero of=bigfile bs=8M count=256 ; sync"
> done
>
> and even really small fsync's end up being at the end of all that
> unrelated activity, and you see things like
>
> fsync(7) = 0 <32.756308>

Hmm. So I decided to try with "data=writeback" to see if it really makes
that big of a difference. It does help, but I still easily trigger
multi-second pauses:

fsync(4) = 0 <2.447926>
fsync(4) = 0 <4.275472>
fsync(4) = 0 <3.731948>
fsync(4) = 0 <4.020839>
fsync(6) = 0 <3.482735>
fsync(6) = 0 <5.819923>

even though the system _should_ be able to write back the 'bigfile'
datablocks without any ordering constraint on the fsync.

So at a guess, it now avoids some nasty journal writing ordering issue
where it has to wait for the previous transaction, and it's probably now
purely an IO ordering issue.

This is all with your ext3 work, btw. But I also added "rm bigfile" at the
end of the loop (so that it shouldn't trigge any "write out bigfile early"
logic), and that didn't seem to make any difference.

Are we perhaps ending up doing those regular 'bigfile' writes as
WRITE_SYNC, just because of the global "sync()" call? That's probably a
bad idea. A "sync" is about pure throughput. It's not about latency like
"fsync()" is.

Linus

2009-04-03 21:02:00

by Chris Mason

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes

On Fri, 2009-04-03 at 20:47 +0200, Jens Axboe wrote:
> On Fri, Apr 03 2009, Linus Torvalds wrote:
> >
> >
> > On Fri, 3 Apr 2009, Theodore Ts'o wrote:
> > >
> > > Please pull from:
> > >
> > > git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git ext3-latency-fixes
> >
> > Thanks, pulled. I'll be interested to see how it feels. Will report back
> > after I've rebuild and gone through a few more emails.
>
> I have one question, didn't see this series before... Ted, what kind of
> tests did you run with this and on what? Currently one has to be careful
> with WRITE_SYNC, as it also implies an immediate unplug of the device.
> So not only does it flag the priority as sync, it'll also kick things
> off immediately. We had a nasty regression in performance in a few
> revisions of the kernel due to this, sqlite performance was basically 4
> times as bad as before we did WRITE_SYNC in sync_dirty_buffer(). So I'd
> be curious what kind of testing was done with the patch series before
> submitting it.
>

I had run a few tests against this...the assumption was that since ext3
doesn't have writepages I should be able to easily show using WRITE_SYNC
made for fewer merges and lower performance. But, the write cache on my
array happily ate the smaller requests and came out just as fast.

So, your sqlite example made me think of something different:

mkfs.ext3 /dev/sdb
mount /dev/sdb /mnt
cd /mnt
tar xf /local/linux-2.6.tar
sync
sync
sync
echo 3 > /proc/sys/vm/drop_caches

# step one, start an O_SYNC writer to trigger lots of WRITE_SYNC
# this 32GB file is big enough that my drive can't finish it before the
# tar is done.
dd if=/dev/zero of=/mnt/foo bs=1M count=32000 oflag=sync &

# step two, time how long it takes to read the linux-2.6 dir
# from above
time tar cf /dev/zero /mnt/linux-2.6 &
wait %2
kill %1

/local/linux-2.6.tar is a full kernel tree after compiling with
debuginfo turned on and including the mercurial history files. It has
2.2GB of files. The goal here is to time how long it takes to read the
big directory tree in the face of a writer doing constant WRITE_SYNC
writes. Time to read the directory tree:

With Ted's ext3 patches: 8m2s
Plain 2.6.29: 4m9s

-chris



2009-04-04 13:57:28

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes

On Fri, Apr 03, 2009 at 01:41:54PM -0700, Linus Torvalds wrote:
>
> Hmm. So I decided to try with "data=writeback" to see if it really makes
> that big of a difference. It does help, but I still easily trigger
> multi-second pauses

Using ext3 data=writeback, your "write big (2GB) file and sync" as the
background workload, and fsync-tester, I was able to reduce the
latency down to under 1 second...

fsync time: 0.1717
fsync time: 0.0205
fsync time: 0.1814
fsync time: 0.7408
fsync time: 0.1955
fsync time: 0.0327
fsync time: 0.0456
fsync time: 0.0563

...by doing two things: (1) Applying the attached patch, which fixes
one last critical place where we were using WRITE instead of
WRITE_SYNC --- the commit record was being written by
sync_dirty_buffer(), and this of _all_ places needs to use WRITE_SYNC,
since it waits for the buffer to be written write after submitting the
I/O, and (2) using the anticipatory I/O scheduler instead of the cfq
I/O scheduler.

(1) brought things down from 2-3.5 seconds on my system to 1-2
seconds, and (2) brought things down to what you see above. I think
what is going on with the cfq scheduler is that it's using time slices
to make sure sync and async operations never completely starve each
other out, and in this case we need to tune the I/O scheduling
parameters so that for this workload, the synchronous operations don't
end up getting impeded by the asynchronous writes caused by background
"write big file and sync" task.

In any case, Jens may want to look at this test case (ext3
data=writeback, 'while : ; do time sh -c "dd if=/dev/zero of=bigfile
bs=8M count=256 ; sync"; done', and fsync-tester) as a good way to see
how cfq might be improved.

On another thread, it's been commented that there are still workloads
for which people are quietly switching from CFQ to AS, and this is
bad, because it causes us not to try to figure out why our default I/O
scheduler still as these 1% of cases where people need to use another
scheduler. Well, here's one such case which is relatively easy to
reproduce.

> Are we perhaps ending up doing those regular 'bigfile' writes as
> WRITE_SYNC, just because of the global "sync()" call? That's probably a
> bad idea. A "sync" is about pure throughput. It's not about latency like
> "fsync()" is.

Well, at least on my system where I did this round of testing (4gig
X61s with a 5400 RPM laptop drive), most of the time we weren't
writing the bigfile writes because of the sync, but because dd and
pdflush processes was trying to flush out the dirty pages from the big
write operation. At the moment where "dd" completes and the "sync"
command is executed, the fsync latency jumped up to about 4-5 seconds
before this last round of changes. After adding the attached patch
and switching to the AS I/O scheduler, at the moment of the sync the
fsync latency was just over a second (1.1 to 1.2 seconds). The rest
of the time we are averaging between a 1/4 and a 1/3 of a second, with
rare fsync latency spikes up to about 3/4 of a second, as show at the
beginning of this message.

(Maybe on a system with a lot more memory, the dirty pages don't start
getting flushed to disk until the sync command, but that's not what
I'm seeing on my 4 gig laptop.)

In answer to your question, "sync" does the writes in two passes;
first it pushes out writes with wbc.sync_mode set to WB_SYNC_NONE, and
then it calls the page writeback routines a second time with
WB_SYNC_ALL. So most of the writes should go out with WRITE, except
that the page writeback routines aren't as aggressive about pushing
out _all_ pages in WB_SYNC_NONE, so I believe some of the pages would
still be written on the WB_SYNC_ALL, and thus would go out using
WRITE_SYNc. This is based on 2-3 month old memory of how things
worked in the page-writeback routines, which is the last time I traced
the very deep call trees involved in this area. I'd have to run a
blktrace experiment to see for sure how many of the writes were going
out as WRITE vs. WRITE_SYNC in the case of the 'sync' command.

In any case, I recommend you take the following attached patch, and
then try out ext3 data=writeback with anticipatory I/O scheduler.
Hopefully you'll be pleased with the results.

- Ted

>From 6d293d2aa42d43c120f113bde55f7b0d6f3f35ae Mon Sep 17 00:00:00 2001
From: Theodore Ts'o <[email protected]>
Date: Sat, 4 Apr 2009 09:17:38 -0400
Subject: [PATCH] sync_dirty_buffer: Use WRITE_SYNC instead of WRITE

The sync_dirty_buffer() function submits a buffer for write, and then
synchronously waits for it. It clearly should use WRITE_SYNC instead
of WRITE. This significantly reduces ext3's fsync() latency when
there is a huge background task writing data asyncronously in the
background, since ext3 uses sync_dirty_buffer() to write the commit
block.

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/buffer.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 2ed4b68..78ed086 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3038,7 +3038,7 @@ int sync_dirty_buffer(struct buffer_head *bh)
if (test_clear_buffer_dirty(bh)) {
get_bh(bh);
bh->b_end_io = end_buffer_write_sync;
- ret = submit_bh(WRITE, bh);
+ ret = submit_bh(WRITE_SYNC, bh);
wait_on_buffer(bh);
if (buffer_eopnotsupp(bh)) {
clear_buffer_eopnotsupp(bh);
--
1.5.6.3


2009-04-04 15:16:53

by Jens Axboe

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes

On Sat, Apr 04 2009, Theodore Tso wrote:
> On Fri, Apr 03, 2009 at 01:41:54PM -0700, Linus Torvalds wrote:
> >
> > Hmm. So I decided to try with "data=writeback" to see if it really makes
> > that big of a difference. It does help, but I still easily trigger
> > multi-second pauses
>
> Using ext3 data=writeback, your "write big (2GB) file and sync" as the
> background workload, and fsync-tester, I was able to reduce the
> latency down to under 1 second...
>
> fsync time: 0.1717
> fsync time: 0.0205
> fsync time: 0.1814
> fsync time: 0.7408
> fsync time: 0.1955
> fsync time: 0.0327
> fsync time: 0.0456
> fsync time: 0.0563
>
> ...by doing two things: (1) Applying the attached patch, which fixes
> one last critical place where we were using WRITE instead of
> WRITE_SYNC --- the commit record was being written by
> sync_dirty_buffer(), and this of _all_ places needs to use WRITE_SYNC,
> since it waits for the buffer to be written write after submitting the
> I/O, and (2) using the anticipatory I/O scheduler instead of the cfq
> I/O scheduler.
>
> (1) brought things down from 2-3.5 seconds on my system to 1-2
> seconds, and (2) brought things down to what you see above. I think
> what is going on with the cfq scheduler is that it's using time slices
> to make sure sync and async operations never completely starve each
> other out, and in this case we need to tune the I/O scheduling
> parameters so that for this workload, the synchronous operations don't
> end up getting impeded by the asynchronous writes caused by background
> "write big file and sync" task.
>
> In any case, Jens may want to look at this test case (ext3
> data=writeback, 'while : ; do time sh -c "dd if=/dev/zero of=bigfile
> bs=8M count=256 ; sync"; done', and fsync-tester) as a good way to see
> how cfq might be improved.
>
> On another thread, it's been commented that there are still workloads
> for which people are quietly switching from CFQ to AS, and this is
> bad, because it causes us not to try to figure out why our default I/O
> scheduler still as these 1% of cases where people need to use another
> scheduler. Well, here's one such case which is relatively easy to
> reproduce.
>
> > Are we perhaps ending up doing those regular 'bigfile' writes as
> > WRITE_SYNC, just because of the global "sync()" call? That's probably a
> > bad idea. A "sync" is about pure throughput. It's not about latency like
> > "fsync()" is.
>
> Well, at least on my system where I did this round of testing (4gig
> X61s with a 5400 RPM laptop drive), most of the time we weren't
> writing the bigfile writes because of the sync, but because dd and
> pdflush processes was trying to flush out the dirty pages from the big
> write operation. At the moment where "dd" completes and the "sync"
> command is executed, the fsync latency jumped up to about 4-5 seconds
> before this last round of changes. After adding the attached patch
> and switching to the AS I/O scheduler, at the moment of the sync the
> fsync latency was just over a second (1.1 to 1.2 seconds). The rest
> of the time we are averaging between a 1/4 and a 1/3 of a second, with
> rare fsync latency spikes up to about 3/4 of a second, as show at the
> beginning of this message.
>
> (Maybe on a system with a lot more memory, the dirty pages don't start
> getting flushed to disk until the sync command, but that's not what
> I'm seeing on my 4 gig laptop.)
>
> In answer to your question, "sync" does the writes in two passes;
> first it pushes out writes with wbc.sync_mode set to WB_SYNC_NONE, and
> then it calls the page writeback routines a second time with
> WB_SYNC_ALL. So most of the writes should go out with WRITE, except
> that the page writeback routines aren't as aggressive about pushing
> out _all_ pages in WB_SYNC_NONE, so I believe some of the pages would
> still be written on the WB_SYNC_ALL, and thus would go out using
> WRITE_SYNc. This is based on 2-3 month old memory of how things
> worked in the page-writeback routines, which is the last time I traced
> the very deep call trees involved in this area. I'd have to run a
> blktrace experiment to see for sure how many of the writes were going
> out as WRITE vs. WRITE_SYNC in the case of the 'sync' command.
>
> In any case, I recommend you take the following attached patch, and
> then try out ext3 data=writeback with anticipatory I/O scheduler.
> Hopefully you'll be pleased with the results.
>
> - Ted
>
> From 6d293d2aa42d43c120f113bde55f7b0d6f3f35ae Mon Sep 17 00:00:00 2001
> From: Theodore Ts'o <[email protected]>
> Date: Sat, 4 Apr 2009 09:17:38 -0400
> Subject: [PATCH] sync_dirty_buffer: Use WRITE_SYNC instead of WRITE
>
> The sync_dirty_buffer() function submits a buffer for write, and then
> synchronously waits for it. It clearly should use WRITE_SYNC instead
> of WRITE. This significantly reduces ext3's fsync() latency when
> there is a huge background task writing data asyncronously in the
> background, since ext3 uses sync_dirty_buffer() to write the commit
> block.

Sorry for not commenting on the rest of your email, I don't have a lot
of spare time on my hands. I'll get to that.

Big nack on this patch. Ted, this is EXACTLY where I told you we saw big
write regressions (sqlite performance drops by a factor of 4-5). Do a
git log on fs/buffer.c and see the original patch (which does what your
patch does) and the later revert. No idea why you are now suggestion
making that exact change?!

Low latency is nice, but not at the cost of 4-5x throughput for real
world cases. Lets please try and keep a reasonable focus here, things
need to be tested widely before we go and sprinkle sync magic
everywhere. I can't rule out that CFQ will need adjusting, cutting down
to basics it looks at sync vs async like AS does.

--
Jens Axboe


2009-04-04 15:59:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes



On Sat, 4 Apr 2009, Jens Axboe wrote:
>
> Big nack on this patch. Ted, this is EXACTLY where I told you we saw big
> write regressions (sqlite performance drops by a factor of 4-5). Do a
> git log on fs/buffer.c and see the original patch (which does what your
> patch does) and the later revert. No idea why you are now suggestion
> making that exact change?!

Jens, if I can re-create the 'fsync' times (I haven't yet), then the
default scheduler _will_ be switched to AS.

> Low latency is nice, but not at the cost of 4-5x throughput for real
> world cases.

I'm sorry, but that fsync thing _is_ a real-world case, and it's the one
that a hell of a lot more people care about than some idiotic sqlite
throughput issue.

You have a test-case now. Consider it a priority, or consider CFQ to be a
"for crazy servers that only care about throughput".

Quite frankly, the fact that I can see _seconds_ of latencies with a
really good SSD is not acceptable. The fact that it is by design is even
less so.

Latency is more important than throughput. It's that simple.

Linus

2009-04-04 16:09:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes



On Sat, 4 Apr 2009, Linus Torvalds wrote:

>
>
> On Sat, 4 Apr 2009, Jens Axboe wrote:
> >
> > Big nack on this patch. Ted, this is EXACTLY where I told you we saw big
> > write regressions (sqlite performance drops by a factor of 4-5). Do a
> > git log on fs/buffer.c and see the original patch (which does what your
> > patch does) and the later revert. No idea why you are now suggestion
> > making that exact change?!
>
> Jens, if I can re-create the 'fsync' times (I haven't yet), then the
> default scheduler _will_ be switched to AS.

Btw, that patch is "obviously correct".

That write we're submitting is very much a synchronous write. After all,
the code is literally

ret = submit_bh(WRITE, bh);
wait_on_buffer(bh);

and it just doesn't get any more synchronous than that. If we don't start
the IO immediately (since we're _waiting_ for it immediately), we're
broken.

Now, if we need to fix some mysql throughput issue as a result, then I'd
suggest that we look at whether "sync_dirty_buffer()" is sometimes called
when it doesn't need to be od (b) whether perhaps the unplugging behavior
is simply buggy in some other way.

But Ted's patch makes so much sense on a purely conceptual level, that
when you look at the patch, you should almost not even need to see the
performance numbers to know it's right. But together with the numbers Ted
posted, it's a total no-brainer. CFQ is clearly broken here, and it's
pretty clear that apparently CFQ has been tuned (improperly) purely for
throughput.

Linus

2009-04-04 17:34:15

by Jens Axboe

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes

On Sat, Apr 04 2009, Linus Torvalds wrote:
>
>
> On Sat, 4 Apr 2009, Jens Axboe wrote:
> >
> > Big nack on this patch. Ted, this is EXACTLY where I told you we saw big
> > write regressions (sqlite performance drops by a factor of 4-5). Do a
> > git log on fs/buffer.c and see the original patch (which does what your
> > patch does) and the later revert. No idea why you are now suggestion
> > making that exact change?!
>
> Jens, if I can re-create the 'fsync' times (I haven't yet), then the
> default scheduler _will_ be switched to AS.

Linus, I'm not aware of a difference here between AS and CFQ. If there
is, it's surely a bug and it will be fixed ASAP. The email I wrote has
nothing to do with CFQ performance, it was a general observation on what
happened with an identical patch.

> > Low latency is nice, but not at the cost of 4-5x throughput for real
> > world cases.
>
> I'm sorry, but that fsync thing _is_ a real-world case, and it's the one
> that a hell of a lot more people care about than some idiotic sqlite
> throughput issue.

sqlite is just one case, I'm sure there are others. My point is that we
should make sure that we don't regress on the throughput side. It's a
trade off, we don't want throughput to fall through the floor either.

> You have a test-case now. Consider it a priority, or consider CFQ to be a
> "for crazy servers that only care about throughput".

CFQ was never for crazy servers, it was very much about interactiveness
from the very beginning. So if it's broken on CFQ, it will of course get
fixed right away.

> Quite frankly, the fact that I can see _seconds_ of latencies with a
> really good SSD is not acceptable. The fact that it is by design is even
> less so.

Agree, multi-second latencies is not acceptable.

> Latency is more important than throughput. It's that simple.

It's really not that simple, otherwise the schedulers would be much
simpler. It's pretty easy to get good latency if you disregard any
throughput concerns, that'll never work in the real world. Latency is
definitely extremely important, but simply stating that latency is the
one and only factor of importance is just too simplistic.

--
Jens Axboe


2009-04-04 17:36:44

by Jens Axboe

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes

On Sat, Apr 04 2009, Linus Torvalds wrote:
>
>
> On Sat, 4 Apr 2009, Linus Torvalds wrote:
>
> >
> >
> > On Sat, 4 Apr 2009, Jens Axboe wrote:
> > >
> > > Big nack on this patch. Ted, this is EXACTLY where I told you we saw big
> > > write regressions (sqlite performance drops by a factor of 4-5). Do a
> > > git log on fs/buffer.c and see the original patch (which does what your
> > > patch does) and the later revert. No idea why you are now suggestion
> > > making that exact change?!
> >
> > Jens, if I can re-create the 'fsync' times (I haven't yet), then the
> > default scheduler _will_ be switched to AS.
>
> Btw, that patch is "obviously correct".
>
> That write we're submitting is very much a synchronous write. After all,
> the code is literally
>
> ret = submit_bh(WRITE, bh);
> wait_on_buffer(bh);
>
> and it just doesn't get any more synchronous than that. If we don't start
> the IO immediately (since we're _waiting_ for it immediately), we're
> broken.
>
> Now, if we need to fix some mysql throughput issue as a result, then I'd
> suggest that we look at whether "sync_dirty_buffer()" is sometimes called
> when it doesn't need to be od (b) whether perhaps the unplugging behavior
> is simply buggy in some other way.
>
> But Ted's patch makes so much sense on a purely conceptual level, that
> when you look at the patch, you should almost not even need to see the
> performance numbers to know it's right. But together with the numbers Ted
> posted, it's a total no-brainer. CFQ is clearly broken here, and it's
> pretty clear that apparently CFQ has been tuned (improperly) purely for
> throughput.

I agree, hence I previously wrote and submitted an IDENTICAL patch. I'll
go and test things, not sure why you think that AS and CFQ perform very
differently here, to my knowledge no such postings exist for this test
case. And I'll state again that if they do, of course I'll look into
that and fix it.

One thing that may be of concern is the immediate unplug, but on the
other hand, we do an immediate wait which would unplug anyway. So if we
just look at the sqlite regression, I'm sure it'll be pretty easy to pin
point with some blktrace data.

--
Jens Axboe


2009-04-04 17:46:56

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes



On Sat, 4 Apr 2009, Jens Axboe wrote:
> >
> > I'm sorry, but that fsync thing _is_ a real-world case, and it's the one
> > that a hell of a lot more people care about than some idiotic sqlite
> > throughput issue.
>
> sqlite is just one case, I'm sure there are others. My point is that we
> should make sure that we don't regress on the throughput side. It's a
> trade off, we don't want throughput to fall through the floor either.

Jens, we _have_ regressed on the latency side. Everybody agrees.

Also, I may be odd, but I really do think latency is more important than
throughput. When my disk has latencies in the sub-milliseconds, I simply
do not think it is _acceptable_ to have hickups that affect my workload in
human-visible terms.

You say sqlite might regress by 4-5x. But Ted's numbers improve latencies
by mor than that. I haven't re-created them yet myself (still reading
email), but the point is, 4-5x may sound bad to you, but turn it around:
the current latency situation is _really_ bad. If we can fix it, we
definitely should.

> > Quite frankly, the fact that I can see _seconds_ of latencies with a
> > really good SSD is not acceptable. The fact that it is by design is even
> > less so.
>
> Agree, multi-second latencies is not acceptable.

I can literally send you strace output from my MUA, where it pauses for
ten seconds after it has written about 5kB (that's _kilobytes_) of data
and does a 'fsync'.

That's the load that Ted worked on and has a solution for.

Linus

2009-04-04 18:00:58

by Trenton D. Adams

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes

Hi Guys,

Sorry for this "quick" interruption. Jens doesn't like the sqllite
problem. Ted and Linus don't like the latency problem. Why not make
that SYNC_ALL option a 1/0 option in /proc/sys/vm/sync_all or
something of the like? That sort of change is pretty small, and quick
to make, right? This would help, at least until the real problem is
found.

Sorry if I'm stepping out of bounds. :P

2009-04-04 18:01:11

by Jens Axboe

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes

On Sat, Apr 04 2009, Linus Torvalds wrote:
>
>
> On Sat, 4 Apr 2009, Jens Axboe wrote:
> > >
> > > I'm sorry, but that fsync thing _is_ a real-world case, and it's the one
> > > that a hell of a lot more people care about than some idiotic sqlite
> > > throughput issue.
> >
> > sqlite is just one case, I'm sure there are others. My point is that we
> > should make sure that we don't regress on the throughput side. It's a
> > trade off, we don't want throughput to fall through the floor either.
>
> Jens, we _have_ regressed on the latency side. Everybody agrees.

It appears so, yes.

> Also, I may be odd, but I really do think latency is more important than
> throughput. When my disk has latencies in the sub-milliseconds, I simply
> do not think it is _acceptable_ to have hickups that affect my workload in
> human-visible terms.

Not everyone has an Intel SSD. But yes, latency is definitely more
important than throughput. That's not the same as saying that throughput
doesn't matter, because it definitely does.

> You say sqlite might regress by 4-5x. But Ted's numbers improve latencies
> by mor than that. I haven't re-created them yet myself (still reading
> email), but the point is, 4-5x may sound bad to you, but turn it around:
> the current latency situation is _really_ bad. If we can fix it, we
> definitely should.

I haven't either. On monday I'll throw some testing and patches on the
boxes here. We can get the latency right, I want that as much as the
next guy. I just want to make sure it doesn't become too one-sided.

> > > Quite frankly, the fact that I can see _seconds_ of latencies with a
> > > really good SSD is not acceptable. The fact that it is by design is even
> > > less so.
> >
> > Agree, multi-second latencies is not acceptable.
>
> I can literally send you strace output from my MUA, where it pauses for
> ten seconds after it has written about 5kB (that's _kilobytes_) of data
> and does a 'fsync'.

Unless you make all journal writes sync, ext3 fsync will always suck big
time. But I get your point.

--
Jens Axboe


2009-04-04 18:12:52

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes



On Sat, 4 Apr 2009, Jens Axboe wrote:
>
> > Also, I may be odd, but I really do think latency is more important than
> > throughput. When my disk has latencies in the sub-milliseconds, I simply
> > do not think it is _acceptable_ to have hickups that affect my workload in
> > human-visible terms.
>
> Not everyone has an Intel SSD.

The thing is, I know for a fact that the hickups were much worse _before_
I had the Intel SSD.

With the SSD, I have to work a lot more to get them. Back when I had
regular rotational disks, I definitely did not need to do any 2GB file
creation at all. Just standard "yum update" would make my mail reader
pause.

Yeah, it's just for a short time, but it's very annoying. One second is
quite noticeable when I type something, and it doesn't show up on the
screen ("What? My Nehalem cannot keep up with my _typing_?"). 2-3 seconds
it really feels bad. And by the time it takes 5+ seconds, I start moving
my mouse around just to see whether the computer crashed.

Linus

2009-04-04 19:18:34

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes

On Sat, Apr 04, 2009 at 05:16:50PM +0200, Jens Axboe wrote:
> Big nack on this patch. Ted, this is EXACTLY where I told you we saw big
> write regressions (sqlite performance drops by a factor of 4-5). Do a
> git log on fs/buffer.c and see the original patch (which does what your
> patch does) and the later revert.

You mean this revert, right?

commit 78f707bfc723552e8309b7c38a8d0cc51012e813
Author: Jens Axboe <[email protected]>
Date: Tue Feb 17 13:59:08 2009 +0100

block: revert part of 18ce3751ccd488c78d3827e9f6bf54e6322676fb

The above commit added WRITE_SYNC and switched various places to using
that for committing writes that will be waited upon immediately after
submission. However, this causes a performance regression with AS and CFQ
for ext3 at least, since sync_dirty_buffer() will submit some writes with
WRITE_SYNC while ext3 has sumitted others dependent writes without the sync
flag set. This causes excessive anticipation/idling in the IO scheduler
because sync and async writes get interleaved, causing a big performance
regression for the below test case (which is meant to simulate sqlite
like behaviour)....

OK, let me test things out first, but note that with the changes that
Linus has already accepted, this may not be an issue --- since we've
now fixed ext3 to submit those dependent writes with the SYNC flag
now. So I'm not sure the performance regression still applies, but
I'll test using the test case supplied in the rest of the commit log
and get back to you.

- Ted

2009-04-04 20:18:45

by Ingo Molnar

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes


* Jens Axboe <[email protected]> wrote:

> > Quite frankly, the fact that I can see _seconds_ of latencies
> > with a really good SSD is not acceptable. The fact that it is by
> > design is even less so.
>
> Agree, multi-second latencies is not acceptable.

btw., just to insert some hard data: usability studies place the
acceptance threshold for delays to around 300 milliseconds.

If the computer does not 'respond' (in a real or a fake, visible or
audible way) to their input within 0.3 seconds they get annoyed
emotionally.

It does not matter how complex it is for the kernel to solve this
problem, it does not matter how far away and difficult to access the
data is. If we are not absolutely latency centric in Linux, if we
spuriously delay apps that do supposedly simple-looking things the
user _will_ get annoyed and _will_ pick another OS.

All things considered it is certainly a combined kernel and app
problem space to get there, but not being completely aware of its
importance in the kernel kills any chance of us ever having a
long-term solution.

Ingo

2009-04-04 20:54:56

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes

On Sat, 4 Apr 2009 19:34:12 +0200
Jens Axboe <[email protected]> wrote:

> > Latency is more important than throughput. It's that simple.
>
> It's really not that simple, otherwise the schedulers would be much
> simpler. It's pretty easy to get good latency if you disregard any
> throughput concerns,

I'd be very interested in a scheduler like that.....
How much work would it be to make it ?

(if nothing else it would be a good number to have "should be within
50% of the perfect one for the tradeoff")


--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2009-04-04 22:15:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes



On Sat, 4 Apr 2009, Theodore Tso wrote:
>
> Using ext3 data=writeback, your "write big (2GB) file and sync" as the
> background workload, and fsync-tester, I was able to reduce the
> latency down to under 1 second...

Hmm. I can certainly see a very noticeable improvement.

I could still see 1-2s fsync() delays for my MUA test (and seem to have
seen a 4s one once too). But most were all 'quite noticeable stutters',
and very seldom do they go into the 'ooh, that's really painful' stage.

And yes, anticipatory seems to be quite noticeably better than cfq here.
With cfq I got a few two-second delays on 'ftruncate()' too (probably
because of your new serialization code?), and the longest fsync() delay
was over 7 seconds. That was definitely solidly in the "painful" category.

(Jens - my test-case is not the exact same fsycn() test that Ted uses:
it's really just me being in my MUA editing an email, and doing that

while : ; do time sh -c "dd if=/dev/zero of=bigfile bs=8M count=256 ; sync; rm bigfile"; done

in another window. My MUA does these save-files ever minute or so, and
does a 'fsync()' after writing that (small) file. If the fsync() takes
more than half a second, I can definitely notice. If it takes 1-2 seconds,
it's a big stutter, where keyboard auto-repeat when moving around really
hurts (ie I don't see how it moves).

If it takes 5+ seconds, it feels really bad, and as if the whole email
client had just died.

Yeah, it's a nasty test.

Linus

2009-04-04 22:22:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes



On Sat, 4 Apr 2009, Linus Torvalds wrote:
>
> Hmm. I can certainly see a very noticeable improvement.

.. and btw, just to put that into perspective - I ended up running with
that repeated 'dd+sync+rm' in the background for quite a while, and the
machine was in no way unusable. I don't think I saw a single "ooh, that
feels really dead" case with anticipatory, although I saw several "bad
stutter" cases.

Of course, with the whole "overwrite" case, things are really much
smoother, and I can hardly notice anything happening at all. Although I
_do_ get an occasional 200-600ms fsync hickup, and I can tell those very
clearly. But if that was the worst I'd ever see, I'd be a very happy
person. It's a good goal to have, I suspect.

Linus

2009-04-04 23:22:31

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes

On Sat, Apr 04, 2009 at 08:01:08PM +0200, Jens Axboe wrote:
>
> Unless you make all journal writes sync, ext3 fsync will always suck big
> time. But I get your point.
>

We have already made all other journal writes sync when triggered by
fsync() --- except for the commit record which is being written by
sync_dirty_buffer(). I've verified this via blktrace.

However, you're right. We do have an performance regression using the
regression test provided in commit 78f707bf. Taking the average of at
least 5 runs, I found:

stock 2.6.29 1687 ms
2.6.29 w/ ext3-latency-fixes 7337 ms
2.6.29 w/ full-latency-fixes 13722 ms

"ext3-latency-fixes" are the patches which Linus has already pulled
into the 2.6.29 tree. "full-latency-fixes" are ext3-latency-fixes
plus the sync_dirty_buffes() patch.

Looking at blktrace of stock 2.6.29 running the sqlite performance
measurement, we see this:

254,2 1 13 0.005120554 7183 Q W 199720 + 8 [sqlite-fsync-te]
254,2 1 15 0.005666573 6947 Q W 10277200 + 8 [kjournald]
254,2 1 16 0.005691576 6947 Q W 10277208 + 8 [kjournald]
254,2 1 18 0.006145684 6947 Q W 10277216 + 8 [kjournald]
254,2 1 21 0.006685348 7183 Q W 199720 + 8 [sqlite-fsync-te]
254,2 1 24 0.007162644 6947 Q W 10277224 + 8 [kjournald]
254,2 1 25 0.007187857 6947 Q W 10277232 + 8 [kjournald]
254,2 0 27 0.007616473 6947 Q W 10277240 + 8 [kjournald]

Looking at a blktrace of 2.6.29 plus the full-latency-fixes, we see this:

254,2 0 13 0.013744556 7205 Q WS 199208 + 8 [sqlite-fsync-te]
254,2 0 16 0.019270748 6965 Q WS 10301896 + 8 [kjournald]
254,2 0 17 0.019400024 6965 Q WS 10301904 + 8 [kjournald]
254,2 1 23 0.019892824 6965 Q WS 10301912 + 8 [kjournald]
254,2 0 20 0.020450995 7205 Q WS 199208 + 8 [sqlite-fsync-te]
254,2 1 26 0.025954909 6965 Q WS 10301920 + 8 [kjournald]
254,2 1 27 0.026084814 6965 Q WS 10301928 + 8 [kjournald]
254,2 0 24 0.026612884 6965 Q WS 10301936 + 8 [kjournald]

Looking at the deltas between the two iterations of the sqlite
analogue, we see that stock 2.6.29 is 1.56ms, where as with the full
latency fixes, it's 6.70ms.

However, the full latency fixes all the writes are synchronous, so it
must be the case that the delays are caused by the fact that queue is
getting implicitly unplugged after the synchronous write, and the
problem is no longer the mixing of WRITE and WRITE_SYNC requests as
posted in the commit log for 78f707bf. If we remove the automatic
unplug for WRITE_SYNC requests, and add an explicit unplug where it is
needed, that should fix the performance regression for this particular
sqlite test case. (Which isn't a throughput issue, since the test is
basically fopen/write/fsync/fclose over and over again, with no
background load.)

The scenario which Linus and I had been focused on is one where there
was a heavy background load writing asynchronously. We do want to
make sure that a series of fsync() calls in a tight loop is also
reasonable as well, so Jens, I do think you are absolutely right that
this is something that we need to pay attention to.

I had missed the commit 78f707bf when you originally submitted it, so
I didn't do this test before submitting the patch. And I guess you
had missed my patch proposal to LKML, and I didn't think to explicitly
CC you on my patches. Apologies for the communication faults, but
hopefully we can fix this performance issue for both cases and get
these problems behind us.

Regards,

- Ted

2009-04-04 23:32:20

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes

On Sat, 4 Apr 2009 19:22:22 -0400
Theodore Tso <[email protected]> wrote:
>
> However, the full latency fixes all the writes are synchronous, so it
> must be the case that the delays are caused by the fact that queue is
> getting implicitly unplugged after the synchronous write, and the
> problem is no longer the mixing of WRITE and WRITE_SYNC requests as
> posted in the commit log for 78f707bf. If we remove the automatic
> unplug for WRITE_SYNC requests, and add an explicit unplug where it is
> needed, that should fix the performance regression for this particular
> sqlite test case.

removing the unplug is bound to be bad; after all we're waiting on the
IO. But maybe it should be "make the unplug a REALLY short time".
At least for rotating storage. For non-rotating .. I'd never wait.

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2009-04-05 04:09:52

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes

On Sat, Apr 04, 2009 at 04:33:49PM -0700, Arjan van de Ven wrote:
> > However, the full latency fixes all the writes are synchronous, so it
> > must be the case that the delays are caused by the fact that queue is
> > getting implicitly unplugged after the synchronous write, and the
> > problem is no longer the mixing of WRITE and WRITE_SYNC requests as
> > posted in the commit log for 78f707bf. If we remove the automatic
> > unplug for WRITE_SYNC requests, and add an explicit unplug where it is
> > needed, that should fix the performance regression for this particular
> > sqlite test case.
>
> removing the unplug is bound to be bad; after all we're waiting on the
> IO. But maybe it should be "make the unplug a REALLY short time".
> At least for rotating storage. For non-rotating .. I'd never wait.

Ext3 needs to submit a large number of blocks for writing with
WRITE_SYNC priority, without unplugging the queue, until they are all
submitted. Then we want to let things rip. (Hence the "add an
explicit unplug where it is needed".)

It may be that it's easier from an less-lines-of-the-kernels-to-change
point of view to add a WRITE_SYNC_PLUGGED which doesn't do the unplug,
and keep WRITE_SNYC as having the implicit unplug. Or maybe it would
be better to completely separate the "send a write which is marked as
SYNC" from the "implicit unplug" in the API.

Jens, do you have an opinion/preference?

- Ted



2009-04-05 04:09:44

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes

On Sat, Apr 04, 2009 at 03:13:13PM -0700, Linus Torvalds wrote:
> And yes, anticipatory seems to be quite noticeably better than cfq here.
> With cfq I got a few two-second delays on 'ftruncate()' too (probably
> because of your new serialization code?), and the longest fsync() delay
> was over 7 seconds. That was definitely solidly in the "painful" category.

What is going on with in your "ftruncate()" case? The synchronization
code I added will do call filemap_flush() on close if the file
descriptor had been previously truncated down to zero, either because
it was opened with O_TRUNCATE, or if ftruncate(fd, 0) was explicitly
called. But it won't actually call fsync() or do anything special on
the actual ftrucate() call; it just sets a flag indicating that the
file in question should be flushed on close.

This is to make the right thing happen for applications which try to
edit a file in place via:

fd = open("foo", O_RDWR);
len = read(fd, buf, MAXBUF);
<modify buf>
ftruncate(fd, 0);
write(fd, buf, len);
close(fd);

Otherwise, given the lack of fsync(fd) in the above sequence, a crash
may leave he file "foo" truncated or only partially written out.

- Ted

2009-04-05 15:04:05

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes

On Sat, 4 Apr 2009 20:10:05 -0400
Theodore Tso <[email protected]> wrote:

> On Sat, Apr 04, 2009 at 04:33:49PM -0700, Arjan van de Ven wrote:
> > > However, the full latency fixes all the writes are synchronous,
> > > so it must be the case that the delays are caused by the fact
> > > that queue is getting implicitly unplugged after the synchronous
> > > write, and the problem is no longer the mixing of WRITE and
> > > WRITE_SYNC requests as posted in the commit log for 78f707bf. If
> > > we remove the automatic unplug for WRITE_SYNC requests, and add
> > > an explicit unplug where it is needed, that should fix the
> > > performance regression for this particular sqlite test case.
> >
> > removing the unplug is bound to be bad; after all we're waiting on
> > the IO. But maybe it should be "make the unplug a REALLY short
> > time". At least for rotating storage. For non-rotating .. I'd never
> > wait.
>
> Ext3 needs to submit a large number of blocks for writing with
> WRITE_SYNC priority, without unplugging the queue, until they are all
> submitted. Then we want to let things rip. (Hence the "add an
> explicit unplug where it is needed".)
>
> It may be that it's easier from an less-lines-of-the-kernels-to-change
> point of view to add a WRITE_SYNC_PLUGGED which doesn't do the unplug,
> and keep WRITE_SNYC as having the implicit unplug. Or maybe it would
> be better to completely separate the "send a write which is marked as
> SYNC" from the "implicit unplug" in the API.

I think there's actually 3 cases:
* Normal write with normal plugging rules
* Write that should be "sync" priority, but which should explicitly NOT
unplug, even if it otherwise would have happened, because the caller
will, as part of the contract of the API, do the unplug when all IO
is submitted.
(a bit like tcp/cork)
* Write that is sync priority and is the last one of a sequence,
and thus should unplug immediately.

I can even imagine using a temporary/special queue for the 2nd case,
and then splice that into the regular queue when the final IO comes in,
in one go.

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2009-04-05 17:05:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes



On Sat, 4 Apr 2009, Theodore Tso wrote:
>
> It may be that it's easier from an less-lines-of-the-kernels-to-change
> point of view to add a WRITE_SYNC_PLUGGED which doesn't do the unplug,
> and keep WRITE_SNYC as having the implicit unplug. Or maybe it would
> be better to completely separate the "send a write which is marked as
> SYNC" from the "implicit unplug" in the API.

Well, the block layer internally already separates the two, it's just
WRITE_SYNC that ties them together. So a trivial patch like the appended
would make WRITE_SYNC just mean "this is synchronous IO" without the
unplugging, and then WRITE_UNPLUG would mark it both as synchronous _and_
unplug it.

(Or you could do the WRITE_SYNC_PLUGGED model too, but I think WRITE_SYNC
and WRITE_UNPLUG actually would be better names, since they talk about the
effects more directly).

The question is more of a "who really wants the unplugging". Presumably
the direct-io code-path really does (on the assumption that if you are
doing direct-io, the caller had better done all the coalescing it wants).

Non-rotational media would tend to want the unplugging, but the block
layer already does that (ie in __make_request() we already do:

if (unplug || blk_queue_nonrot(q))
__generic_unplug_device(q);

so non-rotational devices get unplugged whether the request was a
unplugging request or not).

Of course, different IO schedulers react differently to that whole "sync
vs unplug" thing. I think CFQ is the only one that actually cares about
the "sync" bit (using different queues for sync vs async). The other
schedulers only care about the plugging. So the patch below really doesn't
make much sense as-is, because as things are right now, the scheduler
behaviors are so different for the unplug-vs-sync thing that no sane user
can ever know whether they should use WRITE_SYNC (== higher priority
queueing for CFQ, no-op for others) or WRITE_UNPLUG (unplug on all, and
additionally higher priority for CFQ).

So I'm not serious about this patch, because as things are right now, I
don't think it's sensible, but I do think that this just generally shows
the confusion of what we should do. When is plugging right, when isn't it?

Also, one of the issues seems to literally be that the higher-level
request handling doesn't care AT ALL about the priority. Allocating the
request itself does keep reads and writes separated, but if the request is
a SYNCIO request, and non-sync writes have filled up th write requests,
we'll have to wait for the background writes to free up requests.

That is quite possibly the longest wait we have in the system.

See get_request():

const int rw = rw_flags & 0x01;

(That should _really_ be RW_MASK instead of 0x01, I suspect) and how the
request allocation itself is done.

I think this is broken.

Linus

---
include/linux/fs.h | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index a09e17c..a144377 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -95,7 +95,8 @@ struct inodes_stat_t {
#define SWRITE 3 /* for ll_rw_block() - wait for buffer lock */
#define READ_SYNC (READ | (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_UNPLUG))
#define READ_META (READ | (1 << BIO_RW_META))
-#define WRITE_SYNC (WRITE | (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_UNPLUG))
+#define WRITE_SYNC (WRITE | (1 << BIO_RW_SYNCIO))
+#define WRITE_UNPLUG (WRITE | (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_UNPLUG))
#define SWRITE_SYNC (SWRITE | (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_UNPLUG))
#define WRITE_BARRIER (WRITE | (1 << BIO_RW_BARRIER))
#define DISCARD_NOBARRIER (1 << BIO_RW_DISCARD)

2009-04-05 17:15:14

by Mark Lord

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes

Linus Torvalds wrote:
>
> Non-rotational media would tend to want the unplugging, but the block
> layer already does that (ie in __make_request() we already do:
>
> if (unplug || blk_queue_nonrot(q))
> __generic_unplug_device(q);
>
> so non-rotational devices get unplugged whether the request was a
> unplugging request or not).
..

Mmm.. except that Jeff's SSD doesn't appear to identify itself
as a non-rotating media. I suppose there's a way to "quirk" that
somewhere, though.

:/

2009-04-05 18:55:08

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes

On Sun, 5 Apr 2009 10:01:06 -0700 (PDT)
Linus Torvalds <[email protected]> wrote:
> Also, one of the issues seems to literally be that the higher-level
> request handling doesn't care AT ALL about the priority. Allocating
> the request itself does keep reads and writes separated, but if the
> request is a SYNCIO request, and non-sync writes have filled up th
> write requests, we'll have to wait for the background writes to free
> up requests.
>
> That is quite possibly the longest wait we have in the system.

it often is; latencytop tends to point that out.
>
> See get_request():

our default number of requests is so low that we very regularly hit the
limit. In addition to setting kjournald to higher priority, I tend to
set the number of requests to 4096 or so to improve interactive
performance on my own systems. That way at least the elevator has a
chance to see the requests ;-)


--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2009-04-05 19:37:50

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes



On Sun, 5 Apr 2009, Arjan van de Ven wrote:
>
> > See get_request():
>
> our default number of requests is so low that we very regularly hit the
> limit. In addition to setting kjournald to higher priority, I tend to
> set the number of requests to 4096 or so to improve interactive
> performance on my own systems. That way at least the elevator has a
> chance to see the requests ;-)

That's insane. Long queues make the problem harder to hit, yes. But it
also tends to make the problem them a million times worse when you _do_
hit it.

I would suggest looking instead at trying to have separate allocation
pools for bulk and "sync" IO. Instead of having just one rq->rq_pool, we
could easily have a rq->rq_bulk_pool and rq->rq_sync_pool.

We might even _save_ memory by having two pools simply because that may
make it much less important to have a big pool. Most subsystems don't
really need that many requests in flight anyway, and the advantage to the
elevator of huge pools is rather dubious.

So you obviously need more requests than the hardware can have in flight
(since you want to be able to feed the hardware new requests and overlap
the refill with the ones executing), but 4096 sounds excessive if you're
doing something like SATA that can only have 32 actual outstanding
requests at the hardware.

But yes, if a synchronous request gets blocked just because we've already
used all the requests, latency will be suffer.

Linus

2009-04-05 20:06:48

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes

On Sun, 5 Apr 2009 12:34:32 -0700 (PDT)
Linus Torvalds <[email protected]> wrote:

>
>
> On Sun, 5 Apr 2009, Arjan van de Ven wrote:
> >
> > > See get_request():
> >
> > our default number of requests is so low that we very regularly hit
> > the limit. In addition to setting kjournald to higher priority, I
> > tend to set the number of requests to 4096 or so to improve
> > interactive performance on my own systems. That way at least the
> > elevator has a chance to see the requests ;-)
>
> That's insane.

4096 is an absolutely insane value that hides some of the problem

> Long queues make the problem harder to hit, yes. But
> it also tends to make the problem them a million times worse when you
> _do_ hit it.

There is a dilemma though. By not having the IO needs in a queue,
to some degree, they haven't gone away; they just are invisible.

Now there is also a throttling value in having these limits, to
slow down "regular" processes that would cause too much IO.
Except that we have the dirty limit for that in the VM, and except that
most actual IO is done by pdflush and other kernel threads, with the
dirtying of data asynchronous to that.

I would contend that for most common cases, not giving callers a request
immediately does not change or throttle the actual IO that is in want
of being sent to the device. All it does is reduce visibility of the IO
need so less grouping of adjacent and prioritization can be done by the
elevator.

> I would suggest looking instead at trying to have separate allocation
> pools for bulk and "sync" IO. Instead of having just one rq->rq_pool,
> we could easily have a rq->rq_bulk_pool and rq->rq_sync_pool.

Well that or have pools for a few buckets of priority level.
The risk of this is that someone like pdflush might get stuck on a low
priority queue, and thus cannot send the IO it might have wanted to
send into a higher priority queue. I fear that any such limits will in
general punish the wrong guy; after all number 129 is punished, not the
guy who put numbers 1 to 128 in the queue.

I wonder if it wouldn't be a better solution to give insight of the
queue length in use to pdflush, and have pdflush decide what kind of IO
to submit based on the length, rather than having it just block.

Just think of the sync() or fsync() cases.
The total amount of IO that those calls will cause is pretty much
fixed: the data that is "relevantly dirty" at the time of the call.
Holding things back at the request allocation level does not change
that, all it changes is that we delay merging requests that are
adjacent, sort on priority, etc.



--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2009-04-05 20:58:06

by Jeff Garzik

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes

Mark Lord wrote:
> Linus Torvalds wrote:
>>
>> Non-rotational media would tend to want the unplugging, but the block
>> layer already does that (ie in __make_request() we already do:
>>
>> if (unplug || blk_queue_nonrot(q))
>> __generic_unplug_device(q);
>>
>> so non-rotational devices get unplugged whether the request was a
>> unplugging request or not).
> ..
>
> Mmm.. except that Jeff's SSD doesn't appear to identify itself
> as a non-rotating media. I suppose there's a way to "quirk" that
> somewhere, though.

We set it in libata-scsi.c:ata_scsi_dev_config() based on ata_id_is_ssd()

That hueristic probably assumes Intel SSDs or something :/

Jeff



2009-04-05 23:47:47

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes

On Sun, 05 Apr 2009 16:57:21 -0400
Jeff Garzik <[email protected]> wrote:
>
> We set it in libata-scsi.c:ata_scsi_dev_config() based on
> ata_id_is_ssd()
>
> That hueristic probably assumes Intel SSDs or something :/

you mean the "rpm" set to '1' ?
I was pretty sure that that was industry standard...


--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2009-04-06 02:29:00

by Mark Lord

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes

Arjan van de Ven wrote:
> On Sun, 05 Apr 2009 16:57:21 -0400
> Jeff Garzik <[email protected]> wrote:
>> We set it in libata-scsi.c:ata_scsi_dev_config() based on
>> ata_id_is_ssd()
>>
>> That hueristic probably assumes Intel SSDs or something :/
>
> you mean the "rpm" set to '1' ?
> I was pretty sure that that was industry standard...
..

Or even set to zero. I just don't recall seeing the RPM
reported by hdparm-9.12 for Jeff's SSD.

Cheers

2009-04-06 05:48:07

by Jeff Garzik

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes

Arjan van de Ven wrote:
> On Sun, 05 Apr 2009 16:57:21 -0400
> Jeff Garzik <[email protected]> wrote:
>> We set it in libata-scsi.c:ata_scsi_dev_config() based on
>> ata_id_is_ssd()
>>
>> That hueristic probably assumes Intel SSDs or something :/
>
> you mean the "rpm" set to '1' ?
> I was pretty sure that that was industry standard...

A -new- industry standard. You can certainly create a compliant SSD
while only conforming to ATA-7, for example. Some older IDE flash
devices pretend they are normal hard drives in almost every respect, too.

Jeff




2009-04-06 06:16:02

by Jens Axboe

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes

On Sat, Apr 04 2009, Theodore Tso wrote:
> On Sat, Apr 04, 2009 at 08:01:08PM +0200, Jens Axboe wrote:
> >
> > Unless you make all journal writes sync, ext3 fsync will always suck big
> > time. But I get your point.
> >
>
> We have already made all other journal writes sync when triggered by
> fsync() --- except for the commit record which is being written by
> sync_dirty_buffer(). I've verified this via blktrace.
>
> However, you're right. We do have an performance regression using the
> regression test provided in commit 78f707bf. Taking the average of at
> least 5 runs, I found:
>
> stock 2.6.29 1687 ms
> 2.6.29 w/ ext3-latency-fixes 7337 ms
> 2.6.29 w/ full-latency-fixes 13722 ms
>
> "ext3-latency-fixes" are the patches which Linus has already pulled
> into the 2.6.29 tree. "full-latency-fixes" are ext3-latency-fixes
> plus the sync_dirty_buffes() patch.

Ugh yes, not everyone will appreciate the better latency and for them an
8x slowdown is veeery painful.

> Looking at blktrace of stock 2.6.29 running the sqlite performance
> measurement, we see this:
>
> 254,2 1 13 0.005120554 7183 Q W 199720 + 8 [sqlite-fsync-te]
> 254,2 1 15 0.005666573 6947 Q W 10277200 + 8 [kjournald]
> 254,2 1 16 0.005691576 6947 Q W 10277208 + 8 [kjournald]
> 254,2 1 18 0.006145684 6947 Q W 10277216 + 8 [kjournald]
> 254,2 1 21 0.006685348 7183 Q W 199720 + 8 [sqlite-fsync-te]
> 254,2 1 24 0.007162644 6947 Q W 10277224 + 8 [kjournald]
> 254,2 1 25 0.007187857 6947 Q W 10277232 + 8 [kjournald]
> 254,2 0 27 0.007616473 6947 Q W 10277240 + 8 [kjournald]
>
> Looking at a blktrace of 2.6.29 plus the full-latency-fixes, we see this:
>
> 254,2 0 13 0.013744556 7205 Q WS 199208 + 8 [sqlite-fsync-te]
> 254,2 0 16 0.019270748 6965 Q WS 10301896 + 8 [kjournald]
> 254,2 0 17 0.019400024 6965 Q WS 10301904 + 8 [kjournald]
> 254,2 1 23 0.019892824 6965 Q WS 10301912 + 8 [kjournald]
> 254,2 0 20 0.020450995 7205 Q WS 199208 + 8 [sqlite-fsync-te]
> 254,2 1 26 0.025954909 6965 Q WS 10301920 + 8 [kjournald]
> 254,2 1 27 0.026084814 6965 Q WS 10301928 + 8 [kjournald]
> 254,2 0 24 0.026612884 6965 Q WS 10301936 + 8 [kjournald]
>
> Looking at the deltas between the two iterations of the sqlite
> analogue, we see that stock 2.6.29 is 1.56ms, where as with the full
> latency fixes, it's 6.70ms.

It would be interesting with the full set of information, like unplug
and merge. The above definitely shows longer between queue-to-queue, I
wonder what that is due to.

> However, the full latency fixes all the writes are synchronous, so it
> must be the case that the delays are caused by the fact that queue is
> getting implicitly unplugged after the synchronous write, and the
> problem is no longer the mixing of WRITE and WRITE_SYNC requests as
> posted in the commit log for 78f707bf. If we remove the automatic
> unplug for WRITE_SYNC requests, and add an explicit unplug where it is
> needed, that should fix the performance regression for this particular
> sqlite test case. (Which isn't a throughput issue, since the test is
> basically fopen/write/fsync/fclose over and over again, with no
> background load.)

We definitely should add a WRITE_SYNC variant that doesn't unplug, for
the cases where you know you are going to submit more than one IO before
waiting. Whether that should just be WRITE_SYNC or we should add a
WRITE_SYNC_UNPLUG, I don't have a really strong preference either way.

But if the delays are due to premature unplugging, it should be visible
in the trace. Care to send the "full" thing?

> The scenario which Linus and I had been focused on is one where there
> was a heavy background load writing asynchronously. We do want to
> make sure that a series of fsync() calls in a tight loop is also
> reasonable as well, so Jens, I do think you are absolutely right that
> this is something that we need to pay attention to.

Most definitely!

> I had missed the commit 78f707bf when you originally submitted it, so
> I didn't do this test before submitting the patch. And I guess you
> had missed my patch proposal to LKML, and I didn't think to explicitly
> CC you on my patches. Apologies for the communication faults, but
> hopefully we can fix this performance issue for both cases and get
> these problems behind us.

We still have a long cycle ahead of us, so I'm sure we can get it nailed
down before release.

--
Jens Axboe


2009-04-06 06:24:01

by Jens Axboe

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes

On Sun, Apr 05 2009, Linus Torvalds wrote:
>
>
> On Sat, 4 Apr 2009, Theodore Tso wrote:
> >
> > It may be that it's easier from an less-lines-of-the-kernels-to-change
> > point of view to add a WRITE_SYNC_PLUGGED which doesn't do the unplug,
> > and keep WRITE_SNYC as having the implicit unplug. Or maybe it would
> > be better to completely separate the "send a write which is marked as
> > SYNC" from the "implicit unplug" in the API.
>
> Well, the block layer internally already separates the two, it's just
> WRITE_SYNC that ties them together. So a trivial patch like the appended
> would make WRITE_SYNC just mean "this is synchronous IO" without the
> unplugging, and then WRITE_UNPLUG would mark it both as synchronous _and_
> unplug it.
>
> (Or you could do the WRITE_SYNC_PLUGGED model too, but I think WRITE_SYNC
> and WRITE_UNPLUG actually would be better names, since they talk about the
> effects more directly).
>
> The question is more of a "who really wants the unplugging". Presumably
> the direct-io code-path really does (on the assumption that if you are
> doing direct-io, the caller had better done all the coalescing it wants).
>
> Non-rotational media would tend to want the unplugging, but the block
> layer already does that (ie in __make_request() we already do:
>
> if (unplug || blk_queue_nonrot(q))
> __generic_unplug_device(q);
>
> so non-rotational devices get unplugged whether the request was a
> unplugging request or not).
>
> Of course, different IO schedulers react differently to that whole "sync
> vs unplug" thing. I think CFQ is the only one that actually cares about
> the "sync" bit (using different queues for sync vs async). The other
> schedulers only care about the plugging. So the patch below really doesn't
> make much sense as-is, because as things are right now, the scheduler
> behaviors are so different for the unplug-vs-sync thing that no sane user
> can ever know whether they should use WRITE_SYNC (== higher priority
> queueing for CFQ, no-op for others) or WRITE_UNPLUG (unplug on all, and
> additionally higher priority for CFQ).

AS also very much cares about sync vs async. CFQ and AS both use that to
determine whether to wait for a new request from that io context, even
if we have other work to do. This is the logic that enables dependent
reads to proceed at a fast pace, instead of incurring a seek on every
read from a process reading lots of smaller files.

This could very well be what is causing a performance problem with the
writes. For most of these writes, you really don't want to idle at the
end even if they are SYNC. For O_DIRECT it's a win though, so we can't
just say 'never idle for sync writes'. Perhaps what we really need is a
specific WRITE_ODIRECT that signals sync and enables idling, while
keeping the WRITE_SYNC / WRITE_SYNC_UNPLUG like normal writes but just
at sync priority.

> So I'm not serious about this patch, because as things are right now, I
> don't think it's sensible, but I do think that this just generally shows
> the confusion of what we should do. When is plugging right, when isn't it?
>
> Also, one of the issues seems to literally be that the higher-level
> request handling doesn't care AT ALL about the priority. Allocating the
> request itself does keep reads and writes separated, but if the request is
> a SYNCIO request, and non-sync writes have filled up th write requests,
> we'll have to wait for the background writes to free up requests.
>
> That is quite possibly the longest wait we have in the system.
>
> See get_request():
>
> const int rw = rw_flags & 0x01;
>
> (That should _really_ be RW_MASK instead of 0x01, I suspect) and how the
> request allocation itself is done.
>
> I think this is broken.

If we just make it (rw & REQ_RW_SYNC) then we get all sync requests from
the same pool, instead of doing the read vs write thing there. And I
agree, that makes sense.

> @@ -95,7 +95,8 @@ struct inodes_stat_t {
> #define SWRITE 3 /* for ll_rw_block() - wait for buffer lock */
> #define READ_SYNC (READ | (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_UNPLUG))
> #define READ_META (READ | (1 << BIO_RW_META))
> -#define WRITE_SYNC (WRITE | (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_UNPLUG))
> +#define WRITE_SYNC (WRITE | (1 << BIO_RW_SYNCIO))
> +#define WRITE_UNPLUG (WRITE | (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_UNPLUG))
> #define SWRITE_SYNC (SWRITE | (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_UNPLUG))
> #define WRITE_BARRIER (WRITE | (1 << BIO_RW_BARRIER))
> #define DISCARD_NOBARRIER (1 << BIO_RW_DISCARD)

On the naming, after thinking a bit about it, I think it should be
WRITE_SYNC_PLUGGED and WRITE_SYNC. The reason being that if you use
WRITE_SYNC_PLUGGED, then you MUST unplug at some point in time. By
making that explicit in the name, we don't risk callers forgetting to
unplug and causing the unplug timer to do that work (and slowing things
down).

--
Jens Axboe


2009-04-06 06:05:42

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes

On Sun, Apr 05, 2009 at 10:01:06AM -0700, Linus Torvalds wrote:
> Of course, different IO schedulers react differently to that whole "sync
> vs unplug" thing. I think CFQ is the only one that actually cares about
> the "sync" bit (using different queues for sync vs async).

It looks liks AS and CFQ both care about the "sync" bit; they both use
rq_is_sync defined in include/lonux/blkdev.h. Deadline apparently
only distinguishes between read and write requests, and not whether
they are considered synchronous or not. The noop scheduler, obviously
also doesn't care. :-)

> The other schedulers only care about the plugging. So the patch
> below really doesn't make much sense as-is, because as things are
> right now, the scheduler behaviors are so different for the
> unplug-vs-sync thing that no sane user can ever know whether they
> should use WRITE_SYNC (== higher priority queueing for CFQ, no-op
> for others) or WRITE_UNPLUG (unplug on all, and additionally higher
> priority for CFQ).

Well, if the deadline scheduler ignores the SYNC bit, it would still
make sense for it to only unplug the queue after for the commit block,
and not for any of the other writes to the journal. Unplugging after
every synchronous write is going to lead to a performance problem,
which I've demonstrated using the fopen/fprintf/fsync/fclose scenario
that Jens pointed me at.

- Ted

2009-04-06 06:25:52

by Jens Axboe

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes

On Sun, Apr 05 2009, Linus Torvalds wrote:
>
>
> On Sun, 5 Apr 2009, Arjan van de Ven wrote:
> >
> > > See get_request():
> >
> > our default number of requests is so low that we very regularly hit the
> > limit. In addition to setting kjournald to higher priority, I tend to
> > set the number of requests to 4096 or so to improve interactive
> > performance on my own systems. That way at least the elevator has a
> > chance to see the requests ;-)
>
> That's insane. Long queues make the problem harder to hit, yes. But it
> also tends to make the problem them a million times worse when you _do_
> hit it.

Unfortunately it is insane, because the vm will barf big time on that.
Perhaps when we do the "throttle at device IO rate" for dirtying of data
we can be more relaxed, but increasing nr_requests too much will just
cause the vm to go nuts.

> I would suggest looking instead at trying to have separate allocation
> pools for bulk and "sync" IO. Instead of having just one rq->rq_pool, we
> could easily have a rq->rq_bulk_pool and rq->rq_sync_pool.
>
> We might even _save_ memory by having two pools simply because that may
> make it much less important to have a big pool. Most subsystems don't
> really need that many requests in flight anyway, and the advantage to the
> elevator of huge pools is rather dubious.

The pools are not big, the number is just a maximum allocation number.
In reality we only store 4 requests in the mempool, and we should
probably just dump that down to 1.

--
Jens Axboe


2009-04-06 07:06:56

by Jens Axboe

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes

On Sat, Apr 04 2009, Arjan van de Ven wrote:
> On Sat, 4 Apr 2009 19:34:12 +0200
> Jens Axboe <[email protected]> wrote:
>
> > > Latency is more important than throughput. It's that simple.
> >
> > It's really not that simple, otherwise the schedulers would be much
> > simpler. It's pretty easy to get good latency if you disregard any
> > throughput concerns,
>
> I'd be very interested in a scheduler like that.....
> How much work would it be to make it ?
>
> (if nothing else it would be a good number to have "should be within
> 50% of the perfect one for the tradeoff")

It'd be pretty close to the first version of CFQ. The easiest would be
to add a cfq sysfs know that basically just switches a bunch of things
off in CFQ. Never idle, always dispatch only a single request at the
time, etc. At least for test purposes it would not be that hard. CFQ
doesn't export all of the settings that allow to make this possible
right now, otherwise it could just be done with a shell script.

--
Jens Axboe


2009-04-06 08:12:44

by Jens Axboe

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes

On Sat, Apr 04 2009, Theodore Tso wrote:
> On Sat, Apr 04, 2009 at 05:16:50PM +0200, Jens Axboe wrote:
> > Big nack on this patch. Ted, this is EXACTLY where I told you we saw big
> > write regressions (sqlite performance drops by a factor of 4-5). Do a
> > git log on fs/buffer.c and see the original patch (which does what your
> > patch does) and the later revert.
>
> You mean this revert, right?
>
> commit 78f707bfc723552e8309b7c38a8d0cc51012e813
> Author: Jens Axboe <[email protected]>
> Date: Tue Feb 17 13:59:08 2009 +0100
>
> block: revert part of 18ce3751ccd488c78d3827e9f6bf54e6322676fb
>
> The above commit added WRITE_SYNC and switched various places to using
> that for committing writes that will be waited upon immediately after
> submission. However, this causes a performance regression with AS and CFQ
> for ext3 at least, since sync_dirty_buffer() will submit some writes with
> WRITE_SYNC while ext3 has sumitted others dependent writes without the sync
> flag set. This causes excessive anticipation/idling in the IO scheduler
> because sync and async writes get interleaved, causing a big performance
> regression for the below test case (which is meant to simulate sqlite
> like behaviour)....
>
> OK, let me test things out first, but note that with the changes that
> Linus has already accepted, this may not be an issue --- since we've
> now fixed ext3 to submit those dependent writes with the SYNC flag
> now. So I'm not sure the performance regression still applies, but
> I'll test using the test case supplied in the rest of the commit log
> and get back to you.

Yep that's the one. I'll throw some testing together here, too.

--
Jens Axboe

2009-04-06 08:13:58

by Jens Axboe

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes

On Sun, Apr 05 2009, Mark Lord wrote:
> Linus Torvalds wrote:
>>
>> Non-rotational media would tend to want the unplugging, but the block
>> layer already does that (ie in __make_request() we already do:
>>
>> if (unplug || blk_queue_nonrot(q))
>> __generic_unplug_device(q);
>>
>> so non-rotational devices get unplugged whether the request was a
>> unplugging request or not).
> ..
>
> Mmm.. except that Jeff's SSD doesn't appear to identify itself
> as a non-rotating media. I suppose there's a way to "quirk" that
> somewhere, though.

You can quirk it in userspace, just have it set the 'rotational' sysfs
file to 0 if it matches make/model/whatever.

--
Jens Axboe


2009-04-06 08:16:19

by Jens Axboe

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes

On Sat, Apr 04 2009, Arjan van de Ven wrote:
> On Sat, 4 Apr 2009 19:22:22 -0400
> Theodore Tso <[email protected]> wrote:
> >
> > However, the full latency fixes all the writes are synchronous, so it
> > must be the case that the delays are caused by the fact that queue is
> > getting implicitly unplugged after the synchronous write, and the
> > problem is no longer the mixing of WRITE and WRITE_SYNC requests as
> > posted in the commit log for 78f707bf. If we remove the automatic
> > unplug for WRITE_SYNC requests, and add an explicit unplug where it is
> > needed, that should fix the performance regression for this particular
> > sqlite test case.
>
> removing the unplug is bound to be bad; after all we're waiting on the
> IO. But maybe it should be "make the unplug a REALLY short time".
> At least for rotating storage. For non-rotating .. I'd never wait.

Well, either you are submitting a single piece of IO (in which case you
just want to unplug or directly submit as part of the submit_bio()), or
you are submitting several IOS (in which case you just want to unplug at
the end of the IO submission, before waiting).

--
Jens Axboe


2009-04-06 14:51:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes



On Mon, 6 Apr 2009, Jens Axboe wrote:
>
> Well, either you are submitting a single piece of IO (in which case you
> just want to unplug or directly submit as part of the submit_bio()), or
> you are submitting several IOS (in which case you just want to unplug at
> the end of the IO submission, before waiting).

That's not true.

The plugging is often across multiple threads. It didn't _use_ to be (we
always unplugged at wait), but it is now. Nothing else explains why that
patch by Ted makes such a big throughput thing, because the code did

ret = submit_bh(WRITE_SYNC, bh);
wait_on_buffer(bh);

ie it very much submits a _single_ IO, and waits on it. If plugging made a
difference, that means that unplugging was delayed so long that somebody
else does IO too - ie it gets delayed past a wait event.

So according to your own rules, that submit_bh() _should_ use WRITE_SYNC,
but something bad happens if it does. I'm not quite seeing _what_, though,
unless there are multiple processes trying to dirty the _same_ buffer, and
they win if they all can dirty it without doing IO on it in between (and
then the write turns into just one write).

Linus


2009-04-06 15:09:46

by Jens Axboe

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes

On Mon, Apr 06 2009, Linus Torvalds wrote:
>
>
> On Mon, 6 Apr 2009, Jens Axboe wrote:
> >
> > Well, either you are submitting a single piece of IO (in which case you
> > just want to unplug or directly submit as part of the submit_bio()), or
> > you are submitting several IOS (in which case you just want to unplug at
> > the end of the IO submission, before waiting).
>
> That's not true.
>
> The plugging is often across multiple threads. It didn't _use_ to be (we

It is completely true, you will very rarely see merges between
processes. The plug may be across the entire device and it'll get you
some inter-process sorting as well if you have more than one app busy,
but for merging it's usually effectively per-process.

> always unplugged at wait), but it is now. Nothing else explains why that
> patch by Ted makes such a big throughput thing, because the code did
>
> ret = submit_bh(WRITE_SYNC, bh);
> wait_on_buffer(bh);

Linus, the implementation is still basically the same. Yes it's true
that you used to do

submit();
unplug(;)
wait();

and you better still be doing that or things will run at the timer
unplug speed - not very fast. The only difference is that we hide the
unplug in the wait_on_bit() callback, but it's definitely still very
much the same thing.

> ie it very much submits a _single_ IO, and waits on it. If plugging made a
> difference, that means that unplugging was delayed so long that somebody
> else does IO too - ie it gets delayed past a wait event.
>
> So according to your own rules, that submit_bh() _should_ use WRITE_SYNC,
> but something bad happens if it does. I'm not quite seeing _what_, though,
> unless there are multiple processes trying to dirty the _same_ buffer, and
> they win if they all can dirty it without doing IO on it in between (and
> then the write turns into just one write).

sync_dirty_buffer() should use it, I agree, I even did the original
patch for it. And in the series posted today, it's also there and it
performs as expected now. For this particular case, it's not about
plugging. The performance penalty came from the 'sync' modifier, it'll
change how the IO schedulers look at the IO. Both AS and CFQ would have
considerably worse performance with this, as you would be mixing related
IO into both sync and async buckets. So it wasn't merging (as suspected)
or plugging.

--
Jens Axboe


2009-04-06 21:50:36

by Lennart Sorensen

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes

On Sat, Apr 04, 2009 at 10:18:20PM +0200, Ingo Molnar wrote:
> btw., just to insert some hard data: usability studies place the
> acceptance threshold for delays to around 300 milliseconds.
>
> If the computer does not 'respond' (in a real or a fake, visible or
> audible way) to their input within 0.3 seconds they get annoyed
> emotionally.

You mean like hitting 'skip commercial' on the remote, and 2 minutes later
when the commercials are almost over, then it happens. Yeah users get a
bit peeved at that. Fortunately that particularly behaviour is somewhat
rare. 5 to 10 seconds is more common and sometimes it works fine.

> It does not matter how complex it is for the kernel to solve this
> problem, it does not matter how far away and difficult to access the
> data is. If we are not absolutely latency centric in Linux, if we
> spuriously delay apps that do supposedly simple-looking things the
> user _will_ get annoyed and _will_ pick another OS.

I would have to get _very_ annoyed before that happened. I would have
to find a less annoying OS to switch to as well.

> All things considered it is certainly a combined kernel and app
> problem space to get there, but not being completely aware of its
> importance in the kernel kills any chance of us ever having a
> long-term solution.

Certainly my problems are likely to a large extent mythtv's fault.

--
Len Sorensen

2009-04-07 13:31:39

by Mark Lord

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes

Lennart Sorensen wrote:
> On Sat, Apr 04, 2009 at 10:18:20PM +0200, Ingo Molnar wrote:
>> btw., just to insert some hard data: usability studies place the
>> acceptance threshold for delays to around 300 milliseconds.
>>
>> If the computer does not 'respond' (in a real or a fake, visible or
>> audible way) to their input within 0.3 seconds they get annoyed
>> emotionally.
>
> You mean like hitting 'skip commercial' on the remote, and 2 minutes later
> when the commercials are almost over, then it happens. Yeah users get a
> bit peeved at that. Fortunately that particularly behaviour is somewhat
> rare. 5 to 10 seconds is more common and sometimes it works fine.
..

I generally don't see any delay at all when doing that,
even when the Mythtv system is very busy on multiple recordings.
But then, I also don't use the huge LIRC stack for the simple R/C either.

cheers

2009-04-07 14:48:03

by Lennart Sorensen

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes

On Tue, Apr 07, 2009 at 09:31:37AM -0400, Mark Lord wrote:
> I generally don't see any delay at all when doing that,
> even when the Mythtv system is very busy on multiple recordings.
> But then, I also don't use the huge LIRC stack for the simple R/C either.

But it happens with the keyboard too. :)

--
Len Sorensen

2009-04-07 16:02:18

by Indan Zupancic

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes

On Sat, April 4, 2009 17:57, Linus Torvalds wrote:
>
> On Sat, 4 Apr 2009, Jens Axboe wrote:
>>
>> Big nack on this patch. Ted, this is EXACTLY where I told you we saw big
>> write regressions (sqlite performance drops by a factor of 4-5). Do a
>> git log on fs/buffer.c and see the original patch (which does what your
>> patch does) and the later revert. No idea why you are now suggestion
>> making that exact change?!
>
> Jens, if I can re-create the 'fsync' times (I haven't yet), then the
> default scheduler _will_ be switched to AS.

http://bugzilla.kernel.org/show_bug.cgi?id=5900

I can try it with a different fs, kernel or machine as well.

To me it seems this stuff is way more tricky than it should be, with
all the interaction between fs, vm, io schedulers and hardware.
Perhaps there's a need for one sophisticated io scheduler that gets
enough information from the rest to make the right decisions. Sync
or not sync isn't enough info.



2009-04-07 18:23:53

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes



On Mon, 6 Apr 2009, Jeff Garzik wrote:

> Arjan van de Ven wrote:
> > On Sun, 05 Apr 2009 16:57:21 -0400
> > Jeff Garzik <[email protected]> wrote:
> > > We set it in libata-scsi.c:ata_scsi_dev_config() based on
> > > ata_id_is_ssd()
> > >
> > > That hueristic probably assumes Intel SSDs or something :/
> >
> > you mean the "rpm" set to '1' ?
> > I was pretty sure that that was industry standard...
>
> A -new- industry standard. You can certainly create a compliant SSD while
> only conforming to ATA-7, for example. Some older IDE flash devices pretend
> they are normal hard drives in almost every respect, too.

Something like this might be a good idea.

I've seen several SSD's that do _not_ do that whole RPM == 1 thing, but
they have "SSD" in their names.

I forget how the ID is stored (I have this memory of it being big-endian
16-bit words or something crazy like that?), but aside from fixing up that
kind of crazyness, maybe something like this is worth it?

And making it non-inline, of course. And maybe it should use 'strstr()'
instead of checking whether the name ends in 'SSD'. You get the idea..

Linus

---
--- a/include/linux/ata.h
+++ b/include/linux/ata.h
@@ -769,7 +769,14 @@ static inline int ata_id_is_cfa(const u16 *id)

static inline int ata_id_is_ssd(const u16 *id)
{
- return id[ATA_ID_ROT_SPEED] == 0x01;
+ int len;
+ const char *model;
+
+ if (id[ATA_ID_ROT_SPEED] == 0x01)
+ return 1;
+ model = (const char *)&id[ATA_ID_PROD];
+ len = strnlen(model, ATA_ID_PROD_LEN);
+ return len > 3 && !memcmp(model+len-3, "SSD", 3);
}

static inline int ata_drive_40wire(const u16 *dev_id)

2009-04-07 18:28:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes



On Tue, 7 Apr 2009, Linus Torvalds wrote:
>
> Something like this might be a good idea.

And it clearly is "something like".

> I've seen several SSD's that do _not_ do that whole RPM == 1 thing, but
> they have "SSD" in their names.
>
> I forget how the ID is stored (I have this memory of it being big-endian
> 16-bit words or something crazy like that?), but aside from fixing up that
> kind of crazyness, maybe something like this is worth it?

Yeah, just checked. My memory wasn't wrong, and that code can not possibly
work. Not only isn't that whole field generally NUL-terminated at all, my
recollection of odd 16-bit byte ordering was right.

So that patch is crap.

But the _concept_ may be worth pursuing.

Linus

2009-04-07 19:21:18

by Mark Lord

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes

Lennart Sorensen wrote:
> On Tue, Apr 07, 2009 at 09:31:37AM -0400, Mark Lord wrote:
>> I generally don't see any delay at all when doing that,
>> even when the Mythtv system is very busy on multiple recordings.
>> But then, I also don't use the huge LIRC stack for the simple R/C either.
>
> But it happens with the keyboard too. :)
..

Mmm.. now that could be problem.

I wonder if mythfrontend is simply getting paged out of memory,
until you press a button and then the relevant code has to all
get paged back in again before it can respond (?).


2009-04-07 19:57:59

by Lennart Sorensen

[permalink] [raw]
Subject: Re: [GIT PULL] Ext3 latency fixes

On Tue, Apr 07, 2009 at 03:21:15PM -0400, Mark Lord wrote:
> Mmm.. now that could be problem.
>
> I wonder if mythfrontend is simply getting paged out of memory,
> until you press a button and then the relevant code has to all
> get paged back in again before it can respond (?).

While the video is playing? Well it is possible I guess. I haven't had
it happen in a while, so perhaps that was a mythtv bug that got fixed
since the keyboard worked fine if I switch to vt1 for example.

--
Len Sorensen