2010-04-01 19:05:08

by Jeff Moyer

[permalink] [raw]
Subject: [patch/rft] jbd2: tag journal writes as metadata I/O

Hi,

In running iozone for writes to small files, we noticed a pretty big
discrepency between the performance of the deadline and cfq I/O
schedulers. Investigation showed that I/O was being issued from 2
different contexts: the iozone process itself, and the jbd2/sdh-8 thread
(as expected). Because of the way cfq performs slice idling, the delays
introduced between the metadata and data I/Os were significant. For
example, cfq would see about 7MB/s versus deadline's 35 for the same
workload. I also tested fs_mark with writing and fsyncing 1000 64k
files, and a similar 5x performance difference was observed. Eric
Sandeen suggested that I flag the journal writes as metadata, and once I
did that, the performance difference went away completely (cfq has
special logic to prioritize metadata I/O).

So, I'm submitting this patch for comments and testing. I have a
similar patch for jbd that I will submit if folks agree that this is a
good idea.

Cheers,
Jeff

Signed-off-by: Jeff Moyer <[email protected]>

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 671da7f..1998265 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -139,7 +139,7 @@ static int journal_submit_commit_record(journal_t *journal,
set_buffer_ordered(bh);
barrier_done = 1;
}
- ret = submit_bh(WRITE_SYNC_PLUG, bh);
+ ret = submit_bh(WRITE_SYNC_PLUG | (1<<BIO_RW_META), bh);
if (barrier_done)
clear_buffer_ordered(bh);

@@ -160,7 +160,7 @@ static int journal_submit_commit_record(journal_t *journal,
lock_buffer(bh);
set_buffer_uptodate(bh);
clear_buffer_dirty(bh);
- ret = submit_bh(WRITE_SYNC_PLUG, bh);
+ ret = submit_bh(WRITE_SYNC_PLUG | (1<<BIO_RW_META), bh);
}
*cbh = bh;
return ret;
@@ -369,7 +369,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
int tag_bytes = journal_tag_bytes(journal);
struct buffer_head *cbh = NULL; /* For transactional checksums */
__u32 crc32_sum = ~0;
- int write_op = WRITE;
+ int write_op = WRITE_META;

/*
* First job: lock down the current transaction and wait for
@@ -409,7 +409,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
* instead we rely on sync_buffer() doing the unplug for us.
*/
if (commit_transaction->t_synchronous_commit)
- write_op = WRITE_SYNC_PLUG;
+ write_op = WRITE_SYNC_PLUG | (1<<BIO_RW_META);
trace_jbd2_commit_locking(journal, commit_transaction);
stats.run.rs_wait = commit_transaction->t_max_wait;
stats.run.rs_locked = jiffies;


2010-04-01 19:48:28

by Jan Kara

[permalink] [raw]
Subject: Re: [patch/rft] jbd2: tag journal writes as metadata I/O

Hi,

> In running iozone for writes to small files, we noticed a pretty big
> discrepency between the performance of the deadline and cfq I/O
> schedulers. Investigation showed that I/O was being issued from 2
> different contexts: the iozone process itself, and the jbd2/sdh-8 thread
> (as expected). Because of the way cfq performs slice idling, the delays
> introduced between the metadata and data I/Os were significant. For
> example, cfq would see about 7MB/s versus deadline's 35 for the same
> workload. I also tested fs_mark with writing and fsyncing 1000 64k
> files, and a similar 5x performance difference was observed. Eric
> Sandeen suggested that I flag the journal writes as metadata, and once I
> did that, the performance difference went away completely (cfq has
> special logic to prioritize metadata I/O).
>
> So, I'm submitting this patch for comments and testing. I have a
> similar patch for jbd that I will submit if folks agree that this is a
> good idea.
This looks like a good idea to me. I'd just be careful about data=journal
mode where even data is written via journal and thus you'd incorrectly
prioritize all the IO. I suppose that could have negative impact on performace
of other filesystems on the same disk. So for data=journal mode, I'd leave
write_op to be just WRITE / WRITE_SYNC_PLUG.

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2010-04-02 07:00:07

by Jens Axboe

[permalink] [raw]
Subject: Re: [patch/rft] jbd2: tag journal writes as metadata I/O

On Thu, Apr 01 2010, Jeff Moyer wrote:
> Hi,
>
> In running iozone for writes to small files, we noticed a pretty big
> discrepency between the performance of the deadline and cfq I/O
> schedulers. Investigation showed that I/O was being issued from 2
> different contexts: the iozone process itself, and the jbd2/sdh-8 thread
> (as expected). Because of the way cfq performs slice idling, the delays
> introduced between the metadata and data I/Os were significant. For
> example, cfq would see about 7MB/s versus deadline's 35 for the same
> workload. I also tested fs_mark with writing and fsyncing 1000 64k
> files, and a similar 5x performance difference was observed. Eric
> Sandeen suggested that I flag the journal writes as metadata, and once I
> did that, the performance difference went away completely (cfq has
> special logic to prioritize metadata I/O).
>
> So, I'm submitting this patch for comments and testing. I have a
> similar patch for jbd that I will submit if folks agree that this is a
> good idea.

Looks good to me.

--
Jens Axboe

2010-04-05 15:24:29

by Jeff Moyer

[permalink] [raw]
Subject: Re: [patch/rft] jbd2: tag journal writes as metadata I/O

Jan Kara <[email protected]> writes:

> Hi,
>
>> In running iozone for writes to small files, we noticed a pretty big
>> discrepency between the performance of the deadline and cfq I/O
>> schedulers. Investigation showed that I/O was being issued from 2
>> different contexts: the iozone process itself, and the jbd2/sdh-8 thread
>> (as expected). Because of the way cfq performs slice idling, the delays
>> introduced between the metadata and data I/Os were significant. For
>> example, cfq would see about 7MB/s versus deadline's 35 for the same
>> workload. I also tested fs_mark with writing and fsyncing 1000 64k
>> files, and a similar 5x performance difference was observed. Eric
>> Sandeen suggested that I flag the journal writes as metadata, and once I
>> did that, the performance difference went away completely (cfq has
>> special logic to prioritize metadata I/O).
>>
>> So, I'm submitting this patch for comments and testing. I have a
>> similar patch for jbd that I will submit if folks agree that this is a
>> good idea.
> This looks like a good idea to me. I'd just be careful about data=journal
> mode where even data is written via journal and thus you'd incorrectly
> prioritize all the IO. I suppose that could have negative impact on performace
> of other filesystems on the same disk. So for data=journal mode, I'd leave
> write_op to be just WRITE / WRITE_SYNC_PLUG.

Hi, Jan, thanks for the review! I'm trying to figure out the best way
to relay the journal mode from ext3 or ext4 to jbd or jbd2. Would a new
journal flag, set in journal_init_inode, be appropriate? This wouldn't
cover the case of data journalling set per inode, though. It also puts
some ext3-specific code into the purportedly fs-agnostic jbd code
(specifically, testing the superblock for the data journal mount flag).
Do you have any suggestions?

Thanks!
Jeff

2010-04-05 17:46:11

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [patch/rft] jbd2: tag journal writes as metadata I/O

On Mon, Apr 05, 2010 at 11:24:13AM -0400, Jeff Moyer wrote:
> Jan Kara <[email protected]> writes:
>
> > Hi,
> >
> >> In running iozone for writes to small files, we noticed a pretty big
> >> discrepency between the performance of the deadline and cfq I/O
> >> schedulers. Investigation showed that I/O was being issued from 2
> >> different contexts: the iozone process itself, and the jbd2/sdh-8 thread
> >> (as expected). Because of the way cfq performs slice idling, the delays
> >> introduced between the metadata and data I/Os were significant. For
> >> example, cfq would see about 7MB/s versus deadline's 35 for the same
> >> workload. I also tested fs_mark with writing and fsyncing 1000 64k
> >> files, and a similar 5x performance difference was observed. Eric
> >> Sandeen suggested that I flag the journal writes as metadata, and once I
> >> did that, the performance difference went away completely (cfq has
> >> special logic to prioritize metadata I/O).
> >>
> >> So, I'm submitting this patch for comments and testing. I have a
> >> similar patch for jbd that I will submit if folks agree that this is a
> >> good idea.
> > This looks like a good idea to me. I'd just be careful about data=journal
> > mode where even data is written via journal and thus you'd incorrectly
> > prioritize all the IO. I suppose that could have negative impact on performace
> > of other filesystems on the same disk. So for data=journal mode, I'd leave
> > write_op to be just WRITE / WRITE_SYNC_PLUG.
>
> Hi, Jan, thanks for the review! I'm trying to figure out the best way
> to relay the journal mode from ext3 or ext4 to jbd or jbd2. Would a new
> journal flag, set in journal_init_inode, be appropriate? This wouldn't
> cover the case of data journalling set per inode, though. It also puts
> some ext3-specific code into the purportedly fs-agnostic jbd code
> (specifically, testing the superblock for the data journal mount flag).
> Do you have any suggestions?

I don't think it's necessary to worry about journal=data mode. First
of all, it's not true that all of the I/O would be prioritized as
metadata. In data=journal mode, data blocks are written twice; once
to the journal, and once to the final location on disk. And the
journal writes do need to be prioritized because the commit can't go
out until all of the preceeding journal blocks have been written. So
treating all of the journal writes as metadata for the the purposes of
cfq's prioritization makes sense to me....

- Ted

2010-04-05 17:52:11

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [patch/rft] jbd2: tag journal writes as metadata I/O

On Thu, Apr 01, 2010 at 03:04:54PM -0400, Jeff Moyer wrote:
>
> So, I'm submitting this patch for comments and testing. I have a
> similar patch for jbd that I will submit if folks agree that this is a
> good idea.

Added to the ext4 patch queue.

What benchmark were you using to test small file writes? This looks
good to me as well, but we might want to do some extra benchmarking
just to be sure we're not accidentally introducing a performance
regression.

- Ted

2010-04-05 18:36:25

by Jeff Moyer

[permalink] [raw]
Subject: Re: [patch/rft] jbd2: tag journal writes as metadata I/O

[email protected] writes:

> On Thu, Apr 01, 2010 at 03:04:54PM -0400, Jeff Moyer wrote:
>>
>> So, I'm submitting this patch for comments and testing. I have a
>> similar patch for jbd that I will submit if folks agree that this is a
>> good idea.
>
> Added to the ext4 patch queue.
>
> What benchmark were you using to test small file writes? This looks
> good to me as well, but we might want to do some extra benchmarking
> just to be sure we're not accidentally introducing a performance
> regression.

iozone showed regressions for write and re-write in runs that include
fsync timings for small files (<8MB). Here's the command line used for
testing:

iozone -az -n 4k -g 2048m -y 1k -q 1m -e

I also ran fs_mark using the following command line:

fs_mark -S 1 -D 100 -N 1000 -d /mnt/test/fs_mark -s 65536 -t 1 -w 4096

I'll let you know if there are any regressions caused by this patch in
any of our other testing.

Cheers,
Jeff

2010-04-05 19:48:18

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [patch/rft] jbd2: tag journal writes as metadata I/O

On Mon, Apr 05, 2010 at 02:36:07PM -0400, Jeff Moyer wrote:
> >
> > What benchmark were you using to test small file writes? This looks
> > good to me as well, but we might want to do some extra benchmarking
> > just to be sure we're not accidentally introducing a performance
> > regression.
>
> iozone showed regressions for write and re-write in runs that include
> fsync timings for small files (<8MB). Here's the command line used for
> testing:
>
> iozone -az -n 4k -g 2048m -y 1k -q 1m -e

iozone is showing performance regressions or performance improvements?
I thought the point of this patch was to improve iozone benchmarks?

- Ted

2010-04-05 20:34:54

by Jeff Moyer

[permalink] [raw]
Subject: Re: [patch/rft] jbd2: tag journal writes as metadata I/O

[email protected] writes:

> On Mon, Apr 05, 2010 at 02:36:07PM -0400, Jeff Moyer wrote:
>> >
>> > What benchmark were you using to test small file writes? This looks
>> > good to me as well, but we might want to do some extra benchmarking
>> > just to be sure we're not accidentally introducing a performance
>> > regression.
>>
>> iozone showed regressions for write and re-write in runs that include
>> fsync timings for small files (<8MB). Here's the command line used for
>> testing:
>>
>> iozone -az -n 4k -g 2048m -y 1k -q 1m -e
>
> iozone is showing performance regressions or performance improvements?
> I thought the point of this patch was to improve iozone benchmarks?

Sorry, Ted, what I meant to say was that iozone showed differences
between deadline and cfq, where cfq's performance was much worse than
deadline's.

Thanks!
Jeff

2010-04-05 20:42:01

by Jeff Moyer

[permalink] [raw]
Subject: Re: [patch/rft] jbd2: tag journal writes as metadata I/O

Jeff Moyer <[email protected]> writes:

> [email protected] writes:
>
>> On Mon, Apr 05, 2010 at 02:36:07PM -0400, Jeff Moyer wrote:
>>> >
>>> > What benchmark were you using to test small file writes? This looks
>>> > good to me as well, but we might want to do some extra benchmarking
>>> > just to be sure we're not accidentally introducing a performance
>>> > regression.
>>>
>>> iozone showed regressions for write and re-write in runs that include
>>> fsync timings for small files (<8MB). Here's the command line used for
>>> testing:
>>>
>>> iozone -az -n 4k -g 2048m -y 1k -q 1m -e
>>
>> iozone is showing performance regressions or performance improvements?
>> I thought the point of this patch was to improve iozone benchmarks?
>
> Sorry, Ted, what I meant to say was that iozone showed differences
> between deadline and cfq, where cfq's performance was much worse than
> deadline's.

And to be 100% clear, with the patch, the performance differences
between deadline and cfq were in the noise.

Cheers,
Jeff

2010-04-05 21:01:42

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [patch/rft] jbd2: tag journal writes as metadata I/O

On Mon, Apr 05, 2010 at 04:41:48PM -0400, Jeff Moyer wrote:
> > Sorry, Ted, what I meant to say was that iozone showed differences
> > between deadline and cfq, where cfq's performance was much worse than
> > deadline's.
>
> And to be 100% clear, with the patch, the performance differences
> between deadline and cfq were in the noise.

Right, and since most people don't actually change the I/O scheduler
from cfq, this is basically a performance improvement patch, which is
how I'm going to describe it. :-)

- Ted

2010-04-06 15:20:35

by Jan Kara

[permalink] [raw]
Subject: Re: [patch/rft] jbd2: tag journal writes as metadata I/O

On Mon 05-04-10 13:46:03, [email protected] wrote:
> On Mon, Apr 05, 2010 at 11:24:13AM -0400, Jeff Moyer wrote:
> > Jan Kara <[email protected]> writes:
> > >> In running iozone for writes to small files, we noticed a pretty big
> > >> discrepency between the performance of the deadline and cfq I/O
> > >> schedulers. Investigation showed that I/O was being issued from 2
> > >> different contexts: the iozone process itself, and the jbd2/sdh-8 thread
> > >> (as expected). Because of the way cfq performs slice idling, the delays
> > >> introduced between the metadata and data I/Os were significant. For
> > >> example, cfq would see about 7MB/s versus deadline's 35 for the same
> > >> workload. I also tested fs_mark with writing and fsyncing 1000 64k
> > >> files, and a similar 5x performance difference was observed. Eric
> > >> Sandeen suggested that I flag the journal writes as metadata, and once I
> > >> did that, the performance difference went away completely (cfq has
> > >> special logic to prioritize metadata I/O).
> > >>
> > >> So, I'm submitting this patch for comments and testing. I have a
> > >> similar patch for jbd that I will submit if folks agree that this is a
> > >> good idea.
> > > This looks like a good idea to me. I'd just be careful about data=journal
> > > mode where even data is written via journal and thus you'd incorrectly
> > > prioritize all the IO. I suppose that could have negative impact on performace
> > > of other filesystems on the same disk. So for data=journal mode, I'd leave
> > > write_op to be just WRITE / WRITE_SYNC_PLUG.
> >
> > Hi, Jan, thanks for the review! I'm trying to figure out the best way
> > to relay the journal mode from ext3 or ext4 to jbd or jbd2. Would a new
> > journal flag, set in journal_init_inode, be appropriate? This wouldn't
> > cover the case of data journalling set per inode, though. It also puts
> > some ext3-specific code into the purportedly fs-agnostic jbd code
> > (specifically, testing the superblock for the data journal mount flag).
> > Do you have any suggestions?
>
> I don't think it's necessary to worry about journal=data mode. First
> of all, it's not true that all of the I/O would be prioritized as
> metadata. In data=journal mode, data blocks are written twice; once
> to the journal, and once to the final location on disk. And the
> journal writes do need to be prioritized because the commit can't go
> out until all of the preceeding journal blocks have been written. So
> treating all of the journal writes as metadata for the the purposes of
> cfq's prioritization makes sense to me....
OK, you are right that if we look at it this way, it makes sence to
treat data writes to journal as metadata. So Jeff, now I agree with your
original patch.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2010-04-06 18:27:07

by Vivek Goyal

[permalink] [raw]
Subject: Re: [patch/rft] jbd2: tag journal writes as metadata I/O

On Mon, Apr 05, 2010 at 01:46:03PM -0400, [email protected] wrote:
> On Mon, Apr 05, 2010 at 11:24:13AM -0400, Jeff Moyer wrote:
> > Jan Kara <[email protected]> writes:
> >
> > > Hi,
> > >
> > >> In running iozone for writes to small files, we noticed a pretty big
> > >> discrepency between the performance of the deadline and cfq I/O
> > >> schedulers. Investigation showed that I/O was being issued from 2
> > >> different contexts: the iozone process itself, and the jbd2/sdh-8 thread
> > >> (as expected). Because of the way cfq performs slice idling, the delays
> > >> introduced between the metadata and data I/Os were significant. For
> > >> example, cfq would see about 7MB/s versus deadline's 35 for the same
> > >> workload. I also tested fs_mark with writing and fsyncing 1000 64k
> > >> files, and a similar 5x performance difference was observed. Eric
> > >> Sandeen suggested that I flag the journal writes as metadata, and once I
> > >> did that, the performance difference went away completely (cfq has
> > >> special logic to prioritize metadata I/O).
> > >>
> > >> So, I'm submitting this patch for comments and testing. I have a
> > >> similar patch for jbd that I will submit if folks agree that this is a
> > >> good idea.
> > > This looks like a good idea to me. I'd just be careful about data=journal
> > > mode where even data is written via journal and thus you'd incorrectly
> > > prioritize all the IO. I suppose that could have negative impact on performace
> > > of other filesystems on the same disk. So for data=journal mode, I'd leave
> > > write_op to be just WRITE / WRITE_SYNC_PLUG.
> >
> > Hi, Jan, thanks for the review! I'm trying to figure out the best way
> > to relay the journal mode from ext3 or ext4 to jbd or jbd2. Would a new
> > journal flag, set in journal_init_inode, be appropriate? This wouldn't
> > cover the case of data journalling set per inode, though. It also puts
> > some ext3-specific code into the purportedly fs-agnostic jbd code
> > (specifically, testing the superblock for the data journal mount flag).
> > Do you have any suggestions?
>
> I don't think it's necessary to worry about journal=data mode. First
> of all, it's not true that all of the I/O would be prioritized as
> metadata. In data=journal mode, data blocks are written twice; once
> to the journal, and once to the final location on disk. And the
> journal writes do need to be prioritized because the commit can't go
> out until all of the preceeding journal blocks have been written. So
> treating all of the journal writes as metadata for the the purposes of
> cfq's prioritization makes sense to me....
>

CFQ currently seems to be preempting any thread doing IO if a request has
been marked as metadata. I think this is going to be bad for any other IO
going on.

I wrote a small fio script which is doing buffered writes with bs=32K and I
am doing fsync on file after every 20 IOs (fsync=20). I am assuming that this
something close to writting a small file and then doing fsync on that.

With that fio script running I launched firefox and measured the time it
takes.

Without patch
=============
real 0m13.141s
user 0m0.588s
sys 0m0.164s

real 0m11.859s
user 0m0.580s
sys 0m0.163s

real 0m13.384s
user 0m0.587s
sys 0m0.163s

With patch
==========
real 0m25.986s
user 0m0.627s
sys 0m0.150s

real 0m29.582s
user 0m0.623s
sys 0m0.156s

real 0m25.898s
user 0m0.587s
sys 0m0.160s

So it looks like that firefox launching times have seems to just almost doubled.

My fio script looks like as follows.

*******************************************
[global]
directory=/root/fio/
ioscheduler=cfq
size=1G
group_reporting=1
bs=32K
exec_prerun='echo 3 > /proc/sys/vm/drop_caches'

[iostest]
numjobs=1
new_group=1
rw=write
fsync=20
**********************************************

Thanks
Vivek

> - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2010-04-06 18:45:13

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [patch/rft] jbd2: tag journal writes as metadata I/O

> CFQ currently seems to be preempting any thread doing IO if a request has
> been marked as metadata. I think this is going to be bad for any other IO
> going on.
>
> I wrote a small fio script which is doing buffered writes with bs=32K and I
> am doing fsync on file after every 20 IOs (fsync=20). I am assuming that this
> something close to writting a small file and then doing fsync on that.
>
> With that fio script running I launched firefox and measured the
> time it takes..... it looks like that firefox launching times have
> seems to just almost doubled.

Vivek, thanks for pointing this out. Sounds like we need to think
carefully about whether the potential unfairness that this patch might
impose on other workloads sharing the file system dominates the
improvements that Jeff found when there's only a single workload
running on the file system.

I'm tentatively leaning towards pulling this patch so we can do more
testing / benchmarking. Jeff, any thoughts or comments?

- Ted

2010-04-06 19:05:19

by Jeff Moyer

[permalink] [raw]
Subject: Re: [patch/rft] jbd2: tag journal writes as metadata I/O

[email protected] writes:

>> CFQ currently seems to be preempting any thread doing IO if a request has
>> been marked as metadata. I think this is going to be bad for any other IO
>> going on.
>>
>> I wrote a small fio script which is doing buffered writes with bs=32K and I
>> am doing fsync on file after every 20 IOs (fsync=20). I am assuming that this
>> something close to writting a small file and then doing fsync on that.
>>
>> With that fio script running I launched firefox and measured the
>> time it takes..... it looks like that firefox launching times have
>> seems to just almost doubled.
>
> Vivek, thanks for pointing this out. Sounds like we need to think
> carefully about whether the potential unfairness that this patch might
> impose on other workloads sharing the file system dominates the
> improvements that Jeff found when there's only a single workload
> running on the file system.
>
> I'm tentatively leaning towards pulling this patch so we can do more
> testing / benchmarking. Jeff, any thoughts or comments?

Yeah, pull it. I just talked this over with Vivek, and looking back at
the blktrace data I think I have another idea that might work and be
less invasive.

Basically, the iozone process is, umm, special. It does the following
at startup (from memory, so I might've missed a step):

open(fd);
truncate(fd,0);
close(fd)
open(fd);
fsync(fd);

*then* it does I/O.

So, we get a sync cfqq for the iozone process that ends up doing the
metadata lookups. Then it does the truncate and of course blocks. At
this point, we are waiting for jbd2 to run to sync out its transaction.
However, we're idling on the iozone cfqq, so it doesn't get a chance to
run for 8ms.

If we instead just pass a hint down to the I/O scheduler on fsync,
similar to the schedule() call for the cpu scheduler, then I think we
can give up our time slice and allow the jbd2 process to run.

I'll report back with my findings. Vivek, thanks a ton for the careful
(as always) review.

Cheers,
Jeff