LinuxLists.cc - trying to understand READ_META, READ_SYNC, WRITE

[permalink] [raw]

Subject: Re: trying to understand READ_META, READ_SYNC, WRITE_SYNC & co

On 2010-06-21 11:48, Christoph Hellwig wrote:
> Hi Jens,
>
> I'm trying to understand the "sync" and "meta" flags we add to bios /
> requests and the more I look into the more I get confused. Let's start
> with the simple bits:
>
> - REQ_META
>
> Only really used for READ_META. There's a WRITE_META that I
> added a while ago, but the actual users of it are gone again
> now.
>
> The READ_META is used by ext3 and ext4 for reading inodes,
> and for all reads of directory blocks. XFS has a place that
> sets READ_META, but it's unreachable now. GFS2 on the other
> hand uses REQ_META directly, and ORs it into just about every
> other kind of READ/WRITE request: WRITE_BARRIER,
> WRITE_SYNC_PLUG, WRITE, READ_SYNC.
>
> - REQ_SYNC
>
> Used for READ_SYNC / WRITE_SYNC_PLUG / WRITE_SYNC /
> WRITE_ODIRECT_PLUG / WRITE_BARRIER and the SWRITE variants.
>
> READ_SYNC is used by GFS2 in a few spots, but always together
> with REQ_META, and by jfs and xfs for the code reading the log
> during recovery.
> (S)WRITE_SYNC_PLUG is used for ->writepage instances with
> the WB_SYNC_ALL sync_mode, and random pieces of journal /
> metadata I/O in gfs2 and jbd(2) as well as fsync_buffers_list().
> Note that the SWRITE_SYNC_PLUG uses actually degenerate to WRITE
> in ll_rw_block.c due to what seems to be a bug in that code.
> WRITE_SYNC is used by xfs for the log write if barriers are not
> supported, by jfs for all log writes, by gfs2 for at least some
> log writes, by btrfs for data writeout in some cases.
> SWRITE_SYNC is entirely unused.
> WRITE_ODIRECT_PLUG is used for O_DIRECT writes, and
> WRITE_BARRIER is used for anything using barriers.
>
> - REQ_NOIDLE
>
> Used for WRITE_SYNC_PLUG / WRITE_SYNC / WRITE_BARRIER and
> SWRITE variants. See the description above for details.
>
> - REQ_UNPLUG
>
> Used for READ_SYNC, WRITE_SYNC and WRITE_BARRIER.
>
> Now how do we use these flags in the block layer?
>
> - REQ_META
>
> The only place where we ever use this flag is inside the
> cfq scheduler. In cfq_choose_req we use it to give a meta
> request priority over one that doesn't have it. But before
> that we already do the same preference check with rw_is_sync,
> which evaluates to true for requests with that are either
> reads or have REQ_SYNC set. So for reads the REQ_META flag
> here effectively is a no-op, and for writes it gives less
> priority than REQ_SYNC.
> In addition to that we use it to account for pending metadata
> requests in cfq_rq_enqueued/cfq_remove_request which gets
> checked in cfq_should_preempt to give priority to a meta
> request if the other queue doesn't have any pending meta
> requests. But again this priority comes after a similar
> check for sync requests that checks if the other queue has
> been marked to have sync requests pending.

It's also annotation for blktrace, so you can tell which parts of the IO
is meta data etc. The scheduler impact is questionable, I doubt it makes
a whole lot of difference.

> - REQ_SYNC
>
> This one is used in quite a few places, with many of them
> obsfucated by macros like rw_is_sync, rq_is_sync and
> cfq_bio_sync. In general all uses seem to imply giving
> a write request the same priority as a read request and
> treat it as synchronous. I could not spot a place where
> it actually has any effect on reads.

Reads are sync by nature in the block layer, so they don't get that
special annotation.

> - REQ_NOIDLE
>
> Only used inside the cfq I/O scheduler in one place:
> cfq_completed_request. Only used for READ or synchronous
> request. I can't figure out what's actually going on in
> details here.

For a sync request, CFQ will idle the disk waiting for another one that
is close to the completed request. This is what keeps up high throughput
for read or sync write workloads, if you have more than one stream
going.

> Now the big questions:
>
> What eactly is the REQ_META flag for? Even after going
> through all the things I mentioned above I don't quite
> understand it. It's only used for reads, and doesn't seem
> to give any major priority. Should it be used for all metadata
> reads or just some? Currently it's not actually very widely
> used.
>
> Should we allow REQ_META on writes, but ignore it except for
> blktrace? If the answer above is we want to tag all metadata
> reads as REQ_META that would allow easily spotting all metadata
> I/O in blktrace.

For analysis purposes, annotating all meta data IOs as such would be
beneficial (reads as well as writes). As mentioned above, the current
scheduler impact isn't huge. There may be some interesting test and
benchmarks there for improving that part.

> Is there any point in keeping the READ_SYNC? If the users
> really want to keep the direct unplugging semantics we can
> just write READ | REQ_UNPLUG, but even that seems to be wrong
> for the log recvoery callers.
>
> Similarly is there a point in keeping WRITE_SYNC? Most callers
> really want WRITE_SYNC_PLUG semantics, and the current naming
> is extremtly confusing. I'd much rather see the callers
> specify explicitly using REQ_UNPLUG if they want to unplug
> the queue.

So a large part of that problem is the overloaded meaning of sync. For
some cases it means "unplug on issue", and for others it means that the
IO itself is syncronous. The other nasty bit is the implicit plugging
that happens behind the back of the submitter, but that's an issue to
tackle separately. I'd suggest avoiding unnecessary churn in naming of
those.

> Why do O_DIRECT writes not want to set REQ_NOIDLE (and that
> exactly does REQ_NOIDLE mean anyway). It's the only sync writes
> that do not set it, so if this special case went away we
> could get rid of the flag and key it off REQ_SYNC.

See above for NOIDLE. You kill O_DIRECT write throughput if you don't
idle at the end of a write, if you have other activity on the disk.

> Do we really need to keep the SWRITE flags? I'd rather
> make that an additional flag to ll_rw_block, or even better
> split ll_rw_block into two helpers for the guaranteed locking
> or not case.

No, lets kill those.

--
Jens Axboe

2010-06-21 11:04:39

[permalink] [raw]

Subject: Re: trying to understand READ_META, READ_SYNC, WRITE_SYNC & co

On Mon, Jun 21, 2010 at 12:04:06PM +0200, Jens Axboe wrote:
> It's also annotation for blktrace, so you can tell which parts of the IO
> is meta data etc. The scheduler impact is questionable, I doubt it makes
> a whole lot of difference.

> For analysis purposes, annotating all meta data IOs as such would be
> beneficial (reads as well as writes). As mentioned above, the current
> scheduler impact isn't huge. There may be some interesting test and
> benchmarks there for improving that part.

As mentioned in the previous mail the use of the flag is not very
wide spread. Currently it's only ext3/ext3 inodes and directories as well
as all metadata I/O in gfs2 that gets marked this way. And I'd be much more
comfortable to add more annotations if it didn't also have some form
of schedule impact. The I/O schedules in general and cfq in particular
have caused us far too many issues with such subtile differences.

> > This one is used in quite a few places, with many of them
> > obsfucated by macros like rw_is_sync, rq_is_sync and
> > cfq_bio_sync. In general all uses seem to imply giving
> > a write request the same priority as a read request and
> > treat it as synchronous. I could not spot a place where
> > it actually has any effect on reads.
>
> Reads are sync by nature in the block layer, so they don't get that
> special annotation.

Well, we do give them this special annotation in a few places, but we
don't actually use it.

> So a large part of that problem is the overloaded meaning of sync. For
> some cases it means "unplug on issue", and for others it means that the
> IO itself is syncronous. The other nasty bit is the implicit plugging
> that happens behind the back of the submitter, but that's an issue to
> tackle separately. I'd suggest avoiding unnecessary churn in naming of
> those.

Well, the current naming is extremly confusing. The best thing I could
come up with is to completely drop READ_SYNC and WRITE_SYNC and just
pass REQ_UNPLUG explicitly together with READ / WRITE_SYNC_PLUG.
There's only 5 respective 8 users of them, so explicitly documenting
our intentions there seems useful. Especially if we want to add more
_META annotation in which case the simple READ_* / WRITE_* macros
don't do anymore either. Similarly it might be useful to remove
READ_META/WRITE_META and replace them with explicit | REQ_META, which
is just about as short and a lot more descriptive, especially for
synchronous metadata writes.

> > Why do O_DIRECT writes not want to set REQ_NOIDLE (and that
> > exactly does REQ_NOIDLE mean anyway). It's the only sync writes
> > that do not set it, so if this special case went away we
> > could get rid of the flag and key it off REQ_SYNC.
>
> See above for NOIDLE. You kill O_DIRECT write throughput if you don't
> idle at the end of a write, if you have other activity on the disk.

Ok, makes sense. Would you mind taking a patch to kill the
WRITE_ODIRECT_PLUG and just do a

/*
* O_DIRECT writes are synchronous, but we must not disable the
* idling logic in CFQ to avoid killing performance.
*/
if (rw & WRITE)
rw |= REQ_SYNC;

But that leaves the question why disabling the idling logical for
data integrity ->writepage is fine? This gets called from ->fsync
or O_SYNC writes and will have the same impact as O_DIRECT writes.

2010-06-21 18:52:50

by Jeff Moyer

[permalink] [raw]

Subject: Re: trying to understand READ_META, READ_SYNC, WRITE_SYNC & co

Jens Axboe <[email protected]> writes:

> On 2010-06-21 11:48, Christoph Hellwig wrote:
>> Now how do we use these flags in the block layer?
>>
>> - REQ_META
>>
>> The only place where we ever use this flag is inside the
>> cfq scheduler. In cfq_choose_req we use it to give a meta
>> request priority over one that doesn't have it. But before
>> that we already do the same preference check with rw_is_sync,
>> which evaluates to true for requests with that are either
>> reads or have REQ_SYNC set. So for reads the REQ_META flag
>> here effectively is a no-op, and for writes it gives less
>> priority than REQ_SYNC.
>> In addition to that we use it to account for pending metadata
>> requests in cfq_rq_enqueued/cfq_remove_request which gets
>> checked in cfq_should_preempt to give priority to a meta
>> request if the other queue doesn't have any pending meta
>> requests. But again this priority comes after a similar
>> check for sync requests that checks if the other queue has
>> been marked to have sync requests pending.
>
> It's also annotation for blktrace, so you can tell which parts of the IO
> is meta data etc. The scheduler impact is questionable, I doubt it makes
> a whole lot of difference.

Really? Even after I showed the performance impact of setting that bit
for journal I/O?

http://lkml.org/lkml/2010/4/1/344

Cheers,
Jeff

2010-06-21 18:57:03

[permalink] [raw]

Subject: Re: trying to understand READ_META, READ_SYNC, WRITE_SYNC & co

On 21/06/10 13.04, Christoph Hellwig wrote:
> On Mon, Jun 21, 2010 at 12:04:06PM +0200, Jens Axboe wrote:
>> It's also annotation for blktrace, so you can tell which parts of the IO
>> is meta data etc. The scheduler impact is questionable, I doubt it makes
>> a whole lot of difference.
>
>> For analysis purposes, annotating all meta data IOs as such would be
>> beneficial (reads as well as writes). As mentioned above, the current
>> scheduler impact isn't huge. There may be some interesting test and
>> benchmarks there for improving that part.
>
> As mentioned in the previous mail the use of the flag is not very
> wide spread. Currently it's only ext3/ext3 inodes and directories as well
> as all metadata I/O in gfs2 that gets marked this way. And I'd be much more
> comfortable to add more annotations if it didn't also have some form
> of schedule impact. The I/O schedules in general and cfq in particular
> have caused us far too many issues with such subtile differences.

FWIW, Windows marks meta data writes and they go out with FUA set
on SATA disks. And SATA firmware prioritizes FUA writes, it's essentially
a priority bit as well as a platter access bit. So at least we have some
one else using a meta data boost. I agree that it would be a lot more
trivial to add the annotations if they didn't have scheduler impact
as well, but I still think it's a sane thing.

>>> This one is used in quite a few places, with many of them
>>> obsfucated by macros like rw_is_sync, rq_is_sync and
>>> cfq_bio_sync. In general all uses seem to imply giving
>>> a write request the same priority as a read request and
>>> treat it as synchronous. I could not spot a place where
>>> it actually has any effect on reads.
>>
>> Reads are sync by nature in the block layer, so they don't get that
>> special annotation.
>
> Well, we do give them this special annotation in a few places, but we
> don't actually use it.

For unplugging?

>> So a large part of that problem is the overloaded meaning of sync. For
>> some cases it means "unplug on issue", and for others it means that the
>> IO itself is syncronous. The other nasty bit is the implicit plugging
>> that happens behind the back of the submitter, but that's an issue to
>> tackle separately. I'd suggest avoiding unnecessary churn in naming of
>> those.
>
> Well, the current naming is extremly confusing. The best thing I could
> come up with is to completely drop READ_SYNC and WRITE_SYNC and just
> pass REQ_UNPLUG explicitly together with READ / WRITE_SYNC_PLUG.
> There's only 5 respective 8 users of them, so explicitly documenting
> our intentions there seems useful. Especially if we want to add more
> _META annotation in which case the simple READ_* / WRITE_* macros
> don't do anymore either. Similarly it might be useful to remove
> READ_META/WRITE_META and replace them with explicit | REQ_META, which
> is just about as short and a lot more descriptive, especially for
> synchronous metadata writes.

Sure, we can add the explicit plug naming, I don't have an issue
with that.

>>> Why do O_DIRECT writes not want to set REQ_NOIDLE (and that
>>> exactly does REQ_NOIDLE mean anyway). It's the only sync writes
>>> that do not set it, so if this special case went away we
>>> could get rid of the flag and key it off REQ_SYNC.
>>
>> See above for NOIDLE. You kill O_DIRECT write throughput if you don't
>> idle at the end of a write, if you have other activity on the disk.
>
> Ok, makes sense. Would you mind taking a patch to kill the
> WRITE_ODIRECT_PLUG and just do a
>
> /*
> * O_DIRECT writes are synchronous, but we must not disable the
> * idling logic in CFQ to avoid killing performance.
> */
> if (rw & WRITE)
> rw |= REQ_SYNC;

At the caller site? Sure.

> But that leaves the question why disabling the idling logical for
> data integrity ->writepage is fine? This gets called from ->fsync
> or O_SYNC writes and will have the same impact as O_DIRECT writes.

We have never enabled idling for those. O_SYNC should get a nice
boost too, it just needs to be benchmarked and tested and then
there would be no reason not to add it.

--
Jens Axboe

2010-06-21 18:59:04

[permalink] [raw]

Subject: Re: trying to understand READ_META, READ_SYNC, WRITE_SYNC & co

On 21/06/10 20.52, Jeff Moyer wrote:
> Jens Axboe <[email protected]> writes:
>
>> On 2010-06-21 11:48, Christoph Hellwig wrote:
>>> Now how do we use these flags in the block layer?
>>>
>>> - REQ_META
>>>
>>> The only place where we ever use this flag is inside the
>>> cfq scheduler. In cfq_choose_req we use it to give a meta
>>> request priority over one that doesn't have it. But before
>>> that we already do the same preference check with rw_is_sync,
>>> which evaluates to true for requests with that are either
>>> reads or have REQ_SYNC set. So for reads the REQ_META flag
>>> here effectively is a no-op, and for writes it gives less
>>> priority than REQ_SYNC.
>>> In addition to that we use it to account for pending metadata
>>> requests in cfq_rq_enqueued/cfq_remove_request which gets
>>> checked in cfq_should_preempt to give priority to a meta
>>> request if the other queue doesn't have any pending meta
>>> requests. But again this priority comes after a similar
>>> check for sync requests that checks if the other queue has
>>> been marked to have sync requests pending.
>>
>> It's also annotation for blktrace, so you can tell which parts of the IO
>> is meta data etc. The scheduler impact is questionable, I doubt it makes
>> a whole lot of difference.
>
> Really? Even after I showed the performance impact of setting that bit
> for journal I/O?
>
> http://lkml.org/lkml/2010/4/1/344

It's definitely a win in some cases, as you showed there as well.
My initial testing a long time ago had some nice benefits too. So
perhaps the above wasn't worded very well, I always worry that we
have regressions doing boosts for things like that. But given that
meta data is something that needs to be done before we get to the
real data, bumping priority generally seems like a good thing to do.

--
Jens Axboe

2010-06-21 19:08:28

by Jeff Moyer

[permalink] [raw]

Subject: Re: trying to understand READ_META, READ_SYNC, WRITE_SYNC & co

Jens Axboe <[email protected]> writes:

> On 21/06/10 20.52, Jeff Moyer wrote:
>> Jens Axboe <[email protected]> writes:
>>
>>> On 2010-06-21 11:48, Christoph Hellwig wrote:
>>>> Now how do we use these flags in the block layer?
>>>>
>>>> - REQ_META
>>>>
>>>> The only place where we ever use this flag is inside the
>>>> cfq scheduler. In cfq_choose_req we use it to give a meta
>>>> request priority over one that doesn't have it. But before
>>>> that we already do the same preference check with rw_is_sync,
>>>> which evaluates to true for requests with that are either
>>>> reads or have REQ_SYNC set. So for reads the REQ_META flag
>>>> here effectively is a no-op, and for writes it gives less
>>>> priority than REQ_SYNC.
>>>> In addition to that we use it to account for pending metadata
>>>> requests in cfq_rq_enqueued/cfq_remove_request which gets
>>>> checked in cfq_should_preempt to give priority to a meta
>>>> request if the other queue doesn't have any pending meta
>>>> requests. But again this priority comes after a similar
>>>> check for sync requests that checks if the other queue has
>>>> been marked to have sync requests pending.
>>>
>>> It's also annotation for blktrace, so you can tell which parts of the IO
>>> is meta data etc. The scheduler impact is questionable, I doubt it makes
>>> a whole lot of difference.
>>
>> Really? Even after I showed the performance impact of setting that bit
>> for journal I/O?
>>
>> http://lkml.org/lkml/2010/4/1/344
>
> It's definitely a win in some cases, as you showed there as well.
> My initial testing a long time ago had some nice benefits too. So
> perhaps the above wasn't worded very well, I always worry that we
> have regressions doing boosts for things like that. But given that
> meta data is something that needs to be done before we get to the
> real data, bumping priority generally seems like a good thing to do.

Oh, I'm not arguing for that approach. I just wanted to make it clear
that it can and does have a noticible impact.

Cheers,
Jeff

2010-06-21 19:14:14

[permalink] [raw]

Subject: Re: trying to understand READ_META, READ_SYNC, WRITE_SYNC & co

On Mon, Jun 21, 2010 at 08:56:55PM +0200, Jens Axboe wrote:
> FWIW, Windows marks meta data writes and they go out with FUA set
> on SATA disks. And SATA firmware prioritizes FUA writes, it's essentially
> a priority bit as well as a platter access bit. So at least we have some
> one else using a meta data boost. I agree that it would be a lot more
> trivial to add the annotations if they didn't have scheduler impact
> as well, but I still think it's a sane thing.

And we still disable the FUA bit in libata unless people set a
non-standard module option..

> >> Reads are sync by nature in the block layer, so they don't get that
> >> special annotation.
> >
> > Well, we do give them this special annotation in a few places, but we
> > don't actually use it.
>
> For unplugging?

We use the explicit unplugging, yes - but READ_META also includes
REQ_SYNC which is not used anywhere.

> > But that leaves the question why disabling the idling logical for
> > data integrity ->writepage is fine? This gets called from ->fsync
> > or O_SYNC writes and will have the same impact as O_DIRECT writes.
>
> We have never enabled idling for those. O_SYNC should get a nice
> boost too, it just needs to be benchmarked and tested and then
> there would be no reason not to add it.

We've only started using any kind of sync tag last year in ->writepage in
commit a64c8610bd3b753c6aff58f51c04cdf0ae478c18 "block_write_full_page:
Use synchronous writes for WBC_SYNC_ALL writebacks", switching from
WRITE_SYNC to WRITE_SYNC_PLUG a bit later in commit
6e34eeddf7deec1444bbddab533f03f520d8458c "block_write_full_page: switch
synchronous writes to use WRITE_SYNC_PLUG"

Before that we used plain WRITE, which had the normal idling logic.

2010-06-21 19:16:35

[permalink] [raw]

Subject: Re: trying to understand READ_META, READ_SYNC, WRITE_SYNC & co

On 21/06/10 21.14, Christoph Hellwig wrote:
> On Mon, Jun 21, 2010 at 08:56:55PM +0200, Jens Axboe wrote:
>> FWIW, Windows marks meta data writes and they go out with FUA set
>> on SATA disks. And SATA firmware prioritizes FUA writes, it's essentially
>> a priority bit as well as a platter access bit. So at least we have some
>> one else using a meta data boost. I agree that it would be a lot more
>> trivial to add the annotations if they didn't have scheduler impact
>> as well, but I still think it's a sane thing.
>
> And we still disable the FUA bit in libata unless people set a
> non-standard module option..

Yes, but that's a separate thing. The point is that boosting meta data
IO is done by others as well. That we don't fully do the same on the
hw side is a different story. That doesn't mean that the io scheduler
boost isn't useful.

>>>> Reads are sync by nature in the block layer, so they don't get that
>>>> special annotation.
>>>
>>> Well, we do give them this special annotation in a few places, but we
>>> don't actually use it.
>>
>> For unplugging?
>
> We use the explicit unplugging, yes - but READ_META also includes
> REQ_SYNC which is not used anywhere.

That does look superfluous.

>>> But that leaves the question why disabling the idling logical for
>>> data integrity ->writepage is fine? This gets called from ->fsync
>>> or O_SYNC writes and will have the same impact as O_DIRECT writes.
>>
>> We have never enabled idling for those. O_SYNC should get a nice
>> boost too, it just needs to be benchmarked and tested and then
>> there would be no reason not to add it.
>
> We've only started using any kind of sync tag last year in ->writepage in
> commit a64c8610bd3b753c6aff58f51c04cdf0ae478c18 "block_write_full_page:
> Use synchronous writes for WBC_SYNC_ALL writebacks", switching from
> WRITE_SYNC to WRITE_SYNC_PLUG a bit later in commit
> 6e34eeddf7deec1444bbddab533f03f520d8458c "block_write_full_page: switch
> synchronous writes to use WRITE_SYNC_PLUG"
>
> Before that we used plain WRITE, which had the normal idling logic.

Plain write does not idle.

--
Jens Axboe

2010-06-21 19:20:27

[permalink] [raw]

Subject: Re: trying to understand READ_META, READ_SYNC, WRITE_SYNC & co

On Mon, Jun 21, 2010 at 09:16:33PM +0200, Jens Axboe wrote:
> Yes, but that's a separate thing. The point is that boosting meta data
> IO is done by others as well. That we don't fully do the same on the
> hw side is a different story. That doesn't mean that the io scheduler
> boost isn't useful.

Well, the FUA bit is useful for writes. And I agree that we need to
prioritize log I/O metadata. On the other hand at least for XFS most
other metadata writes actually are asynchronous - there's no need to
prioritize those.

> > We use the explicit unplugging, yes - but READ_META also includes
> > REQ_SYNC which is not used anywhere.
>
> That does look superfluous.

The above should READ_SYNC, but the same still applies.

> > We've only started using any kind of sync tag last year in ->writepage in
> > commit a64c8610bd3b753c6aff58f51c04cdf0ae478c18 "block_write_full_page:
> > Use synchronous writes for WBC_SYNC_ALL writebacks", switching from
> > WRITE_SYNC to WRITE_SYNC_PLUG a bit later in commit
> > 6e34eeddf7deec1444bbddab533f03f520d8458c "block_write_full_page: switch
> > synchronous writes to use WRITE_SYNC_PLUG"
> >
> > Before that we used plain WRITE, which had the normal idling logic.
>
> Plain write does not idle.

Well, whatever logic we use for normal files. Note that all the commits
above were introduced in 2.6.29. That's where the horrible fsync
latency regressions due to the idling logic that Jeff is fighting were
introduced. So there's defintively something wrong going on here.

2010-06-21 20:25:14

[permalink] [raw]

Subject: Re: trying to understand READ_META, READ_SYNC, WRITE_SYNC & co

On Mon, Jun 21, 2010 at 12:04:06PM +0200, Jens Axboe wrote:

[..]
> > Now how do we use these flags in the block layer?
> >
> > - REQ_META
> >
> > The only place where we ever use this flag is inside the
> > cfq scheduler. In cfq_choose_req we use it to give a meta
> > request priority over one that doesn't have it. But before
> > that we already do the same preference check with rw_is_sync,
> > which evaluates to true for requests with that are either
> > reads or have REQ_SYNC set. So for reads the REQ_META flag
> > here effectively is a no-op, and for writes it gives less
> > priority than REQ_SYNC.
> > In addition to that we use it to account for pending metadata
> > requests in cfq_rq_enqueued/cfq_remove_request which gets
> > checked in cfq_should_preempt to give priority to a meta
> > request if the other queue doesn't have any pending meta
> > requests. But again this priority comes after a similar
> > check for sync requests that checks if the other queue has
> > been marked to have sync requests pending.
>
> It's also annotation for blktrace, so you can tell which parts of the IO
> is meta data etc. The scheduler impact is questionable, I doubt it makes
> a whole lot of difference.

In my testing in the past, this was helping if lots of sequential readers
are running in a system (with 100ms slice each) and if there is another
reader doing small file reads. Without meta data based preemption check,
latency of opening small file used to be very high (Because all the sync
sequntial reader gets to consume their 100ms slices first).

Situation should be somewhat better now after corrado's changes of
reducing slice length dynamically. But I suspect that we will still
experience significantly reduced latency for small file reads in presece
of multiple sequntial reads going on.

Thanks
Vivek

2010-06-21 21:36:28

[permalink] [raw]

Subject: Re: trying to understand READ_META, READ_SYNC, WRITE_SYNC & co

On Mon, Jun 21, 2010 at 09:14:10PM +0200, Christoph Hellwig wrote:
> On Mon, Jun 21, 2010 at 08:56:55PM +0200, Jens Axboe wrote:
> > FWIW, Windows marks meta data writes and they go out with FUA set
> > on SATA disks. And SATA firmware prioritizes FUA writes, it's essentially
> > a priority bit as well as a platter access bit. So at least we have some
> > one else using a meta data boost. I agree that it would be a lot more
> > trivial to add the annotations if they didn't have scheduler impact
> > as well, but I still think it's a sane thing.
>
> And we still disable the FUA bit in libata unless people set a
> non-standard module option..
>
> > >> Reads are sync by nature in the block layer, so they don't get that
> > >> special annotation.
> > >
> > > Well, we do give them this special annotation in a few places, but we
> > > don't actually use it.
> >
> > For unplugging?
>
> We use the explicit unplugging, yes - but READ_META also includes
> REQ_SYNC which is not used anywhere.
>
> > > But that leaves the question why disabling the idling logical for
> > > data integrity ->writepage is fine? This gets called from ->fsync
> > > or O_SYNC writes and will have the same impact as O_DIRECT writes.
> >
> > We have never enabled idling for those. O_SYNC should get a nice
> > boost too, it just needs to be benchmarked and tested and then
> > there would be no reason not to add it.
>
> We've only started using any kind of sync tag last year in ->writepage in
> commit a64c8610bd3b753c6aff58f51c04cdf0ae478c18 "block_write_full_page:
> Use synchronous writes for WBC_SYNC_ALL writebacks", switching from
> WRITE_SYNC to WRITE_SYNC_PLUG a bit later in commit
> 6e34eeddf7deec1444bbddab533f03f520d8458c "block_write_full_page: switch
> synchronous writes to use WRITE_SYNC_PLUG"
>
> Before that we used plain WRITE, which had the normal idling logic.

I have got very little understanding of file system layer, but if I had
to guess, i think following might have happened.

- We switched from WRITE to WRITE_SYNC for fsync() path.

- This might have caused issues with idling as for SYNC_WRITE we will idle
in CFQ but probably it is not desirable in certain cases where next set
of WRITES is going to come from journaling thread.

- That might have prompted us to introduce the rq_noidle() to make sure
we don't idle in WRITE_SYNC path but direct IO path was avoided to make
sure good throughput is maintained. But this left one question open
and that is it good to disable idling on all WRITE_SYNC path in kernel.

- Slowly cfq code emerged and as it stands today, to me rq_noidle() is
practically of not much use. For sync-idle tree (where idling is
enabled), we are ignoring the rq_noidle() and always arming the timer.
For sync-noidle, we choose not to idle based on if there was some other
thread who did even a single IO with rq_noidle=0.

I think in practice, there is on thread of other which is doing some
read or write with rq_noidle=0 and if that's the case, we will end up
idling on sync-noidle tree also and rq_noidle() practically does
not take effect.

So if rq_noidle() was introduced to solve the issue of not idling on
fsync() path (as jbd thread will send more data now), then probably slice
yielding patch of jeff might come handy here and and we can get rid of
rq_noidle() logic. This is just a guess work and I might be completely
wrong here...

Thanks
Vivek

2010-06-23 09:26:23

[permalink] [raw]

Subject: Re: trying to understand READ_META, READ_SYNC, WRITE_SYNC & co

On Mon, Jun 21, 2010 at 08:58:51PM +0200, Jens Axboe wrote:
> It's definitely a win in some cases, as you showed there as well.
> My initial testing a long time ago had some nice benefits too. So
> perhaps the above wasn't worded very well, I always worry that we
> have regressions doing boosts for things like that. But given that
> meta data is something that needs to be done before we get to the
> real data, bumping priority generally seems like a good thing to do.

Even if the REQ_META special casing helps with performance it creates
a big issue if we want to follow your other guide line, that is marking
all actual metadata requests REQ_META for blocktrace. What about
only applying the metadata preference only to _synchronous_ (read or
REQ_SYNC) I/Os that also have REQ_META set?

Right now we never use REQ_META on a non-synchronous request (XFS appears
to, but the code is not actually reachable anymore), so it's not
actually a change in behaviour. After that we could do an easy sweep
through the tree and mark all metadata requests as REQ_META. Btw, what
do we consider metadata for this purpose? The interesting question
here is about indirect blocks / bmap btree blocks. In the traditional
sense they are metadata, but for I/O purposes they are mostly part of
the I/O stream.

2010-06-23 10:01:54

[permalink] [raw]

Subject: Re: trying to understand READ_META, READ_SYNC, WRITE_SYNC & co

On Mon, Jun 21, 2010 at 05:36:18PM -0400, Vivek Goyal wrote:
> I have got very little understanding of file system layer, but if I had
> to guess, i think following might have happened.
>
> - We switched from WRITE to WRITE_SYNC for fsync() path.

Yes. WRITE_SYNC_PLUG to be exact. Note that we don't juse do this
for fsync but also for O_SYNC writes which use ->fsync, and also sync(2)
and the unmount path, which all end up submitting WB_SYNC_ALL writeback
requests.

> - This might have caused issues with idling as for SYNC_WRITE we will idle
> in CFQ but probably it is not desirable in certain cases where next set
> of WRITES is going to come from journaling thread.

I'm still a bit confused about what the idling logic actually does. Is
it some sort of additional plugging where we wait for more I/O to
accumulate?

> - That might have prompted us to introduce the rq_noidle() to make sure
> we don't idle in WRITE_SYNC path but direct IO path was avoided to make
> sure good throughput is maintained. But this left one question open
> and that is it good to disable idling on all WRITE_SYNC path in kernel.

I still fail to see why we should make any difference in the I/O
scheduler for O_DIRECT vs O_SYNC/fsync workloads. In both cases the
caller blocks waiting for the I/O completion.

> - Slowly cfq code emerged and as it stands today, to me rq_noidle() is
> practically of not much use. For sync-idle tree (where idling is
> enabled), we are ignoring the rq_noidle() and always arming the timer.
> For sync-noidle, we choose not to idle based on if there was some other
> thread who did even a single IO with rq_noidle=0.
>
> I think in practice, there is on thread of other which is doing some
> read or write with rq_noidle=0 and if that's the case, we will end up
> idling on sync-noidle tree also and rq_noidle() practically does
> not take effect.
>
> So if rq_noidle() was introduced to solve the issue of not idling on
> fsync() path (as jbd thread will send more data now), then probably slice
> yielding patch of jeff might come handy here and and we can get rid of
> rq_noidle() logic. This is just a guess work and I might be completely
> wrong here...

Getting rid of the noidle logic and more bio flag that us filesystem
developers have real trouble understanding would be a good thing.

After that we're down to three bio modifiers for filesystem use, of
which at least two are very easy to grasp:

- REQ_SYNC - treat a request as synchronous, implicitly enabled for
reads anyway
- REQ_UNPLUG - explicitly unplug the queue after I/O submission

and

- REQ_META - which we're currenly trying to define in detail

REQ_NOIDLE currenly really is a lot of deep magic.

2010-06-23 10:03:11

[permalink] [raw]

Subject: Re: trying to understand READ_META, READ_SYNC, WRITE_SYNC & co

On Mon, Jun 21, 2010 at 04:25:04PM -0400, Vivek Goyal wrote:
> In my testing in the past, this was helping if lots of sequential readers
> are running in a system (with 100ms slice each) and if there is another
> reader doing small file reads. Without meta data based preemption check,
> latency of opening small file used to be very high (Because all the sync
> sequntial reader gets to consume their 100ms slices first).
>
> Situation should be somewhat better now after corrado's changes of
> reducing slice length dynamically. But I suspect that we will still
> experience significantly reduced latency for small file reads in presece
> of multiple sequntial reads going on.

Any chance we could create a benchmark suite for the I/O schedulers?
I'd really like to get these annotations correct, and having an
automated suite that shows the numbers for the interesting workloads
would help greatly with that.

2010-06-24 01:44:31

[permalink] [raw]

Subject: Re: trying to understand READ_META, READ_SYNC, WRITE_SYNC & co

On Wed, Jun 23, 2010 at 12:01:38PM +0200, Christoph Hellwig wrote:
> On Mon, Jun 21, 2010 at 05:36:18PM -0400, Vivek Goyal wrote:
> > I have got very little understanding of file system layer, but if I had
> > to guess, i think following might have happened.
> >
> > - We switched from WRITE to WRITE_SYNC for fsync() path.
>
> Yes. WRITE_SYNC_PLUG to be exact. Note that we don't juse do this
> for fsync but also for O_SYNC writes which use ->fsync, and also sync(2)
> and the unmount path, which all end up submitting WB_SYNC_ALL writeback
> requests.
>
> > - This might have caused issues with idling as for SYNC_WRITE we will idle
> > in CFQ but probably it is not desirable in certain cases where next set
> > of WRITES is going to come from journaling thread.
>
> I'm still a bit confused about what the idling logic actually does. Is
> it some sort of additional plugging where we wait for more I/O to
> accumulate?

Let me explain the general idling logic and then see if it makes sense in case
of WRITE_SYNC.

Once a request has completed, if the cfq queue is empty, we have two choices.
Either expire the cfq queue and move on to dispatch requests from a
different queue or we idle on the queue hoping we will get more IO from
same process/queue. Idling can help (on SATA disks with high seek cost), if
our guess was right and soon we got another request from same process. We
cut down on number of seeks hence increased throghput.

Apart from cutting down number of seeks, idling also helps in getting a
process queue getting its fair share of IO done. Otherwise once you expire the
queue, you lose the slice and you might not get a chance to dispatch more IO
till other queues in the CFQ have used their slice.

Above, idling is done only for processes which are doing sequential IO. If
random IO is happening from process then there is not much point in idling
because disk will anyway perform seek.

But we if don't idle at all for random IO queues then they can experience
large latencies when heavy buffered writting is going on. So corrado came
up with a concept of grouping all the random IO queues and idle on the
whole group and not on individual queues.

Now the big question, should we really idle on WRITE_SYNC IO or not. Looks
like we seem to have mixed use cases. Some cases where we will get the
next IO on the same queue and in some cases we will not and there is
really no sure shot way to differentiate between two.

So looks like fsync path will do bunch of IO and then will wait for jbd thread
to finish the work. In this case idling is waste of time. I guess same will
be true for umount and sync() path. But same probably is not necessarily true
for a O_DIRECT writer (database comes to mind), and for O_SYNC writer
(virtual machines?).

If we always choose not to idle on WRITE_SYNC, then direct writers and
O_SYNC writers will get little disk share in presence of heavy buffered
WRITES. If we choose to not special case WRITE_SYNC and continue to
idle on the queue then we probably are wasting time and reducing overall
throughput. (The fsync() case Jeff is running into).

I think that's the reason we introdue rq_noidle() in an attempt to tell
IO scheduler in what cases to expect more IO and in what cases not to. In
case of direct writes, there is no way to tell, hence that path was
separated and rq_noidle=0. Rest of the WRITE_SYNC was using rq_noidle=1.
But as you said there is probably no differnce between direct writes and
O_SYNC path and both the paths should get same treatment in IO scheduler
when it comes to idling.

So IO scheduler needs a clear direction from higher layers in terms of when
not to idle after a sync write and save time. One way would be to continue
to use rq_noidle() and just clean up upper layer paths to make sure we
set this flag only when we know we are not going to dispatch more sync
writes in the same context. Other way could be to get rid of rq_noidle()
logic and upper layers explicitly call IO scheduler to yield the slice
when we know there will be no more IO from the same process/context.
Jeff's blk_yield() patches handle one such case (fsync). May be we need
to do something similar for other paths also, if they exist (umount).

>
> > - That might have prompted us to introduce the rq_noidle() to make sure
> > we don't idle in WRITE_SYNC path but direct IO path was avoided to make
> > sure good throughput is maintained. But this left one question open
> > and that is it good to disable idling on all WRITE_SYNC path in kernel.
>
> I still fail to see why we should make any difference in the I/O
> scheduler for O_DIRECT vs O_SYNC/fsync workloads. In both cases the
> caller blocks waiting for the I/O completion.

I also don't know. I think in both the cases kernel does not know if user
space immediately is going to do more IO or not in the same context. So they
should fall in same category when it comes to IO scheduler treatment.

So one possible way could be that don't try to special case synchronous
writes and continue to idle on the queue based on other parameters. If
kernel/higher layers have knowledge that we are not going to issue more
IO in same context, then they should explicitly call blk_yield(), to
stop idling and give up slice.

>
> > - Slowly cfq code emerged and as it stands today, to me rq_noidle() is
> > practically of not much use. For sync-idle tree (where idling is
> > enabled), we are ignoring the rq_noidle() and always arming the timer.
> > For sync-noidle, we choose not to idle based on if there was some other
> > thread who did even a single IO with rq_noidle=0.
> >
> > I think in practice, there is on thread of other which is doing some
> > read or write with rq_noidle=0 and if that's the case, we will end up
> > idling on sync-noidle tree also and rq_noidle() practically does
> > not take effect.
> >
> > So if rq_noidle() was introduced to solve the issue of not idling on
> > fsync() path (as jbd thread will send more data now), then probably slice
> > yielding patch of jeff might come handy here and and we can get rid of
> > rq_noidle() logic. This is just a guess work and I might be completely
> > wrong here...
>
> Getting rid of the noidle logic and more bio flag that us filesystem
> developers have real trouble understanding would be a good thing.
>
> After that we're down to three bio modifiers for filesystem use, of
> which at least two are very easy to grasp:
>
> - REQ_SYNC - treat a request as synchronous, implicitly enabled for
> reads anyway
> - REQ_UNPLUG - explicitly unplug the queue after I/O submission
>
> and
>
> - REQ_META - which we're currenly trying to define in detail
>
>
> REQ_NOIDLE currenly really is a lot of deep magic.

I agree. rq_noidle() logic is confusing.

Thanks
Vivek

2010-06-25 11:03:38

[permalink] [raw]

Subject: Re: trying to understand READ_META, READ_SYNC, WRITE_SYNC & co

On Wed, Jun 23, 2010 at 09:44:20PM -0400, Vivek Goyal wrote:
> Let me explain the general idling logic and then see if it makes sense in case
> of WRITE_SYNC.
>
> Once a request has completed, if the cfq queue is empty, we have two choices.
> Either expire the cfq queue and move on to dispatch requests from a
> different queue or we idle on the queue hoping we will get more IO from
> same process/queue.

queues are basically processes in this context?

> Idling can help (on SATA disks with high seek cost), if
> our guess was right and soon we got another request from same process. We
> cut down on number of seeks hence increased throghput.

I don't really understand the logic behind this. If we lots of I/O
that actually is close to each other we should generally submit it in
one batch. That is true for pagecache writeback, that is true for
metadata (at least in XFS..), and it's true for any sane application
doing O_DIRECT / O_SYNC style I/O.

What workloads produde I/O that is local (not random) writes with small
delays between the I/O requests?

I see the point of this logic for reads where various workloads have
dependent reads that might be close to each other, but I don't really
see any point for writes.

> So looks like fsync path will do bunch of IO and then will wait for jbd thread
> to finish the work. In this case idling is waste of time.

Given that ->writepage already does WRITE_SYNC_PLUG I/O which includes
REQ_NODILE I'm still confused why we still have that issue.

> I guess same will
> be true for umount and sync() path. But same probably is not necessarily true
> for a O_DIRECT writer (database comes to mind), and for O_SYNC writer
> (virtual machines?).

For virtual machines idling seems like a waste of ressources. If we
have sequential I/O we dispatch in batches - in fact qemu even merges
sequential small block I/O it gets from the guest into one large request
we hand off to the host kernel. For reads the same caveat as above
applies as read requests as handed through 1:1 from the guest.

> O_SYNC writers will get little disk share in presence of heavy buffered
> WRITES. If we choose to not special case WRITE_SYNC and continue to
> idle on the queue then we probably are wasting time and reducing overall
> throughput. (The fsync() case Jeff is running into).

Remember that O_SYNC writes are implemented as normal buffered write +
fsync (a range fsync to be exact, but that doesn't change a thing).

And that's what they conceptually are anyway, so treating a normal
buffered write + fsync different from an O_SYNC write is not only wrong
conceptuall but also in implementation. You have the exact same issue
of handing off work to the journal commit thread in extN. Note that
the log write (or at least parts of it) will always use WRITE_BARRIER,
which completey bypasses the I/O scheduler.

> So one possible way could be that don't try to special case synchronous
> writes and continue to idle on the queue based on other parameters. If
> kernel/higher layers have knowledge that we are not going to issue more
> IO in same context, then they should explicitly call blk_yield(), to
> stop idling and give up slice.

We have no way to know what userspace will do if we are doing
O_SYNC/O_DIRECT style I/O or use fsync. We know that we will most
likely continue kicking things from the same queue when doing page
writeback. One thing that should help with this is Jens' explicit
per-process plugging stuff, which I noticed he recently updated to a
current kernel.

2010-06-26 03:35:39

[permalink] [raw]

Subject: Re: trying to understand READ_META, READ_SYNC, WRITE_SYNC & co

On Fri, Jun 25, 2010 at 01:03:20PM +0200, Christoph Hellwig wrote:
> On Wed, Jun 23, 2010 at 09:44:20PM -0400, Vivek Goyal wrote:
> > Let me explain the general idling logic and then see if it makes sense in case
> > of WRITE_SYNC.
> >
> > Once a request has completed, if the cfq queue is empty, we have two choices.
> > Either expire the cfq queue and move on to dispatch requests from a
> > different queue or we idle on the queue hoping we will get more IO from
> > same process/queue.
>
> queues are basically processes in this context?

Yes. Basically IO context. Generally one per process but multiple threads
can also share the IO context and requests from all the threads will go
into single cfq queue.

>
> > Idling can help (on SATA disks with high seek cost), if
> > our guess was right and soon we got another request from same process. We
> > cut down on number of seeks hence increased throghput.
>
> I don't really understand the logic behind this. If we lots of I/O
> that actually is close to each other we should generally submit it in
> one batch. That is true for pagecache writeback, that is true for
> metadata (at least in XFS..), and it's true for any sane application
> doing O_DIRECT / O_SYNC style I/O.
>
> What workloads produde I/O that is local (not random) writes with small
> delays between the I/O requests?
>

That's a good question. I don't have an answer for that. I have most of the
time done my testing using fio where one can create synthetic workload of
doing O_DIRECT/O_SYNC sequential/local writes after small delays.

I think I need to run blktrace on some of the real life workloads and monitor
for WRITE_SYNC pattern.

> I see the point of this logic for reads where various workloads have
> dependent reads that might be close to each other, but I don't really
> see any point for writes.
>
> > So looks like fsync path will do bunch of IO and then will wait for jbd thread
> > to finish the work. In this case idling is waste of time.
>
> Given that ->writepage already does WRITE_SYNC_PLUG I/O which includes
> REQ_NODILE I'm still confused why we still have that issue.

In current form, cfq honors REQ_NOIDLE conditionally and that's why we
still have the issue. If you look at cfq_completed_request(), we continue
to idle in following two cases.

- If we classifed the queue as SYNC_WORKLOAD.
- If there is another random read/write happening on sync-noidle service
tree.

SYNC_WORKLOAD means that cfq thinks this particular queue is doing sequential
IO. For random IO queues, we don't idle on each individual queue but a
group of queue.

In jeff's testing, fsync thread/queue sometimes is viewed as sequential
workload and goes on SYNC_WORKLOAD tree. In that case even if request is
REQ_NOIDLE, we will continue to idle hence fsync issue.

This logic was introduced by corrado to ensure WRITE_SYNC does not
lose fair share. Now we are back to the same question, what is the workload
which does that.

8e55063 cfq-iosched: fix corner cases in idling logic

Before this patch, we will simply not do any idling on WRITE_SYNC. That
means no idling on O_SYNC/fsync paths but still idle on direct IO
WRITE_SYNC. Which is a bit of discrepancy.

>
> > I guess same will
> > be true for umount and sync() path. But same probably is not necessarily true
> > for a O_DIRECT writer (database comes to mind), and for O_SYNC writer
> > (virtual machines?).
>
> For virtual machines idling seems like a waste of ressources. If we
> have sequential I/O we dispatch in batches - in fact qemu even merges
> sequential small block I/O it gets from the guest into one large request
> we hand off to the host kernel. For reads the same caveat as above
> applies as read requests as handed through 1:1 from the guest.
>

Ok, That's good to know.

> > O_SYNC writers will get little disk share in presence of heavy buffered
> > WRITES. If we choose to not special case WRITE_SYNC and continue to
> > idle on the queue then we probably are wasting time and reducing overall
> > throughput. (The fsync() case Jeff is running into).
>
> Remember that O_SYNC writes are implemented as normal buffered write +
> fsync (a range fsync to be exact, but that doesn't change a thing).
>
> And that's what they conceptually are anyway, so treating a normal
> buffered write + fsync different from an O_SYNC write is not only wrong
> conceptuall but also in implementation. You have the exact same issue
> of handing off work to the journal commit thread in extN.

Actually right now I think O_DIRECT is getting different treatment as
it does not mark queue IO as RQ_NOIDLE. As you said O_SYNC is just fsync
on a specific range, so both the paths will mark RQ_NOIDLE and get treated
same in IO scheduler.

Intially Jens treated O_DIRECT differently and then Jeff put a patch which
accidently got rid of that difference and finally I had reintroduced the
O_DIRECT flag.

d9449ce Fix regression in direct writes performance due to WRITE_ODIRECT flag

I have explained the effect of not idling on DIRECT writes in changelog. It
gets very little BW in presence of sequential reads.

With idling on WRITE_SYNC, distribution looks as follows.

direct writes: aggrb=2,968KB/s
readers : aggrb=101MB/s

Without idling on WRITE_SYNC, distribution of BW looks as follows.

direct write: aggrb=19KB/s
readers aggrb=137MB/s

So notice how sequential reads can starve a DIRECT sequential writer. Now
this is a synthetic workload created with the help of fio where sequential
writes are submitted with delay in between (i think so). We are back to
the question that the application should coalesce sequential writes and
submit bigger write requests. Otherwise you will lose share and get
starved out.

So going forward, I think there are few possible options.

- Stop idling on all the WRITE_SYNC IO. There is no reasonable way to
tell whether there will be more IO or not from applicatoin. This will
impact direct writes, O_SYNC writes and fsync().

If direct IO application is submitting writes with a delay in between
it can be starved out in presnce of competing workloads.

In this scheme we get rid of all rq_noidle logic, get rid of separate
flag for direct IO, WRITE_ODIRECT_PLUG, and also don't need any queue
yielding patches as we never idle on SYNC_WRITES.

The important thing to note here is that even direct random writes
will lose share here if we don't keep the queue busy continuously
and issue random writes after a delay. (database?).

- Do idling by default on WRITE_SYNC path. Implement Jeff's queue yielding
patches. That way O_SYNC/fsyn path will call blk_yield() and stop idling.
Direct IO write path will stil continue to idle (I think if file has already
been laid out?).

I have no idea in real world how SYNC_WRITE workloads look like and how
their traffic pattern is. So it is hard to say which scheme to go for.
Jens, what do you think of above?

> Note that
> the log write (or at least parts of it) will always use WRITE_BARRIER,
> which completey bypasses the I/O scheduler.
>
> > So one possible way could be that don't try to special case synchronous
> > writes and continue to idle on the queue based on other parameters. If
> > kernel/higher layers have knowledge that we are not going to issue more
> > IO in same context, then they should explicitly call blk_yield(), to
> > stop idling and give up slice.
>
> We have no way to know what userspace will do if we are doing
> O_SYNC/O_DIRECT style I/O or use fsync. We know that we will most
> likely continue kicking things from the same queue when doing page
> writeback. One thing that should help with this is Jens' explicit
> per-process plugging stuff, which I noticed he recently updated to a
> current kernel.

Ok. I will have a look at that.

Thanks
Vivek

2010-06-26 09:26:12

by Nick Piggin

[permalink] [raw]

Subject: Re: trying to understand READ_META, READ_SYNC, WRITE_SYNC & co

2010-06-26 09:28:16

[permalink] [raw]

Subject: Re: trying to understand READ_META, READ_SYNC, WRITE_SYNC & co

On Sat, Jun 26, 2010 at 07:25:56PM +1000, Nick Piggin wrote:
> Biggest thing is multiple small files operations like on the same
> directory. Best case I measured back when doing AS io scheduler
> versus deadline was about 100x improvement on a uncached kernel
> grep workload when competing with a streaming writeout (the writeout
> probably ended up going somewhat slower naturally, but it is fairer).

As I mentioned below I absolutely see the case for reads. A normal
grep basically is a dependent read kind of workload. For for writes
it should either be O_SYNC-style workloads or batched I/O.

2010-06-26 10:05:43

[permalink] [raw]

Subject: Re: trying to understand READ_META, READ_SYNC, WRITE_SYNC & co

On Fri, Jun 25, 2010 at 11:35:10PM -0400, Vivek Goyal wrote:
> This logic was introduced by corrado to ensure WRITE_SYNC does not
> lose fair share. Now we are back to the same question, what is the workload
> which does that.
>
> 8e55063 cfq-iosched: fix corner cases in idling logic
>
> Before this patch, we will simply not do any idling on WRITE_SYNC. That
> means no idling on O_SYNC/fsync paths but still idle on direct IO
> WRITE_SYNC. Which is a bit of discrepancy.

Corrado, can you explain what workloads you doing that commit for?
The commit message doesn't really given any useful information.

> - Stop idling on all the WRITE_SYNC IO. There is no reasonable way to
> tell whether there will be more IO or not from applicatoin. This will
> impact direct writes, O_SYNC writes and fsync().
>
> If direct IO application is submitting writes with a delay in between
> it can be starved out in presnce of competing workloads.

So what application does this?

> - Do idling by default on WRITE_SYNC path. Implement Jeff's queue yielding
> patches. That way O_SYNC/fsyn path will call blk_yield() and stop idling.
> Direct IO write path will stil continue to idle (I think if file has already
> been laid out?).

I don't think this makes much sense. And O_SYNC write / fsync are
defined to not return before the I/O makes it to disk, and for any
sane filesystem end with a WRITE_BARRIER request because of that.
So we end these with an explicit unplug anyway, no need for idling
logic.

2010-06-26 10:10:51

by Nick Piggin

[permalink] [raw]

Subject: Re: trying to understand READ_META, READ_SYNC, WRITE_SYNC & co

On Sat, Jun 26, 2010 at 11:27:59AM +0200, Christoph Hellwig wrote:
> On Sat, Jun 26, 2010 at 07:25:56PM +1000, Nick Piggin wrote:
> > Biggest thing is multiple small files operations like on the same
> > directory. Best case I measured back when doing AS io scheduler
> > versus deadline was about 100x improvement on a uncached kernel
> > grep workload when competing with a streaming writeout (the writeout
> > probably ended up going somewhat slower naturally, but it is fairer).
>
> As I mentioned below I absolutely see the case for reads. A normal
> grep basically is a dependent read kind of workload. For for writes
> it should either be O_SYNC-style workloads or batched I/O.

Sorry I missed that. OK well that may be true. One would hope
it was benchmarked before being merged.

But I'm sure apps can submit fsyncs much faster than once per
few ms, like small database transactions.

2010-06-26 10:16:33

[permalink] [raw]

Subject: Re: trying to understand READ_META, READ_SYNC, WRITE_SYNC & co

On Sat, Jun 26, 2010 at 08:10:45PM +1000, Nick Piggin wrote:
> But I'm sure apps can submit fsyncs much faster than once per
> few ms, like small database transactions.

fsync / O_SYNC should be irrelevant for the idling logic. One those
retourn to the user data must have made it to the disk, and with our
barrier implementation that implies fully draining any outstanding I/O
on the device.

2010-06-26 11:20:57

[permalink] [raw]

Subject: Re: trying to understand READ_META, READ_SYNC, WRITE_SYNC & co

On Sat, Jun 26 2010, Christoph Hellwig wrote:
> > - Stop idling on all the WRITE_SYNC IO. There is no reasonable way to
> > tell whether there will be more IO or not from applicatoin. This will
> > impact direct writes, O_SYNC writes and fsync().
> >
> > If direct IO application is submitting writes with a delay in between
> > it can be starved out in presnce of competing workloads.
>
> So what application does this?

It isn't about apps having a small delay between submissions,
not sure where Vivek gets that from. Even if you submit back
to back, there's still very small time where the io scheduler
will switch to something else if there's other io pending. This
happens instantly when the sync request finishes - if we don't
idle for any given request, then we of course go to service
someone else with pending io.

The whole idling/anticipation is all about knowing when to
stall the queue very briefly for sync io, allowing that single
sync stream to make good progress for a while before switching
to something else. This switching back and forth potentially
destroyes throughput for the O_DIRECT writer, especially for
disks with write through caching.

Christoph, you seem not to agree on the concept of idling. As
Nick writes in another reply, the difference in performance
when you get it right is staggering. We don't idle the disk
for kicks and laughs. I wont rule out bugs, both in the
handling and the signalling of idling. But as a concept, it
is definitely sound and proven.

--
Jens Axboe

2010-06-26 11:56:51

[permalink] [raw]

Subject: Re: trying to understand READ_META, READ_SYNC, WRITE_SYNC & co

On Sat, Jun 26, 2010 at 01:20:55PM +0200, Jens Axboe wrote:
> The whole idling/anticipation is all about knowing when to
> stall the queue very briefly for sync io, allowing that single
> sync stream to make good progress for a while before switching
> to something else. This switching back and forth potentially
> destroyes throughput for the O_DIRECT writer, especially for
> disks with write through caching.
>
> Christoph, you seem not to agree on the concept of idling.

I'm still trying to understand the use case. Basically CFQ gets
worse results for any workload I'm looking at, be that from the
filesystem developer point of view, or virtualization point of
view. And it tends to get worse the more intelligence is added
to CFQ.

If we could actually narrow down what it's supposed to help
with into useful benchmarks that can be trivially reproduced a
lot of things would be easier.

2010-06-27 15:44:38

by Jeff Moyer

[permalink] [raw]

Subject: Re: trying to understand READ_META, READ_SYNC, WRITE_SYNC & co

Vivek Goyal <[email protected]> writes:

> On Fri, Jun 25, 2010 at 01:03:20PM +0200, Christoph Hellwig wrote:
>> On Wed, Jun 23, 2010 at 09:44:20PM -0400, Vivek Goyal wrote:
>> I see the point of this logic for reads where various workloads have
>> dependent reads that might be close to each other, but I don't really
>> see any point for writes.
>>
>> > So looks like fsync path will do bunch of IO and then will wait for jbd thread
>> > to finish the work. In this case idling is waste of time.
>>
>> Given that ->writepage already does WRITE_SYNC_PLUG I/O which includes
>> REQ_NODILE I'm still confused why we still have that issue.
>
> In current form, cfq honors REQ_NOIDLE conditionally and that's why we
> still have the issue. If you look at cfq_completed_request(), we continue
> to idle in following two cases.
>
> - If we classifed the queue as SYNC_WORKLOAD.
> - If there is another random read/write happening on sync-noidle service
> tree.
>
> SYNC_WORKLOAD means that cfq thinks this particular queue is doing sequential
> IO. For random IO queues, we don't idle on each individual queue but a
> group of queue.
>
> In jeff's testing, fsync thread/queue sometimes is viewed as sequential
> workload and goes on SYNC_WORKLOAD tree. In that case even if request is
> REQ_NOIDLE, we will continue to idle hence fsync issue.

I'm now testing OCFS2, and I'm seeing performance that is not great
(even with the blk_yield patches applied). What happens is that we
successfully yield the queue to the journal thread, but then idle on the
journal thread (even though RQ_NOIDLE was set).

So, can we just get rid of idling when RQ_NOIDLE is set?

Vivek sent me this patch to test, and it got rid of the performance
issue for the fsync workload. Can we discuss its merits?

Thanks,
Jeff

Index: linux-2.6/block/cfq-iosched.c
===================================================================
--- linux-2.6.orig/block/cfq-iosched.c 2010-06-25 15:57:33.832125786 -0400
+++ linux-2.6/block/cfq-iosched.c 2010-06-25 15:59:19.788876361 -0400
@@ -318,6 +318,7 @@
CFQ_CFQQ_FLAG_split_coop, /* shared cfqq will be splitted */
CFQ_CFQQ_FLAG_deep, /* sync cfqq experienced large depth */
CFQ_CFQQ_FLAG_wait_busy, /* Waiting for next request */
+ CFQ_CFQQ_FLAG_group_idle, /* This queue is doing group idle */
};

#define CFQ_CFQQ_FNS(name) \
@@ -347,6 +348,7 @@
CFQ_CFQQ_FNS(split_coop);
CFQ_CFQQ_FNS(deep);
CFQ_CFQQ_FNS(wait_busy);
+CFQ_CFQQ_FNS(group_idle);
#undef CFQ_CFQQ_FNS

#ifdef CONFIG_CFQ_GROUP_IOSCHED
@@ -1613,6 +1615,7 @@

cfq_clear_cfqq_wait_request(cfqq);
cfq_clear_cfqq_wait_busy(cfqq);
+ cfq_clear_cfqq_group_idle(cfqq);

/*
* If this cfqq is shared between multiple processes, check to
@@ -3176,6 +3179,13 @@
if (cfq_class_rt(new_cfqq) && !cfq_class_rt(cfqq))
return true;

+ /*
+ * If were doing group_idle and we got new request in same group,
+ * preempt the queue
+ */
+ if (cfq_cfqq_group_idle(cfqq))
+ return true;
+
if (!cfqd->active_cic || !cfq_cfqq_wait_request(cfqq))
return false;

@@ -3271,6 +3281,7 @@
struct cfq_queue *cfqq = RQ_CFQQ(rq);

cfq_log_cfqq(cfqd, cfqq, "insert_request");
+ cfq_clear_cfqq_group_idle(cfqq);
cfq_init_prio_data(cfqq, RQ_CIC(rq)->ioc);

rq_set_fifo_time(rq, jiffies + cfqd->cfq_fifo_expire[rq_is_sync(rq)]);
@@ -3416,10 +3427,12 @@
* SYNC_NOIDLE_WORKLOAD idles at the end of the tree
* only if we processed at least one !rq_noidle request
*/
- if (cfqd->serving_type == SYNC_WORKLOAD
- || cfqd->noidle_tree_requires_idle
- || cfqq->cfqg->nr_cfqq == 1)
+ if (cfqd->noidle_tree_requires_idle)
+ cfq_arm_slice_timer(cfqd);
+ else if (cfqq->cfqg->nr_cfqq == 1) {
+ cfq_mark_cfqq_group_idle(cfqq);
cfq_arm_slice_timer(cfqd);
+ }
}
}

2010-06-29 09:07:20

by Corrado Zoccolo

[permalink] [raw]

Subject: Re: trying to understand READ_META, READ_SYNC, WRITE_SYNC & co

On Sun, Jun 27, 2010 at 5:44 PM, Jeff Moyer <[email protected]> wrote:
> Vivek Goyal <[email protected]> writes:
>
>> On Fri, Jun 25, 2010 at 01:03:20PM +0200, Christoph Hellwig wrote:
>>> On Wed, Jun 23, 2010 at 09:44:20PM -0400, Vivek Goyal wrote:
>>> I see the point of this logic for reads where various workloads have
>>> dependent reads that might be close to each other, but I don't really
>>> see any point for writes.
>>>
>>> > So looks like fsync path will do bunch of IO and then will wait for jbd thread
>>> > to finish the work. In this case idling is waste of time.
>>>
>>> Given that ->writepage already does WRITE_SYNC_PLUG I/O which includes
>>> REQ_NODILE I'm still confused why we still have that issue.
>>
>> In current form, cfq honors REQ_NOIDLE conditionally and that's why we
>> still have the issue. If you look at cfq_completed_request(), we continue
>> to idle in following two cases.
>>
>> - If we classifed the queue as SYNC_WORKLOAD.
>> - If there is another random read/write happening on sync-noidle service
>> tree.
>>
>> SYNC_WORKLOAD means that cfq thinks this particular queue is doing sequential
>> IO. For random IO queues, we don't idle on each individual queue but a
>> group of queue.
>>
>> In jeff's testing, fsync thread/queue sometimes is viewed as sequential
>> workload and goes on SYNC_WORKLOAD tree. In that case even if request is
>> REQ_NOIDLE, we will continue to idle hence fsync issue.
>
> I'm now testing OCFS2, and I'm seeing performance that is not great
> (even with the blk_yield patches applied). What happens is that we
> successfully yield the queue to the journal thread, but then idle on the
> journal thread (even though RQ_NOIDLE was set).
>
> So, can we just get rid of idling when RQ_NOIDLE is set?
Hi Jeff,
I think I spotted a problem with the initial implementation of the
tree-wide idle when RQ_NOIDLE is set: I assumed that a queue would
either send possibly-idling requests or no-idle requests, but it seems
that RQ_NOIDLE is being used to mark the end of a stream of
possibly-idling requests (in my initial implementation, this will then
cause an unintended idle). The attached patch should fix it, and I
think the logic is simpler than Vivek's. Can you give it a spin?
Otherwise, I think that reverting the "noidle_tree_requires_idle"
behaviour completely may be better than adding complexity, since it is
really trying to solve corner cases (that maybe happen only on
synthetic workloads), but affecting negatively more common cases.

About what it is trying to solve, since I think it was not clear:
- we have a workload of 2 queues, both issuing requests that are being
put in the no-idle tree (e.g. they are random) + 1 queue issuing
idling requests (e.g. sequential).
- if one of the 2 "random" queues marks its requests as RQ_NOIDLE,
then the timeslice for the no-idle tree is not preserved, causing
unfairness, as soon as an RQ_NOIDLE request is serviced and the tree
is empty.

Thanks
Corrado

>
> Vivek sent me this patch to test, and it got rid of the performance
> issue for the fsync workload. Can we discuss its merits?
>
> Thanks,
> Jeff
>
> Index: linux-2.6/block/cfq-iosched.c
> ===================================================================
> --- linux-2.6.orig/block/cfq-iosched.c 2010-06-25 15:57:33.832125786 -0400
> +++ linux-2.6/block/cfq-iosched.c 2010-06-25 15:59:19.788876361 -0400
> @@ -318,6 +318,7 @@
> CFQ_CFQQ_FLAG_split_coop, /* shared cfqq will be splitted */
> CFQ_CFQQ_FLAG_deep, /* sync cfqq experienced large depth */
> CFQ_CFQQ_FLAG_wait_busy, /* Waiting for next request */
> + CFQ_CFQQ_FLAG_group_idle, /* This queue is doing group idle */
> };
>
> #define CFQ_CFQQ_FNS(name) \
> @@ -347,6 +348,7 @@
> CFQ_CFQQ_FNS(split_coop);
> CFQ_CFQQ_FNS(deep);
> CFQ_CFQQ_FNS(wait_busy);
> +CFQ_CFQQ_FNS(group_idle);
> #undef CFQ_CFQQ_FNS
>
> #ifdef CONFIG_CFQ_GROUP_IOSCHED
> @@ -1613,6 +1615,7 @@
>
> cfq_clear_cfqq_wait_request(cfqq);
> cfq_clear_cfqq_wait_busy(cfqq);
> + cfq_clear_cfqq_group_idle(cfqq);
>
> /*
> * If this cfqq is shared between multiple processes, check to
> @@ -3176,6 +3179,13 @@
> if (cfq_class_rt(new_cfqq) && !cfq_class_rt(cfqq))
> return true;
>
> + /*
> + * If were doing group_idle and we got new request in same group,
> + * preempt the queue
> + */
> + if (cfq_cfqq_group_idle(cfqq))
> + return true;
> +
> if (!cfqd->active_cic || !cfq_cfqq_wait_request(cfqq))
> return false;
>
> @@ -3271,6 +3281,7 @@
> struct cfq_queue *cfqq = RQ_CFQQ(rq);
>
> cfq_log_cfqq(cfqd, cfqq, "insert_request");
> + cfq_clear_cfqq_group_idle(cfqq);
> cfq_init_prio_data(cfqq, RQ_CIC(rq)->ioc);
>
> rq_set_fifo_time(rq, jiffies + cfqd->cfq_fifo_expire[rq_is_sync(rq)]);
> @@ -3416,10 +3427,12 @@
> * SYNC_NOIDLE_WORKLOAD idles at the end of the tree
> * only if we processed at least one !rq_noidle request
> */
> - if (cfqd->serving_type == SYNC_WORKLOAD
> - || cfqd->noidle_tree_requires_idle
> - || cfqq->cfqg->nr_cfqq == 1)
> + if (cfqd->noidle_tree_requires_idle)
> + cfq_arm_slice_timer(cfqd);
> + else if (cfqq->cfqg->nr_cfqq == 1) {
> + cfq_mark_cfqq_group_idle(cfqq);
> cfq_arm_slice_timer(cfqd);
> + }
> }
> }
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:[email protected]
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
Tales of Power - C. Castaneda

Hi

Attachments:

0001-cfq-iosched-fix-tree-wide-handling-of-rq_noidle.patch (2.20 kB)

2010-06-29 12:30:44