2004-03-19 15:35:59

by Jens Axboe

[permalink] [raw]
Subject: [PATCH] barrier patch set

Hi,

A first release of a collected barrier patchset for 2.6.5-rc1-mm2. I
have a few changes planned to support dm/md + sata, I'll do those
changes over the weekend.

Reiser has the best barrier support, ext3 works but only if things don't
go wrong. So only attempt to use the barrier feature on ext3 if on ide
drives, not SCSI nor SATA.

Also note that for reiser you need to add:

-o barrier=flush

while ext3 currently wants:

-o barrier=1

Cosmetic stuff that will get ironed out. You can find the patches here:

ftp://ftp.kernel.org/pub/linux/kernel/people/axboe/patches/v2.6/2.6.5-rc1-mm2/

ide-barrier-2.6.5-rc1-mm2-1
ide/core part

ext3-barrier-2.6.5-rc1-mm2-1
ext3 part

reiserfs-current-2.6.5-rc1-mm2-1
current reiser tree, get it here in parts:

ftp.suse.com/pub/people/mason/patches/data-logging/experimental/2.6.4

(use series.mm for apply order)

reiserfs-barrier-2.6.5-rc1-mm2-1
reiser part.

or just apply

all-barrier-2.6.5-rc1-mm2-1
all rolled up into one patch.

--
Jens Axboe


2004-03-19 16:25:16

by Mika Penttilä

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set



Jens Axboe wrote:

>Hi,
>
>A first release of a collected barrier patchset for 2.6.5-rc1-mm2. I
>have a few changes planned to support dm/md + sata, I'll do those
>changes over the weekend.
>
>Reiser has the best barrier support, ext3 works but only if things don't
>go wrong. So only attempt to use the barrier feature on ext3 if on ide
>drives, not SCSI nor SATA.
>
>
>
What are these brutal pieces...?


+static int ide_transform_pc_req(ide_drive_t *drive, struct request *rq)
+{
+ if (rq->cmd[0] != 0x35) {
+ ide_end_request(drive, 0, 0);
+ return 1;
+ }
+
+ if (!drive->wcache) {
+ ide_end_request(drive, 1, 0);
+ return 1;
+ }
+
+ ide_fill_flush_cmd(drive, rq);
+ return 0;
+}


/*
+ * basic transformation support for scsi -> ata commands
+ */
+ if (blk_pc_request(rq)) {
+ if (drive->media != ide_disk)
+ goto kill_rq;
+ if (ide_transform_pc_req(drive, rq))
+ return ide_stopped;
+ }
+



--Mika


2004-03-19 16:35:09

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set

Jens Axboe wrote:
> Cosmetic stuff that will get ironed out. You can find the patches here:
>
> ftp://ftp.kernel.org/pub/linux/kernel/people/axboe/patches/v2.6/2.6.5-rc1-mm2/
>
> ide-barrier-2.6.5-rc1-mm2-1
> ide/core part


WRT ATA and flush-cache... before the IDE pieces of this patch are
merged, IMO it is a requirement that the entire flush-cache stuff gets a
review. ide_get_error_location() is one of the important pieces
(great!). Another important piece is to make sure that a drive's
flush-cache capability is correctly deduced and set up from the
identify-device. The steps look like

- check identify-device word 83, bits 12 (flush cache) and 13 (flush
cache ext)
- issue set-features command to get flush-cache into proper state
(enabled or disabled, as the user desires), if identify-device word 86
indicates it is not already in the state you seek.
- re-read identify [packet] device page from device, make sure
flush-cache[-ext] is enabled. A slacker could just make sure the
set-features command completed successfully, but to be 100% correct you
need to re-read the identify-device page. :/

NOTE 1: these steps are specific to the flush-cache command, and are
only vaguely related to write caching (which must be tested-for and
enabled separately). It is important to go through these steps
separately for write-caching and flush-cache[-ext].

Write-caching and flush-cache-command state should always be considered
separately, even though the two are often used together. I want to
avoid the LG cdrom debacle, where not properly checking for, and setting
up, flush-cache resulted in turning cdroms into bricks.

NOTE 2: Don't forget to check for 0x0000 or 0xffff in word 86, of
__only__ identify-packet-device. These special case meanings do not
appear for ATA devices, only ATAPI devices.

NOTE 3: flush-cache-ext should always be used on lba48 devices, even if
the request was inside the lba28 limit.

NOTE 4: flush-cache-ext does not exist on ATAPI devices.

Jeff



2004-03-19 16:48:31

by Marc-Christian Petersen

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set

On Friday 19 March 2004 16:35, Jens Axboe wrote:

Hi Jens,

> A first release of a collected barrier patchset for 2.6.5-rc1-mm2. I
> have a few changes planned to support dm/md + sata, I'll do those
> changes over the weekend.
> Reiser has the best barrier support, ext3 works but only if things don't
> go wrong. So only attempt to use the barrier feature on ext3 if on ide
> drives, not SCSI nor SATA.
> reiserfs-barrier-2.6.5-rc1-mm2-1
> reiser part.

is this intended? ;)

-rw------- 1 axboe axboe 3377 Mar 19 07:32
reiserfs-barrier-2.6.5-rc1-mm2-1.bz2
-rw------- 1 axboe axboe 248 Mar 19 07:32
reiserfs-barrier-2.6.5-rc1-mm2-1.bz2.sign
-rw------- 1 axboe axboe 3473 Mar 19 07:32
reiserfs-barrier-2.6.5-rc1-mm2-1.gz
-rw------- 1 axboe axboe 248 Mar 19 07:32
reiserfs-barrier-2.6.5-rc1-mm2-1.gz.sign
-rw------- 1 axboe axboe 248 Mar 19 07:32
reiserfs-barrier-2.6.5-rc1-mm2-1.sign

means permission denied for us.


ciao, Marc

2004-03-19 18:16:20

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set

On Fri, Mar 19 2004, Mika Penttil? wrote:
>
>
> Jens Axboe wrote:
>
> >Hi,
> >
> >A first release of a collected barrier patchset for 2.6.5-rc1-mm2. I
> >have a few changes planned to support dm/md + sata, I'll do those
> >changes over the weekend.
> >
> >Reiser has the best barrier support, ext3 works but only if things don't
> >go wrong. So only attempt to use the barrier feature on ext3 if on ide
> >drives, not SCSI nor SATA.
> >
> >
> >
> What are these brutal pieces...?
>
>
> +static int ide_transform_pc_req(ide_drive_t *drive, struct request *rq)
> +{
> + if (rq->cmd[0] != 0x35) {
> + ide_end_request(drive, 0, 0);
> + return 1;
> + }
> +
> + if (!drive->wcache) {
> + ide_end_request(drive, 1, 0);
> + return 1;
> + }
> +
> + ide_fill_flush_cmd(drive, rq);
> + return 0;
> +}
>
>
> /*
> + * basic transformation support for scsi -> ata commands
> + */
> + if (blk_pc_request(rq)) {
> + if (drive->media != ide_disk)
> + goto kill_rq;
> + if (ide_transform_pc_req(drive, rq))
> + return ide_stopped;
> + }

Hmm, I thought it was pretty obvious, even just from the naming and
comments. Right now, the block layer issued flush without data attached
(ie a drive barrier without pinning it to a buffer) comes as a scsi
synchronize cache command. I'm going to change this anyways and allow
queue hook of a ->issue_flush_fn() that can just tailored to ide or
scsi, _or_ dm/md and that sort of thing.

--
Jens Axboe

2004-03-19 18:20:13

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set

On Fri, Mar 19 2004, Jeff Garzik wrote:
> Jens Axboe wrote:
> >Cosmetic stuff that will get ironed out. You can find the patches here:
> >
> >ftp://ftp.kernel.org/pub/linux/kernel/people/axboe/patches/v2.6/2.6.5-rc1-mm2/
> >
> >ide-barrier-2.6.5-rc1-mm2-1
> > ide/core part
>
>
> WRT ATA and flush-cache... before the IDE pieces of this patch are
> merged, IMO it is a requirement that the entire flush-cache stuff gets a
> review. ide_get_error_location() is one of the important pieces

Well what are you waiting for :-). Honestly, the bits have been "out
there" for 2 years or so, so it's not like it's anything brand new.

ide_get_error_location() is new for 2.6, I consider the 2.6 code to have
basically 100% coverage where the 2.4 one never really did...

> (great!). Another important piece is to make sure that a drive's
> flush-cache capability is correctly deduced and set up from the
> identify-device. The steps look like
>
> - check identify-device word 83, bits 12 (flush cache) and 13 (flush
> cache ext)
> - issue set-features command to get flush-cache into proper state
> (enabled or disabled, as the user desires), if identify-device word 86
> indicates it is not already in the state you seek.
> - re-read identify [packet] device page from device, make sure
> flush-cache[-ext] is enabled. A slacker could just make sure the
> set-features command completed successfully, but to be 100% correct you
> need to re-read the identify-device page. :/
>
> NOTE 1: these steps are specific to the flush-cache command, and are
> only vaguely related to write caching (which must be tested-for and
> enabled separately). It is important to go through these steps
> separately for write-caching and flush-cache[-ext].
>
> Write-caching and flush-cache-command state should always be considered
> separately, even though the two are often used together. I want to
> avoid the LG cdrom debacle, where not properly checking for, and setting
> up, flush-cache resulted in turning cdroms into bricks.
>
> NOTE 2: Don't forget to check for 0x0000 or 0xffff in word 86, of
> __only__ identify-packet-device. These special case meanings do not
> appear for ATA devices, only ATAPI devices.
>
> NOTE 3: flush-cache-ext should always be used on lba48 devices, even if
> the request was inside the lba28 limit.
>
> NOTE 4: flush-cache-ext does not exist on ATAPI devices.

Only ide-disk is supported. But that's some great info, I'll go over the
code this weekend and make sure we're doing ok and correct any
deviations.

--
Jens Axboe

2004-03-19 18:20:02

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set

On Fri, Mar 19 2004, Marc-Christian Petersen wrote:
> On Friday 19 March 2004 16:35, Jens Axboe wrote:
>
> Hi Jens,
>
> > A first release of a collected barrier patchset for 2.6.5-rc1-mm2. I
> > have a few changes planned to support dm/md + sata, I'll do those
> > changes over the weekend.
> > Reiser has the best barrier support, ext3 works but only if things don't
> > go wrong. So only attempt to use the barrier feature on ext3 if on ide
> > drives, not SCSI nor SATA.
> > reiserfs-barrier-2.6.5-rc1-mm2-1
> > reiser part.
>
> is this intended? ;)
>
> -rw------- 1 axboe axboe 3377 Mar 19 07:32
> reiserfs-barrier-2.6.5-rc1-mm2-1.bz2
> -rw------- 1 axboe axboe 248 Mar 19 07:32
> reiserfs-barrier-2.6.5-rc1-mm2-1.bz2.sign
> -rw------- 1 axboe axboe 3473 Mar 19 07:32
> reiserfs-barrier-2.6.5-rc1-mm2-1.gz
> -rw------- 1 axboe axboe 248 Mar 19 07:32
> reiserfs-barrier-2.6.5-rc1-mm2-1.gz.sign
> -rw------- 1 axboe axboe 248 Mar 19 07:32
> reiserfs-barrier-2.6.5-rc1-mm2-1.sign
>
> means permission denied for us.

Corrected, thanks.

--
Jens Axboe

2004-03-19 18:39:21

by Mika Penttilä

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set



Jens Axboe wrote:

>On Fri, Mar 19 2004, Mika Penttil? wrote:
>
>
>>Jens Axboe wrote:
>>
>>
>>
>>>Hi,
>>>
>>>A first release of a collected barrier patchset for 2.6.5-rc1-mm2. I
>>>have a few changes planned to support dm/md + sata, I'll do those
>>>changes over the weekend.
>>>
>>>Reiser has the best barrier support, ext3 works but only if things don't
>>>go wrong. So only attempt to use the barrier feature on ext3 if on ide
>>>drives, not SCSI nor SATA.
>>>
>>>
>>>
>>>
>>>
>>What are these brutal pieces...?
>>
>>
>>+static int ide_transform_pc_req(ide_drive_t *drive, struct request *rq)
>>+{
>>+ if (rq->cmd[0] != 0x35) {
>>+ ide_end_request(drive, 0, 0);
>>+ return 1;
>>+ }
>>+
>>+ if (!drive->wcache) {
>>+ ide_end_request(drive, 1, 0);
>>+ return 1;
>>+ }
>>+
>>+ ide_fill_flush_cmd(drive, rq);
>>+ return 0;
>>+}
>>
>>
>>/*
>>+ * basic transformation support for scsi -> ata commands
>>+ */
>>+ if (blk_pc_request(rq)) {
>>+ if (drive->media != ide_disk)
>>+ goto kill_rq;
>>+ if (ide_transform_pc_req(drive, rq))
>>+ return ide_stopped;
>>+ }
>>
>>
>
>Hmm, I thought it was pretty obvious, even just from the naming and
>comments. Right now, the block layer issued flush without data attached
>(ie a drive barrier without pinning it to a buffer) comes as a scsi
>synchronize cache command. I'm going to change this anyways and allow
>queue hook of a ->issue_flush_fn() that can just tailored to ide or
>scsi, _or_ dm/md and that sort of thing.
>
>
I mean other BLOCK_PC requests than SYNCHRONIZE CACHE ->
ide_end_request() and ide_stopped.

--Mika


2004-03-19 23:01:42

by Matthias Andree

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set

On Fri, 19 Mar 2004, Jeff Garzik wrote:

> - issue set-features command to get flush-cache into proper state
> (enabled or disabled, as the user desires), if identify-device word 86
> indicates it is not already in the state you seek.

BTW, speaking of identify-device, hdparm -i (which uses
HDIO_GET_IDENTITY) always returns "WriteCache=enabled" while hdparm -I
that uses HDIO_DRIVE_CMD with WIN_PIDENTIFY reports the "correct" state
that I've previously set with -W0. This is an i386 machine w/ 2.6.5-rc1.

Is HDIO_GET_IDENTITY working correctly?

--
Matthias Andree

Encrypt your mail: my GnuPG key ID is 0x052E7D95

Subject: Re: [PATCH] barrier patch set

On Friday 19 of March 2004 17:34, Jeff Garzik wrote:
> Jens Axboe wrote:
> > Cosmetic stuff that will get ironed out. You can find the patches here:
> >
> > ftp://ftp.kernel.org/pub/linux/kernel/people/axboe/patches/v2.6/2.6.5-rc1
> >-mm2/
> >
> > ide-barrier-2.6.5-rc1-mm2-1
> > ide/core part

Jens, am I right that you didn't do any changes/cleanups I asked you to do?
Here they are once again (probably some new items added as a bonus). ;-)

- do not use hwgroup->wrq (die!) and do not add drive->special_buf,
just do what PM code does and other special commands do - use taskfile
(yes, dirty stack allocation)

- SCSI -> IDE transform should die, please use something like REQ_FLUSH
and let subsystems deal with it

- ide_get_error_location() is cool but clean other places doing same thing
as you are duplicating existing code
(please use u64 not sector_t - you are getting raw info from the disk)

- why does blkdev_issue_flush() add REQ_BLOCK_PC to rq->flags?

- why are we doing pre-flush?

> WRT ATA and flush-cache... before the IDE pieces of this patch are
> merged, IMO it is a requirement that the entire flush-cache stuff gets a
> review. ide_get_error_location() is one of the important pieces
> (great!). Another important piece is to make sure that a drive's
> flush-cache capability is correctly deduced and set up from the
> identify-device. The steps look like

Yep.

> - check identify-device word 83, bits 12 (flush cache) and 13 (flush
> cache ext)
> - issue set-features command to get flush-cache into proper state
> (enabled or disabled, as the user desires), if identify-device word 86
> indicates it is not already in the state you seek.
> - re-read identify [packet] device page from device, make sure
> flush-cache[-ext] is enabled. A slacker could just make sure the
> set-features command completed successfully, but to be 100% correct you
> need to re-read the identify-device page. :/

The fact that spec says "supported" not "enabled" in description of word86
makes me wonder - can they be disabled? (FLUSH CACHE is mandatory for General
feature set and FLUSH CACHE EXT is mandatory if 48-bit LBA is supported)

Jeff, please note that these bits were introduced by ATA-6 spec
and take a look at ATA-5 spec:

...
FLUSH CACHE
General feature set
- Mandatory for all devices
...

and ATA-4 spec:

...
FLUSH CACHE
General feature set
- Optional for all devices
...

IMO to test if FLUSH CACHE works we should just issue it during disk setup
and check result. This way we can use FLUSH CACHE also on < ATA-6 devices
(there is a lot of them).

Cheers,
Bartlomiej

Subject: Re: [PATCH] barrier patch set

On Saturday 20 of March 2004 00:01, Matthias Andree wrote:
> On Fri, 19 Mar 2004, Jeff Garzik wrote:
> > - issue set-features command to get flush-cache into proper state
> > (enabled or disabled, as the user desires), if identify-device word 86
> > indicates it is not already in the state you seek.
>
> BTW, speaking of identify-device, hdparm -i (which uses
> HDIO_GET_IDENTITY) always returns "WriteCache=enabled" while hdparm -I
> that uses HDIO_DRIVE_CMD with WIN_PIDENTIFY reports the "correct" state
> that I've previously set with -W0. This is an i386 machine w/ 2.6.5-rc1.
>
> Is HDIO_GET_IDENTITY working correctly?

There were reports that on some drives you can't disable write cache
and even (?) that some drives lie (WC still enabled but marked as disabled).

Regards,
Bartlomiej

2004-03-20 00:15:14

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set

Bartlomiej Zolnierkiewicz wrote:
> The fact that spec says "supported" not "enabled" in description of word86
> makes me wonder - can they be disabled? (FLUSH CACHE is mandatory for General
> feature set and FLUSH CACHE EXT is mandatory if 48-bit LBA is supported)

Yes, that's why there are separate 'supported' and 'enabled' bits for
each feature.

Words 82-84 are 'supported' bits. Words 85-87 are 'enabled' bits.
These bits mirror each other, i.e. Word 83 and Word 86 have basically
the same bits, except that Word 86 definitions change _slightly_ since
the only bits that are relevant are the ones for features that can be
disabled/enabled.

You use set-features command to enable and disable these features, and
then the result shows up in subsequent identify-device command output.

If the driver is testing for a capability but does not enable it, then
always use the 'enabled' set of bits, not the 'supported' set of bits.


> Jeff, please note that these bits were introduced by ATA-6 spec
> and take a look at ATA-5 spec:
>
> ...
> FLUSH CACHE
> General feature set
> - Mandatory for all devices
> ...
>
> and ATA-4 spec:
>
> ...
> FLUSH CACHE
> General feature set
> - Optional for all devices
> ...
>
> IMO to test if FLUSH CACHE works we should just issue it during disk setup
> and check result. This way we can use FLUSH CACHE also on < ATA-6 devices
> (there is a lot of them).

I disagree. "just issue it" is how those LG cdrom drives got cooked.

LG cdrom drives indicated in their identify-packet-device page that
flush-cache was not supported... and then re-used the flush-cache ATA
opcode for their vendor-specific download-firmware command. Combine
that with a Linux patch that didn't properly check for flush-cache
support. Result: brick.

All drives that support flush-cache list the relevant bits in
identify-device, even on pre-ATA-6 devices. Whether the feature was
optional or mandantory, we can check the feature bits.

Jeff



2004-03-20 00:17:49

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set

Bartlomiej Zolnierkiewicz wrote:
> The fact that spec says "supported" not "enabled" in description of word86
> makes me wonder - can they be disabled? (FLUSH CACHE is mandatory for General
> feature set and FLUSH CACHE EXT is mandatory if 48-bit LBA is supported)

IOW, "mandantory" isn't really mandantory, if you scale over time...
Always check the identify-device feature bits before using a feature set :)

Jeff



Subject: Re: [PATCH] barrier patch set

On Saturday 20 of March 2004 01:14, Jeff Garzik wrote:
> Bartlomiej Zolnierkiewicz wrote:
> > The fact that spec says "supported" not "enabled" in description of
> > word86 makes me wonder - can they be disabled? (FLUSH CACHE is mandatory
> > for General feature set and FLUSH CACHE EXT is mandatory if 48-bit LBA is
> > supported)
>
> Yes, that's why there are separate 'supported' and 'enabled' bits for
> each feature.
>
> Words 82-84 are 'supported' bits. Words 85-87 are 'enabled' bits.
> These bits mirror each other, i.e. Word 83 and Word 86 have basically
> the same bits, except that Word 86 definitions change _slightly_ since
> the only bits that are relevant are the ones for features that can be
> disabled/enabled.
>
> You use set-features command to enable and disable these features, and
> then the result shows up in subsequent identify-device command output.
>
> If the driver is testing for a capability but does not enable it, then
> always use the 'enabled' set of bits, not the 'supported' set of bits.

This is quite obvious but I am talking about confusing wording in description
of word86 - for some features 'enabled' is used and for others 'supported'

> > Jeff, please note that these bits were introduced by ATA-6 spec
> > and take a look at ATA-5 spec:
> >
> > ...
> > FLUSH CACHE
> > General feature set
> > - Mandatory for all devices
> > ...
> >
> > and ATA-4 spec:
> >
> > ...
> > FLUSH CACHE
> > General feature set
> > - Optional for all devices
> > ...
> >
> > IMO to test if FLUSH CACHE works we should just issue it during disk
> > setup and check result. This way we can use FLUSH CACHE also on < ATA-6
> > devices (there is a lot of them).
>
> I disagree. "just issue it" is how those LG cdrom drives got cooked.

I'm aware of LG fun. Jens already stated that current barrier implementation
is disk-only and I'm talking about disks only.

If anybody reused CACHE FLUSH opcode for disk drive he/she deserves to loose.
8)

> LG cdrom drives indicated in their identify-packet-device page that
> flush-cache was not supported... and then re-used the flush-cache ATA
> opcode for their vendor-specific download-firmware command. Combine
> that with a Linux patch that didn't properly check for flush-cache
> support. Result: brick.
>
> All drives that support flush-cache list the relevant bits in
> identify-device, even on pre-ATA-6 devices. Whether the feature was
> optional or mandantory, we can check the feature bits.

Hm. so this is undocumented in the spec?

Bartlomiej

2004-03-20 00:42:38

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set

Bartlomiej Zolnierkiewicz wrote:
> On Saturday 20 of March 2004 01:14, Jeff Garzik wrote:
>
>>Bartlomiej Zolnierkiewicz wrote:
>>
>>>The fact that spec says "supported" not "enabled" in description of
>>>word86 makes me wonder - can they be disabled? (FLUSH CACHE is mandatory
>>>for General feature set and FLUSH CACHE EXT is mandatory if 48-bit LBA is
>>>supported)
>>
>>Yes, that's why there are separate 'supported' and 'enabled' bits for
>>each feature.
>>
>>Words 82-84 are 'supported' bits. Words 85-87 are 'enabled' bits.
>>These bits mirror each other, i.e. Word 83 and Word 86 have basically
>>the same bits, except that Word 86 definitions change _slightly_ since
>>the only bits that are relevant are the ones for features that can be
>>disabled/enabled.
>>
>>You use set-features command to enable and disable these features, and
>>then the result shows up in subsequent identify-device command output.
>>
>>If the driver is testing for a capability but does not enable it, then
>>always use the 'enabled' set of bits, not the 'supported' set of bits.
>
>
> This is quite obvious but I am talking about confusing wording in description
> of word86 - for some features 'enabled' is used and for others 'supported'

Yeah, mainly the difference is communicating in the description of each
word.

Anyway, what I described is how things work :) For example, features
that are always enabled in the drive are listed with both support and
enabled bits set. The driver sees that, and does not issue a
set-features command, because it does not need to.


>>>IMO to test if FLUSH CACHE works we should just issue it during disk
>>>setup and check result. This way we can use FLUSH CACHE also on < ATA-6
>>>devices (there is a lot of them).
>>
>>I disagree. "just issue it" is how those LG cdrom drives got cooked.
>
>
> I'm aware of LG fun. Jens already stated that current barrier implementation
> is disk-only and I'm talking about disks only.
>
> If anybody reused CACHE FLUSH opcode for disk drive he/she deserves to loose.
> 8)

Well... If you don't check the proper feature bits found in the spec, I
blame the driver for ignoring the spec... :)


>>All drives that support flush-cache list the relevant bits in
>>identify-device, even on pre-ATA-6 devices. Whether the feature was
>>optional or mandantory, we can check the feature bits.
>
>
> Hm. so this is undocumented in the spec?

? When it was optional, there was a feature bit to test. When it
became mandantory, the feature bit to test stayed in there. The feature
bit is zero, otherwise. Makes it possible to use "just test the bit"
and have things Just Work(tm). :)

Jeff



Subject: Re: [PATCH] barrier patch set

On Saturday 20 of March 2004 01:42, Jeff Garzik wrote:
> Bartlomiej Zolnierkiewicz wrote:
> > On Saturday 20 of March 2004 01:14, Jeff Garzik wrote:
> >>Bartlomiej Zolnierkiewicz wrote:
> >>>The fact that spec says "supported" not "enabled" in description of
> >>>word86 makes me wonder - can they be disabled? (FLUSH CACHE is mandatory
> >>>for General feature set and FLUSH CACHE EXT is mandatory if 48-bit LBA
> >>> is supported)
> >>
> >>Yes, that's why there are separate 'supported' and 'enabled' bits for
> >>each feature.
> >>
> >>Words 82-84 are 'supported' bits. Words 85-87 are 'enabled' bits.
> >>These bits mirror each other, i.e. Word 83 and Word 86 have basically
> >>the same bits, except that Word 86 definitions change _slightly_ since
> >>the only bits that are relevant are the ones for features that can be
> >>disabled/enabled.
> >>
> >>You use set-features command to enable and disable these features, and
> >>then the result shows up in subsequent identify-device command output.
> >>
> >>If the driver is testing for a capability but does not enable it, then
> >>always use the 'enabled' set of bits, not the 'supported' set of bits.
> >
> > This is quite obvious but I am talking about confusing wording in
> > description of word86 - for some features 'enabled' is used and for
> > others 'supported'
>
> Yeah, mainly the difference is communicating in the description of each
> word.
>
> Anyway, what I described is how things work :) For example, features
> that are always enabled in the drive are listed with both support and
> enabled bits set. The driver sees that, and does not issue a
> set-features command, because it does not need to.

Yep but in your original mail you suggested that we should explicitly enable
FLUSH CACHE and FLUSH CACHE EXT features - there are even no subcommands
to do this. ;-)

> >>>IMO to test if FLUSH CACHE works we should just issue it during disk
> >>>setup and check result. This way we can use FLUSH CACHE also on < ATA-6
> >>>devices (there is a lot of them).
> >>
> >>I disagree. "just issue it" is how those LG cdrom drives got cooked.
> >
> > I'm aware of LG fun. Jens already stated that current barrier
> > implementation is disk-only and I'm talking about disks only.
> >
> > If anybody reused CACHE FLUSH opcode for disk drive he/she deserves to
> > loose. 8)
>
> Well... If you don't check the proper feature bits found in the spec, I
> blame the driver for ignoring the spec... :)

Please point me to these bits in ATA-4 or ATA-5 spec.
Have you checked them as I asked?

> >>All drives that support flush-cache list the relevant bits in
> >>identify-device, even on pre-ATA-6 devices. Whether the feature was
> >>optional or mandantory, we can check the feature bits.
> >
> > Hm. so this is undocumented in the spec?
>
> ? When it was optional, there was a feature bit to test. When it
> became mandantory, the feature bit to test stayed in there. The feature
> bit is zero, otherwise. Makes it possible to use "just test the bit"
> and have things Just Work(tm). :)

I wish it was so simple. Here is an example to make it clear:

model: WDC WD800JB-00CRA1 firmware: 17.07W77
word 0x83 is 4b01, word 0x86 is 0x0801

and drive of course supports CACHE FLUSH command.

Cheers,
Bartlomiej

2004-03-20 01:48:50

by Johannes Stezenbach

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set

Bartlomiej Zolnierkiewicz wrote:
> On Saturday 20 of March 2004 00:01, Matthias Andree wrote:
> >
> > BTW, speaking of identify-device, hdparm -i (which uses
> > HDIO_GET_IDENTITY) always returns "WriteCache=enabled" while hdparm -I
> > that uses HDIO_DRIVE_CMD with WIN_PIDENTIFY reports the "correct" state
> > that I've previously set with -W0. This is an i386 machine w/ 2.6.5-rc1.
> >
> > Is HDIO_GET_IDENTITY working correctly?
>
> There were reports that on some drives you can't disable write cache
> and even (?) that some drives lie (WC still enabled but marked as disabled).

hdparm -i and -I ultimately both interpret WIN_IDENTIFY result, and both test
bit 0x0020 of word 85. So it's unclear to me why they report a
different write cache setting. I added a hexdump to dump_identity()
in hdparm.c, and found that bit 0x0020 of word 85 is always set.

BTW, 'cat /proc/ide/hda/identify' or 'hdparm -Istdin </dev/ide/hda/identify'
reports the same value as hdparm -I, and that is consistent with
the value I set with hdparm -W x.


So, is HDIO_GET_IDENTITY broken?

Johannes

Subject: Re: [PATCH] barrier patch set

On Saturday 20 of March 2004 02:48, Johannes Stezenbach wrote:
> Bartlomiej Zolnierkiewicz wrote:
> > On Saturday 20 of March 2004 00:01, Matthias Andree wrote:
> > > BTW, speaking of identify-device, hdparm -i (which uses
> > > HDIO_GET_IDENTITY) always returns "WriteCache=enabled" while hdparm -I
> > > that uses HDIO_DRIVE_CMD with WIN_PIDENTIFY reports the "correct" state
> > > that I've previously set with -W0. This is an i386 machine w/
> > > 2.6.5-rc1.
> > >
> > > Is HDIO_GET_IDENTITY working correctly?
> >
> > There were reports that on some drives you can't disable write cache
> > and even (?) that some drives lie (WC still enabled but marked as
> > disabled).

Doh, I misunderstood the question.

Correct answer is: everything is fine, RTFM (man hdparm). ;-)

> hdparm -i and -I ultimately both interpret WIN_IDENTIFY result, and both
> test bit 0x0020 of word 85. So it's unclear to me why they report a
> different write cache setting. I added a hexdump to dump_identity()
> in hdparm.c, and found that bit 0x0020 of word 85 is always set.

or WIN_PIDENTIFY to be strict but

-i returns _cached_ (read when the device was probed) identify data
(uses HDIO_GET_IDENTIFY ioctl)
-I reads _current_ data directly from the device
(uses HDIO_DRIVE_CMD ioctl)

> BTW, 'cat /proc/ide/hda/identify' or 'hdparm -Istdin
> </dev/ide/hda/identify' reports the same value as hdparm -I, and that is
> consistent with
> the value I set with hdparm -W x.
>
>
> So, is HDIO_GET_IDENTITY broken?

No.

Regards,
Bartlomiej

2004-03-20 02:53:19

by Johannes Stezenbach

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set

Bartlomiej Zolnierkiewicz wrote:
> On Saturday 20 of March 2004 02:48, Johannes Stezenbach wrote:
>
> > hdparm -i and -I ultimately both interpret WIN_IDENTIFY result, and both
> > test bit 0x0020 of word 85. So it's unclear to me why they report a
> > different write cache setting. I added a hexdump to dump_identity()
> > in hdparm.c, and found that bit 0x0020 of word 85 is always set.
>
> or WIN_PIDENTIFY to be strict but
>
> -i returns _cached_ (read when the device was probed) identify data
> (uses HDIO_GET_IDENTIFY ioctl)
> -I reads _current_ data directly from the device
> (uses HDIO_DRIVE_CMD ioctl)

Oh, right.

But: HDIO_GET_IDENTITY returns drive->id, and surely drive->id
is used internally. So isn't it a bug that drive->id is not
updated when the write cache setting is changed?

I think the barrier code uses drive->id to determine if the
write cache is enabled.

Johannes

2004-03-20 09:55:14

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set

On Fri, Mar 19 2004, Mika Penttil? wrote:
>
>
> Jens Axboe wrote:
>
> >On Fri, Mar 19 2004, Mika Penttil? wrote:
> >
> >
> >>Jens Axboe wrote:
> >>
> >>
> >>
> >>>Hi,
> >>>
> >>>A first release of a collected barrier patchset for 2.6.5-rc1-mm2. I
> >>>have a few changes planned to support dm/md + sata, I'll do those
> >>>changes over the weekend.
> >>>
> >>>Reiser has the best barrier support, ext3 works but only if things don't
> >>>go wrong. So only attempt to use the barrier feature on ext3 if on ide
> >>>drives, not SCSI nor SATA.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>What are these brutal pieces...?
> >>
> >>
> >>+static int ide_transform_pc_req(ide_drive_t *drive, struct request *rq)
> >>+{
> >>+ if (rq->cmd[0] != 0x35) {
> >>+ ide_end_request(drive, 0, 0);
> >>+ return 1;
> >>+ }
> >>+
> >>+ if (!drive->wcache) {
> >>+ ide_end_request(drive, 1, 0);
> >>+ return 1;
> >>+ }
> >>+
> >>+ ide_fill_flush_cmd(drive, rq);
> >>+ return 0;
> >>+}
> >>
> >>
> >>/*
> >>+ * basic transformation support for scsi -> ata commands
> >>+ */
> >>+ if (blk_pc_request(rq)) {
> >>+ if (drive->media != ide_disk)
> >>+ goto kill_rq;
> >>+ if (ide_transform_pc_req(drive, rq))
> >>+ return ide_stopped;
> >>+ }
> >>
> >>
> >
> >Hmm, I thought it was pretty obvious, even just from the naming and
> >comments. Right now, the block layer issued flush without data attached
> >(ie a drive barrier without pinning it to a buffer) comes as a scsi
> >synchronize cache command. I'm going to change this anyways and allow
> >queue hook of a ->issue_flush_fn() that can just tailored to ide or
> >scsi, _or_ dm/md and that sort of thing.
> >
> >
> I mean other BLOCK_PC requests than SYNCHRONIZE CACHE ->
> ide_end_request() and ide_stopped.

That's a bug, good catch. Added to fixme list for next release :)

--
Jens Axboe

2004-03-20 09:54:05

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set

On Sat, Mar 20 2004, Bartlomiej Zolnierkiewicz wrote:
> On Friday 19 of March 2004 17:34, Jeff Garzik wrote:
> > Jens Axboe wrote:
> > > Cosmetic stuff that will get ironed out. You can find the patches here:
> > >
> > > ftp://ftp.kernel.org/pub/linux/kernel/people/axboe/patches/v2.6/2.6.5-rc1
> > >-mm2/
> > >
> > > ide-barrier-2.6.5-rc1-mm2-1
> > > ide/core part
>
> Jens, am I right that you didn't do any changes/cleanups I asked you to do?
> Here they are once again (probably some new items added as a bonus). ;-)

Probably, ide code was idle for some time :). As I said to Chris, there
are a bunch of things I want to do to the code over the weekend, I
wanted to get something out there before that though (raises the
incentive to finish it)

> - do not use hwgroup->wrq (die!) and do not add drive->special_buf,
> just do what PM code does and other special commands do - use taskfile
> (yes, dirty stack allocation)

Doesn't work for split flush, ie issue a bunch of flushes to devices,
then wait for them. I agree using ->wrq and special_buf is ugly as hell,
though.

> - SCSI -> IDE transform should die, please use something like REQ_FLUSH
> and let subsystems deal with it

That's what I wanted to avoid, adding more flags. However, if you see
the comment in there this is being changed to ->issue_flush() instead.
So it's dying, don't worry.

> - ide_get_error_location() is cool but clean other places doing same thing
> as you are duplicating existing code
> (please use u64 not sector_t - you are getting raw info from the disk)

Ok

> - why does blkdev_issue_flush() add REQ_BLOCK_PC to rq->flags?

Ehm, because it _is_ a REQ_BLOCK_PC? ;-)

> - why are we doing pre-flush?

To ensure previously written data is on platter first.

--
Jens Axboe

2004-03-20 09:58:32

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set

On Sat, Mar 20 2004, Bartlomiej Zolnierkiewicz wrote:
> > >>All drives that support flush-cache list the relevant bits in
> > >>identify-device, even on pre-ATA-6 devices. Whether the feature was
> > >>optional or mandantory, we can check the feature bits.
> > >
> > > Hm. so this is undocumented in the spec?
> >
> > ? When it was optional, there was a feature bit to test. When it
> > became mandantory, the feature bit to test stayed in there. The feature
> > bit is zero, otherwise. Makes it possible to use "just test the bit"
> > and have things Just Work(tm). :)
>
> I wish it was so simple. Here is an example to make it clear:
>
> model: WDC WD800JB-00CRA1 firmware: 17.07W77
> word 0x83 is 4b01, word 0x86 is 0x0801
>
> and drive of course supports CACHE FLUSH command.

I agree with Bart, it's usually never that clear. Quit harping the
stupid LG issue, they did something brain dead in the firmware and I
almost have to say that they got what they deserved for doing something
as _stupid_ as that.

Jeff, it's wonderful to think that you can always rely on checking spec
bits, but in reality it never really 'just works out' for any given set
of hardware.

--
Jens Axboe

2004-03-20 10:13:08

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set

Jens Axboe wrote:
> I agree with Bart, it's usually never that clear. Quit harping the
> stupid LG issue, they did something brain dead in the firmware and I
> almost have to say that they got what they deserved for doing something
> as _stupid_ as that.

I use it because it's an excellent illustration of what happens when you
ignore the spec.


> Jeff, it's wonderful to think that you can always rely on checking spec
> bits, but in reality it never really 'just works out' for any given set
> of hardware.

"just issue it and hope" is not a reasonable plan, IMO.

Jeff



2004-03-20 10:19:41

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set

On Sat, Mar 20 2004, Jeff Garzik wrote:
> Jens Axboe wrote:
> >I agree with Bart, it's usually never that clear. Quit harping the
> >stupid LG issue, they did something brain dead in the firmware and I
> >almost have to say that they got what they deserved for doing something
> >as _stupid_ as that.
>
> I use it because it's an excellent illustration of what happens when you
> ignore the spec.

Really, I think it's a much better demonstration of exactly how stupid
hardware developers are at times...

> >Jeff, it's wonderful to think that you can always rely on checking spec
> >bits, but in reality it never really 'just works out' for any given set
> >of hardware.
>
> "just issue it and hope" is not a reasonable plan, IMO.

I agree with that as well. I just didn't agree with your rosy idea of
thinking everything would always work if you just check the bits
according to spec.

--
Jens Axboe

2004-03-20 10:22:19

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set

Bartlomiej Zolnierkiewicz wrote:
> Yep but in your original mail you suggested that we should explicitly enable
> FLUSH CACHE and FLUSH CACHE EXT features - there are even no subcommands
> to do this. ;-)

Whoops, you're right. I was thinking about the general protocol for
features.


> I wish it was so simple. Here is an example to make it clear:
>
> model: WDC WD800JB-00CRA1 firmware: 17.07W77
> word 0x83 is 4b01, word 0x86 is 0x0801
>
> and drive of course supports CACHE FLUSH command.

What ATA revision?

Sending down opcodes because they will "probably" work isn't the best
idea in the world...

Jeff



2004-03-20 10:37:54

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set

Jens Axboe wrote:
> On Sat, Mar 20 2004, Jeff Garzik wrote:
>
>>Jens Axboe wrote:
>>
>>>I agree with Bart, it's usually never that clear. Quit harping the
>>>stupid LG issue, they did something brain dead in the firmware and I
>>>almost have to say that they got what they deserved for doing something
>>>as _stupid_ as that.
>>
>>I use it because it's an excellent illustration of what happens when you
>>ignore the spec.
>
>
> Really, I think it's a much better demonstration of exactly how stupid
> hardware developers are at times...

No argument. But their behavior, however awful, _was_ reported in
places where a spec-driven driver would have noticed... :)


>>>Jeff, it's wonderful to think that you can always rely on checking spec
>>>bits, but in reality it never really 'just works out' for any given set
>>>of hardware.
>>
>>"just issue it and hope" is not a reasonable plan, IMO.
>
>
> I agree with that as well. I just didn't agree with your rosy idea of
> thinking everything would always work if you just check the bits
> according to spec.

Everything _will_ always work, if you check the bits according to spec.
Not reporting the flush-cache feature bit, when it really the
opcode exists, isn't a spec violation. The opposite is, and I haven't
heard of any such drives :)

AFAICS:
* for ATA versions where flush-cache is optional, you must check the
bit. otherwise, issuing flush-cache would be very unwise.
* for ATA versions where flush-cache is mandatory... the argument can
be made that issuing a flush-cache in the absence of the bit isn't a bad
thing. I'm not sure I agree with that, but I'm willing to be convinced.

"just check the bits" always works because it is the conservative
choice... but it can lead to suboptimal (but valid!) behavior by
ignoring some devices' flush-cache capability.

If it's only a few drives out there that misreport their flush-cache
bit, then who cares --> just more broken hardware, that we won't issue a
flush-cache on. If it's a lot of drives, that changes things...

Jeff


2004-03-20 11:36:32

by Matthias Andree

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set

> Correct answer is: everything is fine, RTFM (man hdparm). ;-)

Not everything is fine. hdparm documents -i returns inconsistent data.
Most, but _NOT_ _EVERYTHING_ is cached: the multcount is updated, for
instance. What is that good for? Mix & Match whatever is convenient?

Are there systems where -I will not work? If there are none, hdparm 6.0
should be shipped without the -i option.

--
Matthias Andree

Encrypt your mail: my GnuPG key ID is 0x052E7D95

Subject: Re: [PATCH] barrier patch set

On Saturday 20 of March 2004 11:21, Jeff Garzik wrote:
> > I wish it was so simple. Here is an example to make it clear:
> >
> > model: WDC WD800JB-00CRA1 firmware: 17.07W77
> > word 0x83 is 4b01, word 0x86 is 0x0801
> >
> > and drive of course supports CACHE FLUSH command.
>
> What ATA revision?

ATA-5

> Sending down opcodes because they will "probably" work isn't the best
> idea in the world...

It isn't but spec says that devices should abort unknown commands. 8)

Bartlomiej

Subject: Re: [PATCH] barrier patch set

On Saturday 20 of March 2004 12:36, Matthias Andree wrote:
> > Correct answer is: everything is fine, RTFM (man hdparm). ;-)
>
> Not everything is fine. hdparm documents -i returns inconsistent data.
> Most, but _NOT_ _EVERYTHING_ is cached: the multcount is updated, for
> instance. What is that good for? Mix & Match whatever is convenient?

I'm aware of this bug - driver shouldn't modify drive->id. Patches are welcomed.

> Are there systems where -I will not work? If there are none, hdparm 6.0

It should always work.

> should be shipped without the -i option.

Why? It can be sometimes useful for debugging purposes
and HDIO_GET_IDENTIFY ioctl is not going away in 2.6.x.

Regards,
Bartlomiej

Subject: Re: [PATCH] barrier patch set

On Saturday 20 of March 2004 03:53, Johannes Stezenbach wrote:
> Bartlomiej Zolnierkiewicz wrote:
> > On Saturday 20 of March 2004 02:48, Johannes Stezenbach wrote:
> > > hdparm -i and -I ultimately both interpret WIN_IDENTIFY result, and
> > > both test bit 0x0020 of word 85. So it's unclear to me why they report
> > > a different write cache setting. I added a hexdump to dump_identity()
> > > in hdparm.c, and found that bit 0x0020 of word 85 is always set.
> >
> > or WIN_PIDENTIFY to be strict but
> >
> > -i returns _cached_ (read when the device was probed) identify data
> > (uses HDIO_GET_IDENTIFY ioctl)
> > -I reads _current_ data directly from the device
> > (uses HDIO_DRIVE_CMD ioctl)
>
> Oh, right.
>
> But: HDIO_GET_IDENTITY returns drive->id, and surely drive->id
> is used internally. So isn't it a bug that drive->id is not
> updated when the write cache setting is changed?

No, drive->id shouldn't be changed.

> I think the barrier code uses drive->id to determine if the
> write cache is enabled.

The barrier code depends on drive->wcache.

Regards,
Bartlomiej

Subject: Re: [PATCH] barrier patch set

On Saturday 20 of March 2004 10:53, Jens Axboe wrote:
> On Sat, Mar 20 2004, Bartlomiej Zolnierkiewicz wrote:
> > - do not use hwgroup->wrq (die!) and do not add drive->special_buf,
> > just do what PM code does and other special commands do - use taskfile
> > (yes, dirty stack allocation)
>
> Doesn't work for split flush, ie issue a bunch of flushes to devices,
> then wait for them. I agree using ->wrq and special_buf is ugly as hell,
> though.

I can't see why it doesn't work, please explain.

> > - ide_get_error_location() is cool but clean other places doing same
> > thing as you are duplicating existing code
> > (please use u64 not sector_t - you are getting raw info from the disk)
>
> Ok

Cool.

> > - why does blkdev_issue_flush() add REQ_BLOCK_PC to rq->flags?
>
> Ehm, because it _is_ a REQ_BLOCK_PC? ;-)

Ok, it is PC till SCSI->IDE transform, then it is no longer PC. :)

> > - why are we doing pre-flush?
>
> To ensure previously written data is on platter first.

I know this, I want to know what for you are doing this?

Previously written data is already acknowledgment to the upper layers so you
can't do much even if you hit error on flush cache. IMO if error happens we
should just check if failed sector is of our ordered write if not well report
it and continue. It's cleaner and can give some (small?) performance gain.

Regards,
Bartlomiej

Subject: Re: [PATCH] barrier patch set

On Saturday 20 of March 2004 11:37, Jeff Garzik wrote:
> Jens Axboe wrote:
> > On Sat, Mar 20 2004, Jeff Garzik wrote:
> >>Jens Axboe wrote:
> >>>I agree with Bart, it's usually never that clear. Quit harping the
> >>>stupid LG issue, they did something brain dead in the firmware and I
> >>>almost have to say that they got what they deserved for doing something
> >>>as _stupid_ as that.
> >>
> >>I use it because it's an excellent illustration of what happens when you
> >>ignore the spec.
> >
> > Really, I think it's a much better demonstration of exactly how stupid
> > hardware developers are at times...
>
> No argument. But their behavior, however awful, _was_ reported in
> places where a spec-driven driver would have noticed... :)
>
> >>>Jeff, it's wonderful to think that you can always rely on checking spec
> >>>bits, but in reality it never really 'just works out' for any given set
> >>>of hardware.
> >>
> >>"just issue it and hope" is not a reasonable plan, IMO.
> >
> > I agree with that as well. I just didn't agree with your rosy idea of
> > thinking everything would always work if you just check the bits
> > according to spec.
>
> Everything _will_ always work, if you check the bits according to spec.
> Not reporting the flush-cache feature bit, when it really the
> opcode exists, isn't a spec violation. The opposite is, and I haven't
> heard of any such drives :)
>
> AFAICS:
> * for ATA versions where flush-cache is optional, you must check the
> bit. otherwise, issuing flush-cache would be very unwise.

There is not flush-cache bit in both ATA-4 (command optional)
and ATA-5 (command mandatory).

> * for ATA versions where flush-cache is mandatory... the argument can
> be made that issuing a flush-cache in the absence of the bit isn't a bad
> thing. I'm not sure I agree with that, but I'm willing to be convinced.
>
> "just check the bits" always works because it is the conservative
> choice... but it can lead to suboptimal (but valid!) behavior by
> ignoring some devices' flush-cache capability.

It's just damn too conservative.

> If it's only a few drives out there that misreport their flush-cache
> bit, then who cares --> just more broken hardware, that we won't issue a
> flush-cache on. If it's a lot of drives, that changes things...

They don't misreport! They comply with the spec (ATA-4 or ATA-5).

Jeff, please RTFSPEC. ;-)

Cheers,
Bartlomiej

2004-03-20 16:27:42

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set

On Sat, Mar 20 2004, Bartlomiej Zolnierkiewicz wrote:
> On Saturday 20 of March 2004 10:53, Jens Axboe wrote:
> > On Sat, Mar 20 2004, Bartlomiej Zolnierkiewicz wrote:
> > > - do not use hwgroup->wrq (die!) and do not add drive->special_buf,
> > > just do what PM code does and other special commands do - use taskfile
> > > (yes, dirty stack allocation)
> >
> > Doesn't work for split flush, ie issue a bunch of flushes to devices,
> > then wait for them. I agree using ->wrq and special_buf is ugly as hell,
> > though.
>
> I can't see why it doesn't work, please explain.

You'd need to pass on some opaque buffer from the issuer to the flush
cache function for that to work, which would be ugly.

> > > - why does blkdev_issue_flush() add REQ_BLOCK_PC to rq->flags?
> >
> > Ehm, because it _is_ a REQ_BLOCK_PC? ;-)
>
> Ok, it is PC till SCSI->IDE transform, then it is no longer PC. :)

Right, and during that transform, REQ_BLOCK_PC is cleared:

+ rq->flags &= ~REQ_BLOCK_PC;
+ rq->flags |= REQ_DRIVE_TASK | REQ_STARTED;

(REQ_STARTED should be removed, btw).

> > > - why are we doing pre-flush?
> >
> > To ensure previously written data is on platter first.
>
> I know this, I want to know what for you are doing this?
>
> Previously written data is already acknowledgment to the upper layers
> so you can't do much even if you hit error on flush cache. IMO if
> error happens we should just check if failed sector is of our ordered
> write if not well report it and continue. It's cleaner and can give
> some (small?) performance gain.

This depends entirely on what you assume with a barrier and what the
upper layers want. My implementation guarentees that existing data sent
to drive is on platter before you issue the barrier. What is confusing
is that you cannot pass back where the error occured, that probably
needs some work.

--
Jens Axboe

2004-03-20 16:30:48

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set

On Sat, 2004-03-20 at 11:23, Bartlomiej Zolnierkiewicz wrote:

> > > - why are we doing pre-flush?
> >
> > To ensure previously written data is on platter first.
>
> I know this, I want to know what for you are doing this?
>
> Previously written data is already acknowledgment to the upper layers so you
> can't do much even if you hit error on flush cache. IMO if error happens we
> should just check if failed sector is of our ordered write if not well report
> it and continue. It's cleaner and can give some (small?) performance gain.
>

The journaled filesystems need this. We need to make sure that before
we write the commit block for a transaction, all the previous log blocks
we're written are safely on media. Then we also need to make sure the
commit block is on media.

We end up with a log blocks, pre-flush, commit block, post-flush cycle,
which is what gives the proper transaction ordering on disk.

For data blocks we only need the post flush, which is why Jens made
blkdev_issue_flush skip the pre-flush.

-chris


Subject: Re: [PATCH] barrier patch set

On Saturday 20 of March 2004 17:32, Chris Mason wrote:
> On Sat, 2004-03-20 at 11:23, Bartlomiej Zolnierkiewicz wrote:
> > > > - why are we doing pre-flush?
> > >
> > > To ensure previously written data is on platter first.
> >
> > I know this, I want to know what for you are doing this?
> >
> > Previously written data is already acknowledgment to the upper layers so
> > you can't do much even if you hit error on flush cache. IMO if error
> > happens we should just check if failed sector is of our ordered write if
> > not well report it and continue. It's cleaner and can give some (small?)
> > performance gain.
>
> The journaled filesystems need this. We need to make sure that before
> we write the commit block for a transaction, all the previous log blocks
> we're written are safely on media. Then we also need to make sure the
> commit block is on media.

For low-level driver it shouldn't really matter whether sectors to be
written are the commit block for a transaction or the previous log blocks
and in the current implementation it does matter.

> We end up with a log blocks, pre-flush, commit block, post-flush cycle,
> which is what gives the proper transaction ordering on disk.

Jens, can you explain how this translates to the block layer?
If "log blocks" is a separate request from "commit block",
we can just do: log blocks, flush, commit block, flush cycle.

> For data blocks we only need the post flush, which is why Jens made
> blkdev_issue_flush skip the pre-flush.

Yes.

Thanks,
Bartlomiej

2004-03-20 17:07:42

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set

On Sat, 2004-03-20 at 12:05, Bartlomiej Zolnierkiewicz wrote:
> On Saturday 20 of March 2004 17:32, Chris Mason wrote:
> > On Sat, 2004-03-20 at 11:23, Bartlomiej Zolnierkiewicz wrote:
> > > > > - why are we doing pre-flush?
> > The journaled filesystems need this. We need to make sure that before
> > we write the commit block for a transaction, all the previous log blocks
> > we're written are safely on media. Then we also need to make sure the
> > commit block is on media.
>
> For low-level driver it shouldn't really matter whether sectors to be
> written are the commit block for a transaction or the previous log blocks
> and in the current implementation it does matter.
>
As Jens said, it depends on how you define barrier ;-) I define it as
this io will be written after all the previous io and before any later
io. It was originally written with scsi tags in mind as well, the FS
side was the same for both.

In the end, I'm not that picky though, any reasonable setup that gets
the blocks on media is fine.

-chris


2004-03-20 18:51:58

by Helge Hafting

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set

On Sat, Mar 20, 2004 at 01:02:39AM +0100, Bartlomiej Zolnierkiewicz wrote:
[...]
> There were reports that on some drives you can't disable write cache
> and even (?) that some drives lie (WC still enabled but marked as disabled).
>
I think the simple solution of not supporting data integrity properly
on such a broken disk is perfectly ok.

Helge Hafting

Subject: Re: [PATCH] barrier patch set

On Saturday 20 of March 2004 18:10, Chris Mason wrote:
> On Sat, 2004-03-20 at 12:05, Bartlomiej Zolnierkiewicz wrote:
> > On Saturday 20 of March 2004 17:32, Chris Mason wrote:
> > > On Sat, 2004-03-20 at 11:23, Bartlomiej Zolnierkiewicz wrote:
> > > > > > - why are we doing pre-flush?
> > >
> > > The journaled filesystems need this. We need to make sure that before
> > > we write the commit block for a transaction, all the previous log
> > > blocks we're written are safely on media. Then we also need to make
> > > sure the commit block is on media.
> >
> > For low-level driver it shouldn't really matter whether sectors to be
> > written are the commit block for a transaction or the previous log blocks
> > and in the current implementation it does matter.
>
> As Jens said, it depends on how you define barrier ;-) I define it as
> this io will be written after all the previous io and before any later
> io. It was originally written with scsi tags in mind as well, the FS
> side was the same for both.

Yes, thanks for explaining this.

I took a quick look at fs/jbd/ and now I think I understand the way barriers
currently work. I assume that SCSI handles barriers by ordered tags, right?

> In the end, I'm not that picky though, any reasonable setup that gets
> the blocks on media is fine.

Yep.

Regards,
Bartlomiej

2004-03-20 23:36:13

by Johannes Stezenbach

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set

Bartlomiej Zolnierkiewicz wrote:
> On Saturday 20 of March 2004 12:36, Matthias Andree wrote:
> > > Correct answer is: everything is fine, RTFM (man hdparm). ;-)
> >
> > Not everything is fine. hdparm documents -i returns inconsistent data.
> > Most, but _NOT_ _EVERYTHING_ is cached: the multcount is updated, for
> > instance. What is that good for? Mix & Match whatever is convenient?
>
> I'm aware of this bug - driver shouldn't modify drive->id. Patches are welcomed.

Why? What's the reason for keeping out-of-date IDENTIFY data?

And what about ide_driveid_update()? Is it a bug that
it exists?

This is all too confusing for me :-(

Johannes

Subject: Re: [PATCH] barrier patch set

On Sunday 21 of March 2004 00:36, Johannes Stezenbach wrote:
> Bartlomiej Zolnierkiewicz wrote:
> > On Saturday 20 of March 2004 12:36, Matthias Andree wrote:
> > > > Correct answer is: everything is fine, RTFM (man hdparm). ;-)
> > >
> > > Not everything is fine. hdparm documents -i returns inconsistent data.
> > > Most, but _NOT_ _EVERYTHING_ is cached: the multcount is updated, for
> > > instance. What is that good for? Mix & Match whatever is convenient?
> >
> > I'm aware of this bug - driver shouldn't modify drive->id. Patches are
> > welcomed.
>
> Why? What's the reason for keeping out-of-date IDENTIFY data?

HDIO_GET_IDENTIFY ioctl which should die first.

> And what about ide_driveid_update()? Is it a bug that
> it exists?

It should just check/copy relevant bits from the new identify
(but without modifying drive->id).

> This is all too confusing for me :-(

Welcome in the wonderful world of IDE driver. ;-)

Bartlomiej

2004-03-21 09:43:30

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set

On Sat, Mar 20 2004, Bartlomiej Zolnierkiewicz wrote:
> On Saturday 20 of March 2004 18:10, Chris Mason wrote:
> > On Sat, 2004-03-20 at 12:05, Bartlomiej Zolnierkiewicz wrote:
> > > On Saturday 20 of March 2004 17:32, Chris Mason wrote:
> > > > On Sat, 2004-03-20 at 11:23, Bartlomiej Zolnierkiewicz wrote:
> > > > > > > - why are we doing pre-flush?
> > > >
> > > > The journaled filesystems need this. We need to make sure that before
> > > > we write the commit block for a transaction, all the previous log
> > > > blocks we're written are safely on media. Then we also need to make
> > > > sure the commit block is on media.
> > >
> > > For low-level driver it shouldn't really matter whether sectors to be
> > > written are the commit block for a transaction or the previous log blocks
> > > and in the current implementation it does matter.
> >
> > As Jens said, it depends on how you define barrier ;-) I define it as
> > this io will be written after all the previous io and before any later
> > io. It was originally written with scsi tags in mind as well, the FS
> > side was the same for both.
>
> Yes, thanks for explaining this.
>
> I took a quick look at fs/jbd/ and now I think I understand the way barriers
> currently work. I assume that SCSI handles barriers by ordered tags, right?

That's the idea, yes. Needs some SCSI error handling work though, to
always ensure correct ordering.

--
Jens Axboe

2004-03-21 18:12:27

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set

Bartlomiej Zolnierkiewicz wrote:
> They don't misreport! They comply with the spec (ATA-4 or ATA-5).
>
> Jeff, please RTFSPEC. ;-)

Arrrrgh. Yes, you're right, and I stand corrected.

One wonders why the bits appeared at all, then... I wonder if some ATA
drives _don't_ support mandatory flush-cache command? grumble.

Jeff



2004-03-22 11:09:42

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set

Jens Axboe <[email protected]> wrote:
>
> A first release of a collected barrier patchset for 2.6.5-rc1-mm2.

The tagging of BIOs with set_buffer_ordered() or WRITE_BARRIER is a little
awkward.

Take the case of an ext2 fsync() or even an ext3 fsync() which frequently
will not trigger a commit. If we must perform the barrier by tagging the
final BIO, that will be tricky to implement. I could set some new field in
struct writeback_control and rework the mpage code, but working out "this
is the final BIO for this operation" is a fairly hard thing to do.
sys_sync() would require even more VFS surgery.

Generally, it would be much preferable to use the blkdev_issue_flush() API.
What is the status of that?

2004-03-22 11:10:45

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set

On Mon, Mar 22 2004, Andrew Morton wrote:
> Jens Axboe <[email protected]> wrote:
> >
> > A first release of a collected barrier patchset for 2.6.5-rc1-mm2.
>
> The tagging of BIOs with set_buffer_ordered() or WRITE_BARRIER is a little
> awkward.
>
> Take the case of an ext2 fsync() or even an ext3 fsync() which frequently
> will not trigger a commit. If we must perform the barrier by tagging the
> final BIO, that will be tricky to implement. I could set some new field in
> struct writeback_control and rework the mpage code, but working out "this
> is the final BIO for this operation" is a fairly hard thing to do.
> sys_sync() would require even more VFS surgery.

Yeah, it's not very pretty if you have to track the last sumit. Chris
complained about that in 2.4 as well :-)

> Generally, it would be much preferable to use the blkdev_issue_flush()
> API. What is the status of that?

It'll be fully supported.

--
Jens Axboe

2004-03-22 11:15:32

by Matthias Andree

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set

On Sat, 20 Mar 2004, Helge Hafting wrote:

> On Sat, Mar 20, 2004 at 01:02:39AM +0100, Bartlomiej Zolnierkiewicz wrote:
> [...]
> > There were reports that on some drives you can't disable write cache
> > and even (?) that some drives lie (WC still enabled but marked as disabled).
> >
> I think the simple solution of not supporting data integrity properly
> on such a broken disk is perfectly ok.

At least Linux should warn the user that his data is heading for doom.

--
Matthias Andree

Encrypt your mail: my GnuPG key ID is 0x052E7D95

2004-03-30 16:07:08

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set

Hi,

On Sat, 2004-03-20 at 17:05, Bartlomiej Zolnierkiewicz wrote:

> Jens, can you explain how this translates to the block layer?
> If "log blocks" is a separate request from "commit block",
> we can just do: log blocks, flush, commit block, flush cycle.

It's not so simple. There are two different operations here as far as
the fs is concerned --- ordering ("barrier"), and waiting ("flush").
For most journaling purposes, a filesystem really doesn't have to know
when data has hit disk --- it only needs to be assured that certain
ordering constraints will be obeyed. So in SCSI terms we can submit the
log blocks, then set an ORDERED queue tag on the commit block, and leave
it at that. We don't have to wait until that ORDERED tag actually
becomes persistent on disk --- we don't care, in most cases. This is
the "barrier", and in journaling, we basically need a barrier before and
after the commit block.

The tricky bit is that _sometimes_ we find, later on, that we need to
know when data has hit disk. If an application requests synchronised IO
(such as fsync, O_SYNC or O_DIRECT) then we cannot legally return until
the disk has told us that the data is persistent for certain. _That_ is
a full flush, and it's distinct from the simple ordering constraint that
we normally have for journal commit blocks.

The distinction becomes important if flushing and barriers have
significantly different performance characteristics. In the SCSI tagged
command case, we can insert an ORDERED queue tag into the pipeline
without having to wait for anything. But if we implement the barrier
and the flush by the same mechanism, then the fs ends up waiting; we get

write log blocks
wait for IO queue to empty
write commit block
wait for IO queue to empty

instead of

write log blocks
write commit block with ORDERED tag set

so the commits are miles slower.

Jens' patch happens to implement barriers via flushes on IDE, but at the
block layer blkdev_issue_flush() and set_buffer_ordered() are quite
distinct APIs.

--Stephen


2004-03-30 19:19:20

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set

On Tue, 2004-03-30 at 11:04, Stephen C. Tweedie wrote:

[ barriers vs ordered writes ]

> The distinction becomes important if flushing and barriers have
> significantly different performance characteristics. In the SCSI tagged
> command case, we can insert an ORDERED queue tag into the pipeline
> without having to wait for anything. But if we implement the barrier
> and the flush by the same mechanism, then the fs ends up waiting; we get
>
> write log blocks
> wait for IO queue to empty
> write commit block
> wait for IO queue to empty
>
> instead of
>
> write log blocks
> write commit block with ORDERED tag set
>
> so the commits are miles slower.
>
I think we're mixing a few concepts together. submit_bh(WRITE_BARRIER,
bh) gives us an ordered write in whatever form the lower layers can
provide. It also ensures that if you happen to call wait_on_buffer()
for the barrier buffer, the wait won't return until the data is on
media.

The construct allows for the second type of commit you describe above,
where wait_on_buffer is not called on the log blocks until after the
commit block has been sent down the pipe. The reiserfs barrier patch
does this now.

The fact that IDE implements the barriers via a flush and that no other
layer has barriers right now is secondary ;) I do want to revive some
kind of ordering for scsi as well, but since IDE is a safety issue right
now I wanted to concentrate there first.

> Jens' patch happens to implement barriers via flushes on IDE, but at the
> block layer blkdev_issue_flush() and set_buffer_ordered() are quite
> distinct APIs.

Yes, they are completely different. blkdev_issue_flush is the kernel
recognizing that wait_on_buffer (or page or whatever) isn't always
enough to make sure writes are really done. It has no ordering
semantics at all, it just means "really write what you've already told
me is written, unless you promise via battery backed cache that it will
get to disk eventually"

(roughly)

-chris


2004-03-30 21:52:38

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set

Hi,

On Tue, 2004-03-30 at 20:19, Chris Mason wrote:


> I think we're mixing a few concepts together. submit_bh(WRITE_BARRIER,
> bh) gives us an ordered write in whatever form the lower layers can
> provide. It also ensures that if you happen to call wait_on_buffer()
> for the barrier buffer, the wait won't return until the data is on
> media.

Right, but that's just how it works right now --- one doesn't _have_ to
imply the other. You could easily imagine an implementation that
implements barriers and flushing separately, and which does not do
automatic flushing on completion of WRITE_BARRIER IOs. SCSI with
writeback caching enabled might be one example of that. NBD/DRBD would
be another likely candidate --- if you've got network latencies in the
way, then a flushing sync may be far more expensive than a barrier
propagation.

Unfortunately, a lot of the cases we care about really have to do the
barrier via flushing, so the benefit of keeping them separate is
limited. For LVM/raid0, for example, we've got no way of preserving
ordering between IOs on different drives, so a flush is necessary there
unless we start journaling the low-level IOs to preserve order.

> The fact that IDE implements the barriers via a flush and that no other
> layer has barriers right now is secondary ;) I do want to revive some
> kind of ordering for scsi as well, but since IDE is a safety issue right
> now I wanted to concentrate there first.

Right.

> Yes, they are completely different. blkdev_issue_flush is the kernel
> recognizing that wait_on_buffer (or page or whatever) isn't always
> enough to make sure writes are really done.

Yep. It scares me to think what performance characteristics we'll start
seeing once that gets used everywhere it's needed, though. If every raw
or O_DIRECT write needs a flush after it, databases are going to become
very sensitive to flush performance. I guess disabling the flushing and
using disks which tell the truth about data hitting the platter is the
sane answer there.

--Stephen

2004-03-30 22:14:05

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set

On Tue, 2004-03-30 at 16:50, Stephen C. Tweedie wrote:
> Hi,
>
> On Tue, 2004-03-30 at 20:19, Chris Mason wrote:
>
>
> > I think we're mixing a few concepts together. submit_bh(WRITE_BARRIER,
> > bh) gives us an ordered write in whatever form the lower layers can
> > provide. It also ensures that if you happen to call wait_on_buffer()
> > for the barrier buffer, the wait won't return until the data is on
> > media.
>
> Right, but that's just how it works right now --- one doesn't _have_ to
> imply the other. You could easily imagine an implementation that
> implements barriers and flushing separately, and which does not do
> automatic flushing on completion of WRITE_BARRIER IOs. SCSI with
> writeback caching enabled might be one example of that. NBD/DRBD would
> be another likely candidate --- if you've got network latencies in the
> way, then a flushing sync may be far more expensive than a barrier
> propagation.
>
Yes, that's true, although the barriers don't really imply a flush, it
just implies that if you do use wait_on_* for flushing, it will report
things accurately.

> Unfortunately, a lot of the cases we care about really have to do the
> barrier via flushing, so the benefit of keeping them separate is
> limited. For LVM/raid0, for example, we've got no way of preserving
> ordering between IOs on different drives, so a flush is necessary there
> unless we start journaling the low-level IOs to preserve order.
>
Right.

> Yep. It scares me to think what performance characteristics we'll start
> seeing once that gets used everywhere it's needed, though. If every raw
> or O_DIRECT write needs a flush after it, databases are going to become
> very sensitive to flush performance. I guess disabling the flushing and
> using disks which tell the truth about data hitting the platter is the
> sane answer there.

Most database benchmarks are done on scsi, and the blkdev_flush should
be a noop there. For IDE based database and mail server benchmarks, the
results won't be pretty.

The reiserfs fsync code tries hard to only flush once, so if a commit is
done then blkdev_flush isn't called. We might have to do a few other
tricks to queue up multiple synchronous ios and only flush once.

-chris




2004-03-30 22:22:08

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set

Stephen C. Tweedie wrote:
> Yep. It scares me to think what performance characteristics we'll start
> seeing once that gets used everywhere it's needed, though. If every raw
> or O_DIRECT write needs a flush after it, databases are going to become
> very sensitive to flush performance. I guess disabling the flushing and
> using disks which tell the truth about data hitting the platter is the
> sane answer there.


For IDE, O_DIRECT and O_SYNC can use special "FUA" commands, which don't
return until the data is on the platter.

Jeff



Subject: Re: [PATCH] barrier patch set

On Wednesday 31 of March 2004 00:21, Jeff Garzik wrote:
> Stephen C. Tweedie wrote:
> > Yep. It scares me to think what performance characteristics we'll start
> > seeing once that gets used everywhere it's needed, though. If every raw
> > or O_DIRECT write needs a flush after it, databases are going to become
> > very sensitive to flush performance. I guess disabling the flushing and
> > using disks which tell the truth about data hitting the platter is the
> > sane answer there.
>
> For IDE, O_DIRECT and O_SYNC can use special "FUA" commands, which don't
> return until the data is on the platter.

Do you know of any drive supporting it? I don't.

Bartlomiej

2004-03-30 22:39:48

by Chris Wedgwood

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set


> For IDE, O_DIRECT and O_SYNC can use special "FUA" commands, which
> don't return until the data is on the platter.

On modern drives how reliable is this? At one point disk-scrubbing
software which used FUA (to ensure data was being written to the
platters) showed that some drives completely ignore this.

Has the state of things changed significantly that we can assume this
is very rare or might we need to have to kind of whitelist/blacklist
system?


--cw

2004-03-30 22:39:57

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set

Bartlomiej Zolnierkiewicz wrote:
> On Wednesday 31 of March 2004 00:21, Jeff Garzik wrote:
>
>>Stephen C. Tweedie wrote:
>>
>>>Yep. It scares me to think what performance characteristics we'll start
>>>seeing once that gets used everywhere it's needed, though. If every raw
>>>or O_DIRECT write needs a flush after it, databases are going to become
>>>very sensitive to flush performance. I guess disabling the flushing and
>>>using disks which tell the truth about data hitting the platter is the
>>>sane answer there.
>>
>>For IDE, O_DIRECT and O_SYNC can use special "FUA" commands, which don't
>>return until the data is on the platter.
>
>
> Do you know of any drive supporting it? I don't.


Newer SATAs do...

Jeff



2004-03-30 22:42:36

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set

Chris Wedgwood wrote:
>>For IDE, O_DIRECT and O_SYNC can use special "FUA" commands, which
>>don't return until the data is on the platter.
>
>
> On modern drives how reliable is this? At one point disk-scrubbing
> software which used FUA (to ensure data was being written to the
> platters) showed that some drives completely ignore this.

I'm suspicious of this, because of Bart's point... I haven't seen any
PATA disks that did FUA, so it sounds like broken software.

Jeff



2004-03-30 22:44:59

by Chris Wedgwood

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set

On Tue, Mar 30, 2004 at 05:39:26PM -0500, Jeff Garzik wrote:

> I'm suspicious of this, because of Bart's point... I haven't seen
> any PATA disks that did FUA, so it sounds like broken software.

I was kinda hoping that the response would be "all modern SATA and
PATA dirves" because ideally this would be a nice thing to have.
Maybe we should just test in a few places and see if it works (timing
will show obviously).


2004-03-31 14:04:28

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set

Hi,

On Tue, 2004-03-30 at 23:13, Chris Mason wrote:

> Most database benchmarks are done on scsi, and the blkdev_flush should
> be a noop there. For IDE based database and mail server benchmarks, the
> results won't be pretty.

Yep. I'm really not too worried about big database benchmarks -- those
are very much special cases, using rather specialised storage setup
(SCSI or FC, striped over lots of small disks rather than fewer large
ones.) I'm much more concerned about your average LAMP user's mysql
database, and how to keep performance sane on that.

> The reiserfs fsync code tries hard to only flush once, so if a commit is
> done then blkdev_flush isn't called. We might have to do a few other
> tricks to queue up multiple synchronous ios and only flush once.

Batching is really helpful when you've got lots of threads that can be
coalesced, yes. ext3 does that for things like mail servers. I'm not
sure whether the same tricks will apply to the various databases out
there, though.

--Stephen

2004-03-31 14:08:44

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set

Hi,

On Tue, 2004-03-30 at 23:21, Jeff Garzik wrote:

> For IDE, O_DIRECT and O_SYNC can use special "FUA" commands, which don't
> return until the data is on the platter.

fsync() is still really nasty, because that can require that we wait on
IO that was submitted by the VM before we knew that there was a
synchronous IO wait coming. SCSI also has an FUA bit that can make a
difference if you've got writeback caching enabled. (And FUA on read
can bypass drive writethrough caches, too, for media verification.)

--Stephen

2004-03-31 14:25:45

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set

On Wed, 2004-03-31 at 09:03, Stephen C. Tweedie wrote:
> Hi,
>
> On Tue, 2004-03-30 at 23:13, Chris Mason wrote:
>
> > Most database benchmarks are done on scsi, and the blkdev_flush should
> > be a noop there. For IDE based database and mail server benchmarks, the
> > results won't be pretty.
>
> Yep. I'm really not too worried about big database benchmarks -- those
> are very much special cases, using rather specialised storage setup
> (SCSI or FC, striped over lots of small disks rather than fewer large
> ones.) I'm much more concerned about your average LAMP user's mysql
> database, and how to keep performance sane on that.
>
In some cases, it's going to be so much slower that it will look like
the old code wasn't writing the data at all. I don't think there's much
we can do about that.

> > The reiserfs fsync code tries hard to only flush once, so if a commit is
> > done then blkdev_flush isn't called. We might have to do a few other
> > tricks to queue up multiple synchronous ios and only flush once.
>
> Batching is really helpful when you've got lots of threads that can be
> coalesced, yes. ext3 does that for things like mail servers. I'm not
> sure whether the same tricks will apply to the various databases out
> there, though.

We can do better in general when there's more then one process doing an
fsync. reiserfs and ext3 both try to be smart about batching log
commits, but I think we could do more to streamline the data writes.

I'm playing with a few ideas, I'll post more when I've got real code to
back things up.

If there's only one process doing fsyncs, there's not much the kernel
can do except provide an aio fsync call.

-chris




2004-03-31 14:22:27

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set

On Wed, 2004-03-31 at 09:08, Stephen C. Tweedie wrote:
> Hi,
>
> On Tue, 2004-03-30 at 23:21, Jeff Garzik wrote:
>
> > For IDE, O_DIRECT and O_SYNC can use special "FUA" commands, which don't
> > return until the data is on the platter.
>
> fsync() is still really nasty, because that can require that we wait on
> IO that was submitted by the VM before we knew that there was a
> synchronous IO wait coming.

Yes, it gets ugly in a hurry. Jeff, look at the whole thread about the
O_DIRECT read vs buffered write races. I don't think we can use FUA for
fsync or O_SYNC without using it for every write.

We might be able to get away with using it on O_DIRECT.

-chris


2004-03-31 18:29:14

by Ric Wheeler

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set


As one of the large users of the Jens and Chris's barrier support in
2.4, I am very motivated to help validate and benchmark the new version.

Stephen, if you have a specific mysql workload or usage case, I can try
to through that into the mix. We do a lot of Sleepycat DB testing,
would results from that help?

Ric


Chris Mason wrote:

>On Wed, 2004-03-31 at 09:03, Stephen C. Tweedie wrote:
>
>
>>Hi,
>>
>>On Tue, 2004-03-30 at 23:13, Chris Mason wrote:
>>
>>
>>
>>>Most database benchmarks are done on scsi, and the blkdev_flush should
>>>be a noop there. For IDE based database and mail server benchmarks, the
>>>results won't be pretty.
>>>
>>>
>>Yep. I'm really not too worried about big database benchmarks -- those
>>are very much special cases, using rather specialised storage setup
>>(SCSI or FC, striped over lots of small disks rather than fewer large
>>ones.) I'm much more concerned about your average LAMP user's mysql
>>database, and how to keep performance sane on that.
>>
>>
>>
>In some cases, it's going to be so much slower that it will look like
>the old code wasn't writing the data at all. I don't think there's much
>we can do about that.
>
>
>
>>>The reiserfs fsync code tries hard to only flush once, so if a commit is
>>>done then blkdev_flush isn't called. We might have to do a few other
>>>tricks to queue up multiple synchronous ios and only flush once.
>>>
>>>
>>Batching is really helpful when you've got lots of threads that can be
>>coalesced, yes. ext3 does that for things like mail servers. I'm not
>>sure whether the same tricks will apply to the various databases out
>>there, though.
>>
>>
>
>We can do better in general when there's more then one process doing an
>fsync. reiserfs and ext3 both try to be smart about batching log
>commits, but I think we could do more to streamline the data writes.
>
>I'm playing with a few ideas, I'll post more when I've got real code to
>back things up.
>
>If there's only one process doing fsyncs, there's not much the kernel
>can do except provide an aio fsync call.
>
>-chris
>
>
>
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>

2004-03-31 21:32:30

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set

Chris Mason wrote:
> On Wed, 2004-03-31 at 09:08, Stephen C. Tweedie wrote:
>
>>Hi,
>>
>>On Tue, 2004-03-30 at 23:21, Jeff Garzik wrote:
>>
>>
>>>For IDE, O_DIRECT and O_SYNC can use special "FUA" commands, which don't
>>>return until the data is on the platter.
>>
>>fsync() is still really nasty, because that can require that we wait on
>>IO that was submitted by the VM before we knew that there was a
>>synchronous IO wait coming.
>
>
> Yes, it gets ugly in a hurry. Jeff, look at the whole thread about the
> O_DIRECT read vs buffered write races. I don't think we can use FUA for

Yes, I'm aware of the thread...


> fsync or O_SYNC without using it for every write.

Why not for O_SYNC? Is some crazy userspace application flipping this
bit on and off rapidly?


> We might be able to get away with using it on O_DIRECT.

Nod.

Jeff



2004-03-31 21:33:39

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set

Stephen C. Tweedie wrote:
> Hi,
>
> On Tue, 2004-03-30 at 23:21, Jeff Garzik wrote:
>
>
>>For IDE, O_DIRECT and O_SYNC can use special "FUA" commands, which don't
>>return until the data is on the platter.
>
>
> fsync() is still really nasty, because that can require that we wait on
> IO that was submitted by the VM before we knew that there was a
> synchronous IO wait coming. SCSI also has an FUA bit that can make a
> difference if you've got writeback caching enabled. (And FUA on read
> can bypass drive writethrough caches, too, for media verification.)

Agreed, but I did not mention fsync(), since that would not be
appropriate to FUA like O_DIRECT or O_SYNC might be.

Jeff




2004-03-31 22:09:18

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH] barrier patch set

On Wed, 2004-03-31 at 16:26, Jeff Garzik wrote:
>
> > Yes, it gets ugly in a hurry. Jeff, look at the whole thread about the
> > O_DIRECT read vs buffered write races. I don't think we can use FUA for
>
> Yes, I'm aware of the thread...
>
>
> > fsync or O_SYNC without using it for every write.
>
> Why not for O_SYNC? Is some crazy userspace application flipping this
> bit on and off rapidly?
>

For both fsync and O_SYNC, the pages we want to write synchronously are
also available for some other part of the kernel to write async. Since
we do know the write is going to be O_SYNC when we are marking the pages
dirty, we could mark them dirty_fua or something as well.

Even assuming we can deal with the data=ordered ext3/reiserfs issues, it
makes the writeback for O_SYNC yet another corner case to check, and one
where we have no useful way to make sure the fua bit really got set on
all the writes for a given O_SYNC (unless we pin the page and check each
one after the writes are complete).

Since O_DIRECT is much less complex I think we should start there.

-chris