2010-06-01 10:30:47

by Christof Schmitt

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write

On Mon, May 31, 2010 at 06:30:05PM +0300, Boaz Harrosh wrote:
> On 05/31/2010 06:01 PM, James Bottomley wrote:
> > On Mon, 2010-05-31 at 10:20 -0400, Martin K. Petersen wrote:
> >>>>>>> "Christof" == Christof Schmitt <[email protected]> writes:
> >>
> >> Christof> Since the guard tags are created in Linux, it seems that the
> >> Christof> data attached to the write request changes between the
> >> Christof> generation in bio_integrity_generate and the call to
> >> Christof> sd_prep_fn.
> >>
> >> Yep, known bug. Page writeback locking is messed up for buffer_head
> >> users. The extNfs folks volunteered to look into this a while back but
> >> I don't think they have found the time yet.
> >>
> >>
> >> Christof> Using ext3 or ext4 instead of ext2 does not show the problem.
> >>
> >> Last I looked there were still code paths in ext3 and ext4 that
> >> permitted pages to be changed during flight. I guess you've just been
> >> lucky.
> >
> > Pages have always been modifiable in flight. The OS guarantees they'll
> > be rewritten, so the drivers can drop them if it detects the problem.
> > This is identical to the iscsi checksum issue (iscsi adds a checksum
> > because it doesn't trust TCP/IP and if the checksum is generated in
> > software, there's time between generation and page transmission for the
> > alteration to occur). The solution in the iscsi case was not to
> > complain if the page is still marked dirty.
> >
>
> And also why RAID1 and RAID4/5/6 need the data bounced. I wish VFS
> would prevent data writing given a device queue flag that requests
> it. So all these devices and modes could just flag the VFS/filesystems
> that: "please don't allow concurrent writes, otherwise I need to copy data"
>
> From what Chris Mason has said before, all the mechanics are there, and it's
> what btrfs is doing. Though I don't know how myself?

I also tested with btrfs and invalid guard tags in writes have been
encountered as well (again in 2.6.34). The only difference is that no
error was reported to userspace, although this might be a
configuration issue.

What is the best strategy to continue with the invalid guard tags on
write requests? Should this be fixed in the filesystems?

Another idea would be to pass invalid guard tags on write requests
down to the hardware, expect an "invalid guard tag" error and report
it to the block layer where a new checksum is generated and the
request is issued again. Basically implement a retry through the whole
I/O stack. But this also sounds complicated.

--
Christof Schmitt


2010-06-01 10:50:02

by Boaz Harrosh

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write

On 06/01/2010 01:30 PM, Christof Schmitt wrote:
> On Mon, May 31, 2010 at 06:30:05PM +0300, Boaz Harrosh wrote:
>> On 05/31/2010 06:01 PM, James Bottomley wrote:
>>> On Mon, 2010-05-31 at 10:20 -0400, Martin K. Petersen wrote:
>>>>>>>>> "Christof" == Christof Schmitt <[email protected]> writes:
>>>>
>>>> Christof> Since the guard tags are created in Linux, it seems that the
>>>> Christof> data attached to the write request changes between the
>>>> Christof> generation in bio_integrity_generate and the call to
>>>> Christof> sd_prep_fn.
>>>>
>>>> Yep, known bug. Page writeback locking is messed up for buffer_head
>>>> users. The extNfs folks volunteered to look into this a while back but
>>>> I don't think they have found the time yet.
>>>>
>>>>
>>>> Christof> Using ext3 or ext4 instead of ext2 does not show the problem.
>>>>
>>>> Last I looked there were still code paths in ext3 and ext4 that
>>>> permitted pages to be changed during flight. I guess you've just been
>>>> lucky.
>>>
>>> Pages have always been modifiable in flight. The OS guarantees they'll
>>> be rewritten, so the drivers can drop them if it detects the problem.
>>> This is identical to the iscsi checksum issue (iscsi adds a checksum
>>> because it doesn't trust TCP/IP and if the checksum is generated in
>>> software, there's time between generation and page transmission for the
>>> alteration to occur). The solution in the iscsi case was not to
>>> complain if the page is still marked dirty.
>>>
>>
>> And also why RAID1 and RAID4/5/6 need the data bounced. I wish VFS
>> would prevent data writing given a device queue flag that requests
>> it. So all these devices and modes could just flag the VFS/filesystems
>> that: "please don't allow concurrent writes, otherwise I need to copy data"
>>
>> From what Chris Mason has said before, all the mechanics are there, and it's
>> what btrfs is doing. Though I don't know how myself?
>
> I also tested with btrfs and invalid guard tags in writes have been
> encountered as well (again in 2.6.34). The only difference is that no
> error was reported to userspace, although this might be a
> configuration issue.
>

I think in btrfs you need a raid1/5 multi-device configuration for this
to work. If you use a single device then it is just like ext4.

BTW: you could use DM or MD and it will guard your DIF by coping the
data before IO.

> What is the best strategy to continue with the invalid guard tags on
> write requests? Should this be fixed in the filesystems?
>
> Another idea would be to pass invalid guard tags on write requests
> down to the hardware, expect an "invalid guard tag" error and report
> it to the block layer where a new checksum is generated and the
> request is issued again. Basically implement a retry through the whole
> I/O stack. But this also sounds complicated.
>

I suggest we should talk about this issue in upcoming LSF, because it does
not only affects DIF but any checksum subsystem. And it could enhance Linux
raid performance.

> --
> Christof Schmitt

Boaz

2010-06-01 13:05:37

by Chris Mason

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write

On Tue, Jun 01, 2010 at 12:30:42PM +0200, Christof Schmitt wrote:
> On Mon, May 31, 2010 at 06:30:05PM +0300, Boaz Harrosh wrote:
> > On 05/31/2010 06:01 PM, James Bottomley wrote:
> > > On Mon, 2010-05-31 at 10:20 -0400, Martin K. Petersen wrote:
> > >>>>>>> "Christof" == Christof Schmitt <[email protected]> writes:
> > >>
> > >> Christof> Since the guard tags are created in Linux, it seems that the
> > >> Christof> data attached to the write request changes between the
> > >> Christof> generation in bio_integrity_generate and the call to
> > >> Christof> sd_prep_fn.
> > >>
> > >> Yep, known bug. Page writeback locking is messed up for buffer_head
> > >> users. The extNfs folks volunteered to look into this a while back but
> > >> I don't think they have found the time yet.
> > >>
> > >>
> > >> Christof> Using ext3 or ext4 instead of ext2 does not show the problem.
> > >>
> > >> Last I looked there were still code paths in ext3 and ext4 that
> > >> permitted pages to be changed during flight. I guess you've just been
> > >> lucky.
> > >
> > > Pages have always been modifiable in flight. The OS guarantees they'll
> > > be rewritten, so the drivers can drop them if it detects the problem.
> > > This is identical to the iscsi checksum issue (iscsi adds a checksum
> > > because it doesn't trust TCP/IP and if the checksum is generated in
> > > software, there's time between generation and page transmission for the
> > > alteration to occur). The solution in the iscsi case was not to
> > > complain if the page is still marked dirty.
> > >
> >
> > And also why RAID1 and RAID4/5/6 need the data bounced. I wish VFS
> > would prevent data writing given a device queue flag that requests
> > it. So all these devices and modes could just flag the VFS/filesystems
> > that: "please don't allow concurrent writes, otherwise I need to copy data"
> >
> > From what Chris Mason has said before, all the mechanics are there, and it's
> > what btrfs is doing. Though I don't know how myself?
>
> I also tested with btrfs and invalid guard tags in writes have been
> encountered as well (again in 2.6.34). The only difference is that no
> error was reported to userspace, although this might be a
> configuration issue.

This would be a btrfs bug. We have strict checks in place that are
supposed to prevent buffers changing while in flight. What was the
workload that triggered this problem?

>
> What is the best strategy to continue with the invalid guard tags on
> write requests? Should this be fixed in the filesystems?
>

Long term, I think the filesystems shouldn't be changing pages in
flight. Bouncing just hurts way too much.

-chris

2010-06-01 13:28:05

by James Bottomley

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write

On Tue, 2010-06-01 at 12:30 +0200, Christof Schmitt wrote:
> What is the best strategy to continue with the invalid guard tags on
> write requests? Should this be fixed in the filesystems?

For write requests, as long as the page dirty bit is still set, it's
safe to drop the request, since it's already going to be repeated. What
we probably want is an error code we can return that the layer that sees
both the request and the page flags can make the call.

> Another idea would be to pass invalid guard tags on write requests
> down to the hardware, expect an "invalid guard tag" error and report
> it to the block layer where a new checksum is generated and the
> request is issued again. Basically implement a retry through the whole
> I/O stack. But this also sounds complicated.

No, no ... as long as the guard tag is wrong because the fs changed the
page, the write request for the updated page will already be queued or
in-flight, so there's no need to retry. We still have to pass checksum
failures on in case the data changed because of some HW (or SW) cockup.
The check for this is page dirty. If we get a checksum error back and
the page is still clean, we know nothing in the OS changed it, therefore
it's a real bit flip error.

James

2010-06-01 13:35:11

by Chris Mason

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write

On Tue, Jun 01, 2010 at 01:27:56PM +0000, James Bottomley wrote:
> On Tue, 2010-06-01 at 12:30 +0200, Christof Schmitt wrote:
> > What is the best strategy to continue with the invalid guard tags on
> > write requests? Should this be fixed in the filesystems?
>
> For write requests, as long as the page dirty bit is still set, it's
> safe to drop the request, since it's already going to be repeated. What
> we probably want is an error code we can return that the layer that sees
> both the request and the page flags can make the call.

I'm afraid this isn't entirely true. The FS tends to do this:

change the page
<---------> truck sized race right here where the page is clean
mark the page dirty

-chris

2010-06-01 13:40:44

by James Bottomley

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write

On Tue, 2010-06-01 at 09:33 -0400, Chris Mason wrote:
> On Tue, Jun 01, 2010 at 01:27:56PM +0000, James Bottomley wrote:
> > On Tue, 2010-06-01 at 12:30 +0200, Christof Schmitt wrote:
> > > What is the best strategy to continue with the invalid guard tags on
> > > write requests? Should this be fixed in the filesystems?
> >
> > For write requests, as long as the page dirty bit is still set, it's
> > safe to drop the request, since it's already going to be repeated. What
> > we probably want is an error code we can return that the layer that sees
> > both the request and the page flags can make the call.
>
> I'm afraid this isn't entirely true. The FS tends to do this:
>
> change the page
> <---------> truck sized race right here where the page is clean
> mark the page dirty

Would it be too much work in the fs to mark the page dirty before you
begin altering it (and again after you finish, just in case some cleaner
noticed and initiated a write)? Or some other flag that indicates page
under modification? All the process controlling the writeout (which is
pretty high up in the stack) needs to know is if we triggered the check
error by altering the page while it was in flight.

I agree that a block based retry would close all the holes ... it just
doesn't look elegant to me that the fs will already be repeating the I/O
if it changed the page and so will block.

James


James

2010-06-01 13:50:28

by Chris Mason

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write

On Tue, Jun 01, 2010 at 08:40:37AM -0500, James Bottomley wrote:
> On Tue, 2010-06-01 at 09:33 -0400, Chris Mason wrote:
> > On Tue, Jun 01, 2010 at 01:27:56PM +0000, James Bottomley wrote:
> > > On Tue, 2010-06-01 at 12:30 +0200, Christof Schmitt wrote:
> > > > What is the best strategy to continue with the invalid guard tags on
> > > > write requests? Should this be fixed in the filesystems?
> > >
> > > For write requests, as long as the page dirty bit is still set, it's
> > > safe to drop the request, since it's already going to be repeated. What
> > > we probably want is an error code we can return that the layer that sees
> > > both the request and the page flags can make the call.
> >
> > I'm afraid this isn't entirely true. The FS tends to do this:
> >
> > change the page
> > <---------> truck sized race right here where the page is clean
> > mark the page dirty
>
> Would it be too much work in the fs to mark the page dirty before you
> begin altering it (and again after you finish, just in case some cleaner
> noticed and initiated a write)? Or some other flag that indicates page
> under modification? All the process controlling the writeout (which is
> pretty high up in the stack) needs to know is if we triggered the check
> error by altering the page while it was in flight.

I expect that once we went down that path we would end up waiting for
the IO to finish before changing the page. Maybe there is a less
complex way, but I sure didn't see it.

>
> I agree that a block based retry would close all the holes ... it just
> doesn't look elegant to me that the fs will already be repeating the I/O
> if it changed the page and so will block.

We might not ever repeat the IO. We might change the page, write it,
change it again, truncate the file and toss the page completely.

-chris

2010-06-01 13:51:05

by Christof Schmitt

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write

On Tue, Jun 01, 2010 at 09:03:25AM -0400, Chris Mason wrote:
> On Tue, Jun 01, 2010 at 12:30:42PM +0200, Christof Schmitt wrote:
> > On Mon, May 31, 2010 at 06:30:05PM +0300, Boaz Harrosh wrote:
> > > On 05/31/2010 06:01 PM, James Bottomley wrote:
> > > > On Mon, 2010-05-31 at 10:20 -0400, Martin K. Petersen wrote:
> > > >>>>>>> "Christof" == Christof Schmitt <[email protected]> writes:
> > > >>
> > > >> Christof> Since the guard tags are created in Linux, it seems that the
> > > >> Christof> data attached to the write request changes between the
> > > >> Christof> generation in bio_integrity_generate and the call to
> > > >> Christof> sd_prep_fn.
> > > >>
> > > >> Yep, known bug. Page writeback locking is messed up for buffer_head
> > > >> users. The extNfs folks volunteered to look into this a while back but
> > > >> I don't think they have found the time yet.
> > > >>
> > > >>
> > > >> Christof> Using ext3 or ext4 instead of ext2 does not show the problem.
> > > >>
> > > >> Last I looked there were still code paths in ext3 and ext4 that
> > > >> permitted pages to be changed during flight. I guess you've just been
> > > >> lucky.
> > > >
> > > > Pages have always been modifiable in flight. The OS guarantees they'll
> > > > be rewritten, so the drivers can drop them if it detects the problem.
> > > > This is identical to the iscsi checksum issue (iscsi adds a checksum
> > > > because it doesn't trust TCP/IP and if the checksum is generated in
> > > > software, there's time between generation and page transmission for the
> > > > alteration to occur). The solution in the iscsi case was not to
> > > > complain if the page is still marked dirty.
> > > >
> > >
> > > And also why RAID1 and RAID4/5/6 need the data bounced. I wish VFS
> > > would prevent data writing given a device queue flag that requests
> > > it. So all these devices and modes could just flag the VFS/filesystems
> > > that: "please don't allow concurrent writes, otherwise I need to copy data"
> > >
> > > From what Chris Mason has said before, all the mechanics are there, and it's
> > > what btrfs is doing. Though I don't know how myself?
> >
> > I also tested with btrfs and invalid guard tags in writes have been
> > encountered as well (again in 2.6.34). The only difference is that no
> > error was reported to userspace, although this might be a
> > configuration issue.
>
> This would be a btrfs bug. We have strict checks in place that are
> supposed to prevent buffers changing while in flight. What was the
> workload that triggered this problem?

I am running an internal test tool that creates files with a known
pattern until the disk is full, reads the data to verify if the
pattern is still intact, removes the files and starts over.

# uname -r
2.6.34

# mkfs.btrfs /dev/sda

WARNING! - Btrfs Btrfs v0.19 IS EXPERIMENTAL
WARNING! - see http://btrfs.wiki.kernel.org before using

fs created label (null) on /dev/sda
nodesize 4096 leafsize 4096 sectorsize 4096 size 8.00GB
Btrfs Btrfs v0.19

# mount /dev/sda /mnt/esslv0/

The kernel log shows:

Jun 1 11:11:53 R78_4E19_ESAME26_40_1 kernel: device fsid d260dc729d784bf0-9d097091e01ec932 devid 1 transid 7 /dev/sda
Jun 1 11:23:26 R78_4E19_ESAME26_40_1 kernel: no space left, need 4096, 0 delalloc bytes, 7701659648 bytes_used, 0 bytes_reserved, 0 bytes_pinned, 0 bytes_readonly, 0 may use 7701659648 total

I guess this is to be expected, since the test tool keeps writing
until there is no space left.

Jun 1 11:25:10 R78_4E19_ESAME26_40_1 kernel: sd tag mismatch, lba 0x12480, dif tag c8c0, computed tag dbe0
Jun 1 11:25:10 R78_4E19_ESAME26_40_1 kernel: sd tag mismatch, lba 0x12480, dif tag 51d, computed tag 5861
Jun 1 11:25:10 R78_4E19_ESAME26_40_1 kernel: sd tag mismatch, lba 0x12480, dif tag 3693, computed tag c52e
Jun 1 11:25:10 R78_4E19_ESAME26_40_1 kernel: sd tag mismatch, lba 0x12480, dif tag ff8a, computed tag 4ac3
Jun 1 11:25:10 R78_4E19_ESAME26_40_1 kernel: sd tag mismatch, lba 0x12480, dif tag b337, computed tag 3840
Jun 1 11:25:10 R78_4E19_ESAME26_40_1 kernel: sd tag mismatch, lba 0x12480, dif tag fbed, computed tag 746a
Jun 1 11:25:10 R78_4E19_ESAME26_40_1 kernel: sd tag mismatch, lba 0x12480, dif tag 2bfc, computed tag 263e
Jun 1 11:25:10 R78_4E19_ESAME26_40_1 kernel: sd tag mismatch, lba 0x12480, dif tag c0a4, computed tag 3942

This is the output from my debug patch where i see the mismatch
between the data and the provided guard tags in the block integrity
data.

Jun 1 11:25:10 R78_4E19_ESAME26_40_1 kernel: sd 0:0:13:1077952528: [sda] Host Data Integrity Failure
Jun 1 11:25:10 R78_4E19_ESAME26_40_1 kernel: sd 0:0:13:1077952528: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jun 1 11:25:10 R78_4E19_ESAME26_40_1 kernel: sd 0:0:13:1077952528: [sda] Sense Key : Illegal Request [current] [descriptor]
Jun 1 11:25:10 R78_4E19_ESAME26_40_1 kernel: sd 0:0:13:1077952528: [sda] Add. Sense: Logical block guard check failed
Jun 1 11:25:10 R78_4E19_ESAME26_40_1 kernel: sd 0:0:13:1077952528: [sda] CDB: Write(10): 2a 20 00 01 24 80 00 00 08 00
Jun 1 11:25:10 R78_4E19_ESAME26_40_1 kernel: end_request: I/O error, dev sda, sector 74880

As expected, the controller detects the checksum mismatch in the same
request and this is reported as an error for the request. But the
running application is not affected.

I can start this again to get more data, but for now, this is what i
see.

--
Christof Schmitt

2010-06-01 13:52:20

by Martin K. Petersen

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write

>>>>> "James" == James Bottomley <[email protected]> writes:

James> Would it be too much work in the fs to mark the page dirty before
James> you begin altering it (and again after you finish, just in case
James> some cleaner noticed and initiated a write)? Or some other flag
James> that indicates page under modification? All the process
James> controlling the writeout (which is pretty high up in the stack)
James> needs to know is if we triggered the check error by altering the
James> page while it was in flight.

James> I agree that a block based retry would close all the holes ... it
James> just doesn't look elegant to me that the fs will already be
James> repeating the I/O if it changed the page and so will block.

I experimented with this approach a while back. However, I quickly got
into a situation where frequently updated blocks never made it to disk
because the page was constantly being updated. And all writes failed
with a guard tag error.

--
Martin K. Petersen Oracle Linux Engineering

2010-06-01 13:59:24

by Chris Mason

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write

On Tue, Jun 01, 2010 at 03:50:59PM +0200, Christof Schmitt wrote:
> On Tue, Jun 01, 2010 at 09:03:25AM -0400, Chris Mason wrote:
> > On Tue, Jun 01, 2010 at 12:30:42PM +0200, Christof Schmitt wrote:
> > > On Mon, May 31, 2010 at 06:30:05PM +0300, Boaz Harrosh wrote:
> > > > On 05/31/2010 06:01 PM, James Bottomley wrote:
> > > > > On Mon, 2010-05-31 at 10:20 -0400, Martin K. Petersen wrote:
> > > > >>>>>>> "Christof" == Christof Schmitt <[email protected]> writes:
> > > > >>
> > > > >> Christof> Since the guard tags are created in Linux, it seems that the
> > > > >> Christof> data attached to the write request changes between the
> > > > >> Christof> generation in bio_integrity_generate and the call to
> > > > >> Christof> sd_prep_fn.
> > > > >>
> > > > >> Yep, known bug. Page writeback locking is messed up for buffer_head
> > > > >> users. The extNfs folks volunteered to look into this a while back but
> > > > >> I don't think they have found the time yet.
> > > > >>
> > > > >>
> > > > >> Christof> Using ext3 or ext4 instead of ext2 does not show the problem.
> > > > >>
> > > > >> Last I looked there were still code paths in ext3 and ext4 that
> > > > >> permitted pages to be changed during flight. I guess you've just been
> > > > >> lucky.
> > > > >
> > > > > Pages have always been modifiable in flight. The OS guarantees they'll
> > > > > be rewritten, so the drivers can drop them if it detects the problem.
> > > > > This is identical to the iscsi checksum issue (iscsi adds a checksum
> > > > > because it doesn't trust TCP/IP and if the checksum is generated in
> > > > > software, there's time between generation and page transmission for the
> > > > > alteration to occur). The solution in the iscsi case was not to
> > > > > complain if the page is still marked dirty.
> > > > >
> > > >
> > > > And also why RAID1 and RAID4/5/6 need the data bounced. I wish VFS
> > > > would prevent data writing given a device queue flag that requests
> > > > it. So all these devices and modes could just flag the VFS/filesystems
> > > > that: "please don't allow concurrent writes, otherwise I need to copy data"
> > > >
> > > > From what Chris Mason has said before, all the mechanics are there, and it's
> > > > what btrfs is doing. Though I don't know how myself?
> > >
> > > I also tested with btrfs and invalid guard tags in writes have been
> > > encountered as well (again in 2.6.34). The only difference is that no
> > > error was reported to userspace, although this might be a
> > > configuration issue.
> >
> > This would be a btrfs bug. We have strict checks in place that are
> > supposed to prevent buffers changing while in flight. What was the
> > workload that triggered this problem?
>
> I am running an internal test tool that creates files with a known
> pattern until the disk is full, reads the data to verify if the
> pattern is still intact, removes the files and starts over.

Ok, is the lba in the output the sector offset? We can map that to a
btrfs block and figure out what it was.

Btrfs never complains about the IO error? We really should explode.

-chris

2010-06-01 14:27:04

by Nick Piggin

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write

On Tue, Jun 01, 2010 at 03:50:59PM +0200, Christof Schmitt wrote:
> On Tue, Jun 01, 2010 at 09:03:25AM -0400, Chris Mason wrote:
> > On Tue, Jun 01, 2010 at 12:30:42PM +0200, Christof Schmitt wrote:
> > > On Mon, May 31, 2010 at 06:30:05PM +0300, Boaz Harrosh wrote:
> > > > On 05/31/2010 06:01 PM, James Bottomley wrote:
> > > > > On Mon, 2010-05-31 at 10:20 -0400, Martin K. Petersen wrote:
> > > > >>>>>>> "Christof" == Christof Schmitt <[email protected]> writes:
> > > > >>
> > > > >> Christof> Since the guard tags are created in Linux, it seems that the
> > > > >> Christof> data attached to the write request changes between the
> > > > >> Christof> generation in bio_integrity_generate and the call to
> > > > >> Christof> sd_prep_fn.
> > > > >>
> > > > >> Yep, known bug. Page writeback locking is messed up for buffer_head
> > > > >> users. The extNfs folks volunteered to look into this a while back but
> > > > >> I don't think they have found the time yet.
> > > > >>
> > > > >>
> > > > >> Christof> Using ext3 or ext4 instead of ext2 does not show the problem.
> > > > >>
> > > > >> Last I looked there were still code paths in ext3 and ext4 that
> > > > >> permitted pages to be changed during flight. I guess you've just been
> > > > >> lucky.
> > > > >
> > > > > Pages have always been modifiable in flight. The OS guarantees they'll
> > > > > be rewritten, so the drivers can drop them if it detects the problem.
> > > > > This is identical to the iscsi checksum issue (iscsi adds a checksum
> > > > > because it doesn't trust TCP/IP and if the checksum is generated in
> > > > > software, there's time between generation and page transmission for the
> > > > > alteration to occur). The solution in the iscsi case was not to
> > > > > complain if the page is still marked dirty.
> > > > >
> > > >
> > > > And also why RAID1 and RAID4/5/6 need the data bounced. I wish VFS
> > > > would prevent data writing given a device queue flag that requests
> > > > it. So all these devices and modes could just flag the VFS/filesystems
> > > > that: "please don't allow concurrent writes, otherwise I need to copy data"
> > > >
> > > > From what Chris Mason has said before, all the mechanics are there, and it's
> > > > what btrfs is doing. Though I don't know how myself?
> > >
> > > I also tested with btrfs and invalid guard tags in writes have been
> > > encountered as well (again in 2.6.34). The only difference is that no
> > > error was reported to userspace, although this might be a
> > > configuration issue.
> >
> > This would be a btrfs bug. We have strict checks in place that are
> > supposed to prevent buffers changing while in flight. What was the
> > workload that triggered this problem?
>
> I am running an internal test tool that creates files with a known
> pattern until the disk is full, reads the data to verify if the
> pattern is still intact, removes the files and starts over.

How are the checks done? The lock_page and wait_on_page_writeback from
prepare_pages?

Looks OK but AFAIKS you would need to unmap_mapping_range() the range
after taking the locks. You are also screwed if the page is in the
process of being modified by a get_user_pages() caller. You'd have to
add VM_IO to vm_flags to prevent get_user_pages.

Direct-IO is always going to have the same problems, so it's not like
block based solutions can just pretend it will easily go away.

2010-06-01 14:28:27

by Nick Piggin

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write

On Tue, Jun 01, 2010 at 09:50:29AM -0400, Martin K. Petersen wrote:
> >>>>> "James" == James Bottomley <[email protected]> writes:
>
> James> Would it be too much work in the fs to mark the page dirty before
> James> you begin altering it (and again after you finish, just in case
> James> some cleaner noticed and initiated a write)? Or some other flag
> James> that indicates page under modification? All the process
> James> controlling the writeout (which is pretty high up in the stack)
> James> needs to know is if we triggered the check error by altering the
> James> page while it was in flight.
>
> James> I agree that a block based retry would close all the holes ... it
> James> just doesn't look elegant to me that the fs will already be
> James> repeating the I/O if it changed the page and so will block.
>
> I experimented with this approach a while back. However, I quickly got
> into a situation where frequently updated blocks never made it to disk
> because the page was constantly being updated. And all writes failed
> with a guard tag error.

What if you bounce in the case of a first guard error?

2010-06-01 14:32:17

by James Bottomley

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write

On Tue, 2010-06-01 at 09:50 -0400, Martin K. Petersen wrote:
> >>>>> "James" == James Bottomley <[email protected]> writes:
>
> James> Would it be too much work in the fs to mark the page dirty before
> James> you begin altering it (and again after you finish, just in case
> James> some cleaner noticed and initiated a write)? Or some other flag
> James> that indicates page under modification? All the process
> James> controlling the writeout (which is pretty high up in the stack)
> James> needs to know is if we triggered the check error by altering the
> James> page while it was in flight.
>
> James> I agree that a block based retry would close all the holes ... it
> James> just doesn't look elegant to me that the fs will already be
> James> repeating the I/O if it changed the page and so will block.
>
> I experimented with this approach a while back. However, I quickly got
> into a situation where frequently updated blocks never made it to disk
> because the page was constantly being updated. And all writes failed
> with a guard tag error.

But that's unfixable with a retry based system as well if the page is
changing so fast that the guard is always wrong by the time we get to
the array. The only way to fix this is either to copy or freeze the
page.

James


2010-06-01 14:54:33

by Martin K. Petersen

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write

>>>>> "James" == James Bottomley <[email protected]> writes:

>> I experimented with this approach a while back. However, I quickly
>> got into a situation where frequently updated blocks never made it to
>> disk because the page was constantly being updated. And all writes
>> failed with a guard tag error.

James> But that's unfixable with a retry based system as well if the
James> page is changing so fast that the guard is always wrong by the
James> time we get to the array. The only way to fix this is either to
James> copy or freeze the page.

Exactly, and that's why I'm in favor of the filesystems implementing
whatever method they see fit for ensuring that pages don't change in
flight. Whether that be locking, unmapping, or copying the page.

If there's a performance hit we can have a flag that indicates whether
this block device requires pages to be stable or not. I believe extN
really depends on modifying pages for performance reasons. However,
both XFS and btrfs seem to do just fine without it.

Over time we'll have checksums coming down from userspace with various
I/O submission interfaces, internally generated checksums for filesystem
metadata, etc. I really don't think a one-size-fits-all retry heuristic
is going to cut it.

--
Martin K. Petersen Oracle Linux Engineering

2010-06-01 16:29:33

by Matthew Wilcox

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write

On Tue, Jun 01, 2010 at 09:49:51AM -0400, Chris Mason wrote:
> > I agree that a block based retry would close all the holes ... it just
> > doesn't look elegant to me that the fs will already be repeating the I/O
> > if it changed the page and so will block.
>
> We might not ever repeat the IO. We might change the page, write it,
> change it again, truncate the file and toss the page completely.

Why does it matter that it was never written in that case?

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2010-06-01 16:48:19

by Chris Mason

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write

On Tue, Jun 01, 2010 at 10:29:30AM -0600, Matthew Wilcox wrote:
> On Tue, Jun 01, 2010 at 09:49:51AM -0400, Chris Mason wrote:
> > > I agree that a block based retry would close all the holes ... it just
> > > doesn't look elegant to me that the fs will already be repeating the I/O
> > > if it changed the page and so will block.
> >
> > We might not ever repeat the IO. We might change the page, write it,
> > change it again, truncate the file and toss the page completely.
>
> Why does it matter that it was never written in that case?

It matters is the storage layer is going to wait around for the block to
be written again with a correct crc.

Unless there is trim + DIF, but that's a lot of plates to spin just for
a basic implementation.

-chris

2010-06-01 16:55:18

by James Bottomley

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write

On Tue, 2010-06-01 at 12:47 -0400, Chris Mason wrote:
> On Tue, Jun 01, 2010 at 10:29:30AM -0600, Matthew Wilcox wrote:
> > On Tue, Jun 01, 2010 at 09:49:51AM -0400, Chris Mason wrote:
> > > > I agree that a block based retry would close all the holes ... it just
> > > > doesn't look elegant to me that the fs will already be repeating the I/O
> > > > if it changed the page and so will block.
> > >
> > > We might not ever repeat the IO. We might change the page, write it,
> > > change it again, truncate the file and toss the page completely.
> >
> > Why does it matter that it was never written in that case?
>
> It matters is the storage layer is going to wait around for the block to
> be written again with a correct crc.

Actually, I wasn't advocating that. I think block should return a guard
mismatch error. I think somewhere in filesystem writeout is the place
to decide whether the error was self induced or systematic. For self
induced errors (as long as we can detect them) I think we can just
forget about it ... if the changed page is important, the I/O request
gets repeated (modulo the problem of too great a frequency of changes
leading to us never successfully writing it) or it gets dropped because
the file was truncated or the data deleted for some other reason.

James

2010-06-01 18:09:45

by Chris Mason

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write

On Tue, Jun 01, 2010 at 04:54:53PM +0000, James Bottomley wrote:
> On Tue, 2010-06-01 at 12:47 -0400, Chris Mason wrote:
> > On Tue, Jun 01, 2010 at 10:29:30AM -0600, Matthew Wilcox wrote:
> > > On Tue, Jun 01, 2010 at 09:49:51AM -0400, Chris Mason wrote:
> > > > > I agree that a block based retry would close all the holes ... it just
> > > > > doesn't look elegant to me that the fs will already be repeating the I/O
> > > > > if it changed the page and so will block.
> > > >
> > > > We might not ever repeat the IO. We might change the page, write it,
> > > > change it again, truncate the file and toss the page completely.
> > >
> > > Why does it matter that it was never written in that case?
> >
> > It matters is the storage layer is going to wait around for the block to
> > be written again with a correct crc.
>
> Actually, I wasn't advocating that. I think block should return a guard
> mismatch error. I think somewhere in filesystem writeout is the place
> to decide whether the error was self induced or systematic.

In that case the io error goes to the async page writeback bio-endio
handlers. We don't have a reference on the inode and no ability to
reliably restart the IO, but we can set a bit on the address space
indicating that somewhere, sometime in the past we had an IO error.

> For self
> induced errors (as long as we can detect them) I think we can just
> forget about it ... if the changed page is important, the I/O request
> gets repeated (modulo the problem of too great a frequency of changes
> leading to us never successfully writing it) or it gets dropped because
> the file was truncated or the data deleted for some other reason.

Sorry, how can we tell the errors that are self induced from the evil
bit flipping cable induced errors?

-chris

2010-06-01 18:47:10

by Nick Piggin

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write

On Tue, Jun 01, 2010 at 02:09:05PM -0400, Chris Mason wrote:
> On Tue, Jun 01, 2010 at 04:54:53PM +0000, James Bottomley wrote:
> > On Tue, 2010-06-01 at 12:47 -0400, Chris Mason wrote:
> > > On Tue, Jun 01, 2010 at 10:29:30AM -0600, Matthew Wilcox wrote:
> > > > On Tue, Jun 01, 2010 at 09:49:51AM -0400, Chris Mason wrote:
> > > > > > I agree that a block based retry would close all the holes ... it just
> > > > > > doesn't look elegant to me that the fs will already be repeating the I/O
> > > > > > if it changed the page and so will block.
> > > > >
> > > > > We might not ever repeat the IO. We might change the page, write it,
> > > > > change it again, truncate the file and toss the page completely.
> > > >
> > > > Why does it matter that it was never written in that case?
> > >
> > > It matters is the storage layer is going to wait around for the block to
> > > be written again with a correct crc.
> >
> > Actually, I wasn't advocating that. I think block should return a guard
> > mismatch error. I think somewhere in filesystem writeout is the place
> > to decide whether the error was self induced or systematic.
>
> In that case the io error goes to the async page writeback bio-endio
> handlers. We don't have a reference on the inode and no ability to
> reliably restart the IO, but we can set a bit on the address space
> indicating that somewhere, sometime in the past we had an IO error.
>
> > For self
> > induced errors (as long as we can detect them) I think we can just
> > forget about it ... if the changed page is important, the I/O request
> > gets repeated (modulo the problem of too great a frequency of changes
> > leading to us never successfully writing it) or it gets dropped because
> > the file was truncated or the data deleted for some other reason.
>
> Sorry, how can we tell the errors that are self induced from the evil
> bit flipping cable induced errors?

Block layer should retry it with bounce pages. That would be a lot nicer
than forcing all upper layers to avoid the problem.

2010-06-01 19:36:14

by Chris Mason

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write

On Wed, Jun 02, 2010 at 04:46:49AM +1000, Nick Piggin wrote:
> On Tue, Jun 01, 2010 at 02:09:05PM -0400, Chris Mason wrote:
> > On Tue, Jun 01, 2010 at 04:54:53PM +0000, James Bottomley wrote:
> >
> > > For self
> > > induced errors (as long as we can detect them) I think we can just
> > > forget about it ... if the changed page is important, the I/O request
> > > gets repeated (modulo the problem of too great a frequency of changes
> > > leading to us never successfully writing it) or it gets dropped because
> > > the file was truncated or the data deleted for some other reason.
> >
> > Sorry, how can we tell the errors that are self induced from the evil
> > bit flipping cable induced errors?
>
> Block layer should retry it with bounce pages. That would be a lot nicer
> than forcing all upper layers to avoid the problem.
>

So the idea is that we have sent down a buffer and it changed in flight.
The block layer is going to say: oh look, the crcs don't match, I'll
bounce it, recrc it and send again. But, there are at least 3 reasons the crc
will change:

1) filesystem changed it
2) corruption on the wire or in the raid controller
3) the page was corrupted while the IO layer was doing the IO.

1 and 2 are easy, we bounce, retry and everyone continues on with
their lives. With #3, we'll recrc and send the IO down again thinking
the data is correct when really we're writing garbage.

How can we tell these three cases apart?

-chris

2010-06-01 21:08:09

by James Bottomley

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write

On Tue, 2010-06-01 at 14:09 -0400, Chris Mason wrote:
> On Tue, Jun 01, 2010 at 04:54:53PM +0000, James Bottomley wrote:
> > On Tue, 2010-06-01 at 12:47 -0400, Chris Mason wrote:
> > > On Tue, Jun 01, 2010 at 10:29:30AM -0600, Matthew Wilcox wrote:
> > > > On Tue, Jun 01, 2010 at 09:49:51AM -0400, Chris Mason wrote:
> > > > > > I agree that a block based retry would close all the holes ... it just
> > > > > > doesn't look elegant to me that the fs will already be repeating the I/O
> > > > > > if it changed the page and so will block.
> > > > >
> > > > > We might not ever repeat the IO. We might change the page, write it,
> > > > > change it again, truncate the file and toss the page completely.
> > > >
> > > > Why does it matter that it was never written in that case?
> > >
> > > It matters is the storage layer is going to wait around for the block to
> > > be written again with a correct crc.
> >
> > Actually, I wasn't advocating that. I think block should return a guard
> > mismatch error. I think somewhere in filesystem writeout is the place
> > to decide whether the error was self induced or systematic.
>
> In that case the io error goes to the async page writeback bio-endio
> handlers. We don't have a reference on the inode and no ability to
> reliably restart the IO, but we can set a bit on the address space
> indicating that somewhere, sometime in the past we had an IO error.
>
> > For self
> > induced errors (as long as we can detect them) I think we can just
> > forget about it ... if the changed page is important, the I/O request
> > gets repeated (modulo the problem of too great a frequency of changes
> > leading to us never successfully writing it) or it gets dropped because
> > the file was truncated or the data deleted for some other reason.
>
> Sorry, how can we tell the errors that are self induced from the evil
> bit flipping cable induced errors?

We have all the information ... the fs will eventually mark the page
dirty when it finishes the alterations, we just have to find a way to
express that.

If you're thinking of the double fault scenario where the page
spontaneously corrupts *and* the filesystem alters it, then the only way
of detecting that is to freeze the page as it undergoes I/O ... which
involves quite a bit of filesystem surgery, doesn't it?

James

2010-06-01 22:50:59

by Chris Mason

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write

On Tue, Jun 01, 2010 at 04:07:43PM -0500, James Bottomley wrote:
> On Tue, 2010-06-01 at 14:09 -0400, Chris Mason wrote:
> > On Tue, Jun 01, 2010 at 04:54:53PM +0000, James Bottomley wrote:
> > > On Tue, 2010-06-01 at 12:47 -0400, Chris Mason wrote:
> > > > On Tue, Jun 01, 2010 at 10:29:30AM -0600, Matthew Wilcox wrote:
> > > > > On Tue, Jun 01, 2010 at 09:49:51AM -0400, Chris Mason wrote:
> > > > > > > I agree that a block based retry would close all the holes ... it just
> > > > > > > doesn't look elegant to me that the fs will already be repeating the I/O
> > > > > > > if it changed the page and so will block.
> > > > > >
> > > > > > We might not ever repeat the IO. We might change the page, write it,
> > > > > > change it again, truncate the file and toss the page completely.
> > > > >
> > > > > Why does it matter that it was never written in that case?
> > > >
> > > > It matters is the storage layer is going to wait around for the block to
> > > > be written again with a correct crc.
> > >
> > > Actually, I wasn't advocating that. I think block should return a guard
> > > mismatch error. I think somewhere in filesystem writeout is the place
> > > to decide whether the error was self induced or systematic.
> >
> > In that case the io error goes to the async page writeback bio-endio
> > handlers. We don't have a reference on the inode and no ability to
> > reliably restart the IO, but we can set a bit on the address space
> > indicating that somewhere, sometime in the past we had an IO error.
> >
> > > For self
> > > induced errors (as long as we can detect them) I think we can just
> > > forget about it ... if the changed page is important, the I/O request
> > > gets repeated (modulo the problem of too great a frequency of changes
> > > leading to us never successfully writing it) or it gets dropped because
> > > the file was truncated or the data deleted for some other reason.
> >
> > Sorry, how can we tell the errors that are self induced from the evil
> > bit flipping cable induced errors?
>
> We have all the information ... the fs will eventually mark the page
> dirty when it finishes the alterations, we just have to find a way to
> express that.

Eventually?

>
> If you're thinking of the double fault scenario where the page
> spontaneously corrupts *and* the filesystem alters it, then the only way
> of detecting that is to freeze the page as it undergoes I/O ... which
> involves quite a bit of filesystem surgery, doesn't it?

No, I'm thinking of the case where the page corrupts and the FS doesn't
alter it. I still don't know how to determine if the FS changed the
page and will write it again, if the FS changed the page and won't write
it again, or if the page corrupted all on its own. The page dirty bit
isn't sufficient for this.

-chris

2010-06-02 03:20:51

by Nick Piggin

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write

On Tue, Jun 01, 2010 at 03:35:28PM -0400, Chris Mason wrote:
> On Wed, Jun 02, 2010 at 04:46:49AM +1000, Nick Piggin wrote:
> > On Tue, Jun 01, 2010 at 02:09:05PM -0400, Chris Mason wrote:
> > > On Tue, Jun 01, 2010 at 04:54:53PM +0000, James Bottomley wrote:
> > >
> > > > For self
> > > > induced errors (as long as we can detect them) I think we can just
> > > > forget about it ... if the changed page is important, the I/O request
> > > > gets repeated (modulo the problem of too great a frequency of changes
> > > > leading to us never successfully writing it) or it gets dropped because
> > > > the file was truncated or the data deleted for some other reason.
> > >
> > > Sorry, how can we tell the errors that are self induced from the evil
> > > bit flipping cable induced errors?
> >
> > Block layer should retry it with bounce pages. That would be a lot nicer
> > than forcing all upper layers to avoid the problem.
> >
>
> So the idea is that we have sent down a buffer and it changed in flight.
> The block layer is going to say: oh look, the crcs don't match, I'll
> bounce it, recrc it and send again. But, there are at least 3 reasons the crc
> will change:
>
> 1) filesystem changed it
> 2) corruption on the wire or in the raid controller
> 3) the page was corrupted while the IO layer was doing the IO.
>
> 1 and 2 are easy, we bounce, retry and everyone continues on with
> their lives. With #3, we'll recrc and send the IO down again thinking
> the data is correct when really we're writing garbage.
>
> How can we tell these three cases apart?

Do we really need to handle #3? It could have happened before the
checksum was calculated.

2010-06-02 13:19:30

by Martin K. Petersen

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write

>>>>> "Nick" == Nick Piggin <[email protected]> writes:

>> 1) filesystem changed it
>> 2) corruption on the wire or in the raid controller
>> 3) the page was corrupted while the IO layer was doing the IO.
>>
>> 1 and 2 are easy, we bounce, retry and everyone continues on with
>> their lives. With #3, we'll recrc and send the IO down again
>> thinking the data is correct when really we're writing garbage.
>>
>> How can we tell these three cases apart?

Nick> Do we really need to handle #3? It could have happened before the
Nick> checksum was calculated.

Reason #3 is one of the main reasons for having the checksum in the
first place. The whole premise of the data integrity extensions is that
the checksum is calculated in close temporal proximity to the data
creation. I.e. eventually in userland.

Filesystems will inevitably have to be integrity-aware for that to work.
And it will be their job to keep the data pages stable during DMA.

--
Martin K. Petersen Oracle Linux Engineering

2010-06-02 13:41:43

by Nick Piggin

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write

On Wed, Jun 02, 2010 at 09:17:56AM -0400, Martin K. Petersen wrote:
> >>>>> "Nick" == Nick Piggin <[email protected]> writes:
>
> >> 1) filesystem changed it
> >> 2) corruption on the wire or in the raid controller
> >> 3) the page was corrupted while the IO layer was doing the IO.
> >>
> >> 1 and 2 are easy, we bounce, retry and everyone continues on with
> >> their lives. With #3, we'll recrc and send the IO down again
> >> thinking the data is correct when really we're writing garbage.
> >>
> >> How can we tell these three cases apart?
>
> Nick> Do we really need to handle #3? It could have happened before the
> Nick> checksum was calculated.
>
> Reason #3 is one of the main reasons for having the checksum in the
> first place. The whole premise of the data integrity extensions is that
> the checksum is calculated in close temporal proximity to the data
> creation. I.e. eventually in userland.
>
> Filesystems will inevitably have to be integrity-aware for that to work.
> And it will be their job to keep the data pages stable during DMA.

Let's just think hard about what windows can actually be closed versus
how much effort goes in to closing them. I also prefer not to accept
half-solutions in the kernel because they don't want to implement real
solutions in hardware (it's pretty hard to checksum and protect all
kernel data structures by hand).

For "normal" writes into pagecache, the data can get corrupted anywhere
from after it is generated in userspace, during the copy, while it is
dirty in cache, and while it is being written out.

Closing the while it is dirty, while it is being written back window
still leaves a pretty big window. Also, how do you handle mmap writes?
Write protect and checksum the destination page after every store? Or
leave some window between when the pagecache is dirtied and when it is
written back? So I don't know whether it's worth putting a lot of effort
into this case.

If you had an interface for userspace to insert checksums to direct IO
requests or pagecache ranges, then not only can you close the entire gap
between userspace data generation, and writeback. But you also can
handle mmap writes and anything else just fine: userspace knows about
the concurrency details, so it can add the right checksum (and
potentially fsync etc) when it's ready.

And the bounce-retry method would be sufficient to handle IO
transmission errors for normal IOs without being intrusive.

2010-06-03 11:19:43

by Vladislav Bolkhovitin

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write

James Bottomley, on 06/01/2010 05:27 PM wrote:
> On Tue, 2010-06-01 at 12:30 +0200, Christof Schmitt wrote:
>> What is the best strategy to continue with the invalid guard tags on
>> write requests? Should this be fixed in the filesystems?
>
> For write requests, as long as the page dirty bit is still set, it's
> safe to drop the request, since it's already going to be repeated. What
> we probably want is an error code we can return that the layer that sees
> both the request and the page flags can make the call.
>
>> Another idea would be to pass invalid guard tags on write requests
>> down to the hardware, expect an "invalid guard tag" error and report
>> it to the block layer where a new checksum is generated and the
>> request is issued again. Basically implement a retry through the whole
>> I/O stack. But this also sounds complicated.
>
> No, no ... as long as the guard tag is wrong because the fs changed the
> page, the write request for the updated page will already be queued or
> in-flight, so there's no need to retry.

There's one interesting problem here, at least theoretically, with SCSI
or similar transports which allow to have commands queue depth >1 and
allowed to internally reorder queued requests. I don't know the FS/block
layers sufficiently well to tell if sending several requests for the
same page really possible or not, but we can see a real life problem,
which can be well explained if it's possible.

The problem could be if the second (rewrite) request (SCSI command) for
the same page queued to the corresponding device before the original
request finished. Since the device allowed to freely reorder requests,
there's a probability that the original write request would hit the
permanent storage *AFTER* the retry request, hence the data changes it's
carrying would be lost, hence welcome data corruption.

For single parallel SCSI or SAS devices such race may look practically
impossible, but for sophisticated clusters when many nodes pretending to
be a single SCSI device in a load balancing configuration, it becomes
very real.

The real life problem we can see in an active-active DRBD-setup. In this
configuration 2 nodes act as a single SCST-powered SCSI device and they
both run DRBD to keep their backstorage in-sync. The initiator uses them
as a single multipath device in an active-active round-robin
load-balancing configuration, i.e. sends requests to both nodes in
parallel, then DRBD takes care to replicate the requests to the other node.

The problem is that sometimes DRBD complies about concurrent local
writes, like:

kernel: drbd0: scsi_tgt0[12503] Concurrent local write detected!
[DISCARD L] new: 144072784s +8192; pending: 144072784s +8192

This message means that DRBD detected that both nodes received
overlapping writes on the same block(s) and DRBD can't figure out which
one to store. This is possible only if the initiator sent the second
write request before the first one completed.

The topic of the discussion could well explain the cause of that. But,
unfortunately, people who reported it forgot to note which OS they run
on the initiator, i.e. I can't say for sure it's Linux.

Vlad

2010-06-03 12:07:41

by Boaz Harrosh

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write

On 06/03/2010 02:20 PM, Vladislav Bolkhovitin wrote:
>
> There's one interesting problem here, at least theoretically, with SCSI
> or similar transports which allow to have commands queue depth >1 and
> allowed to internally reorder queued requests. I don't know the FS/block
> layers sufficiently well to tell if sending several requests for the
> same page really possible or not, but we can see a real life problem,
> which can be well explained if it's possible.
>
> The problem could be if the second (rewrite) request (SCSI command) for
> the same page queued to the corresponding device before the original
> request finished. Since the device allowed to freely reorder requests,
> there's a probability that the original write request would hit the
> permanent storage *AFTER* the retry request, hence the data changes it's
> carrying would be lost, hence welcome data corruption.
>

I might be totally wrong here but I think NCQ can reorder sectors but
not writes. That is if the sector is cached in device memory and a later
write comes to modify the same sector then the original should be
replaced not two values of the same sector be kept in device cache at the
same time.

Failing to do so is a scsi device problem.

Please note that page-to-sector is not necessary constant. And the same page
might get written at a different sector, next time. But FSs will have to
barrier in this case.

> For single parallel SCSI or SAS devices such race may look practically
> impossible, but for sophisticated clusters when many nodes pretending to
> be a single SCSI device in a load balancing configuration, it becomes
> very real.
>
> The real life problem we can see in an active-active DRBD-setup. In this
> configuration 2 nodes act as a single SCST-powered SCSI device and they
> both run DRBD to keep their backstorage in-sync. The initiator uses them
> as a single multipath device in an active-active round-robin
> load-balancing configuration, i.e. sends requests to both nodes in
> parallel, then DRBD takes care to replicate the requests to the other node.
>
> The problem is that sometimes DRBD complies about concurrent local
> writes, like:
>
> kernel: drbd0: scsi_tgt0[12503] Concurrent local write detected!
> [DISCARD L] new: 144072784s +8192; pending: 144072784s +8192
>
> This message means that DRBD detected that both nodes received
> overlapping writes on the same block(s) and DRBD can't figure out which
> one to store. This is possible only if the initiator sent the second
> write request before the first one completed.
>

It is totally possible in today's code.

DRBD should store the original command_sn of the write and discard
the sector with the lower SN. It should appear as a single device
to the initiator.

> The topic of the discussion could well explain the cause of that. But,
> unfortunately, people who reported it forgot to note which OS they run
> on the initiator, i.e. I can't say for sure it's Linux.
>
> Vlad
>

Boaz

2010-06-03 12:40:53

by Vladislav Bolkhovitin

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write

Boaz Harrosh, on 06/03/2010 04:07 PM wrote:
> On 06/03/2010 02:20 PM, Vladislav Bolkhovitin wrote:
>> There's one interesting problem here, at least theoretically, with SCSI
>> or similar transports which allow to have commands queue depth >1 and
>> allowed to internally reorder queued requests. I don't know the FS/block
>> layers sufficiently well to tell if sending several requests for the
>> same page really possible or not, but we can see a real life problem,
>> which can be well explained if it's possible.
>>
>> The problem could be if the second (rewrite) request (SCSI command) for
>> the same page queued to the corresponding device before the original
>> request finished. Since the device allowed to freely reorder requests,
>> there's a probability that the original write request would hit the
>> permanent storage *AFTER* the retry request, hence the data changes it's
>> carrying would be lost, hence welcome data corruption.
>>
>
> I might be totally wrong here but I think NCQ can reorder sectors but
> not writes. That is if the sector is cached in device memory and a later
> write comes to modify the same sector then the original should be
> replaced not two values of the same sector be kept in device cache at the
> same time.
>
> Failing to do so is a scsi device problem.

SCSI devices supporting Full task management model (almost all) and
having QUEUE ALGORITHM MODIFIER bits in Control mode page set to 1
allowed to freely reorder any commands with SIMPLE task attribute. If an
application wants to maintain order of some commands for such devices,
it must issue them with ORDERED task attribute and over a _single_ MPIO
path to the device.

Linux neither uses ORDERED attribute, nor honors or enforces anyhow
QUEUE ALGORITHM MODIFIER bits, nor takes care to send commands with
order dependencies (overlapping writes in our case) over a single MPIO path.

> Please note that page-to-sector is not necessary constant. And the same page
> might get written at a different sector, next time. But FSs will have to
> barrier in this case.
>
>> For single parallel SCSI or SAS devices such race may look practically
>> impossible, but for sophisticated clusters when many nodes pretending to
>> be a single SCSI device in a load balancing configuration, it becomes
>> very real.
>>
>> The real life problem we can see in an active-active DRBD-setup. In this
>> configuration 2 nodes act as a single SCST-powered SCSI device and they
>> both run DRBD to keep their backstorage in-sync. The initiator uses them
>> as a single multipath device in an active-active round-robin
>> load-balancing configuration, i.e. sends requests to both nodes in
>> parallel, then DRBD takes care to replicate the requests to the other node.
>>
>> The problem is that sometimes DRBD complies about concurrent local
>> writes, like:
>>
>> kernel: drbd0: scsi_tgt0[12503] Concurrent local write detected!
>> [DISCARD L] new: 144072784s +8192; pending: 144072784s +8192
>>
>> This message means that DRBD detected that both nodes received
>> overlapping writes on the same block(s) and DRBD can't figure out which
>> one to store. This is possible only if the initiator sent the second
>> write request before the first one completed.
>
> It is totally possible in today's code.
>
> DRBD should store the original command_sn of the write and discard
> the sector with the lower SN. It should appear as a single device
> to the initiator.

How can it find the SN? The commands were sent over _different_ MPIO
paths to the device, so at the moment of the sending all the order
information was lost.

Until SCSI generally allowed to preserve ordering information between
MPIO paths in such configurations the only way to maintain commands
order would be queue draining. Hence, for safety all initiators working
with such devices must do it.

But looks like Linux doesn't do it, so unsafe with MPIO clusters?

Vlad

2010-06-03 12:46:38

by Vladislav Bolkhovitin

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write



Vladislav Bolkhovitin, on 06/03/2010 04:41 PM wrote:
> Boaz Harrosh, on 06/03/2010 04:07 PM wrote:
>> On 06/03/2010 02:20 PM, Vladislav Bolkhovitin wrote:
>>> There's one interesting problem here, at least theoretically, with SCSI
>>> or similar transports which allow to have commands queue depth >1 and
>>> allowed to internally reorder queued requests. I don't know the FS/block
>>> layers sufficiently well to tell if sending several requests for the
>>> same page really possible or not, but we can see a real life problem,
>>> which can be well explained if it's possible.
>>>
>>> The problem could be if the second (rewrite) request (SCSI command) for
>>> the same page queued to the corresponding device before the original
>>> request finished. Since the device allowed to freely reorder requests,
>>> there's a probability that the original write request would hit the
>>> permanent storage *AFTER* the retry request, hence the data changes it's
>>> carrying would be lost, hence welcome data corruption.
>>>
>> I might be totally wrong here but I think NCQ can reorder sectors but
>> not writes. That is if the sector is cached in device memory and a later
>> write comes to modify the same sector then the original should be
>> replaced not two values of the same sector be kept in device cache at the
>> same time.
>>
>> Failing to do so is a scsi device problem.
>
> SCSI devices supporting Full task management model (almost all) and
> having QUEUE ALGORITHM MODIFIER bits in Control mode page set to 1
> allowed to freely reorder any commands with SIMPLE task attribute. If an
> application wants to maintain order of some commands for such devices,
> it must issue them with ORDERED task attribute and over a _single_ MPIO
> path to the device.
>
> Linux neither uses ORDERED attribute, nor honors or enforces anyhow
> QUEUE ALGORITHM MODIFIER bits, nor takes care to send commands with
> order dependencies (overlapping writes in our case) over a single MPIO path.
>
>> Please note that page-to-sector is not necessary constant. And the same page
>> might get written at a different sector, next time. But FSs will have to
>> barrier in this case.
>>
>>> For single parallel SCSI or SAS devices such race may look practically
>>> impossible, but for sophisticated clusters when many nodes pretending to
>>> be a single SCSI device in a load balancing configuration, it becomes
>>> very real.
>>>
>>> The real life problem we can see in an active-active DRBD-setup. In this
>>> configuration 2 nodes act as a single SCST-powered SCSI device and they
>>> both run DRBD to keep their backstorage in-sync. The initiator uses them
>>> as a single multipath device in an active-active round-robin
>>> load-balancing configuration, i.e. sends requests to both nodes in
>>> parallel, then DRBD takes care to replicate the requests to the other node.
>>>
>>> The problem is that sometimes DRBD complies about concurrent local
>>> writes, like:
>>>
>>> kernel: drbd0: scsi_tgt0[12503] Concurrent local write detected!
>>> [DISCARD L] new: 144072784s +8192; pending: 144072784s +8192
>>>
>>> This message means that DRBD detected that both nodes received
>>> overlapping writes on the same block(s) and DRBD can't figure out which
>>> one to store. This is possible only if the initiator sent the second
>>> write request before the first one completed.
>> It is totally possible in today's code.
>>
>> DRBD should store the original command_sn of the write and discard
>> the sector with the lower SN. It should appear as a single device
>> to the initiator.
>
> How can it find the SN? The commands were sent over _different_ MPIO
> paths to the device, so at the moment of the sending all the order
> information was lost.
>
> Until SCSI generally allowed to preserve ordering information between
> MPIO paths in such configurations the only way to maintain commands
> order would be queue draining. Hence, for safety all initiators working
> with such devices must do it.
>
> But looks like Linux doesn't do it, so unsafe with MPIO clusters?

I meant load balancing MPIO clusters.

Vlad

2010-06-03 13:06:57

by Boaz Harrosh

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write

On 06/03/2010 03:41 PM, Vladislav Bolkhovitin wrote:
> Boaz Harrosh, on 06/03/2010 04:07 PM wrote:
>> On 06/03/2010 02:20 PM, Vladislav Bolkhovitin wrote:
>>> There's one interesting problem here, at least theoretically, with SCSI
>>> or similar transports which allow to have commands queue depth >1 and
>>> allowed to internally reorder queued requests. I don't know the FS/block
>>> layers sufficiently well to tell if sending several requests for the
>>> same page really possible or not, but we can see a real life problem,
>>> which can be well explained if it's possible.
>>>
>>> The problem could be if the second (rewrite) request (SCSI command) for
>>> the same page queued to the corresponding device before the original
>>> request finished. Since the device allowed to freely reorder requests,
>>> there's a probability that the original write request would hit the
>>> permanent storage *AFTER* the retry request, hence the data changes it's
>>> carrying would be lost, hence welcome data corruption.
>>>
>>
>> I might be totally wrong here but I think NCQ can reorder sectors but
>> not writes. That is if the sector is cached in device memory and a later
>> write comes to modify the same sector then the original should be
>> replaced not two values of the same sector be kept in device cache at the
>> same time.
>>
>> Failing to do so is a scsi device problem.
>
> SCSI devices supporting Full task management model (almost all) and
> having QUEUE ALGORITHM MODIFIER bits in Control mode page set to 1
> allowed to freely reorder any commands with SIMPLE task attribute. If an
> application wants to maintain order of some commands for such devices,
> it must issue them with ORDERED task attribute and over a _single_ MPIO
> path to the device.
>
> Linux neither uses ORDERED attribute, nor honors or enforces anyhow
> QUEUE ALGORITHM MODIFIER bits, nor takes care to send commands with
> order dependencies (overlapping writes in our case) over a single MPIO path.
>

OK I take your word for it. But that sounds stupid to me. I would think
that sectors can be ordered. not commands per se. What happen with reads
then? do they get to be ordered? I mean a read in between the two writes which
value is read? It gets so complicated that only a sector model makes sense
to me.

>> Please note that page-to-sector is not necessary constant. And the same page
>> might get written at a different sector, next time. But FSs will have to
>> barrier in this case.
>>
>>> For single parallel SCSI or SAS devices such race may look practically
>>> impossible, but for sophisticated clusters when many nodes pretending to
>>> be a single SCSI device in a load balancing configuration, it becomes
>>> very real.
>>>
>>> The real life problem we can see in an active-active DRBD-setup. In this
>>> configuration 2 nodes act as a single SCST-powered SCSI device and they
>>> both run DRBD to keep their backstorage in-sync. The initiator uses them
>>> as a single multipath device in an active-active round-robin
>>> load-balancing configuration, i.e. sends requests to both nodes in
>>> parallel, then DRBD takes care to replicate the requests to the other node.
>>>
>>> The problem is that sometimes DRBD complies about concurrent local
>>> writes, like:
>>>
>>> kernel: drbd0: scsi_tgt0[12503] Concurrent local write detected!
>>> [DISCARD L] new: 144072784s +8192; pending: 144072784s +8192
>>>
>>> This message means that DRBD detected that both nodes received
>>> overlapping writes on the same block(s) and DRBD can't figure out which
>>> one to store. This is possible only if the initiator sent the second
>>> write request before the first one completed.
>>
>> It is totally possible in today's code.
>>
>> DRBD should store the original command_sn of the write and discard
>> the sector with the lower SN. It should appear as a single device
>> to the initiator.
>
> How can it find the SN? The commands were sent over _different_ MPIO
> paths to the device, so at the moment of the sending all the order
> information was lost.
>

I'm not hard on the specifics here. But I think the initiator has set
the same SN on the two paths, or has incremented them between paths.
You said:

> The initiator uses them as a single multipath device in an active-active
> round-robin load-balancing configuration, i.e. sends requests to both nodes
> in paralle.

So what was the SN sent to each side. Is there a relationship between them
or they each advance independently?

If there is a relationship then the targets on two sides should store
the SN for later comparison. (Life is hard)

> Until SCSI generally allowed to preserve ordering information between
> MPIO paths in such configurations the only way to maintain commands
> order would be queue draining. Hence, for safety all initiators working
> with such devices must do it.
>
> But looks like Linux doesn't do it, so unsafe with MPIO clusters?
>
> Vlad
>

Thanks
Boaz

2010-06-03 13:23:12

by Vladislav Bolkhovitin

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write

Boaz Harrosh, on 06/03/2010 05:06 PM wrote:
> On 06/03/2010 03:41 PM, Vladislav Bolkhovitin wrote:
>> Boaz Harrosh, on 06/03/2010 04:07 PM wrote:
>>> On 06/03/2010 02:20 PM, Vladislav Bolkhovitin wrote:
>>>> There's one interesting problem here, at least theoretically, with SCSI
>>>> or similar transports which allow to have commands queue depth >1 and
>>>> allowed to internally reorder queued requests. I don't know the FS/block
>>>> layers sufficiently well to tell if sending several requests for the
>>>> same page really possible or not, but we can see a real life problem,
>>>> which can be well explained if it's possible.
>>>>
>>>> The problem could be if the second (rewrite) request (SCSI command) for
>>>> the same page queued to the corresponding device before the original
>>>> request finished. Since the device allowed to freely reorder requests,
>>>> there's a probability that the original write request would hit the
>>>> permanent storage *AFTER* the retry request, hence the data changes it's
>>>> carrying would be lost, hence welcome data corruption.
>>>>
>>> I might be totally wrong here but I think NCQ can reorder sectors but
>>> not writes. That is if the sector is cached in device memory and a later
>>> write comes to modify the same sector then the original should be
>>> replaced not two values of the same sector be kept in device cache at the
>>> same time.
>>>
>>> Failing to do so is a scsi device problem.
>> SCSI devices supporting Full task management model (almost all) and
>> having QUEUE ALGORITHM MODIFIER bits in Control mode page set to 1
>> allowed to freely reorder any commands with SIMPLE task attribute. If an
>> application wants to maintain order of some commands for such devices,
>> it must issue them with ORDERED task attribute and over a _single_ MPIO
>> path to the device.
>>
>> Linux neither uses ORDERED attribute, nor honors or enforces anyhow
>> QUEUE ALGORITHM MODIFIER bits, nor takes care to send commands with
>> order dependencies (overlapping writes in our case) over a single MPIO path.
>>
>
> OK I take your word for it. But that sounds stupid to me. I would think
> that sectors can be ordered. not commands per se. What happen with reads
> then? do they get to be ordered? I mean a read in between the two writes which
> value is read? It gets so complicated that only a sector model makes sense
> to me.

Look wider. For a single HDD your way of thinking makes sense. But how
about big clusters consisting from many nodes with many clients? In them
maintaining internal commands order is generally bad and often a way too
expensive for performance.

It's the same as with modern CPUs, where for performance reasons
programmers also must live with the commands reorder possibilities and
use barriers, when necessary.

>>> Please note that page-to-sector is not necessary constant. And the same page
>>> might get written at a different sector, next time. But FSs will have to
>>> barrier in this case.
>>>
>>>> For single parallel SCSI or SAS devices such race may look practically
>>>> impossible, but for sophisticated clusters when many nodes pretending to
>>>> be a single SCSI device in a load balancing configuration, it becomes
>>>> very real.
>>>>
>>>> The real life problem we can see in an active-active DRBD-setup. In this
>>>> configuration 2 nodes act as a single SCST-powered SCSI device and they
>>>> both run DRBD to keep their backstorage in-sync. The initiator uses them
>>>> as a single multipath device in an active-active round-robin
>>>> load-balancing configuration, i.e. sends requests to both nodes in
>>>> parallel, then DRBD takes care to replicate the requests to the other node.
>>>>
>>>> The problem is that sometimes DRBD complies about concurrent local
>>>> writes, like:
>>>>
>>>> kernel: drbd0: scsi_tgt0[12503] Concurrent local write detected!
>>>> [DISCARD L] new: 144072784s +8192; pending: 144072784s +8192
>>>>
>>>> This message means that DRBD detected that both nodes received
>>>> overlapping writes on the same block(s) and DRBD can't figure out which
>>>> one to store. This is possible only if the initiator sent the second
>>>> write request before the first one completed.
>>> It is totally possible in today's code.
>>>
>>> DRBD should store the original command_sn of the write and discard
>>> the sector with the lower SN. It should appear as a single device
>>> to the initiator.
>> How can it find the SN? The commands were sent over _different_ MPIO
>> paths to the device, so at the moment of the sending all the order
>> information was lost.
>>
>
> I'm not hard on the specifics here. But I think the initiator has set
> the same SN on the two paths, or has incremented them between paths.
> You said:
>
>> The initiator uses them as a single multipath device in an active-active
>> round-robin load-balancing configuration, i.e. sends requests to both nodes
>> in paralle.
>
> So what was the SN sent to each side. Is there a relationship between them
> or they each advance independently?
>
> If there is a relationship then the targets on two sides should store
> the SN for later comparison. (Life is hard)

None of SCSI transports carry any SN to other paths (I_T nexuses)
related information in internal packets, including iSCSI. It's simply
out of SAM. If you need order information between paths, you must use
"extensions", like iSCSI MC/S, but they are bad for many other reasons.
I summarized it in http://scst.sourceforge.net/mc_s.html.

Vlad

2010-06-03 15:48:10

by Chris Mason

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write

On Wed, Jun 02, 2010 at 11:41:21PM +1000, Nick Piggin wrote:
> On Wed, Jun 02, 2010 at 09:17:56AM -0400, Martin K. Petersen wrote:
> > >>>>> "Nick" == Nick Piggin <[email protected]> writes:
> >
> > >> 1) filesystem changed it
> > >> 2) corruption on the wire or in the raid controller
> > >> 3) the page was corrupted while the IO layer was doing the IO.
> > >>
> > >> 1 and 2 are easy, we bounce, retry and everyone continues on with
> > >> their lives. With #3, we'll recrc and send the IO down again
> > >> thinking the data is correct when really we're writing garbage.
> > >>
> > >> How can we tell these three cases apart?
> >
> > Nick> Do we really need to handle #3? It could have happened before the
> > Nick> checksum was calculated.
> >
> > Reason #3 is one of the main reasons for having the checksum in the
> > first place. The whole premise of the data integrity extensions is that
> > the checksum is calculated in close temporal proximity to the data
> > creation. I.e. eventually in userland.
> >
> > Filesystems will inevitably have to be integrity-aware for that to work.
> > And it will be their job to keep the data pages stable during DMA.
>
> Let's just think hard about what windows can actually be closed versus
> how much effort goes in to closing them. I also prefer not to accept
> half-solutions in the kernel because they don't want to implement real
> solutions in hardware (it's pretty hard to checksum and protect all
> kernel data structures by hand).
>
> For "normal" writes into pagecache, the data can get corrupted anywhere
> from after it is generated in userspace, during the copy, while it is
> dirty in cache, and while it is being written out.

This is why the DIF/DIX spec has the idea of a crc generated in userland
when the data is generated. At any rate the basic idea is to crc early
but not often...recalculating the crc after we hand our precious memory
to the evil device driver does weaken the end-to-end integrity checks.

What I don't want to do is weaken the basic DIF/DIX structure by letting
the lower recrc stuff as they find faults. It would be fine if we had
some definitive way to say: the FS raced, just recrc, but we really
don't.

>
> Closing the while it is dirty, while it is being written back window
> still leaves a pretty big window. Also, how do you handle mmap writes?
> Write protect and checksum the destination page after every store? Or
> leave some window between when the pagecache is dirtied and when it is
> written back? So I don't know whether it's worth putting a lot of effort
> into this case.

So, changing gears to how do we protect filesystem page cache pages
instead of the generic idea of dif/dix, btrfs crcs just before writing,
which does leave a pretty big window for the page to get corrupted.
The storage layer shouldn't care or know about that though, we hand it a
crc and it makes sure data matching that crc goes to the media.

>
> If you had an interface for userspace to insert checksums to direct IO
> requests or pagecache ranges, then not only can you close the entire gap
> between userspace data generation, and writeback. But you also can
> handle mmap writes and anything else just fine: userspace knows about
> the concurrency details, so it can add the right checksum (and
> potentially fsync etc) when it's ready.

Yeah, I do agree here.

-chris

2010-06-03 16:27:26

by Nick Piggin

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write

On Thu, Jun 03, 2010 at 11:46:34AM -0400, Chris Mason wrote:
> On Wed, Jun 02, 2010 at 11:41:21PM +1000, Nick Piggin wrote:
> > On Wed, Jun 02, 2010 at 09:17:56AM -0400, Martin K. Petersen wrote:
> > > >>>>> "Nick" == Nick Piggin <[email protected]> writes:
> > >
> > > >> 1) filesystem changed it
> > > >> 2) corruption on the wire or in the raid controller
> > > >> 3) the page was corrupted while the IO layer was doing the IO.
> > > >>
> > > >> 1 and 2 are easy, we bounce, retry and everyone continues on with
> > > >> their lives. With #3, we'll recrc and send the IO down again
> > > >> thinking the data is correct when really we're writing garbage.
> > > >>
> > > >> How can we tell these three cases apart?
> > >
> > > Nick> Do we really need to handle #3? It could have happened before the
> > > Nick> checksum was calculated.
> > >
> > > Reason #3 is one of the main reasons for having the checksum in the
> > > first place. The whole premise of the data integrity extensions is that
> > > the checksum is calculated in close temporal proximity to the data
> > > creation. I.e. eventually in userland.
> > >
> > > Filesystems will inevitably have to be integrity-aware for that to work.
> > > And it will be their job to keep the data pages stable during DMA.
> >
> > Let's just think hard about what windows can actually be closed versus
> > how much effort goes in to closing them. I also prefer not to accept
> > half-solutions in the kernel because they don't want to implement real
> > solutions in hardware (it's pretty hard to checksum and protect all
> > kernel data structures by hand).
> >
> > For "normal" writes into pagecache, the data can get corrupted anywhere
> > from after it is generated in userspace, during the copy, while it is
> > dirty in cache, and while it is being written out.
>
> This is why the DIF/DIX spec has the idea of a crc generated in userland
> when the data is generated. At any rate the basic idea is to crc early
> but not often...recalculating the crc after we hand our precious memory
> to the evil device driver does weaken the end-to-end integrity checks.
>
> What I don't want to do is weaken the basic DIF/DIX structure by letting
> the lower recrc stuff as they find faults. It would be fine if we had
> some definitive way to say: the FS raced, just recrc, but we really
> don't.

That's true but a naive kernel crc cannot do better than the
user/kernel boundary (and has very big problems even doing that well
with mmap, get_user_pages, concurrent dirtying). So we are already
resigned there to a best effort approach.

Since we fundamentally can't have end-to-end protection then, it's
much harder to argue for significant complexity just to close the
hole a little.

So if we do the block layer retries in response to concurrent writes, it
opens the window there a little bit, but remember only a small
proportion of writes will require retries, and for that proportion, the
window is only opening a small amount.

As far as I know, we're not checksumming at the usercopy point, but the
writeback point, so we have a vastly bigger window there already.


> > Closing the while it is dirty, while it is being written back window
> > still leaves a pretty big window. Also, how do you handle mmap writes?
> > Write protect and checksum the destination page after every store? Or
> > leave some window between when the pagecache is dirtied and when it is
> > written back? So I don't know whether it's worth putting a lot of effort
> > into this case.
>
> So, changing gears to how do we protect filesystem page cache pages
> instead of the generic idea of dif/dix, btrfs crcs just before writing,
> which does leave a pretty big window for the page to get corrupted.
> The storage layer shouldn't care or know about that though, we hand it a
> crc and it makes sure data matching that crc goes to the media.

I'm not totally against freezing redirtying events during writeback,
but as long as there is a huge window of the page sitting dirty in
pagecache, it's not worth much if any complexity.

Also I don't think we can deal with memory errors and scribbles just by
crcing dirty data. The calculations generating the data could get
corrupted. Data can be corrupted on its way back from the device to
userspace. Dirty memory writeback is usually only a small part of the
entire data transformation process.


> > If you had an interface for userspace to insert checksums to direct IO
> > requests or pagecache ranges, then not only can you close the entire gap
> > between userspace data generation, and writeback. But you also can
> > handle mmap writes and anything else just fine: userspace knows about
> > the concurrency details, so it can add the right checksum (and
> > potentially fsync etc) when it's ready.
>
> Yeah, I do agree here.

After protecting writeback from IO bus and wire errors, I think this
would be the most productive thing to work on. Obviously this feature is
being pushed by databases and such that really want to pass checksums
all the way from userspace. Block retrying is _not_ needed or wanted
here of course.

After that is done, it might be worth coming back to see if
regular pagecache can be protected any better. I just think that's
the highest hanging fruit at the moment.

2010-06-04 01:31:19

by Martin K. Petersen

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write

>>>>> "Nick" == Nick Piggin <[email protected]> writes:

Nick,

>> Filesystems will inevitably have to be integrity-aware for that to
>> work. And it will be their job to keep the data pages stable during
>> DMA.

Nick> Closing the while it is dirty, while it is being written back
Nick> window still leaves a pretty big window. Also, how do you handle
Nick> mmap writes? Write protect and checksum the destination page
Nick> after every store? Or leave some window between when the pagecache
Nick> is dirtied and when it is written back? So I don't know whether
Nick> it's worth putting a lot of effort into this case.

I'm mostly interested in the cases where the filesystem acts as a
conduit for integrity metadata from user space.

I agree the corruption windows inside the kernel are only of moderate
interest. No filesystems have added support for explicitly protecting a
bio because the block layer's function to do so automatically is just a
few function calls away.

--
Martin K. Petersen Oracle Linux Engineering

2010-06-04 01:47:33

by Martin K. Petersen

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write

>>>>> "Nick" == Nick Piggin <[email protected]> writes:

Nick> Also I don't think we can deal with memory errors and scribbles
Nick> just by crcing dirty data. The calculations generating the data
Nick> could get corrupted.

Yep, the goal is to make the window as small as possible.


Nick> Data can be corrupted on its way back from the device to
Nick> userspace.

We also get a CRC back from the storage. So the (integrity-aware)
application is also able to check on read.


Nick> Obviously this feature is being pushed by databases and such that
Nick> really want to pass checksums all the way from userspace. Block
Nick> retrying is _not_ needed or wanted here of course.

Nope. The integrity error is bubbled all the way up to the database and
we can decide to retry, recreate or error out depending on what we find
when we do validation checks on the data buffer and the integrity
metadata.

--
Martin K. Petersen Oracle Linux Engineering

2010-06-04 02:03:01

by Dave Chinner

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write

On Thu, Jun 03, 2010 at 11:46:34AM -0400, Chris Mason wrote:
> On Wed, Jun 02, 2010 at 11:41:21PM +1000, Nick Piggin wrote:
> > Closing the while it is dirty, while it is being written back window
> > still leaves a pretty big window. Also, how do you handle mmap writes?
> > Write protect and checksum the destination page after every store? Or
> > leave some window between when the pagecache is dirtied and when it is
> > written back? So I don't know whether it's worth putting a lot of effort
> > into this case.
>
> So, changing gears to how do we protect filesystem page cache pages
> instead of the generic idea of dif/dix, btrfs crcs just before writing,
> which does leave a pretty big window for the page to get corrupted.
> The storage layer shouldn't care or know about that though, we hand it a
> crc and it makes sure data matching that crc goes to the media.

I think the only way to get accurate CRCs is to stop modifications
from occurring while the page is under writeback. i.e. when a page
transitions from dirty to writeback we need to unmap any writable
mappings on the page, and then any new modifications (either by the
write() path or through ->fault) need to block waiting for
page writeback to complete before they can proceed...

If we can lock out modification during writeback, we can calculate
CRCs safely at any point in time the page is not mapped. e.g. we
could do the CRC calculation at copy-in time and store it on new
pages. During writeback, if the page has not been mapped then the
stored CRC can be used. If it has been mapped (say writeable
mappings clear the stored CRC during ->fault) then we can
recalculate the CRC once we've transitioned the page to being under
writeback...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-06-04 03:09:50

by Nick Piggin

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write

On Thu, Jun 03, 2010 at 09:46:02PM -0400, Martin K. Petersen wrote:
> >>>>> "Nick" == Nick Piggin <[email protected]> writes:
>
> Nick> Also I don't think we can deal with memory errors and scribbles
> Nick> just by crcing dirty data. The calculations generating the data
> Nick> could get corrupted.
>
> Yep, the goal is to make the window as small as possible.
>
>
> Nick> Data can be corrupted on its way back from the device to
> Nick> userspace.
>
> We also get a CRC back from the storage. So the (integrity-aware)
> application is also able to check on read.

Well that's nice :)


> Nick> Obviously this feature is being pushed by databases and such that
> Nick> really want to pass checksums all the way from userspace. Block
> Nick> retrying is _not_ needed or wanted here of course.
>
> Nope. The integrity error is bubbled all the way up to the database and
> we can decide to retry, recreate or error out depending on what we find
> when we do validation checks on the data buffer and the integrity
> metadata.

By block retrying, I just meant the bounce / re-checksum approach.

2010-06-04 15:32:30

by Jan Kara

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write

On Fri 04-06-10 12:02:43, Dave Chinner wrote:
> On Thu, Jun 03, 2010 at 11:46:34AM -0400, Chris Mason wrote:
> > On Wed, Jun 02, 2010 at 11:41:21PM +1000, Nick Piggin wrote:
> > > Closing the while it is dirty, while it is being written back window
> > > still leaves a pretty big window. Also, how do you handle mmap writes?
> > > Write protect and checksum the destination page after every store? Or
> > > leave some window between when the pagecache is dirtied and when it is
> > > written back? So I don't know whether it's worth putting a lot of effort
> > > into this case.
> >
> > So, changing gears to how do we protect filesystem page cache pages
> > instead of the generic idea of dif/dix, btrfs crcs just before writing,
> > which does leave a pretty big window for the page to get corrupted.
> > The storage layer shouldn't care or know about that though, we hand it a
> > crc and it makes sure data matching that crc goes to the media.
>
> I think the only way to get accurate CRCs is to stop modifications
> from occurring while the page is under writeback. i.e. when a page
> transitions from dirty to writeback we need to unmap any writable
> mappings on the page, and then any new modifications (either by the
> write() path or through ->fault) need to block waiting for
> page writeback to complete before they can proceed...
Actually, we already write-protect the page in clear_page_dirty_for_io
so the first part already happens. Any filesystem can do
wait_on_page_writeback() in its ->page_mkwrite function so even the second
part shouldn't be hard. I'm just a bit worried about the performance
implications / hidden deadlocks...
Also we'd have to wait_on_page_writeback() in ->write_begin function to
protect against ordinary writes but that's the easy part...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2010-06-08 07:18:41

by Christof Schmitt

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write

On Tue, Jun 01, 2010 at 09:58:18AM -0400, Chris Mason wrote:
> On Tue, Jun 01, 2010 at 03:50:59PM +0200, Christof Schmitt wrote:
> > On Tue, Jun 01, 2010 at 09:03:25AM -0400, Chris Mason wrote:
> > > On Tue, Jun 01, 2010 at 12:30:42PM +0200, Christof Schmitt wrote:
> > > > On Mon, May 31, 2010 at 06:30:05PM +0300, Boaz Harrosh wrote:
> > > > > On 05/31/2010 06:01 PM, James Bottomley wrote:
> > > > > > On Mon, 2010-05-31 at 10:20 -0400, Martin K. Petersen wrote:
> > > > > >>>>>>> "Christof" == Christof Schmitt <[email protected]> writes:
> > > > > >>
> > > > > >> Christof> Since the guard tags are created in Linux, it seems that the
> > > > > >> Christof> data attached to the write request changes between the
> > > > > >> Christof> generation in bio_integrity_generate and the call to
> > > > > >> Christof> sd_prep_fn.
> > > > > >>
> > > > > >> Yep, known bug. Page writeback locking is messed up for buffer_head
> > > > > >> users. The extNfs folks volunteered to look into this a while back but
> > > > > >> I don't think they have found the time yet.
> > > > > >>
> > > > > >>
> > > > > >> Christof> Using ext3 or ext4 instead of ext2 does not show the problem.
> > > > > >>
> > > > > >> Last I looked there were still code paths in ext3 and ext4 that
> > > > > >> permitted pages to be changed during flight. I guess you've just been
> > > > > >> lucky.
> > > > > >
> > > > > > Pages have always been modifiable in flight. The OS guarantees they'll
> > > > > > be rewritten, so the drivers can drop them if it detects the problem.
> > > > > > This is identical to the iscsi checksum issue (iscsi adds a checksum
> > > > > > because it doesn't trust TCP/IP and if the checksum is generated in
> > > > > > software, there's time between generation and page transmission for the
> > > > > > alteration to occur). The solution in the iscsi case was not to
> > > > > > complain if the page is still marked dirty.
> > > > > >
> > > > >
> > > > > And also why RAID1 and RAID4/5/6 need the data bounced. I wish VFS
> > > > > would prevent data writing given a device queue flag that requests
> > > > > it. So all these devices and modes could just flag the VFS/filesystems
> > > > > that: "please don't allow concurrent writes, otherwise I need to copy data"
> > > > >
> > > > > From what Chris Mason has said before, all the mechanics are there, and it's
> > > > > what btrfs is doing. Though I don't know how myself?
> > > >
> > > > I also tested with btrfs and invalid guard tags in writes have been
> > > > encountered as well (again in 2.6.34). The only difference is that no
> > > > error was reported to userspace, although this might be a
> > > > configuration issue.
> > >
> > > This would be a btrfs bug. We have strict checks in place that are
> > > supposed to prevent buffers changing while in flight. What was the
> > > workload that triggered this problem?
> >
> > I am running an internal test tool that creates files with a known
> > pattern until the disk is full, reads the data to verify if the
> > pattern is still intact, removes the files and starts over.
>
> Ok, is the lba in the output the sector offset? We can map that to a
> btrfs block and figure out what it was.
>
> Btrfs never complains about the IO error? We really should explode.

Right now, i don't have a system available to continue with this. When
i have a change to run it again, i can try the latest rc kernel to see
if there is any difference.

2010-06-09 15:57:38

by Vladislav Bolkhovitin

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write

Vladislav Bolkhovitin, on 06/03/2010 04:46 PM wrote:
>
> Vladislav Bolkhovitin, on 06/03/2010 04:41 PM wrote:
>> Boaz Harrosh, on 06/03/2010 04:07 PM wrote:
>>> On 06/03/2010 02:20 PM, Vladislav Bolkhovitin wrote:
>>>> There's one interesting problem here, at least theoretically, with SCSI
>>>> or similar transports which allow to have commands queue depth >1 and
>>>> allowed to internally reorder queued requests. I don't know the FS/block
>>>> layers sufficiently well to tell if sending several requests for the
>>>> same page really possible or not, but we can see a real life problem,
>>>> which can be well explained if it's possible.
>>>>
>>>> The problem could be if the second (rewrite) request (SCSI command) for
>>>> the same page queued to the corresponding device before the original
>>>> request finished. Since the device allowed to freely reorder requests,
>>>> there's a probability that the original write request would hit the
>>>> permanent storage *AFTER* the retry request, hence the data changes it's
>>>> carrying would be lost, hence welcome data corruption.
>>>>
>>> I might be totally wrong here but I think NCQ can reorder sectors but
>>> not writes. That is if the sector is cached in device memory and a later
>>> write comes to modify the same sector then the original should be
>>> replaced not two values of the same sector be kept in device cache at the
>>> same time.
>>>
>>> Failing to do so is a scsi device problem.
>> SCSI devices supporting Full task management model (almost all) and
>> having QUEUE ALGORITHM MODIFIER bits in Control mode page set to 1
>> allowed to freely reorder any commands with SIMPLE task attribute. If an
>> application wants to maintain order of some commands for such devices,
>> it must issue them with ORDERED task attribute and over a _single_ MPIO
>> path to the device.
>>
>> Linux neither uses ORDERED attribute, nor honors or enforces anyhow
>> QUEUE ALGORITHM MODIFIER bits, nor takes care to send commands with
>> order dependencies (overlapping writes in our case) over a single MPIO path.
>>
>>> Please note that page-to-sector is not necessary constant. And the same page
>>> might get written at a different sector, next time. But FSs will have to
>>> barrier in this case.
>>>
>>>> For single parallel SCSI or SAS devices such race may look practically
>>>> impossible, but for sophisticated clusters when many nodes pretending to
>>>> be a single SCSI device in a load balancing configuration, it becomes
>>>> very real.
>>>>
>>>> The real life problem we can see in an active-active DRBD-setup. In this
>>>> configuration 2 nodes act as a single SCST-powered SCSI device and they
>>>> both run DRBD to keep their backstorage in-sync. The initiator uses them
>>>> as a single multipath device in an active-active round-robin
>>>> load-balancing configuration, i.e. sends requests to both nodes in
>>>> parallel, then DRBD takes care to replicate the requests to the other node.
>>>>
>>>> The problem is that sometimes DRBD complies about concurrent local
>>>> writes, like:
>>>>
>>>> kernel: drbd0: scsi_tgt0[12503] Concurrent local write detected!
>>>> [DISCARD L] new: 144072784s +8192; pending: 144072784s +8192
>>>>
>>>> This message means that DRBD detected that both nodes received
>>>> overlapping writes on the same block(s) and DRBD can't figure out which
>>>> one to store. This is possible only if the initiator sent the second
>>>> write request before the first one completed.
>>> It is totally possible in today's code.
>>>
>>> DRBD should store the original command_sn of the write and discard
>>> the sector with the lower SN. It should appear as a single device
>>> to the initiator.
>> How can it find the SN? The commands were sent over _different_ MPIO
>> paths to the device, so at the moment of the sending all the order
>> information was lost.
>>
>> Until SCSI generally allowed to preserve ordering information between
>> MPIO paths in such configurations the only way to maintain commands
>> order would be queue draining. Hence, for safety all initiators working
>> with such devices must do it.
>>
>> But looks like Linux doesn't do it, so unsafe with MPIO clusters?
>
> I meant load balancing MPIO clusters.

Actually, if consider processing of exception conditions like Task Set
Full status or Unit Attentions, queuing of several write commands for
the same page(s) is not safe also for all other MPIO clusters as well as
for single path SCSI-transport devices, like regular HDDs or other
SAS/FC/iSCSI/... storage.

This is because in case of exception conditions the first write command
could be preliminary finished to deliver the exception condition status
to the initiator, but all the queued after it commands would be neither
aborted, nor suspended. So, after retrying the command can be queued
_after_ the second write command, hence they would be executed in the
reverse order with related data corruption.

To prevent such things, SCSI standard provides ACA and UA interlock
facilities, but Linux doesn't use them.

Thus, to be safe Linux should:

1. Either don't write on pages under IO, hence don't queue retries,

2. Or queue retries only after the original write finished.

Vlad

2010-07-23 17:59:50

by Gennadiy Nerubayev

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write

On Thu, Jun 3, 2010 at 7:20 AM, Vladislav Bolkhovitin <[email protected]> wrote:
>
> James Bottomley, on 06/01/2010 05:27 PM wrote:
>>
>> On Tue, 2010-06-01 at 12:30 +0200, Christof Schmitt wrote:
>>>
>>> What is the best strategy to continue with the invalid guard tags on
>>> write requests? Should this be fixed in the filesystems?
>>
>> For write requests, as long as the page dirty bit is still set, it's
>> safe to drop the request, since it's already going to be repeated. ?What
>> we probably want is an error code we can return that the layer that sees
>> both the request and the page flags can make the call.
>>
>>> Another idea would be to pass invalid guard tags on write requests
>>> down to the hardware, expect an "invalid guard tag" error and report
>>> it to the block layer where a new checksum is generated and the
>>> request is issued again. Basically implement a retry through the whole
>>> I/O stack. But this also sounds complicated.
>>
>> No, no ... as long as the guard tag is wrong because the fs changed the
>> page, the write request for the updated page will already be queued or
>> in-flight, so there's no need to retry.
>
> There's one interesting problem here, at least theoretically, with SCSI or similar transports which allow to have commands queue depth >1 and allowed to internally reorder queued requests. I don't know the FS/block layers sufficiently well to tell if sending several requests for the same page really possible or not, but we can see a real life problem, which can be well explained if it's possible.
>
> The problem could be if the second (rewrite) request (SCSI command) for the same page queued to the corresponding device before the original request finished. Since the device allowed to freely reorder requests, there's a probability that the original write request would hit the permanent storage *AFTER* the retry request, hence the data changes it's carrying would be lost, hence welcome data corruption.
>
> For single parallel SCSI or SAS devices such race may look practically impossible, but for sophisticated clusters when many nodes pretending to be a single SCSI device in a load balancing configuration, it becomes very real.
>
> The real life problem we can see in an active-active DRBD-setup. In this configuration 2 nodes act as a single SCST-powered SCSI device and they both run DRBD to keep their backstorage in-sync. The initiator uses them as a single multipath device in an active-active round-robin load-balancing configuration, i.e. sends requests to both nodes in parallel, then DRBD takes care to replicate the requests to the other node.
>
> The problem is that sometimes DRBD complies about concurrent local writes, like:
>
> kernel: drbd0: scsi_tgt0[12503] Concurrent local write detected! [DISCARD L] new: 144072784s +8192; pending: 144072784s +8192
>
> This message means that DRBD detected that both nodes received overlapping writes on the same block(s) and DRBD can't figure out which one to store. This is possible only if the initiator sent the second write request before the first one completed.
>
> The topic of the discussion could well explain the cause of that. But, unfortunately, people who reported it forgot to note which OS they run on the initiator, i.e. I can't say for sure it's Linux.

Sorry for the late chime in, but here's some more information of
potential interest as I've previously inquired about this to the drbd
mailing list:

1. It only happens when using blockio mode in IET or SCST. Fileio,
nv_cache, and write_through do not generate the warnings.
2. It happens on active/passive drbd clusters (on the active node
obviously), NOT active/active. In fact, I've found that doing round
robin on active/active is a Bad Idea (tm) even with a clustered
filesystem, until at least the target software is able to synchronize
the command state of either node.
3. Linux and ESX initiators can generate the warning, but I've so far
only been able to reliably reproduce it using a Windows initiator and
sqlio or iometer benchmarks. I'll be trying again using iometer when I
have the time.
4. It only happens using a random write io workload (any block size),
with initiator threads >1, OR initiator queue depth >1. The higher
either of those is, the more spammy the warnings become.
5. The transport does not matter (reproduced with iSCSI and SRP)
6. If DRBD is disconnected (primary/unknown), the warnings are not
generated. As soon as it's reconnected (primary/secondary), the
warnings will reappear.

(sorry for the duplicate, forgot to plaintext)

-Gennadiy

2010-07-23 19:16:52

by Vladislav Bolkhovitin

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write

Gennadiy Nerubayev, on 07/23/2010 09:59 PM wrote:
> On Thu, Jun 3, 2010 at 7:20 AM, Vladislav Bolkhovitin<[email protected]> wrote:
>>
>> James Bottomley, on 06/01/2010 05:27 PM wrote:
>>>
>>> On Tue, 2010-06-01 at 12:30 +0200, Christof Schmitt wrote:
>>>>
>>>> What is the best strategy to continue with the invalid guard tags on
>>>> write requests? Should this be fixed in the filesystems?
>>>
>>> For write requests, as long as the page dirty bit is still set, it's
>>> safe to drop the request, since it's already going to be repeated. What
>>> we probably want is an error code we can return that the layer that sees
>>> both the request and the page flags can make the call.
>>>
>>>> Another idea would be to pass invalid guard tags on write requests
>>>> down to the hardware, expect an "invalid guard tag" error and report
>>>> it to the block layer where a new checksum is generated and the
>>>> request is issued again. Basically implement a retry through the whole
>>>> I/O stack. But this also sounds complicated.
>>>
>>> No, no ... as long as the guard tag is wrong because the fs changed the
>>> page, the write request for the updated page will already be queued or
>>> in-flight, so there's no need to retry.
>>
>> There's one interesting problem here, at least theoretically, with SCSI or similar transports which allow to have commands queue depth>1 and allowed to internally reorder queued requests. I don't know the FS/block layers sufficiently well to tell if sending several requests for the same page really possible or not, but we can see a real life problem, which can be well explained if it's possible.
>>
>> The problem could be if the second (rewrite) request (SCSI command) for the same page queued to the corresponding device before the original request finished. Since the device allowed to freely reorder requests, there's a probability that the original write request would hit the permanent storage *AFTER* the retry request, hence the data changes it's carrying would be lost, hence welcome data corruption.
>>
>> For single parallel SCSI or SAS devices such race may look practically impossible, but for sophisticated clusters when many nodes pretending to be a single SCSI device in a load balancing configuration, it becomes very real.
>>
>> The real life problem we can see in an active-active DRBD-setup. In this configuration 2 nodes act as a single SCST-powered SCSI device and they both run DRBD to keep their backstorage in-sync. The initiator uses them as a single multipath device in an active-active round-robin load-balancing configuration, i.e. sends requests to both nodes in parallel, then DRBD takes care to replicate the requests to the other node.
>>
>> The problem is that sometimes DRBD complies about concurrent local writes, like:
>>
>> kernel: drbd0: scsi_tgt0[12503] Concurrent local write detected! [DISCARD L] new: 144072784s +8192; pending: 144072784s +8192
>>
>> This message means that DRBD detected that both nodes received overlapping writes on the same block(s) and DRBD can't figure out which one to store. This is possible only if the initiator sent the second write request before the first one completed.
>>
>> The topic of the discussion could well explain the cause of that. But, unfortunately, people who reported it forgot to note which OS they run on the initiator, i.e. I can't say for sure it's Linux.
>
> Sorry for the late chime in, but here's some more information of
> potential interest as I've previously inquired about this to the drbd
> mailing list:
>
> 1. It only happens when using blockio mode in IET or SCST. Fileio,
> nv_cache, and write_through do not generate the warnings.

Some explanations for those who not familiar with the terminology:

- "Fileio" means Linux IO stack on the target receives IO via
vfs_readv()/vfs_writev()

- "NV_CACHE" means all the cache synchronization requests
(SYNCHRONIZE_CACHE, FUA) from the initiator are ignored

- "WRITE_THROUGH" means write through, i.e. the corresponding backend
file for the device open with O_SYNC flag.

> 2. It happens on active/passive drbd clusters (on the active node
> obviously), NOT active/active. In fact, I've found that doing round
> robin on active/active is a Bad Idea (tm) even with a clustered
> filesystem, until at least the target software is able to synchronize
> the command state of either node.
> 3. Linux and ESX initiators can generate the warning, but I've so far
> only been able to reliably reproduce it using a Windows initiator and
> sqlio or iometer benchmarks. I'll be trying again using iometer when I
> have the time.
> 4. It only happens using a random write io workload (any block size),
> with initiator threads>1, OR initiator queue depth>1. The higher
> either of those is, the more spammy the warnings become.
> 5. The transport does not matter (reproduced with iSCSI and SRP)
> 6. If DRBD is disconnected (primary/unknown), the warnings are not
> generated. As soon as it's reconnected (primary/secondary), the
> warnings will reappear.

It would be great if you prove or disprove our suspicions that Linux can
produce several write requests for the same blocks simultaneously. To be
sure we need:

1. The initiator is Linux. Windows and ESX are not needed for this
particular case.

2. If you are able to reproduce it, we will need full description of
which application used on the initiator to generate the load and in
which mode.

Target and DRBD configuration doesn't matter, you can use any.

Thanks,
Vlad

2010-07-23 20:52:04

by Gennadiy Nerubayev

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write

On Fri, Jul 23, 2010 at 3:16 PM, Vladislav Bolkhovitin <[email protected]> wrote:
> Gennadiy Nerubayev, on 07/23/2010 09:59 PM wrote:
>>
>> On Thu, Jun 3, 2010 at 7:20 AM, Vladislav Bolkhovitin<[email protected]>
>> ?wrote:
>>>
>>> James Bottomley, on 06/01/2010 05:27 PM wrote:
>>>>
>>>> On Tue, 2010-06-01 at 12:30 +0200, Christof Schmitt wrote:
>>>>>
>>>>> What is the best strategy to continue with the invalid guard tags on
>>>>> write requests? Should this be fixed in the filesystems?
>>>>
>>>> For write requests, as long as the page dirty bit is still set, it's
>>>> safe to drop the request, since it's already going to be repeated. ?What
>>>> we probably want is an error code we can return that the layer that sees
>>>> both the request and the page flags can make the call.
>>>>
>>>>> Another idea would be to pass invalid guard tags on write requests
>>>>> down to the hardware, expect an "invalid guard tag" error and report
>>>>> it to the block layer where a new checksum is generated and the
>>>>> request is issued again. Basically implement a retry through the whole
>>>>> I/O stack. But this also sounds complicated.
>>>>
>>>> No, no ... as long as the guard tag is wrong because the fs changed the
>>>> page, the write request for the updated page will already be queued or
>>>> in-flight, so there's no need to retry.
>>>
>>> There's one interesting problem here, at least theoretically, with SCSI
>>> or similar transports which allow to have commands queue depth>1 and allowed
>>> to internally reorder queued requests. I don't know the FS/block layers
>>> sufficiently well to tell if sending several requests for the same page
>>> really possible or not, but we can see a real life problem, which can be
>>> well explained if it's possible.
>>>
>>> The problem could be if the second (rewrite) request (SCSI command) for
>>> the same page queued to the corresponding device before the original request
>>> finished. Since the device allowed to freely reorder requests, there's a
>>> probability that the original write request would hit the permanent storage
>>> *AFTER* the retry request, hence the data changes it's carrying would be
>>> lost, hence welcome data corruption.
>>>
>>> For single parallel SCSI or SAS devices such race may look practically
>>> impossible, but for sophisticated clusters when many nodes pretending to be
>>> a single SCSI device in a load balancing configuration, it becomes very
>>> real.
>>>
>>> The real life problem we can see in an active-active DRBD-setup. In this
>>> configuration 2 nodes act as a single SCST-powered SCSI device and they both
>>> run DRBD to keep their backstorage in-sync. The initiator uses them as a
>>> single multipath device in an active-active round-robin load-balancing
>>> configuration, i.e. sends requests to both nodes in parallel, then DRBD
>>> takes care to replicate the requests to the other node.
>>>
>>> The problem is that sometimes DRBD complies about concurrent local
>>> writes, like:
>>>
>>> kernel: drbd0: scsi_tgt0[12503] Concurrent local write detected! [DISCARD
>>> L] new: 144072784s +8192; pending: 144072784s +8192
>>>
>>> This message means that DRBD detected that both nodes received
>>> overlapping writes on the same block(s) and DRBD can't figure out which one
>>> to store. This is possible only if the initiator sent the second write
>>> request before the first one completed.
>>>
>>> The topic of the discussion could well explain the cause of that. But,
>>> unfortunately, people who reported it forgot to note which OS they run on
>>> the initiator, i.e. I can't say for sure it's Linux.
>>
>> Sorry for the late chime in, but here's some more information of
>> potential interest as I've previously inquired about this to the drbd
>> mailing list:
>>
>> 1. It only happens when using blockio mode in IET or SCST. Fileio,
>> nv_cache, and write_through do not generate the warnings.
>
> Some explanations for those who not familiar with the terminology:
>
> ?- "Fileio" means Linux IO stack on the target receives IO via
> vfs_readv()/vfs_writev()
>
> ?- "NV_CACHE" means all the cache synchronization requests
> (SYNCHRONIZE_CACHE, FUA) from the initiator are ignored
>
> ?- "WRITE_THROUGH" means write through, i.e. the corresponding backend file
> for the device open with O_SYNC flag.
>
>> 2. It happens on active/passive drbd clusters (on the active node
>> obviously), NOT active/active. In fact, I've found that doing round
>> robin on active/active is a Bad Idea (tm) even with a clustered
>> filesystem, until at least the target software is able to synchronize
>> the command state of either node.
>> 3. Linux and ESX initiators can generate the warning, but I've so far
>> only been able to reliably reproduce it using a Windows initiator and
>> sqlio or iometer benchmarks. I'll be trying again using iometer when I
>> have the time.
>> 4. It only happens using a random write io workload (any block size),
>> with initiator threads>1, OR initiator queue depth>1. The higher
>> either of those is, the more spammy the warnings become.
>> 5. The transport does not matter (reproduced with iSCSI and SRP)
>> 6. If DRBD is disconnected (primary/unknown), the warnings are not
>> generated. As soon as it's reconnected (primary/secondary), the
>> warnings will reappear.
>
> It would be great if you prove or disprove our suspicions that Linux can
> produce several write requests for the same blocks simultaneously. To be
> sure we need:
>
> 1. The initiator is Linux. Windows and ESX are not needed for this
> particular case.
>
> 2. If you are able to reproduce it, we will need full description of which
> application used on the initiator to generate the load and in which mode.
>
> Target and DRBD configuration doesn't matter, you can use any.

I just tried, and this particular DRBD warning is not reproducible
with io (iometer) coming from a Linux initiator (2.6.30.10) The same
iometer parameters were used as on windows, and both the base device
as well as filesystem (ext3) were tested, both negative. I'll try a
few more tests, but it seems that this is a nonissue with a Linux
initiator.

Hope that helps,

-Gennadiy

2010-07-24 01:04:12

by Dave Chinner

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write

On Fri, Jul 23, 2010 at 11:16:33PM +0400, Vladislav Bolkhovitin wrote:
> It would be great if you prove or disprove our suspicions that Linux
> can produce several write requests for the same blocks
> simultaneously. To be sure we need:

Just use direct IO. Case in point is the concurrent sub-block
AIO-DIO data corruption we're chasing on XFS and ext4 at the moment
where we have two concurrent unaligned write IOs to the same
filesystem block:

http://oss.sgi.com/archives/xfs/2010-07/msg00278.html

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-07-26 12:22:42

by Vladislav Bolkhovitin

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write

Gennadiy Nerubayev, on 07/24/2010 12:51 AM wrote:
>>>> The real life problem we can see in an active-active DRBD-setup. In this
>>>> configuration 2 nodes act as a single SCST-powered SCSI device and they both
>>>> run DRBD to keep their backstorage in-sync. The initiator uses them as a
>>>> single multipath device in an active-active round-robin load-balancing
>>>> configuration, i.e. sends requests to both nodes in parallel, then DRBD
>>>> takes care to replicate the requests to the other node.
>>>>
>>>> The problem is that sometimes DRBD complies about concurrent local
>>>> writes, like:
>>>>
>>>> kernel: drbd0: scsi_tgt0[12503] Concurrent local write detected! [DISCARD
>>>> L] new: 144072784s +8192; pending: 144072784s +8192
>>>>
>>>> This message means that DRBD detected that both nodes received
>>>> overlapping writes on the same block(s) and DRBD can't figure out which one
>>>> to store. This is possible only if the initiator sent the second write
>>>> request before the first one completed.
>>>>
>>>> The topic of the discussion could well explain the cause of that. But,
>>>> unfortunately, people who reported it forgot to note which OS they run on
>>>> the initiator, i.e. I can't say for sure it's Linux.
>>>
>>> Sorry for the late chime in, but here's some more information of
>>> potential interest as I've previously inquired about this to the drbd
>>> mailing list:
>>>
>>> 1. It only happens when using blockio mode in IET or SCST. Fileio,
>>> nv_cache, and write_through do not generate the warnings.
>>
>> Some explanations for those who not familiar with the terminology:
>>
>> - "Fileio" means Linux IO stack on the target receives IO via
>> vfs_readv()/vfs_writev()
>>
>> - "NV_CACHE" means all the cache synchronization requests
>> (SYNCHRONIZE_CACHE, FUA) from the initiator are ignored
>>
>> - "WRITE_THROUGH" means write through, i.e. the corresponding backend file
>> for the device open with O_SYNC flag.
>>
>>> 2. It happens on active/passive drbd clusters (on the active node
>>> obviously), NOT active/active. In fact, I've found that doing round
>>> robin on active/active is a Bad Idea (tm) even with a clustered
>>> filesystem, until at least the target software is able to synchronize
>>> the command state of either node.
>>> 3. Linux and ESX initiators can generate the warning, but I've so far
>>> only been able to reliably reproduce it using a Windows initiator and
>>> sqlio or iometer benchmarks. I'll be trying again using iometer when I
>>> have the time.
>>> 4. It only happens using a random write io workload (any block size),
>>> with initiator threads>1, OR initiator queue depth>1. The higher
>>> either of those is, the more spammy the warnings become.
>>> 5. The transport does not matter (reproduced with iSCSI and SRP)
>>> 6. If DRBD is disconnected (primary/unknown), the warnings are not
>>> generated. As soon as it's reconnected (primary/secondary), the
>>> warnings will reappear.
>>
>> It would be great if you prove or disprove our suspicions that Linux can
>> produce several write requests for the same blocks simultaneously. To be
>> sure we need:
>>
>> 1. The initiator is Linux. Windows and ESX are not needed for this
>> particular case.
>>
>> 2. If you are able to reproduce it, we will need full description of which
>> application used on the initiator to generate the load and in which mode.
>>
>> Target and DRBD configuration doesn't matter, you can use any.
>
> I just tried, and this particular DRBD warning is not reproducible
> with io (iometer) coming from a Linux initiator (2.6.30.10) The same
> iometer parameters were used as on windows, and both the base device
> as well as filesystem (ext3) were tested, both negative. I'll try a
> few more tests, but it seems that this is a nonissue with a Linux
> initiator.

OK, but to be completely sure, can you check also with other load
generators, than IOmeter, please? IOmeter on Linux is a lot less
effective than on Windows, because it uses sync IO, while we need big
multi-IO load to trigger the problem we are discussing, if it exists.
Plus, to catch it we need an FS on the initiator side, not using raw
devices. So, something like fio over files on FS or diskbench should be
more appropriate. Please don't use direct IO to avoid the bug Dave
Chinner pointed us out.

Also, you mentioned above about that Linux can generate the warning. Can
you recall on which configuration, including the kernel version, the
load application and its configuration, you have seen it?

Thanks,
Vlad

2010-07-26 17:01:05

by Gennadiy Nerubayev

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write

On Mon, Jul 26, 2010 at 8:22 AM, Vladislav Bolkhovitin <[email protected]> wrote:
> Gennadiy Nerubayev, on 07/24/2010 12:51 AM wrote:
>>>>>
>>>>> The real life problem we can see in an active-active DRBD-setup. In
>>>>> this
>>>>> configuration 2 nodes act as a single SCST-powered SCSI device and they
>>>>> both
>>>>> run DRBD to keep their backstorage in-sync. The initiator uses them as
>>>>> a
>>>>> single multipath device in an active-active round-robin load-balancing
>>>>> configuration, i.e. sends requests to both nodes in parallel, then DRBD
>>>>> takes care to replicate the requests to the other node.
>>>>>
>>>>> The problem is that sometimes DRBD complies about concurrent local
>>>>> writes, like:
>>>>>
>>>>> kernel: drbd0: scsi_tgt0[12503] Concurrent local write detected!
>>>>> [DISCARD
>>>>> L] new: 144072784s +8192; pending: 144072784s +8192
>>>>>
>>>>> This message means that DRBD detected that both nodes received
>>>>> overlapping writes on the same block(s) and DRBD can't figure out which
>>>>> one
>>>>> to store. This is possible only if the initiator sent the second write
>>>>> request before the first one completed.
>>>>>
>>>>> The topic of the discussion could well explain the cause of that. But,
>>>>> unfortunately, people who reported it forgot to note which OS they run
>>>>> on
>>>>> the initiator, i.e. I can't say for sure it's Linux.
>>>>
>>>> Sorry for the late chime in, but here's some more information of
>>>> potential interest as I've previously inquired about this to the drbd
>>>> mailing list:
>>>>
>>>> 1. It only happens when using blockio mode in IET or SCST. Fileio,
>>>> nv_cache, and write_through do not generate the warnings.
>>>
>>> Some explanations for those who not familiar with the terminology:
>>>
>>> ?- "Fileio" means Linux IO stack on the target receives IO via
>>> vfs_readv()/vfs_writev()
>>>
>>> ?- "NV_CACHE" means all the cache synchronization requests
>>> (SYNCHRONIZE_CACHE, FUA) from the initiator are ignored
>>>
>>> ?- "WRITE_THROUGH" means write through, i.e. the corresponding backend
>>> file
>>> for the device open with O_SYNC flag.
>>>
>>>> 2. It happens on active/passive drbd clusters (on the active node
>>>> obviously), NOT active/active. In fact, I've found that doing round
>>>> robin on active/active is a Bad Idea (tm) even with a clustered
>>>> filesystem, until at least the target software is able to synchronize
>>>> the command state of either node.
>>>> 3. Linux and ESX initiators can generate the warning, but I've so far
>>>> only been able to reliably reproduce it using a Windows initiator and
>>>> sqlio or iometer benchmarks. I'll be trying again using iometer when I
>>>> have the time.
>>>> 4. It only happens using a random write io workload (any block size),
>>>> with initiator threads>1, OR initiator queue depth>1. The higher
>>>> either of those is, the more spammy the warnings become.
>>>> 5. The transport does not matter (reproduced with iSCSI and SRP)
>>>> 6. If DRBD is disconnected (primary/unknown), the warnings are not
>>>> generated. As soon as it's reconnected (primary/secondary), the
>>>> warnings will reappear.
>>>
>>> It would be great if you prove or disprove our suspicions that Linux can
>>> produce several write requests for the same blocks simultaneously. To be
>>> sure we need:
>>>
>>> 1. The initiator is Linux. Windows and ESX are not needed for this
>>> particular case.
>>>
>>> 2. If you are able to reproduce it, we will need full description of
>>> which
>>> application used on the initiator to generate the load and in which mode.
>>>
>>> Target and DRBD configuration doesn't matter, you can use any.
>>
>> I just tried, and this particular DRBD warning is not reproducible
>> with io (iometer) coming from a Linux initiator (2.6.30.10) The same
>> iometer parameters were used as on windows, and both the base device
>> as well as filesystem (ext3) were tested, both negative. I'll try a
>> few more tests, but it seems that this is a nonissue with a Linux
>> initiator.
>
> OK, but to be completely sure, can you check also with other load
> generators, than IOmeter, please? IOmeter on Linux is a lot less effective
> than on Windows, because it uses sync IO, while we need big multi-IO load to
> trigger the problem we are discussing, if it exists. Plus, to catch it we
> need an FS on the initiator side, not using raw devices. So, something like
> fio over files on FS or diskbench should be more appropriate. Please don't
> use direct IO to avoid the bug Dave Chinner pointed us out.

I tried both fio and dbench, with the same results. With fio in
particular, I think I used pretty much every possible combination of
engines, directio, and sync settings with 8 threads, 32 queue depth
and random write workload.

> Also, you mentioned above about that Linux can generate the warning. Can you
> recall on which configuration, including the kernel version, the load
> application and its configuration, you have seen it?

Sorry, after double checking, it's only ESX and Windows that generate
them. The majority of the ESX virtuals in question are Windows, though
I can see some indications of ESX servers that have Linux-only
virtuals generating one here and there. It's somewhat difficult to
tell historically, and I probably would not be able to determine what
those virtuals were running at the time.

-Gennadiy

2010-07-26 19:26:14

by Vladislav Bolkhovitin

[permalink] [raw]
Subject: Re: Wrong DIF guard tag on ext2 write

Gennadiy Nerubayev, on 07/26/2010 09:00 PM wrote:
> On Mon, Jul 26, 2010 at 8:22 AM, Vladislav Bolkhovitin<[email protected]> wrote:
>> Gennadiy Nerubayev, on 07/24/2010 12:51 AM wrote:
>>>>>>
>>>>>> The real life problem we can see in an active-active DRBD-setup. In
>>>>>> this
>>>>>> configuration 2 nodes act as a single SCST-powered SCSI device and they
>>>>>> both
>>>>>> run DRBD to keep their backstorage in-sync. The initiator uses them as
>>>>>> a
>>>>>> single multipath device in an active-active round-robin load-balancing
>>>>>> configuration, i.e. sends requests to both nodes in parallel, then DRBD
>>>>>> takes care to replicate the requests to the other node.
>>>>>>
>>>>>> The problem is that sometimes DRBD complies about concurrent local
>>>>>> writes, like:
>>>>>>
>>>>>> kernel: drbd0: scsi_tgt0[12503] Concurrent local write detected!
>>>>>> [DISCARD
>>>>>> L] new: 144072784s +8192; pending: 144072784s +8192
>>>>>>
>>>>>> This message means that DRBD detected that both nodes received
>>>>>> overlapping writes on the same block(s) and DRBD can't figure out which
>>>>>> one
>>>>>> to store. This is possible only if the initiator sent the second write
>>>>>> request before the first one completed.
>>>>>>
>>>>>> The topic of the discussion could well explain the cause of that. But,
>>>>>> unfortunately, people who reported it forgot to note which OS they run
>>>>>> on
>>>>>> the initiator, i.e. I can't say for sure it's Linux.
>>>>>
>>>>> Sorry for the late chime in, but here's some more information of
>>>>> potential interest as I've previously inquired about this to the drbd
>>>>> mailing list:
>>>>>
>>>>> 1. It only happens when using blockio mode in IET or SCST. Fileio,
>>>>> nv_cache, and write_through do not generate the warnings.
>>>>
>>>> Some explanations for those who not familiar with the terminology:
>>>>
>>>> - "Fileio" means Linux IO stack on the target receives IO via
>>>> vfs_readv()/vfs_writev()
>>>>
>>>> - "NV_CACHE" means all the cache synchronization requests
>>>> (SYNCHRONIZE_CACHE, FUA) from the initiator are ignored
>>>>
>>>> - "WRITE_THROUGH" means write through, i.e. the corresponding backend
>>>> file
>>>> for the device open with O_SYNC flag.
>>>>
>>>>> 2. It happens on active/passive drbd clusters (on the active node
>>>>> obviously), NOT active/active. In fact, I've found that doing round
>>>>> robin on active/active is a Bad Idea (tm) even with a clustered
>>>>> filesystem, until at least the target software is able to synchronize
>>>>> the command state of either node.
>>>>> 3. Linux and ESX initiators can generate the warning, but I've so far
>>>>> only been able to reliably reproduce it using a Windows initiator and
>>>>> sqlio or iometer benchmarks. I'll be trying again using iometer when I
>>>>> have the time.
>>>>> 4. It only happens using a random write io workload (any block size),
>>>>> with initiator threads>1, OR initiator queue depth>1. The higher
>>>>> either of those is, the more spammy the warnings become.
>>>>> 5. The transport does not matter (reproduced with iSCSI and SRP)
>>>>> 6. If DRBD is disconnected (primary/unknown), the warnings are not
>>>>> generated. As soon as it's reconnected (primary/secondary), the
>>>>> warnings will reappear.
>>>>
>>>> It would be great if you prove or disprove our suspicions that Linux can
>>>> produce several write requests for the same blocks simultaneously. To be
>>>> sure we need:
>>>>
>>>> 1. The initiator is Linux. Windows and ESX are not needed for this
>>>> particular case.
>>>>
>>>> 2. If you are able to reproduce it, we will need full description of
>>>> which
>>>> application used on the initiator to generate the load and in which mode.
>>>>
>>>> Target and DRBD configuration doesn't matter, you can use any.
>>>
>>> I just tried, and this particular DRBD warning is not reproducible
>>> with io (iometer) coming from a Linux initiator (2.6.30.10) The same
>>> iometer parameters were used as on windows, and both the base device
>>> as well as filesystem (ext3) were tested, both negative. I'll try a
>>> few more tests, but it seems that this is a nonissue with a Linux
>>> initiator.
>>
>> OK, but to be completely sure, can you check also with other load
>> generators, than IOmeter, please? IOmeter on Linux is a lot less effective
>> than on Windows, because it uses sync IO, while we need big multi-IO load to
>> trigger the problem we are discussing, if it exists. Plus, to catch it we
>> need an FS on the initiator side, not using raw devices. So, something like
>> fio over files on FS or diskbench should be more appropriate. Please don't
>> use direct IO to avoid the bug Dave Chinner pointed us out.
>
> I tried both fio and dbench, with the same results. With fio in
> particular, I think I used pretty much every possible combination of
> engines, directio, and sync settings with 8 threads, 32 queue depth
> and random write workload.
>
>> Also, you mentioned above about that Linux can generate the warning. Can you
>> recall on which configuration, including the kernel version, the load
>> application and its configuration, you have seen it?
>
> Sorry, after double checking, it's only ESX and Windows that generate
> them. The majority of the ESX virtuals in question are Windows, though
> I can see some indications of ESX servers that have Linux-only
> virtuals generating one here and there. It's somewhat difficult to
> tell historically, and I probably would not be able to determine what
> those virtuals were running at the time.

OK, I see. A negative result is also a result. Now we know that Linux
(in contrast to VMware and Windows) works well in this area.

Thank you!
Vlad