LinuxLists.cc - [RFC][PATCH] first cut 64 bit block support

[permalink] [raw]

Subject: Re: [RFC][PATCH] first cut 64 bit block support

On Sun, Jul 01, 2001 at 12:53:25AM -0400, Ben LaHaise wrote:
> Hey folks,
>
> Below is the first cut at making the block size limit configurable to 64
> bits on x86, as well as always 64 bits on 64 bit machines. The audit
> isn't complete yet, but a good chunk of it is done.

Great !

> Filesystem 1k-blocks Used Available Use% Mounted on
> /dev/md1 7508125768 20 7476280496 1% /mnt/3
>
> This is a 7TB ext2 filesystem on 4KB blocks. The 7TB /dev/md1 consists of
> 7x 1TB sparse files on loop devices raid0'd together. The current patch
> does not have the fixes in the SCSI layer or IDE driver yet; expect the
> SCSI fixes in the next version, although I'll need a tester. The
> following should be 64 bit clean now: nbd, loop, raid0, raid1, raid5.

What about LVM?

We'll see what we can do to test the scsi-code. Please send it to us
when you have code. I guess there are fixes for both generic-scsi code
and for each controller, right? What controllers are you planning on
fixing first?
What tests do you recommend?
mkfs on a big device, and then putting >2TB data on it?

--
Ragnar Kjorstad
Big Storage

2001-07-04 02:20:01

by Benjamin LaHaise

[permalink] [raw]

Subject: [PATCH] 64 bit scsi read/write

On Tue, 3 Jul 2001, Ragnar Kj?rstad wrote:

> What about LVM?

Errr, I'll refrain from talking about LVM.

> We'll see what we can do to test the scsi-code. Please send it to us
> when you have code. I guess there are fixes for both generic-scsi code
> and for each controller, right? What controllers are you planning on
> fixing first?
> What tests do you recommend?
> mkfs on a big device, and then putting >2TB data on it?

Here's the [completely untested] generic scsi fixup, but I'm told that
some controllers will break with it. Give it a whirl and let me know how
many pieces you're left holding. =) Please note that msdos partitions do
*not* work on devices larger than 2TB, so you'll have to use the scsi disk
directly. This patch applies on top of v2.4.6-pre8-largeblock4.diff.

Testing wise, I'm looking for tests on ext2, the block device and raw
devices that write out enough data to fill the device and then reads the
data back looking for any corruption. There are a few test programs I've
got to this end, but I need to clean them up before releasing them. If
anyone wants to help sort out issues on other filesystems, I'll certainly
track patches and feedback. Cheers,

-ben

.... ~/patches/v2.4.6-pre8-lb-scsi.diff ....
diff -ur lb-2.4.6-pre8/drivers/scsi/scsi.h lb-2.4.6-pre8.scsi/drivers/scsi/scsi.h
--- lb-2.4.6-pre8/drivers/scsi/scsi.h Tue Jul 3 01:31:47 2001
+++ lb-2.4.6-pre8.scsi/drivers/scsi/scsi.h Tue Jul 3 22:03:16 2001
@@ -351,7 +351,7 @@
#define DRIVER_MASK 0x0f
#define SUGGEST_MASK 0xf0

-#define MAX_COMMAND_SIZE 12
+#define MAX_COMMAND_SIZE 16
#define SCSI_SENSE_BUFFERSIZE 64

/*
@@ -613,6 +613,7 @@
unsigned expecting_cc_ua:1; /* Expecting a CHECK_CONDITION/UNIT_ATTN
* because we did a bus reset. */
unsigned device_blocked:1; /* Device returned QUEUE_FULL. */
+ unsigned sixteen:1; /* use 16 byte read / write */
unsigned ten:1; /* support ten byte read / write */
unsigned remap:1; /* support remapping */
unsigned starved:1; /* unable to process commands because
diff -ur lb-2.4.6-pre8/drivers/scsi/sd.c lb-2.4.6-pre8.scsi/drivers/scsi/sd.c
--- lb-2.4.6-pre8/drivers/scsi/sd.c Tue Jul 3 22:08:28 2001
+++ lb-2.4.6-pre8.scsi/drivers/scsi/sd.c Tue Jul 3 22:05:46 2001
@@ -277,11 +277,12 @@

static int sd_init_command(Scsi_Cmnd * SCpnt)
{
- int dev, devm, block, this_count;
+ int dev, devm, this_count;
Scsi_Disk *dpnt;
#if CONFIG_SCSI_LOGGING
char nbuff[6];
#endif
+ blkoff_t block;

devm = SD_PARTITION(SCpnt->request.rq_dev);
dev = DEVICE_NR(SCpnt->request.rq_dev);
@@ -289,7 +290,7 @@
block = SCpnt->request.sector;
this_count = SCpnt->request_bufflen >> 9;

- SCSI_LOG_HLQUEUE(1, printk("Doing sd request, dev = %d, block = %d\n", devm, block));
+ SCSI_LOG_HLQUEUE(1, printk("Doing sd request, dev = %d, block = %"BLKOFF_FMT"\n", devm, block));

dpnt = &rscsi_disks[dev];
if (devm >= (sd_template.dev_max << 4) ||
@@ -374,7 +375,21 @@

SCpnt->cmnd[1] = (SCpnt->lun << 5) & 0xe0;

- if (((this_count > 0xff) || (block > 0x1fffff)) || SCpnt->device->ten) {
+ if (SCpnt->device->sixteen) {
+ SCpnt->cmnd[0] += READ_16 - READ_6;
+ SCpnt->cmnd[2] = (unsigned char) (block >> 56) & 0xff;
+ SCpnt->cmnd[3] = (unsigned char) (block >> 48) & 0xff;
+ SCpnt->cmnd[4] = (unsigned char) (block >> 40) & 0xff;
+ SCpnt->cmnd[5] = (unsigned char) (block >> 32) & 0xff;
+ SCpnt->cmnd[6] = (unsigned char) (block >> 24) & 0xff;
+ SCpnt->cmnd[7] = (unsigned char) (block >> 16) & 0xff;
+ SCpnt->cmnd[8] = (unsigned char) (block >> 8) & 0xff;
+ SCpnt->cmnd[9] = (unsigned char) block & 0xff;
+ SCpnt->cmnd[10] = (unsigned char) (this_count >> 24) & 0xff;
+ SCpnt->cmnd[11] = (unsigned char) (this_count >> 16) & 0xff;
+ SCpnt->cmnd[12] = (unsigned char) (this_count >> 8) & 0xff;
+ SCpnt->cmnd[13] = (unsigned char) this_count & 0xff;
+ } else if (SCpnt->device->ten || (this_count > 0xff) || (block > 0x1fffff)) {
if (this_count > 0xffff)
this_count = 0xffff;

@@ -882,14 +897,61 @@
*/
rscsi_disks[i].ready = 1;

- rscsi_disks[i].capacity = 1 + ((buffer[0] << 24) |
- (buffer[1] << 16) |
- (buffer[2] << 8) |
- buffer[3]);
+ rscsi_disks[i].capacity = buffer[0];
+ rscsi_disks[i].capacity <<= 8;
+ rscsi_disks[i].capacity |= buffer[1];
+ rscsi_disks[i].capacity <<= 8;
+ rscsi_disks[i].capacity |= buffer[2];
+ rscsi_disks[i].capacity <<= 8;
+ rscsi_disks[i].capacity |= buffer[3];
+ rscsi_disks[i].capacity += 1;

sector_size = (buffer[4] << 24) |
(buffer[5] << 16) | (buffer[6] << 8) | buffer[7];

+
+ /* Is this disk larger than 32 bits? */
+ if (rscsi_disks[i].capacity == 0x100000000) {
+ cmd[0] = READ_CAPACITY;
+ cmd[1] = (rscsi_disks[i].device->lun << 5) & 0xe0;
+ cmd[1] |= 0x2; /* Longlba */
+ memset((void *) &cmd[2], 0, 8);
+ memset((void *) buffer, 0, 8);
+ SRpnt->sr_cmd_len = 0;
+ SRpnt->sr_sense_buffer[0] = 0;
+ SRpnt->sr_sense_buffer[2] = 0;
+
+ SRpnt->sr_data_direction = SCSI_DATA_READ;
+ scsi_wait_req(SRpnt, (void *) cmd, (void *) buffer,
+ 8, SD_TIMEOUT, MAX_RETRIES);
+
+ /* cool! 64 bit goodness... */
+ if (!SRpnt->sr_result) {
+ rscsi_disks[i].capacity = buffer[0];
+ rscsi_disks[i].capacity <<= 8;
+ rscsi_disks[i].capacity |= buffer[1];
+ rscsi_disks[i].capacity <<= 8;
+ rscsi_disks[i].capacity |= buffer[2];
+ rscsi_disks[i].capacity <<= 8;
+ rscsi_disks[i].capacity |= buffer[3];
+ rscsi_disks[i].capacity <<= 8;
+ rscsi_disks[i].capacity |= buffer[4];
+ rscsi_disks[i].capacity <<= 8;
+ rscsi_disks[i].capacity |= buffer[5];
+ rscsi_disks[i].capacity <<= 8;
+ rscsi_disks[i].capacity |= buffer[6];
+ rscsi_disks[i].capacity <<= 8;
+ rscsi_disks[i].capacity |= buffer[7];
+ rscsi_disks[i].capacity += 1;
+
+ sector_size = (buffer[8] << 24) |
+ (buffer[9] << 16) | (buffer[10] << 8) |
+ buffer[11];
+
+ SRpnt->sr_device->sixteen = 1;
+ }
+ }
+
if (sector_size == 0) {
sector_size = 512;
printk("%s : sector size 0 reported, assuming 512.\n",
@@ -930,7 +992,7 @@
*/
int m;
int hard_sector = sector_size;
- int sz = rscsi_disks[i].capacity * (hard_sector/256);
+ blkoff_t sz = rscsi_disks[i].capacity * (hard_sector/256);

/* There are 16 minors allocated for each major device */
for (m = i << 4; m < ((i + 1) << 4); m++) {
@@ -938,7 +1000,7 @@
}

printk("SCSI device %s: "
- "%d %d-byte hdwr sectors (%d MB)\n",
+ "%"BLKOFF_FMT" %d-byte hdwr sectors (%"BLKOFF_FMT" MB)\n",
nbuff, rscsi_disks[i].capacity,
hard_sector, (sz/2 - sz/1250 + 974)/1950);
}
diff -ur lb-2.4.6-pre8/drivers/scsi/sd.h lb-2.4.6-pre8.scsi/drivers/scsi/sd.h
--- lb-2.4.6-pre8/drivers/scsi/sd.h Tue Jul 3 01:31:47 2001
+++ lb-2.4.6-pre8.scsi/drivers/scsi/sd.h Tue Jul 3 22:03:16 2001
@@ -26,7 +26,7 @@
extern struct hd_struct *sd;

typedef struct scsi_disk {
- unsigned capacity; /* size in blocks */
+ u64 capacity; /* size in blocks */
Scsi_Device *device;
unsigned char ready; /* flag ready for FLOPTICAL */
unsigned char write_prot; /* flag write_protect for rmvable dev */
diff -ur lb-2.4.6-pre8/include/scsi/scsi.h lb-2.4.6-pre8.scsi/include/scsi/scsi.h
--- lb-2.4.6-pre8/include/scsi/scsi.h Thu May 3 11:22:20 2001
+++ lb-2.4.6-pre8.scsi/include/scsi/scsi.h Tue Jul 3 18:06:43 2001
@@ -78,6 +78,9 @@
#define MODE_SENSE_10 0x5a
#define PERSISTENT_RESERVE_IN 0x5e
#define PERSISTENT_RESERVE_OUT 0x5f
+#define READ_16 0x88
+#define WRITE_16 0x8a
+#define WRITE_VERIFY_16 0x8e
#define MOVE_MEDIUM 0xa5
#define READ_12 0xa8
#define WRITE_12 0xaa

2001-07-04 07:11:52

by Alan

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

> --- lb-2.4.6-pre8/drivers/scsi/scsi.h Tue Jul 3 01:31:47 2001
> +++ lb-2.4.6-pre8.scsi/drivers/scsi/scsi.h Tue Jul 3 22:03:16 2001
> @@ -351,7 +351,7 @@
> #define DRIVER_MASK 0x0f
> #define SUGGEST_MASK 0xf0
>
> -#define MAX_COMMAND_SIZE 12
> +#define MAX_COMMAND_SIZE 16

Please talk to Khalid at HP who has already submitted patches to handle
16 byte comamnd blocks on some controllers cleanly. I think you need to
combine both patches to get the right result

> + if (SCpnt->device->sixteen) {

[and controller]

Alan

2001-07-04 10:17:36

[permalink] [raw]

Subject: Re: [RFC][PATCH] first cut 64 bit block support

On Sun, Jul 01, 2001 at 12:53:25AM -0400, Ben LaHaise wrote:

> Ugly bits: I had to add libgcc.a to satisfy the need for 64 bit
> division. Yeah, it sucks, but RAID needs some more massaging before
> I can remove the 64 bit division completely. This will be fixed.

I would rather see this code removed from libgcc and put into a
function (optionally inline) such that code like:

__u64 foo(__u64 a, __u64 b)
{
__u64 t;

t = a * SOME_CONST + b;

return t / BLEM;
}

would really look like:

__64 foo(__u64 a, __u64 b)
{
__u64 t;

t = 64b_mul(a, SOME_CONST) + b;

return 64b_udiv(t, BLEM);
}

such that for peopel to use 64-bit operations on the kernel, they have
to explicity code them in, not just accidentialyl change a variable
type and have gcc/libgcc hide this fact from them.

Note, I use __u64 not "long long" as I'm not 100% "long long" will
mean 64-bits on all future architectures (it would be cool, for
example, if it was 128-bit on some!).

What do you think? Would you accept patches for either of these?

--cw

2001-07-04 17:00:27

by Benjamin LaHaise

[permalink] [raw]

Subject: Re: [RFC][PATCH] first cut 64 bit block support

On Wed, 4 Jul 2001, Chris Wedgwood wrote:

> On Sun, Jul 01, 2001 at 12:53:25AM -0400, Ben LaHaise wrote:
>
> > Ugly bits: I had to add libgcc.a to satisfy the need for 64 bit
> > division. Yeah, it sucks, but RAID needs some more massaging before
> > I can remove the 64 bit division completely. This will be fixed.
>
> I would rather see this code removed from libgcc and put into a
> function (optionally inline) such that code like:

I'm getting rid of the need for libgcc entirely. That's what "This will
be fixed" means. If you want to expedite the process, send a patch.
Until then, this is Good Enough for testing purposes.

-ben

2001-07-05 06:34:30

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

On Tue, Jul 03, 2001 at 10:19:36PM -0400, Ben LaHaise wrote:
> > > [ patch to make md and nbd work for >2TB devices ]
> > What about LVM?
>
> Errr, I'll refrain from talking about LVM.

What do you mean?
Is it not feasible to fix this in LVM as well, or do you just not know
what needs to be done to LVM?

--
Ragnar Kjorstad
Big Storage

2001-07-05 07:36:00

by Benjamin LaHaise

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

On Thu, 5 Jul 2001, Ragnar Kj?rstad wrote:

> What do you mean?
> Is it not feasible to fix this in LVM as well, or do you just not know
> what needs to be done to LVM?

Fixing LVM is not on the radar of my priorities. The code is sorely in
need of a rewrite and violates several of the basic planning tenents that
any good code in the block layer should follow. Namely, it should have 1)
planned on supporting 64 bit offsets, 2) never used multiplication,
division or modulus on block numbers, and 3) don't allocate memory
structures that are indexed by block numbers. LVM failed on all three of
these -- and this si just what I noticed in a quick 5 minute glance
through the code. Sorry, but LVM is obsolete by design. It will continue
to work on 32 bit block devices, but if you try to use it beyond that, it
will fail. That said, we'll have to make sure these failures are graceful
and occur prior to the user having a chance at loosing any data.

Now, thankfully there are alternatives like ELVM, which are working on
getting the details right from the lessons learned. Given that, I think
we'll be in good shape during the 2.5 cycle.

-ben

2001-07-13 18:21:20

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

Ben LaHaise writes:
> On Thu, 5 Jul 2001, Ragnar Kj\370rstad wrote:

>> What do you mean?
>> Is it not feasible to fix this in LVM as well, or do you just not know
>> what needs to be done to LVM?
>
> Fixing LVM is not on the radar of my priorities. The code is sorely in
> need of a rewrite and violates several of the basic planning tenents that
> any good code in the block layer should follow. Namely, it should have 1)
> planned on supporting 64 bit offsets, 2) never used multiplication,
> division or modulus on block numbers, and 3) don't allocate memory
> structures that are indexed by block numbers. LVM failed on all three of
> these -- and this si just what I noticed in a quick 5 minute glance
> through the code. Sorry, but LVM is obsolete by design. It will continue
> to work on 32 bit block devices, but if you try to use it beyond that, it
> will fail. That said, we'll have to make sure these failures are graceful
> and occur prior to the user having a chance at loosing any data.
>
> Now, thankfully there are alternatives like ELVM, which are working on
> getting the details right from the lessons learned. Given that, I think
> we'll be in good shape during the 2.5 cycle.

How does can any of this even work?

Say I have N disks, mirrored, or maybe with parity. I'm trying
to have a reliable system. I change a file. The write goes out
to my disks, and power is lost. Some number M, such that 0<M<N,
of the disks are written before the power loss. The rest of the
disks don't complete the write. Maybe worse, this is more than
one sector, and some disks have partial writes.

Doesn't RAID need a journal or the phase-tree algorithm?
How does one tell what data is old and what data is new?

2001-07-13 20:43:28

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

Albert writes:
> How does can any of this even work?
>
> Say I have N disks, mirrored, or maybe with parity. I'm trying
> to have a reliable system. I change a file. The write goes out
> to my disks, and power is lost. Some number M, such that 0<M<N,
> of the disks are written before the power loss. The rest of the
> disks don't complete the write. Maybe worse, this is more than
> one sector, and some disks have partial writes.
>
> Doesn't RAID need a journal or the phase-tree algorithm?
> How does one tell what data is old and what data is new?

Yes, RAID should have a journal or other ordering enforcement, but
it really isn't any worse in this regard than a single disk. Even
on a single disk you don't have any guarantees of data ordering, so
if you change the file and the power is lost, some of the sectors
will make it to disk and some will not => fsck, with possible data
corrpution or loss.

That's why the journaled filesystems have multi-stage commit of I/O,
first to the journal and then to the disk, so no chance of corruption
of the metadata, and if you journal data also, then the data cannot
be corrupted (but some may be lost).

RAID 5 throws a wrench into this by not guaranteeing that all of the
blocks in a stripe are consistent (you don't know which blocks and/or
parity were written and which not). Ideally, you want a multi-stage
commit for RAID as well, so that you write the data first, and the
parity afterwards (so on reboot you trust the data first, and not the
parity). You have a problem if there is a bad disk and you crash.

With a data-journaled fs you don't care what RAID does because the fs
journal knows which transactions were in progress. If an I/O was being
written into the journal and did not complete, it is discarded. If it
was written into the journal and did not finish the write into the fs,
it will re-write it on recovery. In both cases you don't care if the
RAID finished the write or not.

Note that for LVM (the original topic), it does NOT do any RAID stuff
at all, it is just a virtually contiguous disk, made up of one or more
real disks (or stacked on top of RAID).

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2001-07-13 21:07:22

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

On Fri, Jul 13, 2001 at 02:41:52PM -0600, Andreas Dilger wrote:

Yes, RAID should have a journal or other ordering enforcement, but
it really isn't any worse in this regard than a single disk. Even
on a single disk you don't have any guarantees of data ordering,
so if you change the file and the power is lost, some of the
sectors will make it to disk and some will not => fsck, with
possible data corrpution or loss.

How so? On a single disk you can either disable write-caching or for
SCSI disks you can use barriers of sorts.

At which time, you can either assume a sector is written or not.

--cw

2001-07-13 21:14:39

by Alan

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

> RAID 5 throws a wrench into this by not guaranteeing that all of the
> blocks in a stripe are consistent (you don't know which blocks and/or
> parity were written and which not). Ideally, you want a multi-stage
> commit for RAID as well, so that you write the data first, and the
> parity afterwards (so on reboot you trust the data first, and not the
> parity). You have a problem if there is a bad disk and you crash.

Well to be honest so does most disk firmware. IDE especially. For one thing
the logical sector size the drives writes need not match the illusions
provided upstream, and the write flush commands are frequently not implemented
because they damage benchmarketing numbers from folks like Zdnet..

2001-07-13 22:07:07

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

Chris writes:
> On Fri, Jul 13, 2001 at 02:41:52PM -0600, Andreas Dilger wrote:
>
> Yes, RAID should have a journal or other ordering enforcement, but
> it really isn't any worse in this regard than a single disk. Even
> on a single disk you don't have any guarantees of data ordering,
> so if you change the file and the power is lost, some of the
> sectors will make it to disk and some will not => fsck, with
> possible data corrpution or loss.
>
> How so? On a single disk you can either disable write-caching or for
> SCSI disks you can use barriers of sorts.
>
> At which time, you can either assume a sector is written or not.

Well, I _think_ your statement is only true if you are using rawio.
Otherwise, you have a minimum block size of 1kB (for filesystems at
least) so you can't write less than that, and you could potentially
write one sector and not another.

I'm not sure of the exact MD RAID implementation, but I suspect that
if you write a single sector*, it will be exactly the same situation.
However, it also has to write the parity to disk, so if you crash at
this point what you get back depends on the RAID implementation**.

As Alan said in another reply, with IDE disks, you have no guarantee
about write caching on the disk, even if you try to turn it off.

If you are doing synchronous I/O from your application, then I don't
think a RAID write will not complete until all of the data+parity I/O
is complete, so you should again be as safe as with a single disk.

If you want safety, but async I/O, use ext3 with full data journaling
and a large journal. Andrew Morton has just done some testing with
this and the performance is very good, as long as your journal is big
enough to hold your largest write bursts, and you have < 50% duty
cycle for disk I/O (i.e. you have to have enough spare I/O bandwidth
to write everything to disk twice, but it will go to the journal in a
single contiguous (synchronous) write and can go to the filesystem
asynchronously at a later time when there is no other I/O). If you
put your journal on NVRAM, you will have blazing synchronous I/O.

Cheers, Andreas

*) You _may_ be limited to a larger minimum write, depending on the stripe
size, I haven't looked closely at the code. AFAIK, MD RAID does not
let you stripe a single sector across multiple disks (nor would you
want to), so all disk I/O would still be one or more single sector I/Os
to one or more disks. This means the sector I/O to each individual
disk is still atomic, so it is not any worse than writes to a single
disk (the parity is NOT atomic, but then you don't have parity at
all on a single disk...).

**) As I said in my previous posting, it depends on if/how MD RAID does
write ordering of I/O to the data sector and the parity sector. If
it holds back the parity write until the data I/O(s) are complete, and
trusts the data over parity on recovery, you should be OK unless you
have multiple failures (i.e. bad disk + crash). If it doesn't do this
ordering, or trusts parity over data, then you are F***ed (I doubt it
would have this problem).
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2001-07-14 00:51:48

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

At 4:04 PM -0600 2001-07-13, Andreas Dilger wrote:
>**) As I said in my previous posting, it depends on if/how MD RAID does
> write ordering of I/O to the data sector and the parity sector. If
> it holds back the parity write until the data I/O(s) are complete, and
> trusts the data over parity on recovery, you should be OK unless you
> have multiple failures (i.e. bad disk + crash). If it doesn't do this
> ordering, or trusts parity over data, then you are F***ed (I doubt it
> would have this problem).

That wouldn't help, would it, if >1 data sectors were being written.

The fault mode of a sector simply not being written seems like a real
weak point of both RAID-1 and RAID-5. Not that RAID-5 parity ever
gets checked, I think, under normal circumstances, nor RAID-1 mirrors
get compared, but if they were check and there was an parity or
mirror-compare error and no other indication of a fault (eg CRC),
there's no way to recover correct data.
--
/Jonathan Lundell.

2001-07-14 03:23:18

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

Alan Cox wrote:
>
> > RAID 5 throws a wrench into this by not guaranteeing that all of the
> > blocks in a stripe are consistent (you don't know which blocks and/or
> > parity were written and which not). Ideally, you want a multi-stage
> > commit for RAID as well, so that you write the data first, and the
> > parity afterwards (so on reboot you trust the data first, and not the
> > parity). You have a problem if there is a bad disk and you crash.
>
> Well to be honest so does most disk firmware. IDE especially. For one thing
> the logical sector size the drives writes need not match the illusions
> provided upstream, and the write flush commands are frequently not implemented
> because they damage benchmarketing numbers from folks like Zdnet..

If, after a power outage, the IDE disk can keep going for long enough
to write its write cache out to the reserved vendor area (which will
only take 20-30 milliseconds) then the data may be considered *safe*
as soon as it hits writecache.

In which case it is perfectly legitimate and sensible for the drive
to ignore flush commands, and to ack data as soon as it hits cache.

Yes?

If I'm right then the only open question is: which disks do and
do not do the right thing when the lights go out.

-

2001-07-14 08:46:01

by Alan

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

> If, after a power outage, the IDE disk can keep going for long enough
> to write its write cache out to the reserved vendor area (which will
> only take 20-30 milliseconds) then the data may be considered *safe*
> as soon as it hits writecache.

Hohohoho.

> In which case it is perfectly legitimate and sensible for the drive
> to ignore flush commands, and to ack data as soon as it hits cache.

Since the flushing commands are 'optional' it can legitimately ignore them

> If I'm right then the only open question is: which disks do and
> do not do the right thing when the lights go out.

As far as I can tell none of them at least in the IDE world

Alan

2001-07-14 12:27:47

by Paul Jakma

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

On Fri, 13 Jul 2001, Andreas Dilger wrote:

> put your journal on NVRAM, you will have blazing synchronous I/O.

so ext3 supports having the journal somewhere else then. question: can
the journal be on tmpfs?

> Cheers, Andreas

--paulj

2001-07-14 14:49:17

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

On Sat, Jul 14, 2001 at 01:27:37PM +0100, Paul Jakma wrote:

so ext3 supports having the journal somewhere else then. question: can
the journal be on tmpfs?

*why* would you want to to do this?

--cw

2001-07-14 14:50:16

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

On Sat, Jul 14, 2001 at 09:45:44AM +0100, Alan Cox wrote:

As far as I can tell none of them at least in the IDE world

SCSI disk must, or at least some... if not, how to peopel like NetApp
get these cool HA certifications?

--cw

2001-07-14 15:09:09

by Ed Tomlinson

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

On Fri, 13 Jul 2001, Paul Jakma wrote:
>On Fri, 13 Jul 2001, Andreas Dilger wrote:

>> put your journal on NVRAM, you will have blazing synchronous I/O.

>so ext3 supports having the journal somewhere else then. question: can
>the journal be on tmpfs?

Why would you want too? You _need_ the journal after a crash to recover
without an fsck - if its on tmpfs you are SOL...

Ed Tomlinson

2001-07-14 15:46:07

by Paul Jakma

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

On Sun, 15 Jul 2001, Chris Wedgwood wrote:

> *why* would you want to to do this?

:)

to test performance advantage of journal on RAM before going to spend
money on NVRAM...

> --cw

--paulj

2001-07-14 16:23:06

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

At 2:50 AM +1200 2001-07-15, Chris Wedgwood wrote:
>On Sat, Jul 14, 2001 at 09:45:44AM +0100, Alan Cox wrote:
>
> As far as I can tell none of them at least in the IDE world
>
>SCSI disk must, or at least some... if not, how to peopel like NetApp
>get these cool HA certifications?

NetApp uses a large system-local NVRAM buffer, do they not?
--
/Jonathan Lundell.

2001-07-14 17:00:51

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

On Sat, Jul 14, 2001 at 08:41:52AM -0700, Jonathan Lundell wrote:

NetApp uses a large system-local NVRAM buffer, do they not?

Yes... and for clusters its chared via some kind of NUMA interconnect.
Anyhow, thats doesn't prevent disk/fs corruption alone, I suspect it
might be one of the reasons they use raid4 and not raid5 (plus they
also get better LVM management).

--cw

2001-07-14 17:18:47

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

tmpfs is going to be _much_ faster than any external bus-connected
NVRAM solution

create a ram disk on a PCI connected video card and journal to that to
compare if you like (PCI bulk writes suck for speed)

--cw

On Sat, Jul 14, 2001 at 04:42:04PM +0100, Paul Jakma wrote:
On Sun, 15 Jul 2001, Chris Wedgwood wrote:

> *why* would you want to to do this?

:)

to test performance advantage of journal on RAM before going to spend
money on NVRAM...

> --cw

--paulj

2001-07-14 17:38:40

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

At 9:45 AM +0100 2001-07-14, Alan Cox wrote:
> > If, after a power outage, the IDE disk can keep going for long enough
>> to write its write cache out to the reserved vendor area (which will
>> only take 20-30 milliseconds) then the data may be considered *safe*
>> as soon as it hits writecache.
>
>Hohohoho.
>
>> In which case it is perfectly legitimate and sensible for the drive
>> to ignore flush commands, and to ack data as soon as it hits cache.
>
>Since the flushing commands are 'optional' it can legitimately ignore them
>
>> If I'm right then the only open question is: which disks do and
>> do not do the right thing when the lights go out.
>
>As far as I can tell none of them at least in the IDE world

It's not so great in the SCSI world either. Here's a bit from the
Ultrastar 73LZX functional spec (this is the current-technology
Ultra160 73GB family):

>5.0 Data integrity
>The drive retains recorded information under all non-write operations.
>No more than one sector will be lost by power down during write
>operation while write cache is
>disabled.
>If power down occurs before completion of data transfer from write
>cache to disk while write cache is
>enabled, the data remaining in write cache will be lost. To prevent
>this data loss at power off, the
>following action is recommended:
>* Confirm successful completion of SYNCHRONIZE CACHE (35h) command.

What's worse, though the spec is not explicit on this point, it
appears that the write cache is lost on a SCSI reset, which is
typically used by drivers for last-resort error recovery. And of
course a SCSI bus reset affects all the drives on the bus, not just
the offending one.
--
/Jonathan Lundell.

2001-07-14 20:08:56

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

On Saturday 14 July 2001 16:50, Chris Wedgwood wrote:
> On Sat, Jul 14, 2001 at 09:45:44AM +0100, Alan Cox wrote:
>
> As far as I can tell none of them at least in the IDE world
>
> SCSI disk must, or at least some... if not, how to peopel like NetApp
> get these cool HA certifications?

Atomic commit. The superblock, which references the updated version
of the filesystem, carries a sequence number and a checksum. It is
written to one of two alternating locations. On restart, both
locations are read and the highest numbered superblock with a correct
checksum is chosen as the new filesystem root.

--
Daniel

2001-07-15 01:20:13

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

Daniel Phillips wrote:
>
> On Saturday 14 July 2001 16:50, Chris Wedgwood wrote:
> > On Sat, Jul 14, 2001 at 09:45:44AM +0100, Alan Cox wrote:
> >
> > As far as I can tell none of them at least in the IDE world
> >
> > SCSI disk must, or at least some... if not, how to peopel like NetApp
> > get these cool HA certifications?
>
> Atomic commit. The superblock, which references the updated version
> of the filesystem, carries a sequence number and a checksum. It is
> written to one of two alternating locations. On restart, both
> locations are read and the highest numbered superblock with a correct
> checksum is chosen as the new filesystem root.

But this assumes that it is the most-recently-written sector/block
which gets lost in a power failure.

The disk will be reordering writes - so when it fails it may have
written the commit block but *not* the data which that block is
committing.

You need a barrier or a full synchronous flush prior to writing
the commit block. A `don't-reorder-past-me' barrier is very much
preferable, of course.

-

2001-07-15 01:50:21

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

On Sunday 15 July 2001 03:21, Andrew Morton wrote:
> Daniel Phillips wrote:
> > On Saturday 14 July 2001 16:50, Chris Wedgwood wrote:
> > > On Sat, Jul 14, 2001 at 09:45:44AM +0100, Alan Cox wrote:
> > >
> > > As far as I can tell none of them at least in the IDE world
> > >
> > > SCSI disk must, or at least some... if not, how to peopel like
> > > NetApp get these cool HA certifications?
> >
> > Atomic commit. The superblock, which references the updated
> > version of the filesystem, carries a sequence number and a
> > checksum. It is written to one of two alternating locations. On
> > restart, both locations are read and the highest numbered
> > superblock with a correct checksum is chosen as the new filesystem
> > root.
>
> But this assumes that it is the most-recently-written sector/block
> which gets lost in a power failure.
>
> The disk will be reordering writes - so when it fails it may have
> written the commit block but *not* the data which that block is
> committing.
>
> You need a barrier or a full synchronous flush prior to writing
> the commit block. A `don't-reorder-past-me' barrier is very much
> preferable, of course.

Oh yes, absolutely, that's very much part of the puzzle. Any disk
that doesn't support a real write barrier or write cache flush is
fundamentally broken as far as failsafe operation goes. A disk that
claims to provide such support and doesn't is an even worse offender.
I find Alan's comment there worrisome. We need to know which disks
devliver on this and which don't.

--
Daniel

2001-07-15 03:36:24

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

On Sat, Jul 14, 2001 at 10:11:30PM +0200, Daniel Phillips wrote:

Atomic commit. The superblock, which references the updated
version of the filesystem, carries a sequence number and a
checksum. It is written to one of two alternating locations. On
restart, both locations are read and the highest numbered
superblock with a correct checksum is chosen as the new filesystem
root.

Yes... and which ever part of the superblock contains the sequence
number must be written atomically.

The point is, you _NEED_ to be sure that data written before the
superblock (or indeed anywhere further up the tree, you can make
changes in theory which don't require super-block updates) are written
firmly to the platters before any thing which refers to it is updated.

Alan was saying with IDE you cannot reliably do this, I assume you can
with SCSI was my point.

--cw

2001-07-15 04:03:04

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

On Sat, Jul 14, 2001 at 10:33:44AM -0700, Jonathan Lundell wrote:

What's worse, though the spec is not explicit on this point, it
appears that the write cache is lost on a SCSI reset, which is
typically used by drivers for last-resort error recovery. And of
course a SCSI bus reset affects all the drives on the bus, not
just the offending one.

Doesn't SCSI have a notion of write barriers?

Even if this is required, the above still works because for anything
requiring a barrier, you wait of a positive SYNCHRONIZE CACHE

--cw

2001-07-15 05:47:46

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

At 4:02 PM +1200 2001-07-15, Chris Wedgwood wrote:
>On Sat, Jul 14, 2001 at 10:33:44AM -0700, Jonathan Lundell wrote:
>
> What's worse, though the spec is not explicit on this point, it
> appears that the write cache is lost on a SCSI reset, which is
> typically used by drivers for last-resort error recovery. And of
> course a SCSI bus reset affects all the drives on the bus, not
> just the offending one.
>
>Doesn't SCSI have a notion of write barriers?
>
>Even if this is required, the above still works because for anything
>requiring a barrier, you wait for a positive SYNCHRONIZE CACHE

Sure, if you keep all your write buffers around until then, so you
can re-write if the sync fails. And if you don't crash in the
meantime.
--
/Jonathan Lundell.

2001-07-15 06:07:00

by John Alvord

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

On Sun, 15 Jul 2001, Chris Wedgwood wrote:

> On Sat, Jul 14, 2001 at 10:11:30PM +0200, Daniel Phillips wrote:
>
> Atomic commit. The superblock, which references the updated
> version of the filesystem, carries a sequence number and a
> checksum. It is written to one of two alternating locations. On
> restart, both locations are read and the highest numbered
> superblock with a correct checksum is chosen as the new filesystem
> root.
>
> Yes... and which ever part of the superblock contains the sequence
> number must be written atomically.
>
> The point is, you _NEED_ to be sure that data written before the
> superblock (or indeed anywhere further up the tree, you can make
> changes in theory which don't require super-block updates) are written
> firmly to the platters before any thing which refers to it is updated.
>
> Alan was saying with IDE you cannot reliably do this, I assume you can
> with SCSI was my point.

In the IBM solution to this (1977-78, VM/CMS) the critical data was
written at the begining and the end of the block. If the two data items
didn't match then the block was rejected.

john alvord
>
>
>
> --cw
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2001-07-15 06:08:10

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

On Sat, Jul 14, 2001 at 11:05:36PM -0700, John Alvord wrote:

In the IBM solution to this (1977-78, VM/CMS) the critical data was
written at the begining and the end of the block. If the two data items
didn't match then the block was rejected.

Neat.

Simple and effective. Presumably you can also checksum the block, and
check that.

--cw

2001-07-15 13:15:14

by Ken Hirsch

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

Chris Wedgwood <[email protected]> wrote:
> On Sat, Jul 14, 2001 at 11:05:36PM -0700, John Alvord wrote:
>
> In the IBM solution to this (1977-78, VM/CMS) the critical data was
> written at the begining and the end of the block. If the two data
items
> didn't match then the block was rejected.
>
> Neat.
>
>
> Simple and effective. Presumably you can also checksum the block, and
> check that.

The first technique is not sufficient with modern disk controllers, which
may reorder sector writes within a block. A checksum, especially a robust
CRC32, is sufficient, but rather expensive.

Mohan has a clever technique that is computationally trivial and only uses
one bit per sector: http://www.almaden.ibm.com/u/mohan/ICDE95.pdf

Unfortunately, it's also patented:
http://www.delphion.com/details?pn=US05418940__

Perhaps IBM will clarify their position with respect to free software and
patents in the upcoming conference.

2001-07-15 13:41:23

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

On Sunday 15 July 2001 05:36, Chris Wedgwood wrote:
> On Sat, Jul 14, 2001 at 10:11:30PM +0200, Daniel Phillips wrote:
>
> Atomic commit. The superblock, which references the updated
> version of the filesystem, carries a sequence number and a
> checksum. It is written to one of two alternating locations. On
> restart, both locations are read and the highest numbered
> superblock with a correct checksum is chosen as the new
> filesystem root.
>
> Yes... and which ever part of the superblock contains the sequence
> number must be written atomically.

The only requirement here is that the checksum be correct. And sure,
that's not a hard guarantee because, on average, you will get a good
checksum for bad data once every 4 billion power events that mess up
the final superblock transfer. Let me see, if that happens once a year,
your data should still be good when the warrantee on the sun expires.
:-)

> The point is, you _NEED_ to be sure that data written before the
> superblock (or indeed anywhere further up the tree, you can make
> changes in theory which don't require super-block updates) are
> written firmly to the platters before any thing which refers to it is
> updated.

Since the updated tree is created non-destructively with respect to
the original tree, the only priority relationship that matters is the
requirement that all blocks of the updated tree be securely committed
before the new superblock is written.

> Alan was saying with IDE you cannot reliably do this, I assume you
> can with SCSI was my point.

Surely it can't be that *all* IDE disks can fail in that way? And it
seems the jury is still out on SCSI, I'm interested to see where that
discussion goes.

--
Daniel

2001-07-15 14:39:33

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

On Sun, Jul 15, 2001 at 03:44:14PM +0200, Daniel Phillips wrote:

The only requirement here is that the checksum be correct. And
sure, that's not a hard guarantee because, on average, you will
get a good checksum for bad data once every 4 billion power events
that mess up the final superblock transfer. Let me see, if that
happens once a year, your data should still be good when the
warrantee on the sun expires. :-)

the sun will probably last a tad longer than that even contuing to
burn hydrogen, if you allow for helium burning, you will probably get
errors to sneak by

Surely it can't be that *all* IDE disks can fail in that way? And
it seems the jury is still out on SCSI, I'm interested to see
where that discussion goes.

Alan said *ALL* disks appear to lie, and I'm not going to argue with
him :)

I only have SCSI disks to test with, but they are hot-plug, so I guess
I can write a whole bunch of blocks with different numbers on them,
all over the disk, if I can figure out how to place SCSI barriers and
then pull the drive and see what gives?

--cw

2001-07-15 14:50:14

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

On Sun, Jul 15, 2001 at 09:16:09AM -0400, Ken Hirsch wrote:

The first technique is not sufficient with modern disk
controllers, which may reorder sector writes within a block. A
checksum, especially a robust CRC32, is sufficient, but rather
expensive.

So you write the number to the start and end of each sector, or, you
only assume sector-wide 'block-sizes' for integrity.

A 32-bit CRC is plenty cheap enough on modern CPUs and especially
considering how often you need to calculate it.

Mohan has a clever technique that is computationally trivial and
only uses one bit per sector:
http://www.almaden.ibm.com/u/mohan/ICDE95.pdf

Unfortunately, it's also patented:
http://www.delphion.com/details?pn=US05418940__

Perhaps IBM will clarify their position with respect to free
software and patents in the upcoming conference.

Wow... pretty neat, but fortunately not necessary.

--cw

2001-07-15 15:07:41

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

At 2:39 AM +1200 2001-07-16, Chris Wedgwood wrote:
>On Sun, Jul 15, 2001 at 03:44:14PM +0200, Daniel Phillips wrote:
>
> The only requirement here is that the checksum be correct. And
> sure, that's not a hard guarantee because, on average, you will
> get a good checksum for bad data once every 4 billion power events
> that mess up the final superblock transfer. Let me see, if that
> happens once a year, your data should still be good when the
> warrantee on the sun expires. :-)
>
>the sun will probably last a tad longer than that even contuing to
>burn hydrogen, if you allow for helium burning, you will probably get
>errors to sneak by
>
> Surely it can't be that *all* IDE disks can fail in that way? And
> it seems the jury is still out on SCSI, I'm interested to see
> where that discussion goes.
>
>Alan said *ALL* disks appear to lie, and I'm not going to argue with
>him :)
>
>I only have SCSI disks to test with, but they are hot-plug, so I guess
>I can write a whole bunch of blocks with different numbers on them,
>all over the disk, if I can figure out how to place SCSI barriers and
>then pull the drive and see what gives?

Consider the possibility (probability, I think) that SCSI drives blow
away their (unwritten) write cache buffers on a SCSI bus reset, and
that a SCSI bus reset is a routine, albeit last-resort, error
recovery technique. (It's also necessary; by the time a driver gets
to a bus reset, all else has failed. It's also, in my experience, not
especially rare.)

The fix for that particular problem--disabling write caching--is
simple enough, though it presumably has a performance consequence. A
second benefit of disabling write caching is that the drive can't
reorder writes (though of course the system still might).

At first glance, by the way, the only write barrier I see in the SCSI
command set is the synchronize-cache command, which completes only
after all the drive's dirty buffers are written out. Of course,
without write caching, it's not an issue.
--
/Jonathan Lundell.

2001-07-15 15:22:33

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

On Sun, Jul 15, 2001 at 08:06:39AM -0700, Jonathan Lundell wrote:

At first glance, by the way, the only write barrier I see in the
SCSI command set is the synchronize-cache command, which completes
only after all the drive's dirty buffers are written out. Of
course, without write caching, it's not an issue.

Is the spec you have distributable? I believe some of the early drafts
were, but the final spec isn't.

I'd really like to check it out myself, I alwasy assumed SCSI had the
smarts for write-barriers and force-unit-access but I guess I was
wrong.

Anyhow, I'd like to see the spec for myself if it is something I can
get hold of.

--cw

2001-07-15 15:33:34

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

On Sun, Jul 15, 2001 at 04:32:59PM +0100, Alan Cox wrote:

Another way is to time

write block
write barrier
write same block
write barrier
repeat

If the write barrier is working you should be able to measure the
drive rpm 8)

Yeah, I was thinking of doing this with caches turned off, since I
know how to do that, but not a write-barrier.

--cs

2001-07-15 15:34:04

by Alan

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

> I only have SCSI disks to test with, but they are hot-plug, so I guess
> I can write a whole bunch of blocks with different numbers on them,
> all over the disk, if I can figure out how to place SCSI barriers and
> then pull the drive and see what gives?

Another way is to time

write block
write barrier
write same block
write barrier
repeat

If the write barrier is working you should be able to measure the drive rpm 8)

2001-07-15 16:24:41

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

On Sun, Jul 15, 2001 at 04:32:59PM +0100, Alan Cox wrote:

Another way is to time

write block
write barrier
write same block
write barrier
repeat

If the write barrier is working you should be able to measure the
drive rpm 8)

OK, I just wrote this in order to test just that, test on a raw device
and turn caching off if you can.

For my drives, I cannot disable caching (I don't know if it is on or
not) and I get abysmal speed, but nothing unrealistic,

Anyhow, I just wrote this and tested it a couple of times, if it
breaks or east your disk, don't bitch at me.

Otherwise, flames and comments on my god awful code welcome.

--cw

Attachments:

(No filename) (677.00 B)
write-bench.c (2.62 kB)
More of Blondie's awful code Download all attachments

2001-07-15 17:11:02

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

On Sat, Jul 14, 2001 at 09:45:44AM +0100, Alan Cox wrote:

As far as I can tell none of them at least in the IDE world

Can you test with the code I posted a hour or so ago please?

I ask this because I tested writes to:

-- buffered devices

-- ide with caching on

-- ide with caching off

-- scsi (caching on?)

To a buffered device, I get something silly like 63000
writes/second. No big surprises there (other than Linux is bloody lean
these days).

To a SCSI device (10K RPM SCSI-3 160 drive), I get something like 167
writes/second, which seems moderately sane if caching is disabled.

To a cheap IDE drive (5400 RPM?) with caching off, I get about 87
writes/second.

To the same drive, with caching on, I get almost 4000 writes/second.

This seems to imply, at least for my test IDE drive, you can turn
caching off --- and its about half as fast as my SCSI drives which
rotate at about twice the speed (sanity check).

IDE drive: IBM-DTTA-351010, ATA DISK drive
SCSI drive: SEAGATE ST318404LC

--cw

2001-07-15 17:45:11

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

At 5:10 AM +1200 2001-07-16, Chris Wedgwood wrote:
>On Sat, Jul 14, 2001 at 09:45:44AM +0100, Alan Cox wrote:
>
> As far as I can tell none of them at least in the IDE world
>
>Can you test with the code I posted a hour or so ago please?

AC's comment was about whether the drive's cache would be written out
on power failure, which is another issue, a little harder to test
(and not easily testable by writing a single sector). I raise the
related question of what happens to the write cache on a bus reset on
SCSI drives.

>I ask this because I tested writes to:
>
> -- buffered devices
>
> -- ide with caching on
>
> -- ide with caching off
>
> -- scsi (caching on?)
>
>To a buffered device, I get something silly like 63000
>writes/second. No big surprises there (other than Linux is bloody lean
>these days).
>
>To a SCSI device (10K RPM SCSI-3 160 drive), I get something like 167
>writes/second, which seems moderately sane if caching is disabled.

My impression, based a a little but not much research, is that most
SCSI drives disable write caching by default. IBM SCSI drives may be
an exception to this.

>To a cheap IDE drive (5400 RPM?) with caching off, I get about 87
>writes/second.
>
>To the same drive, with caching on, I get almost 4000 writes/second.
>
>This seems to imply, at least for my test IDE drive, you can turn
>caching off --- and its about half as fast as my SCSI drives which
>rotate at about twice the speed (sanity check).
>
>IDE drive: IBM-DTTA-351010, ATA DISK drive
>SCSI drive: SEAGATE ST318404LC

--
/Jonathan Lundell.

2001-07-15 17:45:11

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

At 3:22 AM +1200 2001-07-16, Chris Wedgwood wrote:
>On Sun, Jul 15, 2001 at 08:06:39AM -0700, Jonathan Lundell wrote:
>
> At first glance, by the way, the only write barrier I see in the
> SCSI command set is the synchronize-cache command, which completes
> only after all the drive's dirty buffers are written out. Of
> course, without write caching, it's not an issue.
>
>Is the spec you have distributable? I believe some of the early drafts
>were, but the final spec isn't.
>
>I'd really like to check it out myself, I alwasy assumed SCSI had the
>smarts for write-barriers and force-unit-access but I guess I was
>wrong.
>
>Anyhow, I'd like to see the spec for myself if it is something I can
>get hold of.

I was referring to IBM's spec, as implemented in their recent SCSI
and FC drives. You can find a copy at
http://www.storage.ibm.com/techsup/hddtech/prodspec/ddyf_spi.pdf

WRITE EXTENDED has a bit (FUA) that will let you force that
particular write to go to disk immediately, independent of write
caching, but there's no suggestion that it otherwise acts as a write
barrier for cached writes.

WRITE VERIFY implies a CACHE SYNCHRONIZE, so it's a write barrier,
but an expensive (because synchronous) one.
--
/Jonathan Lundell.

2001-07-15 17:47:51

by Justin T. Gibbs

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

>Consider the possibility (probability, I think) that SCSI drives blow
>away their (unwritten) write cache buffers on a SCSI bus reset, and
>that a SCSI bus reset is a routine, albeit last-resort, error
>recovery technique. (It's also necessary; by the time a driver gets
>to a bus reset, all else has failed. It's also, in my experience, not
>especially rare.)

I have never seen this to be the case. The SCSI spec is quite clear
in stating that a bus reset only affects "I/O processes that have not
completed, SCSI device reservations, and SCSI device operating modes".
The soft reset section clarifies the meaning of "completed commands"
as:
e) An initiator shall consider an I/O process to be completed
when it negates ACK for a successfully received COMMAND
COMPLETE message.
f) A target shall consider an I/O process to be completed when
it detects the transition of ACK to false for the COMMAND
COMPLETE message with the ATN signal false.

As the soft reset section also specifies how to deal with initiators
that are not expecting soft reset semantics, I believe this applies to
either reset model.

If we look at the section on caching for direct access devices we see,
"[write-back cached] data may be lost if power to the device is lost or
a hardware failure occurs". There is no mention of a bus reset having
any effect on commands already acked as completed to the intiator.

>The fix for that particular problem--disabling write caching--is
>simple enough, though it presumably has a performance consequence. A
>second benefit of disabling write caching is that the drive can't
>reorder writes (though of course the system still might).

Simply disabling the write cache does not guarantee the order of writes.
For one, with tagged I/O and the use of the SIMPLE_Q tag qualifier,
commands may be completed in any order. If you want some semblance of
order, either disable the write cache or use the FUA bit in all writes,
and use the ORDERED tag qualifier. Even when using these options,
it is not clear that the drive cannot reorder writes "slightly" to
make track writes more efficient (e.g. two separate commands to write
sequential sectors on the same track may be written in reverse order).

>At first glance, by the way, the only write barrier I see in the SCSI
>command set is the synchronize-cache command, which completes only
>after all the drive's dirty buffers are written out. Of course,
>without write caching, it's not an issue.

The ordered tag qualifier gives you barier semantics with the caveats
listed above.

--
Justin

2001-07-15 22:10:52

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

On Sunday 15 July 2001 15:16, Ken Hirsch wrote:
> Chris Wedgwood <[email protected]> wrote:
> > On Sat, Jul 14, 2001 at 11:05:36PM -0700, John Alvord wrote:
> > >
> > > In the IBM solution to this (1977-78, VM/CMS) the critical data
> > > was written at the begining and the end of the block. If the two
> > > data items didn't match then the block was rejected.
> >
> > Neat.
> >
> > Simple and effective. Presumably you can also checksum the block,
> > and check that.
>
> The first technique is not sufficient with modern disk controllers,
> which may reorder sector writes within a block. A checksum,
> especially a robust CRC32, is sufficient, but rather expensive.

As somebody else pointed out, not if you don't have to compute it on
every block, as with journalling or atomic commit.

> Mohan has a clever technique that is computationally trivial and only
> uses one bit per sector:
> http://www.almaden.ibm.com/u/mohan/ICDE95.pdf
>
> Unfortunately, it's also patented:
> http://www.delphion.com/details?pn=US05418940__

Fortunately, it's clunky and unappealing compared to the simple
checksum method, applied only to those blocks that define consistency
points. I don't think this is patented. I'd be disturbed if it was,
since it's obvious.

> Perhaps IBM will clarify their position with respect to free software
> and patents in the upcoming conference.

Wouldn't that be nice. Imagine, IBM comes out and says, we admit it,
patents are a net burden on everybody, even us - from now on, we use
them only against those who use them against us, and we'll put that
in writing. Right.

--
Daniel

2001-07-15 23:01:44

by Rod Van Meter

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

I don't have the SCSI spec in front of me (though, as noted, some
drafts are available online; try t10.org somewhere), but as I
understand it (having worked, briefly, for a major disk manufacturer):

You can commit an individual write with the FUA (force unit access)
bit. The command for this is not WRITE EXTENDED, but WRITE(10) or
WRITE(12). I don't think WRITE(6) has room for the bit, and WRITE(6)
is useless nowadays, anyway. WRITE EXTENDED lets you write over the
ECC bits -- it's a raw write to the platter. Dunno that anyone
implements it any more.

That does NOT get you ordering with respect to other commands. You
can use the complex tagging stuff to get that, but most disk drives
didn't implement it properly in the SCSI-2 days, and there are
significant differences in SCSI-3.

Otherwise, your choice, as noted, is SYNCHRONIZE CACHE before the root
block write, and after. AFAIK, all drives treat that the way it's
meant to be done; everything's on platter when you get a COMMAND
COMPLETE back from it, but they weren't necessarily done in order.

Even within a command, I don't believe there is a guarantee that the
blocks will go to platter in order. Say you write blocks 0-7; the
drive will start the transfer to buffer immediately, as the seek is
begun. When the seek completes, the write gate will enable writes
from buffer to platter, and a state machine takes care of that.
However, the seek and settle may complete when the head is over block
3, so the first write to platter would be block 4, then 5-7. This is
followed by almost an entire revolution's delay(*see note) to get back
to block 0, and 3 will be the last block written.

I have had this exact conversation with disk drive folks (of which I
am not one), but I haven't seen the firmware and state machines
myself, so treat this as an educated guess. The folks I was talking
to may have been wrong, or more likely, misunderstood what I was
asking.

Some manufacturers can put either IDE or SCSI on a drive, and this
behavior is likely to be the same on both. It may not apply to all
members of a family, and probably doesn't apply across families from
the same manufacturer.

Most disk drives, as recently as two years ago, were a lot dumber than
you think, and I doubt the situation has improved much. For the most
part, disk manufacturers get paid for capacity, not smarts, but
there's an entire year-long argument there.

--Rod

* Note: In theory, that rotational delay doesn't have to be idle. I
believe any blocks between 7 and 0 that are also in cache will be
written as the head passes over them. Thus, the drive might
literally interleave writes from multiple commands. It's also
possible, in theory, to switch tracks for a short time and come back
to the first track before block 0 rolls around, but I don't believe
existing controllers are that sophisticated.

P.S. I gotta put in another plug here -- you have until Friday to
write this behavior up and submit it as a paper to USENIX FAST --
Conference on File and Storage Technology. See
http://www.usenix.org/events/fast/

2001-07-16 00:37:40

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

At 4:14 PM -0700 2001-07-15, Rod Van Meter wrote:
>You can commit an individual write with the FUA (force unit access)
>bit. The command for this is not WRITE EXTENDED, but WRITE(10) or
>WRITE(12). I don't think WRITE(6) has room for the bit, and WRITE(6)
>is useless nowadays, anyway. WRITE EXTENDED lets you write over the
>ECC bits -- it's a raw write to the platter. Dunno that anyone
>implements it any more.

WRITE EXTENDED is WRITE(10), I believe. The ECC-writing version is
WRITE LONG; IBM (at least) implements it.

At 11:47 AM -0600 2001-07-15, Justin T. Gibbs wrote:
>As the soft reset section also specifies how to deal with initiators
>that are not expecting soft reset semantics, I believe this applies to
>either reset model.
>
>If we look at the section on caching for direct access devices we see,
>"[write-back cached] data may be lost if power to the device is lost or
>a hardware failure occurs". There is no mention of a bus reset having
>any effect on commands already acked as completed to the intiator.

I'd very much like to think so; thanks for the reference. I'd feel a
little more sanguine about the subject if there were some explicit
guarantee of the desired behavior, either in the SCSI spec or in an
implementer's functional spec. Nonetheless, it's testable behavior,
and it's a reasonable inference that drives should behave correctly.
Thanks again.
--
/Jonathan Lundell.

2001-07-16 01:08:58

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

Daniel Phillips writes:
> On Sunday 15 July 2001 05:36, Chris Wedgwood wrote:
>> On Sat, Jul 14, 2001 at 10:11:30PM +0200, Daniel Phillips wrote:

>>> Atomic commit. The superblock, which references the updated
>>> version of the filesystem, carries a sequence number and a
>>> checksum. It is written to one of two alternating locations. On
>>> restart, both locations are read and the highest numbered
>>> superblock with a correct checksum is chosen as the new
>>> filesystem root.
>>
>> Yes... and which ever part of the superblock contains the sequence
>> number must be written atomically.
>
> The only requirement here is that the checksum be correct. And sure,
> that's not a hard guarantee because, on average, you will get a good
> checksum for bad data once every 4 billion power events that mess up
> the final superblock transfer. Let me see, if that happens once a year,

In a tree-structured filesystem, checksums on everything would only
cost you space similar to the number of pointers you have. Whenever
a non-leaf node points to a child, it can hold a checksum for that
child as well.

This gives a very reliable way to spot filesystem errors, including
corrupt data blocks.

2001-07-16 08:49:44

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

On Sun, Jul 15, 2001 at 09:08:41PM -0400, Albert D. Cahalan wrote:

In a tree-structured filesystem, checksums on everything would
only cost you space similar to the number of pointers you
have. Whenever a non-leaf node points to a child, it can hold a
checksum for that child as well.

This gives a very reliable way to spot filesystem errors,
including corrupt data blocks.

Actually, this is a really nice concept... have additional checksums
and such floating about. When filesystems get to several terabytes, it
would allws background consistency checking (as checking on boot would
be far to slow).

It would also allow the fs layer to fsck the filesystem _as_ data was
accessed if need be, which would be the case more often.

--cw

2001-07-16 08:56:44

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

On Sun, Jul 15, 2001 at 11:47:10AM -0600, Justin T. Gibbs wrote:

Simply disabling the write cache does not guarantee the order of
writes. For one, with tagged I/O and the use of the SIMPLE_Q tag
qualifier, commands may be completed in any order. If you want
some semblance of order, either disable the write cache or use the
FUA bit in all writes, and use the ORDERED tag qualifier. Even
when using these options, it is not clear that the drive cannot
reorder writes "slightly" to make track writes more efficient
(e.g. two separate commands to write sequential sectors on the
same track may be written in reverse order).

ORDERED sounds like the trick... I assume this is some kind of
write-barrier? If so, then I assume it has some kind of strict
temporal ordering, even between command issues to the drive.

If so, that would be idea if we can have the fs communicate this all
the way down to the device layer, making it work for soft-raid and LVM
be a little harder perhaps.

--cw

2001-07-16 13:16:02

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

On Monday 16 July 2001 10:56, Chris Wedgwood wrote:
> On Sun, Jul 15, 2001 at 11:47:10AM -0600, Justin T. Gibbs wrote:
>
> Simply disabling the write cache does not guarantee the order of
> writes. For one, with tagged I/O and the use of the SIMPLE_Q tag
> qualifier, commands may be completed in any order. If you want
> some semblance of order, either disable the write cache or use
> the FUA bit in all writes, and use the ORDERED tag qualifier. Even
> when using these options, it is not clear that the drive cannot
> reorder writes "slightly" to make track writes more efficient (e.g.
> two separate commands to write sequential sectors on the same track
> may be written in reverse order).
>
> ORDERED sounds like the trick... I assume this is some kind of
> write-barrier? If so, then I assume it has some kind of strict
> temporal ordering, even between command issues to the drive.
>
> If so, that would be idea if we can have the fs communicate this all
> the way down to the device layer, making it work for soft-raid and
> LVM be a little harder perhaps.

There was general agreement amongst filesystem developers at San Jose
that we need some kind of internal interface at the filesystem level
for this, independent of the type of underlying block device - IDE,
SCSI or "other". That's as far as it got.

--
Daniel

2001-07-16 14:59:25

by Rod Van Meter

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

At 4:14 PM -0700 2001-07-15, Rod Van Meter wrote:
> >You can commit an individual write with the FUA (force unit access)
> >bit. The command for this is not WRITE EXTENDED, but WRITE(10) or
> >WRITE(12). I don't think WRITE(6) has room for the bit, and WRITE(6)
> >is useless nowadays, anyway. WRITE EXTENDED lets you write over the
> >ECC bits -- it's a raw write to the platter. Dunno that anyone
> >implements it any more.
>
> WRITE EXTENDED is WRITE(10), I believe. The ECC-writing version is
> WRITE LONG; IBM (at least) implements it.
>

Whoops, you're right! Brain fart. We never used the term WRITE
EXTENDED; we always just called it WRITE(10) or WRITE(12).

--Rod

2001-07-16 19:00:26

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

Paul Jamka writes:
> On Fri, 13 Jul 2001, Andreas Dilger wrote:
> > put your journal on NVRAM, you will have blazing synchronous I/O.
>
> so ext3 supports having the journal somewhere else then. question: can
> the journal be on tmpfs?

There are patches for this (2.2 only, not 2.4) but it is not in the core
ext3 code yet. The ext3 design and on-disk layout allow for it (and the
e2fsprogs have the basic support for it), so it will not be a major change
to start using external devices for journals.

If you are keen to do performance testing (on a temporary filesystem, for
sure), you can hack around the current lack of ext3 support for journal
devices by doing the following (works for reiserfs also) with LVM:

1) create a PV on NVRAM/SSD/ramdisk (needs hacks to LVM code to support the
NVRAM device, or you can loopback mount the device ;-) It should be big
enough to hold the entire journal + a bit of overhead*
2) create a VG and LV on the ramdisk
3) create a PV on a regular disk, add it to the above VG
4) extend the LV with the new PV space**
5) create a 4kB blocksize ext2 filesystem on this LV
6) use "dumpe2fs <LV NAME>" to find the free blocks count in the first group
7) use "tune2fs -J size=<blocks * blocksize> <LV name>" to create the
journal, where "blocks" <= number of free blocks in first group and
also <= (number of blocks on NVRAM device - overhead*)

You _should_ have the journal on NVRAM now, along with the superblock and
all of the metadata for the first group. This will also improve performance
as the superblock and group descriptor tables are hot spots as well.

Of course, once support for external journal devices is added to ext3, it
will simply be a matter of doing "tune2fs -J device=<NVRAM device>".

Cheers, Andreas
---------------
*) For ext3, you need enough extra space for the superblock, group descriptors,
one block and inode bitmap, the first inode table, (and lost+found if
you don't want to do extra work deleting lost+found before creating the
journal, and re-creating it afterwards). The output from "dumpe2fs"
will tell you the number of inode blocks and group descriptor blocks.
For reiserfs it is hard to tell exactly where the file will go, but if
you had, say, a 64MB NVRAM device and a new filesystem, you could expect
the journal to be put entirely on the NVRAM device.

**) The LV will have the NVRAM device as the first Logical Extent, so this
will also be logically the first part of the filesystem. The PEs added
to the LV will be appended to the NVRAM device.
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2001-07-16 19:13:46

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

> Cheers, Andreas
> ---------------
> *) For ext3, you need enough extra space for the superblock, group descriptors,
> one block and inode bitmap, the first inode table, (and lost+found if
> you don't want to do extra work deleting lost+found before creating the
> journal, and re-creating it afterwards). The output from "dumpe2fs"
> will tell you the number of inode blocks and group descriptor blocks.
> For reiserfs it is hard to tell exactly where the file will go, but if
> you had, say, a 64MB NVRAM device and a new filesystem, you could expect
> the journal to be put entirely on the NVRAM device.

You can use the LVM tools to see what extents are written the most times
- I'm sure that after having used the filesystem a little bit it will be
clear wich extents hold the journal. (and then you can move them to
NVRAM).

For reiserfs, I believe you can no specify a seperate device for your
journal and don't need lvm. Not sure if this code entered the kernel yet
though - maybe you need a patch.

When doing you testing, you should be aware that the results will be
very much dependent on the device you use for the filesystem. One thing
is that if you use a slow ide-drive, then the NVRAM/disk performance
will be higher than if you used a fast scsi-drive. But more importantly,
if you use a highend RAID, it will include NVRAM of it's own. So if you
really want to know if seperate NVRAM makes sense for you highend
server - don't test this on a regular disk and assume the results will
be the same.

--
Ragnar Kjorstad
Big Storage

2001-07-18 08:57:55

by Juan Quintela

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

>>>>> "chris" == Chris Wedgwood <[email protected]> writes:

chris> On Sat, Jul 14, 2001 at 11:05:36PM -0700, John Alvord wrote:
chris> In the IBM solution to this (1977-78, VM/CMS) the critical data was
chris> written at the begining and the end of the block. If the two data items
chris> didn't match then the block was rejected.

chris> Neat.

chris> Simple and effective. Presumably you can also checksum the block, and
chris> check that.

There is the rumor (I can't confirm that), that you need checksums,
that some disks are able to write well the beginning & the end of the
sector and put garbage in the middle in the case of problems. I
have never been able to reproduce that errors, but ....

Later, Juan.

--
In theory, practice and theory are the same, but in practice they
are different -- Larry McVoy

2001-07-21 00:06:43

by Stephen C. Tweedie

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

Hi,

On Sat, Jul 14, 2001 at 04:42:04PM +0100, Paul Jakma wrote:
>
> > *why* would you want to to do this?
>
> :)
>
> to test performance advantage of journal on RAM before going to spend
> money on NVRAM...

Journaling to ramdisk has been tried, yes. The result was faster than
ext2 doing the same jobs. Of course, the support for journal to
external devices is still only really at prototype stage.

Cheers,
Stephen

2001-07-21 19:20:08

by Alexander Griesser

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

On Sun, Jul 15, 2001 at 09:08:41PM -0400, you wrote:
> > The only requirement here is that the checksum be correct. And sure,
> > that's not a hard guarantee because, on average, you will get a good
> > checksum for bad data once every 4 billion power events that mess up
> > the final superblock transfer. Let me see, if that happens once a year,
> In a tree-structured filesystem, checksums on everything would only
> cost you space similar to the number of pointers you have. Whenever
> a non-leaf node points to a child, it can hold a checksum for that
> child as well.
> This gives a very reliable way to spot filesystem errors, including
> corrupt data blocks.

Hmm, maybe this is crap, but:
If the checksum-calculation for one node fails, wouldn't that mean, that
the data in this node, is not to be trusted? therefore also the checksum
of this node could be corrupted and so the node, 2 hops away, can't be
validated with 100% certitude...

regards, alexx
--
| .-. | Alexander Griesser <[email protected]> -=- ICQ:63180135 | .''`. |
| /v\ | http://www.tuxx-home.at -=- Linux Version 2.4.7 | : :' : |
| /( )\ | FAQ zu at.linux: http://alfie.ist.org/LinuxFAQ | `. `' |
| ^^ ^^ `---------------------------------------------------? `- |

2001-07-22 03:52:42

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

Alexander Griesser writes:
> On Sun, Jul 15, 2001 at 09:08:41PM -0400, you wrote:

>> In a tree-structured filesystem, checksums on everything would only
>> cost you space similar to the number of pointers you have. Whenever
>> a non-leaf node points to a child, it can hold a checksum for that
>> child as well. This gives a very reliable way to spot filesystem
>> errors, including corrupt data blocks.
>
> Hmm, maybe this is crap, but: If the checksum-calculation for one
> node fails, wouldn't that mean, that the data in this node, is not
> to be trusted? therefore also the checksum of this node could be
> corrupted and so the node, 2 hops away, can't be validated with 100%
> certitude...

If I understand you right ("one"? "this"?), yes and we want that.

Node 1 has children 2, 3, and 4.
Node 3 has children 5, 6, and 7.
Node 6 has children 8, 9, and 10. (children might be data blocks)

To have a child is to have a checksum+pointer pair.

If node 3 contains a corrupt pointer to node 6, then it is unlikely
that the checksum will match. So node 6 is bad, along 8, 9, and 10.
(actually we might not be able to know that 8, 9, and 10 exist)
This result is wonderful, since it prevents interpreting random
disk blocks as useful data.

If node 3 contains a corrupt checksum for node 6, same thing. Damn.
This case should be rare, since why for node 1 have a checksum
that is OK for node 3 if node 3 has corruption?

If node 6 itself is corrupt, same thing. Good, we are stopped from
using bad data.

2001-07-23 14:37:12

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

On Sunday 22 July 2001 05:52, Albert D. Cahalan wrote:
> Alexander Griesser writes:
> > On Sun, Jul 15, 2001 at 09:08:41PM -0400, you wrote:
> >> In a tree-structured filesystem, checksums on everything would
> >> only cost you space similar to the number of pointers you have.
> >> Whenever a non-leaf node points to a child, it can hold a checksum
> >> for that child as well. This gives a very reliable way to spot
> >> filesystem errors, including corrupt data blocks.
> >
> > Hmm, maybe this is crap, but: If the checksum-calculation for one
> > node fails, wouldn't that mean, that the data in this node, is not
> > to be trusted? therefore also the checksum of this node could be
> > corrupted and so the node, 2 hops away, can't be validated with
> > 100% certitude...
>
> If I understand you right ("one"? "this"?), yes and we want that.
>
> Node 1 has children 2, 3, and 4.
> Node 3 has children 5, 6, and 7.
> Node 6 has children 8, 9, and 10. (children might be data blocks)
>
> To have a child is to have a checksum+pointer pair.
>
> If node 3 contains a corrupt pointer to node 6, then it is unlikely
> that the checksum will match. So node 6 is bad, along 8, 9, and 10.
> (actually we might not be able to know that 8, 9, and 10 exist)
> This result is wonderful, since it prevents interpreting random
> disk blocks as useful data.
>
> If node 3 contains a corrupt checksum for node 6, same thing. Damn.
> This case should be rare, since why for node 1 have a checksum
> that is OK for node 3 if node 3 has corruption?
>
> If node 6 itself is corrupt, same thing. Good, we are stopped from
> using bad data.

I agree that your suggestion will work and that doubling the size of
the metadata isn't an enormous cost, especially if you'd already
compressed it using extents. On the other hand, sometimes I just feel
like trusting the hardware a little. Both atomic-commit and
journalling strategies take care of normal failure modes, and the disk
hardware is supposed to flag other failures by ecc'ing each sector on
disk.

--
Daniel

2001-07-24 04:29:58

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

Daniel Phillips writes:
> On Sunday 22 July 2001 05:52, Albert D. Cahalan wrote:
>> [...]
>>> On Sun, Jul 15, 2001 at 09:08:41PM -0400, you wrote:

>>>> In a tree-structured filesystem, checksums on everything would
>>>> only cost you space similar to the number of pointers you have.
>>>> Whenever a non-leaf node points to a child, it can hold a checksum
>>>> for that child as well. This gives a very reliable way to spot
>>>> filesystem errors, including corrupt data blocks.
...
>> To have a child is to have a checksum+pointer pair.
...
> I agree that your suggestion will work and that doubling the size
> of the metadata isn't an enormous cost, especially if you'd already
> compressed it using extents. On the other hand, sometimes I just
> feel like trusting the hardware a little. Both atomic-commit and
> journalling strategies take care of normal failure modes, and the
> disk hardware is supposed to flag other failures by ecc'ing each
> sector on disk.

Maybe you should discuss power-loss behavior with Theodore T'so.
For whatever reason, it seems that many drives and/or controllers
like to scribble on random unrelated sectors as power is lost.

For the atomic-commit case, an additional defense against this
sort of problem might be to keep a few extra trees on disk,
using a generation counter to pick the latest one. This does
bring us back to scanning the whole filesystem at boot though,
in order to disregard snapshots that have been damaged.

2001-07-24 13:57:05

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

On Tuesday 24 July 2001 06:29, Albert D. Cahalan wrote:
> Daniel Phillips writes:
> > On Sunday 22 July 2001 05:52, Albert D. Cahalan wrote:
> >> [...]
> >>> On Sun, Jul 15, 2001 at 09:08:41PM -0400, you wrote:
> >>>> In a tree-structured filesystem, checksums on everything would
> >>>> only cost you space similar to the number of pointers you have.
> >>>> Whenever a non-leaf node points to a child, it can hold a
> >>>> checksum for that child as well. This gives a very reliable way
> >>>> to spot filesystem errors, including corrupt data blocks.
> ...
> >> To have a child is to have a checksum+pointer pair.
> ...
> > I agree that your suggestion will work and that doubling the size
> > of the metadata isn't an enormous cost, especially if you'd already
> > compressed it using extents. On the other hand, sometimes I just
> > feel like trusting the hardware a little. Both atomic-commit and
> > journalling strategies take care of normal failure modes, and the
> > disk hardware is supposed to flag other failures by ecc'ing each
> > sector on disk.
>
> Maybe you should discuss power-loss behavior with Theodore T'so.
> For whatever reason, it seems that many drives and/or controllers
> like to scribble on random unrelated sectors as power is lost.

Last time we discussed this on lkml - I don't think Ted was involved
that time - the concensus was that only the last sector written is
in danger of being scribbled on. (Sometimes because of reordering
we don't know which the last sector is, that's another story.) If
you have experience with any disk that scribbled on a sector other
than the last written, I'd really appreciate knowing the model and
manufacturer - so that I can stay far away from such a POS.

As for silently feeding you corrupted sectors - that's clearly a
firmware bug, or outright omission. Again, the term POS applies.

> For the atomic-commit case, an additional defense against this
> sort of problem might be to keep a few extra trees on disk,
> using a generation counter to pick the latest one. This does
> bring us back to scanning the whole filesystem at boot though,
> in order to disregard snapshots that have been damaged.

Unfortunately, most of the blocks are shared between trees so this
doesn't provide any extra protection. RAID, or some RAID-like
thing (a little birdie told me that something may be in the works)
is probably the way to go, for dealing with substandard hardware
that you can't avoid using or weren't warned about.

--
Daniel

2001-07-26 02:18:31

[permalink] [raw]

Subject: Re: [PATCH] 64 bit scsi read/write

On Tue, Jul 03, 2001 at 10:19:36PM -0400, Ben LaHaise wrote:
> Here's the [completely untested] generic scsi fixup, but I'm told that
> some controllers will break with it. Give it a whirl and let me know how
> many pieces you're left holding. =) Please note that msdos partitions do
> *not* work on devices larger than 2TB, so you'll have to use the scsi disk
> directly. This patch applies on top of v2.4.6-pre8-largeblock4.diff.

I just trid this, but when I can't load the md modules becuase of
missing symbols for __divdi3 and __umoddi3.

Theese are the messages from make install:
find kernel -path '*/pcmcia/*' -name '*.o' | xargs -i -r ln -sf ../{}
pcmcia
if [ -r System.map ]; then /sbin/depmod -ae -F System.map 2.4.6-pre8;
fi
depmod: *** Unresolved symbols in
/lib/modules/2.4.6-pre8/kernel/drivers/md/linear.o
depmod: __udivdi3
depmod: __umoddi3
depmod: *** Unresolved symbols in
/lib/modules/2.4.6-pre8/kernel/drivers/md/lvm-mod.o
depmod: __udivdi3
depmod: __umoddi3
depmod: *** Unresolved symbols in
/lib/modules/2.4.6-pre8/kernel/drivers/md/md.o
depmod: __udivdi3
depmod: *** Unresolved symbols in
/lib/modules/2.4.6-pre8/kernel/drivers/md/raid0.o
depmod: __udivdi3
depmod: __umoddi3
depmod: *** Unresolved symbols in
/lib/modules/2.4.6-pre8/kernel/drivers/md/raid5.o
depmod: __udivdi3
depmod: __umoddi3

Did you forget something in your patch, or was it not supposed to work
on ia32?

This is kind of urgent, because I will temporarely be without testing
equipment pretty soon. Tips are appreciated!

--
Ragnar Kjorstad
Big Storage

2001-07-26 16:25:35