Just a heads-up for now. I ran ll_ver_fs on a 96TB fs - the write phase
finished without problems, but the read phase encountered a problem:
...
read File name: /mnt/dir00373/file026
liverfs: verify /mnt/dir00373/file026 failed offset/timestamp/inode 3244298240/1248819541/1096796: found 3243249664/1248819541/1096796 instead
liverfs: Data verification failed
770.45user 218639.65system 67:38:18elapsed 90%CPU (0avgtext+0avgdata 0maxresident)k
100357573552inputs+195522668184outputs (1major+414minor)pagefaults 0swaps
make: *** [llver] Error 2
The offset difference is exactly 1M, and it occurs about 3GB into the file.
In total, there are 726 directories, each with 32 4GB files (except the last,
which only has 12 files). So directory 373 is roughly half-way. I'll take a look
at the block allocation of both the directory and the file and see if they are
straddling the 16TB boundary (or other such).
Running a fsck now to see if there are other problems.
Thanks,
Nick
Nick Dokos <[email protected]> wrote:
> Just a heads-up for now. I ran ll_ver_fs on a 96TB fs - the write phase
> finished without problems, but the read phase encountered a problem:
>
> ...
> Running a fsck now to see if there are other problems.
>
The fsck did not find any other problems:
sudo time e2fsck -t -t -F -n -f /dev/mapper/bigvg-bigvol
e2fsck 1.41.8 (20-Jul-2009)
Pass 1: Checking inodes, blocks, and sizes
Pass 1: Memory used: 284k/1630436k (131k/154k), time: 1055.90/880.05/ 6.00
Pass 1: I/O read: 190MB, write: 0MB, rate: 0.18MB/s
Pass 2: Checking directory structure
Pass 2: Memory used: 284k/18014398508112064k (94k/191k), time: 2.33/ 0.93/ 0.92
Pass 2: I/O read: 3MB, write: 0MB, rate: 1.29MB/s
Pass 3: Checking directory connectivity
Peak memory: Memory used: 284k/18014398508635724k (94k/191k), time: 1067.60/888.18/ 9.10
Pass 3: Memory used: 284k/18014398508112064k (86k/199k), time: 0.58/ 0.26/ 0.32
Pass 3: I/O read: 1MB, write: 0MB, rate: 1.71MB/s
Pass 4: Checking reference counts
Pass 4: Memory used: 284k/18014398509394812k (85k/200k), time: 147.37/147.08/ 0.27
Pass 4: I/O read: 0MB, write: 0MB, rate: 0.00MB/s
Pass 5: Checking group summary information
Pass 5: Memory used: 336k/18014398508871152k (86k/251k), time: 1124.59/707.93/18.27
Pass 5: I/O read: 2934MB, write: 0MB, rate: 2.61MB/s
/dev/mapper/bigvg-bigvol: 23950/4289791968 files (0.0% non-contiguous), 24609217723/24609226752 blocks
Memory used: 336k/18014398508871152k (86k/251k), time: 2339.61/1743.19/27.68
I/O read: 3173MB, write: 0MB, rate: 1.36MB/s
1743.18user 31.66system 39:03.59elapsed 75%CPU (0avgtext+0avgdata 0maxresident)k
32603472inputs+0outputs (0major+4403060minor)pagefaults 0swaps
Nick
On Mon, Aug 03, 2009 at 09:54:36AM -0400, Nick Dokos wrote:
> Just a heads-up for now. I ran ll_ver_fs on a 96TB fs - the write phase
> finished without problems, but the read phase encountered a problem:
>
> ...
> read File name: /mnt/dir00373/file026
>
> liverfs: verify /mnt/dir00373/file026 failed offset/timestamp/inode 3244298240/1248819541/1096796: found 3243249664/1248819541/1096796 instead
>
> liverfs: Data verification failed
> 770.45user 218639.65system 67:38:18elapsed 90%CPU (0avgtext+0avgdata 0maxresident)k
> 100357573552inputs+195522668184outputs (1major+414minor)pagefaults 0swaps
> make: *** [llver] Error 2
>
>
> The offset difference is exactly 1M, and it occurs about 3GB into the file.
Interesting - exactly 1M off. Does this correspond to anything
interesting in extent layout or block allocation boundaries?
Any chance you can patch ll_ver_fs to continue after the first error?
I'd be happy to write the patch for you.
> In total, there are 726 directories, each with 32 4GB files (except the last,
> which only has 12 files). So directory 373 is roughly half-way. I'll take a look
> at the block allocation of both the directory and the file and see if they are
> straddling the 16TB boundary (or other such).
Did you have a chance to look at what falls before and after the 16TB
boundary?
Thanks,
-VAL
On Mon, Aug 03, 2009 at 09:54:36AM -0400, Nick Dokos wrote:
> Just a heads-up for now. I ran ll_ver_fs on a 96TB fs - the write phase
> finished without problems, but the read phase encountered a problem:
>
> ...
> read File name: /mnt/dir00373/file026
>
> liverfs: verify /mnt/dir00373/file026 failed offset/timestamp/inode 3244298240/1248819541/1096796: found 3243249664/1248819541/1096796 instead
>
> liverfs: Data verification failed
> 770.45user 218639.65system 67:38:18elapsed 90%CPU (0avgtext+0avgdata 0maxresident)k
> 100357573552inputs+195522668184outputs (1major+414minor)pagefaults 0swaps
> make: *** [llver] Error 2
>
>
> The offset difference is exactly 1M, and it occurs about 3GB into the file.
Just a note, this is an off-by-one rather than a bit flip:
obase=2
3244298240
11000001011000000001000000000000
3243249664
11000001010100000001000000000000
-VAL
Valerie Aurora <[email protected]> wrote:
> On Mon, Aug 03, 2009 at 09:54:36AM -0400, Nick Dokos wrote:
> > Just a heads-up for now. I ran ll_ver_fs on a 96TB fs - the write phase
> > finished without problems, but the read phase encountered a problem:
> >
> > ...
> > read File name: /mnt/dir00373/file026
> >
> > liverfs: verify /mnt/dir00373/file026 failed offset/timestamp/inode 3244298240/1248819541/1096796: found 3243249664/1248819541/1096796 instead
> >
> > liverfs: Data verification failed
> > 770.45user 218639.65system 67:38:18elapsed 90%CPU (0avgtext+0avgdata 0maxresident)k
> > 100357573552inputs+195522668184outputs (1major+414minor)pagefaults 0swaps
> > make: *** [llver] Error 2
> >
> >
> > The offset difference is exactly 1M, and it occurs about 3GB into the file.
>
> Interesting - exactly 1M off. Does this correspond to anything
> interesting in extent layout or block allocation boundaries?
>
> Any chance you can patch ll_ver_fs to continue after the first error?
> I'd be happy to write the patch for you.
I did that to begin with but the problem turns out to be much more
mundane: there was an IO error on one of the volumes. It wasn't quite
obvious (no red lights going off) but there *was* a message in
/var/log/messages - unfortunately I missed it. I eventually recreated
the error by trying to read the file with ``od -c'' and then went back
and found the original error. I don't know why/how ll_ver_fs managed to
read the offset and come up with a 1M difference[1] -- ``od -c'' failed with
a big thud.
We have now replaced the disk and I'm doing the test again: it should be
done (barring further problems) by sometime next week.
>
> > In total, there are 726 directories, each with 32 4GB files (except the last,
> > which only has 12 files). So directory 373 is roughly half-way. I'll take a look
> > at the block allocation of both the directory and the file and see if they are
> > straddling the 16TB boundary (or other such).
>
> Did you have a chance to look at what falls before and after the 16TB
> boundary?
>
I did go barking up the wrong tree for a while :-) (or should that be :-( ?)
Thanks,
Nick
[1] that's a 2-bit flip:
3244298240 = 2#11000001011000000001000000000000
3243249664 = 2#11000001010100000001000000000000
On Aug 06, 2009 16:37 -0400, Nick Dokos wrote:
> I did that to begin with but the problem turns out to be much more
> mundane: there was an IO error on one of the volumes. It wasn't quite
> obvious (no red lights going off) but there *was* a message in
> /var/log/messages - unfortunately I missed it. I eventually recreated
> the error by trying to read the file with ``od -c'' and then went back
> and found the original error. I don't know why/how ll_ver_fs managed to
> read the offset and come up with a 1M difference[1] -- ``od -c'' failed with
> a big thud.
Can you have a look at the error handling in ll_ver_fs at that point?
It seems that it might just have re-used the previous 1MB buffer, but
didn't detect/report the error from the read, which would itself be bad.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
> On Aug 06, 2009 16:37 -0400, Nick Dokos wrote:
> > I did that to begin with but the problem turns out to be much more
> > mundane: there was an IO error on one of the volumes. It wasn't quite
> > obvious (no red lights going off) but there *was* a message in
> > /var/log/messages - unfortunately I missed it. I eventually recreated
> > the error by trying to read the file with ``od -c'' and then went back
> > and found the original error. I don't know why/how ll_ver_fs managed to
> > read the offset and come up with a 1M difference[1] -- ``od -c'' failed with
> > a big thud.
>
> Can you have a look at the error handling in ll_ver_fs at that point?
> It seems that it might just have re-used the previous 1MB buffer, but
> didn't detect/report the error from the read, which would itself be bad.
>
It looks right to me:
,----
| ...
| if (read(fd, chunk_buf, chunksize) < 0) {
| fprintf(stderr, "\n%s: read %s+%llu failed: %s\n",
| progname, file, offset, strerror(errno));
| return 1;
| }
| if (verify_chunk(chunk_buf, chunksize, offset, time_st,
| inode_st, file) != 0)
| return 1;
| ...
`----
The read() should have failed (and I should have gotten a different error
message) but somehow it didn't - instead, verify_chunk() was called and
*that* detected the mismatch.
Thanks,
Nick
Nick Dokos <[email protected]> wrote:
> > On Aug 06, 2009 16:37 -0400, Nick Dokos wrote:
> > > I did that to begin with but the problem turns out to be much more
> > > mundane: there was an IO error on one of the volumes. It wasn't quite
> > > obvious (no red lights going off) but there *was* a message in
> > > /var/log/messages - unfortunately I missed it. I eventually recreated
> > > the error by trying to read the file with ``od -c'' and then went back
> > > and found the original error. I don't know why/how ll_ver_fs managed to
> > > read the offset and come up with a 1M difference[1] -- ``od -c'' failed with
> > > a big thud.
> >
> > Can you have a look at the error handling in ll_ver_fs at that point?
> > It seems that it might just have re-used the previous 1MB buffer, but
> > didn't detect/report the error from the read, which would itself be bad.
> >
>
> It looks right to me:
>
> ,----
> | ...
> | if (read(fd, chunk_buf, chunksize) < 0) {
> | fprintf(stderr, "\n%s: read %s+%llu failed: %s\n",
> | progname, file, offset, strerror(errno));
> | return 1;
> | }
> | if (verify_chunk(chunk_buf, chunksize, offset, time_st,
> | inode_st, file) != 0)
> | return 1;
> | ...
> `----
>
> The read() should have failed (and I should have gotten a different error
> message) but somehow it didn't - instead, verify_chunk() was called and
> *that* detected the mismatch.
>
The offset starts at 0 and is marched along at a 1MB stride, so it's
conceivable that the read() returned 0 (avoiding the error return) and
did not touch chunk_buf (thereby leading to the verify_chunk() failure).
That would be consistent with what I got, but if so, it would be a kernel
bug somewhere in the read() calling sequence, right?
Thanks,
Nick
PS. For the record, this is the error from /var/log/messages:
,----
| ...
| Aug 5 18:14:11 shifter kernel: sd 10:0:7:3: [sdn] Unhandled sense code
| Aug 5 18:14:11 shifter kernel: sd 10:0:7:3: [sdn] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
| Aug 5 18:14:11 shifter kernel: sd 10:0:7:3: [sdn] Sense Key : Medium Error [current]
| Aug 5 18:14:11 shifter kernel: sd 10:0:7:3: [sdn] Add. Sense: No additional sense information
| Aug 5 18:14:11 shifter kernel: end_request: I/O error, dev sdn, sector 2238819328
| Aug 5 18:16:03 shifter kernel: sd 10:0:7:3: [sdn] Unhandled sense code
| Aug 5 18:16:03 shifter kernel: sd 10:0:7:3: [sdn] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
| Aug 5 18:16:03 shifter kernel: sd 10:0:7:3: [sdn] Sense Key : Medium Error [current]
| Aug 5 18:16:03 shifter kernel: sd 10:0:7:3: [sdn] Add. Sense: No additional sense information
| Aug 5 18:16:03 shifter kernel: end_request: I/O error, dev sdn, sector 2238819592
`----
Nick Dokos <[email protected]> wrote:
> Nick Dokos <[email protected]> wrote:
>
> > > On Aug 06, 2009 16:37 -0400, Nick Dokos wrote:
> > > > I did that to begin with but the problem turns out to be much more
> > > > mundane: there was an IO error on one of the volumes. It wasn't quite
> > > > obvious (no red lights going off) but there *was* a message in
> > > > /var/log/messages - unfortunately I missed it. I eventually recreated
> > > > the error by trying to read the file with ``od -c'' and then went back
> > > > and found the original error. I don't know why/how ll_ver_fs managed to
> > > > read the offset and come up with a 1M difference[1] -- ``od -c'' failed with
> > > > a big thud.
> > >
> > > Can you have a look at the error handling in ll_ver_fs at that point?
> > > It seems that it might just have re-used the previous 1MB buffer, but
> > > didn't detect/report the error from the read, which would itself be bad.
> > >
> >
> > It looks right to me:
> >
> > ,----
> > | ...
> > | if (read(fd, chunk_buf, chunksize) < 0) {
> > | fprintf(stderr, "\n%s: read %s+%llu failed: %s\n",
> > | progname, file, offset, strerror(errno));
> > | return 1;
> > | }
> > | if (verify_chunk(chunk_buf, chunksize, offset, time_st,
> > | inode_st, file) != 0)
> > | return 1;
> > | ...
> > `----
> >
> > The read() should have failed (and I should have gotten a different error
> > message) but somehow it didn't - instead, verify_chunk() was called and
> > *that* detected the mismatch.
> >
>
> The offset starts at 0 and is marched along at a 1MB stride, so it's
> conceivable that the read() returned 0 (avoiding the error return) and
> did not touch chunk_buf (thereby leading to the verify_chunk() failure).
>
> That would be consistent with what I got, but if so, it would be a kernel
> bug somewhere in the read() calling sequence, right?
>
> Thanks,
> Nick
>
> PS. For the record, this is the error from /var/log/messages:
>
> ,----
> | ...
> | Aug 5 18:14:11 shifter kernel: sd 10:0:7:3: [sdn] Unhandled sense code
> | Aug 5 18:14:11 shifter kernel: sd 10:0:7:3: [sdn] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> | Aug 5 18:14:11 shifter kernel: sd 10:0:7:3: [sdn] Sense Key : Medium Error [current]
> | Aug 5 18:14:11 shifter kernel: sd 10:0:7:3: [sdn] Add. Sense: No additional sense information
> | Aug 5 18:14:11 shifter kernel: end_request: I/O error, dev sdn, sector 2238819328
> | Aug 5 18:16:03 shifter kernel: sd 10:0:7:3: [sdn] Unhandled sense code
> | Aug 5 18:16:03 shifter kernel: sd 10:0:7:3: [sdn] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> | Aug 5 18:16:03 shifter kernel: sd 10:0:7:3: [sdn] Sense Key : Medium Error [current]
> | Aug 5 18:16:03 shifter kernel: sd 10:0:7:3: [sdn] Add. Sense: No additional sense information
> | Aug 5 18:16:03 shifter kernel: end_request: I/O error, dev sdn, sector 2238819592
> `----
>
Jim Owens pointed out to me that the code doesn't deal correctly with
short reads. The verification information is at the beginning of every
4kB block, so the sector error causes a short read with exactly the
symptoms I see. The next read() would get the error.
The code should perhaps read something like this:
if ((nread = read(fd, chunk_buf, chunksize)) < 0) {
fprintf(stderr, "\n%s: read %s+%llu failed: %s\n",
progname, file, offset, strerror(errno));
return 1;
}
if (nread < chunksize) {
fprintf(stderr, "short read etc");
/* or force the next read() to check for errors? */
return 1;
}
if (verify_chunk(chunk_buf, chunksize, offset, time_st,
inode_st, file) != 0)
return 1;
Thanks,
Nick
On Aug 06, 2009 17:28 -0400, Nick Dokos wrote:
> > On Aug 06, 2009 16:37 -0400, Nick Dokos wrote:
> > Can you have a look at the error handling in ll_ver_fs at that point?
> > It seems that it might just have re-used the previous 1MB buffer, but
> > didn't detect/report the error from the read, which would itself be bad.
>
> It looks right to me:
>
> ,----
> | ...
> | if (read(fd, chunk_buf, chunksize) < 0) {
> | fprintf(stderr, "\n%s: read %s+%llu failed: %s\n",
> | progname, file, offset, strerror(errno));
> | return 1;
> | }
> | if (verify_chunk(chunk_buf, chunksize, offset, time_st,
> | inode_st, file) != 0)
> | return 1;
> | ...
> `----
>
> The read() should have failed (and I should have gotten a different error
> message) but somehow it didn't - instead, verify_chunk() was called and
> *that* detected the mismatch.
Well, it seems possible that read() returned less than chunksize bytes,
and the buffer compared correctly up to the 4kB chunk that is beyond the
read data. That looks like a small bug in llverfs, since it is legal
for read() to return less than the requested bytes.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
> ...
> The code should perhaps read something like this:
>
> if ((nread = read(fd, chunk_buf, chunksize)) < 0) {
> fprintf(stderr, "\n%s: read %s+%llu failed: %s\n",
> progname, file, offset, strerror(errno));
> return 1;
> }
> if (nread < chunksize) {
> fprintf(stderr, "short read etc");
> /* or force the next read() to check for errors? */
> return 1;
> }
> if (verify_chunk(chunk_buf, chunksize, offset, time_st,
> inode_st, file) != 0)
> return 1;
>
... except that the write() side also does not deal with short writes,
and for a full run, the last file of the last directory will probably be
short even if no errors occur. In that case, the short read is not an
error.
Thanks,
Nick