LinuxLists.cc - Kernels v4.9+ cause short reads of block devices

2017-08-23 19:54:59

Subject: Kernels v4.9+ cause short reads of block devices

The following commits cause short reads of block devices, however writes
are still allowed.

c2a9737f45e2 ("vfs,mm: fix a dead loop in truncate_inode_pages_range()")
d05c5f7ba164 ("vfs,mm: fix return value of read() at s_maxbytes")

When e2fsck sees this, it thinks it's a bad sector and tries to write a
block of nulls which overwrites the valid data.

Device is LVM over 2 x RAID-5 on an old 32bit desktop.

RO RA SSZ BSZ StartSec Size Device
rw 4096 512 4096 0 9748044840960 /dev/Storage/Main

Doug

2017-08-23 19:37:20

by Linus Torvalds

[permalink] [raw]

Subject: Re: Kernels v4.9+ cause short reads of block devices

On Wed, Aug 23, 2017 at 12:15 PM, Doug Nazar <[email protected]> wrote:
> The following commits cause short reads of block devices, however writes are
> still allowed.
>
> c2a9737f45e2 ("vfs,mm: fix a dead loop in truncate_inode_pages_range()")
> d05c5f7ba164 ("vfs,mm: fix return value of read() at s_maxbytes")
>
> When e2fsck sees this, it thinks it's a bad sector and tries to write a
> block of nulls which overwrites the valid data.

Hmm. Block devices shouldn't have issues with s_maxbytes, and I'm
surprised that nobody has seen that before.

> Device is LVM over 2 x RAID-5 on an old 32bit desktop.
>
> RO RA SSZ BSZ StartSec Size Device
> rw 4096 512 4096 0 9748044840960 /dev/Storage/Main

.. and the problem may be as simple as just a missing initialization
of s_maxbytes for blockdev_superblock.

Does the attcahed trivial one-liner fix things for you?

Al, if it really is this simple, how come nobody even noticed?

Also, I do wonder if that check in do_generic_file_read() should just
unconditionally use MAX_LFS_FILESIZE, since the whole point there is
really about the index wrap-around, not about any underlying
filesystem limits per se.

And that's exactly what MAX_LFS_FILESIZE is - the maximum size that
fits in the page index.

Linus

Attachments:

patch.diff (425.00 B)

2017-08-23 19:53:57

by Doug Nazar

[permalink] [raw]

Subject: Re: Kernels v4.9+ cause short reads of block devices

On 8/23/17 3:37 PM, Linus Torvalds wrote:
> On Wed, Aug 23, 2017 at 12:15 PM, Doug Nazar <[email protected]> wrote:
>> The following commits cause short reads of block devices, however writes are
>> still allowed.
>>
>> c2a9737f45e2 ("vfs,mm: fix a dead loop in truncate_inode_pages_range()")
>> d05c5f7ba164 ("vfs,mm: fix return value of read() at s_maxbytes")
>>
>> When e2fsck sees this, it thinks it's a bad sector and tries to write a
>> block of nulls which overwrites the valid data.
> Hmm. Block devices shouldn't have issues with s_maxbytes, and I'm
> surprised that nobody has seen that before.
>
>> Device is LVM over 2 x RAID-5 on an old 32bit desktop.
>>
>> RO RA SSZ BSZ StartSec Size Device
>> rw 4096 512 4096 0 9748044840960 /dev/Storage/Main
> .. and the problem may be as simple as just a missing initialization
> of s_maxbytes for blockdev_superblock.
>
> Does the attcahed trivial one-liner fix things for you?
>
> Al, if it really is this simple, how come nobody even noticed?
>
> Also, I do wonder if that check in do_generic_file_read() should just
> unconditionally use MAX_LFS_FILESIZE, since the whole point there is
> really about the index wrap-around, not about any underlying
> filesystem limits per se.
>
> And that's exactly what MAX_LFS_FILESIZE is - the maximum size that
> fits in the page index.

It's compiling now, but I think it's already set to MAX_LFS_FILESIZE.

[ 169.095127] ppos=80180006000, s_maxbytes=7ffffffffff,
magic=0x62646576, type=bdev

Doug

2017-08-23 20:13:07

by Linus Torvalds

[permalink] [raw]

Subject: Re: Kernels v4.9+ cause short reads of block devices

On Wed, Aug 23, 2017 at 12:53 PM, Doug Nazar <[email protected]> wrote:
>
> It's compiling now, but I think it's already set to MAX_LFS_FILESIZE.
>
> [ 169.095127] ppos=80180006000, s_maxbytes=7ffffffffff, magic=0x62646576,
> type=bdev

Oh, right you are - I'm much too used to 64-bit, where
MAX_LFS_FILESIZE is basically infinite, and was jusr assuming that it
was something like the UFS bug we had not that long ago that was due
to the 32-bit limit.

But yes, on 32-bit, we are limited by the 32-bit index into the page
cache, and we limit the index to 31 bits too, so we have (PAGE_SIZE <<
31) -1, which is that 7ffffffffff.

And that also explains why people haven't seen it. You do need

(a) 32-bit environment

(b) a disk larger than that 8TB in size

The *hard* limit for the page cache on a 32-bit environment should
actually be (PAGE_SIZE << 32)-PAGE_SIZE (that final PAGE_SIZE
subtraction is to make sure we don't generate that page cache with
index -1), so having a disk that is 16TB or larger is not going to
work, but your disk is right in that 8TB-16TB hole that used to work
and was broken by that check.

Anyway, that makes me feel better. I should have looked at your disk
size more, now I at least understand why nobody noticed before.

So just throw away my patch. That's wrong, and garbage.

The *right* patch is likely to just this instead:

-#define MAX_LFS_FILESIZE (((loff_t)PAGE_SIZE << (BITS_PER_LONG-1))-1)
+#define MAX_LFS_FILESIZE (((loff_t)PAGE_SIZE <<
BITS_PER_LONG)-PAGE_SIZE)

which should make MAX_LFS_FILESIZE be 0xffffffff000 and you disk size
should be ok.

Linus

2017-08-23 21:01:47

by Andreas Dilger

[permalink] [raw]

Subject: Re: Kernels v4.9+ cause short reads of block devices

On Aug 23, 2017, at 2:13 PM, Linus Torvalds <[email protected]> wrote:
>
> On Wed, Aug 23, 2017 at 12:53 PM, Doug Nazar <[email protected]> wrote:
>>
>> It's compiling now, but I think it's already set to MAX_LFS_FILESIZE.
>>
>> [ 169.095127] ppos=80180006000, s_maxbytes=7ffffffffff, magic=0x62646576,
>> type=bdev
>
> Oh, right you are - I'm much too used to 64-bit, where
> MAX_LFS_FILESIZE is basically infinite, and was jusr assuming that it
> was something like the UFS bug we had not that long ago that was due
> to the 32-bit limit.
>
> But yes, on 32-bit, we are limited by the 32-bit index into the page
> cache, and we limit the index to 31 bits too, so we have (PAGE_SIZE <<
> 31) -1, which is that 7ffffffffff.
>
> And that also explains why people haven't seen it. You do need
>
> (a) 32-bit environment
>
> (b) a disk larger than that 8TB in size
>
> The *hard* limit for the page cache on a 32-bit environment should
> actually be (PAGE_SIZE << 32)-PAGE_SIZE (that final PAGE_SIZE
> subtraction is to make sure we don't generate that page cache with
> index -1), so having a disk that is 16TB or larger is not going to
> work, but your disk is right in that 8TB-16TB hole that used to work
> and was broken by that check.
>
> Anyway, that makes me feel better. I should have looked at your disk
> size more, now I at least understand why nobody noticed before.
>
> So just throw away my patch. That's wrong, and garbage.
>
> The *right* patch is likely to just this instead:
>
> -#define MAX_LFS_FILESIZE (((loff_t)PAGE_SIZE << (BITS_PER_LONG-1))-1)
> +#define MAX_LFS_FILESIZE (((loff_t)PAGE_SIZE <<
> BITS_PER_LONG)-PAGE_SIZE)
>
> which should make MAX_LFS_FILESIZE be 0xffffffff000 and you disk size
> should be ok.

Doug,
I noticed while checking for other implications of changing MAX_LFS_FILESIZE
that fs/jfs/super.c is also working around this limit. If you are going
to submit a patch for this, it also makes sense to fix jfs_fill_super() to
use MAX_LFS_FILESIZE instead of JFS rolling its own, something like:

/* logical blocks are represented by 40 bits in pxd_t, etc.
* and page cache is indexed by long. */
sb->s_maxbytes = min((u64)sb->s_blocksize) << 40,
MAX_LFS_FILESIZE);

It also looks like ocfs2_max_file_offset() is trying to avoid overflowing
the old 31-bit limit, and isn't using MAX_LFS_FILESIZE directly, so it will
now be wrong. It looks like it could use "bitshift = 32; trim = bytes;",
but Joel or Mark should confirm.

Finally, there is a check in fs/super.c::mount_fs() that is verifying
s_maxbytes is not set too large, but this has been present since 2.6.32
and should probably be removed at this point, or changed to a BUG_ON()
(see commit 42cb56ae2ab for details).

Cheers, Andreas

Attachments:

signature.asc (195.00 B)
Message signed with OpenPGP

2017-08-24 10:03:47

by Doug Nazar

[permalink] [raw]

Subject: Re: Kernels v4.9+ cause short reads of block devices

On 8/23/17 4:13 PM, Linus Torvalds wrote:

> Oh, right you are - I'm much too used to 64-bit, where
> MAX_LFS_FILESIZE is basically infinite, and was jusr assuming that it
> was something like the UFS bug we had not that long ago that was due
> to the 32-bit limit.
>
> But yes, on 32-bit, we are limited by the 32-bit index into the page
> cache, and we limit the index to 31 bits too, so we have (PAGE_SIZE <<
> 31) -1, which is that 7ffffffffff.

Yeah, it's an old organically grown storage server (install images, old
vms, etc.) that
I can't convince myself it's worth upgrading.

> The *right* patch is likely to just this instead:
>
> -#define MAX_LFS_FILESIZE (((loff_t)PAGE_SIZE << (BITS_PER_LONG-1))-1)
> +#define MAX_LFS_FILESIZE (((loff_t)PAGE_SIZE <<
> BITS_PER_LONG)-PAGE_SIZE)
>
> which should make MAX_LFS_FILESIZE be 0xffffffff000 and you disk size
> should be ok.

That solves my issue. I'm curious if that check should also be in the
write path. If the
check had been there too I wouldn't have ended up with any corruption.

I'm not sure if anything over 16TB will create/assemble (or if anybody
else is crazy like me).

Doug

2017-08-24 10:20:28

by Doug Nazar

[permalink] [raw]

Subject: Re: Kernels v4.9+ cause short reads of block devices

On 8/23/17 5:01 PM, Andreas Dilger wrote:
> Doug,
> I noticed while checking for other implications of changing MAX_LFS_FILESIZE
> that fs/jfs/super.c is also working around this limit. If you are going
> to submit a patch for this, it also makes sense to fix jfs_fill_super() to
> use MAX_LFS_FILESIZE instead of JFS rolling its own, something like:
>
> /* logical blocks are represented by 40 bits in pxd_t, etc.
> * and page cache is indexed by long. */
> sb->s_maxbytes = min((u64)sb->s_blocksize) << 40,
> MAX_LFS_FILESIZE);
>
> It also looks like ocfs2_max_file_offset() is trying to avoid overflowing
> the old 31-bit limit, and isn't using MAX_LFS_FILESIZE directly, so it will
> now be wrong. It looks like it could use "bitshift = 32; trim = bytes;",
> but Joel or Mark should confirm.
>
> Finally, there is a check in fs/super.c::mount_fs() that is verifying
> s_maxbytes is not set too large, but this has been present since 2.6.32
> and should probably be removed at this point, or changed to a BUG_ON()
> (see commit 42cb56ae2ab for details).

I don't have any issue trying to write patches for those, but I have no
domain knowledge
in the area or any way to test them.

From a quick glance, jfs is locked to PSIZE (4096) so should be ok.
OCFS looks a little complex, and since it's a shared fs, little hesitant.

The check in fs/super.c, maybe that should be:

sb->s_maxbytes > MAX_LFS_FILESIZE

Actually, little confused, the comment says unsigned, but loff_t looks
like its long long.
Maybe cast to u64 and check greater than?

Doug

2017-08-24 15:23:11

by Dave Kleikamp

[permalink] [raw]

Subject: Re: Kernels v4.9+ cause short reads of block devices

On 08/24/2017 05:20 AM, Doug Nazar wrote:
> On 8/23/17 5:01 PM, Andreas Dilger wrote:
>> Doug,
>> I noticed while checking for other implications of changing
>> MAX_LFS_FILESIZE
>> that fs/jfs/super.c is also working around this limit.? If you are going
>> to submit a patch for this, it also makes sense to fix
>> jfs_fill_super() to
>> use MAX_LFS_FILESIZE instead of JFS rolling its own, something like:
>>
>> ????/* logical blocks are represented by 40 bits in pxd_t, etc.
>> ???? * and page cache is indexed by long. */
>> ????sb->s_maxbytes = min((u64)sb->s_blocksize) << 40,
>> ????????????????????????????? MAX_LFS_FILESIZE);
>>
>> It also looks like ocfs2_max_file_offset() is trying to avoid overflowing
>> the old 31-bit limit, and isn't using MAX_LFS_FILESIZE directly, so it
>> will
>> now be wrong.? It looks like it could use "bitshift = 32; trim = bytes;",
>> but Joel or Mark should confirm.
>>
>> Finally, there is a check in fs/super.c::mount_fs() that is verifying
>> s_maxbytes is not set too large, but this has been present since 2.6.32
>> and should probably be removed at this point, or changed to a BUG_ON()
>> (see commit 42cb56ae2ab for details).
>
> I don't have any issue trying to write patches for those, but I have no
> domain knowledge
> in the area or any way to test them.

If you want to wrap the jfs change into this, I will be happy to test it
for you, or I could take care of jfs with a separate patch if you'd prefer.

>
> From a quick glance, jfs is locked to PSIZE (4096) so should be ok.
> OCFS looks a little complex, and since it's a shared fs, little hesitant.
>
> The check in fs/super.c, maybe that should be:
>
> sb->s_maxbytes > MAX_LFS_FILESIZE
>
> Actually, little confused, the comment says unsigned, but loff_t looks
> like its long long.
> Maybe cast to u64 and check greater than?
>
> Doug
>

2017-08-27 19:47:06

by Linus Torvalds

[permalink] [raw]

Subject: Re: Kernels v4.9+ cause short reads of block devices

On Wed, Aug 23, 2017 at 2:01 PM, Andreas Dilger <[email protected]> wrote:
>
> Doug,
> I noticed while checking for other implications of changing MAX_LFS_FILESIZE
> that fs/jfs/super.c is also working around this limit.

Note to people: I just committed the patch to update MAX_LFS_FILESIZE.

I made it use the simpler (and clearer) calculation of

((loff_t)ULONG_MAX << PAGE_SHIFT)

for the 32-bit case, and I did *not* change any other users.

The jfs comment was a bit confusing, and talks about "wraps around" at
8TB, when that actually happens at 16TB. Yes, if you use a signed
number for the index, it does wrap at 8TB, but you really shouldn't
(and the code the jfs comment points to doesn't).

So I didn't touch that. Nor did I touch:

> it also makes sense to fix jfs_fill_super() to
> use MAX_LFS_FILESIZE instead of JFS rolling its own, something like:
>
> /* logical blocks are represented by 40 bits in pxd_t, etc.
> * and page cache is indexed by long. */
> sb->s_maxbytes = min((u64)sb->s_blocksize) << 40,
> MAX_LFS_FILESIZE);

which I agree should be modified. The new MAX_LFS_FILESIZE should be
the right size, but the difference now is only one page less one byte.

Linus

2017-08-27 19:55:02

by Dave Kleikamp

[permalink] [raw]

Subject: Re: Kernels v4.9+ cause short reads of block devices

On 08/27/2017 02:47 PM, Linus Torvalds wrote:
> On Wed, Aug 23, 2017 at 2:01 PM, Andreas Dilger <[email protected]> wrote:
>>
>> Doug,
>> I noticed while checking for other implications of changing MAX_LFS_FILESIZE
>> that fs/jfs/super.c is also working around this limit.
>
> Note to people: I just committed the patch to update MAX_LFS_FILESIZE.
>
> I made it use the simpler (and clearer) calculation of
>
> ((loff_t)ULONG_MAX << PAGE_SHIFT)
>
> for the 32-bit case, and I did *not* change any other users.
>
> The jfs comment was a bit confusing, and talks about "wraps around" at
> 8TB, when that actually happens at 16TB. Yes, if you use a signed
> number for the index, it does wrap at 8TB, but you really shouldn't
> (and the code the jfs comment points to doesn't).
>
> So I didn't touch that. Nor did I touch:
>
>> it also makes sense to fix jfs_fill_super() to
>> use MAX_LFS_FILESIZE instead of JFS rolling its own, something like:
>>
>> /* logical blocks are represented by 40 bits in pxd_t, etc.
>> * and page cache is indexed by long. */
>> sb->s_maxbytes = min((u64)sb->s_blocksize) << 40,
>> MAX_LFS_FILESIZE);
>
> which I agree should be modified. The new MAX_LFS_FILESIZE should be
> the right size, but the difference now is only one page less one byte.

I'll submit a patch to clean up jfs.

Thanks,
Shaggy

>
> Linus
>

2017-08-29 10:40:25

by Doug Nazar

[permalink] [raw]

Subject: Re: Kernels v4.9+ cause short reads of block devices

On 8/27/17 3:47 PM, Linus Torvalds wrote:
> Note to people: I just committed the patch to update MAX_LFS_FILESIZE.
> I made it use the simpler (and clearer) calculation of
>
> ((loff_t)ULONG_MAX << PAGE_SHIFT)
>
> for the 32-bit case, and I did *not* change any other users.

Not a problem... I got caught trying to find out why 4.13 hangs on a
warm reset. It has really slowed down my testing having towalk couple
of floors for every reboot.

Doug