On May 18, 2006 11:23 +0100, Stephen C. Tweedie wrote:
> On Wed, 2006-05-17 at 17:58 -0600, Andreas Dilger wrote:
> > > the next question then is should we fail mounting of any >2TB fs
> > > w/o CONFIG_LBD or it would be better to mount IFF the fs has
> > > no allocated blocks beyond 2TB?
> >
> > This could actually be the primary source of the > 2TB filesystem
> > corruption problem that I've seen reported occasionally. The question
> > is if CONFIG_LBD isn't enabled will the kernel silently truncate
> > block-layer offsets?
>
> It appears so, yes. bread() goes through submit_bh(), which converts
> blocknr to sector with:
>
> bio->bi_sector = bh->b_blocknr * (bh->b_size >> 9);
>
> without checking for overflow.
>
> > This would seem to be a critical kernel bug that should be addressed in
> > the block subsystem to prevent use of block devices > 2TB if sector_t
> > is only 32 bits.
>
> I think the arguments are a little less strong about restricting block
> device access in such cases, but certainly filesystem access needs to be
> properly checked. We need to check the *filesystem* size, not the
> device size (they are usually the same but they don't strictly have to
> be), both on mount and on attempted resize.
At least the submit_bh() code should return an error in this case
regardless of who the caller is. If ext[23] and other filesystems
check this themselves at mount time that is of course also prudent.
Patch has not been more than compile tested but it is a pretty serious
problem dating back a long time so I'm posting it for review anyways.
There are several ways to check for the 64-bit overflow, the efficiency
of which depends on the architecture. We could alternately shift b_blocknr
by >> 41 or change the mask, for example. The primary concern is really
the 32-bit overflow and resulting silent and massive data corruption
when CONFIG_LBD isn't enabled.
It is coincidental that mke2fs will clobber the primary group's metadata
(bitmaps, inode table) when formatting such a filesystem, but since this
data is identical at format time (group 0 offsets and group 16384 offsets
modulo 2TB are the same) it is very hard to detect before corruption hits.
==================== buffer_overflow.diff ===============================
--- ./fs/buffer.c.orig 2006-05-18 12:15:45.000000000 -0600
+++ ./fs/buffer.c 2006-05-18 12:09:20.000000000 -0600
@@ -2758,12 +2784,32 @@ static int end_bio_bh_io_sync(struct bio
int submit_bh(int rw, struct buffer_head * bh)
{
struct bio *bio;
+ unsigned long long sector;
int ret = 0;
BUG_ON(!buffer_locked(bh));
BUG_ON(!buffer_mapped(bh));
BUG_ON(!bh->b_end_io);
+ /* Check if we overflow sector_t when computing the sector offset. */
+ sector = (unsigned long long)bh->b_blocknr * (bh->b_size >> 9);
+#if !defined(CONFIG_LBD) && BITS_PER_LONG == 32
+ if (unlikely(sector != (sector_t)sector))
+#else
+ if (unlikely(((bh->b_blocknr >> 32) * (bh->b_size >> 9)) >=
+ 0xffffffff00000000ULL))
+#endif
+ {
+ printk(KERN_ERR "IO past maximum addressable sector"
+#if !defined(CONFIG_LBD) && BITS_PER_LONG == 32
+ "- CONFIG_LBD not enabled"
+#endif
+ "\n");
+ buffer_io_error(bh);
+
+ return -EOVERFLOW;
+ }
+
if (buffer_ordered(bh) && (rw == WRITE))
rw = WRITE_BARRIER;
@@ -2780,7 +2828,7 @@ int submit_bh(int rw, struct buffer_head
*/
bio = bio_alloc(GFP_NOIO, 1);
- bio->bi_sector = bh->b_blocknr * (bh->b_size >> 9);
+ bio->bi_sector = sector;
bio->bi_bdev = bh->b_bdev;
bio->bi_io_vec[0].bv_page = bh->b_page;
bio->bi_io_vec[0].bv_len = bh->b_size;
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
On Thu, May 18, 2006 at 12:59:56PM -0600, Andreas Dilger wrote:
> At least the submit_bh() code should return an error in this case
> regardless of who the caller is. If ext[23] and other filesystems
> check this themselves at mount time that is of course also prudent.
>
> Patch has not been more than compile tested but it is a pretty serious
> problem dating back a long time so I'm posting it for review anyways.
>
> There are several ways to check for the 64-bit overflow, the efficiency
> of which depends on the architecture. We could alternately shift b_blocknr
> by >> 41 or change the mask, for example. The primary concern is really
> the 32-bit overflow and resulting silent and massive data corruption
> when CONFIG_LBD isn't enabled.
>
> It is coincidental that mke2fs will clobber the primary group's metadata
> (bitmaps, inode table) when formatting such a filesystem, but since this
> data is identical at format time (group 0 offsets and group 16384 offsets
> modulo 2TB are the same) it is very hard to detect before corruption hits.
>
> ==================== buffer_overflow.diff ===============================
> --- ./fs/buffer.c.orig 2006-05-18 12:15:45.000000000 -0600
> +++ ./fs/buffer.c 2006-05-18 12:09:20.000000000 -0600
> @@ -2758,12 +2784,32 @@ static int end_bio_bh_io_sync(struct bio
> int submit_bh(int rw, struct buffer_head * bh)
> {
> struct bio *bio;
> + unsigned long long sector;
> int ret = 0;
>
> BUG_ON(!buffer_locked(bh));
> BUG_ON(!buffer_mapped(bh));
> BUG_ON(!bh->b_end_io);
>
> + /* Check if we overflow sector_t when computing the sector offset. */
> + sector = (unsigned long long)bh->b_blocknr * (bh->b_size >> 9);
> +#if !defined(CONFIG_LBD) && BITS_PER_LONG == 32
> + if (unlikely(sector != (sector_t)sector))
> +#else
> + if (unlikely(((bh->b_blocknr >> 32) * (bh->b_size >> 9)) >=
> + 0xffffffff00000000ULL))
> +#endif
> + {
> + printk(KERN_ERR "IO past maximum addressable sector"
> +#if !defined(CONFIG_LBD) && BITS_PER_LONG == 32
> + "- CONFIG_LBD not enabled"
> +#endif
> + "\n");
You may disagree, but this is ugly and you're touching generic code.
Why not hide it inside static inline function? And don't cut it into
pieces while you're at it.
#if !defined(CONFIG_LBD) && BITS_PER_LONG == 32
static inline
{
}
#else
static inline
{
}
#endif
> + buffer_io_error(bh);
> +
> + return -EOVERFLOW;
> + }
> +
> if (buffer_ordered(bh) && (rw == WRITE))
> rw = WRITE_BARRIER;
>
On Thu, 18 May 2006, Andreas Dilger wrote:
> struct bio *bio;
> + unsigned long long sector;
> int ret = 0;
>
> BUG_ON(!buffer_locked(bh));
> BUG_ON(!buffer_mapped(bh));
> BUG_ON(!bh->b_end_io);
>
> + /* Check if we overflow sector_t when computing the sector offset. */
> + sector = (unsigned long long)bh->b_blocknr * (bh->b_size >> 9);
Ok so far, looks fine.
But what the heck is this:
> +#if !defined(CONFIG_LBD) && BITS_PER_LONG == 32
> + if (unlikely(sector != (sector_t)sector))
> +#else
> + if (unlikely(((bh->b_blocknr >> 32) * (bh->b_size >> 9)) >=
> + 0xffffffff00000000ULL))
> +#endif
I don't understand the #ifdef at all.
Why isn't that just a
if (unlikely(sector != (sector_t)sector))
and that's it? What does this have to do with CONFIG_LBD or BITS_PER_LONG,
or anything at all?
If the sector number fits in a sector_t, we're all good.
Linus
On Thu, 18 May 2006, Linus Torvalds wrote:
> On Thu, 18 May 2006, Andreas Dilger wrote:
> > struct bio *bio;
> > + unsigned long long sector;
> > int ret = 0;
> >
> > BUG_ON(!buffer_locked(bh));
> > BUG_ON(!buffer_mapped(bh));
> > BUG_ON(!bh->b_end_io);
> >
> > + /* Check if we overflow sector_t when computing the sector offset. */
> > + sector = (unsigned long long)bh->b_blocknr * (bh->b_size >> 9);
>
> Ok so far, looks fine.
>
> But what the heck is this:
>
> > +#if !defined(CONFIG_LBD) && BITS_PER_LONG == 32
> > + if (unlikely(sector != (sector_t)sector))
> > +#else
> > + if (unlikely(((bh->b_blocknr >> 32) * (bh->b_size >> 9)) >=
> > + 0xffffffff00000000ULL))
> > +#endif
>
> I don't understand the #ifdef at all.
>
> Why isn't that just a
>
> if (unlikely(sector != (sector_t)sector))
>
> and that's it? What does this have to do with CONFIG_LBD or BITS_PER_LONG,
> or anything at all?
>
> If the sector number fits in a sector_t, we're all good.
I think you missed that Andrewas said he is worried about 64-bit overflows
as well. And you would not catch those with the sector !=
(sector_t)sector test because you would be comparing two 64-bit values
together so they always match...
Hence why he shifts the value right by 32 bits then multiplies and tests
the result for overflowing 32-bits which if it does it means it would
overflow the 64-bit multiplication, too therefor your "sector" is
truncated.
Makes sense to me in a some very convoluted sickening way... (-;
Best regards,
Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/
On Thu, 18 May 2006, Anton Altaparmakov wrote:
>
> I think you missed that Andrewas said he is worried about 64-bit overflows
> as well.
Ahh. Ok. However, then the test _really_ should be something like
sector_t maxsector, sector;
int sector_shift = get_sector_shift(bh->b_size);
maxsector = (~(sector_t)0) >> sector_shift;
if (unshifted_value > maxsector)
return -EIO;
sector = (sector_t) unshifted_value << sector_shift;
which is a lot clearer, and likely faster too, with a proper
get_sector_shift. Something like this:
/*
* What it the shift required to turn a bh of size
* "size" into a 512-byte sector count?
*/
static inline int get_sector_shift(unsigned int size)
{
int shift = -1;
unsigned int n = 256;
do {
shift++;
} while ((n += n) < size);
return shift;
}
which should generate good code on just about any architecture (it avoids
actually using shifts on purpose), and I think the end result will end up
being more readable (I'm not claiming that the "get_sector_shift()"
implementation is readable, I'm claiming that the _users_ will be more
readable).
Of course, even better would be to not have "b_size" at all, but use
"b_shift", but we don't have that. And the sector shift calculation might
be fast enough that it's even a win (turning a 64-bit multiply into a
shift _tends_ to be better)
Linus
On May 18, 2006 23:12 +0100, Anton Altaparmakov wrote:
> On Thu, 18 May 2006, Linus Torvalds wrote:
> > On Thu, 18 May 2006, Andreas Dilger wrote:
> > > + /* Check if we overflow sector_t when computing the sector offset. */
> > > + sector = (unsigned long long)bh->b_blocknr * (bh->b_size >> 9);
> >
> > Ok so far, looks fine.
> >
> > But what the heck is this:
> >
> > > +#if !defined(CONFIG_LBD) && BITS_PER_LONG == 32
> > > + if (unlikely(sector != (sector_t)sector))
> > > +#else
> > > + if (unlikely(((bh->b_blocknr >> 32) * (bh->b_size >> 9)) >=
> > > + 0xffffffff00000000ULL))
> > > +#endif
> >
> > I don't understand the #ifdef at all.
> >
> > Why isn't that just a
> >
> > if (unlikely(sector != (sector_t)sector))
> >
> > and that's it? What does this have to do with CONFIG_LBD or BITS_PER_LONG,
> > or anything at all?
> >
> > If the sector number fits in a sector_t, we're all good.
>
> I think you missed that Andreas said he is worried about 64-bit overflows
> as well. And you would not catch those with the sector !=
> (sector_t)sector test because you would be comparing two 64-bit values
> together so they always match...
>
> Hence why he shifts the value right by 32 bits then multiplies and tests
> the result for overflowing 32-bits which if it does it means it would
> overflow the 64-bit multiplication, too therefor your "sector" is
> truncated.
Yes, this is exactly correct. At various times while writing this I had
more comments to explain everything, but then removed them as too wordy.
The actual implementation isn't what I care about - the resulting corruption
is more important and I thought it better to at least get some attention
on it instead of worrying over the details.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
On May 18, 2006 15:41 -0700, Linus Torvalds wrote:
> On Thu, 18 May 2006, Anton Altaparmakov wrote:
> > I think you missed that Andreas said he is worried about 64-bit overflows
> > as well.
>
> Ahh. Ok. However, then the test _really_ should be something like
>
> sector_t maxsector, sector;
> int sector_shift = get_sector_shift(bh->b_size);
>
> maxsector = (~(sector_t)0) >> sector_shift;
> if (unshifted_value > maxsector)
> return -EIO;
> sector = (sector_t) unshifted_value << sector_shift;
>
> which is a lot clearer, and likely faster too, with a proper
> get_sector_shift.
I looked at that also, but it isn't clear from the use of "b_size" here
that there is any constraint that b_size is a power of two, only that it
is a multiple of 512. Granted, I don't know whether there are any users
of such a crazy thing, but the fact that this is in bytes instead of a
shift made me think twice.
> Something like this:
>
> /*
> * What it the shift required to turn a bh of size
> * "size" into a 512-byte sector count?
> */
> static inline int get_sector_shift(unsigned int size)
> {
> int shift = -1;
> unsigned int n = 256;
>
> do {
> shift++;
> } while ((n += n) < size);
> return shift;
> }
>
> which should generate good code on just about any architecture (it avoids
> actually using shifts on purpose), and I think the end result will end up
> being more readable (I'm not claiming that the "get_sector_shift()"
> implementation is readable, I'm claiming that the _users_ will be more
> readable).
This in fact exists virtually identically in blkdev.h::blksize_bits()
which I had a look at, but worried a bit about b_size != 2^n and also
the fact that this has branches and/or loop unwinding vs. the fixed
shift operations.
> Of course, even better would be to not have "b_size" at all, but use
> "b_shift", but we don't have that.
I was thinking exactly the same thing myself.
> And the sector shift calculation might be fast enough that it's even
> a win (turning a 64-bit multiply into a shift _tends_ to be better)
My thought was that the gratuitous 64-bit multiply in the 32-bit case
was offset by the fact that the comparison is easy. In the 64-bit case
we are already doing a 64-bit multiply so the goal is to make the
comparison as cheap as possible.
In the end, I don't really care about the exact mechanics of the check...
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
Hi,
On Thu, 2006-05-18 at 17:23 -0600, Andreas Dilger wrote:
> I looked at that also, but it isn't clear from the use of "b_size" here
> that there is any constraint that b_size is a power of two, only that it
> is a multiple of 512. Granted, I don't know whether there are any users
> of such a crazy thing, but the fact that this is in bytes instead of a
> shift made me think twice.
Yeah. It was very strongly constrained to a power-of-two in the dim and
distant past, when buffer_heads were only ever used for true buffer-
cache data (the entire IO path had to be special-cased for IO that
wasn't from the buffer cache, such as swap IO.)
But more recently it has been a lot more relaxed, and we've had patches
like Jens' "varyIO" patches on 2.4 which routinely generated odd-sized
b_size buffer_heads when doing raw/direct IO on unaligned disk offsets.
But in 2.6, I _think_ such paths should be going straight to bio, not
via submit_bh. Direct IO certainly doesn't use bh's any more, and
pretty much any other normal disk IO paths are page-aligned. I might be
missing something, though.
--Stephen
"Stephen C. Tweedie" <[email protected]> wrote:
>
> Hi,
>
> On Thu, 2006-05-18 at 17:23 -0600, Andreas Dilger wrote:
>
> > I looked at that also, but it isn't clear from the use of "b_size" here
> > that there is any constraint that b_size is a power of two, only that it
> > is a multiple of 512. Granted, I don't know whether there are any users
> > of such a crazy thing, but the fact that this is in bytes instead of a
> > shift made me think twice.
>
> Yeah. It was very strongly constrained to a power-of-two in the dim and
> distant past, when buffer_heads were only ever used for true buffer-
> cache data (the entire IO path had to be special-cased for IO that
> wasn't from the buffer cache, such as swap IO.)
>
> But more recently it has been a lot more relaxed, and we've had patches
> like Jens' "varyIO" patches on 2.4 which routinely generated odd-sized
> b_size buffer_heads when doing raw/direct IO on unaligned disk offsets.
>
> But in 2.6, I _think_ such paths should be going straight to bio, not
> via submit_bh. Direct IO certainly doesn't use bh's any more, and
> pretty much any other normal disk IO paths are page-aligned. I might be
> missing something, though.
>
We use various values of b_size when using a buffer_head for gathering a
disk mapping. See, for example, fs/direct-io.c:get_more_blocks().
I don't think we ever play such tricks when a bh is used as an IO
container. But we might - I see nothing in submit_bh() which would prevent
it.
btw, it seems odd to me that we're trying to handle the
device-too-large-for-sector_t problem at the submit_bh() level. What
happens if someone uses submit_bio()? Isn't it something we can check at
mount time, or partition parsing, or...?
On May 19, 2006 13:11 -0700, Andrew Morton wrote:
> btw, it seems odd to me that we're trying to handle the
> device-too-large-for-sector_t problem at the submit_bh() level. What
> happens if someone uses submit_bio()? Isn't it something we can check at
> mount time, or partition parsing, or...?
Doing so at mount time will be added to ext3 shortly. That said, I'd still
like a check in the block layer to avoid the silent corruption possible if
the filesystem makes a mistake, or for those filesystems that haven't added
such a check yet, or new filesystems that don't in the future.
I saw that xfs and ocfs2 have some checks at fill_super time related
to sector_t and CONFIG_LBD, which I found AFTER I saw that ext3 already
had a problem and was searching for where CONFIG_LBD was used. Looking
at that code, however, it isn't even clear there that this is checking
for sector_t overflow, only what the maximum file size is, so they may
also be susceptible to the same corruption.
As for submit_bio() I don't think it is even POSSIBLE to check for overflow
at that stage because the bi_sector has already been truncated to sector_t.
It would have to be done in the caller of submit_bio(), as in submit_bh().
As for the kernel completely disallowing access to a block device larger
than 2TB when CONFIG_LBD isn't set is a separate question entirely.
Stephen suggested (and I can believe this happening, just did it myself
last week during testing on an 8TB+delta device) that a filesystem may be
formatted to only use the accessible part of the device. Having the block
layer fail IO beyond the limit is good, not being able to use the device
because it is just a bit bigger than the safe limit is not.
One extra suggestion that might be safe and acceptible all around is if
a device is larger than 2TB w/o a 64-bit sector_t that the block device
size itself be truncated in the kernel to 2TB-512. This at least prevents
userspace tools from trying to e.g. format a 3TB filesystem on a device
that will just corrupt the filesystem.
I don't think partitions on a larger device make any difference, since
the kernel will remap them to the final sector offset internally in the
end, and still overflow sector_t later (which could itself be checked
also). I don't think this is sufficient to avoid the submit_bh() check
though, since it doesn't address wrapping when calculating the sector.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
Hi,
On Fri, 2006-05-19 at 13:11 -0700, Andrew Morton wrote:
> btw, it seems odd to me that we're trying to handle the
> device-too-large-for-sector_t problem at the submit_bh() level. What
> happens if someone uses submit_bio()?
The initial code we were trying to protect was the
bio->bi_sector = bh->b_blocknr * (bh->b_size >> 9);
in submit_bh(), which can take a blocknr that fits within 2^32 and
multiply it such that it overflows sector_t. That specific case doesn't
happen for submit_bio() simply because we're already taking input
counted in sectors in that case.
So for submit_bio(), we can't do it at IO time (at least not within the
block layer.) But...
> Isn't it something we can check at
> mount time, or partition parsing, or...?
Yes, we could and we should --- the recent ext2-devel >32-bit
discussions have already identified mount and resize as needing this
sort of attention. It's not just for >32-bit filesystems, either --- an
existing 31-bit ext3 filesystem can be up to 8TB with 4k blocks, and
that easily exceeds the addressing limit of sector_t on 32-bit hosts
without CONFIG_LBD.
I don't think we should be doing it at partition check time. We don't
want to unnecessarily hurt the user who created a LUN just a little
larger than 2TB and formatted a filesystem onto it that does actually
fit; or who has a <2TB filesystem, tries to lvextend it, and then finds
that the fs itself won't grow beyond 2TB. As long as the filesystem
itself fits into sector_t we should just allow access, so it's really
mount time, not partition time, when we need to check all of this.
--Stephen
Andreas Dilger <[email protected]> wrote:
>
> One extra suggestion that might be safe and acceptible all around is if
> a device is larger than 2TB w/o a 64-bit sector_t that the block device
> size itself be truncated in the kernel to 2TB-512. This at least prevents
> userspace tools from trying to e.g. format a 3TB filesystem on a device
> that will just corrupt the filesystem.
'twould be good if we could do something like that - doing it on every
single IO submission in submit_bh() Just Feels Wrong.
Also, there's always the option (or enhancement) of emitting lots of scary
warnings and then just proceeding.
Hi!
> > Why isn't that just a
> >
> > if (unlikely(sector != (sector_t)sector))
> >
> > and that's it? What does this have to do with CONFIG_LBD or BITS_PER_LONG,
> > or anything at all?
> >
> > If the sector number fits in a sector_t, we're all good.
>
> I think you missed that Andrewas said he is worried about 64-bit overflows
> as well. And you would not catch those with the sector !=
Can 64-bit really overflow? That's 16 000 Peta bytes, AFAICS. Does
anyone really have disk array over 100 Peta bytes? How much space does
Google have, for example?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html