Hi all,
In testing large (>4TB) device support on 2.6, I've been using a simple
write/verify test to check both block device and regular file
correctness.
Set to write 1MB poison patterns for the whole of a file until EOF is
encountered, it worked just fine on ext3: the file got a short write on
the last write, leaving the file at its largest permitted size of
0x1fffffff000 (2^32 sectors minus a page.) Verify works fine.
This 2^32 sector limit is set in ext3_max_size(), which has the comment
/*
* Maximal file size. There is a direct, and {,double-,triple-}indirect
* block limit, and also a limit of (2^32 - 1) 512-byte sectors in i_blocks.
* We need to be 1 filesystem block less than the 2^32 sector limit.
*/
Trouble is, that limit *should* be an i_blocks limit, because i_blocks
is still 32-bits, and (more importantly) is multiplied by the fs
blocksize / 512 in stat(2) to return st_blocks in 512-byte chunks.
Overflow 2^32 sectors in i_blocks and stat(2) wraps.
But i_blocks includes indirect blocks as well as data, so for a
non-sparse file we wrap stat(2) st_blocks well before the file is
2^32*512 bytes long. Yet ext3_max_size() doesn't understand this:
it simply caps the size with
if (res > (512LL << 32) - (1 << bits))
res = (512LL << 32) - (1 << bits);
so write() keeps writing past the wrap, resulting in a file which looks
like:
[root@host scratch]# ls -lh verif-file9.tmp
-rw-r--r-- 1 root root 2.0T Feb 10 05:49 verif-file9.tmp
[root@host scratch]# du -h verif-file9.tmp
2.1G verif-file9.tmp
Worse comes at e2fsck time: near the end of walking the indirect tree,
e2fsck decides that the file has grown too large, as in this fsck -n
output:
Pass 1: Checking inodes, blocks, and sizes
Inode 20 is too big. Truncate? no
Block #536346622 (980630816) causes file to be too big. IGNORED.
Block #536346623 (980630817) causes file to be too big. IGNORED.
Block #536346624 (980630818) causes file to be too big. IGNORED.
...
Whoops. e2fsck sees that st_blocks is too large at this point, and
decides that it wants to truncate the file to make it fit. So if a user
has legitimately created such a file, fsck will effectively attempt to
corrupt it at the next fsck.
So who is right? Should ext3 let the file grow that large?
For now, I think we need to constrain ext2/3 files so that i_blocks does
not exceed 2^32*512/blocksize. Even if we fix up all the stat() stuff
to pass back 64-bit st_blocks, we still have every e2fsck in existence
which will not be able to deal with those files. Eventually 64-bit
st_blocks would be good to have, but it needs to have a fs feature flag
to let e2fsck know about it.
--Stephen
On Feb 11, 2005 20:52 +0000, Stephen C. Tweedie wrote:
> /*
> * Maximal file size. There is a direct, and {,double-,triple-}indirect
> * block limit, and also a limit of (2^32 - 1) 512-byte sectors in i_blocks.
> * We need to be 1 filesystem block less than the 2^32 sector limit.
> */
>
> Trouble is, that limit *should* be an i_blocks limit, because i_blocks
> is still 32-bits, and (more importantly) is multiplied by the fs
> blocksize / 512 in stat(2) to return st_blocks in 512-byte chunks.
> Overflow 2^32 sectors in i_blocks and stat(2) wraps.
I agree. The problem AFAIR is that the i_blocks accounting is done in
the quota code, so it was a challenge to get it right, and the i_size
limit was easier to do. Until now I don't think anyone has created
dense 2TB files, so the sparse limit was enough.
Fixing this to count i_blocks correctly would also allow us to have
larger sparse files (up to the indirect limit).
Note also that there was a patch to extend i_blocks floating around
(pretty small hack to use one of the reserved fields), and it might make
sense to get this into the kernel before we actually need it.
> But i_blocks includes indirect blocks as well as data, so for a
> non-sparse file we wrap stat(2) st_blocks well before the file is
> 2^32*512 bytes long. Yet ext3_max_size() doesn't understand this:
> it simply caps the size with
>
> if (res > (512LL << 32) - (1 << bits))
> res = (512LL << 32) - (1 << bits);
So, for the quick fix we could reduce this by the number of expected
[td]indirect blocks and submit that to 2.4 also.
Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://members.shaw.ca/adilger/ http://members.shaw.ca/golinux/
Hi,
On Fri, 2005-02-11 at 21:27, Andreas Dilger wrote:
> > Trouble is, that limit *should* be an i_blocks limit, because i_blocks
> > is still 32-bits, and (more importantly) is multiplied by the fs
> > blocksize / 512 in stat(2) to return st_blocks in 512-byte chunks.
> > Overflow 2^32 sectors in i_blocks and stat(2) wraps.
>
> I agree. The problem AFAIR is that the i_blocks accounting is done in
> the quota code, so it was a challenge to get it right, and the i_size
> limit was easier to do.
The i_size limit is also wrong for dense files; I'd be satisfied with
just getting it right! i_blocks handling through the quota calls is
cleaner these days, but I don't think that's a particularly satisfactory
solution --- reaching maximum file size has all sorts of specific
semantics such as sending SIGXFSZ which you don't really want to have to
replicate.
> Until now I don't think anyone has created
> dense 2TB files, so the sparse limit was enough.
Yep.
> Note also that there was a patch to extend i_blocks floating around
> (pretty small hack to use one of the reserved fields), and it might make
> sense to get this into the kernel before we actually need it.
True, but it's not really a problem right now --- i_blocks is counted in
fs blocksize units, so we're nowhere near overflowing that. It's only
when stat() converts it to st_blocks' 512-byte units that we get into
trouble within the kernel.
> > if (res > (512LL << 32) - (1 << bits))
> > res = (512LL << 32) - (1 << bits);
>
> So, for the quick fix we could reduce this by the number of expected
> [td]indirect blocks and submit that to 2.4 also.
Agreed.
--Stephen
On Feb 11, 2005 21:39 +0000, Stephen C. Tweedie wrote:
> ...i_blocks is counted in fs blocksize units, so we're nowhere near
> overflowing that. It's only when stat() converts it to st_blocks'
> 512-byte units that we get into trouble within the kernel.
Umm, I don't think so. ext3 i_blocks is sectors and not fs blocks (one of
my pet peeves actually). In 2.4 it is as below, 2.6 has one more copy.
ext3_read_inode()
{
:
inode->i_blocks = le32_to_cpu(raw_inode->i_blocks);
cp_new_stat()
{
:
tmp.st_blocks = inode->i_blocks;
I've wondered at times whether it might make sense to store i_blocks in
fs blocksize units when we add some new feature (e.g. high bits for
i_blocks if we overflow 2^32) but I'm not sure the increased complexity
makes up for the minor increase in dynamic range.
In the end, we hit the 2^64 fs size limit before we would run out of
range for i_blocks (assuming 64 bits there) so changing it doesn't help
much. The only reason to change would be to store up to 2^48 fs blocks
(only using 16 bits in the core inode, e.g. i_frag + i_fsize) and assume
we need to use 2^16 blocksize for the largest files with extents.
Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://members.shaw.ca/adilger/ http://members.shaw.ca/golinux/