2006-01-18 13:06:37

by Takashi Sato

[permalink] [raw]
Subject: [PATCH] ext3: Extends blocksize up to pagesize

Hi,

As a disk tends to get large, a disk storage has had a capacity to
supply multi-TB. But now, ext3 can't support more than 8TB filesystem
when blocksize is 4KB. That's why I think ext3 needs to be
more than 8TB.

Therefore I think filesystem size can increase on architectures
which has more than 4KB pagesize by extending blocksize to pagesize on
ext3. For example, the following is in case of ia64. (Blocksize have
already been supported up to pagesize on ext2. Why is the max blocksize
restricted to 4KB on ext3?)

Max filesystem size on ia64:
Original :4096(blocksize) * 2^31 = 8TB
After modification [pagesize=16KB(default)]:16384(blocksize) * 2^31 = 32TB
After modification [pagesize=64KB(max)] :65536(blocksize) * 2^31 = 128TB

The followings are the contents of modification.
- ext3_fill_super
In checking blocksize on mount, allow blocksize up to pagesize.

- ext3_readdir
Currently read-ahead 16 sectors when reading a directory, but not
if blocksize is more than 8KB. Then I modified to read-ahead
one fs-block if blocksize is more than 8KB.

Any feedback and comments are welcome.

Signed-off-by: Takashi Sato <[email protected]>
---

diff -uprN -X linux-2.6.16-rc1.org/Documentation/dontdiff linux-2.6.16-rc1.org/fs/ext3/dir.c linux-2.6.16-rc1-bigblk/fs/ext3/dir.c
--- linux-2.6.16-rc1.org/fs/ext3/dir.c 2006-01-18 08:10:26.000000000 +0900
+++ linux-2.6.16-rc1-bigblk/fs/ext3/dir.c 2006-01-18 08:12:05.000000000 +0900
@@ -142,7 +142,13 @@ static int ext3_readdir(struct file * fi
* Do the readahead
*/
if (!offset) {
- for (i = 16 >> (EXT3_BLOCK_SIZE_BITS(sb) - 9), num = 0;
+ int readcnt;
+ if (sb->s_blocksize > 8192) {
+ readcnt = sb->s_blocksize >> EXT3_SECTOR_BITS;
+ } else {
+ readcnt = 16;
+ }
+ for (i = readcnt >> (EXT3_BLOCK_SIZE_BITS(sb) - 9), num = 0;
i > 0; i--) {
tmp = ext3_getblk (NULL, inode, ++blk, 0, &err);
if (tmp && !buffer_uptodate(tmp) &&
diff -uprN -X linux-2.6.16-rc1.org/Documentation/dontdiff linux-2.6.16-rc1.org/fs/ext3/super.c
linux-2.6.16-rc1-bigblk/fs/ext3/super.c
--- linux-2.6.16-rc1.org/fs/ext3/super.c 2006-01-18 08:10:26.000000000 +0900
+++ linux-2.6.16-rc1-bigblk/fs/ext3/super.c 2006-01-18 08:12:05.000000000 +0900
@@ -1461,6 +1461,7 @@ static int ext3_fill_super (struct super
blocksize = BLOCK_SIZE << le32_to_cpu(es->s_log_block_size);

if (blocksize < EXT3_MIN_BLOCK_SIZE ||
+ blocksize > PAGE_SIZE ||
blocksize > EXT3_MAX_BLOCK_SIZE) {
printk(KERN_ERR
"EXT3-fs: Unsupported filesystem blocksize %d on %s.\n",
diff -uprN -X linux-2.6.16-rc1.org/Documentation/dontdiff linux-2.6.16-rc1.org/include/linux/ext2_fs.h
linux-2.6.16-rc1-bigblk/include/linux/ext2_fs.h
--- linux-2.6.16-rc1.org/include/linux/ext2_fs.h 2006-01-18 08:10:56.000000000 +0900
+++ linux-2.6.16-rc1-bigblk/include/linux/ext2_fs.h 2006-01-18 08:12:05.000000000 +0900
@@ -90,7 +90,7 @@ static inline struct ext2_sb_info *EXT2_
* Macro-instructions used to manage several block sizes
*/
#define EXT2_MIN_BLOCK_SIZE 1024
-#define EXT2_MAX_BLOCK_SIZE 4096
+#define EXT2_MAX_BLOCK_SIZE 65536
#define EXT2_MIN_BLOCK_LOG_SIZE 10
#ifdef __KERNEL__
# define EXT2_BLOCK_SIZE(s) ((s)->s_blocksize)
diff -uprN -X linux-2.6.16-rc1.org/Documentation/dontdiff linux-2.6.16-rc1.org/include/linux/ext3_fs.h
linux-2.6.16-rc1-bigblk/include/linux/ext3_fs.h
--- linux-2.6.16-rc1.org/include/linux/ext3_fs.h 2006-01-18 08:10:53.000000000 +0900
+++ linux-2.6.16-rc1-bigblk/include/linux/ext3_fs.h 2006-01-18 20:13:36.309026768 +0900
@@ -84,7 +84,7 @@ struct statfs;
* Macro-instructions used to manage several block sizes
*/
#define EXT3_MIN_BLOCK_SIZE 1024
-#define EXT3_MAX_BLOCK_SIZE 4096
+#define EXT3_MAX_BLOCK_SIZE 65536
#define EXT3_MIN_BLOCK_LOG_SIZE 10
#ifdef __KERNEL__
# define EXT3_BLOCK_SIZE(s) ((s)->s_blocksize)
@@ -97,6 +97,7 @@ struct statfs;
#else
# define EXT3_BLOCK_SIZE_BITS(s) ((s)->s_log_block_size + 10)
#endif
+#define EXT3_SECTOR_BITS 9 /* log2(SECTOR_SIZE) */
#ifdef __KERNEL__
#define EXT3_ADDR_PER_BLOCK_BITS(s) (EXT3_SB(s)->s_addr_per_block_bits)
#define EXT3_INODE_SIZE(s) (EXT3_SB(s)->s_inode_size)



2006-01-18 15:48:10

by John Stoffel

[permalink] [raw]
Subject: Re: [PATCH] ext3: Extends blocksize up to pagesize


Takashi> As a disk tends to get large, a disk storage has had a
Takashi> capacity to supply multi-TB. But now, ext3 can't support
Takashi> more than 8TB filesystem when blocksize is 4KB. That's why I
Takashi> think ext3 needs to be more than 8TB.

Man, I don't want to even think about doing an FSCK on an 8TB
filesystem running ext[23] at all.

In that size range, you really need a filesystem which doesn't need an
FSCK at all. Not sure what the real answer is though...

John

2006-01-18 18:52:53

by Andreas Dilger

[permalink] [raw]
Subject: Re: [Ext2-devel] [PATCH] ext3: Extends blocksize up to pagesize

On Jan 18, 2006 22:06 +0900, Takashi Sato wrote:
> As a disk tends to get large, a disk storage has had a capacity to
> supply multi-TB. But now, ext3 can't support more than 8TB filesystem
> when blocksize is 4KB. That's why I think ext3 needs to be
> more than 8TB.
>
> Therefore I think filesystem size can increase on architectures
> which has more than 4KB pagesize by extending blocksize to pagesize on
> ext3. For example, the following is in case of ia64. (Blocksize have
> already been supported up to pagesize on ext2. Why is the max blocksize
> restricted to 4KB on ext3?)
>
> Max filesystem size on ia64:
> Original :4096(blocksize) * 2^31 = 8TB
> After modification [pagesize=16KB(default)]:16384(blocksize) * 2^31 = 32TB
> After modification [pagesize=64KB(max)] :65536(blocksize) * 2^31 = 128TB

Just for others' info - the fill_super change has been tested in the past
by Sonny Rao at IBM also. e2fsprogs has supported this for a long time
already.

> - ext3_readdir
> Currently read-ahead 16 sectors when reading a directory, but not
> if blocksize is more than 8KB. Then I modified to read-ahead
> one fs-block if blocksize is more than 8KB.

> @@ -142,7 +142,13 @@ static int ext3_readdir(struct file * fi
> if (!offset) {
> - for (i = 16 >> (EXT3_BLOCK_SIZE_BITS(sb) - 9), num = 0;
> + int readcnt;
> + if (sb->s_blocksize > 8192) {
> + readcnt = sb->s_blocksize >> EXT3_SECTOR_BITS;
> + } else {
> + readcnt = 16;
> + }
> + for (i = readcnt >> (EXT3_BLOCK_SIZE_BITS(sb) - 9), num = 0;
> i > 0; i--) {
> tmp = ext3_getblk (NULL, inode, ++blk, 0, &err);
> if (tmp && !buffer_uptodate(tmp) &&

The code doesn't get any more clear with your change. Using "sectors" as
a readahead unit is kind of pointless in the first place (that isn't your
fault, not sure why it is like that), and 8kB readahead is probably too
small for today's disks.

It would probably make more sense to just change this to be straight forward:
"readcnt = 16 * sb->s_blocksize" or "readcnt = 64 >> EXT3_BLOCK_SIZE_BITS(sb)".

Usually directories are small, so this is a non-issue. If they are large
you likely want to readahead more anyways, especially since directory blocks
are usually fragmented and more readahead time is better.


Note that this should probably also include an ext2 version of the same
change so that it is still possible to mount such a filesystem as ext2.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2006-01-21 07:10:42

by Andrew Morton

[permalink] [raw]
Subject: Re: [Ext2-devel] [PATCH] ext3: Extends blocksize up to pagesize

Andreas Dilger <[email protected]> wrote:
>
> On Jan 18, 2006 22:06 +0900, Takashi Sato wrote:
> > As a disk tends to get large, a disk storage has had a capacity to
> > supply multi-TB. But now, ext3 can't support more than 8TB filesystem
> > when blocksize is 4KB. That's why I think ext3 needs to be
> > more than 8TB.
> >
> > Therefore I think filesystem size can increase on architectures
> > which has more than 4KB pagesize by extending blocksize to pagesize on
> > ext3. For example, the following is in case of ia64. (Blocksize have
> > already been supported up to pagesize on ext2. Why is the max blocksize
> > restricted to 4KB on ext3?)
> >
> > Max filesystem size on ia64:
> > Original :4096(blocksize) * 2^31 = 8TB
> > After modification [pagesize=16KB(default)]:16384(blocksize) * 2^31 = 32TB
> > After modification [pagesize=64KB(max)] :65536(blocksize) * 2^31 = 128TB
>
> Just for others' info - the fill_super change has been tested in the past
> by Sonny Rao at IBM also. e2fsprogs has supported this for a long time
> already.

I have a vague memory that there's some piece of metadata (per-block-group
info, I think) which will overflow at 8kb blocksize. I say this in the
hope that you'll remmeber what it was ;)


> > - ext3_readdir
> > Currently read-ahead 16 sectors when reading a directory, but not
> > if blocksize is more than 8KB. Then I modified to read-ahead
> > one fs-block if blocksize is more than 8KB.

I've rewritten ext3 directory readahead to use the generic pagecache
functions, so changes here shouldn't be needed.

But I haven't yet got around to performance-testing it.

ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.16-rc1/2.6.16-rc1-mm2/broken-out/ext3_readdir-use-generic-readahead.patch

2006-01-22 21:03:13

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [Ext2-devel] [PATCH] ext3: Extends blocksize up to pagesize

On Fri, Jan 20, 2006 at 11:10:16PM -0800, Andrew Morton wrote:
> Andreas Dilger <[email protected]> wrote:
> > Just for others' info - the fill_super change has been tested in the past
> > by Sonny Rao at IBM also. e2fsprogs has supported this for a long time
> > already.
>
> I have a vague memory that there's some piece of metadata (per-block-group
> info, I think) which will overflow at 8kb blocksize. I say this in the
> hope that you'll remmeber what it was ;)

The limiting factor is bg_free_blocks_count (and to some extent,
possibly bg_free_inodes_dir), which is a 16 bit field. At 8kb, the
default block group size is 8kb * 8 bits/byte == 65536. At 16kb, a
block group size of 131072 would overflow bg_free_blocks_count. You
could of course artificially limit the block group size to 65536 for
block sizes > 16kb. The better thing to do would be to use the extra
space in the per-block group metadata to extend those fields to 32
bits.

- Ted

2006-01-22 21:03:25

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] ext3: Extends blocksize up to pagesize

On Wed, Jan 18, 2006 at 10:48:06AM -0500, John Stoffel wrote:
>
> Takashi> As a disk tends to get large, a disk storage has had a
> Takashi> capacity to supply multi-TB. But now, ext3 can't support
> Takashi> more than 8TB filesystem when blocksize is 4KB. That's why I
> Takashi> think ext3 needs to be more than 8TB.
>
> Man, I don't want to even think about doing an FSCK on an 8TB
> filesystem running ext[23] at all.
>
> In that size range, you really need a filesystem which doesn't need an
> FSCK at all. Not sure what the real answer is though...

Ext3 doesn't require a fsck under normal circumstances. The only
reason why it still requires a periodic fsck after some number of
mounts is sheer paranoia about the reliability of PC class hardware.
All filesystems need some kind of filesystem consistency checker to
deal with filesystem corruptions caused by OS bugs or hardware
corruption bugs. The only question is whether or not the filesystem
assumes at a fundamental level whether or not the hardware can be
trusted to be reliable or not. (People have claimed that XFS is much
less robust in the face of hardware errors when compared to ext[23]; I
haven't seen a definitive study on the issue, although that tends to
correspond with my experience. Other people would say it doesn't
matter because that's why you pay $$$$$ for am EMC Symmetrix box or an
IBM shark/DS6000/DS8000, or some other Really Expensive Storage
Hardware.)

But if you're willing to assume that your hardware is reliable and
never fails, hey, feel free to disable the periodic FSCK checking
using the command: "tune2fs -c 0 -i 0 /dev/sdXXX".

- Ted

2006-01-23 05:38:42

by Andreas Dilger

[permalink] [raw]
Subject: Re: [Ext2-devel] [PATCH] ext3: Extends blocksize up to pagesize

On Jan 20, 2006 23:10 -0800, Andrew Morton wrote:
> Andreas Dilger <[email protected]> wrote:
> >
> > On Jan 18, 2006 22:06 +0900, Takashi Sato wrote:
> > > As a disk tends to get large, a disk storage has had a capacity to
> > > supply multi-TB. But now, ext3 can't support more than 8TB filesystem
> > > when blocksize is 4KB. That's why I think ext3 needs to be
> > > more than 8TB.
> > >
> > > Therefore I think filesystem size can increase on architectures
> > > which has more than 4KB pagesize by extending blocksize to pagesize on
> > > ext3. For example, the following is in case of ia64. (Blocksize have
> > > already been supported up to pagesize on ext2. Why is the max blocksize
> > > restricted to 4KB on ext3?)
> > >
> > > Max filesystem size on ia64:
> > > Original :4096(blocksize) * 2^31 = 8TB
> > > After modification [pagesize=16KB(default)]:16384(blocksize) * 2^31 = 32TB
> > > After modification [pagesize=64KB(max)] :65536(blocksize) * 2^31 = 128TB
> >
> > Just for others' info - the fill_super change has been tested in the past
> > by Sonny Rao at IBM also. e2fsprogs has supported this for a long time
> > already.
>
> I have a vague memory that there's some piece of metadata (per-block-group
> info, I think) which will overflow at 8kb blocksize. I say this in the
> hope that you'll remmeber what it was ;)

This is the group descriptor per-group blocks and inode free counts. The
current mke2fs code limits the number of blocks and inodes per group to
EXT2_MAX_BLOCKS_PER_GROUP (2^16 - 8) and (2^16 - EXT2_INODES_PER_BLOCK)
so that this won't overflow. We still get linear growth of filesystem
limits with blocksize, but not cubic growth that would otherwise be there.

Bull has a patch which enlarges the group descriptor fields to allow
more blocks and inodes per group.

> I've rewritten ext3 directory readahead to use the generic pagecache
> functions, so changes here shouldn't be needed.
>
> But I haven't yet got around to performance-testing it.
>
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.16-rc1/2.6.16-rc1-mm2/broken-out/ext3_readdir-use-generic-readahead.patch

Hmm, I'll have to take a look at that when I get a chance.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2006-01-23 05:45:18

by Andreas Dilger

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: [PATCH] ext3: Extends blocksize up to pagesize

On Jan 22, 2006 13:28 -0500, Theodore Ts'o wrote:
> On Wed, Jan 18, 2006 at 10:48:06AM -0500, John Stoffel wrote:
> > In that size range, you really need a filesystem which doesn't need an
> > FSCK at all. Not sure what the real answer is though...
>
> Ext3 doesn't require a fsck under normal circumstances. The only
> reason why it still requires a periodic fsck after some number of
> mounts is sheer paranoia about the reliability of PC class hardware.
> All filesystems need some kind of filesystem consistency checker to
> deal with filesystem corruptions caused by OS bugs or hardware
> corruption bugs. The only question is whether or not the filesystem
> assumes at a fundamental level whether or not the hardware can be
> trusted to be reliable or not. (People have claimed that XFS is much
> less robust in the face of hardware errors when compared to ext[23]; I
> haven't seen a definitive study on the issue, although that tends to
> correspond with my experience. Other people would say it doesn't
> matter because that's why you pay $$$$$ for am EMC Symmetrix box or an
> IBM shark/DS6000/DS8000, or some other Really Expensive Storage
> Hardware.)

I think the work done by the U. Wisconsin group for IRON ext3 is the
way to go (namely checksumming of filesystem metadata, and possibly
some level of redundancy). This gives us concrete checks on what metadata
is valid and the filesystem can avoid any (or further) corruption when
the hardware goes bad. The existing ext3 code already has these checks,
but as filesystems get larger the validity of a block number of an inode
is harder to check because any value may be correct. Given that CPU
speed is growing orders of magnitude faster than disk IO the overhead of
checksumming is a reasonable thing to do these days (optionally, of course).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2006-01-23 08:25:27

by Helge Hafting

[permalink] [raw]
Subject: Re: [PATCH] ext3: Extends blocksize up to pagesize

John Stoffel wrote:

>Takashi> As a disk tends to get large, a disk storage has had a
>Takashi> capacity to supply multi-TB. But now, ext3 can't support
>Takashi> more than 8TB filesystem when blocksize is 4KB. That's why I
>Takashi> think ext3 needs to be more than 8TB.
>
>Man, I don't want to even think about doing an FSCK on an 8TB
>filesystem running ext[23] at all.
>
>In that size range, you really need a filesystem which doesn't need an
>FSCK at all. Not sure what the real answer is though...
>
>
An fs that allows consistency checking while in use, perhaps?
Then it doesn't matter if it takes days, for you can use the fs
in the meantime.

Helge Hafting

2006-01-23 20:36:06

by folkert

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: [PATCH] ext3: Extends blocksize up to pagesize

> I think the work done by the U. Wisconsin group for IRON ext3 is the
> way to go (namely checksumming of filesystem metadata, and possibly
> some level of redundancy). This gives us concrete checks on what metadata
> is valid and the filesystem can avoid any (or further) corruption when
> the hardware goes bad. The existing ext3 code already has these checks,
> but as filesystems get larger the validity of a block number of an inode
> is harder to check because any value may be correct. Given that CPU
> speed is growing orders of magnitude faster than disk IO the overhead of
> checksumming is a reasonable thing to do these days (optionally, of course).

Then please make it optionally per mount-point.
E.g.: I don't care if the filesystem of the filestore of my Squid setup
goes bad (mke2fs will fix it just nicely) but I would get upset if its
OS filesystem would get corrupted.


Folkert van Heusden

--
Ever wonder what is out there? Any alien races? Then please support
the seti@home project: setiathome.ssl.berkeley.edu
----------------------------------------------------------------------
Phone: +31-6-41278122, PGP-key: 1F28D8AE, http://www.vanheusden.com