2006-01-05 10:04:39

by Takashi Sato

[permalink] [raw]
Subject: [PATCH 2/3] Fix problems on multi-TB filesystem and file

This is a patch to add blkcnt_t as the type of inode.i_blocks.
This enables you to make the size of blkcnt_t either 4 bytes or 8 bytes
on 32 bits architecture with CONFIG_LSF.

The content of the patch attached to this mail is below.
- CONFIG_LSF
Add new configuration parameter.
- blkcnt_t
On h8300, i386, mips, powerpc, s390 and sh that define sector_t,
blkcnt_t is defined as u64 if CONFIG_LSF is enabled; otherwise it is
defined as unsigned long.
On other architectures, it is defined as unsigned long.
- inode.i_blocks
Change the type from sector_t to blkcnt_t.

Any feedback and comments are welcome.

Signed-off-by: Takashi Sato <[email protected]>

diff -uprN -X linux-2.6.15-iblocks/Documentation/dontdiff linux-2.6.15-iblocks/block/Kconfig linux-2.6.15-lsf/block/Kconfig
--- linux-2.6.15-iblocks/block/Kconfig 2006-01-03 12:21:10.000000000 +0900
+++ linux-2.6.15-lsf/block/Kconfig 2006-01-04 15:37:17.000000000 +0900
@@ -6,9 +6,19 @@
config LBD
bool "Support for Large Block Devices"
depends on X86 || (MIPS && 32BIT) || PPC32 || ARCH_S390_31 || SUPERH || UML
+ select LSF
help
Say Y here if you want to attach large (bigger than 2TB) discs to
your machine, or if you want to have a raid or loopback device
bigger than 2TB. Otherwise say N.

+config LSF
+ bool "Support for Large Single Files"
+ depends on X86 || (MIPS && 32BIT) || PPC32 || ARCH_S390_31 || SUPERH || UML
+ default n
+ help
+ When CONFIG_LBD is disabled, say Y here if you want to
+ handle large file(bigger than 2TB), otherwise say N.
+ When CONFIG_LBD is enabled, Y is set automatically.
+
source block/Kconfig.iosched
diff -uprN -X linux-2.6.15-iblocks/Documentation/dontdiff linux-2.6.15-iblocks/include/asm-h8300/types.h
linux-2.6.15-lsf/include/asm-h8300/types.h
--- linux-2.6.15-iblocks/include/asm-h8300/types.h 2006-01-03 12:21:10.000000000 +0900
+++ linux-2.6.15-lsf/include/asm-h8300/types.h 2006-01-04 15:37:17.000000000 +0900
@@ -58,6 +58,9 @@ typedef u32 dma_addr_t;
#define HAVE_SECTOR_T
typedef u64 sector_t;

+#define HAVE_BLKCNT_T
+typedef u64 blkcnt_t;
+
#endif /* __KERNEL__ */

#endif /* __ASSEMBLY__ */
diff -uprN -X linux-2.6.15-iblocks/Documentation/dontdiff linux-2.6.15-iblocks/include/asm-i386/types.h
linux-2.6.15-lsf/include/asm-i386/types.h
--- linux-2.6.15-iblocks/include/asm-i386/types.h 2006-01-03 12:21:10.000000000 +0900
+++ linux-2.6.15-lsf/include/asm-i386/types.h 2006-01-04 15:37:17.000000000 +0900
@@ -63,6 +63,11 @@ typedef u64 sector_t;
#define HAVE_SECTOR_T
#endif

+#ifdef CONFIG_LSF
+typedef u64 blkcnt_t;
+#define HAVE_BLKCNT_T
+#endif
+
#endif /* __ASSEMBLY__ */

#endif /* __KERNEL__ */
diff -uprN -X linux-2.6.15-iblocks/Documentation/dontdiff linux-2.6.15-iblocks/include/asm-mips/types.h
linux-2.6.15-lsf/include/asm-mips/types.h
--- linux-2.6.15-iblocks/include/asm-mips/types.h 2006-01-03 12:21:10.000000000 +0900
+++ linux-2.6.15-lsf/include/asm-mips/types.h 2006-01-04 15:37:17.000000000 +0900
@@ -99,6 +99,11 @@ typedef u64 sector_t;
#define HAVE_SECTOR_T
#endif

+#ifdef CONFIG_LSF
+typedef u64 blkcnt_t;
+#define HAVE_BLKCNT_T
+#endif
+
#endif /* __ASSEMBLY__ */

#endif /* __KERNEL__ */
diff -uprN -X linux-2.6.15-iblocks/Documentation/dontdiff linux-2.6.15-iblocks/include/asm-powerpc/types.h
linux-2.6.15-lsf/include/asm-powerpc/types.h
--- linux-2.6.15-iblocks/include/asm-powerpc/types.h 2006-01-03 12:21:10.000000000 +0900
+++ linux-2.6.15-lsf/include/asm-powerpc/types.h 2006-01-04 15:37:17.000000000 +0900
@@ -103,6 +103,11 @@ typedef u64 sector_t;
#define HAVE_SECTOR_T
#endif

+#ifdef CONFIG_LSF
+typedef u64 blkcnt_t;
+#define HAVE_BLKCNT_T
+#endif
+
#endif /* __ASSEMBLY__ */

#endif /* __KERNEL__ */
diff -uprN -X linux-2.6.15-iblocks/Documentation/dontdiff linux-2.6.15-iblocks/include/asm-s390/types.h
linux-2.6.15-lsf/include/asm-s390/types.h
--- linux-2.6.15-iblocks/include/asm-s390/types.h 2006-01-03 12:21:10.000000000 +0900
+++ linux-2.6.15-lsf/include/asm-s390/types.h 2006-01-04 15:37:17.000000000 +0900
@@ -93,6 +93,11 @@ typedef u64 sector_t;
#define HAVE_SECTOR_T
#endif

+#ifdef CONFIG_LSF
+typedef u64 blkcnt_t;
+#define HAVE_BLKCNT_T
+#endif
+
#endif /* ! __s390x__ */
#endif /* __ASSEMBLY__ */
#endif /* __KERNEL__ */
diff -uprN -X linux-2.6.15-iblocks/Documentation/dontdiff linux-2.6.15-iblocks/include/asm-sh/types.h
linux-2.6.15-lsf/include/asm-sh/types.h
--- linux-2.6.15-iblocks/include/asm-sh/types.h 2006-01-03 12:21:10.000000000 +0900
+++ linux-2.6.15-lsf/include/asm-sh/types.h 2006-01-04 15:37:17.000000000 +0900
@@ -58,6 +58,11 @@ typedef u64 sector_t;
#define HAVE_SECTOR_T
#endif

+#ifdef CONFIG_LSF
+typedef u64 blkcnt_t;
+#define HAVE_BLKCNT_T
+#endif
+
#endif /* __ASSEMBLY__ */

#endif /* __KERNEL__ */
diff -uprN -X linux-2.6.15-iblocks/Documentation/dontdiff linux-2.6.15-iblocks/include/linux/fs.h
linux-2.6.15-lsf/include/linux/fs.h
--- linux-2.6.15-iblocks/include/linux/fs.h 2006-01-04 15:39:06.000000000 +0900
+++ linux-2.6.15-lsf/include/linux/fs.h 2006-01-04 15:37:17.000000000 +0900
@@ -450,7 +450,7 @@ struct inode {
unsigned int i_blkbits;
unsigned long i_blksize;
unsigned long i_version;
- sector_t i_blocks;
+ blkcnt_t i_blocks;
unsigned short i_bytes;
spinlock_t i_lock; /* i_blocks, i_bytes, maybe i_size */
struct semaphore i_sem;
diff -uprN -X linux-2.6.15-iblocks/Documentation/dontdiff linux-2.6.15-iblocks/include/linux/types.h
linux-2.6.15-lsf/include/linux/types.h
--- linux-2.6.15-iblocks/include/linux/types.h 2006-01-03 12:21:10.000000000 +0900
+++ linux-2.6.15-lsf/include/linux/types.h 2006-01-04 15:37:17.000000000 +0900
@@ -135,6 +135,10 @@ typedef __s64 int64_t;
typedef unsigned long sector_t;
#endif

+#ifndef HAVE_BLKCNT_T
+typedef unsigned long blkcnt_t;
+#endif
+
/*
* The type of an index into the pagecache. Use a #define so asm/types.h
* can override it.

-- Takashi Sato



2006-01-13 02:33:43

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2/3] Fix problems on multi-TB filesystem and file

"Takashi Sato" <[email protected]> wrote:
>
> This is a patch to add blkcnt_t as the type of inode.i_blocks.
> This enables you to make the size of blkcnt_t either 4 bytes or 8 bytes
> on 32 bits architecture with CONFIG_LSF.

What was the rationale behind CONFIG_LSF? It's a bit of an ugly thing and
I'm wondering if we wouldn't be better off just removing it and simply
fixing >2TB support for all .configs?

Do the common userspace tools need to be updated for this, or do they
already get it right?

2006-01-13 13:50:28

by Takashi Sato

[permalink] [raw]
Subject: RE: [PATCH 2/3] Fix problems on multi-TB filesystem and file

Hi,

> > This is a patch to add blkcnt_t as the type of inode.i_blocks.
> > This enables you to make the size of blkcnt_t either 4 bytes or 8 bytes
> > on 32 bits architecture with CONFIG_LSF.
>
> What was the rationale behind CONFIG_LSF? It's a bit of an ugly thing and
> I'm wondering if we wouldn't be better off just removing it and simply
> fixing >2TB support for all .configs?

We should avoid needless growth of heavily-used structure such as
inode for a small system like embedded system.
So I make it possible to configure the size of i_blocks with CONFIG_LSF.

> Do the common userspace tools need to be updated for this, or do they
> already get it right?

glibc's st_blocks is always 8 bytes as below.

__blkcnt64_t st_blocks;

On the original kernel, a padding for st_blocks is set at appropriate
location for each endian types as below.
So, there is no need to fix user program(glibc).

include/asm-i386/stat.h:
--------------------------------------------------------------------------
unsigned long st_blocks; /* Number 512-byte blocks allocated. */
unsigned long __pad4; /* future possible st_blocks high bits */
---------------------------------------------------------------------------

include/asm-m68k/stat.h:
--------------------------------------------------------------------------
unsigned long __pad4; /* future possible st_blocks high bits */
unsigned long st_blocks; /* Number 512-byte blocks allocated. */
--------------------------------------------------------------------------

-- Takashi Sato


2006-01-13 20:28:53

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2/3] Fix problems on multi-TB filesystem and file

"Takashi Sato" <[email protected]> wrote:
>
> > > This is a patch to add blkcnt_t as the type of inode.i_blocks.
> > > This enables you to make the size of blkcnt_t either 4 bytes or 8 bytes
> > > on 32 bits architecture with CONFIG_LSF.
> >
> > What was the rationale behind CONFIG_LSF? It's a bit of an ugly thing and
> > I'm wondering if we wouldn't be better off just removing it and simply
> > fixing >2TB support for all .configs?
>
> We should avoid needless growth of heavily-used structure such as
> inode for a small system like embedded system.
> So I make it possible to configure the size of i_blocks with CONFIG_LSF.

Variables whose size varies according to Kconfig are surprisingly expensive
with respect to maintenence cost and runtime stability risk. The main
offended is sector_t.

I am forever fixing people's code which does

printk("%ld", some_sector_t);

With CONFIG_LBD=n that's fine. With CONFIG_LBD=y it generates a warning.

The consequences of that warning are:

a) It'll probably print a garbage number

b) If the printk did

printk("%d %s", some_sector_t, some_string);

then printk might well oops the kernel, because `some_string' will be
passed in either the wrong register or in the wrong stack slot.

And because printks are almost always in rare error paths, such an
oops will probably sail through everyone's kernel testing and will only
hit users when rare things like I/O errors or disk corruption occur.

The best fix for the printk(sector_t) problem is

printk("%llu", (unsigned long long)some_sector_t);

All other solutions have subtle problems.


So now we're proposing to repeat the sector_t problem with a bunch of new
fields. Fortunately we're less likely to be putting these particular
fields into printk statements but I note that CIFS (at least) has a couple
such statements and with your patch they're now generating warnings (and
potential runtime bugs).


On the other hand, for a fairly fat .config which has 17 filesystems in
.vmlinux:

text data bss dec hex filename
4633032 1011304 248288 5892624 59ea10 vmlinux CONFIG_LSF=y
4633680 1011304 248288 5893272 59ec98 vmlinux CONFIG_LSF=n

It's probably less 0.5 kbytes for usual embedded .config.


I just don't think the benefit of CONFIG_LSF outweighs its costs.

2006-01-13 20:51:57

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 2/3] Fix problems on multi-TB filesystem and file

On Jan 13, 2006 12:28 -0800, Andrew Morton wrote:
> So now we're proposing to repeat the sector_t problem with a bunch of new
> fields. Fortunately we're less likely to be putting these particular
> fields into printk statements but I note that CIFS (at least) has a couple
> such statements and with your patch they're now generating warnings (and
> potential runtime bugs).
>
> On the other hand, for a fairly fat .config which has 17 filesystems in
> .vmlinux:
>
> text data bss dec hex filename
> 4633032 1011304 248288 5892624 59ea10 vmlinux CONFIG_LSF=y
> 4633680 1011304 248288 5893272 59ec98 vmlinux CONFIG_LSF=n
>
> It's probably less 0.5 kbytes for usual embedded .config.
> I just don't think the benefit of CONFIG_LSF outweighs its costs.

We were originally going to use CONFIG_LBD, but there were some complaints
that "sector_t" isn't the right variable to use for this, even though they
are remarkably close. That would at least remove one config change.

I don't think the cost is in the vmlinux itself, but rather that having a
long long for i_blocks is overkill for any but the very largest systems
(Lustre has been running fine w/o this, at the expense of some accuracy
for the i_blocks count on many-TB files). Growing struct inode for these
0.0000001% of systems is probably harmful for small systems, given how
many inodes are used in a system.

Two options exist IMHO:
- remove the new CONFIG_* parameters and stick with CONFIG_LBD (this could
still use a separate type from sector_t if desired) to reduce the amount
of testing combinations needed
- make the new CONFIG_* default to on and allow it to be disabled with
CONFIG_BASE_SMALL

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2006-01-13 21:20:22

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2/3] Fix problems on multi-TB filesystem and file

Andreas Dilger <[email protected]> wrote:
>
> On Jan 13, 2006 12:28 -0800, Andrew Morton wrote:
> > So now we're proposing to repeat the sector_t problem with a bunch of new
> > fields. Fortunately we're less likely to be putting these particular
> > fields into printk statements but I note that CIFS (at least) has a couple
> > such statements and with your patch they're now generating warnings (and
> > potential runtime bugs).
> >
> > On the other hand, for a fairly fat .config which has 17 filesystems in
> > .vmlinux:
> >
> > text data bss dec hex filename
> > 4633032 1011304 248288 5892624 59ea10 vmlinux CONFIG_LSF=y
> > 4633680 1011304 248288 5893272 59ec98 vmlinux CONFIG_LSF=n
> >
> > It's probably less 0.5 kbytes for usual embedded .config.
> > I just don't think the benefit of CONFIG_LSF outweighs its costs.
>
> We were originally going to use CONFIG_LBD, but there were some complaints
> that "sector_t" isn't the right variable to use for this, even though they
> are remarkably close. That would at least remove one config change.
>
> I don't think the cost is in the vmlinux itself, but rather that having a
> long long for i_blocks is overkill for any but the very largest systems
> (Lustre has been running fine w/o this, at the expense of some accuracy
> for the i_blocks count on many-TB files). Growing struct inode for these
> 0.0000001% of systems is probably harmful for small systems, given how
> many inodes are used in a system.

I'd expect that rh and suse and others will turn on the >2TB option, so
that's most systems.

> Two options exist IMHO:
> - remove the new CONFIG_* parameters and stick with CONFIG_LBD (this could
> still use a separate type from sector_t if desired) to reduce the amount
> of testing combinations needed
> - make the new CONFIG_* default to on and allow it to be disabled with
> CONFIG_BASE_SMALL

Well yes, but we still have the printk problem.

CONFIG_LFS would become a specialised option for embedded systems and for
the minority of people who self-compile kernels. I just don't think that's
worth the maintainability hassle.

Ho hum. But then, people don't printk these fields as much. I spose we
could live with your option 1). And we need to find all those places (like
CIFS) which are presently trying to print a blkcnt_t and add the %llu and
the typecast.

2006-01-18 12:55:05

by Takashi Sato

[permalink] [raw]
Subject: RE: [PATCH 2/3] Fix problems on multi-TB filesystem and file

Hi,

> > > On the other hand, for a fairly fat .config which has 17 filesystems in
> > > .vmlinux:
> > >
> > > text data bss dec hex filename
> > > 4633032 1011304 248288 5892624 59ea10 vmlinux CONFIG_LSF=y
> > > 4633680 1011304 248288 5893272 59ec98 vmlinux CONFIG_LSF=n
> > >
> > > It's probably less 0.5 kbytes for usual embedded .config.
> > > I just don't think the benefit of CONFIG_LSF outweighs its costs.

I looked into the number of struct inode on slab of x86
in cases i_blocks is both 4 bytes and 8 bytes.
As the following tables, the number of inodes per slab is the same
in both cases. So at least on x86, there seems to be no influence for
the memory usage by extending inode->i_blocks.

In default configuration (CONFIG_QUOTA=ON, CONFIG_INOTIFY=ON):
-------------------------------------------------------------
In case inode->i_blocks is unsigned long
size of inode(byte) the number per slab
ext2_inode_info 588 6
ext3_inode_info 612 6

In case inode->i_blocks is unsigned long long
size of inode(byte) the number per slab
ext2_inode_info 592 6
ext3_inode_info 616 6
--------------------------------------------------------------

CONFIG_QUOTA=OFF, CONFIG_INOTIFY=OFF:
-------------------------------------------------------------
In case inode->i_blocks is unsigned long
size of inode(byte) the number per slab
ext2_inode_info 540 7
ext3_inode_info 564 7

In case inode->i_blocks is unsigned long long
size of inode(byte) the number per slab
ext2_inode_info 544 7
ext3_inode_info 568 7
--------------------------------------------------------------

> > Two options exist IMHO:
> > - remove the new CONFIG_* parameters and stick with CONFIG_LBD (this could
> > still use a separate type from sector_t if desired) to reduce the amount
> > of testing combinations needed
> > - make the new CONFIG_* default to on and allow it to be disabled with
> > CONFIG_BASE_SMALL
>
> Well yes, but we still have the printk problem.
>
> CONFIG_LFS would become a specialised option for embedded systems and
> for the minority of people who self-compile kernels. I just don't
> think that's worth the maintainability hassle.

I added CONFIG_LSF to use large filesystem over network with >2TB file
even on a small system as CONFIG_LBD disable. And I heard that some
people dislike network filesystems depending on block device.

Trond, do you have comments about integrating CONFIG_LFS and
CONFIG_LBD?

-- Takashi Sato


2006-01-18 21:23:20

by Trond Myklebust

[permalink] [raw]
Subject: RE: [PATCH 2/3] Fix problems on multi-TB filesystem and file

On Wed, 2006-01-18 at 21:54 +0900, Takashi Sato wrote:
> > CONFIG_LFS would become a specialised option for embedded systems and
> > for the minority of people who self-compile kernels. I just don't
> > think that's worth the maintainability hassle.
>
> I added CONFIG_LSF to use large filesystem over network with >2TB file
> even on a small system as CONFIG_LBD disable. And I heard that some
> people dislike network filesystems depending on block device.
>
> Trond, do you have comments about integrating CONFIG_LFS and
> CONFIG_LBD?

If you do merge CONFIG_LFS and CONFIG_LBD, then please throw out the
name CONFIG_LBD in favour of CONFIG_LFS, since the resulting option will
_not_ be block device specific.

Unless someone has some really good arguments against it, I too would
vote for hiding both options behind CONFIG_EMBEDDED.

Cheers,
Trond