2009-07-18 00:08:20

by NeilBrown

[permalink] [raw]
Subject: How to handle >16TB devices on 32 bit hosts ??


Hi,
It has recently come to by attention that Linux on a 32 bit host does
not handle devices beyond 16TB particularly well.

In particular, any access that goes through the page cache for the
block device is limited to a pgoff_t number of pages.
As pgoff_t is "unsigned long" and hence 32bit, and as page size is
4096, this comes to 16TB total.

A filesystem created on a 17TB device should be able to access and
cache file data perfectly providing CONFIG_LBDAF is set.
However if the filesystem caches metadata using the block device,
then metadata beyond 16TB will be a problem.

Access to the block device (/dev/whatever) via open/read/write will
also cause problems beyond 16TB, though if O_DIRECT is used I think
it should work OK (it will probably try to flushed out completely
irrelevant parts of the page cache before allowing the IO, but that
is a benign error case I think).

With 2TB drives easily available, more people will probably try
building arrays this big and we cannot just assume they will only do
it on 64bit hosts.

So the question I wanted to ask really is: is there any point in
allowing >16TB arrays to be created on 32bit hosts, or should we just
disallow them? If we allow them, what steps should we take to make
the possible failure modes more obvious?

As I said, I think O_DIRECT largely works fine on these devices and
we could fix the few irregularities with little effort. So one step
might be to make mkfs/fsck utilities use O_DIRECT on >16TB devices on
32bit hosts.

Given that non-O_DIRECT can fail (e.g. in do_generic_file_read,
index = *ppos >> PAGE_CACHE_SHIFT
will lose data if *ppos is beyond 44 bits) we should probably fail
opens on devices larger than 16TB.... though just failing the open
doesn't help if the device can change size, as dm and md devices can.

I believe ext[234] uses the block device's page cache for metadata, so
they cannot safely be used with >16TB devices on 32bit. Is that
correct? Should they fail a mount attempt? Do they?

Are there any filesystems that do not use the block device cache and
so are not limited to 16TB on 32bit?

Even if no filesystem can use >16TB on 32bit, I suspect dm can
usefully use such a device for logical volume management, and as long
as each logical volume does not exceed 16TB, all should be happy. So
completely disallowing them might not be best.

I suppose we could add a CONFIG option to make pgoff_t be
"unsigned long long". Would the cost/benefit of that be acceptable?

Your thoughts are most welcome.

Thanks,
NeilBrown


2009-07-18 04:33:08

by Andreas Dilger

[permalink] [raw]
Subject: Re: How to handle >16TB devices on 32 bit hosts ??

On Jul 18, 2009 10:08 +1000, Neil Brown wrote:
> It has recently come to by attention that Linux on a 32 bit host does
> not handle devices beyond 16TB particularly well.
>
> In particular, any access that goes through the page cache for the
> block device is limited to a pgoff_t number of pages.
> As pgoff_t is "unsigned long" and hence 32bit, and as page size is
> 4096, this comes to 16TB total.
:
:
> I suppose we could add a CONFIG option to make pgoff_t be
> "unsigned long long". Would the cost/benefit of that be acceptable?

I think the point is that for those people who want to use > 16TB
devices on 32-bit platforms (e.g. embedded/appliance systems) the
choice is between "completely non-functional" and "uses a bit more
memory per page", and the answer is pretty obvious.

For users who don't want to support this, they don't have to (just
like CONFIG_LBD or whatever), and for 64-bit systems it is irrelevant.
I think years ago we had the idea that it would be 64-bit everywhere
by now, and while that is true for many systems, embedded/appliance
systems will probably continue to be 32-bit for as long as they can.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

2009-07-18 06:09:39

by Andi Kleen

[permalink] [raw]
Subject: Re: How to handle >16TB devices on 32 bit hosts ??

Neil Brown <[email protected]> writes:
>
> With 2TB drives easily available, more people will probably try
> building arrays this big and we cannot just assume they will only do
> it on 64bit hosts.

They should -- no 32bit fsck has any chance running on a 16TB volume.

-Andi

--
[email protected] -- Speaking for myself only.

2009-07-18 06:16:13

by Andi Kleen

[permalink] [raw]
Subject: Re: How to handle >16TB devices on 32 bit hosts ??

Andreas Dilger <[email protected]> writes:
>
> I think the point is that for those people who want to use > 16TB
> devices on 32-bit platforms (e.g. embedded/appliance systems) the
> choice is between "completely non-functional" and "uses a bit more
> memory per page", and the answer is pretty obvious.

It's not just more memory per page, but also worse code all over the
VM. long long 32bit code is generally rather bad, especially on
register constrained x86.

But I think the fsck problem is a show stopper here anyways.
Enabling a setup that cannot handle IO errors wouldn't
be really a good idea.

In fact this problem already hits before 16TB on 32bit.

Unless people rewrite fsck to use /dev/shm >4GB swapping
(or perhaps use JFS which iirc had a way to use the file system
itself as fsck scratch space)

-Andi

--
[email protected] -- Speaking for myself only.

2009-07-18 06:53:38

by Andreas Dilger

[permalink] [raw]
Subject: Re: How to handle >16TB devices on 32 bit hosts ??

On Jul 18, 2009 08:16 +0200, Andi Kleen wrote:
> Andreas Dilger <[email protected]> writes:
> > I think the point is that for those people who want to use > 16TB
> > devices on 32-bit platforms (e.g. embedded/appliance systems) the
> > choice is between "completely non-functional" and "uses a bit more
> > memory per page", and the answer is pretty obvious.
>
> It's not just more memory per page, but also worse code all over the
> VM. long long 32bit code is generally rather bad, especially on
> register constrained x86.

If you aren't running a 32-bit system with this config, you shouldn't
really care. For those systems that need to run in this mode they
would rather have it work a few percent slower instead of not at all.

> But I think the fsck problem is a show stopper here anyways.
> Enabling a setup that cannot handle IO errors wouldn't
> be really a good idea.
>
> In fact this problem already hits before 16TB on 32bit.

The e2fsck code is currently just starting to get > 16TB support,
and while the initial implementation is naive, we are definitely
planning on reducing the memory needed to check very large devices.

The last test numbers I saw were 5GB of RAM for a 20TB filesystem,
but since the bitmaps used are fully-allocated arrays that isn't
surprising. We are planning to replace this with a tree, since the
majority of bitmaps used by e2fsck have large contiguous ranges of
set or unset bits and can be represented much more efficiently.

> Unless people rewrite fsck to use /dev/shm >4GB swapping
> (or perhaps use JFS which iirc had a way to use the file system
> itself as fsck scratch space)

I'm guessing that such systems won't have a 20TB boot device, but
rather a small flash boot/swap device (a few GB is cheap) and then
they could swap, if strictly necessary.

Also, for filesystems like btrfs or ZFS the checking can be done
online and incrementally without storing a full representation of
the state in memory.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

2009-07-18 07:48:20

by Andi Kleen

[permalink] [raw]
Subject: Re: How to handle >16TB devices on 32 bit hosts ??

On Sat, Jul 18, 2009 at 02:52:13AM -0400, Andreas Dilger wrote:
> If you aren't running a 32-bit system with this config, you shouldn't
> really care. For those systems that need to run in this mode they
> would rather have it work a few percent slower instead of not at all.

Well, it doesn't work at all anyways due to the fsck problem.

> The last test numbers I saw were 5GB of RAM for a 20TB filesystem,
> but since the bitmaps used are fully-allocated arrays that isn't
> surprising. We are planning to replace this with a tree, since the
> majority of bitmaps used by e2fsck have large contiguous ranges of
> set or unset bits and can be represented much more efficiently.

You would need to get <~2.5GB for 32bit. In practice that's
the limit you have there.

> Also, for filesystems like btrfs or ZFS the checking can be done
> online and incrementally without storing a full representation of
> the state in memory.

You could, but I suspect it would be cheaper to just use a
64bit system than to rewrite fsck. 64bit is available
for a lot of embedded setups these days too.

-Andi

--
[email protected] -- Speaking for myself only.

2009-07-18 13:50:05

by Theodore Ts'o

[permalink] [raw]
Subject: Re: How to handle >16TB devices on 32 bit hosts ??

On Sat, Jul 18, 2009 at 09:48:11AM +0200, Andi Kleen wrote:
>
> > Also, for filesystems like btrfs or ZFS the checking can be done
> > online and incrementally without storing a full representation of
> > the state in memory.
>
> You could, but I suspect it would be cheaper to just use a
> 64bit system than to rewrite fsck. 64bit is available
> for a lot of embedded setups these days too.

We don't have to rewrite fsck; most of the framework for supporting an
run-length-conding for compressed bitmaps is already in patches that
add > 32-bit block numbers to e2fsprogs; we've just been more focused
on getting 64-bit block numbers support merged than implementing
compressed bitmaps, but it's only one file that would need to be
added, and we might be able to steal the compressed bitmap support
from xfsprogs --- which does this already.

- Ted

2009-07-18 14:22:01

by Andi Kleen

[permalink] [raw]
Subject: Re: How to handle >16TB devices on 32 bit hosts ??

> We don't have to rewrite fsck; most of the framework for supporting an
> run-length-conding for compressed bitmaps is already in patches that
> add > 32-bit block numbers to e2fsprogs; we've just been more focused
> on getting 64-bit block numbers support merged than implementing
> compressed bitmaps, but it's only one file that would need to be
> added, and we might be able to steal the compressed bitmap support
> from xfsprogs --- which does this already.

There are regular reports of xfs_repair failing on 32bit,
even on volumes far smaller than 16TB.

-Andi
--
[email protected] -- Speaking for myself only.

2009-07-18 14:33:39

by Andreas Dilger

[permalink] [raw]
Subject: Re: How to handle >16TB devices on 32 bit hosts ??

On Jul 18, 2009 16:21 +0200, Andi Kleen wrote:
> > We don't have to rewrite fsck; most of the framework for supporting an
> > run-length-conding for compressed bitmaps is already in patches that
> > add > 32-bit block numbers to e2fsprogs; we've just been more focused
> > on getting 64-bit block numbers support merged than implementing
> > compressed bitmaps, but it's only one file that would need to be
> > added, and we might be able to steal the compressed bitmap support
> > from xfsprogs --- which does this already.
>
> There are regular reports of xfs_repair failing on 32bit,
> even on volumes far smaller than 16TB.

It's a good thing we have ext4 then :-).

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

2009-07-18 18:19:42

by Christoph Hellwig

[permalink] [raw]
Subject: Re: How to handle >16TB devices on 32 bit hosts ??

On Sat, Jul 18, 2009 at 04:21:56PM +0200, Andi Kleen wrote:
> > We don't have to rewrite fsck; most of the framework for supporting an
> > run-length-conding for compressed bitmaps is already in patches that
> > add > 32-bit block numbers to e2fsprogs; we've just been more focused
> > on getting 64-bit block numbers support merged than implementing
> > compressed bitmaps, but it's only one file that would need to be
> > added, and we might be able to steal the compressed bitmap support
> > from xfsprogs --- which does this already.
>
> There are regular reports of xfs_repair failing on 32bit,
> even on volumes far smaller than 16TB.

We have a set of patches pending to reduce memory use in xfs_repair
dramaticly, replacing bitmaps for tracking used blocks with extent
structures stored in btrees.

2009-07-19 04:01:49

by Tapani Tarvainen

[permalink] [raw]
Subject: Re: How to handle >16TB devices on 32 bit hosts ??

On Sat, Jul 18, 2009 at 09:48:11AM +0200, Andi Kleen ([email protected]) wrote:
> On Sat, Jul 18, 2009 at 02:52:13AM -0400, Andreas Dilger wrote:
> > If you aren't running a 32-bit system with this config, you shouldn't
> > really care. For those systems that need to run in this mode they
> > would rather have it work a few percent slower instead of not at all.
>
> Well, it doesn't work at all anyways due to the fsck problem.

I can imagine several scenarios where 16TB+ raid would be useful
without any filesystems bigger than 2TB.
Or does LVM have problems with 16TB+ devices in 32bit-systems, too?

--
Tapani Tarvainen

2009-07-22 06:59:26

by Andrew Morton

[permalink] [raw]
Subject: Re: How to handle >16TB devices on 32 bit hosts ??

On Sat, 18 Jul 2009 10:08:10 +1000 Neil Brown <[email protected]> wrote:

> It has recently come to by attention that Linux on a 32 bit host does
> not handle devices beyond 16TB particularly well.
>
> In particular, any access that goes through the page cache for the
> block device is limited to a pgoff_t number of pages.
> As pgoff_t is "unsigned long" and hence 32bit, and as page size is
> 4096, this comes to 16TB total.

I expect that the VFS could be made to work with 64-bit pgoff_t fairly
easily. The generated code will be pretty damn sad.

radix-trees use a ulong index, so we would need a new
lib/radix_tree64.c or some other means of fixing that up.

The bigger problem is filesystems - they'll each need to be checked,
tested, fixed and enabled. It's probably not too bad for the
mainstream filesystems which mostly bounce their operations into VFS
libarary functions anyway.



There's perhaps a middle ground - support >16TB devices, but not >16TB
partitions. That way everything remains 32-bit and we just have to get
the offsetting right (probably already the case).

So now /dev/sda1, /dev/sda2 etc are all <16TB. The remaining problem
is that /dev/sda is >16TB. I expect that we could arrange for the
kernel to error out if userspace tries to access /dev/sda beyond the
16TB point, and those very very few applications which want to touch
that part of the disk will need to be written using direct-io, (or
perhaps sgio) or run on 64-bit machines.

2009-07-22 18:33:33

by Andreas Dilger

[permalink] [raw]
Subject: Re: How to handle >16TB devices on 32 bit hosts ??

On Jul 21, 2009 23:59 -0700, Andrew Morton wrote:
> On Sat, 18 Jul 2009 10:08:10 +1000 Neil Brown <[email protected]> wrote:
> I expect that the VFS could be made to work with 64-bit pgoff_t fairly
> easily. The generated code will be pretty damn sad.
>
> radix-trees use a ulong index, so we would need a new
> lib/radix_tree64.c or some other means of fixing that up.
>
> The bigger problem is filesystems - they'll each need to be checked,
> tested, fixed and enabled. It's probably not too bad for the
> mainstream filesystems which mostly bounce their operations into VFS
> libarary functions anyway.

I don't think this is a primary concern for most filesystems even today.
Filesystems that work correctly > 16TB on 64-bit platforms should continue
to work correctly on 32-bit platforms. ext4 and XFS will be fine, and
we can slap a "refuse to mount > 16TB filesystem on 32-bit" check in
*_fill_super() for the other filesystems, ext3 included. Maintainers can
veto that if they think it will work, and for the rest I don't think
anyone will even notice.

> There's perhaps a middle ground - support >16TB devices, but not >16TB
> partitions. That way everything remains 32-bit and we just have to get
> the offsetting right (probably already the case).
>
> So now /dev/sda1, /dev/sda2 etc are all <16TB. The remaining problem
> is that /dev/sda is >16TB. I expect that we could arrange for the
> kernel to error out if userspace tries to access /dev/sda beyond the
> 16TB point, and those very very few applications which want to touch
> that part of the disk will need to be written using direct-io, (or
> perhaps sgio) or run on 64-bit machines.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

2009-07-22 18:52:48

by Andrew Morton

[permalink] [raw]
Subject: Re: How to handle >16TB devices on 32 bit hosts ??

On Wed, 22 Jul 2009 12:32:44 -0600
Andreas Dilger <[email protected]> wrote:

> On Jul 21, 2009 23:59 -0700, Andrew Morton wrote:
> > On Sat, 18 Jul 2009 10:08:10 +1000 Neil Brown <[email protected]> wrote:
> > I expect that the VFS could be made to work with 64-bit pgoff_t fairly
> > easily. The generated code will be pretty damn sad.
> >
> > radix-trees use a ulong index, so we would need a new
> > lib/radix_tree64.c or some other means of fixing that up.
> >
> > The bigger problem is filesystems - they'll each need to be checked,
> > tested, fixed and enabled. It's probably not too bad for the
> > mainstream filesystems which mostly bounce their operations into VFS
> > libarary functions anyway.
>
> I don't think this is a primary concern for most filesystems even today.
> Filesystems that work correctly > 16TB on 64-bit platforms should continue
> to work correctly on 32-bit platforms.

Not if they use an unsigned long to hold a pagecache index anywhere.

akpm:/usr/src/25> grep 'unsigned long' fs/*/*.c | wc -l
3465

2009-07-29 15:07:25

by Pavel Machek

[permalink] [raw]
Subject: Re: How to handle >16TB devices on 32 bit hosts ??

Hi!

> > > Also, for filesystems like btrfs or ZFS the checking can be done
> > > online and incrementally without storing a full representation of
> > > the state in memory.
> >
> > You could, but I suspect it would be cheaper to just use a
> > 64bit system than to rewrite fsck. 64bit is available
> > for a lot of embedded setups these days too.
>
> We don't have to rewrite fsck; most of the framework for supporting an
> run-length-conding for compressed bitmaps is already in patches that
> add > 32-bit block numbers to e2fsprogs; we've just been more focused
> on getting 64-bit block numbers support merged than implementing
> compressed bitmaps, but it's only one file that would need to be
> added, and we might be able to steal the compressed bitmap support
> from xfsprogs --- which does this already.

Well... 'If allocation pattern is bad your fsck runs out of address
space and breaks on your 15T fs' would scare me.

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html