2014-09-19 15:54:45

by TR Reardon

[permalink] [raw]
Subject: Reserved GDT inode: blocks vs extents

Hello all: there's probably a good reason for this, but I'm wondering why inode#7 (reserved GDT blocks) is always created with a block map rather than extent?

[see ext2fs_create_resize_inode()]

Would changing this to extent resolve problem of resize_inode and 64bit / >16TB filesystems?


2014-09-19 16:36:52

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Reserved GDT inode: blocks vs extents

On Fri, Sep 19, 2014 at 11:54:39AM -0400, TR Reardon wrote:
> Hello all: there's probably a good reason for this, but I'm wondering why inode#7 (reserved GDT blocks) is always created with a block map rather than extent?
>
> [see ext2fs_create_resize_inode()]

It's created using an indirect map because the on-line resizing code
in the kernel relies on it. It's rather dependent on the structure of
the indirect block map so that the kernel knows where to fetch the
necessary blocks in each block group to extend the block group
descriptor.

So no, we can't change it.

And we do have a solution, namely the meta_bg layout which mostly
solves the problem, although at the cost of slowing down the mount
time.

But that may be moot, since one of the things that I've been
considering is to stop pinning the block group descriptors in memory,
and just start reading in memory as they are needed. The rationale is
that for a 4TB disk, we're burning 8 MB of memory. And if you have
two dozen disks attached to your system, then you're burning 192
megabytes of memory, which starts to fairly significant amounts of
memory, especially for bookcase NAS servers.

If I would do it all over again, knowing what we know now, I'd
probably redesign the meta_bg layout somewhat to group block group
descriptors into chunks. But it's probably not worth it to add yet
another block group descriptor layout at this point.

Cheers,

- Ted

2014-09-19 17:26:39

by TR Reardon

[permalink] [raw]
Subject: RE: Reserved GDT inode: blocks vs extents

> Date: Fri, 19 Sep 2014 12:36:49 -0400
> From: [email protected]
> To: [email protected]
> CC: [email protected]
> Subject: Re: Reserved GDT inode: blocks vs extents
>
> On Fri, Sep 19, 2014 at 11:54:39AM -0400, TR Reardon wrote:
>> Hello all: there's probably a good reason for this, but I'm wondering why inode#7 (reserved GDT blocks) is always created with a block map rather than extent?
>>
>> [see ext2fs_create_resize_inode()]
>
> It's created using an indirect map because the on-line resizing code
> in the kernel relies on it. It's rather dependent on the structure of
> the indirect block map so that the kernel knows where to fetch the
> necessary blocks in each block group to extend the block group
> descriptor.
>
> So no, we can't change it.
>
> And we do have a solution, namely the meta_bg layout which mostly
> solves the problem, although at the cost of slowing down the mount
> time.
>
> But that may be moot, since one of the things that I've been
> considering is to stop pinning the block group descriptors in memory,
> and just start reading in memory as they are needed. The rationale is
> that for a 4TB disk, we're burning 8 MB of memory. And if you have
> two dozen disks attached to your system, then you're burning 192
> megabytes of memory, which starts to fairly significant amounts of
> memory, especially for bookcase NAS servers.

But I'd argue that in many use cases, in particular bookcase NAS servers,?
ext4+vfs should optimize for avoiding spinups rather than reducing RAM usage.?
Would this change increase spinups when scanning for changes, say via rsync?
For mostly-cold-storage I wish I had the ability to make dentry- and inode-cache?
long lived, and have ext4 prefer to retain directory over file-data cache blocks,?
rather than current non-deterministic behavior via vfs_cache_pressure. ?Unfortunately,?
it is precisely the kinds of largefiles on bookcase NAS servers being read linearly?
(and used only once) that blowout the cache of directory blocks (and dentries etc
but it's really the dir blocks that create the problem with spinups on cold-storage)

Of course, it's likelier that I don't actually understand how all these caches work ;)

+Reardon


-

2014-09-19 19:47:59

by Andreas Dilger

[permalink] [raw]
Subject: Re: Reserved GDT inode: blocks vs extents

On Sep 19, 2014, at 11:26 AM, TR Reardon <[email protected]> wrote:
>> Date: Fri, 19 Sep 2014 12:36:49 -0400
>> From: [email protected]
>> To: [email protected]
>> CC: [email protected]
>> Subject: Re: Reserved GDT inode: blocks vs extents
>>
>> On Fri, Sep 19, 2014 at 11:54:39AM -0400, TR Reardon wrote:
>>> Hello all: there's probably a good reason for this, but I'm wondering why inode#7 (reserved GDT blocks) is always created with a block map rather than extent?
>>>
>>> [see ext2fs_create_resize_inode()]
>>
>> But that may be moot, since one of the things that I've been
>> considering is to stop pinning the block group descriptors in memory,
>> and just start reading in memory as they are needed. The rationale is
>> that for a 4TB disk, we're burning 8 MB of memory. And if you have
>> two dozen disks attached to your system, then you're burning 192
>> megabytes of memory, which starts to fairly significant amounts of
>> memory, especially for bookcase NAS servers.
>
> But I'd argue that in many use cases, in particular bookcase NAS servers,
> ext4+vfs should optimize for avoiding spinups rather than reducing RAM usage.
> Would this change increase spinups when scanning for changes, say via rsync?

I think not pinning the group descriptors would be a bad thing. If
we consider that reading block allocation bitmaps to be slow, then it
would be twice as slow having to read the group descriptor from disk
before even knowing which groups have free space for allocation.

I think this kind of change only makes sense if there is some other
in-memory structure (e.g. rbtree) to describe the free space in the
filesystem. That would be a net win, since we wouldn't have to scan
potentially thousands of groups to find free space.

Cheers, Andreas






Attachments:
signature.asc (833.00 B)
Message signed with OpenPGP using GPGMail