From: Theodore Tso Subject: Re: ext4 64bit (disk >16TB) question Date: Tue, 15 Jul 2008 08:36:32 -0400 Message-ID: <20080715123632.GA16704@mit.edu> References: <87bq10w8gv.fsf@frosties.localdomain> <20080714234616.GD3382@mit.edu> <87y743vh3q.fsf@frosties.localdomain> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: linux-ext4@vger.kernel.org To: Goswin von Brederlow Return-path: Received: from www.church-of-our-saviour.ORG ([69.25.196.31]:58565 "EHLO thunker.thunk.org" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1755263AbYGOMgf (ORCPT ); Tue, 15 Jul 2008 08:36:35 -0400 Content-Disposition: inline In-Reply-To: <87y743vh3q.fsf@frosties.localdomain> Sender: linux-ext4-owner@vger.kernel.org List-ID: On Tue, Jul 15, 2008 at 07:42:01AM +0200, Goswin von Brederlow wrote: > Is that a problem for the kernel or for the user space? I notices that > mke2fs 1.39 used over a gigabyte memory to format a >16TiB disk. While > being a lot that is not really a problem here. Userspace. The kernel demand-loads bitmap blocks as needed, but e2fsprogs keeps bitarrays in user memory. The problem is e2fsck; it needs in the worst case something like 5 different blocks bitmaps and 3 or 4 inode bitmaps. (I don't remember the exact numbers, but it's that order of magnitude.) So if it's something like a gigabyte of memory for mke2fs, it might be 6-7 gigs of memory for e2fsck. If this is before swap has been enabled, it might not work at all, and even with swap, we're talking serious slowdown if e2fsck is constantly paging to disk. > Will there be filesystem changes as well? The above mentioned > run-length encoding sounds a bit like a new bitmap format or is that > only supposed to be the in memory format in userspace? No, it will only be a memory format in userspace. And I anticipate multiple backend storage formats for the bitmaps, depending on what they will be used for. For example, e2fsck uses one inode bitmap to detect directory loops when following the parent '..' entry; this is a super-sparse array, with at most N bits set in the entire array, where N is the deepest directory in the filesystem. Simply storing a sorted list of bits that are "on" is the most efficient representation for that particular bitmap. Other bitmaps will be much better off stored in memory using perhaps an extent of "on" bits in a red-black tree, etc. At least initially I will implement the "dumb and stupid" fixed bitarray, but I need to make sure the we have the right dispatching to support the rest. > what is the plan of how to add 64-bit support to the shared lib now? > Will you introduce a do_foo64() function in parallel to do_foo() to > maintain abi compatibility? Will you add versioned symbols? Or will > there be an abi break at some point? There's a pretty good description of my plans here: http://thread.gmane.org/gmane.comp.file-systems.ext4/2845 So no versioned symbols, new functions where we go from ext2fs_block_iterator2() to ext2fs_block_iterate3(), etc. All new interfaces that I have been adding have all been 64-bit clean to begin with. So for example all of the extents code use blk64_t. The io_manager has been switched over to support 64-bit block numbers, etc. > The reason I ask all this is because I'm willing to spend some time > patching and testing. A single >16TiB filesystem instead of multiple > smaller ones would be a great benefit for us. Jose Santos has been working on some patches, and I've been working on the 64-bit bitmap support (when I have time, which means it's been sporadic). My primary priority for ext4 has been on getting last major bits of the patches into mainline and getting e2fsprogs 1.41 out the door so that basic testing, bug fixing, and stablization could begin. We still have some bugs that need to squash, such as the summary statistics and/or checksums in the block group descriptors getting corrupted. Nothing so far that can't be fixed with e2fsck, but getting ext4 stable is just *much* higher priority for me right now. That being said, if you want to join the ext4 development efforts, please subscribe to the linux-ext4@vger.kernel.org mailing list (standard majordomo subscription interface, like all of the kernel.org lists). The wiki at http://ext4.wiki.kernel.org has some good stuff, but there's also stuff which is out of date there. But stuff like the ext4 irc channel is there, and the "getting started page" is reasonably up to date. Regards, - Ted