From: "Darrick J. Wong" Subject: Re: kernel bug at fs/ext4/resize.c:409 Date: Fri, 14 Feb 2014 19:16:24 -0800 Message-ID: <20140215031624.GI9176@birch.djwong.org> References: <20140203182634.GA28811@shaniqua> <20140203185633.GA22856@thunk.org> <20140206210844.GA4335@helmut> <87sirnp2m3.fsf@openvz.org> <20140213145323.GA6296@helmut> <20140213211831.GA11480@thunk.org> <20140214201905.GA26292@helmut> <20140214234631.GC1748@thunk.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Jon Bernard , Dmitry Monakhov , linux-ext4@vger.kernel.org To: "Theodore Ts'o" Return-path: Received: from aserp1040.oracle.com ([141.146.126.69]:37392 "EHLO aserp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751242AbaBODQk (ORCPT ); Fri, 14 Feb 2014 22:16:40 -0500 Content-Disposition: inline In-Reply-To: <20140214234631.GC1748@thunk.org> Sender: linux-ext4-owner@vger.kernel.org List-ID: Per Ted's request, I've started editing a document on the ext4 wiki: https://ext4.wiki.kernel.org/index.php/Ext4_VM_Images [comments below too] On Fri, Feb 14, 2014 at 06:46:31PM -0500, Theodore Ts'o wrote: > On Fri, Feb 14, 2014 at 03:19:05PM -0500, Jon Bernard wrote: > > Ahh, I see. Here's where this comes from: the particular usecase is > > provisioning of new cloud instances whose root volume is of unknown > > size. The filesystem and its contents are created and bundled > > before-hand into the smallest filesystem possible. The instance is PXE > > booted for provisioning and the root filesystem is then copied onto the > > disk - and then resized to take advantage of the total amount of space. > > > > In order to support very large partitions, the filesystem is created > > with an abnormally large inode table so that large resizes would be > > possible. I traced it to this commit as best I can tell: > > > > https://github.com/openstack/diskimage-builder/commit/fb246a02eb2ed330d3cc37f5795b3ed026aabe07 > > > > I assumed that additional inodes would be allocated along with block > > groups during an online resize, but that commit contradicts my current > > understanding. > > Additional inodes *are* allocated as the file system is grown. > However thought otherwise was wrong. What happens is that there is a > fixed number of inodes per block group. When the file system is > resized, either by growing or shrinking file system, as block groups > are added or removed from the file system, the number of inodes > is also added or removed. > > > I suggested that the filesystem be created during the time of > > provisioning to allow a more optimal on-disk layout, and I believe this > > is being considered now. > > What causes the most damage in terms of a non-optimal data block > layout, installing the file system on a large file system, and then > shrinking the file system to its minimum size use resize2fs -M. There > is so some non-optimality that occurs as the file system gets filled > beyond about 90% full, but that it's not nearly so bad as shrinking > the file system --- which you should avoid at all costs. > > From a performance point of view, the only time you should try to do > an off-line resize2fs shrink is if you are shrinking the file system > by a handful of blocks as part of converting a file system in place to > use LVM or LUKS encryption, and you need to make room for some > metadata blocks at the end of the partition. > > The other thing thing to note is that if you are using a format such > as qcow2, or something like the device-mapper's thin-provisining > (thinkp) scheme, or if you are willing to deal with sparse files, one > approach is to not resize the file system at all. You could just use > a tool like zerofree[1] to zero out all of the unused blocks in the > file system, and then use "/bin/cp --sparse==always" to cause all zero > blocks to be treated as sparse blocks on the destination file. > > [1] http://git.kernel.org/cgit/fs/ext2/xfstests-bld.git/tree/kvm-xfstests/util/zerofree.c I have a zerofree variant that knows how to punch/discard blocks that I'll throw into contrib/ the next time I send out one of my megapatch sets. > This is part of how I maintain my root filesystem that I use in a VM > for testing ext4 changes upstream. After I update to the latest > Debian unstable package updates, install the latest updates from the > xfstests and e2fsprogs git repositories, I then run the following > script which uses the zerofree.c program to compress the qcow2 root > file system image that I use with kvm: > > http://git.kernel.org/cgit/fs/ext2/xfstests-bld.git/tree/kvm-xfstests/compress-rootfs > > > Also, starting with e2fsprogs 1.42.10, there's another way you can These three options (-rap) are available in 1.42.9. Is there a particular reason not to use it before 1.42.10? > efficiently deploy a large file system image by only copying the > blocks which are in use, by using a command like this: > > e2image -rap src_fs dest_fs > > (See also the -c flag as described in e2image's man page if you want > to use this technique to do incremental image-based backups onto a > flash-based backup medium; I was using this for a while to keep two > laptop SSD's root filesystem in sync with one another.) > > So there are lots of ways that you can do what you need, all without > playing games with resize2fs. Perhaps some of them would actually be > better for your use case. Calvin Watson noted on Ted's G+ repost that one can use fstrim in newer versions of QEMU (1.5+?) to punch out unused blocks if the virtual disk is emulated via virtio-scsi. --D > > > > If it turns out to be not terribly complicated and there is not an > > immediate time constraint, I would love to try to help with this or at > > least test patches. > > I will hopefully have a bug fix in the next week or two. > > Cheers, > > - Ted > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html