Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933749Ab0FRQvK (ORCPT ); Fri, 18 Jun 2010 12:51:10 -0400 Received: from mx1.redhat.com ([209.132.183.28]:37180 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932104Ab0FRQvI (ORCPT ); Fri, 18 Jun 2010 12:51:08 -0400 Message-ID: <4C1BA3E5.7020400@gmail.com> Date: Fri, 18 Jun 2010 18:50:45 +0200 From: Edward Shishkin User-Agent: Thunderbird 2.0.0.23 (X11/20090825) MIME-Version: 1.0 To: Daniel J Blueman CC: Mat , LKML , linux-fsdevel@vger.kernel.org, Chris Mason , Ric Wheeler , Andrew Morton , Linus Torvalds , The development of BTRFS Subject: Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs) References: <4C07C321.8010000@redhat.com> <4C1B7560.1000806@gmail.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4311 Lines: 95 Daniel J Blueman wrote: > On Fri, Jun 18, 2010 at 1:32 PM, Edward Shishkin > wrote: > >> Mat wrote: >> >>> On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin wrote: >>> >>>> Hello everyone. >>>> >>>> I was asked to review/evaluate Btrfs for using in enterprise >>>> systems and the below are my first impressions (linux-2.6.33). >>>> >>>> The first test I have made was filling an empty 659M (/dev/sdb2) >>>> btrfs partition (mounted to /mnt) with 2K files: >>>> >>>> # for i in $(seq 1000000); \ >>>> do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done >>>> (terminated after getting "No space left on device" reports). >>>> >>>> # ls /mnt | wc -l >>>> 59480 >>>> >>>> So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17, >>>> and the first obvious question is "hey, where are other 83% of my >>>> disk space???" I looked at the btrfs storage tree (fs_tree) and was >>>> shocked with the situation on the leaf level. The Appendix B shows >>>> 5 adjacent btrfs leafs, which have the same parent. >>>> >>>> For example, look at the leaf 29425664: "items 1 free space 3892" >>>> (of 4096!!). Note, that this "free" space (3892) is _dead_: any >>>> attempts to write to the file system will result in "No space left >>>> on device". >>>> >>>> Internal fragmentation (see Appendix A) of those 5 leafs is >>>> (1572+3892+1901+3666+1675)/4096*5 = 0.62. This is even worse then >>>> ext4 and xfs: The last ones in this example will show fragmentation >>>> near zero with blocksize <= 2K. Even with 4K blocksize they will >>>> show better utilization 0.50 (against 0.38 in btrfs)! >>>> >>>> I have a small question for btrfs developers: Why do you folks put >>>> "inline extents", xattr, etc items of variable size to the B-tree >>>> in spite of the fact that B-tree is a data structure NOT for variable >>>> sized records? This disadvantage of B-trees was widely discussed. >>>> For example, maestro D. Knuth warned about this issue long time >>>> ago (see Appendix C). >>>> >>>> It is a well known fact that internal fragmentation of classic Bayer's >>>> B-trees is restricted by the value 0.50 (see Appendix C). However it >>>> takes place only if your tree contains records of the _same_ length >>>> (for example, extent pointers). Once you put to your B-tree records >>>> of variable length (restricted only by leaf size, like btrfs "inline >>>> extents"), your tree LOSES this boundary. Moreover, even worse: >>>> it is clear, that in this case utilization of B-tree scales as zero(!). >>>> That said, for every small E and for every amount of data N we >>>> can construct a consistent B-tree, which contains data N and has >>>> utilization worse then E. I.e. from the standpoint of utilization >>>> such trees can be completely degenerated. >>>> >>>> That said, the very important property of B-trees, which guarantees >>>> non-zero utilization, has been lost, and I don't see in Btrfs code any >>>> substitution for this property. In other words, where is a formal >>>> guarantee that all disk space of our users won't be eaten by internal >>>> fragmentation? I consider such guarantee as a *necessary* condition >>>> for putting a file system to production. >>>> > > Wow...a small part of me says 'well said', on the basis that your > assertions are true, but I do think there needs to be more > constructivity in such critique; it is almost impossible to be a great > engineer and a great academic at once in a time-pressured environment. > Sure it is impossible. I believe in division of labour: academics writes algorithms, and we (engineers) encode them. I have noticed that events in Btrfs develop by scenario not predicted by the paper of academic Ohad Rodeh (in spite of the announce that Btrfs is based on this paper). This is why I have started to grumble.. Thanks. > If you can produce some specific and suggestions with code references, > I'm sure we'll get some good discussion with potential to improve from > where we are. > > Thanks, > Daniel > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/