Hello everyone.
I was asked to review/evaluate Btrfs for using in enterprise
systems and the below are my first impressions (linux-2.6.33).
The first test I have made was filling an empty 659M (/dev/sdb2)
btrfs partition (mounted to /mnt) with 2K files:
# for i in $(seq 1000000); \
do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
(terminated after getting "No space left on device" reports).
# ls /mnt | wc -l
59480
So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17,
and the first obvious question is "hey, where are other 83% of my
disk space???" I looked at the btrfs storage tree (fs_tree) and was
shocked with the situation on the leaf level. The Appendix B shows
5 adjacent btrfs leafs, which have the same parent.
For example, look at the leaf 29425664: "items 1 free space 3892"
(of 4096!!). Note, that this "free" space (3892) is _dead_: any
attempts to write to the file system will result in "No space left
on device".
Internal fragmentation (see Appendix A) of those 5 leafs is
(1572+3892+1901+3666+1675)/4096*5 = 0.62. This is even worse then
ext4 and xfs: The last ones in this example will show fragmentation
near zero with blocksize <= 2K. Even with 4K blocksize they will
show better utilization 0.50 (against 0.38 in btrfs)!
I have a small question for btrfs developers: Why do you folks put
"inline extents", xattr, etc items of variable size to the B-tree
in spite of the fact that B-tree is a data structure NOT for variable
sized records? This disadvantage of B-trees was widely discussed.
For example, maestro D. Knuth warned about this issue long time
ago (see Appendix C).
It is a well known fact that internal fragmentation of classic Bayer's
B-trees is restricted by the value 0.50 (see Appendix C). However it
takes place only if your tree contains records of the _same_ length
(for example, extent pointers). Once you put to your B-tree records
of variable length (restricted only by leaf size, like btrfs "inline
extents"), your tree LOSES this boundary. Moreover, even worse:
it is clear, that in this case utilization of B-tree scales as zero(!).
That said, for every small E and for every amount of data N we
can construct a consistent B-tree, which contains data N and has
utilization worse then E. I.e. from the standpoint of utilization
such trees can be completely degenerated.
That said, the very important property of B-trees, which guarantees
non-zero utilization, has been lost, and I don't see in Btrfs code any
substitution for this property. In other words, where is a formal
guarantee that all disk space of our users won't be eaten by internal
fragmentation? I consider such guarantee as a *necessary* condition
for putting a file system to production.
Any technical comments are welcome.
Thanks,
Edward.
Appendix A.
-----------
Glossary
1. Utilization of data and(or) metadata storage.
The fraction A/B, where
A is total size in bytes of stored data and(or) metadata.
B = N * S, where
N is number of blocks occupied by stored data and(or) metadata.
S is block size in bytes.
2. Internal fragmentation of data and(or) metadata storage.
difference (1 - U), where U is utilization.
Appendix B.
-----------
a "period" in the dump of the fs_tree (btrfs-debug-tree /dev/sdb2)
...
leaf 29982720 items 4 free space 1572 generation 8 owner 5
fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
item 0 key (319 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78
location key (0 UNKNOWN 0) type 8
namelen 16 datalen 32 name: security.selinux
item 1 key (319 EXTENT_DATA 0) itemoff 1848 itemsize 2069
inline extent data size 2048 ram 2048 compress 0
item 2 key (320 INODE_ITEM 0) itemoff 1688 itemsize 160
inode generation 8 size 2048 block group 29360128 mode
100644 links 1
item 3 key (320 INODE_REF 256) itemoff 1672 itemsize 16
inode ref index 65 namelen 6 name: file64
leaf 29425664 items 1 free space 3892 generation 8 owner 5
fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
item 0 key (320 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78
location key (0 UNKNOWN 0) type 8
namelen 16 datalen 32 name: security.selinux
leaf 29990912 items 1 free space 1901 generation 8 owner 5
fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
item 0 key (320 EXTENT_DATA 0) itemoff 1926 itemsize 2069
inline extent data size 2048 ram 2048 compress 0
leaf 29986816 items 3 free space 3666 generation 8 owner 5
fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
item 0 key (321 INODE_ITEM 0) itemoff 3835 itemsize 160
inode generation 8 size 2048 block group 29360128 mode
100644 links 1
item 1 key (321 INODE_REF 256) itemoff 3819 itemsize 16
inode ref index 66 namelen 6 name: file65
item 2 key (321 XATTR_ITEM 3817753667) itemoff 3741 itemsize 78
location key (0 UNKNOWN 0) type 8
namelen 16 datalen 32 name: security.selinux
leaf 29995008 items 3 free space 1675 generation 8 owner 5
fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
item 0 key (321 EXTENT_DATA 0) itemoff 1926 itemsize 2069
inline extent data size 2048 ram 2048 compress 0
item 1 key (322 INODE_ITEM 0) itemoff 1766 itemsize 160
inode generation 8 size 2048 block group 29360128 mode
100644 links 1
item 2 key (322 INODE_REF 256) itemoff 1750 itemsize 16
inode ref index 67 namelen 6 name: file66
...
Appendix C.
-----------
D.E. Knuth, The Art of Computer Programming, vol. 3 (Sorting and Searching),
Addison-Wesley, Reading, MA, 1973.
--
Edward O. Shishkin
Principal Software Engineer
Red Hat Czech
On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin <[email protected]> wrote:
> Hello everyone.
>
> I was asked to review/evaluate Btrfs for using in enterprise
> systems and the below are my first impressions (linux-2.6.33).
>
> The first test I have made was filling an empty 659M (/dev/sdb2)
> btrfs partition (mounted to /mnt) with 2K files:
>
> # for i in $(seq 1000000); \
> do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
> (terminated after getting "No space left on device" reports).
>
> # ls /mnt | wc -l
> 59480
>
> So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17,
> and the first obvious question is "hey, where are other 83% of my
> disk space???" I looked at the btrfs storage tree (fs_tree) and was
> shocked with the situation on the leaf level. The Appendix B shows
> 5 adjacent btrfs leafs, which have the same parent.
>
> For example, look at the leaf 29425664: "items 1 free space 3892"
> (of 4096!!). Note, that this "free" space (3892) is _dead_: any
> attempts to write to the file system will result in "No space left
> on device".
>
> Internal fragmentation (see Appendix A) of those 5 leafs is
> (1572+3892+1901+3666+1675)/4096*5 = 0.62. This is even worse then
> ext4 and xfs: The last ones in this example will show fragmentation
> near zero with blocksize <= 2K. Even with 4K blocksize they will
> show better utilization 0.50 (against 0.38 in btrfs)!
>
> I have a small question for btrfs developers: Why do you folks put
> "inline extents", xattr, etc items of variable size to the B-tree
> in spite of the fact that B-tree is a data structure NOT for variable
> sized records? This disadvantage of B-trees was widely discussed.
> For example, maestro D. Knuth warned about this issue long time
> ago (see Appendix C).
>
> It is a well known fact that internal fragmentation of classic Bayer's
> B-trees is restricted by the value 0.50 (see Appendix C). However it
> takes place only if your tree contains records of the _same_ length
> (for example, extent pointers). Once you put to your B-tree records
> of variable length (restricted only by leaf size, like btrfs "inline
> extents"), your tree LOSES this boundary. Moreover, even worse:
> it is clear, that in this case utilization of B-tree scales as zero(!).
> That said, for every small E and for every amount of data N we
> can construct a consistent B-tree, which contains data N and has
> utilization worse then E. I.e. from the standpoint of utilization
> such trees can be completely degenerated.
>
> That said, the very important property of B-trees, which guarantees
> non-zero utilization, has been lost, and I don't see in Btrfs code any
> substitution for this property. In other words, where is a formal
> guarantee that all disk space of our users won't be eaten by internal
> fragmentation? I consider such guarantee as a *necessary* condition
> for putting a file system to production.
>
> Any technical comments are welcome.
>
> Thanks,
> Edward.
>
>
> Appendix A.
> -----------
> Glossary
>
> 1. Utilization of data and(or) metadata storage.
>
> The fraction A/B, where
> A is total size in bytes of stored data and(or) metadata.
> B = N * S, where
> N is number of blocks occupied by stored data and(or) metadata.
> S is block size in bytes.
>
> 2. Internal fragmentation of data and(or) metadata storage.
>
> difference (1 - U), where U is utilization.
>
>
> Appendix B.
> -----------
> a "period" in the dump of the fs_tree (btrfs-debug-tree /dev/sdb2)
>
> ...
>
> leaf 29982720 items 4 free space 1572 generation 8 owner 5
> fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
> chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
> ? ? ? item 0 key (319 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78
> ? ? ? ? ? ? ? location key (0 UNKNOWN 0) type 8
> ? ? ? ? ? ? ? namelen 16 datalen 32 name: security.selinux
> ? ? ? item 1 key (319 EXTENT_DATA 0) itemoff 1848 itemsize 2069
> ? ? ? ? ? ? ? inline extent data size 2048 ram 2048 compress 0
> ? ? ? item 2 key (320 INODE_ITEM 0) itemoff 1688 itemsize 160
> ? ? ? ? ? ? ? inode generation 8 size 2048 block group 29360128 mode 100644
> links 1
> ? ? ? item 3 key (320 INODE_REF 256) itemoff 1672 itemsize 16
> ? ? ? ? ? ? ? inode ref index 65 namelen 6 name: file64
> leaf 29425664 items 1 free space 3892 generation 8 owner 5
> fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
> chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
> ? ? ? item 0 key (320 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78
> ? ? ? ? ? ? ? location key (0 UNKNOWN 0) type 8
> ? ? ? ? ? ? ? namelen 16 datalen 32 name: security.selinux
> leaf 29990912 items 1 free space 1901 generation 8 owner 5
> fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
> chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
> ? ? ? item 0 key (320 EXTENT_DATA 0) itemoff 1926 itemsize 2069
> ? ? ? ? ? ? ? inline extent data size 2048 ram 2048 compress 0
> leaf 29986816 items 3 free space 3666 generation 8 owner 5
> fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
> chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
> ? ? ? item 0 key (321 INODE_ITEM 0) itemoff 3835 itemsize 160
> ? ? ? ? ? ? ? inode generation 8 size 2048 block group 29360128 mode 100644
> links 1
> ? ? ? item 1 key (321 INODE_REF 256) itemoff 3819 itemsize 16
> ? ? ? ? ? ? ? inode ref index 66 namelen 6 name: file65
> ? ? ? item 2 key (321 XATTR_ITEM 3817753667) itemoff 3741 itemsize 78
> ? ? ? ? ? ? ? location key (0 UNKNOWN 0) type 8
> ? ? ? ? ? ? ? namelen 16 datalen 32 name: security.selinux
> leaf 29995008 items 3 free space 1675 generation 8 owner 5
> fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
> chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
> ? ? ? item 0 key (321 EXTENT_DATA 0) itemoff 1926 itemsize 2069
> ? ? ? ? ? ? ? inline extent data size 2048 ram 2048 compress 0
> ? ? ? item 1 key (322 INODE_ITEM 0) itemoff 1766 itemsize 160
> ? ? ? ? ? ? ? inode generation 8 size 2048 block group 29360128 mode 100644
> links 1
> ? ? ? item 2 key (322 INODE_REF 256) itemoff 1750 itemsize 16
> ? ? ? ? ? ? ? inode ref index 67 namelen 6 name: file66
> ...
>
> Appendix C.
> -----------
>
> D.E. Knuth, The Art of Computer Programming, vol. 3 (Sorting and Searching),
> Addison-Wesley, Reading, MA, 1973.
>
> --
> Edward O. Shishkin
> Principal Software Engineer
> Red Hat Czech
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>
Hi to all,
First of let me say: Btrfs really has matured a lot in the last months
and this is thanks to you guys (the developers !)
More and more people are making it their dedicated filesystem (MeeGo)
or an option (Ubuntu, Fedora)
So thank you very very much for your on-going efforts on making this
more and more a viable (and usable !) alternative/competition to zfs
:)
The problems Edward mentioned sound like some very important points
(issues ?) to investigate
I find it somewhat surprising that none of you guys commented on his mail
I'm in no way a developer (in fact a power-user) but would
nevertheless be interested in your opinions on this matter -
especially Chris'
Thanks !
Mat
Aloha Mat:
> On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin<[email protected]> wrote:
>
>> Hello everyone.
>>
>> I was asked to review/evaluate Btrfs for using in enterprise
>> systems and the below are my first impressions (linux-2.6.33).
>>
>> The first test I have made was filling an empty 659M (/dev/sdb2)
>> btrfs partition (mounted to /mnt) with 2K files:
>>
>> # for i in $(seq 1000000); \
>> do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
>> (terminated after getting "No space left on device" reports).
>>
>> # ls /mnt | wc -l
>> 59480
>>
>> So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17,
>> and the first obvious question is "hey, where are other 83% of my
>> disk space???" I looked at the btrfs storage tree (fs_tree) and was
>> shocked with the situation on the leaf level. The Appendix B shows
>> 5 adjacent btrfs leafs, which have the same parent.
>>
>> For example, look at the leaf 29425664: "items 1 free space 3892"
>> (of 4096!!). Note, that this "free" space (3892) is _dead_: any
>> attempts to write to the file system will result in "No space left
>> on device".
>>
>> Internal fragmentation (see Appendix A) of those 5 leafs is
>> (1572+3892+1901+3666+1675)/4096*5 = 0.62. This is even worse then
>> ext4 and xfs: The last ones in this example will show fragmentation
>> near zero with blocksize<= 2K. Even with 4K blocksize they will
>> show better utilization 0.50 (against 0.38 in btrfs)!
>>
>> I have a small question for btrfs developers: Why do you folks put
>> "inline extents", xattr, etc items of variable size to the B-tree
>> in spite of the fact that B-tree is a data structure NOT for variable
>> sized records? This disadvantage of B-trees was widely discussed.
>> For example, maestro D. Knuth warned about this issue long time
>> ago (see Appendix C).
>>
>> It is a well known fact that internal fragmentation of classic Bayer's
>> B-trees is restricted by the value 0.50 (see Appendix C). However it
>> takes place only if your tree contains records of the _same_ length
>> (for example, extent pointers). Once you put to your B-tree records
>> of variable length (restricted only by leaf size, like btrfs "inline
>> extents"), your tree LOSES this boundary. Moreover, even worse:
>> it is clear, that in this case utilization of B-tree scales as zero(!).
>> That said, for every small E and for every amount of data N we
>> can construct a consistent B-tree, which contains data N and has
>> utilization worse then E. I.e. from the standpoint of utilization
>> such trees can be completely degenerated.
>>
>> That said, the very important property of B-trees, which guarantees
>> non-zero utilization, has been lost, and I don't see in Btrfs code any
>> substitution for this property. In other words, where is a formal
>> guarantee that all disk space of our users won't be eaten by internal
>> fragmentation? I consider such guarantee as a *necessary* condition
>> for putting a file system to production.
>>
>> Any technical comments are welcome.
>>
>> Thanks,
>> Edward.
>>
>>
>> Appendix A.
>> -----------
>> Glossary
>>
>> 1. Utilization of data and(or) metadata storage.
>>
>> The fraction A/B, where
>> A is total size in bytes of stored data and(or) metadata.
>> B = N * S, where
>> N is number of blocks occupied by stored data and(or) metadata.
>> S is block size in bytes.
>>
>> 2. Internal fragmentation of data and(or) metadata storage.
>>
>> difference (1 - U), where U is utilization.
>>
>>
>> Appendix B.
>> -----------
>> a "period" in the dump of the fs_tree (btrfs-debug-tree /dev/sdb2)
>>
>> ...
>>
>> leaf 29982720 items 4 free space 1572 generation 8 owner 5
>> fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
>> chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
>> item 0 key (319 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78
>> location key (0 UNKNOWN 0) type 8
>> namelen 16 datalen 32 name: security.selinux
>> item 1 key (319 EXTENT_DATA 0) itemoff 1848 itemsize 2069
>> inline extent data size 2048 ram 2048 compress 0
>> item 2 key (320 INODE_ITEM 0) itemoff 1688 itemsize 160
>> inode generation 8 size 2048 block group 29360128 mode 100644
>> links 1
>> item 3 key (320 INODE_REF 256) itemoff 1672 itemsize 16
>> inode ref index 65 namelen 6 name: file64
>> leaf 29425664 items 1 free space 3892 generation 8 owner 5
>> fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
>> chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
>> item 0 key (320 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78
>> location key (0 UNKNOWN 0) type 8
>> namelen 16 datalen 32 name: security.selinux
>> leaf 29990912 items 1 free space 1901 generation 8 owner 5
>> fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
>> chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
>> item 0 key (320 EXTENT_DATA 0) itemoff 1926 itemsize 2069
>> inline extent data size 2048 ram 2048 compress 0
>> leaf 29986816 items 3 free space 3666 generation 8 owner 5
>> fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
>> chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
>> item 0 key (321 INODE_ITEM 0) itemoff 3835 itemsize 160
>> inode generation 8 size 2048 block group 29360128 mode 100644
>> links 1
>> item 1 key (321 INODE_REF 256) itemoff 3819 itemsize 16
>> inode ref index 66 namelen 6 name: file65
>> item 2 key (321 XATTR_ITEM 3817753667) itemoff 3741 itemsize 78
>> location key (0 UNKNOWN 0) type 8
>> namelen 16 datalen 32 name: security.selinux
>> leaf 29995008 items 3 free space 1675 generation 8 owner 5
>> fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
>> chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
>> item 0 key (321 EXTENT_DATA 0) itemoff 1926 itemsize 2069
>> inline extent data size 2048 ram 2048 compress 0
>> item 1 key (322 INODE_ITEM 0) itemoff 1766 itemsize 160
>> inode generation 8 size 2048 block group 29360128 mode 100644
>> links 1
>> item 2 key (322 INODE_REF 256) itemoff 1750 itemsize 16
>> inode ref index 67 namelen 6 name: file66
>> ...
>>
>> Appendix C.
>> -----------
>>
>> D.E. Knuth, The Art of Computer Programming, vol. 3 (Sorting and Searching),
>> Addison-Wesley, Reading, MA, 1973.
>>
>> --
>> Edward O. Shishkin
>> Principal Software Engineer
>> Red Hat Czech
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
> Hi to all,
>
> First of let me say: Btrfs really has matured a lot in the last months
> and this is thanks to you guys (the developers !)
>
> More and more people are making it their dedicated filesystem (MeeGo)
> or an option (Ubuntu, Fedora)
>
> So thank you very very much for your on-going efforts on making this
> more and more a viable (and usable !) alternative/competition to zfs
> :)
>
> The problems Edward mentioned sound like some very important points
> (issues ?) to investigate
>
> I find it somewhat surprising that none of you guys commented on his mail
>
Me too.
> I'm in no way a developer (in fact a power-user) but would
> nevertheless be interested in your opinions on this matter -
> especially Chris'
>
> Thanks !
>
> Mat
>
>
With all the best
Christian Stroetmann
Mat wrote:
> On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin <[email protected]> wrote:
>
>> Hello everyone.
>>
>> I was asked to review/evaluate Btrfs for using in enterprise
>> systems and the below are my first impressions (linux-2.6.33).
>>
>> The first test I have made was filling an empty 659M (/dev/sdb2)
>> btrfs partition (mounted to /mnt) with 2K files:
>>
>> # for i in $(seq 1000000); \
>> do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
>> (terminated after getting "No space left on device" reports).
>>
>> # ls /mnt | wc -l
>> 59480
>>
>> So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17,
>> and the first obvious question is "hey, where are other 83% of my
>> disk space???" I looked at the btrfs storage tree (fs_tree) and was
>> shocked with the situation on the leaf level. The Appendix B shows
>> 5 adjacent btrfs leafs, which have the same parent.
>>
>> For example, look at the leaf 29425664: "items 1 free space 3892"
>> (of 4096!!). Note, that this "free" space (3892) is _dead_: any
>> attempts to write to the file system will result in "No space left
>> on device".
>>
>> Internal fragmentation (see Appendix A) of those 5 leafs is
>> (1572+3892+1901+3666+1675)/4096*5 = 0.62. This is even worse then
>> ext4 and xfs: The last ones in this example will show fragmentation
>> near zero with blocksize <= 2K. Even with 4K blocksize they will
>> show better utilization 0.50 (against 0.38 in btrfs)!
>>
>> I have a small question for btrfs developers: Why do you folks put
>> "inline extents", xattr, etc items of variable size to the B-tree
>> in spite of the fact that B-tree is a data structure NOT for variable
>> sized records? This disadvantage of B-trees was widely discussed.
>> For example, maestro D. Knuth warned about this issue long time
>> ago (see Appendix C).
>>
>> It is a well known fact that internal fragmentation of classic Bayer's
>> B-trees is restricted by the value 0.50 (see Appendix C). However it
>> takes place only if your tree contains records of the _same_ length
>> (for example, extent pointers). Once you put to your B-tree records
>> of variable length (restricted only by leaf size, like btrfs "inline
>> extents"), your tree LOSES this boundary. Moreover, even worse:
>> it is clear, that in this case utilization of B-tree scales as zero(!).
>> That said, for every small E and for every amount of data N we
>> can construct a consistent B-tree, which contains data N and has
>> utilization worse then E. I.e. from the standpoint of utilization
>> such trees can be completely degenerated.
>>
>> That said, the very important property of B-trees, which guarantees
>> non-zero utilization, has been lost, and I don't see in Btrfs code any
>> substitution for this property. In other words, where is a formal
>> guarantee that all disk space of our users won't be eaten by internal
>> fragmentation? I consider such guarantee as a *necessary* condition
>> for putting a file system to production.
>>
>> Any technical comments are welcome.
>>
>> Thanks,
>> Edward.
>>
>>
>> Appendix A.
>> -----------
>> Glossary
>>
>> 1. Utilization of data and(or) metadata storage.
>>
>> The fraction A/B, where
>> A is total size in bytes of stored data and(or) metadata.
>> B = N * S, where
>> N is number of blocks occupied by stored data and(or) metadata.
>> S is block size in bytes.
>>
>> 2. Internal fragmentation of data and(or) metadata storage.
>>
>> difference (1 - U), where U is utilization.
>>
>>
>> Appendix B.
>> -----------
>> a "period" in the dump of the fs_tree (btrfs-debug-tree /dev/sdb2)
>>
>> ...
>>
>> leaf 29982720 items 4 free space 1572 generation 8 owner 5
>> fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
>> chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
>> item 0 key (319 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78
>> location key (0 UNKNOWN 0) type 8
>> namelen 16 datalen 32 name: security.selinux
>> item 1 key (319 EXTENT_DATA 0) itemoff 1848 itemsize 2069
>> inline extent data size 2048 ram 2048 compress 0
>> item 2 key (320 INODE_ITEM 0) itemoff 1688 itemsize 160
>> inode generation 8 size 2048 block group 29360128 mode 100644
>> links 1
>> item 3 key (320 INODE_REF 256) itemoff 1672 itemsize 16
>> inode ref index 65 namelen 6 name: file64
>> leaf 29425664 items 1 free space 3892 generation 8 owner 5
>> fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
>> chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
>> item 0 key (320 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78
>> location key (0 UNKNOWN 0) type 8
>> namelen 16 datalen 32 name: security.selinux
>> leaf 29990912 items 1 free space 1901 generation 8 owner 5
>> fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
>> chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
>> item 0 key (320 EXTENT_DATA 0) itemoff 1926 itemsize 2069
>> inline extent data size 2048 ram 2048 compress 0
>> leaf 29986816 items 3 free space 3666 generation 8 owner 5
>> fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
>> chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
>> item 0 key (321 INODE_ITEM 0) itemoff 3835 itemsize 160
>> inode generation 8 size 2048 block group 29360128 mode 100644
>> links 1
>> item 1 key (321 INODE_REF 256) itemoff 3819 itemsize 16
>> inode ref index 66 namelen 6 name: file65
>> item 2 key (321 XATTR_ITEM 3817753667) itemoff 3741 itemsize 78
>> location key (0 UNKNOWN 0) type 8
>> namelen 16 datalen 32 name: security.selinux
>> leaf 29995008 items 3 free space 1675 generation 8 owner 5
>> fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
>> chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
>> item 0 key (321 EXTENT_DATA 0) itemoff 1926 itemsize 2069
>> inline extent data size 2048 ram 2048 compress 0
>> item 1 key (322 INODE_ITEM 0) itemoff 1766 itemsize 160
>> inode generation 8 size 2048 block group 29360128 mode 100644
>> links 1
>> item 2 key (322 INODE_REF 256) itemoff 1750 itemsize 16
>> inode ref index 67 namelen 6 name: file66
>> ...
>>
>> Appendix C.
>> -----------
>>
>> D.E. Knuth, The Art of Computer Programming, vol. 3 (Sorting and Searching),
>> Addison-Wesley, Reading, MA, 1973.
>>
>> --
>> Edward O. Shishkin
>> Principal Software Engineer
>> Red Hat Czech
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
>
> Hi to all,
>
> First of let me say: Btrfs really has matured a lot in the last months
> and this is thanks to you guys (the developers !)
>
> More and more people are making it their dedicated filesystem (MeeGo)
> or an option (Ubuntu, Fedora)
>
> So thank you very very much for your on-going efforts on making this
> more and more a viable (and usable !) alternative/competition to zfs
> :)
>
> The problems Edward mentioned sound like some very important points
> (issues ?) to investigate
>
> I find it somewhat surprising that none of you guys commented on his mail
>
It must be a highly unexpected and difficult question for file system
developers: "how efficiently does your file system manage disk space"?
In the meanwhile I confirm that Btrfs design is completely broken:
records stored in the B-tree differ greatly from each other (it is
unacceptable!), and the balancing algorithms have been modified in
insane manner. All these factors has led to loss of *all* boundaries
holding internal fragmentation and to exhaustive waste of disk space
(and memory!) in spite of the property "scaling in their ability to
address large storage".
This is not a large storage, this is a "scalable sieve": you can not
rely on finding there some given amount of water even after infinite
increasing the size of the sieve (read escalating the pool of Btrfs
devices).
It seems that nobody have reviewed Btrfs before its inclusion to the
mainline. I have only found a pair of recommendations with a common
idea that Btrfs maintainer is "not a crazy man". Plus a number of
papers which admire with the "Btrfs phenomena". Sigh.
Well, let's decide what can we do in current situation..
The first obvious point here is that we *can not* put such file system
to production. Just because it doesn't provide any guarantees for our
users regarding disk space utilization.
I'll explain on a simple example, why is it important. Think of a file
system as a bank, which deducts an interest q. I.e. amount of money N
that you put on your account can be reduced to (N - qN). That said,
in order to buy a suit which costs M you should put to your account
not less than M/(1-q). Now imagine that the bank deducts 100% (q=1).
Will you bring your money to such bank? No. Not just because you are
greedy, but also because you won't be able to schedule your purchases.
So why should we push our users to keep money in such bank?
I should remind for developers that we work for *users*. They want a
*good* environment to run programs. Our subsystems should provide
*efficient* management of user's resources (such as memory and disk
space). A subsystem which is going to send all user's resources to the
toilet is *bad*!!!
If you decide to base your file system on some algorithms then please
use the original ones from proper academic papers. DO NOT modify the
algorithms in solitude: this is very fragile thing! All such
modifications must be reviewed by specialists in the theory of
algorithms. Such review can be done in various scientific magazines of
proper level.
Personally I don't see any way to improve the situation with Btrfs
except full redesigning the last one. If you want to base your file
system on the paper of Ohad Rodeh, then please, use *exactly* the
Bayer's B-trees that he refers to. That said, make sure that all
records you put to the tree has equal length and all non-root nodes of
your tree are at least half filled.
As to current Btrfs code: *NOT ACK*!!! I don't think we need such
"file systems".
Thanks!
On Fri, Jun 18, 2010 at 1:32 PM, Edward Shishkin
<[email protected]> wrote:
> Mat wrote:
>>
>> On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin <[email protected]> wrote:
>>>
>>> Hello everyone.
>>>
>>> I was asked to review/evaluate Btrfs for using in enterprise
>>> systems and the below are my first impressions (linux-2.6.33).
>>>
>>> The first test I have made was filling an empty 659M (/dev/sdb2)
>>> btrfs partition (mounted to /mnt) with 2K files:
>>>
>>> # for i in $(seq 1000000); \
>>> do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
>>> (terminated after getting "No space left on device" reports).
>>>
>>> # ls /mnt | wc -l
>>> 59480
>>>
>>> So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17,
>>> and the first obvious question is "hey, where are other 83% of my
>>> disk space???" I looked at the btrfs storage tree (fs_tree) and was
>>> shocked with the situation on the leaf level. The Appendix B shows
>>> 5 adjacent btrfs leafs, which have the same parent.
>>>
>>> For example, look at the leaf 29425664: "items 1 free space 3892"
>>> (of 4096!!). Note, that this "free" space (3892) is _dead_: any
>>> attempts to write to the file system will result in "No space left
>>> on device".
>>>
>>> Internal fragmentation (see Appendix A) of those 5 leafs is
>>> (1572+3892+1901+3666+1675)/4096*5 = 0.62. This is even worse then
>>> ext4 and xfs: The last ones in this example will show fragmentation
>>> near zero with blocksize <= 2K. Even with 4K blocksize they will
>>> show better utilization 0.50 (against 0.38 in btrfs)!
>>>
>>> I have a small question for btrfs developers: Why do you folks put
>>> "inline extents", xattr, etc items of variable size to the B-tree
>>> in spite of the fact that B-tree is a data structure NOT for variable
>>> sized records? This disadvantage of B-trees was widely discussed.
>>> For example, maestro D. Knuth warned about this issue long time
>>> ago (see Appendix C).
>>>
>>> It is a well known fact that internal fragmentation of classic Bayer's
>>> B-trees is restricted by the value 0.50 (see Appendix C). However it
>>> takes place only if your tree contains records of the _same_ length
>>> (for example, extent pointers). Once you put to your B-tree records
>>> of variable length (restricted only by leaf size, like btrfs "inline
>>> extents"), your tree LOSES this boundary. Moreover, even worse:
>>> it is clear, that in this case utilization of B-tree scales as zero(!).
>>> That said, for every small E and for every amount of data N we
>>> can construct a consistent B-tree, which contains data N and has
>>> utilization worse then E. I.e. from the standpoint of utilization
>>> such trees can be completely degenerated.
>>>
>>> That said, the very important property of B-trees, which guarantees
>>> non-zero utilization, has been lost, and I don't see in Btrfs code any
>>> substitution for this property. In other words, where is a formal
>>> guarantee that all disk space of our users won't be eaten by internal
>>> fragmentation? I consider such guarantee as a *necessary* condition
>>> for putting a file system to production.
Wow...a small part of me says 'well said', on the basis that your
assertions are true, but I do think there needs to be more
constructivity in such critique; it is almost impossible to be a great
engineer and a great academic at once in a time-pressured environment.
If you can produce some specific and suggestions with code references,
I'm sure we'll get some good discussion with potential to improve from
where we are.
Thanks,
Daniel
--
Daniel J Blueman
On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote:
> Mat wrote:
> >On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin <[email protected]> wrote:
> >>Hello everyone.
> >>
> >>I was asked to review/evaluate Btrfs for using in enterprise
> >>systems and the below are my first impressions (linux-2.6.33).
> >>
> >>The first test I have made was filling an empty 659M (/dev/sdb2)
> >>btrfs partition (mounted to /mnt) with 2K files:
> >>
> >># for i in $(seq 1000000); \
> >>do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
> >>(terminated after getting "No space left on device" reports).
> >>
> >># ls /mnt | wc -l
> >>59480
> >>
> >>So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17,
> >>and the first obvious question is "hey, where are other 83% of my
> >>disk space???" I looked at the btrfs storage tree (fs_tree) and was
> >>shocked with the situation on the leaf level. The Appendix B shows
> >>5 adjacent btrfs leafs, which have the same parent.
> >>
> >>For example, look at the leaf 29425664: "items 1 free space 3892"
> >>(of 4096!!). Note, that this "free" space (3892) is _dead_: any
> >>attempts to write to the file system will result in "No space left
> >>on device".
There are two easy ways to fix this problem. Turn off the inline
extents (max_inline=0) or allow splitting of the inline extents. I
didn't put in the splitting simply because the complexity was high while
the benefits were low (in comparison with just turning off the inline
extents).
>
> It must be a highly unexpected and difficult question for file system
> developers: "how efficiently does your file system manage disk space"?
>
> In the meanwhile I confirm that Btrfs design is completely broken:
> records stored in the B-tree differ greatly from each other (it is
> unacceptable!), and the balancing algorithms have been modified in
> insane manner. All these factors has led to loss of *all* boundaries
> holding internal fragmentation and to exhaustive waste of disk space
> (and memory!) in spite of the property "scaling in their ability to
> address large storage".
>
> This is not a large storage, this is a "scalable sieve": you can not
> rely on finding there some given amount of water even after infinite
> increasing the size of the sieve (read escalating the pool of Btrfs
> devices).
>
> It seems that nobody have reviewed Btrfs before its inclusion to the
> mainline. I have only found a pair of recommendations with a common
> idea that Btrfs maintainer is "not a crazy man". Plus a number of
> papers which admire with the "Btrfs phenomena". Sigh.
>
> Well, let's decide what can we do in current situation..
> The first obvious point here is that we *can not* put such file system
> to production. Just because it doesn't provide any guarantees for our
> users regarding disk space utilization.
Are you basing all of this on inline extents? The other extents of
variable size are more flexible (taking up the room in the leaf), but
they can also easy be split during balancing.
-chris
Chris Mason wrote:
> On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote:
>
>> Mat wrote:
>>
>>> On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin <[email protected]> wrote:
>>>
>>>> Hello everyone.
>>>>
>>>> I was asked to review/evaluate Btrfs for using in enterprise
>>>> systems and the below are my first impressions (linux-2.6.33).
>>>>
>>>> The first test I have made was filling an empty 659M (/dev/sdb2)
>>>> btrfs partition (mounted to /mnt) with 2K files:
>>>>
>>>> # for i in $(seq 1000000); \
>>>> do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
>>>> (terminated after getting "No space left on device" reports).
>>>>
>>>> # ls /mnt | wc -l
>>>> 59480
>>>>
>>>> So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17,
>>>> and the first obvious question is "hey, where are other 83% of my
>>>> disk space???" I looked at the btrfs storage tree (fs_tree) and was
>>>> shocked with the situation on the leaf level. The Appendix B shows
>>>> 5 adjacent btrfs leafs, which have the same parent.
>>>>
>>>> For example, look at the leaf 29425664: "items 1 free space 3892"
>>>> (of 4096!!). Note, that this "free" space (3892) is _dead_: any
>>>> attempts to write to the file system will result in "No space left
>>>> on device".
>>>>
>
> There are two easy ways to fix this problem. Turn off the inline
> extents (max_inline=0) or allow splitting of the inline extents. I
> didn't put in the splitting simply because the complexity was high while
> the benefits were low (in comparison with just turning off the inline
> extents).
>
Hello, Chris. Thanks for response!
I afraid that both ways won't fix the problem. Look at this leaf:
[...]
leaf 29425664 items 1 free space 3892 generation 8 owner 5
fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
item 0 key (320 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78
location key (0 UNKNOWN 0) type 8
namelen 16 datalen 32 name: security.selinux
[...]
There is no inline extents, and what are you going to split here?
All leafs must be at least a half filled, otherwise we loose all
boundaries, which provides non-zero utilization..
Any ideas?
Thanks,
Edward.
>
>> It must be a highly unexpected and difficult question for file system
>> developers: "how efficiently does your file system manage disk space"?
>>
>> In the meanwhile I confirm that Btrfs design is completely broken:
>> records stored in the B-tree differ greatly from each other (it is
>> unacceptable!), and the balancing algorithms have been modified in
>> insane manner. All these factors has led to loss of *all* boundaries
>> holding internal fragmentation and to exhaustive waste of disk space
>> (and memory!) in spite of the property "scaling in their ability to
>> address large storage".
>>
>> This is not a large storage, this is a "scalable sieve": you can not
>> rely on finding there some given amount of water even after infinite
>> increasing the size of the sieve (read escalating the pool of Btrfs
>> devices).
>>
>> It seems that nobody have reviewed Btrfs before its inclusion to the
>> mainline. I have only found a pair of recommendations with a common
>> idea that Btrfs maintainer is "not a crazy man". Plus a number of
>> papers which admire with the "Btrfs phenomena". Sigh.
>>
>> Well, let's decide what can we do in current situation..
>> The first obvious point here is that we *can not* put such file system
>> to production. Just because it doesn't provide any guarantees for our
>> users regarding disk space utilization.
>>
>
> Are you basing all of this on inline extents? The other extents of
> variable size are more flexible (taking up the room in the leaf), but
> they can also easy be split during balancing.
>
> -chris
>
>
On Fri, Jun 18, 2010 at 05:05:46PM +0200, Edward Shishkin wrote:
> Chris Mason wrote:
> >On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote:
> >>Mat wrote:
> >>>On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin <[email protected]> wrote:
> >>>>Hello everyone.
> >>>>
> >>>>I was asked to review/evaluate Btrfs for using in enterprise
> >>>>systems and the below are my first impressions (linux-2.6.33).
> >>>>
> >>>>The first test I have made was filling an empty 659M (/dev/sdb2)
> >>>>btrfs partition (mounted to /mnt) with 2K files:
> >>>>
> >>>># for i in $(seq 1000000); \
> >>>>do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
> >>>>(terminated after getting "No space left on device" reports).
> >>>>
> >>>># ls /mnt | wc -l
> >>>>59480
> >>>>
> >>>>So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17,
> >>>>and the first obvious question is "hey, where are other 83% of my
> >>>>disk space???" I looked at the btrfs storage tree (fs_tree) and was
> >>>>shocked with the situation on the leaf level. The Appendix B shows
> >>>>5 adjacent btrfs leafs, which have the same parent.
> >>>>
> >>>>For example, look at the leaf 29425664: "items 1 free space 3892"
> >>>>(of 4096!!). Note, that this "free" space (3892) is _dead_: any
> >>>>attempts to write to the file system will result in "No space left
> >>>>on device".
> >
> >There are two easy ways to fix this problem. Turn off the inline
> >extents (max_inline=0) or allow splitting of the inline extents. I
> >didn't put in the splitting simply because the complexity was high while
> >the benefits were low (in comparison with just turning off the inline
> >extents).
>
> Hello, Chris. Thanks for response!
> I afraid that both ways won't fix the problem. Look at this leaf:
>
> [...]
> leaf 29425664 items 1 free space 3892 generation 8 owner 5
> fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
> chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
> item 0 key (320 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78
> location key (0 UNKNOWN 0) type 8
> namelen 16 datalen 32 name: security.selinux
> [...]
>
> There is no inline extents, and what are you going to split here?
> All leafs must be at least a half filled, otherwise we loose all
> boundaries, which provides non-zero utilization..
Right, there is no inline extent because we require them to fit entirely
in the leaf. So we end up with mostly empty leaves because the inline
item is large enough to make it difficult to push around but not large
enough to fill the leaf.
-chris
Chris Mason wrote:
> On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote:
>
>> Mat wrote:
>>
>>> On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin<[email protected]> wrote:
>>>
>>>> Hello everyone.
>>>>
>>>> I was asked to review/evaluate Btrfs for using in enterprise
>>>> systems and the below are my first impressions (linux-2.6.33).
>>>>
>>>> The first test I have made was filling an empty 659M (/dev/sdb2)
>>>> btrfs partition (mounted to /mnt) with 2K files:
>>>>
>>>> # for i in $(seq 1000000); \
>>>> do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
>>>> (terminated after getting "No space left on device" reports).
>>>>
>>>> # ls /mnt | wc -l
>>>> 59480
>>>>
>>>> So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17,
>>>> and the first obvious question is "hey, where are other 83% of my
>>>> disk space???" I looked at the btrfs storage tree (fs_tree) and was
>>>> shocked with the situation on the leaf level. The Appendix B shows
>>>> 5 adjacent btrfs leafs, which have the same parent.
>>>>
>>>> For example, look at the leaf 29425664: "items 1 free space 3892"
>>>> (of 4096!!). Note, that this "free" space (3892) is _dead_: any
>>>> attempts to write to the file system will result in "No space left
>>>> on device".
>>>>
> There are two easy ways to fix this problem. Turn off the inline
> extents (max_inline=0) or allow splitting of the inline extents. I
> didn't put in the splitting simply because the complexity was high while
> the benefits were low (in comparison with just turning off the inline
> extents).
>
But then the benefits of splitting must be high, because it solves this
problem if inline extents are turned on.
>
>> It must be a highly unexpected and difficult question for file system
>> developers: "how efficiently does your file system manage disk space"?
>>
>> In the meanwhile I confirm that Btrfs design is completely broken:
>> records stored in the B-tree differ greatly from each other (it is
>> unacceptable!), and the balancing algorithms have been modified in
>> insane manner. All these factors has led to loss of *all* boundaries
>> holding internal fragmentation and to exhaustive waste of disk space
>> (and memory!) in spite of the property "scaling in their ability to
>> address large storage".
>>
>> This is not a large storage, this is a "scalable sieve": you can not
>> rely on finding there some given amount of water even after infinite
>> increasing the size of the sieve (read escalating the pool of Btrfs
>> devices).
>>
>> It seems that nobody have reviewed Btrfs before its inclusion to the
>> mainline. I have only found a pair of recommendations with a common
>> idea that Btrfs maintainer is "not a crazy man". Plus a number of
>> papers which admire with the "Btrfs phenomena". Sigh.
>>
>> Well, let's decide what can we do in current situation..
>> The first obvious point here is that we *can not* put such file system
>> to production. Just because it doesn't provide any guarantees for our
>> users regarding disk space utilization.
>>
> Are you basing all of this on inline extents? The other extents of
> variable size are more flexible (taking up the room in the leaf), but
> they can also easy be split during balancing.
>
If we have to split everywhere, hasn't it then some (dramatic) impact on
the performance of the Btrfs filesystem?
As it was said above: splitting has a high complexity.
> -chris
> --
>
Have fun
Christian Stroetmann
On Fri, Jun 18, 2010 at 05:21:21PM +0200, Christian Stroetmann wrote:
> Chris Mason wrote:
> >On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote:
> >>Mat wrote:
> >>>On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin<[email protected]> wrote:
> >>>>Hello everyone.
> >>>>
> >>>>I was asked to review/evaluate Btrfs for using in enterprise
> >>>>systems and the below are my first impressions (linux-2.6.33).
> >>>>
> >>>>The first test I have made was filling an empty 659M (/dev/sdb2)
> >>>>btrfs partition (mounted to /mnt) with 2K files:
> >>>>
> >>>># for i in $(seq 1000000); \
> >>>>do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
> >>>>(terminated after getting "No space left on device" reports).
> >>>>
> >>>># ls /mnt | wc -l
> >>>>59480
> >>>>
> >>>>So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17,
> >>>>and the first obvious question is "hey, where are other 83% of my
> >>>>disk space???" I looked at the btrfs storage tree (fs_tree) and was
> >>>>shocked with the situation on the leaf level. The Appendix B shows
> >>>>5 adjacent btrfs leafs, which have the same parent.
> >>>>
> >>>>For example, look at the leaf 29425664: "items 1 free space 3892"
> >>>>(of 4096!!). Note, that this "free" space (3892) is _dead_: any
> >>>>attempts to write to the file system will result in "No space left
> >>>>on device".
> >There are two easy ways to fix this problem. Turn off the inline
> >extents (max_inline=0) or allow splitting of the inline extents. I
> >didn't put in the splitting simply because the complexity was high while
> >the benefits were low (in comparison with just turning off the inline
> >extents).
> But then the benefits of splitting must be high, because it solves
> this problem if inline extents are turned on.
It depends, we might also argue that for larger inline extents like this
we are better off with much larger leaves or with using max_inline=0
instead.
> >>It must be a highly unexpected and difficult question for file system
> >>developers: "how efficiently does your file system manage disk space"?
> >>
> >>In the meanwhile I confirm that Btrfs design is completely broken:
> >>records stored in the B-tree differ greatly from each other (it is
> >>unacceptable!), and the balancing algorithms have been modified in
> >>insane manner. All these factors has led to loss of *all* boundaries
> >>holding internal fragmentation and to exhaustive waste of disk space
> >>(and memory!) in spite of the property "scaling in their ability to
> >>address large storage".
> >>
> >>This is not a large storage, this is a "scalable sieve": you can not
> >>rely on finding there some given amount of water even after infinite
> >>increasing the size of the sieve (read escalating the pool of Btrfs
> >>devices).
> >>
> >>It seems that nobody have reviewed Btrfs before its inclusion to the
> >>mainline. I have only found a pair of recommendations with a common
> >>idea that Btrfs maintainer is "not a crazy man". Plus a number of
> >>papers which admire with the "Btrfs phenomena". Sigh.
> >>
> >>Well, let's decide what can we do in current situation..
> >>The first obvious point here is that we *can not* put such file system
> >>to production. Just because it doesn't provide any guarantees for our
> >>users regarding disk space utilization.
> >Are you basing all of this on inline extents? The other extents of
> >variable size are more flexible (taking up the room in the leaf), but
> >they can also easy be split during balancing.
> If we have to split everywhere, hasn't it then some (dramatic)
> impact on the performance of the Btrfs filesystem?
> As it was said above: splitting has a high complexity.
Yes. Both Edward and I have worked on the reiserfs sources where inline
extents were split during balancing. It made for a few surprises in the
file read/write code. They aren't impossible by any means, but I wanted
to avoid them.
-chris
Edward Shishkin wrote:
> If you decide to base your file system on some algorithms then please
> use the original ones from proper academic papers. DO NOT modify the
> algorithms in solitude: this is very fragile thing! All such
> modifications must be reviewed by specialists in the theory of
> algorithms. Such review can be done in various scientific magazines of
> proper level.
>
> Personally I don't see any way to improve the situation with Btrfs
> except full redesigning the last one. If you want to base your file
> system on the paper of Ohad Rodeh, then please, use *exactly* the
> Bayer's B-trees that he refers to. That said, make sure that all
> records you put to the tree has equal length and all non-root nodes of
> your tree are at least half filled.
First, thanks Edward for identifying a specific problem with the
current btrfs implementation.
I've studied modified B-trees quite a lot and know enough to be sure
that they are quite robust when you modify them in all sorts of ways.
Moreover, you are incorrect to say there's an intrinsic algorithmic
problem with variable-length records. It is not true; if Knuth said
so, Knuth was mistaken.
This is easily shown by modelling variable-length records (key ->
string) as a range of fixed length records ([key,index] -> byte) and
apply the standard B-tree algorithms to that, which guarantees
algorithm properties such as space utilisation and time; then you can
substitute a "compressed" representation of contiguous index runs,
which amounts to nothing more than just storing the strings (split
where the algorithm says to do so) and endpoint indexes , and because
this compression does not expand (in any way that matters), classic
algorithmic properties are still guaranteed.
Variable-length keys are a different business. Those are trickier,
but as far as I know, btrfs doesn't use them.
> As to current Btrfs code: *NOT ACK*!!! I don't think we need such
> "file systems".
Btrfs provides many useful features that other filesystems don't. We
definitely need it, or something like it. You have identified a bug.
It's not a corruption bug, but it's definitely a bug, and probably
affects performance as well as space utilisation.
It is not deep design bug; it is just a result of the packing and
balancing heuristics.
If you are still interested, please apply your knowledge of B-tree
algorithms to understanding why btrfs fails to balance the tree
sufficiently well, and then propose a fix.
Note that it's not necessarily a problem to have a few nodes with low
utilisation. Deliberate violation of the classic balancing heuristic
is often useful for performance.[*] The problem you've found is only a
real problem when there are _too many_ nodes with low utilisation.
[*] For example when filling a tree by inserting contiguously
ascending keys, the classic "split into two when full" heuristic gives
the worst possible results (50% lost space), and deliberately
underfilling the most actively updated nodes, which is not permitted
at all by the classic algorithm, gives denser packing in the end
(almost zero lost space). It's also faster. The trick is to make
sure there's just the right number of underfilled nodes...
-- Jamie
Chris Mason wrote:
> On Fri, Jun 18, 2010 at 05:05:46PM +0200, Edward Shishkin wrote:
>
>> Chris Mason wrote:
>>
>>> On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote:
>>>
>>>> Mat wrote:
>>>>
>>>>> On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin <[email protected]> wrote:
>>>>>
>>>>>> Hello everyone.
>>>>>>
>>>>>> I was asked to review/evaluate Btrfs for using in enterprise
>>>>>> systems and the below are my first impressions (linux-2.6.33).
>>>>>>
>>>>>> The first test I have made was filling an empty 659M (/dev/sdb2)
>>>>>> btrfs partition (mounted to /mnt) with 2K files:
>>>>>>
>>>>>> # for i in $(seq 1000000); \
>>>>>> do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
>>>>>> (terminated after getting "No space left on device" reports).
>>>>>>
>>>>>> # ls /mnt | wc -l
>>>>>> 59480
>>>>>>
>>>>>> So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17,
>>>>>> and the first obvious question is "hey, where are other 83% of my
>>>>>> disk space???" I looked at the btrfs storage tree (fs_tree) and was
>>>>>> shocked with the situation on the leaf level. The Appendix B shows
>>>>>> 5 adjacent btrfs leafs, which have the same parent.
>>>>>>
>>>>>> For example, look at the leaf 29425664: "items 1 free space 3892"
>>>>>> (of 4096!!). Note, that this "free" space (3892) is _dead_: any
>>>>>> attempts to write to the file system will result in "No space left
>>>>>> on device".
>>>>>>
>>> There are two easy ways to fix this problem. Turn off the inline
>>> extents (max_inline=0) or allow splitting of the inline extents. I
>>> didn't put in the splitting simply because the complexity was high while
>>> the benefits were low (in comparison with just turning off the inline
>>> extents).
>>>
>> Hello, Chris. Thanks for response!
>> I afraid that both ways won't fix the problem. Look at this leaf:
>>
>> [...]
>> leaf 29425664 items 1 free space 3892 generation 8 owner 5
>> fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
>> chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
>> item 0 key (320 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78
>> location key (0 UNKNOWN 0) type 8
>> namelen 16 datalen 32 name: security.selinux
>> [...]
>>
>> There is no inline extents, and what are you going to split here?
>> All leafs must be at least a half filled, otherwise we loose all
>> boundaries, which provides non-zero utilization..
>>
>
> Right, there is no inline extent because we require them to fit entirely
> in the leaf. So we end up with mostly empty leaves because the inline
> item is large enough to make it difficult to push around but not large
> enough to fill the leaf.
>
How about left and right neighbors? They contain a lot of
free space (1572 and 1901 respectively).
I am not happy with the very fact of such shallow leafs which
contain only one small (xattr) item..
Daniel J Blueman wrote:
> On Fri, Jun 18, 2010 at 1:32 PM, Edward Shishkin
> <[email protected]> wrote:
>
>> Mat wrote:
>>
>>> On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin <[email protected]> wrote:
>>>
>>>> Hello everyone.
>>>>
>>>> I was asked to review/evaluate Btrfs for using in enterprise
>>>> systems and the below are my first impressions (linux-2.6.33).
>>>>
>>>> The first test I have made was filling an empty 659M (/dev/sdb2)
>>>> btrfs partition (mounted to /mnt) with 2K files:
>>>>
>>>> # for i in $(seq 1000000); \
>>>> do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
>>>> (terminated after getting "No space left on device" reports).
>>>>
>>>> # ls /mnt | wc -l
>>>> 59480
>>>>
>>>> So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17,
>>>> and the first obvious question is "hey, where are other 83% of my
>>>> disk space???" I looked at the btrfs storage tree (fs_tree) and was
>>>> shocked with the situation on the leaf level. The Appendix B shows
>>>> 5 adjacent btrfs leafs, which have the same parent.
>>>>
>>>> For example, look at the leaf 29425664: "items 1 free space 3892"
>>>> (of 4096!!). Note, that this "free" space (3892) is _dead_: any
>>>> attempts to write to the file system will result in "No space left
>>>> on device".
>>>>
>>>> Internal fragmentation (see Appendix A) of those 5 leafs is
>>>> (1572+3892+1901+3666+1675)/4096*5 = 0.62. This is even worse then
>>>> ext4 and xfs: The last ones in this example will show fragmentation
>>>> near zero with blocksize <= 2K. Even with 4K blocksize they will
>>>> show better utilization 0.50 (against 0.38 in btrfs)!
>>>>
>>>> I have a small question for btrfs developers: Why do you folks put
>>>> "inline extents", xattr, etc items of variable size to the B-tree
>>>> in spite of the fact that B-tree is a data structure NOT for variable
>>>> sized records? This disadvantage of B-trees was widely discussed.
>>>> For example, maestro D. Knuth warned about this issue long time
>>>> ago (see Appendix C).
>>>>
>>>> It is a well known fact that internal fragmentation of classic Bayer's
>>>> B-trees is restricted by the value 0.50 (see Appendix C). However it
>>>> takes place only if your tree contains records of the _same_ length
>>>> (for example, extent pointers). Once you put to your B-tree records
>>>> of variable length (restricted only by leaf size, like btrfs "inline
>>>> extents"), your tree LOSES this boundary. Moreover, even worse:
>>>> it is clear, that in this case utilization of B-tree scales as zero(!).
>>>> That said, for every small E and for every amount of data N we
>>>> can construct a consistent B-tree, which contains data N and has
>>>> utilization worse then E. I.e. from the standpoint of utilization
>>>> such trees can be completely degenerated.
>>>>
>>>> That said, the very important property of B-trees, which guarantees
>>>> non-zero utilization, has been lost, and I don't see in Btrfs code any
>>>> substitution for this property. In other words, where is a formal
>>>> guarantee that all disk space of our users won't be eaten by internal
>>>> fragmentation? I consider such guarantee as a *necessary* condition
>>>> for putting a file system to production.
>>>>
>
> Wow...a small part of me says 'well said', on the basis that your
> assertions are true, but I do think there needs to be more
> constructivity in such critique; it is almost impossible to be a great
> engineer and a great academic at once in a time-pressured environment.
>
Sure it is impossible. I believe in division of labour:
academics writes algorithms, and we (engineers) encode them.
I have noticed that events in Btrfs develop by scenario not predicted
by the paper of academic Ohad Rodeh (in spite of the announce that
Btrfs is based on this paper). This is why I have started to grumble..
Thanks.
> If you can produce some specific and suggestions with code references,
> I'm sure we'll get some good discussion with potential to improve from
> where we are.
>
> Thanks,
> Daniel
>
On Fri, Jun 18, 2010 at 06:22:39PM +0200, Edward Shishkin wrote:
> Chris Mason wrote:
> >On Fri, Jun 18, 2010 at 05:05:46PM +0200, Edward Shishkin wrote:
> >>Chris Mason wrote:
> >>>On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote:
> >>>>Mat wrote:
> >>>>>On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin <[email protected]> wrote:
> >>>>>>Hello everyone.
> >>>>>>
> >>>>>>I was asked to review/evaluate Btrfs for using in enterprise
> >>>>>>systems and the below are my first impressions (linux-2.6.33).
> >>>>>>
> >>>>>>The first test I have made was filling an empty 659M (/dev/sdb2)
> >>>>>>btrfs partition (mounted to /mnt) with 2K files:
> >>>>>>
> >>>>>># for i in $(seq 1000000); \
> >>>>>>do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
> >>>>>>(terminated after getting "No space left on device" reports).
> >>>>>>
> >>>>>># ls /mnt | wc -l
> >>>>>>59480
> >>>>>>
> >>>>>>So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17,
> >>>>>>and the first obvious question is "hey, where are other 83% of my
> >>>>>>disk space???" I looked at the btrfs storage tree (fs_tree) and was
> >>>>>>shocked with the situation on the leaf level. The Appendix B shows
> >>>>>>5 adjacent btrfs leafs, which have the same parent.
> >>>>>>
> >>>>>>For example, look at the leaf 29425664: "items 1 free space 3892"
> >>>>>>(of 4096!!). Note, that this "free" space (3892) is _dead_: any
> >>>>>>attempts to write to the file system will result in "No space left
> >>>>>>on device".
> >>>There are two easy ways to fix this problem. Turn off the inline
> >>>extents (max_inline=0) or allow splitting of the inline extents. I
> >>>didn't put in the splitting simply because the complexity was high while
> >>>the benefits were low (in comparison with just turning off the inline
> >>>extents).
> >>Hello, Chris. Thanks for response!
> >>I afraid that both ways won't fix the problem. Look at this leaf:
> >>
> >>[...]
> >>leaf 29425664 items 1 free space 3892 generation 8 owner 5
> >>fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
> >>chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
> >> item 0 key (320 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78
> >> location key (0 UNKNOWN 0) type 8
> >> namelen 16 datalen 32 name: security.selinux
> >>[...]
> >>
> >>There is no inline extents, and what are you going to split here?
> >>All leafs must be at least a half filled, otherwise we loose all
> >>boundaries, which provides non-zero utilization..
> >
> >Right, there is no inline extent because we require them to fit entirely
> >in the leaf. So we end up with mostly empty leaves because the inline
> >item is large enough to make it difficult to push around but not large
> >enough to fill the leaf.
>
> How about left and right neighbors? They contain a lot of
> free space (1572 and 1901 respectively).
> I am not happy with the very fact of such shallow leafs which
> contain only one small (xattr) item..
Sure, the balancing can also be made more aggressive. This should be
very easy to fix.
-chris
Daniel J Blueman wrote:
> On Fri, Jun 18, 2010 at 1:32 PM, Edward Shishkin
> <[email protected]> wrote:
>
>> Mat wrote:
>>
>>> On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin<[email protected]> wrote:
>>>
>>>> Hello everyone.
>>>>
>>>> I was asked to review/evaluate Btrfs for using in enterprise
>>>> systems and the below are my first impressions (linux-2.6.33).
>>>>
>>>> The first test I have made was filling an empty 659M (/dev/sdb2)
>>>> btrfs partition (mounted to /mnt) with 2K files:
>>>>
>>>> # for i in $(seq 1000000); \
>>>> do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
>>>> (terminated after getting "No space left on device" reports).
>>>>
>>>> # ls /mnt | wc -l
>>>> 59480
>>>>
>>>> So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17,
>>>> and the first obvious question is "hey, where are other 83% of my
>>>> disk space???" I looked at the btrfs storage tree (fs_tree) and was
>>>> shocked with the situation on the leaf level. The Appendix B shows
>>>> 5 adjacent btrfs leafs, which have the same parent.
>>>>
>>>> For example, look at the leaf 29425664: "items 1 free space 3892"
>>>> (of 4096!!). Note, that this "free" space (3892) is _dead_: any
>>>> attempts to write to the file system will result in "No space left
>>>> on device".
>>>>
>>>> Internal fragmentation (see Appendix A) of those 5 leafs is
>>>> (1572+3892+1901+3666+1675)/4096*5 = 0.62. This is even worse then
>>>> ext4 and xfs: The last ones in this example will show fragmentation
>>>> near zero with blocksize<= 2K. Even with 4K blocksize they will
>>>> show better utilization 0.50 (against 0.38 in btrfs)!
>>>>
>>>> I have a small question for btrfs developers: Why do you folks put
>>>> "inline extents", xattr, etc items of variable size to the B-tree
>>>> in spite of the fact that B-tree is a data structure NOT for variable
>>>> sized records? This disadvantage of B-trees was widely discussed.
>>>> For example, maestro D. Knuth warned about this issue long time
>>>> ago (see Appendix C).
>>>>
>>>> It is a well known fact that internal fragmentation of classic Bayer's
>>>> B-trees is restricted by the value 0.50 (see Appendix C). However it
>>>> takes place only if your tree contains records of the _same_ length
>>>> (for example, extent pointers). Once you put to your B-tree records
>>>> of variable length (restricted only by leaf size, like btrfs "inline
>>>> extents"), your tree LOSES this boundary. Moreover, even worse:
>>>> it is clear, that in this case utilization of B-tree scales as zero(!).
>>>> That said, for every small E and for every amount of data N we
>>>> can construct a consistent B-tree, which contains data N and has
>>>> utilization worse then E. I.e. from the standpoint of utilization
>>>> such trees can be completely degenerated.
>>>>
>>>> That said, the very important property of B-trees, which guarantees
>>>> non-zero utilization, has been lost, and I don't see in Btrfs code any
>>>> substitution for this property. In other words, where is a formal
>>>> guarantee that all disk space of our users won't be eaten by internal
>>>> fragmentation? I consider such guarantee as a *necessary* condition
>>>> for putting a file system to production.
>>>>
> Wow...a small part of me says 'well said', on the basis that your
> assertions are true, but I do think there needs to be more
> constructivity in such critique; it is almost impossible to be a great
> engineer and a great academic at once in a time-pressured environment.
>
I find this is somehow off-topic, but:
For sure, it isn't impossible. History showed and present shows that
there are exceptions.
> If you can produce some specific and suggestions with code references,
> I'm sure we'll get some good discussion with potential to improve from
> where we are.
>
> Thanks,
> Daniel
>
Have fun
Christian Stroetmann
Jamie Lokier wrote:
> Edward Shishkin wrote:
>
>> If you decide to base your file system on some algorithms then please
>> use the original ones from proper academic papers. DO NOT modify the
>> algorithms in solitude: this is very fragile thing! All such
>> modifications must be reviewed by specialists in the theory of
>> algorithms. Such review can be done in various scientific magazines of
>> proper level.
>>
>> Personally I don't see any way to improve the situation with Btrfs
>> except full redesigning the last one. If you want to base your file
>> system on the paper of Ohad Rodeh, then please, use *exactly* the
>> Bayer's B-trees that he refers to. That said, make sure that all
>> records you put to the tree has equal length and all non-root nodes of
>> your tree are at least half filled.
>>
> First, thanks Edward for identifying a specific problem with the
> current btrfs implementation.
>
> I've studied modified B-trees quite a lot and know enough to be sure
> that they are quite robust when you modify them in all sorts of ways.
>
This is the point: Which kind of modified B-tree data structure is best
suited?
> Moreover, you are incorrect to say there's an intrinsic algorithmic
> problem with variable-length records. It is not true; if Knuth said
> so, Knuth was mistaken.
>
> This is easily shown by modelling variable-length records (key ->
> string) as a range of fixed length records ([key,index] -> byte) and
> apply the standard B-tree algorithms to that, which guarantees
> algorithm properties such as space utilisation and time; then you can
> substitute a "compressed" representation of contiguous index runs,
> which amounts to nothing more than just storing the strings (split
> where the algorithm says to do so) and endpoint indexes , and because
> this compression does not expand (in any way that matters), classic
> algorithmic properties are still guaranteed.
>
> Variable-length keys are a different business. Those are trickier,
> but as far as I know, btrfs doesn't use them.
>
>
>> As to current Btrfs code: *NOT ACK*!!! I don't think we need such
>> "file systems".
>>
> Btrfs provides many useful features that other filesystems don't. We
> definitely need it, or something like it. You have identified a bug.
> It's not a corruption bug, but it's definitely a bug, and probably
> affects performance as well as space utilisation.
>
> It is not deep design bug; it is just a result of the packing and
> balancing heuristics.
>
I think this is the most important design question in relation with
filesystems that use some kind of B-trees, which means, if the wrong
modified B-tree as the fundamental data structure was chosen, then this
is a deep design bug.
> If you are still interested, please apply your knowledge of B-tree
> algorithms to understanding why btrfs fails to balance the tree
> sufficiently well, and then propose a fix.
>
This is a general problem of filesystem design, especially the packing
and balancing heurisitcs, and a special problem of the Btrfs filesystem.
You can't simply say do it in this or that way. That's why another
filesystem uses something exotic like a B*-tree in conjunction with
dancing trees as fundamental data structure, which leads back to the
deep design question/problem/decision/bug/.... And after I followed the
explanations of this exotic B-tree version by the main developer I knew
just right from the start of the development of the Btrfs filesystem
that it wasn't chosen the right modified B-tree data structure, because
it was too simple and too general. And since some days I have the
impression that there wasn't made a design decision at all with the only
exception that there has to be some kind of a B-tree algorithm/data
structure in the Btrfs filesystem.
And I also think that such a deep desgin decision can't simply be
corrected in general (subjective opinion).
> Note that it's not necessarily a problem to have a few nodes with low
> utilisation. Deliberate violation of the classic balancing heuristic
> is often useful for performance.[*] The problem you've found is only a
> real problem when there are _too many_ nodes with low utilisation.
>
The found problem is the first problem with the chosen modified B-tree
data structure. I wouldn't call it only a problem in a special case.
> [*] For example when filling a tree by inserting contiguously
> ascending keys, the classic "split into two when full" heuristic gives
> the worst possible results (50% lost space), and deliberately
> underfilling the most actively updated nodes, which is not permitted
> at all by the classic algorithm, gives denser packing in the end
> (almost zero lost space). It's also faster. The trick is to make
> sure there's just the right number of underfilled nodes...
>
Yes, but ....
Firstly, maybe you are too focused on the classic B-tree algorithm here.
Secondly, a trick here, a split there, turning off a feature and then?
Then we have complexity at then end, which brings us back to the start,
the design decision.
But if you say there are no deep problems, then I will believe you for now.
> -- Jamie
>
With all the best
Christian Stroetmann
Jamie Lokier wrote:
> Edward Shishkin wrote:
>
>> If you decide to base your file system on some algorithms then please
>> use the original ones from proper academic papers. DO NOT modify the
>> algorithms in solitude: this is very fragile thing! All such
>> modifications must be reviewed by specialists in the theory of
>> algorithms. Such review can be done in various scientific magazines of
>> proper level.
>>
>> Personally I don't see any way to improve the situation with Btrfs
>> except full redesigning the last one. If you want to base your file
>> system on the paper of Ohad Rodeh, then please, use *exactly* the
>> Bayer's B-trees that he refers to. That said, make sure that all
>> records you put to the tree has equal length and all non-root nodes of
>> your tree are at least half filled.
>>
>
> First, thanks Edward for identifying a specific problem with the
> current btrfs implementation.
>
Hello Jamie.
> I've studied modified B-trees quite a lot and know enough to be sure
> that they are quite robust when you modify them in all sorts of ways.
>
Which property is robust?
> Moreover, you are incorrect to say there's an intrinsic algorithmic
> problem with variable-length records. It is not true; if Knuth said
> so, Knuth was mistaken.
>
I didn't say about intrinsic algorithmic problems :)
I just repeat (after Knuth et al) that B-trees with variable-length
records don't
have any sane boundary for internal fragmentation. The common idea is
that if we
don't want Btrfs to be in infinite development stage, then we should
choose some
*sane* strategy (for example the paper of Ohad Rodeh) and strictly
adhere this in
future.
> This is easily shown by modelling variable-length records (key ->
> string) as a range of fixed length records ([key,index] -> byte) and
> apply the standard B-tree algorithms to that, which guarantees
> algorithm properties such as space utilisation and time; then you can
> substitute a "compressed" representation of contiguous index runs,
> which amounts to nothing more than just storing the strings (split
> where the algorithm says to do so) and endpoint indexes , and because
> this compression does not expand (in any way that matters), classic
> algorithmic properties are still guaranteed.
>
> Variable-length keys are a different business. Those are trickier,
> but as far as I know, btrfs doesn't use them.
>
>
>> As to current Btrfs code: *NOT ACK*!!! I don't think we need such
>> "file systems".
>>
>
> Btrfs provides many useful features that other filesystems don't. We
> definitely need it, or something like it. You have identified a bug.
> It's not a corruption bug, but it's definitely a bug, and probably
> affects performance as well as space utilisation.
>
> It is not deep design bug; it is just a result of the packing and
> balancing heuristics.
>
Frankly, I would like to believe to such end, taking into account amount
of my
contributions to the Btrfs project. At least to make sure I didn't do
useless
work..
> If you are still interested, please apply your knowledge of B-tree
> algorithms to understanding why btrfs fails to balance the tree
> sufficiently well,
Because of top-down balancing. It doesn't like "clever" things like
"splitting"
and "merging". Currently top-down works properly only with stupid classic
Bayer's B-trees.
> and then propose a fix.
>
I'll try to help, but I am rather pessimistic here: working out
algorithms is
something, which doesn't like timelines..
> Note that it's not necessarily a problem to have a few nodes with low
> utilisation. Deliberate violation of the classic balancing heuristic
> is often useful for performance.[*]
Ok, let's violate this, but not in the detriment of utilisation:
Internal fragmentation is a horror for file systems, the enemy #1
(which is completely defeated in the last century BTW).
> The problem you've found is only a
> real problem when there are _too many_ nodes with low utilisation.
>
IMHO this is a problem, as we can not estimate number of such nodes.
Do you have any sane upper boundary for this number? I don't know such one.
Maybe I have missed something?
> [*] For example when filling a tree by inserting contiguously
> ascending keys, the classic "split into two when full" heuristic gives
> the worst possible results (50% lost space), and deliberately
> underfilling the most actively updated nodes, which is not permitted
> at all by the classic algorithm, gives denser packing in the end
> (almost zero lost space).
At the end of what? I hope you don't speak about on-line defragmentation?
Can you point me to the paper (if any)?
Thanks!
> It's also faster. The trick is to make
> sure there's just the right number of underfilled nodes...
>
> -- Jamie
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
Edward O. Shishkin
Principal Software Engineer
Red Hat Czech
On Fri, Jun 18, 2010 at 09:29:40PM +0200, Edward Shishkin wrote:
> Jamie Lokier wrote:
> >Edward Shishkin wrote:
> >>If you decide to base your file system on some algorithms then please
> >>use the original ones from proper academic papers. DO NOT modify the
> >>algorithms in solitude: this is very fragile thing! All such
> >>modifications must be reviewed by specialists in the theory of
> >>algorithms. Such review can be done in various scientific magazines of
> >>proper level.
> >>
> >>Personally I don't see any way to improve the situation with Btrfs
> >>except full redesigning the last one. If you want to base your file
> >>system on the paper of Ohad Rodeh, then please, use *exactly* the
> >>Bayer's B-trees that he refers to. That said, make sure that all
> >>records you put to the tree has equal length and all non-root nodes of
> >>your tree are at least half filled.
> >
> >First, thanks Edward for identifying a specific problem with the
> >current btrfs implementation.
>
> Hello Jamie.
>
> >I've studied modified B-trees quite a lot and know enough to be sure
> >that they are quite robust when you modify them in all sorts of ways.
>
> Which property is robust?
>
> >Moreover, you are incorrect to say there's an intrinsic algorithmic
> >problem with variable-length records. It is not true; if Knuth said
> >so, Knuth was mistaken.
>
> I didn't say about intrinsic algorithmic problems :)
> I just repeat (after Knuth et al) that B-trees with variable-length
> records don't
> have any sane boundary for internal fragmentation. The common idea
> is that if we
> don't want Btrfs to be in infinite development stage, then we should
> choose some
> *sane* strategy (for example the paper of Ohad Rodeh) and strictly
> adhere this in
> future.
Again, other than the inline file data, what exactly do you believe
needs to change? Top down balancing vs balancing on insertion doesn't
impact our ability to maintain full leaves. The current code is clearly
choosing not to merge two leaves that it should have merged, which is
just a plain old bug.
-chris
Chris Mason wrote:
> On Fri, Jun 18, 2010 at 09:29:40PM +0200, Edward Shishkin wrote:
>
>> Jamie Lokier wrote:
>>
>>> Edward Shishkin wrote:
>>>
>>>> If you decide to base your file system on some algorithms then please
>>>> use the original ones from proper academic papers. DO NOT modify the
>>>> algorithms in solitude: this is very fragile thing! All such
>>>> modifications must be reviewed by specialists in the theory of
>>>> algorithms. Such review can be done in various scientific magazines of
>>>> proper level.
>>>>
>>>> Personally I don't see any way to improve the situation with Btrfs
>>>> except full redesigning the last one. If you want to base your file
>>>> system on the paper of Ohad Rodeh, then please, use *exactly* the
>>>> Bayer's B-trees that he refers to. That said, make sure that all
>>>> records you put to the tree has equal length and all non-root nodes of
>>>> your tree are at least half filled.
>>>>
>>> First, thanks Edward for identifying a specific problem with the
>>> current btrfs implementation.
>>>
>> Hello Jamie.
>>
>>
>>> I've studied modified B-trees quite a lot and know enough to be sure
>>> that they are quite robust when you modify them in all sorts of ways.
>>>
>> Which property is robust?
>>
>>
>>> Moreover, you are incorrect to say there's an intrinsic algorithmic
>>> problem with variable-length records. It is not true; if Knuth said
>>> so, Knuth was mistaken.
>>>
>> I didn't say about intrinsic algorithmic problems :)
>> I just repeat (after Knuth et al) that B-trees with variable-length
>> records don't
>> have any sane boundary for internal fragmentation. The common idea
>> is that if we
>> don't want Btrfs to be in infinite development stage, then we should
>> choose some
>> *sane* strategy (for example the paper of Ohad Rodeh) and strictly
>> adhere this in
>> future.
>>
>
> Again, other than the inline file data, what exactly do you believe
> needs to change?
1. getting rid of inline extents;
2. new formats for directory and xattr items to not look like a train,
which is able to occupy the whole leaf;
3. make sure we do pro-active balancing like it is described in the paper.
Sorry, I don't see other ways for now..
> Top down balancing vs balancing on insertion doesn't
> impact our ability to maintain full leaves. The current code is clearly
> choosing not to merge two leaves that it should have merged, which is
> just a plain old bug.
>
How are you going to balance leaves when walking from top to down?
Suppose 1) and 2) above are not satisfied and having arrived to the leaf
level we see a number of items of variable length. What will we do to
keep leaves full?
Could you please provide a sketch of the algorithm?
Thanks!
--
Edward O. Shishkin
Principal Software Engineer
Red Hat Czech
On 06/18/2010 06:04 PM, Edward Shishkin wrote:
> Chris Mason wrote:
>> On Fri, Jun 18, 2010 at 09:29:40PM +0200, Edward Shishkin wrote:
>>> Jamie Lokier wrote:
>>>> Edward Shishkin wrote:
>>>>> If you decide to base your file system on some algorithms then please
>>>>> use the original ones from proper academic papers. DO NOT modify the
>>>>> algorithms in solitude: this is very fragile thing! All such
>>>>> modifications must be reviewed by specialists in the theory of
>>>>> algorithms. Such review can be done in various scientific magazines of
>>>>> proper level.
>>>>>
>>>>> Personally I don't see any way to improve the situation with Btrfs
>>>>> except full redesigning the last one. If you want to base your file
>>>>> system on the paper of Ohad Rodeh, then please, use *exactly* the
>>>>> Bayer's B-trees that he refers to. That said, make sure that all
>>>>> records you put to the tree has equal length and all non-root nodes of
>>>>> your tree are at least half filled.
>>>> First, thanks Edward for identifying a specific problem with the
>>>> current btrfs implementation.
>>> Hello Jamie.
>>>
>>>> I've studied modified B-trees quite a lot and know enough to be sure
>>>> that they are quite robust when you modify them in all sorts of ways.
>>> Which property is robust?
>>>
>>>> Moreover, you are incorrect to say there's an intrinsic algorithmic
>>>> problem with variable-length records. It is not true; if Knuth said
>>>> so, Knuth was mistaken.
>>> I didn't say about intrinsic algorithmic problems :)
>>> I just repeat (after Knuth et al) that B-trees with variable-length
>>> records don't
>>> have any sane boundary for internal fragmentation. The common idea
>>> is that if we
>>> don't want Btrfs to be in infinite development stage, then we should
>>> choose some
>>> *sane* strategy (for example the paper of Ohad Rodeh) and strictly
>>> adhere this in
>>> future.
>>
>> Again, other than the inline file data, what exactly do you believe
>> needs to change?
>
> 1. getting rid of inline extents;
> 2. new formats for directory and xattr items to not look like a train,
> which is able to occupy the whole leaf;
> 3. make sure we do pro-active balancing like it is described in the paper.
>
> Sorry, I don't see other ways for now..
>
>> Top down balancing vs balancing on insertion doesn't
>> impact our ability to maintain full leaves. The current code is clearly
>> choosing not to merge two leaves that it should have merged, which is
>> just a plain old bug.
>
> How are you going to balance leaves when walking from top to down?
> Suppose 1) and 2) above are not satisfied and having arrived to the leaf
> level we see a number of items of variable length. What will we do to
> keep leaves full?
>
> Could you please provide a sketch of the algorithm?
>
> Thanks!
>
Hi Edward,
Is it really a requirement to have 100% full leaves? Most DB's (assuming I
remember correctly) have deliberate strategies around this kind of thing. You
might want to leave room in leaf nodes so that future insertions can be
contiguous on the media with older data.
Regards,
Ric
Ric Wheeler wrote:
> On 06/18/2010 06:04 PM, Edward Shishkin wrote:
>> Chris Mason wrote:
>>> On Fri, Jun 18, 2010 at 09:29:40PM +0200, Edward Shishkin wrote:
>>>> Jamie Lokier wrote:
>>>>> Edward Shishkin wrote:
>>>>>> If you decide to base your file system on some algorithms then
>>>>>> please
>>>>>> use the original ones from proper academic papers. DO NOT modify the
>>>>>> algorithms in solitude: this is very fragile thing! All such
>>>>>> modifications must be reviewed by specialists in the theory of
>>>>>> algorithms. Such review can be done in various scientific
>>>>>> magazines of
>>>>>> proper level.
>>>>>>
>>>>>> Personally I don't see any way to improve the situation with Btrfs
>>>>>> except full redesigning the last one. If you want to base your file
>>>>>> system on the paper of Ohad Rodeh, then please, use *exactly* the
>>>>>> Bayer's B-trees that he refers to. That said, make sure that all
>>>>>> records you put to the tree has equal length and all non-root
>>>>>> nodes of
>>>>>> your tree are at least half filled.
>>>>> First, thanks Edward for identifying a specific problem with the
>>>>> current btrfs implementation.
>>>> Hello Jamie.
>>>>
>>>>> I've studied modified B-trees quite a lot and know enough to be sure
>>>>> that they are quite robust when you modify them in all sorts of ways.
>>>> Which property is robust?
>>>>
>>>>> Moreover, you are incorrect to say there's an intrinsic algorithmic
>>>>> problem with variable-length records. It is not true; if Knuth said
>>>>> so, Knuth was mistaken.
>>>> I didn't say about intrinsic algorithmic problems :)
>>>> I just repeat (after Knuth et al) that B-trees with variable-length
>>>> records don't
>>>> have any sane boundary for internal fragmentation. The common idea
>>>> is that if we
>>>> don't want Btrfs to be in infinite development stage, then we should
>>>> choose some
>>>> *sane* strategy (for example the paper of Ohad Rodeh) and strictly
>>>> adhere this in
>>>> future.
>>>
>>> Again, other than the inline file data, what exactly do you believe
>>> needs to change?
>>
>> 1. getting rid of inline extents;
>> 2. new formats for directory and xattr items to not look like a train,
>> which is able to occupy the whole leaf;
>> 3. make sure we do pro-active balancing like it is described in the
>> paper.
>>
>> Sorry, I don't see other ways for now..
>>
>>> Top down balancing vs balancing on insertion doesn't
>>> impact our ability to maintain full leaves. The current code is clearly
>>> choosing not to merge two leaves that it should have merged, which is
>>> just a plain old bug.
>>
>> How are you going to balance leaves when walking from top to down?
>> Suppose 1) and 2) above are not satisfied and having arrived to the leaf
>> level we see a number of items of variable length. What will we do to
>> keep leaves full?
>>
>> Could you please provide a sketch of the algorithm?
>>
>> Thanks!
>>
>
> Hi Edward,
>
> Is it really a requirement to have 100% full leaves? Most DB's
> (assuming I remember correctly) have deliberate strategies around this
> kind of thing. You might want to leave room in leaf nodes so that
> future insertions can be contiguous on the media with older data.
>
> Regards,
>
> Ric
Hello Ric.
No, every leaf shouldn't be necessarily 100% full.
We may require every L-vicinity(*) of every leaf to be full in
some sense. And this local condition sometimes brings the (global)
boundary for utilization of the whole tree.
In the classic Bayer's B-trees we have so-called "S(0)" balancing
condition implicitly satisfied, which requires every leaf to be at
least half full. It provides the low boundary 0.5 for utilization
of the whole tree.
Next example is ReiserFS file system, which uses so-called "S(1)"
balancing condition on the leaf level, which requires every 1-vicinity
of every leaf to be "incompressible" (i.e. two leaves can not be
squeezed to a single one). This local incompressibility brings the
sweet low utilization boundary 0.5-E for the whole tree (E is a small
number, which is back proportional to the tree branching).
(*) any set of L+1 neighboring nodes, which contains the leaf.
--
Edward O. Shishkin
Principal Software Engineer
Red Hat Czech
On Sat, Jun 19, 2010 at 12:04:06AM +0200, Edward Shishkin wrote:
> Chris Mason wrote:
> >On Fri, Jun 18, 2010 at 09:29:40PM +0200, Edward Shishkin wrote:
> >>Jamie Lokier wrote:
> >>>Edward Shishkin wrote:
> >>>>If you decide to base your file system on some algorithms then please
> >>>>use the original ones from proper academic papers. DO NOT modify the
> >>>>algorithms in solitude: this is very fragile thing! All such
> >>>>modifications must be reviewed by specialists in the theory of
> >>>>algorithms. Such review can be done in various scientific magazines of
> >>>>proper level.
> >>>>
> >>>>Personally I don't see any way to improve the situation with Btrfs
> >>>>except full redesigning the last one. If you want to base your file
> >>>>system on the paper of Ohad Rodeh, then please, use *exactly* the
> >>>>Bayer's B-trees that he refers to. That said, make sure that all
> >>>>records you put to the tree has equal length and all non-root nodes of
> >>>>your tree are at least half filled.
> >>>First, thanks Edward for identifying a specific problem with the
> >>>current btrfs implementation.
> >>Hello Jamie.
> >>
> >>>I've studied modified B-trees quite a lot and know enough to be sure
> >>>that they are quite robust when you modify them in all sorts of ways.
> >>Which property is robust?
> >>
> >>>Moreover, you are incorrect to say there's an intrinsic algorithmic
> >>>problem with variable-length records. It is not true; if Knuth said
> >>>so, Knuth was mistaken.
> >>I didn't say about intrinsic algorithmic problems :)
> >>I just repeat (after Knuth et al) that B-trees with variable-length
> >>records don't
> >>have any sane boundary for internal fragmentation. The common idea
> >>is that if we
> >>don't want Btrfs to be in infinite development stage, then we should
> >>choose some
> >>*sane* strategy (for example the paper of Ohad Rodeh) and strictly
> >>adhere this in
> >>future.
> >
> >Again, other than the inline file data, what exactly do you believe
> >needs to change?
>
> 1. getting rid of inline extents;
> 2. new formats for directory and xattr items to not look like a train,
> which is able to occupy the whole leaf;
> 3. make sure we do pro-active balancing like it is described in the paper.
>
> Sorry, I don't see other ways for now..
>
> > Top down balancing vs balancing on insertion doesn't
> >impact our ability to maintain full leaves. The current code is clearly
> >choosing not to merge two leaves that it should have merged, which is
> >just a plain old bug.
>
> How are you going to balance leaves when walking from top to down?
> Suppose 1) and 2) above are not satisfied and having arrived to the leaf
> level we see a number of items of variable length. What will we do to
> keep leaves full?
I think I'm mixing up complaints ;) top-down balancing is unrelated to
the variable length items. You could balance from the bottom up and if
you have a 1K item that can't be split next to a 3K item that can't be
split, we're going to have those two items in two leaves. This is true
regardless of how we balance. So the prime complaint here is that you
don't like variable length items, correct?
Ideally leaves are full, and ideally data blocks are full. In practice
we can end up with either one utilized less than we would like. At the
end of the day, every FS has a worst case setup where it is mostly empty
in terms of bytes stored but full in terms of blocks. On xfs:
for x in `seq 1 bazallion` ; echo 'f' > $x
This would work on ext4 too, until you ran out of inodes.
>
> Could you please provide a sketch of the algorithm?
>
I'll reproduce from your test case and provide a fix. mount -o
max_inline=1500 would give us 50% usage in the worst case (minus the
balancing bug you hit).
-chris
On Mon, Jun 21, 2010 at 09:15:28AM -0400, Chris Mason wrote:
>
> I'll reproduce from your test case and provide a fix. mount -o
> max_inline=1500 would give us 50% usage in the worst case (minus the
> balancing bug you hit).
Ok, the balancing bug was interesting. What happens is we create all
the inodes and directory items nicely packed into the leaves.
Then delayed allocation kicks in and starts inserting the big fat inline
extents. This often forces the balancing code to split a leaf twice in
order to make room for the new inline extent. The double split code
wasn't balancing the bits that were left over on either side of our
desired slot.
The patch below fixes the problem for me, reducing the number of leaves
that have more than 2K of free space down from 6% of the total to about
74 leaves. It could be zero, but the balancing code doesn't push
items around if our leaf is in the first or last slot of the parent
node (complexity vs benefit tradeoff).
With the patch, I'm able to create 106894 files (2K each) on a 1GB FS.
That doesn't sound like a huge improvement, but the default from
mkfs.btrfs is to duplicate metadata. After duplication, that's about
417MB or about 40% of the overall space.
When you factor in the space that we reserve to avoid exploding on
enospc and the space that we have allocated for data extents, that's not
a very surprising number.
I'm still putting this patch through more testing, the double split code
is used in some difficult corners and I need to make sure I've tried
all of them.
-chris
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 0d1d966..1f393b0 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -2304,12 +2304,17 @@ noinline int btrfs_leaf_free_space(struct btrfs_root *root,
return ret;
}
+/*
+ * min slot controls the lowest index we're willing to push to the
+ * right. We'll push up to and including min_slot, but no lower
+ */
static noinline int __push_leaf_right(struct btrfs_trans_handle *trans,
struct btrfs_root *root,
struct btrfs_path *path,
int data_size, int empty,
struct extent_buffer *right,
- int free_space, u32 left_nritems)
+ int free_space, u32 left_nritems,
+ u32 min_slot)
{
struct extent_buffer *left = path->nodes[0];
struct extent_buffer *upper = path->nodes[1];
@@ -2327,7 +2332,7 @@ static noinline int __push_leaf_right(struct btrfs_trans_handle *trans,
if (empty)
nr = 0;
else
- nr = 1;
+ nr = max_t(u32, 1, min_slot);
if (path->slots[0] >= left_nritems)
push_space += data_size;
@@ -2469,10 +2474,13 @@ out_unlock:
*
* returns 1 if the push failed because the other node didn't have enough
* room, 0 if everything worked out and < 0 if there were major errors.
+ *
+ * this will push starting from min_slot to the end of the leaf. It won't
+ * push any slot lower than min_slot
*/
static int push_leaf_right(struct btrfs_trans_handle *trans, struct btrfs_root
*root, struct btrfs_path *path, int data_size,
- int empty)
+ int empty, u32 min_slot)
{
struct extent_buffer *left = path->nodes[0];
struct extent_buffer *right;
@@ -2515,7 +2523,7 @@ static int push_leaf_right(struct btrfs_trans_handle *trans, struct btrfs_root
goto out_unlock;
return __push_leaf_right(trans, root, path, data_size, empty,
- right, free_space, left_nritems);
+ right, free_space, left_nritems, min_slot);
out_unlock:
btrfs_tree_unlock(right);
free_extent_buffer(right);
@@ -2525,12 +2533,17 @@ out_unlock:
/*
* push some data in the path leaf to the left, trying to free up at
* least data_size bytes. returns zero if the push worked, nonzero otherwise
+ *
+ * max_slot can put a limit on how far into the leaf we'll push items. The
+ * item at 'max_slot' won't be touched. Use (u32)-1 to make us do all the
+ * items
*/
static noinline int __push_leaf_left(struct btrfs_trans_handle *trans,
struct btrfs_root *root,
struct btrfs_path *path, int data_size,
int empty, struct extent_buffer *left,
- int free_space, int right_nritems)
+ int free_space, u32 right_nritems,
+ u32 max_slot)
{
struct btrfs_disk_key disk_key;
struct extent_buffer *right = path->nodes[0];
@@ -2549,9 +2562,9 @@ static noinline int __push_leaf_left(struct btrfs_trans_handle *trans,
slot = path->slots[1];
if (empty)
- nr = right_nritems;
+ nr = min(right_nritems, max_slot);
else
- nr = right_nritems - 1;
+ nr = min(right_nritems - 1, max_slot);
for (i = 0; i < nr; i++) {
item = btrfs_item_nr(right, i);
@@ -2712,10 +2725,14 @@ out:
/*
* push some data in the path leaf to the left, trying to free up at
* least data_size bytes. returns zero if the push worked, nonzero otherwise
+ *
+ * max_slot can put a limit on how far into the leaf we'll push items. The
+ * item at 'max_slot' won't be touched. Use (u32)-1 to make us push all the
+ * items
*/
static int push_leaf_left(struct btrfs_trans_handle *trans, struct btrfs_root
*root, struct btrfs_path *path, int data_size,
- int empty)
+ int empty, u32 max_slot)
{
struct extent_buffer *right = path->nodes[0];
struct extent_buffer *left;
@@ -2762,7 +2779,8 @@ static int push_leaf_left(struct btrfs_trans_handle *trans, struct btrfs_root
}
return __push_leaf_left(trans, root, path, data_size,
- empty, left, free_space, right_nritems);
+ empty, left, free_space, right_nritems,
+ max_slot);
out:
btrfs_tree_unlock(left);
free_extent_buffer(left);
@@ -2855,6 +2873,59 @@ static noinline int copy_for_split(struct btrfs_trans_handle *trans,
}
/*
+ * double splits happen when we need to insert a big item in the middle
+ * of a leaf. A double split can leave us with 3 mostly empty leaves:
+ * leaf: [ slots 0 - N] [ our target ] [ N + 1 - total in leaf ]
+ * A B C
+ *
+ * We avoid this by trying to push the items on either side of our target
+ * into the adjacent leaves. If all goes well we can avoid the double split
+ * completely.
+ */
+static noinline int push_for_double_split(struct btrfs_trans_handle *trans,
+ struct btrfs_root *root,
+ struct btrfs_path *path)
+{
+ int ret;
+ int progress = 0;
+ int slot;
+
+ slot = path->slots[0];
+
+ /*
+ * try to push all the items after our slot into the
+ * right leaf
+ */
+ ret = push_leaf_right(trans, root, path, 1, 0, slot);
+ if (ret < 0)
+ return ret;
+
+ if (ret == 0)
+ progress++;
+
+ /*
+ * our goal is to get our slot at the start or end of a leaf. If
+ * we've done so we're done
+ */
+ if (path->slots[0] == 0 ||
+ path->slots[0] == btrfs_header_nritems(path->nodes[0]))
+ return 0;
+
+ /* try to push all the items before our slot into the next leaf */
+ slot = path->slots[0];
+ ret = push_leaf_left(trans, root, path, 1, 0, slot);
+ if (ret < 0)
+ return ret;
+
+ if (ret == 0)
+ progress++;
+
+ if (progress)
+ return 0;
+ return 1;
+}
+
+/*
* split the path's leaf in two, making sure there is at least data_size
* available for the resulting leaf level of the path.
*
@@ -2876,6 +2947,7 @@ static noinline int split_leaf(struct btrfs_trans_handle *trans,
int wret;
int split;
int num_doubles = 0;
+ int tried_avoid_double = 0;
l = path->nodes[0];
slot = path->slots[0];
@@ -2884,12 +2956,13 @@ static noinline int split_leaf(struct btrfs_trans_handle *trans,
return -EOVERFLOW;
/* first try to make some room by pushing left and right */
- if (data_size && ins_key->type != BTRFS_DIR_ITEM_KEY) {
- wret = push_leaf_right(trans, root, path, data_size, 0);
+ if (data_size) {
+ wret = push_leaf_right(trans, root, path, data_size, 0, 0);
if (wret < 0)
return wret;
if (wret) {
- wret = push_leaf_left(trans, root, path, data_size, 0);
+ wret = push_leaf_left(trans, root, path,
+ data_size, 0, (u32)-1);
if (wret < 0)
return wret;
}
@@ -2923,6 +2996,12 @@ again:
if (mid != nritems &&
leaf_space_used(l, mid, nritems - mid) +
data_size > BTRFS_LEAF_DATA_SIZE(root)) {
+ if (!tried_avoid_double) {
+ push_for_double_split(trans,
+ root, path);
+ tried_avoid_double = 1;
+ goto again;
+ }
split = 2;
}
}
@@ -2939,6 +3018,12 @@ again:
if (mid != nritems &&
leaf_space_used(l, mid, nritems - mid) +
data_size > BTRFS_LEAF_DATA_SIZE(root)) {
+ if (!tried_avoid_double) {
+ push_for_double_split(trans,
+ root, path);
+ tried_avoid_double = 1;
+ goto again;
+ }
split = 2 ;
}
}
@@ -3915,13 +4000,14 @@ int btrfs_del_items(struct btrfs_trans_handle *trans, struct btrfs_root *root,
extent_buffer_get(leaf);
btrfs_set_path_blocking(path);
- wret = push_leaf_left(trans, root, path, 1, 1);
+ wret = push_leaf_left(trans, root, path, 1, 1, (u32)-1);
if (wret < 0 && wret != -ENOSPC)
ret = wret;
if (path->nodes[0] == leaf &&
btrfs_header_nritems(leaf)) {
- wret = push_leaf_right(trans, root, path, 1, 1);
+ wret = push_leaf_right(trans, root, path,
+ 1, 1, 0);
if (wret < 0 && wret != -ENOSPC)
ret = wret;
}
Chris Mason wrote:
> On Mon, Jun 21, 2010 at 09:15:28AM -0400, Chris Mason wrote:
>
>> I'll reproduce from your test case and provide a fix. mount -o
>> max_inline=1500 would give us 50% usage in the worst case
This is a very strange statement: how did you calculate this lower bound?
>> (minus the
>> balancing bug you hit).
>>
>
> Ok, the balancing bug was interesting. What happens is we create all
> the inodes and directory items nicely packed into the leaves.
>
> Then delayed allocation kicks in and starts inserting the big fat inline
> extents. This often forces the balancing code to split a leaf twice in
> order to make room for the new inline extent. The double split code
> wasn't balancing the bits that were left over on either side of our
> desired slot.
>
> The patch below fixes the problem for me, reducing the number of leaves
> that have more than 2K of free space down from 6% of the total to about
> 74 leaves. It could be zero, but the balancing code doesn't push
> items around if our leaf is in the first or last slot of the parent
> node (complexity vs benefit tradeoff).
>
Nevertheless, I see leaves, which are not in the first or last slot,
but mergeable with their neighbors (test case is the same):
leaf 269557760 items 22 free space 2323 generation 25 owner 2
fs uuid 614fb921-cfa9-403d-9feb-940021e72382
chunk uuid b1674882-a445-45f9-b250-0985e483d231
leaf 280027136 items 18 free space 2627 generation 25 owner 2
fs uuid 614fb921-cfa9-403d-9feb-940021e72382
chunk uuid b1674882-a445-45f9-b250-0985e483d231
node 269549568 level 1 items 60 free 61 generation 25 owner 2
fs uuid 614fb921-cfa9-403d-9feb-940021e72382
chunk uuid b1674882-a445-45f9-b250-0985e483d231
key (175812608 EXTENT_ITEM 4096) block 175828992 (42927) gen 15
key (176025600 EXTENT_ITEM 4096) block 176111616 (42996) gen 15
key (176238592 EXTENT_ITEM 4096) block 176300032 (43042) gen 15
key (176451584 EXTENT_ITEM 4096) block 216248320 (52795) gen 17
key (176672768 EXTENT_ITEM 4096) block 216236032 (52792) gen 17
key (176783360 EXTENT_ITEM 4096) block 216252416 (52796) gen 17
key (176955392 EXTENT_ITEM 4096) block 138854400 (33900) gen 25
key (177131520 EXTENT_ITEM 4096) block 280289280 (68430) gen 25
key (177348608 EXTENT_ITEM 4096) block 280285184 (68429) gen 25
key (177561600 EXTENT_ITEM 4096) block 269557760 (65810) gen 25
key (177795072 EXTENT_ITEM 4096) block 280027136 (68366) gen 25
key (178008064 EXTENT_ITEM 4096) block 280064000 (68375) gen 25
key (178233344 EXTENT_ITEM 4096) block 285061120 (69595) gen 25
key (178442240 EXTENT_ITEM 4096) block 178442240 (43565) gen 16
key (178655232 EXTENT_ITEM 4096) block 178655232 (43617) gen 16
key (178868224 EXTENT_ITEM 4096) block 178868224 (43669) gen 16
[...]
> With the patch, I'm able to create 106894 files (2K each) on a 1GB FS.
> That doesn't sound like a huge improvement, but the default from
> mkfs.btrfs is to duplicate metadata. After duplication, that's about
> 417MB or about 40% of the overall space.
>
> When you factor in the space that we reserve to avoid exploding on
> enospc and the space that we have allocated for data extents, that's not
> a very surprising number.
>
> I'm still putting this patch through more testing, the double split code
> is used in some difficult corners and I need to make sure I've tried
> all of them.
>
Chris, for the further code review we need documents, which reflect
the original ideas of the balancing, etc. Could you please provide them?
Obviously, I can not do it instead of you, as I don't know those ideas.
Thanks,
Edward.
> -chris
>
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index 0d1d966..1f393b0 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -2304,12 +2304,17 @@ noinline int btrfs_leaf_free_space(struct btrfs_root *root,
> return ret;
> }
>
> +/*
> + * min slot controls the lowest index we're willing to push to the
> + * right. We'll push up to and including min_slot, but no lower
> + */
> static noinline int __push_leaf_right(struct btrfs_trans_handle *trans,
> struct btrfs_root *root,
> struct btrfs_path *path,
> int data_size, int empty,
> struct extent_buffer *right,
> - int free_space, u32 left_nritems)
> + int free_space, u32 left_nritems,
> + u32 min_slot)
> {
> struct extent_buffer *left = path->nodes[0];
> struct extent_buffer *upper = path->nodes[1];
> @@ -2327,7 +2332,7 @@ static noinline int __push_leaf_right(struct btrfs_trans_handle *trans,
> if (empty)
> nr = 0;
> else
> - nr = 1;
> + nr = max_t(u32, 1, min_slot);
>
> if (path->slots[0] >= left_nritems)
> push_space += data_size;
> @@ -2469,10 +2474,13 @@ out_unlock:
> *
> * returns 1 if the push failed because the other node didn't have enough
> * room, 0 if everything worked out and < 0 if there were major errors.
> + *
> + * this will push starting from min_slot to the end of the leaf. It won't
> + * push any slot lower than min_slot
> */
> static int push_leaf_right(struct btrfs_trans_handle *trans, struct btrfs_root
> *root, struct btrfs_path *path, int data_size,
> - int empty)
> + int empty, u32 min_slot)
> {
> struct extent_buffer *left = path->nodes[0];
> struct extent_buffer *right;
> @@ -2515,7 +2523,7 @@ static int push_leaf_right(struct btrfs_trans_handle *trans, struct btrfs_root
> goto out_unlock;
>
> return __push_leaf_right(trans, root, path, data_size, empty,
> - right, free_space, left_nritems);
> + right, free_space, left_nritems, min_slot);
> out_unlock:
> btrfs_tree_unlock(right);
> free_extent_buffer(right);
> @@ -2525,12 +2533,17 @@ out_unlock:
> /*
> * push some data in the path leaf to the left, trying to free up at
> * least data_size bytes. returns zero if the push worked, nonzero otherwise
> + *
> + * max_slot can put a limit on how far into the leaf we'll push items. The
> + * item at 'max_slot' won't be touched. Use (u32)-1 to make us do all the
> + * items
> */
> static noinline int __push_leaf_left(struct btrfs_trans_handle *trans,
> struct btrfs_root *root,
> struct btrfs_path *path, int data_size,
> int empty, struct extent_buffer *left,
> - int free_space, int right_nritems)
> + int free_space, u32 right_nritems,
> + u32 max_slot)
> {
> struct btrfs_disk_key disk_key;
> struct extent_buffer *right = path->nodes[0];
> @@ -2549,9 +2562,9 @@ static noinline int __push_leaf_left(struct btrfs_trans_handle *trans,
> slot = path->slots[1];
>
> if (empty)
> - nr = right_nritems;
> + nr = min(right_nritems, max_slot);
> else
> - nr = right_nritems - 1;
> + nr = min(right_nritems - 1, max_slot);
>
> for (i = 0; i < nr; i++) {
> item = btrfs_item_nr(right, i);
> @@ -2712,10 +2725,14 @@ out:
> /*
> * push some data in the path leaf to the left, trying to free up at
> * least data_size bytes. returns zero if the push worked, nonzero otherwise
> + *
> + * max_slot can put a limit on how far into the leaf we'll push items. The
> + * item at 'max_slot' won't be touched. Use (u32)-1 to make us push all the
> + * items
> */
> static int push_leaf_left(struct btrfs_trans_handle *trans, struct btrfs_root
> *root, struct btrfs_path *path, int data_size,
> - int empty)
> + int empty, u32 max_slot)
> {
> struct extent_buffer *right = path->nodes[0];
> struct extent_buffer *left;
> @@ -2762,7 +2779,8 @@ static int push_leaf_left(struct btrfs_trans_handle *trans, struct btrfs_root
> }
>
> return __push_leaf_left(trans, root, path, data_size,
> - empty, left, free_space, right_nritems);
> + empty, left, free_space, right_nritems,
> + max_slot);
> out:
> btrfs_tree_unlock(left);
> free_extent_buffer(left);
> @@ -2855,6 +2873,59 @@ static noinline int copy_for_split(struct btrfs_trans_handle *trans,
> }
>
> /*
> + * double splits happen when we need to insert a big item in the middle
> + * of a leaf. A double split can leave us with 3 mostly empty leaves:
> + * leaf: [ slots 0 - N] [ our target ] [ N + 1 - total in leaf ]
> + * A B C
> + *
> + * We avoid this by trying to push the items on either side of our target
> + * into the adjacent leaves. If all goes well we can avoid the double split
> + * completely.
> + */
> +static noinline int push_for_double_split(struct btrfs_trans_handle *trans,
> + struct btrfs_root *root,
> + struct btrfs_path *path)
> +{
> + int ret;
> + int progress = 0;
> + int slot;
> +
> + slot = path->slots[0];
> +
> + /*
> + * try to push all the items after our slot into the
> + * right leaf
> + */
> + ret = push_leaf_right(trans, root, path, 1, 0, slot);
> + if (ret < 0)
> + return ret;
> +
> + if (ret == 0)
> + progress++;
> +
> + /*
> + * our goal is to get our slot at the start or end of a leaf. If
> + * we've done so we're done
> + */
> + if (path->slots[0] == 0 ||
> + path->slots[0] == btrfs_header_nritems(path->nodes[0]))
> + return 0;
> +
> + /* try to push all the items before our slot into the next leaf */
> + slot = path->slots[0];
> + ret = push_leaf_left(trans, root, path, 1, 0, slot);
> + if (ret < 0)
> + return ret;
> +
> + if (ret == 0)
> + progress++;
> +
> + if (progress)
> + return 0;
> + return 1;
> +}
> +
> +/*
> * split the path's leaf in two, making sure there is at least data_size
> * available for the resulting leaf level of the path.
> *
> @@ -2876,6 +2947,7 @@ static noinline int split_leaf(struct btrfs_trans_handle *trans,
> int wret;
> int split;
> int num_doubles = 0;
> + int tried_avoid_double = 0;
>
> l = path->nodes[0];
> slot = path->slots[0];
> @@ -2884,12 +2956,13 @@ static noinline int split_leaf(struct btrfs_trans_handle *trans,
> return -EOVERFLOW;
>
> /* first try to make some room by pushing left and right */
> - if (data_size && ins_key->type != BTRFS_DIR_ITEM_KEY) {
> - wret = push_leaf_right(trans, root, path, data_size, 0);
> + if (data_size) {
> + wret = push_leaf_right(trans, root, path, data_size, 0, 0);
> if (wret < 0)
> return wret;
> if (wret) {
> - wret = push_leaf_left(trans, root, path, data_size, 0);
> + wret = push_leaf_left(trans, root, path,
> + data_size, 0, (u32)-1);
> if (wret < 0)
> return wret;
> }
> @@ -2923,6 +2996,12 @@ again:
> if (mid != nritems &&
> leaf_space_used(l, mid, nritems - mid) +
> data_size > BTRFS_LEAF_DATA_SIZE(root)) {
> + if (!tried_avoid_double) {
> + push_for_double_split(trans,
> + root, path);
> + tried_avoid_double = 1;
> + goto again;
> + }
> split = 2;
> }
> }
> @@ -2939,6 +3018,12 @@ again:
> if (mid != nritems &&
> leaf_space_used(l, mid, nritems - mid) +
> data_size > BTRFS_LEAF_DATA_SIZE(root)) {
> + if (!tried_avoid_double) {
> + push_for_double_split(trans,
> + root, path);
> + tried_avoid_double = 1;
> + goto again;
> + }
> split = 2 ;
> }
> }
> @@ -3915,13 +4000,14 @@ int btrfs_del_items(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> extent_buffer_get(leaf);
>
> btrfs_set_path_blocking(path);
> - wret = push_leaf_left(trans, root, path, 1, 1);
> + wret = push_leaf_left(trans, root, path, 1, 1, (u32)-1);
> if (wret < 0 && wret != -ENOSPC)
> ret = wret;
>
> if (path->nodes[0] == leaf &&
> btrfs_header_nritems(leaf)) {
> - wret = push_leaf_right(trans, root, path, 1, 1);
> + wret = push_leaf_right(trans, root, path,
> + 1, 1, 0);
> if (wret < 0 && wret != -ENOSPC)
> ret = wret;
> }
>
--
Edward O. Shishkin
Principal Software Engineer
Red Hat Czech
On Tue, Jun 22, 2010 at 04:12:57PM +0200, Edward Shishkin wrote:
> Chris Mason wrote:
> >On Mon, Jun 21, 2010 at 09:15:28AM -0400, Chris Mason wrote:
> >>I'll reproduce from your test case and provide a fix. mount -o
> >>max_inline=1500 would give us 50% usage in the worst case
>
> This is a very strange statement: how did you calculate this lower bound?
We want room for the extent and the inode item and the inode backref.
It's a rough number that leaves some extra room. But even with a 2K
item we're getting very close to 50% usage of the metadata area.
>
> >> (minus the
> >>balancing bug you hit).
> >
> >Ok, the balancing bug was interesting. What happens is we create all
> >the inodes and directory items nicely packed into the leaves.
> >
> >Then delayed allocation kicks in and starts inserting the big fat inline
> >extents. This often forces the balancing code to split a leaf twice in
> >order to make room for the new inline extent. The double split code
> >wasn't balancing the bits that were left over on either side of our
> >desired slot.
> >
> >The patch below fixes the problem for me, reducing the number of leaves
> >that have more than 2K of free space down from 6% of the total to about
> >74 leaves. It could be zero, but the balancing code doesn't push
> >items around if our leaf is in the first or last slot of the parent
> >node (complexity vs benefit tradeoff).
>
> Nevertheless, I see leaves, which are not in the first or last slot,
> but mergeable with their neighbors (test case is the same):
Correct, but it was in the first or last slot when it was balanced (I
confirmed this with printk).
Then the higher layers were balanced and our leaves were no longer in
the first/last slot. We don't rebalance leaves when we balance level 1.
>
> leaf 269557760 items 22 free space 2323 generation 25 owner 2
> fs uuid 614fb921-cfa9-403d-9feb-940021e72382
> chunk uuid b1674882-a445-45f9-b250-0985e483d231
>
> leaf 280027136 items 18 free space 2627 generation 25 owner 2
> fs uuid 614fb921-cfa9-403d-9feb-940021e72382
> chunk uuid b1674882-a445-45f9-b250-0985e483d231
>
> node 269549568 level 1 items 60 free 61 generation 25 owner 2
> fs uuid 614fb921-cfa9-403d-9feb-940021e72382
> chunk uuid b1674882-a445-45f9-b250-0985e483d231
> key (175812608 EXTENT_ITEM 4096) block 175828992 (42927) gen 15
> key (176025600 EXTENT_ITEM 4096) block 176111616 (42996) gen 15
> key (176238592 EXTENT_ITEM 4096) block 176300032 (43042) gen 15
> key (176451584 EXTENT_ITEM 4096) block 216248320 (52795) gen 17
> key (176672768 EXTENT_ITEM 4096) block 216236032 (52792) gen 17
> key (176783360 EXTENT_ITEM 4096) block 216252416 (52796) gen 17
> key (176955392 EXTENT_ITEM 4096) block 138854400 (33900) gen 25
> key (177131520 EXTENT_ITEM 4096) block 280289280 (68430) gen 25
> key (177348608 EXTENT_ITEM 4096) block 280285184 (68429) gen 25
> key (177561600 EXTENT_ITEM 4096) block 269557760 (65810) gen 25
> key (177795072 EXTENT_ITEM 4096) block 280027136 (68366) gen 25
> key (178008064 EXTENT_ITEM 4096) block 280064000 (68375) gen 25
> key (178233344 EXTENT_ITEM 4096) block 285061120 (69595) gen 25
> key (178442240 EXTENT_ITEM 4096) block 178442240 (43565) gen 16
> key (178655232 EXTENT_ITEM 4096) block 178655232 (43617) gen 16
> key (178868224 EXTENT_ITEM 4096) block 178868224 (43669) gen 16
> [...]
>
> >With the patch, I'm able to create 106894 files (2K each) on a 1GB FS.
> >That doesn't sound like a huge improvement, but the default from
> >mkfs.btrfs is to duplicate metadata. After duplication, that's about
> >417MB or about 40% of the overall space.
> >
> >When you factor in the space that we reserve to avoid exploding on
> >enospc and the space that we have allocated for data extents, that's not
> >a very surprising number.
> >
> >I'm still putting this patch through more testing, the double split code
> >is used in some difficult corners and I need to make sure I've tried
> >all of them.
>
> Chris, for the further code review we need documents, which reflect
> the original ideas of the balancing, etc. Could you please provide them?
> Obviously, I can not do it instead of you, as I don't know those ideas.
>
Which part are you most interested in?
-chris
Chris Mason wrote:
> On Tue, Jun 22, 2010 at 04:12:57PM +0200, Edward Shishkin wrote:
>
>> Chris Mason wrote:
>>
>>> On Mon, Jun 21, 2010 at 09:15:28AM -0400, Chris Mason wrote:
>>>
>>>> I'll reproduce from your test case and provide a fix. mount -o
>>>> max_inline=1500 would give us 50% usage in the worst case
>>>>
>> This is a very strange statement: how did you calculate this lower bound?
>>
>
> We want room for the extent and the inode item and the inode backref.
> It's a rough number that leaves some extra room. But even with a 2K
> item we're getting very close to 50% usage of the metadata area.
>
>
>>>> (minus the
>>>> balancing bug you hit).
>>>>
>>> Ok, the balancing bug was interesting. What happens is we create all
>>> the inodes and directory items nicely packed into the leaves.
>>>
>>> Then delayed allocation kicks in and starts inserting the big fat inline
>>> extents. This often forces the balancing code to split a leaf twice in
>>> order to make room for the new inline extent. The double split code
>>> wasn't balancing the bits that were left over on either side of our
>>> desired slot.
>>>
>>> The patch below fixes the problem for me, reducing the number of leaves
>>> that have more than 2K of free space down from 6% of the total to about
>>> 74 leaves. It could be zero, but the balancing code doesn't push
>>> items around if our leaf is in the first or last slot of the parent
>>> node (complexity vs benefit tradeoff).
>>>
>> Nevertheless, I see leaves, which are not in the first or last slot,
>> but mergeable with their neighbors (test case is the same):
>>
>
> Correct, but it was in the first or last slot when it was balanced (I
> confirmed this with printk).
>
> Then the higher layers were balanced and our leaves were no longer in
> the first/last slot. We don't rebalance leaves when we balance level 1.
>
>
>> leaf 269557760 items 22 free space 2323 generation 25 owner 2
>> fs uuid 614fb921-cfa9-403d-9feb-940021e72382
>> chunk uuid b1674882-a445-45f9-b250-0985e483d231
>>
>> leaf 280027136 items 18 free space 2627 generation 25 owner 2
>> fs uuid 614fb921-cfa9-403d-9feb-940021e72382
>> chunk uuid b1674882-a445-45f9-b250-0985e483d231
>>
>> node 269549568 level 1 items 60 free 61 generation 25 owner 2
>> fs uuid 614fb921-cfa9-403d-9feb-940021e72382
>> chunk uuid b1674882-a445-45f9-b250-0985e483d231
>> key (175812608 EXTENT_ITEM 4096) block 175828992 (42927) gen 15
>> key (176025600 EXTENT_ITEM 4096) block 176111616 (42996) gen 15
>> key (176238592 EXTENT_ITEM 4096) block 176300032 (43042) gen 15
>> key (176451584 EXTENT_ITEM 4096) block 216248320 (52795) gen 17
>> key (176672768 EXTENT_ITEM 4096) block 216236032 (52792) gen 17
>> key (176783360 EXTENT_ITEM 4096) block 216252416 (52796) gen 17
>> key (176955392 EXTENT_ITEM 4096) block 138854400 (33900) gen 25
>> key (177131520 EXTENT_ITEM 4096) block 280289280 (68430) gen 25
>> key (177348608 EXTENT_ITEM 4096) block 280285184 (68429) gen 25
>> key (177561600 EXTENT_ITEM 4096) block 269557760 (65810) gen 25
>> key (177795072 EXTENT_ITEM 4096) block 280027136 (68366) gen 25
>> key (178008064 EXTENT_ITEM 4096) block 280064000 (68375) gen 25
>> key (178233344 EXTENT_ITEM 4096) block 285061120 (69595) gen 25
>> key (178442240 EXTENT_ITEM 4096) block 178442240 (43565) gen 16
>> key (178655232 EXTENT_ITEM 4096) block 178655232 (43617) gen 16
>> key (178868224 EXTENT_ITEM 4096) block 178868224 (43669) gen 16
>> [...]
>>
>>
>>> With the patch, I'm able to create 106894 files (2K each) on a 1GB FS.
>>> That doesn't sound like a huge improvement, but the default from
>>> mkfs.btrfs is to duplicate metadata. After duplication, that's about
>>> 417MB or about 40% of the overall space.
>>>
>>> When you factor in the space that we reserve to avoid exploding on
>>> enospc and the space that we have allocated for data extents, that's not
>>> a very surprising number.
>>>
>>> I'm still putting this patch through more testing, the double split code
>>> is used in some difficult corners and I need to make sure I've tried
>>> all of them.
>>>
>> Chris, for the further code review we need documents, which reflect
>> the original ideas of the balancing, etc. Could you please provide them?
>> Obviously, I can not do it instead of you, as I don't know those ideas.
>>
>>
>
> Which part are you most interested in?
>
Balancing.
What conditions do we want to provide?
Why won't we get utilization slump in future?
How do we need to fix things when something goes wrong?
Whereas in accordance with the single existing paper Btrfs
design looks really broken, because:
1. the balancing condition n <= number_of_keys <= 2n+1
is not satisfied (indeed, when number_of_keys is 1
we have 1 <= 2n+1 -- false);
2. the trees mentioned in this paper don't allow keys of
variable length.
Thanks,
Edward.
Edward Shishkin wrote:
> Chris Mason wrote:
> >On Tue, Jun 22, 2010 at 04:12:57PM +0200, Edward Shishkin wrote:
> >
> >>Chris Mason wrote:
> >>
> >>>On Mon, Jun 21, 2010 at 09:15:28AM -0400, Chris Mason wrote:
> >>>
> >>>>I'll reproduce from your test case and provide a fix. mount -o
> >>>>max_inline=1500 would give us 50% usage in the worst case
> >>>>
> >>This is a very strange statement: how did you calculate this lower bound?
> >>
> >
> >We want room for the extent and the inode item and the inode backref.
> >It's a rough number that leaves some extra room. But even with a 2K
> >item we're getting very close to 50% usage of the metadata area.
> >
> >
> >>>>(minus the
> >>>>balancing bug you hit).
> >>>>
> >>>Ok, the balancing bug was interesting. What happens is we create all
> >>>the inodes and directory items nicely packed into the leaves.
> >>>
> >>>Then delayed allocation kicks in and starts inserting the big fat inline
> >>>extents. This often forces the balancing code to split a leaf twice in
> >>>order to make room for the new inline extent. The double split code
> >>>wasn't balancing the bits that were left over on either side of our
> >>>desired slot.
> >>>
> >>>The patch below fixes the problem for me, reducing the number of leaves
> >>>that have more than 2K of free space down from 6% of the total to about
> >>>74 leaves. It could be zero, but the balancing code doesn't push
> >>>items around if our leaf is in the first or last slot of the parent
> >>>node (complexity vs benefit tradeoff).
> >>>
> >>Nevertheless, I see leaves, which are not in the first or last slot,
> >>but mergeable with their neighbors (test case is the same):
> >>
> >
> >Correct, but it was in the first or last slot when it was balanced (I
> >confirmed this with printk).
> >
> >Then the higher layers were balanced and our leaves were no longer in
> >the first/last slot. We don't rebalance leaves when we balance level 1.
> >
> >
> >>leaf 269557760 items 22 free space 2323 generation 25 owner 2
> >>fs uuid 614fb921-cfa9-403d-9feb-940021e72382
> >>chunk uuid b1674882-a445-45f9-b250-0985e483d231
> >>
> >>leaf 280027136 items 18 free space 2627 generation 25 owner 2
> >>fs uuid 614fb921-cfa9-403d-9feb-940021e72382
> >>chunk uuid b1674882-a445-45f9-b250-0985e483d231
> >>
> >>node 269549568 level 1 items 60 free 61 generation 25 owner 2
> >>fs uuid 614fb921-cfa9-403d-9feb-940021e72382
> >>chunk uuid b1674882-a445-45f9-b250-0985e483d231
> >> key (175812608 EXTENT_ITEM 4096) block 175828992 (42927) gen 15
> >> key (176025600 EXTENT_ITEM 4096) block 176111616 (42996) gen 15
> >> key (176238592 EXTENT_ITEM 4096) block 176300032 (43042) gen 15
> >> key (176451584 EXTENT_ITEM 4096) block 216248320 (52795) gen 17
> >> key (176672768 EXTENT_ITEM 4096) block 216236032 (52792) gen 17
> >> key (176783360 EXTENT_ITEM 4096) block 216252416 (52796) gen 17
> >> key (176955392 EXTENT_ITEM 4096) block 138854400 (33900) gen 25
> >> key (177131520 EXTENT_ITEM 4096) block 280289280 (68430) gen 25
> >> key (177348608 EXTENT_ITEM 4096) block 280285184 (68429) gen 25
> >> key (177561600 EXTENT_ITEM 4096) block 269557760 (65810) gen 25
> >> key (177795072 EXTENT_ITEM 4096) block 280027136 (68366) gen 25
> >> key (178008064 EXTENT_ITEM 4096) block 280064000 (68375) gen 25
> >> key (178233344 EXTENT_ITEM 4096) block 285061120 (69595) gen 25
> >> key (178442240 EXTENT_ITEM 4096) block 178442240 (43565) gen 16
> >> key (178655232 EXTENT_ITEM 4096) block 178655232 (43617) gen 16
> >> key (178868224 EXTENT_ITEM 4096) block 178868224 (43669) gen 16
> >>[...]
> >>
> >>
> >>>With the patch, I'm able to create 106894 files (2K each) on a 1GB FS.
> >>>That doesn't sound like a huge improvement, but the default from
> >>>mkfs.btrfs is to duplicate metadata. After duplication, that's about
> >>>417MB or about 40% of the overall space.
> >>>
> >>>When you factor in the space that we reserve to avoid exploding on
> >>>enospc and the space that we have allocated for data extents, that's not
> >>>a very surprising number.
> >>>
> >>>I'm still putting this patch through more testing, the double split code
> >>>is used in some difficult corners and I need to make sure I've tried
> >>>all of them.
> >>>
> >>Chris, for the further code review we need documents, which reflect
> >>the original ideas of the balancing, etc. Could you please provide them?
> >>Obviously, I can not do it instead of you, as I don't know those ideas.
> >>
> >>
> >
> >Which part are you most interested in?
> >
>
> Balancing.
>
> What conditions do we want to provide?
> Why won't we get utilization slump in future?
> How do we need to fix things when something goes wrong?
> Whereas in accordance with the single existing paper Btrfs
> design looks really broken, because:
> 1. the balancing condition n <= number_of_keys <= 2n+1
> is not satisfied (indeed, when number_of_keys is 1
> we have 1 <= 2n+1 -- false);
That doesn't matter. It is not necessary (or desirable) to follow the
classical B-tree algorithms to the letter to get sufficiently good
properties. B-trees are quite robust to changing the details, as long
as you follow the generalised principles which make them work.
(It is entirely possible that btrfs does not follow those principles
though :-)
All that matters to guarantee utilisation and algorithmic performance
is to do enough merging on deletions, and avoid doing too many splits,
to ensure that an amount X% of all used blocks (except the root) are
at least Y% full.
Y is 50 and X is 100 in your description, but other numbers work too.
Reducing X slightly improves utilisation in common filesystem
scenarios where many blocks are filled in ascending key sequences.
(But delayed allocation/writing, like btrfs does, is another way to
achieve a similar result.) Increasing Y improves minimum utilisation,
which seems like a good thing to demand from a filesystem - for the
actual data, anyway.
>From what's been said in this thread, I'm getting the impression that
btrfs does not guarantee a particular value of X, but rather depends
heuristically on typical behaviour to give an equivalent outcome, and
that that does not always work in some pathological cases. And also
that Y is/was not guaranteed, but that may be quite possibly an
ordinary bug and might have been fixed by Chris during this thread.
I must admit that only getting 200GB of 2kB files into a formatted 1GB
partition before it says ENOSPC - and that's after Chris's recent
tweak - is a bit disappointing. A little over 50% loss for 2kB files
is quite reasonable - most filesystems would do that just due to
padding to the next 4kB (the block size), and about 50% wastage in a
B-tree is fair enough (at least if there is no sane pattern to the
file's creation), as guaranteeing higher utilisation has a performance
penalty. But with the delayed allocation, I'd hope for better packing
than the worst case, and I'd expect even the worst case to be better
than 20% available for data.
Losing 80% of the space to utilisation loss and metadata duplication
is disappointing (to me), and a bit suprising, given that one of the
purposes of inline extents is to pack small files more efficiently
than if they were padded to the block size.
> 2. the trees mentioned in this paper don't allow keys of
> variable length.
Inline extents aren't the same as variable-length keys, because they
aren't sorted or searched by content. I showed in an earlier reply to
you how to show that variable length *values*[1] don't cause
unbalancing provided the splits are put in acceptable places. But to
make sense of that, you do have to grok the generalising of B-trees
beyond the original algorithms.
[1] Think of the B-tree variants which contain (key,value) pairings
where "value" is not ordered, rather than the type which contain just
"keys" which are fully ordered.
Are there any true (fully ordered) variable-length keys in btrfs?
-- Jamie
Edward Shishkin wrote:
> I have noticed that events in Btrfs develop by scenario not predicted
> by the paper of academic Ohad Rodeh (in spite of the announce that
> Btrfs is based on this paper). This is why I have started to grumble..
In btrfs, "based on" means "started with the algorithm and ideas of",
not "uses the same algorithm as".
-- Jamie
Edward Shishkin wrote:
> I'll try to help, but I am rather pessimistic here: working out
> algorithms is something, which doesn't like timelines..
Nonsense. Working out algorithms is just work to an algorithm
designer, just like programming is work to a programmer. Sure, some
things are harder than others, and there's an element of creativity.
But lots of it is just cranking the handle, even in algorithm design.
> >Note that it's not necessarily a problem to have a few nodes with low
> >utilisation. Deliberate violation of the classic balancing heuristic
> >is often useful for performance.[*]
>
> Ok, let's violate this, but not in the detriment of utilisation:
> Internal fragmentation is a horror for file systems, the enemy #1
> (which is completely defeated in the last century BTW).
>
> > The problem you've found is only a
> >real problem when there are _too many_ nodes with low utilisation.
>
> IMHO this is a problem, as we can not estimate number of such nodes.
> Do you have any sane upper boundary for this number? I don't know such one.
> Maybe I have missed something?
Well, I don't know if btrfs maintains enough statistics or other
implicit processes to guarantee a sane upper boundary for this. The
impression I'm getting from the thread is that it relies on heuristic
behaviour which is usually sufficient (and perhaps a bit faster than
updating statistics on the disk would be), but that it does not
provide a hard upper boundary. I'm really not sure, though. I
haven't read the code.
> >[*] For example when filling a tree by inserting contiguously
> >ascending keys, the classic "split into two when full" heuristic gives
> >the worst possible results (50% lost space), and deliberately
> >underfilling the most actively updated nodes, which is not permitted
> >at all by the classic algorithm, gives denser packing in the end
> >(almost zero lost space).
>
> At the end of what?
At the end of inserting keys in ascending order. Just think about the
classic B-tree when that is done: Node[1] fills to 100% utilisation,
then is split into two nodes at 50% (call them node[1] and node[2]).
Node[2] fills to 100%, then splits into node[2] at 50% and node[3].
Etc etc: Result is a million nodes at 50% utilisation, except the last
one.
If instead you fill node[1], then *don't* split it but permit node[2]
to have 0% to start with, let that fill to 100%, then don't split
node[2] and let node[3] start with 0%, etc. etc: Result is half a
million nodes at 100% utilisation, except the last one.
Both fit the desired bounds, but the second is more desirable in
practice, especially as it's common behaviour to fill with contiguous,
ascending keys in a filesystem (or blob-filled database) where data
extents are part of the tree :-)
To handle the cases of multiple independent ascending runs of keys
being written in parallel, you generalise to maintain more than one
below-50% block, near the "active insertion heads".
The above describes classic B-trees where updates are assumed to be
written immediately. Things are different when there's a memory cache
and delayed allocation/writing, and packing can be reorganised in
memory before sending to storage.
> I hope you don't speak about on-line defragmentation?
No, not at all.
> Can you point me to the paper (if any)?
Sorry, no, I haven't seen this described in a paper. Imho it's
obvious when you think about it. But maybe it's not obvious to
everyone - perhaps this is even the heuristic modification missing
from btrfs which it would need to avoid the scenario which started
this thread?
-- Jamie
Just an FYI reminder. The original test (2K files) is utterly
pathological for disk drives with 4K physical sectors, such as
those now shipping from WD, Seagate, and others. Some of the
SSDs have larger (16K0 or smaller blocks (2K). There is also
the issue of btrfs over RAID (which I know is not entirely
sensible, but which will happen).
The absolute minimum allocation size for data should be the same
as, and aligned with, the underlying disk block size. If that
results in underutilization, I think that's a good thing for
performance, compared to read-modify-write cycles to update
partial disk blocks.
On Wed, Jun 23, 2010 at 8:43 PM, Daniel Taylor <[email protected]> wrote:
> Just an FYI reminder. The original test (2K files) is utterly
> pathological for disk drives with 4K physical sectors, such as
> those now shipping from WD, Seagate, and others. Some of the
> SSDs have larger (16K0 or smaller blocks (2K). There is also
> the issue of btrfs over RAID (which I know is not entirely
> sensible, but which will happen).
>
> The absolute minimum allocation size for data should be the same
> as, and aligned with, the underlying disk block size. If that
> results in underutilization, I think that's a good thing for
> performance, compared to read-modify-write cycles to update
> partial disk blocks.
Block size = 4k
Btrfs packs smaller objects into the blocks in certain cases.
On Wed, 2010-06-23 at 20:43 -0700, Daniel Taylor wrote:
> There is also the issue of btrfs over RAID (which I know is not
> entirely sensible, but which will happen).
Well, we could discourage that by merging the RAID support that's been
pending for a while.... but I suspect Chris is a bit busy right now :)
--
dwmw2
On Thu, Jun 24, 2010 at 12:37:59AM +0100, Jamie Lokier wrote:
> Edward Shishkin wrote:
> > Chris Mason wrote:
> > >On Tue, Jun 22, 2010 at 04:12:57PM +0200, Edward Shishkin wrote:
> > >
> > >>Chris Mason wrote:
> > >>
> > >>>On Mon, Jun 21, 2010 at 09:15:28AM -0400, Chris Mason wrote:
> > >>>
> > >>>>I'll reproduce from your test case and provide a fix. mount -o
> > >>>>max_inline=1500 would give us 50% usage in the worst case
> > >>>>
> > >>This is a very strange statement: how did you calculate this lower bound?
> > >>
> > >
> > >We want room for the extent and the inode item and the inode backref.
> > >It's a rough number that leaves some extra room. But even with a 2K
> > >item we're getting very close to 50% usage of the metadata area.
> > >
> > >
> > >>>>(minus the
> > >>>>balancing bug you hit).
> > >>>>
> > >>>Ok, the balancing bug was interesting. What happens is we create all
> > >>>the inodes and directory items nicely packed into the leaves.
> > >>>
> > >>>Then delayed allocation kicks in and starts inserting the big fat inline
> > >>>extents. This often forces the balancing code to split a leaf twice in
> > >>>order to make room for the new inline extent. The double split code
> > >>>wasn't balancing the bits that were left over on either side of our
> > >>>desired slot.
> > >>>
> > >>>The patch below fixes the problem for me, reducing the number of leaves
> > >>>that have more than 2K of free space down from 6% of the total to about
> > >>>74 leaves. It could be zero, but the balancing code doesn't push
> > >>>items around if our leaf is in the first or last slot of the parent
> > >>>node (complexity vs benefit tradeoff).
> > >>>
> > >>Nevertheless, I see leaves, which are not in the first or last slot,
> > >>but mergeable with their neighbors (test case is the same):
> > >>
> > >
> > >Correct, but it was in the first or last slot when it was balanced (I
> > >confirmed this with printk).
> > >
> > >Then the higher layers were balanced and our leaves were no longer in
> > >the first/last slot. We don't rebalance leaves when we balance level 1.
> > >
> > >
> > >>leaf 269557760 items 22 free space 2323 generation 25 owner 2
> > >>fs uuid 614fb921-cfa9-403d-9feb-940021e72382
> > >>chunk uuid b1674882-a445-45f9-b250-0985e483d231
> > >>
> > >>leaf 280027136 items 18 free space 2627 generation 25 owner 2
> > >>fs uuid 614fb921-cfa9-403d-9feb-940021e72382
> > >>chunk uuid b1674882-a445-45f9-b250-0985e483d231
> > >>
> > >>node 269549568 level 1 items 60 free 61 generation 25 owner 2
> > >>fs uuid 614fb921-cfa9-403d-9feb-940021e72382
> > >>chunk uuid b1674882-a445-45f9-b250-0985e483d231
> > >> key (175812608 EXTENT_ITEM 4096) block 175828992 (42927) gen 15
> > >> key (176025600 EXTENT_ITEM 4096) block 176111616 (42996) gen 15
> > >> key (176238592 EXTENT_ITEM 4096) block 176300032 (43042) gen 15
> > >> key (176451584 EXTENT_ITEM 4096) block 216248320 (52795) gen 17
> > >> key (176672768 EXTENT_ITEM 4096) block 216236032 (52792) gen 17
> > >> key (176783360 EXTENT_ITEM 4096) block 216252416 (52796) gen 17
> > >> key (176955392 EXTENT_ITEM 4096) block 138854400 (33900) gen 25
> > >> key (177131520 EXTENT_ITEM 4096) block 280289280 (68430) gen 25
> > >> key (177348608 EXTENT_ITEM 4096) block 280285184 (68429) gen 25
> > >> key (177561600 EXTENT_ITEM 4096) block 269557760 (65810) gen 25
> > >> key (177795072 EXTENT_ITEM 4096) block 280027136 (68366) gen 25
> > >> key (178008064 EXTENT_ITEM 4096) block 280064000 (68375) gen 25
> > >> key (178233344 EXTENT_ITEM 4096) block 285061120 (69595) gen 25
> > >> key (178442240 EXTENT_ITEM 4096) block 178442240 (43565) gen 16
> > >> key (178655232 EXTENT_ITEM 4096) block 178655232 (43617) gen 16
> > >> key (178868224 EXTENT_ITEM 4096) block 178868224 (43669) gen 16
> > >>[...]
> > >>
> > >>
> > >>>With the patch, I'm able to create 106894 files (2K each) on a 1GB FS.
> > >>>That doesn't sound like a huge improvement, but the default from
> > >>>mkfs.btrfs is to duplicate metadata. After duplication, that's about
> > >>>417MB or about 40% of the overall space.
> > >>>
> > >>>When you factor in the space that we reserve to avoid exploding on
> > >>>enospc and the space that we have allocated for data extents, that's not
> > >>>a very surprising number.
> > >>>
> > >>>I'm still putting this patch through more testing, the double split code
> > >>>is used in some difficult corners and I need to make sure I've tried
> > >>>all of them.
> > >>>
> > >>Chris, for the further code review we need documents, which reflect
> > >>the original ideas of the balancing, etc. Could you please provide them?
> > >>Obviously, I can not do it instead of you, as I don't know those ideas.
> > >>
> > >>
> > >
> > >Which part are you most interested in?
> > >
> >
> > Balancing.
> >
> > What conditions do we want to provide?
> > Why won't we get utilization slump in future?
> > How do we need to fix things when something goes wrong?
> > Whereas in accordance with the single existing paper Btrfs
> > design looks really broken, because:
>
> > 1. the balancing condition n <= number_of_keys <= 2n+1
> > is not satisfied (indeed, when number_of_keys is 1
> > we have 1 <= 2n+1 -- false);
>
> That doesn't matter. It is not necessary (or desirable) to follow the
> classical B-tree algorithms to the letter to get sufficiently good
> properties. B-trees are quite robust to changing the details, as long
> as you follow the generalised principles which make them work.
There are lots of different perspectives here, but it is important to
step back and look at what we're using the btree for. We're making an
index to filesystem metadata. The index is there so that we can find
that metadata with a reasonable number of IOs and in a reasonable amount
of time.
Because btrfs is COW, and because of the reference counted COW used to
maintain snapshots, the cost of updating the btree is different from
traditional btrees. We do try to avoid balancing more and
intentionally allow the btree leaves to be less compact simply because
it allows us to avoid the higher cost of those operations.
The btree nodes are fairly strict. They contain fixed records and are
balanced in a simple fashion. When we are inserting a new key into a
node, if the node is completely full we split it. When we are deleting
a key, if the node is less than half full we balance it. Calculating
full on the btree nodes is a factor of the number of keys in them.
The leaves have fixed length keys that describe items of variable size.
On deletion merging is done when we're less than 1/3rd full and on
insertion we split when the item we're trying to shove in doesn't fit.
Calculating full on the btree leaves is a factor of the total size of
the items stored.
So, with all of that said, the nodes point to leaves and the leaves
contain inodes, directory entries and sometimes file data.
The most important part of the index is the higher levels of the tree.
The index of most filesystems points to but does not contain
inodes and file data, and so the higher levels of the btrfs btree are
more like a traditional FS index.
There's a knob here for the max inline item size and if someone wants to
fill their whole FS with 2K files, the default settings are not at all
suitable. Filling with 3K files actually works better because we end
up with inode and file data in the same block much of the time.
>
> (It is entirely possible that btrfs does not follow those principles
> though :-)
We try ;) Clearly there was a bug around splitting for large items.
>
> All that matters to guarantee utilisation and algorithmic performance
> is to do enough merging on deletions, and avoid doing too many splits,
> to ensure that an amount X% of all used blocks (except the root) are
> at least Y% full.
>
> Y is 50 and X is 100 in your description, but other numbers work too.
> Reducing X slightly improves utilisation in common filesystem
> scenarios where many blocks are filled in ascending key sequences.
> (But delayed allocation/writing, like btrfs does, is another way to
> achieve a similar result.) Increasing Y improves minimum utilisation,
> which seems like a good thing to demand from a filesystem - for the
> actual data, anyway.
>
> From what's been said in this thread, I'm getting the impression that
> btrfs does not guarantee a particular value of X, but rather depends
> heuristically on typical behaviour to give an equivalent outcome, and
> that that does not always work in some pathological cases. And also
> that Y is/was not guaranteed, but that may be quite possibly an
> ordinary bug and might have been fixed by Chris during this thread.
>
> I must admit that only getting 200GB of 2kB files into a formatted 1GB
> partition before it says ENOSPC - and that's after Chris's recent
> tweak - is a bit disappointing. A little over 50% loss for 2kB files
> is quite reasonable - most filesystems would do that just due to
> padding to the next 4kB (the block size), and about 50% wastage in a
> B-tree is fair enough (at least if there is no sane pattern to the
> file's creation), as guaranteeing higher utilisation has a performance
> penalty. But with the delayed allocation, I'd hope for better packing
> than the worst case, and I'd expect even the worst case to be better
> than 20% available for data.
>
> Losing 80% of the space to utilisation loss and metadata duplication
> is disappointing (to me), and a bit suprising, given that one of the
> purposes of inline extents is to pack small files more efficiently
> than if they were padded to the block size.
I agree, in this case metadata duplication is not helping. A new
default limit on the size of the inline data might be in order.
>
> > 2. the trees mentioned in this paper don't allow keys of
> > variable length.
>
> Inline extents aren't the same as variable-length keys, because they
> aren't sorted or searched by content. I showed in an earlier reply to
> you how to show that variable length *values*[1] don't cause
> unbalancing provided the splits are put in acceptable places. But to
> make sense of that, you do have to grok the generalising of B-trees
> beyond the original algorithms.
>
> [1] Think of the B-tree variants which contain (key,value) pairings
> where "value" is not ordered, rather than the type which contain just
> "keys" which are fully ordered.
>
> Are there any true (fully ordered) variable-length keys in btrfs?
The keys are a fixed length, only the size of the items stored can vary.
-chris
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On
> Behalf Of Mike Fedyk
> Sent: Wednesday, June 23, 2010 9:51 PM
> To: Daniel Taylor
> Cc: Daniel J Blueman; Mat; LKML;
> [email protected]; Chris Mason; Ric Wheeler;
> Andrew Morton; Linus Torvalds; The development of BTRFS
> Subject: Re: Btrfs: broken file system design (was Unbound(?)
> internal fragmentation in Btrfs)
>
> On Wed, Jun 23, 2010 at 8:43 PM, Daniel Taylor
> <[email protected]> wrote:
> > Just an FYI reminder. ?The original test (2K files) is utterly
> > pathological for disk drives with 4K physical sectors, such as
> > those now shipping from WD, Seagate, and others. ?Some of the
> > SSDs have larger (16K0 or smaller blocks (2K). ?There is also
> > the issue of btrfs over RAID (which I know is not entirely
> > sensible, but which will happen).
> >
> > The absolute minimum allocation size for data should be the same
> > as, and aligned with, the underlying disk block size. ?If that
> > results in underutilization, I think that's a good thing for
> > performance, compared to read-modify-write cycles to update
> > partial disk blocks.
>
> Block size = 4k
>
> Btrfs packs smaller objects into the blocks in certain cases.
>
As long as no object smaller than the disk block size is ever
flushed to media, and all flushed objects are aligned to the disk
blocks, there should be no real performance hit from that.
Otherwise we end up with the damage for the ext[234] family, where
the file blocks can be aligned, but the 1K inode updates cause
the read-modify-write (RMW) cycles and and cost >10% performance
hit for creation/update of large numbers of files.
An RMW cycle costs at least a full rotation (11 msec on a 5400 RPM
drive), which is painful.
"Daniel Taylor" <[email protected]> writes:
>
> As long as no object smaller than the disk block size is ever
> flushed to media, and all flushed objects are aligned to the disk
> blocks, there should be no real performance hit from that.
The question is just how large such a block needs to be.
Traditionally some RAID controllers (and possibly some SSDs now)
needed very large blocks upto MBs.
>
> Otherwise we end up with the damage for the ext[234] family, where
> the file blocks can be aligned, but the 1K inode updates cause
> the read-modify-write (RMW) cycles and and cost >10% performance
> hit for creation/update of large numbers of files.
Fixing that doesn't require a new file system layout, just some effort
to read/write inodes in batches of multiple of them. XFS did similar
things for a long time, I believe there were some efforts for this
for ext4 too.
-Andi
--
[email protected] -- Speaking for myself only.
On 06/24/2010 06:06 PM, Daniel Taylor wrote:
>
>
>
>> -----Original Message-----
>> From: [email protected] [mailto:[email protected]] On
>> Behalf Of Mike Fedyk
>> Sent: Wednesday, June 23, 2010 9:51 PM
>> To: Daniel Taylor
>> Cc: Daniel J Blueman; Mat; LKML;
>> [email protected]; Chris Mason; Ric Wheeler;
>> Andrew Morton; Linus Torvalds; The development of BTRFS
>> Subject: Re: Btrfs: broken file system design (was Unbound(?)
>> internal fragmentation in Btrfs)
>>
>> On Wed, Jun 23, 2010 at 8:43 PM, Daniel Taylor
>> <[email protected]> wrote:
>>
>>> Just an FYI reminder. The original test (2K files) is utterly
>>> pathological for disk drives with 4K physical sectors, such as
>>> those now shipping from WD, Seagate, and others. Some of the
>>> SSDs have larger (16K0 or smaller blocks (2K). There is also
>>> the issue of btrfs over RAID (which I know is not entirely
>>> sensible, but which will happen).
>>>
>>> The absolute minimum allocation size for data should be the same
>>> as, and aligned with, the underlying disk block size. If that
>>> results in underutilization, I think that's a good thing for
>>> performance, compared to read-modify-write cycles to update
>>> partial disk blocks.
>>>
>> Block size = 4k
>>
>> Btrfs packs smaller objects into the blocks in certain cases.
>>
>>
> As long as no object smaller than the disk block size is ever
> flushed to media, and all flushed objects are aligned to the disk
> blocks, there should be no real performance hit from that.
>
> Otherwise we end up with the damage for the ext[234] family, where
> the file blocks can be aligned, but the 1K inode updates cause
> the read-modify-write (RMW) cycles and and cost>10% performance
> hit for creation/update of large numbers of files.
>
> An RMW cycle costs at least a full rotation (11 msec on a 5400 RPM
> drive), which is painful.
>
Also interesting is to note that you can get a significant overheard
even with 0 byte length files. Path names, metadata overhead, etc can
consume (depending on the pathname length) quite a bit of space per file.
Ric
25.06.2010 22:58, Ric Wheeler wrote:
> On 06/24/2010 06:06 PM, Daniel Taylor wrote:
[]
>>> On Wed, Jun 23, 2010 at 8:43 PM, Daniel Taylor
>>> <[email protected]> wrote:
>>>
>>>> Just an FYI reminder. The original test (2K files) is utterly
>>>> pathological for disk drives with 4K physical sectors, such as
>>>> those now shipping from WD, Seagate, and others. Some of the
>>>> SSDs have larger (16K0 or smaller blocks (2K). There is also
>>>> the issue of btrfs over RAID (which I know is not entirely
>>>> sensible, but which will happen).
Why it is not sensible to use btrfs on raid devices?
Nowadays raid is just everywhere, from 'fakeraid' on AHCI to
large external arrays on iSCSI-attached storage. Sometimes
it is nearly imposisble to _not_ use RAID, -- many servers
comes with a built-in RAID card which can't be turned off or
disabled. And hardware raid is faster (at least in theory)
at least because it puts less load on various system busses.
To many "enterprise folks" a statement "we don't need hw raid,
we have better solution" sounds like "we're just a toy, don't
use".
Hmm? ;)
/mjt, who always used and preferred _software_ raid due to
multiple reasons, and never used btrfs so far.
On 06/26/2010 01:18 AM, Michael Tokarev wrote:
> 25.06.2010 22:58, Ric Wheeler wrote:
>
>> On 06/24/2010 06:06 PM, Daniel Taylor wrote:
>>
> []
>
>>>> On Wed, Jun 23, 2010 at 8:43 PM, Daniel Taylor
>>>> <[email protected]> wrote:
>>>>
>>>>
>>>>> Just an FYI reminder. The original test (2K files) is utterly
>>>>> pathological for disk drives with 4K physical sectors, such as
>>>>> those now shipping from WD, Seagate, and others. Some of the
>>>>> SSDs have larger (16K0 or smaller blocks (2K). There is also
>>>>> the issue of btrfs over RAID (which I know is not entirely
>>>>> sensible, but which will happen).
>>>>>
> Why it is not sensible to use btrfs on raid devices?
> Nowadays raid is just everywhere, from 'fakeraid' on AHCI to
> large external arrays on iSCSI-attached storage. Sometimes
> it is nearly imposisble to _not_ use RAID, -- many servers
> comes with a built-in RAID card which can't be turned off or
> disabled. And hardware raid is faster (at least in theory)
> at least because it puts less load on various system busses.
>
> To many "enterprise folks" a statement "we don't need hw raid,
> we have better solution" sounds like "we're just a toy, don't
> use".
>
> Hmm? ;)
>
> /mjt, who always used and preferred _software_ raid due to
> multiple reasons, and never used btrfs so far.
>
Absolutely no reason that you would not use btrfs on hardware raid
volumes (or software RAID for that matter).
Ric
Chris Mason wrote:
[...]
>>> 1. the balancing condition n <= number_of_keys <= 2n+1
>>> is not satisfied (indeed, when number_of_keys is 1
>>> we have 1 <= 2n+1 -- false);
>>>
>> That doesn't matter. It is not necessary (or desirable) to follow the
>> classical B-tree algorithms to the letter to get sufficiently good
>> properties.
sure, but we need something instead of classic balancing condition.
We can not just refuse it: this is your lifejacket during performing
tree operations. How will you proof utilization bounds without any
balancing conditions?
>> B-trees are quite robust to changing the details, as long
>> as you follow the generalised principles which make them work.
>>
>
> There are lots of different perspectives here,
I wouldn't be so optimistic here (see below)
> but it is important to
> step back and look at what we're using the btree for. We're making an
> index to filesystem metadata. The index is there so that we can find
> that metadata with a reasonable number of IOs and in a reasonable amount
> of time.
>
> Because btrfs is COW, and because of the reference counted COW used to
> maintain snapshots, the cost of updating the btree is different from
> traditional btrees. We do try to avoid balancing more and
> intentionally allow the btree leaves to be less compact simply because
> it allows us to avoid the higher cost of those operations.
>
I can understand reducing low bound, say, from 1/2 to 1/3 for performance
reasons, but again, bound 0.00 is unacceptable.. So, please, make sure you
have a sane proved bound for every such "compromise".
> The btree nodes are fairly strict. They contain fixed records and are
> balanced in a simple fashion. When we are inserting a new key into a
> node, if the node is completely full we split it. When we are deleting
> a key, if the node is less than half full we balance it.
This works only if insertion/deletion on the leaf level won't result in
insertion/deletion more then one key on upper levels.
> Calculating
> full on the btree nodes is a factor of the number of keys in them.
>
> The leaves have fixed length keys that describe items of variable size.
> On deletion merging is done when we're less than 1/3rd full and on
> insertion we split when the item we're trying to shove in doesn't fit.
> Calculating full on the btree leaves is a factor of the total size of
> the items stored.
>
> So, with all of that said, the nodes point to leaves and the leaves
> contain inodes, directory entries and sometimes file data.
>
> The most important part of the index is the higher levels of the tree.
> The index of most filesystems points to but does not contain
> inodes and file data, and so the higher levels of the btrfs btree are
> more like a traditional FS index.
>
> There's a knob here for the max inline item size and if someone wants to
> fill their whole FS with 2K files, the default settings are not at all
> suitable. Filling with 3K files actually works better because we end
> up with inode and file data in the same block much of the time.
>
Chris, thanks for the description.
I think that such balancing is totally incorrect. In accordance with
your strategy insertions can spawn shallow leaves, and deletions will
result in shallow leaves (because of merging with shallow leaves).
This will lead to unbound utilization slump.
In particular, your patch is just a workaround, which doesn't fix the
core problem. Knobs for cases is a nonsense. Are you kidding? Ext4,
xfs, reiserfs don't use any knobs, so why should we accept this
regress on the background of common progress? Everything should be
packed fine without any knobs...
Your strategy slightly resembles balancing in Reiserfs file system,
so I have two comments here.
1. Algorithms used in Reiserfs are "precise" (or "minimal"). That
means there is no "superfluous" balancing there. It is impossible to
save on balancing operations and slightly decrease utilization bound:
you will lose everything (this is unacceptable). Only classic Bayer's
B-trees allow to reduce bound 1/2 via replacing usual balancing
condition q <= N <= 2q with more liberal one (say, q <= N <= 3q).
2. If you want to base on COW-friendly Reiserfs, then it's algorithms
should be properly modified for the case of top-down balancing and
reviewed (better by the authors of the original algorithms).
I recommend to consider the following approach:
Balancing the leaf level
A. Inserting a key of length I(*)
01 Suppose we traversed the tree and have arrived to the position
02 pos in the leaf A.
03
04 At first we try to make space by shifting all possible items at the
05 left and right from the pos to the neighbors B and C respectively:
06
07 B A C
08
09 Let's denote via L and R total length of items on leaf A at left
10 and right from the pos respectively.
11
12 If after shifting there is enough space in A, then we perform
13 insertion to A. After the insertion pairs (BA) and (AC) are
14 incompressible for obvious reasons.
15
16 Else if there is no enough space in A, we make space via inserting
17 a new empty leaf D and shift all items at the right side of pos
18 to D:
19
20 B A D C
21
22 If there is enough space in A, then we perform insertion to A.
23 After insertion (AD) is incompressible, because there wasn't
24 enough space at (16), i.e. I+L+R > N, hence I+L > N-R,
25 i.e there is no enough space in D for all items of A
26 (N is capacity of empty leaf).
27
28 Else if there is enough space in D, then we perform insertion
29 to D. Pair (DC) is incompressible, because at the point (4)
30 we have shifted right all possible items at the right side
31 of pos. Pair (AD) is incompressible, because we didn't jump
32 to (22), i.e. I is larger then free space in A.
33
34 Else (there is no space in A and D for insertion), we make
35 space via insertion of a new empty leaf E, shift all items
36 from A to E, and perform insertion to the empty node A:
37
38 B E A D C
39
40 In this case (BE) is incompressible, because at the point (4)
41 we have shifted left all possible items. Pair (EA) is
42 incompressible, because there wasn't enough space in A at
43 (34).
(*) Variable length key is an abstraction. In real life this usually
means a record which includes a key of fixed length and some payload
of variable length.
B. Deleting a key
01 Suppose we traversed the tree and have arrived to the position
02 pos in the leaf A:
03
04 B A C
05
06 At first we perform deletion from A.
07 If A is empty, then
08 remove leaf A:
09
10 B C
11
12 If the pair (BC) compressible, then shift everything from B
13 to C and remove leaf B:
14
15 C
16
17 At this point leaf C is pairwise incompressible with its left
18 and right neighbors for obvious reasons.
19
20 Else
21 if the pair (BA) compressible, then shift everything from B
22 to A and remove leaf B:
23
24 A C
25
26 If the pair (AC) compressible, then shift everything from A to
27 C and remove leaf A:
28
29 C
30
31 Here leaf C is pairwise incompressible with its left and right
32 neighbors for obvious reasons.
Balancing the upper levels
Thus, insertion/deletion on the leaf level can lead to insertion/deletion
of not more then 2 leaves, and hence 2 keys on the upper levels. So, we
modify the classic balancing condition and top-down balancing strategy
by the following way:
Every non-root node of upper (non-leaf) levels may contain between
q and 2(q+1) entries (q>=2).
A. Inserting a key
If a node with 2q+2 (2q+1) entries is encountered while traversing the
tree, it is split into two nodes with q and q+2 (q and q+1) entries.
B. Deleting a key
If a node with q (q+1) entries in encountered while traversing the
tree, it is fixed. Fixing means either moving keys into it, or merging
it with a neighbor, so it will contain not less then q+2 keys.
Conclusion.
So with my strategy we have S1-condition satisfied on the leaf level,
and S0 on the upper levels, and, hence, we can rely on the low bound
0.5 as I mentioned here:
http://marc.info/?l=linux-btrfs&m=127690584607800&w=2
Thanks.
--
Edward O. Shishkin
Principal Software Engineer
Red Hat Czech
On Wed, Jun 30, 2010 at 10:05:21PM +0200, Edward Shishkin wrote:
> Chris Mason wrote:
> [...]
> >>>1. the balancing condition n <= number_of_keys <= 2n+1
> >>> is not satisfied (indeed, when number_of_keys is 1
> >>> we have 1 <= 2n+1 -- false);
> >>That doesn't matter. It is not necessary (or desirable) to follow the
> >>classical B-tree algorithms to the letter to get sufficiently good
> >>properties.
>
> sure, but we need something instead of classic balancing condition.
> We can not just refuse it: this is your lifejacket during performing
> tree operations. How will you proof utilization bounds without any
> balancing conditions?
Again, the leaves and nodes both have balancing conditions. I'm not
sure I follow.
>
>
> >> B-trees are quite robust to changing the details, as long
> >>as you follow the generalised principles which make them work.
> >
> >There are lots of different perspectives here,
>
> I wouldn't be so optimistic here (see below)
>
> > but it is important to
> >step back and look at what we're using the btree for. We're making an
> >index to filesystem metadata. The index is there so that we can find
> >that metadata with a reasonable number of IOs and in a reasonable amount
> >of time.
> >
> >Because btrfs is COW, and because of the reference counted COW used to
> >maintain snapshots, the cost of updating the btree is different from
> >traditional btrees. We do try to avoid balancing more and
> >intentionally allow the btree leaves to be less compact simply because
> >it allows us to avoid the higher cost of those operations.
>
> I can understand reducing low bound, say, from 1/2 to 1/3 for performance
> reasons, but again, bound 0.00 is unacceptable.. So, please, make sure you
> have a sane proved bound for every such "compromise".
>
> >The btree nodes are fairly strict. They contain fixed records and are
> >balanced in a simple fashion. When we are inserting a new key into a
> >node, if the node is completely full we split it. When we are deleting
> >a key, if the node is less than half full we balance it.
>
> This works only if insertion/deletion on the leaf level won't result in
> insertion/deletion more then one key on upper levels.
The top-down balancing requires that we control the number of upper
level inserts. This is already done today.
>
> > Calculating
> >full on the btree nodes is a factor of the number of keys in them.
> >
> >The leaves have fixed length keys that describe items of variable size.
> >On deletion merging is done when we're less than 1/3rd full and on
> >insertion we split when the item we're trying to shove in doesn't fit.
> >Calculating full on the btree leaves is a factor of the total size of
> >the items stored.
> >
> >So, with all of that said, the nodes point to leaves and the leaves
> >contain inodes, directory entries and sometimes file data.
> >
> >The most important part of the index is the higher levels of the tree.
> >The index of most filesystems points to but does not contain
> >inodes and file data, and so the higher levels of the btrfs btree are
> >more like a traditional FS index.
> >
> >There's a knob here for the max inline item size and if someone wants to
> >fill their whole FS with 2K files, the default settings are not at all
> >suitable. Filling with 3K files actually works better because we end
> >up with inode and file data in the same block much of the time.
>
> Chris, thanks for the description.
>
> I think that such balancing is totally incorrect. In accordance with
> your strategy insertions can spawn shallow leaves, and deletions will
> result in shallow leaves (because of merging with shallow leaves).
> This will lead to unbound utilization slump.
Ok, insertions can spawn shallow leaves containing file data and
deletions can result in shallow leaves containing file data, just trying
to make sure I'm reading this correctly.
Shallow leaves with file data happen in every filesystem, but the other
filesystems call them data extents. I think we're talking past each other a
little here ;)
>
> In particular, your patch is just a workaround, which doesn't fix the
> core problem. Knobs for cases is a nonsense. Are you kidding? Ext4,
> xfs, reiserfs don't use any knobs, so why should we accept this
> regress on the background of common progress? Everything should be
> packed fine without any knobs...
Ext4 and xfs don't pack file data into the btree. reiserfs does but
splits the data items. I'm still trying to nail down if you're worried
about something other than packed file data.
I'm saying there's no difference between a btrfs leaf with a single item
and a block pointer in xfs pointing to a single block. The btrfs leaf
is full and the single block in xfs is full.
Both are backed by higher levels in the btree that are properly compact.
Btrfs metadata mirroring makes the cost of that leaf much higher, which
is something we might want to address by lowering the size of file data
that we choose to inline.
>
> Your strategy slightly resembles balancing in Reiserfs file system,
> so I have two comments here.
>
> 1. Algorithms used in Reiserfs are "precise" (or "minimal"). That
> means there is no "superfluous" balancing there. It is impossible to
> save on balancing operations and slightly decrease utilization bound:
> you will lose everything (this is unacceptable). Only classic Bayer's
> B-trees allow to reduce bound 1/2 via replacing usual balancing
> condition q <= N <= 2q with more liberal one (say, q <= N <= 3q).
>
> 2. If you want to base on COW-friendly Reiserfs, then it's algorithms
> should be properly modified for the case of top-down balancing and
> reviewed (better by the authors of the original algorithms).
> I recommend to consider the following approach:
>
>
> Balancing the leaf level
>
>
> A. Inserting a key of length I(*)
>
>
> 01 Suppose we traversed the tree and have arrived to the position
> 02 pos in the leaf A.
> 03
> 04 At first we try to make space by shifting all possible items at the
> 05 left and right from the pos to the neighbors B and C respectively:
Ok, maybe it is just before dinner and I'm reading this too quickly, but
this is very similar to how the btrfs balancing does already work. A
few comments below.
> 06
> 07 B A C
> 08 09 Let's denote via L and R total length of items on
> leaf A at left
> 10 and right from the pos respectively.
push_leaf_left(), push_leaf_right() in the btrfs code
> 11
> 12 If after shifting there is enough space in A, then we perform
> 13 insertion to A. After the insertion pairs (BA) and (AC) are
> 14 incompressible for obvious reasons.
> 15
> 16 Else if there is no enough space in A, we make space via inserting
> 17 a new empty leaf D and shift all items at the right side of pos
> 18 to D:
> 19
> 20 B A D C
split_leaf()
> 21
> 22 If there is enough space in A, then we perform insertion to A.
> 23 After insertion (AD) is incompressible, because there wasn't
> 24 enough space at (16), i.e. I+L+R > N, hence I+L > N-R,
> 25 i.e there is no enough space in D for all items of A
> 26 (N is capacity of empty leaf).
> 27
> 28 Else if there is enough space in D, then we perform insertion
> 29 to D. Pair (DC) is incompressible, because at the point (4)
> 30 we have shifted right all possible items at the right side
> 31 of pos. Pair (AD) is incompressible, because we didn't jump
> 32 to (22), i.e. I is larger then free space in A.
> 33
> 34 Else (there is no space in A and D for insertion), we make
> 35 space via insertion of a new empty leaf E, shift all items
> 36 from A to E, and perform insertion to the empty node A:
> 37
> 38 B E A D C
This is the double split condition in split_leaf()
> 39
> 40 In this case (BE) is incompressible, because at the point (4)
> 41 we have shifted left all possible items. Pair (EA) is
> 42 incompressible, because there wasn't enough space in A at
> 43 (34).
>
> (*) Variable length key is an abstraction. In real life this usually
> means a record which includes a key of fixed length and some payload
> of variable length.
>
>
>
> B. Deleting a key
>
>
> 01 Suppose we traversed the tree and have arrived to the position
> 02 pos in the leaf A:
> 03
> 04 B A C
> 05
> 06 At first we perform deletion from A.
> 07 If A is empty, then
> 08 remove leaf A:
> 09
> 10 B C
> 11
> 12 If the pair (BC) compressible, then shift everything from B
> 13 to C and remove leaf B:
> 14
> 15 C
> 16
> 17 At this point leaf C is pairwise incompressible with its left
> 18 and right neighbors for obvious reasons.
> 19
> 20 Else
> 21 if the pair (BA) compressible, then shift everything from B
> 22 to A and remove leaf B:
> 23
> 24 A C
> 25
> 26 If the pair (AC) compressible, then shift everything from A to
> 27 C and remove leaf A:
> 28
> 29 C
> 30
> 31 Here leaf C is pairwise incompressible with its left and right
> 32 neighbors for obvious reasons.
The above is what we do in btrfs, except it is triggered based on how
empty the leaf is.
>
>
> Balancing the upper levels
>
>
> Thus, insertion/deletion on the leaf level can lead to insertion/deletion
> of not more then 2 leaves, and hence 2 keys on the upper levels. So, we
> modify the classic balancing condition and top-down balancing strategy
> by the following way:
>
> Every non-root node of upper (non-leaf) levels may contain between
> q and 2(q+1) entries (q>=2).
>
> A. Inserting a key
>
> If a node with 2q+2 (2q+1) entries is encountered while traversing the
> tree, it is split into two nodes with q and q+2 (q and q+1) entries.
>
>
> B. Deleting a key
>
> If a node with q (q+1) entries in encountered while traversing the
> tree, it is fixed. Fixing means either moving keys into it, or merging
> it with a neighbor, so it will contain not less then q+2 keys.
This is balance_level()
-chris