2015-12-03 07:44:38

by lokesh jaliminche

[permalink] [raw]
Subject: Regarding random grouop search start for allocation of inode.

hello folks,
I am new to ext4 code. I was going through the
ext4-source for allocation of inode.
There is one thing that I did not understand while selection of groups
for inode allocation . I came across this code snippet which is part
of find_group_orlov function. question is, why group search start is
random ?

Code snippet:
==========
···if (qstr) {
»·······»·······»·······hinfo.hash_version = LDISKFS_DX_HASH_HALF_MD4;
»·······»·······»·······hinfo.seed = sbi->s_hash_seed;
»·······»·······»·······ldiskfsfs_dirhash(qstr->name, qstr->len, &hinfo);
»·······»·······»·······grp = hinfo.hash;
»·······»·······} else
»·······»·······»·······get_random_bytes(&grp, sizeof(grp));
»·······»·······parent_group = (unsigned)grp % ngroups;
»·······»·······for (i = 0; i < ngroups; i++) {
»·······»·······»·······g = (parent_group + i) % ngroups;
»·······»·······»·······get_orlov_stats(sb, g, flex_size, &stats);
»·······»·······»·······if (!stats.free_inodes)
»·······»·······»·······»·······continue;
»·······»·······»·······if (stats.used_dirs >= best_ndir)
»·······»·······»·······»·······continue;
»·······»·······»·······if (stats.free_inodes < avefreei)
»·······»·······»·······»·······continue;
»·······»·······»·······if (stats.free_blocks < avefreeb)
»·······»·······»·······»·······continue;
»·······»·······»·······grp = g;
»·······»·······»·······ret = 0;
»·······»·······»·······best_ndir = stats.used_dirs;
»·······»·······}

Thanks & Regards,
Lokesh


2015-12-03 08:07:35

by lokesh jaliminche

[permalink] [raw]
Subject: Re: Regarding random grouop search start for allocation of inode.

Thought of giving more clarification on my question
why group search start is random ? because we can also start search
for valid groups for inode allocation from the start . As this group
search is random inode selection might go to end of groups which
might affect IO performance

On Thu, Dec 3, 2015 at 1:14 PM, lokesh jaliminche
<[email protected]> wrote:
> hello folks,
> I am new to ext4 code. I was going through the
> ext4-source for allocation of inode.
> There is one thing that I did not understand while selection of groups
> for inode allocation . I came across this code snippet which is part
> of find_group_orlov function. question is, why group search start is
> random ?
>
> Code snippet:
> ==========
> ···if (qstr) {
> »·······»·······»·······hinfo.hash_version = LDISKFS_DX_HASH_HALF_MD4;
> »·······»·······»·······hinfo.seed = sbi->s_hash_seed;
> »·······»·······»·······ldiskfsfs_dirhash(qstr->name, qstr->len, &hinfo);
> »·······»·······»·······grp = hinfo.hash;
> »·······»·······} else
> »·······»·······»·······get_random_bytes(&grp, sizeof(grp));
> »·······»·······parent_group = (unsigned)grp % ngroups;
> »·······»·······for (i = 0; i < ngroups; i++) {
> »·······»·······»·······g = (parent_group + i) % ngroups;
> »·······»·······»·······get_orlov_stats(sb, g, flex_size, &stats);
> »·······»·······»·······if (!stats.free_inodes)
> »·······»·······»·······»·······continue;
> »·······»·······»·······if (stats.used_dirs >= best_ndir)
> »·······»·······»·······»·······continue;
> »·······»·······»·······if (stats.free_inodes < avefreei)
> »·······»·······»·······»·······continue;
> »·······»·······»·······if (stats.free_blocks < avefreeb)
> »·······»·······»·······»·······continue;
> »·······»·······»·······grp = g;
> »·······»·······»·······ret = 0;
> »·······»·······»·······best_ndir = stats.used_dirs;
> »·······»·······}
>
> Thanks & Regards,
> Lokesh

2015-12-03 17:58:45

by Andreas Dilger

[permalink] [raw]
Subject: Re: Regarding random grouop search start for allocation of inode.

On Dec 3, 2015, at 01:07, lokesh jaliminche <[email protected]> wrote:
>
> Thought of giving more clarification on my question
> why group search start is random ? because we can also start search
> for valid groups for inode allocation from the start. As this group
> search is random inode selection might go to end of groups which
> might affect IO performance

Starting the inode search at the beginning of the disk each time
means that inode allocation will be inefficient because it will search
over groups that are mostly or entirely full already.

Allocating the new directory in a semi-random group, one that is
relatively unused, ensures that new
inode and block allocations are relatively efficient afterward.

Cheers, Andreas

> On Thu, Dec 3, 2015 at 1:14 PM, lokesh jaliminche
> <[email protected]> wrote:
>> hello folks,
>> I am new to ext4 code. I was going through the
>> ext4-source for allocation of inode.
>> There is one thing that I did not understand while selection of groups
>> for inode allocation . I came across this code snippet which is part
>> of find_group_orlov function. question is, why group search start is
>> random ?
>>
>> Code snippet:
>> ==========
>> В·В·В·if (qstr) {
>> »·······»·······»·······hinfo.hash_version = LDISKFS_DX_HASH_HALF_MD4;
>> »·······»·······»·······hinfo.seed = sbi->s_hash_seed;
>> »·······»·······»·······ldiskfsfs_dirhash(qstr->name, qstr->len, &hinfo);
>> »·······»·······»·······grp = hinfo.hash;
>> »·······»·······} else
>> »·······»·······»·······get_random_bytes(&grp, sizeof(grp));
>> »·······»·······parent_group = (unsigned)grp % ngroups;
>> »·······»·······for (i = 0; i < ngroups; i++) {
>> »·······»·······»·······g = (parent_group + i) % ngroups;
>> »·······»·······»·······get_orlov_stats(sb, g, flex_size, &stats);
>> »·······»·······»·······if (!stats.free_inodes)
>> »·······»·······»·······»·······continue;
>> »·······»·······»·······if (stats.used_dirs >= best_ndir)
>> »·······»·······»·······»·······continue;
>> »·······»·······»·······if (stats.free_inodes < avefreei)
>> »·······»·······»·······»·······continue;
>> »·······»·······»·······if (stats.free_blocks < avefreeb)
>> »·······»·······»·······»·······continue;
>> »·······»·······»·······grp = g;
>> »·······»·······»·······ret = 0;
>> »·······»·······»·······best_ndir = stats.used_dirs;
>> »·······»·······}
>>
>> Thanks & Regards,
>> Lokesh
> N‹§Іжмrё›yъиљШbІX¬¶З§vШ^–)Ює{.nЗ+‰·ҐЉ{±{xЉ{ayєК‡Ъ™л,j­ўfЈў·hљ‹аz№®wҐўё ў·¦j:+v‰ЁЉwиjШm¶џяѕ«‘кзzZ+ѓщљЋЉЭўj"ќъ!¶i

2015-12-03 19:13:09

by lokesh jaliminche

[permalink] [raw]
Subject: Re: Regarding random grouop search start for allocation of inode.

Ohh thanks for the clarification. There is one more thing I would like
to point out here.
In the code there is a loop to scan the groups for inode
alllocation(Inside find_group_orlov function).
There are some policies for group selection . while scanning the
groups, it checks for these
policies to be satisfied.
If a particular group satisfy these properties it should get selected
for inode allocation but instead
it does further lookup in next groups.
I think there is missing breaking condition. I have added break over
there and here is the
patch for that. Any reason for not having break condition over here ?

diff -Nur linux-2.6.32-431.17.1.el6.x86_64/fs/ext4/ialloc.c
linux-2.6.32-431.17.1.el6.x86_64/fs/ext4/ialloc.c
--- linux-2.6.32-431.17.1.el6.x86_64/fs/ext4/ialloc.c 2014-04-12
01:20:31.000000000 +0530
+++ linux-2.6.32-431.17.1.el6.x86_64/fs/ext4/ialloc.c 2015-11-29
21:36:51.805542209 +0530
@@ -529,6 +529,7 @@
grp = g;
ret = 0;
best_ndir = stats.used_dirs;
+ break;
}
if (ret)
goto fallback;

Thanks & Regards,
Lokesh



On Thu, Dec 3, 2015 at 11:28 PM, Andreas Dilger <[email protected]> wrote:
> On Dec 3, 2015, at 01:07, lokesh jaliminche <[email protected]> wrote:
>>
>> Thought of giving more clarification on my question
>> why group search start is random ? because we can also start search
>> for valid groups for inode allocation from the start. As this group
>> search is random inode selection might go to end of groups which
>> might affect IO performance
>
> Starting the inode search at the beginning of the disk each time
> means that inode allocation will be inefficient because it will search
> over groups that are mostly or entirely full already.
>
> Allocating the new directory in a semi-random group, one that is
> relatively unused, ensures that new
> inode and block allocations are relatively efficient afterward.
>
> Cheers, Andreas
>
>> On Thu, Dec 3, 2015 at 1:14 PM, lokesh jaliminche
>> <[email protected]> wrote:
>>> hello folks,
>>> I am new to ext4 code. I was going through the
>>> ext4-source for allocation of inode.
>>> There is one thing that I did not understand while selection of groups
>>> for inode allocation . I came across this code snippet which is part
>>> of find_group_orlov function. question is, why group search start is
>>> random ?
>>>
>>> Code snippet:
>>> ==========
>>> В·В·В·if (qstr) {
>>> »·······»·······»·······hinfo.hash_version = LDISKFS_DX_HASH_HALF_MD4;
>>> »·······»·······»·······hinfo.seed = sbi->s_hash_seed;
>>> »·······»·······»·······ldiskfsfs_dirhash(qstr->name, qstr->len, &hinfo);
>>> »·······»·······»·······grp = hinfo.hash;
>>> »·······»·······} else
>>> »·······»·······»·······get_random_bytes(&grp, sizeof(grp));
>>> »·······»·······parent_group = (unsigned)grp % ngroups;
>>> »·······»·······for (i = 0; i < ngroups; i++) {
>>> »·······»·······»·······g = (parent_group + i) % ngroups;
>>> »·······»·······»·······get_orlov_stats(sb, g, flex_size, &stats);
>>> »·······»·······»·······if (!stats.free_inodes)
>>> »·······»·······»·······»·······continue;
>>> »·······»·······»·······if (stats.used_dirs >= best_ndir)
>>> »·······»·······»·······»·······continue;
>>> »·······»·······»·······if (stats.free_inodes < avefreei)
>>> »·······»·······»·······»·······continue;
>>> »·······»·······»·······if (stats.free_blocks < avefreeb)
>>> »·······»·······»·······»·······continue;
>>> »·······»·······»·······grp = g;
>>> »·······»·······»·······ret = 0;
>>> »·······»·······»·······best_ndir = stats.used_dirs;
>>> »·······»·······}
>>>
>>> Thanks & Regards,
>>> Lokesh
>> N‹§Іжмrё›yъиљШbІX¬¶З§vШ^–)Ює{.nЗ+‰·ҐЉ{±{ xЉ{ayє К‡Ъ™л,j ­ўfЈў·hљ‹аz№ ®wҐўё ў·¦j:+v‰ЁЉwиjШm¶џяѕ «‘кзzZ+ѓщљЋЉЭўj"ќъ!¶i

2015-12-04 21:55:05

by Andreas Dilger

[permalink] [raw]
Subject: Re: Regarding random grouop search start for allocation of inode.


> On Dec 3, 2015, at 12:13 PM, lokesh jaliminche <[email protected]> wrote:
>
> Ohh thanks for the clarification. There is one more thing I would like
> to point out here.
> In the code there is a loop to scan the groups for inode
> alllocation(Inside find_group_orlov function).
> There are some policies for group selection . while scanning the
> groups, it checks for these
> policies to be satisfied.
> If a particular group satisfy these properties it should get selected
> for inode allocation but instead
> it does further lookup in next groups.
> I think there is missing breaking condition. I have added break over
> there and here is the
> patch for that. Any reason for not having break condition over here ?
>
> diff -Nur linux-2.6.32-431.17.1.el6.x86_64/fs/ext4/ialloc.c
> linux-2.6.32-431.17.1.el6.x86_64/fs/ext4/ialloc.c
> --- linux-2.6.32-431.17.1.el6.x86_64/fs/ext4/ialloc.c 2014-04-12
> 01:20:31.000000000 +0530
> +++ linux-2.6.32-431.17.1.el6.x86_64/fs/ext4/ialloc.c 2015-11-29
> 21:36:51.805542209 +0530
> @@ -529,6 +529,7 @@
> grp = g;
> ret = 0;
> best_ndir = stats.used_dirs;
> + break;
> }
> if (ret)
> goto fallback;

No, this isn't correct. This loop is looking for the *best* group it can find,
and your "break" would have it exit the loop as soon as the *first* directory
that matches the conditions is found. Since those conditions are fairly weak,
for example that the group actually has free inodes, and it has better than
average free inodes and blocks, it makes sense to search beyond just the first
matching group.

That said, it also doesn't make sense to search beyond a "perfect" group that
has no allocated inodes and no allocated blocks, so a break condition could be
added to this loop and make it more efficient, especially for very large
filesystems that have 128k+ groups.

It should be noted that this part of the algorithm only applies to "top level"
directories (those below the root inode, or with the EXT4_INODE_TOPDIR flag
set, so searching a bit longer for a good group is not a bad idea in this case.

Cheers, Andreas.

> Thanks & Regards,
> Lokesh
>
>
>
> On Thu, Dec 3, 2015 at 11:28 PM, Andreas Dilger <[email protected]> wrote:
>> On Dec 3, 2015, at 01:07, lokesh jaliminche <[email protected]> wrote:
>>>
>>> Thought of giving more clarification on my question
>>> why group search start is random ? because we can also start search
>>> for valid groups for inode allocation from the start. As this group
>>> search is random inode selection might go to end of groups which
>>> might affect IO performance
>>
>> Starting the inode search at the beginning of the disk each time
>> means that inode allocation will be inefficient because it will search
>> over groups that are mostly or entirely full already.
>>
>> Allocating the new directory in a semi-random group, one that is
>> relatively unused, ensures that new
>> inode and block allocations are relatively efficient afterward.
>>
>> Cheers, Andreas
>>
>>> On Thu, Dec 3, 2015 at 1:14 PM, lokesh jaliminche
>>> <[email protected]> wrote:
>>>> hello folks,
>>>> I am new to ext4 code. I was going through the
>>>> ext4-source for allocation of inode.
>>>> There is one thing that I did not understand while selection of groups
>>>> for inode allocation . I came across this code snippet which is part
>>>> of find_group_orlov function. question is, why group search start is
>>>> random ?
>>>>
>>>> Code snippet:
>>>> ==========
>>>> В·В·В·if (qstr) {
>>>> »·······»·······»·······hinfo.hash_version = LDISKFS_DX_HASH_HALF_MD4;
>>>> »·······»·······»·······hinfo.seed = sbi->s_hash_seed;
>>>> »·······»·······»·······ldiskfsfs_dirhash(qstr->name, qstr->len, &hinfo);
>>>> »·······»·······»·······grp = hinfo.hash;
>>>> »·······»·······} else
>>>> »·······»·······»·······get_random_bytes(&grp, sizeof(grp));
>>>> »·······»·······parent_group = (unsigned)grp % ngroups;
>>>> »·······»·······for (i = 0; i < ngroups; i++) {
>>>> »·······»·······»·······g = (parent_group + i) % ngroups;
>>>> »·······»·······»·······get_orlov_stats(sb, g, flex_size, &stats);
>>>> »·······»·······»·······if (!stats.free_inodes)
>>>> »·······»·······»·······»·······continue;
>>>> »·······»·······»·······if (stats.used_dirs >= best_ndir)
>>>> »·······»·······»·······»·······continue;
>>>> »·······»·······»·······if (stats.free_inodes < avefreei)
>>>> »·······»·······»·······»·······continue;
>>>> »·······»·······»·······if (stats.free_blocks < avefreeb)
>>>> »·······»·······»·······»·······continue;
>>>> »·······»·······»·······grp = g;
>>>> »·······»·······»·······ret = 0;
>>>> »·······»·······»·······best_ndir = stats.used_dirs;
>>>> »·······»·······}
>>>>
>>>> Thanks & Regards,
>>>> Lokesh
>>> N‹§Іжмrё›yъиљШbІX¬¶З§vШ^–)Ює{.nЗ+‰·ҐЉ{±{ xЉ{ayє К‡Ъ™л,j ­ўfЈў·hљ‹аz№ ®wҐўё ў·¦j:+v‰ЁЉwиjШm¶џяѕ «‘кзzZ+ѓщљЋЉЭўj"ќъ!¶i


Cheers, Andreas






Attachments:
signature.asc (833.00 B)
Message signed with OpenPGP using GPGMail

2015-12-15 10:33:39

by lokesh jaliminche

[permalink] [raw]
Subject: Re: Regarding random grouop search start for allocation of inode.

"No, this isn't correct. This loop is looking for the *best* group it
can find, and your "break" would have it exit the loop as soon as the
*first* group that matches the conditions is found."

But as we are checking all the groups with the same conditions then
how it guarantees better group selection ? As per my understanding we
are just wasting time in looping.

On Sat, Dec 5, 2015 at 3:24 AM, Andreas Dilger <[email protected]> wrote:
>
>> On Dec 3, 2015, at 12:13 PM, lokesh jaliminche <[email protected]> wrote:
>>
>> Ohh thanks for the clarification. There is one more thing I would like
>> to point out here.
>> In the code there is a loop to scan the groups for inode
>> alllocation(Inside find_group_orlov function).
>> There are some policies for group selection . while scanning the
>> groups, it checks for these
>> policies to be satisfied.
>> If a particular group satisfy these properties it should get selected
>> for inode allocation but instead
>> it does further lookup in next groups.
>> I think there is missing breaking condition. I have added break over
>> there and here is the
>> patch for that. Any reason for not having break condition over here ?
>>
>> diff -Nur linux-2.6.32-431.17.1.el6.x86_64/fs/ext4/ialloc.c
>> linux-2.6.32-431.17.1.el6.x86_64/fs/ext4/ialloc.c
>> --- linux-2.6.32-431.17.1.el6.x86_64/fs/ext4/ialloc.c 2014-04-12
>> 01:20:31.000000000 +0530
>> +++ linux-2.6.32-431.17.1.el6.x86_64/fs/ext4/ialloc.c 2015-11-29
>> 21:36:51.805542209 +0530
>> @@ -529,6 +529,7 @@
>> grp = g;
>> ret = 0;
>> best_ndir = stats.used_dirs;
>> + break;
>> }
>> if (ret)
>> goto fallback;
>
> No, this isn't correct. This loop is looking for the *best* group it can find,
> and your "break" would have it exit the loop as soon as the *first* directory
> that matches the conditions is found. Since those conditions are fairly weak,
> for example that the group actually has free inodes, and it has better than
> average free inodes and blocks, it makes sense to search beyond just the first
> matching group.
>
> That said, it also doesn't make sense to search beyond a "perfect" group that
> has no allocated inodes and no allocated blocks, so a break condition could be
> added to this loop and make it more efficient, especially for very large
> filesystems that have 128k+ groups.
>
> It should be noted that this part of the algorithm only applies to "top level"
> directories (those below the root inode, or with the EXT4_INODE_TOPDIR flag
> set, so searching a bit longer for a good group is not a bad idea in this case.
>
> Cheers, Andreas.
>
>> Thanks & Regards,
>> Lokesh
>>
>>
>>
>> On Thu, Dec 3, 2015 at 11:28 PM, Andreas Dilger <[email protected]> wrote:
>>> On Dec 3, 2015, at 01:07, lokesh jaliminche <[email protected]> wrote:
>>>>
>>>> Thought of giving more clarification on my question
>>>> why group search start is random ? because we can also start search
>>>> for valid groups for inode allocation from the start. As this group
>>>> search is random inode selection might go to end of groups which
>>>> might affect IO performance
>>>
>>> Starting the inode search at the beginning of the disk each time
>>> means that inode allocation will be inefficient because it will search
>>> over groups that are mostly or entirely full already.
>>>
>>> Allocating the new directory in a semi-random group, one that is
>>> relatively unused, ensures that new
>>> inode and block allocations are relatively efficient afterward.
>>>
>>> Cheers, Andreas
>>>
>>>> On Thu, Dec 3, 2015 at 1:14 PM, lokesh jaliminche
>>>> <[email protected]> wrote:
>>>>> hello folks,
>>>>> I am new to ext4 code. I was going through the
>>>>> ext4-source for allocation of inode.
>>>>> There is one thing that I did not understand while selection of groups
>>>>> for inode allocation . I came across this code snippet which is part
>>>>> of find_group_orlov function. question is, why group search start is
>>>>> random ?
>>>>>
>>>>> Code snippet:
>>>>> ==========
>>>>> В·В·В·if (qstr) {
>>>>> »·······»·······»·······hinfo.hash_version = LDISKFS_DX_HASH_HALF_MD4;
>>>>> »·······»·······»·······hinfo.seed = sbi->s_hash_seed;
>>>>> »·······»·······»·······ldiskfsfs_dirhash(qstr->name, qstr->len, &hinfo);
>>>>> »·······»·······»·······grp = hinfo.hash;
>>>>> »·······»·······} else
>>>>> »·······»·······»·······get_random_bytes(&grp, sizeof(grp));
>>>>> »·······»·······parent_group = (unsigned)grp % ngroups;
>>>>> »·······»·······for (i = 0; i < ngroups; i++) {
>>>>> »·······»·······»·······g = (parent_group + i) % ngroups;
>>>>> »·······»·······»·······get_orlov_stats(sb, g, flex_size, &stats);
>>>>> »·······»·······»·······if (!stats.free_inodes)
>>>>> »·······»·······»·······»·······continue;
>>>>> »·······»·······»·······if (stats.used_dirs >= best_ndir)
>>>>> »·······»·······»·······»·······continue;
>>>>> »·······»·······»·······if (stats.free_inodes < avefreei)
>>>>> »·······»·······»·······»·······continue;
>>>>> »·······»·······»·······if (stats.free_blocks < avefreeb)
>>>>> »·······»·······»·······»·······continue;
>>>>> »·······»·······»·······grp = g;
>>>>> »·······»·······»·······ret = 0;
>>>>> »·······»·······»·······best_ndir = stats.used_dirs;
>>>>> »·······»·······}
>>>>>
>>>>> Thanks & Regards,
>>>>> Lokesh
>>>> N‹§Іжмrё›yъиљШbІX¬¶З§vШ^–)Ює{.nЗ+‰·ҐЉ{±{ xЉ{ayє К‡Ъ™л,j ­ўfЈў·hљ‹аz№ ®wҐўё ў·¦j:+v‰ЁЉwиjШm¶џяѕ «‘кзzZ+ѓщљЋЉЭўj"ќъ!¶i
>
>
> Cheers, Andreas
>
>
>
>
>

2015-12-15 21:33:00

by Andreas Dilger

[permalink] [raw]
Subject: Re: Regarding random grouop search start for allocation of inode.

On Dec 15, 2015, at 3:33 AM, lokesh jaliminche <[email protected]> wrote:
>
> "No, this isn't correct. This loop is looking for the *best* group it
> can find, and your "break" would have it exit the loop as soon as the
> *first* group that matches the conditions is found."
>
> But as we are checking all the groups with the same conditions then
> how it guarantees better group selection ? As per my understanding we
> are just wasting time in looping.

The important part of the loop is ensuring that the selected group is always
improving over the previous one:

if (le16_to_cpu(desc->bg_used_dirs_count) >= best_ndir)
continue;

The "best_ndir" value tracks for the best group found so far the number of
used directories in the group, and if the new directory has fewer directories
than the previous "best" directory and still meets all the other criteria
(fewer than average inodes allocated, etc) then the new group will be chosen.

That said, you are correct that the loop can spend a lot of time searching
needlessly. It would be trivial to add a check at the start of the loop:

/* the group can't get any better than empty */
if (desc->bg_free_inodes_count == inodes_per_group &&
desc->bg_free_blocks_count ==
EXT4_BLOCKS_PER_GROUP(sb))
goto found;

This jumps directly to "found" with "desc" and "group" already set, so
there is no need to go through the extra steps of setting "best_desc" and
"best_group" and then break out of the loop just to set "desc" and "group"
again.

Since you are the one to find this optimization, could you please submit a
patch to this effect so you get the credit.

Cheers, Andreas

> On Sat, Dec 5, 2015 at 3:24 AM, Andreas Dilger <[email protected]> wrote:
>>
>>> On Dec 3, 2015, at 12:13 PM, lokesh jaliminche <[email protected]> wrote:
>>>
>>> Ohh thanks for the clarification. There is one more thing I would like
>>> to point out here.
>>> In the code there is a loop to scan the groups for inode
>>> alllocation(Inside find_group_orlov function).
>>> There are some policies for group selection . while scanning the
>>> groups, it checks for these
>>> policies to be satisfied.
>>> If a particular group satisfy these properties it should get selected
>>> for inode allocation but instead
>>> it does further lookup in next groups.
>>> I think there is missing breaking condition. I have added break over
>>> there and here is the
>>> patch for that. Any reason for not having break condition over here ?
>>>
>>> diff -Nur linux-2.6.32-431.17.1.el6.x86_64/fs/ext4/ialloc.c
>>> linux-2.6.32-431.17.1.el6.x86_64/fs/ext4/ialloc.c
>>> --- linux-2.6.32-431.17.1.el6.x86_64/fs/ext4/ialloc.c 2014-04-12
>>> 01:20:31.000000000 +0530
>>> +++ linux-2.6.32-431.17.1.el6.x86_64/fs/ext4/ialloc.c 2015-11-29
>>> 21:36:51.805542209 +0530
>>> @@ -529,6 +529,7 @@
>>> grp = g;
>>> ret = 0;
>>> best_ndir = stats.used_dirs;
>>> + break;
>>> }
>>> if (ret)
>>> goto fallback;
>>
>> No, this isn't correct. This loop is looking for the *best* group it can find,
>> and your "break" would have it exit the loop as soon as the *first* directory
>> that matches the conditions is found. Since those conditions are fairly weak,
>> for example that the group actually has free inodes, and it has better than
>> average free inodes and blocks, it makes sense to search beyond just the first
>> matching group.
>>
>> That said, it also doesn't make sense to search beyond a "perfect" group that
>> has no allocated inodes and no allocated blocks, so a break condition could be
>> added to this loop and make it more efficient, especially for very large
>> filesystems that have 128k+ groups.
>>
>> It should be noted that this part of the algorithm only applies to "top level"
>> directories (those below the root inode, or with the EXT4_INODE_TOPDIR flag
>> set, so searching a bit longer for a good group is not a bad idea in this case.
>>
>> Cheers, Andreas.
>>
>>> Thanks & Regards,
>>> Lokesh
>>>
>>>
>>>
>>> On Thu, Dec 3, 2015 at 11:28 PM, Andreas Dilger <[email protected]> wrote:
>>>> On Dec 3, 2015, at 01:07, lokesh jaliminche <[email protected]> wrote:
>>>>>
>>>>> Thought of giving more clarification on my question
>>>>> why group search start is random ? because we can also start search
>>>>> for valid groups for inode allocation from the start. As this group
>>>>> search is random inode selection might go to end of groups which
>>>>> might affect IO performance
>>>>
>>>> Starting the inode search at the beginning of the disk each time
>>>> means that inode allocation will be inefficient because it will search
>>>> over groups that are mostly or entirely full already.
>>>>
>>>> Allocating the new directory in a semi-random group, one that is
>>>> relatively unused, ensures that new
>>>> inode and block allocations are relatively efficient afterward.
>>>>
>>>> Cheers, Andreas
>>>>
>>>>> On Thu, Dec 3, 2015 at 1:14 PM, lokesh jaliminche
>>>>> <[email protected]> wrote:
>>>>>> hello folks,
>>>>>> I am new to ext4 code. I was going through the
>>>>>> ext4-source for allocation of inode.
>>>>>> There is one thing that I did not understand while selection of groups
>>>>>> for inode allocation . I came across this code snippet which is part
>>>>>> of find_group_orlov function. question is, why group search start is
>>>>>> random ?
>>>>>>
>>>>>> Code snippet:
>>>>>> ==========
>>>>>> В·В·В·if (qstr) {
>>>>>> »·······»·······»·······hinfo.hash_version = LDISKFS_DX_HASH_HALF_MD4;
>>>>>> »·······»·······»·······hinfo.seed = sbi->s_hash_seed;
>>>>>> »·······»·······»·······ldiskfsfs_dirhash(qstr->name, qstr->len, &hinfo);
>>>>>> »·······»·······»·······grp = hinfo.hash;
>>>>>> »·······»·······} else
>>>>>> »·······»·······»·······get_random_bytes(&grp, sizeof(grp));
>>>>>> »·······»·······parent_group = (unsigned)grp % ngroups;
>>>>>> »·······»·······for (i = 0; i < ngroups; i++) {
>>>>>> »·······»·······»·······g = (parent_group + i) % ngroups;
>>>>>> »·······»·······»·······get_orlov_stats(sb, g, flex_size, &stats);
>>>>>> »·······»·······»·······if (!stats.free_inodes)
>>>>>> »·······»·······»·······»·······continue;
>>>>>> »·······»·······»·······if (stats.used_dirs >= best_ndir)
>>>>>> »·······»·······»·······»·······continue;
>>>>>> »·······»·······»·······if (stats.free_inodes < avefreei)
>>>>>> »·······»·······»·······»·······continue;
>>>>>> »·······»·······»·······if (stats.free_blocks < avefreeb)
>>>>>> »·······»·······»·······»·······continue;
>>>>>> »·······»·······»·······grp = g;
>>>>>> »·······»·······»·······ret = 0;
>>>>>> »·······»·······»·······best_ndir = stats.used_dirs;
>>>>>> »·······»·······}
>>>>>>
>>>>>> Thanks & Regards,
>>>>>> Lokesh
>>>>> N‹§Іжмrё›yъиљШbІX¬¶З§vШ^–)Ює{.nЗ+‰·ҐЉ{±{ xЉ{ayє К‡Ъ™л,j ­ўfЈў·hљ‹аz№ ®wҐўё ў·¦j:+v‰ЁЉwиjШm¶џяѕ «‘кзzZ+ѓщљЋЉЭўj"ќъ!¶i
>>
>>
>> Cheers, Andreas
>>
>>
>>
>>
>>


Cheers, Andreas






Attachments:
signature.asc (833.00 B)
Message signed with OpenPGP using GPGMail

2015-12-17 14:34:42

by lokesh jaliminche

[permalink] [raw]
Subject: Re: Regarding random grouop search start for allocation of inode.

Hey thanks for that, I will push a patch for this ASAP :-)

Regards,
Lokesh

On Wed, Dec 16, 2015 at 3:02 AM, Andreas Dilger <[email protected]> wrote:
> On Dec 15, 2015, at 3:33 AM, lokesh jaliminche <[email protected]> wrote:
>>
>> "No, this isn't correct. This loop is looking for the *best* group it
>> can find, and your "break" would have it exit the loop as soon as the
>> *first* group that matches the conditions is found."
>>
>> But as we are checking all the groups with the same conditions then
>> how it guarantees better group selection ? As per my understanding we
>> are just wasting time in looping.
>
> The important part of the loop is ensuring that the selected group is always
> improving over the previous one:
>
> if (le16_to_cpu(desc->bg_used_dirs_count) >= best_ndir)
> continue;
>
> The "best_ndir" value tracks for the best group found so far the number of
> used directories in the group, and if the new directory has fewer directories
> than the previous "best" directory and still meets all the other criteria
> (fewer than average inodes allocated, etc) then the new group will be chosen.
>
> That said, you are correct that the loop can spend a lot of time searching
> needlessly. It would be trivial to add a check at the start of the loop:
>
> /* the group can't get any better than empty */
> if (desc->bg_free_inodes_count == inodes_per_group &&
> desc->bg_free_blocks_count ==
> EXT4_BLOCKS_PER_GROUP(sb))
> goto found;
>
> This jumps directly to "found" with "desc" and "group" already set, so
> there is no need to go through the extra steps of setting "best_desc" and
> "best_group" and then break out of the loop just to set "desc" and "group"
> again.
>
> Since you are the one to find this optimization, could you please submit a
> patch to this effect so you get the credit.
>
> Cheers, Andreas
>
>> On Sat, Dec 5, 2015 at 3:24 AM, Andreas Dilger <[email protected]> wrote:
>>>
>>>> On Dec 3, 2015, at 12:13 PM, lokesh jaliminche <[email protected]> wrote:
>>>>
>>>> Ohh thanks for the clarification. There is one more thing I would like
>>>> to point out here.
>>>> In the code there is a loop to scan the groups for inode
>>>> alllocation(Inside find_group_orlov function).
>>>> There are some policies for group selection . while scanning the
>>>> groups, it checks for these
>>>> policies to be satisfied.
>>>> If a particular group satisfy these properties it should get selected
>>>> for inode allocation but instead
>>>> it does further lookup in next groups.
>>>> I think there is missing breaking condition. I have added break over
>>>> there and here is the
>>>> patch for that. Any reason for not having break condition over here ?
>>>>
>>>> diff -Nur linux-2.6.32-431.17.1.el6.x86_64/fs/ext4/ialloc.c
>>>> linux-2.6.32-431.17.1.el6.x86_64/fs/ext4/ialloc.c
>>>> --- linux-2.6.32-431.17.1.el6.x86_64/fs/ext4/ialloc.c 2014-04-12
>>>> 01:20:31.000000000 +0530
>>>> +++ linux-2.6.32-431.17.1.el6.x86_64/fs/ext4/ialloc.c 2015-11-29
>>>> 21:36:51.805542209 +0530
>>>> @@ -529,6 +529,7 @@
>>>> grp = g;
>>>> ret = 0;
>>>> best_ndir = stats.used_dirs;
>>>> + break;
>>>> }
>>>> if (ret)
>>>> goto fallback;
>>>
>>> No, this isn't correct. This loop is looking for the *best* group it can find,
>>> and your "break" would have it exit the loop as soon as the *first* directory
>>> that matches the conditions is found. Since those conditions are fairly weak,
>>> for example that the group actually has free inodes, and it has better than
>>> average free inodes and blocks, it makes sense to search beyond just the first
>>> matching group.
>>>
>>> That said, it also doesn't make sense to search beyond a "perfect" group that
>>> has no allocated inodes and no allocated blocks, so a break condition could be
>>> added to this loop and make it more efficient, especially for very large
>>> filesystems that have 128k+ groups.
>>>
>>> It should be noted that this part of the algorithm only applies to "top level"
>>> directories (those below the root inode, or with the EXT4_INODE_TOPDIR flag
>>> set, so searching a bit longer for a good group is not a bad idea in this case.
>>>
>>> Cheers, Andreas.
>>>
>>>> Thanks & Regards,
>>>> Lokesh
>>>>
>>>>
>>>>
>>>> On Thu, Dec 3, 2015 at 11:28 PM, Andreas Dilger <[email protected]> wrote:
>>>>> On Dec 3, 2015, at 01:07, lokesh jaliminche <[email protected]> wrote:
>>>>>>
>>>>>> Thought of giving more clarification on my question
>>>>>> why group search start is random ? because we can also start search
>>>>>> for valid groups for inode allocation from the start. As this group
>>>>>> search is random inode selection might go to end of groups which
>>>>>> might affect IO performance
>>>>>
>>>>> Starting the inode search at the beginning of the disk each time
>>>>> means that inode allocation will be inefficient because it will search
>>>>> over groups that are mostly or entirely full already.
>>>>>
>>>>> Allocating the new directory in a semi-random group, one that is
>>>>> relatively unused, ensures that new
>>>>> inode and block allocations are relatively efficient afterward.
>>>>>
>>>>> Cheers, Andreas
>>>>>
>>>>>> On Thu, Dec 3, 2015 at 1:14 PM, lokesh jaliminche
>>>>>> <[email protected]> wrote:
>>>>>>> hello folks,
>>>>>>> I am new to ext4 code. I was going through the
>>>>>>> ext4-source for allocation of inode.
>>>>>>> There is one thing that I did not understand while selection of groups
>>>>>>> for inode allocation . I came across this code snippet which is part
>>>>>>> of find_group_orlov function. question is, why group search start is
>>>>>>> random ?
>>>>>>>
>>>>>>> Code snippet:
>>>>>>> ==========
>>>>>>> В·В·В·if (qstr) {
>>>>>>> »·······»·······»·······hinfo.hash_version = LDISKFS_DX_HASH_HALF_MD4;
>>>>>>> »·······»·······»·······hinfo.seed = sbi->s_hash_seed;
>>>>>>> »·······»·······»·······ldiskfsfs_dirhash(qstr->name, qstr->len, &hinfo);
>>>>>>> »·······»·······»·······grp = hinfo.hash;
>>>>>>> »·······»·······} else
>>>>>>> »·······»·······»·······get_random_bytes(&grp, sizeof(grp));
>>>>>>> »·······»·······parent_group = (unsigned)grp % ngroups;
>>>>>>> »·······»·······for (i = 0; i < ngroups; i++) {
>>>>>>> »·······»·······»·······g = (parent_group + i) % ngroups;
>>>>>>> »·······»·······»·······get_orlov_stats(sb, g, flex_size, &stats);
>>>>>>> »·······»·······»·······if (!stats.free_inodes)
>>>>>>> »·······»·······»·······»·······continue;
>>>>>>> »·······»·······»·······if (stats.used_dirs >= best_ndir)
>>>>>>> »·······»·······»·······»·······continue;
>>>>>>> »·······»·······»·······if (stats.free_inodes < avefreei)
>>>>>>> »·······»·······»·······»·······continue;
>>>>>>> »·······»·······»·······if (stats.free_blocks < avefreeb)
>>>>>>> »·······»·······»·······»·······continue;
>>>>>>> »·······»·······»·······grp = g;
>>>>>>> »·······»·······»·······ret = 0;
>>>>>>> »·······»·······»·······best_ndir = stats.used_dirs;
>>>>>>> »·······»·······}
>>>>>>>
>>>>>>> Thanks & Regards,
>>>>>>> Lokesh
>>>>>> N‹§Іжмrё›yъиљШbІX¬¶З§vШ^–)Ює{.nЗ+‰·ҐЉ{±{ xЉ{ayє К‡Ъ™л,j ­ўfЈў·hљ‹аz№ ®wҐўё ў·¦j:+v‰ЁЉwиjШm¶џяѕ «‘кзzZ+ѓщљЋЉЭўj"ќъ!¶i
>>>
>>>
>>> Cheers, Andreas
>>>
>>>
>>>
>>>
>>>
>
>
> Cheers, Andreas
>
>
>
>
>

2015-12-19 11:14:41

by lokesh jaliminche

[permalink] [raw]
Subject: Re: Regarding random grouop search start for allocation of inode.

From 9e09fef78b2fa552c883bf8124af873abfde0805 Mon Sep 17 00:00:00 2001
From: Lokesh Nagappa Jaliminche <[email protected]>
Date: Sat, 19 Dec 2015 00:33:06 +0530
Subject: [PATCH] ext4: optimizing group serch for inode allocation

Added a check at the start of group search loop to
avoid looping unecessarily in case of empty group.
This also allow group search to jump directly to
"found_flex_bg" with "stats" and "group" already set,
so there is no need to go through the extra steps of
setting "best_desc" and "best_group" and then break
out of the loop just to set "stats" and "group" again.

Signed-off-by: Lokesh Nagappa Jaliminche <[email protected]>
---
fs/ext4/ialloc.c | 8 ++++++++
1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 1b8024d..588bf8e 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -446,6 +446,8 @@ static int find_group_orlov(struct super_block *sb, struct inode *parent,
struct ext4_sb_info *sbi = EXT4_SB(sb);
ext4_group_t real_ngroups = ext4_get_groups_count(sb);
int inodes_per_group = EXT4_INODES_PER_GROUP(sb);
+ unsigned int inodes_per_flex_group;
+ long unsigned int blocks_per_clustre;
unsigned int freei, avefreei, grp_free;
ext4_fsblk_t freeb, avefreec;
unsigned int ndirs;
@@ -470,6 +472,8 @@ static int find_group_orlov(struct super_block *sb, struct inode *parent,
percpu_counter_read_positive(&sbi->s_freeclusters_counter));
avefreec = freeb;
do_div(avefreec, ngroups);
+ inodes_per_flex_group = inodes_per_group * flex_size;
+ blocks_per_clustre = sbi->s_blocks_per_group * flex_size;
ndirs = percpu_counter_read_positive(&sbi->s_dirs_counter);
·
if (S_ISDIR(mode) &&
@@ -489,6 +493,10 @@ static int find_group_orlov(struct super_block *sb, struct inode *parent,
for (i = 0; i < ngroups; i++) {
g = (parent_group + i) % ngroups;
get_orlov_stats(sb, g, flex_size, &stats);
+ /* the group can't get any better than empty */
+ if (inodes_per_flex_group == stats.free_inodes &&
+ blocks_per_clustre == stats.free_clusters)
+ goto found_flex_bg;
if (!stats.free_inodes)
continue;
if (stats.used_dirs >= best_ndir)
--
1.7.1
On Tue, Dec 15, 2015 at 02:32:52PM -0700, Andreas Dilger wrote:
> On Dec 15, 2015, at 3:33 AM, lokesh jaliminche <[email protected]> wrote:
> >
> > "No, this isn't correct. This loop is looking for the *best* group it
> > can find, and your "break" would have it exit the loop as soon as the
> > *first* group that matches the conditions is found."
> >
> > But as we are checking all the groups with the same conditions then
> > how it guarantees better group selection ? As per my understanding we
> > are just wasting time in looping.
>
> The important part of the loop is ensuring that the selected group is always
> improving over the previous one:
>
> if (le16_to_cpu(desc->bg_used_dirs_count) >= best_ndir)
> continue;
>
> The "best_ndir" value tracks for the best group found so far the number of
> used directories in the group, and if the new directory has fewer directories
> than the previous "best" directory and still meets all the other criteria
> (fewer than average inodes allocated, etc) then the new group will be chosen.
>
> That said, you are correct that the loop can spend a lot of time searching
> needlessly. It would be trivial to add a check at the start of the loop:
>
> /* the group can't get any better than empty */
> if (desc->bg_free_inodes_count == inodes_per_group &&
> desc->bg_free_blocks_count ==
> EXT4_BLOCKS_PER_GROUP(sb))
> goto found;
>
> This jumps directly to "found" with "desc" and "group" already set, so
> there is no need to go through the extra steps of setting "best_desc" and
> "best_group" and then break out of the loop just to set "desc" and "group"
> again.
>
> Since you are the one to find this optimization, could you please submit a
> patch to this effect so you get the credit.
>
> Cheers, Andreas
>
> > On Sat, Dec 5, 2015 at 3:24 AM, Andreas Dilger <[email protected]> wrote:
> >>
> >>> On Dec 3, 2015, at 12:13 PM, lokesh jaliminche <[email protected]> wrote:
> >>>
> >>> Ohh thanks for the clarification. There is one more thing I would like
> >>> to point out here.
> >>> In the code there is a loop to scan the groups for inode
> >>> alllocation(Inside find_group_orlov function).
> >>> There are some policies for group selection . while scanning the
> >>> groups, it checks for these
> >>> policies to be satisfied.
> >>> If a particular group satisfy these properties it should get selected
> >>> for inode allocation but instead
> >>> it does further lookup in next groups.
> >>> I think there is missing breaking condition. I have added break over
> >>> there and here is the
> >>> patch for that. Any reason for not having break condition over here ?
> >>>
> >>> diff -Nur linux-2.6.32-431.17.1.el6.x86_64/fs/ext4/ialloc.c
> >>> linux-2.6.32-431.17.1.el6.x86_64/fs/ext4/ialloc.c
> >>> --- linux-2.6.32-431.17.1.el6.x86_64/fs/ext4/ialloc.c 2014-04-12
> >>> 01:20:31.000000000 +0530
> >>> +++ linux-2.6.32-431.17.1.el6.x86_64/fs/ext4/ialloc.c 2015-11-29
> >>> 21:36:51.805542209 +0530
> >>> @@ -529,6 +529,7 @@
> >>> grp = g;
> >>> ret = 0;
> >>> best_ndir = stats.used_dirs;
> >>> + break;
> >>> }
> >>> if (ret)
> >>> goto fallback;
> >>
> >> No, this isn't correct. This loop is looking for the *best* group it can find,
> >> and your "break" would have it exit the loop as soon as the *first* directory
> >> that matches the conditions is found. Since those conditions are fairly weak,
> >> for example that the group actually has free inodes, and it has better than
> >> average free inodes and blocks, it makes sense to search beyond just the first
> >> matching group.
> >>
> >> That said, it also doesn't make sense to search beyond a "perfect" group that
> >> has no allocated inodes and no allocated blocks, so a break condition could be
> >> added to this loop and make it more efficient, especially for very large
> >> filesystems that have 128k+ groups.
> >>
> >> It should be noted that this part of the algorithm only applies to "top level"
> >> directories (those below the root inode, or with the EXT4_INODE_TOPDIR flag
> >> set, so searching a bit longer for a good group is not a bad idea in this case.
> >>
> >> Cheers, Andreas.
> >>
> >>> Thanks & Regards,
> >>> Lokesh
> >>>
> >>>
> >>>
> >>> On Thu, Dec 3, 2015 at 11:28 PM, Andreas Dilger <[email protected]> wrote:
> >>>> On Dec 3, 2015, at 01:07, lokesh jaliminche <[email protected]> wrote:
> >>>>>
> >>>>> Thought of giving more clarification on my question
> >>>>> why group search start is random ? because we can also start search
> >>>>> for valid groups for inode allocation from the start. As this group
> >>>>> search is random inode selection might go to end of groups which
> >>>>> might affect IO performance
> >>>>
> >>>> Starting the inode search at the beginning of the disk each time
> >>>> means that inode allocation will be inefficient because it will search
> >>>> over groups that are mostly or entirely full already.
> >>>>
> >>>> Allocating the new directory in a semi-random group, one that is
> >>>> relatively unused, ensures that new
> >>>> inode and block allocations are relatively efficient afterward.
> >>>>
> >>>> Cheers, Andreas
> >>>>
> >>>>> On Thu, Dec 3, 2015 at 1:14 PM, lokesh jaliminche
> >>>>> <[email protected]> wrote:
> >>>>>> hello folks,
> >>>>>> I am new to ext4 code. I was going through the
> >>>>>> ext4-source for allocation of inode.
> >>>>>> There is one thing that I did not understand while selection of groups
> >>>>>> for inode allocation . I came across this code snippet which is part
> >>>>>> of find_group_orlov function. question is, why group search start is
> >>>>>> random ?
> >>>>>>
> >>>>>> Code snippet:
> >>>>>> ==========
> >>>>>> В·В·В·if (qstr) {
> >>>>>> »·······»·······»·······hinfo.hash_version = LDISKFS_DX_HASH_HALF_MD4;
> >>>>>> »·······»·······»·······hinfo.seed = sbi->s_hash_seed;
> >>>>>> »·······»·······»·······ldiskfsfs_dirhash(qstr->name, qstr->len, &hinfo);
> >>>>>> »·······»·······»·······grp = hinfo.hash;
> >>>>>> »·······»·······} else
> >>>>>> »·······»·······»·······get_random_bytes(&grp, sizeof(grp));
> >>>>>> »·······»·······parent_group = (unsigned)grp % ngroups;
> >>>>>> »·······»·······for (i = 0; i < ngroups; i++) {
> >>>>>> »·······»·······»·······g = (parent_group + i) % ngroups;
> >>>>>> »·······»·······»·······get_orlov_stats(sb, g, flex_size, &stats);
> >>>>>> »·······»·······»·······if (!stats.free_inodes)
> >>>>>> »·······»·······»·······»·······continue;
> >>>>>> »·······»·······»·······if (stats.used_dirs >= best_ndir)
> >>>>>> »·······»·······»·······»·······continue;
> >>>>>> »·······»·······»·······if (stats.free_inodes < avefreei)
> >>>>>> »·······»·······»·······»·······continue;
> >>>>>> »·······»·······»·······if (stats.free_blocks < avefreeb)
> >>>>>> »·······»·······»·······»·······continue;
> >>>>>> »·······»·······»·······grp = g;
> >>>>>> »·······»·······»·······ret = 0;
> >>>>>> »·······»·······»·······best_ndir = stats.used_dirs;
> >>>>>> »·······»·······}
> >>>>>>
> >>>>>> Thanks & Regards,
> >>>>>> Lokesh
> >>>>> N‹§Іжмrё›yъиљШbІX¬¶З§vШ^–)Ює{.nЗ+‰·ҐЉ{±{ xЉ{ayє К‡Ъ™л,j ­ўfЈў·hљ‹аz№ ®wҐўё ў·¦j:+v‰ЁЉwиjШm¶џяѕ «‘кзzZ+ѓщљЋЉЭўj"ќъ!¶i
> >>
> >>
> >> Cheers, Andreas
> >>
> >>
> >>
> >>
> >>
>
>
> Cheers, Andreas
>
>
>
>
>

2015-12-21 18:25:15

by Andreas Dilger

[permalink] [raw]
Subject: Re: Regarding random grouop search start for allocation of inode.

On Dec 18, 2015, at 10:09 PM, lokesh <[email protected]> wrote:
>
> From 9e09fef78b2fa552c883bf8124af873abfde0805 Mon Sep 17 00:00:00 2001
> From: Lokesh Nagappa Jaliminche <[email protected]>
> Date: Sat, 19 Dec 2015 00:33:06 +0530
> Subject: [PATCH] ext4: optimizing group serch for inode allocation

You need to send the patch with [PATCH] in the actual email Subject
line, or it will likely be missed by Ted. In this case, you should
use "[PATCH v3]" for the new patch, since you've already sent out this
patch twice.

> Added a check at the start of group search loop to
> avoid looping unecessarily in case of empty group.
> This also allow group search to jump directly to
> "found_flex_bg" with "stats" and "group" already set,
> so there is no need to go through the extra steps of
> setting "best_desc" and "best_group" and then break
> out of the loop just to set "stats" and "group" again.


> Signed-off-by: Lokesh Nagappa Jaliminche <[email protected]>
> ---
> fs/ext4/ialloc.c | 8 ++++++++
> 1 files changed, 8 insertions(+), 0 deletions(-)
>
> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> index 1b8024d..588bf8e 100644
> --- a/fs/ext4/ialloc.c
> +++ b/fs/ext4/ialloc.c
> @@ -446,6 +446,8 @@ static int find_group_orlov(struct super_block *sb, struct inode *parent,
> struct ext4_sb_info *sbi = EXT4_SB(sb);
> ext4_group_t real_ngroups = ext4_get_groups_count(sb);
> int inodes_per_group = EXT4_INODES_PER_GROUP(sb);
> + unsigned int inodes_per_flex_group;
> + long unsigned int blocks_per_clustre;

s/clustre/cluster/ but it should really be "clusters_per_flex_group"
to match "inodes_per_flex_group".

These declarations should be inside the "if" block, by "int best_ndir"

> unsigned int freei, avefreei, grp_free;
> ext4_fsblk_t freeb, avefreec;
> unsigned int ndirs;
> @@ -470,6 +472,8 @@ static int find_group_orlov(struct super_block *sb, struct inode *parent,
> percpu_counter_read_positive(&sbi->s_freeclusters_counter));
> avefreec = freeb;
> do_div(avefreec, ngroups);
> + inodes_per_flex_group = inodes_per_group * flex_size;
> + blocks_per_clustre = sbi->s_blocks_per_group * flex_size;

This calculation should also be inside the "if (S_ISDIR()" block below.

> ndirs = percpu_counter_read_positive(&sbi->s_dirs_counter);
> ·
> if (S_ISDIR(mode) &&
> @@ -489,6 +493,10 @@ static int find_group_orlov(struct super_block *sb, struct inode *parent,
> for (i = 0; i < ngroups; i++) {
> g = (parent_group + i) % ngroups;
> get_orlov_stats(sb, g, flex_size, &stats);
> + /* the group can't get any better than empty */
> + if (inodes_per_flex_group == stats.free_inodes &&
> + blocks_per_clustre == stats.free_clusters)

Align continued line after '(' on previous line.

> + goto found_flex_bg;

This should only have a single level of indent.

Cheers, Andreas

> if (!stats.free_inodes)
> continue;
> if (stats.used_dirs >= best_ndir)
> --
> 1.7.1
> On Tue, Dec 15, 2015 at 02:32:52PM -0700, Andreas Dilger wrote:
>> On Dec 15, 2015, at 3:33 AM, lokesh jaliminche <[email protected]> wrote:
>>>
>>> "No, this isn't correct. This loop is looking for the *best* group it
>>> can find, and your "break" would have it exit the loop as soon as the
>>> *first* group that matches the conditions is found."
>>>
>>> But as we are checking all the groups with the same conditions then
>>> how it guarantees better group selection ? As per my understanding we
>>> are just wasting time in looping.
>>
>> The important part of the loop is ensuring that the selected group is always
>> improving over the previous one:
>>
>> if (le16_to_cpu(desc->bg_used_dirs_count) >= best_ndir)
>> continue;
>>
>> The "best_ndir" value tracks for the best group found so far the number of
>> used directories in the group, and if the new directory has fewer directories
>> than the previous "best" directory and still meets all the other criteria
>> (fewer than average inodes allocated, etc) then the new group will be chosen.
>>
>> That said, you are correct that the loop can spend a lot of time searching
>> needlessly. It would be trivial to add a check at the start of the loop:
>>
>> /* the group can't get any better than empty */
>> if (desc->bg_free_inodes_count == inodes_per_group &&
>> desc->bg_free_blocks_count ==
>> EXT4_BLOCKS_PER_GROUP(sb))
>> goto found;
>>
>> This jumps directly to "found" with "desc" and "group" already set, so
>> there is no need to go through the extra steps of setting "best_desc" and
>> "best_group" and then break out of the loop just to set "desc" and "group"
>> again.
>>
>> Since you are the one to find this optimization, could you please submit a
>> patch to this effect so you get the credit.
>>
>> Cheers, Andreas
>>
>>> On Sat, Dec 5, 2015 at 3:24 AM, Andreas Dilger <[email protected]> wrote:
>>>>
>>>>> On Dec 3, 2015, at 12:13 PM, lokesh jaliminche <[email protected]> wrote:
>>>>>
>>>>> Ohh thanks for the clarification. There is one more thing I would like
>>>>> to point out here.
>>>>> In the code there is a loop to scan the groups for inode
>>>>> alllocation(Inside find_group_orlov function).
>>>>> There are some policies for group selection . while scanning the
>>>>> groups, it checks for these
>>>>> policies to be satisfied.
>>>>> If a particular group satisfy these properties it should get selected
>>>>> for inode allocation but instead
>>>>> it does further lookup in next groups.
>>>>> I think there is missing breaking condition. I have added break over
>>>>> there and here is the
>>>>> patch for that. Any reason for not having break condition over here ?
>>>>>
>>>>> diff -Nur linux-2.6.32-431.17.1.el6.x86_64/fs/ext4/ialloc.c
>>>>> linux-2.6.32-431.17.1.el6.x86_64/fs/ext4/ialloc.c
>>>>> --- linux-2.6.32-431.17.1.el6.x86_64/fs/ext4/ialloc.c 2014-04-12
>>>>> 01:20:31.000000000 +0530
>>>>> +++ linux-2.6.32-431.17.1.el6.x86_64/fs/ext4/ialloc.c 2015-11-29
>>>>> 21:36:51.805542209 +0530
>>>>> @@ -529,6 +529,7 @@
>>>>> grp = g;
>>>>> ret = 0;
>>>>> best_ndir = stats.used_dirs;
>>>>> + break;
>>>>> }
>>>>> if (ret)
>>>>> goto fallback;
>>>>
>>>> No, this isn't correct. This loop is looking for the *best* group it can find,
>>>> and your "break" would have it exit the loop as soon as the *first* directory
>>>> that matches the conditions is found. Since those conditions are fairly weak,
>>>> for example that the group actually has free inodes, and it has better than
>>>> average free inodes and blocks, it makes sense to search beyond just the first
>>>> matching group.
>>>>
>>>> That said, it also doesn't make sense to search beyond a "perfect" group that
>>>> has no allocated inodes and no allocated blocks, so a break condition could be
>>>> added to this loop and make it more efficient, especially for very large
>>>> filesystems that have 128k+ groups.
>>>>
>>>> It should be noted that this part of the algorithm only applies to "top level"
>>>> directories (those below the root inode, or with the EXT4_INODE_TOPDIR flag
>>>> set, so searching a bit longer for a good group is not a bad idea in this case.
>>>>
>>>> Cheers, Andreas.
>>>>
>>>>> Thanks & Regards,
>>>>> Lokesh
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Dec 3, 2015 at 11:28 PM, Andreas Dilger <[email protected]> wrote:
>>>>>> On Dec 3, 2015, at 01:07, lokesh jaliminche <[email protected]> wrote:
>>>>>>>
>>>>>>> Thought of giving more clarification on my question
>>>>>>> why group search start is random ? because we can also start search
>>>>>>> for valid groups for inode allocation from the start. As this group
>>>>>>> search is random inode selection might go to end of groups which
>>>>>>> might affect IO performance
>>>>>>
>>>>>> Starting the inode search at the beginning of the disk each time
>>>>>> means that inode allocation will be inefficient because it will search
>>>>>> over groups that are mostly or entirely full already.
>>>>>>
>>>>>> Allocating the new directory in a semi-random group, one that is
>>>>>> relatively unused, ensures that new
>>>>>> inode and block allocations are relatively efficient afterward.
>>>>>>
>>>>>> Cheers, Andreas
>>>>>>
>>>>>>> On Thu, Dec 3, 2015 at 1:14 PM, lokesh jaliminche
>>>>>>> <[email protected]> wrote:
>>>>>>>> hello folks,
>>>>>>>> I am new to ext4 code. I was going through the
>>>>>>>> ext4-source for allocation of inode.
>>>>>>>> There is one thing that I did not understand while selection of groups
>>>>>>>> for inode allocation . I came across this code snippet which is part
>>>>>>>> of find_group_orlov function. question is, why group search start is
>>>>>>>> random ?
>>>>>>>>
>>>>>>>> Code snippet:
>>>>>>>> ==========
>>>>>>>> В·В·В·if (qstr) {
>>>>>>>> »·······»·······»·······hinfo.hash_version = LDISKFS_DX_HASH_HALF_MD4;
>>>>>>>> »·······»·······»·······hinfo.seed = sbi->s_hash_seed;
>>>>>>>> »·······»·······»·······ldiskfsfs_dirhash(qstr->name, qstr->len, &hinfo);
>>>>>>>> »·······»·······»·······grp = hinfo.hash;
>>>>>>>> »·······»·······} else
>>>>>>>> »·······»·······»·······get_random_bytes(&grp, sizeof(grp));
>>>>>>>> »·······»·······parent_group = (unsigned)grp % ngroups;
>>>>>>>> »·······»·······for (i = 0; i < ngroups; i++) {
>>>>>>>> »·······»·······»·······g = (parent_group + i) % ngroups;
>>>>>>>> »·······»·······»·······get_orlov_stats(sb, g, flex_size, &stats);
>>>>>>>> »·······»·······»·······if (!stats.free_inodes)
>>>>>>>> »·······»·······»·······»·······continue;
>>>>>>>> »·······»·······»·······if (stats.used_dirs >= best_ndir)
>>>>>>>> »·······»·······»·······»·······continue;
>>>>>>>> »·······»·······»·······if (stats.free_inodes < avefreei)
>>>>>>>> »·······»·······»·······»·······continue;
>>>>>>>> »·······»·······»·······if (stats.free_blocks < avefreeb)
>>>>>>>> »·······»·······»·······»·······continue;
>>>>>>>> »·······»·······»·······grp = g;
>>>>>>>> »·······»·······»·······ret = 0;
>>>>>>>> »·······»·······»·······best_ndir = stats.used_dirs;
>>>>>>>> »·······»·······}
>>>>>>>>
>>>>>>>> Thanks & Regards,
>>>>>>>> Lokesh
>>>>>>> N‹§Іжмrё›yъиљШbІX¬¶З§vШ^–)Ює{.nЗ+‰·ҐЉ{±{ xЉ{ayє К‡Ъ™л,j ­ўfЈў·hљ‹аz№ ®wҐўё ў·¦j:+v‰ЁЉwиjШm¶џяѕ «‘кзzZ+ѓщљЋЉЭўj"ќъ!¶i
>>>>
>>>>
>>>> Cheers, Andreas
>>>>
>>>>
>>>>
>>>>
>>>>
>>
>>
>> Cheers, Andreas
>>
>>
>>
>>
>>
>
>


Cheers, Andreas






Attachments:
signature.asc (833.00 B)
Message signed with OpenPGP using GPGMail