It would be possible to optimize mb_free_blocks() by having it
clear a whole word at a time instead of a series if bits.
I thought that was done already, but it doesn't appear to be the case.
Also, it isn't clear that the bit "normalization" is needed anymore.
This was done back in the aniceint times when the buddy bitmaps were stored on disk instead of being regenerated only at mount time.
Cheers, Andreas
> On Aug 4, 2014, at 16:30, Mason <[email protected]> wrote:
>
>> On 18/07/2014 11:29, Lukáš Czerner wrote:
>>
>> Mason wrote:
>>
>>> Theodore Ts'o wrote:
>>>
>>>> Mason wrote:
>>>>
>>>>> unlink("/mnt/hdd/xxx") = 0 <111.479283>
>>>>>
>>>>> 0.01user 111.48system 1:51.99elapsed 99%CPU (0avgtext+0avgdata 772maxresident)k
>>>>> 0inputs+0outputs (0major+434minor)pagefaults 0swaps
>>>>
>>>> ... and we're CPU bound inside the kernel.
>>>>
>>>> Can you run perf so we can see exactly where we're spending the CPU?
>>>> You're not using a journal, so I'm pretty sure what you will find is
>>>> that we're spending all of our time in mb_free_blocks(), when it is
>>>> updating the internal mballoc buddy bitmaps.
>>>>
>>>> With a journal, this work done by mb_free_blocks() is hidden in the
>>>> kjournal thread, and happens after the commit is completed, so it
>>>> won't block other file system operations (other than burning some
>>>> extra CPU on one of the multiple cores available on a typical x86
>>>> CPU).
>>>>
>>>> Also, I suspect the CPU overhead is *much* less on an x86 CPU, which
>>>> has native bit test/set/clear instructions, whereas the MIPS
>>>> architecture was designed by Prof. Hennessy at Stanford, who was a
>>>> doctrinaire RISC fanatic, so there would be no bitop instructions.
>
> I've attached the output of "mips-linux-gnu-objdump -xd mballoc.o"
> in case someone wants to peek at the generated code.
>
>>>> Even though I'm pretty sure what we'll find, knowing exactly *where*
>>>> in mb_free_blocks() or the function it calls would be helpful in
>>>> knowing what we need to optimize. So if you could try using perf
>>>> (assuming that the perf is supported MIPS; not sure if it does) that
>>>> would be really helpful.
>
> How do you get perf to tell you where in mb_free_blocks we are spending
> the most time?
>
>>> What command-line do you suggest I run to get the output you expect?
>>
>> If perf works on your system you can record data with
>>
>> perf record -g ./test file <size>
>>
>> and then report with
>>
>> perf report --stdio
>>
>> That should yield some interesting information about where we spend
>> the most time in kernel.
>
> I've no idea why, but the unlink operation, which used to take
> 111 seconds to run, now only takes 53...
>
> Anyway, here is the requested output.
>
> # time perf record -g foo /mnt/hdd/xxx 300
> [ perf record: Woken up 8 times to write data ]
> [ perf record: Captured and wrote 1.909 MB perf.data (~83406 samples) ]
> 0.04user 0.08system 0:53.54elapsed 0%CPU (0avgtext+0avgdata 3616maxresident)k
> 0inputs+0outputs (0major+984minor)pagefaults 0swaps
>
> # perf report --stdio > report.txt
> (Complete report attached as report.txt.xz)
>
> What can I do to improve the latency of unlinking large files?
> Would sparse_super2 help at all?
>
>
> # Events: 14K cycles
> #
> # Overhead Command Shared Object Symbol
> # ........ ....... ................. ............................
> #
> 33.94% foo [kernel.kallsyms] [k] mb_free_blocks
> |
> --- mb_free_blocks
> ext4_free_blocks
> ext4_ext_rm_leaf
> ext4_ext_truncate
> ext4_truncate
> ext4_evict_inode
> evict
> do_unlinkat
> stack_done
>
> 21.11% foo [kernel.kallsyms] [k] __find_get_block
> |
> --- __find_get_block
> |
> |--99.94%-- ext4_free_blocks
> | ext4_ext_rm_leaf
> | ext4_ext_truncate
> | ext4_truncate
> | ext4_evict_inode
> | evict
> | do_unlinkat
> | stack_done
> --0.06%-- [...]
>
> 8.33% foo [kernel.kallsyms] [k] radix_tree_lookup_slot
> |
> --- radix_tree_lookup_slot
> find_get_page
> __find_get_block_slow
> __find_get_block
> ext4_free_blocks
> ext4_ext_rm_leaf
> ext4_ext_truncate
> ext4_truncate
> ext4_evict_inode
> evict
> do_unlinkat
> stack_done
>
> 6.99% foo [kernel.kallsyms] [k] mb_find_buddy
> |
> --- mb_find_buddy
> mb_free_blocks
> ext4_free_blocks
> ext4_ext_rm_leaf
> ext4_ext_truncate
> ext4_truncate
> ext4_evict_inode
> evict
> do_unlinkat
> stack_done
>
> 4.21% foo [kernel.kallsyms] [k] trace_preempt_off
> |
> --- trace_preempt_off
> |
> |--99.99%-- __find_get_block
> | ext4_free_blocks
> | ext4_ext_rm_leaf
> | ext4_ext_truncate
> | ext4_truncate
> | ext4_evict_inode
> | evict
> | do_unlinkat
> | stack_done
> --0.01%-- [...]
>
> 4.19% foo [kernel.kallsyms] [k] ext4_free_blocks
> |
> --- ext4_free_blocks
> ext4_ext_rm_leaf
> ext4_ext_truncate
> ext4_truncate
> ext4_evict_inode
> evict
> do_unlinkat
> stack_done
>
> 4.14% foo [kernel.kallsyms] [k] sub_preempt_count
> |
> --- sub_preempt_count
> |
> |--99.69%-- __find_get_block
> | ext4_free_blocks
> | ext4_ext_rm_leaf
> | ext4_ext_truncate
> | ext4_truncate
> | ext4_evict_inode
> | evict
> | do_unlinkat
> | stack_done
> --0.31%-- [...]
>
> 3.97% foo [kernel.kallsyms] [k] __find_get_block_slow
> |
> --- __find_get_block_slow
> __find_get_block
> ext4_free_blocks
> ext4_ext_rm_leaf
> ext4_ext_truncate
> ext4_truncate
> ext4_evict_inode
> evict
> do_unlinkat
> stack_done
>
> 3.53% foo [kernel.kallsyms] [k] __rcu_read_unlock
> |
> --- __rcu_read_unlock
> |
> |--100.00%-- find_get_page
> | __find_get_block_slow
> | __find_get_block
> | ext4_free_blocks
> | ext4_ext_rm_leaf
> | ext4_ext_truncate
> | ext4_truncate
> | ext4_evict_inode
> | evict
> | do_unlinkat
> | stack_done
> --0.00%-- [...]
>
> 3.26% foo [kernel.kallsyms] [k] trace_preempt_on
> |
> --- trace_preempt_on
> sub_preempt_count
> |
> |--100.00%-- __find_get_block
> | ext4_free_blocks
> | ext4_ext_rm_leaf
> | ext4_ext_truncate
> | ext4_truncate
> | ext4_evict_inode
> | evict
> | do_unlinkat
> | stack_done
> --0.00%-- [...]
>
> 2.06% foo [kernel.kallsyms] [k] find_get_page
> |
> --- find_get_page
> |
> |--100.00%-- __find_get_block_slow
> | __find_get_block
> | ext4_free_blocks
> | ext4_ext_rm_leaf
> | ext4_ext_truncate
> | ext4_truncate
> | ext4_evict_inode
> | evict
> | do_unlinkat
> | stack_done
> --0.00%-- [...]
>
> 1.39% foo [kernel.kallsyms] [k] add_preempt_count
> |
> --- add_preempt_count
> |
> |--99.99%-- __find_get_block
> | ext4_free_blocks
> | ext4_ext_rm_leaf
> | ext4_ext_truncate
> | ext4_truncate
> | ext4_evict_inode
> | evict
> | do_unlinkat
> | stack_done
> --0.01%-- [...]
>
> 1.26% foo [kernel.kallsyms] [k] __rcu_read_lock
> |
> --- __rcu_read_lock
> find_get_page
> __find_get_block_slow
> __find_get_block
> ext4_free_blocks
> ext4_ext_rm_leaf
> ext4_ext_truncate
> ext4_truncate
> ext4_evict_inode
> evict
> do_unlinkat
> stack_done
>
> --
> Regards.
>
> <report.txt.xz>
> <mballoc.dump.xz>
On Tue, Aug 05, 2014 at 12:55:14AM +0200, Andreas Dilger wrote:
> It would be possible to optimize mb_free_blocks() by having it
> clear a whole word at a time instead of a series if bits.
It looks like we're doing this already in mb_test_and_clear_bits(),
aren't we?
> I thought that was done already, but it doesn't appear to be the case.
> Also, it isn't clear that the bit "normalization" is needed anymore.
> This was done back in the aniceint times when the buddy bitmaps were stored on disk instead of being regenerated only at mount time.
I'm not sure what you mean by this; the only reference I can find
normalization is with normalizing requests?
- Ted
On 05/08/2014 00:55, Andreas Dilger wrote:
> It would be possible to optimize mb_free_blocks() by having it
> clear a whole word at a time instead of a series of bits.
>
> I thought that was done already, but it doesn't appear to be the case.
> Also, it isn't clear that the bit "normalization" is needed anymore.
> This was done back in the ancient times when the buddy bitmaps were
> stored on disk instead of being regenerated only at mount time.
Are there any other tests you'd like me to run?
(I will be permanently losing access to this platform in a few days.)
--
Regards.
On Aug 5, 2014, at 4:33, Theodore Ts'o <[email protected]> wrote:
>
>> On Tue, Aug 05, 2014 at 12:55:14AM +0200, Andreas Dilger wrote:
>> It would be possible to optimize mb_free_blocks() by having it
>> clear a whole word at a time instead of a series if bits.
>
> It looks like we're doing this already in mb_test_and_clear_bits(),
> aren't we?
Sorry, I didn't see mb_test_and_clear_bits(), I was only looking at
mb_clear_bit() to see if it be the multi-bit optimization.
>> I thought that was done already, but it doesn't appear to be the case.
>> Also, it isn't clear that the bit "normalization" is needed anymore.
>> This was done back in the aniceint times when the buddy bitmaps were stored on disk instead of being regenerated only at mount time.
>
> I'm not sure what you mean by this; the only reference I can find
> normalization is with normalizing requests?
I meant mb_correct_addr_and_bit().
Cheers, Andreas