2005-01-10 21:30:58

by Mike Waychison

[permalink] [raw]
Subject: [patch] Per-sb s_all_inodes list

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

The following patch (against 2.6.10) adds a per super_block list of all
inodes.

Releasing a super_block requires walking all inodes for the given
superblock and releasing them. Currently, inodes are found on one of
four lists:

- global list inode_in_use
- global list inode_unused
- per-sb ->s_dirty
- per-sb ->s_io

The second list, inode_unused can potentially be quite large.
Unfortunately, it cannot be made per-sb as it is the global LRU list
used for inode cache reduction under memory pressure.

When unmounting a single filesystem, profiling shows dramatic time spent
walking inode_unused. This because very noticeble when one is
unmounting a decently sized tree of filesystems.

The proposed solution is to create a new list per-sb, that contains all
inodes allocated. It is maintained under the inode_lock for the sake of
simplicity, but this may prove unneccesary, and may be better done with
another global or per-sb lock.

Unfortunately, this patch also adds another list_head to each struct
inode, but failing other suggestions, I don't see how else to do this.

The following script was used to profile:

- -----------
#!/bin/sh

LOOP=100



# SETUP

for test in `seq 1 $LOOP` ; do

mkdir /tmp/test$test

mount -t tmpfs test$test /tmp/test$test

for i in `seq 1 10` ; do

mkdir /tmp/test$test/$i

mount -t tmpfs test$test-$i /tmp/test$test/$i

done

done



# PROFILE

/usr/local/sbin/readprofile -r

for test in `seq 1 $LOOP` ; do

umount -l /tmp/test$test

done

/usr/local/sbin/readprofile | sort -nr +2



# CLEANUP

for test in `seq 1 $LOOP` ; do

rmdir /tmp/test$test

done
- -----------

Before applying this patch, from a fresh boot:
- -----------
4207 poll_idle 65.7344

3611 invalidate_list 20.5170

121 kmem_cache_free 0.9453

48 kfree 0.3000

15 kernel_map_pages 0.1339

25 seq_escape 0.1302

8 _atomic_dec_and_lock 0.1000

30 __d_path 0.0938

8 m_start 0.0625

5 poison_obj 0.0625

1 obj_dbghead 0.0625

15 d_path 0.0625

5 page_remove_rmap 0.0521

10 kmap_atomic 0.0521

13 seq_path 0.0508

4 seq_puts 0.0374

13 show_vfsmnt 0.0369

1 proc_follow_link 0.0312

1 fput 0.0312

3 change_page_attr 0.0268

1 system_call 0.0227

2 strnlen_user 0.0208

1 shmem_destroy_inode 0.0208

1 seq_putc 0.0208

11 do_anonymous_page 0.0202

2 pte_alloc_one 0.0179

2 handle_IRQ_event 0.0179

2 d_genocide 0.0179

2 sysenter_past_esp 0.0171

2 de_put 0.0156

1 page_waitqueue 0.0156

1 __mntput 0.0156

1 file_kill 0.0156

2 iput 0.0139

2 write_profile 0.0138

1 __vm_stat_account 0.0125

1 put_filesystem 0.0125

1 profile_handoff_task 0.0125

1 path_release 0.0125

1 name_to_int 0.0125

3 __pagevec_lru_add_active 0.0117

2 __copy_user_intel 0.0114

2 flush_tlb_page 0.0104

1 __user_walk 0.0104

1 find_vma_prev 0.0104

1 find_vma 0.0104

1 find_get_page 0.0104

1 fget 0.0104
1 d_lookup 0.0104

2 __do_softirq 0.0096

3 release_task 0.0089

1 set_page_dirty 0.0089

1 page_add_file_rmap 0.0089

8 do_wp_page 0.0083

2 __might_sleep 0.0083

1 sys_read 0.0078

3 release_pages 0.0075

7 copy_page_range 0.0072

3 clear_page_tables 0.0069

2 vfs_read 0.0069

1 zap_pmd_range 0.0069

1 sync_inodes_sb 0.0069

1 page_add_anon_rmap 0.0069

1 finish_task_switch 0.0069

1 filp_close 0.0069

6 zap_pte_range 0.0067

1 shrink_dcache_anon 0.0063

1 remove_vm_struct 0.0063

1 dup_task_struct 0.0063

1 deactivate_super 0.0063

5 do_no_page 0.0060

3 handle_mm_fault 0.0059

1 kmem_cache_alloc 0.0057

1 dnotify_flush 0.0057

1 generic_fillattr 0.0052

4 get_signal_to_deliver 0.0049

1 proc_lookup 0.0048

1 invalidate_inodes 0.0048

1 do_sync_read 0.0048

2 cache_alloc_debugcheck_after 0.0045

1 proc_pid_make_inode 0.0045

1 writeback_inodes 0.0042

1 scsi_end_request 0.0042

8 do_page_fault 0.0041

1 pte_alloc_map 0.0035

8299 total 0.0033

3 copy_mm 0.0032

1 old_mmap 0.0031

1 __d_lookup 0.0031

1 cap_vm_enough_memory 0.0031

2 sync_sb_inodes 0.0027

1 prio_tree_insert 0.0026

2 vma_adjust 0.0025

1 vfs_quota_sync 0.0023

1 check_poison_obj 0.0021
2 scsi_request_fn 0.0019

1 try_to_wake_up 0.0015

1 number 0.0015

3 exit_notify 0.0013

1 filemap_nopage 0.0011

2 link_path_walk 0.0006

1 do_mmap_pgoff 0.0005
- ------------

Before applying the patch, but after a 'find / > /dev/null':
- ------------
21489 poll_idle 335.7656

21820 invalidate_list 123.9773

164 kmem_cache_free 1.2812

95 _atomic_dec_and_lock 1.1875

110 de_put 0.8594

411 __sync_single_inode 0.8027

131 proc_lookup 0.6298

70 kfree 0.4375

153 sync_sb_inodes 0.2079

23 kernel_map_pages 0.2054

41 writeback_inodes 0.1708

39 __d_path 0.1219

14 m_start 0.1094

3 wake_up_inode 0.0938

9 seq_puts 0.0841

16 seq_escape 0.0833

18 d_path 0.0750

25 show_vfsmnt 0.0710

5 poison_obj 0.0625

1 zone_statistics 0.0625

13 scsi_end_request 0.0542

13 seq_path 0.0508

4 bit_waitqueue 0.0500

9 kmap_atomic 0.0469

2 system_call 0.0455

4 find_get_page 0.0417

6 kmem_cache_alloc 0.0341

1 writeback_acquire 0.0312

1 pmd_ctor 0.0312

4 finish_task_switch 0.0278

2 page_remove_rmap 0.0208

1 m_next 0.0208

1 find_task_by_pid_type 0.0208

1 eventpoll_init_file 0.0208

1 blkdev_writepage 0.0208

9 __mark_inode_dirty 0.0194

6 release_task 0.0179

2 change_page_attr 0.0179

44876 total 0.0177

3 __copy_user_intel 0.0170

9 do_anonymous_page 0.0165

1 pagevec_lookup_tag 0.0156

1 kmem_flagcheck 0.0156

2 find_get_pages_tag 0.0139

2 write_profile 0.0138

3 fn_hash_lookup 0.0125

2 dispose_list 0.0125

1 __read_page_state 0.0125
1 current_kernel_time 0.0125

2 idr_remove 0.0114

2 flush_tlb_page 0.0104

1 vm_acct_memory 0.0104

1 pid_base_iput 0.0104

1 find_vma 0.0104

1 bio_destructor 0.0104

10 do_wp_page 0.0104

5 handle_mm_fault 0.0098

3 generic_shutdown_super 0.0094

1 strncpy_from_user 0.0089

1 set_page_dirty 0.0089

1 prio_tree_replace 0.0089

17 do_page_fault 0.0087

1 sysenter_past_esp 0.0085

2 __might_sleep 0.0083

7 zap_pte_range 0.0078

1 unmap_page_range 0.0078

1 lru_cache_add_active 0.0078

1 flush_thread 0.0078

1 del_timer 0.0078

4 cache_reap 0.0076

2 sub_remove 0.0074

1 skb_release_data 0.0069

1 rt_hash_code 0.0069

1 page_add_anon_rmap 0.0069

1 iput 0.0069

2 __d_lookup 0.0063

1 follow_mount 0.0063

1 deactivate_super 0.0063

4 seq_read 0.0060

1 fsync_super 0.0052

5 copy_page_range 0.0051

2 release_pages 0.0050

4 do_no_page 0.0048

2 do_page_cache_readahead 0.0048

1 locks_remove_flock 0.0048

2 prune_dcache 0.0046

2 cache_alloc_debugcheck_after 0.0045

1 sys_mmap2 0.0045

1 shmem_delete_inode 0.0045

2 dput 0.0042

2 check_poison_obj 0.0042

1 rb_insert_color 0.0039

1 __pagevec_lru_add_active 0.0039

1 __find_get_block 0.0039

1 umount_tree 0.0037

1 vfs_read 0.0035
1 pte_alloc_map 0.0035

2 ahd_linux_isr 0.0033

1 generic_delete_inode 0.0031

1 cap_vm_enough_memory 0.0031

1 profile_hit 0.0030

1 mempool_alloc 0.0028

1 __fput 0.0028

1 path_lookup 0.0027

1 nf_hook_slow 0.0027

1 clear_inode 0.0027

2 get_signal_to_deliver 0.0025

1 exec_mmap 0.0024

2 mpage_writepages 0.0021

1 unmap_vmas 0.0017

1 try_to_wake_up 0.0015

1 udp_v4_mcast_deliver 0.0014

3 exit_notify 0.0013

1 vma_adjust 0.0012

1 copy_mm 0.0011

1 ext3_do_update_inode 0.0010

1 load_elf_binary 0.0003
- ------------

After applying the patch, from a fresh boot:
- ------------
614 poll_idle 9.5938

150 kmem_cache_free 1.1719

39 kfree 0.2437

26 kernel_map_pages 0.2321

21 seq_escape 0.1094

32 __d_path 0.1000

21 d_path 0.0875

15 kmap_atomic 0.0781

8 seq_puts 0.0748

17 seq_path 0.0664

5 _atomic_dec_and_lock 0.0625

4 __read_page_state 0.0500

17 show_vfsmnt 0.0483

6 m_start 0.0469

3 poison_obj 0.0375

1 fput 0.0312

3 pte_alloc_one 0.0268

3 page_add_file_rmap 0.0268

3 change_page_attr 0.0268

4 __copy_user_intel 0.0227

2 page_remove_rmap 0.0208

1 seq_putc 0.0208

2 sysenter_past_esp 0.0171

1 __wake_up_bit 0.0156

5 release_task 0.0149

2 sys_readlink 0.0139

2 write_profile 0.0138

7 handle_mm_fault 0.0137

7 do_anonymous_page 0.0129

5 release_pages 0.0125

1 mounts_release 0.0125

1 inode_times_differ 0.0125

1 current_kernel_time 0.0125

21 do_page_fault 0.0108

2 flush_tlb_page 0.0104

1 strlcpy 0.0104

1 strncpy_from_user 0.0089

1 prio_tree_replace 0.0089

7 do_no_page 0.0084
8 do_wp_page 0.0083

1 __copy_to_user_ll 0.0078

1 anon_vma_unlink 0.0078

2 pte_alloc_map 0.0069

1 iput 0.0069

1 finish_task_switch 0.0069

1 filp_close 0.0069

1 d_rehash 0.0069

6 zap_pte_range 0.0067
2 setup_sigcontext 0.0063

2 __d_lookup 0.0063

1 sched_migrate_task 0.0063

1 dnotify_parent 0.0063

1 dispose_list 0.0063

2 free_hot_cold_page 0.0057

1 kmem_cache_alloc 0.0057

1 free_pages_and_swap_cache 0.0057

1 vma_prio_tree_add 0.0052

1 proc_lookup 0.0048

2 cache_alloc_debugcheck_after 0.0045

1 sigprocmask 0.0045

4 copy_page_range 0.0041

1 __pagevec_lru_add_active 0.0039

1 get_empty_filp 0.0039

1 expand_stack 0.0039

3 get_signal_to_deliver 0.0037

1 path_lookup 0.0027

1 vfs_quota_sync 0.0023

1 prune_dcache 0.0023

1 clear_page_tables 0.0023

2 filemap_nopage 0.0022

1 dput 0.0021

1 buffered_rmqueue 0.0018

1 unmap_vmas 0.0017

1 page_cache_readahead 0.0017

1 scsi_dispatch_cmd 0.0016

1 do_brk 0.0015

2 exit_notify 0.0009

1 do_wait 0.0008

1 do_mmap_pgoff 0.0005

1127 total 0.0004

1 link_path_walk 0.0003
- --------

After applying the patch, doing a fresh reboot and running 'find / >
/dev/null':
- --------
630 poll_idle 9.8438

167 kmem_cache_free 1.3047

52 kfree 0.3250

24 m_start 0.1875

16 kernel_map_pages 0.1429

40 __d_path 0.1250
23 seq_escape 0.1198

8 seq_puts 0.0748

13 kmap_atomic 0.0677

16 d_path 0.0667

5 _atomic_dec_and_lock 0.0625

16 seq_path 0.0625

5 page_remove_rmap 0.0521

18 show_vfsmnt 0.0511

4 poison_obj 0.0500

4 change_page_attr 0.0357

1 wake_up_inode 0.0312

1 fput 0.0312

14 do_anonymous_page 0.0257

8 release_task 0.0238

2 strnlen_user 0.0208

2 find_get_page 0.0208

1 seq_putc 0.0208

1 unlock_page 0.0156

1 mark_page_accessed 0.0156

2 page_add_anon_rmap 0.0139

2 finish_task_switch 0.0139

2 write_profile 0.0138

1 __read_page_state 0.0125

1 kill_anon_super 0.0125

1 flush_signal_handlers 0.0125

1 bit_waitqueue 0.0125

1 anon_vma_link 0.0125

23 do_page_fault 0.0118

2 __copy_user_intel 0.0114

10 zap_pte_range 0.0112

1 vm_acct_memory 0.0104

1 find_vma 0.0104

10 do_wp_page 0.0104

2 shmem_delete_inode 0.0089

1 strncpy_from_user 0.0089

1 set_page_dirty 0.0089

1 page_add_file_rmap 0.0089

1 bad_range 0.0089

1 sysenter_past_esp 0.0085

2 __pagevec_lru_add_active 0.0078

1 invalidate_inodes 0.0078

1 __insert_vm_struct 0.0078
1 fget_light 0.0078

3 release_pages 0.0075

1 lookup_mnt 0.0069

1 destroy_inode 0.0069

3 dput 0.0063

1 remove_vm_struct 0.0063

1 deactivate_super 0.0063

5 do_no_page 0.0060

3 handle_mm_fault 0.0059

1 kmem_cache_alloc 0.0057

1 generic_fillattr 0.0052

1 flush_tlb_page 0.0052

5 copy_page_range 0.0051

2 do_page_cache_readahead 0.0048

1 proc_lookup 0.0048

2 prune_dcache 0.0046

2 clear_page_tables 0.0046

2 cache_alloc_debugcheck_after 0.0045

1 elf_map 0.0045

1 vma_link 0.0042

1 __might_sleep 0.0042

1 cp_new_stat64 0.0039

3 get_signal_to_deliver 0.0037

1 __d_lookup 0.0031

2 seq_read 0.0030

1 proc_get_inode 0.0027

1 alloc_inode 0.0025

2 copy_mm 0.0021

1 copy_strings 0.0018

1 do_generic_mapping_read 0.0010

1 create_elf_tables 0.0010

2 exit_notify 0.0009

2 link_path_walk 0.0006

1 do_mmap_pgoff 0.0005

1200 total 0.0005

1 load_elf_binary 0.0003
- -------

- From the profile data above, it would appear that this patch allows
unmounting to scale regardless of the size of the inode caches as well.

Please consider applying, Thanks,

- --
Mike Waychison
Sun Microsystems, Inc.
1 (650) 352-5299 voice
1 (416) 202-8336 voice

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NOTICE: The opinions expressed in this email are held by me,
and may not represent the views of Sun Microsystems, Inc.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFB4vFcdQs4kOxk3/MRArK/AJ4xr7bCcum4A9CjJ1bYrJ2BiX3m0gCcCW87
z0JdXMHkwPdTeJmd4k7eLA0=
=QSPC
-----END PGP SIGNATURE-----


Attachments:
per_super_block_inode_list.diff (4.29 kB)

2005-01-11 01:32:32

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [patch] Per-sb s_all_inodes list

On Mon, Jan 10, 2005 at 04:19:24PM -0500, Mike Waychison wrote:
> Releasing a super_block requires walking all inodes for the given
> superblock and releasing them. Currently, inodes are found on one of
> four lists:
[...]
> The second list, inode_unused can potentially be quite large.
> Unfortunately, it cannot be made per-sb as it is the global LRU list
> used for inode cache reduction under memory pressure.
> When unmounting a single filesystem, profiling shows dramatic time spent
> walking inode_unused. This because very noticeble when one is
> unmounting a decently sized tree of filesystems.
> The proposed solution is to create a new list per-sb, that contains all
> inodes allocated. It is maintained under the inode_lock for the sake of
> simplicity, but this may prove unneccesary, and may be better done with
> another global or per-sb lock.

I thought this was a good idea a number of months ago myself when I saw
a patch for 2.4.x implementing this from Kirill Korotaev, so I ported
that code to 2.6.x and it got merged in -mm then. That patch was merged
into Linus' bk shortly after 2.6.10. Could you check Linus' bk to see
if what made it there resolves the issue as well as your own?

-- wli

2005-01-11 02:15:15

by Mike Waychison

[permalink] [raw]
Subject: Re: [patch] Per-sb s_all_inodes list

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

William Lee Irwin III wrote:
> On Mon, Jan 10, 2005 at 04:19:24PM -0500, Mike Waychison wrote:
>
>>Releasing a super_block requires walking all inodes for the given
>>superblock and releasing them. Currently, inodes are found on one of
>>four lists:
>
> [...]
>
>>The second list, inode_unused can potentially be quite large.
>>Unfortunately, it cannot be made per-sb as it is the global LRU list
>>used for inode cache reduction under memory pressure.
>>When unmounting a single filesystem, profiling shows dramatic time spent
>>walking inode_unused. This because very noticeble when one is
>>unmounting a decently sized tree of filesystems.
>>The proposed solution is to create a new list per-sb, that contains all
>>inodes allocated. It is maintained under the inode_lock for the sake of
>>simplicity, but this may prove unneccesary, and may be better done with
>>another global or per-sb lock.
>
>
> I thought this was a good idea a number of months ago myself when I saw
> a patch for 2.4.x implementing this from Kirill Korotaev, so I ported
> that code to 2.6.x and it got merged in -mm then. That patch was merged
> into Linus' bk shortly after 2.6.10. Could you check Linus' bk to see
> if what made it there resolves the issue as well as your own?
>

Excellent. I eyeballed the patch on bkbits.net and it does exactly what
I posted.

Thanks,

- --
Mike Waychison
Sun Microsystems, Inc.
1 (650) 352-5299 voice
1 (416) 202-8336 voice

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NOTICE: The opinions expressed in this email are held by me,
and may not represent the views of Sun Microsystems, Inc.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFB4zUBdQs4kOxk3/MRAhaKAJwJDvFtnb57DkMpWuEB1C8ePZItFwCfSLrm
IKROVg53ElpF8V8PQRRB7vQ=
=AjvL
-----END PGP SIGNATURE-----