2010-07-22 19:01:11

by Nick Piggin

[permalink] [raw]
Subject: VFS scalability git tree

I'm pleased to announce I have a git tree up of my vfs scalability work.

git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git

Branch vfs-scale-working

The really interesting new item is the store-free path walk, (43fe2b)
which I've re-introduced. It has had a complete redesign, it has much
better performance and scalability in more cases, and is actually sane
code now.

What this does is to allow parallel name lookups to walk down common
elements without any cacheline bouncing between them. It can walk
across many interesting cases such as mount points, back up '..', and
negative dentries of most filesystems. It does so without requiring any
atomic operations or any stores at all to hared data. This also makes
it very fast in serial performance (path walking is nearly twice as fast
on my Opteron).

In cases where it cannot continue the RCU walk (eg. dentry does not
exist), then it can in most cases take a reference on the farthest
element it has reached so far, and then continue on with a regular
refcount-based path walk. My first attempt at this simply dropped
everything and re-did the full refcount based walk.

I've also been working on stress testing, bug fixing, cutting down
'XXX'es, and improving changelogs and comments.

Most filesystems are untested (it's too large a job to do comprehensive
stress tests on everything), but none have known issues (except nilfs2).
Ext2/3, nfs, nfsd, and ram based filesystems seem to work well,
ext4/btrfs/xfs/autofs4 have had light testing.

I've never had filesystem corruption when testing these patches (only
lockups or other bugs). But standard disclaimer: they may eat your data.

Summary of a few numbers I've run. google's socket teardown workload
runs 3-4x faster on my 2 socket Opteron. Single thread git diff runs 20%
on same machine. 32 node Altix runs dbench on ramfs 150x faster (100MB/s
up to 15GB/s).

At this point, I would be very interested in reviewing, correctness
testing on different configurations, and of course benchmarking.

Thanks,
Nick


2010-07-23 11:13:32

by Dave Chinner

[permalink] [raw]
Subject: Re: VFS scalability git tree

On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> I'm pleased to announce I have a git tree up of my vfs scalability work.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
>
> Branch vfs-scale-working

I've got a couple of patches needed to build XFS - they shrinker
merge left some bad fragments - I'll post them in a minute. This
email is for the longest ever lockdep warning I've seen that
occurred on boot.

Cheers,

Dave.

[ 6.368707] ======================================================
[ 6.369773] [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
[ 6.370379] 2.6.35-rc5-dgc+ #58
[ 6.370882] ------------------------------------------------------
[ 6.371475] pmcd/2124 [HC0[0]:SC0[1]:HE1:SE0] is trying to acquire:
[ 6.372062] (&sb->s_type->i_lock_key#6){+.+...}, at: [<ffffffff81736f8c>] socket_get_id+0x3c/0x60
[ 6.372268]
[ 6.372268] and this task is already holding:
[ 6.372268] (&(&hashinfo->ehash_locks[i])->rlock){+.-...}, at: [<ffffffff81791750>] established_get_first+0x60/0x120
[ 6.372268] which would create a new lock dependency:
[ 6.372268] (&(&hashinfo->ehash_locks[i])->rlock){+.-...} -> (&sb->s_type->i_lock_key#6){+.+...}
[ 6.372268]
[ 6.372268] but this new dependency connects a SOFTIRQ-irq-safe lock:
[ 6.372268] (&(&hashinfo->ehash_locks[i])->rlock){+.-...}
[ 6.372268] ... which became SOFTIRQ-irq-safe at:
[ 6.372268] [<ffffffff810b3b26>] __lock_acquire+0x576/0x1450
[ 6.372268] [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160
[ 6.372268] [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70
[ 6.372268] [<ffffffff8177a1ba>] __inet_hash_nolisten+0xfa/0x180
[ 6.372268] [<ffffffff8179392a>] tcp_v4_syn_recv_sock+0x1aa/0x2d0
[ 6.372268] [<ffffffff81795502>] tcp_check_req+0x202/0x440
[ 6.372268] [<ffffffff817948c4>] tcp_v4_do_rcv+0x304/0x4f0
[ 6.372268] [<ffffffff81795134>] tcp_v4_rcv+0x684/0x7e0
[ 6.372268] [<ffffffff81771512>] ip_local_deliver+0xe2/0x1c0
[ 6.372268] [<ffffffff81771af7>] ip_rcv+0x397/0x760
[ 6.372268] [<ffffffff8174d067>] __netif_receive_skb+0x277/0x330
[ 6.372268] [<ffffffff8174d1f4>] process_backlog+0xd4/0x1e0
[ 6.372268] [<ffffffff8174dc38>] net_rx_action+0x188/0x2b0
[ 6.372268] [<ffffffff81084cc2>] __do_softirq+0xd2/0x260
[ 6.372268] [<ffffffff81035edc>] call_softirq+0x1c/0x50
[ 6.372268] [<ffffffff8108551b>] local_bh_enable_ip+0xeb/0xf0
[ 6.372268] [<ffffffff8182c544>] _raw_spin_unlock_bh+0x34/0x40
[ 6.372268] [<ffffffff8173c59e>] release_sock+0x14e/0x1a0
[ 6.372268] [<ffffffff817a3975>] inet_stream_connect+0x75/0x320
[ 6.372268] [<ffffffff81737917>] sys_connect+0xa7/0xc0
[ 6.372268] [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b
[ 6.372268]
[ 6.372268] to a SOFTIRQ-irq-unsafe lock:
[ 6.372268] (&sb->s_type->i_lock_key#6){+.+...}
[ 6.372268] ... which became SOFTIRQ-irq-unsafe at:
[ 6.372268] ... [<ffffffff810b3b73>] __lock_acquire+0x5c3/0x1450
[ 6.372268] [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160
[ 6.372268] [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70
[ 6.372268] [<ffffffff8116af72>] new_inode+0x52/0xd0
[ 6.372268] [<ffffffff81174a40>] get_sb_pseudo+0xb0/0x180
[ 6.372268] [<ffffffff81735a41>] sockfs_get_sb+0x21/0x30
[ 6.372268] [<ffffffff81152dba>] vfs_kern_mount+0x8a/0x1e0
[ 6.372268] [<ffffffff81152f29>] kern_mount_data+0x19/0x20
[ 6.372268] [<ffffffff81e1c075>] sock_init+0x4e/0x59
[ 6.372268] [<ffffffff810001dc>] do_one_initcall+0x3c/0x1a0
[ 6.372268] [<ffffffff81de5767>] kernel_init+0x17a/0x204
[ 6.372268] [<ffffffff81035de4>] kernel_thread_helper+0x4/0x10
[ 6.372268]
[ 6.372268] other info that might help us debug this:
[ 6.372268]
[ 6.372268] 3 locks held by pmcd/2124:
[ 6.372268] #0: (&p->lock){+.+.+.}, at: [<ffffffff81171dae>] seq_read+0x3e/0x430
[ 6.372268] #1: (&(&hashinfo->ehash_locks[i])->rlock){+.-...}, at: [<ffffffff81791750>] established_get_first+0x60/0x120
[ 6.372268] #2: (clock-AF_INET){++....}, at: [<ffffffff8173b6ae>] sock_i_ino+0x2e/0x70
[ 6.372268]
[ 6.372268] the dependencies between SOFTIRQ-irq-safe lock and the holding lock:
[ 6.372268] -> (&(&hashinfo->ehash_locks[i])->rlock){+.-...} ops: 3 {
[ 6.372268] HARDIRQ-ON-W at:
[ 6.372268] [<ffffffff810b3b47>] __lock_acquire+0x597/0x1450
[ 6.372268] [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160
[ 6.372268] [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70
[ 6.372268] [<ffffffff8177a1ba>] __inet_hash_nolisten+0xfa/0x180
[ 6.372268] [<ffffffff8177ab6a>] __inet_hash_connect+0x33a/0x3d0
[ 6.372268] [<ffffffff8177ac4f>] inet_hash_connect+0x4f/0x60
[ 6.372268] [<ffffffff81792522>] tcp_v4_connect+0x272/0x4f0
[ 6.372268] [<ffffffff817a3b8e>] inet_stream_connect+0x28e/0x320
[ 6.372268] [<ffffffff81737917>] sys_connect+0xa7/0xc0
[ 6.372268] [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b
[ 6.372268] IN-SOFTIRQ-W at:
[ 6.372268] [<ffffffff810b3b26>] __lock_acquire+0x576/0x1450
[ 6.372268] [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160
[ 6.372268] [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70
[ 6.372268] [<ffffffff8177a1ba>] __inet_hash_nolisten+0xfa/0x180
[ 6.372268] [<ffffffff8179392a>] tcp_v4_syn_recv_sock+0x1aa/0x2d0
[ 6.372268] [<ffffffff81795502>] tcp_check_req+0x202/0x440
[ 6.372268] [<ffffffff817948c4>] tcp_v4_do_rcv+0x304/0x4f0
[ 6.372268] [<ffffffff81795134>] tcp_v4_rcv+0x684/0x7e0
[ 6.372268] [<ffffffff81771512>] ip_local_deliver+0xe2/0x1c0
[ 6.372268] [<ffffffff81771af7>] ip_rcv+0x397/0x760
[ 6.372268] [<ffffffff8174d067>] __netif_receive_skb+0x277/0x330
[ 6.372268] [<ffffffff8174d1f4>] process_backlog+0xd4/0x1e0
[ 6.372268] [<ffffffff8174dc38>] net_rx_action+0x188/0x2b0
[ 6.372268] [<ffffffff81084cc2>] __do_softirq+0xd2/0x260
[ 6.372268] [<ffffffff81035edc>] call_softirq+0x1c/0x50
[ 6.372268] [<ffffffff8108551b>] local_bh_enable_ip+0xeb/0xf0
[ 6.372268] [<ffffffff8182c544>] _raw_spin_unlock_bh+0x34/0x40
[ 6.372268] [<ffffffff8173c59e>] release_sock+0x14e/0x1a0
[ 6.372268] [<ffffffff817a3975>] inet_stream_connect+0x75/0x320
[ 6.372268] [<ffffffff81737917>] sys_connect+0xa7/0xc0
[ 6.372268] [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b
[ 6.372268] INITIAL USE at:
[ 6.372268] [<ffffffff810b37e2>] __lock_acquire+0x232/0x1450
[ 6.372268] [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160
[ 6.372268] [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70
[ 6.372268] [<ffffffff8177a1ba>] __inet_hash_nolisten+0xfa/0x180
[ 6.372268] [<ffffffff8177ab6a>] __inet_hash_connect+0x33a/0x3d0
[ 6.372268] [<ffffffff8177ac4f>] inet_hash_connect+0x4f/0x60
[ 6.372268] [<ffffffff81792522>] tcp_v4_connect+0x272/0x4f0
[ 6.372268] [<ffffffff817a3b8e>] inet_stream_connect+0x28e/0x320
[ 6.372268] [<ffffffff81737917>] sys_connect+0xa7/0xc0
[ 6.372268] [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b
[ 6.372268] }
[ 6.372268] ... key at: [<ffffffff8285ddf8>] __key.47027+0x0/0x8
[ 6.372268] ... acquired at:
[ 6.372268] [<ffffffff810b2940>] check_irq_usage+0x60/0xf0
[ 6.372268] [<ffffffff810b41ff>] __lock_acquire+0xc4f/0x1450
[ 6.372268] [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160
[ 6.372268] [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70
[ 6.372268] [<ffffffff81736f8c>] socket_get_id+0x3c/0x60
[ 6.372268] [<ffffffff8173b6c3>] sock_i_ino+0x43/0x70
[ 6.372268] [<ffffffff81790fc9>] tcp4_seq_show+0x1a9/0x520
[ 6.372268] [<ffffffff81172005>] seq_read+0x295/0x430
[ 6.372268] [<ffffffff811ad9f4>] proc_reg_read+0x84/0xc0
[ 6.372268] [<ffffffff81150165>] vfs_read+0xb5/0x170
[ 6.372268] [<ffffffff81150274>] sys_read+0x54/0x90
[ 6.372268] [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b
[ 6.372268]
[ 6.372268]
[ 6.372268] the dependencies between the lock to be acquired and SOFTIRQ-irq-unsafe lock:
[ 6.372268] -> (&sb->s_type->i_lock_key#6){+.+...} ops: 1185 {
[ 6.372268] HARDIRQ-ON-W at:
[ 6.372268] [<ffffffff810b3b47>] __lock_acquire+0x597/0x1450
[ 6.372268] [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160
[ 6.372268] [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70
[ 6.372268] [<ffffffff8116af72>] new_inode+0x52/0xd0
[ 6.372268] [<ffffffff81174a40>] get_sb_pseudo+0xb0/0x180
[ 6.372268] [<ffffffff81735a41>] sockfs_get_sb+0x21/0x30
[ 6.372268] [<ffffffff81152dba>] vfs_kern_mount+0x8a/0x1e0
[ 6.372268] [<ffffffff81152f29>] kern_mount_data+0x19/0x20
[ 6.372268] [<ffffffff81e1c075>] sock_init+0x4e/0x59
[ 6.372268] [<ffffffff810001dc>] do_one_initcall+0x3c/0x1a0
[ 6.372268] [<ffffffff81de5767>] kernel_init+0x17a/0x204
[ 6.372268] [<ffffffff81035de4>] kernel_thread_helper+0x4/0x10
[ 6.372268] SOFTIRQ-ON-W at:
[ 6.372268] [<ffffffff810b3b73>] __lock_acquire+0x5c3/0x1450
[ 6.372268] [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160
[ 6.372268] [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70
[ 6.372268] [<ffffffff8116af72>] new_inode+0x52/0xd0
[ 6.372268] [<ffffffff81174a40>] get_sb_pseudo+0xb0/0x180
[ 6.372268] [<ffffffff81735a41>] sockfs_get_sb+0x21/0x30
[ 6.372268] [<ffffffff81152dba>] vfs_kern_mount+0x8a/0x1e0
[ 6.372268] [<ffffffff81152f29>] kern_mount_data+0x19/0x20
[ 6.372268] [<ffffffff81e1c075>] sock_init+0x4e/0x59
[ 6.372268] [<ffffffff810001dc>] do_one_initcall+0x3c/0x1a0
[ 6.372268] [<ffffffff81de5767>] kernel_init+0x17a/0x204
[ 6.372268] [<ffffffff81035de4>] kernel_thread_helper+0x4/0x10
[ 6.372268] INITIAL USE at:
[ 6.372268] [<ffffffff810b37e2>] __lock_acquire+0x232/0x1450
[ 6.372268] [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160
[ 6.372268] [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70
[ 6.372268] [<ffffffff8116af72>] new_inode+0x52/0xd0
[ 6.372268] [<ffffffff81174a40>] get_sb_pseudo+0xb0/0x180
[ 6.372268] [<f [<ffffffff81152dba>] vfs_kern_mount+0x8a/0x1e0
[ 6.372268] [<ffffffff81152f29>] kern_mount_data+0x19/0x20
[ 6.372268] [<ffffffff81e1c075>] sock_init+0x4e/0x59
[ 6.372268] [<ffffffff810001dc>] do_one_initcall+0x3c/0x1a0
[ 6.372268] [<ffffffff81de5767>] kernel_init+0x17a/0x204
[ 6.372268] [<ffffffff81035de4>] kernel_thread_helper+0x4/0x10
[ 6.372268] }
[ 6.372268] ... key at: [<ffffffff81bd5bd8>] sock_fs_type+0x58/0x80
[ 6.372268] ... acquired at:
[ 6.372268] [<ffffffff810b2940>] check_irq_usage+0x60/0xf0
[ 6.372268] [<ffffffff810b41ff>] __lock_acquire+0xc4f/0x1450
[ 6.372268] [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160
[ 6.372268] [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70
[ 6.372268] [<ffffffff81736f8c>] socket_get_id+0x3c/0x60
[ 6.372268] [<ffffffff8173b6c3>] sock_i_ino+0x43/0x70
[ 6.372268] [<ffffffff81790fc9>] tcp4_seq_show+0x1a9/0x520
[ 6.372268] [<ffffffff81172005>] seq_read+0x295/0x430
[ 6.372268] [<ffffffff811ad9f4>] proc_reg_read+0x84/0xc0
[ 6.372268] [<ffffffff81150165>] vfs_read+0xb5/0x170
[ 6.372268] [<ffffffff81150274>] sys_read+0x54/0x90
[ 6.372268] [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b
[ 6.372268]
[ 6.372268]
[ 6.372268] stack backtrace:
[ 6.372268] Pid: 2124, comm: pmcd Not tainted 2.6.35-rc5-dgc+ #58
[ 6.372268] Call Trace:
[ 6.372268] [<ffffffff810b28d9>] check_usage+0x499/0x4a0
[ 6.372268] [<ffffffff810b24c6>] ? check_usage+0x86/0x4a0
[ 6.372268] [<ffffffff810af729>] ? __bfs+0x129/0x260
[ 6.372268] [<ffffffff810b2940>] check_irq_usage+0x60/0xf0
[ 6.372268] [<ffffffff810b41ff>] __lock_acquire+0xc4f/0x1450
[ 6.372268] [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160
[ 6.372268] [<ffffffff81736f8c>] ? socket_get_id+0x3c/0x60
[ 6.372268] [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70
[ 6.372268] [<ffffffff81736f8c>] ? socket_get_id+0x3c/0x60
[ 6.372268] [<ffffffff81736f8c>] socket_get_id+0x3c/0x60
[ 6.372268] [<ffffffff8173b6c3>] sock_i_ino+0x43/0x70
[ 6.372268] [<ffffffff81790fc9>] tcp4_seq_show+0x1a9/0x520
[ 6.372268] [<ffffffff81791750>] ? established_get_first+0x60/0x120
[ 6.372268] [<ffffffff8182beb7>] ? _raw_spin_lock_bh+0x67/0x70
[ 6.372268] [<ffffffff81172005>] seq_read+0x295/0x430
[ 6.372268] [<ffffffff81171d70>] ? seq_read+0x0/0x430
[ 6.372268] [<ffffffff811ad9f4>] proc_reg_read+0x84/0xc0
[ 6.372268] [<ffffffff81150165>] vfs_read+0xb5/0x170
[ 6.372268] [<ffffffff81150274>] sys_read+0x54/0x90
[ 6.372268] [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b

--
Dave Chinner
[email protected]

2010-07-23 11:17:51

by Christoph Hellwig

[permalink] [raw]
Subject: Re: VFS scalability git tree

I might sound like a broken record, but if you want to make forward
progress with this split it into smaller series.

What would be useful for example would be one series each to split
the global inode_lock and dcache_lock, without introducing all the
fancy new locking primitives, per-bucket locks and lru schemes for
a start.

2010-07-23 13:55:33

by Dave Chinner

[permalink] [raw]
Subject: Re: VFS scalability git tree

On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> I'm pleased to announce I have a git tree up of my vfs scalability work.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
>
> Branch vfs-scale-working

Bug's I've noticed so far:

- Using XFS, the existing vfs inode count statistic does not decrease
as inodes are free.
- the existing vfs dentry count remains at zero
- the existing vfs free inode count remains at zero

$ pminfo -f vfs.inodes vfs.dentry

vfs.inodes.count
value 7472612

vfs.inodes.free
value 0

vfs.dentry.count
value 0

vfs.dentry.free
value 0


Performance Summary:

With lockdep and CONFIG_XFS_DEBUG enabled, a 16 thread parallel
sequential create/unlink workload on an 8p/4GB RAM VM with a virtio
block device sitting on a short-stroked 12x2TB SAS array w/ 512MB
BBWC in RAID0 via dm and using the noop elevator in the guest VM:

$ sudo mkfs.xfs -f -l size=128m -d agcount=16 /dev/vdb
meta-data=/dev/vdb isize=256 agcount=16, agsize=1638400 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=26214400, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal log bsize=4096 blocks=32768, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
$ sudo mount -o delaylog,logbsize=262144,nobarrier /dev/vdb /mnt/scratch
$ sudo chmod 777 /mnt/scratch
$ cd ~/src/fs_mark-3.3/
$ ./fs_mark -S0 -n 500000 -s 0 -d /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/3 -d /mnt/scratch/2 -d /mnt/scratch/4 -d /mnt/scratch/5 -d /mnt/scratch/6 -d /mnt/scratch/7 -d /mnt/scratch/8 -d /mnt/scratch/9 -d /mnt/scratch/10 -d /mnt/scratch/11 -d /mnt/scratch/12 -d /mnt/scratch/13 -d /mnt/scratch/14 -d /mnt/scratch/15

files/s
2.6.34-rc4 12550
2.6.35-rc5+scale 12285

So the same within the error margins of the benchmark.

Screenshot of monitoring graphs - you can see the effect of the
broken stats:

http://userweb.kernel.org/~dgc/shrinker-2.6.36/fs_mark-2.6.35-rc4-16x500-xfs.png
http://userweb.kernel.org/~dgc/shrinker-2.6.36/fs_mark-2.6.35-rc5-npiggin-scale-lockdep-16x500-xfs.png

With a production build (i.e. no lockdep, no xfs debug), I'll
run the same fs_mark parallel create/unlink workload to show
scalability as I ran here:

http://oss.sgi.com/archives/xfs/2010-05/msg00329.html

The numbers can't be directly compared, but the test and the setup
is the same. The XFS numbers below are with delayed logging
enabled. ext4 is using default mkfs and mount parameters except for
barrier=0. All numbers are averages of three runs.

fs_mark rate (thousands of files/second)
2.6.35-rc5 2.6.35-rc5-scale
threads xfs ext4 xfs ext4
1 20 39 20 39
2 35 55 35 57
4 60 41 57 42
8 79 9 75 9

ext4 is getting IO bound at more than 2 threads, so apart from
pointing out that XFS is 8-9x faster than ext4 at 8 thread, I'm
going to ignore ext4 for the purposes of testing scalability here.

For XFS w/ delayed logging, 2.6.35-rc5 is only getting to about 600%
CPU and with Nick's patches it's about 650% (10% higher) for
slightly lower throughput. So at this class of machine for this
workload, the changes result in a slight reduction in scalability.

I looked at dbench on XFS as well, but didn't see any significant
change in the numbers at up to 200 load threads, so not much to
talk about there.

Sometime over the weekend I'll build a 16p VM and see what I get
from that...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-07-23 14:04:26

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 1/2] xfs: fix shrinker build

From: Dave Chinner <[email protected]>

Remove the stray mount list lock reference from the shrinker code.

Signed-off-by: Dave Chinner <[email protected]>
---
fs/xfs/linux-2.6/xfs_sync.c | 5 +----
1 files changed, 1 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
index 7a5a368..05426bf 100644
--- a/fs/xfs/linux-2.6/xfs_sync.c
+++ b/fs/xfs/linux-2.6/xfs_sync.c
@@ -916,10 +916,8 @@ xfs_reclaim_inode_shrink(

done:
nr = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH);
- if (!nr) {
- up_read(&xfs_mount_list_lock);
+ if (!nr)
return 0;
- }
xfs_inode_ag_iterator(mp, xfs_reclaim_inode, 0,
XFS_ICI_RECLAIM_TAG, 1, &nr);
/* if we don't exhaust the scan, don't bother coming back */
@@ -935,7 +933,6 @@ xfs_inode_shrinker_register(
struct xfs_mount *mp)
{
mp->m_inode_shrink.shrink = xfs_reclaim_inode_shrink;
- mp->m_inode_shrink.seeks = DEFAULT_SEEKS;
register_shrinker(&mp->m_inode_shrink);
}

--
1.7.1

2010-07-23 14:04:33

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 0/2] vfs scalability tree fixes

Nick,

Here's the fixes I applied to your tree to make the XFS inode cache
shrinker build and scan sanely.

Cheers,

Dave.

2010-07-23 14:04:45

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 2/2] xfs: shrinker should use a per-filesystem scan count

From: Dave Chinner <[email protected]>

The shrinker uses a global static to aggregate excess scan counts.
This should be per filesystem like all the other shrinker context to
operate correctly.

Signed-off-by: Dave Chinner <[email protected]>
---
fs/xfs/linux-2.6/xfs_sync.c | 5 ++---
fs/xfs/xfs_mount.h | 1 +
2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
index 05426bf..b0e6296 100644
--- a/fs/xfs/linux-2.6/xfs_sync.c
+++ b/fs/xfs/linux-2.6/xfs_sync.c
@@ -893,7 +893,6 @@ xfs_reclaim_inode_shrink(
unsigned long global,
gfp_t gfp_mask)
{
- static unsigned long nr_to_scan;
int nr;
struct xfs_mount *mp;
struct xfs_perag *pag;
@@ -908,14 +907,14 @@ xfs_reclaim_inode_shrink(
nr_reclaimable += pag->pag_ici_reclaimable;
xfs_perag_put(pag);
}
- shrinker_add_scan(&nr_to_scan, scanned, global, nr_reclaimable,
+ shrinker_add_scan(&mp->m_shrink_scan_nr, scanned, global, nr_reclaimable,
DEFAULT_SEEKS);
if (!(gfp_mask & __GFP_FS)) {
return 0;
}

done:
- nr = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH);
+ nr = shrinker_do_scan(&mp->m_shrink_scan_nr, SHRINK_BATCH);
if (!nr)
return 0;
xfs_inode_ag_iterator(mp, xfs_reclaim_inode, 0,
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 5761087..ed5531f 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -260,6 +260,7 @@ typedef struct xfs_mount {
__int64_t m_update_flags; /* sb flags we need to update
on the next remount,rw */
struct shrinker m_inode_shrink; /* inode reclaim shrinker */
+ unsigned long m_shrink_scan_nr; /* shrinker scan count */
} xfs_mount_t;

/*
--
1.7.1

2010-07-23 15:35:36

by Nick Piggin

[permalink] [raw]
Subject: Re: VFS scalability git tree

On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> I'm pleased to announce I have a git tree up of my vfs scalability work.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git

> Summary of a few numbers I've run. google's socket teardown workload
> runs 3-4x faster on my 2 socket Opteron. Single thread git diff runs 20%
> on same machine. 32 node Altix runs dbench on ramfs 150x faster (100MB/s
> up to 15GB/s).

Following post just contains some preliminary benchmark numbers on a
POWER7. Boring if you're not interested in this stuff.

IBM and Mikey kindly allowed me to do some test runs on a big POWER7
system today. Very is the only word I'm authorized to describe how big
is big. We tested the vfs-scale-working and master branches from my git
tree as of today. I'll stick with relative numbers to be safe. All
tests were run on ramfs.


First and very important is single threaded performance of basic code.
POWER7 is obviously vastly different from a Barcelona or Nehalem. and
store-free path walk uses a lot of seqlocks, which are cheap on x86, a
little more epensive on others.

Test case time difference, vanilla to vfs-scale (negative is better)
stat() -10.8% +/- 0.3%
close(open()) 4.3% +/- 0.3%
unlink(creat()) 36.8% +/- 0.3%

stat is significantly faster which is really good.

open/close is a bit slower which we didn't get time to analyse. There
are one or two seqlock checks which might be avoided, which could make
up the difference. It's not horrible, but I hope to get POWER7
open/close more competitive (on x86 open/close is even a bit faster).

Note this is a worst case for rcu-path-walk: lookup of "./file", because
it has to take refcount on the final element. With more elements, rcu
walk should gain the advantage.

creat/unlink is showing the big RCU penalty. However I have penciled
out a working design with Linus of how to do SLAB_DESTROY_BY_RCU.
However it makes the store-free path walking and some inode RCU list
walking a little bit trickier, so I prefer not to dump too much on
at once. There is something that can be done if regressions show up.
I don't anticipate many regressions outside microbenchmarks, and this
is about the absolute worst case.


On to parallel tests. Firstly, the google socket workload.
Running with "NR_THREADS" children, vfs-scale patches do this:

root@p7ih06:~/google# time ./google --files_per_cpu 10000 > /dev/null
real 0m4.976s
user 8m38.925s
sys 6m45.236s

root@p7ih06:~/google# time ./google --files_per_cpu 20000 > /dev/null
real 0m7.816s
user 11m21.034s
sys 14m38.258s

root@p7ih06:~/google# time ./google --files_per_cpu 40000 > /dev/null
real 0m11.358s
user 11m37.955s
sys 28m44.911s

Reducing to NR_THREADS/4 children allows vanilla to complete:

root@p7ih06:~/google# time ./google --files_per_cpu 10000
real 1m23.118s
user 3m31.820s
sys 81m10.405s

I was actually surprised it did that well.


Dbench was an interesting one. We didn't manage to stretch the box's
legs, unfortunately! dbench with 1 proc gave about 500MB/s, 64 procs
gave 21GB/s, 128 and throughput dropped dramatically. Turns out that
weird things start happening with rename seqlock versus d_lookup, and
d_move contention (dbench does a sprinkle of renaming). That can be
improved I think, but noth worth bothering with for the time being.

It's not really worth testing vanilla at high dbench parallelism.


Parallel git diff workload looked OK. It seemed to be scaling fine
in the vfs, but it hit a bottlneck in powerpc's tlb invalidation, so
numbers may not be so interesting.


Lastly, some parallel syscall microbenchmarks:

procs vanilla vfs-scale
open-close, seperate-cwd
1 384557.70 355923.82 op/s/proc
NR_CORES 86.63 164054.64 op/s/proc
NR_THREADS 18.68 (ouch!)

open-close, same-cwd
1 381074.32 339161.25
NR_CORES 104.16 107653.05

creat-unlink, seperate-cwd
1 145891.05 104301.06
NR_CORES 29.81 10061.66

creat-unlink, same-cwd
1 129681.27 104301.06
NR_CORES 12.68 181.24

So we can see the single thread performance regressions here, but
the vanilla case really chokes at high CPU counts.

2010-07-23 15:42:27

by Nick Piggin

[permalink] [raw]
Subject: Re: VFS scalability git tree

On Fri, Jul 23, 2010 at 07:17:46AM -0400, Christoph Hellwig wrote:
> I might sound like a broken record, but if you want to make forward
> progress with this split it into smaller series.

No I appreciate the advice. I put this tree up for people to fetch
without posting patches all the time. I think it is important to
test and to see the big picture when reviewing the patches, but you
are right about how to actually submit patches on the ML.


> What would be useful for example would be one series each to split
> the global inode_lock and dcache_lock, without introducing all the
> fancy new locking primitives, per-bucket locks and lru schemes for
> a start.

I've kept the series fairly well structured like that. Basically it
is in these parts:

1. files lock
2. vfsmount lock
3. mnt refcount
4a. put several new global spinlocks around different parts of dcache
4b. remove dcache_lock after the above protect everything
4c. start doing fine grained locking of hash, inode alias, lru, etc etc
5a, 5b, 5c. same for inodes
6. some further optimisations and cleanups
7. store-free path walking

This kind of sequence. I will again try to submit a first couple of
things to Al soon.

2010-07-23 15:51:24

by Nick Piggin

[permalink] [raw]
Subject: Re: VFS scalability git tree

On Fri, Jul 23, 2010 at 09:13:10PM +1000, Dave Chinner wrote:
> On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > I'm pleased to announce I have a git tree up of my vfs scalability work.
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> >
> > Branch vfs-scale-working
>
> I've got a couple of patches needed to build XFS - they shrinker
> merge left some bad fragments - I'll post them in a minute. This

OK cool.


> email is for the longest ever lockdep warning I've seen that
> occurred on boot.

Ah thanks. OK that was one of my attempts to keep sockets out of
hidding the vfs as much as possible (lazy inode number evaluation).
Not a big problem, but I'll drop the patch for now.

I have just got one for you too, btw :) (on vanilla kernel but it is
messing up my lockdep stress testing on xfs). Real or false?

[ INFO: possible circular locking dependency detected ]
2.6.35-rc5-00064-ga9f7f2e #334
-------------------------------------------------------
kswapd0/605 is trying to acquire lock:
(&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffff8125500c>]
xfs_ilock+0x7c/0xa0

but task is already holding lock:
(&xfs_mount_list_lock){++++.-}, at: [<ffffffff81281a76>]
xfs_reclaim_inode_shrink+0xc6/0x140

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #1 (&xfs_mount_list_lock){++++.-}:
[<ffffffff8106ef9a>] lock_acquire+0x5a/0x70
[<ffffffff815aa646>] _raw_spin_lock+0x36/0x50
[<ffffffff810fabf3>] try_to_free_buffers+0x43/0xb0
[<ffffffff812763b2>] xfs_vm_releasepage+0x92/0xe0
[<ffffffff810908ee>] try_to_release_page+0x2e/0x50
[<ffffffff8109ef56>] shrink_page_list+0x486/0x5a0
[<ffffffff8109f35d>] shrink_inactive_list+0x2ed/0x700
[<ffffffff8109fda0>] shrink_zone+0x3b0/0x460
[<ffffffff810a0f41>] try_to_free_pages+0x241/0x3a0
[<ffffffff810999e2>] __alloc_pages_nodemask+0x4c2/0x6b0
[<ffffffff810c52c6>] alloc_pages_current+0x76/0xf0
[<ffffffff8109205b>] __page_cache_alloc+0xb/0x10
[<ffffffff81092a2a>] find_or_create_page+0x4a/0xa0
[<ffffffff812780cc>] _xfs_buf_lookup_pages+0x14c/0x360
[<ffffffff81279122>] xfs_buf_get+0x72/0x160
[<ffffffff8126eb68>] xfs_trans_get_buf+0xc8/0xf0
[<ffffffff8124439f>] xfs_da_do_buf+0x3df/0x6d0
[<ffffffff81244825>] xfs_da_get_buf+0x25/0x30
[<ffffffff8124a076>] xfs_dir2_data_init+0x46/0xe0
[<ffffffff81247f89>] xfs_dir2_sf_to_block+0xb9/0x5a0
[<ffffffff812501c8>] xfs_dir2_sf_addname+0x418/0x5c0
[<ffffffff81247d7c>] xfs_dir_createname+0x14c/0x1a0
[<ffffffff81271d49>] xfs_create+0x449/0x5d0
[<ffffffff8127d802>] xfs_vn_mknod+0xa2/0x1b0
[<ffffffff8127d92b>] xfs_vn_create+0xb/0x10
[<ffffffff810ddc81>] vfs_create+0x81/0xd0
[<ffffffff810df1a5>] do_last+0x535/0x690
[<ffffffff810e11fd>] do_filp_open+0x21d/0x660
[<ffffffff810d16b4>] do_sys_open+0x64/0x140
[<ffffffff810d17bb>] sys_open+0x1b/0x20
[<ffffffff810023eb>] system_call_fastpath+0x16/0x1b

:-> #0 (&(&ip->i_lock)->mr_lock){++++--}:
[<ffffffff8106ef10>] __lock_acquire+0x1be0/0x1c10
[<ffffffff8106ef9a>] lock_acquire+0x5a/0x70
[<ffffffff8105dfba>] down_write_nested+0x4a/0x70
[<ffffffff8125500c>] xfs_ilock+0x7c/0xa0
[<ffffffff81280c98>] xfs_reclaim_inode+0x98/0x250
[<ffffffff81281824>] xfs_inode_ag_walk+0x74/0x120
[<ffffffff81281953>] xfs_inode_ag_iterator+0x83/0xe0
[<ffffffff81281aa4>] xfs_reclaim_inode_shrink+0xf4/0x140
[<ffffffff8109ff7d>] shrink_slab+0x12d/0x190
[<ffffffff810a07ad>] balance_pgdat+0x43d/0x6f0
[<ffffffff810a0b1e>] kswapd+0xbe/0x2a0
[<ffffffff810592ae>] kthread+0x8e/0xa0
[<ffffffff81003194>] kernel_thread_helper+0x4/0x10

other info that might help us debug this:

2 locks held by kswapd0/605:
#0: (shrinker_rwsem){++++..}, at: [<ffffffff8109fe88>]
shrink_slab+0x38/0x190
#1: (&xfs_mount_list_lock){++++.-}, at: [<ffffffff81281a76>]
xfs_reclaim_inode_shrink+0xc6/0x140

stack backtrace:
Pid: 605, comm: kswapd0 Not tainted 2.6.35-rc5-00064-ga9f7f2e #334
Call Trace:
[<ffffffff8106c5d9>] print_circular_bug+0xe9/0xf0
[<ffffffff8106ef10>] __lock_acquire+0x1be0/0x1c10
[<ffffffff8106e3c2>] ? __lock_acquire+0x1092/0x1c10
[<ffffffff8106ef9a>] lock_acquire+0x5a/0x70
[<ffffffff8125500c>] ? xfs_ilock+0x7c/0xa0
[<ffffffff8105dfba>] down_write_nested+0x4a/0x70
[<ffffffff8125500c>] ? xfs_ilock+0x7c/0xa0
[<ffffffff815ae795>] ? sub_preempt_count+0x95/0xd0
[<ffffffff8125500c>] xfs_ilock+0x7c/0xa0
[<ffffffff81280c98>] xfs_reclaim_inode+0x98/0x250
[<ffffffff81281824>] xfs_inode_ag_walk+0x74/0x120
[<ffffffff81280c00>] ? xfs_reclaim_inode+0x0/0x250
[<ffffffff81281953>] xfs_inode_ag_iterator+0x83/0xe0
[<ffffffff81280c00>] ? xfs_reclaim_inode+0x0/0x250
[<ffffffff81281aa4>] xfs_reclaim_inode_shrink+0xf4/0x140
[<ffffffff8109ff7d>] shrink_slab+0x12d/0x190
[<ffffffff810a07ad>] balance_pgdat+0x43d/0x6f0
[<ffffffff810a0b1e>] kswapd+0xbe/0x2a0
[<ffffffff81059700>] ? autoremove_wake_function+0x0/0x40
[<ffffffff815aaf3d>] ? _raw_spin_unlock_irqrestore+0x3d/0x70
[<ffffffff810a0a60>] ? kswapd+0x0/0x2a0
[<ffffffff810592ae>] kthread+0x8e/0xa0
[<ffffffff81003194>] kernel_thread_helper+0x4/0x10
[<ffffffff815ab400>] ? restore_args+0x0/0x30
[<ffffffff81059220>] ? kthread+0x0/0xa0

2010-07-23 16:09:33

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 0/2] vfs scalability tree fixes

On Sat, Jul 24, 2010 at 12:04:00AM +1000, Dave Chinner wrote:
> Nick,
>
> Here's the fixes I applied to your tree to make the XFS inode cache
> shrinker build and scan sanely.

Thanks for these Dave

2010-07-23 16:16:20

by Nick Piggin

[permalink] [raw]
Subject: Re: VFS scalability git tree

On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote:
> On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > I'm pleased to announce I have a git tree up of my vfs scalability work.
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> >
> > Branch vfs-scale-working
>
> Bug's I've noticed so far:
>
> - Using XFS, the existing vfs inode count statistic does not decrease
> as inodes are free.
> - the existing vfs dentry count remains at zero
> - the existing vfs free inode count remains at zero
>
> $ pminfo -f vfs.inodes vfs.dentry
>
> vfs.inodes.count
> value 7472612
>
> vfs.inodes.free
> value 0
>
> vfs.dentry.count
> value 0
>
> vfs.dentry.free
> value 0

Hm, I must have broken it along the way and not noticed. Thanks
for pointing that out.


> With a production build (i.e. no lockdep, no xfs debug), I'll
> run the same fs_mark parallel create/unlink workload to show
> scalability as I ran here:
>
> http://oss.sgi.com/archives/xfs/2010-05/msg00329.html
>
> The numbers can't be directly compared, but the test and the setup
> is the same. The XFS numbers below are with delayed logging
> enabled. ext4 is using default mkfs and mount parameters except for
> barrier=0. All numbers are averages of three runs.
>
> fs_mark rate (thousands of files/second)
> 2.6.35-rc5 2.6.35-rc5-scale
> threads xfs ext4 xfs ext4
> 1 20 39 20 39
> 2 35 55 35 57
> 4 60 41 57 42
> 8 79 9 75 9
>
> ext4 is getting IO bound at more than 2 threads, so apart from
> pointing out that XFS is 8-9x faster than ext4 at 8 thread, I'm
> going to ignore ext4 for the purposes of testing scalability here.
>
> For XFS w/ delayed logging, 2.6.35-rc5 is only getting to about 600%
> CPU and with Nick's patches it's about 650% (10% higher) for
> slightly lower throughput. So at this class of machine for this
> workload, the changes result in a slight reduction in scalability.

That's a good test case, thanks. I'll see if I can find where
this is coming from. I will suspect RCU-inodes I suppose. Hm,
may have to make them DESTROY_BY_RCU afterall.

Thanks,
Nick

2010-07-24 00:21:20

by Dave Chinner

[permalink] [raw]
Subject: Re: VFS scalability git tree

On Sat, Jul 24, 2010 at 01:51:18AM +1000, Nick Piggin wrote:
> On Fri, Jul 23, 2010 at 09:13:10PM +1000, Dave Chinner wrote:
> > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > >
> > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> > >
> > > Branch vfs-scale-working
> >
> > I've got a couple of patches needed to build XFS - they shrinker
> > merge left some bad fragments - I'll post them in a minute. This
>
> OK cool.
>
>
> > email is for the longest ever lockdep warning I've seen that
> > occurred on boot.
>
> Ah thanks. OK that was one of my attempts to keep sockets out of
> hidding the vfs as much as possible (lazy inode number evaluation).
> Not a big problem, but I'll drop the patch for now.
>
> I have just got one for you too, btw :) (on vanilla kernel but it is
> messing up my lockdep stress testing on xfs). Real or false?
>
> [ INFO: possible circular locking dependency detected ]
> 2.6.35-rc5-00064-ga9f7f2e #334
> -------------------------------------------------------
> kswapd0/605 is trying to acquire lock:
> (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffff8125500c>]
> xfs_ilock+0x7c/0xa0
>
> but task is already holding lock:
> (&xfs_mount_list_lock){++++.-}, at: [<ffffffff81281a76>]
> xfs_reclaim_inode_shrink+0xc6/0x140

False positive, but the xfs_mount_list_lock is gone in 2.6.35-rc6 -
the shrinker context change has fixed that - so you can ignore it
anyway.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-07-24 08:43:59

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: VFS scalability git tree

> At this point, I would be very interested in reviewing, correctness
> testing on different configurations, and of course benchmarking.

I haven't review this series so long time. but I've found one misterious
shrink_slab() usage. can you please see my patch? (I will send it as
another mail)

2010-07-24 08:44:54

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 1/2] vmscan: shrink_all_slab() use reclaim_state instead the return value of shrink_slab()

Now, shrink_slab() doesn't return number of reclaimed objects. IOW,
current shrink_all_slab() is broken. Thus instead we use reclaim_state
to detect no reclaimable slab objects.

Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/vmscan.c | 20 +++++++++-----------
1 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index d7256e0..bfa1975 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -300,18 +300,16 @@ static unsigned long shrink_slab(struct zone *zone, unsigned long scanned, unsig
void shrink_all_slab(void)
{
struct zone *zone;
- unsigned long nr;
+ struct reclaim_state reclaim_state;

-again:
- nr = 0;
- for_each_zone(zone)
- nr += shrink_slab(zone, 1, 1, 1, GFP_KERNEL);
- /*
- * If we reclaimed less than 10 objects, might as well call
- * it a day. Nothing special about the number 10.
- */
- if (nr >= 10)
- goto again;
+ current->reclaim_state = &reclaim_state;
+ do {
+ reclaim_state.reclaimed_slab = 0;
+ for_each_zone(zone)
+ shrink_slab(zone, 1, 1, 1, GFP_KERNEL);
+ } while (reclaim_state.reclaimed_slab);
+
+ current->reclaim_state = NULL;
}

static inline int is_page_cache_freeable(struct page *page)
--
1.6.5.2


2010-07-24 08:46:27

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 2/2] vmscan: change shrink_slab() return tyep with void

Now, no caller use the return value of shrink_slab(). Thus we can change
it with void.

Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/vmscan.c | 7 +++----
1 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bfa1975..89b593e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -277,24 +277,23 @@ EXPORT_SYMBOL(shrinker_do_scan);
*
* Returns the number of slab objects which we shrunk.
*/
-static unsigned long shrink_slab(struct zone *zone, unsigned long scanned, unsigned long total,
+static void shrink_slab(struct zone *zone, unsigned long scanned, unsigned long total,
unsigned long global, gfp_t gfp_mask)
{
struct shrinker *shrinker;
- unsigned long ret = 0;

if (scanned == 0)
scanned = SWAP_CLUSTER_MAX;

if (!down_read_trylock(&shrinker_rwsem))
- return 1; /* Assume we'll be able to shrink next time */
+ return;

list_for_each_entry(shrinker, &shrinker_list, list) {
(*shrinker->shrink)(shrinker, zone, scanned,
total, global, gfp_mask);
}
up_read(&shrinker_rwsem);
- return ret;
+ return;
}

void shrink_all_slab(void)
--
1.6.5.2


2010-07-24 10:54:49

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: VFS scalability git tree

> > At this point, I would be very interested in reviewing, correctness
> > testing on different configurations, and of course benchmarking.
>
> I haven't review this series so long time. but I've found one misterious
> shrink_slab() usage. can you please see my patch? (I will send it as
> another mail)

Plus, I have one question. upstream shrink_slab() calculation and your
calculation have bigger change rather than your patch description explained.

upstream:

shrink_slab()

lru_scanned max_pass
basic_scan_objects = 4 x ------------- x -----------------------------
lru_pages shrinker->seeks (default:2)

scan_objects = min(basic_scan_objects, max_pass * 2)

shrink_icache_memory()

sysctl_vfs_cache_pressure
max_pass = inodes_stat.nr_unused x --------------------------
100


That said, higher sysctl_vfs_cache_pressure makes higher slab reclaim.


In the other hand, your code:
shrinker_add_scan()

scanned objects
scan_objects = 4 x ------------- x ----------- x SHRINK_FACTOR x SHRINK_FACTOR
total ratio

shrink_icache_memory()

ratio = DEFAULT_SEEKS * sysctl_vfs_cache_pressure / 100

That said, higher sysctl_vfs_cache_pressure makes smaller slab reclaim.


So, I guess following change honorly refrect your original intention.

New calculation is,

shrinker_add_scan()

scanned
scan_objects = ------------- x objects x ratio
total

shrink_icache_memory()

ratio = DEFAULT_SEEKS * sysctl_vfs_cache_pressure / 100

This has the same behavior as upstream. because upstream's 4/shrinker->seeks = 2.
also the above has DEFAULT_SEEKS = SHRINK_FACTORx2.



===============
o move 'ratio' from denominator to numerator
o adapt kvm/mmu_shrink
o SHRINK_FACTOR / 2 (default seek) x 4 (unknown shrink slab modifier)
-> (SHRINK_FACTOR*2) == DEFAULT_SEEKS

---
arch/x86/kvm/mmu.c | 2 +-
mm/vmscan.c | 10 ++--------
2 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index ae5a038..cea1e92 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2942,7 +2942,7 @@ static int mmu_shrink(struct shrinker *shrink,
}

shrinker_add_scan(&nr_to_scan, scanned, global, cache_count,
- DEFAULT_SEEKS*10);
+ DEFAULT_SEEKS/10);

done:
cache_count = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 89b593e..2d8e9ab 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -208,14 +208,8 @@ void shrinker_add_scan(unsigned long *dst,
{
unsigned long long delta;

- /*
- * The constant 4 comes from old code. Who knows why.
- * This could all use a good tune up with some decent
- * benchmarks and numbers.
- */
- delta = (unsigned long long)scanned * objects
- * SHRINK_FACTOR * SHRINK_FACTOR * 4UL;
- do_div(delta, (ratio * total + 1));
+ delta = (unsigned long long)scanned * objects * ratio;
+ do_div(delta, total+ 1);

/*
* Avoid risking looping forever due to too large nr value:
--
1.6.5.2



2010-07-24 12:05:10

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 1/2] vmscan: shrink_all_slab() use reclaim_state instead the return value of shrink_slab()

2010/7/24 KOSAKI Motohiro <[email protected]>:
> Now, shrink_slab() doesn't return number of reclaimed objects. IOW,
> current shrink_all_slab() is broken. Thus instead we use reclaim_state
> to detect no reclaimable slab objects.
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> ---
> ?mm/vmscan.c | ? 20 +++++++++-----------
> ?1 files changed, 9 insertions(+), 11 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d7256e0..bfa1975 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -300,18 +300,16 @@ static unsigned long shrink_slab(struct zone *zone, unsigned long scanned, unsig
> ?void shrink_all_slab(void)
> ?{
> ? ? ? ?struct zone *zone;
> - ? ? ? unsigned long nr;
> + ? ? ? struct reclaim_state reclaim_state;
>
> -again:
> - ? ? ? nr = 0;
> - ? ? ? for_each_zone(zone)
> - ? ? ? ? ? ? ? nr += shrink_slab(zone, 1, 1, 1, GFP_KERNEL);
> - ? ? ? /*
> - ? ? ? ?* If we reclaimed less than 10 objects, might as well call
> - ? ? ? ?* it a day. Nothing special about the number 10.
> - ? ? ? ?*/
> - ? ? ? if (nr >= 10)
> - ? ? ? ? ? ? ? goto again;
> + ? ? ? current->reclaim_state = &reclaim_state;
> + ? ? ? do {
> + ? ? ? ? ? ? ? reclaim_state.reclaimed_slab = 0;
> + ? ? ? ? ? ? ? for_each_zone(zone)

Oops, this should be for_each_populated_zone().


> + ? ? ? ? ? ? ? ? ? ? ? shrink_slab(zone, 1, 1, 1, GFP_KERNEL);
> + ? ? ? } while (reclaim_state.reclaimed_slab);
> +
> + ? ? ? current->reclaim_state = NULL;
> ?}
>
> ?static inline int is_page_cache_freeable(struct page *page)
> --
> 1.6.5.2
>
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. ?For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2010-07-26 05:41:19

by Nick Piggin

[permalink] [raw]
Subject: Re: VFS scalability git tree

On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> I'm pleased to announce I have a git tree up of my vfs scalability work.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git

Pushed several fixes and improvements
o XFS bugs fixed by Dave
o dentry and inode stats bugs noticed by Dave
o vmscan shrinker bugs fixed by KOSAKI san
o compile bugs noticed by John
o a few attempts to improve powerpc performance (eg. reducing smp_rmb())
o scalability improvments for rename_lock

2010-07-27 07:05:47

by Nick Piggin

[permalink] [raw]
Subject: Re: VFS scalability git tree

On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote:
> On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > I'm pleased to announce I have a git tree up of my vfs scalability work.
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> >
> > Branch vfs-scale-working
>
> With a production build (i.e. no lockdep, no xfs debug), I'll
> run the same fs_mark parallel create/unlink workload to show
> scalability as I ran here:
>
> http://oss.sgi.com/archives/xfs/2010-05/msg00329.html

I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead
of a real disk (I don't have easy access to a good disk setup ATM, but
I guess we're more interested in code above the block layer anyway).

Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as
yours.

I found that performance is a little unstable, so I sync and echo 3 >
drop_caches between each run. When it starts reclaiming memory, things
get a bit more erratic (and XFS seemed to be almost livelocking for tens
of seconds in inode reclaim). So I started with 50 runs of fs_mark
-n 20000 (which did not cause reclaim), rebuilding a new filesystem
between every run.

That gave the following files/sec numbers:
N Min Max Median Avg Stddev
x 50 100986.4 127622 125013.4 123248.82 5244.1988
+ 50 100967.6 135918.6 130214.9 127926.94 6374.6975
Difference at 95.0% confidence
4678.12 +/- 2316.07
3.79567% +/- 1.87919%
(Student's t, pooled s = 5836.88)

This is 3.8% in favour of vfs-scale-working.

I then did 10 runs of -n 20000 but with -L 4 (4 iterations) which did
start to fill up memory and cause reclaim during the 2nd and subsequent
iterations.

N Min Max Median Avg Stddev
x 10 116919.7 126785.7 123279.2 122245.17 3169.7993
+ 10 110985.1 132440.7 130122.1 126573.41 7151.2947
No difference proven at 95.0% confidence

x 10 75820.9 105934.9 79521.7 84263.37 11210.173
+ 10 75698.3 115091.7 82932 93022.75 16725.304
No difference proven at 95.0% confidence

x 10 66330.5 74950.4 69054.5 69102 2335.615
+ 10 68348.5 74231.5 70728.2 70879.45 1838.8345
No difference proven at 95.0% confidence

x 10 59353.8 69813.1 67416.7 65164.96 4175.8209
+ 10 59670.7 77719.1 74326.1 70966.02 6469.0398
Difference at 95.0% confidence
5801.06 +/- 5115.66
8.90212% +/- 7.85033%
(Student's t, pooled s = 5444.54)

vfs-scale-working was ahead at every point, but the results were
too erratic to read much into it (even the last point I think is
questionable).

I can provide raw numbers or more details on the setup if required.


> enabled. ext4 is using default mkfs and mount parameters except for
> barrier=0. All numbers are averages of three runs.
>
> fs_mark rate (thousands of files/second)
> 2.6.35-rc5 2.6.35-rc5-scale
> threads xfs ext4 xfs ext4
> 1 20 39 20 39
> 2 35 55 35 57
> 4 60 41 57 42
> 8 79 9 75 9
>
> ext4 is getting IO bound at more than 2 threads, so apart from
> pointing out that XFS is 8-9x faster than ext4 at 8 thread, I'm
> going to ignore ext4 for the purposes of testing scalability here.
>
> For XFS w/ delayed logging, 2.6.35-rc5 is only getting to about 600%
> CPU and with Nick's patches it's about 650% (10% higher) for
> slightly lower throughput. So at this class of machine for this
> workload, the changes result in a slight reduction in scalability.

I wonder if these results are stable. It's possible that changes in
reclaim behaviour are causing my patches to require more IO for a
given unit of work?

I was seeing XFS 'livelock' in reclaim more with my patches, it
could be due to more parallelism now being allowed from the vfs and
reclaim.

Based on my above numbers, I don't see that rcu-inodes is causing a
problem, and in terms of SMP scalability, there is really no way that
vanilla is more scalable, so I'm interested to see where this slowdown
is coming from.


> I looked at dbench on XFS as well, but didn't see any significant
> change in the numbers at up to 200 load threads, so not much to
> talk about there.

On a smaller system, dbench doesn't bottleneck too much. It's more of
a test to find shared cachelines and such on larger systems when you're
talking about several GB/s bandwidths.

Thanks,
Nick

2010-07-27 11:10:06

by Nick Piggin

[permalink] [raw]
Subject: Re: VFS scalability git tree

On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote:
> On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote:
> > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > >
> > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> > >
> > > Branch vfs-scale-working
> >
> > With a production build (i.e. no lockdep, no xfs debug), I'll
> > run the same fs_mark parallel create/unlink workload to show
> > scalability as I ran here:
> >
> > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html
>
> I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead
> of a real disk (I don't have easy access to a good disk setup ATM, but
> I guess we're more interested in code above the block layer anyway).
>
> Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as
> yours.

I also tried dbench on this setup. 20 runs of dbench -t20 8
(that is a 20 second run, 8 clients).

Numbers are throughput, higher is better:

N Min Max Median Avg Stddev
vanilla 20 2219.19 2249.43 2230.43 2230.9915 7.2528893
scale 20 2428.21 2490.8 2437.86 2444.111 16.668256
Difference at 95.0% confidence
213.119 +/- 8.22695
9.55268% +/- 0.368757%
(Student's t, pooled s = 12.8537)

vfs-scale is 9.5% or 210MB/s faster than vanilla.

Like fs_mark, dbench has creat/unlink activity, so I hope rcu-inodes
should not be such a problem in practice. In my creat/unlink benchmark,
it is creating and destroying one inode repeatedly, which is the
absolute worst case for rcu-inodes. Wheras in most real workloads
would be creating and destroying many inodes, which is not such a dis
advantage for rcu-inodes.

Incidentally, XFS was by far the fastest "real" filesystem I tested on
this workload. ext4 was around 1700MB/s (ext2 was around 3100MB/s and
ramfs is 3350MB/s).

2010-07-27 13:18:31

by Dave Chinner

[permalink] [raw]
Subject: Re: VFS scalability git tree

On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote:
> On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote:
> > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > >
> > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> > >
> > > Branch vfs-scale-working
> >
> > With a production build (i.e. no lockdep, no xfs debug), I'll
> > run the same fs_mark parallel create/unlink workload to show
> > scalability as I ran here:
> >
> > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html
>
> I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead
> of a real disk (I don't have easy access to a good disk setup ATM, but
> I guess we're more interested in code above the block layer anyway).
>
> Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as
> yours.

A s a personal prefernce, I don't like testing filesystem performance
on ramdisks because it hides problems caused by changes in IO
latency. I'll come back to this later.

> I found that performance is a little unstable, so I sync and echo 3 >
> drop_caches between each run.

Quite possibly because of the smaller log - that will cause more
frequent pushing on the log tail and hence I/O patterns will vary a
bit...

Also, keep in mind that delayed logging is shiny and new - it has
increased XFS metadata performance and parallelism by an order of
magnitude and so we're really seeing new a bunch of brand new issues
that have never been seen before with this functionality. As such,
there's still some interactions I haven't got to the bottom of with
delayed logging - it's stable enough to use and benchmark and won't
corrupt anything but there are still has some warts we need to
solve. The difficulty (as always) is in reliably reproducing the bad
behaviour.

> When it starts reclaiming memory, things
> get a bit more erratic (and XFS seemed to be almost livelocking for tens
> of seconds in inode reclaim).

I can't say that I've seen this - even when testing up to 10m
inodes. Yes, kswapd is almost permanently active on these runs,
but when creating 100,000 inodes/s we also need to be reclaiming
100,000 inodes/s so it's not surprising that when 7 CPUs are doing
allocation we need at least one CPU to run reclaim....

> So I started with 50 runs of fs_mark
> -n 20000 (which did not cause reclaim), rebuilding a new filesystem
> between every run.
>
> That gave the following files/sec numbers:
> N Min Max Median Avg Stddev
> x 50 100986.4 127622 125013.4 123248.82 5244.1988
> + 50 100967.6 135918.6 130214.9 127926.94 6374.6975
> Difference at 95.0% confidence
> 4678.12 +/- 2316.07
> 3.79567% +/- 1.87919%
> (Student's t, pooled s = 5836.88)
>
> This is 3.8% in favour of vfs-scale-working.
>
> I then did 10 runs of -n 20000 but with -L 4 (4 iterations) which did
> start to fill up memory and cause reclaim during the 2nd and subsequent
> iterations.

I haven't used this mode, so I can't really comment on the results
you are seeing.

> > enabled. ext4 is using default mkfs and mount parameters except for
> > barrier=0. All numbers are averages of three runs.
> >
> > fs_mark rate (thousands of files/second)
> > 2.6.35-rc5 2.6.35-rc5-scale
> > threads xfs ext4 xfs ext4
> > 1 20 39 20 39
> > 2 35 55 35 57
> > 4 60 41 57 42
> > 8 79 9 75 9
> >
> > ext4 is getting IO bound at more than 2 threads, so apart from
> > pointing out that XFS is 8-9x faster than ext4 at 8 thread, I'm
> > going to ignore ext4 for the purposes of testing scalability here.
> >
> > For XFS w/ delayed logging, 2.6.35-rc5 is only getting to about 600%
> > CPU and with Nick's patches it's about 650% (10% higher) for
> > slightly lower throughput. So at this class of machine for this
> > workload, the changes result in a slight reduction in scalability.
>
> I wonder if these results are stable. It's possible that changes in
> reclaim behaviour are causing my patches to require more IO for a
> given unit of work?

More likely that's the result of using a smaller log size because it
will require more frequent metadata pushes to make space for new
transactions.

> I was seeing XFS 'livelock' in reclaim more with my patches, it
> could be due to more parallelism now being allowed from the vfs and
> reclaim.
>
> Based on my above numbers, I don't see that rcu-inodes is causing a
> problem, and in terms of SMP scalability, there is really no way that
> vanilla is more scalable, so I'm interested to see where this slowdown
> is coming from.

As I said initially, ram disks hide IO latency changes resulting
from increased numbers of IO or increases in seek distances. My
initial guess is the change in inode reclaim behaviour causing
different IO patterns and more seeks under reclaim because the zone
based reclaim is no longer reclaiming inodes in the order
they are created (i.e. we are not doing sequential inode reclaim any
more.

FWIW, I use PCP monitoring graphs to correlate behavioural changes
across different subsystems because it is far easier to relate
information visually than it is by looking at raw numbers or traces.
I think this graph shows the effect of relcaim on performance
most clearly:

http://userweb.kernel.org/~dgc/shrinker-2.6.36/fs_mark-2.6.35-rc3-context-only-per-xfs-batch6-16x500-xfs.png

It's pretty clear that when the inode/dentry cache shrinkers are
running, sustained create/unlink performance goes right down. From a
different tab not in the screen shot (the other "test-4" tab), I
could see CPU usage also goes down and the disk iops go way up
whenever the create/unlink performance dropped. This same behaviour
happens with the vfs-scale patchset, so it's not related to lock
contention - just aggressive reclaim of still-dirty inodes.

FYI, The patch under test there was the XFS shrinker ignoring 7 out
of 8 shrinker calls and then on the 8th call doing the work of all
previous calls. i.e emulating SHRINK_BATCH = 1024. Interestingly
enough, that one change reduced the runtime of the 8m inode
create/unlink load by ~25% (from ~24min to ~18min).

That is by far the largest improvement I've been able to obtain from
modifying the shrinker code, and it is from those sorts of
observations that I think that IO being issued from reclaim is
currently the most significant performance limiting factor for XFS
in this sort of workload....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-07-27 15:09:16

by Nick Piggin

[permalink] [raw]
Subject: Re: VFS scalability git tree

On Tue, Jul 27, 2010 at 11:18:10PM +1000, Dave Chinner wrote:
> On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote:
> > On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote:
> > > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > > > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > > >
> > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> > > >
> > > > Branch vfs-scale-working
> > >
> > > With a production build (i.e. no lockdep, no xfs debug), I'll
> > > run the same fs_mark parallel create/unlink workload to show
> > > scalability as I ran here:
> > >
> > > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html
> >
> > I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead
> > of a real disk (I don't have easy access to a good disk setup ATM, but
> > I guess we're more interested in code above the block layer anyway).
> >
> > Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as
> > yours.
>
> A s a personal prefernce, I don't like testing filesystem performance
> on ramdisks because it hides problems caused by changes in IO
> latency. I'll come back to this later.

Very true, although it's good if you don't have some fast disks,
and it can be good to trigger different races than disks tend to.

So I still want to get to the bottom of the slowdown you saw on
vfs-scale.


> > I found that performance is a little unstable, so I sync and echo 3 >
> > drop_caches between each run.
>
> Quite possibly because of the smaller log - that will cause more
> frequent pushing on the log tail and hence I/O patterns will vary a
> bit...

Well... I think the test case (or how I'm running it) is simply a
bit unstable. I mean, there are subtle interactions all the way from
the CPU scheduler to the disk, so when I say unstable I'm not
particularly blaming XFS :)


> Also, keep in mind that delayed logging is shiny and new - it has
> increased XFS metadata performance and parallelism by an order of
> magnitude and so we're really seeing new a bunch of brand new issues
> that have never been seen before with this functionality. As such,
> there's still some interactions I haven't got to the bottom of with
> delayed logging - it's stable enough to use and benchmark and won't
> corrupt anything but there are still has some warts we need to
> solve. The difficulty (as always) is in reliably reproducing the bad
> behaviour.

Sure, and I didn't see any corruptions, it seems pretty stable and
scalability is better than other filesystems. I'll see if I can
give a better recipe to reproduce the 'livelock'ish behaviour.


> > I then did 10 runs of -n 20000 but with -L 4 (4 iterations) which did
> > start to fill up memory and cause reclaim during the 2nd and subsequent
> > iterations.
>
> I haven't used this mode, so I can't really comment on the results
> you are seeing.

It's a bit strange. Help says it should clear inodes between iterations
(without the -k flag), but it does not seem to.


> > > enabled. ext4 is using default mkfs and mount parameters except for
> > > barrier=0. All numbers are averages of three runs.
> > >
> > > fs_mark rate (thousands of files/second)
> > > 2.6.35-rc5 2.6.35-rc5-scale
> > > threads xfs ext4 xfs ext4
> > > 1 20 39 20 39
> > > 2 35 55 35 57
> > > 4 60 41 57 42
> > > 8 79 9 75 9
> > >
> > > ext4 is getting IO bound at more than 2 threads, so apart from
> > > pointing out that XFS is 8-9x faster than ext4 at 8 thread, I'm
> > > going to ignore ext4 for the purposes of testing scalability here.
> > >
> > > For XFS w/ delayed logging, 2.6.35-rc5 is only getting to about 600%
> > > CPU and with Nick's patches it's about 650% (10% higher) for
> > > slightly lower throughput. So at this class of machine for this
> > > workload, the changes result in a slight reduction in scalability.
> >
> > I wonder if these results are stable. It's possible that changes in
> > reclaim behaviour are causing my patches to require more IO for a
> > given unit of work?
>
> More likely that's the result of using a smaller log size because it
> will require more frequent metadata pushes to make space for new
> transactions.

I was just checking whether your numbers are stable (where you
saw some slowdown with vfs-scale patches), and what could be the
cause. I agree that running real disks could make big changes in
behaviour.


> > I was seeing XFS 'livelock' in reclaim more with my patches, it
> > could be due to more parallelism now being allowed from the vfs and
> > reclaim.
> >
> > Based on my above numbers, I don't see that rcu-inodes is causing a
> > problem, and in terms of SMP scalability, there is really no way that
> > vanilla is more scalable, so I'm interested to see where this slowdown
> > is coming from.
>
> As I said initially, ram disks hide IO latency changes resulting
> from increased numbers of IO or increases in seek distances. My
> initial guess is the change in inode reclaim behaviour causing
> different IO patterns and more seeks under reclaim because the zone
> based reclaim is no longer reclaiming inodes in the order
> they are created (i.e. we are not doing sequential inode reclaim any
> more.

Sounds plausible. I'll do more investigations along those lines.


> FWIW, I use PCP monitoring graphs to correlate behavioural changes
> across different subsystems because it is far easier to relate
> information visually than it is by looking at raw numbers or traces.
> I think this graph shows the effect of relcaim on performance
> most clearly:
>
> http://userweb.kernel.org/~dgc/shrinker-2.6.36/fs_mark-2.6.35-rc3-context-only-per-xfs-batch6-16x500-xfs.png

I haven't actually used that, it looks interesting.


> It's pretty clear that when the inode/dentry cache shrinkers are
> running, sustained create/unlink performance goes right down. From a
> different tab not in the screen shot (the other "test-4" tab), I
> could see CPU usage also goes down and the disk iops go way up
> whenever the create/unlink performance dropped. This same behaviour
> happens with the vfs-scale patchset, so it's not related to lock
> contention - just aggressive reclaim of still-dirty inodes.
>
> FYI, The patch under test there was the XFS shrinker ignoring 7 out
> of 8 shrinker calls and then on the 8th call doing the work of all
> previous calls. i.e emulating SHRINK_BATCH = 1024. Interestingly
> enough, that one change reduced the runtime of the 8m inode
> create/unlink load by ~25% (from ~24min to ~18min).

Hmm, interesting. Well that's naturally configurable with the
shrinker API changes I'm hoping to have merged. I'll plan to push
that ahead of the vfs-scale patches of course.


> That is by far the largest improvement I've been able to obtain from
> modifying the shrinker code, and it is from those sorts of
> observations that I think that IO being issued from reclaim is
> currently the most significant performance limiting factor for XFS
> in this sort of workload....

How is the xfs inode reclaim tied to linux inode reclaim? Does the
xfs inode not become reclaimable until some time after the linux inode
is reclaimed? Or what?

Do all or most of the xfs inodes require IO before being reclaimed
during this test? I wonder if you could throttle them a bit or sort
them somehow so that they tend to be cleaned by writeout and reclaim
just comes after and removes the clean ones, like pagecache reclaim
is (supposed) to work.?

2010-07-28 05:00:05

by Dave Chinner

[permalink] [raw]
Subject: Re: VFS scalability git tree

On Wed, Jul 28, 2010 at 01:09:08AM +1000, Nick Piggin wrote:
> On Tue, Jul 27, 2010 at 11:18:10PM +1000, Dave Chinner wrote:
> > On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote:
> > > On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote:
> > solve. The difficulty (as always) is in reliably reproducing the bad
> > behaviour.
>
> Sure, and I didn't see any corruptions, it seems pretty stable and
> scalability is better than other filesystems. I'll see if I can
> give a better recipe to reproduce the 'livelock'ish behaviour.

Well, stable is a good start :)

> > > > fs_mark rate (thousands of files/second)
> > > > 2.6.35-rc5 2.6.35-rc5-scale
> > > > threads xfs ext4 xfs ext4
> > > > 1 20 39 20 39
> > > > 2 35 55 35 57
> > > > 4 60 41 57 42
> > > > 8 79 9 75 9
> > > >
> > > > ext4 is getting IO bound at more than 2 threads, so apart from
> > > > pointing out that XFS is 8-9x faster than ext4 at 8 thread, I'm
> > > > going to ignore ext4 for the purposes of testing scalability here.
> > > >
> > > > For XFS w/ delayed logging, 2.6.35-rc5 is only getting to about 600%
> > > > CPU and with Nick's patches it's about 650% (10% higher) for
> > > > slightly lower throughput. So at this class of machine for this
> > > > workload, the changes result in a slight reduction in scalability.
> > >
> > > I wonder if these results are stable. It's possible that changes in
> > > reclaim behaviour are causing my patches to require more IO for a
> > > given unit of work?
> >
> > More likely that's the result of using a smaller log size because it
> > will require more frequent metadata pushes to make space for new
> > transactions.
>
> I was just checking whether your numbers are stable (where you
> saw some slowdown with vfs-scale patches), and what could be the
> cause. I agree that running real disks could make big changes in
> behaviour.

Yeah, the numbers are repeatable within about +/-5%. I generally
don't bother with optimisations that result in gains/losses less
than that because IO benchmarks that reliably repoduce results with
more precise repeatability than that are few and far between.

> > FWIW, I use PCP monitoring graphs to correlate behavioural changes
> > across different subsystems because it is far easier to relate
> > information visually than it is by looking at raw numbers or traces.
> > I think this graph shows the effect of relcaim on performance
> > most clearly:
> >
> > http://userweb.kernel.org/~dgc/shrinker-2.6.36/fs_mark-2.6.35-rc3-context-only-per-xfs-batch6-16x500-xfs.png
>
> I haven't actually used that, it looks interesting.

The archiving side of PCP is the most useful, I find. i.e. being able
to record the metrics into a file and analyse them with pmchart or
other tools after the fact...

> > That is by far the largest improvement I've been able to obtain from
> > modifying the shrinker code, and it is from those sorts of
> > observations that I think that IO being issued from reclaim is
> > currently the most significant performance limiting factor for XFS
> > in this sort of workload....
>
> How is the xfs inode reclaim tied to linux inode reclaim? Does the
> xfs inode not become reclaimable until some time after the linux inode
> is reclaimed? Or what?

The struct xfs_inode embeds a struct inode like so:

struct xfs_inode {
.....
struct inode i_inode;
}

so they are the same chunk of memory. XFS does not use the VFS inode
hashes for finding inodes - that's what the per-ag radix trees are
used for. The xfs_inode lives longer than the struct inode because
we do non-trivial work after the VFS "reclaims" the struct inode.

For example, when an inode is unlinked
do not truncate or free the inode until after the VFS has finished with
it - the inode remains on the unlinked list (orphaned in ext3 terms)
from the time is is unlinked by the VFS to the time the last VFs
reference goes away. When XFS gets it, XFS then issues the inactive
transaction that takes the inode off the unlinked list and marks it
free in the inode alloc btree. This transaction is asynchronous and
dirties the xfs inode. Finally XFS will mark the inode as
reclaimable via a radix tree tag. The final processing of the inode
is then done via eaither a background relcaim walk from xfssyncd
(every 30s) where it will do non-blocking operations to finalіze
reclaim. It may take several passes to actually reclaim the inode.
e.g. one pass to force the log if the inode is pinned, another pass
to flush the inode to disk if it is dirty and not stale, and then
another pass to reclaim the inode once clean. There may be multiple
passes inbetween where the inode is skipped because those operations
have not completed.

And to top it all off, if the inode is looked up again (cache hit)
while in the reclaimable state, it will be removed from the reclaim
state and reused immediately. in this case we don't need to continue
the reclaim processing other things will ensure all the correct
information will go to disk.

> Do all or most of the xfs inodes require IO before being reclaimed
> during this test?

Yes, because all the inode are being dirtied and they are being
reclaimed faster than background flushing expires them.

> I wonder if you could throttle them a bit or sort
> them somehow so that they tend to be cleaned by writeout and reclaim
> just comes after and removes the clean ones, like pagecache reclaim
> is (supposed) to work.?

The whole point of using the radix trees is to get nicely sorted
reclaim IO - inodes are indexed by number, and the radix tree walk
gives us ascending inode number (and hence ascending block number)
reclaim - and the background reclaim allows optimal flushing to
occur by aggregating all the IO into delayed write metadata buffers
so they can be sorted and flushed to the elevator by the xfsbufd in
the most optimal manner possible.

The shrinker does preempt this somewhat, which is why delaying the
XFS shrinker's work appears to improve things alot. If the shrinker
is not running, the the background reclaim does exactly what you are
suggesting.

However, I don't think the increase in iops is caused by the XFS
inode shrinker - I think that it is the VFS cache shrinkers. If you
look at the the graphs in the link above, preformance doesn't
decrease when the XFS inode cache is being shrunk (top chart, yellow
trace) - it drops when the vfs caches are being shrunk (middle
chart). I haven't correlated the behaviour any further than that
because I haven't had time.

FWIW, all this background reclaim, radix tree reclaim tagging and
walking, embedded struct inodes, etc is all relatively new code.
The oldest bit of it was introduced in 2.6.31 (I think) and so a
significant part of what we are exploring here is uncharted
territory. The changes to relcaim, etc are aprtially reponsible for
the scalabilty we are geting from delayed logging, but there is
certainly room for improvement....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-07-28 10:25:09

by Nick Piggin

[permalink] [raw]
Subject: Re: VFS scalability git tree

On Mon, Jul 26, 2010 at 03:41:11PM +1000, Nick Piggin wrote:
> On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > I'm pleased to announce I have a git tree up of my vfs scalability work.
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
>
> Pushed several fixes and improvements
> o XFS bugs fixed by Dave
> o dentry and inode stats bugs noticed by Dave
> o vmscan shrinker bugs fixed by KOSAKI san
> o compile bugs noticed by John
> o a few attempts to improve powerpc performance (eg. reducing smp_rmb())
> o scalability improvments for rename_lock

Yet another result on my small 2s8c Opteron. This time, the
re-aim benchmark configured as described here:

http://ertos.nicta.com.au/publications/papers/Chubb_Williams_05.pdf

It is using ext2 on ramdisk and an IO intensive workload, with fsync
activity.

I did 10 runs on each, and took the max jobs/sec of each run.

N Min Max Median Avg Stddev
x 10 2598750 2735122 2665384.6 2653353.8 46421.696
+ 10 3337297.3 3484687.5 3410689.7 3397763.8 49994.631
Difference at 95.0% confidence
744410 +/- 45327.3
28.0554% +/- 1.7083%

Average is 2653K jobs/s for vanilla, versus 3398K jobs/s for vfs-scalem
or 28% speedup.

The profile is interesting. It is known to be inode_lock intensive, but
we also see here that it is do_lookup intensive, due to cacheline bouncing
in common elements of path lookups.

Vanilla:
# Overhead Symbol
# ........ ......
#
7.63% [k] __d_lookup
|
|--88.59%-- do_lookup
|--9.75%-- __lookup_hash
|--0.89%-- d_lookup

7.17% [k] _raw_spin_lock
|
|--11.07%-- _atomic_dec_and_lock
| |
| |--53.73%-- dput
| --46.27%-- iput
|
|--9.85%-- __mark_inode_dirty
| |
| |--46.25%-- ext2_new_inode
| |--25.32%-- __set_page_dirty
| |--18.27%-- nobh_write_end
| |--6.91%-- ext2_new_blocks
| |--3.12%-- ext2_unlink
|
|--7.69%-- ext2_new_inode
|
|--6.84%-- insert_inode_locked
| ext2_new_inode
|
|--6.56%-- new_inode
| ext2_new_inode
|
|--5.61%-- writeback_single_inode
| sync_inode
| generic_file_fsync
| ext2_fsync
|
|--5.13%-- dput
|--3.75%-- generic_delete_inode
|--3.56%-- __d_lookup
|--3.53%-- ext2_free_inode
|--3.40%-- sync_inode
|--2.71%-- d_instantiate
|--2.36%-- d_delete
|--2.25%-- inode_sub_bytes
|--1.84%-- file_move
|--1.52%-- file_kill
|--1.36%-- ext2_new_blocks
|--1.34%-- ext2_create
|--1.34%-- d_alloc
|--1.11%-- do_lookup
|--1.07%-- iput
|--1.05%-- __d_instantiate

4.19% [k] mutex_spin_on_owner
|
|--99.92%-- __mutex_lock_slowpath
| mutex_lock
| |
| |--56.45%-- do_unlinkat
| | sys_unlink
| |
| --43.55%-- do_last
| do_filp_open

2.96% [k] _atomic_dec_and_lock
|
|--58.18%-- dput
|--31.02%-- mntput_no_expire
|--3.30%-- path_put
|--3.09%-- iput
|--2.69%-- link_path_walk
|--1.02%-- fput

2.73% [k] copy_user_generic_string
2.67% [k] __mark_inode_dirty
2.65% [k] link_path_walk
2.63% [k] mark_buffer_dirty
1.72% [k] __memcpy
1.62% [k] generic_getxattr
1.50% [k] acl_permission_check
1.30% [k] __find_get_block
1.30% [k] __memset
1.17% [k] ext2_find_entry
1.09% [k] ext2_new_inode
1.06% [k] system_call
1.01% [k] kmem_cache_free
1.00% [k] dput


In vfs-scale, most of the spinlock contention and path lookup cost is
gone. Contention for parent i_mutex (and d_lock) for creat/unlink
operations is now at the top of the profile.

A lot of the spinlock overhead seems to be not contention so much as
the the cost of the atomics. Down at 3% it is much less a problem than
it was though.

We may run into a bit of contention on the per-bdi inode dirty/io
list lock, with just a single ramdisk device (dirty/fsync activity
will hit this lock), but it is really not worth worrying about at
the moment.

# Overhead Symbol
# ........ ......
#
5.67% [k] mutex_spin_on_owner
|
|--99.96%-- __mutex_lock_slowpath
| mutex_lock
| |
| |--58.63%-- do_unlinkat
| | sys_unlink
| |
| --41.37%-- do_last
| do_filp_open

3.93% [k] __mark_inode_dirty
3.43% [k] copy_user_generic_string
3.31% [k] link_path_walk
3.15% [k] mark_buffer_dirty
3.11% [k] _raw_spin_lock
|
|--11.03%-- __mark_inode_dirty
|--10.54%-- ext2_new_inode
|--7.60%-- ext2_free_inode
|--6.33%-- inode_sub_bytes
|--6.27%-- ext2_new_blocks
|--5.80%-- generic_delete_inode
|--4.09%-- ext2_create
|--3.62%-- writeback_single_inode
|--2.92%-- sync_inode
|--2.81%-- generic_drop_inode
|--2.46%-- iput
|--1.86%-- dput
|--1.80%-- __dquot_alloc_space
|--1.61%-- __mutex_unlock_slowpath
|--1.59%-- generic_file_fsync
|--1.57%-- __d_instantiate
|--1.55%-- __set_page_dirty_buffers
|--1.36%-- d_alloc_and_lookup
|--1.23%-- do_path_lookup
|--1.10%-- ext2_free_blocks

2.13% [k] __memset
2.12% [k] __memcpy
1.98% [k] __d_lookup_rcu
1.46% [k] generic_getxattr
1.44% [k] ext2_find_entry
1.41% [k] __find_get_block
1.27% [k] kmem_cache_free
1.25% [k] ext2_new_inode
1.23% [k] system_call
1.02% [k] ext2_add_link
1.01% [k] strncpy_from_user
0.96% [k] kmem_cache_alloc
0.95% [k] find_get_page
0.94% [k] sysret_check
0.88% [k] __d_lookup
0.75% [k] ext2_delete_entry
0.70% [k] generic_file_aio_read
0.67% [k] generic_file_buffered_write
0.63% [k] ext2_new_blocks
0.62% [k] __percpu_counter_add
0.59% [k] __bread
0.58% [k] __wake_up_bit
0.58% [k] __mutex_lock_slowpath
0.56% [k] __ext2_write_inode
0.55% [k] ext2_get_blocks

2010-07-30 09:12:34

by Nick Piggin

[permalink] [raw]
Subject: Re: VFS scalability git tree

On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> I'm pleased to announce I have a git tree up of my vfs scalability work.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
>
> Branch vfs-scale-working
>
> The really interesting new item is the store-free path walk, (43fe2b)
> which I've re-introduced. It has had a complete redesign, it has much
> better performance and scalability in more cases, and is actually sane
> code now.

Things are progressing well here with fixes and improvements to the
branch.

One thing that has been brought to my attention is that store-free path
walking (rcu-walk) drops into the normal refcounted walking on any
filesystem that has posix ACLs enabled.

Having misread that IS_POSIXACL is based on a superblock flag, I had
thought we only drop out of rcu-walk in case of encountering an inode
that actually has acls.

This is quite an important point for any performance testing work.
ACLs can actually be rcu checked quite easily in most cases, but it
takes a bit of work on APIs.

Filesystems defining their own ->permission and ->d_revalidate will
also not use rcu-walk. These could likewise be made to support rcu-walk
more widely, but it will require knowledge of rcu-walk to be pushed
into filesystems.

It's not a big deal, basically: no blocking, no stores, no referencing
non-rcu-protected data, and confirm with seqlock. That is usually the
case in fastpaths. If it cannot be satisfied, then just return -ECHILD
and you'll get called in the usual ref-walk mode next time.

But for now, keep this in mind if you plan to do any serious performance
testing work, *do not mount filesystems with ACL support*.

Thanks,
Nick

2010-08-03 00:28:16

by john stultz

[permalink] [raw]
Subject: Re: VFS scalability git tree

On Fri, 2010-07-30 at 19:12 +1000, Nick Piggin wrote:
> On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > I'm pleased to announce I have a git tree up of my vfs scalability work.
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> >
> > Branch vfs-scale-working
> >
> > The really interesting new item is the store-free path walk, (43fe2b)
> > which I've re-introduced. It has had a complete redesign, it has much
> > better performance and scalability in more cases, and is actually sane
> > code now.
>
> Things are progressing well here with fixes and improvements to the
> branch.

Hey Nick,
Just another minor compile issue with today's vfs-scale-working branch.

fs/fuse/dir.c:231: error: ‘fuse_dentry_revalidate_rcu’ undeclared here
(not in a function)

>From looking at the vfat and ecryptfs changes in
582c56f032983e9a8e4b4bd6fac58d18811f7d41 it looks like you intended to
add the following?


diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index f0c2479..9ee4c10 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -154,7 +154,7 @@ u64 fuse_get_attr_version(struct fuse_conn *fc)
* the lookup once more. If the lookup results in the same inode,
* then refresh the attributes, timeouts and mark the dentry valid.
*/
-static int fuse_dentry_revalidate(struct dentry *entry, struct nameidata *nd)
+static int fuse_dentry_revalidate_rcu(struct dentry *entry, struct nameidata *nd)
{
struct inode *inode = entry->d_inode;


2010-08-03 05:44:08

by Nick Piggin

[permalink] [raw]
Subject: Re: VFS scalability git tree

On Mon, Aug 02, 2010 at 05:27:59PM -0700, John Stultz wrote:
> On Fri, 2010-07-30 at 19:12 +1000, Nick Piggin wrote:
> > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > >
> > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> > >
> > > Branch vfs-scale-working
> > >
> > > The really interesting new item is the store-free path walk, (43fe2b)
> > > which I've re-introduced. It has had a complete redesign, it has much
> > > better performance and scalability in more cases, and is actually sane
> > > code now.
> >
> > Things are progressing well here with fixes and improvements to the
> > branch.
>
> Hey Nick,
> Just another minor compile issue with today's vfs-scale-working branch.
>
> fs/fuse/dir.c:231: error: ‘fuse_dentry_revalidate_rcu’ undeclared here
> (not in a function)
>
> >From looking at the vfat and ecryptfs changes in
> 582c56f032983e9a8e4b4bd6fac58d18811f7d41 it looks like you intended to
> add the following?

Thanks John, you're right.

I thought I actually linked and ran this, but I must not have had fuse
compiled in.