2010-02-19 17:16:22

by Mikael Abrahamsson

[permalink] [raw]
Subject: disk/crypto performance regression 2.6.31 -> 2.6.32 (mmap problem?)


I have a system with 8 gigs of ram, c2d 2.4GHz and md
raid6->crypto->lvm/lv->xfs that I've had for quite a while. It used to be
raid5, but with 2.6.32 I was able to restripe it into a raid6.

When under 2.6.32 (Linux ub 2.6.32-12-generic #16-Ubuntu SMP Thu Jan 28
07:47:05 UTC 2010 x86_64 GNU/Linux) I'm seeing this in latency-top
(sometimes worse, several seconds) when running rtorrent (which uses mmap)

Cause Maximum Percentage
Page fault 1580.4 msec 84.7 %
Writing a page to disk 869.7 msec 2.2 %
[xfs_ilock] 232.9 msec 2.3 %
Truncating file 110.1 msec 0.3 %
Waiting for TTY data 18.2 msec 0.0 %
Reading EXT3 indirect blocks 10.0 msec 0.1 %
Executing a program 9.8 msec 0.0 %
Walking directory tree 8.7 msec 0.0 %
Waiting for event (select) 5.0 msec 1.8 %

swmike@ub:~$ free
total used free shared buffers cached
Mem: 8194328 6455004 1739324 0 2012 4343532
-/+ buffers/cache: 2109460 6084868
Swap: 0 0 0

iostat -x 5 gives this:

avg-cpu: %user %nice %system %iowait %steal %idle
0.61 0.00 4.36 63.29 0.00 31.74

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 35.80 59.00 20.80 21.00 457.60 646.40 26.41 0.13 3.21 2.06 8.60
sdc 33.20 74.40 21.60 26.40 491.20 814.40 27.20 0.23 4.88 3.29 15.80
sdd 24.20 96.60 21.00 32.80 457.60 1040.00 27.84 0.31 5.84 4.16 22.40
sde 25.60 94.80 18.80 33.60 420.80 1033.60 27.76 0.28 5.42 3.82 20.00
md0 0.00 0.00 5.00 97.20 393.60 2390.40 27.24 0.00 0.00 0.00 0.00
sdf 29.80 87.20 18.60 28.80 460.80 937.60 29.50 0.16 3.38 2.70 12.80
sdg 35.80 61.80 20.60 19.60 476.80 657.60 28.22 0.26 5.92 3.38 13.60
sdh 32.20 71.20 21.00 22.20 465.60 752.00 28.19 0.23 5.23 3.24 14.00
dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 5.00 132.40 393.60 2390.40 20.26 3.32 22.39 7.26 99.80
dm-3 0.00 0.00 5.20 77.80 444.80 1841.60 27.55 3.32 37.08 12.05 100.00

swmike@ub:/dev/mapper$ ls -l
total 0
crw-rw---- 1 root root 10, 59 2010-02-05 17:57 control
brw-rw---- 1 root disk 252, 2 2010-02-15 13:55 md0_crypt
brw-rw---- 1 root disk 252, 0 2010-02-15 13:55 sdb1_crypt
brw-rw---- 1 root disk 252, 1 2010-02-15 13:55 ub-root
brw-rw---- 1 root disk 252, 3 2010-02-15 13:55 vg01-home

So from what I can gather, even though it's moving very little traffic
thru the crypt layer, this is still the bottleneck (when just copying a
single file I can easily get 50+ megabyte/s thru the crypto layer). Top
indicates low cpu usage but still high load of course, since rtorrent is
"blocking" most of the time (unresponsive to interactive input).

top - 17:34:54 up 14 days, 37 min, 6 users, load average: 3.60, 3.90, 3.89
Tasks: 237 total, 2 running, 235 sleeping, 0 stopped, 0 zombie
Cpu0 : 1.7%us, 8.6%sy, 0.0%ni, 55.1%id, 30.9%wa, 2.3%hi, 1.3%si, 0.0%st
Cpu1 : 1.4%us, 11.1%sy, 0.0%ni, 2.8%id, 81.9%wa, 0.0%hi, 2.8%si, 0.0%st
Mem: 8194328k total, 8143856k used, 50472k free, 2136k buffers
Swap: 0k total, 0k used, 0k free, 5625364k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
9950 root 20 0 0 0 0 S 5 0.0 1012:29 md0_raid5
25684 root 20 0 0 0 0 R 9 0.0 342:58.88 kcryptd
22007 swmike 20 0 325m 170m 140m D 5 2.1 87:09.82 rtorrent
15543 swmike 20 0 124m 24m 3296 S 1 0.3 78:25.01 rtorrent
31346 swmike 9 -11 327m 5940 4200 S 0 0.1 43:12.09 pulseaudio

When I was running 2.6.31 (ubuntu 9.10) kernel I didn't see this problem
at all, though things have changed in that I'm now running raid6 instead
of raid5, but from the iostat it looks like the disk+md layer is not
causing this behaviour (before in 2.6.31 that was the bottleneck but not
anymore).

Here is an example of when I dd a large file instead, note the dm-2
behaviour:

avg-cpu: %user %nice %system %iowait %steal %idle
29.10 0.00 70.90 0.00 0.00 0.00

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.00 3.00 0.00 54.40 0.00 18.13 0.01 4.00 1.33 0.40
sdb 42.80 1990.20 3.40 75.40 369.60 16537.60 214.56 0.20 2.51 0.79 6.20
sdc 38.60 1993.40 3.20 75.20 334.40 16563.20 215.53 0.10 1.22 0.56 4.40
sdd 31.80 1990.40 2.20 75.00 272.00 16537.60 217.74 0.12 1.53 0.65 5.00
sde 31.80 1986.80 2.20 75.40 272.00 16512.00 216.29 0.25 3.20 1.11 8.60
md0 0.00 0.00 0.00 777.20 0.00 99225.60 127.67 0.00 0.00 0.00 0.00
sdf 28.80 1993.40 2.00 75.40 246.40 16563.20 217.18 0.06 0.80 0.52 4.00
sdg 36.80 1999.20 3.00 75.80 318.40 16612.80 214.86 0.56 7.13 1.73 13.60
sdh 17.60 1993.20 1.80 75.60 155.20 16563.20 216.00 0.02 0.26 0.21 1.60
dm-0 0.00 0.00 3.00 0.00 54.40 0.00 18.13 0.01 4.00 1.33 0.40
dm-1 0.00 0.00 3.00 0.00 54.40 0.00 18.13 0.01 4.00 1.33 0.40
dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 14696.19 0.00 0.00 100.00
dm-3 0.00 0.00 0.00 9051.80 0.00 963836.80 106.48 14696.19 468.84 0.11 100.00
sdj 24.80 1999.00 2.20 76.00 216.00 16612.80 215.20 0.10 1.25 0.72 5.60

avg-cpu: %user %nice %system %iowait %steal %idle
24.82 0.00 73.27 1.91 0.00 0.00

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.00 1.20 0.00 24.00 0.00 20.00 0.00 3.33 3.33 0.40
sdb 3.00 3288.00 0.20 83.40 25.60 26972.80 322.95 0.16 1.91 1.89 15.80
sdc 6.00 3287.80 0.80 83.60 54.40 26972.80 320.23 0.19 2.25 2.13 18.00
sdd 7.20 3283.40 1.00 83.00 65.60 26932.80 321.41 0.29 3.40 2.45 20.60
sde 7.20 3283.80 1.00 82.60 65.60 26932.80 322.95 0.25 3.01 2.39 20.00
md0 0.00 0.00 0.00 3665.20 0.00 161574.40 44.08 0.00 0.00 0.00 0.00
sdf 1.20 3283.80 0.60 82.60 14.40 26932.80 323.88 0.21 2.50 2.33 19.40
sdg 3.00 3285.00 0.20 83.20 25.60 26947.20 323.41 0.29 3.43 2.45 20.40
sdh 1.20 3283.80 0.60 82.60 14.40 26932.80 323.88 0.20 2.43 2.28 19.00
dm-0 0.00 0.00 1.20 0.00 24.00 0.00 20.00 0.00 3.33 3.33 0.40
dm-1 0.00 0.00 1.20 0.00 24.00 0.00 20.00 0.00 3.33 3.33 0.40
dm-2 0.00 0.00 0.00 3872.40 0.00 334886.40 86.48 9989.29 6114.24 0.26 100.00
dm-3 0.00 0.00 0.00 2731.80 0.00 178907.60 65.49 9989.30 8667.10 0.37 100.00
sdj 4.20 3283.80 0.80 82.60 40.00 26932.80 323.41 0.18 2.16 2.13 17.80

avg-cpu: %user %nice %system %iowait %steal %idle
30.89 0.00 65.09 4.02 0.00 0.00

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 12.00 3141.00 0.80 78.20 102.40 25680.00 326.36 0.15 1.85 1.77 14.00
sdc 9.00 3141.20 0.60 78.20 76.80 25756.80 327.84 0.14 1.83 1.75 13.80
sdd 9.00 3140.80 0.60 78.60 76.80 25755.20 326.16 0.32 4.04 2.30 18.20
sde 9.00 3140.60 0.60 78.80 76.80 25755.20 325.34 0.27 3.38 2.14 17.00
md0 0.00 0.00 0.00 1685.80 0.00 154444.80 91.62 0.00 0.00 0.00 0.00
sdf 6.00 3141.00 0.40 78.40 51.20 25755.20 327.49 0.16 2.03 1.93 15.20
sdg 9.00 3142.60 0.60 78.40 76.80 25769.60 327.17 0.48 6.08 3.32 26.20
sdh 6.00 3141.20 0.40 77.80 51.20 25678.40 329.02 0.14 1.74 1.71 13.40
dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 9797.52 0.00 0.00 100.00
dm-3 0.00 0.00 0.20 2035.60 1.60 135518.80 66.57 9797.52 3423.06 0.49 100.00
sdj 9.00 3142.40 0.60 78.60 76.80 25769.60 326.34 0.18 2.25 1.97 15.60

Any ideas? What more information can I collect to help look into this
issue? Currently I'm running with:

swmike@ub:~$ cat /etc/sysctl.conf | grep vm
vm.dirty_background_ratio = 1
vm.dirty_ratio = 60

Setting this to 10/30 or something alike doesn't really change the
behaviour with rtorrent. I've also mucked around with io scheduler,
running deadline right now, doesn't see much difference between them under
2.6.32 and this problem.

--
Mikael Abrahamsson email: [email protected]


2010-02-22 12:39:39

by Mikael Abrahamsson

[permalink] [raw]
Subject: Re: disk/crypto performance regression 2.6.31 -> 2.6.32 (mmap problem?)

On Fri, 19 Feb 2010, Mikael Abrahamsson wrote:

> I have a system with 8 gigs of ram, c2d 2.4GHz and md
> raid6->crypto->lvm/lv->xfs that I've had for quite a while. It used to be
> raid5, but with 2.6.32 I was able to restripe it into a raid6.

This is not a 2.6.32 regression, the behaviour is similar in 2.6.31 I have
verified during the weekend.

It seems iostat isn't really able to see exactly where the bottleneck is,
and my guess right now is that it's indeed the md raid6 layer that is
causing my performance problems, even though it doesn't look like it in
iostat.

--
Mikael Abrahamsson email: [email protected]

2010-02-23 04:48:24

by Dave Chinner

[permalink] [raw]
Subject: Re: disk/crypto performance regression 2.6.31 -> 2.6.32 (mmap problem?)

On Mon, Feb 22, 2010 at 01:39:37PM +0100, Mikael Abrahamsson wrote:
> On Fri, 19 Feb 2010, Mikael Abrahamsson wrote:
>
>> I have a system with 8 gigs of ram, c2d 2.4GHz and md
>> raid6->crypto->lvm/lv->xfs that I've had for quite a while. It used to
>> be raid5, but with 2.6.32 I was able to restripe it into a raid6.
>
> This is not a 2.6.32 regression, the behaviour is similar in 2.6.31 I
> have verified during the weekend.
>
> It seems iostat isn't really able to see exactly where the bottleneck is,
> and my guess right now is that it's indeed the md raid6 layer that is
> causing my performance problems, even though it doesn't look like it in
> iostat.

XFS issues IO barriers by default. They were recently enabled in md
for raid5/6, so that might be the cause of the slowdown/latency. You
could try using the "nobarrier" mount option to see if this make the
problem go away, but beware that this can result in filesystem
corruption if the machine crashes.

If it is not barriers that are causing this, then the other thing
you might want to look at is if XFS is configured with lazy-count=1
(xfs_info <mntpt>). If it is not enabled (0), then a significant
amount of latency could be coming from the superblock buffer being
locked during transaction commits. Unfortunately enabling this
feature is an offline operation (via xfs_admin) so enabling may not
be feasible for you.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-02-23 07:24:29

by Mikael Abrahamsson

[permalink] [raw]
Subject: Re: disk/crypto performance regression 2.6.31 -> 2.6.32 (mmap problem?)

On Tue, 23 Feb 2010, Dave Chinner wrote:

> XFS issues IO barriers by default. They were recently enabled in md
> for raid5/6, so that might be the cause of the slowdown/latency. You
> could try using the "nobarrier" mount option to see if this make the
> problem go away, but beware that this can result in filesystem
> corruption if the machine crashes.

I don't think the crypto layer supports it so I think my barriers are off,
that's what I'm used to seeing every time it mounts the partition.

> If it is not barriers that are causing this, then the other thing
> you might want to look at is if XFS is configured with lazy-count=1
> (xfs_info <mntpt>). If it is not enabled (0), then a significant
> amount of latency could be coming from the superblock buffer being
> locked during transaction commits. Unfortunately enabling this
> feature is an offline operation (via xfs_admin) so enabling may not
> be feasible for you.

Currently it's "lazy-count=0", so I'll change that setting tonight.

The behaviour I'm seeing right now causes things like "sync" to take very
long time, I just saw this as well (I've seen the same once when I ran
"sync" manually):

[210846.031875] INFO: task xfssyncd:3167 blocked for more than 120 seconds.
[210846.031879] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[210846.031883] xfssyncd D 00000000ffffffff 0 3167 2 0x00000000
[210846.031889] ffff880208569c50 0000000000000046 ffff880208569c10 0000000000015880
[210846.031895] ffff88020b0ec7c0 0000000000015880 0000000000015880 0000000000015880
[210846.031900] 0000000000015880 ffff88020b0ec7c0 0000000000015880 0000000000015880
[210846.031905] Call Trace:
[210846.031935] [<ffffffffa02767d5>] __xfs_iunpin_wait+0x95/0xd0 [xfs]
[210846.031943] [<ffffffff810789c0>] ? autoremove_wake_function+0x0/0x40
[210846.031964] [<ffffffffa0279dad>] xfs_iflush+0x8d/0x2b0 [xfs]
[210846.031971] [<ffffffff815299cb>] ? __down_write+0xb/0x10
[210846.031992] [<ffffffffa02a06f4>] xfs_reclaim_inode+0x114/0x180 [xfs]
[210846.032013] [<ffffffffa02a07e2>] xfs_reclaim_inode_now+0x82/0x90 [xfs]
[210846.032033] [<ffffffffa02a0760>] ? xfs_reclaim_inode_now+0x0/0x90 [xfs]
[210846.032053] [<ffffffffa02a11d2>] xfs_inode_ag_walk+0x72/0xd0 [xfs]
[210846.032073] [<ffffffffa02a0760>] ? xfs_reclaim_inode_now+0x0/0x90 [xfs]
[210846.032094] [<ffffffffa02a1297>] xfs_inode_ag_iterator+0x67/0xa0 [xfs]
[210846.032114] [<ffffffffa02a12e4>] xfs_reclaim_inodes+0x14/0x20 [xfs]
[210846.032134] [<ffffffffa02a1349>] xfs_sync_worker+0x59/0x90 [xfs]
[210846.032154] [<ffffffffa02a0a0a>] xfssyncd+0x17a/0x220 [xfs]
[210846.032174] [<ffffffffa02a0890>] ? xfssyncd+0x0/0x220 [xfs]
[210846.032179] [<ffffffff810785d6>] kthread+0xa6/0xb0
[210846.032183] [<ffffffff810130ea>] child_rip+0xa/0x20
[210846.032188] [<ffffffff81078530>] ? kthread+0x0/0xb0
[210846.032191] [<ffffffff810130e0>] ? child_rip+0x0/0x20

So something weird is going on, I don't really see why it wouldn't be able
to write enough data in 120 seconds in an orderly fashion.

--
Mikael Abrahamsson email: [email protected]

2010-02-23 13:32:26

by Dave Chinner

[permalink] [raw]
Subject: Re: disk/crypto performance regression 2.6.31 -> 2.6.32 (mmap problem?)

On Tue, Feb 23, 2010 at 08:24:26AM +0100, Mikael Abrahamsson wrote:
> On Tue, 23 Feb 2010, Dave Chinner wrote:
>
>> XFS issues IO barriers by default. They were recently enabled in md
>> for raid5/6, so that might be the cause of the slowdown/latency. You
>> could try using the "nobarrier" mount option to see if this make the
>> problem go away, but beware that this can result in filesystem
>> corruption if the machine crashes.
>
> I don't think the crypto layer supports it so I think my barriers are
> off, that's what I'm used to seeing every time it mounts the partition.

Barriers are only recently supported across DM and MD, so it would
be worth checking you logs for the last mount of the filesystems to
confirm.

>> If it is not barriers that are causing this, then the other thing
>> you might want to look at is if XFS is configured with lazy-count=1
>> (xfs_info <mntpt>). If it is not enabled (0), then a significant
>> amount of latency could be coming from the superblock buffer being
>> locked during transaction commits. Unfortunately enabling this
>> feature is an offline operation (via xfs_admin) so enabling may not
>> be feasible for you.
>
> Currently it's "lazy-count=0", so I'll change that setting tonight.
>
> The behaviour I'm seeing right now causes things like "sync" to take very
> long time, I just saw this as well (I've seen the same once when I ran
> "sync" manually):
>
> [210846.031875] INFO: task xfssyncd:3167 blocked for more than 120 seconds.
> [210846.031879] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [210846.031883] xfssyncd D 00000000ffffffff 0 3167 2 0x00000000
> [210846.031889] ffff880208569c50 0000000000000046 ffff880208569c10 0000000000015880
> [210846.031895] ffff88020b0ec7c0 0000000000015880 0000000000015880 0000000000015880
> [210846.031900] 0000000000015880 ffff88020b0ec7c0 0000000000015880 0000000000015880
> [210846.031905] Call Trace:
> [210846.031935] [<ffffffffa02767d5>] __xfs_iunpin_wait+0x95/0xd0 [xfs]
> [210846.031943] [<ffffffff810789c0>] ? autoremove_wake_function+0x0/0x40
> [210846.031964] [<ffffffffa0279dad>] xfs_iflush+0x8d/0x2b0 [xfs]
> [210846.031971] [<ffffffff815299cb>] ? __down_write+0xb/0x10
> [210846.031992] [<ffffffffa02a06f4>] xfs_reclaim_inode+0x114/0x180 [xfs]
> [210846.032013] [<ffffffffa02a07e2>] xfs_reclaim_inode_now+0x82/0x90 [xfs]
> [210846.032033] [<ffffffffa02a0760>] ? xfs_reclaim_inode_now+0x0/0x90 [xfs]
> [210846.032053] [<ffffffffa02a11d2>] xfs_inode_ag_walk+0x72/0xd0 [xfs]
> [210846.032073] [<ffffffffa02a0760>] ? xfs_reclaim_inode_now+0x0/0x90 [xfs]
> [210846.032094] [<ffffffffa02a1297>] xfs_inode_ag_iterator+0x67/0xa0 [xfs]
> [210846.032114] [<ffffffffa02a12e4>] xfs_reclaim_inodes+0x14/0x20 [xfs]
> [210846.032134] [<ffffffffa02a1349>] xfs_sync_worker+0x59/0x90 [xfs]
> [210846.032154] [<ffffffffa02a0a0a>] xfssyncd+0x17a/0x220 [xfs]
> [210846.032174] [<ffffffffa02a0890>] ? xfssyncd+0x0/0x220 [xfs]
> [210846.032179] [<ffffffff810785d6>] kthread+0xa6/0xb0
> [210846.032183] [<ffffffff810130ea>] child_rip+0xa/0x20
> [210846.032188] [<ffffffff81078530>] ? kthread+0x0/0xb0
> [210846.032191] [<ffffffff810130e0>] ? child_rip+0x0/0x20
>
> So something weird is going on, I don't really see why it wouldn't be
> able to write enough data in 120 seconds in an orderly fashion.

If is it blocked there, it is waiting for a log force to complete.
The lazy-count setting will make no difference to that. However,
this adds more weight to the barrier theory because one of the
sources of barrier writes are writes triggered by log forces.

It would be instructive to gather the output of:

# echo w > /proc/sysrq-trigger

When IO is really slow so we get a better idea of where things are
blocking. Running a few of these 30s apart will give a fair
indication of what is blocked and what is making progress....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-02-23 22:41:35

by Dave Chinner

[permalink] [raw]
Subject: Re: disk/crypto performance regression 2.6.31 -> 2.6.32 (mmap problem?)

On Tue, Feb 23, 2010 at 09:42:40PM +0100, Mikael Abrahamsson wrote:
> On Wed, 24 Feb 2010, Dave Chinner wrote:
>
> Thought I should send the dump only to you, would you like me to post to
> the list as well and include a link to the logfile (think attachments
> shouldn't be done too much to lkml, right?). Feel free to post any
> information from this letter or attachment back onto lkml if you feel
> it's appropriate.

Attachments under roughly 100kb (IIRC) are fine for LKML. This is
allowed specifically to let people attach log files and configs.

I've re-added the lkml CC to continue the discussion there. I also
added the DM mailing list, so that they know directly that barriers
are causing _significant_ system slowdowns. This is important,
because there have been several reports of this problem since the
start of the year to XFS forums as people are upgrading kernels.

>> Barriers are only recently supported across DM and MD, so it would be
>> worth checking you logs for the last mount of the filesystems to
>> confirm.
>
> You're right, it doesn't say that anymore in 2.6.31, so I think I'm
> indeed running with barriers on.

And the stack traces confirm that. Every single blocked process
output set has this trace in it:

kdmflush D 0000000000000000 0 3082 2 0x00000000
ffff880208553d20 0000000000000046 ffff880210a5f640 0000000000015880
ffff8802241ac7c0 0000000000015880 0000000000015880 0000000000015880
0000000000015880 ffff8802241ac7c0 0000000000015880 0000000000015880
Call Trace:
[<ffffffff81527cf8>] io_schedule+0x28/0x40
[<ffffffff813e7ca7>] dm_wait_for_completion+0x87/0x150
[<ffffffff81053c50>] ? default_wake_function+0x0/0x10
[<ffffffff813e8617>] ? __split_and_process_bio+0x107/0x170
[<ffffffff813e86a5>] dm_flush+0x25/0x70
[<ffffffff813e8749>] process_barrier+0x59/0xf0
[<ffffffff813e881b>] dm_wq_work+0x3b/0x100
[<ffffffff813e87e0>] ? dm_wq_work+0x0/0x100
[<ffffffff81073695>] run_workqueue+0x95/0x170
[<ffffffff81073814>] worker_thread+0xa4/0x120
[<ffffffff810789c0>] ? autoremove_wake_function+0x0/0x40
[<ffffffff81073770>] ? worker_thread+0x0/0x120
[<ffffffff810785d6>] kthread+0xa6/0xb0
[<ffffffff810130ea>] child_rip+0xa/0x20
[<ffffffff81078530>] ? kthread+0x0/0xb0
[<ffffffff810130e0>] ? child_rip+0x0/0x20

That is, it appears that DM is blocking waiting for a barrier to
complete. Everything else is backed up waiting for IO completion to
occur...

>>> Currently it's "lazy-count=0", so I'll change that setting tonight.
>
> I didn't do this before the test I'm referring to now.
>
>> When IO is really slow so we get a better idea of where things are
>> blocking. Running a few of these 30s apart will give a fair indication
>> of what is blocked and what is making progress....
>
> Attached to this email, logfile from yesterday and today.
>
> Some interesting parts as well, that didn't trigger from the above sysrq
> stuff:
>
> Feb 22 19:56:16 ub kernel: [201245.583915] INFO: task md0_raid5:425 blocked for more than 120 seconds.
> Feb 22 22:36:16 ub kernel: [210846.031875] INFO: task xfssyncd:3167 blocked for more than 120 seconds.
> Feb 23 18:10:18 ub kernel: [281287.492499] INFO: task kdmflush:3082 blocked for more than 120 seconds.
> Feb 23 18:12:18 ub kernel: [281407.491041] INFO: task kdmflush:3082 blocked for more than 120 seconds.
> Feb 23 21:36:18 ub kernel: [293647.665917] INFO: task md0_raid5:425 blocked for more than 120 seconds.
>
> Didn't really think md0_raid5 could be blocked like that...

They are blocking in:

[<ffffffff813dc0e1>] md_super_wait+0xc1/0xe0
[<ffffffff810789c0>] ? autoremove_wake_function+0x0/0x40
[<ffffffff813dc4e0>] md_update_sb+0x230/0x3b0
[<ffffffff813de2f2>] md_check_recovery+0x272/0x590
[<ffffffffa0063621>] raid5d+0x21/0x320 [raid456]
[<ffffffff813dbd94>] md_thread+0x44/0x120

I'm assuming this is just taking a long time because of all the
other IO that is going on slowing this down. The other 3 (xfssyncd,
kdmflush) are all waiting for barrier IO to complete...

So it sounds like the immediate WAR for your extremely slow IO
problem is to mount the XFS filesystems with "-o nobarrier", but
keep in mind this can result in corruption if a crash occurrs.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-02-24 02:26:11

by James Cloos

[permalink] [raw]
Subject: Re: disk/crypto performance regression 2.6.31 -> 2.6.32 (mmap problem?)

M> You're right, it doesn't say that anymore in 2.6.31, so I think I'm
M> indeed running with barriers on.

When barriers were added to ext4, I saw a similar slowdown on lock- and
sync- heavy workloads.

Based on a recent thread on the ext4 list I've started using deadline
rather than cfq on that disk. There are some slowdowns on that disk's
other partition, but the overall throughput is significantly better than
using the combination of cfq, ext4 and barriers.

You might want to test out deadline and/or noop.

Cf: /sys/block/*/queue/scheduler

-JimC
--
James Cloos <[email protected]> OpenPGP: 1024D/ED7DAEA6

2010-02-24 05:20:14

by Mikael Abrahamsson

[permalink] [raw]
Subject: Re: disk/crypto performance regression 2.6.31 -> 2.6.32 (mmap problem?)

On Tue, 23 Feb 2010, James Cloos wrote:

> Based on a recent thread on the ext4 list I've started using deadline
> rather than cfq on that disk. There are some slowdowns on that disk's
> other partition, but the overall throughput is significantly better than
> using the combination of cfq, ext4 and barriers.
>
> You might want to test out deadline and/or noop.
>
> Cf: /sys/block/*/queue/scheduler

I have been running deadline on the drives itself for years, I've tried
both with cfq and deadline in this case, and it doesn't really help.

Another question is what the recommended scheduler setup when it comes to
my different layers drive->md->crypto(dm)->lvm(dm). For now I have only
been changing scheduler to deadline on the drive layer.

I guess the different layers doesn't really know that much about each
other? I can imagine a few different scenarios where one only wants to do
most of the scheduling on the lvm layer, and then wants to keep the
queueing to a minimum on the other layers and keep the queue as small as
possible there, so it can do the proper re-ordering.

Anyone has any thoughts to share on this? I don't have much experience
with this when it comes to block devices, I'm a network engineer and I'm
trying to use my experience in QoS/packet schedulers in different layers,
where for instance when one runs an IP QoS scheduler, one doesn't want a
lot of buffering on the underlying ATM layer, because it makes the IP
schedulers job much harder.

--
Mikael Abrahamsson email: [email protected]