LinuxLists.cc - disk/crypto performance regression 2.6.31 -> 2.6.32 (mmap problem?)

2010-02-19 17:16:22

Subject: disk/crypto performance regression 2.6.31 -> 2.6.32 (mmap problem?)

I have a system raid6->crypto->lvm/lv->xfs raid5, but with
When under 2.6.32 07:47:05 UTC 2010 (sometimes worse,
Cause Page fault Writing a page to disk [xfs_ilock] Truncating file Waiting for TTY data Reading EXT3 indirect blocks Executing a program Walking directory tree Waiting for event (select)
swmike@ub:~$ free
total Mem: 8194328 -/+ buffers/cache: Swap: 0
iostat -x 5 gives this:

avg-cpu: %user 0.61
Device: rrqm/s wrqm/s sda 0.00 sdb 35.80 sdc 33.20 sdd 24.20 sde 25.60 md0 0.00 sdf 29.80 sdg 35.80 sdh 32.20 dm-0 0.00 dm-1 0.00 dm-2 0.00 dm-3 0.00
swmike@ub:/dev/mapper$ ls -l
total 0
crw-rw---- 1 root root brw-rw---- 1 root disk 252, brw-rw---- 1 root disk 252, brw-rw---- 1 root disk 252, brw-rw---- 1 root disk 252,
So from what I thru the crypt single file I can indicates low cpu "blocking"
top - 17:34:54 up 14 days, 37 min, Tasks: 237 total, Cpu0 : 1.7%us, Cpu1 : 1.4%us, 11.1%sy, Mem: 8194328k total, Swap: 0k total,
PID USER 9950 root 20 0 25684 root 20 0 22007 swmike 20 15543 swmike 20 31346 swmike
When I was running at all, though of raid5, but from causing this behaviour anymore).

Here is an example behaviour:

avg-cpu: %user 29.10
Device: rrqm/s wrqm/s sda 0.00 sdb 42.80 1990.20 sdc 38.60 1993.40 sdd 31.80 1990.40 sde 31.80 1986.80 md0 0.00 sdf 28.80 1993.40 sdg 36.80 1999.20 sdh 17.60 1993.20 dm-0 0.00 dm-1 0.00 dm-2 0.00 dm-3 0.00 sdj 24.80 1999.00
avg-cpu: %user 24.82
Device: rrqm/s wrqm/s sda 0.00 sdb sdc sdd sde md0 0.00 sdf sdg sdh dm-0 0.00 dm-1 0.00 dm-2 0.00 dm-3 0.00 sdj
avg-cpu: %user 30.89
Device: rrqm/s wrqm/s sda 0.00 sdb 12.00 3141.00 sdc sdd sde md0 0.00 sdf sdg sdh dm-0 0.00 dm-1 0.00 dm-2 0.00 dm-3 0.00 sdj
Any ideas? What issue? Currently
swmike@ub:~$ cat vm.dirty_background_ratio = 1
vm.dirty_ratio = 60

Setting this to behaviour with running deadline 2.6.32 and this problem.

--
Mikael Abrahamsson with 8 gigs of ram, c2d 2.4GHz and md
that I've had for quite a while. It used to be
2.6.32 I was able to restripe it into a raid6.
(Linux ub 2.6.32-12-generic #16-Ubuntu SMP Thu Jan 28
x86_64 GNU/Linux) I'm seeing this in latency-top
several seconds) when running rtorrent (which uses mmap)
Maximum Percentage
1580.4 msec 84.7 %
869.7 msec 2.2 %
232.9 msec 2.3 %
110.1 msec 0.3 %
18.2 msec 0.0 %
10.0 msec 0.1 %
9.8 msec 0.0 %
8.7 msec 0.0 %
5.0 msec 1.8 %
used free shared buffers cached
6455004 1739324 0 2012 4343532
2109460 6084868
0 0
%nice %system %iowait %steal %idle
0.00 4.36 63.29 0.00 31.74
r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
59.00 20.80 21.00 457.60 646.40 26.41 0.13 3.21 2.06 8.60
74.40 21.60 26.40 491.20 814.40 27.20 0.23 4.88 3.29 15.80
96.60 21.00 32.80 457.60 1040.00 27.84 0.31 5.84 4.16 22.40
94.80 18.80 33.60 420.80 1033.60 27.76 0.28 5.42 3.82 20.00
0.00 5.00 97.20 393.60 2390.40 27.24 0.00 0.00 0.00 0.00
87.20 18.60 28.80 460.80 937.60 29.50 0.16 3.38 2.70 12.80
61.80 20.60 19.60 476.80 657.60 28.22 0.26 5.92 3.38 13.60
71.20 21.00 22.20 465.60 752.00 28.19 0.23 5.23 3.24 14.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 5.00 132.40 393.60 2390.40 20.26 3.32 22.39 7.26 99.80
0.00 5.20 77.80 444.80 1841.60 27.55 3.32 37.08 12.05 100.00
10, 59 2010-02-05 17:57 control
2 2010-02-15 13:55 md0_crypt
0 2010-02-15 13:55 sdb1_crypt
1 2010-02-15 13:55 ub-root
3 2010-02-15 13:55 vg01-home
can gather, even though it's moving very little traffic
layer, this is still the bottleneck (when just copying a
easily get 50+ megabyte/s thru the crypto layer). Top
usage but still high load of course, since rtorrent is
most of the time (unresponsive to interactive input).
6 users, load average: 3.60, 3.90, 3.89
2 running, 235 sleeping, 0 stopped, 0 zombie
8.6%sy, 0.0%ni, 55.1%id, 30.9%wa, 2.3%hi, 1.3%si, 0.0%st
0.0%ni, 2.8%id, 81.9%wa, 0.0%hi, 2.8%si, 0.0%st
8143856k used, 50472k free, 2136k buffers
0k used, 0k free, 5625364k cached
PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
0 0 0 S 5 0.0 1012:29 md0_raid5
0 0 0 R 9 0.0 342:58.88 kcryptd
0 325m 170m 140m D 5 2.1 87:09.82 rtorrent
0 124m 24m 3296 S 1 0.3 78:25.01 rtorrent
9 -11 327m 5940 4200 S 0 0.1 43:12.09 pulseaudio
2.6.31 (ubuntu 9.10) kernel I didn't see this problem
things have changed in that I'm now running raid6 instead
the iostat it looks like the disk+md layer is not
(before in 2.6.31 that was the bottleneck but not
of when I dd a large file instead, note the dm-2
%nice %system %iowait %steal %idle
0.00 70.90 0.00 0.00 0.00
r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
0.00 3.00 0.00 54.40 0.00 18.13 0.01 4.00 1.33 0.40
3.40 75.40 369.60 16537.60 214.56 0.20 2.51 0.79 6.20
3.20 75.20 334.40 16563.20 215.53 0.10 1.22 0.56 4.40
2.20 75.00 272.00 16537.60 217.74 0.12 1.53 0.65 5.00
2.20 75.40 272.00 16512.00 216.29 0.25 3.20 1.11 8.60
0.00 0.00 777.20 0.00 99225.60 127.67 0.00 0.00 0.00 0.00
2.00 75.40 246.40 16563.20 217.18 0.06 0.80 0.52 4.00
3.00 75.80 318.40 16612.80 214.86 0.56 7.13 1.73 13.60
1.80 75.60 155.20 16563.20 216.00 0.02 0.26 0.21 1.60
0.00 3.00 0.00 54.40 0.00 18.13 0.01 4.00 1.33 0.40
0.00 3.00 0.00 54.40 0.00 18.13 0.01 4.00 1.33 0.40
0.00 0.00 0.00 0.00 0.00 0.00 14696.19 0.00 0.00 100.00
0.00 0.00 9051.80 0.00 963836.80 106.48 14696.19 468.84 0.11 100.00
2.20 76.00 216.00 16612.80 215.20 0.10 1.25 0.72 5.60
%nice %system %iowait %steal %idle
0.00 73.27 1.91 0.00 0.00
r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
0.00 1.20 0.00 24.00 0.00 20.00 0.00 3.33 3.33 0.40
3.00 3288.00 0.20 83.40 25.60 26972.80 322.95 0.16 1.91 1.89 15.80
6.00 3287.80 0.80 83.60 54.40 26972.80 320.23 0.19 2.25 2.13 18.00
7.20 3283.40 1.00 83.00 65.60 26932.80 321.41 0.29 3.40 2.45 20.60
7.20 3283.80 1.00 82.60 65.60 26932.80 322.95 0.25 3.01 2.39 20.00
0.00 0.00 3665.20 0.00 161574.40 44.08 0.00 0.00 0.00 0.00
1.20 3283.80 0.60 82.60 14.40 26932.80 323.88 0.21 2.50 2.33 19.40
3.00 3285.00 0.20 83.20 25.60 26947.20 323.41 0.29 3.43 2.45 20.40
1.20 3283.80 0.60 82.60 14.40 26932.80 323.88 0.20 2.43 2.28 19.00
0.00 1.20 0.00 24.00 0.00 20.00 0.00 3.33 3.33 0.40
0.00 1.20 0.00 24.00 0.00 20.00 0.00 3.33 3.33 0.40
0.00 0.00 3872.40 0.00 334886.40 86.48 9989.29 6114.24 0.26 100.00
0.00 0.00 2731.80 0.00 178907.60 65.49 9989.30 8667.10 0.37 100.00
4.20 3283.80 0.80 82.60 40.00 26932.80 323.41 0.18 2.16 2.13 17.80
%nice %system %iowait %steal %idle
0.00 65.09 4.02 0.00 0.00
r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.80 78.20 102.40 25680.00 326.36 0.15 1.85 1.77 14.00
9.00 3141.20 0.60 78.20 76.80 25756.80 327.84 0.14 1.83 1.75 13.80
9.00 3140.80 0.60 78.60 76.80 25755.20 326.16 0.32 4.04 2.30 18.20
9.00 3140.60 0.60 78.80 76.80 25755.20 325.34 0.27 3.38 2.14 17.00
0.00 0.00 1685.80 0.00 154444.80 91.62 0.00 0.00 0.00 0.00
6.00 3141.00 0.40 78.40 51.20 25755.20 327.49 0.16 2.03 1.93 15.20
9.00 3142.60 0.60 78.40 76.80 25769.60 327.17 0.48 6.08 3.32 26.20
6.00 3141.20 0.40 77.80 51.20 25678.40 329.02 0.14 1.74 1.71 13.40
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 9797.52 0.00 0.00 100.00
0.00 0.20 2035.60 1.60 135518.80 66.57 9797.52 3423.06 0.49 100.00
9.00 3142.40 0.60 78.60 76.80 25769.60 326.34 0.18 2.25 1.97 15.60
more information can I collect to help look into this
I'm running with:
/etc/sysctl.conf | grep vm
10/30 or something alike doesn't really change the
rtorrent. I've also mucked around with io scheduler,
right now, doesn't see much difference between them under
email: [email protected]

2010-02-22 12:39:39

by Mikael Abrahamsson

[permalink] [raw]

Subject: Re: disk/crypto performance regression 2.6.31 -> 2.6.32 (mmap problem?)

On Fri, 19 Feb 2010, Mikael Abrahamsson wrote:

> I have a system with 8 gigs of ram, c2d 2.4GHz and md
> raid6->crypto->lvm/lv->xfs that I've had for quite a while. It used to be
> raid5, but with 2.6.32 I was able to restripe it into a raid6.

This is not a 2.6.32 regression, the behaviour is similar in 2.6.31 I have
verified during the weekend.

It seems iostat isn't really able to see exactly where the bottleneck is,
and my guess right now is that it's indeed the md raid6 layer that is
causing my performance problems, even though it doesn't look like it in
iostat.

--
Mikael Abrahamsson email: [email protected]

2010-02-23 04:48:24

by Dave Chinner

[permalink] [raw]

Subject: Re: disk/crypto performance regression 2.6.31 -> 2.6.32 (mmap problem?)

On Mon, Feb 22, 2010 at 01:39:37PM +0100, Mikael Abrahamsson wrote:
> On Fri, 19 Feb 2010, Mikael Abrahamsson wrote:
>
>> I have a system with 8 gigs of ram, c2d 2.4GHz and md
>> raid6->crypto->lvm/lv->xfs that I've had for quite a while. It used to
>> be raid5, but with 2.6.32 I was able to restripe it into a raid6.
>
> This is not a 2.6.32 regression, the behaviour is similar in 2.6.31 I
> have verified during the weekend.
>
> It seems iostat isn't really able to see exactly where the bottleneck is,
> and my guess right now is that it's indeed the md raid6 layer that is
> causing my performance problems, even though it doesn't look like it in
> iostat.

XFS issues IO barriers by default. They were recently enabled in md
for raid5/6, so that might be the cause of the slowdown/latency. You
could try using the "nobarrier" mount option to see if this make the
problem go away, but beware that this can result in filesystem
corruption if the machine crashes.

If it is not barriers that are causing this, then the other thing
you might want to look at is if XFS is configured with lazy-count=1
(xfs_info <mntpt>). If it is not enabled (0), then a significant
amount of latency could be coming from the superblock buffer being
locked during transaction commits. Unfortunately enabling this
feature is an offline operation (via xfs_admin) so enabling may not
be feasible for you.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-02-23 07:24:29

by Mikael Abrahamsson

[permalink] [raw]

Subject: Re: disk/crypto performance regression 2.6.31 -> 2.6.32 (mmap problem?)

On Tue, 23 Feb 2010, Dave Chinner wrote:

> XFS issues IO barriers by default. They were recently enabled in md
> for raid5/6, so that might be the cause of the slowdown/latency. You
> could try using the "nobarrier" mount option to see if this make the
> problem go away, but beware that this can result in filesystem
> corruption if the machine crashes.

I don't think the crypto layer supports it so I think my barriers are off,
that's what I'm used to seeing every time it mounts the partition.

> If it is not barriers that are causing this, then the other thing
> you might want to look at is if XFS is configured with lazy-count=1
> (xfs_info <mntpt>). If it is not enabled (0), then a significant
> amount of latency could be coming from the superblock buffer being
> locked during transaction commits. Unfortunately enabling this
> feature is an offline operation (via xfs_admin) so enabling may not
> be feasible for you.

Currently it's "lazy-count=0", so I'll change that setting tonight.

The behaviour I'm seeing right now causes things like "sync" to take very
long time, I just saw this as well (I've seen the same once when I ran
"sync" manually):

[210846.031875] INFO: task xfssyncd:3167 blocked for more than 120 seconds.
[210846.031879] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[210846.031883] xfssyncd D 00000000ffffffff 0 3167 2 0x00000000
[210846.031889] ffff880208569c50 0000000000000046 ffff880208569c10 0000000000015880
[210846.031895] ffff88020b0ec7c0 0000000000015880 0000000000015880 0000000000015880
[210846.031900] 0000000000015880 ffff88020b0ec7c0 0000000000015880 0000000000015880
[210846.031905] Call Trace:
[210846.031935] [<ffffffffa02767d5>] __xfs_iunpin_wait+0x95/0xd0 [xfs]
[210846.031943] [<ffffffff810789c0>] ? autoremove_wake_function+0x0/0x40
[210846.031964] [<ffffffffa0279dad>] xfs_iflush+0x8d/0x2b0 [xfs]
[210846.031971] [<ffffffff815299cb>] ? __down_write+0xb/0x10
[210846.031992] [<ffffffffa02a06f4>] xfs_reclaim_inode+0x114/0x180 [xfs]
[210846.032013] [<ffffffffa02a07e2>] xfs_reclaim_inode_now+0x82/0x90 [xfs]
[210846.032033] [<ffffffffa02a0760>] ? xfs_reclaim_inode_now+0x0/0x90 [xfs]
[210846.032053] [<ffffffffa02a11d2>] xfs_inode_ag_walk+0x72/0xd0 [xfs]
[210846.032073] [<ffffffffa02a0760>] ? xfs_reclaim_inode_now+0x0/0x90 [xfs]
[210846.032094] [<ffffffffa02a1297>] xfs_inode_ag_iterator+0x67/0xa0 [xfs]
[210846.032114] [<ffffffffa02a12e4>] xfs_reclaim_inodes+0x14/0x20 [xfs]
[210846.032134] [<ffffffffa02a1349>] xfs_sync_worker+0x59/0x90 [xfs]
[210846.032154] [<ffffffffa02a0a0a>] xfssyncd+0x17a/0x220 [xfs]
[210846.032174] [<ffffffffa02a0890>] ? xfssyncd+0x0/0x220 [xfs]
[210846.032179] [<ffffffff810785d6>] kthread+0xa6/0xb0
[210846.032183] [<ffffffff810130ea>] child_rip+0xa/0x20
[210846.032188] [<ffffffff81078530>] ? kthread+0x0/0xb0
[210846.032191] [<ffffffff810130e0>] ? child_rip+0x0/0x20

So something weird is going on, I don't really see why it wouldn't be able
to write enough data in 120 seconds in an orderly fashion.

--
Mikael Abrahamsson email: [email protected]

2010-02-23 13:32:26

by Dave Chinner

[permalink] [raw]

Subject: Re: disk/crypto performance regression 2.6.31 -> 2.6.32 (mmap problem?)

On Tue, Feb 23, 2010 at 08:24:26AM +0100, Mikael Abrahamsson wrote:
> On Tue, 23 Feb 2010, Dave Chinner wrote:
>
>> XFS issues IO barriers by default. They were recently enabled in md
>> for raid5/6, so that might be the cause of the slowdown/latency. You
>> could try using the "nobarrier" mount option to see if this make the
>> problem go away, but beware that this can result in filesystem
>> corruption if the machine crashes.
>
> I don't think the crypto layer supports it so I think my barriers are
> off, that's what I'm used to seeing every time it mounts the partition.

Barriers are only recently supported across DM and MD, so it would
be worth checking you logs for the last mount of the filesystems to
confirm.

>> If it is not barriers that are causing this, then the other thing
>> you might want to look at is if XFS is configured with lazy-count=1
>> (xfs_info <mntpt>). If it is not enabled (0), then a significant
>> amount of latency could be coming from the superblock buffer being
>> locked during transaction commits. Unfortunately enabling this
>> feature is an offline operation (via xfs_admin) so enabling may not
>> be feasible for you.
>
> Currently it's "lazy-count=0", so I'll change that setting tonight.
>
> The behaviour I'm seeing right now causes things like "sync" to take very
> long time, I just saw this as well (I've seen the same once when I ran
> "sync" manually):
>
> [210846.031875] INFO: task xfssyncd:3167 blocked for more than 120 seconds.
> [210846.031879] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [210846.031883] xfssyncd D 00000000ffffffff 0 3167 2 0x00000000
> [210846.031889] ffff880208569c50 0000000000000046 ffff880208569c10 0000000000015880
> [210846.031895] ffff88020b0ec7c0 0000000000015880 0000000000015880 0000000000015880
> [210846.031900] 0000000000015880 ffff88020b0ec7c0 0000000000015880 0000000000015880
> [210846.031905] Call Trace:
> [210846.031935] [<ffffffffa02767d5>] __xfs_iunpin_wait+0x95/0xd0 [xfs]
> [210846.031943] [<ffffffff810789c0>] ? autoremove_wake_function+0x0/0x40
> [210846.031964] [<ffffffffa0279dad>] xfs_iflush+0x8d/0x2b0 [xfs]
> [210846.031971] [<ffffffff815299cb>] ? __down_write+0xb/0x10
> [210846.031992] [<ffffffffa02a06f4>] xfs_reclaim_inode+0x114/0x180 [xfs]
> [210846.032013] [<ffffffffa02a07e2>] xfs_reclaim_inode_now+0x82/0x90 [xfs]
> [210846.032033] [<ffffffffa02a0760>] ? xfs_reclaim_inode_now+0x0/0x90 [xfs]
> [210846.032053] [<ffffffffa02a11d2>] xfs_inode_ag_walk+0x72/0xd0 [xfs]
> [210846.032073] [<ffffffffa02a0760>] ? xfs_reclaim_inode_now+0x0/0x90 [xfs]
> [210846.032094] [<ffffffffa02a1297>] xfs_inode_ag_iterator+0x67/0xa0 [xfs]
> [210846.032114] [<ffffffffa02a12e4>] xfs_reclaim_inodes+0x14/0x20 [xfs]
> [210846.032134] [<ffffffffa02a1349>] xfs_sync_worker+0x59/0x90 [xfs]
> [210846.032154] [<ffffffffa02a0a0a>] xfssyncd+0x17a/0x220 [xfs]
> [210846.032174] [<ffffffffa02a0890>] ? xfssyncd+0x0/0x220 [xfs]
> [210846.032179] [<ffffffff810785d6>] kthread+0xa6/0xb0
> [210846.032183] [<ffffffff810130ea>] child_rip+0xa/0x20
> [210846.032188] [<ffffffff81078530>] ? kthread+0x0/0xb0
> [210846.032191] [<ffffffff810130e0>] ? child_rip+0x0/0x20
>
> So something weird is going on, I don't really see why it wouldn't be
> able to write enough data in 120 seconds in an orderly fashion.

If is it blocked there, it is waiting for a log force to complete.
The lazy-count setting will make no difference to that. However,
this adds more weight to the barrier theory because one of the
sources of barrier writes are writes triggered by log forces.

It would be instructive to gather the output of:

# echo w > /proc/sysrq-trigger

When IO is really slow so we get a better idea of where things are
blocking. Running a few of these 30s apart will give a fair
indication of what is blocked and what is making progress....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-02-23 22:41:35

by Dave Chinner

[permalink] [raw]

Subject: Re: disk/crypto performance regression 2.6.31 -> 2.6.32 (mmap problem?)

On Tue, Feb 23, 2010 at 09:42:40PM +0100, Mikael Abrahamsson wrote:
> On Wed, 24 Feb 2010, Dave Chinner wrote:
>
> Thought I should send the dump only to you, would you like me to post to
> the list as well and include a link to the logfile (think attachments
> shouldn't be done too much to lkml, right?). Feel free to post any
> information from this letter or attachment back onto lkml if you feel
> it's appropriate.

Attachments under roughly 100kb (IIRC) are fine for LKML. This is
allowed specifically to let people attach log files and configs.

I've re-added the lkml CC to continue the discussion there. I also
added the DM mailing list, so that they know directly that barriers
are causing _significant_ system slowdowns. This is important,
because there have been several reports of this problem since the
start of the year to XFS forums as people are upgrading kernels.

>> Barriers are only recently supported across DM and MD, so it would be
>> worth checking you logs for the last mount of the filesystems to
>> confirm.
>
> You're right, it doesn't say that anymore in 2.6.31, so I think I'm
> indeed running with barriers on.

And the stack traces confirm that. Every single blocked process
output set has this trace in it:

kdmflush D 0000000000000000 0 3082 2 0x00000000
ffff880208553d20 0000000000000046 ffff880210a5f640 0000000000015880
ffff8802241ac7c0 0000000000015880 0000000000015880 0000000000015880
0000000000015880 ffff8802241ac7c0 0000000000015880 0000000000015880
Call Trace:
[<ffffffff81527cf8>] io_schedule+0x28/0x40
[<ffffffff813e7ca7>] dm_wait_for_completion+0x87/0x150
[<ffffffff81053c50>] ? default_wake_function+0x0/0x10
[<ffffffff813e8617>] ? __split_and_process_bio+0x107/0x170
[<ffffffff813e86a5>] dm_flush+0x25/0x70
[<ffffffff813e8749>] process_barrier+0x59/0xf0
[<ffffffff813e881b>] dm_wq_work+0x3b/0x100
[<ffffffff813e87e0>] ? dm_wq_work+0x0/0x100
[<ffffffff81073695>] run_workqueue+0x95/0x170
[<ffffffff81073814>] worker_thread+0xa4/0x120
[<ffffffff810789c0>] ? autoremove_wake_function+0x0/0x40
[<ffffffff81073770>] ? worker_thread+0x0/0x120
[<ffffffff810785d6>] kthread+0xa6/0xb0
[<ffffffff810130ea>] child_rip+0xa/0x20
[<ffffffff81078530>] ? kthread+0x0/0xb0
[<ffffffff810130e0>] ? child_rip+0x0/0x20

That is, it appears that DM is blocking waiting for a barrier to
complete. Everything else is backed up waiting for IO completion to
occur...

>>> Currently it's "lazy-count=0", so I'll change that setting tonight.
>
> I didn't do this before the test I'm referring to now.
>
>> When IO is really slow so we get a better idea of where things are
>> blocking. Running a few of these 30s apart will give a fair indication
>> of what is blocked and what is making progress....
>
> Attached to this email, logfile from yesterday and today.
>
> Some interesting parts as well, that didn't trigger from the above sysrq
> stuff:
>
> Feb 22 19:56:16 ub kernel: [201245.583915] INFO: task md0_raid5:425 blocked for more than 120 seconds.
> Feb 22 22:36:16 ub kernel: [210846.031875] INFO: task xfssyncd:3167 blocked for more than 120 seconds.
> Feb 23 18:10:18 ub kernel: [281287.492499] INFO: task kdmflush:3082 blocked for more than 120 seconds.
> Feb 23 18:12:18 ub kernel: [281407.491041] INFO: task kdmflush:3082 blocked for more than 120 seconds.
> Feb 23 21:36:18 ub kernel: [293647.665917] INFO: task md0_raid5:425 blocked for more than 120 seconds.
>
> Didn't really think md0_raid5 could be blocked like that...

They are blocking in:

[<ffffffff813dc0e1>] md_super_wait+0xc1/0xe0
[<ffffffff810789c0>] ? autoremove_wake_function+0x0/0x40
[<ffffffff813dc4e0>] md_update_sb+0x230/0x3b0
[<ffffffff813de2f2>] md_check_recovery+0x272/0x590
[<ffffffffa0063621>] raid5d+0x21/0x320 [raid456]
[<ffffffff813dbd94>] md_thread+0x44/0x120

I'm assuming this is just taking a long time because of all the
other IO that is going on slowing this down. The other 3 (xfssyncd,
kdmflush) are all waiting for barrier IO to complete...

So it sounds like the immediate WAR for your extremely slow IO
problem is to mount the XFS filesystems with "-o nobarrier", but
keep in mind this can result in corruption if a crash occurrs.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-02-24 02:26:11

by James Cloos

[permalink] [raw]

Subject: Re: disk/crypto performance regression 2.6.31 -> 2.6.32 (mmap problem?)

M> You're right, it doesn't say that anymore in 2.6.31, so I think I'm
M> indeed running with barriers on.

When barriers were added to ext4, I saw a similar slowdown on lock- and
sync- heavy workloads.

Based on a recent thread on the ext4 list I've started using deadline
rather than cfq on that disk. There are some slowdowns on that disk's
other partition, but the overall throughput is significantly better than
using the combination of cfq, ext4 and barriers.

You might want to test out deadline and/or noop.

Cf: /sys/block/*/queue/scheduler

-JimC
--
James Cloos <[email protected]> OpenPGP: 1024D/ED7DAEA6

2010-02-24 05:20:14

by Mikael Abrahamsson

[permalink] [raw]

Subject: Re: disk/crypto performance regression 2.6.31 -> 2.6.32 (mmap problem?)

On Tue, 23 Feb 2010, James Cloos wrote:

> Based on a recent thread on the ext4 list I've started using deadline
> rather than cfq on that disk. There are some slowdowns on that disk's
> other partition, but the overall throughput is significantly better than
> using the combination of cfq, ext4 and barriers.
>
> You might want to test out deadline and/or noop.
>
> Cf: /sys/block/*/queue/scheduler

I have been running deadline on the drives itself for years, I've tried
both with cfq and deadline in this case, and it doesn't really help.

Another question is what the recommended scheduler setup when it comes to
my different layers drive->md->crypto(dm)->lvm(dm). For now I have only
been changing scheduler to deadline on the drive layer.

I guess the different layers doesn't really know that much about each
other? I can imagine a few different scenarios where one only wants to do
most of the scheduling on the lvm layer, and then wants to keep the
queueing to a minimum on the other layers and keep the queue as small as
possible there, so it can do the proper re-ordering.

Anyone has any thoughts to share on this? I don't have much experience
with this when it comes to block devices, I'm a network engineer and I'm
trying to use my experience in QoS/packet schedulers in different layers,
where for instance when one runs an IP QoS scheduler, one doesn't want a
lot of buffering on the underlying ATM layer, because it makes the IP
schedulers job much harder.

--
Mikael Abrahamsson email: [email protected]