LinuxLists.cc - xfstests generic/130 hang with non-4k block size ext4 on 4.7-rc1 kernel

2016-05-31 14:09:25

Subject: xfstests generic/130 hang with non-4k block size ext4 on 4.7-rc1 kernel

Hi,

I noticed that generic/130 hangs starting from 4.7-rc1 kernel, on non-4k
block size ext4 (x86_64 host). And I bisected to commit 06bd3c36a733
("ext4: fix data exposure after a crash").

It's the sub-test "Small Vector Sync" in generic/130 hangs the kernel,
and I can reproduce it on different hosts, both bare metal and kvm
guest.

Thanks,
Eryu

P.S-1: a slightly simplified reproducer

#!/bin/bash
dev=/dev/sda5
mnt=/mnt/ext4

mkfs -t ext4 -b 1024 $dev
mount $dev $mnt

echo "abcdefghijklmnopqrstuvwxyz" > $mnt/testfile
xfs_io -f -s -c "pread -v 0 1" -c "pread -v 1 1" -c "pread -v 2 1" -c "pread -v 3 1" -c "pread -v 4 1" -c "pread -v 5 1" -c "pread -v 6 1" -c "pread -v 7 1" -c "pread -v 8 1" -c "pread -v 9 1" -c "pread -v 10 1" -c "pread -v 11 1" -c "pread -v 12 1" -c "pread -v 13 13" -c "pwrite -S 0x61 4090 1" -c "pwrite -S 0x62 4091 1" -c "pwrite -S 0x63 4092 1" -c "pwrite -S 0x64 4093 1" -c "pwrite -S 0x65 4094 1" -c "pwrite -S 0x66 4095 1" -c "pwrite -S 0x67 4096 1" -c "pwrite -S 0x68 4097 1" -c "pwrite -S 0x69 4098 1" -c "pwrite -S 0x6A 4099 1" -c "pwrite -S 0x6B 4100 1" -c "pwrite -S 0x6C 4101 1" -c "pwrite -S 0x6D 4102 1" -c "pwrite -S 0x6E 4103 1" -c "pwrite -S 0x6F 4104 1" -c "pwrite -S 0x70 4105 1" -c "pread -v 4090 4" -c "pread -v 4094 4" -c "pread -v 4098 4" -c "pread -v 4102 4" -c "pwrite -S
0x61 10000000000 1" -c "pwrite -S 0x62 10000000001 1" -c "pwrite -S 0x63 10000000002 1" -c "pwrite -S 0x64 10000000003 1" -c "pwrite -S 0x65 10000000004 1" -c "pwrite -S 0x66 10000000005 1" -c "pwrite -S 0x67 10000000006 1" -c "pwrite -S 0x68 10000000007 1" -c "pwrite -S 0x69 10000000008 1" -c "pwrite -S 0x6A 10000000009 1" -c "pwrite -S 0x6B 10000000010 1" -c "pwrite -S 0x6C 10000000011 1" -c "pwrite -S 0x6D 10000000012 1" -c "pwrite -S 0x6E 10000000013 1" -c "pwrite -S 0x6F 10000000014 1" -c "pwrite -S 0x70 10000000015 1" -c "pread -v 10000000000 4" -c "pread -v 10000000004 4" -c "pread -v 10000000008 4" -c "pread -v 10000000012 4" $mnt/testfile

P.S-2: sysrq-w output
[43360.261177] sysrq: SysRq : Show Blocked State
[43360.265588] task PC stack pid father
[43360.271579] jbd2/sda5-8 D ffff880225d3b9e8 0 21723 2 0x00000080
[43360.278718] ffff880225d3b9e8 0000000000000000 ffff88022695bd80 0000000000002000
[43360.286229] ffff880225d3c000 0000000000000000 7fffffffffffffff ffff88022ffaa790
[43360.293741] ffffffff816c2f50 ffff880225d3ba00 ffffffff816c26e5 ffff88022fc17ec0
[43360.301268] Call Trace:
[43360.303737] [<ffffffff816c2f50>] ? bit_wait+0x50/0x50
[43360.308900] [<ffffffff816c26e5>] schedule+0x35/0x80
[43360.313884] [<ffffffff816c5691>] schedule_timeout+0x231/0x2d0
[43360.319733] [<ffffffff81318ad0>] ? queue_unplugged+0xa0/0xb0
[43360.325505] [<ffffffff810fc44c>] ? ktime_get+0x3c/0xb0
[43360.330739] [<ffffffff816c2f50>] ? bit_wait+0x50/0x50
[43360.335895] [<ffffffff816c1fb6>] io_schedule_timeout+0xa6/0x110
[43360.341917] [<ffffffff816c2f6b>] bit_wait_io+0x1b/0x60
[43360.347161] [<ffffffff816c2b10>] __wait_on_bit+0x60/0x90
[43360.352580] [<ffffffff8119025e>] wait_on_page_bit+0xce/0xf0
[43360.358256] [<ffffffff810cd1c0>] ? autoremove_wake_function+0x40/0x40
[43360.364798] [<ffffffff8119037f>] __filemap_fdatawait_range+0xff/0x180
[43360.371341] [<ffffffff8131b127>] ? submit_bio+0x77/0x150
[43360.376758] [<ffffffff81312b9b>] ? bio_alloc_bioset+0x1ab/0x2d0
[43360.382782] [<ffffffffa06bffa9>] ? jbd2_journal_write_metadata_buffer+0x279/0x430 [jbd2]
[43360.390973] [<ffffffff81190414>] filemap_fdatawait_range+0x14/0x30
[43360.397264] [<ffffffff81190453>] filemap_fdatawait+0x23/0x30
[43360.403032] [<ffffffffa06b7787>] jbd2_journal_commit_transaction+0x677/0x1860 [jbd2]
[43360.410881] [<ffffffff81036bb9>] ? sched_clock+0x9/0x10
[43360.416195] [<ffffffff8102c6d9>] ? __switch_to+0x219/0x5c0
[43360.421795] [<ffffffffa06bcd5a>] kjournald2+0xca/0x260 [jbd2]
[43360.427649] [<ffffffff810cd180>] ? prepare_to_wait_event+0xf0/0xf0
[43360.433936] [<ffffffffa06bcc90>] ? commit_timeout+0x10/0x10 [jbd2]
[43360.440215] [<ffffffff810a92b8>] kthread+0xd8/0xf0
[43360.445105] [<ffffffff816c663f>] ret_from_fork+0x1f/0x40
[43360.450519] [<ffffffff810a91e0>] ? kthread_park+0x60/0x60
[43360.456025] xfs_io D ffff880082503960 0 21895 21474 0x00000080
[43360.463145] ffff880082503960 0000000000000246 ffff880220805200 ffff880225d5f088
[43360.470681] ffff880082504000 0000000000000012 ffff880225d5f088 ffff880225d5f024
[43360.478186] ffff8800825039a8 ffff880082503978 ffffffff816c26e5 ffff880225d5f000
[43360.485707] Call Trace:
[43360.488176] [<ffffffff816c26e5>] schedule+0x35/0x80
[43360.493162] [<ffffffffa06bc899>] jbd2_log_wait_commit+0xa9/0x130 [jbd2]
[43360.499877] [<ffffffff810cd180>] ? prepare_to_wait_event+0xf0/0xf0
[43360.506163] [<ffffffffa06b560c>] jbd2_journal_stop+0x38c/0x3e0 [jbd2]
[43360.512731] [<ffffffffa07337fc>] __ext4_journal_stop+0x3c/0xa0 [ext4]
[43360.519278] [<ffffffffa0703bce>] ext4_writepages+0x8ce/0xd70 [ext4]
[43360.525660] [<ffffffff8119e8ae>] do_writepages+0x1e/0x30
[43360.531068] [<ffffffff81192996>] __filemap_fdatawrite_range+0xc6/0x100
[43360.537699] [<ffffffff81192b01>] filemap_write_and_wait_range+0x41/0x90
[43360.544420] [<ffffffffa06fa971>] ext4_sync_file+0xb1/0x320 [ext4]
[43360.550619] [<ffffffff8124ca7d>] vfs_fsync_range+0x3d/0xb0
[43360.556223] [<ffffffffa06f9fad>] ext4_file_write_iter+0x22d/0x330 [ext4]
[43360.563031] [<ffffffff811937b7>] ? generic_file_read_iter+0x627/0x7b0
[43360.569569] [<ffffffff812180b3>] __vfs_write+0xe3/0x160
[43360.574888] [<ffffffff81219302>] vfs_write+0xb2/0x1b0
[43360.580046] [<ffffffff8121a8f7>] SyS_pwrite64+0x87/0xb0
[43360.585366] [<ffffffff81003b12>] do_syscall_64+0x62/0x110
[43360.590869] [<ffffffff816c64e1>] entry_SYSCALL64_slow_path+0x25/0x25

2016-05-31 15:40:22

by Theodore Ts'o

[permalink] [raw]

Subject: Re: xfstests generic/130 hang with non-4k block size ext4 on 4.7-rc1 kernel

On Tue, May 31, 2016 at 10:09:22PM +0800, Eryu Guan wrote:
>
> I noticed that generic/130 hangs starting from 4.7-rc1 kernel, on non-4k
> block size ext4 (x86_64 host). And I bisected to commit 06bd3c36a733
> ("ext4: fix data exposure after a crash").
>
> It's the sub-test "Small Vector Sync" in generic/130 hangs the kernel,
> and I can reproduce it on different hosts, both bare metal and kvm
> guest.

Hmm, it's not reproducing for me, either using your simplified repro
or generic/130. Is there something specific with your kernel config,
which is needed for the reproduction, perhaps?

- Ted

FSTESTVER: e2fsprogs v1.43-25-ge2406b9 (Wed, 25 May 2016 00:30:42 -0400)
FSTESTVER: fio fio-2.6-8-ge6989e1 (Thu, 4 Feb 2016 12:09:48 -0700)
FSTESTVER: quota 67fd9cc (Mon, 4 Apr 2016 00:32:39 -0400)
FSTESTVER: xfsprogs v4.3.0 (Mon, 23 Nov 2015 15:24:24 +1100)
FSTESTVER: xfstests-bld ccae8d1 (Tue, 26 Apr 2016 00:42:18 -0400)
FSTESTVER: xfstests linux-v3.8-1036-gd22e675 (Wed, 25 May 2016 00:58:35 -0400)
FSTESTVER: kernel 4.7.0-rc1-ext4 #293 SMP Tue May 31 11:31:24 EDT 2016 x86_64
FSTESTCFG: "1k"
FSTESTSET: "generic/130"
FSTESTEXC: ""
FSTESTOPT: "aex"
MNTOPTS: ""
CPUS: "2"
MEM: "2007.43"
total used free shared buffers cached
Mem: 2007 93 1913 9 3 29
-/+ buffers/cache: 60 1946
Swap: 0 0 0
BEGIN TEST 1k: Ext4 1k block Tue May 31 11:33:55 EDT 2016
DEVICE: /dev/vdd
MK2FS OPTIONS: -q -b 1024
MOUNT OPTIONS: -o block_validity
FSTYP -- ext4
PLATFORM -- Linux/x86_64 kvm-xfstests 4.7.0-rc1-ext4
MKFS_OPTIONS -- -q -b 1024 /dev/vdc
MOUNT_OPTIONS -- -o acl,user_xattr -o block_validity /dev/vdc /vdc

generic/130 [11:33:55]run fstests generic/130 at 2016-05-31 11:33:55
[11:33:59] 4s
Ran: generic/130
Passed all 1 tests

total used free shared buffers cached
Mem: 2007 71 1936 9 0 16
-/+ buffers/cache: 53 1953
Swap: 0 0 0
END TEST: Ext4 1k block Tue May 31 11:33:59 EDT 2016
reboot: Power down

2016-06-01 06:38:33

by Eryu Guan

[permalink] [raw]

Subject: Re: xfstests generic/130 hang with non-4k block size ext4 on 4.7-rc1 kernel

On Tue, May 31, 2016 at 11:40:17AM -0400, Theodore Ts'o wrote:
> On Tue, May 31, 2016 at 10:09:22PM +0800, Eryu Guan wrote:
> >
> > I noticed that generic/130 hangs starting from 4.7-rc1 kernel, on non-4k
> > block size ext4 (x86_64 host). And I bisected to commit 06bd3c36a733
> > ("ext4: fix data exposure after a crash").
> >
> > It's the sub-test "Small Vector Sync" in generic/130 hangs the kernel,
> > and I can reproduce it on different hosts, both bare metal and kvm
> > guest.
>
> Hmm, it's not reproducing for me, either using your simplified repro
> or generic/130. Is there something specific with your kernel config,
> which is needed for the reproduction, perhaps?

That's weird, it's easily reproduced for me on different hosts/guests.
The kernel config I'm using is based on the config from RHEL7.2 kernel,
leaving all new config options to their default choices. i.e

cp /boot/<config-rhel7.2> ./.config && yes "" | make oldconfig && make

I attached my kernel config file.

And my test vm has 8G memory & 4 vcpus, with RHEL7.2 installed running
upstream kernel, host is RHEL6.7. xfsprogs version 3.2.2 (shipped with
RHEL7.2) and version 4.5.0 (compiled from upstream) made no difference.

I think I can try configs from other venders such as SuSE, Ubuntu. If
you can share your config file I'll test it as well.

Thanks,
Eryu

>
> - Ted
>
> FSTESTVER: e2fsprogs v1.43-25-ge2406b9 (Wed, 25 May 2016 00:30:42 -0400)
> FSTESTVER: fio fio-2.6-8-ge6989e1 (Thu, 4 Feb 2016 12:09:48 -0700)
> FSTESTVER: quota 67fd9cc (Mon, 4 Apr 2016 00:32:39 -0400)
> FSTESTVER: xfsprogs v4.3.0 (Mon, 23 Nov 2015 15:24:24 +1100)
> FSTESTVER: xfstests-bld ccae8d1 (Tue, 26 Apr 2016 00:42:18 -0400)
> FSTESTVER: xfstests linux-v3.8-1036-gd22e675 (Wed, 25 May 2016 00:58:35 -0400)
> FSTESTVER: kernel 4.7.0-rc1-ext4 #293 SMP Tue May 31 11:31:24 EDT 2016 x86_64
> FSTESTCFG: "1k"
> FSTESTSET: "generic/130"
> FSTESTEXC: ""
> FSTESTOPT: "aex"
> MNTOPTS: ""
> CPUS: "2"
> MEM: "2007.43"
> total used free shared buffers cached
> Mem: 2007 93 1913 9 3 29
> -/+ buffers/cache: 60 1946
> Swap: 0 0 0
> BEGIN TEST 1k: Ext4 1k block Tue May 31 11:33:55 EDT 2016
> DEVICE: /dev/vdd
> MK2FS OPTIONS: -q -b 1024
> MOUNT OPTIONS: -o block_validity
> FSTYP -- ext4
> PLATFORM -- Linux/x86_64 kvm-xfstests 4.7.0-rc1-ext4
> MKFS_OPTIONS -- -q -b 1024 /dev/vdc
> MOUNT_OPTIONS -- -o acl,user_xattr -o block_validity /dev/vdc /vdc
>
> generic/130 [11:33:55]run fstests generic/130 at 2016-05-31 11:33:55
> [11:33:59] 4s
> Ran: generic/130
> Passed all 1 tests
>
> total used free shared buffers cached
> Mem: 2007 71 1936 9 0 16
> -/+ buffers/cache: 53 1953
> Swap: 0 0 0
> END TEST: Ext4 1k block Tue May 31 11:33:59 EDT 2016
> reboot: Power down
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Attachments:

(No filename) (3.14 kB)
kernel-config.bz2 (31.88 kB)
Download all attachments

2016-06-01 13:53:28

Jan Kara <[email protected]> writes:

> On Mon 20-06-16 14:59:57, Nikola Pajkovsky wrote:
>> Jan Kara <[email protected]> writes:
>> > On Thu 16-06-16 16:42:58, Nikola Pajkovsky wrote:
>> >> Jan Kara <[email protected]> writes:
>> >>
>> >> > On Fri 10-06-16 07:52:56, Nikola Pajkovsky wrote:
>> >> >> Jan Kara <[email protected]> writes:
>> >> >> > On Thu 09-06-16 09:23:29, Nikola Pajkovsky wrote:
>> >> >> >> Holger Hoffstätte <[email protected]> writes:
>> >> >> >>
>> >> >> >> > On Wed, 08 Jun 2016 14:56:31 +0200, Jan Kara wrote:
>> >> >> >> > (snip)
>> >> >> >> >> Attached patch fixes the issue for me. I'll submit it once a full xfstests
>> >> >> >> >> run finishes for it (which may take a while as our server room is currently
>> >> >> >> >> moving to a different place).
>> >> >> >> >>
>> >> >> >> >> Honza
>> >> >> >> >> --
>> >> >> >> >> Jan Kara <[email protected]>
>> >> >> >> >> SUSE Labs, CR
>> >> >> >> >> From 3a120841a5d9a6c42bf196389467e9e663cf1cf8 Mon Sep 17 00:00:00 2001
>> >> >> >> >> From: Jan Kara <[email protected]>
>> >> >> >> >> Date: Wed, 8 Jun 2016 10:01:45 +0200
>> >> >> >> >> Subject: [PATCH] ext4: Fix deadlock during page writeback
>> >> >> >> >>
>> >> >> >> >> Commit 06bd3c36a733 (ext4: fix data exposure after a crash) uncovered a
>> >> >> >> >> deadlock in ext4_writepages() which was previously much harder to hit.
>> >> >> >> >> After this commit xfstest generic/130 reproduces the deadlock on small
>> >> >> >> >> filesystems.
>> >> >> >> >
>> >> >> >> > Since you marked this for -stable, just a heads-up that the previous patch
>> >> >> >> > for the data exposure was rejected from -stable (see [1]) because it
>> >> >> >> > has the mismatching "!IS_NOQUOTA(inode) &&" line, which didn't exist
>> >> >> >> > until 4.6. I removed it locally but Greg probably wants an official patch.
>> >> >> >> >
>> >> >> >> > So both this and the previous patch need to be submitted.
>> >> >> >> >
>> >> >> >> > [1] http://permalink.gmane.org/gmane.linux.kernel.stable/18074{4,5,6}
>> >> >> >>
>> >> >> >> I'm just wondering if the Jan's patch is not related to blocked
>> >> >> >> processes in following trace. It very hard to hit it and I don't have
>> >> >> >> any reproducer.
>> >> >> >
>> >> >> > This looks like a different issue. Does the machine recover itself or is it
>> >> >> > a hard hang and you have to press a reset button?
>> >> >>
>> >> >> The machine is bit bigger than I have pretend. It's 18 vcpu with 160 GB
>> >> >> ram and machine has dedicated mount point only for PostgreSQL data.
>> >> >>
>> >> >> Nevertheless, I was able always to ssh to the machine, so machine itself
>> >> >> was not in hard hang and ext4 mostly gets recover by itself (it took
>> >> >> 30min). But I have seen situation, were every process who 'touch' the ext4
>> >> >> goes immediately to D state and does not recover even after hour.
>> >> >
>> >> > If such situation happens, can you run 'echo w >/proc/sysrq-trigger' to
>> >> > dump stuck processes and also run 'iostat -x 1' for a while to see how much
>> >> > IO is happening in the system? That should tell us more.
>> >>
>> >>
>> >> Link to 'echo w >/proc/sysrq-trigger' is here, because it's bit bigger
>> >> to mail it.
>> >>
>> >> http://expirebox.com/download/68c26e396feb8c9abb0485f857ccea3a.html
>> >
>> > Can you upload it again please? I've got to looking at the file only today
>> > and it is already deleted. Thanks!
>>
>> http://expirebox.com/download/c010e712e55938435c446cdc01a0b523.html
>
> OK, I had a look into the traces and JBD2 thread just waits for the buffers
> is has submitted for IO to complete. The rest is just blocked on that. From
> the message "INFO: task jbd2/vdc-8:4710 blocked for more than 120 seconds.
> severity=err" we can see that the JBD2 process has been waiting for a
> significant amount of time. Now the question is why it takes so long for
> the IO to complete - likely not a fs problem but somewhere below - block
> layer or the storage itself.
>
> What is the underlying storage? And what IO scheduler do you use? Seeing
> that the device is 'vdc' - that suggests you are running in a guest - is
> there anything interesting happening on the host at that moment? Is IO from
> other guests / the host stalled at that moment as well?

The underlying storage is 24 disks in hw raid6 with 64k stripe. LVM is
used to manage partitions for virt guests. Guests see just block dev
which is formatted in guest with jsize=2048 and with mount options
rw,noatime,nodiratime,user_xattr,acl.

Two guests running 3.18.34 with virtio-blk and hence uses multiqueue and
if I remember correctly, no IO sched is used for multiqueue. Each 18
vcpu, 160GB ram and only Postgresql uses /dev/vdc.

There are two other guests, much smaller running standard rhel6 kernel
with deadline IO sched.

We're plotting IO and read/write throughput from host and when processes
get blocked in one guest, we don't see any traffic going down to the
host/raid6. The other guests are running just fine, because we don't get
any blocked processes from other guests.

>> >> I was running iotop and there was traffic roughly ~20 KB/s write.
>> >>
>> >> What was bit more interesting, was looking at
>> >>
>> >> cat /proc/vmstat | egrep "nr_dirty|nr_writeback"
>> >>
>> >> nr_drity had around 240 and was slowly counting up, but nr_writeback had
>> >> ~8800 and was stuck for 120s.
>> >
>> > Hum, interesting. This would suggest like IO completion got stuck for some
>> > reason. We'll see more from the stacktraces hopefully.
>>
>> I have monitor /sys/kernel/debug/bdi/253:32/stats for 10 mins per 1 sec.
>> Values are all same as follows:
>>
>> --[ Sun Jun 19 06:11:08 CEST 2016
>> BdiWriteback: 15840 kB
>> BdiReclaimable: 32320 kB
>> BdiDirtyThresh: 0 kB
>> DirtyThresh: 1048576 kB
>> BackgroundThresh: 131072 kB
>> BdiDirtied: 6131163680 kB
>> BdiWritten: 6130214880 kB
>> BdiWriteBandwidth: 324948 kBps
>> b_dirty: 2
>> b_io: 3
>> b_more_io: 0
>> bdi_list: 1
>> state: c
>
> OK, so all the IO looks stalled for that period of it.
>
>> Maybe those values can cause issue and kicks in writeback to often and
>> block everyone else.
>>
>> $ sysctl -a | grep dirty | grep -v ratio
>> vm.dirty_background_bytes = 134217728
>> vm.dirty_bytes = 1073741824
>> vm.dirty_expire_centisecs = 1500
>> vm.dirty_writeback_centisecs = 500
>
> This looks healthy.
>
>> I even have output of command, if you're interested.
>>
>> $ trace-cmd record -e ext4 -e jbd2 -e writeback -e block sleep 600
>
> Traces from block layer may be interesting but you'd need the trace started
> before the hang starts so that you see what happened with the IO that
> jbd2/vdc-8:4710 is waiting for.
>
> Honza

--
Nikola