2017-03-01 11:45:52

by Nick Alcock

[permalink] [raw]
Subject: v4.7--v4.10+: ext4: repeatable inline-data oops (and fs corruption) caused by msync() of shared writable mmap (with recipe)

[Resend, after the first attempt, from my home address, failed with
endless greylisting followed by "4.5.0 Interactive router timed out"
from all but the lowest-priority MX for vger, and "Name server:
bl-ckh-le.kernel.org.: host not found" for the apparently-nonexistent
lowest-priority MX. Maybe it'll work better from here.]

I first spotted this -- or it spotted me -- back in the v4.7.x days. It
is still present in v4.10.

Here's a replication recipe, given a reasonable rootfs with a compiler
on it, and assuming a blank virtio disk on /dev/vdb:

bash-4.4# mke2fs -t ext4 -O inline_data /dev/vdb
# using stock /etc/mke2fs.conf from e2fsprogs master

bash-4.4# mount /dev/vdb /mnt/boom
bash-4.4# cat > boom.c
/* derived from dovecot's configure script */

#include <string.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/mman.h>
int main() {
/* return 0 if we're signed */
int f = open("conftest.mmap", O_RDWR|O_CREAT|O_TRUNC, 0600);
void *mem;
if (f == -1) {
perror("open()");
return 1;
}
unlink("conftest.mmap");

write(f, "1", 2);
mem = mmap(NULL, 2, PROT_READ|PROT_WRITE, MAP_SHARED, f, 0);
if (mem == MAP_FAILED) {
perror("mmap()");
return 1;
}
strcpy(mem, "2");
msync(mem, 2, MS_SYNC);
lseek(f, 0, SEEK_SET);
write(f, "3", 2);

return strcmp(mem, "3") == 0 ? 0 : 1;
}

bash-4.4# gcc -O2 -o boom boom.c
bash-4.4# ./boom
[ 205.652124] ------------[ cut here ]------------
[ 205.653692] kernel BUG at fs/ext4/inode.c:2696!
[ 205.655174] invalid opcode: 0000 [#1] SMP
[ 205.656527] Modules linked in:
[ 205.657675] CPU: 1 PID: 151 Comm: boom Not tainted 4.10.0-00006-g7f691c7bbef7-dirty #22
[ 205.660319] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.1-0-g8891697-prebuilt.qemu-project.org 04/01/2014
[ 205.661496] task: ffff88013a325040 task.stack: ffffc90000328000
[ 205.661496] RIP: 0010:ext4_writepages+0xb30/0xcf0
[ 205.661496] RSP: 0018:ffffc9000032bcb8 EFLAGS: 00010287
[ 205.661496] RAX: 0000028410000000 RBX: ffff880139c820c0 RCX: 0000000000000800
[ 205.661496] RDX: 0000000000a82000 RSI: 0000000000000001 RDI: ffff88013a3d4000
[ 205.661496] RBP: ffffc9000032bde0 R08: 0000000000000800 R09: ffff880139c820c0
[ 205.661496] R10: ffff880139c820c0 R11: 0000000000000000 R12: ffff880139cae898
[ 205.661496] R13: ffff880139caea00 R14: ffff88013a3d7800 R15: ffffc9000032be00
[ 205.661496] FS: 00007fc55a32e700(0000) GS:ffff88013fd00000(0000) knlGS:0000000000000000
[ 205.661496] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 205.661496] CR2: 00007fc55a37d000 CR3: 0000000139546000 CR4: 00000000000006e0
[ 205.661496] Call Trace:
[ 205.661496] ? __block_write_begin_int+0x2f2/0x5c0
[ 205.661496] ? ext4_inode_attach_jinode.part.16+0xa0/0xa0
[ 205.661496] ? __set_page_dirty_buffers+0x25/0xc0
[ 205.661496] ? ext4_set_page_dirty+0x49/0xa0
[ 205.661496] ? set_page_dirty+0x5b/0xb0
[ 205.661496] ? block_page_mkwrite+0xc2/0x100
[ 205.661496] ? ext4_page_mkwrite+0xe0/0x4c0
[ 205.661496] do_writepages+0x1e/0x30
[ 205.661496] __filemap_fdatawrite_range+0x71/0x90
[ 205.661496] filemap_write_and_wait_range+0x2a/0x70
[ 205.661496] ext4_sync_file+0xf4/0x390
[ 205.661496] vfs_fsync_range+0x49/0xa0
[ 205.661496] ? find_vma+0x1b/0x70
[ 205.661496] SyS_msync+0x182/0x200
[ 205.661496] entry_SYSCALL_64_fastpath+0x13/0x94
[ 205.661496] RIP: 0033:0x7fc559ea2710
[ 205.661496] RSP: 002b:00007ffec1f76c08 EFLAGS: 00000246 ORIG_RAX: 000000000000001a
[ 205.661496] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fc559ea2710
[ 205.661496] RDX: 0000000000000004 RSI: 0000000000000002 RDI: 00007fc55a37d000
[ 205.661496] RBP: 00007fc55a37d000 R08: 0000000000000003 R09: 0000000000000000
[ 205.661496] R10: 0000000000000305 R11: 0000000000000246 R12: 00000000004006a0
[ 205.661496] R13: 00007ffec1f76d00 R14: 0000000000000000 R15: 0000000000000000
[ 205.661496] Code: 8b 44 24 18 48 c7 c1 38 ea 9e 81 ba a8 09 00 00 48 c7 c6 40 eb 83 81 48 8b 78 28 4c 8b 40 40 e8 37 97 01 00 44 8b 54 24 08 eb ac <0f> 0b 4c 8b 74 24 28 31 db 4c 8b 6c 24 20 4c 8b 7c 24 40 41 f6
[ 205.661496] RIP: ext4_writepages+0xb30/0xcf0 RSP: ffffc9000032bcb8
[ 205.730074] ---[ end trace f8ac10159c3827e3 ]---

./boom is (obviously) now stuck in D state, so the filesystem is not
umountable (except lazily). Further writing to the filesystem in this
state can corrupt it so badly that fsck can't make head or tail of it,
though debugfs can still find hints that it was probably an ext4
filesystem once upon a time.


2017-03-13 00:52:36

by Eric Biggers

[permalink] [raw]
Subject: Re: v4.7--v4.10+: ext4: repeatable inline-data oops (and fs corruption) caused by msync() of shared writable mmap (with recipe)

On Wed, Mar 01, 2017 at 11:45:52AM +0000, Nick Alcock wrote:
> [Resend, after the first attempt, from my home address, failed with
> endless greylisting followed by "4.5.0 Interactive router timed out"
> from all but the lowest-priority MX for vger, and "Name server:
> bl-ckh-le.kernel.org.: host not found" for the apparently-nonexistent
> lowest-priority MX. Maybe it'll work better from here.]
>
> I first spotted this -- or it spotted me -- back in the v4.7.x days. It
> is still present in v4.10.
>
> Here's a replication recipe, given a reasonable rootfs with a compiler
> on it, and assuming a blank virtio disk on /dev/vdb:
>

Hi Nick, thanks for reporting this. I've sent a patch which should fix this,
and Cc'ed you. This actually seems to been a bug for a very long time, maybe
even ever since the inline_data feature was introduced. (I was able to
reproduce it in a 3.18 kernel, at least.) I'm not sure why it didn't get
noticed earlier --- maybe hardly anyone ever writes to small files with mmap...

- Eric

2017-03-13 23:11:51

by Nick Alcock

[permalink] [raw]
Subject: Re: v4.7--v4.10+: ext4: repeatable inline-data oops (and fs corruption) caused by msync() of shared writable mmap (with recipe)

On 13 Mar 2017, Eric Biggers spake thusly:

> On Wed, Mar 01, 2017 at 11:45:52AM +0000, Nick Alcock wrote:
>> [Resend, after the first attempt, from my home address, failed with
>> endless greylisting followed by "4.5.0 Interactive router timed out"
>> from all but the lowest-priority MX for vger, and "Name server:
>> bl-ckh-le.kernel.org.: host not found" for the apparently-nonexistent
>> lowest-priority MX. Maybe it'll work better from here.]
>>
>> I first spotted this -- or it spotted me -- back in the v4.7.x days. It
>> is still present in v4.10.
>>
>> Here's a replication recipe, given a reasonable rootfs with a compiler
>> on it, and assuming a blank virtio disk on /dev/vdb:
>
> Hi Nick, thanks for reporting this. I've sent a patch which should fix this,
> and Cc'ed you. This actually seems to been a bug for a very long time, maybe

I'll test it. Your timing is supernatural: I was just about to mkfs all
the filesystems on my new server (a once-in-a-decade operation for me)
and was bemoaning the fact that I couldn't turn on inline_data at the
same time. Now I can! (I have good backups so can take suicidally crazy
risks).

> even ever since the inline_data feature was introduced. (I was able to
> reproduce it in a 3.18 kernel, at least.) I'm not sure why it didn't get
> noticed earlier --- maybe hardly anyone ever writes to small files with mmap...

Yeah, I built my /usr/src with it and ran for weeks without hitting it:
it wasn't until I rebuilt most of a distro and hit dovecot that anything
went wrong.

I note that what I saw then was massive filesystem corruption, so
massive that not even tune2fs recognized it as being an ext4 fs
afterwards. Perhaps the thing wrote badness into the journal (possibly
including inline data scribbled over the next inode?) and replayed it
over the fs on the next boot, following which a cascade of increasing
badness ended up eating the entire fs... ah well, I guess it's hard to
know now, months after the fact (though if it's of interest, I still
have an e2image of the corrupted fs lying around!)

--
NULL && (void)

2017-03-13 23:37:05

by Darrick J. Wong

[permalink] [raw]
Subject: Re: v4.7--v4.10+: ext4: repeatable inline-data oops (and fs corruption) caused by msync() of shared writable mmap (with recipe)

On Mon, Mar 13, 2017 at 11:11:35PM +0000, Nick Alcock wrote:
> On 13 Mar 2017, Eric Biggers spake thusly:
>
> > On Wed, Mar 01, 2017 at 11:45:52AM +0000, Nick Alcock wrote:
> >> [Resend, after the first attempt, from my home address, failed with
> >> endless greylisting followed by "4.5.0 Interactive router timed out"
> >> from all but the lowest-priority MX for vger, and "Name server:
> >> bl-ckh-le.kernel.org.: host not found" for the apparently-nonexistent
> >> lowest-priority MX. Maybe it'll work better from here.]
> >>
> >> I first spotted this -- or it spotted me -- back in the v4.7.x days. It
> >> is still present in v4.10.
> >>
> >> Here's a replication recipe, given a reasonable rootfs with a compiler
> >> on it, and assuming a blank virtio disk on /dev/vdb:
> >
> > Hi Nick, thanks for reporting this. I've sent a patch which should fix this,
> > and Cc'ed you. This actually seems to been a bug for a very long time, maybe
>
> I'll test it. Your timing is supernatural: I was just about to mkfs all
> the filesystems on my new server (a once-in-a-decade operation for me)
> and was bemoaning the fact that I couldn't turn on inline_data at the
> same time. Now I can! (I have good backups so can take suicidally crazy
> risks).

Glad to hear you have backups!

I wouldn't turn on inline_data for files, period. It's not as well tested
as it ought to be (clearly). :/

--D

> > even ever since the inline_data feature was introduced. (I was able to
> > reproduce it in a 3.18 kernel, at least.) I'm not sure why it didn't get
> > noticed earlier --- maybe hardly anyone ever writes to small files with mmap...
>
> Yeah, I built my /usr/src with it and ran for weeks without hitting it:
> it wasn't until I rebuilt most of a distro and hit dovecot that anything
> went wrong.
>
> I note that what I saw then was massive filesystem corruption, so
> massive that not even tune2fs recognized it as being an ext4 fs
> afterwards. Perhaps the thing wrote badness into the journal (possibly
> including inline data scribbled over the next inode?) and replayed it
> over the fs on the next boot, following which a cascade of increasing
> badness ended up eating the entire fs... ah well, I guess it's hard to
> know now, months after the fact (though if it's of interest, I still
> have an e2image of the corrupted fs lying around!)
>
> --
> NULL && (void)

2017-03-14 15:18:33

by Nick Alcock

[permalink] [raw]
Subject: Re: v4.7--v4.10+: ext4: repeatable inline-data oops (and fs corruption) caused by msync() of shared writable mmap (with recipe)

On 13 Mar 2017, Darrick J. Wong told this:

> I wouldn't turn on inline_data for files, period. It's not as well tested
> as it ought to be (clearly). :/

Bah, if I don't, how will it get tested? :) this machine is newly
commissioned and doesn't have much load or important content yet: it's
just the time to try experimental and potentially dangerous things :)

(I didn't know you could turn it on in any sense other than a 'big
hammer' for everything. I have a great many symlinks, and it's really
that I'm interested in reducing seeks for...)

--
NULL && (void)