LinuxLists.cc - [PATCH] ext4: don't BUG when truncating encrypted inodes on the orphan list

2017-02-11 03:20:42

Subject: [PATCH] ext4: don't BUG when truncating encrypted inodes on the orphan list

Fix a BUG when the kernel tries to mount a file system constructed as
follows:

echo foo > foo.txt
mke2fs -Fq -t ext4 -O encrypt foo.img 100
debugfs -w foo.img << EOF
write foo.txt a
set_inode_field a i_flags 0x80800
set_super_value s_last_orphan 12
quit
EOF

root@kvm-xfstests:~# mount -o loop foo.img /mnt
[ 160.238770] ------------[ cut here ]------------
[ 160.240106] kernel BUG at /usr/projects/linux/ext4/fs/ext4/inode.c:3874!
[ 160.240106] invalid opcode: 0000 [#1] SMP
[ 160.240106] Modules linked in:
[ 160.240106] CPU: 0 PID: 2547 Comm: mount Tainted: G W 4.10.0-rc3-00034-gcdd33b941b67 #227
[ 160.240106] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1 04/01/2014
[ 160.240106] task: f4518000 task.stack: f47b6000
[ 160.240106] EIP: ext4_block_zero_page_range+0x1a7/0x2b4
[ 160.240106] EFLAGS: 00010246 CPU: 0
[ 160.240106] EAX: 00000001 EBX: f7be4b50 ECX: f47b7dc0 EDX: 00000007
[ 160.240106] ESI: f43b05a8 EDI: f43babec EBP: f47b7dd0 ESP: f47b7dac
[ 160.240106] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[ 160.240106] CR0: 80050033 CR2: bfd85b08 CR3: 34a00680 CR4: 000006f0
[ 160.240106] Call Trace:
[ 160.240106] ext4_truncate+0x1e9/0x3e5
[ 160.240106] ext4_fill_super+0x286f/0x2b1e
[ 160.240106] ? set_blocksize+0x2e/0x7e
[ 160.240106] mount_bdev+0x114/0x15f
[ 160.240106] ext4_mount+0x15/0x17
[ 160.240106] ? ext4_calculate_overhead+0x39d/0x39d
[ 160.240106] mount_fs+0x58/0x115
[ 160.240106] vfs_kern_mount+0x4b/0xae
[ 160.240106] do_mount+0x671/0x8c3
[ 160.240106] ? _copy_from_user+0x70/0x83
[ 160.240106] ? strndup_user+0x31/0x46
[ 160.240106] SyS_mount+0x57/0x7b
[ 160.240106] do_int80_syscall_32+0x4f/0x61
[ 160.240106] entry_INT80_32+0x2f/0x2f
[ 160.240106] EIP: 0xb76b919e
[ 160.240106] EFLAGS: 00000246 CPU: 0
[ 160.240106] EAX: ffffffda EBX: 08053838 ECX: 08052188 EDX: 080537e8
[ 160.240106] ESI: c0ed0000 EDI: 00000000 EBP: 080537e8 ESP: bfa13660
[ 160.240106] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b
[ 160.240106] Code: 59 8b 00 a8 01 0f 84 09 01 00 00 8b 07 66 25 00 f0 66 3d 00 80 75 61 89 f8 e8 3e e2 ff ff 84 c0 74 56 83 bf 48 02 00 00 00 75 02 <0f> 0b 81 7d e8 00 10 00 00 74 02 0f 0b 8b 43 04 8b 53 08 31 c9
[ 160.240106] EIP: ext4_block_zero_page_range+0x1a7/0x2b4 SS:ESP: 0068:f47b7dac
[ 160.317241] ---[ end trace d6a773a375c810a5 ]---

The problem is that when the kernel tries to truncate an inode in
ext4_truncate(), it tries to clear any on-disk data beyond i_size.
Without the encryption key, it can't do that, and so it triggers a
BUG.

E2fsck does *not* provide this service, and in practice most file
systems have their orphan list processed by e2fsck, so to avoid
crashing, this patch skips this step if we don't have access to the
encryption key (which is the case when processing the orphan list; in
all other cases, we will have the encryption key, or the kernel
wouldn't have allowed the file to be opened).

An open question is whether the fact that e2fsck isn't clearing the
bytes beyond i_size causing problems --- and if we've lived with it
not doing it for so long, can we drop this from the kernel replay of
the orphan list in all cases (not just when we don't have the key for
encrypted inodes).

Addresses-Google-Bug: #35209576

Signed-off-by: Theodore Ts'o <[email protected]>
---
fs/ext4/inode.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index bc282f9d0969..831d025e59ad 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3944,6 +3944,10 @@ static int ext4_block_truncate_page(handle_t *handle,
unsigned blocksize;
struct inode *inode = mapping->host;

+ /* If we are processing an encrypted inode during orphan list handling */
+ if (!fscrypt_has_encryption_key(inode))
+ return 0;
+
blocksize = inode->i_sb->s_blocksize;
length = blocksize - (offset & (blocksize - 1));

--
2.11.0.rc0.7.gbe5a750

2017-02-11 07:26:57

by Andreas Dilger

[permalink] [raw]

Subject: Re: [PATCH] ext4: don't BUG when truncating encrypted inodes on the orphan list

On Feb 10, 2017, at 8:19 PM, Theodore Ts'o <[email protected]> wrote:
>
> Fix a BUG when the kernel tries to mount a file system constructed as
> follows:
>
> echo foo > foo.txt
> mke2fs -Fq -t ext4 -O encrypt foo.img 100
> debugfs -w foo.img << EOF
> write foo.txt a
> set_inode_field a i_flags 0x80800
> set_super_value s_last_orphan 12
> quit
> EOF
>
> root@kvm-xfstests:~# mount -o loop foo.img /mnt
> [ 160.238770] ------------[ cut here ]------------
> [ 160.240106] kernel BUG at /usr/projects/linux/ext4/fs/ext4/inode.c:3874!
> [ 160.240106] invalid opcode: 0000 [#1] SMP
> [ 160.240106] Modules linked in:
> [ 160.240106] CPU: 0 PID: 2547 Comm: mount Tainted: G W 4.10.0-rc3-00034-gcdd33b941b67 #227
> [ 160.240106] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1 04/01/2014
> [ 160.240106] task: f4518000 task.stack: f47b6000
> [ 160.240106] EIP: ext4_block_zero_page_range+0x1a7/0x2b4
> [ 160.240106] EFLAGS: 00010246 CPU: 0
> [ 160.240106] EAX: 00000001 EBX: f7be4b50 ECX: f47b7dc0 EDX: 00000007
> [ 160.240106] ESI: f43b05a8 EDI: f43babec EBP: f47b7dd0 ESP: f47b7dac
> [ 160.240106] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
> [ 160.240106] CR0: 80050033 CR2: bfd85b08 CR3: 34a00680 CR4: 000006f0
> [ 160.240106] Call Trace:
> [ 160.240106] ext4_truncate+0x1e9/0x3e5
> [ 160.240106] ext4_fill_super+0x286f/0x2b1e
> [ 160.240106] ? set_blocksize+0x2e/0x7e
> [ 160.240106] mount_bdev+0x114/0x15f
> [ 160.240106] ext4_mount+0x15/0x17
> [ 160.240106] ? ext4_calculate_overhead+0x39d/0x39d
> [ 160.240106] mount_fs+0x58/0x115
> [ 160.240106] vfs_kern_mount+0x4b/0xae
> [ 160.240106] do_mount+0x671/0x8c3
> [ 160.240106] ? _copy_from_user+0x70/0x83
> [ 160.240106] ? strndup_user+0x31/0x46
> [ 160.240106] SyS_mount+0x57/0x7b
> [ 160.240106] do_int80_syscall_32+0x4f/0x61
> [ 160.240106] entry_INT80_32+0x2f/0x2f
> [ 160.240106] EIP: 0xb76b919e
> [ 160.240106] EFLAGS: 00000246 CPU: 0
> [ 160.240106] EAX: ffffffda EBX: 08053838 ECX: 08052188 EDX: 080537e8
> [ 160.240106] ESI: c0ed0000 EDI: 00000000 EBP: 080537e8 ESP: bfa13660
> [ 160.240106] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b
> [ 160.240106] Code: 59 8b 00 a8 01 0f 84 09 01 00 00 8b 07 66 25 00 f0 66 3d 00 80 75 61 89 f8 e8 3e e2 ff ff 84 c0 74 56 83 bf 48 02 00 00 00 75 02 <0f> 0b 81 7d e8 00 10 00 00 74 02 0f 0b 8b 43 04 8b 53 08 31 c9
> [ 160.240106] EIP: ext4_block_zero_page_range+0x1a7/0x2b4 SS:ESP: 0068:f47b7dac
> [ 160.317241] ---[ end trace d6a773a375c810a5 ]---
>
> The problem is that when the kernel tries to truncate an inode in
> ext4_truncate(), it tries to clear any on-disk data beyond i_size.
> Without the encryption key, it can't do that, and so it triggers a
> BUG.
>
> E2fsck does *not* provide this service, and in practice most file
> systems have their orphan list processed by e2fsck, so to avoid
> crashing, this patch skips this step if we don't have access to the
> encryption key (which is the case when processing the orphan list; in
> all other cases, we will have the encryption key, or the kernel
> wouldn't have allowed the file to be opened).
>
> An open question is whether the fact that e2fsck isn't clearing the
> bytes beyond i_size causing problems --- and if we've lived with it
> not doing it for so long, can we drop this from the kernel replay of
> the orphan list in all cases (not just when we don't have the key for
> encrypted inodes).

The reason truncated orphans are on the orphan list is because the
transaction that sets i_size may be restarted if the inode is larger
than can be truncated in a single transaction. If the system crashes
before the truncate finishes then the truncate should be completed
so that old data is not returned if the file is truncated larger again.

I suspect that this inconsistency is hard to detect in most cases,
since it needs both a crash during truncation of a huge file used by
an applications that truncates the file larger than EOF. Apps would
normally just extend i_size by overwriting the old data as needed.

However, if a full e2fsck is run it would increase i_size to include
the blocks beyond EOF which would expose the old data again (the
"inode size is XXX should be YYY" error is not so uncommon), so this
should probably be fixed by e2fsck as well, rather than dropped.

Cheers, Andreas

> Addresses-Google-Bug: #35209576
>
> Signed-off-by: Theodore Ts'o <[email protected]>
> ---
> fs/ext4/inode.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index bc282f9d0969..831d025e59ad 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3944,6 +3944,10 @@ static int ext4_block_truncate_page(handle_t *handle,
> unsigned blocksize;
> struct inode *inode = mapping->host;
>
> + /* If we are processing an encrypted inode during orphan list handling */
> + if (!fscrypt_has_encryption_key(inode))
> + return 0;
> +
> blocksize = inode->i_sb->s_blocksize;
> length = blocksize - (offset & (blocksize - 1));
>
> --
> 2.11.0.rc0.7.gbe5a750
>

Cheers, Andreas

Attachments:

signature.asc (195.00 B)
Message signed with OpenPGP

2017-02-12 02:27:41

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH] ext4: don't BUG when truncating encrypted inodes on the orphan list

On Sat, Feb 11, 2017 at 12:26:52AM -0700, Andreas Dilger wrote:
> The reason truncated orphans are on the orphan list is because the
> transaction that sets i_size may be restarted if the inode is larger
> than can be truncated in a single transaction. If the system crashes
> before the truncate finishes then the truncate should be completed
> so that old data is not returned if the file is truncated larger again.

Another way of fixing this is at the time when the file is truncated
to a larger size. Of course the other case we need handle is what
happens if there is data after i_size and the file is mmaped.

One advantage of doing when the file is truncated larger again is at
that point we will have the encryption key. In the case of an
encrypted file, both the kernel and e2fsck *can't* zero fill past
i_size if the key is not available. And during the orphan replay the
encryption key won't be available.

The other way to solve the problem would be zero the portion of the
last remaining datablock *first* and journal the data block along with
the initial transaction which sets the i_size in the inode. But that
gets tricky, since all data writes for that last block must not go to
the disk, and then once the journal has been committed we can't write
the block to via the normal page_io routines (since otherwise it might
get overwritten), until we write it back and then revoke the block in
the journal, and the revoke is committed. Messy....

In the mean time, given this is a denial of service attack, and this
is no worse than what e2fsck has been doing for years and years, we
need to skip the zero-fill for encrypted files if we don't have the
key, since crashing is not acceptable.

- Ted

2017-02-12 08:38:23

by Vegard Nossum

[permalink] [raw]

Subject: Re: [PATCH] ext4: don't BUG when truncating encrypted inodes on the orphan list

On 11 February 2017 at 04:19, Theodore Ts'o <[email protected]> wrote:
> Fix a BUG when the kernel tries to mount a file system constructed as
> follows:
>
> echo foo > foo.txt
> mke2fs -Fq -t ext4 -O encrypt foo.img 100
> debugfs -w foo.img << EOF
> write foo.txt a
> set_inode_field a i_flags 0x80800
> set_super_value s_last_orphan 12
> quit
> EOF
>
> root@kvm-xfstests:~# mount -o loop foo.img /mnt
> [ 160.238770] ------------[ cut here ]------------
> [ 160.240106] kernel BUG at /usr/projects/linux/ext4/fs/ext4/inode.c:3874!
> [ 160.240106] invalid opcode: 0000 [#1] SMP
> [ 160.240106] Modules linked in:
> [ 160.240106] CPU: 0 PID: 2547 Comm: mount Tainted: G W 4.10.0-rc3-00034-gcdd33b941b67 #227
> [ 160.240106] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1 04/01/2014
> [ 160.240106] task: f4518000 task.stack: f47b6000
> [ 160.240106] EIP: ext4_block_zero_page_range+0x1a7/0x2b4
> [ 160.240106] EFLAGS: 00010246 CPU: 0
> [ 160.240106] EAX: 00000001 EBX: f7be4b50 ECX: f47b7dc0 EDX: 00000007
> [ 160.240106] ESI: f43b05a8 EDI: f43babec EBP: f47b7dd0 ESP: f47b7dac
> [ 160.240106] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
> [ 160.240106] CR0: 80050033 CR2: bfd85b08 CR3: 34a00680 CR4: 000006f0
> [ 160.240106] Call Trace:
> [ 160.240106] ext4_truncate+0x1e9/0x3e5
> [ 160.240106] ext4_fill_super+0x286f/0x2b1e
> [ 160.240106] ? set_blocksize+0x2e/0x7e
> [ 160.240106] mount_bdev+0x114/0x15f
> [ 160.240106] ext4_mount+0x15/0x17
> [ 160.240106] ? ext4_calculate_overhead+0x39d/0x39d
> [ 160.240106] mount_fs+0x58/0x115
> [ 160.240106] vfs_kern_mount+0x4b/0xae
> [ 160.240106] do_mount+0x671/0x8c3
> [ 160.240106] ? _copy_from_user+0x70/0x83
> [ 160.240106] ? strndup_user+0x31/0x46
> [ 160.240106] SyS_mount+0x57/0x7b
> [ 160.240106] do_int80_syscall_32+0x4f/0x61
> [ 160.240106] entry_INT80_32+0x2f/0x2f
> [ 160.240106] EIP: 0xb76b919e
> [ 160.240106] EFLAGS: 00000246 CPU: 0
> [ 160.240106] EAX: ffffffda EBX: 08053838 ECX: 08052188 EDX: 080537e8
> [ 160.240106] ESI: c0ed0000 EDI: 00000000 EBP: 080537e8 ESP: bfa13660
> [ 160.240106] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b
> [ 160.240106] Code: 59 8b 00 a8 01 0f 84 09 01 00 00 8b 07 66 25 00 f0 66 3d 00 80 75 61 89 f8 e8 3e e2 ff ff 84 c0 74 56 83 bf 48 02 00 00 00 75 02 <0f> 0b 81 7d e8 00 10 00 00 74 02 0f 0b 8b 43 04 8b 53 08 31 c9
> [ 160.240106] EIP: ext4_block_zero_page_range+0x1a7/0x2b4 SS:ESP: 0068:f47b7dac
> [ 160.317241] ---[ end trace d6a773a375c810a5 ]---
>
> The problem is that when the kernel tries to truncate an inode in
> ext4_truncate(), it tries to clear any on-disk data beyond i_size.
> Without the encryption key, it can't do that, and so it triggers a
> BUG.
>
> E2fsck does *not* provide this service, and in practice most file
> systems have their orphan list processed by e2fsck, so to avoid
> crashing, this patch skips this step if we don't have access to the
> encryption key (which is the case when processing the orphan list; in
> all other cases, we will have the encryption key, or the kernel
> wouldn't have allowed the file to be opened).
>
> An open question is whether the fact that e2fsck isn't clearing the
> bytes beyond i_size causing problems --- and if we've lived with it
> not doing it for so long, can we drop this from the kernel replay of
> the orphan list in all cases (not just when we don't have the key for
> encrypted inodes).
>
> Addresses-Google-Bug: #35209576
>
> Signed-off-by: Theodore Ts'o <[email protected]>
> ---

You may consider adding a Reported-by: and/or a link to the previous discussion:

https://patchwork.ozlabs.org/patch/648817/

Thanks,

Vegard

2017-02-12 22:09:03

by Eric Biggers

[permalink] [raw]

Subject: Re: [PATCH] ext4: don't BUG when truncating encrypted inodes on the orphan list

On Fri, Feb 10, 2017 at 10:19:42PM -0500, Theodore Ts'o wrote:
> Fix a BUG when the kernel tries to mount a file system constructed as
> follows:
>
> echo foo > foo.txt
> mke2fs -Fq -t ext4 -O encrypt foo.img 100
> debugfs -w foo.img << EOF
> write foo.txt a
> set_inode_field a i_flags 0x80800
> set_super_value s_last_orphan 12
> quit
> EOF

I also had to pass '-b 4096' to mke2fs; otherwise mounting the filesystem failed
with "Unsupported blocksize for fs encryption".

(I also added 'zap_block -f a 0 -p 42' so that I could see if/when the unused
portion of the block gets zeroed.)

> The problem is that when the kernel tries to truncate an inode in
> ext4_truncate(), it tries to clear any on-disk data beyond i_size.
> Without the encryption key, it can't do that, and so it triggers a
> BUG.
>
> E2fsck does *not* provide this service, and in practice most file
> systems have their orphan list processed by e2fsck, so to avoid
> crashing, this patch skips this step if we don't have access to the
> encryption key (which is the case when processing the orphan list; in
> all other cases, we will have the encryption key, or the kernel
> wouldn't have allowed the file to be opened).
>
> An open question is whether the fact that e2fsck isn't clearing the
> bytes beyond i_size causing problems --- and if we've lived with it
> not doing it for so long, can we drop this from the kernel replay of
> the orphan list in all cases (not just when we don't have the key for
> encrypted inodes).

e2fsck does clean up the orphan list, which lets the kernel mount the filesystem
without crashing. e2fsck also completes the truncates and is apparently
supposed to zero out the unused portion of the last block, but it doesn't due to
a bug. See release_inode_block() in e2fsck/super.c:

/*
* If part of the last block needs truncating, we do
* it here.
*/
if ((blockcnt == pb->truncate_block) && pb->truncate_offset) {
pb->errcode = io_channel_read_blk64(fs->io, blk, 1,
pb->buf);
if (pb->errcode)
goto return_abort;
memset(pb->buf + pb->truncate_offset, 0,
fs->blocksize - pb->truncate_offset);
pb->errcode = io_channel_write_blk64(fs->io, blk, 1,
pb->buf);
if (pb->errcode)
goto return_abort;
}

'blockcnt' is a logical block number, but 'truncate_block' is the number of
blocks in the file, i.e. 1 plus the last logical block number. So this gets
executed on the block following the one intended, which is useless because that
block then gets freed.

So I think that should be fixed too, but zeroing still wouldn't be appropriate
for encrypted regular files, so it would need to be skipped on such files.

> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index bc282f9d0969..831d025e59ad 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3944,6 +3944,10 @@ static int ext4_block_truncate_page(handle_t *handle,
> unsigned blocksize;
> struct inode *inode = mapping->host;
>
> + /* If we are processing an encrypted inode during orphan list handling */
> + if (!fscrypt_has_encryption_key(inode))
> + return 0;
> +
> blocksize = inode->i_sb->s_blocksize;
> length = blocksize - (offset & (blocksize - 1));

Shouldn't it be:

if (ext4_encrypted_inode(inode) && S_ISREG(inode->i_mode) &&
!fscrypt_has_encryption_key(inode))
return 0;

... since only encrypted regular files should be skipped?

Otherwise, I think this will be okay for now and better than the current
situation, though the underlying problem of handling an interrupted truncate
still needs to be solved. Journalling the last block together with the i_size
change sounds like it may be the right solution, though perhaps difficult to
implement.

- Eric

2017-02-13 00:31:41

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH] ext4: don't BUG when truncating encrypted inodes on the orphan list

On Sun, Feb 12, 2017 at 09:38:20AM +0100, Vegard Nossum wrote:
> You may consider adding a Reported-by: and/or a link to the previous discussion:
>
> https://patchwork.ozlabs.org/patch/648817/

I'm trying to remember if you always included the git commit of the
kernel you were testing. One of the reasons why I had trouble root
causes some of your crashes was because I wasn't able to map the
addresses back to line numbers. I think I must have missed the git
commit ID, since I see you have historical information for all of
them in your reports.

In any case, what happened is for those reports that weren't caused by
superblock corruption (where you had provided the superblock dump), if
it wasn't immediately obvious from the stack trace, I eventually gave
up, because I had a finite amount of time and energy, and it was too
hard to figure out *where* in the source the kernel was actually
crashing.

As a wishlist item, it would be nice if your web reports included
hotlinks from the addresses / symbol_name+offset to the git tree
sources at the indicated line number. For example:

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/fs/ext4/inode.c?id=07f00f06ba9a5533d6650d46d3e938f6cbeee97e#n3738

One of the crash dump analysis tool at $WORK does that, and it saves a
huge amount of developer time.

Anyway, at least for the ones where there is an explicit BUG_ON with a
line number, I should revisit your reports to see if I might not be
able clear out more of them. (Is there a list of which ones are still
valid?)

But for ones where there is a stack trace without a BUG_INFO or
WARN_ON to give me a line number, any chance you could use addr2line
or otherwise make the kernel (with debugging information compiled in)
available?

Thanks!!

- Ted

2017-02-13 02:58:48

by Vegard Nossum

[permalink] [raw]

Subject: Re: [PATCH] ext4: don't BUG when truncating encrypted inodes on the orphan list

On 13 February 2017 at 01:31, Theodore Ts'o <[email protected]> wrote:
> As a wishlist item, it would be nice if your web reports included
> hotlinks from the addresses / symbol_name+offset to the git tree
> sources at the indicated line number. For example:
>
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/fs/ext4/inode.c?id=07f00f06ba9a5533d6650d46d3e938f6cbeee97e#n3738
>
> One of the crash dump analysis tool at $WORK does that, and it saves a
> huge amount of developer time.
>
> Anyway, at least for the ones where there is an explicit BUG_ON with a
> line number, I should revisit your reports to see if I might not be
> able clear out more of them. (Is there a list of which ones are still
> valid?)
>
> But for ones where there is a stack trace without a BUG_INFO or
> WARN_ON to give me a line number, any chance you could use addr2line
> or otherwise make the kernel (with debugging information compiled in)
> available?

Those reports all include the hotlinks actually. The one for this problem was:

http://139.162.151.198/f/ext4/5bdefda69f39b2f2c56d9b67d5b7d9e2cc8dfd5f/

If you go to the stack trace then any line containing a kernel address
that can be addr2lined appears in green and clicking it should expand
a link to git.kernel.org. It does probably need javascript to work.

I'll get around to rerunning all the testcases soon on a recent kernel.

Vegard

2017-02-13 15:19:30

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH] ext4: don't BUG when truncating encrypted inodes on the orphan list

On Sun, Feb 12, 2017 at 02:09:00PM -0800, Eric Biggers wrote:
> On Fri, Feb 10, 2017 at 10:19:42PM -0500, Theodore Ts'o wrote:
> > Fix a BUG when the kernel tries to mount a file system constructed as
> > follows:
> >
> > echo foo > foo.txt
> > mke2fs -Fq -t ext4 -O encrypt foo.img 100
> > debugfs -w foo.img << EOF
> > write foo.txt a
> > set_inode_field a i_flags 0x80800
> > set_super_value s_last_orphan 12
> > quit
> > EOF
>
> I also had to pass '-b 4096' to mke2fs; otherwise mounting the filesystem failed
> with "Unsupported blocksize for fs encryption".

Sorry, oops. My original test case didn't have the -O encrypt, and it
works equally well without the -O encrypt.

> e2fsck does clean up the orphan list, which lets the kernel mount the filesystem
> without crashing. e2fsck also completes the truncates and is apparently
> supposed to zero out the unused portion of the last block, but it doesn't due to
> a bug. See release_inode_block() in e2fsck/super.c:
>
> /*
> * If part of the last block needs truncating, we do
> * it here.
> */
> if ((blockcnt == pb->truncate_block) && pb->truncate_offset) {
> pb->errcode = io_channel_read_blk64(fs->io, blk, 1,
> pb->buf);
> if (pb->errcode)
> goto return_abort;
> memset(pb->buf + pb->truncate_offset, 0,
> fs->blocksize - pb->truncate_offset);
> pb->errcode = io_channel_write_blk64(fs->io, blk, 1,
> pb->buf);
> if (pb->errcode)
> goto return_abort;
> }
>
> 'blockcnt' is a logical block number, but 'truncate_block' is the number of
> blocks in the file, i.e. 1 plus the last logical block number. So this gets
> executed on the block following the one intended, which is useless because that
> block then gets freed.

Ah, nice catch, thanks!

> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index bc282f9d0969..831d025e59ad 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -3944,6 +3944,10 @@ static int ext4_block_truncate_page(handle_t *handle,
> > unsigned blocksize;
> > struct inode *inode = mapping->host;
> >
> > + /* If we are processing an encrypted inode during orphan list handling */
> > + if (!fscrypt_has_encryption_key(inode))
> > + return 0;
> > +
> > blocksize = inode->i_sb->s_blocksize;
> > length = blocksize - (offset & (blocksize - 1));
>
> Shouldn't it be:
>
> if (ext4_encrypted_inode(inode) && S_ISREG(inode->i_mode) &&
> !fscrypt_has_encryption_key(inode))
> return 0;
>
> ... since only encrypted regular files should be skipped?

We certainly don't want to add the ext4_encrypted_inode() test, since
this can fail even if -O encrypt is not enabled. As for checking for
regular files, we could potentially fall into this code path for
non-regular files too (symlinks with length greater than 60 bytes come
to mind), and if we don't have the encryption key, there's no *point*
to try to zero beyond i_size --- and if we do, we're going to fail
with a BUG.

> Otherwise, I think this will be okay for now and better than the current
> situation, though the underlying problem of handling an interrupted truncate
> still needs to be solved. Journalling the last block together with the i_size
> change sounds like it may be the right solution, though perhaps difficult to
> implement.

It *is* going to be hard, because the currently only way to _stop_
data journalling for a particular inode safely is to lock the journal
against any updates, flush all previously journalled blocks to their
final location on disk, empty the journal, and then unlock the journal
--- and doing that after any truncate of a file which is not deleted
is a performance non-starter.

I've been looking into this problem for the lazy journalling[1]
patches (in the unstable portion of the patch queue) with respect to
clearing the JOURNAL_DATA flag (via chattr) and...

/*
* Clearing the JOURNAL_DATA flag is *hard* with lazy
* journalling. We can't use jbd2_journal_flush(); instead,
* we must, for all blocks in the file which have been logged
* to the journal, save them to their final location on disk
* and insert revoke records for them, and then do a journal
* commit --- and while this whole procedure is being carried
* out, any attempts to write to the file must be locked out.
* Punt for now.
*/
if ((oldflags & EXT4_JOURNAL_DATA_FL) && !jflag &&
test_opt(inode->i_sb, JOURNAL_LAZY)) {
err = -EOPNOTSUPP;
goto flags_out;
}

[1] https://www.usenix.org/conference/fast17/technical-sessions/presentation/aghayev
(Final camera ready copy not up yet; pre-prints available on request)

This is somewhat easier for the truncate case, since there is only one
block that needs to be written out and revoked after the truncate
takes place, but it is still a going to be somewhat painful, since it
will require a forced flush for every truncate(2) system call to an
encrypted file.

Another way of doing it which might be simpler would be to allocate an
inode flag which if set, indicates that there may be data after i_size
which must be zero'ed, and the kernel can take care of that when it
actually has the key and the file is opened for reading or writing.
(We would need to make sure read-only mounts are handled properly,
since we can't modify the on-disk block or the inode in that case; so
there would be a second code path that would have to be tested
separately.)

- Ted

2017-02-13 15:27:08

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH] ext4: don't BUG when truncating encrypted inodes on the orphan list

On Mon, Feb 13, 2017 at 03:58:46AM +0100, Vegard Nossum wrote:
> Those reports all include the hotlinks actually. The one for this problem was:
>
> http://139.162.151.198/f/ext4/5bdefda69f39b2f2c56d9b67d5b7d9e2cc8dfd5f/
>
> If you go to the stack trace then any line containing a kernel address
> that can be addr2lined appears in green and clicking it should expand
> a link to git.kernel.org. It does probably need javascript to work.

Nice! I hadn't noticed that. I was running with javascript disabled
(unless the site was on a white list) for a while, so that could be why.

> I'll get around to rerunning all the testcases soon on a recent kernel.

Thanks!

- Ted

2017-02-13 18:20:00

by Eric Biggers

[permalink] [raw]

Subject: Re: [PATCH] ext4: don't BUG when truncating encrypted inodes on the orphan list

On Mon, Feb 13, 2017 at 10:19:27AM -0500, Theodore Ts'o wrote:
>
> > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > index bc282f9d0969..831d025e59ad 100644
> > > --- a/fs/ext4/inode.c
> > > +++ b/fs/ext4/inode.c
> > > @@ -3944,6 +3944,10 @@ static int ext4_block_truncate_page(handle_t *handle,
> > > unsigned blocksize;
> > > struct inode *inode = mapping->host;
> > >
> > > + /* If we are processing an encrypted inode during orphan list handling */
> > > + if (!fscrypt_has_encryption_key(inode))
> > > + return 0;
> > > +
> > > blocksize = inode->i_sb->s_blocksize;
> > > length = blocksize - (offset & (blocksize - 1));
> >
> > Shouldn't it be:
> >
> > if (ext4_encrypted_inode(inode) && S_ISREG(inode->i_mode) &&
> > !fscrypt_has_encryption_key(inode))
> > return 0;
> >
> > ... since only encrypted regular files should be skipped?
>
> We certainly don't want to add the ext4_encrypted_inode() test, since
> this can fail even if -O encrypt is not enabled. As for checking for
> regular files, we could potentially fall into this code path for
> non-regular files too (symlinks with length greater than 60 bytes come
> to mind), and if we don't have the encryption key, there's no *point*
> to try to zero beyond i_size --- and if we do, we're going to fail
> with a BUG.
>

Are you sure? ext4_encrypted_inode() only checks the inode flag; it doesn't
check the superblock flag too. If we check for just
!fscrypt_has_encryption_key(), the zeroing will get skipped for all
*unencrypted* files too.

Also, my suggestion matches the logic in
__ext4_block_zero_page_range() where the BUG() is actually being hit:

if (S_ISREG(inode->i_mode) &&
ext4_encrypted_inode(inode)) {
/* We expect the key to be set. */
BUG_ON(!fscrypt_has_encryption_key(inode));
BUG_ON(blocksize != PAGE_SIZE);
WARN_ON_ONCE(fscrypt_decrypt_page(page->mapping->host,
page, PAGE_SIZE, 0, page->index));
}

- Eric

2017-02-14 05:04:03

by Theodore Ts'o

[permalink] [raw]

Subject: [PATCH -v2] ext4: don't BUG when truncating encrypted inodes on the orphan list

Fix a BUG when the kernel tries to mount a file system constructed as
follows:

echo foo > foo.txt
mke2fs -Fq -t ext4 -O encrypt foo.img 100
debugfs -w foo.img << EOF
write foo.txt a
set_inode_field a i_flags 0x80800
set_super_value s_last_orphan 12
quit
EOF

root@kvm-xfstests:~# mount -o loop foo.img /mnt
[ 160.238770] ------------[ cut here ]------------
[ 160.240106] kernel BUG at /usr/projects/linux/ext4/fs/ext4/inode.c:3874!
[ 160.240106] invalid opcode: 0000 [#1] SMP
[ 160.240106] Modules linked in:
[ 160.240106] CPU: 0 PID: 2547 Comm: mount Tainted: G W 4.10.0-rc3-00034-gcdd33b941b67 #227
[ 160.240106] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1 04/01/2014
[ 160.240106] task: f4518000 task.stack: f47b6000
[ 160.240106] EIP: ext4_block_zero_page_range+0x1a7/0x2b4
[ 160.240106] EFLAGS: 00010246 CPU: 0
[ 160.240106] EAX: 00000001 EBX: f7be4b50 ECX: f47b7dc0 EDX: 00000007
[ 160.240106] ESI: f43b05a8 EDI: f43babec EBP: f47b7dd0 ESP: f47b7dac
[ 160.240106] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[ 160.240106] CR0: 80050033 CR2: bfd85b08 CR3: 34a00680 CR4: 000006f0
[ 160.240106] Call Trace:
[ 160.240106] ext4_truncate+0x1e9/0x3e5
[ 160.240106] ext4_fill_super+0x286f/0x2b1e
[ 160.240106] ? set_blocksize+0x2e/0x7e
[ 160.240106] mount_bdev+0x114/0x15f
[ 160.240106] ext4_mount+0x15/0x17
[ 160.240106] ? ext4_calculate_overhead+0x39d/0x39d
[ 160.240106] mount_fs+0x58/0x115
[ 160.240106] vfs_kern_mount+0x4b/0xae
[ 160.240106] do_mount+0x671/0x8c3
[ 160.240106] ? _copy_from_user+0x70/0x83
[ 160.240106] ? strndup_user+0x31/0x46
[ 160.240106] SyS_mount+0x57/0x7b
[ 160.240106] do_int80_syscall_32+0x4f/0x61
[ 160.240106] entry_INT80_32+0x2f/0x2f
[ 160.240106] EIP: 0xb76b919e
[ 160.240106] EFLAGS: 00000246 CPU: 0
[ 160.240106] EAX: ffffffda EBX: 08053838 ECX: 08052188 EDX: 080537e8
[ 160.240106] ESI: c0ed0000 EDI: 00000000 EBP: 080537e8 ESP: bfa13660
[ 160.240106] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b
[ 160.240106] Code: 59 8b 00 a8 01 0f 84 09 01 00 00 8b 07 66 25 00 f0 66 3d 00 80 75 61 89 f8 e8 3e e2 ff ff 84 c0 74 56 83 bf 48 02 00 00 00 75 02 <0f> 0b 81 7d e8 00 10 00 00 74 02 0f 0b 8b 43 04 8b 53 08 31 c9
[ 160.240106] EIP: ext4_block_zero_page_range+0x1a7/0x2b4 SS:ESP: 0068:f47b7dac
[ 160.317241] ---[ end trace d6a773a375c810a5 ]---

The problem is that when the kernel tries to truncate an inode in
ext4_truncate(), it tries to clear any on-disk data beyond i_size.
Without the encryption key, it can't do that, and so it triggers a
BUG.

E2fsck does *not* provide this service, and in practice most file
systems have their orphan list processed by e2fsck, so to avoid
crashing, this patch skips this step if we don't have access to the
encryption key (which is the case when processing the orphan list; in
all other cases, we will have the encryption key, or the kernel
wouldn't have allowed the file to be opened).

An open question is whether the fact that e2fsck isn't clearing the
bytes beyond i_size causing problems --- and if we've lived with it
not doing it for so long, can we drop this from the kernel replay of
the orphan list in all cases (not just when we don't have the key for
encrypted inodes).

Addresses-Google-Bug: #35209576

Signed-off-by: Theodore Ts'o <[email protected]>
---
fs/ext4/inode.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index bc282f9d0969..43ce831cd7c7 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3944,6 +3944,10 @@ static int ext4_block_truncate_page(handle_t *handle,
unsigned blocksize;
struct inode *inode = mapping->host;

+ /* If we are processing an encrypted inode during orphan list handling */
+ if (ext4_encrypted_inode(inode) && !fscrypt_has_encryption_key(inode))=
+ return 0;
+
blocksize = inode->i_sb->s_blocksize;
length = blocksize - (offset & (blocksize - 1));

--
2.11.0.rc0.7.gbe5a750

2017-03-09 14:24:01

by Jan Kara

[permalink] [raw]

Subject: Re: [PATCH] ext4: don't BUG when truncating encrypted inodes on the orphan list

On Sat 11-02-17 21:27:38, Ted Tso wrote:
> On Sat, Feb 11, 2017 at 12:26:52AM -0700, Andreas Dilger wrote:
> > The reason truncated orphans are on the orphan list is because the
> > transaction that sets i_size may be restarted if the inode is larger
> > than can be truncated in a single transaction. If the system crashes
> > before the truncate finishes then the truncate should be completed
> > so that old data is not returned if the file is truncated larger again.
>
> Another way of fixing this is at the time when the file is truncated
> to a larger size. Of course the other case we need handle is what
> happens if there is data after i_size and the file is mmaped.
>
> One advantage of doing when the file is truncated larger again is at
> that point we will have the encryption key. In the case of an
> encrypted file, both the kernel and e2fsck *can't* zero fill past
> i_size if the key is not available. And during the orphan replay the
> encryption key won't be available.
>
> The other way to solve the problem would be zero the portion of the
> last remaining datablock *first* and journal the data block along with
> the initial transaction which sets the i_size in the inode. But that
> gets tricky, since all data writes for that last block must not go to
> the disk, and then once the journal has been committed we can't write
> the block to via the normal page_io routines (since otherwise it might
> get overwritten), until we write it back and then revoke the block in
> the journal, and the revoke is committed. Messy....

Going through some old email... I don't think this would be really
reasonably doable. What would fixup the missing zeroing on orphan cleanup
though is to zero the tail of the last page on readpage, extending
truncate and write beyond EOF. That may be acceptable cost for encrypted
inodes.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2017-03-09 19:21:30

by Andreas Dilger

[permalink] [raw]

Subject: Re: [PATCH] ext4: don't BUG when truncating encrypted inodes on the orphan list

On Mar 9, 2017, at 6:47 AM, Jan Kara <[email protected]> wrote:
>
> On Sat 11-02-17 21:27:38, Ted Tso wrote:
>> On Sat, Feb 11, 2017 at 12:26:52AM -0700, Andreas Dilger wrote:
>>> The reason truncated orphans are on the orphan list is because the
>>> transaction that sets i_size may be restarted if the inode is larger
>>> than can be truncated in a single transaction. If the system crashes
>>> before the truncate finishes then the truncate should be completed
>>> so that old data is not returned if the file is truncated larger again.
>>
>> Another way of fixing this is at the time when the file is truncated
>> to a larger size. Of course the other case we need handle is what
>> happens if there is data after i_size and the file is mmaped.
>>
>> One advantage of doing when the file is truncated larger again is at
>> that point we will have the encryption key. In the case of an
>> encrypted file, both the kernel and e2fsck *can't* zero fill past
>> i_size if the key is not available. And during the orphan replay the
>> encryption key won't be available.
>>
>> The other way to solve the problem would be zero the portion of the
>> last remaining datablock *first* and journal the data block along with
>> the initial transaction which sets the i_size in the inode. But that
>> gets tricky, since all data writes for that last block must not go to
>> the disk, and then once the journal has been committed we can't write
>> the block to via the normal page_io routines (since otherwise it might
>> get overwritten), until we write it back and then revoke the block in
>> the journal, and the revoke is committed. Messy....
>
> Going through some old email... I don't think this would be really
> reasonably doable. What would fixup the missing zeroing on orphan cleanup
> though is to zero the tail of the last page on readpage, extending
> truncate and write beyond EOF. That may be acceptable cost for encrypted
> inodes.

Another option would be to revive the unlink/truncate thread, and dump
the blocks to be truncated over to another (temporary) inode that is put
on the orphan list and will be unlinked. That means the visible truncate
operation can always complete in a single transaction (including the
partial block write), and everything on the orphan list is essentially an
unlink rather than a truncate.

The code wasn't too complex, but we dropped it when extents arrived since
it didn't give a huge performance advantage. That said, there could be a
benefit in terms of code simplification, since there wouldn't be the need
to restart transactions in the middle if the truncate gets too large.

The most recent version I could find is for ext3 in 2.4.29 at:

https://git.hpdd.intel.com/?p=fs/lustre-release.git;a=blob_plain;f=lustre/kernel_patches/patches/ext3-delete_thread-2.4.29.patch;hb=113303973ec9f8484eb2355a1a6ef3c4c7fd6a56

Cheers, Andreas

Attachments:

signature.asc (195.00 B)
Message signed with OpenPGP