2024-05-08 13:37:03

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 218820] New: The empty file occupies incorrect blocks

https://bugzilla.kernel.org/show_bug.cgi?id=218820

Bug ID: 218820
Summary: The empty file occupies incorrect blocks
Product: File System
Version: 2.5
Hardware: All
OS: Linux
Status: NEW
Severity: normal
Priority: P3
Component: ext4
Assignee: [email protected]
Reporter: [email protected]
Regression: No

Created attachment 306275
--> https://bugzilla.kernel.org/attachment.cgi?id=306275&action=edit
reproduce.c

Hi,

I mounted an ext4 image, created a file, and wrote to it, but the blocks
occupied by this file were incorrect after I `truncate` it. I can reproduce
this with the latest linux kernel
https://git.kernel.org/torvalds/t/linux-6.9-rc7.tar.gz


The following is the triggering script:
```
dd if=/dev/zero of=ext4-0.img bs=1M count=120
mkfs.ext4 ext4-0.img
g++ -static reproduce.c
losetup /dev/loop0 ext4-0.img
mkdir /root/mnt
./a.out
stat /root/mnt/a
```

After run the script, you will get the following outputs:
```
File: /root/mnt/a
Size: 0 Blocks: 82 IO Block: 1024 regular empty file
Device: 700h/1792d Inode: 12 Links: 1
Access: (0755/-rwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Context: system_u:object_r:unlabeled_t:s0
Access: 2024-05-08 11:47:48.000000000 +0000
Modify: 2024-05-08 11:47:48.000000000 +0000
Change: 2024-05-08 11:47:48.000000000 +0000
Birth: -
```

The size of file `a` is 0, yet it occupies 82 blocks. Normally, it should only
occupy 2 blocks.

The contents of `reproduce.c` :
```
#include <assert.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
#include <stdarg.h>
#include <stddef.h>
#include <unistd.h>
#include <pthread.h>
#include <errno.h>
#include <dirent.h>

#include <string>

#include <sys/mount.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/ioctl.h>
#include <sys/wait.h>
#include <sys/xattr.h>
#include <sys/mount.h>
#include <sys/statfs.h>
#include <fcntl.h>

#define ALIGN 4096

void* align_alloc(size_t size) {
void *ptr = NULL;
int ret = posix_memalign(&ptr, ALIGN, size);
if (ret) {
printf("align error\n");
exit(1);
}
return ptr;
}

int main()
{
mount("/dev/loop0", "/root/mnt", "ext4", 0, "");

creat("/root/mnt/a", S_IRWXU | S_IRWXG | S_IROTH | S_IXOTH);
int fd = open("/root/mnt/a", O_RDWR);
mount(NULL, "/root/mnt/", NULL, MS_REMOUNT, "nodelalloc");
sync();

char *buf = (char*)align_alloc(4096*20);
memset(buf, 'a' + 15, 4096*20);
write(fd, buf, 4096*10);

truncate("/root/mnt/a", 0);
close(fd);
return 0;
}
```

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.


2024-05-09 03:35:54

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 218820] The empty file occupies incorrect blocks

https://bugzilla.kernel.org/show_bug.cgi?id=218820

--- Comment #1 from Theodore Tso ([email protected]) ---
On Wed, May 08, 2024 at 01:33:56PM +0000, [email protected] wrote:
>
> The following is the triggering script:
> ```
> dd if=/dev/zero of=ext4-0.img bs=1M count=120
> mkfs.ext4 ext4-0.img
> g++ -static reproduce.c
> losetup /dev/loop0 ext4-0.img
> mkdir /root/mnt
> ./a.out
> stat /root/mnt/a
> ```
>
> After run the script, you will get the following outputs:
> ```
> File: /root/mnt/a
> Size: 0 Blocks: 82 IO Block: 1024 regular empty
> file
> Device: 700h/1792d Inode: 12 Links: 1
> Access: (0755/-rwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
> Context: system_u:object_r:unlabeled_t:s0
> Access: 2024-05-08 11:47:48.000000000 +0000
> Modify: 2024-05-08 11:47:48.000000000 +0000
> Change: 2024-05-08 11:47:48.000000000 +0000
> Birth: -
> ```

Thanks for the bug report. What the reproducer script is doing is
opening a file for writing, and then remounting the file system to
disable delayhed allocations. It then writes 40k to the file, and
then truncates the file, and close it.

The reproducer script leaves the file system mounted; if you unmount
the file system, the kernel will issue a warning message:

EXT4-fs (loop0): Inode 13 (0000000082f8ff6c): i_reserved_data_blocks (40) not
cleared!

... and then if you examine the on-disk image, you'll see that the
i_blocks field is correct.


The root cause is that when the file is open, the address_operations
which is instaniated is one which is designed for delayed allocation.
So when 40k is written to the file, although we haven't allocated the
space yet, we *will* allocate the space, so we update in-memory
i_blocks to reflect the to-be-allocated disk space (although we don't
update the on-disk i_blocks until the allocation actually takes place,
so if we crash, the file system stays consistent).

However, some of the *other* logic is done based on whether the
delayed allocation flag is set in the struct super (as opposed to the
address_operations in the struct inode at the time that file is
opened). This includes what happens when we truncate the file, where
in the nodelalloc case, the in-memory and on-disk i_blocks are in
sync, and when the blocks are delallocated, the i_blocks field is
dropped. But since the blocks weren't actually allocated because the
file descriptor was in delalloc mode, there was nothing to
delallocate, so the in-memory i_blocks stayed elevanted.

There are a number ways of fixing this. The simplest is to simply not
allow the delayed allocation mode to be changed on a remount. Since
in the long term, once we fix some performance for some specialized
use case which has caused people to want to disable delayed
allocation, this is the one that probably makes the most sense.

The two other approaches involve a lot more complexity. The first
alternate approach is to iterate over all open files, and resolve any
pending delayed allocations, and then update the address_operations to
use ext4_aops instead of ext4_da_aops, while avoiding the races
involving writes happening while we are trying to do the remount.
Yelch.

The second alternate approach is to treat the delayed allocation
status on a per-inode basis, with a per-inode flag indicating whether
delayed allocation is active, so that the truncate logic stays
consistent with the ext4_da_ops active on file descriptors associated
with the file. But this gets super messy since subsequent file opens
on the inode also need to use the same delayed allocation mode until
the last file descriptor associated with the inode is closed. Double
Yelch.

So preventing the "mount -o remount,nodelalloc" in the reproducer
program is how we'll address this issue.

- Ted

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

2024-05-09 03:36:02

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [Bug 218820] New: The empty file occupies incorrect blocks

On Wed, May 08, 2024 at 01:33:56PM +0000, [email protected] wrote:
>
> The following is the triggering script:
> ```
> dd if=/dev/zero of=ext4-0.img bs=1M count=120
> mkfs.ext4 ext4-0.img
> g++ -static reproduce.c
> losetup /dev/loop0 ext4-0.img
> mkdir /root/mnt
> ./a.out
> stat /root/mnt/a
> ```
>
> After run the script, you will get the following outputs:
> ```
> File: /root/mnt/a
> Size: 0 Blocks: 82 IO Block: 1024 regular empty file
> Device: 700h/1792d Inode: 12 Links: 1
> Access: (0755/-rwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
> Context: system_u:object_r:unlabeled_t:s0
> Access: 2024-05-08 11:47:48.000000000 +0000
> Modify: 2024-05-08 11:47:48.000000000 +0000
> Change: 2024-05-08 11:47:48.000000000 +0000
> Birth: -
> ```

Thanks for the bug report. What the reproducer script is doing is
opening a file for writing, and then remounting the file system to
disable delayhed allocations. It then writes 40k to the file, and
then truncates the file, and close it.

The reproducer script leaves the file system mounted; if you unmount
the file system, the kernel will issue a warning message:

EXT4-fs (loop0): Inode 13 (0000000082f8ff6c): i_reserved_data_blocks (40) not cleared!

... and then if you examine the on-disk image, you'll see that the
i_blocks field is correct.


The root cause is that when the file is open, the address_operations
which is instaniated is one which is designed for delayed allocation.
So when 40k is written to the file, although we haven't allocated the
space yet, we *will* allocate the space, so we update in-memory
i_blocks to reflect the to-be-allocated disk space (although we don't
update the on-disk i_blocks until the allocation actually takes place,
so if we crash, the file system stays consistent).

However, some of the *other* logic is done based on whether the
delayed allocation flag is set in the struct super (as opposed to the
address_operations in the struct inode at the time that file is
opened). This includes what happens when we truncate the file, where
in the nodelalloc case, the in-memory and on-disk i_blocks are in
sync, and when the blocks are delallocated, the i_blocks field is
dropped. But since the blocks weren't actually allocated because the
file descriptor was in delalloc mode, there was nothing to
delallocate, so the in-memory i_blocks stayed elevanted.

There are a number ways of fixing this. The simplest is to simply not
allow the delayed allocation mode to be changed on a remount. Since
in the long term, once we fix some performance for some specialized
use case which has caused people to want to disable delayed
allocation, this is the one that probably makes the most sense.

The two other approaches involve a lot more complexity. The first
alternate approach is to iterate over all open files, and resolve any
pending delayed allocations, and then update the address_operations to
use ext4_aops instead of ext4_da_aops, while avoiding the races
involving writes happening while we are trying to do the remount.
Yelch.

The second alternate approach is to treat the delayed allocation
status on a per-inode basis, with a per-inode flag indicating whether
delayed allocation is active, so that the truncate logic stays
consistent with the ext4_da_ops active on file descriptors associated
with the file. But this gets super messy since subsequent file opens
on the inode also need to use the same delayed allocation mode until
the last file descriptor associated with the inode is closed. Double
Yelch.

So preventing the "mount -o remount,nodelalloc" in the reproducer
program is how we'll address this issue.

- Ted

2024-05-09 06:35:24

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 218820] The empty file occupies incorrect blocks

https://bugzilla.kernel.org/show_bug.cgi?id=218820

--- Comment #2 from Chi ([email protected]) ---
I got it! Thank you very much for your detailed explanation.

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.