2018-11-09 03:45:04

by Elana Hashman

[permalink] [raw]
Subject: Phantom full ext4 root filesystems on 4.1 through 4.14 kernels

Hi Ted,

We've run into a mysterious "phantom" full filesystem issue on our Kubernetes fleet. We initially encountered this issue on kernel 4.1.35, but are still experiencing the problem after upgrading to 4.14.67. Essentially, `df` reports our root filesystems as full and they behave as though they are full, but the "used" space cannot be accounted for. Rebooting the system, remounting the root filesystem read-only and then remounting as read-write, or booting into single-user mode all free up the "used" space. The disk slowly fills up over time, suggesting that there might be some kind of leak; we previously saw this affecting hosts with ~200 days of uptime on the 4.1 kernel, but are now seeing it affect a 4.14 host with only ~70 days of uptime.

Here is some data from an example host, running the 4.14.67 kernel. The root disk is ext4.

$ uname -a
Linux <hostname> 4.14.67-ts1 #1 SMP Wed Aug 29 13:28:25 UTC 2018 x86_64 GNU/Linux
$ grep ' / ' /proc/mounts
/dev/disk/by-uuid/<some-uuid> / ext4 rw,relatime,errors=remount-ro,data=ordered 0 0

`df` reports 0 bytes free:

$ df -h /
Filesystem ???????????????????????????????????????????? Size ?Used Avail Use% Mounted on
/dev/disk/by-uuid/<some-uuid> ??50G 48G ??? 0 100% /

Deleted, open files account for almost no disk capacity:

$ sudo lsof -a +L1 /
COMMAND??? PID ??USER FD ??TYPE DEVICE SIZE/OFF NLINK??? NODE NAME
java ???? 5313 user??? 3r ??REG??? 8,3 ?6806312 ??? 0 1315847 /var/lib/sss/mc/passwd (deleted)
java ???? 5313 user ??11u REG??? 8,3??? 55185 ??? 0 2494654 /tmp/classpath.1668Gp (deleted)
system_ar 5333 user??? 3r ??REG??? 8,3 ?6806312 ??? 0 1315847 /var/lib/sss/mc/passwd (deleted)
java ???? 5421 user??? 3r ??REG??? 8,3 ?6806312 ??? 0 1315847 /var/lib/sss/mc/passwd (deleted)
java ???? 5421 user ??11u REG??? 8,3 ??149313 ??? 0 2494486 /tmp/java.fzTwWp (deleted)
java ???? 5421 tsdist ??12u REG??? 8,3??? 55185 ??? 0 2500513 /tmp/classpath.7AmxHO (deleted)

`du` can only account for 16GB of file usage:

$ sudo du -hxs /
16G ??? /

But what is most puzzling is the numbers reported by e2freefrag, which don't add up:

$ sudo e2freefrag /dev/disk/by-uuid/<some-uuid>
Device: /dev/disk/by-uuid/<some-uuid>
Blocksize: 4096 bytes
Total blocks: 13107200
Free blocks: 7778076 (59.3%)

Min. free extent: 4 KB
Max. free extent: 8876 KB
Avg. free extent: 224 KB
Num. free extent: 6098

HISTOGRAM OF FREE EXTENT SIZES:
Extent Size Range : ?Free extents Free Blocks ?Percent
??? 4K...??? 8K- ?: ???? 1205 ???????? 1205??? 0.02%
??? 8K... ??16K- : ???????? 980 ???????? 2265??? 0.03%
??16K... ??32K- : ??????? 653 ???????? 3419??? 0.04%
??32K... ??64K- : ?????? 1337 ??????? 15374??? 0.20%
??64K... ?128K- : ????????? 631 ??????? 14151??? 0.18%
?128K... ?256K- : ???????? 224 ??????? 10205??? 0.13%
?256K... ?512K- : ???????? 261 ??????? 23818??? 0.31%
?512K... 1024K- ?: ??? 303 ??????? 56801??? 0.73%
??? 1M...??? 2M- ?: ????? 387 ?????? 135907??? 1.75%
??? 2M...??? 4M- ?: ????? 103 ??????? 64740??? 0.83%
??? 4M...??? 8M- ?: ?????? 12 ??????? 15005??? 0.19%
??? 8M... ??16M- : ?????????? 2 ???????? 4267??? 0.05%

This looks like a bug to me; the histogram in the manpage example has percentages that add up to 100% but this doesn't even add up to 5%.

After a reboot, `df` reflects real utilization:

$ df -h /
Filesystem ???????????????????????????????????????????? Size ?Used Avail Use% Mounted on
/dev/disk/by-uuid/<some-uuid> ??50G 16G 31G 34% /

We are using overlay2fs for Docker, as well as rbd mounts; I'm not sure how they might interact.

Thanks for your help,

--
Elana Hashman
[email protected]


2018-11-09 04:24:29

by Darrick J. Wong

[permalink] [raw]
Subject: Re: Phantom full ext4 root filesystems on 4.1 through 4.14 kernels

On Thu, Nov 08, 2018 at 05:59:18PM +0000, Elana Hashman wrote:
> Hi Ted,
>
> We've run into a mysterious "phantom" full filesystem issue on our Kubernetes fleet. We initially encountered this issue on kernel 4.1.35, but are still experiencing the problem after upgrading to 4.14.67. Essentially, `df` reports our root filesystems as full and they behave as though they are full, but the "used" space cannot be accounted for. Rebooting the system, remounting the root filesystem read-only and then remounting as read-write, or booting into single-user mode all free up the "used" space. The disk slowly fills up over time, suggesting that there might be some kind of leak; we previously saw this affecting hosts with ~200 days of uptime on the 4.1 kernel, but are now seeing it affect a 4.14 host with only ~70 days of uptime.
>
> Here is some data from an example host, running the 4.14.67 kernel. The root disk is ext4.
>
> $ uname -a
> Linux <hostname> 4.14.67-ts1 #1 SMP Wed Aug 29 13:28:25 UTC 2018 x86_64 GNU/Linux
> $ grep ' / ' /proc/mounts
> /dev/disk/by-uuid/<some-uuid> / ext4 rw,relatime,errors=remount-ro,data=ordered 0 0
>
> `df` reports 0 bytes free:
>
> $ df -h /
> Filesystem ???????????????????????????????????????????? Size ?Used Avail Use% Mounted on
> /dev/disk/by-uuid/<some-uuid> ??50G 48G ??? 0 100% /

This is very odd. I wonder, how many of those overlayfses are still
mounted on the system at this point? Over in xfs land we've discovered
that overlayfs subtly changes the lifetime behavior of incore inodes,
maybe that's what's going on here? (Pure speculation on my part...)

> Deleted, open files account for almost no disk capacity:
>
> $ sudo lsof -a +L1 /
> COMMAND??? PID ??USER FD ??TYPE DEVICE SIZE/OFF NLINK??? NODE NAME
> java ???? 5313 user??? 3r ??REG??? 8,3 ?6806312 ??? 0 1315847 /var/lib/sss/mc/passwd (deleted)
> java ???? 5313 user ??11u REG??? 8,3??? 55185 ??? 0 2494654 /tmp/classpath.1668Gp (deleted)
> system_ar 5333 user??? 3r ??REG??? 8,3 ?6806312 ??? 0 1315847 /var/lib/sss/mc/passwd (deleted)
> java ???? 5421 user??? 3r ??REG??? 8,3 ?6806312 ??? 0 1315847 /var/lib/sss/mc/passwd (deleted)
> java ???? 5421 user ??11u REG??? 8,3 ??149313 ??? 0 2494486 /tmp/java.fzTwWp (deleted)
> java ???? 5421 tsdist ??12u REG??? 8,3??? 55185 ??? 0 2500513 /tmp/classpath.7AmxHO (deleted)
>
> `du` can only account for 16GB of file usage:
>
> $ sudo du -hxs /
> 16G ??? /
>
> But what is most puzzling is the numbers reported by e2freefrag, which don't add up:
>
> $ sudo e2freefrag /dev/disk/by-uuid/<some-uuid>
> Device: /dev/disk/by-uuid/<some-uuid>
> Blocksize: 4096 bytes
> Total blocks: 13107200
> Free blocks: 7778076 (59.3%)
>
> Min. free extent: 4 KB
> Max. free extent: 8876 KB
> Avg. free extent: 224 KB
> Num. free extent: 6098
>
> HISTOGRAM OF FREE EXTENT SIZES:
> Extent Size Range : ?Free extents Free Blocks ?Percent
> ??? 4K...??? 8K- ?: ???? 1205 ???????? 1205??? 0.02%
> ??? 8K... ??16K- : ???????? 980 ???????? 2265??? 0.03%
> ??16K... ??32K- : ??????? 653 ???????? 3419??? 0.04%
> ??32K... ??64K- : ?????? 1337 ??????? 15374??? 0.20%
> ??64K... ?128K- : ????????? 631 ??????? 14151??? 0.18%
> ?128K... ?256K- : ???????? 224 ??????? 10205??? 0.13%
> ?256K... ?512K- : ???????? 261 ??????? 23818??? 0.31%
> ?512K... 1024K- ?: ??? 303 ??????? 56801??? 0.73%
> ??? 1M...??? 2M- ?: ????? 387 ?????? 135907??? 1.75%
> ??? 2M...??? 4M- ?: ????? 103 ??????? 64740??? 0.83%
> ??? 4M...??? 8M- ?: ?????? 12 ??????? 15005??? 0.19%
> ??? 8M... ??16M- : ?????????? 2 ???????? 4267??? 0.05%

Clearly a bug in e2freefrag, the percentages are supposed to sum to 100.
Patches soon.

--D

>
> This looks like a bug to me; the histogram in the manpage example has percentages that add up to 100% but this doesn't even add up to 5%.
>
> After a reboot, `df` reflects real utilization:
>
> $ df -h /
> Filesystem ???????????????????????????????????????????? Size ?Used Avail Use% Mounted on
> /dev/disk/by-uuid/<some-uuid> ??50G 16G 31G 34% /
>
> We are using overlay2fs for Docker, as well as rbd mounts; I'm not sure how they might interact.
>
> Thanks for your help,
>
> --
> Elana Hashman
> [email protected]

2018-11-09 04:15:08

by Elana Hashman

[permalink] [raw]
Subject: RE: Phantom full ext4 root filesystems on 4.1 through 4.14 kernels

-----Original Message-----
From: Reindl Harald <[email protected]>
Sent: Thursday, November 8, 2018 1:14 PM

> and what says "df -hi"?
> maybe you are running simply out of inodes

We are nowhere near the inode limit:

$ df -hi /
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/disk/by-uuid/<some-uuid> 3.2M 191K 3.0M 6% /

I can't confirm 100% since we rebooted this host and I don't have a candidate demonstrating the problem right now that I can check, but it's something we've checked previously.

- e

2018-11-09 04:27:32

by Reindl Harald

[permalink] [raw]
Subject: Re: Phantom full ext4 root filesystems on 4.1 through 4.14 kernels



Am 08.11.18 um 18:59 schrieb Elana Hashman:
> After a reboot, `df` reflects real utilization:
>
> $ df -h /
> Filesystem                                              Size  Used Avail Use% Mounted on
> /dev/disk/by-uuid/<some-uuid>   50G 16G 31G 34% /
>
> We are using overlay2fs for Docker, as well as rbd mounts; I'm not sure how they might interact.

and what says "df -hi"?
maybe you are running simply out of inodes

2018-12-05 16:26:21

by Elana Hashman

[permalink] [raw]
Subject: RE: Phantom full ext4 root filesystems on 4.1 through 4.14 kernels

Okay, let's take a look at another affected host. I have not drained it, just cordoned it, so it's still in Kubernetes service and has running, active pods.

$ uname -a
Linux <hostname> 4.14.67-ts1 #1 SMP Wed Aug 29 13:28:25 UTC 2018 x86_64 GNU/Linux

$ df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/disk/by-uuid/<some-uuid> 50G 46G 1.6G 97% /

$ df -hi /
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/disk/by-uuid/<some-uuid> 3.2M 203K 3.0M 7% /

$ sudo du -hxs /
21G /

$ sudo lsof -a +L1 /
lsof: WARNING: can't stat() fuse file system /srv/kubelet/pods/<some-path>
Output information may be incomplete.
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NLINK NODE NAME
java 6392 user 11u REG 8,3 55185 0 1441946 /tmp/classpath.ln0XhI (deleted)
java 6481 user 11u REG 8,3 149313 0 1441945 /tmp/java.AwFIiw (deleted)
java 6481 user 12u REG 8,3 55185 0 1441951 /tmp/classpath.qrQhkS (deleted)


There are many overlay mounts currently active:

$ mount | grep overlay | wc -l
40

Also some fuse mounts (as mentioned in the lsof warning on this particular machine):

$ mount | grep fuse | wc -l
21

And just to double-check, all these modules are running in-tree:

$ modinfo ext4 | grep intree
intree: Y
$ modinfo overlay | grep intree
intree: Y
$ modinfo fuse | grep intree
intree: Y

- e

-----Original Message-----
From: Darrick J. Wong <[email protected]>
Sent: Thursday, November 8, 2018 1:47 PM
Subject: Re: Phantom full ext4 root filesystems on 4.1 through 4.14 kernels

This is very odd. I wonder, how many of those overlayfses are still mounted on the system at this point? Over in xfs land we've discovered that overlayfs subtly changes the lifetime behavior of incore inodes, maybe that's what's going on here? (Pure speculation on my part...)

2019-01-23 20:05:37

by Thomas Walker

[permalink] [raw]
Subject: Re: Phantom full ext4 root filesystems on 4.1 through 4.14 kernels

Unfortunately this still continues to be a persistent problem for us. On another example system:

# uname -a
Linux <hostname> 4.14.67-ts1 #1 SMP Wed Aug 29 13:28:25 UTC 2018 x86_64 GNU/Linux

# df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/disk/by-uuid/<uuid> 50G 46G 1.1G 98% /

# df -hi /
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/disk/by-uuid/<uuid> 3.2M 306K 2.9M 10% /

# du -hsx /
14G /

And confirmed not to be due to sparse files or deleted but still open files.

The most interesting thing that I've been able to find so far is this:

# mount -o remount,ro /
mount: / is busy
# df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/disk/by-uuid/<uuid> 50G 14G 33G 30% /

Something about attempting (and failing) to remount read-only frees up all of the phantom space usage.
Curious whether that sparks ideas in anyone's mind?

I've tried all manner of other things without success. Unmounting all of the overlays. Killing off virtually all of usersapce (dropping to single user). Dropping page/inode/dentry caches.Nothing else (short of a reboot) seems to give us the space back.


On Wed, Dec 05, 2018 at 11:26:19AM -0500, Elana Hashman wrote:
> Okay, let's take a look at another affected host. I have not drained it, just
> cordoned it, so it's still in Kubernetes service and has running, active pods.
>

2019-01-24 01:54:41

by Liu Bo

[permalink] [raw]
Subject: Re: Phantom full ext4 root filesystems on 4.1 through 4.14 kernels

On Thu, Nov 8, 2018 at 10:11 AM Elana Hashman
<[email protected]> wrote:
>
> Hi Ted,
>
> We've run into a mysterious "phantom" full filesystem issue on our Kubernetes fleet. We initially encountered this issue on kernel 4.1.35, but are still experiencing the problem after upgrading to 4.14.67. Essentially, `df` reports our root filesystems as full and they behave as though they are full, but the "used" space cannot be accounted for. Rebooting the system, remounting the root filesystem read-only and then remounting as read-write, or booting into single-user mode all free up the "used" space. The disk slowly fills up over time, suggesting that there might be some kind of leak; we previously saw this affecting hosts with ~200 days of uptime on the 4.1 kernel, but are now seeing it affect a 4.14 host with only ~70 days of uptime.
>

I wonder if this ext4 enabled bigalloc (can be checked by dumpe2fs -h $disk).
So bigalloc is known to cause leak space, and it's been just fixed recently.

thanks,
liubo

> Here is some data from an example host, running the 4.14.67 kernel. The root disk is ext4.
>
> $ uname -a
> Linux <hostname> 4.14.67-ts1 #1 SMP Wed Aug 29 13:28:25 UTC 2018 x86_64 GNU/Linux
> $ grep ' / ' /proc/mounts
> /dev/disk/by-uuid/<some-uuid> / ext4 rw,relatime,errors=remount-ro,data=ordered 0 0
>
> `df` reports 0 bytes free:
>
> $ df -h /
> Filesystem Size Used Avail Use% Mounted on
> /dev/disk/by-uuid/<some-uuid> 50G 48G 0 100% /
>
> Deleted, open files account for almost no disk capacity:
>
> $ sudo lsof -a +L1 /
> COMMAND PID USER FD TYPE DEVICE SIZE/OFF NLINK NODE NAME
> java 5313 user 3r REG 8,3 6806312 0 1315847 /var/lib/sss/mc/passwd (deleted)
> java 5313 user 11u REG 8,3 55185 0 2494654 /tmp/classpath.1668Gp (deleted)
> system_ar 5333 user 3r REG 8,3 6806312 0 1315847 /var/lib/sss/mc/passwd (deleted)
> java 5421 user 3r REG 8,3 6806312 0 1315847 /var/lib/sss/mc/passwd (deleted)
> java 5421 user 11u REG 8,3 149313 0 2494486 /tmp/java.fzTwWp (deleted)
> java 5421 tsdist 12u REG 8,3 55185 0 2500513 /tmp/classpath.7AmxHO (deleted)
>
> `du` can only account for 16GB of file usage:
>
> $ sudo du -hxs /
> 16G /
>
> But what is most puzzling is the numbers reported by e2freefrag, which don't add up:
>
> $ sudo e2freefrag /dev/disk/by-uuid/<some-uuid>
> Device: /dev/disk/by-uuid/<some-uuid>
> Blocksize: 4096 bytes
> Total blocks: 13107200
> Free blocks: 7778076 (59.3%)
>
> Min. free extent: 4 KB
> Max. free extent: 8876 KB
> Avg. free extent: 224 KB
> Num. free extent: 6098
>
> HISTOGRAM OF FREE EXTENT SIZES:
> Extent Size Range : Free extents Free Blocks Percent
> 4K... 8K- : 1205 1205 0.02%
> 8K... 16K- : 980 2265 0.03%
> 16K... 32K- : 653 3419 0.04%
> 32K... 64K- : 1337 15374 0.20%
> 64K... 128K- : 631 14151 0.18%
> 128K... 256K- : 224 10205 0.13%
> 256K... 512K- : 261 23818 0.31%
> 512K... 1024K- : 303 56801 0.73%
> 1M... 2M- : 387 135907 1.75%
> 2M... 4M- : 103 64740 0.83%
> 4M... 8M- : 12 15005 0.19%
> 8M... 16M- : 2 4267 0.05%
>
> This looks like a bug to me; the histogram in the manpage example has percentages that add up to 100% but this doesn't even add up to 5%.
>
> After a reboot, `df` reflects real utilization:
>
> $ df -h /
> Filesystem Size Used Avail Use% Mounted on
> /dev/disk/by-uuid/<some-uuid> 50G 16G 31G 34% /
>
> We are using overlay2fs for Docker, as well as rbd mounts; I'm not sure how they might interact.
>
> Thanks for your help,
>
> --
> Elana Hashman
> [email protected]

2019-01-24 14:48:31

by Elana Hashman

[permalink] [raw]
Subject: RE: Phantom full ext4 root filesystems on 4.1 through 4.14 kernels

From: Liu Bo <[email protected]>

> I wonder if this ext4 enabled bigalloc (can be checked by dumpe2fs -h $disk).
> So bigalloc is known to cause leak space, and it's been just fixed recently.

Doesn't look that way:

$ uname -a
Linux <hostname> 4.14.67-ts1 #1 SMP Wed Aug 29 13:28:25 UTC 2018 x86_64 GNU/Linux
$ sudo du -hxs /
20G /
$ df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/disk/by-uuid/<some-uuid> 50G 46G 1.1G 98% /
$ sudo dumpe2fs -h /dev/disk/by-uuid/<some-uuid>
dumpe2fs 1.42.5 (29-Jul-2012)
Filesystem volume name: ROOT
Last mounted on: /
Filesystem UUID: <some-uuid>
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags: signed_directory_hash
Default mount options: user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 3276800
Block count: 13107200
Reserved block count: 655360
Free blocks: 9260751
Free inodes: 3078348
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 1020
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 8192
Inode blocks per group: 512
Flex block group size: 16
Filesystem created: Tue Jan 16 18:48:13 2018
Last mount time: Mon Oct 1 20:43:59 2018
Last write time: Mon Oct 1 20:43:58 2018
Mount count: 7
Maximum mount count: -1
Last checked: Tue Jan 16 18:48:13 2018
Check interval: 0 (<none>)
Lifetime writes: 7159 GB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Required extra isize: 28
Desired extra isize: 28
Journal inode: 8
First orphan inode: 2231477
Default directory hash: half_md4
Directory Hash Seed: 8a6f6d03-37e7-409c-b8c9-d738e0b0a94a
Journal backup: inode blocks
Journal features: journal_incompat_revoke
Journal size: 128M
Journal length: 32768
Journal sequence: 0x02a09116
Journal start: 24688


--
Elana Hashman
[email protected]

2019-06-26 15:24:48

by Thomas Walker

[permalink] [raw]
Subject: Re: Phantom full ext4 root filesystems on 4.1 through 4.14 kernels

Sorry to revive a rather old thread, but Elana mentioned that there might have been a related fix recently? Possibly something to do with truncate?
A quick scan of the last month or so turned up https://www.spinics.net/lists/linux-ext4/msg65772.html but none of these seemed obviously applicable to me.
We do still experience this phantom space usage quite frequently (although the remount workaround below has lowered the priority).

On Wed, Jan 23, 2019 at 02:59:22PM -0500, Thomas Walker wrote:
> Unfortunately this still continues to be a persistent problem for us. On another example system:
>
> # uname -a
> Linux <hostname> 4.14.67-ts1 #1 SMP Wed Aug 29 13:28:25 UTC 2018 x86_64 GNU/Linux
>
> # df -h /
> Filesystem Size Used Avail Use% Mounted on
> /dev/disk/by-uuid/<uuid> 50G 46G 1.1G 98% /
>
> # df -hi /
> Filesystem Inodes IUsed IFree IUse% Mounted on
> /dev/disk/by-uuid/<uuid> 3.2M 306K 2.9M 10% /
>
> # du -hsx /
> 14G /
>
> And confirmed not to be due to sparse files or deleted but still open files.
>
> The most interesting thing that I've been able to find so far is this:
>
> # mount -o remount,ro /
> mount: / is busy
> # df -h /
> Filesystem Size Used Avail Use% Mounted on
> /dev/disk/by-uuid/<uuid> 50G 14G 33G 30% /
>
> Something about attempting (and failing) to remount read-only frees up all of the phantom space usage.
> Curious whether that sparks ideas in anyone's mind?
>
> I've tried all manner of other things without success. Unmounting all of the overlays. Killing off virtually all of usersapce (dropping to single user). Dropping page/inode/dentry caches.Nothing else (short of a reboot) seems to give us the space back.

2019-07-11 09:24:18

by Jan Kara

[permalink] [raw]
Subject: Re: Phantom full ext4 root filesystems on 4.1 through 4.14 kernels

On Wed 26-06-19 11:17:54, Thomas Walker wrote:
> Sorry to revive a rather old thread, but Elana mentioned that there might
> have been a related fix recently? Possibly something to do with
> truncate? A quick scan of the last month or so turned up
> https://www.spinics.net/lists/linux-ext4/msg65772.html but none of these
> seemed obviously applicable to me. We do still experience this phantom
> space usage quite frequently (although the remount workaround below has
> lowered the priority).

I don't recall any fix for this. But seeing that remount "fixes" the issue
for you can you try whether one of the following has a similar effect?

1) Try "sync"
2) Try "fsfreeze -f / && fsfreeze -u /"
3) Try "echo 3 >/proc/sys/vm/drop_caches"

Also what is the contents of
/sys/fs/ext4/<problematic-device>/delayed_allocation_blocks
when the issue happens?

Honza

>
> On Wed, Jan 23, 2019 at 02:59:22PM -0500, Thomas Walker wrote:
> > Unfortunately this still continues to be a persistent problem for us. On another example system:
> >
> > # uname -a
> > Linux <hostname> 4.14.67-ts1 #1 SMP Wed Aug 29 13:28:25 UTC 2018 x86_64 GNU/Linux
> >
> > # df -h /
> > Filesystem Size Used Avail Use% Mounted on
> > /dev/disk/by-uuid/<uuid> 50G 46G 1.1G 98% /
> >
> > # df -hi /
> > Filesystem Inodes IUsed IFree IUse% Mounted on
> > /dev/disk/by-uuid/<uuid> 3.2M 306K 2.9M 10% /
> >
> > # du -hsx /
> > 14G /
> >
> > And confirmed not to be due to sparse files or deleted but still open files.
> >
> > The most interesting thing that I've been able to find so far is this:
> >
> > # mount -o remount,ro /
> > mount: / is busy
> > # df -h /
> > Filesystem Size Used Avail Use% Mounted on
> > /dev/disk/by-uuid/<uuid> 50G 14G 33G 30% /
> >
> > Something about attempting (and failing) to remount read-only frees up all of the phantom space usage.
> > Curious whether that sparks ideas in anyone's mind?
> >
> > I've tried all manner of other things without success. Unmounting all of the overlays. Killing off virtually all of usersapce (dropping to single user). Dropping page/inode/dentry caches.Nothing else (short of a reboot) seems to give us the space back.
--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-07-11 14:58:02

by Geoffrey Thomas

[permalink] [raw]
Subject: RE: Phantom full ext4 root filesystems on 4.1 through 4.14 kernels

On Thursday, July 11, 2019 5:23 AM, Jan Kara <[email protected]> wrote:
> On Wed 26-06-19 11:17:54, Thomas Walker wrote:
> > Sorry to revive a rather old thread, but Elana mentioned that there might
> > have been a related fix recently? Possibly something to do with
> > truncate? A quick scan of the last month or so turned up
> > https://www.spinics.net/lists/linux-ext4/msg65772.html but none of these
> > seemed obviously applicable to me. We do still experience this phantom
> > space usage quite frequently (although the remount workaround below has
> > lowered the priority).
>
> I don't recall any fix for this. But seeing that remount "fixes" the issue
> for you can you try whether one of the following has a similar effect?
>
> 1) Try "sync"
> 2) Try "fsfreeze -f / && fsfreeze -u /"
> 3) Try "echo 3 >/proc/sys/vm/drop_caches"
>
> Also what is the contents of
> /sys/fs/ext4/<problematic-device>/delayed_allocation_blocks
> when the issue happens?

We just had one of these today, and no luck from any of those. delayed_allocation_blocks is 1:

# df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/disk/by-uuid/523c8243-5a25-40eb-8627-f3bbf98ec299 50G 47G 455M 100% /
# ls -l /dev/disk/by-uuid/523c8243-5a25-40eb-8627-f3bbf98ec299
lrwxrwxrwx 1 root root 10 Mar 8 16:03 /dev/disk/by-uuid/523c8243-5a25-40eb-8627-f3bbf98ec299 -> ../../sda3
# cat /sys/fs/ext4/sda3/delayed_allocation_blocks
1
# sync; sync; sync
# df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/disk/by-uuid/523c8243-5a25-40eb-8627-f3bbf98ec299 50G 47G 455M 100% /
# fsfreeze -f /; fsfreeze -u /
# df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/disk/by-uuid/523c8243-5a25-40eb-8627-f3bbf98ec299 50G 47G 455M 100% /
# echo 3 > /proc/sys/vm/drop_caches
[blocks for about 10 seconds]
# df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/disk/by-uuid/523c8243-5a25-40eb-8627-f3bbf98ec299 50G 47G 454M 100% /
# umount /
[blocks for about 2 seconds]
umount: /: device is busy.
(In some cases useful info about processes that use
the device is found by lsof(8) or fuser(1))
# df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/disk/by-uuid/523c8243-5a25-40eb-8627-f3bbf98ec299 50G 16G 32G 33% /
# uname -r
4.14.67-ts1

--
Geoffrey Thomas
[email protected]

2019-07-11 15:48:48

by Jan Kara

[permalink] [raw]
Subject: Re: Phantom full ext4 root filesystems on 4.1 through 4.14 kernels

On Thu 11-07-19 14:40:43, Geoffrey Thomas wrote:
> On Thursday, July 11, 2019 5:23 AM, Jan Kara <[email protected]> wrote:
> > On Wed 26-06-19 11:17:54, Thomas Walker wrote:
> > > Sorry to revive a rather old thread, but Elana mentioned that there might
> > > have been a related fix recently? Possibly something to do with
> > > truncate? A quick scan of the last month or so turned up
> > > https://www.spinics.net/lists/linux-ext4/msg65772.html but none of these
> > > seemed obviously applicable to me. We do still experience this phantom
> > > space usage quite frequently (although the remount workaround below has
> > > lowered the priority).
> >
> > I don't recall any fix for this. But seeing that remount "fixes" the issue
> > for you can you try whether one of the following has a similar effect?
> >
> > 1) Try "sync"
> > 2) Try "fsfreeze -f / && fsfreeze -u /"
> > 3) Try "echo 3 >/proc/sys/vm/drop_caches"
> >
> > Also what is the contents of
> > /sys/fs/ext4/<problematic-device>/delayed_allocation_blocks
> > when the issue happens?
>
> We just had one of these today, and no luck from any of those.
> delayed_allocation_blocks is 1:

...

This is very strange because failed remount read-only (with EBUSY) doesn't
really do more than what "sync; echo 3 >/proc/sys/vm/drop_caches" does. I
suspect there's really some userspace taking up space and cleaning up on
umount. Anyway once this happens again, can you do:

fsfreeze -f /
e2image -r /dev/disk/by-uuid/523c8243-5a25-40eb-8627-f3bbf98ec299 - | \
xz >some_storage.xz
fsfreeze -u /

some_storage.xz can be on some usb stick or so. It will dump ext4 metadata
to the file. Then please provide some_storage.xz for download somewhere.
Thanks! BTW I'll be on vacation next two weeks so it will take a while to
get to this...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-07-11 17:13:13

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Phantom full ext4 root filesystems on 4.1 through 4.14 kernels

Can you try using "df -i" when the file system looks full, and then
reboot, and look at the results of "df -i" afterwards?

Also interesting would be to grab a metadata-only snapshot of the file
system when it is in its mysteriously full state, writing that
snapshot on some other file system *other* than on /dev/sda3:

e2image -r /dev/sda3 /mnt/sda3.e2i

Then run e2fsck on it:

e2fsck -fy /mnt/sda3.e2i

What I'm curious about is how many "orphaned inodes" are reported, and
how much space they are taking up. That will look like this:

% gunzip < /usr/src/e2fsprogs/tests/f_orphan/image.gz > /tmp/foo.img
% e2fsck -fy /tmp/foo.img
e2fsck 1.45.2 (27-May-2019)
Clearing orphaned inode 15 (uid=0, gid=0, mode=040755, size=1024)
Clearing orphaned inode 17 (uid=0, gid=0, mode=0100644, size=0)
Clearing orphaned inode 16 (uid=0, gid=0, mode=040755, size=1024)
Clearing orphaned inode 14 (uid=0, gid=0, mode=0100644, size=69)
Clearing orphaned inode 13 (uid=0, gid=0, mode=040755, size=1024)
...

It's been theorized the bug is in overlayfs, where it's holding inodes
open so the space isn't released. IIRC somewhat had reported a
similar problem with overlayfs on top of xfs. (BTW, are you using
overlayfs or aufs with your Docker setup?)

- Ted

2019-07-12 19:19:39

by Thomas Walker

[permalink] [raw]
Subject: Re: Phantom full ext4 root filesystems on 4.1 through 4.14 kernels

On Thu, Jul 11, 2019 at 01:10:46PM -0400, Theodore Ts'o wrote:
> Can you try using "df -i" when the file system looks full, and then
> reboot, and look at the results of "df -i" afterwards?

inode usage doesn't change appreciably between the state with "lost" space, after the remount workaround, or after a reboot.

> Also interesting would be to grab a metadata-only snapshot of the file
> system when it is in its mysteriously full state, writing that
> snapshot on some other file system *other* than on /dev/sda3:
>
> e2image -r /dev/sda3 /mnt/sda3.e2i
>
> Then run e2fsck on it:
>
> e2fsck -fy /mnt/sda3.e2i
>
> What I'm curious about is how many "orphaned inodes" are reported, and
> how much space they are taking up. That will look like this:

<..>
Clearing orphaned inode 2360177 (uid=0, gid=0, mode=0100644, size=1035390)
Clearing orphaned inode 2360181 (uid=0, gid=0, mode=0100644, size=185522)
Clearing orphaned inode 2360176 (uid=0, gid=0, mode=0100644, size=1924512)
Clearing orphaned inode 2360180 (uid=0, gid=0, mode=0100644, size=3621978)
Clearing orphaned inode 1048838 (uid=0, gid=4, mode=0100640, size=39006841856)
release_inode_blocks: Corrupt extent header while calling ext2fs_block_iterate for inode 1048838
<..>

Of particular note, ino 1048838 matches the size of the space that we "lost".
A few months ago I was poking at this with kprobes to try to understand what was happening with the attempt to remount read-only and noticed that it triggered hundreds of:

<...>-78273 [000] .... 5186888.917840: ext4_remove_blocks: dev 8,3 ino 2889535 extent [0(11384832), 2048]from 0 to 2047 partial_cluster 0
<...>-78273 [000] .... 5186888.917841: <stack trace>
=> ext4_ext_truncate
=> ext4_truncate
=> ext4_evict_inode
=> evict
=> iput
=> dentry_unlink_inode
=> __dentry_kill
=> dput.part.23
=> dput
=> SyS_rename
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe

With all the same inode numbers whose sizes added up to the same amount of space that we had "lost" and got back. But the inode didn't map to any file or open filehandle though.
Obviously the inode numbers match here and, if I total up all of the ext4_remove_blocks lines, the number of blocks match both what fsck reports and what we "lost"

>
> ...
>
> It's been theorized the bug is in overlayfs, where it's holding inodes
> open so the space isn't released. IIRC somewhat had reported a
> similar problem with overlayfs on top of xfs. (BTW, are you using
> overlayfs or aufs with your Docker setup?)
>

Yes, we are using overlayfs and had also heard similar reports. But the ext4 fs here is the root filesystem,
while all of the overlays are using a separate partition using XFS. The only interplay between our root fs and
the docker containers is the occasional ro bind mount from root into the container.

Unfortunately, we've not been able to reproduce this outside of our production plant running real workload with real data. I did capture a metadata dump (with dirents scrambled) as Jan asked, but I suspect it will take some work to get past our security team. I can certainly do that if we think there is anything valuable though.


Thanks,
Tom.

2019-07-12 20:29:04

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Phantom full ext4 root filesystems on 4.1 through 4.14 kernels

On Fri, Jul 12, 2019 at 03:19:03PM -0400, Thomas Walker wrote:
> Clearing orphaned inode 1048838 (uid=0, gid=4, mode=0100640, size=39006841856)

> Of particular note, ino 1048838 matches the size of the space that we "lost".

Hmmm... what's gid 4? Is that a hint of where the inode might have come from?

Can you try the this experiment of e2image... e2fsck, but add a "cp
--sparse" of the e2i file between the e2image and e2fsck step? Then
when you can identify the inode that has the huge amount of the
orphaned space, try grab the first couple of blocks, and then run
"file" on the first part of the file, which might help you identify
where the file came from. Is it an ISO file? etc.

The goal is to come up with a repeatable way of forcing the failure,
so we can understand the root cause of the "lost space". The fact
that it's an orphaned inode means that something was hanging onto the
inode. The question is what part of the kernel was keeping the ref
count elevated.

Thanks,

- Ted

2019-07-12 22:04:54

by Geoffrey Thomas

[permalink] [raw]
Subject: RE: Phantom full ext4 root filesystems on 4.1 through 4.14 kernels

On Friday, July 12, 2019 4:28 PM, Theodore Ts'o <[email protected]> wrote:
> To: Thomas Walker <[email protected]>
> Cc: Geoffrey Thomas <[email protected]>; 'Jan Kara'
> <[email protected]>; '[email protected]' <[email protected]>;
> Darrick J. Wong <[email protected]>
> Subject: Re: Phantom full ext4 root filesystems on 4.1 through 4.14
> kernels
>
> On Fri, Jul 12, 2019 at 03:19:03PM -0400, Thomas Walker wrote:
> > Clearing orphaned inode 1048838 (uid=0, gid=4, mode=0100640,
> size=39006841856)
>
> > Of particular note, ino 1048838 matches the size of the space that we
> "lost".
>
> Hmmm... what's gid 4? Is that a hint of where the inode might have come
> from?

Good call, gid 4 is `adm`. And now that we have an inode number we can see the file's contents, it's from /var/log/account.

I bet that this is acct(2) holding onto a reference in some weird way (possibly involving logrotate?), which also explains why we couldn't find a userspace process holding onto the inode. We'll investigate a bit....

Thanks,
--
Geoffrey Thomas
[email protected]

2019-07-25 21:22:53

by Geoffrey Thomas

[permalink] [raw]
Subject: RE: Phantom full ext4 root filesystems on 4.1 through 4.14 kernels

On Friday, July 12, 2019 5:47 PM, Geoffrey Thomas <[email protected]> wrote:
> On Friday, July 12, 2019 4:28 PM, Theodore Ts'o <[email protected]> wrote:
> > Hmmm... what's gid 4? Is that a hint of where the inode might have come
> > from?
>
> Good call, gid 4 is `adm`. And now that we have an inode number we can see
> the file's contents, it's from /var/log/account.
>
> I bet that this is acct(2) holding onto a reference in some weird way
> (possibly involving logrotate?), which also explains why we couldn't find
> a userspace process holding onto the inode. We'll investigate a bit....

To close this out - yes, this was process accounting. Debian has a nightly cronjob which rotates the pacct logs, runs `invoke-rc.d acct restart` to reopen the file, and compresses the old log. Due to a stray policy-rc.d file from an old provisioning script, however, the restart was being skipped, and so we were unlinking and compressing the pacct file while the kernel still had it open. So it was the classic problem of an open file handle to a large deleted file, except that the open file handle was being held by the kernel.

`accton off` solved our immediate problems and freed the space. I'm not totally sure why a failed umount had that effect, too, but I suppose it turned off process accounting.

It's a little frustrating to me that the file opened by acct(2) doesn't show up to userspace (lsof doesn't seem to find it) - it'd be nice if it could show up in /proc/$some_kernel_thread/fd or somewhere, if possible.

Thanks for the help - the e2image + fsck trick is great!

--
Geoffrey Thomas
[email protected]

2019-07-29 10:11:12

by Jan Kara

[permalink] [raw]
Subject: Re: Phantom full ext4 root filesystems on 4.1 through 4.14 kernels

On Thu 25-07-19 21:22:28, Geoffrey Thomas wrote:
> On Friday, July 12, 2019 5:47 PM, Geoffrey Thomas <[email protected]> wrote:
> > On Friday, July 12, 2019 4:28 PM, Theodore Ts'o <[email protected]> wrote:
> > > Hmmm... what's gid 4? Is that a hint of where the inode might have come
> > > from?
> >
> > Good call, gid 4 is `adm`. And now that we have an inode number we can see
> > the file's contents, it's from /var/log/account.
> >
> > I bet that this is acct(2) holding onto a reference in some weird way
> > (possibly involving logrotate?), which also explains why we couldn't find
> > a userspace process holding onto the inode. We'll investigate a bit....
>
> To close this out - yes, this was process accounting. Debian has a nightly cronjob which rotates the pacct logs, runs `invoke-rc.d acct restart` to reopen the file, and compresses the old log. Due to a stray policy-rc.d file from an old provisioning script, however, the restart was being skipped, and so we were unlinking and compressing the pacct file while the kernel still had it open. So it was the classic problem of an open file handle to a large deleted file, except that the open file handle was being held by the kernel.
>
> `accton off` solved our immediate problems and freed the space. I'm not totally sure why a failed umount had that effect, too, but I suppose it turned off process accounting.
>
> It's a little frustrating to me that the file opened by acct(2) doesn't show up to userspace (lsof doesn't seem to find it) - it'd be nice if it could show up in /proc/$some_kernel_thread/fd or somewhere, if possible.
>
> Thanks for the help - the e2image + fsck trick is great!

Glad to hear you were able to solve the problem in the end :)

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-07-29 11:29:15

by Dmitrij Gusev

[permalink] [raw]
Subject: ext4 file system is constantly writing to the block device with no activity from the applications, is it a bug?

Hello.

A ext4 file system is constantly writing to the block device with no
activity from the applications, is it a bug?

Write speed is about 64k bytes (almost always exactly 64k bytes) per
second every 1-2 seconds (I've discovered it after a RAID sync
finished). Please the check activity log sample below.

I could not find out which process was writing constantly, so I rebooted
to a rescue mode and mounted FS with no activity guaranteed - the same
activity has started about at 64kB/s every 1-2 seconds.

The file system was created with the latest e2fsprogs, here are the details:

Linux nexus 4.19.62 #2 SMP Sun Jul 28 17:18:10 CDT 2019 x86_64 AMD
Athlon(tm) Dual Core Processor 5050e AuthenticAMD GNU/Linux

# tune2fs -l /dev/<XXXXXXvg>/home
tune2fs 1.43.1 (08-Jun-2016)  // file system was created in other system
with the latest version tools
Filesystem volume name:   home
Last mounted on:          /home
Filesystem UUID:          XXXXXXXXXXXXXXXXXX
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index
filetype needs_recovery extent 64bit flex_bg sparse_super large_file
huge_file dir_nlink extra_isize metadata_csum
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              21644032
Block count:              2770432000
Reserved block count:     0
Free blocks:              2095699032
Free inodes:              21205431
First block:              0
Block size:               4096
Fragment size:            4096
Group descriptor size:    64
Reserved GDT blocks:      726
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         256
Inode blocks per group:   16
RAID stride:              128
RAID stripe width:        384
Flex block group size:    16
Filesystem created:       Sat Jul 27 16:41:21 2019
Last mount time:          Mon Jul 29 13:47:33 2019
Last write time:          Mon Jul 29 13:47:33 2019
Mount count:              10
Maximum mount count:      -1
Last checked:             Sat Jul 27 16:41:21 2019
Check interval:           0 (<none>)
Lifetime writes:          2615 GB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     32
Desired extra isize:      32
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      XXXXXXXXXXXXXXXXXXXXXXX
Journal backup:           inode blocks
Checksum type:            crc32c
Checksum:                 0x7f0fb170


BlockDev Activity log:

# iostat -kdt 1 /dev/XXXXXXvg/home
Linux 4.19.62 (XXXXX)   07/29/2019      _x86_64_        (2 CPU)

07/29/2019 02:08:03 PM
Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
dm-5              1.91        15.40        37.77      19309 47364

07/29/2019 02:08:04 PM
Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
dm-5              1.00         0.00        64.00 0         64

07/29/2019 02:08:05 PM
Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
dm-5              1.00         0.00        64.00 0         64

07/29/2019 02:08:06 PM
Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
dm-5              0.00         0.00         0.00 0          0

07/29/2019 02:08:07 PM
Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
dm-5              1.00         0.00        64.00 0         64

07/29/2019 02:08:08 PM
Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
dm-5              0.00         0.00         0.00 0          0

07/29/2019 02:08:09 PM
Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
dm-5              1.00         0.00        64.00 0         64

07/29/2019 02:08:10 PM
Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
dm-5              1.00         0.00        64.00 0         64

07/29/2019 02:08:11 PM
Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
dm-5              0.00         0.00         0.00 0          0

07/29/2019 02:08:12 PM
Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
dm-5              1.00         0.00        64.00 0         64

07/29/2019 02:08:13 PM
Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
dm-5              0.00         0.00         0.00 0          0

07/29/2019 02:08:14 PM
Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
dm-5              1.00         0.00        64.00 0         64

07/29/2019 02:08:15 PM
Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
dm-5              1.00         0.00        64.00 0         64

07/29/2019 02:08:16 PM
Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
dm-5              0.00         0.00         0.00 0          0

07/29/2019 02:08:17 PM
Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
dm-5              1.00         0.00        64.00 0         64

07/29/2019 02:08:18 PM
Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
dm-5              0.00         0.00         0.00 0          0

07/29/2019 02:08:19 PM
Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
dm-5              1.00         0.00        64.00 0         64

07/29/2019 02:08:20 PM
Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
dm-5              1.00         0.00        64.00 0         64

07/29/2019 02:08:21 PM
Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
dm-5              0.00         0.00         0.00 0          0

07/29/2019 02:08:22 PM
Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
dm-5              1.00         0.00        64.00 0         64

07/29/2019 02:08:23 PM
Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
dm-5              0.00         0.00         0.00 0          0

07/29/2019 02:08:24 PM
Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
dm-5              1.00         0.00        64.00 0         64


Thank you,

---

Dmitrij Gusev

Systems architect


2019-07-29 13:54:31

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4 file system is constantly writing to the block device with no activity from the applications, is it a bug?

On Mon, Jul 29, 2019 at 02:18:22PM +0300, Dmitrij Gusev wrote:
>
> A ext4 file system is constantly writing to the block device with no
> activity from the applications, is it a bug?
>
> Write speed is about 64k bytes (almost always exactly 64k bytes) per second
> every 1-2 seconds (I've discovered it after a RAID sync finished). Please
> the check activity log sample below.

Is this a freshly created file system? It could be the lazy inode
table initialization. You can suppress it using "mount -o
noinit_itable", but it will leave portions of the inode table
unzeroed, which can lead to confusion if the system crashes and e2fsck
has to try to recover the file system.

Or you can not enable lazy inode table initialization when the file
system is created, using "mke2fs -t ext4 -E lazy_itable_init=0
/dev/XXX". (See the manual page for mke2fs.conf for another way to
turn it off by default.)

Turning off lazy inode table initialization mke2fs to take **much**
longer, especially on large RAID arrays. The idea is to trade off
mkfs time with background activity to initialize the inode table when
the file system is mounted. The noinit_itable mount option was added
so that a distro installer can temporarily suppress the background
inode table initialization to speed up the install; but then when the
system is booted, it can run in the background later.


If that's not it, try installing the blktrace package and then run
"btrace /dev/<vg>/home", and see what it reports. For example, here's
the output from running "touch /mnt/test" (comments prefixed by '#'):

# here's the touch process reading the inode...
259,0 2 1 37.115679608 6646 Q RM 4232 + 8 [touch]
259,0 2 2 37.115682891 6646 C RM 4232 + 8 [0]
# here's the journal commit, 5 seconds later
259,0 1 11 42.543705759 6570 Q WS 3932216 + 8 [jbd2/pmem0-8]
259,0 1 12 42.543709184 6570 C WS 3932216 + 8 [0]
259,0 1 13 42.543713049 6570 Q WS 3932224 + 8 [jbd2/pmem0-8]
259,0 1 14 42.543714248 6570 C WS 3932224 + 8 [0]
259,0 1 15 42.543717049 6570 Q WS 3932232 + 8 [jbd2/pmem0-8]
259,0 1 16 42.543718193 6570 C WS 3932232 + 8 [0]
259,0 1 17 42.543720895 6570 Q WS 3932240 + 8 [jbd2/pmem0-8]
259,0 1 18 42.543722028 6570 C WS 3932240 + 8 [0]
259,0 1 19 42.543724806 6570 Q WS 3932248 + 8 [jbd2/pmem0-8]
259,0 1 20 42.543725952 6570 C WS 3932248 + 8 [0]
259,0 1 21 42.543728697 6570 Q WS 3932256 + 8 [jbd2/pmem0-8]
259,0 1 22 42.543729799 6570 C WS 3932256 + 8 [0]
259,0 1 23 42.543745380 6570 Q FWFS 3932264 + 8 [jbd2/pmem0-8]
259,0 1 24 42.543746836 6570 C FWFS 3932264 + 8 [0]
# and here's the writeback to the inode table and superblock,
# 30 seconds later
259,0 1 25 72.836967205 91 Q W 0 + 8 [kworker/u8:3]
259,0 1 26 72.836970861 91 C W 0 + 8 [0]
259,0 1 27 72.836984218 91 Q WM 8 + 8 [kworker/u8:3]
259,0 1 28 72.836985929 91 C WM 8 + 8 [0]
259,0 1 29 72.836992108 91 Q WM 4232 + 8 [kworker/u8:3]
259,0 1 30 72.836993953 91 C WM 4232 + 8 [0]
259,0 1 31 72.837001370 91 Q WM 4360 + 8 [kworker/u8:3]
259,0 1 32 72.837003210 91 C WM 4360 + 8 [0]
259,0 1 33 72.837010993 91 Q WM 69896 + 8 [kworker/u8:3]
259,0 1 34 72.837012564 91 C WM 69896 + 8 [0]

Cheers,

- Ted

2019-07-29 21:14:25

by Dmitrij Gusev

[permalink] [raw]
Subject: Re: ext4 file system is constantly writing to the block device with no activity from the applications, is it a bug?

Hello.

Yes, this is the lazy inode table initialization.

After I remounted the FS with a "noinit_itable" option the activity has
stopped (then I remounted it back to continue initialization).

Appreciate your help.

P.S.

Thank you, Theodore and the development team involved for the great ext4
filesystem.

It is a reliable, high-performance file system. It always served me well
for many years and continues to do so.

I had never lost any data using it, even though some systems experienced
many crashes or power losses.

Sincerely,

Dmitrij

On 2019-07-29 15:55, Theodore Y. Ts'o wrote:
> On Mon, Jul 29, 2019 at 02:18:22PM +0300, Dmitrij Gusev wrote:
>> A ext4 file system is constantly writing to the block device with no
>> activity from the applications, is it a bug?
>>
>> Write speed is about 64k bytes (almost always exactly 64k bytes) per second
>> every 1-2 seconds (I've discovered it after a RAID sync finished). Please
>> the check activity log sample below.
> Is this a freshly created file system? It could be the lazy inode
> table initialization. You can suppress it using "mount -o
> noinit_itable", but it will leave portions of the inode table
> unzeroed, which can lead to confusion if the system crashes and e2fsck
> has to try to recover the file system.
>
> Or you can not enable lazy inode table initialization when the file
> system is created, using "mke2fs -t ext4 -E lazy_itable_init=0
> /dev/XXX". (See the manual page for mke2fs.conf for another way to
> turn it off by default.)
>
> Turning off lazy inode table initialization mke2fs to take **much**
> longer, especially on large RAID arrays. The idea is to trade off
> mkfs time with background activity to initialize the inode table when
> the file system is mounted. The noinit_itable mount option was added
> so that a distro installer can temporarily suppress the background
> inode table initialization to speed up the install; but then when the
> system is booted, it can run in the background later.
>
>
> If that's not it, try installing the blktrace package and then run
> "btrace /dev/<vg>/home", and see what it reports. For example, here's
> the output from running "touch /mnt/test" (comments prefixed by '#'):
>
> # here's the touch process reading the inode...
> 259,0 2 1 37.115679608 6646 Q RM 4232 + 8 [touch]
> 259,0 2 2 37.115682891 6646 C RM 4232 + 8 [0]
> # here's the journal commit, 5 seconds later
> 259,0 1 11 42.543705759 6570 Q WS 3932216 + 8 [jbd2/pmem0-8]
> 259,0 1 12 42.543709184 6570 C WS 3932216 + 8 [0]
> 259,0 1 13 42.543713049 6570 Q WS 3932224 + 8 [jbd2/pmem0-8]
> 259,0 1 14 42.543714248 6570 C WS 3932224 + 8 [0]
> 259,0 1 15 42.543717049 6570 Q WS 3932232 + 8 [jbd2/pmem0-8]
> 259,0 1 16 42.543718193 6570 C WS 3932232 + 8 [0]
> 259,0 1 17 42.543720895 6570 Q WS 3932240 + 8 [jbd2/pmem0-8]
> 259,0 1 18 42.543722028 6570 C WS 3932240 + 8 [0]
> 259,0 1 19 42.543724806 6570 Q WS 3932248 + 8 [jbd2/pmem0-8]
> 259,0 1 20 42.543725952 6570 C WS 3932248 + 8 [0]
> 259,0 1 21 42.543728697 6570 Q WS 3932256 + 8 [jbd2/pmem0-8]
> 259,0 1 22 42.543729799 6570 C WS 3932256 + 8 [0]
> 259,0 1 23 42.543745380 6570 Q FWFS 3932264 + 8 [jbd2/pmem0-8]
> 259,0 1 24 42.543746836 6570 C FWFS 3932264 + 8 [0]
> # and here's the writeback to the inode table and superblock,
> # 30 seconds later
> 259,0 1 25 72.836967205 91 Q W 0 + 8 [kworker/u8:3]
> 259,0 1 26 72.836970861 91 C W 0 + 8 [0]
> 259,0 1 27 72.836984218 91 Q WM 8 + 8 [kworker/u8:3]
> 259,0 1 28 72.836985929 91 C WM 8 + 8 [0]
> 259,0 1 29 72.836992108 91 Q WM 4232 + 8 [kworker/u8:3]
> 259,0 1 30 72.836993953 91 C WM 4232 + 8 [0]
> 259,0 1 31 72.837001370 91 Q WM 4360 + 8 [kworker/u8:3]
> 259,0 1 32 72.837003210 91 C WM 4360 + 8 [0]
> 259,0 1 33 72.837010993 91 Q WM 69896 + 8 [kworker/u8:3]
> 259,0 1 34 72.837012564 91 C WM 69896 + 8 [0]
>
> Cheers,
>
> - Ted