MIME-Version: 1.0
References: <9abbdde6145a4887a8d32c65974f7832@exmbdft5.ad.twosigma.com>
In-Reply-To: <9abbdde6145a4887a8d32c65974f7832@exmbdft5.ad.twosigma.com>
From:   Liu Bo <obuil.liubo@gmail.com>
Date:   Wed, 23 Jan 2019 17:54:28 -0800
Message-ID: <CANQeFDD78G0f8V2YYKq+wMpiVPzmZ8SJ_6xWtgyZ6FvAUKpe8w@mail.gmail.com>
Subject: Re: Phantom full ext4 root filesystems on 4.1 through 4.14 kernels
To:     Elana Hashman <Elana.Hashman@twosigma.com>
Cc:     "tytso@mit.edu" <tytso@mit.edu>,
        "linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Sender: linux-ext4-owner@vger.kernel.org
Precedence: bulk

On Thu, Nov 8, 2018 at 10:11 AM Elana Hashman
<Elana.Hashman@twosigma.com> wrote:
>
> Hi Ted,
>
> We've run into a mysterious "phantom" full filesystem issue on our Kubern=
etes fleet. We initially encountered this issue on kernel 4.1.35, but are s=
till experiencing the problem after upgrading to 4.14.67. Essentially, `df`=
 reports our root filesystems as full and they behave as though they are fu=
ll, but the "used" space cannot be accounted for. Rebooting the system, rem=
ounting the root filesystem read-only and then remounting as read-write, or=
 booting into single-user mode all free up the "used" space. The disk slowl=
y fills up over time, suggesting that there might be some kind of leak; we =
previously saw this affecting hosts with ~200 days of uptime on the 4.1 ker=
nel, but are now seeing it affect a 4.14 host with only ~70 days of uptime.
>

I wonder if this ext4 enabled bigalloc (can be checked by dumpe2fs -h $disk=
).
So bigalloc is known to cause leak space, and it's been just fixed recently=
.

thanks,
liubo

> Here is some data from an example host, running the 4.14.67 kernel. The r=
oot disk is ext4.
>
> $ uname -a
> Linux <hostname> 4.14.67-ts1 #1 SMP Wed Aug 29 13:28:25 UTC 2018 x86_64 G=
NU/Linux
> $ grep ' / ' /proc/mounts
> /dev/disk/by-uuid/<some-uuid> / ext4 rw,relatime,errors=3Dremount-ro,data=
=3Dordered 0 0
>
> `df` reports 0 bytes free:
>
> $ df -h /
> Filesystem                                              Size  Used Avail =
Use% Mounted on
> /dev/disk/by-uuid/<some-uuid>   50G   48G     0 100% /
>
> Deleted, open files account for almost no disk capacity:
>
> $ sudo lsof -a +L1 /
> COMMAND    PID   USER   FD   TYPE DEVICE SIZE/OFF NLINK    NODE NAME
> java      5313 user    3r   REG    8,3  6806312     0 1315847 /var/lib/ss=
s/mc/passwd (deleted)
> java      5313 user   11u   REG    8,3    55185     0 2494654 /tmp/classp=
ath.1668Gp (deleted)
> system_ar 5333 user    3r   REG    8,3  6806312     0 1315847 /var/lib/ss=
s/mc/passwd (deleted)
> java      5421 user    3r   REG    8,3  6806312     0 1315847 /var/lib/ss=
s/mc/passwd (deleted)
> java      5421 user   11u   REG    8,3   149313     0 2494486 /tmp/java.f=
zTwWp (deleted)
> java      5421 tsdist   12u   REG    8,3    55185     0 2500513 /tmp/clas=
spath.7AmxHO (deleted)
>
> `du` can only account for 16GB of file usage:
>
> $ sudo du -hxs /
> 16G     /
>
> But what is most puzzling is the numbers reported by e2freefrag, which do=
n't add up:
>
> $ sudo e2freefrag /dev/disk/by-uuid/<some-uuid>
> Device: /dev/disk/by-uuid/<some-uuid>
> Blocksize: 4096 bytes
> Total blocks: 13107200
> Free blocks: 7778076 (59.3%)
>
> Min. free extent: 4 KB
> Max. free extent: 8876 KB
> Avg. free extent: 224 KB
> Num. free extent: 6098
>
> HISTOGRAM OF FREE EXTENT SIZES:
> Extent Size Range :  Free extents   Free Blocks  Percent
>     4K...    8K-  :          1205          1205    0.02%
>     8K...   16K-  :           980          2265    0.03%
>    16K...   32K-  :           653          3419    0.04%
>    32K...   64K-  :          1337         15374    0.20%
>    64K...  128K-  :           631         14151    0.18%
>   128K...  256K-  :           224         10205    0.13%
>   256K...  512K-  :           261         23818    0.31%
>   512K... 1024K-  :           303         56801    0.73%
>     1M...    2M-  :           387        135907    1.75%
>     2M...    4M-  :           103         64740    0.83%
>     4M...    8M-  :            12         15005    0.19%
>     8M...   16M-  :             2          4267    0.05%
>
> This looks like a bug to me; the histogram in the manpage example has per=
centages that add up to 100% but this doesn't even add up to 5%.
>
> After a reboot, `df` reflects real utilization:
>
> $ df -h /
> Filesystem                                              Size  Used Avail =
Use% Mounted on
> /dev/disk/by-uuid/<some-uuid>   50G   16G   31G  34% /
>
> We are using overlay2fs for Docker, as well as rbd mounts; I'm not sure h=
ow they might interact.
>
> Thanks for your help,
>
> --
> Elana Hashman
> ehashman@twosigma.com