2015-09-22 19:13:09

by Pocas, Jamie

[permalink] [raw]
Subject: resize2fs stuck in ext4_group_extend with 100% CPU Utilization With Small Volumes

Hi,

I apologize in advance if this is a well-known issue but I don't see it as an open bug in sourceforge.net. I'm not able to open a bug there without permission, so I am writing you here.

I have a very reproducible spin in resize2fs (x86_64) on both CentOS 6 latest rpms and CentOS 7. It will peg one core at 100%. This happens with both e2fsprogs version 1.41.12 on CentOS 6 w/ latest 2.6.32 kernel rpm installed and e2fsprogs version 1.42.9 on CentOS 7 with latest 3.10 kernel rpm installed. The key to reproducing this seems to be when creating small filesystems. For example if I create an ext4 filesystem on a 100MiB disk (or file), and then increase the size of the underlying disk (or file) to say 1GiB, it will spin and consume 100% CPU and not finish even after hours (it should take a few seconds).

Here are the flags used when creating the fs.

mkfs.ext4 -O uninit_bg -E nodiscard,lazy_itable_init=1 -F 0 /dev/sdz

Some of these may not be necessary anymore but were very experimental when I first started testing on CentOS 5 way back. I think all of these options except "nodiscard" are the defaults now anyway. I only use the option because in the application I am using this for, it doesn't make sense to discard the existing devices which are initially zeroed anyway. I suppose with volumes this small it doesn't take much extra time anyway, but I don't want to go down that rat hole. I am not doing anything custom with the number of inodes, smaller blocksize (1k), etc... just what you see above. So it's taking the default settings for those, which maybe are bogus and broken for small volumes nowadays. I don't know.

Here is the stack...

[root@localhost ~]# cat /proc/8403/stack
[<ffffffff8106ee1a>] __cond_resched+0x2a/0x40
[<ffffffff8112860b>] find_lock_page+0x3b/0x80
[<ffffffff8112874f>] find_or_create_page+0x3f/0xb0
[<ffffffff811c8540>] __getblk+0xf0/0x2a0
[<ffffffff811c9ad3>] __bread+0x13/0xb0
[<ffffffffa056098c>] ext4_group_extend+0xfc/0x410 [ext4]
[<ffffffffa05498a0>] ext4_ioctl+0x660/0x920 [ext4]
[<ffffffff811a7372>] vfs_ioctl+0x22/0xa0
[<ffffffff811a7514>] do_vfs_ioctl+0x84/0x580
[<ffffffff811a7a91>] sys_ioctl+0x81/0xa0
[<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

It seems to be sleeping, waiting for a free page, and then sleeping again in the kernel. I don't get ANY output after the version heading prints out, even with the -d debug flags turned up all the way. It's really getting stuck very early on with no I/O going to the disk during this CPU spinning. I don't see anything in the dmesg related to this activity either.

I haven't finished binary searching for the specific boundary where the problem occurs, but I initially noticed that 1GiB and larger always worked and took only a few seconds. Then I stepped down to 500MiB and it hung in the same way. Then stepped up to 750MiB and it works normally. So there is some kind of boundary between 500-750MiB that I haven't found yet.

I understand that these are really small filesystems nowadays other than something that might fit on a CD, but I'm hoping that it's something simple that could probably be fixed easily. I suspect that due to the disk size, there are probably bad or unusual defaults being selected, or there is a structure that is being undersized, or with unexpected filesystem dimensions such that the conditions it's expecting are invalid and will never be satisfied. On that note I am wondering with disks this small if it is relying on the antiquated geometry reporting from the device because I know that sometimes with small virtual disks like there, there can sometimes be problems trying to accurately emulate a fake C/H/S geometry with disks this small and sometimes rounding down is necessary. I wonder if
a mismatch could cause this. I don't want to steer anyone off into the weeds though.

I haven't dug into the code much yet, but I was wondering if anyone had any ideas what could be going on. I think at the very least this is a bug in the resize code in the ext4 code in the kernel itself because even if the resize2fs program is giving bad parameters, I would not expect this type of hang to be able to be initiated from user space.

Regards,
Jamie



2015-09-22 19:33:19

by Eric Sandeen

[permalink] [raw]
Subject: Re: resize2fs stuck in ext4_group_extend with 100% CPU Utilization With Small Volumes

On 9/22/15 2:12 PM, Pocas, Jamie wrote:
> Hi,
>
> I apologize in advance if this is a well-known issue but I don't see
> it as an open bug in sourceforge.net. I'm not able to open a bug
> there without permission, so I am writing you here.

the centos bug tracker may be the right place for your distro...

> I have a very reproducible spin in resize2fs (x86_64) on both CentOS
> 6 latest rpms and CentOS 7. It will peg one core at 100%. This
> happens with both e2fsprogs version 1.41.12 on CentOS 6 w/ latest
> 2.6.32 kernel rpm installed and e2fsprogs version 1.42.9 on CentOS 7
> with latest 3.10 kernel rpm installed. The key to reproducing this
> seems to be when creating small filesystems. For example if I create
> an ext4 filesystem on a 100MiB disk (or file), and then increase the
> size of the underlying disk (or file) to say 1GiB, it will spin and
> consume 100% CPU and not finish even after hours (it should take a
> few seconds).
>
> Here are the flags used when creating the fs.
>
> mkfs.ext4 -O uninit_bg -E nodiscard,lazy_itable_init=1 -F 0 /dev/sdz

AFAIK -F doesn't take an argument, is that 0 supposed to be there?

but if I test this:

# truncate --size=100m testfile
# mkfs.ext4 -O uninit_bg -E nodiscard,lazy_itable_init=1 -F testfile
# truncate --size=1g testfile
# mount -o loop testfile mnt
# resize2fs /dev/loop0

that works fine on my rhel7 box, with kernel-3.10.0-229.el7 and
e2fsprogs-1.42.9-7.el7

Do those same steps fail for you?

-Eric

> Some of these may not be necessary anymore but were very experimental
> when I first started testing on CentOS 5 way back. I think all of
> these options except "nodiscard" are the defaults now anyway. I only
> use the option because in the application I am using this for, it
> doesn't make sense to discard the existing devices which are
> initially zeroed anyway. I suppose with volumes this small it doesn't
> take much extra time anyway, but I don't want to go down that rat
> hole. I am not doing anything custom with the number of inodes,
> smaller blocksize (1k), etc... just what you see above. So it's
> taking the default settings for those, which maybe are bogus and
> broken for small volumes nowadays. I don't know.
>
> Here is the stack...
>
> [root@localhost ~]# cat /proc/8403/stack
> [<ffffffff8106ee1a>] __cond_resched+0x2a/0x40
> [<ffffffff8112860b>] find_lock_page+0x3b/0x80
> [<ffffffff8112874f>] find_or_create_page+0x3f/0xb0
> [<ffffffff811c8540>] __getblk+0xf0/0x2a0
> [<ffffffff811c9ad3>] __bread+0x13/0xb0
> [<ffffffffa056098c>] ext4_group_extend+0xfc/0x410 [ext4]
> [<ffffffffa05498a0>] ext4_ioctl+0x660/0x920 [ext4]
> [<ffffffff811a7372>] vfs_ioctl+0x22/0xa0
> [<ffffffff811a7514>] do_vfs_ioctl+0x84/0x580
> [<ffffffff811a7a91>] sys_ioctl+0x81/0xa0
> [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> It seems to be sleeping, waiting for a free page, and then sleeping
> again in the kernel. I don't get ANY output after the version heading
> prints out, even with the -d debug flags turned up all the way. It's
> really getting stuck very early on with no I/O going to the disk
> during this CPU spinning. I don't see anything in the dmesg related
> to this activity either.
>
> I haven't finished binary searching for the specific boundary where
> the problem occurs, but I initially noticed that 1GiB and larger
> always worked and took only a few seconds. Then I stepped down to
> 500MiB and it hung in the same way. Then stepped up to 750MiB and it
> works normally. So there is some kind of boundary between 500-750MiB
> that I haven't found yet.
>
> I understand that these are really small filesystems nowadays other
> than something that might fit on a CD, but I'm hoping that it's
> something simple that could probably be fixed easily. I suspect that
> due to the disk size, there are probably bad or unusual defaults
> being selected, or there is a structure that is being undersized, or
> with unexpected filesystem dimensions such that the conditions it's
> expecting are invalid and will never be satisfied. On that note I am
> wondering with disks this small if it is relying on the antiquated
> geometry reporting from the device because I know that sometimes with
> small virtual disks like there, there can sometimes be problems
> trying to accurately emulate a fake C/H/S geometry with disks this
> small and sometimes rounding down is necessary. I wonder if a
> mismatch could cause this. I don't want to steer anyone off into the
> weeds though.
>
> I haven't dug into the code much yet, but I was wondering if anyone
> had any ideas what could be going on. I think at the very least this
> is a bug in the resize code in the ext4 code in the kernel itself
> because even if the resize2fs program is giving bad parameters, I
> would not expect this type of hang to be able to be initiated from
> user space.>
> Regards,
> Jamie
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>


2015-09-22 20:21:01

by Theodore Ts'o

[permalink] [raw]
Subject: Re: resize2fs stuck in ext4_group_extend with 100% CPU Utilization With Small Volumes

On Tue, Sep 22, 2015 at 03:12:53PM -0400, Pocas, Jamie wrote:
>
> I have a very reproducible spin in resize2fs (x86_64) on both CentOS
> 6 latest rpms and CentOS 7. It will peg one core at 100%. This
> happens with both e2fsprogs version 1.41.12 on CentOS 6 w/ latest
> 2.6.32 kernel rpm installed and e2fsprogs version 1.42.9 on CentOS 7
> with latest 3.10 kernel rpm installed. The key to reproducing this
> seems to be when creating small filesystems. For example if I create
> an ext4 filesystem on a 100MiB disk (or file), and then increase the
> size of the underlying disk (or file) to say 1GiB, it will spin and
> consume 100% CPU and not finish even after hours (it should take a
> few seconds).

I can't reproduce the problem using a 3.10.88 kernel using e2fsprogs
1.42.12-1.1 as shipped with Debian x86_64 jessie 8.2 release image.
(As found on Google Compute Engine, but it should be the same no
matter what you're using.)

I've attached the repro script I'm using.

The kernel config I'm using is here:

https://git.kernel.org/cgit/fs/ext2/xfstests-bld.git/tree/kernel-configs/ext4-x86_64-config-3.10


I also tried reproducing it on CentOS 6.7 as shipped by Google Compute
Engine:

[root@centos-test tytso]# cat /etc/centos-release
CentOS release 6.7 (Final)
[root@centos-test tytso]# uname -a
Linux centos-test 2.6.32-573.3.1.el6.x86_64 #1 SMP Thu Aug 13 22:55:16 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
[root@centos-test tytso]# rpm -q e2fsprogs
e2fsprogs-1.41.12-22.el6.x86_64

And I can't reproduce it there either.

Can you take a look at my repro script and see if it fails for you?
And if it doesn't, can you adjust it until it does reproduce for you?

Thanks,

- Ted

#!/bin/bash

FS=/tmp/foo.img

cp /dev/null $FS
mke2fs -t ext4 -O uninit_bg -E nodiscard,lazy_itable_init=1 -Fq $FS 100M
truncate -s 1G $FS

DEV=$(losetup -j $FS | awk -F: '{print $1}')
if test -z "$DEV"
then
losetup -f $FS
DEV=$(losetup -j $FS | awk -F: '{print $1}')
fi
if test -z "$DEV"
then
echo "Can't create loop device for $FS"
else
echo "Using loop device $DEV"
CLEANUP_LOOP=yes
fi

e2fsck -p $DEV
mkdir /tmp/mnt$$
mount $DEV /tmp/mnt$$
resize2fs -p $DEV 1G
umount /tmp/mnt$$
e2fsck -fy $DEV

if test "$CLEANUP_LOOP" = "yes"
then
losetup -d $DEV
fi
rmdir /tmp/mnt$$



2015-09-22 20:28:54

by Pocas, Jamie

[permalink] [raw]
Subject: RE: resize2fs stuck in ext4_group_extend with 100% CPU Utilization With Small Volumes

Thanks for the prompt reply. Yes the "0" in mkfs was an accidental copy and paste. It's not supposed to be there.

Your sequence works, but it's a tad bit more synthetic than what's really happening in my case. In your example, the backing store (testfile in this case) is being resized using truncate before the contained filesystem is mounted. In my case the underlying device is being grown while the filesystem is mounted. If I do the following instead, which is more analogous to the way that the underlying device is resized at runtime, it reproduces the 100% consumption.

$ truncate --size=100M testfile
# mkfs.ext4 -O uninit_bg -E nodiscard,lazy_itable_init=1 -F testfile
mke2fs 1.42.9 (28-Dec-2013)
Filesystem label=
OS type: Linux
Block size=1024 (log=0)
Fragment size=1024 (log=0)
Stride=0 blocks, Stripe width=0 blocks
25688 inodes, 102400 blocks
5120 blocks (5.00%) reserved for the super user
First data block=1
Maximum filesystem blocks=33685504
13 block groups
8192 blocks per group, 8192 fragments per group
1976 inodes per group
Superblock backups stored on blocks:
8193, 24577, 40961, 57345, 73729

Allocating group tables: done
Writing inode tables: done
Creating journal (4096 blocks): done
Writing superblocks and filesystem accounting information: done

# mount -o loop testfile mnt
# truncate --size=1G testfile
# losetup -c /dev/loop0 ## Cause loop device to reread size of backing file while still online
# resize2fs /dev/loop0
resize2fs 1.42.9 (28-Dec-2013)
Filesystem at /dev/loop0 is mounted on /home/jpocas/source/hulk.1/mnt; on-line resizing required
old_desc_blocks = 1, new_desc_blocks = 8

##... it's hung here spinning at 100%, at least I got SOME output though.
## From another shell I can see the following

# top | head
top - 16:22:53 up 6:02, 6 users, load average: 1.05, 0.80, 0.40
Tasks: 518 total, 2 running, 516 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.6 us, 0.7 sy, 0.0 ni, 98.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 5933160 total, 1864476 free, 1196476 used, 2872208 buff/cache
KiB Swap: 3670012 total, 3670012 free, 0 used. 4403764 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
13664 root 20 0 116548 1032 864 R 100.0 0.0 5:54.61 resize2fs
2214 root 20 0 264300 72876 8756 S 6.2 1.2 2:19.58 Xorg
3892 jpocas 20 0 432920 7884 6052 S 6.2 0.1 0:56.68 ibus-x11
#
## BTW, I am not sure why the heading only shows the 1.42.9 on CentOS but I surely have the 1.42.9-7 rpm installed.
# rpm -q e2fsprogs
e2fsprogs-1.42.9-7.el7.x86_64
#

-----Original Message-----
From: Eric Sandeen [mailto:[email protected]]
Sent: Tuesday, September 22, 2015 3:33 PM
To: Pocas, Jamie; [email protected]
Subject: Re: resize2fs stuck in ext4_group_extend with 100% CPU Utilization With Small Volumes

On 9/22/15 2:12 PM, Pocas, Jamie wrote:
> Hi,
>
> I apologize in advance if this is a well-known issue but I don't see
> it as an open bug in sourceforge.net. I'm not able to open a bug there
> without permission, so I am writing you here.

the centos bug tracker may be the right place for your distro...

> I have a very reproducible spin in resize2fs (x86_64) on both CentOS
> 6 latest rpms and CentOS 7. It will peg one core at 100%. This happens
> with both e2fsprogs version 1.41.12 on CentOS 6 w/ latest
> 2.6.32 kernel rpm installed and e2fsprogs version 1.42.9 on CentOS 7
> with latest 3.10 kernel rpm installed. The key to reproducing this
> seems to be when creating small filesystems. For example if I create
> an ext4 filesystem on a 100MiB disk (or file), and then increase the
> size of the underlying disk (or file) to say 1GiB, it will spin and
> consume 100% CPU and not finish even after hours (it should take a few
> seconds).
>
> Here are the flags used when creating the fs.
>
> mkfs.ext4 -O uninit_bg -E nodiscard,lazy_itable_init=1 -F 0 /dev/sdz

AFAIK -F doesn't take an argument, is that 0 supposed to be there?

but if I test this:

# truncate --size=100m testfile
# mkfs.ext4 -O uninit_bg -E nodiscard,lazy_itable_init=1 -F testfile
# truncate --size=1g testfile
# mount -o loop testfile mnt
#resize2fs /dev/loop0

that works fine on my rhel7 box, with kernel-3.10.0-229.el7 and
e2fsprogs-1.42.9-7.el7

Do those same steps fail for you?

-Eric

> Some of these may not be necessary anymore but were very experimental
> when I first started testing on CentOS 5 way back. I think all of
> these options except "nodiscard" are the defaults now anyway. I only
> use the option because in the application I am using this for, it
> doesn't make sense to discard the existing devices which are initially
> zeroed anyway. I suppose with volumes this small it doesn't take much
> extra time anyway, but I don't want to go down that rat hole. I am not
> doing anything custom with the number of inodes, smaller blocksize
> (1k), etc... just what you see above. So it's taking the default
> settings for those, which maybe are bogus and broken for small volumes
> nowadays. I don't know.
>
> Here is the stack...
>
> [root@localhost ~]# cat /proc/8403/stack [<ffffffff8106ee1a>]
> __cond_resched+0x2a/0x40 [<ffffffff8112860b>] find_lock_page+0x3b/0x80
> [<ffffffff8112874f>] find_or_create_page+0x3f/0xb0
> [<ffffffff811c8540>] __getblk+0xf0/0x2a0 [<ffffffff811c9ad3>]
> __bread+0x13/0xb0 [<ffffffffa056098c>] ext4_group_extend+0xfc/0x410
> [ext4] [<ffffffffa05498a0>] ext4_ioctl+0x660/0x920 [ext4]
> [<ffffffff811a7372>] vfs_ioctl+0x22/0xa0 [<ffffffff811a7514>]
> do_vfs_ioctl+0x84/0x580 [<ffffffff811a7a91>] sys_ioctl+0x81/0xa0
> [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> It seems to be sleeping, waiting for a free page, and then sleeping
> again in the kernel. I don't get ANY output after the version heading
> prints out, even with the -d debug flags turned up all the way. It's
> really getting stuck very early on with no I/O going to the disk
> during this CPU spinning. I don't see anything in the dmesg related to
> this activity either.
>
> I haven't finished binary searching for the specific boundary where
> the problem occurs, but I initially noticed that 1GiB and larger
> always worked and took only a few seconds. Then I stepped down to
> 500MiB and it hung in the same way. Then stepped up to 750MiB and it
> works normally. So there is some kind of boundary between 500-750MiB
> that I haven't found yet.
>
> I understand that these are really small filesystems nowadays other
> than something that might fit on a CD, but I'm hoping that it's
> something simple that could probably be fixed easily. I suspect that
> due to the disk size, there are probably bad or unusual defaults being
> selected, or there is a structure that is being undersized, or with
> unexpected filesystem dimensions such that the conditions it's
> expecting are invalid and will never be satisfied. On that note I am
> wondering with disks this small if it is relying on the antiquated
> geometry reporting from the device because I know that sometimes with
> small virtual disks like there, there can sometimes be problems trying
> to accurately emulate a fake C/H/S geometry with disks this small and
> sometimes rounding down is necessary. I wonder if a mismatch could
> cause this. I don't want to steer anyone off into the weeds though.
>
> I haven't dug into the code much yet, but I was wondering if anyone
> had any ideas what could be going on. I think at the very least this
> is a bug in the resize code in the ext4 code in the kernel itself
> because even if the resize2fs program is giving bad parameters, I
> would not expect this type of hang to be able to be initiated from
> user space.> Regards, Jamie
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4"
> in the body of a message to [email protected] More majordomo
> info at http://vger.kernel.org/majordomo-info.html
>


2015-09-22 21:27:13

by Pocas, Jamie

[permalink] [raw]
Subject: RE: resize2fs stuck in ext4_group_extend with 100% CPU Utilization With Small Volumes

Hi Theodore,

I am not sure if you had a chance to see my reply to Eric yet. I can see you are using the same general approach that Eric was using. The key difference from what I am doing again seems to be that I am resizing the underlying disk *while the filesystem is mounted*. Instead you both are using truncate to grow the disk while the filesystem is not currently mounted, and then mounting it. So maybe there is some fundamental cleanup or fixup that happens during the subsequent mount that doesn't happen if you grow the disk while the filesystem is already online. With the test example, you can do this using 'losetup -c' to force reread the size of the underlying file. I can understand why a disk should not shrink while the filesystem is mounted, but in my case I am growing it so the existing FS st
ructure should be unharmed.

Your script works -- caveat I had to fix some line wrap issues probably due to my email client, but it was pretty clear what your intention was.
Here's my modification to your script that reproduces the issue.

#!/bin/bash

FS=/tmp/foo.img

cp /dev/null $FS
mke2fs -t ext4 -O uninit_bg -E nodiscard,lazy_itable_init=1 -Fq $FS 100M

DEV=$(losetup -j $FS | awk -F: '{print $1}')
if test -z "$DEV"
then
losetup -f $FS
DEV=$(losetup -j $FS | awk -F: '{print $1}')
fi

if test -z "$DEV"
then
echo "Can't create loop device for $FS"
else
echo "Using loop device $DEV"
CLEANUP_LOOP=yes
fi

#e2fsck -p $DEV # Not sure if this needs to be commented out. I will have to reboot to find out though.
mkdir /tmp/mnt$$
mount $DEV /tmp/mnt$$
# Grow the backing file *AFTER* we are mounted
truncate -s 1G $FS
# Tell loopback device to rescan the size
losetup -c $DEV
resize2fs -p $DEV 1G
umount /tmp/mnt$$
e2fsck -fy $DEV

if test "$CLEANUP_LOOP" = "yes"
then
losetup -d $DEV
fi
rmdir /tmp/mnt$$

## END OF SCRIPT

Execution looks like this

$ sudo ./repro.sh
[sudo] password for jpocas:
Using loop device /dev/loop0
resize2fs 1.42.9 (28-Dec-2013)
Filesystem at /dev/loop0 is mounted on /tmp/mnt5715; on-line resizing required
old_desc_blocks = 1, new_desc_blocks = 8
## SPINNING 100% CPU!

-----Original Message-----
From: Theodore Ts'o [mailto:[email protected]]
Sent: Tuesday, September 22, 2015 4:21 PM
To: Pocas, Jamie
Cc: [email protected]
Subject: Re: resize2fs stuck in ext4_group_extend with 100% CPU Utilization With Small Volumes

On Tue, Sep 22, 2015 at 03:12:53PM -0400, Pocas, Jamie wrote:
>
> I have a very reproducible spin in resize2fs (x86_64) on both CentOS
> 6 latest rpms and CentOS 7. It will peg one core at 100%. This happens
> with both e2fsprogs version 1.41.12 on CentOS 6 w/ latest
> 2.6.32 kernel rpm installed and e2fsprogs version 1.42.9 on CentOS 7
> with latest 3.10 kernel rpm installed. The key to reproducing this
> seems to be when creating small filesystems. For example if I create
> an ext4 filesystem on a 100MiB disk (or file), and then increase the
> size of the underlying disk (or file) to say 1GiB, it will spin and
> consume 100% CPU and not finish even after hours (it should take a few
> seconds).

I can't reproduce the problem using a 3.10.88 kernel using e2fsprogs
1.42.12-1.1 as shipped with Debian x86_64 jessie 8.2 release image.
(As found on Google Compute Engine, but it should be the same no matter what you're using.)

I've attached the repro script I'm using.

The kernel config I'm using is here:

https://git.kernel.org/cgit/fs/ext2/xfstests-bld.git/tree/kernel-configs/ext4-x86_64-config-3.10


I also tried reproducing it on CentOS 6.7 as shipped by Google Compute
Engine:

[root@centos-test tytso]# cat /etc/centos-release CentOS release 6.7 (Final) [root@centos-test tytso]# uname -a Linux centos-test 2.6.32-573.3.1.el6.x86_64 #1 SMP Thu Aug 13 22:55:16 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux [root@centos-test tytso]# rpm -q e2fsprogs
e2fsprogs-1.41.12-22.el6.x86_64

And I can't reproduce it there either.

Can you take a look at my repro script and see if it fails for you?
And if it doesn't, can you adjust it until it does reproduce for you?

Thanks,

- Ted

#!/bin/bash

FS=/tmp/foo.img

cp /dev/null $FS
mke2fs -t ext4 -O uninit_bg -E nodiscard,lazy_itable_init=1 -Fq $FS 100M truncate -s 1G $FS

DEV=$(losetup -j $FS | awk -F: '{print $1}') if test -z "$DEV"
then
losetup -f $FS
DEV=$(losetup -j $FS | awk -F: '{print $1}') fi if test -z "$DEV"
then
echo "Can't create loop device for $FS"
else
echo "Using loop device $DEV"
CLEANUP_LOOP=yes
fi

e2fsck -p $DEV
mkdir /tmp/mnt$$
mount $DEV /tmp/mnt$$
resize2fs -p $DEV 1G
umount /tmp/mnt$$
e2fsck -fy $DEV

if test "$CLEANUP_LOOP" = "yes"
then
losetup -d $DEV
fi
rmdir /tmp/mnt$$



2015-09-22 23:02:09

by Theodore Ts'o

[permalink] [raw]
Subject: Re: resize2fs stuck in ext4_group_extend with 100% CPU Utilization With Small Volumes

On Tue, Sep 22, 2015 at 04:28:39PM -0400, Pocas, Jamie wrote:
> # mount -o loop testfile mnt
> # truncate --size=1G testfile
> # losetup -c /dev/loop0 ## Cause loop device to reread size of backing file while still online
> # resize2fs /dev/loop0

It looks like the problem is with the loopback driver, and I can
reproduce the problem using 4.3-rc2.

If you don't do *either* the truncate or the resize2fs command in the
above sequence, and then do a "touch mnt/foo ; sync", the sync command
will hang.

The problem is the losetup -c command, which calls the
LOOP_SET_CAPACITY ioctl. The problem is that this causes
bd_set_size() to be called, which has the side effect of forcing the
block size of /dev/loop0 to 4096 --- which is a problem if the file
system is using a 1k block size, and so the block size was properly
set to 1024. This is subsequently causing the buffer cache operations
to hang.

So this will cause a hang:

cp /dev/null /tmp/foo.img
mke2fs -t ext4 /tmp/foo.img 100M
mount -o loop /tmp/foo.img /mnt
losetup -c /dev/loop0
touch /mnt/foo
sync

This will not hang:

cp /dev/null /tmp/foo.img
mke2fs -t ext4 -b 4096 /tmp/foo.img 100M
mount -o loop /tmp/foo.img /mnt
losetup -c /dev/loop0
touch /mnt/foo
sync

And this also explains why you weren't seeing the problem with small
file systems. By default mke2fs uses a block size of 1k for file
systems smaller than 512 MB. This is largely for historical reasons
since there was a time when we worried about optimizing the storage of
every single byte of your 80MB disk (which was all you had on your 40
MHz 80386 :-).

With larger file systems, the block size defaults to 4096, so we don't
run into problems when losetup -c attempts to set the block size ---
which is something that is *not* supposed to change if the block
device is currently mounted. So for example, if you try to run the
command "blockdev --setbsz", it will fail with an EBUSY if the block
device is curently mounted.

So the workaround is to just create the file system with "-b 4096"
when you call mkfs.ext4. This is a good idea if you intend to grow
the file system, since it is far more efficient to use a 4k block
size.

The proper fix in the kernel is to have the loop device check to see
if the block device is currently mounted. If it is, then needs to
avoid changing the block size (which probably means it will need to
call a modified version of bd_set_size), and the capacity of the block
device needs to be rounded-down to the current block size.

(Currently if you set the capacity of the block device to be say, 1MB
plus 2k, and the current block size is 4k, it will change the block
size of the device to be 2k, so that the entire block device is
addressable. If the block device is mount and the block size is fixed
to 4k, then it must not change the block size --- either up or down.
Instead, it must keep the block size at 4k, and only allow the
capacity to be set to 1MB.)

Cheers,

- Ted

2015-09-22 23:41:16

by Eric Sandeen

[permalink] [raw]
Subject: Re: resize2fs stuck in ext4_group_extend with 100% CPU Utilization With Small Volumes

On 9/22/15 4:26 PM, Pocas, Jamie wrote:
> Hi Theodore,
>
> I am not sure if you had a chance to see my reply to Eric yet. I can
> see you are using the same general approach that Eric was using. The
> key difference from what I am doing again seems to be that I am
> resizing the underlying disk *while the filesystem is mounted*.

Do you see the same problem if you resize a physical disk, not
just with loopback? Sounds like it...

In theory it should be reproducible w/ lvm too, then, I think,
unless there's some issue specific to your block device similar to
what's happening on the loop device.

> Instead you both are using truncate to grow the disk while the
> filesystem is not currently mounted, and then mounting it.

Always worth communicating a testcase in the first email, if you
have one, so we don't have to guess. ;)

thanks,
-Eric


2015-09-23 03:40:36

by Pocas, Jamie

[permalink] [raw]
Subject: RE: resize2fs stuck in ext4_group_extend with 100% CPU Utilization With Small Volumes

Yes I am seeing the same problem with physical disks and other types of virtualized disks (e.g. VMware can resize vmdk virtual disks online). Sorry if the initial ambiguity wasted some time. I was trying to come up with the smallest most isolated example that reproduced the issue so I went with the loopback approach since it doesn't have a lot of moving parts or external dependencies and it's easy to make arbitrary sized devices including these small ones. I could care less if loopback didn't work for my intended use but I am happy it is useful in reproducing the issue. Honestly, for my application it's easy to work around by just not allowing devices that small that we will never encounter anyway but I thought I would do my due diligence and report what I think is a bug :)

________________________________________
From: Eric Sandeen [[email protected]]
Sent: Tuesday, September 22, 2015 7:41 PM
To: Pocas, Jamie; Theodore Ts'o
Cc: [email protected]
Subject: Re: resize2fs stuck in ext4_group_extend with 100% CPU Utilization With Small Volumes

On 9/22/15 4:26 PM, Pocas, Jamie wrote:
> Hi Theodore,
>
> I am not sure if you had a chance to see my reply to Eric yet. I can
> see you are using the same general approach that Eric was using. The
> key difference from what I am doing again seems to be that I am
> resizing the underlying disk *while the filesystem is mounted*.

Do you see the same problem if you resize a physical disk, not
just with loopback? Sounds like it...

In theory it should be reproducible w/ lvm too, then, I think,
unless there's some issue specific to your block device similar to
what's happening on the loop device.

> Instead you both are using truncate to grow the disk while the
> filesystem is not currently mounted, and then mounting it.

Always worth communicating a testcase in the first email, if you
have one, so we don't have to guess. ;)

thanks,
-Eric


2015-09-23 04:20:45

by Pocas, Jamie

[permalink] [raw]
Subject: RE: resize2fs stuck in ext4_group_extend with 100% CPU Utilization With Small Volumes

Ted, just to add another data point, with some minor adjustments to the script to use xfs instead, such as using "mkfs.xfs -b size=1024" to force 1k blocks, I cannot reproduce the issue and the data block size doesn't change from 1k. This is still using loopback so I am a bit skeptical that the blame is due to the use of a loopback device or filesystems with an initial 1k fs block size. I can see this on other virtualized disks that can be resized online such as VMware virtual disks and remote iSCSI targets. I haven't tried LVM but I suspect that would be another good test. Suffer this small analogy for me and let me know where I am wrong: say hypothetically I expand a small partition (or LVM for that matter). Then I try to use resize2fs to grow the ext filesystem on it. I expect that this
should *not* change the block size of the underlying device (of course not!) nor the filesystem's block size. Is that a correct assumption? I can see that it doesn't change the block size with xfs, nor the underlying device queue parameters for /dev/loop0 either (under /sys/block/loop0/queue).

This use of a relatively tiny volume is not a normal use case for my application so I want to express that this is not a super urgent issue for me to resolve right away. For my purposes I can just disallow using devices that are that small. They are really impractical anyway and this just came up in testing. I just wanted to do my duty and report what I think is a legitimate issue, and maybe validate someone else's frustration if they are having this issue, however small of an edge case this might turn out to be :). I also wasn't sure if was indicative of a bug on a boundary condition that might happen with other potentially incompatible combinations of mkfs/mount parameters or sizes of volumes that are not validated before use. That would be more serious. I deal more with the block storag
e itself and so I admit I am not an ext4 expert, hence the possibly bad analogy earlier :). I am willing to take a deeper look into the code and see if I can figure out a patch when I get some more time but I was just picking your brain in case it was something really obvious.

-Jamie


-----Original Message-----
From: Theodore Ts'o [mailto:[email protected]]
Sent: Tuesday, September 22, 2015 7:02 PM
To: Pocas, Jamie
Cc: Eric Sandeen; [email protected]
Subject: Re: resize2fs stuck in ext4_group_extend with 100% CPU Utilization With Small Volumes

On Tue, Sep 22, 2015 at 04:28:39PM -0400, Pocas, Jamie wrote:
> # mount -o loop testfile mnt
> # truncate --size=1G testfile
> # losetup -c /dev/loop0 ## Cause loop device to reread size of backing
> file while still online # resize2fs /dev/loop0

It looks like the problem is with the loopback driver, and I can reproduce the problem using 4.3-rc2.

If you don't do *either* the truncate or the resize2fs command in the above sequence, and then do a "touch mnt/foo ; sync", the sync command will hang.

The problem is the losetup -c command, which calls the LOOP_SET_CAPACITY ioctl. The problem is that this causes
bd_set_size() to be called, which has the side effect of forcing the block size of /dev/loop0 to 4096 --- which is a problem if the file system is using a 1k block size, and so the block size was properly set to 1024. This is subsequently causing the buffer cache operations to hang.

So this will cause a hang:

cp /dev/null /tmp/foo.img
mke2fs -t ext4 /tmp/foo.img 100M
mount -o loop /tmp/foo.img /mnt
losetup -c /dev/loop0
touch /mnt/foo
sync

This will not hang:

cp /dev/null /tmp/foo.img
mke2fs -t ext4 -b 4096 /tmp/foo.img 100M mount -o loop /tmp/foo.img /mnt losetup -c /dev/loop0 touch /mnt/foo sync

And this also explains why you weren't seeing the problem with small file systems. By default mke2fs uses a block size of 1k for file systems smaller than 512 MB. This is largely for historical reasons since there was a time when we worried about optimizing the storage of every single byte of your 80MB disk (which was all you had on your 40 MHz 80386 :-).

With larger file systems, the block size defaults to 4096, so we don't run into problems when losetup -c attempts to set the block size --- which is something that is *not* supposed to change if the block device is currently mounted. So for example, if you try to run the command "blockdev --setbsz", it will fail with an EBUSY if the block device is curently mounted.

So the workaround is to just create the file system with "-b 4096"
when you call mkfs.ext4. This is a good idea if you intend to grow the file system, since it is far more efficient to use a 4k block size.

The proper fix in the kernel is to have the loop device check to see if the block device is currently mounted. If it is, then needs to avoid changing the block size (which probably means it will need to call a modified version of bd_set_size), and the capacity of the block device needs to be rounded-down to the current block size.

(Currently if you set the capacity of the block device to be say, 1MB plus 2k, and the current block size is 4k, it will change the block size of the device to be 2k, so that the entire block device is addressable. If the block device is mount and the block size is fixed to 4k, then it must not change the block size --- either up or down.
Instead, it must keep the block size at 4k, and only allow the capacity to be set to 1MB.)

Cheers,

- Ted

2015-09-23 15:14:09

by Theodore Ts'o

[permalink] [raw]
Subject: Re: resize2fs stuck in ext4_group_extend with 100% CPU Utilization With Small Volumes

On Wed, Sep 23, 2015 at 12:20:17AM -0400, Pocas, Jamie wrote:
> Ted, just to add another data point, with some minor adjustments to
> the script to use xfs instead, such as using "mkfs.xfs -b size=1024"
> to force 1k blocks, I cannot reproduce the issue and the data block
> size doesn't change from 1k.

Yes, that's not surprising, because XFS doesn't use the buffer cache
layer. Ext4 does, because that's the basis of how the jbd2 layer
works. It does change the block size as reported by the block device
and which is used by the buffer cache layer, though. (Internally,
this is known as the "soft" block size; it's basically the data in
which data is cached in the buffer cache layer):

root@kvm-xfstests:~# truncate -s 100M /tmp/foo.img
root@kvm-xfstests:~# mkfs.xfs -b size=1024 /tmp/foo.img
meta-data=/tmp/foo.img isize=512 agcount=4, agsize=25600 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=0
data = bsize=1024 blocks=102400, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal log bsize=1024 blocks=2573, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
root@kvm-xfstests:~# mount -o loop /tmp/foo.img /mnt
root@kvm-xfstests:~# blockdev --getbsz /dev/loop0
1024
root@kvm-xfstests:~# losetup -c /dev/loop0
root@kvm-xfstests:~# blockdev --getbsz /dev/loop0
4096 <--------- BUG, note the change in the block size
root@kvm-xfstests:~# touch /mnt/foo
root@kvm-xfstests:~# sync
<------ The reason why we don't hang is that XFS doesn't use the
<------ buffer cache
root@kvm-xfstests:~# umount /mnt



Also feel free to try my repro, but using "blockdev --getbsz
/dev/loop" before and after the losetup -c command, and note that it
does not hang even though there is no resize2fs in the command
sequence at all:

root@kvm-xfstests:~# cp /dev/null /tmp/foo.img
root@kvm-xfstests:~# truncate -s 100M /tmp/foo.img
root@kvm-xfstests:~# mke2fs -t ext4 /tmp/foo.img
mke2fs 1.43-WIP (18-May-2015)
Discarding device blocks: done
Creating filesystem with 102400 1k blocks and 25688 inodes
Filesystem UUID: 27dfdbbe-f3a9-48a7-abe8-5a52798a9849
Superblock backups stored on blocks:
8193, 24577, 40961, 57345, 73729

Allocating group tables: done
Writing inode tables: done
Creating journal (4096 blocks): done
Writing superblocks and filesystem accounting information: done

root@kvm-xfstests:~# mount -o loop /tmp/foo.img /mnt
root@kvm-xfstests:~# blockdev --getbsz /dev/loop0
1024
root@kvm-xfstests:~# losetup -c /dev/loop0
root@kvm-xfstests:~# blockdev --getbsz /dev/loop0
4096 <------------ BUG
root@kvm-xfstests:~# touch /mnt/foo
<------- Should hang here, even though there is no resize2fs command
<------- If it doesn't hang right away, try typing the "sync" command


> Suffer this small analogy
> for me and let me know where I am wrong: say hypothetically I expand
> a small partition (or LVM for that matter). Then I try to use
> resize2fs to grow the ext filesystem on it. I expect that this
> should *not* change the block size of the underlying device (of
> course not!) nor the filesystem's block size.

The cause of your misunderstanding is not understanding that there are
actually 4 different concepts of block/sector size:

* The logical block/sector size of the underlying storage device
- Retrived via "blockdev --getss /dev/sdXX"
- This is the smallest unit that can be sent to the disk from
the Host OS. If the logical sector size is different from
the physical block size, and write is smaller than the
physical sector size (see below), then the disk will do a
read-modify-write.
- The file system block size MUST be greater than or equal to
the logical sector size.

* The physical block/sector size of the underlying storage device
- Retrived via "blockdev --getpbsz /dev/sdXX"
- This is the smallest unit can be physically written to the
storage media.
- The file system block size SHOULD be greater than or equal
to the logical sector size. (To avoid read-modify-write
operations by the hard drive that will bad for performance.)

* The "soft" block size of the block device.
- Retrived via "blockdev --getbsz /dev/sdXX"
- This represents the units of storage which is used to cache
data in the buffer cache. This only matters if you are
using buffer cache --- for example, if you are doing
buffered I/O to a block device, or if you are using a file
system such as ext4 which is using buffer cache. Since data
is indexed in the buffer cache by the 3-tuple (block device,
block number, block size), Bad Things happen if you try to
change the block size while the file system is mounted.
Normally, the kernel will prevent you from changing the
block size under these circumstances.

* The file system block size.
- Retrieved by some file-system dependent command. For ext4,
this is "dumpe2fs -h".
- Set at format time. For file systems that use the buffer
cache, the file system driver will automatically set the
"soft" block size of the block device when the file system
is mounted.


Speaking of LVM, I can't reproduce the problem using LVM, at least not
with a 4.3-rc2 kernel:

root@kvm-xfstests:~# pvcreate /dev/vdc
Physical volume "/dev/vdc" successfully created
root@kvm-xfstests:~# vgcreate test /dev/vdc
Volume group "test" successfully created
root@kvm-xfstests:~# lvcreate -L 100M -n small /dev/test
Logical volume "small" created
root@kvm-xfstests:~# mkfs.ext4 -Fq /dev/test/small
root@kvm-xfstests:~# mount -o loop /dev/test/small /mnt
root@kvm-xfstests:~# blockdev --getbsz /dev/loop0
1024
root@kvm-xfstests:~# lvresize -L 1G /dev/test/small
Size of logical volume test/small changed from 100.00 MiB (25 extents) to 1.00 GiB (256 extents).
Logical volume small successfully resized
root@kvm-xfstests:~# blockdev --getbsz /dev/loop0
1024 <------ NO BUG, see the block size has not changed
root@kvm-xfstests:~# lvcreate -L 100M -n small /dev/test^C
root@kvm-xfstests:~# touch /mnt/foo ; sync
root@kvm-xfstests:~# resize2fs /dev/test/small
resize2fs 1.43-WIP (18-May-2015)
Filesystem at /dev/test/small is mounted on /mnt; on-line resizing required
old_desc_blocks = 1, new_desc_blocks = 8
The filesystem on /dev/test/small is now 1048576 (1k) blocks long.
<------ Note that resize2fs works just fine!
root@kvm-xfstests:~# touch /mnt/bar ; sync
root@kvm-xfstests:~# umount /mnt
root@kvm-xfstests:~#

You might see if this works on CentOS; but if it doesn't, I'm pretty
convinced this is a bug outside of ext4, and I've already given you a
workaround --- using "-b 4096" on the command line to mkfs.ext4 or
mke2fs.

Alternatively, here's another workaround; you can change modify your
/etc/mke2fs.conf so the "small" and "floppy" stanzas read:

[fs_types]
small = {
blocksize = 4096
inode_size = 128
inode_ratio = 4096
}
floppy = {
blocksize = 4096
inode_size = 128
inode_ratio = 8192
}

I'm pretty certain your failures won't reproduce if you either change
how you call mke2fs for small file systems, or change your
/etc/mke2fs.conf file as shown above.

Cheers,

- Ted

2015-09-23 16:05:08

by Pocas, Jamie

[permalink] [raw]
Subject: RE: resize2fs stuck in ext4_group_extend with 100% CPU Utilization With Small Volumes

Interesting. Thanks for the detailed break-down! I don't mind the workaround of using 4k "soft" block size on the filesystem, even for smaller filesystems. Now that I understand better, I think you were on target with your earlier explanation of bd_set_size(). So this means it's not an ext4 bug. I think the online resize of loopback device (or any other block device driver) should use something like the code in check_disk_size_change() instead of bd_set_size(). I will have to test this out. Thanks again.

Regards,
- Jamie

-----Original Message-----
From: Theodore Ts'o [mailto:[email protected]]
Sent: Wednesday, September 23, 2015 11:14 AM
To: Pocas, Jamie
Cc: Eric Sandeen; [email protected]
Subject: Re: resize2fs stuck in ext4_group_extend with 100% CPU Utilization With Small Volumes

On Wed, Sep 23, 2015 at 12:20:17AM -0400, Pocas, Jamie wrote:
> Ted, just to add another data point, with some minor adjustments to
> the script to use xfs instead, such as using "mkfs.xfs -b size=1024"
> to force 1k blocks, I cannot reproduce the issue and the data block
> size doesn't change from 1k.

Yes, that's not surprising, because XFS doesn't use the buffer cache layer. Ext4 does, because that's the basis of how the jbd2 layer works. It does change the block size as reported by the block device and which is used by the buffer cache layer, though. (Internally, this is known as the "soft" block size; it's basically the data in which data is cached in the buffer cache layer):

root@kvm-xfstests:~# truncate -s 100M /tmp/foo.img root@kvm-xfstests:~# mkfs.xfs -b size=1024 /tmp/foo.img
meta-data=/tmp/foo.img isize=512 agcount=4, agsize=25600 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=0
data = bsize=1024 blocks=102400, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal log bsize=1024 blocks=2573, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
root@kvm-xfstests:~# mount -o loop /tmp/foo.img /mnt root@kvm-xfstests:~# blockdev --getbsz /dev/loop0
1024
root@kvm-xfstests:~# losetup -c /dev/loop0 root@kvm-xfstests:~# blockdev --getbsz /dev/loop0
4096 <--------- BUG, note the change in the block size root@kvm-xfstests:~# touch /mnt/foo root@kvm-xfstests:~# sync
<------ The reason why we don't hang is that XFS doesn't use the
<------ buffer cache
root@kvm-xfstests:~# umount /mnt



Also feel free to try my repro, but using "blockdev --getbsz /dev/loop" before and after the losetup -c command, and note that it does not hang even though there is no resize2fs in the command sequence at all:

root@kvm-xfstests:~# cp /dev/null /tmp/foo.img root@kvm-xfstests:~# truncate -s 100M /tmp/foo.img root@kvm-xfstests:~# mke2fs -t ext4 /tmp/foo.img mke2fs 1.43-WIP (18-May-2015)
Discarding device blocks: done
Creating filesystem with 102400 1k blocks and 25688 inodes Filesystem UUID: 27dfdbbe-f3a9-48a7-abe8-5a52798a9849
Superblock backups stored on blocks:
8193, 24577, 40961, 57345, 73729

Allocating group tables: done
Writing inode tables: done
Creating journal (4096 blocks): done
Writing superblocks and filesystem accounting information: done

root@kvm-xfstests:~# mount -o loop /tmp/foo.img /mnt root@kvm-xfstests:~# blockdev --getbsz /dev/loop0
1024
root@kvm-xfstests:~# losetup -c /dev/loop0 root@kvm-xfstests:~# blockdev --getbsz /dev/loop0
4096 <------------ BUG
root@kvm-xfstests:~# touch /mnt/foo
<------- Should hang here, even though there is no resize2fs command
<------- If it doesn't hang right away, try typing the "sync" command


> Suffer this small analogy
> for me and let me know where I am wrong: say hypothetically I expand a
> small partition (or LVM for that matter). Then I try to use resize2fs
> to grow the ext filesystem on it. I expect that this should *not*
> change the block size of the underlying device (of course not!) nor
> the filesystem's block size.

The cause of your misunderstanding is not understanding that there are actually 4 different concepts of block/sector size:

* The logical block/sector size of the underlying storage device
- Retrived via "blockdev --getss /dev/sdXX"
- This is the smallest unit that can be sent to the disk from
the Host OS. If the logical sector size is different from
the physical block size, and write is smaller than the
physical sector size (see below), then the disk will do a
read-modify-write.
- The file system block size MUST be greater than or equal to
the logical sector size.

* The physical block/sector size of the underlying storage device
- Retrived via "blockdev --getpbsz /dev/sdXX"
- This is the smallest unit can be physically written to the
storage media.
- The file system block size SHOULD be greater than or equal
to the logical sector size. (To avoid read-modify-write
operations by the hard drive that will bad for performance.)

* The "soft" block size of the block device.
- Retrived via "blockdev --getbsz /dev/sdXX"
- This represents the units of storage which is used to cache
data in the buffer cache. This only matters if you are
using buffer cache --- for example, if you are doing
buffered I/O to a block device, or if you are using a file
system such as ext4 which is using buffer cache. Since data
is indexed in the buffer cache by the 3-tuple (block device,
block number, block size), Bad Things happen if you try to
change the block size while the file system is mounted.
Normally, the kernel will prevent you from changing the
block size under these circumstances.

* The file system block size.
- Retrieved by some file-system dependent command. For ext4,
this is "dumpe2fs -h".
- Set at format time. For file systems that use the buffer
cache, the file system driver will automatically set the
"soft" block size of the block device when the file system
is mounted.


Speaking of LVM, I can't reproduce the problem using LVM, at least not with a 4.3-rc2 kernel:

root@kvm-xfstests:~# pvcreate /dev/vdc
Physical volume "/dev/vdc" successfully created root@kvm-xfstests:~# vgcreate test /dev/vdc
Volume group "test" successfully created root@kvm-xfstests:~# lvcreate -L 100M -n small /dev/test
Logical volume "small" created
root@kvm-xfstests:~# mkfs.ext4 -Fq /dev/test/small root@kvm-xfstests:~# mount -o loop /dev/test/small /mnt root@kvm-xfstests:~# blockdev --getbsz /dev/loop0
1024
root@kvm-xfstests:~# lvresize -L 1G /dev/test/small
Size of logical volume test/small changed from 100.00 MiB (25 extents) to 1.00 GiB (256 extents).
Logical volume small successfully resized root@kvm-xfstests:~# blockdev --getbsz /dev/loop0
1024 <------ NO BUG, see the block size has not changed root@kvm-xfstests:~# lvcreate -L 100M -n small /dev/test^C root@kvm-xfstests:~# touch /mnt/foo ; sync root@kvm-xfstests:~# resize2fs /dev/test/small resize2fs 1.43-WIP (18-May-2015) Filesystem at /dev/test/small is mounted on /mnt; on-line resizing required old_desc_blocks = 1, new_desc_blocks = 8 The filesystem on /dev/test/small is now 1048576 (1k) blocks long.
<------ Note that resize2fs works just fine!
root@kvm-xfstests:~# touch /mnt/bar ; sync root@kvm-xfstests:~# umount /mnt root@kvm-xfstests:~#

You might see if this works on CentOS; but if it doesn't, I'm pretty convinced this is a bug outside of ext4, and I've already given you a workaround --- using "-b 4096" on the command line to mkfs.ext4 or mke2fs.

Alternatively, here's another workaround; you can change modify your /etc/mke2fs.conf so the "small" and "floppy" stanzas read:

[fs_types]
small = {
blocksize = 4096
inode_size = 128
inode_ratio = 4096
}
floppy = {
blocksize = 4096
inode_size = 128
inode_ratio = 8192
}

I'm pretty certain your failures won't reproduce if you either change how you call mke2fs for small file systems, or change your /etc/mke2fs.conf file as shown above.

Cheers,

- Ted

2015-09-23 16:59:13

by Theodore Ts'o

[permalink] [raw]
Subject: Re: resize2fs stuck in ext4_group_extend with 100% CPU Utilization With Small Volumes

On Wed, Sep 23, 2015 at 12:04:49PM -0400, Pocas, Jamie wrote:
> Interesting. Thanks for the detailed break-down! I don't mind the
> workaround of using 4k "soft" block size on the filesystem, even for
> smaller filesystems. Now that I understand better, I think you were
> on target with your earlier explanation of bd_set_size(). So this
> means it's not an ext4 bug. I think the online resize of loopback
> device (or any other block device driver) should use something like
> the code in check_disk_size_change() instead of bd_set_size(). I
> will have to test this out. Thanks again.

To be clear, the 4k file system block size is an on-disk format thing,
and it will give you better performance (at the cost of increasing
internal fragmentation overhead which can consume more space). It
will cause the soft block size to be set to be 4k when the file system
is mounted, but that's a different thing.

Note that for larger ext4 file systems, or if you are using XFS, the
file system block size will be 4k, so explicitly configuring the
blocksize to 4k isn't anything particularly unusual. It's a change in
the defaults, but I showed you how you can change the defaults by
editing /etc/mke2fs.conf.

Cheers,

- Ted

2015-09-23 18:21:10

by Pocas, Jamie

[permalink] [raw]
Subject: RE: resize2fs stuck in ext4_group_extend with 100% CPU Utilization With Small Volumes

I understand the tradeoff. Thanks. I tested a change in the driver from calling bd_set_size() to calling check_disk_size_change() and it is in fact working as expected! I can see that the capacity is increased correctly, but the blocksize reported by 'blockdev --getbsz' is still retained as 1024 correctly. Obviously this needs more review and testing, but I think this is a bug with loopback and any other driver that would call bd_set_size() to do online resize. As far as ext is concerned, consider it case-closed. Thanks again for the tips.

-----Original Message-----
From: Theodore Ts'o [mailto:[email protected]]
Sent: Wednesday, September 23, 2015 12:59 PM
To: Pocas, Jamie
Cc: Eric Sandeen; [email protected]
Subject: Re: resize2fs stuck in ext4_group_extend with 100% CPU Utilization With Small Volumes

On Wed, Sep 23, 2015 at 12:04:49PM -0400, Pocas, Jamie wrote:
> Interesting. Thanks for the detailed break-down! I don't mind the
> workaround of using 4k "soft" block size on the filesystem, even for
> smaller filesystems. Now that I understand better, I think you were on
> target with your earlier explanation of bd_set_size(). So this means
> it's not an ext4 bug. I think the online resize of loopback device (or
> any other block device driver) should use something like the code in
> check_disk_size_change() instead of bd_set_size(). I will have to test
> this out. Thanks again.

To be clear, the 4k file system block size is an on-disk format thing, and it will give you better performance (at the cost of increasing internal fragmentation overhead which can consume more space). It will cause the soft block size to be set to be 4k when the file system is mounted, but that's a different thing.

Note that for larger ext4 file systems, or if you are using XFS, the file system block size will be 4k, so explicitly configuring the blocksize to 4k isn't anything particularly unusual. It's a change in the defaults, but I showed you how you can change the defaults by editing /etc/mke2fs.conf.

Cheers,

- Ted