2015-03-25 10:46:27

by Lukas Czerner

[permalink] [raw]
Subject: [PATCH v2] e2fsprogs: Limit number of reserved gdt blocks on small fs

Currently we're unable to online resize very small (smaller than 32 MB)
file systems with 1k block size because there is not enough space in the
journal to put all the reserved gdt blocks.

I tried to find solution in kernel which does not require to reserve
journal credits for all the gdt blocks, but it looks like we have to
have them all in journal to make sure that we atomically either change
them all, or none.

So it leaves us with fixing the root cause which is the fact that mke2fs
is setup by default so that it creates so many gdt blocks that it can't
fit into the journal transaction.

Fix it by limiting the default number of reserved gdt blocks in the
case that it would not fit in the journal transaction by default.
Note that it's still possible to create such file system by specifying
non-default options so I've added mke2fs message that would warn users
that would try to create such file system, but will still allow them to
do so.

We still need to address the kernel problem where such file system would
generate scary kernel warning, but that's for a separate patch.

Also fix output of some tests, because this change slightly alters the
file system layout (specifically number of free blocks and number of
reserved gdt blocks) on very small file systems.

Signed-off-by: Lukas Czerner <[email protected]>
---
lib/ext2fs/initialize.c | 31 ++++++++++++-
misc/mke2fs.c | 27 ++++++++++++
tests/d_loaddump/script | 2 +-
tests/f_desc_size_bad/expect.1 | 2 +-
tests/f_desc_size_bad/expect.2 | 2 +-
tests/f_resize_inode/expect | 44 +++++++++----------
tests/m_mkfs_overhead/script | 2 +-
tests/r_fixup_lastbg/expect | 22 +++++-----
tests/r_fixup_lastbg_big/expect | 26 +++++------
tests/r_resize_inode/expect | 96 ++++++++++++++++++++---------------------
10 files changed, 154 insertions(+), 100 deletions(-)

diff --git a/lib/ext2fs/initialize.c b/lib/ext2fs/initialize.c
index 75fbf8e..180e605 100644
--- a/lib/ext2fs/initialize.c
+++ b/lib/ext2fs/initialize.c
@@ -59,18 +59,45 @@ static unsigned int calc_reserved_gdt_blocks(ext2_filsys fs)
unsigned long max_blocks = 0xffffffff;
unsigned long rsv_groups;
unsigned int rsv_gdb;
+ __u64 blocks_count;
+ int journal_blocks, credits;
+
+ blocks_count = ext2fs_blocks_count(sb);
+ journal_blocks = ext2fs_default_journal_size(blocks_count);

/* We set it at 1024x the current filesystem size, or
* the upper block count limit (2^32), whichever is lower.
*/
- if (ext2fs_blocks_count(sb) < max_blocks / 1024)
- max_blocks = ext2fs_blocks_count(sb) * 1024;
+ if (blocks_count < max_blocks / 1024)
+ max_blocks = blocks_count * 1024;
/*
* ext2fs_div64_ceil() is unnecessary because max_blocks is
* max _GDT_ blocks, which is limited to 32 bits.
*/
rsv_groups = ext2fs_div_ceil(max_blocks - sb->s_first_data_block, bpg);
rsv_gdb = ext2fs_div_ceil(rsv_groups, gdpb) - fs->desc_blocks;
+
+ /*
+ * In online resize we require a huge number of journal
+ * credits mainly because of the reserved gdt blocks. The
+ * exact number of journal credits needed is:
+ * flex_gd->count * 4 + reserved_gdb
+ *
+ * In order to allow online resize on file system with tiny
+ * journal (tiny fs) we have to make sure that the number of
+ * reserved gdt blocks is small enough. If the user specifies
+ * non-default values he can still run into this issue but we
+ * do not want to make that mistake by default.
+ *
+ * journal_blocks / 8 is the maximum amount of credits we're allowed
+ * to use in a single transaction.
+ *
+ */
+ credits = (1 << fs->super->s_log_groups_per_flex) * 4 + rsv_gdb;
+ if (journal_blocks > 0 && credits > journal_blocks / 8)
+ rsv_gdb = (journal_blocks / 8) -
+ (1 << fs->super->s_log_groups_per_flex) * 4;
+
if (rsv_gdb > EXT2_ADDR_PER_BLOCK(sb))
rsv_gdb = EXT2_ADDR_PER_BLOCK(sb);
#ifdef RES_GDT_DEBUG
diff --git a/misc/mke2fs.c b/misc/mke2fs.c
index aeb852f..a6c6a1d 100644
--- a/misc/mke2fs.c
+++ b/misc/mke2fs.c
@@ -3076,6 +3076,33 @@ int main (int argc, char *argv[])
}
if (!quiet)
printf("%s", _("done\n"));
+
+ /*
+ * In online resize we require a huge number of journal
+ * credits mainly because of the reserved_gdt_blocks. The
+ * exact number of journal credits needed is:
+ * flex_gd->count * 4 + reserved_gdb
+ *
+ * Warn user if we will not have enough credits to resize
+ * the file system online.
+ */
+ if (fs_param.s_feature_compat & EXT2_FEATURE_COMPAT_RESIZE_INODE)
+ {
+ unsigned int credits;
+
+ /*
+ * This is the amount we're allowed to use for a
+ * single handle in a transaction
+ */
+ credits = (1 << fs_param.s_log_groups_per_flex) * 4 +
+ fs->super->s_reserved_gdt_blocks;
+ if (credits > journal_blocks / 8) {
+ fprintf(stderr, "%s", _("Warning: Journal is "
+ "not big enough. In this configuration "
+ "the file system will not be able to "
+ "grow online!\n"));
+ }
+ }
}
no_journal:
if (!super_only &&
diff --git a/tests/d_loaddump/script b/tests/d_loaddump/script
index 39727ba..0f5fd25 100644
--- a/tests/d_loaddump/script
+++ b/tests/d_loaddump/script
@@ -13,7 +13,7 @@ dd if=/dev/zero of=$TMPFILE bs=1k count=512 > /dev/null 2>&1

echo "mke2fs -Fq -b 1024 test.img 512" >> $OUT

-$MKE2FS -Fq $TMPFILE 512 > /dev/null 2>&1
+$MKE2FS -Fq $TMPFILE 512
status=$?
echo Exit status is $status >> $OUT

diff --git a/tests/f_desc_size_bad/expect.1 b/tests/f_desc_size_bad/expect.1
index 009ee04..08a8dbc 100644
--- a/tests/f_desc_size_bad/expect.1
+++ b/tests/f_desc_size_bad/expect.1
@@ -7,5 +7,5 @@ Pass 4: Checking reference counts
Pass 5: Checking group summary information

test_filesys: ***** FILE SYSTEM WAS MODIFIED *****
-test_filesys: 11/32 files (0.0% non-contiguous), 801/2048 blocks
+test_filesys: 11/32 files (0.0% non-contiguous), 405/2048 blocks
Exit status is 1
diff --git a/tests/f_desc_size_bad/expect.2 b/tests/f_desc_size_bad/expect.2
index d1429fd..ad1cf50 100644
--- a/tests/f_desc_size_bad/expect.2
+++ b/tests/f_desc_size_bad/expect.2
@@ -3,5 +3,5 @@ Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
-test_filesys: 11/32 files (0.0% non-contiguous), 801/2048 blocks
+test_filesys: 11/32 files (0.0% non-contiguous), 405/2048 blocks
Exit status is 0
diff --git a/tests/f_resize_inode/expect b/tests/f_resize_inode/expect
index db57ed6..95fc4cf 100644
--- a/tests/f_resize_inode/expect
+++ b/tests/f_resize_inode/expect
@@ -12,7 +12,7 @@ Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
-test_filesys: 11/4096 files (0.0% non-contiguous), 2107/16384 blocks
+test_filesys: 11/4096 files (0.0% non-contiguous), 1315/16384 blocks
Exit status is 0
-----------------------------------------------

@@ -25,22 +25,22 @@ Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
-Free blocks count wrong for group #0 (717, counted=718).
+Free blocks count wrong for group #0 (849, counted=850).
Fix? yes

-Free blocks count wrong (14276, counted=14277).
+Free blocks count wrong (15068, counted=15069).
Fix? yes


test_filesys: ***** FILE SYSTEM WAS MODIFIED *****
-test_filesys: 11/4096 files (0.0% non-contiguous), 2107/16384 blocks
+test_filesys: 11/4096 files (0.0% non-contiguous), 1315/16384 blocks
Exit status is 1
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
-test_filesys: 11/4096 files (0.0% non-contiguous), 2107/16384 blocks
+test_filesys: 11/4096 files (0.0% non-contiguous), 1315/16384 blocks
Exit status is 0
-----------------------------------------------

@@ -53,22 +53,22 @@ Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
-Free blocks count wrong for group #0 (717, counted=718).
+Free blocks count wrong for group #0 (849, counted=850).
Fix? yes

-Free blocks count wrong (14276, counted=14277).
+Free blocks count wrong (15068, counted=15069).
Fix? yes


test_filesys: ***** FILE SYSTEM WAS MODIFIED *****
-test_filesys: 11/4096 files (0.0% non-contiguous), 2107/16384 blocks
+test_filesys: 11/4096 files (0.0% non-contiguous), 1315/16384 blocks
Exit status is 1
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
-test_filesys: 11/4096 files (0.0% non-contiguous), 2107/16384 blocks
+test_filesys: 11/4096 files (0.0% non-contiguous), 1315/16384 blocks
Exit status is 0
-----------------------------------------------

@@ -81,22 +81,22 @@ Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
-Free blocks count wrong for group #0 (717, counted=718).
+Free blocks count wrong for group #0 (849, counted=850).
Fix? yes

-Free blocks count wrong (14276, counted=14277).
+Free blocks count wrong (15068, counted=15069).
Fix? yes


test_filesys: ***** FILE SYSTEM WAS MODIFIED *****
-test_filesys: 11/4096 files (0.0% non-contiguous), 2107/16384 blocks
+test_filesys: 11/4096 files (0.0% non-contiguous), 1315/16384 blocks
Exit status is 1
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
-test_filesys: 11/4096 files (0.0% non-contiguous), 2107/16384 blocks
+test_filesys: 11/4096 files (0.0% non-contiguous), 1315/16384 blocks
Exit status is 0
-----------------------------------------------

@@ -111,7 +111,7 @@ Pass 4: Checking reference counts
Pass 5: Checking group summary information

test_filesys: ***** FILE SYSTEM WAS MODIFIED *****
-test_filesys: 11/4096 files (0.0% non-contiguous), 2107/16384 blocks
+test_filesys: 11/4096 files (0.0% non-contiguous), 1315/16384 blocks
Exit status is 1
-----------------------------------------------

@@ -122,28 +122,28 @@ Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
-Block bitmap differences: -(35--258) -(1059--1282) -(3107--3330) -(5155--5378) -(7203--7426) -(9251--9474)
+Block bitmap differences: -(35--126) -(1059--1150) -(3107--3198) -(5155--5246) -(7203--7294) -(9251--9342)
Fix? yes

-Free blocks count wrong for group #0 (718, counted=942).
+Free blocks count wrong for group #0 (850, counted=942).
Fix? yes

-Free blocks count wrong for group #1 (732, counted=956).
+Free blocks count wrong for group #1 (864, counted=956).
Fix? yes

-Free blocks count wrong for group #3 (732, counted=956).
+Free blocks count wrong for group #3 (864, counted=956).
Fix? yes

-Free blocks count wrong for group #5 (732, counted=956).
+Free blocks count wrong for group #5 (864, counted=956).
Fix? yes

-Free blocks count wrong for group #7 (732, counted=956).
+Free blocks count wrong for group #7 (864, counted=956).
Fix? yes

-Free blocks count wrong for group #9 (732, counted=956).
+Free blocks count wrong for group #9 (864, counted=956).
Fix? yes

-Free blocks count wrong (14277, counted=15621).
+Free blocks count wrong (15069, counted=15621).
Fix? yes


diff --git a/tests/m_mkfs_overhead/script b/tests/m_mkfs_overhead/script
index c21da0c..e19e67e 100644
--- a/tests/m_mkfs_overhead/script
+++ b/tests/m_mkfs_overhead/script
@@ -3,7 +3,7 @@ test_description="test bg overhead calculation"
OUT=$test_name.log
EXP=$test_dir/expect
FS_SIZE=1024
-MKE2FS_OPTS="-b 1024 -m 0 -g 256 -N 3745"
+MKE2FS_OPTS="-b 1024 -m 0 -g 256 -N 4513"

MKE2FS_SKIP_PROGRESS=true
MKE2FS_SKIP_CHECK_MSG=true
diff --git a/tests/r_fixup_lastbg/expect b/tests/r_fixup_lastbg/expect
index 96b154a..6514328 100644
--- a/tests/r_fixup_lastbg/expect
+++ b/tests/r_fixup_lastbg/expect
@@ -8,16 +8,16 @@ Creating journal (1024 blocks): done
Writing superblocks and filesystem accounting information: done

Group 2: (Blocks 16385-19999) [INODE_UNINIT, ITABLE_ZEROED]
- Block bitmap at 83 (bg #0 + 82)
- Inode bitmap at 86 (bg #0 + 85)
- Inode table at 295-398 (bg #0 + 294)
+ Block bitmap at 69 (bg #0 + 68)
+ Inode bitmap at 72 (bg #0 + 71)
+ Inode table at 281-384 (bg #0 + 280)
3615 free blocks, 416 free inodes, 0 directories, 416 unused inodes
Free blocks: 16385-19999
Free inodes: 833-1248
-Group 2: (Blocks 16385-19999)
- Block bitmap at 83 (bg #0 + 82)
- Inode bitmap at 86 (bg #0 + 85)
- Inode table at 295-398 (bg #0 + 294)
+Group 2: (Blocks 16385-19999) (EXPECTED 0xe815)
+ Block bitmap at 69 (bg #0 + 68)
+ Inode bitmap at 72 (bg #0 + 71)
+ Inode table at 281-384 (bg #0 + 280)
3615 free blocks, 416 free inodes, 0 directories
Free blocks: 16385-19999
Free inodes: 833-1248
@@ -25,9 +25,9 @@ Resizing the filesystem on test.img to 20004 (1k) blocks.
The filesystem on test.img is now 20004 (1k) blocks long.

Group 2: (Blocks 16385-20003) [INODE_UNINIT]
- Block bitmap at 83 (bg #0 + 82)
- Inode bitmap at 86 (bg #0 + 85)
- Inode table at 295-398 (bg #0 + 294)
+ Block bitmap at 69 (bg #0 + 68)
+ Inode bitmap at 72 (bg #0 + 71)
+ Inode table at 281-384 (bg #0 + 280)
3619 free blocks, 416 free inodes, 0 directories, 416 unused inodes
Free blocks: 16385-20003
Free inodes: 833-1248
@@ -36,4 +36,4 @@ Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
-test.img: 11/1248 files (0.0% non-contiguous), 1517/20004 blocks
+test.img: 11/1248 files (0.0% non-contiguous), 1489/20004 blocks
diff --git a/tests/r_fixup_lastbg_big/expect b/tests/r_fixup_lastbg_big/expect
index edaabaf..3e87acf 100644
--- a/tests/r_fixup_lastbg_big/expect
+++ b/tests/r_fixup_lastbg_big/expect
@@ -8,16 +8,16 @@ Creating journal (1024 blocks): done
Writing superblocks and filesystem accounting information: done

Group 2: (Blocks 16385-19999) [INODE_UNINIT, ITABLE_ZEROED]
- Block bitmap at 83 (bg #0 + 82)
- Inode bitmap at 86 (bg #0 + 85)
- Inode table at 295-398 (bg #0 + 294)
+ Block bitmap at 69 (bg #0 + 68)
+ Inode bitmap at 72 (bg #0 + 71)
+ Inode table at 281-384 (bg #0 + 280)
3615 free blocks, 416 free inodes, 0 directories, 416 unused inodes
Free blocks: 16385-19999
Free inodes: 833-1248
-Group 2: (Blocks 16385-19999)
- Block bitmap at 83 (bg #0 + 82)
- Inode bitmap at 86 (bg #0 + 85)
- Inode table at 295-398 (bg #0 + 294)
+Group 2: (Blocks 16385-19999) (EXPECTED 0xe815)
+ Block bitmap at 69 (bg #0 + 68)
+ Inode bitmap at 72 (bg #0 + 71)
+ Inode table at 281-384 (bg #0 + 280)
3615 free blocks, 416 free inodes, 0 directories
Free blocks: 16385-19999
Free inodes: 833-1248
@@ -27,19 +27,19 @@ Extending the inode table ----------------------------------------
The filesystem on test.img is now 40000 (1k) blocks long.

Group 2: (Blocks 16385-24576) [INODE_UNINIT, BLOCK_UNINIT]
- Block bitmap at 83 (bg #0 + 82)
- Inode bitmap at 86 (bg #0 + 85)
- Inode table at 295-398 (bg #0 + 294)
+ Block bitmap at 69 (bg #0 + 68)
+ Inode bitmap at 72 (bg #0 + 71)
+ Inode table at 281-384 (bg #0 + 280)
8192 free blocks, 416 free inodes, 0 directories, 416 unused inodes
Free blocks: 16385-24576
Free inodes: 833-1248
Group 3: (Blocks 24577-32768) [INODE_UNINIT, ITABLE_ZEROED]
Backup superblock at 24577, Group descriptors at 24578-24578
- Reserved GDT blocks at 24579-24656
- Block bitmap at 413 (bg #0 + 412)
+ Reserved GDT blocks at 24579-24642
+ Block bitmap at 399 (bg #0 + 398)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
-test.img: 11/2080 files (0.0% non-contiguous), 1809/40000 blocks
+test.img: 11/2080 files (0.0% non-contiguous), 1767/40000 blocks
diff --git a/tests/r_resize_inode/expect b/tests/r_resize_inode/expect
index ba1647e..698347f 100644
--- a/tests/r_resize_inode/expect
+++ b/tests/r_resize_inode/expect
@@ -8,7 +8,7 @@ Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
-test_filesys: 11/16384 files (0.0% non-contiguous), 4513/65536 blocks
+test_filesys: 11/16384 files (0.0% non-contiguous), 3325/65536 blocks
Exit status is 0
dumpe2fs test.img
Filesystem volume name: <none>
@@ -23,12 +23,12 @@ Filesystem OS type: Linux
Inode count: 16384
Block count: 65536
Reserved block count: 3276
-Free blocks: 61023
+Free blocks: 62211
Free inodes: 16373
First block: 1
Block size: 1024
Fragment size: 1024
-Reserved GDT blocks: 255
+Reserved GDT blocks: 123
Blocks per group: 1024
Fragments per group: 1024
Inodes per group: 256
@@ -44,19 +44,19 @@ Default directory hash: half_md4

Group 0: (Blocks 1-1024)
Primary superblock at 1, Group descriptors at 2-3
- Reserved GDT blocks at 4-258
- Block bitmap at 259 (+258), Inode bitmap at 260 (+259)
- Inode table at 261-292 (+260)
- 718 free blocks, 245 free inodes, 2 directories
- Free blocks: 307-1024
+ Reserved GDT blocks at 4-126
+ Block bitmap at 127 (+126), Inode bitmap at 128 (+127)
+ Inode table at 129-160 (+128)
+ 850 free blocks, 245 free inodes, 2 directories
+ Free blocks: 175-1024
Free inodes: 12-256
Group 1: (Blocks 1025-2048)
Backup superblock at 1025, Group descriptors at 1026-1027
- Reserved GDT blocks at 1028-1282
- Block bitmap at 1283 (+258), Inode bitmap at 1284 (+259)
- Inode table at 1285-1316 (+260)
- 732 free blocks, 256 free inodes, 0 directories
- Free blocks: 1317-2048
+ Reserved GDT blocks at 1028-1150
+ Block bitmap at 1151 (+126), Inode bitmap at 1152 (+127)
+ Inode table at 1153-1184 (+128)
+ 864 free blocks, 256 free inodes, 0 directories
+ Free blocks: 1185-2048
Free inodes: 257-512
Group 2: (Blocks 2049-3072)
Block bitmap at 2049 (+0), Inode bitmap at 2050 (+1)
@@ -66,11 +66,11 @@ Group 2: (Blocks 2049-3072)
Free inodes: 513-768
Group 3: (Blocks 3073-4096)
Backup superblock at 3073, Group descriptors at 3074-3075
- Reserved GDT blocks at 3076-3330
- Block bitmap at 3331 (+258), Inode bitmap at 3332 (+259)
- Inode table at 3333-3364 (+260)
- 732 free blocks, 256 free inodes, 0 directories
- Free blocks: 3365-4096
+ Reserved GDT blocks at 3076-3198
+ Block bitmap at 3199 (+126), Inode bitmap at 3200 (+127)
+ Inode table at 3201-3232 (+128)
+ 864 free blocks, 256 free inodes, 0 directories
+ Free blocks: 3233-4096
Free inodes: 769-1024
Group 4: (Blocks 4097-5120)
Block bitmap at 4097 (+0), Inode bitmap at 4098 (+1)
@@ -80,11 +80,11 @@ Group 4: (Blocks 4097-5120)
Free inodes: 1025-1280
Group 5: (Blocks 5121-6144)
Backup superblock at 5121, Group descriptors at 5122-5123
- Reserved GDT blocks at 5124-5378
- Block bitmap at 5379 (+258), Inode bitmap at 5380 (+259)
- Inode table at 5381-5412 (+260)
- 732 free blocks, 256 free inodes, 0 directories
- Free blocks: 5413-6144
+ Reserved GDT blocks at 5124-5246
+ Block bitmap at 5247 (+126), Inode bitmap at 5248 (+127)
+ Inode table at 5249-5280 (+128)
+ 864 free blocks, 256 free inodes, 0 directories
+ Free blocks: 5281-6144
Free inodes: 1281-1536
Group 6: (Blocks 6145-7168)
Block bitmap at 6145 (+0), Inode bitmap at 6146 (+1)
@@ -94,11 +94,11 @@ Group 6: (Blocks 6145-7168)
Free inodes: 1537-1792
Group 7: (Blocks 7169-8192)
Backup superblock at 7169, Group descriptors at 7170-7171
- Reserved GDT blocks at 7172-7426
- Block bitmap at 7427 (+258), Inode bitmap at 7428 (+259)
- Inode table at 7429-7460 (+260)
- 732 free blocks, 256 free inodes, 0 directories
- Free blocks: 7461-8192
+ Reserved GDT blocks at 7172-7294
+ Block bitmap at 7295 (+126), Inode bitmap at 7296 (+127)
+ Inode table at 7297-7328 (+128)
+ 864 free blocks, 256 free inodes, 0 directories
+ Free blocks: 7329-8192
Free inodes: 1793-2048
Group 8: (Blocks 8193-9216)
Block bitmap at 8193 (+0), Inode bitmap at 8194 (+1)
@@ -108,11 +108,11 @@ Group 8: (Blocks 8193-9216)
Free inodes: 2049-2304
Group 9: (Blocks 9217-10240)
Backup superblock at 9217, Group descriptors at 9218-9219
- Reserved GDT blocks at 9220-9474
- Block bitmap at 9475 (+258), Inode bitmap at 9476 (+259)
- Inode table at 9477-9508 (+260)
- 732 free blocks, 256 free inodes, 0 directories
- Free blocks: 9509-10240
+ Reserved GDT blocks at 9220-9342
+ Block bitmap at 9343 (+126), Inode bitmap at 9344 (+127)
+ Inode table at 9345-9376 (+128)
+ 864 free blocks, 256 free inodes, 0 directories
+ Free blocks: 9377-10240
Free inodes: 2305-2560
Group 10: (Blocks 10241-11264)
Block bitmap at 10241 (+0), Inode bitmap at 10242 (+1)
@@ -206,11 +206,11 @@ Group 24: (Blocks 24577-25600)
Free inodes: 6145-6400
Group 25: (Blocks 25601-26624)
Backup superblock at 25601, Group descriptors at 25602-25603
- Reserved GDT blocks at 25604-25858
- Block bitmap at 25859 (+258), Inode bitmap at 25860 (+259)
- Inode table at 25861-25892 (+260)
- 732 free blocks, 256 free inodes, 0 directories
- Free blocks: 25893-26624
+ Reserved GDT blocks at 25604-25726
+ Block bitmap at 25727 (+126), Inode bitmap at 25728 (+127)
+ Inode table at 25729-25760 (+128)
+ 864 free blocks, 256 free inodes, 0 directories
+ Free blocks: 25761-26624
Free inodes: 6401-6656
Group 26: (Blocks 26625-27648)
Block bitmap at 26625 (+0), Inode bitmap at 26626 (+1)
@@ -220,11 +220,11 @@ Group 26: (Blocks 26625-27648)
Free inodes: 6657-6912
Group 27: (Blocks 27649-28672)
Backup superblock at 27649, Group descriptors at 27650-27651
- Reserved GDT blocks at 27652-27906
- Block bitmap at 27907 (+258), Inode bitmap at 27908 (+259)
- Inode table at 27909-27940 (+260)
- 732 free blocks, 256 free inodes, 0 directories
- Free blocks: 27941-28672
+ Reserved GDT blocks at 27652-27774
+ Block bitmap at 27775 (+126), Inode bitmap at 27776 (+127)
+ Inode table at 27777-27808 (+128)
+ 864 free blocks, 256 free inodes, 0 directories
+ Free blocks: 27809-28672
Free inodes: 6913-7168
Group 28: (Blocks 28673-29696)
Block bitmap at 28673 (+0), Inode bitmap at 28674 (+1)
@@ -354,11 +354,11 @@ Group 48: (Blocks 49153-50176)
Free inodes: 12289-12544
Group 49: (Blocks 50177-51200)
Backup superblock at 50177, Group descriptors at 50178-50179
- Reserved GDT blocks at 50180-50434
- Block bitmap at 50435 (+258), Inode bitmap at 50436 (+259)
- Inode table at 50437-50468 (+260)
- 732 free blocks, 256 free inodes, 0 directories
- Free blocks: 50469-51200
+ Reserved GDT blocks at 50180-50302
+ Block bitmap at 50303 (+126), Inode bitmap at 50304 (+127)
+ Inode table at 50305-50336 (+128)
+ 864 free blocks, 256 free inodes, 0 directories
+ Free blocks: 50337-51200
Free inodes: 12545-12800
Group 50: (Blocks 51201-52224)
Block bitmap at 51201 (+0), Inode bitmap at 51202 (+1)
--
1.8.3.1



2015-04-24 21:51:48

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH v2] e2fsprogs: Limit number of reserved gdt blocks on small fs

On 3/25/15 5:46 AM, Lukas Czerner wrote:
> Currently we're unable to online resize very small (smaller than 32 MB)
> file systems with 1k block size because there is not enough space in the
> journal to put all the reserved gdt blocks.

So, I'll get to the patch review if I need to, but this all seemed a little
odd; this is a regression, so do we really need to restrict things at mkfs
time?

On the userspace side, things were ok until:

9f6ba88 resize2fs: add support for new in-kernel online resize ioctl

and even with that, on the kernelspace side, things were ok until:

8f7d89f jbd2: transaction reservation support

I guess I'm trying to understand why that jbd2 commit regressed this.
I've not been paying enough attention to ext4 lately. ;)

I mean, the threshold got chopped in half:

- if (nblocks > journal->j_max_transaction_buffers) {
+ /*
+ * 1/2 of transaction can be reserved so we can practically handle
+ * only 1/2 of maximum transaction size per operation
+ */
+ if (WARN_ON(blocks > journal->j_max_transaction_buffers / 2)) {
printk(KERN_ERR "JBD2: %s wants too many credits (%d > %d)\n",
- current->comm, nblocks,
- journal->j_max_transaction_buffers);
+ current->comm, blocks,
+ journal->j_max_transaction_buffers / 2);
return -ENOSPC;
}

so it's clear why the behavior changed, I guess, but it feels like I
must be missing something here.

The reproducer, for those playing along at home, is something like:

mkfs.ext4 /dev/sda 20M
mount /dev/sda /mnt/test
resize2fs /dev/sda 200M

-Eric



2015-04-25 04:25:10

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH v2] e2fsprogs: Limit number of reserved gdt blocks on small fs

On Apr 24, 2015, at 3:51 PM, Eric Sandeen <[email protected]> wrote:
> On 3/25/15 5:46 AM, Lukas Czerner wrote:
>> Currently we're unable to online resize very small (smaller than 32 MB)
>> file systems with 1k block size because there is not enough space in the
>> journal to put all the reserved gdt blocks.
>
> So, I'll get to the patch review if I need to, but this all seemed a little
> odd; this is a regression, so do we really need to restrict things at mkfs
> time?
>
> On the userspace side, things were ok until:
>
> 9f6ba88 resize2fs: add support for new in-kernel online resize ioctl
>
> and even with that, on the kernelspace side, things were ok until:
>
> 8f7d89f jbd2: transaction reservation support
>
> I guess I'm trying to understand why that jbd2 commit regressed this.
> I've not been paying enough attention to ext4 lately. ;)
>
> I mean, the threshold got chopped in half:
>
> - if (nblocks > journal->j_max_transaction_buffers) {
> + /*
> + * 1/2 of transaction can be reserved so we can practically handle
> + * only 1/2 of maximum transaction size per operation
> + */
> + if (WARN_ON(blocks > journal->j_max_transaction_buffers / 2)) {
> printk(KERN_ERR "JBD2: %s wants too many credits (%d > %d)\n",
> - current->comm, nblocks,
> - journal->j_max_transaction_buffers);
> + current->comm, blocks,
> + journal->j_max_transaction_buffers / 2);
> return -ENOSPC;
> }
>
> so it's clear why the behavior changed, I guess, but it feels like I
> must be missing something here.

Is there some way to reserve these journal blocks only in the case of
delalloc usage? This has caused a performance regression with Lustre
servers on 3.10 kernels because the journal commits twice as often.
We've worked around this for now by doubling the journal size, but it
seems a bit of a hack since we can never use the whole journal anymore.

> The reproducer, for those playing along at home, is something like:
>
> mkfs.ext4 /dev/sda 20M
> mount /dev/sda /mnt/test
> resize2fs /dev/sda 200M
>
> -Eric
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


Cheers, Andreas






2015-04-27 16:14:57

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v2] e2fsprogs: Limit number of reserved gdt blocks on small fs

On Fri 24-04-15 22:25:06, Andreas Dilger wrote:
> On Apr 24, 2015, at 3:51 PM, Eric Sandeen <[email protected]> wrote:
> > On 3/25/15 5:46 AM, Lukas Czerner wrote:
> >> Currently we're unable to online resize very small (smaller than 32 MB)
> >> file systems with 1k block size because there is not enough space in the
> >> journal to put all the reserved gdt blocks.
> >
> > So, I'll get to the patch review if I need to, but this all seemed a little
> > odd; this is a regression, so do we really need to restrict things at mkfs
> > time?
> >
> > On the userspace side, things were ok until:
> >
> > 9f6ba88 resize2fs: add support for new in-kernel online resize ioctl
> >
> > and even with that, on the kernelspace side, things were ok until:
> >
> > 8f7d89f jbd2: transaction reservation support
> >
> > I guess I'm trying to understand why that jbd2 commit regressed this.
> > I've not been paying enough attention to ext4 lately. ;)
> >
> > I mean, the threshold got chopped in half:
> >
> > - if (nblocks > journal->j_max_transaction_buffers) {
> > + /*
> > + * 1/2 of transaction can be reserved so we can practically handle
> > + * only 1/2 of maximum transaction size per operation
> > + */
> > + if (WARN_ON(blocks > journal->j_max_transaction_buffers / 2)) {
> > printk(KERN_ERR "JBD2: %s wants too many credits (%d > %d)\n",
> > - current->comm, nblocks,
> > - journal->j_max_transaction_buffers);
> > + current->comm, blocks,
> > + journal->j_max_transaction_buffers / 2);
> > return -ENOSPC;
> > }
> >
> > so it's clear why the behavior changed, I guess, but it feels like I
> > must be missing something here.
>
> Is there some way to reserve these journal blocks only in the case of
> delalloc usage? This has caused a performance regression with Lustre
> servers on 3.10 kernels because the journal commits twice as often.
> We've worked around this for now by doubling the journal size, but it
> seems a bit of a hack since we can never use the whole journal anymore.
Hum, so the above hunk only limits maximum number of credits used by a
single handle. Multiple handles can still consume upto maximum transaction
size buffers (at least that's the intention :). So I don't see how that can
cause the problem you describe. What can happen though is that there are
quite a few outstanding reserved handles and so we have to reserve space
for them in the running transaction. Do you use dioread_nolock option? That
enables the use of reserved handles in ext4 for conversion of unwritten
extents...

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-04-27 16:23:25

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH v2] e2fsprogs: Limit number of reserved gdt blocks on small fs

On 4/27/15 11:14 AM, Jan Kara wrote:
> On Fri 24-04-15 22:25:06, Andreas Dilger wrote:
>> On Apr 24, 2015, at 3:51 PM, Eric Sandeen <[email protected]> wrote:
>>> On 3/25/15 5:46 AM, Lukas Czerner wrote:
>>>> Currently we're unable to online resize very small (smaller than 32 MB)
>>>> file systems with 1k block size because there is not enough space in the
>>>> journal to put all the reserved gdt blocks.
>>>
>>> So, I'll get to the patch review if I need to, but this all seemed a little
>>> odd; this is a regression, so do we really need to restrict things at mkfs
>>> time?
>>>
>>> On the userspace side, things were ok until:
>>>
>>> 9f6ba88 resize2fs: add support for new in-kernel online resize ioctl
>>>
>>> and even with that, on the kernelspace side, things were ok until:
>>>
>>> 8f7d89f jbd2: transaction reservation support
>>>
>>> I guess I'm trying to understand why that jbd2 commit regressed this.
>>> I've not been paying enough attention to ext4 lately. ;)
>>>
>>> I mean, the threshold got chopped in half:
>>>
>>> - if (nblocks > journal->j_max_transaction_buffers) {
>>> + /*
>>> + * 1/2 of transaction can be reserved so we can practically handle
>>> + * only 1/2 of maximum transaction size per operation
>>> + */
>>> + if (WARN_ON(blocks > journal->j_max_transaction_buffers / 2)) {
>>> printk(KERN_ERR "JBD2: %s wants too many credits (%d > %d)\n",
>>> - current->comm, nblocks,
>>> - journal->j_max_transaction_buffers);
>>> + current->comm, blocks,
>>> + journal->j_max_transaction_buffers / 2);
>>> return -ENOSPC;
>>> }
>>>
>>> so it's clear why the behavior changed, I guess, but it feels like I
>>> must be missing something here.
>>
>> Is there some way to reserve these journal blocks only in the case of
>> delalloc usage? This has caused a performance regression with Lustre
>> servers on 3.10 kernels because the journal commits twice as often.
>> We've worked around this for now by doubling the journal size, but it
>> seems a bit of a hack since we can never use the whole journal anymore.
> Hum, so the above hunk only limits maximum number of credits used by a
> single handle. Multiple handles can still consume upto maximum transaction
> size buffers (at least that's the intention :). So I don't see how that can
> cause the problem you describe. What can happen though is that there are
> quite a few outstanding reserved handles and so we have to reserve space
> for them in the running transaction. Do you use dioread_nolock option? That
> enables the use of reserved handles in ext4 for conversion of unwritten
> extents...

You're probably asking Andreas, but just in case, for my testcase, it's
all defaults & standard options.

i.e. just this fails, after the above commit, whereas it worked before.

mkfs.ext4 /dev/sda 20M
mount /dev/sda /mnt/test
resize2fs /dev/sda 200M

-Eric




2015-04-28 12:21:08

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v2] e2fsprogs: Limit number of reserved gdt blocks on small fs

On Mon 27-04-15 11:23:19, Eric Sandeen wrote:
> On 4/27/15 11:14 AM, Jan Kara wrote:
> > On Fri 24-04-15 22:25:06, Andreas Dilger wrote:
> >> On Apr 24, 2015, at 3:51 PM, Eric Sandeen <[email protected]> wrote:
> >>> On 3/25/15 5:46 AM, Lukas Czerner wrote:
> >>>> Currently we're unable to online resize very small (smaller than 32 MB)
> >>>> file systems with 1k block size because there is not enough space in the
> >>>> journal to put all the reserved gdt blocks.
> >>>
> >>> So, I'll get to the patch review if I need to, but this all seemed a little
> >>> odd; this is a regression, so do we really need to restrict things at mkfs
> >>> time?
> >>>
> >>> On the userspace side, things were ok until:
> >>>
> >>> 9f6ba88 resize2fs: add support for new in-kernel online resize ioctl
> >>>
> >>> and even with that, on the kernelspace side, things were ok until:
> >>>
> >>> 8f7d89f jbd2: transaction reservation support
> >>>
> >>> I guess I'm trying to understand why that jbd2 commit regressed this.
> >>> I've not been paying enough attention to ext4 lately. ;)
> >>>
> >>> I mean, the threshold got chopped in half:
> >>>
> >>> - if (nblocks > journal->j_max_transaction_buffers) {
> >>> + /*
> >>> + * 1/2 of transaction can be reserved so we can practically handle
> >>> + * only 1/2 of maximum transaction size per operation
> >>> + */
> >>> + if (WARN_ON(blocks > journal->j_max_transaction_buffers / 2)) {
> >>> printk(KERN_ERR "JBD2: %s wants too many credits (%d > %d)\n",
> >>> - current->comm, nblocks,
> >>> - journal->j_max_transaction_buffers);
> >>> + current->comm, blocks,
> >>> + journal->j_max_transaction_buffers / 2);
> >>> return -ENOSPC;
> >>> }
> >>>
> >>> so it's clear why the behavior changed, I guess, but it feels like I
> >>> must be missing something here.
> >>
> >> Is there some way to reserve these journal blocks only in the case of
> >> delalloc usage? This has caused a performance regression with Lustre
> >> servers on 3.10 kernels because the journal commits twice as often.
> >> We've worked around this for now by doubling the journal size, but it
> >> seems a bit of a hack since we can never use the whole journal anymore.
> > Hum, so the above hunk only limits maximum number of credits used by a
> > single handle. Multiple handles can still consume upto maximum transaction
> > size buffers (at least that's the intention :). So I don't see how that can
> > cause the problem you describe. What can happen though is that there are
> > quite a few outstanding reserved handles and so we have to reserve space
> > for them in the running transaction. Do you use dioread_nolock option? That
> > enables the use of reserved handles in ext4 for conversion of unwritten
> > extents...
>
> You're probably asking Andreas, but just in case, for my testcase, it's
> all defaults & standard options.
>
> i.e. just this fails, after the above commit, whereas it worked before.
>
> mkfs.ext4 /dev/sda 20M
> mount /dev/sda /mnt/test
> resize2fs /dev/sda 200M
Yeah, I understand your failure - transaction reservation has reduced
max transaction size to a half. After that your fs resize exceeds max
transaction size and we are in trouble. I'd prefer solution for that to be
in resize code though because it's really a corner case and I wouldn't like
to slow down the common transaction start path for it...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-04-28 12:24:54

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH v2] e2fsprogs: Limit number of reserved gdt blocks on small fs

On Tue, 28 Apr 2015, Jan Kara wrote:

> Date: Tue, 28 Apr 2015 14:21:02 +0200
> From: Jan Kara <[email protected]>
> To: Eric Sandeen <[email protected]>
> Cc: Jan Kara <[email protected]>, Andreas Dilger <[email protected]>,
> Lukas Czerner <[email protected]>, [email protected]
> Subject: Re: [PATCH v2] e2fsprogs: Limit number of reserved gdt blocks on
> small fs
>
> On Mon 27-04-15 11:23:19, Eric Sandeen wrote:
> > On 4/27/15 11:14 AM, Jan Kara wrote:
> > > On Fri 24-04-15 22:25:06, Andreas Dilger wrote:
> > >> On Apr 24, 2015, at 3:51 PM, Eric Sandeen <[email protected]> wrote:
> > >>> On 3/25/15 5:46 AM, Lukas Czerner wrote:
> > >>>> Currently we're unable to online resize very small (smaller than 32 MB)
> > >>>> file systems with 1k block size because there is not enough space in the
> > >>>> journal to put all the reserved gdt blocks.
> > >>>
> > >>> So, I'll get to the patch review if I need to, but this all seemed a little
> > >>> odd; this is a regression, so do we really need to restrict things at mkfs
> > >>> time?
> > >>>
> > >>> On the userspace side, things were ok until:
> > >>>
> > >>> 9f6ba88 resize2fs: add support for new in-kernel online resize ioctl
> > >>>
> > >>> and even with that, on the kernelspace side, things were ok until:
> > >>>
> > >>> 8f7d89f jbd2: transaction reservation support
> > >>>
> > >>> I guess I'm trying to understand why that jbd2 commit regressed this.
> > >>> I've not been paying enough attention to ext4 lately. ;)
> > >>>
> > >>> I mean, the threshold got chopped in half:
> > >>>
> > >>> - if (nblocks > journal->j_max_transaction_buffers) {
> > >>> + /*
> > >>> + * 1/2 of transaction can be reserved so we can practically handle
> > >>> + * only 1/2 of maximum transaction size per operation
> > >>> + */
> > >>> + if (WARN_ON(blocks > journal->j_max_transaction_buffers / 2)) {
> > >>> printk(KERN_ERR "JBD2: %s wants too many credits (%d > %d)\n",
> > >>> - current->comm, nblocks,
> > >>> - journal->j_max_transaction_buffers);
> > >>> + current->comm, blocks,
> > >>> + journal->j_max_transaction_buffers / 2);
> > >>> return -ENOSPC;
> > >>> }
> > >>>
> > >>> so it's clear why the behavior changed, I guess, but it feels like I
> > >>> must be missing something here.
> > >>
> > >> Is there some way to reserve these journal blocks only in the case of
> > >> delalloc usage? This has caused a performance regression with Lustre
> > >> servers on 3.10 kernels because the journal commits twice as often.
> > >> We've worked around this for now by doubling the journal size, but it
> > >> seems a bit of a hack since we can never use the whole journal anymore.
> > > Hum, so the above hunk only limits maximum number of credits used by a
> > > single handle. Multiple handles can still consume upto maximum transaction
> > > size buffers (at least that's the intention :). So I don't see how that can
> > > cause the problem you describe. What can happen though is that there are
> > > quite a few outstanding reserved handles and so we have to reserve space
> > > for them in the running transaction. Do you use dioread_nolock option? That
> > > enables the use of reserved handles in ext4 for conversion of unwritten
> > > extents...
> >
> > You're probably asking Andreas, but just in case, for my testcase, it's
> > all defaults & standard options.
> >
> > i.e. just this fails, after the above commit, whereas it worked before.
> >
> > mkfs.ext4 /dev/sda 20M
> > mount /dev/sda /mnt/test
> > resize2fs /dev/sda 200M
> Yeah, I understand your failure - transaction reservation has reduced
> max transaction size to a half. After that your fs resize exceeds max
> transaction size and we are in trouble. I'd prefer solution for that to be
> in resize code though because it's really a corner case and I wouldn't like
> to slow down the common transaction start path for it...

Hi Jan,

if you have not already, please see the patch which started the
discussion.

Also I think that aside from the userspace fix, I need to add some
safety to the kernel as well so that people who do run into this
know what's going on.

Thanks!
-Lukas

>
> Honza
>

2015-04-28 15:46:20

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH v2] e2fsprogs: Limit number of reserved gdt blocks on small fs

On 4/28/15 7:24 AM, Luk?? Czerner wrote:
> On Tue, 28 Apr 2015, Jan Kara wrote:
>
>> Date: Tue, 28 Apr 2015 14:21:02 +0200
>> From: Jan Kara <[email protected]>
>> To: Eric Sandeen <[email protected]>
>> Cc: Jan Kara <[email protected]>, Andreas Dilger <[email protected]>,
>> Lukas Czerner <[email protected]>, [email protected]
>> Subject: Re: [PATCH v2] e2fsprogs: Limit number of reserved gdt blocks on
>> small fs
>>
>> On Mon 27-04-15 11:23:19, Eric Sandeen wrote:
>>> On 4/27/15 11:14 AM, Jan Kara wrote:
>>>> On Fri 24-04-15 22:25:06, Andreas Dilger wrote:
>>>>> On Apr 24, 2015, at 3:51 PM, Eric Sandeen <[email protected]> wrote:
>>>>>> On 3/25/15 5:46 AM, Lukas Czerner wrote:
>>>>>>> Currently we're unable to online resize very small (smaller than 32 MB)
>>>>>>> file systems with 1k block size because there is not enough space in the
>>>>>>> journal to put all the reserved gdt blocks.
>>>>>>
>>>>>> So, I'll get to the patch review if I need to, but this all seemed a little
>>>>>> odd; this is a regression, so do we really need to restrict things at mkfs
>>>>>> time?
>>>>>>
>>>>>> On the userspace side, things were ok until:
>>>>>>
>>>>>> 9f6ba88 resize2fs: add support for new in-kernel online resize ioctl
>>>>>>
>>>>>> and even with that, on the kernelspace side, things were ok until:
>>>>>>
>>>>>> 8f7d89f jbd2: transaction reservation support
>>>>>>
>>>>>> I guess I'm trying to understand why that jbd2 commit regressed this.
>>>>>> I've not been paying enough attention to ext4 lately. ;)
>>>>>>
>>>>>> I mean, the threshold got chopped in half:
>>>>>>
>>>>>> - if (nblocks > journal->j_max_transaction_buffers) {
>>>>>> + /*
>>>>>> + * 1/2 of transaction can be reserved so we can practically handle
>>>>>> + * only 1/2 of maximum transaction size per operation
>>>>>> + */
>>>>>> + if (WARN_ON(blocks > journal->j_max_transaction_buffers / 2)) {
>>>>>> printk(KERN_ERR "JBD2: %s wants too many credits (%d > %d)\n",
>>>>>> - current->comm, nblocks,
>>>>>> - journal->j_max_transaction_buffers);
>>>>>> + current->comm, blocks,
>>>>>> + journal->j_max_transaction_buffers / 2);
>>>>>> return -ENOSPC;
>>>>>> }
>>>>>>
>>>>>> so it's clear why the behavior changed, I guess, but it feels like I
>>>>>> must be missing something here.
>>>>>
>>>>> Is there some way to reserve these journal blocks only in the case of
>>>>> delalloc usage? This has caused a performance regression with Lustre
>>>>> servers on 3.10 kernels because the journal commits twice as often.
>>>>> We've worked around this for now by doubling the journal size, but it
>>>>> seems a bit of a hack since we can never use the whole journal anymore.
>>>> Hum, so the above hunk only limits maximum number of credits used by a
>>>> single handle. Multiple handles can still consume upto maximum transaction
>>>> size buffers (at least that's the intention :). So I don't see how that can
>>>> cause the problem you describe. What can happen though is that there are
>>>> quite a few outstanding reserved handles and so we have to reserve space
>>>> for them in the running transaction. Do you use dioread_nolock option? That
>>>> enables the use of reserved handles in ext4 for conversion of unwritten
>>>> extents...
>>>
>>> You're probably asking Andreas, but just in case, for my testcase, it's
>>> all defaults & standard options.
>>>
>>> i.e. just this fails, after the above commit, whereas it worked before.
>>>
>>> mkfs.ext4 /dev/sda 20M
>>> mount /dev/sda /mnt/test
>>> resize2fs /dev/sda 200M
>> Yeah, I understand your failure - transaction reservation has reduced
>> max transaction size to a half. After that your fs resize exceeds max
>> transaction size and we are in trouble. I'd prefer solution for that to be
>> in resize code though because it's really a corner case and I wouldn't like
>> to slow down the common transaction start path for it...
>
> Hi Jan,
>
> if you have not already, please see the patch which started the
> discussion.

It just doesn't feel right to me to place limits on fs geometry for this
reason. We could, but it seems like a stopgap. Can we modify the
resize code so that it doesn't need such a large transaction?

In the "old" resize world, I think we played tricks with restarting
transactions, because until the resize was complete and superblocks were
updated, it was ok if we lost updates to new block groups (or something
like that...) i.e. we didn't need one giant atomic update of the filesystem.

Maybe we can do something similar here? I've kind of lost track
of how resize is working now, TBH.

-Eric

2015-04-29 10:10:31

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v2] e2fsprogs: Limit number of reserved gdt blocks on small fs

On Tue 28-04-15 10:46:12, Eric Sandeen wrote:
> On 4/28/15 7:24 AM, Lukáš Czerner wrote:
> > On Tue, 28 Apr 2015, Jan Kara wrote:
> >
> >> Date: Tue, 28 Apr 2015 14:21:02 +0200
> >> From: Jan Kara <[email protected]>
> >> To: Eric Sandeen <[email protected]>
> >> Cc: Jan Kara <[email protected]>, Andreas Dilger <[email protected]>,
> >> Lukas Czerner <[email protected]>, [email protected]
> >> Subject: Re: [PATCH v2] e2fsprogs: Limit number of reserved gdt blocks on
> >> small fs
> >>
> >> On Mon 27-04-15 11:23:19, Eric Sandeen wrote:
> >>> On 4/27/15 11:14 AM, Jan Kara wrote:
> >>>> On Fri 24-04-15 22:25:06, Andreas Dilger wrote:
> >>>>> On Apr 24, 2015, at 3:51 PM, Eric Sandeen <[email protected]> wrote:
> >>>>>> On 3/25/15 5:46 AM, Lukas Czerner wrote:
> >>>>>>> Currently we're unable to online resize very small (smaller than 32 MB)
> >>>>>>> file systems with 1k block size because there is not enough space in the
> >>>>>>> journal to put all the reserved gdt blocks.
> >>>>>>
> >>>>>> So, I'll get to the patch review if I need to, but this all seemed a little
> >>>>>> odd; this is a regression, so do we really need to restrict things at mkfs
> >>>>>> time?
> >>>>>>
> >>>>>> On the userspace side, things were ok until:
> >>>>>>
> >>>>>> 9f6ba88 resize2fs: add support for new in-kernel online resize ioctl
> >>>>>>
> >>>>>> and even with that, on the kernelspace side, things were ok until:
> >>>>>>
> >>>>>> 8f7d89f jbd2: transaction reservation support
> >>>>>>
> >>>>>> I guess I'm trying to understand why that jbd2 commit regressed this.
> >>>>>> I've not been paying enough attention to ext4 lately. ;)
> >>>>>>
> >>>>>> I mean, the threshold got chopped in half:
> >>>>>>
> >>>>>> - if (nblocks > journal->j_max_transaction_buffers) {
> >>>>>> + /*
> >>>>>> + * 1/2 of transaction can be reserved so we can practically handle
> >>>>>> + * only 1/2 of maximum transaction size per operation
> >>>>>> + */
> >>>>>> + if (WARN_ON(blocks > journal->j_max_transaction_buffers / 2)) {
> >>>>>> printk(KERN_ERR "JBD2: %s wants too many credits (%d > %d)\n",
> >>>>>> - current->comm, nblocks,
> >>>>>> - journal->j_max_transaction_buffers);
> >>>>>> + current->comm, blocks,
> >>>>>> + journal->j_max_transaction_buffers / 2);
> >>>>>> return -ENOSPC;
> >>>>>> }
> >>>>>>
> >>>>>> so it's clear why the behavior changed, I guess, but it feels like I
> >>>>>> must be missing something here.
> >>>>>
> >>>>> Is there some way to reserve these journal blocks only in the case of
> >>>>> delalloc usage? This has caused a performance regression with Lustre
> >>>>> servers on 3.10 kernels because the journal commits twice as often.
> >>>>> We've worked around this for now by doubling the journal size, but it
> >>>>> seems a bit of a hack since we can never use the whole journal anymore.
> >>>> Hum, so the above hunk only limits maximum number of credits used by a
> >>>> single handle. Multiple handles can still consume upto maximum transaction
> >>>> size buffers (at least that's the intention :). So I don't see how that can
> >>>> cause the problem you describe. What can happen though is that there are
> >>>> quite a few outstanding reserved handles and so we have to reserve space
> >>>> for them in the running transaction. Do you use dioread_nolock option? That
> >>>> enables the use of reserved handles in ext4 for conversion of unwritten
> >>>> extents...
> >>>
> >>> You're probably asking Andreas, but just in case, for my testcase, it's
> >>> all defaults & standard options.
> >>>
> >>> i.e. just this fails, after the above commit, whereas it worked before.
> >>>
> >>> mkfs.ext4 /dev/sda 20M
> >>> mount /dev/sda /mnt/test
> >>> resize2fs /dev/sda 200M
> >> Yeah, I understand your failure - transaction reservation has reduced
> >> max transaction size to a half. After that your fs resize exceeds max
> >> transaction size and we are in trouble. I'd prefer solution for that to be
> >> in resize code though because it's really a corner case and I wouldn't like
> >> to slow down the common transaction start path for it...
> >
> > Hi Jan,
> >
> > if you have not already, please see the patch which started the
> > discussion.
>
> It just doesn't feel right to me to place limits on fs geometry for this
> reason. We could, but it seems like a stopgap. Can we modify the
> resize code so that it doesn't need such a large transaction?
>
> In the "old" resize world, I think we played tricks with restarting
> transactions, because until the resize was complete and superblocks were
> updated, it was ok if we lost updates to new block groups (or something
> like that...) i.e. we didn't need one giant atomic update of the filesystem.
>
> Maybe we can do something similar here? I've kind of lost track
> of how resize is working now, TBH.
Well, the fact is we have to limit some fs parameters so that code can be
reasonably simple. IMO already supporting 20 MB filesystem with a journal
is a stretch and choice of number of reserved blocks looks arbitrary but
I guess we shouldn't regress if possible.

So 20 MB filesystem will get 1 MB journal. That means
j_max_transaction_size 256 and thus we allow at most 128 credits in a single
handle. The filesystem has 79 reserved GDT blocks and flex group size 16 so
when resizing to 200 MB when adding full flex group, the resize code in
ext4_flex_group_add() wants a handle with 4*16 + 79 credits which is too
much (143).

That being said the calculation in ext4_flex_group_add() looks too
pessimistic (in the flex_gd->count * 4 part). Sure we need to modify resize
inode and its dindirect block but that's common for all the groups,
similarly for superblock so we can change that to (3 + flex_gd->count) and
even that is too pessimistic since we have EXT4_DESC_PER_BLOCK(sb)
descriptors in one block. So we could shrink it to (3 + 1 + (flex_gd->count +
EXT4_DESC_PER_BLOCK(sb) - 1) / EXT4_DESC_PER_BLOCK(sb)). That will shrink
the total credit estimate down to 84 for that filesystem.

The attached patch fixes the issue for me. Thoughts?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR


Attachments:
(No filename) (6.13 kB)
0001-ext4-Fix-growing-of-tiny-filesystems.patch (1.92 kB)
Download all attachments

2015-04-29 19:50:59

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH v2] e2fsprogs: Limit number of reserved gdt blocks on small fs

On 4/29/15 5:10 AM, Jan Kara wrote:
> On Tue 28-04-15 10:46:12, Eric Sandeen wrote:

...

>> Maybe we can do something similar here? I've kind of lost track
>> of how resize is working now, TBH.
> Well, the fact is we have to limit some fs parameters so that code can be
> reasonably simple. IMO already supporting 20 MB filesystem with a journal
> is a stretch and choice of number of reserved blocks looks arbitrary but
> I guess we shouldn't regress if possible.
>
> So 20 MB filesystem will get 1 MB journal. That means
> j_max_transaction_size 256 and thus we allow at most 128 credits in a single
> handle. The filesystem has 79 reserved GDT blocks and flex group size 16 so
> when resizing to 200 MB when adding full flex group, the resize code in
> ext4_flex_group_add() wants a handle with 4*16 + 79 credits which is too
> much (143).

*nod*

> That being said the calculation in ext4_flex_group_add() looks too
> pessimistic (in the flex_gd->count * 4 part). Sure we need to modify resize
> inode and its dindirect block but that's common for all the groups,
> similarly for superblock so we can change that to (3 + flex_gd->count) and
> even that is too pessimistic since we have EXT4_DESC_PER_BLOCK(sb)
> descriptors in one block. So we could shrink it to (3 + 1 + (flex_gd->count +
> EXT4_DESC_PER_BLOCK(sb) - 1) / EXT4_DESC_PER_BLOCK(sb)). That will shrink
> the total credit estimate down to 84 for that filesystem.
>
> The attached patch fixes the issue for me. Thoughts?
>
> Honza

I think I'm glad you understand it better than me. :) I was surprised at
the large reservation but hadn't worked out why it was a problem.


> From 37ecb96f1ab1e57bc5c1663c1662d658df2f84fd Mon Sep 17 00:00:00 2001
> From: Jan Kara <[email protected]>
> Date: Wed, 29 Apr 2015 11:46:31 +0200
> Subject: [PATCH] ext4: Fix growing of tiny filesystems
>
> The estimate of necessary transaction credits in ext4_flex_group_add()
> is too pessimistic. It reserves credit for sb, resize inode, and resize
> inode dindirect block for each group added in a flex group although they
> are always the same block and thus it is enough to account them only
> once. Also the number of modified GDT block is overestimated since we
> fit EXT4_DESC_PER_BLOCK(sb) descriptors in one block.
>
> Make the estimation more precise. That reduces number of requested
> credits enough that we can grow 20 MB filesystem (which has 1 MB
> journal, 79 reserved GDT blocks, and flex group size 16 by default).
>
> Signed-off-by: Jan Kara <[email protected]>
> ---
> fs/ext4/resize.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
> index 8a8ec6293b19..15b4b3605859 100644
> --- a/fs/ext4/resize.c
> +++ b/fs/ext4/resize.c
> @@ -1432,12 +1432,15 @@ static int ext4_flex_group_add(struct super_block *sb,
> goto exit;
> /*
> * We will always be modifying at least the superblock and GDT
> - * block. If we are adding a group past the last current GDT block,
> + * blocks. If we are adding a group past the last current GDT block,
> * we will also modify the inode and the dindirect block. If we
> * are adding a group with superblock/GDT backups we will also
> * modify each of the reserved GDT dindirect blocks.
> */
> - credit = flex_gd->count * 4 + reserved_gdb;
> + credit = 3; /* sb, resize inode, resize inode dindirect */

yay for excellent comments :)

> + credit += 1 + (flex_gd->count + EXT4_DESC_PER_BLOCK(sb) - 1) /
> + EXT4_DESC_PER_BLOCK(sb); /* GDT blocks */

maybe credit += 1 + DIV_ROUND_UP(flex_gd->count, EXT4_DESC_PER_BLOCK(sb)) ?

> + credit += reserved_gdb; /* Reserved GDT dindirect blocks */
> handle = ext4_journal_start_sb(sb, EXT4_HT_RESIZE, credit);
> if (IS_ERR(handle)) {
> err = PTR_ERR(handle);

Thanks!
-Eric