Hi
Kernel is 2.6.33-rc1-00366-g2f99f5c
Ext4 mounts ext3 filesystem
kernel BUG at fs/ext4/inode.c:1063!
\|/ ____ \|/
"@'/ .. \`@"
/_| \__/ |_\
\__U_/
flush-8:0(1137): Kernel bad sw trap 5 [#1]
TSTATE: 0000000080001603 TPC: 0000000000544fb4 TNPC: 0000000000544fb8
Y: 00000000 Not tainted
TPC: <ext4_get_blocks+0x3f4/0x400>
g0: fffff800dc862fe0 g1: 0000000000000001 g2: 0000000000000001 g3:
fffff800dc85f9c1
g4: fffff800df305880 g5: 2e68006800002e68 g6: fffff800dc860000 g7:
0000000000833e08
o0: 00000000007b0cb8 o1: 0000000000000427 o2: 0000000000000040 o3:
0000000000000040
o4: fffff800dc8630a0 o5: 000000000000000d sp: fffff800dc8628d1 ret_pc:
0000000000544fac
RPC: <ext4_get_blocks+0x3ec/0x400>
l0: 0000000000000002 l1: fffff800c2674000 l2: fffff800c26740d0 l3:
0000000000000001
l4: fffff800df39d2d0 l5: 00000000000003f9 l6: 0006000000000000 l7:
0000000000001ffe
i0: 0000000000000040 i1: fffff800c2674130 i2: 000000000000064c i3:
fffff800c26745a8
i4: fffff800dc863308 i5: 0000000000000003 i6: fffff800dc8629a1 i7:
0000000000545380
I7: <mpage_da_map_blocks+0x80/0x800>
Disabling lock debugging due to kernel taint
Caller[0000000000545380]: mpage_da_map_blocks+0x80/0x800
Caller[00000000005461c0]: mpage_add_bh_to_extent+0x40/0x100
Caller[000000000054642c]: __mpage_da_writepage+0x1ac/0x220
Caller[00000000004a951c]: write_cache_pages+0x19c/0x380
Caller[0000000000545d7c]: ext4_da_writepages+0x27c/0x680
Caller[00000000004a978c]: do_writepages+0x2c/0x60
Caller[00000000004f94cc]: writeback_single_inode+0xcc/0x3c0
Caller[00000000004fa3d8]: writeback_inodes_wb+0x338/0x500
Caller[00000000004fa6e8]: wb_writeback+0x148/0x220
Caller[00000000004fab00]: wb_do_writeback+0x240/0x260
Caller[00000000004fab8c]: bdi_writeback_task+0x6c/0xc0
Caller[00000000004b6f50]: bdi_start_fn+0x70/0xe0
Caller[000000000047030c]: kthread+0x6c/0x80
Caller[000000000042bc9c]: kernel_thread+0x3c/0x60
Caller[0000000000470408]: kthreadd+0xe8/0x160
Instruction DUMP: 92102427 7ffb95a5 901220b8 <91d02005> 01000000
01000000 9de3bf40 11002096 a4100018
note: flush-8:0[1137] exited with preempt_count 1
2009/12/25 Alexander Beregalov <[email protected]>:
> Hi
>
> Kernel is 2.6.33-rc1-00366-g2f99f5c
> Ext4 mounts ext3 filesystem
>
>
>
> kernel BUG at fs/ext4/inode.c:1063!
> \|/ ____ \|/
> "@'/ .. \`@"
> /_| \__/ |_\
> \__U_/
> flush-8:0(1137): Kernel bad sw trap 5 [#1]
> TSTATE: 0000000080001603 TPC: 0000000000544fb4 TNPC: 0000000000544fb8
> Y: 00000000 Not tainted
> TPC: <ext4_get_blocks+0x3f4/0x400>
> g0: fffff800dc862fe0 g1: 0000000000000001 g2: 0000000000000001 g3:
> fffff800dc85f9c1
> g4: fffff800df305880 g5: 2e68006800002e68 g6: fffff800dc860000 g7:
> 0000000000833e08
> o0: 00000000007b0cb8 o1: 0000000000000427 o2: 0000000000000040 o3:
> 0000000000000040
> o4: fffff800dc8630a0 o5: 000000000000000d sp: fffff800dc8628d1 ret_pc:
> 0000000000544fac
> RPC: <ext4_get_blocks+0x3ec/0x400>
> l0: 0000000000000002 l1: fffff800c2674000 l2: fffff800c26740d0 l3:
> 0000000000000001
> l4: fffff800df39d2d0 l5: 00000000000003f9 l6: 0006000000000000 l7:
> 0000000000001ffe
> i0: 0000000000000040 i1: fffff800c2674130 i2: 000000000000064c i3:
> fffff800c26745a8
> i4: fffff800dc863308 i5: 0000000000000003 i6: fffff800dc8629a1 i7:
> 0000000000545380
> I7: <mpage_da_map_blocks+0x80/0x800>
> Disabling lock debugging due to kernel taint
> Caller[0000000000545380]: mpage_da_map_blocks+0x80/0x800
> Caller[00000000005461c0]: mpage_add_bh_to_extent+0x40/0x100
> Caller[000000000054642c]: __mpage_da_writepage+0x1ac/0x220
> Caller[00000000004a951c]: write_cache_pages+0x19c/0x380
> Caller[0000000000545d7c]: ext4_da_writepages+0x27c/0x680
> Caller[00000000004a978c]: do_writepages+0x2c/0x60
> Caller[00000000004f94cc]: writeback_single_inode+0xcc/0x3c0
> Caller[00000000004fa3d8]: writeback_inodes_wb+0x338/0x500
> Caller[00000000004fa6e8]: wb_writeback+0x148/0x220
> Caller[00000000004fab00]: wb_do_writeback+0x240/0x260
> Caller[00000000004fab8c]: bdi_writeback_task+0x6c/0xc0
> Caller[00000000004b6f50]: bdi_start_fn+0x70/0xe0
> Caller[000000000047030c]: kthread+0x6c/0x80
> Caller[000000000042bc9c]: kernel_thread+0x3c/0x60
> Caller[0000000000470408]: kthreadd+0xe8/0x160
> Instruction DUMP: 92102427 7ffb95a5 901220b8 <91d02005> 01000000
> 01000000 9de3bf40 11002096 a4100018
> note: flush-8:0[1137] exited with preempt_count 1
>
It seems I can easily reproduce it.
But I can't compile 2.6.33-rc2 :)
scripts/kconfig/conf -s arch/sparc/Kconfig
CHK include/linux/version.h
CHK include/generated/utsrelease.h
CALL scripts/checksyscalls.sh
CHK include/generated/compile.h
GZIP kernel/config_data.gz
CC fs/configfs/inode.o
IKCFG kernel/config_data.h
LD [M] fs/btrfs/btrfs.o
CC kernel/configs.o
fs/btrfs/sysfs.o: file not recognized: File truncated
make[2]: *** [fs/btrfs/btrfs.o] Error 1
make[1]: *** [fs/btrfs] Error 2
make[1]: *** Waiting for unfinished jobs....
On Fri, Dec 25, 2009 at 01:28:34AM +0300, Alexander Beregalov wrote:
>
> Kernel is 2.6.33-rc1-00366-g2f99f5c
> Ext4 mounts ext3 filesystem
>
> kernel BUG at fs/ext4/inode.c:1063!
OK, that's this BUG which is triggering:
if (mdb_free) {
/* Account for allocated meta_blocks */
mdb_claim = EXT4_I(inode)->i_allocated_meta_blocks;
BUG_ON(mdb_free < mdb_claim); <------- BUG triggered
mdb_free -= mdb_claim;
Can you replicate this? If so, I'd like to ask you to replicate with
the following debugging patch applied:
--- fs/ext4/inode.c 2009-12-24 17:55:03.736366001 -0500
+++ fs/ext4/inode.c.new 2009-12-24 18:02:58.556366024 -0500
@@ -1060,6 +1060,10 @@
if (mdb_free) {
/* Account for allocated meta_blocks */
mdb_claim = EXT4_I(inode)->i_allocated_meta_blocks;
+ if (mdb_free < mdb_claim)
+ ext4_msg(inode->i_sb, KERN_ERR, "inode #%lu: "
+ "mdb_free (%d) < mdb_claim (%d) BUG\n",
+ inode->i_ino, mdb_free, mdb_claim);
BUG_ON(mdb_free < mdb_claim);
mdb_free -= mdb_claim;
Then once you get the inode number (suppose it's 12345), please send
the output of the following debugfs commands:
debugfs: stat <12345>
debugfs: ncheck 12345
Thanks!!
- Ted
On Thu, Dec 24, 2009 at 06:05:12PM -0500, [email protected] wrote:
> On Fri, Dec 25, 2009 at 01:28:34AM +0300, Alexander Beregalov wrote:
> >
> > Kernel is 2.6.33-rc1-00366-g2f99f5c
> > Ext4 mounts ext3 filesystem
> >
> > kernel BUG at fs/ext4/inode.c:1063!
>
> OK, that's this BUG which is triggering:
>
> if (mdb_free) {
> /* Account for allocated meta_blocks */
> mdb_claim = EXT4_I(inode)->i_allocated_meta_blocks;
> BUG_ON(mdb_free < mdb_claim); <------- BUG triggered
> mdb_free -= mdb_claim;
>
> Can you replicate this? If so, I'd like to ask you to replicate with
> the following debugging patch applied:
Here's a revised version of the patch which should avoid the BUG_ON,
which should make it be less annoying. We should really figure out
what's going on and fix it, though. It may be fixed by the recently
pushed quota race fixes, or at least there's a good chace that it's
related to a ext4 quota-releated WARN_ON that people have been
complaining about.
- Ted
--- /tmp/inode.c 2009-12-24 17:55:03.736366001 -0500
+++ /tmp/inode.c.new 2009-12-24 18:13:07.716366002 -0500
@@ -1060,8 +1060,14 @@
if (mdb_free) {
/* Account for allocated meta_blocks */
mdb_claim = EXT4_I(inode)->i_allocated_meta_blocks;
- BUG_ON(mdb_free < mdb_claim);
- mdb_free -= mdb_claim;
+ if (mdb_free < mdb_claim) {
+ ext4_msg(inode->i_sb, KERN_ERR, "inode #%lu: "
+ "mdb_free (%d) < mdb_claim (%d) BUG\n",
+ inode->i_ino, mdb_free, mdb_claim);
+ WARN_ON(1);
+ mdb_free = 0;
+ } else
+ mdb_free -= mdb_claim;
/* update fs dirty blocks counter */
percpu_counter_sub(&sbi->s_dirtyblocks_counter, mdb_free);
Alexander Beregalov <[email protected]> writes:
> 2009/12/25 Alexander Beregalov <[email protected]>:
>> Hi
>>
>> Kernel is 2.6.33-rc1-00366-g2f99f5c
>> Ext4 mounts ext3 filesystem
>>
>>
>>
>> kernel BUG at fs/ext4/inode.c:1063!
>> \|/ ____ \|/
>> "@'/ .. \`@"
>> /_| \__/ |_\
>> \__U_/
>> flush-8:0(1137): Kernel bad sw trap 5 [#1]
>> TSTATE: 0000000080001603 TPC: 0000000000544fb4 TNPC: 0000000000544fb8
>> Y: 00000000 Not tainted
>> TPC: <ext4_get_blocks+0x3f4/0x400>
>> g0: fffff800dc862fe0 g1: 0000000000000001 g2: 0000000000000001 g3:
>> fffff800dc85f9c1
>> g4: fffff800df305880 g5: 2e68006800002e68 g6: fffff800dc860000 g7:
>> 0000000000833e08
>> o0: 00000000007b0cb8 o1: 0000000000000427 o2: 0000000000000040 o3:
>> 0000000000000040
>> o4: fffff800dc8630a0 o5: 000000000000000d sp: fffff800dc8628d1 ret_pc:
>> 0000000000544fac
>> RPC: <ext4_get_blocks+0x3ec/0x400>
>> l0: 0000000000000002 l1: fffff800c2674000 l2: fffff800c26740d0 l3:
>> 0000000000000001
>> l4: fffff800df39d2d0 l5: 00000000000003f9 l6: 0006000000000000 l7:
>> 0000000000001ffe
>> i0: 0000000000000040 i1: fffff800c2674130 i2: 000000000000064c i3:
>> fffff800c26745a8
>> i4: fffff800dc863308 i5: 0000000000000003 i6: fffff800dc8629a1 i7:
>> 0000000000545380
>> I7: <mpage_da_map_blocks+0x80/0x800>
>> Disabling lock debugging due to kernel taint
>> Caller[0000000000545380]: mpage_da_map_blocks+0x80/0x800
>> Caller[00000000005461c0]: mpage_add_bh_to_extent+0x40/0x100
>> Caller[000000000054642c]: __mpage_da_writepage+0x1ac/0x220
>> Caller[00000000004a951c]: write_cache_pages+0x19c/0x380
>> Caller[0000000000545d7c]: ext4_da_writepages+0x27c/0x680
>> Caller[00000000004a978c]: do_writepages+0x2c/0x60
>> Caller[00000000004f94cc]: writeback_single_inode+0xcc/0x3c0
>> Caller[00000000004fa3d8]: writeback_inodes_wb+0x338/0x500
>> Caller[00000000004fa6e8]: wb_writeback+0x148/0x220
>> Caller[00000000004fab00]: wb_do_writeback+0x240/0x260
>> Caller[00000000004fab8c]: bdi_writeback_task+0x6c/0xc0
>> Caller[00000000004b6f50]: bdi_start_fn+0x70/0xe0
>> Caller[000000000047030c]: kthread+0x6c/0x80
>> Caller[000000000042bc9c]: kernel_thread+0x3c/0x60
>> Caller[0000000000470408]: kthreadd+0xe8/0x160
>> Instruction DUMP: 92102427 7ffb95a5 901220b8 <91d02005> 01000000
>> 01000000 9de3bf40 11002096 a4100018
>> note: flush-8:0[1137] exited with preempt_count 1
>>
>
> It seems I can easily reproduce it.
> But I can't compile 2.6.33-rc2 :)
>
> scripts/kconfig/conf -s arch/sparc/Kconfig
> CHK include/linux/version.h
> CHK include/generated/utsrelease.h
> CALL scripts/checksyscalls.sh
> CHK include/generated/compile.h
> GZIP kernel/config_data.gz
> CC fs/configfs/inode.o
> IKCFG kernel/config_data.h
> LD [M] fs/btrfs/btrfs.o
> CC kernel/configs.o
> fs/btrfs/sysfs.o: file not recognized: File truncated
This happens because of delayed allocation. Each time BUG or
unexpected power off happens during object files usually becomes
broken. IMHO this is expected issue. Just recompile from beginning
# make clean; make -j4
As soon as your testcase is kernel compilation.
Strange i'm living with quota patches on my notebook more than
a month( two weeks with the version committed to quota git tree)
with and without quota . But this never happens.
Currently i'm trying to reproduce the bug on 2.6.33-rc2
Please add keep me in cc because seems the bug was introduced
(or just triggered) by my quota patches.
> make[2]: *** [fs/btrfs/btrfs.o] Error 1
> make[1]: *** [fs/btrfs] Error 2
> make[1]: *** Waiting for unfinished jobs....
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> It seems I can easily reproduce it.
>> But I can't compile 2.6.33-rc2 :)
>>
>> scripts/kconfig/conf -s arch/sparc/Kconfig
>> CHK include/linux/version.h
>> CHK include/generated/utsrelease.h
>> CALL scripts/checksyscalls.sh
>> CHK include/generated/compile.h
>> GZIP kernel/config_data.gz
>> CC fs/configfs/inode.o
>> IKCFG kernel/config_data.h
>> LD [M] fs/btrfs/btrfs.o
>> CC kernel/configs.o
>> fs/btrfs/sysfs.o: file not recognized: File truncated
> This happens because of delayed allocation. Each time BUG or
> unexpected power off happens during object files usually becomes
> broken. IMHO this is expected issue. Just recompile from beginning
> # make clean; make -j4
It does not help, it still fails.
I will try to crosscompile the kernel with Ted's patch on another host.
> As soon as your testcase is kernel compilation.
> Strange i'm living with quota patches on my notebook more than
> a month( two weeks with the version committed to quota git tree)
> with and without quota . But this never happens.
> Currently i'm trying to reproduce the bug on 2.6.33-rc2
> Please add keep me in cc because seems the bug was introduced
> (or just triggered) by my quota patches.
>> make[2]: *** [fs/btrfs/btrfs.o] Error 1
>> make[1]: *** [fs/btrfs] Error 2
>> make[1]: *** Waiting for unfinished jobs....
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
Alexander Beregalov <[email protected]> writes:
>>> It seems I can easily reproduce it.
>>> But I can't compile 2.6.33-rc2 :)
BTW what sha1 of the git-commit you have used to reproduce
the bug (2.6.33-rc1 HEAD has no this BUG_ON).
This is important to me to know it, or just post the
fs/ext4/inode.c file.
>>>
>>> scripts/kconfig/conf -s arch/sparc/Kconfig
>>> CHK include/linux/version.h
>>> CHK include/generated/utsrelease.h
>>> CALL scripts/checksyscalls.sh
>>> CHK include/generated/compile.h
>>> GZIP kernel/config_data.gz
>>> CC fs/configfs/inode.o
>>> IKCFG kernel/config_data.h
>>> LD [M] fs/btrfs/btrfs.o
>>> CC kernel/configs.o
>>> fs/btrfs/sysfs.o: file not recognized: File truncated
>> This happens because of delayed allocation. Each time BUG or
>> unexpected power off happens during object files usually becomes
>> broken. IMHO this is expected issue. Just recompile from beginning
>> # make clean; make -j4
>
> It does not help, it still fails.
Again strange, please run fsck. What about compile it from very
beginning (start from unpacking tar-ball from kernel.org)
Or may be compile it on another file-system(ext3 or
ext4 with nodelalloc option)
> I will try to crosscompile the kernel with Ted's patch on another host.
>
It is sad, but i still can not reproduce your bug.
At this time i've tested following configurations:
system : 2.6.33-rc2, x86 two cores cpu with 2GB of ram
block dev: real sata drive, loopdev over tmpfs
mkfs : 4k and 1k blocksize
mount : w/o quota, quota, journaled quota
quota : both ON and OFF states
fs-load : - fsstress with 1,4,16,32 concurrent tasks
- kernel compilation -j4, -j32
- In fact currently my mail-dir is under quota control.
Please clarify your use-case:
0) Your system speciffication: cpu_num, mem_size, page_size(i guess 8k)
block device.
1) mkfs options
2) mount options
3) quota options (if any)
4) your fs load test-case
5) How long does it takes you to reproduce the bug.
>> As soon as your testcase is kernel compilation.
>> Strange i'm living with quota patches on my notebook more than
>> a month( two weeks with the version committed to quota git tree)
>> with and without quota . But this never happens.
>> Currently i'm trying to reproduce the bug on 2.6.33-rc2
>> Please add keep me in cc because seems the bug was introduced
>> (or just triggered) by my quota patches.
>>> make[2]: *** [fs/btrfs/btrfs.o] Error 1
>>> make[1]: *** [fs/btrfs] Error 2
>>> make[1]: *** Waiting for unfinished jobs....
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
> LocalWords: speciffication cpu
It seems Dmitry Torokhov has the same issue, Cc'ed.
2009/12/26 Dmitry Monakhov <[email protected]>:
> Alexander Beregalov <[email protected]> writes:
>
>>>> It seems I can easily reproduce it.
>>>> But I can't compile 2.6.33-rc2 :)
> BTW what sha1 of the git-commit you have used to reproduce
> the bug (2.6.33-rc1 HEAD has no this BUG_ON).
> This is important to me to know it, or just post the
> fs/ext4/inode.c file.
It was in the first post - 2f99f5c
There is only OCFS update between it and -rc2.
>>>>
>>>> scripts/kconfig/conf -s arch/sparc/Kconfig
>>>> CHK include/linux/version.h
>>>> CHK include/generated/utsrelease.h
>>>> CALL scripts/checksyscalls.sh
>>>> CHK include/generated/compile.h
>>>> GZIP kernel/config_data.gz
>>>> CC fs/configfs/inode.o
>>>> IKCFG kernel/config_data.h
>>>> LD [M] fs/btrfs/btrfs.o
>>>> CC kernel/configs.o
>>>> fs/btrfs/sysfs.o: file not recognized: File truncated
>>> This happens because of delayed allocation. Each time BUG or
>>> unexpected power off happens during object files usually becomes
>>> broken. IMHO this is expected issue. Just recompile from beginning
>>> # make clean; make -j4
>>
>> It does not help, it still fails.
> Again strange, please run fsck. What about compile it from very
> beginning (start from unpacking tar-ball from kernel.org)
> Or may be compile it on another file-system(ext3 or
> ext4 with nodelalloc option)
I tried fsck, it did not find any problem, kernel build still fails after it.
>> I will try to crosscompile the kernel with Ted's patch on another host.
Here is output of 2.6.33-rc2 plus Ted's patch
EXT4-fs (sda1): inode #1387643: mdb_free (1) < mdb_claim (2) BUG
------------[ cut here ]------------
WARNING: at fs/ext4/inode.c:1067 ext4_get_blocks+0x3f0/0x440()
Modules linked in:
Call Trace:
[0000000000456bb0] warn_slowpath_common+0x50/0xa0
[0000000000456c1c] warn_slowpath_null+0x1c/0x40
[0000000000545010] ext4_get_blocks+0x3f0/0x440
[0000000000545420] mpage_da_map_blocks+0x80/0x800
[0000000000546260] mpage_add_bh_to_extent+0x40/0x100
[00000000005464cc] __mpage_da_writepage+0x1ac/0x220
[00000000004a957c] write_cache_pages+0x19c/0x380
[0000000000545e1c] ext4_da_writepages+0x27c/0x680
[00000000004a97ec] do_writepages+0x2c/0x60
[00000000004f952c] writeback_single_inode+0xcc/0x3c0
[00000000004fa438] writeback_inodes_wb+0x338/0x500
[00000000004fa748] wb_writeback+0x148/0x220
[00000000004fab60] wb_do_writeback+0x240/0x260
[00000000004fabec] bdi_writeback_task+0x6c/0xc0
[00000000004b6fb0] bdi_start_fn+0x70/0xe0
[000000000047036c] kthread+0x6c/0x80
---[ end trace 46a56c443941c84d ]---
>>
> It is sad, but i still can not reproduce your bug.
> At this time i've tested following configurations:
> system : 2.6.33-rc2, x86 two cores cpu with 2GB of ram
> block dev: real sata drive, loopdev over tmpfs
> mkfs : 4k and 1k blocksize
> mount : w/o quota, quota, journaled quota
> quota : both ON and OFF states
> fs-load : - fsstress with 1,4,16,32 concurrent tasks
> - kernel compilation -j4, -j32
> - In fact currently my mail-dir is under quota control.
> Please clarify your use-case:
> 0) Your system speciffication: cpu_num, mem_size, page_size(i guess 8k)
> block device.
UltraSparc IIe, UP, 2Gb, 8kb, real SCSI disk (sym53c8xx driver)
> 1) mkfs options
I do not remember.
Perhaps dumpe2fs can help
root@v120 ~ # dumpe2fs -h /dev/sda1
dumpe2fs 1.41.9 (22-Aug-2009)
Filesystem volume name: <none>
Last mounted on: /
Filesystem UUID: b34f302e-78a3-4f80-bae6-31639456216c
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index
filetype needs_recovery sparse_super large_file
Filesystem flags: signed_directory_hash
Default mount options: (none)
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 2113536
Block count: 8448000
Reserved block count: 422400
Free blocks: 6661110
Free inodes: 1861302
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 1021
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 8192
Inode blocks per group: 512
Filesystem created: Tue Nov 10 00:44:17 2009
Last mount time: Sun Dec 27 20:05:48 2009
Last write time: Sat Dec 26 10:59:00 2009
Mount count: 3
Maximum mount count: 21
Last checked: Sat Dec 26 06:07:50 2009
Check interval: 15552000 (6 months)
Next check after: Thu Jun 24 07:07:50 2010
Lifetime writes: 30 GB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Required extra isize: 28
Desired extra isize: 28
Journal inode: 8
Default directory hash: half_md4
Directory Hash Seed: ae1ec2f1-0f86-4f26-ace5-eb656fd25709
Journal backup: inode blocks
Journal size: 128M
> 2) mount options
noatime
> 3) quota options (if any)
No
> 4) your fs load test-case
Have not tried to find a simpler testcase yet.
make CROSS_COMPILE="ccache sparc64-unknown-linux-gnu-" -j4 zImage modules
Hm, perhaps ccache is the real trigger of the problem.
> 5) How long does it takes you to reproduce the bug.
Few seconds (~5)
On Sun, Dec 27, 2009 at 11:32:25PM +0300, Alexander Beregalov wrote:
> It seems Dmitry Torokhov has the same issue, Cc'ed.
>
> 2009/12/26 Dmitry Monakhov <[email protected]>:
> > Alexander Beregalov <[email protected]> writes:
> >
> >>>> It seems I can easily reproduce it.
> >>>> But I can't compile 2.6.33-rc2 :)
> > BTW what sha1 of the git-commit you have used to reproduce
> > the bug (2.6.33-rc1 HEAD has no this BUG_ON).
> > This is important to me to know it, or just post the
> > fs/ext4/inode.c file.
>
> It was in the first post - 2f99f5c
> There is only OCFS update between it and -rc2.
>
> >>>>
> >>>> scripts/kconfig/conf -s arch/sparc/Kconfig
> >>>> ? CHK ? ? include/linux/version.h
> >>>> ? CHK ? ? include/generated/utsrelease.h
> >>>> ? CALL ? ?scripts/checksyscalls.sh
> >>>> ? CHK ? ? include/generated/compile.h
> >>>> ? GZIP ? ?kernel/config_data.gz
> >>>> ? CC ? ? ?fs/configfs/inode.o
> >>>> ? IKCFG ? kernel/config_data.h
> >>>> ? LD [M] ?fs/btrfs/btrfs.o
> >>>> ? CC ? ? ?kernel/configs.o
> >>>> fs/btrfs/sysfs.o: file not recognized: File truncated
> >>> This happens because of ?delayed allocation. Each time BUG or
> >>> unexpected power off happens during object files usually becomes
> >>> broken. IMHO this is expected issue. Just recompile from beginning
> >>> # make clean; make -j4
> >>
> >> It does not help, it still fails.
> > Again strange, please run fsck. What about compile it from very
> > beginning (start from unpacking tar-ball from kernel.org)
> > Or may be compile it on another file-system(ext3 or
> > ext4 with nodelalloc option)
>
> I tried fsck, it did not find any problem, kernel build still fails after it.
>
Are you using ccache? I do and all the breakage is hidden there (so
"make clean" does not help), just clean you cache and you should be good
to go.
> >> I will try to crosscompile the kernel with Ted's patch on another host.
>
> Here is output of 2.6.33-rc2 plus Ted's patch
>
> EXT4-fs (sda1): inode #1387643: mdb_free (1) < mdb_claim (2) BUG
>
> ------------[ cut here ]------------
> WARNING: at fs/ext4/inode.c:1067 ext4_get_blocks+0x3f0/0x440()
> Modules linked in:
> Call Trace:
> [0000000000456bb0] warn_slowpath_common+0x50/0xa0
> [0000000000456c1c] warn_slowpath_null+0x1c/0x40
> [0000000000545010] ext4_get_blocks+0x3f0/0x440
> [0000000000545420] mpage_da_map_blocks+0x80/0x800
> [0000000000546260] mpage_add_bh_to_extent+0x40/0x100
> [00000000005464cc] __mpage_da_writepage+0x1ac/0x220
> [00000000004a957c] write_cache_pages+0x19c/0x380
> [0000000000545e1c] ext4_da_writepages+0x27c/0x680
> [00000000004a97ec] do_writepages+0x2c/0x60
> [00000000004f952c] writeback_single_inode+0xcc/0x3c0
> [00000000004fa438] writeback_inodes_wb+0x338/0x500
> [00000000004fa748] wb_writeback+0x148/0x220
> [00000000004fab60] wb_do_writeback+0x240/0x260
> [00000000004fabec] bdi_writeback_task+0x6c/0xc0
> [00000000004b6fb0] bdi_start_fn+0x70/0xe0
> [000000000047036c] kthread+0x6c/0x80
> ---[ end trace 46a56c443941c84d ]---
>
> >>
> > It is sad, but i still can not reproduce your bug.
It happens to me as soon as a moderate load is put on ext3 fs mounted
with ext4 driver.
> > At this time i've tested following configurations:
> > system ? : ? ?2.6.33-rc2, x86 two cores cpu with 2GB of ram
> > block dev: real sata drive, loopdev over tmpfs
> > mkfs ? ? : 4k and 1k blocksize
> > mount ? ?: w/o quota, quota, journaled quota
> > quota ? ?: both ON and OFF states
> > fs-load ?: - fsstress with 1,4,16,32 concurrent tasks
> > ? ? ? ? ? - kernel compilation -j4, -j32
> > ? ? ? ? ? - In fact currently my mail-dir is under quota control.
> > Please clarify your use-case:
> > 0) Your system speciffication: cpu_num, mem_size, page_size(i guess 8k)
> > ? block device.
> UltraSparc IIe, UP, 2Gb, 8kb, real SCSI disk (sym53c8xx driver)
> > 1) mkfs options
> I do not remember.
> Perhaps dumpe2fs can help
>
> root@v120 ~ # dumpe2fs -h /dev/sda1
> dumpe2fs 1.41.9 (22-Aug-2009)
> Filesystem volume name: <none>
> Last mounted on: /
> Filesystem UUID: b34f302e-78a3-4f80-bae6-31639456216c
> Filesystem magic number: 0xEF53
> Filesystem revision #: 1 (dynamic)
> Filesystem features: has_journal ext_attr resize_inode dir_index
> filetype needs_recovery sparse_super large_file
> Filesystem flags: signed_directory_hash
> Default mount options: (none)
> Filesystem state: clean
> Errors behavior: Continue
> Filesystem OS type: Linux
> Inode count: 2113536
> Block count: 8448000
> Reserved block count: 422400
> Free blocks: 6661110
> Free inodes: 1861302
> First block: 0
> Block size: 4096
> Fragment size: 4096
> Reserved GDT blocks: 1021
> Blocks per group: 32768
> Fragments per group: 32768
> Inodes per group: 8192
> Inode blocks per group: 512
> Filesystem created: Tue Nov 10 00:44:17 2009
> Last mount time: Sun Dec 27 20:05:48 2009
> Last write time: Sat Dec 26 10:59:00 2009
> Mount count: 3
> Maximum mount count: 21
> Last checked: Sat Dec 26 06:07:50 2009
> Check interval: 15552000 (6 months)
> Next check after: Thu Jun 24 07:07:50 2010
> Lifetime writes: 30 GB
> Reserved blocks uid: 0 (user root)
> Reserved blocks gid: 0 (group root)
> First inode: 11
> Inode size: 256
> Required extra isize: 28
> Desired extra isize: 28
> Journal inode: 8
> Default directory hash: half_md4
> Directory Hash Seed: ae1ec2f1-0f86-4f26-ace5-eb656fd25709
> Journal backup: inode blocks
> Journal size: 128M
>
>
> > 2) mount options
> noatime
> > 3) quota options (if any)
> No
> > 4) your fs load test-case
> Have not tried to find a simpler testcase yet.
> make CROSS_COMPILE="ccache sparc64-unknown-linux-gnu-" -j4 zImage modules
>
> Hm, perhaps ccache is the real trigger of the problem.
>
> > 5) How long does it takes you to reproduce the bug.
> Few seconds (~5)
--
Dmitry
On Sun, Dec 27, 2009 at 11:32:25PM +0300, Alexander Beregalov wrote:
> >> I will try to crosscompile the kernel with Ted's patch on another host.
>
> Here is output of 2.6.33-rc2 plus Ted's patch
>
> EXT4-fs (sda1): inode #1387643: mdb_free (1) < mdb_claim (2) BUG
OK, can you give me the output of:
debugfs /dev/sda1
debugfs: stat <1387643>
debugfs: ncheck 1387643
debugfs: quit
Thanks!!
- Ted
2009/12/28 <[email protected]>:
> On Sun, Dec 27, 2009 at 11:32:25PM +0300, Alexander Beregalov wrote:
>> >> I will try to crosscompile the kernel with Ted's patch on another host.
>>
>> Here is output of 2.6.33-rc2 plus Ted's patch
>>
>> EXT4-fs (sda1): inode #1387643: mdb_free (1) < mdb_claim (2) BUG
>
> OK, can you give me the output of:
>
> debugfs /dev/sda1
> debugfs: stat <1387643>
> debugfs: ncheck 1387643
> debugfs: quit
Cleaning of CCache does not help.
debugfs 1.41.9 (22-Aug-2009)
debugfs: stat <1387643>
Inode: 1387643 Type: regular Mode: 0644 Flags: 0x0
Generation: 2004186252 Version: 0x00000000:00000001
User: 1000 Group: 1003 Size: 11028803
File ACL: 0 Directory ACL: 0
Links: 1 Blockcount: 21576
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x4b37d65c:04b0870a -- Mon Dec 28 00:49:16 2009
atime: 0x4b37d65a:2803e3e5 -- Mon Dec 28 00:49:14 2009
mtime: 0x4b37d65c:04b0870a -- Mon Dec 28 00:49:16 2009
crtime: 0x5ad6374b:2803e3e5 -- Tue Apr 17 22:04:59 2018
Size of extra inode fields: 28
BLOCKS:
(0-11):172032-172043, (IND):165412, (12-1035):172044-173067,
(DIND):165380, (IND):165381, (1036-2047):173068-174079,
(2048-2059):188416-188427, (IND):165414, (2060-2692):188428-189060
TOTAL: 2697
debugfs: ncheck 1387643
Inode Pathname
1387643 /home/alexb/linux-2.6/kernel/built-in.o
~/linux-2.6 $ rm kernel/built-in.o
~/linux-2.6 $ sync
~/linux-2.6 $ kmake
CHK include/linux/version.h
CHK include/generated/utsrelease.h
CALL scripts/checksyscalls.sh
CHK include/generated/compile.h
LD kernel/built-in.o
LD [M] fs/btrfs/btrfs.o
fs/btrfs/relocation.o: file not recognized: File truncated
kernel BUG at fs/ext4/inode.c:1063!
> >> Here is output of 2.6.33-rc2 plus Ted's patch
> >>
> >> EXT4-fs (sda1): inode #1387643: mdb_free (1) < mdb_claim (2) BUG
> >
OK, i've been able to reproduce the problem using xfsqa test #74
(fstest) when an ext3 file system is mounted the ext4 file system
driver. I was then able to bisect it down to commit d21cd8f6, which
was introduced between 2.6.33-rc1 and 2.6.33-rc2, as part of
quota/ext4 patch series pushed by Jan.
I then tested v2.6.33-rc2 with commit d21cd8 reverted, and I was not
able to replicate the BUG. More investigation is needed, but if we
compare the potential quota deadlock with the apparently
fairly-easy-to-replicate BUG, if we can't find a better fix fairly
quickly we should probably just revert this commit for now.
- Ted
commit d21cd8f163ac44b15c465aab7306db931c606908
Author: Dmitry Monakhov <[email protected]>
AuthorDate: Thu Dec 10 03:31:45 2009 +0000
Commit: Jan Kara <[email protected]>
CommitDate: Wed Dec 23 13:44:12 2009 +0100
ext4: Fix potential quota deadlock
We have to delay vfs_dq_claim_space() until allocation context destruction.
Currently we have following call-trace:
ext4_mb_new_blocks()
/* task is already holding ac->alloc_semp */
->ext4_mb_mark_diskspace_used
->vfs_dq_claim_space() /* acquire dqptr_sem here. Possible deadlock */
->ext4_mb_release_context() /* drop ac->alloc_semp here */
Let's move quota claiming to ext4_da_update_reserve_space()
=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.32-rc7 #18
-------------------------------------------------------
write-truncate-/3465 is trying to acquire lock:
(&s->s_dquot.dqptr_sem){++++..}, at: [<c025e73b>] dquot_claim_space+0x3b/0x1b0
but task is already holding lock:
(&meta_group_info[i]->alloc_sem){++++..}, at: [<c02ce962>] ext4_mb_load_buddy+0xb2/0x370
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 (&meta_group_info[i]->alloc_sem){++++..}:
[<c017d04b>] __lock_acquire+0xd7b/0x1260
[<c017d5ea>] lock_acquire+0xba/0xd0
[<c0527191>] down_read+0x51/0x90
[<c02ce962>] ext4_mb_load_buddy+0xb2/0x370
[<c02d0c1c>] ext4_mb_free_blocks+0x46c/0x870
[<c029c9d3>] ext4_free_blocks+0x73/0x130
[<c02c8cfc>] ext4_ext_truncate+0x76c/0x8d0
[<c02a8087>] ext4_truncate+0x187/0x5e0
[<c01e0f7b>] vmtruncate+0x6b/0x70
[<c022ec02>] inode_setattr+0x62/0x190
[<c02a2d7a>] ext4_setattr+0x25a/0x370
[<c022ee81>] notify_change+0x151/0x340
[<c021349d>] do_truncate+0x6d/0xa0
[<c0221034>] may_open+0x1d4/0x200
[<c022412b>] do_filp_open+0x1eb/0x910
[<c021244d>] do_sys_open+0x6d/0x140
[<c021258e>] sys_open+0x2e/0x40
[<c0103100>] sysenter_do_call+0x12/0x32
-> #2 (&ei->i_data_sem){++++..}:
[<c017d04b>] __lock_acquire+0xd7b/0x1260
[<c017d5ea>] lock_acquire+0xba/0xd0
[<c0527191>] down_read+0x51/0x90
[<c02a5787>] ext4_get_blocks+0x47/0x450
[<c02a74c1>] ext4_getblk+0x61/0x1d0
[<c02a7a7f>] ext4_bread+0x1f/0xa0
[<c02bcddc>] ext4_quota_write+0x12c/0x310
[<c0262d23>] qtree_write_dquot+0x93/0x120
[<c0261708>] v2_write_dquot+0x28/0x30
[<c025d3fb>] dquot_commit+0xab/0xf0
[<c02be977>] ext4_write_dquot+0x77/0x90
[<c02be9bf>] ext4_mark_dquot_dirty+0x2f/0x50
[<c025e321>] dquot_alloc_inode+0x101/0x180
[<c029fec2>] ext4_new_inode+0x602/0xf00
[<c02ad789>] ext4_create+0x89/0x150
[<c0221ff2>] vfs_create+0xa2/0xc0
[<c02246e7>] do_filp_open+0x7a7/0x910
[<c021244d>] do_sys_open+0x6d/0x140
[<c021258e>] sys_open+0x2e/0x40
[<c0103100>] sysenter_do_call+0x12/0x32
-> #1 (&sb->s_type->i_mutex_key#7/4){+.+...}:
[<c017d04b>] __lock_acquire+0xd7b/0x1260
[<c017d5ea>] lock_acquire+0xba/0xd0
[<c0526505>] mutex_lock_nested+0x65/0x2d0
[<c0260c9d>] vfs_load_quota_inode+0x4bd/0x5a0
[<c02610af>] vfs_quota_on_path+0x5f/0x70
[<c02bc812>] ext4_quota_on+0x112/0x190
[<c026345a>] sys_quotactl+0x44a/0x8a0
[<c0103100>] sysenter_do_call+0x12/0x32
-> #0 (&s->s_dquot.dqptr_sem){++++..}:
[<c017d361>] __lock_acquire+0x1091/0x1260
[<c017d5ea>] lock_acquire+0xba/0xd0
[<c0527191>] down_read+0x51/0x90
[<c025e73b>] dquot_claim_space+0x3b/0x1b0
[<c02cb95f>] ext4_mb_mark_diskspace_used+0x36f/0x380
[<c02d210a>] ext4_mb_new_blocks+0x34a/0x530
[<c02c83fb>] ext4_ext_get_blocks+0x122b/0x13c0
[<c02a5966>] ext4_get_blocks+0x226/0x450
[<c02a5ff3>] mpage_da_map_blocks+0xc3/0xaa0
[<c02a6ed6>] ext4_da_writepages+0x506/0x790
[<c01de272>] do_writepages+0x22/0x50
[<c01d766d>] __filemap_fdatawrite_range+0x6d/0x80
[<c01d7b9b>] filemap_flush+0x2b/0x30
[<c02a40ac>] ext4_alloc_da_blocks+0x5c/0x60
[<c029e595>] ext4_release_file+0x75/0xb0
[<c0216b59>] __fput+0xf9/0x210
[<c0216c97>] fput+0x27/0x30
[<c02122dc>] filp_close+0x4c/0x80
[<c014510e>] put_files_struct+0x6e/0xd0
[<c01451b7>] exit_files+0x47/0x60
[<c0146a24>] do_exit+0x144/0x710
[<c0147028>] do_group_exit+0x38/0xa0
[<c0159abc>] get_signal_to_deliver+0x2ac/0x410
[<c0102849>] do_notify_resume+0xb9/0x890
[<c01032d2>] work_notifysig+0x13/0x21
other info that might help us debug this:
3 locks held by write-truncate-/3465:
#0: (jbd2_handle){+.+...}, at: [<c02e1f8f>] start_this_handle+0x38f/0x5c0
#1: (&ei->i_data_sem){++++..}, at: [<c02a57f6>] ext4_get_blocks+0xb6/0x450
#2: (&meta_group_info[i]->alloc_sem){++++..}, at: [<c02ce962>] ext4_mb_load_buddy+0xb2/0x370
stack backtrace:
Pid: 3465, comm: write-truncate- Not tainted 2.6.32-rc7 #18
Call Trace:
[<c0524cb3>] ? printk+0x1d/0x22
[<c017ac9a>] print_circular_bug+0xca/0xd0
[<c017d361>] __lock_acquire+0x1091/0x1260
[<c016bca2>] ? sched_clock_local+0xd2/0x170
[<c0178fd0>] ? trace_hardirqs_off_caller+0x20/0xd0
[<c017d5ea>] lock_acquire+0xba/0xd0
[<c025e73b>] ? dquot_claim_space+0x3b/0x1b0
[<c0527191>] down_read+0x51/0x90
[<c025e73b>] ? dquot_claim_space+0x3b/0x1b0
[<c025e73b>] dquot_claim_space+0x3b/0x1b0
[<c02cb95f>] ext4_mb_mark_diskspace_used+0x36f/0x380
[<c02d210a>] ext4_mb_new_blocks+0x34a/0x530
[<c02c601d>] ? ext4_ext_find_extent+0x25d/0x280
[<c02c83fb>] ext4_ext_get_blocks+0x122b/0x13c0
[<c016bca2>] ? sched_clock_local+0xd2/0x170
[<c016be60>] ? sched_clock_cpu+0x120/0x160
[<c016beef>] ? cpu_clock+0x4f/0x60
[<c0178fd0>] ? trace_hardirqs_off_caller+0x20/0xd0
[<c052712c>] ? down_write+0x8c/0xa0
[<c02a5966>] ext4_get_blocks+0x226/0x450
[<c016be60>] ? sched_clock_cpu+0x120/0x160
[<c016beef>] ? cpu_clock+0x4f/0x60
[<c017908b>] ? trace_hardirqs_off+0xb/0x10
[<c02a5ff3>] mpage_da_map_blocks+0xc3/0xaa0
[<c01d69cc>] ? find_get_pages_tag+0x16c/0x180
[<c01d6860>] ? find_get_pages_tag+0x0/0x180
[<c02a73bd>] ? __mpage_da_writepage+0x16d/0x1a0
[<c01dfc4e>] ? pagevec_lookup_tag+0x2e/0x40
[<c01ddf1b>] ? write_cache_pages+0xdb/0x3d0
[<c02a7250>] ? __mpage_da_writepage+0x0/0x1a0
[<c02a6ed6>] ext4_da_writepages+0x506/0x790
[<c016beef>] ? cpu_clock+0x4f/0x60
[<c016bca2>] ? sched_clock_local+0xd2/0x170
[<c016be60>] ? sched_clock_cpu+0x120/0x160
[<c016be60>] ? sched_clock_cpu+0x120/0x160
[<c02a69d0>] ? ext4_da_writepages+0x0/0x790
[<c01de272>] do_writepages+0x22/0x50
[<c01d766d>] __filemap_fdatawrite_range+0x6d/0x80
[<c01d7b9b>] filemap_flush+0x2b/0x30
[<c02a40ac>] ext4_alloc_da_blocks+0x5c/0x60
[<c029e595>] ext4_release_file+0x75/0xb0
[<c0216b59>] __fput+0xf9/0x210
[<c0216c97>] fput+0x27/0x30
[<c02122dc>] filp_close+0x4c/0x80
[<c014510e>] put_files_struct+0x6e/0xd0
[<c01451b7>] exit_files+0x47/0x60
[<c0146a24>] do_exit+0x144/0x710
[<c017b163>] ? lock_release_holdtime+0x33/0x210
[<c0528137>] ? _spin_unlock_irq+0x27/0x30
[<c0147028>] do_group_exit+0x38/0xa0
[<c017babb>] ? trace_hardirqs_on+0xb/0x10
[<c0159abc>] get_signal_to_deliver+0x2ac/0x410
[<c0102849>] do_notify_resume+0xb9/0x890
[<c0178fd0>] ? trace_hardirqs_off_caller+0x20/0xd0
[<c017b163>] ? lock_release_holdtime+0x33/0x210
[<c0165b50>] ? autoremove_wake_function+0x0/0x50
[<c017ba54>] ? trace_hardirqs_on_caller+0x134/0x190
[<c017babb>] ? trace_hardirqs_on+0xb/0x10
[<c0300ba4>] ? security_file_permission+0x14/0x20
[<c0215761>] ? vfs_write+0x131/0x190
[<c0214f50>] ? do_sync_write+0x0/0x120
[<c0103115>] ? sysenter_do_call+0x27/0x32
[<c01032d2>] work_notifysig+0x13/0x21
CC: Theodore Ts'o <[email protected]>
Signed-off-by: Dmitry Monakhov <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
On Sun, Dec 27, 2009 at 10:51:59PM -0500, [email protected] wrote:
> OK, i've been able to reproduce the problem using xfsqa test #74
> (fstest) when an ext3 file system is mounted the ext4 file system
> driver. I was then able to bisect it down to commit d21cd8f6, which
> was introduced between 2.6.33-rc1 and 2.6.33-rc2, as part of
> quota/ext4 patch series pushed by Jan.
OK, here's a patch which I think should avoid the BUG in
fs/ext4/inode.c. It should fix the regression, but in the long run we
need to pretty seriously rethink how we account for the need for
potentially new meta-data blocks when doing delayed allocation.
The remaining problem with this machinery is that
ext4_da_update_reserve_space() and ext4_da_release_space() is that
they both try to calculate how many metadata blocks will potentially
required by calculating ext4_calc_metadata_amount() based on the
number of delayed allocation blocks found in i_reserved_data_blocks.
The problem is that ext4_calc_metadata_amount() assumes that the
number of blocks passed to it is contiguous, and what might be left
remaining to be written in the page cache could be anything but
contiguous. This is a problem which has always been there, so it's
not a regression per se; just a design flaw.
The patch below should fixes the regression caused by commit d21cd8f,
but we need to look much more closely to find a better way of
accounting for the potential need for metadata for inodes facing
delayed allocation. Could people who are having problems with the BUG
in line 1063 of fs/ext4/inode.c try this patch?
Thanks!!
- Ted
commit 48b71e562ecd35ab12f6b6420a92fb3c9145da92
Author: Theodore Ts'o <[email protected]>
Date: Wed Dec 30 00:04:04 2009 -0500
ext4: Patch up how we claim metadata blocks for quota purposes
Commit d21cd8f triggered a BUG in the function
ext4_da_update_reserve_space() found in fs/ext4/inode.c, which was
caused by fact that ext4_calc_metadata_amount() can over-estimate how
many metadata blocks will be needed, especially when using direct
block-mapped files. Work around this by not claiming any excess
metadata blocks than we are prepared to claim at this point.
Signed-off-by: "Theodore Ts'o" <[email protected]>
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3e3b454..d6e84b4 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1058,14 +1058,23 @@ static void ext4_da_update_reserve_space(struct inode *inode, int used)
mdb_free = EXT4_I(inode)->i_reserved_meta_blocks - mdb;
if (mdb_free) {
- /* Account for allocated meta_blocks */
+ /*
+ * Account for allocated meta_blocks; it is possible
+ * for us to have allocated more meta blocks than we
+ * are prepared to free at this point. This is
+ * because ext4_calc_metadata_amount can over-estimate
+ * how many blocks are still needed. So we may not be
+ * able to claim all of the allocated meta blocks
+ * right away. The accounting will work out in the end...
+ */
mdb_claim = EXT4_I(inode)->i_allocated_meta_blocks;
- BUG_ON(mdb_free < mdb_claim);
+ if (mdb_free < mdb_claim)
+ mdb_claim = mdb_free;
mdb_free -= mdb_claim;
/* update fs dirty blocks counter */
percpu_counter_sub(&sbi->s_dirtyblocks_counter, mdb_free);
- EXT4_I(inode)->i_allocated_meta_blocks = 0;
+ EXT4_I(inode)->i_allocated_meta_blocks -= mdb_claim;
EXT4_I(inode)->i_reserved_meta_blocks = mdb;
}
@@ -1845,7 +1854,7 @@ repeat:
static void ext4_da_release_space(struct inode *inode, int to_free)
{
struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
- int total, mdb, mdb_free, release;
+ int total, mdb, mdb_free, mdb_claim, release;
if (!to_free)
return; /* Nothing to release, exit */
@@ -1874,6 +1883,16 @@ static void ext4_da_release_space(struct inode *inode, int to_free)
BUG_ON(mdb > EXT4_I(inode)->i_reserved_meta_blocks);
mdb_free = EXT4_I(inode)->i_reserved_meta_blocks - mdb;
+ if (mdb_free) {
+ /* Account for allocated meta_blocks */
+ mdb_claim = EXT4_I(inode)->i_allocated_meta_blocks;
+ if (mdb_free < mdb_claim)
+ mdb_claim = mdb_free;
+ mdb_free -= mdb_claim;
+
+ EXT4_I(inode)->i_allocated_meta_blocks -= mdb_claim;
+ }
+
release = to_free + mdb_free;
/* update fs dirty blocks counter for truncate case */
[email protected] writes:
> On Sun, Dec 27, 2009 at 10:51:59PM -0500, [email protected] wrote:
>> OK, i've been able to reproduce the problem using xfsqa test #74
>> (fstest) when an ext3 file system is mounted the ext4 file system
>> driver. I was then able to bisect it down to commit d21cd8f6, which
>> was introduced between 2.6.33-rc1 and 2.6.33-rc2, as part of
>> quota/ext4 patch series pushed by Jan.
>
> OK, here's a patch which I think should avoid the BUG in
> fs/ext4/inode.c. It should fix the regression, but in the long run we
> need to pretty seriously rethink how we account for the need for
> potentially new meta-data blocks when doing delayed allocation.
>
> The remaining problem with this machinery is that
> ext4_da_update_reserve_space() and ext4_da_release_space() is that
> they both try to calculate how many metadata blocks will potentially
> required by calculating ext4_calc_metadata_amount() based on the
> number of delayed allocation blocks found in i_reserved_data_blocks.
> The problem is that ext4_calc_metadata_amount() assumes that the
> number of blocks passed to it is contiguous, and what might be left
> remaining to be written in the page cache could be anything but
> contiguous. This is a problem which has always been there, so it's
> not a regression per se; just a design flaw.
Hello, I've finally able to reproduce the issue. I'm agree with your
diagnose. But while looking in to code i've found some questions
see late in the message.
>
> The patch below should fixes the regression caused by commit d21cd8f,
> but we need to look much more closely to find a better way of
> accounting for the potential need for metadata for inodes facing
> delayed allocation. Could people who are having problems with the BUG
> in line 1063 of fs/ext4/inode.c try this patch?
>
> Thanks!!
>
> - Ted
>
>
> commit 48b71e562ecd35ab12f6b6420a92fb3c9145da92
> Author: Theodore Ts'o <[email protected]>
> Date: Wed Dec 30 00:04:04 2009 -0500
>
> ext4: Patch up how we claim metadata blocks for quota purposes
>
> Commit d21cd8f triggered a BUG in the function
> ext4_da_update_reserve_space() found in fs/ext4/inode.c, which was
> caused by fact that ext4_calc_metadata_amount() can over-estimate how
> many metadata blocks will be needed, especially when using direct
> block-mapped files. Work around this by not claiming any excess
> metadata blocks than we are prepared to claim at this point.
>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 3e3b454..d6e84b4 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1058,14 +1058,23 @@ static void ext4_da_update_reserve_space(struct inode *inode, int used)
> mdb_free = EXT4_I(inode)->i_reserved_meta_blocks - mdb;
>
> if (mdb_free) {
> - /* Account for allocated meta_blocks */
> + /*
> + * Account for allocated meta_blocks; it is possible
> + * for us to have allocated more meta blocks than we
> + * are prepared to free at this point. This is
> + * because ext4_calc_metadata_amount can over-estimate
> + * how many blocks are still needed. So we may not be
> + * able to claim all of the allocated meta blocks
> + * right away. The accounting will work out in the end...
> + */
> mdb_claim = EXT4_I(inode)->i_allocated_meta_blocks;
> - BUG_ON(mdb_free < mdb_claim);
> + if (mdb_free < mdb_claim)
> + mdb_claim = mdb_free;
> mdb_free -= mdb_claim;
>
> /* update fs dirty blocks counter */
> percpu_counter_sub(&sbi->s_dirtyblocks_counter, mdb_free);
> - EXT4_I(inode)->i_allocated_meta_blocks = 0;
> + EXT4_I(inode)->i_allocated_meta_blocks -= mdb_claim;
> EXT4_I(inode)->i_reserved_meta_blocks = mdb;
> }
>
> @@ -1845,7 +1854,7 @@ repeat:
> static void ext4_da_release_space(struct inode *inode, int to_free)
> {
> struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> - int total, mdb, mdb_free, release;
> + int total, mdb, mdb_free, mdb_claim, release;
>
> if (!to_free)
> return; /* Nothing to release, exit */
> @@ -1874,6 +1883,16 @@ static void ext4_da_release_space(struct inode *inode, int to_free)
> BUG_ON(mdb > EXT4_I(inode)->i_reserved_meta_blocks);
> mdb_free = EXT4_I(inode)->i_reserved_meta_blocks - mdb;
>
> + if (mdb_free) {
> + /* Account for allocated meta_blocks */
> + mdb_claim = EXT4_I(inode)->i_allocated_meta_blocks;
> + if (mdb_free < mdb_claim)
> + mdb_claim = mdb_free;
> + mdb_free -= mdb_claim;
> +
> + EXT4_I(inode)->i_allocated_meta_blocks -= mdb_claim;
> + }
> +
Seems what this is not enough.
Just imagine, we may have following call-trace:
userspace pwrite(fd, d, 1000, off)
->ext4_da_reserve_space(inode, 1000)
->dq_reserve_space(1000 + md_needed)
userspace ftruncate(fd, off) /* "off" is the same as in pwrite call */
->ext4_da_invalidatepage()
->ext4_da_page_release_reservation()
->ext4_da_release_space()
<<< And we decrease ->i_allocated_meta_blocks only if (mdb_free > 0)
userspace close(fd)
So reserved metadata blocks will leak. I'm able to reproduce it like this:
quotacheck -cu /mnt
quotaon /mnt
fsstres -p 16 -d /mnt -l999999999 -n99999999&
sleep 180
killall -9 fsstress
sync; sync;
cp /mnt/aquota.user > q1
quotaoff /mnt
quotacheck -cu /mnt/ # recaculate real quota usage.
cp /mnt/aquota.user > q2
diff -up q1 q2 # in my case i've found 1 block leaked.
IMHO we may drop i_allocated_meta_block in ext4_release_file()
But while looking in to this function i've found another question
about locking
static int ext4_release_file(struct inode *inode, struct file *filp)
{
if (EXT4_I(inode)->i_state & EXT4_STATE_DA_ALLOC_CLOSE) {
ext4_alloc_da_blocks(inode);
EXT4_I(inode)->i_state &= ~EXT4_STATE_DA_ALLOC_CLOSE;
<<< Seems what i_state modification must being protected by i_mutex,
but currently caller don't have to hold it.
.....
}
> release = to_free + mdb_free;
>
> /* update fs dirty blocks counter for truncate case */
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Dec 30, 2009 at 04:18:09PM +0300, Dmitry Monakhov wrote:
> Hello, I've finally able to reproduce the issue. I'm agree with your
> diagnose. But while looking in to code i've found some questions
> see late in the message.
The simplest way of reproducing the BUG() that I've found is:
mke2fs -t ext3 /dev/XXX
mount /dev/XXX /mnt
dd if=/dev/zero of=/mnt/big-file bs=1024k count=16
sync
Unfortunately, this is all to easy for users to stumble across, so we
need to fix this ASAP. :-(
> Seems what this is not enough.
> Just imagine, we may have following call-trace:
>
> userspace pwrite(fd, d, 1000, off)
> ->ext4_da_reserve_space(inode, 1000)
> ->dq_reserve_space(1000 + md_needed)
> userspace ftruncate(fd, off) /* "off" is the same as in pwrite call */
> ->ext4_da_invalidatepage()
> ->ext4_da_page_release_reservation()
> ->ext4_da_release_space()
> <<< And we decrease ->i_allocated_meta_blocks only if (mdb_free > 0)
> userspace close(fd)
I don't think this is the problem. After we do the truncate, we will
be calling ext4_da_release_space with a value that should cause us to
call ext4_calc_metadata_amount with 0, so it will return 0. At that
point, we will have some i_reserved_metadata_blocks to free. The
problem is that in ext4_da_release_space(), I forgot to call
vfs_dq_claim_block(mdb_claim). That was probably the cause of the leak.
In any case, here's a patch which also fixes the blatent
under-estimation of the number of metadata blocks that could be needed
if the process is writing random blocks into a sparse file.
Unfortunately, especially for non-extent mapped files, we *very* badly
over-estimate how many indirect blocks will be necessary we assume
each data block requires 2-3 indirect blocks(!!!). Guessing exactly
how many metadata blocks will be necessary when doing delayed
allocation is painful, and I'm very tempted to simply change the quota
system to not include metadata blocks at all. The only thing stopping
me from doing this is we'd also need to make synchronized changes to
userspace programs like checkquota.
Care to give this a spin? BTW, are you testing with lockdep enabled?
I'm reliably getting a LOCKDEP complaint any time I use quotas, either
normal quotas or journalled quotas, and if I use normal quotas, I get
a lot of complaints from the JBD layer about dirty metadata buffers
that aren't part of a transaction belonging to the aquota.user file.
(What's up with that? I thought if you weren't using journalled
quotas the file that should be used is quota.user?) In any case, it
looks like the quota code in ext4 needs more attention, and we may
want to check and see if any of these bugs are also turning up in the
ext3 code, or were introduced as part of the ext4 enhancements.
(Clearly the problems associated with quota and delalloc are ext4
specific.)
- Ted
commit ef627929781c98113e6ae93f159dd3c12a884ad8
Author: Theodore Ts'o <[email protected]>
Date: Wed Dec 30 00:04:04 2009 -0500
ext4: Patch up how we claim metadata blocks for quota purposes
Commit d21cd8f triggered a BUG in the function
ext4_da_update_reserve_space() found in fs/ext4/inode.c. The root
cause of this BUG() is caused by the fact that
ext4_calc_metadata_amount() can severely over-estimate how many
metadata blocks will be needed, especially when using direct
block-mapped files.
In addition, it can also badly *under* estimate how much space is
needed, since ext4_calc_metadata_amount() assumes that the blocks are
contiguous, and this is not always true. If the application is
writing blocks to a sparse file, the number of metadata blocks
necessary can be severly underestimated by the functions
ext4_da_reserve_space(), ext4_da_update_reserve_space() and
ext4_da_release_space().
Unfortunately, doing this right means that we need to massively
over-estimate the amount of free space needed. So in some cases we
may need to force the inode to be written to disk asynchronously in
the hope that we don't get spurious quota failures.
Signed-off-by: "Theodore Ts'o" <[email protected]>
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3e3b454..84eeb8f 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1043,43 +1043,47 @@ static int ext4_calc_metadata_amount(struct inode *inode, int blocks)
return ext4_indirect_calc_metadata_amount(inode, blocks);
}
+/*
+ * Called with i_data_sem down, which is important since we can call
+ * ext4_discard_preallocations() from here.
+ */
static void ext4_da_update_reserve_space(struct inode *inode, int used)
{
struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
- int total, mdb, mdb_free, mdb_claim = 0;
-
- spin_lock(&EXT4_I(inode)->i_block_reservation_lock);
- /* recalculate the number of metablocks still need to be reserved */
- total = EXT4_I(inode)->i_reserved_data_blocks - used;
- mdb = ext4_calc_metadata_amount(inode, total);
-
- /* figure out how many metablocks to release */
- BUG_ON(mdb > EXT4_I(inode)->i_reserved_meta_blocks);
- mdb_free = EXT4_I(inode)->i_reserved_meta_blocks - mdb;
-
- if (mdb_free) {
- /* Account for allocated meta_blocks */
- mdb_claim = EXT4_I(inode)->i_allocated_meta_blocks;
- BUG_ON(mdb_free < mdb_claim);
- mdb_free -= mdb_claim;
-
- /* update fs dirty blocks counter */
+ struct ext4_inode_info *ei = EXT4_I(inode);
+ int mdb_free = 0;
+
+ spin_lock(&ei->i_block_reservation_lock);
+ if (unlikely(used > ei->i_reserved_data_blocks)) {
+ ext4_msg(inode->i_sb, KERN_NOTICE, "%s: ino %lu, used %d "
+ "with only %d reserved data blocks\n",
+ __func__, inode->i_ino, used,
+ ei->i_reserved_data_blocks);
+ WARN_ON(1);
+ used = ei->i_reserved_data_blocks;
+ }
+
+ /* Update per-inode reservations */
+ ei->i_reserved_data_blocks -= used;
+ used += ei->i_allocated_meta_blocks;
+ ei->i_reserved_meta_blocks -= ei->i_allocated_meta_blocks;
+ ei->i_allocated_meta_blocks = 0;
+ percpu_counter_sub(&sbi->s_dirtyblocks_counter, used);
+
+ if (ei->i_reserved_data_blocks == 0) {
+ /*
+ * We can release all of the reserved metadata blocks
+ * only when we have written all of the delayed
+ * allocation blocks.
+ */
+ mdb_free = ei->i_allocated_meta_blocks;
percpu_counter_sub(&sbi->s_dirtyblocks_counter, mdb_free);
- EXT4_I(inode)->i_allocated_meta_blocks = 0;
- EXT4_I(inode)->i_reserved_meta_blocks = mdb;
+ ei->i_allocated_meta_blocks = 0;
}
-
- /* update per-inode reservations */
- BUG_ON(used > EXT4_I(inode)->i_reserved_data_blocks);
- EXT4_I(inode)->i_reserved_data_blocks -= used;
- percpu_counter_sub(&sbi->s_dirtyblocks_counter, used + mdb_claim);
spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
- vfs_dq_claim_block(inode, used + mdb_claim);
-
- /*
- * free those over-booking quota for metadata blocks
- */
+ /* Update quota subsystem */
+ vfs_dq_claim_block(inode, used);
if (mdb_free)
vfs_dq_release_reservation_block(inode, mdb_free);
@@ -1088,7 +1092,8 @@ static void ext4_da_update_reserve_space(struct inode *inode, int used)
* there aren't any writers on the inode, we can discard the
* inode's preallocations.
*/
- if (!total && (atomic_read(&inode->i_writecount) == 0))
+ if ((ei->i_reserved_data_blocks == 0) &&
+ (atomic_read(&inode->i_writecount) == 0))
ext4_discard_preallocations(inode);
}
@@ -1801,7 +1806,8 @@ static int ext4_da_reserve_space(struct inode *inode, int nrblocks)
{
int retries = 0;
struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
- unsigned long md_needed, mdblocks, total = 0;
+ struct ext4_inode_info *ei = EXT4_I(inode);
+ unsigned long md_needed, md_reserved, total = 0;
/*
* recalculate the amount of metadata blocks to reserve
@@ -1809,35 +1815,44 @@ static int ext4_da_reserve_space(struct inode *inode, int nrblocks)
* worse case is one extent per block
*/
repeat:
- spin_lock(&EXT4_I(inode)->i_block_reservation_lock);
- total = EXT4_I(inode)->i_reserved_data_blocks + nrblocks;
- mdblocks = ext4_calc_metadata_amount(inode, total);
- BUG_ON(mdblocks < EXT4_I(inode)->i_reserved_meta_blocks);
-
- md_needed = mdblocks - EXT4_I(inode)->i_reserved_meta_blocks;
+ spin_lock(&ei->i_block_reservation_lock);
+ md_reserved = ei->i_reserved_meta_blocks;
+ md_needed = ext4_calc_metadata_amount(inode, nrblocks);
total = md_needed + nrblocks;
- spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
+ spin_unlock(&ei->i_block_reservation_lock);
/*
* Make quota reservation here to prevent quota overflow
* later. Real quota accounting is done at pages writeout
* time.
*/
- if (vfs_dq_reserve_block(inode, total))
+ if (vfs_dq_reserve_block(inode, total)) {
+ /*
+ * We tend to badly over-estimate the amount of
+ * metadata blocks which are needed, so if we have
+ * reserved any metadata blocks, try to force out the
+ * inode and see if we have any better luck.
+ */
+ if (md_reserved && retries++ <= 3)
+ goto retry;
return -EDQUOT;
+ }
if (ext4_claim_free_blocks(sbi, total)) {
vfs_dq_release_reservation_block(inode, total);
if (ext4_should_retry_alloc(inode->i_sb, &retries)) {
+ retry:
+ if (md_reserved)
+ write_inode_now(inode, (retries == 3));
yield();
goto repeat;
}
return -ENOSPC;
}
- spin_lock(&EXT4_I(inode)->i_block_reservation_lock);
- EXT4_I(inode)->i_reserved_data_blocks += nrblocks;
- EXT4_I(inode)->i_reserved_meta_blocks += md_needed;
- spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
+ spin_lock(&ei->i_block_reservation_lock);
+ ei->i_reserved_data_blocks += nrblocks;
+ ei->i_reserved_meta_blocks += md_needed;
+ spin_unlock(&ei->i_block_reservation_lock);
return 0; /* success */
}
@@ -1845,49 +1860,45 @@ repeat:
static void ext4_da_release_space(struct inode *inode, int to_free)
{
struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
- int total, mdb, mdb_free, release;
+ struct ext4_inode_info *ei = EXT4_I(inode);
if (!to_free)
return; /* Nothing to release, exit */
spin_lock(&EXT4_I(inode)->i_block_reservation_lock);
- if (!EXT4_I(inode)->i_reserved_data_blocks) {
+ if (unlikely(to_free > ei->i_reserved_data_blocks)) {
/*
- * if there is no reserved blocks, but we try to free some
- * then the counter is messed up somewhere.
- * but since this function is called from invalidate
- * page, it's harmless to return without any action
+ * if there aren't enough reserved blocks, then the
+ * counter is messed up somewhere. Since this
+ * function is called from invalidate page, it's
+ * harmless to return without any action.
*/
- printk(KERN_INFO "ext4 delalloc try to release %d reserved "
- "blocks for inode %lu, but there is no reserved "
- "data blocks\n", to_free, inode->i_ino);
- spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
- return;
+ ext4_msg(inode->i_sb, KERN_NOTICE, "ext4_da_release_space: "
+ "ino %lu, to_free %d with only %d reserved "
+ "data blocks\n", inode->i_ino, to_free,
+ ei->i_reserved_data_blocks);
+ WARN_ON(1);
+ to_free = ei->i_reserved_data_blocks;
}
+ ei->i_reserved_data_blocks -= to_free;
- /* recalculate the number of metablocks still need to be reserved */
- total = EXT4_I(inode)->i_reserved_data_blocks - to_free;
- mdb = ext4_calc_metadata_amount(inode, total);
-
- /* figure out how many metablocks to release */
- BUG_ON(mdb > EXT4_I(inode)->i_reserved_meta_blocks);
- mdb_free = EXT4_I(inode)->i_reserved_meta_blocks - mdb;
-
- release = to_free + mdb_free;
-
- /* update fs dirty blocks counter for truncate case */
- percpu_counter_sub(&sbi->s_dirtyblocks_counter, release);
+ if (ei->i_reserved_data_blocks == 0) {
+ /*
+ * We can release all of the reserved metadata blocks
+ * only when we have written all of the delayed
+ * allocation blocks.
+ */
+ to_free += ei->i_allocated_meta_blocks;
+ ei->i_allocated_meta_blocks = 0;
+ }
- /* update per-inode reservations */
- BUG_ON(to_free > EXT4_I(inode)->i_reserved_data_blocks);
- EXT4_I(inode)->i_reserved_data_blocks -= to_free;
+ /* update fs dirty blocks counter */
+ percpu_counter_sub(&sbi->s_dirtyblocks_counter, to_free);
- BUG_ON(mdb > EXT4_I(inode)->i_reserved_meta_blocks);
- EXT4_I(inode)->i_reserved_meta_blocks = mdb;
spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
- vfs_dq_release_reservation_block(inode, release);
+ vfs_dq_release_reservation_block(inode, to_free);
}
static void ext4_da_page_release_reservation(struct page *page,
On Wed, Dec 30, 2009 at 04:18:09PM +0300, Dmitry Monakhov wrote:
>
> IMHO we may drop i_allocated_meta_block in ext4_release_file()
> But while looking in to this function i've found another question
> about locking
> static int ext4_release_file(struct inode *inode, struct file *filp)
> {
> if (EXT4_I(inode)->i_state & EXT4_STATE_DA_ALLOC_CLOSE) {
> ext4_alloc_da_blocks(inode);
> EXT4_I(inode)->i_state &= ~EXT4_STATE_DA_ALLOC_CLOSE;
> <<< Seems what i_state modification must being protected by i_mutex,
> but currently caller don't have to hold it.
(I'm answering this in a separate message since it really is a
separate question).
Yeah, that looks like a problem --- and it exists in more than just
this one place. Unfortunately using i_mutex to protect updates to
i_state is a bit heavyweight. What I'm thinking about doing is
converting all of the references the i_state flags to use set_bit,
clear_bit, and test_bit, since this will allow us to safely and
cleanly set/clear/test individual bits.
A quick audit of ext3 seems to show this is potentially a problem with
ext3 as well (specifically, in fs/ext3/xattr.c's use of
EXT3_STATE_XATTR).
- Ted