2009-03-12 16:54:47

by Thiemo Nagel

[permalink] [raw]
Subject: BUG at fs/ext4/mballoc.c:3295

Hello,

the following can be observed reproducibly with 2.6.29-rc7 when filling
up a very small (2MB) file system:

dd if=/dev/zero of=/file/inside/small/filesystem

hangs, dmesg output is:

[ 602.831279] EXT4-fs: barriers enabled
[ 602.862751] EXT4-fs warning: mounting fs with errors, running e2fsck
is recommended
[ 602.864165] kjournald2 starting: pid 5414, dev loop0:8, commit
interval 5 seconds
[ 602.864318] EXT4 FS on loop0, internal journal on loop0:8
[ 602.864325] EXT4-fs: delayed allocation enabled
[ 602.864348] EXT4-fs: file extents enabled
[ 602.864669] EXT4-fs: mballoc enabled
[ 602.864674] EXT4-fs: recovery complete.
[ 602.869466] EXT4-fs: mounted filesystem loop0 with ordered data mode
[ 623.000911] JBD: barrier-based sync failed on loop0:8 - disabling
barriers
[ 633.432299] ------------[ cut here ]------------
[ 633.432329] kernel BUG at fs/ext4/mballoc.c:3295!
[ 633.432353] invalid opcode: 0000 [#1] PREEMPT SMP
[ 633.432382] last sysfs file: /sys/power/state
[ 633.432388] Modules linked in: ext4 jbd2 crc16 cpufreq_ondemand
cpufreq_userspace cpufreq_powersave acpi_cpufreq speedstep_lib
freq_table toshiba loop snd_intel8x0 snd_ac97_codec ac97_bus ipw2200
snd_pcm libipw snd_timer psmouse snd soundcore lib80211 snd_page_alloc
toshiba_acpi rfkill backlight input_polldev ac battery button evdev ext3
jbd mbcache usbhid sd_mod ata_generic ata_piix libata ehci_hcd uhci_hcd
e100 mii scsi_mod usbcore thermal processor fan thermal_sys
[ 633.432669]
[ 633.432676] Pid: 5475, comm: dd Not tainted (2.6.29-rc7-dbg #1) TECRA A3X
[ 633.432701] EIP: 0060:[<e04486bc>] EFLAGS: 00210202 CPU: 0
[ 633.432772] EIP is at ext4_mb_normalize_request+0x712/0x7bd [ext4]
[ 633.432796] EAX: 00000001 EBX: 00000200 ECX: 00000001 EDX: cf950738
[ 633.432803] ESI: 00000000 EDI: cf94a3f0 EBP: cd07fa98 ESP: cd07fa38
[ 633.432827] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[ 633.432852] Process dd (pid: 5475, ti=cd07e000 task=dd092d00
task.ti=cd07e000)
[ 633.432859] Stack:
[ 633.432863] 00000002 00000001 00000000 e044853a 0005e800 00000000
cd07fb6c cf950738
[ 633.432918] 00000200 00000000 cf94a3f0 0000000a 00000200 00000000
00000000 00000000
[ 633.432958] cf94a0ac cf94a184 cf94a184 000001da 00000002 00000000
dcdf0c00 cf94a184
[ 633.433016] Call Trace:
[ 633.433021] [<e044853a>] ? ext4_mb_normalize_request+0x590/0x7bd [ext4]
[ 633.433047] [<e044d5e8>] ? ext4_mb_new_blocks+0x1c3/0x480 [ext4]
[ 633.433047] [<e04455fa>] ? ext4_ext_get_blocks+0xc06/0xe7c [ext4]
[ 633.433047] [<c0315ccd>] ? _spin_unlock+0x27/0x3c
[ 633.433047] [<e007da49>] ? find_revoke_record+0x94/0xa3 [jbd2]
[ 633.433047] [<e007e10d>] ? jbd2_journal_cancel_revoke+0x11a/0x15c [jbd2]
[ 633.433047] [<c014f580>] ? __lock_acquire+0x475/0x5e0
[ 633.433047] [<e04343a2>] ? ext4_get_blocks_wrap+0xf0/0x214 [ext4]
[ 633.433047] [<e0434bb4>] ? ext4_get_block+0xba/0xf1 [ext4]
[ 633.433047] [<c01c1954>] ? __block_prepare_write+0x15b/0x352
[ 633.433047] [<c0179320>] ? find_lock_page+0x33/0x6b
[ 633.433047] [<c01c1ce7>] ? block_write_begin+0x7b/0xd7
[ 633.433047] [<e0434afa>] ? ext4_get_block+0x0/0xf1 [ext4]
[ 633.433047] [<e042f416>] ? ext4_write_begin+0xde/0x1dd [ext4]
[ 633.433047] [<e0434afa>] ? ext4_get_block+0x0/0xf1 [ext4]
[ 633.433047] [<e042faab>] ? ext4_da_write_begin+0x10f/0x227 [ext4]
[ 633.433047] [<c0178ac1>] ? generic_file_buffered_write+0xd9/0x27c
[ 633.433047] [<c017a503>] ? __generic_file_aio_write_nolock+0x4b3/0x507
[ 633.433047] [<c017a5b2>] ? generic_file_aio_write+0x5b/0xb9
[ 633.433047] [<c012ec77>] ? __do_softirq+0x68/0x163
[ 633.433047] [<c014d207>] ? validate_chain+0x12e/0x1044
[ 633.433047] [<e042cd05>] ? ext4_file_write+0xd0/0x152 [ext4]
[ 633.433047] [<c014fea2>] ? mark_held_locks+0x4f/0x66
[ 633.433047] [<c01a1eb5>] ? do_sync_write+0xc4/0x109
[ 633.433047] [<c013dadd>] ? autoremove_wake_function+0x0/0x41
[ 633.433047] [<c01fa634>] ? security_file_permission+0xf/0x11
[ 633.433047] [<c01a20b1>] ? rw_verify_area+0xb0/0xd3
[ 633.433047] [<c0315ccd>] ? _spin_unlock+0x27/0x3c
[ 633.433047] [<c01a216b>] ? vfs_write+0x97/0x14e
[ 633.433047] [<c0103237>] ? sysenter_exit+0xf/0x1a
[ 633.433047] [<c01a1df1>] ? do_sync_write+0x0/0x109
[ 633.433047] [<c01a2963>] ? sys_write+0x3d/0x61
[ 633.433047] [<c01031fb>] ? sysenter_do_call+0x12/0x3f
[ 633.433047] Code: 26 8b 55 bc 8b 42 04 8b 80 7c 02 00 00 8b 40 08 b9
01 00 00 00 83 fe 00 7f 0b 7c 04 39 c3 73 05 b9 00 00 00 00 89 c8 85 c0
74 04 <0f> 0b eb fe 8b 7d dc 8b 4d bc 89 79 18 89 59 24 8b 45 b8 8b 50
[ 633.433047] EIP: [<e04486bc>] ext4_mb_normalize_request+0x712/0x7bd
[ext4] SS:ESP 0068:cd07fa38
[ 633.443710] ---[ end trace 7b79a7a8035c66f0 ]---


IIRC the small filesystem was created like that:

dd if=/dev/zero of=image.ext4 bs=1M count=2
/sbin/mkfs.ext4 -v -F -b 1024 -m 0 -g 512 -G 4 -I 128 -N 1 -O
large_file,dir_index,flex_bg,extent,sparse_super image.ext4

Kind regards,

Thiemo


2009-03-12 17:08:10

by Eric Sandeen

[permalink] [raw]
Subject: Re: BUG at fs/ext4/mballoc.c:3295

Thiemo Nagel wrote:
> Hello,
>
> the following can be observed reproducibly with 2.6.29-rc7 when filling
> up a very small (2MB) file system:
>
> dd if=/dev/zero of=/file/inside/small/filesystem
>
> hangs, dmesg output is:
>
> [ 602.831279] EXT4-fs: barriers enabled
> [ 602.862751] EXT4-fs warning: mounting fs with errors, running e2fsck
> is recommended
> [ 602.864165] kjournald2 starting: pid 5414, dev loop0:8, commit
> interval 5 seconds
> [ 602.864318] EXT4 FS on loop0, internal journal on loop0:8
> [ 602.864325] EXT4-fs: delayed allocation enabled
> [ 602.864348] EXT4-fs: file extents enabled
> [ 602.864669] EXT4-fs: mballoc enabled
> [ 602.864674] EXT4-fs: recovery complete.
> [ 602.869466] EXT4-fs: mounted filesystem loop0 with ordered data mode
> [ 623.000911] JBD: barrier-based sync failed on loop0:8 - disabling
> barriers
> [ 633.432299] ------------[ cut here ]------------
> [ 633.432329] kernel BUG at fs/ext4/mballoc.c:3295!

I don't see it:

# dd if=/dev/zero of=2mbfile bs=1M count=2
# mkfs.ext4 -F 2mbfile
# mount -o loop 2mbfile mnt/
# dd if=/dev/zero of=mnt/file
dd: writing to `mnt/file': No space left on device
1917+0 records in
1916+0 records out
980992 bytes (981 kB) copied, 0.0162723 s, 60.3 MB/s

is this more or less what you did?

-Eric

2009-03-12 17:13:59

by Thiemo Nagel

[permalink] [raw]
Subject: Re: BUG at fs/ext4/mballoc.c:3295

> I don't see it:
>
> # dd if=/dev/zero of=2mbfile bs=1M count=2
> # mkfs.ext4 -F 2mbfile
> # mount -o loop 2mbfile mnt/
> # dd if=/dev/zero of=mnt/file
> dd: writing to `mnt/file': No space left on device
> 1917+0 records in
> 1916+0 records out
> 980992 bytes (981 kB) copied, 0.0162723 s, 60.3 MB/s
>
> is this more or less what you did?

Yes, except that I used the following parameters for mkfs.ext4 (IIRC):

/sbin/mkfs.ext4 -v -F -b 1024 -m 0 -g 512 -G 4 -I 128 -N 1 -O
large_file,dir_index,flex_bg,extent,sparse_super image.ext4

Kind regards,

Thiemo Nagel

2009-03-12 17:17:03

by Eric Sandeen

[permalink] [raw]
Subject: Re: BUG at fs/ext4/mballoc.c:3295

Thiemo Nagel wrote:
>> I don't see it:
>>
>> # dd if=/dev/zero of=2mbfile bs=1M count=2
>> # mkfs.ext4 -F 2mbfile
>> # mount -o loop 2mbfile mnt/
>> # dd if=/dev/zero of=mnt/file
>> dd: writing to `mnt/file': No space left on device
>> 1917+0 records in
>> 1916+0 records out
>> 980992 bytes (981 kB) copied, 0.0162723 s, 60.3 MB/s
>>
>> is this more or less what you did?
>
> Yes, except that I used the following parameters for mkfs.ext4 (IIRC):
>
> /sbin/mkfs.ext4 -v -F -b 1024 -m 0 -g 512 -G 4 -I 128 -N 1 -O
> large_file,dir_index,flex_bg,extent,sparse_super image.ext4
>
> Kind regards,
>
> Thiemo Nagel

Bingo, thanks :)

-Eric

2009-03-12 18:47:11

by Eric Sandeen

[permalink] [raw]
Subject: [PATCH] fix bogus BUG_ONs in in mballoc code

Thiemo Nagel reported that:

# dd if=/dev/zero of=image.ext4 bs=1M count=2
# mkfs.ext4 -v -F -b 1024 -m 0 -g 512 -G 4 -I 128 -N 1 \
-O large_file,dir_index,flex_bg,extent,sparse_super image.ext4
# mount -o loop image.ext4 mnt/
# dd if=/dev/zero of=mnt/file

oopsed, with a BUG_ON in ext4_mb_normalize_request because
size == EXT4_BLOCKS_PER_GROUP

It appears to me (esp. after talking to Andreas) that the BUG_ON
is bogus; a request of exactly EXT4_BLOCKS_PER_GROUP should
be allowed, though larger sizes do indicate a problem.

Fix that an another (apparently rare) codepath with a similar check.

Reported-by: Thiemo Nagel <[email protected]>
Signed-off-by: Eric Sandeen <[email protected]>
--

Index: linux-2.6/fs/ext4/mballoc.c
===================================================================
--- linux-2.6.orig/fs/ext4/mballoc.c
+++ linux-2.6/fs/ext4/mballoc.c
@@ -1447,7 +1447,7 @@ static void ext4_mb_measure_extent(struc
struct ext4_free_extent *gex = &ac->ac_g_ex;

BUG_ON(ex->fe_len <= 0);
- BUG_ON(ex->fe_len >= EXT4_BLOCKS_PER_GROUP(ac->ac_sb));
+ BUG_ON(ex->fe_len > EXT4_BLOCKS_PER_GROUP(ac->ac_sb));
BUG_ON(ex->fe_start >= EXT4_BLOCKS_PER_GROUP(ac->ac_sb));
BUG_ON(ac->ac_status != AC_STATUS_CONTINUE);

@@ -3292,7 +3292,7 @@ ext4_mb_normalize_request(struct ext4_al
}
BUG_ON(start + size <= ac->ac_o_ex.fe_logical &&
start > ac->ac_o_ex.fe_logical);
- BUG_ON(size <= 0 || size >= EXT4_BLOCKS_PER_GROUP(ac->ac_sb));
+ BUG_ON(size <= 0 || size > EXT4_BLOCKS_PER_GROUP(ac->ac_sb));

/* now prepare goal request */



2009-03-13 00:39:04

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] fix bogus BUG_ONs in in mballoc code

On Thu, Mar 12, 2009 at 01:46:57PM -0500, Eric Sandeen wrote:
> Thiemo Nagel reported that:
>
> # dd if=/dev/zero of=image.ext4 bs=1M count=2
> # mkfs.ext4 -v -F -b 1024 -m 0 -g 512 -G 4 -I 128 -N 1 \
> -O large_file,dir_index,flex_bg,extent,sparse_super image.ext4
> # mount -o loop image.ext4 mnt/
> # dd if=/dev/zero of=mnt/file
>
> oopsed, with a BUG_ON in ext4_mb_normalize_request because
> size == EXT4_BLOCKS_PER_GROUP
>
> It appears to me (esp. after talking to Andreas) that the BUG_ON
> is bogus; a request of exactly EXT4_BLOCKS_PER_GROUP should
> be allowed, though larger sizes do indicate a problem.

Clearly we should make this change to avoid the BUG_ON; but stupid
question, why shouldn't we allow sizes larger than
EXT4_BLOCKS_PER_GROUP?

Especially with flex_bg, it is possible for an allocation size >
EXT4_BLOCKS_PER_GROUP to be satisifed, especially if the filesystem
isn't that full yet, and it might even make sense to request a larger
allocation for video files that are getting preallocated, for
example....

- Ted

2009-03-13 01:09:47

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] fix bogus BUG_ONs in in mballoc code

On Thu, Mar 12, 2009 at 01:46:57PM -0500, Eric Sandeen wrote:
> Thiemo Nagel reported that:
>
> # dd if=/dev/zero of=image.ext4 bs=1M count=2
> # mkfs.ext4 -v -F -b 1024 -m 0 -g 512 -G 4 -I 128 -N 1 \
> -O large_file,dir_index,flex_bg,extent,sparse_super image.ext4
> # mount -o loop image.ext4 mnt/
> # dd if=/dev/zero of=mnt/file
>
> oopsed, with a BUG_ON in ext4_mb_normalize_request because
> size == EXT4_BLOCKS_PER_GROUP
>
> It appears to me (esp. after talking to Andreas) that the BUG_ON
> is bogus; a request of exactly EXT4_BLOCKS_PER_GROUP should
> be allowed, though larger sizes do indicate a problem.
>
> Fix that an another (apparently rare) codepath with a similar check.

Hmm.... is this at all likely to happen with a standard ext4
filesystem parameters? Or was this triggered because of the
artifially set -g 512 parameter? The question is whether we should
try pushing this to Linus at this point, or let this wait until the
merge window opens.

Opinions?

= Ted
<

2009-03-13 02:08:09

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH] fix bogus BUG_ONs in in mballoc code

Theodore Tso wrote:
> On Thu, Mar 12, 2009 at 01:46:57PM -0500, Eric Sandeen wrote:
>> Thiemo Nagel reported that:
>>
>> # dd if=/dev/zero of=image.ext4 bs=1M count=2
>> # mkfs.ext4 -v -F -b 1024 -m 0 -g 512 -G 4 -I 128 -N 1 \
>> -O large_file,dir_index,flex_bg,extent,sparse_super image.ext4
>> # mount -o loop image.ext4 mnt/
>> # dd if=/dev/zero of=mnt/file
>>
>> oopsed, with a BUG_ON in ext4_mb_normalize_request because
>> size == EXT4_BLOCKS_PER_GROUP
>>
>> It appears to me (esp. after talking to Andreas) that the BUG_ON
>> is bogus; a request of exactly EXT4_BLOCKS_PER_GROUP should
>> be allowed, though larger sizes do indicate a problem.
>>
>> Fix that an another (apparently rare) codepath with a similar check.
>
> Hmm.... is this at all likely to happen with a standard ext4
> filesystem parameters? Or was this triggered because of the
> artifially set -g 512 parameter? The question is whether we should
> try pushing this to Linus at this point, or let this wait until the
> merge window opens.
>
> Opinions?
>
> = Ted
> <

I wondered the same thing, and will admit to probably not digging deep
enough on this one. I think the fix is ok as is but you are asking the
right questions. Maybe a clusterfs mballoc expert can chime in and save
us some time? :)

-=Eric

2009-03-13 11:10:11

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH] fix bogus BUG_ONs in in mballoc code

On Mar 12, 2009 20:38 -0400, Theodore Ts'o wrote:
> On Thu, Mar 12, 2009 at 01:46:57PM -0500, Eric Sandeen wrote:
> > Thiemo Nagel reported that:
> >
> > # dd if=/dev/zero of=image.ext4 bs=1M count=2
> > # mkfs.ext4 -v -F -b 1024 -m 0 -g 512 -G 4 -I 128 -N 1 \
> > -O large_file,dir_index,flex_bg,extent,sparse_super image.ext4
> > # mount -o loop image.ext4 mnt/
> > # dd if=/dev/zero of=mnt/file
> >
> > oopsed, with a BUG_ON in ext4_mb_normalize_request because
> > size == EXT4_BLOCKS_PER_GROUP
> >
> > It appears to me (esp. after talking to Andreas) that the BUG_ON
> > is bogus; a request of exactly EXT4_BLOCKS_PER_GROUP should
> > be allowed, though larger sizes do indicate a problem.
>
> Clearly we should make this change to avoid the BUG_ON; but stupid
> question, why shouldn't we allow sizes larger than
> EXT4_BLOCKS_PER_GROUP?
>
> Especially with flex_bg, it is possible for an allocation size >
> EXT4_BLOCKS_PER_GROUP to be satisifed, especially if the filesystem
> isn't that full yet, and it might even make sense to request a larger
> allocation for video files that are getting preallocated, for
> example....

There are two reasons that we can't have too-large mballoc allocations:
- mballoc works on a per-group basis, so the most blocks that it can
allocate at a time is BLOCKS_PER_GROUP.
- the on-disk extent format cannot map more than 128MB at a time, which
is equal to the group size at 4kB blocksize.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.