2008-08-17 12:53:14

by Theodore Ts'o

[permalink] [raw]
Subject: ENOSPC error returned too late


While trying to track down kernel BZ #11341 (which is really an
e2fsprogs bug, not a 2.6.26 regression) I created a fresh ext4
filesystem which was 5GB in length, and then filled it to ENOSPC via:

dd if=/dev/zero of=/mnt/bigfile bs=1024 count=10485760

Unfortunately, it sent the ENOSPC few blocks too late, so the following
appeared in dmesg, and an attempt to umount the filesystem hung in D
wait. :-(

- Ted


[63518.930337] mpage_da_map_blocks block allocation failed for inode 12 at logical offset 1254547 with max blocks 2 with error -28
[63518.930337] This should not happen.!! Data will be lost
[63518.930337] ------------[ cut here ]------------
[63518.930337] kernel BUG at fs/buffer.c:1858!
[63518.930337] invalid opcode: 0000 [#1] SMP
[63518.930337] Modules linked in: af_packet arc4 ecb iwlagn iwlcore mac80211 cfg80211 i915 drm nfsd auth_rpcgss exportfs kvm_intel kvm autofs4 pci_slot container wmi cpufreq_stats cpufreq_conservative cpufreq_userspace cpufreq_powersave nfs lockd sunrpc iptable_filter ip_tables x_tables deflate zlib_deflate ctr twofish twofish_common camellia serpent blowfish des_generic cbc aes_i586 aes_generic xcbc rmd160 sha256_generic sha1_generic crypto_null crypto_blkcipher af_key ext3 jbd uinput acpi_cpufreq loop snd_hda_intel snd_pcm_oss snd_mixer_oss serio_raw snd_pcm snd_page_alloc snd_hwdep psmouse battery snd_seq_dummy video output snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device button snd soundcore intel_agp thinkpad_acpi rfkill led_class nvram evdev id
e_pci_generic piix ide_core sg sd_mod crc_t10dif ata_piix ata_generic ahci libata scsi_mod dock dm_mirror dm_log dm_snapshot dm_mod thermal processor fan fuse ext4dev mbcache jbd2 crc16
[63518.930337]
[63518.930337] Pid: 15028, comm: sync Not tainted (2.6.27-rc3-00038-gd6f7b5a #56)
[63518.930337] EIP: 0060:[<c01b0fa9>] EFLAGS: 00010246 CPU: 0
[63518.930337] EIP is at __block_prepare_write+0x20/0x2b1
[63518.930337] EAX: 00000000 EBX: d3a04604 ECX: 00000000 EDX: 00000000
[63518.930337] ESI: c6070730 EDI: 00000000 EBP: e2d71cbc ESP: e2d71c7c
[63518.930337] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[63518.930337] Process sync (pid: 15028, ti=e2d71000 task=e81c8000 task.ti=e2d71000)
[63518.930337] Stack: c014653b d3a044d8 00000246 f74d8ba8 c01b123f e2d71cb4 c0168845 f74d8ba8
[63518.930337] 00000246 00000000 00000246 00000000 00001000 d3a04604 c6070730 00000000
[63518.930337] e2d71cd4 c01b1258 00001000 f88b7035 00000000 00001000 e2d71cf8 f88b8086
[63518.930337] Call Trace:
[63518.930337] [<c014653b>] ? trace_hardirqs_on+0xb/0xd
[63518.930337] [<c01b123f>] ? block_prepare_write+0x5/0x36
[63518.930337] [<c0168845>] ? ftrace_record_ip+0x1b3/0x1cc
[63518.930337] [<c01b1258>] ? block_prepare_write+0x1e/0x36
[63518.930337] [<f88b7035>] ? ext4_normal_get_block_write+0x0/0x53 [ext4dev]
[63518.930337] [<f88b8086>] ? ext4_da_writepage+0x89/0x138 [ext4dev]
[63518.930337] [<f88b7035>] ? ext4_normal_get_block_write+0x0/0x53 [ext4dev]
[63518.930337] [<f88b61cc>] ? mpage_da_submit_io+0x7a/0xab [ext4dev]
[63518.930337] [<f88b62ca>] ? __mpage_da_writepage+0x3c/0x171 [ext4dev]
[63518.930337] [<c01463fc>] ? mark_held_locks+0x45/0x5c
[63518.930337] [<c014653b>] ? trace_hardirqs_on+0xb/0xd
[63518.930337] [<c01464fb>] ? trace_hardirqs_on_caller+0xe8/0x11d
[63518.930337] [<c014653b>] ? trace_hardirqs_on+0xb/0xd
[63518.930337] [<c0173720>] ? write_cache_pages+0x155/0x261
[63518.930337] [<f88b628e>] ? __mpage_da_writepage+0x0/0x171 [ext4dev]
[63518.930337] [<f885d733>] ? jbd2_journal_start+0xce/0xf2 [jbd2]
[63518.930337] [<f88b8773>] ? ext4_da_writepages+0x186/0x23f [ext4dev]
[63518.930337] [<c01089e0>] ? native_sched_clock+0x97/0xb5
[63518.930337] [<c0108768>] ? sched_clock+0xb/0x1c
[63518.930337] [<c0144908>] ? lock_release_holdtime+0x43/0x48
[63518.930337] [<f88b9260>] ? ext4_da_get_block_write+0x0/0x13e [ext4dev]
[63518.930337] [<f88b85ed>] ? ext4_da_writepages+0x0/0x23f [ext4dev]
[63518.930337] [<c0173880>] ? do_writepages+0x28/0x39
[63518.930337] [<c01abcaf>] ? __writeback_single_inode+0x130/0x222
[63518.930337] [<c01ac0b3>] ? generic_sync_sb_inodes+0x195/0x25f
[63518.930337] [<c01ac18a>] ? sync_sb_inodes+0xd/0xf
[63518.930337] [<c01ac204>] ? sync_inodes_sb+0x78/0x80
[63518.930337] [<c01ac267>] ? __sync_inodes+0x5b/0xa1
[63518.930337] [<c01ac2c6>] ? sync_inodes+0x19/0x31
[63518.930337] [<c01ae847>] ? do_sync+0x19/0x5f
[63518.930337] [<c01ae89f>] ? sys_sync+0x12/0x17
[63518.930337] [<c0103a63>] ? sysenter_do_call+0x12/0x3f
[63518.930337] =======================
[63518.930337] Code: ff 8d 65 f4 89 f0 5b 5e 5f 5d c3 55 89 e5 57 56 53 83 ec 34 0f 1f 44 00 00 89 45 c4 89 d0 89 d6 89 cf e8 21 dd ff ff 85 c0 75 04 <0f> 0b eb fe 81 ff 00 10 00 00 76 04 0f 0b eb fe 81 7d 08 00 10
[63518.930337] EIP: [<c01b0fa9>] __block_prepare_write+0x20/0x2b1 SS:ESP 0068:e2d71c7c
[63518.931648] ---[ end trace fa5a90ec7832f4fc ]---


2008-08-18 09:30:32

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: ENOSPC error returned too late

On Sun, Aug 17, 2008 at 08:53:11AM -0400, Theodore Ts'o wrote:
>
> While trying to track down kernel BZ #11341 (which is really an
> e2fsprogs bug, not a 2.6.26 regression) I created a fresh ext4
> filesystem which was 5GB in length, and then filled it to ENOSPC via:
>
> dd if=/dev/zero of=/mnt/bigfile bs=1024 count=10485760
>
> Unfortunately, it sent the ENOSPC few blocks too late, so the following
> appeared in dmesg, and an attempt to umount the filesystem hung in D
> wait. :-(
>
> - Ted
>
>
> [63518.930337] mpage_da_map_blocks block allocation failed for inode 12 at logical offset 1254547 with max blocks 2 with error -28
> [63518.930337] This should not happen.!! Data will be lost

I will look into this. Are we really out of free space or is it that
mballoc is not able to find free space ?. I am not able to reproduce
this here. Will try more.



> [63518.930337] ------------[ cut here ]------------
> [63518.930337] kernel BUG at fs/buffer.c:1858!

This should be fixed by not doing a submit_io when we failed get_block.
The below is an updated ext4_invalidate_pages_when_delalloc_alloc_fail.patch
I also tried another patch (with random failures injected )which did
remove_mapping(mapping, page) after block_invalidate. But I still find
some pages in the radix tree even though we called remove_mapping. End I
found the below patch simple. The shrink_page_list will find the page clean
and should remove the page from the radix tree if everything is correct.

ext4: invalidate pages if delalloc block allocation fails.

From: Aneesh Kumar K.V <[email protected]>

We are a bit agressive in invalidating all the pages. But
it is ok because we really don't know why the block allocation
failed and it is better to come of the writeback path
so that user can look for more info.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---

fs/ext4/inode.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++--------
1 files changed, 73 insertions(+), 12 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index fd14d6b..c5e892c 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1783,6 +1783,39 @@ static inline void __unmap_underlying_blocks(struct inode *inode,
unmap_underlying_metadata(bdev, bh->b_blocknr + i);
}

+static void ext4_da_block_invalidatepages(struct mpage_da_data *mpd,
+ sector_t logical, long blk_cnt)
+{
+ int nr_pages, i;
+ pgoff_t index, end;
+ struct pagevec pvec;
+ struct inode *inode = mpd->inode;
+ struct address_space *mapping = inode->i_mapping;
+
+ index = logical >> (PAGE_CACHE_SHIFT - inode->i_blkbits);
+ end = (logical + blk_cnt - 1) >>
+ (PAGE_CACHE_SHIFT - inode->i_blkbits);
+ while (index <= end) {
+ nr_pages = pagevec_lookup(&pvec, mapping, index, PAGEVEC_SIZE);
+ if (nr_pages == 0)
+ break;
+ for (i = 0; i < nr_pages; i++) {
+ struct page *page = pvec.pages[i];
+ index = page->index;
+ if (index > end)
+ break;
+ index++;
+
+ BUG_ON(!PageLocked(page));
+ BUG_ON(PageWriteback(page));
+ block_invalidatepage(page, 0);
+ ClearPageUptodate(page);
+ unlock_page(page);
+ }
+ }
+ return;
+}
+
/*
* mpage_da_map_blocks - go through given space
*
@@ -1792,7 +1825,7 @@ static inline void __unmap_underlying_blocks(struct inode *inode,
* The function skips space we know is already mapped to disk blocks.
*
*/
-static void mpage_da_map_blocks(struct mpage_da_data *mpd)
+static int mpage_da_map_blocks(struct mpage_da_data *mpd)
{
int err = 0;
struct buffer_head *lbh = &mpd->lbh;
@@ -1803,7 +1836,7 @@ static void mpage_da_map_blocks(struct mpage_da_data *mpd)
* We consider only non-mapped and non-allocated blocks
*/
if (buffer_mapped(lbh) && !buffer_delay(lbh))
- return;
+ return 0;

new.b_state = lbh->b_state;
new.b_blocknr = 0;
@@ -1814,10 +1847,38 @@ static void mpage_da_map_blocks(struct mpage_da_data *mpd)
* to write simply return
*/
if (!new.b_size)
- return;
+ return 0;
err = mpd->get_block(mpd->inode, next, &new, 1);
- if (err)
- return;
+ if (err) {
+
+ /* If get block returns with error
+ * we simply return. Later writepage
+ * will redirty the page and writepages
+ * will find the dirty page again
+ */
+ if (err == -EAGAIN)
+ return;
+ /*
+ * get block failure will cause us
+ * to loop in writepages. Because
+ * a_ops->writepage won't be able to
+ * make progress. The page will be redirtied
+ * by writepage and writepages will again
+ * try to write the same.
+ */
+ printk(KERN_EMERG "%s block allocation failed for inode %lu "
+ "at logical offset %llu with max blocks "
+ "%zd with error %d\n",
+ __func__, mpd->inode->i_ino,
+ (unsigned long long)next,
+ lbh->b_size >> mpd->inode->i_blkbits, err);
+ printk(KERN_EMERG "This should not happen.!! "
+ "Data will be lost\n");
+ /* invlaidate all the pages */
+ ext4_da_block_invalidatepages(mpd, next,
+ lbh->b_size >> mpd->inode->i_blkbits);
+ return err;
+ }
BUG_ON(new.b_size == 0);

if (buffer_new(&new))
@@ -1830,7 +1891,7 @@ static void mpage_da_map_blocks(struct mpage_da_data *mpd)
if (buffer_delay(lbh) || buffer_unwritten(lbh))
mpage_put_bnr_to_bhs(mpd, next, &new);

- return;
+ return 0;
}

#define BH_FLAGS ((1 << BH_Uptodate) | (1 << BH_Mapped) | \
@@ -1880,8 +1941,8 @@ static void mpage_add_bh_to_extent(struct mpage_da_data *mpd,
* We couldn't merge the block to our extent, so we
* need to flush current extent and start new one
*/
- mpage_da_map_blocks(mpd);
- mpage_da_submit_io(mpd);
+ if (mpage_da_map_blocks(mpd) == 0)
+ mpage_da_submit_io(mpd);
mpd->io_done = 1;
return;
}
@@ -1923,8 +1984,8 @@ static int __mpage_da_writepage(struct page *page,
* and start IO on them using writepage()
*/
if (mpd->next_page != mpd->first_page) {
- mpage_da_map_blocks(mpd);
- mpage_da_submit_io(mpd);
+ if (mpage_da_map_blocks(mpd) == 0)
+ mpage_da_submit_io(mpd);
/*
* skip rest of the page in the page_vec
*/
@@ -2027,8 +2088,8 @@ static int mpage_da_writepages(struct address_space *mapping,
* Handle last extent of pages
*/
if (!mpd.io_done && mpd.next_page != mpd.first_page) {
- mpage_da_map_blocks(&mpd);
- mpage_da_submit_io(&mpd);
+ if (mpage_da_map_blocks(&mpd) == 0)
+ mpage_da_submit_io(&mpd);
}

wbc->nr_to_write = to_write - mpd.pages_written;

2008-08-18 10:51:49

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: ENOSPC error returned too late

On Mon, Aug 18, 2008 at 03:00:26PM +0530, Aneesh Kumar K.V wrote:
> On Sun, Aug 17, 2008 at 08:53:11AM -0400, Theodore Ts'o wrote:
> >
> > While trying to track down kernel BZ #11341 (which is really an
> > e2fsprogs bug, not a 2.6.26 regression) I created a fresh ext4
> > filesystem which was 5GB in length, and then filled it to ENOSPC via:
> >
> > dd if=/dev/zero of=/mnt/bigfile bs=1024 count=10485760
> >
> > Unfortunately, it sent the ENOSPC few blocks too late, so the following
> > appeared in dmesg, and an attempt to umount the filesystem hung in D
> > wait. :-(
> >
> > - Ted
> >
> >
> > [63518.930337] mpage_da_map_blocks block allocation failed for inode 12 at logical offset 1254547 with max blocks 2 with error -28
> > [63518.930337] This should not happen.!! Data will be lost
>
> I will look into this. Are we really out of free space or is it that
> mballoc is not able to find free space ?. I am not able to reproduce
> this here. Will try more.
>
>

How about

commit 08085a642ac6e7c3bfae0b6daf538e64a99695b2
Author: Aneesh Kumar K.V <[email protected]>
Date: Mon Aug 18 16:20:39 2008 +0530

ext4: make sure ext4_has_free_blocks return 0 for ENOSPC

Fix ext4_has_free_blocks to return 0 when we don't have
enought space.

Signed-off-by: Aneesh Kumar K.V <[email protected]>

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 27323cd..dfe2d4f 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -1628,6 +1628,9 @@ ext4_fsblk_t ext4_has_free_blocks(struct ext4_sb_info *sbi,
free_blocks =
percpu_counter_sum_and_set(&sbi->s_freeblocks_counter);
#endif
+ if (free_blocks <= root_blocks)
+ /* we don't have free space */
+ return 0;
if (free_blocks - root_blocks < nblocks)
return free_blocks - root_blocks;
return nblocks;

2008-08-18 13:30:50

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ENOSPC error returned too late

On Mon, Aug 18, 2008 at 04:21:40PM +0530, Aneesh Kumar K.V wrote:
> --- a/fs/ext4/balloc.c
> +++ b/fs/ext4/balloc.c
> @@ -1628,6 +1628,9 @@ ext4_fsblk_t ext4_has_free_blocks(struct ext4_sb_info *sbi,
> free_blocks =
> percpu_counter_sum_and_set(&sbi->s_freeblocks_counter);
> #endif
> + if (free_blocks <= root_blocks)
> + /* we don't have free space */
> + return 0;
> if (free_blocks - root_blocks < nblocks)
> return free_blocks - root_blocks;
> return nblocks;

Thanks, I'll try it out. Given that ext4_fsblk_t is an unsigned long
(BTW, this is one of the reasons why the common coding style guide
dislikes typedefs; it hides this kind of errors), it's obviously a
good patch to add regardless of whether it solves are problem, since
if freeblocks < root_blocks, the following condition is going to do
the Wrong Thing.

Signed-off-by: "Theodore Ts'o" <[email protected]>

- Ted


2008-08-18 16:44:55

by Mingming Cao

[permalink] [raw]
Subject: Re: ENOSPC error returned too late


在 2008-08-18一的 09:30 -0400,Theodore Tso写道:
> On Mon, Aug 18, 2008 at 04:21:40PM +0530, Aneesh Kumar K.V wrote:
> > --- a/fs/ext4/balloc.c
> > +++ b/fs/ext4/balloc.c
> > @@ -1628,6 +1628,9 @@ ext4_fsblk_t ext4_has_free_blocks(struct ext4_sb_info *sbi,
> > free_blocks =
> > percpu_counter_sum_and_set(&sbi->s_freeblocks_counter);
> > #endif
> > + if (free_blocks <= root_blocks)
> > + /* we don't have free space */
> > + return 0;
> > if (free_blocks - root_blocks < nblocks)
> > return free_blocks - root_blocks;
> > return nblocks;
>
> Thanks, I'll try it out. Given that ext4_fsblk_t is an unsigned long
> (BTW, this is one of the reasons why the common coding style guide
> dislikes typedefs; it hides this kind of errors), it's obviously a
> good patch to add regardless of whether it solves are problem, since
> if freeblocks < root_blocks, the following condition is going to do
> the Wrong Thing.
>

Thanks for catching this Aneesh, agreed.

Signed-off-by: Mingming Cao <[email protected]>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
>
> - Ted
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html