The following BUG occurred today while compiling gcc, with 2.6.37-rc1+.
More precisely, commit 7fe19da4ca38 ("preempt: fix kernel build with
!CONFIG_BKL") with http://permalink.gmane.org/gmane.linux.nfs/36521
applied on top. It basically took out the whole system.
------------[ cut here ]------------
kernel BUG at /scratch_space/linux-2.6/fs/ext4/page-io.c:146!
invalid opcode: 0000 [#1] PREEMPT SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:1d.7/usb8/8-3/8-3:1.0/uevent
CPU 0
Modules linked in: nls_iso8859_1 nls_cp437 vfat fat nfs nfs_acl bridge stp llc autofs4 nfsd lockd sunrpc exportfs ipv6 iptable_filter iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables x_tables snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_timer snd soundcore snd_page_alloc sg evdev usb_storage ext2 ehci_hcd sr_mod cdrom loop tun acpi_cpufreq mperf arc4 ecb crypto_blkcipher cryptomgr aead crypto_algapi rt2800pci rt2800lib crc_ccitt rt2x00pci rt2x00lib mac80211 cfg80211 eeprom_93cx6 e1000e
Pid: 30058, comm: ranlib Not tainted 2.6.37-rc1-00004-g1c7d46a #60 WG43M/Aspire X3810
RIP: 0010:[<ffffffff81107409>] [<ffffffff81107409>] ext4_init_io_end+0x3c/0x72
RSP: 0018:ffff880016a4b788 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff880062e2abe0 RCX: 0000000000000000
RDX: ffffea0002398c88 RSI: ffff8800018c3948 RDI: ffff880016a4a000
RBP: ffff880016a4b798 R08: 0000000000000023 R09: 0000000000008000
R10: ffff88013beaace0 R11: ffff88013beaace0 R12: ffff8800018c3948
R13: ffff880016a4b918 R14: ffff880100cd5f30 R15: ffffea0002398c88
FS: 00002abaac327b20(0000) GS:ffff8800b7a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f573b7a8000 CR3: 000000008d3ed000 CR4: 00000000000406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process ranlib (pid: 30058, threadinfo ffff880016a4a000, task ffff880016b59640)
Stack:
ffff8800a59999c0 ffffea0002398c88 ffff880016a4b818 ffffffff811075b1
ffff88013e028000 ffff880016a4bc68 0000100016b59640 ffffea0002398c88
ffff8800a59999c0 ffff880062e2abe0 0000100000008000 0000000000000080
Call Trace:
[<ffffffff811075b1>] ext4_bio_write_page+0x172/0x307
[<ffffffff811033a7>] mpage_da_submit_io+0x2f9/0x37b
[<ffffffff811068d7>] mpage_da_map_and_submit+0x2cc/0x2e2
[<ffffffff811069b3>] mpage_add_bh_to_extent+0xc6/0xd5
[<ffffffff81106c66>] write_cache_pages_da+0x2a4/0x3ac
[<ffffffff81107044>] ext4_da_writepages+0x2d6/0x44d
[<ffffffff81087910>] do_writepages+0x1c/0x25
[<ffffffff810810a4>] __filemap_fdatawrite_range+0x4b/0x4d
[<ffffffff810815f5>] filemap_fdatawrite_range+0xe/0x10
[<ffffffff81122a2e>] jbd2_journal_begin_ordered_truncate+0x7b/0xa2
[<ffffffff8110615d>] ext4_evict_inode+0x57/0x24c
[<ffffffff810c14a3>] evict+0x22/0x92
[<ffffffff810c1a3d>] iput+0x212/0x249
[<ffffffff810bdf16>] dentry_iput+0xa1/0xb9
[<ffffffff810bdf6b>] d_kill+0x3d/0x5d
[<ffffffff810be613>] dput+0x13a/0x147
[<ffffffff810b990d>] sys_renameat+0x1b5/0x258
[<ffffffff81145f71>] ? _atomic_dec_and_lock+0x2d/0x4c
[<ffffffff810b2950>] ? cp_new_stat+0xde/0xea
[<ffffffff810b29c1>] ? sys_newlstat+0x2d/0x38
[<ffffffff810b99c6>] sys_rename+0x16/0x18
[<ffffffff81002a2b>] system_call_fastpath+0x16/0x1b
Code: e8 97 4c fa ff 49 89 c4 48 85 c0 74 4c 31 c0 b9 1a 01 00 00 4c 89 e7 f3 ab 48 89 df e8 1f 95 fb ff 49 89 44 24 10 48 85 c0 75 04 <0f> 0b eb fe 49 8d 44 24 40 49 c7 44 24 38 00 05 00 00 49 89 44
RIP [<ffffffff81107409>] ext4_init_io_end+0x3c/0x72
RSP <ffff880016a4b788>
---[ end trace ddc79adad95a6879 ]---
--
Nick Bowler, Elliptic Technologies (http://www.elliptictech.com/)
On Tue, Nov 02, 2010 at 04:20:13PM -0400, Nick Bowler wrote:
> The following BUG occurred today while compiling gcc, with 2.6.37-rc1+.
> More precisely, commit 7fe19da4ca38 ("preempt: fix kernel build with
> !CONFIG_BKL") with http://permalink.gmane.org/gmane.linux.nfs/36521
> applied on top. It basically took out the whole system.
Hi Nick,
How easily can you reproduce this bug? I'm pretty sure I know what
caused it, but it's always good to get confirmation that a patch
addresses the reported bug. If it happens to you fairly often, can
you try out this patch and let me know whether it fixes the bug for
you?
Many thanks for reporting the bug,
- Ted
commit 684e31c9cd47fee65413d5ef620013d03921abb9
Author: Theodore Ts'o <[email protected]>
Date: Wed Nov 3 14:12:22 2010 -0400
ext4: handle writeback of inodes which are being freed
The following BUG can occur when an inode which is getting freed when
it still has dirty pages outstanding, and it gets deleted (in this
because it was the target of a rename). In ordered mode, we need to
make sure the data pages are written just in case we crash before the
rename (or unlink) is committed. If the inode is being freed then
when we try to igrab the inode, we end up tripping the BUG_ON at
fs/ext4/page-io.c:146.
In this case, don't need to do any of the endio handling, so we can
drop the BUG_ON and then arrange to make ext4_bio_end() handle this
case appropriately by releasing the pages and ending the writeback on
those pages, and then returning early without queueing the io_end
structure on the workqueue.
kernel BUG at /scratch_space/linux-2.6/fs/ext4/page-io.c:146!
Call Trace:
[<ffffffff811075b1>] ext4_bio_write_page+0x172/0x307
[<ffffffff811033a7>] mpage_da_submit_io+0x2f9/0x37b
[<ffffffff811068d7>] mpage_da_map_and_submit+0x2cc/0x2e2
[<ffffffff811069b3>] mpage_add_bh_to_extent+0xc6/0xd5
[<ffffffff81106c66>] write_cache_pages_da+0x2a4/0x3ac
[<ffffffff81107044>] ext4_da_writepages+0x2d6/0x44d
[<ffffffff81087910>] do_writepages+0x1c/0x25
[<ffffffff810810a4>] __filemap_fdatawrite_range+0x4b/0x4d
[<ffffffff810815f5>] filemap_fdatawrite_range+0xe/0x10
[<ffffffff81122a2e>] jbd2_journal_begin_ordered_truncate+0x7b/0xa2
[<ffffffff8110615d>] ext4_evict_inode+0x57/0x24c
[<ffffffff810c14a3>] evict+0x22/0x92
[<ffffffff810c1a3d>] iput+0x212/0x249
[<ffffffff810bdf16>] dentry_iput+0xa1/0xb9
[<ffffffff810bdf6b>] d_kill+0x3d/0x5d
[<ffffffff810be613>] dput+0x13a/0x147
[<ffffffff810b990d>] sys_renameat+0x1b5/0x258
[<ffffffff81145f71>] ? _atomic_dec_and_lock+0x2d/0x4c
[<ffffffff810b2950>] ? cp_new_stat+0xde/0xea
[<ffffffff810b29c1>] ? sys_newlstat+0x2d/0x38
[<ffffffff810b99c6>] sys_rename+0x16/0x18
[<ffffffff81002a2b>] system_call_fastpath+0x16/0x1b
Reported-by: Nick Bowler <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 46a7d6a..4f1b8cd 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -143,7 +143,6 @@ ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags)
if (io) {
memset(io, 0, sizeof(*io));
io->inode = igrab(inode);
- BUG_ON(!io->inode);
INIT_WORK(&io->work, ext4_end_io_work);
INIT_LIST_HEAD(&io->list);
}
@@ -171,35 +170,15 @@ static void ext4_end_bio(struct bio *bio, int error)
struct workqueue_struct *wq;
struct inode *inode;
unsigned long flags;
- ext4_fsblk_t err_block;
int i;
BUG_ON(!io_end);
- inode = io_end->inode;
bio->bi_private = NULL;
bio->bi_end_io = NULL;
if (test_bit(BIO_UPTODATE, &bio->bi_flags))
error = 0;
- err_block = bio->bi_sector >> (inode->i_blkbits - 9);
bio_put(bio);
- if (!(inode->i_sb->s_flags & MS_ACTIVE)) {
- pr_err("sb umounted, discard end_io request for inode %lu\n",
- io_end->inode->i_ino);
- ext4_free_io_end(io_end);
- return;
- }
-
- if (error) {
- io_end->flag |= EXT4_IO_END_ERROR;
- ext4_warning(inode->i_sb, "I/O error writing to inode %lu "
- "(offset %llu size %ld starting block %llu)",
- inode->i_ino,
- (unsigned long long) io_end->offset,
- (long) io_end->size,
- (unsigned long long) err_block);
- }
-
for (i = 0; i < io_end->num_io_pages; i++) {
struct page *page = io_end->pages[i]->p_page;
struct buffer_head *bh, *head;
@@ -254,9 +233,30 @@ static void ext4_end_bio(struct bio *bio, int error)
if (!partial_write)
SetPageUptodate(page);
}
-
io_end->num_io_pages = 0;
+ if ((inode = io_end->inode) == NULL)
+ goto no_work;
+
+ if (!(inode->i_sb->s_flags & MS_ACTIVE)) {
+ pr_err("sb umounted, discard end_io request for inode %lu\n",
+ io_end->inode->i_ino);
+ no_work:
+ ext4_free_io_end(io_end);
+ return;
+ }
+
+ if (error) {
+ io_end->flag |= EXT4_IO_END_ERROR;
+ ext4_warning(inode->i_sb, "I/O error writing to inode %lu "
+ "(offset %llu size %ld starting block %llu)",
+ inode->i_ino,
+ (unsigned long long) io_end->offset,
+ (long) io_end->size,
+ (unsigned long long)
+ bio->bi_sector >> (inode->i_blkbits - 9));
+ }
+
/* Add the io_end to per-inode completed io list*/
spin_lock_irqsave(&EXT4_I(inode)->i_completed_io_lock, flags);
list_add_tail(&io_end->list, &EXT4_I(inode)->i_completed_io_list);
On 11/2/10 4:20 PM, Nick Bowler wrote:
> The following BUG occurred today while compiling gcc, with 2.6.37-rc1+.
> More precisely, commit 7fe19da4ca38 ("preempt: fix kernel build with
> !CONFIG_BKL") with http://permalink.gmane.org/gmane.linux.nfs/36521
> applied on top. It basically took out the whole system.
>
> ------------[ cut here ]------------
> kernel BUG at /scratch_space/linux-2.6/fs/ext4/page-io.c:146!
138 ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags)
139 {
140 ext4_io_end_t *io = NULL;
141
142 io = kmem_cache_alloc(io_end_cachep, flags);
143 if (io) {
144 memset(io, 0, sizeof(*io));
145 io->inode = igrab(inode);
146 BUG_ON(!io->inode);
igrab can fail if it's being torn down:
/*
* Handle the case where s_op->clear_inode is not been
* called yet, and somebody is calling igrab
* while the inode is getting freed.
*/
inode = NULL;
and boom.
-Eric
> invalid opcode: 0000 [#1] PREEMPT SMP
> last sysfs file: /sys/devices/pci0000:00/0000:00:1d.7/usb8/8-3/8-3:1.0/uevent
> CPU 0
> Modules linked in: nls_iso8859_1 nls_cp437 vfat fat nfs nfs_acl bridge stp llc autofs4 nfsd lockd sunrpc exportfs ipv6 iptable_filter iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables x_tables snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_timer snd soundcore snd_page_alloc sg evdev usb_storage ext2 ehci_hcd sr_mod cdrom loop tun acpi_cpufreq mperf arc4 ecb crypto_blkcipher cryptomgr aead crypto_algapi rt2800pci rt2800lib crc_ccitt rt2x00pci rt2x00lib mac80211 cfg80211 eeprom_93cx6 e1000e
>
> Pid: 30058, comm: ranlib Not tainted 2.6.37-rc1-00004-g1c7d46a #60 WG43M/Aspire X3810
> RIP: 0010:[<ffffffff81107409>] [<ffffffff81107409>] ext4_init_io_end+0x3c/0x72
> RSP: 0018:ffff880016a4b788 EFLAGS: 00010246
> RAX: 0000000000000000 RBX: ffff880062e2abe0 RCX: 0000000000000000
> RDX: ffffea0002398c88 RSI: ffff8800018c3948 RDI: ffff880016a4a000
> RBP: ffff880016a4b798 R08: 0000000000000023 R09: 0000000000008000
> R10: ffff88013beaace0 R11: ffff88013beaace0 R12: ffff8800018c3948
> R13: ffff880016a4b918 R14: ffff880100cd5f30 R15: ffffea0002398c88
> FS: 00002abaac327b20(0000) GS:ffff8800b7a00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f573b7a8000 CR3: 000000008d3ed000 CR4: 00000000000406f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process ranlib (pid: 30058, threadinfo ffff880016a4a000, task ffff880016b59640)
> Stack:
> ffff8800a59999c0 ffffea0002398c88 ffff880016a4b818 ffffffff811075b1
> ffff88013e028000 ffff880016a4bc68 0000100016b59640 ffffea0002398c88
> ffff8800a59999c0 ffff880062e2abe0 0000100000008000 0000000000000080
> Call Trace:
> [<ffffffff811075b1>] ext4_bio_write_page+0x172/0x307
> [<ffffffff811033a7>] mpage_da_submit_io+0x2f9/0x37b
> [<ffffffff811068d7>] mpage_da_map_and_submit+0x2cc/0x2e2
> [<ffffffff811069b3>] mpage_add_bh_to_extent+0xc6/0xd5
> [<ffffffff81106c66>] write_cache_pages_da+0x2a4/0x3ac
> [<ffffffff81107044>] ext4_da_writepages+0x2d6/0x44d
> [<ffffffff81087910>] do_writepages+0x1c/0x25
> [<ffffffff810810a4>] __filemap_fdatawrite_range+0x4b/0x4d
> [<ffffffff810815f5>] filemap_fdatawrite_range+0xe/0x10
> [<ffffffff81122a2e>] jbd2_journal_begin_ordered_truncate+0x7b/0xa2
> [<ffffffff8110615d>] ext4_evict_inode+0x57/0x24c
> [<ffffffff810c14a3>] evict+0x22/0x92
> [<ffffffff810c1a3d>] iput+0x212/0x249
> [<ffffffff810bdf16>] dentry_iput+0xa1/0xb9
> [<ffffffff810bdf6b>] d_kill+0x3d/0x5d
> [<ffffffff810be613>] dput+0x13a/0x147
> [<ffffffff810b990d>] sys_renameat+0x1b5/0x258
> [<ffffffff81145f71>] ? _atomic_dec_and_lock+0x2d/0x4c
> [<ffffffff810b2950>] ? cp_new_stat+0xde/0xea
> [<ffffffff810b29c1>] ? sys_newlstat+0x2d/0x38
> [<ffffffff810b99c6>] sys_rename+0x16/0x18
> [<ffffffff81002a2b>] system_call_fastpath+0x16/0x1b
> Code: e8 97 4c fa ff 49 89 c4 48 85 c0 74 4c 31 c0 b9 1a 01 00 00 4c 89 e7 f3 ab 48 89 df e8 1f 95 fb ff 49 89 44 24 10 48 85 c0 75 04 <0f> 0b eb fe 49 8d 44 24 40 49 c7 44 24 38 00 05 00 00 49 89 44
> RIP [<ffffffff81107409>] ext4_init_io_end+0x3c/0x72
> RSP <ffff880016a4b788>
> ---[ end trace ddc79adad95a6879 ]---
>
On 2010-11-03 14:14 -0400, Ted Ts'o wrote:
> On Tue, Nov 02, 2010 at 04:20:13PM -0400, Nick Bowler wrote:
> > The following BUG occurred today while compiling gcc, with 2.6.37-rc1+.
> > More precisely, commit 7fe19da4ca38 ("preempt: fix kernel build with
> > !CONFIG_BKL") with http://permalink.gmane.org/gmane.linux.nfs/36521
> > applied on top. It basically took out the whole system.
>
> Hi Nick,
>
> How easily can you reproduce this bug? I'm pretty sure I know what
> caused it, but it's always good to get confirmation that a patch
> addresses the reported bug. If it happens to you fairly often, can
> you try out this patch and let me know whether it fixes the bug for
> you?
I only encountered it the one time, but I haven't tried compiling gcc
since it blew up the first time (there seems to be no problem compiling
linux, for instance). I will try it again now with and without the
patch.
--
Nick Bowler, Elliptic Technologies (http://www.elliptictech.com/)
On 2010-11-03 14:22 -0400, Nick Bowler wrote:
> On 2010-11-03 14:14 -0400, Ted Ts'o wrote:
> > How easily can you reproduce this bug? I'm pretty sure I know what
> > caused it, but it's always good to get confirmation that a patch
> > addresses the reported bug. If it happens to you fairly often, can
> > you try out this patch and let me know whether it fixes the bug for
> > you?
>
> I only encountered it the one time, but I haven't tried compiling gcc
> since it blew up the first time (there seems to be no problem compiling
> linux, for instance). I will try it again now with and without the
> patch.
OK, it's 100% reproducible: the kernel BUGs, without fail, every time I
do 'make install' in the gcc build tree. After applying the patch, it
seems that the original BUG is gone, but now there's a new one:
------------[ cut here ]------------
kernel BUG at /scratch_space/linux-2.6/fs/inode.c:1405!
invalid opcode: 0000 [#1] PREEMPT SMP
last sysfs file: /sys/devices/virtual/vtconsole/vtcon1/uevent
CPU 1
Modules linked in: netconsole nfs nfs_acl bridge stp llc autofs4 nfsd lockd sunrpc exportfs ipv6 iptable_filter iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables x_tables snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_timer snd soundcore snd_page_alloc sg evdev usb_storage ext2 ehci_hcd sr_mod cdrom loop tun acpi_cpufreq mperf arc4 ecb crypto_blkcipher cryptomgr aead crypto_algapi rt2800pci rt2800lib crc_ccitt rt2x00pci rt2x00lib mac80211 cfg80211 eeprom_93cx6 e1000e
Pid: 273, comm: kworker/1:1 Not tainted 2.6.37-rc1-00005-gdd0ce84 #77 WG43M/Aspire X3810
RIP: 0010:[<ffffffff810c1847>] [<ffffffff810c1847>] iput+0x1c/0x249
RSP: 0018:ffff88013ff1fdc0 EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffff88012d3baf78 RCX: ffff88012d3bb220
RDX: ffff880021f2ac10 RSI: 0000000000000296 RDI: ffff88012d3baf78
RBP: ffff88013ff1fdd0 R08: 0000000000000000 R09: ffff88013ee95900
R10: ffff8800b7a8de70 R11: ffff88013ff1fda0 R12: 0000000000000000
R13: ffff88012d3bb230 R14: ffff880021f2bde8 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff8800b7a80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002b121db17a20 CR3: 000000013972d000 CR4: 00000000000406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kworker/1:1 (pid: 273, threadinfo ffff88013ff1e000, task ffff88013feec2c0)
Stack:
ffff880021f2bdb0 0000000000000000 ffff88013ff1fe00 ffffffff8110783d
ffff880021f2bdb0 ffff88012d3bb038 ffff88012d3bb230 ffff880021f2bde8
ffff88013ff1fe30 ffffffff81107b9a ffff88013fce6700 ffff8800b7a8da80
Call Trace:
[<ffffffff8110783d>] ext4_free_io_end+0x84/0x9c
[<ffffffff81107b9a>] ext4_end_io_work+0x81/0x8a
[<ffffffff81107b19>] ? ext4_end_io_work+0x0/0x8a
[<ffffffff8104970c>] process_one_work+0x1a8/0x286
[<ffffffff8104b1f9>] worker_thread+0x136/0x255
[<ffffffff8104b0c3>] ? worker_thread+0x0/0x255
[<ffffffff8104e2be>] kthread+0x7d/0x85
[<ffffffff81003754>] kernel_thread_helper+0x4/0x10
[<ffffffff8104e241>] ? kthread+0x0/0x85
[<ffffffff81003750>] ? kernel_thread_helper+0x0/0x10
Code: 29 08 f9 ff 48 83 c4 18 5b 41 5c 41 5d c9 c3 55 48 85 ff 48 89 e5 41 54 53 48 89 fb 0f 84 31 02 00 00 f6 87 e0 01 00 00 40 74 04 <0f> 0b eb fe 48 8d 7f 58 48 c7 c6 50 22 73 81 e8 f1 46 08 00 85
RIP [<ffffffff810c1847>] iput+0x1c/0x249
RSP <ffff88013ff1fdc0>
---[ end trace 4ed7b09a97b06d55 ]---
--
Nick Bowler, Elliptic Technologies (http://www.elliptictech.com/)
On Wed, Nov 03, 2010 at 02:14:21PM -0400, Eric Sandeen wrote:
> On 11/2/10 4:20 PM, Nick Bowler wrote:
> > The following BUG occurred today while compiling gcc, with 2.6.37-rc1+.
> > More precisely, commit 7fe19da4ca38 ("preempt: fix kernel build with
> > !CONFIG_BKL") with http://permalink.gmane.org/gmane.linux.nfs/36521
> > applied on top. It basically took out the whole system.
> >
> > ------------[ cut here ]------------
> > kernel BUG at /scratch_space/linux-2.6/fs/ext4/page-io.c:146!
>
> 138 ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags)
> 139 {
> 140 ext4_io_end_t *io = NULL;
> 141
> 142 io = kmem_cache_alloc(io_end_cachep, flags);
> 143 if (io) {
> 144 memset(io, 0, sizeof(*io));
> 145 io->inode = igrab(inode);
> 146 BUG_ON(!io->inode);
>
> igrab can fail if it's being torn down:
>
> /*
> * Handle the case where s_op->clear_inode is not been
> * called yet, and somebody is calling igrab
> * while the inode is getting freed.
> */
> inode = NULL;
>
> and boom.
Oh, nasty.
FWIW, the XFS code this was copied from doesn't have this problem
because the struct inode is not tagged for reclaim in
->destroy_inode until all writeback IO is completed. We keep a
separate active ioend reference count in the struct xfs_inode, and
the inode is never freed while there are still active IO references
(see the xfs_ioend_wait() call in xfs_fs_destroy_inode).
Hence the XFS ->writepage path does not need to take inode
references to handle the possibility of an inode being freed from
under it because the inode lifecycle model guarantees it
cannot occur. Perhaps ext4 needs to copy more from XFS.... ;)
BTW, io_end_cachep() probably should use a mempool (like the
equivalent XFS ioend slab cache), otherwise ext4 won't be able to
make writeback progress in OOM conditions and will avoid needing to
handle ENOMEM errors in ->writepage.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Tue, 2010-11-02 at 16:20 -0400, Nick Bowler wrote:
> ------------[ cut here ]------------
> kernel BUG at /scratch_space/linux-2.6/fs/ext4/page-io.c:146!
> invalid opcode: 0000 [#1] PREEMPT SMP
Confirmed, it's an issue here too.
I have just hit this one 2 times in a row within 20 minutes. Please CC
me if you need any fix tested; looks like I'm off to 2.6.26 until this
is fixed.
Linux prometheus 2.6.37-rc1-00040-g0660a9b #1 SMP PREEMPT Fri Nov 5
22:09:58 GMT 2010 x86_64 Six-Core AMD Opteron(tm) Processor 2435
AuthenticAMD GNU/Linux
I can provide .config if needed; have digital photos of the oops if they
are deemed useful. The BUG hit at the exact same line as you though.
Regards,
--
Tony Vroon
UNIX systems administrator
London Internet Exchange Ltd, Trinity Court, Trinity Street,
Peterborough, PE1 1DA
Registered in England number 3137929
E-Mail: [email protected]
On Nov 5, 2010, at 7:09 PM, Tony Vroon wrote:
> On Tue, 2010-11-02 at 16:20 -0400, Nick Bowler wrote:
>> ------------[ cut here ]------------
>> kernel BUG at /scratch_space/linux-2.6/fs/ext4/page-io.c:146!
>> invalid opcode: 0000 [#1] PREEMPT SMP
>
> Confirmed, it's an issue here too.
> I have just hit this one 2 times in a row within 20 minutes. Please CC
> me if you need any fix tested; looks like I'm off to 2.6.26 until this
> is fixed.
>
Hi, my apologies for not having a lot of time to look at this. I've been at the Linux Plumber's Conference and Kernel Summit this whole week.
What did you have running which triggered this, and can you try the patch that I sent in this patch, and see if you get the same other BUG_ON? And if you have the stack trace easily to hand, it would be useful to make sure that part is the same as well.
Many thanks,
-- Ted
On Fri, 2010-11-05 at 20:33 -0400, Theodore Tso wrote:
> What did you have running which triggered this, and can you try the patch that I sent in this patch, and see if you get the same other BUG_ON?
Updating /cvs/gentoo-x86 seemed to trigger it; it is a rather large
collection of small files. Basic contents visible here:
http://sources.gentoo.org/cgi-bin/viewvc.cgi/gentoo-x86/
Unlike Nick however, I do not have a specific "this does it 100% of the
time" command. At least, I updated CVS again after it died roughly 90%
in; after deleting the directories CVS was unhappy with the sync
completed.
Syncing CVS too often would be a bit unfair on the server at the other
end, so I will try this again tomorrow.
> And if you have the stack trace easily to hand, it would be useful to make sure that part is the same as well.
I will sift through my quick snaps now to see if it is legible enough to
type it out for you. More news in half an hour or so.
> Many thanks,
> -- Ted
Regards,
--
Tony Vroon
UNIX systems administrator
London Internet Exchange Ltd, Trinity Court, Trinity Street,
Peterborough, PE1 1DA
Registered in England number 3137929
E-Mail: [email protected]
On Fri, 2010-11-05 at 20:33 -0400, Theodore Tso wrote:
> And if you have the stack trace easily to hand, it would be useful to make sure that part is the same as well.
Seemingly identical between 2.6.36-09452-g2d10d87 &
2.6.37-rc1-00040-g0660a9b, both had cvs as the active process.
This one is from 2.6.37-rc1-00040-g0660a9b:
Call Trace:
[<ffffffff810e6716>] ? ext4_bio_write_page+0x176/0x30a
[<ffffffff810e25db>] ? mpage_da_submit_io+0x2ad/0x31e
[<ffffffff8106d034>] ? pagevec_lookup_tag+0x18/0x1f
[<ffffffff810e5c56>] ? write_cache_pages_da+0xad/0x36d
[<ffffffff810e5ac3>] ? mpage_da_map_and_submit+0x2c4/0x2da
[<ffffffff810edbba>] ? ext4_journal_start_sb+0xd4/0x108
[<ffffffff810269ac>] ? should_resched+0x5/0x26
[<ffffffff810a2f26>] ? _cond_resched+0x9/0x1e
[<ffffffff810e61ee>] ? ext4_da_writepages+0x2d8/0x422
[<ffffffff810b3471>] ? __getblk+0x1c/0x2d4
[<ffffffff81065cad>] ? __filemap_fdatawrite_range+0x48/0x4d
[<ffffffff810ff15c>] ? jbd2_journal_begin_ordered_truncated+0x72/0x9b
[<ffffffff810bb060>] ? fsnotify_clear_marks_by_inode+0x8a/0xc7
[<ffffffff810e5359>] ? ext4_evict_inode+0x56/0x241
[<ffffffff810a4c62>] ? evict+0x1a/0x89
[<ffffffff810a51dd>] ? iput+0x20d/0x246
[<ffffffff8109d85f>] ? do_unlinkat+0xf4/0x14c
[<ffffffff810029ab>] ? system_call_fastpath+0x16/0x1b
> Many thanks,
> -- Ted
Regards,
--
Tony Vroon
UNIX systems administrator
London Internet Exchange Ltd, Trinity Court, Trinity Street,
Peterborough, PE1 1DA
Registered in England number 3137929
E-Mail: [email protected]
On Sat, Nov 06, 2010 at 12:39:36AM +0000, Tony Vroon wrote:
> Unlike Nick however, I do not have a specific "this does it 100% of the
> time" command. At least, I updated CVS again after it died roughly 90%
> in; after deleting the directories CVS was unhappy with the sync
> completed.
> Syncing CVS too often would be a bit unfair on the server at the other
> end, so I will try this again tomorrow.
Fair enough. I may try to build gcc and use that as a repro case. My
attempts to build quick synthetic repro's haven't been successful to
date.
- Ted
On Wed, Nov 03, 2010 at 04:53:06PM -0400, Nick Bowler wrote:
>
> OK, it's 100% reproducible: the kernel BUGs, without fail, every time I
> do 'make install' in the gcc build tree. After applying the patch, it
> seems that the original BUG is gone, but now there's a new one:
OK, I think I have a new version of the patch that should fix both the
original and the second BUG_ON which you reported. Unfortunately,
I've not been able to reproduce the BUG_ON, even by downloading gcc
4.5.1, configuring to build just gcc, and then doing a "make install".
I'm not sure what languages you had enabled, or how much memory you
have on your machine, etc., all of which might have made it harder for
me to repro the bug. Still, I think I now understand what's causing
it.
Can you give this a try and let me know if this fixes things for you?
- Ted
>From 94acf883b316f50b38018f710341a943cdd38d0d Mon Sep 17 00:00:00 2001
From: Theodore Ts'o <[email protected]>
Date: Sun, 7 Nov 2010 16:11:42 -0500
Subject: [PATCH -v2] ext4: handle writeback of inodes which are being freed
The following BUG can occur when an inode which is getting freed when
it still has dirty pages outstanding, and it gets deleted (in this
because it was the target of a rename). In ordered mode, we need to
make sure the data pages are written just in case we crash before the
rename (or unlink) is committed. If the inode is being freed then
when we try to igrab the inode, we end up tripping the BUG_ON at
fs/ext4/page-io.c:146.
In this case, don't need to do any of the endio handling, so we can
drop the BUG_ON and then arrange to make ext4_bio_end() handle this
case appropriately by releasing the pages and ending the writeback on
those pages, and then returning early without queueing the io_end
structure on the workqueue.
kernel BUG at /scratch_space/linux-2.6/fs/ext4/page-io.c:146!
Call Trace:
[<ffffffff811075b1>] ext4_bio_write_page+0x172/0x307
[<ffffffff811033a7>] mpage_da_submit_io+0x2f9/0x37b
[<ffffffff811068d7>] mpage_da_map_and_submit+0x2cc/0x2e2
[<ffffffff811069b3>] mpage_add_bh_to_extent+0xc6/0xd5
[<ffffffff81106c66>] write_cache_pages_da+0x2a4/0x3ac
[<ffffffff81107044>] ext4_da_writepages+0x2d6/0x44d
[<ffffffff81087910>] do_writepages+0x1c/0x25
[<ffffffff810810a4>] __filemap_fdatawrite_range+0x4b/0x4d
[<ffffffff810815f5>] filemap_fdatawrite_range+0xe/0x10
[<ffffffff81122a2e>] jbd2_journal_begin_ordered_truncate+0x7b/0xa2
[<ffffffff8110615d>] ext4_evict_inode+0x57/0x24c
[<ffffffff810c14a3>] evict+0x22/0x92
[<ffffffff810c1a3d>] iput+0x212/0x249
[<ffffffff810bdf16>] dentry_iput+0xa1/0xb9
[<ffffffff810bdf6b>] d_kill+0x3d/0x5d
[<ffffffff810be613>] dput+0x13a/0x147
[<ffffffff810b990d>] sys_renameat+0x1b5/0x258
[<ffffffff81145f71>] ? _atomic_dec_and_lock+0x2d/0x4c
[<ffffffff810b2950>] ? cp_new_stat+0xde/0xea
[<ffffffff810b29c1>] ? sys_newlstat+0x2d/0x38
[<ffffffff810b99c6>] sys_rename+0x16/0x18
[<ffffffff81002a2b>] system_call_fastpath+0x16/0x1b
Reported-by: Nick Bowler <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/page-io.c | 45 ++++++++++++++++++++++-----------------------
1 files changed, 22 insertions(+), 23 deletions(-)
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 46a7d6a..94bbdb4 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -143,7 +143,6 @@ ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags)
if (io) {
memset(io, 0, sizeof(*io));
io->inode = igrab(inode);
- BUG_ON(!io->inode);
INIT_WORK(&io->work, ext4_end_io_work);
INIT_LIST_HEAD(&io->list);
}
@@ -171,35 +170,15 @@ static void ext4_end_bio(struct bio *bio, int error)
struct workqueue_struct *wq;
struct inode *inode;
unsigned long flags;
- ext4_fsblk_t err_block;
int i;
BUG_ON(!io_end);
- inode = io_end->inode;
bio->bi_private = NULL;
bio->bi_end_io = NULL;
if (test_bit(BIO_UPTODATE, &bio->bi_flags))
error = 0;
- err_block = bio->bi_sector >> (inode->i_blkbits - 9);
bio_put(bio);
- if (!(inode->i_sb->s_flags & MS_ACTIVE)) {
- pr_err("sb umounted, discard end_io request for inode %lu\n",
- io_end->inode->i_ino);
- ext4_free_io_end(io_end);
- return;
- }
-
- if (error) {
- io_end->flag |= EXT4_IO_END_ERROR;
- ext4_warning(inode->i_sb, "I/O error writing to inode %lu "
- "(offset %llu size %ld starting block %llu)",
- inode->i_ino,
- (unsigned long long) io_end->offset,
- (long) io_end->size,
- (unsigned long long) err_block);
- }
-
for (i = 0; i < io_end->num_io_pages; i++) {
struct page *page = io_end->pages[i]->p_page;
struct buffer_head *bh, *head;
@@ -254,9 +233,30 @@ static void ext4_end_bio(struct bio *bio, int error)
if (!partial_write)
SetPageUptodate(page);
}
-
io_end->num_io_pages = 0;
+ if ((inode = io_end->inode) == NULL)
+ goto no_work;
+
+ if (!(inode->i_sb->s_flags & MS_ACTIVE)) {
+ pr_err("sb umounted, discard end_io request for inode %lu\n",
+ io_end->inode->i_ino);
+ no_work:
+ ext4_free_io_end(io_end);
+ return;
+ }
+
+ if (error) {
+ io_end->flag |= EXT4_IO_END_ERROR;
+ ext4_warning(inode->i_sb, "I/O error writing to inode %lu "
+ "(offset %llu size %ld starting block %llu)",
+ inode->i_ino,
+ (unsigned long long) io_end->offset,
+ (long) io_end->size,
+ (unsigned long long)
+ bio->bi_sector >> (inode->i_blkbits - 9));
+ }
+
/* Add the io_end to per-inode completed io list*/
spin_lock_irqsave(&EXT4_I(inode)->i_completed_io_lock, flags);
list_add_tail(&io_end->list, &EXT4_I(inode)->i_completed_io_list);
@@ -305,7 +305,6 @@ static int io_submit_init(struct ext4_io_submit *io,
bio->bi_private = io->io_end = io_end;
bio->bi_end_io = ext4_end_bio;
- io_end->inode = inode;
io_end->offset = (page->index << PAGE_CACHE_SHIFT) + bh_offset(bh);
io->io_bio = bio;
--
1.7.3.1
On Sun, 2010-11-07 at 16:21 -0500, Ted Ts'o wrote:
> OK, I think I have a new version of the patch that should fix both the
> original and the second BUG_ON which you reported.
> Subject: [PATCH -v2] ext4: handle writeback of inodes which are being freed
Tested-By: Tony Vroon <[email protected]>
This survived a 'cvs up' in /cvs/gentoo-x86 with only the CVS directory
in place; a workload that died at roughly 90% before. I did my best to
recreate the heavy I/O conditions that existed when it died there
before, with Evolution being opened just as cvs starts the first writes.
I opened Transmission-gtk with a few LiveDVD ISOs just for good measure.
The system was loaded enough for "su -" to stall for 20 seconds at some
points but ext4 survived now.
I do believe you have found the culprit, many thanks :)
Regards,
--
Tony Vroon
UNIX systems administrator
London Internet Exchange Ltd, Trinity Court, Trinity Street,
Peterborough, PE1 1DA
Registered in England number 3137929
E-Mail: [email protected]
On Sun, Nov 07, 2010 at 04:21:43PM -0500, Ted Ts'o wrote:
> On Wed, Nov 03, 2010 at 04:53:06PM -0400, Nick Bowler wrote:
> >
> > OK, it's 100% reproducible: the kernel BUGs, without fail, every time I
> > do 'make install' in the gcc build tree. After applying the patch, it
> > seems that the original BUG is gone, but now there's a new one:
>
> OK, I think I have a new version of the patch that should fix both the
> original and the second BUG_ON which you reported. Unfortunately,
> I've not been able to reproduce the BUG_ON, even by downloading gcc
> 4.5.1, configuring to build just gcc, and then doing a "make install".
> I'm not sure what languages you had enabled, or how much memory you
> have on your machine, etc., all of which might have made it harder for
> me to repro the bug. Still, I think I now understand what's causing
> it.
>
> Can you give this a try and let me know if this fixes things for you?
>
> - Ted
>
> From 94acf883b316f50b38018f710341a943cdd38d0d Mon Sep 17 00:00:00 2001
> From: Theodore Ts'o <[email protected]>
> Date: Sun, 7 Nov 2010 16:11:42 -0500
> Subject: [PATCH -v2] ext4: handle writeback of inodes which are being freed
>
> The following BUG can occur when an inode which is getting freed when
> it still has dirty pages outstanding, and it gets deleted (in this
> because it was the target of a rename). In ordered mode, we need to
> make sure the data pages are written just in case we crash before the
> rename (or unlink) is committed. If the inode is being freed then
> when we try to igrab the inode, we end up tripping the BUG_ON at
> fs/ext4/page-io.c:146.
>
> In this case, don't need to do any of the endio handling, so we can
> drop the BUG_ON and then arrange to make ext4_bio_end() handle this
> case appropriately by releasing the pages and ending the writeback on
> those pages, and then returning early without queueing the io_end
> structure on the workqueue.
>
> kernel BUG at /scratch_space/linux-2.6/fs/ext4/page-io.c:146!
> Call Trace:
> [<ffffffff811075b1>] ext4_bio_write_page+0x172/0x307
> [<ffffffff811033a7>] mpage_da_submit_io+0x2f9/0x37b
> [<ffffffff811068d7>] mpage_da_map_and_submit+0x2cc/0x2e2
> [<ffffffff811069b3>] mpage_add_bh_to_extent+0xc6/0xd5
> [<ffffffff81106c66>] write_cache_pages_da+0x2a4/0x3ac
> [<ffffffff81107044>] ext4_da_writepages+0x2d6/0x44d
> [<ffffffff81087910>] do_writepages+0x1c/0x25
> [<ffffffff810810a4>] __filemap_fdatawrite_range+0x4b/0x4d
> [<ffffffff810815f5>] filemap_fdatawrite_range+0xe/0x10
> [<ffffffff81122a2e>] jbd2_journal_begin_ordered_truncate+0x7b/0xa2
> [<ffffffff8110615d>] ext4_evict_inode+0x57/0x24c
> [<ffffffff810c14a3>] evict+0x22/0x92
> [<ffffffff810c1a3d>] iput+0x212/0x249
> [<ffffffff810bdf16>] dentry_iput+0xa1/0xb9
> [<ffffffff810bdf6b>] d_kill+0x3d/0x5d
> [<ffffffff810be613>] dput+0x13a/0x147
> [<ffffffff810b990d>] sys_renameat+0x1b5/0x258
> [<ffffffff81145f71>] ? _atomic_dec_and_lock+0x2d/0x4c
> [<ffffffff810b2950>] ? cp_new_stat+0xde/0xea
> [<ffffffff810b29c1>] ? sys_newlstat+0x2d/0x38
> [<ffffffff810b99c6>] sys_rename+0x16/0x18
> [<ffffffff81002a2b>] system_call_fastpath+0x16/0x1b
>
> Reported-by: Nick Bowler <[email protected]>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
> ---
> fs/ext4/page-io.c | 45 ++++++++++++++++++++++-----------------------
> 1 files changed, 22 insertions(+), 23 deletions(-)
>
> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> index 46a7d6a..94bbdb4 100644
> --- a/fs/ext4/page-io.c
> +++ b/fs/ext4/page-io.c
> @@ -143,7 +143,6 @@ ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags)
> if (io) {
> memset(io, 0, sizeof(*io));
> io->inode = igrab(inode);
> - BUG_ON(!io->inode);
> INIT_WORK(&io->work, ext4_end_io_work);
> INIT_LIST_HEAD(&io->list);
> }
> @@ -171,35 +170,15 @@ static void ext4_end_bio(struct bio *bio, int error)
> struct workqueue_struct *wq;
> struct inode *inode;
> unsigned long flags;
> - ext4_fsblk_t err_block;
> int i;
>
> BUG_ON(!io_end);
> - inode = io_end->inode;
> bio->bi_private = NULL;
> bio->bi_end_io = NULL;
> if (test_bit(BIO_UPTODATE, &bio->bi_flags))
> error = 0;
> - err_block = bio->bi_sector >> (inode->i_blkbits - 9);
> bio_put(bio);
>
> - if (!(inode->i_sb->s_flags & MS_ACTIVE)) {
> - pr_err("sb umounted, discard end_io request for inode %lu\n",
> - io_end->inode->i_ino);
> - ext4_free_io_end(io_end);
> - return;
> - }
> -
> - if (error) {
> - io_end->flag |= EXT4_IO_END_ERROR;
> - ext4_warning(inode->i_sb, "I/O error writing to inode %lu "
> - "(offset %llu size %ld starting block %llu)",
> - inode->i_ino,
> - (unsigned long long) io_end->offset,
> - (long) io_end->size,
> - (unsigned long long) err_block);
> - }
> -
> for (i = 0; i < io_end->num_io_pages; i++) {
> struct page *page = io_end->pages[i]->p_page;
> struct buffer_head *bh, *head;
> @@ -254,9 +233,30 @@ static void ext4_end_bio(struct bio *bio, int error)
> if (!partial_write)
> SetPageUptodate(page);
> }
> -
> io_end->num_io_pages = 0;
>
> + if ((inode = io_end->inode) == NULL)
> + goto no_work;
> +
> + if (!(inode->i_sb->s_flags & MS_ACTIVE)) {
> + pr_err("sb umounted, discard end_io request for inode %lu\n",
> + io_end->inode->i_ino);
> + no_work:
> + ext4_free_io_end(io_end);
> + return;
> + }
Doesn't this mean you are tossing away unwritten extent conversion
processing when IO is issued on any inode in the I_WILL_FREE/I_FREEING
state, or completing IO after the unmount is in progress?
Cheers,
Dave.
> +
> + if (error) {
> + io_end->flag |= EXT4_IO_END_ERROR;
> + ext4_warning(inode->i_sb, "I/O error writing to inode %lu "
> + "(offset %llu size %ld starting block %llu)",
> + inode->i_ino,
> + (unsigned long long) io_end->offset,
> + (long) io_end->size,
> + (unsigned long long)
> + bio->bi_sector >> (inode->i_blkbits - 9));
> + }
> +
> /* Add the io_end to per-inode completed io list*/
> spin_lock_irqsave(&EXT4_I(inode)->i_completed_io_lock, flags);
> list_add_tail(&io_end->list, &EXT4_I(inode)->i_completed_io_list);
> @@ -305,7 +305,6 @@ static int io_submit_init(struct ext4_io_submit *io,
> bio->bi_private = io->io_end = io_end;
> bio->bi_end_io = ext4_end_bio;
>
> - io_end->inode = inode;
> io_end->offset = (page->index << PAGE_CACHE_SHIFT) + bh_offset(bh);
>
> io->io_bio = bio;
> --
> 1.7.3.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
--
Dave Chinner
[email protected]
On Mon, Nov 08, 2010 at 12:19:19PM +1100, Dave Chinner wrote:
>
> Doesn't this mean you are tossing away unwritten extent conversion
> processing when IO is issued on any inode in the I_WILL_FREE/I_FREEING
> state, or completing IO after the unmount is in progress?
Yeah, thanks for pointing that out. Clearly I must have been
sleep-deprived, since I had managed to convince myself we'd only end
up in this situation when the inode had been deleted (as opposed to
the situation where there's so much memory pressure that we're pushing
out an inode that still has writebacks pending).
Let's try for patch #3:
>From 862ded6f705884fe08ebad63580aec240635d91d Mon Sep 17 00:00:00 2001
From: Theodore Ts'o <[email protected]>
Date: Sun, 7 Nov 2010 23:44:52 -0500
Subject: [PATCH -v3] ext4: handle writeback of inodes which are being freed
The following BUG can occur when an inode which is getting freed when
it still has dirty pages outstanding, and it gets deleted (in this
because it was the target of a rename). In ordered mode, we need to
make sure the data pages are written just in case we crash before the
rename (or unlink) is committed. If the inode is being freed then
when we try to igrab the inode, we end up tripping the BUG_ON at
fs/ext4/page-io.c:146.
To solve this problem, we need to keep track of the number of io
callbacks which are pending, and avoid destroying the inode until they
have all been completed. That way we don't have to bump the inode
count to keep the inode from being destroyed; an approach which
doesn't work because the count could have already been dropped down to
zero before the inode writeback has started (at which point we're not
allowed to bump the count back up to 1, since it's already started
getting freed).
Thanks to Dave Chinner for suggesting this approach, which is also
used by XFS.
kernel BUG at /scratch_space/linux-2.6/fs/ext4/page-io.c:146!
Call Trace:
[<ffffffff811075b1>] ext4_bio_write_page+0x172/0x307
[<ffffffff811033a7>] mpage_da_submit_io+0x2f9/0x37b
[<ffffffff811068d7>] mpage_da_map_and_submit+0x2cc/0x2e2
[<ffffffff811069b3>] mpage_add_bh_to_extent+0xc6/0xd5
[<ffffffff81106c66>] write_cache_pages_da+0x2a4/0x3ac
[<ffffffff81107044>] ext4_da_writepages+0x2d6/0x44d
[<ffffffff81087910>] do_writepages+0x1c/0x25
[<ffffffff810810a4>] __filemap_fdatawrite_range+0x4b/0x4d
[<ffffffff810815f5>] filemap_fdatawrite_range+0xe/0x10
[<ffffffff81122a2e>] jbd2_journal_begin_ordered_truncate+0x7b/0xa2
[<ffffffff8110615d>] ext4_evict_inode+0x57/0x24c
[<ffffffff810c14a3>] evict+0x22/0x92
[<ffffffff810c1a3d>] iput+0x212/0x249
[<ffffffff810bdf16>] dentry_iput+0xa1/0xb9
[<ffffffff810bdf6b>] d_kill+0x3d/0x5d
[<ffffffff810be613>] dput+0x13a/0x147
[<ffffffff810b990d>] sys_renameat+0x1b5/0x258
[<ffffffff81145f71>] ? _atomic_dec_and_lock+0x2d/0x4c
[<ffffffff810b2950>] ? cp_new_stat+0xde/0xea
[<ffffffff810b29c1>] ? sys_newlstat+0x2d/0x38
[<ffffffff810b99c6>] sys_rename+0x16/0x18
[<ffffffff81002a2b>] system_call_fastpath+0x16/0x1b
Reported-by: Nick Bowler <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/ext4.h | 2 +
fs/ext4/page-io.c | 70 ++++++++++++++++++++++++++++++++++-------------------
fs/ext4/super.c | 2 +
3 files changed, 49 insertions(+), 25 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 8b5dd63..670d134 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -858,6 +858,7 @@ struct ext4_inode_info {
spinlock_t i_completed_io_lock;
/* current io_end structure for async DIO write*/
ext4_io_end_t *cur_aio_dio;
+ atomic_t i_ioend_count; /* Number of outstanding io_end structs */
/*
* Transactions that contain inode's metadata needed to complete
@@ -2060,6 +2061,7 @@ extern int ext4_move_extents(struct file *o_filp, struct file *d_filp,
/* page-io.c */
extern int __init ext4_init_pageio(void);
extern void ext4_exit_pageio(void);
+extern void ext4_ioend_wait(struct inode *);
extern void ext4_free_io_end(ext4_io_end_t *io);
extern ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags);
extern int ext4_end_io_nolock(ext4_io_end_t *io);
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 46a7d6a..432a233 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -32,8 +32,14 @@
static struct kmem_cache *io_page_cachep, *io_end_cachep;
+#define WQ_HASH_SZ 37
+#define to_ioend_wq(v) (&ioend_wq[((unsigned long)v) % WQ_HASH_SZ])
+static wait_queue_head_t ioend_wq[WQ_HASH_SZ];
+
int __init ext4_init_pageio(void)
{
+ int i;
+
io_page_cachep = KMEM_CACHE(ext4_io_page, SLAB_RECLAIM_ACCOUNT);
if (io_page_cachep == NULL)
return -ENOMEM;
@@ -42,6 +48,8 @@ int __init ext4_init_pageio(void)
kmem_cache_destroy(io_page_cachep);
return -ENOMEM;
}
+ for (i = 0; i < WQ_HASH_SZ; i++)
+ init_waitqueue_head(&ioend_wq[i]);
return 0;
}
@@ -52,9 +60,17 @@ void ext4_exit_pageio(void)
kmem_cache_destroy(io_page_cachep);
}
+void ext4_ioend_wait(struct inode *inode)
+{
+ wait_queue_head_t *wq = to_ioend_wq(inode);
+
+ wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_ioend_count) == 0));
+}
+
void ext4_free_io_end(ext4_io_end_t *io)
{
int i;
+ wait_queue_head_t *wq;
BUG_ON(!io);
if (io->page)
@@ -69,7 +85,10 @@ void ext4_free_io_end(ext4_io_end_t *io)
}
}
io->num_io_pages = 0;
- iput(io->inode);
+ wq = to_ioend_wq(io->inode);
+ if (atomic_dec_and_test(&EXT4_I(io->inode)->i_ioend_count) &&
+ waitqueue_active(wq))
+ wake_up_all(wq);
kmem_cache_free(io_end_cachep, io);
}
@@ -142,8 +161,8 @@ ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags)
io = kmem_cache_alloc(io_end_cachep, flags);
if (io) {
memset(io, 0, sizeof(*io));
- io->inode = igrab(inode);
- BUG_ON(!io->inode);
+ atomic_inc(&EXT4_I(inode)->i_ioend_count);
+ io->inode = inode;
INIT_WORK(&io->work, ext4_end_io_work);
INIT_LIST_HEAD(&io->list);
}
@@ -171,35 +190,15 @@ static void ext4_end_bio(struct bio *bio, int error)
struct workqueue_struct *wq;
struct inode *inode;
unsigned long flags;
- ext4_fsblk_t err_block;
int i;
BUG_ON(!io_end);
- inode = io_end->inode;
bio->bi_private = NULL;
bio->bi_end_io = NULL;
if (test_bit(BIO_UPTODATE, &bio->bi_flags))
error = 0;
- err_block = bio->bi_sector >> (inode->i_blkbits - 9);
bio_put(bio);
- if (!(inode->i_sb->s_flags & MS_ACTIVE)) {
- pr_err("sb umounted, discard end_io request for inode %lu\n",
- io_end->inode->i_ino);
- ext4_free_io_end(io_end);
- return;
- }
-
- if (error) {
- io_end->flag |= EXT4_IO_END_ERROR;
- ext4_warning(inode->i_sb, "I/O error writing to inode %lu "
- "(offset %llu size %ld starting block %llu)",
- inode->i_ino,
- (unsigned long long) io_end->offset,
- (long) io_end->size,
- (unsigned long long) err_block);
- }
-
for (i = 0; i < io_end->num_io_pages; i++) {
struct page *page = io_end->pages[i]->p_page;
struct buffer_head *bh, *head;
@@ -254,9 +253,30 @@ static void ext4_end_bio(struct bio *bio, int error)
if (!partial_write)
SetPageUptodate(page);
}
-
io_end->num_io_pages = 0;
+ if ((inode = io_end->inode) == NULL)
+ goto no_work;
+
+ if (!(inode->i_sb->s_flags & MS_ACTIVE)) {
+ pr_err("sb umounted, discard end_io request for inode %lu\n",
+ io_end->inode->i_ino);
+ no_work:
+ ext4_free_io_end(io_end);
+ return;
+ }
+
+ if (error) {
+ io_end->flag |= EXT4_IO_END_ERROR;
+ ext4_warning(inode->i_sb, "I/O error writing to inode %lu "
+ "(offset %llu size %ld starting block %llu)",
+ inode->i_ino,
+ (unsigned long long) io_end->offset,
+ (long) io_end->size,
+ (unsigned long long)
+ bio->bi_sector >> (inode->i_blkbits - 9));
+ }
+
/* Add the io_end to per-inode completed io list*/
spin_lock_irqsave(&EXT4_I(inode)->i_completed_io_lock, flags);
list_add_tail(&io_end->list, &EXT4_I(inode)->i_completed_io_list);
@@ -305,8 +325,8 @@ static int io_submit_init(struct ext4_io_submit *io,
bio->bi_private = io->io_end = io_end;
bio->bi_end_io = ext4_end_bio;
- io_end->inode = inode;
io_end->offset = (page->index << PAGE_CACHE_SHIFT) + bh_offset(bh);
+ io_end->flag = EXT4_IO_END_UNWRITTEN;
io->io_bio = bio;
io->io_op = (wbc->sync_mode == WB_SYNC_ALL ?
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 04352e9..45653af 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -828,12 +828,14 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
ei->cur_aio_dio = NULL;
ei->i_sync_tid = 0;
ei->i_datasync_tid = 0;
+ atomic_set(&ei->i_ioend_count, 0);
return &ei->vfs_inode;
}
static void ext4_destroy_inode(struct inode *inode)
{
+ ext4_ioend_wait(inode);
if (!list_empty(&(EXT4_I(inode)->i_orphan))) {
ext4_msg(inode->i_sb, KERN_ERR,
"Inode %lu (%p): orphan list check failed!",
--
1.7.3.1
Sigh... and an extra bogus line slipped through. This one passes
fsstress on both 1k and 4k block sizes.
- Ted
>From 7e619bed3311b843238c03aa5090a576e02f097b Mon Sep 17 00:00:00 2001
From: Theodore Ts'o <[email protected]>
Date: Mon, 8 Nov 2010 00:36:53 -0500
Subject: [PATCH -v4] ext4: handle writeback of inodes which are being freed
The following BUG can occur when an inode which is getting freed when
it still has dirty pages outstanding, and it gets deleted (in this
because it was the target of a rename). In ordered mode, we need to
make sure the data pages are written just in case we crash before the
rename (or unlink) is committed. If the inode is being freed then
when we try to igrab the inode, we end up tripping the BUG_ON at
fs/ext4/page-io.c:146.
To solve this problem, we need to keep track of the number of io
callbacks which are pending, and avoid destroying the inode until they
have all been completed. That way we don't have to bump the inode
count to keep the inode from being destroyed; an approach which
doesn't work because the count could have already been dropped down to
zero before the inode writeback has started (at which point we're not
allowed to bump the count back up to 1, since it's already started
getting freed).
Thanks to Dave Chinner for suggesting this approach, which is also
used by XFS.
kernel BUG at /scratch_space/linux-2.6/fs/ext4/page-io.c:146!
Call Trace:
[<ffffffff811075b1>] ext4_bio_write_page+0x172/0x307
[<ffffffff811033a7>] mpage_da_submit_io+0x2f9/0x37b
[<ffffffff811068d7>] mpage_da_map_and_submit+0x2cc/0x2e2
[<ffffffff811069b3>] mpage_add_bh_to_extent+0xc6/0xd5
[<ffffffff81106c66>] write_cache_pages_da+0x2a4/0x3ac
[<ffffffff81107044>] ext4_da_writepages+0x2d6/0x44d
[<ffffffff81087910>] do_writepages+0x1c/0x25
[<ffffffff810810a4>] __filemap_fdatawrite_range+0x4b/0x4d
[<ffffffff810815f5>] filemap_fdatawrite_range+0xe/0x10
[<ffffffff81122a2e>] jbd2_journal_begin_ordered_truncate+0x7b/0xa2
[<ffffffff8110615d>] ext4_evict_inode+0x57/0x24c
[<ffffffff810c14a3>] evict+0x22/0x92
[<ffffffff810c1a3d>] iput+0x212/0x249
[<ffffffff810bdf16>] dentry_iput+0xa1/0xb9
[<ffffffff810bdf6b>] d_kill+0x3d/0x5d
[<ffffffff810be613>] dput+0x13a/0x147
[<ffffffff810b990d>] sys_renameat+0x1b5/0x258
[<ffffffff81145f71>] ? _atomic_dec_and_lock+0x2d/0x4c
[<ffffffff810b2950>] ? cp_new_stat+0xde/0xea
[<ffffffff810b29c1>] ? sys_newlstat+0x2d/0x38
[<ffffffff810b99c6>] sys_rename+0x16/0x18
[<ffffffff81002a2b>] system_call_fastpath+0x16/0x1b
Reported-by: Nick Bowler <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/ext4.h | 2 +
fs/ext4/page-io.c | 69 +++++++++++++++++++++++++++++++++-------------------
fs/ext4/super.c | 2 +
3 files changed, 48 insertions(+), 25 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 8b5dd63..670d134 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -858,6 +858,7 @@ struct ext4_inode_info {
spinlock_t i_completed_io_lock;
/* current io_end structure for async DIO write*/
ext4_io_end_t *cur_aio_dio;
+ atomic_t i_ioend_count; /* Number of outstanding io_end structs */
/*
* Transactions that contain inode's metadata needed to complete
@@ -2060,6 +2061,7 @@ extern int ext4_move_extents(struct file *o_filp, struct file *d_filp,
/* page-io.c */
extern int __init ext4_init_pageio(void);
extern void ext4_exit_pageio(void);
+extern void ext4_ioend_wait(struct inode *);
extern void ext4_free_io_end(ext4_io_end_t *io);
extern ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags);
extern int ext4_end_io_nolock(ext4_io_end_t *io);
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 46a7d6a..3e25a88 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -32,8 +32,14 @@
static struct kmem_cache *io_page_cachep, *io_end_cachep;
+#define WQ_HASH_SZ 37
+#define to_ioend_wq(v) (&ioend_wq[((unsigned long)v) % WQ_HASH_SZ])
+static wait_queue_head_t ioend_wq[WQ_HASH_SZ];
+
int __init ext4_init_pageio(void)
{
+ int i;
+
io_page_cachep = KMEM_CACHE(ext4_io_page, SLAB_RECLAIM_ACCOUNT);
if (io_page_cachep == NULL)
return -ENOMEM;
@@ -42,6 +48,8 @@ int __init ext4_init_pageio(void)
kmem_cache_destroy(io_page_cachep);
return -ENOMEM;
}
+ for (i = 0; i < WQ_HASH_SZ; i++)
+ init_waitqueue_head(&ioend_wq[i]);
return 0;
}
@@ -52,9 +60,17 @@ void ext4_exit_pageio(void)
kmem_cache_destroy(io_page_cachep);
}
+void ext4_ioend_wait(struct inode *inode)
+{
+ wait_queue_head_t *wq = to_ioend_wq(inode);
+
+ wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_ioend_count) == 0));
+}
+
void ext4_free_io_end(ext4_io_end_t *io)
{
int i;
+ wait_queue_head_t *wq;
BUG_ON(!io);
if (io->page)
@@ -69,7 +85,10 @@ void ext4_free_io_end(ext4_io_end_t *io)
}
}
io->num_io_pages = 0;
- iput(io->inode);
+ wq = to_ioend_wq(io->inode);
+ if (atomic_dec_and_test(&EXT4_I(io->inode)->i_ioend_count) &&
+ waitqueue_active(wq))
+ wake_up_all(wq);
kmem_cache_free(io_end_cachep, io);
}
@@ -142,8 +161,8 @@ ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags)
io = kmem_cache_alloc(io_end_cachep, flags);
if (io) {
memset(io, 0, sizeof(*io));
- io->inode = igrab(inode);
- BUG_ON(!io->inode);
+ atomic_inc(&EXT4_I(inode)->i_ioend_count);
+ io->inode = inode;
INIT_WORK(&io->work, ext4_end_io_work);
INIT_LIST_HEAD(&io->list);
}
@@ -171,35 +190,15 @@ static void ext4_end_bio(struct bio *bio, int error)
struct workqueue_struct *wq;
struct inode *inode;
unsigned long flags;
- ext4_fsblk_t err_block;
int i;
BUG_ON(!io_end);
- inode = io_end->inode;
bio->bi_private = NULL;
bio->bi_end_io = NULL;
if (test_bit(BIO_UPTODATE, &bio->bi_flags))
error = 0;
- err_block = bio->bi_sector >> (inode->i_blkbits - 9);
bio_put(bio);
- if (!(inode->i_sb->s_flags & MS_ACTIVE)) {
- pr_err("sb umounted, discard end_io request for inode %lu\n",
- io_end->inode->i_ino);
- ext4_free_io_end(io_end);
- return;
- }
-
- if (error) {
- io_end->flag |= EXT4_IO_END_ERROR;
- ext4_warning(inode->i_sb, "I/O error writing to inode %lu "
- "(offset %llu size %ld starting block %llu)",
- inode->i_ino,
- (unsigned long long) io_end->offset,
- (long) io_end->size,
- (unsigned long long) err_block);
- }
-
for (i = 0; i < io_end->num_io_pages; i++) {
struct page *page = io_end->pages[i]->p_page;
struct buffer_head *bh, *head;
@@ -254,9 +253,30 @@ static void ext4_end_bio(struct bio *bio, int error)
if (!partial_write)
SetPageUptodate(page);
}
-
io_end->num_io_pages = 0;
+ if ((inode = io_end->inode) == NULL)
+ goto no_work;
+
+ if (!(inode->i_sb->s_flags & MS_ACTIVE)) {
+ pr_err("sb umounted, discard end_io request for inode %lu\n",
+ io_end->inode->i_ino);
+ no_work:
+ ext4_free_io_end(io_end);
+ return;
+ }
+
+ if (error) {
+ io_end->flag |= EXT4_IO_END_ERROR;
+ ext4_warning(inode->i_sb, "I/O error writing to inode %lu "
+ "(offset %llu size %ld starting block %llu)",
+ inode->i_ino,
+ (unsigned long long) io_end->offset,
+ (long) io_end->size,
+ (unsigned long long)
+ bio->bi_sector >> (inode->i_blkbits - 9));
+ }
+
/* Add the io_end to per-inode completed io list*/
spin_lock_irqsave(&EXT4_I(inode)->i_completed_io_lock, flags);
list_add_tail(&io_end->list, &EXT4_I(inode)->i_completed_io_list);
@@ -305,7 +325,6 @@ static int io_submit_init(struct ext4_io_submit *io,
bio->bi_private = io->io_end = io_end;
bio->bi_end_io = ext4_end_bio;
- io_end->inode = inode;
io_end->offset = (page->index << PAGE_CACHE_SHIFT) + bh_offset(bh);
io->io_bio = bio;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 04352e9..45653af 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -828,12 +828,14 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
ei->cur_aio_dio = NULL;
ei->i_sync_tid = 0;
ei->i_datasync_tid = 0;
+ atomic_set(&ei->i_ioend_count, 0);
return &ei->vfs_inode;
}
static void ext4_destroy_inode(struct inode *inode)
{
+ ext4_ioend_wait(inode);
if (!list_empty(&(EXT4_I(inode)->i_orphan))) {
ext4_msg(inode->i_sb, KERN_ERR,
"Inode %lu (%p): orphan list check failed!",
--
1.7.3.1
On Mon, Nov 08, 2010 at 12:00:52AM -0500, Ted Ts'o wrote:
> On Mon, Nov 08, 2010 at 12:19:19PM +1100, Dave Chinner wrote:
> >
> > Doesn't this mean you are tossing away unwritten extent conversion
> > processing when IO is issued on any inode in the I_WILL_FREE/I_FREEING
> > state, or completing IO after the unmount is in progress?
>
> Yeah, thanks for pointing that out. Clearly I must have been
> sleep-deprived, since I had managed to convince myself we'd only end
> up in this situation when the inode had been deleted (as opposed to
> the situation where there's so much memory pressure that we're pushing
> out an inode that still has writebacks pending).
....
> @@ -254,9 +253,30 @@ static void ext4_end_bio(struct bio *bio, int error)
> if (!partial_write)
> SetPageUptodate(page);
> }
> -
> io_end->num_io_pages = 0;
>
> + if ((inode = io_end->inode) == NULL)
> + goto no_work;
> +
> + if (!(inode->i_sb->s_flags & MS_ACTIVE)) {
> + pr_err("sb umounted, discard end_io request for inode %lu\n",
> + io_end->inode->i_ino);
> + no_work:
> + ext4_free_io_end(io_end);
> + return;
> + }
These checks are no longer necessary.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On 2010-11-08 01:05 -0500, Ted Ts'o wrote:
> Subject: [PATCH -v4] ext4: handle writeback of inodes which are being freed
After several successful 'make install' tests with this patch:
Tested-by: Nick Bowler <[email protected]>
Thanks.
--
Nick Bowler, Elliptic Technologies (http://www.elliptictech.com/)
On Mon, Nov 08, 2010 at 11:28:32AM -0500, Nick Bowler wrote:
>
> After several successful 'make install' tests with this patch:
>
> Tested-by: Nick Bowler <[email protected]>
>
> Thanks.
Thank you for reporting the bug and testing the patches!!
- Ted