2008-05-12 17:21:21

by Nick Dokos

[permalink] [raw]
Subject: Soft lockups in mballoc code

I keep getting soft lockups in mballoc code, running fsstress.
I'm running a 2.6.25.2 kernel, with 1.40.8 e2fsprogs. Here
is sample console output on a 2-cpu Opteron box with a 2Tb filesystem -
the two stack traces keep repeating every minute or so and the only
way to get out is to reset the box:

---------------------------------------------------------------------------
[[email protected] fsstress]# /home/nick/src/fsstress/fsstress -d /mnt2/fsstress -l 10 -n 100 -p 20
seed = 1211090909
BUG: soft lockup - CPU#1 stuck for 61s! [fsstress:3974]
CPU 1:
Modules linked in: qla2xxx ext4dev jbd2 crc16 ipv6 autofs4 hidp rfcomm l2cap bluetooth sunrpc cpufreq_ondemand dm_mirror dm_multipath dm_mod sbs sbshc battery acpi_memhotplug ac parport_pc lp parport sg floppy ide_cd_mod serio_raw cdrom button tg3 amd_rng k8temp shpchp i2c_amd756 pcspkr hwmon i2c_core scsi_transport_fc mptspi mptscsih scsi_transport_spi mptbase cciss sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd [last unloaded: qla2xxx]
Pid: 3974, comm: fsstress Not tainted 2.6.25.2 #1
RIP: 0010:[<ffffffff88412790>] [<ffffffff88412790>] :ext4dev:ext4_mb_new_blocks+0x7bd/0x1618
RSP: 0018:ffff8101f304b898 EFLAGS: 00000202
RAX: 00000000ffffffff RBX: ffff8101f4c86918 RCX: 0000000000000007
RDX: ffff8100f31d35c0 RSI: 00000000000029fd RDI: ffffe200034f6548
RBP: 00000000000014fe R08: 0000000000000000 R09: ffff8101f4cbd280
R10: 0000000000000001 R11: ffffffff884174c0 R12: 0000000000000f00
R13: ffff8100f2de8b58 R14: ffffffff883feadd R15: ffff8101f4c622b8
FS: 00007f78917726f0(0000) GS:ffff8101ff6b3f40(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000003dca80b9a0 CR3: 00000001f3036000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

Call Trace:
[<ffffffff8841274d>] ? :ext4dev:ext4_mb_new_blocks+0x77a/0x1618
[<ffffffff8022779f>] ? set_next_entity+0x23/0x73
[<ffffffff8026000c>] ? compat_irq_chip_set_default_handler+0x5/0x11
[<ffffffff883fbd99>] ? :ext4dev:ext4_new_blocks+0x6a/0x88
[<ffffffff883fefb5>] ? :ext4dev:ext4_get_blocks_handle+0x355/0x84f
[<ffffffff883ff773>] ? :ext4dev:ext4_get_block+0x0/0xf9
[<ffffffff802aa74d>] ? __find_get_block+0x148/0x158
[<ffffffff883ff5c0>] ? :ext4dev:ext4_get_blocks_wrap+0x111/0x135
[<ffffffff883ff82d>] ? :ext4dev:ext4_get_block+0xba/0xf9
[<ffffffff802ab509>] ? __block_prepare_write+0x19d/0x36c
[<ffffffff883ff773>] ? :ext4dev:ext4_get_block+0x0/0xf9
[<ffffffff802ab775>] ? block_write_begin+0x78/0xc9
[<ffffffff88400d67>] ? :ext4dev:ext4_write_begin+0xeb/0x1b1
[<ffffffff883ff773>] ? :ext4dev:ext4_get_block+0x0/0xf9
[<ffffffff80292ab5>] ? __page_symlink+0x44/0xf7
[<ffffffff8840491d>] ? :ext4dev:ext4_symlink+0x136/0x1db
[<ffffffff80293915>] ? vfs_symlink+0xbf/0x12e
[<ffffffff80295d27>] ? sys_symlinkat+0x87/0xd4
[<ffffffff8025cbdf>] ? audit_syscall_exit+0x2e4/0x303
[<ffffffff8020bdae>] ? tracesys+0x71/0xe1
[<ffffffff8020be19>] ? tracesys+0xdc/0xe1

BUG: soft lockup - CPU#0 stuck for 61s! [fsstress:3984]
CPU 0:
Modules linked in: qla2xxx ext4dev jbd2 crc16 ipv6 autofs4 hidp rfcomm l2cap bluetooth sunrpc cpufreq_ondemand dm_mirror dm_multipath dm_mod sbs sbshc battery acpi_memhotplug ac parport_pc lp parport sg floppy ide_cd_mod serio_raw cdrom button tg3 amd_rng k8temp shpchp i2c_amd756 pcspkr hwmon i2c_core scsi_transport_fc mptspi mptscsih scsi_transport_spi mptbase cciss sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd [last unloaded: qla2xxx]
Pid: 3984, comm: fsstress Not tainted 2.6.25.2 #1
RIP: 0010:[<ffffffff884145e0>] [<ffffffff884145e0>] :ext4dev:ext4_mb_discard_inode_preallocations+0x1c6/0x27b
RSP: 0018:ffff8101f3521e18 EFLAGS: 00000202
RAX: 00000000ffffffff RBX: ffff8100f2def470 RCX: 0000000000000007
RDX: ffff8100f31d35c0 RSI: 00000000000014fe RDI: ffff8100f2dbd470
RBP: 0000000000000000 R08: ffff81000900ec80 R09: 0000000000001000
R10: ffff8100f2debae8 R11: ffff8101fc9d8000 R12: ffff8101f4c8bcc0
R13: 0000000000001000 R14: 0000000000000000 R15: 000000000a7f0000
FS: 00007f78917726f0(0000) GS:ffffffff805b9000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000003dca8c2200 CR3: 00000001f4023000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

Call Trace:
[<ffffffff88414591>] :ext4dev:ext4_mb_discard_inode_preallocations+0x177/0x27b
[<ffffffff883fbe50>] :ext4dev:ext4_discard_reservation+0x24/0x68
[<ffffffff883fc872>] :ext4dev:ext4_release_file+0x2d/0x6f
[<ffffffff8028d243>] __fput+0xb1/0x17e
[<ffffffff8028ab77>] filp_close+0x5d/0x65
[<ffffffff8028bcee>] sys_close+0x7e/0xb7
[<ffffffff8020be19>] tracesys+0xdc/0xe1



...

---------------------------------------------------------------------------

Known bug? Is there a fix somewhere?

Let me know if more information is needed.

Thanks very much,
Nick


2008-05-14 17:18:46

by Mingming Cao

[permalink] [raw]
Subject: Re: Soft lockups in mballoc code

On Mon, 2008-05-12 at 13:21 -0400, Nick Dokos wrote:
> I keep getting soft lockups in mballoc code, running fsstress.
> I'm running a 2.6.25.2 kernel, with 1.40.8 e2fsprogs. Here
> is sample console output on a 2-cpu Opteron box with a 2Tb filesystem -
> the two stack traces keep repeating every minute or so and the only
> way to get out is to reset the box:
>
Nick, thanks for reporting this. I remember saw soft lockup issue with
mballoc on 2.6.25-rc kernels. Could you run the latest 2.6.26-rc2
kernel? There are lots of bug fixes have been pushed to 2.6.26-rc1.
Aneesh, do you know if this is a new issue?

Mingming
> ---------------------------------------------------------------------------
> [[email protected] fsstress]# /home/nick/src/fsstress/fsstress -d /mnt2/fsstress -l 10 -n 100 -p 20
> seed = 1211090909
> BUG: soft lockup - CPU#1 stuck for 61s! [fsstress:3974]
> CPU 1:
> Modules linked in: qla2xxx ext4dev jbd2 crc16 ipv6 autofs4 hidp rfcomm l2cap bluetooth sunrpc cpufreq_ondemand dm_mirror dm_multipath dm_mod sbs sbshc battery acpi_memhotplug ac parport_pc lp parport sg floppy ide_cd_mod serio_raw cdrom button tg3 amd_rng k8temp shpchp i2c_amd756 pcspkr hwmon i2c_core scsi_transport_fc mptspi mptscsih scsi_transport_spi mptbase cciss sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd [last unloaded: qla2xxx]
> Pid: 3974, comm: fsstress Not tainted 2.6.25.2 #1
> RIP: 0010:[<ffffffff88412790>] [<ffffffff88412790>] :ext4dev:ext4_mb_new_blocks+0x7bd/0x1618
> RSP: 0018:ffff8101f304b898 EFLAGS: 00000202
> RAX: 00000000ffffffff RBX: ffff8101f4c86918 RCX: 0000000000000007
> RDX: ffff8100f31d35c0 RSI: 00000000000029fd RDI: ffffe200034f6548
> RBP: 00000000000014fe R08: 0000000000000000 R09: ffff8101f4cbd280
> R10: 0000000000000001 R11: ffffffff884174c0 R12: 0000000000000f00
> R13: ffff8100f2de8b58 R14: ffffffff883feadd R15: ffff8101f4c622b8
> FS: 00007f78917726f0(0000) GS:ffff8101ff6b3f40(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 0000003dca80b9a0 CR3: 00000001f3036000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>
> Call Trace:
> [<ffffffff8841274d>] ? :ext4dev:ext4_mb_new_blocks+0x77a/0x1618
> [<ffffffff8022779f>] ? set_next_entity+0x23/0x73
> [<ffffffff8026000c>] ? compat_irq_chip_set_default_handler+0x5/0x11
> [<ffffffff883fbd99>] ? :ext4dev:ext4_new_blocks+0x6a/0x88
> [<ffffffff883fefb5>] ? :ext4dev:ext4_get_blocks_handle+0x355/0x84f
> [<ffffffff883ff773>] ? :ext4dev:ext4_get_block+0x0/0xf9
> [<ffffffff802aa74d>] ? __find_get_block+0x148/0x158
> [<ffffffff883ff5c0>] ? :ext4dev:ext4_get_blocks_wrap+0x111/0x135
> [<ffffffff883ff82d>] ? :ext4dev:ext4_get_block+0xba/0xf9
> [<ffffffff802ab509>] ? __block_prepare_write+0x19d/0x36c
> [<ffffffff883ff773>] ? :ext4dev:ext4_get_block+0x0/0xf9
> [<ffffffff802ab775>] ? block_write_begin+0x78/0xc9
> [<ffffffff88400d67>] ? :ext4dev:ext4_write_begin+0xeb/0x1b1
> [<ffffffff883ff773>] ? :ext4dev:ext4_get_block+0x0/0xf9
> [<ffffffff80292ab5>] ? __page_symlink+0x44/0xf7
> [<ffffffff8840491d>] ? :ext4dev:ext4_symlink+0x136/0x1db
> [<ffffffff80293915>] ? vfs_symlink+0xbf/0x12e
> [<ffffffff80295d27>] ? sys_symlinkat+0x87/0xd4
> [<ffffffff8025cbdf>] ? audit_syscall_exit+0x2e4/0x303
> [<ffffffff8020bdae>] ? tracesys+0x71/0xe1
> [<ffffffff8020be19>] ? tracesys+0xdc/0xe1
>
> BUG: soft lockup - CPU#0 stuck for 61s! [fsstress:3984]
> CPU 0:
> Modules linked in: qla2xxx ext4dev jbd2 crc16 ipv6 autofs4 hidp rfcomm l2cap bluetooth sunrpc cpufreq_ondemand dm_mirror dm_multipath dm_mod sbs sbshc battery acpi_memhotplug ac parport_pc lp parport sg floppy ide_cd_mod serio_raw cdrom button tg3 amd_rng k8temp shpchp i2c_amd756 pcspkr hwmon i2c_core scsi_transport_fc mptspi mptscsih scsi_transport_spi mptbase cciss sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd [last unloaded: qla2xxx]
> Pid: 3984, comm: fsstress Not tainted 2.6.25.2 #1
> RIP: 0010:[<ffffffff884145e0>] [<ffffffff884145e0>] :ext4dev:ext4_mb_discard_inode_preallocations+0x1c6/0x27b
> RSP: 0018:ffff8101f3521e18 EFLAGS: 00000202
> RAX: 00000000ffffffff RBX: ffff8100f2def470 RCX: 0000000000000007
> RDX: ffff8100f31d35c0 RSI: 00000000000014fe RDI: ffff8100f2dbd470
> RBP: 0000000000000000 R08: ffff81000900ec80 R09: 0000000000001000
> R10: ffff8100f2debae8 R11: ffff8101fc9d8000 R12: ffff8101f4c8bcc0
> R13: 0000000000001000 R14: 0000000000000000 R15: 000000000a7f0000
> FS: 00007f78917726f0(0000) GS:ffffffff805b9000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 0000003dca8c2200 CR3: 00000001f4023000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>
> Call Trace:
> [<ffffffff88414591>] :ext4dev:ext4_mb_discard_inode_preallocations+0x177/0x27b
> [<ffffffff883fbe50>] :ext4dev:ext4_discard_reservation+0x24/0x68
> [<ffffffff883fc872>] :ext4dev:ext4_release_file+0x2d/0x6f
> [<ffffffff8028d243>] __fput+0xb1/0x17e
> [<ffffffff8028ab77>] filp_close+0x5d/0x65
> [<ffffffff8028bcee>] sys_close+0x7e/0xb7
> [<ffffffff8020be19>] tracesys+0xdc/0xe1
>
>
>
> ...
>
> ---------------------------------------------------------------------------
>
> Known bug? Is there a fix somewhere?
>
> Let me know if more information is needed.
>
> Thanks very much,
> Nick
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2008-05-14 17:45:39

by Nick Dokos

[permalink] [raw]
Subject: Re: Soft lockups in mballoc code

Mingming Cao <[email protected]> wrote:

> On Mon, 2008-05-12 at 13:21 -0400, Nick Dokos wrote:
> > I keep getting soft lockups in mballoc code, running fsstress.
> > I'm running a 2.6.25.2 kernel, with 1.40.8 e2fsprogs. Here
> > is sample console output on a 2-cpu Opteron box with a 2Tb filesystem -
> > the two stack traces keep repeating every minute or so and the only
> > way to get out is to reset the box:
> >
> Nick, thanks for reporting this. I remember saw soft lockup issue with
> mballoc on 2.6.25-rc kernels. Could you run the latest 2.6.26-rc2
> kernel? There are lots of bug fixes have been pushed to 2.6.26-rc1.
> Aneesh, do you know if this is a new issue?
>

Hi Mingming,

I'll do so at the earliest opportunity. In the meantime, I wanted to let
you know that I tried the (stable) 2.6.25.3 kernel on x86-64 and encountered the
same problem, but I also tried it on an IA64 system (again with the
2.6.25.3 kernel) and did *not* encounter the problem. Not sure what this
means yet, but I'll let you know as soon as I get some more information.

Thanks very much,
Nick


2008-05-15 09:38:27

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: Soft lockups in mballoc code

On Mon, May 12, 2008 at 01:21:35PM -0400, Nick Dokos wrote:
> I keep getting soft lockups in mballoc code, running fsstress.
> I'm running a 2.6.25.2 kernel, with 1.40.8 e2fsprogs. Here
> is sample console output on a 2-cpu Opteron box with a 2Tb filesystem -
> the two stack traces keep repeating every minute or so and the only
> way to get out is to reset the box:

What is the block size with which the file system is created. I would
also suggest testing with the latest kernel and see if you can
reproduce the same. While testing with the latest kernel please also
use the below patch.

http://repo.or.cz/w/ext4-patch-queue.git?a=blob;f=ext4-mballoc-fix-mb-_nornalize_request-for-1K-blksz.patch;h=f7b91a6186b7b469ad798ddc95e76ece69b47c24;hb=HEAD


-aneesh

2008-05-19 14:47:33

by Solofo.Ramangalahy

[permalink] [raw]
Subject: Re: Soft lockups in mballoc code

Nick Dokos writes:
> In the meantime, I wanted to let
> you know that I tried the (stable) 2.6.25.3 kernel on x86-64 and encountered the
> same problem, but I also tried it on an IA64 system (again with the
> 2.6.25.3 kernel) and did *not* encounter the problem.

Also reproduced (or similar) on 2.6.25.4.
I will try to get a more focused test case...

--
solofo

BUG: soft lockup - CPU#0 stuck for 61s! [fsstress:16194]
CPU 0:
Modules linked in: ext4dev jbd2 crc16
Pid: 16194, comm: fsstress Not tainted 2.6.25.4 #1
RIP: 0010:[<ffffffff8802d5b7>] [<ffffffff8802d5b7>] :ext4dev:ext4_mb_new_blocks+0x8e4/0x154d
RSP: 0018:ffff81011c1d97a8 EFLAGS: 00000202
RAX: 00000000ffffffff RBX: ffff810112de5c60 RCX: 0000000000000007
RDX: ffff81012d7789c0 RSI: 0000000000000008 RDI: ffffe200070fb180
RBP: ffff81011c1d9888 R08: 0000000000000003 R09: d37a6f4de9bd37a7
R10: 0000000000000173 R11: ffff810127fd8000 R12: ffff81012d7789c0
R13: ffff8100807c1000 R14: ffff81011c1d8000 R15: ffffffff808a6748
FS: 00007fcfcd3f46e0(0000) GS:ffffffff80782000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fffd5409fa8 CR3: 00000001241db000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

Call Trace:
[<ffffffff8802d571>] ? :ext4dev:ext4_mb_new_blocks+0x89e/0x154d
[<ffffffff802821cd>] ? __kmalloc+0x12e/0x159
[<ffffffff880267a1>] ? :ext4dev:ext4_ext_find_extent+0x71/0x255
[<ffffffff88028c9c>] ? :ext4dev:ext4_ext_get_blocks+0x77a/0x962
[<ffffffff8035a6bd>] ? __down_write_trylock+0x15/0x44
[<ffffffff80210ee3>] ? native_sched_clock+0x44/0x5d
[<ffffffff8024a891>] ? mark_held_locks+0x58/0x72
[<ffffffff8801a703>] ? :ext4dev:ext4_get_blocks_wrap+0xed/0x135
[<ffffffff8801a967>] ? :ext4dev:ext4_get_block+0xa3/0xe2
[<ffffffff802ab900>] ? __blockdev_direct_IO+0x544/0xa0b
[<ffffffff8801c450>] ? :ext4dev:ext4_direct_IO+0x11e/0x1b0
[<ffffffff8801a8c4>] ? :ext4dev:ext4_get_block+0x0/0xe2
[<ffffffff8025f516>] ? generic_file_direct_IO+0xcd/0x103
[<ffffffff802a2157>] ? __mark_inode_dirty+0xe2/0x18b
[<ffffffff8025f5ac>] ? generic_file_direct_write+0x60/0xfd
[<ffffffff8025ff14>] ? __generic_file_aio_write_nolock+0x28f/0x37a
[<ffffffff8024aa3c>] ? trace_hardirqs_on+0xfa/0x125
[<ffffffff80260065>] ? generic_file_aio_write+0x66/0xc6
[<ffffffff880178d8>] ? :ext4dev:ext4_file_write+0xa0/0x125
[<ffffffff80286da0>] ? do_sync_write+0xd1/0x118
[<ffffffff8024a891>] ? mark_held_locks+0x58/0x72
[<ffffffff80240b96>] ? autoremove_wake_function+0x0/0x2e
[<ffffffff80210ee3>] ? native_sched_clock+0x44/0x5d
[<ffffffff8057c4a1>] ? __mutex_unlock_slowpath+0xfe/0x109
[<ffffffff8024aa3c>] ? trace_hardirqs_on+0xfa/0x125
[<ffffffff802874f1>] ? vfs_write+0xad/0x136
[<ffffffff80287a28>] ? sys_write+0x45/0x6e
[<ffffffff8020af7b>] ? system_call_after_swapgs+0x7b/0x80

BUG: soft lockup - CPU#2 stuck for 61s! [fsstress:16438]
CPU 2:
Modules linked in: ext4dev jbd2 crc16
Pid: 16438, comm: fsstress Not tainted 2.6.25.4 #1
RIP: 0010:[<ffffffff8802d5b4>] [<ffffffff8802d5b4>] :ext4dev:ext4_mb_new_blocks+0x8e1/0x154d
RSP: 0018:ffff8101211917a8 EFLAGS: 00000202
RAX: 00000000ffffffff RBX: ffff81010246d7c8 RCX: 0000000000000007
RDX: ffff81012d7789c0 RSI: 0000000000000008 RDI: ffffe200070fb180
RBP: ffff810121191888 R08: 0000000000000003 R09: d37a6f4de9bd37a7
R10: 0000000000000173 R11: ffff810127fd8000 R12: ffff81012d7789c0
R13: ffff810080893000 R14: ffff810121190000 R15: ffffffff808a6748
FS: 00007fcfcd3f46e0(0000) GS:ffff81012faeee40(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fffd5409fa8 CR3: 0000000121194000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

Call Trace:
[<ffffffff8802d571>] ? :ext4dev:ext4_mb_new_blocks+0x89e/0x154d
[<ffffffff88028c9c>] ? :ext4dev:ext4_ext_get_blocks+0x77a/0x962
[<ffffffff8035a6bd>] ? __down_write_trylock+0x15/0x44
[<ffffffff80210ee3>] ? native_sched_clock+0x44/0x5d
[<ffffffff8024a891>] ? mark_held_locks+0x58/0x72
[<ffffffff8801a703>] ? :ext4dev:ext4_get_blocks_wrap+0xed/0x135
[<ffffffff8801a967>] ? :ext4dev:ext4_get_block+0xa3/0xe2
[<ffffffff802ab900>] ? __blockdev_direct_IO+0x544/0xa0b
[<ffffffff8801c450>] ? :ext4dev:ext4_direct_IO+0x11e/0x1b0
[<ffffffff8801a8c4>] ? :ext4dev:ext4_get_block+0x0/0xe2
[<ffffffff8025f516>] ? generic_file_direct_IO+0xcd/0x103
[<ffffffff802a2157>] ? __mark_inode_dirty+0xe2/0x18b
[<ffffffff8025f5ac>] ? generic_file_direct_write+0x60/0xfd
[<ffffffff8025ff14>] ? __generic_file_aio_write_nolock+0x28f/0x37a
[<ffffffff8024aa3c>] ? trace_hardirqs_on+0xfa/0x125
[<ffffffff80260065>] ? generic_file_aio_write+0x66/0xc6
[<ffffffff880178d8>] ? :ext4dev:ext4_file_write+0xa0/0x125
[<ffffffff80286da0>] ? do_sync_write+0xd1/0x118
[<ffffffff8024a891>] ? mark_held_locks+0x58/0x72
[<ffffffff80240b96>] ? autoremove_wake_function+0x0/0x2e
[<ffffffff80210ee3>] ? native_sched_clock+0x44/0x5d
[<ffffffff8057c4a1>] ? __mutex_unlock_slowpath+0xfe/0x109
[<ffffffff8024aa3c>] ? trace_hardirqs_on+0xfa/0x125
[<ffffffff802874f1>] ? vfs_write+0xad/0x136
[<ffffffff80287a28>] ? sys_write+0x45/0x6e
[<ffffffff8020af7b>] ? system_call_after_swapgs+0x7b/0x80

BUG: soft lockup - CPU#3 stuck for 61s! [fsstress:15776]
CPU 3:
Modules linked in: ext4dev jbd2 crc16
Pid: 15776, comm: fsstress Not tainted 2.6.25.4 #1
RIP: 0010:[<ffffffff8802d5b7>] [<ffffffff8802d5b7>] :ext4dev:ext4_mb_new_blocks+0x8e4/0x154d
RSP: 0018:ffff8101240457a8 EFLAGS: 00000202
RAX: 00000000ffffffff RBX: ffff810112de5bb8 RCX: 0000000000000007
RDX: ffff81012d7789c0 RSI: 0000000000000008 RDI: ffffe200070fb180
RBP: ffff810124045888 R08: 0000000000000003 R09: d37a6f4de9bd37a7
R10: 0000000000000173 R11: ffff810127fd8000 R12: ffff81012d7789c0
R13: ffff8100808fc000 R14: ffff810124044000 R15: ffffffff808a6748
FS: 00007fcfcd3f46e0(0000) GS:ffff81012faee5c0(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fcfcd3f2000 CR3: 00000001164f5000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

Call Trace:
[<ffffffff8802d571>] ? :ext4dev:ext4_mb_new_blocks+0x89e/0x154d
[<ffffffff880268b4>] ? :ext4dev:ext4_ext_find_extent+0x184/0x255
[<ffffffff88028c9c>] ? :ext4dev:ext4_ext_get_blocks+0x77a/0x962
[<ffffffff8035a6bd>] ? __down_write_trylock+0x15/0x44
[<ffffffff80210ee3>] ? native_sched_clock+0x44/0x5d
[<ffffffff8024a891>] ? mark_held_locks+0x58/0x72
[<ffffffff8801a703>] ? :ext4dev:ext4_get_blocks_wrap+0xed/0x135
[<ffffffff8801a967>] ? :ext4dev:ext4_get_block+0xa3/0xe2
[<ffffffff802ab900>] ? __blockdev_direct_IO+0x544/0xa0b
[<ffffffff8801c450>] ? :ext4dev:ext4_direct_IO+0x11e/0x1b0
[<ffffffff8801a8c4>] ? :ext4dev:ext4_get_block+0x0/0xe2
[<ffffffff8025f516>] ? generic_file_direct_IO+0xcd/0x103
[<ffffffff802a2157>] ? __mark_inode_dirty+0xe2/0x18b
[<ffffffff8025f5ac>] ? generic_file_direct_write+0x60/0xfd
[<ffffffff8025ff14>] ? __generic_file_aio_write_nolock+0x28f/0x37a
[<ffffffff8024aa3c>] ? trace_hardirqs_on+0xfa/0x125
[<ffffffff8026a2f6>] ? zone_statistics+0x41/0x95
[<ffffffff80260065>] ? generic_file_aio_write+0x66/0xc6
[<ffffffff880178d8>] ? :ext4dev:ext4_file_write+0xa0/0x125
[<ffffffff80286da0>] ? do_sync_write+0xd1/0x118
[<ffffffff8026e5aa>] ? handle_mm_fault+0x2a9/0x6dc
[<ffffffff80240b96>] ? autoremove_wake_function+0x0/0x2e
[<ffffffff8026e9c0>] ? handle_mm_fault+0x6bf/0x6dc
[<ffffffff8035a6ff>] ? __up_read+0x13/0x8a
[<ffffffff80210ee3>] ? native_sched_clock+0x44/0x5d
[<ffffffff802874f1>] ? vfs_write+0xad/0x136
[<ffffffff80287a28>] ? sys_write+0x45/0x6e
[<ffffffff8020af7b>] ? system_call_after_swapgs+0x7b/0x80

BUG: soft lockup - CPU#1 stuck for 61s! [fsstress:15786]
CPU 1:
Modules linked in: ext4dev jbd2 crc16
Pid: 15786, comm: fsstress Not tainted 2.6.25.4 #1
RIP: 0010:[<ffffffff8802d5b7>] [<ffffffff8802d5b7>] :ext4dev:ext4_mb_new_blocks+0x8e4/0x154d
RSP: 0018:ffff810122c0b6e8 EFLAGS: 00000202
RAX: 00000000ffffffff RBX: ffff810112de5d08 RCX: 0000000000000007
RDX: ffff81012d7789c0 RSI: 0000000000000008 RDI: ffffe200070fb180
RBP: ffff810122c0b7c8 R08: 0000000000000003 R09: d37a6f4de9bd37a7
R10: 0000000000000173 R11: ffff810127fd8000 R12: ffff81012d7789c0
R13: ffff81008082a000 R14: ffff810122c0a000 R15: ffffffff808a6748
FS: 00007fcfcd3f46e0(0000) GS:ffff81012fc08bc0(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fffd5409c88 CR3: 000000010d41e000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

Call Trace:
[<ffffffff8802d571>] ? :ext4dev:ext4_mb_new_blocks+0x89e/0x154d
[<ffffffff802821cd>] ? __kmalloc+0x12e/0x159
[<ffffffff880267a1>] ? :ext4dev:ext4_ext_find_extent+0x71/0x255
[<ffffffff88028c9c>] ? :ext4dev:ext4_ext_get_blocks+0x77a/0x962
[<ffffffff8035a6bd>] ? __down_write_trylock+0x15/0x44
[<ffffffff80210ee3>] ? native_sched_clock+0x44/0x5d
[<ffffffff8024a891>] ? mark_held_locks+0x58/0x72
[<ffffffff8801a703>] ? :ext4dev:ext4_get_blocks_wrap+0xed/0x135
[<ffffffff8801a967>] ? :ext4dev:ext4_get_block+0xa3/0xe2
[<ffffffff802a6a47>] ? __block_prepare_write+0x1bb/0x3ca
[<ffffffff8801a8c4>] ? :ext4dev:ext4_get_block+0x0/0xe2
[<ffffffff880041b3>] ? :jbd2:jbd2_journal_start+0xe5/0x108
[<ffffffff802a6cf3>] ? block_write_begin+0x78/0xc9
[<ffffffff8801bf04>] ? :ext4dev:ext4_write_begin+0xeb/0x1aa
[<ffffffff8801a8c4>] ? :ext4dev:ext4_get_block+0x0/0xe2
[<ffffffff8025f799>] ? generic_file_buffered_write+0x150/0x63c
[<ffffffff88003612>] ? :jbd2:jbd2_journal_stop+0x28f/0x29f
[<ffffffff80210ee3>] ? native_sched_clock+0x44/0x5d
[<ffffffff8025ffcb>] ? __generic_file_aio_write_nolock+0x346/0x37a
[<ffffffff8024aa3c>] ? trace_hardirqs_on+0xfa/0x125
[<ffffffff80260065>] ? generic_file_aio_write+0x66/0xc6
[<ffffffff880178d8>] ? :ext4dev:ext4_file_write+0xa0/0x125
[<ffffffff80286da0>] ? do_sync_write+0xd1/0x118
[<ffffffff8024a891>] ? mark_held_locks+0x58/0x72
[<ffffffff80240b96>] ? autoremove_wake_function+0x0/0x2e
[<ffffffff80210ee3>] ? native_sched_clock+0x44/0x5d
[<ffffffff8057c4a1>] ? __mutex_unlock_slowpath+0xfe/0x109
[<ffffffff8024aa3c>] ? trace_hardirqs_on+0xfa/0x125
[<ffffffff802874f1>] ? vfs_write+0xad/0x136
[<ffffffff80287a28>] ? sys_write+0x45/0x6e
[<ffffffff8020af7b>] ? system_call_after_swapgs+0x7b/0x80


2008-05-26 14:28:11

by Nick Dokos

[permalink] [raw]
Subject: Re: Soft lockups in mballoc code

[email protected] wrote:

> I have not been able to reproduce it on 2.6.26-rc3 and
> 2.6.25-14.fc9.x86_64 (fedora 9 kernel).
>
> besides, on 2.6.25.4, it was much more longer than
> > # /home/nick/src/fsstress/fsstress -d /mnt2/fsstress -l 10 -n 100 -p 20
> to reproduce (I also changed -n and -p).
>
> Could you confirm that you still easily see the soft lockups?
> If yes, could you attach your x86-64 .config?
> (On my side, I have been using defconfig + ext4dev:
> > CONFIG_EXT4DEV_FS=m
> > CONFIG_EXT4DEV_FS_XATTR=y
> > CONFIG_EXT4DEV_FS_POSIX_ACL=y
> > CONFIG_EXT4DEV_FS_SECURITY=y
> )

No, I am not seeing them any more: it does seem to be a config problem.

I got my original config from the distro that I was running on the box
at the time (RHEL5.1), doing "make oldconfig" and adding ext4dev. I was
trying to get 2.6.26-rc3 going with the same method, but I was running
into severe problems, even without ext4 in the picture (soft lockups
whenever any significant disk IO was done: since the configuration
included a readahead init script for optimization, the machine was
getting hung late in the boot.)

So I got a config file from a colleague's setup and that got me past the
2.6.26-rc3 problem. I also had no problem with ext4 on top of that. I
went back and tried a similar config file on 2.6.25.3 and that also did
not show the problem. So it seems to be entirely caused by my bad choice
of a config file.

As a matter of good practice, is defconfig+ext4dev the best way of
producing a kernel for ext4 testing purposes? Are there any settings
that should be added/delete/modified from the default?

Thanks,
Nick



2008-05-26 20:39:29

by Solofo.Ramangalahy

[permalink] [raw]
Subject: Re: Soft lockups in mballoc code

Nick Dokos writes:
> No, I am not seeing them any more

Good!

> So it seems to be entirely caused by my bad choice of a config file.

I don't see anything wrong with what you did.

> As a matter of good practice, is defconfig+ext4dev the best way of
> producing a kernel for ext4 testing purposes?

No, I don't think so. I have used it because this is reasonably fast
to compile.

I don't think there is a best single .config for testing ext4 (but
let's see what other think).

Using .config from distributions is fine since it is the way some
users get ext4 (that's why I tested Fedora 9 too).
Using other ways is also fine since it increases testing coverage.

> Are there any settings
> that should be added/delete/modified from the default?

Yes, for example, defconfig lacks options from the "Kernel Hacking"
section for testability which are mentionned e.g. in
Documentation/SubmitChecklist.

--
solofo