LinuxLists.cc - 2.6.39 Block layer regression was [Bug] Boot hangs with 2.6.39-rc[123]]

2011-04-15 02:06:36

Subject: 2.6.39 Block layer regression was [Bug] Boot hangs with 2.6.39-rc[123]]

Good morning again,

> Hi list,
>
> Apparently all the 2.6.39-rc kernels up to current git HEAD hang on my setup
> here during boot. They survive the initramfs step, but after or during the root
> file system mount it just hangs.
> This is a software RAID-5->LVM setup. The machine itself boots without
> any problems with a 2.6.38 kernel.
>
> The hardware:
> processor : 0
> vendor_id : GenuineIntel
> cpu family : 15
> model : 1
> model name : Intel(R) Pentium(R) 4 CPU 1.70GHz
> stepping : 2
> cpu MHz : 1699.523
> cache size : 256 KB
> fdiv_bug : no
> hlt_bug : no
> f00f_bug : no
> coma_bug : no
> fpu : yes
> fpu_exception : yes
> cpuid level : 2
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pebs bts
> bogomips : 3399.04
> clflush size : 64
> cache_alignment : 128
> address sizes : 36 bits physical, 32 bits virtual
> power management:
>
<boot messages removed>

> md0: detected capacity change from 0 to 57475072
> md0:
> md: md1 stopped.
> md: bind<sdd2>
> md: bind<sde2>
> md: bind<sdf2>
> md: bind<sda2>
> md: bind<sdc2>
> md/raid:md1: device sdc2 operational as raid disk 0
> md/raid:md1: device sda2 operational as raid disk 4
> md/raid:md1: device sdf2 operational as raid disk 3
> md/raid:md1: device sde2 operational as raid disk 2
> md/raid:md1: device sdd2 operational as raid disk 1
> md/raid:md1: allocated 5258kB
> md/raid:md1: raid level 5 active with 5 out of 5 devices, algorithm 2
> created bitmap (4 pages) for device md1
> md1: bitmap initialized from disk: read 1/1 pages, set 0 bits
> md1: detected capacity change from 0 to 2000189128704
> md1: unknown partition table
> device fsid d64bac2f4cf259ba-326da6041632dd9c devid 1 transid 557 /dev/sdb1
> end_request: I/O error, dev fd0, sector 0
> EXT3-fs: barriers not enabled
> kjournald starting. Commit interval 5 seconds
> EXT3-fs (dm-0): mounted filesystem with writeback data mode
> udev[1333]: starting version 167
> input: Power Button as /devices/LNXSYSTM:00/device:00/PNP0C0C:00/input/input2
> ACPI: Power Button [PWRB]
> input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input3
> ACPI: Power Button [PWRF]
> parport_pc 00:08: reported by Plug and Play ACPI
> parport0: PC-style at 0x378, irq 7 [PCSPP,TRISTATE,EPP]
> intel_rng: FWH not detected
> Linux agpgart interface v0.103
> i801_smbus 0000:00:1f.3: PCI INT B -> GSI 17 (level, low) -> IRQ 17
> agpgart-intel 0000:00:00.0: Intel 865 Chipset
> agpgart-intel 0000:00:00.0: AGP aperture is 64M @ 0xf8000000
> Intel ICH 0000:00:1f.5: PCI INT B -> GSI 17 (level, low) -> IRQ 17
> intel8x0: white list rate for 1043:80f3 is 48000
>
> Here the mount hangs. Apparently after root was mounted successfully.
>
> SysRq : Show Blocked State
> task PC stack pid father
> mount D f4cfde00 0 1765 1745 0x00000000
> f4c23dd0 00000086 f44aa800 f4cfde00 f44aa500 0000004d f4c23d74 c11a35f9
> f4c23d80 f4c23dd0 f4431760 8416416f 00000004 f4431760 00000041 ffffffff
> 00000001 00000008 00100800 f4431760 00000008 00000000 e8da47f8 00000000
> Call Trace:
> [<c11a35f9>] ? dm_request+0x8a/0xf6
> [<c105c63b>] ? mempool_alloc_slab+0xe/0x10
> [<c12197b9>] io_schedule+0x2b/0x42
> [<c109cbad>] sleep_on_buffer+0x8/0xc
> [<c1219ad3>] __wait_on_bit+0x43/0x60
> [<c109cba5>] ? __wait_on_buffer+0x27/0x27
> [<c1219b46>] out_of_line_wait_on_bit+0x56/0x5e
> [<c109cba5>] ? __wait_on_buffer+0x27/0x27
> [<c1038e76>] ? autoremove_wake_function+0x36/0x36
> [<c109cba3>] __wait_on_buffer+0x25/0x27
> [<c109d367>] __sync_dirty_buffer+0x6b/0x8b
> [<c109d394>] sync_dirty_buffer+0xd/0xf
> [<fa8d7d80>] ext3_commit_super.clone.24+0x4b/0xb0 [ext3]
> [<c10ed9c7>] ? match_token+0xfd/0x1a2
> [<fa8d8585>] ext3_setup_super+0xb4/0x20d [ext3]
> [<fa8d776d>] ? parse_options+0x32/0x4b0 [ext3]
> [<c121a2e0>] ? __mutex_lock_slowpath+0x124/0x1b2
> [<c121a379>] ? mutex_lock+0xb/0x22
> [<fa8da100>] ext3_remount+0x1df/0x27e [ext3]
> [<fa8d9f21>] ? ext3_fill_super+0x1843/0x1843 [ext3]
> [<c10811d9>] do_remount_sb+0x7d/0xf6
> [<c1094379>] do_mount+0x44f/0x633
> [<c1093e9e>] ? copy_mount_options+0x8d/0xee
> [<c10947bf>] sys_mount+0x66/0x9b
> [<c121af1d>] syscall_call+0x7/0xb
> Sched Debug Version: v0.10, 2.6.39-rc3+ #2
> ktime : 37933.056186
> sched_clk : 38008.731563
> cpu_clk : 37806.036844
> jiffies : 4294901746
> sched_clock_stable : 0
>
> sysctl_sched
> .sysctl_sched_latency : 6.000000
> .sysctl_sched_min_granularity : 0.750000
> .sysctl_sched_wakeup_granularity : 1.000000
> .sysctl_sched_child_runs_first : 0
> .sysctl_sched_features : 7279
> .sysctl_sched_tunable_scaling : 1 (logaritmic)
>
> cpu#0, 1699.419 MHz
> .nr_running : 0
> .load : 0
> .nr_switches : 19596
> .nr_load_updates : 3167
> .nr_uninterruptible : 1
> .next_balance : 0.000000
> .curr->pid : 0
> .clock : 37680.038704
> .cpu_load[0] : 0
> .cpu_load[1] : 0
> .cpu_load[2] : 0
> .cpu_load[3] : 4
> .cpu_load[4] : 17
> .yld_count : 3
> .sched_switch : 0
> .sched_count : 19620
> .sched_goidle : 1555
> .ttwu_count : 1
> .ttwu_local : 1
> .bkl_count : 0
>
> cfs_rq[0]:
> .exec_clock : 6548.484099
> .MIN_vruntime : 0.000001
> .min_vruntime : 6269.914000
> .max_vruntime : 0.000001
> .spread : 0.000000
> .spread0 : 0.000000
> .nr_spread_over : 0
> .nr_running : 0
> .load : 0
>
> rt_rq[0]:
> .rt_nr_running : 0
> .rt_throttled : 0
> .rt_time : 200.624789
> .rt_runtime : 950.000000
>
> runnable tasks:
> task PID tree-key switches prio exec-runtime sum-exec sum-sleep
> ----------------------------------------------------------------------------------------------------------

After talking to Dave Chinner I looked at the block layer merges. I ended
up on

6c510389005 Merge branch 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block

Starting with this merge I see the problems. For testing I compiled the
previous merge

3dab04e6978 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-mn1030

And apart from a futex error that gets fixed later on I was able to
mount all partitions and could boot.
As I said initially this happens with current HEAD as well so has not
been fixed yet. Since not everyone is seeing this I wonder if it is
related to my setup here SATA->Raid5->LVM?

Kind regards,
Michael Guntsche

2011-04-15 03:26:27

by Linus Torvalds

[permalink] [raw]

Subject: Re: 2.6.39 Block layer regression was [Bug] Boot hangs with 2.6.39-rc[123]]

On Thu, Apr 14, 2011 at 7:06 PM, Michael Guntsche <[email protected]> wrote:
>
> After talking to Dave Chinner I looked at the block layer merges. I ended
> up on
>
> 6c510389005 Merge branch 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block
>
> Starting with this merge I see the problems.

Ok, so that's not very surprising. It's the new per-thread plugging,
and yes, there's clearly something broken with regards to MD/DM.

And I have a suspicion.

Jens - tell me if I'm wrong, but look at the crazy plug flushing code:

void __blk_flush_plug(struct task_struct *tsk, struct blk_plug *plug)
{
__blk_finish_plug(tsk, plug);
tsk->plug = plug;
}

and explain that idiotic __blk_finish_plug() logic to me:

static void __blk_finish_plug(struct task_struct *tsk, struct blk_plug *plug)
{
flush_plug_list(plug);

if (plug == tsk->plug)
tsk->plug = NULL;
}

and in particular the "set it to NULL, only to then set it back
again". That code makes no sense. __blk_finish_plug() is only ever
called with "plug" being "tsk->plug", and afaik nothing will ever
modify a non-NULL plug (if it is a nested plug, it would never be
added to the task) _except_ for that __blk_finish_plug(). No? So it
sets it to NULL, and then immediately the caller will set it back
again.

What's the thinking there? It looks very confused to me.

Now, clearly RAID seems to be involved in the problem? The main thing
with that would be that the execution of the requests would tend to
generate new requests, that go back on the plug queue. Yes? And the
loop in flush_plug_list() means that they all should get flushed out,
I assume. But something clearly isn't working, and it does seem to be
about the RAID kind of setup. So either they didn't get put on the
plug queue, or the task got a new plug (which _wasn't_ flushed).

Because we're clearly waiting for some request that hasn't completed.
Where in the plug queues would it be hiding?

The whole block layer plugging looks to be the major problem of the 39
cycle. Jens, pls explain.

Linus

2011-04-15 04:22:58

by Christoph Hellwig

[permalink] [raw]

Subject: Re: 2.6.39 Block layer regression was [Bug] Boot hangs with 2.6.39-rc[123]]

On Thu, Apr 14, 2011 at 08:25:33PM -0700, Linus Torvalds wrote:
> What's the thinking there? It looks very confused to me.

It is. I sent a patch a couple of days ago to fix it.

> Now, clearly RAID seems to be involved in the problem? The main thing
> with that would be that the execution of the requests would tend to
> generate new requests, that go back on the plug queue. Yes? And the
> loop in flush_plug_list() means that they all should get flushed out,
> I assume. But something clearly isn't working, and it does seem to be
> about the RAID kind of setup. So either they didn't get put on the
> plug queue, or the task got a new plug (which _wasn't_ flushed).
>
> Because we're clearly waiting for some request that hasn't completed.
> Where in the plug queues would it be hiding?

There's a thread where Neil explains what the problem with MD is - it
needs a callback on unplug time to generate e.g. the write intent bitmap
or as large as possible writes for RAID5. Jens and Neil have been
looking into it.

2011-04-15 06:13:24

by Jens Axboe

[permalink] [raw]

Subject: Re: 2.6.39 Block layer regression was [Bug] Boot hangs with 2.6.39-rc[123]]

On 2011-04-15 06:22, Christoph Hellwig wrote:
> On Thu, Apr 14, 2011 at 08:25:33PM -0700, Linus Torvalds wrote:
>> What's the thinking there? It looks very confused to me.
>
> It is. I sent a patch a couple of days ago to fix it.

Yeah thanks for that, I agree it looks a bit confusing as-is. I'll queue
it up.

>> Now, clearly RAID seems to be involved in the problem? The main thing
>> with that would be that the execution of the requests would tend to
>> generate new requests, that go back on the plug queue. Yes? And the
>> loop in flush_plug_list() means that they all should get flushed out,
>> I assume. But something clearly isn't working, and it does seem to be
>> about the RAID kind of setup. So either they didn't get put on the
>> plug queue, or the task got a new plug (which _wasn't_ flushed).
>>
>> Because we're clearly waiting for some request that hasn't completed.
>> Where in the plug queues would it be hiding?
>
> There's a thread where Neil explains what the problem with MD is - it
> needs a callback on unplug time to generate e.g. the write intent bitmap
> or as large as possible writes for RAID5. Jens and Neil have been
> looking into it.

I think we are done, Neil just needs to rebase around the current
for-linus and then we should expedite things in.

--
Jens Axboe

2011-04-21 17:53:53

by Christian Kujau

[permalink] [raw]

Subject: Re: 2.6.39 Block layer regression was [Bug] Boot hangs with 2.6.39-rc[123]]

On Fri, 15 Apr 2011 at 08:13, Jens Axboe wrote:
> I think we are done, Neil just needs to rebase around the current
> for-linus and then we should expedite things in.

Whatever this was: has it been pushed to Linus yet? I'm on 2.6.39-rc4+
(from yesterday) and I too experience a hang during boot (and other
strange things with udev[0])

Thanks,
Christian.

[0] http://nerdbynature.de/bits/2.6.39-rc4/
--
BOFH excuse #208:

Your mail is being routed through Germany ... and they're censoring us.

2011-04-21 18:00:35

by Linus Torvalds

[permalink] [raw]

Subject: Re: 2.6.39 Block layer regression was [Bug] Boot hangs with 2.6.39-rc[123]]

On Thu, Apr 21, 2011 at 10:53 AM, Christian Kujau <[email protected]> wrote:
> On Fri, 15 Apr 2011 at 08:13, Jens Axboe wrote:
>> I think we are done, Neil just needs to rebase around the current
>> for-linus and then we should expedite things in.
>
> Whatever this was: has it been pushed to Linus yet? I'm on 2.6.39-rc4+
> (from yesterday) and I too experience a hang during boot (and other
> strange things with udev[0])

As of right now, all _known_ block bugs should be fixed in the stuff I
just pushed out, with the current top-of-tree being the merge of the
IDE CD endless loop issue and the NULL ptr oops at elevator change
time:

Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block

* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
ide: unexport DISK_EVENT_MEDIA_CHANGE for ide-gd and ide-cd
block: don't propagate unlisted DISK_EVENTs to userland
elevator: check for ELEVATOR_INSERT_SORT_MERGE in !elvpriv case too

but maybe yours is something else.

That said, that udevd message of yours would imply a full root
filesystem, which can cause any amount of issues at boot time..

Linus

2011-04-21 18:30:19

by Christian Kujau

[permalink] [raw]

Subject: Re: 2.6.39 Block layer regression was [Bug] Boot hangs with 2.6.39-rc[123]]

Linus Torvalds wrote on 2011-04-21 10:59 :
> As of right now, all _known_ block bugs should be fixed in the stuff I
> just pushed out, with the current top-of-tree being the merge of the
> IDE CD endless loop issue and the NULL ptr oops at elevator change
> time:

OK, thanks.

> That said, that udevd message of yours would imply a full root
> filesystem, which can cause any amount of issues at boot time..

My /dev is mounted as udev and is 10MB in size, which is more than
enough for kernels < 2.6.39. But with 2.6.39 something changed and
/dev/.udev/queue.bin is growing to 7 MB and more, udevd is spinning like
crazy and the OOM is reaping processes.

I'm currently in the middle of a bisect, with 11 steps to go on this
PowerBook G4 :-\

Thanks,
Christian.
--
make bzImage, not war

2011-04-21 18:36:56

by Jens Axboe

[permalink] [raw]

Subject: Re: 2.6.39 Block layer regression was [Bug] Boot hangs with 2.6.39-rc[123]]

On 2011-04-21 20:30, Christian Kujau wrote:
> Linus Torvalds wrote on 2011-04-21 10:59 :
>> As of right now, all _known_ block bugs should be fixed in the stuff I
>> just pushed out, with the current top-of-tree being the merge of the
>> IDE CD endless loop issue and the NULL ptr oops at elevator change
>> time:
>
> OK, thanks.
>
>> That said, that udevd message of yours would imply a full root
>> filesystem, which can cause any amount of issues at boot time..
>
> My /dev is mounted as udev and is 10MB in size, which is more than
> enough for kernels < 2.6.39. But with 2.6.39 something changed and
> /dev/.udev/queue.bin is growing to 7 MB and more, udevd is spinning like
> crazy and the OOM is reaping processes.
>
> I'm currently in the middle of a bisect, with 11 steps to go on this
> PowerBook G4 :-\

Please try Linus -git as of now, it could be the excessive udev events
for media change that is causing your problem.

--
Jens Axboe

2011-04-21 20:53:37

by Christian Kujau

[permalink] [raw]

Subject: Re: 2.6.39 Block layer regression was [Bug] Boot hangs with 2.6.39-rc[123]]

On Thu, 21 Apr 2011 at 20:36, Jens Axboe wrote:
> Please try Linus -git as of now, it could be the excessive udev events
> for media change that is causing your problem.

I'm running latest -git (91e8549..), which includes your for-linus branch
and it's running fine so far (1h:30min uptime).

@Ralf: can you try this as well, maybe it'll fix #622352 for you too?

Thanks,
Christian.
--
BOFH excuse #382:

Someone was smoking in the computer room and set off the halon systems.