Hi,
After a git pull earlier today, I've had oopsen whilst booting the system twice today:
A photo of the oops message can be viewed at:
http://img231.imageshack.us/img231/8931/dscn0610.jpg
I haven't used the 2.6.30-rc kernels very much, so I don't have a clue when the problem was
introduced. I have no problems at all with 2.6.29.4.
Happy to help solve this - just let me know what else I can do.
Thanks
--
No, Sir; there is nothing which has yet been contrived by man, by which so much happiness is
produced as by a good tavern or inn - Doctor Samuel Johnson
Sorry, I've just realised what a poor report my earlier email was. I should also have said that the
kernel has booted succesfully. At the time of reporting I had had two panics. I've just done
another git pull, so the kernel is bang up to date with kernel.org, but after building and
installing, a reboot resulted in another panic. The system then booted OK after switch off and on
again.
Please cc me to any reply, I'm not subscribed.
On Saturday 06 June 2009, Chris Clayton wrote:
> Hi,
>
> After a git pull earlier today, I've had oopsen whilst booting the system
> twice today: A photo of the oops message can be viewed at:
>
> http://img231.imageshack.us/img231/8931/dscn0610.jpg
>
> I haven't used the 2.6.30-rc kernels very much, so I don't have a clue when
> the problem was introduced. I have no problems at all with 2.6.29.4.
>
> Happy to help solve this - just let me know what else I can do.
>
> Thanks
--
No, Sir; there is nothing which has yet been contrived by man, by which
so much happiness is produced as by a good tavern or inn - Doctor Samuel
Johnson
Hello Chris,
On Sat, 2009-06-06 at 22:15 +0100, Chris Clayton wrote:
> Sorry, I've just realised what a poor report my earlier email was. I should also have said that the
> kernel has booted succesfully. At the time of reporting I had had two panics. I've just done
> another git pull, so the kernel is bang up to date with kernel.org, but after building and
> installing, a reboot resulted in another panic. The system then booted OK after switch off and on
> again.
>
> Please cc me to any reply, I'm not subscribed.
>
>
> On Saturday 06 June 2009, Chris Clayton wrote:
> > Hi,
> >
> > After a git pull earlier today, I've had oopsen whilst booting the system
> > twice today: A photo of the oops message can be viewed at:
> >
> > http://img231.imageshack.us/img231/8931/dscn0610.jpg
> >
> > I haven't used the 2.6.30-rc kernels very much, so I don't have a clue when
> > the problem was introduced. I have no problems at all with 2.6.29.4.
> >
Can you provide your .config and dmesg of working and non-working
kernel.
Thanks,
--
JSR
2009/6/7 Jaswinder Singh Rajput <[email protected]>:
> Hello Chris,
>
Hello Jaswinder
and thanks for the reply.
> On Sat, 2009-06-06 at 22:15 +0100, Chris Clayton wrote:
>> Sorry, I've just realised what a poor report my earlier email was. I should also have said that the
>> kernel has booted succesfully. At the time of reporting I had had two panics. I've just done
>> another git pull, so the kernel is bang up to date with kernel.org, but after building and
>> installing, a reboot resulted in another panic. The system then booted OK after switch off and on
>> again.
>>
>> Please cc me to any reply, I'm not subscribed.
>>
>>
>> On Saturday 06 June 2009, Chris Clayton wrote:
>> > Hi,
>> >
>> > After a git pull earlier today, I've had oopsen whilst booting the system
>> > twice today: A photo of the oops message can be viewed at:
>> >
>> > http://img231.imageshack.us/img231/8931/dscn0610.jpg
>> >
>> > I haven't used the 2.6.30-rc kernels very much, so I don't have a clue when
>> > the problem was introduced. I have no problems at all with 2.6.29.4.
>> >
>
> Can you provide your .config and dmesg of working and non-working
> kernel.
>
> Thanks,
> --
> JSR
>
>
Sorry for delay in replying - I've been bisecting but ended up with:
commit 60a0cd528d761c50d3a0a49e8fbaf6a87e64254a
Merge: e25e092 8e35961
Author: Linus Torvalds <[email protected]>
Commit: Linus Torvalds <[email protected]>
Merge branch 'merge' of
git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
powerpc/mm: Fix broken MMU PID stealing on !SMP
which seems wrong given I have and x86 machine. I guess I didn't try
enough reboots before reporting a kernel as good.
However, I did get two bad boots. The dmesg for a good boot and config
are attached. As I don't get to user space on a bad boot, I can't
provide a dmesg, but a photograph of the panic can be viewed at
http://img99.imageshack.us/my.php?image=dscn0617b.jpg
I'll start bisecting again, but allow may more reboots before deciding
a kernel is good.
Let me know if I can provide any additional information.
Thanks
--
No, Sir; there is nothing which has yet been contrived by man, by
which so much happiness is produced as by a good tavern or inn -
Doctor Samuel Johnson
On Sun, 2009-06-07 at 19:38 +0100, Chris Clayton wrote:
> 2009/6/7 Jaswinder Singh Rajput <[email protected]>:
> > Hello Chris,
> >
>
> Hello Jaswinder
>
> and thanks for the reply.
>
> > On Sat, 2009-06-06 at 22:15 +0100, Chris Clayton wrote:
> >> Sorry, I've just realised what a poor report my earlier email was. I should also have said that the
> >> kernel has booted succesfully. At the time of reporting I had had two panics. I've just done
> >> another git pull, so the kernel is bang up to date with kernel.org, but after building and
> >> installing, a reboot resulted in another panic. The system then booted OK after switch off and on
> >> again.
> >>
> >> Please cc me to any reply, I'm not subscribed.
> >>
> >>
> >> On Saturday 06 June 2009, Chris Clayton wrote:
> >> > Hi,
> >> >
> >> > After a git pull earlier today, I've had oopsen whilst booting the system
> >> > twice today: A photo of the oops message can be viewed at:
> >> >
> >> > http://img231.imageshack.us/img231/8931/dscn0610.jpg
> >> >
> >> > I haven't used the 2.6.30-rc kernels very much, so I don't have a clue when
> >> > the problem was introduced. I have no problems at all with 2.6.29.4.
> >> >
> >
> > Can you provide your .config and dmesg of working and non-working
> > kernel.
> >
> > Thanks,
> > --
> > JSR
> >
> >
>
> Sorry for delay in replying - I've been bisecting but ended up with:
>
> commit 60a0cd528d761c50d3a0a49e8fbaf6a87e64254a
> Merge: e25e092 8e35961
> Author: Linus Torvalds <[email protected]>
> Commit: Linus Torvalds <[email protected]>
>
> Merge branch 'merge' of
> git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
>
> * 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
> powerpc/mm: Fix broken MMU PID stealing on !SMP
>
Based on this I added some CCs.
> which seems wrong given I have and x86 machine. I guess I didn't try
> enough reboots before reporting a kernel as good.
>
> However, I did get two bad boots. The dmesg for a good boot and config
> are attached. As I don't get to user space on a bad boot, I can't
> provide a dmesg, but a photograph of the panic can be viewed at
> http://img99.imageshack.us/my.php?image=dscn0617b.jpg
>
Error message will be in previous screen, can you do some page up and
capture the output.
And .config is same for Good and Bad.
--
JSR
On Mon, June 8, 2009 8:31 am, Jaswinder Singh Rajput wrote:
> On Sun, 2009-06-07 at 19:38 +0100, Chris Clayton wrote:
>> 2009/6/7 Jaswinder Singh Ra
>> >> > http://img231.imageshack.us/img231/8931/dscn0610.jpg
This message says that it found a vfat filesystem on 8:3x (I cannot see
what digit should be 'x'). That is probably sdc1 or sdc2. Maybe even
sdc6 or sdc7.
However the vfat filesystem didn't have /sbin/init.
>> http://img99.imageshack.us/my.php?image=dscn0617b.jpg
This one says it couldn't find anything at 8,22, which I think
should be sdb6.
It also shows that you have and sdc6, but sdb only goes up to sdb3.
So it seems that your disk drives have changed name - not a wholely
unexpected event these days.
We now need answers to questions like:
- what device do you expect the root filesystem to be on
- how is the kernel being told this? Maybe it is hard coded
into your initrd. Knowing which distro and what /etc/fstab
says might help (though it wouldn't help me, I'm just about out
of my depth at this point)
Maybe if you changed /etc/fstab to mount by uuid instead of hardcoding
e.g. /etc/sdb3, and then run "mkinitramfs" or whatever, it might work.
Good luck,
NeilBrown
Hi Neil,
Thanks for the reply.
2009/6/7 NeilBrown <[email protected]>:
> On Mon, June 8, 2009 8:31 am, Jaswinder Singh Rajput wrote:
>> On Sun, 2009-06-07 at 19:38 +0100, Chris Clayton wrote:
>>> 2009/6/7 Jaswinder Singh Ra
>>> >> > http://img231.imageshack.us/img231/8931/dscn0610.jpg
>
> This message says that it found a vfat filesystem on 8:3x (I cannot see
> what digit should be 'x'). ?That is probably sdc1 or sdc2. Maybe even
> sdc6 or sdc7.
> However the vfat filesystem didn't have /sbin/init.
>
>>> http://img99.imageshack.us/my.php?image=dscn0617b.jpg
>
> This one says it couldn't find anything at 8,22, which I think
> should be sdb6.
> It also shows that you have and sdc6, but sdb only goes up to sdb3.
>
> So it seems that your disk drives have changed name - not a wholely
> unexpected event these days.
>
> We now need answers to questions like:
> ?- what device do you expect the root filesystem to be on
> ?- how is the kernel being told this? ?Maybe it is hard coded
> ? ?into your initrd. ?Knowing which distro and what /etc/fstab
> ? ?says might help (though it wouldn't help me, I'm just about out
> ? ?of my depth at this point)
> Maybe if you changed /etc/fstab to mount by uuid instead of hardcoding
> e.g. /etc/sdb3, and then run "mkinitramfs" or whatever, it might work.
>
Yes, I've just been looking at the photographs of the panics again and
I've noticed that two of my discs are being detected in the "wrong
order". There are three HDDS. The first, /dev/sda, is the master on
the first IDE port and contains sda1..sda7. The second, normally
/dev/sdb, is the slave on that port and contains sdb1..sdb6. The
third, normally /dev/sdc, is attached to the first SATA port and
contains sdc1..sdc3. The second photograph I posted shows that sdb and
sdc have been reversed. The first partition on the disc that is
normally /dev/sdb does indeed have a FAT32 filesystem in the first
partition.
By the way, I should have said that in between the panics that the two
photographs show, I copied contents of /dev/sdc1, which I normally
boot from, to /dev/sdb6, so that I minimised the risk to sdc1 in the
reboot festival that bisecting would involve. I also, of course,
changed the name of the root partition that is passed to the kernel by
GRUB and amended /etc/fstab on /dev/sdb6. That's why the partitions
shown in the photographs seem inconsistent. Sorry I forgot to mention
that - I really shouldn't do these things late at night :-).
As I indicate above, when booting the partition I have set up to do
this bisecting, I expect the root filesystem to be on /dev/hdb6. As I
also indicate, this information is passed to the kernel through GRUB's
/boot/grub/menu.lst. The kernel is configured specifically for my
system and the drivers needed to boot the system are built in to the
kernel, so I don't use an initrd. IIRC, that's the way Slackware is
installed today, except, of course, it's a big fat kernel with all
drivers needed to boot any system built in. I could be wrong on that
though, it's a while since I installed
As to the distro, it used to be (the now defunct) Peanut Linux, which
was derived from Slackware. However, it's years since I installed it
and I have upgraded just about everything in user space and added many
other things (udev, dbus...). I don't think that makes any difference
here, though, because we don't get as far as user space. On a
successful boot, the system is stable and runs trouble-free for
several hours a day, every day.
Hope this helps.
I'm a good way through bisecting again and this time the system has to
boot without a panic 100 times before I mark a kernel as good. I'll
post the result later.
Thanks
> Good luck,
> NeilBrown
>
>
--
No, Sir; there is nothing which has yet been contrived by man, by which
so much happiness is produced as by a good tavern or inn - Doctor Samuel
Johnson
2009/6/8 Chris Clayton <[email protected]>:
> Hi Neil,
>
> Thanks for the reply.
>
> 2009/6/7 NeilBrown <[email protected]>:
>> On Mon, June 8, 2009 8:31 am, Jaswinder Singh Rajput wrote:
>>> On Sun, 2009-06-07 at 19:38 +0100, Chris Clayton wrote:
>>>> 2009/6/7 Jaswinder Singh Ra
>>>> >> > http://img231.imageshack.us/img231/8931/dscn0610.jpg
>>
>> This message says that it found a vfat filesystem on 8:3x (I cannot see
>> what digit should be 'x'). ?That is probably sdc1 or sdc2. Maybe even
>> sdc6 or sdc7.
>> However the vfat filesystem didn't have /sbin/init.
>>
>
>>>> http://img99.imageshack.us/my.php?image=dscn0617b.jpg
>>
>> This one says it couldn't find anything at 8,22, which I think
>> should be sdb6.
>> It also shows that you have and sdc6, but sdb only goes up to sdb3.
>>
>> So it seems that your disk drives have changed name - not a wholely
>> unexpected event these days.
>>
>> We now need answers to questions like:
>> ?- what device do you expect the root filesystem to be on
>> ?- how is the kernel being told this? ?Maybe it is hard coded
>> ? ?into your initrd. ?Knowing which distro and what /etc/fstab
>> ? ?says might help (though it wouldn't help me, I'm just about out
>> ? ?of my depth at this point)
>> Maybe if you changed /etc/fstab to mount by uuid instead of hardcoding
>> e.g. /etc/sdb3, and then run "mkinitramfs" or whatever, it might work.
>>
>
> Yes, I've just been looking at the photographs of the panics again and
> I've noticed that two of my discs are being detected in the "wrong
> order". There are three HDDS. The first, /dev/sda, is the master on
> the first IDE port and contains sda1..sda7. The second, normally
> /dev/sdb, is the slave on that port and contains sdb1..sdb6. The
> third, normally /dev/sdc, is attached to the first SATA port and
> contains sdc1..sdc3. The second photograph I posted shows that sdb and
> sdc have been reversed. The first partition on the disc that is
> normally /dev/sdb does indeed have a FAT32 filesystem in the first
> partition.
>
> By the way, I should have said that in between the panics that the two
> photographs show, I copied contents of /dev/sdc1, which I normally
> boot from, to /dev/sdb6, so that I minimised the risk to sdc1 in the
> reboot festival that bisecting would involve. I also, of course,
> changed the name of the root partition that is passed to the kernel by
> GRUB and amended /etc/fstab on /dev/sdb6. That's why the partitions
> shown in the photographs seem inconsistent. Sorry I forgot to mention
> that - I really shouldn't do these things late at night :-).
>
> As I indicate above, when booting the partition I have set up to do
> this bisecting, ?I expect the root filesystem to be on /dev/hdb6. As I
> also indicate, this information is passed to the kernel through GRUB's
> /boot/grub/menu.lst. The kernel is configured specifically for my
> system and the drivers needed to boot the system are built in to the
> kernel, so I don't use an initrd. IIRC, that's the way Slackware is
> installed today, except, of course, it's a big fat kernel with all
> drivers needed to boot any system built in. I could be wrong on that
> though, it's a while since I installed
>
> As to the distro, it used to be (the now defunct) Peanut Linux, which
> was derived from Slackware. However, it's years since I installed it
> and I have upgraded just about everything in user space and added many
> other things (udev, dbus...). I don't think that makes any difference
> here, though, because we don't get as far as user space. On a
> successful boot, the system is stable and runs trouble-free for
> several hours a day, every day.
>
> Hope this helps.
>
> I'm a good way through bisecting again and this time the system has to
> boot without a panic 100 times before I mark a kernel as good. I'll
> post the result later.
>
Finally got to the end of the bisection/reboot festival. I ended up here:
[chris:~/kernel/linux-2.6]$ git bisect good
d5a877e8dd409d8c702986d06485c374b705d340 is first bad commit
commit d5a877e8dd409d8c702986d06485c374b705d340
Author: James Bottomley <[email protected]>
Date: Sun May 24 13:03:43 2009 -0700
async: make sure independent async domains can't accidentally entangle
The problem occurs when async_synchronize_full_domain() is called when
the async_pending list is not empty. This will cause lowest_running()
to return the cookie of the first entry on the async_pending list, which
might be nothing at all to do with the domain being asked for and thus
cause the domain synchronization to wait for an unrelated domain. This
can cause a deadlock if domain synchronization is used from one domain
to wait for another.
Fix by running over the async_pending list to see if any pending items
actually belong to our domain (and return their cookies if they do).
Signed-off-by: James Bottomley <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
:040000 040000 fab1e0c06572605a7015061db4a7e0a77c04fa91
34252dbb7fed3942f5952c25639564bbd77357da M kernel
I can't claim to know what the change actually means, but the change
seems to be a much better candidate than my previous bisection outcome
where I required only 20 "panicless" boots to regard the kernel as
good. As I said earlier today, this time I required 100 such boots.
I'll revert that change, give the new kernel the reboot treatment :-)
and report back later.
Chris
> Thanks
>
>
>> Good luck,
>> NeilBrown
>>
>>
>
>
>
> --
> No, Sir; there is nothing which has yet been contrived by man, by which
> so much happiness is produced as by a good tavern or inn - Doctor Samuel
> Johnson
>
--
No, Sir; there is nothing which has yet been contrived by man, by which
so much happiness is produced as by a good tavern or inn - Doctor Samuel
Johnson
Hello Chris,
On Mon, 2009-06-08 at 11:58 +0100, Chris Clayton wrote:
> 2009/6/8 Chris Clayton <[email protected]>:
> > Hi Neil,
> >
> > Thanks for the reply.
> >
> > 2009/6/7 NeilBrown <[email protected]>:
> >> On Mon, June 8, 2009 8:31 am, Jaswinder Singh Rajput wrote:
> >>> On Sun, 2009-06-07 at 19:38 +0100, Chris Clayton wrote:
> >>>> 2009/6/7 Jaswinder Singh Ra
> >>>> >> > http://img231.imageshack.us/img231/8931/dscn0610.jpg
> >>
> >> This message says that it found a vfat filesystem on 8:3x (I cannot see
> >> what digit should be 'x'). That is probably sdc1 or sdc2. Maybe even
> >> sdc6 or sdc7.
> >> However the vfat filesystem didn't have /sbin/init.
> >>
> >
> >>>> http://img99.imageshack.us/my.php?image=dscn0617b.jpg
> >>
> >> This one says it couldn't find anything at 8,22, which I think
> >> should be sdb6.
> >> It also shows that you have and sdc6, but sdb only goes up to sdb3.
> >>
> >> So it seems that your disk drives have changed name - not a wholely
> >> unexpected event these days.
> >>
> >> We now need answers to questions like:
> >> - what device do you expect the root filesystem to be on
> >> - how is the kernel being told this? Maybe it is hard coded
> >> into your initrd. Knowing which distro and what /etc/fstab
> >> says might help (though it wouldn't help me, I'm just about out
> >> of my depth at this point)
> >> Maybe if you changed /etc/fstab to mount by uuid instead of hardcoding
> >> e.g. /etc/sdb3, and then run "mkinitramfs" or whatever, it might work.
> >>
> >
> > Yes, I've just been looking at the photographs of the panics again and
> > I've noticed that two of my discs are being detected in the "wrong
> > order". There are three HDDS. The first, /dev/sda, is the master on
> > the first IDE port and contains sda1..sda7. The second, normally
> > /dev/sdb, is the slave on that port and contains sdb1..sdb6. The
> > third, normally /dev/sdc, is attached to the first SATA port and
> > contains sdc1..sdc3. The second photograph I posted shows that sdb and
> > sdc have been reversed. The first partition on the disc that is
> > normally /dev/sdb does indeed have a FAT32 filesystem in the first
> > partition.
> >
> > By the way, I should have said that in between the panics that the two
> > photographs show, I copied contents of /dev/sdc1, which I normally
> > boot from, to /dev/sdb6, so that I minimised the risk to sdc1 in the
> > reboot festival that bisecting would involve. I also, of course,
> > changed the name of the root partition that is passed to the kernel by
> > GRUB and amended /etc/fstab on /dev/sdb6. That's why the partitions
> > shown in the photographs seem inconsistent. Sorry I forgot to mention
> > that - I really shouldn't do these things late at night :-).
> >
> > As I indicate above, when booting the partition I have set up to do
> > this bisecting, I expect the root filesystem to be on /dev/hdb6. As I
> > also indicate, this information is passed to the kernel through GRUB's
> > /boot/grub/menu.lst. The kernel is configured specifically for my
> > system and the drivers needed to boot the system are built in to the
> > kernel, so I don't use an initrd. IIRC, that's the way Slackware is
> > installed today, except, of course, it's a big fat kernel with all
> > drivers needed to boot any system built in. I could be wrong on that
> > though, it's a while since I installed
> >
> > As to the distro, it used to be (the now defunct) Peanut Linux, which
> > was derived from Slackware. However, it's years since I installed it
> > and I have upgraded just about everything in user space and added many
> > other things (udev, dbus...). I don't think that makes any difference
> > here, though, because we don't get as far as user space. On a
> > successful boot, the system is stable and runs trouble-free for
> > several hours a day, every day.
> >
> > Hope this helps.
> >
> > I'm a good way through bisecting again and this time the system has to
> > boot without a panic 100 times before I mark a kernel as good. I'll
> > post the result later.
> >
>
> Finally got to the end of the bisection/reboot festival. I ended up here:
>
> [chris:~/kernel/linux-2.6]$ git bisect good
> d5a877e8dd409d8c702986d06485c374b705d340 is first bad commit
> commit d5a877e8dd409d8c702986d06485c374b705d340
> Author: James Bottomley <[email protected]>
> Date: Sun May 24 13:03:43 2009 -0700
>
> async: make sure independent async domains can't accidentally entangle
>
> The problem occurs when async_synchronize_full_domain() is called when
> the async_pending list is not empty. This will cause lowest_running()
> to return the cookie of the first entry on the async_pending list, which
> might be nothing at all to do with the domain being asked for and thus
> cause the domain synchronization to wait for an unrelated domain. This
> can cause a deadlock if domain synchronization is used from one domain
> to wait for another.
>
> Fix by running over the async_pending list to see if any pending items
> actually belong to our domain (and return their cookies if they do).
>
> Signed-off-by: James Bottomley <[email protected]>
> Signed-off-by: Arjan van de Ven <[email protected]>
> Signed-off-by: Linus Torvalds <[email protected]>
>
> :040000 040000 fab1e0c06572605a7015061db4a7e0a77c04fa91
> 34252dbb7fed3942f5952c25639564bbd77357da M kernel
>
> I can't claim to know what the change actually means, but the change
> seems to be a much better candidate than my previous bisection outcome
> where I required only 20 "panicless" boots to regard the kernel as
> good. As I said earlier today, this time I required 100 such boots.
>
> I'll revert that change, give the new kernel the reboot treatment :-)
> and report back later.
>
Good work. Please also share this info with other signed-off members, So
adding CC.
Thanks,
--
JSR
2009/6/8 Jaswinder Singh Rajput <[email protected]>:
> Hello Chris,
>
> On Mon, 2009-06-08 at 11:58 +0100, Chris Clayton wrote:
>> 2009/6/8 Chris Clayton <[email protected]>:
>> > Hi Neil,
>> >
>> > Thanks for the reply.
>> >
>> > 2009/6/7 NeilBrown <[email protected]>:
>> >> On Mon, June 8, 2009 8:31 am, Jaswinder Singh Rajput wrote:
>> >>> On Sun, 2009-06-07 at 19:38 +0100, Chris Clayton wrote:
>> >>>> 2009/6/7 Jaswinder Singh Ra
>> >>>> >> > http://img231.imageshack.us/img231/8931/dscn0610.jpg
>> >>
>> >> This message says that it found a vfat filesystem on 8:3x (I cannot see
>> >> what digit should be 'x'). ?That is probably sdc1 or sdc2. Maybe even
>> >> sdc6 or sdc7.
>> >> However the vfat filesystem didn't have /sbin/init.
>> >>
>> >
>> >>>> http://img99.imageshack.us/my.php?image=dscn0617b.jpg
>> >>
>> >> This one says it couldn't find anything at 8,22, which I think
>> >> should be sdb6.
>> >> It also shows that you have and sdc6, but sdb only goes up to sdb3.
>> >>
>> >> So it seems that your disk drives have changed name - not a wholely
>> >> unexpected event these days.
>> >>
>> >> We now need answers to questions like:
>> >> ?- what device do you expect the root filesystem to be on
>> >> ?- how is the kernel being told this? ?Maybe it is hard coded
>> >> ? ?into your initrd. ?Knowing which distro and what /etc/fstab
>> >> ? ?says might help (though it wouldn't help me, I'm just about out
>> >> ? ?of my depth at this point)
>> >> Maybe if you changed /etc/fstab to mount by uuid instead of hardcoding
>> >> e.g. /etc/sdb3, and then run "mkinitramfs" or whatever, it might work.
>> >>
>> >
>> > Yes, I've just been looking at the photographs of the panics again and
>> > I've noticed that two of my discs are being detected in the "wrong
>> > order". There are three HDDS. The first, /dev/sda, is the master on
>> > the first IDE port and contains sda1..sda7. The second, normally
>> > /dev/sdb, is the slave on that port and contains sdb1..sdb6. The
>> > third, normally /dev/sdc, is attached to the first SATA port and
>> > contains sdc1..sdc3. The second photograph I posted shows that sdb and
>> > sdc have been reversed. The first partition on the disc that is
>> > normally /dev/sdb does indeed have a FAT32 filesystem in the first
>> > partition.
>> >
>> > By the way, I should have said that in between the panics that the two
>> > photographs show, I copied contents of /dev/sdc1, which I normally
>> > boot from, to /dev/sdb6, so that I minimised the risk to sdc1 in the
>> > reboot festival that bisecting would involve. I also, of course,
>> > changed the name of the root partition that is passed to the kernel by
>> > GRUB and amended /etc/fstab on /dev/sdb6. That's why the partitions
>> > shown in the photographs seem inconsistent. Sorry I forgot to mention
>> > that - I really shouldn't do these things late at night :-).
>> >
>> > As I indicate above, when booting the partition I have set up to do
>> > this bisecting, ?I expect the root filesystem to be on /dev/hdb6. As I
>> > also indicate, this information is passed to the kernel through GRUB's
>> > /boot/grub/menu.lst. The kernel is configured specifically for my
>> > system and the drivers needed to boot the system are built in to the
>> > kernel, so I don't use an initrd. IIRC, that's the way Slackware is
>> > installed today, except, of course, it's a big fat kernel with all
>> > drivers needed to boot any system built in. I could be wrong on that
>> > though, it's a while since I installed
>> >
>> > As to the distro, it used to be (the now defunct) Peanut Linux, which
>> > was derived from Slackware. However, it's years since I installed it
>> > and I have upgraded just about everything in user space and added many
>> > other things (udev, dbus...). I don't think that makes any difference
>> > here, though, because we don't get as far as user space. On a
>> > successful boot, the system is stable and runs trouble-free for
>> > several hours a day, every day.
>> >
>> > Hope this helps.
>> >
>> > I'm a good way through bisecting again and this time the system has to
>> > boot without a panic 100 times before I mark a kernel as good. I'll
>> > post the result later.
>> >
>>
>> Finally got to the end of the bisection/reboot festival. I ended up here:
>>
>> [chris:~/kernel/linux-2.6]$ git bisect good
>> d5a877e8dd409d8c702986d06485c374b705d340 is first bad commit
>> commit d5a877e8dd409d8c702986d06485c374b705d340
>> Author: James Bottomley <[email protected]>
>> Date: ? Sun May 24 13:03:43 2009 -0700
>>
>> ? ? async: make sure independent async domains can't accidentally entangle
>>
>> ? ? The problem occurs when async_synchronize_full_domain() is called when
>> ? ? the async_pending list is not empty. ?This will cause lowest_running()
>> ? ? to return the cookie of the first entry on the async_pending list, which
>> ? ? might be nothing at all to do with the domain being asked for and thus
>> ? ? cause the domain synchronization to wait for an unrelated domain. ? This
>> ? ? can cause a deadlock if domain synchronization is used from one domain
>> ? ? to wait for another.
>>
>> ? ? Fix by running over the async_pending list to see if any pending items
>> ? ? actually belong to our domain (and return their cookies if they do).
>>
>> ? ? Signed-off-by: James Bottomley <[email protected]>
>> ? ? Signed-off-by: Arjan van de Ven <[email protected]>
>> ? ? Signed-off-by: Linus Torvalds <[email protected]>
>>
>> :040000 040000 fab1e0c06572605a7015061db4a7e0a77c04fa91
>> 34252dbb7fed3942f5952c25639564bbd77357da M ? ? ?kernel
>>
>> I can't claim to know what the change actually means, but the change
>> seems to be a much better candidate than my previous bisection outcome
>> where I required only 20 "panicless" boots to regard the kernel as
>> good. As I said earlier today, this time I required 100 such boots.
>>
>> I'll revert that change, give the new kernel the reboot treatment :-)
>> and report back later.
>>
>
> Good work. Please also share this info with other signed-off members, So
> adding CC.
>
OK. I reversed that change and built and installed the kernel. It has
withstood 100 reboots without a panic. Additionally, I pulled the
latest changes (that will be rc8-git5, I think) from kernel.org,
reversed the change to that kernel and built and installed it. That
too withstood 100 reboots without a panic.
Let me know if there's anything else I can do to help fix this.
Chris
> Thanks,
> --
> JSR
>
>
--
No, Sir; there is nothing which has yet been contrived by man, by
which so much happiness is produced as by a good tavern or inn -
Doctor Samuel Johnson
On Mon, 2009-06-08 at 09:08 +0100, Chris Clayton wrote:
> Hi Neil,
>
> Thanks for the reply.
>
> 2009/6/7 NeilBrown <[email protected]>:
> > On Mon, June 8, 2009 8:31 am, Jaswinder Singh Rajput wrote:
> >> On Sun, 2009-06-07 at 19:38 +0100, Chris Clayton wrote:
> >>> 2009/6/7 Jaswinder Singh Ra
> >>> >> > http://img231.imageshack.us/img231/8931/dscn0610.jpg
> >
> > This message says that it found a vfat filesystem on 8:3x (I cannot see
> > what digit should be 'x'). That is probably sdc1 or sdc2. Maybe even
> > sdc6 or sdc7.
> > However the vfat filesystem didn't have /sbin/init.
> >
>
> >>> http://img99.imageshack.us/my.php?image=dscn0617b.jpg
> >
> > This one says it couldn't find anything at 8,22, which I think
> > should be sdb6.
> > It also shows that you have and sdc6, but sdb only goes up to sdb3.
> >
> > So it seems that your disk drives have changed name - not a wholely
> > unexpected event these days.
> >
> > We now need answers to questions like:
> > - what device do you expect the root filesystem to be on
> > - how is the kernel being told this? Maybe it is hard coded
> > into your initrd. Knowing which distro and what /etc/fstab
> > says might help (though it wouldn't help me, I'm just about out
> > of my depth at this point)
> > Maybe if you changed /etc/fstab to mount by uuid instead of hardcoding
> > e.g. /etc/sdb3, and then run "mkinitramfs" or whatever, it might work.
> >
>
> Yes, I've just been looking at the photographs of the panics again and
> I've noticed that two of my discs are being detected in the "wrong
> order". There are three HDDS. The first, /dev/sda, is the master on
> the first IDE port and contains sda1..sda7. The second, normally
> /dev/sdb, is the slave on that port and contains sdb1..sdb6. The
> third, normally /dev/sdc, is attached to the first SATA port and
> contains sdc1..sdc3. The second photograph I posted shows that sdb and
> sdc have been reversed. The first partition on the disc that is
> normally /dev/sdb does indeed have a FAT32 filesystem in the first
> partition.
>
> By the way, I should have said that in between the panics that the two
> photographs show, I copied contents of /dev/sdc1, which I normally
> boot from, to /dev/sdb6, so that I minimised the risk to sdc1 in the
> reboot festival that bisecting would involve. I also, of course,
> changed the name of the root partition that is passed to the kernel by
> GRUB and amended /etc/fstab on /dev/sdb6. That's why the partitions
> shown in the photographs seem inconsistent. Sorry I forgot to mention
> that - I really shouldn't do these things late at night :-).
Actually, you can save yourself a lot of pain by mounting by label
instead ... that way both grub and fstab will find your root disc even
if it has swapped order.
> As I indicate above, when booting the partition I have set up to do
> this bisecting, I expect the root filesystem to be on /dev/hdb6. As I
> also indicate, this information is passed to the kernel through GRUB's
> /boot/grub/menu.lst. The kernel is configured specifically for my
> system and the drivers needed to boot the system are built in to the
> kernel, so I don't use an initrd. IIRC, that's the way Slackware is
> installed today, except, of course, it's a big fat kernel with all
> drivers needed to boot any system built in. I could be wrong on that
> though, it's a while since I installed
The fact that a slave on the first channel is detected after the SATA
indicates a problem with async probing. What are the two drivers for
these?
James
Hi James,
2009/6/8 James Bottomley <[email protected]>:
> On Mon, 2009-06-08 at 09:08 +0100, Chris Clayton wrote:
>> Hi Neil,
>>
>> Thanks for the reply.
>>
>> 2009/6/7 NeilBrown <[email protected]>:
>> > On Mon, June 8, 2009 8:31 am, Jaswinder Singh Rajput wrote:
>> >> On Sun, 2009-06-07 at 19:38 +0100, Chris Clayton wrote:
>> >>> 2009/6/7 Jaswinder Singh Ra
>> >>> >> > http://img231.imageshack.us/img231/8931/dscn0610.jpg
>> >
>> > This message says that it found a vfat filesystem on 8:3x (I cannot see
>> > what digit should be 'x'). ?That is probably sdc1 or sdc2. Maybe even
>> > sdc6 or sdc7.
>> > However the vfat filesystem didn't have /sbin/init.
>> >
>>
>> >>> http://img99.imageshack.us/my.php?image=dscn0617b.jpg
>> >
>> > This one says it couldn't find anything at 8,22, which I think
>> > should be sdb6.
>> > It also shows that you have and sdc6, but sdb only goes up to sdb3.
>> >
>> > So it seems that your disk drives have changed name - not a wholely
>> > unexpected event these days.
>> >
>> > We now need answers to questions like:
>> > ?- what device do you expect the root filesystem to be on
>> > ?- how is the kernel being told this? ?Maybe it is hard coded
>> > ? ?into your initrd. ?Knowing which distro and what /etc/fstab
>> > ? ?says might help (though it wouldn't help me, I'm just about out
>> > ? ?of my depth at this point)
>> > Maybe if you changed /etc/fstab to mount by uuid instead of hardcoding
>> > e.g. /etc/sdb3, and then run "mkinitramfs" or whatever, it might work.
>> >
>>
>> Yes, I've just been looking at the photographs of the panics again and
>> I've noticed that two of my discs are being detected in the "wrong
>> order". There are three HDDS. The first, /dev/sda, is the master on
>> the first IDE port and contains sda1..sda7. The second, normally
>> /dev/sdb, is the slave on that port and contains sdb1..sdb6. The
>> third, normally /dev/sdc, is attached to the first SATA port and
>> contains sdc1..sdc3. The second photograph I posted shows that sdb and
>> sdc have been reversed. The first partition on the disc that is
>> normally /dev/sdb does indeed have a FAT32 filesystem in the first
>> partition.
>>
>> By the way, I should have said that in between the panics that the two
>> photographs show, I copied contents of /dev/sdc1, which I normally
>> boot from, to /dev/sdb6, so that I minimised the risk to sdc1 in the
>> reboot festival that bisecting would involve. I also, of course,
>> changed the name of the root partition that is passed to the kernel by
>> GRUB and amended /etc/fstab on /dev/sdb6. That's why the partitions
>> shown in the photographs seem inconsistent. Sorry I forgot to mention
>> that - I really shouldn't do these things late at night :-).
>
> Actually, you can save yourself a lot of pain by mounting by label
> instead ... that way both grub and fstab will find your root disc even
> if it has swapped order.
>
>> As I indicate above, when booting the partition I have set up to do
>> this bisecting, ?I expect the root filesystem to be on /dev/hdb6. As I
>> also indicate, this information is passed to the kernel through GRUB's
>> /boot/grub/menu.lst. The kernel is configured specifically for my
>> system and the drivers needed to boot the system are built in to the
>> kernel, so I don't use an initrd. IIRC, that's the way Slackware is
>> installed today, except, of course, it's a big fat kernel with all
>> drivers needed to boot any system built in. I could be wrong on that
>> though, it's a while since I installed
>
> The fact that a slave on the first channel is detected after the SATA
> indicates a problem with async probing. ?What are the two drivers for
> these?
>
I think this is the relevant part of my .config:
CONFIG_ATA=y
# CONFIG_ATA_NONSTANDARD is not set
CONFIG_ATA_ACPI=y
# CONFIG_SATA_PMP is not set
# CONFIG_SATA_AHCI is not set
# CONFIG_SATA_SIL24 is not set
CONFIG_ATA_SFF=y
# CONFIG_SATA_SVW is not set
CONFIG_ATA_PIIX=y <<<<=================
# CONFIG_SATA_MV is not set
# CONFIG_SATA_NV is not set
# CONFIG_PDC_ADMA is not set
# CONFIG_SATA_QSTOR is not set
# CONFIG_SATA_PROMISE is not set
# CONFIG_SATA_SX4 is not set
# CONFIG_SATA_SIL is not set
# CONFIG_SATA_SIS is not set
# CONFIG_SATA_ULI is not set
# CONFIG_SATA_VIA is not set
# CONFIG_SATA_VITESSE is not set
# CONFIG_SATA_INIC162X is not set
The ATA/(E)IDE driver is disabled.
Chris
> James
>
--
No, Sir; there is nothing which has yet been contrived by man, by
which so much happiness is produced as by a good tavern or inn -
Doctor Samuel Johnson
Sorry James, I forgot to ask...
2009/6/8 James Bottomley <[email protected]>:
> On Mon, 2009-06-08 at 09:08 +0100, Chris Clayton wrote:
>> Hi Neil,
>>
>> Thanks for the reply.
>>
>> 2009/6/7 NeilBrown <[email protected]>:
>> > On Mon, June 8, 2009 8:31 am, Jaswinder Singh Rajput wrote:
>> >> On Sun, 2009-06-07 at 19:38 +0100, Chris Clayton wrote:
>> >>> 2009/6/7 Jaswinder Singh Ra
>> >>> >> > http://img231.imageshack.us/img231/8931/dscn0610.jpg
>> >
>> > This message says that it found a vfat filesystem on 8:3x (I cannot see
>> > what digit should be 'x'). ?That is probably sdc1 or sdc2. Maybe even
>> > sdc6 or sdc7.
>> > However the vfat filesystem didn't have /sbin/init.
>> >
>>
>> >>> http://img99.imageshack.us/my.php?image=dscn0617b.jpg
>> >
>> > This one says it couldn't find anything at 8,22, which I think
>> > should be sdb6.
>> > It also shows that you have and sdc6, but sdb only goes up to sdb3.
>> >
>> > So it seems that your disk drives have changed name - not a wholely
>> > unexpected event these days.
>> >
>> > We now need answers to questions like:
>> > ?- what device do you expect the root filesystem to be on
>> > ?- how is the kernel being told this? ?Maybe it is hard coded
>> > ? ?into your initrd. ?Knowing which distro and what /etc/fstab
>> > ? ?says might help (though it wouldn't help me, I'm just about out
>> > ? ?of my depth at this point)
>> > Maybe if you changed /etc/fstab to mount by uuid instead of hardcoding
>> > e.g. /etc/sdb3, and then run "mkinitramfs" or whatever, it might work.
>> >
>>
>> Yes, I've just been looking at the photographs of the panics again and
>> I've noticed that two of my discs are being detected in the "wrong
>> order". There are three HDDS. The first, /dev/sda, is the master on
>> the first IDE port and contains sda1..sda7. The second, normally
>> /dev/sdb, is the slave on that port and contains sdb1..sdb6. The
>> third, normally /dev/sdc, is attached to the first SATA port and
>> contains sdc1..sdc3. The second photograph I posted shows that sdb and
>> sdc have been reversed. The first partition on the disc that is
>> normally /dev/sdb does indeed have a FAT32 filesystem in the first
>> partition.
>>
>> By the way, I should have said that in between the panics that the two
>> photographs show, I copied contents of /dev/sdc1, which I normally
>> boot from, to /dev/sdb6, so that I minimised the risk to sdc1 in the
>> reboot festival that bisecting would involve. I also, of course,
>> changed the name of the root partition that is passed to the kernel by
>> GRUB and amended /etc/fstab on /dev/sdb6. That's why the partitions
>> shown in the photographs seem inconsistent. Sorry I forgot to mention
>> that - I really shouldn't do these things late at night :-).
>
> Actually, you can save yourself a lot of pain by mounting by label
> instead ... that way both grub and fstab will find your root disc even
> if it has swapped order.
>
Would I be right in assuming from this that "out-of-order" detection
is expected behaviour? If so, I'll fix up my system and shut up :-) I
suspect I may not be the only person on the planet who specifies the
root filesystem in this way, though.
Thanks
>> As I indicate above, when booting the partition I have set up to do
>> this bisecting, ?I expect the root filesystem to be on /dev/hdb6. As I
>> also indicate, this information is passed to the kernel through GRUB's
>> /boot/grub/menu.lst. The kernel is configured specifically for my
>> system and the drivers needed to boot the system are built in to the
>> kernel, so I don't use an initrd. IIRC, that's the way Slackware is
>> installed today, except, of course, it's a big fat kernel with all
>> drivers needed to boot any system built in. I could be wrong on that
>> though, it's a while since I installed
>
> The fact that a slave on the first channel is detected after the SATA
> indicates a problem with async probing. ?What are the two drivers for
> these?
>
> James
>
>
>
--
No, Sir; there is nothing which has yet been contrived by man, by which
so much happiness is produced as by a good tavern or inn - Doctor Samuel
Johnson
On Mon, 2009-06-08 at 16:17 +0100, Chris Clayton wrote:
> > Actually, you can save yourself a lot of pain by mounting by label
> > instead ... that way both grub and fstab will find your root disc even
> > if it has swapped order.
> >
>
> Would I be right in assuming from this that "out-of-order" detection
> is expected behaviour? If so, I'll fix up my system and shut up :-) I
> suspect I may not be the only person on the planet who specifies the
> root filesystem in this way, though.
Yes and no ... yes generally because parallel asynchronous probing does
relax the ordering rules, so if you have multiple devices the ordering
can easily change. No in your particular case because you have a single
ata_piix and even with async enabled it should probe PATA first (master
and slave) followed by SATA. The fact that the PATA and SATA probes
didn't synchronise is a bug somewhere, but I'm not sure where.
Could you fix your system to do the label mounting and post the full
boot log where it's out of sequence? I have a nasty feeling we might
have got the ata probe order correct only to be thrown out of order by
sd driver attachment.
James
On Mon, 8 Jun 2009, Chris Clayton wrote:
>
> OK. I reversed that change and built and installed the kernel. It has
> withstood 100 reboots without a panic. Additionally, I pulled the
> latest changes (that will be rc8-git5, I think) from kernel.org,
> reversed the change to that kernel and built and installed it. That
> too withstood 100 reboots without a panic.
>
> Let me know if there's anything else I can do to help fix this.
That's already pretty convincing.
James, Arjan? The original oops message is here (a jpg screen capture,
unable to open initial console):
http://lkml.org/lkml/2009/6/6/142
and it's this bug entry:
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13474
Subject : Oops whilst booting
Submitter : Chris Clayton <[email protected]>
Date : 2009-06-06 18:59 (2 days old)
References : http://marc.info/?l=linux-kernel&m=124431487924254&w=4
and now bisected down to
>> commit d5a877e8dd409d8c702986d06485c374b705d340
>> Author: James Bottomley <[email protected]>
>> Date: ? Sun May 24 13:03:43 2009 -0700
>>
>> ? ? async: make sure independent async domains can't accidentally entangle
please advice. Otherwise I'll have to revert.
Linus
On Mon, 2009-06-08 at 09:21 -0700, Linus Torvalds wrote:
>
> On Mon, 8 Jun 2009, Chris Clayton wrote:
> >
> > OK. I reversed that change and built and installed the kernel. It has
> > withstood 100 reboots without a panic. Additionally, I pulled the
> > latest changes (that will be rc8-git5, I think) from kernel.org,
> > reversed the change to that kernel and built and installed it. That
> > too withstood 100 reboots without a panic.
> >
> > Let me know if there's anything else I can do to help fix this.
>
> That's already pretty convincing.
>
> James, Arjan? The original oops message is here (a jpg screen capture,
> unable to open initial console):
>
> http://lkml.org/lkml/2009/6/6/142
>
> and it's this bug entry:
>
> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13474
> Subject : Oops whilst booting
> Submitter : Chris Clayton <[email protected]>
> Date : 2009-06-06 18:59 (2 days old)
> References : http://marc.info/?l=linux-kernel&m=124431487924254&w=4
>
> and now bisected down to
>
> >> commit d5a877e8dd409d8c702986d06485c374b705d340
> >> Author: James Bottomley <[email protected]>
> >> Date: Sun May 24 13:03:43 2009 -0700
> >>
> >> async: make sure independent async domains can't accidentally entangle
>
> please advice. Otherwise I'll have to revert.
The root cause is a reordering of the devices caused by the async code.
I suspect it's a bug in async that was obscured by the old behaviour of
async_synchronize.. (or it's a bug in the new code) ... how long do I
have to find out which?
James
On Mon, 2009-06-08 at 16:51 +0000, James Bottomley wrote:
> On Mon, 2009-06-08 at 09:21 -0700, Linus Torvalds wrote:
> >
> > On Mon, 8 Jun 2009, Chris Clayton wrote:
> > >
> > > OK. I reversed that change and built and installed the kernel. It has
> > > withstood 100 reboots without a panic. Additionally, I pulled the
> > > latest changes (that will be rc8-git5, I think) from kernel.org,
> > > reversed the change to that kernel and built and installed it. That
> > > too withstood 100 reboots without a panic.
> > >
> > > Let me know if there's anything else I can do to help fix this.
> >
> > That's already pretty convincing.
> >
> > James, Arjan? The original oops message is here (a jpg screen capture,
> > unable to open initial console):
> >
> > http://lkml.org/lkml/2009/6/6/142
> >
> > and it's this bug entry:
> >
> > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13474
> > Subject : Oops whilst booting
> > Submitter : Chris Clayton <[email protected]>
> > Date : 2009-06-06 18:59 (2 days old)
> > References : http://marc.info/?l=linux-kernel&m=124431487924254&w=4
> >
> > and now bisected down to
> >
> > >> commit d5a877e8dd409d8c702986d06485c374b705d340
> > >> Author: James Bottomley <[email protected]>
> > >> Date: Sun May 24 13:03:43 2009 -0700
> > >>
> > >> async: make sure independent async domains can't accidentally entangle
> >
> > please advice. Otherwise I'll have to revert.
>
> The root cause is a reordering of the devices caused by the async code.
>
> I suspect it's a bug in async that was obscured by the old behaviour of
> async_synchronize.. (or it's a bug in the new code) ... how long do I
> have to find out which?
>
But reverting your patch Or if we return like this also fix chris
problem :
diff --git a/kernel/async.c b/kernel/async.c
index 94dd36f..3b492cb 100644
--- a/kernel/async.c
+++ b/kernel/async.c
@@ -96,15 +96,13 @@ static async_cookie_t __lowest_in_progress(struct list_head *running)
if (!list_empty(running)) {
entry = list_first_entry(running,
struct async_entry, list);
- ret = entry->cookie;
+ return entry->cookie;
}
if (!list_empty(&async_pending)) {
list_for_each_entry(entry, &async_pending, list)
- if (entry->running == running) {
- ret = entry->cookie;
- break;
- }
+ if (entry->running == running)
+ return entry->cookie;
}
return ret;
On Mon, 2009-06-08 at 09:21 -0700, Linus Torvalds wrote:
>
> On Mon, 8 Jun 2009, Chris Clayton wrote:
> >
> > OK. I reversed that change and built and installed the kernel. It has
> > withstood 100 reboots without a panic. Additionally, I pulled the
> > latest changes (that will be rc8-git5, I think) from kernel.org,
> > reversed the change to that kernel and built and installed it. That
> > too withstood 100 reboots without a panic.
> >
> > Let me know if there's anything else I can do to help fix this.
>
> That's already pretty convincing.
>
> James, Arjan? The original oops message is here (a jpg screen capture,
> unable to open initial console):
>
> http://lkml.org/lkml/2009/6/6/142
>
> and it's this bug entry:
>
> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13474
> Subject : Oops whilst booting
> Submitter : Chris Clayton <[email protected]>
> Date : 2009-06-06 18:59 (2 days old)
> References : http://marc.info/?l=linux-kernel&m=124431487924254&w=4
>
> and now bisected down to
>
> >> commit d5a877e8dd409d8c702986d06485c374b705d340
> >> Author: James Bottomley <[email protected]>
> >> Date: Sun May 24 13:03:43 2009 -0700
> >>
> >> async: make sure independent async domains can't accidentally entangle
>
> please advice. Otherwise I'll have to revert.
I think it's a bug in the async code. It's providing cookies too high
because it doesn't stop after it finds a running entry.
Can we try this as the fix?
James
---
diff --git a/kernel/async.c b/kernel/async.c
index 5054030..e4909ee 100644
--- a/kernel/async.c
+++ b/kernel/async.c
@@ -97,7 +97,7 @@ static async_cookie_t __lowest_in_progress(struct list_head *running)
if (!list_empty(running)) {
entry = list_first_entry(running,
struct async_entry, list);
- ret = entry->cookie;
+ return entry->cookie;
}
if (!list_empty(&async_pending)) {
On Mon, 8 Jun 2009, James Bottomley wrote:
>
> The root cause is a reordering of the devices caused by the async code.
That's NULL information.
OF COURSE the root cause is the async code. We know that. We're looking
for the specifics.
In particular, before that commit, at most you will wait for too _much_.
In other words, it's a "good" wait.
Your commit caused it to wait for less, and that then showed a bug. Not
all that surprising - it's now not waiting enough.
You tried to avoid a deadlock situation of waiting for too much, but you
avoided the deadlock by now waiting for too little.
I also think that your code is simply buggy. As far as I can tell, int he
case of having both running and pending events, you'll always pick the
pending cookie. But it's the _running_ cookie that has the lower event
number, isn't it?
I dunno. It all looks very fishy to me.
Linus
Linus Torvalds wrote:
>
> On Mon, 8 Jun 2009, James Bottomley wrote:
>> The root cause is a reordering of the devices caused by the async code.
>
> That's NULL information.
>
> OF COURSE the root cause is the async code. We know that. We're looking
> for the specifics.
>
> In particular, before that commit, at most you will wait for too _much_.
> In other words, it's a "good" wait.
>
> Your commit caused it to wait for less, and that then showed a bug. Not
> all that surprising - it's now not waiting enough.
>
> You tried to avoid a deadlock situation of waiting for too much, but you
> avoided the deadlock by now waiting for too little.
>
> I also think that your code is simply buggy. As far as I can tell, int he
> case of having both running and pending events, you'll always pick the
> pending cookie. But it's the _running_ cookie that has the lower event
> number, isn't it?
>
> I dunno. It all looks very fishy to me.
>
that's likely my screwup, not james'
the patch looks ok to me, it indeed should fix the problem.
(and is simpler than the idea I had around using min() )
On Mon, 2009-06-08 at 10:21 -0700, Linus Torvalds wrote:
>
> On Mon, 8 Jun 2009, James Bottomley wrote:
> >
> > The root cause is a reordering of the devices caused by the async code.
>
> That's NULL information.
>
> OF COURSE the root cause is the async code. We know that. We're looking
> for the specifics.
>
> In particular, before that commit, at most you will wait for too _much_.
> In other words, it's a "good" wait.
>
> Your commit caused it to wait for less, and that then showed a bug. Not
> all that surprising - it's now not waiting enough.
right ... my question was whether this exposed an existing bug that was
hidden by the waiting too much. Actually, I audited all the async code
and that's impossible: we don't actually have any async domains at all
(except for the spurious superblock s_async_list, which never gets
anything added to its runqueue), so it must be a bug in the code.
> You tried to avoid a deadlock situation of waiting for too much, but you
> avoided the deadlock by now waiting for too little.
>
> I also think that your code is simply buggy. As far as I can tell, int he
> case of having both running and pending events, you'll always pick the
> pending cookie. But it's the _running_ cookie that has the lower event
> number, isn't it?
Yes, see later fix. Assuming we get confirmation from the reporter, we
should be good to go.
> I dunno. It all looks very fishy to me.
Well, the other option is to revert the fix ... since there is no other
separated domain, there's nothing really to fix ... the original code
that showed the problem was a SCSI feature tree conversion of our
current async scanning code to the async infrastructure which used a
separate domain.
James
On Mon, 8 Jun 2009, James Bottomley wrote:
>
> I think it's a bug in the async code. It's providing cookies too high
> because it doesn't stop after it finds a running entry.
>
> Can we try this as the fix?
Ok, this looks likely.
That said, why doesn't that function look like this?
Linus
---
kernel/async.c | 15 +++++----------
1 files changed, 5 insertions(+), 10 deletions(-)
diff --git a/kernel/async.c b/kernel/async.c
index 5054030..27235f5 100644
--- a/kernel/async.c
+++ b/kernel/async.c
@@ -92,23 +92,18 @@ extern int initcall_debug;
static async_cookie_t __lowest_in_progress(struct list_head *running)
{
struct async_entry *entry;
- async_cookie_t ret = next_cookie; /* begin with "infinity" value */
if (!list_empty(running)) {
entry = list_first_entry(running,
struct async_entry, list);
- ret = entry->cookie;
+ return entry->cookie;
}
- if (!list_empty(&async_pending)) {
- list_for_each_entry(entry, &async_pending, list)
- if (entry->running == running) {
- ret = entry->cookie;
- break;
- }
- }
+ list_for_each_entry(entry, &async_pending, list)
+ if (entry->running == running)
+ return entry->cookie;
- return ret;
+ return next_cookie; /* "infinity" value */
}
static async_cookie_t lowest_in_progress(struct list_head *running)
2009/6/8 Jaswinder Singh Rajput <[email protected]>:
> But reverting your patch Or if we return like this also fix chris
> problem :
>
> diff --git a/kernel/async.c b/kernel/async.c
> index 94dd36f..3b492cb 100644
> --- a/kernel/async.c
> +++ b/kernel/async.c
> @@ -96,15 +96,13 @@ static async_cookie_t ?__lowest_in_progress(struct list_head *running)
> ? ? ? ?if (!list_empty(running)) {
> ? ? ? ? ? ? ? ?entry = list_first_entry(running,
> ? ? ? ? ? ? ? ? ? ? ? ?struct async_entry, list);
> - ? ? ? ? ? ? ? ret = entry->cookie;
> + ? ? ? ? ? ? ? return entry->cookie;
> ? ? ? ?}
>
> ? ? ? ?if (!list_empty(&async_pending)) {
> ? ? ? ? ? ? ? ?list_for_each_entry(entry, &async_pending, list)
> - ? ? ? ? ? ? ? ? ? ? ? if (entry->running == running) {
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ret = entry->cookie;
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? break;
> - ? ? ? ? ? ? ? ? ? ? ? }
> + ? ? ? ? ? ? ? ? ? ? ? if (entry->running == running)
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? return entry->cookie;
> ? ? ? ?}
>
> ? ? ? ?return ret;
>
>
>
I can confirm that a kernel built with Jaswinder's patch applied
survived 200 boots without a panic.
Chris
--
No, Sir; there is nothing which has yet been contrived by man, by which
so much happiness is produced as by a good tavern or inn - Doctor Samuel
Johnson
On Mon, 2009-06-08 at 10:42 -0700, Linus Torvalds wrote:
>
> On Mon, 8 Jun 2009, James Bottomley wrote:
> >
> > I think it's a bug in the async code. It's providing cookies too high
> > because it doesn't stop after it finds a running entry.
> >
> > Can we try this as the fix?
>
> Ok, this looks likely.
>
> That said, why doesn't that function look like this?
That's probably a better style ... or simply put an if ... else if to
make it exclusive.
James
On Mon, 8 Jun 2009, Chris Clayton wrote:
>
> I can confirm that a kernel built with Jaswinder's patch applied
> survived 200 boots without a panic.
Ok, goodie.
Can you confirm that the further cleanup (removing the pointless 'ret'
variable and the useless empty checking around 'for_each_entry') also
works for you?
Linus
---
kernel/async.c | 15 +++++----------
1 files changed, 5 insertions(+), 10 deletions(-)
diff --git a/kernel/async.c b/kernel/async.c
index 5054030..27235f5 100644
--- a/kernel/async.c
+++ b/kernel/async.c
@@ -92,23 +92,18 @@ extern int initcall_debug;
static async_cookie_t __lowest_in_progress(struct list_head *running)
{
struct async_entry *entry;
- async_cookie_t ret = next_cookie; /* begin with "infinity" value */
if (!list_empty(running)) {
entry = list_first_entry(running,
struct async_entry, list);
- ret = entry->cookie;
+ return entry->cookie;
}
- if (!list_empty(&async_pending)) {
- list_for_each_entry(entry, &async_pending, list)
- if (entry->running == running) {
- ret = entry->cookie;
- break;
- }
- }
+ list_for_each_entry(entry, &async_pending, list)
+ if (entry->running == running)
+ return entry->cookie;
- return ret;
+ return next_cookie; /* "infinity" value */
}
static async_cookie_t lowest_in_progress(struct list_head *running)
2009/6/8 James Bottomley <[email protected]>:
> On Mon, 2009-06-08 at 09:21 -0700, Linus Torvalds wrote:
>>
>> On Mon, 8 Jun 2009, Chris Clayton wrote:
>> >
>> > OK. I reversed that change and built and installed the kernel. It has
>> > withstood 100 reboots without a panic. Additionally, I pulled the
>> > latest changes (that will be rc8-git5, I think) from kernel.org,
>> > reversed the change to that kernel and built and installed it. That
>> > too withstood 100 reboots without a panic.
>> >
>> > Let me know if there's anything else I can do to help fix this.
>>
>> That's already pretty convincing.
>>
>> James, Arjan? The original oops message is here (a jpg screen capture,
>> unable to open initial console):
>>
>> ? ? ? http://lkml.org/lkml/2009/6/6/142
>>
>> and it's this bug entry:
>>
>> ? ? ? Bug-Entry ? ? ? : http://bugzilla.kernel.org/show_bug.cgi?id=13474
>> ? ? ? Subject ? ? ? ? : Oops whilst booting
>> ? ? ? Submitter ? ? ? : Chris Clayton <[email protected]>
>> ? ? ? Date ? ? ? ? ? ?: 2009-06-06 18:59 (2 days old)
>> ? ? ? References ? ? ?: http://marc.info/?l=linux-kernel&m=124431487924254&w=4
>>
>> and now bisected down to
>>
>> >> commit d5a877e8dd409d8c702986d06485c374b705d340
>> >> Author: James Bottomley <[email protected]>
>> >> Date: ? Sun May 24 13:03:43 2009 -0700
>> >>
>> >> ? ? async: make sure independent async domains can't accidentally entangle
>>
>> please advice. Otherwise I'll have to revert.
>
> I think it's a bug in the async code. ?It's providing cookies too high
> because it doesn't stop after it finds a running entry.
>
> Can we try this as the fix?
>
> James
>
> ---
>
> diff --git a/kernel/async.c b/kernel/async.c
> index 5054030..e4909ee 100644
> --- a/kernel/async.c
> +++ b/kernel/async.c
> @@ -97,7 +97,7 @@ static async_cookie_t ?__lowest_in_progress(struct list_head *running)
> ? ? ? ?if (!list_empty(running)) {
> ? ? ? ? ? ? ? ?entry = list_first_entry(running,
> ? ? ? ? ? ? ? ? ? ? ? ?struct async_entry, list);
> - ? ? ? ? ? ? ? ret = entry->cookie;
> + ? ? ? ? ? ? ? return entry->cookie;
> ? ? ? ?}
>
> ? ? ? ?if (!list_empty(&async_pending)) {
>
I can also confirm that a kernel with this patch applied has withstood
the 100-boot torture. I'll try Linus's version now and report back
asap.
Chris
--
No, Sir; there is nothing which has yet been contrived by man, by
which so much happiness is produced as by a good tavern or inn -
Doctor Samuel Johnson
Linus,
2009/6/8 Linus Torvalds <[email protected]>:
>
>
> On Mon, 8 Jun 2009, Chris Clayton wrote:
>>
>> I can confirm that a kernel built with Jaswinder's patch applied
>> survived 200 boots without a panic.
>
> Ok, goodie.
>
> Can you confirm that the further cleanup (removing the pointless 'ret'
> variable and the useless empty checking around 'for_each_entry') also
> works for you?
>
> ? ? ? ? ? ? ? ?Linus
>
> ---
> ?kernel/async.c | ? 15 +++++----------
> ?1 files changed, 5 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/async.c b/kernel/async.c
> index 5054030..27235f5 100644
> --- a/kernel/async.c
> +++ b/kernel/async.c
> @@ -92,23 +92,18 @@ extern int initcall_debug;
> ?static async_cookie_t ?__lowest_in_progress(struct list_head *running)
> ?{
> ? ? ? ?struct async_entry *entry;
> - ? ? ? async_cookie_t ret = next_cookie; /* begin with "infinity" value */
>
> ? ? ? ?if (!list_empty(running)) {
> ? ? ? ? ? ? ? ?entry = list_first_entry(running,
> ? ? ? ? ? ? ? ? ? ? ? ?struct async_entry, list);
> - ? ? ? ? ? ? ? ret = entry->cookie;
> + ? ? ? ? ? ? ? return entry->cookie;
> ? ? ? ?}
>
> - ? ? ? if (!list_empty(&async_pending)) {
> - ? ? ? ? ? ? ? list_for_each_entry(entry, &async_pending, list)
> - ? ? ? ? ? ? ? ? ? ? ? if (entry->running == running) {
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ret = entry->cookie;
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? break;
> - ? ? ? ? ? ? ? ? ? ? ? }
> - ? ? ? }
> + ? ? ? list_for_each_entry(entry, &async_pending, list)
> + ? ? ? ? ? ? ? if (entry->running == running)
> + ? ? ? ? ? ? ? ? ? ? ? return entry->cookie;
>
> - ? ? ? return ret;
> + ? ? ? return next_cookie; ? ? /* "infinity" value */
> ?}
>
> ?static async_cookie_t ?lowest_in_progress(struct list_head *running)
>
Yes, rc8-git5 with your patch applied has booted 100 times without a panic.
May I add that the people who thought of, designed and implemented
kexec should have a large and shiny medals pinned to their chests.
Well over 1000 kernel boots have been executed on my PC today and, if
I hadn't been able to do that automatically with a few lines of script
at the head of /etc/rc.d/rc.local, I would have been bleary-eyed
before noon :-)
Chris
--
No, Sir; there is nothing which has yet been contrived by man, by
which so much happiness is produced as by a good tavern or inn -
Doctor Samuel Johnson
On Monday 08 June 2009, Chris Clayton wrote:
> Linus,
>
> 2009/6/8 Linus Torvalds <[email protected]>:
> > On Mon, 8 Jun 2009, Chris Clayton wrote:
> >> I can confirm that a kernel built with Jaswinder's patch applied
> >> survived 200 boots without a panic.
> >
> > Ok, goodie.
> >
> > Can you confirm that the further cleanup (removing the pointless 'ret'
> > variable and the useless empty checking around 'for_each_entry') also
> > works for you?
> >
> > ? ? ? ? ? ? ? ?Linus
> >
> > ---
> > ?kernel/async.c | ? 15 +++++----------
> > ?1 files changed, 5 insertions(+), 10 deletions(-)
> >
> > diff --git a/kernel/async.c b/kernel/async.c
> > index 5054030..27235f5 100644
> > --- a/kernel/async.c
> > +++ b/kernel/async.c
> > @@ -92,23 +92,18 @@ extern int initcall_debug;
> > ?static async_cookie_t ?__lowest_in_progress(struct list_head *running)
> > ?{
> > ? ? ? ?struct async_entry *entry;
> > - ? ? ? async_cookie_t ret = next_cookie; /* begin with "infinity" value
> > */
> >
> > ? ? ? ?if (!list_empty(running)) {
> > ? ? ? ? ? ? ? ?entry = list_first_entry(running,
> > ? ? ? ? ? ? ? ? ? ? ? ?struct async_entry, list);
> > - ? ? ? ? ? ? ? ret = entry->cookie;
> > + ? ? ? ? ? ? ? return entry->cookie;
> > ? ? ? ?}
> >
> > - ? ? ? if (!list_empty(&async_pending)) {
> > - ? ? ? ? ? ? ? list_for_each_entry(entry, &async_pending, list)
> > - ? ? ? ? ? ? ? ? ? ? ? if (entry->running == running) {
> > - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ret = entry->cookie;
> > - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? break;
> > - ? ? ? ? ? ? ? ? ? ? ? }
> > - ? ? ? }
> > + ? ? ? list_for_each_entry(entry, &async_pending, list)
> > + ? ? ? ? ? ? ? if (entry->running == running)
> > + ? ? ? ? ? ? ? ? ? ? ? return entry->cookie;
> >
> > - ? ? ? return ret;
> > + ? ? ? return next_cookie; ? ? /* "infinity" value */
> > ?}
> >
> > ?static async_cookie_t ?lowest_in_progress(struct list_head *running)
>
> Yes, rc8-git5 with your patch applied has booted 100 times without a panic.
>
...so I should have added:
Tested-by: Chris Clayton <[email protected]>
> May I add that the people who thought of, designed and implemented
> kexec should have a large and shiny medals pinned to their chests.
> Well over 1000 kernel boots have been executed on my PC today and, if
> I hadn't been able to do that automatically with a few lines of script
> at the head of /etc/rc.d/rc.local, I would have been bleary-eyed
> before noon :-)
>
> Chris
--
No, Sir; there is nothing which has yet been contrived by man, by which
so much happiness is produced as by a good tavern or inn - Doctor Samuel
Johnson