2001-04-16 04:48:49

by linas

[permalink] [raw]
Subject: lilo + raid + kernel-2.4.x failure to boot


Hi,

another zinger that I am sending to LKML because I don't know where else to send it ...

I've discovered a deadly combination of kernel & lilo (and raid). This may be a pure
lilo bug, but I assume that the kernel+raid aids & abets the problem...:


I am running kernel-2.4.x. Two ide hard drives, with partitions 1,5,6,7,8 in use.
The partitions on the two drives are mirrored using RAID-1 to create /dev/md1, /dev/md5,
/dev/md6, etc. The root fs is on /dev/md1. Thus, lilo.conf looks like:


boot=/dev/md1
map=/boot/map
install=/boot/boot.b
prompt
timeout=50
linear
default=linux

image=/boot/vmlinuz-2.4.2
label=linux
read-only
root=/dev/md1

For nearly a year, this combo has worked just fine (running 2.3.99 back then).

Just fine, that is, using the redhat-6.2 rpm for LILO, i.e. version
lilo-0.21-15.i386.rpm which reports itself to be:

% /sbin/lilo -V
LILO version 21

Recently, this machine went over to debian-unstable from redhat:

% dpkg -s lilo
Package: lilo
Status: install ok installed
Priority: important
Section: base
Installed-Size: 271
Maintainer: Russell Coker <[email protected]>
Version: 1:21.7-3
Depends: libc6 (>= 2.2.1-2), debconf (>= 0.2.26), logrotate

The debian version of lilo writes a boot sector that hangs hard for the above
kernel+raid+lilo.conf configuration: specifically:

LIL- after a reboot. Needless to say, recovery was painful. But I was able to verify
that the redhat lilo rpm always functioned correctly, and the debian-unstable dpkg always
hung in this way. Although at one point, during my twisting & turning, I got the debian lilo
to get to only LI before hanging. I have no idea of what I did different to get to that as
opposed to LIL-


BTW, I noticed that oddly, every time I ran lilo, and then ran lilo -q -v -v, it reported
different sector numbers for the kernel images. This freaked me out at first, but I came to
accept it as normal: doesn't affect bootability. But is this really w.a.d? (I was
assuming, appearently erroneously, that lilo -q -v -v was reporting the physical location
of the kernel image on the disk; but since the numbers bounce around, that can't be right.
Or is this just weird bios head/cyl/sect math flakiness?)


--linas






Attachments:
(No filename) (2.21 kB)
(No filename) (232.00 B)
Download all attachments

2001-04-16 08:16:22

by Andreas Dilger

[permalink] [raw]
Subject: Re: lilo + raid + kernel-2.4.x failure to boot

Linas Vepstas writes:
> BTW, I noticed that oddly, every time I ran lilo, and then ran
> lilo -q -v -v, it reported different sector numbers for the kernel
> images. This freaked me out at first, but I came to accept it as normal:
> doesn't affect bootability. But is this really w.a.d? (I was assuming,
> appearently erroneously, that lilo -q -v -v was reporting the physical
> location of the kernel image on the disk; but since the numbers bounce
> around, that can't be right. Or is this just weird bios head/cyl/sect
> math flakiness?)

No, I noticed this behaviour as well. You run "lilo -v 5" once you get
one set of numbers, you run it a second time, you get another set of
numbers. It repeats every 2 lilo runs. I believe it has something to
do with the map file, and keeping a backup copy thereof.

Sorry, this doesn't help your RAID problem.

Note, can you boot from one of the separate RAID drives with the Debian
LILO directly? Have you tried CHS, LBA32, and linear options to lilo?

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2001-04-16 18:49:34

by Russell Coker

[permalink] [raw]
Subject: Re: lilo + raid + kernel-2.4.x failure to boot

On Monday 16 April 2001 06:47, Linas Vepstas wrote:
> I am running kernel-2.4.x. Two ide hard drives, with partitions 1,5,6,7,8
> in use. The partitions on the two drives are mirrored using RAID-1 to
> create /dev/md1, /dev/md5, /dev/md6, etc. The root fs is on /dev/md1.

What partitions are used to make /dev/md1?

> Thus, lilo.conf looks like:
>
>
> boot=/dev/md1

All my use of lilo and RAID is with boot=/dev/hda.
I guess the above should work if the system is setup to look for a boot
record on /dev/hda1 (or whichever is the name of a part of the RAID-1 mirror)
by having it marked active and having the Debian MBR, the DOS "fdisk /mbr" or
something similar. But why would you want to? Is the aim of this to enable
swapping /dev/hda and /dev/hdc (or whichever drives comprise the RAID-1)
without re-running LILO?

> % dpkg -s lilo
> Package: lilo
> Status: install ok installed
> Priority: important
> Section: base
> Installed-Size: 271
> Maintainer: Russell Coker <[email protected]>
> Version: 1:21.7-3
> Depends: libc6 (>= 2.2.1-2), debconf (>= 0.2.26), logrotate
>
> The debian version of lilo writes a boot sector that hangs hard for the
> above kernel+raid+lilo.conf configuration: specifically:
>
> LIL- after a reboot. Needless to say, recovery was painful. But I
> was able to verify that the redhat lilo rpm always functioned correctly,
> and the debian-unstable dpkg always hung in this way. Although at one
> point, during my twisting & turning, I got the debian lilo to get to only
> LI before hanging. I have no idea of what I did different to get to that
> as opposed to LIL-

>From Manual.txt.gz:
LI The first stage boot loader was able to load the second stage boot
loader, but has failed to execute it. This can either be caused by a
geometry mismatch or by moving /boot/boot.b without running the map
installer.
LIL- The descriptor table is corrupt. This can either be caused by a
geometry mismatch or by moving /boot/map without running the map
installer.

The error "LI" is easy to cause. Just do mv /boot/boot.b.backup /boot/boot.b
...

As for "LIL-". Are you sure that everything is fine with your geometry?
Maybe your BIOS and the kernel have different ideas about how things are
supposed to be? I imagine that you installed a newer kernel etc at the time
of your upgrade from Red Hat to Debian so this could be a partial cause.

> BTW, I noticed that oddly, every time I ran lilo, and then ran lilo -q -v
> -v, it reported different sector numbers for the kernel images. This
> freaked me out at first, but I came to accept it as normal: doesn't affect
> bootability. But is this really w.a.d? (I was assuming, appearently
> erroneously, that lilo -q -v -v was reporting the physical location of the
> kernel image on the disk; but since the numbers bounce around, that can't
> be right. Or is this just weird bios head/cyl/sect math flakiness?)

I'll leave that for John to answer.

--
http://www.coker.com.au/bonnie++/ Bonnie++ hard drive benchmark
http://www.coker.com.au/postal/ Postal SMTP/POP benchmark
http://www.coker.com.au/projects.html Projects I am working on
http://www.coker.com.au/~russell/ My home page