LinuxLists.cc - Faster reboots (and a better way of taking crashdumps?)

[permalink] [raw]

Subject: Re: Faster reboots (and a better way of taking crashdumps?)

> for 2, the BIOS sets the hardware to a known state,
> or if you can trigger *the* hardware reset line,
> which will also do that, then you're going through
> the BIOS again. Now if you made your own bios...
> see http://www.linuxbios.org.

I need to avoid going through the BIOS ... this is a
multiquad NUMA machine, and it doesn't take kindly
to the reboot through the BIOS for various reasons.
It also takes about 4 minutes, which is a pain ;-)

I have source code access to our BIOS if I really wanted,
I just want to avoid modifying it if possible.

> there are patches where a kernel can load another
> kernel, also.

Hmmm ... sounds interesting ... any pointers?

> As for taking crashdumps on the way up, I believe
> (SGI's ?) linux kernel crash dumps does *exactly*
> this.

I was under the impression that most BIOSes reset
memory on reboot, so this was impossible during a
BIOS reboot?

M.

2002-04-06 02:56:10

by Jeremy Jackson

[permalink] [raw]

Subject: Re: Faster reboots (and a better way of taking crashdumps?)

----- Original Message -----
From: "Martin J. Bligh" <[email protected]>
Sent: Friday, April 05, 2002 5:48 PM

> I need to avoid going through the BIOS ... this is a
> multiquad NUMA machine, and it doesn't take kindly
> to the reboot through the BIOS for various reasons.
> It also takes about 4 minutes, which is a pain ;-)
>
> I have source code access to our BIOS if I really wanted,
> I just want to avoid modifying it if possible.

well keep in mind that the fastest LinuxBIOS boot is 3 seconds...
a large part of the boot time on most PCs is the BIOS setting up
DOS support and painting silly logos on the screen, all of which
can go away. I'm guessing your NUMA system has a bit more
to do at this stage due to the hardware, but still...

>
> > there are patches where a kernel can load another
> > kernel, also.
>
> Hmmm ... sounds interesting ... any pointers?

LOBOS,

http://www.acl.lanl.gov/linuxbios/papers/index.html
http://www.usenix.org/publications/library/proceedings/als2000/minnichLOBOS.
html

two kernel monte,

http://www.scyld.com/products/beowulf/software/monte.html

There's also suspend to disk support, which is closely related.
Kind of a restartable crash dump, without the crash:

http://falcon.sch.bme.hu/~seasons/linux/swsusp.html

>
> > As for taking crashdumps on the way up, I believe
> > (SGI's ?) linux kernel crash dumps does *exactly*
> > this.
>
> I was under the impression that most BIOSes reset
> memory on reboot, so this was impossible during a
> BIOS reboot?

oss.sgi.com seems to be down today... but iirc,
it doesn't boot through bios, but stashes some critical state
in a buffer previously reserverd, uses one of the above methods
to boot a new kernel, lets this kernel do the dump, then boots
through the bios to make sure hardware is completely restored
after the crash. I'm sure it could be tailored to suit though.

I'm currently researching combining the two, to create a LinuxBIOS
firmware debug console, which will allow complete crash dump to
be taken after a hardware reset, with the smallest possible Heisenburg
effect, aside from a hardware debugger.

Basically when the kernel panics, it dumps the CPU registers and
resets the CPU. The firmware console makes no alterations to
the state of the hardware, instead running a modified kgdb stub
like routine, possibly without even touching RAM. Also, if an SMI
button is available, this can be used as a hardware break switch,
allowing panics to use an even less invasive HLT instruction.

Jeremy

2002-04-06 17:09:56

by Byron Stanoszek

[permalink] [raw]

Subject: Re: Faster reboots - calling _start?

On Fri, 5 Apr 2002, Jeremy Jackson wrote:

> ----- Original Message -----
> From: "Martin J. Bligh" <[email protected]>
> Sent: Friday, April 05, 2002 5:48 PM
>
> > I need to avoid going through the BIOS ... this is a
> > multiquad NUMA machine, and it doesn't take kindly
> > to the reboot through the BIOS for various reasons.
> > It also takes about 4 minutes, which is a pain ;-)
> >
> > I have source code access to our BIOS if I really wanted,
> > I just want to avoid modifying it if possible.
>
> well keep in mind that the fastest LinuxBIOS boot is 3 seconds...
> a large part of the boot time on most PCs is the BIOS setting up
> DOS support and painting silly logos on the screen, all of which
> can go away. I'm guessing your NUMA system has a bit more
> to do at this stage due to the hardware, but still...

Wouldn't it be easier to just ljmp to the start address of the kernel in memory
(the address after the bootloader has done its thing), effectively restarting
the kernel from line 1? Or is tehre an issue with some hardware being in an
invalid state when doing this?

Maybe Eric Biederman can comment on this since he's adding new functionality to
the boot loader..

-Byron

--
Byron Stanoszek Ph: (330) 644-3059
Systems Programmer Fax: (330) 644-8110
Commercial Timesharing Inc. Email: [email protected]

2002-04-06 17:39:37

by Alan

[permalink] [raw]

Subject: Re: Faster reboots (and a better way of taking crashdumps?)

> What's to stop me rebooting by having machine_restart load
> the first sector of the first disk (as the BIOS does), where
> the LILO code should be, and just jumping to it?

In theory nothing

> 1. Are there tables that are created by the BIOS that we
> destroy during Linux runtime? mps tables spring to mind -
> I can't see where we preserve them ...

They should be in E820 reserved pages anyway and we do keep them and the
EBDA safe. You will however have blown away ACPI pages marked as disposable

> 2. Things that are reset by reboot that we don't reset during
> normal kernel boot?

Possibly. I wouldnt like to hand control back to the BIOS but the kernel
ought to be ok with itself.

Alan

2002-04-06 20:07:38

[permalink] [raw]

Subject: Re: Faster reboots - calling _start?

Byron Stanoszek <[email protected]> writes:

> On Fri, 5 Apr 2002, Jeremy Jackson wrote:
>
> > ----- Original Message -----
> > From: "Martin J. Bligh" <[email protected]>
> > Sent: Friday, April 05, 2002 5:48 PM
> >
> > > I need to avoid going through the BIOS ... this is a
> > > multiquad NUMA machine, and it doesn't take kindly
> > > to the reboot through the BIOS for various reasons.
> > > It also takes about 4 minutes, which is a pain ;-)
> > >
> > > I have source code access to our BIOS if I really wanted,
> > > I just want to avoid modifying it if possible.
> >
> > well keep in mind that the fastest LinuxBIOS boot is 3 seconds...
[ to login prompt ]

> > a large part of the boot time on most PCs is the BIOS setting up
> > DOS support and painting silly logos on the screen, all of which
> > can go away. I'm guessing your NUMA system has a bit more
> > to do at this stage due to the hardware, but still...

Especially given that I can load the kernel on a dual P4 xeon system in
3 seconds from power on. But traditional BIOS's are slow, and getting
slower for unknown reasons.

> Wouldn't it be easier to just ljmp to the start address of the kernel in memory
> (the address after the bootloader has done its thing), effectively restarting
> the kernel from line 1? Or is tehre an issue with some hardware being in an
> invalid state when doing this?
>
> Maybe Eric Biederman can comment on this since he's adding new functionality to
> the boot loader..

I've also written the patch that allows this.

In the general case where you want to boot another version of the kernel
you must rerun the BIOS query code. In case your new kernel can handle your
hardware better than the previous version.

The are bad cases where the kernel can leave the hardware in a state
that either confuses the BIOS or confuses the drivers when they load.
This is rare and being worked on. But it is an independent problem.

http://download.lnxi.com/pub/src/linux-kernel-patches/kexec/linux-2.5.7.kexec.diff
http://download.lnxi.com/pub/src/mkelfImage/elfboottools-2.0.tar.gz

And despite the name I can kexec plain bzImages, now.

Eric

2002-04-06 21:36:38

[permalink] [raw]

Subject: Re: Faster reboots - calling _start?

> Wouldn't it be easier to just ljmp to the start address of the kernel in
> memory (the address after the bootloader has done its thing), effectively
> restarting the kernel from line 1? Or is tehre an issue with some
> hardware being in an invalid state when doing this?

Two issues with that:

1. I want to be able to boot a different kernel on reboot - this
is a development machine.

2. I believe we free all the __init stuff around the end of
start_kernel, so the initial functions and data just aren't
there any more ... of course that could be changed, but it's
both a more major change than I really want to do, and it still
doesn't solve (1) ;-)

M.

2002-04-06 22:04:11

[permalink] [raw]

Subject: Re: Faster reboots - calling _start?

"Martin J. Bligh" <[email protected]> writes:

> > Wouldn't it be easier to just ljmp to the start address of the kernel in
> > memory (the address after the bootloader has done its thing), effectively
> > restarting the kernel from line 1? Or is tehre an issue with some
> > hardware being in an invalid state when doing this?
>
> Two issues with that:
>
> 1. I want to be able to boot a different kernel on reboot - this
> is a development machine.
>
> 2. I believe we free all the __init stuff around the end of
> start_kernel, so the initial functions and data just aren't
> there any more ... of course that could be changed, but it's
> both a more major change than I really want to do, and it still
> doesn't solve (1) ;-)

Seriously check out my code it should just work unless there are special
apic shutdown rules for NUMAQ machines.

Eric

2002-04-06 23:34:18

[permalink] [raw]

Subject: Re: Faster reboots (and a better way of taking crashdumps?)

"Martin J. Bligh" <[email protected]> writes:

> My real motivation for this isn't actually faster reboots,
> it's rebooting at all - I have some strange hardware that
> won't do init 6 in traditional ways ... but it might mean
> a faster reboot for others.
>
> What's to stop me rebooting by having machine_restart load
> the first sector of the first disk (as the BIOS does), where
> the LILO code should be, and just jumping to it?

Be very careful with loading a boot sector. The problem is
that lilo will ask the BIOS to drive the disk, and the disk
is almost certainly in a different state than when the BIOS left it,
and the BIOS hasn't been given a reset state command. Without letting
the BIOS know you did something strange you are going out and looking
for trouble.

But if you can load a boot sector you can just about as easily load
the whole kernel, which on startup will only ask the BIOS hardware
information and not to drive the hardware (which should be safe).

> 1. Are there tables that are created by the BIOS that we
> destroy during Linux runtime? mps tables spring to mind -
> I can't see where we preserve them ...

Generally MPS tables are in regions of memory that we preserve anyway.

> 2. Things that are reset by reboot that we don't reset during
> normal kernel boot?

A sane BIOS will toggle the board level reset line on reboot.
The all don't but that makes it look like a fresh boot, with
a negligible speed penalty.

> As a side effect, this means we could potentially take
> crashdumps on the way up, rather than the way down, so
> the kernel is more likely to be in a working state (we'd
> have to load a minimal kernel / crashdumper to take the
> dump first ... this is similar to what we did with PTX).

I guess if you were careful you could. The fact that you can't rely
on the BIOS to drive the hardware is significant though.

Eric

2002-04-07 01:31:38

[permalink] [raw]

Subject: Re: Faster reboots (and a better way of taking crashdumps?)

> Be very careful with loading a boot sector. The problem is
> that lilo will ask the BIOS to drive the disk, and the disk
> is almost certainly in a different state than when the BIOS left it,
> and the BIOS hasn't been given a reset state command. Without letting
> the BIOS know you did something strange you are going out and looking
> for trouble.

Good point. I would need to jump back to 16 bit mode (which there's
already code to do) and reset the disk.

> But if you can load a boot sector you can just about as easily load
> the whole kernel, which on startup will only ask the BIOS hardware
> information and not to drive the hardware (which should be safe).

Mmmm ... I'd have to reimplement most of the bootloader in order to
read the mapped blocks off disk and duplicate the use of lilo.conf,
etc to make it usable ... not too attractive an option ... I think
it'd be easier to reutilise what we have already.

>> 2. Things that are reset by reboot that we don't reset during
>> normal kernel boot?
>
> A sane BIOS will toggle the board level reset line on reboot.
> The all don't but that makes it look like a fresh boot, with
> a negligible speed penalty.

I know that, but what I mean is that I'm *not* going to get
this reset if I just jump back to the init point ... I was
trying to work out what kind of trouble that would cause.

> Seriously check out my code it should just work unless there are

OK, I took a very brief scan of just the descriptions of your
patches - looks like the main thing you're doing is creating
a 32 bit kernel entry point, right? So above and beyond that
I'd have to rework the LILO code to work in 32 bit, which
probably isn't that hard now I think about it ... all the hard
stuff is actually done by the command line binary, so maybe ...

> special apic shutdown rules for NUMAQ machines.

The APICs should be OK ... the interconnect firmware sets them
up, and Linux never changes them, so everything *should* be OK
in theory. Of course if it ever gets screwed up, reboot won't
fix it, but then I can't reboot at all right now, so ... ;-)

On the other hand, I don't reset the processors fully (I have
to use NMI to boot rather than the INIT, INIT, STARTUP sequence),
which seems to be asking for trouble ;-(

M.

2002-04-07 01:35:15

[permalink] [raw]

Subject: Re: Faster reboots (and a better way of taking crashdumps?)

1. Are there tables that are created by the BIOS that we
>> destroy during Linux runtime? mps tables spring to mind -
>> I can't see where we preserve them ...
>
> They should be in E820 reserved pages anyway and we do keep them
> and the EBDA safe.

Ah, OK. I will have to check the BIOS is doing this correctly,
since I hacked it to move the MPS tables to a different place
(below 8Mb). I should really fix that using a fixmap or something
anyway ...

> You will however have blown away ACPI pages marked as disposable

Pah, ACPI ;-) I don't have ACPI on these machines, but it would
be needed for a more general solution - sounds easy enough to fix
anyway - we just keep them and mark them reserved during the Linux
ACPI parse, I think.

Thanks,

M.

2002-04-07 02:56:45

[permalink] [raw]

Subject: Re: Faster reboots (and a better way of taking crashdumps?)

"Martin J. Bligh" <[email protected]> writes:

> >> 2. Things that are reset by reboot that we don't reset during
> >> normal kernel boot?
> >
> > A sane BIOS will toggle the board level reset line on reboot.
> > The all don't but that makes it look like a fresh boot, with
> > a negligible speed penalty.
>
> I know that, but what I mean is that I'm *not* going to get
> this reset if I just jump back to the init point ... I was
> trying to work out what kind of trouble that would cause.

I didn't mean to obfuscate the issue. There is some trouble you can
get into, but most of it is when asking the BIOS to do things for
you. Beyond that there a few rare devices that the kernel doesn't
reset right. Turning off bus master is the primary issue. There
is work in the linux power management infrastructure that will handle
most of this long term.

> > Seriously check out my code it should just work unless there are
>
> OK, I took a very brief scan of just the descriptions of your
> patches - looks like the main thing you're doing is creating
> a 32 bit kernel entry point, right? So above and beyond that
> I'd have to rework the LILO code to work in 32 bit, which
> probably isn't that hard now I think about it ... all the hard
> stuff is actually done by the command line binary, so maybe ...

Sorry wrong set of patches. I'm not pushing my kexec stuff quite as
hard. It is a little lower on my priority list. Basically the work
is to allow Linux to boot Linux directly.

ftp://download.lnxi.com/pub/src/linux-kernel-patches/linux-2.5.7.kexec.diff
ftp://download.lnxi.com/pub/src/linux-kernel-patches/kexec-2.5.7.kexec.log
ftp://download.lnxi.com/pub/src/mkelfImage/elfboottools-2.0.tar.gz
type make and see objdir/build/sbin/kexec (work with bzImages)

The basic kernel interface that is added is:

struct segment {
void *buffer;
void *dest_addr;
size_t len;
};
int kexec(void *start, int nr_segments, struct segment *segments);

This appears to be an interface I can support on every linux platform.
it works on x86, and I have a proof of concept alpha port.

The initial transfer of control happens on the bootstrap processor, in
32bit protected mode with paging disabled. All of the segment
registers are set to a flat 32bit segment with a base address of zero.
And %esp points to an area that is good for a small stack.

The rest is left up the controlling program. The major discussion
with this happened a while ago. There was a basic agreement on what
the system call interface should look like, but it really hasn't
progressed much since then. There are not a lot of booting experts I
can bang heads with :)

> > special apic shutdown rules for NUMAQ machines.
>
> The APICs should be OK ... the interconnect firmware sets them
> up, and Linux never changes them, so everything *should* be OK
> in theory. Of course if it ever gets screwed up, reboot won't
> fix it, but then I can't reboot at all right now, so ... ;-)

I have code in there to fix things of for the SMP case, you might want
to double check that doesn't mess up the NUMAQ case.

> On the other hand, I don't reset the processors fully (I have
> to use NMI to boot rather than the INIT, INIT, STARTUP sequence),
> which seems to be asking for trouble ;-(

Well even the INIT assert, INIT deassert, STARTUP sequence doesn't
reset the processors fully. Things like the MTRR's remain setup so
I wouldn't worry about it to much. Assuming the NMI can reliably
get the processor to a kernel entry point.

Eric

2002-04-07 04:00:10

[permalink] [raw]

Subject: Re: Faster reboots (and a better way of taking crashdumps?)

> Sorry wrong set of patches. I'm not pushing my kexec stuff quite as
> hard. It is a little lower on my priority list. Basically the work
> is to allow Linux to boot Linux directly.

Looks great - thanks. I'll try this out next week sometime. 2.5.7
doesn't work on these boxes for some reason we haven't debugged yet,
so I'll try this on 2.4 ... if that doesn't work, we'll have to get
2.5 working first ;-)

M.

2002-04-07 04:19:29

[permalink] [raw]

Subject: Re: Faster reboots (and a better way of taking crashdumps?)

"Martin J. Bligh" <[email protected]> writes:

> > Sorry wrong set of patches. I'm not pushing my kexec stuff quite as
> > hard. It is a little lower on my priority list. Basically the work
> > is to allow Linux to boot Linux directly.
>
> Looks great - thanks. I'll try this out next week sometime. 2.5.7
> doesn't work on these boxes for some reason we haven't debugged yet,
> so I'll try this on 2.4 ... if that doesn't work, we'll have to get
> 2.5 working first ;-)

2.4 should work. I'm just trying to get this all in the kernel
and 2.5 is the more appropriate location to work against.

Eric

2002-04-08 07:19:40

[permalink] [raw]

Subject: Re: Faster reboots (and a better way of taking crashdumps?)

Jeremy Jackson wrote:
>
> ----- Original Message -----
> From: "Martin J. Bligh" <[email protected]>
> Sent: Friday, April 05, 2002 5:48 PM
>
> > I need to avoid going through the BIOS ... this is a
> > multiquad NUMA machine, and it doesn't take kindly
> > to the reboot through the BIOS for various reasons.
> > It also takes about 4 minutes, which is a pain ;-)
> >
> > I have source code access to our BIOS if I really wanted,
> > I just want to avoid modifying it if possible.
>
> well keep in mind that the fastest LinuxBIOS boot is 3 seconds...
> a large part of the boot time on most PCs is the BIOS setting up
> DOS support and painting silly logos on the screen, all of which
> can go away. I'm guessing your NUMA system has a bit more
> to do at this stage due to the hardware, but still...
>
> >
> > > there are patches where a kernel can load another
> > > kernel, also.
> >
> > Hmmm ... sounds interesting ... any pointers?
>
> LOBOS,
>
> http://www.acl.lanl.gov/linuxbios/papers/index.html
> http://www.usenix.org/publications/library/proceedings/als2000/minnichLOBOS.
> html
>
> two kernel monte,
>
> http://www.scyld.com/products/beowulf/software/monte.html
>
> There's also suspend to disk support, which is closely related.
> Kind of a restartable crash dump, without the crash:
>
> http://falcon.sch.bme.hu/~seasons/linux/swsusp.html
>
> >
> > > As for taking crashdumps on the way up, I believe
> > > (SGI's ?) linux kernel crash dumps does *exactly*
> > > this.
> >
> > I was under the impression that most BIOSes reset
> > memory on reboot, so this was impossible during a
> > BIOS reboot?
>
> oss.sgi.com seems to be down today... but iirc,
> it doesn't boot through bios, but stashes some critical state
> in a buffer previously reserverd, uses one of the above methods
> to boot a new kernel, lets this kernel do the dump, then boots
> through the bios to make sure hardware is completely restored
> after the crash. I'm sure it could be tailored to suit though.

What you just described is the way Mission Critical's
crash dump facility implements this. It makes use of bootimg
Werner Almesberger's implementation of linux-boots-linux, to
soft reboot a new kernel without going through the bios on x86
machines.

SGI lkcd doesn't do it that way today (btw, the lkcd project
has now shifted to sourceforge, so the right place to look at
is lkcd.sourceforge.net), but we are actually working on integrating
this mechanism from Mission Critical's mcore into lkcd, with
some enhancements/improvements.

I'm cc'ing the lkcd-devel mailing list which is where we discuss
such stuff.

>
> I'm currently researching combining the two, to create a LinuxBIOS
> firmware debug console, which will allow complete crash dump to
> be taken after a hardware reset, with the smallest possible Heisenburg
> effect, aside from a hardware debugger.

So how is the actual writeout accomplished ?(via LinuxBIOS ?)

>
> Basically when the kernel panics, it dumps the CPU registers and
> resets the CPU. The firmware console makes no alterations to
> the state of the hardware, instead running a modified kgdb stub
> like routine, possibly without even touching RAM. Also, if an SMI
> button is available, this can be used as a hardware break switch,
> allowing panics to use an even less invasive HLT instruction.
>
> Jeremy
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2002-04-08 14:29:56

[permalink] [raw]

Subject: Re: Faster reboots (and a better way of taking crashdumps?)

"Eric W. Biederman" wrote:
>
> "Martin J. Bligh" <[email protected]> writes:
>
> > >> 2. Things that are reset by reboot that we don't reset during
> > >> normal kernel boot?
> > >
> > > A sane BIOS will toggle the board level reset line on reboot.
> > > The all don't but that makes it look like a fresh boot, with
> > > a negligible speed penalty.
> >
> > I know that, but what I mean is that I'm *not* going to get
> > this reset if I just jump back to the init point ... I was
> > trying to work out what kind of trouble that would cause.
>
> I didn't mean to obfuscate the issue. There is some trouble you can
> get into, but most of it is when asking the BIOS to do things for
> you. Beyond that there a few rare devices that the kernel doesn't
> reset right. Turning off bus master is the primary issue. There
> is work in the linux power management infrastructure that will handle
> most of this long term.
>
> > > Seriously check out my code it should just work unless there are
> >
> > OK, I took a very brief scan of just the descriptions of your
> > patches - looks like the main thing you're doing is creating
> > a 32 bit kernel entry point, right? So above and beyond that
> > I'd have to rework the LILO code to work in 32 bit, which
> > probably isn't that hard now I think about it ... all the hard
> > stuff is actually done by the command line binary, so maybe ...
>
> Sorry wrong set of patches. I'm not pushing my kexec stuff quite as
> hard. It is a little lower on my priority list. Basically the work
> is to allow Linux to boot Linux directly.

I have been trying look through this in terms of how it compares with
alternate projects (bootimg, monte etc). As I mentioned in an earlier
mail, crash dump (mcore) relies on bootimg, and I'm trying to decide if
there
could be advantages in using your kexec stuff. My main concern of
course is with regard to these BIOS dependent/related issues
since at the time of a crash dump we may not be in quite a "friendly
state". Guess some the linux power mgmt infrastructure or driverfs
should help with sane resets etc (I'm not saying its straightforward
:)).
in the long run. As such how far does your implementation address
some of this BIOS/h/w state handling better ?

BTW, some of your other boot enhancements like being able to find out
which memory areas were used or overwritten during bootup sound useful
to me, in being able to estimate the footprint of early boot and
avoiding
using those portions of memory for saving any state (because boot could
stomp over them). Its good to be able to do this in a generic way,
rather
than have the dump code be aware of the ranges for every architecture.

>
> ftp://download.lnxi.com/pub/src/linux-kernel-patches/linux-2.5.7.kexec.diff
> ftp://download.lnxi.com/pub/src/linux-kernel-patches/kexec-2.5.7.kexec.log
> ftp://download.lnxi.com/pub/src/mkelfImage/elfboottools-2.0.tar.gz
> type make and see objdir/build/sbin/kexec (work with bzImages)

I don't seem to be able to access these urls.
The patches I downloaded were from
ftp://download.lnxi.com/pub/src/kexec/* (with the same names). Are these
the right ones ? (your last note mentioned those, but you are saying
that these are the wrong set ... so now I'm a little confused)

Is there one single grand rollup patch with all of the function which I
should look through or try out ?

>
> The basic kernel interface that is added is:
>
> struct segment {
> void *buffer;
> void *dest_addr;
> size_t len;
> };
> int kexec(void *start, int nr_segments, struct segment *segments);
>

Yes, its a good idea to split up the load stage and actual boot/exec
stage. Crash dump needs to have it that way too (the second image
preloaded in advance since we don't want to do any i/o at that point).

> This appears to be an interface I can support on every linux platform.
> it works on x86, and I have a proof of concept alpha port.
>
> The initial transfer of control happens on the bootstrap processor, in
> 32bit protected mode with paging disabled. All of the segment
> registers are set to a flat 32bit segment with a base address of zero.
> And %esp points to an area that is good for a small stack.
>
> The rest is left up the controlling program. The major discussion
> with this happened a while ago. There was a basic agreement on what
> the system call interface should look like, but it really hasn't
> progressed much since then. There are not a lot of booting experts I
> can bang heads with :)
>
> > > special apic shutdown rules for NUMAQ machines.
> >
> > The APICs should be OK ... the interconnect firmware sets them
> > up, and Linux never changes them, so everything *should* be OK
> > in theory. Of course if it ever gets screwed up, reboot won't
> > fix it, but then I can't reboot at all right now, so ... ;-)
>
> I have code in there to fix things of for the SMP case, you might want
> to double check that doesn't mess up the NUMAQ case.
>
> > On the other hand, I don't reset the processors fully (I have
> > to use NMI to boot rather than the INIT, INIT, STARTUP sequence),
> > which seems to be asking for trouble ;-(
>
> Well even the INIT assert, INIT deassert, STARTUP sequence doesn't
> reset the processors fully. Things like the MTRR's remain setup so
> I wouldn't worry about it to much. Assuming the NMI can reliably
> get the processor to a kernel entry point.
>
> Eric
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2002-04-08 17:16:13

[permalink] [raw]

Subject: Re: Faster reboots (and a better way of taking crashdumps?)

Suparna Bhattacharya <[email protected]> writes:

> I have been trying look through this in terms of how it compares with
> alternate projects (bootimg, monte etc). As I mentioned in an earlier
> mail, crash dump (mcore) relies on bootimg, and I'm trying to decide if
> there
> could be advantages in using your kexec stuff.

My target it to submit the kexec stuff to Linus. I seem to be the
only one really actively working on it at this time. I believe my
code is the most mature at the moment. The bottom line is the system
call needs to get into the kernel.

With respect to bootimg there is a strong similarity it how things are
done. The big difference is that bootimg interface does everything
per page in asking the kernel where to put things and my kexec call is
does everything with extents. Which means the kexec data structures
are usually much smaller, plus I rely on odd things like PAGE_SIZE.

As for monte I can boot other things than the linux kernel. I'm much
better at doing the work than publisizing it so my variant isn't quite
as well known. That plus I can late to the game.

> My main concern of
> course is with regard to these BIOS dependent/related issues
> since at the time of a crash dump we may not be in quite a "friendly
> state". Guess some the linux power mgmt infrastructure or driverfs
> should help with sane resets etc (I'm not saying its straightforward
> :)).
> in the long run. As such how far does your implementation address
> some of this BIOS/h/w state handling better ?

My code works in SMP. I call the reboot notifier.
I probably should run through the pci bus and disable bus masters, but
I don't right now.

> BTW, some of your other boot enhancements like being able to find out
> which memory areas were used or overwritten during bootup sound useful
> to me, in being able to estimate the footprint of early boot and
> avoiding
> using those portions of memory for saving any state (because boot could
> stomp over them). Its good to be able to do this in a generic way,
> rather
> than have the dump code be aware of the ranges for every architecture.

That is why I am a fan of ELF kernel images. There is a lot of
reasonable resistance to change in that department but it is fairly
sane.

> > ftp://download.lnxi.com/pub/src/linux-kernel-patches/linux-2.5.7.kexec.diff
> > ftp://download.lnxi.com/pub/src/linux-kernel-patches/kexec-2.5.7.kexec.log
> > ftp://download.lnxi.com/pub/src/mkelfImage/elfboottools-2.0.tar.gz
> > type make and see objdir/build/sbin/kexec (work with bzImages)
>
> I don't seem to be able to access these urls.
> The patches I downloaded were from
> ftp://download.lnxi.com/pub/src/kexec/* (with the same names). Are these
> the right ones ? (your last note mentioned those, but you are saying
> that these are the wrong set ... so now I'm a little confused)

O.k. My directory structure is just to deep I can't type it straight
I meant:
ftp://download.lnxi.com/pub/src/linux-kernel-patches/kexec/linux-2.5.7.kexec.diff
ftp://download.lnxi.com/pub/src/linux-kernel-patches/kexec/kexec-2.5.7.kexec.log

And you probably meant:
ftp://downalod.lnxi.com/pub/src/linux-kernel-patches/kexec.

My other code for cleaning up the boot process is in:
ftp://download.lnxi.com/pub/src/linux-kernel-patches/boot/linux-2.5.7.boot.diff

> Is there one single grand rollup patch with all of the function which I
> should look through or try out ?

The kexec.diff (instead of the 3 sub patches) is as close as I have
gotten.

> > The basic kernel interface that is added is:
> >
> > struct segment {
> > void *buffer;
> > void *dest_addr;
> > size_t len;
> > };
> > int kexec(void *start, int nr_segments, struct segment *segments);
> >
>
> Yes, its a good idea to split up the load stage and actual boot/exec
> stage. Crash dump needs to have it that way too (the second image
> preloaded in advance since we don't want to do any i/o at that
> point).

Interesting. After a lot of discussion this was essentially the
interface we all agreed upon. Preloading wasn't what I was thinking
but it works in that sense as well. At least as long as you mlock
buffers.

The most important piece left with this is to get it accepted into
the kernel so people can count on a stable system call interface.

Eric

2002-04-09 15:25:08

[permalink] [raw]

Subject: Re: Faster reboots (and a better way of taking crashdumps?)

On Mon, Apr 08, 2002 at 11:09:26AM -0600, Eric W. Biederman wrote:
> Suparna Bhattacharya <[email protected]> writes:
>
> > I have been trying look through this in terms of how it compares with
> > alternate projects (bootimg, monte etc). As I mentioned in an earlier
> > mail, crash dump (mcore) relies on bootimg, and I'm trying to decide if
> > there
> > could be advantages in using your kexec stuff.
>
> My target it to submit the kexec stuff to Linus. I seem to be the
> only one really actively working on it at this time. I believe my
> code is the most mature at the moment. The bottom line is the system
> call needs to get into the kernel.
>
> With respect to bootimg there is a strong similarity it how things are
> done. The big difference is that bootimg interface does everything
> per page in asking the kernel where to put things and my kexec call is
> does everything with extents. Which means the kexec data structures
> are usually much smaller, plus I rely on odd things like PAGE_SIZE.

OK.

>
> As for monte I can boot other things than the linux kernel. I'm much
> better at doing the work than publisizing it so my variant isn't quite
> as well known. That plus I can late to the game.
>

I'm not sure if I got this right, but unlike bootimg, monte seems
to prefer going through the early real mode setup code (unless one
specifies skip_setup), and also resets the video mode. At first I
thought some of your querybios stuff achieves a similar effect,
but then is that for linux bios ?

>
> > My main concern of
> > course is with regard to these BIOS dependent/related issues
> > since at the time of a crash dump we may not be in quite a "friendly
> > state". Guess some the linux power mgmt infrastructure or driverfs
> > should help with sane resets etc (I'm not saying its straightforward
> > :)).
> > in the long run. As such how far does your implementation address
> > some of this BIOS/h/w state handling better ?
>
> My code works in SMP. I call the reboot notifier.
> I probably should run through the pci bus and disable bus masters, but
> I don't right now.

The crash dump code with bootimg seems to work on smp
Yes, I noticed the reboot notifier part in your code.
Disabling the busmaster might be required (monte seems to do that)

>
> > BTW, some of your other boot enhancements like being able to find out
> > which memory areas were used or overwritten during bootup sound useful
> > to me, in being able to estimate the footprint of early boot and
> > avoiding
> > using those portions of memory for saving any state (because boot could
> > stomp over them). Its good to be able to do this in a generic way,
> > rather
> > than have the dump code be aware of the ranges for every architecture.
>
> That is why I am a fan of ELF kernel images. There is a lot of
> reasonable resistance to change in that department but it is fairly
> sane.
>
> > > ftp://download.lnxi.com/pub/src/linux-kernel-patches/linux-2.5.7.kexec.diff
> > > ftp://download.lnxi.com/pub/src/linux-kernel-patches/kexec-2.5.7.kexec.log
> > > ftp://download.lnxi.com/pub/src/mkelfImage/elfboottools-2.0.tar.gz
> > > type make and see objdir/build/sbin/kexec (work with bzImages)
> >
> > I don't seem to be able to access these urls.
> > The patches I downloaded were from
> > ftp://download.lnxi.com/pub/src/kexec/* (with the same names). Are these
> > the right ones ? (your last note mentioned those, but you are saying
> > that these are the wrong set ... so now I'm a little confused)
>
>
> O.k. My directory structure is just to deep I can't type it straight
> I meant:
> ftp://download.lnxi.com/pub/src/linux-kernel-patches/kexec/linux-2.5.7.kexec.diff
> ftp://download.lnxi.com/pub/src/linux-kernel-patches/kexec/kexec-2.5.7.kexec.log
>
> And you probably meant:
> ftp://downalod.lnxi.com/pub/src/linux-kernel-patches/kexec.

Yes indeed (except for the spelling of download above :))
Looks like I can't type straight either :)

>
> My other code for cleaning up the boot process is in:
> ftp://download.lnxi.com/pub/src/linux-kernel-patches/boot/linux-2.5.7.boot.diff

Ok got it.
Have to get over the tools too and give it a shot.

>
> > Is there one single grand rollup patch with all of the function which I
> > should look through or try out ?
>
> The kexec.diff (instead of the 3 sub patches) is as close as I have
> gotten.
>
> > > The basic kernel interface that is added is:
> > >
> > > struct segment {
> > > void *buffer;
> > > void *dest_addr;
> > > size_t len;
> > > };
> > > int kexec(void *start, int nr_segments, struct segment *segments);
> > >
> >
> > Yes, its a good idea to split up the load stage and actual boot/exec
> > stage. Crash dump needs to have it that way too (the second image
> > preloaded in advance since we don't want to do any i/o at that
> > point).
>
> Interesting. After a lot of discussion this was essentially the
> interface we all agreed upon. Preloading wasn't what I was thinking
> but it works in that sense as well. At least as long as you mlock
> buffers.
>
> The most important piece left with this is to get it accepted into
> the kernel so people can count on a stable system call interface.
>
> Eric

2002-04-10 15:47:41

[permalink] [raw]

Subject: Re: Faster reboots (and a better way of taking crashdumps?)

Suparna Bhattacharya <[email protected]> writes:

> On Mon, Apr 08, 2002 at 11:09:26AM -0600, Eric W. Biederman wrote:
> > Suparna Bhattacharya <[email protected]> writes:
> >
> > > I have been trying look through this in terms of how it compares with
> > > alternate projects (bootimg, monte etc). As I mentioned in an earlier
> > > mail, crash dump (mcore) relies on bootimg, and I'm trying to decide if
> > > there
> > > could be advantages in using your kexec stuff.
> >
> > My target it to submit the kexec stuff to Linus. I seem to be the
> > only one really actively working on it at this time. I believe my
> > code is the most mature at the moment. The bottom line is the system
> > call needs to get into the kernel.
> >
> > With respect to bootimg there is a strong similarity it how things are
> > done. The big difference is that bootimg interface does everything
> > per page in asking the kernel where to put things and my kexec call is
> > does everything with extents. Which means the kexec data structures
> > are usually much smaller, plus I rely on odd things like PAGE_SIZE.
>
> OK.
>
> >
> > As for monte I can boot other things than the linux kernel. I'm much
> > better at doing the work than publisizing it so my variant isn't quite
> > as well known. That plus I can late to the game.
> >
>
> I'm not sure if I got this right, but unlike bootimg, monte seems
> to prefer going through the early real mode setup code (unless one
> specifies skip_setup), and also resets the video mode.

Going through the early setup code is good. I do this as well,
though I have tried the other route as well.

Reseting the video mode like monte does is questionable.

The basic point though is that the monte kernel interface is not set
up to support anything but the linux kernel. The bootimg interface
if fairly general, the user space just happens to be a little
immature.

As for skipping the real mode setup code, I prefer to do that cleanly
when it is needed.

> At first I
> thought some of your querybios stuff achieves a similar effect,
> but then is that for linux bios ?

Yes that is primarily for linuxbios. But that is when it is necessary
to skip the real mode setup. But all you have to do is specify a
mem=xyz line and you also skip the real mode setup, if you feel like
it.

> > > My main concern of
> > > course is with regard to these BIOS dependent/related issues
> > > since at the time of a crash dump we may not be in quite a "friendly
> > > state". Guess some the linux power mgmt infrastructure or driverfs
> > > should help with sane resets etc (I'm not saying its straightforward
> > > :)).
> > > in the long run. As such how far does your implementation address
> > > some of this BIOS/h/w state handling better ?
> >
> > My code works in SMP. I call the reboot notifier.
> > I probably should run through the pci bus and disable bus masters, but
> > I don't right now.
>
> The crash dump code with bootimg seems to work on smp

Unless I missed something the Linux kernel won't work on smp though.
It is a matter of resetting the state of the apics, and ensuring you
are running on the first processor. I don't believe bootimg did/does that.

> Yes, I noticed the reboot notifier part in your code.
> Disabling the busmaster might be required (monte seems to do that)

In general yes. There are some interesting side effects though.
Going through the pci bus and shutting off bus masters is a good
first approximation of what needs to happen.

In general though (a) there are buggy devices that can hang the system
if you treat the incorrectly. (b) Sometimes you need to do more than
just shutdown bus master DMA.

Which is why I have for the most part been holding off.

> > And you probably meant:
> > ftp://downalod.lnxi.com/pub/src/linux-kernel-patches/kexec.
>
> Yes indeed (except for the spelling of download above :))
> Looks like I can't type straight either :)

:)

> > My other code for cleaning up the boot process is in:
> >
> ftp://download.lnxi.com/pub/src/linux-kernel-patches/boot/linux-2.5.7.boot.diff
>
>
> Ok got it.
> Have to get over the tools too and give it a shot.

Thanks. Please holler if you have problems. I really need
to look at building a debugging strategy for this code. I have gotten
some failure reports but so far it is hard to track down why any of
it has problems.

Eric

2002-04-10 17:58:25

by Andy Pfiffer

[permalink] [raw]

Subject: Re: Faster reboots (and a better way of taking crashdumps?)

On Wed, 2002-04-10 at 08:40, Eric W. Biederman wrote:

> Unless I missed something the Linux kernel won't work on smp though.
> It is a matter of resetting the state of the apics, and ensuring you
> are running on the first processor. I don't believe bootimg did/does that.
>

The copy of bootimg that I have makes no effort to offline CPU's or
reset the APICs. If there is a newer version, I could not find it.

I have tried 3 different solutions for for Linux-reloading-linux
(bootimg, two-kernel monte, and kexec), and none of them fully support
the kinds of enterprise-class systems we (OSDL) care about:

1. multiprocessor x86 (p3, p4, +xeons) with APICs
2. >4GB memory
3. CPU hotplug
4. device hotplug
5. >= 2.5.x kernel

In fact, I have yet to find any variation of linux-loading-linux that
works at all on the 2-way P4-Xeon under my desk or the 8-way P3-Xeon in
the lab. The only system I have ever seen Two Kernel Monte work on here
is a Celeron-based machine in a nearby cube.

Why do we care about this? Rebooting these kinds of sytsems can take
several minutes, and in my sample of the systems in the lab, ~80% of the
reboot time is spent slogging through the platform's firmware, ~20% of
the time is spent between LILO and login:. 80% of several minutes is
often greater than the allowable annual downtime for some enterprise
systems.

What about LinxuBIOS? While an attractive solution for many, it is a
long, uphill battle to add support for chipset after chipset, and
motherboard after motherboard.

The >4GB of memory problem is an interesting quirk -- if the
linux-loading-linux implementation assumes that it can perform the final
copy in 32-bit protected mode *without* paging enabled, it won't
reliably work on >4GB systems.

> In general yes. There are some interesting side effects though.
> Going through the pci bus and shutting off bus masters is a good
> first approximation of what needs to happen.
>

The new device model from Pat ([email protected]) is probably the best way
to go here; you'll be able to walk the driver tree and reliably turn off
devices.

For the CPU side of things, the CPU hotplug work looks promising as
well.

Andy

2002-04-11 03:49:25

by Jeremy Jackson

[permalink] [raw]

Subject: Re: Faster reboots (and a better way of taking crashdumps?)

Should this go off lkml?

----- Original Message -----
From: "Suparna Bhattacharya" <[email protected]>
Sent: Monday, April 08, 2002 12:20 AM

> > I'm currently researching combining the two, to create a LinuxBIOS
> > firmware debug console, which will allow complete crash dump to
> > be taken after a hardware reset, with the smallest possible Heisenburg
> > effect, aside from a hardware debugger.
>
> So how is the actual writeout accomplished ?(via LinuxBIOS ?)

well it's just an idea just now. In order to do this from code in rom,
I immagine it would just dump physical memory to a raw partition,
using polling ide drivers in LinuxBIOS. This is probably a step
backwards, compared to modern crash dumps, but it would
allow zero alteration of memory.

It may be possible to do with a standard flash size of 128KiB,
though, which would allow virtually all motherboards to support it.

Jeremy

2002-04-11 13:56:04

[permalink] [raw]

Subject: Re: Faster reboots (and a better way of taking crashdumps?)

On Wed, Apr 10, 2002 at 09:40:44AM -0600, Eric W. Biederman wrote:
> Suparna Bhattacharya <[email protected]> writes:
>
> > On Mon, Apr 08, 2002 at 11:09:26AM -0600, Eric W. Biederman wrote:
> > > Suparna Bhattacharya <[email protected]> writes:
> > >
> > > > I have been trying look through this in terms of how it compares with
> > > > alternate projects (bootimg, monte etc). As I mentioned in an earlier
> > > > mail, crash dump (mcore) relies on bootimg, and I'm trying to decide if
> > > > there
> > > > could be advantages in using your kexec stuff.
> > >
> > > My target it to submit the kexec stuff to Linus. I seem to be the
> > > only one really actively working on it at this time. I believe my
> > > code is the most mature at the moment. The bottom line is the system
> > > call needs to get into the kernel.
> > >
> > > With respect to bootimg there is a strong similarity it how things are
> > > done. The big difference is that bootimg interface does everything
> > > per page in asking the kernel where to put things and my kexec call is
> > > does everything with extents. Which means the kexec data structures
> > > are usually much smaller, plus I rely on odd things like PAGE_SIZE.
> >
> > OK.
> >
> > >
> > > As for monte I can boot other things than the linux kernel. I'm much
> > > better at doing the work than publisizing it so my variant isn't quite
> > > as well known. That plus I can late to the game.
> > >
> >
> > I'm not sure if I got this right, but unlike bootimg, monte seems
> > to prefer going through the early real mode setup code (unless one
> > specifies skip_setup), and also resets the video mode.
>
> Going through the early setup code is good. I do this as well,
> though I have tried the other route as well.

OK. I hadn't looked at your user space pieces earlier, where you
patch in the code to jump to real-mode etc.
(from my perspective this seems to be the main difference vs bootimg)

>
> Reseting the video mode like monte does is questionable.

Could you explain that a bit ?

>
> The basic point though is that the monte kernel interface is not set
> up to support anything but the linux kernel. The bootimg interface
> if fairly general, the user space just happens to be a little
> immature.

I wasn't really looking at the system call interface as yet. That's
important of course, but I first wanted to be understand the actual
boot mechanism and the degree of reliability plus any known
limitations.

The interface which is interesting to me ATM is actually the lower
level in-kernel interface to initate the boot with a preloaded image
(i.e after the segments are loaded into a kimage).

>
> As for skipping the real mode setup code, I prefer to do that cleanly
> when it is needed.
>
> > At first I
> > thought some of your querybios stuff achieves a similar effect,
> > but then is that for linux bios ?
>
> Yes that is primarily for linuxbios. But that is when it is necessary
> to skip the real mode setup. But all you have to do is specify a

OK. Now that I have seen your bzImage kexec userland code, I see the
missing link. When I'd looked at the older patches around
the time of our elfboot announcment, I couldn't locate the right
user space pieces, so things weren't clear.

> mem=xyz line and you also skip the real mode setup, if you feel like
> it.
>
> > > > My main concern of
> > > > course is with regard to these BIOS dependent/related issues
> > > > since at the time of a crash dump we may not be in quite a "friendly
> > > > state". Guess some the linux power mgmt infrastructure or driverfs
> > > > should help with sane resets etc (I'm not saying its straightforward
> > > > :)).
> > > > in the long run. As such how far does your implementation address
> > > > some of this BIOS/h/w state handling better ?
> > >
> > > My code works in SMP. I call the reboot notifier.
> > > I probably should run through the pci bus and disable bus masters, but
> > > I don't right now.
> >
> > The crash dump code with bootimg seems to work on smp
>
> Unless I missed something the Linux kernel won't work on smp though.
> It is a matter of resetting the state of the apics, and ensuring you
> are running on the first processor. I don't believe bootimg did/does that.

What I tried out was the MCLX crash dump implementation using bootimg
and that did work on a 2-way. This has some modifications to run on
the boot_cpu, and also to setup the local APIC and program the LVT0
register. (The pure bootimg patch I had was pretty old, so never
tried that out separately).

>
> > Yes, I noticed the reboot notifier part in your code.
> > Disabling the busmaster might be required (monte seems to do that)
>
> In general yes. There are some interesting side effects though.
> Going through the pci bus and shutting off bus masters is a good
> first approximation of what needs to happen.
>
> In general though (a) there are buggy devices that can hang the system
> if you treat the incorrectly. (b) Sometimes you need to do more than
> just shutdown bus master DMA.
>
> Which is why I have for the most part been holding off.
>
> > > And you probably meant:
> > > ftp://downalod.lnxi.com/pub/src/linux-kernel-patches/kexec.
> >
> > Yes indeed (except for the spelling of download above :))
> > Looks like I can't type straight either :)
>
> :)
>
> > > My other code for cleaning up the boot process is in:
> > >
> > ftp://download.lnxi.com/pub/src/linux-kernel-patches/boot/linux-2.5.7.boot.diff
> >
> >
> > Ok got it.
> > Have to get over the tools too and give it a shot.
>
> Thanks. Please holler if you have problems. I really need

I ran into some errors when building elfboottools.
EM_486 is reported to be undeclared. I think it must somehow be
picking up the wrong elf.h, but didn't dig around too much
into the makefiles.

> to look at building a debugging strategy for this code. I have gotten
> some failure reports but so far it is hard to track down why any of
> it has problems.
>
> Eric

Regards
Suparna

2002-04-11 14:14:28

[permalink] [raw]

Subject: Re: Faster reboots (and a better way of taking crashdumps?)

On Wed, Apr 10, 2002 at 10:58:42AM -0700, Andy Pfiffer wrote:
> On Wed, 2002-04-10 at 08:40, Eric W. Biederman wrote:
>
> > Unless I missed something the Linux kernel won't work on smp though.
> > It is a matter of resetting the state of the apics, and ensuring you
> > are running on the first processor. I don't believe bootimg did/does that.
> >
>
> The copy of bootimg that I have makes no effort to offline CPU's or
> reset the APICs. If there is a newer version, I could not find it.

Not the old bootimg code that I had found, but the mclx crash dump
code based on bootimg includes these modifications.

Something like this runs on all cpu's as part of the crash code,
where machine_restart calls bootimg directly if configured.

+/*
+ * If we are not the panicking thread, we simply halt. Otherwise,
+ * we take care of calling the reboot code.
+ */
+#ifdef CONFIG_SMP
+ if (!boot_cpu) {
+ stop_this_cpu(NULL);
+ /* NOTREACHED */
+ }
+#endif
+
+ machine_restart(NULL)

And the following code added along the init path:

diff -urN linux-2.4.17-vanilla/init/main.c linux-2.4.17-mcore/init/main.c
--- linux-2.4.17-vanilla/init/main.c Fri Dec 21 12:42:04 2001
+++ linux-2.4.17-mcore/init/main.c Fri Jan 11 11:04:52 2002
@@ -580,6 +593,15 @@

kmem_cache_init();
sti();
+#if defined(CONFIG_BOOTIMG) && defined(CONFIG_X86_LOCAL_APIC)
+ /* If we don't make sure the APIC is enabled, AND the LVT0
+ * register is programmed properly, we won't get timer interrupts
+ */
+ setup_local_APIC();
+
+ value = apic_read(APIC_LVT0);
+ apic_write_around(APIC_LVT0, value & ~APIC_LVT_MASKED);
+#endif
calibrate_delay();
#ifdef CONFIG_BLK_DEV_INITRD
if (initrd_start && !initrd_below_start_ok &&

>
> I have tried 3 different solutions for for Linux-reloading-linux
> (bootimg, two-kernel monte, and kexec), and none of them fully support
> the kinds of enterprise-class systems we (OSDL) care about:
>
> 1. multiprocessor x86 (p3, p4, +xeons) with APICs
> 2. >4GB memory
> 3. CPU hotplug
> 4. device hotplug
> 5. >= 2.5.x kernel
>
> In fact, I have yet to find any variation of linux-loading-linux that
> works at all on the 2-way P4-Xeon under my desk or the 8-way P3-Xeon in
> the lab. The only system I have ever seen Two Kernel Monte work on here
> is a Celeron-based machine in a nearby cube.
>
> Why do we care about this? Rebooting these kinds of sytsems can take
> several minutes, and in my sample of the systems in the lab, ~80% of the
> reboot time is spent slogging through the platform's firmware, ~20% of
> the time is spent between LILO and login:. 80% of several minutes is
> often greater than the allowable annual downtime for some enterprise
> systems.
>
> What about LinxuBIOS? While an attractive solution for many, it is a
> long, uphill battle to add support for chipset after chipset, and
> motherboard after motherboard.
>
> The >4GB of memory problem is an interesting quirk -- if the
> linux-loading-linux implementation assumes that it can perform the final
> copy in 32-bit protected mode *without* paging enabled, it won't
> reliably work on >4GB systems.

Isn't the image copied into kernel pages/buffers within allowable ranges
first (when loading the image) ?

>
> > In general yes. There are some interesting side effects though.
> > Going through the pci bus and shutting off bus masters is a good
> > first approximation of what needs to happen.
> >
>
> The new device model from Pat ([email protected]) is probably the best way
> to go here; you'll be able to walk the driver tree and reliably turn off
> devices.

Yes, I had discussed this with him sometime back to understand if his
model would support what we need. Conditions are more stringent or
less reliable in a crash scenario, but still this looks like the
right direction.

>
> For the CPU side of things, the CPU hotplug work looks promising as
> well.
>
> Andy
>

2002-04-11 14:27:22

[permalink] [raw]

Subject: Re: Faster reboots (and a better way of taking crashdumps?)

On Wed, Apr 10, 2002 at 08:47:17PM -0700, Jeremy Jackson wrote:
> Should this go off lkml?

Should be ok to continue - I just wanted to make sure lkcd-devel
is in the loop too, since many of us don't monitor lkml that
closely all the time.

>
> ----- Original Message -----
> From: "Suparna Bhattacharya" <[email protected]>
> Sent: Monday, April 08, 2002 12:20 AM
>
> > > I'm currently researching combining the two, to create a LinuxBIOS
> > > firmware debug console, which will allow complete crash dump to
> > > be taken after a hardware reset, with the smallest possible Heisenburg
> > > effect, aside from a hardware debugger.
> >
> > So how is the actual writeout accomplished ?(via LinuxBIOS ?)
>
> well it's just an idea just now. In order to do this from code in rom,
> I immagine it would just dump physical memory to a raw partition,
> using polling ide drivers in LinuxBIOS. This is probably a step

OK. There have been plans to do this via polling drivers in software,
but guess if you can handle it at the LinuxBIOS level the impact may
be still lower.

> backwards, compared to modern crash dumps, but it would
> allow zero alteration of memory.
>
> It may be possible to do with a standard flash size of 128KiB,
> though, which would allow virtually all motherboards to support it.
>
> Jeremy

2002-04-11 15:15:37

[permalink] [raw]

Subject: Re: Faster reboots (and a better way of taking crashdumps?)

Andy Pfiffer <[email protected]> writes:

> On Wed, 2002-04-10 at 08:40, Eric W. Biederman wrote:
>
> > Unless I missed something the Linux kernel won't work on smp though.
> > It is a matter of resetting the state of the apics, and ensuring you
> > are running on the first processor. I don't believe bootimg did/does that.
> >
>
> The copy of bootimg that I have makes no effort to offline CPU's or
> reset the APICs. If there is a newer version, I could not find it.
>
> I have tried 3 different solutions for for Linux-reloading-linux
> (bootimg, two-kernel monte, and kexec), and none of them fully support
> the kinds of enterprise-class systems we (OSDL) care about:
>
> 1. multiprocessor x86 (p3, p4, +xeons) with APICs
> 2. >4GB memory
> 3. CPU hotplug
> 4. device hotplug
> 5. >= 2.5.x kernel

kexec should handle.
1. multiprocessor x86 (p3, p4, +xeons) with APICs
2. >4GB memory
3. >= 2.5.x kernel
4. potentially non-x86.

> In fact, I have yet to find any variation of linux-loading-linux that
> works at all on the 2-way P4-Xeon under my desk or the 8-way P3-Xeon in
> the lab. The only system I have ever seen Two Kernel Monte work on here
> is a Celeron-based machine in a nearby cube.

Interesting. I know I have it runs on the 2-way P4-Xeon under my
desk. So maybe it is a compiler bug, or some weird firmware case I
don't handle correctly.

> The >4GB of memory problem is an interesting quirk -- if the
> linux-loading-linux implementation assumes that it can perform the final
> copy in 32-bit protected mode *without* paging enabled, it won't
> reliably work on >4GB systems.

Sure it will, if it only allocates the memory from the low 4GB, in fact
my kexec code makes certain to allocate memory from the kernel address
space. get_free_page() in ZONE_NORMAL. This is the low 896MB. So
there shouldn't be a problem. This was done very deliberately so it
would work on these kinds of systems.

> > In general yes. There are some interesting side effects though.
> > Going through the pci bus and shutting off bus masters is a good
> > first approximation of what needs to happen.
> >
>
> The new device model from Pat ([email protected]) is probably the best way
> to go here; you'll be able to walk the driver tree and reliably turn off
> devices.

I totally agree. Walking the driver tree is exactly what I want.
Disabling bus masters is just a quick hack to rule out a DMA killing
your linux booting linux.

> For the CPU side of things, the CPU hotplug work looks promising as
> well.

Interesting. So far I haven't seen a system that supports CPU hot
plug, on x86 so I have no clue here.

Eric

2002-04-11 15:42:34

[permalink] [raw]

Subject: Re: Faster reboots (and a better way of taking crashdumps?)

Suparna Bhattacharya <[email protected]> writes:

> OK. I hadn't looked at your user space pieces earlier, where you
> patch in the code to jump to real-mode etc.
> (from my perspective this seems to be the main difference vs bootimg)

I have borrowed techniques from all over. And I will go with whatever
works. In this case I don't see why bootimg wasn't doing that.

> > Reseting the video mode like monte does is questionable.
>
> Could you explain that a bit ?

1) You should have done proper video shutdown before you left the
previous kernel.
2) You might not have a video card.
3) You might not have a working BIOS.
4) It is generally a very good idea to make a division of who controls
what hardware. And after the initial boot and the kernel has
driven the hardware it isn't reliable to give control back to the
BIOS. We may have changed state enough to confuse the BIOS.

Or in summary it is generally unnecessary so why the heck are we doing
it.

> > The basic point though is that the monte kernel interface is not set
> > up to support anything but the linux kernel. The bootimg interface
> > if fairly general, the user space just happens to be a little
> > immature.
>
> I wasn't really looking at the system call interface as yet. That's
> important of course, but I first wanted to be understand the actual
> boot mechanism and the degree of reliability plus any known
> limitations.
>
> The interface which is interesting to me ATM is actually the lower
> level in-kernel interface to initate the boot with a preloaded image
> (i.e after the segments are loaded into a kimage).

O.k. Any comments, on that interface?

> > As for skipping the real mode setup code, I prefer to do that cleanly
> > when it is needed.
> >
> > > At first I
> > > thought some of your querybios stuff achieves a similar effect,
> > > but then is that for linux bios ?
> >
> > Yes that is primarily for linuxbios. But that is when it is necessary
> > to skip the real mode setup. But all you have to do is specify a
>
> OK. Now that I have seen your bzImage kexec userland code, I see the
> missing link. When I'd looked at the older patches around
> the time of our elfboot announcment, I couldn't locate the right
> user space pieces, so things weren't clear.

O.k. Given that I keep evolving this, that is understandable.
Sorry about that.

> > Unless I missed something the Linux kernel won't work on smp though.
> > It is a matter of resetting the state of the apics, and ensuring you
> > are running on the first processor. I don't believe bootimg did/does that.
>
> What I tried out was the MCLX crash dump implementation using bootimg
> and that did work on a 2-way. This has some modifications to run on
> the boot_cpu, and also to setup the local APIC and program the LVT0
> register. (The pure bootimg patch I had was pretty old, so never
> tried that out separately).

O.k. cool. I haven't really looked at that.

> I ran into some errors when building elfboottools.
> EM_486 is reported to be undeclared. I think it must somehow be
> picking up the wrong elf.h, but didn't dig around too much
> into the makefiles.

Version 2.0 is an early beta. Some idiot yanked EM_486 and a couple
of other symbols out of elf.h from glibc. Despite the ELF spec says
EM_486 at least should be there. In any case that is just a debugging
bit and you can safely disable those. Or do a make -k and compile the
kexec piece, but not the kparam, which isn't really relevant.

You have a newer version of glibc than I do. For the next rev I will
supply my own elf.h. The ABI at least is stable.

Eric

2002-04-12 14:48:41

[permalink] [raw]

Subject: Re: Faster reboots (and a better way of taking crashdumps?)

On Thu, Apr 11, 2002 at 09:35:34AM -0600, Eric W. Biederman wrote:
> Suparna Bhattacharya <[email protected]> writes:
>
>
> > OK. I hadn't looked at your user space pieces earlier, where you
> > patch in the code to jump to real-mode etc.
> > (from my perspective this seems to be the main difference vs bootimg)
>
> I have borrowed techniques from all over. And I will go with whatever
> works. In this case I don't see why bootimg wasn't doing that.
>
> > > Reseting the video mode like monte does is questionable.
> >
> > Could you explain that a bit ?
>
> 1) You should have done proper video shutdown before you left the
> previous kernel.
> 2) You might not have a video card.
> 3) You might not have a working BIOS.
> 4) It is generally a very good idea to make a division of who controls
> what hardware. And after the initial boot and the kernel has
> driven the hardware it isn't reliable to give control back to the
> BIOS. We may have changed state enough to confuse the BIOS.
>
> Or in summary it is generally unnecessary so why the heck are we doing
> it.

OK. The crash with bootimg implementation was known to have occasional
difficulties when boot got triggered (via panic) while in X -- sometimes
having the console messed up, and on some occasions even hangs.
How's been your experience - no problems in kexec'ing from X,
anytime ?

>
> > > The basic point though is that the monte kernel interface is not set
> > > up to support anything but the linux kernel. The bootimg interface
> > > if fairly general, the user space just happens to be a little
> > > immature.
> >
> > I wasn't really looking at the system call interface as yet. That's
> > important of course, but I first wanted to be understand the actual
> > boot mechanism and the degree of reliability plus any known
> > limitations.
> >
> > The interface which is interesting to me ATM is actually the lower
> > level in-kernel interface to initate the boot with a preloaded image
> > (i.e after the segments are loaded into a kimage).
>
> O.k. Any comments, on that interface?

That would be machine_kexec, and the kimage struct, right ?
So far looks ok, though I haven't looked at it critically. One thing that
that I require was a way to pass information across boots -
maybe that could be done through command line parameters to the new
kernel.

>
> > > As for skipping the real mode setup code, I prefer to do that cleanly
> > > when it is needed.
> > >
> > > > At first I
> > > > thought some of your querybios stuff achieves a similar effect,
> > > > but then is that for linux bios ?
> > >
> > > Yes that is primarily for linuxbios. But that is when it is necessary
> > > to skip the real mode setup. But all you have to do is specify a
> >
> > OK. Now that I have seen your bzImage kexec userland code, I see the
> > missing link. When I'd looked at the older patches around
> > the time of our elfboot announcment, I couldn't locate the right
> > user space pieces, so things weren't clear.
>
> O.k. Given that I keep evolving this, that is understandable.
> Sorry about that.
>
> > > Unless I missed something the Linux kernel won't work on smp though.
> > > It is a matter of resetting the state of the apics, and ensuring you
> > > are running on the first processor. I don't believe bootimg did/does that.
> >
> > What I tried out was the MCLX crash dump implementation using bootimg
> > and that did work on a 2-way. This has some modifications to run on
> > the boot_cpu, and also to setup the local APIC and program the LVT0
> > register. (The pure bootimg patch I had was pretty old, so never
> > tried that out separately).
>
> O.k. cool. I haven't really looked at that.
>
> > I ran into some errors when building elfboottools.
> > EM_486 is reported to be undeclared. I think it must somehow be
> > picking up the wrong elf.h, but didn't dig around too much
> > into the makefiles.
>
> Version 2.0 is an early beta. Some idiot yanked EM_486 and a couple
> of other symbols out of elf.h from glibc. Despite the ELF spec says
> EM_486 at least should be there. In any case that is just a debugging
> bit and you can safely disable those. Or do a make -k and compile the
> kexec piece, but not the kparam, which isn't really relevant.

I commented out the EM_486 check from the kexec code, and it built
cleanly. I was able to boot a 2-way system with it, though it seemed
to take a while, perhaps more so because I didn't seem to be getting
any of the bootup/startup messages on my console. In one case there were
some INIT respawning messages that came up. Not sure the fact that
I'm using a serial console matters.

Regards
Suparna

2002-04-12 18:06:39

[permalink] [raw]

Subject: Re: Faster reboots (and a better way of taking crashdumps?)

Suparna Bhattacharya <[email protected]> writes:

> OK. The crash with bootimg implementation was known to have occasional
> difficulties when boot got triggered (via panic) while in X -- sometimes
> having the console messed up, and on some occasions even hangs.
> How's been your experience - no problems in kexec'ing from X,
> anytime ?

So far I haven't tried it from X. Most of my test machines don't have
video. When working on a good machine you should be able to return
the video to the state you got it.

I'm not quite certain how to handle the crash dump case.

> That would be machine_kexec, and the kimage struct, right ?
> So far looks ok, though I haven't looked at it critically. One thing that
> that I require was a way to pass information across boots -
> maybe that could be done through command line parameters to the new
> kernel.

A command line work work. You can arbitrary segments so you can pass
anything that is needed from the user space side.

> > Version 2.0 is an early beta. Some idiot yanked EM_486 and a couple
> > of other symbols out of elf.h from glibc. Despite the ELF spec says
> > EM_486 at least should be there. In any case that is just a debugging
> > bit and you can safely disable those. Or do a make -k and compile the
> > kexec piece, but not the kparam, which isn't really relevant.
>
> I commented out the EM_486 check from the kexec code, and it built
> cleanly. I was able to boot a 2-way system with it, though it seemed
> to take a while, perhaps more so because I didn't seem to be getting
> any of the bootup/startup messages on my console. In one case there were
> some INIT respawning messages that came up. Not sure the fact that
> I'm using a serial console matters.

A serial console is my usual test case, so that shouldn't affect
anything. I'm glad that it worked.

For the speed difference my hunch is perhaps you didn't specify your
normal kernel command line on the command line.

Usually I do something like:
kexec bzImage root=xxx console=ttyS0,9600 blah, blah, blah.

Mostly kexec is supposed to be the simple test client instead of a
full fledged interface.

Eric

2002-04-15 10:06:51