I have been working on this quietly and now my pieces generally work,
and I am down to dotting the i's, and crossing the t's. As my start
of pushing for inclusion I want to get some feedback (in case I have
missed something), and to give some feedback so people understand what
I am doing.
First the connections.
- I have a patch I am maintaining (kexec) for booting linux and other
OS's from linux. This patch is not x86 specific (it does take
architecture specific code). It takes as input an ELF file.
- I am working on a native LinuxBIOS port of Linux, and LinuxBIOS
takes as an input a ELF formated kernel.
- I need regularly network boot with etherboot, and an ELF formated
kernel can be used natively. bzImage needs help.
- It is a pain not when switching platforms alpha/ia32/xxx to have to
completely change your toolchain for booting the linux kernel.
- I have a patch that makes the x86 linux kernel natively ELF
bootable.
What a bootable ELF formatted kernel is.
- A list of segments that say load this chunk of file at this physical
address.
- An 32bit entry point. (64bit on 64bit platforms).
- Code at that entry point to query from the firmware/BIOS the
information the kernel needs.
My next step is to integrate all of my pieces and do some cleanup but
I have functioning code for everything.
An x86 ELF bootable kernel:
ftp://download.lnxi.com/pub/src/linux-kernel-patches/linux-2.4.17.elf_bootable.diff
A patch to boot an arbitrary static ELF executeable.
ftp://download.lnxi.com/pub/src/linux-kernel-patches/linux-2.4.17.eb-kexec-apic-lb-mtd2.diff
A kernel fix to do proper SMP shutdown so that you can kexec on a SMP kernel
ftp://download.lnxi.com/pub/src/linux-kernel-patches/linux-2.4.17.eb-apic-lb-mtd2.diff
A kernel patch that implements minimal some LinuxBIOS support.
ftp://download.lnxi.com/pub/src/linux-kernel-patches/linux-2.4.18-pre7.linuxbios.diff
A standalone executable to adapt an existing x86 bzImage to be an ELF
bootable kernel.
ftp://download.lnxi.com/pub/src/mkelfImage/mkelfImage-1.12.tar.gz
A first draft of a specification that goes into detail about how an
ELF image is interpreted, and extensions that can be added so the
bootloader name, the bootloader version, and similar interesting but
functionally unnecessary information can be passed to the loaded
image, and so the bootloader can find out similar kinds of information
about the ELF executable it is loading.
ftp://download.lnxi.com/pub/src/linuxbios/specifications/draft1_elf_boot_proposal.txt
For what it is worth I have gotten fairly positive feedback from both
the Etherboot and the LinuxBIOS communities so far. And etherboot
and memtest86 have both been modified already to be ELF bootable. And
there is current work that gets a long ways with Plan9.
My kexec work is direct competition to two kernel monte, bootimage,
lobos. I have been using it in production for about a year, and I
haven't encountered real problems. The biggest issue I have had is
with the kernel not properly shutting down devices.
In the short term shutting down devices is trivially handled by
umounting filesystems, downing ethernet devices, and calling the
reboot notifier chain. Long term I need to call the module_exit
routines but they need a little sorting out before I can use them
during reboot. In particular calling any module_exit routing that clears
pm_power_off is a no-no.
Also while it should work in most cases any loaded ELF image that
starts using BIOS/firmware services to drive the hardware is on it's
own. Putting all devices back in a state that they match the
firmwares cached state is a poorly defined problem. However normal
firmware calls that ask if for the memory size or IRQ routing
information should work correctly.
More on etherboot can be found at:
http://www.etherboot.org
More on LinuxBIOS can be found at:
http://www.linuxbios.org
Eric
"Eric W. Biederman" wrote:
>
> A kernel fix to do proper SMP shutdown so that you can kexec on a SMP kernel
Oh man, you rock. I spent about ten hours last weekend
trying to teach 2-kernel-monte to do this. But the damn
XT clock refused to deliver interrupts to the secondary
kernel when the primary had local APCI enabled :(
On uniprocessor, you can type `sudo monte /boot/bzImage'
and get to `decompressing linux' in two seconds flat. (Having
journalling filesystems rather helps with this trick). It's
lovely.
> The biggest issue I have had is
> with the kernel not properly shutting down devices.
Monte just disables all busmastering on the PCI devices...
> In the short term shutting down devices is trivially handled by
> umounting filesystems, downing ethernet devices, and calling the
> reboot notifier chain. Long term I need to call the module_exit
> routines but they need a little sorting out before I can use them
> during reboot. In particular calling any module_exit routing that clears
> pm_power_off is a no-no.
module_exit() routines for statically-linked drivers often
don't exist - they're in .text.exit. I guess you can just
move .text.exit out of the /DISCARD/ section in vmlinux.lds.
Also, take a look at user-mode-linux's do_exitcalls()
implementation - there's no clear reason why that shouldn't
be mainstreamed.
It would be convenient to be able to directly boot a bzImage,
but I guess elf is workable.
Great work, and thanks! I look forward to 2-second SMP
reboots.
-
Followup to: <[email protected]>
By author: [email protected] (Eric W. Biederman)
In newsgroup: linux.dev.kernel
>
> - Code at that entry point to query from the firmware/BIOS the
> information the kernel needs.
>
How do you query from the 16-bit firmware/BIOS at the 32-bit
entrypoint? Or is it that you have a table, fixed by protocol, of
what information is available (so we're basically fucked when
something needs to change)?
-hpa
--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt <[email protected]>
On Wed, 30 Jan 2002 13:19:17 -0800,
Andrew Morton <[email protected]> wrote:
>"Eric W. Biederman" wrote:
>> In the short term shutting down devices is trivially handled by
>> umounting filesystems, downing ethernet devices, and calling the
>> reboot notifier chain. Long term I need to call the module_exit
>> routines but they need a little sorting out before I can use them
>> during reboot. In particular calling any module_exit routing that clears
>> pm_power_off is a no-no.
>
>module_exit() routines for statically-linked drivers often
>don't exist - they're in .text.exit. I guess you can just
>move .text.exit out of the /DISCARD/ section in vmlinux.lds.
Sounds like a generalization of device hot plugging, which has already
solved this problem. Turn on CONFIG_HOTPLUG and module_exit()
functions are in .text instead of .text.exit, no need to fiddle with
vmlinux.lds.
"H. Peter Anvin" <[email protected]> writes:
> Followup to: <[email protected]>
> By author: [email protected] (Eric W. Biederman)
> In newsgroup: linux.dev.kernel
> >
> > - Code at that entry point to query from the firmware/BIOS the
> > information the kernel needs.
> >
>
> How do you query from the 16-bit firmware/BIOS at the 32-bit
> entrypoint? Or is it that you have a table, fixed by protocol, of
> what information is available (so we're basically fucked when
> something needs to change)?
I drop back into real mode. Run the existing query code, (I had to factor
setup.S but the queries are 100% the same) and then I climb back to
32bit mode. But I do it from setup_arch() so if I don't have a
pcbios, I can skip all that nasty busyness.
I did it the wrong way (fixed table) initially and after some
conversations with you and some thinking I changed it around.
There is nothing outside the ELF specification that needs to be used.
All I depend on is having flat 32bit code and data segments, initially
loaded in %cs and %ds. So basically anyones x86 ELF bootloader should
work.
I have defined a fully optional table of tagged elements, so the
bootloader can tell me what kind of firmware I have. All it passes
besides that is the bootloader name and the bootloader version. So
you must do the bios queries yourself.
Eric
Andrew Morton <[email protected]> writes:
> On uniprocessor, you can type `sudo monte /boot/bzImage'
> and get to `decompressing linux' in two seconds flat. (Having
> journalling filesystems rather helps with this trick). It's
> lovely.
Hmm. That sounds a little slow to me. That is about what I get
with LinuxBIOS from when I flip the power switch, and network boot..
But that is enough enjoyment of speed.
> > The biggest issue I have had is
> > with the kernel not properly shutting down devices.
>
> Monte just disables all busmastering on the PCI devices...
That might be a useful addition, as it will probably work for most
devices. However it doesn't handle non-PCI devices. And it doesn't
handle strange devices that need a different shutdown.
With module_exit() I am quiet certain the linux driver can find the
device and set it up again, because otherwise you couldn't insert,
remove, and reinsert the code as a module.
> module_exit() routines for statically-linked drivers often
> don't exist - they're in .text.exit. I guess you can just
> move .text.exit out of the /DISCARD/ section in vmlinux.lds.
> Also, take a look at user-mode-linux's do_exitcalls()
> implementation - there's no clear reason why that shouldn't
> be mainstreamed.
I like the other suggestion of extending the Hot-plug infrastructure.
In that case I just need to figure out how to logically Hot-unplug all
the devices in the system. That may be better than a
do_exitcalls()... As it automatically gets the discrimination right.
> It would be convenient to be able to directly boot a bzImage,
> but I guess elf is workable.
Well that is directly booting vmlinux, and it doesn't lock you into
booting the linux kernel which is very important to me.
> Great work, and thanks! I look forward to 2-second SMP
> reboots.
I'll love to hear how it goes.
Eric
Eric W. Biederman wrote:
>
>>It would be convenient to be able to directly boot a bzImage,
>>but I guess elf is workable.
>
> Well that is directly booting vmlinux, and it doesn't lock you into
> booting the linux kernel which is very important to me.
>
Neither is bzImage... you can use bzImage format for other things. My
main worry about this is that it locks you into booting a single image
(as opposed to kernel+initramfs, the latter of which can be composed on
the fly if desirable), which is a huge step backwards IMNSHO.
-hpa
On 30 Jan 2002 19:42:14 -0700,
[email protected] (Eric W. Biederman) wrote:
>I like the other suggestion of extending the Hot-plug infrastructure.
>In that case I just need to figure out how to logically Hot-unplug all
>the devices in the system. That may be better than a
>do_exitcalls()... As it automatically gets the discrimination right.
In an ideal world, it should be enough to call the module_exit()
functions in reverse order to the module_init(), LIFO. But check with
the hotplug list, they have done most of the work on this problem.
[email protected]
"H. Peter Anvin" <[email protected]> writes:
> Neither is bzImage... you can use bzImage format for other things.
Agreed. Though the bzImage format does have a smaller set of programs
it can handle than an ELF image, and is a little harder to setup.
In particular many simple stand alone programs are ELF bootable by
accident.
> My main
> worry about this is that it locks you into booting a single image (as opposed to
>
> kernel+initramfs, the latter of which can be composed on the fly if desirable),
> which is a huge step backwards IMNSHO.
Compose on the fly requires some amount of knowledge of what you are
booting. You cannot edit an arbitrary image, and know it will work.
The ELF file format does still help in that case though as it exports
which addresses are already taken.
Beyond that ELF has a .note section and more usefully a PT_NOTE
program header that points at the .note section. Using ELF notes I
can export information like where initrd_start and initrd_size are in
the ELF image, so they can be modified.
ELF notes are composed of 3 pieces. name (string), descriptor (string)
and type. name is a label defining who defined the note (we can use
Linux). type is a 32 bit binary integer, so there is plenty of room
for everything we need. descriptor we can use for the contents.
Through this we can also export things like the command line. I have
prototyped this with my mkelfImage utility and it doesn't look to hard
to handle. And I use a checksum it adds to verify the images integrity
in LinuxBIOS.
Beyond that I can still build a bootable bzImage with my patched
kernel. setup_arch() looks, sees that we have already queried the
BIOS and skips that step. And the code used to query the BIOS is
exactly the same in both cases there is just has a different delivery
mechanism.
Beyond that having the ability to do the composition before booting is
important for the network booting case. Making network bootloaders
simpler. For network booting having ramdisks tends actually to be
more important then in regular booting because you can't count on
anything else being there on the machine. So I have no intention of
not supporting initramfs.
My kexec system call directly supports ELF images simply because it
makes debugging easier. It means I don't have to compute everything
on the fly. I can go back and look to see what I did wrong or I can
try the same image on a known good bootloader.
I am reluctant to go with a bootimg like interface because having a
standard format encourages people to standardize. Though a good
argument can persuade me. I don't loose any flexibility in comparison
to bootimg because composing files on the fly is not significantly
harder than composing a bootable image in ram.
Please tell me if I haven't clearly answered your concerns about
being locked into a single image.
Eric
Eric W. Biederman wrote:
>
> I am reluctant to go with a bootimg like interface because having a
> standard format encourages people to standardize. Though a good
> argument can persuade me. I don't loose any flexibility in comparison
> to bootimg because composing files on the fly is not significantly
> harder than composing a bootable image in ram.
>
> Please tell me if I haven't clearly answered your concerns about
> being locked into a single image.
>
I have to think about it. I'm not convinced that this particular
flavour of standardization is a step in the right direction -- in fact,
it is *guaranteed* to provide significant additional complexity for
bootloaders, and bzImage support is still going to have to be provided
for the forseeable future. Since you express that it will basically be
necessary to stitch the ELF file together on the fly I don't see much
point, quite frankly; it seems like extra complexity for no good reason.
-hpa
"H. Peter Anvin" <[email protected]> writes:
> Eric W. Biederman wrote:
>
> > I am reluctant to go with a bootimg like interface because having a
> > standard format encourages people to standardize. Though a good
> > argument can persuade me. I don't loose any flexibility in comparison
> > to bootimg because composing files on the fly is not significantly
> > harder than composing a bootable image in ram. Please tell me if I haven't
> > clearly answered your concerns about
> > being locked into a single image.
> >
>
>
> I have to think about it. I'm not convinced that this particular flavour of
> standardization is a step in the right direction -- in fact, it is *guaranteed*
> to provide significant additional complexity for bootloaders, and bzImage
> support is still going to have to be provided for the forseeable
> future.
It is not my intention to deprecate bzImage at this time. But instead
to provide an alternative.
> Since
> you express that it will basically be necessary to stitch the ELF file together
> on the fly I don't see much point, quite frankly; it seems like extra complexity
> for no good reason.
This part is only for people using the kernel as a bootloader. My
apologies if I didn't make it clear.
Let me clarify a little with usage. I have three cases I am targeting.
1) LinuxBIOS.
I need a kernel format that doesn't unconditionally do 16bit BIOS
queries, as there is no 16bit code in LinuxBIOS. bzImage doesn't
work.
2) Portability.
For this I have with some simple bootstrap program that loads the
kernel. And then I have the kernel driving devices and acting a
super bootloader. Something like Grub but with the full power
linux can bring to bear on the issue. This is all I need to
standardize the user booting experience on multiple platforms.
3) Network Booting.
There is not much chance to change bootroms once they are flashed
so they like LinuxBIOS need a general purpose interface, so that can
handle whatever they need to boot.
On the dynamic stitching/initramfs side, the code should really be no
more complex than the current bzImage loading with ramdisks. And if
bootloaders want to keep it simple can drop any optional ELF features,
which should be much simpler than todays bzImage.
Now I will shut up and let you digest this.
Eric
Eric W. Biederman wrote:
>
> 3) Network Booting.
> There is not much chance to change bootroms once they are flashed
> so they like LinuxBIOS need a general purpose interface, so that can
> handle whatever they need to boot.
>
On this particular subject, I should point out that there is a standard
for network bootroms on i386 platforms -- PXE. Most PXE implementations
out there suck rocks, but that's orthogonal -- they're still a lot
easier to use than coming up with your own (and they will boot, ahem,
other operating systems as well.)
Now, PXE is pretty limited usually boots a second-stage bootloader
That being said, it would be great to get an Open Source PXE
implementation and driver collection. I was hoping NILO
(http://www.nilo.org/) would be it, but it seems to not be going
anywhere. I was for a while considering trying to turn Etherboot into a
PXE kit.
-hpa
On Wed, Jan 30, 2002 at 08:41:04PM -0800, H. Peter Anvin wrote:
> Eric W. Biederman wrote:
>
> >
> > I am reluctant to go with a bootimg like interface because having a
> > standard format encourages people to standardize. Though a good
> > argument can persuade me. I don't loose any flexibility in comparison
> > to bootimg because composing files on the fly is not significantly
> > harder than composing a bootable image in ram.
> >
> > Please tell me if I haven't clearly answered your concerns about
> > being locked into a single image.
> >
>
>
> I have to think about it. I'm not convinced that this particular
> flavour of standardization is a step in the right direction -- in fact,
> it is *guaranteed* to provide significant additional complexity for
> bootloaders, and bzImage support is still going to have to be provided
> for the forseeable future. Since you express that it will basically be
> necessary to stitch the ELF file together on the fly I don't see much
> point, quite frankly; it seems like extra complexity for no good reason.
I'm inclined to agree with Peter here.
For Linux, placing an initrd image with ELF is seriously problematic
since only the boot loader will the necessary information to know
where to put it.
(Note: I suppose initramfs will make some or all of this go away but
as long as initrds are around - which I think will be a long while,
these problems exist.)
- It needs to be placed far enough away from the kernel to avoid
getting overwritten when the kernel starts allocating memory for
stuff at boot time. I had this problem myself with two kernel
monte and a fixed load address for initrds.
- It needs to be placed on free memory. The boot loader would
potentially have access to memory maps to know where not to place
the initrd. I don't know if any do this at this point but I have
had problems with bootloaders dropping initrds in reserved
regions. I ended up switching from syslinux to LILO (or maybe it
was the other way) to get around that.
As soon as you throw multiple architectures into the mix, I think
anything you think you can assume about what memory is reasonable
to use disappears.
Two kernel monte still has this problem. Reading the E820 map or
something like it is on my to-do list somewhere.
I don't think that being able to compose an elf image on the fly would
solve either of these problems since the boot loader needs to make a
relocation decision. Running a linker on the fly would be pretty
nasty anyway, IMO. Nobody is suggesting any kind of dynamic linking
in the boot loader are they?
I like some of the patches that change Linux so you enter in 32 bit
protected mode. Switching back to 16bit mode might cause some trouble
on boards with screwy BIOSes. I did the "stay in protected mode"
thing for two kernel monte because some BIOSes would get hung if I
switch to protected mode and then back again before calling the APM
setup. The trick there is to keep the setup information from the real
mode code and use it for the next kernel. Gross, if you ask me.
I think ELF is overkill for what you're doing with it. It's an
established format but so what? It's not like you'll be able to take
advantage of large existing code base. Sure, ld exists to create your
image but that's not the hard part. The boot loaders would all have
to include new code for this. Also, your patch (for x86 only, it
seems. didn't you have alpha support too?) is far from trivial. In
short, I don't see how using ELF (or creating a new boot format at
all) is going to save much, if any, work.
I have this funny feeling some of the initrd discussion might have
been discussed/addressed elsewhere. I hope I'm not too out of touch
:)
Anyway, there's $.02.
- Erik
P.S. The two second delay in two kernel monte is intentional and
easily removed. It's there to let me glimpse the messages
before it actually does the reset.
Eric W. Biederman wrote:
>
>>From my experience PXE is not easier to use than coming up with my
> own. At least not for machines that regularly need to network boot.
> And many motherboard manufacturers are happy to replace their PXE
> option rom with an etherboot option rom.
>
> And besides not working well PXE is overly complicated, and
> intricately tied to the x86 BIOS. I would rather simply follow the
> good internet RFC's and work on filling in the one missing piece. A
> file format. And the ELF file format works very well.
>
> Besides all of that of that I regularly network boot LinuxBIOS which
> PXE can't cope with.
>
> Using etherboot I don't need a second stage bootloader. Etherboot
> does work well. I don't need anything beyond vanilla DHCP and TFTP
> (the standards for network booting). The research into how to do it
> has really been done. And I can work on interesting things like
> adding end to end checksums of the image I am booting.
>
Etherboot requires a specific other driver. The problem with what
you're proposing -- and let me get it very clear here, it's a huge
problem -- is that you have no device-independent access to the boot
medium (in this case, the network) once you have loaded the initial boot
program. This is an enormous drawback.
That's the thing with PXE and the BIOS too, for that matter: they might
be specs done by monkeys, but when it really counts, what you need is
really there (modulo bugs, but that applies to everything.)
-hpa
"H. Peter Anvin" <[email protected]> writes:
> Etherboot requires a specific other driver. The problem with what you're
> proposing -- and let me get it very clear here, it's a huge problem -- is that
> you have no device-independent access to the boot medium (in this case, the
> network) once you have loaded the initial boot program. This is an enormous
> drawback.
First it isn't an immediate problem because with an ELF image you
don't need a multistage bootloader. I can represent everything in one
file. Essentially a sparse coredump.
If you are writing an intermediate loader it is a problem. An
intermediate loader needs OS services, and if you don't have those
services you are in trouble. For this purpose it is fair to call the
x86 BIOS, EFI, the SRM, and open firmware OS's.
Personally when I want an OS I would like to use Linux. So I have
written a patch that allows me too load another OS from Linux, so in
those cases when I am writing an intermediate boot loader I can use
Linux. I admit I haven't gracefully solved all of the bootstrapping
cases but that is just a matter of time.
> That's the thing with PXE and the BIOS too, for that matter: they might be specs
> done by monkeys, but when it really counts, what you need is really there
> (modulo bugs, but that applies to everything.)
Except PXE isn't always there. In fact PXE is usually absent. If it
was always there on x86 this would be a good argument. For those
cases when I can't get firmware to do my network booting I put the
Linux kernel on a floppy or a cd or whatever the firmware can boot off
of and network boot with that.
I do agree that having specs even when done by monkeys are good when
they are widely implemented.
But I see no reason why the open source community shouldn't drive the
specs. We have as much right as Intel or any other self appointed
commitie. And open source is a great tool for providing defacto
implementations.
And I have been doing this for over a year in production on thousands
of machines so I do know that PXE is by no means necessary.
Eric
Eric W. Biederman wrote:
>
> First it isn't an immediate problem because with an ELF image you
> don't need a multistage bootloader. I can represent everything in one
> file. Essentially a sparse coredump.
>
For what definition of "everything"? I'm not being facetious, I think
it's a fundamentally impossible statement to make, especially if the
bootloader is interactive.
> If you are writing an intermediate loader it is a problem. An
> intermediate loader needs OS services, and if you don't have those
> services you are in trouble. For this purpose it is fair to call the
> x86 BIOS, EFI, the SRM, and open firmware OS's.
>
> Personally when I want an OS I would like to use Linux. So I have
> written a patch that allows me too load another OS from Linux, so in
> those cases when I am writing an intermediate boot loader I can use
> Linux. I admit I haven't gracefully solved all of the bootstrapping
> cases but that is just a matter of time.
Your intermediate Linux still needs to have all its devices. I was
working for a while on a project to create an "intermediate Linux" which
wouldn't claim to own the world but instead callback to the firmware; this
would be a port of Linux to a sort of pseudo-architecture. I got a bit
too busy, but I think it's a valid program.
>>That's the thing with PXE and the BIOS too, for that matter: they might be specs
>>done by monkeys, but when it really counts, what you need is really there
>>(modulo bugs, but that applies to everything.)
>
> Except PXE isn't always there. In fact PXE is usually absent. If it
> was always there on x86 this would be a good argument. For those
> cases when I can't get firmware to do my network booting I put the
> Linux kernel on a floppy or a cd or whatever the firmware can boot off
> of and network boot with that.
PXE is there on the vast majority of all modern (1999 or later) network
cards. Those that aren't either have a socket or don't have any
provisions for network firmware whatsoever, as a rule.
> I do agree that having specs even when done by monkeys are good when
> they are widely implemented.
>
> But I see no reason why the open source community shouldn't drive the
> specs.
We could have, but we didn't. I DO NOT want to launch a competing spec.
> We have as much right as Intel or any other self appointed
> commitie. And open source is a great tool for providing defacto
> implementations.
Yes, but we're several years too late. PXE is the accepted spec, for as
much as it sucks (to its defense, the latest version of the PXE spec
actually does most of what you want it to be able to do, although few if
any commercial PXE specs implement these correctly to my experience.
I would *LOVE* to see a high-quality Open Source PXE implementation, for
several reasons:
a) To support older, socket-carrying network cards;
b) To put on a floppy or flash into your network ROM if your commercial
PXE firmware is too broken to live.
> And I have been doing this for over a year in production on thousands
> of machines so I do know that PXE is by no means necessary.
There are a lot of things that are "by no means necessary" if you look at
any one individual user.
It's fundamentally about creating pervasive interfaces. My problem with
several of your proposals is that they make well-established interfaces
*less* pervasisve, which is a huge step in the wrong direction.
-hpa
"Erik A. Hendriks" <[email protected]> writes:
> I'm inclined to agree with Peter here.
>
> For Linux, placing an initrd image with ELF is seriously problematic
> since only the boot loader will the necessary information to know
> where to put it.
The bootloader barely has a clue about where in memory the ramdisk
should go. It just places it up very high in memory and hopes the
kernel won't over write it. Which is very interesting because except
for the bootmem allocator the current kernel allocates from high
memory down.
> (Note: I suppose initramfs will make some or all of this go away but
> as long as initrds are around - which I think will be a long while,
> these problems exist.)
>
> - It needs to be placed far enough away from the kernel to avoid
> getting overwritten when the kernel starts allocating memory for
> stuff at boot time. I had this problem myself with two kernel
> monte and a fixed load address for initrds.
I had that problem and I fixed the kernel. All that isn't fixed yet
is the bootmem allocator. Since 2.4.10 anything above 8MB is safe.
For the final version of my patch I intend to allow it so that
anything the kernel that isn't in an ELF segment the kernel doesn't
have a problem with. Which basically means statically allocating the
bootmem bitmap, or possibly the bootmem range array.
> - It needs to be placed on free memory. The boot loader would
> potentially have access to memory maps to know where not to place
> the initrd. I don't know if any do this at this point but I have
> had problems with bootloaders dropping initrds in reserved
> regions. I ended up switching from syslinux to LILO (or maybe it
> was the other way) to get around that.
I agree that is an issue. And etherboot looks and if the kernel is
attempting to load in reserved memory it says sorry I can't load
that. Not brilliant but since that case virtually never happens it
isn't a problem. For a given architecture the memory layout is pretty
much fixed.
> As soon as you throw multiple architectures into the mix, I think
> anything you think you can assume about what memory is reasonable
> to use disappears.
Sort of. The assumptions change per architecture. But I haven't
heard of an architecture where some addresses are not safe.
> Two kernel monte still has this problem. Reading the E820 map or
> something like it is on my to-do list somewhere.
>
> I don't think that being able to compose an elf image on the fly would
> solve either of these problems since the boot loader needs to make a
> relocation decision. Running a linker on the fly would be pretty
> nasty anyway, IMO. Nobody is suggesting any kind of dynamic linking
> in the boot loader are they?
No. Dynamic linking with relocation is nasty.
First with a fixed kernel a relocation decision isn't necessary. And
even if it is all you need is essentially the e820 map which should be
fairly straight forward to export to user space, and verify in kernel
space. One of my todo items is to actually to verify the load
addresses against addresses where we have ram, and refuse the image
if necessary.
> I like some of the patches that change Linux so you enter in 32 bit
> protected mode. Switching back to 16bit mode might cause some trouble
> on boards with screwy BIOSes. I did the "stay in protected mode"
> thing for two kernel monte because some BIOSes would get hung if I
> switch to protected mode and then back again before calling the APM
> setup. The trick there is to keep the setup information from the real
> mode code and use it for the next kernel. Gross, if you ask me.
Agreed. I haven't seen the nasty APM code. So I'm not certain how to
comment. In my case, the only switch I have is back to real mode from
protected mode in the loaded image. Which I believe works just fine.
Perhaps the first kernel would have to call the APM setup stuff, to
make this stable I'm not certain.
> I think ELF is overkill for what you're doing with it.
Static ELF executables are trivial. I went looking for a portable
format that had just a start address, and load this chunk of file into
this location in memory. ELF was the cleanest example of that I could
find. It has a few extra fields I can ignore but otherwise it works.
> It's an
> established format but so what? It's not like you'll be able to take
> advantage of large existing code base. Sure, ld exists to create your
> image but that's not the hard part. The boot loaders would all have
> to include new code for this.
Only if we stopped supporting bzImage, which I have no current
intentions of doing.
Not that I'm worried the baseline code to load an ELF image really is
trivial, all you really have to ad to a bootloader is an extra for
loop.
> Also, your patch (for x86 only, it
> seems. didn't you have alpha support too?) is far from trivial.
The only one of my patches that is x86 only at the moment is the patch
to build an ELF bootable Linux kernel image. I still support alpha in
my kexec work. Though I need to go back and test to see if it is
still working.
And my patch is far from trivial. But most of the code is in the
scatter/gather algorithm for loading the kernel into pages the kernel
has free. And then going through those pages to make certain the
assembly code can do a simple memcpy from that list of pages to their
destination address. Monte does something very similar. The nice
thing is that code is architecture independent so it can be trivially
reused.
Also the code is essentially independent from most of the rest of the
kernel. So once it gets stable very little needs to change. You have
doubtless observed this already.
> In
> short, I don't see how using ELF (or creating a new boot format at
> all) is going to save much, if any, work.
I already need a new format for LinuxBIOS, because I can't use
bzImages.
> I have this funny feeling some of the initrd discussion might have
> been discussed/addressed elsewhere. I hope I'm not too out of touch
> :)
My take on the current x86 boot protocol is that it is flawed with
respect to ramdisks. When it was initially implemented the kernel was
assigned no responsibility for not overwriting the ramdisk and the
trick of putting the ramdisk as high up in memory as possible was
developed. Later when the top of memory was to high for the kernel
again the boot protocol was fixed, and correspondingly the
bootloaders.
Personally I think all of that is just flawed. The bootloaders should
be simple. They should be able to load the ramdisk at a fixed address
(assuming the memory isn't reserved). And they shouldn't need to be
changed every time the kernel has a problem.
> Anyway, there's $.02.
Thanks.
> - Erik
>
> P.S. The two second delay in two kernel monte is intentional and
> easily removed. It's there to let me glimpse the messages
> before it actually does the reset.
Cool. That makes me feel a lot better about monte. Not that I every
really had a problem with it.
My main purpose in this conversation is to get people familiar with
what I'm doing, to see if there really are areas that I have missed,
and to convince people I intend to support this for the long term. So
I can get Linux booting Linux functionality into the kernel.
My code still needs some work of course, but all of the major elements
are there.
Eric
Eric W. Biederman wrote:
>
> I already need a new format for LinuxBIOS, because I can't use
> bzImages.
>
[...]
>
> Personally I think all of that is just flawed. The bootloaders should
> be simple. They should be able to load the ramdisk at a fixed address
> (assuming the memory isn't reserved). And they shouldn't need to be
> changed every time the kernel has a problem.
>
And your solution is to come up with a new format that is (a) MORE
complex, (b) DIFFERENT, (c) incompatible?
Give me a break. You have just added so much complexity it's not even
funny. I can guarantee you that people *WILL* ask for every single
existing bootloader out there to support your new format. It's a support
nightmare for all of us that write bootloaders, and not just for you.
If your complaint is about the lack of a 32-bit entrypoint such can
probably be added to the existing format (it would require
Oh, and as far as "simple" is concerned, I should let you know that when I
work on syslinux, I count bytes. The existing protocol is definitely
suboptimal in that respect (this is due to some severe mistakes which were
done in the original initrd implementation), but we're stuck with it. All
you're accomplishing is creating two completely incompatible formats,
*BOTH* of which will need to be supported for the forseeable future.
-hpa
On 31 Jan 2002 16:36:27 -0700,
[email protected] (Eric W. Biederman) wrote:
>"Erik A. Hendriks" <[email protected]> writes:
>Sort of. The assumptions change per architecture. But I haven't
>heard of an architecture where some addresses are not safe.
NUMA boxes with discontiguous physical memory. You may not boot off
node 0. Whichever node you boot from may not be able to see all of
physical memory yet, the cross node directrories may not be set up.
The boot node may not even have physical address 0. I don't say that
these boxes exist yet but they are possible with discontiguous memory
architectures.
>Dynamic linking with relocation is nasty.
I have some code in insmod that you can use ... Nasty is an
understatement.
On Thu, Jan 31, 2002 at 02:03:09PM +1100, Keith Owens wrote:
> On 30 Jan 2002 19:42:14 -0700,
> [email protected] (Eric W. Biederman) wrote:
> >I like the other suggestion of extending the Hot-plug infrastructure.
> >In that case I just need to figure out how to logically Hot-unplug all
> >the devices in the system. That may be better than a
> >do_exitcalls()... As it automatically gets the discrimination right.
>
> In an ideal world, it should be enough to call the module_exit()
> functions in reverse order to the module_init(), LIFO. But check with
> the hotplug list, they have done most of the work on this problem.
Actually, no we haven't :)
We punt on the "do we remove the driver when the device disappears"
issue, and instead ignore the hotplug REMOVE events right now.
So far, no one's complained.
But to unplug all of the devices in the system in the proper order,
you're probably going to have to use the driverfs tree that is slowly
taking shape. That's the only representation of all physical devices in
the system that shows the topology correctly.
Hope this helps,
greg k-h
In my ideal world here is how it works.
The firmware bootloader reads an ELF header from the first sector of
the disk. Looking at the ELF header it knows where everything else is
and loads the rest of the operating system and jumps to it.
There is a problem with that I cannot select among multiple images to
boot. So I install a different ELF image with the ELF header on the
first sector of my hard drive. This second image instead of being a
stripped down small and limited piece of code is a distribution of my OS
dedicated to the task of helping my decide which image to boot. Total
freedom.
And I have this working now without sweating the space in 64K of
firmware.
"H. Peter Anvin" <[email protected]> writes:
> Eric W. Biederman wrote:
>
> >
> > First it isn't an immediate problem because with an ELF image you
> > don't need a multistage bootloader. I can represent everything in one
> > file. Essentially a sparse coredump.
>
>
> For what definition of "everything"? I'm not being facetious, I think
> it's a fundamentally impossible statement to make, especially if the
> bootloader is interactive.
I know of no case for which a free-standing program or operating
system cannot be represented as an ELF image. Essentially the
non-interactive case.
For the interactive case you can I agree you can't please everyone.
But if it is trivial to write the interactive bootloader, with size
being the only problem with naive implementations I don't see a
problem. Though I suspect a bootloader running X will be overkill :)
> > If you are writing an intermediate loader it is a problem. An
> > intermediate loader needs OS services, and if you don't have those
> > services you are in trouble. For this purpose it is fair to call the
> > x86 BIOS, EFI, the SRM, and open firmware OS's.
> >
> > Personally when I want an OS I would like to use Linux. So I have
> > written a patch that allows me too load another OS from Linux, so in
> > those cases when I am writing an intermediate boot loader I can use
> > Linux. I admit I haven't gracefully solved all of the bootstrapping
> > cases but that is just a matter of time.
>
>
> Your intermediate Linux still needs to have all its devices.
Yep sure does.
> I was
> working for a while on a project to create an "intermediate Linux" which
> wouldn't claim to own the world but instead callback to the firmware; this
> would be a port of Linux to a sort of pseudo-architecture. I got a bit
> too busy, but I think it's a valid program.
I agree. But I also think it is valid to directly drive the devices
if you never plan on having the firmware driving the again. Or you at
least issue a firmware device reinitialization command before having
the firmware drive the device.
[snip PXE It doesn't matter if we agree]
> It's fundamentally about creating pervasive interfaces. My problem with
> several of your proposals is that they make well-established interfaces
> *less* pervasisve, which is a huge step in the wrong direction.
I agree. Except many of the pervasive interfaces today, do not
address my needs.
In particular an x86 BIOS is terrible for clusters. Where you have
hundreds of thousands of machines that you wish to manage remotely. I
have tried. I have worked with board vendors and BIOS manufacturers,
and at their best they can barely manage serial console support. And
they really don't help with anything beyond that. I have done a
better job with less code, in less time by rewriting the BIOS from
scratch.
So I am going to push for architecture neutral standards, that have
the potential to be even more pervasive than todays interfaces, because I
already need something different. And a core part of that for me is
the ELF file format. As I see it a bootloader saying everything is
ELF is as simplifying as the unix everything is a file concept.
Eric
"H. Peter Anvin" <[email protected]> writes:
> Eric W. Biederman wrote:
>
> >
> > I already need a new format for LinuxBIOS, because I can't use
> > bzImages.
> >
>
> [...]
>
> >
> > Personally I think all of that is just flawed. The bootloaders should
> > be simple. They should be able to load the ramdisk at a fixed address
> > (assuming the memory isn't reserved). And they shouldn't need to be
> > changed every time the kernel has a problem.
> >
>
>
> And your solution is to come up with a new format that is (a) MORE
> complex, (b) DIFFERENT, (c) incompatible?
a) More complex. Just barely.
b) Yep.
c) Yep.
d) More flexible and can handle what ever it needs to do tomorrow,
without mods.
All firmware has it's own different format. With LinuxBIOS things are
simple enough that you can add a make target instead of having to
write a specific bootloader to support it.
> Give me a break. You have just added so much complexity it's not even
> funny.
The only thing I have seen that is terribly complex is when I wrote a
though shalt paper, my draft specification. Specing out in
excruciating detail how everything needs to be setup. That takes a
simple idea and makes it sound terribly complex. I need to rewrite
that before I get much farther. I added flexibility not complexity.
In practice all I have added is a 32bit entry point and a for loop.
> I can guarantee you that people *WILL* ask for every single
> existing bootloader out there to support your new format. It's a support
> nightmare for all of us that write bootloaders, and not just for
> you.
Hmm. Supporting a format that has needed no changes since it's 1.0
release a decade ago, is going to be a support nightmare?
I know it is slightly more general than the current x86 home grown
solution so that is a bit harder.
Having a format that allows a cross platform implementation of rdev is
a support nightmare?
I am in no way promoting using any of the ELF relocation support I
have made that clear right?
> If your complaint is about the lack of a 32-bit entrypoint such can
> probably be added to the existing format (it would require
If adding a 32bit entry point to bzImage is really a more palatable
solution I can probably handle that. But it will take some
convincing.
And then I will go out and write a utility to convert that to an ELF
image, for when want to network boot or load it from LinuxBIOS. But
in that sense it will be just a bootloader. So will be less likely to
generate demand that everyone else support the ELF format for booting...
> Oh, and as far as "simple" is concerned, I should let you know that when I
> work on syslinux, I count bytes.
O.k. for a size comparison. I just built the LinuxBIOS elf bootloader
with a floppy driver for reading the disk, and a serial driver for
writing output, all 32 bit C code. With debugging messages compiled
in it was ~= 11K with debugging messages compiled out it was ~= 7K.
That is a very general bootloader that has practically every optional
feature implemented, and there has been no real attempt at space
optimization, and the drivers are direct to hardware drivers.
Admittedly it is a noninteractive bootloader.
Is that simple enough?
> The existing protocol is definitely
> suboptimal in that respect (this is due to some severe mistakes which were
> done in the original initrd implementation), but we're stuck with it. All
> you're accomplishing is creating two completely incompatible formats,
> *BOTH* of which will need to be supported for the forseeable future.
And on that score I can do deeper work. Would it be more palatable
if there were utilities to convert from one format to the other
without losing of information?
Eric
Eric W. Biederman wrote:
>
> a) More complex. Just barely.
> b) Yep.
> c) Yep.
> d) More flexible and can handle what ever it needs to do tomorrow,
> without mods.
>
> All firmware has it's own different format. With LinuxBIOS things are
> simple enough that you can add a make target instead of having to
> write a specific bootloader to support it.
>
I cannot agree with you on (d). In fact, you have already brushed off a
multiple of issues I have raised, such as interactivity.
I don't particularly care if you want to make this the LinuxBIOS
platform-specific whatever, but when you start talking about making it
the grand universal thing, you have me seriously concerned.
-hpa
"H. Peter Anvin" <[email protected]> writes:
> Eric W. Biederman wrote:
>
> > a) More complex. Just barely. b) Yep.
> > c) Yep.
> > d) More flexible and can handle what ever it needs to do tomorrow,
> > without mods. All firmware has it's own different format. With LinuxBIOS
> > things are
> > simple enough that you can add a make target instead of having to
> > write a specific bootloader to support it.
> >
>
>
> I cannot agree with you on (d). In fact, you have already brushed off a
> multiple of issues I have raised, such as interactivity.
What is magic about interactivity? What makes this a different
problem? We approach booting from totally different perspectives,
which makes communicating clearly hard.
If you spell out individual problems I will show you how I would solve
them.
I believe I have done my due diligence and provided appropriate
channels for everything, but I'm willing to test it.
> I don't particularly care if you want to make this the LinuxBIOS
> platform-specific whatever, but when you start talking about making it the grand
> universal thing, you have me seriously concerned.
Hey it isn't wrong to dream of world domination :)
Eric
Eric W. Biederman wrote:
>
> What is magic about interactivity? What makes this a different
> problem? We approach booting from totally different perspectives,
> which makes communicating clearly hard.
>
> If you spell out individual problems I will show you how I would solve
> them.
>
It makes it a very different problem because YOU DON'T KNOW WHAT YOU'RE
BOOTING UNTIL THE USER TELLS YOU.
In fact, depending on just exactly what you're doing, you might not even
know what you're booting until you have already gotten several items
downloaded (consider, for example, a device-probing bootloader.)
Therefore, the bootloader must be able to obtain boot medium services
not just once and for all, but on a back-and-forth basis. There needs
to be an API between the boot loader and the firmware, and just
"stuffing it into memory" doesn't count.
-hpa
"H. Peter Anvin" <[email protected]> writes:
> Eric W. Biederman wrote:
>
> > What is magic about interactivity? What makes this a different
> > problem? We approach booting from totally different perspectives,
> > which makes communicating clearly hard. If you spell out individual problems
> > I will show you how I would solve
> > them.
> >
>
>
> It makes it a very different problem because YOU DON'T KNOW WHAT YOU'RE BOOTING
> UNTIL THE USER TELLS YOU.
>
> In fact, depending on just exactly what you're doing, you might not even know
> what you're booting until you have already gotten several items downloaded
> (consider, for example, a device-probing bootloader.)
>
> Therefore, the bootloader must be able to obtain boot medium services not just
> once and for all, but on a back-and-forth basis. There needs to be an API
> between the boot loader and the firmware, and just "stuffing it into memory"
> doesn't count.
If you are correct, then there a fundamental design problem with my
Linux Booting Linux code. Because that is exactly what I do. I stuff
the kernel in memory and jump to it. Once the new kernel starts there is
no back and forth. _Please_ help me understand why this back and
forth is needed.
Here is my experience. Non-interactive etherboot, doesn't know what
it is booting, or where it is booting from until the DHCP server tells
it. Then it gets a file from a TFTP server and boots that.
When booting the Linux kernel it never attempts to do a back and forth
via the firmware to the boot medium. Instead someone has a clue about
what the boot medium was and it mounts that medium using it's own
drivers. Booting a rescue cd is a good example.
_Please_ help me find the flaw in my understanding.
Eric W. Biederman wrote:
>
> If you are correct, then there a fundamental design problem with my
> Linux Booting Linux code. Because that is exactly what I do. I stuff
> the kernel in memory and jump to it. Once the new kernel starts there is
> no back and forth. _Please_ help me understand why this back and
> forth is needed.
>
> Here is my experience. Non-interactive etherboot, doesn't know what
> it is booting, or where it is booting from until the DHCP server tells
> it. Then it gets a file from a TFTP server and boots that.
>
This is one subcase, but one of many.
>
> When booting the Linux kernel it never attempts to do a back and forth
> via the firmware to the boot medium. Instead someone has a clue about
> what the boot medium was and it mounts that medium using it's own
> drivers. Booting a rescue cd is a good example.
>
> _Please_ help me find the flaw in my understanding.
>
The flaw in your understanding comes in when you want to run maintenance
on a system, reinstall it, install a system for which you don't have
drivers, etc. Otherwise you're basically requiring the memory on the
target system to contain every driver that could possibly exist, not
just today but in the future.
-hpa
"H. Peter Anvin" <[email protected]> writes:
> The flaw in your understanding comes in when you want to run maintenance on a
> system, reinstall it, install a system for which you don't have drivers, etc.
> Otherwise you're basically requiring the memory on the target system to contain
> every driver that could possibly exist, not just today but in the future.
It does seem to be a requirement to contain every driver for at least
one class of devices in ram. So it looks like Firmware to support
ease of administration needs to support loading both user-space
programs, and kernel-space programs. At a first approximation I
agree, but I need to think on this some more.
For the acceptance of a Linux booting Linux patch this case isn't a
problem because it is only an enabler, that is good news.
For the firmware case with LinuxBIOS that is an element I had not
thought of. And it is a problem because it hinders automation. The
only place a firm line has been drawn so far is where LinuxBIOS
stops, and where the in firmware bootloader begins. Embedded systems
don't in general need or want the flexibility a full general purpose
firmware provides so can be ignored, but for general purpose systems
this is an issue. The original design called for the using the Linux
kernel (in which case a trivial answer would be forthcoming) but in
many instances you cannot compile the Linux kernel small enough to be
useful.
It can be argued that general purpose systems have enough ram that
putting drivers for all mass produced devices in ram is possible, and
practical. But that is a cop out.
Eric
Eric W. Biederman wrote:
>
> It can be argued that general purpose systems have enough ram that
> putting drivers for all mass produced devices in ram is possible, and
> practical. But that is a cop out.
>
Indeed. Worse, it may not be possible for the *boot medium* to hold all
those devices...
-hpa
"H. Peter Anvin" <[email protected]> writes:
> Eric W. Biederman wrote:
>
> > It can be argued that general purpose systems have enough ram that
> > putting drivers for all mass produced devices in ram is possible, and
> > practical. But that is a cop out.
> >
>
>
> Indeed. Worse, it may not be possible for the *boot medium* to hold all those
> devices...
O.k. I have been thinking about this some more, and I have come up with a couple
alternate of solutions.
The simplest is the observation that right now 10MB is about what it
takes to hold every Linux driver out there. So all you really need is
a 16MB system, to avoid a device probing loader. And probably
noticeably less than that. The only systems I see having real
problems are old systems where device enumeration is not reliable, and
require human intervention anyway.
A second is to just make certain there is some kind of fallback path
so if the image is too large have a way to load a smaller one. When
you consider that older systems had less memory it has a reasonable
chance of working properly.
My final and favorite is to take an ELF image, define a couple of ELF
note types, and add a bunch those notes saying which pieces are
hardware dependent. So a smart ELF loader can prune the image as it
is loaded, and a stupid one will just attempt to load everything. And
with the setup for this not being bootloader specific it will probably
encourage device pruning loaders.
Am I being optimistic or are there any pressing cases for callbacks to
the firmware?
Eric
Eric W. Biederman wrote:
>
> The simplest is the observation that right now 10MB is about what it
> takes to hold every Linux driver out there. So all you really need is
> a 16MB system, to avoid a device probing loader. And probably
> noticeably less than that. The only systems I see having real
> problems are old systems where device enumeration is not reliable, and
> require human intervention anyway.
>
A floppy disk is 1.44 MB.
-hpa
Eric W. Biederman wrote:
>
> O.k. I have been thinking about this some more, and I have come up with a couple
> alternate of solutions.
>
> The simplest is the observation that right now 10MB is about what it
> takes to hold every Linux driver out there. So all you really need is
> a 16MB system, to avoid a device probing loader. And probably
> noticeably less than that. The only systems I see having real
> problems are old systems where device enumeration is not reliable, and
> require human intervention anyway.
>
> A second is to just make certain there is some kind of fallback path
> so if the image is too large have a way to load a smaller one. When
> you consider that older systems had less memory it has a reasonable
> chance of working properly.
>
> My final and favorite is to take an ELF image, define a couple of ELF
> note types, and add a bunch those notes saying which pieces are
> hardware dependent. So a smart ELF loader can prune the image as it
> is loaded, and a stupid one will just attempt to load everything. And
> with the setup for this not being bootloader specific it will probably
> encourage device pruning loaders.
>
> Am I being optimistic or are there any pressing cases for callbacks to
> the firmware?
>
Ok, now let me ask the question that hopefully should be obvious to
everyone now...
WHAT'S THE POINT?
All you're doing is an awfully complex song and dance to *avoid*
implementing a solution that, while imperfect, is thoroughly established
and has worked for 20 years.
-hpa
On Sunday 03 February 2002 02:39 pm, H. Peter Anvin wrote:
> Eric W. Biederman wrote:
> > The simplest is the observation that right now 10MB is about what it
> > takes to hold every Linux driver out there. So all you really need is
> > a 16MB system, to avoid a device probing loader. And probably
> > noticeably less than that. The only systems I see having real
> > problems are old systems where device enumeration is not reliable, and
> > require human intervention anyway.
>
> A floppy disk is 1.44 MB.
And el-torito bootable CDs basically glue a floppy image onto the front of
the CD and lie to the bios to say "oh yeah, I'm a floppy, boot from me".
Luckily, they can use the old 2.88 "extended density" floppy standard IBM
tried to launch years ago which never got anywhere, but which most BIOS's
recognize. But that's still a fairly small place to try to stick a whole
system...
> -hpa
Rob
Rob Landley wrote:
>
> And el-torito bootable CDs basically glue a floppy image onto the front of
> the CD and lie to the bios to say "oh yeah, I'm a floppy, boot from me".
> Luckily, they can use the old 2.88 "extended density" floppy standard IBM
> tried to launch years ago which never got anywhere, but which most BIOS's
> recognize. But that's still a fairly small place to try to stick a whole
> system...
>
They can be; they can also run in a mode where they can access arbitrary
blocks on the CD (ISOLINUX runs in this mode.)
-hpa
Rob Landley wrote:
>
> You can pivot_root after the bios hands control over to the kernel, sure.
> But if the bios can actually boot from arbitrary blocks on the CD before the
> kernel takes over, this is news to me. And for the kernel to read from the
> CD, it needs its drivers already loaded for it, so they have to be in that
> 2.88 megs somewhere. (Statically linked, ramdisk, etc.)
>
No, the boot specification allows direct access to the CD. See the El
Torito specification, specifically the parts that talk about "no
emulation" mode.
> I was just pointing out that small boot environments weren't going away any
> time soon, even if floppy drivers were to finally manage it. When you
> install your system, the initial image you bootstrap from is generally tiny.
>
> Now I'm not so familiar with that etherboot stuff, intel's whatsis
> specification (PXE?) for sucking a bootable image through the network. All
> I've ever seen that boot is a floppy image, but I don't know if that's a
> limitation in the spec or just the way people are using it...
That's just the way *some* people are using it. Look at PXELINUX for
something that doesn't. PXELINUX can use the UDP API provided by the
PXE specification to download arbitrary files, specified at runtime, via
TFTP.
-hpa
On Sunday 03 February 2002 05:24 pm, H. Peter Anvin wrote:
> Rob Landley wrote:
> > And el-torito bootable CDs basically glue a floppy image onto the front
> > of the CD and lie to the bios to say "oh yeah, I'm a floppy, boot from
> > me". Luckily, they can use the old 2.88 "extended density" floppy
> > standard IBM tried to launch years ago which never got anywhere, but
> > which most BIOS's recognize. But that's still a fairly small place to
> > try to stick a whole system...
>
> They can be; they can also run in a mode where they can access arbitrary
> blocks on the CD (ISOLINUX runs in this mode.)
>
> -hpa
You can pivot_root after the bios hands control over to the kernel, sure.
But if the bios can actually boot from arbitrary blocks on the CD before the
kernel takes over, this is news to me. And for the kernel to read from the
CD, it needs its drivers already loaded for it, so they have to be in that
2.88 megs somewhere. (Statically linked, ramdisk, etc.)
I was just pointing out that small boot environments weren't going away any
time soon, even if floppy drivers were to finally manage it. When you
install your system, the initial image you bootstrap from is generally tiny.
Now I'm not so familiar with that etherboot stuff, intel's whatsis
specification (PXE?) for sucking a bootable image through the network. All
I've ever seen that boot is a floppy image, but I don't know if that's a
limitation in the spec or just the way people are using it...
And of course you could always do some variant of two kernel monte...
Rob
On Sunday 03 February 2002 06:01 pm, H. Peter Anvin wrote:
> Rob Landley wrote:
> > You can pivot_root after the bios hands control over to the kernel, sure.
> > But if the bios can actually boot from arbitrary blocks on the CD before
> > the kernel takes over, this is news to me. And for the kernel to read
> > from the CD, it needs its drivers already loaded for it, so they have to
> > be in that 2.88 megs somewhere. (Statically linked, ramdisk, etc.)
>
> No, the boot specification allows direct access to the CD. See the El
> Torito specification, specifically the parts that talk about "no
> emulation" mode.
Thanks. I'd missed that.
> > Now I'm not so familiar with that etherboot stuff, intel's whatsis
> > specification (PXE?) for sucking a bootable image through the network.
> > All I've ever seen that boot is a floppy image, but I don't know if
> > that's a limitation in the spec or just the way people are using it...
>
> That's just the way *some* people are using it. Look at PXELINUX for
> something that doesn't. PXELINUX can use the UDP API provided by the
> PXE specification to download arbitrary files, specified at runtime, via
> TFTP.
That one I suspected. (I used the TFTP setup on solaris and power PC years
ago, just didn't know if x86 bioses had caught up. Most of my BIOS
experience in the past few years has been with Dell machines, so my
expectations may have been unnecessarily lowered... :) On the other hand,
very few systems in the field seem to boot and install entirely through the
network unless they're meant to be diskless...
> -hpa
Rob
Rob Landley wrote:
>
> On the other hand,
> very few systems in the field seem to boot and install entirely through the
> network unless they're meant to be diskless...
>
Most systems build 1999 or later seem to support diskless booting, with
the main exception being laptops.
-hpa
On 03 Feb 2002 11:43:08 -0700,
[email protected] (Eric W. Biederman) wrote:
>O.k. I have been thinking about this some more, and I have come up with a couple
>alternate of solutions....
>My final and favorite is to take an ELF image, define a couple of ELF
>note types, and add a bunch those notes saying which pieces are
>hardware dependent. So a smart ELF loader can prune the image as it
>is loaded, and a stupid one will just attempt to load everything. And
>with the setup for this not being bootloader specific it will probably
>encourage device pruning loaders.
That is not an ELF loader, it is an ELF *linker*. The vmlinux image
has had all the relocations fixed up, you no longer have the data
required to discard sections. To prune hardware dependent pieces means
moving data around and adjusting relocation entries. you have to go
back one stage, to the individual objects, and that means linking.
Dynamically linking the kernel is hard. You just reinvented insmod,
look at all the arch dependent linking code that has to cope with.
I suppose that you could put each hardware dependent piece in its own
section and ensure that the rest of the kernel did not refer to those
sections directly. The rest of the kernel would have to access the
hardware dependent code via a table that was fixed up by the loader.
In addition the optional sections would have to be position independent
code because it would not be known at kbuild time where the code would
finally be loaded and run. Seems like an awful lot of work.
On Sun, 3 Feb 2002, H. Peter Anvin wrote:
> Rob Landley wrote:
>
> >
> > You can pivot_root after the bios hands control over to the kernel, sure.
> > But if the bios can actually boot from arbitrary blocks on the CD before the
> > kernel takes over, this is news to me. And for the kernel to read from the
> > CD, it needs its drivers already loaded for it, so they have to be in that
> > 2.88 megs somewhere. (Statically linked, ramdisk, etc.)
> >
>
>
> No, the boot specification allows direct access to the CD. See the El
> Torito specification, specifically the parts that talk about "no
> emulation" mode.
Is this the -hard-disk-boot option of mkisofs?
.TM.
Hi guys,
sorry for joining the discussion so late, but I'm quite busy these days
(about to move around half of the planet, to start). I followed your
discussion, and, while I haven't looked at the current kexec code, from
what I've read, it should be in great shape, and I'm looking forward to
see this in the kernel soon. I keep on running out of time for bootimg,
so I guess there's no competition now anyway. And I think it is very
important that we have a solid Linux-boots-Linux solution soon, in
order to counter the trends in growing boot loader complexity. (IMHO,
GRUB is an excellent example of impressive craftsmanship, but a few
rather fundamental flaws in the overall design. Hell, I think even
LILO is too complex :-)
Having said this, I still think a bootimg-like API would be better than
a file based API. So, concerning this thread:
H. Peter Anvin wrote:
> Therefore, the bootloader must be able to obtain boot medium services
> not just once and for all, but on a back-and-forth basis. There needs
> to be an API between the boot loader and the firmware, and just
> "stuffing it into memory" doesn't count.
Hmm, I'm not sure about this point of going back and forth, and how it
relates to the design of the kernel boot loader (kexec, bootimg, etc.).
Of course, the boot loader should - before actually booting - be able
to probe hardware, execute a set of policy decisions (e.g. put the
driver for the punch card reader into the initrd), etc.
But this doesn't really affect whether a memory-based interface
(bootimg) or a file-based interface (the rest of the world :-) is
better.
My assumption is that you need to do some processing before telling the
kernel to reboot, and that you want to be able to do such processing in
user space (e.g. extract the current memory map and pass it to the new
kernel, forward the results of bus scans, create a RAM disk with driver
modules on the fly, etc.)
In particular, the "old" kernel may need to pass information obtained
from the firmware to the "new" kernel. With decent firmware, there
could also be user-provided data that needs to be propagated, such as
portions of the command line. Well-known information of this type can
be encoded in the kernel, but I think this just leads to bloat, as more
and more policy will have to be encoded there, let alone the packing
and unpacking issues.
That's where I'd see the advantage of an external program. That program
could take care of all these issues, minus a few low-level
synchronization tasks, and then launch the new kernel with all data at
easily accessible locations. Of course this program would have some
strong interdependencies with the kernel, but I think that, with time,
the interfaces would stabilize, as they have done in the case of the
existing boot protocol, etc.
Now, given that there will be a kernel-specific preloader or loader,
the question is whether a file-based interface is really useful. I
understand the point about debugging, but on the other hand, a simple
dump of the data to be passed to the kernel would be sufficient for
this purpose too.
What I don't like about a file-based interface is that it adds an
extra indirection: you must make sure you have a file system, and you
either need a program that generates this file, and another one that
does the rebooting, or you combine both into the same program - and
you're essentially at the point of having a single converter to what
could basically be an arbitrary API, just like bootimg.
Worse yet, the file-based interface kind of conveys the promise that
the preloader is actually not necessary. This creates an incentive to
keep things that way, so more and more policy will have to be added
to the kernel, simply because externalizing it would shatter that
cute "loader-less" image.
Also, in many cases, interactions beween the kernel side of the boot
loader and the rest of the kernel would actually be a good thing to
have in user space anyway, e.g. the ability to shut down or
"immobilize" certain devices, or to retrieve device status
information.
There's another problem: a kernel image could in principle be more
generic than the hardware really is (e.g. I wouldn't be surprised if
the boot loader of the IBM mainframe guys knows a thing or two it
doesn't tell the kernel. And we have a similar situation on PCs, with
a very main board specific BIOS). If any of this system-specific
information for the boot loader is persistent, this would have to be
encoded in the ELF image. So you have the mandatory ELF-to-ELF step
again.
As I've shown with bootimg, it's pretty trivial to load all kinds of
formats (including ELF) via a memory-based interface, and you enjoy
considerable freedom in how you generate the data in memory. If you
want, you can even go and modify in-kernel data structures directly,
so you don't need a nice and clean interface for each and every bit
of data, but you can evolve interfaces when necessary.
In cases where a boot (pre)loader program wouldn't be desirable, a
set of library functions could serve the same purpose. In fact, the
boot (pre)loader should be based on them, too.
So, while a file-based interface looks cool, I think a "thin"
memory-based interface will serve us better in the long run.
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Lausanne, CH [email protected] /
/_http://icawww.epfl.ch/almesberger/_____________________________________/
Marco Colombo wrote:
>>
>>No, the boot specification allows direct access to the CD. See the El
>>Torito specification, specifically the parts that talk about "no
>>emulation" mode.
>>
> Is this the -hard-disk-boot option of mkisofs?
>
No, it's -no-emul-boot
-hpa
Werner Almesberger wrote:
[...lots of good stuff...]
>
> Worse yet, the file-based interface kind of conveys the promise that
> the preloader is actually not necessary. This creates an incentive to
> keep things that way, so more and more policy will have to be added
> to the kernel, simply because externalizing it would shatter that
> cute "loader-less" image.
>
[...]
>
> So, while a file-based interface looks cool, I think a "thin"
> memory-based interface will serve us better in the long run.
>
I agree with this sentiment. As far as the filebased interface implying
it is self-contained/loaderless, we can just look at this thread as well
as Linux history -- the boot sector in a Linux kernel is basically
useless and doesn't even work on a lot of hardware today, yet people are
still refusing to deprecate it "because it's so convenient"...
-hpa
Werner Almesberger <[email protected]> writes:
> Hi guys,
>
> Having said this, I still think a bootimg-like API would be better than
> a file based API. So, concerning this thread:
I have come to agree with this sentiment. However I do have a small
issue with the current bootimg api. Everything is done in page sized
chunks. Which feels like it is exporting too much of the current
implementation.
struct boot_image {
void **image_map; /* pointers to image pages in user memory */
int pages; /* length in pages */
unsigned long *load_map;/* list of destination pages (physical addr) */
unsigned long start; /* jump to this physical address */
int flags; /* for future use, must be zero for now */
};
asmlinkage int sys_bootimg(struct boot_image *user_dsc)
My preference goes to something like:
struct segment {
void *vaddr; /* virtual address of the data to start with */
unsigned long paddr; /* physical address to copy this segment to */
unsigned long size; /* size in bytes of this segment */
};
asmlinkage int sys_kexec(struct segment *segments, int nr_segments, unsigned long start_addr);
The big difference I believe is that you will have far fewer segments,
in my scheme, and where everything is on a byte granularity the user
space implementation doesn't have to know how the kernel side of
the implementation interacts with pages.
I hadn't realized until just a minute ago when I looked again that you
were exporting pages.
> My assumption is that you need to do some processing before telling the
> kernel to reboot, and that you want to be able to do such processing in
> user space (e.g. extract the current memory map and pass it to the new
> kernel, forward the results of bus scans, create a RAM disk with driver
> modules on the fly, etc.)
>
> In particular, the "old" kernel may need to pass information obtained
> from the firmware to the "new" kernel.
Except for the case of Loadlin where the old firmware is destroyed,
and you cannot requery the firmware. You have a more robust solution
if you let the new kernel query the firmware itself. In general it is
not safe to use firmware device drivers, but otherwise you are o.k.
> With decent firmware, there
> could also be user-provided data that needs to be propagated, such as
> portions of the command line. Well-known information of this type can
> be encoded in the kernel, but I think this just leads to bloat, as more
> and more policy will have to be encoded there, let alone the packing
> and unpacking issues.
For the most part I agree, that the bootimg type interface will avoid
bloat. At the same time, some of this information that we would like
to pass is easier to get at in kernel space, oh well.
> Also, in many cases, interactions beween the kernel side of the boot
> loader and the rest of the kernel would actually be a good thing to
> have in user space anyway, e.g. the ability to shut down or
> "immobilize" certain devices, or to retrieve device status
> information.
Possibly.
> As I've shown with bootimg, it's pretty trivial to load all kinds of
> formats (including ELF) via a memory-based interface, and you enjoy
> considerable freedom in how you generate the data in memory. If you
> want, you can even go and modify in-kernel data structures directly,
> so you don't need a nice and clean interface for each and every bit
> of data, but you can evolve interfaces when necessary.
I will stop just a moment to say it is extremely nasty to read the ELF
section header instead of the ELF program header for boot purposes.
For an ELF static executable it is totally valid not to have a section
header.
> In cases where a boot (pre)loader program wouldn't be desirable, a
> set of library functions could serve the same purpose. In fact, the
> boot (pre)loader should be based on them, too.
Actually a library of functions to do the conversions sounds very
reasonable.
> So, while a file-based interface looks cool, I think a "thin"
> memory-based interface will serve us better in the long run.
I actually pretty much agree. But mostly because it becomes easier to
do a clean shutdown. If you just wrote a file it is kind of hard to
have a read-only filesystem, plus it looks like it will reduce the
strife a little bit...
Eric
"H. Peter Anvin" <[email protected]> writes:
> Eric W. Biederman wrote:
>
> > The simplest is the observation that right now 10MB is about what it
> > takes to hold every Linux driver out there. So all you really need is
> > a 16MB system, to avoid a device probing loader. And probably
> > noticeably less than that. The only systems I see having real
> > problems are old systems where device enumeration is not reliable, and
> > require human intervention anyway.
> >
>
>
> A floppy disk is 1.44 MB.
Yes floppies are small. The nice thing is that there are only 2 or 3
floppy drivers in the kernel so it is not hard to include access to
the primary boot medium.
Though actually last time I checked you can still fit all of the
kernels network drivers on a floppy, and it wouldn't surprise me if you
could do the same with cd drivers as well.
For other medium you can reduce the number of times you interact with
the user to exactly once, and that is worth handling. For a floppy
you either have enough room for all of your drivers or you don't.
This doesn't really appear to affect the number of user interactions,
and doesn't seem to me to be a case that the presence of absence of
firmware callbacks makes a difference.
The only difference I see with firmware callbacks is weather you are
working from BIOS user space or from kernel user space.
Eric
Keith Owens <[email protected]> writes:
> On 03 Feb 2002 11:43:08 -0700,
> [email protected] (Eric W. Biederman) wrote:
> >O.k. I have been thinking about this some more, and I have come up with a
> couple
>
> >alternate of solutions....
> >My final and favorite is to take an ELF image, define a couple of ELF
> >note types, and add a bunch those notes saying which pieces are
> >hardware dependent. So a smart ELF loader can prune the image as it
> >is loaded, and a stupid one will just attempt to load everything. And
> >with the setup for this not being bootloader specific it will probably
> >encourage device pruning loaders.
>
> That is not an ELF loader, it is an ELF *linker*. The vmlinux image
> has had all the relocations fixed up, you no longer have the data
> required to discard sections. To prune hardware dependent pieces means
> moving data around and adjusting relocation entries. you have to go
> back one stage, to the individual objects, and that means linking.
Not if what you are actually pruning is your cpio archive of modules
that will become your initramfs. I admit insmod then needs to run to
insert those modules.
> Seems like an awful lot of work.
It may actually be, on the setup side. But any solution that is
setup to run on all x86 platforms is a lot of work. On the bootloader
side adding a file a initramfs is the same complexity as removing one.
Eric
"H. Peter Anvin" <[email protected]> writes:
> Ok, now let me ask the question that hopefully should be obvious to everyone
> now...
>
> WHAT'S THE POINT?
>
> All you're doing is an awfully complex song and dance to *avoid* implementing a
> solution that, while imperfect, is thoroughly established and has worked for 20
> years.
The x86 BIOS doesn't work for me.
As for a complex song and dance. That is the complex song and dance
of double checking my design. I am working through and making certain
there are no important cases that I have missed. And the case of a
kernel that can boot on all machines, and having problems because of
limited memory resources is one case I admit I had not considered.
Beyond that there are some real advantages from my perspective to
solutions that involve only one transaction. You greatly increase the
odds that your code does not hit a code path that hasn't been tested
and doesn't work.
And I have no intentions of implementing PXE on top of the linux
kernel, just so I can netboot. Which is what I would have to do to
use a 20 year old solution.
Plus working with the ELF format except for the exact differences of
where code loads in memory I solve the same problem for all linux
platforms, and I just need to port kexec to have it work.
Eric
Alan Cox wrote:
>>>A floppy disk is 1.44 MB.
>>>
>>Yes floppies are small. The nice thing is that there are only 2 or 3
>>floppy drivers in the kernel so it is not hard to include access to
>>the primary boot medium.
>>
>
> Big problems are:
>
> - Floppies are fast becoming optional
> - USB floppies requires the entire USB and hotplug layer
> - USB floppies require the scsi layer which is not small either
> - Libretto style non USB/Cardbus PCMCIA floppies are not supported
>
- Some floppies are actually firmware emulations, that you have no real
clue what they actually do.
-hpa
> > A floppy disk is 1.44 MB.
>
> Yes floppies are small. The nice thing is that there are only 2 or 3
> floppy drivers in the kernel so it is not hard to include access to
> the primary boot medium.
Big problems are:
- Floppies are fast becoming optional
- USB floppies requires the entire USB and hotplug layer
- USB floppies require the scsi layer which is not small either
- Libretto style non USB/Cardbus PCMCIA floppies are not supported
Eric W. Biederman wrote:
> I have come to agree with this sentiment.
Great !
> However I do have a small issue with the current bootimg api.
> Everything is done in page sized chunks. Which feels like it is
> exporting too much of the current implementation.
Well, it keeps things simple for the kernel, and bootimg(8) needs
to know the target architecture anyway. But there isn't really a
design reason why it would have to use pages, agreed.
> Except for the case of Loadlin where the old firmware is destroyed,
> and you cannot requery the firmware. You have a more robust solution
> if you let the new kernel query the firmware itself.
Yes, I was thinking of
- BIOS does IDE bus scan, boots boot loader kernel
- first kernel does IDE bus scan again, boots real kernel
- real kernel does IDE bus scan again
It should be possible to avoid at least the third IDE bus scan, at
least as an optimization.
> For the most part I agree, that the bootimg type interface will avoid
> bloat. At the same time, some of this information that we would like
> to pass is easier to get at in kernel space, oh well.
You can always look it up in /dev/mem, just like bootimg(1) did :-)
BTW, that's what I like about this approach: incremental development
is much easier this way, and you can hide all the ugly spots in the
library, if necessary.
> I will stop just a moment to say it is extremely nasty to read the ELF
> section header instead of the ELF program header for boot purposes.
> For an ELF static executable it is totally valid not to have a section
> header.
Touche ;-) I admit that I'm not much of an ELF expert. This was just
a surprisingly easy hack, so I was content with what I got.
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Lausanne, CH [email protected] /
/_http://icawww.epfl.ch/almesberger/_____________________________________/
Werner Almesberger wrote:
>
> Well, it keeps things simple for the kernel, and bootimg(8) needs
> to know the target architecture anyway. But there isn't really a
> design reason why it would have to use pages, agreed.
>
I looked at this point at some time, and I found that it made it a lot
easier to write code to permute memory arbitrarily, as may be required.
The reason is that you really want to keep an array that's O(N) in the
size of memory to keep track of where things are, and in order to do that,
realistically, you need to have some reasonably large granularity -- 4K
pages are just about right.
Of course, maybe I was just using a dumb algorithm... :)
-hpa
"H. Peter Anvin" <[email protected]> writes:
> Werner Almesberger wrote:
>
> >
> > Well, it keeps things simple for the kernel, and bootimg(8) needs
> > to know the target architecture anyway. But there isn't really a
> > design reason why it would have to use pages, agreed.
> >
>
>
> I looked at this point at some time, and I found that it made it a lot
> easier to write code to permute memory arbitrarily, as may be required.
> The reason is that you really want to keep an array that's O(N) in the
> size of memory to keep track of where things are, and in order to do that,
> realistically, you need to have some reasonably large granularity -- 4K
> pages are just about right.
On the kernel side I still plan to use pages, though my ideal case
would be to allocate one great big slab of non-conflicting memory, and
just copy everything to where it needs to go.
On the user space side what I am proposing actually increases the
granularity quite a bit. For a linux kernel with a ramdisk you should
only need to pass the kernel 3 segments. (Assuming everything is
contiguous in user space memory). The setup code, the kernel, and the
ramdisk.
> Of course, maybe I was just using a dumb algorithm... :)
Perhaps. So far I don't need an array that is O(N) in the size of
memory just O(N) in the size of the image I am copying. The
permutations that are necessary to avoid conflicts in the pathological
cases are a pain. But I've already done that...
Anyway now it's back to the trenches...
Eric