Hi Linus,
Here is the list of features which have are being actively
pushed, not NAK'ed, and are not in 2.5.45. There are 13 of them, as
appropriate for Halloween.
Most were submitted repeatedly *well* before the freeze. It'd
be nice for you to give feedback, and decide which ones (if any) are
still up for review.
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
From: http://www.kernel.org/pub/linux/kernel/people/rusty/2.6-not-in-yet/
Rusty's Remarkably Unreliable List of Pending 2.6 Features
[aka. Rusty's Snowball List]
A: Author
M: lkml posting describing patch
D: Download URL
S: Size of patch, number of files altered (source/config), number of new files.
X: Impact summary (only parts of patch which alter existing source files, not config/make files)
T: Diffstat of whole patch
N: Random notes
In rough order of invasiveness (number of altered source files):
In-kernel Module Loader and Unified parameter support
A: Rusty Russell
D: http://www.kernel.org/pub/linux/kernel/people/rusty/patches/Module/
S: 841 kbytes, 302/36 files altered, 22 new
T: Diffstat
X: Summary patch (598k)
N: Requires new modutils
Fbdev Rewrite
A: James Simmons
M: http://www.uwsg.iu.edu/hypermail/linux/kernel/0111.3/1267.html
D: http://phoenix.infradead.org/~jsimmons/fbdev.diff.gz
S: 4852 kbytes, 168/29 files altered, 124 new
T: Diffstat
X: Summary patch (182k)
Linux Trace Toolkit (LTT)
A: Karim Yaghmour
M: http://www.uwsg.iu.edu/hypermail/linux/kernel/0204.1/0832.html
M: http://marc.theaimsgroup.com/?l=linux-kernel&m=103491640202541&w=2
M: http://marc.theaimsgroup.com/?l=linux-kernel&m=103423004321305&w=2
M: http://marc.theaimsgroup.com/?l=linux-kernel&m=103247532007850&w=2
D: http://opersys.com/ftp/pub/LTT/ExtraPatches/patch-ltt-linux-2.5.44-vanilla-021026-2.2.bz2
S: 257 kbytes, 67/4 files altered, 9 new
T: Diffstat
X: Summary patch (90k)
statfs64
A: Peter Chubb
M: http://marc.theaimsgroup.com/?l=linux-kernel&m=103490436228016&w=2
D: http://marc.theaimsgroup.com/?l=linux-kernel&m=103490436228016&w=2
S: 42 kbytes, 53/0 files altered, 1 new
T: Diffstat
X: Summary patch (32k)
ext2/ext3 ACLs and Extended Attributes
A: Ted Ts'o
M: http://lists.insecure.org/lists/linux-kernel/2002/Oct/6787.html
B: bk://extfs.bkbits.net/extfs-2.5-update
D: http://thunk.org/tytso/linux/extfs-2.5/
S: 497 kbytes, 96/34 files altered, 34 new
T: Diffstat
X: Summary patch (167k)
ucLinux Patch (MMU-less support)
A: Greg Ungerer
M: http://lwn.net/Articles/11016/
D: http://www.uclinux.org/pub/uClinux/uClinux-2.5.x/linux-2.5.44uc3.patch.gz
S: 2218 kbytes, 25/34 files altered, 429 new
T: Diffstat
X: Summary patch (40k)
Crash Dumping (LKCD)
A: Matt Robinson, LKCD team
M: http://lists.insecure.org/lists/linux-kernel/2002/Oct/8552.html
D: http://lkcd.sourceforge.net/download/latest/
S: 18479 kbytes, 18/10 files altered, 10 new
T: Diffstat
X: Summary patch (18k)
POSIX Timer API
A: George Anzinger
M: http://marc.theaimsgroup.com/?l=linux-kernel&m=103553654329827&w=2
D: http://unc.dl.sourceforge.net/sourceforge/high-res-timers/hrtimers-posix-2.5.44-1.0.patch
S: 66 kbytes, 18/2 files altered, 4 new
T: Diffstat
X: Summary patch (21k)
Hotplug CPU Removal Support
A: Rusty Russell
D: http://www.kernel.org/pub/linux/kernel/people/rusty/patches/Hotcpu/hotcpu-cpudown.patch.gz
S: 32 kbytes, 16/0 files altered, 0 new
T: Diffstat
X: Summary patch (29k)
Hires Timers
A: George Anzinger
M: http://marc.theaimsgroup.com/?l=linux-kernel&m=103557676007653&w=2
M: http://marc.theaimsgroup.com/?l=linux-kernel&m=103557677207693&w=2
M: http://marc.theaimsgroup.com/?l=linux-kernel&m=103558349714128&w=2
D: http://unc.dl.sourceforge.net/sourceforge/high-res-timers/hrtimers-core-2.5.44-1.0.patch http://unc.dl.sourceforge.net/sourceforge/high-res-timers/hrtimers-i386-2.5.44-1.0.patch http://unc.dl.sourceforge.net/sourceforge/high-res-timers/hrtimers-hrposix-2.5.44-1.1.patch
S: 132 kbytes, 15/4 files altered, 10 new
T: Diffstat
X: Summary patch (44k)
N: Requires POSIX Timer API patch
EVMS
A: EVMS Team
M: http://www.uwsg.iu.edu/hypermail/linux/kernel/0208.0/0109.html
D: http://evms.sourceforge.net/patches/2.5.44/
S: 1101 kbytes, 7/10 files altered, 44 new
T: Diffstat
X: Summary patch (4k)
initramfs
A: Al Viro
M: http://www.cs.helsinki.fi/linux/linux-kernel/2001-30/0110.html
D: ftp://ftp.math.psu.edu/pub/viro/N0-initramfs-C21
S: 16 kbytes, 5/1 files altered, 2 new
T: Diffstat
X: Summary patch (5k)
Kernel Probes
A: Vamsi Krishna S
M: lists.insecure.org/linux-kernel/2002/Aug/1299.html
D: http://www.kernel.org/pub/linux/kernel/people/rusty/patches/Misc/kprobes.patch.gz
S: 18 kbytes, 4/2 files altered, 4 new
T: Diffstat
X: Summary patch (5k)
On Thu, 31 Oct 2002, Rusty Russell wrote:
>
> Here is the list of features which have are being actively
> pushed, not NAK'ed, and are not in 2.5.45. There are 13 of them, as
> appropriate for Halloween.
I'm unlikely to be able to merge everything by tomorrow, so I will
consider tomorrow a submission deadline to me, rather than a merge
deadline. That said, I merged everything I'm sure I want to merge today,
and the rest I simply haven't had time to look at very much.
> In-kernel Module Loader and Unified parameter support
This apparently breaks things like DRI, which I'm fairly unhappy about,
since I think 3D is important.
> Fbdev Rewrite
This one is just huge, and I have little personal judgement on it.
> Linux Trace Toolkit (LTT)
I don't know what this buys us.
> statfs64
I haven't even seen it.
> ext2/ext3 ACLs and Extended Attributes
I don't know why people still want ACL's. There were noises about them for
samba, but I'v enot heard anything since. Are vendors using this?
> ucLinux Patch (MMU-less support)
I've seen this, it looks pretty ok.
> Crash Dumping (LKCD)
This is definitely a vendor-driven thing. I don't believe it has any
relevance unless vendors actively support it.
> POSIX Timer API
I think I'll do at least the API, but there were some questions about the
config options here, I think.
> Hotplug CPU Removal Support
No objections, but very little visibility into it either.
> Hires Timers
This one is likely another "vendor push" thing.
> EVMS
Not for the feature freeze, there are some noises that imply that SuSE may
push it in their kernels.
> initramfs
I want this.
> Kernel Probes
Probably.
Linus
On Wed, 30 Oct 2002, Linus Torvalds wrote:
> > ext2/ext3 ACLs and Extended Attributes
>
> I don't know why people still want ACL's. There were noises about them for
> samba, but I'v enot heard anything since. Are vendors using this?
Because People Are Stupid(tm). Because it's cheaper to put "ACL support: yes"
in the feature list under "Security" than to make sure than userland can cope
with anything more complex than "Me Og. Og see directory. Directory Og's.
Nobody change it". C.f. snake oil, P.T.Barnum and esp. LSM users
In message <[email protected]> you wri
te:
>
> On Thu, 31 Oct 2002, Rusty Russell wrote:
> >
> > Here is the list of features which have are being actively
> > pushed, not NAK'ed, and are not in 2.5.45. There are 13 of them, as
> > appropriate for Halloween.
>
> I'm unlikely to be able to merge everything by tomorrow, so I will
> consider tomorrow a submission deadline to me, rather than a merge
> deadline. That said, I merged everything I'm sure I want to merge today,
> and the rest I simply haven't had time to look at very much.
>
> > In-kernel Module Loader and Unified parameter support
>
> This apparently breaks things like DRI, which I'm fairly unhappy about,
> since I think 3D is important.
Yes, the patch stubs out inter_module_*, in favor of get_symbol() &
put_symbol().
This breaks the three users: one in drivers/mtd/ and two in
drivers/char/drm/. I have a patch which fixes them (untested), or I
can simply put the inter_module_* code back in.
> > Fbdev Rewrite
>
> This one is just huge, and I have little personal judgement on it.
It's been around for a while. Geert, Russell?
> > Linux Trace Toolkit (LTT)
>
> I don't know what this buys us.
Haven't looked at it.
> > statfs64
>
> I haven't even seen it.
It's fairly old, but Peter Chubb said there was some vendor interest
for v. large devices. Peter?
> > ext2/ext3 ACLs and Extended Attributes
>
> I don't know why people still want ACL's. There were noises about them for
> samba, but I'v enot heard anything since. Are vendors using this?
SAMBA needs them, which is why serious Samba boxes use XFS. Tridge,
Ted?
> > Hotplug CPU Removal Support
>
> No objections, but very little visibility into it either.
The controls are in driverfs etc, and that's always been in flux. 8(
The rest is v. small, basically extending ksoftirqd, workqueues and
migration threads to disable them. Then it's all arch-specific.
> > Hires Timers
>
> This one is likely another "vendor push" thing.
>
> > EVMS
>
> Not for the feature freeze, there are some noises that imply that SuSE may
> push it in their kernels.
They have, IIRC. Interestingly, it was less invasive (existing source
touched) than the LVM2/DM patch you merged.
> > initramfs
>
> I want this.
Good. The big payoff is moving stuff out of the kernel, which can't
really be done in a stable series.
> > Kernel Probes
>
> Probably.
Sent.
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
On Wed, 30 Oct 2002, Linus Torvalds wrote:
> On Thu, 31 Oct 2002, Rusty Russell wrote:
> > ext2/ext3 ACLs and Extended Attributes
>
> I don't know why people still want ACL's. There were noises about them for
> samba, but I'v enot heard anything since. Are vendors using this?
Yes, people use it. Not quite sure why though, I guess ACLs
buy some flexibility over the user/group/other model but if
the "unlimited groups" patch goes in (is in?) I'm happy ;)
Personally I do think either the unlimited groups patch or
ACLs are needed in order to sanely run a large anoncvs setup.
regards,
Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://distro.conectiva.com/
Current spamtrap: <a href=mailto:"[email protected]">[email protected]</a>
Linus Torvalds wrote:
> > Linux Trace Toolkit (LTT)
>
> I don't know what this buys us.
How about being able to:
- Debug synchronization problems among processes (there is no other
tool to do this, not gdb, not strace, not printf, ...)
- Measure exact time spent wainting for kernel and which other
processes a process had to wait for.
- Measure exact time it takes for an interrupt's effects to propagate
throughout the entire system.
- Understand the exact behavior the system has to input. (what is
the exact sequence of processes that run when I press a key).
- Identify sporadic problems in very saturated systems. (thousands
of servers and one of them is doing weird stuff).
- etc.
Providing system tracing is a necessity for any sort of complex
application development and system monitoring. Some people simply
can't use Linux without this sort of tool and I am at pains to
explain to them why they actually have to patch their kernel to
be able to debug their inter-process synchronization problems.
Users don't have to patch their kernel to use gdb and I don't
see why they should need to patch their kernel to understand how
their various processes interact with the kernel and vice-versa.
Karim
===================================================
Karim Yaghmour
[email protected]
Embedded and Real-Time Linux Expert
===================================================
On Thu, Oct 31, 2002 at 02:00:31PM +1100, Rusty Russell wrote:
> > I don't know why people still want ACL's. There were noises about them for
> > samba, but I'v enot heard anything since. Are vendors using this?
>
> SAMBA needs them, which is why serious Samba boxes use XFS. Tridge,
> Ted?
XFS doesn't have ACLs either in plain 2.5.
> > Not for the feature freeze, there are some noises that imply that SuSE may
> > push it in their kernels.
>
> They have, IIRC. Interestingly, it was less invasive (existing source
> touched) than the LVM2/DM patch you merged.
But that only because dm added stuff to the generic code where we
told it. It's a lot more code than dm and it adds new discovery
code at the same time we start moving stuff _out_ of the kernel
to initramfs.
If you can SuSE has merged it any IBM patch posted here should get
in, coming from big blue seems to be a basic merge criteria in
Nuernberg :)
* Rik van Riel ([email protected]) wrote:
> On Wed, 30 Oct 2002, Linus Torvalds wrote:
> > On Thu, 31 Oct 2002, Rusty Russell wrote:
>
> > > ext2/ext3 ACLs and Extended Attributes
> >
> > I don't know why people still want ACL's. There were noises about them for
> > samba, but I'v enot heard anything since. Are vendors using this?
>
> Yes, people use it. Not quite sure why though, I guess ACLs
> buy some flexibility over the user/group/other model but if
> the "unlimited groups" patch goes in (is in?) I'm happy ;)
>
> Personally I do think either the unlimited groups patch or
> ACLs are needed in order to sanely run a large anoncvs setup.
The feeling I got on this was the ability to let users define their own
groups. Perhaps I'm not following it closely enough but that was the
impression I got in terms of "what this does for us"; I'm probably
missing other things. Just that ability would be nice in my view
though. Isn't it something that's been in AFS for a long time too?
I've got a few friends who've played with AFS before (at CMU and the
like) and really enjoyed the ACLs there.
Just my thoughts,
Stephen
On Wed, 2002-10-30 at 20:31, Linus Torvalds wrote:
>
> On Thu, 31 Oct 2002, Rusty Russell wrote:
> >
> > Here is the list of features which have are being actively
> > pushed, not NAK'ed, and are not in 2.5.45. There are 13 of them, as
> > appropriate for Halloween.
>
> I'm unlikely to be able to merge everything by tomorrow, so I will
> consider tomorrow a submission deadline to me, rather than a merge
> deadline. That said, I merged everything I'm sure I want to merge today,
> and the rest I simply haven't had time to look at very much.
>
>
> > ext2/ext3 ACLs and Extended Attributes
>
> I don't know why people still want ACL's. There were noises about them for
> samba, but I'v enot heard anything since. Are vendors using this?
>
There are a fair number of NAS vendors who do linux boxes with Samba
and XFS because of the ACL support, Quantum being the one Tridge now
works for by the way. The reason they want it is so they can support
the features NT folks are used to having in their file servers.
Now, we could just let the NT folks use NT servers instead....
Even getting XFS ACLs running in 2.5 requires part of this patch set.
Steve
> XFS doesn't have ACLs either in plain 2.5.
The existing NAS boxes that use Linux and XFS tend to base their
kernels on the 2.4-xfs tree from cvs on sgi.com. It works well and the
SGI guys have been very good about fixing problems when they crop up.
I think that the biggest beneficiary of adding extended attributes and
ACLs into ext3 for 2.6 would be more casual users (home, small office
etc) as they will then be able to use ACLs in Samba without the pain
of switching to a different kernel.
Cheers, Tridge
--
http://samba.org/~tridge/
> > > ext2/ext3 ACLs and Extended Attributes
> >
> > I don't know why people still want ACL's. There were noises about them for
> > samba, but I'v enot heard anything since. Are vendors using this?
>
> SAMBA needs them, which is why serious Samba boxes use XFS. Tridge,
> Ted?
oh yes, all the Linux based storage appliances use ACLs. Posix ACLs
aren't ideal for Samba, but they are *much* better than having no ACLs
at all. The Posix ACL code has been in Samba for a long time (getting
close to 3 years now?).
Eventually I'd like to see a combination of LSM with a new ACL system
give the ability to support full NT ACLs on Linux (which is also
needed for full nfsv4 support), but that is way too much to do for
the 2.6 kernel.
For the majority of windows users the mapping Samba does internally
between Posix ACLs and NT ACLs is sufficient for now.
I think that it would be a very good thing for Posix ACLs to be
included in the 2.6 kernel, especially in ext3.
Extended attributes are also important as they give a place to store
all the extra DOS info that has no other logical place in a posix
filesystem. For example, we can put the 'read only', 'archive', 'hidden'
and 'system' attributes there. If we don't have extended attributes
then we need to use a nasty kludge where these map to various unix
permission bits, but the mapping is terrible and doesn't give the
correct semantics (especially for things like read only on
directories).
My main concern with using extended attributes in this way is
performance. My experience with XFS is that as soon as you start
adding extended attributes then the performance drops a lot, but I
haven't tested performance with the ext3 extended attributes so maybe
they don't have the same problem.
Cheers, Tridge
--
http://samba.org/~tridge/
On Oct 30, 2002 18:31 -0800, Linus Torvalds wrote:
> On Thu, 31 Oct 2002, Rusty Russell wrote:
> > ext2/ext3 ACLs and Extended Attributes
>
> I don't know why people still want ACL's. There were noises about them for
> samba, but I've not heard anything since. Are vendors using this?
I don't really care about ACLs so much one way or the other, but we
DEFINITELY use EAs with Lustre, so at the minimum if we could have
that part of the changes I'd be happy.
Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/
I'm kind of new here, but I'll present my case in hope that someone
listens to me.
On Wed, 30 Oct 2002, Linus Torvalds wrote:
> On Thu, 31 Oct 2002, Rusty Russell wrote:
>
> > Crash Dumping (LKCD)
>
> This is definitely a vendor-driven thing. I don't believe it has any
> relevance unless vendors actively support it.
This is something that we're just starting to use in my department in
Purdue - we work with clustering, and LKCD will let us determine why our
nodes decide to kernel panic since it's generally not worthwhile to
connect a head to each machine.
I see LKCD as having a big impact by allowing kernels to be debugged after
they have panic'd (and thus don't send out a message to syslog). It can
especially be usful in compute farms, or other scenerios where it's
difficut or cost prohibitive to connect a console (or console server) to
each individual machine.
> > EVMS
>
> Not for the feature freeze, there are some noises that imply that SuSE may
> push it in their kernels.
I think that the integration between RAID and LVM is a good thing, and
EVMS's 'plug-in module' architecture will help tremendously to bring
interoperation with other systems' volume management subsystems.
Specifically, the interoperation with IBM's JFS LVM and MS's LVM will be
helpful for people trying to migrate their servers over from those OS's to
GNU/Linux.
-- Pat
Purdue University ITAP/RCS
Information Technology at Purdue
Research Computing and Storage
http://www-rcd.cc.purdue.edu
On Thu, 31 Oct 2002, Christoph Hellwig wrote:
> On Wed, Oct 30, 2002 at 11:20:42PM -0500, Patrick Finnegan wrote:
> > Specifically, the interoperation with IBM's JFS LVM and MS's LVM will be
>
> JFS has no lvm, it just sits on any blockdevice. The support for Windows
> dynamic disks actually layers ontop of the MD driver..
To be more specific, I'm talking about AIX's JFS, not linux's JFS...
--
Purdue Universtiy ITAP/RCS
Information Technology at Purdue
Research Computing and Storage
http://www-rcd.cc.purdue.edu
http://dilbert.com/comics/dilbert/archive/images/dilbert2040637020924.gif
On Wed, Oct 30, 2002 at 11:20:42PM -0500, Patrick Finnegan wrote:
> Specifically, the interoperation with IBM's JFS LVM and MS's LVM will be
JFS has no lvm, it just sits on any blockdevice. The support for Windows
dynamic disks actually layers ontop of the MD driver..
On Wed, 2002-10-30 at 19:31, Linus Torvalds wrote:
>
> > ext2/ext3 ACLs and Extended Attributes
>
> I don't know why people still want ACL's. There were noises about them for
> samba, but I'v enot heard anything since. Are vendors using this?
>
I teach Linux classes to corporate IT guys (~300 or so this year) and
many of them are migrating from Solaris or deploying Linux along side
Solaris.
Solaris has had ACLs since 2.5.1 (1996), and EAs since 2.9 (May 2002).
Having ACL in Linux is a VERY COMMON REQUEST that I hear from the
students.
FWIW.
Dax Kelson
Guru Labs
Once again here is my kexec patch once again, updated to work with 2.5.45.
sys_kexec is a system call that allows linux to act as a bootloader for
another arbitrary kernel.
What the code does:
It copies data from user space, into buffers in kernel space.
The buffers in kernel space are rearranged so that later I can use
a simply memcpy, to put the data in the page at it's final destination.
The device_shutdown, and the reboot notifier are called.
- This ensures the hardware devices are in a quiescent state
so I do not have to worry about them messing up the transfer of control.
The final copy routine is copied to a buffer that won't get stomped.
The machine is placed into 32bit protected mode with paging disabled.
The final copy routine copies the buffers to their final destination
(which is normally, very similar to where the kernel is running).
The final copy routine jumps to the new loaded kernel image.
At this point the interface is fixed. Anything additional that needs
to happen, can be done in user space by adding a stub routine that
gets called before the loaded kernel is called. In particular I can
directly execute a bzImage which has a 16bit real mode interface.
There is kernel work left to get the device drivers to tell their
devices to shut up. (device_shutdown). But device_shutdown already
exists, I just have a good test case for it.
Except for the final copy which is very machine specific the rest of
the code is generic and has actually been tested on alpha. Eventually
I am hoping for ports to other platforms but I am concentrating on x86
so I can do a quality job.
There has been testing and review on the Linux kernel mailing list.
Starting with a review of the syscall interface about six months ago.
And people testing to be certain they can use the code. While not all
of the bugs are worked out in the user space code. The system call is
solid.
Everything is configurable so there should be not footprint increase
for people who do not want this functionality.
Eric
MAINTAINERS | 7
arch/i386/Kconfig | 17 +
arch/i386/kernel/Makefile | 1
arch/i386/kernel/entry.S | 1
arch/i386/kernel/machine_kexec.c | 142 +++++++++
arch/i386/kernel/relocate_kernel.S | 99 ++++++
include/asm-i386/kexec.h | 25 +
include/asm-i386/unistd.h | 1
include/linux/kexec.h | 48 +++
kernel/Makefile | 1
kernel/kexec.c | 577 +++++++++++++++++++++++++++++++++++++
kernel/sys.c | 61 +++
12 files changed, 980 insertions
diff -uNr linux-2.5.45/MAINTAINERS linux-2.5.45.x86kexec/MAINTAINERS
--- linux-2.5.45/MAINTAINERS Wed Oct 30 19:58:03 2002
+++ linux-2.5.45.x86kexec/MAINTAINERS Wed Oct 30 21:05:37 2002
@@ -934,6 +934,13 @@
W: http://www.cse.unsw.edu.au/~neilb/patches/linux-devel/
S: Maintained
+KEXEC
+P: Eric Biederman
+M: [email protected]
+M: [email protected]
+L: [email protected]
+S: Maintained
+
LANMEDIA WAN CARD DRIVER
P: Andrew Stanley-Jones
M: [email protected]
diff -uNr linux-2.5.45/arch/i386/Kconfig linux-2.5.45.x86kexec/arch/i386/Kconfig
--- linux-2.5.45/arch/i386/Kconfig Wed Oct 30 19:58:04 2002
+++ linux-2.5.45.x86kexec/arch/i386/Kconfig Wed Oct 30 21:40:22 2002
@@ -784,6 +784,23 @@
depends on (SMP || PREEMPT) && X86_CMPXCHG
default y
+config KEXEC
+ bool "kexec system call (EXPERIMENTAL)"
+ depends on EXPERIMENTAL
+ help
+ kexec is a system call that implements the ability to shutdown your
+ current kernel, and to start another kernel. It is like a reboot
+ but it is indepedent of the system firmware. And like a reboot the
+ you can start any kernel with it not just Linux.
+
+ The name comes from the similiarity to the exec system call.
+
+ It is on an going process to be certain the hardware in a machine
+ is properly shutdown, so do not be surprised if this code does not
+ initially work for you. It may help to enable device hotplugging
+ support. As of this writing the exact hardware interface is
+ strongly in flux, so no good recommendation can be made.
+
endmenu
diff -uNr linux-2.5.45/arch/i386/kernel/Makefile linux-2.5.45.x86kexec/arch/i386/kernel/Makefile
--- linux-2.5.45/arch/i386/kernel/Makefile Sat Oct 19 00:57:56 2002
+++ linux-2.5.45.x86kexec/arch/i386/kernel/Makefile Wed Oct 30 21:05:43 2002
@@ -25,6 +25,7 @@
obj-$(CONFIG_X86_MPPARSE) += mpparse.o
obj-$(CONFIG_X86_LOCAL_APIC) += apic.o nmi.o
obj-$(CONFIG_X86_IO_APIC) += io_apic.o
+obj-$(CONFIG_KEXEC) += machine_kexec.o relocate_kernel.o
obj-$(CONFIG_SOFTWARE_SUSPEND) += suspend.o
obj-$(CONFIG_X86_NUMAQ) += numaq.o
obj-$(CONFIG_PROFILING) += profile.o
diff -uNr linux-2.5.45/arch/i386/kernel/entry.S linux-2.5.45.x86kexec/arch/i386/kernel/entry.S
--- linux-2.5.45/arch/i386/kernel/entry.S Wed Oct 30 19:58:04 2002
+++ linux-2.5.45.x86kexec/arch/i386/kernel/entry.S Wed Oct 30 21:06:39 2002
@@ -740,6 +740,7 @@
.long sys_epoll_create
.long sys_epoll_ctl /* 255 */
.long sys_epoll_wait
+ .long sys_kexec
.rept NR_syscalls-(.-sys_call_table)/4
diff -uNr linux-2.5.45/arch/i386/kernel/machine_kexec.c linux-2.5.45.x86kexec/arch/i386/kernel/machine_kexec.c
--- linux-2.5.45/arch/i386/kernel/machine_kexec.c Wed Dec 31 17:00:00 1969
+++ linux-2.5.45.x86kexec/arch/i386/kernel/machine_kexec.c Wed Oct 30 21:05:43 2002
@@ -0,0 +1,142 @@
+#include <linux/config.h>
+#include <linux/mm.h>
+#include <linux/kexec.h>
+#include <linux/delay.h>
+#include <asm/pgtable.h>
+#include <asm/pgalloc.h>
+#include <asm/tlbflush.h>
+#include <asm/io.h>
+#include <asm/apic.h>
+
+
+/*
+ * machine_kexec
+ * =======================
+ */
+
+
+static void set_idt(void *newidt, __u16 limit)
+{
+ unsigned char curidt[6];
+
+ /* ia32 supports unaliged loads & stores */
+ (*(__u16 *)(curidt)) = limit;
+ (*(__u32 *)(curidt +2)) = (unsigned long)(newidt);
+
+ __asm__ __volatile__ (
+ "lidt %0\n"
+ : "=m" (curidt)
+ );
+};
+
+
+static void set_gdt(void *newgdt, __u16 limit)
+{
+ unsigned char curgdt[6];
+
+ /* ia32 supports unaliged loads & stores */
+ (*(__u16 *)(curgdt)) = limit;
+ (*(__u32 *)(curgdt +2)) = (unsigned long)(newgdt);
+
+ __asm__ __volatile__ (
+ "lgdt %0\n"
+ : "=m" (curgdt)
+ );
+};
+
+static void load_segments(void)
+{
+#define __STR(X) #X
+#define STR(X) __STR(X)
+
+ __asm__ __volatile__ (
+ "\tljmp $"STR(__KERNEL_CS)",$1f\n"
+ "\t1:\n"
+ "\tmovl $"STR(__KERNEL_DS)",%eax\n"
+ "\tmovl %eax,%ds\n"
+ "\tmovl %eax,%es\n"
+ "\tmovl %eax,%fs\n"
+ "\tmovl %eax,%gs\n"
+ "\tmovl %eax,%ss\n"
+ );
+#undef STR
+#undef __STR
+}
+
+static void identity_map_page(unsigned long address)
+{
+ /* This code is x86 specific...
+ * general purpose code must be more carful
+ * of caches and tlbs...
+ */
+ pgd_t *pgd;
+ pmd_t *pmd;
+ struct mm_struct *mm = current->mm;
+ spin_lock(&mm->page_table_lock);
+
+ pgd = pgd_offset(mm, address);
+ pmd = pmd_alloc(mm, pgd, address);
+
+ if (pmd) {
+ pte_t *pte = pte_alloc_map(mm, pmd, address);
+ if (pte) {
+ set_pte(pte,
+ mk_pte(virt_to_page(phys_to_virt(address)),
+ PAGE_SHARED));
+ __flush_tlb_one(address);
+ }
+ }
+ spin_unlock(&mm->page_table_lock);
+}
+
+
+typedef void (*relocate_new_kernel_t)(
+ unsigned long indirection_page, unsigned long reboot_code_buffer,
+ unsigned long start_address);
+
+const extern unsigned char relocate_new_kernel[];
+extern void relocate_new_kernel_end(void);
+const extern unsigned int relocate_new_kernel_size;
+
+void machine_kexec(struct kimage *image)
+{
+ unsigned long *indirection_page;
+ void *reboot_code_buffer;
+ relocate_new_kernel_t rnk;
+
+ /* Interrupts aren't acceptable while we reboot */
+ local_irq_disable();
+ reboot_code_buffer = image->reboot_code_buffer;
+ indirection_page = phys_to_virt(image->head & PAGE_MASK);
+
+ identity_map_page(virt_to_phys(reboot_code_buffer));
+
+ /* copy it out */
+ memcpy(reboot_code_buffer, relocate_new_kernel,
+ relocate_new_kernel_size);
+
+ /* The segment registers are funny things, they are
+ * automatically loaded from a table, in memory wherever you
+ * set them to a specific selector, but this table is never
+ * accessed again you set the segment to a different selector.
+ *
+ * The more common model is are caches where the behide
+ * the scenes work is done, but is also dropped at arbitrary
+ * times.
+ *
+ * I take advantage of this here by force loading the
+ * segments, before I zap the gdt with an invalid value.
+ */
+ load_segments();
+ /* The gdt & idt are now invalid.
+ * If you want to load them you must set up your own idt & gdt.
+ */
+ set_gdt(phys_to_virt(0),0);
+ set_idt(phys_to_virt(0),0);
+
+ /* now call it */
+ rnk = (relocate_new_kernel_t) virt_to_phys(reboot_code_buffer);
+ (*rnk)(virt_to_phys(indirection_page), virt_to_phys(reboot_code_buffer),
+ image->start);
+}
+
diff -uNr linux-2.5.45/arch/i386/kernel/relocate_kernel.S linux-2.5.45.x86kexec/arch/i386/kernel/relocate_kernel.S
--- linux-2.5.45/arch/i386/kernel/relocate_kernel.S Wed Dec 31 17:00:00 1969
+++ linux-2.5.45.x86kexec/arch/i386/kernel/relocate_kernel.S Wed Oct 30 21:05:43 2002
@@ -0,0 +1,99 @@
+#include <linux/config.h>
+#include <linux/linkage.h>
+
+ /* Must be relocatable PIC code callable as a C function, that once
+ * it starts can not use the previous processes stack.
+ *
+ */
+ .globl relocate_new_kernel
+relocate_new_kernel:
+ /* read the arguments and say goodbye to the stack */
+ movl 4(%esp), %ebx /* indirection_page */
+ movl 8(%esp), %ebp /* reboot_code_buffer */
+ movl 12(%esp), %edx /* start address */
+
+ /* zero out flags, and disable interrupts */
+ pushl $0
+ popfl
+
+ /* set a new stack at the bottom of our page... */
+ lea 4096(%ebp), %esp
+
+ /* store the parameters back on the stack */
+ pushl %edx /* store the start address */
+
+ /* Set cr0 to a known state:
+ * 31 0 == Paging disabled
+ * 18 0 == Alignment check disabled
+ * 16 0 == Write protect disabled
+ * 3 0 == No task switch
+ * 2 0 == Don't do FP software emulation.
+ * 0 1 == Proctected mode enabled
+ */
+ movl %cr0, %eax
+ andl $~((1<<31)|(1<<18)|(1<<16)|(1<<3)|(1<<2)), %eax
+ orl $(1<<0), %eax
+ movl %eax, %cr0
+ jmp 1f
+1:
+
+ /* Flush the TLB (needed?) */
+ xorl %eax, %eax
+ movl %eax, %cr3
+
+ /* Do the copies */
+ cld
+0: /* top, read another word for the indirection page */
+ movl %ebx, %ecx
+ movl (%ebx), %ecx
+ addl $4, %ebx
+ testl $0x1, %ecx /* is it a destination page */
+ jz 1f
+ movl %ecx, %edi
+ andl $0xfffff000, %edi
+ jmp 0b
+1:
+ testl $0x2, %ecx /* is it an indirection page */
+ jz 1f
+ movl %ecx, %ebx
+ andl $0xfffff000, %ebx
+ jmp 0b
+1:
+ testl $0x4, %ecx /* is it the done indicator */
+ jz 1f
+ jmp 2f
+1:
+ testl $0x8, %ecx /* is it the source indicator */
+ jz 0b /* Ignore it otherwise */
+ movl %ecx, %esi /* For every source page do a copy */
+ andl $0xfffff000, %esi
+
+ movl $1024, %ecx
+ rep ; movsl
+ jmp 0b
+
+2:
+
+ /* To be certain of avoiding problems with self modifying code
+ * I need to execute a serializing instruction here.
+ * So I flush the TLB, it's handy, and not processor dependent.
+ */
+ xorl %eax, %eax
+ movl %eax, %cr3
+
+ /* set all of the registers to known values */
+ /* leave %esp alone */
+
+ xorl %eax, %eax
+ xorl %ebx, %ebx
+ xorl %ecx, %ecx
+ xorl %edx, %edx
+ xorl %esi, %esi
+ xorl %edi, %edi
+ xorl %ebp, %ebp
+ ret
+relocate_new_kernel_end:
+
+ .globl relocate_new_kernel_size
+relocate_new_kernel_size:
+ .long relocate_new_kernel_end - relocate_new_kernel
diff -uNr linux-2.5.45/include/asm-i386/kexec.h linux-2.5.45.x86kexec/include/asm-i386/kexec.h
--- linux-2.5.45/include/asm-i386/kexec.h Wed Dec 31 17:00:00 1969
+++ linux-2.5.45.x86kexec/include/asm-i386/kexec.h Wed Oct 30 21:05:43 2002
@@ -0,0 +1,25 @@
+#ifndef _I386_KEXEC_H
+#define _I386_KEXEC_H
+
+#include <asm/fixmap.h>
+
+/*
+ * KEXEC_SOURCE_MEMORY_LIMIT maximum page get_free_page can return.
+ * I.e. Maximum page that is mapped directly into kernel memory,
+ * and kmap is not required.
+ *
+ * Someone correct me if FIXADDR_START - PAGEOFFSET is not the correct
+ * calculation for the amount of memory directly mappable into the
+ * kernel memory space.
+ */
+
+/* Maximum physical address we can use pages from */
+#define KEXEC_SOURCE_MEMORY_LIMIT (FIXADDR_START - PAGE_OFFSET)
+/* Maximum address we can reach in physical address mode */
+#define KEXEC_DESTINATION_MEMORY_LIMIT (-1UL)
+
+#define KEXEC_REBOOT_CODE_SIZE 4096
+#define KEXEC_REBOOT_CODE_ALIGN 0
+
+
+#endif /* _I386_KEXEC_H */
diff -uNr linux-2.5.45/include/asm-i386/unistd.h linux-2.5.45.x86kexec/include/asm-i386/unistd.h
--- linux-2.5.45/include/asm-i386/unistd.h Wed Oct 30 19:58:25 2002
+++ linux-2.5.45.x86kexec/include/asm-i386/unistd.h Wed Oct 30 21:07:27 2002
@@ -261,6 +261,7 @@
#define __NR_sys_epoll_create 254
#define __NR_sys_epoll_ctl 255
#define __NR_sys_epoll_wait 256
+#define __NR_sys_kexec 257
/* user-visible error numbers are in the range -1 - -124: see <asm-i386/errno.h> */
diff -uNr linux-2.5.45/include/linux/kexec.h linux-2.5.45.x86kexec/include/linux/kexec.h
--- linux-2.5.45/include/linux/kexec.h Wed Dec 31 17:00:00 1969
+++ linux-2.5.45.x86kexec/include/linux/kexec.h Wed Oct 30 21:05:43 2002
@@ -0,0 +1,48 @@
+#ifndef LINUX_KEXEC_H
+#define LINUX_KEXEC_H
+
+#if CONFIG_KEXEC
+#include <linux/types.h>
+#include <asm/kexec.h>
+
+/*
+ * This structure is used to hold the arguments that are used when loading
+ * kernel binaries.
+ */
+
+typedef unsigned long kimage_entry_t;
+#define IND_DESTINATION 0x1
+#define IND_INDIRECTION 0x2
+#define IND_DONE 0x4
+#define IND_SOURCE 0x8
+
+struct kimage {
+ kimage_entry_t head;
+ kimage_entry_t *entry;
+ kimage_entry_t *last_entry;
+
+ unsigned long destination;
+ unsigned long offset;
+
+ unsigned long start;
+ void *reboot_code_buffer;
+};
+
+/* kexec helper functions */
+void kimage_init(struct kimage *image);
+void kimage_free(struct kimage *image);
+
+struct kexec_segment {
+ void *buf;
+ size_t bufsz;
+ void *mem;
+ size_t memsz;
+};
+
+/* kexec interface functions */
+extern void machine_kexec(struct kimage *image);
+extern int do_kexec(unsigned long entry, long nr_segments,
+ struct kexec_segment *segments, struct kimage *image);
+#endif
+#endif /* LINUX_KEXEC_H */
+
diff -uNr linux-2.5.45/kernel/Makefile linux-2.5.45.x86kexec/kernel/Makefile
--- linux-2.5.45/kernel/Makefile Fri Oct 18 11:59:29 2002
+++ linux-2.5.45.x86kexec/kernel/Makefile Wed Oct 30 21:05:43 2002
@@ -21,6 +21,7 @@
obj-$(CONFIG_CPU_FREQ) += cpufreq.o
obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
obj-$(CONFIG_SOFTWARE_SUSPEND) += suspend.o
+obj-$(CONFIG_KEXEC) += kexec.o
ifneq ($(CONFIG_IA64),y)
# According to Alan Modra <[email protected]>, the -fno-omit-frame-pointer is
diff -uNr linux-2.5.45/kernel/kexec.c linux-2.5.45.x86kexec/kernel/kexec.c
--- linux-2.5.45/kernel/kexec.c Wed Dec 31 17:00:00 1969
+++ linux-2.5.45.x86kexec/kernel/kexec.c Wed Oct 30 21:31:20 2002
@@ -0,0 +1,577 @@
+#include <linux/mm.h>
+#include <linux/file.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/version.h>
+#include <linux/compile.h>
+#include <linux/kexec.h>
+#include <net/checksum.h>
+#include <asm/page.h>
+#include <asm/uaccess.h>
+#include <asm/io.h>
+
+/* As designed kexec can only use the memory that you don't
+ * need to use kmap to access. Memory that you can use virt_to_phys()
+ * on an call get_free_page to allocate.
+ *
+ * In the best case you need one page for the transition from
+ * virtual to physical memory. And this page must be identity
+ * mapped. Which pretty much leaves you with pages < PAGE_OFFSET
+ * as you can only mess with user pages.
+ *
+ * As the only subset of memory that it is easy to restrict allocation
+ * to is the physical memory mapped into the kernel, I do that
+ * with get_free_page and hope it is enough.
+ *
+ * I don't know of a good way to do this calcuate which pages get_free_page
+ * will return independent of architecture so I depend on
+ * <asm/kexec.h> to properly set
+ * KEXEC_SOURCE_MEMORY_LIMIT and KEXEC_DESTINATION_MEMORY_LIMIT
+ *
+ */
+
+void kimage_init(struct kimage *image)
+{
+ memset(image, 0, sizeof(*image));
+ image->head = 0;
+ image->entry = &image->head;
+ image->last_entry = &image->head;
+}
+static int kimage_add_entry(struct kimage *image, kimage_entry_t entry)
+{
+ if (image->offset != 0) {
+ image->entry++;
+ }
+ if (image->entry == image->last_entry) {
+ kimage_entry_t *ind_page;
+ ind_page = (void *)__get_free_page(GFP_KERNEL);
+ if (!ind_page) {
+ return -ENOMEM;
+ }
+ *image->entry = virt_to_phys(ind_page) | IND_INDIRECTION;
+ image->entry = ind_page;
+ image->last_entry =
+ ind_page + ((PAGE_SIZE/sizeof(kimage_entry_t)) - 1);
+ }
+ *image->entry = entry;
+ image->entry++;
+ image->offset = 0;
+ return 0;
+}
+
+static int kimage_verify_destination(unsigned long destination)
+{
+ int result;
+
+ /* Assume the page is bad unless we pass the checks */
+ result = -EADDRNOTAVAIL;
+
+ if (destination >= KEXEC_DESTINATION_MEMORY_LIMIT) {
+ goto out;
+ }
+
+ /* NOTE: The caller is responsible for making certain we
+ * don't attempt to load the new image into invalid or
+ * reserved areas of RAM.
+ */
+ result = 0;
+out:
+ return result;
+}
+
+static int kimage_set_destination(
+ struct kimage *image, unsigned long destination)
+{
+ int result;
+ destination &= PAGE_MASK;
+ result = kimage_verify_destination(destination);
+ if (result) {
+ return result;
+ }
+ result = kimage_add_entry(image, destination | IND_DESTINATION);
+ if (result == 0) {
+ image->destination = destination;
+ }
+ return result;
+}
+
+
+static int kimage_add_page(struct kimage *image, unsigned long page)
+{
+ int result;
+ page &= PAGE_MASK;
+ result = kimage_verify_destination(image->destination);
+ if (result) {
+ return result;
+ }
+ result = kimage_add_entry(image, page | IND_SOURCE);
+ if (result == 0) {
+ image->destination += PAGE_SIZE;
+ }
+ return result;
+}
+
+
+static int kimage_terminate(struct kimage *image)
+{
+ int result;
+ result = kimage_add_entry(image, IND_DONE);
+ if (result == 0) {
+ /* Point at the terminating element */
+ image->entry--;
+ }
+ return result;
+}
+
+#define for_each_kimage_entry(image, ptr, entry) \
+ for (ptr = &image->head; (entry = *ptr) && !(entry & IND_DONE); \
+ ptr = (entry & IND_INDIRECTION)? \
+ phys_to_virt((entry & PAGE_MASK)): ptr +1)
+
+void kimage_free(struct kimage *image)
+{
+ kimage_entry_t *ptr, entry;
+ kimage_entry_t ind = 0;
+ for_each_kimage_entry(image, ptr, entry) {
+ if (entry & IND_INDIRECTION) {
+ /* Free the previous indirection page */
+ if (ind & IND_INDIRECTION) {
+ free_page((unsigned long)phys_to_virt(ind & PAGE_MASK));
+ }
+ /* Save this indirection page until we are
+ * done with it.
+ */
+ ind = entry;
+ }
+ else if (entry & IND_SOURCE) {
+ free_page((unsigned long)phys_to_virt(entry & PAGE_MASK));
+ }
+ }
+}
+
+static int kimage_is_destination_page(
+ struct kimage *image, unsigned long page)
+{
+ kimage_entry_t *ptr, entry;
+ unsigned long destination;
+ destination = 0;
+ page &= PAGE_MASK;
+ for_each_kimage_entry(image, ptr, entry) {
+ if (entry & IND_DESTINATION) {
+ destination = entry & PAGE_MASK;
+ }
+ else if (entry & IND_SOURCE) {
+ if (page == destination) {
+ return 1;
+ }
+ destination += PAGE_SIZE;
+ }
+ }
+ return 0;
+}
+
+static int kimage_get_unused_area(
+ struct kimage *image, unsigned long size, unsigned long align,
+ unsigned long *area)
+{
+ /* Walk through mem_map and find the first chunk of
+ * ununsed memory that is at least size bytes long.
+ */
+ /* Since the kernel plays with Page_Reseved mem_map is less
+ * than ideal for this purpose, but it will give us a correct
+ * conservative estimate of what we need to do.
+ */
+ /* For now we take advantage of the fact that all kernel pages
+ * are marked with PG_resereved to allocate a large
+ * contiguous area for the reboot code buffer.
+ */
+ unsigned long addr;
+ unsigned long start, end;
+ unsigned long mask;
+ mask = ((1 << align) -1);
+ start = end = PAGE_SIZE;
+ for(addr = PAGE_SIZE; addr < KEXEC_SOURCE_MEMORY_LIMIT; addr += PAGE_SIZE) {
+ struct page *page;
+ unsigned long aligned_start;
+ page = virt_to_page(phys_to_virt(addr));
+ if (PageReserved(page) ||
+ kimage_is_destination_page(image, addr)) {
+ /* The current page is reserved so the start &
+ * end of the next area must be atleast at the
+ * next page.
+ */
+ start = end = addr + PAGE_SIZE;
+ }
+ else {
+ /* O.k. The current page isn't reserved
+ * so push up the end of the area.
+ */
+ end = addr;
+ }
+ aligned_start = (start + mask) & ~mask;
+ if (aligned_start > start) {
+ continue;
+ }
+ if (aligned_start > end) {
+ continue;
+ }
+ if (end - aligned_start >= size) {
+ *area = aligned_start;
+ return 0;
+ }
+ }
+ *area = 0;
+ return -ENOSPC;
+}
+
+static kimage_entry_t *kimage_dst_conflict(
+ struct kimage *image, unsigned long page, kimage_entry_t *limit)
+{
+ kimage_entry_t *ptr, entry;
+ unsigned long destination = 0;
+ for_each_kimage_entry(image, ptr, entry) {
+ if (ptr == limit) {
+ return 0;
+ }
+ else if (entry & IND_DESTINATION) {
+ destination = entry & PAGE_MASK;
+ }
+ else if (entry & IND_SOURCE) {
+ if (page == destination) {
+ return ptr;
+ }
+ destination += PAGE_SIZE;
+ }
+ }
+ return 0;
+}
+
+static kimage_entry_t *kimage_src_conflict(
+ struct kimage *image, unsigned long destination, kimage_entry_t *limit)
+{
+ kimage_entry_t *ptr, entry;
+ for_each_kimage_entry(image, ptr, entry) {
+ unsigned long page;
+ if (ptr == limit) {
+ return 0;
+ }
+ else if (entry & IND_DESTINATION) {
+ /* nop */
+ }
+ else if (entry & IND_DONE) {
+ /* nop */
+ }
+ else {
+ /* SOURCE & INDIRECTION */
+ page = entry & PAGE_MASK;
+ if (page == destination) {
+ return ptr;
+ }
+ }
+ }
+ return 0;
+}
+
+static int kimage_get_off_destination_pages(struct kimage *image)
+{
+ kimage_entry_t *ptr, *cptr, entry;
+ unsigned long buffer, page;
+ unsigned long destination = 0;
+
+ /* Here we implement safe guards to insure that
+ * a source page is not copied to it's destination
+ * page before the data on the destination page is
+ * no longer useful.
+ *
+ * To make it work we actually wind up with a
+ * stronger condition. For every page considered
+ * it is either it's own destination page or it is
+ * not a destination page of any page considered.
+ *
+ * Invariants
+ * 1. buffer is not a destination of a previous page.
+ * 2. page is not a destination of a previous page.
+ * 3. destination is not a previous source page.
+ *
+ * Result: Either a source page and a destination page
+ * are the same or the page is not a destination page.
+ *
+ * These checks could be done when we allocate the pages,
+ * but doing it as a final pass allows us more freedom
+ * on how we allocate pages.
+ *
+ * Also while the checks are necessary, in practice nothing
+ * happens. The destination kernel wants to sit in the
+ * same physical addresses as the current kernel so we never
+ * actually allocate a destination page.
+ *
+ * BUGS: This is a O(N^2) algorithm.
+ */
+
+
+ buffer = __get_free_page(GFP_KERNEL);
+ if (!buffer) {
+ return -ENOMEM;
+ }
+ buffer = virt_to_phys((void *)buffer);
+ for_each_kimage_entry(image, ptr, entry) {
+ /* Here we check to see if an allocated page */
+ kimage_entry_t *limit;
+ if (entry & IND_DESTINATION) {
+ destination = entry & PAGE_MASK;
+ }
+ else if (entry & IND_INDIRECTION) {
+ /* Indirection pages must include all of their
+ * contents in limit checking.
+ */
+ limit = phys_to_virt(page + PAGE_SIZE - sizeof(*limit));
+ }
+ if (!((entry & IND_SOURCE) | (entry & IND_INDIRECTION))) {
+ continue;
+ }
+ page = entry & PAGE_MASK;
+ limit = ptr;
+
+ /* See if a previous page has the current page as it's
+ * destination.
+ * i.e. invariant 2
+ */
+ cptr = kimage_dst_conflict(image, page, limit);
+ if (cptr) {
+ unsigned long cpage;
+ kimage_entry_t centry;
+ centry = *cptr;
+ cpage = centry & PAGE_MASK;
+ memcpy(phys_to_virt(buffer), phys_to_virt(page), PAGE_SIZE);
+ memcpy(phys_to_virt(page), phys_to_virt(cpage), PAGE_SIZE);
+ *cptr = page | (centry & ~PAGE_MASK);
+ *ptr = buffer | (entry & ~PAGE_MASK);
+ buffer = cpage;
+ }
+ if (!(entry & IND_SOURCE)) {
+ continue;
+ }
+
+ /* See if a previous page is our destination page.
+ * If so claim it now.
+ * i.e. invariant 3
+ */
+ cptr = kimage_src_conflict(image, destination, limit);
+ if (cptr) {
+ unsigned long cpage;
+ kimage_entry_t centry;
+ centry = *cptr;
+ cpage = centry & PAGE_MASK;
+ memcpy(phys_to_virt(buffer), phys_to_virt(cpage), PAGE_SIZE);
+ memcpy(phys_to_virt(cpage), phys_to_virt(page), PAGE_SIZE);
+ *cptr = buffer | (centry & ~PAGE_MASK);
+ *ptr = cpage | ( entry & ~PAGE_MASK);
+ buffer = page;
+ }
+ /* If the buffer is my destination page do the copy now
+ * i.e. invariant 3 & 1
+ */
+ if (buffer == destination) {
+ memcpy(phys_to_virt(buffer), phys_to_virt(page), PAGE_SIZE);
+ *ptr = buffer | (entry & ~PAGE_MASK);
+ buffer = page;
+ }
+ }
+ free_page((unsigned long)phys_to_virt(buffer));
+ return 0;
+}
+
+static int kimage_add_empty_pages(struct kimage *image,
+ unsigned long len)
+{
+ unsigned long pos;
+ int result;
+ for(pos = 0; pos < len; pos += PAGE_SIZE) {
+ char *page;
+ result = -ENOMEM;
+ page = (void *)__get_free_page(GFP_KERNEL);
+ if (!page) {
+ goto out;
+ }
+ result = kimage_add_page(image, virt_to_phys(page));
+ if (result) {
+ goto out;
+ }
+ }
+ result = 0;
+ out:
+ return result;
+}
+
+
+static int kimage_load_segment(struct kimage *image,
+ struct kexec_segment *segment)
+{
+ unsigned long mstart;
+ int result;
+ unsigned long offset;
+ unsigned long offset_end;
+ unsigned char *buf;
+
+ result = 0;
+ buf = segment->buf;
+ mstart = (unsigned long)segment->mem;
+
+ offset_end = segment->memsz;
+
+ result = kimage_set_destination(image, mstart);
+ if (result < 0) {
+ goto out;
+ }
+ for(offset = 0; offset < segment->memsz; offset += PAGE_SIZE) {
+ char *page;
+ size_t size, leader;
+ page = (char *)__get_free_page(GFP_KERNEL);
+ if (page == 0) {
+ result = -ENOMEM;
+ goto out;
+ }
+ result = kimage_add_page(image, virt_to_phys(page));
+ if (result < 0) {
+ goto out;
+ }
+ if (segment->bufsz < offset) {
+ /* We are past the end zero the whole page */
+ memset(page, 0, PAGE_SIZE);
+ continue;
+ }
+ size = PAGE_SIZE;
+ leader = 0;
+ if ((offset == 0)) {
+ leader = mstart & ~PAGE_MASK;
+ }
+ if (leader) {
+ /* We are on the first page zero the unused portion */
+ memset(page, 0, leader);
+ size -= leader;
+ page += leader;
+ }
+ if (size > (segment->bufsz - offset)) {
+ size = segment->bufsz - offset;
+ }
+ result = copy_from_user(page, buf + offset, size);
+ if (result) {
+ result = (result < 0)?result : -EIO;
+ goto out;
+ }
+ if (size < (PAGE_SIZE - leader)) {
+ /* zero the trailing part of the page */
+ memset(page + size, 0, (PAGE_SIZE - leader) - size);
+ }
+ }
+ out:
+ return result;
+}
+
+
+/* do_kexec executes a new kernel
+ */
+int do_kexec(unsigned long start, long nr_segments,
+ struct kexec_segment *arg_segments, struct kimage *image)
+{
+ struct kexec_segment *segments;
+ size_t segment_bytes;
+ int i;
+
+ int result;
+ unsigned long reboot_code_buffer;
+ kimage_entry_t *end;
+
+ /* Initialize variables */
+ segments = 0;
+
+ /* We only trust the superuser with rebooting the system. */
+ if (nr_segments <= 0) {
+ result = -EINVAL;
+ goto out;
+ }
+ segment_bytes = nr_segments * sizeof(*segments);
+ segments = kmalloc(GFP_KERNEL, segment_bytes);
+ if (segments == 0) {
+ result = -ENOMEM;
+ goto out;
+ }
+ result = copy_from_user(segments, arg_segments, segment_bytes);
+ if (result) {
+ goto out;
+ }
+
+ /* Read in the data from user space */
+ image->start = start;
+ for(i = 0; i < nr_segments; i++) {
+ result = kimage_load_segment(image, &segments[i]);
+ if (result) {
+ goto out;
+ }
+ }
+
+ /* Terminate early so I can get a place holder. */
+ result = kimage_terminate(image);
+ if (result)
+ goto out;
+ end = image->entry;
+
+ /* Usage of the reboot code buffer is subtle. We first
+ * find a continguous area of ram, that is not one
+ * of our destination pages. We do not allocate the ram.
+ *
+ * The algorithm to make certain we do not have address
+ * conflicts requires each destination region to have some
+ * backing store so we allocate abitrary source pages.
+ *
+ * Later in machine_kexec when we copy data to the
+ * reboot_code_buffer it still may be allocated for other
+ * purposes, but we do know there are no source or destination
+ * pages in that area. And since the rest of the kernel
+ * is already shutdown those pages are free for use,
+ * regardless of their page->count values.
+ *
+ * The kernel mapping is of the reboot code buffer is passed to
+ * the machine dependent code. If it needs something else
+ * it is free to set that up.
+ */
+ result = kimage_get_unused_area(
+ image, KEXEC_REBOOT_CODE_SIZE, KEXEC_REBOOT_CODE_ALIGN,
+ &reboot_code_buffer);
+ if (result)
+ goto out;
+
+ /* Allocating pages we should never need is silly but the
+ * code won't work correctly unless we have dummy pages to
+ * work with.
+ */
+ result = kimage_set_destination(image, reboot_code_buffer);
+ if (result)
+ goto out;
+ result = kimage_add_empty_pages(image, KEXEC_REBOOT_CODE_SIZE);
+ if (result)
+ goto out;
+ image->reboot_code_buffer = phys_to_virt(reboot_code_buffer);
+
+ result = kimage_terminate(image);
+ if (result)
+ goto out;
+
+ result = kimage_get_off_destination_pages(image);
+ if (result)
+ goto out;
+
+ /* Now hide the extra source pages for the reboot code buffer.
+ */
+ image->entry = end;
+ result = kimage_terminate(image);
+ if (result)
+ goto out;
+
+ result = 0;
+ out:
+ /* cleanup and exit */
+ if (segments) kfree(segments);
+ return result;
+}
+
diff -uNr linux-2.5.45/kernel/sys.c linux-2.5.45.x86kexec/kernel/sys.c
--- linux-2.5.45/kernel/sys.c Fri Oct 18 11:59:29 2002
+++ linux-2.5.45.x86kexec/kernel/sys.c Wed Oct 30 21:45:37 2002
@@ -16,6 +16,7 @@
#include <linux/init.h>
#include <linux/highuid.h>
#include <linux/fs.h>
+#include <linux/kexec.h>
#include <linux/workqueue.h>
#include <linux/device.h>
#include <linux/times.h>
@@ -430,6 +431,66 @@
unlock_kernel();
return 0;
}
+
+#ifdef CONFIG_KEXEC
+/*
+ * Exec Kernel system call: for obvious reasons only root may call it.
+ *
+ * This call breaks up into three pieces.
+ * - A generic part which loads the new kernel from the current
+ * address space, and very carefully places the data in the
+ * allocated pages.
+ *
+ * - A generic part that interacts with the kernel and tells all of
+ * the devices to shut down. Preventing on-going dmas, and placing
+ * the devices in a consistent state so a later kernel can
+ * reinitialize them.
+ *
+ * - A machine specific part that includes the syscall number
+ * and the copies the image to it's final destination. And
+ * jumps into the image at entry.
+ *
+ * kexec does not sync, or unmount filesystems so if you need
+ * that to happen you need to do that yourself.
+ */
+asmlinkage long sys_kexec(unsigned long entry, long nr_segments,
+ struct kexec_segment *segments)
+{
+ /* Am I using to much stack space here? */
+ struct kimage image;
+ int result;
+
+ /* We only trust the superuser with rebooting the system. */
+ if (!capable(CAP_SYS_BOOT))
+ return -EPERM;
+
+ lock_kernel();
+ kimage_init(&image);
+ result = do_kexec(entry, nr_segments, segments, &image);
+ if (result) {
+ kimage_free(&image);
+ unlock_kernel();
+ return result;
+ }
+
+ /* The point of no return is here... */
+ notifier_call_chain(&reboot_notifier_list, SYS_RESTART, NULL);
+ system_running = 0;
+ device_shutdown();
+ printk(KERN_EMERG "Starting new kernel\n");
+ machine_kexec(&image);
+ /* We never get here but... */
+ kimage_free(&image);
+ unlock_kernel();
+ return -EINVAL;
+}
+#else
+asmlinkage long sys_kexec(unsigned long entry, long nr_segments,
+ struct kexec_segment *segments)
+{
+ return -ENOSYS;
+}
+#endif /* CONFIG_KEXEC */
static void deferred_cad(void *dummy)
{
Linus Torvalds wrote:
> > Crash Dumping (LKCD)
>
> This is definitely a vendor-driven thing. I don't believe it has any
> relevance unless vendors actively support it.
There are people within IBM in Germany, India and England, as well as
a number of companies (Intel, NEC, Hitachi, Fujitsu), as well as SGI
that are PAID to support this. In addition, Global Services at IBM
uses this as a front-line method for resolving customer problems.
If you're looking for names of people to sign up to support it
(both vendors and non-vendors), I can make that list up for you.
There are a number of us (developers, support staff, and other
interested parties) who bend over backwards, day in and day out
to make sure this stuff works and helps people, even if it isn't
kernel developers (directly -- indirectly, you get bug reports that
are sane and useful).
It's not sexy kernel stuff, but it is very important, and if you'd
like, I can have representatives from at least 10 major corporations
(Fortune 500 companies) contact you to request that this go in.
We're generating 2.5.45 patches now, and we ask that you include
the patches when they are posted.
I don't know what else to say except that people really want this
stuff and all of us in the LKCD community work really hard together
to make this project useful for everyone.
Please include this in your next snapshot.
--Matt
P.S. Copying some of the users and developers.
On Wed, Oct 30, 2002 at 10:19:54PM -0500, [email protected] wrote:
> Eventually I'd like to see a combination of LSM with a new ACL
> system give the ability to support full NT ACLs on Linux (which is
> also needed for full nfsv4 support), but that is way too much to do
> for the 2.6 kernel.
Add bloat to make windows clients happy?
> Extended attributes are also important as they give a place to store
> all the extra DOS info that has no other logical place in a posix
> filesystem. For example, we can put the 'read only', 'archive',
> 'hidden' and 'system' attributes there. If we don't have extended
> attributes then we need to use a nasty kludge where these map to
> various unix permission bits, but the mapping is terrible and
> doesn't give the correct semantics (especially for things like read
> only on directories).
More bloat that does really solve Linux problems... sounds like nasty
hacks to make winduhs hacks work better.
Don't get me wrong, I'm not against sane ACLs (POSIX ACLs are not) os
EAs, but justification of "it makes windows clients easier" is pretty
horrendous IMO.
I'd would at some point like to see decent ACLs, but I don't want to
see 'windows ACLs' and all the SID nonsense.
--cw
On Thu, Oct 31, 2002 at 01:06:54AM -0200, Rik van Riel wrote:
> Personally I do think either the unlimited groups patch or ACLs are
> needed in order to sanely run a large anoncvs setup.
Processes need to be a member of 20+ groups to make anoncvs work?
Sounds like anoncvs is broken then.
--cw
On Wed, 2002-10-30 at 23:22, Chris Wedgwood wrote:
> On Thu, Oct 31, 2002 at 01:06:54AM -0200, Rik van Riel wrote:
>
> > Personally I do think either the unlimited groups patch or ACLs are
> > needed in order to sanely run a large anoncvs setup.
>
> Processes need to be a member of 20+ groups to make anoncvs work?
> Sounds like anoncvs is broken then.
Technically speaking you can achieve ACL like permissions/behavior using
the historical UNIX security model by creating a group EACH time you run
into a unique case permission scenario.
Without ACLs, if Sally, Joe and Bill need rw access to a file/dir, just
create another group with just those three people in. Over time, of
course, this leads to massive group proliferation. Without Tim Hockin's
patch, 32 groups is maximum number of groups a user can be a member of.
Dax
On Wed, Oct 30, 2002 at 11:48:23PM -0700, Dax Kelson wrote:
> Technically speaking you can achieve ACL like permissions/behavior
> using the historical UNIX security model by creating a group EACH
> time you run into a unique case permission scenario.
I'm not arguing against this... I'm claiming POSIX ACLs are mostly
brain-dead and almost worthless (broken by committee pressure and too
many people making stupid concessions).
If we must have ACLs, why not do it right?
> Without ACLs, if Sally, Joe and Bill need rw access to a file/dir,
> just create another group with just those three people in. Over
> time, of course, this leads to massive group proliferation. Without
> Tim Hockin's patch, 32 groups is maximum number of groups a user can
> be a member of.
How many people actually need this level of complexity?
Why are we adding all this shit and bloat because of perceived
problems most people don't have? What next, some kind of misdesigned
in-kernel CryptoAPI?
--cw
On 30 Oct 2002, Dax Kelson wrote:
> Without ACLs, if Sally, Joe and Bill need rw access to a file/dir, just
> create another group with just those three people in. Over time, of
If Sally, Joe and Bill need rw access to a directory, and Joe and Bill
are using existing userland (any OS I'd seen), then Sally can easily
fuck them into the next month and not in a good way.
_That_ is the real problem. Until that is solved (i.e. until all
userland is written up to the standards allegedly followed in writing
suid-root programs wrt hostile filesystem modifications) NO mechanism
will help you. ACLs, huge groups, whatever - setups with that sort
of access allowed are NOT SUSTAINABLE with the current userland(s).
On Thu, 2002-10-31 at 00:10, Alexander Viro wrote:
>
>
> On 30 Oct 2002, Dax Kelson wrote:
>
> > Without ACLs, if Sally, Joe and Bill need rw access to a file/dir, just
> > create another group with just those three people in. Over time, of
>
> If Sally, Joe and Bill need rw access to a directory, and Joe and Bill
> are using existing userland (any OS I'd seen), then Sally can easily
> fuck them into the next month and not in a good way.
I think the normal intent is to let Sally, Joe, and Bill have their own
private directory protected from THE REST OF THE USERS.
If a member of your trusted circle goes rogue, then, yup you are screwed
for the moment. It shouldn't last a whole month though.
That is what backups, and employment termination is for.
Dax
On 31 Oct 2002, Dax Kelson wrote:
> I think the normal intent is to let Sally, Joe, and Bill have their own
> private directory protected from THE REST OF THE USERS.
>
> If a member of your trusted circle goes rogue, then, yup you are screwed
> for the moment. It shouldn't last a whole month though.
>
> That is what backups, and employment termination is for.
Then give them all the same account and be done with that. Effect will
be the same.
On Wed, Oct 30, 2002 at 06:31:36PM -0800, you [Linus Torvalds] wrote:
>
> > Crash Dumping (LKCD)
>
> This is definitely a vendor-driven thing. I don't believe it has any
> relevance unless vendors actively support it.
I don't think this is just a vendor thing. Currently, linux doesn't have any
way of saving the crash dump when the box crashes. So if it crashes, the
user needs to write the oops down by hand (error prone, the interesting part
has often scrolled off screen), or attach a serial console (then he needs to
reproduce it - not always possible, and actually majority of people (home
users) don't have second box and the cable. Nor the motivation.)
So, imho some kind of way of semi-automatically save the dumps is needed. If
vendors even support it - great - but it has value to mainline kernel as
well, as people can submit more accurate error reports. Besides, if it goes
in mainline, I believe vendors are likely to support it. (Why wouldn't they?
Currently there just isn't a standard way of doing this.)
There are a bunch of patches for this sort of thing (Willy Tarreau's
kmsgdump for dumping to floppy, Ingo's netconsole, Rusty's oopser for
dumping to ide device...), but lkcd is a more general framework, and can
support different ways of dumping.
I know you are not keen on kernel debuggers, but I can't see what's
fundamentally wrong with being able to save the crucial info when a crash
happens...
-- v --
[email protected]
On Thu, 31 Oct 2002, Ville Herva wrote:
> On Wed, Oct 30, 2002 at 06:31:36PM -0800, you [Linus Torvalds] wrote:
> > > Crash Dumping (LKCD)
> >
> > This is definitely a vendor-driven thing. I don't believe it has any
> > relevance unless vendors actively support it.
>
> I don't think this is just a vendor thing. Currently, linux doesn't have any
> way of saving the crash dump when the box crashes. So if it crashes, the
> user needs to write the oops down by hand (error prone, the interesting part
> has often scrolled off screen), or attach a serial console (then he needs to
> reproduce it - not always possible, and actually majority of people (home
> users) don't have second box and the cable. Nor the motivation.)
Except on m68k, where we've had a feature to store all kernel messages in an
unused portion of memory (e.g. some Chip RAM on Amiga) and recover them after
reboot since ages.
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
On Thu, Oct 31, 2002 at 10:23:32AM +0100, you [Geert Uytterhoeven] wrote:
>
> Except on m68k, where we've had a feature to store all kernel messages in an
> unused portion of memory (e.g. some Chip RAM on Amiga) and recover them after
> reboot since ages.
There was similar thing for x86 as well:
http://www.tux.org/hypermail/linux-kernel/1999week27/0782.html
Of course it never went to mainline (and I don't know how well it worked.)
>From what I understand, lkcd can support such method easily.
-- v --
[email protected]
> Yes, people use it. Not quite sure why though, I guess ACLs
> buy some flexibility over the user/group/other model but if
> the "unlimited groups" patch goes in (is in?) I'm happy ;)
Correct me if I'm wrong but I believe a process has to be
restarted to have its group membership list changed?
That's a huge difference from ACL behavior which allow for changes to
file access rights without the need to restart the accessing process.
--
Leszek.
-- [email protected] 2:480/33.7 -- REAL programmers use INTEGERS --
-- speaking just for myself...
On Wed, 2002-10-30 at 21:31, Linus Torvalds wrote:
> > ext2/ext3 ACLs and Extended Attributes
>
> I don't know why people still want ACL's. There were noises about them for
> samba, but I'v enot heard anything since. Are vendors using this?
>
I am sure I don't count (not being a vendor), but Intermezzo offers
support for this (they are waiting on feature freeze to redo it to 2.5
according to an email I have). I want this stuff. Yes, u+g+w is nice,
but good ACLs are even better. Please, if this is technically correct
in implementation, do put it in.
Thank you,
Trever
On Thu, Oct 31, 2002 at 02:00:31PM +1100, Rusty Russell wrote:
> > > EVMS
> >
> > Not for the feature freeze, there are some noises that imply that SuSE may
> > push it in their kernels.
>
> They have, IIRC. Interestingly, it was less invasive (existing source
> touched) than the LVM2/DM patch you merged.
FUD. I added to three areas of existing code:
i) Every man and his dog uses mempools in conjuction with slabs, so
rather than having everyone redefining their own alloc/free
functions I added the following huge functions to mempool.c. In no
way were they mandatory.
/*
* A commonly used alloc and free fn.
*/
void *mempool_alloc_slab(int gfp_mask, void *pool_data)
{
kmem_cache_t *mem = (kmem_cache_t *) pool_data;
return kmem_cache_alloc(mem, gfp_mask);
}
void mempool_free_slab(void *element, void *pool_data)
{
kmem_cache_t *mem = (kmem_cache_t *) pool_data;
kmem_cache_free(mem, element);
}
ii) vcalloc, this *didn't* get merged, and will probably end up getting
moved into dm.h.
iii) ioctl32 support: people have argued against an ioctl interface,
and I'm inclined to agree with them, which is why I'm going to
publish an fs interface shortly. However, given that we are
currently using an ioctl interface how do we avoid adding support for
32bit userland/64 kernel space ? If EVMS isn't touching these
files does that mean they're not supporting these architectures ?
arch/mips64/kernel/ioctl32.c
arch/ppc64/kernel/ioctl32.c
arch/s390x/kernel/ioctl32.c
arch/sparc64/kernel/ioctl32.c
So given that (ii) didn't get merged, which of (i) and (iii) were you
objecting to ?
- Joe
On Thu, 31 Oct 2002, Rusty Russell wrote:
> In message <[email protected]> you wri
> te:
> > On Thu, 31 Oct 2002, Rusty Russell wrote:
> > > Fbdev Rewrite
> >
> > This one is just huge, and I have little personal judgement on it.
>
> It's been around for a while. Geert, Russell?
It's huge because it moves a lot of files around:
1. drivers/char/agp/ -> drivers/video/agp/
2. drivers/char/drm/ -> drivers/video/drm/
3. console related files in drivers/video/ -> drivers/video/console/
(1) and (2) should be reverted, but apparently they aren't reverted in the
patch at http://phoenix.infradead.org/~jsimmons/fbdev.diff.gz yet. The patch
also seems to remove some drivers. Haven't checked the bk repo yet.
James, can you please fix that (and the .Config files)?
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
> > POSIX Timer API
>
> I think I'll do at least the API, but there were some questions about the
> config options here, I think.
I think george just posted a config optionless patch.
WOOHOO! Thanks!
>
> > Hires Timers
>
> This one is likely another "vendor push" thing.
>
I work for a vendor who really wants this.
we have customers who demand it.
I am sure we are not alone (mvista? concurrent? any embedded space people for
whom 10msec is not good enough and the extra overhead of a higer frequency
fixed interval timer is unacceptable please speak up, if we don't get it in
now, we probably won't get it for 2 years.)
--
/**************************************************
** Mark Salisbury || [email protected] **
**************************************************/
Linus Torvalds wrote:
>>Linux Trace Toolkit (LTT
> I don't know what this buys us.
I'd like to add a request for this to be in mainstream. The benefits
have already been stated in this thread, and it has been used here to
good effect.
>>Crash Dumping (LKCD
> This is definitely a vendor-driven thing. I don't believe it has any
> relevance unless vendors actively support it.
I'd like to see this too. The more debug information the better as far
as I'm concerned.
>>Hires Timer
> This one is likely another "vendor push" thing.
It doesn't hurt performance when turned off, and allows for
finer-grained timing when turned on. What's not to like? I can't
comment on the actual code, but I really like the idea.
Chris
--
Chris Friesen | MailStop: 043/33/F10
Nortel Networks | work: (613) 765-0557
3500 Carling Avenue | fax: (613) 765-2986
Nepean, ON K2H 8E9 Canada | email: [email protected]
Joe Thornber wrote:
>ii) vcalloc, this *didn't* get merged, and will probably end up getting
> moved into dm.h.
>
Yeah, historically we have avoided things like this.
kcalloc gets proposed every year or so too.
>iii) ioctl32 support: people have argued against an ioctl interface,
> and I'm inclined to agree with them, which is why I'm going to
> publish an fs interface shortly. However, given that we are
> currently using an ioctl interface how do we avoid adding support for
> 32bit userland/64 kernel space ? If EVMS isn't touching these
> files does that mean they're not supporting these architectures ?
>
> arch/mips64/kernel/ioctl32.c
> arch/ppc64/kernel/ioctl32.c
> arch/s390x/kernel/ioctl32.c
> arch/sparc64/kernel/ioctl32.c
>
>
Well, I'll note that ALSA compartmentalizes their ioctl32 handling
within their own subsystem, which seems like a decent solution.
That said, [maybe I'm biased <g>], using an fs interface allows one to
completely eliminate an ioctl32 interface. That would be the direction
I would greatly prefer by the time 2.5.x hits the code freeze.
Best regards, and congrats for getting it merged,
Jeff
Chris Wedgwood wrote:
>problems most people don't have? What next, some kind of misdesigned
>in-kernel CryptoAPI?
>
>
Ok, I'll allow myself to be trolled.
What's wrong with our current 2.5.45 crypto api?
On Thu, 2002-10-31 at 14:26, Jeff Garzik wrote:
> Yeah, historically we have avoided things like this.
> kcalloc gets proposed every year or so too.
I would like to see both of these in because tons of kernel fixing that
has been done through audits has been about
get_user(a, ...)
kmalloc(a * sizeof(b), ..)
We end up with loads of ugly > MAXINT/sizeof(foo) if checks in the code
that ought to be in one place
On Thu, Oct 31, 2002 at 02:39:23AM +0000, Linus Torvalds wrote:
>
> On Thu, 31 Oct 2002, Rusty Russell wrote:
> >
> > Here is the list of features which have are being actively
> > pushed, not NAK'ed, and are not in 2.5.45. There are 13 of them, as
> > appropriate for Halloween.
>
> I'm unlikely to be able to merge everything by tomorrow, so I will
> consider tomorrow a submission deadline to me, rather than a merge
> deadline. That said, I merged everything I'm sure I want to merge today,
> and the rest I simply haven't had time to look at very much.
>
>
> > Crash Dumping (LKCD)
>
> This is definitely a vendor-driven thing. I don't believe it has any
> relevance unless vendors actively support it.
>
Linus,
I wish you could have made it to the OLS RAS BOF and seen this for
yourself - the vendor support, the need and the drive towards a
unified and flexible dumping framework.
The problem with dump has not been lack of vendor interest. There
wouldn't have been multiple dump type implementations floating around
if there wasn't a need -- LKCD, Mission Critical dump, Ingo's
network dump, kmsgdump, Rusty's oops dumper to cite some. The difficulty
has been technical and hence the diversity of approaches that different
projects came up with to tackle the problem (arising from slightly
different priorities and environments in each case). The second has
been related to preferences in the kind of user level analysis tools.
And the LKCD project has been evolving to address these very
problems to bring the best of these worlds together and also allow
flexibility on the choice of analysis tools !
Mission critical Linux project code base for example is now being
maintained as part of the LKCD project. Either lcrash or mission
critical linux crash can be used for analysing LKCD dumps.
And on the kernel side of things:
(a) The dump driver interface in LKCD has been specifically
designed to enable different kinds of dumping mechanisms and
targets to be supported -- generic block, network dump ,
polled-IDE (Rusty style) etc, even alternate dump targets failover
and multiple dump devices in the future if required. We are also
experimenting with a memory dump driver to save dump to memory
and dump after a memory preserving soft-boot, reusing the mission
critical mcore technique.
(b) Selective dumping, for different levels of dump data - one
option that was added recently would dump all kernel pages
and is likely to be commonly used (gzip compressed dump). Its
pretty easy to extend to more selectivity or different levels
and the dump also occurs in passes from more critical data to
less critical.
(The page in use flag was added to help with this)
(c) The core pieces which touch the kernel as such just add basic
infrastructure that is needed in the kernel for any dumping
facility. Includes:
- Enabling IPI to collect CPU state on all processors in the
system right when dump is triggered (may not be a normal
situation, so NMIs where supported are the best option)
- Ability to quiesce (silence) the system before dumping
(and if in non-disruptive mode, then restore it back)
- Calls into dump from kernel paths (panic, oops, sysrq
etc).
- Exports of symbols to help with physical memory
traversal and verification
As Matt has said there is an active development community behind
LKCD and lot of the drive for that has come from companies who use it
and are really hoping hard that it becomes part of the mainline.
BTW, the code has also been scrutinised and reviewed over
lkml as well and undergone iterations of releases following
that. Anything else there that you think needs to be fixed please
do let us know.
Regards
Suparna
>> Linux Trace Toolkit (LTT)
>
>I don't know what this buys us.
If you consider developer productivity useful then LTT has definite
benefits especially when combined with kprobes. With the two it is possible
to implant tracepoints without having to code up specific printks: kprobes
can be used to implant a probe, the probe handler can call LTT to record
the event.
Why call LTT instead of having a printk in the probe handler? - for
performance reasons, for latency reasons, because kprobes can implant
probes absolutely anywhere in the system, for analysis reasons - LTT trace
data can be post processed and massaged in a number of ways using the
visualizer tools. Yes you can do some of this using printk directly, but
you can be into a whole heap more work and it will certainly take longer to
implant a temporary tracepoint, recompile, run, remove, recompile the using
the dynamic trace technique of LTT+kprobes.
Richard
>> Crash Dumping (LKCD)
>
>This is definitely a vendor-driven thing. I don't believe it has any
>relevance unless vendors actively support it.
I can't argue with the fact you want to view lkcd this way. However as a
developer I have found a crash dump facility indispensable for certain
problems, particularly those that involve multiple processors where to use
more invasive techniques such as an interactive debugger can make the
problem unreproducible. It's also worth pointing out that each of the
serviceability tools (dump, trace, probes) complements each other. They are
every so much more powerful when used as a set: lkcd can capture a trace
buffer, whose contents would otherwise be lost; kprobes enables LTT to
implant tracepoints dynamically; krpobes + lkcd allows a crash dump to be
triggered for complex and specific conditions that are difficult to
reproduce. Without such tools, data gathering for complex problems becomes
a problem in itself. A problem doesn't necessarily have to be reproducible
to make it necessary to solve.
Richard
On 2002-10-31T14:56:27,
Richard J Moore <[email protected]> said:
> >> Crash Dumping (LKCD)
> >This is definitely a vendor-driven thing. I don't believe it has any
> >relevance unless vendors actively support it.
As time to repair is critical for availability (obviously) and having a good
crash dump will help reduce this, I'd also like to point out that such a
dumping framework is very important. Please, merge it.
Sincerely,
Lars Marowsky-Br?e <[email protected]>
--
Principal Squirrel
SuSE Labs - Research & Development, SuSE Linux AG
"If anything can go wrong, it will." "Chance favors the prepared (mind)."
-- Capt. Edward A. Murphy -- Louis Pasteur
On Wed, 30 Oct 2002, Matt D. Robinson wrote:
> Linus Torvalds wrote:
> > > Crash Dumping (LKCD)
> >
> > This is definitely a vendor-driven thing. I don't believe it has any
> > relevance unless vendors actively support it.
>
> There are people within IBM in Germany, India and England, as well as
> a number of companies (Intel, NEC, Hitachi, Fujitsu), as well as SGI
> that are PAID to support this.
That's fine. And since they are paid to support it, they can apply the
patches.
What I'm saying by "vendor driven" is that it has no relevance for the
standard kernel, and since it has no relevance to that, then I have no
incentives to merge it. The crash dump is only useful with people who
actively look at the dumps, and I don't know _anybody_ outside of the
specialized vendors you mention who actually do that.
I will merge it when there are real users who want it - usually as a
result of having gotten used to it through a vendor who supports it. (And
by "support" I do not mean "maintain the patches", but "actively uses it"
to work out the users problems or whatever).
Horse before the cart and all that thing.
People have to realize that my kernel is not for random new features. The
stuff I consider important are things that people use on their own, or
stuff that is the base for other work. Quite often I want vendors to merge
patches _they_ care about long long before I will merge them (examples of
this are quite common, things like reiserfs and ext3 etc).
THAT is what I mean by vendor-driven. If vendors decide they really want
the patches, and I actually start seeing noises on linux-kernel or getting
requests for it being merged from _users_ rather than developers, then
that means that the vendor is on to something.
Linus
Richard J Moore wrote:
> With the two it is possible to implant tracepoints without having to
> code up specific printks: kprobes can be used to implant a probe,
> the probe handler can call LTT to record the event.
Hey, that _is_ useful. Me like. Me spent many times wondering what
gets called when, and hunting heisenbugs masked by printk slowness.
-- Jamie
Linus,
LTT is one step in allowing Linux to continue to move towards being a
viable alternative for more than just hackers. It is part of a larger
effort to provide reliability and serviceability. Concretely it allows
application/subsystem programmers to understand the performance of their
applications and the system. I should note, it also allows people to
improve kernel behavior as well. As we have communicated in the past, the
ability to gather and analyze this data is vital. From my correspondences
with Ingo
"If you care about performance you will want to trace. On two previous
kernels I have worked on I've heard this comment ["we don't need tracing"].
Once the infrastructure was in it was used and appreciated." There were
world-class programmers involved in these projects that did not see the
value of such infrastructure until they were able to use it.
I think Karim provided a list of possible uses, there are countless
applications of this - I'll list some more:
seeing where unexplained idle tie is occurring
understanding where interrupt processing time is going
understanding interactions between applications - which is running when
etc etc etc
If you look around the kernel, subsystems, and applications, you will find
growing numbers of one-off-ways of gathering this information. Providing a
unified way for different developers to communicate about performance will
significantly improve the ability to performance debug different
applications, drivers, system/application interaction, etc.
LTT has existed for a long time now and recent additions have been well
motivated: For a while now I have been working with the RAS team at IBM and
with Karim Yaghmour to streamline LTT and make it perform well on MPs. We
have addressed all the concerns raised by yourself, Ingo, and others from
previous postings. If there remains concern, it is also possible for one
to disable tracing. Some of the features we put into LTT came from ideas
we prototyped in K42 (http://www.research.ibm.com/K42) which in turn was developed
based on my experience writing a tracing infrastructure for IRIX while
working for SGI, and other's experiences with AIX's tracing facilities.
LTT is a valuable aspect in allowing developers using Linux to understand
their application's and the system's behavior. It serves to strengthen
Linux's RAS capabilities and would be great to get included into 2.5.
Thanks.
Thank you.
Robert Wisniewski
The K42 MP OS Project
Advanced Operating Systems
Scalable Parallel Systems
IBM T.J. Watson Research Center
914-945-3181
http://www.research.ibm.com/K42/
[email protected]
Alexander Viro wrote:
>On 31 Oct 2002, Dax Kelson wrote:
>
>>I think the normal intent is to let Sally, Joe, and Bill have their own
>>private directory protected from THE REST OF THE USERS.
>>
>>If a member of your trusted circle goes rogue, then, yup you are screwed
>>for the moment. It shouldn't last a whole month though.
>>
>>That is what backups, and employment termination is for.
>>
>>
>
>Then give them all the same account and be done with that. Effect will
>be the same.
>
>
Unless I'm missing something, that only works if all the users need
*exactly* the same permissions to all files, which isn't a good assumption.
Example: Sally is an accountant, Joe and Bill are engineers.
Bill and Joe are working on a project, and Sally is cost control for
that project - they all need access to the project files. Bill and Joe
do not need access to officer salary data, but Sally does. Bill and Joe
need access to other projects (not necessarily the same ones), but Sally
doesn't. Oops.
- Steve
I don't mean to pick on LTT, I haven't used it, it may be the best thing
since sliced bread.
I can tell you how to present this and any other feature similar to this
in a way which would make me a lot more willing to accept it, which
presupposes I'm doing Linus' job which of course I am not. However,
it's likely that Linus has similar views but he gets to chime in and
speak for himself.
All of these tools/features/whatever add some cost. The cost can be
measured in lots of different ways:
- lines of code
- lines of code which can't be configed out
- call depth increases
- stack size increases
- cache foot print increases
- parallelism (think preempt)
- interface changes
I suspect there are other metrics and it would be very cool if others would
chime in with their pet peeves.
What would be cool is if there was some way to quantify as much as possible
of the accepted set of costs so that that could be balanced against the
value of the change, right?
The one that always gets me is
"I've added feature XYZ, I benchmarked it with <whatever, usually
LMbench> and it didn't make a difference"
That is almost certainly misleading. The real thing you want to do
is quantify the actual costs because there can be non-zero costs that
do not show up in benchmarks. For example, suppose that the benchmark
neatly fits in the onchip caches and it only uses 1/2 of those caches.
Your change could increase the cache foot print to just fill the caches,
the benchmark says no difference, you declare success and move on.
The problem is that almost all changes are good enough that they match
this description. Measuring them in isolation doesn't tell us enough.
If I combine two changes, both of which use up 1/2 the cache, there is
no longer any room for anything else in the cache.
I'd love to see a trend where patch requests for any non-trivial patch
included before/after data for the above metrics (and any others that
people see as useful). I'd love to see some people taking just one of
the above and making a tool which measures that metric. Then we combine
the tools into a "patch measurement suite" and start prefixing patches
with
Code changes:
+1234 -5678 = -4444 (all code)
+123 -567 = -444 (all code subject to CONFIG_XYZ)
Call depth:
+2 for read()
+2 for write()
no change for all other system calls
Stack size:
+2099 bytes for read()/write() path
Cache misses:
No change for benchmark1, 2, 3
12,000 data read misses for lat_ctx ....
Etc.
What does the list think of this?
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm
On Wed, Oct 30, 2002 at 09:43:29PM -0500, Alexander Viro wrote:
>
>
> On Wed, 30 Oct 2002, Linus Torvalds wrote:
>
> > > ext2/ext3 ACLs and Extended Attributes
> >
> > I don't know why people still want ACL's. There were noises about them for
> > samba, but I'v enot heard anything since. Are vendors using this?
>
> Because People Are Stupid(tm). Because it's cheaper to put "ACL support: yes"
> in the feature list under "Security" than to make sure than userland can cope
> with anything more complex than "Me Og. Og see directory. Directory Og's.
> Nobody change it". C.f. snake oil, P.T.Barnum and esp. LSM users
It's nearly useless in a Unix-only context, true, however there's a rather
serious impedance mismatch for serving files to Windows that this
addresses. Emulating ACLs on the fly with groups to fit into the
Windows model is mostly doable but ain't pretty.
--
"Love the dolphins," she advised him. "Write by W.A.S.T.E.."
On Thu, 31 Oct 2002, Stephen Wille Padnos wrote:
> >Then give them all the same account and be done with that. Effect will
> >be the same.
> >
> >
>
> Unless I'm missing something, that only works if all the users need
> *exactly* the same permissions to all files, which isn't a good assumption.
That's the point. In practice shared writable access to a directory can be
easily elevated to full control of each others' accounts, since most of
userland code is written in implicit assumption that nothing bad happens with
directory structure under it. And there is nothing kernel can do about that -
attacker does action you had explicitly allowed and your program goes bonkers
since it can't cope with that. Mechanism used to allow that action doesn't
enter the picture - be it ACLs, groups or something else.
An excellent engineering practice but extremely difficult to do. This is
the holy-grail of software design and I don't think it would work for an
extremely loosely connected set of developers.
There is no central control of the system (or chain of accountability) and
that knocks down the practicality of this plan. It would work extremely
well in another project, though.
} What does the list think of this?
Linus Torvalds <[email protected]> writes:
>> ext2/ext3 ACLs and Extended Attributes
>I don't know why people still want ACL's. There were noises about them for
>samba, but I'v enot heard anything since. Are vendors using this?
CIFS/SMB. Replacing Windows Fileservers. Supporting the required Windows
semantics. World domination.
That's one patch I personally consider really important. Getting the API in
place and a couple of FSses supporting it. The rest is up to user space.
Regards
Henning
--
Dipl.-Inf. (Univ.) Henning P. Schmiedehausen -- Geschaeftsfuehrer
INTERMETA - Gesellschaft fuer Mehrwertdienste mbH [email protected]
Am Schwabachgrund 22 Fon.: 09131 / 50654-0 [email protected]
D-91054 Buckenhof Fax.: 09131 / 50654-20
Larry McVoy writes:
> I don't mean to pick on LTT, I haven't used it, it may be the best thing
> since sliced bread.
...
> > The one that always gets me is
>
> "I've added feature XYZ, I benchmarked it with <whatever, usually
> LMbench> and it didn't make a difference"
Larry,
You're right - whoever wrote that useless LMbench anyway :-)
I agree it would be great to have have a tool that allows us to gather
information on some of what you suggest below - but it's hard - people in
software engineering have been working on such things for a long time.
Further, what you mention below does not make sense in isolation. For
example a package could add 1000 lines of code and have almost no impact,
while another 10 lines of code could make a huge difference. So while the
below metrics are fine, without arguing about the expected impact they're
not necessarily helpful.
That's why benchmarks are still helpful as they are indicative of what
expected performance might be. If you're trying to get at maintainability
then I might (being a K42 convert) argue for a different strategy
altogether.
So what about LTT then. Well sure enough we did run LMbench as some other
tests. We ran a kernel compile, a tar, and LMbench - and posted results to
lkml. While this hardly represents all possibilities, showing little
performance impact on these is a positive statement about impact on other
applications.
To address some of the list below:
lines of code: a lot - almost all can be configed out,
call depth increase: we can analyze - complicated since while it is a
couple levels - other calls in the code may be to
cache footprint: how? - simulate? this is tough - qualitatively I think for
ltt is small because the same code is used across all trace
events. And less frequent trace events won't interfere
parallelism: not quite sure what you mean here - we not have a non-blocking
lockless scheme to address what I think the concern is here
interface changes: I argue very very positive - as in my letter to Linus
getting various developers to talk about performance
with a common mechanism would be a big win
I'm sure this doesn't fully address your concerns - but if others feel some
of the below numbers are really important we can certainly go about getting
more accurate results then my above off-the-cuff info.
Robert Wisniewski
The K42 MP OS Project
Advanced Operating Systems
Scalable Parallel Systems
IBM T.J. Watson Research Center
914-945-3181
http://www.research.ibm.com/K42/
[email protected]
----
Larry McVoy writes:
> I don't mean to pick on LTT, I haven't used it, it may be the best thing
> since sliced bread.
>
> I can tell you how to present this and any other feature similar to this
> in a way which would make me a lot more willing to accept it, which
> presupposes I'm doing Linus' job which of course I am not. However,
> it's likely that Linus has similar views but he gets to chime in and
> speak for himself.
>
> All of these tools/features/whatever add some cost. The cost can be
> measured in lots of different ways:
>
> - lines of code
> - lines of code which can't be configed out
> - call depth increases
> - stack size increases
> - cache foot print increases
> - parallelism (think preempt)
> - interface changes
>
> I suspect there are other metrics and it would be very cool if others would
> chime in with their pet peeves.
>
> What would be cool is if there was some way to quantify as much as possible
> of the accepted set of costs so that that could be balanced against the
> value of the change, right?
>
> The one that always gets me is
>
> "I've added feature XYZ, I benchmarked it with <whatever, usually
> LMbench> and it didn't make a difference"
>
> That is almost certainly misleading. The real thing you want to do
> is quantify the actual costs because there can be non-zero costs that
> do not show up in benchmarks. For example, suppose that the benchmark
> neatly fits in the onchip caches and it only uses 1/2 of those caches.
> Your change could increase the cache foot print to just fill the caches,
> the benchmark says no difference, you declare success and move on.
> The problem is that almost all changes are good enough that they match
> this description. Measuring them in isolation doesn't tell us enough.
> If I combine two changes, both of which use up 1/2 the cache, there is
> no longer any room for anything else in the cache.
>
> I'd love to see a trend where patch requests for any non-trivial patch
> included before/after data for the above metrics (and any others that
> people see as useful). I'd love to see some people taking just one of
> the above and making a tool which measures that metric. Then we combine
> the tools into a "patch measurement suite" and start prefixing patches
> with
>
> Code changes:
> +1234 -5678 = -4444 (all code)
> +123 -567 = -444 (all code subject to CONFIG_XYZ)
>
> Call depth:
> +2 for read()
> +2 for write()
> no change for all other system calls
>
> Stack size:
> +2099 bytes for read()/write() path
>
> Cache misses:
> No change for benchmark1, 2, 3
> 12,000 data read misses for lat_ctx ....
>
> Etc.
>
> What does the list think of this?
> --
> ---
> Larry McVoy lm at bitmover.com http://www.bitmover.com/lm
On Oct 30, 6:31pm, Linus Torvalds wrote:
} Subject: Re: What's left over.
> > ext2/ext3 ACLs and Extended Attributes
>
> I don't know why people still want ACL's. There were noises about
> them for samba, but I'v enot heard anything since. Are vendors using
> this?
I can offer a perspective from someone who has been struggling to get
Linux competitive in real-life enterprise situations.
ACL's are an issue for Linux (and Samba) in order for the combination
to sustain competitiveness against Novell and NT in the desktop
fileservices domain. The harsh reality of life is that file and
document sharing is a way of life in the environments where Novell
dominates. The appearance of ACL's and desktop support for their
management in NT would tend to confirm this.
Without the granularity of ACL's it becomes too difficult to establish
the types of permission environments needed to support what most
administrative and department support personnel (ie, secretaries) seem
to desire.
The patches also begin implementing a common API framework which
multiple filesystems seem to be able to leverage. At least the rumor
appears to be that the instrastructure allows common toolsets to be
used for both ext2/3, XFS and perhaps other filesystems which want to
implement ACL's.
Its a compilation option and if set to default minimizes the impact on
people who don't need or want the infrastructure. Ted also has his
fingers in the project which probably means that it isn't going to get
neglected.
Just my 2 cents.
Best wishes for a productive weekend to everyone.
Greg
}-- End of excerpt from Linus Torvalds
As always,
Dr. G.W. Wettstein, Ph.D. Enjellic Systems Development, LLC.
4206 N. 19th Ave. Specializing in information infra-structure
Fargo, ND 58102 development.
PH: 701-281-4950 WWW: http://www.enjellic.com
FAX: 701-281-3949 EMAIL: [email protected]
------------------------------------------------------------------------------
"Open source code is not guaranteed nor does it come with a warranty."
-- the Alexis de Tocqueville Institute
"I guess that's in contrast to proprietary software, which comes with
a money-back guarantee, and free on-site repairs if any bugs are found."
-- Rary
* Oliver Xymoron ([email protected]) wrote:
> On Wed, Oct 30, 2002 at 09:43:29PM -0500, Alexander Viro wrote:
> > Because People Are Stupid(tm). Because it's cheaper to put "ACL support: yes"
> > in the feature list under "Security" than to make sure than userland can cope
> > with anything more complex than "Me Og. Og see directory. Directory Og's.
> > Nobody change it". C.f. snake oil, P.T.Barnum and esp. LSM users
>
> It's nearly useless in a Unix-only context, true, however there's a rather
> serious impedance mismatch for serving files to Windows that this
> addresses. Emulating ACLs on the fly with groups to fit into the
> Windows model is mostly doable but ain't pretty.
It's only nearly useless if you have some desire as an admin to
constantly be creating groups and changing group lists for users. This
is not a feature which is useful only when serving files to Windows
machines, not even nearly. AFS, Solaris, Irix etc have support for ACLs
and have a great deal of people who use them. The simple yet common
situation of one user who wants to give even just read access to
another specific user for a given file is a pain in the ass to deal with
given the current structure.
Stephen
On Thu, 31 Oct 2002, Stephen Frost wrote:
> So you're not really arguing against ACLs, you're complaining that
> userspace is broken when there's shared write access. That's fine,
> userspace should be fixed, inclusion of ACLs into the kernel shouldn't
> be denied because of this. ACLs should be optional, of course, and if
> you want them some really noisy warnings about the problems of shared
> writeable area with current userspace tools. Of course, that same
> warning should probably be included in 'groupadd'.
No. I'm saying that ACLs do not have a point until at least basic
userland gets ready for setups people want ACLs for. Adding features that
can't be used until $BIG_WORK is done is idiocy in the best case and
danger in the worst. Especially since $BIG_WORK does not depend on these
features.
Rusty Russell <[email protected]> writes:
> > > statfs64
> >
> > I haven't even seen it.
>
> It's fairly old, but Peter Chubb said there was some vendor interest
> for v. large devices. Peter?
statfs64 is needed when you want to access large NFS servers (>2TB is
becomming quite common for NAS) and want to have working "df" for them.
Currently it is scaled by wsize==blocksize, so it only breaks when
fileserversize/wsize > 2^31. For 1KB wsize it breaks with 2TB, with
4KB with 8TB etc. While 1KB wsize is arguably stupid (but happens sometimes
in practice). 8TB is not an unrealistic size for an NFS server these
days.
I did an hack to scale the NFS block size in stat to make sure it fits
into 31bit, but statfs64 would be the correct solution for it really.
Also I would like to propose the nanosecond stat patches. It doesn't add
new system calls, but just uses spare fields in the existing stat64
structure and closes a hole in make.
-Andi
* Alexander Viro ([email protected]) wrote:
> On Thu, 31 Oct 2002, Stephen Wille Padnos wrote:
> > Unless I'm missing something, that only works if all the users need
> > *exactly* the same permissions to all files, which isn't a good assumption.
>
> That's the point. In practice shared writable access to a directory can be
> easily elevated to full control of each others' accounts, since most of
> userland code is written in implicit assumption that nothing bad happens with
> directory structure under it. And there is nothing kernel can do about that -
> attacker does action you had explicitly allowed and your program goes bonkers
> since it can't cope with that. Mechanism used to allow that action doesn't
> enter the picture - be it ACLs, groups or something else.
So you're not really arguing against ACLs, you're complaining that
userspace is broken when there's shared write access. That's fine,
userspace should be fixed, inclusion of ACLs into the kernel shouldn't
be denied because of this. ACLs should be optional, of course, and if
you want them some really noisy warnings about the problems of shared
writeable area with current userspace tools. Of course, that same
warning should probably be included in 'groupadd'.
Stephen
On Thu, 31 Oct 2002, Linus Torvalds wrote:
>
> On Wed, 30 Oct 2002, Matt D. Robinson wrote:
>
> > Linus Torvalds wrote:
> > > > Crash Dumping (LKCD)
> > >
> > > This is definitely a vendor-driven thing. I don't believe it has any
> > > relevance unless vendors actively support it.
> >
> > There are people within IBM in Germany, India and England, as well as
> > a number of companies (Intel, NEC, Hitachi, Fujitsu), as well as SGI
> > that are PAID to support this.
To add to that list, here at Purdue University, we actively look at crash
dumps on other architectures, such as IBM AIX, and are starting to do the
same on Linux machines, after discovery of LKCD.
> What I'm saying by "vendor driven" is that it has no relevance for the
> standard kernel, and since it has no relevance to that, then I have no
> incentives to merge it. The crash dump is only useful with people who
> actively look at the dumps, and I don't know _anybody_ outside of the
> specialized vendors you mention who actually do that.
This has much relevance for the standard kernel, as much relevance as gdb
has for people using applications. While a majority of non-techno-geek
end-users probably don't care about the patch, I'm certain that there are
plenty of organizations out there like Purdue that WANT lkcd to become a
standard part of the Linux kernel. Until then, we're forced to do our
own kernel patching every time we push out a new kernel.
> I will merge it when there are real users who want it - usually as a
> result of having gotten used to it through a vendor who supports it. (And
> by "support" I do not mean "maintain the patches", but "actively uses it"
> to work out the users problems or whatever).
We actively use it.
> People have to realize that my kernel is not for random new features. The
> stuff I consider important are things that people use on their own, or
> stuff that is the base for other work. Quite often I want vendors to merge
> patches _they_ care about long long before I will merge them (examples of
> this are quite common, things like reiserfs and ext3 etc).
LKCD isn't a 'random new feature'. It's something that is present in
nearly ever other "Unix" on the market. (Yes I know Unix != Linux). It's
a feature that should have been integrated by now IMHO.
> THAT is what I mean by vendor-driven. If vendors decide they really want
> the patches, and I actually start seeing noises on linux-kernel or getting
> requests for it being merged from _users_ rather than developers, then
> that means that the vendor is on to something.
Again, we're the end-user, not the vendor, and we're trying to drive to
have it included. I've talked with outher sys admins in my department
here at Purdue, and have gotten a unanimous response that "It would be a
good and useful feature to have."
Pat
--
Purdue Universtiy ITAP/RCS
Information Technology at Purdue
Research Computing and Storage
http://www-rcd.cc.purdue.edu
http://dilbert.com/comics/dilbert/archive/images/dilbert2040637020924.gif
I'm a user, and I request that LKCD get merged into the kernel. :-)
On Thu, Oct 31, 2002 at 07:46:08AM -0800, Linus Torvalds wrote:
> What I'm saying by "vendor driven" is that it has no relevance for the
> standard kernel, and since it has no relevance to that, then I have no
> incentives to merge it. The crash dump is only useful with people who
> actively look at the dumps, and I don't know _anybody_ outside of the
> specialized vendors you mention who actually do that.
I actively look at LKCD dumps. I have no affiliation with SGI, IBM, or any
of the previously mentioned companies. I'm not aware of any vendors providing
pre-patched kernels with LKCD; right now my only option for reasonable crash
data is to patch and build my own kernel.
> I will merge it when there are real users who want it - usually as a
> result of having gotten used to it through a vendor who supports it. (And
> by "support" I do not mean "maintain the patches", but "actively uses it"
> to work out the users problems or whatever).
Here at Purdue University we're building several Linux clusters. LKCD is
most useful to help find in-kernel problems. Most of the time our crashes
are due to a flakey stick of RAM or a dying disk (or controller), but LKCD
dumps are still useful. With a crash dump I can analyze the cause of the
crash after the fact, but without a dump my only option to get _any_ crash
data is to leave a console plugged into each node of my clusters.
Do you feel like donating a 700-port console server? Right, so it's LKCD
for me then.
> People have to realize that my kernel is not for random new features. The
> stuff I consider important are things that people use on their own, or
> stuff that is the base for other work. Quite often I want vendors to merge
> patches _they_ care about long long before I will merge them (examples of
> this are quite common, things like reiserfs and ext3 etc).
>
> THAT is what I mean by vendor-driven. If vendors decide they really want
> the patches, and I actually start seeing noises on linux-kernel or getting
> requests for it being merged from _users_ rather than developers, then
> that means that the vendor is on to something.
I understand that Linux can't have random new features (especially going into
a feature-freeze). However, any additions that provide better debugging info
are (in my opinion, at any rate) worth it. Every other UNIX I've used (with
the possible exception of an early Ultrix) has some facility to inspect the
kernel - all have _at_least_ dumps that get written to a swap disk on a crash
and many have an in-core debugger. Running gdb on a live kernel from a
remote machine isn't unheard of, at least with other OSes. Unfortunately,
only aid you'll get in debugging a Linux kernel is the source code. Sure,
you can add a mess of printk's all over suspect code, and yes, the console
gets a register dump on a panic, but that really isn't enough. Some times
it's nice to be able to walk through the kernel's data structures and figure
out just what was going on when things died. I get this with LKCD.
To that end, it'd be nice if the trace toolkit and SGI's kernel debugger were
added. No, I haven't used them, but then I don't do much kernel development
either. I'd bet that LTT and the kernel debugger would be very useful to
those who do, though.
--
Mike Shuey
[ Ok, this is a really serious email. If you don't get it, don't bother
emailing me. Instead, think about it for an hour, and if you still don't
get it, ask somebody you know to explain it to you. ]
On Thu, 31 Oct 2002, Matt D. Robinson wrote:
>
> Sure, but why should they have to? What technical reason is there
> for not including it, Linus?
There are many:
- bloat kills:
My job is saying "NO!"
In other words: the question is never EVER "Why shouldn't it be
accepted?", but it is always "Why do we really not want to live
without this?"
- included features kill off (potentially better) projects.
There's a big "inertia" to features. It's often better to keep
features _off_ the standard kernel if they may end up being
further developed in totally new directions.
In particular when it comes to this project, I'm told about
"netdump", which doesn't try to dump to a disk, but over the net.
And quite frankly, my immediate reaction is to say "Hell, I
_never_ want the dump touching my disk, but over the network
sounds like a great idea".
To me this says "LKCD is stupid". Which means that I'm not going to apply
it, and I'm going to need some real reason to do so - ie being proven
wrong in the field.
(And don't get me wrong - I don't mind getting proven wrong. I change my
opinions the way some people change underwear. And I think that's ok).
> I completely don't understand your reasoning here.
Tough. That's YOUR problem.
Linus
On Thu, 31 Oct 2002, Linus Torvalds wrote:
|>On Wed, 30 Oct 2002, Matt D. Robinson wrote:
|>That's fine. And since they are paid to support it, they can apply the
|>patches.
Sure, but why should they have to? What technical reason is there
for not including it, Linus?
I completely don't understand your reasoning here. I use it for my
home, not for work, and that's important for me. And not everyone
can spend their evenings rolling up the next set of patches for
a distribution. Yes, vendors want it, they need it, but there are
plenty of people like me that want this in too!
We want to see this in the kernel, frankly, because it's a pain
in the butt keeping up with your kernel revisions and everything
else that goes in that changes. And I'm sure SuSE, UnitedLinux and
(hopefully) Red Hat don't want to spend their time having to roll
this stuff in each and every time you roll a new kernel.
I mean, PLEASE, Linus, what do we have to do? There are so many
interests in this stuff, and I really, truly don't get what's wrong
with putting this in the kernel?
Have you looked at it? Have you looked at how it is now structure
to be non-invasive? How it will allow other kernel developers to
generate their own dumping methods? I mean, we sent you E-mails
weeks ago, and you didn't respond to any of them with even a word
of acknowledgement of receipt.
|>What I'm saying by "vendor driven" is that it has no relevance for the
|>standard kernel, and since it has no relevance to that, then I have no
|>incentives to merge it. The crash dump is only useful with people who
|>actively look at the dumps, and I don't know _anybody_ outside of the
|>specialized vendors you mention who actually do that.
I do. Others like myself do. And not just for development
purposes. I don't like to see my system crash after installing one
of your new kernels and not be able to figure out what's wrong.
The nice thing is that LKCD there, it works, and I can just look
at the crash report instead of wishing that my console buffer
didn't just scroll off. Oh, I know, I'll just wait for it to
happen again ... yeah, like that's real intelligent.
|>I will merge it when there are real users who want it - usually as a
|>result of having gotten used to it through a vendor who supports it. (And
|>by "support" I do not mean "maintain the patches", but "actively uses it"
|>to work out the users problems or whatever).
|>
|>Horse before the cart and all that thing.
|>
|>People have to realize that my kernel is not for random new features. The
|>stuff I consider important are things that people use on their own, or
|>stuff that is the base for other work. Quite often I want vendors to merge
|>patches _they_ care about long long before I will merge them (examples of
|>this are quite common, things like reiserfs and ext3 etc).
Other vendors have merged LKCD a long time ago and use it, and
expect it to be there. And users like myself find it valuable on
their desktops, their servers, etc. I mean, there's someone using
this at Purdue that's responded to you, just another kernel user
that likes to have this stuff there automatically.
|>THAT is what I mean by vendor-driven. If vendors decide they really want
|>the patches, and I actually start seeing noises on linux-kernel or getting
|>requests for it being merged from _users_ rather than developers, then
|>that means that the vendor is on to something.
TurboLinux, MonteVista, Veritas, SuSE, and UnitedLinux have LKCD.
With the most recent changes, I think Red Hat can put LKCD in now
such that it isn't invasive to their distribution.
I think SuSE has already expressed a desire to have this in. If
you want to hear from others, I'll asked them to respond to you.
|> Linus
--Matt
Hello Larry,
First, thanks for your feedback.
I understand and share you concern about the use of micro-benchmarks
to qualify/quantify the impact additional code on the kernel. This is
precisely the reason why I chose not to use micro-benchmarks in the
Usenix article I presented about LTT at the 2000 annual technical
conference. I was suprised to see some of the selection commitee
members actually come up to me and say: "I'm so glad to see a paper
that doesn't use micro-benchmarks."
That's why we elected to create 2 separate sets of benchmarks, one
using real-life applications (kernel build, bzip2, etc.) and one
using LMbench. Personnally, I would have been satisfied with just the
real-life applications, but I know that many folks on the LKML want
to see LMbench numbers, so we included those too. That said, I find
it very positive that you keep a healthy dose of self-criticism towards
your own tool, this is exactly the kind of stuff that makes LMbench so
good. So too is it with LTT. I've always been on the lookout for
reducing costs here and there while acheiving maximal functionality.
Fortunately, repeated testing and analysis on LTT by many parties
using many tools have confirmed that the current LTT has very low
impact on many fronts, including static code modifications.
So, for example, we had one example run of LMbench where we ran kernel
compiles in the background (i.e. a script restarted the kernel
compile every time it ended). To make it as simple as possible, here's
the elapsed time taken to run LMbench on 4x SMP system in the various
configurations:
---------------------------------------------------------------------
vanilla 14:27
vanilla+ltt+ltt off 14:26
vanilla+ltt+ltt on 14:31
vanilla+ltt+ltt on+daemon on 14:32
vanilla+ltt+ltt on+kernel compile 15:03
vanilla+ltt+ltt on+kernel compiles+daemon on 15:13
---------------------------------------------------------------------
As you can see, the differences in percentages are all within the 2%
range we mentioned earlier.
To address the specific metrics you mentioned:
> Code changes:
We've posted diffstats with every patch we published on the LKML.
> Call depth:
We're talking 3 for syscalls and 2 for all other events in order to
reach the core tracing function proper (this could easily be reduced
by 1 if it's really a problem). Add 1 for locking scheme and 3 for
the non-locking scheme. I'm not counting the calls we make to kernel
services, which somewhat goes to show that this is a flawed measure
because I've never seen any thorough analysis of call depths for
kernel services. Can't say that it wouldn't be an interesting
research project to see someone do that for the entire kernel, we
may find some interesting results.
> Stack size:
This really depends on the quantity of data being passed to the tracer,
which varies greatly from one event to the other. I can say this, however:
in all the testing I've seen done on LTT in the past, there has never
been a stack problem. This isn't an invitation for being reckless. I am
aware of stack issues and have been on the lookout for the any related
problem.
> Cache misses:
Bob has said it best. I think the best that we can do about this is
to follow the known-to-be-good guidelines about cache interference.
The discussion Ingo and Bob had on this issue in relation to LTT,
for example, shows that we've thought this through.
Beyond everything I've said above, I'd invite you to download LTT and
try it out. I'm sure you'll see why this is important for Linux users.
BTW, while I'm on the subject of LMbench, I've been trying to find a
way to run it on an embedded system. The problem is that this thing
needs a compiler and that would mean having to cross-compile gcc itself
and so on, which creates storage problems etc. Are there any plans to
make a mini-LMbench?
Thanks again,
Karim
===================================================
Karim Yaghmour
[email protected]
Embedded and Real-Time Linux Expert
===================================================
On Thu, 31 Oct 2002, Alexander Viro wrote:
>
> No. I'm saying that ACLs do not have a point until at least basic
> userland gets ready for setups people want ACLs for. Adding features that
> can't be used until $BIG_WORK is done is idiocy in the best case and
> danger in the worst. Especially since $BIG_WORK does not depend on these
> features.
I think samba alone counts as enough user-land usage.
And if it turns out nobody else ever wants to use them, that's fine too.
Linus
On Thu, 31 Oct 2002, Linus Torvalds wrote:
|>[ Ok, this is a really serious email. If you don't get it, don't bother
|> emailing me. Instead, think about it for an hour, and if you still don't
|> get it, ask somebody you know to explain it to you. ]
Thanks for the response. I don't think I need an hour. This is
pretty simple.
|>On Thu, 31 Oct 2002, Matt D. Robinson wrote:
|>>
|>> Sure, but why should they have to? What technical reason is there
|>> for not including it, Linus?
|>
|>There are many:
|>
|> - bloat kills:
|>
|> My job is saying "NO!"
|>
|> In other words: the question is never EVER "Why shouldn't it be
|> accepted?", but it is always "Why do we really not want to live
|> without this?"
This isn't bloat. If you want, it can be built as a module, and
not as part of your kernel. How can that be bloat? People who
build kernels can optionally build it in, but we're not asking
that it be turned on by default, rather, built as a module so
people can load it if they want to. We made it into a module
because 18 months ago you complained about it being bloat. We
addressed your concerns.
Some people, particularly large SSI configurations, can't live
without this. You shouldn't crash once. Crashing twice, or
more often, is inexcusable.
|> - included features kill off (potentially better) projects.
|>
|> There's a big "inertia" to features. It's often better to keep
|> features _off_ the standard kernel if they may end up being
|> further developed in totally new directions.
I can't argue against this ... to do so would mean that you don't
accept any new features for 2.5, and there are a lot of projects
like mine that need to go in, although I do understand your concerns.
|> In particular when it comes to this project, I'm told about
|> "netdump", which doesn't try to dump to a disk, but over the net.
|> And quite frankly, my immediate reaction is to say "Hell, I
|> _never_ want the dump touching my disk, but over the network
|> sounds like a great idea".
We've integrated the "netdump" capabilities as a dump method
for LKCD. It's an option for dumping, just like all the other
dump methods available to people? Want to dump to disk? Use
LKCD. Want to dump on the network? USE LKCD. What's wrong
with that?
We've created a net dump method that allows you to dump across the
network from Mohammed Abbas (modified from Ingo's netconsole dump).
It integrates into LKCD beautifully. If you want that patch with
the rest of our LKCD patches, we can include it, no problem.
|>To me this says "LKCD is stupid". Which means that I'm not going to apply
|>it, and I'm going to need some real reason to do so - ie being proven
|>wrong in the field.
Hopefully some of this changes your mind.
|>(And don't get me wrong - I don't mind getting proven wrong. I change my
|>opinions the way some people change underwear. And I think that's ok).
|>
|>> I completely don't understand your reasoning here.
|>
|>Tough. That's YOUR problem.
It is. I lose sleep because this is my problem. I lose time on
the weekends because this is my problem.
If you've _reviewed_ the LKCD patches and still have the opinions
you've mentioned above, then I'll consider this your position and
be done with it. Otherwise, please accept the code.
We'll keep doing our best to keep up with your kernels in the
meantime.
|> Linus
--Matt
Alexander Viro writes:
> On Thu, 31 Oct 2002, Stephen Wille Padnos wrote:
>
> > >Then give them all the same account and be done with that. Effect will
> > >be the same.
> >
> > Unless I'm missing something, that only works if all the users need
> > *exactly* the same permissions to all files, which isn't a good assumption.
>
> That's the point. In practice shared writable access to a directory
> can be easily elevated to full control of each others' accounts,
^^^^^^
While that may be true in theory, in practice it's not necessarily the
case. Many people don't have the expertise to make use of such
exploits. And before you say that they can download a pre-cooked
exploit kit, let me tell you that there are plenty of people who don't
have the time or inclination to do that.
I've seen you talk about these kinds of things before, and you always
seem to be talking about the typical nightmarish undergrad CS lab
where the kids spend all their time trying to crack each other and the
system. And I'm not saying that these don't exist: I've seen it.
But there are other environments (say a research lab with grad
students, post-docs and faculty) where the inhabitants either don't
have the skills or don't have the interest in cracking accounts.
Everyone is too busy doing their own research. Cracking the mysteries
of the universe seems to be more interesting.
So group write access and ACL's *can* lead to wanton cracking, but for
many environments it's not an issue. For many, the dangers lie outside
the firewall, not inside.
Note that I'm not specifically advocating ACL's, I'm just letting you
know that the problem you're concerned about is, for good reason, not
a problem for everyone.
I will note that one appealing aspect of ACL's is that they do not
require administrator intervention. That's good for a user who just
wants to set something up without having to wait for the sysadmin.
It's also good for the sysadmin (excepting control freaks) who doesn't
want to do things that the users can (or should) actually be doing by
themselves.
Regards,
Richard....
Permanent: [email protected]
Current: [email protected]
Note that as far as ACL's go, enough people have convinced me that we want
them, with clear real-life issues. So don't worry about them, I'll merge
it.
Linus
On Thu, 31 Oct 2002, Matt D. Robinson wrote:
>
> This isn't bloat. If you want, it can be built as a module, and
> not as part of your kernel. How can that be bloat?
I don't care one _whit_ about the size of the binary. I don't maintain
binaries, adn the binary can be gigabytes for all I care.
The only thing I care about is source code. So the "build it as a module
and it is not bloat" argument is a total nonsense thing as far as I'm
concerned.
Anyway, new code is always bloat to me, unless I see people using them.
Guys, why do you even bother trying to convince me? If you are right, you
will be able to convince other people, and that's the whole point of open
source.
Being "vendor-driven" is _not_ a bad thing. It only means that _I_ am not
personally convinced. I'm only one person.
Linus
On Thu, 31 Oct 2002, Linus Torvalds wrote:
> What I'm saying by "vendor driven" is that it has no relevance for the
> standard kernel, and since it has no relevance to that, then I have no
> incentives to merge it. The crash dump is only useful with people who
> actively look at the dumps, and I don't know _anybody_ outside of the
> specialized vendors you mention who actually do that.
Unfortunately the vast majority of the customers I deal with
buy a distribution and then put a kernel from kernel.org
on. I believe this comes about because of either needing fixes
or function that appear in later kernels that have not made
it to the distributions kernels yet.
Even if the distribution included LKCD in their kernel,
I lose lots of debug ability once customers switch over to
kernel.org and no longer have the LKCD patch.
Thus we are currently left with having to maintain LKCD patches for
many arbitrary kernel.org kernels and convince customers to apply
it BEFORE they start encountering problems that we'll have to look at.
Application of patches that aren't automatically included in kernel.org
rarely happens with our customer set (before problems occur),
no matter how much we flag the issue to them up front.
I realize that while my current capacity makes me fall into
the 'vendor' support you speak of, I believe I am actually
advocating its inclusion on behalf of real live customers.
Vendors can and do actually help linux development, by screening,
researching fixes, and or directly fixing lots of customer
problems that you never have to deal with. To do that, LKCD
is the debug weapon of choice.
I request you reconsider the inclusion of LKCD.
Regards, Dave
Mail : [email protected] Phone : 512-838-8248
On Thu, Oct 31, 2002 at 09:38:41AM -0800, Linus Torvalds wrote:
>
> Note that as far as ACL's go, enough people have convinced me that we want
> them, with clear real-life issues. So don't worry about them, I'll merge
> it.
Ok, so now lets work on a Documentation/filesystems patch pointing
out a few of the common pitfalls, as I definitely agree they invite
some grave mistakes and are best avoided in most scenarios.
- /tmp-style symlink issues on shared directories
- vast majority of software (including security tools) ACL-unaware
- much harder to check for correctness
Al, I'm sure you have more..
--
"Love the dolphins," she advised him. "Write by W.A.S.T.E.."
Trever L. Adams wrote:
> On Wed, 2002-10-30 at 21:31, Linus Torvalds wrote:
>
>> > ext2/ext3 ACLs and Extended Attributes
>>
>> I don't know why people still want ACL's. There were noises about them
>> for samba, but I'v enot heard anything since. Are vendors using this?
>>
>
> I am sure I don't count (not being a vendor), but Intermezzo offers
> support for this (they are waiting on feature freeze to redo it to 2.5
> according to an email I have). I want this stuff. Yes, u+g+w is nice,
> but good ACLs are even better. Please, if this is technically correct
> in implementation, do put it in.
>
I agree, having them is far better then the standard u+g+w that's been
around for ages. I think it gives the "finer" grain of control over your
system that a lot of users may desire. Not to mention the fact that ACL's
are well supported by the recently merged XFS. If I'm not mistaken, AFS
uses them as well. I *really* don't see the overhead cost here in terms of
compiled kernel size when they are turned off. As for the size of the
source tarball, who cares? People should quit whining about the size of
the sources and get over it! Storage is cheap and broadband is in
widespread use.
Cheers,
Nicholas
On Thu, 2002-10-31 at 18:10, Chris Friesen wrote:
> > To me this says "LKCD is stupid". Which means that I'm not going to apply
> > it, and I'm going to need some real reason to do so - ie being proven
> > wrong in the field.
>
> How do you deal with netdump when your network driver is what caused the
> crash?
Netdump drives the system itself. Any dump driver has to as it cant
assume the system is in a remotely sane state
On Thu, 31 Oct 2002 09:54:54 -0800 (PST), Linus Torvalds
<[email protected]> wrote:
>Guys, why do you even bother trying to convince me? If you are right, you
>will be able to convince other people, and that's the whole point of open
>source.
>
>Being "vendor-driven" is _not_ a bad thing. It only means that _I_ am not
>personally convinced. I'm only one person.
It sounds to me like there needs to be L-K traffic when problems are
solved using LKCD.
Personally I love crash dumps... in 33 years of computing I have spent
a total of 1-2 years doing nothing but enhancing and developing
post-processing facilities. The true benefit is not just the "crashed
here, add a null check nonsense". It is the ability to examine the
whole system state. With an inboard trace table, you can even go back
in time. You can look at call stacks, locks held, state of allocated
memory, etc etc. If you save callstacks and time with allocated
memory, you can track down storage growth problems. I have spent weeks
winkling problems out of crash dumps, solving problems the developers
didn't even know existed.
With the right facility you can take crash dump snapshots and keep on
running. It is a great tool for understanding a system.
But until there is a flow of results - good quality fixes - resulting
from such analysis, I can see exactly why LT is doubtful.
john alvord
On Thu, Oct 31, 2002 at 09:31:09AM -0500, Jeff Garzik wrote:
> What's wrong with our current 2.5.45 crypto api?
It's synchronous and assume everything is synchronous. Lots of
hardware (most) doesn't work that way.
--cw
On Thu, 31 Oct 2002, Chris Friesen wrote:
>
> How do you deal with netdump when your network driver is what caused the
> crash?
Actually, from a driver perspective, _the_ most likely driver to crash is
the disk driver.
That's from years of experience. The network drivers are a lot simpler,
the hardware is simpler and more standardized, and doesn't do as many
things. It's just plain _easier_ to write a network driver than a disk
driver.
Ask anybody who has done both.
But that's not the real issue. The real issue is that I have no personal
incentives to try to merge the thing, and as a result I think I'm the
wrong person to do so. I've told people over and over again that I think
this is a "vendor merge", and I'm fed up with people not _getting_ it.
Don't bother to ask me to merge the thing, that only makes me get even
more fed up with the whole discussion. This is open source, guys. Anybody
can merge it. Because I don't particularly believe in it doesn't mean that
it cannot be used. It only means that I want to see users flock to it and
show my beliefs wrong.
Linus
Linus Torvalds wrote:
>
> [ lkcd ]
>
We'll be spending the next six months stabilising and hardening
the used-to-be-2.5 kernel. If grunts like me can get hold a
copy of the other person's kernel image from time-of-crash, that
has a ton of value.
(Disclaimer: I've never used lkcd. I'm assuming that it's
possible to gdb around in a dump)
> In particular when it comes to this project, I'm told about
> "netdump", which doesn't try to dump to a disk, but over the net.
It could help. But like serial console, the random person whose
kernel just died often can't be bothered setting it up, or simply
doesn't have the gear, or the crash is not repeatable.
So. _If_ lkcd gives me gdb-able images from time-of-crash, I'd
like it please. And I'm the grunt who spent nearly two years
doing not much else apart from working 2.3/2.4 oops reports.
Oh, and as Rusty has pointed out, we lose a *lot* of oops reports
because users are in X and the backtrace doesn't make it to the
logs. Rusty has a little app which dumps just the oops report to
disk somewhere. Want that too.
Chris Wedgwood wrote:
> On Wed, Oct 30, 2002 at 11:48:23PM -0700, Dax Kelson wrote:
>
>> Technically speaking you can achieve ACL like permissions/behavior
>> using the historical UNIX security model by creating a group EACH
>> time you run into a unique case permission scenario.
>
> I'm not arguing against this... I'm claiming POSIX ACLs are mostly
> brain-dead and almost worthless (broken by committee pressure and too
> many people making stupid concessions).
>
> If we must have ACLs, why not do it right?
>
>> Without ACLs, if Sally, Joe and Bill need rw access to a file/dir,
>> just create another group with just those three people in. Over
>> time, of course, this leads to massive group proliferation. Without
>> Tim Hockin's patch, 32 groups is maximum number of groups a user can
>> be a member of.
>
> How many people actually need this level of complexity?
>
> Why are we adding all this shit and bloat because of perceived
> problems most people don't have? What next, some kind of misdesigned
> in-kernel CryptoAPI?
Get over it! If you haven't noticed, CryptoAPI is merged already. The only
bloat ACLs cause is the size of the source tarball. If your connection is
slow or you are out of diskspace, too bad! I'm sure I'm not the only one
who is tired of hearing people whine about "bloat" wrt the sources and
demanding that features they don't use be ignored. No one (non-core)
feature will be useful to everyone, that is a given fact. The point is
that while you see no use for it, there are many others out there who do.
ACLs are something which have existed in the Solaris/BSD world for a long
time now, and people who have admin these boxen find ACLs to be quite
useful.
Cheers,
Nicholas
On Thu, 31 Oct 2002, Oliver Xymoron wrote:
>
> Perhaps not the best analogy.
Heh. I like my analogies bad. The best analogies should make you go
"huh!" - kind of like a pink poodle in a tutu.
Linus
On Thu, 31 Oct 2002, Linus Torvalds wrote:
>
> On Thu, 31 Oct 2002, Matt D. Robinson wrote:
> >
> > This isn't bloat. If you want, it can be built as a module, and
> > not as part of your kernel. How can that be bloat?
>
> I don't care one _whit_ about the size of the binary. I don't maintain
> binaries, adn the binary can be gigabytes for all I care.
>
> The only thing I care about is source code. So the "build it as a module
> and it is not bloat" argument is a total nonsense thing as far as I'm
> concerned.
So, you don't like bloat, such as having 22 different file systems (only
including the ones that can be placed on disk, not things like devfs or
smbfs...). That's more filesystems than I have dollars in my wallet at
the moment. For the amount of utility that this code provides, it's
definately not 'bloat'.
> Anyway, new code is always bloat to me, unless I see people using them.
HEY!!! WE'RE USING IT!!!
> Guys, why do you even bother trying to convince me? If you are right, you
> will be able to convince other people, and that's the whole point of open
> source.
Now this sounds more like something I'd hear from Sun trying to get a fix
for a version of Solaris without having to buy a new one. I thought the
whole point of Free Software was sharing with the community, and doing
what's best for the community.
> Being "vendor-driven" is _not_ a bad thing. It only means that _I_ am not
> personally convinced. I'm only one person.
That's the same as claiming that George W. Bush is just one person....
So I'll plea yet again, please add LKCD!
Pat
--
Purdue Universtiy ITAP/RCS
Information Technology at Purdue
Research Computing and Storage
http://www-rcd.cc.purdue.edu
http://dilbert.com/comics/dilbert/archive/images/dilbert2040637020924.gif
On Thu, Oct 31, 2002 at 09:25:21AM -0800, Linus Torvalds wrote:
> (And don't get me wrong - I don't mind getting proven wrong. I change my
> opinions the way some people change underwear. And I think that's ok).
As in 'sometimes not even when hundreds of people start haranguing me
about it in public forums'?
Perhaps not the best analogy.
--
"Love the dolphins," she advised him. "Write by W.A.S.T.E.."
Linus Torvalds wrote:
> In particular when it comes to this project, I'm told about
> "netdump", which doesn't try to dump to a disk, but over the net.
> And quite frankly, my immediate reaction is to say "Hell, I
> _never_ want the dump touching my disk, but over the network
> sounds like a great idea".
>
> To me this says "LKCD is stupid". Which means that I'm not going to apply
> it, and I'm going to need some real reason to do so - ie being proven
> wrong in the field.
How do you deal with netdump when your network driver is what caused the
crash?
Ideally I would like to see a dump framework that can have a number of
possible dump targets. We should be able to dump to any combination of
network, serial, disk, flash, unused ram that isn't wiped over restarts,
etc...
Chris
--
Chris Friesen | MailStop: 043/33/F10
Nortel Networks | work: (613) 765-0557
3500 Carling Avenue | fax: (613) 765-2986
Nepean, ON K2H 8E9 Canada | email: [email protected]
On Thu, 31 Oct 2002, Dave Craft wrote:
> On Thu, 31 Oct 2002, Linus Torvalds wrote:
>
> > What I'm saying by "vendor driven" is that it has no relevance for the
> > standard kernel, and since it has no relevance to that, then I have no
> > incentives to merge it. The crash dump is only useful with people who
> > actively look at the dumps, and I don't know _anybody_ outside of the
> > specialized vendors you mention who actually do that.
>
> Unfortunately the vast majority of the customers I deal with
> buy a distribution and then put a kernel from kernel.org
> on. I believe this comes about because of either needing fixes
> or function that appear in later kernels that have not made
> it to the distributions kernels yet.
>
> Even if the distribution included LKCD in their kernel,
> I lose lots of debug ability once customers switch over to
> kernel.org and no longer have the LKCD patch.
>
> Thus we are currently left with having to maintain LKCD patches for
> many arbitrary kernel.org kernels and convince customers to apply
> it BEFORE they start encountering problems that we'll have to look at.
> Application of patches that aren't automatically included in kernel.org
> rarely happens with our customer set (before problems occur),
> no matter how much we flag the issue to them up front.
So, this is precisely where something like OSDL's Carrier Grade and Data
Center working groups can come into play, amazingly enough.
By now, nearly everyone has heard about the working groups and nearly
every developer that has, despises them. Even I resist association with
them. But, they can have some real value to the vendors and the OEMs in
exactly the way you describe.
Take for example DCL. It's a kernel tree with several base patches
intended to make Linux better in the data center. The base is not fancy,
and includes things like LKCD and kdb (I think). It's actively maintained
and updated more often than Linus makes a release (by virtue of
bitkeeper).
The intent is to later have multiple child trees that implement features
for a specific application space (e.g. databases), while maintainig the
same base set of features. People wishing to use the most recent kernel
with those features can use the DCL tree directly. Or an OEM FAE can use
the tree to build something for the vendor, or add extra features.
Note that it's not a distribution. We don't even make real releases, since
we don't create tarballs or patches (it's only in BK, which actually kinda
sucks). It's merely a means to have these features actively maintained and
kept in synch.
And really, that's what everyone wants. Linus doesn't want the features,
as don't other developers, regardless of the Buzzword or Coolness factors.
Some vendors and users do want them. The developers of the features and
distributors of features don't want to deal with the tedium and pain of
updating patches each and every release.
In the end, it comes down to the fact that Linus's tree is Linus's tree.
Other people can have their trees. I'm not going to tell you go off and
make your own if you want those features so bad, because I know what a
pain in the ass it is, and I know having someone else do it is a lot
easier.
DCL and CGL have their trees, for purposes probably very very similar to
what your customers need. I encourage you to check them out and work with
them (or talk to people in your company that are). Try and make it work,
and everyone can be happy (relativey). And, if DCL and CGL aren't
satisfying the space that you need, please speak up to OSDL and the
working groups. People are listening, and willing to take your suggestions
into consideration.
Relevant URLs:
http://osdl.org/projects/cgl/
http://osdl.org/projects/dcl/
-pat "kissing serious butt" mochel
On Thu, 31 Oct 2002, Linus Torvalds wrote:
> In particular when it comes to this project, I'm told about
> "netdump", which doesn't try to dump to a disk, but over the net.
And guess what ? Netdump is one of various LKCD dump methods ...
regards,
Rik
--
A: No.
Q: Should I include quotations after my reply?
http://www.surriel.com/ http://distro.conectiva.com/
On Thu, 2002-10-31 at 18:28, Nicholas Wourms wrote:
> > problems most people don't have? What next, some kind of misdesigned
> > in-kernel CryptoAPI?
>
> Get over it! If you haven't noticed, CryptoAPI is merged already. The only
Chris is write that crypto api is misdesigned if we want to use hardware
cryptocards
On Thu, 31 Oct 2002, Chris Wedgwood wrote:
>
> It's synchronous and assume everything is synchronous. Lots of
> hardware (most) doesn't work that way.
Think of it another way: many users will likely _require_ atomic
encryption / decryption (done in softirq contexts etc), and thus a
synchronous interface. Also, it simplifies the code and makes it more
efficient.
Any hardware that needs to go off and think about how to encrypt something
sounds like it's so slow as to be unusable. I suspect that anything that
is over the PCI bus is already so slow (even if it adds no extra cycles of
its own) that you're better off using the CPU for the encryption rather
than some external hardware.
In short, from what I can tell, there is no huge actual reason to ever
allow a asynchronous interface. Such interfaces are likely fine for things
like network cards that can do encryption on their own on outgoing or
incoming packets, but that is not a general-purpose encryption engine, and
would not merit being part of an encryption library anyway.
[ Such a card is just a way to _avoid_ using the encryption library - the
same way we can avoid using the checksumming stuff for network cards
that can do their own checksums ]
We'll see. I'd rather have a simpler interface that works for all relevant
cases today, and then if external crypto chips end up being common and
sufficiently efficient, we can always re-consider. Are the DMA-over-PCI
roundtrip (and resulting cache invalidations) overheads really worth the
extra hardware?
Linus
On Thu, 2002-10-31 at 17:13, Michael Shuey wrote:
> I'm a user, and I request that LKCD get merged into the kernel. :-)
> Do you feel like donating a 700-port console server? Right, so it's LKCD
> for me then.
Wouldn't you rather they neatly tftp'd dumps to a nominated central
server which noticed the arrival, did the initial processing with a perl
script and mailed you a summary ?
On Thu, 31 Oct 2002, Nicholas Wourms wrote:
> slow or you are out of diskspace, too bad! I'm sure I'm not the only one
> who is tired of hearing people whine about "bloat" wrt the sources and
> demanding that features they don't use be ignored. No one (non-core)
One look at the From:
understanding has blossomed
.procmailrc grows
Alan Cox wrote:
> On Thu, 2002-10-31 at 18:28, Nicholas Wourms wrote:
>
>>>problems most people don't have? What next, some kind of misdesigned
>>>in-kernel CryptoAPI?
>>
>>Get over it! If you haven't noticed, CryptoAPI is merged already. The only
>
>
> Chris is write that crypto api is misdesigned if we want to use hardware
> cryptocards
>
Alan,
Thanks for setting me straight, your assertion is correct,
of course. I was under the impression that the CryptoAPI
code was merged initially for IPSEC support and would be
revamped and expanded at a later date to support a wide
variety of interfaces?
Cheers,
Nicholas
On Thu, 31 Oct 2002, Linus Torvalds wrote:
>
> On Wed, 30 Oct 2002, Matt D. Robinson wrote:
>
> > Linus Torvalds wrote:
> > > > Crash Dumping (LKCD)
> > >
> > > This is definitely a vendor-driven thing. I don't believe it has any
> > > relevance unless vendors actively support it.
> >
> > There are people within IBM in Germany, India and England, as well as
> > a number of companies (Intel, NEC, Hitachi, Fujitsu), as well as SGI
> > that are PAID to support this.
Add 3PAR and probably a number of other small companies given the traffic
on the lists. Anyone building a new product on Linux and mucking
around inside the kernel, and having more than a handful of developers
is going to want LKCD, or Mission Critical's mcore, or netdump, or
something like it.
It's a shame that right out of the gate they'll have to spend time
figuring out which of these solutions work for them. I spent at least
a month of my life just looking at what's out there, and trying to make
each of them work with our product. It'd be nice if that time were
spent on making new "cool stuff".
Since then, we've put significant amounts of work into making LKCD
reliable on our system, and it's been incredibly useful in our
development. It's going to prove invaluable supporting our stuff in
the field.
> What I'm saying by "vendor driven" is that it has no relevance for the
> standard kernel, and since it has no relevance to that, then I have no
> incentives to merge it. The crash dump is only useful with people who
> actively look at the dumps, and I don't know _anybody_ outside of the
> specialized vendors you mention who actually do that.
>
> I will merge it when there are real users who want it - usually as a
> result of having gotten used to it through a vendor who supports it. (And
> by "support" I do not mean "maintain the patches", but "actively uses it"
> to work out the users problems or whatever).
If you asked me if 3PAR is a "vendor" or a "user" I'd have to say "yes".
As a vendor we sell our system to customers. They could not care less
that LKCD is in the linux kernel distribution. All they care about is
that we fix their problems as fast as possible. They probably have
no idea that this is the underlying technology, so you will never
hear from them about us.
However, we also use linux for desktops, build servers, database servers, etc.
When we have problems with these systems, we'd LOVE to be able to use the
same expertise and technology which we've developed for our system, but
all too often we find that someone just grabbed a Redhat 7.x disk or
standard debian distro to build the system.
So as a "user" I'm asking the distribution vendors, please make it easy
for me to use the same damn tools everywhere by providing some sort
of common crash dump mechanism. It'll make it easier for me to consider new
hardware, new software, etc. One thing that's awesome is Dave Anderson's
"crash" tool. It works with LKCD dumps, netdump dumps, etc. It's an example
of a tool which has leveraged all the different dump communities.
As a "vendor" please put LKCD or something like it into the main line
kernel. LKCD works. It has an active developer community which has
been extending it to work over networks, onto disks, developing new
analysis tools, etc. If we can settle on one such tool, we'll get
more cool stuff like lock analyzers, etc. Until then, we WILL keep
re-inventing the wheel because this is one of the first steps to
collect significant amounts of real data.
-castor
On Thu, Oct 31, 2002 at 10:49:10AM -0800, Linus Torvalds wrote:
> Any hardware that needs to go off and think about how to encrypt
> something sounds like it's so slow as to be unusable. I suspect that
> anything that is over the PCI bus is already so slow (even if it
> adds no extra cycles of its own) that you're better off using the
> CPU for the encryption rather than some external hardware.
Except almost all hardware out there that does this stuff is async to
some extent...
I'm just speaking as someone who has (sadly) done this a couple of
times already for commercial real-world products. I'm no expert, I
don't claim to be and admit there is still plenty to learn...
... that said, having access to lots of hardware, both our own and
other peoples, almost all of it needs to be driven asynchronously to
get good performance (or by a large number of threads).
On Thu, 2002-10-31 at 10:45, Patrick Mochel wrote:
>
> So, this is precisely where something like OSDL's Carrier Grade and Data
> Center working groups can come into play, amazingly enough.
>
> By now, nearly everyone has heard about the working groups and nearly
> every developer that has, despises them. Even I resist association with
> them. But, they can have some real value to the vendors and the OEMs in
> exactly the way you describe.
>
> Take for example DCL. It's a kernel tree with several base patches
> intended to make Linux better in the data center. The base is not fancy,
> and includes things like LKCD and kdb (I think). It's actively maintained
> and updated more often than Linus makes a release (by virtue of
> bitkeeper).
LKCD is in and I try to keep it up to date with the patch stream.
KDB is not in yet, because the current posted patches are not up to date
to apply cleanly against 2.5.44 or 2.5.45.
> The intent is to later have multiple child trees that implement features
> for a specific application space (e.g. databases), while maintainig the
> same base set of features. People wishing to use the most recent kernel
> with those features can use the DCL tree directly. Or an OEM FAE can use
> the tree to build something for the vendor, or add extra features.
CGL hasn't decided what they want to change to.
DCL is going to have one tree focused on databases.
> Note that it's not a distribution. We don't even make real releases, since
> we don't create tarballs or patches (it's only in BK, which actually kinda
> sucks). It's merely a means to have these features actively maintained and
> kept in synch.
For DCL there is both a bitkeeper tree bk://bk.osdl.org/dcl-2.5 and
regular snapshots available on sourceforge
http://osdldcl.sourceforge.net
> And really, that's what everyone wants. Linus doesn't want the features,
> as don't other developers, regardless of the Buzzword or Coolness factors.
> Some vendors and users do want them. The developers of the features and
> distributors of features don't want to deal with the tedium and pain of
> updating patches each and every release.
>
> In the end, it comes down to the fact that Linus's tree is Linus's tree.
> Other people can have their trees. I'm not going to tell you go off and
> make your own if you want those features so bad, because I know what a
> pain in the ass it is, and I know having someone else do it is a lot
> easier.
>
FYI the criteria I apply for what goes into DCL is:
* Applys to large systems and databases
* Vendor support
* Conforms to Linux standard style
* Active project and maintainer that accepts feedback
* Community rejection has been mostly positive.
> DCL and CGL have their trees, for purposes probably very very similar to
> what your customers need. I encourage you to check them out and work with
> them (or talk to people in your company that are). Try and make it work,
> and everyone can be happy (relativey). And, if DCL and CGL aren't
> satisfying the space that you need, please speak up to OSDL and the
> working groups. People are listening, and willing to take your suggestions
> into consideration.
>
> Relevant URLs:
>
> http://osdl.org/projects/cgl/
> http://osdl.org/projects/dcl/
Stephen Hemminger
Data Center Linux (DCL) Maintainer/Coordinater
Alexander Viro wrote:
>
> On Thu, 31 Oct 2002, Nicholas Wourms wrote:
>
>
>>slow or you are out of diskspace, too bad! I'm sure I'm not the only one
>>who is tired of hearing people whine about "bloat" wrt the sources and
>>demanding that features they don't use be ignored. No one (non-core)
>
>
> One look at the From:
> understanding has blossomed
> .procmailrc grows
>
Your point is?
On Thu, Oct 31, 2002 at 07:04:31PM +0000, Alan Cox wrote:
> On Thu, 2002-10-31 at 17:13, Michael Shuey wrote:
> > I'm a user, and I request that LKCD get merged into the kernel. :-)
> > Do you feel like donating a 700-port console server? Right, so it's LKCD
> > for me then.
>
> Wouldn't you rather they neatly tftp'd dumps to a nominated central
> server which noticed the arrival, did the initial processing with a perl
> script and mailed you a summary ?
Generally speaking, no.
A tftp server doesn't provide enough security (specifically authentication).
It would need to be accessible from clusters in multiple buildings and on
multiple networks (some of which must be public).
I've seen more network adapter issues than drive controller issues. In
particular, some vendors (Compaq, listen up) can't implement an eepro100 to
save their asses, especially on older hardware.
>From time to time bandwidth issues and/or network splits can prevent dumps
from being reliably delivered.
Right now we use the presence of a local dump to indicate that a machine
should not join the PBS pool (and begin to run more jobs) on a reboot. I'd
rather not have the nodes check a central server to see if it's okay to run
jobs. And no, I don't want machines to stay down after a crash - many nodes
are in distant corners of campus and it's cold outside. :-) If I can fix the
problem through software I'd prefer that the problematic host be up, rather
than having to walk over to it just to hit reset and load a new kernel.
That said, it would be really nice if LKCD would log dumps to both the swap
device and to a remote server. That way if the machine crashed because of
disk failure I'd still have an uncorrupted dump image (and could then notice
all the little errors coming back out of the swap device). A tool to
automatically analyze a dump and email back summaries would be much more
useful, though. If someone were to write such a widget, that'd be swell. :-)
Right now I'm less concerned with getting dumps to exactly the right place
and a bit more concerned with getting dumps in the main kernel at all.
--
Mike Shuey
On Thu, 31 Oct 2002, Andrew Morton wrote:
>
> We'll be spending the next six months stabilising and hardening
> the used-to-be-2.5 kernel. If grunts like me can get hold a
> copy of the other person's kernel image from time-of-crash, that
> has a ton of value.
Exactly, sometimes you don't even need the dump itself, The user
who has the dump just types lcrash and report -w file.txt and
lcrash writes a consolidated report with the most interesting
information from the dump to the file.txt and he can sent it
to you and you've much information you often miss in problem
reports.
The report consists of: time when the dump was created, time
when the report was created, the architecture, the hostname,
kernel version and compile time, the kernel (dmesg) buffer
with all the oopses logged into it, a short task list with
process adress, id's, state, flags, cpu and process name,
and finally a full CPU dump of every CPU with all registers,
current process and function and a symbolic stack backtrace
of the CPU.
Sometimes this is all you need to know and if you need to
know e.g. the stack backtrace of a not running process at
the time of the dump, you can ask the user to simply run
trace <process address> and lcrash gives you the backtrace
of this process:
lcrash> t[race] 0x1408000
================================================================
STACK TRACE FOR TASK: 0x1408000 (kjournald)
STACK:
0 schedule+894 [0x3164e]
1 interruptible_sleep_on+174 [0x31eae]
2 journal_revoke+<ERROR> [0x10889c0c]
3 kernel_thread+70 [0x18c1e]
showing the full task scruct, a sub-struct or a field is also simple:
p[rint] ((struct task_struct *)0x1408000)->pending
struct sigpending {
head = (nil)
tail = 0x1598700
signal = sigset_t {
sig = {
[0] 0
}
}
}
"feels" a bit like gdb
> (Disclaimer: I've never used lkcd. I'm assuming that it's
> possible to gdb around in a dump)
I don't know if there is an lkcd->ELF core converter yet, but
it should be doable. However, lcrash is quite powerful, it comes
with sial, an integrated C interpreter that permits easy access to the
symbol and type information, obviosly, it allows to write code like this:
void
showprocs()
{
struct proc* p;
for(p=*(struct proc**)procs; p; p=p->p_next)
do something...
}
}
Looks nice... :-)
I also don't know if (k)gdb knows about tasks, lcrash at least
knows about them and this may when you look into a specific
task(I'm not an expert)
Of cource lcrash can do dissembing, find symbols,
> So. _If_ lkcd gives me gdb-able images from time-of-crash, I'd
> like it please. And I'm the grunt who spent nearly two years
> doing not much else apart from working 2.3/2.4 oops reports.
Maybe the lkcd people can do so, but I think they can also give
a hands-on workshop to lcrash.
You can use lcrash also on running system to browse around,
learn and save dumps from it without interrupting it, you
just need lcrash, the System.map and the Kerntypes file from
kernel for using it.
> Oh, and as Rusty has pointed out, we lose a *lot* of oops reports
> because users are in X and the backtrace doesn't make it to the
> logs.
Yep, I think it would be good even if Linus just accepts the
infrastructure patch of lkcd which needs to be in the kernel,
the vafourite dump method module can then be downloaded, compiled
installed and configured without much pain, I think that people
can start using it in a broader range without the hassle of
needing to patching and booting a special kernel.
Bernd
PS: lcrash is only one of the many frontends, as I've read in
this thread, there is also Dave Anderson's "crash" tool which
works with LKCD dumps, netdump dumps, etc. There is also qlcrash,
an qt frontend for lcrash for people who like to click!
> > > On Thu, 31 Oct 2002, Rusty Russell wrote:
> > > > Fbdev Rewrite
> > >
> > > This one is just huge, and I have little personal judgement on it.
> >
> > It's been around for a while. Geert, Russell?
>
> It's huge because it moves a lot of files around:
> 1. drivers/char/agp/ -> drivers/video/agp/
> 2. drivers/char/drm/ -> drivers/video/drm/
> 3. console related files in drivers/video/ -> drivers/video/console/
>
> (1) and (2) should be reverted, but apparently they aren't reverted in the
> patch at http://phoenix.infradead.org/~jsimmons/fbdev.diff.gz yet. The patch
> also seems to remove some drivers. Haven't checked the bk repo yet.
>
> James, can you please fix that (and the .Config files)?
Done. I have a new version of that patch at the same place. It is against
2.5.45.
http://phoenix.infradead.org/~jsimmons/fbdev.diff.gz
Its still pretty big. We can save the moving of the agp code for post
halloween.
Alan Cox wrote:
>On Thu, 2002-10-31 at 18:28, Nicholas Wourms wrote:
>
>
>>>problems most people don't have? What next, some kind of misdesigned
>>>in-kernel CryptoAPI?
>>>
>>>
>>Get over it! If you haven't noticed, CryptoAPI is merged already. The only
>>
>>
>
>Chris is write that crypto api is misdesigned if we want to use hardware
>cryptocards
>
>
I'll reserve judgement until we actually get access to some decent [made
in the past few years] hardware crypto cards, and take a hard look at
their PCI bus utilization... until then it is mostly vague handwaving...
[vendors - any takers?]
Linus Torvalds wrote:
> In particular when it comes to this project, I'm told about
> "netdump", which doesn't try to dump to a disk, but over the net.
> And quite frankly, my immediate reaction is to say "Hell, I
> _never_ want the dump touching my disk, but over the network
> sounds like a great idea".
>
>
[yes, I realize the LKCD merge debate is over, bear with me :)]
I'm sort of in the middle on this issue: The existence of netdump does
not imply that disk dumps are a bad thing.
netdumps require a net dump server, and it is simply not realistic at
all to assume that users seeing crashes will always have a netdump
server set up in advance, or even have multiple machines to make that
possible. Disk dumps are valuable because their requirements are very
low, and because of all the user-support reasons that Andrew Morton
mentioned in this thread.
That said, I used to be an LKCD cheerleader until a couple people made
some good points to me: it is not nearly low-level enough to truly be
of use in crash situations. netdump can work if your interrupts are
hosed/screaming, and various mid-layers are dying. For LKCD to be of
any use, it needs to _skip_ the block layer and talk directly to
low-level drivers.
So, I think the stock kernel does need some form of disk dumping,
regardless of any presence/absence of netdump. But LKCD isn't there yet...
Jeff
On Thu, 31 Oct 2002, Linus Torvalds wrote:
> - included features kill off (potentially better) projects.
>
> There's a big "inertia" to features. It's often better to keep
> features _off_ the standard kernel if they may end up being
> further developed in totally new directions.
>
> In particular when it comes to this project, I'm told about
> "netdump", which doesn't try to dump to a disk, but over the net.
> And quite frankly, my immediate reaction is to say "Hell, I
> _never_ want the dump touching my disk, but over the network
> sounds like a great idea".
>
> To me this says "LKCD is stupid". Which means that I'm not going to apply
> it, and I'm going to need some real reason to do so - ie being proven
> wrong in the field.
>
> (And don't get me wrong - I don't mind getting proven wrong. I change my
> opinions the way some people change underwear. And I think that's ok).
It would be most unfortunate if the existance of netdump is used as a
reason to deny LKCD's inclusion, or to simply dismiss LKCD as stupid.
On Thu, 31 Oct 2002, Matt D. Robinson wrote:
> We want to see this in the kernel, frankly, because it's a pain
> in the butt keeping up with your kernel revisions and everything
> else that goes in that changes. And I'm sure SuSE, UnitedLinux and
> (hopefully) Red Hat don't want to spend their time having to roll
> this stuff in each and every time you roll a new kernel.
While Red Hat advocates Ingo's netdump option, we have customer
requests that are requiring us to look at LKCD disk-based dumps as an
alternative, co-existing dump mechanism. Since the two methods are not mutually
exclusive, LKCD will never kill off netdump -- nor certainly vice-versa. We're
all just looking for a better means to be able to
provide support to our customers, not to mention its value as a
development aid.
Dave Anderson
Red Hat, Inc.
On Wed, 2002-10-30 at 19:28, Stephen Frost wrote:
> The feeling I got on this was the ability to let users define their own
> groups. Perhaps I'm not following it closely enough but that was the
> impression I got in terms of "what this does for us"; I'm probably
> missing other things. Just that ability would be nice in my view
> though. Isn't it something that's been in AFS for a long time too?
> I've got a few friends who've played with AFS before (at CMU and the
> like) and really enjoyed the ACLs there.
Yea, I haven't looked at the submitted implementation, but at CMU ACLs
were critical to be able to selectively share data between a very large
set of users w/o bugging an administrator. Given multiple classes per
semester which had multiple group projects, where you may have different
groups for each project, I have no clue how anyone would be able to
handle the (unix)group management required. ACLs let the users
themselves manage what people got what access to their data.
How else can I fix my partner's bugs (or vice-versa), give the clumsy TA
read only access, and let the cheat across the hall figure it out for
himself? (There may very well be a good solution to this w/o ACLs but
I've not seen it in use.)
So yea, I'd love to see a common ACLs API.
-john
On Thu, Oct 31, 2002 at 03:59:34PM -0500, Dave Anderson wrote:
>
> > To me this says "LKCD is stupid". Which means that I'm not going to apply
> > it, and I'm going to need some real reason to do so - ie being proven
> > wrong in the field.
> >
> > (And don't get me wrong - I don't mind getting proven wrong. I change my
> > opinions the way some people change underwear. And I think that's ok).
>
> It would be most unfortunate if the existance of netdump is used as a
> reason to deny LKCD's inclusion, or to simply dismiss LKCD as stupid.
What he really wants is for Andrew or Alan or someone else he trusts
to merge it, get actual field results, and declare it useful. If
people start visibly passing around crash dump results on l-k and
solving problems with them, that'll help too. Until then all he has is
his gut feel to go on.
--
"Love the dolphins," she advised him. "Write by W.A.S.T.E.."
[ Cc: trimmed ]
john stultz wrote:
> groups for each project, I have no clue how anyone would be able to
> handle the (unix)group management required. ACLs let the users
> themselves manage what people got what access to their data.
Note that POSIX ACLs don't seem to solve this either: they only
let you control access in terms of existing users or groups.
IMHO, this is one of the standard pitfalls of ACLs: if they don't
let you aggregate information, you quickly end up with huge ACLs
hanging off every file, and each of those ACLs wants to be
carefully maintained. I've seen a lot of this in my VMS days.
(Unix is a bit better, because you can control access at a
directory level, while VMS needs the ACL on each file, because
you can open files directly by VMS' equivalent to an inode
number, without traversing the directory hierarchy. Of course,
many users didn't know that :-)
To make ACLs truly scalable, it would be nice to be able to
express permissions in terms of access to other filesystem
objects. E.g. "everybody who can read file ~me/acls/my_friends
can write the directory on which this ACE hangs". This should
work like a symlink, i.e. if I add new friends to my_friends, I
don't have to update all my ACLs.
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/
On 10/31, Matt D. Robinson said something like:
> On Thu, 31 Oct 2002, Linus Torvalds wrote:
> |>On Wed, 30 Oct 2002, Matt D. Robinson wrote:
> |>That's fine. And since they are paid to support it, they can apply the
> |>patches.
>
> We want to see this in the kernel, frankly, because it's a pain
> in the butt keeping up with your kernel revisions and everything
> else that goes in that changes. And I'm sure SuSE, UnitedLinux and
> (hopefully) Red Hat don't want to spend their time having to roll
> this stuff in each and every time you roll a new kernel.
I share some of your sentiment, but honestly, think about it.
Linus has to "keep up" with all the changees coming into his inbox as
well, and the more features, the more breakage that can happen when
Linus accepts a patch.
Really, Linus wants to push some of his maintanance overhead to distros,
who get paid to do it, but also to provide sexy bullet point items for
users, so they buy "Linux" stuff.
You try to find a better balance.
--
Shawn Leas
[email protected]
I installed a skylight in my apartment...
The people who live above me are furious!
-- Stephen Wright
Hi!
> > Without ACLs, if Sally, Joe and Bill need rw access to a file/dir, just
> > create another group with just those three people in. Over time, of
>
> If Sally, Joe and Bill need rw access to a directory, and Joe and Bill
> are using existing userland (any OS I'd seen), then Sally can easily
> fuck them into the next month and not in a good way.
Do you mean symlink attack?
> _That_ is the real problem. Until that is solved (i.e. until all
> userland is written up to the standards allegedly followed in writing
> suid-root programs wrt hostile filesystem modifications) NO mechanism
> will help you. ACLs, huge groups, whatever - setups with that sort
> of access allowed are NOT SUSTAINABLE with the current userland(s).
So userland needs to be improved. It already needs that modifications
because of /tmp. Is there any new issue there?
Pavel
--
When do you have heart between your knees?
Hi!
> > > ext2/ext3 ACLs and Extended Attributes
> >
> > I don't know why people still want ACL's. There were noises about them for
> > samba, but I'v enot heard anything since. Are vendors using this?
>
> Because People Are Stupid(tm). Because it's cheaper to put "ACL support: yes"
> in the feature list under "Security" than to make sure than userland can cope
> with anything more complex than "Me Og. Og see directory. Directory Og's.
> Nobody change it". C.f. snake oil, P.T.Barnum and esp. LSM users
Okay... Have ~/bin/phonebook and I'd like it to be rw- to me, r-- to
jarka and mj, and --- to everyone else. How do I do that without ACLs?
Adding a group is root-only operation.
This seems like a pretty common situation to me, and current solutions
are not nice. [I guess ~/bin/ with --x and
~/bin/my-secret-password-only-jarka-and-mj-knows/phonebook would solve
the problem, but...!]
Pavel
--
When do you have heart between your knees?
Jeff Garzik wrote:
> That said, I used to be an LKCD cheerleader until a couple people made
> some good points to me: it is not nearly low-level enough to truly be
> of use in crash situations.
I'm not so convinced about this. I like the Mission Critical
approach: save the dump to memory, then either boot through the
firmware or through bootimg (nowadays, that would be kexec),
then retrieve the dump from memory, and do whatever you like
with it.
The huge advantage here is that you don't need a ton of
specialized dump drivers and/or have much of the original kernel
infrastructure to be in a usable state. The rebooted system will
typically be stable enough to offer the full range of utilities,
including up to date drivers for all possible devices, so you
can safely write to disk, scp all the mess to your support
critter, or post an automatic flame to linux-kernel :-)
The weak points of the Mission Critical design are that early
memory allocation in the kernel needs to be tightly controlled,
that architectures that wipe CPU caches on reboot need to
commit them to memory before the firmware restart, and that
drivers need to be able to recover from an "unclean" hardware
state. (I think we'll see much of the latter happen as kexec
advances. The other two issues aren't really special.)
Actually, at the RAS BOF I thought that IBM were developing LKCD
in this direction, and had also eliminated a few not so elegant
choices of Mission Critical's original design. I haven't looked
at the LKCD code, but the descriptions sound as if all the
special-case cruft seems to be back again, which I would find a
little disappointing.
There might be a case for specialized low-overhead dump handlers
for small embedded systems and such, but they're probably better
maintained outside of the mainstream kernel. (They're more like
firmware anyway.)
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/
On Thu, 2002-10-31 at 13:49, Werner Almesberger wrote:
> john stultz wrote:
> > groups for each project, I have no clue how anyone would be able to
> > handle the (unix)group management required. ACLs let the users
> > themselves manage what people got what access to their data.
>
> Note that POSIX ACLs don't seem to solve this either: they only
> let you control access in terms of existing users or groups.
I've never looked at the POSIX ACL spec, so forgive my ignorance.
> IMHO, this is one of the standard pitfalls of ACLs: if they don't
> let you aggregate information, you quickly end up with huge ACLs
> hanging off every file, and each of those ACLs wants to be
> carefully maintained. I've seen a lot of this in my VMS days.
> (Unix is a bit better, because you can control access at a
> directory level, while VMS needs the ACL on each file, because
> you can open files directly by VMS' equivalent to an inode
> number, without traversing the directory hierarchy. Of course,
> many users didn't know that :-)
While it would be nice to have user-definable ACL groups ("my friends"
or "History 255 TAs") in addition to existing users and groups, I still
don't find this to be critical. Sure, it adds (possibly quite a bit of)
extra data to every file, but it gives you the granularity you need for
the situation I described. It seems like user-definable ACL groups
would be a nice extra feature on top of existing users or groups, but
not a necessity.
> To make ACLs truly scalable, it would be nice to be able to
> express permissions in terms of access to other filesystem
> objects. E.g. "everybody who can read file ~me/acls/my_friends
> can write the directory on which this ACE hangs". This should
> work like a symlink, i.e. if I add new friends to my_friends, I
> don't have to update all my ACLs.
Ugh, that seems dangerous. Too many forgotten ACL links and then I could
accidentally give a vague acquaintance access to all my data meant for
close friends.
Regardless, while ACLs do result in extra data per file being used, it
is my understanding that ACLs allow you to solve problems that currently
aren't solvable w/o administrator intervention. In my experience using
them w/ AFS, they have been extremely useful.
-john
Le jeu 31/10/2002 ? 23:57, Pavel Machek a ?crit :
> This seems like a pretty common situation to me, and current solutions
> are not nice. [I guess ~/bin/ with --x and
> ~/bin/my-secret-password-only-jarka-and-mj-knows/phonebook would solve
> the problem, but...!]
Can't even this be spied from /proc/*/fd ?
> THAT is what I mean by vendor-driven. If vendors decide they
> really want the patches, and I actually start seeing noises on
> linux-kernel or getting
> requests for it being merged from _users_ rather than developers, then
> that means that the vendor is on to something.
I am a user and I use it; I'd like it. I am a developer and I use it. I'd
love it. Forget my intel.com paying my paycheck.
Inaky Perez-Gonzalez -- Not speaking for Intel - opinions are my own [or my
fault]
john stultz wrote:
> Ugh, that seems dangerous. Too many forgotten ACL links and then I could
> accidentally give a vague acquaintance access to all my data meant for
> close friends.
The idea is that you'd typically have (a) (small number of) specific
location(s) where you keep your files representing groups, e.g.
$HOME/acls/ for your personal lists, maybe ~project/acls/ for
projects, etc.
If you think already this is dangerous, then you should be
terrified by regular, non-aggregateable ACLs ;-)
I'm not saying that ACLs aren't useful, only that the lack of
aggregateability makes them hard to maintain, so that people
frequently fall back to setup scripts that simple re-create
their ACL configuration. Once you're at this point, ACLs have
lost much of their usefulness, and you might as well use some
suid program that creates groups for you.
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/
Hi!
> > This seems like a pretty common situation to me, and current solutions
> > are not nice. [I guess ~/bin/ with --x and
> > ~/bin/my-secret-password-only-jarka-and-mj-knows/phonebook would solve
> > the problem, but...!]
>
> Can't even this be spied from /proc/*/fd ?
Not sure... Its true that if users are not carefull (i.e. do
cat ~/bin/my-secret-password-only-jarka-and-mj-knows/phonebook
it can be seen on ps -aux ;-).
Pavel
--
When do you have heart between your knees?
On Thu, 31 Oct 2002, Shawn wrote:
>
> Linus has to "keep up" with all the changees coming into his inbox as
> well, and the more features, the more breakage that can happen when
> Linus accepts a patch.
Yes, but lkcd differs from the other changes because it can make the
life of people easyer which don't need the patch in the first place,
and help quality and shorten the time to fix bugs.
If someone triggers a problem, one can take a free partition or setup
an network dump server, run and if it happens again, there is a good
chance that all that is needed to fix the problem is in the dump,
the System.map and the Kerntypes file from the kernel which can
be consolidatet into a report with symbolic stack traces of the
CPUs and Tasks quite easy.
Original source, patches and configuration options are good for
analysing but not required if the Kerntypes file is there. The
config options could be even read from the dump if this would
be a liked feature. :-)
> Really, Linus wants to push some of his maintanance overhead to distros,
> who get paid to do it, but also to provide sexy bullet point items for
> users, so they buy "Linux" stuff.
Sure, but the work of the distros could be even better if the base
kernel has lkcd, LTT and dprobes (you don't have to enable them if
you don't need them) because then they would have more resources
to make other even more useful things. But it's up to someone
who merges the stuff.
Bernd
In message <[email protected]> you write:
> On Thu, Oct 31, 2002 at 02:00:31PM +1100, Rusty Russell wrote:
> > They have, IIRC. Interestingly, it was less invasive (existing source
> > touched) than the LVM2/DM patch you merged.
>
> FUD. I added to three areas of existing code:
[ 40-line detailed explanation snipped ]
Woah! War's over dude! We won!
I used Rusty's Unreliable Intrusiveness-o-meter (number of existing
non-config files touched), as I said.
I didn't read code or anything so unscientific or accurate. But both
DM and EVMS were way down on the "intrusiveness" list.
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
In message <[email protected]> you write:
> Ideally I would like to see a dump framework that can have a number of
> possible dump targets. We should be able to dump to any combination of
> network, serial, disk, flash, unused ram that isn't wiped over restarts,
> etc...
Both the lkcd and ide mini-oopser have that (although the mini-oopser
has only x86-ide for now).
The mini-oopser has different aims than LCKD: they want to debug one
system, I want to make sure we're reaping OOPS reports from those 99%
of desktop users who run X and simply reboot when their machine
crashes once a month.
I did *not* put the mini-oopser on the Snowball list, because I don't
have time to polish it.
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
> > Fbdev Rewrite
>
> This one is just huge, and I have little personal judgement on it.
The size has been cut in half now that the issue of AGP being intialized
to late is on hold. We can discuss that move post-halloween. All that is
in the fbdev tree are fbdev changes. So it is safe to pull it.
On Thu, 2002-10-31 at 14:54, Werner Almesberger wrote:
> john stultz wrote:
> > Ugh, that seems dangerous. Too many forgotten ACL links and then I could
> > accidentally give a vague acquaintance access to all my data meant for
> > close friends.
>
> The idea is that you'd typically have (a) (small number of) specific
> location(s) where you keep your files representing groups, e.g.
> $HOME/acls/ for your personal lists, maybe ~project/acls/ for
> projects, etc.
Oh! Ok, that's exactly like the user-definable ACL groups I was
describing. My mistake, I thought you were suggesting some crazy ACL
symlink like: "Make file foo's ACL be the same as file blah's ACL" and
if I then go and add some untrusted user to blah's ACL it would then
automatically change foo's ACL. That just seemed a bit out there, but it
was just my mis-interpretation. Sorry :)
> If you think already this is dangerous, then you should be
> terrified by regular, non-aggregateable ACLs ;-)
Eh, as long as the ACLs are per-file, I can't ever accidentally give
access to a file I didn't mean to. The corner cases of "remove my
ex-friend from all my files" could be annoying, but could be done w/ the
equiv of chgrp -r
> I'm not saying that ACLs aren't useful, only that the lack of
> aggregateability makes them hard to maintain, so that people
> frequently fall back to setup scripts that simple re-create
> their ACL configuration. Once you're at this point, ACLs have
> lost much of their usefulness, and you might as well use some
> suid program that creates groups for you.
Hmmm. I'm way out of my realm of competency here. I just know ACLs were
*really* useful w/ AFS.
I probably should just go read the specs. Anyone have a pointer, or care
to explain what the differences are between AFS's ACLs and POSIX ACLs?
thanks
-john
On Fri, 1 Nov 2002, Rusty Russell wrote:
|>The mini-oopser has different aims than LCKD: they want to debug one
|>system, I want to make sure we're reaping OOPS reports from those 99%
|>of desktop users who run X and simply reboot when their machine
|>crashes once a month.
I'd like to incorporate the mini-oopser as an LKCD dump method.
I'll chat with you off-line about this. Shouldn't be that
difficult to do.
|>I did *not* put the mini-oopser on the Snowball list, because I don't
|>have time to polish it.
|>
|>Rusty.
Thanks,
--Matt
|>On Thu, 31 Oct 2002, Matt D. Robinson wrote:
|>> We want to see this in the kernel, frankly, because it's a pain
|>> in the butt keeping up with your kernel revisions and everything
|>> else that goes in that changes. And I'm sure SuSE, UnitedLinux and
|>> (hopefully) Red Hat don't want to spend their time having to roll
|>> this stuff in each and every time you roll a new kernel.
|>
|>While Red Hat advocates Ingo's netdump option, we have customer
|>requests that are requiring us to look at LKCD disk-based dumps as an
|>alternative, co-existing dump mechanism. Since the two methods are
|>not mutually exclusive, LKCD will never kill off netdump -- nor
|>certainly vice-versa. We're all just looking for a better means
|>to be able to provide support to our customers, not to mention
|>its value as a development aid.
I think you and I are in agreement (as always has been in the
past), Dave. LKCD is meant to create a base for disk, network,
or any dump method. If Red Hat wants netdump to be the primary
dumping method, that's Red Hat's decision, and more power to
them. If SuSE wants disk dumps, that's SuSE's decision. But
for both of them to have to roll their own every single release
or kernel upgrade is unproductive.
What's most concerning about this entire discussion is that I
bet < 20% of the people discussing this have actually LOOKED at
the LKCD patches to see whether or not this is as invasive,
difficult, bloated, or anything negative. We've spent over a
month now posting them, getting comments, responding to all of
the comments, making sure feedback is accounted for and
responded to, only to get an "LKCD is stupid" type response.
--Matt
john stultz wrote:
> I thought you were suggesting some crazy ACL
> symlink like: "Make file foo's ACL be the same as file blah's ACL" and
> if I then go and add some untrusted user to blah's ACL it would then
> automatically change foo's ACL.
Well, with "foo" getting the ACL from "bar", changing the ACL of
"bar" would change "foo", but not vice versa. Of course, the idea
is that you're careful when changing "bar", just like you'd be
careful with your SSH keys.
> Eh, as long as the ACLs are per-file, I can't ever accidentally give
> access to a file I didn't mean to. The corner cases of "remove my
> ex-friend from all my files" could be annoying, but could be done w/ the
> equiv of chgrp -r
chgrp -r gets nasty if you have files which are stored off-line.
On the other hand, using the concept that ACEs add rights, but
never take them away, even an off-line "ACL link target" would
fail on the safe side, by not adding more rights.
> I probably should just go read the specs. Anyone have a pointer, or care
> to explain what the differences are between AFS's ACLs and POSIX ACLs?
I've forgotten most things I knew about AFS ACLs (I used them at
IBM about eight years ago), but http://acl.bestbits.at/ and in
particular http://acl.bestbits.at/cgi-man/acl.5 seem to have
everything about POSIX ACLs. They're not very complicated.
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/
On Thu, 31 Oct 2002, Jeff Garzik wrote:
|>Linus Torvalds wrote:
|>[yes, I realize the LKCD merge debate is over, bear with me :)]
For Linus, it is.
|>That said, I used to be an LKCD cheerleader until a couple people made
|>some good points to me: it is not nearly low-level enough to truly be
|>of use in crash situations. netdump can work if your interrupts are
|>hosed/screaming, and various mid-layers are dying. For LKCD to be of
|>any use, it needs to _skip_ the block layer and talk directly to
|>low-level drivers.
Just to clarify, LKCD is NOT block based dumping, OR net based
dumping, or anything. It's an infrastructure for dumping that
lets you, the user, the distributor, the customer, whatever,
make the decision for what's right for you. Yes, we provide
disk based dumping now, and are including the net dump code
very soon, as well as some of these other smaller dump methods.
Has ANYONE other than Christoph and Stephen H. done a full review of
the LKCD patch set before commenting? Or are people just making
this stuff up as they go along? A ton of things have changed
over the past year just because people complained about only doing
disk dumping. And then to hear this ...
|>So, I think the stock kernel does need some form of disk dumping,
|>regardless of any presence/absence of netdump. But LKCD isn't
|>there yet...
Please read the patches and decide again. If you want the latest
net dump patch, let me know.
|> Jeff
--Matt
On Thu, 31 Oct 2002, Shawn wrote:
|>On 10/31, Matt D. Robinson said something like:
|>> On Thu, 31 Oct 2002, Linus Torvalds wrote:
|>> |>On Wed, 30 Oct 2002, Matt D. Robinson wrote:
|>> |>That's fine. And since they are paid to support it, they can apply the
|>> |>patches.
|>>
|>> We want to see this in the kernel, frankly, because it's a pain
|>> in the butt keeping up with your kernel revisions and everything
|>> else that goes in that changes. And I'm sure SuSE, UnitedLinux and
|>> (hopefully) Red Hat don't want to spend their time having to roll
|>> this stuff in each and every time you roll a new kernel.
|>
|>I share some of your sentiment, but honestly, think about it.
|>
|>Linus has to "keep up" with all the changees coming into his inbox as
|>well, and the more features, the more breakage that can happen when
|>Linus accepts a patch.
Uh ... have you read the patches? Do you see how few the
changes are to non-dump code? Do you know that most of those
changes only get triggered in a crash situation anyway?
Breakage occurs when people change code areas that are used
all the time, like VM, network, block layer, etc.
Look at the patches and tell me where we are causing overhead
and and seriously potential breakage. If you find problems,
then tell us, don't just comment on breakage scenarios.
|>Really, Linus wants to push some of his maintanance overhead to distros,
|>who get paid to do it, but also to provide sexy bullet point items for
|>users, so they buy "Linux" stuff.
Sure, then remove all of the extra filesystems, sound drivers,
etc., that are bulking up the kernel distribution now and give
them to the distributors to include.
|>You try to find a better balance.
If I could think of a better balance to ease his load, I would.
He's already made his mind up. It doesn't mean it won't end up
merged by someone else (or everyone else for that matter).
--Matt
Matt D. Robinson wrote:
>On Thu, 31 Oct 2002, Jeff Garzik wrote:
>|>Linus Torvalds wrote:
>|>[yes, I realize the LKCD merge debate is over, bear with me :)]
>
>For Linus, it is.
>
>|>That said, I used to be an LKCD cheerleader until a couple people made
>|>some good points to me: it is not nearly low-level enough to truly be
>|>of use in crash situations. netdump can work if your interrupts are
>|>hosed/screaming, and various mid-layers are dying. For LKCD to be of
>|>any use, it needs to _skip_ the block layer and talk directly to
>|>low-level drivers.
>
>Just to clarify, LKCD is NOT block based dumping, OR net based
>dumping, or anything. It's an infrastructure for dumping that
>lets you, the user, the distributor, the customer, whatever,
>make the decision for what's right for you. Yes, we provide
>disk based dumping now, and are including the net dump code
>very soon, as well as some of these other smaller dump methods.
>
>Has ANYONE other than Christoph and Stephen H. done a full review of
>the LKCD patch set before commenting? Or are people just making
>this stuff up as they go along? A ton of things have changed
>over the past year just because people complained about only doing
>disk dumping. And then to hear this ...
>
>
You are confusing review with perspective. I've read
http://lkcd.sourceforge.net/download/latest/ before, and just checked it
again tonight before posting.
My view is: LKCD becomes useful to merge when the average user can do
"safe" disk dumps. netdumps are better for corporate customers, but for
average users, disk dumps are _the_ method which is easiest, most
accessible, and thus most helpful to kernel hackers debugging their
problems. LKCD has a dump block dev driver, but it's not even close to
being low-level enough to be "safe".
Re-read my other post(s) -- I have said repeatedly that LKCD's
infrastructure is decent. But it's completely pointless to merge a
decent infrastructure unless the users are up to snuff. It's much
smarter to keep the infrastructure out of the kernel until the low-level
dump drivers are hammered out and stable, because that gives you more
freedom to change the API.
>|>So, I think the stock kernel does need some form of disk dumping,
>|>regardless of any presence/absence of netdump. But LKCD isn't
>|>there yet...
>
>Please read the patches and decide again. If you want the latest
>net dump patch, let me know.
>
>
I have. Nothing has changed. Stable, polling, low-level disk dumps are
not in the LKCD patches.
IMO, net dump is what corporate customers and network admins want. And
overall, net dumps are probably easier and much safer than disk dumps,
from an implementor's perspective. However, disk dumps are what the
average kernel hacker will find most useful, because it is the easiest
for end users, and thus will generate a higher number of quality bug
reports.
Jeff
In message <[email protected]> you write:
> I did an hack to scale the NFS block size in stat to make sure it fits
> into 31bit, but statfs64 would be the correct solution for it really.
AFAICT the patches are not in shape at the moment, so I don't think it
fits "actively being pushed": unless someone chimes in, I'm removing
it.
> Also I would like to propose the nanosecond stat patches. It doesn't add
> new system calls, but just uses spare fields in the existing stat64
> structure and closes a hole in make.
OK, I've added this one: sorry for missing it. You might want to
split this into "core" and then updated the filesystems via their
maintainers during the freeze though: it's one *big* patch as it
stands.
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
On Thu, 31 Oct 2002, Jeff Garzik wrote:
|>Re-read my other post(s) -- I have said repeatedly that LKCD's
|>infrastructure is decent. But it's completely pointless to merge a
|>decent infrastructure unless the users are up to snuff. It's much
|>smarter to keep the infrastructure out of the kernel until the low-level
|>dump drivers are hammered out and stable, because that gives you more
|>freedom to change the API.
This is where we disagree. Without the base infrastructure, this
becomes an even larger and larger patch which needs testing and
verification with a massive number of configurations for each new
kernel release. Do you know how much testing we go through for each
new kernel release? Do you know that we actually try this stuff
out with panic(), die(), interrupt and sysrq() dumps before we send
it off? Do you know we try this for SMP and UP?
If Linus would at least take the infrastructure patches and leave
out the drivers/dump code, that might be a good start. Just take
the base code. Just take the patches for panic.c, dump_ipi(), or
the rest of the other base kernel components, But no. Instead,
Linus just says "LKCD is stupid".
I also think you have completely misrepresented the LKCD user base,
but I'm sure our opinion on who those LKCD users are is different
and it's pointless to argue one person's experiences over another's.
I hate Linus' ego, I hate this whole damn discussion, and I find
it very irritating that I have to go through this process after
many people have created, enhanced and used LKCD for three years,
and this is where we're at.
To spend the last month and a half finalizing things for Linus,
sending this to him on multiple occasions, asking for his comments
and inclusion, asking for his feedback (as well as others), and
not hearing _one damn word_ from Linus all that time, and for him
to wait until now to just say "LKCD is stupid" is insulting.
--Matt
In message <[email protected]> you
write:
> On Fri, 1 Nov 2002, Rusty Russell wrote:
> |>The mini-oopser has different aims than LCKD: they want to debug one
> |>system, I want to make sure we're reaping OOPS reports from those 99%
> |>of desktop users who run X and simply reboot when their machine
> |>crashes once a month.
>
> I'd like to incorporate the mini-oopser as an LKCD dump method.
> I'll chat with you off-line about this. Shouldn't be that
> difficult to do.
That would defeat the "mini" part 8)
Cheers,
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
In article <[email protected]>,
Matt D. Robinson <[email protected]> wrote:
>
>To spend the last month and a half finalizing things for Linus,
>sending this to him on multiple occasions, asking for his comments
>and inclusion, asking for his feedback (as well as others), and
>not hearing _one damn word_ from Linus all that time, and for him
>to wait until now to just say "LKCD is stupid" is insulting.
You got to hear my comment now, several times: convince somebody _else_.
But no, it wasn't the answer you wanted. So you refuse to listen. And
yes, I get irritated too. So right now I won't touch LKCD with a
ten-foot pole, if only because I've been mail-bombed by people who argue
for it when I have better things to do than to explain myself over and
over again.
What's so hard to understand about the "vendor-driven" thing, and why do
people continue to argue about it?
Linus
On Fri, 1 Nov 2002, Linus Torvalds wrote:
> But no, it wasn't the answer you wanted. So you refuse to listen. And
> yes, I get irritated too. So right now I won't touch LKCD with a
> ten-foot pole, if only because I've been mail-bombed by people who argue
> for it when I have better things to do than to explain myself over and
> over again.
Maybe it's because users are wanting it in the mainline kernel... Notice
I said 'users' not 'vendors' or 'the code's maintainers'.
> What's so hard to understand about the "vendor-driven" thing, and why do
> people continue to argue about it?
Because I'm not a vendor, and I want it.
Pat
--
Purdue Universtiy ITAP/RCS
Information Technology at Purdue
Research Computing and Storage
http://www-rcd.cc.purdue.edu
http://dilbert.com/comics/dilbert/archive/images/dilbert2040637020924.gif
On 31 Oct 2002, Alan Cox wrote:
> Chris is write that crypto api is misdesigned if we want to use hardware
> cryptocards
Hardware support was not an initial goal, as the requirements are not yet
fully known.
>From Documentation/crypto/api-intro.txt:
An asynchronous scheduling interface is in planning but not yet
implemented, as we need to further analyze the requirements of all of
the possible hardware scenarios (e.g. IPsec NIC offload).
Hardware accelerators are generally a known issue, with already proven
solutions (e.g. the OpenBSD crypto queue). We don't know much about IPSec
NIC offload yet, however.
- James
--
James Morris
<[email protected]>
On Thu, 31 Oct 2002, Linus Torvalds wrote:
>
> On Wed, 30 Oct 2002, Matt D. Robinson wrote:
>
> > Linus Torvalds wrote:
> > > > Crash Dumping (LKCD)
> > >
> > > This is definitely a vendor-driven thing. I don't believe it has any
> > > relevance unless vendors actively support it.
> >
> > There are people within IBM in Germany, India and England, as well as
> > a number of companies (Intel, NEC, Hitachi, Fujitsu), as well as SGI
> > that are PAID to support this.
>
> That's fine. And since they are paid to support it, they can apply the
> patches.
>
> What I'm saying by "vendor driven" is that it has no relevance for the
> standard kernel, and since it has no relevance to that, then I have no
> incentives to merge it. The crash dump is only useful with people who
> actively look at the dumps, and I don't know _anybody_ outside of the
> specialized vendors you mention who actually do that.
You're not listening! Screw the vendors! The users want this enough to
be patching it into their kernels now.
>
> I will merge it when there are real users who want it - usually as a
> result of having gotten used to it through a vendor who supports it. (And
> by "support" I do not mean "maintain the patches", but "actively uses it"
> to work out the users problems or whatever).
Did you not read the input from the developers? From the people who have
headless clusters?
I have Linux systems in fifteen locations, six states, for timezones.
They oops from time to time, and I can't get any clue why, because (a)
they have no console, (b) most are in secure locations like locked wiring
closets with no one to read a console, and (c) the systems are thousands
of miles away. I don't need a debugger, I'd love to just have ksysoops
output! And given the reality of using the network, I don't make kcore
world readable, I'm not about to send that information over a few
thousand miles of open net to save writing it to disk.
I also have Solaris and AIX servers, and if they go down I send a crash
dump to the vendor who can then provide support. Big difference. Visible
even to management, who see a real support issue.
>
> Horse before the cart and all that thing.
>
> People have to realize that my kernel is not for random new features.
Supportablility is not a "random new feature," it's something which was
developed because users had a need (not by a vendor looking for a feature
to advertize), and if you would read the mail it's mostly coming from
people who want to use the feature. This is a whole new kernel series, it
will be stable a hell of a lot sooner if people can find problems!
Notice that developers want it, vendors want to provide it, and end
users want to be able to get support. In fact, other than one person who
had doubts about the implementation being optimal, your voice is the only
one I hear against it. That should tell you something.
Sometimes the best way to lead is to look at where everyone is going on
their own, jump in front, and yell "Follow me!" a few times. If you put
half the energy into improving the implementation that you put into
telling us we're all wrong it would be a better kernel.
On Thu, 31 Oct 2002, Linus Torvalds wrote:
>
> [ Ok, this is a really serious email. If you don't get it, don't bother
> emailing me. Instead, think about it for an hour, and if you still don't
> get it, ask somebody you know to explain it to you. ]
>
> On Thu, 31 Oct 2002, Matt D. Robinson wrote:
> >
> > Sure, but why should they have to? What technical reason is there
> > for not including it, Linus?
>
> There are many:
>
> - bloat kills:
>
> My job is saying "NO!"
>
> In other words: the question is never EVER "Why shouldn't it be
> accepted?", but it is always "Why do we really not want to live
> without this?"
I suspect that you have not had to make any significant part of your
living administering systems, certainly not recently. Lack of this tool is
a one-to-one mapping to "no clue" if you can't get information from the
console.
> - included features kill off (potentially better) projects.
>
> There's a big "inertia" to features. It's often better to keep
> features _off_ the standard kernel if they may end up being
> further developed in totally new directions.
Yes, you can clearly see how that worked with ext2 stifling development
of... wait a minute, rethink that argument. This feature is years old, and
seems to be ready to add new destinations for the data, disk, net, high
memory, what elese is there? Once the data is saved people will be able to
develop any additional tools they want to read the raw data.
> In particular when it comes to this project, I'm told about
> "netdump", which doesn't try to dump to a disk, but over the net.
> And quite frankly, my immediate reaction is to say "Hell, I
> _never_ want the dump touching my disk, but over the network
> sounds like a great idea".
You have this idea that the dump will go over a high reliability path,
and that's an option, but not in all cases true.
> To me this says "LKCD is stupid". Which means that I'm not going to apply
> it, and I'm going to need some real reason to do so - ie being proven
> wrong in the field.
You've been proven wrong, you just don't want to look at the proof! You
can't say it doesn't work, it does. You can't say the (users, vendors,
developers} don't want it, because they do. You can't say it's untested,
it's been in use for several years, and you seem willing to take reiser4,
which isn't even finsished yet!
> (And don't get me wrong - I don't mind getting proven wrong. I change my
> opinions the way some people change underwear. And I think that's ok).
If you really believed the stuff you say you'd put it in and promise to
take it out if people didn't find it useful or there were inherent
limitations. It would probably take 10-30% off the time to a stable
release.
> > I completely don't understand your reasoning here.
>
> Tough. That's YOUR problem.
--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.
On Thu, 31 Oct 2002, Linus Torvalds wrote:
>
> On Thu, 31 Oct 2002, Chris Friesen wrote:
> >
> > How do you deal with netdump when your network driver is what caused the
> > crash?
>
> Actually, from a driver perspective, _the_ most likely driver to crash is
> the disk driver.
>
> That's from years of experience. The network drivers are a lot simpler,
> the hardware is simpler and more standardized, and doesn't do as many
> things. It's just plain _easier_ to write a network driver than a disk
> driver.
>
> Ask anybody who has done both.
From the standpoint of just the driver that's true. However, the remote
machine and all the network bits between them are a string of single
points of failure. Isn't it good that both disk and network can be
supported.
--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.
On Fri, 1 Nov 2002, Bill Davidsen wrote:
>
> If you really believed the stuff you say you'd put it in and promise to
> take it out if people didn't find it useful or there were inherent
> limitations.
This never works. Be honest. Nobody takes out features, they are stuck
once they get in. Which is exactly why my job is to say "no", and why
there is no "accepted unless proven bad".
> It would probably take 10-30% off the time to a stable release.
Talk is cheap.
I've not seen a _single_ bug-report with a fix that attributed the
existing LKCD patches. I might be more impressed if I had.
The basic issue is that we don't put patches in in the hope that they will
prove themselves later. Your argument is fundamentally flawed.
Linus
On Thu, 31 Oct 2002, Linus Torvalds wrote:
>
> On Fri, 1 Nov 2002, Bill Davidsen wrote:
> >
> > If you really believed the stuff you say you'd put it in and promise to
> > take it out if people didn't find it useful or there were inherent
> > limitations.
>
> This never works. Be honest. Nobody takes out features, they are stuck
> once they get in. Which is exactly why my job is to say "no", and why
> there is no "accepted unless proven bad".
>
> > It would probably take 10-30% off the time to a stable release.
>
> Talk is cheap.
>
> I've not seen a _single_ bug-report with a fix that attributed the
> existing LKCD patches. I might be more impressed if I had.
Maybe people don't bother to spell out how they got there. Here's one.
-castor
:: Newsgroups: mlist.linux.kernel
:: Date: Mon, 17 Dec 2001 09:48:53 -0800 (PST)
:: From: Castor Fu <[email protected]>
:: X-To: <[email protected]>
:: Subject: i386 machine_restart unsafe in interrupt context
:: Message-ID: <linux.kernel.Pine.LNX.4.33.0112170935520.1623-100000@marais.SOMEWHERE>
:: MIME-Version: 1.0
:: Content-Type: TEXT/PLAIN; charset=US-ASCII
:: Approved: [email protected]
:: Lines: 27
::
::
:: I have a problem where systems fail to reboot on panic(). I've resolved
:: it by changing smp_send_stop() to use an NMI (like the KDB patch does to
:: manage communication).
::
:: The source of the problem is that the panic path has the following:
::
:: panic()
:: machine_restart()
:: machine_real_restart()
:: smp_send_stop()
:: smp_call_function()
::
:: and smp_call_function() is not safe in an interrupt context.
::
:: I imagine people might want to handle this differently, but I'd be
:: happy to diffs if there's interest. It may be that there are enough
:: cases like this that smp_call_function might want a version that
:: uses an NMI. . .
::
:: -Castor Fu
:: [email protected]
On Fri, Nov 01, 2002 at 08:14:16AM +1100, Rusty Russell wrote:
> In message <[email protected]> you write:
> > On Thu, Oct 31, 2002 at 02:00:31PM +1100, Rusty Russell wrote:
> > > They have, IIRC. Interestingly, it was less invasive (existing source
> > > touched) than the LVM2/DM patch you merged.
> >
> > FUD. I added to three areas of existing code:
>
> [ 40-line detailed explanation snipped ]
>
> Woah! War's over dude! We won!
:)
Sorry, it wasn't meant to be an agressive email. However comments
like this do get picked up out of context and passed around until they
become the accepted truth. I'm still trying to work out where 'dm
can't handle mirroring or raid' rumour came from.
- Joe
> Talk is cheap.
>
> I've not seen a _single_ bug-report with a fix that attributed the
> existing LKCD patches. I might be more impressed if I had.
>
> The basic issue is that we don't put patches in in the hope that they will
> prove themselves later. Your argument is fundamentally flawed.
comment from userspace:
I'm going to have to side with Linus here despite my desire to see LKCD merged.
However, we need to show him the money. This means:
* making sure that the patches are kept up to date
* keep the LKCD patches in the list/community spotlight in a positive
manner ("please test this!", or "please use this when
looking for help debugging a system problem"). Perhaps
a 2.5.x-lkcd bk tree or something like that.
* make documentation/HOWTO's available for folks so that
they'll know how to generate a crashdump
and run a some utilities against it to generate
a synopsis which can be submitted for debugging
* most important: squash a whole lot of bugs with
said dumps!
If it becomes apparent through empirical data that crash dumps are a useful
tool, I'm sure that Linus will become far more amenable. Until then, lets let
him handle all of his other work which needs to get done.
-- craig
.- ... . -.-. .-. . - -- . ... ... .- --. .
Craig I. Hagan
hagan(at)cih.com
Bill Davidsen <[email protected]> writes:
> You're not listening! Screw the vendors! The users want this enough to
^^^^^^^^^^^^^^^^^^
>be patching it into their kernels now.
[...]
> I also have Solaris and AIX servers, and if they go down I send a crash
>dump to the vendor who can then provide support. Big difference. Visible
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
q.e.d. End of Discussion.
Regard
Henning
--
Dipl.-Inf. (Univ.) Henning P. Schmiedehausen -- Geschaeftsfuehrer
INTERMETA - Gesellschaft fuer Mehrwertdienste mbH [email protected]
Am Schwabachgrund 22 Fon.: 09131 / 50654-0 [email protected]
D-91054 Buckenhof Fax.: 09131 / 50654-20
Patrick Finnegan <[email protected]> writes:
>> What's so hard to understand about the "vendor-driven" thing, and why do
>> people continue to argue about it?
>Because I'm not a vendor, and I want it.
So get your vendor to integrate it.
You don't have a vendor, but roll your own kernels? Tough, so you're
are a "vendor". Surprise, surprise.
Replace "vendor" with "people who roll up and distribute kernels". So
one vendor (Linus) refuses to integrate LKCD. Tough. Use another
one. Think USP here. Think diversity. Think competition. Maybe "that
vendor" (Linus) will catch up one day. Maybe not. Maybe "competition"
is not on his agenda. So what?
Get SuSE. They will integrate everything and their grand mother in
their kernels.
Gee, most people seem to think that "vendor" means "big evil
corporation in Redmont, WA".
Regards
Henning
--
Dipl.-Inf. (Univ.) Henning P. Schmiedehausen -- Geschaeftsfuehrer
INTERMETA - Gesellschaft fuer Mehrwertdienste mbH [email protected]
Am Schwabachgrund 22 Fon.: 09131 / 50654-0 [email protected]
D-91054 Buckenhof Fax.: 09131 / 50654-20
In article <1036103335.25512.40.camel@bip>,
Xavier Bestel <[email protected]> wrote:
>Le jeu 31/10/2002 ? 23:57, Pavel Machek a ?crit :
>
>> This seems like a pretty common situation to me, and current solutions
>> are not nice. [I guess ~/bin/ with --x and
>> ~/bin/my-secret-password-only-jarka-and-mj-knows/phonebook would solve
>> the problem, but...!]
>
>Can't even this be spied from /proc/*/fd ?
Or ptrace, /proc/pid/mem, etc. If you can execute a binary, it
has to be loaded into memory in a process running as you, so
you can read it.
Mike.
On Fri, 2002-11-01 at 06:34, Bill Davidsen wrote:
> From the standpoint of just the driver that's true. However, the remote
> machine and all the network bits between them are a string of single
> points of failure. Isn't it good that both disk and network can be
> supported.
My concerns are solely with things like the correctness of the disk
dumper. Its obviously a good way to do a lot more damage if it isnt done
carefully. Quite clearly your dump system wants to support multiple dump
targets so you can dump to pci battery backed ram, down the parallel
port to an analysing box etc
>>>>> "PI" == Perez-Gonzalez, Inaky <[email protected]> writes:
>> THAT is what I mean by vendor-driven. If vendors decide they
>> really want the patches, and I actually start seeing noises on
>> linux-kernel or getting
>> requests for it being merged from _users_ rather than developers, then
>> that means that the vendor is on to something.
For what it is worth, CERN has been using LKCD kernels for the last
6month or so, enabled mostly on headless farm machines (but the
kernels get deployed to desktops as well). Please consider including
it into the mainstream kernel.
Jan Iven
Linux support / CERN
On Fri, 2002-11-01 at 06:36, Linus Torvalds wrote:
> This never works. Be honest. Nobody takes out features, they are stuck
> once they get in.
Linus I've asked a couple of times about killing sound/oss off now ALSA
is integrated 8) While you are on the rant how about that ;)
On Fri, 2002-11-01 at 06:27, Bill Davidsen wrote:
> You're not listening! Screw the vendors! The users want this enough to
> be patching it into their kernels now.
Welcome to free software. If you can make a case for it go sell people
suitable kernels, build an "LKCD kernel site" whatever.
On Thu, 2002-10-31 at 21:02, Jeff Garzik wrote:
> hosed/screaming, and various mid-layers are dying. For LKCD to be of
> any use, it needs to _skip_ the block layer and talk directly to
> low-level drivers.
Rusty wrote a polled IDE driver that should handle some subset of that
On Fri, 1 Nov 2002, Craig I. Hagan wrote:
> > Talk is cheap.
> >
> > I've not seen a _single_ bug-report with a fix that attributed the
> > existing LKCD patches. I might be more impressed if I had.
> >
> > The basic issue is that we don't put patches in in the hope that they will
> > prove themselves later. Your argument is fundamentally flawed.
>
> comment from userspace:
>
> I'm going to have to side with Linus here despite my desire to see LKCD
> merged.
I'll have to disagree with what you're saying, because:
> However, we need to show him the money. This means:
>
> * making sure that the patches are kept up to date
They are being kept up to date, and aparently have been for some time.
> * keep the LKCD patches in the list/community spotlight in a positive
> manner ("please test this!", or "please use this when
> looking for help debugging a system problem"). Perhaps
> a 2.5.x-lkcd bk tree or something like that.
Umm, and the difference between maintaining a set of patches per kernel
version and something using bitkeeper (or heaven forbid, CVS)? Even
Linus didn't starting using source code management until somewhat
recently.
> * make documentation/HOWTO's available for folks so that
> they'll know how to generate a crashdump
> and run a some utilities against it to generate
> a synopsis which can be submitted for debugging
Have you seen http://lkcd.sf.net ? They have that there. I've
successfully walked through their well-written tutorials and produced
crashdumps from machines that have failed.
> * most important: squash a whole lot of bugs with
> said dumps!
Perhaps people are but they're not posting the bugs to the list...
> If it becomes apparent through empirical data that crash dumps are a useful
> tool, I'm sure that Linus will become far more amenable. Until then, lets let
> him handle all of his other work which needs to get done.
The data is there, perhaps not for Linux, but for other Unixes -
including ones like the BSDs. Crashdumps are an invaluable resource for
finding bugs that involve things like hardware that doesn't conform
exactly to specs, or deadlocks, or...
Pat
--
Purdue Universtiy ITAP/RCS
Information Technology at Purdue
Research Computing and Storage
http://www-rcd.cc.purdue.edu
http://dilbert.com/comics/dilbert/archive/images/dilbert2040637020924.gif
On Fri, 1 Nov 2002, Henning P. Schmiedehausen wrote:
> Patrick Finnegan <[email protected]> writes:
>
> >Because I'm not a vendor, and I want it.
>
> You don't have a vendor, but roll your own kernels? Tough, so you're
> are a "vendor". Surprise, surprise.
>
> Replace "vendor" with "people who roll up and distribute kernels". So
> one vendor (Linus) refuses to integrate LKCD. Tough. Use another
I'm confused, you just said (1) I'm a vendor and then (2) Linus is my
vendor. And besides, we don't distribute the kernels - we install them on
our own machines, and say 'done'. The lack of distribution (at least IMO)
should make us not be a vendor.
> one. Think USP here. Think diversity. Think competition. Maybe "that
> vendor" (Linus) will catch up one day. Maybe not. Maybe "competition"
> is not on his agenda. So what?
This isn't about competition. It's about integrating a core useful
feature that has been shown to be emperically useful by every other person
who writes an OS kernel.
> Get SuSE. They will integrate everything and their grand mother in
> their kernels.
That's not really an option at the moment. We have a disto vendor
(RedHat) and were dissatisfied with its kernels so we are trying to use
*the*official* kernel (Linus's kernel).
> Gee, most people seem to think that "vendor" means "big evil
> corporation in Redmont, WA".
No, vendor == people who sold or gave us the softare. Right now, Linus is
acting like he's a big evil corporation that won't add the change no
matter what we say:
On Thu, 31 Oct 2002, Linus Torvalds wrote:
> On Thu, 31 Oct 2002, Matt D. Robinson wrote:
> >
> > Sure, but why should they have to? What technical reason is there
> > for not including it, Linus?
<snipped reasons that are imho incorrect>
> To me this says "LKCD is stupid". Which means that I'm not going to
> apply it
On Thu, 31 Oct 2002, Linus Torvalds wrote:
> Don't bother to ask me to merge the thing, that only makes me get even
> more fed up with the whole discussion.
On Thu, 31 Oct 2002, Linus Torvalds wrote:
> And imnsho, debugging the kernel on a source level is the way to do it.
>
> Which is why it's not going to be me who merges it.
On Fri, 1 Nov 2002, Linus Torvalds wrote:
> You got to hear my comment now, several times: convince somebody _else_.
<snip>
> What's so hard to understand about the "vendor-driven" thing, and why do
> people continue to argue about it?
You know, considering the volume of people on this list that have been
saying "I want it, Linus, please integrated it" and:
On Thu, 31 Oct 2002, Matt D. Robinson wrote:
> I hate Linus' ego, I hate this whole damn discussion, and I find
> it very irritating that I have to go through this process after
> many people have created, enhanced and used LKCD for three years,
> and this is where we're at.
>
> To spend the last month and a half finalizing things for Linus,
> sending this to him on multiple occasions, asking for his comments
> and inclusion, asking for his feedback (as well as others), and
> not hearing _one damn word_ from Linus all that time, and for him
> to wait until now to just say "LKCD is stupid" is insulting.
You know, pissing off core developers of projects that have been shown to
be (1) desired (2) potentially useful in Linux, even as an aid to other
Linux subsystem developers and (3) emperically show to be useful for other
Free *nixes such as the BSDs, is not what I would be doing as a project
maintainer. Of course, I'm not Linus...
Pat
--
Purdue Universtiy ITAP/RCS
Information Technology at Purdue
Research Computing and Storage
http://www-rcd.cc.purdue.edu
http://dilbert.com/comics/dilbert/archive/images/dilbert2040637020924.gif
On Fri, 1 Nov 2002, Patrick Finnegan wrote:
> No, vendor == people who sold or gave us the softare. Right now, Linus is
> acting like he's a big evil corporation that won't add the change no
> matter what we say:
... to his tree. Geez, why could that be? Maybe because you don't have
any rights to decide what patches does anybody else apply to their trees?
It's not a fscking public service. Linus has full control over his
tree. You have equally full control over your tree. Linus can't
tell you what patches to apply in your tree. You can't tell Linus
what patches he should apply to his.
"I'm not satisfied with this tree, I'll try that one" is perfectly OK.
"I'm not satisfied with either, so bend the fsck over and change your
tree the way I want" is _NOT_.
In article <[email protected]>,
Chris Wedgwood <[email protected]> wrote:
>On Thu, Oct 31, 2002 at 10:49:10AM -0800, Linus Torvalds wrote:
>
>> Any hardware that needs to go off and think about how to encrypt
>> something sounds like it's so slow as to be unusable. I suspect that
>> anything that is over the PCI bus is already so slow (even if it
>> adds no extra cycles of its own) that you're better off using the
>> CPU for the encryption rather than some external hardware.
>
>Except almost all hardware out there that does this stuff is async to
>some extent...
That's not my argument. I realize that external hardware on a PCI bus
_has_ to be asynchronous, simply because it is so slow.
The question I have is whether such external hardware is even worth it
any more for any standard crypto work. With a regular PCI bus
fundamentally limiting throughput to something like a maximum of 66MB/s
(copy-in and copy-out, and that's so theoretical that it's not even
funny - I'd be surprised if RL throughput copying back and forth over a
PCI bus is more than 25-30MB/s), I suspect that you can do most crypto
faster on the CPU directly these days.
Maybe not. The only numbers I have is the slowness of PCI.
Linus
On Fri, 1 Nov 2002, Patrick Finnegan wrote:
>
[SNIPPED...]
> You know, pissing off core developers of projects that have been shown to
> be (1) desired (2) potentially useful in Linux, even as an aid to other
> Linux subsystem developers and (3) emperically show to be useful for other
> Free *nixes such as the BSDs, is not what I would be doing as a project
> maintainer. Of course, I'm not Linus...
>
> Pat
Maybe somebody should at least say what it is that is:
"(1) desired (2) potentially useful in Linux, even as an aid to
other..."
It might be that you guys are so close to the project that you
lose sight of the fact that others, including Linus, might not
understand how important it is. It is quite possible that somebody
has developed a lot of excellent code that has absolutely no use
to anybody except a small group of intellectuals who use the
kernel to write poetry. In that case, regardless of how excellent
it is, it really should not be in the standard kernel. OTH, it
might be useful to the whole world, but nobody has bothered to
explain how this may be so.
Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
Bush : The Fourth Reich of America
On Fri, 1 Nov 2002, Alexander Viro wrote:
> On Fri, 1 Nov 2002, Patrick Finnegan wrote:
>
> > No, vendor == people who sold or gave us the softare. Right now, Linus is
> > acting like he's a big evil corporation that won't add the change no
> > matter what we say:
>
> ... to his tree. Geez, why could that be? Maybe because you don't have
> any rights to decide what patches does anybody else apply to their trees?
>
> It's not a fscking public service. Linus has full control over his
> tree. You have equally full control over your tree. Linus can't
> tell you what patches to apply in your tree. You can't tell Linus
> what patches he should apply to his.
>
> "I'm not satisfied with this tree, I'll try that one" is perfectly OK.
> "I'm not satisfied with either, so bend the fsck over and change your
> tree the way I want" is _NOT_.
Yes, I recognise it's his right. But what bothers me is that he says "I
want users to say they want it" and when user say they want it hey says
"It's a vendor thing, no users want it."
Linus, if you say you're going to listen, please try and listen. This is
annoying and dissatisfying to all of us when you say you'll listen and you
blatantly ignore people. Your tree is your tree, for now it's going to be
patching our own kernel, and then possibly moving to another vendor who
listens to their users.
Pat
--
Purdue Universtiy ITAP/RCS
Information Technology at Purdue
Research Computing and Storage
http://www-rcd.cc.purdue.edu
http://dilbert.com/comics/dilbert/archive/images/dilbert2040637020924.gif
On Fri, Nov 01, 2002 at 03:25:01PM +0000, Linus Torvalds wrote:
> (copy-in and copy-out, and that's so theoretical that it's not even
> funny - I'd be surprised if RL throughput copying back and forth over a
> PCI bus is more than 25-30MB/s), I suspect that you can do most crypto
> faster on the CPU directly these days.
I'd be amazed of current CPUs would be able to do asymmetric encryption at
anywhere within an order of magnitude of those rates.
Symmetric encryption is something else. This is the reason many encryption
products (ie, pgp) only use asymmetric encryption for encrypting a symmetric
session key, and not encrypting the entire message.
Regards,
bert hubert
--
http://www.PowerDNS.com Versatile DNS Software & Services
http://lartc.org Linux Advanced Routing & Traffic Control HOWTO
On Fri, Nov 01, 2002 at 03:25:01PM +0000, Linus Torvalds wrote:
> The question I have is whether such external hardware is even worth it
> any more for any standard crypto work. With a regular PCI bus
> fundamentally limiting throughput to something like a maximum of 66MB/s
> (copy-in and copy-out, and that's so theoretical that it's not even
> funny - I'd be surprised if RL throughput copying back and forth over a
> PCI bus is more than 25-30MB/s), I suspect that you can do most crypto
> faster on the CPU directly these days.
This may be true of a typical workstation or large server, but your router
may not have such a modern CPU in it. Crypto accelerators are likely a
much bigger win on embedded routers or other small appliances with CPUs such
as the AMD Elan or other 486 to Pentium class processors.
-- Gerald
On 11/01/02 23:25, Linus Torvalds wrote:
> In article <[email protected]>,
> Chris Wedgwood <[email protected]> wrote:
>
>>On Thu, Oct 31, 2002 at 10:49:10AM -0800, Linus Torvalds wrote:
>>
>>
>>>Any hardware that needs to go off and think about how to encrypt
>>>something sounds like it's so slow as to be unusable. I suspect that
>>>anything that is over the PCI bus is already so slow (even if it
>>>adds no extra cycles of its own) that you're better off using the
>>>CPU for the encryption rather than some external hardware.
>>
>>Except almost all hardware out there that does this stuff is async to
>>some extent...
>
>
> That's not my argument. I realize that external hardware on a PCI bus
> _has_ to be asynchronous, simply because it is so slow.
>
> The question I have is whether such external hardware is even worth it
> any more for any standard crypto work. With a regular PCI bus
> fundamentally limiting throughput to something like a maximum of 66MB/s
> (copy-in and copy-out, and that's so theoretical that it's not even
> funny - I'd be surprised if RL throughput copying back and forth over a
> PCI bus is more than 25-30MB/s), I suspect that you can do most crypto
> faster on the CPU directly these days.
>
> Maybe not. The only numbers I have is the slowness of PCI.
A 1GHz PIII will do about 8MBytes/sec of 3DES
Plug in a 2.4Gbs broadcom crypto chip into a 64bit PCI-X slot with the
same CPU and you should be capable of doing at least 10 times that.
Stuff like RSA is much slower (and benefits more from hardware)
BTW - there are some outdated cryptolib patches with an async
interface around somewhere (along with patches for freeswan to use
the async api).
I guess the crypto guys like Chris will add the async API if they need
it (which they do i think ;).
~mc
What I'm going to say may not be popular, and probably won't win me
friends, but here it is anyhow:
On Fri, 1 Nov 2002, Alexander Viro wrote:
> On Fri, 1 Nov 2002, Patrick Finnegan wrote:
>
> > No, vendor == people who sold or gave us the softare. Right now, Linus is
> > acting like he's a big evil corporation that won't add the change no
> > matter what we say:
>
> ... to his tree. Geez, why could that be? Maybe because you don't have
> any rights to decide what patches does anybody else apply to their trees?
>
> It's not a fscking public service. Linus has full control over his
> tree. You have equally full control over your tree. Linus can't
> tell you what patches to apply in your tree. You can't tell Linus
> what patches he should apply to his.
I'm sorry it _is_ a public service. Once tens of people started
contributing to it, it became one. This is like saying that the
Washington Monument belongs to the peole that maintain it, any building
belongs to the repair crews and janitors. I'm not saying that Linus is
necessarily a janitor, but when you consider how much of the Linux kernel
that he didn't write, you may relize that it's not just his kernel. It
also belongs to every single person that has written even a single
line of code in it.
BTW, "My opinions do not represent the opinions of my employer" for at
least this email..
Pat
--
Purdue Universtiy ITAP/RCS
Information Technology at Purdue
Research Computing and Storage
http://www-rcd.cc.purdue.edu
http://dilbert.com/comics/dilbert/archive/images/dilbert2040637020924.gif
On Fri Nov 01, 2002 at 03:25:01PM +0000, Linus Torvalds wrote:
> funny - I'd be surprised if RL throughput copying back and forth over a
> PCI bus is more than 25-30MB/s), I suspect that you can do most crypto
> faster on the CPU directly these days.
>
> Maybe not. The only numbers I have is the slowness of PCI.
It may be faster on your beefy 8 CPU boxes. But many people are
creating, for example, little wireless access points with 200 Mhz
StrongArm CPUs and similar little devices that lack the major CPU
horsepower of big-iron system. Such boxes would be far better
off offloading crypto to a little crypto chip, right?
-Erik
--
Erik B. Andersen http://codepoet-consulting.com/
--This message was written using 73% post-consumer electrons--
On Fri, Nov 01, 2002 at 11:16:20AM -0500, Patrick Finnegan wrote:
> On Fri, 1 Nov 2002, Alexander Viro wrote:
> > It's not a fscking public service. Linus has full control over his
> > tree. You have equally full control over your tree. Linus can't
> > tell you what patches to apply in your tree. You can't tell Linus
> > what patches he should apply to his.
>
> I'm sorry it _is_ a public service. Once tens of people started
> contributing to it, it became one.
Pat, the public service that Linus provides is doing exactly what he does.
He's acting as a filter. You may or may not agree with the things he
lets in or does not. That's fine, if you think you can do a better job
you have that option. i can imagine your answer is "I think he's doing
a fine job except for my project which isn't getting in" or something
like that. That's a bummer for you but keep the big picture in mind.
Linus is the glue which keeps the Linux world from turning into the
BSD mess. He is the acknowledged leader. Without him we have a bunch
of semi-leaders, with him we have a real leader. The fact that Linus
is here, leading this herd of cats, is a gift to the world. Try and
imagine Linux without him, it's not a pretty picture.
So figure out a way to work with him, don't stress him out, he's a
critical resource without a viable replacement.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm
Larry McVoy writes:
> On Fri, Nov 01, 2002 at 11:16:20AM -0500, Patrick Finnegan wrote:
>> On Fri, 1 Nov 2002, Alexander Viro wrote:
>> > It's not a fscking public service. Linus has full control over his
>> > tree. You have equally full control over your tree. Linus can't
>> > tell you what patches to apply in your tree. You can't tell Linus
>> > what patches he should apply to his.
>>
>> I'm sorry it _is_ a public service. Once tens of people started
>> contributing to it, it became one.
>
> Pat, the public service that Linus provides is doing exactly what he does.
> He's acting as a filter. You may or may not agree with the things he
> lets in or does not. That's fine, if you think you can do a better job
> you have that option. i can imagine your answer is "I think he's doing
> a fine job except for my project which isn't getting in" or something
> like that. That's a bummer for you but keep the big picture in mind.
> Linus is the glue which keeps the Linux world from turning into the
> BSD mess. He is the acknowledged leader. Without him we have a bunch
> of semi-leaders, with him we have a real leader. The fact that Linus
> is here, leading this herd of cats, is a gift to the world. Try and
> imagine Linux without him, it's not a pretty picture.
>
What something like:
Virox
Hellwigix
Alanix
KHix
eeewww, I can't bring myself to think about it
> So figure out a way to work with him, don't stress him out, he's a
> critical resource without a viable replacement.
> --
> ---
> Larry McVoy lm at bitmover.com http://www.bitmover.com/lm
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>> The fact that Linus is here, leading this herd of cats,
>> is a gift to the world. Try and imagine Linux without
>> him, it's not a pretty picture.
>
> What something like:
> Virox...
That's actually a pretty cool name.
> ...Alanix
Sounds too much like a Canadian musician.
Oh well, back to hacking hairballs. Meow.
Paul Fulghum, [email protected]
Microgate Corporation, http://www.microgate.com
On Fri, Nov 01, 2002 at 10:50:45AM -0500, Gerald Britton wrote:
> On Fri, Nov 01, 2002 at 03:25:01PM +0000, Linus Torvalds wrote:
> > The question I have is whether such external hardware is even worth it
> > any more for any standard crypto work. With a regular PCI bus
> > fundamentally limiting throughput to something like a maximum of 66MB/s
> > (copy-in and copy-out, and that's so theoretical that it's not even
> > funny - I'd be surprised if RL throughput copying back and forth over a
> > PCI bus is more than 25-30MB/s), I suspect that you can do most crypto
> > faster on the CPU directly these days.
>
> This may be true of a typical workstation or large server, but your router
> may not have such a modern CPU in it. Crypto accelerators are likely a
> much bigger win on embedded routers or other small appliances with CPUs such
> as the AMD Elan or other 486 to Pentium class processors.
Yes, and as a tangent, the same class of embedded devices also benefit
from TCP/IP offload facilities. The same argument against a crypto-api
supporting crypto hardware has been used in the past to argue against
a Linux kernel TCP/IP hardware offload layer. The argument is
completely invalid once one considers the typically lower speed of an
embedded processor going into a crypto or network-edge device.
Even better, synthesizable SoC designs like IBM PPC4xx and reconfigurable
processors architectures have opened further the concept of an on-chip
crypto or tcp/ip offload macro cell which virtually eliminates PCI
speed/latency concerns for these assist engines. It should be no
surprise that embedded Linux is highly desired in these application
specific processors.
Regards,
--
Matt Porter
[email protected]
This is Linux Country. On a quiet night, you can hear Windows reboot.
On Fri, 1 Nov 2002, Patrick Finnegan wrote:
> What I'm going to say may not be popular, and probably won't win me
> friends, but here it is anyhow:
>
> On Fri, 1 Nov 2002, Alexander Viro wrote:
>
> > On Fri, 1 Nov 2002, Patrick Finnegan wrote:
> >
> > > No, vendor == people who sold or gave us the softare. Right now, Linus is
> > > acting like he's a big evil corporation that won't add the change no
> > > matter what we say:
> >
> > ... to his tree. Geez, why could that be? Maybe because you don't have
> > any rights to decide what patches does anybody else apply to their trees?
> >
> > It's not a fscking public service. Linus has full control over his
> > tree. You have equally full control over your tree. Linus can't
> > tell you what patches to apply in your tree. You can't tell Linus
> > what patches he should apply to his.
>
> I'm sorry it _is_ a public service. Once tens of people started
> contributing to it, it became one. This is like saying that the
> Washington Monument belongs to the peole that maintain it, any building
> belongs to the repair crews and janitors.
But then would you agree seeing anybody, and I mean anybody, coming along
with a "good idea" for alteration to the Washington Monument and let them do
what they want?
> I'm not saying that Linus is
> necessarily a janitor, but when you consider how much of the Linux kernel
> that he didn't write, you may relize that it's not just his kernel. It
> also belongs to every single person that has written even a single
> line of code in it.
It is _his_ copy of the kernel, just as you have your own copy.
Linus' tree is known to be the main reference tree, no more.
If your patch is so valuable (and I don't mean it's not), you should be able
to convince vendors to include it in their own tree. If _then_ it happens
to be a major feature with a large user base I'm sure it'll make the
reference tree. But in the mean time a few scattered users isn't enough.
Nicolas
On Fri, 1 Nov 2002, Patrick Finnegan wrote:
>
> I'm sorry it _is_ a public service. Once tens of people started
> contributing to it, it became one. This is like saying that the
> Washington Monument belongs to the peole that maintain it, any building
> belongs to the repair crews and janitors. I'm not saying that Linus is
> necessarily a janitor, but when you consider how much of the Linux kernel
> that he didn't write, you may relize that it's not just his kernel. It
> also belongs to every single person that has written even a single
> line of code in it.
>
The logic you seem to be missing is, the Washington Monument is a
physical object. Linus's source tree is a collection of "copied" parts
from other peoples source trees. You obviously see his source copy
as special, more so then say my copy. This is true _ONLY_ because
Linus's copy commands more respect then yours or mine.
If you think about it, the respect Linus's copy has is _PURELY_
the result of his past _choices_ over how he maintains it.
In effect you are saying:
Patrick: "Everyone trusts your source tree, I think LKCD
is SUPER DUPER important and should get the exposure and trust
that being in your tree commands."
Linus: "I think LKCD is a bad idea, until I am convinced otherwise I
will not merge it."
Patrick: "You are wrong, LKCD should be in your copy of the kernel source.
It is your Job Linus, to add things to _your_ copy which others find
important, what you think is secondary."
You cannot have it both ways, either Linus's tree is a dumping
grounds for all ideas (both good and bad) or it is a place for good
ideas (good defined by Linus) where people who trust Linus's judgment can
work from.
In truth you can have it both ways. Take Linus's existing copy, add the
features you think are important. If your choices prove to be superior.
you can expect that people (over time) will begin to trust/respect your
copy more then Linus's.
--
Shane R. Stixrud "Nothing would please me more than being able to
[email protected] hire ten programmers and deluge the hobby market
with good software." -- Bill Gates 1976
We are still waiting ....
On Fri, Nov 01, 2002 at 01:26:44PM +0000, Alan Cox wrote:
> My concerns are solely with things like the correctness of the disk
> dumper. Its obviously a good way to do a lot more damage if it isnt done
> carefully.
I always liked the AIX dumper choices. You could either dump to
the swap area (and startup detects the dump and moves it to the
filesystem before swapon) or provide a dedicated dump partition. The
latter was prefered.
Either of these methods merely require the dumper to correctly
write to one disk partition. This is about as simple as you are going
to get in disk dumping.
Joel
--
"You must remember this:
A kiss is just a kiss,
A sigh is just a sigh.
The fundamental rules apply
As time goes by."
Joel Becker
Senior Member of Technical Staff
Oracle Corporation
E-mail: [email protected]
Phone: (650) 506-8127
On Fri, 1 Nov 2002 10:23:01 -0800 (PST), "Shane R. Stixrud"
<[email protected]> wrote:
>
>On Fri, 1 Nov 2002, Patrick Finnegan wrote:
>>
>> I'm sorry it _is_ a public service. Once tens of people started
>> contributing to it, it became one. This is like saying that the
>> Washington Monument belongs to the peole that maintain it, any building
>> belongs to the repair crews and janitors. I'm not saying that Linus is
>> necessarily a janitor, but when you consider how much of the Linux kernel
>> that he didn't write, you may relize that it's not just his kernel. It
>> also belongs to every single person that has written even a single
>> line of code in it.
>>
>
>The logic you seem to be missing is, the Washington Monument is a
>physical object. Linus's source tree is a collection of "copied" parts
>from other peoples source trees. You obviously see his source copy
>as special, more so then say my copy. This is true _ONLY_ because
>Linus's copy commands more respect then yours or mine.
>If you think about it, the respect Linus's copy has is _PURELY_
>the result of his past _choices_ over how he maintains it.
>
>
>In effect you are saying:
>
>Patrick: "Everyone trusts your source tree, I think LKCD
>is SUPER DUPER important and should get the exposure and trust
>that being in your tree commands."
>
>Linus: "I think LKCD is a bad idea, until I am convinced otherwise I
>will not merge it."
>
>Patrick: "You are wrong, LKCD should be in your copy of the kernel source.
>It is your Job Linus, to add things to _your_ copy which others find
>important, what you think is secondary."
>
>
>You cannot have it both ways, either Linus's tree is a dumping
>grounds for all ideas (both good and bad) or it is a place for good
>ideas (good defined by Linus) where people who trust Linus's judgment can
>work from.
>
>In truth you can have it both ways. Take Linus's existing copy, add the
>features you think are important. If your choices prove to be superior.
>you can expect that people (over time) will begin to trust/respect your
>copy more then Linus's.
This also explains why Linus said it was a vendor push situation. If
vendors pick it up, find it useful (as I am sure they will), and tell
Linus about that usage... LKCD will become part of the mainline tree.
I suspect for most vendors, it would be part of their extra cost
"server" package and the Linux/390 package... It clearly has the
potential to enhance service and buyers of server packages need it.
If along the way, significant numbers of "big users" like Purdue adopt
it, use it, and reflect back to L-K the diagnostic successes and fixes
which result, that could speed the decision. If Linus has a tough bug,
installs LKCD, sends the dump to a wizzard and gets a fix, that would
definitely speed the decision.
john alvord
On 11/01, Larry McVoy said something like:
> On Fri, Nov 01, 2002 at 11:16:20AM -0500, Patrick Finnegan wrote:
> > On Fri, 1 Nov 2002, Alexander Viro wrote:
> > > It's not a fscking public service. Linus has full control over his
> > > tree. You have equally full control over your tree. Linus can't
> > > tell you what patches to apply in your tree. You can't tell Linus
> > > what patches he should apply to his.
> >
> > I'm sorry it _is_ a public service. Once tens of people started
> > contributing to it, it became one.
>
> Pat, the public service that Linus provides is doing exactly what he does.
> He's acting as a filter. You may or may not agree with the things he
cat name-your.patch | Linus --please-dont-delete-your-inbox-again
--
Shawn Leas
[email protected]
My friend has a baby. I'm recording all the noises he makes so later I can
ask him what he meant.
-- Stephen Wright
On Fri, 1 Nov 2002, Joel Becker wrote:
>
> I always liked the AIX dumper choices. You could either dump to
> the swap area (and startup detects the dump and moves it to the
> filesystem before swapon) or provide a dedicated dump partition. The
> latter was prefered.
> Either of these methods merely require the dumper to correctly
> write to one disk partition. This is about as simple as you are going
> to get in disk dumping.
Ehh.. That was on closed hardware that was largely designed with and for
the OS.
Alan isn't worried about the "which sector do I write" kind of thing.
That's the trivial part. Alan is worried about the fact that once you know
which sector to write, actually _doing_ so is a really hard thing. You
have bounce buffers, you have exceedingly complex drivers that work
differently in PIO and DMA modes and are more likely than not the _cause_
of a number of problems etc.
And you have a situation where interrupts are not likely to work well
(because you crashed with various locks held), so the regular driver
simply isn't likely to work all that well.
And you have a situation where there are hundreds of different kinds of
device drivers for the disk.
In other words, the AIX situation isn't even _remotely_ comparable. A
large portion of the complexity in the PC stability space is in device
drivers. It's the thing I worry most about for 2.6.x stabilization, by
_far_.
And if you get these things wrong, you're quite likely to stomp on your
disk. Hard. You may be tryign to write the swap partition, but if the
driver gets confused, you just overwrote all your important data. At which
point it doesn't matter if your filesystem is journaling or not, since you
just potentially overwrote it.
In other words: it's a huge risk to play with the disk when the system is
already known to be unstable. The disk drivers tend to be one of the main
issues even when everything else is _stable_, for chrissake!
To add insult to injury, you will not be able to actually _test_ any of
the real error paths in real life. Sure, you will be able to test forced
dumps on _your_ hardware, but while that is fine in the AIX model ("we
control the hardware, and charge the user five times what it is worth"),
again that doesn't mean _squat_ in the PC hardware space.
See?
Linus
On 11/01, Shawn said something like:
> > Pat, the public service that Linus provides is doing exactly what he does.
> > He's acting as a filter. You may or may not agree with the things he
>
> cat name-your.patch | Linus --please-dont-delete-your-inbox-again
Maybe "piping" things to Linus is a little rude... :O
--
Shawn Leas
[email protected]
While I was gone, somebody rearranged on the furniture in my
bedroom. They put it in _exactly_ the same place it was.
When I told my roommate, he said: Do I know you?
-- Stephen Wright
On Friday 01 November 2002 11:18 am, Linus Torvalds wrote:
> To add insult to injury, you will not be able to actually _test_ any of
> the real error paths in real life. Sure, you will be able to test forced
> dumps on _your_ hardware, but while that is fine in the AIX model ("we
> control the hardware, and charge the user five times what it is worth"),
> again that doesn't mean _squat_ in the PC hardware space.
On the other hand, ISC's system 5 r3 ran on commodity x86 hardware and the
crash dumper worked on the various disk hardware I had occasion to use it on
(mfm, scsi, ide), although one did need to make sure swap was larger than ram
or bad things would happen. 8-{.
On Fri, 1 Nov 2002, Linus Torvalds wrote:
|>Alan isn't worried about the "which sector do I write" kind of thing.
|>That's the trivial part. Alan is worried about the fact that once you know
|>which sector to write, actually _doing_ so is a really hard thing. You
|>have bounce buffers, you have exceedingly complex drivers that work
|>differently in PIO and DMA modes and are more likely than not the _cause_
|>of a number of problems etc.
[ preamble - this is only a technical discussion, I'm interested
in feedback on what we can improve upon ]
I agree with you. We'd prefer to have a better low-level driver
primitive sitting on top of two low-level disk drivers (IDE and
SCSI). Fundamentally, though, this is difficult to do:
0) There's a lot of early stuff you take risks with, such as the
partition size (assuming you can probe it), knowing that it
hasn't changed since boot, and pre-allocating buffers for disk
I/O operations. You always take the partition risk no matter
what.
1) You have to establish that the IDE or SCSI device can be reset
into an appropriate mode for seek/write mode -- if a DMA operation
fails to the drive, and you can't reset the drive, you may be stuck.
2) Once the hardware reports back success, it is a matter of how
you write the blocks. I once wrote the low-level IDE driver
below request structures, writing sequentially to the drive,
and ran into occasional drive lock-ups while writing during
interrupt crashes. This was more likely due to my inexperience
with the IDE driver than anything else.
|>And you have a situation where interrupts are not likely to work well
|>(because you crashed with various locks held), so the regular driver
|>simply isn't likely to work all that well.
This is simply an avoidance of certain code paths. We saw this
problem earlier in 2.2 using kiobufs and got around it for the
most part by doing our best to avoid the io_request_lock. That's
why we haven't seen the lock contention problems for 2.5.
|>And you have a situation where there are hundreds of different kinds of
|>device drivers for the disk.
This is the biggest problem, absolutely. Our idea moving forward
was to create a _dump() primitive with drivers that allows you to
determine, upon configuration of a disk dump device, whether or
not the low-level driver supported dumping or not. I suggested this
to Al Viro a long time ago on this list, but it didn't go anywhere.
That way the driver itself knows that it can support a low-level
page-write method. If it doesn't, you can't use disk dumping to
that device.
I'm willing to re-open this effort.
|>And if you get these things wrong, you're quite likely to stomp on your
|>disk. Hard. You may be tryign to write the swap partition, but if the
|>driver gets confused, you just overwrote all your important data. At which
|>point it doesn't matter if your filesystem is journaling or not, since you
|>just potentially overwrote it.
We haven't seen this before, but it is always a possibility for any
dump scenario. That's why you some choose netdump instead. :)
|>In other words: it's a huge risk to play with the disk when the system is
|>already known to be unstable. The disk drivers tend to be one of the main
|>issues even when everything else is _stable_, for chrissake!
|>
|>To add insult to injury, you will not be able to actually _test_ any of
|>the real error paths in real life. Sure, you will be able to test forced
|>dumps on _your_ hardware, but while that is fine in the AIX model ("we
|>control the hardware, and charge the user five times what it is worth"),
|>again that doesn't mean _squat_ in the PC hardware space.
We have actually done a lot of testing with injection of failures
into the middle of VM, network drivers, etc., in conjunction with
disk dumping. Certainly it doesn't cover all the cases, but nothing
ever will.
|> Linus
--Matt
One question I have is how much of the driver problem you refer to is
becouse of optimizations that the various drivers have, could you fall
back to the simplest, works-with-everything,
all-timeouts-longer-then-the-slowest-disk slug of a driver that could be
used to do this dump?
David Lang
On Fri, 1 Nov 2002, Linus Torvalds wrote:
> Alan isn't worried about the "which sector do I write" kind of thing.
> That's the trivial part. Alan is worried about the fact that once you know
> which sector to write, actually _doing_ so is a really hard thing. You
> have bounce buffers, you have exceedingly complex drivers that work
> differently in PIO and DMA modes and are more likely than not the _cause_
> of a number of problems etc.
>
> And you have a situation where interrupts are not likely to work well
> (because you crashed with various locks held), so the regular driver
> simply isn't likely to work all that well.
>
> And you have a situation where there are hundreds of different kinds of
> device drivers for the disk.
>
> In other words, the AIX situation isn't even _remotely_ comparable. A
> large portion of the complexity in the PC stability space is in device
> drivers. It's the thing I worry most about for 2.6.x stabilization, by
> _far_.
>
> And if you get these things wrong, you're quite likely to stomp on your
> disk. Hard. You may be tryign to write the swap partition, but if the
> driver gets confused, you just overwrote all your important data. At which
> point it doesn't matter if your filesystem is journaling or not, since you
> just potentially overwrote it.
>
> In other words: it's a huge risk to play with the disk when the system is
> already known to be unstable. The disk drivers tend to be one of the main
> issues even when everything else is _stable_, for chrissake!
>
> To add insult to injury, you will not be able to actually _test_ any of
> the real error paths in real life. Sure, you will be able to test forced
> dumps on _your_ hardware, but while that is fine in the AIX model ("we
> control the hardware, and charge the user five times what it is worth"),
> again that doesn't mean _squat_ in the PC hardware space.
>
> See?
>
> Linus
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On Fri, 1 Nov 2002, Linus Torvalds wrote:
> On Fri, 1 Nov 2002, Joel Becker wrote:
> >
> > I always liked the AIX dumper choices. You could either dump to
> > the swap area (and startup detects the dump and moves it to the
> > filesystem before swapon) or provide a dedicated dump partition. The
> > latter was prefered.
>
> Ehh.. That was on closed hardware that was largely designed with and for
> the OS.
>...
> In other words: it's a huge risk to play with the disk when the system is
> already known to be unstable. The disk drivers tend to be one of the main
> issues even when everything else is _stable_, for chrissake!
>
> To add insult to injury, you will not be able to actually _test_ any of
> the real error paths in real life. Sure, you will be able to test forced
> dumps on _your_ hardware, but while that is fine in the AIX model ("we
> control the hardware, and charge the user five times what it is worth"),
> again that doesn't mean _squat_ in the PC hardware space.
I dealt with crash dumps quite a lot over 10 years with SCO UNIX,
OpenServer and UnixWare: which were addressing the PC market, not
own hardware.
It's a real worry that writing a crash dump to disk might stomp in the
wrong place, but I don't recall it ever happening in practice. But
occasionally, yes, a dump was not generated at all, or not completed.
Of course, you could argue that SCO's disk drivers were more stable :-)
which might or might not be a compliment to them.
Hugh
Linus Torvalds <[email protected]> :
[...]
> Maybe not. The only numbers I have is the slowness of PCI.
Issue 'openssl speed' and wait for more numbers.
Short lived hybrid sessions kill (not that this or any of the current
reasons for asynchronous crypto really matters imho).
Instant benchmark:
sign verify sign/s verify/s
rsa 1024 bits 0.0148s 0.0008s 67.7 1198.6 (PIV 2GHz)
sign verify sign/s verify/s
rsa 1024 bits 0.0478s 0.0026s 20.9 381.6 (PII 350MHz)
The 'numbers' are in 1000s of bytes per second processed.
type 8 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
des ede3 3930.00k 4027.43k 4032.30k 4002.19k 3973.12k (PIV)
type 8 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
des ede3 1058.51k 1061.25k 1090.70k 1097.44k 1091.36k (PII)
blowfish is ~10x faster btw.
--
Ueimor
In message <[email protected]> you write:
> On Thu, 2002-10-31 at 21:02, Jeff Garzik wrote:
> > hosed/screaming, and various mid-layers are dying. For LKCD to be of
> > any use, it needs to _skip_ the block layer and talk directly to
> > low-level drivers.
>
> Rusty wrote a polled IDE driver that should handle some subset of that
Yes, patch has bitrotted but updating should be trivial. There's
enough there that you get the idea though: frankly, it's noninvasive
enough for entry during the 2.6.x series, so it's been down on my
list:
http://www.kernel.org/pub/linux/kernel/people/rusty/patches/Misc/oopser.patch.gz
I'd love someone to take this for a spin and tweak it up...
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
[ Cc: trimmed ]
David Lang wrote:
> One question I have is how much of the driver problem you refer to is
> becouse of optimizations that the various drivers have, could you fall
> back to the simplest, works-with-everything,
> all-timeouts-longer-then-the-slowest-disk slug of a driver that could be
> used to do this dump?
Welcome to the wonderful world of code duplication. And don't forget
the "simplified" TCP/IP stack for network dumps. Uh, USB-attached
storage, anyone ? :-)
Special-case dump drivers make perfect sense in isolated cases (e.g.
narrowly specified boxes) or as a band-aid solution.
But for a general solution, it seems more appropriate to me to solve
the problem of moving the kernel data from the damaged system to an
intact system only once, e.g. using the MCORE approach, than over
and over again for all possible types of hardware and attachment
methods.
The only inherent weakness I see in MCORE is the need to reliably
reset a device, either to the point where it is operational (if
used in the process of dumping), or at least to the point where it
doesn't get in the way (if not used for the dump, e.g. video, HID,
etc.).
But this should still be significantly easier than introducing
"dumb" versions for all drivers. Besides, having a way for cleanly
shutting down or resetting devices is desirable in other contexts,
too (e.g. kexec).
- Werner (disclaimer: not affiliated with Mission Critical Linux,
any vendor, or any other form of gainful employment)
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/
Werner Almesberger wrote:
> But for a general solution, it seems more appropriate to me to solve
> the problem of moving the kernel data from the damaged system to an
> intact system only once, e.g. using the MCORE approach, than over
> and over again for all possible types of hardware and attachment
> methods.
This is just a random tangential thought here, but FWIW:
Why not just have a simple backup stripped-down "hardened" copy of Linux
lying around in a physical RAM region not used by the copy of Linux
actually running. Granted the running Linux doesn't do random physical
accesses when dying, the crash handler could then just boot that
secondary Linux which would then have a RAM disk containing the
appropriate scripts and binaries to handle the actual crash. Given the
cost of RAM these days, reserving a MB or two for this purpose should
probably not be that bad.
Karim
===================================================
Karim Yaghmour
[email protected]
Embedded and Real-Time Linux Expert
===================================================
Karim Yaghmour wrote:
> Why not just have a simple backup stripped-down "hardened" copy of Linux
> lying around in a physical RAM region not used by the copy of Linux
> actually running.
Congratulations, you've just re-invented MCORE :-) That's exactly
what they do on systems where rebooting through the firmware
doesn't preserve RAM.
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/
Werner Almesberger wrote:
> Karim Yaghmour wrote:
> > Why not just have a simple backup stripped-down "hardened" copy of Linux
> > lying around in a physical RAM region not used by the copy of Linux
> > actually running.
>
> Congratulations, you've just re-invented MCORE :-) That's exactly
> what they do on systems where rebooting through the firmware
> doesn't preserve RAM.
Oh well, can't have a freshmeat db in my head I guess ;) That said,
I like this approach since you don't need to care about new drivers
and so on ... but since it's already out there I guess it's
advantages have been covered elsewhere ...
Karim
===================================================
Karim Yaghmour
[email protected]
Embedded and Real-Time Linux Expert
===================================================
On Thu, 2002-10-31 at 10:15, Andrew Morton wrote:
> (Disclaimer: I've never used lkcd. I'm assuming that it's
> possible to gdb around in a dump)
I updated Dave Anderson's (Mission Critical) crash code to work
with LKCD core dumps when I updated LKCD to support the ia64.
Dave's crash code uses gdb as a command interpreter. It's not quite
as flexible as using gdb macros on core dumps but it's very close
and has lots of support for various kernel structures. For example,
you can't just have ddd walk through data structures by simply
clicking on pointers in data structures like you normally can.
>
> > In particular when it comes to this project, I'm told about
> > "netdump", which doesn't try to dump to a disk, but over the net.
>
> It could help. But like serial console, the random person whose
> kernel just died often can't be bothered setting it up, or simply
> doesn't have the gear, or the crash is not repeatable.
Yes, ideally I'd like to have an integration between live gdb stub
debugging and crash debugging. I'd like to even be able to use ddd/gdb
on a core file and simulate execution. When using gdb on the kernel
I've found it nice to move the cursor over the PC and move it to the
end of panic(). Then single step back out of panic and re-execute
the code that returned the error code that caused us to decide to panic.
Doing this in asm language with a asm debugger is too difficult for
most folks.
I really liked HP's kwdb approach. kwdb has a tiny TCP/IP stack and
has direct hooks into the trap vectors like a normal kgdb stub. The
nice thing is you can attach to a crash system over the internet
from anywhere in the world to debug the panic. I wasn't able to get
HP to release the kwdb gdb stub into the public domain. The gdb hacks
are available at:
http://h21007.www2.hp.com/dspp/tech/tech_TechSoftwareDetailPage_IDX/1,1703,257,00.html
but are based on a very old version of gdb and ia64 libraries.
> So. _If_ lkcd gives me gdb-able images from time-of-crash, I'd
> like it please. And I'm the grunt who spent nearly two years
> doing not much else apart from working 2.3/2.4 oops reports.
You can snarf a copy from:
ftp://people.redhat.com/anderson
One area that I'm not sure of is if the lkcd kernel changes are a
problem with the kgdb patch (http://kgdb.sourceforge.net/). Perhaps
I can check into that in the near future.
I'd prefer to have both kgdb (http://kgdb.sourceforge.net/)
remote debugging and kgdb crash support available in stock kernels
like the BSD kernels (NetBSD, FreeBSD). I don't know why the kgdb
stub wasn't integrated into the kernel for the ia32 and ia64 platforms.
I suppose for reasons like we are hearing now on the LKCD kernel hooks.
The current LKCD code is at least a step in that direction.
--
piet@http://www.piet.net
On Fri, 1 Nov 2002, Craig I. Hagan wrote:
> If it becomes apparent through empirical data that crash dumps are a useful
> tool, I'm sure that Linus will become far more amenable. Until then, lets let
> him handle all of his other work which needs to get done.
Since he doesn't have the problem he will ignore the proof. Better be sure
we can generate ksymoops reports from the dump, so we can post them asking
for help. Anything else will get the old "I don't use that tool, can't
help." Or like Nvidia problems the "try it without the crash dump code,"
routine.
--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.
On 1 Nov 2002, Alan Cox wrote:
> On Fri, 2002-11-01 at 06:36, Linus Torvalds wrote:
> > This never works. Be honest. Nobody takes out features, they are stuck
> > once they get in.
>
> Linus I've asked a couple of times about killing sound/oss off now ALSA
> is integrated 8) While you are on the rant how about that ;)
Good point, that continues to disprove the theory that having one thing in
the kernel prevents development of a similar feature.
--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.
On Fri, 1 Nov 2002, Steven King wrote:
> On Friday 01 November 2002 11:18 am, Linus Torvalds wrote:
>
> > To add insult to injury, you will not be able to actually _test_ any of
> > the real error paths in real life. Sure, you will be able to test forced
> > dumps on _your_ hardware, but while that is fine in the AIX model ("we
> > control the hardware, and charge the user five times what it is worth"),
> > again that doesn't mean _squat_ in the PC hardware space.
>
> On the other hand, ISC's system 5 r3 ran on commodity x86 hardware and the
> crash dumper worked on the various disk hardware I had occasion to use it on
> (mfm, scsi, ide), although one did need to make sure swap was larger than ram
> or bad things would happen. 8-{.
The thing is that Solaris, AIX, and ISC are written by commercial
companies, they realize that customers need to be able to debug systems
which don't have a screen, a serial printer, etc. They do have disk.
I was hoping Alan would push Redhat to put this in their Linux so we
could resolve some of the ongoing problems which don't write an oops to a
log, but I guess none of the developers has to actually support production
servers and find out why they crash.
--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.
On Sat, 2 Nov 2002, Bill Davidsen wrote:
> The thing is that Solaris, AIX, and ISC are written by commercial
> companies, they realize that customers need to be able to debug systems
> which don't have a screen, a serial printer, etc. They do have disk.
>
> I was hoping Alan would push Redhat to put this in their Linux so we
> could resolve some of the ongoing problems which don't write an oops to a
> log, but I guess none of the developers has to actually support production
> servers and find out why they crash.
Perhaps i'm being grossly naive here, but none of these presumably x86
productions servers don't have a serial port? Not even PCI/ISA slots to
add one? Serial would catch most of your oopsen anyway, and if you were
borked enough that serial couldn't get the entire output, i somehow doubt
dumping to disk could manage. And no i don't see anything wrong nor
consider it studly to use oopses only for debugging...
Zwane
--
function.linuxpower.ca
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On Fri, 1 Nov 2002 13:01, Matt D. Robinson wrote:
<snip>
> Uh ... have you read the patches? Do you see how few the
> changes are to non-dump code? Do you know that most of those
> changes only get triggered in a crash situation anyway?
I applied the patches, and reported some issues.
http://marc.theaimsgroup.com/?l=linux-kernel&m=103520434201014&w=2
I see no signs that any of them have been addressed, although I haven't tried
a really recent set.
> Breakage occurs when people change code areas that are used
> all the time, like VM, network, block layer, etc.
Actually, this is the area that Linux is best at. If you break it, some poor
sod will hit the problem, and you'll know really soon.
> Look at the patches and tell me where we are causing overhead
> and and seriously potential breakage. If you find problems,
> then tell us, don't just comment on breakage scenarios.
I'm a fairly typical user - I just have a couple of desktop machines and a
server/firewall.
I don't have 700 nodes in a cluster, and when my machines break, its normally
something I did. Sometimes the desktop locks up (say every second month,
unless I'm dicking with the kernel), but I reboot and everything is happy.
LKCD doesn't really seem to do anything for me - it wouldn't really worry me
if it went in (since I don't have to maintain it - it isn't near any of my
code), but I'd really prefer that having the _CONFIG option set to N didn't
make the kernel any bigger, or change any code paths.
Is this unreasonable?
Brad
BTW: I admit that I'd be pretty pissed if Linus said that my code was
"stupid", but life isn't reasonable or fair. Take a few days off LKCD, go for
a few walks, and worry about how to get it integrated after that.
- --
http://linux.conf.au. 22-25Jan2003. Perth, Aust. I'm registered. Are you?
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org
iD8DBQE9w6rCW6pHgIdAuOMRAlI5AJ48ELVdExIeCr5C5HtDpU5+1ZnuBQCdEji0
t4q2NjZQVGEumrz6b+CqEEs=
=xtYY
-----END PGP SIGNATURE-----
Hi!
> > I'm a user, and I request that LKCD get merged into the kernel. :-)
> > Do you feel like donating a 700-port console server? Right, so it's LKCD
> > for me then.
>
> Wouldn't you rather they neatly tftp'd dumps to a nominated central
> server which noticed the arrival, did the initial processing with a perl
> script and mailed you a summary ?
Out of interest, how does such "initial processing" look like?
Of course I'd like perl script to tell me
"hey, at vicam.c:715 you are freeing memory that is still in use by
usb.c; that crashed your machine 5 times during last week",
but I guess your perl scripts can't do that, right?
Pavel
--
When do you have heart between your knees?
On Fri, Nov 01, 2002 at 11:25:04PM +0100, Pavel Machek wrote:
> > Wouldn't you rather they neatly tftp'd dumps to a nominated central
> > server which noticed the arrival, did the initial processing with a perl
> > script and mailed you a summary ?
>
> Out of interest, how does such "initial processing" look like?
Toss an email to root and the operations staff including the name of the
machine that crashed and the output of lcrash's "report" command, as well
as the location of the dumps (ie, where they were saved on the machine that
died and where they are on an optional netdump server).
--
Mike Shuey
On Sat, 2002-11-02 at 05:17, Bill Davidsen wrote:
> I was hoping Alan would push Redhat to put this in their Linux so we
> could resolve some of the ongoing problems which don't write an oops to a
> log, but I guess none of the developers has to actually support production
> servers and find out why they crash.
I think several Red Hat people would disagree very strongly. Red Hat
shipped with the kernel symbol decoding oops reporter for a good reason,
and also acquired netdump for a good reason.
On Sat, 2002-11-02 at 05:00, Bill Davidsen wrote:
> > Linus I've asked a couple of times about killing sound/oss off now ALSA
> > is integrated 8) While you are on the rant how about that ;)
>
> Good point, that continues to disprove the theory that having one thing in
> the kernel prevents development of a similar feature.
Its preventing testing and its making parallel fixing hard to manage.
I'd really like to kill off the OSS drivers to make sure the ALSA ones
are tested and anything only in OSS does get ported over,
[email protected] (Matt D. Robinson) wrote on 01.11.02 in <[email protected]>:
> On Fri, 1 Nov 2002, Linus Torvalds wrote:
> |>And if you get these things wrong, you're quite likely to stomp on your
> |>disk. Hard. You may be tryign to write the swap partition, but if the
> |>driver gets confused, you just overwrote all your important data. At which
> |>point it doesn't matter if your filesystem is journaling or not, since you
> |>just potentially overwrote it.
>
> We haven't seen this before, but it is always a possibility for any
> dump scenario. That's why you some choose netdump instead. :)
*If* you want safe dumping to a partition, it seems wrong to me to try to
figure that out after the crash.
Instead,
* configure the crash space with a user-mode app or possibly a kernel
command line arg
* Whenever repartitioning, check if the crash dump partition is affected,
and if so, clear it until it is explicitely reconfigured
* Save a good checksum (say, md5 or sha1) of the crash partition config,
and only dump if that checksum checks out
You might want to checksum even more than that, of course :-)
But there's certainly a reason Netware liked to crash dump to a series of
floppies - too bad those are much too small for today's machines. When
floppy sizes stopped to be slightly larger than standard RAM sizes[*], the
computing public lost big time, and we haven't recovered from that.
[*] Apple ][+: 48 KB RAM, 140 KB floppy. IBM PC: 640 KB RAM, 1.2 MB
floppy. (Yes, I know there were other combinations as well.) Where's my
approximately-1-GB floppy that everyone and their aunt have installed
today? No, CD writers are *not* universal. And burn-once CDs aren't much
like floppies.
Of course, the same problem exists with general backup technology - tape
the size of modern disks is not really affordable anymore.
MfG Kai
Then why do we need 'non-repudiation' w/r/t certificates? Isn't
the
idea to provide a way to isolate "bugs" in the "security" system. If
something
is written to a file by the group signon, who wrote it?
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of
> Alexander Viro
> Sent: Wednesday, October 30, 2002 11:43 PM
> Then give them all the same account and be done with that.
> Effect will
> be the same.
Em Sat, Nov 02, 2002 at 12:00:18AM -0500, Bill Davidsen escreveu:
> On 1 Nov 2002, Alan Cox wrote:
>
> > On Fri, 2002-11-01 at 06:36, Linus Torvalds wrote:
> > > This never works. Be honest. Nobody takes out features, they are stuck
> > > once they get in.
> >
> > Linus I've asked a couple of times about killing sound/oss off now ALSA
> > is integrated 8) While you are on the rant how about that ;)
>
> Good point, that continues to disprove the theory that having one thing in
> the kernel prevents development of a similar feature.
SPX was also removed (hey, it never worked anyway) and probably econet and
ATM will be removed as well if nobody jumps to fix it (I mean net/atm, not
drivers/atm, but I'm not sure the later will be useful without the former).
- Arnaldo
On Fri, 1 Nov 2002, Hugh Dickins wrote:
> I dealt with crash dumps quite a lot over 10 years with SCO UNIX,
> OpenServer and UnixWare: which were addressing the PC market, not
> own hardware.
>
> It's a real worry that writing a crash dump to disk might stomp in the
> wrong place, but I don't recall it ever happening in practice. But
> occasionally, yes, a dump was not generated at all, or not completed.
IIRC, some years ago wuarchive.wustl.edu went down for a few days because the
machine paniced and dumped to the wrong partition...
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
On Sat, 2 Nov 2002, Brad Hards wrote:
|>I applied the patches, and reported some issues.
|>http://marc.theaimsgroup.com/?l=linux-kernel&m=103520434201014&w=2
|>I see no signs that any of them have been addressed, although I haven't tried
|>a really recent set.
We did put your fixes in, if they don't work, let me know.
|>LKCD doesn't really seem to do anything for me - it wouldn't really worry me
|>if it went in (since I don't have to maintain it - it isn't near any of my
|>code), but I'd really prefer that having the _CONFIG option set to N didn't
|>make the kernel any bigger, or change any code paths.
|>
|>Is this unreasonable?
Absolutely not. I would expect most people to not use it, and I
would hope that most distributions would build it as a module but
not turn it on (unless they really wanted it on by default).
|>Brad
|>
|>BTW: I admit that I'd be pretty pissed if Linus said that my code was
|>"stupid", but life isn't reasonable or fair. Take a few days off LKCD, go for
|>a few walks, and worry about how to get it integrated after that.
It's neither here nor there anymore. I think if companies like
Red Hat don't want it turned on, that's fine, but they should at
least allow their customers to have it available to them for
use, if that's what they want.
Of course, I'm not going to go through all the reasons why there's
a major disconnect between Linux distributions and hardware vendors,
but suffice it to say that's the root of the problem here.
--Matt
Em Sat, Nov 02, 2002 at 08:19:17PM +0100, [email protected] escreveu:
> > SPX was also removed (hey, it never worked anyway) and probably econet and
> > ATM will be removed as well if nobody jumps to fix it (I mean net/atm, not
> > drivers/atm, but I'm not sure the later will be useful without the former).
>
> What's the deadline ?
Plan was for 2.6.0
- Arnaldo
Em Sat, Nov 02, 2002 at 08:32:23PM +0100, [email protected] escreveu:
> Arnaldo Carvalho de Melo <[email protected]> :
> > Em Sat, Nov 02, 2002 at 08:19:17PM +0100, [email protected] escreveu:
> [...]
> > > What's the deadline ?
> >
> > Plan was for 2.6.0
>
> :o)
> Is there a lower bound for it's estimate arrival date ?
:-) I think that if you state that you plan to work on it RSN we can forget
about removing it for now.
- Arnaldo
Arnaldo Carvalho de Melo <[email protected]> :
> Em Sat, Nov 02, 2002 at 08:19:17PM +0100, [email protected] escreveu:
[...]
> > What's the deadline ?
>
> Plan was for 2.6.0
:o)
Is there a lower bound for it's estimate arrival date ?
--
Ueimor
[Cc: changed]
Arnaldo Carvalho de Melo <[email protected]> :
> Em Sat, Nov 02, 2002 at 12:00:18AM -0500, Bill Davidsen escreveu:
[...]
> > Good point, that continues to disprove the theory that having one thing in
> > the kernel prevents development of a similar feature.
>
> SPX was also removed (hey, it never worked anyway) and probably econet and
> ATM will be removed as well if nobody jumps to fix it (I mean net/atm, not
> drivers/atm, but I'm not sure the later will be useful without the former).
What's the deadline ?
--
Ueimor
On Sat, 2002-11-02 at 18:55, Arnaldo Carvalho de Melo wrote:
> SPX was also removed (hey, it never worked anyway) and probably econet and
> ATM will be removed as well if nobody jumps to fix it (I mean net/atm, not
> drivers/atm, but I'm not sure the later will be useful without the former).
ATM is actively used by large numbers of people [1]. Its in the fix
rather than remove category. Econet should be trivial and might as well
just be marked CONFIG_OBSOLETE until someone does deal with it.
Alan
[1] PPPoATM is used for a large number of DSL connections
Em Sat, Nov 02, 2002 at 08:31:29PM +0000, Alan Cox escreveu:
> On Sat, 2002-11-02 at 18:55, Arnaldo Carvalho de Melo wrote:
> > SPX was also removed (hey, it never worked anyway) and probably econet and
> > ATM will be removed as well if nobody jumps to fix it (I mean net/atm, not
> > drivers/atm, but I'm not sure the later will be useful without the former).
> ATM is actively used by large numbers of people [1]. Its in the fix
> rather than remove category. Econet should be trivial and might as well
> just be marked CONFIG_OBSOLETE until someone does deal with it.
Oh, cool, way more motivation to fix that stuff 8)
Arnaldo Carvalho de Melo <[email protected]> :
[...]
> :-) I think that if you state that you plan to work on it RSN we can forget
> about removing it for now.
$*@#&%@ !
Will have to setup a burnproof testbed for ATM then.
--
Ueimor
On Sat, Nov 02, 2002 at 09:35:17AM -0800, LA Walsh wrote:
> Then why do we need 'non-repudiation' w/r/t certificates?
we dont
--cw
"Matt D. Robinson" <[email protected]> dijo:
[...]
> This isn't bloat. If you want, it can be built as a module, and
> not as part of your kernel. How can that be bloat? People who
> build kernels can optionally build it in, but we're not asking
> that it be turned on by default, rather, built as a module so
> people can load it if they want to. We made it into a module
> because 18 months ago you complained about it being bloat. We
> addressed your concerns.
Bloat is not just RAM/CPU/... usage when in use, it is much more about
developers who have to understand, work with, and consider how to use or
interface with the new code. Even more so when it is not builtin, as this
creates _two_ scenarios to consider.
This is the sense of "bloat" that Linus is most worried about (and very
rightly so, IMVHO). At lesat that is my observation over the years.
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria +56 32 654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513
I'm not sure I understand your point, Horst. There are four
primary mechanisms which would invoke a dump:
die() (or die_if_kernel())
panic()
interrupt-driven dumps
sysrq()
Assuming you call these functions, there is a single dump()
call that will perform dumping, the the dump_function_ptr
(which is assigned when the dump module is loaded) is set.
dump() is a simple function that basically says:
static inline void dump(char * str, struct pt_regs * regs)
{
if (dump_function_ptr) {
dump_function_ptr((char *)str, regs);
}
}
str is for the panic() string, and regs are so you can create
a proper stack trace for the failing task on the correct CPU.
I don't see how that can can attributed to bloating the kernel.
If you don't panic(), the code is never invoked. If you don't load
the dump module, dump_function_ptr isn't assigned. It's meant
to be non-invasive, off to the side and called when required
(or requested).
There is some additional code put in the kernel to disable
interrupts, quiesce the system, and I think there are a few projects
that can probably use the same code base (such as the suspend-to-ram
project, which I was just informed about). All of that is called
within the dump driver itself, otherwise it sits quietly off to
the side, never getting called.
Using the dump driver infrastructure is like writing any plain-jane
driver. You set up the _open(), _close(), etc., functions,
assigning the ops table based on the dump method you want to use
(disk, network, mini-oopser, etc.) This isn't that difficult,
and it should only be loaded for those customer systems that want
a specific dump style.
--Matt
Standard disclaimer: I'm not trying anymore to get this into the
kernel at this time (via Linus). This is purely for educating
those that aren't familiar with crash dumping for Linux.
On Sat, 2 Nov 2002, Horst von Brand wrote:
|>"Matt D. Robinson" <[email protected]> dijo:
|>
|>[...]
|>
|>> This isn't bloat. If you want, it can be built as a module, and
|>> not as part of your kernel. How can that be bloat? People who
|>> build kernels can optionally build it in, but we're not asking
|>> that it be turned on by default, rather, built as a module so
|>> people can load it if they want to. We made it into a module
|>> because 18 months ago you complained about it being bloat. We
|>> addressed your concerns.
|>
|>Bloat is not just RAM/CPU/... usage when in use, it is much more about
|>developers who have to understand, work with, and consider how to use or
|>interface with the new code. Even more so when it is not builtin, as this
|>creates _two_ scenarios to consider.
|>
|>This is the sense of "bloat" that Linus is most worried about (and very
|>rightly so, IMVHO). At lesat that is my observation over the years.
|>
--
On 2 Nov 2002, Alan Cox wrote:
|>On Sat, 2002-11-02 at 05:17, Bill Davidsen wrote:
|>> I was hoping Alan would push Redhat to put this in their Linux so we
|>> could resolve some of the ongoing problems which don't write an oops to a
|>> log, but I guess none of the developers has to actually support production
|>> servers and find out why they crash.
|>
|>I think several Red Hat people would disagree very strongly. Red Hat
|>shipped with the kernel symbol decoding oops reporter for a good reason,
|>and also acquired netdump for a good reason.
It would be great if crash dumping were an option, at the very least
to unify the netdump, oops reporter and disk dumping (for those that
want it) into a single infrastructure. Long term, that's probably
where this is going anyway. It takes away the religious "who is right"
argument, which is fundamentally silly.
Maybe one day. I think quite a few Red Hat customers would
appreciate it.
--Matt
P.S. IBM shouldn't have signed a contact with Red Hat without
requiring certain features in Red Hat's OS(es). Pushing for
LKCD, kprobes, LTT, etc., wouldn't be on this list for a whole
variety of cases if that had been done in the first place.
P.S. As an aside, too many engineers try and make product marketing
decisions at Red Hat. I personally think that's really bad for
their business model as a whole (and I'm not referring to LKCD).
On Sun, 2002-11-03 at 01:24, Matt D. Robinson wrote:
> P.S. IBM shouldn't have signed a contact with Red Hat without
> requiring certain features in Red Hat's OS(es). Pushing for
> LKCD, kprobes, LTT, etc., wouldn't be on this list for a whole
> variety of cases if that had been done in the first place.
I would hope IBM have more intelligence than to attempt to destroy the
product by trying to force all sorts of junk into it. The Linux world
has a process for filterng crap, it isnt IBM applying force. That path
leads to Star Office 5.2, Netscape 4 and other similar scales of horror
code that become unmaintainably bad.
> P.S. As an aside, too many engineers try and make product marketing
> decisions at Red Hat. I personally think that's really bad for
> their business model as a whole (and I'm not referring to LKCD).
You think things like EVMS are a product marketing decision. I'm very
glad you don't run a Linux distro. It would turn into something like the
old 3com rapops rather rapidly by your models (3com rapops btw ceased to
exist and for good reasons)
Alan
On Sat, Nov 02, 2002 at 05:24:17PM -0800, Matt D. Robinson wrote:
> P.S. IBM shouldn't have signed a contact with Red Hat without
> requiring certain features in Red Hat's OS(es). Pushing for
> LKCD, kprobes, LTT, etc., wouldn't be on this list for a whole
> variety of cases if that had been done in the first place.
Bah, it's enough that IBMs money totally fucked up the tree of one popular
distribution..
On 3 Nov 2002, Alan Cox wrote:
|>On Sun, 2002-11-03 at 01:24, Matt D. Robinson wrote:
|>> P.S. IBM shouldn't have signed a contact with Red Hat without
|>> requiring certain features in Red Hat's OS(es). Pushing for
|>> LKCD, kprobes, LTT, etc., wouldn't be on this list for a whole
|>> variety of cases if that had been done in the first place.
|>
|>I would hope IBM have more intelligence than to attempt to destroy the
|>product by trying to force all sorts of junk into it. The Linux world
|>has a process for filterng crap, it isnt IBM applying force. That path
|>leads to Star Office 5.2, Netscape 4 and other similar scales of horror
|>code that become unmaintainably bad.
I think you misunderstand me. If IBM considers a feature to be useful,
they should require distributions to put into a release from a contractual
standpoint. That doesn't mean Red Hat has to put it into all their
distributions -- it just means they have to produce something that
IBM wants. If nobody else uses it, that's fine. IBM gets what they
want, and Red Hat gets what they want. End of story.
You're looking at this from an engineering perspective and open source
philosophy rather than a business unit at a company like IBM might look
at it. That's not a bad thing to do, but the two concepts are very
different from each other. The Linux world may filter "crap", which
is great, but some of that "crap" is important to companies like IBM,
and if they were smart they'd use their leverage ($$$) to make sure the
"crap" ends up in the products they care to use/support. The rest of
Linux can do whatever it wants, doing things the "Linux world" way.
|>> P.S. As an aside, too many engineers try and make product marketing
|>> decisions at Red Hat. I personally think that's really bad for
|>> their business model as a whole (and I'm not referring to LKCD).
|>
|>You think things like EVMS are a product marketing decision. I'm very
|>glad you don't run a Linux distro. It would turn into something like the
|>old 3com rapops rather rapidly by your models (3com rapops btw ceased to
|>exist and for good reasons)
Again, I wasn't mentioning any product in particular. Making decisions
like GPL-only as an engineering philosophy rather than as a product
marketing decision are more problematic than looking at EVMS vs. anything
else as a question of which is technically better.
But again, that's a complete aside and would probably open up a plethora
of opinions from people who care about both sides of that argument, and
would inevitably head down an rathole infinitely deep.
--Matt
On 1 Nov 2002, Alan Cox wrote:
> On Fri, 2002-11-01 at 06:34, Bill Davidsen wrote:
> > From the standpoint of just the driver that's true. However, the remote
> > machine and all the network bits between them are a string of single
> > points of failure. Isn't it good that both disk and network can be
> > supported.
>
> My concerns are solely with things like the correctness of the disk
> dumper. Its obviously a good way to do a lot more damage if it isnt done
> carefully. Quite clearly your dump system wants to support multiple dump
> targets so you can dump to pci battery backed ram, down the parallel
> port to an analysing box etc
Quite clearly SCO, Sun, and IBM have been doing this for years without
offering dozens of options. I don't need it to sing and dance, I just need
a way to put the dump where I can find it. I'm not going to put another
box in at the end of a serial or parallel port, I don't have NVram, I do
have lopts of disk, and so does almost everyone else. I have remote
systems in wiring closets all over the country (all four time zones). They
are at the end of open net connections, unreliable and untrusted. I don't
want to bet that I have a working VPN, or that I can safely send all that
data without it being read by someone other than me.
The AIX support has a group just to beat on dumps customers send. What
more evidence is needed that people can and do use the capability.
I had hoped that someone would do this for Linux, I never dreamed that
it would be kept out of the kernel by people who clearly don't understand
the problems if distributed and clustered headless systems.
I guess the development folks are working on more important things like
xiafs and morse code dumps to the keyboard LEDs.
--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.
On Sat, 2 Nov 2002, Zwane Mwaikambo wrote:
> On Sat, 2 Nov 2002, Bill Davidsen wrote:
>
> > The thing is that Solaris, AIX, and ISC are written by commercial
> > companies, they realize that customers need to be able to debug systems
> > which don't have a screen, a serial printer, etc. They do have disk.
> >
> > I was hoping Alan would push Redhat to put this in their Linux so we
> > could resolve some of the ongoing problems which don't write an oops to a
> > log, but I guess none of the developers has to actually support production
> > servers and find out why they crash.
>
> Perhaps i'm being grossly naive here, but none of these presumably x86
> productions servers don't have a serial port? Not even PCI/ISA slots to
> add one? Serial would catch most of your oopsen anyway, and if you were
> borked enough that serial couldn't get the entire output, i somehow doubt
> dumping to disk could manage. And no i don't see anything wrong nor
> consider it studly to use oopses only for debugging...
I have distributed servers in 15 locations, six states, four timezones. In
secure unattended locations like wiring closets. What do I do with the
serial port? Do I double my colocation costs and have another system there
to listen? Is the code on a sick system going to dial the modem on the
serial line amd establish a connection?
I have a mix of Linux, Solaris, and AIX systems deployed, and only the
Linux systems don't have this capability. Actually for the most part only
the Linux systems NEED it, that's another problem, but reliability would
go up if I could see the problem.
--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.
On 3 Nov 2002, Alan Cox wrote:
> I would hope IBM have more intelligence than to attempt to destroy the
> product by trying to force all sorts of junk into it. The Linux world
> has a process for filterng crap, it isnt IBM applying force. That path
> leads to Star Office 5.2, Netscape 4 and other similar scales of horror
> code that become unmaintainably bad.
If you define "unmaintainably bad" as "having features you don't need"
then I agree. But since dump to disk is in almost every other commercial
UNIX, maybe someone would question why it's good for others but not for
Linux.
I can agree on stuff the non-hacker wouldn't use, but that is exactly who
uses the crash dump in AIX, the person who wants to send a compressed dump
and money to IBM and get back a fix. Netdump assumes external resources
and a functional secure network (is the dump encrypted and I missed it?)
which home users surely don't have, and remote servers oftem lack as well.
--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.
On Sun, Nov 03, 2002 at 08:48:30AM -0500, Bill Davidsen wrote:
> On 1 Nov 2002, Alan Cox wrote:
>
> > On Fri, 2002-11-01 at 06:34, Bill Davidsen wrote:
> > > From the standpoint of just the driver that's true. However, the remote
> > > machine and all the network bits between them are a string of single
> > > points of failure. Isn't it good that both disk and network can be
> > > supported.
> >
> > My concerns are solely with things like the correctness of the disk
> > dumper. Its obviously a good way to do a lot more damage if it isnt done
> > carefully. Quite clearly your dump system wants to support multiple dump
> > targets so you can dump to pci battery backed ram, down the parallel
> > port to an analysing box etc
>
> Quite clearly SCO, Sun, and IBM have been doing this for years without
> offering dozens of options. I don't need it to sing and dance, I just need
> a way to put the dump where I can find it. I'm not going to put another
> box in at the end of a serial or parallel port, I don't have NVram, I do
> have lopts of disk, and so does almost everyone else. I have remote
> systems in wiring closets all over the country (all four time zones). They
> are at the end of open net connections, unreliable and untrusted. I don't
> want to bet that I have a working VPN, or that I can safely send all that
> data without it being read by someone other than me.
>
> The AIX support has a group just to beat on dumps customers send. What
> more evidence is needed that people can and do use the capability.
>
> I had hoped that someone would do this for Linux, I never dreamed that
You paid someone for this for AIX. So the solution is obvious for Linux.
In article <[email protected]> you wrote:
> If you define "unmaintainably bad" as "having features you don't need"
> then I agree. But since dump to disk is in almost every other commercial
> UNIX, maybe someone would question why it's good for others but not for
> Linux.
It is even in FreeBSD or Windows > ME
Greetings
Bernd
On Sun, 2002-11-03 at 14:33, Bill Davidsen wrote:
> If you define "unmaintainably bad" as "having features you don't need"
> then I agree. But since dump to disk is in almost every other commercial
> UNIX, maybe someone would question why it's good for others but not for
> Linux.
It isnt about features, its about clean maintainable code. netdump to me
doesnt mean no dump to disk option. In fact I'd rather like to be able
to insmod dump-foo.o. The correctness issues are hard but if the
dump-foo is standalone, resets the hardware and has an SHA integrity
check then it can be done (think of it as a post crash variant of the
trusted computing TCB verification problem)
> uses the crash dump in AIX, the person who wants to send a compressed dump
> and money to IBM and get back a fix. Netdump assumes external resources
Lots of interesting legal issues but yes you can do it sometimes (DMCA,
privacy, financial duties sometimes make it horribly complex). Even in
the case where you only dump the oops its still valuable.
> and a functional secure network (is the dump encrypted and I missed it?)
> which home users surely don't have, and remote servers oftem lack as well.
Encrypting the dump with the new crypto lib in the kernel would be easy,
right now it doesnt.
My disk dump concerns are purely those of correctness. That means
1. After loading the module getting the block list for the dump target
2. Resetting and scratch initializing the dump device
3. Not relying on any code outside of the dump TCB that may have
been corrupted
4. At dump time turning off all bus masters, doing the dump TCB
verification and then dumping
Most of the pieces already exist.
On 3 Nov 2002, Alan Cox wrote:
|>Encrypting the dump with the new crypto lib in the kernel would be easy,
|>right now it doesnt.
Piece of cake. It's like adding a dump compression module. You
can load dump_gzip.o or dump_rle.o to specify the kind of compression
you want to use. dump_crypto.o would be the same kind of thing. Just
add another flag and away you go.
|>My disk dump concerns are purely those of correctness. That means
|>
|> [ ... ]
|>
|>Most of the pieces already exist.
It's just a matter of time, then.
--Matt
On Friday 01 November 2002 16:16, Patrick Finnegan wrote:
> > It's not a fscking public service. Linus has full control over his
> > tree. You have equally full control over your tree. Linus can't
> > tell you what patches to apply in your tree. You can't tell Linus
> > what patches he should apply to his.
>
> I'm sorry it _is_ a public service. Once tens of people started
> contributing to it, it became one. This is like saying that the
> Washington Monument belongs to the peole that maintain it, any building
> belongs to the repair crews and janitors.
You pay taxes to support the washington monument. When's the last time you
paid a tax to Linus?
> I'm not saying that Linus is
> necessarily a janitor, but when you consider how much of the Linux kernel
> that he didn't write, you may relize that it's not just his kernel.
He's the editor of a periodical publication. A cross between an academic
technical journal which people contribute to for professional reasons, and a
hobbyist fanzine that people contribute to 'cause it's cool. This is not a
new thing, there are real-world precedents for this sort of relationship
going back hundreds of years, to the invention of the printing press...
Linus's editorial decisions are as final and unappealable as any other
editorial decision at a magazine or newspaper. You can publish your article
elsewhere, and if it doesn't have the same prestige as the Harvard Law Review
or the New England Journal of Medicine, tough. They said no.
And like ALL editors, his job isn't to write a significant portion of the
articles in the publication, but to be a Sturgeon's Law filter throwing out
99% of the submissions in the slush pile, correcting the spelling and grammar
of the remaining few, and trying to stitch them together into a coherent
whole.
Go track down somebody with a Journalism degree if you want to understand
Linus's job.
> It
> also belongs to every single person that has written even a single
> line of code in it.
If you get an article published in Time magazine, and you say that this gives
you the right to print your own copies of Time and distribute them yourself,
Time's lawyers are going to come after you.
The GPL gives you the ability to do this, but it doesn't obligate the
publication's editor to listen to you. If next month's issue contains a huge
rebuttal to one of your articles, calling you a boogerhead, tough. The
editor doesn't owe you anything as a previous contributor, and certainly
doesn't owe you anything as someone from whom he did NOT take a submission.
What Linus basically said was that if a significant number of distributions
integrated it, he might take another look at the thing in the future. But
wasn't going into 2.5.
Now, thanks to people pestering him beyond the Annoyance Event Horizon, he's
got his fingers in his ears. Congratulations. Hopefully, he'll calm down a
bit in a few months, but there's no guarantee. In the mean time, the most
productive thing to do is drop the topic and work on the Red Hat, SuSE, and
Debian guys. (Mandrake feeds from Red Hat, and SuSE is now making kernels
for Connectiva and TurboLinux. Gentoo and Slackware might be good to bug as
well...)
See if you can convince Alan Cox to pick up your patch. That'll get you Red
Hat, and the single largest concentration of roll-your-own kernel guys
outside of Linus's own tree.
Rob
--
http://penguicon.sf.net - Terry Pratchett, Eric Raymond, Pete Abrams, Illiad,
CmdrTaco, liquid nitrogen ice cream, and caffienated jello. Well why not?
First I want to apologize to anyone I've pissed off too badly with this.
Another note - I have no relation to the LKCD developers, other than a
very satisfied, and sometimes excessivly vehement, user. I was about to
respond to this message in detail, but I dont need to put more Magnesium
on the flames.
Pat
On Mon, 4 Nov 2002, Rob Landley wrote:
> On Friday 01 November 2002 16:16, Patrick Finnegan wrote:
>
> > > It's not a fscking public service. Linus has full control over his
> > > tree. You have equally full control over your tree. Linus can't
> > > tell you what patches to apply in your tree. You can't tell Linus
> > > what patches he should apply to his.
> >
> > I'm sorry it _is_ a public service. Once tens of people started
> > contributing to it, it became one. This is like saying that the
> > Washington Monument belongs to the peole that maintain it, any building
> > belongs to the repair crews and janitors.
>
> You pay taxes to support the washington monument. When's the last time you
> paid a tax to Linus?
>
> > I'm not saying that Linus is
> > necessarily a janitor, but when you consider how much of the Linux kernel
> > that he didn't write, you may relize that it's not just his kernel.
>
> He's the editor of a periodical publication. A cross between an academic
> technical journal which people contribute to for professional reasons, and a
> hobbyist fanzine that people contribute to 'cause it's cool. This is not a
> new thing, there are real-world precedents for this sort of relationship
> going back hundreds of years, to the invention of the printing press...
>
> Linus's editorial decisions are as final and unappealable as any other
> editorial decision at a magazine or newspaper. You can publish your article
> elsewhere, and if it doesn't have the same prestige as the Harvard Law Review
> or the New England Journal of Medicine, tough. They said no.
>
> And like ALL editors, his job isn't to write a significant portion of the
> articles in the publication, but to be a Sturgeon's Law filter throwing out
> 99% of the submissions in the slush pile, correcting the spelling and grammar
> of the remaining few, and trying to stitch them together into a coherent
> whole.
>
> Go track down somebody with a Journalism degree if you want to understand
> Linus's job.
>
> > It
> > also belongs to every single person that has written even a single
> > line of code in it.
>
> If you get an article published in Time magazine, and you say that this gives
> you the right to print your own copies of Time and distribute them yourself,
> Time's lawyers are going to come after you.
>
> The GPL gives you the ability to do this, but it doesn't obligate the
> publication's editor to listen to you. If next month's issue contains a huge
> rebuttal to one of your articles, calling you a boogerhead, tough. The
> editor doesn't owe you anything as a previous contributor, and certainly
> doesn't owe you anything as someone from whom he did NOT take a submission.
>
> What Linus basically said was that if a significant number of distributions
> integrated it, he might take another look at the thing in the future. But
> wasn't going into 2.5.
>
> Now, thanks to people pestering him beyond the Annoyance Event Horizon, he's
> got his fingers in his ears. Congratulations. Hopefully, he'll calm down a
> bit in a few months, but there's no guarantee. In the mean time, the most
> productive thing to do is drop the topic and work on the Red Hat, SuSE, and
> Debian guys. (Mandrake feeds from Red Hat, and SuSE is now making kernels
> for Connectiva and TurboLinux. Gentoo and Slackware might be good to bug as
> well...)
>
> See if you can convince Alan Cox to pick up your patch. That'll get you Red
> Hat, and the single largest concentration of roll-your-own kernel guys
> outside of Linus's own tree.
>
> Rob
--
Purdue Universtiy ITAP/RCS
Information Technology at Purdue
Research Computing and Storage
http://www-rcd.cc.purdue.edu
http://dilbert.com/comics/dilbert/archive/images/dilbert2040637020924.gif
Hugh Dickins <[email protected]> said:
[...]
> I dealt with crash dumps quite a lot over 10 years with SCO UNIX,
> OpenServer and UnixWare: which were addressing the PC market, not
> own hardware.
What I remember about hardware compatibility for SCO Unix and Solaris on
ia32 is _not_ funny. Lightyears from what Linux handles today without
breaking a sweat.
> It's a real worry that writing a crash dump to disk might stomp in the
> wrong place, but I don't recall it ever happening in practice. But
> occasionally, yes, a dump was not generated at all, or not completed.
How do you test that? Not in some contrieved situation, under real crashes.
Don't just consider crashes in the official $DISTRIBUTION kernel, but in
Linus' BK tree, or some of the random, two-or-three-letter-trees of the day
(_that_ is where crashes happen, _that_ is where the info would be most
valuable). It gets _real_ hairy _real_ fast to make sure you don't scribble
over /home or /etc on the user's disk...
> Of course, you could argue that SCO's disk drivers were more stable :-)
If you only handle a few, thoroughly tested, high-end controllers and
disks, that is not too hard to do.
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria +56 32 654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513
On Sat, 2 Nov 2002, Horst von Brand wrote:
> Hugh Dickins <[email protected]> said:
>
> > It's a real worry that writing a crash dump to disk might stomp in the
> > wrong place, but I don't recall it ever happening in practice. But
> > occasionally, yes, a dump was not generated at all, or not completed.
>
> How do you test that? Not in some contrieved situation, under real crashes.
Sorry for being unclear: by "in practice" I meant "under real crashes" i.e.
I was referring more to what we heard back from users than my own testing.
Hugh
On Monday 04 November 2002 14:58, Patrick Finnegan wrote:
> First I want to apologize to anyone I've pissed off too badly with this.
Sorry, didn't mean to bring up an old issue. DSL was out over the weekend
here (thank you Southwestern Bell), so some stuff queued up in my laptop's
outbox...
Rob
--
http://penguicon.sf.net - Terry Pratchett, Eric Raymond, Pete Abrams, Illiad,
CmdrTaco, liquid nitrogen ice cream, and caffienated jello. Well why not?
On Thursday 31 October 2002 07:21, Chris Wedgwood wrote:
> Don't get me wrong, I'm not against sane ACLs (POSIX ACLs are not) or
> EAs [...]
POSIX ACLs are more complicated than what would be inherently necessary, if we
were in a situation where we could design from scratch. Unfortunately we are
not in that situation. I've heard dozens of people complain about POSIX ACLs
(and other kinds as well); nobody was able to come up with something truly
better so far.
--Andreas.
On Friday 01 November 2002 01:54, john stultz wrote:
> I probably should just go read the specs. Anyone have a pointer, or care
> to explain what the differences are between AFS's ACLs and POSIX ACLs?
POSIX 1003.1e draft 17 (withdrawn) is available at
<http://wt.xpilot.org/publications/posix.1e/>.
--Andreas.
On Thu, Oct 31, 2002 at 07:37:05PM -0300, Werner Almesberger wrote:
> Jeff Garzik wrote:
> > That said, I used to be an LKCD cheerleader until a couple people made
> > some good points to me: it is not nearly low-level enough to truly be
> > of use in crash situations.
>
> I'm not so convinced about this. I like the Mission Critical
> approach: save the dump to memory, then either boot through the
> firmware or through bootimg (nowadays, that would be kexec),
> then retrieve the dump from memory, and do whatever you like
> with it.
>
> The huge advantage here is that you don't need a ton of
> specialized dump drivers and/or have much of the original kernel
> infrastructure to be in a usable state. The rebooted system will
> typically be stable enough to offer the full range of utilities,
> including up to date drivers for all possible devices, so you
> can safely write to disk, scp all the mess to your support
> critter, or post an automatic flame to linux-kernel :-)
>
> The weak points of the Mission Critical design are that early
> memory allocation in the kernel needs to be tightly controlled,
> that architectures that wipe CPU caches on reboot need to
> commit them to memory before the firmware restart, and that
> drivers need to be able to recover from an "unclean" hardware
> state. (I think we'll see much of the latter happen as kexec
> advances. The other two issues aren't really special.)
>
> Actually, at the RAS BOF I thought that IBM were developing LKCD
> in this direction, and had also eliminated a few not so elegant
> choices of Mission Critical's original design. I haven't looked
Yes, we are putting that in as one of the alternative dump targets
available. I have done quite a bit of work on that implementing the
ideas we talked about at OLS, and that's what I've been referring
to as the memory dump target. Its not quite ready yet and we
need something like kexec to be available which we can use on Intel
systems to achieve the softboot (the acceptance status of that still
doesn't seem to be clear), so I was looking at this as a
follow-on thing once the core infrastructure is there. More so
because we probably need to give it some time to stabilize and try
it on different environments and look at the issues you mention.
Why do we even consider the other options when we are doing
this already ? Well, as we discussed earlier there's non-disruptive dumps
for one, where this wouldn't work. The other is that before overwriting
memory we need to be able to stop all activity in the system for certain
(system may appear hung/locked up) and I'm not fully certain about
how to do this for all environments. Maybe an answer lies in
rethinking some parts of the algorithm a bit.
Also having the interface allows people to develop more specific/
reliable solutions for their environment. So we do not necessiate
code duplication, but if something exists, then the infrastructure
can use it.
The general feeling here is that a one solution fits all thing
may not work best right now ... and hence the focus on an interface
based approach that gives us the needed flexibility.
> at the LKCD code, but the descriptions sound as if all the
> special-case cruft seems to be back again, which I would find a
> little disappointing.
Hope that helps a bit.
Regards
Suparna
>
> There might be a case for specialized low-overhead dump handlers
> for small embedded systems and such, but they're probably better
> maintained outside of the mainstream kernel. (They're more like
> firmware anyway.)
>
> - Werner
>
> --
> _________________________________________________________________________
> / Werner Almesberger, Buenos Aires, Argentina [email protected] /
> /_http://www.almesberger.net/____________________________________________/
>
>
> -------------------------------------------------------
> This sf.net email is sponsored by: Influence the future
> of Java(TM) technology. Join the Java Community
> Process(SM) (JCP(SM)) program now.
> http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en
> _______________________________________________
> lkcd-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/lkcd-devel
--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Labs, India
By the way, let's not forget Eric Biederman's kexec. While not
perfect, it's definitely usable, and looks good enough for
inclusion as an experimental feature.
As to why we need it, I've explained this in my OLS 2000 paper,
sections 2.6 and 5:
http://www.almesberger.net/cv/papers/ols2k-9.ps
My approach was called "bootimg". kexec is similar, but does a few
things related to page sorting/moving better, and it's much smarter
about quiescencing the system before trying to reboot.
I view kexec as an "enabler", much like initrd, which had to be
part of the kernel for a while before people started to figure out
how to use it. (At this year's OLS, somebody told me they just
"discovered" initrd and are now using it. Oh well, it's only been
around for six years ;-)
It should be "experimental", because some compatibility issues
still have to be addressed, but most of this can be done in user
space, and shouldn't require significant changes in the kernel
part of kexec, or in its interface to user space.
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/
On Sun, 3 Nov 2002 [email protected] wrote:
> On Sun, Nov 03, 2002 at 08:48:30AM -0500, Bill Davidsen wrote:
> > Quite clearly SCO, Sun, and IBM have been doing this for years without
> > offering dozens of options. I don't need it to sing and dance, I just need
> > a way to put the dump where I can find it. I'm not going to put another
> > box in at the end of a serial or parallel port, I don't have NVram, I do
> > have lopts of disk, and so does almost everyone else. I have remote
> > systems in wiring closets all over the country (all four time zones). They
> > are at the end of open net connections, unreliable and untrusted. I don't
> > want to bet that I have a working VPN, or that I can safely send all that
> > data without it being read by someone other than me.
> >
> > The AIX support has a group just to beat on dumps customers send. What
> > more evidence is needed that people can and do use the capability.
> You paid someone for this for AIX. So the solution is obvious for Linux.
No, it's included in AIX, SCO and Solaris. And analysis is included in
support contracts. With all the stuff added to Linux to keep up with both
M$ and commercial UNIX, I can't imagine why anyone would be against this.
At least anyone who wanted Linux to compete in the commercial server
market.
--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.
On Tue, Nov 05, 2002 at 02:29:43PM -0300, Werner Almesberger wrote:
> I view kexec as an "enabler", much like initrd, which had to be
> part of the kernel for a while before people started to figure out
> how to use it. (At this year's OLS, somebody told me they just
> "discovered" initrd and are now using it. Oh well, it's only been
> around for six years ;-)
kexec is also a great enabled for a non-intrusive kernel dump
facility done correctly by booting into a new kernel image (which
avoids the whole difficulty on x86 with BIOSes wiping out RAM at
reboot).
-ben
On Tue, Nov 05, 2002 at 12:09:17PM -0500, Bill Davidsen wrote:
> On Sun, 3 Nov 2002 [email protected] wrote:
>
> > On Sun, Nov 03, 2002 at 08:48:30AM -0500, Bill Davidsen wrote:
> > > Quite clearly SCO, Sun, and IBM have been doing this for years without
> > > offering dozens of options. I don't need it to sing and dance, I just need
> > > a way to put the dump where I can find it. I'm not going to put another
> > > box in at the end of a serial or parallel port, I don't have NVram, I do
> > > have lopts of disk, and so does almost everyone else. I have remote
> > > systems in wiring closets all over the country (all four time zones). They
> > > are at the end of open net connections, unreliable and untrusted. I don't
> > > want to bet that I have a working VPN, or that I can safely send all that
> > > data without it being read by someone other than me.
> > >
> > > The AIX support has a group just to beat on dumps customers send. What
> > > more evidence is needed that people can and do use the capability.
>
> > You paid someone for this for AIX. So the solution is obvious for Linux.
>
> No, it's included in AIX, SCO and Solaris. And analysis is included in
None of those are free.
> support contracts. With all the stuff added to Linux to keep up with both
> M$ and commercial UNIX, I can't imagine why anyone would be against this.
> At least anyone who wanted Linux to compete in the commercial server
> market.
So buy your Linux from a vendor who supports it.
Suparna Bhattacharya wrote:
> Yes, we are putting [MCORE] in as one of the alternative dump targets
> available.
Great !
> Its not quite ready yet and we need something like kexec to be
> available which we can use on Intel systems to achieve the softboot
> (the acceptance status of that still doesn't seem to be clear),
Yes, I've just checked with Eric, and he hasn't received any
indication from Linus so far. I posted a reminder to linux-kernel.
I'd really hate to see kexec miss 2.6.
> Why do we even consider the other options when we are doing
> this already ? Well, as we discussed earlier there's non-disruptive
> dumps for one, where this wouldn't work.
But they're very different anyway, aren't they ? I mean, you could
even implement them (well, almost) from user space, with today's
kernels.
> The other is that before overwriting
> memory we need to be able to stop all activity in the system for certain
> (system may appear hung/locked up) and I'm not fully certain about
> how to do this for all environments. Maybe an answer lies in
> rethinking some parts of the algorithm a bit.
This is certainly the hairiest part, yes. I think we have about
four types of devices/elements to worry about:
- those that just sit there, and never talk unless spoken to
- those that may generate interrupts
- those that DMA if you ask them nicely
- those that DMA when they feel like it (e.g. copy an incoming
network packet to the next buffer in the free list)
The latter are the real problem. I see the following possibilities
for dealing with them:
- faith-based computing: pray that nothing bad will befall your
system :-)
- de-activate them individually. There should be a lot of work
that can be shared with power management. And that's one of
the reasons why I think the memory target should be available
early, or convergence will take forever.
- try to reset them, without necessarily knowing what they are
or what they do. I don't know is there is a useful way for
resetting the PCI bus by software, but if there is one, this
looks like the most promising strategy to me, even if it may
be somethat lacking in elegance.
- if all else fails, maybe introduce an "unsafe" memory type
for potential DMA targets of unpredictable devices, that will
not be re-used. I hope we won't need this, though. (But in case
such a memory type gets introduced by the memory-scrubbers, at
least you could blame _them_ :-)
> The general feeling here is that a one solution fits all thing
> may not work best right now ... and hence the focus on an interface
> based approach that gives us the needed flexibility.
Yes, this is certainly important. I just think that the "memory
target" concept is closer to a general solution than disk dumps.
But there are always those 5% with special needs, and it's good
if they can use the same framework.
Thanks,
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/
> By the way, let's not forget Eric Biederman's kexec. While not
> perfect, it's definitely usable, and looks good enough for
> inclusion as an experimental feature.
Another me too for this feature. I really want to be able to use this
on the large NUMA boxes - it takes me 5 minutes to do a full reboot
cycle, and I can't even do an init 6 due to some firmware complications,
I have to do init 0, power off, power on, boot, etc. Whilst I have
a remote power interface, it's still a pain in the butt.
kexec would be Nirvana ;-)
M.
On 3 Nov 2002, Alan Cox wrote:
> On Sun, 2002-11-03 at 14:33, Bill Davidsen wrote:
> > If you define "unmaintainably bad" as "having features you don't need"
> > then I agree. But since dump to disk is in almost every other commercial
> > UNIX, maybe someone would question why it's good for others but not for
> > Linux.
>
> It isnt about features, its about clean maintainable code. netdump to me
> doesnt mean no dump to disk option. In fact I'd rather like to be able
> to insmod dump-foo.o. The correctness issues are hard but if the
> dump-foo is standalone, resets the hardware and has an SHA integrity
> check then it can be done (think of it as a post crash variant of the
> trusted computing TCB verification problem)
I certainly don't disagree, but the one critical problem is writing the
dump to the right place, or at least not writing to the wrong place. I'd
love to have disk, net, NVram, whatever choices, but disk is the one which
would help the most. AIX and ISC have dump to swap, and the swapon copies
the data back or clears it, with a fresh O/S load to ensure writing the
right place.
> > uses the crash dump in AIX, the person who wants to send a compressed dump
> > and money to IBM and get back a fix. Netdump assumes external resources
>
> Lots of interesting legal issues but yes you can do it sometimes (DMCA,
> privacy, financial duties sometimes make it horribly complex). Even in
> the case where you only dump the oops its still valuable.
Agreed, I would think about doing that with a mail server. But even an
oops like ksymoops would be helpful. I started on systems with dumps,
ksymoops is wonderful by comparison.
> > and a functional secure network (is the dump encrypted and I missed it?)
> > which home users surely don't have, and remote servers oftem lack as well.
>
> Encrypting the dump with the new crypto lib in the kernel would be easy,
> right now it doesnt.
>
> My disk dump concerns are purely those of correctness. That means
>
> 1. After loading the module getting the block list for the dump target
That could all be built as part of init, clearly you can't depend on
demand loading the module.
> 2. Resetting and scratch initializing the dump device
If the modules are to be really self-sufficient it would have to include
the driver. I'll let someone tell me that's not always the case if the
driver can have its own data area.
> 3. Not relying on any code outside of the dump TCB that may have
> been corrupted
Yes, although with separate code, stack and data that's less likely. In
the bad old days self-modifying code was common.
> 4. At dump time turning off all bus masters, doing the dump TCB
> verification and then dumping
The first part of that looks medium hard, particularly if the code has to
be part of the dump module.
> Most of the pieces already exist.
Clearly it can be done even better than the current implementation, and
given an interface standard a replacement in the whole could be done.
--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.
On Tue, 2002-11-05 at 18:00, Werner Almesberger wrote:
> Yes, I've just checked with Eric, and he hasn't received any
> indication from Linus so far. I posted a reminder to linux-kernel.
> I'd really hate to see kexec miss 2.6.
Let me ask the same dumb question - what does kexec need that a dumper
doesn't. In other words given reboot/trap hooks can kexec happily live
as a standalone module ?
Alan Cox wrote:
> Let me ask the same dumb question - what does kexec need that a dumper
> doesn't.
kexec needs:
- a system call to set it up
- a way to silence devices (difference to dumper: kexec normally
operates under an intact system, so it's more similar to, say,
swsusp. But I expect that cleaning up device power management
would also clear the path for more reliable dumpers.)
- a bit of glue, e.g. to switch to "real mode", etc. AFAIK, none
of this touches other code, but there are of course some
assumptions here on what other codes provides or does.
- device drivers that can bring silent devices back to life
(normally, device drivers do this already, but kexec may
uncover dormant bugs in this area)
Since recent kernels already preserve memory areas with BIOS data,
kexec is actually quite a bit less intrusive than bootimg was.
> In other words given reboot/trap hooks can kexec happily live
> as a standalone module ?
"Module", as in "software package": yes, the main problem spot
would be the system call allocation, which is also inconvenient
for other developers. By the way, kexec does not tap into the
kernel's reboot process, so no such hooks are needed. If LKCD
wants to use kexec, it can do whatever it does to be invoked at
the time of a crash, and then call kexec directly.
"Module", as in "loadable kernel module": I think so, although
it's currently "bool", not "tristate". Also, you'd have the
system call issue again.
So not merging it is mainly inconvenient to use, adds the system
call allocation as a continuous annoyance, and makes it a little
harder to work on the related infrastructure. But then, despite
being somewhat obscure, bootimg and kexec have been in use for
years, the design is about as lean as it can get, and it's cool.
What more could you ask for ? :-)
What kexec needs now is more exposure, so that the BIOS
compatibility issues get noticed and fixed, it is ported to other
architectures, and that more people can start figuring out how to
use it, and how to build a boot environment. Then, maybe in a
year or two, we can send "Methuselah" LILO and "nice little OS"
GRUB off to their well-deserved retirement.
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/
On Tue, 2002-11-05 at 19:19, Werner Almesberger wrote:
> kexec needs:
> - a system call to set it up
Device, file, insmod...
> - a way to silence devices (difference to dumper: kexec normally
> operates under an intact system, so it's more similar to, say,
> swsusp. But I expect that cleaning up device power management
> would also clear the path for more reliable dumpers.)
So you need to register with the power management as the last thing to
be suspended and do a suspend before kexec.
> So not merging it is mainly inconvenient to use, adds the system
> call allocation as a continuous annoyance, and makes it a little
> harder to work on the related infrastructure. But then, despite
> being somewhat obscure, bootimg and kexec have been in use for
> years, the design is about as lean as it can get, and it's cool.
> What more could you ask for ? :-)
I'm mostly worried about how to make these things fit the least
intrusively into the kernel.
Alan Cox wrote:
>> - a system call to set it up
>
> Device, file, insmod...
I don't know what Eric thinks about using something else than a
system call, but I think he made a quite reasonable choice.
The data structure isn't entirely trivial, so a misc device plus
ioctl would be a bit on the ugly side. I vaguely remember having
proposed something like this a while ago (may have been for
pivot_root), and everybody went "noooo!!" ;-)
insmod would be possible, although with a rather unusual parameter
passing scheme. Also, when using kexec from inside the kernel (e.g.
MCORE), the insmod solution would have to split kexec into the
interface and the kexec core.
But yes, there's always a means to avoid adding a new system
call. /dev/syscall with an ioctl
struct syscall_ioctl {
const char *symbol_name;
va_list ap;
};
anyone ? :-) (Implementing it might be a bit of a challenge :)
> So you need to register with the power management as the last thing to
> be suspended and do a suspend before kexec.
Well, kexec just calls device_shutdown. The problem isn't the
interface, it's that device_shutdown apparently doesn't work too
well (devices not supporting it, some semantics mixup, etc.).
But this is general infrastructure work, that should be done
with or without kexec.
> I'm mostly worried about how to make these things fit the least
> intrusively into the kernel.
Just look at Eric's kexec patch. It isn't particularly intrusive:
http://marc.theaimsgroup.com/?l=linux-kernel&m=103604471723358&w=2
(For 2.5.45. The patch fails for 2.5.46, because new system calls
were added ...)
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/
On Tue, 2002-11-05 at 11:19, Werner Almesberger wrote:
> Alan Cox wrote:
> > Let me ask the same dumb question - what does kexec need that a dumper
> > doesn't.
>
> kexec needs:
> - a system call to set it up
> - a way to silence devices <snip>
<snip>
> - a bit of glue <snip>
> - device drivers that can bring silent devices back to life
<snip>
> > In other words given reboot/trap hooks can kexec happily live
> > as a standalone module ?
You could probably skip the system call to set it up. Example: I could
imagine a bizarre set of pseudo-devices:
# insmod kexec
# cat bzImage > /proc/kexec/next-image
# echo "root=805" > /proc/kexec/next-cmndline
# echo 1 > /proc/kexec/reboot
and hide away that dirty little sequence with a nice kexec(3) library
routine.
The Two Kernel Monte trick (that rewrote when insmod'ed the kernel's
function pointers for sys_reboot) was also effective, but that
apparently isn't an option any longer.
> What kexec needs now is more exposure, so that the BIOS
> compatibility issues get noticed and fixed, it is ported to other
> architectures, and that more people can start figuring out how to
> use it, and how to build a boot environment.
I'll 2nd that sentiment, and add another big one: fixing (apparent)
problems with drivers and chipset-munging code, so that devices can be
reliably re-probed/re-inited/etc. after the reboot.
Long term, I think it would be advantageous to be able to avoid SCSI and
other time consuming device probes for the common and simple reboot case
of 1) the currently running kernel is being rebooted, and 2) no changes
to the device configuration have occured. Shouldn't we be able to "save
away" what is in sysfs, and then re-inject that state after a fast
reboot?
Andy
Andy Pfiffer wrote:
> You could probably skip the system call to set it up.
Yes, yes, there are many ways to do this. This isn't the issue. The
questions regarding this are:
- it kexec allowed to use a system call ?
- if yes, is a system call the technically right solution ?
- if yes, is it a practical solution ?
So far, it hasn't been considered inherently wrong to use system
calls, even for highly Linux-specific functions, and even if they
aren't performance-critical (just think of pivot_root). (*)
If this perception has changed, such a change of policy would also
affect kexec, but then we don't need to discuss kexec but the
policy change. (I don't know - is such a change in the air ?)
(*) By the way, I remember now where I brought up some hack for
avoiding to use a system call - it was for bootimg :-)
Now, if we assume that it's okay for kexec to use a system call,
the next question is whether kexec should indeed use it, i.e.
whether a system call makes sense for what it is trying to do.
Since there are no device files or network elements naturally
involved here (i.e. other major kernel function interfaces),
the answer seems to be "yes".
Last but not least, we need to decide whether using a system
call would be painful for Eric or for kexec users. This would be
the case if kexec isn't merged, and the kexec patch would need
frequent updates because system calls have changed.
I understand Alan's question as the "what if ... ?" type. If
kexec is indeed rejected for merging, it may make sense to change
the interface to something which may be technically less elegant,
but which makes patch maintenance easier to handle.
> I'll 2nd that sentiment, and add another big one: fixing (apparent)
> problems with drivers and chipset-munging code, so that devices can be
> reliably re-probed/re-inited/etc. after the reboot.
Yes, kexec is likely to turn up a few problems in this area, too.
Right now, we only hear about such issues if some BIOS lets
something slip through. With kexec, such problems should show up
sooner.
> Long term, I think it would be advantageous to be able to avoid SCSI and
> other time consuming device probes
Definitely. May I refer you to my booting paper, which discusses
all this in section 5 ? :-)
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/
On Tue, 5 Nov 2002, Werner Almesberger wrote:
> Now, if we assume that it's okay for kexec to use a system call,
> the next question is whether kexec should indeed use it, i.e.
> whether a system call makes sense for what it is trying to do.
> Since there are no device files or network elements naturally
> involved here (i.e. other major kernel function interfaces),
> the answer seems to be "yes".
That's not obvious. By the same logics, we would need syscalls for
turning off overcommit, etc., etc.
FWIW, I suspect that
open("/proc/image", O_EXCL|O_WRONLY);
bunch of lseek()/write()
close()
would be more natural - definitely easier to understand than arguments of
your sys_kexec(). It's easy to switch from your code to that - you
put initialization into ->open(), pulling segments from userland into
->write(), use default ->lseek() and do actual work on ->close() if
no errors had happened. file->private_data will point to intermediate
state you need.
After all, that's what happens - you form an image, writing to it user-supplied
data from given buffers at given offsets and when you are done with that you
commit the changes. IMO special syscall is less natural match for that
than sequence above - commit-on-close is not something unusual, so it matches
the semantics of all syscalls involved...
Alexander Viro wrote:
> That's not obvious. By the same logics, we would need syscalls for
> turning off overcommit, etc., etc.
Okay okay, add file system specific ioctls and sysctl to my list
of alternative mechanisms :-)
> FWIW, I suspect that
> open("/proc/image", O_EXCL|O_WRONLY);
> bunch of lseek()/write()
> close()
Hmm, interesting. Yes, that should work. One would of course have
to retain the current interface for in-kernel use (e.g. MCORE), but
that's probably okay. Let's see what Eric thinks about it - it's
his code after all.
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/
Alan Cox <[email protected]> writes:
> On Tue, 2002-11-05 at 18:00, Werner Almesberger wrote:
> > Yes, I've just checked with Eric, and he hasn't received any
> > indication from Linus so far. I posted a reminder to linux-kernel.
> > I'd really hate to see kexec miss 2.6.
>
> Let me ask the same dumb question - what does kexec need that a dumper
> doesn't. In other words given reboot/trap hooks can kexec happily live
> as a standalone module ?
Kexec primarily needs the reboot/trap hooks in working order, and exported,
for it to live externally to the kernel.
Currently the reboot_notifier call chain is private to sys.c, and is not
exported even to other parts of the kernel.
Even together device_shutdown, and the reboot_notifier do not properly shutdown
the cpus on an SMP system.
Plus we are missing quite a ->shutdown methods at random in the kernel, and if
kexec is not easily available someone might not get around to writing
and debugging them.
Plus a system call seems the natural interface for something that
appears to be a reboot.
Eric
Alexander Viro <[email protected]> writes:
> On Tue, 5 Nov 2002, Werner Almesberger wrote:
>
> > Now, if we assume that it's okay for kexec to use a system call,
> > the next question is whether kexec should indeed use it, i.e.
> > whether a system call makes sense for what it is trying to do.
> > Since there are no device files or network elements naturally
> > involved here (i.e. other major kernel function interfaces),
> > the answer seems to be "yes".
>
> That's not obvious. By the same logics, we would need syscalls for
> turning off overcommit, etc., etc.
>
> FWIW, I suspect that
> open("/proc/image", O_EXCL|O_WRONLY);
> bunch of lseek()/write()
> close()
> would be more natural - definitely easier to understand than arguments of
> your sys_kexec(). It's easy to switch from your code to that - you
> put initialization into ->open(), pulling segments from userland into
> ->write(), use default ->lseek() and do actual work on ->close() if
> no errors had happened. file->private_data will point to intermediate
> state you need.
>
> After all, that's what happens - you form an image, writing to it user-supplied
> data from given buffers at given offsets and when you are done with that you
> commit the changes. IMO special syscall is less natural match for that
> than sequence above - commit-on-close is not something unusual, so it matches
> the semantics of all syscalls involved...
First take a look at a ELF header. There is a one to one mapping between
the arguments to kexec and the segments found there.
Second lseek()/write() pairs do not have the capacity to specify holes/bss
segments kexec does, so it would not be a 1 to 1 transform. But I can
live without holes.
Third I am not fully certain it makes sense to implement a function that will
boot into a user specified image remotely. If the export process has
too many capabilities we could be in trouble.
Are you arguing for more /proc files? Where does the magic file come
from? I cannot request the allocation of a device number because the
allocation was frozen before 2.4 started. Though char 1 minor 11
seems the obvious choice. Or should it be a magic file in sysfs
instead of procfs? All of the require the code to live someplace
where I need to allocate a place in the namespace. So there is no
inherent advantage over a system call. And unless someone exports the
hooks to properly shutdown the system to modules it is useless.
Given that this is a seldom used system function I agree that it does not
need to be optimized.
I do not have any problem with changing the interface to something
more palatable to other kernel developers. But I will only do it for
one of two reasons. My patch will never get accepted in any
reasonable time frame and it makes maintenance easier for me. Or
makes the interface palatable for acceptance, into the kernel.
Neither position currently appears apparent.
----------
Now to dig into the heart of the issue.
I could write the new kernel image into /dev/mem and just jump to
it. Because that is really all I want an interface to do. There
are several practical problems, with something quite that simple.
No kernel shutdown code is run, so I am left with processors flying
all over the place, devices doing all manner of things, after their
device drivers have walked away. Something needs to put the system in
a quiescent state. The fix I call the reboot notifiers, and
device_shutdown. (And then implement a bunch of ->shutdown() methods)
As we all know writing to /dev/mem is not safe because the memory is
being use for other things. So I need a way to safely use memory
during the transition, from one kernel to another.
Personally I would love to be able to allocate one big contiguous
buffer that the kernel is not using and neither is the image I will
eventually load. Then I could just memcpy from that buffer and I
would be done.
Alas memory management in the kernel is done in pages, and can be
fragmented after running for many moons. So I need to allocate all of
my memory in pages, and I need to let the kernel know where it will
all eventually live so I can correctly order the memcpy operations.
Once all the memory copying is sorted out I need to jump to the new
kernel (a kernel being anything that runs without an OS). Logically
all you should have to do is do a single jump instruction but in
practice there is much more that has to be done. The kernel when it
loads up looks around and enables all sorts of cpu optimizations so
the kernel runs as well as possible on the processor. The new kernel
image needs to be given a least common denominator interface so it can
enable what it is prepared to take advantage of. In addition to what
the normal shutdown path can accomplish on x86 this involves disabling
page, changing the gdt, and changing the idt, and possibly disabling
SMP. It should be possible to enhance device_shutdown so it can
properly disable SMP though if that will happen still remains in the
air.
-----------------------------------------
So kexec needs:
- An allocated slot in some namespace.
- The ability to request the kernel devices shut themselves down.
- Buffers that are safe to use.
- The ability to transition the cpu into a state that is suitable
for jumping to another kernel.
- Awareness of it's existence.
To some extent every piece of this is intimately tied to the kernel
implementation, from the ability to modify page tables, when jumping
to a new kernel, to the best algorithm for finding a safe memory
buffer, to the proper way to shutdown devices this week, and being
intimately tied to the kernel the code needs to find a home in the
kernel.
Eric
Alan Cox <[email protected]> writes:
> On Tue, 2002-11-05 at 18:00, Werner Almesberger wrote:
> > Yes, I've just checked with Eric, and he hasn't received any
> > indication from Linus so far. I posted a reminder to linux-kernel.
> > I'd really hate to see kexec miss 2.6.
>
> Let me ask the same dumb question - what does kexec need that a dumper
> doesn't. In other words given reboot/trap hooks can kexec happily live
> as a standalone module ?
In replying to another post by Al Viro I managed to think this through.
kexec needs:
- An allocated slot in some namespace.
- The ability to request the kernel devices shut themselves down.
- Buffers that are safe to use.
- The ability to transition the cpu into a state that is suitable
for jumping to another kernel.
- Awareness of it's existence.
Most of this code is intimate with how the kernel currently behaves
and needs at least minor adjustments for things like living in PAE
mode.
The safe buffers a kernel can probably avoid.
I cannot see the core functionality of kexec every living happily as a
standalone module. The kexec code accomplishes nothing. If there is
something useful it does it can probably be moved elsewhere and the
line count reduced. The entire code base is basically obsessed with
getting safe temporary buffers for the new kernel to live in, and
given improvements to how the kernel allocates memory that are
theoretically possible with rmap I could remove that code as well.
The only thing that keeps kexec at all maintainable outside the kernel
is that big fundamental changes do not happen often. But the kernel
must be tracked, closely. I don't see that as a recipe for a
standalone module. I can barely even see making the code a module, or
what the point would be.
The reason kmonte fails in so many cases where kexec succeeds is
precisely because kmonte is a module.
If we include machine_kexec or something very similar to but more
generalized to the list of exported functions, perhaps kexec could
just have the buffer allocation code and live happily outside of the
kernel. But as it is, if we want to factor kexec into pieces so one
piece can live happily as a standalone module it will take some
serious design work, and a total rethink of the implementation. And
we will still have to add code to the kernel.
Eric
And the question I was building up to, but forgot to ask.
Given that the kexec code is tied intimately to the kernel
implementation.
Given that there is no real advantage in an incremental write
model for kexec users (except not needing to allocate a syscall
number).
Do you see a better way to structure the kexec interface?
Another file in proc, not carefully placed is just a hair better than
an ioctl. Using /proc is not desirable because there are uses of
kexec that need a very small kernel, and /proc is a pig, is otherwise
useless size bloat.
For some uses including the one that drove me to write it CONFIG_KEXEC
and CONFIG_TINY will both be defined.
Eric
On 5 Nov 2002, Eric W. Biederman wrote:
>
> In replying to another post by Al Viro I managed to think this through.
> kexec needs:
Note that kexec doesn't bother me at all, and I might find myself using it
myself.
>From a sanity standpoint, I think the thing already _has_ a system call,
though: clearly "sys_reboot()" is the place to add a case for "reboot into
this image". No? That's where we shut down devices anyway, and it's the
sane place to say "reboot into the kexec image"
Which still leaves you with a real sys_kexec() to actually _load_ the
image, or course. I think loading of the image should be a totally
separate event from the actual booting of the image, since we may want to
load the image early, then do various user-level shutdown (unmounting
etc), and then reboot.
Right now the kexec() stuff seems to mix up the loading and rebooting, but
I didn't take a very deep look, maybe I'm wrong.
Anyway, I don't really get why the kexec() system call would not just be
void *kexec_image = NULL;
unsigned long kexec_size;
int sys_kexec(void *uaddr, size_t len)
{
void *new;
if (!capable(CAP_ADMIN))
return -EPERM;
/* Get rid of old image if any.. */
if (kexec_image) {
vfree(kexec_image);
kexec_image = NULL;
}
/* Zero length just meant "get rid of it" */
if (!len)
return 0;
if (!access_ok(VERIFY_READ, uaddr, len))
return -EFAULT;
new = vmalloc(len);
if (!new)
return -ENOMEM;
if (memcpy_from_user(new, uaddr, len)) {
vfree(new);
return -EFAULT;
}
kexec_image = new;
kexec_size = len;
return 0;
}
and be done with it that way? Then the actual "reboot" (and that would be
in the existing "sys_reboot()") basically just does something like
memcpy(kernelbase, kexec_image, kexec_size);
at the very end (while obviously having to be careful about itself being
out of the way. It can avoid the page table issue by using the "page *"
array that vmalloc uses internally anyway: see "area->pages[]" in
vmalloc).
Note that the two-phase boot means that you can load the new kernel early,
which allows you to later on use it for oops handling (it's a bit late to
try to set up the kernel to be loaded at that time ;)
Linus
On Tue, Nov 05, 2002 at 10:25:35PM -0800, Linus Torvalds wrote:
>
> On 5 Nov 2002, Eric W. Biederman wrote:
> >
> > In replying to another post by Al Viro I managed to think this through.
> > kexec needs:
>
> Note that kexec doesn't bother me at all, and I might find myself using it
> myself.
>
> >From a sanity standpoint, I think the thing already _has_ a system call,
> though: clearly "sys_reboot()" is the place to add a case for "reboot into
> this image". No? That's where we shut down devices anyway, and it's the
> sane place to say "reboot into the kexec image"
>
> Which still leaves you with a real sys_kexec() to actually _load_ the
> image, or course. I think loading of the image should be a totally
> separate event from the actual booting of the image, since we may want to
> load the image early, then do various user-level shutdown (unmounting
> etc), and then reboot.
>
> Right now the kexec() stuff seems to mix up the loading and rebooting, but
> I didn't take a very deep look, maybe I'm wrong.
>
> Anyway, I don't really get why the kexec() system call would not just be
>
> void *kexec_image = NULL;
> unsigned long kexec_size;
>
> int sys_kexec(void *uaddr, size_t len)
> {
> void *new;
>
> if (!capable(CAP_ADMIN))
> return -EPERM;
>
> /* Get rid of old image if any.. */
> if (kexec_image) {
> vfree(kexec_image);
> kexec_image = NULL;
> }
>
> /* Zero length just meant "get rid of it" */
> if (!len)
> return 0;
>
> if (!access_ok(VERIFY_READ, uaddr, len))
> return -EFAULT;
>
> new = vmalloc(len);
> if (!new)
> return -ENOMEM;
>
> if (memcpy_from_user(new, uaddr, len)) {
> vfree(new);
> return -EFAULT;
> }
>
> kexec_image = new;
> kexec_size = len;
> return 0;
> }
>
> and be done with it that way? Then the actual "reboot" (and that would be
> in the existing "sys_reboot()") basically just does something like
>
> memcpy(kernelbase, kexec_image, kexec_size);
>
> at the very end (while obviously having to be careful about itself being
> out of the way. It can avoid the page table issue by using the "page *"
> array that vmalloc uses internally anyway: see "area->pages[]" in
> vmalloc).
>
> Note that the two-phase boot means that you can load the new kernel early,
> which allows you to later on use it for oops handling (it's a bit late to
> try to set up the kernel to be loaded at that time ;)
Yes, that's exactly what we need to support a soft-boot based dump
mechanism, much like the Mission Critical folks split up the bootimg
syscall to do the early load on a sane system, and the actual soft-boot
at crash time. And it fits in naturally as you point out ..
Regards
Suparna
>
> Linus
>
--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Labs, India
Linus Torvalds <[email protected]> writes:
> On 5 Nov 2002, Eric W. Biederman wrote:
> >
> > In replying to another post by Al Viro I managed to think this through.
> > kexec needs:
>
> Note that kexec doesn't bother me at all, and I might find myself using it
> myself.
Good. Just before I saw this message I sent you my patch ported to 2.5.46,
and from the feed back on this one it looks like people would
appreciate a tweak or two.
> >From a sanity standpoint, I think the thing already _has_ a system call,
> though: clearly "sys_reboot()" is the place to add a case for "reboot into
> this image". No? That's where we shut down devices anyway, and it's the
> sane place to say "reboot into the kexec image"
>
> Which still leaves you with a real sys_kexec() to actually _load_ the
> image, or course. I think loading of the image should be a totally
> separate event from the actual booting of the image, since we may want to
> load the image early, then do various user-level shutdown (unmounting
> etc), and then reboot.
That sounds reasonable to me. Especially as that lines up a little more
with what the mcore people want as well. Until today I hadn't realized
they were using a spare current to process oopses. For just booting
another kernel all of the staging can currently be done by reading the
new kernel into your process before calling the user-level shutdown code.
> Right now the kexec() stuff seems to mix up the loading and rebooting, but
> I didn't take a very deep look, maybe I'm wrong.
It currently happens all in one step because I had never gotten
feedback that people wanted it in two steps.
> Note that the two-phase boot means that you can load the new kernel early,
> which allows you to later on use it for oops handling (it's a bit late to
> try to set up the kernel to be loaded at that time ;)
Given that it is definitely a good idea to split the patch up into two
pieces. And a kernel for oops handling should work once all of other
problems are resolved.
> Anyway, I don't really get why the kexec() system call would not just be
>
> void *kexec_image = NULL;
> unsigned long kexec_size;
>
> int sys_kexec(void *uaddr, size_t len)
> {
> void *new;
>
> if (!capable(CAP_ADMIN))
> return -EPERM;
>
> /* Get rid of old image if any.. */
> if (kexec_image) {
> vfree(kexec_image);
> kexec_image = NULL;
> }
>
> /* Zero length just meant "get rid of it" */
> if (!len)
> return 0;
>
> if (!access_ok(VERIFY_READ, uaddr, len))
> return -EFAULT;
>
> new = vmalloc(len);
> if (!new)
> return -ENOMEM;
>
> if (memcpy_from_user(new, uaddr, len)) {
> vfree(new);
> return -EFAULT;
> }
>
> kexec_image = new;
> kexec_size = len;
> return 0;
> }
>
> and be done with it that way? Then the actual "reboot" (and that would be
> in the existing "sys_reboot()") basically just does something like
>
> memcpy(kernelbase, kexec_image, kexec_size);
>
> at the very end (while obviously having to be careful about itself being
> out of the way. It can avoid the page table issue by using the "page *"
> array that vmalloc uses internally anyway: see "area->pages[]" in
> vmalloc).
Using area->pages[] is an interesting idea.
>From my current interface this is missing the following pieces.
1) The address or addresses to load the new kernel at. (Think kernel + ramdisk)
2) The address to jump to start the new kernel.
3) My interesting buffer handling.
The question is how much of that do we need.
Thinking out loud, and hopefully answering your question.
- We need a working stack when the new kernel is jumped to so PIC code
can exist at the entry point.
- An oops processing kernel needs to load at an address other than 1MB,
or at the very least it's boot sequence needs to squirrel away the
old contents of the kernel text and data segments, which reside at
1MB, before it moves down to 1MB.
- When we transfer control to the trampoline in machine_kexec we need
to be able to refer to everything with physical addresses.
- I do not see a way out of running my buffer verifier algorithm.
The problem is that I do not want to put complex logic in the
assembly machine_kexec trampoline. So I want to be able to pass
it something it can just memcpy to it's final resting place. Which
means the buffer pages either need to be the final resting place of
the new kernel (ideal) or are not a page that of the final resting
place.
- I can dig up area->pages[] but I don't see vmalloc buying me
anything. Doing the copies and allocations a page at a time is not
hard. I have to sort the contents of the pages, and where they
are located so I need to undo the virtual mapping.
area ->pages is all by struct pages *, which is most inconvenient
when you are tearing down page tables, I would need to put the pages
into another data structure that just had the page frame number or
physical page address anyway.
- Once I am using my own data structure to track the pages, and I am
already vetting the pages for safe locations. Going the rest of the
way to my current interface is not a big step, and I have already
tested that code.
So either I have blinders on, or there is very little percentage in
changing how I load an image. But to make the oops processing easier
I will split it up into two parts.
Then I guess the reasonable thing to do is to modify sys_reboot to
call machine_kexec instead of machine_restart when a kexec_image is
present. Or should I add another magic number, and another case to
sys_reboot?
case LINUX_REBOOT_CMD_RESTART:
notifier_call_chain(&reboot_notifier_list, SYS_RESTART, NULL);
system_running = 0;
device_shutdown();
printk(KERN_EMERG "Restarting system.\n");
+ if (kexec_image)
+ machine_kexec(kexec_image);
machine_restart(NULL);
break;
O.k. In the next couple of days I will split the loading, and
executing phase of my kexec code into parts, and resubmit the code.
The we can dig in on what it takes to make kexec run stably.
Eric
On Wed, Nov 06, 2002 at 12:48:36AM -0700, Eric W. Biederman wrote:
> Linus Torvalds <[email protected]> writes:
>
> > On 5 Nov 2002, Eric W. Biederman wrote:
> > >
> > > In replying to another post by Al Viro I managed to think this through.
> > > kexec needs:
> >
> > Note that kexec doesn't bother me at all, and I might find myself using it
> > myself.
>
> Good. Just before I saw this message I sent you my patch ported to 2.5.46,
> and from the feed back on this one it looks like people would
> appreciate a tweak or two.
>
>
> That sounds reasonable to me. Especially as that lines up a little more
> with what the mcore people want as well. Until today I hadn't realized
> they were using a spare current to process oopses. For just booting
> another kernel all of the staging can currently be done by reading the
> new kernel into your process before calling the user-level shutdown code.
>
> > Right now the kexec() stuff seems to mix up the loading and rebooting, but
> > I didn't take a very deep look, maybe I'm wrong.
>
> It currently happens all in one step because I had never gotten
> feedback that people wanted it in two steps.
I'd mentioned it a few times in the context of mcore, but probably
didn't explain myself clearly enough then.
>
> > Note that the two-phase boot means that you can load the new kernel early,
> > which allows you to later on use it for oops handling (it's a bit late to
> > try to set up the kernel to be loaded at that time ;)
>
> Given that it is definitely a good idea to split the patch up into two
> pieces. And a kernel for oops handling should work once all of other
> problems are resolved.
Yes, this fits the model we need.
>
> The question is how much of that do we need.
>
> Thinking out loud, and hopefully answering your question.
> - We need a working stack when the new kernel is jumped to so PIC code
> can exist at the entry point.
>
> - An oops processing kernel needs to load at an address other than 1MB,
> or at the very least it's boot sequence needs to squirrel away the
> old contents of the kernel text and data segments, which reside at
> 1MB, before it moves down to 1MB.
Yes, that bit of memory save logic exists in the mcore mechanism. These
pages are saved away in compressed form in memory and written out
later after dump.
Now to avoid these pages from being used by the new kernel until
the dump is safetly written out to disk, mcore patches some of
the initialization code to mark these pages (containing saved
dump) as reserved.
> O.k. In the next couple of days I will split the loading, and
> executing phase of my kexec code into parts, and resubmit the code.
Great !
> The we can dig in on what it takes to make kexec run stably.
>
Regards
Suparna
--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Labs, India
Linus Torvalds <[email protected]> writes:
> >From a sanity standpoint, I think the thing already _has_ a system call,
> though: clearly "sys_reboot()" is the place to add a case for "reboot into
> this image". No? That's where we shut down devices anyway, and it's the
> sane place to say "reboot into the kexec image"
When kexec is separated into two pieces I agree. As I had it
initially in one step it does not look at all like reboot. Now I
just need to think up a new magic number for sys_reboot.
[snip wonderful vision of the theoretical simplicity of sys_kexec].
In case I was not sufficiently clear last night. It could be as
simple as your example code if I replaced vmalloc by
__get_free_pages/alloc_pages, and allocated a large contiguous area of
ram. But MAX_ORDER limits me to 8MB images, and allocating an 8MB
chunk is unreliable, and even a 2MB chunk is dangerous.
So I must use some form of scatter/gather list of pages, like
area ->pages[] to make it work. Things get tricky because I gather
(via memcpy) the pages at a location that potentially overlaps the
source pages. So I must walk through the list of pages making certain
I when I gather (memcpy) the buffer pages into their final location I
will not stomp on a buffer page I have not come to yet. Correctly
doing that untangling is where the complexity in kernel/kexec.c comes
from.
Eric
Oliver Xymoron <[email protected]> writes:
> - /tmp-style symlink issues on shared directories
> - vast majority of software (including security tools) ACL-unaware
> - much harder to check for correctness
- surprising inheritance of of the ACL of the directory
This is a known problem in NTFS land, and some people suggest that
per-directory ACLs are enough for everyone for exactly this reason.
On Wed, Nov 06, 2002 at 12:48:36AM -0700, Eric W. Biederman wrote:
>
> Then I guess the reasonable thing to do is to modify sys_reboot to
> call machine_kexec instead of machine_restart when a kexec_image is
> present. Or should I add another magic number, and another case to
> sys_reboot?
Given that "bird-eye" description why not to make a "normal" restart
a particular case of kexec where you just have one kernel loaded
from an external storage? It does not seem to be that much
different although some issues are skipped or taken for granted. Or
I am talking nonsense?
Michal
On Wednesday 06 November 2002 04:07, Eric W. Biederman wrote:
> Personally I would love to be able to allocate one big contiguous
> buffer that the kernel is not using and neither is the image I will
> eventually load. Then I could just memcpy from that buffer and I
> would be done.
>
> Alas memory management in the kernel is done in pages, and can be
> fragmented after running for many moons. So I need to allocate all of
> my memory in pages, and I need to let the kernel know where it will
> all eventually live so I can correctly order the memcpy operations.
Reverse Mappings are cool, and one reason tehy're cool is, in theory, you can
grab a page of physical memory away from something else. In theory code
could be written to ask the kernel "could you please swap this the heck out,
pin the page in memory, and give it to me instead now?" And it can refuse
("it's already pinned by something else, maybe it's a kernel page, go away"),
it can block a bit ("gotta flush it to disk, wait until DMA is done"), or it
could immediatley comply ("it was a clean buffer, have it, keep it, stuff it
and mount it on the wall for all I care...").
This means you can retroactively get contiguous areas of memory by shoving
stuff aside. If it's in use, it'll swap back in immediately. (An obvious
optimization occurs, but that's not necessary for minimal functionality.)
So the the whole problem of needing contiguous areas of memory could, in
theory, be addressed using RMAP.
--
http://penguicon.sf.net - Terry Pratchett, Eric Raymond, Pete Abrams, Illiad,
CmdrTaco, liquid nitrogen ice cream, and caffienated jello. Well why not?
Werner Almesberger <[email protected]> writes:
> Alexander Viro wrote:
> > That's not obvious. By the same logics, we would need syscalls for
> > turning off overcommit, etc., etc.
>
> Okay okay, add file system specific ioctls and sysctl to my list
> of alternative mechanisms :-)
>
> > FWIW, I suspect that
> > open("/proc/image", O_EXCL|O_WRONLY);
> > bunch of lseek()/write()
> > close()
>
> Hmm, interesting. Yes, that should work. One would of course have
> to retain the current interface for in-kernel use (e.g. MCORE), but
> that's probably okay. Let's see what Eric thinks about it - it's
> his code after all.
For the record my opinion is there is extra code bloat but it is ok
if it is built as kexecfs. Any other way of getting a magic file
to work with seems currently insane.
Eric
I am now officially grumpy. From a code perspective splitting kexec
into two phases load, and execute is a simple change to make. From a
semantics standpoint things get ugly, and messy. And that means I
can't just dash off another patch.
There are currently 2 cases that it would be nice to have work.
1) Load a new kernel and immediately execute it.
2) Load a new kernel and execute it on panic.
At first glance splitting the code into a load and execute phases allows
us to use one mechanism to accomplish both goals. In practice
that does not work. There are 2 problems.
panic does not call sys_reboot it rolls that functionality by hand.
And to a certain extent it makes sense for panic to take a different
path because we know something is badly wrong so we need to be extra
careful.
In staging the image we allocate a whole pile of pages, and keep them
locked in place. Waiting for years potentially until the machine
reboots or panics. This memory is not accounted for anywhere so no
one can see that we have it allocated, which makes debugging hard.
Additionally in locking up megabytes for a long period of time we
create unsolvable fragmentation issues for the mm layer to deal with.
In a unified design I can buffer the image in the anonymous pages of a
user space process just as well as I can in locked down kernel memory.
So factoring sys_kexec in to load and execute pieces only helps for
executing the new image on a kernel panic, and that case does not
actually work.
So currently factoring kexec looks like a pointless exercise, that
will just lead to more pain.
I am left with the following questions.
- How should the pages allocated to an early loaded image be accounted
for?
- How do we avoid making life hard for the mm system with an early
loaded image?
- Is it safe to call sys_reboot from panic?
- Can we simply factor out the sequence:
notifier_call_chain(&reboot_notifier_list, SYS_RESTART, NULL);
system_running = 0;
device_shutdown();
And place it into it's own subroutine?
- What does the current mcore implementation do? And is that a good
model to follow to resolve some of these issues?
Eric
Eric W. Biederman wrote:
[ Al's FS-based kexec interface ]
> For the record my opinion is there is extra code bloat but it is ok
> if it is built as kexecfs. Any other way of getting a magic file
> to work with seems currently insane.
Yes, such an interface change would only make sense if you couldn't
get the system call, or if there would actually be a useful way for
setting up kexec using "third party" programs. But it seems unlikely
to me that somebody could get all the magic right just by using dd.
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/
On 7 Nov 2002, Eric W. Biederman wrote:
>
> There are currently 2 cases that it would be nice to have work.
> 1) Load a new kernel and immediately execute it.
> 2) Load a new kernel and execute it on panic.
I really don't think (1) is _ever_ a valid thing to do.
The fact is, loading a new kernel wants filesystems and a fully working
system. While executing it wants the filesystems quiescent.
> panic does not call sys_reboot it rolls that functionality by hand.
Forget about panic for now. It's a design issue - it should be possible to
work, but somebody else can do it if the infrastructure is done right.
> In a unified design I can buffer the image in the anonymous pages of a
> user space process just as well as I can in locked down kernel memory.
And in a unified design, I won't apply the patches. It's that simple.
Linus
On 7 Nov 2002, Eric W. Biederman wrote:
>
> In staging the image we allocate a whole pile of pages, and keep them
> locked in place. Waiting for years potentially until the machine
> reboots or panics. This memory is not accounted for anywhere so no
> one can see that we have it allocated, which makes debugging hard.
So how about facing the fact that my "vmalloc()" approach actually solves
all these issues. The memory is visible to the rest of the system (few
things care about it right now, but it _is_ accounted for and things like
/dev/kmem will actually see it etc).
And the vmalloc() approach is even portable, so one of the two phases is
something that is totally generic (and the second phase is almost totally
architecture-dependent anyway).
Linus
On Thu, 2002-11-07 at 00:50, Eric W. Biederman wrote:
> In staging the image we allocate a whole pile of pages, and keep them
> locked in place. Waiting for years potentially until the machine
> reboots or panics. This memory is not accounted for anywhere so no
> one can see that we have it allocated, which makes debugging hard.
> Additionally in locking up megabytes for a long period of time we
> create unsolvable fragmentation issues for the mm layer to deal with.
Just an idea:
Could a new, unrunnable process be created to "hold" the image?
<hand-wave>
Use a hypothetical sys_kexec() to:
1. create an empty process.
2. copy the kernel image and parameters into the processes' address
space.
3. put the process to sleep.
</hand-wave>
If it's floating out there for weeks or years, the data could get paged
out and not wired down. It would show up in ps, so you'd have at least
some visibility into the allocation.
Change your mind? Kill the process.
It might be complicated (or unworkable) to handle the panic case
properly, but for the case where a fast reboot is requested by calling
sys_reboot(), one should be able to fault-in and read back the image
from the "kexec holder" process' address space, copying it to the final
destination as you go.
You might even be able to go the next step, and if the process were
crafted carefully, waking it and running it would trigger the "copyin,
quiesce, and go" behavior.
Just a thought.
Andy
On Thu, 2002-11-07 at 11:32, Andy Pfiffer wrote:
> On Thu, 2002-11-07 at 00:50, Eric W. Biederman wrote:
>
> > In staging the image we allocate a whole pile of pages, and keep them
> > locked in place.
> Just an idea:
>
> Could a new, unrunnable process be created to "hold" the image?
>
> <hand-wave>
> Use a hypothetical sys_kexec() to:
> 1. create an empty process.
> 2. copy the kernel image and parameters into the processes' address
> space.
> 3. put the process to sleep.
> </hand-wave>
A further refinement to the above:
1. make sys_kexec() a blocking call. The caller reads the image into
their address space prior to making the call, and passes the same kind
of information (number of segments, segment pointer, etc.) to the
syscall in the same manner. Then, it sets a well-known global variable
that means "there is a kexec image available", and then blocks.
2. add code to sys_reboot() under a CONFIG_KEXEC to check the global
variable in [1) above], and if a kexec image is available, wake the
process in [1) above].
3. the reawakened sys_kexec() then proceeds to copy-in and lay down the
new image in memory, shutdown the devices, and go.
I'm still pondering the kexec-ish reboot after panic() with this kind of
mechanism. Ah well, it's just an idea.
Andy
Andy Pfiffer wrote:
> I'm still pondering the kexec-ish reboot after panic() with this kind of
> mechanism. Ah well, it's just an idea.
Yes, that's where the problems get really nasty. Also, for such
cases, you want the pages to be mlock'ed. Furthermore, you'd
have to tell init about this magic process. (Which would be
tricky, because e.g. sysvinit simply uses kill(-1,...).)
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/
On Thu, 2002-11-07 at 08:50, Eric W. Biederman wrote:
> panic does not call sys_reboot it rolls that functionality by hand.
> And to a certain extent it makes sense for panic to take a different
> path because we know something is badly wrong so we need to be extra
> careful.
However both of them should use the same end point routines and the
hooks should go there
> reboots or panics. This memory is not accounted for anywhere so no
> one can see that we have it allocated, which makes debugging hard.
> Additionally in locking up megabytes for a long period of time we
> create unsolvable fragmentation issues for the mm layer to deal with.
We have an MMU so if you just n thousand "get me a page" calls its quite
happy.
> In a unified design I can buffer the image in the anonymous pages of a
> user space process just as well as I can in locked down kernel memory.
> So factoring sys_kexec in to load and execute pieces only helps for
> executing the new image on a kernel panic, and that case does not
> actually work.
What if your user space is swapped out - you can't page it back in
safely
> - How should the pages allocated to an early loaded image be accounted
> for?
Just get_free_page them - if you can handle it over 4Gb then specify
that high pages are fine and kmap them to copy into them - that makes
the VM on giant boxes way happier. For bonus points also adjust the
virtual memory accounting.
> - How do we avoid making life hard for the mm system with an early
> loaded image?
Not really, especially if you are allowing high pages
> - Is it safe to call sys_reboot from panic?
No but both can call sys_machine_restart or whatever
> - Can we simply factor out the sequence:
> notifier_call_chain(&reboot_notifier_list, SYS_RESTART, NULL);
> system_running = 0;
> device_shutdown();
> And place it into it's own subroutine?
Don't do that sequence on a panic IMHO (this is a standing issue, we
should not pass NULL but REBOOT/PANIC/KEXEC/... so the drivers can make
that decision - then we can do it right
Alan
There are two cases I am seeing users wanting.
1) Load a new kernel on panic.
- Extra care must be taken so what broke the first kernel does
not break this one, and so that the shards of the old kernel
do not break it.
- Care must be taken so that loading the second kernel does not
erase valuable data that is desirable to place in a crash dump.
- This kernel cannot live at the same address as the old one, (at
least not initially).
2) Load a new kernel under normal operating conditions.
And when you have a normal user space that boils down to:
- Acquire the kernel you are going to boot.
- Run the user space shutdown scripts, so the system is in
a clean state.
- Execute the new kernel.
- The normal case is that the newly loaded kernel will live at the
same physical location where the current kernel lives.
Currently my code handles starting a new kernel under normal operating
conditions. There are currently two ways I can implement a clean user
space shutdown with out needing locked buffers in the kernel until the
very last moment.
Method 1 (This works today with my sample user space):
- copy the kernel to /newkernel
- init 6
- if [ -r /newkernel ]; then
/sbin/kexec /newkernel
else
/sbin/reboot
fi
- /sbin/kexec reads in /newkernel
- /newkernel is parsed to figure out how it should be loaded
- sys_kexec is called to copy the kernel from user space anonymous
memory to temporary kernel buffers.
Method 2 (For people with read only roots):
- /sbin/delayed_kexec /path/to/new/kernel
- Read in the /path/to/new/kernel into anonymous pages
- Parse it and figure out how it should be loaded
- Run the shutdown scripts from /etc/rc6.d/
- Call sys_kexec, which will copy the data from user space anonymous
pages, to kernel space.
This is to just make it clear that I am not working from a
FUNDAMENTALLY BROKEN interface, nor from a broken model of machine
maintenance. I am quite willing to make changes assuming I understand
what is gained with the change.
What I currently support is a moderately nice interface, that has the
kernel doing as much as it can without being bogged down in the
specific details in any one file format, or needing something besides
a 32bit entry point to jump to.
I model an image as a set of segments of physical memory. And I copy
the image loaded with sys_kexec to it's final location, before jumping
to the new image. There are two reasons for this. It takes 3
segments to load a bzImage (setup.S, vmlinux, and an initrd). And an
arbitrary number of segments maps cleanly to a static ELF binary.
There is only one difficult case. What happens when the buffers the
kernel allocates are physically in one of the segments of memory of
the new kernel image. Possible especially for the initrd which is
loaded at the end of memory.
I then use the following algorithm to sort the potential mess out
before I jump to the new code. And since this code depends on
swapping the contents of pages, knowing the physical location of
the pages, and is not limited to 128MB I am reluctant to look a
vmalloc variant. I can more get my pages from a slab if I need to
report I have them.
static int kimage_get_off_destination_pages(struct kimage *image)
{
kimage_entry_t *ptr, *cptr, entry;
unsigned long buffer, page;
unsigned long destination = 0;
/* Here we implement safe guards to insure that
* a source page is not copied to it's destination
* page before the data on the destination page is
* no longer useful.
*
* To make it work we actually wind up with a
* stronger condition. For every page considered
* it is either it's own destination page or it is
* not a destination page of any page considered.
*
* Invariants
* 1. buffer is not a destination of a previous page.
* 2. page is not a destination of a previous page.
* 3. destination is not a previous source page.
*
* Result: Either a source page and a destination page
* are the same or the page is not a destination page.
*
* These checks could be done when we allocate the pages,
* but doing it as a final pass allows us more freedom
* on how we allocate pages.
*
* Also while the checks are necessary, in practice nothing
* happens. The destination kernel wants to sit in the
* same physical addresses as the current kernel so we never
* actually allocate a destination page.
*
* BUGS: This is a O(N^2) algorithm.
*/
buffer = __get_free_page(GFP_KERNEL);
if (!buffer) {
return -ENOMEM;
}
buffer = virt_to_phys((void *)buffer);
for_each_kimage_entry(image, ptr, entry) {
/* Here we check to see if an allocated page */
kimage_entry_t *limit;
if (entry & IND_DESTINATION) {
destination = entry & PAGE_MASK;
}
else if (entry & IND_INDIRECTION) {
/* Indirection pages must include all of their
* contents in limit checking.
*/
limit = phys_to_virt(page + PAGE_SIZE - sizeof(*limit));
}
if (!((entry & IND_SOURCE) | (entry & IND_INDIRECTION))) {
continue;
}
page = entry & PAGE_MASK;
limit = ptr;
/* See if a previous page has the current page as it's
* destination.
* i.e. invariant 2
*/
cptr = kimage_dst_conflict(image, page, limit);
if (cptr) {
unsigned long cpage;
kimage_entry_t centry;
centry = *cptr;
cpage = centry & PAGE_MASK;
memcpy(phys_to_virt(buffer), phys_to_virt(page), PAGE_SIZE);
memcpy(phys_to_virt(page), phys_to_virt(cpage), PAGE_SIZE);
*cptr = page | (centry & ~PAGE_MASK);
*ptr = buffer | (entry & ~PAGE_MASK);
buffer = cpage;
}
if (!(entry & IND_SOURCE)) {
continue;
}
/* See if a previous page is our destination page.
* If so claim it now.
* i.e. invariant 3
*/
cptr = kimage_src_conflict(image, destination, limit);
if (cptr) {
unsigned long cpage;
kimage_entry_t centry;
centry = *cptr;
cpage = centry & PAGE_MASK;
memcpy(phys_to_virt(buffer), phys_to_virt(cpage), PAGE_SIZE);
memcpy(phys_to_virt(cpage), phys_to_virt(page), PAGE_SIZE);
*cptr = buffer | (centry & ~PAGE_MASK);
*ptr = cpage | ( entry & ~PAGE_MASK);
buffer = page;
}
/* If the buffer is my destination page do the copy now
* i.e. invariant 3 & 1
*/
if (buffer == destination) {
memcpy(phys_to_virt(buffer), phys_to_virt(page), PAGE_SIZE);
*ptr = buffer | (entry & ~PAGE_MASK);
buffer = page;
}
}
free_page((unsigned long)phys_to_virt(buffer));
return 0;
}
static kimage_entry_t *kimage_dst_conflict(
struct kimage *image, unsigned long page, kimage_entry_t *limit)
{
kimage_entry_t *ptr, entry;
unsigned long destination = 0;
for_each_kimage_entry(image, ptr, entry) {
if (ptr == limit) {
return 0;
}
else if (entry & IND_DESTINATION) {
destination = entry & PAGE_MASK;
}
else if (entry & IND_SOURCE) {
if (page == destination) {
return ptr;
}
destination += PAGE_SIZE;
}
}
return 0;
}
static kimage_entry_t *kimage_src_conflict(
struct kimage *image, unsigned long destination, kimage_entry_t *limit)
{
kimage_entry_t *ptr, entry;
for_each_kimage_entry(image, ptr, entry) {
unsigned long page;
if (ptr == limit) {
return 0;
}
else if (entry & IND_DESTINATION) {
/* nop */
}
else if (entry & IND_DONE) {
/* nop */
}
else {
/* SOURCE & INDIRECTION */
page = entry & PAGE_MASK;
if (page == destination) {
return ptr;
}
}
}
return 0;
}
Having had time to digest the idea of starting a new kernel on panic
I can now make some observations and what I believe it would take to
make it as robust as possible.
- On panic because random pieces of the kernel may be broken we want
to use as little of the kernel as possible.
- Therefore machine_kexec should not allocate any memory, and as
quickly as possible it should transition to the new kernel
- So a new page table should be sitting around with the new kernel
already mapped, and likewise other important tables like the
gdt, and the idt, should be pre-allocated.
- Then machine_kexec can just switch stacks, page tables, and other
machine dependent tables and jump to the new kernel.
- The load stage needs to load everything at the physical location it
will initially run at. This would likely need support from rmap.
- The load stage needs to preallocate page tables and buffers.
- The load stage would likely work easiest by either requiring a mem=xxx
line, reserving some of physical memory for the new kernel. Or
alternatively using some rmap support to clear out a swath of
physical memory the new kernel can be loaded into.
- The new kernel loaded on panic must know about the previous kernel,
and have various restrictions because of that.
Supporting a kernel loaded from a normal environment is a rather
different problem.
- It cannot be loaded at it's run location (The current kernel is
sitting there).
- It should not need to know about the previously executing kernel.
- Work can be done in machine_kexec to allocate memory so everything
does not need to be pre allocated.
- I can safely use multiple calls to the page allocator instead of
needing a special mechanism to allocate memory.
And now I go back to the silly exercise of factoring my code so the
new kernel can be kept in locked kernel memory, instead of in a file
while the shutdown scripts are run.
Unless the linux kernel is modified to copy itself to the top of
physical memory when it loads I have trouble seeing how any of this
will help make the panic case easier to implement.
Eric
On 9 Nov 2002, Eric W. Biederman wrote:
>
> Currently my code handles starting a new kernel under normal operating
> conditions. There are currently two ways I can implement a clean user
> space shutdown with out needing locked buffers in the kernel until the
> very last moment.
PLEASE tell me why you don't just use the 20-line "vmalloc()" function I
already wrote for you?
It works for all cases - and since you do need to load the kernel into
memory anyway, it's not using any more memory than your existing code. And
it's infinitely more flexible to have a clearly separated load-process,
than having to have some load happen at reboot time (whether by init or by
anything else).
And since the kernel is fully working at the load time, you can even do
things like swap out pages in order to make room for the kernel at the
right place. So you can even do something like this:
int alloc_kernel_pages(unsigned long *array, int nr_pages,
unsigned long min_address)
{
void *bad_page_list = NULL;
int i = 0, retval;
while (i < nr_pages) {
unsigned long page = __get_free_page(GFP_USER);
if (!page)
goto oom;
if (page < min_address) {
*(void **)page = bad_page_list;
bad_page_list = (void *)page;
continue;
}
array[i] = page;
i++;
}
retval = 0;
out:
while (bad_page_list) {
unsigned long page = (unsigned long) bad_page_list;
bad_page_list = *(void **)bad_page_list;
free_page(page);
}
return retval;
oom:
while (i > 0)
free_page(array[--i]);
retval = -ENOMEM;
goto out;
}
and now you are guaranteed that all the allocated pages are above a
certain mark (change the "min_address" to be a "validity callback" or
whatever if you want to be fancy and allow arbitrary rules, which is good
if you want to allow pages in the low 1M on x86, for example), which means
that your final reboot stage is _much_much_ simpler and you don't ever
have to worry about overlap.
Use one of the pages to allocate the memcpy() trampoline and the actual
data structures used for the copy, for example. Use the rest for the
actual kernel data.
Keep it simple.
Linus
{warning: cc: list too large :}
On 9 Nov 2002, Eric W. Biederman wrote:
| There are two cases I am seeing users wanting.
| 1) Load a new kernel on panic.
| - Extra care must be taken so what broke the first kernel does
| not break this one, and so that the shards of the old kernel
| do not break it.
| - Care must be taken so that loading the second kernel does not
| erase valuable data that is desirable to place in a crash dump.
| - This kernel cannot live at the same address as the old one, (at
| least not initially).
Conceptually we would like a new kernel on panic, although
I doubt that it's normally safe to "load a new kernel on panic."
Or maybe it depends on the definition of "load."
What I'm trying to say is that I think the new kernel must
already be loaded when the panic happens.
Is that what you describe later (below)?
| 2) Load a new kernel under normal operating conditions.
| And when you have a normal user space that boils down to:
| - Acquire the kernel you are going to boot.
| - Run the user space shutdown scripts, so the system is in
| a clean state.
| - Execute the new kernel.
| - The normal case is that the newly loaded kernel will live at the
| same physical location where the current kernel lives.
|
|
| Currently my code handles starting a new kernel under normal operating
| conditions. There are currently two ways I can implement a clean user
| space shutdown with out needing locked buffers in the kernel until the
| very last moment.
|
| Method 1 (This works today with my sample user space):
| - copy the kernel to /newkernel
| - init 6
| - if [ -r /newkernel ]; then
| /sbin/kexec /newkernel
| else
| /sbin/reboot
| fi
| - /sbin/kexec reads in /newkernel
| - /newkernel is parsed to figure out how it should be loaded
| - sys_kexec is called to copy the kernel from user space anonymous
| memory to temporary kernel buffers.
|
| Method 2 (For people with read only roots):
| - /sbin/delayed_kexec /path/to/new/kernel
| - Read in the /path/to/new/kernel into anonymous pages
| - Parse it and figure out how it should be loaded
| - Run the shutdown scripts from /etc/rc6.d/
| - Call sys_kexec, which will copy the data from user space anonymous
| pages, to kernel space.
|
| This is to just make it clear that I am not working from a
| FUNDAMENTALLY BROKEN interface, nor from a broken model of machine
| maintenance. I am quite willing to make changes assuming I understand
| what is gained with the change.
|
|
| What I currently support is a moderately nice interface, that has the
| kernel doing as much as it can without being bogged down in the
| specific details in any one file format, or needing something besides
| a 32bit entry point to jump to.
|
| I model an image as a set of segments of physical memory. And I copy
| the image loaded with sys_kexec to it's final location, before jumping
| to the new image. There are two reasons for this. It takes 3
| segments to load a bzImage (setup.S, vmlinux, and an initrd). And an
| arbitrary number of segments maps cleanly to a static ELF binary.
|
| There is only one difficult case. What happens when the buffers the
| kernel allocates are physically in one of the segments of memory of
| the new kernel image. Possible especially for the initrd which is
| loaded at the end of memory.
|
| I then use the following algorithm to sort the potential mess out
| before I jump to the new code. And since this code depends on
| swapping the contents of pages, knowing the physical location of
| the pages, and is not limited to 128MB I am reluctant to look a
| vmalloc variant. I can more get my pages from a slab if I need to
| report I have them.
|
[code deleted]
|
| Having had time to digest the idea of starting a new kernel on panic
| I can now make some observations and what I believe it would take to
| make it as robust as possible.
|
| - On panic because random pieces of the kernel may be broken we want
| to use as little of the kernel as possible.
|
| - Therefore machine_kexec should not allocate any memory, and as
| quickly as possible it should transition to the new kernel
|
| - So a new page table should be sitting around with the new kernel
| already mapped, and likewise other important tables like the
| gdt, and the idt, should be pre-allocated.
|
| - Then machine_kexec can just switch stacks, page tables, and other
| machine dependent tables and jump to the new kernel.
|
| - The load stage needs to load everything at the physical location it
| will initially run at. This would likely need support from rmap.
|
| - The load stage needs to preallocate page tables and buffers.
|
| - The load stage would likely work easiest by either requiring a mem=xxx
| line, reserving some of physical memory for the new kernel. Or
| alternatively using some rmap support to clear out a swath of
| physical memory the new kernel can be loaded into.
|
| - The new kernel loaded on panic must know about the previous kernel,
| and have various restrictions because of that.
|
|
| Supporting a kernel loaded from a normal environment is a rather
| different problem.
|
| - It cannot be loaded at it's run location (The current kernel is
| sitting there).
|
| - It should not need to know about the previously executing kernel.
|
| - Work can be done in machine_kexec to allocate memory so everything
| does not need to be pre allocated.
|
| - I can safely use multiple calls to the page allocator instead of
| needing a special mechanism to allocate memory.
|
|
| And now I go back to the silly exercise of factoring my code so the
| new kernel can be kept in locked kernel memory, instead of in a file
| while the shutdown scripts are run.
|
| Unless the linux kernel is modified to copy itself to the top of
| physical memory when it loads I have trouble seeing how any of this
| will help make the panic case easier to implement.
|
| Eric
| -
--
~Randy
Eric W. Biederman wrote:
> - Extra care must be taken so what broke the first kernel does
> not break this one, and so that the shards of the old kernel
> do not break it.
For this, you should checksum the data that you've pre-loaded, and
verify it before rebooting. If the pre-loaded kernel has been hit,
you just do a normal reboot. (In the case if a bzImage, you'd
probably fail uncompression anyway.)
Alternatively, you could also wire this into the uncompression
functions (i.e. reboot if bzImage or initrd don't uncompress
cleanly), but this would be more intrusive.
> - Care must be taken so that loading the second kernel does not
> erase valuable data that is desirable to place in a crash dump.
Or copy all "interesting" memory to a safe place before the kexec.
I don't quite like the idea of building a kernel that "knows" which
addresses it isn't supposed to touch, and I think being able to use
the same kernel binary for regular and panic use would be a
desirable feature.
Also, firmware may not give you the choice of preserving all memory,
so you need that "copy memory to a safe place" functionality anyway.
Furthermore, you most likely want to checksum that memory, too.
But ... I think you're designing too far ahead. The "load kernel on
panic" part isn't trivial, and I think it would be better to tackle
this in a second phase. For now, having a reasonably generic kexec
mechanism would be all that's needed in term of building blocks.
> Method 2 (For people with read only roots):
> - /sbin/delayed_kexec /path/to/new/kernel
> - Read in the /path/to/new/kernel into anonymous pages
There's no delayed_kexec in kexec-tools 1.4, so let me gues how
this would work: as far as I know, there's no way for regular
user space to create a persistent unreferenced memory object, so
you'd probably load the data, perhaps mlock the pages, and then
fork a process that keeps the data in memory. Then, this process
would probably call sys_kexec upon reception of a signal, or
such.
Unfortunately, init assumes that it can SIGKILL all non-init
processes (that is, all processes with PID != 1). Worse yet, this
assumption makes sense, because walking the process list and
killing each of them individually would be racy.
So you'd either have to add this race condition to init, add some
magic to make this type of killing atomic, teach the kernel that
your kexec memory keeper process is somehow magic too, or merge
kexec into init. Not nice.
> I then use the following algorithm to sort the potential mess out
> before I jump to the new code.
I like this approach. It gives you complete freedom of where to
load data. This also makes it future-proof. But I don't see the
reason why you couldn't do the same thing with vmalloc. Using
vmalloc may actually simplify your code a little.
> Having had time to digest the idea of starting a new kernel on panic
> I can now make some observations and what I believe it would take to
> make it as robust as possible.
That pretty much sums it up, yes. But as I've said, this isn't
really something that needs to be implemented at the same time
as the basic kexec functionality. A two-phase kexec with
unrestricted copying capabilities should be a good enough
building block that only minor changes, if any, would be needed
when adding kexec-on-panic.
> And now I go back to the silly exercise of factoring my code so the
> new kernel can be kept in locked kernel memory, instead of in a file
> while the shutdown scripts are run.
Not silly :-)
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/
Linus Torvalds <[email protected]> writes:
> On 9 Nov 2002, Eric W. Biederman wrote:
> >
> > Currently my code handles starting a new kernel under normal operating
> > conditions. There are currently two ways I can implement a clean user
> > space shutdown with out needing locked buffers in the kernel until the
> > very last moment.
>
> PLEASE tell me why you don't just use the 20-line "vmalloc()" function I
> already wrote for you?
The reasons I don't jump on board:
- It does not handle multiple segments.
Without multiple segments I think I simply out essential complexity
of the problem. A bzImage has at least 2.
- vmalloc is artificially limited to 128MB.
- I still have to run code to prevent imperfect overlaps. A perfect
overlap being a source buffer living in it's destination address.
- I still have to run code to find the physical addresses of the
pages, and locate those in non-destination pages, and form a linked
list of pages for that.
> It works for all cases - and since you do need to load the kernel into
> memory anyway, it's not using any more memory than your existing code. And
> it's infinitely more flexible to have a clearly separated load-process,
> than having to have some load happen at reboot time (whether by init or by
> anything else).
I am trying to process it but I don't see why having the load happen
as a seperate syscall is clearer. Having it happen as a seperate
architecture independent function I understand.
asmlinkage long sys_kexec(unsigned long entry, long nr_segments,
struct kexec_segment *segments)
{
/* Am I using to much stack space here? */
struct kimage image;
int result;
/* We only trust the superuser with rebooting the system. */
if (!capable(CAP_SYS_BOOT))
return -EPERM;
lock_kernel();
//// This chunk does the load and there is no kernel shutdown code
//// run yet.
kimage_init(&image);
result = do_kexec(entry, nr_segments, segments, &image);
if (result) {
kimage_free(&image);
unlock_kernel();
return result;
}
//// ----------- I can snip here for your two syscall version -----------
//// This part is the kernel shutdown
/* The point of no return is here... */
notifier_call_chain(&reboot_notifier_list, SYS_RESTART, NULL);
system_running = 0;
device_shutdown();
printk(KERN_EMERG "Starting new kernel\n");
//// And here is where I start the new kernel.
machine_kexec(&image);
}
>
> And since the kernel is fully working at the load time, you can even do
> things like swap out pages in order to make room for the kernel at the
> right place. So you can even do something like this:
I have clearly separated load code, that runs before any of the kernel
starts to shutdown. Until it completes successfully I do not start
to shutdown the kernel. My user space is shut down but that is a
different story.
Swapping out pages is nice, but when user space is shutdown there
shouldn't be any extra pages in the kernel to swap out, and if you are
that tight on memory that you need to swap it won't work, anyway.
> int alloc_kernel_pages(unsigned long *array, int nr_pages,
> unsigned long min_address)
> {
> void *bad_page_list = NULL;
> int i = 0, retval;
>
> while (i < nr_pages) {
> unsigned long page = __get_free_page(GFP_USER);
>
> if (!page)
> goto oom;
>
> if (page < min_address) {
> *(void **)page = bad_page_list;
> bad_page_list = (void *)page;
> continue;
> }
> array[i] = page;
> i++;
> }
> retval = 0;
> out:
> while (bad_page_list) {
> unsigned long page = (unsigned long) bad_page_list;
> bad_page_list = *(void **)bad_page_list;
> free_page(page);
> }
> return retval;
> oom:
> while (i > 0)
> free_page(array[--i]);
> retval = -ENOMEM;
> goto out;
> }
Which is a good algorithm but it has the potential to allocate a lot
of extra pages, and I have implemented this it in the past. It's
worst case is just nasty.
My current code allocates at most 1 extra page and works gracefully if
it happens to allocates the pages it really wanted to use. It is just
a hair more complex, and it makes everything else very simple.
> and now you are guaranteed that all the allocated pages are above a
> certain mark (change the "min_address" to be a "validity callback" or
> whatever if you want to be fancy and allow arbitrary rules, which is good
> if you want to allow pages in the low 1M on x86, for example), which means
> that your final reboot stage is _much_much_ simpler and you don't ever
> have to worry about overlap.
Exactly and that is why I do it where I do it. In the C load code.
In the kernel so it has to be written only once.
> Use one of the pages to allocate the memcpy() trampoline and the actual
> data structures used for the copy, for example. Use the rest for the
> actual kernel data.
>
> Keep it simple.
Yep.
After loading everything I have a total of 243 lines of code.
100 lines of assembly doing the copies in the trampoline.
143 lines of C modifying the page tables, the gdt, and the idt,
copying the trampoline to the correct place, and going for it.
And despite my utter puzzlement on why you want the syscall cut in two.
I will now go cut along the dotted line. If that is all it takes to
have piece I can do that. A sore head from all of the scratching
trying to figure out why it needs to be cut in two, but I can cut
sys_kexec in two.
Eric
On Sat, 2002-11-09 at 23:05, Eric W. Biederman wrote:
> There are two cases I am seeing users wanting.
> 1) Load a new kernel on panic.
Load a new *something* on panic. That something might be a new kernel
but it might also be a kernel dump system like LKCD or a debugger front
end for something like kdb, or a network dump module, or ...
Alan
On Sun, 2002-11-10 at 01:37, Eric W. Biederman wrote:
> The reasons I don't jump on board:
> - It does not handle multiple segments.
> Without multiple segments I think I simply out essential complexity
> of the problem. A bzImage has at least 2.
Thats a matter for user space and the unpacker
> - vmalloc is artificially limited to 128MB.
Just grabbing a load of pages and using kmap/scatter gather by hand is
not
Alan Cox <[email protected]> writes:
> On Sat, 2002-11-09 at 23:05, Eric W. Biederman wrote:
> > There are two cases I am seeing users wanting.
> > 1) Load a new kernel on panic.
>
> Load a new *something* on panic. That something might be a new kernel
> but it might also be a kernel dump system like LKCD or a debugger front
> end for something like kdb, or a network dump module, or ...
And if it isn't a kernel why not load it as a module? The code
has to come preloaded anyway.
Eric
Alan Cox <[email protected]> writes:
> On Sun, 2002-11-10 at 01:37, Eric W. Biederman wrote:
> > The reasons I don't jump on board:
> > - It does not handle multiple segments.
> > Without multiple segments I think I simply out essential complexity
> > of the problem. A bzImage has at least 2.
>
> Thats a matter for user space and the unpacker
>
> > - vmalloc is artificially limited to 128MB.
>
> Just grabbing a load of pages and using kmap/scatter gather by hand is
> not
To use kmapped memory I need to setup a page table to do the final copy.
And to setup a page table I need to know where the memory is going to be copied
to.
So my gut impression at least says an interface that ignores where
the image wants to live just adds complexity in other places, and
makes for an interface that is hard to maintain long term, because
you depend on a lot of kernel implementation details, which are likely
to change in arbitrary ways.
Eric
Eric W. Biederman wrote:
> So my gut impression at least says an interface that ignores where
> the image wants to live just adds complexity in other places,
Linus' alloc_kernel_pages function would actually be able to handle
this, provided that the "validity callback" checks if the allocated
page happens to be in one of the destination areas.
I'm not so sure if this implementation is really that much more
compact than your current conflict resolution, though. Also, it may
be hairy in scenarios where you actually expect to fill more than
50% of system memory. (But your concerns about a 128MB limit scare
me, too. I realize that people have taken initrds to extremes I
never quite imagined, but that still looks a little excessive :-)
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/
"Randy.Dunlap" <[email protected]> writes:
> {warning: cc: list too large :}
>
> On 9 Nov 2002, Eric W. Biederman wrote:
>
> | There are two cases I am seeing users wanting.
> | 1) Load a new kernel on panic.
> | - Extra care must be taken so what broke the first kernel does
> | not break this one, and so that the shards of the old kernel
> | do not break it.
> | - Care must be taken so that loading the second kernel does not
> | erase valuable data that is desirable to place in a crash dump.
> | - This kernel cannot live at the same address as the old one, (at
> | least not initially).
>
> Conceptually we would like a new kernel on panic, although
> I doubt that it's normally safe to "load a new kernel on panic."
> Or maybe it depends on the definition of "load."
>
> What I'm trying to say is that I think the new kernel must
> already be loaded when the panic happens.
> Is that what you describe later (below)?
Yes that was my meaning. The new kernel must be preloaded.
And only started on panic.
Werner Almesberger <[email protected]> writes:
>
> But ... I think you're designing too far ahead. The "load kernel on
> panic" part isn't trivial, and I think it would be better to tackle
> this in a second phase. For now, having a reasonably generic kexec
> mechanism would be all that's needed in term of building blocks.
I'm not designing yet, just looking and what I see says that it
does not very much resemble the non panic case.
> > Method 2 (For people with read only roots):
> > - /sbin/delayed_kexec /path/to/new/kernel
> > - Read in the /path/to/new/kernel into anonymous pages
>
> There's no delayed_kexec in kexec-tools 1.4, so let me gues how
> this would work: as far as I know, there's no way for regular
> user space to create a persistent unreferenced memory object, so
> you'd probably load the data, perhaps mlock the pages, and then
> fork a process that keeps the data in memory. Then, this process
> would probably call sys_kexec upon reception of a signal, or
> such.
What I was thinking is that the process would for and exec
something like "/etc/rc 6" or maybe "/etc/rc 7" to be clean.
And that script would do all of the user space shutdown.
No need to mlock any pages, or hack init, or special hacks.
Just user space cleanly shutting itself down.
>
> > I then use the following algorithm to sort the potential mess out
> > before I jump to the new code.
>
> I like this approach. It gives you complete freedom of where to
> load data. This also makes it future-proof. But I don't see the
> reason why you couldn't do the same thing with vmalloc. Using
> vmalloc may actually simplify your code a little.
Mostly it's a bird in the hand versus a bird in the bush. I simply
see nowhere that vmalloc makes my code simpler.
> > Having had time to digest the idea of starting a new kernel on panic
> > I can now make some observations and what I believe it would take to
> > make it as robust as possible.
>
> That pretty much sums it up, yes. But as I've said, this isn't
> really something that needs to be implemented at the same time
> as the basic kexec functionality. A two-phase kexec with
> unrestricted copying capabilities should be a good enough
> building block that only minor changes, if any, would be needed
> when adding kexec-on-panic.
My feel is that kexec-on-panic is a rather different problem. Which
is why I thought it all through, to see if they felt close. At the
very least you almost need to know that it is the same.
>
> > And now I go back to the silly exercise of factoring my code so the
> > new kernel can be kept in locked kernel memory, instead of in a file
> > while the shutdown scripts are run.
>
> Not silly :-)
Except for the part about getting Linus to accept it I don't see
the advantage. kexec-on-panic looks different enough that I don't
think it will help at all with that case.
Eric
On 9 Nov 2002, Eric W. Biederman wrote:
>
> And despite my utter puzzlement on why you want the syscall cut in two.
I'm amazed about your puzzlement, since everybody else seem to get my
arguments, but as long as you play along I don't much care.
I will explain once more why it needs to be cut into two, even if you're
apparently willing to do it even without understanding:
When you reboot, you often cannot load the image.
This is _trivially_ true for panics or things like
- I don't understand why you do not want to accept this. Even if
your code doesn't even _handle_ panics, it's so obvious that
this is true that I don't understand why you want a limitation
in your particular current implementation to be a fundamental
flaw of the whole idea.
But it is _also_ true for any standard setup where you don't have
a special "init" that knows about loading the kernel, and where to
load it from.
- Do you want to rewrite every "init" setup out there, adding
some way to tell init where to load the kernel from?
Or do you want to just split the thing in two, so that you can
load the kernel _before_ you ask init to shut down, and just
happily use bog-standard tools that everybody is already
familiar with..
The two-part loader can clearly handle both cases. And if _you_ don't want
a two-part loader, you can do exactly what you do now by just doing two
system calls.
As to vmalloc - I don't actually much care how the first and second parts
are implemented. I suggested a vmalloc()-like approach just because your
patch looks unnecessarily complicated to me. But while I am convinced that
the two-phase loading/exec is absolutely the way to do it, the actual
low-level implementation is just a detail.
Linus
Werner Almesberger <[email protected]> writes:
> Eric W. Biederman wrote:
> > So my gut impression at least says an interface that ignores where
> > the image wants to live just adds complexity in other places,
>
> Linus' alloc_kernel_pages function would actually be able to handle
> this, provided that the "validity callback" checks if the allocated
> page happens to be in one of the destination areas.
>
> I'm not so sure if this implementation is really that much more
> compact than your current conflict resolution, though. Also, it may
> be hairy in scenarios where you actually expect to fill more than
> 50% of system memory. (But your concerns about a 128MB limit scare
> me, too. I realize that people have taken initrds to extremes I
> never quite imagined, but that still looks a little excessive :-)
I have not heard of more than about 90MB. One of the things I would
not be surprised to see in the next couple of years as memory gets
cheaper is diskless systems that don't even bother doing NFS root and
just put everything in an initrd. But that is not the main concern.
Since there are more polite ways of allocating memory already
implemented. Sucking up a 16MB hunk of some ones vmalloc space is
quite rude. Currently the limit is pretty much 50% of system memory
or 1GB whichever is less because the code must be loaded into user
space first, and I don't touch high memory. Although I guess if it
was mmaped read only the limit may be higher.
I don't expect to come to close to using all of system memory
except on limited memory systems. But it is always nice to be
polite.
Eric
Eric W. Biederman wrote:
> What I was thinking is that the process would for and exec
> something like "/etc/rc 6" or maybe "/etc/rc 7" to be clean.
> And that script would do all of the user space shutdown.
Yes, but init also does a kill(-1,...) to get rid of all processes,
before the last steps of system shutdown. So you have to somehow
make your "page holding" process survive beyond this point.
> My feel is that kexec-on-panic is a rather different problem.
You make it a different problem by assuming that you'd have a
kernel that is specifically built for running at a "safe"
location. If you assume that you're just using your normal
kernel, the two problems converge again. There are still a
few things that are different, like the checksumming, but
they can safely be added at a later time.
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/
On 9 Nov 2002, Eric W. Biederman wrote:
>
> What I was thinking is that the process would for and exec
> something like "/etc/rc 6" or maybe "/etc/rc 7" to be clean.
> And that script would do all of the user space shutdown.
>
> No need to mlock any pages, or hack init, or special hacks.
> Just user space cleanly shutting itself down.
Ehh.. You do realize that the above doesn't actually _work_?
First off, "all the user space shutdown" includes things like turning off
networking. Oh, and if you're on a NFS-root system, your process is now
officially _toast_.
Unless you do the "mlockall()" etc that you seem to think that you don't
need, that is.
In other words: oh, yes, you do need those mlock() calls.
And never mind the fact that everybody has a slightly different "init"
setup, so executing "/etc/rc 6" may not actually _do_ anything. So now you
need to learn about all the different initscripts, and get that right.
And btw, thanks to the mlockall() requirements, you actually end up
pinning more memory than the two-phase approach ever would have done while
you do all this.
You then need to do the pre-loading anyway for the "kexec-on-panic" case
that you think is so different (I don't understand why you think a reboot
is different from another reboot, but whatever). So now you not only
maintain something that knows about many different init scripts and uses
more memory, it also ends up doing the same reboot thing at least two
different ways.
-- meanwhile, in another universe --
With the two-way separation, you don't have any of these problems. Your
maintenance nightmare has now become _one_ added script:
/etc/rc.d/rc6.d/K00loadkernel
and people using other init script variants can trivially add a script to
match their setup. Done. No maintenance headache, no special init
binaries, no torn-out-hair.
And by adding a startup script, you can have a _different_ small "debug
dump" kernel loaded early, so that if the machine reboots without going
through the controlled sequence, it will automatically boot into that
debug kernel..
Linus
Werner Almesberger <[email protected]> writes:
> Eric W. Biederman wrote:
> > What I was thinking is that the process would for and exec
> > something like "/etc/rc 6" or maybe "/etc/rc 7" to be clean.
> > And that script would do all of the user space shutdown.
>
> Yes, but init also does a kill(-1,...) to get rid of all processes,
> before the last steps of system shutdown. So you have to somehow
> make your "page holding" process survive beyond this point.
True. But it is just as easy to drop the file into something like
ramfs. Or a file on the read only file on the root filesystem. Now
that we can having shutdown do a pivot_root and totally unmounting
the root filesystem is probably a good idea.
> > My feel is that kexec-on-panic is a rather different problem.
>
> You make it a different problem by assuming that you'd have a
> kernel that is specifically built for running at a "safe"
> location.
Well at least the part cleans up after the running kernel. That is
what I think it takes to make it stable. Perhaps I am wrong, but
I think getting other architecture stable is very hard.
> If you assume that you're just using your normal
> kernel, the two problems converge again. There are still a
> few things that are different, like the checksumming, but
> they can safely be added at a later time.
I guess I can be proven wrong.
Eric
Linus Torvalds <[email protected]> writes:
> On 9 Nov 2002, Eric W. Biederman wrote:
> >
> > And despite my utter puzzlement on why you want the syscall cut in two.
>
> I'm amazed about your puzzlement, since everybody else seem to get my
> arguments, but as long as you play along I don't much care.
>
> I will explain once more why it needs to be cut into two, even if you're
> apparently willing to do it even without understanding:
>
> When you reboot, you often cannot load the image.
>
> This is _trivially_ true for panics or things like
That the load needs to be separate for handling panics is trivially
true. I simply have a very hard time believing that the load you want
for the normal case will be the load you want for a panic. I think
I want to be much more paranoid in preparing for the kernel to blow
up. And I want to move data around much more carefully. And that
care adds restrictions I want for the normal case.
So splitting it up to prepare for panic handling just looks like over
design.
> But it is _also_ true for any standard setup where you don't have
> a special "init" that knows about loading the kernel, and where to
> load it from.
>
> - Do you want to rewrite every "init" setup out there, adding
> some way to tell init where to load the kernel from?
>
> Or do you want to just split the thing in two, so that you can
> load the kernel _before_ you ask init to shut down, and just
> happily use bog-standard tools that everybody is already
> familiar with..
When you can change the init setup with just a couple of lines of
shell script seeing if file exists in magic location (say a special
ramfs or tmpfs), I guess it does not look hard to me.
> The two-part loader can clearly handle both cases. And if _you_ don't want
> a two-part loader, you can do exactly what you do now by just doing two
> system calls.
Right which is why I don't much care, so long as I don't have to test
reboot on panic any time soon...
I doubt we will see eye to eye on this one. So I will now finish up
the patch to split this all up.
> As to vmalloc - I don't actually much care how the first and second parts
> are implemented. I suggested a vmalloc()-like approach just because your
> patch looks unnecessarily complicated to me.
I'd love to make it simpler as well if I saw a clear opportunity that
wasn't just moving the complexity somewhere else. But when I really
look at it I think that I am just wordy.
Eric
Hi!
> > Yes, we are putting [MCORE] in as one of the alternative dump targets
> > available.
>
> Great !
>
> > Its not quite ready yet and we need something like kexec to be
> > available which we can use on Intel systems to achieve the softboot
> > (the acceptance status of that still doesn't seem to be clear),
>
> Yes, I've just checked with Eric, and he hasn't received any
> indication from Linus so far. I posted a reminder to linux-kernel.
> I'd really hate to see kexec miss 2.6.
>
> > Why do we even consider the other options when we are doing
> > this already ? Well, as we discussed earlier there's non-disruptive
> > dumps for one, where this wouldn't work.
>
> But they're very different anyway, aren't they ? I mean, you could
> even implement them (well, almost) from user space, with today's
> kernels.
>
> > The other is that before overwriting
> > memory we need to be able to stop all activity in the system for certain
> > (system may appear hung/locked up) and I'm not fully certain about
> > how to do this for all environments. Maybe an answer lies in
> > rethinking some parts of the algorithm a bit.
>
> This is certainly the hairiest part, yes. I think we have about
> four types of devices/elements to worry about:
>
> - those that just sit there, and never talk unless spoken to
> - those that may generate interrupts
> - those that DMA if you ask them nicely
> - those that DMA when they feel like it (e.g. copy an incoming
> network packet to the next buffer in the free list)
>
> The latter are the real problem. I see the following possibilities
> for dealing with them:
>
> - faith-based computing: pray that nothing bad will befall your
> system :-)
> - de-activate them individually. There should be a lot of work
> that can be shared with power management. And that's one of
> the reasons why I think the memory target should be available
> early, or convergence will take forever.
I have very similar problem in swsusp (need to deactivate DMA
devices), and driverfs^H^H^H^H^Hsysfs framework seems to be suitable
for that.
Pavel
--
Worst form of spam? Adding advertisment signatures ala sourceforge.net.
What goes next? Inserting advertisment *into* email?
On Sun, 2002-11-10 at 02:16, Eric W. Biederman wrote:
> To use kmapped memory I need to setup a page table to do the final copy.
> And to setup a page table I need to know where the memory is going to be copied
> to.
And ?
I find it hard to believe you can't drive an MMU if you can write code
that boots one Linux from another
On Sun, 2002-11-10 at 02:58, Eric W. Biederman wrote:
> > What I'm trying to say is that I think the new kernel must
> > already be loaded when the panic happens.
> > Is that what you describe later (below)?
>
> Yes that was my meaning. The new kernel must be preloaded.
> And only started on panic.
Another question from the point of view of unifying things. What is
wrong with
insmod kexec
creates /dev/kexec (or kexecfs is you are Al Viro)
hooks the reboot and panic final notifiers
user copies file to /dev/kexec (which stuffs it into ram)
reboot
kexec module handler jumps to the first page of the
kexec data in a defined state assuming its PIC
At which point we have clearly reduced kexec/oops reporter/lkcd/netdump
to a single common tiny interface.
On Sun, 2002-11-10 at 02:18, Eric W. Biederman wrote:
> > Load a new *something* on panic. That something might be a new kernel
> > but it might also be a kernel dump system like LKCD or a debugger front
> > end for something like kdb, or a network dump module, or ...
>
> And if it isn't a kernel why not load it as a module? The code
> has to come preloaded anyway.
You may want to load it as a module or via syscall request. Doesn't
matter which really. But you do want all the intelligence in the loaded
code not in the reboot stub of the dying code.
Alan Cox <[email protected]> writes:
> On Sun, 2002-11-10 at 02:16, Eric W. Biederman wrote:
> > To use kmapped memory I need to setup a page table to do the final copy.
> > And to setup a page table I need to know where the memory is going to be
> copied
>
> > to.
>
> And ?
>
> I find it hard to believe you can't drive an MMU if you can write code
> that boots one Linux from another
One of the simplifying things I do between OS's is turn of the MMU, or
at least give it a 1-1 trivial mapping with physical memory.
If all of that memory is hanging out there forever. It probably makes sense
to be high memory capable. But for the first rev of this I won't be.
Addresses > 4GB are a major pain to work with on x86.
But I do have a test machine that can reproduce that so I can test for
strange bugs. I added a BIOS option to put all but 512M out of 4GB
above the 4GB line.
Eric
O.k. Here is the splitup version of my kexec
Added are
sys_reboot(LINUX_REBOOT_CMD_KEXEC)
sys_kexec_load(unsigned long entry, unsigned long nr_segments, struct kexec *segments, unsigned long flags);
The flags field is currently enforced to be zero, but it leaves the window open to tweak
what the load does for the panic case.
Currently (because of missing hardware shutdown code) the code only approaches stable
in UP without APICs.
Generating a patch to cleanly shutdown the apics, and releasing a sample user space
is the next step.
Eric
MAINTAINERS | 7
arch/i386/Kconfig | 17
arch/i386/kernel/Makefile | 1
arch/i386/kernel/entry.S | 1
arch/i386/kernel/machine_kexec.c | 142 ++++++++
arch/i386/kernel/relocate_kernel.S | 99 +++++
include/asm-i386/kexec.h | 25 +
include/asm-i386/unistd.h | 1
include/linux/kexec.h | 46 ++
include/linux/reboot.h | 2
kernel/Makefile | 1
kernel/kexec.c | 643 +++++++++++++++++++++++++++++++++++++
kernel/sys.c | 23 +
13 files changed, 1008 insertions
diff -uNr linux-2.5.46-bk6/MAINTAINERS linux-2.5.46-bk6.x86kexec/MAINTAINERS
--- linux-2.5.46-bk6/MAINTAINERS Sun Nov 10 10:04:38 2002
+++ linux-2.5.46-bk6.x86kexec/MAINTAINERS Sun Nov 10 10:05:32 2002
@@ -968,6 +968,13 @@
W: http://www.cse.unsw.edu.au/~neilb/patches/linux-devel/
S: Maintained
+KEXEC
+P: Eric Biederman
+M: [email protected]
+M: [email protected]
+L: [email protected]
+S: Maintained
+
LANMEDIA WAN CARD DRIVER
P: Andrew Stanley-Jones
M: [email protected]
diff -uNr linux-2.5.46-bk6/arch/i386/Kconfig linux-2.5.46-bk6.x86kexec/arch/i386/Kconfig
--- linux-2.5.46-bk6/arch/i386/Kconfig Sun Nov 10 10:04:38 2002
+++ linux-2.5.46-bk6.x86kexec/arch/i386/Kconfig Sun Nov 10 10:05:32 2002
@@ -784,6 +784,23 @@
depends on (SMP || PREEMPT) && X86_CMPXCHG
default y
+config KEXEC
+ bool "kexec system call (EXPERIMENTAL)"
+ depends on EXPERIMENTAL
+ help
+ kexec is a system call that implements the ability to shutdown your
+ current kernel, and to start another kernel. It is like a reboot
+ but it is indepedent of the system firmware. And like a reboot the
+ you can start any kernel with it not just Linux.
+
+ The name comes from the similiarity to the exec system call.
+
+ It is on an going process to be certain the hardware in a machine
+ is properly shutdown, so do not be surprised if this code does not
+ initially work for you. It may help to enable device hotplugging
+ support. As of this writing the exact hardware interface is
+ strongly in flux, so no good recommendation can be made.
+
endmenu
diff -uNr linux-2.5.46-bk6/arch/i386/kernel/Makefile linux-2.5.46-bk6.x86kexec/arch/i386/kernel/Makefile
--- linux-2.5.46-bk6/arch/i386/kernel/Makefile Sun Nov 10 10:04:38 2002
+++ linux-2.5.46-bk6.x86kexec/arch/i386/kernel/Makefile Sun Nov 10 10:05:32 2002
@@ -24,6 +24,7 @@
obj-$(CONFIG_X86_MPPARSE) += mpparse.o
obj-$(CONFIG_X86_LOCAL_APIC) += apic.o nmi.o
obj-$(CONFIG_X86_IO_APIC) += io_apic.o
+obj-$(CONFIG_KEXEC) += machine_kexec.o relocate_kernel.o
obj-$(CONFIG_SOFTWARE_SUSPEND) += suspend.o
obj-$(CONFIG_X86_NUMAQ) += numaq.o
obj-$(CONFIG_PROFILING) += profile.o
diff -uNr linux-2.5.46-bk6/arch/i386/kernel/entry.S linux-2.5.46-bk6.x86kexec/arch/i386/kernel/entry.S
--- linux-2.5.46-bk6/arch/i386/kernel/entry.S Sun Nov 10 10:04:38 2002
+++ linux-2.5.46-bk6.x86kexec/arch/i386/kernel/entry.S Sun Nov 10 10:05:32 2002
@@ -743,6 +743,7 @@
.long sys_epoll_ctl /* 255 */
.long sys_epoll_wait
.long sys_remap_file_pages
+ .long sys_kexec_load
.rept NR_syscalls-(.-sys_call_table)/4
diff -uNr linux-2.5.46-bk6/arch/i386/kernel/machine_kexec.c linux-2.5.46-bk6.x86kexec/arch/i386/kernel/machine_kexec.c
--- linux-2.5.46-bk6/arch/i386/kernel/machine_kexec.c Wed Dec 31 17:00:00 1969
+++ linux-2.5.46-bk6.x86kexec/arch/i386/kernel/machine_kexec.c Sun Nov 10 10:05:32 2002
@@ -0,0 +1,142 @@
+#include <linux/config.h>
+#include <linux/mm.h>
+#include <linux/kexec.h>
+#include <linux/delay.h>
+#include <asm/pgtable.h>
+#include <asm/pgalloc.h>
+#include <asm/tlbflush.h>
+#include <asm/io.h>
+#include <asm/apic.h>
+
+
+/*
+ * machine_kexec
+ * =======================
+ */
+
+
+static void set_idt(void *newidt, __u16 limit)
+{
+ unsigned char curidt[6];
+
+ /* ia32 supports unaliged loads & stores */
+ (*(__u16 *)(curidt)) = limit;
+ (*(__u32 *)(curidt +2)) = (unsigned long)(newidt);
+
+ __asm__ __volatile__ (
+ "lidt %0\n"
+ : "=m" (curidt)
+ );
+};
+
+
+static void set_gdt(void *newgdt, __u16 limit)
+{
+ unsigned char curgdt[6];
+
+ /* ia32 supports unaliged loads & stores */
+ (*(__u16 *)(curgdt)) = limit;
+ (*(__u32 *)(curgdt +2)) = (unsigned long)(newgdt);
+
+ __asm__ __volatile__ (
+ "lgdt %0\n"
+ : "=m" (curgdt)
+ );
+};
+
+static void load_segments(void)
+{
+#define __STR(X) #X
+#define STR(X) __STR(X)
+
+ __asm__ __volatile__ (
+ "\tljmp $"STR(__KERNEL_CS)",$1f\n"
+ "\t1:\n"
+ "\tmovl $"STR(__KERNEL_DS)",%eax\n"
+ "\tmovl %eax,%ds\n"
+ "\tmovl %eax,%es\n"
+ "\tmovl %eax,%fs\n"
+ "\tmovl %eax,%gs\n"
+ "\tmovl %eax,%ss\n"
+ );
+#undef STR
+#undef __STR
+}
+
+static void identity_map_page(unsigned long address)
+{
+ /* This code is x86 specific...
+ * general purpose code must be more carful
+ * of caches and tlbs...
+ */
+ pgd_t *pgd;
+ pmd_t *pmd;
+ struct mm_struct *mm = current->mm;
+ spin_lock(&mm->page_table_lock);
+
+ pgd = pgd_offset(mm, address);
+ pmd = pmd_alloc(mm, pgd, address);
+
+ if (pmd) {
+ pte_t *pte = pte_alloc_map(mm, pmd, address);
+ if (pte) {
+ set_pte(pte,
+ mk_pte(virt_to_page(phys_to_virt(address)),
+ PAGE_SHARED));
+ __flush_tlb_one(address);
+ }
+ }
+ spin_unlock(&mm->page_table_lock);
+}
+
+
+typedef void (*relocate_new_kernel_t)(
+ unsigned long indirection_page, unsigned long reboot_code_buffer,
+ unsigned long start_address);
+
+const extern unsigned char relocate_new_kernel[];
+extern void relocate_new_kernel_end(void);
+const extern unsigned int relocate_new_kernel_size;
+
+void machine_kexec(struct kimage *image)
+{
+ unsigned long *indirection_page;
+ void *reboot_code_buffer;
+ relocate_new_kernel_t rnk;
+
+ /* Interrupts aren't acceptable while we reboot */
+ local_irq_disable();
+ reboot_code_buffer = image->reboot_code_buffer;
+ indirection_page = phys_to_virt(image->head & PAGE_MASK);
+
+ identity_map_page(virt_to_phys(reboot_code_buffer));
+
+ /* copy it out */
+ memcpy(reboot_code_buffer, relocate_new_kernel,
+ relocate_new_kernel_size);
+
+ /* The segment registers are funny things, they are
+ * automatically loaded from a table, in memory wherever you
+ * set them to a specific selector, but this table is never
+ * accessed again you set the segment to a different selector.
+ *
+ * The more common model is are caches where the behide
+ * the scenes work is done, but is also dropped at arbitrary
+ * times.
+ *
+ * I take advantage of this here by force loading the
+ * segments, before I zap the gdt with an invalid value.
+ */
+ load_segments();
+ /* The gdt & idt are now invalid.
+ * If you want to load them you must set up your own idt & gdt.
+ */
+ set_gdt(phys_to_virt(0),0);
+ set_idt(phys_to_virt(0),0);
+
+ /* now call it */
+ rnk = (relocate_new_kernel_t) virt_to_phys(reboot_code_buffer);
+ (*rnk)(virt_to_phys(indirection_page), virt_to_phys(reboot_code_buffer),
+ image->start);
+}
+
diff -uNr linux-2.5.46-bk6/arch/i386/kernel/relocate_kernel.S linux-2.5.46-bk6.x86kexec/arch/i386/kernel/relocate_kernel.S
--- linux-2.5.46-bk6/arch/i386/kernel/relocate_kernel.S Wed Dec 31 17:00:00 1969
+++ linux-2.5.46-bk6.x86kexec/arch/i386/kernel/relocate_kernel.S Sun Nov 10 10:05:32 2002
@@ -0,0 +1,99 @@
+#include <linux/config.h>
+#include <linux/linkage.h>
+
+ /* Must be relocatable PIC code callable as a C function, that once
+ * it starts can not use the previous processes stack.
+ *
+ */
+ .globl relocate_new_kernel
+relocate_new_kernel:
+ /* read the arguments and say goodbye to the stack */
+ movl 4(%esp), %ebx /* indirection_page */
+ movl 8(%esp), %ebp /* reboot_code_buffer */
+ movl 12(%esp), %edx /* start address */
+
+ /* zero out flags, and disable interrupts */
+ pushl $0
+ popfl
+
+ /* set a new stack at the bottom of our page... */
+ lea 4096(%ebp), %esp
+
+ /* store the parameters back on the stack */
+ pushl %edx /* store the start address */
+
+ /* Set cr0 to a known state:
+ * 31 0 == Paging disabled
+ * 18 0 == Alignment check disabled
+ * 16 0 == Write protect disabled
+ * 3 0 == No task switch
+ * 2 0 == Don't do FP software emulation.
+ * 0 1 == Proctected mode enabled
+ */
+ movl %cr0, %eax
+ andl $~((1<<31)|(1<<18)|(1<<16)|(1<<3)|(1<<2)), %eax
+ orl $(1<<0), %eax
+ movl %eax, %cr0
+ jmp 1f
+1:
+
+ /* Flush the TLB (needed?) */
+ xorl %eax, %eax
+ movl %eax, %cr3
+
+ /* Do the copies */
+ cld
+0: /* top, read another word for the indirection page */
+ movl %ebx, %ecx
+ movl (%ebx), %ecx
+ addl $4, %ebx
+ testl $0x1, %ecx /* is it a destination page */
+ jz 1f
+ movl %ecx, %edi
+ andl $0xfffff000, %edi
+ jmp 0b
+1:
+ testl $0x2, %ecx /* is it an indirection page */
+ jz 1f
+ movl %ecx, %ebx
+ andl $0xfffff000, %ebx
+ jmp 0b
+1:
+ testl $0x4, %ecx /* is it the done indicator */
+ jz 1f
+ jmp 2f
+1:
+ testl $0x8, %ecx /* is it the source indicator */
+ jz 0b /* Ignore it otherwise */
+ movl %ecx, %esi /* For every source page do a copy */
+ andl $0xfffff000, %esi
+
+ movl $1024, %ecx
+ rep ; movsl
+ jmp 0b
+
+2:
+
+ /* To be certain of avoiding problems with self modifying code
+ * I need to execute a serializing instruction here.
+ * So I flush the TLB, it's handy, and not processor dependent.
+ */
+ xorl %eax, %eax
+ movl %eax, %cr3
+
+ /* set all of the registers to known values */
+ /* leave %esp alone */
+
+ xorl %eax, %eax
+ xorl %ebx, %ebx
+ xorl %ecx, %ecx
+ xorl %edx, %edx
+ xorl %esi, %esi
+ xorl %edi, %edi
+ xorl %ebp, %ebp
+ ret
+relocate_new_kernel_end:
+
+ .globl relocate_new_kernel_size
+relocate_new_kernel_size:
+ .long relocate_new_kernel_end - relocate_new_kernel
diff -uNr linux-2.5.46-bk6/include/asm-i386/kexec.h linux-2.5.46-bk6.x86kexec/include/asm-i386/kexec.h
--- linux-2.5.46-bk6/include/asm-i386/kexec.h Wed Dec 31 17:00:00 1969
+++ linux-2.5.46-bk6.x86kexec/include/asm-i386/kexec.h Sun Nov 10 10:05:32 2002
@@ -0,0 +1,25 @@
+#ifndef _I386_KEXEC_H
+#define _I386_KEXEC_H
+
+#include <asm/fixmap.h>
+
+/*
+ * KEXEC_SOURCE_MEMORY_LIMIT maximum page get_free_page can return.
+ * I.e. Maximum page that is mapped directly into kernel memory,
+ * and kmap is not required.
+ *
+ * Someone correct me if FIXADDR_START - PAGEOFFSET is not the correct
+ * calculation for the amount of memory directly mappable into the
+ * kernel memory space.
+ */
+
+/* Maximum physical address we can use pages from */
+#define KEXEC_SOURCE_MEMORY_LIMIT (FIXADDR_START - PAGE_OFFSET)
+/* Maximum address we can reach in physical address mode */
+#define KEXEC_DESTINATION_MEMORY_LIMIT (-1UL)
+
+#define KEXEC_REBOOT_CODE_SIZE 4096
+#define KEXEC_REBOOT_CODE_ALIGN 0
+
+
+#endif /* _I386_KEXEC_H */
diff -uNr linux-2.5.46-bk6/include/asm-i386/unistd.h linux-2.5.46-bk6.x86kexec/include/asm-i386/unistd.h
--- linux-2.5.46-bk6/include/asm-i386/unistd.h Tue Nov 5 19:03:51 2002
+++ linux-2.5.46-bk6.x86kexec/include/asm-i386/unistd.h Sun Nov 10 10:05:32 2002
@@ -262,6 +262,7 @@
#define __NR_sys_epoll_ctl 255
#define __NR_sys_epoll_wait 256
#define __NR_remap_file_pages 257
+#define __NR_sys_kexec_load 258
/* user-visible error numbers are in the range -1 - -124: see <asm-i386/errno.h> */
diff -uNr linux-2.5.46-bk6/include/linux/kexec.h linux-2.5.46-bk6.x86kexec/include/linux/kexec.h
--- linux-2.5.46-bk6/include/linux/kexec.h Wed Dec 31 17:00:00 1969
+++ linux-2.5.46-bk6.x86kexec/include/linux/kexec.h Sun Nov 10 10:05:32 2002
@@ -0,0 +1,46 @@
+#ifndef LINUX_KEXEC_H
+#define LINUX_KEXEC_H
+
+#if CONFIG_KEXEC
+#include <linux/types.h>
+#include <asm/kexec.h>
+
+/*
+ * This structure is used to hold the arguments that are used when loading
+ * kernel binaries.
+ */
+
+typedef unsigned long kimage_entry_t;
+#define IND_DESTINATION 0x1
+#define IND_INDIRECTION 0x2
+#define IND_DONE 0x4
+#define IND_SOURCE 0x8
+
+struct kimage {
+ kimage_entry_t head;
+ kimage_entry_t *entry;
+ kimage_entry_t *last_entry;
+
+ unsigned long destination;
+ unsigned long offset;
+
+ unsigned long start;
+ void *reboot_code_buffer;
+};
+
+struct kexec_segment {
+ void *buf;
+ size_t bufsz;
+ void *mem;
+ size_t memsz;
+};
+
+/* kexec interface functions */
+extern void machine_kexec(struct kimage *image);
+extern asmlinkage long sys_kexec(unsigned long entry, long nr_segments,
+ struct kexec_segment *segments);
+extern struct kimage *kexec_image;
+extern spinlock_t kexec_image_lock;
+#endif
+#endif /* LINUX_KEXEC_H */
+
diff -uNr linux-2.5.46-bk6/include/linux/reboot.h linux-2.5.46-bk6.x86kexec/include/linux/reboot.h
--- linux-2.5.46-bk6/include/linux/reboot.h Fri Oct 11 22:22:47 2002
+++ linux-2.5.46-bk6.x86kexec/include/linux/reboot.h Sun Nov 10 10:05:32 2002
@@ -21,6 +21,7 @@
* POWER_OFF Stop OS and remove all power from system, if possible.
* RESTART2 Restart system using given command string.
* SW_SUSPEND Suspend system using Software Suspend if compiled in
+ * KEXEC Restart the system using a different kernel.
*/
#define LINUX_REBOOT_CMD_RESTART 0x01234567
@@ -30,6 +31,7 @@
#define LINUX_REBOOT_CMD_POWER_OFF 0x4321FEDC
#define LINUX_REBOOT_CMD_RESTART2 0xA1B2C3D4
#define LINUX_REBOOT_CMD_SW_SUSPEND 0xD000FCE2
+#define LINUX_REBOOT_CMD_KEXEC 0x45584543
#ifdef __KERNEL__
diff -uNr linux-2.5.46-bk6/kernel/Makefile linux-2.5.46-bk6.x86kexec/kernel/Makefile
--- linux-2.5.46-bk6/kernel/Makefile Fri Oct 18 11:59:29 2002
+++ linux-2.5.46-bk6.x86kexec/kernel/Makefile Sun Nov 10 10:05:32 2002
@@ -21,6 +21,7 @@
obj-$(CONFIG_CPU_FREQ) += cpufreq.o
obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
obj-$(CONFIG_SOFTWARE_SUSPEND) += suspend.o
+obj-$(CONFIG_KEXEC) += kexec.o
ifneq ($(CONFIG_IA64),y)
# According to Alan Modra <[email protected]>, the -fno-omit-frame-pointer is
diff -uNr linux-2.5.46-bk6/kernel/kexec.c linux-2.5.46-bk6.x86kexec/kernel/kexec.c
--- linux-2.5.46-bk6/kernel/kexec.c Wed Dec 31 17:00:00 1969
+++ linux-2.5.46-bk6.x86kexec/kernel/kexec.c Sun Nov 10 10:05:32 2002
@@ -0,0 +1,643 @@
+#include <linux/mm.h>
+#include <linux/file.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/version.h>
+#include <linux/compile.h>
+#include <linux/kexec.h>
+#include <linux/spinlock.h>
+#include <net/checksum.h>
+#include <asm/page.h>
+#include <asm/uaccess.h>
+#include <asm/io.h>
+
+/* As designed kexec can only use the memory that you don't
+ * need to use kmap to access. Memory that you can use virt_to_phys()
+ * on an call get_free_page to allocate.
+ *
+ * In the best case you need one page for the transition from
+ * virtual to physical memory. And this page must be identity
+ * mapped. Which pretty much leaves you with pages < PAGE_OFFSET
+ * as you can only mess with user pages.
+ *
+ * As the only subset of memory that it is easy to restrict allocation
+ * to is the physical memory mapped into the kernel, I do that
+ * with get_free_page and hope it is enough.
+ *
+ * I don't know of a good way to do this calcuate which pages get_free_page
+ * will return independent of architecture so I depend on
+ * <asm/kexec.h> to properly set
+ * KEXEC_SOURCE_MEMORY_LIMIT and KEXEC_DESTINATION_MEMORY_LIMIT
+ *
+ */
+
+static struct kimage *kimage_alloc(void)
+{
+ struct kimage *image;
+ image = kmalloc(sizeof(*image), GFP_KERNEL);
+ if (!image)
+ return 0;
+ memset(image, 0, sizeof(*image));
+ image->head = 0;
+ image->entry = &image->head;
+ image->last_entry = &image->head;
+ return image;
+}
+static int kimage_add_entry(struct kimage *image, kimage_entry_t entry)
+{
+ if (image->offset != 0) {
+ image->entry++;
+ }
+ if (image->entry == image->last_entry) {
+ kimage_entry_t *ind_page;
+ ind_page = (void *)__get_free_page(GFP_KERNEL);
+ if (!ind_page) {
+ return -ENOMEM;
+ }
+ *image->entry = virt_to_phys(ind_page) | IND_INDIRECTION;
+ image->entry = ind_page;
+ image->last_entry =
+ ind_page + ((PAGE_SIZE/sizeof(kimage_entry_t)) - 1);
+ }
+ *image->entry = entry;
+ image->entry++;
+ image->offset = 0;
+ return 0;
+}
+
+static int kimage_verify_destination(unsigned long destination)
+{
+ int result;
+
+ /* Assume the page is bad unless we pass the checks */
+ result = -EADDRNOTAVAIL;
+
+ if (destination >= KEXEC_DESTINATION_MEMORY_LIMIT) {
+ goto out;
+ }
+
+ /* NOTE: The caller is responsible for making certain we
+ * don't attempt to load the new image into invalid or
+ * reserved areas of RAM.
+ */
+ result = 0;
+out:
+ return result;
+}
+
+static int kimage_set_destination(
+ struct kimage *image, unsigned long destination)
+{
+ int result;
+ destination &= PAGE_MASK;
+ result = kimage_verify_destination(destination);
+ if (result) {
+ return result;
+ }
+ result = kimage_add_entry(image, destination | IND_DESTINATION);
+ if (result == 0) {
+ image->destination = destination;
+ }
+ return result;
+}
+
+
+static int kimage_add_page(struct kimage *image, unsigned long page)
+{
+ int result;
+ page &= PAGE_MASK;
+ result = kimage_verify_destination(image->destination);
+ if (result) {
+ return result;
+ }
+ result = kimage_add_entry(image, page | IND_SOURCE);
+ if (result == 0) {
+ image->destination += PAGE_SIZE;
+ }
+ return result;
+}
+
+
+static int kimage_terminate(struct kimage *image)
+{
+ int result;
+ result = kimage_add_entry(image, IND_DONE);
+ if (result == 0) {
+ /* Point at the terminating element */
+ image->entry--;
+ }
+ return result;
+}
+
+#define for_each_kimage_entry(image, ptr, entry) \
+ for (ptr = &image->head; (entry = *ptr) && !(entry & IND_DONE); \
+ ptr = (entry & IND_INDIRECTION)? \
+ phys_to_virt((entry & PAGE_MASK)): ptr +1)
+
+static void kimage_free(struct kimage *image)
+{
+ kimage_entry_t *ptr, entry;
+ kimage_entry_t ind = 0;
+ if (!image)
+ return;
+ for_each_kimage_entry(image, ptr, entry) {
+ if (entry & IND_INDIRECTION) {
+ /* Free the previous indirection page */
+ if (ind & IND_INDIRECTION) {
+ free_page((unsigned long)phys_to_virt(ind & PAGE_MASK));
+ }
+ /* Save this indirection page until we are
+ * done with it.
+ */
+ ind = entry;
+ }
+ else if (entry & IND_SOURCE) {
+ free_page((unsigned long)phys_to_virt(entry & PAGE_MASK));
+ }
+ }
+ kfree(image);
+}
+
+static int kimage_is_destination_page(
+ struct kimage *image, unsigned long page)
+{
+ kimage_entry_t *ptr, entry;
+ unsigned long destination;
+ destination = 0;
+ page &= PAGE_MASK;
+ for_each_kimage_entry(image, ptr, entry) {
+ if (entry & IND_DESTINATION) {
+ destination = entry & PAGE_MASK;
+ }
+ else if (entry & IND_SOURCE) {
+ if (page == destination) {
+ return 1;
+ }
+ destination += PAGE_SIZE;
+ }
+ }
+ return 0;
+}
+
+static int kimage_get_unused_area(
+ struct kimage *image, unsigned long size, unsigned long align,
+ unsigned long *area)
+{
+ /* Walk through mem_map and find the first chunk of
+ * ununsed memory that is at least size bytes long.
+ */
+ /* Since the kernel plays with Page_Reseved mem_map is less
+ * than ideal for this purpose, but it will give us a correct
+ * conservative estimate of what we need to do.
+ */
+ /* For now we take advantage of the fact that all kernel pages
+ * are marked with PG_resereved to allocate a large
+ * contiguous area for the reboot code buffer.
+ */
+ unsigned long addr;
+ unsigned long start, end;
+ unsigned long mask;
+ mask = ((1 << align) -1);
+ start = end = PAGE_SIZE;
+ for(addr = PAGE_SIZE; addr < KEXEC_SOURCE_MEMORY_LIMIT; addr += PAGE_SIZE) {
+ struct page *page;
+ unsigned long aligned_start;
+ page = virt_to_page(phys_to_virt(addr));
+ if (PageReserved(page) ||
+ kimage_is_destination_page(image, addr)) {
+ /* The current page is reserved so the start &
+ * end of the next area must be atleast at the
+ * next page.
+ */
+ start = end = addr + PAGE_SIZE;
+ }
+ else {
+ /* O.k. The current page isn't reserved
+ * so push up the end of the area.
+ */
+ end = addr;
+ }
+ aligned_start = (start + mask) & ~mask;
+ if (aligned_start > start) {
+ continue;
+ }
+ if (aligned_start > end) {
+ continue;
+ }
+ if (end - aligned_start >= size) {
+ *area = aligned_start;
+ return 0;
+ }
+ }
+ *area = 0;
+ return -ENOSPC;
+}
+
+static kimage_entry_t *kimage_dst_conflict(
+ struct kimage *image, unsigned long page, kimage_entry_t *limit)
+{
+ kimage_entry_t *ptr, entry;
+ unsigned long destination = 0;
+ for_each_kimage_entry(image, ptr, entry) {
+ if (ptr == limit) {
+ return 0;
+ }
+ else if (entry & IND_DESTINATION) {
+ destination = entry & PAGE_MASK;
+ }
+ else if (entry & IND_SOURCE) {
+ if (page == destination) {
+ return ptr;
+ }
+ destination += PAGE_SIZE;
+ }
+ }
+ return 0;
+}
+
+static kimage_entry_t *kimage_src_conflict(
+ struct kimage *image, unsigned long destination, kimage_entry_t *limit)
+{
+ kimage_entry_t *ptr, entry;
+ for_each_kimage_entry(image, ptr, entry) {
+ unsigned long page;
+ if (ptr == limit) {
+ return 0;
+ }
+ else if (entry & IND_DESTINATION) {
+ /* nop */
+ }
+ else if (entry & IND_DONE) {
+ /* nop */
+ }
+ else {
+ /* SOURCE & INDIRECTION */
+ page = entry & PAGE_MASK;
+ if (page == destination) {
+ return ptr;
+ }
+ }
+ }
+ return 0;
+}
+
+static int kimage_get_off_destination_pages(struct kimage *image)
+{
+ kimage_entry_t *ptr, *cptr, entry;
+ unsigned long buffer, page;
+ unsigned long destination = 0;
+
+ /* Here we implement safe guards to insure that
+ * a source page is not copied to it's destination
+ * page before the data on the destination page is
+ * no longer useful.
+ *
+ * To make it work we actually wind up with a
+ * stronger condition. For every page considered
+ * it is either it's own destination page or it is
+ * not a destination page of any page considered.
+ *
+ * Invariants
+ * 1. buffer is not a destination of a previous page.
+ * 2. page is not a destination of a previous page.
+ * 3. destination is not a previous source page.
+ *
+ * Result: Either a source page and a destination page
+ * are the same or the page is not a destination page.
+ *
+ * These checks could be done when we allocate the pages,
+ * but doing it as a final pass allows us more freedom
+ * on how we allocate pages.
+ *
+ * Also while the checks are necessary, in practice nothing
+ * happens. The destination kernel wants to sit in the
+ * same physical addresses as the current kernel so we never
+ * actually allocate a destination page.
+ *
+ * BUGS: This is a O(N^2) algorithm.
+ */
+
+
+ buffer = __get_free_page(GFP_KERNEL);
+ if (!buffer) {
+ return -ENOMEM;
+ }
+ buffer = virt_to_phys((void *)buffer);
+ for_each_kimage_entry(image, ptr, entry) {
+ /* Here we check to see if an allocated page */
+ kimage_entry_t *limit;
+ if (entry & IND_DESTINATION) {
+ destination = entry & PAGE_MASK;
+ }
+ else if (entry & IND_INDIRECTION) {
+ /* Indirection pages must include all of their
+ * contents in limit checking.
+ */
+ limit = phys_to_virt(page + PAGE_SIZE - sizeof(*limit));
+ }
+ if (!((entry & IND_SOURCE) | (entry & IND_INDIRECTION))) {
+ continue;
+ }
+ page = entry & PAGE_MASK;
+ limit = ptr;
+
+ /* See if a previous page has the current page as it's
+ * destination.
+ * i.e. invariant 2
+ */
+ cptr = kimage_dst_conflict(image, page, limit);
+ if (cptr) {
+ unsigned long cpage;
+ kimage_entry_t centry;
+ centry = *cptr;
+ cpage = centry & PAGE_MASK;
+ memcpy(phys_to_virt(buffer), phys_to_virt(page), PAGE_SIZE);
+ memcpy(phys_to_virt(page), phys_to_virt(cpage), PAGE_SIZE);
+ *cptr = page | (centry & ~PAGE_MASK);
+ *ptr = buffer | (entry & ~PAGE_MASK);
+ buffer = cpage;
+ }
+ if (!(entry & IND_SOURCE)) {
+ continue;
+ }
+
+ /* See if a previous page is our destination page.
+ * If so claim it now.
+ * i.e. invariant 3
+ */
+ cptr = kimage_src_conflict(image, destination, limit);
+ if (cptr) {
+ unsigned long cpage;
+ kimage_entry_t centry;
+ centry = *cptr;
+ cpage = centry & PAGE_MASK;
+ memcpy(phys_to_virt(buffer), phys_to_virt(cpage), PAGE_SIZE);
+ memcpy(phys_to_virt(cpage), phys_to_virt(page), PAGE_SIZE);
+ *cptr = buffer | (centry & ~PAGE_MASK);
+ *ptr = cpage | ( entry & ~PAGE_MASK);
+ buffer = page;
+ }
+ /* If the buffer is my destination page do the copy now
+ * i.e. invariant 3 & 1
+ */
+ if (buffer == destination) {
+ memcpy(phys_to_virt(buffer), phys_to_virt(page), PAGE_SIZE);
+ *ptr = buffer | (entry & ~PAGE_MASK);
+ buffer = page;
+ }
+ }
+ free_page((unsigned long)phys_to_virt(buffer));
+ return 0;
+}
+
+static int kimage_add_empty_pages(struct kimage *image,
+ unsigned long len)
+{
+ unsigned long pos;
+ int result;
+ for(pos = 0; pos < len; pos += PAGE_SIZE) {
+ char *page;
+ result = -ENOMEM;
+ page = (void *)__get_free_page(GFP_KERNEL);
+ if (!page) {
+ goto out;
+ }
+ result = kimage_add_page(image, virt_to_phys(page));
+ if (result) {
+ goto out;
+ }
+ }
+ result = 0;
+ out:
+ return result;
+}
+
+
+static int kimage_load_segment(struct kimage *image,
+ struct kexec_segment *segment)
+{
+ unsigned long mstart;
+ int result;
+ unsigned long offset;
+ unsigned long offset_end;
+ unsigned char *buf;
+
+ result = 0;
+ buf = segment->buf;
+ mstart = (unsigned long)segment->mem;
+
+ offset_end = segment->memsz;
+
+ result = kimage_set_destination(image, mstart);
+ if (result < 0) {
+ goto out;
+ }
+ for(offset = 0; offset < segment->memsz; offset += PAGE_SIZE) {
+ char *page;
+ size_t size, leader;
+ page = (char *)__get_free_page(GFP_KERNEL);
+ if (page == 0) {
+ result = -ENOMEM;
+ goto out;
+ }
+ result = kimage_add_page(image, virt_to_phys(page));
+ if (result < 0) {
+ goto out;
+ }
+ if (segment->bufsz < offset) {
+ /* We are past the end zero the whole page */
+ memset(page, 0, PAGE_SIZE);
+ continue;
+ }
+ size = PAGE_SIZE;
+ leader = 0;
+ if ((offset == 0)) {
+ leader = mstart & ~PAGE_MASK;
+ }
+ if (leader) {
+ /* We are on the first page zero the unused portion */
+ memset(page, 0, leader);
+ size -= leader;
+ page += leader;
+ }
+ if (size > (segment->bufsz - offset)) {
+ size = segment->bufsz - offset;
+ }
+ result = copy_from_user(page, buf + offset, size);
+ if (result) {
+ result = (result < 0)?result : -EIO;
+ goto out;
+ }
+ if (size < (PAGE_SIZE - leader)) {
+ /* zero the trailing part of the page */
+ memset(page + size, 0, (PAGE_SIZE - leader) - size);
+ }
+ }
+ out:
+ return result;
+}
+
+
+/* do_kexec executes a new kernel
+ */
+static int do_kexec(unsigned long start, unsigned long nr_segments,
+ struct kexec_segment *arg_segments, struct kimage *image)
+{
+ struct kexec_segment *segments;
+ size_t segment_bytes;
+ int i;
+
+ int result;
+ unsigned long reboot_code_buffer;
+ kimage_entry_t *end;
+
+ /* Initialize variables */
+ segments = 0;
+
+ segment_bytes = nr_segments * sizeof(*segments);
+ segments = kmalloc(GFP_KERNEL, segment_bytes);
+ if (segments == 0) {
+ result = -ENOMEM;
+ goto out;
+ }
+ result = copy_from_user(segments, arg_segments, segment_bytes);
+ if (result) {
+ goto out;
+ }
+
+ /* Read in the data from user space */
+ image->start = start;
+ for(i = 0; i < nr_segments; i++) {
+ result = kimage_load_segment(image, &segments[i]);
+ if (result) {
+ goto out;
+ }
+ }
+
+ /* Terminate early so I can get a place holder. */
+ result = kimage_terminate(image);
+ if (result)
+ goto out;
+ end = image->entry;
+
+ /* Usage of the reboot code buffer is subtle. We first
+ * find a continguous area of ram, that is not one
+ * of our destination pages. We do not allocate the ram.
+ *
+ * The algorithm to make certain we do not have address
+ * conflicts requires each destination region to have some
+ * backing store so we allocate abitrary source pages.
+ *
+ * Later in machine_kexec when we copy data to the
+ * reboot_code_buffer it still may be allocated for other
+ * purposes, but we do know there are no source or destination
+ * pages in that area. And since the rest of the kernel
+ * is already shutdown those pages are free for use,
+ * regardless of their page->count values.
+ *
+ * The kernel mapping is of the reboot code buffer is passed to
+ * the machine dependent code. If it needs something else
+ * it is free to set that up.
+ */
+ result = kimage_get_unused_area(
+ image, KEXEC_REBOOT_CODE_SIZE, KEXEC_REBOOT_CODE_ALIGN,
+ &reboot_code_buffer);
+ if (result)
+ goto out;
+
+ /* Allocating pages we should never need is silly but the
+ * code won't work correctly unless we have dummy pages to
+ * work with.
+ */
+ result = kimage_set_destination(image, reboot_code_buffer);
+ if (result)
+ goto out;
+ result = kimage_add_empty_pages(image, KEXEC_REBOOT_CODE_SIZE);
+ if (result)
+ goto out;
+ image->reboot_code_buffer = phys_to_virt(reboot_code_buffer);
+
+ result = kimage_terminate(image);
+ if (result)
+ goto out;
+
+ result = kimage_get_off_destination_pages(image);
+ if (result)
+ goto out;
+
+ /* Now hide the extra source pages for the reboot code buffer.
+ */
+ image->entry = end;
+ result = kimage_terminate(image);
+ if (result)
+ goto out;
+
+ result = 0;
+ out:
+ /* cleanup and exit */
+ if (segments) kfree(segments);
+ return result;
+}
+
+
+/*
+ * Exec Kernel system call: for obvious reasons only root may call it.
+ *
+ * This call breaks up into three pieces.
+ * - A generic part which loads the new kernel from the current
+ * address space, and very carefully places the data in the
+ * allocated pages.
+ *
+ * - A generic part that interacts with the kernel and tells all of
+ * the devices to shut down. Preventing on-going dmas, and placing
+ * the devices in a consistent state so a later kernel can
+ * reinitialize them.
+ *
+ * - A machine specific part that includes the syscall number
+ * and the copies the image to it's final destination. And
+ * jumps into the image at entry.
+ *
+ * kexec does not sync, or unmount filesystems so if you need
+ * that to happen you need to do that yourself.
+ */
+struct kimage *kexec_image = 0;
+spinlock_t kexec_image_lock = SPIN_LOCK_UNLOCKED;
+
+asmlinkage long sys_kexec_load(unsigned long entry, unsigned long nr_segments,
+ struct kexec_segment *segments, unsigned long flags)
+{
+ /* Am I using to much stack space here? */
+ struct kimage *image, *old_image;
+ int result;
+
+ /* We only trust the superuser with rebooting the system. */
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ /* In case we need just a little bit of special behavior for
+ * reboot on panic
+ */
+ if (flags != 0)
+ return -EINVAL;
+
+ image = 0;
+ if (nr_segments > 0) {
+ image = kimage_alloc();
+ if (!image) {
+ return -ENOMEM;
+ }
+ result = do_kexec(entry, nr_segments, segments, image);
+ if (result) {
+ kimage_free(image);
+ return result;
+ }
+ }
+
+ spin_lock(&kexec_image_lock);
+ old_image = kexec_image;
+ kexec_image = image;
+ spin_unlock(&kexec_image_lock);
+
+ kimage_free(old_image);
+ return 0;
+}
diff -uNr linux-2.5.46-bk6/kernel/sys.c linux-2.5.46-bk6.x86kexec/kernel/sys.c
--- linux-2.5.46-bk6/kernel/sys.c Tue Nov 5 19:03:56 2002
+++ linux-2.5.46-bk6.x86kexec/kernel/sys.c Sun Nov 10 10:05:32 2002
@@ -16,6 +16,7 @@
#include <linux/init.h>
#include <linux/highuid.h>
#include <linux/fs.h>
+#include <linux/kexec.h>
#include <linux/workqueue.h>
#include <linux/device.h>
#include <linux/times.h>
@@ -206,6 +207,7 @@
cond_syscall(sys_lookup_dcookie)
cond_syscall(sys_swapon)
cond_syscall(sys_swapoff)
+cond_syscall(sys_kexec_load)
static int set_one_prio(struct task_struct *p, int niceval, int error)
{
@@ -414,6 +416,27 @@
machine_restart(buffer);
break;
+#ifdef CONFIG_KEXEC
+ case LINUX_REBOOT_CMD_KEXEC:
+ {
+ struct kimage *image;
+ spin_lock(&kexec_image_lock);
+ image = kexec_image;
+ if (!image || arg) {
+ spin_unlock(&kexec_image_lock);
+ unlock_kernel();
+ return -EINVAL;
+ }
+ notifier_call_chain(&reboot_notifier_list, SYS_RESTART, NULL);
+ system_running = 0;
+ device_shutdown();
+ printk(KERN_EMERG "Starting new kernel\n");
+ machine_kexec(image);
+ /* We never get here... */
+ spin_unlock(&kexec_image_lock);
+ break;
+ }
+#endif
#ifdef CONFIG_SOFTWARE_SUSPEND
case LINUX_REBOOT_CMD_SW_SUSPEND:
if (!software_suspend_enabled) {
Alan Cox <[email protected]> writes:
> On Sun, 2002-11-10 at 02:58, Eric W. Biederman wrote:
> > > What I'm trying to say is that I think the new kernel must
> > > already be loaded when the panic happens.
> > > Is that what you describe later (below)?
> >
> > Yes that was my meaning. The new kernel must be preloaded.
> > And only started on panic.
>
> Another question from the point of view of unifying things. What is
> wrong with
>
> insmod kexec
> creates /dev/kexec (or kexecfs is you are Al Viro)
> hooks the reboot and panic final notifiers
> user copies file to /dev/kexec (which stuffs it into ram)
>
> reboot
> kexec module handler jumps to the first page of the
> kexec data in a defined state assuming its PIC
>
>
> At which point we have clearly reduced kexec/oops reporter/lkcd/netdump
> to a single common tiny interface.
It would take a special hook that ran after the notifiers, and
device_shutdown. At least in the normal case running what shutdown
code we can is fairly important. And hooking the notifier lists
would not give a guarantee of going last.
There is a long ways to go in working with device drivers to even get
the easy kexec case working stably, in non-special circumstances.
The kernel gets there great but it does not cope well with the APICs
activated and the legacy pic disabled during bootup.
The additional device shutdown code is useful even in the normal
reboot path. Most BIOS's don't care but it should fix a few problems
with BIOS that are not as paranoid about the state of the system as
they should be when reboot is called. Little things like always
shutting down on the bootstrap cpu are on my todo list.
Eric
Hi!
> > > Let me ask the same dumb question - what does kexec need that a dumper
> > > doesn't.
> >
> > kexec needs:
> > - a system call to set it up
> > - a way to silence devices <snip>
> <snip>
> > - a bit of glue <snip>
> > - device drivers that can bring silent devices back to life
> <snip>
>
> > > In other words given reboot/trap hooks can kexec happily live
> > > as a standalone module ?
>
> You could probably skip the system call to set it up. Example: I could
> imagine a bizarre set of pseudo-devices:
>
> # insmod kexec
> # cat bzImage > /proc/kexec/next-image
> # echo "root=805" > /proc/kexec/next-cmndline
> # echo 1 > /proc/kexec/reboot
>
> and hide away that dirty little sequence with a nice kexec(3) library
> routine.
Actually, sys_reboot has void * parameter. Reusing it as "next-image"
char * seems okay to me.
Pavel
--
Worst form of spam? Adding advertisment signatures ala sourceforge.net.
What goes next? Inserting advertisment *into* email?
Von Eric W. Biederman:
> + but it is indepedent of the system firmware. And like a reboot the
> + you can start any kernel with it not just Linux.
What about this one?
s/ the$//
Eike
Pavel Machek <[email protected]> writes:
> I have very similar problem in swsusp (need to deactivate DMA
> devices), and driverfs^H^H^H^H^Hsysfs framework seems to be suitable
> for that.
Yes. The problem and the solutions are very similar. Because you are
restoring the kernel code I don't think we can use the same functions,
but similar work needs to be done. The correct hook for reboots,
halts, kexec, and other cases where the kernel is going away is
device_shutdown which currently calls device->shutdown(). Since the
implementation has changed recently to avoid other problems no one
actually implements the shutdown method at the moment. Once that
happens we can probably kill the reboot notifiers. But there is a lot
of driver work to do on that score.
Eric
On 7 Nov 2002, Andy Pfiffer wrote:
> Just an idea:
>
> Could a new, unrunnable process be created to "hold" the image?
>
> <hand-wave>
> Use a hypothetical sys_kexec() to:
> 1. create an empty process.
> 2. copy the kernel image and parameters into the processes' address
> space.
> 3. put the process to sleep.
> </hand-wave>
>
> If it's floating out there for weeks or years, the data could get paged
> out and not wired down. It would show up in ps, so you'd have at least
> some visibility into the allocation.
The only problem is that if you wanted it to run on panic, you really
couldn't trust the burning embers of a dying kernel to pull in the pages
and run them. I'd actually hope the init (and some cleanup??) code would
be there to get the new kernel going. Where kernel could be something
other than another kernel, hopefully.
--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.
Linus Torvalds <[email protected]> writes:
> On 9 Nov 2002, Eric W. Biederman wrote:
> >
> > And despite my utter puzzlement on why you want the syscall cut in two.
>
> I'm amazed about your puzzlement, since everybody else seem to get my
> arguments, but as long as you play along I don't much care.
I think this comes from being the guy down in the trenches implementing
the code. And it is sometimes hard to look up, far enough to have design
discussions.
I totally agree that having a load/exec split is the right
approach now that I can imagine an implementation where the code will
actually work for the panic case. Before it felt like lying. Doing
the split-up, promising that kexec on panic will work eventually,
when I could not even see it as a possibility was at the core of my
objections.
What brought me around is that I can add a flag field to kexec_load.
With that flag field I can tell the kernel please step extra carefully
this code will be used to handle kexec on panic. Without that I may
be up a creek without a paddle for figuring out how to debug that code.
To be able to support this at all I have had to be very creative in
inventing debugging code. Which is why I have the serial console
program kexec_test. It provides visibility into what is happening
when nothing else will. That and memtest86 which will occasionally
catch DMA's that have not been stopped, (memory errors on good ram) I
at least have a place to start rather than a blank screen when
guessing why the new kernel did not start up.
Eric
kexec is a set of system calls that allows you to load another kernel
from the currently executing Linux kernel. The current implementation
has only been tested, and had the kinks worked out on x86, but the
generic code (kexec_load) should work on any architecture.
Some machines have BIOSes that are either extremely slow to reboot,
or that cannot reliably perform a reboot. In which case kexec
may be the only alternative to reboot in a reliable and timely
manner.
The patch is archived at:
http://www.xmission.com/~ebiederm/files/kexec/
And is currently kept in two pieces.
The pure system call.
http://www.xmission.com/~ebiederm/files/kexec/linux-2.5.47.x86kexec.diff
And the set of hardware fixes known to help kexec.
http://www.xmission.com/~ebiederm/files/kexec/linux-2.5.47.x86kexec-hwfixes.diff
A compatible user space is at:
http://www.xmission.com/~ebiederm/files/kexec/kexec-tools-1.5.tar.gz
This code boots either a static ELF executable or a bzImage.
A kernel reformater that bypasses setup.S in favor of a version that
uses fewer BIOS calls, (increasing the reliability) is at:
ftp://ftp.lnxi.com/pub/mkelfImage/mkelfImage-1.18.tar.gz
In bug reports please include the serial console output of
kexec kexec_test. kexec_test exercises most of the interesting code
paths that are needed to load a kernel (mainly BIOS calls) with lots
of debugging print statements, so hangs can easily be detected.
To be polite to your user space there are now options:
--load (which just loads the new kernel)
--exec (which starts a previously loaded kernel).
I expect to integrate more gracefully with init as time goes on, but
this is what I can do in a timely manner.
Without applying the hardware fixes you must build a kernel that is
uniprocessor and does not use an APIC, to have a chance at this code
working. Cleaning up various hardware fixes and getting them
integrated into the kernel is the next step.
Hopefully this has an interface Linus likes now.
Eric
MAINTAINERS | 7
arch/i386/Kconfig | 17
arch/i386/kernel/Makefile | 1
arch/i386/kernel/entry.S | 1
arch/i386/kernel/machine_kexec.c | 142 ++++++++
arch/i386/kernel/relocate_kernel.S | 99 +++++
include/asm-i386/kexec.h | 25 +
include/asm-i386/unistd.h | 1
include/linux/kexec.h | 46 ++
include/linux/reboot.h | 2
kernel/Makefile | 1
kernel/kexec.c | 643 +++++++++++++++++++++++++++++++++++++
kernel/sys.c | 23 +
13 files changed, 1008 insertions
diff -uNr linux-2.5.47/MAINTAINERS linux-2.5.47.x86kexec/MAINTAINERS
--- linux-2.5.47/MAINTAINERS Mon Nov 11 00:22:33 2002
+++ linux-2.5.47.x86kexec/MAINTAINERS Mon Nov 11 00:24:07 2002
@@ -968,6 +968,13 @@
W: http://www.cse.unsw.edu.au/~neilb/patches/linux-devel/
S: Maintained
+KEXEC
+P: Eric Biederman
+M: [email protected]
+M: [email protected]
+L: [email protected]
+S: Maintained
+
LANMEDIA WAN CARD DRIVER
P: Andrew Stanley-Jones
M: [email protected]
diff -uNr linux-2.5.47/arch/i386/Kconfig linux-2.5.47.x86kexec/arch/i386/Kconfig
--- linux-2.5.47/arch/i386/Kconfig Mon Nov 11 00:22:33 2002
+++ linux-2.5.47.x86kexec/arch/i386/Kconfig Mon Nov 11 00:26:52 2002
@@ -784,6 +784,23 @@
depends on (SMP || PREEMPT) && X86_CMPXCHG
default y
+config KEXEC
+ bool "kexec system call (EXPERIMENTAL)"
+ depends on EXPERIMENTAL
+ help
+ kexec is a system call that implements the ability to shutdown your
+ current kernel, and to start another kernel. It is like a reboot
+ but it is indepedent of the system firmware. And like a reboot
+ you can start any kernel with it not just Linux.
+
+ The name comes from the similiarity to the exec system call.
+
+ It is on an going process to be certain the hardware in a machine
+ is properly shutdown, so do not be surprised if this code does not
+ initially work for you. It may help to enable device hotplugging
+ support. As of this writing the exact hardware interface is
+ strongly in flux, so no good recommendation can be made.
+
endmenu
diff -uNr linux-2.5.47/arch/i386/kernel/Makefile linux-2.5.47.x86kexec/arch/i386/kernel/Makefile
--- linux-2.5.47/arch/i386/kernel/Makefile Mon Nov 11 00:22:33 2002
+++ linux-2.5.47.x86kexec/arch/i386/kernel/Makefile Mon Nov 11 00:24:07 2002
@@ -24,6 +24,7 @@
obj-$(CONFIG_X86_MPPARSE) += mpparse.o
obj-$(CONFIG_X86_LOCAL_APIC) += apic.o nmi.o
obj-$(CONFIG_X86_IO_APIC) += io_apic.o
+obj-$(CONFIG_KEXEC) += machine_kexec.o relocate_kernel.o
obj-$(CONFIG_SOFTWARE_SUSPEND) += suspend.o
obj-$(CONFIG_X86_NUMAQ) += numaq.o
obj-$(CONFIG_PROFILING) += profile.o
diff -uNr linux-2.5.47/arch/i386/kernel/entry.S linux-2.5.47.x86kexec/arch/i386/kernel/entry.S
--- linux-2.5.47/arch/i386/kernel/entry.S Mon Nov 11 00:22:33 2002
+++ linux-2.5.47.x86kexec/arch/i386/kernel/entry.S Mon Nov 11 00:24:07 2002
@@ -743,6 +743,7 @@
.long sys_epoll_ctl /* 255 */
.long sys_epoll_wait
.long sys_remap_file_pages
+ .long sys_kexec_load
.rept NR_syscalls-(.-sys_call_table)/4
diff -uNr linux-2.5.47/arch/i386/kernel/machine_kexec.c linux-2.5.47.x86kexec/arch/i386/kernel/machine_kexec.c
--- linux-2.5.47/arch/i386/kernel/machine_kexec.c Wed Dec 31 17:00:00 1969
+++ linux-2.5.47.x86kexec/arch/i386/kernel/machine_kexec.c Mon Nov 11 00:24:07 2002
@@ -0,0 +1,142 @@
+#include <linux/config.h>
+#include <linux/mm.h>
+#include <linux/kexec.h>
+#include <linux/delay.h>
+#include <asm/pgtable.h>
+#include <asm/pgalloc.h>
+#include <asm/tlbflush.h>
+#include <asm/io.h>
+#include <asm/apic.h>
+
+
+/*
+ * machine_kexec
+ * =======================
+ */
+
+
+static void set_idt(void *newidt, __u16 limit)
+{
+ unsigned char curidt[6];
+
+ /* ia32 supports unaliged loads & stores */
+ (*(__u16 *)(curidt)) = limit;
+ (*(__u32 *)(curidt +2)) = (unsigned long)(newidt);
+
+ __asm__ __volatile__ (
+ "lidt %0\n"
+ : "=m" (curidt)
+ );
+};
+
+
+static void set_gdt(void *newgdt, __u16 limit)
+{
+ unsigned char curgdt[6];
+
+ /* ia32 supports unaliged loads & stores */
+ (*(__u16 *)(curgdt)) = limit;
+ (*(__u32 *)(curgdt +2)) = (unsigned long)(newgdt);
+
+ __asm__ __volatile__ (
+ "lgdt %0\n"
+ : "=m" (curgdt)
+ );
+};
+
+static void load_segments(void)
+{
+#define __STR(X) #X
+#define STR(X) __STR(X)
+
+ __asm__ __volatile__ (
+ "\tljmp $"STR(__KERNEL_CS)",$1f\n"
+ "\t1:\n"
+ "\tmovl $"STR(__KERNEL_DS)",%eax\n"
+ "\tmovl %eax,%ds\n"
+ "\tmovl %eax,%es\n"
+ "\tmovl %eax,%fs\n"
+ "\tmovl %eax,%gs\n"
+ "\tmovl %eax,%ss\n"
+ );
+#undef STR
+#undef __STR
+}
+
+static void identity_map_page(unsigned long address)
+{
+ /* This code is x86 specific...
+ * general purpose code must be more carful
+ * of caches and tlbs...
+ */
+ pgd_t *pgd;
+ pmd_t *pmd;
+ struct mm_struct *mm = current->mm;
+ spin_lock(&mm->page_table_lock);
+
+ pgd = pgd_offset(mm, address);
+ pmd = pmd_alloc(mm, pgd, address);
+
+ if (pmd) {
+ pte_t *pte = pte_alloc_map(mm, pmd, address);
+ if (pte) {
+ set_pte(pte,
+ mk_pte(virt_to_page(phys_to_virt(address)),
+ PAGE_SHARED));
+ __flush_tlb_one(address);
+ }
+ }
+ spin_unlock(&mm->page_table_lock);
+}
+
+
+typedef void (*relocate_new_kernel_t)(
+ unsigned long indirection_page, unsigned long reboot_code_buffer,
+ unsigned long start_address);
+
+const extern unsigned char relocate_new_kernel[];
+extern void relocate_new_kernel_end(void);
+const extern unsigned int relocate_new_kernel_size;
+
+void machine_kexec(struct kimage *image)
+{
+ unsigned long *indirection_page;
+ void *reboot_code_buffer;
+ relocate_new_kernel_t rnk;
+
+ /* Interrupts aren't acceptable while we reboot */
+ local_irq_disable();
+ reboot_code_buffer = image->reboot_code_buffer;
+ indirection_page = phys_to_virt(image->head & PAGE_MASK);
+
+ identity_map_page(virt_to_phys(reboot_code_buffer));
+
+ /* copy it out */
+ memcpy(reboot_code_buffer, relocate_new_kernel,
+ relocate_new_kernel_size);
+
+ /* The segment registers are funny things, they are
+ * automatically loaded from a table, in memory wherever you
+ * set them to a specific selector, but this table is never
+ * accessed again you set the segment to a different selector.
+ *
+ * The more common model is are caches where the behide
+ * the scenes work is done, but is also dropped at arbitrary
+ * times.
+ *
+ * I take advantage of this here by force loading the
+ * segments, before I zap the gdt with an invalid value.
+ */
+ load_segments();
+ /* The gdt & idt are now invalid.
+ * If you want to load them you must set up your own idt & gdt.
+ */
+ set_gdt(phys_to_virt(0),0);
+ set_idt(phys_to_virt(0),0);
+
+ /* now call it */
+ rnk = (relocate_new_kernel_t) virt_to_phys(reboot_code_buffer);
+ (*rnk)(virt_to_phys(indirection_page), virt_to_phys(reboot_code_buffer),
+ image->start);
+}
+
diff -uNr linux-2.5.47/arch/i386/kernel/relocate_kernel.S linux-2.5.47.x86kexec/arch/i386/kernel/relocate_kernel.S
--- linux-2.5.47/arch/i386/kernel/relocate_kernel.S Wed Dec 31 17:00:00 1969
+++ linux-2.5.47.x86kexec/arch/i386/kernel/relocate_kernel.S Mon Nov 11 00:24:07 2002
@@ -0,0 +1,99 @@
+#include <linux/config.h>
+#include <linux/linkage.h>
+
+ /* Must be relocatable PIC code callable as a C function, that once
+ * it starts can not use the previous processes stack.
+ *
+ */
+ .globl relocate_new_kernel
+relocate_new_kernel:
+ /* read the arguments and say goodbye to the stack */
+ movl 4(%esp), %ebx /* indirection_page */
+ movl 8(%esp), %ebp /* reboot_code_buffer */
+ movl 12(%esp), %edx /* start address */
+
+ /* zero out flags, and disable interrupts */
+ pushl $0
+ popfl
+
+ /* set a new stack at the bottom of our page... */
+ lea 4096(%ebp), %esp
+
+ /* store the parameters back on the stack */
+ pushl %edx /* store the start address */
+
+ /* Set cr0 to a known state:
+ * 31 0 == Paging disabled
+ * 18 0 == Alignment check disabled
+ * 16 0 == Write protect disabled
+ * 3 0 == No task switch
+ * 2 0 == Don't do FP software emulation.
+ * 0 1 == Proctected mode enabled
+ */
+ movl %cr0, %eax
+ andl $~((1<<31)|(1<<18)|(1<<16)|(1<<3)|(1<<2)), %eax
+ orl $(1<<0), %eax
+ movl %eax, %cr0
+ jmp 1f
+1:
+
+ /* Flush the TLB (needed?) */
+ xorl %eax, %eax
+ movl %eax, %cr3
+
+ /* Do the copies */
+ cld
+0: /* top, read another word for the indirection page */
+ movl %ebx, %ecx
+ movl (%ebx), %ecx
+ addl $4, %ebx
+ testl $0x1, %ecx /* is it a destination page */
+ jz 1f
+ movl %ecx, %edi
+ andl $0xfffff000, %edi
+ jmp 0b
+1:
+ testl $0x2, %ecx /* is it an indirection page */
+ jz 1f
+ movl %ecx, %ebx
+ andl $0xfffff000, %ebx
+ jmp 0b
+1:
+ testl $0x4, %ecx /* is it the done indicator */
+ jz 1f
+ jmp 2f
+1:
+ testl $0x8, %ecx /* is it the source indicator */
+ jz 0b /* Ignore it otherwise */
+ movl %ecx, %esi /* For every source page do a copy */
+ andl $0xfffff000, %esi
+
+ movl $1024, %ecx
+ rep ; movsl
+ jmp 0b
+
+2:
+
+ /* To be certain of avoiding problems with self modifying code
+ * I need to execute a serializing instruction here.
+ * So I flush the TLB, it's handy, and not processor dependent.
+ */
+ xorl %eax, %eax
+ movl %eax, %cr3
+
+ /* set all of the registers to known values */
+ /* leave %esp alone */
+
+ xorl %eax, %eax
+ xorl %ebx, %ebx
+ xorl %ecx, %ecx
+ xorl %edx, %edx
+ xorl %esi, %esi
+ xorl %edi, %edi
+ xorl %ebp, %ebp
+ ret
+relocate_new_kernel_end:
+
+ .globl relocate_new_kernel_size
+relocate_new_kernel_size:
+ .long relocate_new_kernel_end - relocate_new_kernel
diff -uNr linux-2.5.47/include/asm-i386/kexec.h linux-2.5.47.x86kexec/include/asm-i386/kexec.h
--- linux-2.5.47/include/asm-i386/kexec.h Wed Dec 31 17:00:00 1969
+++ linux-2.5.47.x86kexec/include/asm-i386/kexec.h Mon Nov 11 00:24:07 2002
@@ -0,0 +1,25 @@
+#ifndef _I386_KEXEC_H
+#define _I386_KEXEC_H
+
+#include <asm/fixmap.h>
+
+/*
+ * KEXEC_SOURCE_MEMORY_LIMIT maximum page get_free_page can return.
+ * I.e. Maximum page that is mapped directly into kernel memory,
+ * and kmap is not required.
+ *
+ * Someone correct me if FIXADDR_START - PAGEOFFSET is not the correct
+ * calculation for the amount of memory directly mappable into the
+ * kernel memory space.
+ */
+
+/* Maximum physical address we can use pages from */
+#define KEXEC_SOURCE_MEMORY_LIMIT (FIXADDR_START - PAGE_OFFSET)
+/* Maximum address we can reach in physical address mode */
+#define KEXEC_DESTINATION_MEMORY_LIMIT (-1UL)
+
+#define KEXEC_REBOOT_CODE_SIZE 4096
+#define KEXEC_REBOOT_CODE_ALIGN 0
+
+
+#endif /* _I386_KEXEC_H */
diff -uNr linux-2.5.47/include/asm-i386/unistd.h linux-2.5.47.x86kexec/include/asm-i386/unistd.h
--- linux-2.5.47/include/asm-i386/unistd.h Tue Nov 5 19:03:51 2002
+++ linux-2.5.47.x86kexec/include/asm-i386/unistd.h Mon Nov 11 00:24:07 2002
@@ -262,6 +262,7 @@
#define __NR_sys_epoll_ctl 255
#define __NR_sys_epoll_wait 256
#define __NR_remap_file_pages 257
+#define __NR_sys_kexec_load 258
/* user-visible error numbers are in the range -1 - -124: see <asm-i386/errno.h> */
diff -uNr linux-2.5.47/include/linux/kexec.h linux-2.5.47.x86kexec/include/linux/kexec.h
--- linux-2.5.47/include/linux/kexec.h Wed Dec 31 17:00:00 1969
+++ linux-2.5.47.x86kexec/include/linux/kexec.h Mon Nov 11 00:24:07 2002
@@ -0,0 +1,46 @@
+#ifndef LINUX_KEXEC_H
+#define LINUX_KEXEC_H
+
+#if CONFIG_KEXEC
+#include <linux/types.h>
+#include <asm/kexec.h>
+
+/*
+ * This structure is used to hold the arguments that are used when loading
+ * kernel binaries.
+ */
+
+typedef unsigned long kimage_entry_t;
+#define IND_DESTINATION 0x1
+#define IND_INDIRECTION 0x2
+#define IND_DONE 0x4
+#define IND_SOURCE 0x8
+
+struct kimage {
+ kimage_entry_t head;
+ kimage_entry_t *entry;
+ kimage_entry_t *last_entry;
+
+ unsigned long destination;
+ unsigned long offset;
+
+ unsigned long start;
+ void *reboot_code_buffer;
+};
+
+struct kexec_segment {
+ void *buf;
+ size_t bufsz;
+ void *mem;
+ size_t memsz;
+};
+
+/* kexec interface functions */
+extern void machine_kexec(struct kimage *image);
+extern asmlinkage long sys_kexec(unsigned long entry, long nr_segments,
+ struct kexec_segment *segments);
+extern struct kimage *kexec_image;
+extern spinlock_t kexec_image_lock;
+#endif
+#endif /* LINUX_KEXEC_H */
+
diff -uNr linux-2.5.47/include/linux/reboot.h linux-2.5.47.x86kexec/include/linux/reboot.h
--- linux-2.5.47/include/linux/reboot.h Fri Oct 11 22:22:47 2002
+++ linux-2.5.47.x86kexec/include/linux/reboot.h Mon Nov 11 00:24:07 2002
@@ -21,6 +21,7 @@
* POWER_OFF Stop OS and remove all power from system, if possible.
* RESTART2 Restart system using given command string.
* SW_SUSPEND Suspend system using Software Suspend if compiled in
+ * KEXEC Restart the system using a different kernel.
*/
#define LINUX_REBOOT_CMD_RESTART 0x01234567
@@ -30,6 +31,7 @@
#define LINUX_REBOOT_CMD_POWER_OFF 0x4321FEDC
#define LINUX_REBOOT_CMD_RESTART2 0xA1B2C3D4
#define LINUX_REBOOT_CMD_SW_SUSPEND 0xD000FCE2
+#define LINUX_REBOOT_CMD_KEXEC 0x45584543
#ifdef __KERNEL__
diff -uNr linux-2.5.47/kernel/Makefile linux-2.5.47.x86kexec/kernel/Makefile
--- linux-2.5.47/kernel/Makefile Fri Oct 18 11:59:29 2002
+++ linux-2.5.47.x86kexec/kernel/Makefile Mon Nov 11 00:24:07 2002
@@ -21,6 +21,7 @@
obj-$(CONFIG_CPU_FREQ) += cpufreq.o
obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
obj-$(CONFIG_SOFTWARE_SUSPEND) += suspend.o
+obj-$(CONFIG_KEXEC) += kexec.o
ifneq ($(CONFIG_IA64),y)
# According to Alan Modra <[email protected]>, the -fno-omit-frame-pointer is
diff -uNr linux-2.5.47/kernel/kexec.c linux-2.5.47.x86kexec/kernel/kexec.c
--- linux-2.5.47/kernel/kexec.c Wed Dec 31 17:00:00 1969
+++ linux-2.5.47.x86kexec/kernel/kexec.c Mon Nov 11 00:24:07 2002
@@ -0,0 +1,643 @@
+#include <linux/mm.h>
+#include <linux/file.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/version.h>
+#include <linux/compile.h>
+#include <linux/kexec.h>
+#include <linux/spinlock.h>
+#include <net/checksum.h>
+#include <asm/page.h>
+#include <asm/uaccess.h>
+#include <asm/io.h>
+
+/* As designed kexec can only use the memory that you don't
+ * need to use kmap to access. Memory that you can use virt_to_phys()
+ * on an call get_free_page to allocate.
+ *
+ * In the best case you need one page for the transition from
+ * virtual to physical memory. And this page must be identity
+ * mapped. Which pretty much leaves you with pages < PAGE_OFFSET
+ * as you can only mess with user pages.
+ *
+ * As the only subset of memory that it is easy to restrict allocation
+ * to is the physical memory mapped into the kernel, I do that
+ * with get_free_page and hope it is enough.
+ *
+ * I don't know of a good way to do this calcuate which pages get_free_page
+ * will return independent of architecture so I depend on
+ * <asm/kexec.h> to properly set
+ * KEXEC_SOURCE_MEMORY_LIMIT and KEXEC_DESTINATION_MEMORY_LIMIT
+ *
+ */
+
+static struct kimage *kimage_alloc(void)
+{
+ struct kimage *image;
+ image = kmalloc(sizeof(*image), GFP_KERNEL);
+ if (!image)
+ return 0;
+ memset(image, 0, sizeof(*image));
+ image->head = 0;
+ image->entry = &image->head;
+ image->last_entry = &image->head;
+ return image;
+}
+static int kimage_add_entry(struct kimage *image, kimage_entry_t entry)
+{
+ if (image->offset != 0) {
+ image->entry++;
+ }
+ if (image->entry == image->last_entry) {
+ kimage_entry_t *ind_page;
+ ind_page = (void *)__get_free_page(GFP_KERNEL);
+ if (!ind_page) {
+ return -ENOMEM;
+ }
+ *image->entry = virt_to_phys(ind_page) | IND_INDIRECTION;
+ image->entry = ind_page;
+ image->last_entry =
+ ind_page + ((PAGE_SIZE/sizeof(kimage_entry_t)) - 1);
+ }
+ *image->entry = entry;
+ image->entry++;
+ image->offset = 0;
+ return 0;
+}
+
+static int kimage_verify_destination(unsigned long destination)
+{
+ int result;
+
+ /* Assume the page is bad unless we pass the checks */
+ result = -EADDRNOTAVAIL;
+
+ if (destination >= KEXEC_DESTINATION_MEMORY_LIMIT) {
+ goto out;
+ }
+
+ /* NOTE: The caller is responsible for making certain we
+ * don't attempt to load the new image into invalid or
+ * reserved areas of RAM.
+ */
+ result = 0;
+out:
+ return result;
+}
+
+static int kimage_set_destination(
+ struct kimage *image, unsigned long destination)
+{
+ int result;
+ destination &= PAGE_MASK;
+ result = kimage_verify_destination(destination);
+ if (result) {
+ return result;
+ }
+ result = kimage_add_entry(image, destination | IND_DESTINATION);
+ if (result == 0) {
+ image->destination = destination;
+ }
+ return result;
+}
+
+
+static int kimage_add_page(struct kimage *image, unsigned long page)
+{
+ int result;
+ page &= PAGE_MASK;
+ result = kimage_verify_destination(image->destination);
+ if (result) {
+ return result;
+ }
+ result = kimage_add_entry(image, page | IND_SOURCE);
+ if (result == 0) {
+ image->destination += PAGE_SIZE;
+ }
+ return result;
+}
+
+
+static int kimage_terminate(struct kimage *image)
+{
+ int result;
+ result = kimage_add_entry(image, IND_DONE);
+ if (result == 0) {
+ /* Point at the terminating element */
+ image->entry--;
+ }
+ return result;
+}
+
+#define for_each_kimage_entry(image, ptr, entry) \
+ for (ptr = &image->head; (entry = *ptr) && !(entry & IND_DONE); \
+ ptr = (entry & IND_INDIRECTION)? \
+ phys_to_virt((entry & PAGE_MASK)): ptr +1)
+
+static void kimage_free(struct kimage *image)
+{
+ kimage_entry_t *ptr, entry;
+ kimage_entry_t ind = 0;
+ if (!image)
+ return;
+ for_each_kimage_entry(image, ptr, entry) {
+ if (entry & IND_INDIRECTION) {
+ /* Free the previous indirection page */
+ if (ind & IND_INDIRECTION) {
+ free_page((unsigned long)phys_to_virt(ind & PAGE_MASK));
+ }
+ /* Save this indirection page until we are
+ * done with it.
+ */
+ ind = entry;
+ }
+ else if (entry & IND_SOURCE) {
+ free_page((unsigned long)phys_to_virt(entry & PAGE_MASK));
+ }
+ }
+ kfree(image);
+}
+
+static int kimage_is_destination_page(
+ struct kimage *image, unsigned long page)
+{
+ kimage_entry_t *ptr, entry;
+ unsigned long destination;
+ destination = 0;
+ page &= PAGE_MASK;
+ for_each_kimage_entry(image, ptr, entry) {
+ if (entry & IND_DESTINATION) {
+ destination = entry & PAGE_MASK;
+ }
+ else if (entry & IND_SOURCE) {
+ if (page == destination) {
+ return 1;
+ }
+ destination += PAGE_SIZE;
+ }
+ }
+ return 0;
+}
+
+static int kimage_get_unused_area(
+ struct kimage *image, unsigned long size, unsigned long align,
+ unsigned long *area)
+{
+ /* Walk through mem_map and find the first chunk of
+ * ununsed memory that is at least size bytes long.
+ */
+ /* Since the kernel plays with Page_Reseved mem_map is less
+ * than ideal for this purpose, but it will give us a correct
+ * conservative estimate of what we need to do.
+ */
+ /* For now we take advantage of the fact that all kernel pages
+ * are marked with PG_resereved to allocate a large
+ * contiguous area for the reboot code buffer.
+ */
+ unsigned long addr;
+ unsigned long start, end;
+ unsigned long mask;
+ mask = ((1 << align) -1);
+ start = end = PAGE_SIZE;
+ for(addr = PAGE_SIZE; addr < KEXEC_SOURCE_MEMORY_LIMIT; addr += PAGE_SIZE) {
+ struct page *page;
+ unsigned long aligned_start;
+ page = virt_to_page(phys_to_virt(addr));
+ if (PageReserved(page) ||
+ kimage_is_destination_page(image, addr)) {
+ /* The current page is reserved so the start &
+ * end of the next area must be atleast at the
+ * next page.
+ */
+ start = end = addr + PAGE_SIZE;
+ }
+ else {
+ /* O.k. The current page isn't reserved
+ * so push up the end of the area.
+ */
+ end = addr;
+ }
+ aligned_start = (start + mask) & ~mask;
+ if (aligned_start > start) {
+ continue;
+ }
+ if (aligned_start > end) {
+ continue;
+ }
+ if (end - aligned_start >= size) {
+ *area = aligned_start;
+ return 0;
+ }
+ }
+ *area = 0;
+ return -ENOSPC;
+}
+
+static kimage_entry_t *kimage_dst_conflict(
+ struct kimage *image, unsigned long page, kimage_entry_t *limit)
+{
+ kimage_entry_t *ptr, entry;
+ unsigned long destination = 0;
+ for_each_kimage_entry(image, ptr, entry) {
+ if (ptr == limit) {
+ return 0;
+ }
+ else if (entry & IND_DESTINATION) {
+ destination = entry & PAGE_MASK;
+ }
+ else if (entry & IND_SOURCE) {
+ if (page == destination) {
+ return ptr;
+ }
+ destination += PAGE_SIZE;
+ }
+ }
+ return 0;
+}
+
+static kimage_entry_t *kimage_src_conflict(
+ struct kimage *image, unsigned long destination, kimage_entry_t *limit)
+{
+ kimage_entry_t *ptr, entry;
+ for_each_kimage_entry(image, ptr, entry) {
+ unsigned long page;
+ if (ptr == limit) {
+ return 0;
+ }
+ else if (entry & IND_DESTINATION) {
+ /* nop */
+ }
+ else if (entry & IND_DONE) {
+ /* nop */
+ }
+ else {
+ /* SOURCE & INDIRECTION */
+ page = entry & PAGE_MASK;
+ if (page == destination) {
+ return ptr;
+ }
+ }
+ }
+ return 0;
+}
+
+static int kimage_get_off_destination_pages(struct kimage *image)
+{
+ kimage_entry_t *ptr, *cptr, entry;
+ unsigned long buffer, page;
+ unsigned long destination = 0;
+
+ /* Here we implement safe guards to insure that
+ * a source page is not copied to it's destination
+ * page before the data on the destination page is
+ * no longer useful.
+ *
+ * To make it work we actually wind up with a
+ * stronger condition. For every page considered
+ * it is either it's own destination page or it is
+ * not a destination page of any page considered.
+ *
+ * Invariants
+ * 1. buffer is not a destination of a previous page.
+ * 2. page is not a destination of a previous page.
+ * 3. destination is not a previous source page.
+ *
+ * Result: Either a source page and a destination page
+ * are the same or the page is not a destination page.
+ *
+ * These checks could be done when we allocate the pages,
+ * but doing it as a final pass allows us more freedom
+ * on how we allocate pages.
+ *
+ * Also while the checks are necessary, in practice nothing
+ * happens. The destination kernel wants to sit in the
+ * same physical addresses as the current kernel so we never
+ * actually allocate a destination page.
+ *
+ * BUGS: This is a O(N^2) algorithm.
+ */
+
+
+ buffer = __get_free_page(GFP_KERNEL);
+ if (!buffer) {
+ return -ENOMEM;
+ }
+ buffer = virt_to_phys((void *)buffer);
+ for_each_kimage_entry(image, ptr, entry) {
+ /* Here we check to see if an allocated page */
+ kimage_entry_t *limit;
+ if (entry & IND_DESTINATION) {
+ destination = entry & PAGE_MASK;
+ }
+ else if (entry & IND_INDIRECTION) {
+ /* Indirection pages must include all of their
+ * contents in limit checking.
+ */
+ limit = phys_to_virt(page + PAGE_SIZE - sizeof(*limit));
+ }
+ if (!((entry & IND_SOURCE) | (entry & IND_INDIRECTION))) {
+ continue;
+ }
+ page = entry & PAGE_MASK;
+ limit = ptr;
+
+ /* See if a previous page has the current page as it's
+ * destination.
+ * i.e. invariant 2
+ */
+ cptr = kimage_dst_conflict(image, page, limit);
+ if (cptr) {
+ unsigned long cpage;
+ kimage_entry_t centry;
+ centry = *cptr;
+ cpage = centry & PAGE_MASK;
+ memcpy(phys_to_virt(buffer), phys_to_virt(page), PAGE_SIZE);
+ memcpy(phys_to_virt(page), phys_to_virt(cpage), PAGE_SIZE);
+ *cptr = page | (centry & ~PAGE_MASK);
+ *ptr = buffer | (entry & ~PAGE_MASK);
+ buffer = cpage;
+ }
+ if (!(entry & IND_SOURCE)) {
+ continue;
+ }
+
+ /* See if a previous page is our destination page.
+ * If so claim it now.
+ * i.e. invariant 3
+ */
+ cptr = kimage_src_conflict(image, destination, limit);
+ if (cptr) {
+ unsigned long cpage;
+ kimage_entry_t centry;
+ centry = *cptr;
+ cpage = centry & PAGE_MASK;
+ memcpy(phys_to_virt(buffer), phys_to_virt(cpage), PAGE_SIZE);
+ memcpy(phys_to_virt(cpage), phys_to_virt(page), PAGE_SIZE);
+ *cptr = buffer | (centry & ~PAGE_MASK);
+ *ptr = cpage | ( entry & ~PAGE_MASK);
+ buffer = page;
+ }
+ /* If the buffer is my destination page do the copy now
+ * i.e. invariant 3 & 1
+ */
+ if (buffer == destination) {
+ memcpy(phys_to_virt(buffer), phys_to_virt(page), PAGE_SIZE);
+ *ptr = buffer | (entry & ~PAGE_MASK);
+ buffer = page;
+ }
+ }
+ free_page((unsigned long)phys_to_virt(buffer));
+ return 0;
+}
+
+static int kimage_add_empty_pages(struct kimage *image,
+ unsigned long len)
+{
+ unsigned long pos;
+ int result;
+ for(pos = 0; pos < len; pos += PAGE_SIZE) {
+ char *page;
+ result = -ENOMEM;
+ page = (void *)__get_free_page(GFP_KERNEL);
+ if (!page) {
+ goto out;
+ }
+ result = kimage_add_page(image, virt_to_phys(page));
+ if (result) {
+ goto out;
+ }
+ }
+ result = 0;
+ out:
+ return result;
+}
+
+
+static int kimage_load_segment(struct kimage *image,
+ struct kexec_segment *segment)
+{
+ unsigned long mstart;
+ int result;
+ unsigned long offset;
+ unsigned long offset_end;
+ unsigned char *buf;
+
+ result = 0;
+ buf = segment->buf;
+ mstart = (unsigned long)segment->mem;
+
+ offset_end = segment->memsz;
+
+ result = kimage_set_destination(image, mstart);
+ if (result < 0) {
+ goto out;
+ }
+ for(offset = 0; offset < segment->memsz; offset += PAGE_SIZE) {
+ char *page;
+ size_t size, leader;
+ page = (char *)__get_free_page(GFP_KERNEL);
+ if (page == 0) {
+ result = -ENOMEM;
+ goto out;
+ }
+ result = kimage_add_page(image, virt_to_phys(page));
+ if (result < 0) {
+ goto out;
+ }
+ if (segment->bufsz < offset) {
+ /* We are past the end zero the whole page */
+ memset(page, 0, PAGE_SIZE);
+ continue;
+ }
+ size = PAGE_SIZE;
+ leader = 0;
+ if ((offset == 0)) {
+ leader = mstart & ~PAGE_MASK;
+ }
+ if (leader) {
+ /* We are on the first page zero the unused portion */
+ memset(page, 0, leader);
+ size -= leader;
+ page += leader;
+ }
+ if (size > (segment->bufsz - offset)) {
+ size = segment->bufsz - offset;
+ }
+ result = copy_from_user(page, buf + offset, size);
+ if (result) {
+ result = (result < 0)?result : -EIO;
+ goto out;
+ }
+ if (size < (PAGE_SIZE - leader)) {
+ /* zero the trailing part of the page */
+ memset(page + size, 0, (PAGE_SIZE - leader) - size);
+ }
+ }
+ out:
+ return result;
+}
+
+
+/* do_kexec executes a new kernel
+ */
+static int do_kexec(unsigned long start, unsigned long nr_segments,
+ struct kexec_segment *arg_segments, struct kimage *image)
+{
+ struct kexec_segment *segments;
+ size_t segment_bytes;
+ int i;
+
+ int result;
+ unsigned long reboot_code_buffer;
+ kimage_entry_t *end;
+
+ /* Initialize variables */
+ segments = 0;
+
+ segment_bytes = nr_segments * sizeof(*segments);
+ segments = kmalloc(GFP_KERNEL, segment_bytes);
+ if (segments == 0) {
+ result = -ENOMEM;
+ goto out;
+ }
+ result = copy_from_user(segments, arg_segments, segment_bytes);
+ if (result) {
+ goto out;
+ }
+
+ /* Read in the data from user space */
+ image->start = start;
+ for(i = 0; i < nr_segments; i++) {
+ result = kimage_load_segment(image, &segments[i]);
+ if (result) {
+ goto out;
+ }
+ }
+
+ /* Terminate early so I can get a place holder. */
+ result = kimage_terminate(image);
+ if (result)
+ goto out;
+ end = image->entry;
+
+ /* Usage of the reboot code buffer is subtle. We first
+ * find a continguous area of ram, that is not one
+ * of our destination pages. We do not allocate the ram.
+ *
+ * The algorithm to make certain we do not have address
+ * conflicts requires each destination region to have some
+ * backing store so we allocate abitrary source pages.
+ *
+ * Later in machine_kexec when we copy data to the
+ * reboot_code_buffer it still may be allocated for other
+ * purposes, but we do know there are no source or destination
+ * pages in that area. And since the rest of the kernel
+ * is already shutdown those pages are free for use,
+ * regardless of their page->count values.
+ *
+ * The kernel mapping is of the reboot code buffer is passed to
+ * the machine dependent code. If it needs something else
+ * it is free to set that up.
+ */
+ result = kimage_get_unused_area(
+ image, KEXEC_REBOOT_CODE_SIZE, KEXEC_REBOOT_CODE_ALIGN,
+ &reboot_code_buffer);
+ if (result)
+ goto out;
+
+ /* Allocating pages we should never need is silly but the
+ * code won't work correctly unless we have dummy pages to
+ * work with.
+ */
+ result = kimage_set_destination(image, reboot_code_buffer);
+ if (result)
+ goto out;
+ result = kimage_add_empty_pages(image, KEXEC_REBOOT_CODE_SIZE);
+ if (result)
+ goto out;
+ image->reboot_code_buffer = phys_to_virt(reboot_code_buffer);
+
+ result = kimage_terminate(image);
+ if (result)
+ goto out;
+
+ result = kimage_get_off_destination_pages(image);
+ if (result)
+ goto out;
+
+ /* Now hide the extra source pages for the reboot code buffer.
+ */
+ image->entry = end;
+ result = kimage_terminate(image);
+ if (result)
+ goto out;
+
+ result = 0;
+ out:
+ /* cleanup and exit */
+ if (segments) kfree(segments);
+ return result;
+}
+
+
+/*
+ * Exec Kernel system call: for obvious reasons only root may call it.
+ *
+ * This call breaks up into three pieces.
+ * - A generic part which loads the new kernel from the current
+ * address space, and very carefully places the data in the
+ * allocated pages.
+ *
+ * - A generic part that interacts with the kernel and tells all of
+ * the devices to shut down. Preventing on-going dmas, and placing
+ * the devices in a consistent state so a later kernel can
+ * reinitialize them.
+ *
+ * - A machine specific part that includes the syscall number
+ * and the copies the image to it's final destination. And
+ * jumps into the image at entry.
+ *
+ * kexec does not sync, or unmount filesystems so if you need
+ * that to happen you need to do that yourself.
+ */
+struct kimage *kexec_image = 0;
+spinlock_t kexec_image_lock = SPIN_LOCK_UNLOCKED;
+
+asmlinkage long sys_kexec_load(unsigned long entry, unsigned long nr_segments,
+ struct kexec_segment *segments, unsigned long flags)
+{
+ /* Am I using to much stack space here? */
+ struct kimage *image, *old_image;
+ int result;
+
+ /* We only trust the superuser with rebooting the system. */
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ /* In case we need just a little bit of special behavior for
+ * reboot on panic
+ */
+ if (flags != 0)
+ return -EINVAL;
+
+ image = 0;
+ if (nr_segments > 0) {
+ image = kimage_alloc();
+ if (!image) {
+ return -ENOMEM;
+ }
+ result = do_kexec(entry, nr_segments, segments, image);
+ if (result) {
+ kimage_free(image);
+ return result;
+ }
+ }
+
+ spin_lock(&kexec_image_lock);
+ old_image = kexec_image;
+ kexec_image = image;
+ spin_unlock(&kexec_image_lock);
+
+ kimage_free(old_image);
+ return 0;
+}
diff -uNr linux-2.5.47/kernel/sys.c linux-2.5.47.x86kexec/kernel/sys.c
--- linux-2.5.47/kernel/sys.c Tue Nov 5 19:03:56 2002
+++ linux-2.5.47.x86kexec/kernel/sys.c Mon Nov 11 00:24:07 2002
@@ -16,6 +16,7 @@
#include <linux/init.h>
#include <linux/highuid.h>
#include <linux/fs.h>
+#include <linux/kexec.h>
#include <linux/workqueue.h>
#include <linux/device.h>
#include <linux/times.h>
@@ -206,6 +207,7 @@
cond_syscall(sys_lookup_dcookie)
cond_syscall(sys_swapon)
cond_syscall(sys_swapoff)
+cond_syscall(sys_kexec_load)
static int set_one_prio(struct task_struct *p, int niceval, int error)
{
@@ -414,6 +416,27 @@
machine_restart(buffer);
break;
+#ifdef CONFIG_KEXEC
+ case LINUX_REBOOT_CMD_KEXEC:
+ {
+ struct kimage *image;
+ spin_lock(&kexec_image_lock);
+ image = kexec_image;
+ if (!image || arg) {
+ spin_unlock(&kexec_image_lock);
+ unlock_kernel();
+ return -EINVAL;
+ }
+ notifier_call_chain(&reboot_notifier_list, SYS_RESTART, NULL);
+ system_running = 0;
+ device_shutdown();
+ printk(KERN_EMERG "Starting new kernel\n");
+ machine_kexec(image);
+ /* We never get here... */
+ spin_unlock(&kexec_image_lock);
+ break;
+ }
+#endif
#ifdef CONFIG_SOFTWARE_SUSPEND
case LINUX_REBOOT_CMD_SW_SUSPEND:
if (!software_suspend_enabled) {
On Mon, 2002-11-11 at 10:15, Eric W. Biederman wrote:
> kexec is a set of system calls that allows you to load another kernel
> from the currently executing Linux kernel.
> And is currently kept in two pieces.
> The pure system call.
> http://www.xmission.com/~ebiederm/files/kexec/linux-2.5.47.x86kexec.diff
FYI: that patch applies cleanly to pure 2.5.47 (bk [email protected]).
The current front of the tree does not patch 100% cleanly (conflicts
with recent module changes).
Results on my usual problem machine:
# ./kexec-1.5 ./kexec_test-1.5
Shutting down devices
Debug: sleeping function called from illegal context at include/asm/semaphore.h9
Call Trace: [<c011a698>] [<c0216193>] [<c012b165>] [<c0132dec>] [<c0140357> Starting new kernel
kexec_test 1.5 starting...
eax: 0E1FB007 ebx: 00001078 ecx: 00000000 edx: 00000000
esi: 00000000 edi: 00000000 esp: 00000000 ebp: 00000000
idt: 00000000 C0000000
gdt: 00000000 C0000000
Switching descriptors.
Descriptors changed.
Legacy pic setup.
In real mode.
<hang>
Sorry about the linewrap.
Same as last time, but the good news is that splitting the load and reboot
operations works as expected.
> And the set of hardware fixes known to help kexec.
> http://www.xmission.com/~ebiederm/files/kexec/linux-2.5.47.x86kexec-hwfixes.diff
Missing or inaccessible. I'll try some duct tape and the
linux-2.5.44.x86kexec-hwfixes.diff and see what happens.
Confirming some earlier suspicions:
CONFIG_SMP=y
CONFIG_X86_GOOD_APIC=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
Last time I tried to run a UP kernel (and no APIC support) on this system
it wasn't pretty. I'll add that to my list of combinations to try.
And as always:
% lspci
00:00.0 Host bridge: ServerWorks CNB20LE Host Bridge (rev 06)
00:00.1 Host bridge: ServerWorks CNB20LE Host Bridge (rev 06)
00:01.0 VGA compatible controller: S3 Inc. Savage 4 (rev 04)
00:09.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100] (rev 08)
00:0f.0 ISA bridge: ServerWorks OSB4 South Bridge (rev 50)
00:0f.1 IDE interface: ServerWorks OSB4 IDE Controller
00:0f.2 USB Controller: ServerWorks OSB4/CSB5 OHCI USB Controller (rev 04)
01:03.0 SCSI storage controller: Adaptec AIC-7892P U160/m (rev 02)
% cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 8
model name : Pentium III (Coppermine)
stepping : 10
cpu MHz : 799.957
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse
bogomips : 1576.96
%
Andy Pfiffer <[email protected]> writes:
> On Mon, 2002-11-11 at 10:15, Eric W. Biederman wrote:
> > kexec is a set of system calls that allows you to load another kernel
> > from the currently executing Linux kernel.
>
> > And is currently kept in two pieces.
> > The pure system call.
> > http://www.xmission.com/~ebiederm/files/kexec/linux-2.5.47.x86kexec.diff
>
> FYI: that patch applies cleanly to pure 2.5.47 (bk [email protected]).
>
> The current front of the tree does not patch 100% cleanly (conflicts
> with recent module changes).
I will have to take a look next time a snapshot is uploaded. bk and I have
not yet become friends.
> Results on my usual problem machine:
>
> # ./kexec-1.5 ./kexec_test-1.5
> Shutting down devices
> Debug: sleeping function called from illegal context at include/asm/semaphore.h9
>
> Call Trace: [<c011a698>] [<c0216193>] [<c012b165>] [<c0132dec>] [<c0140357>
Hmm. I wonder what is doing that. Do you have the semaphore problem on a normal reboot?
> Starting new kernel
>
> kexec_test 1.5 starting...
> eax: 0E1FB007 ebx: 00001078 ecx: 00000000 edx: 00000000
> esi: 00000000 edi: 00000000 esp: 00000000 ebp: 00000000
> idt: 00000000 C0000000
> gdt: 00000000 C0000000
> Switching descriptors.
> Descriptors changed.
> Legacy pic setup.
> In real mode.
> <hang>
Yep it works until it runs into your apics that are not shutdown.
That looks like one of the next things to tackle.
> Same as last time, but the good news is that splitting the load and reboot
> operations works as expected.
That is what my test machine said as well. But the confirmation is nice.
And it definitely means I uploaded a working sample user space.
> > And the set of hardware fixes known to help kexec.
> >
> http://www.xmission.com/~ebiederm/files/kexec/linux-2.5.47.x86kexec-hwfixes.diff
>
>
> Missing or inaccessible. I'll try some duct tape and the
> linux-2.5.44.x86kexec-hwfixes.diff and see what happens.
The .47 version is pretty much just a forward port. It is uploaded now.
My apologies for not getting to it earlier.
The challenge is with the apic shutdown is that currently the apics are not
in the device tree so that needs to happen before I can submit a good version
for 2.5.x
> Confirming some earlier suspicions:
> CONFIG_SMP=y
> CONFIG_X86_GOOD_APIC=y
> CONFIG_X86_LOCAL_APIC=y
> CONFIG_X86_IO_APIC=y
>
> Last time I tried to run a UP kernel (and no APIC support) on this system
> it wasn't pretty. I'll add that to my list of combinations to try.
I would not worry about it to much. I'm just happy my tools are good enough
that with a little thinking I can figure out what the problem is.
Getting there was hard.
Eric
On Mon, 2002-11-11 at 23:22, Eric W. Biederman wrote:
> > On Mon, 2002-11-11 at 10:15, Eric W. Biederman wrote:
> > > kexec is a set of system calls that allows you to load another kernel
> > > from the currently executing Linux kernel.
> >
> > Results on my usual problem machine:
> >
> > # ./kexec-1.5 ./kexec_test-1.5
> > Shutting down devices
> > Debug: sleeping function called from illegal context at include/asm/semaphore.h9
> >
> > Call Trace: [<c011a698>] [<c0216193>] [<c012b165>] [<c0132dec>] [<c0140357>
>
> Hmm. I wonder what is doing that. Do you have the semaphore problem on a normal reboot?
No clue as of yet. I do not see this information during a normal
reboot.
> > Starting new kernel
> >
> > kexec_test 1.5 starting...
> > eax: 0E1FB007 ebx: 00001078 ecx: 00000000 edx: 00000000
> > esi: 00000000 edi: 00000000 esp: 00000000 ebp: 00000000
> > idt: 00000000 C0000000
> > gdt: 00000000 C0000000
> > Switching descriptors.
> > Descriptors changed.
> > Legacy pic setup.
> > In real mode.
> > <hang>
>
> Yep it works until it runs into your apics that are not shutdown.
> That looks like one of the next things to tackle.
I used the linux-2.5.44.x86kexec-hwfixes.diff (it applied cleanly to
pure 2.5.47 + kexec); I'll try your updated version soon if there are
any major differences.
> The challenge is with the apic shutdown is that currently the apics are not
> in the device tree so that needs to happen before I can submit a good version
> for 2.5.x
>
>
> > Confirming some earlier suspicions:
> > CONFIG_SMP=y
> > CONFIG_X86_GOOD_APIC=y
> > CONFIG_X86_LOCAL_APIC=y
> > CONFIG_X86_IO_APIC=y
> >
> > Last time I tried to run a UP kernel (and no APIC support) on this system
> > it wasn't pretty. I'll add that to my list of combinations to try.
On this same system, I reconfigured and tried this:
# CONFIG_SMP is not set
CONFIG_X86_GOOD_APIC=y
# CONFIG_X86_UP_APIC is not set
# CONFIG_X86_LOCAL_APIC is not set
# CONFIG_X86_IO_APIC is not set
None of the "ordinary" APIC initialization messages were output during
the regular BIOS->LILO boot of this kernel.
Using kexec on this kernel to run kexec_test-1.5 stops in the same way:
# ./kexec-1.5 --debug ./kexec_test-1.5
Shutting down devices
Debug: sleeping function called from illegal context at
include/asm/semaphore.h9Call Trace: [<c0113f7c>] [<c01ec123>]
[<c0120af2>] [<c0130d5d>] [<c0130d5d> Starting new kernel
kexec_test 1.5 starting...
eax: 0E1FB007 ebx: 00001078 ecx: 00000000 edx: 00000000
esi: 00000000 edi: 00000000 esp: 00000000 ebp: 00000000
idt: 00000000 C0000000
gdt: 00000000 C0000000
Switching descriptors.
Descriptors changed.
Legacy pic setup.
In real mode.
<hang>
So, does this information suggest looking somewhere other than APIC
shutdown?
Andy
Andy Pfiffer <[email protected]> writes:
> On Mon, 2002-11-11 at 23:22, Eric W. Biederman wrote:
> > > On Mon, 2002-11-11 at 10:15, Eric W. Biederman wrote:
> > > > kexec is a set of system calls that allows you to load another kernel
> > > > from the currently executing Linux kernel.
> > >
>
> > > Results on my usual problem machine:
> > >
> > > # ./kexec-1.5 ./kexec_test-1.5
> > > Shutting down devices
> > > Debug: sleeping function called from illegal context at
> include/asm/semaphore.h9
>
> > >
> > > Call Trace: [<c011a698>] [<c0216193>] [<c012b165>] [<c0132dec>] [<c0140357>
> >
> > Hmm. I wonder what is doing that. Do you have the semaphore problem on a
> normal reboot?
>
>
> No clue as of yet. I do not see this information during a normal
> reboot.
Doh. I must compile that debugging in when I am testing. I introduced a spin lock,
and then I called a function that might sleep... Though I am puzzled by what
in the device_shutdown and reboot notifier path is actually sleeping
but that is academic.
Next version will use a semaphore to be polite.
I should have asked where those addresses mapped to in your
system.map.
Anyway one of the reasons I grumble about splitting it, more global
variables that have to be protected, and more chances to fumble
something. Oh, well.
> > > Starting new kernel
> > >
> > > kexec_test 1.5 starting...
> > > eax: 0E1FB007 ebx: 00001078 ecx: 00000000 edx: 00000000
> > > esi: 00000000 edi: 00000000 esp: 00000000 ebp: 00000000
> > > idt: 00000000 C0000000
> > > gdt: 00000000 C0000000
> > > Switching descriptors.
> > > Descriptors changed.
> > > Legacy pic setup.
> > > In real mode.
> > > <hang>
> >
> > Yep it works until it runs into your apics that are not shutdown.
> > That looks like one of the next things to tackle.
>
> I used the linux-2.5.44.x86kexec-hwfixes.diff (it applied cleanly to
> pure 2.5.47 + kexec); I'll try your updated version soon if there are
> any major differences.
I don't think there is anything significant.
> > The challenge is with the apic shutdown is that currently the apics are not
> > in the device tree so that needs to happen before I can submit a good version
> > for 2.5.x
> >
> >
> > > Confirming some earlier suspicions:
> > > CONFIG_SMP=y
> > > CONFIG_X86_GOOD_APIC=y
> > > CONFIG_X86_LOCAL_APIC=y
> > > CONFIG_X86_IO_APIC=y
> > >
> > > Last time I tried to run a UP kernel (and no APIC support) on this system
> > > it wasn't pretty. I'll add that to my list of combinations to try.
>
> On this same system, I reconfigured and tried this:
> # CONFIG_SMP is not set
> CONFIG_X86_GOOD_APIC=y
> # CONFIG_X86_UP_APIC is not set
> # CONFIG_X86_LOCAL_APIC is not set
> # CONFIG_X86_IO_APIC is not set
>
> None of the "ordinary" APIC initialization messages were output during
> the regular BIOS->LILO boot of this kernel.
>
> So, does this information suggest looking somewhere other than APIC
> shutdown?
I am not certain. All that is certain is there is an unhandled
interrupt.
Anyway the next step will be to enter the Linux kernel in 32bit mode
so I can avoid the whole mess of getting the BIOS working again. That
should avoid most of these complications as I will be able to skip
the whole step of enabling interrupts.
Eric
O.k. and now a version that applies cleanly to
v2.5.47-bk2 aka [email protected]
I killed all of the locks and used xchg. That is what I really wanted
anyway.
Linus care to comment on anything you see wrong?
Eric
MAINTAINERS | 7
arch/i386/Kconfig | 17
arch/i386/kernel/Makefile | 1
arch/i386/kernel/entry.S | 1
arch/i386/kernel/machine_kexec.c | 142 ++++++++
arch/i386/kernel/relocate_kernel.S | 99 +++++
include/asm-i386/kexec.h | 25 +
include/asm-i386/unistd.h | 1
include/linux/kexec.h | 45 ++
include/linux/reboot.h | 2
kernel/Makefile | 1
kernel/kexec.c | 640 +++++++++++++++++++++++++++++++++++++
kernel/sys.c | 23 +
13 files changed, 1004 insertions
diff -uNr linux-2.5.47-bk2/MAINTAINERS linux-2.5.47-bk2.x86kexec/MAINTAINERS
--- linux-2.5.47-bk2/MAINTAINERS Mon Nov 11 00:22:33 2002
+++ linux-2.5.47-bk2.x86kexec/MAINTAINERS Wed Nov 13 06:08:52 2002
@@ -968,6 +968,13 @@
W: http://www.cse.unsw.edu.au/~neilb/patches/linux-devel/
S: Maintained
+KEXEC
+P: Eric Biederman
+M: [email protected]
+M: [email protected]
+L: [email protected]
+S: Maintained
+
LANMEDIA WAN CARD DRIVER
P: Andrew Stanley-Jones
M: [email protected]
diff -uNr linux-2.5.47-bk2/arch/i386/Kconfig linux-2.5.47-bk2.x86kexec/arch/i386/Kconfig
--- linux-2.5.47-bk2/arch/i386/Kconfig Wed Nov 13 06:08:11 2002
+++ linux-2.5.47-bk2.x86kexec/arch/i386/Kconfig Wed Nov 13 06:08:52 2002
@@ -784,6 +784,23 @@
depends on (SMP || PREEMPT) && X86_CMPXCHG
default y
+config KEXEC
+ bool "kexec system call (EXPERIMENTAL)"
+ depends on EXPERIMENTAL
+ help
+ kexec is a system call that implements the ability to shutdown your
+ current kernel, and to start another kernel. It is like a reboot
+ but it is indepedent of the system firmware. And like a reboot
+ you can start any kernel with it not just Linux.
+
+ The name comes from the similiarity to the exec system call.
+
+ It is on an going process to be certain the hardware in a machine
+ is properly shutdown, so do not be surprised if this code does not
+ initially work for you. It may help to enable device hotplugging
+ support. As of this writing the exact hardware interface is
+ strongly in flux, so no good recommendation can be made.
+
endmenu
diff -uNr linux-2.5.47-bk2/arch/i386/kernel/Makefile linux-2.5.47-bk2.x86kexec/arch/i386/kernel/Makefile
--- linux-2.5.47-bk2/arch/i386/kernel/Makefile Wed Nov 13 06:08:11 2002
+++ linux-2.5.47-bk2.x86kexec/arch/i386/kernel/Makefile Wed Nov 13 06:09:36 2002
@@ -24,6 +24,7 @@
obj-$(CONFIG_X86_MPPARSE) += mpparse.o
obj-$(CONFIG_X86_LOCAL_APIC) += apic.o nmi.o
obj-$(CONFIG_X86_IO_APIC) += io_apic.o
+obj-$(CONFIG_KEXEC) += machine_kexec.o relocate_kernel.o
obj-$(CONFIG_SOFTWARE_SUSPEND) += suspend.o suspend_asm.o
obj-$(CONFIG_X86_NUMAQ) += numaq.o
obj-$(CONFIG_PROFILING) += profile.o
diff -uNr linux-2.5.47-bk2/arch/i386/kernel/entry.S linux-2.5.47-bk2.x86kexec/arch/i386/kernel/entry.S
--- linux-2.5.47-bk2/arch/i386/kernel/entry.S Wed Nov 13 06:08:11 2002
+++ linux-2.5.47-bk2.x86kexec/arch/i386/kernel/entry.S Wed Nov 13 06:08:52 2002
@@ -743,6 +743,7 @@
.long sys_epoll_ctl /* 255 */
.long sys_epoll_wait
.long sys_remap_file_pages
+ .long sys_kexec_load
.rept NR_syscalls-(.-sys_call_table)/4
diff -uNr linux-2.5.47-bk2/arch/i386/kernel/machine_kexec.c linux-2.5.47-bk2.x86kexec/arch/i386/kernel/machine_kexec.c
--- linux-2.5.47-bk2/arch/i386/kernel/machine_kexec.c Wed Dec 31 17:00:00 1969
+++ linux-2.5.47-bk2.x86kexec/arch/i386/kernel/machine_kexec.c Wed Nov 13 06:08:52 2002
@@ -0,0 +1,142 @@
+#include <linux/config.h>
+#include <linux/mm.h>
+#include <linux/kexec.h>
+#include <linux/delay.h>
+#include <asm/pgtable.h>
+#include <asm/pgalloc.h>
+#include <asm/tlbflush.h>
+#include <asm/io.h>
+#include <asm/apic.h>
+
+
+/*
+ * machine_kexec
+ * =======================
+ */
+
+
+static void set_idt(void *newidt, __u16 limit)
+{
+ unsigned char curidt[6];
+
+ /* ia32 supports unaliged loads & stores */
+ (*(__u16 *)(curidt)) = limit;
+ (*(__u32 *)(curidt +2)) = (unsigned long)(newidt);
+
+ __asm__ __volatile__ (
+ "lidt %0\n"
+ : "=m" (curidt)
+ );
+};
+
+
+static void set_gdt(void *newgdt, __u16 limit)
+{
+ unsigned char curgdt[6];
+
+ /* ia32 supports unaliged loads & stores */
+ (*(__u16 *)(curgdt)) = limit;
+ (*(__u32 *)(curgdt +2)) = (unsigned long)(newgdt);
+
+ __asm__ __volatile__ (
+ "lgdt %0\n"
+ : "=m" (curgdt)
+ );
+};
+
+static void load_segments(void)
+{
+#define __STR(X) #X
+#define STR(X) __STR(X)
+
+ __asm__ __volatile__ (
+ "\tljmp $"STR(__KERNEL_CS)",$1f\n"
+ "\t1:\n"
+ "\tmovl $"STR(__KERNEL_DS)",%eax\n"
+ "\tmovl %eax,%ds\n"
+ "\tmovl %eax,%es\n"
+ "\tmovl %eax,%fs\n"
+ "\tmovl %eax,%gs\n"
+ "\tmovl %eax,%ss\n"
+ );
+#undef STR
+#undef __STR
+}
+
+static void identity_map_page(unsigned long address)
+{
+ /* This code is x86 specific...
+ * general purpose code must be more carful
+ * of caches and tlbs...
+ */
+ pgd_t *pgd;
+ pmd_t *pmd;
+ struct mm_struct *mm = current->mm;
+ spin_lock(&mm->page_table_lock);
+
+ pgd = pgd_offset(mm, address);
+ pmd = pmd_alloc(mm, pgd, address);
+
+ if (pmd) {
+ pte_t *pte = pte_alloc_map(mm, pmd, address);
+ if (pte) {
+ set_pte(pte,
+ mk_pte(virt_to_page(phys_to_virt(address)),
+ PAGE_SHARED));
+ __flush_tlb_one(address);
+ }
+ }
+ spin_unlock(&mm->page_table_lock);
+}
+
+
+typedef void (*relocate_new_kernel_t)(
+ unsigned long indirection_page, unsigned long reboot_code_buffer,
+ unsigned long start_address);
+
+const extern unsigned char relocate_new_kernel[];
+extern void relocate_new_kernel_end(void);
+const extern unsigned int relocate_new_kernel_size;
+
+void machine_kexec(struct kimage *image)
+{
+ unsigned long *indirection_page;
+ void *reboot_code_buffer;
+ relocate_new_kernel_t rnk;
+
+ /* Interrupts aren't acceptable while we reboot */
+ local_irq_disable();
+ reboot_code_buffer = image->reboot_code_buffer;
+ indirection_page = phys_to_virt(image->head & PAGE_MASK);
+
+ identity_map_page(virt_to_phys(reboot_code_buffer));
+
+ /* copy it out */
+ memcpy(reboot_code_buffer, relocate_new_kernel,
+ relocate_new_kernel_size);
+
+ /* The segment registers are funny things, they are
+ * automatically loaded from a table, in memory wherever you
+ * set them to a specific selector, but this table is never
+ * accessed again you set the segment to a different selector.
+ *
+ * The more common model is are caches where the behide
+ * the scenes work is done, but is also dropped at arbitrary
+ * times.
+ *
+ * I take advantage of this here by force loading the
+ * segments, before I zap the gdt with an invalid value.
+ */
+ load_segments();
+ /* The gdt & idt are now invalid.
+ * If you want to load them you must set up your own idt & gdt.
+ */
+ set_gdt(phys_to_virt(0),0);
+ set_idt(phys_to_virt(0),0);
+
+ /* now call it */
+ rnk = (relocate_new_kernel_t) virt_to_phys(reboot_code_buffer);
+ (*rnk)(virt_to_phys(indirection_page), virt_to_phys(reboot_code_buffer),
+ image->start);
+}
+
diff -uNr linux-2.5.47-bk2/arch/i386/kernel/relocate_kernel.S linux-2.5.47-bk2.x86kexec/arch/i386/kernel/relocate_kernel.S
--- linux-2.5.47-bk2/arch/i386/kernel/relocate_kernel.S Wed Dec 31 17:00:00 1969
+++ linux-2.5.47-bk2.x86kexec/arch/i386/kernel/relocate_kernel.S Wed Nov 13 06:08:52 2002
@@ -0,0 +1,99 @@
+#include <linux/config.h>
+#include <linux/linkage.h>
+
+ /* Must be relocatable PIC code callable as a C function, that once
+ * it starts can not use the previous processes stack.
+ *
+ */
+ .globl relocate_new_kernel
+relocate_new_kernel:
+ /* read the arguments and say goodbye to the stack */
+ movl 4(%esp), %ebx /* indirection_page */
+ movl 8(%esp), %ebp /* reboot_code_buffer */
+ movl 12(%esp), %edx /* start address */
+
+ /* zero out flags, and disable interrupts */
+ pushl $0
+ popfl
+
+ /* set a new stack at the bottom of our page... */
+ lea 4096(%ebp), %esp
+
+ /* store the parameters back on the stack */
+ pushl %edx /* store the start address */
+
+ /* Set cr0 to a known state:
+ * 31 0 == Paging disabled
+ * 18 0 == Alignment check disabled
+ * 16 0 == Write protect disabled
+ * 3 0 == No task switch
+ * 2 0 == Don't do FP software emulation.
+ * 0 1 == Proctected mode enabled
+ */
+ movl %cr0, %eax
+ andl $~((1<<31)|(1<<18)|(1<<16)|(1<<3)|(1<<2)), %eax
+ orl $(1<<0), %eax
+ movl %eax, %cr0
+ jmp 1f
+1:
+
+ /* Flush the TLB (needed?) */
+ xorl %eax, %eax
+ movl %eax, %cr3
+
+ /* Do the copies */
+ cld
+0: /* top, read another word for the indirection page */
+ movl %ebx, %ecx
+ movl (%ebx), %ecx
+ addl $4, %ebx
+ testl $0x1, %ecx /* is it a destination page */
+ jz 1f
+ movl %ecx, %edi
+ andl $0xfffff000, %edi
+ jmp 0b
+1:
+ testl $0x2, %ecx /* is it an indirection page */
+ jz 1f
+ movl %ecx, %ebx
+ andl $0xfffff000, %ebx
+ jmp 0b
+1:
+ testl $0x4, %ecx /* is it the done indicator */
+ jz 1f
+ jmp 2f
+1:
+ testl $0x8, %ecx /* is it the source indicator */
+ jz 0b /* Ignore it otherwise */
+ movl %ecx, %esi /* For every source page do a copy */
+ andl $0xfffff000, %esi
+
+ movl $1024, %ecx
+ rep ; movsl
+ jmp 0b
+
+2:
+
+ /* To be certain of avoiding problems with self modifying code
+ * I need to execute a serializing instruction here.
+ * So I flush the TLB, it's handy, and not processor dependent.
+ */
+ xorl %eax, %eax
+ movl %eax, %cr3
+
+ /* set all of the registers to known values */
+ /* leave %esp alone */
+
+ xorl %eax, %eax
+ xorl %ebx, %ebx
+ xorl %ecx, %ecx
+ xorl %edx, %edx
+ xorl %esi, %esi
+ xorl %edi, %edi
+ xorl %ebp, %ebp
+ ret
+relocate_new_kernel_end:
+
+ .globl relocate_new_kernel_size
+relocate_new_kernel_size:
+ .long relocate_new_kernel_end - relocate_new_kernel
diff -uNr linux-2.5.47-bk2/include/asm-i386/kexec.h linux-2.5.47-bk2.x86kexec/include/asm-i386/kexec.h
--- linux-2.5.47-bk2/include/asm-i386/kexec.h Wed Dec 31 17:00:00 1969
+++ linux-2.5.47-bk2.x86kexec/include/asm-i386/kexec.h Wed Nov 13 06:08:52 2002
@@ -0,0 +1,25 @@
+#ifndef _I386_KEXEC_H
+#define _I386_KEXEC_H
+
+#include <asm/fixmap.h>
+
+/*
+ * KEXEC_SOURCE_MEMORY_LIMIT maximum page get_free_page can return.
+ * I.e. Maximum page that is mapped directly into kernel memory,
+ * and kmap is not required.
+ *
+ * Someone correct me if FIXADDR_START - PAGEOFFSET is not the correct
+ * calculation for the amount of memory directly mappable into the
+ * kernel memory space.
+ */
+
+/* Maximum physical address we can use pages from */
+#define KEXEC_SOURCE_MEMORY_LIMIT (FIXADDR_START - PAGE_OFFSET)
+/* Maximum address we can reach in physical address mode */
+#define KEXEC_DESTINATION_MEMORY_LIMIT (-1UL)
+
+#define KEXEC_REBOOT_CODE_SIZE 4096
+#define KEXEC_REBOOT_CODE_ALIGN 0
+
+
+#endif /* _I386_KEXEC_H */
diff -uNr linux-2.5.47-bk2/include/asm-i386/unistd.h linux-2.5.47-bk2.x86kexec/include/asm-i386/unistd.h
--- linux-2.5.47-bk2/include/asm-i386/unistd.h Tue Nov 5 19:03:51 2002
+++ linux-2.5.47-bk2.x86kexec/include/asm-i386/unistd.h Wed Nov 13 06:08:52 2002
@@ -262,6 +262,7 @@
#define __NR_sys_epoll_ctl 255
#define __NR_sys_epoll_wait 256
#define __NR_remap_file_pages 257
+#define __NR_sys_kexec_load 258
/* user-visible error numbers are in the range -1 - -124: see <asm-i386/errno.h> */
diff -uNr linux-2.5.47-bk2/include/linux/kexec.h linux-2.5.47-bk2.x86kexec/include/linux/kexec.h
--- linux-2.5.47-bk2/include/linux/kexec.h Wed Dec 31 17:00:00 1969
+++ linux-2.5.47-bk2.x86kexec/include/linux/kexec.h Wed Nov 13 06:08:52 2002
@@ -0,0 +1,45 @@
+#ifndef LINUX_KEXEC_H
+#define LINUX_KEXEC_H
+
+#if CONFIG_KEXEC
+#include <linux/types.h>
+#include <asm/kexec.h>
+
+/*
+ * This structure is used to hold the arguments that are used when loading
+ * kernel binaries.
+ */
+
+typedef unsigned long kimage_entry_t;
+#define IND_DESTINATION 0x1
+#define IND_INDIRECTION 0x2
+#define IND_DONE 0x4
+#define IND_SOURCE 0x8
+
+struct kimage {
+ kimage_entry_t head;
+ kimage_entry_t *entry;
+ kimage_entry_t *last_entry;
+
+ unsigned long destination;
+ unsigned long offset;
+
+ unsigned long start;
+ void *reboot_code_buffer;
+};
+
+struct kexec_segment {
+ void *buf;
+ size_t bufsz;
+ void *mem;
+ size_t memsz;
+};
+
+/* kexec interface functions */
+extern void machine_kexec(struct kimage *image);
+extern asmlinkage long sys_kexec(unsigned long entry, long nr_segments,
+ struct kexec_segment *segments);
+extern struct kimage *kexec_image;
+#endif
+#endif /* LINUX_KEXEC_H */
+
diff -uNr linux-2.5.47-bk2/include/linux/reboot.h linux-2.5.47-bk2.x86kexec/include/linux/reboot.h
--- linux-2.5.47-bk2/include/linux/reboot.h Fri Oct 11 22:22:47 2002
+++ linux-2.5.47-bk2.x86kexec/include/linux/reboot.h Wed Nov 13 06:08:52 2002
@@ -21,6 +21,7 @@
* POWER_OFF Stop OS and remove all power from system, if possible.
* RESTART2 Restart system using given command string.
* SW_SUSPEND Suspend system using Software Suspend if compiled in
+ * KEXEC Restart the system using a different kernel.
*/
#define LINUX_REBOOT_CMD_RESTART 0x01234567
@@ -30,6 +31,7 @@
#define LINUX_REBOOT_CMD_POWER_OFF 0x4321FEDC
#define LINUX_REBOOT_CMD_RESTART2 0xA1B2C3D4
#define LINUX_REBOOT_CMD_SW_SUSPEND 0xD000FCE2
+#define LINUX_REBOOT_CMD_KEXEC 0x45584543
#ifdef __KERNEL__
diff -uNr linux-2.5.47-bk2/kernel/Makefile linux-2.5.47-bk2.x86kexec/kernel/Makefile
--- linux-2.5.47-bk2/kernel/Makefile Wed Nov 13 06:08:13 2002
+++ linux-2.5.47-bk2.x86kexec/kernel/Makefile Wed Nov 13 06:08:52 2002
@@ -21,6 +21,7 @@
obj-$(CONFIG_CPU_FREQ) += cpufreq.o
obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
obj-$(CONFIG_SOFTWARE_SUSPEND) += suspend.o
+obj-$(CONFIG_KEXEC) += kexec.o
ifneq ($(CONFIG_IA64),y)
# According to Alan Modra <[email protected]>, the -fno-omit-frame-pointer is
diff -uNr linux-2.5.47-bk2/kernel/kexec.c linux-2.5.47-bk2.x86kexec/kernel/kexec.c
--- linux-2.5.47-bk2/kernel/kexec.c Wed Dec 31 17:00:00 1969
+++ linux-2.5.47-bk2.x86kexec/kernel/kexec.c Wed Nov 13 06:08:52 2002
@@ -0,0 +1,640 @@
+#include <linux/mm.h>
+#include <linux/file.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/version.h>
+#include <linux/compile.h>
+#include <linux/kexec.h>
+#include <linux/spinlock.h>
+#include <net/checksum.h>
+#include <asm/page.h>
+#include <asm/uaccess.h>
+#include <asm/io.h>
+#include <asm/system.h>
+
+/* As designed kexec can only use the memory that you don't
+ * need to use kmap to access. Memory that you can use virt_to_phys()
+ * on an call get_free_page to allocate.
+ *
+ * In the best case you need one page for the transition from
+ * virtual to physical memory. And this page must be identity
+ * mapped. Which pretty much leaves you with pages < PAGE_OFFSET
+ * as you can only mess with user pages.
+ *
+ * As the only subset of memory that it is easy to restrict allocation
+ * to is the physical memory mapped into the kernel, I do that
+ * with get_free_page and hope it is enough.
+ *
+ * I don't know of a good way to do this calcuate which pages get_free_page
+ * will return independent of architecture so I depend on
+ * <asm/kexec.h> to properly set
+ * KEXEC_SOURCE_MEMORY_LIMIT and KEXEC_DESTINATION_MEMORY_LIMIT
+ *
+ */
+
+static struct kimage *kimage_alloc(void)
+{
+ struct kimage *image;
+ image = kmalloc(sizeof(*image), GFP_KERNEL);
+ if (!image)
+ return 0;
+ memset(image, 0, sizeof(*image));
+ image->head = 0;
+ image->entry = &image->head;
+ image->last_entry = &image->head;
+ return image;
+}
+static int kimage_add_entry(struct kimage *image, kimage_entry_t entry)
+{
+ if (image->offset != 0) {
+ image->entry++;
+ }
+ if (image->entry == image->last_entry) {
+ kimage_entry_t *ind_page;
+ ind_page = (void *)__get_free_page(GFP_KERNEL);
+ if (!ind_page) {
+ return -ENOMEM;
+ }
+ *image->entry = virt_to_phys(ind_page) | IND_INDIRECTION;
+ image->entry = ind_page;
+ image->last_entry =
+ ind_page + ((PAGE_SIZE/sizeof(kimage_entry_t)) - 1);
+ }
+ *image->entry = entry;
+ image->entry++;
+ image->offset = 0;
+ return 0;
+}
+
+static int kimage_verify_destination(unsigned long destination)
+{
+ int result;
+
+ /* Assume the page is bad unless we pass the checks */
+ result = -EADDRNOTAVAIL;
+
+ if (destination >= KEXEC_DESTINATION_MEMORY_LIMIT) {
+ goto out;
+ }
+
+ /* NOTE: The caller is responsible for making certain we
+ * don't attempt to load the new image into invalid or
+ * reserved areas of RAM.
+ */
+ result = 0;
+out:
+ return result;
+}
+
+static int kimage_set_destination(
+ struct kimage *image, unsigned long destination)
+{
+ int result;
+ destination &= PAGE_MASK;
+ result = kimage_verify_destination(destination);
+ if (result) {
+ return result;
+ }
+ result = kimage_add_entry(image, destination | IND_DESTINATION);
+ if (result == 0) {
+ image->destination = destination;
+ }
+ return result;
+}
+
+
+static int kimage_add_page(struct kimage *image, unsigned long page)
+{
+ int result;
+ page &= PAGE_MASK;
+ result = kimage_verify_destination(image->destination);
+ if (result) {
+ return result;
+ }
+ result = kimage_add_entry(image, page | IND_SOURCE);
+ if (result == 0) {
+ image->destination += PAGE_SIZE;
+ }
+ return result;
+}
+
+
+static int kimage_terminate(struct kimage *image)
+{
+ int result;
+ result = kimage_add_entry(image, IND_DONE);
+ if (result == 0) {
+ /* Point at the terminating element */
+ image->entry--;
+ }
+ return result;
+}
+
+#define for_each_kimage_entry(image, ptr, entry) \
+ for (ptr = &image->head; (entry = *ptr) && !(entry & IND_DONE); \
+ ptr = (entry & IND_INDIRECTION)? \
+ phys_to_virt((entry & PAGE_MASK)): ptr +1)
+
+static void kimage_free(struct kimage *image)
+{
+ kimage_entry_t *ptr, entry;
+ kimage_entry_t ind = 0;
+ if (!image)
+ return;
+ for_each_kimage_entry(image, ptr, entry) {
+ if (entry & IND_INDIRECTION) {
+ /* Free the previous indirection page */
+ if (ind & IND_INDIRECTION) {
+ free_page((unsigned long)phys_to_virt(ind & PAGE_MASK));
+ }
+ /* Save this indirection page until we are
+ * done with it.
+ */
+ ind = entry;
+ }
+ else if (entry & IND_SOURCE) {
+ free_page((unsigned long)phys_to_virt(entry & PAGE_MASK));
+ }
+ }
+ kfree(image);
+}
+
+static int kimage_is_destination_page(
+ struct kimage *image, unsigned long page)
+{
+ kimage_entry_t *ptr, entry;
+ unsigned long destination;
+ destination = 0;
+ page &= PAGE_MASK;
+ for_each_kimage_entry(image, ptr, entry) {
+ if (entry & IND_DESTINATION) {
+ destination = entry & PAGE_MASK;
+ }
+ else if (entry & IND_SOURCE) {
+ if (page == destination) {
+ return 1;
+ }
+ destination += PAGE_SIZE;
+ }
+ }
+ return 0;
+}
+
+static int kimage_get_unused_area(
+ struct kimage *image, unsigned long size, unsigned long align,
+ unsigned long *area)
+{
+ /* Walk through mem_map and find the first chunk of
+ * ununsed memory that is at least size bytes long.
+ */
+ /* Since the kernel plays with Page_Reseved mem_map is less
+ * than ideal for this purpose, but it will give us a correct
+ * conservative estimate of what we need to do.
+ */
+ /* For now we take advantage of the fact that all kernel pages
+ * are marked with PG_resereved to allocate a large
+ * contiguous area for the reboot code buffer.
+ */
+ unsigned long addr;
+ unsigned long start, end;
+ unsigned long mask;
+ mask = ((1 << align) -1);
+ start = end = PAGE_SIZE;
+ for(addr = PAGE_SIZE; addr < KEXEC_SOURCE_MEMORY_LIMIT; addr += PAGE_SIZE) {
+ struct page *page;
+ unsigned long aligned_start;
+ page = virt_to_page(phys_to_virt(addr));
+ if (PageReserved(page) ||
+ kimage_is_destination_page(image, addr)) {
+ /* The current page is reserved so the start &
+ * end of the next area must be atleast at the
+ * next page.
+ */
+ start = end = addr + PAGE_SIZE;
+ }
+ else {
+ /* O.k. The current page isn't reserved
+ * so push up the end of the area.
+ */
+ end = addr;
+ }
+ aligned_start = (start + mask) & ~mask;
+ if (aligned_start > start) {
+ continue;
+ }
+ if (aligned_start > end) {
+ continue;
+ }
+ if (end - aligned_start >= size) {
+ *area = aligned_start;
+ return 0;
+ }
+ }
+ *area = 0;
+ return -ENOSPC;
+}
+
+static kimage_entry_t *kimage_dst_conflict(
+ struct kimage *image, unsigned long page, kimage_entry_t *limit)
+{
+ kimage_entry_t *ptr, entry;
+ unsigned long destination = 0;
+ for_each_kimage_entry(image, ptr, entry) {
+ if (ptr == limit) {
+ return 0;
+ }
+ else if (entry & IND_DESTINATION) {
+ destination = entry & PAGE_MASK;
+ }
+ else if (entry & IND_SOURCE) {
+ if (page == destination) {
+ return ptr;
+ }
+ destination += PAGE_SIZE;
+ }
+ }
+ return 0;
+}
+
+static kimage_entry_t *kimage_src_conflict(
+ struct kimage *image, unsigned long destination, kimage_entry_t *limit)
+{
+ kimage_entry_t *ptr, entry;
+ for_each_kimage_entry(image, ptr, entry) {
+ unsigned long page;
+ if (ptr == limit) {
+ return 0;
+ }
+ else if (entry & IND_DESTINATION) {
+ /* nop */
+ }
+ else if (entry & IND_DONE) {
+ /* nop */
+ }
+ else {
+ /* SOURCE & INDIRECTION */
+ page = entry & PAGE_MASK;
+ if (page == destination) {
+ return ptr;
+ }
+ }
+ }
+ return 0;
+}
+
+static int kimage_get_off_destination_pages(struct kimage *image)
+{
+ kimage_entry_t *ptr, *cptr, entry;
+ unsigned long buffer, page;
+ unsigned long destination = 0;
+
+ /* Here we implement safe guards to insure that
+ * a source page is not copied to it's destination
+ * page before the data on the destination page is
+ * no longer useful.
+ *
+ * To make it work we actually wind up with a
+ * stronger condition. For every page considered
+ * it is either it's own destination page or it is
+ * not a destination page of any page considered.
+ *
+ * Invariants
+ * 1. buffer is not a destination of a previous page.
+ * 2. page is not a destination of a previous page.
+ * 3. destination is not a previous source page.
+ *
+ * Result: Either a source page and a destination page
+ * are the same or the page is not a destination page.
+ *
+ * These checks could be done when we allocate the pages,
+ * but doing it as a final pass allows us more freedom
+ * on how we allocate pages.
+ *
+ * Also while the checks are necessary, in practice nothing
+ * happens. The destination kernel wants to sit in the
+ * same physical addresses as the current kernel so we never
+ * actually allocate a destination page.
+ *
+ * BUGS: This is a O(N^2) algorithm.
+ */
+
+
+ buffer = __get_free_page(GFP_KERNEL);
+ if (!buffer) {
+ return -ENOMEM;
+ }
+ buffer = virt_to_phys((void *)buffer);
+ for_each_kimage_entry(image, ptr, entry) {
+ /* Here we check to see if an allocated page */
+ kimage_entry_t *limit;
+ if (entry & IND_DESTINATION) {
+ destination = entry & PAGE_MASK;
+ }
+ else if (entry & IND_INDIRECTION) {
+ /* Indirection pages must include all of their
+ * contents in limit checking.
+ */
+ limit = phys_to_virt(page + PAGE_SIZE - sizeof(*limit));
+ }
+ if (!((entry & IND_SOURCE) | (entry & IND_INDIRECTION))) {
+ continue;
+ }
+ page = entry & PAGE_MASK;
+ limit = ptr;
+
+ /* See if a previous page has the current page as it's
+ * destination.
+ * i.e. invariant 2
+ */
+ cptr = kimage_dst_conflict(image, page, limit);
+ if (cptr) {
+ unsigned long cpage;
+ kimage_entry_t centry;
+ centry = *cptr;
+ cpage = centry & PAGE_MASK;
+ memcpy(phys_to_virt(buffer), phys_to_virt(page), PAGE_SIZE);
+ memcpy(phys_to_virt(page), phys_to_virt(cpage), PAGE_SIZE);
+ *cptr = page | (centry & ~PAGE_MASK);
+ *ptr = buffer | (entry & ~PAGE_MASK);
+ buffer = cpage;
+ }
+ if (!(entry & IND_SOURCE)) {
+ continue;
+ }
+
+ /* See if a previous page is our destination page.
+ * If so claim it now.
+ * i.e. invariant 3
+ */
+ cptr = kimage_src_conflict(image, destination, limit);
+ if (cptr) {
+ unsigned long cpage;
+ kimage_entry_t centry;
+ centry = *cptr;
+ cpage = centry & PAGE_MASK;
+ memcpy(phys_to_virt(buffer), phys_to_virt(cpage), PAGE_SIZE);
+ memcpy(phys_to_virt(cpage), phys_to_virt(page), PAGE_SIZE);
+ *cptr = buffer | (centry & ~PAGE_MASK);
+ *ptr = cpage | ( entry & ~PAGE_MASK);
+ buffer = page;
+ }
+ /* If the buffer is my destination page do the copy now
+ * i.e. invariant 3 & 1
+ */
+ if (buffer == destination) {
+ memcpy(phys_to_virt(buffer), phys_to_virt(page), PAGE_SIZE);
+ *ptr = buffer | (entry & ~PAGE_MASK);
+ buffer = page;
+ }
+ }
+ free_page((unsigned long)phys_to_virt(buffer));
+ return 0;
+}
+
+static int kimage_add_empty_pages(struct kimage *image,
+ unsigned long len)
+{
+ unsigned long pos;
+ int result;
+ for(pos = 0; pos < len; pos += PAGE_SIZE) {
+ char *page;
+ result = -ENOMEM;
+ page = (void *)__get_free_page(GFP_KERNEL);
+ if (!page) {
+ goto out;
+ }
+ result = kimage_add_page(image, virt_to_phys(page));
+ if (result) {
+ goto out;
+ }
+ }
+ result = 0;
+ out:
+ return result;
+}
+
+
+static int kimage_load_segment(struct kimage *image,
+ struct kexec_segment *segment)
+{
+ unsigned long mstart;
+ int result;
+ unsigned long offset;
+ unsigned long offset_end;
+ unsigned char *buf;
+
+ result = 0;
+ buf = segment->buf;
+ mstart = (unsigned long)segment->mem;
+
+ offset_end = segment->memsz;
+
+ result = kimage_set_destination(image, mstart);
+ if (result < 0) {
+ goto out;
+ }
+ for(offset = 0; offset < segment->memsz; offset += PAGE_SIZE) {
+ char *page;
+ size_t size, leader;
+ page = (char *)__get_free_page(GFP_KERNEL);
+ if (page == 0) {
+ result = -ENOMEM;
+ goto out;
+ }
+ result = kimage_add_page(image, virt_to_phys(page));
+ if (result < 0) {
+ goto out;
+ }
+ if (segment->bufsz < offset) {
+ /* We are past the end zero the whole page */
+ memset(page, 0, PAGE_SIZE);
+ continue;
+ }
+ size = PAGE_SIZE;
+ leader = 0;
+ if ((offset == 0)) {
+ leader = mstart & ~PAGE_MASK;
+ }
+ if (leader) {
+ /* We are on the first page zero the unused portion */
+ memset(page, 0, leader);
+ size -= leader;
+ page += leader;
+ }
+ if (size > (segment->bufsz - offset)) {
+ size = segment->bufsz - offset;
+ }
+ result = copy_from_user(page, buf + offset, size);
+ if (result) {
+ result = (result < 0)?result : -EIO;
+ goto out;
+ }
+ if (size < (PAGE_SIZE - leader)) {
+ /* zero the trailing part of the page */
+ memset(page + size, 0, (PAGE_SIZE - leader) - size);
+ }
+ }
+ out:
+ return result;
+}
+
+
+/* do_kexec executes a new kernel
+ */
+static int do_kexec(unsigned long start, unsigned long nr_segments,
+ struct kexec_segment *arg_segments, struct kimage *image)
+{
+ struct kexec_segment *segments;
+ size_t segment_bytes;
+ int i;
+
+ int result;
+ unsigned long reboot_code_buffer;
+ kimage_entry_t *end;
+
+ /* Initialize variables */
+ segments = 0;
+
+ segment_bytes = nr_segments * sizeof(*segments);
+ segments = kmalloc(GFP_KERNEL, segment_bytes);
+ if (segments == 0) {
+ result = -ENOMEM;
+ goto out;
+ }
+ result = copy_from_user(segments, arg_segments, segment_bytes);
+ if (result) {
+ goto out;
+ }
+
+ /* Read in the data from user space */
+ image->start = start;
+ for(i = 0; i < nr_segments; i++) {
+ result = kimage_load_segment(image, &segments[i]);
+ if (result) {
+ goto out;
+ }
+ }
+
+ /* Terminate early so I can get a place holder. */
+ result = kimage_terminate(image);
+ if (result)
+ goto out;
+ end = image->entry;
+
+ /* Usage of the reboot code buffer is subtle. We first
+ * find a continguous area of ram, that is not one
+ * of our destination pages. We do not allocate the ram.
+ *
+ * The algorithm to make certain we do not have address
+ * conflicts requires each destination region to have some
+ * backing store so we allocate abitrary source pages.
+ *
+ * Later in machine_kexec when we copy data to the
+ * reboot_code_buffer it still may be allocated for other
+ * purposes, but we do know there are no source or destination
+ * pages in that area. And since the rest of the kernel
+ * is already shutdown those pages are free for use,
+ * regardless of their page->count values.
+ *
+ * The kernel mapping is of the reboot code buffer is passed to
+ * the machine dependent code. If it needs something else
+ * it is free to set that up.
+ */
+ result = kimage_get_unused_area(
+ image, KEXEC_REBOOT_CODE_SIZE, KEXEC_REBOOT_CODE_ALIGN,
+ &reboot_code_buffer);
+ if (result)
+ goto out;
+
+ /* Allocating pages we should never need is silly but the
+ * code won't work correctly unless we have dummy pages to
+ * work with.
+ */
+ result = kimage_set_destination(image, reboot_code_buffer);
+ if (result)
+ goto out;
+ result = kimage_add_empty_pages(image, KEXEC_REBOOT_CODE_SIZE);
+ if (result)
+ goto out;
+ image->reboot_code_buffer = phys_to_virt(reboot_code_buffer);
+
+ result = kimage_terminate(image);
+ if (result)
+ goto out;
+
+ result = kimage_get_off_destination_pages(image);
+ if (result)
+ goto out;
+
+ /* Now hide the extra source pages for the reboot code buffer.
+ */
+ image->entry = end;
+ result = kimage_terminate(image);
+ if (result)
+ goto out;
+
+ result = 0;
+ out:
+ /* cleanup and exit */
+ if (segments) kfree(segments);
+ return result;
+}
+
+
+/*
+ * Exec Kernel system call: for obvious reasons only root may call it.
+ *
+ * This call breaks up into three pieces.
+ * - A generic part which loads the new kernel from the current
+ * address space, and very carefully places the data in the
+ * allocated pages.
+ *
+ * - A generic part that interacts with the kernel and tells all of
+ * the devices to shut down. Preventing on-going dmas, and placing
+ * the devices in a consistent state so a later kernel can
+ * reinitialize them.
+ *
+ * - A machine specific part that includes the syscall number
+ * and the copies the image to it's final destination. And
+ * jumps into the image at entry.
+ *
+ * kexec does not sync, or unmount filesystems so if you need
+ * that to happen you need to do that yourself.
+ */
+struct kimage *kexec_image = 0;
+
+asmlinkage long sys_kexec_load(unsigned long entry, unsigned long nr_segments,
+ struct kexec_segment *segments, unsigned long flags)
+{
+ /* Am I using to much stack space here? */
+ struct kimage *image, *old_image;
+ int result;
+
+ /* We only trust the superuser with rebooting the system. */
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ /* In case we need just a little bit of special behavior for
+ * reboot on panic
+ */
+ if (flags != 0)
+ return -EINVAL;
+
+ image = 0;
+ if (nr_segments > 0) {
+ image = kimage_alloc();
+ if (!image) {
+ return -ENOMEM;
+ }
+ result = do_kexec(entry, nr_segments, segments, image);
+ if (result) {
+ kimage_free(image);
+ return result;
+ }
+ }
+
+ old_image = xchg(&kexec_image, image);
+
+ kimage_free(old_image);
+ return 0;
+}
diff -uNr linux-2.5.47-bk2/kernel/sys.c linux-2.5.47-bk2.x86kexec/kernel/sys.c
--- linux-2.5.47-bk2/kernel/sys.c Wed Nov 13 06:08:13 2002
+++ linux-2.5.47-bk2.x86kexec/kernel/sys.c Wed Nov 13 06:08:52 2002
@@ -16,6 +16,7 @@
#include <linux/init.h>
#include <linux/highuid.h>
#include <linux/fs.h>
+#include <linux/kexec.h>
#include <linux/workqueue.h>
#include <linux/device.h>
#include <linux/times.h>
@@ -206,6 +207,7 @@
cond_syscall(sys_lookup_dcookie)
cond_syscall(sys_swapon)
cond_syscall(sys_swapoff)
+cond_syscall(sys_kexec_load)
cond_syscall(sys_init_module)
cond_syscall(sys_delete_module)
@@ -416,6 +418,27 @@
machine_restart(buffer);
break;
+#ifdef CONFIG_KEXEC
+ case LINUX_REBOOT_CMD_KEXEC:
+ {
+ struct kimage *image;
+ if (arg) {
+ unlock_kernel();
+ return -EINVAL;
+ }
+ image = xchg(&kexec_image, 0);
+ if (!image) {
+ unlock_kernel();
+ return -EINVAL;
+ }
+ notifier_call_chain(&reboot_notifier_list, SYS_RESTART, NULL);
+ system_running = 0;
+ device_shutdown();
+ printk(KERN_EMERG "Starting new kernel\n");
+ machine_kexec(image);
+ break;
+ }
+#endif
#ifdef CONFIG_SOFTWARE_SUSPEND
case LINUX_REBOOT_CMD_SW_SUSPEND:
if (!software_suspend_enabled) {
On Wed, Nov 13, 2002 at 06:26:29AM -0700, Eric W. Biederman wrote:
>
> O.k. and now a version that applies cleanly to
> v2.5.47-bk2 aka [email protected]
>
BTW, results similar to Andy on my SMP system (the same problem
machine we'd talked about earlier). Same problem ?
with 2.5.47-bk2
+ kexec patch for 2.5.47-bk2 attached in your mail
+ linux-2.5.47.x86kexec-hwfixes
and using
kexec-tools-1.5
Results of kexec kexec_test
[root@llm01 root]# Synchronizing SCSI caches:
Shutting down devices
Starting new kernel
kexec_test 1.5 starting...
eax: 0E1FB007 ebx: 00001078 ecx: 00000000 edx: 00000000
esi: 00000000 edi: 00000000 esp: 00000000 ebp: 00000000
idt: 00000000 C0000000
gdt: 00000000 C0000000
Switching descriptors.
Descriptors changed.
Legacy pic setup.
In real mode.
<hang>
What would be best way to pass a parameter or address from the
current kernel to kernel being booted (e.g log buffer address
or crash dump buffer etc) ? Should this be part of the interface,
i.e. could/would it make sense for kexec to support this (rather
than our having to go and try to fix up kernel parameters ourselves,
or designate a fixed address for this) ? Also thinking
about other arch support for kexec in the future ...
Regards
Suparna
Suparna Bhattacharya <[email protected]> writes:
> On Wed, Nov 13, 2002 at 06:26:29AM -0700, Eric W. Biederman wrote:
> >
> > O.k. and now a version that applies cleanly to
> > v2.5.47-bk2 aka [email protected]
> >
>
> BTW, results similar to Andy on my SMP system (the same problem
> machine we'd talked about earlier). Same problem ?
Something like that. The good news is that the image is being
loaded the bad news is the BIOS doesn't work, and so the kernels
initial setup code isn't working.
Hopefully this weekend I can do the work in user space to bypass
the BIOS altogether for booting a kernel. That should make the whole
thing easier to use.
> with 2.5.47-bk2
> + kexec patch for 2.5.47-bk2 attached in your mail
> + linux-2.5.47.x86kexec-hwfixes
> and using
> kexec-tools-1.5
>
> Results of kexec kexec_test
>
> [root@llm01 root]# Synchronizing SCSI caches:
> Shutting down devices
> Starting new kernel
> kexec_test 1.5 starting...
> eax: 0E1FB007 ebx: 00001078 ecx: 00000000 edx: 00000000
> esi: 00000000 edi: 00000000 esp: 00000000 ebp: 00000000
> idt: 00000000 C0000000
> gdt: 00000000 C0000000
> Switching descriptors.
> Descriptors changed.
> Legacy pic setup.
> In real mode.
> <hang>
>
> What would be best way to pass a parameter or address from the
> current kernel to kernel being booted (e.g log buffer address
> or crash dump buffer etc) ? Should this be part of the interface,
> i.e. could/would it make sense for kexec to support this (rather
> than our having to go and try to fix up kernel parameters ourselves,
> or designate a fixed address for this) ? Also thinking
> about other arch support for kexec in the future ...
The current interface says load image X at location Y, and entry
at point Z. Given that every little situation wants a slightly
different tweak I don't think a specific feature in the kernel is
needed. The user space binaries can incorporate all of the
interesting logic.
Eric
Suparna Bhattacharya wrote:
> What would be best way to pass a parameter or address from the
> current kernel to kernel being booted (e.g log buffer address
> or crash dump buffer etc) ?
At the moment, perhaps the initrd mechanism might be a useful
interface for this. You'd just leave some space either at the
beginning or at the end of the real initrd (if there's one),
and put your data there.
Afterwards, you can extract it either from the kernel, or even
from user space through /dev/initrd (with "noinitrd")
Advantages:
- fairly non-intrusive
- (almost ?) all platforms support this way of handling "some
object in memory"
- easy to play with from user space
Drawbacks:
- needs synchronization with existing uses of initrd
- a bit hackish
I'd expect that there will be eventually a number of things that
get passed from old to new kernels (e.g. crash data, device scan
results, etc.), so it may be useful to delay designing a "clean"
interface (for this, I expect some TLV structure in the initrd
area would make most sense) until more of those things have
shown up.
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/
The kernel interface has finally as stabilized enough I managed to put
some work into the user space side of things.
The new release is at:
http://www.xmission.com/~ebiederm/kexec-tools-1.6.tar.gz
The interface is now more like reboot, so you probably want to change
your shutdown scripts or use kexec --force.
And by default it now enters the kernel in 32bit mode so it should avoid
interrupt controller problems, and work for more people, in more strange
situations.
Eric
[email protected] (Eric W. Biederman) writes:
> The kernel interface has finally as stabilized enough I managed to put
> some work into the user space side of things.
>
> The new release is at:
> http://www.xmission.com/~ebiederm/kexec-tools-1.6.tar.gz
Make that:
http://www.xmission.com/~ebiederm/files/kexec/kexec-tools-1.6.tar.gz
And the latest patches can be found at:
http://www.xmission.com/~ebiederm/files/kexec/
The basic breakout is
linux-2.4.47.x86kexec.diff is the core patch.
linux-2.4.47.x86kexec-hwfixes.diff
applies on top and is has some hardware fixes that
shutdown kernel code, and make things work better.
Mostly this is the code to get SMP to shutdown properly.
And it looks like .48 is out so I need to do another patch update.
Eric
kexec is a set of systems call that allows you to load another kernel
from the currently executing Linux kernel. The current implementation
has only been tested, and had the kinks worked out on x86, but the
generic code should work on any architecture.
Could I get some feed back on where this work and where this breaks.
With the maturation of kexec-tools to skip attempting bios calls,
I expect a new the linux kernel to load for most people. Though I
also expect some device drivers will not reinitialize after the reboot.
The patch is archived at:
http://www.xmission.com/~ebiederm/files/kexec/
And is currently kept in two pieces.
The pure system call.
http://www.xmission.com/~ebiederm/files/kexec/linux-2.5.48.x86kexec.diff
And the set of hardware fixes known to help kexec.
http://www.xmission.com/~ebiederm/files/kexec/linux-2.5.48.x86kexec-hwfixes.diff
A compatible user space is at:
http://www.xmission.com/~ebiederm/files/kexec/kexec-tools-1.7.tar.gz
This code boots either a static ELF executable or a bzImage.
As of version 1.6 /sbin/kexec now works much more like /sbin/reboot.
It is recommend you place /sbin/kexec -e in /etc/init.d/reboot
just before the the call to /sbin/reboot. If you haven't called
/sbin/kexec previously it will fail, and you can then call
/sbin/reboot. Given the similiarity it is now the plan to merge in
reboot via kexec into /sbin/reboot.
One bug was fixed in the move to 2.5.48. Previously I had failed to
clear PAE and PSE in the kernel. This caused reboot failures when
CONFIG_HIGHMEM_64G was enabled, as the new kernel would fail when
enabling paging, as these bits remained set. Is %cr4 present on all
386+ intel cpus, or do I need to conditionalize the code that accesses
it?
As of version 1.6 /sbin/kexec when presented with a bzImage by default
avoids all BIOS calls and jumps directly to the kernels 32 bit entry
point. The information it would usually get from the BIOS is instead
collected from the current kernel. Accurately getting things like
the BIOS memory map from the current kernel is a challenge, still
needs to be addressed. Safe defaults have been provided for the cases
I do not currently have good code to gather the information from the
running kernel.
In bug reports please include the serial console output of
kexec kexec_test. kexec_test exercises most of the interesting code
paths that are needed to load a kernel (mainly BIOS calls) with lots
of debugging print statements, so hangs can easily be detected.
Eric
MAINTAINERS | 7
arch/i386/Kconfig | 17
arch/i386/kernel/Makefile | 1
arch/i386/kernel/entry.S | 2
arch/i386/kernel/machine_kexec.c | 142 ++++++++
arch/i386/kernel/relocate_kernel.S | 107 ++++++
include/asm-i386/kexec.h | 25 +
include/asm-i386/unistd.h | 2
include/linux/kexec.h | 45 ++
include/linux/reboot.h | 2
kernel/Makefile | 1
kernel/kexec.c | 640 +++++++++++++++++++++++++++++++++++++
kernel/sys.c | 23 +
13 files changed, 1012 insertions, 2 deletions
diff -uNr linux-2.5.48/MAINTAINERS linux-2.5.48.x86kexec/MAINTAINERS
--- linux-2.5.48/MAINTAINERS Mon Nov 11 00:22:33 2002
+++ linux-2.5.48.x86kexec/MAINTAINERS Sun Nov 17 22:53:09 2002
@@ -968,6 +968,13 @@
W: http://www.cse.unsw.edu.au/~neilb/patches/linux-devel/
S: Maintained
+KEXEC
+P: Eric Biederman
+M: [email protected]
+M: [email protected]
+L: [email protected]
+S: Maintained
+
LANMEDIA WAN CARD DRIVER
P: Andrew Stanley-Jones
M: [email protected]
diff -uNr linux-2.5.48/arch/i386/Kconfig linux-2.5.48.x86kexec/arch/i386/Kconfig
--- linux-2.5.48/arch/i386/Kconfig Sun Nov 17 22:51:14 2002
+++ linux-2.5.48.x86kexec/arch/i386/Kconfig Sun Nov 17 22:53:09 2002
@@ -784,6 +784,23 @@
depends on (SMP || PREEMPT) && X86_CMPXCHG
default y
+config KEXEC
+ bool "kexec system call (EXPERIMENTAL)"
+ depends on EXPERIMENTAL
+ help
+ kexec is a system call that implements the ability to shutdown your
+ current kernel, and to start another kernel. It is like a reboot
+ but it is indepedent of the system firmware. And like a reboot
+ you can start any kernel with it not just Linux.
+
+ The name comes from the similiarity to the exec system call.
+
+ It is on an going process to be certain the hardware in a machine
+ is properly shutdown, so do not be surprised if this code does not
+ initially work for you. It may help to enable device hotplugging
+ support. As of this writing the exact hardware interface is
+ strongly in flux, so no good recommendation can be made.
+
endmenu
diff -uNr linux-2.5.48/arch/i386/kernel/Makefile linux-2.5.48.x86kexec/arch/i386/kernel/Makefile
--- linux-2.5.48/arch/i386/kernel/Makefile Sun Nov 17 22:51:14 2002
+++ linux-2.5.48.x86kexec/arch/i386/kernel/Makefile Sun Nov 17 22:53:09 2002
@@ -24,6 +24,7 @@
obj-$(CONFIG_X86_MPPARSE) += mpparse.o
obj-$(CONFIG_X86_LOCAL_APIC) += apic.o nmi.o
obj-$(CONFIG_X86_IO_APIC) += io_apic.o
+obj-$(CONFIG_KEXEC) += machine_kexec.o relocate_kernel.o
obj-$(CONFIG_SOFTWARE_SUSPEND) += suspend.o suspend_asm.o
obj-$(CONFIG_X86_NUMAQ) += numaq.o
obj-$(CONFIG_PROFILING) += profile.o
diff -uNr linux-2.5.48/arch/i386/kernel/entry.S linux-2.5.48.x86kexec/arch/i386/kernel/entry.S
--- linux-2.5.48/arch/i386/kernel/entry.S Sun Nov 17 22:51:14 2002
+++ linux-2.5.48.x86kexec/arch/i386/kernel/entry.S Sun Nov 17 22:56:43 2002
@@ -768,7 +768,7 @@
.long sys_epoll_wait
.long sys_remap_file_pages
.long sys_set_tid_address
-
+ .long sys_kexec_load
.rept NR_syscalls-(.-sys_call_table)/4
.long sys_ni_syscall
diff -uNr linux-2.5.48/arch/i386/kernel/machine_kexec.c linux-2.5.48.x86kexec/arch/i386/kernel/machine_kexec.c
--- linux-2.5.48/arch/i386/kernel/machine_kexec.c Wed Dec 31 17:00:00 1969
+++ linux-2.5.48.x86kexec/arch/i386/kernel/machine_kexec.c Sun Nov 17 22:53:09 2002
@@ -0,0 +1,142 @@
+#include <linux/config.h>
+#include <linux/mm.h>
+#include <linux/kexec.h>
+#include <linux/delay.h>
+#include <asm/pgtable.h>
+#include <asm/pgalloc.h>
+#include <asm/tlbflush.h>
+#include <asm/io.h>
+#include <asm/apic.h>
+
+
+/*
+ * machine_kexec
+ * =======================
+ */
+
+
+static void set_idt(void *newidt, __u16 limit)
+{
+ unsigned char curidt[6];
+
+ /* ia32 supports unaliged loads & stores */
+ (*(__u16 *)(curidt)) = limit;
+ (*(__u32 *)(curidt +2)) = (unsigned long)(newidt);
+
+ __asm__ __volatile__ (
+ "lidt %0\n"
+ : "=m" (curidt)
+ );
+};
+
+
+static void set_gdt(void *newgdt, __u16 limit)
+{
+ unsigned char curgdt[6];
+
+ /* ia32 supports unaliged loads & stores */
+ (*(__u16 *)(curgdt)) = limit;
+ (*(__u32 *)(curgdt +2)) = (unsigned long)(newgdt);
+
+ __asm__ __volatile__ (
+ "lgdt %0\n"
+ : "=m" (curgdt)
+ );
+};
+
+static void load_segments(void)
+{
+#define __STR(X) #X
+#define STR(X) __STR(X)
+
+ __asm__ __volatile__ (
+ "\tljmp $"STR(__KERNEL_CS)",$1f\n"
+ "\t1:\n"
+ "\tmovl $"STR(__KERNEL_DS)",%eax\n"
+ "\tmovl %eax,%ds\n"
+ "\tmovl %eax,%es\n"
+ "\tmovl %eax,%fs\n"
+ "\tmovl %eax,%gs\n"
+ "\tmovl %eax,%ss\n"
+ );
+#undef STR
+#undef __STR
+}
+
+static void identity_map_page(unsigned long address)
+{
+ /* This code is x86 specific...
+ * general purpose code must be more carful
+ * of caches and tlbs...
+ */
+ pgd_t *pgd;
+ pmd_t *pmd;
+ struct mm_struct *mm = current->mm;
+ spin_lock(&mm->page_table_lock);
+
+ pgd = pgd_offset(mm, address);
+ pmd = pmd_alloc(mm, pgd, address);
+
+ if (pmd) {
+ pte_t *pte = pte_alloc_map(mm, pmd, address);
+ if (pte) {
+ set_pte(pte,
+ mk_pte(virt_to_page(phys_to_virt(address)),
+ PAGE_SHARED));
+ __flush_tlb_one(address);
+ }
+ }
+ spin_unlock(&mm->page_table_lock);
+}
+
+
+typedef void (*relocate_new_kernel_t)(
+ unsigned long indirection_page, unsigned long reboot_code_buffer,
+ unsigned long start_address);
+
+const extern unsigned char relocate_new_kernel[];
+extern void relocate_new_kernel_end(void);
+const extern unsigned int relocate_new_kernel_size;
+
+void machine_kexec(struct kimage *image)
+{
+ unsigned long *indirection_page;
+ void *reboot_code_buffer;
+ relocate_new_kernel_t rnk;
+
+ /* Interrupts aren't acceptable while we reboot */
+ local_irq_disable();
+ reboot_code_buffer = image->reboot_code_buffer;
+ indirection_page = phys_to_virt(image->head & PAGE_MASK);
+
+ identity_map_page(virt_to_phys(reboot_code_buffer));
+
+ /* copy it out */
+ memcpy(reboot_code_buffer, relocate_new_kernel,
+ relocate_new_kernel_size);
+
+ /* The segment registers are funny things, they are
+ * automatically loaded from a table, in memory wherever you
+ * set them to a specific selector, but this table is never
+ * accessed again you set the segment to a different selector.
+ *
+ * The more common model is are caches where the behide
+ * the scenes work is done, but is also dropped at arbitrary
+ * times.
+ *
+ * I take advantage of this here by force loading the
+ * segments, before I zap the gdt with an invalid value.
+ */
+ load_segments();
+ /* The gdt & idt are now invalid.
+ * If you want to load them you must set up your own idt & gdt.
+ */
+ set_gdt(phys_to_virt(0),0);
+ set_idt(phys_to_virt(0),0);
+
+ /* now call it */
+ rnk = (relocate_new_kernel_t) virt_to_phys(reboot_code_buffer);
+ (*rnk)(virt_to_phys(indirection_page), virt_to_phys(reboot_code_buffer),
+ image->start);
+}
+
diff -uNr linux-2.5.48/arch/i386/kernel/relocate_kernel.S linux-2.5.48.x86kexec/arch/i386/kernel/relocate_kernel.S
--- linux-2.5.48/arch/i386/kernel/relocate_kernel.S Wed Dec 31 17:00:00 1969
+++ linux-2.5.48.x86kexec/arch/i386/kernel/relocate_kernel.S Sun Nov 17 23:58:29 2002
@@ -0,0 +1,107 @@
+#include <linux/config.h>
+#include <linux/linkage.h>
+
+ /* Must be relocatable PIC code callable as a C function, that once
+ * it starts can not use the previous processes stack.
+ *
+ */
+ .globl relocate_new_kernel
+relocate_new_kernel:
+ /* read the arguments and say goodbye to the stack */
+ movl 4(%esp), %ebx /* indirection_page */
+ movl 8(%esp), %ebp /* reboot_code_buffer */
+ movl 12(%esp), %edx /* start address */
+
+ /* zero out flags, and disable interrupts */
+ pushl $0
+ popfl
+
+ /* set a new stack at the bottom of our page... */
+ lea 4096(%ebp), %esp
+
+ /* store the parameters back on the stack */
+ pushl %edx /* store the start address */
+
+ /* Set cr0 to a known state:
+ * 31 0 == Paging disabled
+ * 18 0 == Alignment check disabled
+ * 16 0 == Write protect disabled
+ * 3 0 == No task switch
+ * 2 0 == Don't do FP software emulation.
+ * 0 1 == Proctected mode enabled
+ */
+ movl %cr0, %eax
+ andl $~((1<<31)|(1<<18)|(1<<16)|(1<<3)|(1<<2)), %eax
+ orl $(1<<0), %eax
+ movl %eax, %cr0
+
+ /* Set cr4 to a known state:
+ * Setting everything to zero seems safe.
+ */
+ movl %cr4, %eax
+ andl $0, %eax
+ movl %eax, %cr4
+
+ jmp 1f
+1:
+
+ /* Flush the TLB (needed?) */
+ xorl %eax, %eax
+ movl %eax, %cr3
+
+ /* Do the copies */
+ cld
+0: /* top, read another word for the indirection page */
+ movl %ebx, %ecx
+ movl (%ebx), %ecx
+ addl $4, %ebx
+ testl $0x1, %ecx /* is it a destination page */
+ jz 1f
+ movl %ecx, %edi
+ andl $0xfffff000, %edi
+ jmp 0b
+1:
+ testl $0x2, %ecx /* is it an indirection page */
+ jz 1f
+ movl %ecx, %ebx
+ andl $0xfffff000, %ebx
+ jmp 0b
+1:
+ testl $0x4, %ecx /* is it the done indicator */
+ jz 1f
+ jmp 2f
+1:
+ testl $0x8, %ecx /* is it the source indicator */
+ jz 0b /* Ignore it otherwise */
+ movl %ecx, %esi /* For every source page do a copy */
+ andl $0xfffff000, %esi
+
+ movl $1024, %ecx
+ rep ; movsl
+ jmp 0b
+
+2:
+
+ /* To be certain of avoiding problems with self modifying code
+ * I need to execute a serializing instruction here.
+ * So I flush the TLB, it's handy, and not processor dependent.
+ */
+ xorl %eax, %eax
+ movl %eax, %cr3
+
+ /* set all of the registers to known values */
+ /* leave %esp alone */
+
+ xorl %eax, %eax
+ xorl %ebx, %ebx
+ xorl %ecx, %ecx
+ xorl %edx, %edx
+ xorl %esi, %esi
+ xorl %edi, %edi
+ xorl %ebp, %ebp
+ ret
+relocate_new_kernel_end:
+
+ .globl relocate_new_kernel_size
+relocate_new_kernel_size:
+ .long relocate_new_kernel_end - relocate_new_kernel
diff -uNr linux-2.5.48/include/asm-i386/kexec.h linux-2.5.48.x86kexec/include/asm-i386/kexec.h
--- linux-2.5.48/include/asm-i386/kexec.h Wed Dec 31 17:00:00 1969
+++ linux-2.5.48.x86kexec/include/asm-i386/kexec.h Sun Nov 17 22:53:09 2002
@@ -0,0 +1,25 @@
+#ifndef _I386_KEXEC_H
+#define _I386_KEXEC_H
+
+#include <asm/fixmap.h>
+
+/*
+ * KEXEC_SOURCE_MEMORY_LIMIT maximum page get_free_page can return.
+ * I.e. Maximum page that is mapped directly into kernel memory,
+ * and kmap is not required.
+ *
+ * Someone correct me if FIXADDR_START - PAGEOFFSET is not the correct
+ * calculation for the amount of memory directly mappable into the
+ * kernel memory space.
+ */
+
+/* Maximum physical address we can use pages from */
+#define KEXEC_SOURCE_MEMORY_LIMIT (FIXADDR_START - PAGE_OFFSET)
+/* Maximum address we can reach in physical address mode */
+#define KEXEC_DESTINATION_MEMORY_LIMIT (-1UL)
+
+#define KEXEC_REBOOT_CODE_SIZE 4096
+#define KEXEC_REBOOT_CODE_ALIGN 0
+
+
+#endif /* _I386_KEXEC_H */
diff -uNr linux-2.5.48/include/asm-i386/unistd.h linux-2.5.48.x86kexec/include/asm-i386/unistd.h
--- linux-2.5.48/include/asm-i386/unistd.h Sun Nov 17 22:51:25 2002
+++ linux-2.5.48.x86kexec/include/asm-i386/unistd.h Sun Nov 17 22:54:03 2002
@@ -263,7 +263,7 @@
#define __NR_sys_epoll_wait 256
#define __NR_remap_file_pages 257
#define __NR_set_tid_address 258
-
+#define __NR_sys_kexec_load 259
/* user-visible error numbers are in the range -1 - -124: see <asm-i386/errno.h> */
diff -uNr linux-2.5.48/include/linux/kexec.h linux-2.5.48.x86kexec/include/linux/kexec.h
--- linux-2.5.48/include/linux/kexec.h Wed Dec 31 17:00:00 1969
+++ linux-2.5.48.x86kexec/include/linux/kexec.h Sun Nov 17 22:53:09 2002
@@ -0,0 +1,45 @@
+#ifndef LINUX_KEXEC_H
+#define LINUX_KEXEC_H
+
+#if CONFIG_KEXEC
+#include <linux/types.h>
+#include <asm/kexec.h>
+
+/*
+ * This structure is used to hold the arguments that are used when loading
+ * kernel binaries.
+ */
+
+typedef unsigned long kimage_entry_t;
+#define IND_DESTINATION 0x1
+#define IND_INDIRECTION 0x2
+#define IND_DONE 0x4
+#define IND_SOURCE 0x8
+
+struct kimage {
+ kimage_entry_t head;
+ kimage_entry_t *entry;
+ kimage_entry_t *last_entry;
+
+ unsigned long destination;
+ unsigned long offset;
+
+ unsigned long start;
+ void *reboot_code_buffer;
+};
+
+struct kexec_segment {
+ void *buf;
+ size_t bufsz;
+ void *mem;
+ size_t memsz;
+};
+
+/* kexec interface functions */
+extern void machine_kexec(struct kimage *image);
+extern asmlinkage long sys_kexec(unsigned long entry, long nr_segments,
+ struct kexec_segment *segments);
+extern struct kimage *kexec_image;
+#endif
+#endif /* LINUX_KEXEC_H */
+
diff -uNr linux-2.5.48/include/linux/reboot.h linux-2.5.48.x86kexec/include/linux/reboot.h
--- linux-2.5.48/include/linux/reboot.h Fri Oct 11 22:22:47 2002
+++ linux-2.5.48.x86kexec/include/linux/reboot.h Sun Nov 17 22:53:09 2002
@@ -21,6 +21,7 @@
* POWER_OFF Stop OS and remove all power from system, if possible.
* RESTART2 Restart system using given command string.
* SW_SUSPEND Suspend system using Software Suspend if compiled in
+ * KEXEC Restart the system using a different kernel.
*/
#define LINUX_REBOOT_CMD_RESTART 0x01234567
@@ -30,6 +31,7 @@
#define LINUX_REBOOT_CMD_POWER_OFF 0x4321FEDC
#define LINUX_REBOOT_CMD_RESTART2 0xA1B2C3D4
#define LINUX_REBOOT_CMD_SW_SUSPEND 0xD000FCE2
+#define LINUX_REBOOT_CMD_KEXEC 0x45584543
#ifdef __KERNEL__
diff -uNr linux-2.5.48/kernel/Makefile linux-2.5.48.x86kexec/kernel/Makefile
--- linux-2.5.48/kernel/Makefile Sun Nov 17 22:51:26 2002
+++ linux-2.5.48.x86kexec/kernel/Makefile Sun Nov 17 22:53:09 2002
@@ -21,6 +21,7 @@
obj-$(CONFIG_CPU_FREQ) += cpufreq.o
obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
obj-$(CONFIG_SOFTWARE_SUSPEND) += suspend.o
+obj-$(CONFIG_KEXEC) += kexec.o
ifneq ($(CONFIG_IA64),y)
# According to Alan Modra <[email protected]>, the -fno-omit-frame-pointer is
diff -uNr linux-2.5.48/kernel/kexec.c linux-2.5.48.x86kexec/kernel/kexec.c
--- linux-2.5.48/kernel/kexec.c Wed Dec 31 17:00:00 1969
+++ linux-2.5.48.x86kexec/kernel/kexec.c Sun Nov 17 22:53:09 2002
@@ -0,0 +1,640 @@
+#include <linux/mm.h>
+#include <linux/file.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/version.h>
+#include <linux/compile.h>
+#include <linux/kexec.h>
+#include <linux/spinlock.h>
+#include <net/checksum.h>
+#include <asm/page.h>
+#include <asm/uaccess.h>
+#include <asm/io.h>
+#include <asm/system.h>
+
+/* As designed kexec can only use the memory that you don't
+ * need to use kmap to access. Memory that you can use virt_to_phys()
+ * on an call get_free_page to allocate.
+ *
+ * In the best case you need one page for the transition from
+ * virtual to physical memory. And this page must be identity
+ * mapped. Which pretty much leaves you with pages < PAGE_OFFSET
+ * as you can only mess with user pages.
+ *
+ * As the only subset of memory that it is easy to restrict allocation
+ * to is the physical memory mapped into the kernel, I do that
+ * with get_free_page and hope it is enough.
+ *
+ * I don't know of a good way to do this calcuate which pages get_free_page
+ * will return independent of architecture so I depend on
+ * <asm/kexec.h> to properly set
+ * KEXEC_SOURCE_MEMORY_LIMIT and KEXEC_DESTINATION_MEMORY_LIMIT
+ *
+ */
+
+static struct kimage *kimage_alloc(void)
+{
+ struct kimage *image;
+ image = kmalloc(sizeof(*image), GFP_KERNEL);
+ if (!image)
+ return 0;
+ memset(image, 0, sizeof(*image));
+ image->head = 0;
+ image->entry = &image->head;
+ image->last_entry = &image->head;
+ return image;
+}
+static int kimage_add_entry(struct kimage *image, kimage_entry_t entry)
+{
+ if (image->offset != 0) {
+ image->entry++;
+ }
+ if (image->entry == image->last_entry) {
+ kimage_entry_t *ind_page;
+ ind_page = (void *)__get_free_page(GFP_KERNEL);
+ if (!ind_page) {
+ return -ENOMEM;
+ }
+ *image->entry = virt_to_phys(ind_page) | IND_INDIRECTION;
+ image->entry = ind_page;
+ image->last_entry =
+ ind_page + ((PAGE_SIZE/sizeof(kimage_entry_t)) - 1);
+ }
+ *image->entry = entry;
+ image->entry++;
+ image->offset = 0;
+ return 0;
+}
+
+static int kimage_verify_destination(unsigned long destination)
+{
+ int result;
+
+ /* Assume the page is bad unless we pass the checks */
+ result = -EADDRNOTAVAIL;
+
+ if (destination >= KEXEC_DESTINATION_MEMORY_LIMIT) {
+ goto out;
+ }
+
+ /* NOTE: The caller is responsible for making certain we
+ * don't attempt to load the new image into invalid or
+ * reserved areas of RAM.
+ */
+ result = 0;
+out:
+ return result;
+}
+
+static int kimage_set_destination(
+ struct kimage *image, unsigned long destination)
+{
+ int result;
+ destination &= PAGE_MASK;
+ result = kimage_verify_destination(destination);
+ if (result) {
+ return result;
+ }
+ result = kimage_add_entry(image, destination | IND_DESTINATION);
+ if (result == 0) {
+ image->destination = destination;
+ }
+ return result;
+}
+
+
+static int kimage_add_page(struct kimage *image, unsigned long page)
+{
+ int result;
+ page &= PAGE_MASK;
+ result = kimage_verify_destination(image->destination);
+ if (result) {
+ return result;
+ }
+ result = kimage_add_entry(image, page | IND_SOURCE);
+ if (result == 0) {
+ image->destination += PAGE_SIZE;
+ }
+ return result;
+}
+
+
+static int kimage_terminate(struct kimage *image)
+{
+ int result;
+ result = kimage_add_entry(image, IND_DONE);
+ if (result == 0) {
+ /* Point at the terminating element */
+ image->entry--;
+ }
+ return result;
+}
+
+#define for_each_kimage_entry(image, ptr, entry) \
+ for (ptr = &image->head; (entry = *ptr) && !(entry & IND_DONE); \
+ ptr = (entry & IND_INDIRECTION)? \
+ phys_to_virt((entry & PAGE_MASK)): ptr +1)
+
+static void kimage_free(struct kimage *image)
+{
+ kimage_entry_t *ptr, entry;
+ kimage_entry_t ind = 0;
+ if (!image)
+ return;
+ for_each_kimage_entry(image, ptr, entry) {
+ if (entry & IND_INDIRECTION) {
+ /* Free the previous indirection page */
+ if (ind & IND_INDIRECTION) {
+ free_page((unsigned long)phys_to_virt(ind & PAGE_MASK));
+ }
+ /* Save this indirection page until we are
+ * done with it.
+ */
+ ind = entry;
+ }
+ else if (entry & IND_SOURCE) {
+ free_page((unsigned long)phys_to_virt(entry & PAGE_MASK));
+ }
+ }
+ kfree(image);
+}
+
+static int kimage_is_destination_page(
+ struct kimage *image, unsigned long page)
+{
+ kimage_entry_t *ptr, entry;
+ unsigned long destination;
+ destination = 0;
+ page &= PAGE_MASK;
+ for_each_kimage_entry(image, ptr, entry) {
+ if (entry & IND_DESTINATION) {
+ destination = entry & PAGE_MASK;
+ }
+ else if (entry & IND_SOURCE) {
+ if (page == destination) {
+ return 1;
+ }
+ destination += PAGE_SIZE;
+ }
+ }
+ return 0;
+}
+
+static int kimage_get_unused_area(
+ struct kimage *image, unsigned long size, unsigned long align,
+ unsigned long *area)
+{
+ /* Walk through mem_map and find the first chunk of
+ * ununsed memory that is at least size bytes long.
+ */
+ /* Since the kernel plays with Page_Reseved mem_map is less
+ * than ideal for this purpose, but it will give us a correct
+ * conservative estimate of what we need to do.
+ */
+ /* For now we take advantage of the fact that all kernel pages
+ * are marked with PG_resereved to allocate a large
+ * contiguous area for the reboot code buffer.
+ */
+ unsigned long addr;
+ unsigned long start, end;
+ unsigned long mask;
+ mask = ((1 << align) -1);
+ start = end = PAGE_SIZE;
+ for(addr = PAGE_SIZE; addr < KEXEC_SOURCE_MEMORY_LIMIT; addr += PAGE_SIZE) {
+ struct page *page;
+ unsigned long aligned_start;
+ page = virt_to_page(phys_to_virt(addr));
+ if (PageReserved(page) ||
+ kimage_is_destination_page(image, addr)) {
+ /* The current page is reserved so the start &
+ * end of the next area must be atleast at the
+ * next page.
+ */
+ start = end = addr + PAGE_SIZE;
+ }
+ else {
+ /* O.k. The current page isn't reserved
+ * so push up the end of the area.
+ */
+ end = addr;
+ }
+ aligned_start = (start + mask) & ~mask;
+ if (aligned_start > start) {
+ continue;
+ }
+ if (aligned_start > end) {
+ continue;
+ }
+ if (end - aligned_start >= size) {
+ *area = aligned_start;
+ return 0;
+ }
+ }
+ *area = 0;
+ return -ENOSPC;
+}
+
+static kimage_entry_t *kimage_dst_conflict(
+ struct kimage *image, unsigned long page, kimage_entry_t *limit)
+{
+ kimage_entry_t *ptr, entry;
+ unsigned long destination = 0;
+ for_each_kimage_entry(image, ptr, entry) {
+ if (ptr == limit) {
+ return 0;
+ }
+ else if (entry & IND_DESTINATION) {
+ destination = entry & PAGE_MASK;
+ }
+ else if (entry & IND_SOURCE) {
+ if (page == destination) {
+ return ptr;
+ }
+ destination += PAGE_SIZE;
+ }
+ }
+ return 0;
+}
+
+static kimage_entry_t *kimage_src_conflict(
+ struct kimage *image, unsigned long destination, kimage_entry_t *limit)
+{
+ kimage_entry_t *ptr, entry;
+ for_each_kimage_entry(image, ptr, entry) {
+ unsigned long page;
+ if (ptr == limit) {
+ return 0;
+ }
+ else if (entry & IND_DESTINATION) {
+ /* nop */
+ }
+ else if (entry & IND_DONE) {
+ /* nop */
+ }
+ else {
+ /* SOURCE & INDIRECTION */
+ page = entry & PAGE_MASK;
+ if (page == destination) {
+ return ptr;
+ }
+ }
+ }
+ return 0;
+}
+
+static int kimage_get_off_destination_pages(struct kimage *image)
+{
+ kimage_entry_t *ptr, *cptr, entry;
+ unsigned long buffer, page;
+ unsigned long destination = 0;
+
+ /* Here we implement safe guards to insure that
+ * a source page is not copied to it's destination
+ * page before the data on the destination page is
+ * no longer useful.
+ *
+ * To make it work we actually wind up with a
+ * stronger condition. For every page considered
+ * it is either it's own destination page or it is
+ * not a destination page of any page considered.
+ *
+ * Invariants
+ * 1. buffer is not a destination of a previous page.
+ * 2. page is not a destination of a previous page.
+ * 3. destination is not a previous source page.
+ *
+ * Result: Either a source page and a destination page
+ * are the same or the page is not a destination page.
+ *
+ * These checks could be done when we allocate the pages,
+ * but doing it as a final pass allows us more freedom
+ * on how we allocate pages.
+ *
+ * Also while the checks are necessary, in practice nothing
+ * happens. The destination kernel wants to sit in the
+ * same physical addresses as the current kernel so we never
+ * actually allocate a destination page.
+ *
+ * BUGS: This is a O(N^2) algorithm.
+ */
+
+
+ buffer = __get_free_page(GFP_KERNEL);
+ if (!buffer) {
+ return -ENOMEM;
+ }
+ buffer = virt_to_phys((void *)buffer);
+ for_each_kimage_entry(image, ptr, entry) {
+ /* Here we check to see if an allocated page */
+ kimage_entry_t *limit;
+ if (entry & IND_DESTINATION) {
+ destination = entry & PAGE_MASK;
+ }
+ else if (entry & IND_INDIRECTION) {
+ /* Indirection pages must include all of their
+ * contents in limit checking.
+ */
+ limit = phys_to_virt(page + PAGE_SIZE - sizeof(*limit));
+ }
+ if (!((entry & IND_SOURCE) | (entry & IND_INDIRECTION))) {
+ continue;
+ }
+ page = entry & PAGE_MASK;
+ limit = ptr;
+
+ /* See if a previous page has the current page as it's
+ * destination.
+ * i.e. invariant 2
+ */
+ cptr = kimage_dst_conflict(image, page, limit);
+ if (cptr) {
+ unsigned long cpage;
+ kimage_entry_t centry;
+ centry = *cptr;
+ cpage = centry & PAGE_MASK;
+ memcpy(phys_to_virt(buffer), phys_to_virt(page), PAGE_SIZE);
+ memcpy(phys_to_virt(page), phys_to_virt(cpage), PAGE_SIZE);
+ *cptr = page | (centry & ~PAGE_MASK);
+ *ptr = buffer | (entry & ~PAGE_MASK);
+ buffer = cpage;
+ }
+ if (!(entry & IND_SOURCE)) {
+ continue;
+ }
+
+ /* See if a previous page is our destination page.
+ * If so claim it now.
+ * i.e. invariant 3
+ */
+ cptr = kimage_src_conflict(image, destination, limit);
+ if (cptr) {
+ unsigned long cpage;
+ kimage_entry_t centry;
+ centry = *cptr;
+ cpage = centry & PAGE_MASK;
+ memcpy(phys_to_virt(buffer), phys_to_virt(cpage), PAGE_SIZE);
+ memcpy(phys_to_virt(cpage), phys_to_virt(page), PAGE_SIZE);
+ *cptr = buffer | (centry & ~PAGE_MASK);
+ *ptr = cpage | ( entry & ~PAGE_MASK);
+ buffer = page;
+ }
+ /* If the buffer is my destination page do the copy now
+ * i.e. invariant 3 & 1
+ */
+ if (buffer == destination) {
+ memcpy(phys_to_virt(buffer), phys_to_virt(page), PAGE_SIZE);
+ *ptr = buffer | (entry & ~PAGE_MASK);
+ buffer = page;
+ }
+ }
+ free_page((unsigned long)phys_to_virt(buffer));
+ return 0;
+}
+
+static int kimage_add_empty_pages(struct kimage *image,
+ unsigned long len)
+{
+ unsigned long pos;
+ int result;
+ for(pos = 0; pos < len; pos += PAGE_SIZE) {
+ char *page;
+ result = -ENOMEM;
+ page = (void *)__get_free_page(GFP_KERNEL);
+ if (!page) {
+ goto out;
+ }
+ result = kimage_add_page(image, virt_to_phys(page));
+ if (result) {
+ goto out;
+ }
+ }
+ result = 0;
+ out:
+ return result;
+}
+
+
+static int kimage_load_segment(struct kimage *image,
+ struct kexec_segment *segment)
+{
+ unsigned long mstart;
+ int result;
+ unsigned long offset;
+ unsigned long offset_end;
+ unsigned char *buf;
+
+ result = 0;
+ buf = segment->buf;
+ mstart = (unsigned long)segment->mem;
+
+ offset_end = segment->memsz;
+
+ result = kimage_set_destination(image, mstart);
+ if (result < 0) {
+ goto out;
+ }
+ for(offset = 0; offset < segment->memsz; offset += PAGE_SIZE) {
+ char *page;
+ size_t size, leader;
+ page = (char *)__get_free_page(GFP_KERNEL);
+ if (page == 0) {
+ result = -ENOMEM;
+ goto out;
+ }
+ result = kimage_add_page(image, virt_to_phys(page));
+ if (result < 0) {
+ goto out;
+ }
+ if (segment->bufsz < offset) {
+ /* We are past the end zero the whole page */
+ memset(page, 0, PAGE_SIZE);
+ continue;
+ }
+ size = PAGE_SIZE;
+ leader = 0;
+ if ((offset == 0)) {
+ leader = mstart & ~PAGE_MASK;
+ }
+ if (leader) {
+ /* We are on the first page zero the unused portion */
+ memset(page, 0, leader);
+ size -= leader;
+ page += leader;
+ }
+ if (size > (segment->bufsz - offset)) {
+ size = segment->bufsz - offset;
+ }
+ result = copy_from_user(page, buf + offset, size);
+ if (result) {
+ result = (result < 0)?result : -EIO;
+ goto out;
+ }
+ if (size < (PAGE_SIZE - leader)) {
+ /* zero the trailing part of the page */
+ memset(page + size, 0, (PAGE_SIZE - leader) - size);
+ }
+ }
+ out:
+ return result;
+}
+
+
+/* do_kexec executes a new kernel
+ */
+static int do_kexec(unsigned long start, unsigned long nr_segments,
+ struct kexec_segment *arg_segments, struct kimage *image)
+{
+ struct kexec_segment *segments;
+ size_t segment_bytes;
+ int i;
+
+ int result;
+ unsigned long reboot_code_buffer;
+ kimage_entry_t *end;
+
+ /* Initialize variables */
+ segments = 0;
+
+ segment_bytes = nr_segments * sizeof(*segments);
+ segments = kmalloc(GFP_KERNEL, segment_bytes);
+ if (segments == 0) {
+ result = -ENOMEM;
+ goto out;
+ }
+ result = copy_from_user(segments, arg_segments, segment_bytes);
+ if (result) {
+ goto out;
+ }
+
+ /* Read in the data from user space */
+ image->start = start;
+ for(i = 0; i < nr_segments; i++) {
+ result = kimage_load_segment(image, &segments[i]);
+ if (result) {
+ goto out;
+ }
+ }
+
+ /* Terminate early so I can get a place holder. */
+ result = kimage_terminate(image);
+ if (result)
+ goto out;
+ end = image->entry;
+
+ /* Usage of the reboot code buffer is subtle. We first
+ * find a continguous area of ram, that is not one
+ * of our destination pages. We do not allocate the ram.
+ *
+ * The algorithm to make certain we do not have address
+ * conflicts requires each destination region to have some
+ * backing store so we allocate abitrary source pages.
+ *
+ * Later in machine_kexec when we copy data to the
+ * reboot_code_buffer it still may be allocated for other
+ * purposes, but we do know there are no source or destination
+ * pages in that area. And since the rest of the kernel
+ * is already shutdown those pages are free for use,
+ * regardless of their page->count values.
+ *
+ * The kernel mapping is of the reboot code buffer is passed to
+ * the machine dependent code. If it needs something else
+ * it is free to set that up.
+ */
+ result = kimage_get_unused_area(
+ image, KEXEC_REBOOT_CODE_SIZE, KEXEC_REBOOT_CODE_ALIGN,
+ &reboot_code_buffer);
+ if (result)
+ goto out;
+
+ /* Allocating pages we should never need is silly but the
+ * code won't work correctly unless we have dummy pages to
+ * work with.
+ */
+ result = kimage_set_destination(image, reboot_code_buffer);
+ if (result)
+ goto out;
+ result = kimage_add_empty_pages(image, KEXEC_REBOOT_CODE_SIZE);
+ if (result)
+ goto out;
+ image->reboot_code_buffer = phys_to_virt(reboot_code_buffer);
+
+ result = kimage_terminate(image);
+ if (result)
+ goto out;
+
+ result = kimage_get_off_destination_pages(image);
+ if (result)
+ goto out;
+
+ /* Now hide the extra source pages for the reboot code buffer.
+ */
+ image->entry = end;
+ result = kimage_terminate(image);
+ if (result)
+ goto out;
+
+ result = 0;
+ out:
+ /* cleanup and exit */
+ if (segments) kfree(segments);
+ return result;
+}
+
+
+/*
+ * Exec Kernel system call: for obvious reasons only root may call it.
+ *
+ * This call breaks up into three pieces.
+ * - A generic part which loads the new kernel from the current
+ * address space, and very carefully places the data in the
+ * allocated pages.
+ *
+ * - A generic part that interacts with the kernel and tells all of
+ * the devices to shut down. Preventing on-going dmas, and placing
+ * the devices in a consistent state so a later kernel can
+ * reinitialize them.
+ *
+ * - A machine specific part that includes the syscall number
+ * and the copies the image to it's final destination. And
+ * jumps into the image at entry.
+ *
+ * kexec does not sync, or unmount filesystems so if you need
+ * that to happen you need to do that yourself.
+ */
+struct kimage *kexec_image = 0;
+
+asmlinkage long sys_kexec_load(unsigned long entry, unsigned long nr_segments,
+ struct kexec_segment *segments, unsigned long flags)
+{
+ /* Am I using to much stack space here? */
+ struct kimage *image, *old_image;
+ int result;
+
+ /* We only trust the superuser with rebooting the system. */
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ /* In case we need just a little bit of special behavior for
+ * reboot on panic
+ */
+ if (flags != 0)
+ return -EINVAL;
+
+ image = 0;
+ if (nr_segments > 0) {
+ image = kimage_alloc();
+ if (!image) {
+ return -ENOMEM;
+ }
+ result = do_kexec(entry, nr_segments, segments, image);
+ if (result) {
+ kimage_free(image);
+ return result;
+ }
+ }
+
+ old_image = xchg(&kexec_image, image);
+
+ kimage_free(old_image);
+ return 0;
+}
diff -uNr linux-2.5.48/kernel/sys.c linux-2.5.48.x86kexec/kernel/sys.c
--- linux-2.5.48/kernel/sys.c Sun Nov 17 22:51:26 2002
+++ linux-2.5.48.x86kexec/kernel/sys.c Sun Nov 17 22:53:09 2002
@@ -16,6 +16,7 @@
#include <linux/init.h>
#include <linux/highuid.h>
#include <linux/fs.h>
+#include <linux/kexec.h>
#include <linux/workqueue.h>
#include <linux/device.h>
#include <linux/times.h>
@@ -206,6 +207,7 @@
cond_syscall(sys_lookup_dcookie)
cond_syscall(sys_swapon)
cond_syscall(sys_swapoff)
+cond_syscall(sys_kexec_load)
cond_syscall(sys_init_module)
cond_syscall(sys_delete_module)
@@ -416,6 +418,27 @@
machine_restart(buffer);
break;
+#ifdef CONFIG_KEXEC
+ case LINUX_REBOOT_CMD_KEXEC:
+ {
+ struct kimage *image;
+ if (arg) {
+ unlock_kernel();
+ return -EINVAL;
+ }
+ image = xchg(&kexec_image, 0);
+ if (!image) {
+ unlock_kernel();
+ return -EINVAL;
+ }
+ notifier_call_chain(&reboot_notifier_list, SYS_RESTART, NULL);
+ system_running = 0;
+ device_shutdown();
+ printk(KERN_EMERG "Starting new kernel\n");
+ machine_kexec(image);
+ break;
+ }
+#endif
#ifdef CONFIG_SOFTWARE_SUSPEND
case LINUX_REBOOT_CMD_SW_SUSPEND:
if (!software_suspend_enabled) {
# ./kexec-1.7 --force --debug "--command-line=ro root=805 console=ttyS0,9600n8" ./linux-2.5
FIXME assuming 64M of ram
setup16_end: 00091b1f
FIXME assuming 64M of ram
Synchronizing SCSI caches:
Shutting down devices
Starting new kernel
Linux version 2.5.48 (andyp@joe) (gcc version 2.95.3 20010315 (SuSE)) #1 Mon Nov 18 15:03:14 PST 2002
Video mode to be used for restore is ffff
BIOS-provided physical RAM map:
BIOS-e820: 0000000000001000 - 000000000009ffff (usable)
BIOS-e820: 0000000000100000 - 0000000003ffffff (usable)
63MB LOWMEM available.
hm, page 00000000 reserved twice.
On node 0 totalpages: 16383
DMA zone: 4096 pages, LIFO batch:1
Normal zone: 12287 pages, LIFO batch:2
HighMem zone: 0 pages, LIFO batch:1
IBM machine detected. Enabling interrupts during APM calls.
IBM machine detected. Disabling SMBus accesses.
Building zonelist for node : 0
Kernel command line: ro root=805 console=ttyS0,9600n8
Initializing CPU#0
Detected 799.717 MHz processor.
Console: colour VGA+ 80x25
Calibrating delay loop... 1581.05 BogoMIPS
Memory: 60868k/65532k available (2087k kernel code, 4204k reserved, 825k data, 304k init, 0k highmem)
Security Scaffold v1.0.0 initialized
Dentry cache hash table entries: 8192 (order: 4, 65536 bytes)
Inode-cache hash table entries: 4096 (order: 3, 32768 bytes)
Mount-cache hash table entries: 512 (order: 0, 4096 bytes)
-> /dev
-> /dev/console
-> /root
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 256K
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
CPU: Intel Pentium III (Coppermine) stepping 0a
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Checking 'hlt' instruction... OK.
POSIX conformance testing by UNIFIX
Linux NET4.0 for Linux 2.4
Based upon Swansea University Computer Society NET3.039
Initializing RT netlink socket
mtrr: v2.0 (20020519)
Linux Plug and Play Support v0.9 (c) Adam Belay
PCI: PCI BIOS revision 2.10 entry at 0xfd5dc, last bus=1
PCI: Using configuration type 1
BIO: pool of 256 setup, 14Kb (56 bytes/bio)
biovec pool[0]: 1 bvecs: 116 entries (12 bytes)
biovec pool[1]: 4 bvecs: 116 entries (48 bytes)
biovec pool[2]: 16 bvecs: 58 entries (192 bytes)
biovec pool[3]: 64 bvecs: 29 entries (768 bytes)
biovec pool[4]: 128 bvecs: 14 entries (1536 bytes)
biovec pool[5]: 256 bvecs: 7 entries (3072 bytes)
block request queues:
112 requests per read queue
112 requests per write queue
8 requests per batch
enter congestion at 27
exit congestion at 29
isapnp: Scanning for PnP cards...
isapnp: No Plug & Play device found
drivers/usb/core/usb.c: registered new driver usbfs
drivers/usb/core/usb.c: registered new driver hub
PCI: Probing PCI hardware
PCI: Probing PCI hardware (bus 00)
PCI: Discovered peer bus 01
Starting kswapd
aio_setup: sizeof(struct page) = 40
[c3fb2040] eventpoll: successfully initialized.
Journalled Block Device driver loaded
Installing knfsd (copyright (C) 1996 [email protected]).
udf: registering filesystem
Capability LSM initialized
Serial: 8250/16550 driver $Revision: 1.90 $ IRQ sharing disabled
ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A
parport0: PC-style at 0x378 [PCSPP]
pty: 256 Unix98 ptys configured
lp0: using parport0 (polling).
Linux agpgart interface v0.99 (c) Jeff Hartmann
agpgart: Maximum main memory to use for agp memory: 27M
agpgart: unable to determine aperture size.
agpgart: Maximum main memory to use for agp memory: 27M
agpgart: unable to determine aperture size.
[drm] Initialized radeon 1.7.0 20020828 on minor 0
Floppy drive(s): fd0 is 1.44M
FDC 0 is a National Semiconductor PC87306
Intel(R) PRO/100 Network Driver - version 2.1.24-k2
Copyright (c) 2002 Intel Corporation
e100: eth0: Intel(R) PRO/100+ Server Adapter (PILA8470B)
Mem:0xfeb7f000 IRQ:11 Speed:0 Mbps Dx:N/A
Hardware receive checksums enabled
cpu cycle saver enabled
Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
hda: LG CD-ROM CRD-8484B, ATAPI CD/DVD-ROM drive
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
hda: ATAPI 48X CD-ROM drive, 128kB Cache
Uniform CD-ROM driver Revision: 3.12
end_request: I/O error, dev hda, sector 0
SCSI subsystem driver Revision: 1.00
PCI: Enabling device 01:03.0 (0156 -> 0157)
scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.4
<Adaptec aic7892 Ultra160 SCSI adapter>
aic7892: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs
(scsi0:A:0): 160.000MB/s transfers (80.000MHz DT, offset 31, 16bit)
Vendor: IBM-PSG Model: ST318436LC !# Rev: 3281
Type: Direct-Access ANSI SCSI revision: 03
(scsi0:A:1): 160.000MB/s transfers (80.000MHz DT, offset 31, 16bit)
Vendor: IBM-PSG Model: ST318436LC !# Rev: 3281
Type: Direct-Access ANSI SCSI revision: 03
Vendor: IBM Model: YGLv3 S2 Rev: 0
Type: Processor ANSI SCSI revision: 02
scsi0:A:0:0: Tagged Queuing enabled. Depth 64
SCSI device sda: drive cache: write through
SCSI device sda: 35548320 512-byte hdwr sectors (18201 MB)
sda: sda1 sda2 < sda5 sda6 sda7 sda8 sda9 sda10 >
Attached scsi disk sda at scsi0, channel 0, id 0, lun 0
scsi0:A:1:0: Tagged Queuing enabled. Depth 64
SCSI device sdb: drive cache: write through
SCSI device sdb: 35548320 512-byte hdwr sectors (18201 MB)
sdb: sdb1
Attached scsi disk sdb at scsi0, channel 0, id 1, lun 0
Attached scsi generic sg2 at scsi0, channel 0, id 8, lun 0, type 3
Initializing USB Mass Storage driver...
drivers/usb/core/usb.c: registered new driver usb-storage
USB Mass Storage support registered.
mice: PS/2 mouse device common for all mice
input: ImPS/2 Generic Wheel Mouse on isa0060/serio1
serio: i8042 AUX port at 0x60,0x64 irq 12
input: AT Set 2 keyboard on isa0060/serio0
serio: i8042 KBD port at 0x60,0x64 irq 1
Advanced Linux Sound Architecture Driver Version 0.9.0rc5 (Sun Nov 10 19:48:18 2002 UTC).
request_module[snd-card-0]: not ready
request_module[snd-card-1]: not ready
request_module[snd-card-2]: not ready
request_module[snd-card-3]: not ready
request_module[snd-card-4]: not ready
request_module[snd-card-5]: not ready
request_module[snd-card-6]: not ready
request_module[snd-card-7]: not ready
ALSA device list:
No soundcards found.
NET4: Linux TCP/IP 1.0 for NET4.0
IP: routing cache hash table of 512 buckets, 4Kbytes
TCP: Hash tables configured (established 4096 bind 4096)
NET4: Unix domain sockets 1.0/SMP for Linux NET4.0.
kjournald starting. Commit interval 5 seconds
EXT3-fs: mounted filesystem with ordered data mode.
VFS: Mounted root (ext3 filesystem) readonly.
Freeing unused kernel memory: 304k freed
INIT: version 2.82 booting
Running /etc/init.d/boot
Mounting /proc device done
Mounting /dev/ptsblogd: console=/dev/console, stdin=/dev/console, must differ, boot logging disabled
showconsole: Warning: the ioctl TIOCGDEV is not known by the kerAdding 530104k swap on /dev/sda6. Priority:42 extents:1
nel
Activating swap-devices in /etc/fstab... done
showconsole: Warning: the ioctl TIOCGDEV is not known by the kernel
Checking file systems...
fsck 1.26 (3-Feb-2002)
/dev/sda5: clean, 16935/66264 files, 104836/265041 blocks
/dev/sda1: clean, 55/10040 files, 24115/40131 blocks
/dev/sdb1: clean, 11/2223872 files, 78008/4441964 blocks
/dev/sda10: clean, 523256/1198208 files, 2052639/2393677 blocks
/dev/sda9: clean, 51895/263296 files, 310582/526120 blocks
/dev/sda8: clean, 140195/525888 files, 590977/1050241 blocks
/dev/sda7: clean, EXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,5), 2747/131616 fileinternal journal
s, 111363/263056 blocks done
Setting up /lib/modules/2.5.48 failed
Mounting local file systems...
kjournald starting. Commit interval 5 seconds
proc on /proc tyEXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,17), pe proc (rw)
deinternal journal
vpts on /dev/ptsEXT3-fs: mounted filesystem with ordered data mode.
type devpts (rw,mode=0620,gid=5)
/dev/sdb1 on /2nd type ext3 (kjournald starting. Commit interval 5 seconds
rw)
/dev/sda1 oEXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,10), n /boot type extinternal journal
2 (rw)
EXT3-fs: mounted filesystem with ordered data mode.
/dev/sda10 on /home type ext3 (rw)
kjournald starting. Commit interval 5 seconds
EXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,9), internal journal
EXT3-fs: mounted filesystem with ordered data mode.
/dev/sda9 on /opt type ext3 (rw)
kjournald starting. Commit interval 5 seconds
EXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,8), internal journal
EXT3-fs: mounted filesystem with ordered data mode.
/dev/sda8 on /usr type ext3 (rw)
kjournald starting. Commit interval 5 seconds
EXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,7), internal journal
EXT3-fs: mounted filesystem with ordered data mode.
/dev/sda7 on /var type ext3 (rw) done
Restore device permissions done
Activating remaining swap-devices in /etc/fstab... done
Setting up the CMOS clock done
Setting up timezone data done
Configuring serial ports...
ttyS0 at 0x03f8 (irq = 4) is a 16550A
ttyS1 at 0x02f8 (irq = 3) is a 16550A
Configured serial ports done
Setting up hostname 'joe' done
Setting up loopback interface done
Creating /var/log/boot.msg done
showconsole: Warning: the ioctl TIOCGDEV is not known by the kernel
INIT: Entering runlevel: 5
blogd: console=/dev/console, stdin=/dev/console, must differ, boot logging disabled
Master Resource Control: previous runlevel: N, switching to runlevel:5
Starting personal-firewall (initial) [not active] unused
Initializing random number generator done
Setting up network interfaces:
lo done
eth0 (DHCP) IP address: 172.20.1.38 done
Starting syslog services done
Starting hotplugging services [ net pci usb ] failed
Starting hardware scan on boote100: eth0 NIC Link is Up 100 Mbps Full duplex
done
Starting RPC portmap daemon done
Starting SSH daemon done
Starting sound driver: already running done
Starting service at daemon done
Initializing SMTP port (sendmail) done
Loading keymap qwerty/us.map.gz done
Loading compose table winkeys shiftctrl latin1.add done
Loading console font lat1-16.psfu done
Loading screenmap none done
Setting up console ttys done
Starting service kdm done
Starting CRON daemon done
Starting Name Service Cache Daemon done
Starting inetd done
Starting personal-firewall (final) [not active] unused
Master Resource Control: runlevel 5 has been reached
Failed services in runlevel 5: hotplug
Skipped services in runlevel 5: personal-firewall.initial splash personal-firewall.final
Eric W. Biederman wrote:
> kexec is a set of systems call that allows you to load another kernel
> from the currently executing Linux kernel. The current implementation
> has only been tested, and had the kinks worked out on x86, but the
> generic code should work on any architecture.
>
> Could I get some feed back on where this work and where this breaks.
> With the maturation of kexec-tools to skip attempting bios calls,
> I expect a new the linux kernel to load for most people. Though I
> also expect some device drivers will not reinitialize after the reboot.
I give it a big thumbs-up. Between the NUMAQs and the big xSeries
machines, we have a lot of slow rebooters. The 16GB intel boxes take
at about 5 minutes to get back to the bootloader after a reboot, and
the 4 and 8-quad NUMAQ's take closer to 10.
The IBM machines I've tried it on are a 4-way and 8-way PIII. They
both have aic7xxx cards and the 8-way has a ServeRAID 4 controller.
They have a collection of acenic, e1000, pcnet32 and eepro100 net
cards. All seem to work just fine.
The NUMAQ is another story, though. I get nothing after "Starting new
kernel". But, I wasn't expecting much. The NUMAQ is pretty weird
hardware and god knows what is actually happening. I'll try it some
more when I'm more confident in what I'm doing.
What's the deal with "FIXME assuming 64M of ram"? I was a little
surprised when my 16GB machine started to OOM as I did a "make -j8
bzImage" :) Why is it that you need the memory size at load time?
--
Dave Hansen
[email protected]
Dave Hansen <[email protected]> writes:
> Eric W. Biederman wrote:
> > kexec is a set of systems call that allows you to load another kernel
> > from the currently executing Linux kernel. The current implementation
> > has only been tested, and had the kinks worked out on x86, but the
> > generic code should work on any architecture.
> > Could I get some feed back on where this work and where this breaks.
> > With the maturation of kexec-tools to skip attempting bios calls,
> > I expect a new the linux kernel to load for most people. Though I
> > also expect some device drivers will not reinitialize after the reboot.
>
> I give it a big thumbs-up.
And you thought I was kidding when I said it was mostly working :)
> Between the NUMAQs and the big xSeries machines, we
> have a lot of slow rebooters. The 16GB intel boxes take at about 5 minutes to
> get back to the bootloader after a reboot, and the 4 and 8-quad NUMAQ's take
> closer to 10.
Wow. 10 minutes is a pain. That certainly explains your interest...
> The IBM machines I've tried it on are a 4-way and 8-way PIII. They both have
> aic7xxx cards and the 8-way has a ServeRAID 4 controller. They have a collection
>
> of acenic, e1000, pcnet32 and eepro100 net cards. All seem to work just fine.
>
> The NUMAQ is another story, though. I get nothing after "Starting new kernel".
> But, I wasn't expecting much. The NUMAQ is pretty weird hardware and god knows
> what is actually happening. I'll try it some more when I'm more confident in
> what I'm doing.
I suspect the hardware shutdown and start up logic for NUMAQ cpus needs some
special handling. Does kexec_test not print anything, or were you not patient
enough?
> What's the deal with "FIXME assuming 64M of ram"? I was a little surprised when
>
> my 16GB machine started to OOM as I did a "make -j8 bzImage" :) Why is it that
> you need the memory size at load time?
Small steps. When I bypass the BIOS I need to get all of the information
the kernel normally would get from the BIOS from someplace else. Currently
you can use the "mem= " kernel command line parameters. Of you can dig the
/proc/iomem and /proc/meminfo and other places and get the BIOS's memory map.
There isn't a really good source, so I started with something that would work,
and I will work the user space tools up to something that works well.
I will happily accept patches :)
Eric
Andy Pfiffer <[email protected]> writes:
> On Mon, 2002-11-18 at 00:53, Eric W. Biederman wrote:
> > kexec is a set of systems call that allows you to load another kernel
> > from the currently executing Linux kernel. The current implementation
> > has only been tested, and had the kinks worked out on x86, but the
> > generic code should work on any architecture.
>
> Great News, Eric. For the first time *ever* I got a kexec reboot to
> work on my most troublesome machine (see below).
Cool. I was pretty certain it would get into Linux but the fact the device
drivers are not hanging up is a real plus.
> Current .config settings:
> # CONFIG_SMP is not set
> CONFIG_X86_GOOD_APIC=y
> # CONFIG_X86_UP_APIC is not set
> CONFIG_KEXEC=y
>
> Oddly, kexec_test still hangs.
> # ./kexec-1.7 --force ./kexec_test-1.7
[snip...]
> <hang>
Yep. I really haven't tracked and fixed the cause of the hang,
I just avoided the issue entirely. Eventually I will come back
and look into what it takes to improve the odds of having BIOS calls,
work. --real-mode restores the old kexec behavior.
All of the real changes were to the user space code. The kernel
patch stayed the same.
> Complete kernel boot-up log attached below. I'm going to try to find my
> other 576MB of RAM with the right command-line magic... ;^)
Or you can write a routine to gather that information dynamically and send
me a patch for /sbin/kexec. Though it may take another proc file to do
that one properly.
Eric
> I suspect the hardware shutdown and start up logic for NUMAQ cpus
> needs some special handling.
Almost certainly ;-) One of the main things I do differently on boot
is to use NMIs rather than the normal INIT/STARTUP sequence to bootstrap
CPUs with .... thus they aren't as thoroughly reset. Things like clearing
down the local APIC state (but NOT the LDR) and clearing down the IO-APICs
will be especially important. I haven't looked at your code yet to see
exactly what it does here though.
> Small steps. When I bypass the BIOS I need to get all of the information
> the kernel normally would get from the BIOS from someplace else. Currently
> you can use the "mem= " kernel command line parameters. Of you can dig the
> /proc/iomem and /proc/meminfo and other places and get the BIOS's memory map.
> There isn't a really good source, so I started with something that would work,
> and I will work the user space tools up to something that works well.
>
> I will happily accept patches :)
Sounds like we should just export back to you the value we parsed from
the BIOS from the existing boot, no? I'll see if I can make you a patch
to do that ...
M.
Eric W. Biederman wrote:
> Dave Hansen <[email protected]> writes:
>>The NUMAQ is another story, though. I get nothing after "Starting new kernel".
>>But, I wasn't expecting much. The NUMAQ is pretty weird hardware and god knows
>>what is actually happening. I'll try it some more when I'm more confident in
>>what I'm doing.
>
> I suspect the hardware shutdown and start up logic for NUMAQ cpus needs some
> special handling. Does kexec_test not print anything, or were you not patient
> enough?
Starting new kernel
kexec_test 1.6 starting...
eax: 0E1FB007 ebx: 0000111C ecx: 00000000 edx: 00000000
esi: 00000000 edi: 00000000 esp: 00000000 ebp: 00000000
idt: 00000000 C0000000
gdt: 0000006F 000010A0
Switching descriptors.
Descriptors changed.
Legacy pic setup.
In real mode.
Interrupts enabled.
Base memory size: 027E
A20 disabled.
E820 Memory Map.
000000000009FC00 @ 0000000000000000 type: 00000001
00000000EFF00000 @ 0000000000100000 type: 00000001
0000000000180000 @ 00000000FFE80000 type: 00000002
0000000000009000 @ 00000000FEC00000 type: 00000002
0000000100000000 @ 0000000100000000 type: 00000001
E801 Memory size: 003D7400
Mem88 Memory size: FC00
Testing for APM.
APM test done.
Equiptment list: 4426
Sysdesc: F000:E6F5
Video type: VGA
Cursor Position(Row,Column): 0018 0000
Video Mode: 0003
Setting auto repeat rate done
DASD type: 0300 00FAC53F
EDD: ok
A20 enabled
Interrupts disabled.
In protected mode.
Halting.
>>What's the deal with "FIXME assuming 64M of ram"? I was a little surprised when
>>
>>my 16GB machine started to OOM as I did a "make -j8 bzImage" :) Why is it that
>>you need the memory size at load time?
>
> Small steps. When I bypass the BIOS I need to get all of the information
> the kernel normally would get from the BIOS from someplace else. Currently
> you can use the "mem= " kernel command line parameters. Of you can dig the
> /proc/iomem and /proc/meminfo and other places and get the BIOS's memory map.
> There isn't a really good source, so I started with something that would work,
> and I will work the user space tools up to something that works well.
I have a couple of ideas. But, first, is it hard to reconstruct the
memory map? Will all 1GB systems have the same memory map? Do you
have documentation of the format? I don't think that any of these
qualify as the "right thing". But, as hacks, they should keep me
happy for a bit.
For now, I can write a quick script to fix it:
--command-line="`memscript`"
Until it is working a --hack-mem option might be a good idea
Perhaps we could just save a copy off when the kernel loads for the
first time. If we export it somewhere, the kexec executable can just
copy it. For now, we can just printk it and paste it into each
version of kexec that we compile.
--
Dave Hansen
[email protected]
On Tue, 2002-11-19 at 02:25, Eric W. Biederman wrote:
> > Complete kernel boot-up log attached below. I'm going to try to find my
> > other 576MB of RAM with the right command-line magic... ;^)
>
> Or you can write a routine to gather that information dynamically and send
> me a patch for /sbin/kexec. Though it may take another proc file to do
> that one properly.
>
> Eric
Just to make sure I understand the problem. Until we can make all
boot-time BIOS calls work, we need a way to:
1) capture the initial memory map used by the kernel, and
2) a way to supply that information to the to-be-run image.
On my system, the e820 map looks like this (from full reboot):
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 000000000009dc00 (usable)
BIOS-e820: 000000000009dc00 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 0000000027fed140 (usable)
BIOS-e820: 0000000027fed140 - 0000000027ff0000 (ACPI data)
BIOS-e820: 0000000027ff0000 - 0000000028000000 (reserved)
BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved)
639MB LOWMEM available.
And /proc/iomem looks like this:
00000000-0009dbff : System RAM
0009dc00-0009ffff : reserved
000a0000-000bffff : Video RAM area
000c0000-000c7fff : Video ROM
000ca000-000cb7ff : Extension ROM
000cb800-000cffff : Extension ROM
000f0000-000fffff : System ROM
00100000-27fed13f : System RAM
00100000-00309f9a : Kernel code
00309f9b-003d873f : Kernel data
27fed140-27feffff : ACPI Tables
27ff0000-27ffffff : reserved
effff000-efffffff : Adaptec AIC-7892P U160/m
effff000-efffffff : aic7xxx
f0000000-f7ffffff : S3 Inc. Savage 4
fea00000-feafffff : Intel Corp. 82557/8/9 [Ethernet
fea00000-feafffff : e100
feb7e000-feb7efff : ServerWorks OSB4/CSB5 USB Contro
feb7f000-feb7ffff : Intel Corp. 82557/8/9 [Ethernet
feb7f000-feb7ffff : e100
feb80000-febfffff : S3 Inc. Savage 4
fec00000-ffffffff : reserved
Comparing the two:
Range e820 /proc/iomem
0000000-0009dbff usable System RAM
0100000-27fed140 usable System RAM
>From a sample of 1 system, it looks like we should be able to use any
ranges marked as "System RAM" that are listed /proc/iomem. Did I miss
something?
I'll see if I can conjure up something...
Andy
Andy Pfiffer <[email protected]> writes:
> On Tue, 2002-11-19 at 02:25, Eric W. Biederman wrote:
> > > Complete kernel boot-up log attached below. I'm going to try to find my
> > > other 576MB of RAM with the right command-line magic... ;^)
> >
> > Or you can write a routine to gather that information dynamically and send
> > me a patch for /sbin/kexec. Though it may take another proc file to do
> > that one properly.
> >
> > Eric
>
> Just to make sure I understand the problem. Until we can make all
> boot-time BIOS calls work, we need a way to:
A small clarification. BIOS calls will never work 100%. Especially in the
interesting cases like kexec on panic. So entering the kernel in
32bit mode will continue to be the default mode of. This means the
final solution to problems like this needs to be a good one.
> 1) capture the initial memory map used by the kernel, and
> 2) a way to supply that information to the to-be-run image.
>
> On my system, the e820 map looks like this (from full reboot):
> BIOS-provided physical RAM map:
> BIOS-e820: 0000000000000000 - 000000000009dc00 (usable)
> BIOS-e820: 000000000009dc00 - 00000000000a0000 (reserved)
> BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
> BIOS-e820: 0000000000100000 - 0000000027fed140 (usable)
> BIOS-e820: 0000000027fed140 - 0000000027ff0000 (ACPI data)
> BIOS-e820: 0000000027ff0000 - 0000000028000000 (reserved)
> BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved)
> 639MB LOWMEM available.
>
> And /proc/iomem looks like this:
> 00000000-0009dbff : System RAM
> 0009dc00-0009ffff : reserved
> 000a0000-000bffff : Video RAM area
> 000c0000-000c7fff : Video ROM
> 000ca000-000cb7ff : Extension ROM
> 000cb800-000cffff : Extension ROM
> 000f0000-000fffff : System ROM
> 00100000-27fed13f : System RAM
> 00100000-00309f9a : Kernel code
> 00309f9b-003d873f : Kernel data
> 27fed140-27feffff : ACPI Tables
> 27ff0000-27ffffff : reserved
> effff000-efffffff : Adaptec AIC-7892P U160/m
> effff000-efffffff : aic7xxx
> f0000000-f7ffffff : S3 Inc. Savage 4
> fea00000-feafffff : Intel Corp. 82557/8/9 [Ethernet
> fea00000-feafffff : e100
> feb7e000-feb7efff : ServerWorks OSB4/CSB5 USB Contro
> feb7f000-feb7ffff : Intel Corp. 82557/8/9 [Ethernet
> feb7f000-feb7ffff : e100
> feb80000-febfffff : S3 Inc. Savage 4
> fec00000-ffffffff : reserved
>
> Comparing the two:
> Range e820 /proc/iomem
> 0000000-0009dbff usable System RAM
> 0100000-27fed140 usable System RAM
>
> >From a sample of 1 system, it looks like we should be able to use any
> ranges marked as "System RAM" that are listed /proc/iomem. Did I miss
> something?
Only that /proc/iomem is only useful this way on x86 and that
it doesn't capture the details of the memory map above 4GB. But
it is much better than only having 4GB of main memory.
Eric
On Tue, 19 Nov 2002, Dave Hansen wrote:
>
> I have a couple of ideas. But, first, is it hard to reconstruct the
> memory map?
Hmm.. You shouldn't need to reconstruct it. It's all there in the
struct e820map e820;
(yeah, we will have modified it to match the setup of the running kernel,
but on the whole it should all be there, no?)
Linus
"Martin J. Bligh" <[email protected]> writes:
> > I suspect the hardware shutdown and start up logic for NUMAQ cpus
> > needs some special handling.
>
> Almost certainly ;-) One of the main things I do differently on boot
> is to use NMIs rather than the normal INIT/STARTUP sequence to bootstrap
> CPUs with .... thus they aren't as thoroughly reset. Things like clearing
> down the local APIC state (but NOT the LDR) and clearing down the IO-APICs
> will be especially important. I haven't looked at your code yet to see
> exactly what it does here though.
That part is in my x86kexec-hwfixes.diff I have a good first stab
at it that works on most x86 SMPs. But apparently not on NUMAQ.
> > Small steps. When I bypass the BIOS I need to get all of the information
> > the kernel normally would get from the BIOS from someplace else. Currently
> > you can use the "mem= " kernel command line parameters. Of you can dig the
> > /proc/iomem and /proc/meminfo and other places and get the BIOS's memory map.
> > There isn't a really good source, so I started with something that would work,
>
> > and I will work the user space tools up to something that works well.
> >
> > I will happily accept patches :)
>
> Sounds like we should just export back to you the value we parsed from
> the BIOS from the existing boot, no? I'll see if I can make you a patch
> to do that ...
Yep. But we currently don't export it cleanly...
Eric
Dave Hansen <[email protected]> writes:
> Eric W. Biederman wrote:
> > Dave Hansen <[email protected]> writes:
> >>The NUMAQ is another story, though. I get nothing after "Starting new
> kernel".
>
> >>But, I wasn't expecting much. The NUMAQ is pretty weird hardware and god
> knows
>
> >>what is actually happening. I'll try it some more when I'm more confident in
> >>what I'm doing.
> > I suspect the hardware shutdown and start up logic for NUMAQ cpus needs some
> > special handling. Does kexec_test not print anything, or were you not patient
>
> > enough?
>
> Starting new kernel
> kexec_test 1.6 starting...
[snip successful run of kexec_test]
Hmm. So it looks like you can make bios calls, on the NUMAQ machine.
It is worth a try to see if "kexec --real_mode bzImage...." will start
up your kernel. Probably not but at least the basic mechanism of kexec
is working. I would be very surprised if you couldn't at least start
a uniprocessor kernel.
> I have a couple of ideas. But, first, is it hard to reconstruct the memory map?
>From your kexec_test run, your memory map...
> E820 Memory Map.
> 000000000009FC00 @ 0000000000000000 type: 00000001
> 00000000EFF00000 @ 0000000000100000 type: 00000001
> 0000000000180000 @ 00000000FFE80000 type: 00000002
> 0000000000009000 @ 00000000FEC00000 type: 00000002
> 0000000100000000 @ 0000000100000000 type: 00000001
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The e820 memory map is printed out on boot up.
>
> Will all 1GB systems have the same memory map?
Most will have pretty much the same memory map, but in general all systems
with same amount of ram will have different memory maps.
> Do you have documentation of the
> format? I don't think that any of these qualify as the "right thing". But, as
> hacks, they should keep me happy for a bit.
>
> For now, I can write a quick script to fix it: --command-line="`memscript`"
>
> Until it is working a --hack-mem option might be a good idea
>
> Perhaps we could just save a copy off when the kernel loads for the first
> time. If we export it somewhere, the kexec executable can just copy it. For
> now, we can just printk it and paste it into each version of kexec that we
> compile.
Yep, essentially that is what needs to happen.
Eric
Linus Torvalds <[email protected]> writes:
> On Tue, 19 Nov 2002, Dave Hansen wrote:
> >
> > I have a couple of ideas. But, first, is it hard to reconstruct the
> > memory map?
>
> Hmm.. You shouldn't need to reconstruct it. It's all there in the
>
> struct e820map e820;
>
> (yeah, we will have modified it to match the setup of the running kernel,
> but on the whole it should all be there, no?)
Yep. We just need to get that information out to user space.
Eric
On Tue, Nov 19, 2002 at 10:48:46AM -0700, Eric W. Biederman wrote:
> > struct e820map e820;
> >
> > (yeah, we will have modified it to match the setup of the running kernel,
> > but on the whole it should all be there, no?)
>
> Yep. We just need to get that information out to user space.
Arjan already did this..
http://www.kernelnewbies.org/kernels/rh80/SOURCES/linux-2.4.0-e820.patch
Dave
--
| Dave Jones. http://www.codemonkey.org.uk
| SuSE Labs
>> Just to make sure I understand the problem. Until we can make all
>> boot-time BIOS calls work, we need a way to:
>
> A small clarification. BIOS calls will never work 100%. Especially in the
> interesting cases like kexec on panic. So entering the kernel in
> 32bit mode will continue to be the default mode of. This means the
> final solution to problems like this needs to be a good one.
Do we still have the mpstables and other such initdata around as well?
Or did we destroy those on boot? If we're going to do kexec on panic,
perhaps all these should be checksummed for corruption detection
eventually (not now).
M.
On Tue, 2002-11-19 at 09:34, Eric W. Biederman wrote:
> Andy Pfiffer <[email protected]> writes:
>
> > On Tue, 2002-11-19 at 02:25, Eric W. Biederman wrote:
> > > > Complete kernel boot-up log attached below. I'm going to try to find my
> > > > other 576MB of RAM with the right command-line magic... ;^)
> > >
> > > Or you can write a routine to gather that information dynamically and send
> > > me a patch for /sbin/kexec. Though it may take another proc file to do
> > > that one properly.
> > >
> > > Eric
Hmmm...I seem to be having some trouble setting "mem=" (system hangs).
Maybe multiple "mem=NNNK@0xXXXXXXXX" options won't work.
While I try to figure out what's going on, here's a program ("kargs")
that composes a kernel command line from the contents of
"/proc/cmndline" and "/proc/iomem". It doesn't do as much error
checking as it should...
Usage (sh quoting): kexec --force "--command-line=`kargs`" bzImage
Andy
On Mon, Nov 18, 2002 at 05:10:38PM -0800, Andy Pfiffer wrote:
> On Mon, 2002-11-18 at 00:53, Eric W. Biederman wrote:
> > kexec is a set of systems call that allows you to load another kernel
> > from the currently executing Linux kernel. The current implementation
> > has only been tested, and had the kinks worked out on x86, but the
> > generic code should work on any architecture.
>
> Great News, Eric. For the first time *ever* I got a kexec reboot to
> work on my most troublesome machine (see below).
Same here - preloading the new kernel and issuing kexec -e after
init 1 works on the troublesome SMP system I'd earlier been sending
you earlier. Bootimg used to work on this setup, so bypassing the
bios calls had the expected effect.
If I issue the call earlier though, it runs into trouble with aic7xxx
reporting interrupts during setup. Guess you know why we are looking
at that case - eventually need to be able to transition directly at dump
time without a chance to go through user-space shutdown ...
Regards
Suparna
>
> For those looking to replicate:
>
>
> 0. apply these two patches to 2.5.48 (bk Changeset 1.842)
> http://www.xmission.com/~ebiederm/files/kexec/linux-2.5.48.x86kexec.diff
> http://www.xmission.com/~ebiederm/files/kexec/linux-2.5.48.x86kexec-hwfixes.diff
>
> 2. compile this:
> http://www.xmission.com/~ebiederm/files/kexec/kexec-tools-1.7.tar.gz
>
> 3. my recipe for rebooting:
> a) I have a script that I execute by hand after "init 1" to unmount
> my filesystems and then remount / and /boot read-only.
> b) I have the kexec binary installed in /boot.
> c) ./kexec-1.7 --force --debug "--command-line=ro root=805
> console=ttyS0,9600n8" ./linux-2.5
>
> Thanks, Eric!
>
> Andy
>
> # ./kexec-1.7 --force --debug "--command-line=ro root=805 console=ttyS0,9600n8" ./linux-2.5
> FIXME assuming 64M of ram
> setup16_end: 00091b1f
> FIXME assuming 64M of ram
> Synchronizing SCSI caches:
> Shutting down devices
> Starting new kernel
> Linux version 2.5.48 (andyp@joe) (gcc version 2.95.3 20010315 (SuSE)) #1 Mon Nov 18 15:03:14 PST 2002
> Video mode to be used for restore is ffff
> BIOS-provided physical RAM map:
> BIOS-e820: 0000000000001000 - 000000000009ffff (usable)
> BIOS-e820: 0000000000100000 - 0000000003ffffff (usable)
> 63MB LOWMEM available.
> hm, page 00000000 reserved twice.
> On node 0 totalpages: 16383
> DMA zone: 4096 pages, LIFO batch:1
> Normal zone: 12287 pages, LIFO batch:2
> HighMem zone: 0 pages, LIFO batch:1
> IBM machine detected. Enabling interrupts during APM calls.
> IBM machine detected. Disabling SMBus accesses.
> Building zonelist for node : 0
> Kernel command line: ro root=805 console=ttyS0,9600n8
> Initializing CPU#0
> Detected 799.717 MHz processor.
> Console: colour VGA+ 80x25
> Calibrating delay loop... 1581.05 BogoMIPS
> Memory: 60868k/65532k available (2087k kernel code, 4204k reserved, 825k data, 304k init, 0k highmem)
> Security Scaffold v1.0.0 initialized
> Dentry cache hash table entries: 8192 (order: 4, 65536 bytes)
> Inode-cache hash table entries: 4096 (order: 3, 32768 bytes)
> Mount-cache hash table entries: 512 (order: 0, 4096 bytes)
> -> /dev
> -> /dev/console
> -> /root
> CPU: L1 I cache: 16K, L1 D cache: 16K
> CPU: L2 cache: 256K
> Intel machine check architecture supported.
> Intel machine check reporting enabled on CPU#0.
> CPU: Intel Pentium III (Coppermine) stepping 0a
> Enabling fast FPU save and restore... done.
> Enabling unmasked SIMD FPU exception support... done.
> Checking 'hlt' instruction... OK.
> POSIX conformance testing by UNIFIX
> Linux NET4.0 for Linux 2.4
> Based upon Swansea University Computer Society NET3.039
> Initializing RT netlink socket
> mtrr: v2.0 (20020519)
> Linux Plug and Play Support v0.9 (c) Adam Belay
> PCI: PCI BIOS revision 2.10 entry at 0xfd5dc, last bus=1
> PCI: Using configuration type 1
> BIO: pool of 256 setup, 14Kb (56 bytes/bio)
> biovec pool[0]: 1 bvecs: 116 entries (12 bytes)
> biovec pool[1]: 4 bvecs: 116 entries (48 bytes)
> biovec pool[2]: 16 bvecs: 58 entries (192 bytes)
> biovec pool[3]: 64 bvecs: 29 entries (768 bytes)
> biovec pool[4]: 128 bvecs: 14 entries (1536 bytes)
> biovec pool[5]: 256 bvecs: 7 entries (3072 bytes)
> block request queues:
> 112 requests per read queue
> 112 requests per write queue
> 8 requests per batch
> enter congestion at 27
> exit congestion at 29
> isapnp: Scanning for PnP cards...
> isapnp: No Plug & Play device found
> drivers/usb/core/usb.c: registered new driver usbfs
> drivers/usb/core/usb.c: registered new driver hub
> PCI: Probing PCI hardware
> PCI: Probing PCI hardware (bus 00)
> PCI: Discovered peer bus 01
> Starting kswapd
> aio_setup: sizeof(struct page) = 40
> [c3fb2040] eventpoll: successfully initialized.
> Journalled Block Device driver loaded
> Installing knfsd (copyright (C) 1996 [email protected]).
> udf: registering filesystem
> Capability LSM initialized
> Serial: 8250/16550 driver $Revision: 1.90 $ IRQ sharing disabled
> ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
> ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A
> parport0: PC-style at 0x378 [PCSPP]
> pty: 256 Unix98 ptys configured
> lp0: using parport0 (polling).
> Linux agpgart interface v0.99 (c) Jeff Hartmann
> agpgart: Maximum main memory to use for agp memory: 27M
> agpgart: unable to determine aperture size.
> agpgart: Maximum main memory to use for agp memory: 27M
> agpgart: unable to determine aperture size.
> [drm] Initialized radeon 1.7.0 20020828 on minor 0
> Floppy drive(s): fd0 is 1.44M
> FDC 0 is a National Semiconductor PC87306
> Intel(R) PRO/100 Network Driver - version 2.1.24-k2
> Copyright (c) 2002 Intel Corporation
>
> e100: eth0: Intel(R) PRO/100+ Server Adapter (PILA8470B)
> Mem:0xfeb7f000 IRQ:11 Speed:0 Mbps Dx:N/A
> Hardware receive checksums enabled
> cpu cycle saver enabled
>
> Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
> ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
> hda: LG CD-ROM CRD-8484B, ATAPI CD/DVD-ROM drive
> ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
> hda: ATAPI 48X CD-ROM drive, 128kB Cache
> Uniform CD-ROM driver Revision: 3.12
> end_request: I/O error, dev hda, sector 0
> SCSI subsystem driver Revision: 1.00
> PCI: Enabling device 01:03.0 (0156 -> 0157)
> scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.4
> <Adaptec aic7892 Ultra160 SCSI adapter>
> aic7892: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs
>
> (scsi0:A:0): 160.000MB/s transfers (80.000MHz DT, offset 31, 16bit)
> Vendor: IBM-PSG Model: ST318436LC !# Rev: 3281
> Type: Direct-Access ANSI SCSI revision: 03
> (scsi0:A:1): 160.000MB/s transfers (80.000MHz DT, offset 31, 16bit)
> Vendor: IBM-PSG Model: ST318436LC !# Rev: 3281
> Type: Direct-Access ANSI SCSI revision: 03
> Vendor: IBM Model: YGLv3 S2 Rev: 0
> Type: Processor ANSI SCSI revision: 02
> scsi0:A:0:0: Tagged Queuing enabled. Depth 64
> SCSI device sda: drive cache: write through
> SCSI device sda: 35548320 512-byte hdwr sectors (18201 MB)
> sda: sda1 sda2 < sda5 sda6 sda7 sda8 sda9 sda10 >
> Attached scsi disk sda at scsi0, channel 0, id 0, lun 0
> scsi0:A:1:0: Tagged Queuing enabled. Depth 64
> SCSI device sdb: drive cache: write through
> SCSI device sdb: 35548320 512-byte hdwr sectors (18201 MB)
> sdb: sdb1
> Attached scsi disk sdb at scsi0, channel 0, id 1, lun 0
> Attached scsi generic sg2 at scsi0, channel 0, id 8, lun 0, type 3
> Initializing USB Mass Storage driver...
> drivers/usb/core/usb.c: registered new driver usb-storage
> USB Mass Storage support registered.
> mice: PS/2 mouse device common for all mice
> input: ImPS/2 Generic Wheel Mouse on isa0060/serio1
> serio: i8042 AUX port at 0x60,0x64 irq 12
> input: AT Set 2 keyboard on isa0060/serio0
> serio: i8042 KBD port at 0x60,0x64 irq 1
> Advanced Linux Sound Architecture Driver Version 0.9.0rc5 (Sun Nov 10 19:48:18 2002 UTC).
> request_module[snd-card-0]: not ready
> request_module[snd-card-1]: not ready
> request_module[snd-card-2]: not ready
> request_module[snd-card-3]: not ready
> request_module[snd-card-4]: not ready
> request_module[snd-card-5]: not ready
> request_module[snd-card-6]: not ready
> request_module[snd-card-7]: not ready
> ALSA device list:
> No soundcards found.
> NET4: Linux TCP/IP 1.0 for NET4.0
> IP: routing cache hash table of 512 buckets, 4Kbytes
> TCP: Hash tables configured (established 4096 bind 4096)
> NET4: Unix domain sockets 1.0/SMP for Linux NET4.0.
> kjournald starting. Commit interval 5 seconds
> EXT3-fs: mounted filesystem with ordered data mode.
> VFS: Mounted root (ext3 filesystem) readonly.
> Freeing unused kernel memory: 304k freed
> INIT: version 2.82 booting
> Running /etc/init.d/boot
> Mounting /proc device done
> Mounting /dev/ptsblogd: console=/dev/console, stdin=/dev/console, must differ, boot logging disabled
> showconsole: Warning: the ioctl TIOCGDEV is not known by the kerAdding 530104k swap on /dev/sda6. Priority:42 extents:1
> nel
> Activating swap-devices in /etc/fstab... done
> showconsole: Warning: the ioctl TIOCGDEV is not known by the kernel
> Checking file systems...
> fsck 1.26 (3-Feb-2002)
> /dev/sda5: clean, 16935/66264 files, 104836/265041 blocks
> /dev/sda1: clean, 55/10040 files, 24115/40131 blocks
> /dev/sdb1: clean, 11/2223872 files, 78008/4441964 blocks
> /dev/sda10: clean, 523256/1198208 files, 2052639/2393677 blocks
> /dev/sda9: clean, 51895/263296 files, 310582/526120 blocks
> /dev/sda8: clean, 140195/525888 files, 590977/1050241 blocks
> /dev/sda7: clean, EXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,5), 2747/131616 fileinternal journal
> s, 111363/263056 blocks done
> Setting up /lib/modules/2.5.48 failed
> Mounting local file systems...
> kjournald starting. Commit interval 5 seconds
> proc on /proc tyEXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,17), pe proc (rw)
> deinternal journal
> vpts on /dev/ptsEXT3-fs: mounted filesystem with ordered data mode.
> type devpts (rw,mode=0620,gid=5)
> /dev/sdb1 on /2nd type ext3 (kjournald starting. Commit interval 5 seconds
> rw)
> /dev/sda1 oEXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,10), n /boot type extinternal journal
> 2 (rw)
> EXT3-fs: mounted filesystem with ordered data mode.
> /dev/sda10 on /home type ext3 (rw)
> kjournald starting. Commit interval 5 seconds
> EXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,9), internal journal
> EXT3-fs: mounted filesystem with ordered data mode.
> /dev/sda9 on /opt type ext3 (rw)
> kjournald starting. Commit interval 5 seconds
> EXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,8), internal journal
> EXT3-fs: mounted filesystem with ordered data mode.
> /dev/sda8 on /usr type ext3 (rw)
> kjournald starting. Commit interval 5 seconds
> EXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,7), internal journal
> EXT3-fs: mounted filesystem with ordered data mode.
> /dev/sda7 on /var type ext3 (rw) done
> Restore device permissions done
> Activating remaining swap-devices in /etc/fstab... done
> Setting up the CMOS clock done
> Setting up timezone data done
> Configuring serial ports...
> ttyS0 at 0x03f8 (irq = 4) is a 16550A
> ttyS1 at 0x02f8 (irq = 3) is a 16550A
> Configured serial ports done
> Setting up hostname 'joe' done
> Setting up loopback interface done
> Creating /var/log/boot.msg done
> showconsole: Warning: the ioctl TIOCGDEV is not known by the kernel
> INIT: Entering runlevel: 5
> blogd: console=/dev/console, stdin=/dev/console, must differ, boot logging disabled
> Master Resource Control: previous runlevel: N, switching to runlevel:5
> Starting personal-firewall (initial) [not active] unused
> Initializing random number generator done
> Setting up network interfaces:
> lo done
> eth0 (DHCP) IP address: 172.20.1.38 done
> Starting syslog services done
> Starting hotplugging services [ net pci usb ] failed
> Starting hardware scan on boote100: eth0 NIC Link is Up 100 Mbps Full duplex
> done
> Starting RPC portmap daemon done
> Starting SSH daemon done
> Starting sound driver: already running done
> Starting service at daemon done
> Initializing SMTP port (sendmail) done
> Loading keymap qwerty/us.map.gz done
> Loading compose table winkeys shiftctrl latin1.add done
> Loading console font lat1-16.psfu done
> Loading screenmap none done
> Setting up console ttys done
> Starting service kdm done
> Starting CRON daemon done
> Starting Name Service Cache Daemon done
> Starting inetd done
> Starting personal-firewall (final) [not active] unused
> Master Resource Control: runlevel 5 has been reached
> Failed services in runlevel 5: hotplug
> Skipped services in runlevel 5: personal-firewall.initial splash personal-firewall.final
>
--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Labs, India
Suparna Bhattacharya <[email protected]> writes:
> On Mon, Nov 18, 2002 at 05:10:38PM -0800, Andy Pfiffer wrote:
> > On Mon, 2002-11-18 at 00:53, Eric W. Biederman wrote:
> > > kexec is a set of systems call that allows you to load another kernel
> > > from the currently executing Linux kernel. The current implementation
> > > has only been tested, and had the kinks worked out on x86, but the
> > > generic code should work on any architecture.
> >
> > Great News, Eric. For the first time *ever* I got a kexec reboot to
> > work on my most troublesome machine (see below).
>
> Same here - preloading the new kernel and issuing kexec -e after
> init 1 works on the troublesome SMP system I'd earlier been sending
> you earlier. Bootimg used to work on this setup, so bypassing the
> bios calls had the expected effect.
>
> If I issue the call earlier though, it runs into trouble with aic7xxx
> reporting interrupts during setup. Guess you know why we are looking
> at that case - eventually need to be able to transition directly at dump
> time without a chance to go through user-space shutdown ...
The needed hooks are there. You can make certain an appropriate
->shutdown()/reboot_notifier method is present, or you can fix the driver
so it can initialize the device from any random state.
I really don't know what kinds of failures you hope to recover
from with the kexec on panic code, so I really can't comment on
how well things will work. There will always be a set of failures
that are non-recoverable, but that doesn't mean there isn't a useful
subset. Anyway there is certainly plenty of material for you to
experiment with and see what works usefully in practice.
Eric
"Martin J. Bligh" <[email protected]> writes:
> >> Just to make sure I understand the problem. Until we can make all
> >> boot-time BIOS calls work, we need a way to:
> >
> > A small clarification. BIOS calls will never work 100%. Especially in the
> > interesting cases like kexec on panic. So entering the kernel in
> > 32bit mode will continue to be the default mode of. This means the
> > final solution to problems like this needs to be a good one.
>
> Do we still have the mpstables and other such initdata around as well?
The mp tables, and all of the other tables we pick up after we are
in 32bit mode the kernel explicitly preserves and leaves right where
they are. There is no need to do anything to convey them to the next
kernel as pointers to them are in well known locations.
Eric
On Fri, Nov 15, 2002 at 11:37:07AM -0300, Werner Almesberger wrote:
> Suparna Bhattacharya wrote:
> > What would be best way to pass a parameter or address from the
> > current kernel to kernel being booted (e.g log buffer address
> > or crash dump buffer etc) ?
>
> At the moment, perhaps the initrd mechanism might be a useful
> interface for this. You'd just leave some space either at the
> beginning or at the end of the real initrd (if there's one),
> and put your data there.
>
> Afterwards, you can extract it either from the kernel, or even
> from user space through /dev/initrd (with "noinitrd")
>
> Advantages:
> - fairly non-intrusive
> - (almost ?) all platforms support this way of handling "some
> object in memory"
> - easy to play with from user space
>
> Drawbacks:
> - needs synchronization with existing uses of initrd
> - a bit hackish
>
> I'd expect that there will be eventually a number of things that
> get passed from old to new kernels (e.g. crash data, device scan
> results, etc.), so it may be useful to delay designing a "clean"
> interface (for this, I expect some TLV structure in the initrd
> area would make most sense) until more of those things have
> shown up.
Yes indeed. At the moment however I was just looking at something
as simple as a single (or more) parameter to pass from an old
kernel to the new one. That parameter could be a scalar value/
variable or denote the address of a control block, or something
requiring more complicated interpretation like you mention.
If the parameter is a pointer to an address block right now the
code to put it in a place that doesn't get overwritten when the
new kernel loads is left as the responsibility of the caller.
Designing a generic and clean interface for that would require
more thought and is best delayed a bit till we understand all the
needs better. Mcore for example (as you probably know already)
passes a map of affected pages to the new kernel and during early
bootmem initialization those pages (from the previous boot) are
marked as reserved, instead of moving them to a contiguous memory
area. Its just the start of the map (crash header) that's still
passed in as a fixed location (rather its relative to the end of
the current image) and I was looking at a nice way to avoid that.
One way of course is to add a kernel parameter(s) and set this
through user-space (after extracting it from the
kernel .. possibly via kmem) when loading the image (kexec tools
does all the work of filling up the parameter block). Probably
that's what was intended.
Eric, Is that correct ? BTW, did you have an option (or plan
to add one) in kexec tools to use the current kernel's parameters
and append additional options to it ?
Regards
Suparna
>
> - Werner
>
> --
> _________________________________________________________________________
> / Werner Almesberger, Buenos Aires, Argentina [email protected] /
> /_http://www.almesberger.net/____________________________________________/
--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Labs, India
On Wed, Nov 20, 2002 at 02:17:04AM -0700, Eric W. Biederman wrote:
> Suparna Bhattacharya <[email protected]> writes:
>
> > On Mon, Nov 18, 2002 at 05:10:38PM -0800, Andy Pfiffer wrote:
> > > On Mon, 2002-11-18 at 00:53, Eric W. Biederman wrote:
> > > > kexec is a set of systems call that allows you to load another kernel
> > > > from the currently executing Linux kernel. The current implementation
> > > > has only been tested, and had the kinks worked out on x86, but the
> > > > generic code should work on any architecture.
> > >
> > > Great News, Eric. For the first time *ever* I got a kexec reboot to
> > > work on my most troublesome machine (see below).
> >
> > Same here - preloading the new kernel and issuing kexec -e after
> > init 1 works on the troublesome SMP system I'd earlier been sending
> > you earlier. Bootimg used to work on this setup, so bypassing the
> > bios calls had the expected effect.
> >
> > If I issue the call earlier though, it runs into trouble with aic7xxx
> > reporting interrupts during setup. Guess you know why we are looking
> > at that case - eventually need to be able to transition directly at dump
> > time without a chance to go through user-space shutdown ...
>
> The needed hooks are there. You can make certain an appropriate
> ->shutdown()/reboot_notifier method is present, or you can fix the driver
> so it can initialize the device from any random state.
>
> I really don't know what kinds of failures you hope to recover
> from with the kexec on panic code, so I really can't comment on
> how well things will work. There will always be a set of failures
> that are non-recoverable, but that doesn't mean there isn't a useful
I agree. If we can get as far with this for situations in which
mcore with bootimg worked (but then we never did try that on
2.5 and am not sure if it was using the current aic7xx driver) that
would be a lot - handling of more difficult cases can
happen bit by bit after that. Whatever can be covered is useful even
if it doesn't address all kinds of troublesome situations.
> subset. Anyway there is certainly plenty of material for you to
> experiment with and see what works usefully in practice.
Yes there is, thanks :)
Regards
Suparna
>
> Eric
>
--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Labs, India
Eric W. Biederman wrote:
> The needed hooks are there. You can make certain an appropriate
> ->shutdown()/reboot_notifier method is present, or you can fix the driver
> so it can initialize the device from any random state.
In the case of a crash, you may not be able to use the normal
shutdown, but there may still be pending bus master accesses, e.g.
from an on-going transfer, or free buffers that will eventually
(i.e. there's no use in "waiting for the operation to finish") get
used.
Initializing the device from any state is certainly a good feature,
and it will cure the most visible symptoms, but problems may still
occur if the device decides to scribble over memory after leaving
the original kernel, and before the reset has occurred under the
new kernel. (Or did you mean to initialize before invoking kexec ?)
I see several possible approaches for this:
0) do as bootimg did, and ignore the problem :-)
1) try to call the regular device shutdown. In the case of a
crash, this may hang, or corrupt the system further.
2) add a new callback that just silences the device, without
trying to clean things up. This is probably the best
long-term solution.
3) if there's a way to just reset some or all devices on the
PCI bus without knowing what they are, this should have the
desired effect, while being relatively easy to implement.
(This probably still leaves things like AGP, multi-level PCI
bus structures, non-PCI, etc.)
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/
Werner Almesberger <[email protected]> writes:
> Eric W. Biederman wrote:
> > The needed hooks are there. You can make certain an appropriate
> > ->shutdown()/reboot_notifier method is present, or you can fix the driver
> > so it can initialize the device from any random state.
>
> In the case of a crash, you may not be able to use the normal
> shutdown, but there may still be pending bus master accesses, e.g.
> from an on-going transfer, or free buffers that will eventually
> (i.e. there's no use in "waiting for the operation to finish") get
> used.
>
> Initializing the device from any state is certainly a good feature,
> and it will cure the most visible symptoms, but problems may still
> occur if the device decides to scribble over memory after leaving
> the original kernel, and before the reset has occurred under the
> new kernel. (Or did you mean to initialize before invoking kexec ?
In this case I suspect the best route is to locate the kexec_on_panic
buffers for kexec where we want to use them. Then even in most
cases a devices is scribbling on memory, unless the device was
improperly setup, it isn't scribbling on memory necessary to get
the new kernel going.
> I see several possible approaches for this:
>
> 0) do as bootimg did, and ignore the problem :-)
> 1) try to call the regular device shutdown. In the case of a
> crash, this may hang, or corrupt the system further.
> 2) add a new callback that just silences the device, without
> trying to clean things up. This is probably the best
> long-term solution.
Roughly that is ->shutdown() it was separated from the ->remove()
case so that it could be stripped down to a minimal implementation.
Eric
Suparna Bhattacharya <[email protected]> writes:
> Yes indeed. At the moment however I was just looking at something
> as simple as a single (or more) parameter to pass from an old
> kernel to the new one.
Currently we pass all kinds of parameters, the e820 memory map being
one of the significant ones. Though the arch specific locations are
not generally the best ones to use.
> That parameter could be a scalar value/
> variable or denote the address of a control block, or something
> requiring more complicated interpretation like you mention.
> If the parameter is a pointer to an address block right now the
> code to put it in a place that doesn't get overwritten when the
> new kernel loads is left as the responsibility of the caller.
> Designing a generic and clean interface for that would require
> more thought and is best delayed a bit till we understand all the
> needs better. Mcore for example (as you probably know already)
> passes a map of affected pages to the new kernel and during early
> bootmem initialization those pages (from the previous boot) are
> marked as reserved, instead of moving them to a contiguous memory
> area. Its just the start of the map (crash header) that's still
> passed in as a fixed location (rather its relative to the end of
> the current image) and I was looking at a nice way to avoid that.
When you can do it passing tables, at a fixed or a relatively fixed
address is a powerful way to do things.. At least when they are
supposed to have a long lifetime. I'm not quite certain about
a temporary solution.
> One way of course is to add a kernel parameter(s) and set this
> through user-space (after extracting it from the
> kernel .. possibly via kmem) when loading the image (kexec tools
> does all the work of filling up the parameter block). Probably
> that's what was intended.
>
> Eric, Is that correct ?
Yes. Getting the information down to user space and then putting
it in the kernel is a reasonable thing to do.
> BTW, did you have an option (or plan
> to add one) in kexec tools to use the current kernel's parameters
> and append additional options to it ?
For command line arguments that is trivial
--command-line="`cat /proc/cmdline` extra arguments".
For the rest it would require a little more work, as all of the
kernels current parameters are not currently preserved. But my basic
take is that I would rather derive/create the parameters to the new
kernel than just copy them from some fixed location. Then passing
the current values just becomes a matter of policy, which the user can
control.
For me it is important to be able to boot new kernels, and things
other than linux. And especially in those cases the policy needs to
be driven from user space, as there is no real standardization of
parameters or what can be passed. Nor is there much desire among
the various kernel authors, and bootloader authors to come up with a
standard format they all can use. A good proposal with an unchanging
story and years of history behind it may eventually change some
minds, but I'm not holding my breath.
So beyond what functionality is currently there, I am not real
enthusiastic about optimizing the case of do what I just did. For me
that is not an especially interesting case.
Eric
kexec-tools-1.8 is now available at:
http://www.xmission.com/~ebiederm/files/kexec/kexec-tools-1.8.tar.gz
Dave Hansen has a patch that allows /proc/iomem to export resources
above 4GB which is needed on machines on with > 4GB of RAM.
Changes:
- /proc/iomem is now parsed so the new kernels memory map should be correct.
- initrds are now actually read into memory so they should work, as well.
That should make kexec quite useable.
The syscall:
http://www.xmission.com/~ebiederm/files/kexec/linux-2.5.48.x86kexec.diff
and the fixes
http://www.xmission.com/~ebiederm/files/kexec/linux-2.5.48.x86kexec-hwfixes.diff
continue to apply to 2.5.50 so I have not updated them.
The archive is at:
http://www.xmission.com/~ebiederm/files/kexec/
My apologies for not getting this sooner. Along with the holidays I have been
battling a cold...
Eric
kexec-tools-1.8 is now available at:
http://www.xmission.com/~ebiederm/files/kexec/kexec-tools-1.8.tar.gz
Dave Hansen has a patch that allows /proc/iomem to export resources
above 4GB which is needed on machines on with > 4GB of RAM.
Changes:
- /proc/iomem is now parsed so the new kernels memory map should be correct.
- initrds are now actually read into memory so they should work, as well.
That should make kexec quite useable.
The syscall:
http://www.xmission.com/~ebiederm/files/kexec/linux-2.5.48.x86kexec.diff
and the fixes
http://www.xmission.com/~ebiederm/files/kexec/linux-2.5.48.x86kexec-hwfixes.diff
continue to apply to 2.5.50 so I have not updated them.
The archive is at:
http://www.xmission.com/~ebiederm/files/kexec/
My apologies for not getting this sooner. Along with the holidays I have been
battling a cold...
Eric
It booted on my first try, even with the 64-bit /proc/iomem changes.
I tried it on machines with 16GB and 1GB of RAM. (insert clapping here)
--
Dave Hansen
[email protected]
Dave Hansen <[email protected]> writes:
> It booted on my first try, even with the 64-bit /proc/iomem changes. I tried it
> on machines with 16GB and 1GB of RAM. (insert clapping here)
Thanks. The code for reading /proc/iomem was a modeled after
Andy Pfiffer's work, and your earlier patch. I just cleaned them
up and integrated it cleanly with my existing code base.
I guess that means I should shake off the bit rot and resubmit
to Linus.
Eric
I got around to trying it on a NUMA-Q again. It makes it well into
the kernel this time. I've been getting some strange CPU numbering
problems, but that was happening to a lesser extent before I threw
kexec in there.
Right now it's dying in the memory allocator, but that is probably
just something that didn't get initialized right, or some cross-quad
memory that isn't set up right.
I would really like to see this go into 2.5. The fact that it gets
this far on something as exotic as a NUMA-Q is a tribute to its
maturity.
--
Dave Hansen
[email protected]
On 2002-12-02 at 04:41:34, Eric wrote:
>kexec-tools-1.8 is now available at :
>http://www.xmission.com/~ebiederm/files/kexec/kexec-tools-1.8.tar.gz
I can't use the kexec program in the package to load a bzImage file. The
following simple patch make it work.
diff -ru kexec-tools-1.8-orig/kexec/kexec.c kexec-tools-1.8/kexec/kexec.c
--- kexec-tools-1.8-orig/kexec/kexec.c Mon Jan 13 11:21:28 2003
+++ kexec-tools-1.8/kexec/kexec.c Mon Jan 13 11:21:50 2003
@@ -159,7 +159,7 @@
}
for(i = 0; i < file_types; i++) {
if (type && (strcmp(type, file_type[i].name) != 0)) {
- break;
+ continue;
}
if (file_type[i].probe(fp_kernel) > 0) {
break;
Michael
Not speaking for Intel, options are my own.
"Fu, Michael" <[email protected]> writes:
> On 2002-12-02 at 04:41:34, Eric wrote:
> >kexec-tools-1.8 is now available at :
> >http://www.xmission.com/~ebiederm/files/kexec/kexec-tools-1.8.tar.gz
>
> I can't use the kexec program in the package to load a bzImage file. The
> following simple patch make it work.
Thanks. I rarely force the type, so it looks like this bug slipped by.
Eric