Hi Andrew,
The patches that follow contain the kexec based crash dumping implementation.
Based on feedback received last time, we have made several changes. Some of
them are:
- The dumping kernel now boots from a non-default location. This is possible
due to Eric's patch which allows i386 kernels to boot from a non-default
location. This change means that we need two different kernels to get this
setup. The documentation patch has complete details on how to do this.
- We can now choose whether or not to dump from panic. The documentation
patch has details on this as well.
- The linear view is now called oldmem.
- Changes as per the code review comments from the previous posting.
The patches correspond to 2.6.9-rc1-mm5.
Kindly review these patches and let me know your thoughts.
Regards, Hari
--
Hariprasad Nellitheertha
Linux Technology Center
India Software Labs
IBM India, Bangalore
Regards, Hari
--
Hariprasad Nellitheertha
Linux Technology Center
India Software Labs
IBM India, Bangalore
Regards, Hari
--
Hariprasad Nellitheertha
Linux Technology Center
India Software Labs
IBM India, Bangalore
Regards, Hari
--
Hariprasad Nellitheertha
Linux Technology Center
India Software Labs
IBM India, Bangalore
Regards, Hari
--
Hariprasad Nellitheertha
Linux Technology Center
India Software Labs
IBM India, Bangalore
Regards, Hari
--
Hariprasad Nellitheertha
Linux Technology Center
India Software Labs
IBM India, Bangalore
Regards, Hari
--
Hariprasad Nellitheertha
Linux Technology Center
India Software Labs
IBM India, Bangalore
Hariprasad Nellitheertha <[email protected]> writes:
> Hi Andrew,
>
> The patches that follow contain the kexec based crash dumping implementation.
> Based on feedback received last time, we have made several changes. Some of
> them are:
>
> - The dumping kernel now boots from a non-default location. This is possible
> due to Eric's patch which allows i386 kernels to boot from a non-default
> location. This change means that we need two different kernels to get this
> setup. The documentation patch has complete details on how to do this.
Cool. I am glad you could get this going this should make things
easier.
> - We can now choose whether or not to dump from panic. The documentation
> patch has details on this as well.
> - The linear view is now called oldmem.
> - Changes as per the code review comments from the previous posting.
>
> The patches correspond to 2.6.9-rc1-mm5.
>
> Kindly review these patches and let me know your thoughts.
I will start with my standard objections about it still being
incompatible with other uses of kexec.
More later when I get a chance. Things seem to slowly be going
in the right direction.
Eric
I'll add these to -mm. Minor things:
> +config BACKUP_BASE
> + int "location of the crash dumps backup region"
> + depends on CRASH_DUMP
> + default 16
> + help
> + This is the location where the second kernel will boot from.
> +
> +config BACKUP_SIZE
> + int "Size of the crash dumps backup region"
> + depends on CRASH_DUMP
> + range 16 64
> + default 32
> + help
> + The size of the second kernel's memory.
You should tell the user the units of the parameter. So
"location of the crash dumps backup region (MB)"
would be nice.
> +++ linux-2.6.9-rc1-hari/include/asm-i386/crash_dump.h 2004-09-15 17:36:30.000000000 +0530
> @@ -0,0 +1,44 @@
> +/* asm-i386/crash_dump.h */
> +#include <linux/bootmem.h>
> +
> +extern unsigned int dump_enabled;
> +
> +void __crash_relocate_mem(unsigned long, unsigned long);
> +unsigned long __init find_max_low_pfn(void);
> +void __init find_max_pfn(void);
> +
> +extern unsigned int crashed;
Should the above declarations be inside CONFIG_CRASH_DUMP? We don't want
to leave them in scope if they don't exist.
> +static inline void crash_machine_kexec(void)
> +{
> + struct kimage *image;
> +
> + if ((!crash_dump_on) || (crashed))
> + return;
> +
> + image = xchg(&kexec_image, 0);
> + if (image) {
> + crashed = 1;
> + device_shutdown();
> + printk(KERN_EMERG "kexec: opening parachute\n");
> + machine_kexec(image);
> + while (1);
> + } else {
> + printk(KERN_EMERG "kexec: No kernel image loaded!\n");
> + }
> +}
I don't see why this is inlined??
> +#ifdef CONFIG_CRASH_DUMP
> +/*
> + * Enable kexec reboot upon panic; for dumping
> + */
> +static ssize_t write_crash_dump_on(struct file *file, const char __user *buf,
> + size_t count, loff_t *ppos)
> +{
> + if (count) {
> + if (get_user(crash_dump_on, buf))
> + return -EFAULT;
> + }
> + return count;
> +}
> +
> +static struct file_operations proc_crash_dump_on_operations = {
> + .write = write_crash_dump_on,
> +};
> +#endif
> +
> struct proc_dir_entry *proc_root_kcore;
>
> static void create_seq_entry(char *name, mode_t mode, struct file_operations *f)
> @@ -663,6 +683,11 @@ void __init proc_misc_init(void)
> if (entry)
> entry->proc_fops = &proc_sysrq_trigger_operations;
> #endif
> +#ifdef CONFIG_CRASH_DUMP
> + entry = create_proc_entry("kexec-dump", S_IWUSR, NULL);
> + if (entry)
> + entry->proc_fops = &proc_crash_dump_on_operations;
> +#endif
> #ifdef CONFIG_LOCKMETER
> entry = create_proc_entry("lockmeter", S_IWUSR | S_IRUGO, NULL);
> if (entry) {
Please do all the above in a crashdump-specific file, not inside proc_misc.c
Hariprasad Nellitheertha <[email protected]> wrote:
>
> +/*
> + * Copy a page from "oldmem". For this page, there is no pte mapped
> + * in the current kernel. We stitch up a pte, similar to kmap_atomic.
> + */
> +static inline ssize_t copy_oldmem_page(unsigned long pfn,
> + char *buf, size_t csize, int userbuf)
Again, why inline this function?
Hariprasad Nellitheertha <[email protected]> wrote:
>
> +/*
> + * Reads from the oldmem device from given offset till
> + * given count
> + */
> +static ssize_t read_from_oldmem(char *buf, size_t count,
> + loff_t *ppos, int userbuf)
> +{
> + unsigned long pfn, p = *ppos;
> + size_t read = 0;
> +
> + pfn = p / PAGE_SIZE;
> + p = p % PAGE_SIZE;
> +
> + if (pfn > saved_max_pfn) {
> + read = -EINVAL;
> + goto done;
> + }
> +
> + if (count > PAGE_SIZE - p)
> + count = PAGE_SIZE - p;
> +
> + if (copy_oldmem_page(pfn, buf, count, userbuf)) {
> + read = -EFAULT;
> + goto done;
> + }
> +
> + *ppos += count;
hm, what's going on here? *ppos is a loff_t but you've copied it
into a 32-bit local prior to calculating the pfn.
Hariprasad Nellitheertha <[email protected]> wrote:
>
> +#ifdef CONFIG_CRASH_DUMP
> + if (dump_enabled) {
> + proc_vmcore = create_proc_entry("vmcore", S_IRUSR, NULL);
> + if (proc_vmcore) {
> + proc_vmcore->proc_fops = &proc_vmcore_operations;
> + proc_vmcore->size =
> + (size_t)(saved_max_pfn << PAGE_SHIFT);
> + }
> + }
> +#endif
Again, please try to move this out of procfs and into a crashdump-specific file.
Hariprasad Nellitheertha <[email protected]> wrote:
>
> -static int notesize(struct memelfnote *en)
> +int notesize(struct memelfnote *en)
> {
> int sz;
>
> @@ -129,7 +129,7 @@ static int notesize(struct memelfnote *e
> /*
> * store a note in the header buffer
> */
> -static char *storenote(struct memelfnote *men, char *bufp)
> +char *storenote(struct memelfnote *men, char *bufp)
As you're giving these kernel-wide scope, please also rename them
to elf_notesize() and elf_storenote().
Hariprasad Nellitheertha <[email protected]> wrote:
>
> +void __crash_dump_stop_cpus(void)
> +{
> + int i, cpu = smp_processor_id();
> + int other_cpus = num_online_cpus()-1;
> +
> + if (other_cpus > 0) {
> + atomic_set(&waiting_for_dump_ipi, other_cpus);
> +
> + for (i = 0; i < NR_CPUS; i++)
> + crash_dump_expect_ipi[i] = (i != cpu && cpu_online(i));
> +
> + set_nmi_callback(crash_dump_nmi_callback);
> + /* Ensure the new callback function is set before sending
> + * out the IPI
> + */
> + wmb();
> +
> + crash_dump_send_ipi();
> + while (atomic_read(&waiting_for_dump_ipi) > 0)
> + cpu_relax();
> +
> + unset_nmi_callback();
> + } else {
> + local_irq_disable();
> + disable_local_APIC();
> + local_irq_enable();
> + }
> +}
Is dodgy wrt CPU hotplug, but there's not a lot we can do about that
in this context, I expect. Which is a shame, given that CPU hotplug
is a likely time at which to be taking a crashdump ;)
Rusty, you may like to review these patches...
On Wed, Sep 15, 2004 at 02:27:22PM -0700, Andrew Morton wrote:
> Hariprasad Nellitheertha <[email protected]> wrote:
> > +void __crash_dump_stop_cpus(void)
> > +{
> > + int i, cpu = smp_processor_id();
> > + int other_cpus = num_online_cpus()-1;
> > +
> > + if (other_cpus > 0) {
> > + atomic_set(&waiting_for_dump_ipi, other_cpus);
> > +
> > + for (i = 0; i < NR_CPUS; i++)
> > + crash_dump_expect_ipi[i] = (i != cpu && cpu_online(i));
> > +
> > + set_nmi_callback(crash_dump_nmi_callback);
> > + /* Ensure the new callback function is set before sending
> > + * out the IPI
> > + */
> > + wmb();
> > +
> > + crash_dump_send_ipi();
> > + while (atomic_read(&waiting_for_dump_ipi) > 0)
> > + cpu_relax();
> > +
> > + unset_nmi_callback();
> > + } else {
> > + local_irq_disable();
> > + disable_local_APIC();
> > + local_irq_enable();
> > + }
> > +}
>
> Is dodgy wrt CPU hotplug, but there's not a lot we can do about that
> in this context, I expect. Which is a shame, given that CPU hotplug
> is a likely time at which to be taking a crashdump ;)
If Hari disables preemption during this entire section of code,
he should be safe from CPU hotplug, AFAICS. The stop machine
threads will never get to run on that CPU.
Thanks
Dipankar
On Thu, Sep 16, 2004 at 08:41:13AM +0000, Dipankar Sarma wrote:
> On Wed, Sep 15, 2004 at 02:27:22PM -0700, Andrew Morton wrote:
> > Is dodgy wrt CPU hotplug, but there's not a lot we can do about that
> > in this context, I expect. Which is a shame, given that CPU hotplug
> > is a likely time at which to be taking a crashdump ;)
>
> If Hari disables preemption during this entire section of code,
> he should be safe from CPU hotplug, AFAICS. The stop machine
> threads will never get to run on that CPU.
This will work for CPU remove, not CPU add, since the later
is not atomic (yet).
Rusty, do you think it would be worthwhile making CPU add atomic?
I can give it a shot :)
--
Thanks and Regards,
Srivatsa Vaddagiri,
Linux Technology Center,
IBM Software Labs,
Bangalore, INDIA - 560017
Srivatsa Vaddagiri <[email protected]> writes:
> On Thu, Sep 16, 2004 at 08:41:13AM +0000, Dipankar Sarma wrote:
> > On Wed, Sep 15, 2004 at 02:27:22PM -0700, Andrew Morton wrote:
> > > Is dodgy wrt CPU hotplug, but there's not a lot we can do about that
> > > in this context, I expect. Which is a shame, given that CPU hotplug
> > > is a likely time at which to be taking a crashdump ;)
> >
> > If Hari disables preemption during this entire section of code,
> > he should be safe from CPU hotplug, AFAICS. The stop machine
> > threads will never get to run on that CPU.
>
> This will work for CPU remove, not CPU add, since the later
> is not atomic (yet).
>
> Rusty, do you think it would be worthwhile making CPU add atomic?
> I can give it a shot :)
The whole section should be run with interrupts disabled,
so disabling preemption should be trivial.
Beyond this the code needs a bogomips based timeout, (we can't use external
time sources just the delay loop). Any of the other cpu's could have
crashed at this point, or our accounting data structures could be
currupt.
Putting the initial call inside of machine_kexec is also bogus.
crash_machine_kexec needs to call crash_stop_cpus and then
machine_kexec. machine_kexec is currently factored as a perfectly
reusable piece of code let's not mess that up.
Eric
Hariprasad Nellitheertha <[email protected]> writes:
> Regards, Hari
> --
> Hariprasad Nellitheertha
> Linux Technology Center
> India Software Labs
> IBM India, Bangalore
>
>
>
> This patch contains the code that does the memory preserving reboot. It
> copies over the first 640k into a backup region before handing over to
> kexec. The second kernel will boot using only the backup region.
Do you know what the kernel does with the low 1M?
Nothing in the hardware architecture requires us to use the
low 1M. So I think we would be safer if we could track down
and remove this dependency.
In general I agree that we need to be prepared to save some of the
original machine state, because some architectures give special
meaning to addresses in memory. But x86 is not one of those.
Perhaps the proper abstraction is to add a use_mem= variable
that simply tells us which memory addresses we can use.
By still doing some copying we run into the problem, of
potentially running out of memory areas where ongoing DMA
transfers may be happening. So this is worth
tracking down.
> diff -puN /dev/null include/linux/crash_dump.h
> --- /dev/null 2003-01-30 15:54:37.000000000 +0530
> +++ linux-2.6.9-rc1-hari/include/linux/crash_dump.h 2004-09-15
> 17:36:30.000000000 +0530
>
> @@ -0,0 +1,28 @@
> +#include <linux/kexec.h>
> +#include <linux/smp_lock.h>
> +#include <linux/device.h>
> +#include <asm/crash_dump.h>
> +
> +#ifdef CONFIG_CRASH_DUMP
Why is this function an inline in a header file?
I know we only call it once but still unnecessary code
in a header file seems silly.
> +extern int crash_dump_on;
> +static inline void crash_machine_kexec(void)
> +{
> + struct kimage *image;
> +
> + if ((!crash_dump_on) || (crashed))
> + return;
> +
> + image = xchg(&kexec_image, 0);
You are still not using a different global variable here.
With a separate kexec_crash_image variable if the image is present
we do a crash dump. That should allow you to remove the
crash_dump_on variable and the test above.
> + if (image) {
> + crashed = 1;
We should not need the magic global variable crashed. We can
just call the one or two functions needed from crash_machine_kexec.
> + device_shutdown();
Why are you calling device_shutdown here?
> + printk(KERN_EMERG "kexec: opening parachute\n");
To wrap things prettily we should probably add a machine_crash_shutdown()
and place in machine_crash_shutdown whatever architecture specific magic needs
to happen.
> + machine_kexec(image);
> + while (1);
The while look is unnecessary machine_kexec cannot
return.
> + } else {
> + printk(KERN_EMERG "kexec: No kernel image loaded!\n");
> + }
> +}
> +#else
> +#define crash_machine_kexec() do { } while(0)
> +#endif
Hi Eric,
On Sun, Sep 19, 2004 at 02:37:18PM -0600, Eric W. Biederman wrote:
> Hariprasad Nellitheertha <[email protected]> writes:
>
> > This patch contains the code that does the memory preserving reboot. It
> > copies over the first 640k into a backup region before handing over to
> > kexec. The second kernel will boot using only the backup region.
>
> Do you know what the kernel does with the low 1M?
>
> Nothing in the hardware architecture requires us to use the
> low 1M. So I think we would be safer if we could track down
> and remove this dependency.
>
> In general I agree that we need to be prepared to save some of the
> original machine state, because some architectures give special
> meaning to addresses in memory. But x86 is not one of those.
>
> Perhaps the proper abstraction is to add a use_mem= variable
> that simply tells us which memory addresses we can use.
>
> By still doing some copying we run into the problem, of
> potentially running out of memory areas where ongoing DMA
> transfers may be happening. So this is worth
> tracking down.
I am trying to track this down. I tried moving the first segment of vmlinux
into the reserved section by modifying kexec-tools. This is the command line
argument segment. It still seems to need the first few kilobytes, though.
Eliminating this is definitely needed so we can avoid using the first
kernel's region completely.
Also, I will make the changes in the rest of the patch as per your review
comments.
Regards, Hari
--
Hariprasad Nellitheertha
Linux Technology Center
India Software Labs
IBM India, Bangalore
Hariprasad Nellitheertha <[email protected]> writes:
> Hi Eric,
>
> On Sun, Sep 19, 2004 at 02:37:18PM -0600, Eric W. Biederman wrote:
> > Hariprasad Nellitheertha <[email protected]> writes:
> >
> > > This patch contains the code that does the memory preserving reboot. It
> > > copies over the first 640k into a backup region before handing over to
> > > kexec. The second kernel will boot using only the backup region.
> >
> > Do you know what the kernel does with the low 1M?
> >
> > Nothing in the hardware architecture requires us to use the
> > low 1M. So I think we would be safer if we could track down
> > and remove this dependency.
> >
> > In general I agree that we need to be prepared to save some of the
> > original machine state, because some architectures give special
> > meaning to addresses in memory. But x86 is not one of those.
> >
> > Perhaps the proper abstraction is to add a use_mem= variable
> > that simply tells us which memory addresses we can use.
> >
> > By still doing some copying we run into the problem, of
> > potentially running out of memory areas where ongoing DMA
> > transfers may be happening. So this is worth
> > tracking down.
>
> I am trying to track this down. I tried moving the first segment of vmlinux
> into the reserved section by modifying kexec-tools. This is the command line
> argument segment. It still seems to need the first few kilobytes, though.
Right that is being automatically placed there.
For testing it should not be too hard to hard code it at someplace
appropriate.
> Eliminating this is definitely needed so we can avoid using the first
> kernel's region completely.
>
> Also, I will make the changes in the rest of the patch as per your review
> comments.
Thanks.