Allows annotation of panics to include platform information. It's no big
deal to collect information, but way helpful when you are collecting
failure reports from a eventual base of millions of systems deployed in
other people's homes.
One of the biggest reasons this is an RFC is that I'm uncomfortable with
putting the pseudo-file that holds the annotation information in /proc.
Different layers of the software stack may drop dynamic information, such
as DHCP-supplied IP addresses, in here as they come up. This means it's
necessary to be able to append to the end of the annotation, so this looks
much more like a real file than a sysctl file. It also has multiple lines,
which doesn't look a sysctl file. Annotation can be viewed as a debug thing,
so maybe it belongs in debugfs, but people seem to be doing somewhat different
things with that filesystem.
So, suggestions on this issue, and any others are most welcome. If there a
better way to do this, I'll be happy to use it.
Signed-off-by: David VomLehn <[email protected]>
---
fs/proc/Makefile | 1 +
fs/proc/panic-note.c | 293 ++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/kernel.h | 7 +
kernel/panic.c | 1 +
lib/Kconfig.debug | 8 ++
5 files changed, 310 insertions(+), 0 deletions(-)
diff --git a/fs/proc/Makefile b/fs/proc/Makefile
index 11a7b5c..486d273 100644
--- a/fs/proc/Makefile
+++ b/fs/proc/Makefile
@@ -26,3 +26,4 @@ proc-$(CONFIG_PROC_VMCORE) += vmcore.o
proc-$(CONFIG_PROC_DEVICETREE) += proc_devtree.o
proc-$(CONFIG_PRINTK) += kmsg.o
proc-$(CONFIG_PROC_PAGE_MONITOR) += page.o
+proc-$(CONFIG_PANIC_NOTE) += panic-note.o
diff --git a/fs/proc/panic-note.c b/fs/proc/panic-note.c
new file mode 100644
index 0000000..449c5ef
--- /dev/null
+++ b/fs/proc/panic-note.c
@@ -0,0 +1,293 @@
+/*
+ * panic-note.c
+ *
+ * Allow a blob to be registered with the kernel that will be printed if
+ * the kernel panics.
+ *
+ * Copyright (C) 2009 Cisco Systems, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+/* Open issues:
+ * Where should the panic_note file be created? It's almost like a sysctl,
+ * but doesn't follow the same rules. When you write to a sysctl file, the
+ * previous data is replaced. When you write to the panic_note file, you
+ * can append to the end of the existing data.
+ */
+
+#include <linux/semaphore.h>
+#include <linux/fs.h>
+#include <linux/proc_fs.h>
+#include <linux/module.h>
+#include <linux/uaccess.h>
+
+/* Maximum size, in bytes, allowed for the blob. Having this limit prevents
+ * an inadvertant denial of service attack that might happen if someone with
+ * root privileges was automatically generating this note and the generator
+ * had an infinite loop. Perhaps this is more of a a denial of service
+ * suicide. */
+#define PANIC_NOTE_SIZE (PAGE_SIZE * 4)
+
+/*
+ * struct panic_note_data - Information about the panic note
+ * @n: Number of bytes in the note
+ * @p: Pointer to the data in the note
+ * @sem: Semaphore controlling access to data in the note
+ */
+struct panic_note_state {
+ size_t n;
+ void *p;
+ struct rw_semaphore sem;
+};
+
+static struct panic_note_state panic_note_state = {
+ 0, NULL, __RWSEM_INITIALIZER(panic_note_state.sem)
+};
+static const struct file_operations panic_note_fops;
+static struct inode_operations panic_note_iops;
+static struct proc_dir_entry *panic_note_entry;
+
+/*
+ * panic_note_print - display the panic note
+ * @priority: Printk priority to use, e.g. KERN_EMERG
+ */
+void panic_note_print()
+{
+ int i;
+ int linelen;
+
+ /* We skip the semaphore stuff because we're in a panic situation and
+ * the scheduler isn't available in case the semaphore is already owned
+ * by someone else */
+ for (i = 0; i < panic_note_state.n; i += linelen) {
+ const char *p;
+ int remaining;
+ const char *nl;
+
+ p = panic_note_state.p + i;
+ remaining = panic_note_state.n - i;
+
+ nl = memchr(p, '\n', remaining);
+
+ if (nl == NULL) {
+ linelen = remaining;
+ pr_emerg("%.*s\n", linelen, p);
+ } else {
+ linelen = nl - p + 1;
+ pr_emerg("%.*s", linelen, p);
+ }
+ }
+}
+
+/*
+ * read_write_size - calculate the limited copy_to_user/copy_from_user count
+ * @nbytes: The number of bytes requested
+ * @pos: Offset, in bytes, into the file
+ * @size: Maximum I/O offset, in bytes. For a read, this is the actual
+ * number of bytes in the file, since you can't read past
+ * the end. Writes can be done after the number of bytes in the
+ * file, so this is the maximum possible file size, minus one.
+ *
+ * Returns the number of bytes to copy.
+ */
+static ssize_t read_write_size(size_t nbytes, loff_t pos, size_t size)
+{
+ ssize_t retval;
+
+ if (pos >= size)
+ retval = 0;
+
+ else {
+ retval = size - pos;
+ if (retval > nbytes)
+ retval = nbytes;
+ }
+
+ return retval;
+}
+
+/*
+ * panic_note_read - return data from the panic note
+ * @filp: Pointer to information on the file
+ * @buf: Pointer, in user space, to the buffer in which to return the
+ * data
+ * @nbytes: Number of bytes requested
+ * @ppos: Pointer to file position
+ *
+ * Returns the number of bytes actually transferred, or a negative errno
+ * value if none could be transferred.
+ */
+static ssize_t panic_note_read(struct file *filp, char __user *buf,
+ size_t nbytes, loff_t *ppos)
+{
+ ssize_t retval;
+ ssize_t result;
+
+ down_read(&panic_note_state.sem);
+ panic_note_entry->size = panic_note_state.n;
+ retval = read_write_size(nbytes, *ppos, panic_note_state.n);
+
+ if (retval > 0) {
+ result = copy_to_user(buf, panic_note_state.p + *ppos, retval);
+
+ if (result != 0)
+ retval = -EFAULT;
+ else
+ *ppos += retval;
+ }
+ up_read(&panic_note_state.sem);
+
+ return retval;
+}
+
+/*
+ * panic_note_write - store data in the panic note
+ * @filp: Pointer to information on the file
+ * @buf: Pointer, in user space, to the buffer from which to retrieve the
+ * data
+ * @nbytes: Number of bytes requested
+ * @ppos: Pointer to file position
+ *
+ * Returns the number of bytes actually transferred, or a negative errno
+ * value if none could be transferred.
+ */
+static ssize_t panic_note_write(struct file *filp, const char __user *buf,
+ size_t nbytes, loff_t *ppos)
+{
+ ssize_t retval;
+ ssize_t result;
+ loff_t pos;
+
+ down_write(&panic_note_state.sem);
+
+ /* If the O_APPEND flag is set, ignore the current position and
+ * add to the end. */
+ pos = ((filp->f_flags & O_APPEND) == 0) ? *ppos : panic_note_state.n;
+
+ retval = read_write_size(nbytes, pos, PANIC_NOTE_SIZE);
+
+ if (retval == 0)
+ retval = -ENOSPC;
+ else {
+ /* If we have a hole, fill it with zeros */
+ if (pos > panic_note_state.n)
+ memset(panic_note_state.p + panic_note_state.n,
+ 0, pos - panic_note_state.n);
+
+ /* Fetch what was written from user space */
+ result = copy_from_user(panic_note_state.p + pos, buf,
+ retval);
+
+ if (result != 0)
+ retval = -EFAULT;
+ else {
+
+ /* If we now have more bytes than we did, grow the
+ * size */
+ if (pos + retval > panic_note_state.n) {
+ struct inode *inode;
+ inode = filp->f_path.dentry->d_inode;
+ panic_note_state.n = pos + retval;
+ panic_note_entry->size = panic_note_state.n;
+ }
+
+ *ppos = pos + retval;
+ }
+ }
+ up_write(&panic_note_state.sem);
+
+ return retval;
+}
+
+static int panic_note_open(struct inode *inode, struct file *filp)
+{
+ filp->f_op = &panic_note_fops;
+ inode->i_op = &panic_note_iops;
+ panic_note_entry->size = panic_note_state.n;
+
+ return 0;
+}
+
+static const struct file_operations panic_note_fops = {
+ .owner = THIS_MODULE,
+ .open = panic_note_open,
+ .read = panic_note_read,
+ .write = panic_note_write,
+};
+
+static void panic_note_truncate(struct inode *inode)
+{
+ down_write(&panic_note_state.sem);
+ panic_note_state.n = 0;
+ panic_note_entry->size = panic_note_state.n;
+ up_write(&panic_note_state.sem);
+}
+
+static struct inode_operations panic_note_iops = {
+ .truncate = panic_note_truncate,
+};
+
+static int __init panic_note_init(void)
+{
+ int retval;
+
+ /* This can allocate kernel memory, so we let only the root use
+ * it. */
+ panic_note_entry = create_proc_entry("panic_note", 0600, NULL);
+
+ if (panic_note_entry == NULL) {
+ retval = -ENOMEM;
+ goto error_exit;
+ }
+
+ /* Set up the basic proc file fields */
+ panic_note_entry->proc_fops = &panic_note_fops;
+ panic_note_entry->proc_iops = &panic_note_iops;
+
+ /* Allocate a buffer. Doing so now avoids the possibility that
+ * we won't be able to get when when the kernel runs out of
+ * memory. */
+ panic_note_state.p = kmalloc(PANIC_NOTE_SIZE, GFP_KERNEL);
+
+ if (panic_note_state.p == NULL) {
+ retval = -ENOMEM;
+ goto kmalloc_buf_error;
+ }
+
+ return 0;
+
+kmalloc_buf_error:
+ kfree(panic_note_state.p);
+ panic_note_state.p = NULL;
+
+ remove_proc_entry("panic_note", NULL);
+
+error_exit:
+ return retval;
+}
+
+static int __exit panic_note_cleanup(void)
+{
+ if (panic_note_state.p != NULL)
+ kfree(panic_note_state.p);
+
+ remove_proc_entry("panic_note", NULL);
+
+ return 0;
+}
+
+late_initcall(panic_note_init);
+late_initcall(panic_note_cleanup);
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index f4e3184..86ca4d7 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -312,6 +312,13 @@ extern void add_taint(unsigned flag);
extern int test_taint(unsigned flag);
extern unsigned long get_taint(void);
extern int root_mountflags;
+#ifdef CONFIG_PANIC_NOTE
+extern void panic_note_print(void);
+#else
+static inline void panic_note_print(void)
+{
+}
+#endif
/* Values used for system_state */
extern enum system_states {
diff --git a/kernel/panic.c b/kernel/panic.c
index 96b45d0..513deae 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -70,6 +70,7 @@ NORET_TYPE void panic(const char * fmt, ...)
vsnprintf(buf, sizeof(buf), fmt, args);
va_end(args);
printk(KERN_EMERG "Kernel panic - not syncing: %s\n",buf);
+ panic_note_print();
#ifdef CONFIG_DEBUG_BUGVERBOSE
dump_stack();
#endif
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 30df586..bade7a1 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1045,6 +1045,14 @@ config DMA_API_DEBUG
This option causes a performance degredation. Use only if you want
to debug device drivers. If unsure, say N.
+config PANIC_NOTE
+ bool "Create file for user space data to be reported at panic time"
+ default n
+ help
+ This creates a pseudo-file, named /proc/panic_note, into which
+ user space data can be written. If a panic occurs, the contents
+ of the file will be included in the failure report.
+
source "samples/Kconfig"
source "lib/Kconfig.kgdb"
Sincerely, I don't understand why we should involve the kernel to gather
this kind of information when we can use other (user-space) tools, only
to have "all" in a single report maybe? I think it's a bit weak reason
to include this additional behavior in the kernel.
David VomLehn ha scritto:
> Allows annotation of panics to include platform information. It's no big
> deal to collect information, but way helpful when you are collecting
> failure reports from a eventual base of millions of systems deployed in
> other people's homes.
>
Marco
On Wed, 2009-11-11 at 21:13 -0500, David VomLehn wrote:
> Allows annotation of panics to include platform information. It's no big
> deal to collect information, but way helpful when you are collecting
> failure reports from a eventual base of millions of systems deployed in
> other people's homes.
I'd like to hear a bit more use case motivation on this feature. Also,
why do you want more than a page?
--
http://selenic.com : development and support for Mercurial and Linux
David VomLehn wrote:
> Allows annotation of panics to include platform information. It's no big
> deal to collect information, but way helpful when you are collecting
> failure reports from a eventual base of millions of systems deployed in
> other people's homes.
>
> One of the biggest reasons this is an RFC is that I'm uncomfortable with
> putting the pseudo-file that holds the annotation information in /proc.
> Different layers of the software stack may drop dynamic information, such
> as DHCP-supplied IP addresses, in here as they come up. This means it's
> necessary to be able to append to the end of the annotation, so this looks
> much more like a real file than a sysctl file. It also has multiple lines,
> which doesn't look a sysctl file. Annotation can be viewed as a debug thing,
> so maybe it belongs in debugfs, but people seem to be doing somewhat different
> things with that filesystem.
>
> So, suggestions on this issue, and any others are most welcome. If there a
> better way to do this, I'll be happy to use it.
>
> Signed-off-by: David VomLehn <[email protected]>
> ---
> --- a/kernel/panic.c
> +++ b/kernel/panic.c
> @@ -70,6 +70,7 @@ NORET_TYPE void panic(const char * fmt, ...)
> vsnprintf(buf, sizeof(buf), fmt, args);
> va_end(args);
> printk(KERN_EMERG "Kernel panic - not syncing: %s\n",buf);
> + panic_note_print();
> #ifdef CONFIG_DEBUG_BUGVERBOSE
> dump_stack();
> #endif
Why hook into panic() directly like this, vs. using the panic
notifier list? If you use that, and then put the data handling
magic that you need into your own kernel module that knows how
to interface with the reporting apps that you have, you can
do the whole thing without having to alter existing code, I think.
Paul.
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 30df586..bade7a1 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1045,6 +1045,14 @@ config DMA_API_DEBUG
> This option causes a performance degredation. Use only if you want
> to debug device drivers. If unsure, say N.
>
> +config PANIC_NOTE
> + bool "Create file for user space data to be reported at panic time"
> + default n
> + help
> + This creates a pseudo-file, named /proc/panic_note, into which
> + user space data can be written. If a panic occurs, the contents
> + of the file will be included in the failure report.
> +
> source "samples/Kconfig"
>
> source "lib/Kconfig.kgdb"
(Switched from top posting)
On Thu, Nov 12, 2009 at 01:00:17PM -0500, Marco Stornelli wrote:
> David VomLehn ha scritto:
> > Allows annotation of panics to include platform information. It's no big
> > deal to collect information, but way helpful when you are collecting
> > failure reports from a eventual base of millions of systems deployed in
> > other people's homes.
> >
> Sincerely, I don't understand why we should involve the kernel to gather
> this kind of information when we can use other (user-space) tools, only
> to have "all" in a single report maybe? I think it's a bit weak reason
> to include this additional behavior in the kernel.
Good question. Some more detail on our application might help. In some
situations, we may have no disk and only enough flash for the bootloader.
The kernel is downloaded over the network. When we get to user space, we
initialize a number of things dynamically. For example, we dynamically
compute some MAC address, and most of the IP addresses are obtained with
DHCP. This are very useful to have for panic analysis.
Since there is neither flash nor disk, user space has no place to store
this information, should the kernel panic. When we come back up, we will get
different MAC and IP addresses. Storing them in memory is our only hope.
Fortunately, there is a section of RAM that the bootloader promises not
to overwrite. On a panic, we capture the messages written on the console
and store them in the protected area. If the information from the
/proc file is written as part of the panic, we will capture it, too.
There is a later email suggesting this be done in a panic notifier, and I
think that's a better approach. Then, instead of having this be a /proc file,
we could have a pseudo-device in /dev.
> Marco
On Thu, Nov 12, 2009 at 01:06:51PM -0500, Matt Mackall wrote:
> On Wed, 2009-11-11 at 21:13 -0500, David VomLehn wrote:
> > Allows annotation of panics to include platform information. It's no big
> > deal to collect information, but way helpful when you are collecting
> > failure reports from a eventual base of millions of systems deployed in
> > other people's homes.
>
> I'd like to hear a bit more use case motivation on this feature. Also,
> why do you want more than a page?
Hopefully, I have addressed the first question in my previous email. As
for the second, I doubt there is a need for more than a page. I just picked
a value to start developing with. This is still a work in progress...
> http://selenic.com : development and support for Mercurial and Linux
David VL
On Thu, Nov 12, 2009 at 02:50:41PM -0500, Paul Gortmaker wrote:
> David VomLehn wrote:
>> Allows annotation of panics to include platform information. It's no big
>> deal to collect information, but way helpful when you are collecting
>> failure reports from a eventual base of millions of systems deployed in
>> other people's homes.
...
> Why hook into panic() directly like this, vs. using the panic
> notifier list? If you use that, and then put the data handling
> magic that you need into your own kernel module that knows how
> to interface with the reporting apps that you have, you can
> do the whole thing without having to alter existing code, I think.
I agree--a panic notifier list is probably a better approach.
David VL
On Thu, 12 Nov 2009 16:56:49 -0500
David VomLehn <[email protected]> wrote:
> Good question. Some more detail on our application might help. In some
> situations, we may have no disk and only enough flash for the bootloader.
> The kernel is downloaded over the network. When we get to user space, we
> initialize a number of things dynamically. For example, we dynamically
> compute some MAC address, and most of the IP addresses are obtained with
> DHCP. This are very useful to have for panic analysis.
>
> Since there is neither flash nor disk, user space has no place to store
> this information, should the kernel panic. When we come back up, we will get
> different MAC and IP addresses. Storing them in memory is our only hope.
>
> Fortunately, there is a section of RAM that the bootloader promises not
> to overwrite. On a panic, we capture the messages written on the console
> and store them in the protected area. If the information from the
> /proc file is written as part of the panic, we will capture it, too.
Can't you solve this completely from userspace using phram and mtdoops
instead? I.e., setup two phram areas
modprobe phram 4K@start-of-your-area,4K@start-of-your-area+4K # Can't remember the exact syntax!
you'll then get /dev/mtdX and /dev/mtdX+1 for these two. You can then do
modprobe mtdoops mtddev=/dev/mtdX+1 dump_oops=0
to load mtdoops to catch the panic in the second area, and just write
your userspace messages to /dev/mtdX.
One thing probably have to be fixed though: I don't think phram has a
panic_write, which will be needed by mtdoops to catch the panic - this
should be trivial to add though since it's plain RAM.
// Simon
On Wed, 2009-11-11 at 21:13 -0500, David VomLehn wrote:
> diff --git a/kernel/panic.c b/kernel/panic.c
> index 96b45d0..513deae 100644
> --- a/kernel/panic.c
> +++ b/kernel/panic.c
> @@ -70,6 +70,7 @@ NORET_TYPE void panic(const char * fmt, ...)
> vsnprintf(buf, sizeof(buf), fmt, args);
> va_end(args);
> printk(KERN_EMERG "Kernel panic - not syncing: %s\n",buf);
> + panic_note_print();
> #ifdef CONFIG_DEBUG_BUGVERBOSE
> dump_stack();
> #endif
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
You should use panic notifiers instead. See the panic_notifier_list.
--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)
On Thu, 2009-11-12 at 12:06 -0600, Matt Mackall wrote:
> On Wed, 2009-11-11 at 21:13 -0500, David VomLehn wrote:
> > Allows annotation of panics to include platform information. It's no big
> > deal to collect information, but way helpful when you are collecting
> > failure reports from a eventual base of millions of systems deployed in
> > other people's homes.
>
> I'd like to hear a bit more use case motivation on this feature. Also,
> why do you want more than a page?
We also need this kind of functionality. The use case is very simple.
Every time the kernel oopeses, we save the oops information on the flash
using mtdoops module. There is even core support, which should be merged
to 2.6.33, see this patch:
http://git.infradead.org/users/dedekind/l2-mtd-2.6.git/commit/832c3d00e82f267316a2b53634631a1821eebae8
(and there was a corresponding discussion on lkml).
And what we want is to dump information about the user-space environment
at the same time to the oops. Specifically, we want to dump information
about what was the SW build number.
And we want this information to be printed at the same time, because we
cannot run any user-space at the panic time. This information is later
read from the flash and sent via the network to the central place. And
by the time it is sent, the user may have already re-flashed his device
with something else.
So I very much appreciate this patch, although I think it should use the
panic notifiers instead of calling a function directly from the panic.
--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)
On Fri, 2009-11-13 at 09:10 +0100, Simon Kagstrom wrote:
> On Thu, 12 Nov 2009 16:56:49 -0500
> David VomLehn <[email protected]> wrote:
>
> > Good question. Some more detail on our application might help. In some
> > situations, we may have no disk and only enough flash for the bootloader.
> > The kernel is downloaded over the network. When we get to user space, we
> > initialize a number of things dynamically. For example, we dynamically
> > compute some MAC address, and most of the IP addresses are obtained with
> > DHCP. This are very useful to have for panic analysis.
> >
> > Since there is neither flash nor disk, user space has no place to store
> > this information, should the kernel panic. When we come back up, we will get
> > different MAC and IP addresses. Storing them in memory is our only hope.
> >
> > Fortunately, there is a section of RAM that the bootloader promises not
> > to overwrite. On a panic, we capture the messages written on the console
> > and store them in the protected area. If the information from the
> > /proc file is written as part of the panic, we will capture it, too.
>
> Can't you solve this completely from userspace using phram and mtdoops
> instead? I.e., setup two phram areas
>
> modprobe phram 4K@start-of-your-area,4K@start-of-your-area+4K # Can't remember the exact syntax!
>
> you'll then get /dev/mtdX and /dev/mtdX+1 for these two. You can then do
>
> modprobe mtdoops mtddev=/dev/mtdX+1 dump_oops=0
>
> to load mtdoops to catch the panic in the second area, and just write
> your userspace messages to /dev/mtdX.
This might work for them, not sure, but not for us. We store panics on
flash, and later they are automatically sent to the panic collection
system via the network. And the complications are:
1. There may be many panics before the device has network access and has
a chance to send the panics.
2. User can re-flash the device with different SW inbetween.
So we really need to print some user-space supplied information during
the panic, and then we store it on flash with mtdoops, and the later,
when the device has network access we send whole bunch of oopses via the
network.
> One thing probably have to be fixed though: I don't think phram has a
> panic_write, which will be needed by mtdoops to catch the panic - this
> should be trivial to add though since it's plain RAM.
--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)
On Thu, 2009-11-12 at 23:09 +0100, ext David VomLehn wrote:
> On Thu, Nov 12, 2009 at 02:50:41PM -0500, Paul Gortmaker wrote:
> > David VomLehn wrote:
> >> Allows annotation of panics to include platform information. It's no big
> >> deal to collect information, but way helpful when you are collecting
> >> failure reports from a eventual base of millions of systems deployed in
> >> other people's homes.
> ...
> > Why hook into panic() directly like this, vs. using the panic
> > notifier list? If you use that, and then put the data handling
> > magic that you need into your own kernel module that knows how
> > to interface with the reporting apps that you have, you can
> > do the whole thing without having to alter existing code, I think.
>
> I agree--a panic notifier list is probably a better approach.
>
That's what we currently use:
http://marc.info/?l=linux-kernel&m=125655380512117&w=2
Very simple and does the job.
Requires to rearrange the panic code a bit, though.
Namely, to move notifiers invocation before the call to kmsg_dump().
Dumpers were introduced by Simon Kagstrom here:
http://marc.info/?l=linux-kernel&m=125569530109871&w=2
Regards,
Atal
> David VL
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
(Also fix David Woodhouses address and add Atal)
On Fri, 13 Nov 2009 13:45:48 +0200
Artem Bityutskiy <[email protected]> wrote:
> So we really need to print some user-space supplied information during
> the panic, and then we store it on flash with mtdoops, and the later,
> when the device has network access we send whole bunch of oopses via the
> network.
Yes, I see that your case would have to be handled differently. A
complication (which I believe was discussed before) is that kmsg_dump()
is done before the panic notifiers are called.
The reason I put it there is to have it before crash_kexec(), so I
guess we'll have to take up the discussion on what to do with it. For
me it now seems like it would be OK to move kmsg_dump() down below the
panic notifiers.
If you have a kdump kernel to load, then you will most likely not need
the kmsg dumped data anyway.
// Simon
On Fri, 2009-11-13 at 12:59 +0100, Simon Kagstrom wrote:
> (Also fix David Woodhouses address and add Atal)
>
> On Fri, 13 Nov 2009 13:45:48 +0200
> Artem Bityutskiy <[email protected]> wrote:
>
> > So we really need to print some user-space supplied information during
> > the panic, and then we store it on flash with mtdoops, and the later,
> > when the device has network access we send whole bunch of oopses via the
> > network.
>
> Yes, I see that your case would have to be handled differently. A
> complication (which I believe was discussed before) is that kmsg_dump()
> is done before the panic notifiers are called.
>
> The reason I put it there is to have it before crash_kexec(), so I
> guess we'll have to take up the discussion on what to do with it. For
> me it now seems like it would be OK to move kmsg_dump() down below the
> panic notifiers.
>
> If you have a kdump kernel to load, then you will most likely not need
> the kmsg dumped data anyway.
Yeah, I think this is a separate issue which can be fixed separately.
--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)
I think in general the procedure should be: at startup or event (for
example acquired IP address from DHCP) user applications write in flash
(better in persistent ram) a log with a tag or a timestamp or something
like this, when there is a kernel panic, it is captured in a file stored
together the log and when possible the system should send all via
network for example. Are there problems that I can't see to follow this
approach? When David says "...so this looks much more like a real file
than a sysctl file" I quite agree, it seems a normal application/system
log indeed.
Marco
Artem Bityutskiy wrote:
> On Fri, 2009-11-13 at 09:10 +0100, Simon Kagstrom wrote:
>> On Thu, 12 Nov 2009 16:56:49 -0500
>> David VomLehn <[email protected]> wrote:
>>
>>> Good question. Some more detail on our application might help. In some
>>> situations, we may have no disk and only enough flash for the bootloader.
>>> The kernel is downloaded over the network. When we get to user space, we
>>> initialize a number of things dynamically. For example, we dynamically
>>> compute some MAC address, and most of the IP addresses are obtained with
>>> DHCP. This are very useful to have for panic analysis.
>>>
>>> Since there is neither flash nor disk, user space has no place to store
>>> this information, should the kernel panic. When we come back up, we will get
>>> different MAC and IP addresses. Storing them in memory is our only hope.
>>>
>>> Fortunately, there is a section of RAM that the bootloader promises not
>>> to overwrite. On a panic, we capture the messages written on the console
>>> and store them in the protected area. If the information from the
>>> /proc file is written as part of the panic, we will capture it, too.
>> Can't you solve this completely from userspace using phram and mtdoops
>> instead? I.e., setup two phram areas
>>
>> modprobe phram 4K@start-of-your-area,4K@start-of-your-area+4K # Can't remember the exact syntax!
>>
>> you'll then get /dev/mtdX and /dev/mtdX+1 for these two. You can then do
>>
>> modprobe mtdoops mtddev=/dev/mtdX+1 dump_oops=0
>>
>> to load mtdoops to catch the panic in the second area, and just write
>> your userspace messages to /dev/mtdX.
>
> This might work for them, not sure, but not for us. We store panics on
> flash, and later they are automatically sent to the panic collection
> system via the network. And the complications are:
>
> 1. There may be many panics before the device has network access and has
> a chance to send the panics.
> 2. User can re-flash the device with different SW inbetween.
>
> So we really need to print some user-space supplied information during
> the panic, and then we store it on flash with mtdoops, and the later,
> when the device has network access we send whole bunch of oopses via the
> network.
>
>> One thing probably have to be fixed though: I don't think phram has a
>> panic_write, which will be needed by mtdoops to catch the panic - this
>> should be trivial to add though since it's plain RAM.
>
Hi Marko,
On Sat, 2009-11-14 at 09:28 +0100, Marco Stornelli wrote:
> I think in general the procedure should be: at startup or event (for
> example acquired IP address from DHCP) user applications write in flash
> (better in persistent ram) a log with a tag or a timestamp or something
> like this, when there is a kernel panic, it is captured in a file stored
> together the log and when possible the system should send all via
> network for example. Are there problems that I can't see to follow this
> approach? When David says "...so this looks much more like a real file
> than a sysctl file" I quite agree, it seems a normal application/system
> log indeed.
Please, do not top-post, you make it difficult to discuss things by
messing up the context. Be consistent with how other people reply in
this thread, please.
Take a look at my mails where I describe different complications we have
in our system. We really want to have an OOPS/panic + our environment
stuff to go together, at once. This makes things so much simpler.
Really, what is the problem providing this trivial panic-note
capability, where user-space can give the kernel a small buffer, and ask
the kernel to print this buffer at the oops/panic time. Very simple and
elegant, and just solves the problem.
Why perversions with time-stamps, separate storages are needed?
IOW, you suggest a complicated approach, and demand explaining why we do
not go for it. Simply because it is unnecessarily complex. This patch
solves the problem gracefully, and I'd rather demand you to point what
is the technical problem with the patches.
--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)
A pair of nit-picks.
On Wed, 2009-11-11 at 21:13 -0500, David VomLehn wrote:
> @@ -0,0 +1,293 @@
> +/*
> + * panic-note.c
No need to type file name there.
> + *
> + * Allow a blob to be registered with the kernel that will be printed if
> + * the kernel panics.
> + *
> + * Copyright (C) 2009 Cisco Systems, Inc.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
> + */
> +
> +/* Open issues:
> + * Where should the panic_note file be created? It's almost like a sysctl,
> + * but doesn't follow the same rules. When you write to a sysctl file, the
> + * previous data is replaced. When you write to the panic_note file, you
> + * can append to the end of the existing data.
> + */
Please, take a look at what is the kernel comment style at
Documentation/CodingStyle. We use
/*
* Multi-line
* comment
*/
style.
> +
> +#include <linux/semaphore.h>
> +#include <linux/fs.h>
> +#include <linux/proc_fs.h>
> +#include <linux/module.h>
> +#include <linux/uaccess.h>
> +
> +/* Maximum size, in bytes, allowed for the blob. Having this limit prevents
> + * an inadvertant denial of service attack that might happen if someone with
> + * root privileges was automatically generating this note and the generator
> + * had an infinite loop. Perhaps this is more of a a denial of service
> + * suicide. */
> +#define PANIC_NOTE_SIZE (PAGE_SIZE * 4)
> +
> +/*
> + * struct panic_note_data - Information about the panic note
> + * @n: Number of bytes in the note
> + * @p: Pointer to the data in the note
> + * @sem: Semaphore controlling access to data in the note
> + */
> +struct panic_note_state {
> + size_t n;
> + void *p;
> + struct rw_semaphore sem;
> +};
> +
> +static struct panic_note_state panic_note_state = {
> + 0, NULL, __RWSEM_INITIALIZER(panic_note_state.sem)
> +};
> +static const struct file_operations panic_note_fops;
> +static struct inode_operations panic_note_iops;
> +static struct proc_dir_entry *panic_note_entry;
> +
> +/*
> + * panic_note_print - display the panic note
> + * @priority: Printk priority to use, e.g. KERN_EMERG
> + */
> +void panic_note_print()
> +{
> + int i;
> + int linelen;
int i, lineline is more compact.
> +
> + /* We skip the semaphore stuff because we're in a panic situation and
> + * the scheduler isn't available in case the semaphore is already owned
> + * by someone else */
> + for (i = 0; i < panic_note_state.n; i += linelen) {
> + const char *p;
> + int remaining;
> + const char *nl;
const char *p, *nl is more compact. And there are several more places.
But this is matter of taste, really.
> +
> + p = panic_note_state.p + i;
> + remaining = panic_note_state.n - i;
> +
> + nl = memchr(p, '\n', remaining);
> +
> + if (nl == NULL) {
> + linelen = remaining;
> + pr_emerg("%.*s\n", linelen, p);
> + } else {
> + linelen = nl - p + 1;
> + pr_emerg("%.*s", linelen, p);
> + }
> + }
> +}
> +
> +/*
> + * read_write_size - calculate the limited copy_to_user/copy_from_user count
> + * @nbytes: The number of bytes requested
> + * @pos: Offset, in bytes, into the file
> + * @size: Maximum I/O offset, in bytes. For a read, this is the actual
> + * number of bytes in the file, since you can't read past
> + * the end. Writes can be done after the number of bytes in the
> + * file, so this is the maximum possible file size, minus one.
> + *
> + * Returns the number of bytes to copy.
> + */
> +static ssize_t read_write_size(size_t nbytes, loff_t pos, size_t size)
> +{
> + ssize_t retval;
> +
> + if (pos >= size)
> + retval = 0;
> +
Unnecessary \n
> + else {
> + retval = size - pos;
> + if (retval > nbytes)
> + retval = nbytes;
> + }
> +
> + return retval;
> +}
--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)
2009/11/17 Artem Bityutskiy <[email protected]>:
> Take a look at my mails where I describe different complications we have
> in our system. We really want to have an OOPS/panic + our environment
> stuff to go together, at once. This makes things so much simpler.
>
> Really, what is the problem providing this trivial panic-note
> capability, where user-space can give the kernel a small buffer, and ask
> the kernel to print this buffer at the oops/panic time. Very simple and
> elegant, and just solves the problem.
>
> Why perversions with time-stamps, separate storages are needed?
>
> IOW, you suggest a complicated approach, and demand explaining why we do
> not go for it. Simply because it is unnecessarily complex.
I don't think it's a complicated approach we are talking of a system
log like syslog with a temporal information, nothing more.
> This patch solves the problem gracefully, and I'd rather demand you to point what
> is the technical problem with the patches.
>
Simply because I think that we should avoid to include in the kernel
things we can do in a simply way at user space level. I think this
patch is well done but it's one of the patches that are solutions "for
embedded only", but it's only my opinion.
Marco
On Tue, 2009-11-17 at 13:45 +0100, Marco Stornelli wrote:
> 2009/11/17 Artem Bityutskiy <[email protected]>:
> > Take a look at my mails where I describe different complications we have
> > in our system. We really want to have an OOPS/panic + our environment
> > stuff to go together, at once. This makes things so much simpler.
> >
> > Really, what is the problem providing this trivial panic-note
> > capability, where user-space can give the kernel a small buffer, and ask
> > the kernel to print this buffer at the oops/panic time. Very simple and
> > elegant, and just solves the problem.
> >
> > Why perversions with time-stamps, separate storages are needed?
> >
> > IOW, you suggest a complicated approach, and demand explaining why we do
> > not go for it. Simply because it is unnecessarily complex.
>
> I don't think it's a complicated approach we are talking of a system
> log like syslog with a temporal information, nothing more.
We need to store this information of NAND flash. Implementing logs on
NAND flash is about handling bad blocks, choosing format of records, and
may be even handling wear-levelling. This is not that simple.
And then I have match oops to the userspace environment prints, using I
guess timestamps, which is also about complications in userspace.
> > This patch solves the problem gracefully, and I'd rather demand you to point what
> > is the technical problem with the patches.
> >
>
> Simply because I think that we should avoid to include in the kernel
> things we can do in a simply way at user space level.
If it is much easier to have in the kernel, then this argument does not
work, IMHO.
> I think this
> patch is well done but it's one of the patches that are solutions "for
> embedded only", but it's only my opinion.
Also IMHO, but having embedded-only things is not bad at all.
--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)
Artem Bityutskiy <[email protected]> writes:
> On Tue, 2009-11-17 at 13:45 +0100, Marco Stornelli wrote:
>> 2009/11/17 Artem Bityutskiy <[email protected]>:
>> > Take a look at my mails where I describe different complications we have
>> > in our system. We really want to have an OOPS/panic + our environment
>> > stuff to go together, at once. This makes things so much simpler.
>> >
>> > Really, what is the problem providing this trivial panic-note
>> > capability, where user-space can give the kernel a small buffer, and ask
>> > the kernel to print this buffer at the oops/panic time. Very simple and
>> > elegant, and just solves the problem.
>> >
>> > Why perversions with time-stamps, separate storages are needed?
>> >
>> > IOW, you suggest a complicated approach, and demand explaining why we do
>> > not go for it. Simply because it is unnecessarily complex.
>>
>> I don't think it's a complicated approach we are talking of a system
>> log like syslog with a temporal information, nothing more.
>
> We need to store this information of NAND flash. Implementing logs on
> NAND flash is about handling bad blocks, choosing format of records, and
> may be even handling wear-levelling. This is not that simple.
>
> And then I have match oops to the userspace environment prints, using I
> guess timestamps, which is also about complications in userspace.
>
>> > This patch solves the problem gracefully, and I'd rather demand you to point what
>> > is the technical problem with the patches.
>> >
>>
>> Simply because I think that we should avoid to include in the kernel
>> things we can do in a simply way at user space level.
>
> If it is much easier to have in the kernel, then this argument does not
> work, IMHO.
>
>> I think this
>> patch is well done but it's one of the patches that are solutions "for
>> embedded only", but it's only my opinion.
>
> Also IMHO, but having embedded-only things is not bad at all.
Why not use the kdump hook? If you handle a kernel panic that way
you get enhanced reliability and full user space support. All in a hook
that is already present and already works.
Eric
Artem Bityutskiy wrote:
> On Tue, 2009-11-17 at 13:45 +0100, Marco Stornelli wrote:
>> 2009/11/17 Artem Bityutskiy <[email protected]>:
>
> We need to store this information of NAND flash. Implementing logs on
> NAND flash is about handling bad blocks, choosing format of records, and
> may be even handling wear-levelling. This is not that simple.
>
> And then I have match oops to the userspace environment prints, using I
> guess timestamps, which is also about complications in userspace.
>
Indeed my suggestion was to use a persistent ram, not difficult to use.
>>> This patch solves the problem gracefully, and I'd rather demand you to point what
>>> is the technical problem with the patches.
>>>
>> Simply because I think that we should avoid to include in the kernel
>> things we can do in a simply way at user space level.
>
> If it is much easier to have in the kernel, then this argument does not
> work, IMHO.
>
>> I think this
>> patch is well done but it's one of the patches that are solutions "for
>> embedded only", but it's only my opinion.
>
> Also IMHO, but having embedded-only things is not bad at all.
>
In the past other patches are not accepted in main line for this, maybe
you'll be luckier.
Marco
On Tue, Nov 17, 2009 at 10:45:43AM -0500, Eric W. Biederman wrote:
...
> Why not use the kdump hook? If you handle a kernel panic that way
> you get enhanced reliability and full user space support. All in a hook
> that is already present and already works.
I'm a big fan of avoiding reinvention of the wheel--if I can use something
already present, I will. However, I'm not clear about how much of the problem
I'm addressing will be solved by using a kdump hook. If I understand
correctly, you'd still need a pseudo-file somewhere to actually get the data
from user space to kernel space. *Then* you could use a kdump hook to
transfer the data to flash or some memory area that will be retained across
boots. Is this the approach to which you were referring? If so, I have a
couple more questions:
1. In what ways would this be better than, say, a panic_notifier?
2. Where would you suggest tying in? (Particularly since not all architectures
currently support kdump)
> Eric
David VL
David VomLehn <[email protected]> writes:
> On Tue, Nov 17, 2009 at 10:45:43AM -0500, Eric W. Biederman wrote:
> ...
>> Why not use the kdump hook? If you handle a kernel panic that way
>> you get enhanced reliability and full user space support. All in a hook
>> that is already present and already works.
>
> I'm a big fan of avoiding reinvention of the wheel--if I can use something
> already present, I will. However, I'm not clear about how much of the problem
> I'm addressing will be solved by using a kdump hook. If I understand
> correctly, you'd still need a pseudo-file somewhere to actually get the data
> from user space to kernel space. *Then* you could use a kdump hook to
> transfer the data to flash or some memory area that will be retained across
> boots. Is this the approach to which you were referring? If so, I have a
> couple more questions:
>
> 1. In what ways would this be better than, say, a panic_notifier?
A couple of ways.
- You are doing the work in a known good kernel instead of the kernel
that just paniced for some unknown reason.
- All of the control logic is in user space (not the kernel) so you can
potentially do something as simple as "date >> logfile" to get the
date.
> 2. Where would you suggest tying in? (Particularly since not all architectures
> currently support kdump)
No changes are needed kernel side. You just need an appropriate kernel and
initrd for your purpose.
All of the interesting architectures support kexec, and if an
architecture doesn't it isn't hard to add. The architecture specific
part is very simple. A pain to debug initially but very simple.
Eric
On Tue, Nov 17, 2009 at 04:28:22PM -0800, Eric W. Biederman wrote:
> David VomLehn <[email protected]> writes:
>
> > On Tue, Nov 17, 2009 at 10:45:43AM -0500, Eric W. Biederman wrote:
> > ...
> >> Why not use the kdump hook? If you handle a kernel panic that way
> >> you get enhanced reliability and full user space support. All in a hook
> >> that is already present and already works.
...
> > 1. In what ways would this be better than, say, a panic_notifier?
>
> A couple of ways.
>
> - You are doing the work in a known good kernel instead of the kernel
> that just paniced for some unknown reason.
> - All of the control logic is in user space (not the kernel) so you can
> potentially do something as simple as "date >> logfile" to get the
> date.
I think I see better what you're suggesting--passing the info to a kdump
kernel and having it do whatever it wants. I don't think I want to do this,
but I haven't used any of the kexec() stuff, so I may be missing the point.
Some more context:
My application is an embedded one, and one of the big things I need to do
after a failure is to bring up a fully functional kernel ASAP. Once I have
that kernel, I process all of the crash data in user space concurrently with
running my main application. Because I'm embedded, I'm very limited in how
much crash data I can save over a reboot, how much I can store, and how
much I can send to a central collection point. This is good, since it doesn't
take up a lot of resources, but core dumps are out of the question.
As I understand kdump, I would also need to have a second kernel in memory
to do the kdump work. It wouldn't need to be as big is the kernel that
failed, but it would still require a significant amount of memory. On an
embedded system, the idle memory may be a luxury we can't afford.
I think this makes a kdump-based solution difficult, but if it can meet
my requirements, I'd much rather use it (I've been following kdump since
it's inception quite a few years ago, but it hasn't seemed a good match
for embedded Linux). Does this still sound like a good match?
> > 2. Where would you suggest tying in? (Particularly since not all architectures
> > currently support kdump)
>
> No changes are needed kernel side. You just need an appropriate kernel and
> initrd for your purpose.
I think I must still be missing something. I have dynamic data that I want
to preserve as I reboot from a failed kernel to a fresh new kernel. By
the time the fresh kernel starts init, the dynamic data (IP addresses, MAC
addresses) has been re-written with new values. This is why I'm trying to
preserve, but I may be running without disk or flash. This patch doesn't
preserve the data, but it gets it into the kernel so that it can be
preserved. At present, I'm preserving the data in a panic_notifier function,
but I am not wedded to that. At present, the data will be copied to a
section of memory retained across boots, but I know others will want to
write to flash.
> All of the interesting architectures support kexec, and if an
> architecture doesn't it isn't hard to add. The architecture specific
> part is very simple. A pain to debug initially but very simple.
I use MIPS processors, and it looks like it is supported. So long as it's
stable, I'm happy to use it.
> Eric
David VL
On Tue, 2009-11-17 at 16:28 -0800, Eric W. Biederman wrote:
> David VomLehn <[email protected]> writes:
>
> > On Tue, Nov 17, 2009 at 10:45:43AM -0500, Eric W. Biederman wrote:
> > ...
> >> Why not use the kdump hook? If you handle a kernel panic that way
> >> you get enhanced reliability and full user space support. All in a hook
> >> that is already present and already works.
> >
> > I'm a big fan of avoiding reinvention of the wheel--if I can use something
> > already present, I will. However, I'm not clear about how much of the problem
> > I'm addressing will be solved by using a kdump hook. If I understand
> > correctly, you'd still need a pseudo-file somewhere to actually get the data
> > from user space to kernel space. *Then* you could use a kdump hook to
> > transfer the data to flash or some memory area that will be retained across
> > boots. Is this the approach to which you were referring? If so, I have a
> > couple more questions:
> >
> > 1. In what ways would this be better than, say, a panic_notifier?
>
> A couple of ways.
>
> - You are doing the work in a known good kernel instead of the kernel
> that just paniced for some unknown reason.
> - All of the control logic is in user space (not the kernel) so you can
> potentially do something as simple as "date >> logfile" to get the
> date.
>
> > 2. Where would you suggest tying in? (Particularly since not all architectures
> > currently support kdump)
>
> No changes are needed kernel side. You just need an appropriate kernel and
> initrd for your purpose.
>
> All of the interesting architectures support kexec, and if an
> architecture doesn't it isn't hard to add. The architecture specific
> part is very simple. A pain to debug initially but very simple.
As much as I like kexec, it loses on memory footprint by about 100x.
It's not appropriate for all use cases, especially things like
consumer-grade wireless access points and phones.
--
http://selenic.com : development and support for Mercurial and Linux
On Tue, 2009-11-17 at 07:45 -0800, Eric W. Biederman wrote:
> Artem Bityutskiy <[email protected]> writes:
>
> > On Tue, 2009-11-17 at 13:45 +0100, Marco Stornelli wrote:
> >> 2009/11/17 Artem Bityutskiy <[email protected]>:
> >> > Take a look at my mails where I describe different complications we have
> >> > in our system. We really want to have an OOPS/panic + our environment
> >> > stuff to go together, at once. This makes things so much simpler.
> >> >
> >> > Really, what is the problem providing this trivial panic-note
> >> > capability, where user-space can give the kernel a small buffer, and ask
> >> > the kernel to print this buffer at the oops/panic time. Very simple and
> >> > elegant, and just solves the problem.
> >> >
> >> > Why perversions with time-stamps, separate storages are needed?
> >> >
> >> > IOW, you suggest a complicated approach, and demand explaining why we do
> >> > not go for it. Simply because it is unnecessarily complex.
> >>
> >> I don't think it's a complicated approach we are talking of a system
> >> log like syslog with a temporal information, nothing more.
> >
> > We need to store this information of NAND flash. Implementing logs on
> > NAND flash is about handling bad blocks, choosing format of records, and
> > may be even handling wear-levelling. This is not that simple.
> >
> > And then I have match oops to the userspace environment prints, using I
> > guess timestamps, which is also about complications in userspace.
> >
> >> > This patch solves the problem gracefully, and I'd rather demand you to point what
> >> > is the technical problem with the patches.
> >> >
> >>
> >> Simply because I think that we should avoid to include in the kernel
> >> things we can do in a simply way at user space level.
> >
> > If it is much easier to have in the kernel, then this argument does not
> > work, IMHO.
> >
> >> I think this
> >> patch is well done but it's one of the patches that are solutions "for
> >> embedded only", but it's only my opinion.
> >
> > Also IMHO, but having embedded-only things is not bad at all.
>
> Why not use the kdump hook? If you handle a kernel panic that way
> you get enhanced reliability and full user space support. All in a hook
> that is already present and already works.
For some reasons kdump does not work on ARM out of the box. We need to
investigate this.
--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)
On Wed, Nov 18, 2009 at 8:53 AM, David VomLehn <[email protected]> wrote:
> On Tue, Nov 17, 2009 at 04:28:22PM -0800, Eric W. Biederman wrote:
>> David VomLehn <[email protected]> writes:
>>
>> > On Tue, Nov 17, 2009 at 10:45:43AM -0500, Eric W. Biederman wrote:
>> > ...
>> >> Why not use the kdump hook? If you handle a kernel panic that way
>> >> you get enhanced reliability and full user space support. All in a hook
>> >> that is already present and already works.
> ...
>> > 1. In what ways would this be better than, say, a panic_notifier?
>>
>> A couple of ways.
>>
>> - You are doing the work in a known good kernel instead of the kernel
>> that just paniced for some unknown reason.
>> - All of the control logic is in user space (not the kernel) so you can
>> potentially do something as simple as "date >> logfile" to get the
>> date.
>
> I think I see better what you're suggesting--passing the info to a kdump
> kernel and having it do whatever it wants. I don't think I want to do this,
> but I haven't used any of the kexec() stuff, so I may be missing the point.
> Some more context:
>
> My application is an embedded one, and one of the big things I need to do
> after a failure is to bring up a fully functional kernel ASAP. Once I have
> that kernel, I process all of the crash data in user space concurrently with
> running my main application. Because I'm embedded, I'm very limited in how
> much crash data I can save over a reboot, how much I can store, and how
> much I can send to a central collection point. This is good, since it doesn't
> take up a lot of resources, but core dumps are out of the question.
I think the problem of kdump is that it uses much memory to hold
the core, i.e. /proc/vmcore, and no way to free it unless using another
reboot. This is why Fedora only does some data-collection in the second
reboot after crash, and then reboots.
I got an idea many days ago, that is providing a way to "delete" /proc/vmcore
in the second reboot, so that we can have enough memory to continue without
another reboot. I am not sure if Eric likes this? Eric?
>
> As I understand kdump, I would also need to have a second kernel in memory
> to do the kdump work. It wouldn't need to be as big is the kernel that
> failed, but it would still require a significant amount of memory. On an
> embedded system, the idle memory may be a luxury we can't afford.
You can use only one kernel, as long as it is relocatable.
>
> I think this makes a kdump-based solution difficult, but if it can meet
> my requirements, I'd much rather use it (I've been following kdump since
> it's inception quite a few years ago, but it hasn't seemed a good match
> for embedded Linux). Does this still sound like a good match?
What do you think about my idea above? If we had that, would kdump
meet your requirements?
>
>> > 2. Where would you suggest tying in? (Particularly since not all architectures
>> > currently support kdump)
>>
>> No changes are needed kernel side. You just need an appropriate kernel and
>> initrd for your purpose.
>
> I think I must still be missing something. I have dynamic data that I want
> to preserve as I reboot from a failed kernel to a fresh new kernel. By
> the time the fresh kernel starts init, the dynamic data (IP addresses, MAC
> addresses) has been re-written with new values. This is why I'm trying to
> preserve, but I may be running without disk or flash. This patch doesn't
> preserve the data, but it gets it into the kernel so that it can be
> preserved. At present, I'm preserving the data in a panic_notifier function,
> but I am not wedded to that. At present, the data will be copied to a
> section of memory retained across boots, but I know others will want to
> write to flash.
>
I believe you can get everything from /proc/vmcore, if you use kexec,
after crash, with some tools like 'crash'.
>> All of the interesting architectures support kexec, and if an
>> architecture doesn't it isn't hard to add. The architecture specific
>> part is very simple. A pain to debug initially but very simple.
>
> I use MIPS processors, and it looks like it is supported. So long as it's
> stable, I'm happy to use it.
MIPS seems to have some kexec() support, but after looking at
arch/mips/kernel/machine_kexec.c, maybe the support is still
broken? But anyway, you are welcome to work on it. :)
Matt Mackall <[email protected]> writes:
> As much as I like kexec, it loses on memory footprint by about 100x.
> It's not appropriate for all use cases, especially things like
> consumer-grade wireless access points and phones.
In general I agree. The cost of a second kernel and initrd can be
prohibitive in the smallest systems, and if you do a crash capture
with using a standalone app that is reinventing the wheel.
That said. I can happily run kdump with only 16M-20M reserved.
So on many systems the cost is affordable.
Eric
David VomLehn <[email protected]> writes:
>> > 2. Where would you suggest tying in? (Particularly since not all architectures
>> > currently support kdump)
>>
>> No changes are needed kernel side. You just need an appropriate kernel and
>> initrd for your purpose.
>
> I think I must still be missing something. I have dynamic data that I want
> to preserve as I reboot from a failed kernel to a fresh new kernel. By
> the time the fresh kernel starts init, the dynamic data (IP addresses, MAC
> addresses) has been re-written with new values. This is why I'm trying to
> preserve, but I may be running without disk or flash. This patch doesn't
> preserve the data, but it gets it into the kernel so that it can be
> preserved. At present, I'm preserving the data in a panic_notifier function,
> but I am not wedded to that. At present, the data will be copied to a
> section of memory retained across boots, but I know others will want to
> write to flash.
For memory retained across reboots, my inclination is to have a trivial
little tty/console driver that writes into that retained memory. You can
open the tty and write your user space data, and then at kernel panic time
the kernel will write it's data.
When people start writing to flash anything short of the most trivial
writes is non-trivial code that requires sleeps etc, etc. Not
something that is particularly reliable after a kernel failure. My
inclination in that case is to use crash_kexec and a small crash
kernel/initrd or if size is a real issue a standalone binary (ideally
Aboot like so different embedded platforms can share).
I'm sensitive about getting anything extra in the path before we call
crash_kexec as that is likely to lead to destabilization of that path.
Eric
Eric W. Biederman wrote:
> Matt Mackall <[email protected]> writes:
>
>> As much as I like kexec, it loses on memory footprint by about 100x.
>> It's not appropriate for all use cases, especially things like
>> consumer-grade wireless access points and phones.
>
> In general I agree. The cost of a second kernel and initrd can be
> prohibitive in the smallest systems, and if you do a crash capture
> with using a standalone app that is reinventing the wheel.
>
> That said. I can happily run kdump with only 16M-20M reserved.
> So on many systems the cost is affordable.
Understood. On some of my systems, the memory budget for the
entire system is 10M. On most systems I work with, it is a
struggle to reserve even 64K for this feature.
-- Tim
=============================
Tim Bird
Architecture Group Chair, CE Linux Forum
Senior Staff Engineer, Sony Corporation of America
=============================
Tim Bird <[email protected]> writes:
> Eric W. Biederman wrote:
>> Matt Mackall <[email protected]> writes:
>>
>>> As much as I like kexec, it loses on memory footprint by about 100x.
>>> It's not appropriate for all use cases, especially things like
>>> consumer-grade wireless access points and phones.
>>
>> In general I agree. The cost of a second kernel and initrd can be
>> prohibitive in the smallest systems, and if you do a crash capture
>> with using a standalone app that is reinventing the wheel.
>>
>> That said. I can happily run kdump with only 16M-20M reserved.
>> So on many systems the cost is affordable.
>
> Understood. On some of my systems, the memory budget for the
> entire system is 10M. On most systems I work with, it is a
> struggle to reserve even 64K for this feature.
crash_kexec is really a glorified jump. It is possible to do a lot in
64K with a standalone application. If reliable capture of kernel
crashes is desirable to an embedded NAND device I expect a semi-general
purpose dedicated application for capturing at least dmesg from the
crashed kernel and write it to a file on a NAND filesystem could be
worth someones time.
On general purpose hardware we use a kernel and an initrd simply to
reduce the development work of supporting everything and the kitchen
sink. My impression is that embedded systems can afford a little more
setup time, and a custom compilation, and that the hardware you would like
to store things too is much more common.
Eric