Here's a patch based on some discussions I had with Thomas
Gleixner at plumbers conference that implements a generic
layer for persistent storage usable to pass tens or hundreds
of kilobytes of data from the dying breath of a crashing
kernel to its successor.
The usage model I'm envisioning is that a platform driver
will register with this code to provide the actual storage.
I've tried to make this interface general, but I'm working
from a sample of one (the ACPI ERST code), so if anyone else
has some persistent store that can't be handled by this code,
speak up and we can put in the necessary tweaks.
My assumptions are that the data that Linux cares about will
be wrapped in some error record structure with a header, and
possibly a footer that the device code needs. So the driver
specifies how much padding to put around a buffer to make
life easy for it. It also specifies the maximum number of
bytes that can be saved in one record.
There are three callback functions from the generic code to
the driver:
"reader" which iterates over all records currently in the
store - returning type, size and a record identifier as
well as the actual data.
"writer" which writes a record with a type to the persistent store
"eraser" which takes a record identifier, and clears that item
from the store.
The Linux user visible interface is via /sys (similar to
the "efivars" interface)
# ls -l /sys/firmware/pstore
total 0
-r--r--r-- 1 root root 0 2010-11-20 11:03 dmesg-0
--w------- 1 root root 0 2010-11-20 11:03 erase
The "type" of error record I mentioned earlier is used to
name the files ... saved console logs from kmsg_dmp() are
named with a "dmesg" prefix as shown above.
Once an error record has been viewed, analysed, saved. The
user can request it to be cleared by writing its name to the
"erase" file:
# echo "dmesg-0" > erase
Answers to a few questions that I think you might ask:
1) "Why do you only allow one platform driver to register?"
I only have one such driver. Adding more is easy from the "read" side
(just collect all the records from all devices and remember where they
came from so you can call the correct "eraser" function). But the "write"
side opens up questions that I don't have good answers for:
- Which device(s) should error records be written to?
All of them? Start with one and move on when it is
full? Write some types of records to one device?
If someone has a machine with multiple persistent storage devices -
then we can talk about how to answer these questions.
2) "Why do you read in all the data from the device when it
registers and save it in memory? Couldn't you just get the
list of records and pick up the data from the device when
the user reads the file?"
I don't think this is going to be very much data, just a few hundred
kilobytes (i.e. less that $0.01 worth of memory, even expensive server
memory). The memory is freed when the record is erased ... which is
likely to be soon after boot.
3) "/sys/firmware/pstore is the wrong pathname for this".
You are probably right. I put it under "firmware" because that's where
the "efivars" driver put its top level directory. In my case the ERST
back end is firmware, so there is some vague logic to it ... but better
suggestions are welcome. Perhaps /sys/devices/platform/pstore?
4) "/sys is the wrong place for this."
Perhaps. I definitely want to use some sort of filesystem interface (so
each record shows up as a file to the user). This seems a lot cleaner
than trying to map the semantics of actual persistent storage devices
onto a character device. The "sysfs_create_bin_file()" API seems very
well designed for this usage. If not /sys, then where? "debugfs"
would work - but not everyone mounts debugfs. Creating a whole new
filesystem for this seems like overkill.
5) "Why is the record identifier type 'u64'?"
This is one place where I knowingly let the ERST implementation bleed
all the way up to the top - it uses 64-bit record numbers. It would be
possible to map these to something smaller like "int" ... but the code
to do so would be far larger than the memory saved. The most common
usage case is likely to be a software crash with just one "dmesg" record.
6) "Is this widely useful? How many systems have persistent storage?"
Although ERST was only added to the ACPI spec earlier this year, it
merely documents existing functionality required for WHEA (Windows
Hardware Error Architecture). So most modern server systems should
have it (my test system has it, and it has a BIOS written in mid 2008).
Sorry desktops & laptops - no love for you here.
No-sign-off-yet-this-is-just-RFC
-Tony
---
diff --git a/drivers/firmware/Kconfig b/drivers/firmware/Kconfig
index e8b6a13..06afe40 100644
--- a/drivers/firmware/Kconfig
+++ b/drivers/firmware/Kconfig
@@ -134,4 +134,16 @@ config ISCSI_IBFT
detect iSCSI boot parameters dynamically during system boot, say Y.
Otherwise, say N.
+config PSTORE
+ tristate "Persistant store support via /sys"
+ default n
+ help
+ This option enables generic access to platform level persistent
+ storage via /sys/firmware/pstore. Only useful if you have a
+ platform level driver that registers with pstore to provide
+ the data, so you probably should just go say "Y" (or "M") to
+ a platform specific persistent store driver (e.g. ACPI_APEI on
+ X86) which will select this for you. If you don't have a platform
+ persistent store driver, say N.
+
endmenu
diff --git a/drivers/firmware/Makefile b/drivers/firmware/Makefile
index 1c3c173..ba19784 100644
--- a/drivers/firmware/Makefile
+++ b/drivers/firmware/Makefile
@@ -11,3 +11,4 @@ obj-$(CONFIG_DMIID) += dmi-id.o
obj-$(CONFIG_ISCSI_IBFT_FIND) += iscsi_ibft_find.o
obj-$(CONFIG_ISCSI_IBFT) += iscsi_ibft.o
obj-$(CONFIG_FIRMWARE_MEMMAP) += memmap.o
+obj-$(CONFIG_PSTORE) += pstore.o
diff --git a/drivers/firmware/pstore.c b/drivers/firmware/pstore.c
new file mode 100644
index 0000000..e11b454
--- /dev/null
+++ b/drivers/firmware/pstore.c
@@ -0,0 +1,313 @@
+/*
+ * Persistent Storage - pstore.c
+ *
+ * Copyright (C) 2010 Intel Corporation <[email protected]>
+ *
+ * This code is the generic layer to export data records from platform
+ * level persistent storage via sysfs.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/atomic.h>
+#include <linux/types.h>
+#include <linux/errno.h>
+#include <linux/init.h>
+#include <linux/kmsg_dump.h>
+#include <linux/module.h>
+#include <linux/pstore.h>
+#include <linux/string.h>
+#include <linux/sysfs.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+
+MODULE_AUTHOR("Tony Luck <[email protected]>");
+MODULE_DESCRIPTION("sysfs interface to persistent storage");
+MODULE_LICENSE("GPL");
+
+static DEFINE_SPINLOCK(pstore_lock);
+static LIST_HEAD(pstore_list);
+static struct kset *pstore_kset;
+
+#define PSTORE_NAMELEN 16
+
+struct pstore_entry {
+ struct bin_attribute attr;
+ char name[PSTORE_NAMELEN];
+ u64 id;
+ int type;
+ int size;
+ struct list_head list;
+ char data[];
+};
+
+static int pstore_create_sysfs_entry(struct pstore_entry *new_pstore);
+
+static struct pstore_info *psinfo;
+
+static char *pstore_buf;
+
+/*
+ * callback from kmsg_dump. (s2,l2) has the most recently
+ * written bytes, older bytes are in (s1,l1). Save as much
+ * as we can from the end of the buffer.
+ */
+static void
+pstore_dump(struct kmsg_dumper *dumper, enum kmsg_dump_reason reason,
+ const char *s1, unsigned long l1,
+ const char *s2, unsigned long l2)
+{
+ unsigned long s1_start, s2_start;
+ unsigned long l1_cpy, l2_cpy;
+ char *dst = pstore_buf + psinfo->header_size;
+
+ /* Don't dump oopses to persistent store */
+ if (reason == KMSG_DUMP_OOPS)
+ return;
+
+ l2_cpy = min(l2, psinfo->data_size);
+ l1_cpy = min(l1, psinfo->data_size - l2_cpy);
+
+ s2_start = l2 - l2_cpy;
+ s1_start = l1 - l1_cpy;
+
+ memcpy(dst, s1 + s1_start, l1_cpy);
+ memcpy(dst + l1_cpy, s2 + s2_start, l2_cpy);
+
+ psinfo->writer(PSTORE_DMESG, pstore_buf, l1_cpy + l2_cpy);
+}
+
+static struct kmsg_dumper pstore_dumper = {
+ .dump = pstore_dump,
+};
+
+/*
+ * platform specific persistent storage driver registers with
+ * us here. Read out all the records right away and install
+ * them in /sys. Register with kmsg_dump to save last part
+ * of console log on panic.
+ */
+int
+pstore_register(struct pstore_info *psi)
+{
+ struct pstore_entry *new_pstore;
+ int rc = 0, type;
+ unsigned long size;
+ u64 id;
+ unsigned long ps_maxsize;
+
+ spin_lock(&pstore_lock);
+ if (psinfo) {
+ spin_unlock(&pstore_lock);
+ return -EBUSY;
+ }
+ psinfo = psi;
+ spin_unlock(&pstore_lock);
+
+ ps_maxsize = psi->header_size + psi->data_size + psi->footer_size;
+ pstore_buf = kzalloc(ps_maxsize, GFP_KERNEL);
+ if (!pstore_buf)
+ return -ENOMEM;
+ for (;;) {
+ if (psi->reader(&id, &type, pstore_buf, &size) <= 0)
+ break;
+ new_pstore = kzalloc(sizeof(struct pstore_entry) + size,
+ GFP_KERNEL);
+ if (!new_pstore) {
+ rc = -ENOMEM;
+ break;
+ }
+ new_pstore->id = id;
+ new_pstore->type = type;
+ new_pstore->size = size;
+ memcpy(new_pstore->data, pstore_buf + psi->header_size, size);
+ if (pstore_create_sysfs_entry(new_pstore)) {
+ kfree(new_pstore);
+ rc = -EINVAL;
+ break;
+ }
+ }
+
+ kobject_uevent(&pstore_kset->kobj, KOBJ_ADD);
+
+ kmsg_dump_register(&pstore_dumper);
+
+ return rc;
+}
+EXPORT_SYMBOL_GPL(pstore_register);
+
+int
+pstore_write(int type, char *buf, unsigned long size)
+{
+ if (!psinfo->writer)
+ return -ENODEV;
+ if (size > psinfo->data_size)
+ return -EFBIG;
+
+ memcpy(pstore_buf + psinfo->header_size, buf, size);
+ return psinfo->writer(type, pstore_buf, size);
+}
+EXPORT_SYMBOL_GPL(pstore_write);
+
+#define to_pstore_entry(obj) container_of(obj, struct pstore_entry, attr)
+
+/*
+ * "read" function for files containing persistent store records
+ */
+static ssize_t pstore_show(struct file *filp, struct kobject *kobj,
+ struct bin_attribute *bin_attr, char *buf,
+ loff_t offset, size_t count)
+{
+ struct pstore_entry *ps = to_pstore_entry(bin_attr);
+
+ return memory_read_from_buffer(buf, count, &offset,
+ ps->data, ps->size);
+}
+
+/*
+ * Erase records by writing their filename to the "erase" file. E.g.
+ * # echo "dmesg-0" > erase
+ */
+static ssize_t pstore_erase(struct file *filp, struct kobject *kobj,
+ struct bin_attribute *bin_attr,
+ char *buf, loff_t pos, size_t count)
+{
+ struct pstore_entry *search_pstore, *n;
+ int len1, len2, found = 0;
+
+ len1 = count;
+ if (buf[len1 - 1] == '\n')
+ len1--;
+
+ spin_lock(&pstore_lock);
+
+ /*
+ * Find this record
+ */
+ list_for_each_entry_safe(search_pstore, n, &pstore_list, list) {
+ len2 = strlen(search_pstore->name);
+ if (len1 == len2 && memcmp(buf, search_pstore->name,
+ len1) == 0) {
+ found = 1;
+ break;
+ }
+ }
+ if (!found) {
+ spin_unlock(&pstore_lock);
+ return -EINVAL;
+ }
+
+ if (psinfo->eraser)
+ if (psinfo->eraser(search_pstore->id)) {
+ spin_unlock(&pstore_lock);
+ return -EIO;
+ }
+
+ list_del(&search_pstore->list);
+
+ spin_unlock(&pstore_lock);
+
+ sysfs_remove_bin_file(&pstore_kset->kobj, &search_pstore->attr);
+
+ return count;
+}
+
+static struct bin_attribute attr_erase = {
+ .attr = {.name = "erase", .mode = 0200},
+ .write = pstore_erase,
+};
+
+static int
+pstore_create_sysfs_entry(struct pstore_entry *new_pstore)
+{
+ static atomic_t next;
+ int error, seq;
+
+ seq = atomic_add_return(1, &next);
+
+ switch (new_pstore->type) {
+ case PSTORE_DMESG:
+ sprintf(new_pstore->name, "dmesg-%d", seq);
+ break;
+ case PSTORE_MCE:
+ sprintf(new_pstore->name, "mce-%d", seq);
+ break;
+ default:
+ sprintf(new_pstore->name, "type%d-%d", new_pstore->type, seq);
+ break;
+ }
+
+ sysfs_attr_init(&new_pstore->attr.attr);
+ new_pstore->attr.size = 0;
+ new_pstore->attr.read = pstore_show;
+ new_pstore->attr.attr.name = new_pstore->name;
+ new_pstore->attr.attr.mode = 0444;
+ error = sysfs_create_bin_file(&pstore_kset->kobj, &new_pstore->attr);
+ if (!error) {
+ spin_lock(&pstore_lock);
+ list_add(&new_pstore->list, &pstore_list);
+ spin_unlock(&pstore_lock);
+ }
+ return error;
+}
+
+static int __init
+pstore_init(void)
+{
+ int error = 0;
+
+ /* Register the pstore directory at /sys/firmware/pstore */
+ pstore_kset = kset_create_and_add("pstore", NULL, firmware_kobj);
+ if (!pstore_kset) {
+ printk(KERN_ERR "pstore: Subsystem registration failed.\n");
+ return -ENOMEM;
+ }
+
+ /*
+ * Add attribute to allow records to be erased from persistent store
+ */
+ error = sysfs_create_bin_file(&pstore_kset->kobj,
+ &attr_erase);
+ if (error) {
+ printk(KERN_ERR "pstore: unable to create 'erase' sysfs file"
+ " due to error %d\n", error);
+ kset_unregister(pstore_kset);
+ }
+
+ return error;
+}
+
+static void __exit
+pstore_exit(void)
+{
+ struct pstore_entry *entry, *n;
+
+ if (psinfo)
+ kmsg_dump_unregister(&pstore_dumper);
+
+ list_for_each_entry_safe(entry, n, &pstore_list, list) {
+ spin_lock(&pstore_lock);
+ list_del(&entry->list);
+ spin_unlock(&pstore_lock);
+ sysfs_remove_bin_file(&pstore_kset->kobj, &entry->attr);
+ }
+ sysfs_remove_bin_file(&pstore_kset->kobj, &attr_erase);
+
+ kset_unregister(pstore_kset);
+
+ kfree(pstore_buf);
+}
+
+module_init(pstore_init);
+module_exit(pstore_exit);
diff --git a/include/linux/pstore.h b/include/linux/pstore.h
new file mode 100644
index 0000000..785ad86
--- /dev/null
+++ b/include/linux/pstore.h
@@ -0,0 +1,54 @@
+/*
+ * Persistent Storage - pstore.h
+ *
+ * Copyright (C) 2010 Intel Corporation <[email protected]>
+ *
+ * This code is the generic layer to export data records from platform
+ * level persistent storage via sysfs.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+#ifndef _LINUX_PSTORE_H
+#define _LINUX_PSTORE_H
+
+/* types */
+#define PSTORE_DMESG 0
+#define PSTORE_MCE 1
+
+struct pstore_info {
+ unsigned long header_size;
+ unsigned long data_size;
+ unsigned long footer_size;
+ int (*reader)(u64 *id, int *type, char *buf, unsigned long *size);
+ int (*writer)(int type, char *buf, unsigned long size);
+ int (*eraser)(u64 id);
+};
+
+#if defined(CONFIG_PSTORE) || defined(CONFIG_PSTORE_MODULE)
+extern int pstore_register(struct pstore_info *);
+extern int pstore_write(int type, char *buf, unsigned long size);
+#else
+static inline int
+pstore_register(struct pstore_info *psi)
+{
+ return -ENODEV;
+}
+static inline int
+pstore_write(int type, char *buf, unsigned long size)
+{
+ return -ENODEV;
+}
+#endif
+
+#endif /*_LINUX_PSTORE_H*/
Hi Tony,
this is actually very cool.
See below for minor nitpicks:
On Sat, Nov 20, 2010 at 03:48:19PM -0800, Luck, Tony wrote:
> Here's a patch based on some discussions I had with Thomas
> Gleixner at plumbers conference that implements a generic
> layer for persistent storage usable to pass tens or hundreds
> of kilobytes of data from the dying breath of a crashing
> kernel to its successor.
>
> The usage model I'm envisioning is that a platform driver
> will register with this code to provide the actual storage.
> I've tried to make this interface general, but I'm working
> from a sample of one (the ACPI ERST code), so if anyone else
> has some persistent store that can't be handled by this code,
> speak up and we can put in the necessary tweaks.
>
> My assumptions are that the data that Linux cares about will
> be wrapped in some error record structure with a header, and
> possibly a footer that the device code needs. So the driver
> specifies how much padding to put around a buffer to make
> life easy for it. It also specifies the maximum number of
> bytes that can be saved in one record.
>
> There are three callback functions from the generic code to
> the driver:
>
> "reader" which iterates over all records currently in the
> store - returning type, size and a record identifier as
> well as the actual data.
>
> "writer" which writes a record with a type to the persistent store
>
> "eraser" which takes a record identifier, and clears that item
> from the store.
>
>
> The Linux user visible interface is via /sys (similar to
> the "efivars" interface)
>
> # ls -l /sys/firmware/pstore
> total 0
> -r--r--r-- 1 root root 0 2010-11-20 11:03 dmesg-0
> --w------- 1 root root 0 2010-11-20 11:03 erase
>
> The "type" of error record I mentioned earlier is used to
> name the files ... saved console logs from kmsg_dmp() are
> named with a "dmesg" prefix as shown above.
>
> Once an error record has been viewed, analysed, saved. The
> user can request it to be cleared by writing its name to the
> "erase" file:
>
> # echo "dmesg-0" > erase
>
> Answers to a few questions that I think you might ask:
>
> 1) "Why do you only allow one platform driver to register?"
> I only have one such driver. Adding more is easy from the "read" side
> (just collect all the records from all devices and remember where they
> came from so you can call the correct "eraser" function). But the "write"
> side opens up questions that I don't have good answers for:
> - Which device(s) should error records be written to?
> All of them? Start with one and move on when it is
> full? Write some types of records to one device?
Maybe based on the error type? We definitely need one device which
should contain all the records, something like main pstore device.
> If someone has a machine with multiple persistent storage devices -
> then we can talk about how to answer these questions.
>
> 2) "Why do you read in all the data from the device when it
> registers and save it in memory? Couldn't you just get the
> list of records and pick up the data from the device when
> the user reads the file?"
> I don't think this is going to be very much data, just a few hundred
> kilobytes (i.e. less that $0.01 worth of memory, even expensive server
> memory). The memory is freed when the record is erased ... which is
> likely to be soon after boot.
This is actually a nice idea from the aspect that when those files
appear in sysfs on boot, a userspace daemon can check for their
existence and tell the user that she has valid error records from the
last crash and she could report them to lkml/hw vendor/...
> 3) "/sys/firmware/pstore is the wrong pathname for this".
> You are probably right. I put it under "firmware" because that's where
> the "efivars" driver put its top level directory. In my case the ERST
> back end is firmware, so there is some vague logic to it ... but better
> suggestions are welcome. Perhaps /sys/devices/platform/pstore?
>
> 4) "/sys is the wrong place for this."
> Perhaps. I definitely want to use some sort of filesystem interface (so
> each record shows up as a file to the user). This seems a lot cleaner
> than trying to map the semantics of actual persistent storage devices
> onto a character device. The "sysfs_create_bin_file()" API seems very
> well designed for this usage. If not /sys, then where? "debugfs"
> would work - but not everyone mounts debugfs. Creating a whole new
> filesystem for this seems like overkill.
No, I think /sys is the right place for it being always present and
all. Besides, for example, all the ACPI tables are there anyway
(/sys/firmware/acpi/tables/) so pstore won't be the first blob there.
/sys/firmware might not be all that sensible if someone comes up with
persistent storage type which is the network but we'll change that then.
> 5) "Why is the record identifier type 'u64'?"
> This is one place where I knowingly let the ERST implementation bleed
> all the way up to the top - it uses 64-bit record numbers. It would be
> possible to map these to something smaller like "int" ... but the code
> to do so would be far larger than the memory saved. The most common
> usage case is likely to be a software crash with just one "dmesg" record.
>
> 6) "Is this widely useful? How many systems have persistent storage?"
> Although ERST was only added to the ACPI spec earlier this year, it
> merely documents existing functionality required for WHEA (Windows
> Hardware Error Architecture). So most modern server systems should
> have it (my test system has it, and it has a BIOS written in mid 2008).
> Sorry desktops & laptops - no love for you here.
>
> No-sign-off-yet-this-is-just-RFC
>
> -Tony
>
> ---
> diff --git a/drivers/firmware/Kconfig b/drivers/firmware/Kconfig
> index e8b6a13..06afe40 100644
> --- a/drivers/firmware/Kconfig
> +++ b/drivers/firmware/Kconfig
> @@ -134,4 +134,16 @@ config ISCSI_IBFT
> detect iSCSI boot parameters dynamically during system boot, say Y.
> Otherwise, say N.
>
> +config PSTORE
> + tristate "Persistant store support via /sys"
> + default n
> + help
> + This option enables generic access to platform level persistent
> + storage via /sys/firmware/pstore. Only useful if you have a
> + platform level driver that registers with pstore to provide
> + the data, so you probably should just go say "Y" (or "M") to
> + a platform specific persistent store driver (e.g. ACPI_APEI on
> + X86) which will select this for you. If you don't have a platform
> + persistent store driver, say N.
> +
> endmenu
> diff --git a/drivers/firmware/Makefile b/drivers/firmware/Makefile
> index 1c3c173..ba19784 100644
> --- a/drivers/firmware/Makefile
> +++ b/drivers/firmware/Makefile
> @@ -11,3 +11,4 @@ obj-$(CONFIG_DMIID) += dmi-id.o
> obj-$(CONFIG_ISCSI_IBFT_FIND) += iscsi_ibft_find.o
> obj-$(CONFIG_ISCSI_IBFT) += iscsi_ibft.o
> obj-$(CONFIG_FIRMWARE_MEMMAP) += memmap.o
> +obj-$(CONFIG_PSTORE) += pstore.o
> diff --git a/drivers/firmware/pstore.c b/drivers/firmware/pstore.c
> new file mode 100644
> index 0000000..e11b454
> --- /dev/null
> +++ b/drivers/firmware/pstore.c
> @@ -0,0 +1,313 @@
> +/*
> + * Persistent Storage - pstore.c
> + *
> + * Copyright (C) 2010 Intel Corporation <[email protected]>
> + *
> + * This code is the generic layer to export data records from platform
> + * level persistent storage via sysfs.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
> + */
> +
> +#include <linux/atomic.h>
> +#include <linux/types.h>
> +#include <linux/errno.h>
> +#include <linux/init.h>
> +#include <linux/kmsg_dump.h>
> +#include <linux/module.h>
> +#include <linux/pstore.h>
> +#include <linux/string.h>
> +#include <linux/sysfs.h>
> +#include <linux/slab.h>
> +#include <linux/uaccess.h>
> +
> +MODULE_AUTHOR("Tony Luck <[email protected]>");
> +MODULE_DESCRIPTION("sysfs interface to persistent storage");
> +MODULE_LICENSE("GPL");
> +
> +static DEFINE_SPINLOCK(pstore_lock);
> +static LIST_HEAD(pstore_list);
> +static struct kset *pstore_kset;
> +
> +#define PSTORE_NAMELEN 16
> +
> +struct pstore_entry {
> + struct bin_attribute attr;
> + char name[PSTORE_NAMELEN];
> + u64 id;
> + int type;
> + int size;
> + struct list_head list;
> + char data[];
> +};
> +
> +static int pstore_create_sysfs_entry(struct pstore_entry *new_pstore);
> +
> +static struct pstore_info *psinfo;
> +
> +static char *pstore_buf;
> +
> +/*
> + * callback from kmsg_dump. (s2,l2) has the most recently
> + * written bytes, older bytes are in (s1,l1). Save as much
> + * as we can from the end of the buffer.
> + */
> +static void
> +pstore_dump(struct kmsg_dumper *dumper, enum kmsg_dump_reason reason,
> + const char *s1, unsigned long l1,
> + const char *s2, unsigned long l2)
> +{
> + unsigned long s1_start, s2_start;
> + unsigned long l1_cpy, l2_cpy;
> + char *dst = pstore_buf + psinfo->header_size;
> +
> + /* Don't dump oopses to persistent store */
> + if (reason == KMSG_DUMP_OOPS)
> + return;
> +
> + l2_cpy = min(l2, psinfo->data_size);
> + l1_cpy = min(l1, psinfo->data_size - l2_cpy);
> +
> + s2_start = l2 - l2_cpy;
> + s1_start = l1 - l1_cpy;
> +
> + memcpy(dst, s1 + s1_start, l1_cpy);
> + memcpy(dst + l1_cpy, s2 + s2_start, l2_cpy);
> +
> + psinfo->writer(PSTORE_DMESG, pstore_buf, l1_cpy + l2_cpy);
> +}
> +
> +static struct kmsg_dumper pstore_dumper = {
> + .dump = pstore_dump,
> +};
> +
> +/*
> + * platform specific persistent storage driver registers with
> + * us here. Read out all the records right away and install
> + * them in /sys. Register with kmsg_dump to save last part
> + * of console log on panic.
> + */
> +int
> +pstore_register(struct pstore_info *psi)
> +{
> + struct pstore_entry *new_pstore;
> + int rc = 0, type;
> + unsigned long size;
> + u64 id;
> + unsigned long ps_maxsize;
Alignment here? Maybe something like this:
struct pstore_entry *new_pstore;
unsigned long ps_maxsize;
int rc = 0, type;
> +
> + spin_lock(&pstore_lock);
> + if (psinfo) {
> + spin_unlock(&pstore_lock);
> + return -EBUSY;
> + }
> + psinfo = psi;
> + spin_unlock(&pstore_lock);
Maybe make this lockless with cmpxchg? OTOH, the spinlock would be
easier when you have multiple persistent storage devices...
> + ps_maxsize = psi->header_size + psi->data_size + psi->footer_size;
> + pstore_buf = kzalloc(ps_maxsize, GFP_KERNEL);
> + if (!pstore_buf)
> + return -ENOMEM;
newline
> + for (;;) {
> + if (psi->reader(&id, &type, pstore_buf, &size) <= 0)
> + break;
> + new_pstore = kzalloc(sizeof(struct pstore_entry) + size,
> + GFP_KERNEL);
> + if (!new_pstore) {
> + rc = -ENOMEM;
> + break;
> + }
> + new_pstore->id = id;
> + new_pstore->type = type;
> + new_pstore->size = size;
> + memcpy(new_pstore->data, pstore_buf + psi->header_size, size);
> + if (pstore_create_sysfs_entry(new_pstore)) {
> + kfree(new_pstore);
> + rc = -EINVAL;
> + break;
> + }
> + }
> +
> + kobject_uevent(&pstore_kset->kobj, KOBJ_ADD);
> +
> + kmsg_dump_register(&pstore_dumper);
> +
> + return rc;
> +}
> +EXPORT_SYMBOL_GPL(pstore_register);
> +
> +int
> +pstore_write(int type, char *buf, unsigned long size)
> +{
> + if (!psinfo->writer)
> + return -ENODEV;
I think you should move this check to the pstore_register() above. We
don't want persistent storage backends which do not implement ->write
anyway since the whole point of them becomes moot, no?
> + if (size > psinfo->data_size)
> + return -EFBIG;
> +
> + memcpy(pstore_buf + psinfo->header_size, buf, size);
> + return psinfo->writer(type, pstore_buf, size);
> +}
> +EXPORT_SYMBOL_GPL(pstore_write);
> +
> +#define to_pstore_entry(obj) container_of(obj, struct pstore_entry, attr)
> +
> +/*
> + * "read" function for files containing persistent store records
> + */
> +static ssize_t pstore_show(struct file *filp, struct kobject *kobj,
> + struct bin_attribute *bin_attr, char *buf,
> + loff_t offset, size_t count)
> +{
> + struct pstore_entry *ps = to_pstore_entry(bin_attr);
> +
> + return memory_read_from_buffer(buf, count, &offset,
> + ps->data, ps->size);
> +}
> +
> +/*
> + * Erase records by writing their filename to the "erase" file. E.g.
> + * # echo "dmesg-0" > erase
> + */
> +static ssize_t pstore_erase(struct file *filp, struct kobject *kobj,
> + struct bin_attribute *bin_attr,
> + char *buf, loff_t pos, size_t count)
> +{
> + struct pstore_entry *search_pstore, *n;
> + int len1, len2, found = 0;
> +
> + len1 = count;
> + if (buf[len1 - 1] == '\n')
> + len1--;
> +
> + spin_lock(&pstore_lock);
> +
> + /*
> + * Find this record
> + */
> + list_for_each_entry_safe(search_pstore, n, &pstore_list, list) {
> + len2 = strlen(search_pstore->name);
> + if (len1 == len2 && memcmp(buf, search_pstore->name,
> + len1) == 0) {
> + found = 1;
> + break;
> + }
> + }
> + if (!found) {
> + spin_unlock(&pstore_lock);
> + return -EINVAL;
> + }
> +
> + if (psinfo->eraser)
> + if (psinfo->eraser(search_pstore->id)) {
> + spin_unlock(&pstore_lock);
> + return -EIO;
> + }
> +
> + list_del(&search_pstore->list);
> +
> + spin_unlock(&pstore_lock);
> +
> + sysfs_remove_bin_file(&pstore_kset->kobj, &search_pstore->attr);
AFAICT, you might want to remove the sysfs file _before_ you remove it
from the pstore list/erase its contents, otherwise concurrent accesses
to it from userspace readers might make us go boom.
> +
> + return count;
> +}
> +
> +static struct bin_attribute attr_erase = {
> + .attr = {.name = "erase", .mode = 0200},
> + .write = pstore_erase,
> +};
> +
> +static int
> +pstore_create_sysfs_entry(struct pstore_entry *new_pstore)
> +{
> + static atomic_t next;
> + int error, seq;
> +
> + seq = atomic_add_return(1, &next);
> +
> + switch (new_pstore->type) {
> + case PSTORE_DMESG:
> + sprintf(new_pstore->name, "dmesg-%d", seq);
> + break;
> + case PSTORE_MCE:
> + sprintf(new_pstore->name, "mce-%d", seq);
> + break;
> + default:
> + sprintf(new_pstore->name, "type%d-%d", new_pstore->type, seq);
> + break;
> + }
> +
> + sysfs_attr_init(&new_pstore->attr.attr);
> + new_pstore->attr.size = 0;
> + new_pstore->attr.read = pstore_show;
> + new_pstore->attr.attr.name = new_pstore->name;
> + new_pstore->attr.attr.mode = 0444;
> + error = sysfs_create_bin_file(&pstore_kset->kobj, &new_pstore->attr);
> + if (!error) {
> + spin_lock(&pstore_lock);
> + list_add(&new_pstore->list, &pstore_list);
> + spin_unlock(&pstore_lock);
> + }
> + return error;
> +}
> +
> +static int __init
> +pstore_init(void)
> +{
> + int error = 0;
> +
> + /* Register the pstore directory at /sys/firmware/pstore */
> + pstore_kset = kset_create_and_add("pstore", NULL, firmware_kobj);
> + if (!pstore_kset) {
> + printk(KERN_ERR "pstore: Subsystem registration failed.\n");
> + return -ENOMEM;
> + }
> +
> + /*
> + * Add attribute to allow records to be erased from persistent store
> + */
> + error = sysfs_create_bin_file(&pstore_kset->kobj,
> + &attr_erase);
> + if (error) {
> + printk(KERN_ERR "pstore: unable to create 'erase' sysfs file"
> + " due to error %d\n", error);
> + kset_unregister(pstore_kset);
> + }
> +
> + return error;
> +}
> +
> +static void __exit
> +pstore_exit(void)
> +{
> + struct pstore_entry *entry, *n;
> +
> + if (psinfo)
> + kmsg_dump_unregister(&pstore_dumper);
> +
> + list_for_each_entry_safe(entry, n, &pstore_list, list) {
> + spin_lock(&pstore_lock);
> + list_del(&entry->list);
> + spin_unlock(&pstore_lock);
> + sysfs_remove_bin_file(&pstore_kset->kobj, &entry->attr);
> + }
> + sysfs_remove_bin_file(&pstore_kset->kobj, &attr_erase);
ditto.
> +
> + kset_unregister(pstore_kset);
> +
> + kfree(pstore_buf);
> +}
> +
> +module_init(pstore_init);
> +module_exit(pstore_exit);
> diff --git a/include/linux/pstore.h b/include/linux/pstore.h
> new file mode 100644
> index 0000000..785ad86
> --- /dev/null
> +++ b/include/linux/pstore.h
> @@ -0,0 +1,54 @@
> +/*
> + * Persistent Storage - pstore.h
> + *
> + * Copyright (C) 2010 Intel Corporation <[email protected]>
> + *
> + * This code is the generic layer to export data records from platform
> + * level persistent storage via sysfs.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
> + */
> +#ifndef _LINUX_PSTORE_H
> +#define _LINUX_PSTORE_H
> +
> +/* types */
> +#define PSTORE_DMESG 0
> +#define PSTORE_MCE 1
maybe PSTORE_TYPE_DMESG/PSTORE_TYPE_MCE ?
> +
> +struct pstore_info {
> + unsigned long header_size;
> + unsigned long data_size;
> + unsigned long footer_size;
> + int (*reader)(u64 *id, int *type, char *buf, unsigned long *size);
> + int (*writer)(int type, char *buf, unsigned long size);
> + int (*eraser)(u64 id);
> +};
> +
> +#if defined(CONFIG_PSTORE) || defined(CONFIG_PSTORE_MODULE)
> +extern int pstore_register(struct pstore_info *);
> +extern int pstore_write(int type, char *buf, unsigned long size);
> +#else
> +static inline int
> +pstore_register(struct pstore_info *psi)
> +{
> + return -ENODEV;
> +}
> +static inline int
> +pstore_write(int type, char *buf, unsigned long size)
> +{
> + return -ENODEV;
> +}
> +#endif
> +
> +#endif /*_LINUX_PSTORE_H*/
> --
Thanks.
--
Regards/Gruss,
Boris.
From: "Luck, Tony" <[email protected]>
Date: Sat, 20 Nov 2010 15:48:19 -0800
> The usage model I'm envisioning is that a platform driver
> will register with this code to provide the actual storage.
> I've tried to make this interface general, but I'm working
> from a sample of one (the ACPI ERST code), so if anyone else
> has some persistent store that can't be handled by this code,
> speak up and we can put in the necessary tweaks.
>
> My assumptions are that the data that Linux cares about will
> be wrapped in some error record structure with a header, and
> possibly a footer that the device code needs. So the driver
> specifies how much padding to put around a buffer to make
> life easy for it. It also specifies the maximum number of
> bytes that can be saved in one record.
Thanks for doing this work Tony.
On sparc64 I can mark an arbitrary region of physical memory
as persistent across soft reboots. So I'll likely use that
to implement these interfaces.
On Sun, Nov 21, 2010 at 1:07 AM, Borislav Petkov <[email protected]> wrote:
> this is actually very cool.
Thanks for looking at it (and more thanks for this endorsement!)
>> 1) "Why do you only allow one platform driver to register?"
>> ? I only have one such driver. ?Adding more is easy from the "read" side
>> ? (just collect all the records from all devices and remember where they
>> ? came from so you can call the correct "eraser" function). ?But the "write"
>> ? side opens up questions that I don't have good answers for:
>> ? - Which device(s) should error records be written to?
>> ? ? All of them? Start with one and move on when it is
>> ? ? full? ?Write some types of records to one device?
>
> Maybe based on the error type? We definitely need one device which
> should contain all the records, something like main pstore device.
But who decides which records go where? And which device gets to be
the "main" one? I don't think that we can usefully do this in the
registration mechanism (how does a driver know that other drivers even
exist?). I continue to want to defer this until someone with two or more
devices on one machine steps forward.
>> 3) "/sys/firmware/pstore is the wrong pathname for this".
>> ? You are probably right. I put it under "firmware" because that's where
>> ? the "efivars" driver put its top level directory. In my case the ERST
>> ? back end is firmware, so there is some vague logic to it ... but better
>> ? suggestions are welcome. Perhaps /sys/devices/platform/pstore?
>>
>> 4) "/sys is the wrong place for this."
>> ? Perhaps. ?I definitely want to use some sort of filesystem interface (so
>> ? each record shows up as a file to the user). This seems a lot cleaner
>> ? than trying to map the semantics of actual persistent storage devices
>> ? onto a character device. ?The "sysfs_create_bin_file()" API seems very
>> ? well designed for this usage. ?If not /sys, then where? ?"debugfs"
>> ? would work - but not everyone mounts debugfs. Creating a whole new
>> ? filesystem for this seems like overkill.
>
> No, I think /sys is the right place for it being always present and
> all. Besides, for example, all the ACPI tables are there anyway
> (/sys/firmware/acpi/tables/) so pstore won't be the first blob there.
Heh! The acpi tables code is where I found out how easy it was to handle
blobs bigger than PAGE_SIZE using memory_read_from_buffer().
>
> /sys/firmware might not be all that sensible if someone comes up with
> persistent storage type which is the network but we'll change that then.
I'd like to get the right place first time - change means having to update
any applications that coded in the pathname.
>> +int
>> +pstore_register(struct pstore_info *psi)
>> +{
>> + ? ? struct ?pstore_entry ? ?*new_pstore;
>> + ? ? int ? ? rc = 0, type;
>> + ? ? unsigned long size;
>> + ? ? u64 ? ? id;
>> + ? ? unsigned long ps_maxsize;
>
> Alignment here? Maybe something like this:
>
> ? ? ? ?struct pstore_entry ? ? *new_pstore;
> ? ? ? ?unsigned long ? ? ? ? ? ps_maxsize;
> ? ? ? ?int ? ? ? ? ? ? ? ? ? ? rc = 0, type;
Are you talking about textual alignment of the declarations? Yours
does look prettier.
>> +
>> + ? ? spin_lock(&pstore_lock);
>> + ? ? if (psinfo) {
>> + ? ? ? ? ? ? spin_unlock(&pstore_lock);
>> + ? ? ? ? ? ? return -EBUSY;
>> + ? ? }
>> + ? ? psinfo = psi;
>> + ? ? spin_unlock(&pstore_lock);
>
> Maybe make this lockless with cmpxchg? OTOH, the spinlock would be
> easier when you have multiple persistent storage devices...
cmpxchg would make the code shorter - I'll try recoding and see if I like
the way it looks.
>> + ? ? ps_maxsize = psi->header_size + psi->data_size + psi->footer_size;
>> + ? ? pstore_buf = kzalloc(ps_maxsize, GFP_KERNEL);
>> + ? ? if (!pstore_buf)
>> + ? ? ? ? ? ? return -ENOMEM;
>
> newline
Yup. Will fix.
>> +int
>> +pstore_write(int type, char *buf, unsigned long size)
>> +{
>> + ? ? if (!psinfo->writer)
>> + ? ? ? ? ? ? return -ENODEV;
>
> I think you should move this check to the pstore_register() above. We
> don't want persistent storage backends which do not implement ->write
> anyway since the whole point of them becomes moot, no?
Doh! Yes, of course. Will fix.
>> + ? ? list_del(&search_pstore->list);
>> +
>> + ? ? spin_unlock(&pstore_lock);
>> +
>> + ? ? sysfs_remove_bin_file(&pstore_kset->kobj, &search_pstore->attr);
>
> AFAICT, you might want to remove the sysfs file _before_ you remove it
> from the pstore list/erase its contents, otherwise concurrent accesses
> to it from userspace readers might make us go boom.
I'll think about the ordering. I have three things to do here:
1) Remove from the pstore_list
2) Call platform driver to erase from store
3) Call sysfs to remove the binfile.
Potentially the erase could fail ... and I should probably notice
that and do something.
>> +/* types */
>> +#define ? ? ?PSTORE_DMESG ? ?0
>> +#define ? ? ?PSTORE_MCE ? ? ?1
>
> maybe PSTORE_TYPE_DMESG/PSTORE_TYPE_MCE ?
Much better (I suck at naming things). Will fix.
-Tony
Hi, Tony,
On Sun, 2010-11-21 at 07:48 +0800, Luck, Tony wrote:
> Here's a patch based on some discussions I had with Thomas
> Gleixner at plumbers conference that implements a generic
> layer for persistent storage usable to pass tens or hundreds
> of kilobytes of data from the dying breath of a crashing
> kernel to its successor.
>
> The usage model I'm envisioning is that a platform driver
> will register with this code to provide the actual storage.
> I've tried to make this interface general, but I'm working
> from a sample of one (the ACPI ERST code), so if anyone else
> has some persistent store that can't be handled by this code,
> speak up and we can put in the necessary tweaks.
>
> My assumptions are that the data that Linux cares about will
> be wrapped in some error record structure with a header, and
> possibly a footer that the device code needs. So the driver
> specifies how much padding to put around a buffer to make
> life easy for it. It also specifies the maximum number of
> bytes that can be saved in one record.
This patch provides a general "read" interface for kmsg_dumper and some
other persistent storage users. Another possible choice is to just
extend the original interface to add persistent store support. For
example, we can add a "read" function in kmsg_dumper, and output the
content of persistent store via extend /dev/kmsg via prefix every line
comes from persistent store or adding some "ioctl" to do that. (But it
seems that nobody likes "ioctl").
> There are three callback functions from the generic code to
> the driver:
>
> "reader" which iterates over all records currently in the
> store - returning type, size and a record identifier as
> well as the actual data.
>
> "writer" which writes a record with a type to the persistent store
I think it is necessary to require this to be NMI safe (in comments?),
because hardware error handler may need to write to persistent storage
in NMI context. Or we can add a "flag" field to let storage provider
advocate its capability of NMI safe.
> "eraser" which takes a record identifier, and clears that item
> from the store.
>
>
> The Linux user visible interface is via /sys (similar to
> the "efivars" interface)
>
> # ls -l /sys/firmware/pstore
> total 0
> -r--r--r-- 1 root root 0 2010-11-20 11:03 dmesg-0
> --w------- 1 root root 0 2010-11-20 11:03 erase
>
> The "type" of error record I mentioned earlier is used to
> name the files ... saved console logs from kmsg_dmp() are
> named with a "dmesg" prefix as shown above.
>
> Once an error record has been viewed, analysed, saved. The
> user can request it to be cleared by writing its name to the
> "erase" file:
>
> # echo "dmesg-0" > erase
>
> Answers to a few questions that I think you might ask:
>
> 1) "Why do you only allow one platform driver to register?"
> I only have one such driver. Adding more is easy from the "read" side
> (just collect all the records from all devices and remember where they
> came from so you can call the correct "eraser" function). But the "write"
> side opens up questions that I don't have good answers for:
> - Which device(s) should error records be written to?
> All of them? Start with one and move on when it is
> full? Write some types of records to one device?
The persistent storage may be full, and the writing may fail. So I think
we can just try to write one by one, until the first success writing.
[...]
> + * Erase records by writing their filename to the "erase" file. E.g.
> + * # echo "dmesg-0" > erase
> + */
> +static ssize_t pstore_erase(struct file *filp, struct kobject *kobj,
> + struct bin_attribute *bin_attr,
> + char *buf, loff_t pos, size_t count)
> +{
> + struct pstore_entry *search_pstore, *n;
> + int len1, len2, found = 0;
> +
> + len1 = count;
> + if (buf[len1 - 1] == '\n')
> + len1--;
> +
> + spin_lock(&pstore_lock);
> +
> + /*
> + * Find this record
> + */
> + list_for_each_entry_safe(search_pstore, n, &pstore_list, list) {
> + len2 = strlen(search_pstore->name);
> + if (len1 == len2 && memcmp(buf, search_pstore->name,
> + len1) == 0) {
> + found = 1;
> + break;
> + }
> + }
> + if (!found) {
> + spin_unlock(&pstore_lock);
> + return -EINVAL;
> + }
> +
> + if (psinfo->eraser)
> + if (psinfo->eraser(search_pstore->id)) {
> + spin_unlock(&pstore_lock);
> + return -EIO;
> + }
> +
> + list_del(&search_pstore->list);
> +
> + spin_unlock(&pstore_lock);
> +
> + sysfs_remove_bin_file(&pstore_kset->kobj, &search_pstore->attr);
It seems that the corresponding memory is not freed after erasing.
Best Regards,
Huang Ying
On Sun, Nov 21, 2010 at 01:47:22PM -0800, Tony Luck wrote:
> >> 1) "Why do you only allow one platform driver to register?"
> >> I only have one such driver. Adding more is easy from the "read" side
> >> (just collect all the records from all devices and remember where they
> >> came from so you can call the correct "eraser" function). But the "write"
> >> side opens up questions that I don't have good answers for:
> >> - Which device(s) should error records be written to?
> >> All of them? Start with one and move on when it is
> >> full? Write some types of records to one device?
> >
> > Maybe based on the error type? We definitely need one device which
> > should contain all the records, something like main pstore device.
>
> But who decides which records go where? And which device gets to be
> the "main" one? I don't think that we can usefully do this in the
> registration mechanism (how does a driver know that other drivers even
> exist?). I continue to want to defer this until someone with two or more
> devices on one machine steps forward.
Yeah, let's wait and see.
[..]
> > /sys/firmware might not be all that sensible if someone comes up with
> > persistent storage type which is the network but we'll change that then.
>
> I'd like to get the right place first time - change means having to update
> any applications that coded in the pathname.
True, true.
> >> +int
> >> +pstore_register(struct pstore_info *psi)
> >> +{
> >> + struct pstore_entry *new_pstore;
> >> + int rc = 0, type;
> >> + unsigned long size;
> >> + u64 id;
> >> + unsigned long ps_maxsize;
> >
> > Alignment here? Maybe something like this:
> >
> > struct pstore_entry *new_pstore;
> > unsigned long ps_maxsize;
> > int rc = 0, type;
>
> Are you talking about textual alignment of the declarations? Yours
> does look prettier.
Yep, textual alignment.
[..]
> >> + list_del(&search_pstore->list);
> >> +
> >> + spin_unlock(&pstore_lock);
> >> +
> >> + sysfs_remove_bin_file(&pstore_kset->kobj, &search_pstore->attr);
> >
> > AFAICT, you might want to remove the sysfs file _before_ you remove it
> > from the pstore list/erase its contents, otherwise concurrent accesses
> > to it from userspace readers might make us go boom.
>
> I'll think about the ordering. I have three things to do here:
> 1) Remove from the pstore_list
> 2) Call platform driver to erase from store
> 3) Call sysfs to remove the binfile.
>
> Potentially the erase could fail ... and I should probably notice
> that and do something.
Right, and I was also wondering what might happen if userspace accesses
the bin attribute before you've removed it from the sysfs hierarchy but
after erasing its backing storage in the firmware?
AFAICT, it could be that those ps->data and ps->size members which are
accessed in pstore_show() (which is called when the bin attribute is
read from /sysfs) after the block of data is erased by platform driver,
might contain crap data and memory_read_from_buffer() could do some
illegal accesses to some kernel memory if not cleared properly by the
->erase() routine.
OTOH, you might want to do the clearing here and leave the platform
driver as dumb as possible by having a lower level __pstore_erase()
wrapper or similar.
Do you see my point? (I could very well be completely offcourse too :))
Thanks.
--
Regards/Gruss,
Boris.
On Mon, Nov 22, 2010 at 08:32:27AM +0100, Borislav Petkov wrote:
> Do you see my point? (I could very well be completely offcourse too :))
Yes I was :)
ps->data still points to the corresponding kernel buffer. And you keep
it around until the driver is removed/machine shutdown.
--
Regards/Gruss,
Boris.
> > "reader" which iterates over all records currently in the
> > store - returning type, size and a record identifier as
> > well as the actual data.
> >
> > "writer" which writes a record with a type to the persistent store
>
> I think it is necessary to require this to be NMI safe (in comments?),
> because hardware error handler may need to write to persistent storage
> in NMI context. Or we can add a "flag" field to let storage provider
> advocate its capability of NMI safe.
One thing we need for some of the driver code sitting in the pending pile
is support for hardware assisted logging, where record numberes are handed
out by a device register access.
> > The "type" of error record I mentioned earlier is used to
> > name the files ... saved console logs from kmsg_dmp() are
> > named with a "dmesg" prefix as shown above.
> >
> > Once an error record has been viewed, analysed, saved. The
> > user can request it to be cleared by writing its name to the
> > "erase" file:
How will fragmentation be handled ?
> The persistent storage may be full, and the writing may fail. So I think
> we can just try to write one by one, until the first success writing.
If you intend to treat it basically like an fs why not just make it a
file system. No more weird echo erase into magic sysfs file node you can
just use "rm".
The logging off an interrupt isn't a big problem either as you can have a
pool of blocks which are preallocated for the log files and more can be
stitched into the files asynchronously after the interrupt logging
events. For that matter you can keep it as a circular log just easily but
present it as a file system.
> [...]
> > + * Erase records by writing their filename to the "erase" file. E.g.
> > + * # echo "dmesg-0" > erase
rm dmesg-0
We have a model for this and filesystem indirected via sysfs commands is
in the daft category.
On Sun, Nov 21, 2010 at 5:59 PM, Huang Ying <[email protected]> wrote:
> This patch provides a general "read" interface for kmsg_dumper and some
> other persistent storage users. Another possible choice is to just
> extend the original interface to add persistent store support. For
> example, we can add a "read" function in kmsg_dumper, and output the
> content of persistent store via extend /dev/kmsg via prefix every line
> comes from persistent store or adding some "ioctl" to do that. (But it
> seems that nobody likes "ioctl").
In Linux (and Unix before it) "everything is a file" ... but this
doesn't work very well if the file has internal structure (e.g.
is made of records that can be individually changed or
deleted). A filesystem seems a much better model.
>> "writer" which writes a record with a type to the persistent store
>
> I think it is necessary to require this to be NMI safe (in comments?),
> because hardware error handler may need to write to persistent storage
> in NMI context. Or we can add a "flag" field to let storage provider
> advocate its capability of NMI safe.
I can add a comment to pstore.h to document the NMI-safe
requirement for the "writer" function.
>> ? - Which device(s) should error records be written to?
>> ? ? All of them? Start with one and move on when it is
>> ? ? full? ?Write some types of records to one device?
>
> The persistent storage may be full, and the writing may fail. So I think
> we can just try to write one by one, until the first success writing.
A good option - if we ever find someone luck enough to have
more than one persistent store device.
>> + ? ? ? sysfs_remove_bin_file(&pstore_kset->kobj, &search_pstore->attr);
>
> It seems that the corresponding memory is not freed after erasing.
Ouch! Good catch. Will add a kfree()
-Tony
On Mon, Nov 22, 2010 at 2:43 AM, Alan Cox <[email protected]> wrote:
>> > "writer" which writes a record with a type to the persistent store
>>
>> I think it is necessary to require this to be NMI safe (in comments?),
>> because hardware error handler may need to write to persistent storage
>> in NMI context. Or we can add a "flag" field to let storage provider
>> advocate its capability of NMI safe.
>
> One thing we need for some of the driver code sitting in the pending pile
> is support for hardware assisted logging, where record numberes are handed
> out by a device register access.
I don't follow you here ... perhaps we have a different idea
of what "record numbers" are? I'm thinking they are purely
an internal implementation detail of a platform level pstore
driver ... but the generic code needs to use them as an
opaque object holding them from the result of a read and
using them when requesting an erase. What usage are
you thinking of that needs this device register access?
>> > Once an error record has been viewed, analysed, saved. The
>> > user can request it to be cleared by writing its name to the
>> > "erase" file:
>
> How will fragmentation be handled ?
This is left as an exercise for the platform level diver (or
in the case of ERST, it is handled by the firmware that
implements ERST). In general this shouldn't be a hard
problem because I think the usual use case will be that
just after boot we will run some processes that look at,
analyse, save and clear all the error records. Different
types of records may be handled by different processes,
so there can be some short term fragmentation when
one type of record has been erased, but the remainder
have not yet been processed.
>> > + * Erase records by writing their filename to the "erase" file. E.g.
>> > + * ? ? # echo "dmesg-0" > erase
>
> rm dmesg-0
>
> We have a model for this and filesystem indirected via sysfs commands is
> in the daft category.
Yup - totally mad. My sysfs-fu isn't up to figuring out how
to get a notification on unlink(2). Is that possible? It would
be a much superior metaphor.
-Tony
On Mon, Nov 22, 2010 at 17:55, Tony Luck <[email protected]> wrote:
> On Sun, Nov 21, 2010 at 5:59 PM, Huang Ying <[email protected]> wrote:
>> This patch provides a general "read" interface for kmsg_dumper and some
>> other persistent storage users. Another possible choice is to just
>> extend the original interface to add persistent store support. For
>> example, we can add a "read" function in kmsg_dumper, and output the
>> content of persistent store via extend /dev/kmsg via prefix every line
>> comes from persistent store or adding some "ioctl" to do that. (But it
>> seems that nobody likes "ioctl").
>
> In Linux (and Unix before it) "everything is a file" ... but this
> doesn't work very well if the file has internal structure (e.g.
> is made of records that can be individually changed or
> deleted). A filesystem seems a much better model.
Any tangents with "pramfs: persistent and protected RAM filesystem"
on linux-embedded?
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
On Mon, Nov 22, 2010 at 10:24 AM, Geert Uytterhoeven
<[email protected]> wrote:
> Any tangents with "pramfs: persistent and protected RAM filesystem"
> on linux-embedded?
That looks nice for building on top on a sane h/w design (a
block of non-volatile RAM). My pstore code works with
uglier things. The ACPI/ERST interface provides ways to
read/write/clear "records" from what might be some NVRAM,
but might also be a channel to a service processor ... and
ERST is what I'd like to see made usable (without subjecting
users to the implementation details).
-Tony
> Here's a patch based on some discussions I had with Thomas
> Gleixner at plumbers conference that implements a generic
> layer for persistent storage usable to pass tens or hundreds
> of kilobytes of data from the dying breath of a crashing
> kernel to its successor.
>
> The usage model I'm envisioning is that a platform driver
> will register with this code to provide the actual storage.
> I've tried to make this interface general, but I'm working
> from a sample of one (the ACPI ERST code), so if anyone else
> has some persistent store that can't be handled by this code,
> speak up and we can put in the necessary tweaks.
I recently posted a patch set for powerpc to capture the most recent
oops or panic message in NVRAM:
http://lists.ozlabs.org/pipermail/linuxppc-dev/2010-November/087032.html
It covers a lot of the same ground, and could be adapted to use your
framework. See below for concerns and suggestions. I'd also be
interested in feedback about the design decisions I mention below.
On powerpc, the amount of NVRAM available for this may be as little
as 1-2 Kbytes. The minimal oops report (with essentially no backtrace)
is about 1800 bytes. See below for implications.
We currently read our NVRAM contents via /dev/nvram and the nvram
command. NVRAM is divided up into several "partitions" -- only
one of which is used for the oops/panic report -- so the user-space
code needs to know about how the partitions are laid out. It also
needs to know how much text we actually wrote to the partition, and
whether or not it's compressed. Since the kernel already knows how
to determine all this, it would probably be more convenient to get
at the oops/panic partition through your /sys interface.
> ...
> 2) "Why do you read in all the data from the device when it
> registers and save it in memory? Couldn't you just get the
> list of records and pick up the data from the device when
> the user reads the file?"
> I don't think this is going to be very much data, just a few hundred
> kilobytes (i.e. less that $0.01 worth of memory, even expensive server
> memory). The memory is freed when the record is erased ... which is
> likely to be soon after boot.
Since the amount of text we capture is so tiny, this is unlikely to be
an issue in my case.
> ...
> 6) "Is this widely useful? How many systems have persistent storage?"
> Although ERST was only added to the ACPI spec earlier this year, it
> merely documents existing functionality required for WHEA (Windows
> Hardware Error Architecture). So most modern server systems should
> have it (my test system has it, and it has a BIOS written in mid 2008).
> Sorry desktops & laptops - no love for you here.
>
Powerpc p Series does, obviously, and we're looking to exploit it in
just this way.
> ...
> +static void
> +pstore_dump(struct kmsg_dumper *dumper, enum kmsg_dump_reason reason,
> + const char *s1, unsigned long l1,
> + const char *s2, unsigned long l2)
> +{
> + unsigned long s1_start, s2_start;
> + unsigned long l1_cpy, l2_cpy;
> + char *dst = pstore_buf + psinfo->header_size;
> +
> + /* Don't dump oopses to persistent store */
Why not? In our case, we capture every oops and panic report, but keep
only the most recent. Seems like catching the last oops could be useful
if your system hangs thereafter and can't be made to panic. I suggest
you pass along the reason (KMSG_DUMP_OOPS or whatever) and let the
callback decide.
You'd have to serialize the oops handling, I guess, in case multiple
CPUs oops simultaneously. (Gotta fix that in my code.)
> + if (reason == KMSG_DUMP_OOPS)
> + return;
> +
> + l2_cpy = min(l2, psinfo->data_size);
> + l1_cpy = min(l1, psinfo->data_size - l2_cpy);
> +
> + s2_start = l2 - l2_cpy;
> + s1_start = l1 - l1_cpy;
> +
> + memcpy(dst, s1 + s1_start, l1_cpy);
> + memcpy(dst + l1_cpy, s2 + s2_start, l2_cpy);
> +
> + psinfo->writer(PSTORE_DMESG, pstore_buf, l1_cpy + l2_cpy);
This assumes that you always want to capture the last psinfo->data_size
bytes of the printk buffer. Given the small capacity of our NVRAM
partition, I handle the case where the whole oops report doesn't fit.
In that case, I sacrifice the end of the oops report to capture the
beginning. Patch #3 in my set is about this.
> ...
> +static int
> +pstore_create_sysfs_entry(struct pstore_entry *new_pstore)
> +{
> ...
> + new_pstore->attr.attr.mode = 0444;
/var/log/messages is typically not readable by everybody. This
appears to circumvent that.
> ...
Thanks.
Jim Keniston
IBM Linux Technology Center
Beaverton, OR
On Mon, Nov 22, 2010 at 4:06 PM, Jim Keniston
<[email protected]> wrote:
>> + /* Don't dump oopses to persistent store */
>
> Why not? ?In our case, we capture every oops and panic report, but keep
> only the most recent. ?Seems like catching the last oops could be useful
> if your system hangs thereafter and can't be made to panic. ?I suggest
> you pass along the reason (KMSG_DUMP_OOPS or whatever) and let the
> callback decide.
My thoughts were that Oops were non-fatal and ended up in /var/log/messages,
so this would be unneeded (this bit of code was copied from one of mtdoops
or ramoops - which does almost the same ... they do have an option to
allow the copy - perhaps I should have copied that bit too?).
> You'd have to serialize the oops handling, I guess, in case multiple
> CPUs oops simultaneously. ?(Gotta fix that in my code.)
Yup - I need to do this too (I only allocate one buffer).
>> + psinfo->writer(PSTORE_DMESG, pstore_buf, l1_cpy + l2_cpy);
>
> This assumes that you always want to capture the last psinfo->data_size
> bytes of the printk buffer. ?Given the small capacity of our NVRAM
> partition, I handle the case where the whole oops report doesn't fit.
> In that case, I sacrifice the end of the oops report to capture the
> beginning. ?Patch #3 in my set is about this.
Yes - I assume here that the last "data_size" bytes will be enough
to be useful. But in your case it most likely won't be. You could
lie about how much space you allow and then include some oops
parsing code to get the vital bits out of what is passed to you. Not
pretty - but it would work. If there are many such devices with limited
capacity, then it would make sense to include this parsing code
in this generic layer.
>> + new_pstore->attr.attr.mode = 0444;
>
> /var/log/messages is typically not readable by everybody. ?This
> appears to circumvent that.
But "dmesg(8)" typically *does* allow any user to see the most recent
part of the console log - so we are not consistent about this.
-Tony
On Mon, Nov 22, 2010 at 5:37 PM, Tony Luck <[email protected]> wrote:
> Yes - I assume here that the last "data_size" bytes will be enough
> to be useful. But in your case it most likely won't be. ?You could
> lie about how much space you allow and then include some oops
> parsing code to get the vital bits out of what is passed to you. Not
> pretty - but it would work. If there are many such devices with limited
> capacity, then it would make sense to include this parsing code
> in this generic layer.
Sorry - finding the good bits from the console log is architecture
dependent ... so it can't be done (easily) in the generic code.
-Tony
On Tue, Nov 23, 2010 at 12:10 PM, Tony Luck <[email protected]> wrote:
> On Mon, Nov 22, 2010 at 5:37 PM, Tony Luck <[email protected]> wrote:
>> Yes - I assume here that the last "data_size" bytes will be enough
>> to be useful. But in your case it most likely won't be. ?You could
>> lie about how much space you allow and then include some oops
>> parsing code to get the vital bits out of what is passed to you. Not
>> pretty - but it would work. If there are many such devices with limited
>> capacity, then it would make sense to include this parsing code
>> in this generic layer.
>
> Sorry - finding the good bits from the console log is architecture
> dependent ... so it can't be done (easily) in the generic code.
FYI:
Now we use ramoops for this purpose. Even tough ramoops uses volatile
memory, it's safe soft-reset and watchdog reset.
To overcome limiataion of ramoops. not persist, I'm working on mmcoops
as persistent oops record.
As the discussed story. it covers the previous oops devices,
mtdoops(drivers/mtd/mtdoops.c), ramoops (drivers/char/ramoops.c) and
mmcoops (drivers/mmc/card/mmc_oops.c).
Actually I tried to enable the lcdoops, display oops message at LCD.
but it doesn't work since not much time to display LCD.
Thank you,
Kyungmin Park
>
> -Tony
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at ?http://www.tux.org/lkml/
>
On Mon, 2010-11-22 at 17:37 -0800, Tony Luck wrote:
> On Mon, Nov 22, 2010 at 4:06 PM, Jim Keniston
> <[email protected]> wrote:
> >> + /* Don't dump oopses to persistent store */
> >
> > Why not? In our case, we capture every oops and panic report, but keep
> > only the most recent. Seems like catching the last oops could be useful
> > if your system hangs thereafter and can't be made to panic. I suggest
> > you pass along the reason (KMSG_DUMP_OOPS or whatever) and let the
> > callback decide.
>
> My thoughts were that Oops were non-fatal and ended up in /var/log/messages,
> so this would be unneeded (this bit of code was copied from one of mtdoops
> or ramoops - which does almost the same ... they do have an option to
> allow the copy - perhaps I should have copied that bit too?).
Yes, I'd still vote for that, because:
1) it provides flexibility at very low cost;
2) it could be useful if syslogd and/or klogd and/or the filesystem
holding /var/log are in trouble; and
3) it's helpful because I want to be sure -- even in the face of limited
NVRAM -- to capture the start of an oops that causes a panic.
(3) requires a little more explanation: As far as I can tell, by the
time you're in panic(), there's no way to know that you're panicking
because of an oops. (The oops_in_progress flag doesn't seem to be
intended for this.) But if I get notified at the time of the oops, I
can check the panic_on_oops flag and know that we're GOING to panic, and
set a panicking_on_oops flag for use when I get called back again during
the panic. (No, my patch set doesn't do that yet, because I didn't
figure it out 'til recently.) There's perhaps a more generic solution
to this particular problem, but I may be your only client with such
space constraints.
>
> > You'd have to serialize the oops handling, I guess, in case multiple
> > CPUs oops simultaneously. (Gotta fix that in my code.)
> Yup - I need to do this too (I only allocate one buffer).
>
> >> + psinfo->writer(PSTORE_DMESG, pstore_buf, l1_cpy + l2_cpy);
> >
> > This assumes that you always want to capture the last psinfo->data_size
> > bytes of the printk buffer. Given the small capacity of our NVRAM
> > partition, I handle the case where the whole oops report doesn't fit.
> > In that case, I sacrifice the end of the oops report to capture the
> > beginning. Patch #3 in my set is about this.
>
> Yes - I assume here that the last "data_size" bytes will be enough
> to be useful. But in your case it most likely won't be. You could
> lie about how much space you allow and then include some oops
> parsing code to get the vital bits out of what is passed to you. Not
> pretty - but it would work.
Yeah, in the case of powerpc, a psinfo->data_size value of (say) 8K
would almost certainly include the start of the oops. And then I could
simplify my code quite a bit.
>
> >> + new_pstore->attr.attr.mode = 0444;
> >
> > /var/log/messages is typically not readable by everybody. This
> > appears to circumvent that.
>
> But "dmesg(8)" typically *does* allow any user to see the most recent
> part of the console log - so we are not consistent about this.
You're right, of course. It's the user-mode syslog messages that are
being hidden.
>
> -Tony
Jim