2010-12-13 18:16:10

by Luck, Tony

[permalink] [raw]
Subject: [concept & "good taste" review] persistent store

Linus,

At the Plumbers conference I chatted to Thomas and Peter about this idea,
and got some positive feedback - so I implemented a prototype which has
gone through three revisions on LKML. I've been pounding out the obvious
stupidities that people have pointed out to me (thanks alphabetically to
Alan, Andrew, Boris, Peter and Ying), so now the code is now converging
on some kind of final version ... it's time to check whether it's getting
close to something useful, or whether I've been drinking too much of the
cool-aid.

So before I embark on another round of code nit-picking, I'd like to
get answers to the bigger questions:

Do we want/need this in Linux at all?

Is the overall approach OK, or do I need do this some other way?



The basic idea:
--------------

Most X86 server class systems that are less than a couple of years old
include a small amount of persistent storage - AFAIK this is a WHQL
requirement to get a Windows Server 2008 sticker. The interface to this
storage is via ACPI, which isn't really suitable for a generic interface
since many other architectures are not lucky enough to have ACPI :-) But
they may have persistent storage that they would like to use (David Miller
said he'd be able to use this on sparc64 and Jim Keniston thought he
could adapt some NVRAM code he has for powerpc to use this framework).

I'm using a file system interface to make persistent storage visible
to users. A filesystem seems to be a logical way to do this because we
have one or more "blobs" of data from each crash. The X86 ACPI-ERST store on
my test machine can take almost 8 Kbytes per "record" - which is usually
plenty to see the panic, stack trace, and several lines leading up to it.

Since I think everyone who has persistent store will want to save the
console log - I made the generic part register with "kmsg_dump_register()"
to save & show "dmesg" information. But I would like to also use this
for unrecoverable machine check information - other people may find
interesting ways to use this too.

Early versions used /sys - which turned out to have issues as there was
no way to hook "unlink(2)" on the files - useful as a way to signal that
the data could be erased from the persistent store, so a new filesystem
"pstore" born using ramfs infrastructure (as suggested to me by Peter).

To the user it looks like this:

$ ls -l /dev/pstore
total 8
-r--r--r-- 1 root root 7896 Dec 3 10:56 dmesg-erst-5546531799825383425


Filenames show the "type" of the data ("dmesg" is console log) as well
as which persistent storage device the data came from, and a unique
device specific identifier (which for ERST is a 64-bit number which
should never be re-used - blame the UEFI spec) ... although the file
names may be a bit unwieldy - they are going to be consistent from one
boot to the next - so if you don't erase a record, the same data will
show up with the same filename. The modification time of the file is set
to the time the data was saved to the persistent store. The current code
reserves the prefix "mce" (machine check exception) for files containing
fatal error information. Other uses would reserve their own prefixes so
that user level tools can find the data that they are interested in and
skip stuff intended for other scripts/tools.

After the user has finished looking at the file and got all the data
they need ... E.g.
$ grep RIP: /dev/pstore/dmesg-erst-5546531799825383425
<4>[ 552.268202] RIP: 0010:[<ffffffff812a3a25>] [<ffffffff812a3a25>] sysrq_handle_crash+0x16/0x20

(s)he simply removes the file - which results in a call from the generic
pstore filesystem code to the platform driver to erase this data from
the persistent store.

# rm /dev/pstore/dmesg-erst-5546531799825383425

-Tony

Here's v4 of the generic part of the code - so if the answers to the big
questions were "yes", then you can pick holes in it. The bit I'm most
worried might fail the good taste test is "pstore_writefile()" which
acts like an open+write to push data into a file.

---

Documentation/ABI/testing/pstore | 35 ++++++
fs/Kconfig | 1
fs/Makefile | 1
fs/pstore/Kconfig | 13 ++
fs/pstore/Makefile | 7 +
fs/pstore/inode.c | 219 +++++++++++++++++++++++++++++++++++++++
fs/pstore/internal.h | 5
fs/pstore/platform.c | 208 +++++++++++++++++++++++++++++++++++++
include/linux/magic.h | 1
include/linux/pstore.h | 60 ++++++++++
10 files changed, 550 insertions(+)


diff --git a/Documentation/ABI/testing/pstore b/Documentation/ABI/testing/pstore
new file mode 100644
index 0000000..f1fb2a0
--- /dev/null
+++ b/Documentation/ABI/testing/pstore
@@ -0,0 +1,35 @@
+Where: /dev/pstore/...
+Date: January 2011
+Kernel Version: 2.6.38
+Contact: [email protected]
+Description: Generic interface to platform dependent persistent storage.
+
+ Platforms that provide a mechanism to preserve some data
+ across system reboots can register with this driver to
+ provide a generic interface to show records captured in
+ the dying moments. In the case of a panic the last part
+ of the console log is captured, but other interesting
+ data can also be saved.
+
+ # mount -t pstore - /dev/pstore
+
+ $ ls -l /dev/pstore
+ total 0
+ -r--r--r-- 1 root root 7896 Nov 30 15:38 dmesg-erst-1
+
+ Different users of this interface will result in different
+ filename prefixes. Currently two are defined:
+
+ "dmesg" - saved console log
+ "mce" - architecture dependent data from fatal h/w error
+
+ Once the information in a file has been read, removing
+ the file will signal to the underlying persistent storage
+ device that it can reclaim the space for later re-use.
+
+ $ rm /dev/pstore/dmesg-erst-1
+
+ The expectation is that all files in /dev/pstore
+ will be saved elsewhere and erased from persistent store
+ soon after boot to free up space ready for the next
+ catastrophe.
diff --git a/fs/Kconfig b/fs/Kconfig
index 771f457..2bbe47f 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -188,6 +188,7 @@ source "fs/omfs/Kconfig"
source "fs/hpfs/Kconfig"
source "fs/qnx4/Kconfig"
source "fs/romfs/Kconfig"
+source "fs/pstore/Kconfig"
source "fs/sysv/Kconfig"
source "fs/ufs/Kconfig"
source "fs/exofs/Kconfig"
diff --git a/fs/Makefile b/fs/Makefile
index a7f7cef..db71a5b 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -121,3 +121,4 @@ obj-$(CONFIG_BTRFS_FS) += btrfs/
obj-$(CONFIG_GFS2_FS) += gfs2/
obj-$(CONFIG_EXOFS_FS) += exofs/
obj-$(CONFIG_CEPH_FS) += ceph/
+obj-$(CONFIG_PSTORE) += pstore/
diff --git a/fs/pstore/Kconfig b/fs/pstore/Kconfig
new file mode 100644
index 0000000..867d0ac
--- /dev/null
+++ b/fs/pstore/Kconfig
@@ -0,0 +1,13 @@
+config PSTORE
+ bool "Persistant store support"
+ default n
+ help
+ This option enables generic access to platform level
+ persistent storage via "pstore" filesystem that can
+ be mounted as /dev/pstore. Only useful if you have
+ a platform level driver that registers with pstore to
+ provide the data, so you probably should just go say "Y"
+ (or "M") to a platform specific persistent store driver
+ (e.g. ACPI_APEI on X86) which will select this for you.
+ If you don't have a platform persistent store driver,
+ say N.
diff --git a/fs/pstore/Makefile b/fs/pstore/Makefile
new file mode 100644
index 0000000..760f4bc
--- /dev/null
+++ b/fs/pstore/Makefile
@@ -0,0 +1,7 @@
+#
+# Makefile for the linux pstorefs routines.
+#
+
+obj-y += pstore.o
+
+pstore-objs += inode.o platform.o
diff --git a/fs/pstore/inode.c b/fs/pstore/inode.c
new file mode 100644
index 0000000..bc704ce
--- /dev/null
+++ b/fs/pstore/inode.c
@@ -0,0 +1,219 @@
+/*
+ * Persistent Storage - ramfs parts.
+ *
+ * Copyright (C) 2010 Intel Corporation <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/module.h>
+#include <linux/fs.h>
+#include <linux/pagemap.h>
+#include <linux/highmem.h>
+#include <linux/time.h>
+#include <linux/init.h>
+#include <linux/string.h>
+#include <linux/mount.h>
+#include <linux/ramfs.h>
+#include <linux/sched.h>
+#include <linux/magic.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+
+#include "internal.h"
+
+#define pstore_get_inode ramfs_get_inode
+
+static int pstore_unlink(struct inode *dir, struct dentry *dentry)
+{
+ pstore_erase(dentry->d_inode->i_private);
+
+ return simple_unlink(dir, dentry);
+}
+
+static const struct inode_operations pstore_dir_inode_operations = {
+ .lookup = simple_lookup,
+ .unlink = pstore_unlink,
+};
+
+static const struct super_operations pstore_ops = {
+ .statfs = simple_statfs,
+ .drop_inode = generic_delete_inode,
+ .show_options = generic_show_options,
+};
+
+static struct super_block *pstore_sb;
+static struct vfsmount *pstore_mnt;
+
+int pstore_is_mounted(void)
+{
+ return pstore_mnt != NULL;
+}
+
+/*
+ * Set up a file structure as if we had opened this file and
+ * write our data to it.
+ */
+static int pstore_writefile(struct inode *inode, struct dentry *dentry,
+ char *data, size_t size)
+{
+ struct file f;
+ ssize_t n;
+ mm_segment_t old_fs = get_fs();
+
+ memset(&f, '0', sizeof f);
+ f.f_mapping = inode->i_mapping;
+ f.f_path.dentry = dentry;
+ f.f_path.mnt = pstore_mnt;
+ f.f_pos = 0;
+ f.f_op = inode->i_fop;
+ set_fs(KERNEL_DS);
+ n = do_sync_write(&f, data, size, &f.f_pos);
+ set_fs(old_fs);
+
+ return n == size;
+}
+
+/*
+ * Make a regular file in the root directory of our file system.
+ * Load it up with "size" bytes of data from "buf".
+ * Set the mtime & ctime to the date that this record was originally stored.
+ */
+int pstore_mkfile(char *name, char *data, size_t size, struct timespec time,
+ void *private)
+{
+ struct dentry *root = pstore_sb->s_root;
+ struct dentry *dentry;
+ struct inode *inode;
+ int rc;
+
+ rc = -ENOMEM;
+ inode = pstore_get_inode(pstore_sb, root->d_inode, S_IFREG | 0444, 0);
+ if (!inode)
+ goto fail;
+
+ inode->i_private = private;
+
+ mutex_lock(&root->d_inode->i_mutex);
+
+ rc = -ENOSPC;
+ dentry = d_alloc_name(root, name);
+ if (IS_ERR(dentry))
+ goto fail_alloc;
+
+ d_add(dentry, inode);
+
+ mutex_unlock(&root->d_inode->i_mutex);
+
+ if (!pstore_writefile(inode, dentry, data, size))
+ goto fail_write;
+
+ if (time.tv_sec)
+ inode->i_mtime = inode->i_ctime = time;
+
+ return 0;
+
+fail_write:
+ inode->i_nlink--;
+ mutex_lock(&root->d_inode->i_mutex);
+ d_delete(dentry);
+ dput(dentry);
+ mutex_unlock(&root->d_inode->i_mutex);
+ goto fail;
+
+fail_alloc:
+ mutex_unlock(&root->d_inode->i_mutex);
+ iput(inode);
+
+fail:
+ return rc;
+}
+
+int pstore_fill_super(struct super_block *sb, void *data, int silent)
+{
+ struct inode *inode = NULL;
+ struct dentry *root;
+ int err;
+
+ save_mount_options(sb, data);
+
+ pstore_sb = sb;
+
+ sb->s_maxbytes = MAX_LFS_FILESIZE;
+ sb->s_blocksize = PAGE_CACHE_SIZE;
+ sb->s_blocksize_bits = PAGE_CACHE_SHIFT;
+ sb->s_magic = PSTOREFS_MAGIC;
+ sb->s_op = &pstore_ops;
+ sb->s_time_gran = 1;
+
+ inode = pstore_get_inode(sb, NULL, S_IFDIR | 0755, 0);
+ if (!inode) {
+ err = -ENOMEM;
+ goto fail;
+ }
+ /* override ramfs "dir" options so we catch unlink(2) */
+ inode->i_op = &pstore_dir_inode_operations;
+
+ root = d_alloc_root(inode);
+ sb->s_root = root;
+ if (!root) {
+ err = -ENOMEM;
+ goto fail;
+ }
+
+ pstore_get_records();
+
+ return 0;
+fail:
+ iput(inode);
+ return err;
+}
+
+static int pstore_get_sb(struct file_system_type *fs_type,
+ int flags, const char *dev_name, void *data, struct vfsmount *mnt)
+{
+ struct dentry *root;
+
+ root = mount_nodev(fs_type, flags, data, pstore_fill_super);
+ if (IS_ERR(root))
+ return -ENOMEM;
+
+ mnt->mnt_root = root;
+ mnt->mnt_sb = root->d_sb;
+ pstore_mnt = mnt;
+
+ return 0;
+}
+
+static void pstore_kill_sb(struct super_block *sb)
+{
+ kill_litter_super(sb);
+ pstore_sb = NULL;
+ pstore_mnt = NULL;
+}
+
+static struct file_system_type pstore_fs_type = {
+ .name = "pstore",
+ .get_sb = pstore_get_sb,
+ .kill_sb = pstore_kill_sb,
+};
+
+static int __init init_pstore_fs(void)
+{
+ return register_filesystem(&pstore_fs_type);
+}
+module_init(init_pstore_fs)
+
+MODULE_AUTHOR("Tony Luck <[email protected]>");
+MODULE_LICENSE("GPL");
diff --git a/fs/pstore/internal.h b/fs/pstore/internal.h
new file mode 100644
index 0000000..1f274ff
--- /dev/null
+++ b/fs/pstore/internal.h
@@ -0,0 +1,5 @@
+extern void pstore_get_records(void);
+extern int pstore_mkfile(char *name, char *data, size_t size,
+ struct timespec time, void *private);
+extern void pstore_erase(void *private);
+extern int pstore_is_mounted(void);
diff --git a/fs/pstore/platform.c b/fs/pstore/platform.c
new file mode 100644
index 0000000..59939f0
--- /dev/null
+++ b/fs/pstore/platform.c
@@ -0,0 +1,208 @@
+/*
+ * Persistent Storage - platform driver interface parts.
+ *
+ * Copyright (C) 2010 Intel Corporation <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/atomic.h>
+#include <linux/types.h>
+#include <linux/errno.h>
+#include <linux/init.h>
+#include <linux/kmsg_dump.h>
+#include <linux/module.h>
+#include <linux/pstore.h>
+#include <linux/string.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+
+#include "internal.h"
+
+/*
+ * pstore_lock just protects "psinfo" during
+ * calls to pstore_register()
+ */
+static DEFINE_SPINLOCK(pstore_lock);
+static struct pstore_info *psinfo;
+
+#define PSTORE_NAMELEN 64
+
+struct pstore_private {
+ u64 id;
+ int (*erase)(u64);
+};
+
+/*
+ * callback from kmsg_dump. (s2,l2) has the most recently
+ * written bytes, older bytes are in (s1,l1). Save as much
+ * as we can from the end of the buffer.
+ */
+static void pstore_dump(struct kmsg_dumper *dumper,
+ enum kmsg_dump_reason reason,
+ const char *s1, unsigned long l1,
+ const char *s2, unsigned long l2)
+{
+ unsigned long s1_start, s2_start;
+ unsigned long l1_cpy, l2_cpy;
+ char *dst = psinfo->buf;
+
+ /* Don't dump oopses to persistent store */
+ if (reason == KMSG_DUMP_OOPS)
+ return;
+
+ l2_cpy = min(l2, psinfo->bufsize);
+ l1_cpy = min(l1, psinfo->bufsize - l2_cpy);
+
+ s2_start = l2 - l2_cpy;
+ s1_start = l1 - l1_cpy;
+
+ mutex_lock(&psinfo->buf_mutex);
+ memcpy(dst, s1 + s1_start, l1_cpy);
+ memcpy(dst + l1_cpy, s2 + s2_start, l2_cpy);
+
+ psinfo->write(PSTORE_TYPE_DMESG, l1_cpy + l2_cpy);
+ mutex_unlock(&psinfo->buf_mutex);
+}
+
+static struct kmsg_dumper pstore_dumper = {
+ .dump = pstore_dump,
+};
+
+/*
+ * platform specific persistent storage driver registers with
+ * us here. If pstore is already mounted, call the platform
+ * read function right away to populate the file system. If not
+ * then the pstore mount code will call us later to fill out
+ * the file system.
+ *
+ * Register with kmsg_dump to save last part of console log on panic.
+ */
+int pstore_register(struct pstore_info *psi)
+{
+ struct module *owner = psi->owner;
+
+ spin_lock(&pstore_lock);
+ if (psinfo) {
+ spin_unlock(&pstore_lock);
+ return -EBUSY;
+ }
+ psinfo = psi;
+ spin_unlock(&pstore_lock);
+
+ if (owner && !try_module_get(owner)) {
+ psinfo = NULL;
+ return -EINVAL;
+ }
+
+ if (pstore_is_mounted())
+ pstore_get_records();
+
+ kmsg_dump_register(&pstore_dumper);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(pstore_register);
+
+/*
+ * Read all the records from the persistent store. Create and
+ * file files in our filesystem.
+ */
+void pstore_get_records(void)
+{
+ struct pstore_info *psi = psinfo;
+ size_t size;
+ u64 id;
+ enum pstore_type_id type;
+ char name[PSTORE_NAMELEN];
+ struct pstore_private *private;
+ struct timespec time;
+ int failed = 0;
+
+ if (!psi)
+ return;
+
+ mutex_lock(&psinfo->buf_mutex);
+ while ((size = psi->read(&id, &type, &time)) > 0) {
+ switch (type) {
+ case PSTORE_TYPE_DMESG:
+ sprintf(name, "dmesg-%s-%lld", psi->name, id);
+ break;
+ case PSTORE_TYPE_MCE:
+ sprintf(name, "mce-%s-%lld", psi->name, id);
+ break;
+ case PSTORE_TYPE_UNKNOWN:
+ sprintf(name, "unknown-%s-%lld", psi->name, id);
+ break;
+ default:
+ sprintf(name, "type%d-%s-%lld", type, psi->name, id);
+ break;
+ }
+ private = kmalloc(sizeof *private, GFP_KERNEL);
+ if (!private) {
+ failed++;
+ continue;
+ }
+ private->id = id;
+ private->erase = psi->erase;
+ if (pstore_mkfile(name, psi->buf, size, time, private)) {
+ kfree(private);
+ failed++;
+ }
+ }
+ mutex_unlock(&psinfo->buf_mutex);
+
+ if (failed)
+ printk(KERN_WARNING "pstore: failed to load %d record(s) from '%s'\n",
+ failed, psi->name);
+}
+
+/*
+ * Call platform driver to write a record to the
+ * persistent store. We don't worry about making
+ * this visible in the pstore filesystem as the
+ * presumption is that we only save things to the
+ * store in the dying moments of OS failure. Hence
+ * nobody will see the entries in the filesystem.
+ */
+int pstore_write(enum pstore_type_id type, char *buf, size_t size)
+{
+ int ret;
+
+ if (!psinfo)
+ return -ENODEV;
+
+ if (size > psinfo->bufsize)
+ return -EFBIG;
+
+ mutex_lock(&psinfo->buf_mutex);
+ memcpy(psinfo->buf, buf, size);
+ ret = psinfo->write(type, size);
+ mutex_unlock(&psinfo->buf_mutex);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(pstore_write);
+
+/*
+ * When a file is unlinked from our file system we call the
+ * platform driver to erase the record from persistent store.
+ */
+void pstore_erase(void *private)
+{
+ struct pstore_private *p = private;
+
+ p->erase(p->id);
+ kfree(p);
+}
diff --git a/include/linux/magic.h b/include/linux/magic.h
index ff690d0..e87fd5a 100644
--- a/include/linux/magic.h
+++ b/include/linux/magic.h
@@ -26,6 +26,7 @@
#define ISOFS_SUPER_MAGIC 0x9660
#define JFFS2_SUPER_MAGIC 0x72b6
#define ANON_INODE_FS_MAGIC 0x09041934
+#define PSTOREFS_MAGIC 0x6165676C

#define MINIX_SUPER_MAGIC 0x137F /* original minix fs */
#define MINIX_SUPER_MAGIC2 0x138F /* minix fs, 30 char names */
diff --git a/include/linux/pstore.h b/include/linux/pstore.h
new file mode 100644
index 0000000..99bf5aa
--- /dev/null
+++ b/include/linux/pstore.h
@@ -0,0 +1,60 @@
+/*
+ * Persistent Storage - pstore.h
+ *
+ * Copyright (C) 2010 Intel Corporation <[email protected]>
+ *
+ * This code is the generic layer to export data records from platform
+ * level persistent storage via a file system.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+#ifndef _LINUX_PSTORE_H
+#define _LINUX_PSTORE_H
+
+/* types */
+enum pstore_type_id {
+ PSTORE_TYPE_DMESG = 0,
+ PSTORE_TYPE_MCE = 1,
+ PSTORE_TYPE_UNKNOWN = 255
+};
+
+struct pstore_info {
+ struct module *owner;
+ char *name;
+ struct mutex buf_mutex; /* serialize access to 'buf' */
+ char *buf;
+ size_t bufsize;
+ size_t (*read)(u64 *id, enum pstore_type_id *type,
+ struct timespec *time);
+ int (*write)(enum pstore_type_id type, size_t size);
+ int (*erase)(u64 id);
+};
+
+#ifdef CONFIG_PSTORE
+extern int pstore_register(struct pstore_info *);
+extern int pstore_write(enum pstore_type_id type, char *buf, size_t size);
+#else
+static inline int
+pstore_register(struct pstore_info *psi)
+{
+ return -ENODEV;
+}
+static inline int
+pstore_write(enum pstore_type_id type, char *buf, size_t size)
+{
+ return -ENODEV;
+}
+#endif
+
+#endif /*_LINUX_PSTORE_H*/


2010-12-17 01:58:00

by Linus Torvalds

[permalink] [raw]
Subject: Re: [concept & "good taste" review] persistent store

On Mon, Dec 13, 2010 at 10:16 AM, Luck, Tony <[email protected]> wrote:
>
> So before I embark on another round of code nit-picking, I'd like to
> get answers to the bigger questions:
>
> ?Do we want/need this in Linux at all?
>
> ?Is the overall approach OK, or do I need do this some other way?

So I have to say, it seemed interesting at first. I was like "ok, I
don't really see the point of making it a filesystem, but whatever -
it's pretty small".

Then I hit the

/* Don't dump oopses to persistent store */
if (reason == KMSG_DUMP_OOPS)
return;

and I'm like "ok, this is just stupid".

The _only_ valid reason for persistent storage is for things like
oopses that kill the machine. Nothing else matters at all. If the
machine isn't dead, and it's not some critical oops, then why the
_hell_ would you ever use persistent storage? You're much better off
just using the regular disk.

So in the end, I really don't see any point at all to this thing. The
filesystem part is pointless (there are better filesystems around and
everybody has a disk or flash or whatever), and the only real upside
has been explicitly castrated.

Linus

2010-12-17 06:28:38

by Tony Luck

[permalink] [raw]
Subject: Re: [concept & "good taste" review] persistent store

On Thu, Dec 16, 2010 at 5:57 PM, Linus Torvalds
<[email protected]> wrote:
> Then I hit the
>
> ? ? ? ?/* Don't dump oopses to persistent store */
> ? ? ? ?if (reason == KMSG_DUMP_OOPS)
> ? ? ? ? ? ? ? ?return;
>
> and I'm like "ok, this is just stupid".
>
> The _only_ valid reason for persistent storage is for things like
> oopses that kill the machine.

Maybe I misunderstood what "KMSG_DUMP_OOPS" meant ... it
looked to me like this code is used for non-fatal OOPsen - ones
that will be logged to /var/log/messages.

You are right that the whole point of this is to save data from
fatal errors ... if I'm not doing that, then I'm happy to change
things until I achieve that goal.

-Tony

2010-12-17 18:09:57

by Luck, Tony

[permalink] [raw]
Subject: Re: [concept & "good taste" review] persistent store

On Thu, Dec 16, 2010 at 10:28 PM, Tony Luck <[email protected]> wrote:
>> The _only_ valid reason for persistent storage is for things like
>> oopses that kill the machine.
>
> Maybe I misunderstood what "KMSG_DUMP_OOPS" meant ... it
> looked to me like this code is used for non-fatal OOPsen - ones
> that will be logged to /var/log/messages.

Thinking about this a bit more I see my experiments with
this were hopelessly naive. There is no way to know at
"oops" time whether the problem is going to turn out to
be minor or fatal. So the right thing to do here is assume
the worst and squirrel the data away safely just in case
death is imminent.

This means that I need to re-think this part too:
+/*
+ * Call platform driver to write a record to the
+ * persistent store. We don't worry about making
+ * this visible in the pstore filesystem as the
+ * presumption is that we only save things to the
+ * store in the dying moments of OS failure. Hence
+ * nobody will see the entries in the filesystem.
+ */

Since rumors of the death of the kernel may be greatly
exaggerated ... I *do* need to make the newly added
records visible in the pstore file system so that we
don't run into the situation where persistent store
has been filled with a series of minor oops reports
(which are in any case also logged elsewhere) with
no space left to log the fatal message. Adding
them to the file system would give the user a handle
to look at them and erase them.

> So in the end, I really don't see any point at all to this thing.

So does fixing this stupidity provide a useful point for this?

The point I see, is providing a generic layer that any platform
persistent store driver can use by just writing simple
read/write/erase functions ... rather than everyone
having to roll their own, slightly different, versions of
what could be a common interface. There was a
recent attempt to do exactly this with:
drivers/staging/dream/pmem.c
which has currently been dropped - but it may come
back (and my version doesn't have, or need, "ioctl"
operations :-).

-Tony

2010-12-17 18:19:19

by James Bottomley

[permalink] [raw]
Subject: Re: [concept & "good taste" review] persistent store

On Fri, 2010-12-17 at 10:09 -0800, Tony Luck wrote:
> On Thu, Dec 16, 2010 at 10:28 PM, Tony Luck <[email protected]> wrote:
> >> The _only_ valid reason for persistent storage is for things like
> >> oopses that kill the machine.
> >
> > Maybe I misunderstood what "KMSG_DUMP_OOPS" meant ... it
> > looked to me like this code is used for non-fatal OOPsen - ones
> > that will be logged to /var/log/messages.
>
> Thinking about this a bit more I see my experiments with
> this were hopelessly naive. There is no way to know at
> "oops" time whether the problem is going to turn out to
> be minor or fatal. So the right thing to do here is assume
> the worst and squirrel the data away safely just in case
> death is imminent.

To be honest, this is what I'd recommend even if you could tell the
difference. A lot of the oopses I see were triggered by something
non-fatal (usually a WARN_ON()) earlier in the sequence. Without seeing
the preceding WARN_ON() data, the oops is usually terrifically hard to
diagnose (often just a NULL or junk pointer deref).

James

2010-12-17 21:38:46

by Linus Torvalds

[permalink] [raw]
Subject: Re: [concept & "good taste" review] persistent store

On Fri, Dec 17, 2010 at 10:09 AM, Tony Luck <[email protected]> wrote:
>
> So does fixing this stupidity provide a useful point for this?

Yes. However, I still question the "filesystem" part.

Technically, what I think any persistent storage should aim for is to
be a "journal" - not a filesystem. It's most useful as a temporary
area for data _before_ that data actually hits the disk, and once it
has hit the disk (or has been picked up by a network syslog server, of
course), the usefulness of the persistent storage immediately
vanishes.

So I don't really mind having a filesystem interface to that (the
whole "everything is a file" model), but I think it can end up
confusing people about what this thing is useful for. I fear that
people will try to write to it from user space as some kind of
mini-filesystem, and that seems pointless.

Linus

2010-12-17 23:08:10

by Luck, Tony

[permalink] [raw]
Subject: Re: [concept & "good taste" review] persistent store

On Fri, Dec 17, 2010 at 1:38 PM, Linus Torvalds
<[email protected]> wrote:
> Yes. However, I still question the "filesystem" part.
>
> Technically, what I think any persistent storage should aim for is to
> be a "journal" - not a filesystem. It's most useful as a temporary
> area for data _before_ that data actually hits the disk, and once it
> has hit the disk (or has been picked up by a network syslog server, of
> course), the usefulness of the persistent storage immediately
> vanishes.
>
> So I don't really mind having a filesystem interface to that (the
> whole "everything is a file" model), but I think it can end up
> confusing people about what this thing is useful for. I fear that
> people will try to write to it from user space as some kind of
> mini-filesystem, and that seems pointless.

Do we have any good models for "journals" ... apart from
the console log? Ying & I did talk about just using printk
to drop all the saved information onto the console (with
some sort of "previous" prefix on each line to make
it easy to find, and to make sure that someone glancing
at the messages wouldn't worry that that OOPs scrolling
by was happening now). But this seemed like a really bad
idea (especially if someone has enough persistent store to
capture all of __logbuf).

So the downside of "everything is a file" is that we don't
have much infrastructure for things that don't look like
files - and trying to build some results in some special
custom tools being needed to access the data, which
just makes things harder to use.

People trying to write to /dev/pstore will figure out quickly
that you can't do that. There are no "ops" to make new files,
directories, symlinks or even to rename existing ones. You
can overwrite existing files (because I don't trap "open" to
deny them write access - but the 0444 mode is supposed
to be a visual clue to not do that).

-Tony

2010-12-17 23:14:00

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [concept & "good taste" review] persistent store

On 12/17/2010 03:08 PM, Tony Luck wrote:
>
> Do we have any good models for "journals" ... apart from
> the console log? Ying & I did talk about just using printk
> to drop all the saved information onto the console (with
> some sort of "previous" prefix on each line to make
> it easy to find, and to make sure that someone glancing
> at the messages wouldn't worry that that OOPs scrolling
> by was happening now). But this seemed like a really bad
> idea (especially if someone has enough persistent store to
> capture all of __logbuf).
>
> So the downside of "everything is a file" is that we don't
> have much infrastructure for things that don't look like
> files - and trying to build some results in some special
> custom tools being needed to access the data, which
> just makes things harder to use.
>
> People trying to write to /dev/pstore will figure out quickly
> that you can't do that. There are no "ops" to make new files,
> directories, symlinks or even to rename existing ones. You
> can overwrite existing files (because I don't trap "open" to
> deny them write access - but the 0444 mode is supposed
> to be a visual clue to not do that).
>

There are two models I can think of:

1. a file where the head is automatically dropped as space requires.
2. a filesystem where the oldest files are automatically reclaimed.

1 has been implemented in actual systems, 2 is kind of a logical extension.

-hpa

2010-12-17 23:53:57

by Luck, Tony

[permalink] [raw]
Subject: Re: [concept & "good taste" review] persistent store

On Fri, Dec 17, 2010 at 3:11 PM, H. Peter Anvin <[email protected]> wrote:
> There are two models I can think of:
>
> 1. a file where the head is automatically dropped as space requires.
> 2. a filesystem where the oldest files are automatically reclaimed.
>
> 1 has been implemented in actual systems, 2 is kind of a logical extension.

#2 sounds more applicable here (we have some multi-kilobyte
blobs of data, one from each kmsg_dumper invocation - and
it would seem useful to keep them as separate entities)

I'm not sure whether everyone would be happy with this. Imagine you
have a system that gets two OOPs, followed by a full panic - but that
the persistent store only has space for two of the three reports. I think
that most people would want the first OOPs and the panic ... i.e.
drop the middle bit, rather than the oldest bit.

-Tony

2010-12-18 18:30:52

by Linus Torvalds

[permalink] [raw]
Subject: Re: [concept & "good taste" review] persistent store

On Fri, Dec 17, 2010 at 3:53 PM, Tony Luck <[email protected]> wrote:
> On Fri, Dec 17, 2010 at 3:11 PM, H. Peter Anvin <[email protected]> wrote:
>> There are two models I can think of:
>>
>> 1. a file where the head is automatically dropped as space requires.
>> 2. a filesystem where the oldest files are automatically reclaimed.
>>
>> 1 has been implemented in actual systems, 2 is kind of a logical extension.
>
> #2 sounds more applicable here (we have some multi-kilobyte
> blobs of data, one from each kmsg_dumper invocation - and
> it would seem useful to keep them as separate entities)

So I would argue that what we'd want is actually more of a mix of the two.

You want to have a ring of events, and into that ring you also have a
"this event has been read" pointer. And you _never_ overwrite entries
that haven't been read yet, because quite frankly, if you get some
nasty memory corruption, you may end up with a thousand oopses in
rapid succession, and the latter ones are likely to be just fallout
from the earlier ones. So you definitely don't want to overwrite the
earlier ones, because they are more likely to contain the clues about
the actual original cause.

At the same time, you do want to have the capability of saying "I've
seen this", and let it be overwritten. For example, if we end up
teaching syslogd or something like that to use this, syslogd would
write the oops to disk, do a fdatasync() on the oops file, and after
it's stable on disk it can mark it "read".

Also, since this is very much about persistent storage, I think any
events from a previous boot that still exists should be marked "read".
You still want to be able to read them (so marking something "read"
does not mean that it goes away), but if a new oops happens, you don't
want some old entries from long ago to stop it from being written to
persistent storage. So if you don't have any syslogd or any other tool
that saves things to disk, you'd still get the new oopses into
persistent storage.

Doesn't that sound like the best of both worlds?

Linus

2010-12-18 23:06:59

by Tony Luck

[permalink] [raw]
Subject: Re: [concept & "good taste" review] persistent store

On Sat, Dec 18, 2010 at 10:23 AM, Linus Torvalds
<[email protected]> wrote:
> You want to have a ring of events, and into that ring you also have a
> "this event has been read" pointer. And you _never_ overwrite entries
> that haven't been read yet, because quite frankly, if you get some
> nasty memory corruption, you may end up with a thousand oopses in
> rapid succession, and the latter ones are likely to be just fallout
> from the earlier ones. So you definitely don't want to overwrite the
> earlier ones, because they are more likely to contain the clues about
> the actual original cause.
>
> At the same time, you do want to have the capability of saying "I've
> seen this", and let it be overwritten. For example, if we end up
> teaching syslogd or something like that to use this, syslogd would
> write the oops to disk, do a fdatasync() on the oops file, and after
> it's stable on disk it can mark it "read".
>
> Also, since this is very much about persistent storage, I think any
> events from a previous boot that still exists should be marked "read".
> You still want to be able to read them (so marking something "read"
> does not mean that it goes away), but if a new oops happens, you don't
> want some old entries from long ago to stop it from being written to
> persistent storage. So if you don't have any syslogd or any other tool
> that saves things to disk, you'd still get the new oopses into
> persistent storage.
>
> Doesn't that sound like the best of both worlds?

It sounds like an excellent heuristic for how the platform layer
should manage the persistent store when space is tight. But
I think that I can still keep my /dev/pstore filesystem as a
presentation layer to make the bits available to the user in
a device independent way.

Or perhaps the pstore layer can help with the implementation
of the heuristic. It knows what items are in the pstore, so it
could build & maintain the "ring" and pass the id of the least
wanted item down to the platform layer whenever it wants to
write a record ... with the platform layer giving a status to
say whether it had to delete that item to make space for the
new one?

-Tony

2010-12-19 09:18:09

by Borislav Petkov

[permalink] [raw]
Subject: Re: [concept & "good taste" review] persistent store

On Sat, Dec 18, 2010 at 03:06:57PM -0800, Tony Luck wrote:
> > Doesn't that sound like the best of both worlds?
>
> It sounds like an excellent heuristic for how the platform layer
> should manage the persistent store when space is tight. But
> I think that I can still keep my /dev/pstore filesystem as a
> presentation layer to make the bits available to the user in
> a device independent way.
>
> Or perhaps the pstore layer can help with the implementation
> of the heuristic. It knows what items are in the pstore, so it
> could build & maintain the "ring" and pass the id of the least
> wanted item down to the platform layer whenever it wants to
> write a record ... with the platform layer giving a status to
> say whether it had to delete that item to make space for the
> new one?

Before we go and delve into priority-sorting the oopses in the pstore,
let me ask this: how big an actual persistent storage device are we
talking?

Because if it is big enough - for some value of 'big' - we could try to
never let it fill up. If want to save space we might even do something
crazy like compressing the oops info. In the rare event it fills up or
hits some 'almost-full' watermarks, we can kick some userspace daemon
to start writing the oopses to fs and clear the pstore. This all should
happen in the case where all you get is non-critical warnings and the
system is still alive.

However, in the critical cases, you get a single "stream" of oopses with
the first one being the most important one and then you panic. And in
most cases that stream is only a couple of oopses long. For that, the
pstore should be big enough to easily contain it.

[ Btw, if we are able to write that stream to pstore before we panic,
we solve all the problems we have with catching oopses with cameras,
serial consoles, netconsoles, firewire and pen and paper! ]

In that case, when you restart the machine, the same daemon notices the
existence of the oopses and starts writing them to fs, notifies the
users and clears the pstore.

So, I think what we could do is keep our big enough pstore with enough
free space for a bunch of oopses in case we panic. In the remaining
cases, we write them out thus freeing some more space.

Hmmm..

--
Regards/Gruss,
Boris.

2010-12-19 17:01:22

by Florian Mickler

[permalink] [raw]
Subject: Re: [concept & "good taste" review] persistent store

On Sun, 19 Dec 2010 10:17:52 +0100
Borislav Petkov <[email protected]> wrote:

> On Sat, Dec 18, 2010 at 03:06:57PM -0800, Tony Luck wrote:
> > > Doesn't that sound like the best of both worlds?
> >
> > It sounds like an excellent heuristic for how the platform layer
> > should manage the persistent store when space is tight. But
> > I think that I can still keep my /dev/pstore filesystem as a
> > presentation layer to make the bits available to the user in
> > a device independent way.
I agree. As far as I understand, Linus concept could map nicely onto an
fs:

1. marking things read:
Clients can delete files as soon as they have savely scribled
them down somewhere.

2. on boot the persistent storage layer init's any files in an
'archive' directory.

3. if an oops happens, the persistent storage layer shows them in the
pstore-root directory. If no space is left, the
persistent storage layer evicts as much oldest entries from the
'archive' as necessary. as soon as archive is empty but the pstore is
full, pstore stops storing new oopses.

Or not? That way, there would be some sort of journal behaviour, but
also new oopses will evict old(pre-boot) oopses if necessary.
maybe a kernel boot-parameter to disable deleting old oopses could be
necessary in some cases. But I doubt it.

Regards,
Flo

2010-12-19 20:17:55

by Luck, Tony

[permalink] [raw]
Subject: Re: [concept & "good taste" review] persistent store

On Sun, Dec 19, 2010 at 1:17 AM, Borislav Petkov <[email protected]> wrote:
> Before we go and delve into priority-sorting the oopses in the pstore,
> let me ask this: how big an actual persistent storage device are we
> talking?

I'm not sure how big the store is on my system ... the ACPI/ERST
interface on this machine limits each entry to just under 8KB. But
that isn't inherent to to ERST, both larger and smaller values would
be an option. 8K seems quite useful for kmsg_dump purposes as
it grabs a significant number of lines leading up to the oops/panic.

After I dropped the "stupid" part about not saving OOPs, I ran a
test on Friday where I instigated a dozen or so OOPses, and all
were saved without ERST complaining. There is a "how big is the
store" call in the protocol - so the only option is to keep writing
until a failure occurs. I will try this when I'm next in the office,

> Because if it is big enough - for some value of 'big' - we could try to
> never let it fill up.
With the price per GByte of flash memory - the answer *ought* to
be some huge value

> If want to save space we might even do something
> crazy like compressing the oops info. In the rare event it fills up or
> hits some 'almost-full' watermarks, we can kick some userspace daemon
> to start writing the oopses to fs and clear the pstore. This all should
> happen in the case where all you get is non-critical warnings and the
> system is still alive.

This is a good point. In the case that the OOPs that was recorded to
persistent store wasn't fatal - then the normal daemons will log it to
/var/log/messages. So in the general case, if the system finds that
it isn't dead a few seconds after logging something - it is most likely
safe to assume that the persistent store copy isn't vital, as the data
should be available elsewhere.

> However, in the critical cases, you get a single "stream" of oopses with
> the first one being the most important one and then you panic. And in
> most cases that stream is only a couple of oopses long. For that, the
> pstore should be big enough to easily contain it.

Yes. At least it is for my test system. I know I can fit a dozen messages.

> So, I think what we could do is keep our big enough pstore with enough
> free space for a bunch of oopses in case we panic. In the remaining
> cases, we write them out thus freeing some more space.

Some feedback from syslogd (or whatever it is that gets things from
dmesg into /var/log/messages) would help here ... though to be really
useful it might need "fsync" to /var/log/messages, which might not
be a welcome addition.

-Tony

2010-12-20 02:47:27

by Huang, Ying

[permalink] [raw]
Subject: Re: [concept & "good taste" review] persistent store

On Mon, 2010-12-20 at 04:17 +0800, Tony Luck wrote:
> On Sun, Dec 19, 2010 at 1:17 AM, Borislav Petkov <[email protected]> wrote:
> > Before we go and delve into priority-sorting the oopses in the pstore,
> > let me ask this: how big an actual persistent storage device are we
> > talking?
>
> I'm not sure how big the store is on my system ... the ACPI/ERST
> interface on this machine limits each entry to just under 8KB. But
> that isn't inherent to to ERST, both larger and smaller values would
> be an option. 8K seems quite useful for kmsg_dump purposes as
> it grabs a significant number of lines leading up to the oops/panic.

8KB is about 100-200 lines message, sometimes it may be not enough for
all necessary information. But in fact, we can use multiple ERST
records to save one kernel message dumping. So the real limitation is
just total storage capacity.

For journal (or transaction) semantics, I think the overall picture of
system can be as follow,

kernel ---> persistent storage ---> user space daemon ---> disk

1. kernel dump OOPS and PANIC message into persistent storage.
2. user space daemon poll persistent storage and erase the record after
saving it into disk.

So,

- for OOPS messages will not cause system panic, it will go to disk and
will not use up the persistent storage.
- for PANIC message, it will be saved into persistent storage only and
read out/saved to disk/erased after successfully rebooting. (Maybe need
the heuristic to overwrite the latest OOPS if storage is really tight).

The issues are:

- We need a user space daemon, via extending /sbin/syslogd?
- It is a little hard to implement "poll" support for a file system.

Best Regards,
Huang Ying

2010-12-20 07:26:42

by Borislav Petkov

[permalink] [raw]
Subject: Re: [concept & "good taste" review] persistent store

On Sun, Dec 19, 2010 at 12:17:53PM -0800, Tony Luck wrote:
> > So, I think what we could do is keep our big enough pstore with enough
> > free space for a bunch of oopses in case we panic. In the remaining
> > cases, we write them out thus freeing some more space.
>
> Some feedback from syslogd (or whatever it is that gets things from
> dmesg into /var/log/messages) would help here ... though to be really
> useful it might need "fsync" to /var/log/messages, which might not
> be a welcome addition.

Right.

However, AFAICT, all warnings/oopses go to dmesg anyway and from there
to syslog so actually there should be no need to save them in pstore
too, methinks.

IOW, the simple (maybe too simple) algo of the pstore could be something
like:

1. Got a relevant message from kernel, log it.

2. Am I still alive?
|-> yes => Ah ok, non-critical, it should be in the logs anyway, drop last message set from pstore.
|-> no => Well I can't do here anything anyway, I'm dead... note-to-self: Show last messages set to user upon next boot.

So, the pstore should actually be constantly almost empty rather than
almost full. And it definitely is going to contain the last critical
oops which may or may not be in the syslog.

Or am I missing something?

Thanks.

--
Regards/Gruss,
Boris.

2010-12-20 10:48:06

by David Howells

[permalink] [raw]
Subject: Re: [concept & "good taste" review] persistent store

Huang Ying <[email protected]> wrote:

> - for OOPS messages will not cause system panic, it will go to disk and
> will not use up the persistent storage.

You can't guarantee that an oops didn't just kill your ability to actually
write your syslog to disk or out across the network.

David

2010-12-20 10:51:28

by David Howells

[permalink] [raw]
Subject: Re: [concept & "good taste" review] persistent store

Tony Luck <[email protected]> wrote:

> I'm not sure how big the store is on my system ... the ACPI/ERST
> interface on this machine limits each entry to just under 8KB. But
> that isn't inherent to to ERST, both larger and smaller values would
> be an option. 8K seems quite useful for kmsg_dump purposes as
> it grabs a significant number of lines leading up to the oops/panic.

How are things written to the store? Is it a record-based affair?

David

2010-12-20 16:53:03

by Tony Luck

[permalink] [raw]
Subject: Re: [concept & "good taste" review] persistent store

On Mon, Dec 20, 2010 at 2:49 AM, David Howells <[email protected]> wrote:
> How are things written to the store? ?Is it a record-based affair?

Yes, the ERST store is record based ... record format is defined in UEFI.
To save blobs of console logs I just defined a new uuid for a section type
that holds a bunch of bytes for me.

-Tony

2010-12-20 17:19:36

by Tony Luck

[permalink] [raw]
Subject: Re: [concept & "good taste" review] persistent store

On Sun, Dec 19, 2010 at 6:47 PM, Huang Ying <[email protected]> wrote:
> 8KB is about 100-200 lines message, sometimes it may be not enough for
> all necessary information. ?But in fact, we can use multiple ERST
> records to save one kernel message dumping. So the real limitation is
> just total storage capacity.

We can - but it will be a bit messy within my pstore + platform driver
framework (I'd have to have erst tell pstore that it could handle some
larger size, 2x or 3x the actual record size and then break a write
into pieces. Presumably re-assemble on the read side too).

It would be good to get some data from real users to see
whether this is required, or whether ~8K actually is adequate
in enough situations to allow us to keep the code simple.

If anyone with a fairly new X86 server system (up to a
couple of years old ... check to see whether it has ERST
support in BIOS with: "dmesg | grep ERST") wants to test
this, the latest code is in today's linux-next tree, tagged
as "next-20101220". (I have some hope that we might get
this together for the 2.6.38 merge window ... but if not,
then 2.6.39 would be fine too).

-Tony

2010-12-20 17:24:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: [concept & "good taste" review] persistent store

On Sun, Dec 19, 2010 at 11:26 PM, Borislav Petkov <[email protected]> wrote:
>
> IOW, the simple (maybe too simple) algo of the pstore could be something
> like:

Simple?

> 1. Got a relevant message from kernel, log it.
>
> 2. Am I still alive?

Umm. The "am I still alive" question is traditionally called "the
stopping problem", and is considered to be the traditional example of
_least_ simple problem there is. As in "fundamentally unsolvable".

Did we kill X? Did we happen to hold some critical lock when oopsing?
Was it syslogd itself that died and caused nothing further to be
saved, even if the machine otherwise seems to be fine? Or did the
filesystem go into read-only mode due to the problem and the rest of
the system is fine, but the disk is never going to see the messages?

In other words, the problem really is that "am I still alive" thing.
That's a seriously impossible question to answer.

What _can_ be answered is "did somebody write out the oops, then
fsync, and then notify us about it?" But without explicit notification
of "yeah, it really is saved off somewhere else", we really can't
tell.

We could do heuristics, of course, and they might even work in
practice (like "flush after half an hour if there has been actual work
done and the machine is clearly making progress").

Linus

2010-12-20 18:58:21

by Borislav Petkov

[permalink] [raw]
Subject: Re: [concept & "good taste" review] persistent store

On Mon, Dec 20, 2010 at 09:18:25AM -0800, Linus Torvalds wrote:
> On Sun, Dec 19, 2010 at 11:26 PM, Borislav Petkov <[email protected]> wrote:
> >
> > IOW, the simple (maybe too simple) algo of the pstore could be something
> > like:
>
> Simple?
>
> > 1. Got a relevant message from kernel, log it.
> >
> > 2. Am I still alive?
>
> Umm. The "am I still alive" question is traditionally called "the
> stopping problem", and is considered to be the traditional example of
> _least_ simple problem there is. As in "fundamentally unsolvable".

Yeah, I meant simple in the sense of only two steps required.

> Did we kill X? Did we happen to hold some critical lock when oopsing?
> Was it syslogd itself that died and caused nothing further to be
> saved, even if the machine otherwise seems to be fine? Or did the
> filesystem go into read-only mode due to the problem and the rest of
> the system is fine, but the disk is never going to see the messages?
>
> In other words, the problem really is that "am I still alive" thing.
> That's a seriously impossible question to answer.

Maybe we should rephrase this as "am I still alive and well," for a
specific definition of well.

> What _can_ be answered is "did somebody write out the oops, then
> fsync, and then notify us about it?" But without explicit notification
> of "yeah, it really is saved off somewhere else", we really can't
> tell.

That could work, I should look deeper into that.

> We could do heuristics, of course, and they might even work in
> practice (like "flush after half an hour if there has been actual work
> done and the machine is clearly making progress").

Yes, this was exactly what I was trying to say! Do something in a
watchdog handler path that shows that we actually made progress. But
you're right, we'd still need the notification. My look at "did we make
a progress" was too simple and there _are_ nuances which need to be
accounted for.

Thanks.

--
Regards/Gruss,
Boris.

2010-12-20 21:09:45

by Luck, Tony

[permalink] [raw]
Subject: Re: [concept & "good taste" review] persistent store

On Mon, Dec 20, 2010 at 10:58 AM, Borislav Petkov <[email protected]> wrote:
> Yes, this was exactly what I was trying to say! Do something in a
> watchdog handler path that shows that we actually made progress. But
> you're right, we'd still need the notification. My look at "did we make
> a progress" was too simple and there _are_ nuances which need to be
> accounted for.

My view of the notification is:
1) Oops happens, gets written to persistent store.
2) fs/pstore code makes a file appear in /dev/pstore
3) Daemon that's watching /dev/pstore sees new file
4) File is read, copied some place safe and fsync'd
5) Daemon unlinks file from /dev/pstore
6) fs/pstore code tells platform level to erase record

-Tony

2010-12-21 00:41:25

by Huang, Ying

[permalink] [raw]
Subject: Re: [concept & "good taste" review] persistent store

On Mon, 2010-12-20 at 18:46 +0800, David Howells wrote:
> Huang Ying <[email protected]> wrote:
>
> > - for OOPS messages will not cause system panic, it will go to disk and
> > will not use up the persistent storage.
>
> You can't guarantee that an oops didn't just kill your ability to actually
> write your syslog to disk or out across the network.

I do not need to guarantee that. If the OOPS message can not be written
to disk, just keeping it in persistent storage, and that is the very
value of persistent storage. But for OOPS can go to disk safely, we do
not need to waste persistent storage for it.

Best Regards,
Huang Ying

2010-12-21 00:48:21

by Huang, Ying

[permalink] [raw]
Subject: Re: [concept & "good taste" review] persistent store

On Tue, 2010-12-21 at 01:19 +0800, Tony Luck wrote:
> On Sun, Dec 19, 2010 at 6:47 PM, Huang Ying <[email protected]> wrote:
> > 8KB is about 100-200 lines message, sometimes it may be not enough for
> > all necessary information. But in fact, we can use multiple ERST
> > records to save one kernel message dumping. So the real limitation is
> > just total storage capacity.
>
> We can - but it will be a bit messy within my pstore + platform driver
> framework (I'd have to have erst tell pstore that it could handle some
> larger size, 2x or 3x the actual record size and then break a write
> into pieces. Presumably re-assemble on the read side too).

If persistent storage driver (such as ERST) implements kmsg_dumper by
themselves, that should be easier to use multiple records for one
message dumping request.

Best Regards,
Huang Ying

2010-12-21 05:13:36

by Luck, Tony

[permalink] [raw]
Subject: Re: [concept & "good taste" review] persistent store

On Mon, Dec 20, 2010 at 4:48 PM, Huang Ying <[email protected]> wrote:
> If persistent storage driver (such as ERST) implements kmsg_dumper by
> themselves, that should be easier to use multiple records for one
> message dumping request.

Or I can build it into the fs/pstore layer. If the record size looks too
small[*], loop writing multiple records (starting with the tail of the buffer,
so that if the persistent store fills up, we are sure to have the stack
back trace and register dump).

-Tony

[*] "Too small" may be an architecture dependent value. Archs
with a large number of registers (like ia64) might need more space
than those with fewer registers.

2010-12-21 07:43:08

by Borislav Petkov

[permalink] [raw]
Subject: Re: [concept & "good taste" review] persistent store

On Mon, Dec 20, 2010 at 09:13:33PM -0800, Tony Luck wrote:
> On Mon, Dec 20, 2010 at 4:48 PM, Huang Ying <[email protected]> wrote:
> > If persistent storage driver (such as ERST) implements kmsg_dumper by
> > themselves, that should be easier to use multiple records for one
> > message dumping request.
>
> Or I can build it into the fs/pstore layer. If the record size looks too
> small[*], loop writing multiple records (starting with the tail of the buffer,
> so that if the persistent store fills up, we are sure to have the stack
> back trace and register dump).

Wait a second, isn't this loop writing actually what we want?!

I mean, if the pstore is big enough to contain, say the last ca. 50 or
so oops messages (this size should suffice even for OOMkill oopses),
we do make sure that we always have the last set of oopses belonging
together and thus if we panic, we have also the relevant ones.

[ IIRC, each ERST record can contain roughly 8K, and we want 50 of
those so having a 1M flash on the platform isn't science fiction
anymore, no? ]

By loop writing it, we also make sure we overwrite the oldest oopses
which, if the pstore is big enough, shouldn't have anything to do with
the next set to be written.

So, basically designing the pstore as a big enough circular buffer
should suffice, no?

This would even alleviate the need for the notification heuristics and
writeout to userspace.

Hmm...

--
Regards/Gruss,
Boris.

2010-12-21 10:13:32

by David Howells

[permalink] [raw]
Subject: Re: [concept & "good taste" review] persistent store

Huang Ying <[email protected]> wrote:

> > > - for OOPS messages will not cause system panic, it will go to disk and
> > > will not use up the persistent storage.
> >
> > You can't guarantee that an oops didn't just kill your ability to actually
> > write your syslog to disk or out across the network.
>
> I do not need to guarantee that. If the OOPS message can not be written
> to disk, just keeping it in persistent storage, and that is the very
> value of persistent storage. But for OOPS can go to disk safely, we do
> not need to waste persistent storage for it.

My point is how do you know an oops message will actually manage to get to
disk? There's a userspace program (syslogd) between the kernel log and the
disk or network.

David

2010-12-22 00:27:02

by Huang, Ying

[permalink] [raw]
Subject: Re: [concept & "good taste" review] persistent store

On Tue, 2010-12-21 at 18:10 +0800, David Howells wrote:
> Huang Ying <[email protected]> wrote:
>
> > > > - for OOPS messages will not cause system panic, it will go to disk and
> > > > will not use up the persistent storage.
> > >
> > > You can't guarantee that an oops didn't just kill your ability to actually
> > > write your syslog to disk or out across the network.
> >
> > I do not need to guarantee that. If the OOPS message can not be written
> > to disk, just keeping it in persistent storage, and that is the very
> > value of persistent storage. But for OOPS can go to disk safely, we do
> > not need to waste persistent storage for it.
>
> My point is how do you know an oops message will actually manage to get to
> disk? There's a userspace program (syslogd) between the kernel log and the
> disk or network.

The user space program (syslogd) is in my big picture, it will guarantee
an oops meesage actually go to disk via something like fsync. After
doing that, the user space program can erase the corresponding record in
persistent storage to free the space. So all in all, oops messages not
causing system panic or disk error will go to disk eventually and being
freed and will not use up the persistent storage.

Best Regards,
Huang Ying

2010-12-22 00:33:23

by David Howells

[permalink] [raw]
Subject: Re: [concept & "good taste" review] persistent store

Huang Ying <[email protected]> wrote:

> The user space program (syslogd) is in my big picture, it will guarantee
> an oops meesage actually go to disk via something like fsync. After
> doing that, the user space program can erase the corresponding record in
> persistent storage to free the space. So all in all, oops messages not
> causing system panic or disk error will go to disk eventually and being
> freed and will not use up the persistent storage.

I see. So you rely on fsync() to hang forever if the message can't be written
to disk because an oops killed the write path?

David

2010-12-22 00:45:26

by Huang, Ying

[permalink] [raw]
Subject: Re: [concept & "good taste" review] persistent store

On Wed, 2010-12-22 at 08:32 +0800, David Howells wrote:
> Huang Ying <[email protected]> wrote:
>
> > The user space program (syslogd) is in my big picture, it will guarantee
> > an oops meesage actually go to disk via something like fsync. After
> > doing that, the user space program can erase the corresponding record in
> > persistent storage to free the space. So all in all, oops messages not
> > causing system panic or disk error will go to disk eventually and being
> > freed and will not use up the persistent storage.
>
> I see. So you rely on fsync() to hang forever if the message can't be written
> to disk because an oops killed the write path?

Yes. Or fsync() report error.

Best Regards,
Huang Ying

2010-12-22 00:55:31

by David Lang

[permalink] [raw]
Subject: Re: [concept & "good taste" review] persistent store

On Wed, 22 Dec 2010, Huang Ying wrote:

> On Tue, 2010-12-21 at 18:10 +0800, David Howells wrote:
>> Huang Ying <[email protected]> wrote:
>>
>>>>> - for OOPS messages will not cause system panic, it will go to disk and
>>>>> will not use up the persistent storage.
>>>>
>>>> You can't guarantee that an oops didn't just kill your ability to actually
>>>> write your syslog to disk or out across the network.
>>>
>>> I do not need to guarantee that. If the OOPS message can not be written
>>> to disk, just keeping it in persistent storage, and that is the very
>>> value of persistent storage. But for OOPS can go to disk safely, we do
>>> not need to waste persistent storage for it.
>>
>> My point is how do you know an oops message will actually manage to get to
>> disk? There's a userspace program (syslogd) between the kernel log and the
>> disk or network.
>
> The user space program (syslogd) is in my big picture, it will guarantee
> an oops meesage actually go to disk via something like fsync. After
> doing that, the user space program can erase the corresponding record in
> persistent storage to free the space. So all in all, oops messages not
> causing system panic or disk error will go to disk eventually and being
> freed and will not use up the persistent storage.

almost nobody runs syslog with a fsync after each message anymore. the
problem is that doing so reduced throughput so much that you ended up
loosing more messages (and causing processes to block, resulting in
user-visible problems) because the messages had to queue up for
processing.

so if you want to record critical messages and be guaranteed that they are
on disk, you will be needing a specific application, and not just using
standard syslog.

David Lang

2010-12-22 07:34:49

by Luck, Tony

[permalink] [raw]
Subject: Re: [concept & "good taste" review] persistent store

On Tue, Dec 21, 2010 at 4:53 PM, <[email protected]> wrote:
> almost nobody runs syslog with a fsync after each message anymore. the
> problem is that doing so reduced throughput so much that you ended up
> loosing more messages (and causing processes to block, resulting in
> user-visible problems) because the messages had to queue up for processing.
>
> so if you want to record critical messages and be guaranteed that they are
> on disk, you will be needing a specific application, and not just using
> standard syslog.

Here we are talking about OOPS situations ... when we have an
oops we will log it to persistent store and make sure we keep it
until something else says that the information has been saved
somewhere else.

Not sure how difficult it would be to have syslog parse what it
copies to look for oops signatures and
1) Issue an fsync()
2) Tell the persistent store to drop the record.

Perhaps some other logging mechanism would be better?
I outlined my thoughts for this yesterday:
> 1) Oops happens, gets written to persistent store.
> 2) fs/pstore code makes a file appear in /dev/pstore
> 3) Daemon that's watching /dev/pstore sees new file
> 4) File is read, copied some place safe and fsync'd
> 5) Daemon unlinks file from /dev/pstore
> 6) fs/pstore code tells platform level to erase record

Since this is only for "oops" class events, I don't think that
the fsync should be a performance problem (you really have
much bigger problems if there are that many oopses!)

-Tony