2010-12-06 18:29:48

by Luck, Tony

[permalink] [raw]
Subject: [RFC] persistent store (version 3) (part 1 of 2)

Generic code for a "persistent store" file system as an interface
to provide a friendly user level ABI to platform specific drivers
for that access to some form of non-volatile storage that can be
used to save the dying words of an OS instance and make them
available after the reset/reboot.

Signed-off-by: Tony Luck <[email protected]>

---

Part 2 (smaller Cc: list) has the changes to the X86 ACPI/APEI/ERST
code that makes use of this.

Version 3 drops use of /sys and the "daft" "erase" file that is used
to clear entries from the persistent store. This version uses the much
more natural "rm file" to delete entries from the platform store. This
code is based on fs/ramfs - with most operations removed (cannot make
new files, directories, symlinks, hook for unlink to pass the erase
command down to the platform layer.

The bit that I think needs most eyeballs is pstore_mkfile() where I push
the data from the persistent store into the file. I couldn't find any
examples where other kernel code does this to copy - so I made it up.
Questions I have:
1) Is "struct file" too big to be on the stack? I can change it to kmalloc() it.
2) Did I get all the bits I need to fake a write to the file?
3) Is this whole thinng OK, or is there a better way? Peter had suggested
that I use aops->write_begin() & aops->write_end(). But I got lost in
the fs/vm code calling sequence working out what I'd need to set up
before calling these - using do_sync_write() seemed a lot easier, but
I've learned that "easy" is quite often not "right" :-)

Here's what the user sees now:

# ls -l /dev/pstore
total 16
-r--r--r-- 1 root root 7896 Dec 3 10:56 dmesg-erst-5546531799825383425
# grep RIP: /dev/pstore/dmesg-erst-5546531799825383425
<4>[ 552.268202] RIP: 0010:[<ffffffff812a3a25>] [<ffffffff812a3a25>] sysrq_handle_crash+0x16/0x20
# rm /dev//pstore/dmesg-erst-5546531799825383425


diff --git a/Documentation/ABI/testing/pstore b/Documentation/ABI/testing/pstore
new file mode 100644
index 0000000..717fddb
--- /dev/null
+++ b/Documentation/ABI/testing/pstore
@@ -0,0 +1,35 @@
+Where: /dev/pstore/...
+Date: January 2011
+Kernel Version: 2.6.38
+Contact: [email protected]
+Description: Generic interface to platform dependent persistent storage.
+
+ Platforms that provide a mechanism to preserve some data
+ across system reboots can register with this driver to
+ provide a generic interface to show records captured in
+ the dying moments. In the case of a panic() the last part
+ of the console log is captured, but other interesting
+ data can also be saved.
+
+ # mount -t pstore - /dev/pstore
+
+ $ ls -l /dev/pstore
+ total 0
+ -r--r--r-- 1 root root 7896 Nov 30 15:38 dmesg-erst-1
+
+ Different users of this interface will result in different
+ filename prefixes. Currently two are defined:
+
+ "dmesg" - saved console log
+ "mce" - architecture dependent data from fatal h/w error
+
+ Once the information in a file has been read, removing
+ the file will signal to the underlying persistent storage
+ device that it can reclaim the space for later re-use.
+
+ $ rm /dev/pstore/dmesg-erst-1
+
+ The expectation is that all files in /dev/pstore
+ will be saved elsewhere and erased from persistent store
+ soon after boot to free up space ready for the next
+ catastrophe.
diff --git a/fs/Kconfig b/fs/Kconfig
index 771f457..2bbe47f 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -188,6 +188,7 @@ source "fs/omfs/Kconfig"
source "fs/hpfs/Kconfig"
source "fs/qnx4/Kconfig"
source "fs/romfs/Kconfig"
+source "fs/pstore/Kconfig"
source "fs/sysv/Kconfig"
source "fs/ufs/Kconfig"
source "fs/exofs/Kconfig"
diff --git a/fs/Makefile b/fs/Makefile
index a7f7cef..db71a5b 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -121,3 +121,4 @@ obj-$(CONFIG_BTRFS_FS) += btrfs/
obj-$(CONFIG_GFS2_FS) += gfs2/
obj-$(CONFIG_EXOFS_FS) += exofs/
obj-$(CONFIG_CEPH_FS) += ceph/
+obj-$(CONFIG_PSTORE) += pstore/
diff --git a/fs/pstore/Kconfig b/fs/pstore/Kconfig
new file mode 100644
index 0000000..867d0ac
--- /dev/null
+++ b/fs/pstore/Kconfig
@@ -0,0 +1,13 @@
+config PSTORE
+ bool "Persistant store support"
+ default n
+ help
+ This option enables generic access to platform level
+ persistent storage via "pstore" filesystem that can
+ be mounted as /dev/pstore. Only useful if you have
+ a platform level driver that registers with pstore to
+ provide the data, so you probably should just go say "Y"
+ (or "M") to a platform specific persistent store driver
+ (e.g. ACPI_APEI on X86) which will select this for you.
+ If you don't have a platform persistent store driver,
+ say N.
diff --git a/fs/pstore/Makefile b/fs/pstore/Makefile
new file mode 100644
index 0000000..760f4bc
--- /dev/null
+++ b/fs/pstore/Makefile
@@ -0,0 +1,7 @@
+#
+# Makefile for the linux pstorefs routines.
+#
+
+obj-y += pstore.o
+
+pstore-objs += inode.o platform.o
diff --git a/fs/pstore/inode.c b/fs/pstore/inode.c
new file mode 100644
index 0000000..5673b83
--- /dev/null
+++ b/fs/pstore/inode.c
@@ -0,0 +1,185 @@
+/*
+ * Persistent Storage - ramfs parts.
+ *
+ * Copyright (C) 2010 Intel Corporation <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/module.h>
+#include <linux/fs.h>
+#include <linux/pagemap.h>
+#include <linux/highmem.h>
+#include <linux/time.h>
+#include <linux/init.h>
+#include <linux/string.h>
+#include <linux/mount.h>
+#include <linux/ramfs.h>
+#include <linux/sched.h>
+#include <linux/magic.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+
+#include "internal.h"
+
+#define pstore_get_inode ramfs_get_inode
+
+static int pstore_unlink(struct inode *dir, struct dentry *dentry)
+{
+ pstore_erase(dentry->d_inode->i_private);
+
+ return simple_unlink(dir, dentry);
+}
+
+static const struct inode_operations pstore_dir_inode_operations = {
+ .lookup = simple_lookup,
+ .unlink = pstore_unlink,
+};
+
+static const struct super_operations pstore_ops = {
+ .statfs = simple_statfs,
+ .drop_inode = generic_delete_inode,
+ .show_options = generic_show_options,
+};
+
+static struct super_block *pstore_sb;
+static struct vfsmount *pstore_mnt;
+
+int pstore_is_mounted(void)
+{
+ return pstore_mnt != NULL;
+}
+
+/*
+ * Make a regular file in the root directory of our file system.
+ * Load it up with "size" bytes of data from "buf".
+ * Set the mtime & ctime to the date that this record was originally stored.
+ */
+int pstore_mkfile(char *name, char *data, size_t size, struct timespec time,
+ void *private)
+{
+ struct dentry *root = pstore_sb->s_root;
+ struct dentry *dentry;
+ struct inode *inode;
+ struct file f;
+ ssize_t n;
+ mm_segment_t old_fs = get_fs();
+
+ inode = pstore_get_inode(pstore_sb, root->d_inode, S_IFREG | 0444, 0);
+ if (!inode)
+ return -ENOMEM;
+
+ mutex_lock(&root->d_inode->i_mutex);
+
+ inode->i_private = private;
+
+ dentry = d_alloc_name(root, name);
+ if (!IS_ERR(dentry))
+ d_add(dentry, inode);
+
+ mutex_unlock(&root->d_inode->i_mutex);
+
+ memset(&f, '0', sizeof f);
+ f.f_mapping = inode->i_mapping;
+ f.f_path.dentry = dentry;
+ f.f_path.mnt = pstore_mnt;
+ f.f_pos = 0;
+ f.f_op = inode->i_fop;
+ set_fs(KERNEL_DS);
+ n = do_sync_write(&f, data, size, &f.f_pos);
+ set_fs(old_fs);
+
+ if (time.tv_sec)
+ inode->i_mtime = inode->i_ctime = time;
+
+ return (n == size) ? 0 : -EIO;
+}
+
+int pstore_fill_super(struct super_block *sb, void *data, int silent)
+{
+ struct inode *inode = NULL;
+ struct dentry *root;
+ int err;
+
+ save_mount_options(sb, data);
+
+ pstore_sb = sb;
+
+ sb->s_maxbytes = MAX_LFS_FILESIZE;
+ sb->s_blocksize = PAGE_CACHE_SIZE;
+ sb->s_blocksize_bits = PAGE_CACHE_SHIFT;
+ sb->s_magic = PSTOREFS_MAGIC;
+ sb->s_op = &pstore_ops;
+ sb->s_time_gran = 1;
+
+ inode = pstore_get_inode(sb, NULL, S_IFDIR | 0755, 0);
+ if (!inode) {
+ err = -ENOMEM;
+ goto fail;
+ }
+ /* override ramfs "dir" options so we catch unlink(2) */
+ inode->i_op = &pstore_dir_inode_operations;
+
+ root = d_alloc_root(inode);
+ sb->s_root = root;
+ if (!root) {
+ err = -ENOMEM;
+ goto fail;
+ }
+
+ pstore_get_records();
+
+ return 0;
+fail:
+ iput(inode);
+ return err;
+}
+
+static int pstore_get_sb(struct file_system_type *fs_type,
+ int flags, const char *dev_name, void *data, struct vfsmount *mnt)
+{
+ struct dentry *root;
+
+ root = mount_nodev(fs_type, flags, data, pstore_fill_super);
+ if (IS_ERR(root))
+ return -ENOMEM;
+
+ mnt->mnt_root = root;
+ mnt->mnt_sb = root->d_sb;
+ pstore_mnt = mnt;
+
+ return 0;
+}
+
+static void pstore_kill_sb(struct super_block *sb)
+{
+ kill_litter_super(sb);
+ pstore_sb = NULL;
+ pstore_mnt = NULL;
+}
+
+static struct file_system_type pstore_fs_type = {
+ .name = "pstore",
+ .get_sb = pstore_get_sb,
+ .kill_sb = pstore_kill_sb,
+};
+
+static int __init init_pstore_fs(void)
+{
+ return register_filesystem(&pstore_fs_type);
+}
+module_init(init_pstore_fs)
+
+MODULE_AUTHOR("Tony Luck <[email protected]>");
+MODULE_LICENSE("GPL");
diff --git a/fs/pstore/internal.h b/fs/pstore/internal.h
new file mode 100644
index 0000000..1f274ff
--- /dev/null
+++ b/fs/pstore/internal.h
@@ -0,0 +1,5 @@
+extern void pstore_get_records(void);
+extern int pstore_mkfile(char *name, char *data, size_t size,
+ struct timespec time, void *private);
+extern void pstore_erase(void *private);
+extern int pstore_is_mounted(void);
diff --git a/fs/pstore/platform.c b/fs/pstore/platform.c
new file mode 100644
index 0000000..63f08db
--- /dev/null
+++ b/fs/pstore/platform.c
@@ -0,0 +1,194 @@
+/*
+ * Persistent Storage - platform driver interface parts.
+ *
+ * Copyright (C) 2010 Intel Corporation <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/atomic.h>
+#include <linux/types.h>
+#include <linux/errno.h>
+#include <linux/init.h>
+#include <linux/kmsg_dump.h>
+#include <linux/module.h>
+#include <linux/pstore.h>
+#include <linux/string.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+
+#include "internal.h"
+
+/*
+ * pstore_lock just protects "psinfo" during
+ * calls to pstore_register()
+ */
+static DEFINE_SPINLOCK(pstore_lock);
+static struct pstore_info *psinfo;
+
+#define PSTORE_NAMELEN 64
+
+struct pstore_private {
+ u64 id;
+ int (*erase)(u64);
+};
+
+/*
+ * callback from kmsg_dump. (s2,l2) has the most recently
+ * written bytes, older bytes are in (s1,l1). Save as much
+ * as we can from the end of the buffer.
+ */
+static void pstore_dump(struct kmsg_dumper *dumper,
+ enum kmsg_dump_reason reason,
+ const char *s1, unsigned long l1,
+ const char *s2, unsigned long l2)
+{
+ unsigned long s1_start, s2_start;
+ unsigned long l1_cpy, l2_cpy;
+ char *dst = psinfo->buf;
+
+ /* Don't dump oopses to persistent store */
+ if (reason == KMSG_DUMP_OOPS)
+ return;
+
+ l2_cpy = min(l2, psinfo->bufsize);
+ l1_cpy = min(l1, psinfo->bufsize - l2_cpy);
+
+ s2_start = l2 - l2_cpy;
+ s1_start = l1 - l1_cpy;
+
+ mutex_lock(&psinfo->mutex);
+ memcpy(dst, s1 + s1_start, l1_cpy);
+ memcpy(dst + l1_cpy, s2 + s2_start, l2_cpy);
+
+ psinfo->write(PSTORE_TYPE_DMESG, l1_cpy + l2_cpy);
+ mutex_unlock(&psinfo->mutex);
+}
+
+static struct kmsg_dumper pstore_dumper = {
+ .dump = pstore_dump,
+};
+
+/*
+ * platform specific persistent storage driver registers with
+ * us here. If pstore is already mounted, call the platform
+ * read function right away to populate the file system. If not
+ * then the pstore mount code will call us later to fill out
+ * the file system.
+ *
+ * Register with kmsg_dump to save last part of console log on panic.
+ */
+int pstore_register(struct pstore_info *psi)
+{
+ struct module *owner = psi->owner;
+
+ spin_lock(&pstore_lock);
+ if (psinfo) {
+ spin_unlock(&pstore_lock);
+ return -EBUSY;
+ }
+ psinfo = psi;
+ spin_unlock(&pstore_lock);
+
+ if (owner && !try_module_get(owner)) {
+ psinfo = NULL;
+ return -EINVAL;
+ }
+
+ if (pstore_is_mounted())
+ pstore_get_records();
+
+ kmsg_dump_register(&pstore_dumper);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(pstore_register);
+
+/*
+ * Read all the records from the persistent store. Create and
+ * file files in our filesystem.
+ */
+void pstore_get_records(void)
+{
+ struct pstore_info *psi = psinfo;
+ size_t size;
+ u64 id;
+ int type;
+ char name[PSTORE_NAMELEN];
+ struct pstore_private *private;
+ struct timespec time;
+
+ if (!psi)
+ return;
+
+ mutex_lock(&psinfo->mutex);
+ for (;;) {
+ if (psi->read(&id, &type, &size, &time) <= 0)
+ break;
+ switch (type) {
+ case PSTORE_TYPE_DMESG:
+ sprintf(name, "dmesg-%s-%lld", psi->name, id);
+ break;
+ case PSTORE_TYPE_MCE:
+ sprintf(name, "mce-%s-%lld", psi->name, id);
+ break;
+ default:
+ sprintf(name, "type%d-%s-%lld", type, psi->name, id);
+ break;
+ }
+ private = kmalloc(sizeof *private, GFP_KERNEL);
+ private->id = id;
+ private->erase = psi->erase;
+ pstore_mkfile(name, psi->buf, size, time, private);
+ }
+ mutex_unlock(&psinfo->mutex);
+}
+
+/*
+ * Call platform driver to write a record to the
+ * persistent store. We don't worry about making
+ * this visible in the pstore filesystem as the
+ * presumption is that we only save things to the
+ * store in the dying moments of OS failure. Hence
+ * nobody will see the entries in the filesystem.
+ */
+int pstore_write(int type, char *buf, size_t size)
+{
+ int ret;
+
+ if (!psinfo)
+ return -ENODEV;
+ if (size > psinfo->bufsize)
+ return -EFBIG;
+
+ mutex_lock(&psinfo->mutex);
+ memcpy(psinfo->buf, buf, size);
+ ret = psinfo->write(type, size);
+ mutex_unlock(&psinfo->mutex);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(pstore_write);
+
+/*
+ * When a file is unlinked from our file system we call the
+ * platform driver to erase the record from persistent store.
+ */
+void pstore_erase(void *private)
+{
+ struct pstore_private *p = private;
+
+ p->erase(p->id);
+ kfree(p);
+}
diff --git a/include/linux/magic.h b/include/linux/magic.h
index ff690d0..e87fd5a 100644
--- a/include/linux/magic.h
+++ b/include/linux/magic.h
@@ -26,6 +26,7 @@
#define ISOFS_SUPER_MAGIC 0x9660
#define JFFS2_SUPER_MAGIC 0x72b6
#define ANON_INODE_FS_MAGIC 0x09041934
+#define PSTOREFS_MAGIC 0x6165676C

#define MINIX_SUPER_MAGIC 0x137F /* original minix fs */
#define MINIX_SUPER_MAGIC2 0x138F /* minix fs, 30 char names */
diff --git a/include/linux/pstore.h b/include/linux/pstore.h
new file mode 100644
index 0000000..4d635c0
--- /dev/null
+++ b/include/linux/pstore.h
@@ -0,0 +1,57 @@
+/*
+ * Persistent Storage - pstore.h
+ *
+ * Copyright (C) 2010 Intel Corporation <[email protected]>
+ *
+ * This code is the generic layer to export data records from platform
+ * level persistent storage via a file system.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+#ifndef _LINUX_PSTORE_H
+#define _LINUX_PSTORE_H
+
+/* types */
+#define PSTORE_TYPE_DMESG 0
+#define PSTORE_TYPE_MCE 1
+
+struct pstore_info {
+ struct module *owner;
+ char *name;
+ struct mutex mutex; /* serialize access to 'buf' */
+ char *buf;
+ size_t bufsize;
+ int (*read)(u64 *id, int *type, size_t *size,
+ struct timespec *time);
+ int (*write)(int type, size_t size);
+ int (*erase)(u64 id);
+};
+
+#if defined(CONFIG_PSTORE) || defined(CONFIG_PSTORE_MODULE)
+extern int pstore_register(struct pstore_info *);
+extern int pstore_write(int type, char *buf, size_t size);
+#else
+static inline int
+pstore_register(struct pstore_info *psi)
+{
+ return -ENODEV;
+}
+static inline int
+pstore_write(int type, char *buf, size_t size)
+{
+ return -ENODEV;
+}
+#endif
+
+#endif /*_LINUX_PSTORE_H*/


2010-12-08 06:24:44

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC] persistent store (version 3) (part 1 of 2)

Hey Tony,

looks good. Minor nitpicks below.

On Mon, Dec 06, 2010 at 10:29:43AM -0800, Luck, Tony wrote:
> Generic code for a "persistent store" file system as an interface
> to provide a friendly user level ABI to platform specific drivers
> for that access to some form of non-volatile storage that can be
> used to save the dying words of an OS instance and make them
> available after the reset/reboot.
>
> Signed-off-by: Tony Luck <[email protected]>
>
> ---
>
> Part 2 (smaller Cc: list) has the changes to the X86 ACPI/APEI/ERST
> code that makes use of this.
>
> Version 3 drops use of /sys and the "daft" "erase" file that is used
> to clear entries from the persistent store. This version uses the much
> more natural "rm file" to delete entries from the platform store. This
> code is based on fs/ramfs - with most operations removed (cannot make
> new files, directories, symlinks, hook for unlink to pass the erase
> command down to the platform layer.
>
> The bit that I think needs most eyeballs is pstore_mkfile() where I push
> the data from the persistent store into the file. I couldn't find any
> examples where other kernel code does this to copy - so I made it up.
> Questions I have:
> 1) Is "struct file" too big to be on the stack? I can change it to kmalloc() it.

Well, this happens during normal operation when we init the pstore and
since we don't need it after pstore_mkwrite has returned (do we?), I
guess we should be fine.

> 2) Did I get all the bits I need to fake a write to the file?
> 3) Is this whole thinng OK, or is there a better way? Peter had suggested
> that I use aops->write_begin() & aops->write_end(). But I got lost in
> the fs/vm code calling sequence working out what I'd need to set up
> before calling these - using do_sync_write() seemed a lot easier, but
> I've learned that "easy" is quite often not "right" :-)
>
> Here's what the user sees now:
>
> # ls -l /dev/pstore
> total 16
> -r--r--r-- 1 root root 7896 Dec 3 10:56 dmesg-erst-5546531799825383425
> # grep RIP: /dev/pstore/dmesg-erst-5546531799825383425
> <4>[ 552.268202] RIP: 0010:[<ffffffff812a3a25>] [<ffffffff812a3a25>] sysrq_handle_crash+0x16/0x20
> # rm /dev//pstore/dmesg-erst-5546531799825383425
>
>
> diff --git a/Documentation/ABI/testing/pstore b/Documentation/ABI/testing/pstore
> new file mode 100644
> index 0000000..717fddb
> --- /dev/null
> +++ b/Documentation/ABI/testing/pstore
> @@ -0,0 +1,35 @@
> +Where: /dev/pstore/...
> +Date: January 2011
> +Kernel Version: 2.6.38
> +Contact: [email protected]
> +Description: Generic interface to platform dependent persistent storage.
> +
> + Platforms that provide a mechanism to preserve some data
> + across system reboots can register with this driver to
> + provide a generic interface to show records captured in
> + the dying moments. In the case of a panic() the last part

"panic" (I'd remove the
brackets)

> + of the console log is captured, but other interesting
> + data can also be saved.
> +
> + # mount -t pstore - /dev/pstore
> +
> + $ ls -l /dev/pstore
> + total 0
> + -r--r--r-- 1 root root 7896 Nov 30 15:38 dmesg-erst-1
> +
> + Different users of this interface will result in different
> + filename prefixes. Currently two are defined:
> +
> + "dmesg" - saved console log
> + "mce" - architecture dependent data from fatal h/w error
> +
> + Once the information in a file has been read, removing
> + the file will signal to the underlying persistent storage
> + device that it can reclaim the space for later re-use.
> +
> + $ rm /dev/pstore/dmesg-erst-1
> +
> + The expectation is that all files in /dev/pstore
> + will be saved elsewhere and erased from persistent store
> + soon after boot to free up space ready for the next
> + catastrophe.

"next catastrophe", hehe, this sounds very optimistic :)

> diff --git a/fs/Kconfig b/fs/Kconfig
> index 771f457..2bbe47f 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -188,6 +188,7 @@ source "fs/omfs/Kconfig"
> source "fs/hpfs/Kconfig"
> source "fs/qnx4/Kconfig"
> source "fs/romfs/Kconfig"
> +source "fs/pstore/Kconfig"
> source "fs/sysv/Kconfig"
> source "fs/ufs/Kconfig"
> source "fs/exofs/Kconfig"
> diff --git a/fs/Makefile b/fs/Makefile
> index a7f7cef..db71a5b 100644
> --- a/fs/Makefile
> +++ b/fs/Makefile
> @@ -121,3 +121,4 @@ obj-$(CONFIG_BTRFS_FS) += btrfs/
> obj-$(CONFIG_GFS2_FS) += gfs2/
> obj-$(CONFIG_EXOFS_FS) += exofs/
> obj-$(CONFIG_CEPH_FS) += ceph/
> +obj-$(CONFIG_PSTORE) += pstore/
> diff --git a/fs/pstore/Kconfig b/fs/pstore/Kconfig
> new file mode 100644
> index 0000000..867d0ac
> --- /dev/null
> +++ b/fs/pstore/Kconfig
> @@ -0,0 +1,13 @@
> +config PSTORE
> + bool "Persistant store support"
> + default n
> + help
> + This option enables generic access to platform level
> + persistent storage via "pstore" filesystem that can
> + be mounted as /dev/pstore. Only useful if you have
> + a platform level driver that registers with pstore to
> + provide the data, so you probably should just go say "Y"
> + (or "M") to a platform specific persistent store driver
> + (e.g. ACPI_APEI on X86) which will select this for you.
> + If you don't have a platform persistent store driver,
> + say N.
> diff --git a/fs/pstore/Makefile b/fs/pstore/Makefile
> new file mode 100644
> index 0000000..760f4bc
> --- /dev/null
> +++ b/fs/pstore/Makefile
> @@ -0,0 +1,7 @@
> +#
> +# Makefile for the linux pstorefs routines.
> +#
> +
> +obj-y += pstore.o
> +
> +pstore-objs += inode.o platform.o
> diff --git a/fs/pstore/inode.c b/fs/pstore/inode.c
> new file mode 100644
> index 0000000..5673b83
> --- /dev/null
> +++ b/fs/pstore/inode.c
> @@ -0,0 +1,185 @@
> +/*
> + * Persistent Storage - ramfs parts.
> + *
> + * Copyright (C) 2010 Intel Corporation <[email protected]>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
> + */
> +
> +#include <linux/module.h>
> +#include <linux/fs.h>
> +#include <linux/pagemap.h>
> +#include <linux/highmem.h>
> +#include <linux/time.h>
> +#include <linux/init.h>
> +#include <linux/string.h>
> +#include <linux/mount.h>
> +#include <linux/ramfs.h>
> +#include <linux/sched.h>
> +#include <linux/magic.h>
> +#include <linux/slab.h>
> +#include <linux/uaccess.h>
> +
> +#include "internal.h"
> +
> +#define pstore_get_inode ramfs_get_inode
> +
> +static int pstore_unlink(struct inode *dir, struct dentry *dentry)
> +{
> + pstore_erase(dentry->d_inode->i_private);
> +
> + return simple_unlink(dir, dentry);
> +}
> +
> +static const struct inode_operations pstore_dir_inode_operations = {
> + .lookup = simple_lookup,
> + .unlink = pstore_unlink,
> +};
> +
> +static const struct super_operations pstore_ops = {
> + .statfs = simple_statfs,
> + .drop_inode = generic_delete_inode,
> + .show_options = generic_show_options,
> +};
> +
> +static struct super_block *pstore_sb;
> +static struct vfsmount *pstore_mnt;
> +
> +int pstore_is_mounted(void)
> +{
> + return pstore_mnt != NULL;
> +}
> +
> +/*
> + * Make a regular file in the root directory of our file system.
> + * Load it up with "size" bytes of data from "buf".
> + * Set the mtime & ctime to the date that this record was originally stored.
> + */
> +int pstore_mkfile(char *name, char *data, size_t size, struct timespec time,
> + void *private)
> +{
> + struct dentry *root = pstore_sb->s_root;
> + struct dentry *dentry;
> + struct inode *inode;
> + struct file f;
> + ssize_t n;
> + mm_segment_t old_fs = get_fs();
> +
> + inode = pstore_get_inode(pstore_sb, root->d_inode, S_IFREG | 0444, 0);
> + if (!inode)
> + return -ENOMEM;
> +
> + mutex_lock(&root->d_inode->i_mutex);
> +
> + inode->i_private = private;
> +
> + dentry = d_alloc_name(root, name);
> + if (!IS_ERR(dentry))
> + d_add(dentry, inode);

what happens if it IS_ERR? Error handling like

goto d_alloc_error;

d_alloc_error:
iput(inode);
return -ENOSPC;


or similar, at least this is what ramfs seems to be doing.

> +
> + mutex_unlock(&root->d_inode->i_mutex);
> +
> + memset(&f, '0', sizeof f);
> + f.f_mapping = inode->i_mapping;
> + f.f_path.dentry = dentry;
> + f.f_path.mnt = pstore_mnt;
> + f.f_pos = 0;
> + f.f_op = inode->i_fop;
> + set_fs(KERNEL_DS);
> + n = do_sync_write(&f, data, size, &f.f_pos);
> + set_fs(old_fs);
> +
> + if (time.tv_sec)
> + inode->i_mtime = inode->i_ctime = time;
> +
> + return (n == size) ? 0 : -EIO;
> +}
> +
> +int pstore_fill_super(struct super_block *sb, void *data, int silent)
> +{
> + struct inode *inode = NULL;
> + struct dentry *root;
> + int err;
> +
> + save_mount_options(sb, data);
> +
> + pstore_sb = sb;
> +
> + sb->s_maxbytes = MAX_LFS_FILESIZE;
> + sb->s_blocksize = PAGE_CACHE_SIZE;
> + sb->s_blocksize_bits = PAGE_CACHE_SHIFT;
> + sb->s_magic = PSTOREFS_MAGIC;
> + sb->s_op = &pstore_ops;
> + sb->s_time_gran = 1;
> +
> + inode = pstore_get_inode(sb, NULL, S_IFDIR | 0755, 0);
> + if (!inode) {
> + err = -ENOMEM;
> + goto fail;
> + }
> + /* override ramfs "dir" options so we catch unlink(2) */
> + inode->i_op = &pstore_dir_inode_operations;
> +
> + root = d_alloc_root(inode);
> + sb->s_root = root;
> + if (!root) {
> + err = -ENOMEM;
> + goto fail;
> + }
> +
> + pstore_get_records();
> +
> + return 0;
> +fail:
> + iput(inode);
> + return err;
> +}
> +
> +static int pstore_get_sb(struct file_system_type *fs_type,
> + int flags, const char *dev_name, void *data, struct vfsmount *mnt)
> +{
> + struct dentry *root;
> +
> + root = mount_nodev(fs_type, flags, data, pstore_fill_super);
> + if (IS_ERR(root))
> + return -ENOMEM;
> +
> + mnt->mnt_root = root;
> + mnt->mnt_sb = root->d_sb;
> + pstore_mnt = mnt;
> +
> + return 0;
> +}
> +
> +static void pstore_kill_sb(struct super_block *sb)
> +{
> + kill_litter_super(sb);
> + pstore_sb = NULL;
> + pstore_mnt = NULL;
> +}
> +
> +static struct file_system_type pstore_fs_type = {
> + .name = "pstore",
> + .get_sb = pstore_get_sb,
> + .kill_sb = pstore_kill_sb,
> +};
> +
> +static int __init init_pstore_fs(void)
> +{
> + return register_filesystem(&pstore_fs_type);
> +}
> +module_init(init_pstore_fs)
> +
> +MODULE_AUTHOR("Tony Luck <[email protected]>");
> +MODULE_LICENSE("GPL");
> diff --git a/fs/pstore/internal.h b/fs/pstore/internal.h
> new file mode 100644
> index 0000000..1f274ff
> --- /dev/null
> +++ b/fs/pstore/internal.h
> @@ -0,0 +1,5 @@
> +extern void pstore_get_records(void);
> +extern int pstore_mkfile(char *name, char *data, size_t size,
> + struct timespec time, void *private);
> +extern void pstore_erase(void *private);
> +extern int pstore_is_mounted(void);
> diff --git a/fs/pstore/platform.c b/fs/pstore/platform.c
> new file mode 100644
> index 0000000..63f08db
> --- /dev/null
> +++ b/fs/pstore/platform.c
> @@ -0,0 +1,194 @@
> +/*
> + * Persistent Storage - platform driver interface parts.
> + *
> + * Copyright (C) 2010 Intel Corporation <[email protected]>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
> + */
> +
> +#include <linux/atomic.h>
> +#include <linux/types.h>
> +#include <linux/errno.h>
> +#include <linux/init.h>
> +#include <linux/kmsg_dump.h>
> +#include <linux/module.h>
> +#include <linux/pstore.h>
> +#include <linux/string.h>
> +#include <linux/slab.h>
> +#include <linux/uaccess.h>
> +
> +#include "internal.h"
> +
> +/*
> + * pstore_lock just protects "psinfo" during
> + * calls to pstore_register()
> + */
> +static DEFINE_SPINLOCK(pstore_lock);
> +static struct pstore_info *psinfo;
> +
> +#define PSTORE_NAMELEN 64
> +
> +struct pstore_private {
> + u64 id;
> + int (*erase)(u64);
> +};
> +
> +/*
> + * callback from kmsg_dump. (s2,l2) has the most recently
> + * written bytes, older bytes are in (s1,l1). Save as much
> + * as we can from the end of the buffer.
> + */
> +static void pstore_dump(struct kmsg_dumper *dumper,
> + enum kmsg_dump_reason reason,
> + const char *s1, unsigned long l1,
> + const char *s2, unsigned long l2)
> +{
> + unsigned long s1_start, s2_start;
> + unsigned long l1_cpy, l2_cpy;
> + char *dst = psinfo->buf;
> +
> + /* Don't dump oopses to persistent store */
> + if (reason == KMSG_DUMP_OOPS)
> + return;
> +
> + l2_cpy = min(l2, psinfo->bufsize);
> + l1_cpy = min(l1, psinfo->bufsize - l2_cpy);
> +
> + s2_start = l2 - l2_cpy;
> + s1_start = l1 - l1_cpy;
> +
> + mutex_lock(&psinfo->mutex);
> + memcpy(dst, s1 + s1_start, l1_cpy);
> + memcpy(dst + l1_cpy, s2 + s2_start, l2_cpy);
> +
> + psinfo->write(PSTORE_TYPE_DMESG, l1_cpy + l2_cpy);
> + mutex_unlock(&psinfo->mutex);
> +}
> +
> +static struct kmsg_dumper pstore_dumper = {
> + .dump = pstore_dump,
> +};
> +
> +/*
> + * platform specific persistent storage driver registers with
> + * us here. If pstore is already mounted, call the platform
> + * read function right away to populate the file system. If not
> + * then the pstore mount code will call us later to fill out
> + * the file system.
> + *
> + * Register with kmsg_dump to save last part of console log on panic.
> + */
> +int pstore_register(struct pstore_info *psi)
> +{
> + struct module *owner = psi->owner;
> +
> + spin_lock(&pstore_lock);
> + if (psinfo) {
> + spin_unlock(&pstore_lock);
> + return -EBUSY;
> + }
> + psinfo = psi;
> + spin_unlock(&pstore_lock);
> +
> + if (owner && !try_module_get(owner)) {
> + psinfo = NULL;
> + return -EINVAL;
> + }
> +
> + if (pstore_is_mounted())
> + pstore_get_records();
> +
> + kmsg_dump_register(&pstore_dumper);

You don't check psi->write() method's existence anymore, I'm assuming
this is implied now... ?

> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(pstore_register);
> +
> +/*
> + * Read all the records from the persistent store. Create and
> + * file files in our filesystem.
> + */
> +void pstore_get_records(void)
> +{
> + struct pstore_info *psi = psinfo;
> + size_t size;
> + u64 id;
> + int type;
> + char name[PSTORE_NAMELEN];
> + struct pstore_private *private;
> + struct timespec time;
> +
> + if (!psi)
> + return;
> +
> + mutex_lock(&psinfo->mutex);
> + for (;;) {
> + if (psi->read(&id, &type, &size, &time) <= 0)
> + break;
> + switch (type) {
> + case PSTORE_TYPE_DMESG:
> + sprintf(name, "dmesg-%s-%lld", psi->name, id);
> + break;
> + case PSTORE_TYPE_MCE:
> + sprintf(name, "mce-%s-%lld", psi->name, id);
> + break;
> + default:
> + sprintf(name, "type%d-%s-%lld", type, psi->name, id);
> + break;
> + }
> + private = kmalloc(sizeof *private, GFP_KERNEL);
> + private->id = id;
> + private->erase = psi->erase;
> + pstore_mkfile(name, psi->buf, size, time, private);
> + }
> + mutex_unlock(&psinfo->mutex);
> +}
> +
> +/*
> + * Call platform driver to write a record to the
> + * persistent store. We don't worry about making
> + * this visible in the pstore filesystem as the
> + * presumption is that we only save things to the
> + * store in the dying moments of OS failure. Hence
> + * nobody will see the entries in the filesystem.
> + */
> +int pstore_write(int type, char *buf, size_t size)
> +{
> + int ret;
> +
> + if (!psinfo)
> + return -ENODEV;

newline.

> + if (size > psinfo->bufsize)
> + return -EFBIG;
> +
> + mutex_lock(&psinfo->mutex);
> + memcpy(psinfo->buf, buf, size);
> + ret = psinfo->write(type, size);
> + mutex_unlock(&psinfo->mutex);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(pstore_write);
> +
> +/*
> + * When a file is unlinked from our file system we call the
> + * platform driver to erase the record from persistent store.
> + */
> +void pstore_erase(void *private)
> +{
> + struct pstore_private *p = private;
> +
> + p->erase(p->id);
> + kfree(p);
> +}
> diff --git a/include/linux/magic.h b/include/linux/magic.h
> index ff690d0..e87fd5a 100644
> --- a/include/linux/magic.h
> +++ b/include/linux/magic.h
> @@ -26,6 +26,7 @@
> #define ISOFS_SUPER_MAGIC 0x9660
> #define JFFS2_SUPER_MAGIC 0x72b6
> #define ANON_INODE_FS_MAGIC 0x09041934
> +#define PSTOREFS_MAGIC 0x6165676C

what does that mean anyway? "aegl" :)

>
> #define MINIX_SUPER_MAGIC 0x137F /* original minix fs */
> #define MINIX_SUPER_MAGIC2 0x138F /* minix fs, 30 char names */
> diff --git a/include/linux/pstore.h b/include/linux/pstore.h
> new file mode 100644
> index 0000000..4d635c0
> --- /dev/null
> +++ b/include/linux/pstore.h
> @@ -0,0 +1,57 @@
> +/*
> + * Persistent Storage - pstore.h
> + *
> + * Copyright (C) 2010 Intel Corporation <[email protected]>
> + *
> + * This code is the generic layer to export data records from platform
> + * level persistent storage via a file system.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
> + */
> +#ifndef _LINUX_PSTORE_H
> +#define _LINUX_PSTORE_H
> +
> +/* types */
> +#define PSTORE_TYPE_DMESG 0
> +#define PSTORE_TYPE_MCE 1

You could make this into a proper enum

enum pstore_type_id {
PSTORE_TYPE_DMESG = 0,
PSTORE_TYPE_MCE = 1,
PSTORE_TYPE_MAX,
};

so that...

> +
> +struct pstore_info {
> + struct module *owner;
> + char *name;
> + struct mutex mutex; /* serialize access to 'buf' */

[ maybe a more descriptive variable name like buf_mutex or whatever ]

> + char *buf;
> + size_t bufsize;
> + int (*read)(u64 *id, int *type, size_t *size,
> + struct timespec *time);
> + int (*write)(int type, size_t size);

... you can enforce typechecking for pstore->write:

int (*write)(enum pstore_type_id type, size_t size);


> + int (*erase)(u64 id);
> +};
> +
> +#if defined(CONFIG_PSTORE) || defined(CONFIG_PSTORE_MODULE)

What is CONFIG_PSTORE_MODULE? Can't seem to find it in your (2 of 2)
message either.

> +extern int pstore_register(struct pstore_info *);
> +extern int pstore_write(int type, char *buf, size_t size);
> +#else
> +static inline int
> +pstore_register(struct pstore_info *psi)
> +{
> + return -ENODEV;
> +}
> +static inline int
> +pstore_write(int type, char *buf, size_t size)
> +{
> + return -ENODEV;
> +}
> +#endif
> +
> +#endif /*_LINUX_PSTORE_H*/
>

--
Regards/Gruss,
Boris.

2010-12-08 17:12:24

by Luck, Tony

[permalink] [raw]
Subject: Re: [RFC] persistent store (version 3) (part 1 of 2)

On Tue, Dec 7, 2010 at 10:24 PM, Borislav Petkov <[email protected]> wrote:
> looks good. Minor nitpicks below.

Thanks for looking at this.

>> 1) Is "struct file" too big to be on the stack? I can change it to kmalloc() it.
>
> Well, this happens during normal operation when we init the pstore and
> since we don't need it after pstore_mkwrite has returned (do we?), I
> guess we should be fine.
struct file is 184 bytes (with the CONFIG options I am using, it can get
a bit larger, but not much).

>> + ? ? ? ? ? ? the dying moments. ?In the case of a panic() the last part
>
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?"panic" (I'd remove the
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? brackets)

Ok. Will do.

>> + ? ? dentry = d_alloc_name(root, name);
>> + ? ? if (!IS_ERR(dentry))
>> + ? ? ? ? ? ? d_add(dentry, inode);
>
> what happens if it IS_ERR? Error handling like
>
> ? ? ? ?goto d_alloc_error;
>
> d_alloc_error:
> ? ? ? ?iput(inode);
> ? ? ? ?return -ENOSPC;
>
>
> or similar, at least this is what ramfs seems to be doing.

Good point - I need to clean up if things fail (and akpm would
like to see me re-factor so that there is only one "return"
statement for better maintainabiliy). I'll fix up stuff here.

>> + ? ? kmsg_dump_register(&pstore_dumper);
>
> You don't check psi->write() method's existence anymore, I'm assuming
> this is implied now... ?

Andrew's comments on version 2 were:
>It doesn't seem appropriate to check this here. It's a programming
>error! Just install the thing and let the kernel oops - the programmer
>will notice.

So I dropped the tests ... just checking whether the function pointer
was NULL wasn't a very strong test anyway.

>> + ? ? if (!psinfo)
>> + ? ? ? ? ? ? return -ENODEV;
>
> newline.

OK.

>> +#define PSTOREFS_MAGIC ? ? ? ? ? ? ? 0x6165676C
>
> what does that mean anyway? "aegl" :)

My initials: Anthony Eric George Luck. I've been using "aegl"
as my Unix login since 1979 (6th Edition on a pdp11/34).

>> +/* types */
>> +#define ? ? ?PSTORE_TYPE_DMESG ? ? ? 0
>> +#define ? ? ?PSTORE_TYPE_MCE ? ? ? ? 1
>
> You could make this into a proper enum
>
> enum pstore_type_id {
> ? ? ? ?PSTORE_TYPE_DMESG ? ? ? = 0,
> ? ? ? ?PSTORE_TYPE_MCE ? ? ? ? = 1,
> ? ? ? ?PSTORE_TYPE_MAX,
> };
OK. Type checking is nice. I think I might need a
PSTORE_TYPE_UNKNOWN to handle the case
where a new platform driver saves a record to
persistent store, and then the user boots an older
kernel that doesn't know about the new type (in
the APEI/ERST driver the type turns into a UUID
in the UEFI header for the record - so there is some
mapping going on at that level).

>> +struct pstore_info {
>> + ? ? struct module ? *owner;
>> + ? ? char ? ? ? ? ? ?*name;
>> + ? ? struct mutex ? ?mutex; ?/* serialize access to 'buf' */
>
> ?[ maybe a more descriptive variable name like buf_mutex or whatever ]

Sure. Will change.

>> +#if defined(CONFIG_PSTORE) || defined(CONFIG_PSTORE_MODULE)
>
> What is CONFIG_PSTORE_MODULE? Can't seem to find it in your (2 of 2)
> message either.

This is a remnant of when PSTORE was tristate - when I chose 'm'
rather than 'y' in "make menuconfig" I ended up with CONFIG_PSTORE_MODULE
rather than CONFIG_PSTORE. But now it is just a "bool" this isn't
needed any more. Will fix.

-Tony

2010-12-08 23:35:43

by Luck, Tony

[permalink] [raw]
Subject: Re: [RFC] persistent store (version 3) (part 1 of 2)

> > what happens if it IS_ERR?
>
> Good point - I need to clean up if things fail (and akpm would
> like to see me re-factor so that there is only one "return"
> statement for better maintainabiliy). I'll fix up stuff here.

Here's what I'll have in the next version. With all the error
handling the pstore_mkfile() routine was getting a bit big, so
I split out the writing-to-the-file part into another function.
The pair of functions now look like this:

/*
* Set up a file structure as if we had opened this file and
* write our data to it.
*/
static int pstore_writefile(struct inode *inode, struct dentry *dentry,
char *data, size_t size)
{
struct file f;
ssize_t n;
mm_segment_t old_fs = get_fs();

memset(&f, '0', sizeof f);
f.f_mapping = inode->i_mapping;
f.f_path.dentry = dentry;
f.f_path.mnt = pstore_mnt;
f.f_pos = 0;
f.f_op = inode->i_fop;
set_fs(KERNEL_DS);
n = do_sync_write(&f, data, size, &f.f_pos);
set_fs(old_fs);

return n == size;
}

/*
* Make a regular file in the root directory of our file system.
* Load it up with "size" bytes of data from "buf".
* Set the mtime & ctime to the date that this record was originally stored.
*/
int pstore_mkfile(char *name, char *data, size_t size, struct timespec time,
void *private)
{
struct dentry *root = pstore_sb->s_root;
struct dentry *dentry;
struct inode *inode;
int rc = 0;

inode = pstore_get_inode(pstore_sb, root->d_inode, S_IFREG | 0444, 0);
if (!inode) {
rc = -ENOMEM;
goto fail;
}
inode->i_private = private;

mutex_lock(&root->d_inode->i_mutex);

dentry = d_alloc_name(root, name);
if (IS_ERR(dentry)) {
mutex_unlock(&root->d_inode->i_mutex);
rc = -ENOSPC;
goto fail1;
}

d_add(dentry, inode);

mutex_unlock(&root->d_inode->i_mutex);

if (!pstore_writefile(inode, dentry, data, size)) {
inode->i_nlink--;
mutex_lock(&root->d_inode->i_mutex);
d_delete(dentry);
dput(dentry);
mutex_unlock(&root->d_inode->i_mutex);
rc = -ENOSPC;
goto fail;
}

if (time.tv_sec)
inode->i_mtime = inode->i_ctime = time;

return 0;

fail1:
iput(inode);
fail:
return rc;
}

2010-12-09 13:15:01

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC] persistent store (version 3) (part 1 of 2)

On Wed, Dec 08, 2010 at 03:35:40PM -0800, Luck, Tony wrote:
> > > what happens if it IS_ERR?
> >
> > Good point - I need to clean up if things fail (and akpm would
> > like to see me re-factor so that there is only one "return"
> > statement for better maintainabiliy). I'll fix up stuff here.
>
> Here's what I'll have in the next version. With all the error
> handling the pstore_mkfile() routine was getting a bit big, so
> I split out the writing-to-the-file part into another function.
> The pair of functions now look like this:
>
> /*
> * Set up a file structure as if we had opened this file and
> * write our data to it.
> */
> static int pstore_writefile(struct inode *inode, struct dentry *dentry,
> char *data, size_t size)
> {
> struct file f;
> ssize_t n;
> mm_segment_t old_fs = get_fs();
>
> memset(&f, '0', sizeof f);
> f.f_mapping = inode->i_mapping;
> f.f_path.dentry = dentry;
> f.f_path.mnt = pstore_mnt;
> f.f_pos = 0;
> f.f_op = inode->i_fop;
> set_fs(KERNEL_DS);
> n = do_sync_write(&f, data, size, &f.f_pos);
> set_fs(old_fs);
>
> return n == size;
> }

Good.

>
> /*
> * Make a regular file in the root directory of our file system.
> * Load it up with "size" bytes of data from "buf".
> * Set the mtime & ctime to the date that this record was originally stored.
> */
> int pstore_mkfile(char *name, char *data, size_t size, struct timespec time,
> void *private)
> {
> struct dentry *root = pstore_sb->s_root;
> struct dentry *dentry;
> struct inode *inode;
> int rc = 0;
>
> inode = pstore_get_inode(pstore_sb, root->d_inode, S_IFREG | 0444, 0);
> if (!inode) {
> rc = -ENOMEM;
> goto fail;
> }
> inode->i_private = private;
>
> mutex_lock(&root->d_inode->i_mutex);
>
> dentry = d_alloc_name(root, name);
> if (IS_ERR(dentry)) {
> mutex_unlock(&root->d_inode->i_mutex);
> rc = -ENOSPC;
> goto fail1;
> }
>
> d_add(dentry, inode);
>
> mutex_unlock(&root->d_inode->i_mutex);
>
> if (!pstore_writefile(inode, dentry, data, size)) {
> inode->i_nlink--;
> mutex_lock(&root->d_inode->i_mutex);
> d_delete(dentry);
> dput(dentry);
> mutex_unlock(&root->d_inode->i_mutex);
> rc = -ENOSPC;
> goto fail;
> }

don't we have to iput() the inode here too if pstore_writefile() fails?

>
> if (time.tv_sec)
> inode->i_mtime = inode->i_ctime = time;
>
> return 0;
>
> fail1:
> iput(inode);
> fail:
> return rc;
> }

So this version is ok. However, I'd go and move the error paths to
labels near the end of the function since this makes the main part of
what it does more readable, IMHO. IOW, something like this (untested, of
course):

int pstore_mkfile(char *name, char *data, size_t size, struct timespec time,
void *private)
{
struct dentry *root = pstore_sb->s_root;
struct dentry *dentry;
struct inode *inode;
int rc;

rc = -ENOMEM;
inode = pstore_get_inode(pstore_sb, root->d_inode, S_IFREG | 0444, 0);
if (!inode)
goto fail;

inode->i_private = private;

mutex_lock(&root->d_inode->i_mutex);

rc = -ENOSPC;
dentry = d_alloc_name(root, name);
if (IS_ERR(dentry))
goto fail_alloc;

d_add(dentry, inode);

mutex_unlock(&root->d_inode->i_mutex);

if (!pstore_writefile(inode, dentry, data, size))
goto fail_write;

if (time.tv_sec)
inode->i_mtime = inode->i_ctime = time;

return 0;

fail_write:
inode->i_nlink--;
mutex_lock(&root->d_inode->i_mutex);
d_delete(dentry);
dput(dentry);
/* we fall through to unlock the mutex below */

fail_alloc:
mutex_unlock(&root->d_inode->i_mutex);
iput(inode);

fail:
return rc;
}


Hmm.

--
Regards/Gruss,
Boris.

2010-12-09 18:23:35

by Luck, Tony

[permalink] [raw]
Subject: Re: [RFC] persistent store (version 3) (part 1 of 2)

On Thu, Dec 9, 2010 at 5:14 AM, Borislav Petkov <[email protected]> wrote:
>> ? ? ? if (!pstore_writefile(inode, dentry, data, size)) {
>> ? ? ? ? ? ? ? inode->i_nlink--;
>> ? ? ? ? ? ? ? mutex_lock(&root->d_inode->i_mutex);
>> ? ? ? ? ? ? ? d_delete(dentry);
>> ? ? ? ? ? ? ? dput(dentry);
>> ? ? ? ? ? ? ? mutex_unlock(&root->d_inode->i_mutex);
>> ? ? ? ? ? ? ? rc = -ENOSPC;
>> ? ? ? ? ? ? ? goto fail;
>> ? ? ? }
>
> don't we have to iput() the inode here too if pstore_writefile() fails?

No. d_delete() called iput() for us (passing through
dentry_iput along the way) - so we must not do it again.
This upsets the traditional layout of having the error
recovery part of the function undo all the things that
we did leading up to the error. Pity, because your
version is easier to read.

-Tony

2010-12-09 19:12:05

by Luck, Tony

[permalink] [raw]
Subject: Re: [RFC] persistent store (version 3) (part 1 of 2)

> This upsets the traditional layout of having the error
> recovery part of the function undo all the things that
> we did leading up to the error. Pity, because your
> version is easier to read.

But most of your "move the error fixups to the tail, so the
normal code path is easier to follow" bit do still hold and
can be used. How about this:

-Tony

---

int pstore_mkfile(char *name, char *data, size_t size, struct timespec time,
void *private)
{
struct dentry *root = pstore_sb->s_root;
struct dentry *dentry;
struct inode *inode;
int rc;

rc = -ENOMEM;
inode = pstore_get_inode(pstore_sb, root->d_inode, S_IFREG | 0444, 0);
if (!inode)
goto fail;
inode->i_private = private;

mutex_lock(&root->d_inode->i_mutex);

rc = -ENOSPC;
dentry = d_alloc_name(root, name);
if (IS_ERR(dentry))
goto fail_alloc;
d_add(dentry, inode);

mutex_unlock(&root->d_inode->i_mutex);

if (!pstore_writefile(inode, dentry, data, size))
goto fail_write;

if (time.tv_sec)
inode->i_mtime = inode->i_ctime = time;

return 0;

fail_write:
inode->i_nlink--;
mutex_lock(&root->d_inode->i_mutex);
d_delete(dentry);
dput(dentry);
mutex_unlock(&root->d_inode->i_mutex);
goto fail;

fail_alloc:
mutex_unlock(&root->d_inode->i_mutex);
iput(inode);

fail:
return rc;
}

2010-12-09 20:47:15

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC] persistent store (version 3) (part 1 of 2)

On Thu, Dec 09, 2010 at 11:12:02AM -0800, Luck, Tony wrote:
> > This upsets the traditional layout of having the error
> > recovery part of the function undo all the things that
> > we did leading up to the error. Pity, because your
> > version is easier to read.
>
> But most of your "move the error fixups to the tail, so the
> normal code path is easier to follow" bit do still hold and
> can be used. How about this:
>
> -Tony
>
> ---
>
> int pstore_mkfile(char *name, char *data, size_t size, struct timespec time,
> void *private)
> {
> struct dentry *root = pstore_sb->s_root;
> struct dentry *dentry;
> struct inode *inode;
> int rc;
>
> rc = -ENOMEM;
> inode = pstore_get_inode(pstore_sb, root->d_inode, S_IFREG | 0444, 0);
> if (!inode)
> goto fail;

I'd add a newline here.

> inode->i_private = private;
>
> mutex_lock(&root->d_inode->i_mutex);
>
> rc = -ENOSPC;
> dentry = d_alloc_name(root, name);
> if (IS_ERR(dentry))
> goto fail_alloc;

ditto.

> d_add(dentry, inode);
>
> mutex_unlock(&root->d_inode->i_mutex);
>
> if (!pstore_writefile(inode, dentry, data, size))
> goto fail_write;
>
> if (time.tv_sec)
> inode->i_mtime = inode->i_ctime = time;
>
> return 0;
>
> fail_write:
> inode->i_nlink--;
> mutex_lock(&root->d_inode->i_mutex);
> d_delete(dentry);
> dput(dentry);
> mutex_unlock(&root->d_inode->i_mutex);
> goto fail;

Yeah, either that or add a "return rc" to save us the jmp. But since
those paths wouldn't be hit in the normal case, I don't think it matters
that much.

>
> fail_alloc:
> mutex_unlock(&root->d_inode->i_mutex);
> iput(inode);
>
> fail:
> return rc;

Thanks.

--
Regards/Gruss,
Boris.

2010-12-13 11:40:20

by Marco Stornelli

[permalink] [raw]
Subject: Re: [RFC] persistent store (version 3) (part 1 of 2)

Hi Tony,

could you cc linux fs devel and maybe linux embedded mailing list?
Currently I try to insert in mainline the pramfs. Maybe we have the
same goal and we could "merge" our effort. What do you think? Sure the
kernel from this point of view has got a lack of support, I see
different approach and solution to this. I saw in the staging tree a
pmem driver of google guys that it seems to be designed for the same
goal.

Marco

>Generic code for a "persistent store" file system as an interface
>to provide a friendly user level ABI to platform specific drivers
>for that access to some form of non-volatile storage that can be
>used to save the dying words of an OS instance and make them
>available after the reset/reboot.
>
>Signed-off-by: Tony Luck <[email protected]>

2010-12-13 17:56:51

by Luck, Tony

[permalink] [raw]
Subject: RE: [RFC] persistent store (version 3) (part 1 of 2)

>could you cc linux fs devel and maybe linux embedded mailing list?
>Currently I try to insert in mainline the pramfs. Maybe we have the
>same goal and we could "merge" our effort. What do you think? Sure the
>kernel from this point of view has got a lack of support, I see
>different approach and solution to this. I saw in the staging tree a
>pmem driver of google guys that it seems to be designed for the same
>goal.

Marco,

My Cc: list is already getting rather long ... I worry about
adding more mailing lists to it. Embedded is a bit worrying
for me as I've been a bit cavalier about memory allocation in
the current version (on the theory that I was targeting servers,
and a few Kbytes, or even tens of Kbytes wasn't worth worrying
about on a server - which these days is likely to have a minimum
of 4 GBytes).

I grepped around for "pmem" in drivers/staging in linux-next and
got a lot of hits - but none of them leapt out as something
similar. Can you point a bit more specifically?

-Tony

2010-12-14 08:35:06

by Marco Stornelli

[permalink] [raw]
Subject: Re: [RFC] persistent store (version 3) (part 1 of 2)

2010/12/13 Luck, Tony <[email protected]>:
>>could you cc linux fs devel and maybe linux embedded mailing list?
>>Currently I try to insert in mainline the pramfs. Maybe we have the
>>same goal and we could "merge" our effort. What do you think? Sure the
>>kernel from this point of view has got a lack of support, I see
>>different approach and solution to this. I saw in the staging tree a
>>pmem driver of google guys that it seems to be designed for the same
>>goal.
>
> Marco,
>
> My Cc: list is already getting rather long ... I worry about
> adding more mailing lists to it. ?Embedded is a bit worrying
> for me as I've been a bit cavalier about memory allocation in
> the current version (on the theory that I was targeting servers,
> and a few Kbytes, or even tens of Kbytes wasn't worth worrying
> about on a server - which these days is likely to have a minimum
> of 4 GBytes).
>
> I grepped around for "pmem" in drivers/staging in linux-next and
> got a lot of hits - but none of them leapt out as something
> similar. ?Can you point a bit more specifically?

I saw (very quickly) in drivers/staging/dream/pmem.c

Marco

2010-12-14 17:00:03

by Tony Luck

[permalink] [raw]
Subject: Re: [RFC] persistent store (version 3) (part 1 of 2)

On Tue, Dec 14, 2010 at 12:35 AM, Marco Stornelli
<[email protected]> wrote:
> I saw (very quickly) in drivers/staging/dream/pmem.c

Most recent commit for that says:

Staging: dream: remove dream driver and arch from tree

This code is stalled, with no one working on it anymore, and the main
msm code is now going through the proper channels to get merged
correctly.

So remove it as it contains a number of kernel information leaks and it
is doubtful if it even still builds anymore.

:-(

-Tony

2010-12-14 17:53:07

by Marco Stornelli

[permalink] [raw]
Subject: Re: [RFC] persistent store (version 3) (part 1 of 2)

Il 14/12/2010 18:00, Tony Luck ha scritto:
> On Tue, Dec 14, 2010 at 12:35 AM, Marco Stornelli
> <[email protected]> wrote:
>> I saw (very quickly) in drivers/staging/dream/pmem.c
>
> Most recent commit for that says:
>
> Staging: dream: remove dream driver and arch from tree
>
> This code is stalled, with no one working on it anymore, and the main
> msm code is now going through the proper channels to get merged
> correctly.
>
> So remove it as it contains a number of kernel information leaks and it
> is doubtful if it even still builds anymore.
>
> :-(
>
> -Tony
>

Yeah, maybe it wasn't really ready for mainline :) However the idea
behind it, it was the same more or less, to have a piece of memory to
write some persistent information.

Marco

2010-12-14 23:15:03

by Luck, Tony

[permalink] [raw]
Subject: Re: [RFC] persistent store (version 3) (part 1 of 2)

On Tue, Dec 14, 2010 at 9:49 AM, Marco Stornelli
<[email protected]> wrote:
> Yeah, maybe it wasn't really ready for mainline :) However the idea
> behind it, it was the same more or less, to have a piece of memory to
> write some persistent information.

I dug back into old linux-next trees and found the code in the
November editions.

You are right - this does look very similar to what I have done. I
think my code
may be more adaptable to other platforms as I tried hard to separate out the
generic part that implements the file system view from the platform driver
that peeks and pokes at the hardware to access the persistent memory.
It is also accessible to user space without any custom tools.

The dream/pmem.c code might do a better job at handling larger amounts
of persistent memory than my code, but it looks like it needs some special
user tools that use the ioctl(... PMEM_*, ...) operations to set things up.

-Tony