2009-03-18 17:46:42

by Boaz Harrosh

[permalink] [raw]
Subject: [PATCHSET 0/8 version 4] exofs for kernel 2.6.30

What's new since last iteration:

* I completely re-wrote the [PATCH 4/8] exofs: address_space_operations
in which we actually write/read to/from osd-storage. The difference is
that now we try to accumulate as many contiguous pages as possible and
send them as one large request. As opposed to writing each page at a
time, in the previous patchset.

* [PATCH 5/8] exofs: dir_inode and directory operations received lots
of love thanks to Evgeniy Polyakov's grate comments.

exofs is a file system that uses an OSD device as it's back store.

OSD is a new T10 command set that views storage devices not as a large/flat
array of sectors but as a container of objects, each having a length, quota,
time attributes and more. Each object is addressed by a 64bit ID, and is
contained in a 64bit ID partition. Each object has associated attributes
attached to it, which are integral part of the object and provide metadata about
the object. The standard defines some common obligatory attributes, but user
attributes can be added as needed.

Here is the list of patches
[PATCH 1/8] exofs: Kbuild, Headers and osd utils
[PATCH 2/8] exofs: file and file_inode operations
[PATCH 3/8] exofs: symlink_inode and fast_symlink_inode operations
[PATCH 4/8] exofs: address_space_operations
[PATCH 5/8] exofs: dir_inode and directory operations
[PATCH 6/8] exofs: super_operations and file_system_type
[PATCH 7/8] exofs: Documentation
[PATCH 8/8] fs: Add exofs to Kernel build

This patchset is also available on:
git-clone git://git.open-osd.org/linux-open-osd.git linux-next
or on the web at:
http://git.open-osd.org/gitweb.cgi?p=linux-open-osd.git;a=shortlog;h=refs/heads/linux-next

(Above tree is based on Linus v2.6.29-rc8-212-g8144737)

If anyone wants to actually run this code and test it
then please start reading at:
http://open-osd.org
You will need to checkout the out-of-tree git (below) for the user-mode utilities.
Also the exofs.txt file in patch 7/8 should help

If you want to review the user-mode library and supporting plumbings,
git-clone git://git.open-osd.org/open-osd.git
or on the web at:
http://git.open-osd.org/gitweb.cgi?p=open-osd.git;a=summary

Boaz


2009-03-18 17:59:09

by Boaz Harrosh

[permalink] [raw]
Subject: [PATCH 1/8] exofs: Kbuild, Headers and osd utils

This patch includes osd infrastructure that will be used later by
the file system.

Also the declarations of constants, on disk structures,
and prototypes.

And the Kbuild+Kconfig files needed to build the exofs module.

Signed-off-by: Boaz Harrosh <[email protected]>
---
fs/exofs/Kbuild | 30 +++++++++
fs/exofs/Kconfig | 13 ++++
fs/exofs/common.h | 185 +++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/exofs/exofs.h | 127 ++++++++++++++++++++++++++++++++++++
fs/exofs/osd.c | 153 +++++++++++++++++++++++++++++++++++++++++++
5 files changed, 508 insertions(+), 0 deletions(-)
create mode 100644 fs/exofs/Kbuild
create mode 100644 fs/exofs/Kconfig
create mode 100644 fs/exofs/common.h
create mode 100644 fs/exofs/exofs.h
create mode 100644 fs/exofs/osd.c

diff --git a/fs/exofs/Kbuild b/fs/exofs/Kbuild
new file mode 100644
index 0000000..63d822c
--- /dev/null
+++ b/fs/exofs/Kbuild
@@ -0,0 +1,30 @@
+#
+# Kbuild for the EXOFS module
+#
+# Copyright (C) 2008 Panasas Inc. All rights reserved.
+#
+# Authors:
+# Boaz Harrosh <[email protected]>
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License version 2
+#
+# Kbuild - Gets included from the Kernels Makefile and build system
+#
+
+ifneq ($(OSD_INC),)
+# we are built out-of-tree Kconfigure everything as on
+
+CONFIG_EXOFS_FS=m
+ccflags-y += -DCONFIG_EXOFS_FS -DCONFIG_EXOFS_FS_MODULE
+# ccflags-y += -DCONFIG_EXOFS_DEBUG
+
+# if we are built out-of-tree and the hosting kernel has OSD headers
+# then "ccflags-y +=" will not pick the out-off-tree headers. Only by doing
+# this it will work. This might break in future kernels
+KBUILD_CPPFLAGS := -I$(OSD_INC) $(KBUILD_CPPFLAGS)
+
+endif
+
+exofs-y := osd.o
+obj-$(CONFIG_EXOFS_FS) += exofs.o
diff --git a/fs/exofs/Kconfig b/fs/exofs/Kconfig
new file mode 100644
index 0000000..86194b2
--- /dev/null
+++ b/fs/exofs/Kconfig
@@ -0,0 +1,13 @@
+config EXOFS_FS
+ tristate "exofs: OSD based file system support"
+ depends on SCSI_OSD_ULD
+ help
+ EXOFS is a file system that uses an OSD storage device,
+ as its backing storage.
+
+# Debugging-related stuff
+config EXOFS_DEBUG
+ bool "Enable debugging"
+ depends on EXOFS_FS
+ help
+ This option enables EXOFS debug prints.
diff --git a/fs/exofs/common.h b/fs/exofs/common.h
new file mode 100644
index 0000000..bcc4882
--- /dev/null
+++ b/fs/exofs/common.h
@@ -0,0 +1,185 @@
+/*
+ * common.h - Common definitions for both Kernel and user-mode utilities
+ *
+ * Copyright (C) 2005, 2006
+ * Avishay Traeger ([email protected]) ([email protected])
+ * Copyright (C) 2005, 2006
+ * International Business Machines
+ * Copyright (C) 2008, 2009
+ * Boaz Harrosh <[email protected]>
+ *
+ * Copyrights for code taken from ext2:
+ * Copyright (C) 1992, 1993, 1994, 1995
+ * Remy Card ([email protected])
+ * Laboratoire MASI - Institut Blaise Pascal
+ * Universite Pierre et Marie Curie (Paris VI)
+ * from
+ * linux/fs/minix/inode.c
+ * Copyright (C) 1991, 1992 Linus Torvalds
+ *
+ * This file is part of exofs.
+ *
+ * exofs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation. Since it is based on ext2, and the only
+ * valid version of GPL for the Linux kernel is version 2, the only valid
+ * version of GPL for exofs is version 2.
+ *
+ * exofs is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with exofs; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#ifndef __EXOFS_COM_H__
+#define __EXOFS_COM_H__
+
+#include <linux/types.h>
+
+#include <scsi/osd_attributes.h>
+#include <scsi/osd_initiator.h>
+#include <scsi/osd_sec.h>
+
+/****************************************************************************
+ * Object ID related defines
+ * NOTE: inode# = object ID - EXOFS_OBJ_OFF
+ ****************************************************************************/
+#define EXOFS_MIN_PID 0x10000 /* Smallest partition ID */
+#define EXOFS_OBJ_OFF 0x10000 /* offset for objects */
+#define EXOFS_SUPER_ID 0x10000 /* object ID for on-disk superblock */
+#define EXOFS_ROOT_ID 0x10002 /* object ID for root directory */
+
+/* exofs Application specific page/attribute */
+# define EXOFS_APAGE_FS_DATA (OSD_APAGE_APP_DEFINED_FIRST + 3)
+# define EXOFS_ATTR_INODE_DATA 1
+
+/*
+ * The maximum number of files we can have is limited by the size of the
+ * inode number. This is the largest object ID that the file system supports.
+ * Object IDs 0, 1, and 2 are always in use (see above defines).
+ */
+enum {
+ EXOFS_UINT64_MAX = (~0LL),
+ EXOFS_MAX_INO_ID = (sizeof(ino_t) * 8 == 64) ? EXOFS_UINT64_MAX :
+ (1LL << (sizeof(ino_t) * 8 - 1)),
+ EXOFS_MAX_ID = (EXOFS_MAX_INO_ID - 1 - EXOFS_OBJ_OFF),
+};
+
+/****************************************************************************
+ * Misc.
+ ****************************************************************************/
+#define EXOFS_BLKSHIFT 12
+#define EXOFS_BLKSIZE (1UL << EXOFS_BLKSHIFT)
+
+/****************************************************************************
+ * superblock-related things
+ ****************************************************************************/
+#define EXOFS_SUPER_MAGIC 0x5DF5
+
+/*
+ * The file system control block - stored in an object's data (mainly, the one
+ * with ID EXOFS_SUPER_ID). This is where the in-memory superblock is stored
+ * on disk. Right now it just has a magic value, which is basically a sanity
+ * check on our ability to communicate with the object store.
+ */
+struct exofs_fscb {
+ __le64 s_nextid; /* Highest object ID used */
+ __le32 s_numfiles; /* Number of files on fs */
+ __le16 s_magic; /* Magic signature */
+ __le16 s_newfs; /* Non-zero if this is a new fs */
+};
+
+/****************************************************************************
+ * inode-related things
+ ****************************************************************************/
+#define EXOFS_IDATA 5
+
+/*
+ * The file control block - stored in an object's attributes. This is where
+ * the in-memory inode is stored on disk.
+ */
+struct exofs_fcb {
+ __le64 i_size; /* Size of the file */
+ __le16 i_mode; /* File mode */
+ __le16 i_links_count; /* Links count */
+ __le32 i_uid; /* Owner Uid */
+ __le32 i_gid; /* Group Id */
+ __le32 i_atime; /* Access time */
+ __le32 i_ctime; /* Creation time */
+ __le32 i_mtime; /* Modification time */
+ __le32 i_flags; /* File flags (unused for now)*/
+ __le32 i_generation; /* File version (for NFS) */
+ __le32 i_data[EXOFS_IDATA]; /* Short symlink names and device #s */
+};
+
+#define EXOFS_INO_ATTR_SIZE sizeof(struct exofs_fcb)
+
+/* This is the Attribute the fcb is stored in */
+static const struct __weak osd_attr g_attr_inode_data = ATTR_DEF(
+ EXOFS_APAGE_FS_DATA,
+ EXOFS_ATTR_INODE_DATA,
+ EXOFS_INO_ATTR_SIZE);
+
+/****************************************************************************
+ * dentry-related things
+ ****************************************************************************/
+#define EXOFS_NAME_LEN 255
+
+/*
+ * The on-disk directory entry
+ */
+struct exofs_dir_entry {
+ __le64 inode_no; /* inode number */
+ __le16 rec_len; /* directory entry length */
+ u8 name_len; /* name length */
+ u8 file_type; /* umm...file type */
+ char name[EXOFS_NAME_LEN]; /* file name */
+};
+
+enum {
+ EXOFS_FT_UNKNOWN,
+ EXOFS_FT_REG_FILE,
+ EXOFS_FT_DIR,
+ EXOFS_FT_CHRDEV,
+ EXOFS_FT_BLKDEV,
+ EXOFS_FT_FIFO,
+ EXOFS_FT_SOCK,
+ EXOFS_FT_SYMLINK,
+ EXOFS_FT_MAX
+};
+
+#define EXOFS_DIR_PAD 4
+#define EXOFS_DIR_ROUND (EXOFS_DIR_PAD - 1)
+#define EXOFS_DIR_REC_LEN(name_len) \
+ (((name_len) + offsetof(struct exofs_dir_entry, name) + \
+ EXOFS_DIR_ROUND) & ~EXOFS_DIR_ROUND)
+
+/*************************
+ * function declarations *
+ *************************/
+/* osd.c */
+void exofs_make_credential(u8 cred_a[OSD_CAP_LEN],
+ const struct osd_obj_id *obj);
+
+int exofs_check_ok_resid(struct osd_request *or, u64 *in_resid, u64 *out_resid);
+static inline int exofs_check_ok(struct osd_request *or)
+{
+ return exofs_check_ok_resid(or, NULL, NULL);
+}
+int exofs_sync_op(struct osd_request *or, int timeout, u8 *cred);
+int exofs_async_op(struct osd_request *or,
+ osd_req_done_fn *async_done, void *caller_context, u8 *cred);
+
+int extract_attr_from_req(struct osd_request *or, struct osd_attr *attr);
+
+int osd_req_read_kern(struct osd_request *or,
+ const struct osd_obj_id *obj, u64 offset, void *buff, u64 len);
+
+int osd_req_write_kern(struct osd_request *or,
+ const struct osd_obj_id *obj, u64 offset, void *buff, u64 len);
+
+#endif /*ifndef __EXOFS_COM_H__*/
diff --git a/fs/exofs/exofs.h b/fs/exofs/exofs.h
new file mode 100644
index 0000000..304e052
--- /dev/null
+++ b/fs/exofs/exofs.h
@@ -0,0 +1,127 @@
+/*
+ * Copyright (C) 2005, 2006
+ * Avishay Traeger ([email protected]) ([email protected])
+ * Copyright (C) 2005, 2006
+ * International Business Machines
+ * Copyright (C) 2008, 2009
+ * Boaz Harrosh <[email protected]>
+ *
+ * Copyrights for code taken from ext2:
+ * Copyright (C) 1992, 1993, 1994, 1995
+ * Remy Card ([email protected])
+ * Laboratoire MASI - Institut Blaise Pascal
+ * Universite Pierre et Marie Curie (Paris VI)
+ * from
+ * linux/fs/minix/inode.c
+ * Copyright (C) 1991, 1992 Linus Torvalds
+ *
+ * This file is part of exofs.
+ *
+ * exofs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation. Since it is based on ext2, and the only
+ * valid version of GPL for the Linux kernel is version 2, the only valid
+ * version of GPL for exofs is version 2.
+ *
+ * exofs is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with exofs; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include <linux/fs.h>
+#include <linux/time.h>
+#include "common.h"
+
+#ifndef __EXOFS_H__
+#define __EXOFS_H__
+
+#define EXOFS_ERR(fmt, a...) printk(KERN_ERR "exofs: " fmt, ##a)
+
+#ifdef CONFIG_EXOFS_DEBUG
+#define EXOFS_DBGMSG(fmt, a...) \
+ printk(KERN_NOTICE "exofs @%s:%d: " fmt, __func__, __LINE__, ##a)
+#else
+#define EXOFS_DBGMSG(fmt, a...) \
+ do {} while (0)
+#endif
+
+/* u64 has problems with printk this will cast it to unsigned long long */
+#define _LLU(x) (unsigned long long)(x)
+
+/*
+ * our extension to the in-memory superblock
+ */
+struct exofs_sb_info {
+ struct osd_dev *s_dev; /* returned by get_osd_dev */
+ osd_id s_pid; /* partition ID of file system*/
+ int s_timeout; /* timeout for OSD operations */
+ uint64_t s_nextid; /* highest object ID used */
+ uint32_t s_numfiles; /* number of files on fs */
+ spinlock_t s_next_gen_lock; /* spinlock for gen # update */
+ u32 s_next_generation; /* next gen # to use */
+ atomic_t s_curr_pending; /* number of pending commands */
+ uint8_t s_cred[OSD_CAP_LEN]; /* all-powerful credential */
+};
+
+/*
+ * our extension to the in-memory inode
+ */
+struct exofs_i_info {
+ unsigned long i_flags; /* various atomic flags */
+ uint32_t i_data[EXOFS_IDATA];/*short symlink names and device #s*/
+ uint32_t i_dir_start_lookup; /* which page to start lookup */
+ wait_queue_head_t i_wq; /* wait queue for inode */
+ uint64_t i_commit_size; /* the object's written length */
+ uint8_t i_cred[OSD_CAP_LEN];/* all-powerful credential */
+ struct inode vfs_inode; /* normal in-memory inode */
+};
+
+/*
+ * our inode flags
+ */
+#define OBJ_2BCREATED 0 /* object will be created soon*/
+#define OBJ_CREATED 1 /* object has been created on the osd*/
+
+static inline int obj_2bcreated(struct exofs_i_info *oi)
+{
+ return test_bit(OBJ_2BCREATED, &(oi->i_flags));
+}
+
+static inline void set_obj_2bcreated(struct exofs_i_info *oi)
+{
+ set_bit(OBJ_2BCREATED, &(oi->i_flags));
+}
+
+static inline int obj_created(struct exofs_i_info *oi)
+{
+ return test_bit(OBJ_CREATED, &(oi->i_flags));
+}
+
+static inline void set_obj_created(struct exofs_i_info *oi)
+{
+ set_bit(OBJ_CREATED, &(oi->i_flags));
+}
+
+int __exofs_wait_obj_created(struct exofs_i_info *oi);
+static inline int wait_obj_created(struct exofs_i_info *oi)
+{
+ if (likely(obj_created(oi)))
+ return 0;
+
+ return __exofs_wait_obj_created(oi);
+}
+
+/*
+ * get to our inode from the vfs inode
+ */
+static inline struct exofs_i_info *exofs_i(struct inode *inode)
+{
+ return container_of(inode, struct exofs_i_info, vfs_inode);
+}
+
+#endif
diff --git a/fs/exofs/osd.c b/fs/exofs/osd.c
new file mode 100644
index 0000000..b249ae9
--- /dev/null
+++ b/fs/exofs/osd.c
@@ -0,0 +1,153 @@
+/*
+ * Copyright (C) 2005, 2006
+ * Avishay Traeger ([email protected]) ([email protected])
+ * Copyright (C) 2005, 2006
+ * International Business Machines
+ * Copyright (C) 2008, 2009
+ * Boaz Harrosh <[email protected]>
+ *
+ * This file is part of exofs.
+ *
+ * exofs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation. Since it is based on ext2, and the only
+ * valid version of GPL for the Linux kernel is version 2, the only valid
+ * version of GPL for exofs is version 2.
+ *
+ * exofs is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with exofs; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include <scsi/scsi_device.h>
+#include <scsi/osd_sense.h>
+
+#include "exofs.h"
+
+int exofs_check_ok_resid(struct osd_request *or, u64 *in_resid, u64 *out_resid)
+{
+ struct osd_sense_info osi;
+ int ret = osd_req_decode_sense(or, &osi);
+
+ if (ret) { /* translate to Linux codes */
+ if (osi.additional_code == scsi_invalid_field_in_cdb) {
+ if (osi.cdb_field_offset == OSD_CFO_STARTING_BYTE)
+ ret = -EFAULT;
+ if (osi.cdb_field_offset == OSD_CFO_OBJECT_ID)
+ ret = -ENOENT;
+ else
+ ret = -EINVAL;
+ } else if (osi.additional_code == osd_quota_error)
+ ret = -ENOSPC;
+ else
+ ret = -EIO;
+ }
+
+ /* FIXME: should be include in osd_sense_info */
+ if (in_resid)
+ *in_resid = or->in.req ? or->in.req->data_len : 0;
+
+ if (out_resid)
+ *out_resid = or->out.req ? or->out.req->data_len : 0;
+
+ return ret;
+}
+
+void exofs_make_credential(u8 cred_a[OSD_CAP_LEN], const struct osd_obj_id *obj)
+{
+ osd_sec_init_nosec_doall_caps(cred_a, obj, false, true);
+}
+
+/*
+ * Perform a synchronous OSD operation.
+ */
+int exofs_sync_op(struct osd_request *or, int timeout, uint8_t *credential)
+{
+ int ret;
+
+ or->timeout = timeout;
+ ret = osd_finalize_request(or, 0, credential, NULL);
+ if (ret) {
+ EXOFS_DBGMSG("Faild to osd_finalize_request() => %d\n", ret);
+ return ret;
+ }
+
+ ret = osd_execute_request(or);
+
+ if (ret)
+ EXOFS_DBGMSG("osd_execute_request() => %d\n", ret);
+ /* osd_req_decode_sense(or, ret); */
+ return ret;
+}
+
+/*
+ * Perform an asynchronous OSD operation.
+ */
+int exofs_async_op(struct osd_request *or, osd_req_done_fn *async_done,
+ void *caller_context, u8 *cred)
+{
+ int ret;
+
+ ret = osd_finalize_request(or, 0, cred, NULL);
+ if (ret) {
+ EXOFS_DBGMSG("Faild to osd_finalize_request() => %d\n", ret);
+ return ret;
+ }
+
+ ret = osd_execute_request_async(or, async_done, caller_context);
+
+ if (ret)
+ EXOFS_DBGMSG("osd_execute_request_async() => %d\n", ret);
+ return ret;
+}
+
+int extract_attr_from_req(struct osd_request *or, struct osd_attr *attr)
+{
+ struct osd_attr cur_attr = {.attr_page = 0}; /* start with zeros */
+ void *iter = NULL;
+ int nelem;
+
+ do {
+ nelem = 1;
+ osd_req_decode_get_attr_list(or, &cur_attr, &nelem, &iter);
+ if ((cur_attr.attr_page == attr->attr_page) &&
+ (cur_attr.attr_id == attr->attr_id)) {
+ attr->len = cur_attr.len;
+ attr->val_ptr = cur_attr.val_ptr;
+ return 0;
+ }
+ } while (iter);
+
+ return -EIO;
+}
+
+int osd_req_read_kern(struct osd_request *or,
+ const struct osd_obj_id *obj, u64 offset, void* buff, u64 len)
+{
+ struct request_queue *req_q = or->osd_dev->scsi_device->request_queue;
+ struct bio *bio = bio_map_kern(req_q, buff, len, GFP_KERNEL);
+
+ if (!bio)
+ return -ENOMEM;
+
+ osd_req_read(or, obj, bio, offset);
+ return 0;
+}
+
+int osd_req_write_kern(struct osd_request *or,
+ const struct osd_obj_id *obj, u64 offset, void* buff, u64 len)
+{
+ struct request_queue *req_q = or->osd_dev->scsi_device->request_queue;
+ struct bio *bio = bio_map_kern(req_q, buff, len, GFP_KERNEL);
+
+ if (!bio)
+ return -ENOMEM;
+
+ osd_req_write(or, obj, bio, offset);
+ return 0;
+}
--
1.6.2.1

2009-03-18 18:00:21

by Boaz Harrosh

[permalink] [raw]
Subject: [PATCH 2/8] exofs: file and file_inode operations

implementation of the file_operations and inode_operations for
regular data files.

Most file_operations are generic vfs implementations except:
- exofs_truncate will truncate the OSD object as well
- Generic file_fsync is not good for none_bd devices so open code it
- The default for .flush in Linux is todo nothing so call exofs_fsync
on the file.

Signed-off-by: Boaz Harrosh <[email protected]>
---
fs/exofs/Kbuild | 2 +-
fs/exofs/exofs.h | 14 +++++
fs/exofs/file.c | 82 ++++++++++++++++++++++++++++++
fs/exofs/inode.c | 148 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 245 insertions(+), 1 deletions(-)
create mode 100644 fs/exofs/file.c
create mode 100644 fs/exofs/inode.c

diff --git a/fs/exofs/Kbuild b/fs/exofs/Kbuild
index 63d822c..269281f 100644
--- a/fs/exofs/Kbuild
+++ b/fs/exofs/Kbuild
@@ -26,5 +26,5 @@ KBUILD_CPPFLAGS := -I$(OSD_INC) $(KBUILD_CPPFLAGS)

endif

-exofs-y := osd.o
+exofs-y := osd.o inode.o file.o
obj-$(CONFIG_EXOFS_FS) += exofs.o
diff --git a/fs/exofs/exofs.h b/fs/exofs/exofs.h
index 304e052..28deb67 100644
--- a/fs/exofs/exofs.h
+++ b/fs/exofs/exofs.h
@@ -124,4 +124,18 @@ static inline struct exofs_i_info *exofs_i(struct inode *inode)
return container_of(inode, struct exofs_i_info, vfs_inode);
}

+/*************************
+ * function declarations *
+ *************************/
+/* inode.c */
+void exofs_truncate(struct inode *inode);
+int exofs_setattr(struct dentry *, struct iattr *);
+
+/*********************
+ * operation vectors *
+ *********************/
+/* file.c */
+extern const struct inode_operations exofs_file_inode_operations;
+extern const struct file_operations exofs_file_operations;
+
#endif
diff --git a/fs/exofs/file.c b/fs/exofs/file.c
new file mode 100644
index 0000000..4738c3f
--- /dev/null
+++ b/fs/exofs/file.c
@@ -0,0 +1,82 @@
+/*
+ * Copyright (C) 2005, 2006
+ * Avishay Traeger ([email protected]) ([email protected])
+ * Copyright (C) 2005, 2006
+ * International Business Machines
+ * Copyright (C) 2008, 2009
+ * Boaz Harrosh <[email protected]>
+ *
+ * Copyrights for code taken from ext2:
+ * Copyright (C) 1992, 1993, 1994, 1995
+ * Remy Card ([email protected])
+ * Laboratoire MASI - Institut Blaise Pascal
+ * Universite Pierre et Marie Curie (Paris VI)
+ * from
+ * linux/fs/minix/inode.c
+ * Copyright (C) 1991, 1992 Linus Torvalds
+ *
+ * This file is part of exofs.
+ *
+ * exofs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation. Since it is based on ext2, and the only
+ * valid version of GPL for the Linux kernel is version 2, the only valid
+ * version of GPL for exofs is version 2.
+ *
+ * exofs is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with exofs; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include <linux/buffer_head.h>
+
+#include "exofs.h"
+
+static int exofs_release_file(struct inode *inode, struct file *filp)
+{
+ return 0;
+}
+
+static int exofs_file_fsync(struct file *filp, struct dentry *dentry,
+ int datasync)
+{
+ int ret1, ret2;
+ struct address_space *mapping = filp->f_mapping;
+
+ ret1 = filemap_write_and_wait(mapping);
+ ret2 = file_fsync(filp, dentry, datasync);
+
+ return ret1 ? ret1 : ret2;
+}
+
+static int exofs_flush(struct file *file, fl_owner_t id)
+{
+ exofs_file_fsync(file, file->f_path.dentry, 1);
+ /* TODO: Flush the OSD target */
+ return 0;
+}
+
+const struct file_operations exofs_file_operations = {
+ .llseek = generic_file_llseek,
+ .read = do_sync_read,
+ .write = do_sync_write,
+ .aio_read = generic_file_aio_read,
+ .aio_write = generic_file_aio_write,
+ .mmap = generic_file_mmap,
+ .open = generic_file_open,
+ .release = exofs_release_file,
+ .fsync = exofs_file_fsync,
+ .flush = exofs_flush,
+ .splice_read = generic_file_splice_read,
+ .splice_write = generic_file_splice_write,
+};
+
+const struct inode_operations exofs_file_inode_operations = {
+ .truncate = exofs_truncate,
+ .setattr = exofs_setattr,
+};
diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
new file mode 100644
index 0000000..b0bda1e
--- /dev/null
+++ b/fs/exofs/inode.c
@@ -0,0 +1,148 @@
+/*
+ * Copyright (C) 2005, 2006
+ * Avishay Traeger ([email protected]) ([email protected])
+ * Copyright (C) 2005, 2006
+ * International Business Machines
+ * Copyright (C) 2008, 2009
+ * Boaz Harrosh <[email protected]>
+ *
+ * Copyrights for code taken from ext2:
+ * Copyright (C) 1992, 1993, 1994, 1995
+ * Remy Card ([email protected])
+ * Laboratoire MASI - Institut Blaise Pascal
+ * Universite Pierre et Marie Curie (Paris VI)
+ * from
+ * linux/fs/minix/inode.c
+ * Copyright (C) 1991, 1992 Linus Torvalds
+ *
+ * This file is part of exofs.
+ *
+ * exofs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation. Since it is based on ext2, and the only
+ * valid version of GPL for the Linux kernel is version 2, the only valid
+ * version of GPL for exofs is version 2.
+ *
+ * exofs is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with exofs; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include <linux/writeback.h>
+#include <linux/buffer_head.h>
+
+#include "exofs.h"
+
+#ifdef CONFIG_EXOFS_DEBUG
+# define EXOFS_DEBUG_OBJ_ISIZE 1
+#endif
+
+/******************************************************************************
+ * INODE OPERATIONS
+ *****************************************************************************/
+
+/*
+ * Test whether an inode is a fast symlink.
+ */
+static inline int exofs_inode_is_fast_symlink(struct inode *inode)
+{
+ struct exofs_i_info *oi = exofs_i(inode);
+
+ return S_ISLNK(inode->i_mode) && (oi->i_data[0] != 0);
+}
+
+/*
+ * get_block_t - Fill in a buffer_head
+ * An OSD takes care of block allocation so we just fake an allocation by
+ * putting in the inode's sector_t in the buffer_head.
+ * TODO: What about the case of create==0 and @iblock does not exist in the
+ * object?
+ */
+static int exofs_get_block(struct inode *inode, sector_t iblock,
+ struct buffer_head *bh_result, int create)
+{
+ map_bh(bh_result, inode->i_sb, iblock);
+ return 0;
+}
+
+const struct osd_attr g_attr_logical_length = ATTR_DEF(
+ OSD_APAGE_OBJECT_INFORMATION, OSD_ATTR_OI_LOGICAL_LENGTH, 8);
+
+/*
+ * Truncate a file to the specified size - all we have to do is set the size
+ * attribute. We make sure the object exists first.
+ */
+void exofs_truncate(struct inode *inode)
+{
+ struct exofs_sb_info *sbi = inode->i_sb->s_fs_info;
+ struct exofs_i_info *oi = exofs_i(inode);
+ struct osd_obj_id obj = {sbi->s_pid, inode->i_ino + EXOFS_OBJ_OFF};
+ struct osd_request *or;
+ struct osd_attr attr;
+ loff_t isize = i_size_read(inode);
+ __be64 newsize;
+ int ret;
+
+ if (!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode)
+ || S_ISLNK(inode->i_mode)))
+ return;
+ if (exofs_inode_is_fast_symlink(inode))
+ return;
+ if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
+ return;
+ inode->i_mtime = inode->i_ctime = CURRENT_TIME;
+
+ nobh_truncate_page(inode->i_mapping, isize, exofs_get_block);
+
+ or = osd_start_request(sbi->s_dev, GFP_KERNEL);
+ if (unlikely(!or)) {
+ EXOFS_ERR("ERROR: exofs_truncate: osd_start_request failed\n");
+ goto fail;
+ }
+
+ osd_req_set_attributes(or, &obj);
+
+ newsize = cpu_to_be64((u64)isize);
+ attr = g_attr_logical_length;
+ attr.val_ptr = &newsize;
+ osd_req_add_set_attr_list(or, &attr, 1);
+
+ /* if we are about to truncate an object, and it hasn't been
+ * created yet, wait
+ */
+ if (unlikely(wait_obj_created(oi)))
+ goto fail;
+
+ ret = exofs_sync_op(or, sbi->s_timeout, oi->i_cred);
+ osd_end_request(or);
+ if (ret)
+ goto fail;
+
+out:
+ mark_inode_dirty(inode);
+ return;
+fail:
+ make_bad_inode(inode);
+ goto out;
+}
+
+/*
+ * Set inode attributes - just call generic functions.
+ */
+int exofs_setattr(struct dentry *dentry, struct iattr *iattr)
+{
+ struct inode *inode = dentry->d_inode;
+ int error;
+
+ error = inode_change_ok(inode, iattr);
+ if (error)
+ return error;
+
+ error = inode_setattr(inode, iattr);
+ return error;
+}
--
1.6.2.1

2009-03-18 18:03:38

by Boaz Harrosh

[permalink] [raw]
Subject: [PATCH 3/8] exofs: symlink_inode and fast_symlink_inode operations

Generic implementation of symlink ops.

Signed-off-by: Boaz Harrosh <[email protected]>
---
fs/exofs/Kbuild | 2 +-
fs/exofs/exofs.h | 4 +++
fs/exofs/symlink.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 62 insertions(+), 1 deletions(-)
create mode 100644 fs/exofs/symlink.c

diff --git a/fs/exofs/Kbuild b/fs/exofs/Kbuild
index 269281f..42d5299 100644
--- a/fs/exofs/Kbuild
+++ b/fs/exofs/Kbuild
@@ -26,5 +26,5 @@ KBUILD_CPPFLAGS := -I$(OSD_INC) $(KBUILD_CPPFLAGS)

endif

-exofs-y := osd.o inode.o file.o
+exofs-y := osd.o inode.o file.o symlink.o
obj-$(CONFIG_EXOFS_FS) += exofs.o
diff --git a/fs/exofs/exofs.h b/fs/exofs/exofs.h
index 28deb67..d3b8bde 100644
--- a/fs/exofs/exofs.h
+++ b/fs/exofs/exofs.h
@@ -138,4 +138,8 @@ int exofs_setattr(struct dentry *, struct iattr *);
extern const struct inode_operations exofs_file_inode_operations;
extern const struct file_operations exofs_file_operations;

+/* symlink.c */
+extern const struct inode_operations exofs_symlink_inode_operations;
+extern const struct inode_operations exofs_fast_symlink_inode_operations;
+
#endif
diff --git a/fs/exofs/symlink.c b/fs/exofs/symlink.c
new file mode 100644
index 0000000..36e2d7b
--- /dev/null
+++ b/fs/exofs/symlink.c
@@ -0,0 +1,57 @@
+/*
+ * Copyright (C) 2005, 2006
+ * Avishay Traeger ([email protected]) ([email protected])
+ * Copyright (C) 2005, 2006
+ * International Business Machines
+ * Copyright (C) 2008, 2009
+ * Boaz Harrosh <[email protected]>
+ *
+ * Copyrights for code taken from ext2:
+ * Copyright (C) 1992, 1993, 1994, 1995
+ * Remy Card ([email protected])
+ * Laboratoire MASI - Institut Blaise Pascal
+ * Universite Pierre et Marie Curie (Paris VI)
+ * from
+ * linux/fs/minix/inode.c
+ * Copyright (C) 1991, 1992 Linus Torvalds
+ *
+ * This file is part of exofs.
+ *
+ * exofs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation. Since it is based on ext2, and the only
+ * valid version of GPL for the Linux kernel is version 2, the only valid
+ * version of GPL for exofs is version 2.
+ *
+ * exofs is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with exofs; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include <linux/namei.h>
+
+#include "exofs.h"
+
+static void *exofs_follow_link(struct dentry *dentry, struct nameidata *nd)
+{
+ struct exofs_i_info *oi = exofs_i(dentry->d_inode);
+
+ nd_set_link(nd, (char *)oi->i_data);
+ return NULL;
+}
+
+const struct inode_operations exofs_symlink_inode_operations = {
+ .readlink = generic_readlink,
+ .follow_link = page_follow_link_light,
+ .put_link = page_put_link,
+};
+
+const struct inode_operations exofs_fast_symlink_inode_operations = {
+ .readlink = generic_readlink,
+ .follow_link = exofs_follow_link,
+};
--
1.6.2.1

2009-03-18 18:06:08

by Boaz Harrosh

[permalink] [raw]
Subject: [PATCH 4/8] exofs: address_space_operations

OK Now we start to read and write from osd-objects. We try to
collect at most contiguous pages as possible in a single write/read.
The first page index is the object's offset.

TODO:
In 64-bit a single bio can carry at most 128 pages.
Add support of chaining multiple bios

Signed-off-by: Boaz Harrosh <[email protected]>
---
fs/exofs/exofs.h | 6 +
fs/exofs/inode.c | 690 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 696 insertions(+), 0 deletions(-)

diff --git a/fs/exofs/exofs.h b/fs/exofs/exofs.h
index d3b8bde..f30de6e 100644
--- a/fs/exofs/exofs.h
+++ b/fs/exofs/exofs.h
@@ -130,6 +130,9 @@ static inline struct exofs_i_info *exofs_i(struct inode *inode)
/* inode.c */
void exofs_truncate(struct inode *inode);
int exofs_setattr(struct dentry *, struct iattr *);
+int exofs_write_begin(struct file *file, struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned flags,
+ struct page **pagep, void **fsdata);

/*********************
* operation vectors *
@@ -138,6 +141,9 @@ int exofs_setattr(struct dentry *, struct iattr *);
extern const struct inode_operations exofs_file_inode_operations;
extern const struct file_operations exofs_file_operations;

+/* inode.c */
+extern const struct address_space_operations exofs_aops;
+
/* symlink.c */
extern const struct inode_operations exofs_symlink_inode_operations;
extern const struct inode_operations exofs_fast_symlink_inode_operations;
diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
index b0bda1e..175679a 100644
--- a/fs/exofs/inode.c
+++ b/fs/exofs/inode.c
@@ -35,6 +35,7 @@

#include <linux/writeback.h>
#include <linux/buffer_head.h>
+#include <scsi/scsi_device.h>

#include "exofs.h"

@@ -42,6 +43,695 @@
# define EXOFS_DEBUG_OBJ_ISIZE 1
#endif

+struct page_collect {
+ struct exofs_sb_info *sbi;
+ struct request_queue *req_q;
+ struct inode *inode;
+ unsigned expected_pages;
+
+ struct bio *bio;
+ unsigned nr_pages;
+ unsigned long length;
+ long pg_first;
+};
+
+void _pcol_init(struct page_collect *pcol, unsigned expected_pages,
+ struct inode *inode)
+{
+ struct exofs_sb_info *sbi = inode->i_sb->s_fs_info;
+ struct request_queue *req_q = sbi->s_dev->scsi_device->request_queue;
+
+ pcol->sbi = sbi;
+ pcol->req_q = req_q;
+ pcol->inode = inode;
+ pcol->expected_pages = expected_pages;
+
+ pcol->bio = NULL;
+ pcol->nr_pages = 0;
+ pcol->length = 0;
+ pcol->pg_first = -1;
+
+ EXOFS_DBGMSG("_pcol_init ino=0x%lx expected_pages=%u\n", inode->i_ino,
+ expected_pages);
+}
+
+void _pcol_reset(struct page_collect *pcol)
+{
+ pcol->expected_pages -= min(pcol->nr_pages, pcol->expected_pages);
+
+ pcol->bio = NULL;
+ pcol->nr_pages = 0;
+ pcol->length = 0;
+ pcol->pg_first = -1;
+ EXOFS_DBGMSG("_pcol_reset ino=0x%lx expected_pages=%u\n",
+ pcol->inode->i_ino, pcol->expected_pages);
+
+ /* this is probably the end of the loop but in writes
+ * it might not end here. don't be left with nothing
+ */
+ if (!pcol->expected_pages)
+ pcol->expected_pages = 128;
+}
+
+int pcol_try_alloc(struct page_collect *pcol)
+{
+ int pages = min_t(unsigned, pcol->expected_pages, BIO_MAX_PAGES);
+
+ for (; pages; pages >>= 1) {
+ pcol->bio = bio_alloc(GFP_KERNEL, pages);
+ if (likely(pcol->bio))
+ return 0;
+ }
+
+ EXOFS_ERR("Failed to kcalloc expected_pages=%d\n",
+ pcol->expected_pages);
+ return -ENOMEM;
+}
+
+void pcol_free(struct page_collect *pcol)
+{
+ bio_put(pcol->bio);
+ pcol->bio = NULL;
+}
+
+int pcol_add_page(struct page_collect *pcol, struct page *page, unsigned len)
+{
+ int added_len = bio_add_pc_page(pcol->req_q, pcol->bio, page, len, 0);
+ if (unlikely(len != added_len))
+ return -ENOMEM;
+
+ ++pcol->nr_pages;
+ pcol->length += len;
+ return 0;
+}
+
+static int update_read_page(struct page *page, int ret)
+{
+ if (ret == 0) {
+ /* Everything is OK */
+ SetPageUptodate(page);
+ if (PageError(page))
+ ClearPageError(page);
+ } else if (ret == -EFAULT) {
+ /* In this case we were trying to read something that wasn't on
+ * disk yet - return a page full of zeroes. This should be OK,
+ * because the object should be empty (if there was a write
+ * before this read, the read would be waiting with the page
+ * locked */
+ clear_highpage(page);
+
+ SetPageUptodate(page);
+ if (PageError(page))
+ ClearPageError(page);
+ ret = 0; /* recovered error */
+ } else /* Error */
+ SetPageError(page);
+
+ return ret;
+}
+
+static void update_write_page(struct page *page, int ret)
+{
+ if (ret) {
+ mapping_set_error(page->mapping, ret);
+ SetPageError(page);
+ }
+ end_page_writeback(page);
+}
+
+static int _readpage(struct page *page, bool is_sync);
+
+static int __readpages_done(struct osd_request *or, struct page_collect *pcol,
+ bool do_unlock)
+{
+ struct bio_vec *bvec;
+ int i;
+ u64 resid;
+ u64 good_bytes;
+ u64 length = 0;
+ int ret = exofs_check_ok_resid(or, &resid, NULL);
+
+ osd_end_request(or);
+
+ if (!ret)
+ good_bytes = pcol->length;
+ else if (ret && !resid)
+ good_bytes = 0;
+ else
+ good_bytes = pcol->length - resid;
+
+ EXOFS_DBGMSG("readpages_done(%ld) good_bytes=%llx"
+ " length=%zx nr_pages=%u\n",
+ pcol->inode->i_ino, _LLU(good_bytes), pcol->length,
+ pcol->nr_pages);
+
+ __bio_for_each_segment(bvec, pcol->bio, i, 0) {
+ struct page *page = bvec->bv_page;
+ struct inode *inode = page->mapping->host;
+
+ if (inode != pcol->inode)
+ continue; /* osd might add more pages at end */
+
+ if ((length < good_bytes) || (i == 0)) {
+ ret = update_read_page(page, (i == 0) ? ret : 0);
+ if (do_unlock)
+ unlock_page(page);
+ EXOFS_DBGMSG(" readpages_done(%ld, %ld)\n",
+ inode->i_ino, page->index);
+ } else {
+ /* can not happen on single sync_readpage */
+ BUG_ON(!do_unlock);
+
+ /* try a single page read and only then it is
+ * marked as SetPageError()
+ */
+ EXOFS_ERR(" readpages_done(%ld, %ld) bad_bytes\n",
+ inode->i_ino, page->index);
+ _readpage(page, false);
+ }
+
+ length += bvec->bv_len;
+ }
+
+ pcol_free(pcol);
+ EXOFS_DBGMSG("readpages_done END\n");
+ return ret;
+}
+
+static void readpages_done(struct osd_request *or, void *p)
+{
+ struct page_collect *pcol = p;
+
+ __readpages_done(or, pcol, true);
+ atomic_dec(&pcol->sbi->s_curr_pending);
+ kfree(p);
+}
+
+void _unlock_pcol_pages(struct page_collect *pcol, int ret, int rw)
+{
+ struct bio_vec *bvec;
+ int i;
+
+ __bio_for_each_segment(bvec, pcol->bio, i, 0) {
+ struct page *page = bvec->bv_page;
+
+ if (rw == READ)
+ update_read_page(page, ret);
+ else
+ update_write_page(page, ret);
+
+ unlock_page(page);
+ }
+ pcol_free(pcol);
+}
+
+int read_exec(struct page_collect *pcol, bool is_sync)
+{
+ struct exofs_i_info *oi = exofs_i(pcol->inode);
+ struct osd_obj_id obj = {pcol->sbi->s_pid,
+ pcol->inode->i_ino + EXOFS_OBJ_OFF};
+ struct osd_request *or = NULL;
+ struct page_collect *pcol_copy = NULL;
+ loff_t i_start = pcol->pg_first << PAGE_CACHE_SHIFT;
+ int ret;
+
+ if (!pcol->bio)
+ return 0;
+
+ /* see comment in _readpage() about sync reads */
+ WARN_ON(is_sync && (pcol->nr_pages != 1));
+
+ or = osd_start_request(pcol->sbi->s_dev, GFP_KERNEL);
+ if (unlikely(!or)) {
+ ret = -ENOMEM;
+ goto err;
+ }
+
+ osd_req_read(or, &obj, pcol->bio, i_start);
+
+ if (is_sync) {
+ exofs_sync_op(or, pcol->sbi->s_timeout, oi->i_cred);
+ return __readpages_done(or, pcol, false);
+ }
+
+ pcol_copy = kmalloc(sizeof(*pcol_copy), GFP_KERNEL);
+ if (!pcol_copy) {
+ ret = -ENOMEM;
+ goto err;
+ }
+
+ *pcol_copy = *pcol;
+ ret = exofs_async_op(or, readpages_done, pcol_copy, oi->i_cred);
+ if (unlikely(ret))
+ goto err;
+
+ atomic_inc(&pcol->sbi->s_curr_pending);
+
+ EXOFS_DBGMSG("read_exec obj=%llx start=%llx length=%zx\n",
+ obj.id, _LLU(i_start), pcol->length);
+
+ /* pages ownership was passed to pcol_copy */
+ _pcol_reset(pcol);
+ return 0;
+
+err:
+ if (!is_sync)
+ _unlock_pcol_pages(pcol, ret, READ);
+ kfree(pcol_copy);
+ if (or)
+ osd_end_request(or);
+ return ret;
+}
+
+static int readpage_strip(void *data, struct page *page)
+{
+ struct page_collect *pcol = data;
+ struct inode *inode = pcol->inode;
+ struct exofs_i_info *oi = exofs_i(inode);
+ loff_t i_size = i_size_read(inode);
+ pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
+ size_t len;
+ int ret;
+
+ /* FIXME: Just for debugging, will be removed */
+ if (PageUptodate(page))
+ EXOFS_ERR("PageUptodate(%ld, %ld)\n", pcol->inode->i_ino,
+ page->index);
+
+ if (page->index < end_index)
+ len = PAGE_CACHE_SIZE;
+ else if (page->index == end_index)
+ len = i_size & ~PAGE_CACHE_MASK;
+ else
+ len = 0;
+
+ if (!len || !obj_created(oi)) {
+ /* this will be out of bounds, or doesn't exist yet.
+ * Current page is cleared and the request is split
+ */
+ clear_highpage(page);
+
+ SetPageUptodate(page);
+ if (PageError(page))
+ ClearPageError(page);
+
+ unlock_page(page);
+ EXOFS_DBGMSG("readpage_strip(%ld, %ld) empty page, splitting\n",
+ inode->i_ino, page->index);
+
+ return read_exec(pcol, false);
+ }
+
+try_again:
+
+ if (unlikely(pcol->pg_first == -1)) {
+ pcol->pg_first = page->index;
+ } else if (unlikely((pcol->pg_first + pcol->nr_pages) !=
+ page->index)) {
+ /* Discontinuity detected, split the request */
+ ret = read_exec(pcol, false);
+ if (unlikely(ret))
+ goto fail;
+ goto try_again;
+ }
+
+ if (!pcol->bio) {
+ ret = pcol_try_alloc(pcol);
+ if (unlikely(ret))
+ goto fail;
+ }
+
+ if (len != PAGE_CACHE_SIZE)
+ zero_user(page, len, PAGE_CACHE_SIZE - len);
+
+ EXOFS_DBGMSG(" readpage_strip(%ld, %ld) len=%zx\n", inode->i_ino,
+ page->index, len);
+
+ ret = pcol_add_page(pcol, page, len);
+ if (ret) {
+ EXOFS_DBGMSG("Failed pcol_add_page pages[i]=%p "
+ "len=%zx nr_pages=%u length=%zx\n",
+ page, len, pcol->nr_pages, pcol->length);
+
+ /* split the request, and start again with current page */
+ ret = read_exec(pcol, false);
+ if (unlikely(ret))
+ goto fail;
+
+ goto try_again;
+ }
+
+ return 0;
+
+fail:
+ /* SetPageError(page); ??? */
+ unlock_page(page);
+ return ret;
+}
+
+static int exofs_readpages(struct file *file, struct address_space *mapping,
+ struct list_head *pages, unsigned nr_pages)
+{
+ struct page_collect pcol;
+ int ret;
+
+ _pcol_init(&pcol, nr_pages, mapping->host);
+
+ ret = read_cache_pages(mapping, pages, readpage_strip, &pcol);
+ if (ret) {
+ EXOFS_ERR("read_cache_pages => %d\n", ret);
+ return ret;
+ }
+
+ return read_exec(&pcol, false);
+}
+
+static int _readpage(struct page *page, bool is_sync)
+{
+ struct page_collect pcol;
+ int ret;
+
+ _pcol_init(&pcol, 1, page->mapping->host);
+
+ /* readpage_strip might call read_exec(,async) inside at several places
+ * but this is safe for is_async=0 since read_exec will not do anything
+ * when we have a single page.
+ */
+ ret = readpage_strip(&pcol, page);
+ if (ret) {
+ EXOFS_ERR("_readpage => %d\n", ret);
+ return ret;
+ }
+
+ return read_exec(&pcol, is_sync);
+}
+
+/*
+ * We don't need the file
+ */
+static int exofs_readpage(struct file *file, struct page *page)
+{
+ return _readpage(page, false);
+}
+
+static int exofs_writepage(struct page *page, struct writeback_control *wbc2);
+
+static void writepages_done(struct osd_request *or, void *p)
+{
+ struct page_collect *pcol = p;
+ struct bio_vec *bvec;
+ int i;
+ u64 resid;
+ u64 good_bytes;
+ u64 length = 0;
+
+ int ret = exofs_check_ok_resid(or, NULL, &resid);
+
+ osd_end_request(or);
+ atomic_dec(&pcol->sbi->s_curr_pending);
+
+ if (likely(!ret))
+ good_bytes = pcol->length;
+ else if (ret && !resid)
+ good_bytes = 0;
+ else
+ good_bytes = pcol->length - resid;
+
+ EXOFS_DBGMSG("writepages_done(%lx) good_bytes=%llx"
+ " length=%zx nr_pages=%u\n",
+ pcol->inode->i_ino, _LLU(good_bytes), pcol->length,
+ pcol->nr_pages);
+
+ __bio_for_each_segment(bvec, pcol->bio, i, 0) {
+ struct page *page = bvec->bv_page;
+ struct inode *inode = page->mapping->host;
+
+ if (inode != pcol->inode)
+ continue; /* osd might add more pages to a bio */
+
+ if ((length < good_bytes) || (i == 0)) {
+ update_write_page(page, ret);
+ unlock_page(page);
+ EXOFS_DBGMSG(" writepages_done(%lx, %lx)"
+ " good_bytes ret=%x\n",
+ inode->i_ino, page->index, ret);
+ } else {
+ /* try a single page write and only then it is
+ * marked as SetPageError()
+ */
+ EXOFS_ERR(" writepages_done(%lx, %lx) bad_bytes\n",
+ inode->i_ino, page->index);
+
+ exofs_writepage(page, NULL);
+ }
+
+ length += bvec->bv_len;
+ }
+
+ pcol_free(pcol);
+ kfree(pcol);
+ EXOFS_DBGMSG("writepages_done END\n");
+}
+
+int write_exec(struct page_collect *pcol)
+{
+ struct exofs_i_info *oi = exofs_i(pcol->inode);
+ struct osd_obj_id obj = {pcol->sbi->s_pid,
+ pcol->inode->i_ino + EXOFS_OBJ_OFF};
+ struct osd_request *or = NULL;
+ struct page_collect *pcol_copy = NULL;
+ loff_t i_start = pcol->pg_first << PAGE_CACHE_SHIFT;
+ int ret;
+
+ if (!pcol->bio)
+ return 0;
+
+ or = osd_start_request(pcol->sbi->s_dev, GFP_KERNEL);
+ if (unlikely(!or)) {
+ EXOFS_ERR("write_exec: Faild to osd_start_request()\n");
+ ret = -ENOMEM;
+ goto err;
+ }
+
+ pcol_copy = kmalloc(sizeof(*pcol_copy), GFP_KERNEL);
+ if (!pcol_copy) {
+ EXOFS_ERR("write_exec: Faild to kmalloc(pcol)\n");
+ ret = -ENOMEM;
+ goto err;
+ }
+
+ *pcol_copy = *pcol;
+
+ osd_req_write(or, &obj, pcol_copy->bio, i_start);
+ ret = exofs_async_op(or, writepages_done, pcol_copy, oi->i_cred);
+ if (unlikely(ret)) {
+ EXOFS_ERR("write_exec: exofs_async_op() Faild\n");
+ goto err;
+ }
+
+ atomic_inc(&pcol->sbi->s_curr_pending);
+ EXOFS_DBGMSG("write_exec(%lx, %lx) start=%llx length=%zx\n",
+ pcol->inode->i_ino, pcol->pg_first, _LLU(i_start),
+ pcol->length);
+ /* pages ownership was passed to pcol_copy */
+ _pcol_reset(pcol);
+ return 0;
+
+err:
+ _unlock_pcol_pages(pcol, ret, WRITE);
+ kfree(pcol_copy);
+ if (or)
+ osd_end_request(or);
+ return ret;
+}
+
+static int writepage_strip(struct page *page,
+ struct writeback_control *wbc_unused, void *data)
+{
+ struct page_collect *pcol = data;
+ struct inode *inode = pcol->inode;
+ struct exofs_i_info *oi = exofs_i(inode);
+ loff_t i_size = i_size_read(inode);
+ pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
+ size_t len;
+ int ret;
+
+ BUG_ON(!PageLocked(page));
+
+ ret = wait_obj_created(oi);
+ if (unlikely(ret))
+ goto fail;
+
+ if (page->index < end_index)
+ /* in this case, the page is within the limits of the file */
+ len = PAGE_CACHE_SIZE;
+ else {
+ len = i_size & ~PAGE_CACHE_MASK;
+
+ if (page->index > end_index || !len) {
+ /* in this case, the page is outside the limits
+ * (truncate in progress)
+ */
+ ret = write_exec(pcol);
+ if (unlikely(ret))
+ goto fail;
+ if (PageError(page))
+ ClearPageError(page);
+ unlock_page(page);
+ return 0;
+ }
+ }
+
+try_again:
+
+ if (unlikely(pcol->pg_first == -1)) {
+ pcol->pg_first = page->index;
+ } else if (unlikely((pcol->pg_first + pcol->nr_pages) !=
+ page->index)) {
+ /* Discontinuity detected, split the request */
+ ret = write_exec(pcol);
+ if (unlikely(ret))
+ goto fail;
+ goto try_again;
+ }
+
+ if (!pcol->bio) {
+ ret = pcol_try_alloc(pcol);
+ if (unlikely(ret))
+ goto fail;
+ }
+
+ EXOFS_DBGMSG(" writepage_strip(%lx, %lx) len=%zx\n", inode->i_ino,
+ page->index, len);
+
+ ret = pcol_add_page(pcol, page, len);
+ if (unlikely(ret)) {
+ EXOFS_DBGMSG("Failed pcol_add_page "
+ "nr_pages=%u total_length=%zx\n",
+ pcol->nr_pages, pcol->length);
+
+ /* split the request, next loop will start again */
+ ret = write_exec(pcol);
+ if (unlikely(ret)) {
+ EXOFS_DBGMSG("write_exec faild => %d", ret);
+ goto fail;
+ }
+
+ goto try_again;
+ }
+
+ BUG_ON(PageWriteback(page));
+ set_page_writeback(page);
+
+ return 0;
+
+fail:
+ set_bit(AS_EIO, &page->mapping->flags);
+ unlock_page(page);
+ return ret;
+}
+
+int exofs_writepages(struct address_space *mapping,
+ struct writeback_control *wbc)
+{
+ struct page_collect pcol;
+ long start, end, expected_pages;
+ int ret;
+
+ start = wbc->range_start >> PAGE_CACHE_SHIFT;
+ end = (wbc->range_end == LLONG_MAX) ?
+ start + mapping->nrpages :
+ wbc->range_end >> PAGE_CACHE_SHIFT;
+
+ if (start || end)
+ expected_pages = min(end - start + 1, 32L);
+ else
+ expected_pages = mapping->nrpages;
+
+ EXOFS_DBGMSG("inode(%lx) wbc->start=0x%llx wbc->end=0x%llx"
+ " m->nrpages=%lu start=%ld end=%ld\n",
+ mapping->host->i_ino, wbc->range_start, wbc->range_end,
+ mapping->nrpages, start, end);
+
+ _pcol_init(&pcol, expected_pages, mapping->host);
+
+ ret = write_cache_pages(mapping, wbc, writepage_strip, &pcol);
+ if (ret) {
+ EXOFS_ERR("write_cache_pages => %d\n", ret);
+ return ret;
+ }
+
+ return write_exec(&pcol);
+}
+
+static int exofs_writepage(struct page *page, struct writeback_control *wbc)
+{
+ struct page_collect pcol;
+ int ret;
+
+ _pcol_init(&pcol, 1, page->mapping->host);
+
+ ret = writepage_strip(page, NULL, &pcol);
+ if (ret) {
+ EXOFS_ERR("exofs_writepage => %d\n", ret);
+ return ret;
+ }
+
+ return write_exec(&pcol);
+}
+
+int exofs_write_begin(struct file *file, struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned flags,
+ struct page **pagep, void **fsdata)
+{
+ int ret = 0;
+ struct page *page;
+
+ page = *pagep;
+ if (page == NULL) {
+ ret = simple_write_begin(file, mapping, pos, len, flags, pagep,
+ fsdata);
+ if (ret) {
+ EXOFS_DBGMSG("simple_write_begin faild\n");
+ return ret;
+ }
+
+ page = *pagep;
+ }
+
+ /* read modify write */
+ if (!PageUptodate(page) && (len != PAGE_CACHE_SIZE)) {
+ ret = _readpage(page, true);
+ if (ret) {
+ /*SetPageError was done by _readpage. Is it ok?*/
+ unlock_page(page);
+ EXOFS_DBGMSG("__readpage_filler faild\n");
+ }
+ }
+
+ return ret;
+}
+
+static int exofs_write_begin_export(struct file *file,
+ struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned flags,
+ struct page **pagep, void **fsdata)
+{
+ *pagep = NULL;
+
+ return exofs_write_begin(file, mapping, pos, len, flags, pagep,
+ fsdata);
+}
+
+const struct address_space_operations exofs_aops = {
+ .readpage = exofs_readpage,
+ .readpages = exofs_readpages,
+ .writepage = exofs_writepage,
+ .writepages = exofs_writepages,
+ .write_begin = exofs_write_begin_export,
+ .write_end = simple_write_end,
+};
+
/******************************************************************************
* INODE OPERATIONS
*****************************************************************************/
--
1.6.2.1

2009-03-18 18:10:26

by Boaz Harrosh

[permalink] [raw]
Subject: [PATCH 5/8] exofs: dir_inode and directory operations

implementation of directory and inode operations.

* A directory is treated as a file, and essentially contains a list
of <file name, inode #> pairs for files that are found in that
directory. The object IDs correspond to the files' inode numbers
and are allocated using a 64bit incrementing global counter.
* Each file's control block (AKA on-disk inode) is stored in its
object's attributes. This applies to both regular files and other
types (directories, device files, symlinks, etc.).

Signed-off-by: Boaz Harrosh <[email protected]>
---
fs/exofs/Kbuild | 2 +-
fs/exofs/dir.c | 656 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/exofs/exofs.h | 26 +++
fs/exofs/inode.c | 272 ++++++++++++++++++++++
fs/exofs/namei.c | 342 ++++++++++++++++++++++++++++
5 files changed, 1297 insertions(+), 1 deletions(-)
create mode 100644 fs/exofs/dir.c
create mode 100644 fs/exofs/namei.c

diff --git a/fs/exofs/Kbuild b/fs/exofs/Kbuild
index 42d5299..d5c8c54 100644
--- a/fs/exofs/Kbuild
+++ b/fs/exofs/Kbuild
@@ -26,5 +26,5 @@ KBUILD_CPPFLAGS := -I$(OSD_INC) $(KBUILD_CPPFLAGS)

endif

-exofs-y := osd.o inode.o file.o symlink.o
+exofs-y := osd.o inode.o file.o symlink.o namei.o dir.o
obj-$(CONFIG_EXOFS_FS) += exofs.o
diff --git a/fs/exofs/dir.c b/fs/exofs/dir.c
new file mode 100644
index 0000000..0ca733a
--- /dev/null
+++ b/fs/exofs/dir.c
@@ -0,0 +1,656 @@
+/*
+ * Copyright (C) 2005, 2006
+ * Avishay Traeger ([email protected]) ([email protected])
+ * Copyright (C) 2005, 2006
+ * International Business Machines
+ * Copyright (C) 2008, 2009
+ * Boaz Harrosh <[email protected]>
+ *
+ * Copyrights for code taken from ext2:
+ * Copyright (C) 1992, 1993, 1994, 1995
+ * Remy Card ([email protected])
+ * Laboratoire MASI - Institut Blaise Pascal
+ * Universite Pierre et Marie Curie (Paris VI)
+ * from
+ * linux/fs/minix/inode.c
+ * Copyright (C) 1991, 1992 Linus Torvalds
+ *
+ * This file is part of exofs.
+ *
+ * exofs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation. Since it is based on ext2, and the only
+ * valid version of GPL for the Linux kernel is version 2, the only valid
+ * version of GPL for exofs is version 2.
+ *
+ * exofs is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with exofs; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "exofs.h"
+
+static inline unsigned exofs_chunk_size(struct inode *inode)
+{
+ return inode->i_sb->s_blocksize;
+}
+
+static inline void exofs_put_page(struct page *page)
+{
+ kunmap(page);
+ page_cache_release(page);
+}
+
+static inline unsigned long dir_pages(struct inode *inode)
+{
+ return (inode->i_size + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+}
+
+static unsigned exofs_last_byte(struct inode *inode, unsigned long page_nr)
+{
+ unsigned last_byte = inode->i_size;
+
+ last_byte -= page_nr << PAGE_CACHE_SHIFT;
+ if (last_byte > PAGE_CACHE_SIZE)
+ last_byte = PAGE_CACHE_SIZE;
+ return last_byte;
+}
+
+static int exofs_commit_chunk(struct page *page, loff_t pos, unsigned len)
+{
+ struct address_space *mapping = page->mapping;
+ struct inode *dir = mapping->host;
+ int err = 0;
+
+ dir->i_version++;
+
+ if (!PageUptodate(page))
+ SetPageUptodate(page);
+
+ if (pos+len > dir->i_size) {
+ i_size_write(dir, pos+len);
+ mark_inode_dirty(dir);
+ }
+ set_page_dirty(page);
+
+ if (IS_DIRSYNC(dir))
+ err = write_one_page(page, 1);
+ else
+ unlock_page(page);
+
+ return err;
+}
+
+static void exofs_check_page(struct page *page)
+{
+ struct inode *dir = page->mapping->host;
+ unsigned chunk_size = exofs_chunk_size(dir);
+ char *kaddr = page_address(page);
+ unsigned offs, rec_len;
+ unsigned limit = PAGE_CACHE_SIZE;
+ struct exofs_dir_entry *p;
+ char *error;
+
+ /* if the page is the last one in the directory */
+ if ((dir->i_size >> PAGE_CACHE_SHIFT) == page->index) {
+ limit = dir->i_size & ~PAGE_CACHE_MASK;
+ if (limit & (chunk_size - 1))
+ goto Ebadsize;
+ if (!limit)
+ goto out;
+ }
+ for (offs = 0; offs <= limit - EXOFS_DIR_REC_LEN(1); offs += rec_len) {
+ p = (struct exofs_dir_entry *)(kaddr + offs);
+ rec_len = le16_to_cpu(p->rec_len);
+
+ if (rec_len < EXOFS_DIR_REC_LEN(1))
+ goto Eshort;
+ if (rec_len & 3)
+ goto Ealign;
+ if (rec_len < EXOFS_DIR_REC_LEN(p->name_len))
+ goto Enamelen;
+ if (((offs + rec_len - 1) ^ offs) & ~(chunk_size-1))
+ goto Espan;
+ }
+ if (offs != limit)
+ goto Eend;
+out:
+ SetPageChecked(page);
+ return;
+
+Ebadsize:
+ EXOFS_ERR("ERROR [exofs_check_page]: "
+ "size of directory #%lu is not a multiple of chunk size",
+ dir->i_ino
+ );
+ goto fail;
+Eshort:
+ error = "rec_len is smaller than minimal";
+ goto bad_entry;
+Ealign:
+ error = "unaligned directory entry";
+ goto bad_entry;
+Enamelen:
+ error = "rec_len is too small for name_len";
+ goto bad_entry;
+Espan:
+ error = "directory entry across blocks";
+ goto bad_entry;
+bad_entry:
+ EXOFS_ERR(
+ "ERROR [exofs_check_page]: bad entry in directory #%lu: %s - "
+ "offset=%lu, inode=%llu, rec_len=%d, name_len=%d",
+ dir->i_ino, error, (page->index<<PAGE_CACHE_SHIFT)+offs,
+ _LLU(le64_to_cpu(p->inode_no)),
+ rec_len, p->name_len);
+ goto fail;
+Eend:
+ p = (struct exofs_dir_entry *)(kaddr + offs);
+ EXOFS_ERR("ERROR [exofs_check_page]: "
+ "entry in directory #%lu spans the page boundary"
+ "offset=%lu, inode=%llu",
+ dir->i_ino, (page->index<<PAGE_CACHE_SHIFT)+offs,
+ _LLU(le64_to_cpu(p->inode_no)));
+fail:
+ SetPageChecked(page);
+ SetPageError(page);
+}
+
+static struct page *exofs_get_page(struct inode *dir, unsigned long n)
+{
+ struct address_space *mapping = dir->i_mapping;
+ struct page *page = read_mapping_page(mapping, n, NULL);
+
+ if (!IS_ERR(page)) {
+ kmap(page);
+ if (!PageChecked(page))
+ exofs_check_page(page);
+ if (PageError(page))
+ goto fail;
+ }
+ return page;
+
+fail:
+ exofs_put_page(page);
+ return ERR_PTR(-EIO);
+}
+
+static inline int exofs_match(int len, const unsigned char *name,
+ struct exofs_dir_entry *de)
+{
+ if (len != de->name_len)
+ return 0;
+ if (!de->inode_no)
+ return 0;
+ return !memcmp(name, de->name, len);
+}
+
+static inline
+struct exofs_dir_entry *exofs_next_entry(struct exofs_dir_entry *p)
+{
+ return (struct exofs_dir_entry *)((char *)p + le16_to_cpu(p->rec_len));
+}
+
+static inline unsigned
+exofs_validate_entry(char *base, unsigned offset, unsigned mask)
+{
+ struct exofs_dir_entry *de = (struct exofs_dir_entry *)(base + offset);
+ struct exofs_dir_entry *p =
+ (struct exofs_dir_entry *)(base + (offset&mask));
+ while ((char *)p < (char *)de) {
+ if (p->rec_len == 0)
+ break;
+ p = exofs_next_entry(p);
+ }
+ return (char *)p - base;
+}
+
+static unsigned char exofs_filetype_table[EXOFS_FT_MAX] = {
+ [EXOFS_FT_UNKNOWN] = DT_UNKNOWN,
+ [EXOFS_FT_REG_FILE] = DT_REG,
+ [EXOFS_FT_DIR] = DT_DIR,
+ [EXOFS_FT_CHRDEV] = DT_CHR,
+ [EXOFS_FT_BLKDEV] = DT_BLK,
+ [EXOFS_FT_FIFO] = DT_FIFO,
+ [EXOFS_FT_SOCK] = DT_SOCK,
+ [EXOFS_FT_SYMLINK] = DT_LNK,
+};
+
+#define S_SHIFT 12
+static unsigned char exofs_type_by_mode[S_IFMT >> S_SHIFT] = {
+ [S_IFREG >> S_SHIFT] = EXOFS_FT_REG_FILE,
+ [S_IFDIR >> S_SHIFT] = EXOFS_FT_DIR,
+ [S_IFCHR >> S_SHIFT] = EXOFS_FT_CHRDEV,
+ [S_IFBLK >> S_SHIFT] = EXOFS_FT_BLKDEV,
+ [S_IFIFO >> S_SHIFT] = EXOFS_FT_FIFO,
+ [S_IFSOCK >> S_SHIFT] = EXOFS_FT_SOCK,
+ [S_IFLNK >> S_SHIFT] = EXOFS_FT_SYMLINK,
+};
+
+static inline
+void exofs_set_de_type(struct exofs_dir_entry *de, struct inode *inode)
+{
+ mode_t mode = inode->i_mode;
+ de->file_type = exofs_type_by_mode[(mode & S_IFMT) >> S_SHIFT];
+}
+
+static int
+exofs_readdir(struct file *filp, void *dirent, filldir_t filldir)
+{
+ loff_t pos = filp->f_pos;
+ struct inode *inode = filp->f_path.dentry->d_inode;
+ unsigned int offset = pos & ~PAGE_CACHE_MASK;
+ unsigned long n = pos >> PAGE_CACHE_SHIFT;
+ unsigned long npages = dir_pages(inode);
+ unsigned chunk_mask = ~(exofs_chunk_size(inode)-1);
+ unsigned char *types = NULL;
+ int need_revalidate = (filp->f_version != inode->i_version);
+
+ if (pos > inode->i_size - EXOFS_DIR_REC_LEN(1))
+ return 0;
+
+ types = exofs_filetype_table;
+
+ for ( ; n < npages; n++, offset = 0) {
+ char *kaddr, *limit;
+ struct exofs_dir_entry *de;
+ struct page *page = exofs_get_page(inode, n);
+
+ if (IS_ERR(page)) {
+ EXOFS_ERR("ERROR: "
+ "bad page in #%lu",
+ inode->i_ino);
+ filp->f_pos += PAGE_CACHE_SIZE - offset;
+ return PTR_ERR(page);
+ }
+ kaddr = page_address(page);
+ if (unlikely(need_revalidate)) {
+ if (offset) {
+ offset = exofs_validate_entry(kaddr, offset,
+ chunk_mask);
+ filp->f_pos = (n<<PAGE_CACHE_SHIFT) + offset;
+ }
+ filp->f_version = inode->i_version;
+ need_revalidate = 0;
+ }
+ de = (struct exofs_dir_entry *)(kaddr + offset);
+ limit = kaddr + exofs_last_byte(inode, n) -
+ EXOFS_DIR_REC_LEN(1);
+ for (; (char *)de <= limit; de = exofs_next_entry(de)) {
+ if (de->rec_len == 0) {
+ EXOFS_ERR("ERROR: "
+ "zero-length directory entry");
+ exofs_put_page(page);
+ return -EIO;
+ }
+ if (de->inode_no) {
+ int over;
+ unsigned char d_type = DT_UNKNOWN;
+
+ if (types && de->file_type < EXOFS_FT_MAX)
+ d_type = types[de->file_type];
+
+ offset = (char *)de - kaddr;
+ over = filldir(dirent, de->name, de->name_len,
+ (n<<PAGE_CACHE_SHIFT) | offset,
+ le64_to_cpu(de->inode_no),
+ d_type);
+ if (over) {
+ exofs_put_page(page);
+ return 0;
+ }
+ }
+ filp->f_pos += le16_to_cpu(de->rec_len);
+ }
+ exofs_put_page(page);
+ }
+
+ return 0;
+}
+
+struct exofs_dir_entry *exofs_find_entry(struct inode *dir,
+ struct dentry *dentry, struct page **res_page)
+{
+ const unsigned char *name = dentry->d_name.name;
+ int namelen = dentry->d_name.len;
+ unsigned reclen = EXOFS_DIR_REC_LEN(namelen);
+ unsigned long start, n;
+ unsigned long npages = dir_pages(dir);
+ struct page *page = NULL;
+ struct exofs_i_info *oi = exofs_i(dir);
+ struct exofs_dir_entry *de;
+
+ if (npages == 0)
+ goto out;
+
+ *res_page = NULL;
+
+ start = oi->i_dir_start_lookup;
+ if (start >= npages)
+ start = 0;
+ n = start;
+ do {
+ char *kaddr;
+ page = exofs_get_page(dir, n);
+ if (!IS_ERR(page)) {
+ kaddr = page_address(page);
+ de = (struct exofs_dir_entry *) kaddr;
+ kaddr += exofs_last_byte(dir, n) - reclen;
+ while ((char *) de <= kaddr) {
+ if (de->rec_len == 0) {
+ EXOFS_ERR(
+ "ERROR: exofs_find_entry: "
+ "zero-length directory entry");
+ exofs_put_page(page);
+ goto out;
+ }
+ if (exofs_match(namelen, name, de))
+ goto found;
+ de = exofs_next_entry(de);
+ }
+ exofs_put_page(page);
+ }
+ if (++n >= npages)
+ n = 0;
+ } while (n != start);
+out:
+ return NULL;
+
+found:
+ *res_page = page;
+ oi->i_dir_start_lookup = n;
+ return de;
+}
+
+struct exofs_dir_entry *exofs_dotdot(struct inode *dir, struct page **p)
+{
+ struct page *page = exofs_get_page(dir, 0);
+ struct exofs_dir_entry *de = NULL;
+
+ if (!IS_ERR(page)) {
+ de = exofs_next_entry(
+ (struct exofs_dir_entry *)page_address(page));
+ *p = page;
+ }
+ return de;
+}
+
+ino_t exofs_inode_by_name(struct inode *dir, struct dentry *dentry)
+{
+ ino_t res = 0;
+ struct exofs_dir_entry *de;
+ struct page *page;
+
+ de = exofs_find_entry(dir, dentry, &page);
+ if (de) {
+ res = le64_to_cpu(de->inode_no);
+ exofs_put_page(page);
+ }
+ return res;
+}
+
+int exofs_set_link(struct inode *dir, struct exofs_dir_entry *de,
+ struct page *page, struct inode *inode)
+{
+ loff_t pos = page_offset(page) +
+ (char *) de - (char *) page_address(page);
+ unsigned len = le16_to_cpu(de->rec_len);
+ int err;
+
+ lock_page(page);
+ err = exofs_write_begin(NULL, page->mapping, pos, len,
+ AOP_FLAG_UNINTERRUPTIBLE, &page, NULL);
+ if (err)
+ EXOFS_ERR("exofs_set_link: exofs_write_begin FAILD => %d\n",
+ err);
+
+ de->inode_no = cpu_to_le64(inode->i_ino);
+ exofs_set_de_type(de, inode);
+ if (likely(!err))
+ err = exofs_commit_chunk(page, pos, len);
+ exofs_put_page(page);
+ dir->i_mtime = dir->i_ctime = CURRENT_TIME;
+ mark_inode_dirty(dir);
+ return err;
+}
+
+int exofs_add_link(struct dentry *dentry, struct inode *inode)
+{
+ struct inode *dir = dentry->d_parent->d_inode;
+ const unsigned char *name = dentry->d_name.name;
+ int namelen = dentry->d_name.len;
+ unsigned chunk_size = exofs_chunk_size(dir);
+ unsigned reclen = EXOFS_DIR_REC_LEN(namelen);
+ unsigned short rec_len, name_len;
+ struct page *page = NULL;
+ struct exofs_sb_info *sbi = inode->i_sb->s_fs_info;
+ struct exofs_dir_entry *de;
+ unsigned long npages = dir_pages(dir);
+ unsigned long n;
+ char *kaddr;
+ loff_t pos;
+ int err;
+
+ for (n = 0; n <= npages; n++) {
+ char *dir_end;
+
+ page = exofs_get_page(dir, n);
+ err = PTR_ERR(page);
+ if (IS_ERR(page))
+ goto out;
+ lock_page(page);
+ kaddr = page_address(page);
+ dir_end = kaddr + exofs_last_byte(dir, n);
+ de = (struct exofs_dir_entry *)kaddr;
+ kaddr += PAGE_CACHE_SIZE - reclen;
+ while ((char *)de <= kaddr) {
+ if ((char *)de == dir_end) {
+ name_len = 0;
+ rec_len = chunk_size;
+ de->rec_len = cpu_to_le16(chunk_size);
+ de->inode_no = 0;
+ goto got_it;
+ }
+ if (de->rec_len == 0) {
+ EXOFS_ERR("ERROR: exofs_add_link: "
+ "zero-length directory entry");
+ err = -EIO;
+ goto out_unlock;
+ }
+ err = -EEXIST;
+ if (exofs_match(namelen, name, de))
+ goto out_unlock;
+ name_len = EXOFS_DIR_REC_LEN(de->name_len);
+ rec_len = le16_to_cpu(de->rec_len);
+ if (!de->inode_no && rec_len >= reclen)
+ goto got_it;
+ if (rec_len >= name_len + reclen)
+ goto got_it;
+ de = (struct exofs_dir_entry *) ((char *) de + rec_len);
+ }
+ unlock_page(page);
+ exofs_put_page(page);
+ }
+
+ EXOFS_ERR("exofs_add_link: BAD dentry=%p or inode=%p", dentry, inode);
+ return -EINVAL;
+
+got_it:
+ pos = page_offset(page) +
+ (char *)de - (char *)page_address(page);
+ err = exofs_write_begin(NULL, page->mapping, pos, rec_len, 0,
+ &page, NULL);
+ if (err)
+ goto out_unlock;
+ if (de->inode_no) {
+ struct exofs_dir_entry *de1 =
+ (struct exofs_dir_entry *)((char *)de + name_len);
+ de1->rec_len = cpu_to_le16(rec_len - name_len);
+ de->rec_len = cpu_to_le16(name_len);
+ de = de1;
+ }
+ de->name_len = namelen;
+ memcpy(de->name, name, namelen);
+ de->inode_no = cpu_to_le64(inode->i_ino);
+ exofs_set_de_type(de, inode);
+ err = exofs_commit_chunk(page, pos, rec_len);
+ dir->i_mtime = dir->i_ctime = CURRENT_TIME;
+ mark_inode_dirty(dir);
+ sbi->s_numfiles++;
+
+out_put:
+ exofs_put_page(page);
+out:
+ return err;
+out_unlock:
+ unlock_page(page);
+ goto out_put;
+}
+
+int exofs_delete_entry(struct exofs_dir_entry *dir, struct page *page)
+{
+ struct address_space *mapping = page->mapping;
+ struct inode *inode = mapping->host;
+ struct exofs_sb_info *sbi = inode->i_sb->s_fs_info;
+ char *kaddr = page_address(page);
+ unsigned from = ((char *)dir - kaddr) & ~(exofs_chunk_size(inode)-1);
+ unsigned to = ((char *)dir - kaddr) + le16_to_cpu(dir->rec_len);
+ loff_t pos;
+ struct exofs_dir_entry *pde = NULL;
+ struct exofs_dir_entry *de = (struct exofs_dir_entry *) (kaddr + from);
+ int err;
+
+ while (de < dir) {
+ if (de->rec_len == 0) {
+ EXOFS_ERR("ERROR: exofs_delete_entry:"
+ "zero-length directory entry");
+ err = -EIO;
+ goto out;
+ }
+ pde = de;
+ de = exofs_next_entry(de);
+ }
+ if (pde)
+ from = (char *)pde - (char *)page_address(page);
+ pos = page_offset(page) + from;
+ lock_page(page);
+ err = exofs_write_begin(NULL, page->mapping, pos, to - from, 0,
+ &page, NULL);
+ if (err)
+ EXOFS_ERR("exofs_delete_entry: exofs_write_begin FAILD => %d\n",
+ err);
+ if (pde)
+ pde->rec_len = cpu_to_le16(to - from);
+ dir->inode_no = 0;
+ if (likely(!err))
+ err = exofs_commit_chunk(page, pos, to - from);
+ inode->i_ctime = inode->i_mtime = CURRENT_TIME;
+ mark_inode_dirty(inode);
+ sbi->s_numfiles--;
+out:
+ exofs_put_page(page);
+ return err;
+}
+
+/* kept aligned on 4 bytes */
+#define THIS_DIR ".\0\0"
+#define PARENT_DIR "..\0"
+
+int exofs_make_empty(struct inode *inode, struct inode *parent)
+{
+ struct address_space *mapping = inode->i_mapping;
+ struct page *page = grab_cache_page(mapping, 0);
+ unsigned chunk_size = exofs_chunk_size(inode);
+ struct exofs_dir_entry *de;
+ int err;
+ void *kaddr;
+
+ if (!page)
+ return -ENOMEM;
+
+ err = exofs_write_begin(NULL, page->mapping, 0, chunk_size, 0,
+ &page, NULL);
+ if (err) {
+ unlock_page(page);
+ goto fail;
+ }
+
+ kaddr = kmap_atomic(page, KM_USER0);
+ de = (struct exofs_dir_entry *)kaddr;
+ de->name_len = 1;
+ de->rec_len = cpu_to_le16(EXOFS_DIR_REC_LEN(1));
+ memcpy(de->name, THIS_DIR, sizeof(THIS_DIR));
+ de->inode_no = cpu_to_le64(inode->i_ino);
+ exofs_set_de_type(de, inode);
+
+ de = (struct exofs_dir_entry *)(kaddr + EXOFS_DIR_REC_LEN(1));
+ de->name_len = 2;
+ de->rec_len = cpu_to_le16(chunk_size - EXOFS_DIR_REC_LEN(1));
+ de->inode_no = cpu_to_le64(parent->i_ino);
+ memcpy(de->name, PARENT_DIR, sizeof(PARENT_DIR));
+ exofs_set_de_type(de, inode);
+ kunmap_atomic(page, KM_USER0);
+ err = exofs_commit_chunk(page, 0, chunk_size);
+fail:
+ page_cache_release(page);
+ return err;
+}
+
+int exofs_empty_dir(struct inode *inode)
+{
+ struct page *page = NULL;
+ unsigned long i, npages = dir_pages(inode);
+
+ for (i = 0; i < npages; i++) {
+ char *kaddr;
+ struct exofs_dir_entry *de;
+ page = exofs_get_page(inode, i);
+
+ if (IS_ERR(page))
+ continue;
+
+ kaddr = page_address(page);
+ de = (struct exofs_dir_entry *)kaddr;
+ kaddr += exofs_last_byte(inode, i) - EXOFS_DIR_REC_LEN(1);
+
+ while ((char *)de <= kaddr) {
+ if (de->rec_len == 0) {
+ EXOFS_ERR("ERROR: exofs_empty_dir: "
+ "zero-length directory entry"
+ "kaddr=%p, de=%p\n", kaddr, de);
+ goto not_empty;
+ }
+ if (de->inode_no != 0) {
+ /* check for . and .. */
+ if (de->name[0] != '.')
+ goto not_empty;
+ if (de->name_len > 2)
+ goto not_empty;
+ if (de->name_len < 2) {
+ if (le64_to_cpu(de->inode_no) !=
+ inode->i_ino)
+ goto not_empty;
+ } else if (de->name[1] != '.')
+ goto not_empty;
+ }
+ de = exofs_next_entry(de);
+ }
+ exofs_put_page(page);
+ }
+ return 1;
+
+not_empty:
+ exofs_put_page(page);
+ return 0;
+}
+
+const struct file_operations exofs_dir_operations = {
+ .llseek = generic_file_llseek,
+ .read = generic_read_dir,
+ .readdir = exofs_readdir,
+};
diff --git a/fs/exofs/exofs.h b/fs/exofs/exofs.h
index f30de6e..abf66b2 100644
--- a/fs/exofs/exofs.h
+++ b/fs/exofs/exofs.h
@@ -124,6 +124,11 @@ static inline struct exofs_i_info *exofs_i(struct inode *inode)
return container_of(inode, struct exofs_i_info, vfs_inode);
}

+/*
+ * Maximum count of links to a file
+ */
+#define EXOFS_LINK_MAX 32000
+
/*************************
* function declarations *
*************************/
@@ -133,10 +138,27 @@ int exofs_setattr(struct dentry *, struct iattr *);
int exofs_write_begin(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata);
+extern struct inode *exofs_iget(struct super_block *, unsigned long);
+struct inode *exofs_new_inode(struct inode *, int);
+
+/* dir.c: */
+int exofs_add_link(struct dentry *, struct inode *);
+ino_t exofs_inode_by_name(struct inode *, struct dentry *);
+int exofs_delete_entry(struct exofs_dir_entry *, struct page *);
+int exofs_make_empty(struct inode *, struct inode *);
+struct exofs_dir_entry *exofs_find_entry(struct inode *, struct dentry *,
+ struct page **);
+int exofs_empty_dir(struct inode *);
+struct exofs_dir_entry *exofs_dotdot(struct inode *, struct page **);
+int exofs_set_link(struct inode *, struct exofs_dir_entry *, struct page *,
+ struct inode *);

/*********************
* operation vectors *
*********************/
+/* dir.c: */
+extern const struct file_operations exofs_dir_operations;
+
/* file.c */
extern const struct inode_operations exofs_file_inode_operations;
extern const struct file_operations exofs_file_operations;
@@ -144,6 +166,10 @@ extern const struct file_operations exofs_file_operations;
/* inode.c */
extern const struct address_space_operations exofs_aops;

+/* namei.c */
+extern const struct inode_operations exofs_dir_inode_operations;
+extern const struct inode_operations exofs_special_inode_operations;
+
/* symlink.c */
extern const struct inode_operations exofs_symlink_inode_operations;
extern const struct inode_operations exofs_fast_symlink_inode_operations;
diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
index 175679a..7b0c696 100644
--- a/fs/exofs/inode.c
+++ b/fs/exofs/inode.c
@@ -836,3 +836,275 @@ int exofs_setattr(struct dentry *dentry, struct iattr *iattr)
error = inode_setattr(inode, iattr);
return error;
}
+
+/*
+ * Read an inode from the OSD, and return it as is. We also return the size
+ * attribute in the 'sanity' argument if we got compiled with debugging turned
+ * on.
+ */
+static int exofs_get_inode(struct super_block *sb, struct exofs_i_info *oi,
+ struct exofs_fcb *inode, uint64_t *sanity)
+{
+ struct exofs_sb_info *sbi = sb->s_fs_info;
+ struct osd_request *or;
+ struct osd_attr attr;
+ struct osd_obj_id obj = {sbi->s_pid,
+ oi->vfs_inode.i_ino + EXOFS_OBJ_OFF};
+ int ret;
+
+ exofs_make_credential(oi->i_cred, &obj);
+
+ or = osd_start_request(sbi->s_dev, GFP_KERNEL);
+ if (unlikely(!or)) {
+ EXOFS_ERR("exofs_get_inode: osd_start_request failed.\n");
+ return -ENOMEM;
+ }
+ osd_req_get_attributes(or, &obj);
+
+ /* we need the inode attribute */
+ osd_req_add_get_attr_list(or, &g_attr_inode_data, 1);
+
+#ifdef EXOFS_DEBUG_OBJ_ISIZE
+ /* we get the size attributes to do a sanity check */
+ osd_req_add_get_attr_list(or, &g_attr_logical_length, 1);
+#endif
+
+ ret = exofs_sync_op(or, sbi->s_timeout, oi->i_cred);
+ if (ret)
+ goto out;
+
+ attr = g_attr_inode_data;
+ ret = extract_attr_from_req(or, &attr);
+ if (ret) {
+ EXOFS_ERR("exofs_get_inode: extract_attr_from_req failed\n");
+ goto out;
+ }
+
+ WARN_ON(attr.len != EXOFS_INO_ATTR_SIZE);
+ memcpy(inode, attr.val_ptr, EXOFS_INO_ATTR_SIZE);
+
+#ifdef EXOFS_DEBUG_OBJ_ISIZE
+ attr = g_attr_logical_length;
+ ret = extract_attr_from_req(or, &attr);
+ if (ret) {
+ EXOFS_ERR("ERROR: extract attr from or failed\n");
+ goto out;
+ }
+ *sanity = get_unaligned_be64(attr.val_ptr);
+#endif
+
+out:
+ osd_end_request(or);
+ return ret;
+}
+
+/*
+ * Fill in an inode read from the OSD and set it up for use
+ */
+struct inode *exofs_iget(struct super_block *sb, unsigned long ino)
+{
+ struct exofs_i_info *oi;
+ struct exofs_fcb fcb;
+ struct inode *inode;
+ uint64_t uninitialized_var(sanity);
+ int ret;
+
+ inode = iget_locked(sb, ino);
+ if (!inode)
+ return ERR_PTR(-ENOMEM);
+ if (!(inode->i_state & I_NEW))
+ return inode;
+ oi = exofs_i(inode);
+
+ /* read the inode from the osd */
+ ret = exofs_get_inode(sb, oi, &fcb, &sanity);
+ if (ret)
+ goto bad_inode;
+
+ init_waitqueue_head(&oi->i_wq);
+ set_obj_created(oi);
+
+ /* copy stuff from on-disk struct to in-memory struct */
+ inode->i_mode = le16_to_cpu(fcb.i_mode);
+ inode->i_uid = le32_to_cpu(fcb.i_uid);
+ inode->i_gid = le32_to_cpu(fcb.i_gid);
+ inode->i_nlink = le16_to_cpu(fcb.i_links_count);
+ inode->i_ctime.tv_sec = (signed)le32_to_cpu(fcb.i_ctime);
+ inode->i_atime.tv_sec = (signed)le32_to_cpu(fcb.i_atime);
+ inode->i_mtime.tv_sec = (signed)le32_to_cpu(fcb.i_mtime);
+ inode->i_ctime.tv_nsec =
+ inode->i_atime.tv_nsec = inode->i_mtime.tv_nsec = 0;
+ oi->i_commit_size = le64_to_cpu(fcb.i_size);
+ i_size_write(inode, oi->i_commit_size);
+ inode->i_blkbits = EXOFS_BLKSHIFT;
+ inode->i_generation = le32_to_cpu(fcb.i_generation);
+
+#ifdef EXOFS_DEBUG_OBJ_ISIZE
+ if ((inode->i_size != sanity) &&
+ (!exofs_inode_is_fast_symlink(inode))) {
+ EXOFS_ERR("WARNING: Size of object from inode and "
+ "attributes differ (%lld != %llu)\n",
+ inode->i_size, _LLU(sanity));
+ }
+#endif
+
+ oi->i_dir_start_lookup = 0;
+
+ if ((inode->i_nlink == 0) && (inode->i_mode == 0)) {
+ ret = -ESTALE;
+ goto bad_inode;
+ }
+
+ if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode)) {
+ if (fcb.i_data[0])
+ inode->i_rdev =
+ old_decode_dev(le32_to_cpu(fcb.i_data[0]));
+ else
+ inode->i_rdev =
+ new_decode_dev(le32_to_cpu(fcb.i_data[1]));
+ } else {
+ memcpy(oi->i_data, fcb.i_data, sizeof(fcb.i_data));
+ }
+
+ if (S_ISREG(inode->i_mode)) {
+ inode->i_op = &exofs_file_inode_operations;
+ inode->i_fop = &exofs_file_operations;
+ inode->i_mapping->a_ops = &exofs_aops;
+ } else if (S_ISDIR(inode->i_mode)) {
+ inode->i_op = &exofs_dir_inode_operations;
+ inode->i_fop = &exofs_dir_operations;
+ inode->i_mapping->a_ops = &exofs_aops;
+ } else if (S_ISLNK(inode->i_mode)) {
+ if (exofs_inode_is_fast_symlink(inode))
+ inode->i_op = &exofs_fast_symlink_inode_operations;
+ else {
+ inode->i_op = &exofs_symlink_inode_operations;
+ inode->i_mapping->a_ops = &exofs_aops;
+ }
+ } else {
+ inode->i_op = &exofs_special_inode_operations;
+ if (fcb.i_data[0])
+ init_special_inode(inode, inode->i_mode,
+ old_decode_dev(le32_to_cpu(fcb.i_data[0])));
+ else
+ init_special_inode(inode, inode->i_mode,
+ new_decode_dev(le32_to_cpu(fcb.i_data[1])));
+ }
+
+ unlock_new_inode(inode);
+ return inode;
+
+bad_inode:
+ iget_failed(inode);
+ return ERR_PTR(ret);
+}
+
+int __exofs_wait_obj_created(struct exofs_i_info *oi)
+{
+ if (!obj_created(oi)) {
+ BUG_ON(!obj_2bcreated(oi));
+ wait_event(oi->i_wq, obj_created(oi));
+ }
+ return unlikely(is_bad_inode(&oi->vfs_inode)) ? -EIO : 0;
+}
+/*
+ * Callback function from exofs_new_inode(). The important thing is that we
+ * set the obj_created flag so that other methods know that the object exists on
+ * the OSD.
+ */
+static void create_done(struct osd_request *or, void *p)
+{
+ struct inode *inode = p;
+ struct exofs_i_info *oi = exofs_i(inode);
+ struct exofs_sb_info *sbi = inode->i_sb->s_fs_info;
+ int ret;
+
+ ret = exofs_check_ok(or);
+ osd_end_request(or);
+ atomic_dec(&sbi->s_curr_pending);
+
+ if (unlikely(ret)) {
+ EXOFS_ERR("object=0x%llx creation faild in pid=0x%llx",
+ _LLU(sbi->s_pid), _LLU(inode->i_ino + EXOFS_OBJ_OFF));
+ make_bad_inode(inode);
+ } else
+ set_obj_created(oi);
+
+ atomic_dec(&inode->i_count);
+ wake_up(&oi->i_wq);
+}
+
+/*
+ * Set up a new inode and create an object for it on the OSD
+ */
+struct inode *exofs_new_inode(struct inode *dir, int mode)
+{
+ struct super_block *sb;
+ struct inode *inode;
+ struct exofs_i_info *oi;
+ struct exofs_sb_info *sbi;
+ struct osd_request *or;
+ struct osd_obj_id obj;
+ int ret;
+
+ sb = dir->i_sb;
+ inode = new_inode(sb);
+ if (!inode)
+ return ERR_PTR(-ENOMEM);
+
+ oi = exofs_i(inode);
+
+ init_waitqueue_head(&oi->i_wq);
+ set_obj_2bcreated(oi);
+
+ sbi = sb->s_fs_info;
+
+ sb->s_dirt = 1;
+ inode->i_uid = current->cred->fsuid;
+ if (dir->i_mode & S_ISGID) {
+ inode->i_gid = dir->i_gid;
+ if (S_ISDIR(mode))
+ mode |= S_ISGID;
+ } else {
+ inode->i_gid = current->cred->fsgid;
+ }
+ inode->i_mode = mode;
+
+ inode->i_ino = sbi->s_nextid++;
+ inode->i_blkbits = EXOFS_BLKSHIFT;
+ inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME;
+ oi->i_commit_size = inode->i_size = 0;
+ spin_lock(&sbi->s_next_gen_lock);
+ inode->i_generation = sbi->s_next_generation++;
+ spin_unlock(&sbi->s_next_gen_lock);
+ insert_inode_hash(inode);
+
+ mark_inode_dirty(inode);
+
+ obj.partition = sbi->s_pid;
+ obj.id = inode->i_ino + EXOFS_OBJ_OFF;
+ exofs_make_credential(oi->i_cred, &obj);
+
+ or = osd_start_request(sbi->s_dev, GFP_KERNEL);
+ if (unlikely(!or)) {
+ EXOFS_ERR("exofs_new_inode: osd_start_request failed\n");
+ return ERR_PTR(-ENOMEM);
+ }
+
+ osd_req_create_object(or, &obj);
+
+ /* increment the refcount so that the inode will still be around when we
+ * reach the callback
+ */
+ atomic_inc(&inode->i_count);
+
+ ret = exofs_async_op(or, create_done, inode, oi->i_cred);
+ if (ret) {
+ atomic_dec(&inode->i_count);
+ osd_end_request(or);
+ return ERR_PTR(-EIO);
+ }
+ atomic_inc(&sbi->s_curr_pending);
+
+ return inode;
+}
diff --git a/fs/exofs/namei.c b/fs/exofs/namei.c
new file mode 100644
index 0000000..77fdd76
--- /dev/null
+++ b/fs/exofs/namei.c
@@ -0,0 +1,342 @@
+/*
+ * Copyright (C) 2005, 2006
+ * Avishay Traeger ([email protected]) ([email protected])
+ * Copyright (C) 2005, 2006
+ * International Business Machines
+ * Copyright (C) 2008, 2009
+ * Boaz Harrosh <[email protected]>
+ *
+ * Copyrights for code taken from ext2:
+ * Copyright (C) 1992, 1993, 1994, 1995
+ * Remy Card ([email protected])
+ * Laboratoire MASI - Institut Blaise Pascal
+ * Universite Pierre et Marie Curie (Paris VI)
+ * from
+ * linux/fs/minix/inode.c
+ * Copyright (C) 1991, 1992 Linus Torvalds
+ *
+ * This file is part of exofs.
+ *
+ * exofs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation. Since it is based on ext2, and the only
+ * valid version of GPL for the Linux kernel is version 2, the only valid
+ * version of GPL for exofs is version 2.
+ *
+ * exofs is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with exofs; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "exofs.h"
+
+static inline int exofs_add_nondir(struct dentry *dentry, struct inode *inode)
+{
+ int err = exofs_add_link(dentry, inode);
+ if (!err) {
+ d_instantiate(dentry, inode);
+ return 0;
+ }
+ inode_dec_link_count(inode);
+ iput(inode);
+ return err;
+}
+
+static struct dentry *exofs_lookup(struct inode *dir, struct dentry *dentry,
+ struct nameidata *nd)
+{
+ struct inode *inode;
+ ino_t ino;
+
+ if (dentry->d_name.len > EXOFS_NAME_LEN)
+ return ERR_PTR(-ENAMETOOLONG);
+
+ ino = exofs_inode_by_name(dir, dentry);
+ inode = NULL;
+ if (ino) {
+ inode = exofs_iget(dir->i_sb, ino);
+ if (IS_ERR(inode))
+ return ERR_CAST(inode);
+ }
+ return d_splice_alias(inode, dentry);
+}
+
+static int exofs_create(struct inode *dir, struct dentry *dentry, int mode,
+ struct nameidata *nd)
+{
+ struct inode *inode = exofs_new_inode(dir, mode);
+ int err = PTR_ERR(inode);
+ if (!IS_ERR(inode)) {
+ inode->i_op = &exofs_file_inode_operations;
+ inode->i_fop = &exofs_file_operations;
+ inode->i_mapping->a_ops = &exofs_aops;
+ mark_inode_dirty(inode);
+ err = exofs_add_nondir(dentry, inode);
+ }
+ return err;
+}
+
+static int exofs_mknod(struct inode *dir, struct dentry *dentry, int mode,
+ dev_t rdev)
+{
+ struct inode *inode;
+ int err;
+
+ if (!new_valid_dev(rdev))
+ return -EINVAL;
+
+ inode = exofs_new_inode(dir, mode);
+ err = PTR_ERR(inode);
+ if (!IS_ERR(inode)) {
+ init_special_inode(inode, inode->i_mode, rdev);
+ mark_inode_dirty(inode);
+ err = exofs_add_nondir(dentry, inode);
+ }
+ return err;
+}
+
+static int exofs_symlink(struct inode *dir, struct dentry *dentry,
+ const char *symname)
+{
+ struct super_block *sb = dir->i_sb;
+ int err = -ENAMETOOLONG;
+ unsigned l = strlen(symname)+1;
+ struct inode *inode;
+ struct exofs_i_info *oi;
+
+ if (l > sb->s_blocksize)
+ goto out;
+
+ inode = exofs_new_inode(dir, S_IFLNK | S_IRWXUGO);
+ err = PTR_ERR(inode);
+ if (IS_ERR(inode))
+ goto out;
+
+ oi = exofs_i(inode);
+ if (l > sizeof(oi->i_data)) {
+ /* slow symlink */
+ inode->i_op = &exofs_symlink_inode_operations;
+ inode->i_mapping->a_ops = &exofs_aops;
+ memset(oi->i_data, 0, sizeof(oi->i_data));
+
+ err = page_symlink(inode, symname, l);
+ if (err)
+ goto out_fail;
+ } else {
+ /* fast symlink */
+ inode->i_op = &exofs_fast_symlink_inode_operations;
+ memcpy(oi->i_data, symname, l);
+ inode->i_size = l-1;
+ }
+ mark_inode_dirty(inode);
+
+ err = exofs_add_nondir(dentry, inode);
+out:
+ return err;
+
+out_fail:
+ inode_dec_link_count(inode);
+ iput(inode);
+ goto out;
+}
+
+static int exofs_link(struct dentry *old_dentry, struct inode *dir,
+ struct dentry *dentry)
+{
+ struct inode *inode = old_dentry->d_inode;
+
+ if (inode->i_nlink >= EXOFS_LINK_MAX)
+ return -EMLINK;
+
+ inode->i_ctime = CURRENT_TIME;
+ inode_inc_link_count(inode);
+ atomic_inc(&inode->i_count);
+
+ return exofs_add_nondir(dentry, inode);
+}
+
+static int exofs_mkdir(struct inode *dir, struct dentry *dentry, int mode)
+{
+ struct inode *inode;
+ int err = -EMLINK;
+
+ if (dir->i_nlink >= EXOFS_LINK_MAX)
+ goto out;
+
+ inode_inc_link_count(dir);
+
+ inode = exofs_new_inode(dir, S_IFDIR | mode);
+ err = PTR_ERR(inode);
+ if (IS_ERR(inode))
+ goto out_dir;
+
+ inode->i_op = &exofs_dir_inode_operations;
+ inode->i_fop = &exofs_dir_operations;
+ inode->i_mapping->a_ops = &exofs_aops;
+
+ inode_inc_link_count(inode);
+
+ err = exofs_make_empty(inode, dir);
+ if (err)
+ goto out_fail;
+
+ err = exofs_add_link(dentry, inode);
+ if (err)
+ goto out_fail;
+
+ d_instantiate(dentry, inode);
+out:
+ return err;
+
+out_fail:
+ inode_dec_link_count(inode);
+ inode_dec_link_count(inode);
+ iput(inode);
+out_dir:
+ inode_dec_link_count(dir);
+ goto out;
+}
+
+static int exofs_unlink(struct inode *dir, struct dentry *dentry)
+{
+ struct inode *inode = dentry->d_inode;
+ struct exofs_dir_entry *de;
+ struct page *page;
+ int err = -ENOENT;
+
+ de = exofs_find_entry(dir, dentry, &page);
+ if (!de)
+ goto out;
+
+ err = exofs_delete_entry(de, page);
+ if (err)
+ goto out;
+
+ inode->i_ctime = dir->i_ctime;
+ inode_dec_link_count(inode);
+ err = 0;
+out:
+ return err;
+}
+
+static int exofs_rmdir(struct inode *dir, struct dentry *dentry)
+{
+ struct inode *inode = dentry->d_inode;
+ int err = -ENOTEMPTY;
+
+ if (exofs_empty_dir(inode)) {
+ err = exofs_unlink(dir, dentry);
+ if (!err) {
+ inode->i_size = 0;
+ inode_dec_link_count(inode);
+ inode_dec_link_count(dir);
+ }
+ }
+ return err;
+}
+
+static int exofs_rename(struct inode *old_dir, struct dentry *old_dentry,
+ struct inode *new_dir, struct dentry *new_dentry)
+{
+ struct inode *old_inode = old_dentry->d_inode;
+ struct inode *new_inode = new_dentry->d_inode;
+ struct page *dir_page = NULL;
+ struct exofs_dir_entry *dir_de = NULL;
+ struct page *old_page;
+ struct exofs_dir_entry *old_de;
+ int err = -ENOENT;
+
+ old_de = exofs_find_entry(old_dir, old_dentry, &old_page);
+ if (!old_de)
+ goto out;
+
+ if (S_ISDIR(old_inode->i_mode)) {
+ err = -EIO;
+ dir_de = exofs_dotdot(old_inode, &dir_page);
+ if (!dir_de)
+ goto out_old;
+ }
+
+ if (new_inode) {
+ struct page *new_page;
+ struct exofs_dir_entry *new_de;
+
+ err = -ENOTEMPTY;
+ if (dir_de && !exofs_empty_dir(new_inode))
+ goto out_dir;
+
+ err = -ENOENT;
+ new_de = exofs_find_entry(new_dir, new_dentry, &new_page);
+ if (!new_de)
+ goto out_dir;
+ inode_inc_link_count(old_inode);
+ err = exofs_set_link(new_dir, new_de, new_page, old_inode);
+ new_inode->i_ctime = CURRENT_TIME;
+ if (dir_de)
+ drop_nlink(new_inode);
+ inode_dec_link_count(new_inode);
+ if (err)
+ goto out_dir;
+ } else {
+ if (dir_de) {
+ err = -EMLINK;
+ if (new_dir->i_nlink >= EXOFS_LINK_MAX)
+ goto out_dir;
+ }
+ inode_inc_link_count(old_inode);
+ err = exofs_add_link(new_dentry, old_inode);
+ if (err) {
+ inode_dec_link_count(old_inode);
+ goto out_dir;
+ }
+ if (dir_de)
+ inode_inc_link_count(new_dir);
+ }
+
+ old_inode->i_ctime = CURRENT_TIME;
+
+ exofs_delete_entry(old_de, old_page);
+ inode_dec_link_count(old_inode);
+
+ if (dir_de) {
+ err = exofs_set_link(old_inode, dir_de, dir_page, new_dir);
+ inode_dec_link_count(old_dir);
+ if (err)
+ goto out_dir;
+ }
+ return 0;
+
+
+out_dir:
+ if (dir_de) {
+ kunmap(dir_page);
+ page_cache_release(dir_page);
+ }
+out_old:
+ kunmap(old_page);
+ page_cache_release(old_page);
+out:
+ return err;
+}
+
+const struct inode_operations exofs_dir_inode_operations = {
+ .create = exofs_create,
+ .lookup = exofs_lookup,
+ .link = exofs_link,
+ .unlink = exofs_unlink,
+ .symlink = exofs_symlink,
+ .mkdir = exofs_mkdir,
+ .rmdir = exofs_rmdir,
+ .mknod = exofs_mknod,
+ .rename = exofs_rename,
+ .setattr = exofs_setattr,
+};
+
+const struct inode_operations exofs_special_inode_operations = {
+ .setattr = exofs_setattr,
+};
--
1.6.2.1

2009-03-18 18:11:25

by Boaz Harrosh

[permalink] [raw]
Subject: [PATCH 6/8] exofs: super_operations and file_system_type

This patch ties all operation vectors into a file system superblock
and registers the exofs file_system_type at module's load time.

* The file system control block (AKA on-disk superblock) resides in
an object with a special ID (defined in common.h).
Information included in the file system control block is used to
fill the in-memory superblock structure at mount time. This object
is created before the file system is used by mkexofs.c It contains
information such as:
- The file system's magic number
- The next inode number to be allocated

Signed-off-by: Boaz Harrosh <[email protected]>
---
fs/exofs/Kbuild | 2 +-
fs/exofs/exofs.h | 22 +++
fs/exofs/inode.c | 188 ++++++++++++++++++++
fs/exofs/super.c | 520 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 731 insertions(+), 1 deletions(-)
create mode 100644 fs/exofs/super.c

diff --git a/fs/exofs/Kbuild b/fs/exofs/Kbuild
index d5c8c54..592f40d 100644
--- a/fs/exofs/Kbuild
+++ b/fs/exofs/Kbuild
@@ -26,5 +26,5 @@ KBUILD_CPPFLAGS := -I$(OSD_INC) $(KBUILD_CPPFLAGS)

endif

-exofs-y := osd.o inode.o file.o symlink.o namei.o dir.o
+exofs-y := osd.o inode.o file.o symlink.o namei.o dir.o super.o
obj-$(CONFIG_EXOFS_FS) += exofs.o
diff --git a/fs/exofs/exofs.h b/fs/exofs/exofs.h
index abf66b2..76155d7 100644
--- a/fs/exofs/exofs.h
+++ b/fs/exofs/exofs.h
@@ -54,6 +54,15 @@
#define _LLU(x) (unsigned long long)(x)

/*
+ * struct to hold what we get from mount options
+ */
+struct exofs_mountopt {
+ const char *dev_name;
+ uint64_t pid;
+ int timeout;
+};
+
+/*
* our extension to the in-memory superblock
*/
struct exofs_sb_info {
@@ -125,6 +134,14 @@ static inline struct exofs_i_info *exofs_i(struct inode *inode)
}

/*
+ * ugly struct so that we can pass two arguments to update_inode's callback
+ */
+struct updatei_args {
+ struct exofs_sb_info *sbi;
+ struct exofs_fcb fcb;
+};
+
+/*
* Maximum count of links to a file
*/
#define EXOFS_LINK_MAX 32000
@@ -140,6 +157,8 @@ int exofs_write_begin(struct file *file, struct address_space *mapping,
struct page **pagep, void **fsdata);
extern struct inode *exofs_iget(struct super_block *, unsigned long);
struct inode *exofs_new_inode(struct inode *, int);
+extern int exofs_write_inode(struct inode *, int);
+extern void exofs_delete_inode(struct inode *);

/* dir.c: */
int exofs_add_link(struct dentry *, struct inode *);
@@ -170,6 +189,9 @@ extern const struct address_space_operations exofs_aops;
extern const struct inode_operations exofs_dir_inode_operations;
extern const struct inode_operations exofs_special_inode_operations;

+/* super.c */
+extern const struct super_operations exofs_sops;
+
/* symlink.c */
extern const struct inode_operations exofs_symlink_inode_operations;
extern const struct inode_operations exofs_fast_symlink_inode_operations;
diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
index 7b0c696..0f52e76 100644
--- a/fs/exofs/inode.c
+++ b/fs/exofs/inode.c
@@ -1108,3 +1108,191 @@ struct inode *exofs_new_inode(struct inode *dir, int mode)

return inode;
}
+
+/*
+ * Callback function from exofs_update_inode().
+ */
+static void updatei_done(struct osd_request *or, void *p)
+{
+ struct updatei_args *args = p;
+
+ osd_end_request(or);
+
+ atomic_dec(&args->sbi->s_curr_pending);
+
+ kfree(args);
+}
+
+/*
+ * Write the inode to the OSD. Just fill up the struct, and set the attribute
+ * synchronously or asynchronously depending on the do_sync flag.
+ */
+static int exofs_update_inode(struct inode *inode, int do_sync)
+{
+ struct exofs_i_info *oi = exofs_i(inode);
+ struct super_block *sb = inode->i_sb;
+ struct exofs_sb_info *sbi = sb->s_fs_info;
+ struct osd_obj_id obj = {sbi->s_pid, inode->i_ino + EXOFS_OBJ_OFF};
+ struct osd_request *or;
+ struct osd_attr attr;
+ struct exofs_fcb *fcb;
+ struct updatei_args *args;
+ int ret;
+
+ args = kzalloc(sizeof(*args), GFP_KERNEL);
+ if (!args)
+ return -ENOMEM;
+
+ fcb = &args->fcb;
+
+ fcb->i_mode = cpu_to_le16(inode->i_mode);
+ fcb->i_uid = cpu_to_le32(inode->i_uid);
+ fcb->i_gid = cpu_to_le32(inode->i_gid);
+ fcb->i_links_count = cpu_to_le16(inode->i_nlink);
+ fcb->i_ctime = cpu_to_le32(inode->i_ctime.tv_sec);
+ fcb->i_atime = cpu_to_le32(inode->i_atime.tv_sec);
+ fcb->i_mtime = cpu_to_le32(inode->i_mtime.tv_sec);
+ oi->i_commit_size = i_size_read(inode);
+ fcb->i_size = cpu_to_le64(oi->i_commit_size);
+ fcb->i_generation = cpu_to_le32(inode->i_generation);
+
+ if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode)) {
+ if (old_valid_dev(inode->i_rdev)) {
+ fcb->i_data[0] =
+ cpu_to_le32(old_encode_dev(inode->i_rdev));
+ fcb->i_data[1] = 0;
+ } else {
+ fcb->i_data[0] = 0;
+ fcb->i_data[1] =
+ cpu_to_le32(new_encode_dev(inode->i_rdev));
+ fcb->i_data[2] = 0;
+ }
+ } else
+ memcpy(fcb->i_data, oi->i_data, sizeof(fcb->i_data));
+
+ or = osd_start_request(sbi->s_dev, GFP_KERNEL);
+ if (unlikely(!or)) {
+ EXOFS_ERR("exofs_update_inode: osd_start_request failed.\n");
+ ret = -ENOMEM;
+ goto free_args;
+ }
+
+ osd_req_set_attributes(or, &obj);
+
+ attr = g_attr_inode_data;
+ attr.val_ptr = fcb;
+ osd_req_add_set_attr_list(or, &attr, 1);
+
+ if (!obj_created(oi)) {
+ EXOFS_DBGMSG("!obj_created\n");
+ BUG_ON(!obj_2bcreated(oi));
+ wait_event(oi->i_wq, obj_created(oi));
+ EXOFS_DBGMSG("wait_event done\n");
+ }
+
+ if (do_sync) {
+ ret = exofs_sync_op(or, sbi->s_timeout, oi->i_cred);
+ osd_end_request(or);
+ goto free_args;
+ } else {
+ args->sbi = sbi;
+
+ ret = exofs_async_op(or, updatei_done, args, oi->i_cred);
+ if (ret) {
+ osd_end_request(or);
+ goto free_args;
+ }
+ atomic_inc(&sbi->s_curr_pending);
+ goto out; /* deallocation in updatei_done */
+ }
+
+free_args:
+ kfree(args);
+out:
+ EXOFS_DBGMSG("ret=>%d\n", ret);
+ return ret;
+}
+
+int exofs_write_inode(struct inode *inode, int wait)
+{
+ return exofs_update_inode(inode, wait);
+}
+
+int exofs_sync_inode(struct inode *inode)
+{
+ struct writeback_control wbc = {
+ .sync_mode = WB_SYNC_ALL,
+ .nr_to_write = 0, /* sys_fsync did this */
+ };
+
+ return sync_inode(inode, &wbc);
+}
+
+/*
+ * Callback function from exofs_delete_inode() - don't have much cleaning up to
+ * do.
+ */
+static void delete_done(struct osd_request *or, void *p)
+{
+ struct exofs_sb_info *sbi;
+ osd_end_request(or);
+ sbi = p;
+ atomic_dec(&sbi->s_curr_pending);
+}
+
+/*
+ * Called when the refcount of an inode reaches zero. We remove the object
+ * from the OSD here. We make sure the object was created before we try and
+ * delete it.
+ */
+void exofs_delete_inode(struct inode *inode)
+{
+ struct exofs_i_info *oi = exofs_i(inode);
+ struct super_block *sb = inode->i_sb;
+ struct exofs_sb_info *sbi = sb->s_fs_info;
+ struct osd_obj_id obj = {sbi->s_pid, inode->i_ino + EXOFS_OBJ_OFF};
+ struct osd_request *or;
+ int ret;
+
+ truncate_inode_pages(&inode->i_data, 0);
+
+ if (is_bad_inode(inode))
+ goto no_delete;
+
+ mark_inode_dirty(inode);
+ exofs_update_inode(inode, inode_needs_sync(inode));
+
+ inode->i_size = 0;
+ if (inode->i_blocks)
+ exofs_truncate(inode);
+
+ clear_inode(inode);
+
+ or = osd_start_request(sbi->s_dev, GFP_KERNEL);
+ if (unlikely(!or)) {
+ EXOFS_ERR("exofs_delete_inode: osd_start_request failed\n");
+ return;
+ }
+
+ osd_req_remove_object(or, &obj);
+
+ /* if we are deleting an obj that hasn't been created yet, wait */
+ if (!obj_created(oi)) {
+ BUG_ON(!obj_2bcreated(oi));
+ wait_event(oi->i_wq, obj_created(oi));
+ }
+
+ ret = exofs_async_op(or, delete_done, sbi, oi->i_cred);
+ if (ret) {
+ EXOFS_ERR(
+ "ERROR: @exofs_delete_inode exofs_async_op failed\n");
+ osd_end_request(or);
+ return;
+ }
+ atomic_inc(&sbi->s_curr_pending);
+
+ return;
+
+no_delete:
+ clear_inode(inode);
+}
diff --git a/fs/exofs/super.c b/fs/exofs/super.c
new file mode 100644
index 0000000..9153db2
--- /dev/null
+++ b/fs/exofs/super.c
@@ -0,0 +1,520 @@
+/*
+ * Copyright (C) 2005, 2006
+ * Avishay Traeger ([email protected]) ([email protected])
+ * Copyright (C) 2005, 2006
+ * International Business Machines
+ * Copyright (C) 2008, 2009
+ * Boaz Harrosh <[email protected]>
+ *
+ * Copyrights for code taken from ext2:
+ * Copyright (C) 1992, 1993, 1994, 1995
+ * Remy Card ([email protected])
+ * Laboratoire MASI - Institut Blaise Pascal
+ * Universite Pierre et Marie Curie (Paris VI)
+ * from
+ * linux/fs/minix/inode.c
+ * Copyright (C) 1991, 1992 Linus Torvalds
+ *
+ * This file is part of exofs.
+ *
+ * exofs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation. Since it is based on ext2, and the only
+ * valid version of GPL for the Linux kernel is version 2, the only valid
+ * version of GPL for exofs is version 2.
+ *
+ * exofs is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with exofs; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include <linux/string.h>
+#include <linux/parser.h>
+#include <linux/vfs.h>
+#include <linux/random.h>
+
+#include "exofs.h"
+
+/******************************************************************************
+ * MOUNT OPTIONS
+ *****************************************************************************/
+
+/*
+ * exofs-specific mount-time options.
+ */
+enum { Opt_pid, Opt_to, Opt_mkfs, Opt_format, Opt_err };
+
+/*
+ * Our mount-time options. These should ideally be 64-bit unsigned, but the
+ * kernel's parsing functions do not currently support that. 32-bit should be
+ * sufficient for most applications now.
+ */
+static match_table_t tokens = {
+ {Opt_pid, "pid=%u"},
+ {Opt_to, "to=%u"},
+ {Opt_err, NULL}
+};
+
+/*
+ * The main option parsing method. Also makes sure that all of the mandatory
+ * mount options were set.
+ */
+static int parse_options(char *options, struct exofs_mountopt *opts)
+{
+ char *p;
+ substring_t args[MAX_OPT_ARGS];
+ int option;
+ bool s_pid = false;
+
+ EXOFS_DBGMSG("parse_options %s\n", options);
+ /* defaults */
+ memset(opts, 0, sizeof(*opts));
+ opts->timeout = BLK_DEFAULT_SG_TIMEOUT;
+
+ while ((p = strsep(&options, ",")) != NULL) {
+ int token;
+ char str[32];
+
+ if (!*p)
+ continue;
+
+ token = match_token(p, tokens, args);
+ switch (token) {
+ case Opt_pid:
+ if (0 == match_strlcpy(str, &args[0], sizeof(str)))
+ return -EINVAL;
+ opts->pid = simple_strtoull(str, NULL, 0);
+ if (opts->pid < EXOFS_MIN_PID) {
+ EXOFS_ERR("Partition ID must be >= %u",
+ EXOFS_MIN_PID);
+ return -EINVAL;
+ }
+ s_pid = 1;
+ break;
+ case Opt_to:
+ if (match_int(&args[0], &option))
+ return -EINVAL;
+ if (option <= 0) {
+ EXOFS_ERR("Timout must be > 0");
+ return -EINVAL;
+ }
+ opts->timeout = option * HZ;
+ break;
+ }
+ }
+
+ if (!s_pid) {
+ EXOFS_ERR("Need to specify the following options:\n");
+ EXOFS_ERR(" -o pid=pid_no_to_use\n");
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+/******************************************************************************
+ * INODE CACHE
+ *****************************************************************************/
+
+/*
+ * Our inode cache. Isn't it pretty?
+ */
+static struct kmem_cache *exofs_inode_cachep;
+
+/*
+ * Allocate an inode in the cache
+ */
+static struct inode *exofs_alloc_inode(struct super_block *sb)
+{
+ struct exofs_i_info *oi;
+
+ oi = kmem_cache_alloc(exofs_inode_cachep, GFP_KERNEL);
+ if (!oi)
+ return NULL;
+
+ oi->vfs_inode.i_version = 1;
+ return &oi->vfs_inode;
+}
+
+/*
+ * Remove an inode from the cache
+ */
+static void exofs_destroy_inode(struct inode *inode)
+{
+ kmem_cache_free(exofs_inode_cachep, exofs_i(inode));
+}
+
+/*
+ * Initialize the inode
+ */
+static void exofs_init_once(void *foo)
+{
+ struct exofs_i_info *oi = foo;
+
+ inode_init_once(&oi->vfs_inode);
+}
+
+/*
+ * Create and initialize the inode cache
+ */
+static int init_inodecache(void)
+{
+ exofs_inode_cachep = kmem_cache_create("exofs_inode_cache",
+ sizeof(struct exofs_i_info), 0,
+ SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD,
+ exofs_init_once);
+ if (exofs_inode_cachep == NULL)
+ return -ENOMEM;
+ return 0;
+}
+
+/*
+ * Destroy the inode cache
+ */
+static void destroy_inodecache(void)
+{
+ kmem_cache_destroy(exofs_inode_cachep);
+}
+
+/******************************************************************************
+ * SUPERBLOCK FUNCTIONS
+ *****************************************************************************/
+
+/*
+ * Write the superblock to the OSD
+ */
+static void exofs_write_super(struct super_block *sb)
+{
+ struct exofs_sb_info *sbi;
+ struct exofs_fscb *fscb;
+ struct osd_request *or;
+ struct osd_obj_id obj;
+ int ret;
+
+ fscb = kzalloc(sizeof(struct exofs_fscb), GFP_KERNEL);
+ if (!fscb) {
+ EXOFS_ERR("exofs_write_super: memory allocation failed.\n");
+ return;
+ }
+
+ lock_kernel();
+ sbi = sb->s_fs_info;
+ fscb->s_nextid = cpu_to_le64(sbi->s_nextid);
+ fscb->s_numfiles = cpu_to_le32(sbi->s_numfiles);
+ fscb->s_magic = cpu_to_le16(sb->s_magic);
+ fscb->s_newfs = 0;
+
+ or = osd_start_request(sbi->s_dev, GFP_KERNEL);
+ if (unlikely(!or)) {
+ EXOFS_ERR("exofs_write_super: osd_start_request failed.\n");
+ goto out;
+ }
+
+ obj.partition = sbi->s_pid;
+ obj.id = EXOFS_SUPER_ID;
+ ret = osd_req_write_kern(or, &obj, 0, fscb, sizeof(*fscb));
+ if (unlikely(ret)) {
+ EXOFS_ERR("exofs_write_super: osd_req_write_kern failed.\n");
+ goto out;
+ }
+
+ ret = exofs_sync_op(or, sbi->s_timeout, sbi->s_cred);
+ if (unlikely(ret)) {
+ EXOFS_ERR("exofs_write_super: exofs_sync_op failed.\n");
+ goto out;
+ }
+ sb->s_dirt = 0;
+
+out:
+ if (or)
+ osd_end_request(or);
+ unlock_kernel();
+ kfree(fscb);
+}
+
+/*
+ * This function is called when the vfs is freeing the superblock. We just
+ * need to free our own part.
+ */
+static void exofs_put_super(struct super_block *sb)
+{
+ int num_pend;
+ struct exofs_sb_info *sbi = sb->s_fs_info;
+
+ /* make sure there are no pending commands */
+ for (num_pend = atomic_read(&sbi->s_curr_pending); num_pend > 0;
+ num_pend = atomic_read(&sbi->s_curr_pending)) {
+ wait_queue_head_t wq;
+ init_waitqueue_head(&wq);
+ wait_event_timeout(wq,
+ (atomic_read(&sbi->s_curr_pending) == 0),
+ msecs_to_jiffies(100));
+ }
+
+ osduld_put_device(sbi->s_dev);
+ kfree(sb->s_fs_info);
+ sb->s_fs_info = NULL;
+}
+
+/*
+ * Read the superblock from the OSD and fill in the fields
+ */
+static int exofs_fill_super(struct super_block *sb, void *data, int silent)
+{
+ struct inode *root;
+ struct exofs_mountopt *opts = data;
+ struct exofs_sb_info *sbi; /*extended info */
+ struct exofs_fscb fscb; /*on-disk superblock info */
+ struct osd_request *or = NULL;
+ struct osd_obj_id obj;
+ int ret;
+
+ sbi = kzalloc(sizeof(*sbi), GFP_KERNEL);
+ if (!sbi)
+ return -ENOMEM;
+ sb->s_fs_info = sbi;
+
+ /* use mount options to fill superblock */
+ sbi->s_dev = osduld_path_lookup(opts->dev_name);
+ if (IS_ERR(sbi->s_dev)) {
+ ret = PTR_ERR(sbi->s_dev);
+ sbi->s_dev = NULL;
+ goto free_sbi;
+ }
+
+ sbi->s_pid = opts->pid;
+ sbi->s_timeout = opts->timeout;
+
+ /* fill in some other data by hand */
+ memset(sb->s_id, 0, sizeof(sb->s_id));
+ strcpy(sb->s_id, "exofs");
+ sb->s_blocksize = EXOFS_BLKSIZE;
+ sb->s_blocksize_bits = EXOFS_BLKSHIFT;
+ atomic_set(&sbi->s_curr_pending, 0);
+ sb->s_bdev = NULL;
+ sb->s_dev = 0;
+
+ /* read data from on-disk superblock object */
+ obj.partition = sbi->s_pid;
+ obj.id = EXOFS_SUPER_ID;
+ exofs_make_credential(sbi->s_cred, &obj);
+
+ or = osd_start_request(sbi->s_dev, GFP_KERNEL);
+ if (unlikely(!or)) {
+ if (!silent)
+ EXOFS_ERR(
+ "exofs_fill_super: osd_start_request failed.\n");
+ ret = -ENOMEM;
+ goto free_sbi;
+ }
+ ret = osd_req_read_kern(or, &obj, 0, &fscb, sizeof(fscb));
+ if (unlikely(ret)) {
+ if (!silent)
+ EXOFS_ERR(
+ "exofs_fill_super: osd_req_read_kern failed.\n");
+ ret = -ENOMEM;
+ goto free_sbi;
+ }
+
+ ret = exofs_sync_op(or, sbi->s_timeout, sbi->s_cred);
+ if (unlikely(ret)) {
+ if (!silent)
+ EXOFS_ERR("exofs_fill_super: exofs_sync_op failed.\n");
+ ret = -EIO;
+ goto free_sbi;
+ }
+
+ sb->s_magic = le16_to_cpu(fscb.s_magic);
+ sbi->s_nextid = le64_to_cpu(fscb.s_nextid);
+ sbi->s_numfiles = le32_to_cpu(fscb.s_numfiles);
+
+ /* make sure what we read from the object store is correct */
+ if (sb->s_magic != EXOFS_SUPER_MAGIC) {
+ if (!silent)
+ EXOFS_ERR("ERROR: Bad magic value\n");
+ ret = -EINVAL;
+ goto free_sbi;
+ }
+
+ /* start generation numbers from a random point */
+ get_random_bytes(&sbi->s_next_generation, sizeof(u32));
+ spin_lock_init(&sbi->s_next_gen_lock);
+
+ /* set up operation vectors */
+ sb->s_op = &exofs_sops;
+ root = exofs_iget(sb, EXOFS_ROOT_ID - EXOFS_OBJ_OFF);
+ if (IS_ERR(root)) {
+ EXOFS_ERR("ERROR: exofs_iget failed\n");
+ ret = PTR_ERR(root);
+ goto free_sbi;
+ }
+ sb->s_root = d_alloc_root(root);
+ if (!sb->s_root) {
+ iput(root);
+ EXOFS_ERR("ERROR: get root inode failed\n");
+ ret = -ENOMEM;
+ goto free_sbi;
+ }
+
+ if (!S_ISDIR(root->i_mode)) {
+ dput(sb->s_root);
+ sb->s_root = NULL;
+ EXOFS_ERR("ERROR: corrupt root inode (mode = %hd)\n",
+ root->i_mode);
+ ret = -EINVAL;
+ goto free_sbi;
+ }
+
+ ret = 0;
+out:
+ if (or)
+ osd_end_request(or);
+ return ret;
+
+free_sbi:
+ osduld_put_device(sbi->s_dev); /* NULL safe */
+ kfree(sbi);
+ goto out;
+}
+
+/*
+ * Set up the superblock (calls exofs_fill_super eventually)
+ */
+static int exofs_get_sb(struct file_system_type *type,
+ int flags, const char *dev_name,
+ void *data, struct vfsmount *mnt)
+{
+ struct exofs_mountopt opts;
+ int ret;
+
+ ret = parse_options(data, &opts);
+ if (ret)
+ return ret;
+
+ opts.dev_name = dev_name;
+ return get_sb_nodev(type, flags, &opts, exofs_fill_super, mnt);
+}
+
+/*
+ * Return information about the file system state in the buffer. This is used
+ * by the 'df' command, for example.
+ */
+static int exofs_statfs(struct dentry *dentry, struct kstatfs *buf)
+{
+ struct super_block *sb = dentry->d_sb;
+ struct exofs_sb_info *sbi = sb->s_fs_info;
+ struct osd_obj_id obj = {sbi->s_pid, 0};
+ struct osd_attr attrs[] = {
+ ATTR_DEF(OSD_APAGE_PARTITION_QUOTAS,
+ OSD_ATTR_PQ_CAPACITY_QUOTA, sizeof(__be64)),
+ ATTR_DEF(OSD_APAGE_PARTITION_INFORMATION,
+ OSD_ATTR_PI_USED_CAPACITY, sizeof(__be64)),
+ };
+ uint64_t capacity = ~0;
+ uint64_t used = ~0;
+ struct osd_request *or;
+ uint8_t cred_a[OSD_CAP_LEN];
+ int ret;
+
+ /* get used/capacity attributes */
+ exofs_make_credential(cred_a, &obj);
+
+ or = osd_start_request(sbi->s_dev, GFP_KERNEL);
+ if (unlikely(!or)) {
+ EXOFS_DBGMSG("exofs_statfs: osd_start_request failed.\n");
+ return -ENOMEM;
+ }
+
+ osd_req_get_attributes(or, &obj);
+ osd_req_add_get_attr_list(or, attrs, ARRAY_SIZE(attrs));
+ ret = exofs_sync_op(or, sbi->s_timeout, cred_a);
+ if (unlikely(ret))
+ goto out;
+
+ ret = extract_attr_from_req(or, &attrs[0]);
+ if (likely(!ret))
+ capacity = get_unaligned_be64(attrs[0].val_ptr);
+ else
+ EXOFS_DBGMSG("exofs_statfs: get capacity failed.\n");
+
+ ret = extract_attr_from_req(or, &attrs[1]);
+ if (likely(!ret))
+ used = get_unaligned_be64(attrs[1].val_ptr);
+ else
+ EXOFS_DBGMSG("exofs_statfs: get used-space failed.\n");
+
+ /* fill in the stats buffer */
+ buf->f_type = EXOFS_SUPER_MAGIC;
+ buf->f_bsize = EXOFS_BLKSIZE;
+ buf->f_blocks = (capacity >> EXOFS_BLKSHIFT);
+ buf->f_bfree = ((capacity - used) >> EXOFS_BLKSHIFT);
+ buf->f_bavail = buf->f_bfree;
+ buf->f_files = sbi->s_numfiles;
+ buf->f_ffree = EXOFS_MAX_ID - sbi->s_numfiles;
+ buf->f_namelen = EXOFS_NAME_LEN;
+
+out:
+ osd_end_request(or);
+ return ret;
+}
+
+const struct super_operations exofs_sops = {
+ .alloc_inode = exofs_alloc_inode,
+ .destroy_inode = exofs_destroy_inode,
+ .write_inode = exofs_write_inode,
+ .delete_inode = exofs_delete_inode,
+ .put_super = exofs_put_super,
+ .write_super = exofs_write_super,
+ .statfs = exofs_statfs,
+};
+
+/******************************************************************************
+ * INSMOD/RMMOD
+ *****************************************************************************/
+
+/*
+ * struct that describes this file system
+ */
+static struct file_system_type exofs_type = {
+ .owner = THIS_MODULE,
+ .name = "exofs",
+ .get_sb = exofs_get_sb,
+ .kill_sb = generic_shutdown_super,
+};
+
+static int __init init_exofs(void)
+{
+ int err;
+
+ err = init_inodecache();
+ if (err)
+ goto out;
+
+ err = register_filesystem(&exofs_type);
+ if (err)
+ goto out_d;
+
+ return 0;
+out_d:
+ destroy_inodecache();
+out:
+ return err;
+}
+
+static void __exit exit_exofs(void)
+{
+ unregister_filesystem(&exofs_type);
+ destroy_inodecache();
+}
+
+MODULE_AUTHOR("Avishay Traeger <[email protected]>");
+MODULE_DESCRIPTION("exofs");
+MODULE_LICENSE("GPL");
+
+module_init(init_exofs)
+module_exit(exit_exofs)
--
1.6.2.1

2009-03-18 18:12:46

by Boaz Harrosh

[permalink] [raw]
Subject: [PATCH 7/8] exofs: Documentation

Added some documentation in exofs.txt, as well as a BUGS file.

For further reading, operation instructions, example scripts
and up to date infomation and code please see:
http://open-osd.org

Signed-off-by: Boaz Harrosh <[email protected]>
---
Documentation/filesystems/exofs.txt | 176 +++++++++++++++++++++++++++++++++++
fs/exofs/BUGS | 3 +
2 files changed, 179 insertions(+), 0 deletions(-)
create mode 100644 Documentation/filesystems/exofs.txt
create mode 100644 fs/exofs/BUGS

diff --git a/Documentation/filesystems/exofs.txt b/Documentation/filesystems/exofs.txt
new file mode 100644
index 0000000..0ced74c
--- /dev/null
+++ b/Documentation/filesystems/exofs.txt
@@ -0,0 +1,176 @@
+===============================================================================
+WHAT IS EXOFS?
+===============================================================================
+
+exofs is a file system that uses an OSD and exports the API of a normal Linux
+file system. Users access exofs like any other local file system, and exofs
+will in turn issue commands to the local OSD initiator.
+
+OSD is a new T10 command set that views storage devices not as a large/flat
+array of sectors but as a container of objects, each having a length, quota,
+time attributes and more. Each object is addressed by a 64bit ID, and is
+contained in a 64bit ID partition. Each object has associated attributes
+attached to it, which are integral part of the object and provide metadata about
+the object. The standard defines some common obligatory attributes, but user
+attributes can be added as needed.
+
+===============================================================================
+ENVIRONMENT
+===============================================================================
+
+To use this file system, you need to have an object store to run it on. You
+may download a target from:
+http://open-osd.org
+
+See Documentation/scsi/osd.txt for how to setup a working osd environment.
+
+===============================================================================
+USAGE
+===============================================================================
+
+1. Download and compile exofs and open-osd initiator:
+ You need an external Kernel source tree or kernel headers from your
+ distribution. (anything based on 2.6.26 or later).
+
+ a. download open-osd including exofs source using:
+ [parent-directory]$ git clone git://git.open-osd.org/open-osd.git
+
+ b. Build the library module like this:
+ [parent-directory]$ make -C KSRC=$(KER_DIR) open-osd
+
+ This will build both the open-osd initiator as well as the exofs kernel
+ module. Use whatever parameters you compiled your Kernel with and
+ $(KER_DIR) above pointing to the Kernel you compile against. See the file
+ open-osd/top-level-Makefile for an example.
+
+2. Get the OSD initiator and target set up properly, and login to the target.
+ See Documentation/scsi/osd.txt for farther instructions. Also see ./do-osd
+ for example script that does all these steps.
+
+3. Insmod the exofs.ko module:
+ [exofs]$ insmod exofs.ko
+
+4. Make sure the directory where you want to mount exists. If not, create it.
+ (For example, mkdir /mnt/exofs)
+
+5. At first run you will need to invoke the mkfs.exofs application
+
+ As an example, this will create the file system on:
+ /dev/osd0 partition ID 65536
+
+ mkfs.exofs --pid=65536 --format /dev/osd0
+
+ The --format is optional if not specified no OSD_FORMAT will be
+ preformed and a clean file system will be created in the specified pid,
+ in the available space of the target. (Use --format=size_in_meg to limit
+ the total LUN space available)
+
+ If pid already exist it will be deleted and a new one will be created in it's
+ place. Be careful.
+
+ An exofs lives inside a single OSD partition. You can create multiple exofs
+ filesystems on the same device using multiple pids.
+
+ (run mkfs.exofs without any parameters for usage help message)
+
+6. Mount the file system.
+
+ For example, to mount /dev/osd0, partition ID 0x10000 on /mnt/exofs:
+
+ mount -t exofs -o pid=65536 /dev/osd0 /mnt/exofs/
+
+7. For reference (See do-exofs example script):
+ do-exofs start - an example of how to perform the above steps.
+ do-exofs stop - an example of how to unmount the file system.
+ do-exofs format - an example of how to format and mkfs a new exofs.
+
+8. Extra compilation flags (uncomment in fs/exofs/Kbuild):
+ CONFIG_EXOFS_DEBUG - for debug messages and extra checks.
+
+===============================================================================
+exofs mount options
+===============================================================================
+Similar to any mount command:
+ mount -t exofs -o exofs_options /dev/osdX mount_exofs_directory
+
+Where:
+ -t exofs: specifies the exofs file system
+
+ /dev/osdX: X is a decimal number. /dev/osdX was created after a successful
+ login into an OSD target.
+
+ mount_exofs_directory: The directory to mount the file system on
+
+ exofs specific options: Options are separated by commas (,)
+ pid=<integer> - The partition number to mount/create as
+ container of the filesystem.
+ This option is mandatory
+ to=<integer> - Timeout in ticks for a single command
+ default is (60 * HZ) [for debugging only]
+
+===============================================================================
+DESIGN
+===============================================================================
+
+* The file system control block (AKA on-disk superblock) resides in an object
+ with a special ID (defined in common.h).
+ Information included in the file system control block is used to fill the
+ in-memory superblock structure at mount time. This object is created before
+ the file system is used by mkexofs.c It contains information such as:
+ - The file system's magic number
+ - The next inode number to be allocated
+
+* Each file resides in its own object and contains the data (and it will be
+ possible to extend the file over multiple objects, though this has not been
+ implemented yet).
+
+* A directory is treated as a file, and essentially contains a list of <file
+ name, inode #> pairs for files that are found in that directory. The object
+ IDs correspond to the files' inode numbers and will be allocated according to
+ a bitmap (stored in a separate object). Now they are allocated using a
+ counter.
+
+* Each file's control block (AKA on-disk inode) is stored in its object's
+ attributes. This applies to both regular files and other types (directories,
+ device files, symlinks, etc.).
+
+* Credentials are generated per object (inode and superblock) when they is
+ created in memory (read off disk or created). The credential works for all
+ operations and is used as long as the object remains in memory.
+
+* Async OSD operations are used whenever possible, but the target may execute
+ them out of order. The operations that concern us are create, delete,
+ readpage, writepage, update_inode, and truncate. The following pairs of
+ operations should execute in the order written, and we need to prevent them
+ from executing in reverse order:
+ - The following are handled with the OBJ_CREATED and OBJ_2BCREATED
+ flags. OBJ_CREATED is set when we know the object exists on the OSD -
+ in create's callback function, and when we successfully do a read_inode.
+ OBJ_2BCREATED is set in the beginning of the create function, so we
+ know that we should wait.
+ - create/delete: delete should wait until the object is created
+ on the OSD.
+ - create/readpage: readpage should be able to return a page
+ full of zeroes in this case. If there was a write already
+ en-route (i.e. create, writepage, readpage) then the page
+ would be locked, and so it would really be the same as
+ create/writepage.
+ - create/writepage: if writepage is called for a sync write, it
+ should wait until the object is created on the OSD.
+ Otherwise, it should just return.
+ - create/truncate: truncate should wait until the object is
+ created on the OSD.
+ - create/update_inode: update_inode should wait until the
+ object is created on the OSD.
+ - Handled by VFS locks:
+ - readpage/delete: shouldn't happen because of page lock.
+ - writepage/delete: shouldn't happen because of page lock.
+ - readpage/writepage: shouldn't happen because of page lock.
+
+===============================================================================
+LICENSE/COPYRIGHT
+===============================================================================
+The exofs file system is based on ext2 v0.5b (distributed with the Linux kernel
+version 2.6.10). All files include the original copyrights, and the license
+is GPL version 2 (only version 2, as is true for the Linux kernel). The
+Linux kernel can be downloaded from http://www.kernel.org.
diff --git a/fs/exofs/BUGS b/fs/exofs/BUGS
new file mode 100644
index 0000000..1b2d4c6
--- /dev/null
+++ b/fs/exofs/BUGS
@@ -0,0 +1,3 @@
+- Out-of-space may cause a severe problem if the object (and directory entry)
+ were written, but the inode attributes failed. Then if the filesystem was
+ unmounted and mounted the kernel can get into an endless loop doing a readdir.
--
1.6.2.1

2009-03-18 18:13:24

by Boaz Harrosh

[permalink] [raw]
Subject: [PATCH 8/8] fs: Add exofs to Kernel build

- Add exofs to fs/Kconfig under "menu 'Miscellaneous filesystems'"
- Add exofs to fs/Makefile

Signed-off-by: Boaz Harrosh <[email protected]>
---
fs/Kconfig | 2 ++
fs/Makefile | 1 +
2 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 93945dd..d0c544c 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -223,6 +223,8 @@ source "fs/romfs/Kconfig"
source "fs/sysv/Kconfig"
source "fs/ufs/Kconfig"

+source "fs/exofs/Kconfig"
+
endif # MISC_FILESYSTEMS

menuconfig NETWORK_FILESYSTEMS
diff --git a/fs/Makefile b/fs/Makefile
index dc20db3..f5f5ce7 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -124,3 +124,4 @@ obj-$(CONFIG_DEBUG_FS) += debugfs/
obj-$(CONFIG_OCFS2_FS) += ocfs2/
obj-$(CONFIG_BTRFS_FS) += btrfs/
obj-$(CONFIG_GFS2_FS) += gfs2/
+obj-$(CONFIG_EXOFS_FS) += exofs/
--
1.6.2.1

2009-03-21 13:26:22

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [PATCH 7/8] exofs: Documentation

Hi.

On Wed, Mar 18, 2009 at 08:10:58PM +0200, Boaz Harrosh ([email protected]) wrote:
> +++ b/fs/exofs/BUGS
> @@ -0,0 +1,3 @@
> +- Out-of-space may cause a severe problem if the object (and directory entry)
> + were written, but the inode attributes failed. Then if the filesystem was
> + unmounted and mounted the kernel can get into an endless loop doing a readdir.

Does it also mean that damaged media may end up freezing the machine
during the mount?

--
Evgeniy Polyakov

2009-03-22 08:45:33

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH 7/8] exofs: Documentation

Evgeniy Polyakov wrote:
> Hi.
>
> On Wed, Mar 18, 2009 at 08:10:58PM +0200, Boaz Harrosh ([email protected]) wrote:
>> +++ b/fs/exofs/BUGS
>> @@ -0,0 +1,3 @@
>> +- Out-of-space may cause a severe problem if the object (and directory entry)
>> + were written, but the inode attributes failed. Then if the filesystem was
>> + unmounted and mounted the kernel can get into an endless loop doing a readdir.
>
> Does it also mean that damaged media may end up freezing the machine
> during the mount?
>

I had such situation and it was able to mount. Some data was lost.

But sure, if the damage was in a way like above it would. The bad
situation is when there is a directory entry, there is a corresponding
object, but there is an error reading the associated attribute. the
redir code does not expect to fail, independent from iget()

It is difficult for me to repeat this problem because I've changed
the osd-target I'm running with, and with the new target attributes
are stored an a DB, so the create-object fails with ENOSP long before
I'm no longer able to write attributes. But I'm sure one day this problem
will come to hunt me.

Thanks
Boaz

2009-03-22 10:22:47

by Marcin Slusarz

[permalink] [raw]
Subject: Re: [PATCH 4/8] exofs: address_space_operations

Boaz Harrosh wrote:
> (...)
> +struct page_collect {
> + struct exofs_sb_info *sbi;
> + struct request_queue *req_q;
> + struct inode *inode;
> + unsigned expected_pages;
> +
> + struct bio *bio;
> + unsigned nr_pages;
> + unsigned long length;
> + long pg_first;
> +};
> (...)
> +int pcol_try_alloc(struct page_collect *pcol)
> +{
> + int pages = min_t(unsigned, pcol->expected_pages, BIO_MAX_PAGES);
> +
> + for (; pages; pages >>= 1) {
> + pcol->bio = bio_alloc(GFP_KERNEL, pages);
> + if (likely(pcol->bio))
> + return 0;
> + }
> +
> + EXOFS_ERR("Failed to kcalloc expected_pages=%d\n",

%u

> + pcol->expected_pages);
> + return -ENOMEM;
> +}
> +
> (...)
> +static int __readpages_done(struct osd_request *or, struct page_collect *pcol,
> + bool do_unlock)
> +{
> + struct bio_vec *bvec;
> + int i;
> + u64 resid;
> + u64 good_bytes;
> + u64 length = 0;
> + int ret = exofs_check_ok_resid(or, &resid, NULL);
> +
> + osd_end_request(or);
> +
> + if (!ret)
> + good_bytes = pcol->length;
> + else if (ret && !resid)
> + good_bytes = 0;
> + else
> + good_bytes = pcol->length - resid;

Second ret check is not needed.

> (...)
> +
> +int read_exec(struct page_collect *pcol, bool is_sync)

read_exec is too generic name for globally visible symbol

> +{
> (...)
> +static void writepages_done(struct osd_request *or, void *p)
> +{
> + struct page_collect *pcol = p;
> + struct bio_vec *bvec;
> + int i;
> + u64 resid;
> + u64 good_bytes;
> + u64 length = 0;
> +
> + int ret = exofs_check_ok_resid(or, NULL, &resid);
> +
> + osd_end_request(or);
> + atomic_dec(&pcol->sbi->s_curr_pending);
> +
> + if (likely(!ret))
> + good_bytes = pcol->length;
> + else if (ret && !resid)
> + good_bytes = 0;
> + else
> + good_bytes = pcol->length - resid;

Ret check again.

> (...)
> +
> +int write_exec(struct page_collect *pcol)

Too generic name.

> (...)

2009-03-22 10:43:05

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH 4/8] exofs: address_space_operations

Marcin Slusarz wrote:
> Boaz Harrosh wrote:
>> (...)
>> +struct page_collect {
>> + struct exofs_sb_info *sbi;
>> + struct request_queue *req_q;
>> + struct inode *inode;
>> + unsigned expected_pages;
>> +
>> + struct bio *bio;
>> + unsigned nr_pages;
>> + unsigned long length;
>> + long pg_first;
>> +};
>> (...)
>> +int pcol_try_alloc(struct page_collect *pcol)
>> +{
>> + int pages = min_t(unsigned, pcol->expected_pages, BIO_MAX_PAGES);
>> +
>> + for (; pages; pages >>= 1) {
>> + pcol->bio = bio_alloc(GFP_KERNEL, pages);
>> + if (likely(pcol->bio))
>> + return 0;
>> + }
>> +
>> + EXOFS_ERR("Failed to kcalloc expected_pages=%d\n",
>
> %u
>
>> + pcol->expected_pages);
>> + return -ENOMEM;
>> +}
>> +
>> (...)
>> +static int __readpages_done(struct osd_request *or, struct page_collect *pcol,
>> + bool do_unlock)
>> +{
>> + struct bio_vec *bvec;
>> + int i;
>> + u64 resid;
>> + u64 good_bytes;
>> + u64 length = 0;
>> + int ret = exofs_check_ok_resid(or, &resid, NULL);
>> +
>> + osd_end_request(or);
>> +
>> + if (!ret)
>> + good_bytes = pcol->length;
>> + else if (ret && !resid)
>> + good_bytes = 0;
>> + else
>> + good_bytes = pcol->length - resid;
>
> Second ret check is not needed.
>
>> (...)
>> +
>> +int read_exec(struct page_collect *pcol, bool is_sync)
>
> read_exec is too generic name for globally visible symbol
>
>> +{
>> (...)
>> +static void writepages_done(struct osd_request *or, void *p)
>> +{
>> + struct page_collect *pcol = p;
>> + struct bio_vec *bvec;
>> + int i;
>> + u64 resid;
>> + u64 good_bytes;
>> + u64 length = 0;
>> +
>> + int ret = exofs_check_ok_resid(or, NULL, &resid);
>> +
>> + osd_end_request(or);
>> + atomic_dec(&pcol->sbi->s_curr_pending);
>> +
>> + if (likely(!ret))
>> + good_bytes = pcol->length;
>> + else if (ret && !resid)
>> + good_bytes = 0;
>> + else
>> + good_bytes = pcol->length - resid;
>
> Ret check again.
>
>> (...)
>> +
>> +int write_exec(struct page_collect *pcol)
>
> Too generic name.
>
>> (...)
>

Right on all accounts. Thanks
will repost

Boaz

2009-03-22 14:01:34

by Boaz Harrosh

[permalink] [raw]
Subject: [PATCH 4/8 ver5] exofs: address_space_operations


OK Now we start to read and write from osd-objects. We try to
collect at most contiguous pages as possible in a single write/read.
The first page index is the object's offset.

TODO:
In 64-bit a single bio can carry at most 128 pages.
Add support of chaining multiple bios

Signed-off-by: Boaz Harrosh <[email protected]>
---
fs/exofs/exofs.h | 6 +
fs/exofs/inode.c | 691 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 697 insertions(+), 0 deletions(-)

diff --git a/fs/exofs/exofs.h b/fs/exofs/exofs.h
index d3b8bde..f30de6e 100644
--- a/fs/exofs/exofs.h
+++ b/fs/exofs/exofs.h
@@ -130,6 +130,9 @@ static inline struct exofs_i_info *exofs_i(struct inode *inode)
/* inode.c */
void exofs_truncate(struct inode *inode);
int exofs_setattr(struct dentry *, struct iattr *);
+int exofs_write_begin(struct file *file, struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned flags,
+ struct page **pagep, void **fsdata);

/*********************
* operation vectors *
@@ -138,6 +141,9 @@ int exofs_setattr(struct dentry *, struct iattr *);
extern const struct inode_operations exofs_file_inode_operations;
extern const struct file_operations exofs_file_operations;

+/* inode.c */
+extern const struct address_space_operations exofs_aops;
+
/* symlink.c */
extern const struct inode_operations exofs_symlink_inode_operations;
extern const struct inode_operations exofs_fast_symlink_inode_operations;
diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
index b0bda1e..ab7b56e 100644
--- a/fs/exofs/inode.c
+++ b/fs/exofs/inode.c
@@ -35,6 +35,7 @@

#include <linux/writeback.h>
#include <linux/buffer_head.h>
+#include <scsi/scsi_device.h>

#include "exofs.h"

@@ -42,6 +43,696 @@
# define EXOFS_DEBUG_OBJ_ISIZE 1
#endif

+struct page_collect {
+ struct exofs_sb_info *sbi;
+ struct request_queue *req_q;
+ struct inode *inode;
+ unsigned expected_pages;
+
+ struct bio *bio;
+ unsigned nr_pages;
+ unsigned long length;
+ long pg_first;
+};
+
+static void _pcol_init(struct page_collect *pcol, unsigned expected_pages,
+ struct inode *inode)
+{
+ struct exofs_sb_info *sbi = inode->i_sb->s_fs_info;
+ struct request_queue *req_q = sbi->s_dev->scsi_device->request_queue;
+
+ pcol->sbi = sbi;
+ pcol->req_q = req_q;
+ pcol->inode = inode;
+ pcol->expected_pages = expected_pages;
+
+ pcol->bio = NULL;
+ pcol->nr_pages = 0;
+ pcol->length = 0;
+ pcol->pg_first = -1;
+
+ EXOFS_DBGMSG("_pcol_init ino=0x%lx expected_pages=%u\n", inode->i_ino,
+ expected_pages);
+}
+
+static void _pcol_reset(struct page_collect *pcol)
+{
+ pcol->expected_pages -= min(pcol->nr_pages, pcol->expected_pages);
+
+ pcol->bio = NULL;
+ pcol->nr_pages = 0;
+ pcol->length = 0;
+ pcol->pg_first = -1;
+ EXOFS_DBGMSG("_pcol_reset ino=0x%lx expected_pages=%u\n",
+ pcol->inode->i_ino, pcol->expected_pages);
+
+ /* this is probably the end of the loop but in writes
+ * it might not end here. don't be left with nothing
+ */
+ if (!pcol->expected_pages)
+ pcol->expected_pages = 128;
+}
+
+static int pcol_try_alloc(struct page_collect *pcol)
+{
+ int pages = min_t(unsigned, pcol->expected_pages, BIO_MAX_PAGES);
+
+ for (; pages; pages >>= 1) {
+ pcol->bio = bio_alloc(GFP_KERNEL, pages);
+ if (likely(pcol->bio))
+ return 0;
+ }
+
+ EXOFS_ERR("Failed to kcalloc expected_pages=%u\n",
+ pcol->expected_pages);
+ return -ENOMEM;
+}
+
+static void pcol_free(struct page_collect *pcol)
+{
+ bio_put(pcol->bio);
+ pcol->bio = NULL;
+}
+
+static int pcol_add_page(struct page_collect *pcol, struct page *page,
+ unsigned len)
+{
+ int added_len = bio_add_pc_page(pcol->req_q, pcol->bio, page, len, 0);
+ if (unlikely(len != added_len))
+ return -ENOMEM;
+
+ ++pcol->nr_pages;
+ pcol->length += len;
+ return 0;
+}
+
+static int update_read_page(struct page *page, int ret)
+{
+ if (ret == 0) {
+ /* Everything is OK */
+ SetPageUptodate(page);
+ if (PageError(page))
+ ClearPageError(page);
+ } else if (ret == -EFAULT) {
+ /* In this case we were trying to read something that wasn't on
+ * disk yet - return a page full of zeroes. This should be OK,
+ * because the object should be empty (if there was a write
+ * before this read, the read would be waiting with the page
+ * locked */
+ clear_highpage(page);
+
+ SetPageUptodate(page);
+ if (PageError(page))
+ ClearPageError(page);
+ ret = 0; /* recovered error */
+ } else /* Error */
+ SetPageError(page);
+
+ return ret;
+}
+
+static void update_write_page(struct page *page, int ret)
+{
+ if (ret) {
+ mapping_set_error(page->mapping, ret);
+ SetPageError(page);
+ }
+ end_page_writeback(page);
+}
+
+static int _readpage(struct page *page, bool is_sync);
+
+static int __readpages_done(struct osd_request *or, struct page_collect *pcol,
+ bool do_unlock)
+{
+ struct bio_vec *bvec;
+ int i;
+ u64 resid;
+ u64 good_bytes;
+ u64 length = 0;
+ int ret = exofs_check_ok_resid(or, &resid, NULL);
+
+ osd_end_request(or);
+
+ if (likely(!ret))
+ good_bytes = pcol->length;
+ else if (!resid)
+ good_bytes = 0;
+ else
+ good_bytes = pcol->length - resid;
+
+ EXOFS_DBGMSG("readpages_done(%ld) good_bytes=%llx"
+ " length=%zx nr_pages=%u\n",
+ pcol->inode->i_ino, _LLU(good_bytes), pcol->length,
+ pcol->nr_pages);
+
+ __bio_for_each_segment(bvec, pcol->bio, i, 0) {
+ struct page *page = bvec->bv_page;
+ struct inode *inode = page->mapping->host;
+
+ if (inode != pcol->inode)
+ continue; /* osd might add more pages at end */
+
+ if ((length < good_bytes) || (i == 0)) {
+ ret = update_read_page(page, (i == 0) ? ret : 0);
+ if (do_unlock)
+ unlock_page(page);
+ EXOFS_DBGMSG(" readpages_done(%ld, %ld)\n",
+ inode->i_ino, page->index);
+ } else {
+ /* can not happen on single sync_readpage */
+ BUG_ON(!do_unlock);
+
+ /* try a single page read and only then it is
+ * marked as SetPageError()
+ */
+ EXOFS_ERR(" readpages_done(%ld, %ld) bad_bytes\n",
+ inode->i_ino, page->index);
+ _readpage(page, false);
+ }
+
+ length += bvec->bv_len;
+ }
+
+ pcol_free(pcol);
+ EXOFS_DBGMSG("readpages_done END\n");
+ return ret;
+}
+
+static void readpages_done(struct osd_request *or, void *p)
+{
+ struct page_collect *pcol = p;
+
+ __readpages_done(or, pcol, true);
+ atomic_dec(&pcol->sbi->s_curr_pending);
+ kfree(p);
+}
+
+static void _unlock_pcol_pages(struct page_collect *pcol, int ret, int rw)
+{
+ struct bio_vec *bvec;
+ int i;
+
+ __bio_for_each_segment(bvec, pcol->bio, i, 0) {
+ struct page *page = bvec->bv_page;
+
+ if (rw == READ)
+ update_read_page(page, ret);
+ else
+ update_write_page(page, ret);
+
+ unlock_page(page);
+ }
+ pcol_free(pcol);
+}
+
+static int read_exec(struct page_collect *pcol, bool is_sync)
+{
+ struct exofs_i_info *oi = exofs_i(pcol->inode);
+ struct osd_obj_id obj = {pcol->sbi->s_pid,
+ pcol->inode->i_ino + EXOFS_OBJ_OFF};
+ struct osd_request *or = NULL;
+ struct page_collect *pcol_copy = NULL;
+ loff_t i_start = pcol->pg_first << PAGE_CACHE_SHIFT;
+ int ret;
+
+ if (!pcol->bio)
+ return 0;
+
+ /* see comment in _readpage() about sync reads */
+ WARN_ON(is_sync && (pcol->nr_pages != 1));
+
+ or = osd_start_request(pcol->sbi->s_dev, GFP_KERNEL);
+ if (unlikely(!or)) {
+ ret = -ENOMEM;
+ goto err;
+ }
+
+ osd_req_read(or, &obj, pcol->bio, i_start);
+
+ if (is_sync) {
+ exofs_sync_op(or, pcol->sbi->s_timeout, oi->i_cred);
+ return __readpages_done(or, pcol, false);
+ }
+
+ pcol_copy = kmalloc(sizeof(*pcol_copy), GFP_KERNEL);
+ if (!pcol_copy) {
+ ret = -ENOMEM;
+ goto err;
+ }
+
+ *pcol_copy = *pcol;
+ ret = exofs_async_op(or, readpages_done, pcol_copy, oi->i_cred);
+ if (unlikely(ret))
+ goto err;
+
+ atomic_inc(&pcol->sbi->s_curr_pending);
+
+ EXOFS_DBGMSG("read_exec obj=%llx start=%llx length=%zx\n",
+ obj.id, _LLU(i_start), pcol->length);
+
+ /* pages ownership was passed to pcol_copy */
+ _pcol_reset(pcol);
+ return 0;
+
+err:
+ if (!is_sync)
+ _unlock_pcol_pages(pcol, ret, READ);
+ kfree(pcol_copy);
+ if (or)
+ osd_end_request(or);
+ return ret;
+}
+
+static int readpage_strip(void *data, struct page *page)
+{
+ struct page_collect *pcol = data;
+ struct inode *inode = pcol->inode;
+ struct exofs_i_info *oi = exofs_i(inode);
+ loff_t i_size = i_size_read(inode);
+ pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
+ size_t len;
+ int ret;
+
+ /* FIXME: Just for debugging, will be removed */
+ if (PageUptodate(page))
+ EXOFS_ERR("PageUptodate(%ld, %ld)\n", pcol->inode->i_ino,
+ page->index);
+
+ if (page->index < end_index)
+ len = PAGE_CACHE_SIZE;
+ else if (page->index == end_index)
+ len = i_size & ~PAGE_CACHE_MASK;
+ else
+ len = 0;
+
+ if (!len || !obj_created(oi)) {
+ /* this will be out of bounds, or doesn't exist yet.
+ * Current page is cleared and the request is split
+ */
+ clear_highpage(page);
+
+ SetPageUptodate(page);
+ if (PageError(page))
+ ClearPageError(page);
+
+ unlock_page(page);
+ EXOFS_DBGMSG("readpage_strip(%ld, %ld) empty page, splitting\n",
+ inode->i_ino, page->index);
+
+ return read_exec(pcol, false);
+ }
+
+try_again:
+
+ if (unlikely(pcol->pg_first == -1)) {
+ pcol->pg_first = page->index;
+ } else if (unlikely((pcol->pg_first + pcol->nr_pages) !=
+ page->index)) {
+ /* Discontinuity detected, split the request */
+ ret = read_exec(pcol, false);
+ if (unlikely(ret))
+ goto fail;
+ goto try_again;
+ }
+
+ if (!pcol->bio) {
+ ret = pcol_try_alloc(pcol);
+ if (unlikely(ret))
+ goto fail;
+ }
+
+ if (len != PAGE_CACHE_SIZE)
+ zero_user(page, len, PAGE_CACHE_SIZE - len);
+
+ EXOFS_DBGMSG(" readpage_strip(%ld, %ld) len=%zx\n", inode->i_ino,
+ page->index, len);
+
+ ret = pcol_add_page(pcol, page, len);
+ if (ret) {
+ EXOFS_DBGMSG("Failed pcol_add_page pages[i]=%p "
+ "len=%zx nr_pages=%u length=%zx\n",
+ page, len, pcol->nr_pages, pcol->length);
+
+ /* split the request, and start again with current page */
+ ret = read_exec(pcol, false);
+ if (unlikely(ret))
+ goto fail;
+
+ goto try_again;
+ }
+
+ return 0;
+
+fail:
+ /* SetPageError(page); ??? */
+ unlock_page(page);
+ return ret;
+}
+
+static int exofs_readpages(struct file *file, struct address_space *mapping,
+ struct list_head *pages, unsigned nr_pages)
+{
+ struct page_collect pcol;
+ int ret;
+
+ _pcol_init(&pcol, nr_pages, mapping->host);
+
+ ret = read_cache_pages(mapping, pages, readpage_strip, &pcol);
+ if (ret) {
+ EXOFS_ERR("read_cache_pages => %d\n", ret);
+ return ret;
+ }
+
+ return read_exec(&pcol, false);
+}
+
+static int _readpage(struct page *page, bool is_sync)
+{
+ struct page_collect pcol;
+ int ret;
+
+ _pcol_init(&pcol, 1, page->mapping->host);
+
+ /* readpage_strip might call read_exec(,async) inside at several places
+ * but this is safe for is_async=0 since read_exec will not do anything
+ * when we have a single page.
+ */
+ ret = readpage_strip(&pcol, page);
+ if (ret) {
+ EXOFS_ERR("_readpage => %d\n", ret);
+ return ret;
+ }
+
+ return read_exec(&pcol, is_sync);
+}
+
+/*
+ * We don't need the file
+ */
+static int exofs_readpage(struct file *file, struct page *page)
+{
+ return _readpage(page, false);
+}
+
+static int exofs_writepage(struct page *page, struct writeback_control *wbc2);
+
+static void writepages_done(struct osd_request *or, void *p)
+{
+ struct page_collect *pcol = p;
+ struct bio_vec *bvec;
+ int i;
+ u64 resid;
+ u64 good_bytes;
+ u64 length = 0;
+
+ int ret = exofs_check_ok_resid(or, NULL, &resid);
+
+ osd_end_request(or);
+ atomic_dec(&pcol->sbi->s_curr_pending);
+
+ if (likely(!ret))
+ good_bytes = pcol->length;
+ else if (!resid)
+ good_bytes = 0;
+ else
+ good_bytes = pcol->length - resid;
+
+ EXOFS_DBGMSG("writepages_done(%lx) good_bytes=%llx"
+ " length=%zx nr_pages=%u\n",
+ pcol->inode->i_ino, _LLU(good_bytes), pcol->length,
+ pcol->nr_pages);
+
+ __bio_for_each_segment(bvec, pcol->bio, i, 0) {
+ struct page *page = bvec->bv_page;
+ struct inode *inode = page->mapping->host;
+
+ if (inode != pcol->inode)
+ continue; /* osd might add more pages to a bio */
+
+ if ((length < good_bytes) || (i == 0)) {
+ update_write_page(page, ret);
+ unlock_page(page);
+ EXOFS_DBGMSG(" writepages_done(%lx, %lx)"
+ " good_bytes ret=%x\n",
+ inode->i_ino, page->index, ret);
+ } else {
+ /* try a single page write and only then it is
+ * marked as SetPageError()
+ */
+ EXOFS_ERR(" writepages_done(%lx, %lx) bad_bytes\n",
+ inode->i_ino, page->index);
+
+ exofs_writepage(page, NULL);
+ }
+
+ length += bvec->bv_len;
+ }
+
+ pcol_free(pcol);
+ kfree(pcol);
+ EXOFS_DBGMSG("writepages_done END\n");
+}
+
+static int write_exec(struct page_collect *pcol)
+{
+ struct exofs_i_info *oi = exofs_i(pcol->inode);
+ struct osd_obj_id obj = {pcol->sbi->s_pid,
+ pcol->inode->i_ino + EXOFS_OBJ_OFF};
+ struct osd_request *or = NULL;
+ struct page_collect *pcol_copy = NULL;
+ loff_t i_start = pcol->pg_first << PAGE_CACHE_SHIFT;
+ int ret;
+
+ if (!pcol->bio)
+ return 0;
+
+ or = osd_start_request(pcol->sbi->s_dev, GFP_KERNEL);
+ if (unlikely(!or)) {
+ EXOFS_ERR("write_exec: Faild to osd_start_request()\n");
+ ret = -ENOMEM;
+ goto err;
+ }
+
+ pcol_copy = kmalloc(sizeof(*pcol_copy), GFP_KERNEL);
+ if (!pcol_copy) {
+ EXOFS_ERR("write_exec: Faild to kmalloc(pcol)\n");
+ ret = -ENOMEM;
+ goto err;
+ }
+
+ *pcol_copy = *pcol;
+
+ osd_req_write(or, &obj, pcol_copy->bio, i_start);
+ ret = exofs_async_op(or, writepages_done, pcol_copy, oi->i_cred);
+ if (unlikely(ret)) {
+ EXOFS_ERR("write_exec: exofs_async_op() Faild\n");
+ goto err;
+ }
+
+ atomic_inc(&pcol->sbi->s_curr_pending);
+ EXOFS_DBGMSG("write_exec(%lx, %lx) start=%llx length=%zx\n",
+ pcol->inode->i_ino, pcol->pg_first, _LLU(i_start),
+ pcol->length);
+ /* pages ownership was passed to pcol_copy */
+ _pcol_reset(pcol);
+ return 0;
+
+err:
+ _unlock_pcol_pages(pcol, ret, WRITE);
+ kfree(pcol_copy);
+ if (or)
+ osd_end_request(or);
+ return ret;
+}
+
+static int writepage_strip(struct page *page,
+ struct writeback_control *wbc_unused, void *data)
+{
+ struct page_collect *pcol = data;
+ struct inode *inode = pcol->inode;
+ struct exofs_i_info *oi = exofs_i(inode);
+ loff_t i_size = i_size_read(inode);
+ pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
+ size_t len;
+ int ret;
+
+ BUG_ON(!PageLocked(page));
+
+ ret = wait_obj_created(oi);
+ if (unlikely(ret))
+ goto fail;
+
+ if (page->index < end_index)
+ /* in this case, the page is within the limits of the file */
+ len = PAGE_CACHE_SIZE;
+ else {
+ len = i_size & ~PAGE_CACHE_MASK;
+
+ if (page->index > end_index || !len) {
+ /* in this case, the page is outside the limits
+ * (truncate in progress)
+ */
+ ret = write_exec(pcol);
+ if (unlikely(ret))
+ goto fail;
+ if (PageError(page))
+ ClearPageError(page);
+ unlock_page(page);
+ return 0;
+ }
+ }
+
+try_again:
+
+ if (unlikely(pcol->pg_first == -1)) {
+ pcol->pg_first = page->index;
+ } else if (unlikely((pcol->pg_first + pcol->nr_pages) !=
+ page->index)) {
+ /* Discontinuity detected, split the request */
+ ret = write_exec(pcol);
+ if (unlikely(ret))
+ goto fail;
+ goto try_again;
+ }
+
+ if (!pcol->bio) {
+ ret = pcol_try_alloc(pcol);
+ if (unlikely(ret))
+ goto fail;
+ }
+
+ EXOFS_DBGMSG(" writepage_strip(%lx, %lx) len=%zx\n", inode->i_ino,
+ page->index, len);
+
+ ret = pcol_add_page(pcol, page, len);
+ if (unlikely(ret)) {
+ EXOFS_DBGMSG("Failed pcol_add_page "
+ "nr_pages=%u total_length=%zx\n",
+ pcol->nr_pages, pcol->length);
+
+ /* split the request, next loop will start again */
+ ret = write_exec(pcol);
+ if (unlikely(ret)) {
+ EXOFS_DBGMSG("write_exec faild => %d", ret);
+ goto fail;
+ }
+
+ goto try_again;
+ }
+
+ BUG_ON(PageWriteback(page));
+ set_page_writeback(page);
+
+ return 0;
+
+fail:
+ set_bit(AS_EIO, &page->mapping->flags);
+ unlock_page(page);
+ return ret;
+}
+
+static int exofs_writepages(struct address_space *mapping,
+ struct writeback_control *wbc)
+{
+ struct page_collect pcol;
+ long start, end, expected_pages;
+ int ret;
+
+ start = wbc->range_start >> PAGE_CACHE_SHIFT;
+ end = (wbc->range_end == LLONG_MAX) ?
+ start + mapping->nrpages :
+ wbc->range_end >> PAGE_CACHE_SHIFT;
+
+ if (start || end)
+ expected_pages = min(end - start + 1, 32L);
+ else
+ expected_pages = mapping->nrpages;
+
+ EXOFS_DBGMSG("inode(%lx) wbc->start=0x%llx wbc->end=0x%llx"
+ " m->nrpages=%lu start=%ld end=%ld\n",
+ mapping->host->i_ino, wbc->range_start, wbc->range_end,
+ mapping->nrpages, start, end);
+
+ _pcol_init(&pcol, expected_pages, mapping->host);
+
+ ret = write_cache_pages(mapping, wbc, writepage_strip, &pcol);
+ if (ret) {
+ EXOFS_ERR("write_cache_pages => %d\n", ret);
+ return ret;
+ }
+
+ return write_exec(&pcol);
+}
+
+static int exofs_writepage(struct page *page, struct writeback_control *wbc)
+{
+ struct page_collect pcol;
+ int ret;
+
+ _pcol_init(&pcol, 1, page->mapping->host);
+
+ ret = writepage_strip(page, NULL, &pcol);
+ if (ret) {
+ EXOFS_ERR("exofs_writepage => %d\n", ret);
+ return ret;
+ }
+
+ return write_exec(&pcol);
+}
+
+int exofs_write_begin(struct file *file, struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned flags,
+ struct page **pagep, void **fsdata)
+{
+ int ret = 0;
+ struct page *page;
+
+ page = *pagep;
+ if (page == NULL) {
+ ret = simple_write_begin(file, mapping, pos, len, flags, pagep,
+ fsdata);
+ if (ret) {
+ EXOFS_DBGMSG("simple_write_begin faild\n");
+ return ret;
+ }
+
+ page = *pagep;
+ }
+
+ /* read modify write */
+ if (!PageUptodate(page) && (len != PAGE_CACHE_SIZE)) {
+ ret = _readpage(page, true);
+ if (ret) {
+ /*SetPageError was done by _readpage. Is it ok?*/
+ unlock_page(page);
+ EXOFS_DBGMSG("__readpage_filler faild\n");
+ }
+ }
+
+ return ret;
+}
+
+static int exofs_write_begin_export(struct file *file,
+ struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned flags,
+ struct page **pagep, void **fsdata)
+{
+ *pagep = NULL;
+
+ return exofs_write_begin(file, mapping, pos, len, flags, pagep,
+ fsdata);
+}
+
+const struct address_space_operations exofs_aops = {
+ .readpage = exofs_readpage,
+ .readpages = exofs_readpages,
+ .writepage = exofs_writepage,
+ .writepages = exofs_writepages,
+ .write_begin = exofs_write_begin_export,
+ .write_end = simple_write_end,
+};
+
/******************************************************************************
* INODE OPERATIONS
*****************************************************************************/
--
1.6.2.1

2009-03-23 13:08:24

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCHSET 0/8 version 4] exofs for kernel 2.6.30

git diff --stat -p 690dd5e9e739cb0c66a792c5d7949f6e97113427..linux-next -- fs/exofs/
fs/exofs/Kbuild | 2 +-
fs/exofs/exofs.h | 17 -----------------
fs/exofs/file.c | 4 ++++
fs/exofs/inode.c | 45 ++++++++++++++++++++++-----------------------
fs/exofs/super.c | 9 +++++++++
5 files changed, 36 insertions(+), 41 deletions(-)

diff --git a/fs/exofs/Kbuild b/fs/exofs/Kbuild
index 592f40d..8c5253e 100644
--- a/fs/exofs/Kbuild
+++ b/fs/exofs/Kbuild
@@ -22,7 +22,7 @@ ccflags-y += -DCONFIG_EXOFS_FS -DCONFIG_EXOFS_FS_MODULE
# if we are built out-of-tree and the hosting kernel has OSD headers
# then "ccflags-y +=" will not pick the out-off-tree headers. Only by doing
# this it will work. This might break in future kernels
-KBUILD_CPPFLAGS := -I$(OSD_INC) $(KBUILD_CPPFLAGS)
+LINUXINCLUDE := -I$(OSD_INC) $(LINUXINCLUDE)

endif

diff --git a/fs/exofs/exofs.h b/fs/exofs/exofs.h
index 76155d7..d54753d 100644
--- a/fs/exofs/exofs.h
+++ b/fs/exofs/exofs.h
@@ -54,15 +54,6 @@
#define _LLU(x) (unsigned long long)(x)

/*
- * struct to hold what we get from mount options
- */
-struct exofs_mountopt {
- const char *dev_name;
- uint64_t pid;
- int timeout;
-};
-
-/*
* our extension to the in-memory superblock
*/
struct exofs_sb_info {
@@ -134,14 +125,6 @@ static inline struct exofs_i_info *exofs_i(struct inode *inode)
}

/*
- * ugly struct so that we can pass two arguments to update_inode's callback
- */
-struct updatei_args {
- struct exofs_sb_info *sbi;
- struct exofs_fcb fcb;
-};
-
-/*
* Maximum count of links to a file
*/
#define EXOFS_LINK_MAX 32000
diff --git a/fs/exofs/file.c b/fs/exofs/file.c
index 4738c3f..2712f68 100644
--- a/fs/exofs/file.c
+++ b/fs/exofs/file.c
@@ -49,6 +49,10 @@ static int exofs_file_fsync(struct file *filp, struct dentry *dentry,
struct address_space *mapping = filp->f_mapping;

ret1 = filemap_write_and_wait(mapping);
+ /*Note: file_fsync below also calles sync_blockdev, which is a no-op
+ * for exofs, but other then that it does sync_inode and
+ * sync_superblock which is what we need here.
+ */
ret2 = file_fsync(filp, dentry, datasync);

return ret1 ? ret1 : ret2;
diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
index 0f52e76..739629a 100644
--- a/fs/exofs/inode.c
+++ b/fs/exofs/inode.c
@@ -55,7 +55,7 @@ struct page_collect {
long pg_first;
};

-void _pcol_init(struct page_collect *pcol, unsigned expected_pages,
+static void _pcol_init(struct page_collect *pcol, unsigned expected_pages,
struct inode *inode)
{
struct exofs_sb_info *sbi = inode->i_sb->s_fs_info;
@@ -75,7 +75,7 @@ void _pcol_init(struct page_collect *pcol, unsigned expected_pages,
expected_pages);
}

-void _pcol_reset(struct page_collect *pcol)
+static void _pcol_reset(struct page_collect *pcol)
{
pcol->expected_pages -= min(pcol->nr_pages, pcol->expected_pages);

@@ -93,7 +93,7 @@ void _pcol_reset(struct page_collect *pcol)
pcol->expected_pages = 128;
}

-int pcol_try_alloc(struct page_collect *pcol)
+static int pcol_try_alloc(struct page_collect *pcol)
{
int pages = min_t(unsigned, pcol->expected_pages, BIO_MAX_PAGES);

@@ -103,18 +103,19 @@ int pcol_try_alloc(struct page_collect *pcol)
return 0;
}

- EXOFS_ERR("Failed to kcalloc expected_pages=%d\n",
+ EXOFS_ERR("Failed to kcalloc expected_pages=%u\n",
pcol->expected_pages);
return -ENOMEM;
}

-void pcol_free(struct page_collect *pcol)
+static void pcol_free(struct page_collect *pcol)
{
bio_put(pcol->bio);
pcol->bio = NULL;
}

-int pcol_add_page(struct page_collect *pcol, struct page *page, unsigned len)
+static int pcol_add_page(struct page_collect *pcol, struct page *page,
+ unsigned len)
{
int added_len = bio_add_pc_page(pcol->req_q, pcol->bio, page, len, 0);
if (unlikely(len != added_len))
@@ -173,9 +174,9 @@ static int __readpages_done(struct osd_request *or, struct page_collect *pcol,

osd_end_request(or);

- if (!ret)
+ if (likely(!ret))
good_bytes = pcol->length;
- else if (ret && !resid)
+ else if (!resid)
good_bytes = 0;
else
good_bytes = pcol->length - resid;
@@ -227,7 +228,7 @@ static void readpages_done(struct osd_request *or, void *p)
kfree(p);
}

-void _unlock_pcol_pages(struct page_collect *pcol, int ret, int rw)
+static void _unlock_pcol_pages(struct page_collect *pcol, int ret, int rw)
{
struct bio_vec *bvec;
int i;
@@ -245,7 +246,7 @@ void _unlock_pcol_pages(struct page_collect *pcol, int ret, int rw)
pcol_free(pcol);
}

-int read_exec(struct page_collect *pcol, bool is_sync)
+static int read_exec(struct page_collect *pcol, bool is_sync)
{
struct exofs_i_info *oi = exofs_i(pcol->inode);
struct osd_obj_id obj = {pcol->sbi->s_pid,
@@ -452,7 +453,7 @@ static void writepages_done(struct osd_request *or, void *p)

if (likely(!ret))
good_bytes = pcol->length;
- else if (ret && !resid)
+ else if (!resid)
good_bytes = 0;
else
good_bytes = pcol->length - resid;
@@ -493,7 +494,7 @@ static void writepages_done(struct osd_request *or, void *p)
EXOFS_DBGMSG("writepages_done END\n");
}

-int write_exec(struct page_collect *pcol)
+static int write_exec(struct page_collect *pcol)
{
struct exofs_i_info *oi = exofs_i(pcol->inode);
struct osd_obj_id obj = {pcol->sbi->s_pid,
@@ -631,7 +632,7 @@ fail:
return ret;
}

-int exofs_writepages(struct address_space *mapping,
+static int exofs_writepages(struct address_space *mapping,
struct writeback_control *wbc)
{
struct page_collect pcol;
@@ -1110,6 +1111,14 @@ struct inode *exofs_new_inode(struct inode *dir, int mode)
}

/*
+ * struct to pass two arguments to update_inode's callback
+ */
+struct updatei_args {
+ struct exofs_sb_info *sbi;
+ struct exofs_fcb fcb;
+};
+
+/*
* Callback function from exofs_update_inode().
*/
static void updatei_done(struct osd_request *or, void *p)
@@ -1218,16 +1227,6 @@ int exofs_write_inode(struct inode *inode, int wait)
return exofs_update_inode(inode, wait);
}

-int exofs_sync_inode(struct inode *inode)
-{
- struct writeback_control wbc = {
- .sync_mode = WB_SYNC_ALL,
- .nr_to_write = 0, /* sys_fsync did this */
- };
-
- return sync_inode(inode, &wbc);
-}
-
/*
* Callback function from exofs_delete_inode() - don't have much cleaning up to
* do.
diff --git a/fs/exofs/super.c b/fs/exofs/super.c
index 9153db2..989952b 100644
--- a/fs/exofs/super.c
+++ b/fs/exofs/super.c
@@ -45,6 +45,15 @@
*****************************************************************************/

/*
+ * struct to hold what we get from mount options
+ */
+struct exofs_mountopt {
+ const char *dev_name;
+ uint64_t pid;
+ int timeout;
+};
+
+/*
* exofs-specific mount-time options.
*/
enum { Opt_pid, Opt_to, Opt_mkfs, Opt_format, Opt_err };


Attachments:
exofs-ver5-to-ver4.diff (6.61 kB)

2009-03-24 09:09:41

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCHSET 0/8 version 4] exofs for kernel 2.6.30

Boaz Harrosh wrote:
> What's new since last iteration:
>
> * I completely re-wrote the [PATCH 4/8] exofs: address_space_operations
> in which we actually write/read to/from osd-storage. The difference is
> that now we try to accumulate as many contiguous pages as possible and
> send them as one large request. As opposed to writing each page at a
> time, in the previous patchset.
>
> * [PATCH 5/8] exofs: dir_inode and directory operations received lots
> of love thanks to Evgeniy Polyakov's grate comments.
>
> exofs is a file system that uses an OSD device as it's back store.
>
> OSD is a new T10 command set that views storage devices not as a large/flat
> array of sectors but as a container of objects, each having a length, quota,
> time attributes and more. Each object is addressed by a 64bit ID, and is
> contained in a 64bit ID partition. Each object has associated attributes
> attached to it, which are integral part of the object and provide metadata about
> the object. The standard defines some common obligatory attributes, but user
> attributes can be added as needed.
>
> Here is the list of patches
> [PATCH 1/8] exofs: Kbuild, Headers and osd utils
> [PATCH 2/8] exofs: file and file_inode operations
> [PATCH 3/8] exofs: symlink_inode and fast_symlink_inode operations
> [PATCH 4/8] exofs: address_space_operations
> [PATCH 5/8] exofs: dir_inode and directory operations
> [PATCH 6/8] exofs: super_operations and file_system_type
> [PATCH 7/8] exofs: Documentation
> [PATCH 8/8] fs: Add exofs to Kernel build
>
> This patchset is also available on:
> git-clone git://git.open-osd.org/linux-open-osd.git linux-next
> or on the web at:
> http://git.open-osd.org/gitweb.cgi?p=linux-open-osd.git;a=shortlog;h=refs/heads/linux-next
>
> (Above tree is based on Linus v2.6.29-rc8-212-g8144737)
>
> If anyone wants to actually run this code and test it
> then please start reading at:
> http://open-osd.org
> You will need to checkout the out-of-tree git (below) for the user-mode utilities.
> Also the exofs.txt file in patch 7/8 should help
>
> If you want to review the user-mode library and supporting plumbings,
> git-clone git://git.open-osd.org/open-osd.git
> or on the web at:
> http://git.open-osd.org/gitweb.cgi?p=open-osd.git;a=summary
>
> Boaz
>

Hi Linus

In the matter of above new exofs file system.

Andrew Morton has suggested that you might prefer to directly
pull form the open-osd git-tree instead of him pushing it through
his tree?

The exofs tree will be pushed only at second stage of the merge window
as it is dependent on patches to the osd-initiator which sit in
James's scsi-misc tree.

I'm monitoring the [email protected] mailing list and once
I see all dependent patches are in main-line I'll send you a pull
request, or Andrew which ever you prefer?

Some background ml thread:
http://www.spinics.net/lists/linux-scsi/msg32104.html

The code was sitting in linux-next since 2.6.29-rc1. It is Kconfigured
off by default and it will only impacted osd early adaptors/developers.

Thank you very much in advance
Boaz Harrosh

2009-03-30 21:30:47

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCHSET 0/8 version 4] exofs for kernel 2.6.30

On Wed, 18 Mar 2009 19:45:01 +0200
Boaz Harrosh <[email protected]> wrote:

> What's new since last iteration:
>
> * I completely re-wrote the [PATCH 4/8] exofs: address_space_operations
> in which we actually write/read to/from osd-storage. The difference is
> that now we try to accumulate as many contiguous pages as possible and
> send them as one large request. As opposed to writing each page at a
> time, in the previous patchset.
>
> * [PATCH 5/8] exofs: dir_inode and directory operations received lots
> of love thanks to Evgeniy Polyakov's grate comments.
>
> exofs is a file system that uses an OSD device as it's back store.
>
> OSD is a new T10 command set that views storage devices not as a large/flat
> array of sectors but as a container of objects, each having a length, quota,
> time attributes and more. Each object is addressed by a 64bit ID, and is
> contained in a 64bit ID partition. Each object has associated attributes
> attached to it, which are integral part of the object and provide metadata about
> the object. The standard defines some common obligatory attributes, but user
> attributes can be added as needed.
>
> Here is the list of patches
> [PATCH 1/8] exofs: Kbuild, Headers and osd utils
> [PATCH 2/8] exofs: file and file_inode operations
> [PATCH 3/8] exofs: symlink_inode and fast_symlink_inode operations
> [PATCH 4/8] exofs: address_space_operations
> [PATCH 5/8] exofs: dir_inode and directory operations
> [PATCH 6/8] exofs: super_operations and file_system_type
> [PATCH 7/8] exofs: Documentation
> [PATCH 8/8] fs: Add exofs to Kernel build

Are all the prerequisites for exofs now in mainline?

> This patchset is also available on:
> git-clone git://git.open-osd.org/linux-open-osd.git linux-next
> or on the web at:
> http://git.open-osd.org/gitweb.cgi?p=linux-open-osd.git;a=shortlog;h=refs/heads/linux-next

Well I could merge them, but given that you have a git tree, a more
convenient path would be for us to include your tree in linux-next and
then you ask Linus to pull it directly when the time comes.

I'm unsure when that time will come. Who has reviewed this work and
what was the result?

2009-03-31 03:01:53

by Stephen Rothwell

[permalink] [raw]
Subject: Re: [PATCHSET 0/8 version 4] exofs for kernel 2.6.30

On Mon, 30 Mar 2009 14:22:00 -0700 Andrew Morton <[email protected]> wrote:
>
> > This patchset is also available on:
> > git-clone git://git.open-osd.org/linux-open-osd.git linux-next
> > or on the web at:
> > http://git.open-osd.org/gitweb.cgi?p=linux-open-osd.git;a=shortlog;h=refs/heads/linux-next
>
> Well I could merge them, but given that you have a git tree, a more
> convenient path would be for us to include your tree in linux-next and
> then you ask Linus to pull it directly when the time comes.

$ git cat-file -p next-20090330:Next/Trees | grep osd
osd git git://git.open-osd.org/linux-open-osd.git#linux-next

i.e. its in there. Those particular patches have been in linux-next
since next-20090326 and presumably have been built by the
all{yes,mod}config builds I do.

> I'm unsure when that time will come. Who has reviewed this work and
> what was the result?

That is a good question. The other is "Is this destined for 2.6.30?"

--
Cheers,
Stephen Rothwell [email protected]
http://www.canb.auug.org.au/~sfr/


Attachments:
(No filename) (1.04 kB)
(No filename) (197.00 B)
Download all attachments

2009-03-31 07:14:17

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [PATCHSET 0/8 version 4] exofs for kernel 2.6.30

Hi.

On Tue, Mar 31, 2009 at 02:01:24PM +1100, Stephen Rothwell ([email protected]) wrote:
> > I'm unsure when that time will come. Who has reviewed this work and
> > what was the result?
>
> That is a good question. The other is "Is this destined for 2.6.30?"

There was a lengthy discussion in scsi list.
I made another review of the previous revision and we talked a bit about
ducumented features/bugs for the current one.

--
Evgeniy Polyakov

2009-03-31 07:24:18

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCHSET 0/8 version 4] exofs for kernel 2.6.30

On 03/31/2009 12:22 AM, Andrew Morton wrote:
> On Wed, 18 Mar 2009 19:45:01 +0200
> Boaz Harrosh <[email protected]> wrote:
>
>> What's new since last iteration:
>>
>> * I completely re-wrote the [PATCH 4/8] exofs: address_space_operations
>> in which we actually write/read to/from osd-storage. The difference is
>> that now we try to accumulate as many contiguous pages as possible and
>> send them as one large request. As opposed to writing each page at a
>> time, in the previous patchset.
>>
>> * [PATCH 5/8] exofs: dir_inode and directory operations received lots
>> of love thanks to Evgeniy Polyakov's grate comments.
>>
>> exofs is a file system that uses an OSD device as it's back store.
>>
>> OSD is a new T10 command set that views storage devices not as a large/flat
>> array of sectors but as a container of objects, each having a length, quota,
>> time attributes and more. Each object is addressed by a 64bit ID, and is
>> contained in a 64bit ID partition. Each object has associated attributes
>> attached to it, which are integral part of the object and provide metadata about
>> the object. The standard defines some common obligatory attributes, but user
>> attributes can be added as needed.
>>
>> Here is the list of patches
>> [PATCH 1/8] exofs: Kbuild, Headers and osd utils
>> [PATCH 2/8] exofs: file and file_inode operations
>> [PATCH 3/8] exofs: symlink_inode and fast_symlink_inode operations
>> [PATCH 4/8] exofs: address_space_operations
>> [PATCH 5/8] exofs: dir_inode and directory operations
>> [PATCH 6/8] exofs: super_operations and file_system_type
>> [PATCH 7/8] exofs: Documentation
>> [PATCH 8/8] fs: Add exofs to Kernel build
>
> Are all the prerequisites for exofs now in mainline?
>

Yes they are all in

>> This patchset is also available on:
>> git-clone git://git.open-osd.org/linux-open-osd.git linux-next
>> or on the web at:
>> http://git.open-osd.org/gitweb.cgi?p=linux-open-osd.git;a=shortlog;h=refs/heads/linux-next
>
> Well I could merge them, but given that you have a git tree, a more
> convenient path would be for us to include your tree in linux-next

As Stephan said they are there since 2.6.29-rc1

> and
> then you ask Linus to pull it directly when the time comes.
>

I was hoping the time is now

> I'm unsure when that time will come. Who has reviewed this work and
> what was the result?
>
>

The patches have been reveiwed on linux-kernel and linux-fsdevel for
5-6 rounds. Each round drew it's comments which I fixed and so on.

The code is pretty well as far as styling and layout. And I'm not sure
what else can be done for them.

As far as quality robustness and performance, that's hard to say since
it has not been used outside of the labs yet. Perhaps being in mainline
will give the exposure it needs to stabilize.

Exofs is the only current candidate user for the osd in mainline, and in a
shape that does things relatively well as far as we could test it.

It has lots of work still to do on it in order to make it the pNFS-Objects
filesystem it needs to be. One reason it is very important for me that it
will go in mainline is because of the pNFS patches that need to come. pNFS
is an out-of-tree staging area that holds the up coming pNFS stuff that will
be added to Linux. It currently holds patches that will eventually go through
4 different git trees / maintainers. if exofs is left outside it is another
headache added. And of course is the hassle of keeping out-of-tree code for
another round, that's depressing. I don't see what is the danger of inclusion
and what is to be gained with keeping this out-of-tree? Please advise you have
much longer experience with these thing then me.

For long term plans By next kernel exofs should be pNFS capable, and a
pNFS-object-layout driver should be added for the pNFS client side. The
layout driver is based on the same infrastructure as exofs. Then farther
down exofs will need to be multi-device raids and all.

Lots of more work to do, but first thing first, can this be included in
2.6.30?

Thanks
Boaz

2009-03-31 07:43:42

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCHSET 0/8 version 4] exofs for kernel 2.6.30

On 03/31/2009 10:20 AM, Boaz Harrosh wrote:
> On 03/31/2009 12:22 AM, Andrew Morton wrote:
>> On Wed, 18 Mar 2009 19:45:01 +0200
>> Boaz Harrosh <[email protected]> wrote:
>>
>>> What's new since last iteration:
>>>
>>> * I completely re-wrote the [PATCH 4/8] exofs: address_space_operations
>>> in which we actually write/read to/from osd-storage. The difference is
>>> that now we try to accumulate as many contiguous pages as possible and
>>> send them as one large request. As opposed to writing each page at a
>>> time, in the previous patchset.
>>>
>>> * [PATCH 5/8] exofs: dir_inode and directory operations received lots
>>> of love thanks to Evgeniy Polyakov's grate comments.
>>>
>>> exofs is a file system that uses an OSD device as it's back store.
>>>
>>> OSD is a new T10 command set that views storage devices not as a large/flat
>>> array of sectors but as a container of objects, each having a length, quota,
>>> time attributes and more. Each object is addressed by a 64bit ID, and is
>>> contained in a 64bit ID partition. Each object has associated attributes
>>> attached to it, which are integral part of the object and provide metadata about
>>> the object. The standard defines some common obligatory attributes, but user
>>> attributes can be added as needed.
>>>
>>> Here is the list of patches
>>> [PATCH 1/8] exofs: Kbuild, Headers and osd utils
>>> [PATCH 2/8] exofs: file and file_inode operations
>>> [PATCH 3/8] exofs: symlink_inode and fast_symlink_inode operations
>>> [PATCH 4/8] exofs: address_space_operations
>>> [PATCH 5/8] exofs: dir_inode and directory operations
>>> [PATCH 6/8] exofs: super_operations and file_system_type
>>> [PATCH 7/8] exofs: Documentation
>>> [PATCH 8/8] fs: Add exofs to Kernel build
>> Are all the prerequisites for exofs now in mainline?
>>
>
> Yes they are all in
>
>>> This patchset is also available on:
>>> git-clone git://git.open-osd.org/linux-open-osd.git linux-next
>>> or on the web at:
>>> http://git.open-osd.org/gitweb.cgi?p=linux-open-osd.git;a=shortlog;h=refs/heads/linux-next
>> Well I could merge them, but given that you have a git tree, a more
>> convenient path would be for us to include your tree in linux-next
>
> As Stephan said they are there since 2.6.29-rc1
>
>> and
>> then you ask Linus to pull it directly when the time comes.
>>
>
> I was hoping the time is now
>
>> I'm unsure when that time will come. Who has reviewed this work and
>> what was the result?
>>
>>
>
> The patches have been reveiwed on linux-kernel and linux-fsdevel for
> 5-6 rounds. Each round drew it's comments which I fixed and so on.
>

I forgot to say. Some of the people that sent comments where:
Marcin Slusarz <[email protected]>
Evgeniy Polyakov <[email protected]>
Jeff Garzik <[email protected]>
Pavel Machek <[email protected]>
Benny Halevy <[email protected]>
Alan Cox <[email protected]>
Andrew Morton <[email protected]>

And then there was a long flame war about user-mode API, which I now
have, and all in kernel utilities are gone.

To the best of my knowledge I have addressed all comments, at least
no one complained.

<snip>

Thanks
Boaz

2009-03-31 08:12:08

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/8] exofs: Kbuild, Headers and osd utils

On Wed, 18 Mar 2009 19:57:36 +0200 Boaz Harrosh <[email protected]> wrote:

> This patch includes osd infrastructure that will be used later by
> the file system.
>
> Also the declarations of constants, on disk structures,
> and prototypes.
>
> And the Kbuild+Kconfig files needed to build the exofs module.
>
> ...
>
> --- /dev/null
> +++ b/fs/exofs/Kbuild
> @@ -0,0 +1,30 @@
> +#
> +# Kbuild for the EXOFS module
> +#
> +# Copyright (C) 2008 Panasas Inc. All rights reserved.
> +#
> +# Authors:
> +# Boaz Harrosh <[email protected]>
> +#
> +# This program is free software; you can redistribute it and/or modify
> +# it under the terms of the GNU General Public License version 2
> +#
> +# Kbuild - Gets included from the Kernels Makefile and build system
> +#
> +
> +ifneq ($(OSD_INC),)
> +# we are built out-of-tree Kconfigure everything as on
> +
> +CONFIG_EXOFS_FS=m
> +ccflags-y += -DCONFIG_EXOFS_FS -DCONFIG_EXOFS_FS_MODULE
> +# ccflags-y += -DCONFIG_EXOFS_DEBUG
> +
> +# if we are built out-of-tree and the hosting kernel has OSD headers
> +# then "ccflags-y +=" will not pick the out-off-tree headers. Only by doing
> +# this it will work. This might break in future kernels
> +KBUILD_CPPFLAGS := -I$(OSD_INC) $(KBUILD_CPPFLAGS)
> +
> +endif

But this patch is putting the fs into the tree, so all the above is unneeded.

>
> ...
>
> + * Object IDs 0, 1, and 2 are always in use (see above defines).
> + */
> +enum {
> + EXOFS_UINT64_MAX = (~0LL),

Use ULLONG_MAX?

~0ULL would be more consistent.

> + EXOFS_MAX_INO_ID = (sizeof(ino_t) * 8 == 64) ? EXOFS_UINT64_MAX :
> + (1LL << (sizeof(ino_t) * 8 - 1)),

Tricky, needs a comment.

Would be clearer to use 1ULL.

> + EXOFS_MAX_ID = (EXOFS_MAX_INO_ID - 1 - EXOFS_OBJ_OFF),
> +};
> +
> +/****************************************************************************
> + * Misc.
> + ****************************************************************************/
> +#define EXOFS_BLKSHIFT 12
> +#define EXOFS_BLKSIZE (1UL << EXOFS_BLKSHIFT)
> +
> +/****************************************************************************
> + * superblock-related things
> + ****************************************************************************/
> +#define EXOFS_SUPER_MAGIC 0x5DF5

Should be in include/linux/magic.h

>
> ...
>
> +/*
> + * The file control block - stored in an object's attributes. This is where
> + * the in-memory inode is stored on disk.
> + */
> +struct exofs_fcb {
> + __le64 i_size; /* Size of the file */
> + __le16 i_mode; /* File mode */
> + __le16 i_links_count; /* Links count */
> + __le32 i_uid; /* Owner Uid */
> + __le32 i_gid; /* Group Id */
> + __le32 i_atime; /* Access time */
> + __le32 i_ctime; /* Creation time */
> + __le32 i_mtime; /* Modification time */
> + __le32 i_flags; /* File flags (unused for now)*/
> + __le32 i_generation; /* File version (for NFS) */
> + __le32 i_data[EXOFS_IDATA]; /* Short symlink names and device #s */
> +};

There is no room for future expansion. Would that be appropriate/wise?
I guess it would need versioning information somewhere too.

>
> ...
>
> +/* u64 has problems with printk this will cast it to unsigned long long */
> +#define _LLU(x) (unsigned long long)(x)

ug.

Normally the response is "please open-code this". But given that one
day real soon this printk(u64) problem will be fixed, I guess the use
of _LLU will make it easy to find and delete all the now-unneeded
casts.

>
> ...
>
> +/*
> + * our inode flags
> + */
> +#define OBJ_2BCREATED 0 /* object will be created soon*/
> +#define OBJ_CREATED 1 /* object has been created on the osd*/
> +
> +static inline int obj_2bcreated(struct exofs_i_info *oi)
> +{
> + return test_bit(OBJ_2BCREATED, &(oi->i_flags));
> +}

unneeded parentheses around oi->i_flags.

> +static inline void set_obj_2bcreated(struct exofs_i_info *oi)
> +{
> + set_bit(OBJ_2BCREATED, &(oi->i_flags));
> +}
> +
> +static inline int obj_created(struct exofs_i_info *oi)
> +{
> + return test_bit(OBJ_CREATED, &(oi->i_flags));
> +}
> +
> +static inline void set_obj_created(struct exofs_i_info *oi)
> +{
> + set_bit(OBJ_CREATED, &(oi->i_flags));
> +}

dittoes.

>
> ...
>

2009-03-31 08:12:35

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2/8] exofs: file and file_inode operations

On Wed, 18 Mar 2009 19:58:47 +0200 Boaz Harrosh <[email protected]> wrote:

> implementation of the file_operations and inode_operations for
> regular data files.
>
> Most file_operations are generic vfs implementations except:
> - exofs_truncate will truncate the OSD object as well
> - Generic file_fsync is not good for none_bd devices so open code it
> - The default for .flush in Linux is todo nothing so call exofs_fsync
> on the file.
>
>
> ...
>
> +static int exofs_file_fsync(struct file *filp, struct dentry *dentry,
> + int datasync)
> +{
> + int ret1, ret2;
> + struct address_space *mapping = filp->f_mapping;
> +
> + ret1 = filemap_write_and_wait(mapping);
> + ret2 = file_fsync(filp, dentry, datasync);
> +
> + return ret1 ? ret1 : ret2;
> +}

It might be better to abort if filemap_write_and_wait() failed. if the
hardware is bad, these things can take a looooong time retrying and
timing out. There's no point in doubling the delay.

>
> ...
>

2009-03-31 08:13:01

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 6/8] exofs: super_operations and file_system_type

On Wed, 18 Mar 2009 20:09:51 +0200 Boaz Harrosh <[email protected]> wrote:

> This patch ties all operation vectors into a file system superblock
> and registers the exofs file_system_type at module's load time.
>
> * The file system control block (AKA on-disk superblock) resides in
> an object with a special ID (defined in common.h).
> Information included in the file system control block is used to
> fill the in-memory superblock structure at mount time. This object
> is created before the file system is used by mkexofs.c It contains
> information such as:
> - The file system's magic number
> - The next inode number to be allocated
>
>
> ...
>
> +static int exofs_statfs(struct dentry *dentry, struct kstatfs *buf)
> +{
> + struct super_block *sb = dentry->d_sb;
> + struct exofs_sb_info *sbi = sb->s_fs_info;
> + struct osd_obj_id obj = {sbi->s_pid, 0};
> + struct osd_attr attrs[] = {
> + ATTR_DEF(OSD_APAGE_PARTITION_QUOTAS,
> + OSD_ATTR_PQ_CAPACITY_QUOTA, sizeof(__be64)),
> + ATTR_DEF(OSD_APAGE_PARTITION_INFORMATION,
> + OSD_ATTR_PI_USED_CAPACITY, sizeof(__be64)),
> + };
> + uint64_t capacity = ~0;
> + uint64_t used = ~0;

My brain hurts.

~0 is signed 0xffffffff.

When assigning to a u64 it gets signed extended to signed
0xffffffffffffffff and then converted to unsigned 0xffffffffffffffff.

I think. Just as with plain old "-1". Perhaps using plain old "-1"
would be clearer here.

>
> ...
>
> +const struct super_operations exofs_sops = {

This can in fact be made static, I believe.

>
> ...
>

2009-03-31 08:13:28

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 5/8] exofs: dir_inode and directory operations

On Wed, 18 Mar 2009 20:08:49 +0200 Boaz Harrosh <[email protected]> wrote:

> implementation of directory and inode operations.
>
> * A directory is treated as a file, and essentially contains a list
> of <file name, inode #> pairs for files that are found in that
> directory. The object IDs correspond to the files' inode numbers
> and are allocated using a 64bit incrementing global counter.
> * Each file's control block (AKA on-disk inode) is stored in its
> object's attributes. This applies to both regular files and other
> types (directories, device files, symlinks, etc.).
>
>
> ...
>
> +static inline unsigned long dir_pages(struct inode *inode)
> +{
> + return (inode->i_size + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
> +}

Do we need i_size_read() here? Probably not if it's always called
under i_mutex. Needs checking and commenting please.

> +static unsigned exofs_last_byte(struct inode *inode, unsigned long page_nr)
> +{
> + unsigned last_byte = inode->i_size;
> +
> + last_byte -= page_nr << PAGE_CACHE_SHIFT;

hm. Strange to left-shift an unsigned long and then copy it to a
smaller type.

Are the types here appropriately chosen?

> + if (last_byte > PAGE_CACHE_SIZE)
> + last_byte = PAGE_CACHE_SIZE;
> + return last_byte;
> +}
> +
> +static int exofs_commit_chunk(struct page *page, loff_t pos, unsigned len)
>
> ...
>

This all looks vaguely familiar :)

2009-03-31 08:13:55

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 4/8 ver5] exofs: address_space_operations

On Sun, 22 Mar 2009 15:58:46 +0200 Boaz Harrosh <[email protected]> wrote:

>
> OK Now we start to read and write from osd-objects. We try to
> collect at most contiguous pages as possible in a single write/read.
> The first page index is the object's offset.
>
> TODO:
> In 64-bit a single bio can carry at most 128 pages.
> Add support of chaining multiple bios
>
>
> ...
>
> +static int write_exec(struct page_collect *pcol)
> +{
> + struct exofs_i_info *oi = exofs_i(pcol->inode);
> + struct osd_obj_id obj = {pcol->sbi->s_pid,
> + pcol->inode->i_ino + EXOFS_OBJ_OFF};
> + struct osd_request *or = NULL;
> + struct page_collect *pcol_copy = NULL;
> + loff_t i_start = pcol->pg_first << PAGE_CACHE_SHIFT;

bug. On 32-bit this shift will overflow prior to getting promoted to
64-bit. Do:

loff_t i_start = (loff_t)pcol->pg_first << PAGE_CACHE_SHIFT;

>
> ...
>
> +static int writepage_strip(struct page *page,
> + struct writeback_control *wbc_unused, void *data)

Some of these functions could do with some comments explaining why they exist.

> + struct page_collect *pcol = data;
> + struct inode *inode = pcol->inode;
> + struct exofs_i_info *oi = exofs_i(inode);
> + loff_t i_size = i_size_read(inode);
> + pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
> + size_t len;
> + int ret;
> +
>
> ...
>

2009-03-31 08:14:34

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCHSET 0/8 version 4] exofs for kernel 2.6.30

On Tue, 31 Mar 2009 10:41:29 +0300 Boaz Harrosh <[email protected]> wrote:

>
> I forgot to say. Some of the people that sent comments where:
> Marcin Slusarz <[email protected]>
> Evgeniy Polyakov <[email protected]>
> Jeff Garzik <[email protected]>
> Pavel Machek <[email protected]>
> Benny Halevy <[email protected]>
> Alan Cox <[email protected]>
> Andrew Morton <[email protected]>
>
> And then there was a long flame war about user-mode API, which I now
> have, and all in kernel utilities are gone.
>
> To the best of my knowledge I have addressed all comments, at least
> no one complained.
>

Let me take a quick look...

<quickly looks>

OK, I foud a few minor things, but the code looks fine to me. However
I cannot speak to the overall design and to the actual desirability of
adding the feature to Linux.

I'd say that unless someone pipes up with a showstopper objection in
the next few days, please send the pull request to Linus near the end of
the week.

2009-03-31 09:00:41

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH 1/8] exofs: Kbuild, Headers and osd utils

On 03/31/2009 11:04 AM, Andrew Morton wrote:
> On Wed, 18 Mar 2009 19:57:36 +0200 Boaz Harrosh <[email protected]> wrote:
>
>> This patch includes osd infrastructure that will be used later by
>> the file system.
>>
>> Also the declarations of constants, on disk structures,
>> and prototypes.
>>
>> And the Kbuild+Kconfig files needed to build the exofs module.
>>
>> ...
>>
>> --- /dev/null
>> +++ b/fs/exofs/Kbuild
>> @@ -0,0 +1,30 @@
>> +#
>> +# Kbuild for the EXOFS module
>> +#
>> +# Copyright (C) 2008 Panasas Inc. All rights reserved.
>> +#
>> +# Authors:
>> +# Boaz Harrosh <[email protected]>
>> +#
>> +# This program is free software; you can redistribute it and/or modify
>> +# it under the terms of the GNU General Public License version 2
>> +#
>> +# Kbuild - Gets included from the Kernels Makefile and build system
>> +#
>> +
>> +ifneq ($(OSD_INC),)
>> +# we are built out-of-tree Kconfigure everything as on
>> +
>> +CONFIG_EXOFS_FS=m
>> +ccflags-y += -DCONFIG_EXOFS_FS -DCONFIG_EXOFS_FS_MODULE
>> +# ccflags-y += -DCONFIG_EXOFS_DEBUG
>> +
>> +# if we are built out-of-tree and the hosting kernel has OSD headers
>> +# then "ccflags-y +=" will not pick the out-off-tree headers. Only by doing
>> +# this it will work. This might break in future kernels
>> +KBUILD_CPPFLAGS := -I$(OSD_INC) $(KBUILD_CPPFLAGS)
>> +
>> +endif
>
> But this patch is putting the fs into the tree, so all the above is unneeded.
>
>> ...
>>
>> + * Object IDs 0, 1, and 2 are always in use (see above defines).
>> + */
>> +enum {
>> + EXOFS_UINT64_MAX = (~0LL),
>
> Use ULLONG_MAX?
>
> ~0ULL would be more consistent.
>
>> + EXOFS_MAX_INO_ID = (sizeof(ino_t) * 8 == 64) ? EXOFS_UINT64_MAX :
>> + (1LL << (sizeof(ino_t) * 8 - 1)),
>
> Tricky, needs a comment.
>
> Would be clearer to use 1ULL.
>
>> + EXOFS_MAX_ID = (EXOFS_MAX_INO_ID - 1 - EXOFS_OBJ_OFF),
>> +};
>> +

OK, OK, OK

>> +/****************************************************************************
>> + * Misc.
>> + ****************************************************************************/
>> +#define EXOFS_BLKSHIFT 12
>> +#define EXOFS_BLKSIZE (1UL << EXOFS_BLKSHIFT)
>> +
>> +/****************************************************************************
>> + * superblock-related things
>> + ****************************************************************************/
>> +#define EXOFS_SUPER_MAGIC 0x5DF5
>
> Should be in include/linux/magic.h
>

Is this relevant for OSD, I guess if there are going to
be more OSD filesystems then yes.

I will do it, thanks.

>> ...
>>
>> +/*
>> + * The file control block - stored in an object's attributes. This is where
>> + * the in-memory inode is stored on disk.
>> + */
>> +struct exofs_fcb {
>> + __le64 i_size; /* Size of the file */
>> + __le16 i_mode; /* File mode */
>> + __le16 i_links_count; /* Links count */
>> + __le32 i_uid; /* Owner Uid */
>> + __le32 i_gid; /* Group Id */
>> + __le32 i_atime; /* Access time */
>> + __le32 i_ctime; /* Creation time */
>> + __le32 i_mtime; /* Modification time */
>> + __le32 i_flags; /* File flags (unused for now)*/
>> + __le32 i_generation; /* File version (for NFS) */
>> + __le32 i_data[EXOFS_IDATA]; /* Short symlink names and device #s */
>> +};
>
> There is no room for future expansion. Would that be appropriate/wise?
> I guess it would need versioning information somewhere too.
>

In osd we have the size-of-the-attribute it sits in. So if in future we
add members we can switch according to size, also we can just stick it in
a different attribute number, so like EXOFS_ATTR_INODE_DATA_VER1
EXOFS_ATTR_INODE_DATA_VER2 attribute. Presence of, means support. Hell we can
even be backward compatible with having 2 or three versions at once.

>> ...
>>
>> +/* u64 has problems with printk this will cast it to unsigned long long */
>> +#define _LLU(x) (unsigned long long)(x)
>
> ug.
>
> Normally the response is "please open-code this". But given that one
> day real soon this printk(u64) problem will be fixed, I guess the use
> of _LLU will make it easy to find and delete all the now-unneeded
> casts.
>

Exactly my thoughts

>> ...
>>
>> +/*
>> + * our inode flags
>> + */
>> +#define OBJ_2BCREATED 0 /* object will be created soon*/
>> +#define OBJ_CREATED 1 /* object has been created on the osd*/
>> +
>> +static inline int obj_2bcreated(struct exofs_i_info *oi)
>> +{
>> + return test_bit(OBJ_2BCREATED, &(oi->i_flags));
>> +}
>
> unneeded parentheses around oi->i_flags.
>
>> +static inline void set_obj_2bcreated(struct exofs_i_info *oi)
>> +{
>> + set_bit(OBJ_2BCREATED, &(oi->i_flags));
>> +}
>> +
>> +static inline int obj_created(struct exofs_i_info *oi)
>> +{
>> + return test_bit(OBJ_CREATED, &(oi->i_flags));
>> +}
>> +
>> +static inline void set_obj_created(struct exofs_i_info *oi)
>> +{
>> + set_bit(OBJ_CREATED, &(oi->i_flags));
>> +}
>
> dittoes.
>
>> ...
>>
>

Thanks
will fix

Boaz

2009-03-31 09:01:01

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH 2/8] exofs: file and file_inode operations

On 03/31/2009 11:04 AM, Andrew Morton wrote:
> On Wed, 18 Mar 2009 19:58:47 +0200 Boaz Harrosh <[email protected]> wrote:
>
>> implementation of the file_operations and inode_operations for
>> regular data files.
>>
>> Most file_operations are generic vfs implementations except:
>> - exofs_truncate will truncate the OSD object as well
>> - Generic file_fsync is not good for none_bd devices so open code it
>> - The default for .flush in Linux is todo nothing so call exofs_fsync
>> on the file.
>>
>>
>> ...
>>
>> +static int exofs_file_fsync(struct file *filp, struct dentry *dentry,
>> + int datasync)
>> +{
>> + int ret1, ret2;
>> + struct address_space *mapping = filp->f_mapping;
>> +
>> + ret1 = filemap_write_and_wait(mapping);
>> + ret2 = file_fsync(filp, dentry, datasync);
>> +
>> + return ret1 ? ret1 : ret2;
>> +}
>
> It might be better to abort if filemap_write_and_wait() failed. if the
> hardware is bad, these things can take a looooong time retrying and
> timing out. There's no point in doubling the delay.
>
>> ...
>>
>

OK I got confused by existing code, good point
Boaz

2009-03-31 09:06:49

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH 4/8 ver5] exofs: address_space_operations

On 03/31/2009 11:04 AM, Andrew Morton wrote:
> On Sun, 22 Mar 2009 15:58:46 +0200 Boaz Harrosh <[email protected]> wrote:
>
>> OK Now we start to read and write from osd-objects. We try to
>> collect at most contiguous pages as possible in a single write/read.
>> The first page index is the object's offset.
>>
>> TODO:
>> In 64-bit a single bio can carry at most 128 pages.
>> Add support of chaining multiple bios
>>
>>
>> ...
>>
>> +static int write_exec(struct page_collect *pcol)
>> +{
>> + struct exofs_i_info *oi = exofs_i(pcol->inode);
>> + struct osd_obj_id obj = {pcol->sbi->s_pid,
>> + pcol->inode->i_ino + EXOFS_OBJ_OFF};
>> + struct osd_request *or = NULL;
>> + struct page_collect *pcol_copy = NULL;
>> + loff_t i_start = pcol->pg_first << PAGE_CACHE_SHIFT;
>
> bug. On 32-bit this shift will overflow prior to getting promoted to
> 64-bit. Do:
>
> loff_t i_start = (loff_t)pcol->pg_first << PAGE_CACHE_SHIFT;
>

In that case I might make pcol->pg_first loff_t.

Why is inode->i_index not an loff_t then?
Page-index <=> byte-offset, is done all the time 12 bits does not
make a difference.

>> ...
>>
>> +static int writepage_strip(struct page *page,
>> + struct writeback_control *wbc_unused, void *data)
>
> Some of these functions could do with some comments explaining why they exist.
>
>> + struct page_collect *pcol = data;
>> + struct inode *inode = pcol->inode;
>> + struct exofs_i_info *oi = exofs_i(inode);
>> + loff_t i_size = i_size_read(inode);
>> + pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
>> + size_t len;
>> + int ret;
>> +
>>
>> ...
>>
>

Thanks
Boaz

2009-03-31 10:23:20

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 4/8 ver5] exofs: address_space_operations

On Tue, 31 Mar 2009 12:04:36 +0300 Boaz Harrosh <[email protected]> wrote:

> >> +static int write_exec(struct page_collect *pcol)
> >> +{
> >> + struct exofs_i_info *oi = exofs_i(pcol->inode);
> >> + struct osd_obj_id obj = {pcol->sbi->s_pid,
> >> + pcol->inode->i_ino + EXOFS_OBJ_OFF};
> >> + struct osd_request *or = NULL;
> >> + struct page_collect *pcol_copy = NULL;
> >> + loff_t i_start = pcol->pg_first << PAGE_CACHE_SHIFT;
> >
> > bug. On 32-bit this shift will overflow prior to getting promoted to
> > 64-bit. Do:
> >
> > loff_t i_start = (loff_t)pcol->pg_first << PAGE_CACHE_SHIFT;
> >
>
> In that case I might make pcol->pg_first loff_t.

That would work.

> Why is inode->i_index not an loff_t then?

hm, what's i_index?

> Page-index <=> byte-offset, is done all the time 12 bits does not
> make a difference.

Page indices are 32-bit on 32-bit CPUs. File offsets are 64-bit. We
are careful to avoid the above overflow bug whenever the conversion
from page index to file size is made. Try

fgrep '(loff_t)' mm/*.c

2009-03-31 10:24:56

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH 5/8] exofs: dir_inode and directory operations

On 03/31/2009 11:04 AM, Andrew Morton wrote:
> On Wed, 18 Mar 2009 20:08:49 +0200 Boaz Harrosh <[email protected]> wrote:
>
>> implementation of directory and inode operations.
>>
>> * A directory is treated as a file, and essentially contains a list
>> of <file name, inode #> pairs for files that are found in that
>> directory. The object IDs correspond to the files' inode numbers
>> and are allocated using a 64bit incrementing global counter.
>> * Each file's control block (AKA on-disk inode) is stored in its
>> object's attributes. This applies to both regular files and other
>> types (directories, device files, symlinks, etc.).
>>
>>
>> ...
>>
>> +static inline unsigned long dir_pages(struct inode *inode)
>> +{
>> + return (inode->i_size + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
>> +}
>
> Do we need i_size_read() here? Probably not if it's always called
> under i_mutex. Needs checking and commenting please.
>

Don't know, I'll have a look

>> +static unsigned exofs_last_byte(struct inode *inode, unsigned long page_nr)
>> +{
>> + unsigned last_byte = inode->i_size;
>> +
>> + last_byte -= page_nr << PAGE_CACHE_SHIFT;
>
> hm. Strange to left-shift an unsigned long and then copy it to a
> smaller type.
>

wrong type, thanks!

> Are the types here appropriately chosen?
>
>> + if (last_byte > PAGE_CACHE_SIZE)
>> + last_byte = PAGE_CACHE_SIZE;
>> + return last_byte;
>> +}
>> +
>> +static int exofs_commit_chunk(struct page *page, loff_t pos, unsigned len)
>>
>> ...
>>
>
> This all looks vaguely familiar :)

Yep ;). Don't forget that all this started with a cp ...

Boaz

2009-03-31 10:29:53

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH 4/8 ver5] exofs: address_space_operations

On 03/31/2009 01:15 PM, Andrew Morton wrote:
> On Tue, 31 Mar 2009 12:04:36 +0300 Boaz Harrosh <[email protected]> wrote:
>
>>>> +static int write_exec(struct page_collect *pcol)
>>>> +{
>>>> + struct exofs_i_info *oi = exofs_i(pcol->inode);
>>>> + struct osd_obj_id obj = {pcol->sbi->s_pid,
>>>> + pcol->inode->i_ino + EXOFS_OBJ_OFF};
>>>> + struct osd_request *or = NULL;
>>>> + struct page_collect *pcol_copy = NULL;
>>>> + loff_t i_start = pcol->pg_first << PAGE_CACHE_SHIFT;
>>> bug. On 32-bit this shift will overflow prior to getting promoted to
>>> 64-bit. Do:
>>>
>>> loff_t i_start = (loff_t)pcol->pg_first << PAGE_CACHE_SHIFT;
>>>
>> In that case I might make pcol->pg_first loff_t.
>
> That would work.
>
>> Why is inode->i_index not an loff_t then?
>
> hm, what's i_index?
>

sorry, I meant page index

>> Page-index <=> byte-offset, is done all the time 12 bits does not
>> make a difference.
>
> Page indices are 32-bit on 32-bit CPUs. File offsets are 64-bit. We
> are careful to avoid the above overflow bug whenever the conversion
> from page index to file size is made. Try
>
> fgrep '(loff_t)' mm/*.c
>

right! which means that Linux does not support 64 bit offsets on
32 bit, but only 44 bits. But I guess exofs will not change that.

Boaz

2009-03-31 10:31:54

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH 6/8] exofs: super_operations and file_system_type

On 03/31/2009 11:04 AM, Andrew Morton wrote:
> On Wed, 18 Mar 2009 20:09:51 +0200 Boaz Harrosh <[email protected]> wrote:
>
>> This patch ties all operation vectors into a file system superblock
>> and registers the exofs file_system_type at module's load time.
>>
>> * The file system control block (AKA on-disk superblock) resides in
>> an object with a special ID (defined in common.h).
>> Information included in the file system control block is used to
>> fill the in-memory superblock structure at mount time. This object
>> is created before the file system is used by mkexofs.c It contains
>> information such as:
>> - The file system's magic number
>> - The next inode number to be allocated
>>
>>
>> ...
>>
>> +static int exofs_statfs(struct dentry *dentry, struct kstatfs *buf)
>> +{
>> + struct super_block *sb = dentry->d_sb;
>> + struct exofs_sb_info *sbi = sb->s_fs_info;
>> + struct osd_obj_id obj = {sbi->s_pid, 0};
>> + struct osd_attr attrs[] = {
>> + ATTR_DEF(OSD_APAGE_PARTITION_QUOTAS,
>> + OSD_ATTR_PQ_CAPACITY_QUOTA, sizeof(__be64)),
>> + ATTR_DEF(OSD_APAGE_PARTITION_INFORMATION,
>> + OSD_ATTR_PI_USED_CAPACITY, sizeof(__be64)),
>> + };
>> + uint64_t capacity = ~0;
>> + uint64_t used = ~0;
>
> My brain hurts.
>
> ~0 is signed 0xffffffff.
>
> When assigning to a u64 it gets signed extended to signed
> 0xffffffffffffffff and then converted to unsigned 0xffffffffffffffff.
>
> I think. Just as with plain old "-1". Perhaps using plain old "-1"
> would be clearer here.
>
>> ...
>>
>> +const struct super_operations exofs_sops = {
>
> This can in fact be made static, I believe.
>
>> ...
>>
>

OK, OK.

Thanks will fix
Boaz

2009-03-31 18:53:31

by Benny Halevy

[permalink] [raw]
Subject: Re: [osd-dev] [PATCH 6/8] exofs: super_operations and file_system_type

On Mar. 31, 2009, 11:04 +0300, Andrew Morton <[email protected]> wrote:
> On Wed, 18 Mar 2009 20:09:51 +0200 Boaz Harrosh <[email protected]> wrote:
>
>> This patch ties all operation vectors into a file system superblock
>> and registers the exofs file_system_type at module's load time.
>>
>> * The file system control block (AKA on-disk superblock) resides in
>> an object with a special ID (defined in common.h).
>> Information included in the file system control block is used to
>> fill the in-memory superblock structure at mount time. This object
>> is created before the file system is used by mkexofs.c It contains
>> information such as:
>> - The file system's magic number
>> - The next inode number to be allocated
>>
>>
>> ...
>>
>> +static int exofs_statfs(struct dentry *dentry, struct kstatfs *buf)
>> +{
>> + struct super_block *sb = dentry->d_sb;
>> + struct exofs_sb_info *sbi = sb->s_fs_info;
>> + struct osd_obj_id obj = {sbi->s_pid, 0};
>> + struct osd_attr attrs[] = {
>> + ATTR_DEF(OSD_APAGE_PARTITION_QUOTAS,
>> + OSD_ATTR_PQ_CAPACITY_QUOTA, sizeof(__be64)),
>> + ATTR_DEF(OSD_APAGE_PARTITION_INFORMATION,
>> + OSD_ATTR_PI_USED_CAPACITY, sizeof(__be64)),
>> + };
>> + uint64_t capacity = ~0;
>> + uint64_t used = ~0;
>
> My brain hurts.
>
> ~0 is signed 0xffffffff.
>
> When assigning to a u64 it gets signed extended to signed
> 0xffffffffffffffff and then converted to unsigned 0xffffffffffffffff.

Right (I think, I'm not sure in what order)

>
> I think. Just as with plain old "-1". Perhaps using plain old "-1"
> would be clearer here.

or maybe ~0ULL or ~(uint64_t)0 to be extremely anal about it.

Benny

>
>> ...
>>
>> +const struct super_operations exofs_sops = {
>
> This can in fact be made static, I believe.
>
>> ...
>>
>
> _______________________________________________
> osd-dev mailing list
> [email protected]
> http://mailman.open-osd.org/mailman/listinfo/osd-dev

2009-04-01 08:07:52

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [osd-dev] [PATCH 6/8] exofs: super_operations and file_system_type

On 03/31/2009 09:52 PM, Benny Halevy wrote:
> On Mar. 31, 2009, 11:04 +0300, Andrew Morton <[email protected]> wrote:
>> ~0 is signed 0xffffffff.
>>
>> When assigning to a u64 it gets signed extended to signed
>> 0xffffffffffffffff and then converted to unsigned 0xffffffffffffffff.
>
> Right (I think, I'm not sure in what order)
>
>> I think. Just as with plain old "-1". Perhaps using plain old "-1"
>> would be clearer here.
>
> or maybe ~0ULL or ~(uint64_t)0 to be extremely anal about it.
>
> Benny
>

There is only one right way => ULLONG_MAX. Takes care of the human factor
too. (BTW that one is defined (~0ULL))

Thanks
Boaz

2009-04-01 09:06:23

by Benny Halevy

[permalink] [raw]
Subject: Re: [osd-dev] [PATCH 6/8] exofs: super_operations and file_system_type

On Apr. 01, 2009, 11:05 +0300, Boaz Harrosh <[email protected]> wrote:
> On 03/31/2009 09:52 PM, Benny Halevy wrote:
>> On Mar. 31, 2009, 11:04 +0300, Andrew Morton <[email protected]> wrote:
>>> ~0 is signed 0xffffffff.
>>>
>>> When assigning to a u64 it gets signed extended to signed
>>> 0xffffffffffffffff and then converted to unsigned 0xffffffffffffffff.
>> Right (I think, I'm not sure in what order)
>>
>>> I think. Just as with plain old "-1". Perhaps using plain old "-1"
>>> would be clearer here.
>> or maybe ~0ULL or ~(uint64_t)0 to be extremely anal about it.
>>
>> Benny
>>
>
> There is only one right way => ULLONG_MAX. Takes care of the human factor
> too. (BTW that one is defined (~0ULL))

Ideally, since the variable is a uint64_t, you'd want a U64_MAX.
unsigned long long may, at some point, be larger than uint64 on
some architectures. With the available defs ~(uint64_t)0 or even
just ~0 seem more portable...

Benny

>
> Thanks
> Boaz
>

2009-04-01 09:24:16

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCHSET 0/8 version 4] exofs for kernel 2.6.30

Boaz Harrosh wrote:
> If anyone wants to actually run this code and test it
> then please start reading at:
> http://open-osd.org
> You will need to checkout the out-of-tree git (below) for the user-mode utilities.
> Also the exofs.txt file in patch 7/8 should help


hum... trying to play with this. If you want exofs to go upstream, I
think you should have a release tarball containing the user-mode utils
posted somewhere. Would make life a lot easier, both on early adopters
and also on distribution packagers.

Jeff


2009-04-01 11:23:40

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCHSET 0/8 version 4] exofs for kernel 2.6.30

On 04/01/2009 12:23 PM, Jeff Garzik wrote:
> Boaz Harrosh wrote:
>> If anyone wants to actually run this code and test it
>> then please start reading at:
>> http://open-osd.org
>> You will need to checkout the out-of-tree git (below) for the user-mode utilities.
>> Also the exofs.txt file in patch 7/8 should help
>
>
> hum... trying to play with this. If you want exofs to go upstream, I
> think you should have a release tarball containing the user-mode utils
> posted somewhere. Would make life a lot easier, both on early adopters
> and also on distribution packagers.
>
> Jeff

You are absolutely right, once 2.6.30 will be out there will not be a need
to compile Kernel modules.

About the binary package. I must admit I'm a total novice. What do I need to do?

One x86_32, one x86_64? What glibc, does it matter what distro I compile on?

I want to have a "make rpm" and "make deb" but I've never done that, I was hoping
someone more experienced would pick it up.

But you are right I have it on my schedule to work on the Wiki, installation and
init-scripts, right after this final push to mainline.

Sorry, for not having this already
Boaz

BTW:
Source tar balls are available from the gitweb GUI by pressing on the
"snapshot" link next to any commit. I should link to it from the WiKi

Best regards
Boaz

2009-04-02 00:39:36

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCHSET 0/8 version 4] exofs for kernel 2.6.30

Boaz Harrosh wrote:
> On 04/01/2009 12:23 PM, Jeff Garzik wrote:
>> Boaz Harrosh wrote:
>>> If anyone wants to actually run this code and test it
>>> then please start reading at:
>>> http://open-osd.org
>>> You will need to checkout the out-of-tree git (below) for the user-mode utilities.
>>> Also the exofs.txt file in patch 7/8 should help
>>
>> hum... trying to play with this. If you want exofs to go upstream, I
>> think you should have a release tarball containing the user-mode utils
>> posted somewhere. Would make life a lot easier, both on early adopters
>> and also on distribution packagers.
>>
>> Jeff
>
> You are absolutely right, once 2.6.30 will be out there will not be a need
> to compile Kernel modules.
>
> About the binary package. I must admit I'm a total novice. What do I need to do?

All you need on your end is a sane setup for installation, including
building of shared libraries and installing necessary headers for
userland programs.

Each individual distribution can easily package your exofs-utils into a
deb or RPM.

Some of my projects have to do this. Here is one way, the highly
standardized GNU autotools.

Take a look at autogen.sh, configure.ac, Makefile.am,
include/Makefile.am and lib/Makefile.am from
git://git.kernel.org/pub/scm/daemon/distsrv/chunkd.git

That demonstrates how to handle building and installing a shared
library, header files and programs.

A lot of people dislike GNU autotools, but it's main benefit here is
that Debian/Red Hat/Novell/Canonical/etc. are well-versed in creating
.deb or .rpm from GNU autotools builds. It makes integration into a
Linux distribution much easier.


> BTW:
> Source tar balls are available from the gitweb GUI by pressing on the
> "snapshot" link next to any commit. I should link to it from the WiKi

Oh yeah, I forgot about that.

Jeff


2009-04-02 12:51:57

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCHSET 0/8 version 4] exofs for kernel 2.6.30

On 04/02/2009 03:39 AM, Jeff Garzik wrote:
> Boaz Harrosh wrote:
>> On 04/01/2009 12:23 PM, Jeff Garzik wrote:
>>> Boaz Harrosh wrote:
>>>> If anyone wants to actually run this code and test it
>>>> then please start reading at:
>>>> http://open-osd.org
>>>> You will need to checkout the out-of-tree git (below) for the user-mode utilities.
>>>> Also the exofs.txt file in patch 7/8 should help
>>> hum... trying to play with this. If you want exofs to go upstream, I
>>> think you should have a release tarball containing the user-mode utils
>>> posted somewhere. Would make life a lot easier, both on early adopters
>>> and also on distribution packagers.
>>>
>>> Jeff
>> You are absolutely right, once 2.6.30 will be out there will not be a need
>> to compile Kernel modules.
>>
>> About the binary package. I must admit I'm a total novice. What do I need to do?
>
> All you need on your end is a sane setup for installation, including
> building of shared libraries and installing necessary headers for
> userland programs.
>
> Each individual distribution can easily package your exofs-utils into a
> deb or RPM.
>
> Some of my projects have to do this. Here is one way, the highly
> standardized GNU autotools.
>
> Take a look at autogen.sh, configure.ac, Makefile.am,
> include/Makefile.am and lib/Makefile.am from
> git://git.kernel.org/pub/scm/daemon/distsrv/chunkd.git
>
> That demonstrates how to handle building and installing a shared
> library, header files and programs.
>
> A lot of people dislike GNU autotools, but it's main benefit here is
> that Debian/Red Hat/Novell/Canonical/etc. are well-versed in creating
> .deb or .rpm from GNU autotools builds. It makes integration into a
> Linux distribution much easier.
>

This is precious information for me, I'll have a look and copy above procedure
for open-osd, thanks, Next beer is on me.

>
>> BTW:
>> Source tar balls are available from the gitweb GUI by pressing on the
>> "snapshot" link next to any commit. I should link to it from the WiKi
>
> Oh yeah, I forgot about that.
>
> Jeff
>

Thank you
Boaz