exofs an OSD based file system.
Whats new since RFC:
First - The name. As Jeff suggested osdfs is too generic a name, for any
particular OSD file system. So Avishay Traeger decided he would like to
call it: exofs - extended object file system. Which credits it's ext2
origin.
2nd - Move to .write_begin/.write_end from the old .prepare_write/.commit_write.
that were just removed in 2.6.28-rc1 Kernel.
3rd - Use of the new osd_req_decode_sense() API to only ignore read errors
when we expect them, i.e. read passed end of objects. (This used to work
with the old IBM initiator)
4th - Fix a NUL-terminate bug with symlinks
5th - file_fsync does not work properly for none-block-dev filesystems
so open code that in an exofs_fsync.
6th - Linux's default for .flush, called at file close, is to do nothing.
We don't like that, specially for networked osd (iscsi) so implement
exofs_flush by calling above exofs_fsync
This patchset is dependent on the open-osd initiator library, that must
get accepted first into Linux.
(git://git.open-osd.org/linux-open-osd.git linux-next)
Andrew Al?
Please, who is the maintainer that such a filesystem should go through?
Our intention with exofs is to make it exportable by Linux
pNFS server, as reference implementation for pNFS-object-layout
server. A pNFS-objects client implementation is also in the works
(See all about pNFS in Linux at:
http://wiki.linux-nfs.org/wiki/index.php/PNFS_prototype_design)
exofs was originally developed by Avishay Traeger <[email protected]>
from IBM. A very old version of it is hosted on sourceforge as the osdfs
project. The Original code was based on ext2 of the 2.6.10 Kernel and ran
over the old IBM's osd-initiator Linux driver.
Since then it was picked by us, open-osd, and was both forward ported to
current Kernel, as well as converted to run over our osd Kernel Library.
I have mechanically divided the code in parts, each introducing a
group of vfs function vectors, all tied at the end into a full filesystem.
Each patch can be compiled but it will only run at the very end.
This was done for the hope of easier reviewing.
Here is the list of patches
[PATCH 1/9] exofs: osd Swiss army knife
[PATCH 2/9] exofs: file and file_inode operations
[PATCH 3/9] exofs: symlink_inode and fast_symlink_inode operations
[PATCH 4/9] exofs: address_space_operations
[PATCH 5/9] exofs: dir_inode and directory operations
[PATCH 6/9] exofs: super_operations and file_system_type
[PATCH 7/9] exofs: mkexofs
[PATCH 8/9] exofs: Documentation
[PATCH 9/9] fs: Add exofs to Kernel build
This patchset is also available on:
git-clone git://git.open-osd.org/linux-open-osd.git linux-next-exofs
or on the web at:
http://git.open-osd.org/gitweb.cgi?p=linux-open-osd.git;a=shortlog;h=refs/heads/linux-next-exofs
(Above tree is based on Linus 2.6.28.rc8-cefb3d0. A branch based on scsi-misc
is also present see osd && osd-exofs branches)
If anyone wants to actually run this code and test it
then please start at: http://open-osd.org
and also the exofs.txt file in patch 8/9 should help
Thank you for the review
Boaz
In this patch are all the osd infrastructure that will be used later
by the file system.
Also the declarations of constants, on disk structures, and prototypes.
And the Kbuild+Kconfig files needed to build the exofs module.
Signed-off-by: Boaz Harrosh <[email protected]>
---
fs/exofs/Kbuild | 30 +++++
fs/exofs/Kconfig | 13 ++
fs/exofs/common.h | 154 ++++++++++++++++++++++++
fs/exofs/exofs.h | 183 +++++++++++++++++++++++++++++
fs/exofs/osd.c | 334 +++++++++++++++++++++++++++++++++++++++++++++++++++++
5 files changed, 714 insertions(+), 0 deletions(-)
create mode 100644 fs/exofs/Kbuild
create mode 100644 fs/exofs/Kconfig
create mode 100644 fs/exofs/common.h
create mode 100644 fs/exofs/exofs.h
create mode 100644 fs/exofs/osd.c
diff --git a/fs/exofs/Kbuild b/fs/exofs/Kbuild
new file mode 100644
index 0000000..fd3351e
--- /dev/null
+++ b/fs/exofs/Kbuild
@@ -0,0 +1,30 @@
+#
+# Kbuild for the EXOFS module
+#
+# Copyright (C) 2008 Panasas Inc. All rights reserved.
+#
+# Authors:
+# Boaz Harrosh <[email protected]>
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License version 2
+#
+# Kbuild - Gets included from the Kernels Makefile and build system
+#
+
+ifneq ($(OSD_INC),)
+# we are built out-of-tree Kconfigure everything as on
+
+CONFIG_EXOFS_FS=m
+ccflags += -DCONFIG_EXOFS_FS -DCONFIG_EXOFS_FS_MODULE
+# ccflags += -DCONFIG_EXOFS_DEBUG
+
+# if we are built out-of-tree and the hosting kernel has OSD headers
+# then "ccflags-y +=" will not pick the out-off-tree headers. Only by doing
+# this it will work. This might break in future kernels
+KBUILD_CPPFLAGS := -I$(OSD_INC) $(KBUILD_CPPFLAGS)
+
+endif
+
+exofs-objs := osd.o
+obj-$(CONFIG_EXOFS_FS) += exofs.o
diff --git a/fs/exofs/Kconfig b/fs/exofs/Kconfig
new file mode 100644
index 0000000..86194b2
--- /dev/null
+++ b/fs/exofs/Kconfig
@@ -0,0 +1,13 @@
+config EXOFS_FS
+ tristate "exofs: OSD based file system support"
+ depends on SCSI_OSD_ULD
+ help
+ EXOFS is a file system that uses an OSD storage device,
+ as its backing storage.
+
+# Debugging-related stuff
+config EXOFS_DEBUG
+ bool "Enable debugging"
+ depends on EXOFS_FS
+ help
+ This option enables EXOFS debug prints.
diff --git a/fs/exofs/common.h b/fs/exofs/common.h
new file mode 100644
index 0000000..9a165b3
--- /dev/null
+++ b/fs/exofs/common.h
@@ -0,0 +1,154 @@
+/*
+ * Copyright (C) 2005, 2006
+ * Avishay Traeger ([email protected]) ([email protected])
+ * Copyright (C) 2005, 2006
+ * International Business Machines
+ *
+ * Copyrights for code taken from ext2:
+ * Copyright (C) 1992, 1993, 1994, 1995
+ * Remy Card ([email protected])
+ * Laboratoire MASI - Institut Blaise Pascal
+ * Universite Pierre et Marie Curie (Paris VI)
+ * from
+ * linux/fs/minix/inode.c
+ * Copyright (C) 1991, 1992 Linus Torvalds
+ *
+ * This file is part of exofs.
+ *
+ * exofs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation. Since it is based on ext2, and the only
+ * valid version of GPL for the Linux kernel is version 2, the only valid
+ * version of GPL for exofs is version 2.
+ *
+ * exofs is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with exofs; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#ifndef __EXOFS_COM_H__
+#define __EXOFS_COM_H__
+
+#include <linux/types.h>
+#include <linux/timex.h>
+
+#include <scsi/osd_attributes.h>
+#include <scsi/osd_initiator.h>
+#include <scsi/osd_sec.h>
+
+/****************************************************************************
+ * Object ID related defines
+ * NOTE: inode# = object ID - EXOFS_OBJ_OFF
+ ****************************************************************************/
+#define EXOFS_OBJ_OFF 0x10000 /* offset for objects */
+#define EXOFS_SUPER_ID 0x10000 /* object ID for on-disk superblock */
+#define EXOFS_BM_ID 0x10001 /* object ID for ID bitmap */
+#define EXOFS_ROOT_ID 0x10002 /* object ID for root directory */
+#define EXOFS_TEST_ID 0x10003 /* object ID for test object */
+
+/* exofs Application specific page/attribute */
+#ifndef OSD_PAGE_NUM_IBM_UOBJ_FS_DATA
+# define OSD_PAGE_NUM_IBM_UOBJ_FS_DATA (OSD_APAGE_APP_DEFINED_FIRST + 3)
+# define OSD_ATTR_NUM_IBM_UOBJ_FS_DATA_INODE 1
+#endif
+
+/*
+ * The maximum number of files we can have is limited by the size of the
+ * inode number. This is the largest object ID that the file system supports.
+ * Object IDs 0, 1, and 2 are always in use (see above defines).
+ */
+enum {
+ EXOFS_UINT64_MAX = (~0LL),
+ EXOFS_MAX_INO_ID = (sizeof(ino_t) * 8 == 64) ? EXOFS_UINT64_MAX :
+ (1LL << (sizeof(ino_t) * 8 - 1)),
+ EXOFS_MAX_ID = (EXOFS_MAX_INO_ID - 1 - EXOFS_OBJ_OFF),
+};
+
+/****************************************************************************
+ * Misc.
+ ****************************************************************************/
+#define EXOFS_BLKSHIFT 12
+#define EXOFS_BLKSIZE (1UL << EXOFS_BLKSHIFT)
+
+/****************************************************************************
+ * superblock-related things
+ ****************************************************************************/
+#define EXOFS_SUPER_MAGIC 0x5DF5
+
+/*
+ * The file system control block - stored in an object's data (mainly, the one
+ * with ID EXOFS_SUPER_ID). This is where the in-memory superblock is stored
+ * on disk. Right now it just has a magic value, which is basically a sanity
+ * check on our ability to communicate with the object store.
+ */
+struct exofs_fscb {
+ uint32_t s_nextid; /* Highest object ID used */
+ uint32_t s_numfiles; /* Number of files on fs */
+ uint16_t s_magic; /* Magic signature */
+ uint16_t s_newfs; /* Non-zero if this is a new fs */
+};
+
+/****************************************************************************
+ * inode-related things
+ ****************************************************************************/
+#define EXOFS_IDATA 5
+
+/*
+ * The file control block - stored in an object's attributes. This is where
+ * the in-memory inode is stored on disk.
+ */
+struct exofs_fcb {
+ uint64_t i_size; /* Size of the file */
+ uint16_t i_mode; /* File mode */
+ uint16_t i_links_count; /* Links count */
+ uint32_t i_uid; /* Owner Uid */
+ uint32_t i_gid; /* Group Id */
+ uint32_t i_atime; /* Access time */
+ uint32_t i_ctime; /* Creation time */
+ uint32_t i_mtime; /* Modification time */
+ uint32_t i_flags; /* File flags */
+ uint32_t i_version; /* File version */
+ uint32_t i_generation; /* File version (for NFS) */
+ uint32_t i_data[EXOFS_IDATA]; /* Short symlink names and device #s */
+};
+
+#define EXOFS_INO_ATTR_SIZE sizeof(struct exofs_fcb)
+
+/****************************************************************************
+ * dentry-related things
+ ****************************************************************************/
+#define EXOFS_NAME_LEN 255
+
+/*
+ * The on-disk directory entry
+ */
+struct exofs_dir_entry {
+ uint32_t inode; /* inode number */
+ uint16_t rec_len; /* directory entry length */
+ uint8_t name_len; /* name length */
+ uint8_t file_type; /* umm...file type */
+ char name[EXOFS_NAME_LEN]; /* file name */
+};
+
+enum {
+ EXOFS_FT_UNKNOWN,
+ EXOFS_FT_REG_FILE,
+ EXOFS_FT_DIR,
+ EXOFS_FT_CHRDEV,
+ EXOFS_FT_BLKDEV,
+ EXOFS_FT_FIFO,
+ EXOFS_FT_SOCK,
+ EXOFS_FT_SYMLINK,
+ EXOFS_FT_MAX
+};
+
+#define EXOFS_DIR_PAD 4
+#define EXOFS_DIR_ROUND (EXOFS_DIR_PAD - 1)
+#define EXOFS_DIR_REC_LEN(name_len) (((name_len) + 8 + EXOFS_DIR_ROUND) & \
+ ~EXOFS_DIR_ROUND)
+#endif
diff --git a/fs/exofs/exofs.h b/fs/exofs/exofs.h
new file mode 100644
index 0000000..8534450
--- /dev/null
+++ b/fs/exofs/exofs.h
@@ -0,0 +1,183 @@
+/*
+ * Copyright (C) 2005, 2006
+ * Avishay Traeger ([email protected]) ([email protected])
+ * Copyright (C) 2005, 2006
+ * International Business Machines
+ *
+ * Copyrights for code taken from ext2:
+ * Copyright (C) 1992, 1993, 1994, 1995
+ * Remy Card ([email protected])
+ * Laboratoire MASI - Institut Blaise Pascal
+ * Universite Pierre et Marie Curie (Paris VI)
+ * from
+ * linux/fs/minix/inode.c
+ * Copyright (C) 1991, 1992 Linus Torvalds
+ *
+ * This file is part of exofs.
+ *
+ * exofs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation. Since it is based on ext2, and the only
+ * valid version of GPL for the Linux kernel is version 2, the only valid
+ * version of GPL for exofs is version 2.
+ *
+ * exofs is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with exofs; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include <linux/fs.h>
+#include <linux/time.h>
+#include "common.h"
+
+#ifndef __EXOFS_H__
+#define __EXOFS_H__
+
+#define EXOFS_ERR(fmt, a...) printk(KERN_ERR "exofs: " fmt, ##a)
+
+#ifdef CONFIG_EXOFS_DEBUG
+#define EXOFS_DBGMSG(fmt, a...) \
+ printk(KERN_NOTICE "exofs @%s:%d: " fmt, __func__, __LINE__, ##a)
+#else
+#define EXOFS_DBGMSG(fmt, a...) \
+ do {} while (0)
+#endif
+
+/* u64 has problems with printk this will cast it to unsigned long long */
+#define _LLU(x) (unsigned long long)(x)
+
+/*
+ * our extension to the in-memory superblock
+ */
+struct exofs_sb_info {
+ struct osd_dev *s_dev; /* returned by get_osd_dev */
+ uint64_t s_pid; /* partition ID of file system*/
+ int s_timeout; /* timeout for OSD operations */
+ uint32_t s_nextid; /* highest object ID used */
+ uint32_t s_numfiles; /* number of files on fs */
+ spinlock_t s_next_gen_lock; /* spinlock for gen # update */
+ u32 s_next_generation; /* next gen # to use */
+ atomic_t s_curr_pending; /* number of pending commands */
+ uint8_t s_cred[OSD_CAP_LEN]; /* all-powerful credential */
+};
+
+/*
+ * our inode flags
+ */
+#ifdef ARCH_HAS_ATOMIC_UNSIGNED
+typedef unsigned exofs_iflags_t;
+#else
+typedef unsigned long exofs_iflags_t;
+#endif
+
+#define OBJ_2BCREATED 0 /* object will be created soon*/
+#define OBJ_CREATED 1 /* object has been created on the osd*/
+
+#define Obj2BCreated(oi) \
+ test_bit(OBJ_2BCREATED, &(oi->i_flags))
+#define SetObj2BCreated(oi) \
+ set_bit(OBJ_2BCREATED, &(oi->i_flags))
+
+#define ObjCreated(oi) \
+ test_bit(OBJ_CREATED, &(oi->i_flags))
+#define SetObjCreated(oi) \
+ set_bit(OBJ_CREATED, &(oi->i_flags))
+
+/*
+ * our extension to the in-memory inode
+ */
+struct exofs_i_info {
+ exofs_iflags_t i_flags; /* various atomic flags */
+ __le32 i_data[EXOFS_IDATA];/*short symlink names and device #s*/
+ uint32_t i_dir_start_lookup; /* which page to start lookup */
+ wait_queue_head_t i_wq; /* wait queue for inode */
+ uint64_t i_commit_size; /* the object's written length */
+ uint8_t i_cred[OSD_CAP_LEN];/* all-powerful credential */
+ struct inode vfs_inode; /* normal in-memory inode */
+};
+
+/*
+ * get to our inode from the vfs inode
+ */
+static inline struct exofs_i_info *EXOFS_I(struct inode *inode)
+{
+ return container_of(inode, struct exofs_i_info, vfs_inode);
+}
+
+/*************************
+ * function declarations *
+ *************************/
+/* osd.c */
+void make_credential(uint8_t[], uint64_t, uint64_t);
+int check_ok(struct osd_request *);
+int exofs_sync_op(struct osd_request *, int, uint8_t *);
+int exofs_async_op(struct osd_request *, osd_req_done_fn *, void *, char *);
+
+int prepare_get_attr_list_add_entry(struct osd_request *req,
+ uint32_t page_num,
+ uint32_t attr_num,
+ uint32_t attr_len);
+int prepare_set_attr_list_add_entry(struct osd_request *req,
+ uint32_t page_num,
+ uint32_t attr_num,
+ uint16_t attr_len,
+ const unsigned char *attr_val);
+int extract_next_attr_from_req(struct osd_request *req,
+ uint32_t *page_num, uint32_t *attr_num,
+ uint16_t *attr_len, uint8_t **attr_val);
+struct osd_request *prepare_osd_format_lun(struct osd_dev *dev,
+ uint64_t formatted_capacity);
+struct osd_request *prepare_osd_create_partition(struct osd_dev *dev,
+ uint64_t requested_id);
+struct osd_request *prepare_osd_remove_partition(struct osd_dev *dev,
+ uint64_t requested_id);
+struct osd_request *prepare_osd_create(struct osd_dev *dev,
+ uint64_t part_id,
+ uint64_t requested_id);
+struct osd_request *prepare_osd_remove(struct osd_dev *dev,
+ uint64_t part_id,
+ uint64_t obj_id);
+struct osd_request *prepare_osd_set_attr(struct osd_dev *dev,
+ uint64_t part_id,
+ uint64_t obj_id);
+struct osd_request *prepare_osd_get_attr(struct osd_dev *dev,
+ uint64_t part_id,
+ uint64_t obj_id);
+struct osd_request *prepare_osd_read(struct osd_dev *dev,
+ uint64_t part_id,
+ uint64_t obj_id,
+ uint64_t length,
+ uint64_t offset,
+ int cmd_data_use_sg,
+ unsigned char *cmd_data);
+struct osd_request *prepare_osd_write(struct osd_dev *dev,
+ uint64_t part_id,
+ uint64_t obj_id,
+ uint64_t length,
+ uint64_t offset,
+ int cmd_data_use_sg,
+ const unsigned char *cmd_data);
+struct osd_request *prepare_osd_list(struct osd_dev *dev,
+ uint64_t part_id,
+ uint32_t list_id,
+ uint64_t alloc_len,
+ uint64_t initial_obj_id,
+ int use_sg,
+ void *data);
+int extract_list_from_req(struct osd_request *req,
+ uint64_t *total_matches_p,
+ uint64_t *num_ids_retrieved_p,
+ uint64_t *list_of_ids_p[],
+ int *is_list_of_partitions_p,
+ int *list_isnt_up_to_date_p,
+ uint64_t *continuation_tag_p,
+ uint32_t *list_id_for_more_p);
+
+void free_osd_req(struct osd_request *req);
+
+#endif
diff --git a/fs/exofs/osd.c b/fs/exofs/osd.c
new file mode 100644
index 0000000..3859d3e
--- /dev/null
+++ b/fs/exofs/osd.c
@@ -0,0 +1,334 @@
+/*
+ * Copyright (C) 2005, 2006
+ * Avishay Traeger ([email protected]) ([email protected])
+ * Copyright (C) 2005, 2006
+ * International Business Machines
+ *
+ * This file is part of exofs.
+ *
+ * exofs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation. Since it is based on ext2, and the only
+ * valid version of GPL for the Linux kernel is version 2, the only valid
+ * version of GPL for exofs is version 2.
+ *
+ * exofs is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with exofs; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include <scsi/scsi_device.h>
+#include <scsi/osd_sense.h>
+
+#include "exofs.h"
+
+int check_ok(struct osd_request *req)
+{
+ struct osd_sense_info osi;
+ int ret = osd_req_decode_sense(req, &osi);
+
+ if (ret) { /* translate to Linux codes */
+ if (osi.additional_code == scsi_invalid_field_in_cdb) {
+ if (osi.cdb_field_offset == OSD_CFO_STARTING_BYTE)
+ ret = -EFAULT;
+ if (osi.cdb_field_offset == OSD_CFO_OBJECT_ID)
+ ret = -ENOENT;
+ else
+ ret = -EINVAL;
+ } else if (osi.additional_code == osd_quota_error)
+ ret = -ENOSPC;
+ else
+ ret = -EIO;
+ }
+
+ return ret;
+}
+
+void make_credential(uint8_t cred_a[OSD_CAP_LEN], uint64_t pid, uint64_t oid)
+{
+ struct osd_obj_id obj = {
+ .partition = pid,
+ .id = oid
+ };
+
+ osd_sec_init_nosec_doall_caps(cred_a, &obj, false, true);
+}
+
+/*
+ * Perform a synchronous OSD operation.
+ */
+int exofs_sync_op(struct osd_request *req, int timeout, uint8_t *credential)
+{
+ int ret;
+
+ req->timeout = timeout;
+ ret = osd_finalize_request(req, 0, credential, NULL);
+ if (ret) {
+ EXOFS_DBGMSG("Faild to osd_finalize_request() => %d\n", ret);
+ return ret;
+ }
+
+ ret = osd_execute_request(req);
+
+ if (ret)
+ EXOFS_DBGMSG("osd_execute_request() => %d\n", ret);
+ /* osd_req_decode_sense(or, ret); */
+ return ret;
+}
+
+/*
+ * Perform an asynchronous OSD operation.
+ */
+int exofs_async_op(struct osd_request *req, osd_req_done_fn *async_done,
+ void *caller_context, char *credential)
+{
+ int ret;
+
+ ret = osd_finalize_request(req, 0, credential, NULL);
+ if (ret) {
+ EXOFS_DBGMSG("Faild to osd_finalize_request() => %d\n", ret);
+ return ret;
+ }
+
+ ret = osd_execute_request_async(req, async_done, caller_context);
+
+ if (ret)
+ EXOFS_DBGMSG("osd_execute_request_async() => %d\n", ret);
+ return ret;
+}
+
+int prepare_get_attr_list_add_entry(struct osd_request *req,
+ uint32_t page_num,
+ uint32_t attr_num,
+ uint32_t attr_len)
+{
+ struct osd_attr attr = {
+ .page = page_num,
+ .attr_id = attr_num,
+ .len = attr_len,
+ };
+
+ return osd_req_add_get_attr_list(req, &attr, 1);
+}
+
+int prepare_set_attr_list_add_entry(struct osd_request *req,
+ uint32_t page_num,
+ uint32_t attr_num,
+ uint16_t attr_len,
+ const unsigned char *attr_val)
+{
+ struct osd_attr attr = {
+ .page = page_num,
+ .attr_id = attr_num,
+ .len = attr_len,
+ .val_ptr = (u8 *)attr_val,
+ };
+
+ return osd_req_add_set_attr_list(req, &attr, 1);
+}
+
+int extract_next_attr_from_req(struct osd_request *req,
+ uint32_t *page_num, uint32_t *attr_num,
+ uint16_t *attr_len, uint8_t **attr_val)
+{
+ struct osd_attr attr = {.page = 0}; /* start with zeros */
+ void *iter = NULL;
+ int nelem;
+
+ do {
+ nelem = 1;
+ osd_req_decode_get_attr_list(req, &attr, &nelem, &iter);
+ if ((attr.page == *page_num) && (attr.attr_id == *attr_num)) {
+ *attr_len = attr.len;
+ *attr_val = attr.val_ptr;
+ return 0;
+ }
+ } while (iter);
+
+ return -EIO;
+}
+
+struct osd_request *prepare_osd_format_lun(struct osd_dev *dev,
+ uint64_t formatted_capacity)
+{
+ struct osd_request *or = osd_start_request(dev, GFP_KERNEL);
+
+ if (!or)
+ return NULL;
+
+ osd_req_format(or, formatted_capacity);
+
+ return or;
+}
+
+struct osd_request *prepare_osd_create_partition(struct osd_dev *dev,
+ uint64_t requested_id)
+{
+ struct osd_request *or = osd_start_request(dev, GFP_KERNEL);
+
+ if (!or)
+ return NULL;
+
+ osd_req_create_partition(or, requested_id);
+
+ return or;
+}
+
+struct osd_request *prepare_osd_remove_partition(struct osd_dev *dev,
+ uint64_t requested_id)
+{
+ struct osd_request *or = osd_start_request(dev, GFP_KERNEL);
+
+ if (!or)
+ return NULL;
+
+ osd_req_remove_partition(or, requested_id);
+
+ return or;
+}
+
+struct osd_request *prepare_osd_create(struct osd_dev *dev,
+ uint64_t part_id,
+ uint64_t requested_id)
+{
+ struct osd_obj_id obj = {
+ .partition = part_id,
+ .id = requested_id
+ };
+ struct osd_request *or = osd_start_request(dev, GFP_KERNEL);
+
+ if (!or)
+ return NULL;
+
+ osd_req_create_object(or, &obj);
+
+ return or;
+}
+
+struct osd_request *prepare_osd_remove(struct osd_dev *dev,
+ uint64_t part_id,
+ uint64_t obj_id)
+{
+ struct osd_obj_id obj = {
+ .partition = part_id,
+ .id = obj_id
+ };
+ struct osd_request *or = osd_start_request(dev, GFP_KERNEL);
+
+ if (!or)
+ return NULL;
+
+ osd_req_remove_object(or, &obj);
+
+ return or;
+}
+
+struct osd_request *prepare_osd_set_attr(struct osd_dev *dev,
+ uint64_t part_id,
+ uint64_t obj_id)
+{
+ struct osd_obj_id obj = {
+ .partition = part_id,
+ .id = obj_id
+ };
+ struct osd_request *or = osd_start_request(dev, GFP_KERNEL);
+
+ if (!or)
+ return NULL;
+
+ osd_req_set_attributes(or, &obj);
+
+ return or;
+}
+
+struct osd_request *prepare_osd_get_attr(struct osd_dev *dev,
+ uint64_t part_id,
+ uint64_t obj_id)
+{
+ struct osd_obj_id obj = {
+ .partition = part_id,
+ .id = obj_id
+ };
+ struct osd_request *or = osd_start_request(dev, GFP_KERNEL);
+
+ if (!or)
+ return NULL;
+
+ osd_req_get_attributes(or, &obj);
+
+ return or;
+}
+
+struct osd_request *prepare_osd_read(struct osd_dev *dev,
+ uint64_t part_id,
+ uint64_t obj_id,
+ uint64_t length,
+ uint64_t offset,
+ int cmd_data_use_sg,
+ unsigned char *cmd_data)
+{
+ struct osd_obj_id obj = {
+ .partition = part_id,
+ .id = obj_id
+ };
+ struct osd_request *or = osd_start_request(dev, GFP_KERNEL);
+ struct request_queue *req_q = dev->scsi_device->request_queue;
+ struct bio *bio;
+
+ if (!or)
+ return NULL;
+
+ BUG_ON(cmd_data_use_sg);
+ bio = bio_map_kern(req_q, cmd_data, length, or->alloc_flags);
+ if (!bio) {
+ osd_end_request(or);
+ return NULL;
+ }
+
+ osd_req_read(or, &obj, bio, offset);
+ EXOFS_DBGMSG("osd_req_read(p=%llX, ob=%llX, l=%llu, of=%llu)\n",
+ _LLU(part_id), _LLU(obj_id), _LLU(length), _LLU(offset));
+ return or;
+}
+
+struct osd_request *prepare_osd_write(struct osd_dev *dev,
+ uint64_t part_id,
+ uint64_t obj_id,
+ uint64_t length,
+ uint64_t offset,
+ int cmd_data_use_sg,
+ const unsigned char *cmd_data)
+{
+ struct osd_obj_id obj = {
+ .partition = part_id,
+ .id = obj_id
+ };
+ struct osd_request *or = osd_start_request(dev, GFP_KERNEL);
+ struct request_queue *req_q = dev->scsi_device->request_queue;
+ struct bio *bio;
+
+ if (!or)
+ return NULL;
+
+ BUG_ON(cmd_data_use_sg);
+ bio = bio_map_kern(req_q, (u8 *)cmd_data, length, or->alloc_flags);
+ if (!bio) {
+ osd_end_request(or);
+ return NULL;
+ }
+
+ osd_req_write(or, &obj, bio, offset);
+ EXOFS_DBGMSG("osd_req_write(p=%llX, ob=%llX, l=%llu, of=%llu)\n",
+ _LLU(part_id), _LLU(obj_id), _LLU(length), _LLU(offset));
+ return or;
+}
+
+void free_osd_req(struct osd_request *req)
+{
+ osd_end_request(req);
+}
--
1.6.0.1
implementation of the file_operations and inode_operations for
regular data files.
Most file_operations are generic vfs implementations except:
- exofs_truncate will truncate the OSD object as well
- Generic file_fsync is not good for none_bd devices so open code it
- The default for .flush in Linux is todo nothing so call exofs_fsync
on the file.
Signed-off-by: Boaz Harrosh <[email protected]>
---
fs/exofs/Kbuild | 2 +-
fs/exofs/exofs.h | 11 ++++
fs/exofs/file.c | 77 +++++++++++++++++++++++++++++
fs/exofs/inode.c | 140 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 229 insertions(+), 1 deletions(-)
create mode 100644 fs/exofs/file.c
create mode 100644 fs/exofs/inode.c
diff --git a/fs/exofs/Kbuild b/fs/exofs/Kbuild
index fd3351e..d9dedc9 100644
--- a/fs/exofs/Kbuild
+++ b/fs/exofs/Kbuild
@@ -26,5 +26,5 @@ KBUILD_CPPFLAGS := -I$(OSD_INC) $(KBUILD_CPPFLAGS)
endif
-exofs-objs := osd.o
+exofs-objs := osd.o inode.o file.o
obj-$(CONFIG_EXOFS_FS) += exofs.o
diff --git a/fs/exofs/exofs.h b/fs/exofs/exofs.h
index 8534450..f11250c 100644
--- a/fs/exofs/exofs.h
+++ b/fs/exofs/exofs.h
@@ -180,4 +180,15 @@ int extract_list_from_req(struct osd_request *req,
void free_osd_req(struct osd_request *req);
+/* inode.c */
+void exofs_truncate(struct inode *inode);
+int exofs_setattr(struct dentry *, struct iattr *);
+
+/*********************
+ * operation vectors *
+ *********************/
+/* file.c */
+extern struct inode_operations exofs_file_inode_operations;
+extern struct file_operations exofs_file_operations;
+
#endif
diff --git a/fs/exofs/file.c b/fs/exofs/file.c
new file mode 100644
index 0000000..073dcf7
--- /dev/null
+++ b/fs/exofs/file.c
@@ -0,0 +1,77 @@
+/*
+ * Copyright (C) 2005, 2006
+ * Avishay Traeger ([email protected]) ([email protected])
+ * Copyright (C) 2005, 2006
+ * International Business Machines
+ *
+ * Copyrights for code taken from ext2:
+ * Copyright (C) 1992, 1993, 1994, 1995
+ * Remy Card ([email protected])
+ * Laboratoire MASI - Institut Blaise Pascal
+ * Universite Pierre et Marie Curie (Paris VI)
+ * from
+ * linux/fs/minix/inode.c
+ * Copyright (C) 1991, 1992 Linus Torvalds
+ *
+ * This file is part of exofs.
+ *
+ * exofs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation. Since it is based on ext2, and the only
+ * valid version of GPL for the Linux kernel is version 2, the only valid
+ * version of GPL for exofs is version 2.
+ *
+ * exofs is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with exofs; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include <linux/buffer_head.h>
+
+#include "exofs.h"
+
+static int exofs_release_file(struct inode *inode, struct file *filp)
+{
+ return 0;
+}
+
+int exofs_file_fsync(struct file *filp, struct dentry *dentry, int datasync)
+{
+ int ret1, ret2;
+ struct address_space *mapping = filp->f_mapping;
+
+ ret1 = filemap_write_and_wait(mapping);
+ ret2 = file_fsync(filp, dentry, datasync);
+
+ return ret1 ? : ret2;
+}
+
+static int exofs_flush(struct file *file, fl_owner_t id)
+{
+ exofs_file_fsync(file, file->f_path.dentry, 1);
+ /* TODO: Flush the OSD target */
+ return 0;
+}
+
+struct file_operations exofs_file_operations = {
+ .llseek = generic_file_llseek,
+ .read = do_sync_read,
+ .write = do_sync_write,
+ .aio_read = generic_file_aio_read,
+ .aio_write = generic_file_aio_write,
+ .mmap = generic_file_mmap,
+ .open = generic_file_open,
+ .release = exofs_release_file,
+ .fsync = exofs_file_fsync,
+ .flush = exofs_flush,
+};
+
+struct inode_operations exofs_file_inode_operations = {
+ .truncate = exofs_truncate,
+ .setattr = exofs_setattr,
+};
diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
new file mode 100644
index 0000000..931025b
--- /dev/null
+++ b/fs/exofs/inode.c
@@ -0,0 +1,140 @@
+/*
+ * Copyright (C) 2005, 2006
+ * Avishay Traeger ([email protected]) ([email protected])
+ * Copyright (C) 2005, 2006
+ * International Business Machines
+ *
+ * Copyrights for code taken from ext2:
+ * Copyright (C) 1992, 1993, 1994, 1995
+ * Remy Card ([email protected])
+ * Laboratoire MASI - Institut Blaise Pascal
+ * Universite Pierre et Marie Curie (Paris VI)
+ * from
+ * linux/fs/minix/inode.c
+ * Copyright (C) 1991, 1992 Linus Torvalds
+ *
+ * This file is part of exofs.
+ *
+ * exofs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation. Since it is based on ext2, and the only
+ * valid version of GPL for the Linux kernel is version 2, the only valid
+ * version of GPL for exofs is version 2.
+ *
+ * exofs is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with exofs; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include <linux/writeback.h>
+#include <linux/buffer_head.h>
+
+#include "exofs.h"
+
+/*
+ * Test whether an inode is a fast symlink.
+ */
+static inline int exofs_inode_is_fast_symlink(struct inode *inode)
+{
+ struct exofs_i_info *oi = EXOFS_I(inode);
+
+ return S_ISLNK(inode->i_mode) && (oi->i_data[0] != 0);
+}
+
+/*
+ * get_block_t - Fill in a buffer_head
+ * An OSD takes care of block allocation so we just fake an allocation by
+ * putting in the inode's sector_t in the buffer_head.
+ * TODO: What about the case of create==0 and @iblock does not exist in the
+ * object?
+ */
+int exofs_get_block(struct inode *inode, sector_t iblock,
+ struct buffer_head *bh_result, int create)
+{
+ map_bh(bh_result, inode->i_sb, iblock);
+ return 0;
+}
+
+/******************************************************************************
+ * INODE OPERATIONS
+ *****************************************************************************/
+
+/*
+ * Truncate a file to the specified size - all we have to do is set the size
+ * attribute. We make sure the object exists first.
+ */
+void exofs_truncate(struct inode *inode)
+{
+ struct exofs_sb_info *sbi = inode->i_sb->s_fs_info;
+ struct exofs_i_info *oi = EXOFS_I(inode);
+ struct osd_request *req = NULL;
+ loff_t isize = i_size_read(inode);
+ uint64_t newsize;
+ int ret;
+
+ if (!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode)
+ || S_ISLNK(inode->i_mode)))
+ return;
+ if (exofs_inode_is_fast_symlink(inode))
+ return;
+ if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
+ return;
+ inode->i_mtime = inode->i_ctime = CURRENT_TIME;
+
+ nobh_truncate_page(inode->i_mapping, isize, exofs_get_block);
+
+ req = prepare_osd_set_attr(sbi->s_dev, sbi->s_pid,
+ inode->i_ino + EXOFS_OBJ_OFF);
+ if (!req) {
+ printk(KERN_ERR "ERROR: prepare set_attr failed.\n");
+ goto fail;
+ }
+
+ newsize = cpu_to_be64((uint64_t) isize);
+ prepare_set_attr_list_add_entry(req, OSD_APAGE_OBJECT_INFORMATION,
+ OSD_ATTR_OI_LOGICAL_LENGTH, 8,
+ (unsigned char *)(&newsize));
+
+ /* if we are about to truncate an object, and it hasn't been
+ * created yet, wait
+ */
+ if (!ObjCreated(oi)) {
+ if (!Obj2BCreated(oi))
+ BUG();
+ else
+ wait_event(oi->i_wq, ObjCreated(oi));
+ }
+
+ ret = exofs_sync_op(req, sbi->s_timeout, oi->i_cred);
+ free_osd_req(req);
+ if (ret)
+ goto fail;
+
+out:
+ mark_inode_dirty(inode);
+ return;
+fail:
+ make_bad_inode(inode);
+ goto out;
+}
+
+/*
+ * Set inode attributes - just call generic functions.
+ */
+int exofs_setattr(struct dentry *dentry, struct iattr *iattr)
+{
+ struct inode *inode = dentry->d_inode;
+ int error;
+
+ error = inode_change_ok(inode, iattr);
+ if (error)
+ return error;
+
+ error = inode_setattr(inode, iattr);
+ return error;
+}
--
1.6.0.1
Generic implementation of symlink ops.
Signed-off-by: Boaz Harrosh <[email protected]>
---
fs/exofs/Kbuild | 2 +-
fs/exofs/exofs.h | 4 +++
fs/exofs/symlink.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 59 insertions(+), 1 deletions(-)
create mode 100644 fs/exofs/symlink.c
diff --git a/fs/exofs/Kbuild b/fs/exofs/Kbuild
index d9dedc9..b372058 100644
--- a/fs/exofs/Kbuild
+++ b/fs/exofs/Kbuild
@@ -26,5 +26,5 @@ KBUILD_CPPFLAGS := -I$(OSD_INC) $(KBUILD_CPPFLAGS)
endif
-exofs-objs := osd.o inode.o file.o
+exofs-objs := osd.o inode.o file.o symlink.o
obj-$(CONFIG_EXOFS_FS) += exofs.o
diff --git a/fs/exofs/exofs.h b/fs/exofs/exofs.h
index f11250c..6f9b56c 100644
--- a/fs/exofs/exofs.h
+++ b/fs/exofs/exofs.h
@@ -191,4 +191,8 @@ int exofs_setattr(struct dentry *, struct iattr *);
extern struct inode_operations exofs_file_inode_operations;
extern struct file_operations exofs_file_operations;
+/* symlink.c */
+extern struct inode_operations exofs_symlink_inode_operations;
+extern struct inode_operations exofs_fast_symlink_inode_operations;
+
#endif
diff --git a/fs/exofs/symlink.c b/fs/exofs/symlink.c
new file mode 100644
index 0000000..4275451
--- /dev/null
+++ b/fs/exofs/symlink.c
@@ -0,0 +1,54 @@
+/*
+ * Copyright (C) 2005, 2006
+ * Avishay Traeger ([email protected]) ([email protected])
+ * Copyright (C) 2005, 2006
+ * International Business Machines
+ *
+ * Copyrights for code taken from ext2:
+ * Copyright (C) 1992, 1993, 1994, 1995
+ * Remy Card ([email protected])
+ * Laboratoire MASI - Institut Blaise Pascal
+ * Universite Pierre et Marie Curie (Paris VI)
+ * from
+ * linux/fs/minix/inode.c
+ * Copyright (C) 1991, 1992 Linus Torvalds
+ *
+ * This file is part of exofs.
+ *
+ * exofs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation. Since it is based on ext2, and the only
+ * valid version of GPL for the Linux kernel is version 2, the only valid
+ * version of GPL for exofs is version 2.
+ *
+ * exofs is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with exofs; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include <linux/namei.h>
+
+#include "exofs.h"
+
+static void *exofs_follow_link(struct dentry *dentry, struct nameidata *nd)
+{
+ struct exofs_i_info *oi = EXOFS_I(dentry->d_inode);
+ nd_set_link(nd, (char *)oi->i_data);
+ return NULL;
+}
+
+struct inode_operations exofs_symlink_inode_operations = {
+ .readlink = generic_readlink,
+ .follow_link = page_follow_link_light,
+ .put_link = page_put_link,
+};
+
+struct inode_operations exofs_fast_symlink_inode_operations = {
+ .readlink = generic_readlink,
+ .follow_link = exofs_follow_link,
+};
--
1.6.0.1
OK Now we start to read and write from osd-objects, page-by-page.
The page index is the object's offset.
Signed-off-by: Boaz Harrosh <[email protected]>
---
fs/exofs/exofs.h | 6 +
fs/exofs/inode.c | 315 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 321 insertions(+), 0 deletions(-)
diff --git a/fs/exofs/exofs.h b/fs/exofs/exofs.h
index 6f9b56c..a094cd7 100644
--- a/fs/exofs/exofs.h
+++ b/fs/exofs/exofs.h
@@ -183,6 +183,9 @@ void free_osd_req(struct osd_request *req);
/* inode.c */
void exofs_truncate(struct inode *inode);
int exofs_setattr(struct dentry *, struct iattr *);
+int exofs_write_begin(struct file *file, struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned flags,
+ struct page **pagep, void **fsdata);
/*********************
* operation vectors *
@@ -191,6 +194,9 @@ int exofs_setattr(struct dentry *, struct iattr *);
extern struct inode_operations exofs_file_inode_operations;
extern struct file_operations exofs_file_operations;
+/* inode.c */
+extern struct address_space_operations exofs_aops;
+
/* symlink.c */
extern struct inode_operations exofs_symlink_inode_operations;
extern struct inode_operations exofs_fast_symlink_inode_operations;
diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
index 931025b..b904e97 100644
--- a/fs/exofs/inode.c
+++ b/fs/exofs/inode.c
@@ -36,6 +36,8 @@
#include "exofs.h"
+static int __readpage_filler(struct page *page, bool is_async_unlock);
+
/*
* Test whether an inode is a fast symlink.
*/
@@ -60,6 +62,319 @@ int exofs_get_block(struct inode *inode, sector_t iblock,
return 0;
}
+int exofs_write_begin(struct file *file, struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned flags,
+ struct page **pagep, void **fsdata)
+{
+ int ret = 0;
+ struct page *page;
+
+ page = *pagep;
+ if (page == NULL) {
+ ret = simple_write_begin(file, mapping, pos, len, flags, pagep,
+ fsdata);
+ page = *pagep;
+ }
+
+ /* read modify write */
+ if (!PageUptodate(page) && (len != PAGE_CACHE_SIZE))
+ ret = __readpage_filler(page, false);
+
+ return ret;
+}
+
+int exofs_write_begin_export(struct file *file, struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned flags,
+ struct page **pagep, void **fsdata)
+{
+ *pagep = NULL;
+
+ return exofs_write_begin(file, mapping, pos, len, flags, pagep,
+ fsdata);
+}
+
+/*
+ * Callback function when writepage finishes. Check for errors, unlock, clean
+ * up, etc.
+ */
+void writepage_done(struct osd_request *req, void *p)
+{
+ int ret;
+ struct page *page = (struct page *)p;
+ struct inode *inode = page->mapping->host;
+ struct exofs_sb_info *sbi = inode->i_sb->s_fs_info;
+
+ ret = check_ok(req);
+ free_osd_req(req);
+ atomic_dec(&sbi->s_curr_pending);
+
+ if (ret) {
+ if (ret == -ENOSPC)
+ set_bit(AS_ENOSPC, &page->mapping->flags);
+ else
+ set_bit(AS_EIO, &page->mapping->flags);
+
+ SetPageError(page);
+ }
+
+ end_page_writeback(page);
+ unlock_page(page);
+}
+
+/*
+ * Write a page to disk. page->index gives us the page number. The page is
+ * locked before this function is called. We write asynchronously and then the
+ * callback function (writepage_done) is called. We signify that the operation
+ * has completed by unlocking the page and calling end_page_writeback().
+ */
+static int exofs_writepage(struct page *page, struct writeback_control *wbc)
+{
+ struct inode *inode = page->mapping->host;
+ struct exofs_i_info *oi = EXOFS_I(inode);
+ loff_t i_size = i_size_read(inode);
+ unsigned long end_index = i_size >> PAGE_CACHE_SHIFT;
+ unsigned offset = 0;
+ struct osd_request *req = NULL;
+ struct exofs_sb_info *sbi;
+ uint64_t start;
+ uint64_t len = PAGE_CACHE_SIZE;
+ unsigned char *kaddr;
+ int ret = 0;
+
+ if (!PageLocked(page))
+ BUG();
+
+ /* if the object has not been created, and we are not in sync mode,
+ * just return. otherwise, wait. */
+ if (!ObjCreated(oi)) {
+ if (!Obj2BCreated(oi))
+ BUG();
+
+ if (wbc->sync_mode == WB_SYNC_NONE) {
+ redirty_page_for_writepage(wbc, page);
+ unlock_page(page);
+ ret = 0;
+ goto out;
+ } else {
+ wait_event(oi->i_wq, ObjCreated(oi));
+ }
+ }
+
+ /* in this case, the page is within the limits of the file */
+ if (page->index < end_index)
+ goto do_it;
+
+ offset = i_size & (PAGE_CACHE_SIZE - 1);
+ len = offset;
+
+ /*in this case, the page is outside the limits (truncate in progress)*/
+ if (page->index >= end_index + 1 || !offset) {
+ unlock_page(page);
+ goto out;
+ }
+
+do_it:
+ BUG_ON(PageWriteback(page));
+ set_page_writeback(page);
+ start = page->index << PAGE_CACHE_SHIFT;
+ sbi = inode->i_sb->s_fs_info;
+
+ kaddr = page_address(page);
+
+ req = prepare_osd_write(sbi->s_dev, sbi->s_pid,
+ inode->i_ino + EXOFS_OBJ_OFF, len, start, 0,
+ kaddr);
+ if (!req) {
+ printk(KERN_ERR "ERROR: writepage failed.\n");
+ ret = -ENOMEM;
+ goto fail;
+ }
+
+ oi->i_commit_size = min_t(uint64_t, oi->i_commit_size, len + start);
+
+ ret = exofs_async_op(req, writepage_done, (void *)page, oi->i_cred);
+ if (ret) {
+ free_osd_req(req);
+ goto fail;
+ }
+ atomic_inc(&sbi->s_curr_pending);
+out:
+ return ret;
+fail:
+ set_bit(AS_EIO, &page->mapping->flags);
+ end_page_writeback(page);
+ unlock_page(page);
+ goto out;
+}
+
+/*
+ * Callback for readpage
+ */
+int __readpage_done(struct osd_request *req, void *p, int unlock)
+{
+ struct page *page = (struct page *)p;
+ struct inode *inode = page->mapping->host;
+ struct exofs_sb_info *sbi = inode->i_sb->s_fs_info;
+ int ret;
+
+ ret = check_ok(req);
+ free_osd_req(req);
+ atomic_dec(&sbi->s_curr_pending);
+
+ if (ret == 0) {
+
+ /* Everything is OK */
+ SetPageUptodate(page);
+ if (PageError(page))
+ ClearPageError(page);
+ } else if (ret == -EFAULT) {
+ char *kaddr;
+
+ /* In this case we were trying to read something that wasn't on
+ * disk yet - return a page full of zeroes. This should be OK,
+ * because the object should be empty (if there was a write
+ * before this read, the read would be waiting with the page
+ * locked */
+ kaddr = page_address(page);
+ memset(kaddr, 0, PAGE_CACHE_SIZE);
+
+ SetPageUptodate(page);
+ if (PageError(page))
+ ClearPageError(page);
+ } else /* Error */
+ SetPageError(page);
+
+ if (unlock)
+ unlock_page(page);
+
+ return ret;
+}
+
+void readpage_done(struct osd_request *req, void *p)
+{
+ __readpage_done(req, p, true);
+}
+
+/*
+ * Read a page from the OSD
+ */
+static int __readpage_filler(struct page *page, bool is_async_unlock)
+{
+ struct osd_request *req = NULL;
+ struct inode *inode = page->mapping->host;
+ struct exofs_i_info *oi = EXOFS_I(inode);
+ ino_t ino = inode->i_ino;
+ loff_t i_size = i_size_read(inode);
+ loff_t i_start = page->index << PAGE_CACHE_SHIFT;
+ unsigned long end_index = i_size >> PAGE_CACHE_SHIFT;
+ struct super_block *sb = inode->i_sb;
+ struct exofs_sb_info *sbi = sb->s_fs_info;
+ uint64_t amount;
+ unsigned char *kaddr;
+ int ret = 0;
+
+ if (!PageLocked(page))
+ BUG();
+
+ if (PageUptodate(page))
+ goto out;
+
+ if (page->index < end_index)
+ amount = PAGE_CACHE_SIZE;
+ else
+ amount = i_size & (PAGE_CACHE_SIZE - 1);
+
+ /* this will be out of bounds, or doesn't exist yet */
+ if ((page->index >= end_index + 1) || !ObjCreated(oi) || !amount
+ /*|| (i_start >= oi->i_commit_size)*/) {
+ kaddr = kmap_atomic(page, KM_USER0);
+ memset(kaddr, 0, PAGE_CACHE_SIZE);
+ flush_dcache_page(page);
+ kunmap_atomic(page, KM_USER0);
+ SetPageUptodate(page);
+ if (PageError(page))
+ ClearPageError(page);
+ if (is_async_unlock)
+ unlock_page(page);
+ goto out;
+ }
+
+ if (amount != PAGE_CACHE_SIZE) {
+ kaddr = kmap_atomic(page, KM_USER0);
+ memset(kaddr + amount, 0, PAGE_CACHE_SIZE - amount);
+ flush_dcache_page(page);
+ kunmap_atomic(page, KM_USER0);
+ }
+
+ kaddr = page_address(page);
+
+ req = prepare_osd_read(sbi->s_dev, sbi->s_pid, ino + EXOFS_OBJ_OFF,
+ amount, i_start, 0, kaddr);
+ if (!req) {
+ printk(KERN_ERR "ERROR: readpage failed.\n");
+ ret = -ENOMEM;
+ unlock_page(page);
+ goto out;
+ }
+
+ atomic_inc(&sbi->s_curr_pending);
+ if (!is_async_unlock) {
+ exofs_sync_op(req, sbi->s_timeout, oi->i_cred);
+ ret = __readpage_done(req, page, false);
+ } else {
+ ret = exofs_async_op(req, readpage_done, page, oi->i_cred);
+ if (ret) {
+ free_osd_req(req);
+ unlock_page(page);
+ atomic_dec(&sbi->s_curr_pending);
+ }
+ }
+
+out:
+ return ret;
+}
+
+static int readpage_filler(struct page *page)
+{
+ int ret = __readpage_filler(page, true);
+
+ return ret;
+}
+
+/*
+ * We don't need the file
+ */
+static int exofs_readpage(struct file *file, struct page *page)
+{
+ return readpage_filler(page);
+}
+
+/*
+ * We don't need the data
+ */
+static int readpage_strip(void *data, struct page *page)
+{
+ return readpage_filler(page);
+}
+
+/*
+ * read a bunch of pages - usually for readahead
+ */
+static int exofs_readpages(struct file *file, struct address_space *mapping,
+ struct list_head *pages, unsigned nr_pages)
+{
+ return read_cache_pages(mapping, pages, readpage_strip, NULL);
+}
+
+struct address_space_operations exofs_aops = {
+ .readpage = exofs_readpage,
+ .readpages = exofs_readpages,
+ .writepage = exofs_writepage,
+ .write_begin = exofs_write_begin_export,
+ .write_end = simple_write_end,
+ .writepages = generic_writepages,
+};
+
/******************************************************************************
* INODE OPERATIONS
*****************************************************************************/
--
1.6.0.1
implementation of directory and inode operations.
* A directory is treated as a file, and essentially contains a list
of <file name, inode #> pairs for files that are found in that
directory. The object IDs correspond to the files' inode numbers
and are allocated using a 64bit incrementing global counter.
* Each file's control block (AKA on-disk inode) is stored in its
object's attributes. This applies to both regular files and other
types (directories, device files, symlinks, etc.).
Signed-off-by: Boaz Harrosh <[email protected]>
---
fs/exofs/Kbuild | 2 +-
fs/exofs/dir.c | 649 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/exofs/exofs.h | 26 +++
fs/exofs/inode.c | 266 ++++++++++++++++++++++
fs/exofs/namei.c | 351 +++++++++++++++++++++++++++++
5 files changed, 1293 insertions(+), 1 deletions(-)
create mode 100644 fs/exofs/dir.c
create mode 100644 fs/exofs/namei.c
diff --git a/fs/exofs/Kbuild b/fs/exofs/Kbuild
index b372058..27c738c 100644
--- a/fs/exofs/Kbuild
+++ b/fs/exofs/Kbuild
@@ -26,5 +26,5 @@ KBUILD_CPPFLAGS := -I$(OSD_INC) $(KBUILD_CPPFLAGS)
endif
-exofs-objs := osd.o inode.o file.o symlink.o
+exofs-objs := osd.o inode.o file.o symlink.o namei.o dir.o
obj-$(CONFIG_EXOFS_FS) += exofs.o
diff --git a/fs/exofs/dir.c b/fs/exofs/dir.c
new file mode 100644
index 0000000..fb644eb
--- /dev/null
+++ b/fs/exofs/dir.c
@@ -0,0 +1,649 @@
+/*
+ * Copyright (C) 2005, 2006
+ * Avishay Traeger ([email protected]) ([email protected])
+ * Copyright (C) 2005, 2006
+ * International Business Machines
+ *
+ * Copyrights for code taken from ext2:
+ * Copyright (C) 1992, 1993, 1994, 1995
+ * Remy Card ([email protected])
+ * Laboratoire MASI - Institut Blaise Pascal
+ * Universite Pierre et Marie Curie (Paris VI)
+ * from
+ * linux/fs/minix/inode.c
+ * Copyright (C) 1991, 1992 Linus Torvalds
+ *
+ * This file is part of exofs.
+ *
+ * exofs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation. Since it is based on ext2, and the only
+ * valid version of GPL for the Linux kernel is version 2, the only valid
+ * version of GPL for exofs is version 2.
+ *
+ * exofs is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with exofs; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include <linux/pagemap.h>
+#include <linux/smp_lock.h>
+#include <linux/buffer_head.h>
+
+#include "exofs.h"
+
+static inline unsigned exofs_chunk_size(struct inode *inode)
+{
+ return inode->i_sb->s_blocksize;
+}
+
+static inline void exofs_put_page(struct page *page)
+{
+ kunmap(page);
+ page_cache_release(page);
+}
+
+static inline unsigned long dir_pages(struct inode *inode)
+{
+ return (inode->i_size+PAGE_CACHE_SIZE-1)>>PAGE_CACHE_SHIFT;
+}
+
+static unsigned exofs_last_byte(struct inode *inode, unsigned long page_nr)
+{
+ unsigned last_byte = inode->i_size;
+
+ last_byte -= page_nr << PAGE_CACHE_SHIFT;
+ if (last_byte > PAGE_CACHE_SIZE)
+ last_byte = PAGE_CACHE_SIZE;
+ return last_byte;
+}
+
+static int exofs_commit_chunk(struct page *page, loff_t pos, unsigned len)
+{
+ struct address_space *mapping = page->mapping;
+ struct inode *dir = mapping->host;
+ int err = 0;
+
+ dir->i_version++;
+
+ if (!PageUptodate(page))
+ SetPageUptodate(page);
+
+ if (pos+len > dir->i_size) {
+ i_size_write(dir, pos+len);
+ mark_inode_dirty(dir);
+ }
+ set_page_dirty(page);
+
+ if (IS_DIRSYNC(dir))
+ err = write_one_page(page, 1);
+ else
+ unlock_page(page);
+
+ return err;
+}
+
+static void exofs_check_page(struct page *page)
+{
+ struct inode *dir = page->mapping->host;
+ unsigned chunk_size = exofs_chunk_size(dir);
+ char *kaddr = page_address(page);
+ unsigned offs, rec_len;
+ unsigned limit = PAGE_CACHE_SIZE;
+ struct exofs_dir_entry *p;
+ char *error;
+
+ /* if the page is the last one in the directory */
+ if ((dir->i_size >> PAGE_CACHE_SHIFT) == page->index) {
+ limit = dir->i_size & ~PAGE_CACHE_MASK;
+ if (limit & (chunk_size - 1))
+ goto Ebadsize;
+ if (!limit)
+ goto out;
+ }
+ for (offs = 0; offs <= limit - EXOFS_DIR_REC_LEN(1); offs += rec_len) {
+ p = (struct exofs_dir_entry *)(kaddr + offs);
+ rec_len = p->rec_len;
+
+ if (rec_len < EXOFS_DIR_REC_LEN(1))
+ goto Eshort;
+ if (rec_len & 3)
+ goto Ealign;
+ if (rec_len < EXOFS_DIR_REC_LEN(p->name_len))
+ goto Enamelen;
+ if (((offs + rec_len - 1) ^ offs) & ~(chunk_size-1))
+ goto Espan;
+ }
+ if (offs != limit)
+ goto Eend;
+out:
+ SetPageChecked(page);
+ return;
+
+Ebadsize:
+ printk(KERN_ERR "ERROR [exofs_check_page]: "
+ "size of directory #%lu is not a multiple of chunk size",
+ dir->i_ino
+ );
+ goto fail;
+Eshort:
+ error = "rec_len is smaller than minimal";
+ goto bad_entry;
+Ealign:
+ error = "unaligned directory entry";
+ goto bad_entry;
+Enamelen:
+ error = "rec_len is too small for name_len";
+ goto bad_entry;
+Espan:
+ error = "directory entry across blocks";
+ goto bad_entry;
+bad_entry:
+ printk(KERN_ERR
+ "ERROR [exofs_check_page]: bad entry in directory #%lu: %s - "
+ "offset=%lu, inode=%lu, rec_len=%d, name_len=%d",
+ dir->i_ino, error, (page->index<<PAGE_CACHE_SHIFT)+offs,
+ (unsigned long) le32_to_cpu(p->inode),
+ rec_len, p->name_len);
+ goto fail;
+Eend:
+ p = (struct exofs_dir_entry *)(kaddr + offs);
+ printk(KERN_ERR "ERROR [exofs_check_page]: "
+ "entry in directory #%lu spans the page boundary"
+ "offset=%lu, inode=%lu",
+ dir->i_ino, (page->index<<PAGE_CACHE_SHIFT)+offs,
+ (unsigned long) le32_to_cpu(p->inode));
+fail:
+ SetPageChecked(page);
+ SetPageError(page);
+}
+
+static struct page *exofs_get_page(struct inode *dir, unsigned long n)
+{
+ struct address_space *mapping = dir->i_mapping;
+ struct page *page = read_cache_page(mapping, n,
+ (filler_t *)mapping->a_ops->readpage, NULL);
+ if (!IS_ERR(page)) {
+ wait_on_page_locked(page);
+ kmap(page);
+ if (!PageUptodate(page))
+ goto fail;
+ if (!PageChecked(page))
+ exofs_check_page(page);
+ if (PageError(page))
+ goto fail;
+ }
+ return page;
+
+fail:
+ exofs_put_page(page);
+ return ERR_PTR(-EIO);
+}
+
+static inline int exofs_match(int len, const unsigned char *name,
+ struct exofs_dir_entry *de)
+{
+ if (len != de->name_len)
+ return 0;
+ if (!de->inode)
+ return 0;
+ return !memcmp(name, de->name, len);
+}
+
+static inline
+struct exofs_dir_entry *exofs_next_entry(struct exofs_dir_entry *p)
+{
+ return (struct exofs_dir_entry *)((char *)p + p->rec_len);
+}
+
+static inline unsigned
+exofs_validate_entry(char *base, unsigned offset, unsigned mask)
+{
+ struct exofs_dir_entry *de = (struct exofs_dir_entry *)(base + offset);
+ struct exofs_dir_entry *p =
+ (struct exofs_dir_entry *)(base + (offset&mask));
+ while ((char *)p < (char *)de) {
+ if (p->rec_len == 0)
+ break;
+ p = exofs_next_entry(p);
+ }
+ return (char *)p - base;
+}
+
+static unsigned char exofs_filetype_table[EXOFS_FT_MAX] = {
+ [EXOFS_FT_UNKNOWN] = DT_UNKNOWN,
+ [EXOFS_FT_REG_FILE] = DT_REG,
+ [EXOFS_FT_DIR] = DT_DIR,
+ [EXOFS_FT_CHRDEV] = DT_CHR,
+ [EXOFS_FT_BLKDEV] = DT_BLK,
+ [EXOFS_FT_FIFO] = DT_FIFO,
+ [EXOFS_FT_SOCK] = DT_SOCK,
+ [EXOFS_FT_SYMLINK] = DT_LNK,
+};
+
+#define S_SHIFT 12
+static unsigned char exofs_type_by_mode[S_IFMT >> S_SHIFT] = {
+ [S_IFREG >> S_SHIFT] = EXOFS_FT_REG_FILE,
+ [S_IFDIR >> S_SHIFT] = EXOFS_FT_DIR,
+ [S_IFCHR >> S_SHIFT] = EXOFS_FT_CHRDEV,
+ [S_IFBLK >> S_SHIFT] = EXOFS_FT_BLKDEV,
+ [S_IFIFO >> S_SHIFT] = EXOFS_FT_FIFO,
+ [S_IFSOCK >> S_SHIFT] = EXOFS_FT_SOCK,
+ [S_IFLNK >> S_SHIFT] = EXOFS_FT_SYMLINK,
+};
+
+static inline
+void exofs_set_de_type(struct exofs_dir_entry *de, struct inode *inode)
+{
+ mode_t mode = inode->i_mode;
+ de->file_type = exofs_type_by_mode[(mode & S_IFMT)>>S_SHIFT];
+}
+
+static int
+exofs_readdir(struct file *filp, void *dirent, filldir_t filldir)
+{
+ loff_t pos = filp->f_pos;
+ struct inode *inode = filp->f_dentry->d_inode;
+ unsigned int offset = pos & ~PAGE_CACHE_MASK;
+ unsigned long n = pos >> PAGE_CACHE_SHIFT;
+ unsigned long npages = dir_pages(inode);
+ unsigned chunk_mask = ~(exofs_chunk_size(inode)-1);
+ unsigned char *types = NULL;
+ int need_revalidate = (filp->f_version != inode->i_version);
+ int ret;
+
+ if (pos > inode->i_size - EXOFS_DIR_REC_LEN(1))
+ goto success;
+
+ types = exofs_filetype_table;
+
+ for ( ; n < npages; n++, offset = 0) {
+ char *kaddr, *limit;
+ struct exofs_dir_entry *de;
+ struct page *page = exofs_get_page(inode, n);
+
+ if (IS_ERR(page)) {
+ printk(KERN_ERR "ERROR: "
+ "bad page in #%lu",
+ inode->i_ino);
+ filp->f_pos += PAGE_CACHE_SIZE - offset;
+ ret = -EIO;
+ goto done;
+ }
+ kaddr = page_address(page);
+ if (need_revalidate) {
+ offset = exofs_validate_entry(kaddr, offset, chunk_mask);
+ need_revalidate = 0;
+ }
+ de = (struct exofs_dir_entry *)(kaddr+offset);
+ limit = kaddr + exofs_last_byte(inode, n) - EXOFS_DIR_REC_LEN(1);
+ for (; (char *)de <= limit; de = exofs_next_entry(de)) {
+ if (de->rec_len == 0) {
+ printk(KERN_ERR "ERROR: "
+ "zero-length directory entry");
+ ret = -EIO;
+ exofs_put_page(page);
+ goto done;
+ }
+ if (de->inode) {
+ int over;
+ unsigned char d_type = DT_UNKNOWN;
+
+ if (types && de->file_type < EXOFS_FT_MAX)
+ d_type = types[de->file_type];
+
+ offset = (char *)de - kaddr;
+ over = filldir(dirent, de->name, de->name_len,
+ (n<<PAGE_CACHE_SHIFT) | offset,
+ de->inode, d_type);
+ if (over) {
+ exofs_put_page(page);
+ goto success;
+ }
+ }
+ filp->f_pos += de->rec_len;
+ }
+ exofs_put_page(page);
+ }
+
+success:
+ ret = 0;
+done:
+ filp->f_version = inode->i_version;
+ return ret;
+}
+
+struct exofs_dir_entry *exofs_find_entry(struct inode *dir,
+ struct dentry *dentry, struct page **res_page)
+{
+ const unsigned char *name = dentry->d_name.name;
+ int namelen = dentry->d_name.len;
+ unsigned reclen = EXOFS_DIR_REC_LEN(namelen);
+ unsigned long start, n;
+ unsigned long npages = dir_pages(dir);
+ struct page *page = NULL;
+ struct exofs_i_info *oi = EXOFS_I(dir);
+ struct exofs_dir_entry *de;
+
+ if (npages == 0)
+ goto out;
+
+ *res_page = NULL;
+
+ start = oi->i_dir_start_lookup;
+ if (start >= npages)
+ start = 0;
+ n = start;
+ do {
+ char *kaddr;
+ page = exofs_get_page(dir, n);
+ if (!IS_ERR(page)) {
+ kaddr = page_address(page);
+ de = (struct exofs_dir_entry *) kaddr;
+ kaddr += exofs_last_byte(dir, n) - reclen;
+ while ((char *) de <= kaddr) {
+ if (de->rec_len == 0) {
+ printk(KERN_ERR
+ "ERROR: exofs_find_entry: "
+ "zero-length directory entry");
+ exofs_put_page(page);
+ goto out;
+ }
+ if (exofs_match(namelen, name, de))
+ goto found;
+ de = exofs_next_entry(de);
+ }
+ exofs_put_page(page);
+ }
+ if (++n >= npages)
+ n = 0;
+ } while (n != start);
+out:
+ return NULL;
+
+found:
+ *res_page = page;
+ oi->i_dir_start_lookup = n;
+ return de;
+}
+
+struct exofs_dir_entry *exofs_dotdot(struct inode *dir, struct page **p)
+{
+ struct page *page = exofs_get_page(dir, 0);
+ struct exofs_dir_entry *de = NULL;
+
+ if (!IS_ERR(page)) {
+ de = exofs_next_entry(
+ (struct exofs_dir_entry *)page_address(page));
+ *p = page;
+ }
+ return de;
+}
+
+ino_t exofs_inode_by_name(struct inode *dir, struct dentry *dentry)
+{
+ ino_t res = 0;
+ struct exofs_dir_entry *de;
+ struct page *page;
+
+ de = exofs_find_entry(dir, dentry, &page);
+ if (de) {
+ res = de->inode;
+ kunmap(page);
+ page_cache_release(page);
+ }
+ return res;
+}
+
+void exofs_set_link(struct inode *dir, struct exofs_dir_entry *de,
+ struct page *page, struct inode *inode)
+{
+ loff_t pos = page_offset(page) +
+ (char *) de - (char *) page_address(page);
+ unsigned len = le16_to_cpu(de->rec_len);
+ int err;
+
+ lock_page(page);
+ err = exofs_write_begin(NULL, page->mapping, pos, len,
+ AOP_FLAG_UNINTERRUPTIBLE, &page, NULL);
+ BUG_ON(err);
+ de->inode = inode->i_ino;
+ exofs_set_de_type(de, inode);
+ err = exofs_commit_chunk(page, pos, len);
+ exofs_put_page(page);
+ dir->i_mtime = dir->i_ctime = CURRENT_TIME;
+ mark_inode_dirty(dir);
+}
+
+int exofs_add_link(struct dentry *dentry, struct inode *inode)
+{
+ struct inode *dir = dentry->d_parent->d_inode;
+ const unsigned char *name = dentry->d_name.name;
+ int namelen = dentry->d_name.len;
+ unsigned chunk_size = exofs_chunk_size(dir);
+ unsigned reclen = EXOFS_DIR_REC_LEN(namelen);
+ unsigned short rec_len, name_len;
+ struct page *page = NULL;
+ struct exofs_sb_info *sbi = inode->i_sb->s_fs_info;
+ struct exofs_dir_entry *de;
+ unsigned long npages = dir_pages(dir);
+ unsigned long n;
+ char *kaddr;
+ loff_t pos;
+ int err;
+
+ for (n = 0; n <= npages; n++) {
+ char *dir_end;
+
+ page = exofs_get_page(dir, n);
+ err = PTR_ERR(page);
+ if (IS_ERR(page))
+ goto out;
+ lock_page(page);
+ kaddr = page_address(page);
+ dir_end = kaddr + exofs_last_byte(dir, n);
+ de = (struct exofs_dir_entry *)kaddr;
+ kaddr += PAGE_CACHE_SIZE - reclen;
+ while ((char *)de <= kaddr) {
+ if ((char *)de == dir_end) {
+ name_len = 0;
+ rec_len = chunk_size;
+ de->rec_len = chunk_size;
+ de->inode = 0;
+ goto got_it;
+ }
+ if (de->rec_len == 0) {
+ printk(KERN_ERR "ERROR: exofs_add_link: "
+ "zero-length directory entry");
+ err = -EIO;
+ goto out_unlock;
+ }
+ err = -EEXIST;
+ if (exofs_match(namelen, name, de))
+ goto out_unlock;
+ name_len = EXOFS_DIR_REC_LEN(de->name_len);
+ rec_len = de->rec_len;
+ if (!de->inode && rec_len >= reclen)
+ goto got_it;
+ if (rec_len >= name_len + reclen)
+ goto got_it;
+ de = (struct exofs_dir_entry *) ((char *) de + rec_len);
+ }
+ unlock_page(page);
+ exofs_put_page(page);
+ }
+ BUG();
+ return -EINVAL;
+
+got_it:
+ pos = page_offset(page) +
+ (char *)de - (char *)page_address(page);
+ err = exofs_write_begin(NULL, page->mapping, pos, rec_len, 0,
+ &page, NULL);
+ if (err)
+ goto out_unlock;
+ if (de->inode) {
+ struct exofs_dir_entry *de1 =
+ (struct exofs_dir_entry *)((char *)de + name_len);
+ de1->rec_len = rec_len - name_len;
+ de->rec_len = name_len;
+ de = de1;
+ }
+ de->name_len = namelen;
+ memcpy(de->name, name, namelen);
+ de->inode = inode->i_ino;
+ exofs_set_de_type(de, inode);
+ err = exofs_commit_chunk(page, pos, rec_len);
+ dir->i_mtime = dir->i_ctime = CURRENT_TIME;
+ mark_inode_dirty(dir);
+ sbi->s_numfiles++;
+
+out_put:
+ exofs_put_page(page);
+out:
+ return err;
+out_unlock:
+ unlock_page(page);
+ goto out_put;
+}
+
+int exofs_delete_entry(struct exofs_dir_entry *dir, struct page *page)
+{
+ struct address_space *mapping = page->mapping;
+ struct inode *inode = mapping->host;
+ struct exofs_sb_info *sbi = inode->i_sb->s_fs_info;
+ char *kaddr = page_address(page);
+ unsigned from = ((char *)dir - kaddr) & ~(exofs_chunk_size(inode)-1);
+ unsigned to = ((char *)dir - kaddr) + dir->rec_len;
+ loff_t pos;
+ struct exofs_dir_entry *pde = NULL;
+ struct exofs_dir_entry *de = (struct exofs_dir_entry *) (kaddr + from);
+ int err;
+
+ while ((char *)de < (char *)dir) {
+ if (de->rec_len == 0) {
+ printk(KERN_ERR "ERROR: exofs_delete_entry:"
+ "zero-length directory entry");
+ err = -EIO;
+ goto out;
+ }
+ pde = de;
+ de = exofs_next_entry(de);
+ }
+ if (pde)
+ from = (char *)pde - (char *)page_address(page);
+ pos = page_offset(page) + from;
+ lock_page(page);
+ err = exofs_write_begin(NULL, page->mapping, pos, to - from, 0,
+ &page, NULL);
+ BUG_ON(err);
+ if (pde)
+ pde->rec_len = cpu_to_le16(to - from);
+ dir->inode = 0;
+ err = exofs_commit_chunk(page, pos, to - from);
+ inode->i_ctime = inode->i_mtime = CURRENT_TIME;
+ mark_inode_dirty(inode);
+ sbi->s_numfiles--;
+out:
+ exofs_put_page(page);
+ return err;
+}
+
+int exofs_make_empty(struct inode *inode, struct inode *parent)
+{
+ struct address_space *mapping = inode->i_mapping;
+ struct page *page = grab_cache_page(mapping, 0);
+ unsigned chunk_size = exofs_chunk_size(inode);
+ struct exofs_dir_entry *de;
+ int err;
+ void *kaddr;
+
+ if (!page)
+ return -ENOMEM;
+
+ err = exofs_write_begin(NULL, page->mapping, 0, chunk_size, 0,
+ &page, NULL);
+ if (err) {
+ unlock_page(page);
+ goto fail;
+ }
+
+ kaddr = kmap_atomic(page, KM_USER0);
+ de = (struct exofs_dir_entry *)kaddr;
+ de->name_len = 1;
+ de->rec_len = EXOFS_DIR_REC_LEN(1);
+ memcpy(de->name, ".\0\0", 4);
+ de->inode = inode->i_ino;
+ exofs_set_de_type(de, inode);
+
+ de = (struct exofs_dir_entry *)(kaddr + EXOFS_DIR_REC_LEN(1));
+ de->name_len = 2;
+ de->rec_len = chunk_size - EXOFS_DIR_REC_LEN(1);
+ de->inode = parent->i_ino;
+ memcpy(de->name, "..\0", 4);
+ exofs_set_de_type(de, inode);
+ kunmap_atomic(page, KM_USER0);
+ err = exofs_commit_chunk(page, 0, chunk_size);
+fail:
+ page_cache_release(page);
+ return err;
+}
+
+int exofs_empty_dir(struct inode *inode)
+{
+ struct page *page = NULL;
+ unsigned long i, npages = dir_pages(inode);
+
+ for (i = 0; i < npages; i++) {
+ char *kaddr;
+ struct exofs_dir_entry *de;
+ page = exofs_get_page(inode, i);
+
+ if (IS_ERR(page))
+ continue;
+
+ kaddr = page_address(page);
+ de = (struct exofs_dir_entry *)kaddr;
+ kaddr += exofs_last_byte(inode, i) - EXOFS_DIR_REC_LEN(1);
+
+ while ((char *)de <= kaddr) {
+ if (de->rec_len == 0) {
+ printk(KERN_ERR "ERROR: exofs_empty_dir: "
+ "zero-length directory entry");
+ printk("kaddr=%p, de=%p\n", kaddr, de);
+ goto not_empty;
+ }
+ if (de->inode != 0) {
+ /* check for . and .. */
+ if (de->name[0] != '.')
+ goto not_empty;
+ if (de->name_len > 2)
+ goto not_empty;
+ if (de->name_len < 2) {
+ if (de->inode !=
+ inode->i_ino)
+ goto not_empty;
+ } else if (de->name[1] != '.')
+ goto not_empty;
+ }
+ de = exofs_next_entry(de);
+ }
+ exofs_put_page(page);
+ }
+ return 1;
+
+not_empty:
+ exofs_put_page(page);
+ return 0;
+}
+
+struct file_operations exofs_dir_operations = {
+ .llseek = generic_file_llseek,
+ .read = generic_read_dir,
+ .readdir = exofs_readdir,
+};
diff --git a/fs/exofs/exofs.h b/fs/exofs/exofs.h
index a094cd7..7330b59 100644
--- a/fs/exofs/exofs.h
+++ b/fs/exofs/exofs.h
@@ -109,6 +109,11 @@ static inline struct exofs_i_info *EXOFS_I(struct inode *inode)
return container_of(inode, struct exofs_i_info, vfs_inode);
}
+/*
+ * Maximum count of links to a file
+ */
+#define EXOFS_LINK_MAX 32000
+
/*************************
* function declarations *
*************************/
@@ -182,14 +187,31 @@ void free_osd_req(struct osd_request *req);
/* inode.c */
void exofs_truncate(struct inode *inode);
+extern struct inode *exofs_iget(struct super_block *, unsigned long);
+struct inode *exofs_new_inode(struct inode *, int);
int exofs_setattr(struct dentry *, struct iattr *);
int exofs_write_begin(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata);
+/* dir.c: */
+int exofs_add_link(struct dentry *, struct inode *);
+ino_t exofs_inode_by_name(struct inode *, struct dentry *);
+int exofs_delete_entry(struct exofs_dir_entry *, struct page *);
+int exofs_make_empty(struct inode *, struct inode *);
+struct exofs_dir_entry *exofs_find_entry(struct inode *, struct dentry *,
+ struct page **);
+int exofs_empty_dir(struct inode *);
+struct exofs_dir_entry *exofs_dotdot(struct inode *, struct page **);
+void exofs_set_link(struct inode *, struct exofs_dir_entry *, struct page *,
+ struct inode *);
+
/*********************
* operation vectors *
*********************/
+/* dir.c: */
+extern struct file_operations exofs_dir_operations;
+
/* file.c */
extern struct inode_operations exofs_file_inode_operations;
extern struct file_operations exofs_file_operations;
@@ -197,6 +219,10 @@ extern struct file_operations exofs_file_operations;
/* inode.c */
extern struct address_space_operations exofs_aops;
+/* namei.c */
+extern struct inode_operations exofs_dir_inode_operations;
+extern struct inode_operations exofs_special_inode_operations;
+
/* symlink.c */
extern struct inode_operations exofs_symlink_inode_operations;
extern struct inode_operations exofs_fast_symlink_inode_operations;
diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
index b904e97..25a562e 100644
--- a/fs/exofs/inode.c
+++ b/fs/exofs/inode.c
@@ -439,6 +439,177 @@ fail:
}
/*
+ * Read an inode from the OSD, and return it as is. We also return the size
+ * attribute in the 'sanity' argument if we got compiled with debugging turned
+ * on.
+ */
+int exofs_get_inode(struct super_block *sb, struct exofs_i_info *oi,
+ struct exofs_fcb *inode, uint64_t *sanity)
+{
+ struct exofs_sb_info *sbi = sb->s_fs_info;
+ struct osd_request *req = NULL;
+ uint32_t page;
+ uint32_t attr;
+ uint16_t expected;
+ uint8_t *buf;
+ uint64_t o_id;
+ int ret;
+
+ o_id = oi->vfs_inode.i_ino + EXOFS_OBJ_OFF;
+
+ make_credential(oi->i_cred, sbi->s_pid, o_id);
+
+ req = prepare_osd_get_attr(sbi->s_dev, sbi->s_pid, o_id);
+ if (!req) {
+ printk(KERN_ERR "ERROR: prepare get_attr failed.\n");
+ return -ENOMEM;
+ }
+
+ /* we need the inode attribute */
+ prepare_get_attr_list_add_entry(req,
+ OSD_PAGE_NUM_IBM_UOBJ_FS_DATA,
+ OSD_ATTR_NUM_IBM_UOBJ_FS_DATA_INODE,
+ EXOFS_INO_ATTR_SIZE);
+
+#ifdef EXOFS_DEBUG
+ /* we get the size attributes to do a sanity check */
+ prepare_get_attr_list_add_entry(req,
+ OSD_APAGE_OBJECT_INFORMATION,
+ OSD_ATTR_OI_LOGICAL_LENGTH, 8);
+#endif
+
+ ret = exofs_sync_op(req, sbi->s_timeout, oi->i_cred);
+ if (ret)
+ goto out;
+
+ page = OSD_PAGE_NUM_IBM_UOBJ_FS_DATA;
+ attr = OSD_ATTR_NUM_IBM_UOBJ_FS_DATA_INODE;
+ expected = EXOFS_INO_ATTR_SIZE;
+ ret = extract_next_attr_from_req(req, &page, &attr, &expected, &buf);
+ if (ret) {
+ printk(KERN_ERR "ERROR: extract attr from req failed\n");
+ goto out;
+ }
+ memcpy(inode, buf, sizeof(struct exofs_fcb));
+
+#ifdef EXOFS_DEBUG
+ page = OSD_APAGE_OBJECT_INFORMATION;
+ attr = OSD_ATTR_OI_LOGICAL_LENGTH;
+ expected = 8;
+ ret = extract_next_attr_from_req(req, &page, &attr, &expected, &buf);
+ if (ret) {
+ printk(KERN_ERR "ERROR: extract attr from req failed\n");
+ goto out;
+ }
+ *sanity = be64_to_cpu(*((uint64_t *) buf));
+#endif
+
+out:
+ free_osd_req(req);
+ return ret;
+}
+
+/*
+ * Fill in an inode read from the OSD and set it up for use
+ */
+struct inode *exofs_iget(struct super_block *sb, unsigned long ino)
+{
+ struct exofs_i_info *oi;
+ struct exofs_fcb fcb;
+ struct inode *inode;
+ uint64_t sanity;
+ int ret;
+ int n;
+
+ inode = iget_locked(sb, ino);
+ if (!inode)
+ return ERR_PTR(-ENOMEM);
+ if (!(inode->i_state & I_NEW))
+ return inode;
+ oi = EXOFS_I(inode);
+
+ /* read the inode from the osd */
+ ret = exofs_get_inode(sb, oi, &fcb, &sanity);
+ if (ret)
+ goto bad_inode;
+
+ init_waitqueue_head(&oi->i_wq);
+ SetObjCreated(oi);
+
+ /* copy stuff from on-disk struct to in-memory struct */
+ inode->i_mode = be16_to_cpu(fcb.i_mode);
+ inode->i_uid = be32_to_cpu(fcb.i_uid);
+ inode->i_gid = be32_to_cpu(fcb.i_gid);
+ inode->i_nlink = be16_to_cpu(fcb.i_links_count);
+ inode->i_ctime.tv_sec = be32_to_cpu(fcb.i_ctime);
+ inode->i_atime.tv_sec = be32_to_cpu(fcb.i_atime);
+ inode->i_mtime.tv_sec = be32_to_cpu(fcb.i_mtime);
+ inode->i_atime.tv_nsec = inode->i_mtime.tv_nsec =
+ inode->i_ctime.tv_nsec = 0;
+ i_size_write(inode, oi->i_commit_size = be64_to_cpu(fcb.i_size));
+ inode->i_blkbits = EXOFS_BLKSHIFT;
+ inode->i_generation = be32_to_cpu(fcb.i_generation);
+
+#ifdef EXOFS_DEBUG
+ if ((inode->i_size != sanity) &&
+ (!exofs_inode_is_fast_symlink(inode))) {
+ printk(KERN_WARNING
+ "WARNING: Size of object from inode and "
+ "attributes differ (%lld != %llu)\n",
+ inode->i_size, _LLU(sanity));
+ }
+#endif
+
+ oi->i_dir_start_lookup = 0;
+
+ if ((inode->i_nlink == 0) && (inode->i_mode == 0)) {
+ ret = -ESTALE;
+ goto bad_inode;
+ }
+
+ if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode)) {
+ if (fcb.i_data[0])
+ inode->i_rdev = old_decode_dev(fcb.i_data[0]);
+ else
+ inode->i_rdev = new_decode_dev(fcb.i_data[1]);
+ } else
+ for (n = 0; n < EXOFS_IDATA; n++)
+ oi->i_data[n] = fcb.i_data[n];
+
+ if (S_ISREG(inode->i_mode)) {
+ inode->i_op = &exofs_file_inode_operations;
+ inode->i_fop = &exofs_file_operations;
+ inode->i_mapping->a_ops = &exofs_aops;
+ } else if (S_ISDIR(inode->i_mode)) {
+ inode->i_op = &exofs_dir_inode_operations;
+ inode->i_fop = &exofs_dir_operations;
+ inode->i_mapping->a_ops = &exofs_aops;
+ } else if (S_ISLNK(inode->i_mode)) {
+ if (exofs_inode_is_fast_symlink(inode))
+ inode->i_op = &exofs_fast_symlink_inode_operations;
+ else {
+ inode->i_op = &exofs_symlink_inode_operations;
+ inode->i_mapping->a_ops = &exofs_aops;
+ }
+ } else {
+ inode->i_op = &exofs_special_inode_operations;
+ if (fcb.i_data[0])
+ init_special_inode(inode, inode->i_mode,
+ old_decode_dev(le32_to_cpu(fcb.i_data[0])));
+ else
+ init_special_inode(inode, inode->i_mode,
+ new_decode_dev(le32_to_cpu(fcb.i_data[1])));
+ }
+
+ unlock_new_inode(inode);
+ return inode;
+
+bad_inode:
+ iget_failed(inode);
+ return ERR_PTR(ret);
+}
+
+/*
* Set inode attributes - just call generic functions.
*/
int exofs_setattr(struct dentry *dentry, struct iattr *iattr)
@@ -453,3 +624,98 @@ int exofs_setattr(struct dentry *dentry, struct iattr *iattr)
error = inode_setattr(inode, iattr);
return error;
}
+
+/*
+ * Callback function from exofs_new_inode(). The important thing is that we
+ * set the ObjCreated flag so that other methods know that the object exists on
+ * the OSD.
+ */
+void create_done(struct osd_request *req, void *p)
+{
+ struct inode *inode = (struct inode *)p;
+ struct exofs_i_info *oi = EXOFS_I(inode);
+ struct exofs_sb_info *sbi = inode->i_sb->s_fs_info;
+ int ret;
+
+ ret = check_ok(req);
+ free_osd_req(req);
+ atomic_dec(&sbi->s_curr_pending);
+
+ if (ret)
+ make_bad_inode(inode);
+ else
+ SetObjCreated(oi);
+
+ atomic_dec(&inode->i_count);
+}
+
+/*
+ * Set up a new inode and create an object for it on the OSD
+ */
+struct inode *exofs_new_inode(struct inode *dir, int mode)
+{
+ struct super_block *sb;
+ struct inode *inode;
+ struct exofs_i_info *oi;
+ struct exofs_sb_info *sbi;
+ struct osd_request *req = NULL;
+ int ret;
+
+ sb = dir->i_sb;
+ inode = new_inode(sb);
+ if (!inode)
+ return ERR_PTR(-ENOMEM);
+
+ oi = EXOFS_I(inode);
+
+ init_waitqueue_head(&oi->i_wq);
+ SetObj2BCreated(oi);
+
+ sbi = sb->s_fs_info;
+
+ sb->s_dirt = 1;
+ inode->i_uid = current->fsuid;
+ if (dir->i_mode & S_ISGID) {
+ inode->i_gid = dir->i_gid;
+ if (S_ISDIR(mode))
+ mode |= S_ISGID;
+ } else
+ inode->i_gid = current->fsgid;
+ inode->i_mode = mode;
+
+ inode->i_ino = sbi->s_nextid++;
+ inode->i_blkbits = EXOFS_BLKSHIFT;
+ inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME;
+ oi->i_commit_size = inode->i_size = 0;
+ spin_lock(&sbi->s_next_gen_lock);
+ inode->i_generation = sbi->s_next_generation++;
+ spin_unlock(&sbi->s_next_gen_lock);
+ insert_inode_hash(inode);
+
+ mark_inode_dirty(inode);
+
+ req = prepare_osd_create(sbi->s_dev, sbi->s_pid,
+ inode->i_ino + EXOFS_OBJ_OFF);
+ if (!req) {
+ printk(KERN_ERR "ERROR: prepare_osd_create failed\n");
+ return ERR_PTR(-EIO);
+ }
+
+ make_credential(oi->i_cred, sbi->s_pid, inode->i_ino + EXOFS_OBJ_OFF);
+
+ /* increment the refcount so that the inode will still be around when we
+ * reach the callback
+ */
+ atomic_inc(&inode->i_count);
+
+ ret = exofs_async_op(req, create_done, (void *)inode, oi->i_cred);
+ if (ret) {
+ atomic_dec(&inode->i_count);
+ free_osd_req(req);
+ return ERR_PTR(-EIO);
+ }
+ atomic_inc(&sbi->s_curr_pending);
+
+ return inode;
+}
+
diff --git a/fs/exofs/namei.c b/fs/exofs/namei.c
new file mode 100644
index 0000000..5e3cbe8
--- /dev/null
+++ b/fs/exofs/namei.c
@@ -0,0 +1,351 @@
+/*
+ * Copyright (C) 2005, 2006
+ * Avishay Traeger ([email protected]) ([email protected])
+ * Copyright (C) 2005, 2006
+ * International Business Machines
+ *
+ * Copyrights for code taken from ext2:
+ * Copyright (C) 1992, 1993, 1994, 1995
+ * Remy Card ([email protected])
+ * Laboratoire MASI - Institut Blaise Pascal
+ * Universite Pierre et Marie Curie (Paris VI)
+ * from
+ * linux/fs/minix/inode.c
+ * Copyright (C) 1991, 1992 Linus Torvalds
+ *
+ * This file is part of exofs.
+ *
+ * exofs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation. Since it is based on ext2, and the only
+ * valid version of GPL for the Linux kernel is version 2, the only valid
+ * version of GPL for exofs is version 2.
+ *
+ * exofs is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with exofs; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "exofs.h"
+
+static inline void exofs_inc_count(struct inode *inode)
+{
+ inode->i_nlink++;
+ mark_inode_dirty(inode);
+}
+
+static inline void exofs_dec_count(struct inode *inode)
+{
+ inode->i_nlink--;
+ mark_inode_dirty(inode);
+}
+
+static inline int exofs_add_nondir(struct dentry *dentry, struct inode *inode)
+{
+ int err = exofs_add_link(dentry, inode);
+ if (!err) {
+ d_instantiate(dentry, inode);
+ return 0;
+ }
+ exofs_dec_count(inode);
+ iput(inode);
+ return err;
+}
+
+static struct dentry *exofs_lookup(struct inode *dir, struct dentry *dentry,
+ struct nameidata *nd)
+{
+ struct inode *inode;
+ ino_t ino;
+
+ if (dentry->d_name.len > EXOFS_NAME_LEN)
+ return ERR_PTR(-ENAMETOOLONG);
+
+ ino = exofs_inode_by_name(dir, dentry);
+ inode = NULL;
+ if (ino) {
+ inode = exofs_iget(dir->i_sb, ino);
+ if (IS_ERR(inode))
+ return ERR_CAST(inode);
+ }
+ if (inode)
+ return d_splice_alias(inode, dentry);
+ d_add(dentry, inode);
+ return NULL;
+}
+
+static int exofs_create(struct inode *dir, struct dentry *dentry, int mode,
+ struct nameidata *nd)
+{
+ struct inode *inode = exofs_new_inode(dir, mode);
+ int err = PTR_ERR(inode);
+ if (!IS_ERR(inode)) {
+ inode->i_op = &exofs_file_inode_operations;
+ inode->i_fop = &exofs_file_operations;
+ inode->i_mapping->a_ops = &exofs_aops;
+ mark_inode_dirty(inode);
+ err = exofs_add_nondir(dentry, inode);
+ }
+ return err;
+}
+
+static int exofs_mknod(struct inode *dir, struct dentry *dentry, int mode,
+ dev_t rdev)
+{
+ struct inode *inode;
+ int err;
+
+ if (!new_valid_dev(rdev))
+ return -EINVAL;
+
+ inode = exofs_new_inode(dir, mode);
+ err = PTR_ERR(inode);
+ if (!IS_ERR(inode)) {
+ init_special_inode(inode, inode->i_mode, rdev);
+ mark_inode_dirty(inode);
+ err = exofs_add_nondir(dentry, inode);
+ }
+ return err;
+}
+
+static int exofs_symlink(struct inode *dir, struct dentry *dentry,
+ const char *symname)
+{
+ struct super_block *sb = dir->i_sb;
+ int err = -ENAMETOOLONG;
+ unsigned l = strlen(symname)+1;
+ struct inode *inode;
+ struct exofs_i_info *oi;
+
+ if (l > sb->s_blocksize)
+ goto out;
+
+ inode = exofs_new_inode(dir, S_IFLNK | S_IRWXUGO);
+ err = PTR_ERR(inode);
+ if (IS_ERR(inode))
+ goto out;
+
+ oi = EXOFS_I(inode);
+ if (l > sizeof(oi->i_data)) {
+ /* slow symlink */
+ inode->i_op = &exofs_symlink_inode_operations;
+ inode->i_mapping->a_ops = &exofs_aops;
+ memset((char *)(oi->i_data), 0, sizeof(oi->i_data));
+
+ err = page_symlink(inode, symname, l);
+ if (err)
+ goto out_fail;
+ } else {
+ /* fast symlink */
+ inode->i_op = &exofs_fast_symlink_inode_operations;
+ memcpy((char *)(oi->i_data), symname, l);
+ inode->i_size = l-1;
+ }
+ mark_inode_dirty(inode);
+
+ err = exofs_add_nondir(dentry, inode);
+out:
+ return err;
+
+out_fail:
+ exofs_dec_count(inode);
+ iput(inode);
+ goto out;
+}
+
+static int exofs_link(struct dentry *old_dentry, struct inode *dir,
+ struct dentry *dentry)
+{
+ struct inode *inode = old_dentry->d_inode;
+
+ if (inode->i_nlink >= EXOFS_LINK_MAX)
+ return -EMLINK;
+
+ inode->i_ctime = CURRENT_TIME;
+ exofs_inc_count(inode);
+ atomic_inc(&inode->i_count);
+
+ return exofs_add_nondir(dentry, inode);
+}
+
+static int exofs_mkdir(struct inode *dir, struct dentry *dentry, int mode)
+{
+ struct inode *inode;
+ int err = -EMLINK;
+
+ if (dir->i_nlink >= EXOFS_LINK_MAX)
+ goto out;
+
+ exofs_inc_count(dir);
+
+ inode = exofs_new_inode(dir, S_IFDIR | mode);
+ err = PTR_ERR(inode);
+ if (IS_ERR(inode))
+ goto out_dir;
+
+ inode->i_op = &exofs_dir_inode_operations;
+ inode->i_fop = &exofs_dir_operations;
+ inode->i_mapping->a_ops = &exofs_aops;
+
+ exofs_inc_count(inode);
+
+ err = exofs_make_empty(inode, dir);
+ if (err)
+ goto out_fail;
+
+ err = exofs_add_link(dentry, inode);
+ if (err)
+ goto out_fail;
+
+ d_instantiate(dentry, inode);
+out:
+ return err;
+
+out_fail:
+ exofs_dec_count(inode);
+ exofs_dec_count(inode);
+ iput(inode);
+out_dir:
+ exofs_dec_count(dir);
+ goto out;
+}
+
+static int exofs_unlink(struct inode *dir, struct dentry *dentry)
+{
+ struct inode *inode = dentry->d_inode;
+ struct exofs_dir_entry *de;
+ struct page *page;
+ int err = -ENOENT;
+
+ de = exofs_find_entry(dir, dentry, &page);
+ if (!de)
+ goto out;
+
+ err = exofs_delete_entry(de, page);
+ if (err)
+ goto out;
+
+ inode->i_ctime = dir->i_ctime;
+ exofs_dec_count(inode);
+ err = 0;
+out:
+ return err;
+}
+
+static int exofs_rmdir(struct inode *dir, struct dentry *dentry)
+{
+ struct inode *inode = dentry->d_inode;
+ int err = -ENOTEMPTY;
+
+ if (exofs_empty_dir(inode)) {
+ err = exofs_unlink(dir, dentry);
+ if (!err) {
+ inode->i_size = 0;
+ exofs_dec_count(inode);
+ exofs_dec_count(dir);
+ }
+ }
+ return err;
+}
+
+static int exofs_rename(struct inode *old_dir, struct dentry *old_dentry,
+ struct inode *new_dir, struct dentry *new_dentry)
+{
+ struct inode *old_inode = old_dentry->d_inode;
+ struct inode *new_inode = new_dentry->d_inode;
+ struct page *dir_page = NULL;
+ struct exofs_dir_entry *dir_de = NULL;
+ struct page *old_page;
+ struct exofs_dir_entry *old_de;
+ int err = -ENOENT;
+
+ old_de = exofs_find_entry(old_dir, old_dentry, &old_page);
+ if (!old_de)
+ goto out;
+
+ if (S_ISDIR(old_inode->i_mode)) {
+ err = -EIO;
+ dir_de = exofs_dotdot(old_inode, &dir_page);
+ if (!dir_de)
+ goto out_old;
+ }
+
+ if (new_inode) {
+ struct page *new_page;
+ struct exofs_dir_entry *new_de;
+
+ err = -ENOTEMPTY;
+ if (dir_de && !exofs_empty_dir(new_inode))
+ goto out_dir;
+
+ err = -ENOENT;
+ new_de = exofs_find_entry(new_dir, new_dentry, &new_page);
+ if (!new_de)
+ goto out_dir;
+ exofs_inc_count(old_inode);
+ exofs_set_link(new_dir, new_de, new_page, old_inode);
+ new_inode->i_ctime = CURRENT_TIME;
+ if (dir_de)
+ new_inode->i_nlink--;
+ exofs_dec_count(new_inode);
+ } else {
+ if (dir_de) {
+ err = -EMLINK;
+ if (new_dir->i_nlink >= EXOFS_LINK_MAX)
+ goto out_dir;
+ }
+ exofs_inc_count(old_inode);
+ err = exofs_add_link(new_dentry, old_inode);
+ if (err) {
+ exofs_dec_count(old_inode);
+ goto out_dir;
+ }
+ if (dir_de)
+ exofs_inc_count(new_dir);
+ }
+
+ old_inode->i_ctime = CURRENT_TIME;
+
+ exofs_delete_entry(old_de, old_page);
+ exofs_dec_count(old_inode);
+
+ if (dir_de) {
+ exofs_set_link(old_inode, dir_de, dir_page, new_dir);
+ exofs_dec_count(old_dir);
+ }
+ return 0;
+
+
+out_dir:
+ if (dir_de) {
+ kunmap(dir_page);
+ page_cache_release(dir_page);
+ }
+out_old:
+ kunmap(old_page);
+ page_cache_release(old_page);
+out:
+ return err;
+}
+
+struct inode_operations exofs_dir_inode_operations = {
+ .create = exofs_create,
+ .lookup = exofs_lookup,
+ .link = exofs_link,
+ .unlink = exofs_unlink,
+ .symlink = exofs_symlink,
+ .mkdir = exofs_mkdir,
+ .rmdir = exofs_rmdir,
+ .mknod = exofs_mknod,
+ .rename = exofs_rename,
+ .setattr = exofs_setattr,
+};
+
+struct inode_operations exofs_special_inode_operations = {
+ .setattr = exofs_setattr,
+};
--
1.6.0.1
This patch ties all operation vectors into a file system superblock
and registers the exofs file_system_type at module's load time.
* The file system control block (AKA on-disk superblock) resides in
an object with a special ID (defined in common.h).
Information included in the file system control block is used to
fill the in-memory superblock structure at mount time. This object
is created before the file system is used by mkexofs.c It contains
information such as:
- The file system's magic number
- The next inode number to be allocated
Signed-off-by: Boaz Harrosh <[email protected]>
---
fs/exofs/Kbuild | 2 +-
fs/exofs/exofs.h | 30 ++++
fs/exofs/inode.c | 195 +++++++++++++++++++++-
fs/exofs/super.c | 502 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 727 insertions(+), 2 deletions(-)
create mode 100644 fs/exofs/super.c
diff --git a/fs/exofs/Kbuild b/fs/exofs/Kbuild
index 27c738c..e293cb9 100644
--- a/fs/exofs/Kbuild
+++ b/fs/exofs/Kbuild
@@ -26,5 +26,5 @@ KBUILD_CPPFLAGS := -I$(OSD_INC) $(KBUILD_CPPFLAGS)
endif
-exofs-objs := osd.o inode.o file.o symlink.o namei.o dir.o
+exofs-objs := osd.o inode.o file.o symlink.o namei.o dir.o super.o
obj-$(CONFIG_EXOFS_FS) += exofs.o
diff --git a/fs/exofs/exofs.h b/fs/exofs/exofs.h
index 7330b59..75c608d 100644
--- a/fs/exofs/exofs.h
+++ b/fs/exofs/exofs.h
@@ -52,6 +52,17 @@
#define _LLU(x) (unsigned long long)(x)
/*
+ * struct to hold what we get from mount options
+ */
+struct exofs_mountopt {
+ const char *dev_name;
+ uint64_t pid;
+ int timeout;
+ bool mkfs;
+ int format; /*in Mbyte*/
+};
+
+/*
* our extension to the in-memory superblock
*/
struct exofs_sb_info {
@@ -110,6 +121,14 @@ static inline struct exofs_i_info *EXOFS_I(struct inode *inode)
}
/*
+ * ugly struct so that we can pass two arguments to update_inode's callback
+ */
+struct updatei_args {
+ struct exofs_sb_info *sbi;
+ struct exofs_fcb *fcb;
+};
+
+/*
* Maximum count of links to a file
*/
#define EXOFS_LINK_MAX 32000
@@ -188,12 +207,20 @@ void free_osd_req(struct osd_request *req);
/* inode.c */
void exofs_truncate(struct inode *inode);
extern struct inode *exofs_iget(struct super_block *, unsigned long);
+extern int exofs_write_inode(struct inode *, int);
+extern void exofs_delete_inode(struct inode *);
struct inode *exofs_new_inode(struct inode *, int);
int exofs_setattr(struct dentry *, struct iattr *);
int exofs_write_begin(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata);
+/* super.c: */
+#ifdef EXOFS_DEBUG
+void exofs_dprint_internal(char *str, ...);
+#endif
+extern void exofs_write_super(struct super_block *);
+
/* dir.c: */
int exofs_add_link(struct dentry *, struct inode *);
ino_t exofs_inode_by_name(struct inode *, struct dentry *);
@@ -223,6 +250,9 @@ extern struct address_space_operations exofs_aops;
extern struct inode_operations exofs_dir_inode_operations;
extern struct inode_operations exofs_special_inode_operations;
+/* super.c */
+extern struct super_operations exofs_sops;
+
/* symlink.c */
extern struct inode_operations exofs_symlink_inode_operations;
extern struct inode_operations exofs_fast_symlink_inode_operations;
diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
index 25a562e..e24690b 100644
--- a/fs/exofs/inode.c
+++ b/fs/exofs/inode.c
@@ -37,6 +37,7 @@
#include "exofs.h"
static int __readpage_filler(struct page *page, bool is_async_unlock);
+static int exofs_update_inode(struct inode *inode, int do_sync);
/*
* Test whether an inode is a fast symlink.
@@ -49,6 +50,18 @@ static inline int exofs_inode_is_fast_symlink(struct inode *inode)
}
/*
+ * Callback function from exofs_delete_inode() - don't have much cleaning up to
+ * do.
+ */
+void delete_done(struct osd_request *req, void *p)
+{
+ struct exofs_sb_info *sbi;
+ free_osd_req(req);
+ sbi = (struct exofs_sb_info *)p;
+ atomic_dec(&sbi->s_curr_pending);
+}
+
+/*
* get_block_t - Fill in a buffer_head
* An OSD takes care of block allocation so we just fake an allocation by
* putting in the inode's sector_t in the buffer_head.
@@ -94,6 +107,62 @@ int exofs_write_begin_export(struct file *file, struct address_space *mapping,
}
/*
+ * Called when the refcount of an inode reaches zero. We remove the object
+ * from the OSD here. We make sure the object was created before we try and
+ * delete it.
+ */
+void exofs_delete_inode(struct inode *inode)
+{
+ struct exofs_i_info *oi = EXOFS_I(inode);
+ struct osd_request *req = NULL;
+ struct super_block *sb = inode->i_sb;
+ struct exofs_sb_info *sbi = sb->s_fs_info;
+ int ret;
+
+ truncate_inode_pages(&inode->i_data, 0);
+
+ if (is_bad_inode(inode))
+ goto no_delete;
+ mark_inode_dirty(inode);
+ exofs_update_inode(inode, inode_needs_sync(inode));
+
+ inode->i_size = 0;
+ if (inode->i_blocks)
+ exofs_truncate(inode);
+
+ clear_inode(inode);
+
+ req = prepare_osd_remove(sbi->s_dev, sbi->s_pid,
+ inode->i_ino + EXOFS_OBJ_OFF);
+ if (!req) {
+ printk(KERN_ERR "ERROR: prepare_osd_remove failed\n");
+ return;
+ }
+
+ /* if we are deleting an obj that hasn't been created yet, wait */
+ if (!ObjCreated(oi)) {
+ if (!Obj2BCreated(oi))
+ BUG();
+ else
+ wait_event(oi->i_wq, ObjCreated(oi));
+ }
+
+ ret = exofs_async_op(req, delete_done, sbi, oi->i_cred);
+ if (ret) {
+ printk(KERN_ERR
+ "ERROR: @exofs_delete_inode exofs_async_op failed\n");
+ free_osd_req(req);
+ return;
+ }
+ atomic_inc(&sbi->s_curr_pending);
+
+ return;
+
+no_delete:
+ clear_inode(inode);
+}
+
+/*
* Callback function when writepage finishes. Check for errors, unlock, clean
* up, etc.
*/
@@ -610,6 +679,131 @@ bad_inode:
}
/*
+ * Callback function from exofs_update_inode().
+ */
+void updatei_done(struct osd_request *req, void *p)
+{
+ struct updatei_args *args = (struct updatei_args *)p;
+
+ free_osd_req(req);
+
+ atomic_dec(&args->sbi->s_curr_pending);
+
+ kfree(args->fcb);
+ kfree(args);
+ args = NULL;
+}
+
+/*
+ * Write the inode to the OSD. Just fill up the struct, and set the attribute
+ * synchronously or asynchronously depending on the do_sync flag.
+ */
+static int exofs_update_inode(struct inode *inode, int do_sync)
+{
+ struct exofs_i_info *oi = EXOFS_I(inode);
+ struct super_block *sb = inode->i_sb;
+ struct exofs_sb_info *sbi = sb->s_fs_info;
+ struct osd_request *req = NULL;
+ struct exofs_fcb *fcb = NULL;
+ int ret;
+ int n;
+
+ fcb = kmalloc(sizeof(struct exofs_fcb), GFP_KERNEL);
+ if (!fcb) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ fcb->i_mode = cpu_to_be16(inode->i_mode);
+ fcb->i_uid = cpu_to_be32(inode->i_uid);
+ fcb->i_gid = cpu_to_be32(inode->i_gid);
+ fcb->i_links_count = cpu_to_be16(inode->i_nlink);
+ fcb->i_ctime = cpu_to_be32(inode->i_ctime.tv_sec);
+ fcb->i_atime = cpu_to_be32(inode->i_atime.tv_sec);
+ fcb->i_mtime = cpu_to_be32(inode->i_mtime.tv_sec);
+ fcb->i_size = cpu_to_be64(oi->i_commit_size = i_size_read(inode));
+ fcb->i_generation = cpu_to_be32(inode->i_generation);
+
+ if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode)) {
+ if (old_valid_dev(inode->i_rdev)) {
+ fcb->i_data[0] = old_encode_dev(inode->i_rdev);
+ fcb->i_data[1] = 0;
+ } else {
+ fcb->i_data[0] = 0;
+ fcb->i_data[1] = new_encode_dev(inode->i_rdev);
+ fcb->i_data[2] = 0;
+ }
+ } else
+ for (n = 0; n < EXOFS_IDATA; n++)
+ fcb->i_data[n] = oi->i_data[n];
+
+ req = prepare_osd_set_attr(sbi->s_dev, sbi->s_pid,
+ (uint64_t) (inode->i_ino + EXOFS_OBJ_OFF));
+ if (!req) {
+ printk(KERN_ERR "ERROR: prepare set_attr failed.\n");
+ kfree(fcb);
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ prepare_set_attr_list_add_entry(req,
+ OSD_PAGE_NUM_IBM_UOBJ_FS_DATA,
+ OSD_ATTR_NUM_IBM_UOBJ_FS_DATA_INODE,
+ EXOFS_INO_ATTR_SIZE,
+ (unsigned char *)fcb);
+
+ if (!ObjCreated(oi)) {
+ if (!Obj2BCreated(oi))
+ BUG();
+ else
+ wait_event(oi->i_wq, ObjCreated(oi));
+ }
+
+ if (do_sync) {
+ ret = exofs_sync_op(req, sbi->s_timeout, oi->i_cred);
+ free_osd_req(req);
+ kfree(fcb);
+ } else {
+ struct updatei_args *args = NULL;
+
+ args = kmalloc(sizeof(struct updatei_args), GFP_KERNEL);
+ if (!args) {
+ kfree(fcb);
+ ret = -ENOMEM;
+ goto out;
+ }
+ args->sbi = sbi;
+ args->fcb = fcb;
+
+ ret = exofs_async_op(req, updatei_done, args, oi->i_cred);
+ if (ret) {
+ free_osd_req(req);
+ kfree(fcb);
+ kfree(args);
+ goto out;
+ }
+ atomic_inc(&sbi->s_curr_pending);
+ }
+out:
+ return ret;
+}
+
+int exofs_write_inode(struct inode *inode, int wait)
+{
+ return exofs_update_inode(inode, wait);
+}
+
+int exofs_sync_inode(struct inode *inode)
+{
+ struct writeback_control wbc = {
+ .sync_mode = WB_SYNC_ALL,
+ .nr_to_write = 0, /* sys_fsync did this */
+ };
+
+ return sync_inode(inode, &wbc);
+}
+
+/*
* Set inode attributes - just call generic functions.
*/
int exofs_setattr(struct dentry *dentry, struct iattr *iattr)
@@ -624,7 +818,6 @@ int exofs_setattr(struct dentry *dentry, struct iattr *iattr)
error = inode_setattr(inode, iattr);
return error;
}
-
/*
* Callback function from exofs_new_inode(). The important thing is that we
* set the ObjCreated flag so that other methods know that the object exists on
diff --git a/fs/exofs/super.c b/fs/exofs/super.c
new file mode 100644
index 0000000..8ecf700
--- /dev/null
+++ b/fs/exofs/super.c
@@ -0,0 +1,502 @@
+/*
+ * Copyright (C) 2005, 2006
+ * Avishay Traeger ([email protected]) ([email protected])
+ * Copyright (C) 2005, 2006
+ * International Business Machines
+ *
+ * Copyrights for code taken from ext2:
+ * Copyright (C) 1992, 1993, 1994, 1995
+ * Remy Card ([email protected])
+ * Laboratoire MASI - Institut Blaise Pascal
+ * Universite Pierre et Marie Curie (Paris VI)
+ * from
+ * linux/fs/minix/inode.c
+ * Copyright (C) 1991, 1992 Linus Torvalds
+ *
+ * This file is part of exofs.
+ *
+ * exofs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation. Since it is based on ext2, and the only
+ * valid version of GPL for the Linux kernel is version 2, the only valid
+ * version of GPL for exofs is version 2.
+ *
+ * exofs is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with exofs; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include <linux/string.h>
+#include <linux/parser.h>
+#include <linux/vfs.h>
+#include <linux/random.h>
+
+#include "exofs.h"
+
+/******************************************************************************
+ * MOUNT OPTIONS
+ *****************************************************************************/
+
+/*
+ * exofs-specific mount-time options.
+ */
+enum { Opt_lun, Opt_tid, Opt_pid, Opt_to, Opt_mkfs, Opt_format, Opt_err };
+
+/*
+ * Our mount-time options. These should ideally be 64-bit unsigned, but the
+ * kernel's parsing functions do not currently support that. 32-bit should be
+ * sufficient for most applications now.
+ */
+static match_table_t tokens = {
+ {Opt_pid, "pid=%u"},
+ {Opt_to, "to=%u"},
+ {Opt_err, NULL}
+};
+
+/*
+ * The main option parsing method. Also makes sure that all of the mandatory
+ * mount options were set.
+ */
+static int parse_options(char *options, struct exofs_mountopt *opts)
+{
+ char *p;
+ substring_t args[MAX_OPT_ARGS];
+ int option;
+ int s_pid = 0;
+
+ EXOFS_DBGMSG("parse_options %s\n", options);
+ /* defaults */
+ memset(opts, 0, sizeof(*opts));
+ opts->timeout = BLK_DEFAULT_SG_TIMEOUT;
+
+ while ((p = strsep(&options, ",")) != NULL) {
+ int token;
+ if (!*p)
+ continue;
+
+ token = match_token(p, tokens, args);
+ switch (token) {
+ case Opt_pid:
+ if (match_int(&args[0], &option))
+ return -EINVAL;
+ if (option < 65536) {
+ EXOFS_ERR("Partition ID must be >= 65536");
+ return -EINVAL;
+ }
+ opts->pid = option;
+ s_pid = 1;
+ break;
+ case Opt_to:
+ if (match_int(&args[0], &option))
+ return -EINVAL;
+ if (option <= 0) {
+ EXOFS_ERR("Timout must be > 0");
+ return -EINVAL;
+ }
+ opts->timeout = option * HZ;
+ break;
+ }
+ }
+
+ if (!s_pid) {
+ EXOFS_ERR("Need to specify the following options:\n");
+ EXOFS_ERR(" -o pid=pid_no_to_use\n");
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+/******************************************************************************
+ * INODE CACHE
+ *****************************************************************************/
+
+/*
+ * Our inode cache. Isn't it pretty?
+ */
+static struct kmem_cache *exofs_inode_cachep;
+
+/*
+ * Allocate an inode in the cache
+ */
+static struct inode *exofs_alloc_inode(struct super_block *sb)
+{
+ struct exofs_i_info *oi;
+
+ oi = kmem_cache_alloc(exofs_inode_cachep, GFP_KERNEL);
+ if (!oi)
+ return NULL;
+
+ oi->vfs_inode.i_version = 1;
+ return &oi->vfs_inode;
+}
+
+/*
+ * Remove an inode from the cache
+ */
+static void exofs_destroy_inode(struct inode *inode)
+{
+ kmem_cache_free(exofs_inode_cachep, EXOFS_I(inode));
+}
+
+/*
+ * Initialize the inode
+ */
+static void exofs_init_once(void *foo)
+{
+ struct exofs_i_info *oi = foo;
+
+ inode_init_once(&oi->vfs_inode);
+}
+
+/*
+ * Create and initialize the inode cache
+ */
+static int init_inodecache(void)
+{
+ exofs_inode_cachep = kmem_cache_create("exofs_inode_cache",
+ sizeof(struct exofs_i_info),
+ 0, SLAB_RECLAIM_ACCOUNT,
+ exofs_init_once);
+ if (exofs_inode_cachep == NULL)
+ return -ENOMEM;
+ return 0;
+}
+
+/*
+ * Destroy the inode cache
+ */
+static void destroy_inodecache(void)
+{
+ kmem_cache_destroy(exofs_inode_cachep);
+}
+
+/******************************************************************************
+ * SUPERBLOCK FUNCTIONS
+ *****************************************************************************/
+
+/*
+ * Write the superblock to the OSD
+ */
+void exofs_write_super(struct super_block *sb)
+{
+ struct exofs_sb_info *sbi;
+ struct exofs_fscb *fscb = NULL;
+ struct osd_request *req = NULL;
+
+ fscb = kzalloc(sizeof(struct exofs_fscb), GFP_KERNEL);
+ if (!fscb)
+ return;
+
+ lock_kernel();
+ sbi = sb->s_fs_info;
+ fscb->s_nextid = sbi->s_nextid;
+ fscb->s_magic = sb->s_magic;
+ fscb->s_numfiles = sbi->s_numfiles;
+ fscb->s_newfs = 0;
+
+ req = prepare_osd_write(sbi->s_dev, sbi->s_pid, EXOFS_SUPER_ID,
+ sizeof(struct exofs_fscb), 0, 0,
+ (unsigned char *)(fscb));
+ if (!req) {
+ EXOFS_ERR("ERROR: write super failed.\n");
+ kfree(fscb);
+ return;
+ }
+
+ exofs_sync_op(req, sbi->s_timeout, sbi->s_cred);
+ free_osd_req(req);
+ sb->s_dirt = 0;
+ unlock_kernel();
+ kfree(fscb);
+}
+
+/*
+ * This function is called when the vfs is freeing the superblock. We just
+ * need to free our own part.
+ */
+static void exofs_put_super(struct super_block *sb)
+{
+ int num_pend;
+ struct exofs_sb_info *sbi = sb->s_fs_info;
+
+ /* make sure there are no pending commands */
+ for (num_pend = atomic_read(&sbi->s_curr_pending); num_pend > 0;
+ num_pend = atomic_read(&sbi->s_curr_pending)) {
+ wait_queue_head_t wq;
+ init_waitqueue_head(&wq);
+ wait_event_timeout(wq,
+ (atomic_read(&sbi->s_curr_pending) == 0),
+ msecs_to_jiffies(100));
+ }
+
+ osduld_put_device(sbi->s_dev);
+ kfree(sb->s_fs_info);
+ sb->s_fs_info = NULL;
+}
+
+/*
+ * Read the superblock from the OSD and fill in the fields
+ */
+static int exofs_fill_super(struct super_block *sb, void *data, int silent)
+{
+ struct inode *root;
+ struct exofs_mountopt *opts = data;
+ struct exofs_sb_info *sbi = NULL; /*extended info */
+ struct exofs_fscb fscb; /*on-disk superblock info */
+ struct osd_request *req = NULL;
+ int ret;
+
+ sbi = kzalloc(sizeof(*sbi), GFP_KERNEL);
+ if (!sbi)
+ return -ENOMEM;
+ sb->s_fs_info = sbi;
+
+ /* use mount options to fill superblock */
+ sbi->s_dev = osduld_path_lookup(opts->dev_name);
+ if (IS_ERR(sbi->s_dev)) {
+ ret = PTR_ERR(sbi->s_dev);
+ sbi->s_dev = NULL;
+ goto free_sbi;
+ }
+
+ sbi->s_pid = opts->pid;
+ sbi->s_timeout = opts->timeout;
+
+ /* fill in some other data by hand */
+ memset(sb->s_id, 0, sizeof(sb->s_id));
+ strcpy(sb->s_id, "exofs");
+ sb->s_blocksize = EXOFS_BLKSIZE;
+ sb->s_blocksize_bits = EXOFS_BLKSHIFT;
+ atomic_set(&sbi->s_curr_pending, 0);
+ sb->s_bdev = NULL;
+ sb->s_dev = 0;
+
+ /* read data from on-disk superblock object */
+ make_credential(sbi->s_cred, sbi->s_pid, EXOFS_SUPER_ID);
+
+ req = prepare_osd_read(sbi->s_dev, sbi->s_pid, EXOFS_SUPER_ID,
+ sizeof(struct exofs_fscb), 0, 0,
+ (unsigned char *)(&fscb));
+ if (!req) {
+ if (!silent)
+ EXOFS_ERR("ERROR: could not prepare read request.\n");
+ ret = -ENOMEM;
+ goto free_sbi;
+ }
+
+ ret = exofs_sync_op(req, sbi->s_timeout, sbi->s_cred);
+ if (ret != 0) {
+ if (!silent)
+ EXOFS_ERR("ERROR: read super failed.\n");
+ ret = -EIO;
+ goto free_sbi;
+ }
+
+ sb->s_magic = fscb.s_magic;
+ sbi->s_nextid = fscb.s_nextid;
+ sbi->s_numfiles = fscb.s_numfiles;
+
+ /* make sure what we read from the object store is correct */
+ if (sb->s_magic != EXOFS_SUPER_MAGIC) {
+ if (!silent)
+ EXOFS_ERR("ERROR: Bad magic value\n");
+ ret = -EINVAL;
+ goto free_sbi;
+ }
+
+ /* start generation numbers from a random point */
+ get_random_bytes(&sbi->s_next_generation, sizeof(u32));
+ spin_lock_init(&sbi->s_next_gen_lock);
+
+ /* set up operation vectors */
+ sb->s_op = &exofs_sops;
+ root = exofs_iget(sb, EXOFS_ROOT_ID - EXOFS_OBJ_OFF);
+ if (IS_ERR(root)) {
+ EXOFS_ERR("ERROR: exofs_iget faild\n");
+ ret = PTR_ERR(root);
+ goto free_sbi;
+ }
+ sb->s_root = d_alloc_root(root);
+ if (!sb->s_root) {
+ iput(root);
+ EXOFS_ERR("ERROR: get root inode failed\n");
+ ret = -ENOMEM;
+ goto free_sbi;
+ }
+
+ if (!S_ISDIR(root->i_mode)) {
+ dput(sb->s_root);
+ sb->s_root = NULL;
+ EXOFS_ERR("ERROR: corrupt root inode (mode = %hd)\n",
+ root->i_mode);
+ ret = -EINVAL;
+ goto free_sbi;
+ }
+
+ ret = 0;
+out:
+ if (req)
+ free_osd_req(req);
+ return ret;
+
+free_sbi:
+ osduld_put_device(sbi->s_dev); /* NULL safe */
+ kfree(sbi);
+ goto out;
+}
+
+/*
+ * Set up the superblock (calls exofs_fill_super eventually)
+ */
+static int exofs_get_sb(struct file_system_type *type,
+ int flags, const char *dev_name,
+ void *data, struct vfsmount *mnt)
+{
+ struct exofs_mountopt opts;
+ int ret;
+
+ ret = parse_options((char *) data, &opts);
+ if (ret)
+ return ret;
+
+ opts.dev_name = dev_name;
+ return get_sb_nodev(type, flags, &opts, exofs_fill_super, mnt);
+}
+
+/*
+ * Return information about the file system state in the buffer. This is used
+ * by the 'df' command, for example.
+ */
+static int exofs_statfs(struct dentry *dentry, struct kstatfs *buf)
+{
+ struct super_block *sb = dentry->d_sb;
+ struct exofs_sb_info *sbi = sb->s_fs_info;
+ uint8_t cred_a[OSD_CAP_LEN];
+ struct osd_request *req = NULL;
+ uint32_t page;
+ uint32_t attr;
+ uint16_t expected;
+ uint64_t capacity;
+ uint64_t used;
+ uint8_t *data;
+ int ret;
+
+ /* get used/capacity attributes */
+ make_credential(cred_a, sbi->s_pid, 0);
+
+ req = prepare_osd_get_attr(sbi->s_dev, sbi->s_pid, 0);
+ if (!req) {
+ EXOFS_ERR("ERROR: prepare get_attr failed.\n");
+ return -1;
+ }
+
+ prepare_get_attr_list_add_entry(req,
+ OSD_APAGE_PARTITION_QUOTAS,
+ OSD_ATTR_PQ_CAPACITY_QUOTA,
+ 8);
+
+ prepare_get_attr_list_add_entry(req,
+ OSD_APAGE_PARTITION_INFORMATION,
+ OSD_ATTR_PI_USED_CAPACITY,
+ 8);
+
+ ret = exofs_sync_op(req, sbi->s_timeout, cred_a);
+ if (ret)
+ goto out;
+
+ page = OSD_APAGE_PARTITION_QUOTAS;
+ attr = OSD_ATTR_PQ_CAPACITY_QUOTA;
+ expected = 8;
+ ret = extract_next_attr_from_req(req, &page, &attr, &expected, &data);
+ if (ret) {
+ EXOFS_ERR("ERROR: extract attr from req failed\n");
+ goto out;
+ }
+ capacity = be64_to_cpu(*((uint64_t *)data));
+
+ page = OSD_APAGE_PARTITION_INFORMATION;
+ attr = OSD_ATTR_PI_USED_CAPACITY;
+ expected = 8;
+ ret = extract_next_attr_from_req(req, &page, &attr, &expected, &data);
+ if (ret) {
+ EXOFS_ERR("ERROR: extract attr from req failed\n");
+ goto out;
+ }
+ used = be64_to_cpu(*((uint64_t *)data));
+
+ /* fill in the stats buffer */
+ buf->f_type = EXOFS_SUPER_MAGIC;
+ buf->f_bsize = EXOFS_BLKSIZE;
+ buf->f_blocks = (capacity >> EXOFS_BLKSHIFT);
+ buf->f_bfree = ((capacity - used) >> EXOFS_BLKSHIFT);
+ buf->f_bavail = buf->f_bfree;
+ buf->f_files = sbi->s_numfiles;
+ buf->f_ffree = EXOFS_MAX_ID - sbi->s_numfiles;
+ buf->f_namelen = EXOFS_NAME_LEN;
+out:
+ free_osd_req(req);
+
+ return ret;
+}
+
+struct super_operations exofs_sops = {
+ .alloc_inode = exofs_alloc_inode,
+ .destroy_inode = exofs_destroy_inode,
+ .write_inode = exofs_write_inode,
+ .delete_inode = exofs_delete_inode,
+ .put_super = exofs_put_super,
+ .write_super = exofs_write_super,
+ .statfs = exofs_statfs,
+};
+
+/******************************************************************************
+ * INSMOD/RMMOD
+ *****************************************************************************/
+
+/*
+ * struct that describes this file system
+ */
+static struct file_system_type exofs_type = {
+ .owner = THIS_MODULE,
+ .name = "exofs",
+ .get_sb = exofs_get_sb,
+ .kill_sb = generic_shutdown_super,
+};
+
+static int __init init_exofs(void)
+{
+ int err;
+
+ err = init_inodecache();
+ if (err)
+ goto out;
+
+ err = register_filesystem(&exofs_type);
+ if (err)
+ goto out_d;
+
+ return 0;
+out_d:
+ destroy_inodecache();
+out:
+ return err;
+}
+
+static void __exit exit_exofs(void)
+{
+ unregister_filesystem(&exofs_type);
+ destroy_inodecache();
+}
+
+MODULE_AUTHOR("Avishay Traeger <[email protected]>");
+MODULE_DESCRIPTION("exofs");
+MODULE_LICENSE("GPL");
+
+module_init(init_exofs)
+module_exit(exit_exofs)
--
1.6.0.1
We need a mechanism to prepare the file system (mkfs).
I chose to implement that by means of a couple of
mount-options. Because there is no user-mode API for committing
OSD commands. And also, all this stuff is highly internal to
the file system itself.
- Added two mount options mkfs=0/1,format=capacity_in_meg, so mkfs/format
can be executed by kernel code just before mount. An mkexofs utility
can now be implemented by means of a script that mounts and unmount the
file system with proper options.
Signed-off-by: Boaz Harrosh <[email protected]>
---
fs/exofs/Kbuild | 2 +-
fs/exofs/exofs.h | 3 +
fs/exofs/mkexofs.c | 605 ++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/exofs/super.c | 18 ++
4 files changed, 627 insertions(+), 1 deletions(-)
create mode 100644 fs/exofs/mkexofs.c
diff --git a/fs/exofs/Kbuild b/fs/exofs/Kbuild
index e293cb9..639c181 100644
--- a/fs/exofs/Kbuild
+++ b/fs/exofs/Kbuild
@@ -26,5 +26,5 @@ KBUILD_CPPFLAGS := -I$(OSD_INC) $(KBUILD_CPPFLAGS)
endif
-exofs-objs := osd.o inode.o file.o symlink.o namei.o dir.o super.o
+exofs-objs := osd.o inode.o file.o symlink.o namei.o dir.o super.o mkexofs.o
obj-$(CONFIG_EXOFS_FS) += exofs.o
diff --git a/fs/exofs/exofs.h b/fs/exofs/exofs.h
index 75c608d..f53d18b 100644
--- a/fs/exofs/exofs.h
+++ b/fs/exofs/exofs.h
@@ -204,6 +204,9 @@ int extract_list_from_req(struct osd_request *req,
void free_osd_req(struct osd_request *req);
+/* mkexofs.c */
+int exofs_mkfs(struct osd_dev *dev, uint64_t p_id, uint64_t format_size);
+
/* inode.c */
void exofs_truncate(struct inode *inode);
extern struct inode *exofs_iget(struct super_block *, unsigned long);
diff --git a/fs/exofs/mkexofs.c b/fs/exofs/mkexofs.c
new file mode 100644
index 0000000..79df3e3
--- /dev/null
+++ b/fs/exofs/mkexofs.c
@@ -0,0 +1,605 @@
+/*
+ * mkexofs.c - make an exofs file system.
+ *
+ * Copyright (C) 2005, 2006
+ * Avishay Traeger ([email protected]) ([email protected])
+ * Copyright (C) 2005, 2006
+ * International Business Machines
+ *
+ * Copyrights from mke2fs.c:
+ * Copyright (C) 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002,
+ * 2003, 2004, 2005 by Theodore Ts'o.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "exofs.h"
+#include <linux/random.h>
+
+/* #define __MKEXOFS_DEBUG_CHECKS 1 */
+
+static int kick_it(struct osd_request *req, int timeout, uint8_t *cred_a,
+ const char *op)
+{
+ return exofs_sync_op(req, timeout, cred_a);
+}
+
+/* Format the LUN to the specified size */
+static int format(uint64_t lun_capacity, struct osd_dev *dev, int timeout)
+{
+ struct osd_request *req = prepare_osd_format_lun(dev, lun_capacity);
+ uint8_t cred_a[OSD_CAP_LEN];
+ int ret;
+
+ make_credential(cred_a, 0, 0);
+
+ if (req == NULL) {
+ EXOFS_ERR("ERROR: Failed to allocate request.\n");
+ return -ENOMEM;
+ }
+
+ ret = kick_it(req, timeout, cred_a, "format");
+
+ free_osd_req(req);
+
+ return ret;
+}
+
+static int create_partition(struct osd_dev *dev, uint64_t p_id, int timeout)
+{
+ struct osd_request *req;
+ uint8_t cred_a[OSD_CAP_LEN];
+ bool try_remove = false;
+ int ret;
+
+ make_credential(cred_a, p_id, 0);
+
+create_part:
+ req = prepare_osd_create_partition(dev, p_id);
+ if (!req)
+ return -ENOMEM;
+ ret = kick_it(req, timeout, cred_a, "create partition");
+ free_osd_req(req);
+
+ if (ret && !try_remove) {
+ try_remove = true;
+ req = prepare_osd_remove_partition(dev, p_id);
+ if (!req)
+ return -ENOMEM;
+ ret = kick_it(req, timeout, cred_a, "remove partition");
+ free_osd_req(req);
+ if (!ret) /* Try again now */
+ goto create_part;
+ }
+
+ return ret;
+}
+
+#ifdef __MKEXOFS_DEBUG_CHECKS
+static int list(struct osd_dev *dev, uint64_t p_id, int timeout)
+{
+ struct osd_request *req;
+ uint8_t cred_a[OSD_CAP_LEN];
+ unsigned char *buf = NULL;
+ int ret;
+ uint64_t total_matches;
+ uint64_t total_ret;
+ uint64_t *id_list;
+ int is_part, is_utd;
+ uint64_t cont;
+ uint32_t more;
+ int i;
+
+ buf = kzalloc(1024, GFP_KERNEL);
+ if (!buf) {
+ EXOFS_ERR("ERROR: Failed to allocate memory.\n");
+ return -ENOMEM;
+ }
+
+ make_credential(cred_a, p_id, 0);
+
+ req = prepare_osd_list(dev, p_id, 0, 1024, 0, 0, buf);
+ if (req == NULL) {
+ EXOFS_ERR("ERROR: Failed to allocate request.\n");
+ return -ENOMEM;
+ }
+
+ ret = kick_it(req, timeout, cred_a, "list");
+ if (ret != 0)
+ goto out;
+
+ ret = extract_list_from_req(req, &total_matches, &total_ret, &id_list,
+ &is_part, &is_utd, &cont, &more);
+
+ EXOFS_DBGMSG("created %llu objects:\n", _LLU(total_ret));
+ for (i = 0 ; i < total_ret ; i++)
+ EXOFS_DBGMSG("%llu\n", _LLU(id_list[i]));
+
+out:
+ free_osd_req(req);
+ kfree(buf);
+
+ return ret;
+}
+#endif
+
+static int create(struct osd_dev *dev, uint64_t p_id, uint64_t o_id,
+ int timeout)
+{
+ struct osd_request *req;
+ uint8_t cred_a[OSD_CAP_LEN];
+ int ret;
+
+ make_credential(cred_a, p_id, o_id);
+ req = prepare_osd_create(dev, p_id, o_id);
+
+ if (req == NULL) {
+ EXOFS_ERR("ERROR: Failed to allocate request.\n");
+ return -ENOMEM;
+ }
+
+ ret = kick_it(req, timeout, cred_a, "create");
+
+ free_osd_req(req);
+
+ return ret;
+}
+
+static int write_super(struct osd_dev *dev, uint64_t p_id, int timeout,
+ int newfile)
+{
+ struct osd_request *req;
+ uint8_t cred_a[OSD_CAP_LEN];
+ struct exofs_fscb data;
+ int ret;
+
+ make_credential(cred_a, p_id, EXOFS_SUPER_ID);
+
+ data.s_nextid = 4;
+ data.s_magic = EXOFS_SUPER_MAGIC;
+ data.s_newfs = 1;
+ if (newfile)
+ data.s_numfiles = 1;
+ else
+ data.s_numfiles = 0;
+
+ req = prepare_osd_write(dev, p_id, EXOFS_SUPER_ID,
+ sizeof(struct exofs_fscb), 0, 0,
+ (unsigned char *)(&data));
+
+ if (req == NULL) {
+ EXOFS_ERR("ERROR: Failed to allocate request.\n");
+ return -ENOMEM;
+ }
+
+ ret = kick_it(req, timeout, cred_a, "write super");
+
+ free_osd_req(req);
+
+ return ret;
+}
+
+#ifdef __MKEXOFS_DEBUG_CHECKS
+static int read_super(struct osd_dev *dev, uint64_t p_id, int timeout)
+{
+ struct osd_request *req;
+ uint8_t cred_a[OSD_CAP_LEN];
+ struct exofs_fscb data;
+ int ret;
+
+ make_credential(cred_a, p_id, EXOFS_SUPER_ID);
+
+ req = prepare_osd_read(dev, p_id, EXOFS_SUPER_ID,
+ sizeof(struct exofs_fscb), 0, 0,
+ (unsigned char *)(&data));
+
+ if (req == NULL) {
+ EXOFS_ERR("ERROR: Failed to allocate request.\n");
+ return -ENOMEM;
+ }
+
+ ret = kick_it(req, timeout, cred_a, "read super");
+ if (ret)
+ goto out;
+
+ EXOFS_DBGMSG("nextid:\t%u\n", data.s_nextid);
+ EXOFS_DBGMSG("magic:\t%u\n", data.s_magic);
+ EXOFS_DBGMSG("numfiles:\t%u\n", data.s_numfiles);
+out:
+ free_osd_req(req);
+
+ return ret;
+}
+#endif
+
+static int write_bitmap(struct osd_dev *dev, uint64_t p_id, int timeout)
+{
+ struct osd_request *req;
+ uint8_t cred_a[OSD_CAP_LEN];
+ uint64_t off = 0;
+ unsigned int id = 3;
+ int ret;
+
+ /* XXX: For now just use counter - later make bitmap */
+ make_credential(cred_a, p_id, EXOFS_BM_ID);
+
+ req = prepare_osd_write(dev, p_id, EXOFS_BM_ID, sizeof(unsigned int),
+ off, 0, (unsigned char *)&id);
+ if (req == NULL) {
+ EXOFS_ERR("ERROR: Failed to allocate request.\n");
+ return -ENOMEM;
+ }
+
+ ret = kick_it(req, timeout, cred_a, "write bitmap");
+
+ free_osd_req(req);
+
+ return ret;
+}
+
+#ifdef __MKEXOFS_DEBUG_CHECKS
+static int write_testfile(struct osd_dev *dev, uint64_t p_id, int timeout)
+{
+ struct osd_request *req;
+ uint8_t cred_a[OSD_CAP_LEN];
+ uint64_t off = 0;
+ unsigned char buf[64];
+ int ret;
+
+ strcpy((char *)buf, "This file is a test, it is only a test.");
+ make_credential(cred_a, p_id, EXOFS_TEST_ID);
+
+ req = prepare_osd_write(dev, p_id, EXOFS_TEST_ID, 64, off, 0, buf);
+
+ if (req == NULL) {
+ EXOFS_ERR("ERROR: Failed to allocate request.\n");
+ return -ENOMEM;
+ }
+
+ ret = kick_it(req, timeout, cred_a, "write bitmap");
+
+ free_osd_req(req);
+
+ return ret;
+}
+
+static int read_testfile(struct osd_dev *dev, uint64_t p_id, int timeout)
+{
+ struct osd_request *req;
+ uint8_t cred_a[OSD_CAP_LEN];
+ unsigned char data[64];
+ int ret;
+
+ make_credential(cred_a, p_id, EXOFS_TEST_ID);
+
+ req = prepare_osd_read(dev, p_id, EXOFS_TEST_ID, 64, 0, 0, data);
+
+ if (req == NULL) {
+ EXOFS_ERR("ERROR: Failed to allocate request.\n");
+ return -ENOMEM;
+ }
+
+ ret = kick_it(req, timeout, cred_a, "read test file");
+ if (ret)
+ goto out;
+
+ EXOFS_DBGMSG("test file: %s\n", data);
+
+out:
+ free_osd_req(req);
+
+ return ret;
+}
+#endif
+
+static int write_rootdir(struct osd_dev *dev, uint64_t p_id, int timeout,
+ int newfile)
+{
+ struct osd_request *req;
+ uint8_t cred_a[OSD_CAP_LEN];
+ struct exofs_dir_entry *dir;
+ uint64_t off = 0;
+ unsigned char *buf = NULL;
+ int filetype = EXOFS_FT_DIR << 8;
+ int filetype2 = EXOFS_FT_REG_FILE << 8;
+ int rec_len;
+ int done;
+ int ret;
+
+ buf = kzalloc(EXOFS_BLKSIZE, GFP_KERNEL);
+ if (!buf) {
+ EXOFS_ERR("ERROR: Failed to allocate memory.\n");
+ return -ENOMEM;
+ }
+ dir = (struct exofs_dir_entry *)buf;
+
+ /* create entry for '.' */
+ dir->name[0] = '.';
+ dir->name_len = 1 | filetype;
+ dir->inode = EXOFS_ROOT_ID - EXOFS_OBJ_OFF;
+ dir->rec_len = EXOFS_DIR_REC_LEN(1);
+ rec_len = EXOFS_BLKSIZE - EXOFS_DIR_REC_LEN(1);
+
+ /* create entry for '..' */
+ dir = (struct exofs_dir_entry *) (buf + dir->rec_len);
+ dir->name[0] = '.';
+ dir->name[1] = '.';
+ dir->name_len = 2 | filetype;
+ dir->inode = EXOFS_ROOT_ID - EXOFS_OBJ_OFF;
+ if (newfile) {
+ rec_len -= EXOFS_DIR_REC_LEN(2);
+ dir->rec_len = EXOFS_DIR_REC_LEN(2);
+ } else
+ dir->rec_len = rec_len;
+ done = EXOFS_DIR_REC_LEN(1) + dir->rec_len;
+
+ /* create entry for 'test', if specified */
+ if (newfile) {
+ dir = (struct exofs_dir_entry *) (buf + done);
+ dir->inode = EXOFS_TEST_ID - EXOFS_OBJ_OFF;
+ dir->name_len = 4 | filetype2;
+ dir->name[0] = 't';
+ dir->name[1] = 'e';
+ dir->name[2] = 's';
+ dir->name[3] = 't';
+ dir->rec_len = rec_len;
+ }
+
+ make_credential(cred_a, p_id, EXOFS_ROOT_ID);
+
+ req = prepare_osd_write(dev, p_id, EXOFS_ROOT_ID, EXOFS_BLKSIZE, off,
+ 0, buf);
+
+ if (req == NULL) {
+ EXOFS_ERR("ERROR: Failed to allocate request.\n");
+ return -ENOMEM;
+ }
+
+ ret = kick_it(req, timeout, cred_a, "write rootdir");
+
+ free_osd_req(req);
+
+ return ret;
+}
+
+static int set_inode(struct osd_dev *dev, uint64_t p_id, int timeout,
+ uint64_t o_id, uint16_t mode)
+{
+ struct osd_request *req;
+ uint8_t cred_a[OSD_CAP_LEN];
+ struct exofs_fcb in = {0};
+ struct exofs_fcb *inode = ∈
+ uint32_t i_generation;
+ int ret;
+
+ inode->i_mode = cpu_to_be16(mode);
+ inode->i_uid = inode->i_gid = 0;
+ inode->i_links_count = cpu_to_be16(2);
+ inode->i_ctime = inode->i_atime = inode->i_mtime =
+ cpu_to_be32(CURRENT_TIME.tv_sec);
+ inode->i_size = cpu_to_be64(EXOFS_BLKSIZE);
+ if (o_id != EXOFS_ROOT_ID)
+ inode->i_size = cpu_to_be64(64);
+
+ get_random_bytes(&i_generation, sizeof(i_generation));
+ inode->i_generation = cpu_to_be32(i_generation);
+
+ make_credential(cred_a, p_id, o_id);
+
+ req = prepare_osd_set_attr(dev, p_id, o_id);
+
+ if (req == NULL) {
+ EXOFS_ERR("ERROR: Failed to allocate request.\n");
+ return -ENOMEM;
+ }
+
+ prepare_set_attr_list_add_entry(req,
+ OSD_PAGE_NUM_IBM_UOBJ_FS_DATA,
+ OSD_ATTR_NUM_IBM_UOBJ_FS_DATA_INODE,
+ EXOFS_INO_ATTR_SIZE,
+ (unsigned char *)inode);
+
+ ret = kick_it(req, timeout, cred_a, "set inode");
+
+ free_osd_req(req);
+
+ return ret;
+}
+
+#ifdef __MKEXOFS_DEBUG_CHECKS
+static int get_root_attr(struct osd_dev *dev, uint64_t p_id, int timeout)
+{
+ struct osd_request *req;
+ uint8_t cred_a[OSD_CAP_LEN];
+ struct exofs_fcb in = {0};
+ struct exofs_fcb *inode = ∈
+ uint32_t page = OSD_PAGE_NUM_IBM_UOBJ_FS_DATA;
+ uint32_t attr = OSD_ATTR_NUM_IBM_UOBJ_FS_DATA_INODE;
+ uint16_t expected = EXOFS_INO_ATTR_SIZE;
+ uint8_t *buf;
+ int ret;
+
+ make_credential(cred_a, p_id, EXOFS_ROOT_ID);
+
+ req = prepare_osd_get_attr(dev, p_id, EXOFS_ROOT_ID);
+ if (req == NULL) {
+ EXOFS_ERR("ERROR: Failed to allocate request.\n");
+ return -ENOMEM;
+ }
+
+ prepare_get_attr_list_add_entry(req,
+ OSD_PAGE_NUM_IBM_UOBJ_FS_DATA,
+ OSD_ATTR_NUM_IBM_UOBJ_FS_DATA_INODE,
+ EXOFS_INO_ATTR_SIZE);
+
+ ret = kick_it(req, timeout, cred_a, "get root inode");
+ if (ret)
+ goto out;
+
+ ret = extract_next_attr_from_req(req, &page, &attr, &expected, &buf);
+ if (ret) {
+ EXOFS_ERR("ERROR: extract attr from req failed\n");
+ goto out;
+ }
+
+ memcpy(inode, buf, sizeof(struct exofs_fcb));
+
+ EXOFS_DBGMSG("mode: %u\n", be16_to_cpu(inode->i_mode));
+ EXOFS_DBGMSG("uid: %u\n", be32_to_cpu(inode->i_uid));
+ EXOFS_DBGMSG("gid: %u\n", be32_to_cpu(inode->i_gid));
+ EXOFS_DBGMSG("links: %u\n", be16_to_cpu(inode->i_links_count));
+ EXOFS_DBGMSG("ctime: %u\n", be32_to_cpu(inode->i_ctime));
+ EXOFS_DBGMSG("atime: %u\n", be32_to_cpu(inode->i_atime));
+ EXOFS_DBGMSG("mtime: %u\n", be32_to_cpu(inode->i_mtime));
+ EXOFS_DBGMSG("gen: %u\n", be32_to_cpu(inode->i_generation));
+ EXOFS_DBGMSG("size: %llu\n", _LLU(be64_to_cpu(inode->i_size)));
+
+out:
+ free_osd_req(req);
+
+ return ret;
+}
+#endif
+
+/*
+ * This function creates an exofs file system on the specified OSD partition.
+ */
+int exofs_mkfs(struct osd_dev *dev, uint64_t p_id, uint64_t format_size_meg)
+{
+ int err;
+ const int to_format = (4 * 60 * HZ);
+ const int to_gen = (60 * HZ);
+ bool newfile = false;
+
+ /* Get a handle */
+ EXOFS_DBGMSG("setting up exofs on partition %llu:\n", _LLU(p_id));
+
+ /* Format LUN if requested */
+ if (format_size_meg > 0) {
+ EXOFS_DBGMSG("formatting %llu Mgb...\n", _LLU(format_size_meg));
+ err = format(format_size_meg * 1024 * 1024, dev, to_format);
+ if (err)
+ goto out;
+ EXOFS_DBGMSG(" OK\n");
+ }
+
+ /* Create partition */
+ EXOFS_DBGMSG("creating partition...\n");
+ err = create_partition(dev, p_id, to_gen);
+ if (err)
+ goto out;
+ EXOFS_DBGMSG(" OK\n");
+
+ /* Create object with known ID for superblock info */
+ EXOFS_DBGMSG("creating superblock...\n");
+ err = create(dev, p_id, EXOFS_SUPER_ID, to_gen);
+ if (err)
+ goto out;
+ EXOFS_DBGMSG(" OK\n");
+
+ /* Create root directory object */
+ EXOFS_DBGMSG("creating root directory...\n");
+ err = create(dev, p_id, EXOFS_ROOT_ID, to_gen);
+ if (err)
+ goto out;
+ EXOFS_DBGMSG(" OK\n");
+
+ /* Create bitmap object */
+ EXOFS_DBGMSG("creating free ID bitmap...\n");
+ err = create(dev, p_id, EXOFS_BM_ID, to_gen);
+ if (err)
+ goto out;
+ EXOFS_DBGMSG(" OK\n");
+
+#ifdef __MKEXOFS_DEBUG_CHECKS
+ /* Create a test file, if specified by options */
+ if (newfile) {
+ EXOFS_DBGMSG("creating test file...\n");
+ err = create(dev, p_id, EXOFS_TEST_ID, to_gen);
+ if (err)
+ goto out;
+ EXOFS_DBGMSG(" OK\n");
+ }
+#endif
+
+ /* Write superblock */
+ EXOFS_DBGMSG("writing superblock...\n");
+ err = write_super(dev, p_id, to_gen, newfile);
+ if (err)
+ goto out;
+ EXOFS_DBGMSG(" OK\n");
+
+ /* Write root directory */
+ EXOFS_DBGMSG("writing root directory...\n");
+ err = write_rootdir(dev, p_id, to_gen, newfile);
+ if (err)
+ goto out;
+ EXOFS_DBGMSG(" OK\n");
+
+ /* Set root partition inode attribute */
+ EXOFS_DBGMSG("writing root inode...\n");
+ err = set_inode(dev, p_id, to_gen, EXOFS_ROOT_ID,
+ 0040000 | (0777 & ~022));
+ if (err)
+ goto out;
+ EXOFS_DBGMSG(" OK\n");
+
+#ifdef __MKEXOFS_DEBUG_CHECKS
+ /* Set test file inode attribute */
+ if (newfile) {
+ EXOFS_DBGMSG("writing test inode...\n");
+ err = set_inode(dev, p_id, to_gen, EXOFS_TEST_ID,
+ 0100000 | (0777 & ~022));
+ if (err)
+ goto out;
+ EXOFS_DBGMSG(" OK\n");
+ }
+#endif
+ /* Write bitmap */
+ EXOFS_DBGMSG("writing free ID bitmap...\n");
+ err = write_bitmap(dev, p_id, to_gen);
+ if (err)
+ goto out;
+ EXOFS_DBGMSG(" OK\n");
+
+#ifdef __MKEXOFS_DEBUG_CHECKS
+ /* Write test file */
+ if (newfile) {
+ EXOFS_DBGMSG("writing test file...\n");
+ err = write_testfile(dev, p_id, to_gen);
+ if (err)
+ goto out;
+ EXOFS_DBGMSG(" OK\n");
+ }
+
+ /* some debug info */
+ {
+ EXOFS_DBGMSG("listing:\n");
+ list(dev, p_id, to_gen);
+ EXOFS_DBGMSG("contents of superblock:\n");
+ read_super(dev, p_id, to_gen);
+ EXOFS_DBGMSG("contents of root inode:\n");
+ get_root_attr(dev, p_id, to_gen);
+ if (newfile) {
+ EXOFS_DBGMSG("contents of test file:\n");
+ read_testfile(dev, p_id, to_gen);
+ }
+ }
+#endif
+ EXOFS_DBGMSG("\nsetup complete: enjoy your shiny new exofs!\n");
+
+out:
+ return err;
+}
diff --git a/fs/exofs/super.c b/fs/exofs/super.c
index 8ecf700..459b935 100644
--- a/fs/exofs/super.c
+++ b/fs/exofs/super.c
@@ -55,6 +55,8 @@ enum { Opt_lun, Opt_tid, Opt_pid, Opt_to, Opt_mkfs, Opt_format, Opt_err };
static match_table_t tokens = {
{Opt_pid, "pid=%u"},
{Opt_to, "to=%u"},
+ {Opt_mkfs, "mkfs=%u"},
+ {Opt_format, "format=%u"},
{Opt_err, NULL}
};
@@ -100,6 +102,16 @@ static int parse_options(char *options, struct exofs_mountopt *opts)
}
opts->timeout = option * HZ;
break;
+ case Opt_mkfs:
+ if (match_int(&args[0], &option))
+ return -EINVAL;
+ opts->mkfs = option != 0;
+ break;
+ case Opt_format:
+ if (match_int(&args[0], &option))
+ return -EINVAL;
+ opts->format = option;
+ break;
}
}
@@ -277,6 +289,12 @@ static int exofs_fill_super(struct super_block *sb, void *data, int silent)
sb->s_bdev = NULL;
sb->s_dev = 0;
+ /* see if we need to make the file system on the obsd */
+ if (opts->mkfs) {
+ EXOFS_DBGMSG("exofs_mkfs %p\n", sbi->s_dev);
+ exofs_mkfs(sbi->s_dev, sbi->s_pid, opts->format);
+ }
+
/* read data from on-disk superblock object */
make_credential(sbi->s_cred, sbi->s_pid, EXOFS_SUPER_ID);
--
1.6.0.1
Added some documentation in exofs.txt, as well as a BUGS file.
For further reading, operation instructions, example scripts
and up to date infomation and code please see:
http://open-osd.org
Signed-off-by: Boaz Harrosh <[email protected]>
---
Documentation/filesystems/exofs.txt | 173 +++++++++++++++++++++++++++++++++++
fs/exofs/BUGS | 6 +
2 files changed, 179 insertions(+), 0 deletions(-)
create mode 100644 Documentation/filesystems/exofs.txt
create mode 100644 fs/exofs/BUGS
diff --git a/Documentation/filesystems/exofs.txt b/Documentation/filesystems/exofs.txt
new file mode 100644
index 0000000..ec5040c
--- /dev/null
+++ b/Documentation/filesystems/exofs.txt
@@ -0,0 +1,173 @@
+===============================================================================
+WHAT IS EXOFS?
+===============================================================================
+
+exofs is a file system that uses an OSD and exports the API of a normal Linux
+file system. Users access exofs like any other local file system, and exofs
+will in turn issue commands to the local initiator.
+
+===============================================================================
+ENVIRONMENT
+===============================================================================
+
+To use this file system, you need to have an object store to run it on. You
+may download a target from:
+http://open-osd.org
+
+See drivers/scsi/osd/README for how to setup a working osd environment.
+
+===============================================================================
+USAGE
+===============================================================================
+
+1. Download and compile exofs and open-osd initiator:
+ You need an external Kernel source tree or kernel headers from your
+ distribution. (anything based on 2.6.26 or later).
+
+ a. download open-osd including exofs source using:
+ [parent-directory]$ git clone git://git.open-osd.org/open-osd.git
+
+ b. Build the library module like this:
+ [parent-directory]$ make -C KSRC=$(KER_DIR) open-osd
+
+ This will build both the open-osd initiator as well as the exofs kernel
+ module. Use whatever parameters you compiled your Kernel with and
+ $(KER_DIR) above pointing to the Kernel you compile against. See the file
+ open-osd/top-level-Makefile for an example.
+
+2. Get the OSD initiator and target set up properly, and login to the target.
+ See drivers/scsi/osd/README for farther instructions. Also see ./do-osd-test
+ for example script that does all these steps.
+
+3. Insmod the exofs.ko module:
+ [exofs]$ insmod exofs.ko
+
+4. Make sure the directory where you want to mount exists. If not, create it.
+ (For example, /mnt/exofs)
+
+5. At first run you will need to invoke the mkexofs.c routine
+
+ As an example, this will create the file system on:
+ /dev/osd0 partition ID 65540, max capacity 1024 Mg bytes
+
+ mount -t exofs -o pid=65540,mkfs=1,format=1024 /dev/osd0 /mnt/exofs/
+
+ The format=1024 is optional if not specified no OSD_FORMAT will be preformed
+ and a clean file system will be created in the specified pid, in the
+ available space of the target.
+ If pid already exist it will be deleted and a new one will be created in it's
+ place. Be careful.
+
+6. Mount the file system. The above command left the filesystem mounted,
+ but on subsequent runs the mkfs=1 should not be invoked.
+
+ For example, to mount /dev/osd0, partition ID 65540 on /mnt/exofs:
+
+ mount -t exofs -o pid=65540 /dev/osd0 /mnt/exofs/
+
+7. For reference (under fs/exofs/):
+ do-exofs start - an example of how to perform the above steps.
+ do-exofs stop - an example of how to unmount the file system.
+
+8. Extra compilation flags (uncomment in fs/exofs/Kbuild):
+ EXOFS_DEBUG - for debug messages and extra checks.
+
+===============================================================================
+exofs mount options
+===============================================================================
+Similar to any mount command:
+ mount -t exofs -o exofs_options /dev/osdX mount_exofs_directory
+
+Where:
+ -t exofs: specifies the exofs file system
+
+ /dev/osdX: X is a decimal number. /dev/osdX was created after a successful
+ login into an OSD target.
+
+ mount_exofs_directory: The directory to mount the file system on
+
+ exofs_options: Options are separated by commas (,)
+ pid=<integer> - The partition number to mount/create as
+ container of the filesystem.
+ This option is mandatory
+ mkfs=<1/0> - If mkfs=1 make a new filesystem before mount.
+ Default is 0 - don't make. If mkfs=0 pid must
+ exist and an mkfs=1 was previously preformed
+ on it.
+ format=<integer>- If mkfs=1 is specified then the format=
+ parameter will also invoke an OSD_FORMAT
+ command prior to creation of the filesystem
+ partition (mkfs). The integer specified is in
+ Mega bytes. If not specified or set to 0 then
+ no format is executed, and a partition is
+ created in the available space.
+ If mkfs=0 this option is ignored.
+ to=<integer> - Timeout in ticks for a single command
+ default is (60 * HZ) [for debugging only]
+
+===============================================================================
+DESIGN
+===============================================================================
+
+* The file system control block (AKA on-disk superblock) resides in an object
+ with a special ID (defined in common.h).
+ Information included in the file system control block is used to fill the
+ in-memory superblock structure at mount time. This object is created before
+ the file system is used by mkexofs.c It contains information such as:
+ - The file system's magic number
+ - The next inode number to be allocated
+
+* Each file resides in its own object and contains the data (and it will be
+ possible to extend the file over multiple objects, though this has not been
+ implemented yet).
+
+* A directory is treated as a file, and essentially contains a list of <file
+ name, inode #> pairs for files that are found in that directory. The object
+ IDs correspond to the files' inode numbers and will be allocated according to
+ a bitmap (stored in a separate object). Now they are allocated using a
+ counter.
+
+* Each file's control block (AKA on-disk inode) is stored in its object's
+ attributes. This applies to both regular files and other types (directories,
+ device files, symlinks, etc.).
+
+* Credentials are generated per object (inode and superblock) when they is
+ created in memory (read off disk or created). The credential works for all
+ operations and is used as long as the object remains in memory.
+
+* Async OSD operations are used whenever possible, but the target may execute
+ them out of order. The operations that concern us are create, delete,
+ readpage, writepage, update_inode, and truncate. The following pairs of
+ operations should execute in the order written, and we need to prevent them
+ from executing in reverse order:
+ - The following are handled with the OBJ_CREATED and OBJ_2BCREATED
+ flags. OBJ_CREATED is set when we know the object exists on the OSD -
+ in create's callback function, and when we successfully do a read_inode.
+ OBJ_2BCREATED is set in the beginning of the create function, so we
+ know that we should wait.
+ - create/delete: delete should wait until the object is created
+ on the OSD.
+ - create/readpage: readpage should be able to return a page
+ full of zeroes in this case. If there was a write already
+ en-route (i.e. create, writepage, readpage) then the page
+ would be locked, and so it would really be the same as
+ create/writepage.
+ - create/writepage: if writepage is called for a sync write, it
+ should wait until the object is created on the OSD.
+ Otherwise, it should just return.
+ - create/truncate: truncate should wait until the object is
+ created on the OSD.
+ - create/update_inode: update_inode should wait until the
+ object is created on the OSD.
+ - Handled by VFS locks:
+ - readpage/delete: shouldn't happen because of page lock.
+ - writepage/delete: shouldn't happen because of page lock.
+ - readpage/writepage: shouldn't happen because of page lock.
+
+===============================================================================
+LICENSE/COPYRIGHT
+===============================================================================
+The exofs file system is based on ext2 v0.5b (distributed with the Linux kernel
+version 2.6.10). All files include the original copyrights, and the license
+is GPL version 2 (only version 2, as is true for the Linux kernel). The
+Linux kernel can be downloaded from http://www.kernel.org.
diff --git a/fs/exofs/BUGS b/fs/exofs/BUGS
new file mode 100644
index 0000000..6d6e1f9
--- /dev/null
+++ b/fs/exofs/BUGS
@@ -0,0 +1,6 @@
+- Some mount time options should have been 64-bit, but are declared as 32-bit
+ because that's what the kernel's parsing methods support at this time.
+
+- Out-of-space may cause a severe problem if the object (and directory entry)
+ were written, but the inode attributes failed. Then if the filesystem was
+ unmounted and mounted the kernel can get into an endless loop doing a readdir.
--
1.6.0.1
- Add exofs to fs/Kconfig under "menu 'Miscellaneous filesystems'"
- Add exofs to fs/Makefile
Signed-off-by: Boaz Harrosh <[email protected]>
---
fs/Kconfig | 1 +
fs/Makefile | 1 +
2 files changed, 2 insertions(+), 0 deletions(-)
diff --git a/fs/Kconfig b/fs/Kconfig
index 522469a..a2e6129 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -1122,6 +1122,7 @@ config UFS_DEBUG
Y here. This will result in _many_ additional debugging messages to be
written to the system log.
+source "fs/exofs/Kconfig"
endmenu
menuconfig NETWORK_FILESYSTEMS
diff --git a/fs/Makefile b/fs/Makefile
index d9f8afe..920250b 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -122,3 +122,4 @@ obj-$(CONFIG_HPPFS) += hppfs/
obj-$(CONFIG_DEBUG_FS) += debugfs/
obj-$(CONFIG_OCFS2_FS) += ocfs2/
obj-$(CONFIG_GFS2_FS) += gfs2/
+obj-$(CONFIG_EXOFS_FS) += exofs/
--
1.6.0.1
In this patch are all the osd infrastructure that will be used later
by the file system.
Also the declarations of constants, on disk structures, and prototypes.
And the Kbuild+Kconfig files needed to build the exofs module.
Signed-off-by: Boaz Harrosh <[email protected]>
---
fs/exofs/Kbuild | 30 +++++
fs/exofs/Kconfig | 13 ++
fs/exofs/common.h | 154 ++++++++++++++++++++++++
fs/exofs/exofs.h | 183 +++++++++++++++++++++++++++++
fs/exofs/osd.c | 334 +++++++++++++++++++++++++++++++++++++++++++++++++++++
5 files changed, 714 insertions(+), 0 deletions(-)
create mode 100644 fs/exofs/Kbuild
create mode 100644 fs/exofs/Kconfig
create mode 100644 fs/exofs/common.h
create mode 100644 fs/exofs/exofs.h
create mode 100644 fs/exofs/osd.c
diff --git a/fs/exofs/Kbuild b/fs/exofs/Kbuild
new file mode 100644
index 0000000..fd3351e
--- /dev/null
+++ b/fs/exofs/Kbuild
@@ -0,0 +1,30 @@
+#
+# Kbuild for the EXOFS module
+#
+# Copyright (C) 2008 Panasas Inc. All rights reserved.
+#
+# Authors:
+# Boaz Harrosh <[email protected]>
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License version 2
+#
+# Kbuild - Gets included from the Kernels Makefile and build system
+#
+
+ifneq ($(OSD_INC),)
+# we are built out-of-tree Kconfigure everything as on
+
+CONFIG_EXOFS_FS=m
+ccflags += -DCONFIG_EXOFS_FS -DCONFIG_EXOFS_FS_MODULE
+# ccflags += -DCONFIG_EXOFS_DEBUG
+
+# if we are built out-of-tree and the hosting kernel has OSD headers
+# then "ccflags-y +=" will not pick the out-off-tree headers. Only by doing
+# this it will work. This might break in future kernels
+KBUILD_CPPFLAGS := -I$(OSD_INC) $(KBUILD_CPPFLAGS)
+
+endif
+
+exofs-objs := osd.o
+obj-$(CONFIG_EXOFS_FS) += exofs.o
diff --git a/fs/exofs/Kconfig b/fs/exofs/Kconfig
new file mode 100644
index 0000000..86194b2
--- /dev/null
+++ b/fs/exofs/Kconfig
@@ -0,0 +1,13 @@
+config EXOFS_FS
+ tristate "exofs: OSD based file system support"
+ depends on SCSI_OSD_ULD
+ help
+ EXOFS is a file system that uses an OSD storage device,
+ as its backing storage.
+
+# Debugging-related stuff
+config EXOFS_DEBUG
+ bool "Enable debugging"
+ depends on EXOFS_FS
+ help
+ This option enables EXOFS debug prints.
diff --git a/fs/exofs/common.h b/fs/exofs/common.h
new file mode 100644
index 0000000..9a165b3
--- /dev/null
+++ b/fs/exofs/common.h
@@ -0,0 +1,154 @@
+/*
+ * Copyright (C) 2005, 2006
+ * Avishay Traeger ([email protected]) ([email protected])
+ * Copyright (C) 2005, 2006
+ * International Business Machines
+ *
+ * Copyrights for code taken from ext2:
+ * Copyright (C) 1992, 1993, 1994, 1995
+ * Remy Card ([email protected])
+ * Laboratoire MASI - Institut Blaise Pascal
+ * Universite Pierre et Marie Curie (Paris VI)
+ * from
+ * linux/fs/minix/inode.c
+ * Copyright (C) 1991, 1992 Linus Torvalds
+ *
+ * This file is part of exofs.
+ *
+ * exofs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation. Since it is based on ext2, and the only
+ * valid version of GPL for the Linux kernel is version 2, the only valid
+ * version of GPL for exofs is version 2.
+ *
+ * exofs is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with exofs; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#ifndef __EXOFS_COM_H__
+#define __EXOFS_COM_H__
+
+#include <linux/types.h>
+#include <linux/timex.h>
+
+#include <scsi/osd_attributes.h>
+#include <scsi/osd_initiator.h>
+#include <scsi/osd_sec.h>
+
+/****************************************************************************
+ * Object ID related defines
+ * NOTE: inode# = object ID - EXOFS_OBJ_OFF
+ ****************************************************************************/
+#define EXOFS_OBJ_OFF 0x10000 /* offset for objects */
+#define EXOFS_SUPER_ID 0x10000 /* object ID for on-disk superblock */
+#define EXOFS_BM_ID 0x10001 /* object ID for ID bitmap */
+#define EXOFS_ROOT_ID 0x10002 /* object ID for root directory */
+#define EXOFS_TEST_ID 0x10003 /* object ID for test object */
+
+/* exofs Application specific page/attribute */
+#ifndef OSD_PAGE_NUM_IBM_UOBJ_FS_DATA
+# define OSD_PAGE_NUM_IBM_UOBJ_FS_DATA (OSD_APAGE_APP_DEFINED_FIRST + 3)
+# define OSD_ATTR_NUM_IBM_UOBJ_FS_DATA_INODE 1
+#endif
+
+/*
+ * The maximum number of files we can have is limited by the size of the
+ * inode number. This is the largest object ID that the file system supports.
+ * Object IDs 0, 1, and 2 are always in use (see above defines).
+ */
+enum {
+ EXOFS_UINT64_MAX = (~0LL),
+ EXOFS_MAX_INO_ID = (sizeof(ino_t) * 8 == 64) ? EXOFS_UINT64_MAX :
+ (1LL << (sizeof(ino_t) * 8 - 1)),
+ EXOFS_MAX_ID = (EXOFS_MAX_INO_ID - 1 - EXOFS_OBJ_OFF),
+};
+
+/****************************************************************************
+ * Misc.
+ ****************************************************************************/
+#define EXOFS_BLKSHIFT 12
+#define EXOFS_BLKSIZE (1UL << EXOFS_BLKSHIFT)
+
+/****************************************************************************
+ * superblock-related things
+ ****************************************************************************/
+#define EXOFS_SUPER_MAGIC 0x5DF5
+
+/*
+ * The file system control block - stored in an object's data (mainly, the one
+ * with ID EXOFS_SUPER_ID). This is where the in-memory superblock is stored
+ * on disk. Right now it just has a magic value, which is basically a sanity
+ * check on our ability to communicate with the object store.
+ */
+struct exofs_fscb {
+ uint32_t s_nextid; /* Highest object ID used */
+ uint32_t s_numfiles; /* Number of files on fs */
+ uint16_t s_magic; /* Magic signature */
+ uint16_t s_newfs; /* Non-zero if this is a new fs */
+};
+
+/****************************************************************************
+ * inode-related things
+ ****************************************************************************/
+#define EXOFS_IDATA 5
+
+/*
+ * The file control block - stored in an object's attributes. This is where
+ * the in-memory inode is stored on disk.
+ */
+struct exofs_fcb {
+ uint64_t i_size; /* Size of the file */
+ uint16_t i_mode; /* File mode */
+ uint16_t i_links_count; /* Links count */
+ uint32_t i_uid; /* Owner Uid */
+ uint32_t i_gid; /* Group Id */
+ uint32_t i_atime; /* Access time */
+ uint32_t i_ctime; /* Creation time */
+ uint32_t i_mtime; /* Modification time */
+ uint32_t i_flags; /* File flags */
+ uint32_t i_version; /* File version */
+ uint32_t i_generation; /* File version (for NFS) */
+ uint32_t i_data[EXOFS_IDATA]; /* Short symlink names and device #s */
+};
+
+#define EXOFS_INO_ATTR_SIZE sizeof(struct exofs_fcb)
+
+/****************************************************************************
+ * dentry-related things
+ ****************************************************************************/
+#define EXOFS_NAME_LEN 255
+
+/*
+ * The on-disk directory entry
+ */
+struct exofs_dir_entry {
+ uint32_t inode; /* inode number */
+ uint16_t rec_len; /* directory entry length */
+ uint8_t name_len; /* name length */
+ uint8_t file_type; /* umm...file type */
+ char name[EXOFS_NAME_LEN]; /* file name */
+};
+
+enum {
+ EXOFS_FT_UNKNOWN,
+ EXOFS_FT_REG_FILE,
+ EXOFS_FT_DIR,
+ EXOFS_FT_CHRDEV,
+ EXOFS_FT_BLKDEV,
+ EXOFS_FT_FIFO,
+ EXOFS_FT_SOCK,
+ EXOFS_FT_SYMLINK,
+ EXOFS_FT_MAX
+};
+
+#define EXOFS_DIR_PAD 4
+#define EXOFS_DIR_ROUND (EXOFS_DIR_PAD - 1)
+#define EXOFS_DIR_REC_LEN(name_len) (((name_len) + 8 + EXOFS_DIR_ROUND) & \
+ ~EXOFS_DIR_ROUND)
+#endif
diff --git a/fs/exofs/exofs.h b/fs/exofs/exofs.h
new file mode 100644
index 0000000..8534450
--- /dev/null
+++ b/fs/exofs/exofs.h
@@ -0,0 +1,183 @@
+/*
+ * Copyright (C) 2005, 2006
+ * Avishay Traeger ([email protected]) ([email protected])
+ * Copyright (C) 2005, 2006
+ * International Business Machines
+ *
+ * Copyrights for code taken from ext2:
+ * Copyright (C) 1992, 1993, 1994, 1995
+ * Remy Card ([email protected])
+ * Laboratoire MASI - Institut Blaise Pascal
+ * Universite Pierre et Marie Curie (Paris VI)
+ * from
+ * linux/fs/minix/inode.c
+ * Copyright (C) 1991, 1992 Linus Torvalds
+ *
+ * This file is part of exofs.
+ *
+ * exofs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation. Since it is based on ext2, and the only
+ * valid version of GPL for the Linux kernel is version 2, the only valid
+ * version of GPL for exofs is version 2.
+ *
+ * exofs is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with exofs; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include <linux/fs.h>
+#include <linux/time.h>
+#include "common.h"
+
+#ifndef __EXOFS_H__
+#define __EXOFS_H__
+
+#define EXOFS_ERR(fmt, a...) printk(KERN_ERR "exofs: " fmt, ##a)
+
+#ifdef CONFIG_EXOFS_DEBUG
+#define EXOFS_DBGMSG(fmt, a...) \
+ printk(KERN_NOTICE "exofs @%s:%d: " fmt, __func__, __LINE__, ##a)
+#else
+#define EXOFS_DBGMSG(fmt, a...) \
+ do {} while (0)
+#endif
+
+/* u64 has problems with printk this will cast it to unsigned long long */
+#define _LLU(x) (unsigned long long)(x)
+
+/*
+ * our extension to the in-memory superblock
+ */
+struct exofs_sb_info {
+ struct osd_dev *s_dev; /* returned by get_osd_dev */
+ uint64_t s_pid; /* partition ID of file system*/
+ int s_timeout; /* timeout for OSD operations */
+ uint32_t s_nextid; /* highest object ID used */
+ uint32_t s_numfiles; /* number of files on fs */
+ spinlock_t s_next_gen_lock; /* spinlock for gen # update */
+ u32 s_next_generation; /* next gen # to use */
+ atomic_t s_curr_pending; /* number of pending commands */
+ uint8_t s_cred[OSD_CAP_LEN]; /* all-powerful credential */
+};
+
+/*
+ * our inode flags
+ */
+#ifdef ARCH_HAS_ATOMIC_UNSIGNED
+typedef unsigned exofs_iflags_t;
+#else
+typedef unsigned long exofs_iflags_t;
+#endif
+
+#define OBJ_2BCREATED 0 /* object will be created soon*/
+#define OBJ_CREATED 1 /* object has been created on the osd*/
+
+#define Obj2BCreated(oi) \
+ test_bit(OBJ_2BCREATED, &(oi->i_flags))
+#define SetObj2BCreated(oi) \
+ set_bit(OBJ_2BCREATED, &(oi->i_flags))
+
+#define ObjCreated(oi) \
+ test_bit(OBJ_CREATED, &(oi->i_flags))
+#define SetObjCreated(oi) \
+ set_bit(OBJ_CREATED, &(oi->i_flags))
+
+/*
+ * our extension to the in-memory inode
+ */
+struct exofs_i_info {
+ exofs_iflags_t i_flags; /* various atomic flags */
+ __le32 i_data[EXOFS_IDATA];/*short symlink names and device #s*/
+ uint32_t i_dir_start_lookup; /* which page to start lookup */
+ wait_queue_head_t i_wq; /* wait queue for inode */
+ uint64_t i_commit_size; /* the object's written length */
+ uint8_t i_cred[OSD_CAP_LEN];/* all-powerful credential */
+ struct inode vfs_inode; /* normal in-memory inode */
+};
+
+/*
+ * get to our inode from the vfs inode
+ */
+static inline struct exofs_i_info *EXOFS_I(struct inode *inode)
+{
+ return container_of(inode, struct exofs_i_info, vfs_inode);
+}
+
+/*************************
+ * function declarations *
+ *************************/
+/* osd.c */
+void make_credential(uint8_t[], uint64_t, uint64_t);
+int check_ok(struct osd_request *);
+int exofs_sync_op(struct osd_request *, int, uint8_t *);
+int exofs_async_op(struct osd_request *, osd_req_done_fn *, void *, char *);
+
+int prepare_get_attr_list_add_entry(struct osd_request *req,
+ uint32_t page_num,
+ uint32_t attr_num,
+ uint32_t attr_len);
+int prepare_set_attr_list_add_entry(struct osd_request *req,
+ uint32_t page_num,
+ uint32_t attr_num,
+ uint16_t attr_len,
+ const unsigned char *attr_val);
+int extract_next_attr_from_req(struct osd_request *req,
+ uint32_t *page_num, uint32_t *attr_num,
+ uint16_t *attr_len, uint8_t **attr_val);
+struct osd_request *prepare_osd_format_lun(struct osd_dev *dev,
+ uint64_t formatted_capacity);
+struct osd_request *prepare_osd_create_partition(struct osd_dev *dev,
+ uint64_t requested_id);
+struct osd_request *prepare_osd_remove_partition(struct osd_dev *dev,
+ uint64_t requested_id);
+struct osd_request *prepare_osd_create(struct osd_dev *dev,
+ uint64_t part_id,
+ uint64_t requested_id);
+struct osd_request *prepare_osd_remove(struct osd_dev *dev,
+ uint64_t part_id,
+ uint64_t obj_id);
+struct osd_request *prepare_osd_set_attr(struct osd_dev *dev,
+ uint64_t part_id,
+ uint64_t obj_id);
+struct osd_request *prepare_osd_get_attr(struct osd_dev *dev,
+ uint64_t part_id,
+ uint64_t obj_id);
+struct osd_request *prepare_osd_read(struct osd_dev *dev,
+ uint64_t part_id,
+ uint64_t obj_id,
+ uint64_t length,
+ uint64_t offset,
+ int cmd_data_use_sg,
+ unsigned char *cmd_data);
+struct osd_request *prepare_osd_write(struct osd_dev *dev,
+ uint64_t part_id,
+ uint64_t obj_id,
+ uint64_t length,
+ uint64_t offset,
+ int cmd_data_use_sg,
+ const unsigned char *cmd_data);
+struct osd_request *prepare_osd_list(struct osd_dev *dev,
+ uint64_t part_id,
+ uint32_t list_id,
+ uint64_t alloc_len,
+ uint64_t initial_obj_id,
+ int use_sg,
+ void *data);
+int extract_list_from_req(struct osd_request *req,
+ uint64_t *total_matches_p,
+ uint64_t *num_ids_retrieved_p,
+ uint64_t *list_of_ids_p[],
+ int *is_list_of_partitions_p,
+ int *list_isnt_up_to_date_p,
+ uint64_t *continuation_tag_p,
+ uint32_t *list_id_for_more_p);
+
+void free_osd_req(struct osd_request *req);
+
+#endif
diff --git a/fs/exofs/osd.c b/fs/exofs/osd.c
new file mode 100644
index 0000000..3859d3e
--- /dev/null
+++ b/fs/exofs/osd.c
@@ -0,0 +1,334 @@
+/*
+ * Copyright (C) 2005, 2006
+ * Avishay Traeger ([email protected]) ([email protected])
+ * Copyright (C) 2005, 2006
+ * International Business Machines
+ *
+ * This file is part of exofs.
+ *
+ * exofs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation. Since it is based on ext2, and the only
+ * valid version of GPL for the Linux kernel is version 2, the only valid
+ * version of GPL for exofs is version 2.
+ *
+ * exofs is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with exofs; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include <scsi/scsi_device.h>
+#include <scsi/osd_sense.h>
+
+#include "exofs.h"
+
+int check_ok(struct osd_request *req)
+{
+ struct osd_sense_info osi;
+ int ret = osd_req_decode_sense(req, &osi);
+
+ if (ret) { /* translate to Linux codes */
+ if (osi.additional_code == scsi_invalid_field_in_cdb) {
+ if (osi.cdb_field_offset == OSD_CFO_STARTING_BYTE)
+ ret = -EFAULT;
+ if (osi.cdb_field_offset == OSD_CFO_OBJECT_ID)
+ ret = -ENOENT;
+ else
+ ret = -EINVAL;
+ } else if (osi.additional_code == osd_quota_error)
+ ret = -ENOSPC;
+ else
+ ret = -EIO;
+ }
+
+ return ret;
+}
+
+void make_credential(uint8_t cred_a[OSD_CAP_LEN], uint64_t pid, uint64_t oid)
+{
+ struct osd_obj_id obj = {
+ .partition = pid,
+ .id = oid
+ };
+
+ osd_sec_init_nosec_doall_caps(cred_a, &obj, false, true);
+}
+
+/*
+ * Perform a synchronous OSD operation.
+ */
+int exofs_sync_op(struct osd_request *req, int timeout, uint8_t *credential)
+{
+ int ret;
+
+ req->timeout = timeout;
+ ret = osd_finalize_request(req, 0, credential, NULL);
+ if (ret) {
+ EXOFS_DBGMSG("Faild to osd_finalize_request() => %d\n", ret);
+ return ret;
+ }
+
+ ret = osd_execute_request(req);
+
+ if (ret)
+ EXOFS_DBGMSG("osd_execute_request() => %d\n", ret);
+ /* osd_req_decode_sense(or, ret); */
+ return ret;
+}
+
+/*
+ * Perform an asynchronous OSD operation.
+ */
+int exofs_async_op(struct osd_request *req, osd_req_done_fn *async_done,
+ void *caller_context, char *credential)
+{
+ int ret;
+
+ ret = osd_finalize_request(req, 0, credential, NULL);
+ if (ret) {
+ EXOFS_DBGMSG("Faild to osd_finalize_request() => %d\n", ret);
+ return ret;
+ }
+
+ ret = osd_execute_request_async(req, async_done, caller_context);
+
+ if (ret)
+ EXOFS_DBGMSG("osd_execute_request_async() => %d\n", ret);
+ return ret;
+}
+
+int prepare_get_attr_list_add_entry(struct osd_request *req,
+ uint32_t page_num,
+ uint32_t attr_num,
+ uint32_t attr_len)
+{
+ struct osd_attr attr = {
+ .page = page_num,
+ .attr_id = attr_num,
+ .len = attr_len,
+ };
+
+ return osd_req_add_get_attr_list(req, &attr, 1);
+}
+
+int prepare_set_attr_list_add_entry(struct osd_request *req,
+ uint32_t page_num,
+ uint32_t attr_num,
+ uint16_t attr_len,
+ const unsigned char *attr_val)
+{
+ struct osd_attr attr = {
+ .page = page_num,
+ .attr_id = attr_num,
+ .len = attr_len,
+ .val_ptr = (u8 *)attr_val,
+ };
+
+ return osd_req_add_set_attr_list(req, &attr, 1);
+}
+
+int extract_next_attr_from_req(struct osd_request *req,
+ uint32_t *page_num, uint32_t *attr_num,
+ uint16_t *attr_len, uint8_t **attr_val)
+{
+ struct osd_attr attr = {.page = 0}; /* start with zeros */
+ void *iter = NULL;
+ int nelem;
+
+ do {
+ nelem = 1;
+ osd_req_decode_get_attr_list(req, &attr, &nelem, &iter);
+ if ((attr.page == *page_num) && (attr.attr_id == *attr_num)) {
+ *attr_len = attr.len;
+ *attr_val = attr.val_ptr;
+ return 0;
+ }
+ } while (iter);
+
+ return -EIO;
+}
+
+struct osd_request *prepare_osd_format_lun(struct osd_dev *dev,
+ uint64_t formatted_capacity)
+{
+ struct osd_request *or = osd_start_request(dev, GFP_KERNEL);
+
+ if (!or)
+ return NULL;
+
+ osd_req_format(or, formatted_capacity);
+
+ return or;
+}
+
+struct osd_request *prepare_osd_create_partition(struct osd_dev *dev,
+ uint64_t requested_id)
+{
+ struct osd_request *or = osd_start_request(dev, GFP_KERNEL);
+
+ if (!or)
+ return NULL;
+
+ osd_req_create_partition(or, requested_id);
+
+ return or;
+}
+
+struct osd_request *prepare_osd_remove_partition(struct osd_dev *dev,
+ uint64_t requested_id)
+{
+ struct osd_request *or = osd_start_request(dev, GFP_KERNEL);
+
+ if (!or)
+ return NULL;
+
+ osd_req_remove_partition(or, requested_id);
+
+ return or;
+}
+
+struct osd_request *prepare_osd_create(struct osd_dev *dev,
+ uint64_t part_id,
+ uint64_t requested_id)
+{
+ struct osd_obj_id obj = {
+ .partition = part_id,
+ .id = requested_id
+ };
+ struct osd_request *or = osd_start_request(dev, GFP_KERNEL);
+
+ if (!or)
+ return NULL;
+
+ osd_req_create_object(or, &obj);
+
+ return or;
+}
+
+struct osd_request *prepare_osd_remove(struct osd_dev *dev,
+ uint64_t part_id,
+ uint64_t obj_id)
+{
+ struct osd_obj_id obj = {
+ .partition = part_id,
+ .id = obj_id
+ };
+ struct osd_request *or = osd_start_request(dev, GFP_KERNEL);
+
+ if (!or)
+ return NULL;
+
+ osd_req_remove_object(or, &obj);
+
+ return or;
+}
+
+struct osd_request *prepare_osd_set_attr(struct osd_dev *dev,
+ uint64_t part_id,
+ uint64_t obj_id)
+{
+ struct osd_obj_id obj = {
+ .partition = part_id,
+ .id = obj_id
+ };
+ struct osd_request *or = osd_start_request(dev, GFP_KERNEL);
+
+ if (!or)
+ return NULL;
+
+ osd_req_set_attributes(or, &obj);
+
+ return or;
+}
+
+struct osd_request *prepare_osd_get_attr(struct osd_dev *dev,
+ uint64_t part_id,
+ uint64_t obj_id)
+{
+ struct osd_obj_id obj = {
+ .partition = part_id,
+ .id = obj_id
+ };
+ struct osd_request *or = osd_start_request(dev, GFP_KERNEL);
+
+ if (!or)
+ return NULL;
+
+ osd_req_get_attributes(or, &obj);
+
+ return or;
+}
+
+struct osd_request *prepare_osd_read(struct osd_dev *dev,
+ uint64_t part_id,
+ uint64_t obj_id,
+ uint64_t length,
+ uint64_t offset,
+ int cmd_data_use_sg,
+ unsigned char *cmd_data)
+{
+ struct osd_obj_id obj = {
+ .partition = part_id,
+ .id = obj_id
+ };
+ struct osd_request *or = osd_start_request(dev, GFP_KERNEL);
+ struct request_queue *req_q = dev->scsi_device->request_queue;
+ struct bio *bio;
+
+ if (!or)
+ return NULL;
+
+ BUG_ON(cmd_data_use_sg);
+ bio = bio_map_kern(req_q, cmd_data, length, or->alloc_flags);
+ if (!bio) {
+ osd_end_request(or);
+ return NULL;
+ }
+
+ osd_req_read(or, &obj, bio, offset);
+ EXOFS_DBGMSG("osd_req_read(p=%llX, ob=%llX, l=%llu, of=%llu)\n",
+ _LLU(part_id), _LLU(obj_id), _LLU(length), _LLU(offset));
+ return or;
+}
+
+struct osd_request *prepare_osd_write(struct osd_dev *dev,
+ uint64_t part_id,
+ uint64_t obj_id,
+ uint64_t length,
+ uint64_t offset,
+ int cmd_data_use_sg,
+ const unsigned char *cmd_data)
+{
+ struct osd_obj_id obj = {
+ .partition = part_id,
+ .id = obj_id
+ };
+ struct osd_request *or = osd_start_request(dev, GFP_KERNEL);
+ struct request_queue *req_q = dev->scsi_device->request_queue;
+ struct bio *bio;
+
+ if (!or)
+ return NULL;
+
+ BUG_ON(cmd_data_use_sg);
+ bio = bio_map_kern(req_q, (u8 *)cmd_data, length, or->alloc_flags);
+ if (!bio) {
+ osd_end_request(or);
+ return NULL;
+ }
+
+ osd_req_write(or, &obj, bio, offset);
+ EXOFS_DBGMSG("osd_req_write(p=%llX, ob=%llX, l=%llu, of=%llu)\n",
+ _LLU(part_id), _LLU(obj_id), _LLU(length), _LLU(offset));
+ return or;
+}
+
+void free_osd_req(struct osd_request *req)
+{
+ osd_end_request(req);
+}
--
1.6.0.1
On Tue, 16 Dec 2008 17:33:48 +0200
Boaz Harrosh <[email protected]> wrote:
> We need a mechanism to prepare the file system (mkfs).
> I chose to implement that by means of a couple of
> mount-options. Because there is no user-mode API for committing
> OSD commands. And also, all this stuff is highly internal to
> the file system itself.
>
> - Added two mount options mkfs=0/1,format=capacity_in_meg, so mkfs/format
> can be executed by kernel code just before mount. An mkexofs utility
> can now be implemented by means of a script that mounts and unmount the
> file system with proper options.
Doing mkfs in-kernel is unusual. I don't think the above description
sufficiently helps the uninitiated understand why mkfs cannot be done
in userspace as usual. Please flesh it out a bit.
What are the dependencies for this filesystem code? I assume that it
depends on various block- and scsi-level patches? Which ones, and
what is their status, and is this code even compileable without them?
Thanks.
On Tue, 16 Dec 2008 16:52:54 +0200
Boaz Harrosh <[email protected]> wrote:
> In this patch are all the osd infrastructure that will be used later
> by the file system.
>
> Also the declarations of constants, on disk structures, and prototypes.
>
> And the Kbuild+Kconfig files needed to build the exofs module.
>
>
> ...
>
> +struct exofs_sb_info {
> + struct osd_dev *s_dev; /* returned by get_osd_dev */
> + uint64_t s_pid; /* partition ID of file system*/
> + int s_timeout; /* timeout for OSD operations */
> + uint32_t s_nextid; /* highest object ID used */
> + uint32_t s_numfiles; /* number of files on fs */
> + spinlock_t s_next_gen_lock; /* spinlock for gen # update */
> + u32 s_next_generation; /* next gen # to use */
> + atomic_t s_curr_pending; /* number of pending commands */
> + uint8_t s_cred[OSD_CAP_LEN]; /* all-powerful credential */
> +};
> +
> +/*
> + * our inode flags
> + */
> +#ifdef ARCH_HAS_ATOMIC_UNSIGNED
This doesn't exist, and it would be fairly bad to introduce it. Please
kill the ifdefs.
> +typedef unsigned exofs_iflags_t;
> +#else
> +typedef unsigned long exofs_iflags_t;
> +#endif
Then please kill the typedef altogether and replace it with `unsigned
long' everywhere.
> +#define OBJ_2BCREATED 0 /* object will be created soon*/
> +#define OBJ_CREATED 1 /* object has been created on the osd*/
> +
> +#define Obj2BCreated(oi) \
> + test_bit(OBJ_2BCREATED, &(oi->i_flags))
> +#define SetObj2BCreated(oi) \
> + set_bit(OBJ_2BCREATED, &(oi->i_flags))
> +
> +#define ObjCreated(oi) \
> + test_bit(OBJ_CREATED, &(oi->i_flags))
> +#define SetObjCreated(oi) \
> + set_bit(OBJ_CREATED, &(oi->i_flags))
- please only implement code in macros when it CANNOT be implemented
in C. There are numerous reasons. One of which is that the above
macros will happily compile when passed a pointer to ANY truct whcih
has an i_flags field. If it were a properly typechecked C function,
that can't happen.
- These "functions" have odd names. This:
static inline void obj_created(struct exofs_i_info *ei)
would be more Linux-like.
> +/*
> + * our extension to the in-memory inode
> + */
> +struct exofs_i_info {
> + exofs_iflags_t i_flags; /* various atomic flags */
> + __le32 i_data[EXOFS_IDATA];/*short symlink names and device #s*/
> + uint32_t i_dir_start_lookup; /* which page to start lookup */
> + wait_queue_head_t i_wq; /* wait queue for inode */
> + uint64_t i_commit_size; /* the object's written length */
> + uint8_t i_cred[OSD_CAP_LEN];/* all-powerful credential */
> + struct inode vfs_inode; /* normal in-memory inode */
> +};
> +
> +/*
> + * get to our inode from the vfs inode
> + */
> +static inline struct exofs_i_info *EXOFS_I(struct inode *inode)
> +{
> + return container_of(inode, struct exofs_i_info, vfs_inode);
> +}
yeah, well. We got lazy when, we converted EXT2_I from a macro to a C
function. That doesn't mean that the mistake should have been copied :)
exofs_i() would be a more suitable name.
> +/*************************
> + * function declarations *
> + *************************/
>
> ...
>
> +#include <scsi/scsi_device.h>
> +#include <scsi/osd_sense.h>
> +
> +#include "exofs.h"
> +
> +int check_ok(struct osd_request *req)
eek. This is a kernel-wide symbol. The choice of identifier is bad.
> +{
> + struct osd_sense_info osi;
> + int ret = osd_req_decode_sense(req, &osi);
> +
> + if (ret) { /* translate to Linux codes */
> + if (osi.additional_code == scsi_invalid_field_in_cdb) {
> + if (osi.cdb_field_offset == OSD_CFO_STARTING_BYTE)
> + ret = -EFAULT;
> + if (osi.cdb_field_offset == OSD_CFO_OBJECT_ID)
> + ret = -ENOENT;
> + else
> + ret = -EINVAL;
> + } else if (osi.additional_code == osd_quota_error)
> + ret = -ENOSPC;
> + else
> + ret = -EIO;
> + }
> +
> + return ret;
> +}
> +
> +void make_credential(uint8_t cred_a[OSD_CAP_LEN], uint64_t pid, uint64_t oid)
Ditto. I suspect I'm going to see a lot of this. Please review the
entire fs for its namespace niceness
> +{
> + struct osd_obj_id obj = {
> + .partition = pid,
> + .id = oid
> + };
> +
> + osd_sec_init_nosec_doall_caps(cred_a, &obj, false, true);
> +}
> +
>
> ...
>
> +int prepare_get_attr_list_add_entry(struct osd_request *req,
> + uint32_t page_num,
> + uint32_t attr_num,
> + uint32_t attr_len)
> +{
> + struct osd_attr attr = {
> + .page = page_num,
Kernel developers expect a field called "page" to have type `struct
page *'. osd_attr.page is thus designed to confuse.
>
> ...
>
On Tue, 16 Dec 2008 17:17:25 +0200
Boaz Harrosh <[email protected]> wrote:
>
> implementation of the file_operations and inode_operations for
> regular data files.
>
> Most file_operations are generic vfs implementations except:
> - exofs_truncate will truncate the OSD object as well
> - Generic file_fsync is not good for none_bd devices so open code it
> - The default for .flush in Linux is todo nothing so call exofs_fsync
> on the file.
>
> ...
>
> +int exofs_file_fsync(struct file *filp, struct dentry *dentry, int datasync)
> +{
> + int ret1, ret2;
> + struct address_space *mapping = filp->f_mapping;
> +
> + ret1 = filemap_write_and_wait(mapping);
> + ret2 = file_fsync(filp, dentry, datasync);
> +
> + return ret1 ? : ret2;
mutter. That gccism always makes me fall over dazed and confused.
Maybe that's just me.
Did we really want to call file_fsync() if filemap_write_and_wait() failed?
> +}
>
> ...
>
> +struct file_operations exofs_file_operations = {
> +struct inode_operations exofs_file_inode_operations = {
These both could/should be made const.
> + .truncate = exofs_truncate,
> + .setattr = exofs_setattr,
> +};
On Tue, 16 Dec 2008 17:21:18 +0200
Boaz Harrosh <[email protected]> wrote:
> +struct inode_operations exofs_symlink_inode_operations = {
> +struct inode_operations exofs_fast_symlink_inode_operations = {
Can be made const (I think)
On Tue, 16 Dec 2008 17:22:37 +0200
Boaz Harrosh <[email protected]> wrote:
>
> OK Now we start to read and write from osd-objects, page-by-page.
> The page index is the object's offset.
>
>
> ...
>
> +/*
> + * Callback function when writepage finishes. Check for errors, unlock, clean
> + * up, etc.
> + */
> +void writepage_done(struct osd_request *req, void *p)
> +{
> + int ret;
> + struct page *page = (struct page *)p;
unneeded cast
> + struct inode *inode = page->mapping->host;
> + struct exofs_sb_info *sbi = inode->i_sb->s_fs_info;
> +
> + ret = check_ok(req);
> + free_osd_req(req);
> + atomic_dec(&sbi->s_curr_pending);
> +
> + if (ret) {
> + if (ret == -ENOSPC)
> + set_bit(AS_ENOSPC, &page->mapping->flags);
> + else
> + set_bit(AS_EIO, &page->mapping->flags);
> +
> + SetPageError(page);
> + }
> +
> + end_page_writeback(page);
> + unlock_page(page);
> +}
> +
> +/*
> + * Write a page to disk. page->index gives us the page number. The page is
> + * locked before this function is called. We write asynchronously and then the
> + * callback function (writepage_done) is called. We signify that the operation
> + * has completed by unlocking the page and calling end_page_writeback().
> + */
> +static int exofs_writepage(struct page *page, struct writeback_control *wbc)
> +{
> + struct inode *inode = page->mapping->host;
> + struct exofs_i_info *oi = EXOFS_I(inode);
> + loff_t i_size = i_size_read(inode);
> + unsigned long end_index = i_size >> PAGE_CACHE_SHIFT;
> + unsigned offset = 0;
> + struct osd_request *req = NULL;
> + struct exofs_sb_info *sbi;
> + uint64_t start;
> + uint64_t len = PAGE_CACHE_SIZE;
> + unsigned char *kaddr;
> + int ret = 0;
> +
> + if (!PageLocked(page))
> + BUG();
Could use BUG_ON(!PageLocked)
> + /* if the object has not been created, and we are not in sync mode,
> + * just return. otherwise, wait. */
> + if (!ObjCreated(oi)) {
> + if (!Obj2BCreated(oi))
> + BUG();
BUG_ON()?
> + if (wbc->sync_mode == WB_SYNC_NONE) {
> + redirty_page_for_writepage(wbc, page);
> + unlock_page(page);
> + ret = 0;
> + goto out;
> + } else {
> + wait_event(oi->i_wq, ObjCreated(oi));
> + }
> + }
> +
> + /* in this case, the page is within the limits of the file */
> + if (page->index < end_index)
> + goto do_it;
> +
> + offset = i_size & (PAGE_CACHE_SIZE - 1);
> + len = offset;
> +
> + /*in this case, the page is outside the limits (truncate in progress)*/
> + if (page->index >= end_index + 1 || !offset) {
> + unlock_page(page);
> + goto out;
> + }
> +
> +do_it:
> + BUG_ON(PageWriteback(page));
> + set_page_writeback(page);
> + start = page->index << PAGE_CACHE_SHIFT;
> + sbi = inode->i_sb->s_fs_info;
> +
> + kaddr = page_address(page);
> +
> + req = prepare_osd_write(sbi->s_dev, sbi->s_pid,
> + inode->i_ino + EXOFS_OBJ_OFF, len, start, 0,
> + kaddr);
Does prepare_osd_write() modify the memory at *kaddr? If so, does it
do the needed flush_dcache_page()?
> +
> + if (!req) {
> + printk(KERN_ERR "ERROR: writepage failed.\n");
> + ret = -ENOMEM;
> + goto fail;
> + }
> +
> + oi->i_commit_size = min_t(uint64_t, oi->i_commit_size, len + start);
> +
> + ret = exofs_async_op(req, writepage_done, (void *)page, oi->i_cred);
> + if (ret) {
> + free_osd_req(req);
> + goto fail;
> + }
> + atomic_inc(&sbi->s_curr_pending);
> +out:
> + return ret;
> +fail:
> + set_bit(AS_EIO, &page->mapping->flags);
> + end_page_writeback(page);
> + unlock_page(page);
> + goto out;
> +}
> +
> +/*
> + * Callback for readpage
> + */
> +int __readpage_done(struct osd_request *req, void *p, int unlock)
> +{
> + struct page *page = (struct page *)p;
unneeded cast.
> + struct inode *inode = page->mapping->host;
> + struct exofs_sb_info *sbi = inode->i_sb->s_fs_info;
> + int ret;
> +
> + ret = check_ok(req);
> + free_osd_req(req);
> + atomic_dec(&sbi->s_curr_pending);
> +
> + if (ret == 0) {
> +
> + /* Everything is OK */
> + SetPageUptodate(page);
> + if (PageError(page))
> + ClearPageError(page);
> + } else if (ret == -EFAULT) {
> + char *kaddr;
> +
> + /* In this case we were trying to read something that wasn't on
> + * disk yet - return a page full of zeroes. This should be OK,
> + * because the object should be empty (if there was a write
> + * before this read, the read would be waiting with the page
> + * locked */
> + kaddr = page_address(page);
> + memset(kaddr, 0, PAGE_CACHE_SIZE);
There is I think a missing flsh_dcache_page() here. Use of the
(somewhat misnamed) zero_user() would be an appropriate fix and
cleanup.
> + SetPageUptodate(page);
> + if (PageError(page))
> + ClearPageError(page);
> + } else /* Error */
> + SetPageError(page);
> +
> + if (unlock)
> + unlock_page(page);
> +
> + return ret;
> +}
> +
> +void readpage_done(struct osd_request *req, void *p)
> +{
> + __readpage_done(req, p, true);
> +}
> +
> +/*
> + * Read a page from the OSD
> + */
> +static int __readpage_filler(struct page *page, bool is_async_unlock)
> +{
> + struct osd_request *req = NULL;
> + struct inode *inode = page->mapping->host;
> + struct exofs_i_info *oi = EXOFS_I(inode);
> + ino_t ino = inode->i_ino;
> + loff_t i_size = i_size_read(inode);
> + loff_t i_start = page->index << PAGE_CACHE_SHIFT;
> + unsigned long end_index = i_size >> PAGE_CACHE_SHIFT;
Using pgoff_t for this would have some small documentation benefit.
> + struct super_block *sb = inode->i_sb;
> + struct exofs_sb_info *sbi = sb->s_fs_info;
> + uint64_t amount;
> + unsigned char *kaddr;
> + int ret = 0;
> +
> + if (!PageLocked(page))
> + BUG();
BUG_ON?
> + if (PageUptodate(page))
> + goto out;
> +
> + if (page->index < end_index)
> + amount = PAGE_CACHE_SIZE;
> + else
> + amount = i_size & (PAGE_CACHE_SIZE - 1);
> +
> + /* this will be out of bounds, or doesn't exist yet */
> + if ((page->index >= end_index + 1) || !ObjCreated(oi) || !amount
> + /*|| (i_start >= oi->i_commit_size)*/) {
> + kaddr = kmap_atomic(page, KM_USER0);
> + memset(kaddr, 0, PAGE_CACHE_SIZE);
> + flush_dcache_page(page);
> + kunmap_atomic(page, KM_USER0);
There's a flush_dcache_page() ;)
Could use clear_highpage() here.
> + SetPageUptodate(page);
> + if (PageError(page))
> + ClearPageError(page);
> + if (is_async_unlock)
> + unlock_page(page);
> + goto out;
> + }
> +
> + if (amount != PAGE_CACHE_SIZE) {
> + kaddr = kmap_atomic(page, KM_USER0);
> + memset(kaddr + amount, 0, PAGE_CACHE_SIZE - amount);
> + flush_dcache_page(page);
> + kunmap_atomic(page, KM_USER0);
Use zero_user()?
> + }
> +
> + kaddr = page_address(page);
> +
> + req = prepare_osd_read(sbi->s_dev, sbi->s_pid, ino + EXOFS_OBJ_OFF,
> + amount, i_start, 0, kaddr);
flush_dcache_page()?
> + if (!req) {
> + printk(KERN_ERR "ERROR: readpage failed.\n");
> + ret = -ENOMEM;
> + unlock_page(page);
> + goto out;
> + }
> +
> + atomic_inc(&sbi->s_curr_pending);
> + if (!is_async_unlock) {
> + exofs_sync_op(req, sbi->s_timeout, oi->i_cred);
> + ret = __readpage_done(req, page, false);
> + } else {
> + ret = exofs_async_op(req, readpage_done, page, oi->i_cred);
> + if (ret) {
> + free_osd_req(req);
> + unlock_page(page);
> + atomic_dec(&sbi->s_curr_pending);
> + }
> + }
> +
> +out:
> + return ret;
> +}
> +
> +static int readpage_filler(struct page *page)
> +{
> + int ret = __readpage_filler(page, true);
> +
> + return ret;
> +}
> +
> +/*
> + * We don't need the file
> + */
> +static int exofs_readpage(struct file *file, struct page *page)
> +{
> + return readpage_filler(page);
> +}
> +
> +/*
> + * We don't need the data
> + */
> +static int readpage_strip(void *data, struct page *page)
> +{
> + return readpage_filler(page);
> +}
> +
> +/*
> + * read a bunch of pages - usually for readahead
> + */
> +static int exofs_readpages(struct file *file, struct address_space *mapping,
> + struct list_head *pages, unsigned nr_pages)
> +{
> + return read_cache_pages(mapping, pages, readpage_strip, NULL);
> +}
> +
> +struct address_space_operations exofs_aops = {
const.
> + .readpage = exofs_readpage,
> + .readpages = exofs_readpages,
> + .writepage = exofs_writepage,
> + .write_begin = exofs_write_begin_export,
> + .write_end = simple_write_end,
> + .writepages = generic_writepages,
> +};
On Tue, 16 Dec 2008 17:28:57 +0200
Boaz Harrosh <[email protected]> wrote:
> implementation of directory and inode operations.
>
> * A directory is treated as a file, and essentially contains a list
> of <file name, inode #> pairs for files that are found in that
> directory. The object IDs correspond to the files' inode numbers
> and are allocated using a 64bit incrementing global counter.
> * Each file's control block (AKA on-disk inode) is stored in its
> object's attributes. This applies to both regular files and other
> types (directories, device files, symlinks, etc.).
>
> ...
>
> fs/exofs/dir.c | 649 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
yes, this does look rather ext2-like ;)
How long ago was the code cloned from ext2? iirc there have been a
number of fairly subtle bugs fixed in ext2/dir.c over the past year or
three. If the code was not quite recently cloned then I'd suggest that
you spend a bit of time looking through the ext2 changelogs, see if
there are any bugfixes which needs to be ported.
On Tue, 16 Dec 2008 17:31:30 +0200
Boaz Harrosh <[email protected]> wrote:
> +struct super_operations exofs_sops = {
const. I'm sure I missed many of these.
Andrew Morton wrote:
>> +int prepare_get_attr_list_add_entry(struct osd_request *req,
>> + uint32_t page_num,
>> + uint32_t attr_num,
>> + uint32_t attr_len)
>> +{
>> + struct osd_attr attr = {
>> + .page = page_num,
>
> Kernel developers expect a field called "page" to have type `struct
> page *'. osd_attr.page is thus designed to confuse.
>
>> ...
>>
>
Rant below (can be ignored):
This single fix will cause a massive change to the open-osd
initiator patchset, (18 patches), and resubmission .I made the mistake
because this name originates from a file that all naming conventions
are taken from the OSD standard text. However this is no excuse
for using a well known Kernel construct name. I will fix it. And
will be more careful in the future.
Thanks
Boaz
Andrew Morton wrote:
>> +
>> + kaddr = page_address(page);
>> +
>> + req = prepare_osd_write(sbi->s_dev, sbi->s_pid,
>> + inode->i_ino + EXOFS_OBJ_OFF, len, start, 0,
>> + kaddr);
>
> Does prepare_osd_write() modify the memory at *kaddr? If so, does it
> do the needed flush_dcache_page()?
>
kaddr is not modified by CPU. This is just a very BAD API left from the old
osd-initiator days.The address is used for preparing a BIO and submitted
to HW later. I will change all these places to receive a page* directly.
(It was meant to be changed in the future where I want to support read/write
page* array).
<snip>
>> + } else if (ret == -EFAULT) {
>> + char *kaddr;
>> +
>> + /* In this case we were trying to read something that wasn't on
>> + * disk yet - return a page full of zeroes. This should be OK,
>> + * because the object should be empty (if there was a write
>> + * before this read, the read would be waiting with the page
>> + * locked */
>> + kaddr = page_address(page);
>> + memset(kaddr, 0, PAGE_CACHE_SIZE);
>
> There is I think a missing flsh_dcache_page() here. Use of the
> (somewhat misnamed) zero_user() would be an appropriate fix and
> cleanup.
>
What happened here is that the HW actually never touched the page in question,
and it is returned to CPU, do I need to flsh_dcache_page anyway?
But this is not relevant since I will use zero_user() as you suggested.
Should I use clear_highpage as this is a clear of a full page?
<snip>
>> +
>> + /* this will be out of bounds, or doesn't exist yet */
>> + if ((page->index >= end_index + 1) || !ObjCreated(oi) || !amount
>> + /*|| (i_start >= oi->i_commit_size)*/) {
>> + kaddr = kmap_atomic(page, KM_USER0);
>> + memset(kaddr, 0, PAGE_CACHE_SIZE);
>> + flush_dcache_page(page);
>> + kunmap_atomic(page, KM_USER0);
>
> There's a flush_dcache_page() ;)
>
> Could use clear_highpage() here.
>
Thanks, sounds much better.
>> + SetPageUptodate(page);
>> + if (PageError(page))
>> + ClearPageError(page);
>> + if (is_async_unlock)
>> + unlock_page(page);
>> + goto out;
>> + }
>> +
>> + if (amount != PAGE_CACHE_SIZE) {
>> + kaddr = kmap_atomic(page, KM_USER0);
>> + memset(kaddr + amount, 0, PAGE_CACHE_SIZE - amount);
>> + flush_dcache_page(page);
>> + kunmap_atomic(page, KM_USER0);
>
> Use zero_user()?
>
Will change
Thanks
Boaz
On Wed, 2008-12-31 at 17:19 +0200, Boaz Harrosh wrote:
> Andrew Morton wrote:
> > On Tue, 16 Dec 2008 17:33:48 +0200
> > Boaz Harrosh <[email protected]> wrote:
> >
> >> We need a mechanism to prepare the file system (mkfs).
> >> I chose to implement that by means of a couple of
> >> mount-options. Because there is no user-mode API for committing
> >> OSD commands. And also, all this stuff is highly internal to
> >> the file system itself.
> >>
> >> - Added two mount options mkfs=0/1,format=capacity_in_meg, so mkfs/format
> >> can be executed by kernel code just before mount. An mkexofs utility
> >> can now be implemented by means of a script that mounts and unmount the
> >> file system with proper options.
> >
> > Doing mkfs in-kernel is unusual. I don't think the above description
> > sufficiently helps the uninitiated understand why mkfs cannot be done
> > in userspace as usual. Please flesh it out a bit.
>
> There are a few main reasons.
> - There is no user-mode API for initiating OSD commands. Such a subsystem
> would be hundredfold bigger then the mkfs code submitted. I think it would be
> hard and stupid to maintain a complex user-mode API just for creating
> a couple of objects and writing a couple of on disk structures.
This is really a reflection of the whole problem with the OSD paradigm.
In theory, a filesystem on OSD is a thin layer of metadata mapping
objects to files. Get this right and the storage will manage things,
like security and access and attributes (there's even a natural mapping
to the VFS concept of extended attributes). Plus, the storage has
enough information to manage persistence, backups and replication.
The real problem is that no-one has actually managed to come up with a
useful VFS<->OSD mapping layer (even by extending or altering the VFS).
Every filesystem that currently uses OSD has a separate direct OSD
speaking interface (i.e. it slices out the block layer to do this and
talks directly to the storage).
I suppose this could be taken to show that such a layer is impossibly
complex, as you assert, but its lack is reflected in strange looking
design decisions like in-kernel mkfs. It would also mean that there
would be very little layered code sharing between ODS based filesystems.
> - I intend to refactor the code further to make use of more super.c services,
> so to make this addition even smaller. Also future direction of raid over
> multiple objects will make even more kernel infrastructure needed which
> will need even more user-mode code duplication.
> - I anticipate problems that are not yet addressed in this body of work
> but will be in the future, mainly that a single OSD-target (lun) can
> be shared by lots of FSs, and a single FS can span many OSD-targets.
> Some central management is much easier to do in Kernel.
>
> >
> > What are the dependencies for this filesystem code? I assume that it
> > depends on various block- and scsi-level patches? Which ones, and
> > what is their status, and is this code even compileable without them?
> >
>
> This OSD-based file system is dependent on the open-osd initiator library
> code that I've submitted for inclusion for 2.6.29. It has been sitting
> in linux-next for a while now, and has not been receiving any comments
> for the last two updated patchsets I've sent to scsi-misc/lkml. However
> it has not yet been submitted into Jame's scsi-misc git tree, and James
> is the ultimate maintainer that should submit this work. I hope it will
> still be submitted into 2.6.29, as this code is totally self sufficient
> and does not endangers or changes any other Kernel subsystems.
> (All the needed ground work was already submitted to Linus since 2.6.26)
> So why should it not?
I don't like it mainly because it's not truly a useful general framework
for others to build on. However, as argued above, there might not
actually be such a useful framework, so as long as the only two
consumers (you and Lustre) want an interface like this, I'll put it in.
James
Andrew Morton wrote:
> On Tue, 16 Dec 2008 17:33:48 +0200
> Boaz Harrosh <[email protected]> wrote:
>
>> We need a mechanism to prepare the file system (mkfs).
>> I chose to implement that by means of a couple of
>> mount-options. Because there is no user-mode API for committing
>> OSD commands. And also, all this stuff is highly internal to
>> the file system itself.
>>
>> - Added two mount options mkfs=0/1,format=capacity_in_meg, so mkfs/format
>> can be executed by kernel code just before mount. An mkexofs utility
>> can now be implemented by means of a script that mounts and unmount the
>> file system with proper options.
>
> Doing mkfs in-kernel is unusual. I don't think the above description
> sufficiently helps the uninitiated understand why mkfs cannot be done
> in userspace as usual. Please flesh it out a bit.
There are a few main reasons.
- There is no user-mode API for initiating OSD commands. Such a subsystem
would be hundredfold bigger then the mkfs code submitted. I think it would be
hard and stupid to maintain a complex user-mode API just for creating
a couple of objects and writing a couple of on disk structures.
- I intend to refactor the code further to make use of more super.c services,
so to make this addition even smaller. Also future direction of raid over
multiple objects will make even more kernel infrastructure needed which
will need even more user-mode code duplication.
- I anticipate problems that are not yet addressed in this body of work
but will be in the future, mainly that a single OSD-target (lun) can
be shared by lots of FSs, and a single FS can span many OSD-targets.
Some central management is much easier to do in Kernel.
>
> What are the dependencies for this filesystem code? I assume that it
> depends on various block- and scsi-level patches? Which ones, and
> what is their status, and is this code even compileable without them?
>
This OSD-based file system is dependent on the open-osd initiator library
code that I've submitted for inclusion for 2.6.29. It has been sitting
in linux-next for a while now, and has not been receiving any comments
for the last two updated patchsets I've sent to scsi-misc/lkml. However
it has not yet been submitted into Jame's scsi-misc git tree, and James
is the ultimate maintainer that should submit this work. I hope it will
still be submitted into 2.6.29, as this code is totally self sufficient
and does not endangers or changes any other Kernel subsystems.
(All the needed ground work was already submitted to Linus since 2.6.26)
So why should it not?
Once the open-osd initiator library is accepted this file system
could be accepted. I was hoping as a 2.6.30 time frame. (One Kernel
after the open-osd library)
> Thanks.
Thank you dear Andrew for your most valuable input.
I will constify all the const needed code. will fix the global name space
litter, will inline the macros and lower case the inlines. Will remove
the typedefs.
I will reply to individual patches, I have a couple of questions. But
all your comments are right and I will take care of them.
When, if, all is fixed, through which tree/maintainer can exofs be submitted?
Thanks
Boaz
Andrew Morton wrote:
> On Tue, 16 Dec 2008 17:17:25 +0200
> Boaz Harrosh <[email protected]> wrote:
>
>> implementation of the file_operations and inode_operations for
>> regular data files.
>>
>> Most file_operations are generic vfs implementations except:
>> - exofs_truncate will truncate the OSD object as well
>> - Generic file_fsync is not good for none_bd devices so open code it
>> - The default for .flush in Linux is todo nothing so call exofs_fsync
>> on the file.
>>
>> ...
>>
>> +int exofs_file_fsync(struct file *filp, struct dentry *dentry, int datasync)
>> +{
>> + int ret1, ret2;
>> + struct address_space *mapping = filp->f_mapping;
>> +
>> + ret1 = filemap_write_and_wait(mapping);
>> + ret2 = file_fsync(filp, dentry, datasync);
>> +
>> + return ret1 ? : ret2;
>
> mutter. That gccism always makes me fall over dazed and confused.
> Maybe that's just me.
>
I've seen it done and felt like you exactly, only I liked the feeling. I'll change
it.
> Did we really want to call file_fsync() if filemap_write_and_wait() failed?
>
I think it cannot hurt, other places do the same including generic code.
On Wed, 31 Dec 2008 17:33:41 +0200 Boaz Harrosh <[email protected]> wrote:
> Andrew Morton wrote:
> >> +int prepare_get_attr_list_add_entry(struct osd_request *req,
> >> + uint32_t page_num,
> >> + uint32_t attr_num,
> >> + uint32_t attr_len)
> >> +{
> >> + struct osd_attr attr = {
> >> + .page = page_num,
> >
> > Kernel developers expect a field called "page" to have type `struct
> > page *'. osd_attr.page is thus designed to confuse.
> >
> >> ...
> >>
> >
>
> Rant below (can be ignored):
> This single fix will cause a massive change to the open-osd
> initiator patchset, (18 patches), and resubmission .I made the mistake
> because this name originates from a file that all naming conventions
> are taken from the OSD standard text. However this is no excuse
> for using a well known Kernel construct name. I will fix it. And
> will be more careful in the future.
The world wouldn't end if you left the code as-is. We've done worse things :)
Andrew Morton wrote:
> On Tue, 16 Dec 2008 17:28:57 +0200
> Boaz Harrosh <[email protected]> wrote:
>
>> implementation of directory and inode operations.
>>
>> * A directory is treated as a file, and essentially contains a list
>> of <file name, inode #> pairs for files that are found in that
>> directory. The object IDs correspond to the files' inode numbers
>> and are allocated using a 64bit incrementing global counter.
>> * Each file's control block (AKA on-disk inode) is stored in its
>> object's attributes. This applies to both regular files and other
>> types (directories, device files, symlinks, etc.).
>>
>> ...
>>
>> fs/exofs/dir.c | 649 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
> yes, this does look rather ext2-like ;)
>
> How long ago was the code cloned from ext2? iirc there have been a
> number of fairly subtle bugs fixed in ext2/dir.c over the past year or
> three. If the code was not quite recently cloned then I'd suggest that
> you spend a bit of time looking through the ext2 changelogs, see if
> there are any bugfixes which needs to be ported.
>
>
Long! Like Linux-v2.6.10 ;)
I will git-log the files in question and see if any of the bugs
are relevant here. (They should be).
Thanks that is most valuable input.
Boaz
On Wed, 31 Dec 2008 17:19:44 +0200 Boaz Harrosh <[email protected]> wrote:
> Andrew Morton wrote:
> > On Tue, 16 Dec 2008 17:33:48 +0200
> > Boaz Harrosh <[email protected]> wrote:
> >
> >> We need a mechanism to prepare the file system (mkfs).
> >> I chose to implement that by means of a couple of
> >> mount-options. Because there is no user-mode API for committing
> >> OSD commands. And also, all this stuff is highly internal to
> >> the file system itself.
> >>
> >> - Added two mount options mkfs=0/1,format=capacity_in_meg, so mkfs/format
> >> can be executed by kernel code just before mount. An mkexofs utility
> >> can now be implemented by means of a script that mounts and unmount the
> >> file system with proper options.
> >
> > Doing mkfs in-kernel is unusual. I don't think the above description
> > sufficiently helps the uninitiated understand why mkfs cannot be done
> > in userspace as usual. Please flesh it out a bit.
>
> There are a few main reasons.
> - There is no user-mode API for initiating OSD commands. Such a subsystem
> would be hundredfold bigger then the mkfs code submitted. I think it would be
> hard and stupid to maintain a complex user-mode API just for creating
> a couple of objects and writing a couple of on disk structures.
> - I intend to refactor the code further to make use of more super.c services,
> so to make this addition even smaller. Also future direction of raid over
> multiple objects will make even more kernel infrastructure needed which
> will need even more user-mode code duplication.
> - I anticipate problems that are not yet addressed in this body of work
> but will be in the future, mainly that a single OSD-target (lun) can
> be shared by lots of FSs, and a single FS can span many OSD-targets.
> Some central management is much easier to do in Kernel.
OK. Please add the above info to the changelog for that patch.
> >
> > What are the dependencies for this filesystem code? I assume that it
> > depends on various block- and scsi-level patches? Which ones, and
> > what is their status, and is this code even compileable without them?
> >
>
> This OSD-based file system is dependent on the open-osd initiator library
> code that I've submitted for inclusion for 2.6.29. It has been sitting
> in linux-next for a while now, and has not been receiving any comments
> for the last two updated patchsets I've sent to scsi-misc/lkml. However
> it has not yet been submitted into Jame's scsi-misc git tree, and James
> is the ultimate maintainer that should submit this work. I hope it will
> still be submitted into 2.6.29, as this code is totally self sufficient
> and does not endangers or changes any other Kernel subsystems.
> (All the needed ground work was already submitted to Linus since 2.6.26)
> So why should it not?
>
> Once the open-osd initiator library is accepted this file system
> could be accepted. I was hoping as a 2.6.30 time frame. (One Kernel
> after the open-osd library)
>
> > Thanks.
>
> Thank you dear Andrew for your most valuable input.
>
> I will constify all the const needed code. will fix the global name space
> litter, will inline the macros and lower case the inlines. Will remove
> the typedefs.
>
> I will reply to individual patches, I have a couple of questions. But
> all your comments are right and I will take care of them.
>
> When, if, all is fixed, through which tree/maintainer can exofs be submitted?
I can merge them. Or you can run a git tree of your own, add it to
linux-next and ask Linus to pull it at the appropriate time.
On Tue, Dec 16, 2008 at 05:31:30PM +0200, Boaz Harrosh wrote:
>
> This patch ties all operation vectors into a file system superblock
> and registers the exofs file_system_type at module's load time.
>
> * The file system control block (AKA on-disk superblock) resides in
> an object with a special ID (defined in common.h).
> Information included in the file system control block is used to
> fill the in-memory superblock structure at mount time. This object
> is created before the file system is used by mkexofs.c It contains
> information such as:
> - The file system's magic number
> - The next inode number to be allocated
>
> Signed-off-by: Boaz Harrosh <[email protected]>
Some minor comments below.
> ---
> fs/exofs/Kbuild | 2 +-
> fs/exofs/exofs.h | 30 ++++
> fs/exofs/inode.c | 195 +++++++++++++++++++++-
> fs/exofs/super.c | 502 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 4 files changed, 727 insertions(+), 2 deletions(-)
> create mode 100644 fs/exofs/super.c
>
> diff --git a/fs/exofs/Kbuild b/fs/exofs/Kbuild
> index 27c738c..e293cb9 100644
> --- a/fs/exofs/Kbuild
> +++ b/fs/exofs/Kbuild
> @@ -26,5 +26,5 @@ KBUILD_CPPFLAGS := -I$(OSD_INC) $(KBUILD_CPPFLAGS)
>
> endif
>
> -exofs-objs := osd.o inode.o file.o symlink.o namei.o dir.o
> +exofs-objs := osd.o inode.o file.o symlink.o namei.o dir.o super.o
> obj-$(CONFIG_EXOFS_FS) += exofs.o
> diff --git a/fs/exofs/exofs.h b/fs/exofs/exofs.h
> index 7330b59..75c608d 100644
> --- a/fs/exofs/exofs.h
> +++ b/fs/exofs/exofs.h
> @@ -52,6 +52,17 @@
> #define _LLU(x) (unsigned long long)(x)
>
> /*
> + * struct to hold what we get from mount options
> + */
> +struct exofs_mountopt {
> + const char *dev_name;
> + uint64_t pid;
> + int timeout;
> + bool mkfs;
> + int format; /*in Mbyte*/
> +};
> +
> +/*
> * our extension to the in-memory superblock
> */
> struct exofs_sb_info {
> @@ -110,6 +121,14 @@ static inline struct exofs_i_info *EXOFS_I(struct inode *inode)
> }
>
> /*
> + * ugly struct so that we can pass two arguments to update_inode's callback
> + */
> +struct updatei_args {
> + struct exofs_sb_info *sbi;
> + struct exofs_fcb *fcb;
> +};
> +
> +/*
> * Maximum count of links to a file
> */
> #define EXOFS_LINK_MAX 32000
> @@ -188,12 +207,20 @@ void free_osd_req(struct osd_request *req);
> /* inode.c */
> void exofs_truncate(struct inode *inode);
> extern struct inode *exofs_iget(struct super_block *, unsigned long);
> +extern int exofs_write_inode(struct inode *, int);
> +extern void exofs_delete_inode(struct inode *);
> struct inode *exofs_new_inode(struct inode *, int);
> int exofs_setattr(struct dentry *, struct iattr *);
> int exofs_write_begin(struct file *file, struct address_space *mapping,
> loff_t pos, unsigned len, unsigned flags,
> struct page **pagep, void **fsdata);
>
> +/* super.c: */
> +#ifdef EXOFS_DEBUG
> +void exofs_dprint_internal(char *str, ...);
> +#endif
> +extern void exofs_write_super(struct super_block *);
> +
> /* dir.c: */
> int exofs_add_link(struct dentry *, struct inode *);
> ino_t exofs_inode_by_name(struct inode *, struct dentry *);
> @@ -223,6 +250,9 @@ extern struct address_space_operations exofs_aops;
> extern struct inode_operations exofs_dir_inode_operations;
> extern struct inode_operations exofs_special_inode_operations;
>
> +/* super.c */
> +extern struct super_operations exofs_sops;
> +
> /* symlink.c */
> extern struct inode_operations exofs_symlink_inode_operations;
> extern struct inode_operations exofs_fast_symlink_inode_operations;
> diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
> index 25a562e..e24690b 100644
> --- a/fs/exofs/inode.c
> +++ b/fs/exofs/inode.c
> @@ -37,6 +37,7 @@
> #include "exofs.h"
>
> static int __readpage_filler(struct page *page, bool is_async_unlock);
> +static int exofs_update_inode(struct inode *inode, int do_sync);
>
> /*
> * Test whether an inode is a fast symlink.
> @@ -49,6 +50,18 @@ static inline int exofs_inode_is_fast_symlink(struct inode *inode)
> }
>
> /*
> + * Callback function from exofs_delete_inode() - don't have much cleaning up to
> + * do.
> + */
> +void delete_done(struct osd_request *req, void *p)
too generic name for non-static symbol (should it be static?)
> +{
> + struct exofs_sb_info *sbi;
> + free_osd_req(req);
> + sbi = (struct exofs_sb_info *)p;
> + atomic_dec(&sbi->s_curr_pending);
> +}
> +
> +/*
> * get_block_t - Fill in a buffer_head
> * An OSD takes care of block allocation so we just fake an allocation by
> * putting in the inode's sector_t in the buffer_head.
> @@ -94,6 +107,62 @@ int exofs_write_begin_export(struct file *file, struct address_space *mapping,
> }
>
> /*
> + * Called when the refcount of an inode reaches zero. We remove the object
> + * from the OSD here. We make sure the object was created before we try and
> + * delete it.
> + */
> +void exofs_delete_inode(struct inode *inode)
> +{
> + struct exofs_i_info *oi = EXOFS_I(inode);
> + struct osd_request *req = NULL;
> + struct super_block *sb = inode->i_sb;
> + struct exofs_sb_info *sbi = sb->s_fs_info;
> + int ret;
> +
> + truncate_inode_pages(&inode->i_data, 0);
> +
> + if (is_bad_inode(inode))
> + goto no_delete;
> + mark_inode_dirty(inode);
> + exofs_update_inode(inode, inode_needs_sync(inode));
> +
> + inode->i_size = 0;
> + if (inode->i_blocks)
> + exofs_truncate(inode);
> +
> + clear_inode(inode);
> +
> + req = prepare_osd_remove(sbi->s_dev, sbi->s_pid,
> + inode->i_ino + EXOFS_OBJ_OFF);
> + if (!req) {
> + printk(KERN_ERR "ERROR: prepare_osd_remove failed\n");
> + return;
> + }
> +
> + /* if we are deleting an obj that hasn't been created yet, wait */
> + if (!ObjCreated(oi)) {
> + if (!Obj2BCreated(oi))
> + BUG();
> + else
> + wait_event(oi->i_wq, ObjCreated(oi));
> + }
> +
> + ret = exofs_async_op(req, delete_done, sbi, oi->i_cred);
> + if (ret) {
> + printk(KERN_ERR
> + "ERROR: @exofs_delete_inode exofs_async_op failed\n");
> + free_osd_req(req);
> + return;
> + }
> + atomic_inc(&sbi->s_curr_pending);
> +
> + return;
> +
> +no_delete:
> + clear_inode(inode);
> +}
> +
> +/*
> * Callback function when writepage finishes. Check for errors, unlock, clean
> * up, etc.
> */
> @@ -610,6 +679,131 @@ bad_inode:
> }
>
> /*
> + * Callback function from exofs_update_inode().
> + */
> +void updatei_done(struct osd_request *req, void *p)
> +{
> + struct updatei_args *args = (struct updatei_args *)p;
> +
> + free_osd_req(req);
> +
> + atomic_dec(&args->sbi->s_curr_pending);
> +
> + kfree(args->fcb);
> + kfree(args);
> + args = NULL;
last line is a no-op
> +}
> +
> +/*
> + * Write the inode to the OSD. Just fill up the struct, and set the attribute
> + * synchronously or asynchronously depending on the do_sync flag.
> + */
> +static int exofs_update_inode(struct inode *inode, int do_sync)
> +{
> + struct exofs_i_info *oi = EXOFS_I(inode);
> + struct super_block *sb = inode->i_sb;
> + struct exofs_sb_info *sbi = sb->s_fs_info;
> + struct osd_request *req = NULL;
> + struct exofs_fcb *fcb = NULL;
> + int ret;
> + int n;
> +
> + fcb = kmalloc(sizeof(struct exofs_fcb), GFP_KERNEL);
> + if (!fcb) {
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + fcb->i_mode = cpu_to_be16(inode->i_mode);
> + fcb->i_uid = cpu_to_be32(inode->i_uid);
> + fcb->i_gid = cpu_to_be32(inode->i_gid);
> + fcb->i_links_count = cpu_to_be16(inode->i_nlink);
> + fcb->i_ctime = cpu_to_be32(inode->i_ctime.tv_sec);
> + fcb->i_atime = cpu_to_be32(inode->i_atime.tv_sec);
> + fcb->i_mtime = cpu_to_be32(inode->i_mtime.tv_sec);
> + fcb->i_size = cpu_to_be64(oi->i_commit_size = i_size_read(inode));
> + fcb->i_generation = cpu_to_be32(inode->i_generation);
> +
> + if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode)) {
> + if (old_valid_dev(inode->i_rdev)) {
> + fcb->i_data[0] = old_encode_dev(inode->i_rdev);
> + fcb->i_data[1] = 0;
> + } else {
> + fcb->i_data[0] = 0;
> + fcb->i_data[1] = new_encode_dev(inode->i_rdev);
> + fcb->i_data[2] = 0;
> + }
> + } else
> + for (n = 0; n < EXOFS_IDATA; n++)
> + fcb->i_data[n] = oi->i_data[n];
memcpy?
> +
> + req = prepare_osd_set_attr(sbi->s_dev, sbi->s_pid,
> + (uint64_t) (inode->i_ino + EXOFS_OBJ_OFF));
> + if (!req) {
> + printk(KERN_ERR "ERROR: prepare set_attr failed.\n");
> + kfree(fcb);
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + prepare_set_attr_list_add_entry(req,
> + OSD_PAGE_NUM_IBM_UOBJ_FS_DATA,
> + OSD_ATTR_NUM_IBM_UOBJ_FS_DATA_INODE,
> + EXOFS_INO_ATTR_SIZE,
> + (unsigned char *)fcb);
> +
> + if (!ObjCreated(oi)) {
> + if (!Obj2BCreated(oi))
> + BUG();
> + else
> + wait_event(oi->i_wq, ObjCreated(oi));
> + }
> +
> + if (do_sync) {
> + ret = exofs_sync_op(req, sbi->s_timeout, oi->i_cred);
> + free_osd_req(req);
> + kfree(fcb);
> + } else {
> + struct updatei_args *args = NULL;
> +
> + args = kmalloc(sizeof(struct updatei_args), GFP_KERNEL);
> + if (!args) {
> + kfree(fcb);
> + ret = -ENOMEM;
> + goto out;
> + }
> + args->sbi = sbi;
> + args->fcb = fcb;
> +
> + ret = exofs_async_op(req, updatei_done, args, oi->i_cred);
> + if (ret) {
> + free_osd_req(req);
> + kfree(fcb);
> + kfree(args);
> + goto out;
> + }
> + atomic_inc(&sbi->s_curr_pending);
> + }
> +out:
> + return ret;
> +}
all kfree(fcb)'s can be moved after "out:"
> +
> +int exofs_write_inode(struct inode *inode, int wait)
> +{
> + return exofs_update_inode(inode, wait);
> +}
> +
> +int exofs_sync_inode(struct inode *inode)
> +{
> + struct writeback_control wbc = {
> + .sync_mode = WB_SYNC_ALL,
> + .nr_to_write = 0, /* sys_fsync did this */
> + };
> +
> + return sync_inode(inode, &wbc);
> +}
> +
> +/*
> * Set inode attributes - just call generic functions.
> */
> int exofs_setattr(struct dentry *dentry, struct iattr *iattr)
> @@ -624,7 +818,6 @@ int exofs_setattr(struct dentry *dentry, struct iattr *iattr)
> error = inode_setattr(inode, iattr);
> return error;
> }
> -
> /*
> * Callback function from exofs_new_inode(). The important thing is that we
> * set the ObjCreated flag so that other methods know that the object exists on
> diff --git a/fs/exofs/super.c b/fs/exofs/super.c
> new file mode 100644
> index 0000000..8ecf700
> --- /dev/null
> +++ b/fs/exofs/super.c
> @@ -0,0 +1,502 @@
> +/*
> + * Copyright (C) 2005, 2006
> + * Avishay Traeger ([email protected]) ([email protected])
> + * Copyright (C) 2005, 2006
> + * International Business Machines
> + *
> + * Copyrights for code taken from ext2:
> + * Copyright (C) 1992, 1993, 1994, 1995
> + * Remy Card ([email protected])
> + * Laboratoire MASI - Institut Blaise Pascal
> + * Universite Pierre et Marie Curie (Paris VI)
> + * from
> + * linux/fs/minix/inode.c
> + * Copyright (C) 1991, 1992 Linus Torvalds
> + *
> + * This file is part of exofs.
> + *
> + * exofs is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation. Since it is based on ext2, and the only
> + * valid version of GPL for the Linux kernel is version 2, the only valid
> + * version of GPL for exofs is version 2.
> + *
> + * exofs is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with exofs; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
> + */
> +
> +#include <linux/string.h>
> +#include <linux/parser.h>
> +#include <linux/vfs.h>
> +#include <linux/random.h>
> +
> +#include "exofs.h"
> +
> +/******************************************************************************
> + * MOUNT OPTIONS
> + *****************************************************************************/
> +
> +/*
> + * exofs-specific mount-time options.
> + */
> +enum { Opt_lun, Opt_tid, Opt_pid, Opt_to, Opt_mkfs, Opt_format, Opt_err };
> +
> +/*
> + * Our mount-time options. These should ideally be 64-bit unsigned, but the
> + * kernel's parsing functions do not currently support that. 32-bit should be
> + * sufficient for most applications now.
> + */
> +static match_table_t tokens = {
> + {Opt_pid, "pid=%u"},
> + {Opt_to, "to=%u"},
> + {Opt_err, NULL}
> +};
> +
> +/*
> + * The main option parsing method. Also makes sure that all of the mandatory
> + * mount options were set.
> + */
> +static int parse_options(char *options, struct exofs_mountopt *opts)
> +{
> + char *p;
> + substring_t args[MAX_OPT_ARGS];
> + int option;
> + int s_pid = 0;
> +
> + EXOFS_DBGMSG("parse_options %s\n", options);
> + /* defaults */
> + memset(opts, 0, sizeof(*opts));
> + opts->timeout = BLK_DEFAULT_SG_TIMEOUT;
> +
> + while ((p = strsep(&options, ",")) != NULL) {
> + int token;
> + if (!*p)
> + continue;
> +
> + token = match_token(p, tokens, args);
> + switch (token) {
> + case Opt_pid:
> + if (match_int(&args[0], &option))
> + return -EINVAL;
> + if (option < 65536) {
> + EXOFS_ERR("Partition ID must be >= 65536");
> + return -EINVAL;
> + }
> + opts->pid = option;
> + s_pid = 1;
> + break;
> + case Opt_to:
> + if (match_int(&args[0], &option))
> + return -EINVAL;
> + if (option <= 0) {
> + EXOFS_ERR("Timout must be > 0");
> + return -EINVAL;
> + }
> + opts->timeout = option * HZ;
> + break;
> + }
> + }
> +
> + if (!s_pid) {
> + EXOFS_ERR("Need to specify the following options:\n");
> + EXOFS_ERR(" -o pid=pid_no_to_use\n");
> + return -EINVAL;
> + }
> +
> + return 0;
> +}
> +
> +/******************************************************************************
> + * INODE CACHE
> + *****************************************************************************/
> +
> +/*
> + * Our inode cache. Isn't it pretty?
> + */
> +static struct kmem_cache *exofs_inode_cachep;
> +
> +/*
> + * Allocate an inode in the cache
> + */
> +static struct inode *exofs_alloc_inode(struct super_block *sb)
> +{
> + struct exofs_i_info *oi;
> +
> + oi = kmem_cache_alloc(exofs_inode_cachep, GFP_KERNEL);
> + if (!oi)
> + return NULL;
> +
> + oi->vfs_inode.i_version = 1;
> + return &oi->vfs_inode;
> +}
> +
> +/*
> + * Remove an inode from the cache
> + */
> +static void exofs_destroy_inode(struct inode *inode)
> +{
> + kmem_cache_free(exofs_inode_cachep, EXOFS_I(inode));
> +}
> +
> +/*
> + * Initialize the inode
> + */
> +static void exofs_init_once(void *foo)
> +{
> + struct exofs_i_info *oi = foo;
> +
> + inode_init_once(&oi->vfs_inode);
> +}
> +
> +/*
> + * Create and initialize the inode cache
> + */
> +static int init_inodecache(void)
> +{
> + exofs_inode_cachep = kmem_cache_create("exofs_inode_cache",
> + sizeof(struct exofs_i_info),
> + 0, SLAB_RECLAIM_ACCOUNT,
> + exofs_init_once);
> + if (exofs_inode_cachep == NULL)
> + return -ENOMEM;
> + return 0;
> +}
> +
> +/*
> + * Destroy the inode cache
> + */
> +static void destroy_inodecache(void)
> +{
> + kmem_cache_destroy(exofs_inode_cachep);
> +}
> +
> +/******************************************************************************
> + * SUPERBLOCK FUNCTIONS
> + *****************************************************************************/
> +
> +/*
> + * Write the superblock to the OSD
> + */
> +void exofs_write_super(struct super_block *sb)
> +{
> + struct exofs_sb_info *sbi;
> + struct exofs_fscb *fscb = NULL;
> + struct osd_request *req = NULL;
> +
> + fscb = kzalloc(sizeof(struct exofs_fscb), GFP_KERNEL);
> + if (!fscb)
> + return;
> +
> + lock_kernel();
> + sbi = sb->s_fs_info;
> + fscb->s_nextid = sbi->s_nextid;
> + fscb->s_magic = sb->s_magic;
> + fscb->s_numfiles = sbi->s_numfiles;
> + fscb->s_newfs = 0;
> +
> + req = prepare_osd_write(sbi->s_dev, sbi->s_pid, EXOFS_SUPER_ID,
> + sizeof(struct exofs_fscb), 0, 0,
> + (unsigned char *)(fscb));
> + if (!req) {
> + EXOFS_ERR("ERROR: write super failed.\n");
unlock_kernel()
or just goto out
> + kfree(fscb);
> + return;
> + }
> +
> + exofs_sync_op(req, sbi->s_timeout, sbi->s_cred);
> + free_osd_req(req);
> + sb->s_dirt = 0;
out:
> + unlock_kernel();
> + kfree(fscb);
> +}
> +
> +/*
> + * This function is called when the vfs is freeing the superblock. We just
> + * need to free our own part.
> + */
> +static void exofs_put_super(struct super_block *sb)
> +{
> + int num_pend;
> + struct exofs_sb_info *sbi = sb->s_fs_info;
> +
> + /* make sure there are no pending commands */
> + for (num_pend = atomic_read(&sbi->s_curr_pending); num_pend > 0;
> + num_pend = atomic_read(&sbi->s_curr_pending)) {
> + wait_queue_head_t wq;
> + init_waitqueue_head(&wq);
> + wait_event_timeout(wq,
> + (atomic_read(&sbi->s_curr_pending) == 0),
> + msecs_to_jiffies(100));
> + }
> +
> + osduld_put_device(sbi->s_dev);
> + kfree(sb->s_fs_info);
> + sb->s_fs_info = NULL;
> +}
> +
> +/*
> + * Read the superblock from the OSD and fill in the fields
> + */
> +static int exofs_fill_super(struct super_block *sb, void *data, int silent)
> +{
> + struct inode *root;
> + struct exofs_mountopt *opts = data;
> + struct exofs_sb_info *sbi = NULL; /*extended info */
> + struct exofs_fscb fscb; /*on-disk superblock info */
> + struct osd_request *req = NULL;
> + int ret;
> +
> + sbi = kzalloc(sizeof(*sbi), GFP_KERNEL);
> + if (!sbi)
> + return -ENOMEM;
> + sb->s_fs_info = sbi;
> +
> + /* use mount options to fill superblock */
> + sbi->s_dev = osduld_path_lookup(opts->dev_name);
> + if (IS_ERR(sbi->s_dev)) {
> + ret = PTR_ERR(sbi->s_dev);
> + sbi->s_dev = NULL;
> + goto free_sbi;
> + }
> +
> + sbi->s_pid = opts->pid;
> + sbi->s_timeout = opts->timeout;
> +
> + /* fill in some other data by hand */
> + memset(sb->s_id, 0, sizeof(sb->s_id));
wasn't it zeroed by kzalloc?
> + strcpy(sb->s_id, "exofs");
> + sb->s_blocksize = EXOFS_BLKSIZE;
> + sb->s_blocksize_bits = EXOFS_BLKSHIFT;
> + atomic_set(&sbi->s_curr_pending, 0);
> + sb->s_bdev = NULL;
> + sb->s_dev = 0;
> +
> + /* read data from on-disk superblock object */
> + make_credential(sbi->s_cred, sbi->s_pid, EXOFS_SUPER_ID);
> +
> + req = prepare_osd_read(sbi->s_dev, sbi->s_pid, EXOFS_SUPER_ID,
> + sizeof(struct exofs_fscb), 0, 0,
> + (unsigned char *)(&fscb));
> + if (!req) {
> + if (!silent)
> + EXOFS_ERR("ERROR: could not prepare read request.\n");
> + ret = -ENOMEM;
> + goto free_sbi;
> + }
> +
> + ret = exofs_sync_op(req, sbi->s_timeout, sbi->s_cred);
> + if (ret != 0) {
> + if (!silent)
> + EXOFS_ERR("ERROR: read super failed.\n");
> + ret = -EIO;
> + goto free_sbi;
> + }
> +
> + sb->s_magic = fscb.s_magic;
> + sbi->s_nextid = fscb.s_nextid;
> + sbi->s_numfiles = fscb.s_numfiles;
> +
> + /* make sure what we read from the object store is correct */
> + if (sb->s_magic != EXOFS_SUPER_MAGIC) {
> + if (!silent)
> + EXOFS_ERR("ERROR: Bad magic value\n");
> + ret = -EINVAL;
> + goto free_sbi;
> + }
> +
> + /* start generation numbers from a random point */
> + get_random_bytes(&sbi->s_next_generation, sizeof(u32));
> + spin_lock_init(&sbi->s_next_gen_lock);
> +
> + /* set up operation vectors */
> + sb->s_op = &exofs_sops;
> + root = exofs_iget(sb, EXOFS_ROOT_ID - EXOFS_OBJ_OFF);
> + if (IS_ERR(root)) {
> + EXOFS_ERR("ERROR: exofs_iget faild\n");
typo (failed)
> + ret = PTR_ERR(root);
> + goto free_sbi;
> + }
> + sb->s_root = d_alloc_root(root);
> + if (!sb->s_root) {
> + iput(root);
> + EXOFS_ERR("ERROR: get root inode failed\n");
> + ret = -ENOMEM;
> + goto free_sbi;
> + }
> +
> + if (!S_ISDIR(root->i_mode)) {
> + dput(sb->s_root);
> + sb->s_root = NULL;
> + EXOFS_ERR("ERROR: corrupt root inode (mode = %hd)\n",
> + root->i_mode);
> + ret = -EINVAL;
> + goto free_sbi;
> + }
> +
> + ret = 0;
> +out:
> + if (req)
> + free_osd_req(req);
> + return ret;
> +
> +free_sbi:
> + osduld_put_device(sbi->s_dev); /* NULL safe */
> + kfree(sbi);
> + goto out;
> +}
> +
> +/*
> + * Set up the superblock (calls exofs_fill_super eventually)
> + */
> +static int exofs_get_sb(struct file_system_type *type,
> + int flags, const char *dev_name,
> + void *data, struct vfsmount *mnt)
> +{
> + struct exofs_mountopt opts;
> + int ret;
> +
> + ret = parse_options((char *) data, &opts);
> + if (ret)
> + return ret;
> +
> + opts.dev_name = dev_name;
> + return get_sb_nodev(type, flags, &opts, exofs_fill_super, mnt);
> +}
> +
> +/*
> + * Return information about the file system state in the buffer. This is used
> + * by the 'df' command, for example.
> + */
> +static int exofs_statfs(struct dentry *dentry, struct kstatfs *buf)
> +{
> + struct super_block *sb = dentry->d_sb;
> + struct exofs_sb_info *sbi = sb->s_fs_info;
> + uint8_t cred_a[OSD_CAP_LEN];
> + struct osd_request *req = NULL;
> + uint32_t page;
> + uint32_t attr;
> + uint16_t expected;
> + uint64_t capacity;
> + uint64_t used;
> + uint8_t *data;
> + int ret;
> +
> + /* get used/capacity attributes */
> + make_credential(cred_a, sbi->s_pid, 0);
> +
> + req = prepare_osd_get_attr(sbi->s_dev, sbi->s_pid, 0);
> + if (!req) {
> + EXOFS_ERR("ERROR: prepare get_attr failed.\n");
> + return -1;
> + }
> +
> + prepare_get_attr_list_add_entry(req,
> + OSD_APAGE_PARTITION_QUOTAS,
> + OSD_ATTR_PQ_CAPACITY_QUOTA,
> + 8);
> +
> + prepare_get_attr_list_add_entry(req,
> + OSD_APAGE_PARTITION_INFORMATION,
> + OSD_ATTR_PI_USED_CAPACITY,
> + 8);
> +
> + ret = exofs_sync_op(req, sbi->s_timeout, cred_a);
> + if (ret)
> + goto out;
> +
> + page = OSD_APAGE_PARTITION_QUOTAS;
> + attr = OSD_ATTR_PQ_CAPACITY_QUOTA;
> + expected = 8;
> + ret = extract_next_attr_from_req(req, &page, &attr, &expected, &data);
> + if (ret) {
> + EXOFS_ERR("ERROR: extract attr from req failed\n");
> + goto out;
> + }
> + capacity = be64_to_cpu(*((uint64_t *)data));
> +
> + page = OSD_APAGE_PARTITION_INFORMATION;
> + attr = OSD_ATTR_PI_USED_CAPACITY;
> + expected = 8;
> + ret = extract_next_attr_from_req(req, &page, &attr, &expected, &data);
> + if (ret) {
> + EXOFS_ERR("ERROR: extract attr from req failed\n");
> + goto out;
> + }
> + used = be64_to_cpu(*((uint64_t *)data));
> +
> + /* fill in the stats buffer */
> + buf->f_type = EXOFS_SUPER_MAGIC;
> + buf->f_bsize = EXOFS_BLKSIZE;
> + buf->f_blocks = (capacity >> EXOFS_BLKSHIFT);
> + buf->f_bfree = ((capacity - used) >> EXOFS_BLKSHIFT);
> + buf->f_bavail = buf->f_bfree;
> + buf->f_files = sbi->s_numfiles;
> + buf->f_ffree = EXOFS_MAX_ID - sbi->s_numfiles;
> + buf->f_namelen = EXOFS_NAME_LEN;
> +out:
> + free_osd_req(req);
> +
> + return ret;
> +}
> +
> +struct super_operations exofs_sops = {
> + .alloc_inode = exofs_alloc_inode,
> + .destroy_inode = exofs_destroy_inode,
> + .write_inode = exofs_write_inode,
> + .delete_inode = exofs_delete_inode,
> + .put_super = exofs_put_super,
> + .write_super = exofs_write_super,
> + .statfs = exofs_statfs,
> +};
> +
> +/******************************************************************************
> + * INSMOD/RMMOD
> + *****************************************************************************/
> +
> +/*
> + * struct that describes this file system
> + */
> +static struct file_system_type exofs_type = {
> + .owner = THIS_MODULE,
> + .name = "exofs",
> + .get_sb = exofs_get_sb,
> + .kill_sb = generic_shutdown_super,
> +};
> +
> +static int __init init_exofs(void)
> +{
> + int err;
> +
> + err = init_inodecache();
> + if (err)
> + goto out;
> +
> + err = register_filesystem(&exofs_type);
> + if (err)
> + goto out_d;
> +
> + return 0;
> +out_d:
> + destroy_inodecache();
> +out:
> + return err;
> +}
> +
> +static void __exit exit_exofs(void)
> +{
> + unregister_filesystem(&exofs_type);
> + destroy_inodecache();
> +}
> +
> +MODULE_AUTHOR("Avishay Traeger <[email protected]>");
> +MODULE_DESCRIPTION("exofs");
> +MODULE_LICENSE("GPL");
> +
> +module_init(init_exofs)
> +module_exit(exit_exofs)
> --
> 1.6.0.1
Hi!
> Added some documentation in exofs.txt, as well as a BUGS file.
>
> For further reading, operation instructions, example scripts
> and up to date infomation and code please see:
> http://open-osd.org
>
> Signed-off-by: Boaz Harrosh <[email protected]>
> +===============================================================================
> +WHAT IS EXOFS?
> +===============================================================================
> +
> +exofs is a file system that uses an OSD and exports the API of a normal Linux
> +file system. Users access exofs like any other local file system, and exofs
> +will in turn issue commands to the local initiator.
> +
Which tells me pretty much nothing. I guess it should explain what is
OSD... there are way too many TLAs and ETLAs in FOSS world.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Pavel Machek wrote:
> Hi!
>
>> Added some documentation in exofs.txt, as well as a BUGS file.
>>
>> For further reading, operation instructions, example scripts
>> and up to date infomation and code please see:
>> http://open-osd.org
>>
>> Signed-off-by: Boaz Harrosh <[email protected]>
>
>> +===============================================================================
>> +WHAT IS EXOFS?
>> +===============================================================================
>> +
>> +exofs is a file system that uses an OSD and exports the API of a normal Linux
>> +file system. Users access exofs like any other local file system, and exofs
>> +will in turn issue commands to the local initiator.
>> +
>
> Which tells me pretty much nothing. I guess it should explain what is
> OSD... there are way too many TLAs and ETLAs in FOSS world.
> Pavel
Thanks, you are completely right. I need to give a short description
and point to the osd.txt file that was added in the prerequisite patchset
will do
Marcin Slusarz wrote:
> On Tue, Dec 16, 2008 at 05:31:30PM +0200, Boaz Harrosh wrote:
>> This patch ties all operation vectors into a file system superblock
>> and registers the exofs file_system_type at module's load time.
>>
>> * The file system control block (AKA on-disk superblock) resides in
>> an object with a special ID (defined in common.h).
>> Information included in the file system control block is used to
>> fill the in-memory superblock structure at mount time. This object
>> is created before the file system is used by mkexofs.c It contains
>> information such as:
>> - The file system's magic number
>> - The next inode number to be allocated
>>
>> Signed-off-by: Boaz Harrosh <[email protected]>
>
> Some minor comments below.
>
Thank you Marcin for your comments. They are all true and I will
fix them.
Just as a side note, most of your comments are on code inherited from
ext2. Though it is a good chance to fix them here.
>> ---
<snip>
>> + sbi = kzalloc(sizeof(*sbi), GFP_KERNEL);
>> + if (!sbi)
>> + return -ENOMEM;
>> + sb->s_fs_info = sbi;
>> +
>> + /* use mount options to fill superblock */
>> + sbi->s_dev = osduld_path_lookup(opts->dev_name);
>> + if (IS_ERR(sbi->s_dev)) {
>> + ret = PTR_ERR(sbi->s_dev);
>> + sbi->s_dev = NULL;
>> + goto free_sbi;
>> + }
>> +
>> + sbi->s_pid = opts->pid;
>> + sbi->s_timeout = opts->timeout;
>> +
>> + /* fill in some other data by hand */
>> + memset(sb->s_id, 0, sizeof(sb->s_id));
>
> wasn't it zeroed by kzalloc?
>
That is a different kzalloc, though I agree that a memset is a bit hysterical
for a strcpy
>> + strcpy(sb->s_id, "exofs");
Thanks
Boaz
On Dec. 31, 2008, 17:57 +0200, James Bottomley <[email protected]> wrote:
> On Wed, 2008-12-31 at 17:19 +0200, Boaz Harrosh wrote:
>> Andrew Morton wrote:
>>> On Tue, 16 Dec 2008 17:33:48 +0200
>>> Boaz Harrosh <[email protected]> wrote:
>>>
>>>> We need a mechanism to prepare the file system (mkfs).
>>>> I chose to implement that by means of a couple of
>>>> mount-options. Because there is no user-mode API for committing
>>>> OSD commands. And also, all this stuff is highly internal to
>>>> the file system itself.
>>>>
>>>> - Added two mount options mkfs=0/1,format=capacity_in_meg, so mkfs/format
>>>> can be executed by kernel code just before mount. An mkexofs utility
>>>> can now be implemented by means of a script that mounts and unmount the
>>>> file system with proper options.
>>> Doing mkfs in-kernel is unusual. I don't think the above description
>>> sufficiently helps the uninitiated understand why mkfs cannot be done
>>> in userspace as usual. Please flesh it out a bit.
>> There are a few main reasons.
>> - There is no user-mode API for initiating OSD commands. Such a subsystem
>> would be hundredfold bigger then the mkfs code submitted. I think it would be
>> hard and stupid to maintain a complex user-mode API just for creating
>> a couple of objects and writing a couple of on disk structures.
>
> This is really a reflection of the whole problem with the OSD paradigm.
>
> In theory, a filesystem on OSD is a thin layer of metadata mapping
> objects to files. Get this right and the storage will manage things,
> like security and access and attributes (there's even a natural mapping
> to the VFS concept of extended attributes). Plus, the storage has
> enough information to manage persistence, backups and replication.
>
> The real problem is that no-one has actually managed to come up with a
> useful VFS<->OSD mapping layer (even by extending or altering the VFS).
> Every filesystem that currently uses OSD has a separate direct OSD
> speaking interface (i.e. it slices out the block layer to do this and
> talks directly to the storage).
>
> I suppose this could be taken to show that such a layer is impossibly
> complex, as you assert, but its lack is reflected in strange looking
> design decisions like in-kernel mkfs. It would also mean that there
> would be very little layered code sharing between ODS based filesystems.
I think that we may need to gain some more experience to extract the
commonalities of such file systems. Currently we came up with the
lowest possible denominator the osd initiator library that deals
with command formatting and execution, including attrs, sense status,
and security.
To provide a higher level abstraction that would help with "administrative"
tasks like mkfs and the like we already tossed an idea in the past -
a file system that will represent the contents of an OSD in a namespace,
for example: partition_id / object_id / {data, attrs / ..., ctl / ...}.
Such a file system could provide a generic mapping which one could
use to easily develop management applications for the OSD. That said,
it's out of the scope of exofs which focuses mostly on the filesystem
data and metadata paths.
>
>> - I intend to refactor the code further to make use of more super.c services,
>> so to make this addition even smaller. Also future direction of raid over
>> multiple objects will make even more kernel infrastructure needed which
>> will need even more user-mode code duplication.
>> - I anticipate problems that are not yet addressed in this body of work
>> but will be in the future, mainly that a single OSD-target (lun) can
>> be shared by lots of FSs, and a single FS can span many OSD-targets.
>> Some central management is much easier to do in Kernel.
>>
>>> What are the dependencies for this filesystem code? I assume that it
>>> depends on various block- and scsi-level patches? Which ones, and
>>> what is their status, and is this code even compileable without them?
>>>
>> This OSD-based file system is dependent on the open-osd initiator library
>> code that I've submitted for inclusion for 2.6.29. It has been sitting
>> in linux-next for a while now, and has not been receiving any comments
>> for the last two updated patchsets I've sent to scsi-misc/lkml. However
>> it has not yet been submitted into Jame's scsi-misc git tree, and James
>> is the ultimate maintainer that should submit this work. I hope it will
>> still be submitted into 2.6.29, as this code is totally self sufficient
>> and does not endangers or changes any other Kernel subsystems.
>> (All the needed ground work was already submitted to Linus since 2.6.26)
>> So why should it not?
>
> I don't like it mainly because it's not truly a useful general framework
> for others to build on. However, as argued above, there might not
> actually be such a useful framework, so as long as the only two
> consumers (you and Lustre) want an interface like this, I'll put it in.
Not to mention pnfs over objects which is coming up around the corner.
The pnfs-obj layout driver will use the osd initiator library as well
for distributed data I/O access (while the metadata server, to be based
on exofs accesses the OSD for metadata and security ops too)
Benny
>
> James
>
>
> _______________________________________________
> osd-dev mailing list
> [email protected]
> http://mailman.open-osd.org/mailman/listinfo/osd-dev
Benny Halevy wrote:
> On Dec. 31, 2008, 17:57 +0200, James Bottomley <[email protected]> wrote:
>> On Wed, 2008-12-31 at 17:19 +0200, Boaz Harrosh wrote:
>>> Andrew Morton wrote:
>>>> On Tue, 16 Dec 2008 17:33:48 +0200
>>>> Boaz Harrosh <[email protected]> wrote:
>>>>
>>>>> We need a mechanism to prepare the file system (mkfs).
>>>>> I chose to implement that by means of a couple of
>>>>> mount-options. Because there is no user-mode API for committing
>>>>> OSD commands. And also, all this stuff is highly internal to
>>>>> the file system itself.
>>>>>
>>>>> - Added two mount options mkfs=0/1,format=capacity_in_meg, so mkfs/format
>>>>> can be executed by kernel code just before mount. An mkexofs utility
>>>>> can now be implemented by means of a script that mounts and unmount the
>>>>> file system with proper options.
>>>> Doing mkfs in-kernel is unusual. I don't think the above description
>>>> sufficiently helps the uninitiated understand why mkfs cannot be done
>>>> in userspace as usual. Please flesh it out a bit.
>>> There are a few main reasons.
>>> - There is no user-mode API for initiating OSD commands. Such a subsystem
>>> would be hundredfold bigger then the mkfs code submitted. I think it would be
>>> hard and stupid to maintain a complex user-mode API just for creating
>>> a couple of objects and writing a couple of on disk structures.
>> This is really a reflection of the whole problem with the OSD paradigm.
>>
>> In theory, a filesystem on OSD is a thin layer of metadata mapping
>> objects to files. Get this right and the storage will manage things,
>> like security and access and attributes (there's even a natural mapping
>> to the VFS concept of extended attributes). Plus, the storage has
>> enough information to manage persistence, backups and replication.
>>
>> The real problem is that no-one has actually managed to come up with a
>> useful VFS<->OSD mapping layer (even by extending or altering the VFS).
>> Every filesystem that currently uses OSD has a separate direct OSD
>> speaking interface (i.e. it slices out the block layer to do this and
>> talks directly to the storage).
>>
>> I suppose this could be taken to show that such a layer is impossibly
>> complex, as you assert, but its lack is reflected in strange looking
>> design decisions like in-kernel mkfs. It would also mean that there
>> would be very little layered code sharing between ODS based filesystems.
>
> I think that we may need to gain some more experience to extract the
> commonalities of such file systems. Currently we came up with the
> lowest possible denominator the osd initiator library that deals
> with command formatting and execution, including attrs, sense status,
> and security.
Not putting words in James' mouth, but I definitely agree that the
in-kernel mkfs raises a red flag or two. mkfs.ext3 for block-based
filesystems has direct and intimate knowledge of ext3 filesystem
structure, and it writes that information from userland directly to the
block(s) necessary.
Similarly, mkfs for an object-based filesystem should be issuing SCSI
commands to the OSD device from userland, AFAICS.
> To provide a higher level abstraction that would help with "administrative"
> tasks like mkfs and the like we already tossed an idea in the past -
> a file system that will represent the contents of an OSD in a namespace,
> for example: partition_id / object_id / {data, attrs / ..., ctl / ...}.
> Such a file system could provide a generic mapping which one could
> use to easily develop management applications for the OSD. That said,
> it's out of the scope of exofs which focuses mostly on the filesystem
> data and metadata paths.
That's far too complex for what is necessary. Just issue SCSI commands
from userland. We don't need an abstract interface specifically for
low-level details. The VFS is that abstract interface; anything else
should be low-level and purpose-built.
Jeff
Andrew Morton wrote:
>>> Boaz Harrosh <[email protected]> wrote:
>> When, if, all is fixed, through which tree/maintainer can exofs be submitted?
>
> I can merge them. Or you can run a git tree of your own, add it to
> linux-next and ask Linus to pull it at the appropriate time.
>
Hi James
Andrew suggested that maybe I should push exofs file system directly to
Linus as it is pretty orthogonal to any other work. Sitting in linux-next
will quickly expose any advancements in VFS and will force me to keep
the tree uptodate.
If that is so, and is accepted by Linus, would you rather that also the
open-osd initiator library will be submitted through the same tree?
The conflicts with scsi are very very narrow. The only real dependency
is the ULD being a SCSI ULD. I will routinely ask your ACK on any scsi
or ULD related patches. Which are very few. This way it will be easier
to manage the dependencies between the OSD work, the OSD pNFS-Objects
trees at pNFS project, and the pNFSD+EXOFS export. One less dependency.
[I already have such a public tree at git.open-osd.org for a while now]
Thanks
Boaz
On Jan. 01, 2009, 11:54 +0200, Jeff Garzik <[email protected]> wrote:
> Benny Halevy wrote:
>> On Dec. 31, 2008, 17:57 +0200, James Bottomley <[email protected]> wrote:
>>> On Wed, 2008-12-31 at 17:19 +0200, Boaz Harrosh wrote:
>>>> Andrew Morton wrote:
>>>>> On Tue, 16 Dec 2008 17:33:48 +0200
>>>>> Boaz Harrosh <[email protected]> wrote:
>>>>>
>>>>>> We need a mechanism to prepare the file system (mkfs).
>>>>>> I chose to implement that by means of a couple of
>>>>>> mount-options. Because there is no user-mode API for committing
>>>>>> OSD commands. And also, all this stuff is highly internal to
>>>>>> the file system itself.
>>>>>>
>>>>>> - Added two mount options mkfs=0/1,format=capacity_in_meg, so mkfs/format
>>>>>> can be executed by kernel code just before mount. An mkexofs utility
>>>>>> can now be implemented by means of a script that mounts and unmount the
>>>>>> file system with proper options.
>>>>> Doing mkfs in-kernel is unusual. I don't think the above description
>>>>> sufficiently helps the uninitiated understand why mkfs cannot be done
>>>>> in userspace as usual. Please flesh it out a bit.
>>>> There are a few main reasons.
>>>> - There is no user-mode API for initiating OSD commands. Such a subsystem
>>>> would be hundredfold bigger then the mkfs code submitted. I think it would be
>>>> hard and stupid to maintain a complex user-mode API just for creating
>>>> a couple of objects and writing a couple of on disk structures.
>>> This is really a reflection of the whole problem with the OSD paradigm.
>>>
>>> In theory, a filesystem on OSD is a thin layer of metadata mapping
>>> objects to files. Get this right and the storage will manage things,
>>> like security and access and attributes (there's even a natural mapping
>>> to the VFS concept of extended attributes). Plus, the storage has
>>> enough information to manage persistence, backups and replication.
>>>
>>> The real problem is that no-one has actually managed to come up with a
>>> useful VFS<->OSD mapping layer (even by extending or altering the VFS).
>>> Every filesystem that currently uses OSD has a separate direct OSD
>>> speaking interface (i.e. it slices out the block layer to do this and
>>> talks directly to the storage).
>>>
>>> I suppose this could be taken to show that such a layer is impossibly
>>> complex, as you assert, but its lack is reflected in strange looking
>>> design decisions like in-kernel mkfs. It would also mean that there
>>> would be very little layered code sharing between ODS based filesystems.
>> I think that we may need to gain some more experience to extract the
>> commonalities of such file systems. Currently we came up with the
>> lowest possible denominator the osd initiator library that deals
>> with command formatting and execution, including attrs, sense status,
>> and security.
>
> Not putting words in James' mouth, but I definitely agree that the
> in-kernel mkfs raises a red flag or two. mkfs.ext3 for block-based
> filesystems has direct and intimate knowledge of ext3 filesystem
> structure, and it writes that information from userland directly to the
> block(s) necessary.
Personally, I'm not sure if maintaining that intimate knowledge in a
user space program is an ideal model with respect to keeping both
in sync, avoiding code duplication, and dealing with upgrade issues
(e.g. upgrading the kernel and not the user space utils)
The main advantage I can see in doing that is keeping the kernel
code small without bloating it with rarely-used logic. However,
the mkfs logic for exofs has such a small footprint that it
doesn't add much to the module footprint so justifying the user space
util using that parameter is questionable IMO.
>
> Similarly, mkfs for an object-based filesystem should be issuing SCSI
> commands to the OSD device from userland, AFAICS.
That's possible...
Benny
>
>
>> To provide a higher level abstraction that would help with "administrative"
>> tasks like mkfs and the like we already tossed an idea in the past -
>> a file system that will represent the contents of an OSD in a namespace,
>> for example: partition_id / object_id / {data, attrs / ..., ctl / ...}.
>> Such a file system could provide a generic mapping which one could
>> use to easily develop management applications for the OSD. That said,
>> it's out of the scope of exofs which focuses mostly on the filesystem
>> data and metadata paths.
>
> That's far too complex for what is necessary. Just issue SCSI commands
> from userland. We don't need an abstract interface specifically for
> low-level details. The VFS is that abstract interface; anything else
> should be low-level and purpose-built.
>
> Jeff
>
>
>
>
>
>
On Thu, Jan 01, 2009 at 04:23:00PM +0200, Benny Halevy wrote:
> Personally, I'm not sure if maintaining that intimate knowledge in a
> user space program is an ideal model with respect to keeping both
> in sync, avoiding code duplication, and dealing with upgrade issues
> (e.g. upgrading the kernel and not the user space utils)
The other 30-40 filesystems that Linux supports manage to do it this
way. I'm not sure why osdfs is different in this regard.
You need to be careful with the filesystem layout anyway -- when you
upgrade the kernel, it still needs to be able to access all the files
contained in existing filesystems. And it needs to create new files
which are still readable by older kernels (users have this pesky habit
of downgrading).
--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
Andrew Morton wrote:
> On Wed, 31 Dec 2008 17:33:41 +0200 Boaz Harrosh <[email protected]> wrote:
>
>> Andrew Morton wrote:
>>>> +int prepare_get_attr_list_add_entry(struct osd_request *req,
>>>> + uint32_t page_num,
>>>> + uint32_t attr_num,
>>>> + uint32_t attr_len)
>>>> +{
>>>> + struct osd_attr attr = {
>>>> + .page = page_num,
>>> Kernel developers expect a field called "page" to have type `struct
>>> page *'. osd_attr.page is thus designed to confuse.
>>>
>>>> ...
>>>>
>> Rant below (can be ignored):
>> This single fix will cause a massive change to the open-osd
>> initiator patchset, (18 patches), and resubmission .I made the mistake
>> because this name originates from a file that all naming conventions
>> are taken from the OSD standard text. However this is no excuse
>> for using a well known Kernel construct name. I will fix it. And
>> will be more careful in the future.
>
> The world wouldn't end if you left the code as-is. We've done worse things :)
To late I've changed it. I had an Internet outage yesterday so I've only just
pushed the new trees.
I'm glad. Because I found in exofs code, inside the same file, an "u32 page"
next to a "struct page *page" which is really bad. Now attr_page everywhere
is much clearer.
[ As usual:
git-clone git://git.open-osd.org/linux-open-osd.git linux-next-exofs
or on the web at:
http://git.open-osd.org/gitweb.cgi?p=linux-open-osd.git;a=shortlog;h=refs/heads/linux-next-exofs
]
I will submit another round of patches once I address all the other comments.
Thanks
Boaz
On Thu, 1 January 2009 16:23:00 +0200, Benny Halevy wrote:
>
> Personally, I'm not sure if maintaining that intimate knowledge in a
> user space program is an ideal model with respect to keeping both
> in sync, avoiding code duplication, and dealing with upgrade issues
> (e.g. upgrading the kernel and not the user space utils)
None of those problems actually matter, because you will have them
anyway. If your filesystem is any good, someone will reimplement it for
Windows, Grub, UBoot, Solaris or some other system. And even if it
isn't any good, you still need to stay compatible with your own
implementation from last year.
Ok, maybe code duplication is a valid concern. But that will hardly
outweigh the arguments in favor of a userland mkfs. The only exception
I am aware of is jffs2, where a newly erased flash happens to be a valid
(empty) filesystem. And even there you can view flash_eraseall as a
trivial mkfs program. ;)
Jörn
--
It's just what we asked for, but not what we want!
-- anonymous
On Thu, Jan 01, 2009 at 11:22:45AM +0200, Benny Halevy wrote:
> On Dec. 31, 2008, 17:57 +0200, James Bottomley <[email protected]> wrote:
> > I don't like it mainly because it's not truly a useful general framework
> > for others to build on. However, as argued above, there might not
> > actually be such a useful framework, so as long as the only two
> > consumers (you and Lustre) want an interface like this, I'll put it in.
>
> Not to mention pnfs over objects which is coming up around the corner.
> The pnfs-obj layout driver will use the osd initiator library as well
> for distributed data I/O access (while the metadata server, to be based
> on exofs accesses the OSD for metadata and security ops too)
What state is that project in right now?
--b.
On Jan. 02, 2009, 1:26 +0200, "J. Bruce Fields" <[email protected]> wrote:
> On Thu, Jan 01, 2009 at 11:22:45AM +0200, Benny Halevy wrote:
>> On Dec. 31, 2008, 17:57 +0200, James Bottomley <[email protected]> wrote:
>>> I don't like it mainly because it's not truly a useful general framework
>>> for others to build on. However, as argued above, there might not
>>> actually be such a useful framework, so as long as the only two
>>> consumers (you and Lustre) want an interface like this, I'll put it in.
>> Not to mention pnfs over objects which is coming up around the corner.
>> The pnfs-obj layout driver will use the osd initiator library as well
>> for distributed data I/O access (while the metadata server, to be based
>> on exofs accesses the OSD for metadata and security ops too)
>
> What state is that project in right now?
I hope to release the pnfs-obj layout driver in a few weeks,
after finishing with cleaning up the nfs41 and pnfs patch sets.
Still, there's more work to be done on the back end side, exporting
exofs over (p)NFS, and then we'd be able to provide full pnfs
over objects functionality.
Benny
>
> --b.
Hi!
> > In this patch are all the osd infrastructure that will be used later
> > by the file system.
> >
> > Also the declarations of constants, on disk structures, and prototypes.
> >
> > And the Kbuild+Kconfig files needed to build the exofs module.
> >
> >
> > ...
> >
> > +struct exofs_sb_info {
> > + struct osd_dev *s_dev; /* returned by get_osd_dev */
> > + uint64_t s_pid; /* partition ID of file system*/
> > + int s_timeout; /* timeout for OSD operations */
> > + uint32_t s_nextid; /* highest object ID used */
> > + uint32_t s_numfiles; /* number of files on fs */
> > + spinlock_t s_next_gen_lock; /* spinlock for gen # update */
> > + u32 s_next_generation; /* next gen # to use */
> > + atomic_t s_curr_pending; /* number of pending commands */
> > + uint8_t s_cred[OSD_CAP_LEN]; /* all-powerful credential */
> > +};
> > +
> > +/*
> > + * our inode flags
> > + */
> > +#ifdef ARCH_HAS_ATOMIC_UNSIGNED
>
> This doesn't exist, and it would be fairly bad to introduce it. Please
> kill the ifdefs.
>
> > +typedef unsigned exofs_iflags_t;
> > +#else
> > +typedef unsigned long exofs_iflags_t;
> > +#endif
>
> Then please kill the typedef altogether and replace it with `unsigned
> long' everywhere
Hmmm.. .and at a note somewhere that we assume unsigned long to be atomic...?
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Thu, 2009-01-01 at 15:33 +0200, Boaz Harrosh wrote:
> Andrew Morton wrote:
> >>> Boaz Harrosh <[email protected]> wrote:
> >> When, if, all is fixed, through which tree/maintainer can exofs be submitted?
> >
> > I can merge them. Or you can run a git tree of your own, add it to
> > linux-next and ask Linus to pull it at the appropriate time.
> >
>
> Hi James
>
> Andrew suggested that maybe I should push exofs file system directly to
> Linus as it is pretty orthogonal to any other work. Sitting in linux-next
> will quickly expose any advancements in VFS and will force me to keep
> the tree uptodate.
>
> If that is so, and is accepted by Linus, would you rather that also the
> open-osd initiator library will be submitted through the same tree?
> The conflicts with scsi are very very narrow. The only real dependency
> is the ULD being a SCSI ULD. I will routinely ask your ACK on any scsi
> or ULD related patches. Which are very few. This way it will be easier
> to manage the dependencies between the OSD work, the OSD pNFS-Objects
> trees at pNFS project, and the pNFSD+EXOFS export. One less dependency.
>
> [I already have such a public tree at git.open-osd.org for a while now]
Since it's sitting in SCSI, at least the libosd piece belongs over the
SCSI mailing list, so I think it makes sense to continue updating it via
the SCSI tree.
What's the status of the major number request from LANANA. That's patch
number one, and I haven't heard that they've confirmed the selection of
260 yet; or is LANANA now dead and it's who gets the major into the tree
first?
James
Pavel Machek wrote:
> Hi!
>
>>> In this patch are all the osd infrastructure that will be used later
>>> by the file system.
>>>
>>> Also the declarations of constants, on disk structures, and prototypes.
>>>
>>> And the Kbuild+Kconfig files needed to build the exofs module.
>>>
>>>
>>> ...
>>>
>>> +struct exofs_sb_info {
>>> + struct osd_dev *s_dev; /* returned by get_osd_dev */
>>> + uint64_t s_pid; /* partition ID of file system*/
>>> + int s_timeout; /* timeout for OSD operations */
>>> + uint32_t s_nextid; /* highest object ID used */
>>> + uint32_t s_numfiles; /* number of files on fs */
>>> + spinlock_t s_next_gen_lock; /* spinlock for gen # update */
>>> + u32 s_next_generation; /* next gen # to use */
>>> + atomic_t s_curr_pending; /* number of pending commands */
>>> + uint8_t s_cred[OSD_CAP_LEN]; /* all-powerful credential */
>>> +};
>>> +
>>> +/*
>>> + * our inode flags
>>> + */
>>> +#ifdef ARCH_HAS_ATOMIC_UNSIGNED
>> This doesn't exist, and it would be fairly bad to introduce it. Please
>> kill the ifdefs.
>>
>>> +typedef unsigned exofs_iflags_t;
>>> +#else
>>> +typedef unsigned long exofs_iflags_t;
>>> +#endif
>> Then please kill the typedef altogether and replace it with `unsigned
>> long' everywhere
>
> Hmmm.. .and at a note somewhere that we assume unsigned long to be atomic...?
>
I think I'll just use unsigned. It's more then enough I'm not using more then 3
bits for now. Is unsigned workable for all ARCHs?
Thanks
James Bottomley wrote:
> On Thu, 2009-01-01 at 15:33 +0200, Boaz Harrosh wrote:
>> Andrew Morton wrote:
>>>>> Boaz Harrosh <[email protected]> wrote:
>>>> When, if, all is fixed, through which tree/maintainer can exofs be submitted?
>>> I can merge them. Or you can run a git tree of your own, add it to
>>> linux-next and ask Linus to pull it at the appropriate time.
>>>
>> Hi James
>>
>> Andrew suggested that maybe I should push exofs file system directly to
>> Linus as it is pretty orthogonal to any other work. Sitting in linux-next
>> will quickly expose any advancements in VFS and will force me to keep
>> the tree uptodate.
>>
>> If that is so, and is accepted by Linus, would you rather that also the
>> open-osd initiator library will be submitted through the same tree?
>> The conflicts with scsi are very very narrow. The only real dependency
>> is the ULD being a SCSI ULD. I will routinely ask your ACK on any scsi
>> or ULD related patches. Which are very few. This way it will be easier
>> to manage the dependencies between the OSD work, the OSD pNFS-Objects
>> trees at pNFS project, and the pNFSD+EXOFS export. One less dependency.
>>
>> [I already have such a public tree at git.open-osd.org for a while now]
>
> Since it's sitting in SCSI, at least the libosd piece belongs over the
> SCSI mailing list, so I think it makes sense to continue updating it via
> the SCSI tree.
>
> What's the status of the major number request from LANANA. That's patch
> number one, and I haven't heard that they've confirmed the selection of
> 260 yet; or is LANANA now dead and it's who gets the major into the tree
> first?
>
> James
>
LANANA seems dead. I was unable to get any response from any e-mail.
Andrew?
Thanks James. I will personally prefer if these patches will carry
your sign-off on them, thous gaining your long acquired instincts.
That could be really grate.
I will send a new batch tomorrow morning, as Andrew had concerns with
some members names. Unless you prefer a git tree, drop me a note and
I'll send you a URL instead.
Thanks
Boaz
James Bottomley wrote:
> On Wed, 2008-12-31 at 17:19 +0200, Boaz Harrosh wrote:
>> Andrew Morton wrote:
>>> On Tue, 16 Dec 2008 17:33:48 +0200
>>> Boaz Harrosh <[email protected]> wrote:
>>>
>>>> We need a mechanism to prepare the file system (mkfs).
>>>> I chose to implement that by means of a couple of
>>>> mount-options. Because there is no user-mode API for committing
>>>> OSD commands. And also, all this stuff is highly internal to
>>>> the file system itself.
>>>>
>>>> - Added two mount options mkfs=0/1,format=capacity_in_meg, so mkfs/format
>>>> can be executed by kernel code just before mount. An mkexofs utility
>>>> can now be implemented by means of a script that mounts and unmount the
>>>> file system with proper options.
>>> Doing mkfs in-kernel is unusual. I don't think the above description
>>> sufficiently helps the uninitiated understand why mkfs cannot be done
>>> in userspace as usual. Please flesh it out a bit.
>> There are a few main reasons.
>> - There is no user-mode API for initiating OSD commands. Such a subsystem
>> would be hundredfold bigger then the mkfs code submitted. I think it would be
>> hard and stupid to maintain a complex user-mode API just for creating
>> a couple of objects and writing a couple of on disk structures.
>
> This is really a reflection of the whole problem with the OSD paradigm.
Certainly not a problem of the OSD paradigm, just maybe a problem
of the current code boundaries laid out by years of block-devices.
> In theory, a filesystem on OSD is a thin layer of metadata mapping
> objects to files. Get this right and the storage will manage things,
- objects to files. Get this right and the storage will manage things,
+ files to objects. Get this right and the storage will manage things,
[objects to files is what some of the osd-targets do.]
> like security and access and attributes (there's even a natural mapping
> to the VFS concept of extended attributes). Plus, the storage has
> enough information to manage persistence, backups and replication.
>
Sounds perfect to me.
> The real problem is that no-one has actually managed to come up with a
> useful VFS<->OSD mapping layer (even by extending or altering the VFS).
> Every filesystem that currently uses OSD has a separate direct OSD
> speaking interface (i.e. it slices out the block layer to do this and
> talks directly to the storage).
I'm not sure what you mean.
Lets take VFS<->BLOCKS mapping for example. Each FS has it's own
interpretation of what that means, brtfs is less perfect then xfs
or vice versa?
I guess you did not mean "mapping" but meant "Interface" or API.
(or more likely I misunderstand the meaning of "mapping" ;)
Well that is exactly what I was attempting to submit. A general-purpose
low-level but easy-to-use, objects API for kernel clients. be it a
dead-simple exofs, or a complex multi-head beast like a pNFS-Objects
file system. The same library/API/Interface will be used for NFS-Clients
NFSD-Servers, reconstruction, security what ever.
The block-layer is not sliced out, Only the elevator function is, since
BIO merging, if any, are not device global but per-object/file, and the
elevator does not currently support that. (Profiling shows that it will
be needed)
BTW. The block-based filesystems are just a big minority in Kernel. The
majority does not use block-layer either.
>
> I suppose this could be taken to show that such a layer is impossibly
> complex, as you assert, but its lack is reflected in strange looking
> design decisions like in-kernel mkfs. It would also mean that there
> would be very little layered code sharing between ODS based filesystems.
- would be very little layered code sharing between ODS based filesystems.
+ would be very little layered code sharing between OSD based filesystems.
I disagree.
All the OSD-Based file systems (In Linux) should absolutely only use the
open-osd library submitted. I myself will work on a couple. If anything is
missing that could not be added later, I would like to know about it.
User-mode Interface is another matter. There are some ideas and some already
implemented.
[Hosted on open-osd.org
see: http://git.open-osd.org/gitweb.cgi?p=osc-osd/.git;a=summary
look inside the osd-initiator directory]
And I have a toy interface that adds no new entries into the Kernel in
the form of an OSDVFS module, that will let you access the raw OSD device
through the VFS name-space.
The lack of any user-mode API is just the lack of any current need/priority,
or that I'm the only one working on OSD. But nothing that could not be solved
in two weeks of pragmatic work. Surly it's not a paradigm problem.
>
>> - I intend to refactor the code further to make use of more super.c services,
>> so to make this addition even smaller. Also future direction of raid over
>> multiple objects will make even more kernel infrastructure needed which
>> will need even more user-mode code duplication.
>> - I anticipate problems that are not yet addressed in this body of work
>> but will be in the future, mainly that a single OSD-target (lun) can
>> be shared by lots of FSs, and a single FS can span many OSD-targets.
>> Some central management is much easier to do in Kernel.
>>
>>> What are the dependencies for this filesystem code? I assume that it
>>> depends on various block- and scsi-level patches? Which ones, and
>>> what is their status, and is this code even compileable without them?
>>>
>> This OSD-based file system is dependent on the open-osd initiator library
>> code that I've submitted for inclusion for 2.6.29. It has been sitting
>> in linux-next for a while now, and has not been receiving any comments
>> for the last two updated patchsets I've sent to scsi-misc/lkml. However
>> it has not yet been submitted into Jame's scsi-misc git tree, and James
>> is the ultimate maintainer that should submit this work. I hope it will
>> still be submitted into 2.6.29, as this code is totally self sufficient
>> and does not endangers or changes any other Kernel subsystems.
>> (All the needed ground work was already submitted to Linus since 2.6.26)
>> So why should it not?
>
> I don't like it mainly because it's not truly a useful general framework
> for others to build on. However, as argued above, there might not
> actually be such a useful framework, so as long as the only two
> consumers (you and Lustre) want an interface like this, I'll put it in.
>
Time will tell, but I believe the exact opposite. I believe and strive
for this OSD body of work to be useful for anybody that needs to talk
T10-OSD in Linux, be it for any-purpose. Any thing missing should be
easily added.
> James
>
>
To summarize the way I see it:
- James is right in that we can not currently see the full OSD picture since
we do not have a user-mode API, so the usefulness of it all is unclear.
[I will send an RFD soon, and hope all interested will chime in on the
discussion]
- That said, all the submitted code is still relevant and useful,
though at few places it takes the route of pragmatic-easy vs
long-term-correctness. [Which can be fixed]
- exofs/OSD is not the first FS that depends on a none-block-dev/its-own
stack. The lower level (OSD) is represented to kernel as a char-dev +
Additional API, common to other FS/stack models. Though the lower OSD
level has the potential to be a generic layer that can be used by lots
of users and use cases, not only FS type.
Thank you James for your consideration
Boaz
On Sun, Jan 04, 2009 at 05:20:42PM +0200, Boaz Harrosh wrote:
>
> User-mode Interface is another matter. There are some ideas and some already
> implemented.
> [Hosted on open-osd.org
> see: http://git.open-osd.org/gitweb.cgi?p=osc-osd/.git;a=summary
> look inside the osd-initiator directory]
> And I have a toy interface that adds no new entries into the Kernel in
> the form of an OSDVFS module, that will let you access the raw OSD device
> through the VFS name-space.
>
> The lack of any user-mode API is just the lack of any current need/priority,
> or that I'm the only one working on OSD. But nothing that could not be solved
> in two weeks of pragmatic work. Surly it's not a paradigm problem.
For mkfs/repair direct use by databases, etc you want a userspace
library, too. The easiest way to get started would to simply take the
kernel libosd and make it work ontop of SG_IO.
On Sun 2009-01-04 10:43:09, Boaz Harrosh wrote:
> Pavel Machek wrote:
> > Hi!
> >
> >>> In this patch are all the osd infrastructure that will be used later
> >>> by the file system.
> >>>
> >>> Also the declarations of constants, on disk structures, and prototypes.
> >>>
> >>> And the Kbuild+Kconfig files needed to build the exofs module.
> >>>
> >>>
> >>> ...
> >>>
> >>> +struct exofs_sb_info {
> >>> + struct osd_dev *s_dev; /* returned by get_osd_dev */
> >>> + uint64_t s_pid; /* partition ID of file system*/
> >>> + int s_timeout; /* timeout for OSD operations */
> >>> + uint32_t s_nextid; /* highest object ID used */
> >>> + uint32_t s_numfiles; /* number of files on fs */
> >>> + spinlock_t s_next_gen_lock; /* spinlock for gen # update */
> >>> + u32 s_next_generation; /* next gen # to use */
> >>> + atomic_t s_curr_pending; /* number of pending commands */
> >>> + uint8_t s_cred[OSD_CAP_LEN]; /* all-powerful credential */
> >>> +};
> >>> +
> >>> +/*
> >>> + * our inode flags
> >>> + */
> >>> +#ifdef ARCH_HAS_ATOMIC_UNSIGNED
> >> This doesn't exist, and it would be fairly bad to introduce it. Please
> >> kill the ifdefs.
> >>
> >>> +typedef unsigned exofs_iflags_t;
> >>> +#else
> >>> +typedef unsigned long exofs_iflags_t;
> >>> +#endif
> >> Then please kill the typedef altogether and replace it with `unsigned
> >> long' everywhere
> >
> > Hmmm.. .and at a note somewhere that we assume unsigned long to be atomic...?
> >
>
> I think I'll just use unsigned. It's more then enough I'm not using more then 3
> bits for now. Is unsigned workable for all ARCHs?
Please just use atomic_t.
(see "atomics: document that linux expects certain atomic behaviour"
thread for discussion)
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Pavel Machek wrote:
> On Sun 2009-01-04 10:43:09, Boaz Harrosh wrote:
>> Pavel Machek wrote:
>>> Hi!
>>>
>>>>> +#ifdef ARCH_HAS_ATOMIC_UNSIGNED
>>>> This doesn't exist, and it would be fairly bad to introduce it. Please
>>>> kill the ifdefs.
>>>>
>>>>> +typedef unsigned exofs_iflags_t;
>>>>> +#else
>>>>> +typedef unsigned long exofs_iflags_t;
>>>>> +#endif
>>>> Then please kill the typedef altogether and replace it with `unsigned
>>>> long' everywhere
>>> Hmmm.. .and at a note somewhere that we assume unsigned long to be atomic...?
>>>
>> I think I'll just use unsigned. It's more then enough I'm not using more then 3
>> bits for now. Is unsigned workable for all ARCHs?
<added>
> /*
> * our extension to the in-memory inode
> */
> struct exofs_i_info {
> unsigned long i_flags; /* various atomic flags */
<snip>
>
> /*
> * our inode flags
> */
> #define OBJ_2BCREATED 0 /* object will be created soon*/
> #define OBJ_CREATED 1 /* object has been created on the osd*/
>
> static inline int obj_2bcreated(struct exofs_i_info *oi)
> {
> return test_bit(OBJ_2BCREATED, &(oi->i_flags));
> }
>
> static inline void set_obj_2bcreated(struct exofs_i_info *oi)
> {
> set_bit(OBJ_2BCREATED, &(oi->i_flags));
> }
>
> static inline int obj_created(struct exofs_i_info *oi)
> {
> return test_bit(OBJ_CREATED, &(oi->i_flags));
> }
>
> static inline void set_obj_created(struct exofs_i_info *oi)
> {
> set_bit(OBJ_CREATED, &(oi->i_flags));
> }
</added>
>
> Please just use atomic_t.
>
> (see "atomics: document that linux expects certain atomic behaviour"
> thread for discussion)
> Pavel
I have a problem with this. The context of i_flags is to be used with
set_bit() and test_bit(). In some ARCHs like x86_64 they take an
"unsigned long *" in most others they take a "void *" and cast internally
to a "u32 *". (for x86_64 I must use "unsigned long", anything else warns)
I think if I declare "unsigned long" but only use 32 bits flags then
I should be in the clear with ALL archs, I'll see if that works once
this code sits in linux-next. (That's real ugly I think)
Is set_bit() and test_bit() should only be used from arch/ code? What
can regular kernel code use?
Thanks
Boaz
> Pavel Machek wrote:
> > On Sun 2009-01-04 10:43:09, Boaz Harrosh wrote:
> >> Pavel Machek wrote:
> >>> Hi!
> >>>
> >>>>> +#ifdef ARCH_HAS_ATOMIC_UNSIGNED
> >>>> This doesn't exist, and it would be fairly bad to introduce it. Please
> >>>> kill the ifdefs.
> >>>>
> >>>>> +typedef unsigned exofs_iflags_t;
> >>>>> +#else
> >>>>> +typedef unsigned long exofs_iflags_t;
> >>>>> +#endif
> >>>> Then please kill the typedef altogether and replace it with `unsigned
> >>>> long' everywhere
> >>> Hmmm.. .and at a note somewhere that we assume unsigned long to be atomic...?
> >>>
> >> I think I'll just use unsigned. It's more then enough I'm not using more then 3
> >> bits for now. Is unsigned workable for all ARCHs?
>
> <added>
> > /*
> > * our extension to the in-memory inode
> > */
> > struct exofs_i_info {
> > unsigned long i_flags; /* various atomic flags */
> <snip>
> >
> > /*
> > * our inode flags
> > */
> > #define OBJ_2BCREATED 0 /* object will be created soon*/
> > #define OBJ_CREATED 1 /* object has been created on the osd*/
> >
> > static inline int obj_2bcreated(struct exofs_i_info *oi)
> > {
> > return test_bit(OBJ_2BCREATED, &(oi->i_flags));
> > }
> >
> > static inline void set_obj_2bcreated(struct exofs_i_info *oi)
> > {
> > set_bit(OBJ_2BCREATED, &(oi->i_flags));
> > }
> >
> > static inline int obj_created(struct exofs_i_info *oi)
> > {
> > return test_bit(OBJ_CREATED, &(oi->i_flags));
> > }
> >
> > static inline void set_obj_created(struct exofs_i_info *oi)
> > {
> > set_bit(OBJ_CREATED, &(oi->i_flags));
> > }
> </added>
>
> >
> > Please just use atomic_t.
> >
> > (see "atomics: document that linux expects certain atomic behaviour"
> > thread for discussion)
> > Pavel
>
> I have a problem with this. The context of i_flags is to be used with
> set_bit() and test_bit(). In some ARCHs like x86_64 they take an
> "unsigned long *" in most others they take a "void *" and cast internally
> to a "u32 *". (for x86_64 I must use "unsigned long", anything else warns)
>
> I think if I declare "unsigned long" but only use 32 bits flags then
> I should be in the clear with ALL archs, I'll see if that works once
> this code sits in linux-next. (That's real ugly I think)
>
> Is set_bit() and test_bit() should only be used from arch/ code? What
> can regular kernel code use?
I believe using test_bit/set_bit on first 32 bits of unsigned long is
okay and portable. Just don't call it atomic :-).
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Dec 31, 2008 15:57 +0000, James Bottomley wrote:
> I don't like it mainly because it's not truly a useful general framework
> for others to build on. However, as argued above, there might not
> actually be such a useful framework, so as long as the only two
> consumers (you and Lustre) want an interface like this, I'll put it in.
To be clear - Lustre has nothing to do with T10-OSD interfaces.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
On Dec. 16, 2008, 17:15 +0200, Boaz Harrosh <[email protected]> wrote:
> In this patch are all the osd infrastructure that will be used later
> by the file system.
>
> Also the declarations of constants, on disk structures, and prototypes.
>
> And the Kbuild+Kconfig files needed to build the exofs module.
>
> Signed-off-by: Boaz Harrosh <[email protected]>
> ---
> fs/exofs/Kbuild | 30 +++++
> fs/exofs/Kconfig | 13 ++
> fs/exofs/common.h | 154 ++++++++++++++++++++++++
> fs/exofs/exofs.h | 183 +++++++++++++++++++++++++++++
> fs/exofs/osd.c | 334 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> 5 files changed, 714 insertions(+), 0 deletions(-)
> create mode 100644 fs/exofs/Kbuild
> create mode 100644 fs/exofs/Kconfig
> create mode 100644 fs/exofs/common.h
> create mode 100644 fs/exofs/exofs.h
> create mode 100644 fs/exofs/osd.c
>
> diff --git a/fs/exofs/Kbuild b/fs/exofs/Kbuild
> new file mode 100644
> index 0000000..fd3351e
> --- /dev/null
> +++ b/fs/exofs/Kbuild
> @@ -0,0 +1,30 @@
> +#
> +# Kbuild for the EXOFS module
> +#
> +# Copyright (C) 2008 Panasas Inc. All rights reserved.
> +#
> +# Authors:
> +# Boaz Harrosh <[email protected]>
> +#
> +# This program is free software; you can redistribute it and/or modify
> +# it under the terms of the GNU General Public License version 2
> +#
> +# Kbuild - Gets included from the Kernels Makefile and build system
> +#
> +
> +ifneq ($(OSD_INC),)
> +# we are built out-of-tree Kconfigure everything as on
> +
> +CONFIG_EXOFS_FS=m
> +ccflags += -DCONFIG_EXOFS_FS -DCONFIG_EXOFS_FS_MODULE
> +# ccflags += -DCONFIG_EXOFS_DEBUG
> +
> +# if we are built out-of-tree and the hosting kernel has OSD headers
> +# then "ccflags-y +=" will not pick the out-off-tree headers. Only by doing
> +# this it will work. This might break in future kernels
> +KBUILD_CPPFLAGS := -I$(OSD_INC) $(KBUILD_CPPFLAGS)
> +
> +endif
> +
> +exofs-objs := osd.o
> +obj-$(CONFIG_EXOFS_FS) += exofs.o
> diff --git a/fs/exofs/Kconfig b/fs/exofs/Kconfig
> new file mode 100644
> index 0000000..86194b2
> --- /dev/null
> +++ b/fs/exofs/Kconfig
> @@ -0,0 +1,13 @@
> +config EXOFS_FS
> + tristate "exofs: OSD based file system support"
> + depends on SCSI_OSD_ULD
> + help
> + EXOFS is a file system that uses an OSD storage device,
> + as its backing storage.
> +
> +# Debugging-related stuff
> +config EXOFS_DEBUG
> + bool "Enable debugging"
> + depends on EXOFS_FS
> + help
> + This option enables EXOFS debug prints.
> diff --git a/fs/exofs/common.h b/fs/exofs/common.h
> new file mode 100644
> index 0000000..9a165b3
> --- /dev/null
> +++ b/fs/exofs/common.h
> @@ -0,0 +1,154 @@
> +/*
> + * Copyright (C) 2005, 2006
> + * Avishay Traeger ([email protected]) ([email protected])
> + * Copyright (C) 2005, 2006
> + * International Business Machines
> + *
> + * Copyrights for code taken from ext2:
> + * Copyright (C) 1992, 1993, 1994, 1995
> + * Remy Card ([email protected])
> + * Laboratoire MASI - Institut Blaise Pascal
> + * Universite Pierre et Marie Curie (Paris VI)
> + * from
> + * linux/fs/minix/inode.c
> + * Copyright (C) 1991, 1992 Linus Torvalds
> + *
> + * This file is part of exofs.
> + *
> + * exofs is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation. Since it is based on ext2, and the only
> + * valid version of GPL for the Linux kernel is version 2, the only valid
> + * version of GPL for exofs is version 2.
> + *
> + * exofs is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with exofs; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
> + */
> +
> +#ifndef __EXOFS_COM_H__
> +#define __EXOFS_COM_H__
> +
> +#include <linux/types.h>
> +#include <linux/timex.h>
> +
> +#include <scsi/osd_attributes.h>
> +#include <scsi/osd_initiator.h>
> +#include <scsi/osd_sec.h>
> +
> +/****************************************************************************
> + * Object ID related defines
> + * NOTE: inode# = object ID - EXOFS_OBJ_OFF
> + ****************************************************************************/
> +#define EXOFS_OBJ_OFF 0x10000 /* offset for objects */
> +#define EXOFS_SUPER_ID 0x10000 /* object ID for on-disk superblock */
> +#define EXOFS_BM_ID 0x10001 /* object ID for ID bitmap */
> +#define EXOFS_ROOT_ID 0x10002 /* object ID for root directory */
> +#define EXOFS_TEST_ID 0x10003 /* object ID for test object */
> +
> +/* exofs Application specific page/attribute */
> +#ifndef OSD_PAGE_NUM_IBM_UOBJ_FS_DATA
> +# define OSD_PAGE_NUM_IBM_UOBJ_FS_DATA (OSD_APAGE_APP_DEFINED_FIRST + 3)
> +# define OSD_ATTR_NUM_IBM_UOBJ_FS_DATA_INODE 1
> +#endif
> +
> +/*
> + * The maximum number of files we can have is limited by the size of the
> + * inode number. This is the largest object ID that the file system supports.
> + * Object IDs 0, 1, and 2 are always in use (see above defines).
> + */
> +enum {
> + EXOFS_UINT64_MAX = (~0LL),
> + EXOFS_MAX_INO_ID = (sizeof(ino_t) * 8 == 64) ? EXOFS_UINT64_MAX :
> + (1LL << (sizeof(ino_t) * 8 - 1)),
> + EXOFS_MAX_ID = (EXOFS_MAX_INO_ID - 1 - EXOFS_OBJ_OFF),
> +};
> +
> +/****************************************************************************
> + * Misc.
> + ****************************************************************************/
> +#define EXOFS_BLKSHIFT 12
> +#define EXOFS_BLKSIZE (1UL << EXOFS_BLKSHIFT)
> +
> +/****************************************************************************
> + * superblock-related things
> + ****************************************************************************/
> +#define EXOFS_SUPER_MAGIC 0x5DF5
> +
> +/*
> + * The file system control block - stored in an object's data (mainly, the one
> + * with ID EXOFS_SUPER_ID). This is where the in-memory superblock is stored
> + * on disk. Right now it just has a magic value, which is basically a sanity
> + * check on our ability to communicate with the object store.
> + */
> +struct exofs_fscb {
> + uint32_t s_nextid; /* Highest object ID used */
> + uint32_t s_numfiles; /* Number of files on fs */
> + uint16_t s_magic; /* Magic signature */
> + uint16_t s_newfs; /* Non-zero if this is a new fs */
> +};
> +
> +/****************************************************************************
> + * inode-related things
> + ****************************************************************************/
> +#define EXOFS_IDATA 5
> +
> +/*
> + * The file control block - stored in an object's attributes. This is where
> + * the in-memory inode is stored on disk.
> + */
> +struct exofs_fcb {
> + uint64_t i_size; /* Size of the file */
> + uint16_t i_mode; /* File mode */
> + uint16_t i_links_count; /* Links count */
> + uint32_t i_uid; /* Owner Uid */
> + uint32_t i_gid; /* Group Id */
> + uint32_t i_atime; /* Access time */
> + uint32_t i_ctime; /* Creation time */
> + uint32_t i_mtime; /* Modification time */
> + uint32_t i_flags; /* File flags */
> + uint32_t i_version; /* File version */
> + uint32_t i_generation; /* File version (for NFS) */
> + uint32_t i_data[EXOFS_IDATA]; /* Short symlink names and device #s */
> +};
This shouldn't be stored as a blob in a single attribute
but rather each field should be stored in its own attribute.
All grouped in an attribute page.
The same goes for the super block which should be
stored in a user attributes page on the partition object.
Also, please add a metadata version attribute for each
such page, numbered as the first attribute in the page
so future compatibility.
The metadata contents must be endian safe as well.
As we talked, it makes sense to retain the little endian
heritage from ext2...
Thanks,
Benny
> +
> +#define EXOFS_INO_ATTR_SIZE sizeof(struct exofs_fcb)
> +
> +/****************************************************************************
> + * dentry-related things
> + ****************************************************************************/
> +#define EXOFS_NAME_LEN 255
> +
> +/*
> + * The on-disk directory entry
> + */
> +struct exofs_dir_entry {
> + uint32_t inode; /* inode number */
> + uint16_t rec_len; /* directory entry length */
> + uint8_t name_len; /* name length */
> + uint8_t file_type; /* umm...file type */
> + char name[EXOFS_NAME_LEN]; /* file name */
> +};
> +
> +enum {
> + EXOFS_FT_UNKNOWN,
> + EXOFS_FT_REG_FILE,
> + EXOFS_FT_DIR,
> + EXOFS_FT_CHRDEV,
> + EXOFS_FT_BLKDEV,
> + EXOFS_FT_FIFO,
> + EXOFS_FT_SOCK,
> + EXOFS_FT_SYMLINK,
> + EXOFS_FT_MAX
> +};
> +
> +#define EXOFS_DIR_PAD 4
> +#define EXOFS_DIR_ROUND (EXOFS_DIR_PAD - 1)
> +#define EXOFS_DIR_REC_LEN(name_len) (((name_len) + 8 + EXOFS_DIR_ROUND) & \
> + ~EXOFS_DIR_ROUND)
> +#endif
> diff --git a/fs/exofs/exofs.h b/fs/exofs/exofs.h
> new file mode 100644
> index 0000000..8534450
> --- /dev/null
> +++ b/fs/exofs/exofs.h
> @@ -0,0 +1,183 @@
> +/*
> + * Copyright (C) 2005, 2006
> + * Avishay Traeger ([email protected]) ([email protected])
> + * Copyright (C) 2005, 2006
> + * International Business Machines
> + *
> + * Copyrights for code taken from ext2:
> + * Copyright (C) 1992, 1993, 1994, 1995
> + * Remy Card ([email protected])
> + * Laboratoire MASI - Institut Blaise Pascal
> + * Universite Pierre et Marie Curie (Paris VI)
> + * from
> + * linux/fs/minix/inode.c
> + * Copyright (C) 1991, 1992 Linus Torvalds
> + *
> + * This file is part of exofs.
> + *
> + * exofs is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation. Since it is based on ext2, and the only
> + * valid version of GPL for the Linux kernel is version 2, the only valid
> + * version of GPL for exofs is version 2.
> + *
> + * exofs is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with exofs; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
> + */
> +
> +#include <linux/fs.h>
> +#include <linux/time.h>
> +#include "common.h"
> +
> +#ifndef __EXOFS_H__
> +#define __EXOFS_H__
> +
> +#define EXOFS_ERR(fmt, a...) printk(KERN_ERR "exofs: " fmt, ##a)
> +
> +#ifdef CONFIG_EXOFS_DEBUG
> +#define EXOFS_DBGMSG(fmt, a...) \
> + printk(KERN_NOTICE "exofs @%s:%d: " fmt, __func__, __LINE__, ##a)
> +#else
> +#define EXOFS_DBGMSG(fmt, a...) \
> + do {} while (0)
> +#endif
> +
> +/* u64 has problems with printk this will cast it to unsigned long long */
> +#define _LLU(x) (unsigned long long)(x)
> +
> +/*
> + * our extension to the in-memory superblock
> + */
> +struct exofs_sb_info {
> + struct osd_dev *s_dev; /* returned by get_osd_dev */
> + uint64_t s_pid; /* partition ID of file system*/
> + int s_timeout; /* timeout for OSD operations */
> + uint32_t s_nextid; /* highest object ID used */
> + uint32_t s_numfiles; /* number of files on fs */
> + spinlock_t s_next_gen_lock; /* spinlock for gen # update */
> + u32 s_next_generation; /* next gen # to use */
> + atomic_t s_curr_pending; /* number of pending commands */
> + uint8_t s_cred[OSD_CAP_LEN]; /* all-powerful credential */
> +};
> +
> +/*
> + * our inode flags
> + */
> +#ifdef ARCH_HAS_ATOMIC_UNSIGNED
> +typedef unsigned exofs_iflags_t;
> +#else
> +typedef unsigned long exofs_iflags_t;
> +#endif
> +
> +#define OBJ_2BCREATED 0 /* object will be created soon*/
> +#define OBJ_CREATED 1 /* object has been created on the osd*/
> +
> +#define Obj2BCreated(oi) \
> + test_bit(OBJ_2BCREATED, &(oi->i_flags))
> +#define SetObj2BCreated(oi) \
> + set_bit(OBJ_2BCREATED, &(oi->i_flags))
> +
> +#define ObjCreated(oi) \
> + test_bit(OBJ_CREATED, &(oi->i_flags))
> +#define SetObjCreated(oi) \
> + set_bit(OBJ_CREATED, &(oi->i_flags))
> +
> +/*
> + * our extension to the in-memory inode
> + */
> +struct exofs_i_info {
> + exofs_iflags_t i_flags; /* various atomic flags */
> + __le32 i_data[EXOFS_IDATA];/*short symlink names and device #s*/
> + uint32_t i_dir_start_lookup; /* which page to start lookup */
> + wait_queue_head_t i_wq; /* wait queue for inode */
> + uint64_t i_commit_size; /* the object's written length */
> + uint8_t i_cred[OSD_CAP_LEN];/* all-powerful credential */
> + struct inode vfs_inode; /* normal in-memory inode */
> +};
> +
> +/*
> + * get to our inode from the vfs inode
> + */
> +static inline struct exofs_i_info *EXOFS_I(struct inode *inode)
> +{
> + return container_of(inode, struct exofs_i_info, vfs_inode);
> +}
> +
> +/*************************
> + * function declarations *
> + *************************/
> +/* osd.c */
> +void make_credential(uint8_t[], uint64_t, uint64_t);
> +int check_ok(struct osd_request *);
> +int exofs_sync_op(struct osd_request *, int, uint8_t *);
> +int exofs_async_op(struct osd_request *, osd_req_done_fn *, void *, char *);
> +
> +int prepare_get_attr_list_add_entry(struct osd_request *req,
> + uint32_t page_num,
> + uint32_t attr_num,
> + uint32_t attr_len);
> +int prepare_set_attr_list_add_entry(struct osd_request *req,
> + uint32_t page_num,
> + uint32_t attr_num,
> + uint16_t attr_len,
> + const unsigned char *attr_val);
> +int extract_next_attr_from_req(struct osd_request *req,
> + uint32_t *page_num, uint32_t *attr_num,
> + uint16_t *attr_len, uint8_t **attr_val);
> +struct osd_request *prepare_osd_format_lun(struct osd_dev *dev,
> + uint64_t formatted_capacity);
> +struct osd_request *prepare_osd_create_partition(struct osd_dev *dev,
> + uint64_t requested_id);
> +struct osd_request *prepare_osd_remove_partition(struct osd_dev *dev,
> + uint64_t requested_id);
> +struct osd_request *prepare_osd_create(struct osd_dev *dev,
> + uint64_t part_id,
> + uint64_t requested_id);
> +struct osd_request *prepare_osd_remove(struct osd_dev *dev,
> + uint64_t part_id,
> + uint64_t obj_id);
> +struct osd_request *prepare_osd_set_attr(struct osd_dev *dev,
> + uint64_t part_id,
> + uint64_t obj_id);
> +struct osd_request *prepare_osd_get_attr(struct osd_dev *dev,
> + uint64_t part_id,
> + uint64_t obj_id);
> +struct osd_request *prepare_osd_read(struct osd_dev *dev,
> + uint64_t part_id,
> + uint64_t obj_id,
> + uint64_t length,
> + uint64_t offset,
> + int cmd_data_use_sg,
> + unsigned char *cmd_data);
> +struct osd_request *prepare_osd_write(struct osd_dev *dev,
> + uint64_t part_id,
> + uint64_t obj_id,
> + uint64_t length,
> + uint64_t offset,
> + int cmd_data_use_sg,
> + const unsigned char *cmd_data);
> +struct osd_request *prepare_osd_list(struct osd_dev *dev,
> + uint64_t part_id,
> + uint32_t list_id,
> + uint64_t alloc_len,
> + uint64_t initial_obj_id,
> + int use_sg,
> + void *data);
> +int extract_list_from_req(struct osd_request *req,
> + uint64_t *total_matches_p,
> + uint64_t *num_ids_retrieved_p,
> + uint64_t *list_of_ids_p[],
> + int *is_list_of_partitions_p,
> + int *list_isnt_up_to_date_p,
> + uint64_t *continuation_tag_p,
> + uint32_t *list_id_for_more_p);
> +
> +void free_osd_req(struct osd_request *req);
> +
> +#endif
> diff --git a/fs/exofs/osd.c b/fs/exofs/osd.c
> new file mode 100644
> index 0000000..3859d3e
> --- /dev/null
> +++ b/fs/exofs/osd.c
> @@ -0,0 +1,334 @@
> +/*
> + * Copyright (C) 2005, 2006
> + * Avishay Traeger ([email protected]) ([email protected])
> + * Copyright (C) 2005, 2006
> + * International Business Machines
> + *
> + * This file is part of exofs.
> + *
> + * exofs is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation. Since it is based on ext2, and the only
> + * valid version of GPL for the Linux kernel is version 2, the only valid
> + * version of GPL for exofs is version 2.
> + *
> + * exofs is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with exofs; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
> + */
> +
> +#include <scsi/scsi_device.h>
> +#include <scsi/osd_sense.h>
> +
> +#include "exofs.h"
> +
> +int check_ok(struct osd_request *req)
> +{
> + struct osd_sense_info osi;
> + int ret = osd_req_decode_sense(req, &osi);
> +
> + if (ret) { /* translate to Linux codes */
> + if (osi.additional_code == scsi_invalid_field_in_cdb) {
> + if (osi.cdb_field_offset == OSD_CFO_STARTING_BYTE)
> + ret = -EFAULT;
> + if (osi.cdb_field_offset == OSD_CFO_OBJECT_ID)
> + ret = -ENOENT;
> + else
> + ret = -EINVAL;
> + } else if (osi.additional_code == osd_quota_error)
> + ret = -ENOSPC;
> + else
> + ret = -EIO;
> + }
> +
> + return ret;
> +}
> +
> +void make_credential(uint8_t cred_a[OSD_CAP_LEN], uint64_t pid, uint64_t oid)
> +{
> + struct osd_obj_id obj = {
> + .partition = pid,
> + .id = oid
> + };
> +
> + osd_sec_init_nosec_doall_caps(cred_a, &obj, false, true);
> +}
> +
> +/*
> + * Perform a synchronous OSD operation.
> + */
> +int exofs_sync_op(struct osd_request *req, int timeout, uint8_t *credential)
> +{
> + int ret;
> +
> + req->timeout = timeout;
> + ret = osd_finalize_request(req, 0, credential, NULL);
> + if (ret) {
> + EXOFS_DBGMSG("Faild to osd_finalize_request() => %d\n", ret);
> + return ret;
> + }
> +
> + ret = osd_execute_request(req);
> +
> + if (ret)
> + EXOFS_DBGMSG("osd_execute_request() => %d\n", ret);
> + /* osd_req_decode_sense(or, ret); */
> + return ret;
> +}
> +
> +/*
> + * Perform an asynchronous OSD operation.
> + */
> +int exofs_async_op(struct osd_request *req, osd_req_done_fn *async_done,
> + void *caller_context, char *credential)
> +{
> + int ret;
> +
> + ret = osd_finalize_request(req, 0, credential, NULL);
> + if (ret) {
> + EXOFS_DBGMSG("Faild to osd_finalize_request() => %d\n", ret);
> + return ret;
> + }
> +
> + ret = osd_execute_request_async(req, async_done, caller_context);
> +
> + if (ret)
> + EXOFS_DBGMSG("osd_execute_request_async() => %d\n", ret);
> + return ret;
> +}
> +
> +int prepare_get_attr_list_add_entry(struct osd_request *req,
> + uint32_t page_num,
> + uint32_t attr_num,
> + uint32_t attr_len)
> +{
> + struct osd_attr attr = {
> + .page = page_num,
> + .attr_id = attr_num,
> + .len = attr_len,
> + };
> +
> + return osd_req_add_get_attr_list(req, &attr, 1);
> +}
> +
> +int prepare_set_attr_list_add_entry(struct osd_request *req,
> + uint32_t page_num,
> + uint32_t attr_num,
> + uint16_t attr_len,
> + const unsigned char *attr_val)
> +{
> + struct osd_attr attr = {
> + .page = page_num,
> + .attr_id = attr_num,
> + .len = attr_len,
> + .val_ptr = (u8 *)attr_val,
> + };
> +
> + return osd_req_add_set_attr_list(req, &attr, 1);
> +}
> +
> +int extract_next_attr_from_req(struct osd_request *req,
> + uint32_t *page_num, uint32_t *attr_num,
> + uint16_t *attr_len, uint8_t **attr_val)
> +{
> + struct osd_attr attr = {.page = 0}; /* start with zeros */
> + void *iter = NULL;
> + int nelem;
> +
> + do {
> + nelem = 1;
> + osd_req_decode_get_attr_list(req, &attr, &nelem, &iter);
> + if ((attr.page == *page_num) && (attr.attr_id == *attr_num)) {
> + *attr_len = attr.len;
> + *attr_val = attr.val_ptr;
> + return 0;
> + }
> + } while (iter);
> +
> + return -EIO;
> +}
> +
> +struct osd_request *prepare_osd_format_lun(struct osd_dev *dev,
> + uint64_t formatted_capacity)
> +{
> + struct osd_request *or = osd_start_request(dev, GFP_KERNEL);
> +
> + if (!or)
> + return NULL;
> +
> + osd_req_format(or, formatted_capacity);
> +
> + return or;
> +}
> +
> +struct osd_request *prepare_osd_create_partition(struct osd_dev *dev,
> + uint64_t requested_id)
> +{
> + struct osd_request *or = osd_start_request(dev, GFP_KERNEL);
> +
> + if (!or)
> + return NULL;
> +
> + osd_req_create_partition(or, requested_id);
> +
> + return or;
> +}
> +
> +struct osd_request *prepare_osd_remove_partition(struct osd_dev *dev,
> + uint64_t requested_id)
> +{
> + struct osd_request *or = osd_start_request(dev, GFP_KERNEL);
> +
> + if (!or)
> + return NULL;
> +
> + osd_req_remove_partition(or, requested_id);
> +
> + return or;
> +}
> +
> +struct osd_request *prepare_osd_create(struct osd_dev *dev,
> + uint64_t part_id,
> + uint64_t requested_id)
> +{
> + struct osd_obj_id obj = {
> + .partition = part_id,
> + .id = requested_id
> + };
> + struct osd_request *or = osd_start_request(dev, GFP_KERNEL);
> +
> + if (!or)
> + return NULL;
> +
> + osd_req_create_object(or, &obj);
> +
> + return or;
> +}
> +
> +struct osd_request *prepare_osd_remove(struct osd_dev *dev,
> + uint64_t part_id,
> + uint64_t obj_id)
> +{
> + struct osd_obj_id obj = {
> + .partition = part_id,
> + .id = obj_id
> + };
> + struct osd_request *or = osd_start_request(dev, GFP_KERNEL);
> +
> + if (!or)
> + return NULL;
> +
> + osd_req_remove_object(or, &obj);
> +
> + return or;
> +}
> +
> +struct osd_request *prepare_osd_set_attr(struct osd_dev *dev,
> + uint64_t part_id,
> + uint64_t obj_id)
> +{
> + struct osd_obj_id obj = {
> + .partition = part_id,
> + .id = obj_id
> + };
> + struct osd_request *or = osd_start_request(dev, GFP_KERNEL);
> +
> + if (!or)
> + return NULL;
> +
> + osd_req_set_attributes(or, &obj);
> +
> + return or;
> +}
> +
> +struct osd_request *prepare_osd_get_attr(struct osd_dev *dev,
> + uint64_t part_id,
> + uint64_t obj_id)
> +{
> + struct osd_obj_id obj = {
> + .partition = part_id,
> + .id = obj_id
> + };
> + struct osd_request *or = osd_start_request(dev, GFP_KERNEL);
> +
> + if (!or)
> + return NULL;
> +
> + osd_req_get_attributes(or, &obj);
> +
> + return or;
> +}
> +
> +struct osd_request *prepare_osd_read(struct osd_dev *dev,
> + uint64_t part_id,
> + uint64_t obj_id,
> + uint64_t length,
> + uint64_t offset,
> + int cmd_data_use_sg,
> + unsigned char *cmd_data)
> +{
> + struct osd_obj_id obj = {
> + .partition = part_id,
> + .id = obj_id
> + };
> + struct osd_request *or = osd_start_request(dev, GFP_KERNEL);
> + struct request_queue *req_q = dev->scsi_device->request_queue;
> + struct bio *bio;
> +
> + if (!or)
> + return NULL;
> +
> + BUG_ON(cmd_data_use_sg);
> + bio = bio_map_kern(req_q, cmd_data, length, or->alloc_flags);
> + if (!bio) {
> + osd_end_request(or);
> + return NULL;
> + }
> +
> + osd_req_read(or, &obj, bio, offset);
> + EXOFS_DBGMSG("osd_req_read(p=%llX, ob=%llX, l=%llu, of=%llu)\n",
> + _LLU(part_id), _LLU(obj_id), _LLU(length), _LLU(offset));
> + return or;
> +}
> +
> +struct osd_request *prepare_osd_write(struct osd_dev *dev,
> + uint64_t part_id,
> + uint64_t obj_id,
> + uint64_t length,
> + uint64_t offset,
> + int cmd_data_use_sg,
> + const unsigned char *cmd_data)
> +{
> + struct osd_obj_id obj = {
> + .partition = part_id,
> + .id = obj_id
> + };
> + struct osd_request *or = osd_start_request(dev, GFP_KERNEL);
> + struct request_queue *req_q = dev->scsi_device->request_queue;
> + struct bio *bio;
> +
> + if (!or)
> + return NULL;
> +
> + BUG_ON(cmd_data_use_sg);
> + bio = bio_map_kern(req_q, (u8 *)cmd_data, length, or->alloc_flags);
> + if (!bio) {
> + osd_end_request(or);
> + return NULL;
> + }
> +
> + osd_req_write(or, &obj, bio, offset);
> + EXOFS_DBGMSG("osd_req_write(p=%llX, ob=%llX, l=%llu, of=%llu)\n",
> + _LLU(part_id), _LLU(obj_id), _LLU(length), _LLU(offset));
> + return or;
> +}
> +
> +void free_osd_req(struct osd_request *req)
> +{
> + osd_end_request(req);
> +}
On Sun, 2009-01-04 at 17:20 +0200, Boaz Harrosh wrote:
> James Bottomley wrote:
> > On Wed, 2008-12-31 at 17:19 +0200, Boaz Harrosh wrote:
> >> Andrew Morton wrote:
> >>> On Tue, 16 Dec 2008 17:33:48 +0200
> >>> Boaz Harrosh <[email protected]> wrote:
> >>>
> >>>> We need a mechanism to prepare the file system (mkfs).
> >>>> I chose to implement that by means of a couple of
> >>>> mount-options. Because there is no user-mode API for committing
> >>>> OSD commands. And also, all this stuff is highly internal to
> >>>> the file system itself.
> >>>>
> >>>> - Added two mount options mkfs=0/1,format=capacity_in_meg, so mkfs/format
> >>>> can be executed by kernel code just before mount. An mkexofs utility
> >>>> can now be implemented by means of a script that mounts and unmount the
> >>>> file system with proper options.
> >>> Doing mkfs in-kernel is unusual. I don't think the above description
> >>> sufficiently helps the uninitiated understand why mkfs cannot be done
> >>> in userspace as usual. Please flesh it out a bit.
> >> There are a few main reasons.
> >> - There is no user-mode API for initiating OSD commands. Such a subsystem
> >> would be hundredfold bigger then the mkfs code submitted. I think it would be
> >> hard and stupid to maintain a complex user-mode API just for creating
> >> a couple of objects and writing a couple of on disk structures.
> >
> > This is really a reflection of the whole problem with the OSD paradigm.
>
> Certainly not a problem of the OSD paradigm, just maybe a problem
> of the current code boundaries laid out by years of block-devices.
Not having a suggestion for redrawing the boundaries is a problem of the
paradigm. Right at the moment using OSD is an all or nothing, there's
no migration path for block based filesystems, or even a good idea how
they'd take advantage of OSD. Most OSD based filesystems are for
special purpose things (mainly cluster FS).
> > In theory, a filesystem on OSD is a thin layer of metadata mapping
> > objects to files. Get this right and the storage will manage things,
> - objects to files. Get this right and the storage will manage things,
> + files to objects. Get this right and the storage will manage things,
> [objects to files is what some of the osd-targets do.]
> > like security and access and attributes (there's even a natural mapping
> > to the VFS concept of extended attributes). Plus, the storage has
> > enough information to manage persistence, backups and replication.
> >
>
> Sounds perfect to me.
>
> > The real problem is that no-one has actually managed to come up with a
> > useful VFS<->OSD mapping layer (even by extending or altering the VFS).
> > Every filesystem that currently uses OSD has a separate direct OSD
> > speaking interface (i.e. it slices out the block layer to do this and
> > talks directly to the storage).
>
> I'm not sure what you mean.
> Lets take VFS<->BLOCKS mapping for example. Each FS has it's own
> interpretation of what that means, brtfs is less perfect then xfs
> or vice versa?
> I guess you did not mean "mapping" but meant "Interface" or API.
> (or more likely I misunderstand the meaning of "mapping" ;)
No ... by mapping I mean mapping of VFS functions.
For example, an OSD filesystem should be user mountable: if the user has
the security key (could possibly do this in userspace). Additionally,
an OSD with attributes should be pluggable into the VFS layer
sufficiently to allow attribute search, even if the VFS has no idea of
the metadata layout, we can still get objects back. We'd also better be
able to do backup and restore of object based devices.
The basic problem for OSD, at least as I see it is that unless it can
provide some compelling relevance to current filesystem problems (like
attribute search is 10x faster over OSD vs block or X filesystem gets a
2x performance improvement using OSD vs block ...) it's doomed forever
to be a niche player: nice idea but no relevance to the real world.
> Well that is exactly what I was attempting to submit. A general-purpose
> low-level but easy-to-use, objects API for kernel clients. be it a
> dead-simple exofs, or a complex multi-head beast like a pNFS-Objects
> file system. The same library/API/Interface will be used for NFS-Clients
> NFSD-Servers, reconstruction, security what ever.
OK ... perhaps I missed the description of how a general purpose
filesystem might use this then?
> The block-layer is not sliced out, Only the elevator function is, since
> BIO merging, if any, are not device global but per-object/file, and the
> elevator does not currently support that. (Profiling shows that it will
> be needed)
Um, your submission path is character. You pick up block again because
SCSI uses it for queues, but it's not really part of your paradigm.
> BTW. The block-based filesystems are just a big minority in Kernel. The
> majority does not use block-layer either.
>
> >
> > I suppose this could be taken to show that such a layer is impossibly
> > complex, as you assert, but its lack is reflected in strange looking
> > design decisions like in-kernel mkfs. It would also mean that there
> > would be very little layered code sharing between ODS based filesystems.
> - would be very little layered code sharing between ODS based filesystems.
> + would be very little layered code sharing between OSD based filesystems.
>
> I disagree.
> All the OSD-Based file systems (In Linux) should absolutely only use the
> open-osd library submitted. I myself will work on a couple. If anything is
> missing that could not be added later, I would like to know about it.
But that's precisely the problem: "OSD based filesystems" implying that
if you want to use OSD you write a new filesystem.
> User-mode Interface is another matter. There are some ideas and some already
> implemented.
> [Hosted on open-osd.org
> see: http://git.open-osd.org/gitweb.cgi?p=osc-osd/.git;a=summary
> look inside the osd-initiator directory]
> And I have a toy interface that adds no new entries into the Kernel in
> the form of an OSDVFS module, that will let you access the raw OSD device
> through the VFS name-space.
OK, so this is moving it more towards general usability.
> The lack of any user-mode API is just the lack of any current need/priority,
> or that I'm the only one working on OSD. But nothing that could not be solved
> in two weeks of pragmatic work. Surly it's not a paradigm problem.
It's an indicator of one. If you buy my premise that OSD cannot be
relevant without compelling user cases, then the lack of a user API can
be viewed as a symptom of this.
> >
> >> - I intend to refactor the code further to make use of more super.c services,
> >> so to make this addition even smaller. Also future direction of raid over
> >> multiple objects will make even more kernel infrastructure needed which
> >> will need even more user-mode code duplication.
> >> - I anticipate problems that are not yet addressed in this body of work
> >> but will be in the future, mainly that a single OSD-target (lun) can
> >> be shared by lots of FSs, and a single FS can span many OSD-targets.
> >> Some central management is much easier to do in Kernel.
> >>
> >>> What are the dependencies for this filesystem code? I assume that it
> >>> depends on various block- and scsi-level patches? Which ones, and
> >>> what is their status, and is this code even compileable without them?
> >>>
> >> This OSD-based file system is dependent on the open-osd initiator library
> >> code that I've submitted for inclusion for 2.6.29. It has been sitting
> >> in linux-next for a while now, and has not been receiving any comments
> >> for the last two updated patchsets I've sent to scsi-misc/lkml. However
> >> it has not yet been submitted into Jame's scsi-misc git tree, and James
> >> is the ultimate maintainer that should submit this work. I hope it will
> >> still be submitted into 2.6.29, as this code is totally self sufficient
> >> and does not endangers or changes any other Kernel subsystems.
> >> (All the needed ground work was already submitted to Linus since 2.6.26)
> >> So why should it not?
> >
> > I don't like it mainly because it's not truly a useful general framework
> > for others to build on. However, as argued above, there might not
> > actually be such a useful framework, so as long as the only two
> > consumers (you and Lustre) want an interface like this, I'll put it in.
> >
>
> Time will tell, but I believe the exact opposite. I believe and strive
> for this OSD body of work to be useful for anybody that needs to talk
> T10-OSD in Linux, be it for any-purpose. Any thing missing should be
> easily added.
>
> > James
> >
> >
>
> To summarize the way I see it:
> - James is right in that we can not currently see the full OSD picture since
> we do not have a user-mode API, so the usefulness of it all is unclear.
> [I will send an RFD soon, and hope all interested will chime in on the
> discussion]
> - That said, all the submitted code is still relevant and useful,
> though at few places it takes the route of pragmatic-easy vs
> long-term-correctness. [Which can be fixed]
> - exofs/OSD is not the first FS that depends on a none-block-dev/its-own
> stack. The lower level (OSD) is represented to kernel as a char-dev +
> Additional API, common to other FS/stack models. Though the lower OSD
> level has the potential to be a generic layer that can be used by lots
> of users and use cases, not only FS type.
Right, so I'm reasonably happy to accept libosd for what it is: an
enabler for a few specialised applications.
I think your choice of using a character device will turn out to be a
design mistake because the migration path of existing filesystems is
bound to be a block device with extra features (which they may or may
not make use of) but only if there's a way to make ODS relevant to
users.
James
James Bottomley wrote:
> On Sun, 2009-01-04 at 17:20 +0200, Boaz Harrosh wrote:
>> James Bottomley wrote:
>>> On Wed, 2008-12-31 at 17:19 +0200, Boaz Harrosh wrote:
>>>> Andrew Morton wrote:
>>>>> On Tue, 16 Dec 2008 17:33:48 +0200
>>>>> Boaz Harrosh <[email protected]> wrote:
>>>>> Doing mkfs in-kernel is unusual. I don't think the above description
>>>>> sufficiently helps the uninitiated understand why mkfs cannot be done
>>>>> in userspace as usual. Please flesh it out a bit.
>>>> There are a few main reasons.
>>>> - There is no user-mode API for initiating OSD commands. Such a subsystem
>>>> would be hundredfold bigger then the mkfs code submitted. I think it would be
>>>> hard and stupid to maintain a complex user-mode API just for creating
>>>> a couple of objects and writing a couple of on disk structures.
>>> This is really a reflection of the whole problem with the OSD paradigm.
>> Certainly not a problem of the OSD paradigm, just maybe a problem
>> of the current code boundaries laid out by years of block-devices.
>
> Not having a suggestion for redrawing the boundaries is a problem of the
> paradigm. Right at the moment using OSD is an all or nothing, there's
> no migration path for block based filesystems, or even a good idea how
> they'd take advantage of OSD. Most OSD based filesystems are for
> special purpose things (mainly cluster FS).
I think you both are talking past each other a bit.
There is no inherent "problem with the paradigm" with regards to
creating a userspace mkfs and userspace filesystem access library.
Yes, it's annoying to maintain two parallel codebases, but from
experience we have found that that is what is best. A userspace library
is used by a wide variety of users: specialized filesystem tools,
filesystem repair tools, filesystem creation and optimization tools,
FUSE implementations, the list goes on.
It has nothing to do with "block-based code boundaries".
History and experience have shown that we want a minimal, purpose-built
filesystem in the kernel, with all the other filesystem tools external
to the kernel. That has proven the most robust over time, IMO (although
noises about in-kernel fsck are beginning to appear)
>>> In theory, a filesystem on OSD is a thin layer of metadata mapping
>>> objects to files. Get this right and the storage will manage things,
>> - objects to files. Get this right and the storage will manage things,
>> + files to objects. Get this right and the storage will manage things,
>> [objects to files is what some of the osd-targets do.]
>>> like security and access and attributes (there's even a natural mapping
>>> to the VFS concept of extended attributes). Plus, the storage has
>>> enough information to manage persistence, backups and replication.
I'm a bit lost in the quoting, but to respond...
One should not make assumptions that an in-kernel OSD filesystem will
simply turn all the "inode-ish" (object manipulation) duties wholesale
to the OSD storage device(s). That is an implementation detail.
To conjure an example, an OSD filesystem designer may wish to store
collections of VFS extended attributes as a single OSD object, for
performance or caching reasons.
Or, as discussed at the filesystem/storage summit I attended, a separate
layer handles replication and OSD device aggregation (read: RAID) just
like MD manages RAID[0156] now.
>>> The real problem is that no-one has actually managed to come up with a
>>> useful VFS<->OSD mapping layer (even by extending or altering the VFS).
>>> Every filesystem that currently uses OSD has a separate direct OSD
>>> speaking interface (i.e. it slices out the block layer to do this and
>>> talks directly to the storage).
>> I'm not sure what you mean.
>> Lets take VFS<->BLOCKS mapping for example. Each FS has it's own
>> interpretation of what that means, brtfs is less perfect then xfs
>> or vice versa?
>> I guess you did not mean "mapping" but meant "Interface" or API.
>> (or more likely I misunderstand the meaning of "mapping" ;)
>
> No ... by mapping I mean mapping of VFS functions.
>
> For example, an OSD filesystem should be user mountable: if the user has
I think that's setting the bar too high. It would be nice if an OSD
filesystem were user-mountable, but that obviously is less compatible
with existing tools, admin knowledge, and site policies.
> the security key (could possibly do this in userspace). Additionally,
> an OSD with attributes should be pluggable into the VFS layer
> sufficiently to allow attribute search, even if the VFS has no idea of
> the metadata layout, we can still get objects back. We'd also better be
> able to do backup and restore of object based devices.
Sure. tar/cpio/pax at the userspace level, or exofs-specific
dump+restore tools running in userspace. Just like with other
filesystems :)
> The basic problem for OSD, at least as I see it is that unless it can
> provide some compelling relevance to current filesystem problems (like
> attribute search is 10x faster over OSD vs block or X filesystem gets a
> 2x performance improvement using OSD vs block ...) it's doomed forever
> to be a niche player: nice idea but no relevance to the real world.
Let's get exofs into the kernel, and prove you wrong (or right).
I know you have wonderful anecdotes about how OSD has been around
forever and you consider it a failed paradigm; but new work is occuring,
and people are talking about how this might be the successor to
sector-based devices.
Let's not be closed-minded and close doors before they can be opened.
At this point, OSD is a fun and interesting research experiment that
might have promise for the future.
That's Linux's bread-n-butter: be on the cutting edge, experimenting
with new technologies. Some pan out, others don't.
But I don't see any compelling reason for an overall pushback _against_
OSD devices and filesystems.
>> Well that is exactly what I was attempting to submit. A general-purpose
>> low-level but easy-to-use, objects API for kernel clients. be it a
>> dead-simple exofs, or a complex multi-head beast like a pNFS-Objects
>> file system. The same library/API/Interface will be used for NFS-Clients
>> NFSD-Servers, reconstruction, security what ever.
>
> OK ... perhaps I missed the description of how a general purpose
> filesystem might use this then?
>
>> The block-layer is not sliced out, Only the elevator function is, since
>> BIO merging, if any, are not device global but per-object/file, and the
>> elevator does not currently support that. (Profiling shows that it will
>> be needed)
>
> Um, your submission path is character. You pick up block again because
> SCSI uses it for queues, but it's not really part of your paradigm.
>
>> BTW. The block-based filesystems are just a big minority in Kernel. The
>> majority does not use block-layer either.
>>
>>> I suppose this could be taken to show that such a layer is impossibly
>>> complex, as you assert, but its lack is reflected in strange looking
>>> design decisions like in-kernel mkfs. It would also mean that there
>>> would be very little layered code sharing between ODS based filesystems.
>> - would be very little layered code sharing between ODS based filesystems.
>> + would be very little layered code sharing between OSD based filesystems.
>>
>> I disagree.
>> All the OSD-Based file systems (In Linux) should absolutely only use the
>> open-osd library submitted. I myself will work on a couple. If anything is
>> missing that could not be added later, I would like to know about it.
>
> But that's precisely the problem: "OSD based filesystems" implying that
> if you want to use OSD you write a new filesystem.
Are you somehow assuming that existing block-based filesystems will take
advantage of OSD? I hope not; that would be silly.
_Of course_ using OSD implies a new filesystem. You are using a wholly
different method of interacting with storage.
Just like NFS implies a new filesystem, because networked RPC is wholly
different from sector-based storage as well.
> It's an indicator of one. If you buy my premise that OSD cannot be
> relevant without compelling user cases, then the lack of a user API can
> be viewed as a symptom of this.
If having a compelling user case was a prereq for kernel inclusion, well
over half the code would be gone.
> I think your choice of using a character device will turn out to be a
> design mistake because the migration path of existing filesystems is
> bound to be a block device with extra features (which they may or may
> not make use of) but only if there's a way to make ODS relevant to
> users.
It is fantasy to think we will be migrating ext4 to OSD. That fantasy
is not a compelling reason to block OSD development.
To sum,
* exofs needs a userspace library, around which the standard filesystem
tools will be built, most notably mkfs, dump, restore, fsck
* talk of migrating existing filesystems is wildly premature (and a bit
of a silly argument, since you are also arguing that OSD lacks
compelling use cases)
* an in-kernel OSD-based filesystem needs some sort of generic in-kernel
libosd API, so that multiple OSD filesystems do not reinvent the wheel
each time.
* OSD was bound to be annoying, because it forces the kernel filesystem
to either (a) talk SCSI or (b) use messages that can be converted to
SCSI OSD commands, like existing drivers convert the block layer's READ
and WRITE to device-specific commands.
* Trying to force OSD to export a block device is pushing a square peg
through a round hole. Thus, the best (and only) alternative is
character device. What you really want is a Third Way(tm): a mmap'able
message device, since you really want to export an API to userspace.
Jeff
On Mon, 2009-01-12 at 14:23 -0500, Jeff Garzik wrote:
> > It's an indicator of one. If you buy my premise that OSD cannot be
> > relevant without compelling user cases, then the lack of a user API can
> > be viewed as a symptom of this.
>
> If having a compelling user case was a prereq for kernel inclusion, well
> over half the code would be gone.
I'm not holding this against inclusion ... I'm saying it's a symptom of
the generic relevance to user issues problem that OSD has.
> > I think your choice of using a character device will turn out to be a
> > design mistake because the migration path of existing filesystems is
> > bound to be a block device with extra features (which they may or may
> > not make use of) but only if there's a way to make ODS relevant to
> > users.
>
> It is fantasy to think we will be migrating ext4 to OSD. That fantasy
> is not a compelling reason to block OSD development.
OK, so your quote managed to miss this bit:
"Right, so I'm reasonably happy to accept libosd for what it is: an
enabler for a few specialised applications. "
I can't see how that can be construed as "blocking OSD development".
The word "accept" is conventionally used in Linux parlance to mean "will
send upstream".
> To sum,
>
> * exofs needs a userspace library, around which the standard filesystem
> tools will be built, most notably mkfs, dump, restore, fsck
>
> * talk of migrating existing filesystems is wildly premature (and a bit
> of a silly argument, since you are also arguing that OSD lacks
> compelling use cases)
So criticising lacking compelling use cases while at the same time
suggesting how to find them is wrong?
Actually, If the only use case OSD can bring to the table is requiring
new filesystems, then there's nothing of general user relevance for it
on the horizon ... anywhere. There's never going to be a compelling
reason to move the consumer OSDs in the various development labs to
production because nothing would be able to use them on a mass scale.
If we could derive a benefit from OSD in existing filesystems, then they
do have user relevance, and Seagate and the others might just consider
releasing the devices.
Note that "providing benefit to" does not equate to "rewriting the
filesystem for" ... and it shouldn't; the benefit really should be
incremental. And that's the crux of my criticism. While OSD are
separate things that we have to rewrite whole filesystems for, they're
never going to set the world on fire. If they could be used with only
incremental effort, they might. The bridge for the incremental effort
will come from a properly designed kernel API.
> * an in-kernel OSD-based filesystem needs some sort of generic in-kernel
> libosd API, so that multiple OSD filesystems do not reinvent the wheel
> each time.
>
> * OSD was bound to be annoying, because it forces the kernel filesystem
> to either (a) talk SCSI or (b) use messages that can be converted to
> SCSI OSD commands, like existing drivers convert the block layer's READ
> and WRITE to device-specific commands.
OK, so what you're arguing is that unlike block devices where we can
produce a useful generic abstraction that is protocol agnostic, for OSD
we can't? As I've said before, I think this might be true, but fear it
dooms OSD to being too difficult to use.
> * Trying to force OSD to export a block device is pushing a square peg
> through a round hole. Thus, the best (and only) alternative is
> character device. What you really want is a Third Way(tm): a mmap'able
> message device, since you really want to export an API to userspace.
only allowing a character tap raises the effort bar on getting other
filesystems to use it, because they're all block based ... that's what I
think is the mistake.
James
James Bottomley wrote:
> On Mon, 2009-01-12 at 14:23 -0500, Jeff Garzik wrote:
>>> It's an indicator of one. If you buy my premise that OSD cannot be
>>> relevant without compelling user cases, then the lack of a user API can
>>> be viewed as a symptom of this.
>> If having a compelling user case was a prereq for kernel inclusion, well
>> over half the code would be gone.
>
> I'm not holding this against inclusion ... I'm saying it's a symptom of
> the generic relevance to user issues problem that OSD has.
>
>>> I think your choice of using a character device will turn out to be a
>>> design mistake because the migration path of existing filesystems is
>>> bound to be a block device with extra features (which they may or may
>>> not make use of) but only if there's a way to make ODS relevant to
>>> users.
>> It is fantasy to think we will be migrating ext4 to OSD. That fantasy
>> is not a compelling reason to block OSD development.
>
> OK, so your quote managed to miss this bit:
>
> "Right, so I'm reasonably happy to accept libosd for what it is: an
> enabler for a few specialised applications. "
>
> I can't see how that can be construed as "blocking OSD development".
> The word "accept" is conventionally used in Linux parlance to mean "will
> send upstream".
Yet you continue to expend energy complaining about migrating
block-based filesystems to OSD, a complex, overhead-laden undertaking
_no one_ has proposed or entertained.
>> To sum,
>>
>> * exofs needs a userspace library, around which the standard filesystem
>> tools will be built, most notably mkfs, dump, restore, fsck
>>
>> * talk of migrating existing filesystems is wildly premature (and a bit
>> of a silly argument, since you are also arguing that OSD lacks
>> compelling use cases)
>
> So criticising lacking compelling use cases while at the same time
> suggesting how to find them is wrong?
>
> Actually, If the only use case OSD can bring to the table is requiring
> new filesystems, then there's nothing of general user relevance for it
> on the horizon ... anywhere. There's never going to be a compelling
> reason to move the consumer OSDs in the various development labs to
> production because nothing would be able to use them on a mass scale.
> If we could derive a benefit from OSD in existing filesystems, then they
> do have user relevance, and Seagate and the others might just consider
> releasing the devices.
If Seagate were to release a production OSD device, do you really think
they would prefer a block-based filesystem hacked to work with OSDs? I
don't think so.
Existing block filesystems are very much purpose built for sector-based
storage as implemented on modern storage devices. No kernel API can
hand-wave that away.
The whole point of OSDs is to move some of the overhead to the storage
device, not _add_ to the overhead.
> Note that "providing benefit to" does not equate to "rewriting the
> filesystem for" ... and it shouldn't; the benefit really should be
> incremental. And that's the crux of my criticism. While OSD are
> separate things that we have to rewrite whole filesystems for, they're
> never going to set the world on fire. If they could be used with only
> incremental effort, they might. The bridge for the incremental effort
> will come from a properly designed kernel API.
Well, hey, if you wanna expend energy creating a kernel API that
presents a complex OSD as simple block-based storage, go for it. AFAICS
it's just extra overhead and complexity when a new filesystem could do
the job much better.
And I seriously doubt Linus or anyone else will want to hack up a
block-based filesystem in this manner. Better to create a silly "for
argument's sake" OSD block device, upon which any block-based filesystem
can be mounted. (Note I said block device, _not_ filesystem)
>> * an in-kernel OSD-based filesystem needs some sort of generic in-kernel
>> libosd API, so that multiple OSD filesystems do not reinvent the wheel
>> each time.
>>
>> * OSD was bound to be annoying, because it forces the kernel filesystem
>> to either (a) talk SCSI or (b) use messages that can be converted to
>> SCSI OSD commands, like existing drivers convert the block layer's READ
>> and WRITE to device-specific commands.
>
> OK, so what you're arguing is that unlike block devices where we can
> produce a useful generic abstraction that is protocol agnostic, for OSD
> we can't? As I've said before, I think this might be true, but fear it
> dooms OSD to being too difficult to use.
No, a generic abstraction is "(b)" in my quoted paragraph.
But it's certainly easy to create an OSD block device client, that
simulates sector-based storage, if you are motivated in that direction.
But that only makes sense if you want the extra overhead (square peg,
round hole), which no sane person will want. Face it, only screwballs
want to mount ext4 on an OSD.
>> * Trying to force OSD to export a block device is pushing a square peg
>> through a round hole. Thus, the best (and only) alternative is
>> character device. What you really want is a Third Way(tm): a mmap'able
>> message device, since you really want to export an API to userspace.
>
> only allowing a character tap raises the effort bar on getting other
> filesystems to use it, because they're all block based ...
That's irrelevant, since no one is calling for block-based filesystems
to be converted to use OSD.
And I can only imagine the push-back, should someone actually propose
doing so. Filesystems are very much purpose-built for their storage
paradigm.
Jeff
James Bottomley wrote:
> Um, your submission path is character. You pick up block again because
> SCSI uses it for queues, but it's not really part of your paradigm.
> I think your choice of using a character device will turn out to be a
> design mistake because the migration path of existing filesystems is
> bound to be a block device with extra features (which they may or may
> not make use of) but only if there's a way to make ODS relevant to
> users.
We mount character devices already when it's appropriate.
Look at JFFS, JFFS2, UBIFS and LOGFS. All of them operate on MTD
devices, which are character device interfaces to flash storage, using
the common MTD interface instead of the block layer.
This is quite correct, because block devices have specific
characteristics (generic block caching and ability to read/write each
block independently) which neither flash nor OSDs have.
Imho, OSDs are similar to flash in this respected. There is no
fixed-size block/sector indexed storage device, therefore a block
device would be wrong.
Admittedly lumping everything else under "character" is daft, when you
can't read and write character streams to the device, but that's unix
for you. Character device used to mean serial ports etc. until it
become "any old crap that's not a block device". :-)
-- Jamie
On Mon, 2009-01-12 at 15:22 -0500, Jeff Garzik wrote:
> James Bottomley wrote:
> > On Mon, 2009-01-12 at 14:23 -0500, Jeff Garzik wrote:
> >>> It's an indicator of one. If you buy my premise that OSD cannot be
> >>> relevant without compelling user cases, then the lack of a user API can
> >>> be viewed as a symptom of this.
> >> If having a compelling user case was a prereq for kernel inclusion, well
> >> over half the code would be gone.
> >
> > I'm not holding this against inclusion ... I'm saying it's a symptom of
> > the generic relevance to user issues problem that OSD has.
> >
> >>> I think your choice of using a character device will turn out to be a
> >>> design mistake because the migration path of existing filesystems is
> >>> bound to be a block device with extra features (which they may or may
> >>> not make use of) but only if there's a way to make ODS relevant to
> >>> users.
> >> It is fantasy to think we will be migrating ext4 to OSD. That fantasy
> >> is not a compelling reason to block OSD development.
> >
> > OK, so your quote managed to miss this bit:
> >
> > "Right, so I'm reasonably happy to accept libosd for what it is: an
> > enabler for a few specialised applications. "
> >
> > I can't see how that can be construed as "blocking OSD development".
> > The word "accept" is conventionally used in Linux parlance to mean "will
> > send upstream".
>
> Yet you continue to expend energy complaining about migrating
> block-based filesystems to OSD, a complex, overhead-laden undertaking
> _no one_ has proposed or entertained.
You're the one who keeps suggesting migration, not me. I keep
suggesting ways to make OSD more relevant to current user problems.
A maintainer doesn't have to like everything they merge.
> >> To sum,
> >>
> >> * exofs needs a userspace library, around which the standard filesystem
> >> tools will be built, most notably mkfs, dump, restore, fsck
> >>
> >> * talk of migrating existing filesystems is wildly premature (and a bit
> >> of a silly argument, since you are also arguing that OSD lacks
> >> compelling use cases)
> >
> > So criticising lacking compelling use cases while at the same time
> > suggesting how to find them is wrong?
> >
> > Actually, If the only use case OSD can bring to the table is requiring
> > new filesystems, then there's nothing of general user relevance for it
> > on the horizon ... anywhere. There's never going to be a compelling
> > reason to move the consumer OSDs in the various development labs to
> > production because nothing would be able to use them on a mass scale.
>
> > If we could derive a benefit from OSD in existing filesystems, then they
> > do have user relevance, and Seagate and the others might just consider
> > releasing the devices.
>
> If Seagate were to release a production OSD device, do you really think
> they would prefer a block-based filesystem hacked to work with OSDs? I
> don't think so.
Um, speaking with my business hat on, I'd really beg to differ ... you
don't release a product into an empty market. you pick an existing one,
or fill a fundamental need that a market nucleates around. If that
means block based filesystems hacked to work with OSDs, I think they'd
take it, yes.
> Existing block filesystems are very much purpose built for sector-based
> storage as implemented on modern storage devices. No kernel API can
> hand-wave that away.
>
> The whole point of OSDs is to move some of the overhead to the storage
> device, not _add_ to the overhead.
Well, that was the idea, with OSD version 1. The problem is that the
benchmarks didn't confirm that letting the disk take care of object
placement was a win over block based filesystems. If you want to
migrate objects across disks (i.e. cfs paradigm), then it is a win, but
not really for performance. That's why OSDv2 has been beefing up
attributes and security.
The interesting question is what does it take to allow arbitrary
filesystems to benefit from this.
> > Note that "providing benefit to" does not equate to "rewriting the
> > filesystem for" ... and it shouldn't; the benefit really should be
> > incremental. And that's the crux of my criticism. While OSD are
> > separate things that we have to rewrite whole filesystems for, they're
> > never going to set the world on fire. If they could be used with only
> > incremental effort, they might. The bridge for the incremental effort
> > will come from a properly designed kernel API.
>
> Well, hey, if you wanna expend energy creating a kernel API that
> presents a complex OSD as simple block-based storage, go for it. AFAICS
> it's just extra overhead and complexity when a new filesystem could do
> the job much better.
Because writing a new filesystem is so much easier?
> And I seriously doubt Linus or anyone else will want to hack up a
> block-based filesystem in this manner. Better to create a silly "for
> argument's sake" OSD block device, upon which any block-based filesystem
> can be mounted. (Note I said block device, _not_ filesystem)
That's a possibility ... as I said before: a block device with extra
features that allows incremental use in the filesystem.
> >> * an in-kernel OSD-based filesystem needs some sort of generic in-kernel
> >> libosd API, so that multiple OSD filesystems do not reinvent the wheel
> >> each time.
> >>
> >> * OSD was bound to be annoying, because it forces the kernel filesystem
> >> to either (a) talk SCSI or (b) use messages that can be converted to
> >> SCSI OSD commands, like existing drivers convert the block layer's READ
> >> and WRITE to device-specific commands.
> >
> > OK, so what you're arguing is that unlike block devices where we can
> > produce a useful generic abstraction that is protocol agnostic, for OSD
> > we can't? As I've said before, I think this might be true, but fear it
> > dooms OSD to being too difficult to use.
>
> No, a generic abstraction is "(b)" in my quoted paragraph.
>
> But it's certainly easy to create an OSD block device client, that
> simulates sector-based storage, if you are motivated in that direction.
>
> But that only makes sense if you want the extra overhead (square peg,
> round hole), which no sane person will want. Face it, only screwballs
> want to mount ext4 on an OSD.
So what's your proposal for lowering the barrier to adoption then?
> >> * Trying to force OSD to export a block device is pushing a square peg
> >> through a round hole. Thus, the best (and only) alternative is
> >> character device. What you really want is a Third Way(tm): a mmap'able
> >> message device, since you really want to export an API to userspace.
> >
> > only allowing a character tap raises the effort bar on getting other
> > filesystems to use it, because they're all block based ...
>
> That's irrelevant, since no one is calling for block-based filesystems
> to be converted to use OSD.
It's relevant to lowering the barrier to adoption, unless there's some
other means I haven't seen.
> And I can only imagine the push-back, should someone actually propose
> doing so. Filesystems are very much purpose-built for their storage
> paradigm.
Filesystems are complex and difficult beasts to get right. Btrfs took a
year to get to the point of kernel inclusion and will take some little
time longer to get enterprises to the point of trusting data to it. So
if we say a two year lead time, that would mean that even if someone
started a general purpose OSD based filesystem today, it wouldn't be
ready for the consumer market until 2011. That's not really going to
convince the disk vendors that OSD based devices should be marketed
today.
James
On Jan. 13, 2009, 1:25 +0200, James Bottomley <[email protected]> wrote:
> On Mon, 2009-01-12 at 15:22 -0500, Jeff Garzik wrote:
>> James Bottomley wrote:
>>> On Mon, 2009-01-12 at 14:23 -0500, Jeff Garzik wrote:
>>>>> It's an indicator of one. If you buy my premise that OSD cannot be
>>>>> relevant without compelling user cases, then the lack of a user API can
>>>>> be viewed as a symptom of this.
>>>> If having a compelling user case was a prereq for kernel inclusion, well
>>>> over half the code would be gone.
>>> I'm not holding this against inclusion ... I'm saying it's a symptom of
>>> the generic relevance to user issues problem that OSD has.
>>>
>>>>> I think your choice of using a character device will turn out to be a
>>>>> design mistake because the migration path of existing filesystems is
>>>>> bound to be a block device with extra features (which they may or may
>>>>> not make use of) but only if there's a way to make ODS relevant to
>>>>> users.
>>>> It is fantasy to think we will be migrating ext4 to OSD. That fantasy
>>>> is not a compelling reason to block OSD development.
>>> OK, so your quote managed to miss this bit:
>>>
>>> "Right, so I'm reasonably happy to accept libosd for what it is: an
>>> enabler for a few specialised applications. "
>>>
>>> I can't see how that can be construed as "blocking OSD development".
>>> The word "accept" is conventionally used in Linux parlance to mean "will
>>> send upstream".
>> Yet you continue to expend energy complaining about migrating
>> block-based filesystems to OSD, a complex, overhead-laden undertaking
>> _no one_ has proposed or entertained.
>
> You're the one who keeps suggesting migration, not me. I keep
> suggesting ways to make OSD more relevant to current user problems.
>
> A maintainer doesn't have to like everything they merge.
>
>>>> To sum,
>>>>
>>>> * exofs needs a userspace library, around which the standard filesystem
>>>> tools will be built, most notably mkfs, dump, restore, fsck
>>>>
>>>> * talk of migrating existing filesystems is wildly premature (and a bit
>>>> of a silly argument, since you are also arguing that OSD lacks
>>>> compelling use cases)
>>> So criticising lacking compelling use cases while at the same time
>>> suggesting how to find them is wrong?
>>>
>>> Actually, If the only use case OSD can bring to the table is requiring
>>> new filesystems, then there's nothing of general user relevance for it
>>> on the horizon ... anywhere. There's never going to be a compelling
>>> reason to move the consumer OSDs in the various development labs to
>>> production because nothing would be able to use them on a mass scale.
>>> If we could derive a benefit from OSD in existing filesystems, then they
>>> do have user relevance, and Seagate and the others might just consider
>>> releasing the devices.
>> If Seagate were to release a production OSD device, do you really think
>> they would prefer a block-based filesystem hacked to work with OSDs? I
>> don't think so.
>
> Um, speaking with my business hat on, I'd really beg to differ ... you
> don't release a product into an empty market. you pick an existing one,
> or fill a fundamental need that a market nucleates around. If that
> means block based filesystems hacked to work with OSDs, I think they'd
> take it, yes.
>
>> Existing block filesystems are very much purpose built for sector-based
>> storage as implemented on modern storage devices. No kernel API can
>> hand-wave that away.
>>
>> The whole point of OSDs is to move some of the overhead to the storage
>> device, not _add_ to the overhead.
>
> Well, that was the idea, with OSD version 1. The problem is that the
> benchmarks didn't confirm that letting the disk take care of object
> placement was a win over block based filesystems. If you want to
> migrate objects across disks (i.e. cfs paradigm), then it is a win, but
> not really for performance. That's why OSDv2 has been beefing up
> attributes and security.
IMO the main advantage of moving block allocation down to the OSD target
is more apparent with distributed file systems a-la pNFS over objects
where paralleling that task is a key for scalable performance.
The thing is that the target needs to implement its own mapping from
object logical offsets into disk blocks and this is usually done
using some kind of a (possibly trimmed down) local file system.
Therefore the I/O performance of a single OSD is likely to be similar
to a single file server's. I'm not sure what will be case comparing
an OSD with a local file system mounted over a block device over
a storage network, e.g. FC or iSCSI - that could be an interesting
research topic. I guess that the main issue there is to cache enough
metadata on the host to minimize transfer latencies (assuming
latency of a directly attached device is always better than
a fabric-attached one).
Anyhow, capacity management via partitions and object allocation,
plus quotas, and the fine grain OSD security model is a big one
that's worth investigating, to say the least.
>
> The interesting question is what does it take to allow arbitrary
> filesystems to benefit from this.
One direction is to mount the file system over an object or a set
of object exported via exofs using mount -o loop.
And the user doesn't have to be necessarily a filesystem. It could
be a database either... or anything that's typically working
over a block device.
Benny
>
>>> Note that "providing benefit to" does not equate to "rewriting the
>>> filesystem for" ... and it shouldn't; the benefit really should be
>>> incremental. And that's the crux of my criticism. While OSD are
>>> separate things that we have to rewrite whole filesystems for, they're
>>> never going to set the world on fire. If they could be used with only
>>> incremental effort, they might. The bridge for the incremental effort
>>> will come from a properly designed kernel API.
>> Well, hey, if you wanna expend energy creating a kernel API that
>> presents a complex OSD as simple block-based storage, go for it. AFAICS
>> it's just extra overhead and complexity when a new filesystem could do
>> the job much better.
>
> Because writing a new filesystem is so much easier?
>
>> And I seriously doubt Linus or anyone else will want to hack up a
>> block-based filesystem in this manner. Better to create a silly "for
>> argument's sake" OSD block device, upon which any block-based filesystem
>> can be mounted. (Note I said block device, _not_ filesystem)
>
> That's a possibility ... as I said before: a block device with extra
> features that allows incremental use in the filesystem.
I can understand representing a single object as a block device (although I
think that using a file for that should be good enough and easier) but
why representing the whole OSD as a block device? The OSD holds partitions
and objects each with attributes and OSD security related support. Hence
representing that in a namespace using a filesystem seems straight forward.
Benny
>
>>>> * an in-kernel OSD-based filesystem needs some sort of generic in-kernel
>>>> libosd API, so that multiple OSD filesystems do not reinvent the wheel
>>>> each time.
>>>>
>>>> * OSD was bound to be annoying, because it forces the kernel filesystem
>>>> to either (a) talk SCSI or (b) use messages that can be converted to
>>>> SCSI OSD commands, like existing drivers convert the block layer's READ
>>>> and WRITE to device-specific commands.
>>> OK, so what you're arguing is that unlike block devices where we can
>>> produce a useful generic abstraction that is protocol agnostic, for OSD
>>> we can't? As I've said before, I think this might be true, but fear it
>>> dooms OSD to being too difficult to use.
>> No, a generic abstraction is "(b)" in my quoted paragraph.
>>
>> But it's certainly easy to create an OSD block device client, that
>> simulates sector-based storage, if you are motivated in that direction.
>>
>> But that only makes sense if you want the extra overhead (square peg,
>> round hole), which no sane person will want. Face it, only screwballs
>> want to mount ext4 on an OSD.
>
> So what's your proposal for lowering the barrier to adoption then?
>
>>>> * Trying to force OSD to export a block device is pushing a square peg
>>>> through a round hole. Thus, the best (and only) alternative is
>>>> character device. What you really want is a Third Way(tm): a mmap'able
>>>> message device, since you really want to export an API to userspace.
>>> only allowing a character tap raises the effort bar on getting other
>>> filesystems to use it, because they're all block based ...
>> That's irrelevant, since no one is calling for block-based filesystems
>> to be converted to use OSD.
>
> It's relevant to lowering the barrier to adoption, unless there's some
> other means I haven't seen.
>
>> And I can only imagine the push-back, should someone actually propose
>> doing so. Filesystems are very much purpose-built for their storage
>> paradigm.
>
> Filesystems are complex and difficult beasts to get right. Btrfs took a
> year to get to the point of kernel inclusion and will take some little
> time longer to get enterprises to the point of trusting data to it. So
> if we say a two year lead time, that would mean that even if someone
> started a general purpose OSD based filesystem today, it wouldn't be
> ready for the consumer market until 2011. That's not really going to
> convince the disk vendors that OSD based devices should be marketed
> today.
>
> James
>
>
> _______________________________________________
> osd-dev mailing list
> [email protected]
> http://mailman.open-osd.org/mailman/listinfo/osd-dev
Benny Halevy wrote:
> IMO the main advantage of moving block allocation down to the OSD target
> is more apparent with distributed file systems a-la pNFS over objects
> where paralleling that task is a key for scalable performance.
>
> The thing is that the target needs to implement its own mapping from
> object logical offsets into disk blocks and this is usually done
> using some kind of a (possibly trimmed down) local file system.
> Therefore the I/O performance of a single OSD is likely to be similar
> to a single file server's.
Well, modern SATA devices are already mini-filesystems internally, when
you consider logical block remapping etc.
And the claim by drive research guys at the filesystem/storage summit
was that OSD offered the potential to better optimize storage based on
access/usage patterns.
(of course, whether or not reality bears out this guess is another question)
> I can understand representing a single object as a block device (although I
> think that using a file for that should be good enough and easier) but
> why representing the whole OSD as a block device? The OSD holds partitions
> and objects each with attributes and OSD security related support. Hence
> representing that in a namespace using a filesystem seems straight forward.
I am actually considering writing a simple "osdblk" driver, that would
represent a single object as a block device.
This would NOT replace exofs or other OSD filesystems, but it would be
nice to have, and it will give me more experience with OSDs.
Jeff
On Jan. 13, 2009, 15:24 +0200, Jeff Garzik <[email protected]> wrote:
> Benny Halevy wrote:
>> IMO the main advantage of moving block allocation down to the OSD target
>> is more apparent with distributed file systems a-la pNFS over objects
>> where paralleling that task is a key for scalable performance.
>>
>> The thing is that the target needs to implement its own mapping from
>> object logical offsets into disk blocks and this is usually done
>> using some kind of a (possibly trimmed down) local file system.
>> Therefore the I/O performance of a single OSD is likely to be similar
>> to a single file server's.
>
> Well, modern SATA devices are already mini-filesystems internally, when
> you consider logical block remapping etc.
>
> And the claim by drive research guys at the filesystem/storage summit
> was that OSD offered the potential to better optimize storage based on
> access/usage patterns.
>
> (of course, whether or not reality bears out this guess is another question)
That's true for multi-user access where knowing the context for each I/O
request - i.e. the object that holds it provides a crucial hint for
read-ahead and write allocation, where for a dumb device that doesn't
know anything about the filesystem's internals, it's much harder to
associate different blocks with their respective containers, or "streams"
(in case the container is typically accessed in a sequential pattern).
>
>
>> I can understand representing a single object as a block device (although I
>> think that using a file for that should be good enough and easier) but
>> why representing the whole OSD as a block device? The OSD holds partitions
>> and objects each with attributes and OSD security related support. Hence
>> representing that in a namespace using a filesystem seems straight forward.
>
> I am actually considering writing a simple "osdblk" driver, that would
> represent a single object as a block device.
>
> This would NOT replace exofs or other OSD filesystems, but it would be
> nice to have, and it will give me more experience with OSDs.
That's awesome!
It be really interesting to benchmark one against the other.
Benny
>
> Jeff
>
>
> _______________________________________________
> osd-dev mailing list
> [email protected]
> http://mailman.open-osd.org/mailman/listinfo/osd-dev
James Bottomley wrote:
> On Mon, 2009-01-12 at 15:22 -0500, Jeff Garzik wrote:
>> If Seagate were to release a production OSD device, do you really think
>> they would prefer a block-based filesystem hacked to work with OSDs? I
>> don't think so.
>
> Um, speaking with my business hat on, I'd really beg to differ ... you
> don't release a product into an empty market. you pick an existing one,
> or fill a fundamental need that a market nucleates around. If that
> means block based filesystems hacked to work with OSDs, I think they'd
> take it, yes.
It seems unlikely drive manufacturers would get excited about a
sub-optimal solution that does not even approach using the full
potential of the product.
Plus, given the existence of an OSD-specific filesystem (exofs, at the
very least), it seems unlikely that end users who own OSDs would choose
the sub-optimal solution when an OSD-specific filesystem exists.
>>> Note that "providing benefit to" does not equate to "rewriting the
>>> filesystem for" ... and it shouldn't; the benefit really should be
>>> incremental. And that's the crux of my criticism. While OSD are
>>> separate things that we have to rewrite whole filesystems for, they're
>>> never going to set the world on fire. If they could be used with only
>>> incremental effort, they might. The bridge for the incremental effort
>>> will come from a properly designed kernel API.
>> Well, hey, if you wanna expend energy creating a kernel API that
>> presents a complex OSD as simple block-based storage, go for it. AFAICS
>> it's just extra overhead and complexity when a new filesystem could do
>> the job much better.
>
> Because writing a new filesystem is so much easier?
Yes, easier -- both technically and politically -- than hacking XFS or
ext4 to support two vastly different storage APIs (linear sector or
object-based).
It might be a tad easier to hack btrfs to do objects.
>>>> * an in-kernel OSD-based filesystem needs some sort of generic in-kernel
>>>> libosd API, so that multiple OSD filesystems do not reinvent the wheel
>>>> each time.
>>>>
>>>> * OSD was bound to be annoying, because it forces the kernel filesystem
>>>> to either (a) talk SCSI or (b) use messages that can be converted to
>>>> SCSI OSD commands, like existing drivers convert the block layer's READ
>>>> and WRITE to device-specific commands.
>>> OK, so what you're arguing is that unlike block devices where we can
>>> produce a useful generic abstraction that is protocol agnostic, for OSD
>>> we can't? As I've said before, I think this might be true, but fear it
>>> dooms OSD to being too difficult to use.
>> No, a generic abstraction is "(b)" in my quoted paragraph.
>>
>> But it's certainly easy to create an OSD block device client, that
>> simulates sector-based storage, if you are motivated in that direction.
>>
>> But that only makes sense if you want the extra overhead (square peg,
>> round hole), which no sane person will want. Face it, only screwballs
>> want to mount ext4 on an OSD.
>
> So what's your proposal for lowering the barrier to adoption then?
Once exofs is in upstream, installers can easily choose that when an OSD
device is detected.
> Filesystems are complex and difficult beasts to get right. Btrfs took a
> year to get to the point of kernel inclusion and will take some little
> time longer to get enterprises to the point of trusting data to it. So
> if we say a two year lead time, that would mean that even if someone
> started a general purpose OSD based filesystem today, it wouldn't be
> ready for the consumer market until 2011. That's not really going to
> convince the disk vendors that OSD based devices should be marketed
> today.
And you have a similar sales job and lag time, when hacking -- read
destabilizing -- a filesystem to work with OSDs as well as sector-based
devices.
Jeff
> > +#define EXOFS_SUPER_ID 0x10000 /* object ID for on-disk superblock */
And if an OS failure breaks the super block and you have only one how do
you recover it ?
> > +#define EXOFS_BM_ID 0x10001 /* object ID for ID bitmap */
> > +#define EXOFS_ROOT_ID 0x10002 /* object ID for root directory */
> > +#define EXOFS_TEST_ID 0x10003 /* object ID for test object */
Ditto some of the others
> > + EXOFS_UINT64_MAX = (~0LL),
> > + EXOFS_MAX_INO_ID = (sizeof(ino_t) * 8 == 64) ? EXOFS_UINT64_MAX :
> > + (1LL << (sizeof(ino_t) * 8 - 1)),
Ok so thats quite a big number
> > + uint32_t s_nextid; /* Highest object ID used */
but that is a smaller one
> > + uint32_t s_numfiles; /* Number of files on fs */
as is this
> > + uint32_t i_atime; /* Access time */
> > + uint32_t i_ctime; /* Creation time */
> > + uint32_t i_mtime; /* Modification time */
2038 ? - bits are cheap
> It seems unlikely drive manufacturers would get excited about a
> sub-optimal solution that does not even approach using the full
> potential of the product.
You forgot the more important people
Mr Customer, would you like your data centre to use a new magic OSD fs or
the existing one you trust.
Now in my experience that is a *dumb* question because the answer is
obvious...
> Plus, given the existence of an OSD-specific filesystem (exofs, at the
> very least), it seems unlikely that end users who own OSDs would choose
> the sub-optimal solution when an OSD-specific filesystem exists.
Actually until you can show zillions of users stably using them the
people with the money won't buy them in the first place 8)
> > ready for the consumer market until 2011. That's not really going to
> > convince the disk vendors that OSD based devices should be marketed
> > today.
>
> And you have a similar sales job and lag time, when hacking -- read
> destabilizing -- a filesystem to work with OSDs as well as sector-based
> devices.
2011 sounds optimistic for major OSD adoption in any space except for
flash storage where OSD type knowledge means you can do much better jobs
on erase management.
Alan Cox wrote:
>> It seems unlikely drive manufacturers would get excited about a
>> sub-optimal solution that does not even approach using the full
>> potential of the product.
>
> You forgot the more important people
>
> Mr Customer, would you like your data centre to use a new magic OSD fs or
> the existing one you trust.
>
> Now in my experience that is a *dumb* question because the answer is
> obvious...
The choice is between "new magic OSD fs" and "new fs that used to be
ext4, before we hacked it up".
"existing one you trust" is not an option...
>> Plus, given the existence of an OSD-specific filesystem (exofs, at the
>> very least), it seems unlikely that end users who own OSDs would choose
>> the sub-optimal solution when an OSD-specific filesystem exists.
>
> Actually until you can show zillions of users stably using them the
> people with the money won't buy them in the first place 8)
Yeah, at this point the discussion devolves into talk of carts, horses,
chickens and eggs... :)
>>> ready for the consumer market until 2011. That's not really going to
>>> convince the disk vendors that OSD based devices should be marketed
>>> today.
>> And you have a similar sales job and lag time, when hacking -- read
>> destabilizing -- a filesystem to work with OSDs as well as sector-based
>> devices.
>
> 2011 sounds optimistic for major OSD adoption in any space except for
> flash storage where OSD type knowledge means you can do much better jobs
> on erase management.
His number, not mine...
At this point OSD is a fun and interesting research project.
Overall, I think Linux should have OSD support so that we are ready for
whatever the future brings. Even if OSD goes nowhere, it will still
have more users than many of the existing Linux drivers and architectures :)
Jeff
Alan Cox wrote:
>>> +#define EXOFS_SUPER_ID 0x10000 /* object ID for on-disk superblock */
>
> And if an OS failure breaks the super block and you have only one how do
> you recover it ?
There is nothing really in this object but the "next_id" which is recoverable
by an fsck utility (Not yet submitted) by a simple osd-list-partition
command. Same for num-of-files. All these values are just cached values,
for convenience.
>
>>> +#define EXOFS_BM_ID 0x10001 /* object ID for ID bitmap */
Not used will be dropped
>>> +#define EXOFS_ROOT_ID 0x10002 /* object ID for root directory */
OK Only one, but so is all other directories. I'll think about it.
I'll probably postpone it for together with the raid management.
>>> +#define EXOFS_TEST_ID 0x10003 /* object ID for test object */
Not used will be dropped
>
> Ditto some of the others
>
>>> + EXOFS_UINT64_MAX = (~0LL),
>>> + EXOFS_MAX_INO_ID = (sizeof(ino_t) * 8 == 64) ? EXOFS_UINT64_MAX :
>>> + (1LL << (sizeof(ino_t) * 8 - 1)),
>
> Ok so thats quite a big number
>
>>> + uint32_t s_nextid; /* Highest object ID used */
I fixed all that to be __le64, you can see that on the web here:
http://git.open-osd.org/gitweb.cgi?p=open-osd.git;a=shortlog;h=refs/heads/exofs
I will submit another round ASAP. I'm currently busy with user-mode-library.
Once done I'll post the next exofs round. All these and lots of other areas where
converted to proper __leXX types. Specially the directory code had all these
missing. It was all triggered thanks to Morton who pointed me to all these ext2
bug-fixes since 2.6.10.
>
> but that is a smaller one
>
>>> + uint32_t s_numfiles; /* Number of files on fs */
>
> as is this
Yes also fixed to __le64
>
>>> + uint32_t i_atime; /* Access time */
>>> + uint32_t i_ctime; /* Creation time */
>>> + uint32_t i_mtime; /* Modification time */
>
> 2038 ? - bits are cheap
>
OK Avisi got lazy, This is a copy form ext2. I will fix that,
thanks for pointing this out. I will put __le64 for seconds
and also add __le64 for nanoseconds while at it.
Thanks Ingo for your review
Boaz
Boaz Harrosh wrote:
>
> Thanks Ingo for your review
> Boaz
>
Oops what's up with me?!!?
Thanks Alan
Boaz
Alan Cox wrote:
> > > +#define EXOFS_SUPER_ID 0x10000 /* object ID for on-disk superblock */
>
> And if an OS failure breaks the super block and you have only one how do
> you recover it ?
Having one super block would be silly.
But aren't most kinds of replication better done behind the OSD level,
on the storage fabric? OSD is all about letting the fabric decide
things like allocation and durability strategies after all.
With multiple super blocks at the filesystem level, some OS failures
that would trash one of the super blocks would simply trash all the copies.
I wonder how much less likely trashing one super block object than
trashing a set of them would be.
-- Jamie
Jamie Lokier wrote:
> Having one super block would be silly.
Yep.
> But aren't most kinds of replication better done behind the OSD level,
> on the storage fabric? OSD is all about letting the fabric decide
> things like allocation and durability strategies after all.
Probably, but one cannot _assume_ that. The OSD device might just be a
dumb, non-replicated OSD simulator, or in the future, a singleton SATA
drive.
Jeff
On Jan. 13, 2009, 17:17 +0200, Jeff Garzik <[email protected]> wrote:
> Jamie Lokier wrote:
>> Having one super block would be silly.
>
> Yep.
>
>
>> But aren't most kinds of replication better done behind the OSD level,
>> on the storage fabric? OSD is all about letting the fabric decide
>> things like allocation and durability strategies after all.
>
> Probably, but one cannot _assume_ that. The OSD device might just be a
> dumb, non-replicated OSD simulator, or in the future, a singleton SATA
> drive.
>
> Jeff
>
>
>
Alan asked about an _os_ failure. I consider it different than a disk
level failure which is typically handled by RAID. At the OS level I'd
care more about the self consistency of the metadata and its corruption
due to the OS (or the OSD) failing to update it atomically.
In exofs's case the metadata in superblock is unexpensive to recover.
It holds the last object ID created. If, when using it, the filesystem
finds an already existing object it can detect the last object created
using a logarithmic search (or even a linear one assuming the sb is
synced frequently enough). Therefore I wouldn't spend cycles on
replicating it.
Benny
> > Now in my experience that is a *dumb* question because the answer is
> > obvious...
>
> The choice is between "new magic OSD fs" and "new fs that used to be
> ext4, before we hacked it up".
>
> "existing one you trust" is not an option...
No it isn't. The choice is existing technology followed by a "thank you
goodbye Mr OSD salesman".
I'm not saying we shouldn't work on an OSD file system and I'm glad IBM
folks are but that it can be done slowly. Also for most fs folks an OSD
emulator testing might not be a bad idea - say one stacked on ext3 8)
Alan Cox wrote:
>>> Now in my experience that is a *dumb* question because the answer is
>>> obvious...
>> The choice is between "new magic OSD fs" and "new fs that used to be
>> ext4, before we hacked it up".
>>
>> "existing one you trust" is not an option...
>
> No it isn't. The choice is existing technology followed by a "thank you
> goodbye Mr OSD salesman".
>
> I'm not saying we shouldn't work on an OSD file system and I'm glad IBM
> folks are but that it can be done slowly.
IBM is not working on OSD for a long time now. We at open-osd are.
That is me and Benny (abit) and other people that hang on the mailing-list
So it is mostly Panasas these days.
On git.open-osd.org we are hosting various OSD projects mainly the submitted
work plus inherited code from OSC, which is not active anymore. as of Q3 2008.
Also for most fs folks an OSD
> emulator testing might not be a bad idea - say one stacked on ext3 8)
>
One of the projects on open-osd.org is the OSC's osd-target which is based
on scsi tgt framework and implements an OSD in user-mode over any local
filesystem. It supports any SCSI transport supported by tgt that is: iscsi,
fcoe, iser, kernel-tgt. This is what we test against. I have just been porting
that project to freebsd. It as a very small foot print compared to, lets say NFS.
Thanks
Boaz
BTW, where can the latest libosd be found?
I'll want to use that for osdblk (export a single OSD object as a Linux
block device).
Jeff
Jeff Garzik wrote:
> BTW, where can the latest libosd be found?
>
> I'll want to use that for osdblk (export a single OSD object as a Linux
> block device).
>
> Jeff
>
You are most welcome thank you.
The in kernel patches are at:
git-clone git://git.open-osd.org/linux-open-osd.git linux-next
or on the web at:
http://git.open-osd.org/gitweb.cgi?p=linux-open-osd.git;a=shortlog;h=refs/heads/linux-next
But you might find the out-of-tree project more complete:
git-clone git://git.open-osd.org/open-osd.git
or on the web at:
http://git.open-osd.org/gitweb.cgi?p=open-osd.git;a=shortlog;h=refs/heads/exofs
To setup an osd-target and all that, this is also hosted on open-osd.org
Please start reading at http://open-osd.org and the links from that page.
(Been first, you get to debug my documentation)
I'm patiently awaiting patches ;)
sincerely yours
Boaz