As we promised on the linux-ext4 list on October 31, here is the patch
that adds secure deletion via a trash-bin functionality for ext4. It is a
compromise solution that combines secure deletion with the trash-bin support
(the latter had been requested by even more people than the former :-).
The patch moves the files marked with the -s (secure deletion) or the -u
(undelete) ext2/3/4 attributes to a .trash/<uid>/ directory. It does so
reliably because it does it in the kernel upon every unlink operation.
User-mode tools can miss some unlinks, which is unacceptable for secure
deletion. The per-user subdirectories allow us to solve the problem
with permissions. The .trash directory is owned by root and has
drwxr-xr-x permissions. The per-uid subdirectories are owned and only
accessible by their corresponding owners (drwx------) so users cannot
read each other's files in the trash. Alternative solutions would
require changing a lot of ext3 code.
Right now we do not store the whole path to the file moved into the
trash-bin. We keep the filenames or append them with numbers in case of
collisions. In the future we may change it to store the whole path
information if necessary.
Later (e.g., when the system is idle) a user-mode daemon can just scan
.trash's subdirectories, overwrite the files marked with -s, and unlink
the files. The purging and data overwriting is performed entirely in user
space. It may or may not be necessary to add a mechanism for the file
system to initiate the user-mode deletion once the space becomes scarce.
The benefits of this secure deletion approach are obvious:
1) small kernel patch;
2) two solutions (trash-bin and secure deletion) in one;
3) the user-mode part can be made arbitrarily complex and overwrite with
many patterns, many times, and at configurable times; and
4) most of the code can be reused for other file systems in the future.
We will really appreciate any comments, help, and feedback.
Thank you,
Nikolai Joukov, Harry Papaxenopoulos, and Erez Zadok.
Filesystems and Storage Laboratory,
Stony Brook University
Signed-off-by: Nikolai Joukov <[email protected]>
Signed-off-by: Harry Papaxenopoulos <[email protected]>
Signed-off-by: Erez Zadok <[email protected]>
---
fs/ext4/Makefile | 1 +
fs/ext4/namei.c | 40 ++++++-
fs/ext4/super.c | 6 +
fs/ext4/tb.c | 237 +++++++++++++++++++++++++++++++++++++++
fs/ext4/tb.h | 18 +++
fs/Kconfig | 9 +
6 files changed, 309 insertions(+), 2 deletions(-)
diff -Naur 2.6.19/fs/ext4/Makefile 2.6.19tb/fs/ext4/Makefile
--- 2.6.19/fs/ext4/Makefile 2006-12-02 19:49:53.000000000 -0500
+++ 2.6.19tb/fs/ext4/Makefile 2006-12-02 19:46:13.000000000 -0500
@@ -10,3 +10,4 @@
ext4dev-$(CONFIG_EXT4DEV_FS_XATTR) += xattr.o xattr_user.o xattr_trusted.o
ext4dev-$(CONFIG_EXT4DEV_FS_POSIX_ACL) += acl.o
ext4dev-$(CONFIG_EXT4DEV_FS_SECURITY) += xattr_security.o
+ext4dev-$(CONFIG_EXT4DEV_FS_TRASH_BIN) += tb.o
diff -Naur 2.6.19/fs/ext4/namei.c 2.6.19tb/fs/ext4/namei.c
--- 2.6.19/fs/ext4/namei.c 2006-12-02 19:49:49.000000000 -0500
+++ 2.6.19tb/fs/ext4/namei.c 2006-12-04 13:03:35.000000000 -0500
@@ -41,6 +41,7 @@
#include "namei.h"
#include "xattr.h"
#include "acl.h"
+#include "tb.h"
/*
* define how far ahead to read directories while searching them.
@@ -2067,6 +2068,10 @@
struct inode * inode;
struct buffer_head * bh;
struct ext4_dir_entry_2 * de;
+ int trashed = 0;
+#ifdef CONFIG_EXT4DEV_FS_TRASH_BIN
+ struct dentry *user_dentry = NULL;
+#endif
handle_t *handle;
/* Initialize quotas before so that eventual writes go
@@ -2096,13 +2101,40 @@
inode->i_ino, inode->i_nlink);
inode->i_nlink = 1;
}
- retval = ext4_delete_entry(handle, dir, de, bh);
+#ifdef CONFIG_EXT4DEV_FS_TRASH_BIN
+ if (EXT4_I(dentry->d_inode)->i_flags &
+ (EXT4_UNRM_FL | EXT4_SECRM_FL)) {
+
+ /*
+ * We put this code here to optimize the common case. Since
+ * lookups are expensive, we try to reserve from making any,
+ * unless one of the trash-bin flags are set. The cleanest
+ * way though is to probably move this code outside the
+ * above if statement.
+ */
+ user_dentry = ext4_get_user_dentry(dir, 1);
+ if (IS_ERR(user_dentry)) {
+ retval = PTR_ERR(user_dentry);
+ user_dentry = NULL;
+ goto end_unlink;
+ }
+
+ if (inode->i_nlink == 1 && user_dentry->d_inode &&
+ user_dentry->d_inode->i_ino != dir->i_ino) {
+ retval = ext4_trash_entry(dir, dentry);
+ trashed = 1;
+ }
+ }
+#endif
+ if (!trashed)
+ retval = ext4_delete_entry(handle, dir, de, bh);
if (retval)
goto end_unlink;
dir->i_ctime = dir->i_mtime = CURRENT_TIME_SEC;
ext4_update_dx_flag(dir);
ext4_mark_inode_dirty(handle, dir);
- drop_nlink(inode);
+ if (!trashed)
+ drop_nlink(inode);
if (!inode->i_nlink)
ext4_orphan_add(handle, inode);
inode->i_ctime = dir->i_ctime;
@@ -2112,6 +2144,10 @@
end_unlink:
ext4_journal_stop(handle);
brelse (bh);
+#ifdef CONFIG_EXT4DEV_FS_TRASH_BIN
+ if (user_dentry)
+ dput(user_dentry);
+#endif
return retval;
}
diff -Naur 2.6.19/fs/ext4/super.c 2.6.19tb/fs/ext4/super.c
--- 2.6.19/fs/ext4/super.c 2006-12-02 19:49:49.000000000 -0500
+++ 2.6.19tb/fs/ext4/super.c 2006-12-03 21:09:05.000000000 -0500
@@ -41,6 +41,7 @@
#include "xattr.h"
#include "acl.h"
#include "namei.h"
+#include "tb.h"
static int ext4_load_journal(struct super_block *, struct ext4_super_block *,
unsigned long journal_devnum);
@@ -1840,6 +1841,11 @@
goto failed_mount4;
}
+#ifdef CONFIG_EXT4DEV_FS_TRASH_BIN
+ if (ext4_create_tb(sb))
+ goto failed_mount4;
+#endif
+
ext4_setup_super (sb, es, sb->s_flags & MS_RDONLY);
/*
* akpm: core read_super() calls in here with the superblock locked.
diff -Naur 2.6.19/fs/ext4/tb.c 2.6.19tb/fs/ext4/tb.c
--- 2.6.19/fs/ext4/tb.c 1969-12-31 19:00:00.000000000 -0500
+++ 2.6.19tb/fs/ext4/tb.c 2006-12-04 13:03:35.000000000 -0500
@@ -0,0 +1,237 @@
+/* linux/fs/ext4/tb.c
+ *
+ * Copyright (C) 2006 Stony Brook University
+ * Nikolai Joukov, Harry Papaxenopoulos, and Erez Zadok
+ */
+#include <linux/fs.h>
+#include <linux/jbd.h>
+#include <linux/ext3_fs.h>
+#include <linux/mount.h>
+#include <linux/namei.h>
+#include <asm/current.h>
+#include "tb.h"
+
+int ext4_create_tb(struct super_block *sb)
+{
+ int err = 0;
+ struct dentry *dentry, *root_dentry;
+ struct inode *mnt_inode;
+
+ root_dentry = sb->s_root;
+ mnt_inode = root_dentry->d_inode;
+
+ dentry = lookup_one_len(TB_NAME, root_dentry, TB_NAME_LEN);
+
+ if (IS_ERR(dentry)) {
+ err = PTR_ERR(dentry);
+ goto out;
+ }
+
+ if (!dentry->d_inode && mnt_inode->i_op && mnt_inode->i_op->mkdir)
{
+ mutex_lock(&mnt_inode->i_mutex);
+ err = mnt_inode->i_op->mkdir(mnt_inode, dentry, TB_MODE);
+ mutex_unlock(&mnt_inode->i_mutex);
+ if (err)
+ goto release_out;
+ }
+
+ dentry->d_inode->i_mode = TB_MODE;
+ mark_inode_dirty(dentry->d_inode);
+release_out:
+ dput(dentry);
+out:
+ return err;
+}
+
+struct dentry *tb_sillyname(struct dentry *tb_dentry, struct dentry
*dentry)
+{
+ unsigned sillycounter = 1;
+ char *silly = NULL;
+ char *tmp = NULL;
+ struct dentry *sdentry = NULL;
+
+ if (!tb_dentry || !tb_dentry->d_inode) {
+ sdentry = ERR_PTR(-EIO);
+ goto out;
+ }
+
+ silly = kmalloc(PATH_MAX, GFP_KERNEL);
+ if (!silly)
+ goto out;
+
+ tmp = kmalloc(PATH_MAX, GFP_KERNEL);
+ if (!tmp)
+ goto out;
+
+ sdentry = lookup_one_len(dentry->d_name.name, tb_dentry,
+ dentry->d_name.len);
+ if (IS_ERR(sdentry))
+ goto out;
+
+ while (sdentry->d_inode != NULL) {
+ if (PATH_MAX - 5 <= dentry->d_name.len)
+ snprintf(tmp, PATH_MAX - 6, dentry->d_name.name);
+ else
+ sprintf(tmp, dentry->d_name.name);
+
+ sprintf(silly, "%s_%u", dentry->d_name.name, sillycounter);
+ dput(sdentry);
+ sdentry = lookup_one_len(silly, tb_dentry, strlen(silly));
+ if (IS_ERR(sdentry))
+ goto out;
+ sillycounter++;
+ }
+out:
+ kfree(silly);
+ kfree(tmp);
+ return sdentry;
+}
+
+struct dentry *ext4_get_user_dentry(struct inode *old_dir, int create)
+{
+ int err = 0;
+ struct dentry *user_dentry = NULL;
+ struct dentry *tb_dentry = NULL;
+ struct dentry *root_dentry = NULL;
+ char *name = NULL;
+ struct inode *tb_inode;
+
+ if (old_dir && old_dir->i_sb && old_dir->i_sb->s_root)
+ root_dentry = old_dir->i_sb->s_root;
+ else {
+ user_dentry = ERR_PTR(-EIO);
+ goto out;
+ }
+
+ tb_dentry = lookup_one_len(TB_NAME, root_dentry, TB_NAME_LEN);
+ if (IS_ERR(tb_dentry)) {
+ user_dentry = tb_dentry;
+ tb_dentry = NULL;
+ goto out;
+ }
+
+ tb_inode = tb_dentry->d_inode;
+ if (!tb_inode) {
+ user_dentry = ERR_PTR(-EIO);
+ goto out;
+ }
+
+ name = kmalloc(UID_MAX_LEN, GFP_KERNEL);
+ if (!name)
+ goto out;
+
+ sprintf(name, "%d", current->fsuid);
+
+ user_dentry = lookup_one_len(name, tb_dentry, strlen(name));
+ if (IS_ERR(user_dentry))
+ goto out;
+
+ if (!user_dentry->d_inode & create) {
+ if (tb_inode && tb_inode->i_op && tb_inode->i_op->permission)
+ err = tb_inode->i_op->permission(tb_inode,
+ MAY_EXEC, NULL);
+ else
+ err = generic_permission(tb_inode,
+ MAY_EXEC, NULL);
+ if (err) {
+ user_dentry = ERR_PTR(err);
+ goto out;
+ }
+
+ mutex_lock(&tb_inode->i_mutex);
+ err = tb_inode->i_op->mkdir(tb_inode, user_dentry, USER_MODE);
+ mutex_unlock(&tb_inode->i_mutex);
+ if (err) {
+ user_dentry = ERR_PTR(err);
+ goto out;
+ }
+ }
+
+out:
+ kfree(name);
+ if (tb_dentry)
+ dput(tb_dentry);
+ return user_dentry;
+}
+
+int ext4_trash_entry(struct inode *old_dir, struct dentry *old_dentry)
+{
+ int err = 0;
+ struct dentry *new_dentry = NULL;
+ struct dentry *user_dentry = NULL;
+ struct dentry *tb_dentry = NULL;
+ struct dentry *root_dentry = NULL;
+ struct inode *user_inode;
+ struct inode *tb_inode;
+
+ if (old_dir && old_dir->i_sb && old_dir->i_sb->s_root)
+ root_dentry = old_dir->i_sb->s_root;
+ else {
+ err = -EIO;
+ goto out;
+ }
+
+ tb_dentry = lookup_one_len(TB_NAME, root_dentry, TB_NAME_LEN);
+ if (IS_ERR(tb_dentry)) {
+ err = PTR_ERR(tb_dentry);
+ tb_dentry = NULL;
+ goto out;
+ }
+
+ tb_inode = tb_dentry->d_inode;
+ if (!tb_inode) {
+ err = -EIO;
+ goto out;
+ }
+
+
+ user_dentry = ext4_get_user_dentry(old_dir, 1);
+ if (IS_ERR(user_dentry)) {
+ err = PTR_ERR(user_dentry);
+ user_dentry = NULL;
+ goto out;
+ }
+
+ user_inode = user_dentry->d_inode;
+
+ new_dentry = tb_sillyname(user_dentry, old_dentry);
+ if (IS_ERR(new_dentry)) {
+ err = PTR_ERR(new_dentry);
+ new_dentry = NULL;
+ goto out;
+ }
+
+ if (tb_inode && tb_inode->i_op && tb_inode->i_op->permission)
+ err = tb_inode->i_op->permission(user_inode,
+ MAY_WRITE | MAY_EXEC, NULL);
+ else
+ err = generic_permission(user_inode, MAY_WRITE | MAY_EXEC, NULL);
+
+ if (err)
+ goto out;
+
+ if (user_inode->i_op && user_inode->i_op->rename) {
+ mutex_lock(&user_inode->i_mutex);
+ err = user_inode->i_op->rename(old_dir, old_dentry,
+ user_inode, new_dentry);
+ mutex_unlock(&user_inode->i_mutex);
+ if (!err)
+ d_move(old_dentry, new_dentry);
+ } else {
+ err = -ENOSYS;
+ goto out;
+ }
+
+ if (!d_unhashed(old_dentry))
+ d_drop(old_dentry);
+
+out:
+ if (new_dentry)
+ dput(new_dentry);
+ if (user_dentry)
+ dput(user_dentry);
+ if (tb_dentry)
+ dput(tb_dentry);
+ return err;
+}
+
diff -Naur 2.6.19/fs/ext4/tb.h 2.6.19tb/fs/ext4/tb.h
--- 2.6.19/fs/ext4/tb.h 1969-12-31 19:00:00.000000000 -0500
+++ 2.6.19tb/fs/ext4/tb.h 2006-12-03 21:09:05.000000000 -0500
@@ -0,0 +1,18 @@
+/* linux/fs/ext4/tb.h
+ *
+ * Copyright (C) 2006 Stony Brook University
+ * Nikolai Joukov, Harry Papaxenopoulos, and Erez Zadok
+ */
+#ifdef CONFIG_EXT4DEV_FS_TRASH_BIN
+
+#define TB_NAME ".trash"
+#define TB_NAME_LEN 6
+#define TB_MODE S_IRUGO | S_IXUGO | S_IRWXU | S_IFDIR
+#define USER_MODE S_IRWXU | S_IFDIR
+#define UID_MAX_LEN 6
+#define TB_DEBUG 1
+
+extern int ext4_create_tb(struct super_block *);
+extern int ext4_trash_entry(struct inode *, struct dentry *);
+extern struct dentry *ext4_get_user_dentry(struct inode *old_dir, int create);
+#endif /*CONFIG_EXT4_FS_TRASH_BIN*/
diff -Naur 2.6.19/fs/Kconfig 2.6.19tb/fs/Kconfig
--- 2.6.19/fs/Kconfig 2006-12-02 19:49:39.000000000 -0500
+++ 2.6.19tb/fs/Kconfig 2006-12-02 20:04:18.000000000 -0500
@@ -207,6 +207,15 @@
If you are not using a security module that requires using
extended attributes for file security labels, say N.
+config EXT4DEV_FS_TRASH_BIN
+ bool "Trash bin"
+ depends on EXT4DEV_FS
+ help
+ Trash bin functionality enables users to easily recover
+ previously deleted files.
+
+ If you do not know what trash bin does, say N.
+
config JBD
tristate
help
On Mon, Dec 04, 2006 at 01:33:55PM -0500, Nikolai Joukov wrote:
> As we promised on the linux-ext4 list on October 31, here is the patch
> that adds secure deletion via a trash-bin functionality for ext4. It is a
> compromise solution that combines secure deletion with the trash-bin support
> (the latter had been requested by even more people than the former :-).
Given that almost all of the code for this uses vfs interfaces and
only a couple of simple filesystem hooks, is there any reason for
this being ext4 specific? i.e. why aren't you hooking the vfs layer
so we get a single undelete/secure delete implementation for all
filesystems?
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
> > As we promised on the linux-ext4 list on October 31, here is the patch
> > that adds secure deletion via a trash-bin functionality for ext4. It is a
> > compromise solution that combines secure deletion with the trash-bin support
> > (the latter had been requested by even more people than the former :-).
>
> Given that almost all of the code for this uses vfs interfaces and
> only a couple of simple filesystem hooks, is there any reason for
> this being ext4 specific? i.e. why aren't you hooking the vfs layer
> so we get a single undelete/secure delete implementation for all
> filesystems?
You are right. Actually, we mentioned it as a benefit number 4 of this
approach in the original post. Most of the code is not
file-system--specific and can be used by any other (all other?) file
system(s). The only complication is that only ext2/3/4 and reiser file
systems already support the per-file secure deletion and undelete
attributes.
Since ext4 is in early development now, we believe it'd be easier to get
such code into ext4 than into the main-line VFS. If there's enough
interested among the kernel maintainers, we'd be happy to move this
functionality to the VFS and provide f/s hooks for
secure-deletion/trash-bin.
I guess, the right thing to do would be to move the common trash-bin
(tb.c and tb.h) code into the /fs and /include/linux directories.
Also, VFS would require just a couple of trivial changes to support
something like '-o trashbin' mount-time option for all file systems.
In addition, individual file systems may support per-file attributes for
this (e.g., ext2/3/4).
Since I just postponed my moving house (I am joining IBM soon) I think we
may have time to try it out this week.
Nikolai.
---------------
Filesystems and Storage Laboratory,
Stony Brook University
On Tue, Dec 05, 2006 at 11:41:28AM -0500, Nikolai Joukov wrote:
> > > As we promised on the linux-ext4 list on October 31, here is the patch
> > > that adds secure deletion via a trash-bin functionality for ext4. It is a
> > > compromise solution that combines secure deletion with the trash-bin support
> > > (the latter had been requested by even more people than the former :-).
> >
> > Given that almost all of the code for this uses vfs interfaces and
> > only a couple of simple filesystem hooks, is there any reason for
> > this being ext4 specific? i.e. why aren't you hooking the vfs layer
> > so we get a single undelete/secure delete implementation for all
> > filesystems?
>
> You are right. Actually, we mentioned it as a benefit number 4 of this
> approach in the original post. Most of the code is not
> file-system--specific and can be used by any other (all other?) file
> system(s). The only complication is that only ext2/3/4 and reiser file
> systems already support the per-file secure deletion and undelete
> attributes.
They are defined but unused in 2.6.19, right? I can't see anywhere
in the 2.6.19 ext2/3/4/reiser trees that actually those flags,
including setting and retrieving them from disk. JFS i can see
sets, clears and retreives them, but not the fielsystems you
mention. Though I might just be blind..... ;)
If all we need to add to XFS is support for those flags, then XFS
support would be trivial to add.
Oh, damn. I take that back. We're almost out of flag space in the on
disk inode - these two flags would use the last 2 flag bits so this
may require an on disk inode format change in XFS. This will be
a little more complex than I first thought, but not impossible
as we already support two on-disk inode format versions.
> Since ext4 is in early development now, we believe it'd be easier to get
> such code into ext4 than into the main-line VFS. If there's enough
> interested among the kernel maintainers, we'd be happy to move this
> functionality to the VFS and provide f/s hooks for
> secure-deletion/trash-bin.
I'd certainly like to see new functionality like this added at the
VFS rather than in any particular filesystem. Don't know about
anyone else, though....
> I guess, the right thing to do would be to move the common trash-bin
> (tb.c and tb.h) code into the /fs and /include/linux directories.
And call them trashbin.[ch] ;)
> Also, VFS would require just a couple of trivial changes to support
> something like '-o trashbin' mount-time option for all file systems.
> In addition, individual file systems may support per-file attributes for
> this (e.g., ext2/3/4).
Sounds like a good idea to me.
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
On Wed, Dec 06, 2006 at 08:11:00PM +1100, David Chinner wrote:
> They are defined but unused in 2.6.19, right? I can't see anywhere
> in the 2.6.19 ext2/3/4/reiser trees that actually those flags,
> including setting and retrieving them from disk. JFS i can see
> sets, clears and retreives them, but not the fielsystems you
> mention. Though I might just be blind..... ;)
>
> If all we need to add to XFS is support for those flags, then XFS
> support would be trivial to add.
>
> Oh, damn. I take that back. We're almost out of flag space in the on
> disk inode - these two flags would use the last 2 flag bits so this
> may require an on disk inode format change in XFS. This will be
> a little more complex than I first thought, but not impossible
> as we already support two on-disk inode format versions.
Hrm. I was toying around with the idea of using a flag to mark inodes as
whiteouts (similar to what BSD does) for Unionfs. I remember that Jan Blunck
tried similar thing in his implementation of VFS unionfs mounts.
I am not entirely convinced that whiteout inode flag is the right way to do
things, but I'm just raising this now as I wouldn't want to wait for new
ondisk format for XFS to say that Unionfs supports XFS. (Assuming that it is
the right approach.) ;-)
Josef "Jeff" Sipek.
--
I already backed up the box once, I can do it again!
On Wed, Dec 06, 2006 at 07:56:19PM -0500, Josef Sipek wrote:
> On Wed, Dec 06, 2006 at 08:11:00PM +1100, David Chinner wrote:
> > They are defined but unused in 2.6.19, right? I can't see anywhere
> > in the 2.6.19 ext2/3/4/reiser trees that actually those flags,
> > including setting and retrieving them from disk. JFS i can see
> > sets, clears and retreives them, but not the fielsystems you
> > mention. Though I might just be blind..... ;)
> >
> > If all we need to add to XFS is support for those flags, then XFS
> > support would be trivial to add.
> >
> > Oh, damn. I take that back. We're almost out of flag space in the on
> > disk inode - these two flags would use the last 2 flag bits so this
> > may require an on disk inode format change in XFS. This will be
> > a little more complex than I first thought, but not impossible
> > as we already support two on-disk inode format versions.
>
> Hrm. I was toying around with the idea of using a flag to mark inodes as
> whiteouts (similar to what BSD does) for Unionfs. I remember that Jan Blunck
> tried similar thing in his implementation of VFS unionfs mounts.
>
> I am not entirely convinced that whiteout inode flag is the right way to do
> things, but I'm just raising this now as I wouldn't want to wait for new
> ondisk format for XFS to say that Unionfs supports XFS. (Assuming that it is
> the right approach.) ;-)
Maybe we should be using EAs for this sort of thing instead of flags
on the inode? If we keep adding inode flags for generic features
then we are going to force more than just XFS into inode format
changes eventually....
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
On Thu, Dec 07, 2006 at 12:44:27PM +1100, David Chinner wrote:
> Maybe we should be using EAs for this sort of thing instead of flags
> on the inode? If we keep adding inode flags for generic features
> then we are going to force more than just XFS into inode format
> changes eventually....
Aren't EAs slow? Maybe not on XFS but on other filesystems...
Josef "Jeff" Sipek.
--
Linux, n.:
Generous programmers from around the world all join forces to help
you shoot yourself in the foot for free.
On Wed, Dec 06, 2006 at 09:35:30PM -0500, Josef Sipek wrote:
> On Thu, Dec 07, 2006 at 12:44:27PM +1100, David Chinner wrote:
> > Maybe we should be using EAs for this sort of thing instead of flags
> > on the inode? If we keep adding inode flags for generic features
> > then we are going to force more than just XFS into inode format
> > changes eventually....
>
> Aren't EAs slow? Maybe not on XFS but on other filesystems...
Only when they don't fit in the inode itself and extra
disk seeks are needed to retrieve them.
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
On Thu, 2006-12-07 at 13:49 +1100, David Chinner wrote:
> On Wed, Dec 06, 2006 at 09:35:30PM -0500, Josef Sipek wrote:
> > On Thu, Dec 07, 2006 at 12:44:27PM +1100, David Chinner wrote:
> > > Maybe we should be using EAs for this sort of thing instead of flags
> > > on the inode? If we keep adding inode flags for generic features
> > > then we are going to force more than just XFS into inode format
> > > changes eventually....
> >
> > Aren't EAs slow? Maybe not on XFS but on other filesystems...
>
> Only when they don't fit in the inode itself and extra
> disk seeks are needed to retrieve them.
>
> Cheers,
>
> Dave.
Also keep in mind that the EA doesn't actually have to have a physical
representation on disk (or, rather, it does, but it doesn't need to be
the same representation used by EAs in the user namespace).
This means that if one of those slow EA filesystems still has room for
flags in the inode, it can synthesize the EA on demand.
This is even preferable to ioctls for the interface to new filesystem
metadata -- if a backup or archive program knows how to store EAs, it
will be able to save and restore any new exotic metadata without any
extra effort.
--
Nicholas Miell <[email protected]>
On Wed, 2006-12-06 at 20:11 +1100, David Chinner wrote:
> ...
> If all we need to add to XFS is support for those flags, then XFS
> support would be trivial to add.
>
> Oh, damn. I take that back. We're almost out of flag space in the on
> disk inode - these two flags would use the last 2 flag bits so this
> may require an on disk inode format change in XFS. This will be
> a little more complex than I first thought, ...
It should be OK - you can do it without an inode version revision
if you take a second 16 bits for "di_flags2" from here...
xfs_dinode_core {
...
__uint8_t di_pad[8]; /* unused, zeroed space */
Its guaranteed zeroed initially (i.e. all flags unset) and the XFS
get/set flags APIs are 32 bits, so you should be OK there.
Also, it may also be possible to reclaim di_onlink at some point (maybe
now, since 16 bits would be good here) if mkfs.xfs is changed to always
create v2 inodes (dynamic conversion ATM IIRC)... not 100% sure though,
needs more code analysis.
cheers.
--
Nathan
On Wed, Dec 06, 2006 at 07:14:58PM -0800, Nicholas Miell wrote:
> On Thu, 2006-12-07 at 13:49 +1100, David Chinner wrote:
> > On Wed, Dec 06, 2006 at 09:35:30PM -0500, Josef Sipek wrote:
> > > On Thu, Dec 07, 2006 at 12:44:27PM +1100, David Chinner wrote:
> > > > Maybe we should be using EAs for this sort of thing instead of flags
> > > > on the inode? If we keep adding inode flags for generic features
> > > > then we are going to force more than just XFS into inode format
> > > > changes eventually....
> > >
> > > Aren't EAs slow? Maybe not on XFS but on other filesystems...
> >
> > Only when they don't fit in the inode itself and extra
> > disk seeks are needed to retrieve them.
> >
> > Cheers,
> >
> > Dave.
>
> Also keep in mind that the EA doesn't actually have to have a physical
> representation on disk (or, rather, it does, but it doesn't need to be
> the same representation used by EAs in the user namespace).
>
> This means that if one of those slow EA filesystems still has room for
> flags in the inode, it can synthesize the EA on demand.
Interesting point. Filesystems such as XFS would be unaffected (at least
with attr v1, not sure how attr2 does things in XFS wrt EAs.)
Another question is, suppose that filesystem x is one of the slow EA
filesystems, exposing this whiteout inode flag as EA would require fs x to
be compiled with EA support. As a matter of fact, it would require all the
filesystems to have EA support. Sure, overall it is less work for me, but I
am more interested in doing the Right Thing(tm) - which may be something
completely different from EA or inode flag (e.g., what Ted Tso suggested at
OLS - effectively a metadata-only ondisk format for unionfs.)
> This is even preferable to ioctls for the interface to new filesystem
> metadata -- if a backup or archive program knows how to store EAs, it
> will be able to save and restore any new exotic metadata without any
> extra effort.
Agreed.
Josef "Jeff" Sipek.
--
UNIX is user-friendly ... it's just selective about who it's friends are
On Thu, 2006-12-07 at 12:44 +1100, David Chinner wrote:
> Maybe we should be using EAs for this sort of thing instead of flags
> on the inode? If we keep adding inode flags for generic features
> then we are going to force more than just XFS into inode format
> changes eventually....
You do need to be judicious in what you add, but we got up to here
(well over 10 years) with 16 bits worth of space - increasing to 32
bits is going to last a fairly long time... I wouldn't switch to a
different scheme until that point (which is also the point where the
system calls are going to have to change anyway).
cheers.
--
Nathan
> They are defined but unused in 2.6.19, right? I can't see anywhere
> in the 2.6.19 ext2/3/4/reiser trees that actually those flags,
> including setting and retrieving them from disk. JFS i can see
> sets, clears and retreives them, but not the fielsystems you
> mention. Though I might just be blind..... ;)
Well, at least Ext2/3/4 and JFS do store these attributes and retrieve
them from the disk. However, not a single one of these file systems
supports the related functionality (secure deletion or a trash-bin).
Clearly, this is confusing for the users.
> If all we need to add to XFS is support for those flags, then XFS
> support would be trivial to add.
>
> Oh, damn. I take that back. We're almost out of flag space in the on
> disk inode - these two flags would use the last 2 flag bits so this
> may require an on disk inode format change in XFS. This will be
> a little more complex than I first thought, but not impossible
> as we already support two on-disk inode format versions.
If you do not see any obvious use for these two bits in the near future
and ready to sacrifice them (or at least one of them) for secure deletion
and trash-bin attributes we would be more than happy to add the actual
functionality to XFS as well.
> I'd certainly like to see new functionality like this added at the
> VFS rather than in any particular filesystem. Don't know about
> anyone else, though....
>
> > I guess, the right thing to do would be to move the common trash-bin
> > (tb.c and tb.h) code into the /fs and /include/linux directories.
>
> And call them trashbin.[ch] ;)
In fact, we have already done this :-)
> > Also, VFS would require just a couple of trivial changes to support
> > something like '-o trashbin' mount-time option for all file systems.
> > In addition, individual file systems may support per-file attributes for
> > this (e.g., ext2/3/4).
>
> Sounds like a good idea to me.
Thank you! So we will submit the corresponding patches soon. We will
start from the trashbin.c and patches for the individual file systems. A
little later we will submit a VFS patch to add support for the
'-o trashbin' mount option.
Nikolai Joukov
----------------------------------
Filesystems and Storage Laboratory
Stony Brook University