2007-11-06 17:15:41

by Jan Kara

[permalink] [raw]
Subject: [RFC] [PATCH 0/3] Recursive mtime for ext3

Hello,

in following three patches is implemented recursive mtime feature for
ext3. The first two patches are mostly clean-up patches, the third patch
implements the feature itself. If somebody is interested in testing this
(or even writing a support of this feature in rsync and similar), please
contact me. Attached are sources of simple tools set_recmod, get_recmod
for testing the feature and also a patch implementing basic support of
the feature in e2fsprogs. Comments welcome.

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR


Attachments:
(No filename) (546.00 B)
set_recmod.c (643.00 B)
get_recmod.c (723.00 B)
e2fsprogs-rec_mtime.diff (2.37 kB)
Download all attachments

2007-11-06 17:18:22

by Jan Kara

[permalink] [raw]
Subject: [RFC] [PATCH 1/3] Recursive mtime for ext3

Hello,

the following patch makes more lightweight handling of
EXT3_I(inode)->i_flags possible.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
---

Implement atomic updates of EXT3_I(inode)->i_flags. So far the i_flags access
was guarded mostly by i_mutex but this is quite heavy-weight. We now use
inode->i_lock to protect i_flags reading and updates in ext3. This patch
introduces a bogus warning that jflag and oldflags may be uninitialized -
anyone knows how to cleanly get rid of it?

Signed-off-by: Jan Kara <[email protected]>

diff -rupX /home/jack/.kerndiffexclude linux-2.6.23/fs/ext3/dir.c linux-2.6.23-1-i_flags_atomicity/fs/ext3/dir.c
--- linux-2.6.23/fs/ext3/dir.c 2007-10-11 12:01:23.000000000 +0200
+++ linux-2.6.23-1-i_flags_atomicity/fs/ext3/dir.c 2007-11-05 14:04:56.000000000 +0100
@@ -108,10 +108,10 @@ static int ext3_readdir(struct file * fi
sb = inode->i_sb;

#ifdef CONFIG_EXT3_INDEX
- if (EXT3_HAS_COMPAT_FEATURE(inode->i_sb,
- EXT3_FEATURE_COMPAT_DIR_INDEX) &&
- ((EXT3_I(inode)->i_flags & EXT3_INDEX_FL) ||
- ((inode->i_size >> sb->s_blocksize_bits) == 1))) {
+ if (is_dx(inode) ||
+ (EXT3_HAS_COMPAT_FEATURE(inode->i_sb, \
+ EXT3_FEATURE_COMPAT_DIR_INDEX) &&
+ (inode->i_size >> sb->s_blocksize_bits) == 1)) {
err = ext3_dx_readdir(filp, dirent, filldir);
if (err != ERR_BAD_DX_DIR) {
ret = err;
@@ -121,7 +121,9 @@ static int ext3_readdir(struct file * fi
* We don't set the inode dirty flag since it's not
* critical that it get flushed back to the disk.
*/
+ spin_lock(&inode->i_lock);
EXT3_I(filp->f_path.dentry->d_inode)->i_flags &= ~EXT3_INDEX_FL;
+ spin_unlock(&inode->i_lock);
}
#endif
stored = 0;
diff -rupX /home/jack/.kerndiffexclude linux-2.6.23/fs/ext3/ialloc.c linux-2.6.23-1-i_flags_atomicity/fs/ext3/ialloc.c
--- linux-2.6.23/fs/ext3/ialloc.c 2006-11-29 22:57:37.000000000 +0100
+++ linux-2.6.23-1-i_flags_atomicity/fs/ext3/ialloc.c 2007-11-05 14:14:50.000000000 +0100
@@ -278,7 +278,7 @@ static int find_group_orlov(struct super
ndirs = percpu_counter_read_positive(&sbi->s_dirs_counter);

if ((parent == sb->s_root->d_inode) ||
- (EXT3_I(parent)->i_flags & EXT3_TOPDIR_FL)) {
+ ext3_test_inode_flags(parent, EXT3_TOPDIR_FL)) {
int best_ndir = inodes_per_group;
int best_group = -1;

@@ -566,7 +566,11 @@ got:
ei->i_dir_start_lookup = 0;
ei->i_disksize = 0;

+ /* Guard reading of directory's i_flags, created inode is safe as
+ * noone has a reference to it yet */
+ spin_lock(&dir->i_lock);
ei->i_flags = EXT3_I(dir)->i_flags & ~EXT3_INDEX_FL;
+ spin_unlock(&dir->i_lock);
if (S_ISLNK(mode))
ei->i_flags &= ~(EXT3_IMMUTABLE_FL|EXT3_APPEND_FL);
/* dirsync only applies to directories */
diff -rupX /home/jack/.kerndiffexclude linux-2.6.23/fs/ext3/inode.c linux-2.6.23-1-i_flags_atomicity/fs/ext3/inode.c
--- linux-2.6.23/fs/ext3/inode.c 2007-10-11 12:01:23.000000000 +0200
+++ linux-2.6.23-1-i_flags_atomicity/fs/ext3/inode.c 2007-11-05 14:24:39.000000000 +0100
@@ -2557,8 +2557,10 @@ int ext3_get_inode_loc(struct inode *ino

void ext3_set_inode_flags(struct inode *inode)
{
- unsigned int flags = EXT3_I(inode)->i_flags;
+ unsigned int flags;

+ spin_lock(&inode->i_lock);
+ flags = EXT3_I(inode)->i_flags;
inode->i_flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC);
if (flags & EXT3_SYNC_FL)
inode->i_flags |= S_SYNC;
@@ -2570,13 +2572,16 @@ void ext3_set_inode_flags(struct inode *
inode->i_flags |= S_NOATIME;
if (flags & EXT3_DIRSYNC_FL)
inode->i_flags |= S_DIRSYNC;
+ spin_unlock(&inode->i_lock);
}

/* Propagate flags from i_flags to EXT3_I(inode)->i_flags */
void ext3_get_inode_flags(struct ext3_inode_info *ei)
{
- unsigned int flags = ei->vfs_inode.i_flags;
+ unsigned int flags;

+ spin_lock(&ei->vfs_inode.i_lock);
+ flags = ei->vfs_inode.i_flags;
ei->i_flags &= ~(EXT3_SYNC_FL|EXT3_APPEND_FL|
EXT3_IMMUTABLE_FL|EXT3_NOATIME_FL|EXT3_DIRSYNC_FL);
if (flags & S_SYNC)
@@ -2589,6 +2594,7 @@ void ext3_get_inode_flags(struct ext3_in
ei->i_flags |= EXT3_NOATIME_FL;
if (flags & S_DIRSYNC)
ei->i_flags |= EXT3_DIRSYNC_FL;
+ spin_unlock(&ei->vfs_inode.i_lock);
}

void ext3_read_inode(struct inode * inode)
@@ -2781,7 +2787,9 @@ static int ext3_do_update_inode(handle_t
raw_inode->i_mtime = cpu_to_le32(inode->i_mtime.tv_sec);
raw_inode->i_blocks = cpu_to_le32(inode->i_blocks);
raw_inode->i_dtime = cpu_to_le32(ei->i_dtime);
+ spin_lock(&inode->i_lock);
raw_inode->i_flags = cpu_to_le32(ei->i_flags);
+ spin_unlock(&inode->i_lock);
#ifdef EXT3_FRAGMENTS
raw_inode->i_faddr = cpu_to_le32(ei->i_faddr);
raw_inode->i_frag = ei->i_frag_no;
@@ -3209,10 +3217,12 @@ int ext3_change_inode_journal_flag(struc
* the inode's in-core data-journaling state flag now.
*/

+ spin_lock(&inode->i_lock);
if (val)
EXT3_I(inode)->i_flags |= EXT3_JOURNAL_DATA_FL;
else
EXT3_I(inode)->i_flags &= ~EXT3_JOURNAL_DATA_FL;
+ spin_unlock(&inode->i_lock);
ext3_set_aops(inode);

journal_unlock_updates(journal);
diff -rupX /home/jack/.kerndiffexclude linux-2.6.23/fs/ext3/ioctl.c linux-2.6.23-1-i_flags_atomicity/fs/ext3/ioctl.c
--- linux-2.6.23/fs/ext3/ioctl.c 2007-10-11 12:01:23.000000000 +0200
+++ linux-2.6.23-1-i_flags_atomicity/fs/ext3/ioctl.c 2007-11-05 14:32:12.000000000 +0100
@@ -29,7 +29,9 @@ int ext3_ioctl (struct inode * inode, st
switch (cmd) {
case EXT3_IOC_GETFLAGS:
ext3_get_inode_flags(ei);
+ spin_lock(&inode->i_lock);
flags = ei->i_flags & EXT3_FL_USER_VISIBLE;
+ spin_unlock(&inode->i_lock);
return put_user(flags, (int __user *) arg);
case EXT3_IOC_SETFLAGS: {
handle_t *handle = NULL;
@@ -51,10 +53,19 @@ int ext3_ioctl (struct inode * inode, st
flags &= ~EXT3_DIRSYNC_FL;

mutex_lock(&inode->i_mutex);
- oldflags = ei->i_flags;
+ handle = ext3_journal_start(inode, 1);
+ if (IS_ERR(handle)) {
+ mutex_unlock(&inode->i_mutex);
+ return PTR_ERR(handle);
+ }
+ if (IS_SYNC(inode))
+ handle->h_sync = 1;
+ err = ext3_reserve_inode_write(handle, inode, &iloc);
+ if (err)
+ goto flags_err;

- /* The JOURNAL_DATA flag is modifiable only by root */
- jflag = flags & EXT3_JOURNAL_DATA_FL;
+ spin_lock(&inode->i_lock);
+ oldflags = ei->i_flags;

/*
* The IMMUTABLE and APPEND_ONLY flags can only be changed by
@@ -64,8 +75,9 @@ int ext3_ioctl (struct inode * inode, st
*/
if ((flags ^ oldflags) & (EXT3_APPEND_FL | EXT3_IMMUTABLE_FL)) {
if (!capable(CAP_LINUX_IMMUTABLE)) {
- mutex_unlock(&inode->i_mutex);
- return -EPERM;
+ spin_unlock(&inode->i_lock);
+ err = -EPERM;
+ goto flags_err;
}
}

@@ -73,28 +85,19 @@ int ext3_ioctl (struct inode * inode, st
* The JOURNAL_DATA flag can only be changed by
* the relevant capability.
*/
+ jflag = flags & EXT3_JOURNAL_DATA_FL;
if ((jflag ^ oldflags) & (EXT3_JOURNAL_DATA_FL)) {
if (!capable(CAP_SYS_RESOURCE)) {
- mutex_unlock(&inode->i_mutex);
- return -EPERM;
+ spin_unlock(&inode->i_lock);
+ err = -EPERM;
+ goto flags_err;
}
}

-
- handle = ext3_journal_start(inode, 1);
- if (IS_ERR(handle)) {
- mutex_unlock(&inode->i_mutex);
- return PTR_ERR(handle);
- }
- if (IS_SYNC(inode))
- handle->h_sync = 1;
- err = ext3_reserve_inode_write(handle, inode, &iloc);
- if (err)
- goto flags_err;
-
flags = flags & EXT3_FL_USER_MODIFIABLE;
flags |= oldflags & ~EXT3_FL_USER_MODIFIABLE;
ei->i_flags = flags;
+ spin_unlock(&inode->i_lock);

ext3_set_inode_flags(inode);
inode->i_ctime = CURRENT_TIME_SEC;
diff -rupX /home/jack/.kerndiffexclude linux-2.6.23/include/linux/ext3_fs.h linux-2.6.23-1-i_flags_atomicity/include/linux/ext3_fs.h
--- linux-2.6.23/include/linux/ext3_fs.h 2007-07-16 17:47:28.000000000 +0200
+++ linux-2.6.23-1-i_flags_atomicity/include/linux/ext3_fs.h 2007-11-05 14:31:44.000000000 +0100
@@ -514,6 +514,17 @@ static inline int ext3_valid_inum(struct
(ino >= EXT3_FIRST_INO(sb) &&
ino <= le32_to_cpu(EXT3_SB(sb)->s_es->s_inodes_count));
}
+
+static inline unsigned int ext3_test_inode_flags(struct inode *inode, u32 flags)
+{
+ unsigned int ret;
+
+ spin_lock(&inode->i_lock);
+ ret = EXT3_I(inode)->i_flags & flags;
+ spin_unlock(&inode->i_lock);
+ return ret;
+}
+
#else
/* Assume that user mode programs are passing in an ext3fs superblock, not
* a kernel struct super_block. This will allow us to call the feature-test
@@ -666,9 +677,18 @@ struct ext3_dir_entry_2 {
*/

#ifdef CONFIG_EXT3_INDEX
- #define is_dx(dir) (EXT3_HAS_COMPAT_FEATURE(dir->i_sb, \
- EXT3_FEATURE_COMPAT_DIR_INDEX) && \
- (EXT3_I(dir)->i_flags & EXT3_INDEX_FL))
+static inline int is_dx(struct inode *dir)
+{
+ int ret = 0;
+
+ if (EXT3_HAS_COMPAT_FEATURE(dir->i_sb, \
+ EXT3_FEATURE_COMPAT_DIR_INDEX)) {
+ spin_lock(&dir->i_lock);
+ ret = EXT3_I(dir)->i_flags & EXT3_INDEX_FL;
+ spin_unlock(&dir->i_lock);
+ }
+ return ret;
+}
#define EXT3_DIR_LINK_MAX(dir) (!is_dx(dir) && (dir)->i_nlink >= EXT3_LINK_MAX)
#define EXT3_DIR_LINK_EMPTY(dir) ((dir)->i_nlink == 2 || (dir)->i_nlink == 1)
#else
diff -rupX /home/jack/.kerndiffexclude linux-2.6.23/include/linux/ext3_fs_i.h linux-2.6.23-1-i_flags_atomicity/include/linux/ext3_fs_i.h
--- linux-2.6.23/include/linux/ext3_fs_i.h 2007-07-16 17:47:28.000000000 +0200
+++ linux-2.6.23-1-i_flags_atomicity/include/linux/ext3_fs_i.h 2007-11-05 14:26:43.000000000 +0100
@@ -69,7 +69,7 @@ struct ext3_block_alloc_info {
*/
struct ext3_inode_info {
__le32 i_data[15]; /* unconverted */
- __u32 i_flags;
+ __u32 i_flags; /* Guarded by inode->i_lock */
#ifdef EXT3_FRAGMENTS
__u32 i_faddr;
__u8 i_frag_no;

2007-11-06 17:19:19

by Jan Kara

[permalink] [raw]
Subject: [RFC] [PATCH 2/3] Recursive mtime for ext3

Make space reserved for fragments as unused as they were never implemented.
Remove also related initializations. We later use the space for recursive
mtime.

Signed-off-by: Jan Kara <[email protected]>

diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-1-i_flags_atomicity/fs/ext3/ialloc.c linux-2.6.23-2-make_flags_unused/fs/ext3/ialloc.c
--- linux-2.6.23-1-i_flags_atomicity/fs/ext3/ialloc.c 2007-11-05 14:14:50.000000000 +0100
+++ linux-2.6.23-2-make_flags_unused/fs/ext3/ialloc.c 2007-11-05 14:37:33.000000000 +0100
@@ -576,11 +576,6 @@ got:
/* dirsync only applies to directories */
if (!S_ISDIR(mode))
ei->i_flags &= ~EXT3_DIRSYNC_FL;
-#ifdef EXT3_FRAGMENTS
- ei->i_faddr = 0;
- ei->i_frag_no = 0;
- ei->i_frag_size = 0;
-#endif
ei->i_file_acl = 0;
ei->i_dir_acl = 0;
ei->i_dtime = 0;
diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-1-i_flags_atomicity/fs/ext3/inode.c linux-2.6.23-2-make_flags_unused/fs/ext3/inode.c
--- linux-2.6.23-1-i_flags_atomicity/fs/ext3/inode.c 2007-11-05 14:24:39.000000000 +0100
+++ linux-2.6.23-2-make_flags_unused/fs/ext3/inode.c 2007-11-05 14:38:05.000000000 +0100
@@ -2651,11 +2651,6 @@ void ext3_read_inode(struct inode * inod
}
inode->i_blocks = le32_to_cpu(raw_inode->i_blocks);
ei->i_flags = le32_to_cpu(raw_inode->i_flags);
-#ifdef EXT3_FRAGMENTS
- ei->i_faddr = le32_to_cpu(raw_inode->i_faddr);
- ei->i_frag_no = raw_inode->i_frag;
- ei->i_frag_size = raw_inode->i_fsize;
-#endif
ei->i_file_acl = le32_to_cpu(raw_inode->i_file_acl);
if (!S_ISREG(inode->i_mode)) {
ei->i_dir_acl = le32_to_cpu(raw_inode->i_dir_acl);
@@ -2790,11 +2785,6 @@ static int ext3_do_update_inode(handle_t
spin_lock(&inode->i_lock);
raw_inode->i_flags = cpu_to_le32(ei->i_flags);
spin_unlock(&inode->i_lock);
-#ifdef EXT3_FRAGMENTS
- raw_inode->i_faddr = cpu_to_le32(ei->i_faddr);
- raw_inode->i_frag = ei->i_frag_no;
- raw_inode->i_fsize = ei->i_frag_size;
-#endif
raw_inode->i_file_acl = cpu_to_le32(ei->i_file_acl);
if (!S_ISREG(inode->i_mode)) {
raw_inode->i_dir_acl = cpu_to_le32(ei->i_dir_acl);
diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-1-i_flags_atomicity/fs/ext3/super.c linux-2.6.23-2-make_flags_unused/fs/ext3/super.c
--- linux-2.6.23-1-i_flags_atomicity/fs/ext3/super.c 2007-11-05 15:04:19.000000000 +0100
+++ linux-2.6.23-2-make_flags_unused/fs/ext3/super.c 2007-11-05 15:01:37.000000000 +0100
@@ -1584,17 +1584,7 @@ static int ext3_fill_super (struct super
goto failed_mount;
}
}
- sbi->s_frag_size = EXT3_MIN_FRAG_SIZE <<
- le32_to_cpu(es->s_log_frag_size);
- if (blocksize != sbi->s_frag_size) {
- printk(KERN_ERR
- "EXT3-fs: fragsize %lu != blocksize %u (unsupported)\n",
- sbi->s_frag_size, blocksize);
- goto failed_mount;
- }
- sbi->s_frags_per_block = 1;
sbi->s_blocks_per_group = le32_to_cpu(es->s_blocks_per_group);
- sbi->s_frags_per_group = le32_to_cpu(es->s_frags_per_group);
sbi->s_inodes_per_group = le32_to_cpu(es->s_inodes_per_group);
if (EXT3_INODE_SIZE(sb) == 0)
goto cantfind_ext3;
@@ -1618,12 +1608,6 @@ static int ext3_fill_super (struct super
sbi->s_blocks_per_group);
goto failed_mount;
}
- if (sbi->s_frags_per_group > blocksize * 8) {
- printk (KERN_ERR
- "EXT3-fs: #fragments per group too big: %lu\n",
- sbi->s_frags_per_group);
- goto failed_mount;
- }
if (sbi->s_inodes_per_group > blocksize * 8) {
printk (KERN_ERR
"EXT3-fs: #inodes per group too big: %lu\n",
diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-1-i_flags_atomicity/include/linux/ext3_fs.h linux-2.6.23-2-make_flags_unused/include/linux/ext3_fs.h
--- linux-2.6.23-1-i_flags_atomicity/include/linux/ext3_fs.h 2007-11-05 14:31:44.000000000 +0100
+++ linux-2.6.23-2-make_flags_unused/include/linux/ext3_fs.h 2007-11-05 14:37:33.000000000 +0100
@@ -291,27 +291,24 @@ struct ext3_inode {
__le32 i_generation; /* File version (for NFS) */
__le32 i_file_acl; /* File ACL */
__le32 i_dir_acl; /* Directory ACL */
- __le32 i_faddr; /* Fragment address */
+ __le32 i_obsolete_faddr; /* Unused */
union {
struct {
- __u8 l_i_frag; /* Fragment number */
- __u8 l_i_fsize; /* Fragment size */
+ __u16 l_i_obsolete_frag; /* Unused */
__u16 i_pad1;
__le16 l_i_uid_high; /* these 2 fields */
__le16 l_i_gid_high; /* were reserved2[0] */
__u32 l_i_reserved2;
} linux2;
struct {
- __u8 h_i_frag; /* Fragment number */
- __u8 h_i_fsize; /* Fragment size */
+ __u16 h_i_obsolete_frag; /* Unused */
__u16 h_i_mode_high;
__u16 h_i_uid_high;
__u16 h_i_gid_high;
__u32 h_i_author;
} hurd2;
struct {
- __u8 m_i_frag; /* Fragment number */
- __u8 m_i_fsize; /* Fragment size */
+ __u16 m_i_obsolete_frag; /* Unused */
__u16 m_pad1;
__u32 m_i_reserved2[2];
} masix2;
diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-1-i_flags_atomicity/include/linux/ext3_fs_sb.h linux-2.6.23-2-make_flags_unused/include/linux/ext3_fs_sb.h
--- linux-2.6.23-1-i_flags_atomicity/include/linux/ext3_fs_sb.h 2007-10-11 12:01:28.000000000 +0200
+++ linux-2.6.23-2-make_flags_unused/include/linux/ext3_fs_sb.h 2007-11-05 14:50:55.000000000 +0100
@@ -28,10 +28,7 @@
* third extended-fs super-block data in memory
*/
struct ext3_sb_info {
- unsigned long s_frag_size; /* Size of a fragment in bytes */
- unsigned long s_frags_per_block;/* Number of fragments per block */
unsigned long s_inodes_per_block;/* Number of inodes per block */
- unsigned long s_frags_per_group;/* Number of fragments in a group */
unsigned long s_blocks_per_group;/* Number of blocks in a group */
unsigned long s_inodes_per_group;/* Number of inodes in a group */
unsigned long s_itb_per_group; /* Number of inode table blocks per group */

2007-11-06 17:19:47

by Jan Kara

[permalink] [raw]
Subject: [RFC] [PATCH 3/3] Recursive mtime for ext3

Implement recursive mtime (rtime) feature for ext3. The feature works as
follows: In each directory we keep a flag EXT3_RTIME_FL (modifiable by a user)
whether rtime should be updated. In case a directory or a file in it is
modified and when the flag is set, directory's rtime is updated, the flag is
cleared, and we move to the parent. If the flag is set there, we clear it,
update rtime and continue upwards upto the root of the filesystem. In case a
regular file or symlink is modified, we pick arbitrary of its parents (actually
the one that happens to be at the head of i_dentry list) and start the rtime
update algorith there.

As the flag is always cleared after updating rtime and we don't climb up the
tree if the flag is cleared, we have constant amortized complexity of rtime
updates. That's for theoretical time consumption ;) Practically, there's no
measurable performance impact for a test case like: touch every file in a
kernel tree where every directory has RTIME flag set.

Intended use case is that application which wants to watch any modification in
a subtree scans the subtree and sets flags for all inodes there. Next time, it
just needs to recurse in directories having rtime newer than the start of the
previous scan. There it can handle modifications and set the flag again. It is
up to application to watch out for hardlinked files. It can e.g. build their
list and check their mtime separately (when a hardlink to a file is created its
inode is modified and rtimes properly updated and thus any application has an
effective way of finding new hardlinked files).

Signed-off-by: Jan Kara <[email protected]>

diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-2-ext3_make_frags_unused/fs/ext3/ialloc.c linux-2.6.23-3-ext3_recursive_mtime/fs/ext3/ialloc.c
--- linux-2.6.23-2-ext3_make_frags_unused/fs/ext3/ialloc.c 2007-11-05 16:58:10.000000000 +0100
+++ linux-2.6.23-3-ext3_recursive_mtime/fs/ext3/ialloc.c 2007-11-05 16:58:53.000000000 +0100
@@ -569,7 +569,7 @@ got:
/* Guard reading of directory's i_flags, created inode is safe as
* noone has a reference to it yet */
spin_lock(&dir->i_lock);
- ei->i_flags = EXT3_I(dir)->i_flags & ~EXT3_INDEX_FL;
+ ei->i_flags = EXT3_I(dir)->i_flags & ~(EXT3_INDEX_FL | EXT3_RTIME_FL);
spin_unlock(&dir->i_lock);
if (S_ISLNK(mode))
ei->i_flags &= ~(EXT3_IMMUTABLE_FL|EXT3_APPEND_FL);
@@ -579,6 +579,7 @@ got:
ei->i_file_acl = 0;
ei->i_dir_acl = 0;
ei->i_dtime = 0;
+ ei->i_rtime = inode->i_mtime.tv_sec;
ei->i_block_alloc_info = NULL;
ei->i_block_group = group;

diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-2-ext3_make_frags_unused/fs/ext3/inode.c linux-2.6.23-3-ext3_recursive_mtime/fs/ext3/inode.c
--- linux-2.6.23-2-ext3_make_frags_unused/fs/ext3/inode.c 2007-11-05 16:58:10.000000000 +0100
+++ linux-2.6.23-3-ext3_recursive_mtime/fs/ext3/inode.c 2007-11-06 16:13:50.000000000 +0100
@@ -1232,6 +1232,8 @@ static int ext3_ordered_commit_write(str
ret2 = ext3_journal_stop(handle);
if (!ret)
ret = ret2;
+ if (!ret)
+ ext3_update_rtimes(inode);
return ret;
}

@@ -1255,6 +1257,8 @@ static int ext3_writeback_commit_write(s
ret2 = ext3_journal_stop(handle);
if (!ret)
ret = ret2;
+ if (!ret)
+ ext3_update_rtimes(inode);
return ret;
}

@@ -1288,6 +1292,8 @@ static int ext3_journalled_commit_write(
ret2 = ext3_journal_stop(handle);
if (!ret)
ret = ret2;
+ if (!ret)
+ ext3_update_rtimes(inode);
return ret;
}

@@ -2386,6 +2392,10 @@ out_stop:
ext3_orphan_del(handle, inode);

ext3_journal_stop(handle);
+ /* We update time only for linked inodes. Unlinked ones already
+ * notified parent during unlink... */
+ if (inode->i_nlink)
+ ext3_update_rtimes(inode);
}

static ext3_fsblk_t ext3_get_inode_block(struct super_block *sb,
@@ -2628,6 +2638,8 @@ void ext3_read_inode(struct inode * inod
inode->i_ctime.tv_sec = (signed)le32_to_cpu(raw_inode->i_ctime);
inode->i_mtime.tv_sec = (signed)le32_to_cpu(raw_inode->i_mtime);
inode->i_atime.tv_nsec = inode->i_ctime.tv_nsec = inode->i_mtime.tv_nsec = 0;
+ if (EXT3_HAS_COMPAT_FEATURE(inode->i_sb, EXT3_FEATURE_COMPAT_RTIME))
+ ei->i_rtime = le32_to_cpu(raw_inode->i_rtime);

ei->i_state = 0;
ei->i_dir_start_lookup = 0;
@@ -2780,6 +2792,8 @@ static int ext3_do_update_inode(handle_t
raw_inode->i_atime = cpu_to_le32(inode->i_atime.tv_sec);
raw_inode->i_ctime = cpu_to_le32(inode->i_ctime.tv_sec);
raw_inode->i_mtime = cpu_to_le32(inode->i_mtime.tv_sec);
+ if (EXT3_HAS_COMPAT_FEATURE(inode->i_sb, EXT3_FEATURE_COMPAT_RTIME))
+ raw_inode->i_rtime = cpu_to_le32(ei->i_rtime);
raw_inode->i_blocks = cpu_to_le32(inode->i_blocks);
raw_inode->i_dtime = cpu_to_le32(ei->i_dtime);
spin_lock(&inode->i_lock);
@@ -2978,6 +2992,8 @@ int ext3_setattr(struct dentry *dentry,

if (!rc && (ia_valid & ATTR_MODE))
rc = ext3_acl_chmod(inode);
+ if (!rc)
+ ext3_update_rtimes(inode);

err_out:
ext3_std_error(inode->i_sb, error);
@@ -3129,6 +3145,7 @@ void ext3_dirty_inode(struct inode *inod
handle_t *current_handle = ext3_journal_current_handle();
handle_t *handle;

+ /* Reserve 2 blocks for inode and superblock */
handle = ext3_journal_start(inode, 2);
if (IS_ERR(handle))
goto out;
diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-2-ext3_make_frags_unused/fs/ext3/ioctl.c linux-2.6.23-3-ext3_recursive_mtime/fs/ext3/ioctl.c
--- linux-2.6.23-2-ext3_make_frags_unused/fs/ext3/ioctl.c 2007-11-05 15:42:03.000000000 +0100
+++ linux-2.6.23-3-ext3_recursive_mtime/fs/ext3/ioctl.c 2007-11-05 16:58:53.000000000 +0100
@@ -23,10 +23,20 @@ int ext3_ioctl (struct inode * inode, st
struct ext3_inode_info *ei = EXT3_I(inode);
unsigned int flags;
unsigned short rsv_window_size;
+ unsigned int rtime;

ext3_debug ("cmd = %u, arg = %lu\n", cmd, arg);

switch (cmd) {
+ case EXT3_IOC_GETRTIME:
+ if (!test_opt(inode->i_sb, RTIME))
+ return -ENOTSUPP;
+ if (!S_ISDIR(inode->i_mode))
+ return -ENOTDIR;
+ spin_lock(&inode->i_lock);
+ rtime = ei->i_rtime;
+ spin_unlock(&inode->i_lock);
+ return put_user(rtime, (unsigned int __user *) arg);
case EXT3_IOC_GETFLAGS:
ext3_get_inode_flags(ei);
spin_lock(&inode->i_lock);
@@ -49,8 +59,10 @@ int ext3_ioctl (struct inode * inode, st
if (get_user(flags, (int __user *) arg))
return -EFAULT;

- if (!S_ISDIR(inode->i_mode))
+ if (!S_ISDIR(inode->i_mode)) {
flags &= ~EXT3_DIRSYNC_FL;
+ flags &= ~EXT3_RTIME_FL;
+ }

mutex_lock(&inode->i_mutex);
handle = ext3_journal_start(inode, 1);
diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-2-ext3_make_frags_unused/fs/ext3/namei.c linux-2.6.23-3-ext3_recursive_mtime/fs/ext3/namei.c
--- linux-2.6.23-2-ext3_make_frags_unused/fs/ext3/namei.c 2007-10-09 22:31:38.000000000 +0200
+++ linux-2.6.23-3-ext3_recursive_mtime/fs/ext3/namei.c 2007-11-05 16:58:53.000000000 +0100
@@ -65,6 +65,59 @@ static struct buffer_head *ext3_append(h
return bh;
}

+/* We don't want to get new handle for every inode updated. Thus we batch
+ * updates of this many inodes into one transaction */
+#define RTIME_UPDATES_PER_TRANS 16
+
+/* Walk up the directory tree and modify rtimes.
+ * We journal i_rtime updates into a separate transaction - we don't guarantee
+ * consistency between other inode times and rtime. Only consistency between
+ * i_flags and i_rtime. */
+int __ext3_update_rtimes(struct inode *inode)
+{
+ struct dentry *dentry = list_entry(inode->i_dentry.next, struct dentry,
+ d_alias);
+ handle_t *handle;
+ int updates = 0;
+ int err = 0;
+
+ /* We should not have any transaction started - noone knows how many
+ * inode updates will be needed */
+ WARN_ON(ext3_journal_current_handle() != NULL);
+ if (!S_ISDIR(inode->i_mode)) {
+ dentry = dentry->d_parent;
+ inode = dentry->d_inode;
+ }
+ while (ext3_test_inode_flags(inode, EXT3_RTIME_FL)) {
+ if (!updates) {
+ /* For inode updates + superblock */
+ handle = ext3_journal_start(inode, RTIME_UPDATES_PER_TRANS + 1);
+ if (IS_ERR(handle))
+ return PTR_ERR(handle);
+ updates = RTIME_UPDATES_PER_TRANS;
+ }
+
+ spin_lock(&inode->i_lock);
+ EXT3_I(inode)->i_rtime = get_seconds();
+ EXT3_I(inode)->i_flags &= ~EXT3_RTIME_FL;
+ spin_unlock(&inode->i_lock);
+ ext3_mark_inode_dirty(handle, inode);
+ if (!--updates) {
+ err = ext3_journal_stop(handle);
+ if (err)
+ return err;
+ }
+
+ if (dentry == inode->i_sb->s_root)
+ break;
+ dentry = dentry->d_parent;
+ inode = dentry->d_inode;
+ }
+ if (updates)
+ err = ext3_journal_stop(handle);
+ return err;
+}
+
#ifndef assert
#define assert(test) J_ASSERT(test)
#endif
@@ -1738,6 +1791,8 @@ retry:
ext3_journal_stop(handle);
if (err == -ENOSPC && ext3_should_retry_alloc(dir->i_sb, &retries))
goto retry;
+ if (!err)
+ ext3_update_rtimes(dir);
return err;
}

@@ -1773,6 +1828,8 @@ retry:
ext3_journal_stop(handle);
if (err == -ENOSPC && ext3_should_retry_alloc(dir->i_sb, &retries))
goto retry;
+ if (!err)
+ ext3_update_rtimes(dir);
return err;
}

@@ -1847,6 +1904,8 @@ out_stop:
ext3_journal_stop(handle);
if (err == -ENOSPC && ext3_should_retry_alloc(dir->i_sb, &retries))
goto retry;
+ if (!err)
+ ext3_update_rtimes(dir);
return err;
}

@@ -2123,6 +2182,8 @@ static int ext3_rmdir (struct inode * di

end_rmdir:
ext3_journal_stop(handle);
+ if (!retval)
+ ext3_update_rtimes(dir);
brelse (bh);
return retval;
}
@@ -2177,6 +2238,8 @@ static int ext3_unlink(struct inode * di

end_unlink:
ext3_journal_stop(handle);
+ if (!retval)
+ ext3_update_rtimes(dir);
brelse (bh);
return retval;
}
@@ -2234,6 +2297,8 @@ out_stop:
ext3_journal_stop(handle);
if (err == -ENOSPC && ext3_should_retry_alloc(dir->i_sb, &retries))
goto retry;
+ if (!err)
+ ext3_update_rtimes(dir);
return err;
}

@@ -2270,6 +2335,10 @@ retry:
ext3_journal_stop(handle);
if (err == -ENOSPC && ext3_should_retry_alloc(dir->i_sb, &retries))
goto retry;
+ if (!err) {
+ ext3_update_rtimes(dir);
+ ext3_update_rtimes(inode);
+ }
return err;
}

@@ -2429,6 +2498,10 @@ end_rename:
brelse (old_bh);
brelse (new_bh);
ext3_journal_stop(handle);
+ if (!retval) {
+ ext3_update_rtimes(old_dir);
+ ext3_update_rtimes(new_dir);
+ }
return retval;
}

diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-2-ext3_make_frags_unused/fs/ext3/super.c linux-2.6.23-3-ext3_recursive_mtime/fs/ext3/super.c
--- linux-2.6.23-2-ext3_make_frags_unused/fs/ext3/super.c 2007-11-05 16:58:10.000000000 +0100
+++ linux-2.6.23-3-ext3_recursive_mtime/fs/ext3/super.c 2007-11-05 16:58:53.000000000 +0100
@@ -684,7 +684,7 @@ enum {
Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota,
Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota,
- Opt_grpquota
+ Opt_grpquota, Opt_rtime
};

static match_table_t tokens = {
@@ -734,6 +734,7 @@ static match_table_t tokens = {
{Opt_quota, "quota"},
{Opt_usrquota, "usrquota"},
{Opt_barrier, "barrier=%u"},
+ {Opt_rtime, "rtime"},
{Opt_err, NULL},
{Opt_resize, "resize"},
};
@@ -1066,6 +1067,14 @@ clear_qf_name:
case Opt_bh:
clear_opt(sbi->s_mount_opt, NOBH);
break;
+ case Opt_rtime:
+ if (!EXT3_HAS_COMPAT_FEATURE(sb, EXT3_FEATURE_COMPAT_RTIME)) {
+ printk("EXT3-fs: rtime option available only "
+ "if superblock has RTIME feature.\n");
+ return 0;
+ }
+ set_opt(sbi->s_mount_opt, RTIME);
+ break;
default:
printk (KERN_ERR
"EXT3-fs: Unrecognized mount option \"%s\" "
diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-2-ext3_make_frags_unused/include/linux/ext3_fs.h linux-2.6.23-3-ext3_recursive_mtime/include/linux/ext3_fs.h
--- linux-2.6.23-2-ext3_make_frags_unused/include/linux/ext3_fs.h 2007-11-05 16:58:10.000000000 +0100
+++ linux-2.6.23-3-ext3_recursive_mtime/include/linux/ext3_fs.h 2007-11-06 16:34:43.000000000 +0100
@@ -177,10 +177,11 @@ struct ext3_group_desc
#define EXT3_NOTAIL_FL 0x00008000 /* file tail should not be merged */
#define EXT3_DIRSYNC_FL 0x00010000 /* dirsync behaviour (directories only) */
#define EXT3_TOPDIR_FL 0x00020000 /* Top of directory hierarchies*/
+#define EXT3_RTIME_FL 0x00100000 /* Update recursive mtime (directories only) */
#define EXT3_RESERVED_FL 0x80000000 /* reserved for ext3 lib */

-#define EXT3_FL_USER_VISIBLE 0x0003DFFF /* User visible flags */
-#define EXT3_FL_USER_MODIFIABLE 0x000380FF /* User modifiable flags */
+#define EXT3_FL_USER_VISIBLE 0x0013DFFF /* User visible flags */
+#define EXT3_FL_USER_MODIFIABLE 0x001380FF /* User modifiable flags */

/*
* Inode dynamic state flags
@@ -229,6 +230,7 @@ struct ext3_new_group_data {
#endif
#define EXT3_IOC_GETRSVSZ _IOR('f', 5, long)
#define EXT3_IOC_SETRSVSZ _IOW('f', 6, long)
+#define EXT3_IOC_GETRTIME _IOR('f', 9, unsigned int)

/*
* ioctl commands in 32 bit emulation
@@ -291,7 +293,7 @@ struct ext3_inode {
__le32 i_generation; /* File version (for NFS) */
__le32 i_file_acl; /* File ACL */
__le32 i_dir_acl; /* Directory ACL */
- __le32 i_obsolete_faddr; /* Unused */
+ __le32 i_rtime; /* Recursive Modification Time - directories only */
union {
struct {
__u16 l_i_obsolete_frag; /* Unused */
@@ -381,6 +383,7 @@ struct ext3_inode {
#define EXT3_MOUNT_QUOTA 0x80000 /* Some quota option set */
#define EXT3_MOUNT_USRQUOTA 0x100000 /* "old" user quota */
#define EXT3_MOUNT_GRPQUOTA 0x200000 /* "old" group quota */
+#define EXT3_MOUNT_RTIME 0x400000 /* Update rtime */

/* Compatibility, for having both ext2_fs.h and ext3_fs.h included at once */
#ifndef _LINUX_EXT2_FS_H
@@ -580,6 +583,7 @@ static inline unsigned int ext3_test_ino
#define EXT3_FEATURE_COMPAT_EXT_ATTR 0x0008
#define EXT3_FEATURE_COMPAT_RESIZE_INODE 0x0010
#define EXT3_FEATURE_COMPAT_DIR_INDEX 0x0020
+#define EXT3_FEATURE_COMPAT_RTIME 0x0080

#define EXT3_FEATURE_RO_COMPAT_SPARSE_SUPER 0x0001
#define EXT3_FEATURE_RO_COMPAT_LARGE_FILE 0x0002
@@ -854,6 +858,13 @@ extern int ext3_orphan_add(handle_t *, s
extern int ext3_orphan_del(handle_t *, struct inode *);
extern int ext3_htree_fill_tree(struct file *dir_file, __u32 start_hash,
__u32 start_minor_hash, __u32 *next_hash);
+extern int __ext3_update_rtimes(struct inode *inode);
+static inline int ext3_update_rtimes(struct inode *inode)
+{
+ if (test_opt(inode->i_sb, RTIME))
+ return __ext3_update_rtimes(inode);
+ return 0;
+}

/* resize.c */
extern int ext3_group_add(struct super_block *sb,
diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-2-ext3_make_frags_unused/include/linux/ext3_fs_i.h linux-2.6.23-3-ext3_recursive_mtime/include/linux/ext3_fs_i.h
--- linux-2.6.23-2-ext3_make_frags_unused/include/linux/ext3_fs_i.h 2007-11-05 15:42:03.000000000 +0100
+++ linux-2.6.23-3-ext3_recursive_mtime/include/linux/ext3_fs_i.h 2007-11-05 16:58:53.000000000 +0100
@@ -78,6 +78,7 @@ struct ext3_inode_info {
ext3_fsblk_t i_file_acl;
__u32 i_dir_acl;
__u32 i_dtime;
+ __u32 i_rtime;

/*
* i_block_group is the number of the block group which contains

2007-11-06 17:41:59

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC] [PATCH 3/3] Recursive mtime for ext3

On Tue, 6 Nov 2007 18:19:45 +0100
Jan Kara <[email protected]> wrote:

> Implement recursive mtime (rtime) feature for ext3. The feature works
> as follows: In each directory we keep a flag EXT3_RTIME_FL
> (modifiable by a user) whether rtime should be updated. In case a
> directory or a file in it is modified and when the flag is set,
> directory's rtime is updated, the flag is cleared, and we move to the
> parent. If the flag is set there, we clear it, update rtime and
> continue upwards upto the root of the filesystem. In case a regular
> file or symlink is modified, we pick arbitrary of its parents
> (actually the one that happens to be at the head of i_dentry list)
> and start the rtime update algorith there.

Ok since mtime (and rtime) are part of the inode and not the dentry...
how do you deal with hardlinks? And with cases of files that have been
unlinked? (ok the later is a wash obviously other than not crashing)

--
If you want to reach me at my work email, use [email protected]
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2007-11-06 18:01:02

by Al Viro

[permalink] [raw]
Subject: Re: [RFC] [PATCH 3/3] Recursive mtime for ext3

On Tue, Nov 06, 2007 at 06:19:45PM +0100, Jan Kara wrote:
> Implement recursive mtime (rtime) feature for ext3. The feature works as
> follows: In each directory we keep a flag EXT3_RTIME_FL (modifiable by a user)
> whether rtime should be updated. In case a directory or a file in it is
> modified and when the flag is set, directory's rtime is updated, the flag is
> cleared, and we move to the parent. If the flag is set there, we clear it,
> update rtime and continue upwards upto the root of the filesystem. In case a
> regular file or symlink is modified, we pick arbitrary of its parents (actually
> the one that happens to be at the head of i_dentry list) and start the rtime
> update algorith there.

*ewwww*

Nothing like undeterministic behaviour, is there?

> Intended use case is that application which wants to watch any modification in
> a subtree scans the subtree and sets flags for all inodes there. Next time, it
> just needs to recurse in directories having rtime newer than the start of the
> previous scan. There it can handle modifications and set the flag again. It is
> up to application to watch out for hardlinked files. It can e.g. build their
> list and check their mtime separately (when a hardlink to a file is created its
> inode is modified and rtimes properly updated and thus any application has an
> effective way of finding new hardlinked files).

You know, you can do that with aush^H^Hdit right now...

2007-11-06 18:08:13

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC] [PATCH 3/3] Recursive mtime for ext3

Arjan van de Ven wrote:
> On Tue, 6 Nov 2007 18:19:45 +0100
> Jan Kara <[email protected]> wrote:
>
>> Implement recursive mtime (rtime) feature for ext3. The feature works
>> as follows: In each directory we keep a flag EXT3_RTIME_FL
>> (modifiable by a user) whether rtime should be updated. In case a
>> directory or a file in it is modified and when the flag is set,
>> directory's rtime is updated, the flag is cleared, and we move to the
>> parent. If the flag is set there, we clear it, update rtime and
>> continue upwards upto the root of the filesystem. In case a regular
>> file or symlink is modified, we pick arbitrary of its parents
>> (actually the one that happens to be at the head of i_dentry list)
>> and start the rtime update algorith there.
>
> Ok since mtime (and rtime) are part of the inode and not the dentry...
> how do you deal with hardlinks? And with cases of files that have been
> unlinked? (ok the later is a wash obviously other than not crashing)
>

There is only one possible answer... he only updates the directory path
that was used to touch the particular file involved. Thus, the
semantics gets grotty not just in the presence of hard links, but also
in the presence of bind- and other non-root mounts.

-hpa

2007-11-06 19:41:03

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC] [PATCH 3/3] Recursive mtime for ext3

On Tue, Nov 06, 2007 at 06:19:45PM +0100, Jan Kara wrote:
> Intended use case is that application which wants to watch any
> modification in a subtree scans the subtree and sets flags for all
> inodes there. Next time, it just needs to recurse in directories
> having rtime newer than the start of the previous scan. There it can
> handle modifications and set the flag again. It is up to application
> to watch out for hardlinked files. It can e.g. build their list and
> check their mtime separately (when a hardlink to a file is created
> its inode is modified and rtimes properly updated and thus any
> application has an effective way of finding new hardlinked files).

Umm, yuck.

What if more than one application wants to use this facility?

The application is using a global per-inode flag that is written out
to disk. So sweeping the entire subtree and setting this flag will
involve a lot of disk i/o; as does setting a mod-time, since it could
potentially require a large number of inode updates, and then the
application needs to sweep through the subtree and reset the flags
(resulting in more disk i/o). The performance would seem to me to be
really pessimal.

In addition, after you crash, there might not be any application
waiting to watch modifications in that subtree, and yet the flags
would still be set so the system would still be paying the performance
penalties of needing to propagate modtimes until all of the flags
disappear --- and for a large subtree, that might not be for a long,
long time.

So if the goal is some kind of modification notification system that
watches a subtree efficiently, avoiding some of the deficiencies of
inotify and dnotify, the interface doesn't seem to be the right way to
go about things. The fact that only one application at a time can use
this interface, even if you ignore the issues of hard links and the
performance problems and the lack of cleanup after a reboot, seems in
my mind to just be a irreparable fatal flaw to this particular scheme.

Regards,

- Ted

2007-11-07 11:51:13

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC] [PATCH 3/3] Recursive mtime for ext3

On Tue 06-11-07 10:04:47, H. Peter Anvin wrote:
> Arjan van de Ven wrote:
> >On Tue, 6 Nov 2007 18:19:45 +0100
> >Jan Kara <[email protected]> wrote:
> >
> >>Implement recursive mtime (rtime) feature for ext3. The feature works
> >>as follows: In each directory we keep a flag EXT3_RTIME_FL
> >>(modifiable by a user) whether rtime should be updated. In case a
> >>directory or a file in it is modified and when the flag is set,
> >>directory's rtime is updated, the flag is cleared, and we move to the
> >>parent. If the flag is set there, we clear it, update rtime and
> >>continue upwards upto the root of the filesystem. In case a regular
> >>file or symlink is modified, we pick arbitrary of its parents
> >>(actually the one that happens to be at the head of i_dentry list)
> >>and start the rtime update algorith there.
> >
> >Ok since mtime (and rtime) are part of the inode and not the dentry...
> >how do you deal with hardlinks? And with cases of files that have been
> >unlinked? (ok the later is a wash obviously other than not crashing)
Unlinked files are easy - you just don't propagate the rtime anywhere.
For hardlinks see below.

> There is only one possible answer... he only updates the directory path
> that was used to touch the particular file involved. Thus, the
> semantics gets grotty not just in the presence of hard links, but also
> in the presence of bind- and other non-root mounts.
Update of recursive mtime does not pass filesystem boundaries (i.e.
mountpoints) so bind mounts and such are non-issue (hmm, at least that was
my original idea but as I'm looking now I don't handle bind mounts properly
so that needs to be fixed). With hardlinks, you are right that the
behaviour is undeterministic - I tried to argue in the text of the mail
that this does not actually matter - there are not many hardlinks on usual
system and so the application can check hardlinked files in the old way -
i.e. look at mtime.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2007-11-07 14:36:06

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC] [PATCH 3/3] Recursive mtime for ext3

On Tue 06-11-07 14:40:12, Theodore Tso wrote:
> On Tue, Nov 06, 2007 at 06:19:45PM +0100, Jan Kara wrote:
> > Intended use case is that application which wants to watch any
> > modification in a subtree scans the subtree and sets flags for all
> > inodes there. Next time, it just needs to recurse in directories
> > having rtime newer than the start of the previous scan. There it can
> > handle modifications and set the flag again. It is up to application
> > to watch out for hardlinked files. It can e.g. build their list and
> > check their mtime separately (when a hardlink to a file is created
> > its inode is modified and rtimes properly updated and thus any
> > application has an effective way of finding new hardlinked files).
>
> Umm, yuck.
>
> What if more than one application wants to use this facility?
That should be fine - let's see: Each application keeps somewhere a time when
it started a scan of a subtree (or it can actually remember a time when it
set the flag for each directory), during the scan, it sets the flag on
each directory. When it wakes up to recheck the subtree it just compares
the rtime against the stored time - if rtime is greater, subtree has been
modified since the last scan and we recurse in it and when we are finished
with it we set the flag. Now notice that we don't care about the flag when
we check for changes - we care only for rtime - so if there are several
applications interested in the same subtree, the flag just gets set more
often and thus the update of rtime happens more often but the same scheme
still works fine.

> The application is using a global per-inode flag that is written out
> to disk. So sweeping the entire subtree and setting this flag will
> involve a lot of disk i/o; as does setting a mod-time, since it could
> potentially require a large number of inode updates, and then the
> application needs to sweep through the subtree and reset the flags
> (resulting in more disk i/o). The performance would seem to me to be
> really pessimal.
I don't get it here - you need to scan the whole subtree and set the flag
only during the initial scan. Later, you need to scan and set the flag only
for directories in whose subtree something changed. Similarty rtime needs
to be updated for each inode at most once after the scan. Maybe we have
different different ideas of use-cases: I consider this useful for larger
subtrees which change only seldom (or only their small parts) or you want
to check for changes only once per some longer time - so uses like backup
with rsync, updatedb, cachefiles for trees with config files (like KDE has)
etc. There the penalty for additional IO is during rtime updates is quite
negligible - if you have some usecase you'd like to measure, please propose
it and I'll measure it. I have tested the following:
Create a tree of depth 5 where each directory has 5 subdirectories and
the leaf directories have 10 files in it. You set the flag on all
directories (umount and mount again) and then touch one file in every directory.
With the feature enabled this takes 36.1176s (average from 5 tests) with
deviation 0.29509. Without the feature it takes 35.75480 with deviation
0.15433. So the difference in performance is 1% which is just slightly
above the error and I'd find this test case quite pesimistic for the
intended usage...

> In addition, after you crash, there might not be any application
> waiting to watch modifications in that subtree, and yet the flags
> would still be set so the system would still be paying the performance
> penalties of needing to propagate modtimes until all of the flags
> disappear --- and for a large subtree, that might not be for a long,
> long time.
I don't quite understand what you are afraid here - I think we
misunderstand a bit - are you aware that we don't propagate the
modification up once we hit a directory with a flag not set - hence all
possible updates in future will write each inode at most once?

> So if the goal is some kind of modification notification system that
> watches a subtree efficiently, avoiding some of the deficiencies of
> inotify and dnotify, the interface doesn't seem to be the right way to
> go about things. The fact that only one application at a time can use
> this interface, even if you ignore the issues of hard links and the
> performance problems and the lack of cleanup after a reboot, seems in
> my mind to just be a irreparable fatal flaw to this particular scheme.
I hope I've refuted most of your objections ;) Thanks for having look
at the feature.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2007-11-07 14:54:11

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC] [PATCH 3/3] Recursive mtime for ext3

On Tue 06-11-07 18:01:00, Al Viro wrote:
> On Tue, Nov 06, 2007 at 06:19:45PM +0100, Jan Kara wrote:
> > Implement recursive mtime (rtime) feature for ext3. The feature works as
> > follows: In each directory we keep a flag EXT3_RTIME_FL (modifiable by a user)
> > whether rtime should be updated. In case a directory or a file in it is
> > modified and when the flag is set, directory's rtime is updated, the flag is
> > cleared, and we move to the parent. If the flag is set there, we clear it,
> > update rtime and continue upwards upto the root of the filesystem. In case a
> > regular file or symlink is modified, we pick arbitrary of its parents (actually
> > the one that happens to be at the head of i_dentry list) and start the rtime
> > update algorith there.
>
> *ewwww*
>
> Nothing like undeterministic behaviour, is there?
Oh yes, there is :) But I tried to argue it does not really matter -
application would have to handle hardlinks in a special way but I find that
acceptable given how rare they are...

> > Intended use case is that application which wants to watch any modification in
> > a subtree scans the subtree and sets flags for all inodes there. Next time, it
> > just needs to recurse in directories having rtime newer than the start of the
> > previous scan. There it can handle modifications and set the flag again. It is
> > up to application to watch out for hardlinked files. It can e.g. build their
> > list and check their mtime separately (when a hardlink to a file is created its
> > inode is modified and rtimes properly updated and thus any application has an
> > effective way of finding new hardlinked files).
>
> You know, you can do that with aush^H^Hdit right now...
Interesting idea, no I have not thought about this. I guess you mean
watching all the VFS modification events and then do the checking and propagation
from user space... My first feeling is that the performance penalty would be
considerably higher (currently I am at 1% performance penalty for quite
pessimistic test case) but in case the current patch would be considered
unacceptable, I can try how large the penalty would be. Thanks for
suggestion.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2007-11-08 00:21:34

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC] [PATCH 3/3] Recursive mtime for ext3

On Wed, Nov 07, 2007 at 03:36:05PM +0100, Jan Kara wrote:
> > What if more than one application wants to use this facility?
>
> That should be fine - let's see: Each application keeps somewhere a time when
> it started a scan of a subtree (or it can actually remember a time when it
> set the flag for each directory), during the scan, it sets the flag on
> each directory. When it wakes up to recheck the subtree it just compares
> the rtime against the stored time - if rtime is greater, subtree has been
> modified since the last scan and we recurse in it and when we are finished
> with it we set the flag. Now notice that we don't care about the flag when
> we check for changes - we care only for rtime - so if there are several
> applications interested in the same subtree, the flag just gets set more
> often and thus the update of rtime happens more often but the same scheme
> still works fine.

OK, so in this case you don't need to set rtime on the every single
file inode, but only directory inode, right? Because you're only
using checking the rtime at the directory level, and not the flag.
And it's just as easy for you to check the rtime flag for the file's
containing directory (modulo magic vis-a-vis hard links) as the file's
inode.

I'm just really wishing that rtime and the rtime flag didn't have live
on disk, but could rather be in memory. If you only needed to save
the directory flags and rtimes, that might actually be doable.

Note by the way that since you need to own the file/directory to set
flags, this means that only programs that are running as root or
running as the uid who owns the entire subtree will be able to use
this scheme. One advantage of doing in kernel memory is that you
might be able to support watching a tree that is not owned by the
watcher.

> I don't get it here - you need to scan the whole subtree and set the flag
> only during the initial scan. Later, you need to scan and set the flag only
> for directories in whose subtree something changed. Similarty rtime needs
> to be updated for each inode at most once after the scan.

OK, so in the worst case every single file in a kernel source tree
might change after doing an extreme git checkout. That means around
36k of files get updated. So if you have to set/clear the rtime flag
during the checkout process 36k file inodes would have to have their
rtime flag cleared, plus 2k worth of directory inodes; but those would
probably be folded into other changes made to the inodes anyway. But
then when trackerd goes back and scans the subtree, if you are
actually setting rtime flags for every single file inode, then that's
38k of indoes that need updating. If you only need to set the rtime
flags for directories, that's only 2k worth of extra gratuitous inode
updates.

- Ted

2007-11-08 10:56:45

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC] [PATCH 3/3] Recursive mtime for ext3

On Wed 07-11-07 19:20:38, Theodore Tso wrote:
> On Wed, Nov 07, 2007 at 03:36:05PM +0100, Jan Kara wrote:
> > > What if more than one application wants to use this facility?
> >
> > That should be fine - let's see: Each application keeps somewhere a time when
> > it started a scan of a subtree (or it can actually remember a time when it
> > set the flag for each directory), during the scan, it sets the flag on
> > each directory. When it wakes up to recheck the subtree it just compares
> > the rtime against the stored time - if rtime is greater, subtree has been
> > modified since the last scan and we recurse in it and when we are finished
> > with it we set the flag. Now notice that we don't care about the flag when
> > we check for changes - we care only for rtime - so if there are several
> > applications interested in the same subtree, the flag just gets set more
> > often and thus the update of rtime happens more often but the same scheme
> > still works fine.
>
> OK, so in this case you don't need to set rtime on the every single
> file inode, but only directory inode, right? Because you're only
Yes, that's actually what I'm doing - sorry if I didn't make it clear
earlier.

> using checking the rtime at the directory level, and not the flag.
> And it's just as easy for you to check the rtime flag for the file's
> containing directory (modulo magic vis-a-vis hard links) as the file's
> inode.
Exactly.

> I'm just really wishing that rtime and the rtime flag didn't have live
> on disk, but could rather be in memory. If you only needed to save
> the directory flags and rtimes, that might actually be doable.
I already gave some thought to this but there seemed to be some
drawbacks. Query I want to support is: given a directory, tell me which of
its subdirectories (arbitrarily deep below) have been modified since time
T. That is what you need to support faster rsync, updatedb and similar
loads. Also I want to allow a reboot to happen inbetween the modification
and a query (handling a crash correctly would be nice too but honestly my
current implementation is not completely reliable in this regard either) so
some pernament storage is needed in any case. What I can imagine we could
do is to report all modifications to userspace - that has a problem that
there are *many* possible modifications but we are interested only whether
there happened some since time T. We could improve this by an in-memory
inode flag "I'm not interested in modifications any further" and reporting
the change only if the parent directory does not have this flag set (note
that this flag gets lost when we evict the inode from memory). But I would
say that in the end all this message passing, climbing the tree from
userspace and maintaining data structure in memory and on disk would cost
use more than the current implementation... Also it has the disadvantage
that we miss the modifications which happen before we start the userspace
daemon catching the events.
Doing this in kernel memory has a problem how to solve the persistency
across reboots (dumping mod's to userspace on request?) and also on my
system you'd have roughly a few MB of pinned memory for these purposes...
Plausible but I don't really like it...

> Note by the way that since you need to own the file/directory to set
> flags, this means that only programs that are running as root or
> running as the uid who owns the entire subtree will be able to use
> this scheme. One advantage of doing in kernel memory is that you
> might be able to support watching a tree that is not owned by the
> watcher.
Yes, that is the advantage. On the other hand we could allow setting that
particular flag even without being an owner of the inode. In fact, I
don't currently see use case where you won't be either root (rsync,
updatedb) or an owner of the files (watching config file trees) but I guess
people would find some :).

> > I don't get it here - you need to scan the whole subtree and set the flag
> > only during the initial scan. Later, you need to scan and set the flag only
> > for directories in whose subtree something changed. Similarty rtime needs
> > to be updated for each inode at most once after the scan.
>
> OK, so in the worst case every single file in a kernel source tree
> might change after doing an extreme git checkout. That means around
> 36k of files get updated. So if you have to set/clear the rtime flag
> during the checkout process 36k file inodes would have to have their
> rtime flag cleared, plus 2k worth of directory inodes; but those would
> probably be folded into other changes made to the inodes anyway. But
Yes, here the impact is hardly measurable as I've written in the previous
email.

> then when trackerd goes back and scans the subtree, if you are
> actually setting rtime flags for every single file inode, then that's
> 38k of indoes that need updating. If you only need to set the rtime
> flags for directories, that's only 2k worth of extra gratuitous inode
> updates.
As I wrote above, the flag is only set on directories so yes a scan
modifies 2k directory inodes. But such scan happens only when you run rsync
- I don't aim at something like 'trackerd' which would watch the filesystem
all the time. My idea is that those applications that currently do "scan
the tree, stat all files, check mtimes if something changed" would in
future do "scan those directories that have rtime newer than T". So the
overall balance is:
Currently: scan and stat all files in the tree, compare mtimes
With rtime: scan those directories in whose subtree has something changed
- stat all files in them, modify directory inodes

So you trade a lot of reading for some writes... I can actually try to measure
how much it will improve rsync scan on my computer :)

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2007-11-08 14:39:00

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC] [PATCH 3/3] Recursive mtime for ext3

Ah, OK, so the two things that I didn't get from your patch
description are:

1) the rtime flag and rtime field are only set on directories
2) the intended use is not trackerd and its ilk, but rsync and updatedb,
so it is desirable that scan/queries be persistent across reboots

But then the major hole in this scheme is still the issue of hard
links. The rsync program is still going to have to scan the entire
subtree looking for hard links, since an inode with multiple links
into the directory tree can't guarantee that all of its parent
directories will have their rtime field updated.

A program like updatedb which only cares about filenames will be OK,
since that means it really only cares about knowing when directories
have changed, and you can't have hard links to directories.

The other problem, of course, is that this feature would become ext
2/3/4 specific, and I could see future filesystems possibly wanting
this. So this raises the question of whether the interface should be
at the VFS layer or not --- and if so, how to handle querying whether
a particulra filesystem supports it, and what happens if you have a
subtree which is covered by a filesystem that doesn't support rtime?

So a program like rsync would need to scan /proc/self/mounts to see
whether or not it would be safe to use this feature in the first
place. And, of course, rsync would need to know whether it has write
access to the tree in order to set flags in the directory, and what to
do if some portion of the subtree isn't writeable by rsync.

On Thu, Nov 08, 2007 at 11:56:42AM +0100, Jan Kara wrote:
> > Note by the way that since you need to own the file/directory to set
> > flags, this means that only programs that are running as root or
> > running as the uid who owns the entire subtree will be able to use
> > this scheme. One advantage of doing in kernel memory is that you
> > might be able to support watching a tree that is not owned by the
> > watcher.
> Yes, that is the advantage. On the other hand we could allow setting that
> particular flag even without being an owner of the inode. In fact, I
> don't currently see use case where you won't be either root (rsync,
> updatedb) or an owner of the files (watching config file trees) but I guess
> people would find some :).

Sometimes people like to use rsync to copy a subtree to which they
have read access but not write access. (And here note that it's not
enough to have write access, you actually need to *own* all of the
directories in the subtree).

Yes, it's safe to let any user *set* the rtime flag, but we couldn't
let them clear the rtime flag, since then they would be able to hide a
file modification from some other (potentially privileged) process.
Speaking of security, I assume your patch will never allow rtime to go
backwards (for example if the user attempts to backdate a file's mtime
field using the utime() or utimes() system call)?

I guess I'm convinced that updatedb could use this facility, but there
are enough asteriks around it that I'm not sure that rsync could
safely use this feature in production. I don't doubt that in a cold
cache case, it would speed up rsync, but because it doesn't handle
hard links, it's not reliable. Since rsync often gets used for
backups, this is a big deal. There are also questions about what to
do if rsync doesn't have write access to the filesystem, or if there
is a non-rtime capable filesystem mounted in the subtree, etc., that
can be worked around, but would add a lot of complexity and grottiness
to the rsync source tree. Is the rsync maintainer really willing to
add all of the necessary hair to support this rtime facility into
their program?

- Ted

2007-11-08 15:28:24

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC] [PATCH 3/3] Recursive mtime for ext3

On Thu 08-11-07 09:37:59, Theodore Tso wrote:
> Ah, OK, so the two things that I didn't get from your patch
> description are:
>
> 1) the rtime flag and rtime field are only set on directories
> 2) the intended use is not trackerd and its ilk, but rsync and updatedb,
> so it is desirable that scan/queries be persistent across reboots
>
> But then the major hole in this scheme is still the issue of hard
> links. The rsync program is still going to have to scan the entire
> subtree looking for hard links, since an inode with multiple links
> into the directory tree can't guarantee that all of its parent
> directories will have their rtime field updated.
Not really - initially rsync can scan a tree for hardlinks and remember
where they are. If a hardlink to a file is created, an rtime update is
sent up the tree via the path used to create the link. So during next scan,
rsync will see the file is modified and finds out that its nlink is > 1
and adds it to the list of hardlinked files.
So for things like regular backups hardlinks can be dealt with
efficiently.

> A program like updatedb which only cares about filenames will be OK,
> since that means it really only cares about knowing when directories
> have changed, and you can't have hard links to directories.
>
> The other problem, of course, is that this feature would become ext
> 2/3/4 specific, and I could see future filesystems possibly wanting
> this. So this raises the question of whether the interface should be
> at the VFS layer or not --- and if so, how to handle querying whether
> a particulra filesystem supports it, and what happens if you have a
> subtree which is covered by a filesystem that doesn't support rtime?
>
> So a program like rsync would need to scan /proc/self/mounts to see
> whether or not it would be safe to use this feature in the first
Yes, being filesystem specific and thus requiring special handling of
mount points is a disadvantage of this approach.

> place. And, of course, rsync would need to know whether it has write
> access to the tree in order to set flags in the directory, and what to
> do if some portion of the subtree isn't writeable by rsync.
Yes, the cases where we cannot modify the flag in a tree would have to be
handled (similarly as the cases where the filesystem simply does not
support the feature). I don't think it wouldn't be too complicated but I have
not the modification for rsync yet, so I can underestimate...

> On Thu, Nov 08, 2007 at 11:56:42AM +0100, Jan Kara wrote:
> > > Note by the way that since you need to own the file/directory to set
> > > flags, this means that only programs that are running as root or
> > > running as the uid who owns the entire subtree will be able to use
> > > this scheme. One advantage of doing in kernel memory is that you
> > > might be able to support watching a tree that is not owned by the
> > > watcher.
> > Yes, that is the advantage. On the other hand we could allow setting that
> > particular flag even without being an owner of the inode. In fact, I
> > don't currently see use case where you won't be either root (rsync,
> > updatedb) or an owner of the files (watching config file trees) but I guess
> > people would find some :).
>
> Sometimes people like to use rsync to copy a subtree to which they
> have read access but not write access. (And here note that it's not
> enough to have write access, you actually need to *own* all of the
> directories in the subtree).
Yes, so in such cases my feature won't be able to help. But I think
there are still enough cases where it would help.

> Yes, it's safe to let any user *set* the rtime flag, but we couldn't
> let them clear the rtime flag, since then they would be able to hide a
> file modification from some other (potentially privileged) process.
Good point.

> Speaking of security, I assume your patch will never allow rtime to go
> backwards (for example if the user attempts to backdate a file's mtime
> field using the utime() or utimes() system call)?
No, the patch does not allow this. But anyway in case user has enough
rights to change file's mtime, would it really be a security concern?

> I guess I'm convinced that updatedb could use this facility, but there
> are enough asteriks around it that I'm not sure that rsync could
> safely use this feature in production. I don't doubt that in a cold
> cache case, it would speed up rsync, but because it doesn't handle
> hard links, it's not reliable. Since rsync often gets used for
> backups, this is a big deal. There are also questions about what to
> do if rsync doesn't have write access to the filesystem, or if there
> is a non-rtime capable filesystem mounted in the subtree, etc., that
> can be worked around, but would add a lot of complexity and grottiness
> to the rsync source tree. Is the rsync maintainer really willing to
> add all of the necessary hair to support this rtime facility into
> their program?
Hardlinks can be worked-around as I wrote above and there would have to
be a fallback in case we cannot set the flag. So I agree the code would be
more complicated but I think it could be done in a quite clean way - but of
course that has to be proven by a patch which I don't have yet. I have not
spoken to rsync maintainers about this - first I want to have at least a
preliminary version of a patch for rsync so that we have something to
talk about...
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR