2014-11-21 19:59:20

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 0/4] add support for a lazytime mount option

This is an updated version of what had originally been an
ext4-specific patch which significantly improves performance by lazily
writing timestamp updates (and in particular, mtime updates) to disk.
The in-memory timestamps are always correct, but they are only written
to disk when required for correctness.

This provides a huge performance boost for ext4 due to how it handles
journalling, but it's valuable for all file systems running on flash
storage or drive-managed SMR disks by reducing the metadata write
load. So upon request, I've moved the functionality to the VFS layer.
Once the /sbin/mount program adds support for MS_LAZYTIME, all file
systems should be able to benefit from this optimization.

There is still an ext4-specific optimization, which may be applicable
for other file systems which store more than one inode in a block, but
it will require file system specific code. It is purely optional,
however.

Please note the changes to update_time() and the new write_time() inode
operations functions, which impact btrfs and xfs. The changes are
fairly simple, but I would appreciate confirmation from the btrfs and
xfs teams that I got things right. Thanks!!

Theodore Ts'o (4):
fs: split update_time() into update_time() and write_time()
vfs: add support for a lazytime mount option
vfs: don't let the dirty time inodes get more than a day stale
ext4: add support for a lazytime mount option

fs/btrfs/inode.c | 10 ++++++
fs/ext4/inode.c | 46 +++++++++++++++++++++++++-
fs/ext4/super.c | 9 +++++
fs/fs-writeback.c | 39 +++++++++++++++++++++-
fs/inode.c | 88 ++++++++++++++++++++++++++++++++++++++++++-------
fs/proc_namespace.c | 1 +
fs/sync.c | 7 ++++
fs/xfs/xfs_iops.c | 39 +++++++++-------------
include/linux/fs.h | 5 +++
include/uapi/linux/fs.h | 1 +
10 files changed, 209 insertions(+), 36 deletions(-)

--
2.1.0

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs


2014-11-21 19:59:44

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 1/4] fs: split update_time() into update_time() and write_time()

In preparation for adding support for the lazytime mount option, we
need to be able to separate out the update_time() and write_time()
inode operations. Currently, only btrfs and xfs uses update_time().

We needed to preserve update_time() because btrfs wants to have a
special btrfs_root_readonly() check; otherwise we could drop the
update_time() inode operation entirely.

Signed-off-by: Theodore Ts'o <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
fs/btrfs/inode.c | 10 ++++++++++
fs/inode.c | 29 ++++++++++++++++++-----------
fs/xfs/xfs_iops.c | 39 ++++++++++++++++-----------------------
include/linux/fs.h | 1 +
4 files changed, 45 insertions(+), 34 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d23362f..a5e0d0d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5574,6 +5574,11 @@ static int btrfs_update_time(struct inode *inode, struct timespec *now,
inode->i_mtime = *now;
if (flags & S_ATIME)
inode->i_atime = *now;
+ return 0;
+}
+
+static int btrfs_write_time(struct inode *inode)
+{
return btrfs_dirty_inode(inode);
}

@@ -9462,6 +9467,7 @@ static const struct inode_operations btrfs_dir_inode_operations = {
.get_acl = btrfs_get_acl,
.set_acl = btrfs_set_acl,
.update_time = btrfs_update_time,
+ .write_time = btrfs_write_time,
.tmpfile = btrfs_tmpfile,
};
static const struct inode_operations btrfs_dir_ro_inode_operations = {
@@ -9470,6 +9476,7 @@ static const struct inode_operations btrfs_dir_ro_inode_operations = {
.get_acl = btrfs_get_acl,
.set_acl = btrfs_set_acl,
.update_time = btrfs_update_time,
+ .write_time = btrfs_write_time,
};

static const struct file_operations btrfs_dir_file_operations = {
@@ -9540,6 +9547,7 @@ static const struct inode_operations btrfs_file_inode_operations = {
.get_acl = btrfs_get_acl,
.set_acl = btrfs_set_acl,
.update_time = btrfs_update_time,
+ .write_time = btrfs_write_time,
};
static const struct inode_operations btrfs_special_inode_operations = {
.getattr = btrfs_getattr,
@@ -9552,6 +9560,7 @@ static const struct inode_operations btrfs_special_inode_operations = {
.get_acl = btrfs_get_acl,
.set_acl = btrfs_set_acl,
.update_time = btrfs_update_time,
+ .write_time = btrfs_write_time,
};
static const struct inode_operations btrfs_symlink_inode_operations = {
.readlink = generic_readlink,
@@ -9565,6 +9574,7 @@ static const struct inode_operations btrfs_symlink_inode_operations = {
.listxattr = btrfs_listxattr,
.removexattr = btrfs_removexattr,
.update_time = btrfs_update_time,
+ .write_time = btrfs_write_time,
};

const struct dentry_operations btrfs_dentry_operations = {
diff --git a/fs/inode.c b/fs/inode.c
index 26753ba..8f5c4b5 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1499,17 +1499,24 @@ static int relatime_need_update(struct vfsmount *mnt, struct inode *inode,
*/
static int update_time(struct inode *inode, struct timespec *time, int flags)
{
- if (inode->i_op->update_time)
- return inode->i_op->update_time(inode, time, flags);
-
- if (flags & S_ATIME)
- inode->i_atime = *time;
- if (flags & S_VERSION)
- inode_inc_iversion(inode);
- if (flags & S_CTIME)
- inode->i_ctime = *time;
- if (flags & S_MTIME)
- inode->i_mtime = *time;
+ int ret;
+
+ if (inode->i_op->update_time) {
+ ret = inode->i_op->update_time(inode, time, flags);
+ if (ret)
+ return ret;
+ } else {
+ if (flags & S_ATIME)
+ inode->i_atime = *time;
+ if (flags & S_VERSION)
+ inode_inc_iversion(inode);
+ if (flags & S_CTIME)
+ inode->i_ctime = *time;
+ if (flags & S_MTIME)
+ inode->i_mtime = *time;
+ }
+ if (inode->i_op->write_time)
+ return inode->i_op->write_time(inode);
mark_inode_dirty_sync(inode);
return 0;
}
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index ec6dcdc..0e9653c 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -984,10 +984,8 @@ xfs_vn_setattr(
}

STATIC int
-xfs_vn_update_time(
- struct inode *inode,
- struct timespec *now,
- int flags)
+xfs_vn_write_time(
+ struct inode *inode)
{
struct xfs_inode *ip = XFS_I(inode);
struct xfs_mount *mp = ip->i_mount;
@@ -1004,21 +1002,16 @@ xfs_vn_update_time(
}

xfs_ilock(ip, XFS_ILOCK_EXCL);
- if (flags & S_CTIME) {
- inode->i_ctime = *now;
- ip->i_d.di_ctime.t_sec = (__int32_t)now->tv_sec;
- ip->i_d.di_ctime.t_nsec = (__int32_t)now->tv_nsec;
- }
- if (flags & S_MTIME) {
- inode->i_mtime = *now;
- ip->i_d.di_mtime.t_sec = (__int32_t)now->tv_sec;
- ip->i_d.di_mtime.t_nsec = (__int32_t)now->tv_nsec;
- }
- if (flags & S_ATIME) {
- inode->i_atime = *now;
- ip->i_d.di_atime.t_sec = (__int32_t)now->tv_sec;
- ip->i_d.di_atime.t_nsec = (__int32_t)now->tv_nsec;
- }
+
+ ip->i_d.di_ctime.t_sec = (__int32_t) inode->i_ctime.tv_sec;
+ ip->i_d.di_ctime.t_nsec = (__int32_t) inode->i_ctime.tv_nsec;
+
+ ip->i_d.di_mtime.t_sec = (__int32_t)inode->i_mtime.tv_sec;
+ ip->i_d.di_mtime.t_nsec = (__int32_t) inode->i_mtime.tv_nsec;
+
+ ip->i_d.di_atime.t_sec = (__int32_t) inode->i_atime.tv_sec;
+ ip->i_d.di_atime.t_nsec = (__int32_t) inode->i_atime.tv_nsec;
+
xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
xfs_trans_log_inode(tp, ip, XFS_ILOG_TIMESTAMP);
return xfs_trans_commit(tp, 0);
@@ -1129,7 +1122,7 @@ static const struct inode_operations xfs_inode_operations = {
.removexattr = generic_removexattr,
.listxattr = xfs_vn_listxattr,
.fiemap = xfs_vn_fiemap,
- .update_time = xfs_vn_update_time,
+ .write_time = xfs_vn_write_time,
};

static const struct inode_operations xfs_dir_inode_operations = {
@@ -1156,7 +1149,7 @@ static const struct inode_operations xfs_dir_inode_operations = {
.getxattr = generic_getxattr,
.removexattr = generic_removexattr,
.listxattr = xfs_vn_listxattr,
- .update_time = xfs_vn_update_time,
+ .write_time = xfs_vn_write_time,
.tmpfile = xfs_vn_tmpfile,
};

@@ -1184,7 +1177,7 @@ static const struct inode_operations xfs_dir_ci_inode_operations = {
.getxattr = generic_getxattr,
.removexattr = generic_removexattr,
.listxattr = xfs_vn_listxattr,
- .update_time = xfs_vn_update_time,
+ .write_time = xfs_vn_write_time,
.tmpfile = xfs_vn_tmpfile,
};

@@ -1198,7 +1191,7 @@ static const struct inode_operations xfs_symlink_inode_operations = {
.getxattr = generic_getxattr,
.removexattr = generic_removexattr,
.listxattr = xfs_vn_listxattr,
- .update_time = xfs_vn_update_time,
+ .write_time = xfs_vn_write_time,
};

STATIC void
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9ab779e..3633239 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1545,6 +1545,7 @@ struct inode_operations {
int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
u64 len);
int (*update_time)(struct inode *, struct timespec *, int);
+ int (*write_time)(struct inode *);
int (*atomic_open)(struct inode *, struct dentry *,
struct file *, unsigned open_flag,
umode_t create_mode, int *opened);
--
2.1.0


2014-11-21 19:59:24

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 4/4] ext4: add support for a lazytime mount option

Add an optimization for the MS_LAZYTIME mount option so that we will
opportunistically write out any inodes with the I_DIRTY_TIME flag set
in a particular inode table block when we need to update some inode in
that inode table block anyway.

Also add some temporary code so that we can set the lazytime mount
option without needing a modified /sbin/mount program which can set
MS_LAZYTIME. We can eventually make this go away once util-linux has
added support.

Google-Bug-Id: 18297052

Signed-off-by: Theodore Ts'o <[email protected]>
---
fs/ext4/inode.c | 46 +++++++++++++++++++++++++++++++++++++++++++++-
fs/ext4/super.c | 9 +++++++++
fs/inode.c | 36 ++++++++++++++++++++++++++++++++++++
include/linux/fs.h | 2 ++
4 files changed, 92 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3356ab5..07ceafb 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4163,6 +4163,49 @@ static int ext4_inode_blocks_set(handle_t *handle,
}

/*
+ * Opportunistically update the other time fields for other inodes in
+ * the same inode table block.
+ */
+static void ext4_update_other_inodes_time(struct super_block *sb,
+ unsigned long orig_ino, char *buf)
+{
+ struct ext4_inode_info *ei;
+ struct ext4_inode *raw_inode;
+ unsigned long ino;
+ struct inode *inode;
+ int i, inodes_per_block = EXT4_SB(sb)->s_inodes_per_block;
+ int inode_size = EXT4_INODE_SIZE(sb);
+
+ ino = orig_ino & ~(inodes_per_block - 1);
+ for (i = 0; i < inodes_per_block; i++, ino++, buf += inode_size) {
+ if (ino == orig_ino)
+ continue;
+ inode = find_active_inode_nowait(sb, ino);
+ if (!inode ||
+ (inode->i_state & I_DIRTY_TIME) == 0) {
+ iput(inode);
+ continue;
+ }
+ raw_inode = (struct ext4_inode *) buf;
+ ei = EXT4_I(inode);
+
+ spin_lock(&inode->i_lock);
+ inode->i_state &= ~I_DIRTY_TIME;
+ inode->i_ts_dirty_day = 0;
+ spin_unlock(&inode->i_lock);
+
+ spin_lock(&ei->i_raw_lock);
+ EXT4_INODE_SET_XTIME(i_ctime, inode, raw_inode);
+ EXT4_INODE_SET_XTIME(i_mtime, inode, raw_inode);
+ EXT4_INODE_SET_XTIME(i_atime, inode, raw_inode);
+ ext4_inode_csum_set(inode, raw_inode, ei);
+ spin_unlock(&ei->i_raw_lock);
+ iput(inode);
+ }
+}
+
+
+/*
* Post the struct inode info into an on-disk inode location in the
* buffer-cache. This gobbles the caller's reference to the
* buffer_head in the inode location struct.
@@ -4273,9 +4316,10 @@ static int ext4_do_update_inode(handle_t *handle,
}

ext4_inode_csum_set(inode, raw_inode, ei);
-
spin_unlock(&ei->i_raw_lock);

+ ext4_update_other_inodes_time(inode->i_sb, inode->i_ino, bh->b_data);
+
BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
rc = ext4_handle_dirty_metadata(handle, NULL, bh);
if (!err)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 4b79f39..1ac1914 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1133,6 +1133,7 @@ enum {
Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err,
Opt_usrquota, Opt_grpquota, Opt_i_version,
Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_mblk_io_submit,
+ Opt_lazytime, Opt_nolazytime,
Opt_nomblk_io_submit, Opt_block_validity, Opt_noblock_validity,
Opt_inode_readahead_blks, Opt_journal_ioprio,
Opt_dioread_nolock, Opt_dioread_lock,
@@ -1195,6 +1196,8 @@ static const match_table_t tokens = {
{Opt_i_version, "i_version"},
{Opt_stripe, "stripe=%u"},
{Opt_delalloc, "delalloc"},
+ {Opt_lazytime, "lazytime"},
+ {Opt_nolazytime, "nolazytime"},
{Opt_nodelalloc, "nodelalloc"},
{Opt_removed, "mblk_io_submit"},
{Opt_removed, "nomblk_io_submit"},
@@ -1450,6 +1453,12 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
case Opt_i_version:
sb->s_flags |= MS_I_VERSION;
return 1;
+ case Opt_lazytime:
+ sb->s_flags |= MS_LAZYTIME;
+ return 1;
+ case Opt_nolazytime:
+ sb->s_flags &= ~MS_LAZYTIME;
+ return 1;
}

for (m = ext4_mount_opts; m->token != Opt_err; m++)
diff --git a/fs/inode.c b/fs/inode.c
index f0d6232..89cfca7 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1292,6 +1292,42 @@ struct inode *ilookup(struct super_block *sb, unsigned long ino)
}
EXPORT_SYMBOL(ilookup);

+/**
+ * find_active_inode_nowait - find an active inode in the inode cache
+ * @sb: super block of file system to search
+ * @ino: inode number to search for
+ *
+ * Search for an active inode @ino in the inode cache, and if the
+ * inode is in the cache, the inode is returned with an incremented
+ * reference count. If the inode is being freed or is newly
+ * initialized, return nothing instead of trying to wait for the inode
+ * initialization or destruction to be complete.
+ */
+struct inode *find_active_inode_nowait(struct super_block *sb,
+ unsigned long ino)
+{
+ struct hlist_head *head = inode_hashtable + hash(sb, ino);
+ struct inode *inode, *ret_inode = NULL;
+
+ spin_lock(&inode_hash_lock);
+ hlist_for_each_entry(inode, head, i_hash) {
+ if ((inode->i_ino != ino) ||
+ (inode->i_sb != sb))
+ continue;
+ spin_lock(&inode->i_lock);
+ if ((inode->i_state & (I_FREEING | I_WILL_FREE | I_NEW)) == 0) {
+ __iget(inode);
+ ret_inode = inode;
+ }
+ spin_unlock(&inode->i_lock);
+ goto out;
+ }
+out:
+ spin_unlock(&inode_hash_lock);
+ return ret_inode;
+}
+EXPORT_SYMBOL(find_active_inode_nowait);
+
int insert_inode_locked(struct inode *inode)
{
struct super_block *sb = inode->i_sb;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e3574cd..dbbd642 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2413,6 +2413,8 @@ extern struct inode *ilookup(struct super_block *sb, unsigned long ino);

extern struct inode * iget5_locked(struct super_block *, unsigned long, int (*test)(struct inode *, void *), int (*set)(struct inode *, void *), void *);
extern struct inode * iget_locked(struct super_block *, unsigned long);
+extern struct inode *find_active_inode_nowait(struct super_block *,
+ unsigned long);
extern int insert_inode_locked4(struct inode *, unsigned long, int (*test)(struct inode *, void *), void *);
extern int insert_inode_locked(struct inode *);
#ifdef CONFIG_DEBUG_LOCK_ALLOC
--
2.1.0


2014-11-21 19:59:22

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 2/4] vfs: add support for a lazytime mount option

Add a new mount option which enables a new "lazytime" mode. This mode
causes atime, mtime, and ctime updates to only be made to the
in-memory version of the inode. The on-disk times will only get
updated when (a) if the inode needs to be updated for some non-time
related change, (b) if userspace calls fsync(), syncfs() or sync(), or
(c) just before an undeleted inode is evicted from memory.

This is OK according to POSIX because there are no guarantees after a
crash unless userspace explicitly requests via a fsync(2) call.

For workloads which feature a large number of random write to a
preallocated file, the lazytime mount option significantly reduces
writes to the inode table. The repeated 4k writes to a single block
will result in undesirable stress on flash devices and SMR disk
drives. Even on conventional HDD's, the repeated writes to the inode
table block will trigger Adjacent Track Interference (ATI) remediation
latencies, which very negatively impact 99.9 percentile latencies ---
which is a very big deal for web serving tiers (for example).

Google-Bug-Id: 18297052

Signed-off-by: Theodore Ts'o <[email protected]>
---
fs/fs-writeback.c | 38 +++++++++++++++++++++++++++++++++++++-
fs/inode.c | 18 ++++++++++++++++++
fs/proc_namespace.c | 1 +
fs/sync.c | 7 +++++++
include/linux/fs.h | 1 +
include/uapi/linux/fs.h | 1 +
6 files changed, 65 insertions(+), 1 deletion(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index ef9bef1..ce7de22 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -483,7 +483,7 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
inode->i_state &= ~I_DIRTY_PAGES;
dirty = inode->i_state & I_DIRTY;
- inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
+ inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_TIME);
spin_unlock(&inode->i_lock);
/* Don't write the inode if only I_DIRTY_PAGES was set */
if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
@@ -1277,6 +1277,41 @@ static void wait_sb_inodes(struct super_block *sb)
iput(old_inode);
}

+/*
+ * This works like wait_sb_inodes(), but it is called *before* we kick
+ * the bdi so the inodes can get written out.
+ */
+static void flush_sb_dirty_time(struct super_block *sb)
+{
+ struct inode *inode, *old_inode = NULL;
+
+ WARN_ON(!rwsem_is_locked(&sb->s_umount));
+ spin_lock(&inode_sb_list_lock);
+ list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+ int dirty_time;
+
+ spin_lock(&inode->i_lock);
+ if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
+ spin_unlock(&inode->i_lock);
+ continue;
+ }
+ dirty_time = inode->i_state & I_DIRTY_TIME;
+ __iget(inode);
+ spin_unlock(&inode->i_lock);
+ spin_unlock(&inode_sb_list_lock);
+
+ iput(old_inode);
+ old_inode = inode;
+
+ if (dirty_time)
+ mark_inode_dirty(inode);
+ cond_resched();
+ spin_lock(&inode_sb_list_lock);
+ }
+ spin_unlock(&inode_sb_list_lock);
+ iput(old_inode);
+}
+
/**
* writeback_inodes_sb_nr - writeback dirty inodes from given super_block
* @sb: the superblock
@@ -1388,6 +1423,7 @@ void sync_inodes_sb(struct super_block *sb)
return;
WARN_ON(!rwsem_is_locked(&sb->s_umount));

+ flush_sb_dirty_time(sb);
bdi_queue_work(sb->s_bdi, &work);
wait_for_completion(&done);

diff --git a/fs/inode.c b/fs/inode.c
index 8f5c4b5..6e91aca 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -534,6 +534,18 @@ static void evict(struct inode *inode)
BUG_ON(!(inode->i_state & I_FREEING));
BUG_ON(!list_empty(&inode->i_lru));

+ if (inode->i_nlink && inode->i_state & I_DIRTY_TIME) {
+ if (inode->i_op->write_time)
+ inode->i_op->write_time(inode);
+ else if (inode->i_sb->s_op->write_inode) {
+ struct writeback_control wbc = {
+ .sync_mode = WB_SYNC_NONE,
+ };
+ mark_inode_dirty(inode);
+ inode->i_sb->s_op->write_inode(inode, &wbc);
+ }
+ }
+
if (!list_empty(&inode->i_wb_list))
inode_wb_list_del(inode);

@@ -1515,6 +1527,12 @@ static int update_time(struct inode *inode, struct timespec *time, int flags)
if (flags & S_MTIME)
inode->i_mtime = *time;
}
+ if (inode->i_sb->s_flags & MS_LAZYTIME) {
+ spin_lock(&inode->i_lock);
+ inode->i_state |= I_DIRTY_TIME;
+ spin_unlock(&inode->i_lock);
+ return 0;
+ }
if (inode->i_op->write_time)
return inode->i_op->write_time(inode);
mark_inode_dirty_sync(inode);
diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
index 73ca174..f98234a 100644
--- a/fs/proc_namespace.c
+++ b/fs/proc_namespace.c
@@ -44,6 +44,7 @@ static int show_sb_opts(struct seq_file *m, struct super_block *sb)
{ MS_SYNCHRONOUS, ",sync" },
{ MS_DIRSYNC, ",dirsync" },
{ MS_MANDLOCK, ",mand" },
+ { MS_LAZYTIME, ",lazytime" },
{ 0, NULL }
};
const struct proc_fs_info *fs_infop;
diff --git a/fs/sync.c b/fs/sync.c
index bdc729d..db7930e 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -177,8 +177,15 @@ SYSCALL_DEFINE1(syncfs, int, fd)
*/
int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync)
{
+ struct inode *inode = file->f_mapping->host;
+
if (!file->f_op->fsync)
return -EINVAL;
+ if (!datasync && inode->i_state & I_DIRTY_TIME) {
+ spin_lock(&inode->i_lock);
+ inode->i_state |= I_DIRTY_SYNC;
+ spin_unlock(&inode->i_lock);
+ }
return file->f_op->fsync(file, start, end, datasync);
}
EXPORT_SYMBOL(vfs_fsync_range);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3633239..489b2f2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1721,6 +1721,7 @@ struct super_operations {
#define __I_DIO_WAKEUP 9
#define I_DIO_WAKEUP (1 << I_DIO_WAKEUP)
#define I_LINKABLE (1 << 10)
+#define I_DIRTY_TIME (1 << 11)

#define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)

diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 3735fa0..cc9713a 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -90,6 +90,7 @@ struct inodes_stat_t {
#define MS_KERNMOUNT (1<<22) /* this is a kern_mount call */
#define MS_I_VERSION (1<<23) /* Update inode I_version field */
#define MS_STRICTATIME (1<<24) /* Always perform atime updates */
+#define MS_LAZYTIME (1<<25) /* Update the on-disk [acm]times lazily */

/* These sb flags are internal to the kernel */
#define MS_NOSEC (1<<28)
--
2.1.0


2014-11-21 19:59:23

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 3/4] vfs: don't let the dirty time inodes get more than a day stale

Guarantee that the on-disk timestamps will be no more than 24 hours
stale.

Signed-off-by: Theodore Ts'o <[email protected]>
---
fs/fs-writeback.c | 1 +
fs/inode.c | 7 ++++++-
include/linux/fs.h | 1 +
3 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index ce7de22..eb04277 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1141,6 +1141,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
if (flags & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
trace_writeback_dirty_inode_start(inode, flags);

+ inode->i_ts_dirty_day = 0;
if (sb->s_op->dirty_inode)
sb->s_op->dirty_inode(inode, flags);

diff --git a/fs/inode.c b/fs/inode.c
index 6e91aca..f0d6232 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1511,6 +1511,7 @@ static int relatime_need_update(struct vfsmount *mnt, struct inode *inode,
*/
static int update_time(struct inode *inode, struct timespec *time, int flags)
{
+ int days_since_boot = jiffies / (HZ * 86400);
int ret;

if (inode->i_op->update_time) {
@@ -1527,12 +1528,16 @@ static int update_time(struct inode *inode, struct timespec *time, int flags)
if (flags & S_MTIME)
inode->i_mtime = *time;
}
- if (inode->i_sb->s_flags & MS_LAZYTIME) {
+ if ((inode->i_sb->s_flags & MS_LAZYTIME) &&
+ (!inode->i_ts_dirty_day ||
+ inode->i_ts_dirty_day == days_since_boot)) {
spin_lock(&inode->i_lock);
inode->i_state |= I_DIRTY_TIME;
spin_unlock(&inode->i_lock);
+ inode->i_ts_dirty_day = days_since_boot;
return 0;
}
+ inode->i_ts_dirty_day = 0;
if (inode->i_op->write_time)
return inode->i_op->write_time(inode);
mark_inode_dirty_sync(inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 489b2f2..e3574cd 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -575,6 +575,7 @@ struct inode {
struct timespec i_ctime;
spinlock_t i_lock; /* i_blocks, i_bytes, maybe i_size */
unsigned short i_bytes;
+ unsigned short i_ts_dirty_day;
unsigned int i_blkbits;
blkcnt_t i_blocks;

--
2.1.0

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2014-11-21 20:08:48

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH 1/4] fs: split update_time() into update_time() and write_time()

On Fri, Nov 21, 2014 at 2:59 PM, Theodore Ts'o <[email protected]> wrote:
> In preparation for adding support for the lazytime mount option, we
> need to be able to separate out the update_time() and write_time()
> inode operations. Currently, only btrfs and xfs uses update_time().
>
> We needed to preserve update_time() because btrfs wants to have a
> special btrfs_root_readonly() check; otherwise we could drop the
> update_time() inode operation entirely.

No objections here, I'll give the queue a whirl. You can add my
acked-by-or-Ill-fix-whatever-breaks

I look forward to the patch that makes us all lazy by default.

-chris




2014-11-21 20:19:07

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 3/4] vfs: don't let the dirty time inodes get more than a day stale

On Nov 21, 2014, at 1:59 PM, Theodore Ts'o <[email protected]> wrote:
> Guarantee that the on-disk timestamps will be no more than 24 hours
> stale.
>
> Signed-off-by: Theodore Ts'o <[email protected]>
> ---
> fs/fs-writeback.c | 1 +
> fs/inode.c | 7 ++++++-
> include/linux/fs.h | 1 +
> 3 files changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index ce7de22..eb04277 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1141,6 +1141,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
> if (flags & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
> trace_writeback_dirty_inode_start(inode, flags);
>
> + inode->i_ts_dirty_day = 0;
> if (sb->s_op->dirty_inode)
> sb->s_op->dirty_inode(inode, flags);
>
> diff --git a/fs/inode.c b/fs/inode.c
> index 6e91aca..f0d6232 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1511,6 +1511,7 @@ static int relatime_need_update(struct vfsmount *mnt, struct inode *inode,
> */
> static int update_time(struct inode *inode, struct timespec *time, int flags)
> {
> + int days_since_boot = jiffies / (HZ * 86400);
> int ret;
>
> if (inode->i_op->update_time) {
> @@ -1527,12 +1528,16 @@ static int update_time(struct inode *inode, struct timespec *time, int flags)
> if (flags & S_MTIME)
> inode->i_mtime = *time;
> }
> - if (inode->i_sb->s_flags & MS_LAZYTIME) {
> + if ((inode->i_sb->s_flags & MS_LAZYTIME) &&
> + (!inode->i_ts_dirty_day ||
> + inode->i_ts_dirty_day == days_since_boot)) {
> spin_lock(&inode->i_lock);
> inode->i_state |= I_DIRTY_TIME;
> spin_unlock(&inode->i_lock);
> + inode->i_ts_dirty_day = days_since_boot;

It isn't clear if this is correct. It looks like the condition will
only be entered if i_ts_dirty_day == days_since_boot, but that is only
set inside the condition? Shouldn't this be:

inode->i_ts_dirty_day = ~0U;

if ((inode->i_sb->s_flags & MS_LAZYTIME) &&
inode->i_ts_dirty_day != days_since_boot)) {
spin_lock(&inode->i_lock);
inode->i_state |= I_DIRTY_TIME;
spin_unlock(&inode->i_lock);
inode->i_ts_dirty_day = days_since_boot;
}

and "days_since_boot" should be declared unsigned short so it wraps
in the same way as i_ts_dirty_day, maybe using typeof(i_ts_dirty_day)
so that there isn't any need to update this in the future if the type
changes. Conceivably this could be an unsigned char, if that packed
into struct inode better, at the risk of losing a timestamp update on
an inode in cache for 256+ days and only modifying it 256 days later.

Cheers, Andreas

> return 0;
> }
> + inode->i_ts_dirty_day = 0;
> if (inode->i_op->write_time)
> return inode->i_op->write_time(inode);
> mark_inode_dirty_sync(inode);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 489b2f2..e3574cd 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -575,6 +575,7 @@ struct inode {
> struct timespec i_ctime;
> spinlock_t i_lock; /* i_blocks, i_bytes, maybe i_size */
> unsigned short i_bytes;
> + unsigned short i_ts_dirty_day;
> unsigned int i_blkbits;
> blkcnt_t i_blocks;
>
> --
> 2.1.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


Cheers, Andreas






Attachments:
signature.asc (833.00 B)
Message signed with OpenPGP using GPGMail

2014-11-21 21:36:51

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 3/4] vfs: don't let the dirty time inodes get more than a day stale

On Fri, Nov 21, 2014 at 02:19:07PM -0600, Andreas Dilger wrote:
> > - if (inode->i_sb->s_flags & MS_LAZYTIME) {
> > + if ((inode->i_sb->s_flags & MS_LAZYTIME) &&
> > + (!inode->i_ts_dirty_day ||
> > + inode->i_ts_dirty_day == days_since_boot)) {
> > spin_lock(&inode->i_lock);
> > inode->i_state |= I_DIRTY_TIME;
> > spin_unlock(&inode->i_lock);
> > + inode->i_ts_dirty_day = days_since_boot;
>
> It isn't clear if this is correct. It looks like the condition will
> only be entered if i_ts_dirty_day == days_since_boot, but that is only
> set inside the condition?

If i_ts_dirty_day is zero, the timestamps don't have to written to
disk. This is either because the inode has been written to disk, or
the system has been up for less than a day, such that when we last a
lazy mtime update (i.e., we skipped the call to mark_inode_dirty)
since jiffies / (HZ * 86400) was zero.

If it is non-zero, then the timestamps were updated but were not sent
to disk N days since the system was booted. So long as it remains N
days since the system was booted, we can skip calling
mark_inode_dirty(). But once it becomes N+1 days since the system was
booted, then we will call mark_inode_dirty() and set i_ts_dirty_day to
zero.

I'll add a comment so it's a bit more obvious what we're doing here,
but I'm pretty sure we currently have is in fact correct.

> and "days_since_boot" should be declared unsigned short so it wraps
> in the same way as i_ts_dirty_day

Good point, thanks. This will only be an issue after the system has
been up for almost 90 years, but we might as well get it right,

- Ted

2014-11-21 21:42:47

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 1/4] fs: split update_time() into update_time() and write_time()

Out of curiosity, why does btrfs_update_time() need to call
btrfs_root_readonly()? Why can't it just depend on the
__mnt_want_write() call in touch_atime()?

Surely if there are times when it's not OK to write into a btrfs file
system and mnt_is_readonly() returns false, the VFS is going to get
very confused abyway.

If the btrfs_update_time() is not necessary, then we could drop
btrfs_update_time() and update_time() from the inode operations
entirely, and depend on the VFS-level code in update_time().

- Ted

2014-11-21 23:09:25

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 3/4] vfs: don't let the dirty time inodes get more than a day stale


On Nov 21, 2014, at 3:36 PM, Theodore Ts'o <[email protected]> wrote:
>> and "days_since_boot" should be declared unsigned short so it wraps
>> in the same way as i_ts_dirty_day
>
> Good point, thanks. This will only be an issue after the system has
> been up for almost 90 years, but we might as well get it right,

Speaking of stuff in the future, it looks like the patch to fix the
ext4 dates-beyond-epoch handling hasn't landed yet?

The last email I found related to that is in "[PATCH] fs: ext4: Sign-extend tv_sec after ORing in epoch bits", but the patch that I actually thought was ready to land was "[PATCH v8 1/2] ext4: Fix handling of extended tv_sec (bug 23732)" and the matching e2fsprogs patch:

http://lkml.org/lkml/2014/2/13/667
http://lkml.org/lkml/2014/2/13/670

Would be nice to get them landed, since we have less than 20 years
left, and at some point there will be embedded systems or VMs using
ext4 that may live long enough to hit the bug...

Cheers, Andreas






Attachments:
signature.asc (833.00 B)
Message signed with OpenPGP using GPGMail

2014-11-24 15:21:01

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 1/4] fs: split update_time() into update_time() and write_time()

On Fri, Nov 21, 2014 at 02:59:21PM -0500, Theodore Ts'o wrote:
> We needed to preserve update_time() because btrfs wants to have a
> special btrfs_root_readonly() check; otherwise we could drop the
> update_time() inode operation entirely.

Can't btrfs just set the immutable flag on every inode that is read
when the root has the BTRFS_ROOT_SUBVOL_RDONLY flag? That would
cut down the places that need this check to the ioctl path so that
we prevent users from clearling the immutable flag.

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2014-11-24 15:56:58

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 1/4] fs: split update_time() into update_time() and write_time()

On Mon, Nov 24, 2014 at 07:21:01AM -0800, Christoph Hellwig wrote:
> On Fri, Nov 21, 2014 at 02:59:21PM -0500, Theodore Ts'o wrote:
> > We needed to preserve update_time() because btrfs wants to have a
> > special btrfs_root_readonly() check; otherwise we could drop the
> > update_time() inode operation entirely.
>
> Can't btrfs just set the immutable flag on every inode that is read
> when the root has the BTRFS_ROOT_SUBVOL_RDONLY flag? That would
> cut down the places that need this check to the ioctl path so that
> we prevent users from clearling the immutable flag.

Sounds like a good plan to me, although I'm not sure I understand how
BTRFS_ROOT_SUBVOL_RDONLY flag works, since I would have thought there
are all sorts of places in the VFS layer where it is currently
checking MS_RDONLY and MNT_READONLY and _not_ checking
BTRFS_ROOT_SUBVOL_RDONLY isn't causing other problems.

But unless there's something more subtle going on, it would seem to me
that setting the immutable flag on each inode would be a better way to
go in any case.

- Ted

2014-11-24 16:38:30

by David Sterba

[permalink] [raw]
Subject: Re: [PATCH 1/4] fs: split update_time() into update_time() and write_time()

On Fri, Nov 21, 2014 at 04:42:45PM -0500, Theodore Ts'o wrote:
> Out of curiosity, why does btrfs_update_time() need to call
> btrfs_root_readonly()? Why can't it just depend on the
> __mnt_want_write() call in touch_atime()?

mnt_want_write looks only at the mountpoint flags, the readonly
subvolume status is external to that.

> Surely if there are times when it's not OK to write into a btrfs file
> system and mnt_is_readonly() returns false, the VFS is going to get
> very confused abyway.
>
> If the btrfs_update_time() is not necessary, then we could drop
> btrfs_update_time() and update_time() from the inode operations
> entirely, and depend on the VFS-level code in update_time().

It is necessary and the whole .update_time callback was added
intentionally, see commits

c3b2da314834499f34cba94f7053e55f6d6f92d8
fs: introduce inode operation ->update_time

e41f941a23115e84a8550b3d901a13a14b2edc2f
Btrfs: move over to use ->update_time

2bc5565286121d2a77ccd728eb3484dff2035b58
Btrfs: don't update atime on RO subvolumes

2014-11-24 17:22:19

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 1/4] fs: split update_time() into update_time() and write_time()

On Mon, Nov 24, 2014 at 05:38:30PM +0100, David Sterba wrote:
>
> It is necessary and the whole .update_time callback was added
> intentionally, see commits
>
> c3b2da314834499f34cba94f7053e55f6d6f92d8
> fs: introduce inode operation ->update_time
>
> e41f941a23115e84a8550b3d901a13a14b2edc2f
> Btrfs: move over to use ->update_time

Being able to signal an error if the time update fails is still
possible even if we drop update_time(), because the new write_time()
function will return an error.

> 2bc5565286121d2a77ccd728eb3484dff2035b58
> Btrfs: don't update atime on RO subvolumes

Yes, but this doesn't answer my question about other places where the
VFS is only checking MS_RDONLY and MNT_READONLY besides just
update_atime(). Maybe we should be exposing an "is_readonly(inode)"
inode operations function to address this?

- Ted

2014-11-24 17:34:30

by David Sterba

[permalink] [raw]
Subject: Re: [PATCH 1/4] fs: split update_time() into update_time() and write_time()

On Mon, Nov 24, 2014 at 07:21:01AM -0800, Christoph Hellwig wrote:
> On Fri, Nov 21, 2014 at 02:59:21PM -0500, Theodore Ts'o wrote:
> > We needed to preserve update_time() because btrfs wants to have a
> > special btrfs_root_readonly() check; otherwise we could drop the
> > update_time() inode operation entirely.
>
> Can't btrfs just set the immutable flag on every inode that is read
> when the root has the BTRFS_ROOT_SUBVOL_RDONLY flag?

This would lead to change in return code from EROFS to EPERM/EACCESS for
all syscalls that do not pass through permissions() callback. Also, now
each file from a readonly subvol will show the 'i' attribute and there's
now way to determine if the file had had the 'i' attribute before it was
snapshotted.

> That would
> cut down the places that need this check to the ioctl path so that
> we prevent users from clearling the immutable flag.

This means that even the root or capable user are not able to clear the
flag.

Even though the extra btrfs_root_readonly checks are not all great, the
number of surprises that the immutable bit could bring is IMO not great
either.

These callbacks that now implement the extra check:

- update_time
- setattr
- setflags (ioctl)
- setxattr
- removexattr

The btrfs_root_readonly checks in setxattr and removexattr are
unnecessary because they're done through xattr_permisssion. I'll send a
patch to remove them.

'setflags' is an ioctl and all the checks have to be done there.
The generic permission() callback cannot be used here because it would
fail to clear eg. the immutable flag.

'setattr' does limited permission checks (IMMUTABLE and APPEND), nothing
that calls to the filesystem directly or indirectly.

The remaining one is 'update_time'. I'm not sure if the amount of work
the switch to IMUUTABLE bit would need is justified, compared to this
one instance of extra btrfs_root_readonly test. As the VFS layer has no
notion of subvolume it's not able to determine the RO/RW status without
asking the filesystem anyway.

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2014-11-24 18:09:51

by David Sterba

[permalink] [raw]
Subject: Re: [PATCH 1/4] fs: split update_time() into update_time() and write_time()

On Mon, Nov 24, 2014 at 12:22:16PM -0500, Theodore Ts'o wrote:
> On Mon, Nov 24, 2014 at 05:38:30PM +0100, David Sterba wrote:
> >
> > It is necessary and the whole .update_time callback was added
> > intentionally, see commits
> >
> > c3b2da314834499f34cba94f7053e55f6d6f92d8
> > fs: introduce inode operation ->update_time
> >
> > e41f941a23115e84a8550b3d901a13a14b2edc2f
> > Btrfs: move over to use ->update_time
>
> Being able to signal an error if the time update fails is still
> possible even if we drop update_time(), because the new write_time()
> function will return an error.

Fine, means your change does not break the current status. I was
providing the more complete list of related commits.

> > 2bc5565286121d2a77ccd728eb3484dff2035b58
> > Btrfs: don't update atime on RO subvolumes
>
> Yes, but this doesn't answer my question about other places where the
> VFS is only checking MS_RDONLY and MNT_READONLY besides just
> update_atime(). Maybe we should be exposing an "is_readonly(inode)"
> inode operations function to address this?

Yes, if this is a lightweight check then it'd would allow to remove the
filesystem-specific checks.

2014-11-25 01:52:45

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 2/4] vfs: add support for a lazytime mount option

On Fri, Nov 21, 2014 at 02:59:22PM -0500, Theodore Ts'o wrote:
> Add a new mount option which enables a new "lazytime" mode. This mode
> causes atime, mtime, and ctime updates to only be made to the
> in-memory version of the inode. The on-disk times will only get
> updated when (a) if the inode needs to be updated for some non-time
> related change, (b) if userspace calls fsync(), syncfs() or sync(), or
> (c) just before an undeleted inode is evicted from memory.
>
> This is OK according to POSIX because there are no guarantees after a
> crash unless userspace explicitly requests via a fsync(2) call.
>
> For workloads which feature a large number of random write to a
> preallocated file, the lazytime mount option significantly reduces
> writes to the inode table. The repeated 4k writes to a single block
> will result in undesirable stress on flash devices and SMR disk
> drives. Even on conventional HDD's, the repeated writes to the inode
> table block will trigger Adjacent Track Interference (ATI) remediation
> latencies, which very negatively impact 99.9 percentile latencies ---
> which is a very big deal for web serving tiers (for example).
>
> Google-Bug-Id: 18297052
>
> Signed-off-by: Theodore Ts'o <[email protected]>
> ---
> fs/fs-writeback.c | 38 +++++++++++++++++++++++++++++++++++++-
> fs/inode.c | 18 ++++++++++++++++++
> fs/proc_namespace.c | 1 +
> fs/sync.c | 7 +++++++
> include/linux/fs.h | 1 +
> include/uapi/linux/fs.h | 1 +
> 6 files changed, 65 insertions(+), 1 deletion(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index ef9bef1..ce7de22 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -483,7 +483,7 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
> if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
> inode->i_state &= ~I_DIRTY_PAGES;
> dirty = inode->i_state & I_DIRTY;
> - inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
> + inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_TIME);
> spin_unlock(&inode->i_lock);
> /* Don't write the inode if only I_DIRTY_PAGES was set */
> if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
> @@ -1277,6 +1277,41 @@ static void wait_sb_inodes(struct super_block *sb)
> iput(old_inode);
> }
>
> +/*
> + * This works like wait_sb_inodes(), but it is called *before* we kick
> + * the bdi so the inodes can get written out.
> + */
> +static void flush_sb_dirty_time(struct super_block *sb)
> +{
> + struct inode *inode, *old_inode = NULL;
> +
> + WARN_ON(!rwsem_is_locked(&sb->s_umount));
> + spin_lock(&inode_sb_list_lock);
> + list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> + int dirty_time;
> +
> + spin_lock(&inode->i_lock);
> + if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
> + spin_unlock(&inode->i_lock);
> + continue;
> + }
> + dirty_time = inode->i_state & I_DIRTY_TIME;
> + __iget(inode);
> + spin_unlock(&inode->i_lock);
> + spin_unlock(&inode_sb_list_lock);
> +
> + iput(old_inode);
> + old_inode = inode;
> +
> + if (dirty_time)
> + mark_inode_dirty(inode);
> + cond_resched();
> + spin_lock(&inode_sb_list_lock);
> + }
> + spin_unlock(&inode_sb_list_lock);
> + iput(old_inode);
> +}

This just seems wrong to me, not to mention extremely expensive when we have
millions of cached inodes on the superblock.

Why can't we just add a function like mark_inode_dirty_time() which
puts the inode on the dirty inode list with a writeback time 24
hours in the future rather than 30s in the future?



> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -534,6 +534,18 @@ static void evict(struct inode *inode)
> BUG_ON(!(inode->i_state & I_FREEING));
> BUG_ON(!list_empty(&inode->i_lru));
>
> + if (inode->i_nlink && inode->i_state & I_DIRTY_TIME) {
> + if (inode->i_op->write_time)
> + inode->i_op->write_time(inode);
> + else if (inode->i_sb->s_op->write_inode) {
> + struct writeback_control wbc = {
> + .sync_mode = WB_SYNC_NONE,
> + };
> + mark_inode_dirty(inode);
> + inode->i_sb->s_op->write_inode(inode, &wbc);
> + }
> + }
> +

Eviction is too late for this. I'm pretty sure that it won't get
this far as iput_final() should catch the I_DIRTY_TIME in the !drop
case via write_inode_now().


> int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync)
> {
> + struct inode *inode = file->f_mapping->host;
> +
> if (!file->f_op->fsync)
> return -EINVAL;
> + if (!datasync && inode->i_state & I_DIRTY_TIME) {

FWIW, I'm surprised gcc isn't throwing warnings about that. Please
make sure there isn't any ambiguity in the code by writing those
checks like this:

if (!datasync && (inode->i_state & I_DIRTY_TIME)) {

> + spin_lock(&inode->i_lock);
> + inode->i_state |= I_DIRTY_SYNC;
> + spin_unlock(&inode->i_lock);
> + }
> return file->f_op->fsync(file, start, end, datasync);

When we mark the inode I_DIRTY_TIME, we should also be marking it
I_DIRTY_SYNC so that all the sync operations know that they should
be writing this inode. That's partly why I also think these inodes
should be tracked on the dirty inode list....

> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 3633239..489b2f2 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1721,6 +1721,7 @@ struct super_operations {
> #define __I_DIO_WAKEUP 9
> #define I_DIO_WAKEUP (1 << I_DIO_WAKEUP)
> #define I_LINKABLE (1 << 10)
> +#define I_DIRTY_TIME (1 << 11)
>
> #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)

Shouldn't I_DIRTY also include I_DIRTY_TIME now?

--
Dave Chinner
[email protected]

2014-11-25 01:53:34

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 3/4] vfs: don't let the dirty time inodes get more than a day stale

On Fri, Nov 21, 2014 at 02:59:23PM -0500, Theodore Ts'o wrote:
> Guarantee that the on-disk timestamps will be no more than 24 hours
> stale.
>
> Signed-off-by: Theodore Ts'o <[email protected]>

If we put these inodes on the dirty inode list with at writeback
time of 24 hours, this is completely unnecessary.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2014-11-25 04:33:35

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 2/4] vfs: add support for a lazytime mount option

On Tue, Nov 25, 2014 at 12:52:39PM +1100, Dave Chinner wrote:
> > +static void flush_sb_dirty_time(struct super_block *sb)
> > +{
...
> > +}
>
> This just seems wrong to me, not to mention extremely expensive when we have
> millions of cached inodes on the superblock.

#1, It only gets called on a sync(2) or syncfs(2), which can be pretty
expensive as it is, so I didn't really worry about it.

#2, We're already iterating over all of the inodes in the sync(2) or
syncfs(2) path, so the code path in question is already O(N) in the
number of inodes.

> Why can't we just add a function like mark_inode_dirty_time() which
> puts the inode on the dirty inode list with a writeback time 24
> hours in the future rather than 30s in the future?

I was concerned about putting them on the dirty inode list because it
would be extra inodes for the writeback threads would have to skip
over and ignore (since they would not be dirty in the inde or data
pages sense).

Another solution would be to use a separate linked list for dirtytime
inodes, but that means adding some extra fields to the inode
structure, which some might view as bloat. Given #1 and #2 above,
yes, we're doubling the CPU cost for sync(2) and syncfs(2), since
we're not iterating over all of the inodes twice, but I believe that
is a better trade-off than bloating the inode structure with an extra
set of linked lists or increasing the CPU cost to the writeback path
(which gets executed way more often than the sync or syncfs paths).


> Eviction is too late for this. I'm pretty sure that it won't get
> this far as iput_final() should catch the I_DIRTY_TIME in the !drop
> case via write_inode_now().

Actually, the tracepoint for fs_lazytime_evict() does get triggered
from time to time; but only when the inode is evicted due to memory
pressure, i.e., via the evict_inodes() path.

I thought about possibly doing this in iput_final(), but that would
mean that whenever we closed the last fd on the file, we would push
the inode out to disk. For files that we are writing, that's not so
bad; but if we enable strictatime with lazytime, then we would be
updating the atime for inodes that had been only been read on every
close --- which in the case of say, all of the files in the kernel
tree, would be a little unfortunate.

Of course, the combination of strict atime and lazytime would result
in a pretty heavy write load on a umount or sync(2), so I suspect
keeping the relatime mode would make sense for most people, but I for
those people who need strict Posix compliance, it seemed like doing
something that worked well for strictatime plus lazytime would be a
good thing, which is why I tried to defer things as much as possible.

> if (!datasync && (inode->i_state & I_DIRTY_TIME)) {
>
> > + spin_lock(&inode->i_lock);
> > + inode->i_state |= I_DIRTY_SYNC;
> > + spin_unlock(&inode->i_lock);
> > + }
> > return file->f_op->fsync(file, start, end, datasync);
>
> When we mark the inode I_DIRTY_TIME, we should also be marking it
> I_DIRTY_SYNC so that all the sync operations know that they should
> be writing this inode. That's partly why I also think these inodes
> should be tracked on the dirty inode list....

The whole point of what I was doing is that I_DIRTY_TIME was not part
of I_DIRTY, and that when in update_time() we set I_DIRTY_TIME instead
of I_DIRTY_SYNC. The goal is that these inodes would not end up on
the dirty list, because that way they wouldn't be forced out to disk
until either (a) the inode is written out for other reasons (i.e., a
change in i_size or i_blocks, etc.), (b) the inode is explicitly
fsync'ed, or (c) the the umount(2), sync(2), or syncfs(2) of the file
system.

That way, the timestamps are in the memory copy inode, but *not*
written on disk until they are opportunistically written out due to
some other modification of the inode which sets I_DIRTY_SYNC.

If we were to set I_DIRTY_SYNC alongside I_DIRTY_TIME, and put these
inodes on the dirty inode list, then I_DIRTY_TIME would effectively be
a no-op, and there would be no point to this whole exercise.

It may be that lazytime will never be the default, because it is so
different from what we are currently doing. But I think it is worth
doing, even if it is an optional mount option which is only used under
special circumstances. For myself, we will be using it in Google and
I will be using it on my laptop because it definitely reduces the
write load to the SSD. This I've measured it via the tracepoints.

If there is significant objections to doing this in the VFS layer, I'm
happy to go back to doing this as in ext4-specific code. There were a
few bits that were a bit more dodgy, and I can't make sync(2) and
syncfs(2) flush the dirty timestamps if I do it as an ext4-specific
hack, but for my use cases, I don't really care about those things.
The main reason why I redid this patch set as a VFS specific change
was because Cristoph and others specifically requested it. But if you
strongly object, I can always go back to doing this in the ext4 code....

Cheers,

- Ted

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2014-11-25 04:45:08

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 3/4] vfs: don't let the dirty time inodes get more than a day stale

On Tue, Nov 25, 2014 at 12:53:32PM +1100, Dave Chinner wrote:
> On Fri, Nov 21, 2014 at 02:59:23PM -0500, Theodore Ts'o wrote:
> > Guarantee that the on-disk timestamps will be no more than 24 hours
> > stale.
> >
> > Signed-off-by: Theodore Ts'o <[email protected]>
>
> If we put these inodes on the dirty inode list with at writeback
> time of 24 hours, this is completely unnecessary.

What do you mean by "a writeback time of 24 hours"? Do you mean
creating a new field in the inode which specifies when the writeback
should happen? I still worry about the dirty inode list getting
somewhat long large in the strictatime && lazytime case, and the inode
bloat nazi's coming after us for adding a new field to struct inode
structure.

Or do you mean trying to abuse the dirtied_when field in some way?

- Ted

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2014-11-25 15:32:18

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH 2/4] vfs: add support for a lazytime mount option

On 11/25/2014 06:33 AM, Theodore Ts'o wrote:
<>
>
> I was concerned about putting them on the dirty inode list because it
> would be extra inodes for the writeback threads would have to skip
> over and ignore (since they would not be dirty in the inde or data
> pages sense).
>
> Another solution would be to use a separate linked list for dirtytime
> inodes, but that means adding some extra fields to the inode
> structure, which some might view as bloat.

You could use the same list-head for both lists.

If the inode is on the dirty-inode-list then no need to add it
to the list-for-dirtytime, it will be written soon anyway.
else you add it to the list-for-dirtytime.

If you (real)dirty an inode then you first remove it from the
list-for-dirtytime first, and then add it to the dirty-inode-list.

So at each given time it is only on one list

<>

Cheers
Boaz

2014-11-25 15:51:41

by David Sterba

[permalink] [raw]
Subject: Re: [PATCH 1/4] fs: split update_time() into update_time() and write_time()

On Mon, Nov 24, 2014 at 06:34:30PM +0100, David Sterba wrote:
> The btrfs_root_readonly checks in setxattr and removexattr are
> unnecessary because they're done through xattr_permisssion. I'll send a
> patch to remove them.

Does not work because the security.* and system.* namespaces do not call
the permission() hook, so no patch. However, if the proposed
inode_is_readonly callback is merged, we can replace the btrfs-specific
checks with is_readonly check in xattr_permission().

2014-11-25 17:01:00

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 1/4] fs: split update_time() into update_time() and write_time()

On Tue, Nov 25, 2014 at 04:51:41PM +0100, David Sterba wrote:
> Does not work because the security.* and system.* namespaces do not call
> the permission() hook, so no patch. However, if the proposed
> inode_is_readonly callback is merged, we can replace the btrfs-specific
> checks with is_readonly check in xattr_permission().

I think that patch should go first in the series. But I really need to
find some time to review the whole thing before commenting with profound
half-knowledge..

2014-11-25 17:19:27

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 2/4] vfs: add support for a lazytime mount option

On Mon 24-11-14 23:33:35, Ted Tso wrote:
> On Tue, Nov 25, 2014 at 12:52:39PM +1100, Dave Chinner wrote:
> > Eviction is too late for this. I'm pretty sure that it won't get
> > this far as iput_final() should catch the I_DIRTY_TIME in the !drop
> > case via write_inode_now().
>
> Actually, the tracepoint for fs_lazytime_evict() does get triggered
> from time to time; but only when the inode is evicted due to memory
> pressure, i.e., via the evict_inodes() path.
>
> I thought about possibly doing this in iput_final(), but that would
> mean that whenever we closed the last fd on the file, we would push
> the inode out to disk. For files that we are writing, that's not so
> bad; but if we enable strictatime with lazytime, then we would be
> updating the atime for inodes that had been only been read on every
> close --- which in the case of say, all of the files in the kernel
> tree, would be a little unfortunate.
Actually, I'd also prefer to do the writing from iput_final(). My main
reason is that shrinker starts behaving very differently when you put
inodes with I_DIRTY_TIME to the LRU. See inode_lru_isolate() and in
particular:
/*
* Referenced or dirty inodes are still in use. Give them another
* pass
* through the LRU as we canot reclaim them now.
*/
if (atomic_read(&inode->i_count) ||
(inode->i_state & ~I_REFERENCED)) {
list_del_init(&inode->i_lru);
spin_unlock(&inode->i_lock);
this_cpu_dec(nr_unused);
return LRU_REMOVED;
}

Regarding your concern that we'd write the inode when file is closed -
that's not true. We'll write the inode only after corresponding dentry is
evicted and thus drops inode reference. That doesn't seem too bad to me.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2014-11-25 17:30:40

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 2/4] vfs: add support for a lazytime mount option

On Mon 24-11-14 23:33:35, Ted Tso wrote:
> On Tue, Nov 25, 2014 at 12:52:39PM +1100, Dave Chinner wrote:
> > > +static void flush_sb_dirty_time(struct super_block *sb)
> > > +{
> ...
> > > +}
> >
> > This just seems wrong to me, not to mention extremely expensive when we have
> > millions of cached inodes on the superblock.
>
> #1, It only gets called on a sync(2) or syncfs(2), which can be pretty
> expensive as it is, so I didn't really worry about it.
>
> #2, We're already iterating over all of the inodes in the sync(2) or
> syncfs(2) path, so the code path in question is already O(N) in the
> number of inodes.
>
> > Why can't we just add a function like mark_inode_dirty_time() which
> > puts the inode on the dirty inode list with a writeback time 24
> > hours in the future rather than 30s in the future?
>
> I was concerned about putting them on the dirty inode list because it
> would be extra inodes for the writeback threads would have to skip
> over and ignore (since they would not be dirty in the inde or data
> pages sense).
I agree this isn't going to work easily. Currently flusher relies on
dirty list being sorted by i_dirtied_when and that gets harder to maintain
if we ever have inodes with i_dirtied_when in future (we'd have to sort-in
newly dirtied inodes).

> Another solution would be to use a separate linked list for dirtytime
> inodes, but that means adding some extra fields to the inode
> structure, which some might view as bloat. Given #1 and #2 above,
> yes, we're doubling the CPU cost for sync(2) and syncfs(2), since
> we're not iterating over all of the inodes twice, but I believe that
> is a better trade-off than bloating the inode structure with an extra
> set of linked lists or increasing the CPU cost to the writeback path
> (which gets executed way more often than the sync or syncfs paths).
This would be possible and as Boaz says, it might be possible to reuse
the same list_head in the inode for this. Getting rid of the full scan of
all superblock inodes would be nice (as the scan gets really expensive for
large numbers of inodes (think of i_sb_lock contention) and this makes it
twice as bad) so I'd prefer to do this if possible.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2014-11-25 17:31:26

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 3/4] vfs: don't let the dirty time inodes get more than a day stale

On Fri 21-11-14 14:59:23, Ted Tso wrote:
> Guarantee that the on-disk timestamps will be no more than 24 hours
> stale.
Hum, how about reusing i_dirtied_when for this. Using that field even
makes a good sence to me...

Honza

> Signed-off-by: Theodore Ts'o <[email protected]>
> ---
> fs/fs-writeback.c | 1 +
> fs/inode.c | 7 ++++++-
> include/linux/fs.h | 1 +
> 3 files changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index ce7de22..eb04277 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1141,6 +1141,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
> if (flags & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
> trace_writeback_dirty_inode_start(inode, flags);
>
> + inode->i_ts_dirty_day = 0;
> if (sb->s_op->dirty_inode)
> sb->s_op->dirty_inode(inode, flags);
>
> diff --git a/fs/inode.c b/fs/inode.c
> index 6e91aca..f0d6232 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1511,6 +1511,7 @@ static int relatime_need_update(struct vfsmount *mnt, struct inode *inode,
> */
> static int update_time(struct inode *inode, struct timespec *time, int flags)
> {
> + int days_since_boot = jiffies / (HZ * 86400);
> int ret;
>
> if (inode->i_op->update_time) {
> @@ -1527,12 +1528,16 @@ static int update_time(struct inode *inode, struct timespec *time, int flags)
> if (flags & S_MTIME)
> inode->i_mtime = *time;
> }
> - if (inode->i_sb->s_flags & MS_LAZYTIME) {
> + if ((inode->i_sb->s_flags & MS_LAZYTIME) &&
> + (!inode->i_ts_dirty_day ||
> + inode->i_ts_dirty_day == days_since_boot)) {
> spin_lock(&inode->i_lock);
> inode->i_state |= I_DIRTY_TIME;
> spin_unlock(&inode->i_lock);
> + inode->i_ts_dirty_day = days_since_boot;
> return 0;
> }
> + inode->i_ts_dirty_day = 0;
> if (inode->i_op->write_time)
> return inode->i_op->write_time(inode);
> mark_inode_dirty_sync(inode);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 489b2f2..e3574cd 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -575,6 +575,7 @@ struct inode {
> struct timespec i_ctime;
> spinlock_t i_lock; /* i_blocks, i_bytes, maybe i_size */
> unsigned short i_bytes;
> + unsigned short i_ts_dirty_day;
> unsigned int i_blkbits;
> blkcnt_t i_blocks;
>
> --
> 2.1.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <[email protected]>
SUSE Labs, CR

2014-11-25 17:34:26

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 4/4] ext4: add support for a lazytime mount option

On Fri 21-11-14 14:59:24, Ted Tso wrote:
> Add an optimization for the MS_LAZYTIME mount option so that we will
> opportunistically write out any inodes with the I_DIRTY_TIME flag set
> in a particular inode table block when we need to update some inode in
> that inode table block anyway.
>
> Also add some temporary code so that we can set the lazytime mount
> option without needing a modified /sbin/mount program which can set
> MS_LAZYTIME. We can eventually make this go away once util-linux has
> added support.
...
> diff --git a/fs/inode.c b/fs/inode.c
> index f0d6232..89cfca7 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1292,6 +1292,42 @@ struct inode *ilookup(struct super_block *sb, unsigned long ino)
> }
> EXPORT_SYMBOL(ilookup);
>
> +/**
> + * find_active_inode_nowait - find an active inode in the inode cache
> + * @sb: super block of file system to search
> + * @ino: inode number to search for
> + *
> + * Search for an active inode @ino in the inode cache, and if the
> + * inode is in the cache, the inode is returned with an incremented
> + * reference count. If the inode is being freed or is newly
> + * initialized, return nothing instead of trying to wait for the inode
> + * initialization or destruction to be complete.
> + */
> +struct inode *find_active_inode_nowait(struct super_block *sb,
> + unsigned long ino)
> +{
> + struct hlist_head *head = inode_hashtable + hash(sb, ino);
> + struct inode *inode, *ret_inode = NULL;
> +
> + spin_lock(&inode_hash_lock);
> + hlist_for_each_entry(inode, head, i_hash) {
> + if ((inode->i_ino != ino) ||
> + (inode->i_sb != sb))
> + continue;
> + spin_lock(&inode->i_lock);
> + if ((inode->i_state & (I_FREEING | I_WILL_FREE | I_NEW)) == 0) {
> + __iget(inode);
> + ret_inode = inode;
> + }
> + spin_unlock(&inode->i_lock);
> + goto out;
> + }
> +out:
> + spin_unlock(&inode_hash_lock);
> + return ret_inode;
> +}
> +EXPORT_SYMBOL(find_active_inode_nowait);
> +
Please move this function definition into a separate patch so that it
isn't hidden in an ext4-specific patch. Otherwise it looks good.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2014-11-25 17:57:16

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 2/4] vfs: add support for a lazytime mount option

On Tue, Nov 25, 2014 at 06:19:27PM +0100, Jan Kara wrote:
> Actually, I'd also prefer to do the writing from iput_final(). My main
> reason is that shrinker starts behaving very differently when you put
> inodes with I_DIRTY_TIME to the LRU. See inode_lru_isolate() and in
> particular:
> /*
> * Referenced or dirty inodes are still in use. Give them another
> * pass
> * through the LRU as we canot reclaim them now.
> */
> if (atomic_read(&inode->i_count) ||
> (inode->i_state & ~I_REFERENCED)) {
> list_del_init(&inode->i_lru);
> spin_unlock(&inode->i_lock);
> this_cpu_dec(nr_unused);
> return LRU_REMOVED;
> }

I must be missing something; how would the shirnker behave
differently? I_DIRTY_TIME shouldn't have any effect on the shrinker;
note that I_DIRTY_TIME is *not* part of I_DIRTY, and this was quite
deliberate, because I didn't want I_DIRTY_TIME to have any affect on
any of the other parts of the writeback or inode management parts.

> Regarding your concern that we'd write the inode when file is closed -
> that's not true. We'll write the inode only after corresponding dentry is
> evicted and thus drops inode reference. That doesn't seem too bad to me.

True, fair enough. It's not quite so lazy, but it should be close
enough.

I'm still not seeing the benefit in waiting until the last possible
minute to write out the timestamps; evict() can block as it is if
there are any writeback that needs to be completed, and if the
writeback happens to pages subject to delalloc, the timestamp update
could happen for free at that point.

- Ted

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2014-11-25 19:26:09

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 2/4] vfs: add support for a lazytime mount option

On Tue, Nov 25, 2014 at 06:30:40PM +0100, Jan Kara wrote:
> This would be possible and as Boaz says, it might be possible to reuse
> the same list_head in the inode for this. Getting rid of the full scan of
> all superblock inodes would be nice (as the scan gets really expensive for
> large numbers of inodes (think of i_sb_lock contention) and this makes it
> twice as bad) so I'd prefer to do this if possible.

Fair enough, I'll give this a try. Personally, I've never been that
solicitous towards the efficiency of sync, since if you ever use it,
you tend to destroy performance just because of contention of the disk
drive head caused by the writeback, never mind the i_sb_lock
contention. ("I am sync(2), the destroyer of tail latency SLO's...")

In fact there has sometimes been discussion about disabling sync(2)
from non-root users, because the opportunity for mischief when a
developer logs and types sync out of reflex is too high. Of course,
if we ever did this, I'm sure such a patch would never be accepted
upstream, but that's OK, most people don't seem to care about tail
latency outside of Facebook and Google anyway....

- Ted

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2014-11-25 20:18:10

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 2/4] vfs: add support for a lazytime mount option

On Tue 25-11-14 12:57:16, Ted Tso wrote:
> On Tue, Nov 25, 2014 at 06:19:27PM +0100, Jan Kara wrote:
> > Actually, I'd also prefer to do the writing from iput_final(). My main
> > reason is that shrinker starts behaving very differently when you put
> > inodes with I_DIRTY_TIME to the LRU. See inode_lru_isolate() and in
> > particular:
> > /*
> > * Referenced or dirty inodes are still in use. Give them another
> > * pass
> > * through the LRU as we canot reclaim them now.
> > */
> > if (atomic_read(&inode->i_count) ||
> > (inode->i_state & ~I_REFERENCED)) {
> > list_del_init(&inode->i_lru);
> > spin_unlock(&inode->i_lock);
> > this_cpu_dec(nr_unused);
> > return LRU_REMOVED;
> > }
>
> I must be missing something; how would the shirnker behave
> differently? I_DIRTY_TIME shouldn't have any effect on the shrinker;
> note that I_DIRTY_TIME is *not* part of I_DIRTY, and this was quite
> deliberate, because I didn't want I_DIRTY_TIME to have any affect on
> any of the other parts of the writeback or inode management parts.
Sure, but the test tests whether the inode has *any other* bit than
I_REFERENCED set. So I_DIRTY_TIME will trigger the test and we just remove
the inode from lru list. You could exclude I_DIRTY_TIME from this test to
avoid this problem but then the shrinker latency would get much larger
because it will suddently do IO in evict(). So I still think doing the
write in iput_final() is the best solution.

> > Regarding your concern that we'd write the inode when file is closed -
> > that's not true. We'll write the inode only after corresponding dentry is
> > evicted and thus drops inode reference. That doesn't seem too bad to me.
>
> True, fair enough. It's not quite so lazy, but it should be close
> enough.
>
> I'm still not seeing the benefit in waiting until the last possible
> minute to write out the timestamps; evict() can block as it is if
> there are any writeback that needs to be completed, and if the
> writeback happens to pages subject to delalloc, the timestamp update
> could happen for free at that point.
Yeah, doing IO from evict is OK in princible but the change in shrinker
success rate / latency worries me... It would certainly need careful
testing under memory pressure & IO load with lots of outstanding timestamp
updates and see how shrinker behaves (change in CPU consumption, numbers of
evicted inodes, etc.).

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2014-11-25 23:48:51

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 3/4] vfs: don't let the dirty time inodes get more than a day stale

On Mon, Nov 24, 2014 at 11:45:08PM -0500, Theodore Ts'o wrote:
> On Tue, Nov 25, 2014 at 12:53:32PM +1100, Dave Chinner wrote:
> > On Fri, Nov 21, 2014 at 02:59:23PM -0500, Theodore Ts'o wrote:
> > > Guarantee that the on-disk timestamps will be no more than 24 hours
> > > stale.
> > >
> > > Signed-off-by: Theodore Ts'o <[email protected]>
> >
> > If we put these inodes on the dirty inode list with at writeback
> > time of 24 hours, this is completely unnecessary.
>
> What do you mean by "a writeback time of 24 hours"? Do you mean
> creating a new field in the inode which specifies when the writeback
> should happen?

No.

> I still worry about the dirty inode list getting
> somewhat long large in the strictatime && lazytime case, and the inode
> bloat nazi's coming after us for adding a new field to struct inode
> structure.

Use another pure inode time dirty list, and move the inode to the
existing dirty list when it gets marked I_DIRTY.

> Or do you mean trying to abuse the dirtied_when field in some way?

No abuse necessary at all. Just a different inode_dirtied_after()
check is requires if the inode is on the time dirty list in
move_expired_inodes().

Cheers,

Dave.
--
Dave Chinner
[email protected]

2014-11-26 00:24:14

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 2/4] vfs: add support for a lazytime mount option

On Mon, Nov 24, 2014 at 11:33:35PM -0500, Theodore Ts'o wrote:
> On Tue, Nov 25, 2014 at 12:52:39PM +1100, Dave Chinner wrote:
> > > +static void flush_sb_dirty_time(struct super_block *sb)
> > > +{
> ...
> > > +}
> >
> > This just seems wrong to me, not to mention extremely expensive when we have
> > millions of cached inodes on the superblock.
>
> #1, It only gets called on a sync(2) or syncfs(2), which can be pretty
> expensive as it is, so I didn't really worry about it.
>
> #2, We're already iterating over all of the inodes in the sync(2) or
> syncfs(2) path, so the code path in question is already O(N) in the
> number of inodes.
>
> > Why can't we just add a function like mark_inode_dirty_time() which
> > puts the inode on the dirty inode list with a writeback time 24
> > hours in the future rather than 30s in the future?
>
> I was concerned about putting them on the dirty inode list because it
> would be extra inodes for the writeback threads would have to skip
> over and ignore (since they would not be dirty in the inde or data
> pages sense).

Create another list - we already have multiple dirty inode lists in
the struct bdi_writeback....

> Another solution would be to use a separate linked list for dirtytime
> inodes, but that means adding some extra fields to the inode
> structure, which some might view as bloat.

We already have an i_wb_list in the inode for tracking the dirty
list the inode belongs to.

> Given #1 and #2 above,
> yes, we're doubling the CPU cost for sync(2) and syncfs(2), since
> we're not iterating over all of the inodes twice, but I believe that
> is a better trade-off than bloating the inode structure with an extra
> set of linked lists or increasing the CPU cost to the writeback path
> (which gets executed way more often than the sync or syncfs paths).

There is no need to bloat the inode at all, therefore we shouldn't
be causing sync/syncfs regressions by enabling lazytime...

> > Eviction is too late for this. I'm pretty sure that it won't get
> > this far as iput_final() should catch the I_DIRTY_TIME in the !drop
> > case via write_inode_now().
>
> Actually, the tracepoint for fs_lazytime_evict() does get triggered
> from time to time; but only when the inode is evicted due to memory
> pressure, i.e., via the evict_inodes() path.

That's indicative of a bug - if it's dirty then you shouldn't be
evicting it. The LRU shrinker explicitly avoids reclaiming dirty
inodes. Also, evict_inodes() is only called in the unmount path,
and that happens after a sync_filesystem() call so that shouldn't be
finding dirty inodes, either....

> I thought about possibly doing this in iput_final(), but that would
> mean that whenever we closed the last fd on the file, we would push
> the inode out to disk.

I get the feeling from your responses that you really don't
understand the VFS inode lifecycle or the writeback code works.
Inodes don't get dropped form the inode cache when the last open FD
on them is closed unless they are an unlinked file. The dentry cache
still holds a reference to the inode....

> For files that we are writing, that's not so
> bad; but if we enable strictatime with lazytime, then we would be
> updating the atime for inodes that had been only been read on every
> close --- which in the case of say, all of the files in the kernel
> tree, would be a little unfortunate.
>
> Of course, the combination of strict atime and lazytime would result
> in a pretty heavy write load on a umount or sync(2), so I suspect
> keeping the relatime mode would make sense for most people, but I for
> those people who need strict Posix compliance, it seemed like doing
> something that worked well for strictatime plus lazytime would be a
> good thing, which is why I tried to defer things as much as possible.
>
> > if (!datasync && (inode->i_state & I_DIRTY_TIME)) {
> >
> > > + spin_lock(&inode->i_lock);
> > > + inode->i_state |= I_DIRTY_SYNC;
> > > + spin_unlock(&inode->i_lock);
> > > + }
> > > return file->f_op->fsync(file, start, end, datasync);
> >
> > When we mark the inode I_DIRTY_TIME, we should also be marking it
> > I_DIRTY_SYNC so that all the sync operations know that they should
> > be writing this inode. That's partly why I also think these inodes
> > should be tracked on the dirty inode list....
>
> The whole point of what I was doing is that I_DIRTY_TIME was not part
> of I_DIRTY, and that when in update_time() we set I_DIRTY_TIME instead
> of I_DIRTY_SYNC.

I_DIRTY_SYNC only matters once you get the inode into the fsync code
or deep into the inode writeback code (i.e.
__writeback_single_inode()). if we don't expire the inode at the
high level writeback code, then the only time we'll get into
__writeback_single_inode() is through specific foreground attempts
to write the inode. In which case, we should be writing the inode if
it is I_DIRTY_TIME, and so I_DIRTY_SYNC will trigger all the correct
code paths to be taken to get us to writing it.

And then we don't need hacks like the above to get fsync to write
the inode....

> The goal is that these inodes would not end up on
> the dirty list, because that way they wouldn't be forced out to disk
> until either (a) the inode is written out for other reasons (i.e., a
> change in i_size or i_blocks, etc.), (b) the inode is explicitly
> fsync'ed, or (c) the the umount(2), sync(2), or syncfs(2) of the file
> system.

Precisely. If we don't expire the inode from the dirty list, and it
will only get written back when inode writeback is explicitly asked
for.

> That way, the timestamps are in the memory copy inode, but *not*
> written on disk until they are opportunistically written out due to
> some other modification of the inode which sets I_DIRTY_SYNC.
>
> If we were to set I_DIRTY_SYNC alongside I_DIRTY_TIME, and put these
> inodes on the dirty inode list, then I_DIRTY_TIME would effectively be
> a no-op, and there would be no point to this whole exercise.

Except for the fact that I_DIRTY_TIME would result in the inode
being put on a different dirty list, and then when it's marked dirty
properly it can be moved onto the dirty inode list that will trigger
background writeback in 30s....

> If there is significant objections to doing this in the VFS layer, I'm
> happy to go back to doing this as in ext4-specific code.
> The main reason why I redid this patch set as a VFS specific change
> was because Cristoph and others specifically requested it. But if you
> strongly object, I can always go back to doing this in the ext4 code....

Don't be a drama queen, Ted. lazytime is something that needs to be
done at the VFS but it needs to be done efficiently.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2014-11-26 10:20:17

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 3/4] vfs: don't let the dirty time inodes get more than a day stale

On Wed, Nov 26, 2014 at 10:48:51AM +1100, Dave Chinner wrote:
> No abuse necessary at all. Just a different inode_dirtied_after()
> check is requires if the inode is on the time dirty list in
> move_expired_inodes().

I'm still not sure what you have in mind here. When would this be
checked? It sounds like you want to set a timeout such that when an
inode which had its timestamps updated lazily 24 hours earlier, the
inode would get written out. Yes? But that implies something is
going to have to scan the list of inodes on the dirty time list
periodically. When are you proposing that this take place?

The various approaches that come to mind all seem more complex than
what I have in this patch 3 of 4, and I'm not sure it's worth the
complexity.

- Ted

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2014-11-26 22:39:01

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 3/4] vfs: don't let the dirty time inodes get more than a day stale

On Wed, Nov 26, 2014 at 05:20:17AM -0500, Theodore Ts'o wrote:
> On Wed, Nov 26, 2014 at 10:48:51AM +1100, Dave Chinner wrote:
> > No abuse necessary at all. Just a different inode_dirtied_after()
> > check is requires if the inode is on the time dirty list in
> > move_expired_inodes().
>
> I'm still not sure what you have in mind here. When would this be
> checked?

Have you looked at where move_expired_inodes() gets called from?
It's called periodically from background writeback by queue_io(),
and sync uses the same infrastructure to expire all inodes on the
dirty list....

> It sounds like you want to set a timeout such that when an
> inode which had its timestamps updated lazily 24 hours earlier, the
> inode would get written out. Yes? But that implies something is
> going to have to scan the list of inodes on the dirty time list
> periodically. When are you proposing that this take place?

The writeback code already does this for dirty inodes. it does it in
move_expired_inodes() to move the inodes with i_dirtied_when is
older than 30s. It's *trivial* to add a time dirty inode list and
scan that at the same time to pull off inodes that are older than
24hrs.

> The various approaches that come to mind all seem more complex than
> what I have in this patch 3 of 4, and I'm not sure it's worth the
> complexity.

the "once a day" stuff you've added is a horrible, nasty hack. I
wasn't going to say anything about it (i.e. if you can't say
anything nice...). The existing dirty inode writeback expiry code
does *everything* we need already, we just need to plumb in a new
list and add an expiry check of that list to move inodes to the b_io
list when they have been timestamp dirty for more than 24 hours...

Cheers,

Dave.
--
Dave Chinner
[email protected]