2014-11-22 16:54:22

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH-v2 0/5] add support for a lazytime mount option

This is an updated version of what had originally been an
ext4-specific patch which significantly improves performance by lazily
writing timestamp updates (and in particular, mtime updates) to disk.
The in-memory timestamps are always correct, but they are only written
to disk when required for correctness.

This provides a huge performance boost for ext4 due to how it handles
journalling, but it's valuable for all file systems running on flash
storage or drive-managed SMR disks by reducing the metadata write
load. So upon request, I've moved the functionality to the VFS layer.
Once the /sbin/mount program adds support for MS_LAZYTIME, all file
systems should be able to benefit from this optimization.

There is still an ext4-specific optimization, which may be applicable
for other file systems which store more than one inode in a block, but
it will require file system specific code. It is purely optional,
however.

Please note the changes to update_time() and the new write_time() inode
operations functions, which impact btrfs and xfs. The changes are
fairly simple, but I would appreciate confirmation from the btrfs and
xfs teams that I got things right. Thanks!!

Any objections to my carrying these patches in the ext4 git tree?

Changes since -v1:
- Added explanatory comments in update_time() regarding i_ts_dirty_days
- Fix type used for days_since_boot
- Improve SMP scalability in update_time and ext4_update_other_inodes_time
- Added tracepoints to help test and characterize how often and under
what circumstances inodes have their timestamps lazily updated

Theodore Ts'o (5):
fs: split update_time() into update_time() and write_time()
vfs: add support for a lazytime mount option
vfs: don't let the dirty time inodes get more than a day stale
vfs: add lazytime tracepoints for better debugging
ext4: add support for a lazytime mount option

fs/btrfs/inode.c | 10 +++++
fs/ext4/inode.c | 48 ++++++++++++++++++--
fs/ext4/super.c | 9 ++++
fs/fs-writeback.c | 42 +++++++++++++++++-
fs/inode.c | 104 +++++++++++++++++++++++++++++++++++++++-----
fs/proc_namespace.c | 1 +
fs/sync.c | 7 +++
fs/xfs/xfs_iops.c | 39 +++++++----------
include/linux/fs.h | 5 +++
include/trace/events/ext4.h | 29 ++++++++++++
include/trace/events/fs.h | 56 ++++++++++++++++++++++++
include/uapi/linux/fs.h | 1 +
12 files changed, 313 insertions(+), 38 deletions(-)
create mode 100644 include/trace/events/fs.h

--
2.1.0

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs


2014-11-22 16:54:24

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH-v2 2/5] vfs: add support for a lazytime mount option

Add a new mount option which enables a new "lazytime" mode. This mode
causes atime, mtime, and ctime updates to only be made to the
in-memory version of the inode. The on-disk times will only get
updated when (a) if the inode needs to be updated for some non-time
related change, (b) if userspace calls fsync(), syncfs() or sync(), or
(c) just before an undeleted inode is evicted from memory.

This is OK according to POSIX because there are no guarantees after a
crash unless userspace explicitly requests via a fsync(2) call.

For workloads which feature a large number of random write to a
preallocated file, the lazytime mount option significantly reduces
writes to the inode table. The repeated 4k writes to a single block
will result in undesirable stress on flash devices and SMR disk
drives. Even on conventional HDD's, the repeated writes to the inode
table block will trigger Adjacent Track Interference (ATI) remediation
latencies, which very negatively impact 99.9 percentile latencies ---
which is a very big deal for web serving tiers (for example).

Google-Bug-Id: 18297052

Signed-off-by: Theodore Ts'o <[email protected]>
---
fs/fs-writeback.c | 38 +++++++++++++++++++++++++++++++++++++-
fs/inode.c | 20 ++++++++++++++++++++
fs/proc_namespace.c | 1 +
fs/sync.c | 7 +++++++
include/linux/fs.h | 1 +
include/uapi/linux/fs.h | 1 +
6 files changed, 67 insertions(+), 1 deletion(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index ef9bef1..ce7de22 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -483,7 +483,7 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
inode->i_state &= ~I_DIRTY_PAGES;
dirty = inode->i_state & I_DIRTY;
- inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
+ inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_TIME);
spin_unlock(&inode->i_lock);
/* Don't write the inode if only I_DIRTY_PAGES was set */
if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
@@ -1277,6 +1277,41 @@ static void wait_sb_inodes(struct super_block *sb)
iput(old_inode);
}

+/*
+ * This works like wait_sb_inodes(), but it is called *before* we kick
+ * the bdi so the inodes can get written out.
+ */
+static void flush_sb_dirty_time(struct super_block *sb)
+{
+ struct inode *inode, *old_inode = NULL;
+
+ WARN_ON(!rwsem_is_locked(&sb->s_umount));
+ spin_lock(&inode_sb_list_lock);
+ list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+ int dirty_time;
+
+ spin_lock(&inode->i_lock);
+ if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
+ spin_unlock(&inode->i_lock);
+ continue;
+ }
+ dirty_time = inode->i_state & I_DIRTY_TIME;
+ __iget(inode);
+ spin_unlock(&inode->i_lock);
+ spin_unlock(&inode_sb_list_lock);
+
+ iput(old_inode);
+ old_inode = inode;
+
+ if (dirty_time)
+ mark_inode_dirty(inode);
+ cond_resched();
+ spin_lock(&inode_sb_list_lock);
+ }
+ spin_unlock(&inode_sb_list_lock);
+ iput(old_inode);
+}
+
/**
* writeback_inodes_sb_nr - writeback dirty inodes from given super_block
* @sb: the superblock
@@ -1388,6 +1423,7 @@ void sync_inodes_sb(struct super_block *sb)
return;
WARN_ON(!rwsem_is_locked(&sb->s_umount));

+ flush_sb_dirty_time(sb);
bdi_queue_work(sb->s_bdi, &work);
wait_for_completion(&done);

diff --git a/fs/inode.c b/fs/inode.c
index 8f5c4b5..11fe81b 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -534,6 +534,18 @@ static void evict(struct inode *inode)
BUG_ON(!(inode->i_state & I_FREEING));
BUG_ON(!list_empty(&inode->i_lru));

+ if (inode->i_nlink && inode->i_state & I_DIRTY_TIME) {
+ if (inode->i_op->write_time)
+ inode->i_op->write_time(inode);
+ else if (inode->i_sb->s_op->write_inode) {
+ struct writeback_control wbc = {
+ .sync_mode = WB_SYNC_NONE,
+ };
+ mark_inode_dirty(inode);
+ inode->i_sb->s_op->write_inode(inode, &wbc);
+ }
+ }
+
if (!list_empty(&inode->i_wb_list))
inode_wb_list_del(inode);

@@ -1515,6 +1527,14 @@ static int update_time(struct inode *inode, struct timespec *time, int flags)
if (flags & S_MTIME)
inode->i_mtime = *time;
}
+ if (inode->i_sb->s_flags & MS_LAZYTIME) {
+ if (inode->i_state & I_DIRTY_TIME)
+ return 0;
+ spin_lock(&inode->i_lock);
+ inode->i_state |= I_DIRTY_TIME;
+ spin_unlock(&inode->i_lock);
+ return 0;
+ }
if (inode->i_op->write_time)
return inode->i_op->write_time(inode);
mark_inode_dirty_sync(inode);
diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
index 73ca174..f98234a 100644
--- a/fs/proc_namespace.c
+++ b/fs/proc_namespace.c
@@ -44,6 +44,7 @@ static int show_sb_opts(struct seq_file *m, struct super_block *sb)
{ MS_SYNCHRONOUS, ",sync" },
{ MS_DIRSYNC, ",dirsync" },
{ MS_MANDLOCK, ",mand" },
+ { MS_LAZYTIME, ",lazytime" },
{ 0, NULL }
};
const struct proc_fs_info *fs_infop;
diff --git a/fs/sync.c b/fs/sync.c
index bdc729d..db7930e 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -177,8 +177,15 @@ SYSCALL_DEFINE1(syncfs, int, fd)
*/
int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync)
{
+ struct inode *inode = file->f_mapping->host;
+
if (!file->f_op->fsync)
return -EINVAL;
+ if (!datasync && inode->i_state & I_DIRTY_TIME) {
+ spin_lock(&inode->i_lock);
+ inode->i_state |= I_DIRTY_SYNC;
+ spin_unlock(&inode->i_lock);
+ }
return file->f_op->fsync(file, start, end, datasync);
}
EXPORT_SYMBOL(vfs_fsync_range);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3633239..489b2f2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1721,6 +1721,7 @@ struct super_operations {
#define __I_DIO_WAKEUP 9
#define I_DIO_WAKEUP (1 << I_DIO_WAKEUP)
#define I_LINKABLE (1 << 10)
+#define I_DIRTY_TIME (1 << 11)

#define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)

diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 3735fa0..cc9713a 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -90,6 +90,7 @@ struct inodes_stat_t {
#define MS_KERNMOUNT (1<<22) /* this is a kern_mount call */
#define MS_I_VERSION (1<<23) /* Update inode I_version field */
#define MS_STRICTATIME (1<<24) /* Always perform atime updates */
+#define MS_LAZYTIME (1<<25) /* Update the on-disk [acm]times lazily */

/* These sb flags are internal to the kernel */
#define MS_NOSEC (1<<28)
--
2.1.0

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2014-11-22 16:54:25

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH-v2 3/5] vfs: don't let the dirty time inodes get more than a day stale

Guarantee that the on-disk timestamps will be no more than 24 hours
stale.

Signed-off-by: Theodore Ts'o <[email protected]>
---
fs/fs-writeback.c | 1 +
fs/inode.c | 16 +++++++++++++++-
include/linux/fs.h | 1 +
3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index ce7de22..eb04277 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1141,6 +1141,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
if (flags & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
trace_writeback_dirty_inode_start(inode, flags);

+ inode->i_ts_dirty_day = 0;
if (sb->s_op->dirty_inode)
sb->s_op->dirty_inode(inode, flags);

diff --git a/fs/inode.c b/fs/inode.c
index 11fe81b..0d939a8 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1511,6 +1511,7 @@ static int relatime_need_update(struct vfsmount *mnt, struct inode *inode,
*/
static int update_time(struct inode *inode, struct timespec *time, int flags)
{
+ unsigned short days_since_boot = jiffies / (HZ * 86400);
int ret;

if (inode->i_op->update_time) {
@@ -1527,14 +1528,27 @@ static int update_time(struct inode *inode, struct timespec *time, int flags)
if (flags & S_MTIME)
inode->i_mtime = *time;
}
- if (inode->i_sb->s_flags & MS_LAZYTIME) {
+ /*
+ * If i_ts_dirty_day is zero, then either we have not deferred
+ * timestamp updates, or the system has been up for less than
+ * a day (so days_since_boot is zero), so we defer timestamp
+ * updates in that case and set the I_DIRTY_TIME flag. If a
+ * day or more has passed, then i_ts_dirty_day will be
+ * different from days_since_boot, and then we should update
+ * the on-disk inode and then we can clear i_ts_dirty_day.
+ */
+ if ((inode->i_sb->s_flags & MS_LAZYTIME) &&
+ (!inode->i_ts_dirty_day ||
+ inode->i_ts_dirty_day == days_since_boot)) {
if (inode->i_state & I_DIRTY_TIME)
return 0;
spin_lock(&inode->i_lock);
inode->i_state |= I_DIRTY_TIME;
spin_unlock(&inode->i_lock);
+ inode->i_ts_dirty_day = days_since_boot;
return 0;
}
+ inode->i_ts_dirty_day = 0;
if (inode->i_op->write_time)
return inode->i_op->write_time(inode);
mark_inode_dirty_sync(inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 489b2f2..e3574cd 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -575,6 +575,7 @@ struct inode {
struct timespec i_ctime;
spinlock_t i_lock; /* i_blocks, i_bytes, maybe i_size */
unsigned short i_bytes;
+ unsigned short i_ts_dirty_day;
unsigned int i_blkbits;
blkcnt_t i_blocks;

--
2.1.0


2014-11-22 16:54:26

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH-v2 4/5] vfs: add lazytime tracepoints for better debugging

Signed-off-by: Theodore Ts'o <[email protected]>
---
fs/fs-writeback.c | 5 ++++-
fs/inode.c | 5 +++++
include/trace/events/fs.h | 56 +++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 65 insertions(+), 1 deletion(-)
create mode 100644 include/trace/events/fs.h

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index eb04277..cab2d6d 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -27,6 +27,7 @@
#include <linux/backing-dev.h>
#include <linux/tracepoint.h>
#include <linux/device.h>
+#include <trace/events/fs.h>
#include "internal.h"

/*
@@ -1304,8 +1305,10 @@ static void flush_sb_dirty_time(struct super_block *sb)
iput(old_inode);
old_inode = inode;

- if (dirty_time)
+ if (dirty_time) {
+ trace_fs_lazytime_flush(inode);
mark_inode_dirty(inode);
+ }
cond_resched();
spin_lock(&inode_sb_list_lock);
}
diff --git a/fs/inode.c b/fs/inode.c
index 0d939a8..5a9a7b0 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -20,6 +20,9 @@
#include <linux/list_lru.h>
#include "internal.h"

+#define CREATE_TRACE_POINTS
+#include <trace/events/fs.h>
+
/*
* Inode locking rules:
*
@@ -544,6 +547,7 @@ static void evict(struct inode *inode)
mark_inode_dirty(inode);
inode->i_sb->s_op->write_inode(inode, &wbc);
}
+ trace_fs_lazytime_evict(inode);
}

if (!list_empty(&inode->i_wb_list))
@@ -1546,6 +1550,7 @@ static int update_time(struct inode *inode, struct timespec *time, int flags)
inode->i_state |= I_DIRTY_TIME;
spin_unlock(&inode->i_lock);
inode->i_ts_dirty_day = days_since_boot;
+ trace_fs_lazytime_defer(inode);
return 0;
}
inode->i_ts_dirty_day = 0;
diff --git a/include/trace/events/fs.h b/include/trace/events/fs.h
new file mode 100644
index 0000000..ca06d5c
--- /dev/null
+++ b/include/trace/events/fs.h
@@ -0,0 +1,56 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM fs
+
+#if !defined(_TRACE_FS_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_FS_H
+
+#include <linux/tracepoint.h>
+
+DECLARE_EVENT_CLASS(fs__inode,
+ TP_PROTO(struct inode *inode),
+
+ TP_ARGS(inode),
+
+ TP_STRUCT__entry(
+ __field( dev_t, dev )
+ __field( ino_t, ino )
+ __field( uid_t, uid )
+ __field( gid_t, gid )
+ __field( __u16, mode )
+ ),
+
+ TP_fast_assign(
+ __entry->dev = inode->i_sb->s_dev;
+ __entry->ino = inode->i_ino;
+ __entry->uid = i_uid_read(inode);
+ __entry->gid = i_gid_read(inode);
+ __entry->mode = inode->i_mode;
+ ),
+
+ TP_printk("dev %d,%d ino %lu mode 0%o uid %u gid %u",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ (unsigned long) __entry->ino, __entry->mode,
+ __entry->uid, __entry->gid)
+);
+
+DEFINE_EVENT(fs__inode, fs_lazytime_defer,
+ TP_PROTO(struct inode *inode),
+
+ TP_ARGS(inode)
+);
+
+DEFINE_EVENT(fs__inode, fs_lazytime_evict,
+ TP_PROTO(struct inode *inode),
+
+ TP_ARGS(inode)
+);
+
+DEFINE_EVENT(fs__inode, fs_lazytime_flush,
+ TP_PROTO(struct inode *inode),
+
+ TP_ARGS(inode)
+);
+#endif /* _TRACE_FS_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
--
2.1.0


2014-11-22 16:54:23

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH-v2 1/5] fs: split update_time() into update_time() and write_time()

In preparation for adding support for the lazytime mount option, we
need to be able to separate out the update_time() and write_time()
inode operations. Currently, only btrfs and xfs uses update_time().

We needed to preserve update_time() because btrfs wants to have a
special btrfs_root_readonly() check; otherwise we could drop the
update_time() inode operation entirely.

Signed-off-by: Theodore Ts'o <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
fs/btrfs/inode.c | 10 ++++++++++
fs/inode.c | 29 ++++++++++++++++++-----------
fs/xfs/xfs_iops.c | 39 ++++++++++++++++-----------------------
include/linux/fs.h | 1 +
4 files changed, 45 insertions(+), 34 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d23362f..a5e0d0d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5574,6 +5574,11 @@ static int btrfs_update_time(struct inode *inode, struct timespec *now,
inode->i_mtime = *now;
if (flags & S_ATIME)
inode->i_atime = *now;
+ return 0;
+}
+
+static int btrfs_write_time(struct inode *inode)
+{
return btrfs_dirty_inode(inode);
}

@@ -9462,6 +9467,7 @@ static const struct inode_operations btrfs_dir_inode_operations = {
.get_acl = btrfs_get_acl,
.set_acl = btrfs_set_acl,
.update_time = btrfs_update_time,
+ .write_time = btrfs_write_time,
.tmpfile = btrfs_tmpfile,
};
static const struct inode_operations btrfs_dir_ro_inode_operations = {
@@ -9470,6 +9476,7 @@ static const struct inode_operations btrfs_dir_ro_inode_operations = {
.get_acl = btrfs_get_acl,
.set_acl = btrfs_set_acl,
.update_time = btrfs_update_time,
+ .write_time = btrfs_write_time,
};

static const struct file_operations btrfs_dir_file_operations = {
@@ -9540,6 +9547,7 @@ static const struct inode_operations btrfs_file_inode_operations = {
.get_acl = btrfs_get_acl,
.set_acl = btrfs_set_acl,
.update_time = btrfs_update_time,
+ .write_time = btrfs_write_time,
};
static const struct inode_operations btrfs_special_inode_operations = {
.getattr = btrfs_getattr,
@@ -9552,6 +9560,7 @@ static const struct inode_operations btrfs_special_inode_operations = {
.get_acl = btrfs_get_acl,
.set_acl = btrfs_set_acl,
.update_time = btrfs_update_time,
+ .write_time = btrfs_write_time,
};
static const struct inode_operations btrfs_symlink_inode_operations = {
.readlink = generic_readlink,
@@ -9565,6 +9574,7 @@ static const struct inode_operations btrfs_symlink_inode_operations = {
.listxattr = btrfs_listxattr,
.removexattr = btrfs_removexattr,
.update_time = btrfs_update_time,
+ .write_time = btrfs_write_time,
};

const struct dentry_operations btrfs_dentry_operations = {
diff --git a/fs/inode.c b/fs/inode.c
index 26753ba..8f5c4b5 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1499,17 +1499,24 @@ static int relatime_need_update(struct vfsmount *mnt, struct inode *inode,
*/
static int update_time(struct inode *inode, struct timespec *time, int flags)
{
- if (inode->i_op->update_time)
- return inode->i_op->update_time(inode, time, flags);
-
- if (flags & S_ATIME)
- inode->i_atime = *time;
- if (flags & S_VERSION)
- inode_inc_iversion(inode);
- if (flags & S_CTIME)
- inode->i_ctime = *time;
- if (flags & S_MTIME)
- inode->i_mtime = *time;
+ int ret;
+
+ if (inode->i_op->update_time) {
+ ret = inode->i_op->update_time(inode, time, flags);
+ if (ret)
+ return ret;
+ } else {
+ if (flags & S_ATIME)
+ inode->i_atime = *time;
+ if (flags & S_VERSION)
+ inode_inc_iversion(inode);
+ if (flags & S_CTIME)
+ inode->i_ctime = *time;
+ if (flags & S_MTIME)
+ inode->i_mtime = *time;
+ }
+ if (inode->i_op->write_time)
+ return inode->i_op->write_time(inode);
mark_inode_dirty_sync(inode);
return 0;
}
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index ec6dcdc..0e9653c 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -984,10 +984,8 @@ xfs_vn_setattr(
}

STATIC int
-xfs_vn_update_time(
- struct inode *inode,
- struct timespec *now,
- int flags)
+xfs_vn_write_time(
+ struct inode *inode)
{
struct xfs_inode *ip = XFS_I(inode);
struct xfs_mount *mp = ip->i_mount;
@@ -1004,21 +1002,16 @@ xfs_vn_update_time(
}

xfs_ilock(ip, XFS_ILOCK_EXCL);
- if (flags & S_CTIME) {
- inode->i_ctime = *now;
- ip->i_d.di_ctime.t_sec = (__int32_t)now->tv_sec;
- ip->i_d.di_ctime.t_nsec = (__int32_t)now->tv_nsec;
- }
- if (flags & S_MTIME) {
- inode->i_mtime = *now;
- ip->i_d.di_mtime.t_sec = (__int32_t)now->tv_sec;
- ip->i_d.di_mtime.t_nsec = (__int32_t)now->tv_nsec;
- }
- if (flags & S_ATIME) {
- inode->i_atime = *now;
- ip->i_d.di_atime.t_sec = (__int32_t)now->tv_sec;
- ip->i_d.di_atime.t_nsec = (__int32_t)now->tv_nsec;
- }
+
+ ip->i_d.di_ctime.t_sec = (__int32_t) inode->i_ctime.tv_sec;
+ ip->i_d.di_ctime.t_nsec = (__int32_t) inode->i_ctime.tv_nsec;
+
+ ip->i_d.di_mtime.t_sec = (__int32_t)inode->i_mtime.tv_sec;
+ ip->i_d.di_mtime.t_nsec = (__int32_t) inode->i_mtime.tv_nsec;
+
+ ip->i_d.di_atime.t_sec = (__int32_t) inode->i_atime.tv_sec;
+ ip->i_d.di_atime.t_nsec = (__int32_t) inode->i_atime.tv_nsec;
+
xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
xfs_trans_log_inode(tp, ip, XFS_ILOG_TIMESTAMP);
return xfs_trans_commit(tp, 0);
@@ -1129,7 +1122,7 @@ static const struct inode_operations xfs_inode_operations = {
.removexattr = generic_removexattr,
.listxattr = xfs_vn_listxattr,
.fiemap = xfs_vn_fiemap,
- .update_time = xfs_vn_update_time,
+ .write_time = xfs_vn_write_time,
};

static const struct inode_operations xfs_dir_inode_operations = {
@@ -1156,7 +1149,7 @@ static const struct inode_operations xfs_dir_inode_operations = {
.getxattr = generic_getxattr,
.removexattr = generic_removexattr,
.listxattr = xfs_vn_listxattr,
- .update_time = xfs_vn_update_time,
+ .write_time = xfs_vn_write_time,
.tmpfile = xfs_vn_tmpfile,
};

@@ -1184,7 +1177,7 @@ static const struct inode_operations xfs_dir_ci_inode_operations = {
.getxattr = generic_getxattr,
.removexattr = generic_removexattr,
.listxattr = xfs_vn_listxattr,
- .update_time = xfs_vn_update_time,
+ .write_time = xfs_vn_write_time,
.tmpfile = xfs_vn_tmpfile,
};

@@ -1198,7 +1191,7 @@ static const struct inode_operations xfs_symlink_inode_operations = {
.getxattr = generic_getxattr,
.removexattr = generic_removexattr,
.listxattr = xfs_vn_listxattr,
- .update_time = xfs_vn_update_time,
+ .write_time = xfs_vn_write_time,
};

STATIC void
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9ab779e..3633239 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1545,6 +1545,7 @@ struct inode_operations {
int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
u64 len);
int (*update_time)(struct inode *, struct timespec *, int);
+ int (*write_time)(struct inode *);
int (*atomic_open)(struct inode *, struct dentry *,
struct file *, unsigned open_flag,
umode_t create_mode, int *opened);
--
2.1.0

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2014-11-22 16:54:36

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH-v2 5/5] ext4: add support for a lazytime mount option

Add an optimization for the MS_LAZYTIME mount option so that we will
opportunistically write out any inodes with the I_DIRTY_TIME flag set
in a particular inode table block when we need to update some inode in
that inode table block anyway.

Also add some temporary code so that we can set the lazytime mount
option without needing a modified /sbin/mount program which can set
MS_LAZYTIME. We can eventually make this go away once util-linux has
added support.

Google-Bug-Id: 18297052

Signed-off-by: Theodore Ts'o <[email protected]>
---
fs/ext4/inode.c | 48 ++++++++++++++++++++++++++++++++++++++++++---
fs/ext4/super.c | 9 +++++++++
fs/inode.c | 36 ++++++++++++++++++++++++++++++++++
include/linux/fs.h | 2 ++
include/trace/events/ext4.h | 29 +++++++++++++++++++++++++++
5 files changed, 121 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3356ab5..03149b4 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4163,6 +4163,50 @@ static int ext4_inode_blocks_set(handle_t *handle,
}

/*
+ * Opportunistically update the other time fields for other inodes in
+ * the same inode table block.
+ */
+static void ext4_update_other_inodes_time(struct super_block *sb,
+ unsigned long orig_ino, char *buf)
+{
+ struct ext4_inode_info *ei;
+ struct ext4_inode *raw_inode;
+ unsigned long ino;
+ struct inode *inode;
+ int i, inodes_per_block = EXT4_SB(sb)->s_inodes_per_block;
+ int inode_size = EXT4_INODE_SIZE(sb);
+
+ ino = orig_ino & ~(inodes_per_block - 1);
+ for (i = 0; i < inodes_per_block; i++, ino++, buf += inode_size) {
+ if (ino == orig_ino)
+ continue;
+ inode = find_active_inode_nowait(sb, ino);
+ if (!inode ||
+ (inode->i_state & I_DIRTY_TIME) == 0 ||
+ !spin_trylock(&inode->i_lock)) {
+ iput(inode);
+ continue;
+ }
+ inode->i_state &= ~I_DIRTY_TIME;
+ inode->i_ts_dirty_day = 0;
+ spin_unlock(&inode->i_lock);
+
+ ei = EXT4_I(inode);
+ raw_inode = (struct ext4_inode *) buf;
+
+ spin_lock(&ei->i_raw_lock);
+ EXT4_INODE_SET_XTIME(i_ctime, inode, raw_inode);
+ EXT4_INODE_SET_XTIME(i_mtime, inode, raw_inode);
+ EXT4_INODE_SET_XTIME(i_atime, inode, raw_inode);
+ ext4_inode_csum_set(inode, raw_inode, ei);
+ spin_unlock(&ei->i_raw_lock);
+ trace_ext4_other_inode_update_time(inode, orig_ino);
+ iput(inode);
+ }
+}
+
+
+/*
* Post the struct inode info into an on-disk inode location in the
* buffer-cache. This gobbles the caller's reference to the
* buffer_head in the inode location struct.
@@ -4260,7 +4304,6 @@ static int ext4_do_update_inode(handle_t *handle,
for (block = 0; block < EXT4_N_BLOCKS; block++)
raw_inode->i_block[block] = ei->i_data[block];
}
-
if (likely(!test_opt2(inode->i_sb, HURD_COMPAT))) {
raw_inode->i_disk_version = cpu_to_le32(inode->i_version);
if (ei->i_extra_isize) {
@@ -4271,10 +4314,9 @@ static int ext4_do_update_inode(handle_t *handle,
cpu_to_le16(ei->i_extra_isize);
}
}
-
ext4_inode_csum_set(inode, raw_inode, ei);
-
spin_unlock(&ei->i_raw_lock);
+ ext4_update_other_inodes_time(inode->i_sb, inode->i_ino, bh->b_data);

BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
rc = ext4_handle_dirty_metadata(handle, NULL, bh);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 4b79f39..1ac1914 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1133,6 +1133,7 @@ enum {
Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err,
Opt_usrquota, Opt_grpquota, Opt_i_version,
Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_mblk_io_submit,
+ Opt_lazytime, Opt_nolazytime,
Opt_nomblk_io_submit, Opt_block_validity, Opt_noblock_validity,
Opt_inode_readahead_blks, Opt_journal_ioprio,
Opt_dioread_nolock, Opt_dioread_lock,
@@ -1195,6 +1196,8 @@ static const match_table_t tokens = {
{Opt_i_version, "i_version"},
{Opt_stripe, "stripe=%u"},
{Opt_delalloc, "delalloc"},
+ {Opt_lazytime, "lazytime"},
+ {Opt_nolazytime, "nolazytime"},
{Opt_nodelalloc, "nodelalloc"},
{Opt_removed, "mblk_io_submit"},
{Opt_removed, "nomblk_io_submit"},
@@ -1450,6 +1453,12 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
case Opt_i_version:
sb->s_flags |= MS_I_VERSION;
return 1;
+ case Opt_lazytime:
+ sb->s_flags |= MS_LAZYTIME;
+ return 1;
+ case Opt_nolazytime:
+ sb->s_flags &= ~MS_LAZYTIME;
+ return 1;
}

for (m = ext4_mount_opts; m->token != Opt_err; m++)
diff --git a/fs/inode.c b/fs/inode.c
index 5a9a7b0..17eb1c0 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1296,6 +1296,42 @@ struct inode *ilookup(struct super_block *sb, unsigned long ino)
}
EXPORT_SYMBOL(ilookup);

+/**
+ * find_active_inode_nowait - find an active inode in the inode cache
+ * @sb: super block of file system to search
+ * @ino: inode number to search for
+ *
+ * Search for an active inode @ino in the inode cache, and if the
+ * inode is in the cache, the inode is returned with an incremented
+ * reference count. If the inode is being freed or is newly
+ * initialized, return nothing instead of trying to wait for the inode
+ * initialization or destruction to be complete.
+ */
+struct inode *find_active_inode_nowait(struct super_block *sb,
+ unsigned long ino)
+{
+ struct hlist_head *head = inode_hashtable + hash(sb, ino);
+ struct inode *inode, *ret_inode = NULL;
+
+ spin_lock(&inode_hash_lock);
+ hlist_for_each_entry(inode, head, i_hash) {
+ if ((inode->i_ino != ino) ||
+ (inode->i_sb != sb))
+ continue;
+ spin_lock(&inode->i_lock);
+ if ((inode->i_state & (I_FREEING | I_WILL_FREE | I_NEW)) == 0) {
+ __iget(inode);
+ ret_inode = inode;
+ }
+ spin_unlock(&inode->i_lock);
+ goto out;
+ }
+out:
+ spin_unlock(&inode_hash_lock);
+ return ret_inode;
+}
+EXPORT_SYMBOL(find_active_inode_nowait);
+
int insert_inode_locked(struct inode *inode)
{
struct super_block *sb = inode->i_sb;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e3574cd..dbbd642 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2413,6 +2413,8 @@ extern struct inode *ilookup(struct super_block *sb, unsigned long ino);

extern struct inode * iget5_locked(struct super_block *, unsigned long, int (*test)(struct inode *, void *), int (*set)(struct inode *, void *), void *);
extern struct inode * iget_locked(struct super_block *, unsigned long);
+extern struct inode *find_active_inode_nowait(struct super_block *,
+ unsigned long);
extern int insert_inode_locked4(struct inode *, unsigned long, int (*test)(struct inode *, void *), void *);
extern int insert_inode_locked(struct inode *);
#ifdef CONFIG_DEBUG_LOCK_ALLOC
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index ff4bd1b..aca53c0 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -75,6 +75,35 @@ struct extent_status;
{ FALLOC_FL_ZERO_RANGE, "ZERO_RANGE"})


+TRACE_EVENT(ext4_other_inode_update_time,
+ TP_PROTO(struct inode *inode, ino_t orig_ino),
+
+ TP_ARGS(inode, orig_ino),
+
+ TP_STRUCT__entry(
+ __field( dev_t, dev )
+ __field( ino_t, ino )
+ __field( ino_t, orig_ino )
+ __field( uid_t, uid )
+ __field( gid_t, gid )
+ __field( __u16, mode )
+ ),
+
+ TP_fast_assign(
+ __entry->orig_ino = orig_ino;
+ __entry->dev = inode->i_sb->s_dev;
+ __entry->ino = inode->i_ino;
+ __entry->uid = i_uid_read(inode);
+ __entry->gid = i_gid_read(inode);
+ __entry->mode = inode->i_mode;
+ ),
+
+ TP_printk("dev %d,%d orig_ino %lu ino %lu mode 0%o uid %u gid %u",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ (unsigned long) __entry->orig_ino, __entry->ino,
+ __entry->mode, __entry->uid, __entry->gid)
+);
+
TRACE_EVENT(ext4_free_inode,
TP_PROTO(struct inode *inode),

--
2.1.0


2014-11-24 09:07:55

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH-v2 0/5] add support for a lazytime mount option

What's the test coverage for this? xfstest generic/192 tests that
atime is persisted over remounts, which we had a bug with when XFS
used to have a lazy atime implementation somewhat similar to the
proposal.

We should have something similar for c/mtime as well. Also a test to
ensure timestamps are persisted afer a fsync, although right now I can't
imagine how to do that genericly as no other filesystem seems to have
an equivaent to XFS_IOC_GOINGDOWN.

It seems you also handle i_version updates lazily. although that's
not mentioned anywhere. I actually have a clarification request out on
the IETF NFSv4 list about the persistance requirements for the change
counter but I've not seen an answer to it yet.

2014-11-24 11:57:27

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH-v2 0/5] add support for a lazytime mount option

On Mon, Nov 24, 2014 at 01:07:55AM -0800, Christoph Hellwig wrote:
> What's the test coverage for this? xfstest generic/192 tests that
> atime is persisted over remounts, which we had a bug with when XFS
> used to have a lazy atime implementation somewhat similar to the
> proposal.
>
> We should have something similar for c/mtime as well. Also a test to
> ensure timestamps are persisted afer a fsync, although right now I can't
> imagine how to do that genericly as no other filesystem seems to have
> an equivaent to XFS_IOC_GOINGDOWN.

generic/003 will show up problems if there are any [acm]time
persistences across remounts, so we have it already. An earlier
(buggy) version of this patch actually tripped generic/003, so I can
attest that it works.

As far as testing to make sure timestamps are persisted after an
fsync, we should be able to do something genericly using dm_flaky, I
would imagine. We'll need to suppress this in some circumstances
where we know that the file system doesn't have a journal enabled (and
thus has no such guarantees) but I have that issue today with the
various dm_flaky tests, and what I would probably suggest doing is
putting all of the dm_flaky tests in a separate xfstests group, so
that when I test file system configurations in nojournal mode, I can
suppress all of the dm_flakey tests very easily.

> It seems you also handle i_version updates lazily. although that's
> not mentioned anywhere. I actually have a clarification request out on
> the IETF NFSv4 list about the persistance requirements for the change
> counter but I've not seen an answer to it yet.

If we want to be paranoid, we handle i_version updates non-lazily; I
can see arguments in favor of that.

Ext4 only enables MS_I_VERSION if the user asks for it explicitly, so
it wouldn't cause me any problems. However, xfs and btrfs enables it
by default, so that means xfs and btrfs wouldn't see the benefits of
lazytime (if you're going to have to push I_VERSION to disk, you might
as well update the [acm]time while you're at it). I've always thought
that we *should* do is to only enable it if nfsv4 is serving the file
system, and not otherwise, though, which would also give us
consistency across all the file systems.

What do folks think?

- Ted

2014-11-24 12:27:21

by Rasmus Villemoes

[permalink] [raw]
Subject: Re: [PATCH-v2 3/5] vfs: don't let the dirty time inodes get more than a day stale

On Sat, Nov 22 2014, Theodore Ts'o <[email protected]> wrote:

> Guarantee that the on-disk timestamps will be no more than 24 hours
> stale.
>
> static int update_time(struct inode *inode, struct timespec *time, int flags)
> {
> + unsigned short days_since_boot = jiffies / (HZ * 86400);
> int ret;
>

This seems to wrap every 49 days (assuming 32 bit jiffies and HZ==1000),
so on-disk updates can be delayed indefinitely, assuming just the right
delays between writes.

> if (inode->i_op->update_time) {
> @@ -1527,14 +1528,27 @@ static int update_time(struct inode *inode, struct timespec *time, int flags)
> if (flags & S_MTIME)
> inode->i_mtime = *time;
> }
> - if (inode->i_sb->s_flags & MS_LAZYTIME) {
> + /*
> + * If i_ts_dirty_day is zero, then either we have not deferred
> + * timestamp updates, or the system has been up for less than
> + * a day (so days_since_boot is zero), so we defer timestamp
> + * updates in that case and set the I_DIRTY_TIME flag. If a
> + * day or more has passed, then i_ts_dirty_day will be
> + * different from days_since_boot, and then we should update
> + * the on-disk inode and then we can clear i_ts_dirty_day.
> + */

AFAICT days_since_boot is not actually 0 immediately after boot
due to

#define INITIAL_JIFFIES ((unsigned long)(unsigned int) (-300*HZ))

On 32 bit platforms, days_since_boot will be 0 shortly after, while on
64 bit it will always be >= 49. Not exactly sure how this affects the
above logic.

Would it make sense to introduce days_since_boot as a global variable
and avoid these issues? This would presumably also make update_time a
few cycles faster (avoiding a division-by-constant), but not sure if
that's important. And something of course needs to update
days_since_boot, but that should be doable.

Rasmus


2014-11-24 17:10:54

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH-v2 3/5] vfs: don't let the dirty time inodes get more than a day stale

On Mon, Nov 24, 2014 at 01:27:21PM +0100, Rasmus Villemoes wrote:
> On Sat, Nov 22 2014, Theodore Ts'o <[email protected]> wrote:
>
> > Guarantee that the on-disk timestamps will be no more than 24 hours
> > stale.
> >
> > + unsigned short days_since_boot = jiffies / (HZ * 86400);
>
> This seems to wrap every 49 days (assuming 32 bit jiffies and HZ==1000),
> so on-disk updates can be delayed indefinitely, assuming just the right
> delays between writes.

Good point, I'll fix this.

> Would it make sense to introduce days_since_boot as a global variable
> and avoid these issues? This would presumably also make update_time a
> few cycles faster (avoiding a division-by-constant), but not sure if
> that's important. And something of course needs to update
> days_since_boot, but that should be doable.

I can do this fairly simply like this:

get_monotonic_boottime(&uptime);
daycode = uptime.tv_sec / (HZ * 86400);

and we only need to do this if lazytime is set, and the inode isn't
marked as I_DIRTY_TIME:

if ((inode->i_sb->s_flags & MS_LAZYTIME) &&
!(flags & S_VERSION)) {
if (inode->i_state & I_DIRTY_TIME)
return 0;
get_monotonic_boottime(&uptime);
daycode = do_div64(uptime.tv_sec do_div, (HZ * 86400));
if (!inode->i_ts_dirty_day ||
inode->i_ts_dirty_day == daycode) {
spin_lock(&inode->i_lock);
inode->i_state |= I_DIRTY_TIME;
spin_unlock(&inode->i_lock);
inode->i_ts_dirty_day = daycode;
return 0;
}
}

So I'm not entirely sure it's worth it to create a global variable for
days since boot; I've been runnin with this patch in my laptop, we
wouldn't be triggering the get_monotonic_bootime() function all that
often. (Since once the dirty_time flg is set, we don't need to check
about whether we need to set it again.) And if we *did* care, it
would be simple enough to use a static counter which only recalculates
daycode every 30 or 60 minutes.


Cheers,

- Ted

2014-11-24 22:11:45

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH-v2 0/5] add support for a lazytime mount option

On Mon, Nov 24, 2014 at 06:57:27AM -0500, Theodore Ts'o wrote:
> If we want to be paranoid, we handle i_version updates non-lazily; I
> can see arguments in favor of that.
>
> Ext4 only enables MS_I_VERSION if the user asks for it explicitly, so
> it wouldn't cause me any problems. However, xfs and btrfs enables it
> by default, so that means xfs and btrfs wouldn't see the benefits of
> lazytime (if you're going to have to push I_VERSION to disk, you might
> as well update the [acm]time while you're at it). I've always thought
> that we *should* do is to only enable it if nfsv4 is serving the file
> system, and not otherwise, though, which would also give us
> consistency across all the file systems.

I guess you need to worry about the case where you shutdown nfsd, modify
a file, then restart nfsd--you don't want a client to miss the
modification in that case.

--b.

2014-11-25 00:32:15

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH-v2 0/5] add support for a lazytime mount option

On Mon, Nov 24, 2014 at 05:11:45PM -0500, J. Bruce Fields wrote:
> On Mon, Nov 24, 2014 at 06:57:27AM -0500, Theodore Ts'o wrote:
> > If we want to be paranoid, we handle i_version updates non-lazily; I
> > can see arguments in favor of that.
> >
> > Ext4 only enables MS_I_VERSION if the user asks for it explicitly, so
> > it wouldn't cause me any problems. However, xfs and btrfs enables it
> > by default, so that means xfs and btrfs wouldn't see the benefits of
> > lazytime (if you're going to have to push I_VERSION to disk, you might
> > as well update the [acm]time while you're at it). I've always thought
> > that we *should* do is to only enable it if nfsv4 is serving the file
> > system, and not otherwise, though, which would also give us
> > consistency across all the file systems.
>
> I guess you need to worry about the case where you shutdown nfsd, modify
> a file, then restart nfsd--you don't want a client to miss the
> modification in that case.

Hmmm, is there a way we can determine whether the file system is being
exported via nfsv4 is running, without using MS_I_VERSION? Maybe a
new flag? We could just disable lazytime if nfsd is running but
otherwise always increment i_version if lazytime is enabled. My main
concern with i_version updates was not the in-memry update of
i_version, but rather the unnecessary extra metadata write "tax" that
would be inflicted on all users, including the many that aren't
serving files via NFSv4.

That way, if you shutdown the nfsv4 server, i_version would still be
updated, but we wouldn't be forcing the writes to disk, but then when
nfs v4 server is updated again, the i_version tax would be paid again.

- Ted